Designing a dead-letter queue you can trust

In payments, the interesting question was never what happens when a message processes. It’s what happens when it can’t — and whether the system can tell you the difference between “try again in a second” and “a human needs to look at this.”

I work on the financial core at DoorLoop — fund flows, Stripe and Treasury, the kind of money movement where a mistake is expensive and hard to reverse. The event-driven parts run on CQRS and the outbox pattern, which are well-trodden. The part I’ve spent an unreasonable amount of time on is the dead-letter queue, because that’s where the system goes when it’s confused, and a DLQ nobody can act on is just a graveyard with good intentions.

The poison message

A poison message is one that will never succeed, no matter how many times you retry it. A malformed payload, a reference to a customer that was deleted, an assumption the code makes that this particular event violates. Retrying it forever is a denial-of-service you build for yourself: the consumer wedges on the bad message, the lane backs up behind it, and your throughput quietly goes to zero while the dashboards look busy.

So the first rule is boring and non-negotiable: bounded retries, then set it aside. After N attempts with backoff, the message moves to the DLQ with everything you’ll need to understand it later — the original payload, the error, the attempt count, the consumer version, and a correlation ID back to the fund flow it belongs to.

// consumer.ts
if (attempt >= MAX_ATTEMPTS) {
  await deadLetter.park({
    payload,
    error: err.message,
    attempts: attempt,
    consumer: VERSION,
    correlationId: event.correlationId,
  });
  return Ack.Drop; // stop poisoning the lane
}
return Ack.Retry(backoff(attempt));

Replay has to be safe

A DLQ is only useful if you can put a message back. That means every consumer has to be idempotent — processing the same event twice must equal processing it once. In a payments system this isn’t a nice-to-have; it’s the property that lets you redrive a thousand parked messages at 2am without double-charging anyone.

Idempotency is a design input, not a retry hack you sprinkle on later. Each command carries a key derived from the event, and the write side refuses to apply the same key twice. The check and the effect commit in one transaction, or neither does.

// apply.ts
await db.tx(async (t) => {
  const seen = await t.insertIdempotencyKey(cmd.key);
  if (!seen.inserted) return; // already applied — no-op, safe to replay

  await t.appendEvents(aggregate, cmd.events);
  await t.enqueueOutbox(cmd.events);
});

Ordering is a promise you might not be able to keep

Global ordering across a payment system is a fantasy. Per-aggregate ordering — every event for a given payout, in sequence — is achievable and usually what you actually need. But it collides with the DLQ in an awkward way: if event #4 for a payout is poison and you set it aside, what do you do with #5?

There’s no universally right answer, only a decision you have to make on purpose. Block the lane for that aggregate until a human resolves #4, and you preserve correctness at the cost of latency. Skip ahead and process #5, and you keep moving at the cost of applying events out of order against a stale state. For money, I block the aggregate and page someone. The lane for every other payout keeps flowing, so the blast radius is one entity, not the whole system.

A dead-letter queue isn’t where messages go to die. It’s where the system admits it’s confused and asks for help — loudly, with enough context that help is actually possible.

Schema drift

Events outlive the code that wrote them. A message parked today might be redriven next week against a consumer that’s shipped four times since. So every event is versioned, and every consumer is written to handle shapes it didn’t author — narrowing on a discriminated union rather than trusting the payload it happens to receive.

// events.ts
const PayoutEvent = z.discriminatedUnion("v", [
  z.object({ v: z.literal(1), amount: z.number() }),
  z.object({ v: z.literal(2), amount: Money, currency: z.string() }),
]);

const e = PayoutEvent.parse(raw); // drift becomes a typed error, not a 3am mystery
if (e.v === 1) return migrateV1(e);

Zod parses at the boundary; TypeScript narrows after. A message from an old schema doesn’t crash the redrive — it routes to a migration path or, if it genuinely can’t be understood, back to the DLQ with a clear reason instead of a stack trace.

Operational controls

None of this matters without the boring operational surface: the ability to inspect a parked message, replay one or many, redrive a whole batch after a fix ships, and drop something that’s genuinely unrecoverable — each action audited, attributable, and reversible where it can be. We push the same fund-flow events into an analytics store and query them ad-hoc, so “how many payouts are stuck, since when, and why” is a question you answer in seconds, not a forensic dig through logs.

The test that matters

Can an on-call engineer who didn’t write the consumer take a parked message, understand why it failed, and either replay it or escalate it — without reading the source? If not, the DLQ isn’t done.

What it buys

Done well, the DLQ turns failure from something that takes the system down into something that’s observable and reversible. A bad deploy parks a few hundred messages instead of corrupting state; you fix the consumer, redrive the batch, and the idempotency keys make the replay a no-op for anything that already went through. The failure mode becomes a Tuesday, not an incident.

That’s the whole posture, really — reason about the failure first, make it loud, make it recoverable. In payments, what breaks is the job. The happy path mostly takes care of itself.

Filed under systems. Built something similar, or solved the ordering question differently? Tell me how.