I Replaced Kafka, Redis, and RabbitMQ With One Tool. Here’s What I Learned.
The one tool your backend stack has been missing
Photo by Ivan Bandura on Unsplash
What This Post Is
The guide I wish existed when I first encountered NATS.If you’ve worked with Kafka you’ll feel at home — and then you’ll start noticing what you’ve been missing. If you’ve stitched together Kafka, Redis, and RabbitMQ — NATS does all three in a single binary.
A Note on This Post
Before we dive deep into this post, I want to make something clear. Yes, parts of this blog were written with AI assistance — the em dashes are a dead giveaway.
But before you close the tab: this isn’t cheap AI slop dumped into a text box. Every concept here came from weeks of hands-on research — reading through the official NATS documentation, grinding through Synadia’s interactive NATS guide at nats.synadia.com, watching hours of Synadia’s YouTube deep dives, and drawing from real production experience building and migrating queue processing systems at scale.
The AI didn’t do the learning. It helped me articulate what I already understood.
Think of it less as “AI wrote this” and more as “a backend engineer used AI as a rubber duck that could also type.” The insights, the Kafka comparisons, the architecture decisions — those came from actually working with these systems in production.
So yes, there are em dashes! Yes, Claude Opus 4.6 helped structure and write this. And honestly? I’d rather be upfront about it than pretend otherwise.
Now — onto NATS.
Stack of books and an open notebook beside a coffee cup — a deep dive into NATS messaging:- Photo by freestocks on Unsplash
Core
Screenshot from Synadia's Interactive NATS Guide at nats.synadia.com
Core NATS is the foundation layer — Basically the pub/sub model. Everything that follows — subjects, pub/sub, request/reply, queue groups, JetStream, KV stores — is built on top of Core. Any publisher can reach any subscriber through named subjects.
Core NATS is fire and forget — no persistence, no disk, in-memory only.
Message exists just long enough to forward to connected subscribers
If no subscriber is listening, message is gone
Designed for use cases where stale data is worse than no data (real-time telemetry, live metrics, sensor readings)
Slow Consumer behavior:
If a consumer’s buffer fills up, NATS disconnects it rather than stalling everyone else — it disconnects it (explained below)
Messages queued for that consumer are lost
Consumer must reconnect and re-subscribe itself
This is a circuit breaker / self-protection mechanism — not a recovery mechanism like Kafka rebalancing
Kafka rebalancing = redistribute work so nothing is lost. NATS disconnect = protect the broker, consumer is on its own.
Core NATS — Slow Consumer Behavior
Core NATS allocates a per-connection outbound buffer in server RAM for each subscriber.
Chronology:
Subscriber connects → server allocates outbound buffer for it
Publisher sends → server writes into that buffer
Subscriber reads fast → buffer drains → fine
Subscriber reads slow → buffer fills up faster than it drains
Buffer hits limit (64MB default) → server disconnects that subscriber
Everything in that buffer is gone — permanently
Publisher and other subscribers never knew anything happened
Why it disconnects instead of waiting — Core NATS has no disk log. The only options are stall everyone or cut the slow subscriber. It cuts.
Most common cause in Python — accidentally blocking the asyncio event loop inside your handler. nats.py can’t drain the TCP socket while the loop is blocked, server buffer fills up, disconnect.
JetStream fixes this — message is already on disk before delivery. Slow consumer just builds lag like Kafka. No disconnect, no loss.
Wildcards
NATS subjects are dot-separated tokens:
payments.created
payments.failed
orders.payment.processed
orders.payment.failedTwo wildcard characters:
* — matches exactly one token
payments.* → matches payments.created, payments.failed
does NOT match payments.card.created (two tokens)
orders.*.processed → matches orders.payment.processed
matches orders.refund.processed
does NOT match orders.payment.card.processed> — matches everything after, one or more tokens
orders.> → matches orders.payment.processed
matches orders.payment.failed
matches orders.shipping.tracking.updated
matches anything starting with orders.Key rules:
>can only appear at the end —orders.>valid,orders.>.created.invalid*can appear anywhere in the middlePublishers cannot use wildcards — only subscribers can
JetStream stream subjects can use wildcards to capture multiple subjects into one stream
Subjects
This thing is important to understand, as it is the fundamental differentiating factor when working with NATS.
Like Kafka has partitions, NATS does NOT have partitions
The mental model is: subscribers are like attachment nodes that can attach to subjects with patterns
Unlike fixed number of consumers and partitions in Kafka, you can attach any number of subscribers
Messages are addressed by subjects — meaning instead of “send to partition 3”, you “publish to payments.created” and anyone subscribed to that subject gets it. The subject is the address, not a queue slot.
Many-to-many relationship — any publisher can reach any number of subscribers.
Subject Naming Conventions
Core rule — general to specific, left to right
domain.entity.action ✓
domain.entity.identifier.measurement ✓
action.entity.domain ✗ — wildcards become uselessWildcards only work left to right. Putting specific things first kills your filtering ability.
Identifier placement — critical
# Bad — identifier too early
orders.{order_id}.status
orders.{order_id}.alert
# Good - identifier late, action last
orders.payment.{order_id}.reading
orders.payment.{order_id}.alertWith the good pattern:
orders.payment.*.status— all payment statuses across all ordersorders.payment.{order_id}.>— everything for one orderorders.*.{order_id}.>— all event types for one order
With the bad pattern — orders.*.status matches order IDs, not types
Rule — put identifiers as late as possible, actions last.
Implications on streams
Your naming determines stream boundaries:
Stream bound to: orders.>
→ captures everything in one stream
→ consumers filter by subject
Stream bound to: orders.payment.>
Stream bound to: orders.shipping.>
→ separate streams per measurement type
→ more operational overheadToo broad = one stream catches everything including things it shouldn’t. Too narrow = too many streams, hard to manage. Design subject hierarchy before creating streams.
Implications on consumer filtering
Consumer filter: orders.*.{order_id}.status
→ all event types for one specific order
Consumer filter: orders.payment.*.status
→ all payment statuses across all orders
Consumer filter: orders.>
→ everything - good for audit/monitoring consumerYour naming determines what filtering granularity is possible. Bad naming = can’t filter the way you need without creating separate streams.
Avoid
Practical rules
Lowercase always
Dots as separators only
3–4 tokens is the sweet spot — deep enough for wildcards, shallow enough to reason about
Domain first, identifiers in the middle, action last
Design subject names for the consumer filters you’ll need, not just what the publisher wants to send
Never redesign subject naming after streams are in production — painful to migrate
TCP Foundation
NATS protocol is human-readable text over TCP (you can literally telnet into it).
NATS keeps TCP’s reliability but fixes its weaknesses at scale:
Request & Reply
NATS doing RPC without the gRPC ceremony — no protobuf schemas, no generated stubs, no service mesh.
How it works:
Requester creates a unique temporary inbox subject —
_INBOX.abc123Publishes to the service subject (e.g.
payments.process) with the inbox as reply-to addressResponder receives it, sees the reply-to, publishes response back to
_INBOX.abc123Only the original requester receives it — they’re the only one subscribed to that inbox
Key insight — requester is both publisher and subscriber simultaneously:
Publisher → sends to
payments.processSubscriber → listens on
_INBOX.abc123
Responder flips too — subscriber to receive, publisher to respond. The line between publisher and subscriber disappears completely.
Scatter-gather — send one request, collect responses from multiple responders. Useful for finding the fastest instance, aggregating from shards, or fan-out queries. gRPC needs a load balancer for this. NATS gets it for free.
Timeout behavior — no response means clean timeout. Unlike HTTP hanging connections, failed downstream services don’t cascade into thread exhaustion on the caller side.
The deeper insight — request/reply isn’t a special protocol. It’s just two pub/sub operations choreographed with an inbox trick. Same with queue groups, JetStream, KV store — everything in NATS is just subject-based pub/sub composing into higher-level patterns. One primitive, everything else built on top.
Clarification — is this pattern unique to NATS?
No. Kafka, RabbitMQ, Redis pub/sub can all do request/reply. The pattern isn’t new.
What makes NATS practical for it — subjects are pure routing addresses, zero cost, no disk, no partition assignment. A temporary inbox like _INBOX.abc123 spins up and disappears for free. In Kafka, creating topics is heavy — partitions, replication, disk — so request/reply is technically possible but an anti-pattern nobody actually does.
Redis pub/sub is the closest comparison — also in-memory, ephemeral, lightweight. Conceptually the same for this pattern. NATS wins over Redis when you need the full picture — fire and forget, durable streams, request/reply, queue groups — all from one system, one binary. Redis makes you stitch pub/sub + Streams + Cluster together separately.
No Responders
NATS knows its subscription table. If nobody is subscribed to a subject, the server tells you immediately with a “no responders” status — no waiting, no timeout, no hanging connection. Unlike HTTP where a dead service hangs your request for 30 seconds and blocks a thread.
How it changes in JetStream
“No responders” largely disappears for JetStream publishing. The stream itself catches the message and writes to disk regardless of whether any consumer is connected right now. Nobody home doesn’t matter — the stream is always there.
Failure modes shift to:
Subject not bound to any stream → error
Stream full (hit limits) → error based on discard policy
Where “no responders” still applies
Request/reply pattern — needs a live subscriber actively listening right now to respond to your inbox. If nobody is subscribed, “no responders” fires instantly. This is correct behavior because request/reply by definition requires someone home to answer.
“No responders” is a protocol-level feature, not an application pattern. The server sends it back. The distinction is just — JetStream publish looks for a stream to catch the message, request/reply looks for a live subscriber. Different targets, different behavior.
publish() vs request()
publish() — fire and forget. Send a message, don’t expect a response. Nobody listening = silent drop, no error, publisher never knows.
request() — send and wait for a response. Automatically creates an inbox subject under the hood, waits for someone to reply. Nobody listening = instant “no responders” error. This is the request/reply pattern.
Same mechanism under the hood. Different intent. request() is just publish() with an inbox attached and a wait for reply on top.
The No Responder model only applies to the request() mechanism
Queue Groups
Subscribers join a named queue group. NATS distributes messages across all subscribers in that group — exactly one subscriber per message. No coordinator, no leader election, no partition rebalancing.
vs Kafka Consumer Group
Message distribution — random. NATS picks one subscriber from the group per message. No strict round robin, no sticky routing.
Fan-out — queue groups and regular subscribers coexisting
Both can listen on the same subject simultaneously:
Queue group → exactly one worker gets it
Regular subscribers → every one of them gets it
Same message, one copy distributed among workers, one copy broadcast to all regular subscribers. Useful for workers processing while monitoring/analytics observe simultaneously.
Two queue groups = message duplicated
Each queue group gets its own copy. Within each group exactly one worker gets it. Use this when two completely separate pipelines need to consume the same messages independently.
Slow consumer in a queue group
That worker gets disconnected. NATS immediately routes its share to remaining workers — no pause, no rebalance delay. In Core NATS — messages in that worker’s buffer are lost. In JetStream — redelivered automatically.
Core NATS vs JetStream queue groups
Core NATS queue group — fire and forget. Worker dies mid-processing = message lost, no redelivery.
JetStream durable consumer — same load balancing behavior but message stays in stream until acked. Worker dies = redelivered. This is the right choice for production workloads.
JetStream
Screenshot from Synadia’s Interactive NATS Guide at nats.synadia.com
JetStream is NATS’s persistence layer — adds a durable log on top of Core NATS. Solves the message loss problem. This is the Kafka replacement layer.
Retention Policies — this is important
Controls when messages are removed from the stream.
Limits-based — keep messages up to a cap, discard oldest when limit hit. Configure by:
Max message count
Max bytes
Max age
This is your Kafka-equivalent retention. Set it and forget it.
Interest-based — message stays until every consumer that has registered interest has acked it. Once all consumers ack = deleted automatically. Nothing discarded while any consumer still hasn’t seen it. Useful when you have multiple independent pipelines all needing the same message.
Work-queue — message deleted immediately on ack. Each message processed exactly once then gone. No keeping history, no replay. Closest to a traditional job queue.
Which one for your architecture?
Kafka → JetStream mapping
What doesn’t map cleanly: Kafka scales via partitions → more partitions = more consumers. JetStream scales differently (multiple streams, subject sharding). This needs separate thought when migrating.
Deliver Policy — when a new subscriber attaches, you explicitly tell it where to start:
DeliverAll— from the very first stored messageDeliverNew— only future messages, ignore historyDeliverLast— only the most recent message, then new onesDeliverByStartSequence— from a specific sequence numberDeliverByStartTime— from a specific timestamp
Push vs Pull consumers — JetStream only
Pull consumer — you ask the server for messages when you’re ready. You control batch size and pace. Server waits until you ask. If you’re slow, nothing bad happens — no buffer, no disconnect.
Push consumer — server sends messages to you immediately without being asked. Server controls the pace. If you can’t keep up, buffer fills up, slow consumer problem kicks in. Push consumers are now considered legacy — new projects should use pull consumers.
In Short:-
Push — server pushes messages to you (like core NATS feel)
Pull — you explicitly ask for next batch (like Kafka’s poll loop)
For Kafka replacement workloads → use Pull consumers
Three ways to consume with a pull consumerFetch(n) — request exactly N messages, get them back, loop manually. You control exactly when the next batch arrives. Best for batch processing workloads. Direct equivalent of your Kafka batch consume pattern.
FetchNoWait — same as Fetch but returns immediately if no messages available instead of blocking. No pause between fetches.
Consume — continuous delivery via callback. Consolidates features from both push and pull. Messages arrive as fast as the server can send within flow control limits. Feels like push but pull under the hood — real-time delivery without managing a fetch loop. Best for event-driven processing.
Which to use
Why pull replaced push
Push consumers couldn’t scale horizontally cleanly. Pull consumers distribute naturally — whoever fetches first gets the messages. All new features (priority groups, pinning, overflow) only work on pull consumers. Legacy push API is not being removed but will not receive new features.
How Consume Works Mechanically — JetStream only
Kafka poll — for reference
You drive the loop explicitly. Every iteration is an explicit request to the broker. You control when to poll, batch size, and timeout. The broker just responds to your requests.
while True:
records = consumer.poll(timeout=1000)
# process records
# commit offset
# repeatFetch(n) — direct Kafka poll equivalent
Client sends a fetch request to NATS server. Server responds with up to N messages. You process and ack. You call fetch again when ready. One explicit request per iteration. You drive the loop. You control pace, batch size, and when the next fetch happens.
This is your existing Kafka batch consume pattern mapped 1:1 to JetStream.
Consume — Kafka poll loop hidden inside the client
You provide a callback. The client library internally runs the fetch loop — sends fetch request, server responds, callback fires per message, when batch drains the client automatically sends next fetch request. Repeat forever without your code doing anything.
Server behavior is identical to Fetch — still responding to fetch requests. The only difference is who is calling them — your code vs the client library background loop.
Consume IS Kafka’s poll loop. Just hidden inside the client library instead of your application code.
Flow control difference — important
What the server does in both cases
Identical — responds to fetch requests with available messages. Server doesn’t know or care whether it’s your code calling Fetch or the client library calling it internally via Consume. The distinction is purely client-side.
Ack system — JetStream’s equivalent of Kafka offset commit:
Kafka has no native equivalent of Nak, InProgress, or Term — you’d build all that yourself or rely on frameworks like Spring Kafka.
If you never send an Ack:
AckWaittimer expires → JetStream redelivers automaticallyThis is your at-least-once delivery guarantee
Always set
MaxDeliverto something finite — without it, a poison pill loops forever
Dead letter behavior:
Termis a first-class protocol primitive in JetStream — one line of codeIn Kafka, dead letter queues are application-level code you write yourself (or get from a framework)
JetStream designs it in. Kafka bolted it on.
InProgress vs Kafka heartbeat:
In the Kafka Python client (aiokafka), background heartbeat ran automatically
In JetStream, there is no background heartbeat — liveness is per message
You must manually send
InProgressto reset the AckWait timer while processingSome NATS Python client (
nats.py) has helpers for this — worth checkingThis is more precise than Kafka — a stuck consumer is detected per message, not just per connection
MaxAckPending
Limits how many messages can be delivered but not yet acked across ALL workers on a consumer at any given time. Server enforced — not application level.
MaxAckPending = 100
Worker 1 pulls 50 → processing → not acked yet
Worker 2 pulls 50 → processing → not acked yet
Total unacked = 100 → limit hit
→ server suspends ALL delivery on this consumer
→ every worker’s next pull returns empty
→ Worker 1 acks 10 → total = 90 → delivery resumesNot just new workers blocked — entire consumer pauses. Every worker stops getting messages until pending count drops.
Critical sizing rule:
MaxAckPending must be > batch size × number of workers
10 workers × 50 batch = 500 minimum in-flight
MaxAckPending = 100 ← too low, throughput tanks silently
MaxAckPending = 1000 ← safe headroomUndersizing is a common performance trap — workers keep pulling but get empty responses, no obvious error.
Why it exists — without it workers pull millions of messages, crash, millions redeliver simultaneously, system overwhelmed. MaxAckPending keeps in-flight work bounded so crash recovery stays controlled.
Kafka comparison — closest equivalent is max.in.flight.requests.per.connection but that’s per producer connection. JetStream MaxAckPending is per consumer across all workers — broader scope, stronger guarantee.
Dead Letter Queue — three primitives composed together
No single DLQ config — you build it from MaxDeliver, Term, and advisory subjects.
MaxDeliver — set on consumer config. After N delivery attempts with no ack → JetStream stops redelivering. Message stays in stream but no longer actively delivered.
Term — sent from your code when you detect a poison pill. Immediately stops redelivery without waiting for MaxDeliver to exhaust. One line of code.
Advisory subjects — JetStream automatically fires events when either condition hits:
$JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.{stream}.{consumer}— MaxDeliver exhausted$JS.EVENT.ADVISORY.CONSUMER.MSG_TERMINATED.{stream}.{consumer}— Term received
Advisory contains the sequence number of the dead message. You create a separate consumer subscribed to these subjects — on advisory received, fetch original message from stream by sequence number, write to a dedicated dead letter stream for inspection and manual reprocessing.
Full flow:
Message delivered → worker processing
→ success → Ack → done
→ retryable failure → Nak → redelivered with backoff
→ MaxDeliver exhausted → advisory fired → DLQ consumer catches it
→ poison pill → Term → advisory fired → DLQ consumer catches it
DLQ stream → stored for inspection → fix → manually republishKafka comparison on DLQ:
Kafka bolted DLQ on. JetStream designed it in — the primitives are there, you just wire them together.
Ephemeral vs Durable Consumers — JetStream only clarification
Durable consumer — has a name, state persists on the server
Consumer name: “orders-processor”
→ server remembers its position
→ tracks acks, pending, redeliveries
→ if all workers disconnect → consumer still exists on server
→ workers reconnect → pick up exactly where they left offThis is what you want for production workloads. Your Kafka consumer group equivalent.
Ephemeral consumer — no name, state lives only as long as someone is connected
Consumer created with no name
→ server tracks state only while subscribed
→ all subscribers disconnect → consumer deleted automatically after inactivity threshold
→ reconnect → starts fresh, no memory of previous positionWhen to use which?
Practical example
# Your stream monitor workers → durable
consumer: “events-processor”
→ crashes → restarts → picks up from last acked message
# You want to inspect what’s in the stream right now → ephemeral
connect → read last 100 messages → disconnect → consumer cleaned up
automaticallyOne gotchaDurable consumers persist even when nobody is connected. If you forget to delete them, they accumulate on the server. Always clean up durable consumers you no longer need.
The real difference between Kafka and JetStream — Ack granularity|
This is not about persistence. Both Kafka and JetStream are persistent durable logs. The fundamental difference is how you mark messages as done.
Kafka — watermark commit:
Messages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Commit offset 10 = everything up to 10 is done
You CANNOT say “1,2,3,5 done but 4,6 failed”
It’s a watermark — all or nothing up to that pointJetStream — per message ack:
Messages: 1, 2, 3, 4, 5
Ack 1 ✓ → gone
Nak 2 → redelivered individually later
Ack 3 ✓ → gone
Term 4 → dead lettered
OOM kill before 5 → redelivered automaticallyEvery message is completely independent of every other message.
JetStream Message Distribution and Scaling
Unique distribution — each worker gets different messages. Same message never goes to two workers from the same consumer. Server tracks what’s in-flight to whom.
How tracking works — sequence numbers
JetStream uses sequence numbers, not offsets. Same concept, different name.
Stream: [msg1(seq:1), msg2(seq:2), msg3(seq:3) ...]Consumer tracks:
Last delivered sequence
Which sequences are acked (done)
Which sequences are pending (delivered but not acked yet)
Worker 1 pulls seq 1–50 → server marks in-flight to W1
Worker 2 pulls seq 51–100 → server marks in-flight to W2
Worker 1 acks → server marks done, moves cursor forward
Server never sends the same sequence to two workers from the same consumer.
NATS vs Kafka scaling philosophy
Kafka — plan partitions upfront. Max consumers = partition count. More throughput = repartition = rebalance = disruption.
NATS — no partitions. Just start more workers pointing at the same consumer. Server distributes on demand — whoever asks first gets next batch. More throughput = start another worker. No repartition, no rebalance, no disruption.
Ordering problem and solutions
By default — any worker can get any message. No per-key ordering guarantee across parallel workers.
Subject-per-entity pattern — give each entity its own subject. One consumer per subject means one worker always handles that entity’s messages. Ordering guaranteed. Only works for bounded known entities, doesn’t scale to millions of dynamic keys.
Partitioned consumer group library (Orbit) — client-side library that mimics Kafka consumer groups. Splits stream into virtual partitions by key, assigns workers to partitions. New, not widely used, adds complexity. Only reach for it when you need strict per-key ordering AND parallel processing simultaneously.
For most workloads — neither is needed. Multiple workers on the same consumer is enough. Handle any ordering requirements at the application level.
Exactly-once — when you can’t tolerate duplicates
Two parts work together:
Publisher side — deduplication
Publisher attaches a unique ID to every message (Nats-Msg-Id header). Server remembers IDs for a configurable time window. If publisher sends the same ID twice — server rejects the duplicate silently. Publisher can retry safely without creating duplicates.
Consumer side — double ack
Normal ack is one message: consumer → server. In exactly-once mode it becomes a handshake — server acks back to consumer confirming it received the ack. Eliminates the “ack lost in network” scenario.
When do you actually need exactly-once?
Financial transactions, inventory updates, anything where processing twice causes real damage. For telemetry, sensor data, metrics — at-least-once with idempotent processing is almost always enough and simpler.
Idempotent processing — the practical shortcut!
Instead of implementing full exactly-once machinery, most teams just make their message handler idempotent:
Message: "set order status to fulfilled" → idempotent, safe to repeat Message: "increment counter by 1" → NOT idempotent, dangerous to repeat Message: "insert row with this ID" → idempotent if you use upsert
Design your processing so running it twice = same result as running it once. Then at-least-once delivery is effectively exactly-once in practice.
Atomic Batch Publishing (NATS 2.12+)
The problem it solves
Without atomic batching, each publish is independent. Publish 5 related messages, connection drops after 3 → partial write → system in inconsistent state. No way to tie multiple publishes together as one unit.
What it gives you
All or nothing. Either every message in the batch commits together, or none do. Messages are staged invisibly on the server — consumers see nothing until the commit lands.
How it works — three headers coordinate everything
Every message in a batch carries:
Nats-Batch-Id— unique identifier for the batch (max 64 chars)Nats-Batch-Sequence— incrementing number per message, starting at 1Nats-Batch-Commit— only on the last message, triggers atomic commit of everything staged
Intermediate messages get zero-byte acks — lightweight, no data. Final message gets a full pub ack with batch metadata. All messages appear in the stream simultaneously on commit.
Must be enabled on the stream:AllowAtomicPublish: true in stream config. Off by default.
Optimistic locking
Optional consistency check on the first message of a batch:Nats-Expected-Last-Sequence — only commit if stream’s last sequence matches this number. Someone else published while you were building your batch → sequence moved → batch fails → you retry with fresh data. Same optimistic concurrency control as KV store compare-and-swap.
Limits
Max 1,000 messages per batch
50 concurrent batches in flight per stream
1,000 concurrent batches in flight per server
10 seconds inactivity → batch abandoned and discarded
Batches cannot cross streams — all messages must target the same stream
All limits are configurable in server config
Use cases
Event sourcing — single command produces multiple events that must all land together or none. Fan-out — different messages to different subjects (e.g. orders.warehouse, orders.payment, orders.notification) committed atomically. All subscribers receive their message at the same time. If any message fails, none are delivered.
vs Kafka producer transactions
The gap that remains — Kafka can atomically commit consumed offsets AND produced messages in one transaction (consume-transform-produce). JetStream atomic batching is publish side only. No equivalent for tying acks to publishes atomically. Handle at application level with idempotency keys.
Atomic Batch — Subject vs Stream scope clarification
Multiple subjects in one batch → ✓ works, as long as they belong to the same stream.
Multiple streams in one batch → ✗ not possible.
What it CAN do
Stream “ecommerce-events” bound to: ecommerce.>
One atomic batch:
orders.payment.processed → confirmed
inventory.updated → reserved
notifications.sent → queuedAll different subjects, one stream, one atomic commit. All land together or none do.
What it CANNOT do
Stream “ecommerce-events” → ecommerce-events.>
Stream “ecommerce-audit” → audit.>
Atomic batch across both:
orders.payment.processed → publish order event
audit.orders.payment → publish audit log
→ Impossible. Two streams = not atomic.Kafka producer transactions can span completely separate topics regardless of organization. NATS has no equivalent for cross-stream atomicity — handle consistency between streams at the application level with idempotency keys.
Practical implication
If your architecture separates concerns into different streams — telemetry in one, audit trail in another, dead letter in another — you cannot atomically write across them. Design your stream boundaries carefully with this in mind. If two things must always land together, they must live in the same stream.
Language support
Manual construction — not complex, just boilerplate. Generate a batch UUID, set three headers per message, track sequence number yourself, add commit header on last message. What Orbit abstracts away for Go/Java/Rust is exactly this header management — the protocol itself is simple.
Maturity note
Very new. Current release focuses on correctness and atomicity — not performance. Future releases will optimize throughput. Foundation is solid but battle testing at scale is limited. Verify your NATS server version is 2.12+ before using.
JetStream Retention and Message Expiry
JetStream is a durable log with configurable retention — not a permanent database. Messages leaving the stream is intentional, not a bug.
Ways a message can disappear:
No cron jobs needed — all of the above happen automatically. Just configure the limits and NATS handles cleanup.
Retention — JetStream vs Kafka
The key philosophical difference
Kafka retention is purely time/size based. Kafka doesn’t care whether messages were consumed — message sits for 7 days, gets deleted regardless. Consumer offset is the consumer’s problem.
JetStream retention can be consumption-aware. Interest policy keeps messages until all consumers ack. WorkQueue deletes on ack. The stream reacts to what consumers are doing, not just a clock.
Important — at-least-once guarantee interaction
If a message expires from the stream before your consumer processes it — it’s gone. No redelivery. The at-least-once guarantee only holds while the message exists in the stream.
Rule of thumb — MaxAge should always be orders of magnitude larger than your AckWait and MaxDeliver window. Never let retention expire a message that’s still waiting to be processed.
Pull Consumer Priority Groups (NATS 2.11+)
New mechanism for controlling how multiple workers pull from the same consumer. Before 2.11 — random distribution only, no control. Now you can define priority and overflow behavior.
Two mechanisms:
Overflow — primary workers handle all messages. When overwhelmed, messages spill to secondary workers. Triggered by thresholds:
min_pending— too many messages waiting → overflow kicks inmax_ack_pending— too many unacked messages → overflow kicks in
Pinning — exactly one client receives all messages. Others on standby. Active client disconnects → standby takes over automatically. High availability without running multiple active consumers.
Use cases:
Zone/region awareness — prefer local workers, overflow to remote only when local is overwhelmed. Reduces cross-AZ latency and cost.
Active-passive failover — one worker handles everything, others are hot standbys.
Tiered processing — fast workers handle normal load, slow/expensive workers only activated under pressure.
Important — only available on pull consumers. Not available on legacy JetStream API — requires the new consumer API. NATS 2.11+ server required.
Consumer Pause (NATS 2.11+) — JetStream only
Consumer pausing lets you temporarily halt message delivery to any consumer until a deadline you specify — without losing data, throwing errors, or disconnecting clients.
How it works
Set a PauseUntil deadline — either at consumer creation or via the pause API at runtime. Message delivery stops immediately. The stream still receives messages — they’re just queued up waiting. When the deadline passes, delivery resumes automatically and the consumer catches up on everything published during the pause.
Clients keep receiving heartbeats the whole time — so your application doesn’t know it’s paused. No errors, no disconnects, no code changes needed.
One gotcha — if you have a pull consumer and you pull a large number of messages in one batch, the consumer won’t stop consuming messages from that batch. It will only stop after the entire batch has been consumed and it attempts to pull the next batch.
Pause state is persisted to Raft — server restart does not unpause a paused consumer.
Use cases
Maintenance windows — pause before deploying, resume after
Database migrations — stop processing while schema changes run
Emergency debugging — freeze message flow while investigating production issue
Overwhelmed downstream — give a struggling dependency time to recover without killing your consumer
Notes
Works on both push and pull consumers
Can extend or shorten pause duration at any time by calling pause again with a new deadline
Must specify a deadline — pause is always time-bounded, not indefinite
Monitor paused consumers — easy to forget one is paused and wonder why processing stopped
NATS 2.11+ required
Stream Multi-Get (NATS 2.11+) — JetStream only
Batch retrieval of messages directly from a stream without creating a consumer. One request, multiple messages back, in sequence order.
How it’s invoked — direct API call to the JetStream server. One request/reply round trip. You send a request with your subject filters and optional bounds, server responds with matching messages. No subscription, no consumer, no ongoing connection. Closest mental model is a REST GET call — stateless, one shot, results returned immediately.
MultiLastFor — provide a list of subject filters, get the latest message per subject in one round trip.
Optional bounds:
UpToSeq— only return messages up to this sequence numberUpToTime— only return messages up to this timestamp
No consumer created — no cursor, no ack tracking, no server state. Pure read. For inspection and snapshotting, not for processing workloads.
Use cases — read latest state across multiple subjects in one call, dashboard snapshots, debugging stream contents without spinning up a consumer.
Connection Resilience — Core NATS and JetStream
The problem
Networks are unreliable. Clients lose connection to NATS server at any time — network blip, server restart, NAT timeout killing idle connections. What happens to your application when this occurs determines whether you lose data or recover silently.
Auto-reconnect — built in, no code needed
Every official NATS client including nats.py automatically reconnects on connection drop. Backoff strategy built in — retries with increasing delays. Your application keeps running during reconnect unless you register a disconnect callback.
On reconnect automatically:
Subscriptions re-registered
Pending outbound publishes buffered and flushed
Application code unaware unless you hook disconnect callbacks
JetStream consumers on reconnect
Consumer cursor lives on the server, not the client. So:
Client disconnects mid-processing
→ unacked messages → AckWait expires
→ JetStream redelivers automatically
→ client reconnects → re-subscribes to same durable consumer
→ picks up exactly where it left off
→ zero messages lost, zero manual recoveryCore NATS subscribers on reconnect — start fresh. Miss everything published during the disconnect. No recovery. This is another reason to use JetStream for anything that matters.
Ping/Pong — dead connection detection
Server pings client every N seconds. Client responds with pong. No pong within deadline → connection declared dead and closed. Detects failures in seconds not minutes — this is the fix for TCP’s slow failure detection problem covered in the TCP section.
Configurable:
Ping interval — how often server pings
Max pings outstanding — how many missed pongs before declared dead
Connection event callbacks
Hooks for your application code to react to connection events:
Useful for logging, metrics, updating circuit breaker state, alerting.
Kafka comparison — connection resilience
The key difference — Kafka reconnect triggers a consumer group rebalance. Other consumers pause while partitions are reassigned. NATS durable consumer reconnect is invisible to everyone else — no rebalance, no pause, no impact on other workers. The cursor is on the server, so the worker just picks up where it left off the moment it reconnects.
Graceful Shutdown
Close vs Drain — the critical distinction
close() — hard stop. Cuts connection immediately. Messages in outbound buffer, in-flight subscriptions, pending publishes — all lost. Equivalent of kill -9 on your connection.
drain() — graceful stop. Three phase process:
Stop accepting new messages on all subscriptions
Finish processing everything already received
Flush all pending outbound publishes
Then close
Nothing is lost. Everything in flight completes before the connection dies.
Always drain in production. Never close directly.
Why this matters with Kubernetes
Kubernetes sends SIGTERM before killing a pod. This is your signal to drain gracefully before the hard SIGKILL arrives (default 30 seconds later).
The pattern:
Signal: SIGTERM received
→ stop pulling new messages from JetStream
→ finish processing in-flight batch
→ ack/nak all in-flight messages
→ drain NATS connection (flushes outbound)
→ exit cleanly
If SIGKILL arrives before drain completes:
→ unacked messages → AckWait expires
→ JetStream redelivers to next available worker
→ no data lossNATS drain + JetStream per-message ack = no permanent message loss on pod shutdown even if Kubernetes kills you before you finish. The two features combine perfectly — drain handles the network layer, JetStream handles the message layer.
This is why JetStream + Kubernetes is a natural fit. You get graceful shutdown for free by combining two primitives that were designed independently but compose perfectly.
Practical checklist for production
Always use
drain()notclose()on shutdownRegister
SIGTERMhandler that triggers drainSet Kubernetes
terminationGracePeriodSecondslonger than your worst-case processing timeSet NATS
MaxPingsOutstandingand ping interval to detect dead connections fastLog on
disconnected_cbandreconnected_cbfor visibilityUse durable consumers so reconnects are seamless
Ordering at Scale — Partitioned Consumer Groups (Orbit)
One gap covered earlier in this post is strict per-key ordering across parallel workers. The subject-per-entity pattern solves ordering but doesn’t give you elastic auto-scaling. The Orbit library’s pcgroups module solves exactly this.
Orbit’s partitioned consumer groups allow you to parallelize consumption of messages from a stream while ensuring strict ordering for all messages with the same key — functionally equivalent to what Apache Kafka calls consumer groups.
Two modes available:
Static membership is ideal for scenarios where you have a fixed set of consumers, while elastic membership allows consumers to join and leave dynamically, with the group automatically rebalancing.
Requires NATS server version 2.11 or above.
Language support: Go, Rust, Java, .NET, TypeScript. Python not yet available — worth watching the Synadia roadmap.
When to use: Per-key ordered processing with parallel workers — document delta streams, per-customer event ordering, per-device telemetry sequencing.
When to skip: When you require strict global ordering across all messages on stream, or when your stream is relatively low-throughput and the overhead of partitioning is unnecessary.
NATS 2.12 New Features — JetStream only
Prioritized Pull Consumer Policy
Third policy added on top of 2.11’s overflow and pinning. All three exist to control which workers get messages when multiple workers pull from the same consumer.
The gap overflow had — it waits for a threshold before routing to secondary workers. If no local workers are actively pulling and the threshold isn’t hit quickly, messages sit waiting. Prioritized fixes this — higher priority workers get messages immediately without waiting for any threshold.
Trade-off: faster failover but work can bounce between workers if priorities keep changing — called flip-flopping. Overflow is more stable, prioritized is more responsive.
All three compared:
NATS 2.12+ required. Pull consumers only. New consumer API only.
Distributed Counter CRDT (2.12+)
What is a CRDT first — Conflict-free Replicated Data Type. A data structure designed so that multiple nodes can update it independently and concurrently without coordination, and the results will always converge to the same value when merged. No locks, no leader needed. Math guarantees consistency.
In NATS — enable AllowMsgCounter on a stream. Each subject in the stream becomes an atomic distributed counter. Any worker anywhere can increment or decrement it. The server merges all updates correctly regardless of order or network partitions. Counters are implemented as big.Int — no size limit.
The powerful part — counter streams can be mirrored and sourced across clusters. You can aggregate counters from multiple regions into one global counter, or create verbatim copies per region. Aggregation behavior is configurable.
Use cases — distributed rate limiting, global tallying across services, aggregating counts across regions or clusters. No Redis needed for counters. No coordination code needed in your application.
Delayed Message Scheduling (2.12+)
Publish a message now, have it delivered to consumers after a specified delay. Enable AllowMsgSchedules on stream config, attach a delay header to the message. Server holds the message and makes it available to consumers only after the delay expires.
Before 2.12 — teams used Nak-with-delay as a workaround. That was always a hack. The native scheduler supersedes it entirely and is built on the per-message TTL infrastructure so it handles large numbers of scheduled messages efficiently.
Current limitation — single one-off delays only. Cron-style repeated scheduling is planned for future releases but not yet available. (https://github.com/nats-io/nats-server/discussions/7672#discussioncomment-15311571)
Use cases — job scheduling, delayed notifications, retry-after patterns, deferred processing, anything where a message should not be acted on immediately after publishing.
Data Stores — KV store & Object Store
Screenshot from Synadia's Interactive NATS Guide at nats.synadia.com
NATS KV Store
Built on top of JetStream. Every bucket = a JetStream stream internally. Keys map to subjects, values map to message payloads, revisions map to sequence numbers.
Operations — get, put, delete, purge, watch, history, keys
Bucket config — max history per key, TTL, max bucket size, replication factor, storage type (file or memory)
Watch — native reactive notifications on key changes. Subscribe to a pattern like user.123.> and get notified instantly on any change. No polling. Just a JetStream consumer with a subject filter under the hood.
TTL — configured per bucket and per key. Keys expire automatically.
History — keeps N previous values per key. Every write increments a revision number (r1, r2, r3…).
vs Redis
Durability — FileStorage + Raft replication across cluster nodes. Same guarantees as JetStream. Survives node failures and restarts.
Optimistic concurrency control — Compare and Swap
Every value has a revision number. Put with last known revision — if revision doesn’t match (someone else updated it), operation rejected. No locks, no blocking, just conflict detection. Your code re-reads and retries.
Worker A reads revision 5 → updates → becomes revision 6 ✓ Worker B reads revision 5 → tries update → revision is 6 now → rejected ✗ → re-read → retry
Can you use it as a DB? — no. Good for config, feature flags, distributed state, coordination, service discovery. Bad for complex queries, large datasets, relational data. Think etcd, not PostgreSQL.
Good for circuit breaker state shared across worker processes, configuration hot reload, and worker coordination.
NATS Object Store
Also built on JetStream. Designed for storing large binary files — documents, binaries, large configs.
Chunks files into small pieces, stores each chunk as a stream message. A 1GB file never travels as one payload — streamed in chunks transparently. Same concept as MongoDB GridFS.
Versioning — no. One version per named object. Overwrite = old one gone. No revision history unlike KV store.
Large files and performance — chunking means large files don’t block or slow down the messaging side. Each chunk is a normal JetStream message. NATS handles it the same way it handles any other stream data.
Use cases — storing firmware binaries, config files, model weights, any large blob that needs to live alongside your messaging infrastructure without spinning up a separate file storage system.
The Kafka migration checklist — a clean mapping section at the end before the governance note. Something like:
Migrating from Kafka to NATS JetStream — Mental Model
Photo by Julia Craice on Unsplash
The concepts map, but the philosophy shifts. Here’s what to rethink:
What you gain:
No partition planning
No rebalancing pauses
Per-message ack granularity
DLQ built in
KV store and Object Store in the same system
What you need to rethink:
No consumer group concept — pull consumers scale by sharing same consumer name
Subject naming matters more than topic naming — design it upfront
Ordering per key requires explicit design — not guaranteed by default
Limitations and where Kafka still wins
1. Partitioning, consumer assignment, and ordering
NATS absolutely simplified a lot for me, but calling it a full Kafka replacement would be too broad. The biggest gap is that JetStream still does not map cleanly to Kafka’s partition-centric model. NATS’ own docs say that core queue groups and JetStream durable consumers are partition-less and non-deterministic, so if a workload depends on strict per-key ordering with horizontal scale, the default model requires manual patterns rather than built-in partition assignment the way Kafka handles it natively. Orbit’s pcgroups module addresses this directly as covered above — but it is a client-side library, not server-native behavior, and Python support is not yet available as of April 2026. — Refer above section “Ordering at Scale — Partitioned Consumer Groups (Orbit)”
Kafka still treats partitions and offsets as first-class server-native consumption units. For workloads requiring strict global ordering across all messages regardless of key, Kafka remains the stronger choice. (1)(2)(3)
2. Batch publish is better now, but it is still not Kafka transactions
Some older criticism of JetStream also needs updating. JetStream supports AckAll, which lets an acknowledgment for a later message acknowledge earlier pending messages in the same series, and NATS 2.12 introduced AllowAtomicPublish for atomically publishing a batch of messages into a stream. But that still is not the same thing as Kafka’s transactional model: Kafka producers can send consumer offsets into the same transaction, and those offsets are only committed if the transaction itself commits successfully. So if your architecture depends on atomic read-process-write flows or tighter stream-processing semantics, Kafka still has stronger built-in primitives. (4)(5)(6)
3. Ecosystem and managed-service maturity
The operational story has trade-offs too. Managed NATS does exist through Synadia Cloud, and Synadia also supports managing your own NATS systems through that platform. But Kafka currently has more obvious first-party major-cloud managed options, including Amazon MSK, Azure Event Hubs with Kafka protocol support, and Google Cloud Managed Service for Apache Kafka. There is also no Kafka API drop-in compatibility claim from NATS itself that I could verify; the official integration path I found is a NATS–Kafka bridge. And while Python is officially supported, the Python client story is still evolving: nats.py remains the main Python client, nats-core only had its initial release in February 2026, and there is an open nats.py issue around atomic batch publish. (7)(8)(9)(10)(11)(12)(13)(14)
4. Durability depends heavily on version and configuration
Finally, durability claims should be treated as version- and configuration-sensitive. Jepsen’s analysis of NATS 2.12.1 found scenarios involving acknowledged-write loss and persistent split-brain under certain failures. NATS’ own docs also say JetStream’s file-based streams do not immediately fsync by default; instead, the server uses a configurable sync_interval, with a default of two minutes, and sync_interval: always provides stronger durability at a throughput cost. So for business-critical durability, the right takeaway is not that JetStream is bad, but that you should validate the exact failure model, version, replication settings, and fsync strategy you plan to run in production. (15)(16)
References
(1) NATS Docs, “Subject Mapping and Partitioning.”
(2) Apache Kafka Docs, “Design.”
(3) Apache Kafka Docs, “KafkaConsumer — Offsets and Consumer Position.”
(4) NATS Docs, “Consumers.” (AckAll)
(5) NATS Docs, “NATS 2.12.” (AllowAtomicPublish)
(6) Apache Kafka Docs, “KafkaProducer — sendOffsetsToTransaction.”
(7) Synadia Docs, “Synadia Cloud Overview.”
(8) Synadia Docs, “Synadia Cloud Walkthrough.”
(9) AWS Docs, “What is Amazon MSK?”
(10) Microsoft Docs, “Apache Kafka Protocol Support in Azure Event Hubs.”
(11) Google Cloud Docs, “Managed Service for Apache Kafka Overview.”
(12) GitHub, “NATS-Kafka Bridge.”
(13) NATS Docs, “Developing With NATS.” (Python listed as Synadia-supported)
(14) GitHub, nats.py / nats-core / Issue #811.
(15) Jepsen, “NATS 2.12.1.”
(16) NATS Docs, “JetStream” and “Configuring NATS Server.” (sync_interval, fsync behavior)
A Note on NATS Governance
Photo by Charles Forerunner on Unsplash
In April 2025, Synadia — the primary maintainer of NATS — attempted to withdraw the project from CNCF and move future releases to a Business Source License. It got messy fast. By May 1, 2025 it was resolved — Synadia confirmed no BSL fork, Apache 2.0 stays, and the trademark was formally transferred to the Linux Foundation. You can read the full story here.
As of April 2026 the project is actively maintained, shipping on a 6-month release cycle, and firmly under CNCF stewardship. The drama actually left things in a structurally better place than before — trademark locked with Linux Foundation, Apache 2.0 explicitly reconfirmed, community pressure demonstrated how widely adopted NATS actually is.
The residual risk worth knowing: Synadia is still essentially the only major contributor. If they fold, NATS loses its primary driver. But that’s true of most CNCF projects. For self-hosted Apache 2.0 deployments you’re fully insulated from whatever happens commercially with Synadia.
Eyes open — but safe to bet on.




























