HomeBlog

Kafka vs RabbitMQ: Lessons from Getting It Wrong

April 17, 2026

When I first started working with distributed systems, I treated Kafka and RabbitMQ as interchangeable tools — both move messages from A to B, right? I picked whichever one the team had already set up, or whichever had better documentation for the language I was using.

That worked, until it didn't.

The first time it really bit me was on a project where we were processing payment events. We used RabbitMQ because the team knew it well. A few months in, we needed to add a fraud detection service — and it needed to analyze the last 30 days of payment history. The events were gone. Consumed, acknowledged, deleted. We had to rebuild the event history from database audit logs, which took weeks and was about as fun as it sounds.

The problem wasn't RabbitMQ. The problem was that I didn't understand the fundamental difference in what these tools are designed to do. This post is my attempt to explain that difference clearly, so you don't have to learn it the same way I did.


They're not the same kind of tool

Before comparing features, you need to understand what each tool is at its core.

RabbitMQ is a message broker. Its job is to receive messages and make sure they get delivered to the right consumer. Once a consumer processes a message and acknowledges it, RabbitMQ's job is done — the message is gone. It's like a post office: letters come in, letters go out, nobody keeps copies.

Kafka is a distributed commit log. Its job is to durably record events in order and let any number of consumers read them at their own pace — including consumers that didn't exist when the event was written. Messages are retained for days, weeks, or forever. It's less like a post office and more like a ledger: every transaction is recorded permanently, and anyone with access can read the history.

This single difference cascades into almost every other design decision.


Architecture: how data flows

RabbitMQ

ProducerExchangeQueue AQueue BQueue CConsumer 1Consumer 2Consumer 3One message → one consumer (competing consumers)
RabbitMQ: Producer → Exchange routes to Queues → Consumers

Producers send messages to an exchange. The exchange routes messages to queues based on routing rules (bindings). Consumers pull from queues. The broker pushes messages to consumers (with prefetch limits). When a consumer acknowledges a message, it's removed from the queue.

RabbitMQ supports four exchange types:

Kafka

ProducerTOPIC: ordersPartition 0Partition 1Partition 2Consumer Group Aoffset: 5 / 4 / 5reads all partitionsConsumer Group Boffset: 2 / 1 / 3reads independently
Kafka: Producer → Topic (partitions as append-only log) → Consumer Groups

Producers write to topics, which are split into partitions (ordered, append-only logs). Consumers pull messages at their own pace and track their position via offsets. Multiple consumer groups can read the same topic independently — each group maintains its own offset.

Kafka doesn't have routing logic. If you need a message to go to specific consumers, you either use separate topics or handle filtering in the consumer.


Side-by-side comparison

RabbitMQKafka
Mental modelPost officeLedger / audit log
Message lifetimeUntil consumed + ackedConfigurable retention (hours to forever)
Consumer modelPush (broker → consumer)Pull (consumer reads at own pace)
OrderingPer-queue FIFOPer-partition ordering
RoutingRich (exchanges, bindings, routing keys)None (topic-based only)
ReplayNot possible by defaultYes — rewind consumer offset
Multiple consumersCompeting consumers (one wins per message)Fan-out across consumer groups
Throughput~50k–200k msg/sMillions msg/s
LatencySub-millisecondLow milliseconds
Message sizeNo enforced limit (practical: < 128MB)Default 1MB (configurable)
ProtocolAMQP, STOMP, MQTTCustom TCP binary protocol
Dead letter handlingBuilt-in DLX/DLQManual (separate topic convention)
PersistenceOptional per-queue/messageAlways (log)
Schema enforcementNone built-inSchema Registry (Confluent)

When to use RabbitMQ

RabbitMQ shines when you need to distribute work and make sure it gets done exactly once.

Task queues and work distribution

The classic use case. You have a pool of workers and a stream of jobs to process. RabbitMQ distributes jobs across workers, and only one worker gets each job.

API ServiceRabbitMQ QueueWorker 1Worker 2Worker 3Only one worker processes each job
Worker queue: jobs distributed across available workers

Real example: resizing uploaded images. Users upload photos, your API puts a resize job on the queue, and any available worker picks it up. If a worker crashes mid-job, RabbitMQ redelivers the unacknowledged message to another worker. Exactly what you want.

Request-reply (RPC over messaging)

RabbitMQ has a clean pattern for synchronous-style request/response over async messaging. The producer includes a reply_to queue and a correlation_id. The consumer processes and replies to that queue.

Useful when you need the response but want the decoupling — for example, a service that calls a pricing engine and needs the price back before it can continue.

Complex routing logic

If you need to selectively route messages to different consumers based on content, RabbitMQ's exchange types handle this natively.

Example: an order service publishes order.created, order.shipped, order.cancelled events. Your notification service only subscribes to order.shipped. Your accounting service subscribes to order.created and order.cancelled. Your warehouse gets order.created and order.shipped. With a topic exchange and routing key patterns, this is a few lines of configuration.

When messages have a natural "done" state

If a message represents a task with a finite lifecycle — "send this email", "resize this image", "charge this card" — and once it's done you genuinely don't care about it anymore, RabbitMQ is a great fit. The auto-deletion keeps your system clean.

Gotchas to watch for

Unacknowledged messages pile up. If your consumer crashes or is slow, messages sit in an "unacknowledged" state and count against memory. Always set prefetch limits on your consumers.

Quorum queues vs classic queues. Classic mirrored queues have known consistency issues. For production, use quorum queues (added in RabbitMQ 3.8). They're slower but safe.

You can't replay. If a consumer has a bug and acks messages it shouldn't have, those messages are gone. Have a plan for this — usually means logging events somewhere durable separately.


When to use Kafka

Kafka shines when you need to record that something happened, and multiple systems need to know about it — including systems that don't exist yet.

Event sourcing and audit logs

This was the lesson from my payment story. Any time you have business events that you might need to reprocess, replay, or analyze later, Kafka's retention makes this trivial.

Example: every time a user changes their account settings, you publish an AccountSettingsChanged event to Kafka. Your notification service sends a confirmation email. Your security service logs the change. Six months later, you build a compliance feature that needs to know every settings change in the last year — you just rewind the consumer offset.

Multiple independent consumers

When more than one system cares about the same event, Kafka's consumer groups model is elegant. Each consumer group maintains its own offset and reads at its own pace, without blocking or competing with other groups.

Order Created EventKafka TopicInventoryGroup AEmailGroup BAnalyticsGroup CEach group reads every event independently
Kafka fan-out: all consumer groups receive all events at their own pace

All three services get every order event, independently. If your analytics service is down for maintenance, it just falls behind on its offset and catches up when it restarts — without affecting billing or email.

High-throughput event streams

If you're processing millions of events per second — clickstream data, IoT sensor readings, application logs, metrics — Kafka is built for this. RabbitMQ would need a massive cluster to match, and it still wouldn't match Kafka's sequential disk write performance.

Example: a ride-sharing app tracking vehicle positions every second across 100,000 drivers. That's 100k events/second minimum, with multiple consumers (dispatch, ETA calculation, surge pricing, heatmaps). This is a Kafka workload.

Stream processing

Kafka integrates naturally with stream processing frameworks (Kafka Streams, Apache Flink, Spark Streaming). You can do stateful joins, windowed aggregations, and real-time enrichment on the event stream itself.

Example: detecting unusual login patterns. You stream login events to Kafka, run a Flink job that counts login attempts per user per 5-minute window, and emit fraud alerts for users who exceed a threshold. This is hard to do with RabbitMQ because you'd need external state and can't replay historical data to tune your model.

Decoupled microservices that evolve independently

In a microservices architecture, services often need to react to each other's state changes. Kafka as an event backbone means services don't call each other directly — they just read from topics they care about.

This also means you can add new services later that consume historical events without touching any existing code.

Gotchas to watch for

Consumer lag is silent by default. If your consumer falls behind, Kafka doesn't yell at you — you need to monitor consumer group lag explicitly. Set up alerts before you need them.

Partition count is hard to change. You set partitions when you create a topic, and adding partitions later breaks key-based ordering. Think carefully about partition count upfront; 30 is often a safe starting point.

Kafka is operationally heavy. A proper Kafka cluster (even with KRaft, which removed Zookeeper) needs careful tuning of retention, replication factors, and JVM settings. Managed services (Confluent Cloud, AWS MSK, Upstash) save a lot of pain here.

Exactly-once is hard. Kafka supports exactly-once semantics (EOS) but it requires transactional producers and careful consumer design. Most teams get away with idempotent consumers + at-least-once delivery.


Scaling

Scaling RabbitMQ

RabbitMQ scales primarily by adding consumers. Got a backlog? Spin up more workers. The broker distributes work across all connected consumers automatically.

Normal LoadQueueWorker 1Backlog growing...High Load (add consumers)QueueWorker 1Worker 2Worker 3Backlog cleared — no broker config change needed
RabbitMQ scaling: just add consumers — the broker distributes automatically

Broker-side scaling is more limited. You can cluster RabbitMQ nodes, but queues live on a single node by default (quorum queues are replicated). Federation and shovel plugins exist for cross-datacenter setups but add complexity.

The ceiling for a well-tuned RabbitMQ cluster is roughly in the hundreds of thousands of messages per second. Beyond that, you're fighting the broker.

Scaling Kafka

Kafka scales horizontally by adding partitions and brokers. Partitions are the unit of parallelism — you can have at most as many consumers in a group as partitions.

BROKERSPARTITIONSCONSUMERSBroker1P0P1Consumer 1Consumer 2Broker2P2P3Consumer 3Consumer 4Broker3P4P5Consumer 5Consumer 6Add a broker + rebalance partitions = linear throughput scale
Kafka: 6 partitions across 3 brokers, each partition read by one consumer

Adding a broker and rebalancing partitions scales throughput linearly. LinkedIn (who built Kafka) runs clusters processing trillions of messages per day.

The scaling story is genuinely better than RabbitMQ for high-throughput scenarios, but you pay for it in operational complexity.


How to choose: a decision guide

Decision guide — answer top to bottom

?

Do multiple independent services need the same events?

YES → Kafka
?

Do you need to replay events later or audit history?

YES → Kafka
?

Is throughput > 100k msg/s?

YES → Kafka
?

Do you need complex routing (topic patterns, headers)?

YES → RabbitMQ
?

Do you need request-reply (RPC-over-messaging)?

YES → RabbitMQ
?

Is the message a task with a natural "done" state?

YES → RabbitMQ
?

Do you need sub-millisecond latency?

YES → RabbitMQ

None of the above? Default to RabbitMQ— it's simpler to operate. Reach for Kafka when you know the event log will become valuable later (replay, audit, new consumers).


Real-world scenario mapping

ScenarioToolWhy
Email/SMS notification queueRabbitMQTask-based, one consumer, done after send
Order processing pipelineRabbitMQWork distribution, exactly-once processing
User activity trackingKafkaHigh volume, multiple consumers (analytics, recommendations, A/B)
Microservice event backboneKafkaMultiple consumers, replay, audit
IoT sensor dataKafkaMassive throughput, time-series analysis
Image/video transcodingRabbitMQJob queue, worker pool, task-based
Real-time fraud detectionKafkaStream processing, historical context
Audit log for complianceKafkaRetention, immutability, replay
Background job schedulingRabbitMQCron-like tasks, delayed messages
Change data capture (CDC)KafkaDebezium + Kafka is the standard pattern

The mistake I actually made

Going back to that payment system: the team chose RabbitMQ for good reasons. We needed to distribute payment processing across workers. The messages were tasks — "process this payment". RabbitMQ was the right choice for that part.

The mistake was using the same payment events as the only record of what happened. We didn't also write them to Kafka (or even a simple database table). When we needed historical events six weeks later, they were gone.

The real lesson wasn't "use Kafka instead of RabbitMQ". It was that these tools solve different problems, and sometimes you need both:

Using both in the same system isn't over-engineering — it's using the right tool for each concern. RabbitMQ handles the task execution. Kafka handles the event history. Many mature architectures do exactly this.


Summary

If you're starting a new project and aren't sure: RabbitMQ is simpler to operate and reason about. Reach for Kafka when you need the event log semantics — you'll know when you need it, usually when someone asks "can we replay the last month of events?" and you realize you can't.