Kafka vs RabbitMQ: Lessons from Getting It Wrong

When I first started working with distributed systems, I treated Kafka and RabbitMQ as interchangeable tools — both move messages from A to B, right? I picked whichever one the team had already set up, or whichever had better documentation for the language I was using.

That worked, until it didn't.

The first time it really bit me was on a project where we were processing payment events. We used RabbitMQ because the team knew it well. A few months in, we needed to add a fraud detection service — and it needed to analyze the last 30 days of payment history. The events were gone. Consumed, acknowledged, deleted. We had to rebuild the event history from database audit logs, which took weeks and was about as fun as it sounds.

The problem wasn't RabbitMQ. The problem was that I didn't understand the fundamental difference in what these tools are designed to do. This post is my attempt to explain that difference clearly, so you don't have to learn it the same way I did.

They're not the same kind of tool

Before comparing features, you need to understand what each tool is at its core.

RabbitMQ is a message broker. Its job is to receive messages and make sure they get delivered to the right consumer. Once a consumer processes a message and acknowledges it, RabbitMQ's job is done — the message is gone. It's like a post office: letters come in, letters go out, nobody keeps copies.

Kafka is a distributed commit log. Its job is to durably record events in order and let any number of consumers read them at their own pace — including consumers that didn't exist when the event was written. Messages are retained for days, weeks, or forever. It's less like a post office and more like a ledger: every transaction is recorded permanently, and anyone with access can read the history.

This single difference cascades into almost every other design decision.

Architecture: how data flows

RabbitMQ

RabbitMQ: Producer → Exchange routes to Queues → Consumers

Producers send messages to an exchange. The exchange routes messages to queues based on routing rules (bindings). Consumers pull from queues. The broker pushes messages to consumers (with prefetch limits). When a consumer acknowledges a message, it's removed from the queue.

RabbitMQ supports four exchange types:

Direct — routes based on exact routing key match
Fanout — broadcasts to all bound queues
Topic — pattern matching on routing keys (orders.*.created)
Headers — routes based on message header attributes

Kafka

Kafka: Producer → Topic (partitions as append-only log) → Consumer Groups

Producers write to topics, which are split into partitions (ordered, append-only logs). Consumers pull messages at their own pace and track their position via offsets. Multiple consumer groups can read the same topic independently — each group maintains its own offset.

Kafka doesn't have routing logic. If you need a message to go to specific consumers, you either use separate topics or handle filtering in the consumer.

Side-by-side comparison

	RabbitMQ	Kafka
Mental model	Post office	Ledger / audit log
Message lifetime	Until consumed + acked	Configurable retention (hours to forever)
Consumer model	Push (broker → consumer)	Pull (consumer reads at own pace)
Ordering	Per-queue FIFO	Per-partition ordering
Routing	Rich (exchanges, bindings, routing keys)	None (topic-based only)
Replay	Not possible by default	Yes — rewind consumer offset
Multiple consumers	Competing consumers (one wins per message)	Fan-out across consumer groups
Throughput	~50k–200k msg/s	Millions msg/s
Latency	Sub-millisecond	Low milliseconds
Message size	No enforced limit (practical: < 128MB)	Default 1MB (configurable)
Protocol	AMQP, STOMP, MQTT	Custom TCP binary protocol
Dead letter handling	Built-in DLX/DLQ	Manual (separate topic convention)
Persistence	Optional per-queue/message	Always (log)
Schema enforcement	None built-in	Schema Registry (Confluent)

When to use RabbitMQ

RabbitMQ shines when you need to distribute work and make sure it gets done exactly once.

Task queues and work distribution

The classic use case. You have a pool of workers and a stream of jobs to process. RabbitMQ distributes jobs across workers, and only one worker gets each job.

Worker queue: jobs distributed across available workers

Real example: resizing uploaded images. Users upload photos, your API puts a resize job on the queue, and any available worker picks it up. If a worker crashes mid-job, RabbitMQ redelivers the unacknowledged message to another worker. Exactly what you want.

Request-reply (RPC over messaging)

RabbitMQ has a clean pattern for synchronous-style request/response over async messaging. The producer includes a reply_to queue and a correlation_id. The consumer processes and replies to that queue.

Useful when you need the response but want the decoupling — for example, a service that calls a pricing engine and needs the price back before it can continue.

Complex routing logic

If you need to selectively route messages to different consumers based on content, RabbitMQ's exchange types handle this natively.

Example: an order service publishes order.created, order.shipped, order.cancelled events. Your notification service only subscribes to order.shipped. Your accounting service subscribes to order.created and order.cancelled. Your warehouse gets order.created and order.shipped. With a topic exchange and routing key patterns, this is a few lines of configuration.

When messages have a natural "done" state

If a message represents a task with a finite lifecycle — "send this email", "resize this image", "charge this card" — and once it's done you genuinely don't care about it anymore, RabbitMQ is a great fit. The auto-deletion keeps your system clean.

Gotchas to watch for

Unacknowledged messages pile up. If your consumer crashes or is slow, messages sit in an "unacknowledged" state and count against memory. Always set prefetch limits on your consumers.

Quorum queues vs classic queues. Classic mirrored queues have known consistency issues. For production, use quorum queues (added in RabbitMQ 3.8). They're slower but safe.

You can't replay. If a consumer has a bug and acks messages it shouldn't have, those messages are gone. Have a plan for this — usually means logging events somewhere durable separately.

When to use Kafka

Kafka shines when you need to record that something happened, and multiple systems need to know about it — including systems that don't exist yet.

Event sourcing and audit logs

This was the lesson from my payment story. Any time you have business events that you might need to reprocess, replay, or analyze later, Kafka's retention makes this trivial.

Example: every time a user changes their account settings, you publish an AccountSettingsChanged event to Kafka. Your notification service sends a confirmation email. Your security service logs the change. Six months later, you build a compliance feature that needs to know every settings change in the last year — you just rewind the consumer offset.

Multiple independent consumers

When more than one system cares about the same event, Kafka's consumer groups model is elegant. Each consumer group maintains its own offset and reads at its own pace, without blocking or competing with other groups.

Kafka fan-out: all consumer groups receive all events at their own pace

All three services get every order event, independently. If your analytics service is down for maintenance, it just falls behind on its offset and catches up when it restarts — without affecting billing or email.

High-throughput event streams

If you're processing millions of events per second — clickstream data, IoT sensor readings, application logs, metrics — Kafka is built for this. RabbitMQ would need a massive cluster to match, and it still wouldn't match Kafka's sequential disk write performance.

Example: a ride-sharing app tracking vehicle positions every second across 100,000 drivers. That's 100k events/second minimum, with multiple consumers (dispatch, ETA calculation, surge pricing, heatmaps). This is a Kafka workload.

Stream processing

Kafka integrates naturally with stream processing frameworks (Kafka Streams, Apache Flink, Spark Streaming). You can do stateful joins, windowed aggregations, and real-time enrichment on the event stream itself.

Example: detecting unusual login patterns. You stream login events to Kafka, run a Flink job that counts login attempts per user per 5-minute window, and emit fraud alerts for users who exceed a threshold. This is hard to do with RabbitMQ because you'd need external state and can't replay historical data to tune your model.

Decoupled microservices that evolve independently

In a microservices architecture, services often need to react to each other's state changes. Kafka as an event backbone means services don't call each other directly — they just read from topics they care about. Don't forget to instrument these services with proper observability.

This also means you can add new services later that consume historical events without touching any existing code.

Gotchas to watch for

Consumer lag is silent by default. If your consumer falls behind, Kafka doesn't yell at you — you need to monitor consumer group lag explicitly. Set up alerts before you need them.

Partition count is hard to change. You set partitions when you create a topic, and adding partitions later breaks key-based ordering. Think carefully about partition count upfront; 30 is often a safe starting point.

Kafka is operationally heavy. A proper Kafka cluster (even with KRaft, which removed Zookeeper) needs careful tuning of retention, replication factors, and JVM settings. Managed services (Confluent Cloud, AWS MSK, Upstash) save a lot of pain here.

Exactly-once is hard. Kafka supports exactly-once semantics (EOS) but it requires transactional producers and careful consumer design. Most teams get away with idempotent consumers + at-least-once delivery.

Scaling

Scaling RabbitMQ

RabbitMQ scales primarily by adding consumers. Got a backlog? Spin up more workers. The broker distributes work across all connected consumers automatically.

RabbitMQ scaling: just add consumers — the broker distributes automatically

Broker-side scaling is more limited. You can cluster RabbitMQ nodes, but queues live on a single node by default (quorum queues are replicated). Federation and shovel plugins exist for cross-datacenter setups but add complexity.

The ceiling for a well-tuned RabbitMQ cluster is roughly in the hundreds of thousands of messages per second. Beyond that, you're fighting the broker.

Scaling Kafka

Kafka scales horizontally by adding partitions and brokers. Partitions are the unit of parallelism — you can have at most as many consumers in a group as partitions.

Kafka: 6 partitions across 3 brokers, each partition read by one consumer

Adding a broker and rebalancing partitions scales throughput linearly. LinkedIn (who built Kafka) runs clusters processing trillions of messages per day.

The scaling story is genuinely better than RabbitMQ for high-throughput scenarios, but you pay for it in operational complexity.

How to choose: a decision guide

Decision guide — answer top to bottom

Do multiple independent services need the same events?

YES → Kafka

Do you need to replay events later or audit history?

YES → Kafka

Is throughput > 100k msg/s?

YES → Kafka

Do you need complex routing (topic patterns, headers)?

YES → RabbitMQ

Do you need request-reply (RPC-over-messaging)?

YES → RabbitMQ

Is the message a task with a natural "done" state?

YES → RabbitMQ

Do you need sub-millisecond latency?

YES → RabbitMQ

↓

None of the above? Default to RabbitMQ— it's simpler to operate. Reach for Kafka when you know the event log will become valuable later (replay, audit, new consumers).

Real-world scenario mapping

Scenario	Tool	Why
Email/SMS notification queue	RabbitMQ	Task-based, one consumer, done after send
Order processing pipeline	RabbitMQ	Work distribution, exactly-once processing
User activity tracking	Kafka	High volume, multiple consumers (analytics, recommendations, A/B)
Microservice event backbone	Kafka	Multiple consumers, replay, audit
IoT sensor data	Kafka	Massive throughput, time-series analysis
Image/video transcoding	RabbitMQ	Job queue, worker pool, task-based
Real-time fraud detection	Kafka	Stream processing, historical context
Audit log for compliance	Kafka	Retention, immutability, replay
Background job scheduling	RabbitMQ	Cron-like tasks, delayed messages
Change data capture (CDC)	Kafka	Debezium + Kafka is the standard pattern

The mistake I actually made

Going back to that payment system: the team chose RabbitMQ for good reasons. We needed to distribute payment processing across workers. The messages were tasks — "process this payment". RabbitMQ was the right choice for that part.

The mistake was using the same payment events as the only record of what happened. We didn't also write them to Kafka (or even a simple database table). When we needed historical events six weeks later, they were gone.

The real lesson wasn't "use Kafka instead of RabbitMQ". It was that these tools solve different problems, and sometimes you need both:

RabbitMQ to distribute the actual payment processing work
Kafka (or just a database event log) to record that payments happened

Using both in the same system isn't over-engineering — it's using the right tool for each concern. RabbitMQ handles the task execution. Kafka handles the event history. Many mature architectures do exactly this.

Summary

RabbitMQ is for moving work. Messages are tasks. Once done, they're gone. Great for job queues, RPC, and complex routing.
Kafka is for recording events. Messages are facts about things that happened. Great for event streams, multiple consumers, replay, and high throughput.
The difference isn't throughput or reliability — it's semantics. Ask yourself: is this a task to be done, or an event that happened?
You'll often need both. That's fine.

If you're starting a new project and aren't sure: RabbitMQ is simpler to operate and reason about. Reach for Kafka when you need the event log semantics — you'll know when you need it, usually when someone asks "can we replay the last month of events?" and you realize you can't.

References

RabbitMQ Documentation — official docs, especially the AMQP concepts guide and quorum queues
Apache Kafka Documentation — the design section is worth reading in full
Kafka: The Definitive Guide — free ebook from Confluent, covers internals and production operations
Jay Kreps — The Log: What every software engineer should know about real-time data's unifying abstraction — the original essay that explains why the log is such a powerful primitive
Martin Kleppmann — Designing Data-Intensive Applications — chapters 11 and 12 cover stream processing and the trade-offs between messaging systems in depth
RabbitMQ vs Kafka — Lovisa Johansson (CloudAMQP) — practical breakdown from the team behind a managed RabbitMQ service
Confluent — Kafka vs RabbitMQ — biased toward Kafka but technically accurate