ADR 005: RabbitMQ as the Activity / Event Bus¶
Date: 2026-03-23 Status: Accepted
Note: Retrospective ADR written 2026-04-26. The decision was made implicitly across the activity-log work (plan 018, commit 628f12e, 2026-03-23) and earlier RabbitMQ adoption for document ingestion (90d7bca, 2025-12-22). This ADR captures the rationale ex-post.
Context¶
By Q1 2026 the platform had several cross-service flows that needed asynchronous fan-out:
- Document ingestion —
file-serviceaccepts an upload;rag-servicemust convert + embed;partner-serviceneeds the resultingrag_idto attach to the service record. - Notifications — calendar bookings, contract state transitions, and partner tasks all generate emails, websocket pushes, and (eventually) push notifications. Multiple consumers, different cadences.
- Activity log (plan 018) —
calendar-servicewrites log entries thatnotification-servicemust fan out as websocket events to a partner's connected browser tabs in real time. The log write must succeed even if the notification path is down.
A direct HTTP call between services would have coupled liveness (calendar-service blocks on notification-service's health), made fan-out painful (one HTTP call per consumer), and offered no replay or buffer behavior.
Decision¶
Adopt RabbitMQ topic exchanges as the spine for cross-service events. Keep direct HTTP only for synchronous request/response (gateway → service, service → rag-service search).
Pattern¶
- Producers publish via
amqplib(Node) oraio-pika(Python) to a topic exchange named for the domain (activity_log,document.events,notifications,email,websocket). - Routing keys are dotted, hierarchical, and namespaced by event (
activity.log.created,document.uploaded,document.deleted,email.sent,websocket.emitted). - Consumers bind their own queues with topic patterns. New consumers add new queues with new bindings — producers don't change.
- The vhost separates environments (
portugal_odyssey_dev,portugal_odyssey_qual,portugal_odyssey_prod); seeinfrastructure/config/rabbitmq/definitions.json. - A mock mode ships with the
RabbitmqService(NestJS) so dev environments without RabbitMQ still boot — log writes succeed, fan-out is a no-op locally.
Why RabbitMQ specifically¶
- Topic exchanges are the right abstraction for "one producer, many independent consumers with different routing needs" — exactly the activity-log shape.
- AMQP 0.9.1 is universally supported in Node (
amqplib) and Python (aio-pika,pika) — the polyglot stack stays simple. - Dead-letter, ack/nack, prefetch all available out of the box for retry semantics.
- No Kafka complexity — at platform scale (a few hundred events/min, no multi-day replay needs) Kafka's offset/log-retention model is overkill and expensive to operate on a single VPS.
Persistence + reliability stance¶
The activity-log path treats logging as best-effort observability, not durable audit:
- DB write happens first (activity_log row inserted), so the canonical record survives even if RabbitMQ is unavailable.
- The RabbitMQ publish is non-blocking (logSafe() helper); a failed publish doesn't fail the originating sync operation.
- The notification-service consumer is auto-ack for activity events — a single dropped frame loses one push, but the DB row is still queryable via REST.
For document events the trade-off flips: rag-service's consumer ack-on-success ensures Docling failures get retried.
Consequences¶
Positive¶
- Loose coupling: notification-service can be redeployed without touching calendar-service. Consumers come and go without producer changes.
- Built-in fan-out: adding a new consumer (e.g., a future analytics-service that wants
activity.log.createdfor KPI rollups) is one queue + one binding; producers untouched. - Polyglot: TypeScript (calendar/notification/booking) and Python (rag-service) sit on the same bus.
- Mock-mode dev ergonomics: laptops without RabbitMQ still run the full app; only websocket-push behavior degrades.
- Reuse for bookings/contracts: the same exchange-per-domain pattern extends naturally to booking lifecycle events and contract state transitions.
Negative¶
- Operational dependency on a third broker container alongside Postgres and Redis. Adds memory footprint (Erlang VM) and one more thing to monitor on the single VPS.
- No durable replay: dropped messages during a notification-service outage are gone — the activity REST endpoint becomes the only catch-up mechanism. If true audit-grade durability becomes a requirement, the activity_log table is the source of truth, not the bus.
- No schema enforcement on the wire: payloads are plain JSON. Consumers and producers must keep TypeScript / Pydantic shapes in sync manually. A schema registry was considered and deferred.
definitions.jsonis sparse — at the time of writing it declares only vhosts + admin permissions; topic exchanges and bindings are created lazily by services on first publish/subscribe rather than seeded. That's fine while the topology is small but worth revisiting if it grows.- Mock mode hides production drift: a developer can ship code that publishes to a typo'd routing key and never see the bug locally.
References¶
infrastructure/config/rabbitmq/definitions.json— vhost + admin seedservices/calendar-service/src/common/rabbitmq/rabbitmq.service.ts— Node publisher/subscriber with mock-mode fallbackservices/notification-service/src/modules/notifications/services/activity-log-consumer.service.ts— consumer + websocket bridgeservices/rag-service/app/services/rabbitmq_consumer.py— Python consumer fordocument.uploaded/document.deleteddocs/implementation-plans/018-activity-log-system/human/explanation.md— design rationale for the activity-log layer specifically