Skip to content

ADR 005: RabbitMQ as the Activity / Event Bus

Date: 2026-03-23 Status: Accepted

Note: Retrospective ADR written 2026-04-26. The decision was made implicitly across the activity-log work (plan 018, commit 628f12e, 2026-03-23) and earlier RabbitMQ adoption for document ingestion (90d7bca, 2025-12-22). This ADR captures the rationale ex-post.

Context

By Q1 2026 the platform had several cross-service flows that needed asynchronous fan-out:

  • Document ingestionfile-service accepts an upload; rag-service must convert + embed; partner-service needs the resulting rag_id to attach to the service record.
  • Notifications — calendar bookings, contract state transitions, and partner tasks all generate emails, websocket pushes, and (eventually) push notifications. Multiple consumers, different cadences.
  • Activity log (plan 018) — calendar-service writes log entries that notification-service must fan out as websocket events to a partner's connected browser tabs in real time. The log write must succeed even if the notification path is down.

A direct HTTP call between services would have coupled liveness (calendar-service blocks on notification-service's health), made fan-out painful (one HTTP call per consumer), and offered no replay or buffer behavior.

Decision

Adopt RabbitMQ topic exchanges as the spine for cross-service events. Keep direct HTTP only for synchronous request/response (gateway → service, service → rag-service search).

Pattern

  • Producers publish via amqplib (Node) or aio-pika (Python) to a topic exchange named for the domain (activity_log, document.events, notifications, email, websocket).
  • Routing keys are dotted, hierarchical, and namespaced by event (activity.log.created, document.uploaded, document.deleted, email.sent, websocket.emitted).
  • Consumers bind their own queues with topic patterns. New consumers add new queues with new bindings — producers don't change.
  • The vhost separates environments (portugal_odyssey_dev, portugal_odyssey_qual, portugal_odyssey_prod); see infrastructure/config/rabbitmq/definitions.json.
  • A mock mode ships with the RabbitmqService (NestJS) so dev environments without RabbitMQ still boot — log writes succeed, fan-out is a no-op locally.

Why RabbitMQ specifically

  • Topic exchanges are the right abstraction for "one producer, many independent consumers with different routing needs" — exactly the activity-log shape.
  • AMQP 0.9.1 is universally supported in Node (amqplib) and Python (aio-pika, pika) — the polyglot stack stays simple.
  • Dead-letter, ack/nack, prefetch all available out of the box for retry semantics.
  • No Kafka complexity — at platform scale (a few hundred events/min, no multi-day replay needs) Kafka's offset/log-retention model is overkill and expensive to operate on a single VPS.

Persistence + reliability stance

The activity-log path treats logging as best-effort observability, not durable audit: - DB write happens first (activity_log row inserted), so the canonical record survives even if RabbitMQ is unavailable. - The RabbitMQ publish is non-blocking (logSafe() helper); a failed publish doesn't fail the originating sync operation. - The notification-service consumer is auto-ack for activity events — a single dropped frame loses one push, but the DB row is still queryable via REST.

For document events the trade-off flips: rag-service's consumer ack-on-success ensures Docling failures get retried.

Consequences

Positive

  • Loose coupling: notification-service can be redeployed without touching calendar-service. Consumers come and go without producer changes.
  • Built-in fan-out: adding a new consumer (e.g., a future analytics-service that wants activity.log.created for KPI rollups) is one queue + one binding; producers untouched.
  • Polyglot: TypeScript (calendar/notification/booking) and Python (rag-service) sit on the same bus.
  • Mock-mode dev ergonomics: laptops without RabbitMQ still run the full app; only websocket-push behavior degrades.
  • Reuse for bookings/contracts: the same exchange-per-domain pattern extends naturally to booking lifecycle events and contract state transitions.

Negative

  • Operational dependency on a third broker container alongside Postgres and Redis. Adds memory footprint (Erlang VM) and one more thing to monitor on the single VPS.
  • No durable replay: dropped messages during a notification-service outage are gone — the activity REST endpoint becomes the only catch-up mechanism. If true audit-grade durability becomes a requirement, the activity_log table is the source of truth, not the bus.
  • No schema enforcement on the wire: payloads are plain JSON. Consumers and producers must keep TypeScript / Pydantic shapes in sync manually. A schema registry was considered and deferred.
  • definitions.json is sparse — at the time of writing it declares only vhosts + admin permissions; topic exchanges and bindings are created lazily by services on first publish/subscribe rather than seeded. That's fine while the topology is small but worth revisiting if it grows.
  • Mock mode hides production drift: a developer can ship code that publishes to a typo'd routing key and never see the bug locally.

References

  • infrastructure/config/rabbitmq/definitions.json — vhost + admin seed
  • services/calendar-service/src/common/rabbitmq/rabbitmq.service.ts — Node publisher/subscriber with mock-mode fallback
  • services/notification-service/src/modules/notifications/services/activity-log-consumer.service.ts — consumer + websocket bridge
  • services/rag-service/app/services/rabbitmq_consumer.py — Python consumer for document.uploaded / document.deleted
  • docs/implementation-plans/018-activity-log-system/human/explanation.md — design rationale for the activity-log layer specifically