Skip to content

ADR 002: Wave 1 Service Consolidation (Simplification Lens)

Date: 2026-04-21 Status: Implemented (Wave 1 closed — all 6 candidates landed). Wave 2 in progress (see Wave 2 log below). Epic: Riff b418e5aa-7f2b-44c7-9f7c-a2f07c8e95d8 (code #2)

Implementation log

Candidate Status Date Commit
1 — Elasticsearch + search-service removed ✅ Implemented 2026-04-25 52caf98
3a — Loki + Promtail removed ✅ Implemented 2026-04-25 4f1ed42
3b — Prometheus + Grafana removed ✅ Implemented 2026-04-25 4f1ed42
4 — Duplicate CI build job removed ✅ Implemented 2026-04-24 514fa5e
5 — mcp-server → ai-service ✅ Implemented 2026-04-26 80f732c
2 — Qdrant → pgvector ✅ Implemented 2026-04-26 this commit

Wave 2 log (selected — see Riff Epic #2 + docs/ai/backlog.md for the full set)

Candidate Status Date Notes
W2-1 — booking-service removed ✅ Implemented 2026-04-30 3923f0f
W2-3 — user-management mothballed ✅ Implemented 2026-04-30 3981c8f (source kept)
W2-4 — analytics-service removed ✅ Implemented 2026-04-30 3923f0f
W2-7 — rag-service → ai-service ✅ Implemented 2026-04-30 this consolidation; ~613 LoC moved, ~80 LoC dies (broken Qdrant monitor.py, duplicate llm_service.py); the 3 rag.* MCP tools become local function calls instead of httpx round-trips; rag_qual/rag_prod Postgres dbs preserved; docling-serve sidecar preserved. As a side effect: silent ai-service consumer crash-loop bug introduced by W2-9 partial cleanup (commit 7427371 — dangling settings.AI_TASK_QUEUE + self.callback_ai_task references in start_consuming) is fixed implicitly by the consumer rewrite.
W2-8 — auth-service GraphQL dropped ✅ Implemented 2026-04-29 2be0da6
W2-10 — RabbitMQ orphan exchanges ✅ Implemented 2026-04-30 4 commits
W2-11 — admin AIConsolePage removed ✅ Implemented 2026-04-30 f9be666

Verification for Candidate 1 (W1-1): api-gateway tsc --noEmit passes after dropping search: proxy entry; rendered qual compose config contains zero po-elasticsearch/po-search-service/ELASTICSEARCH_URL/SEARCH_SERVICE_URL references; experience-service was confirmed not to consume ELASTICSEARCH_URL (passed but not read); frontends/public-fo/src/hooks/useExperienceSearch.ts continues to use the local mock at ../mocks/api/experiences.api. Manual VPS step required: docker volume rm po-elasticsearch-data + docker network rm po-elasticsearch-network once next deploy lands and orphan containers are stopped.

Verification for Candidate 5 (W1-5): scope expanded from ADR's <1 day estimate to C-full — ai-service is now deployed in qual + prod compose (it was dev-only before), giving the platform a single MCP URL per environment instead of "no external MCP in qual/prod". rag-service's /api/v1/mcp mount removed; rag-service's Traefik labels stripped (now internal-only). New rag-service HTTP endpoints: POST /api/v1/search/, POST /api/v1/convert/ (chat already existed). ai-service's mcp_server.py rewritten to expose 9 tools (6 ai.* local + 3 rag.* proxies via httpx to rag-service's plain HTTP). API-key auth via McpApiKeyMiddleware on the /api/v1/mcp mount; format MCP_API_KEYS=key1:principal1;key2:principal2. Empty value disables auth (dev convenience). Keycloak JWT auth from mcp-server NOT ported — deferred to a follow-up. mcp-server (956 LOC TS) deleted entirely. All four compose stacks validate; both Python services parse clean. Manual VPS step required: set MCP_API_KEYS_QUAL in .env.qual and MCP_API_KEYS in .env.prod on the VPS before next deploy, otherwise the new MCP endpoint is open. Image registry caveat: Python services have no CI build job (pre-existing gap shared with rag-service); first ai-service-{qual,prod} deploy needs a manual docker build && docker push.

Verification for Candidate 2 (W1-4): Postgres image swapped postgres:15-alpinepgvector/pgvector:pg15 in shared.yml + infrastructure.yml. New infrastructure/scripts/rag-vectors-schema.sql mounted at z-rag-vectors-schema.sql (runs after init-multiple-databases.sh creates the rag db). Schema: single rag_document_vectors table (document_id PK, service_id, partner_id, content, embedding vector(1536), two btree indexes + IVFFlat cosine index WITH (lists = 100)). services/rag-service/app/services/vector_service.py rewritten on asyncpg with a singleton pool: same public surface (upsert_document, search, delete_document) so app/api/endpoints/{search,chat}.py and app/services/rabbitmq_consumer.py work unchanged. Dangling document.deleted bug fixed: rabbitmq_consumer.py now binds the rag_service_queue to both document.uploaded AND document.deleted, with a routing-key-aware dispatcher; partner-services.service.ts:512 publishes were previously orphaning vectors. requirements.txt: dropped qdrant-client, langchain* (3 unused packages), numpy, mcp[fastapi] (no longer hosts MCP); added asyncpg>=0.29.0. Compose: Qdrant service blocks + qdrant-data* volumes + qdrant.portugalodyssey.pt Traefik route deleted from infrastructure.yml + qual + prod; QDRANT_URL env var dropped from rag-service everywhere; DATABASE_URL=postgresql://.../rag{,_qual,_prod} added; po-postgres-network added to rag-service-dev's network list. Admin console i18n key renamed qdrantHealthvectorStoreHealth across 3 locales + AIConsolePage.tsx. VPS deploy gates (manual): the rag_qual and rag_prod databases must be added to POSTGRES_MULTIPLE_DATABASES in .env.shared on the VPS before next shared-stack restart; the rag-vectors-schema.sql must be applied to those databases (the in-tree \connect rag line works for dev only — psql -d rag_qual -f rag-vectors-schema.sql for the suffixed envs). Existing Qdrant vectors will be re-ingested via the document.uploaded event flow after first restart of partner-service or via a one-shot replay; lossy if not replayed but the corpus is small per ADR. Acceptance: docker compose config --quiet exits 0 on all four stacks; rag-service Python parses clean; rendered configs have zero qdrant/QDRANT references.

Preflight outcome (2026-04-25, Candidates 3a + 3b): probe of local Grafana SQLite DB returned 0 rows in dashboard and data_source tables (1.2 MB DB held only schema + sessions). Local Loki volume held 94 MB of buffered chunks but no committed dashboards or LogQL queries existed in repo. User confirmed nobody had logged into the live qual Grafana at monitoring.portugalodyssey.pt to build content. Verdict held at Remove (no downgrade to Defer).

Acceptance for 3a + 3b: ✅ all four compose stacks validate (docker compose config --quiet exit 0); ✅ rendered qual config contains zero po-loki/po-promtail/po-prometheus/po-grafana references; ✅ Docker log-opts (max-size: 10m, max-file: "5" → 50 MB per container) applied via merge anchors to all <<: *x_*_env services + explicitly to docling-serve in dev/qual/prod. Known gap: infrastructure.yml's 8 unanchored services (postgres/redis/rabbitmq/kong/etc.) are dev-only and don't yet have the cap; deferred to a hygiene follow-up.

Context

po-platform runs 19 microservices plus 11 infra components (Traefik, Postgres, Redis, RabbitMQ, Elasticsearch, Qdrant, MinIO, Keycloak, Loki, Promtail, Prometheus, Grafana) on a single VPS. For its observed scale — a production-live tourism marketplace with a small catalog, pre-public traffic, and an 18-plan history focused on the calendar/contract domains — this is unusually heavy. The first-pass rule for this review: no component gets kept on vibes. Each must justify itself with traffic, concrete feature coverage, or explicit near-term roadmap. Absence of evidence becomes either Defer (when the probe is clear and cheap) or a verdict of Remove / Consolidate.

Wave 1 covers five candidates. Wave 2 (booking-service, review-service externals, Strapi scope, Keycloak vs Authentik, remaining 14 services) is tracked separately on the Epic.

Decision

Candidate 1 — Elasticsearch + search-serviceRemove

search-service is ~200 LOC of NestJS that indexes exactly one collection (experiences) with ~3 hardcoded mock documents. There is no reindex trigger installed (the Strapi lifecycle hook in services/search-service/CMS_INTEGRATION.md is aspirational), Strapi has no experience content type at all (only CMS page content), and no frontend calls /api/search (frontends/public-fo/src/hooks/useExperienceSearch.ts:39 uses a client-side mock). Elasticsearch is defined twice in compose with divergent heap settings (shared.yml:162-178 vs infrastructure.yml:166-183) and consumes ~1GB RAM. Every query shape in use (multi_match with boost + fuzziness, terms/term/range filters) maps cleanly to Postgres 15 + pg_trgm + GIN, which the platform already runs. When experience data eventually lands, a tsvector column + GIN index on experience-service's own Postgres table is the right home — no separate search service required. Effort: S (~half a day — delete compose blocks, env vars, service directory, gateway route entry).

Candidate 2 — Qdrant + rag-serviceConsolidate (pgvector inside existing Postgres; keep rag-service as ingestion/MCP facade)

rag-service does real work coordinating Docling + LLM + RabbitMQ for partner document vectorisation. The vector store, however, is over-specified: one collection (po_platform_docs), 1536-dim embeddings (OpenAI text-embedding-3-small), one vector per document (no chunking — truncated to first 8000 chars per doc), partner-scoped filter queries. Qdrant's differentiators (sharding, quantization, sparse+dense hybrid search) are unused. The codebase has unused langchain* deps in requirements.txt and a dangling document.deleted RabbitMQ event that rag-service never consumes — both signs the service is under-maintained rather than load-bearing. pgvector inside the already-running Postgres (image swap to pgvector/pgvector:pg15) eliminates a container, a volume, a publicly-exposed dashboard at qdrant.portugalodyssey.pt, and enables single-query hybrid retrieval (ORDER BY embedding <=> $1 joined against partner_service_documents with tenant checks). rag-service's ~60 LOC vector_service.py becomes asyncpg/psycopg calls; all upstream callers (experience-service, admin-console, mcp-server) see no change. Effort: S (~1 day).

Candidate 3a — Loki + Promtail → Remove

Zero application code depends on Loki. All 19 services write to Docker stdout; Promtail tails the JSON log files and pushes to Loki. Configs are duplicated and divergent between monitoring/ (no retention set — unbounded) and infrastructure/monitoring/ (7-day retention). The stack has no committed dashboards (everything lives in the po-grafana-data volume), no alert rules, and Loki is exposed at loki.portugalodyssey.pt behind only HTTP Basic auth. make {dev,qual,prod}-logs and make remote-docker-logs already cover the MVP operator workflow. Before cutover: run du -sh /var/lib/docker/volumes/po-loki-data/_data on the VPS and inspect any Grafana Loki-sourced dashboards via GET /api/search?type=dash-db — if a user has built LogQL queries we don't see in the repo, downgrade to Defer. Concurrent with removal: add Docker log-opts: { max-size: "10m", max-file: "5" } to the compose defaults so container JSON logs don't grow unbounded post-Promtail. Effort: S (~half a day, includes doc update + VPS cleanup).

Candidate 3b — Prometheus + Grafana → Remove

Production does not currently run Prometheus or Grafana — both are absent from production.yml. Dev/qual run them but: no persistent TSDB volume (defaults to container filesystem, lost on restart), zero application-level instrumentation (no service imports prom-client or equivalent), only one real scrape target (traefik:8080/metrics), no committed dashboards, no alert rules, no Alertmanager. The stack is a shell that looks like observability without providing any. Traefik's per-request signal (the only real metric source) is already being captured in its access log and shipped via Promtail — or, post-Wave-1, via Docker stdout to make remote-docker-logs. Reintroduce Prometheus + Grafana (or Grafana Cloud's free tier) later when a service actually needs custom metrics and someone commits provisioned dashboards to git. For MVP liveness, add an external uptime probe (UptimeRobot / BetterStack / Healthchecks.io free tier). Effort: S (~half a day, includes stripping metrics.prometheus stanza from infrastructure/traefik/traefik.yml:42-45).

Candidate 4 — CI duplicate build jobs → Remove docker-build-frontend

.gitlab-ci/frontend.yml's docker-build-frontend and .gitlab-ci/services.yml's docker-build-public-fo both trigger on frontends/public-fo/**/* and push to the same three image tags (:latest, :<branch-slug>, :<short-sha>). They race: whichever finishes second wins the registry. The frontend.yml version is strictly inferior — no BuildKit registry cache, no VITE_* build args (so its bundles ship with empty import.meta.env.VITE_API_BASE_URL and friends), no self-hosted runner. Keep docker-build-public-fo; delete docker-build-frontend and the entire .gitlab-ci/frontend.yml file (its only other content is a copy of the shared docker template; lint-frontend/type-check-frontend already live in root .gitlab-ci.yml). Remove the - local: '.gitlab-ci/frontend.yml' include from .gitlab-ci.yml:33. Effort: S (~15 min).

Candidate 5 — mcp-serverConsolidate (fold into ai-service)

services/mcp-server/ is ~956 LOC TypeScript but only ~113 LOC of that is MCP-handler code — the rest is auth (middleware/mcp_auth.ts), RabbitMQ command bus scaffolding, credential vault, and registry stubs, none of which are wired to any real tool or any caller. The tools it exposes (ai.translate, ai.summarize, ai.analyze_sentiment, ai.sanitize, ai.moderate, ai.enhance, rag.search_services, rag.query_rag, rag.convert_document) are pure namespace-prefixed proxies to Python FastMCP endpoints that ai-service and rag-service already expose natively at /api/v1/mcp. Grep for mcp-server|MCP_SERVER_URL|:3011 across services/ and frontends/ returns zero internal callers. It's absent from qualification.yml and production.yml entirely — deployed only to dev. The platform-MCP URL documented in the (now-archived) docs/AI-AGENT-GUIDE.mdhttps://api-qual.portugalodyssey.pt/api/rag/mcp — has no matching Traefik route and never served traffic. The real live qual endpoint is https://rag-qual.portugalodyssey.pt/api/v1/mcp served directly by rag-service. Fold the api-key + prefix-allowlist middleware (~50 LoC) into ai-service as FastAPI middleware, register the 3 rag.* tool shims there (ai-service can call rag-service's MCP internally), expose a single MCP endpoint, retire the TS service. If the Phase-2 roadmap (RabbitMQ-backed async tools + credential vault) is real, build it inside ai-service rather than a separate TypeScript container. Effort: S (<1 day since nothing runs in qual/prod).

Evidence summary

Candidate Verdict Internal callers In prod? Dashboards in repo App-code dependency Effort
Elasticsearch + search-service Remove None Yes (idle) n/a None S
Qdrant Consolidate → pgvector 1 (rag-service) Yes 0 Only rag-service S
Loki + Promtail Remove 0 Yes 0 committed None S
Prometheus + Grafana Remove 0 No (absent from production.yml) 0 committed None S
CI duplicate build jobs Remove docker-build-frontend n/a n/a n/a Cosmetic S
mcp-server Consolidate → ai-service 0 No (dev-only) n/a None S

Every candidate is S effort. That is itself a signal: the code that matters is thin, the code being evaluated is easy to remove precisely because nothing has grown up around it.

Consequences

Positive

  • Runtime footprint drops materially on the single VPS: Elasticsearch (~1 GB JVM), Qdrant (container + volume + Traefik route), Prometheus + Grafana (+ provisioned-but-stateless TSDB), Loki + Promtail (+ the duplicated monitoring/ and infrastructure/monitoring/ paths), mcp-server dev container. Rough order: 2–3 GB RAM freed, 4 public Traefik routes retired, 6+ named Docker volumes retired.
  • Cognitive load drops: one Postgres for relational + vector + full-text; one MCP surface (ai-service) instead of three (mcp-server + ai-service + rag-service); one log-access path (make *-logs); one duplicated-config source (not two).
  • Supply-chain surface shrinks: drop @elastic/elasticsearch, qdrant-client, prom-client (if any crept in), the .gitlab-ci/frontend.yml include, Grafana's Docker image, Prometheus's Docker image, Loki's + Promtail's Docker images.
  • Attack surface shrinks: remove public qdrant.portugalodyssey.pt, loki.portugalodyssey.pt, prometheus.portugalodyssey.pt, monitoring.portugalodyssey.pt, mcp-dev.portugalodyssey.pt routes (all today gated only by basic auth or dev-only constraints).
  • ADR backlog caught up: 5 of the major architectural decisions since ADR-001 now have a written rationale.

Negative

  • Future observability effort is front-loaded: when Prometheus+Grafana come back (or Grafana Cloud is adopted), someone must commit provisioned dashboards and SLOs to git instead of clicking them in the Grafana UI. That discipline didn't exist in the Wave-0 stack.
  • pgvector ceiling: at the current workload (thousands of partner docs, one vector each) this is a non-issue, but if the vector corpus grows past ~1–5M vectors or starts needing quantization/hybrid-search, the Qdrant decision has to be revisited. Document the re-evaluation threshold in the follow-up backlog item so the signal is visible.
  • External-agent URL break: any agent (Cursor/Antigravity/n8n) pointing at localhost:3011/api/v1/mcp in dev will need to flip to ai-service's endpoint. Since mcp-server is dev-only, this is a small audience. Still: publish a deprecation window before removing.
  • Log retention: removing Loki means the 7-day LogQL queryable retention vanishes. Compensate with Docker log-opts: { max-size, max-file } and make remote-docker-logs. If an incident investigation later needs "7 days of logs from service X", that gap will surface — document the new ops flow before cutover.
  • Grafana-data volume decision: the po-grafana-data Docker volume holds whatever dashboards someone built manually on the VPS. Decide explicitly: keep the volume (re-attach when Grafana returns later) or delete it. Recommendation — keep; it's small, and preserves recoverable config.

Follow-ups

Each verdict becomes a child task under Riff Epic #2 with its own acceptance criteria and deploy sequencing. See docs/ai/backlog.md §"P2 — Wave 1 consolidations" for the full list.

Wave 2 candidates (deferred to a separate session): booking-service in-memory vs proper DB, review-service mocked externals deliver-or-remove, ai-service+rag-service full consolidation question (orthogonal to the Qdrant decision), Strapi scope, Keycloak vs Authentik, per-service evaluation of the remaining 14 services, RabbitMQ topology review.

References

  • docs/developers/adr/001-calendar-architecture.md — ADR shape reference
  • docs/ai/context-snapshots/project-overview.md — service registry + architecture snapshot
  • docs/project-overview.md — domain model (ensures no removed component is load-bearing for Experience assembly)
  • CLAUDE.md — platform conventions
  • docs/implementation-plans/LIFECYCLE.md — if any follow-up grows into an implementation plan