Skip to content

Prod first-deploy readiness — web-presence apex go-public

Status: PREPARATION (no prod deploy yet). production.yml has never been deployed (no .env.prod on the VPS, no *-prod containers, no *_prod databases). The first make prod is intended to be the web-presence-goes-public event during pre-launch (marketing/business). This doc is the turnkey checklist + the validation that runs the moment it lands. Prepared by sA 2026-05-26; nothing here touches secrets or deploys.

Decision needed before first deploy (operator/José): is the first prod deploy the full 21-service stack (make prod), or a web-presence marketing subset (web-presence + strapi-cms + ai-service + ingress/infra, no payment/booking)? This changes whether live Stripe keys are required at launch (full stack brings up payment-service, which needs live keys to start cleanly) or can be deferred. The rest of this doc assumes full-stack; flag in §6 if subset.


1. Service parity — ✅ COMPLETE

production.yml and qualification.yml are at full 20-service parity (verified 2026-05-26; review-service mothballed 2026-05-31 per Riff #225 / Wave 3, removed from both compose files before prod was ever deployed):

admin-console · ai-service · api-gateway · auth-service · calendar-service
contract-service · docling-serve · document-signing-service · experience-service
file-service · keycloak · minio · minio-init · notification-service
partner-console · partner-service · payment-service · public-fo
strapi-cms · web-presence

Riff #205 closed the 5-service gap (partner/contract/calendar/document-signing/web-presence). comm diff of qual-stems vs prod-stems is empty.


2. .env.prod required-vars checklist — 48 variables

Every ${VAR} referenced by production.yml. The operator provisions .env.prod (and .env.shared for the DB password — migrate.sh auto-loads POSTGRES_PASSWORD from .env.shared). Cross-check against .env.qual for structure; swap values for prod (live keys, prod DB password). Do not read .env.* into a Claude session — use a plain terminal.

Secrets — must be unique prod values (Tier 2/3)

JWT_SECRET                       OIDC_CONFIDENTIAL_CLIENT_SECRET
KEYCLOAK_ADMIN_PASSWORD          SIGNING_CERT_PASSWORD
POSTGRES_PASSWORD                MCP_API_KEYS
RABBITMQ_PASSWORD                RAG_INGEST_ADMIN_KEY
REDIS_PASSWORD                   AI_SERVICE_INGEST_KEY
MINIO_ROOT_PASSWORD_PROD         CACHE_BUST_TOKEN
STRAPI_ADMIN_JWT_SECRET          STRAPI_JWT_SECRET
STRAPI_API_TOKEN_SALT            STRAPI_APP_KEYS
STRAPI_TRANSFER_TOKEN_SALT       STRAPI_API_TOKEN

External API credentials — LIVE keys for prod (not test/sandbox)

STRIPE_SECRET_KEY                STRIPE_PUBLISHABLE_KEY
STRIPE_WEBHOOK_SECRET            STRIPE_WEBHOOK_SECRET_CONNECTED
ANTHROPIC_API_KEY                OPENAI_API_KEY
GOOGLE_API_KEY                   GOOGLE_MAPS_API_KEY
LANGCHAIN_API_KEY                YELP_API_KEY
TWILIO_ACCOUNT_SID               TWILIO_AUTH_TOKEN
SMTP_PASSWORD

⚠️ Stripe must be LIVE keys for a real go-public. If the first deploy is a marketing subset with no real payments yet, either omit payment-service from the deploy or use test keys + keep payment routes unexposed. Decide per §6.

Config / non-secret

IMAGE_TAG (CI injects $CI_COMMIT_SHORT_SHA)   REGISTRY_IMAGE
KEYCLOAK_ADMIN          POSTGRES_USER          RABBITMQ_USER
MINIO_ROOT_USER_PROD    MCP_KEYCLOAK_URL       LANGCHAIN_PROJECT
LANGCHAIN_TRACING_V2    SMTP_HOST   SMTP_PORT  SMTP_USER
EMAIL_FROM    CONTACT_ADMIN_EMAIL   CONTACT_FROM_EMAIL
PAYMENT_SUCCESS_URL     PAYMENT_CANCEL_URL

PAYMENT_SUCCESS_URL / PAYMENT_CANCEL_URL must point at the prod public-fo host (app.portugalodyssey.pt, see §3), not qual.


3. Apex cutover (Plan #028 Slice I / Riff #145) — READY DIFF, apply at go-public

Today production.yml routes the apex to public-fo-prod and web-presence to a subdomain. The go-public flip (documented in the web-presence-prod label comment) swaps them. portugalodissey.pt is an intentional dual-domain (see dual-domain-support.md) — mirror it on both routers.

Change A — web-presence-prod → apex (production.yml, web-presence-prod labels):

- - "traefik.http.routers.web-presence-prod.rule=Host(`web-presence.portugalodyssey.pt`)"
+ - "traefik.http.routers.web-presence-prod.rule=Host(`portugalodyssey.pt`) || Host(`www.portugalodyssey.pt`) || Host(`portugalodissey.pt`) || Host(`www.portugalodissey.pt`)"
+ - "traefik.http.routers.web-presence-prod.priority=10"

Change B — public-fo-prod → app.* (production.yml, public-fo-prod labels):

- - "traefik.http.routers.public-fo-prod.rule=Host(`portugalodyssey.pt`) || Host(`www.portugalodyssey.pt`) || Host(`portugalodissey.pt`) || Host(`www.portugalodissey.pt`)"
+ - "traefik.http.routers.public-fo-prod.rule=Host(`app.portugalodyssey.pt`) || Host(`app.portugalodissey.pt`)"

Prerequisites for the flip (Riff #145): - DNS: portugalodyssey.pt + www already resolve to the VPS. Add app.portugalodyssey.pt (+ app.portugalodissey.pt) A records before flipping, else public-fo loses its host. - LE cert: Traefik issues certs via the DNS-01 challenge through Cloudflare (shared.yml --certificatesresolvers.letsencrypt.acme.dnschallenge.provider=cloudflare), not HTTP-01 — so issuance needs a valid CF_DNS_API_TOKEN (Zone:DNS:Edit on both apex zones) in .env.shared, and a missing/expired token blocks every cert regardless of HTTP reachability. DNS for the host must still exist. Watch ACME rate limits — see letsencrypt-rate-limit.md. Bring the stack up, confirm cert issuance, THEN announce publicly. - web-presence content (Strapi) must be Cristina-ready before the public sees the apex.

This diff is not pre-applied to production.yml — it's gated on web-presence content readiness + app.* DNS, and is Plan #028 Slice I's deliverable. Apply it as the go-public step. Leaving public-fo on apex until then is correct.


4. Migrations — fresh-DB ready

migrate.sh prod (run by deploy-prod BEFORE services start, per ADR-003) creates + migrates the *_prod databases. 10 DBs, 41 migrations total, dbmate forward-only + idempotent → a fresh prod DB applies all from scratch:

calendar 5 · contract 3 · document_signing 1 · experience 9 · iam 2
notification 3 · partner 8 · payment 5 · rag 3 · reviews 2
- migrate.sh prod auto-loads POSTGRES_PASSWORD from .env.shared (suffix _prod, host postgres, network po-shared-network). - First deploy: DBs don't exist → dbmate creates them, applies every migration. No backfill needed (greenfield). - Pre-deploy schema check (per CLAUDE.md): after deploy, psql -c "SELECT table_name FROM information_schema.tables WHERE table_schema='public' ORDER BY table_name" against each *_prod DB, diff vs infrastructure/migrations/<db>/.


5. Registry images

Prod pulls :${IMAGE_TAG} (= $CI_COMMIT_SHORT_SHA from CI). deploy-prod does docker compose pull. If a service's image isn't built for the exact SHA (path filter skipped it), pull falls back to :latest (the side-effect rescue valve). Before first prod deploy, confirm the registry has images for web-presence, strapi-cms, ai-service (the marketing-critical trio) at the target tag — push a full pipeline on main if unsure so every service has a fresh :latest.


6. Post-deploy validation (runs the moment prod is up)

Run bash infrastructure/scripts/smoke-prod.sh (added with this doc) + the checklist:

  • [ ] make health-check-prod — all *-prod containers healthy (document-signing-service takes longest: JVM-style .p12 load, retry 6×20s).
  • [ ] Apex serves web-presence with a trusted cert: curl -sI https://portugalodyssey.pt → 200; openssl s_client -connect portugalodyssey.pt:443 | openssl x509 -noout -issuerLet's Encrypt (NOT TRAEFIK DEFAULT CERT).
  • [ ] https://www.portugalodyssey.pt → 200 (or 301→apex).
  • [ ] https://app.portugalodyssey.pt → serves public-fo (booking app), trusted cert.
  • [ ] Both dual-domain spellings respond (portugalodissey.pt too) or are intentionally parked.
  • [ ] web-presence marketing pages render in PT + EN; RagChat answers (ai-service reachable); Strapi content present.
  • [ ] Append a row to DEPLOYS.md (manual deploy → must log).
  • [ ] Visual: run the parameterized Playwright walk (temp/uat-206/walk.js pattern, BASE=https://portugalodyssey.pt) for the marketing pages — nav, footer-pin, PT locale, hero.

If full-stack: also verify app.* booking flow /start-journey → /checkout (Stripe LIVE — use a real card or Stripe test-clock, and refund).


7. Rollback — first-ever deploy has no previous :latest

A normal rollback re-tags the previous :latest and up -d --force-recreate. On the first prod deploy there is no previous prod image, so: - Fastest rollback = DNS: point portugalodyssey.pt back to the holding page / away from the VPS. Apex traffic stops immediately; qual (*.qual.*) is unaffected. - Or stop the stack: docker compose -f production.yml down (keeps *_prod volumes/DBs). - Keep web-presence.portugalodyssey.pt (pre-flip subdomain) working during the first hours as a fallback URL if the apex flip misbehaves — don't remove the subdomain router until the apex is proven. - Prod DBs persist across down/up (named volumes), so a re-deploy doesn't lose data.


Open items / decisions

  1. §6 decision: full 21-service stack vs web-presence marketing subset for the first deploy (drives whether live Stripe is required now). — operator/José
  2. app.portugalodyssey.pt DNS A record must exist before the apex flip. — operator
  3. Riff #145 / Plan #028 Slice I owns the actual cutover execution + sequencing; this doc is the readiness/checklist companion.