Prod first-deploy readiness — web-presence apex go-public¶
Status: PREPARATION (no prod deploy yet). production.yml has never been deployed (no .env.prod on the VPS, no *-prod containers, no *_prod databases). The first make prod is intended to be the web-presence-goes-public event during pre-launch (marketing/business). This doc is the turnkey checklist + the validation that runs the moment it lands. Prepared by sA 2026-05-26; nothing here touches secrets or deploys.
Decision needed before first deploy (operator/José): is the first prod deploy the full 21-service stack (
make prod), or a web-presence marketing subset (web-presence + strapi-cms + ai-service + ingress/infra, no payment/booking)? This changes whether live Stripe keys are required at launch (full stack brings uppayment-service, which needs live keys to start cleanly) or can be deferred. The rest of this doc assumes full-stack; flag in §6 if subset.
1. Service parity — ✅ COMPLETE¶
production.yml and qualification.yml are at full 20-service parity
(verified 2026-05-26; review-service mothballed 2026-05-31 per Riff #225 / Wave 3,
removed from both compose files before prod was ever deployed):
admin-console · ai-service · api-gateway · auth-service · calendar-service
contract-service · docling-serve · document-signing-service · experience-service
file-service · keycloak · minio · minio-init · notification-service
partner-console · partner-service · payment-service · public-fo
strapi-cms · web-presence
Riff #205 closed the 5-service gap (partner/contract/calendar/document-signing/web-presence). comm diff of qual-stems vs prod-stems is empty.
2. .env.prod required-vars checklist — 48 variables¶
Every ${VAR} referenced by production.yml. The operator provisions .env.prod (and .env.shared for the DB password — migrate.sh auto-loads POSTGRES_PASSWORD from .env.shared). Cross-check against .env.qual for structure; swap values for prod (live keys, prod DB password). Do not read .env.* into a Claude session — use a plain terminal.
Secrets — must be unique prod values (Tier 2/3)¶
JWT_SECRET OIDC_CONFIDENTIAL_CLIENT_SECRET
KEYCLOAK_ADMIN_PASSWORD SIGNING_CERT_PASSWORD
POSTGRES_PASSWORD MCP_API_KEYS
RABBITMQ_PASSWORD RAG_INGEST_ADMIN_KEY
REDIS_PASSWORD AI_SERVICE_INGEST_KEY
MINIO_ROOT_PASSWORD_PROD CACHE_BUST_TOKEN
STRAPI_ADMIN_JWT_SECRET STRAPI_JWT_SECRET
STRAPI_API_TOKEN_SALT STRAPI_APP_KEYS
STRAPI_TRANSFER_TOKEN_SALT STRAPI_API_TOKEN
External API credentials — LIVE keys for prod (not test/sandbox)¶
STRIPE_SECRET_KEY STRIPE_PUBLISHABLE_KEY
STRIPE_WEBHOOK_SECRET STRIPE_WEBHOOK_SECRET_CONNECTED
ANTHROPIC_API_KEY OPENAI_API_KEY
GOOGLE_API_KEY GOOGLE_MAPS_API_KEY
LANGCHAIN_API_KEY YELP_API_KEY
TWILIO_ACCOUNT_SID TWILIO_AUTH_TOKEN
SMTP_PASSWORD
⚠️ Stripe must be LIVE keys for a real go-public. If the first deploy is a marketing subset with no real payments yet, either omit
payment-servicefrom the deploy or use test keys + keep payment routes unexposed. Decide per §6.
Config / non-secret¶
IMAGE_TAG (CI injects $CI_COMMIT_SHORT_SHA) REGISTRY_IMAGE
KEYCLOAK_ADMIN POSTGRES_USER RABBITMQ_USER
MINIO_ROOT_USER_PROD MCP_KEYCLOAK_URL LANGCHAIN_PROJECT
LANGCHAIN_TRACING_V2 SMTP_HOST SMTP_PORT SMTP_USER
EMAIL_FROM CONTACT_ADMIN_EMAIL CONTACT_FROM_EMAIL
PAYMENT_SUCCESS_URL PAYMENT_CANCEL_URL
PAYMENT_SUCCESS_URL/PAYMENT_CANCEL_URLmust point at the prod public-fo host (app.portugalodyssey.pt, see §3), not qual.
3. Apex cutover (Plan #028 Slice I / Riff #145) — READY DIFF, apply at go-public¶
Today production.yml routes the apex to public-fo-prod and web-presence to a subdomain. The go-public flip (documented in the web-presence-prod label comment) swaps them. portugalodissey.pt is an intentional dual-domain (see dual-domain-support.md) — mirror it on both routers.
Change A — web-presence-prod → apex (production.yml, web-presence-prod labels):
- - "traefik.http.routers.web-presence-prod.rule=Host(`web-presence.portugalodyssey.pt`)"
+ - "traefik.http.routers.web-presence-prod.rule=Host(`portugalodyssey.pt`) || Host(`www.portugalodyssey.pt`) || Host(`portugalodissey.pt`) || Host(`www.portugalodissey.pt`)"
+ - "traefik.http.routers.web-presence-prod.priority=10"
Change B — public-fo-prod → app.* (production.yml, public-fo-prod labels):
- - "traefik.http.routers.public-fo-prod.rule=Host(`portugalodyssey.pt`) || Host(`www.portugalodyssey.pt`) || Host(`portugalodissey.pt`) || Host(`www.portugalodissey.pt`)"
+ - "traefik.http.routers.public-fo-prod.rule=Host(`app.portugalodyssey.pt`) || Host(`app.portugalodissey.pt`)"
Prerequisites for the flip (Riff #145):
- DNS: portugalodyssey.pt + www already resolve to the VPS. Add app.portugalodyssey.pt (+ app.portugalodissey.pt) A records before flipping, else public-fo loses its host.
- LE cert: Traefik issues certs via the DNS-01 challenge through Cloudflare (shared.yml --certificatesresolvers.letsencrypt.acme.dnschallenge.provider=cloudflare), not HTTP-01 — so issuance needs a valid CF_DNS_API_TOKEN (Zone:DNS:Edit on both apex zones) in .env.shared, and a missing/expired token blocks every cert regardless of HTTP reachability. DNS for the host must still exist. Watch ACME rate limits — see letsencrypt-rate-limit.md. Bring the stack up, confirm cert issuance, THEN announce publicly.
- web-presence content (Strapi) must be Cristina-ready before the public sees the apex.
This diff is not pre-applied to
production.yml— it's gated on web-presence content readiness +app.*DNS, and is Plan #028 Slice I's deliverable. Apply it as the go-public step. Leaving public-fo on apex until then is correct.
4. Migrations — fresh-DB ready¶
migrate.sh prod (run by deploy-prod BEFORE services start, per ADR-003) creates + migrates the *_prod databases. 10 DBs, 41 migrations total, dbmate forward-only + idempotent → a fresh prod DB applies all from scratch:
calendar 5 · contract 3 · document_signing 1 · experience 9 · iam 2
notification 3 · partner 8 · payment 5 · rag 3 · reviews 2
migrate.sh prod auto-loads POSTGRES_PASSWORD from .env.shared (suffix _prod, host postgres, network po-shared-network).
- First deploy: DBs don't exist → dbmate creates them, applies every migration. No backfill needed (greenfield).
- Pre-deploy schema check (per CLAUDE.md): after deploy, psql -c "SELECT table_name FROM information_schema.tables WHERE table_schema='public' ORDER BY table_name" against each *_prod DB, diff vs infrastructure/migrations/<db>/.
5. Registry images¶
Prod pulls :${IMAGE_TAG} (= $CI_COMMIT_SHORT_SHA from CI). deploy-prod does docker compose pull. If a service's image isn't built for the exact SHA (path filter skipped it), pull falls back to :latest (the side-effect rescue valve). Before first prod deploy, confirm the registry has images for web-presence, strapi-cms, ai-service (the marketing-critical trio) at the target tag — push a full pipeline on main if unsure so every service has a fresh :latest.
6. Post-deploy validation (runs the moment prod is up)¶
Run bash infrastructure/scripts/smoke-prod.sh (added with this doc) + the checklist:
- [ ]
make health-check-prod— all*-prodcontainers healthy (document-signing-service takes longest: JVM-style.p12load, retry 6×20s). - [ ] Apex serves web-presence with a trusted cert:
curl -sI https://portugalodyssey.pt→ 200;openssl s_client -connect portugalodyssey.pt:443 | openssl x509 -noout -issuer→Let's Encrypt(NOTTRAEFIK DEFAULT CERT). - [ ]
https://www.portugalodyssey.pt→ 200 (or 301→apex). - [ ]
https://app.portugalodyssey.pt→ serves public-fo (booking app), trusted cert. - [ ] Both dual-domain spellings respond (
portugalodissey.pttoo) or are intentionally parked. - [ ] web-presence marketing pages render in PT + EN; RagChat answers (ai-service reachable); Strapi content present.
- [ ] Append a row to
DEPLOYS.md(manual deploy → must log). - [ ] Visual: run the parameterized Playwright walk (
temp/uat-206/walk.jspattern,BASE=https://portugalodyssey.pt) for the marketing pages — nav, footer-pin, PT locale, hero.
If full-stack: also verify app.* booking flow /start-journey → /checkout (Stripe LIVE — use a real card or Stripe test-clock, and refund).
7. Rollback — first-ever deploy has no previous :latest¶
A normal rollback re-tags the previous :latest and up -d --force-recreate. On the first prod deploy there is no previous prod image, so:
- Fastest rollback = DNS: point portugalodyssey.pt back to the holding page / away from the VPS. Apex traffic stops immediately; qual (*.qual.*) is unaffected.
- Or stop the stack: docker compose -f production.yml down (keeps *_prod volumes/DBs).
- Keep web-presence.portugalodyssey.pt (pre-flip subdomain) working during the first hours as a fallback URL if the apex flip misbehaves — don't remove the subdomain router until the apex is proven.
- Prod DBs persist across down/up (named volumes), so a re-deploy doesn't lose data.
Open items / decisions¶
- §6 decision: full 21-service stack vs web-presence marketing subset for the first deploy (drives whether live Stripe is required now). — operator/José
app.portugalodyssey.ptDNS A record must exist before the apex flip. — operator- Riff #145 / Plan #028 Slice I owns the actual cutover execution + sequencing; this doc is the readiness/checklist companion.