Skip to content

Web-presence → prod launch runbook (subset, subdomain-first)

Step-by-step operator runbook for the first-ever production bring-up of the web-presence marketing site. Decisions locked 2026-06-13 (José):

  • Shape: web-presence subset, not the full 20-service stack. Bring up only the containers the marketing site needs — no payment-service, so no live Stripe keys are exercised at T0.
  • Hostname: subdomain-firstweb-presence.portugalodyssey.pt. The apex cutover (portugalodyssey.pt → web-presence, public-fo → app.*) is Plan #028 Slice I / Riff #145, deferred until Cristina's content is ready.

The general 48-variable .env.prod checklist + apex-cutover ready-diff live in prod-first-deploy-readiness.md. This runbook is the narrowed, ordered path for the subset launch.


Co-location on the qual VPS (chosen topology, 2026-06-13)

prod runs on the same VPS as qual (/opt/po-platform), sharing one shared.yml stack (Traefik + Postgres + Redis + RabbitMQ). This is the designed pattern — isolation is by name, not by host, so there are no collisions:

Resource qual prod Collision
Containers po-*-qual po-*-prod none
Postgres (shared po-postgres) *_qual DBs *_prod DBs (dbmate creates them via migrate.sh prod) none
Redis (shared po-redis) DB index 1 DB index 2 none
RabbitMQ (shared po-rabbitmq) vhost portugal_odyssey_qual vhost portugal_odyssey_prod (already seeded in definitions.json) none
Traefik (one po-traefik) routes *.qual.* routes web-presence.portugalodyssey.pt none
Host ports none (all behind Traefik) none

This pre-satisfies several steps below: - Step 0 / subtask 1 (bootstrap + make networks): SKIP — repo is already at /opt/po-platform and traefik-public + po-shared-network already exist. - Step 1 / subtask 2 (shared stack + CF_DNS_API_TOKEN): ALREADY DONE — the shared stack is running and .env.shared already carries a working CF_DNS_API_TOKEN (qual's *.qual.* wildcard LE cert proves DNS-01 works). - Subtask 5 (DEPLOY_PROD_HOST): = the same host value as DEPLOY_QUAL_HOST.

⚠️ Resource pressure — the one real risk. This VPS (Hostinger) hit 91% CPU steal / throttling on 2026-05-18 when the CI runner co-located here (the reason the runner moved to the dev laptop). The subset launch (~6 prod containers, not 20) is the mitigation. After bring-up, watch vmstat 5 / mpstat -P ALL — if %steal climbs and containers flap unhealthy, that's the known cascade: stop the prod subset and reassess sizing before retrying.


What's already done in-repo (session sA · Douro, 2026-06-13)

These prod blockers were code-fixed and merged — no operator action needed:

ID Fix Commit
B1 deploy-prod now ssh-keyscans $DEPLOY_PROD_HOST (was scanning the qual host → first deploy would die on Host key verification failed) 68b7252
B2 web-presence chat/contact URLs are rewritten at container start to the per-env value (one image serves qual+prod); prod compose injects the prod gateway URLs dab2d4e
B3 deploy-prod has a real post-deploy smoke (HTTP + crash-loop + .env perms), replacing sleep 10 && ps 68b7252
B4 env.production.template gained STRAPI_API_TOKEN + SIGNING_CERT_PASSWORD (were referenced by production.yml but missing from the template) d25ddb0
B5 readiness doc: DNS-01 (not HTTP-01) cert challenge + 20-service count (review-service mothballed) d25ddb0
B6 api-gateway-prod CORS_ORIGIN now allows web-presence.portugalodyssey.pt (else chat + contact die on CORS preflight) dab2d4e

The subset: web-presence-prod + api-gateway-prod (chat/contact proxy) + ai-service-prod (RAG chat backend) + notification-service-prod (contact email) + strapi-cms-prod (RAG content) + the shared shared.yml stack (Traefik, Postgres, Redis, RabbitMQ). docling-serve is a soft dep of ai-service ingest.


Operator critical path

0. One-time prod VPS bootstrap (if the VPS is fresh)

# On the prod VPS, as the deploy user:
sudo mkdir -p /opt/po-platform && cd /opt/po-platform
git clone <repo> . && git checkout main          # deploy-prod does git reset --hard origin/main
make networks                                     # creates traefik-public + po-shared-network
# Copy the signing cert (document-signing-service mount, ro):
#   secrets/signing/certificate.p12   (mode 644)

1. Shared stack (shared.yml) — Traefik + datastores

production.yml's networks are external: true; the shared stack must be up first.

cd /opt/po-platform
# .env.shared MUST have CF_DNS_API_TOKEN uncommented (Zone:DNS:Edit on BOTH
# portugalodyssey.pt and portugalodissey.pt) — Traefik uses the DNS-01 challenge,
# so a missing/expired token blocks EVERY cert. See env.shared.template:33.
docker compose -f infrastructure/compose/shared.yml --env-file .env.shared up -d
docker compose -f infrastructure/compose/shared.yml ps   # all healthy

2. Provision .env.prod

Path: .env.prod lives at the repo root /opt/po-platform/.env.prod (production.yml's env_file: ../../.env.prod + the deploy job's --env-file .env.prod from /opt/po-platform). NOT under infrastructure/compose/.

Use the scaffold script — it copies the shared Postgres/Redis/RabbitMQ passwords from infrastructure/compose/.env.shared (they MUST match the running shared stack), reuses qual's account-level LLM/Resend keys (.env.qual <KEY>_QUAL), generates the internal secrets, and leaves nothing for you to type when run with --reuse-qual-keys. It never prints a secret value, only a shape report.

cd /opt/po-platform && git pull --ff-only          # get the script + latest compose
bash infrastructure/scripts/scaffold-env-prod.sh   # writes /opt/po-platform/.env.prod (chmod 600)
nano .env.prod                                      # fill the 4 REQUIRED keys it flags:
#   OPENAI_API_KEY · ANTHROPIC_API_KEY · GOOGLE_API_KEY   (RAG chat)
#   SMTP_PASSWORD                                         (Resend — contact-form email)
# Everything else (Stripe/Keycloak/Twilio/Maps/signing/STRAPI_API_TOKEN) stays
# REPLACE_ME — those services aren't in the subset; STRAPI_API_TOKEN is step 6.
bash infrastructure/scripts/scaffold-env-prod.sh --check   # re-verify shape (no values printed)
Manual inspection, if ever needed — shape only, never cat it in a Claude session: awk -F= '/^[A-Z]/{print $1"="length($2)" chars"}' /opt/po-platform/.env.prod

3. DNS (Cloudflare)

web-presence.portugalodyssey.pt   A   <prod VPS IP>
(Apex portugalodyssey.pt / app.* records belong to Slice I — not needed now.)

4. GitLab CI variables (Settings → CI/CD → Variables)

Variable Value Notes
DEPLOY_PROD_HOST prod VPS IP/host required — B1 keyscans this
SSH_PRIVATE_KEY (file) key with access to prod VPS verify it's in prod authorized_keys
DEPLOY_USER prod deploy user reuse qual's if same

Prod web-presence URLs are not CI variables — B2 injects them via production.yml's web-presence-prod.environment, so the single registry image works for both qual and prod. Nothing to set here for URLs.

5. Build + deploy the subset

# Ensure :latest images exist in the registry (push main / run a pipeline first).
cd /opt/po-platform && git reset --hard origin/main
export DOCKER_CONFIG=/tmp/po-deploy-docker && mkdir -p "$DOCKER_CONFIG"
docker login registry.gitlab.com -u <user> -p <token>
export IMAGE_TAG=latest

# Create the *_prod databases + run migrations on the shared Postgres FIRST.
# (The CI deploy-prod job does this automatically; a manual subset deploy must
# run it by hand — dbmate creates each *_prod DB then applies migrations.)
bash infrastructure/scripts/migrate.sh prod

# SUBSET up — name the services explicitly (NOT a bare `up -d`, which starts all 20).
# Traefik + datastores already came up with shared.yml in step 1. Note: --env-file
# is relative to the CWD (/opt/po-platform), so the repo-root .env.prod.
docker compose -f infrastructure/compose/production.yml --env-file .env.prod \
  up -d strapi-cms-prod ai-service-prod docling-serve-prod notification-service-prod \
        api-gateway-prod web-presence-prod

Clicking the GitLab deploy-prod manual job instead does a full-stack up -d --remove-orphans (all 20) + runs migrations + the B3 smoke. Use that only when you intend the full stack — for the subset, deploy named services manually as above.

6. Post-first-start: Strapi API token

# Strapi can't pre-generate it. After strapi-cms-prod is up:
#   https://cms.portugalodyssey.pt/admin  → Settings → API Tokens → read-only
# Paste into .env.prod as STRAPI_API_TOKEN, then:
docker compose -f infrastructure/compose/production.yml --env-file .env.prod \
  up -d --force-recreate ai-service-prod

7. Smoke

# In-VPS health:
docker compose -f infrastructure/compose/production.yml ps   # subset healthy, none restarting
# External (subdomain — override the apex-default in smoke-prod.sh):
APEX=web-presence.portugalodyssey.pt bash infrastructure/scripts/smoke-prod.sh   # checks [1][2][5]
curl -sI https://web-presence.portugalodyssey.pt/            # → 200, LE cert (not TRAEFIK DEFAULT)
Then a real browser walk: load the site, ask the chat a question (hits api.portugalodyssey.pt/public/ai/chat), submit the contact form. Both must succeed with no CORS error in the console.

smoke-prod.sh checks [3][4][6] assume the apex/app split — they will report failures that are expected pre-Slice-I. Only [1][2][5] are meaningful for the subdomain launch.

8. Log it

Append a row to DEPLOYS.md (manual deploys must be logged).


Later: apex cutover (Slice I / Riff #145)

When Cristina's content is signed off and app.portugalodyssey.pt DNS exists, apply the Slice I diff (swap web-presence-prod Host rule to apex, move public-fo to app.*), redeploy, and run the full smoke-prod.sh (all 6 checks). CORS already lists the apex. The web-presence.* CORS origin (B6) can stay as a redirect-safety net.