Web-presence → prod launch runbook (subset, subdomain-first)¶
Step-by-step operator runbook for the first-ever production bring-up of the web-presence marketing site. Decisions locked 2026-06-13 (José):
- Shape: web-presence subset, not the full 20-service stack. Bring up only the containers the marketing site needs — no payment-service, so no live Stripe keys are exercised at T0.
- Hostname: subdomain-first —
web-presence.portugalodyssey.pt. The apex cutover (portugalodyssey.pt→ web-presence, public-fo →app.*) is Plan #028 Slice I / Riff #145, deferred until Cristina's content is ready.
The general 48-variable
.env.prodchecklist + apex-cutover ready-diff live inprod-first-deploy-readiness.md. This runbook is the narrowed, ordered path for the subset launch.
Co-location on the qual VPS (chosen topology, 2026-06-13)¶
prod runs on the same VPS as qual (/opt/po-platform), sharing one shared.yml
stack (Traefik + Postgres + Redis + RabbitMQ). This is the designed pattern —
isolation is by name, not by host, so there are no collisions:
| Resource | qual | prod | Collision |
|---|---|---|---|
| Containers | po-*-qual |
po-*-prod |
none |
Postgres (shared po-postgres) |
*_qual DBs |
*_prod DBs (dbmate creates them via migrate.sh prod) |
none |
Redis (shared po-redis) |
DB index 1 | DB index 2 | none |
RabbitMQ (shared po-rabbitmq) |
vhost portugal_odyssey_qual |
vhost portugal_odyssey_prod (already seeded in definitions.json) |
none |
Traefik (one po-traefik) |
routes *.qual.* |
routes web-presence.portugalodyssey.pt |
none |
| Host ports | — | none (all behind Traefik) | none |
This pre-satisfies several steps below:
- Step 0 / subtask 1 (bootstrap + make networks): SKIP — repo is already at
/opt/po-platform and traefik-public + po-shared-network already exist.
- Step 1 / subtask 2 (shared stack + CF_DNS_API_TOKEN): ALREADY DONE — the
shared stack is running and .env.shared already carries a working
CF_DNS_API_TOKEN (qual's *.qual.* wildcard LE cert proves DNS-01 works).
- Subtask 5 (DEPLOY_PROD_HOST): = the same host value as DEPLOY_QUAL_HOST.
⚠️ Resource pressure — the one real risk. This VPS (Hostinger) hit 91% CPU steal / throttling on 2026-05-18 when the CI runner co-located here (the reason the runner moved to the dev laptop). The subset launch (~6 prod containers, not 20) is the mitigation. After bring-up, watch
vmstat 5/mpstat -P ALL— if%stealclimbs and containers flap unhealthy, that's the known cascade: stop the prod subset and reassess sizing before retrying.
What's already done in-repo (session sA · Douro, 2026-06-13)¶
These prod blockers were code-fixed and merged — no operator action needed:
| ID | Fix | Commit |
|---|---|---|
| B1 | deploy-prod now ssh-keyscans $DEPLOY_PROD_HOST (was scanning the qual host → first deploy would die on Host key verification failed) |
68b7252 |
| B2 | web-presence chat/contact URLs are rewritten at container start to the per-env value (one image serves qual+prod); prod compose injects the prod gateway URLs | dab2d4e |
| B3 | deploy-prod has a real post-deploy smoke (HTTP + crash-loop + .env perms), replacing sleep 10 && ps |
68b7252 |
| B4 | env.production.template gained STRAPI_API_TOKEN + SIGNING_CERT_PASSWORD (were referenced by production.yml but missing from the template) |
d25ddb0 |
| B5 | readiness doc: DNS-01 (not HTTP-01) cert challenge + 20-service count (review-service mothballed) | d25ddb0 |
| B6 | api-gateway-prod CORS_ORIGIN now allows web-presence.portugalodyssey.pt (else chat + contact die on CORS preflight) |
dab2d4e |
The subset: web-presence-prod + api-gateway-prod (chat/contact proxy) +
ai-service-prod (RAG chat backend) + notification-service-prod (contact email) +
strapi-cms-prod (RAG content) + the shared shared.yml stack (Traefik, Postgres,
Redis, RabbitMQ). docling-serve is a soft dep of ai-service ingest.
Operator critical path¶
0. One-time prod VPS bootstrap (if the VPS is fresh)¶
# On the prod VPS, as the deploy user:
sudo mkdir -p /opt/po-platform && cd /opt/po-platform
git clone <repo> . && git checkout main # deploy-prod does git reset --hard origin/main
make networks # creates traefik-public + po-shared-network
# Copy the signing cert (document-signing-service mount, ro):
# secrets/signing/certificate.p12 (mode 644)
1. Shared stack (shared.yml) — Traefik + datastores¶
production.yml's networks are external: true; the shared stack must be up first.
cd /opt/po-platform
# .env.shared MUST have CF_DNS_API_TOKEN uncommented (Zone:DNS:Edit on BOTH
# portugalodyssey.pt and portugalodissey.pt) — Traefik uses the DNS-01 challenge,
# so a missing/expired token blocks EVERY cert. See env.shared.template:33.
docker compose -f infrastructure/compose/shared.yml --env-file .env.shared up -d
docker compose -f infrastructure/compose/shared.yml ps # all healthy
2. Provision .env.prod¶
Path: .env.prod lives at the repo root /opt/po-platform/.env.prod
(production.yml's env_file: ../../.env.prod + the deploy job's --env-file .env.prod
from /opt/po-platform). NOT under infrastructure/compose/.
Use the scaffold script — it copies the shared Postgres/Redis/RabbitMQ passwords from
infrastructure/compose/.env.shared (they MUST match the running shared stack),
reuses qual's account-level LLM/Resend keys (.env.qual <KEY>_QUAL), generates the
internal secrets, and leaves nothing for you to type when run with --reuse-qual-keys. It never
prints a secret value, only a shape report.
cd /opt/po-platform && git pull --ff-only # get the script + latest compose
bash infrastructure/scripts/scaffold-env-prod.sh # writes /opt/po-platform/.env.prod (chmod 600)
nano .env.prod # fill the 4 REQUIRED keys it flags:
# OPENAI_API_KEY · ANTHROPIC_API_KEY · GOOGLE_API_KEY (RAG chat)
# SMTP_PASSWORD (Resend — contact-form email)
# Everything else (Stripe/Keycloak/Twilio/Maps/signing/STRAPI_API_TOKEN) stays
# REPLACE_ME — those services aren't in the subset; STRAPI_API_TOKEN is step 6.
bash infrastructure/scripts/scaffold-env-prod.sh --check # re-verify shape (no values printed)
cat it in a Claude session:
awk -F= '/^[A-Z]/{print $1"="length($2)" chars"}' /opt/po-platform/.env.prod
3. DNS (Cloudflare)¶
(Apexportugalodyssey.pt / app.* records belong to Slice I — not needed now.)
4. GitLab CI variables (Settings → CI/CD → Variables)¶
| Variable | Value | Notes |
|---|---|---|
DEPLOY_PROD_HOST |
prod VPS IP/host | required — B1 keyscans this |
SSH_PRIVATE_KEY (file) |
key with access to prod VPS | verify it's in prod authorized_keys |
DEPLOY_USER |
prod deploy user | reuse qual's if same |
Prod web-presence URLs are not CI variables — B2 injects them via
production.yml'sweb-presence-prod.environment, so the single registry image works for both qual and prod. Nothing to set here for URLs.
5. Build + deploy the subset¶
# Ensure :latest images exist in the registry (push main / run a pipeline first).
cd /opt/po-platform && git reset --hard origin/main
export DOCKER_CONFIG=/tmp/po-deploy-docker && mkdir -p "$DOCKER_CONFIG"
docker login registry.gitlab.com -u <user> -p <token>
export IMAGE_TAG=latest
# Create the *_prod databases + run migrations on the shared Postgres FIRST.
# (The CI deploy-prod job does this automatically; a manual subset deploy must
# run it by hand — dbmate creates each *_prod DB then applies migrations.)
bash infrastructure/scripts/migrate.sh prod
# SUBSET up — name the services explicitly (NOT a bare `up -d`, which starts all 20).
# Traefik + datastores already came up with shared.yml in step 1. Note: --env-file
# is relative to the CWD (/opt/po-platform), so the repo-root .env.prod.
docker compose -f infrastructure/compose/production.yml --env-file .env.prod \
up -d strapi-cms-prod ai-service-prod docling-serve-prod notification-service-prod \
api-gateway-prod web-presence-prod
Clicking the GitLab
deploy-prodmanual job instead does a full-stackup -d --remove-orphans(all 20) + runs migrations + the B3 smoke. Use that only when you intend the full stack — for the subset, deploy named services manually as above.
6. Post-first-start: Strapi API token¶
# Strapi can't pre-generate it. After strapi-cms-prod is up:
# https://cms.portugalodyssey.pt/admin → Settings → API Tokens → read-only
# Paste into .env.prod as STRAPI_API_TOKEN, then:
docker compose -f infrastructure/compose/production.yml --env-file .env.prod \
up -d --force-recreate ai-service-prod
7. Smoke¶
# In-VPS health:
docker compose -f infrastructure/compose/production.yml ps # subset healthy, none restarting
# External (subdomain — override the apex-default in smoke-prod.sh):
APEX=web-presence.portugalodyssey.pt bash infrastructure/scripts/smoke-prod.sh # checks [1][2][5]
curl -sI https://web-presence.portugalodyssey.pt/ # → 200, LE cert (not TRAEFIK DEFAULT)
api.portugalodyssey.pt/public/ai/chat), submit the contact form. Both must
succeed with no CORS error in the console.
smoke-prod.shchecks [3][4][6] assume the apex/app split — they will report failures that are expected pre-Slice-I. Only [1][2][5] are meaningful for the subdomain launch.
8. Log it¶
Append a row to DEPLOYS.md (manual deploys must be logged).
Later: apex cutover (Slice I / Riff #145)¶
When Cristina's content is signed off and app.portugalodyssey.pt DNS exists, apply
the Slice I diff (swap web-presence-prod Host rule to apex, move public-fo to
app.*), redeploy, and run the full smoke-prod.sh (all 6 checks). CORS already
lists the apex. The web-presence.* CORS origin (B6) can stay as a redirect-safety net.