UptimeRobot — external uptime probes¶

External monitoring layer for the qualification environment. Replaces the Prometheus + Grafana stack retired in W1-2 (ADR-002).

Why external: an in-cluster monitoring stack can't tell you the cluster is down. UptimeRobot probes from outside the VPS, so any failure mode (network, Traefik, container crash, DNS) shows up as a missed probe.

Account¶

Provider: UptimeRobot (uptimerobot.com)
Tier: Free (sufficient for ~7 monitors at 5-min intervals + email alerts)
Owner: store login + API key in your password manager under po-platform / uptimerobot
API key location: My Settings → API Settings → Main API Key

Monitors¶

The po- prefix keeps everything grouped in the UptimeRobot dashboard. Friendly names follow po-<service>-<env> so future prod monitors slot in cleanly (po-public-fo-prod, etc.).

Friendly name	URL	Type	Anchor / keyword
`po-public-fo-qual`	`https://qual.portugalodyssey.pt/`	HTTPS	200 OK
`po-partner-console-qual`	`https://partner.qual.portugalodyssey.pt/`	HTTPS	200 OK
`po-admin-console-qual`	`https://admin.qual.portugalodyssey.pt/`	HTTPS	200 OK
`po-cms-qual`	`https://cms.qual.portugalodyssey.pt/`	HTTPS	200 OK (after redirect)
`po-api-gateway-qual`	`https://api.qual.portugalodyssey.pt/health`	HTTPS keyword	`"status":"ok"`
`po-ai-service-qual`	`https://ai.qual.portugalodyssey.pt/health`	HTTPS keyword	`"status":"healthy"`
`po-sso-qual`	`https://sso.qual.portugalodyssey.pt/realms/portugal-odyssey/.well-known/openid-configuration`	HTTPS keyword	`"issuer"`

Settings (apply to every monitor):

Monitoring interval: 5 minutes (free tier minimum)
Alert threshold: 2 consecutive failures (avoids paging on transient blips)
Alert contacts: email only by default (Telegram/SMS require paid tier)

Provisioning¶

Option A — script (recommended)¶

export UPTIMEROBOT_API_KEY=u1234567-abcdef...
./infrastructure/scripts/uptimerobot-provision.sh

The script is idempotent: existing monitors with matching friendly names are skipped. To re-provision a single monitor, delete it in the UI first.

Option B — UptimeRobot web UI¶

Log in at uptimerobot.com
+ New monitor
Pick HTTP(s) for the rows without keyword anchors, Keyword for the ones with a value in the table above
Friendly name + URL exactly as in the table
Monitoring interval: 5 minutes; threshold: 2 consecutive failures
Alert contact: select your email contact
Repeat for all 7 entries

Verifying alerts¶

Don't trust an unverified alert pipeline — run a deliberate fail to confirm email delivery:

ssh root@31.97.159.7 docker stop po-public-fo-qual
# Wait 10–15 minutes (2 × 5-min interval)
# An email should arrive at the configured contact
ssh root@31.97.159.7 docker start po-public-fo-qual
# A recovery email should follow within ~5 min

Skipping this step is exactly how the prior 4-month CI silence (Jan–Apr 2026) went unnoticed — silent alert pipelines are worse than no monitoring.

When to update¶

New service deployed with a public URL → add a monitor (mirror the closest pattern in the table)
URL renamed → delete old monitor, create new (UptimeRobot doesn't support URL edits without losing history; cheaper to recreate)
Going to prod → duplicate every po-*-qual row as po-*-prod once prod hostnames are live

Telegram integration (optional, paid tier)¶

Free-tier UptimeRobot only does email. If you want Telegram pings (matching the project's existing Telegram channel), two options:

Pro tier ($7/mo) unlocks webhook + Telegram alert contacts.
Free workaround: configure UptimeRobot to email a parser address (e.g. an n8n / Zapier hook) that reposts to your bot's Telegram channel.

Defer until alert volume justifies it.