Runbook: CPU steal spike on the prod VPS¶
Trigger. cpu-steal-watchdog.timer fires a Telegram alert when
/proc/stat-derived %steal exceeds 20% for ≥5 consecutive 60-second
samples (≈5 minutes sustained). Riff #161; source: 2026-05-18 incident
where Hostinger throttled srv884655 to ~9% of nominal CPU (91% steal)
for ~3 hours before manual detection.
Quick diagnosis (60 seconds)¶
Open SSH to the VPS and run, in order:
# 1) Confirm the spike is current, not stale.
vmstat 1 5
# Watch the rightmost columns. The `st` column = CPU steal. If you see
# sustained high values (>20), the host hypervisor is throttling.
# 2) Confirm it's host-level, not a process.
mpstat -P ALL 1 3
# %steal column should be uniform across all CPUs. If it's CPU-pinned
# to one core, that's a kernel/scheduling issue, not Hostinger throttling.
# 3) Identify the local CPU hog (if any).
docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}' \
| sort -k2 -h -r | head -20
# If a po-platform container is at >50% CPU for a sustained period,
# the host's burst budget may be exhausted. That looks like steal
# from inside the VM even though the trigger is our workload.
Decision tree¶
| Observation | Likely cause | Action |
|---|---|---|
%steal sustained >20% AND docker stats shows one container >50% CPU |
Our workload tripped the hypervisor's burst cap | Throttle the heavy container (docker update --cpus=<n>); cap concurrency; defer non-urgent work. Skip to "Hostinger Halp ticket" only if the throttle persists after the workload drops. |
%steal sustained >20% AND docker stats quiet (all containers <30% CPU) |
Noisy neighbour OR Hostinger throttle without local cause | Open a Hostinger Halp ticket (see below). |
%steal brief (<2min) spikes |
Normal hypervisor scheduling jitter | No action. Watchdog hysteresis (50% of threshold) avoids alerts in this case. |
Hostinger Halp ticket template¶
Open Halp → New ticket → category "VPS / performance". Paste:
srv884655 (IP 31.97.159.7) is observing >20% sustained CPU steal over the last NN minutes. Local diagnostics:
vmstat 1 5output:docker statsshows the largest container at% CPU (well under the VPS's allocated capacity) - Total load average:
uptime> Please check whether this VM has been throttled by the hypervisor (CPU burst cap, noisy neighbour, or hardware migration), and lift the throttle if it's not load-driven from our side.
Historical baseline: on 2026-05-18 Hostinger lifted a 91%-steal throttle within ~20 minutes of the Halp ticket; their agent confirmed the cause as "sustained Docker CPU activity on the node". That incident drove this runbook + the watchdog. Recurrence indicates we need a bigger VPS plan (see Riff #160 — 4-vCPU upgrade) or to move the runner off the prod VPS (done 2026-05-19 in Riff #159).
Verification after recovery¶
Once Hostinger lifts the throttle (or the local hog drops), the
watchdog's hysteresis kicks in at %steal < THRESHOLD / 2 = 10% and
fires a ✅ CPU steal recovered Telegram message. Confirm full
recovery:
# %steal should be ≤1% within 1-2 minutes of the throttle lifting.
vmstat 1 5
# Container healthchecks should all flip back to `(healthy)` within 60s.
docker ps --format 'table {{.Names}}\t{{.Status}}' | grep -v healthy
# Traefik should resume routing to the previously-unhealthy hosts.
curl -fsSL https://qual.portugalodyssey.pt/health
curl -fsSL https://api.qual.portugalodyssey.pt/health
Activating the watchdog (operator, one-time)¶
# On the VPS:
sudo cp /opt/po-platform/infrastructure/systemd/cpu-steal-watchdog.{service,timer} \
/etc/systemd/system/
sudo systemctl daemon-reload
# Provide the Telegram credentials (separate file to avoid leaking into
# .env.qual which is world-readable in some configurations).
sudo tee /opt/po-platform/.env.cpu-steal-watchdog >/dev/null <<'EOF'
TELEGRAM_BOT_TOKEN=<bot-token>
TELEGRAM_CHAT_ID=<chat-id>
# Optional overrides:
# THRESHOLD=20 # percent
# SUSTAINED_SAMPLES=5 # consecutive 60s samples above threshold
# INTERVAL=60 # seconds between samples
EOF
sudo chmod 600 /opt/po-platform/.env.cpu-steal-watchdog
# Activate
sudo systemctl enable --now cpu-steal-watchdog.timer
sudo systemctl status cpu-steal-watchdog.timer
sudo systemctl list-timers cpu-steal-watchdog.timer
# Smoke test the script standalone (will read /proc/stat twice across 60s):
sudo bash /opt/po-platform/infrastructure/scripts/cpu-steal-watchdog.sh
Smoke test (synthetic high-steal scenario)¶
You can't easily trigger real CPU steal from inside a guest VM; the hypervisor controls that. The closest synthetic test is to load the VM heavily and watch the watchdog's behaviour:
# 1) On the VPS:
stress-ng --vm 2 --vm-bytes 1G --timeout 600s &
# Burns ~2 CPUs for 10 minutes.
# 2) Watch vmstat in another shell:
vmstat 5
# 3) Verify watchdog output:
sudo journalctl -u cpu-steal-watchdog.service -f
# Expect a "🚨 CPU steal alarm" Telegram message if Hostinger reacts
# by throttling. Most likely it won't on a 10-minute stress run — this
# test mostly verifies the threshold-comparison code path. The true
# proof was the 2026-05-18 incident; the runbook + watchdog are the
# "next time" insurance.
References¶
- Riff #161 — alert spec (this runbook closes it)
- Riff #160 — 4-vCPU upgrade (defense-in-depth)
- Riff #159 — runner moved off prod VPS (root-cause fix; closed)
infrastructure/scripts/cpu-steal-watchdog.sh— the watchdoginfrastructure/systemd/cpu-steal-watchdog.{service,timer}— units- 2026-05-18 incident postmortem:
docs/ai/sessions/active.md§ VPS pathology - Memory:
feedback_cpu_steal_first_check.md— the lesson that drove this Riff