Skip to content

Runbook: CPU steal spike on the prod VPS

Trigger. cpu-steal-watchdog.timer fires a Telegram alert when /proc/stat-derived %steal exceeds 20% for ≥5 consecutive 60-second samples (≈5 minutes sustained). Riff #161; source: 2026-05-18 incident where Hostinger throttled srv884655 to ~9% of nominal CPU (91% steal) for ~3 hours before manual detection.

Quick diagnosis (60 seconds)

Open SSH to the VPS and run, in order:

# 1) Confirm the spike is current, not stale.
vmstat 1 5
# Watch the rightmost columns. The `st` column = CPU steal. If you see
# sustained high values (>20), the host hypervisor is throttling.

# 2) Confirm it's host-level, not a process.
mpstat -P ALL 1 3
# %steal column should be uniform across all CPUs. If it's CPU-pinned
# to one core, that's a kernel/scheduling issue, not Hostinger throttling.

# 3) Identify the local CPU hog (if any).
docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}' \
  | sort -k2 -h -r | head -20
# If a po-platform container is at >50% CPU for a sustained period,
# the host's burst budget may be exhausted. That looks like steal
# from inside the VM even though the trigger is our workload.

Decision tree

Observation Likely cause Action
%steal sustained >20% AND docker stats shows one container >50% CPU Our workload tripped the hypervisor's burst cap Throttle the heavy container (docker update --cpus=<n>); cap concurrency; defer non-urgent work. Skip to "Hostinger Halp ticket" only if the throttle persists after the workload drops.
%steal sustained >20% AND docker stats quiet (all containers <30% CPU) Noisy neighbour OR Hostinger throttle without local cause Open a Hostinger Halp ticket (see below).
%steal brief (<2min) spikes Normal hypervisor scheduling jitter No action. Watchdog hysteresis (50% of threshold) avoids alerts in this case.

Hostinger Halp ticket template

Open Halp → New ticket → category "VPS / performance". Paste:

srv884655 (IP 31.97.159.7) is observing >20% sustained CPU steal over the last NN minutes. Local diagnostics:

  • vmstat 1 5 output:
  • docker stats shows the largest container at % CPU (well under the VPS's allocated capacity)
  • Total load average: uptime>

Please check whether this VM has been throttled by the hypervisor (CPU burst cap, noisy neighbour, or hardware migration), and lift the throttle if it's not load-driven from our side.

Historical baseline: on 2026-05-18 Hostinger lifted a 91%-steal throttle within ~20 minutes of the Halp ticket; their agent confirmed the cause as "sustained Docker CPU activity on the node". That incident drove this runbook + the watchdog. Recurrence indicates we need a bigger VPS plan (see Riff #160 — 4-vCPU upgrade) or to move the runner off the prod VPS (done 2026-05-19 in Riff #159).

Verification after recovery

Once Hostinger lifts the throttle (or the local hog drops), the watchdog's hysteresis kicks in at %steal < THRESHOLD / 2 = 10% and fires a ✅ CPU steal recovered Telegram message. Confirm full recovery:

# %steal should be ≤1% within 1-2 minutes of the throttle lifting.
vmstat 1 5

# Container healthchecks should all flip back to `(healthy)` within 60s.
docker ps --format 'table {{.Names}}\t{{.Status}}' | grep -v healthy

# Traefik should resume routing to the previously-unhealthy hosts.
curl -fsSL https://qual.portugalodyssey.pt/health
curl -fsSL https://api.qual.portugalodyssey.pt/health

Activating the watchdog (operator, one-time)

# On the VPS:
sudo cp /opt/po-platform/infrastructure/systemd/cpu-steal-watchdog.{service,timer} \
        /etc/systemd/system/
sudo systemctl daemon-reload

# Provide the Telegram credentials (separate file to avoid leaking into
# .env.qual which is world-readable in some configurations).
sudo tee /opt/po-platform/.env.cpu-steal-watchdog >/dev/null <<'EOF'
TELEGRAM_BOT_TOKEN=<bot-token>
TELEGRAM_CHAT_ID=<chat-id>
# Optional overrides:
# THRESHOLD=20            # percent
# SUSTAINED_SAMPLES=5     # consecutive 60s samples above threshold
# INTERVAL=60             # seconds between samples
EOF
sudo chmod 600 /opt/po-platform/.env.cpu-steal-watchdog

# Activate
sudo systemctl enable --now cpu-steal-watchdog.timer
sudo systemctl status cpu-steal-watchdog.timer
sudo systemctl list-timers cpu-steal-watchdog.timer

# Smoke test the script standalone (will read /proc/stat twice across 60s):
sudo bash /opt/po-platform/infrastructure/scripts/cpu-steal-watchdog.sh

Smoke test (synthetic high-steal scenario)

You can't easily trigger real CPU steal from inside a guest VM; the hypervisor controls that. The closest synthetic test is to load the VM heavily and watch the watchdog's behaviour:

# 1) On the VPS:
stress-ng --vm 2 --vm-bytes 1G --timeout 600s &
# Burns ~2 CPUs for 10 minutes.

# 2) Watch vmstat in another shell:
vmstat 5

# 3) Verify watchdog output:
sudo journalctl -u cpu-steal-watchdog.service -f
# Expect a "🚨 CPU steal alarm" Telegram message if Hostinger reacts
# by throttling. Most likely it won't on a 10-minute stress run — this
# test mostly verifies the threshold-comparison code path. The true
# proof was the 2026-05-18 incident; the runbook + watchdog are the
# "next time" insurance.

References

  • Riff #161 — alert spec (this runbook closes it)
  • Riff #160 — 4-vCPU upgrade (defense-in-depth)
  • Riff #159 — runner moved off prod VPS (root-cause fix; closed)
  • infrastructure/scripts/cpu-steal-watchdog.sh — the watchdog
  • infrastructure/systemd/cpu-steal-watchdog.{service,timer} — units
  • 2026-05-18 incident postmortem: docs/ai/sessions/active.md § VPS pathology
  • Memory: feedback_cpu_steal_first_check.md — the lesson that drove this Riff