CI Runner Architecture¶
TL;DR¶
GitLab CI for po-platform runs on a self-hosted runner installed on the dev laptop (jmeireles-Latitude-5401). Every push to main fires a pipeline on GitLab.com; the 23 tagged jobs route to dev-runner; build + deploy execute on the laptop; deploy-qual SSHes from laptop to the qual VPS and brings up new containers; deploy-prod waits for a manual click.
Zero shared-runner SaaS minutes consumed. Zero CPU load on the prod VPS. The dev laptop's only obligation: be online when you want fast deploys (jobs queue at GitLab and fire on reconnect otherwise).
Why this architecture (decision record)¶
The forcing event — 2026-05-18¶
The previous architecture co-located gitlab-runner on the prod VPS (srv884655.hstgr.cloud, 2 vCPU Hostinger plan). With concurrent = 4 in the runner config and trunk-based development firing 5-10 pipelines per day, two simultaneous runner-*-build containers + dind sidecars routinely consumed ~150% of the 2 vCPU budget.
Hostinger's hypervisor responded by throttling our VM to ~9% of nominal CPU (CPU steal = 91% sustained, historical baseline was 8% cumulative over 21 days uptime). This cascaded:
- docker exec spawn time stretched from milliseconds to 60+ seconds
- Per-container healthchecks (wget --spider http://127.0.0.1:80, 10s timeout) all failed
- Traefik's docker provider filters unhealthy containers from routing → every qual URL returned 404
- The cascade lasted ~3 hours before diagnosed and the throttle was lifted by Hostinger support
Full forensic chain in docs/ai/sessions/active.md "VPS pathology" section.
Options considered¶
| # | Option | Trade-off | Verdict |
|---|---|---|---|
| A | Re-enable runner on prod with concurrent=1 |
Cheapest stop-gap; medium throttle risk on burst pushes; doesn't fix architecture | Rejected (doesn't break the coupling) |
| B | GitLab.com shared runners (saas-linux-small-amd64 tag) |
Free 400 min/mo; €4 per 1000min after; needs deploy-key rework for SSH → VPS over public internet | Rejected (recurring metered cost; SaaS minutes meter feels wrong for an MVP) |
| C | Dedicated runner VPS (~€5/mo) | Full isolation; recurring cost; one more box to maintain | Rejected (over-engineered for current scale) |
| D | Self-hosted runner on dev laptop | Zero recurring cost; laptop-offline = pipelines queue; no VPS load; reuses existing SSH access to qual | Chosen |
Choice (D) treats the dev laptop's docker daemon as the project's de-facto CI host. The mounted /var/run/docker.sock (DooD) means each CI job container talks to the laptop's normal docker daemon — same one the operator uses for make dev. No nested daemon, no privileged container, no certificate handshake. The trade-off is that CI is online-coupled to the laptop, which is acceptable at MVP scale where deploys are not time-critical.
Components¶
On the dev laptop¶
| Component | Detail |
|---|---|
| Binary | /usr/local/bin/gitlab-runner (v18.11.3 at install) |
| System user | gitlab-runner (in docker group via usermod -aG docker) |
| Systemd unit | /etc/systemd/system/gitlab-runner.service (enabled --now) |
| Config | /etc/gitlab-runner/config.toml (root-owned 0644; backed up on every install run) |
| Tag | dev-runner |
| Concurrency | concurrent = 2 |
| Executor | docker |
| Docker socket | mounted from host (/var/run/docker.sock:/var/run/docker.sock) — DooD, not DinD |
| Default image | docker:24 with pull_policy = ["if-not-present"] |
| Locked to project | yes (po-platform only) |
| Token type | glrt-… (GitLab new runner-creation workflow, server-side tag/lock/access config) |
On the GitLab project¶
- Settings → CI/CD → Runners → Project runners: one runner registered as
Developer laptop — po-platform dev-runner, tagdev-runner, locked. - Shared runners: enabled at project level but no job tags
saas-linux-small-amd64— so the shared runners receive zero traffic from this project. Switch flipped to allow future fallback without re-enabling. - CI/CD variables (relied on by
deploy-qualanddeploy-prod): SSH_PRIVATE_KEY(file type) — private half of the deploy key. Public half lives inroot@31.97.159.7:~/.ssh/authorized_keys.DEPLOY_QUAL_HOST—31.97.159.7(or the FQDN once apex DNS is sorted)DEPLOY_PROD_HOST— same VPS today; will move when prod is deployedDEPLOY_USER—rootCI_REGISTRY_*— provided by GitLab automatically; runner pushes images here
In the CI YAML¶
Two YAML anchors carry the docker-job baseline. They must stay in sync:
.gitlab-ci.ymllines ~54-62 —.docker_template.gitlab-ci/services.ymllines ~5-15 — duplicate of the same anchor (services.yml isinclude:-d, so the second declaration shadows the first; both kept identical defensively)
Anchor shape:
.docker_template: &docker_template
image: docker:24
before_script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
variables:
DOCKER_BUILDKIT: 1
COMPOSE_DOCKER_CLI_BUILD: 1
No services:, no DOCKER_HOST, no DOCKER_TLS_CERTDIR. The docker CLI in the job container talks to the mounted socket at the default path (unix:///var/run/docker.sock).
The runner tag is set via a separate anchor used by every docker-build job:
Every docker-build-* job and every deploy-* job extends this template via <<: *self_hosted_template. Lint/test jobs use their own tags: - dev-runner declarations (in .gitlab-ci.yml).
Day-to-day operations¶
Push triggers a pipeline → CI auto-deploys to qual¶
This is the steady-state flow. No human action required for qual.
git push origin main(any author, any commit)- GitLab.com receives the push, fires a pipeline
- Pipeline jobs scan for runners with matching tags
- Dev laptop's gitlab-runner polls GitLab.com (long-poll), picks up jobs
- Up to 2 jobs run concurrently in docker containers spawned from
docker:24 - Build jobs:
docker login→docker build→docker pushtoregistry.gitlab.com/portugalodissey/po-platform/<service>:{latest,main,$SHA} - Deploy-qual: SSHes
root@$DEPLOY_QUAL_HOST→ on VPS doesgit reset --hard origin/main && docker compose pull && up -d - Post-deploy smoke: external HTTP probes + container-health pass (with 6-retry budget for slow-starting services like document-signing) + crash-loop detector +
.env*permission audit - Smoke passes → pipeline GREEN. Smoke fails → deploy job exits non-zero; qual containers are already up but pipeline is RED until you investigate.
Prod deploy (manual)¶
- CI pipeline reaches the
deploy:prodstage withdeploy-prodinmanualstatus - Operator opens GitLab → pipeline → clicks "Play" on
deploy-prod - Same SSH flow as qual but to
$DEPLOY_PROD_HOSTand pulls fromproduction.yml - Add a row to
DEPLOYS.mdon success — this is doctrine, not enforced by CI
Reboot / re-install the runner¶
If the laptop is wiped, OS reinstalled, or the runner config corrupted:
# 1. Generate a fresh runner token in GitLab project settings:
# Settings → CI/CD → Runners → New project runner
# Tag: dev-runner; Locked to project: yes; Run untagged: off
# 2. Run the idempotent install script with the token:
GITLAB_RUNNER_TOKEN=glrt-... sudo -E bash infrastructure/scripts/install-dev-runner.sh
The script:
- Downloads gitlab-runner binary if missing
- Creates gitlab-runner system user if missing, adds to docker group
- Installs the systemd unit if missing
- Registers against GitLab using the provided token
- Sets concurrent = 2 and pull_policy = ["if-not-present"] in config.toml
- Enables and starts the systemd service
- Verifies the runner against GitLab
Old config.toml is backed up to config.toml.bak.<timestamp> before changes. The previous runner registration (if any) is left in GitLab's runner list — clean up manually via Settings → CI/CD → Runners → trash icon.
Decommission the runner¶
sudo systemctl stop gitlab-runner
sudo systemctl disable gitlab-runner
sudo gitlab-runner unregister --all-runners
sudo apt-mark unhold gitlab-runner 2>/dev/null || true
sudo apt purge -y gitlab-runner 2>/dev/null || true # if package-installed
sudo rm -f /etc/gitlab-runner/config.toml*
sudo userdel -r gitlab-runner 2>/dev/null || true
Troubleshooting¶
Pipeline jobs stay pending forever¶
The runner isn't picking them up. Causes (most→least common):
- Laptop offline / runner service stopped.
sudo systemctl status gitlab-runner— restart if inactive. concurrentexhausted. With 2 long-running jobs (e.g. a hung Vite build), the runner won't pick a 3rd.docker ps | grep runner-shows current jobs; kill if hung.- Tag mismatch. Job is tagged something the runner doesn't have (e.g. an older
self-hostedtag from a stale branch). Either fix the YAML or add the tag in the GitLab runner UI. - Runner unregistered. Token rotation or accidental
unregister. Re-register per "Reboot / re-install" above. - Network firewall blocking outbound to
gitlab.com:443. Confirm withcurl -sS https://gitlab.com.
docker-build-* job fails with Cannot connect to the Docker daemon at tcp://docker:2376¶
A YAML regression reintroduced the DinD pattern. Search for docker:24-dind or DOCKER_HOST in .gitlab-ci*.yml and remove them. The .docker_template anchor should NOT declare a services: block.
deploy-qual fails at the post-deploy smoke test¶
Most common: a slow-starting container (e.g. po-document-signing-service-qual) hasn't flipped to healthy in the 120s smoke window. Check docker inspect <container> --format '{{.State.Health.Status}}' on the VPS — if it's starting, wait another minute and the next pipeline will succeed. If it's unhealthy, look at docker logs <container> for the real cause.
The retry budget is at .gitlab-ci/infrastructure.yml — search for "max_attempts=6". Bump higher if your container has a legitimately longer cold-start.
Pipeline pipeline is success but qual is wrong¶
The build pushed :latest to the registry, and docker compose pull brought it down on the VPS, but the running container isn't the new image. Cause: pull_policy mismatch in compose, or the same :latest tag was rebuilt without changing content. Inspect the running image SHA:
Compare to docker image inspect registry.gitlab.com/portugalodissey/po-platform/<service>:latest --format "{{.Id}}". If they differ, force-recreate: docker compose -p po-qual -f qualification.yml -f shared.yml --env-file /opt/po-platform/.env.qual up -d --force-recreate --pull always <service>-qual.
CPU steal climbs on the laptop during a build¶
Expected — local builds spike CPU briefly. If sustained beyond a single build window, the laptop is overloaded or thermally throttled. Inspect with mpstat 1 5. Mitigation: lower concurrent from 2 to 1 in /etc/gitlab-runner/config.toml.
Migration history¶
- Pre-2026-04: CI ran on
srv884655(Hostinger qual VPS), co-located with prod containers. Worked at low traffic; risks tracked infeedback_runner_vps_shared_docker_config.md(INC-007 follow-up). - 2026-05-18 ~16:00 UTC: First Hostinger throttle event. 91% CPU steal sustained, cascading 404s on qual. Hostinger Halp agent lifted the throttle. Local cause identified as gitlab-runner
concurrent=4with two simultaneous CI builds. - 2026-05-18 ~20:00 UTC: Stop-gap — runner capped to
concurrent=1, system upgraded with Stage 1 lib patches. - 2026-05-19 ~01:10 UTC: Migration to dev laptop. New runner registered with tag
dev-runnervia GitLab new-creation-workflowglrt-…token. CI YAML retaggedself-hosted→dev-runnerin commitbfb23b2. Docker template switched from DinD to DooD in commit0723809. Smoke-test retry budget bumped 1 → 6 in commit76cbc8a. End-to-end pipeline #2535822299 verified.
What we deliberately didn't do¶
- Add
tags: [saas-linux-small-amd64]for lint/test jobs. Tempting (free 400 min/mo on shared runners), but it'd split the CI mental model in two and complicate the deploy story. Easy upgrade if laptop runner ever becomes the bottleneck. - Move runner to a dedicated VPS. ~€5/mo of fixed cost we don't currently need. Re-evaluate if the laptop dependency becomes painful.
- Auto-restart deploy on smoke failure. A flaky smoke test masking a real deploy issue is worse than a false-positive failure. Each failure should be investigated.
See also¶
~/.claude/projects/.../memory/feedback_cpu_steal_first_check.md— the diagnostic lesson that drove this migration~/.claude/projects/.../memory/feedback_runner_vps_shared_docker_config.md— INC-007 follow-up on the same anti-pattern (credentials half)~/.claude/projects/.../memory/project_session_state_2026_05_18_sC_vps.md— full session record of the throttle event- tasks-prod Riff #159 (closed
done) — the architectural Riff - tasks-prod Riff #160 (
lowpriority) — defense-in-depth: upgrade VPS to ≥4 vCPU - tasks-prod Riff #161 (
mediumpriority) — Prometheus alert on%steal > 20% sustained 5m