Skip to content

CI Runner Architecture

TL;DR

GitLab CI for po-platform runs on a self-hosted runner installed on the dev laptop (jmeireles-Latitude-5401). Every push to main fires a pipeline on GitLab.com; the 23 tagged jobs route to dev-runner; build + deploy execute on the laptop; deploy-qual SSHes from laptop to the qual VPS and brings up new containers; deploy-prod waits for a manual click.

Zero shared-runner SaaS minutes consumed. Zero CPU load on the prod VPS. The dev laptop's only obligation: be online when you want fast deploys (jobs queue at GitLab and fire on reconnect otherwise).

Why this architecture (decision record)

The forcing event — 2026-05-18

The previous architecture co-located gitlab-runner on the prod VPS (srv884655.hstgr.cloud, 2 vCPU Hostinger plan). With concurrent = 4 in the runner config and trunk-based development firing 5-10 pipelines per day, two simultaneous runner-*-build containers + dind sidecars routinely consumed ~150% of the 2 vCPU budget.

Hostinger's hypervisor responded by throttling our VM to ~9% of nominal CPU (CPU steal = 91% sustained, historical baseline was 8% cumulative over 21 days uptime). This cascaded: - docker exec spawn time stretched from milliseconds to 60+ seconds - Per-container healthchecks (wget --spider http://127.0.0.1:80, 10s timeout) all failed - Traefik's docker provider filters unhealthy containers from routing → every qual URL returned 404 - The cascade lasted ~3 hours before diagnosed and the throttle was lifted by Hostinger support

Full forensic chain in docs/ai/sessions/active.md "VPS pathology" section.

Options considered

# Option Trade-off Verdict
A Re-enable runner on prod with concurrent=1 Cheapest stop-gap; medium throttle risk on burst pushes; doesn't fix architecture Rejected (doesn't break the coupling)
B GitLab.com shared runners (saas-linux-small-amd64 tag) Free 400 min/mo; €4 per 1000min after; needs deploy-key rework for SSH → VPS over public internet Rejected (recurring metered cost; SaaS minutes meter feels wrong for an MVP)
C Dedicated runner VPS (~€5/mo) Full isolation; recurring cost; one more box to maintain Rejected (over-engineered for current scale)
D Self-hosted runner on dev laptop Zero recurring cost; laptop-offline = pipelines queue; no VPS load; reuses existing SSH access to qual Chosen

Choice (D) treats the dev laptop's docker daemon as the project's de-facto CI host. The mounted /var/run/docker.sock (DooD) means each CI job container talks to the laptop's normal docker daemon — same one the operator uses for make dev. No nested daemon, no privileged container, no certificate handshake. The trade-off is that CI is online-coupled to the laptop, which is acceptable at MVP scale where deploys are not time-critical.

Components

On the dev laptop

Component Detail
Binary /usr/local/bin/gitlab-runner (v18.11.3 at install)
System user gitlab-runner (in docker group via usermod -aG docker)
Systemd unit /etc/systemd/system/gitlab-runner.service (enabled --now)
Config /etc/gitlab-runner/config.toml (root-owned 0644; backed up on every install run)
Tag dev-runner
Concurrency concurrent = 2
Executor docker
Docker socket mounted from host (/var/run/docker.sock:/var/run/docker.sock) — DooD, not DinD
Default image docker:24 with pull_policy = ["if-not-present"]
Locked to project yes (po-platform only)
Token type glrt-… (GitLab new runner-creation workflow, server-side tag/lock/access config)

On the GitLab project

  • Settings → CI/CD → Runners → Project runners: one runner registered as Developer laptop — po-platform dev-runner, tag dev-runner, locked.
  • Shared runners: enabled at project level but no job tags saas-linux-small-amd64 — so the shared runners receive zero traffic from this project. Switch flipped to allow future fallback without re-enabling.
  • CI/CD variables (relied on by deploy-qual and deploy-prod):
  • SSH_PRIVATE_KEY (file type) — private half of the deploy key. Public half lives in root@31.97.159.7:~/.ssh/authorized_keys.
  • DEPLOY_QUAL_HOST31.97.159.7 (or the FQDN once apex DNS is sorted)
  • DEPLOY_PROD_HOST — same VPS today; will move when prod is deployed
  • DEPLOY_USERroot
  • CI_REGISTRY_* — provided by GitLab automatically; runner pushes images here

In the CI YAML

Two YAML anchors carry the docker-job baseline. They must stay in sync:

  • .gitlab-ci.yml lines ~54-62 — .docker_template
  • .gitlab-ci/services.yml lines ~5-15 — duplicate of the same anchor (services.yml is include:-d, so the second declaration shadows the first; both kept identical defensively)

Anchor shape:

.docker_template: &docker_template
  image: docker:24
  before_script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
  variables:
    DOCKER_BUILDKIT: 1
    COMPOSE_DOCKER_CLI_BUILD: 1

No services:, no DOCKER_HOST, no DOCKER_TLS_CERTDIR. The docker CLI in the job container talks to the mounted socket at the default path (unix:///var/run/docker.sock).

The runner tag is set via a separate anchor used by every docker-build job:

.self_hosted_template: &self_hosted_template
  tags:
    - dev-runner

Every docker-build-* job and every deploy-* job extends this template via <<: *self_hosted_template. Lint/test jobs use their own tags: - dev-runner declarations (in .gitlab-ci.yml).

Day-to-day operations

Push triggers a pipeline → CI auto-deploys to qual

This is the steady-state flow. No human action required for qual.

  1. git push origin main (any author, any commit)
  2. GitLab.com receives the push, fires a pipeline
  3. Pipeline jobs scan for runners with matching tags
  4. Dev laptop's gitlab-runner polls GitLab.com (long-poll), picks up jobs
  5. Up to 2 jobs run concurrently in docker containers spawned from docker:24
  6. Build jobs: docker logindocker builddocker push to registry.gitlab.com/portugalodissey/po-platform/<service>:{latest,main,$SHA}
  7. Deploy-qual: SSHes root@$DEPLOY_QUAL_HOST → on VPS does git reset --hard origin/main && docker compose pull && up -d
  8. Post-deploy smoke: external HTTP probes + container-health pass (with 6-retry budget for slow-starting services like document-signing) + crash-loop detector + .env* permission audit
  9. Smoke passes → pipeline GREEN. Smoke fails → deploy job exits non-zero; qual containers are already up but pipeline is RED until you investigate.

Prod deploy (manual)

  1. CI pipeline reaches the deploy:prod stage with deploy-prod in manual status
  2. Operator opens GitLab → pipeline → clicks "Play" on deploy-prod
  3. Same SSH flow as qual but to $DEPLOY_PROD_HOST and pulls from production.yml
  4. Add a row to DEPLOYS.md on success — this is doctrine, not enforced by CI

Reboot / re-install the runner

If the laptop is wiped, OS reinstalled, or the runner config corrupted:

# 1. Generate a fresh runner token in GitLab project settings:
#    Settings → CI/CD → Runners → New project runner
#    Tag: dev-runner; Locked to project: yes; Run untagged: off
# 2. Run the idempotent install script with the token:
GITLAB_RUNNER_TOKEN=glrt-... sudo -E bash infrastructure/scripts/install-dev-runner.sh

The script: - Downloads gitlab-runner binary if missing - Creates gitlab-runner system user if missing, adds to docker group - Installs the systemd unit if missing - Registers against GitLab using the provided token - Sets concurrent = 2 and pull_policy = ["if-not-present"] in config.toml - Enables and starts the systemd service - Verifies the runner against GitLab

Old config.toml is backed up to config.toml.bak.<timestamp> before changes. The previous runner registration (if any) is left in GitLab's runner list — clean up manually via Settings → CI/CD → Runners → trash icon.

Decommission the runner

sudo systemctl stop gitlab-runner
sudo systemctl disable gitlab-runner
sudo gitlab-runner unregister --all-runners
sudo apt-mark unhold gitlab-runner 2>/dev/null || true
sudo apt purge -y gitlab-runner 2>/dev/null || true  # if package-installed
sudo rm -f /etc/gitlab-runner/config.toml*
sudo userdel -r gitlab-runner 2>/dev/null || true

Troubleshooting

Pipeline jobs stay pending forever

The runner isn't picking them up. Causes (most→least common):

  1. Laptop offline / runner service stopped. sudo systemctl status gitlab-runner — restart if inactive.
  2. concurrent exhausted. With 2 long-running jobs (e.g. a hung Vite build), the runner won't pick a 3rd. docker ps | grep runner- shows current jobs; kill if hung.
  3. Tag mismatch. Job is tagged something the runner doesn't have (e.g. an older self-hosted tag from a stale branch). Either fix the YAML or add the tag in the GitLab runner UI.
  4. Runner unregistered. Token rotation or accidental unregister. Re-register per "Reboot / re-install" above.
  5. Network firewall blocking outbound to gitlab.com:443. Confirm with curl -sS https://gitlab.com.

docker-build-* job fails with Cannot connect to the Docker daemon at tcp://docker:2376

A YAML regression reintroduced the DinD pattern. Search for docker:24-dind or DOCKER_HOST in .gitlab-ci*.yml and remove them. The .docker_template anchor should NOT declare a services: block.

deploy-qual fails at the post-deploy smoke test

Most common: a slow-starting container (e.g. po-document-signing-service-qual) hasn't flipped to healthy in the 120s smoke window. Check docker inspect <container> --format '{{.State.Health.Status}}' on the VPS — if it's starting, wait another minute and the next pipeline will succeed. If it's unhealthy, look at docker logs <container> for the real cause.

The retry budget is at .gitlab-ci/infrastructure.yml — search for "max_attempts=6". Bump higher if your container has a legitimately longer cold-start.

Pipeline pipeline is success but qual is wrong

The build pushed :latest to the registry, and docker compose pull brought it down on the VPS, but the running container isn't the new image. Cause: pull_policy mismatch in compose, or the same :latest tag was rebuilt without changing content. Inspect the running image SHA:

ssh root@31.97.159.7 'docker inspect po-<service>-qual --format "{{.Image}}"'

Compare to docker image inspect registry.gitlab.com/portugalodissey/po-platform/<service>:latest --format "{{.Id}}". If they differ, force-recreate: docker compose -p po-qual -f qualification.yml -f shared.yml --env-file /opt/po-platform/.env.qual up -d --force-recreate --pull always <service>-qual.

CPU steal climbs on the laptop during a build

Expected — local builds spike CPU briefly. If sustained beyond a single build window, the laptop is overloaded or thermally throttled. Inspect with mpstat 1 5. Mitigation: lower concurrent from 2 to 1 in /etc/gitlab-runner/config.toml.

Migration history

  • Pre-2026-04: CI ran on srv884655 (Hostinger qual VPS), co-located with prod containers. Worked at low traffic; risks tracked in feedback_runner_vps_shared_docker_config.md (INC-007 follow-up).
  • 2026-05-18 ~16:00 UTC: First Hostinger throttle event. 91% CPU steal sustained, cascading 404s on qual. Hostinger Halp agent lifted the throttle. Local cause identified as gitlab-runner concurrent=4 with two simultaneous CI builds.
  • 2026-05-18 ~20:00 UTC: Stop-gap — runner capped to concurrent=1, system upgraded with Stage 1 lib patches.
  • 2026-05-19 ~01:10 UTC: Migration to dev laptop. New runner registered with tag dev-runner via GitLab new-creation-workflow glrt-… token. CI YAML retagged self-hosteddev-runner in commit bfb23b2. Docker template switched from DinD to DooD in commit 0723809. Smoke-test retry budget bumped 1 → 6 in commit 76cbc8a. End-to-end pipeline #2535822299 verified.

What we deliberately didn't do

  • Add tags: [saas-linux-small-amd64] for lint/test jobs. Tempting (free 400 min/mo on shared runners), but it'd split the CI mental model in two and complicate the deploy story. Easy upgrade if laptop runner ever becomes the bottleneck.
  • Move runner to a dedicated VPS. ~€5/mo of fixed cost we don't currently need. Re-evaluate if the laptop dependency becomes painful.
  • Auto-restart deploy on smoke failure. A flaky smoke test masking a real deploy issue is worse than a false-positive failure. Each failure should be investigated.

See also

  • ~/.claude/projects/.../memory/feedback_cpu_steal_first_check.md — the diagnostic lesson that drove this migration
  • ~/.claude/projects/.../memory/feedback_runner_vps_shared_docker_config.md — INC-007 follow-up on the same anti-pattern (credentials half)
  • ~/.claude/projects/.../memory/project_session_state_2026_05_18_sC_vps.md — full session record of the throttle event
  • tasks-prod Riff #159 (closed done) — the architectural Riff
  • tasks-prod Riff #160 (low priority) — defense-in-depth: upgrade VPS to ≥4 vCPU
  • tasks-prod Riff #161 (medium priority) — Prometheus alert on %steal > 20% sustained 5m