CI Runner Architecture¶

TL;DR¶

GitLab CI for po-platform runs on a self-hosted runner installed on the dev laptop (jmeireles-Latitude-5401). Every push to main fires a pipeline on GitLab.com; the 23 tagged jobs route to dev-runner; build + deploy execute on the laptop; deploy-qual SSHes from laptop to the qual VPS and brings up new containers; deploy-prod waits for a manual click.

Zero shared-runner SaaS minutes consumed. Zero CPU load on the prod VPS. The dev laptop's only obligation: be online when you want fast deploys (jobs queue at GitLab and fire on reconnect otherwise).

Why this architecture (decision record)¶

The forcing event — 2026-05-18¶

The previous architecture co-located gitlab-runner on the prod VPS (srv884655.hstgr.cloud, 2 vCPU Hostinger plan). With concurrent = 4 in the runner config and trunk-based development firing 5-10 pipelines per day, two simultaneous runner-*-build containers + dind sidecars routinely consumed ~150% of the 2 vCPU budget.

Hostinger's hypervisor responded by throttling our VM to ~9% of nominal CPU (CPU steal = 91% sustained, historical baseline was 8% cumulative over 21 days uptime). This cascaded: - docker exec spawn time stretched from milliseconds to 60+ seconds - Per-container healthchecks (wget --spider http://127.0.0.1:80, 10s timeout) all failed - Traefik's docker provider filters unhealthy containers from routing → every qual URL returned 404 - The cascade lasted ~3 hours before diagnosed and the throttle was lifted by Hostinger support

Full forensic chain in docs/ai/sessions/active.md "VPS pathology" section.

Options considered¶

#	Option	Trade-off	Verdict
A	Re-enable runner on prod with `concurrent=1`	Cheapest stop-gap; medium throttle risk on burst pushes; doesn't fix architecture	Rejected (doesn't break the coupling)
B	GitLab.com shared runners (`saas-linux-small-amd64` tag)	Free 400 min/mo; €4 per 1000min after; needs deploy-key rework for SSH → VPS over public internet	Rejected (recurring metered cost; SaaS minutes meter feels wrong for an MVP)
C	Dedicated runner VPS (~€5/mo)	Full isolation; recurring cost; one more box to maintain	Rejected (over-engineered for current scale)
D	Self-hosted runner on dev laptop	Zero recurring cost; laptop-offline = pipelines queue; no VPS load; reuses existing SSH access to qual	Chosen

Choice (D) treats the dev laptop's docker daemon as the project's de-facto CI host. The mounted /var/run/docker.sock (DooD) means each CI job container talks to the laptop's normal docker daemon — same one the operator uses for make dev. No nested daemon, no privileged container, no certificate handshake. The trade-off is that CI is online-coupled to the laptop, which is acceptable at MVP scale where deploys are not time-critical.

Components¶

On the dev laptop¶

Component	Detail
Binary	`/usr/local/bin/gitlab-runner` (v18.11.3 at install)
System user	`gitlab-runner` (in `docker` group via `usermod -aG docker`)
Systemd unit	`/etc/systemd/system/gitlab-runner.service` (`enabled --now`)
Config	`/etc/gitlab-runner/config.toml` (root-owned 0644; backed up on every install run)
Tag	`dev-runner`
Concurrency	`concurrent = 2`
Executor	docker
Docker socket	mounted from host (`/var/run/docker.sock:/var/run/docker.sock`) — DooD, not DinD
Default image	`docker:24` with `pull_policy = ["if-not-present"]`
Locked to project	yes (po-platform only)
Token type	`glrt-…` (GitLab new runner-creation workflow, server-side tag/lock/access config)

On the GitLab project¶

Settings → CI/CD → Runners → Project runners: one runner registered as Developer laptop — po-platform dev-runner, tag dev-runner, locked.
Shared runners: enabled at project level but no job tags saas-linux-small-amd64 — so the shared runners receive zero traffic from this project. Switch flipped to allow future fallback without re-enabling.
CI/CD variables (relied on by deploy-qual and deploy-prod):
SSH_PRIVATE_KEY (file type) — private half of the deploy key. Public half lives in root@31.97.159.7:~/.ssh/authorized_keys.
DEPLOY_QUAL_HOST — 31.97.159.7 (or the FQDN once apex DNS is sorted)
DEPLOY_PROD_HOST — same VPS today; will move when prod is deployed
DEPLOY_USER — root
CI_REGISTRY_* — provided by GitLab automatically; runner pushes images here

In the CI YAML¶

Two YAML anchors carry the docker-job baseline. They must stay in sync:

.gitlab-ci.yml lines ~54-62 — .docker_template
.gitlab-ci/services.yml lines ~5-15 — duplicate of the same anchor (services.yml is include:-d, so the second declaration shadows the first; both kept identical defensively)

Anchor shape:

.docker_template: &docker_template
  image: docker:24
  before_script:
    - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
  variables:
    DOCKER_BUILDKIT: 1
    COMPOSE_DOCKER_CLI_BUILD: 1

No services:, no DOCKER_HOST, no DOCKER_TLS_CERTDIR. The docker CLI in the job container talks to the mounted socket at the default path (unix:///var/run/docker.sock).

The runner tag is set via a separate anchor used by every docker-build job:

.self_hosted_template: &self_hosted_template
  tags:
    - dev-runner

Every docker-build-* job and every deploy-* job extends this template via <<: *self_hosted_template. Lint/test jobs use their own tags: - dev-runner declarations (in .gitlab-ci.yml).

Day-to-day operations¶

Push triggers a pipeline → CI auto-deploys to qual¶

This is the steady-state flow. No human action required for qual.

git push origin main (any author, any commit)
GitLab.com receives the push, fires a pipeline
Pipeline jobs scan for runners with matching tags
Dev laptop's gitlab-runner polls GitLab.com (long-poll), picks up jobs
Up to 2 jobs run concurrently in docker containers spawned from docker:24
Build jobs: docker login → docker build → docker push to registry.gitlab.com/portugalodissey/po-platform/<service>:{latest,main,$SHA}
Deploy-qual: SSHes root@$DEPLOY_QUAL_HOST → on VPS does git reset --hard origin/main && docker compose pull && up -d
Post-deploy smoke: external HTTP probes + container-health pass (with 6-retry budget for slow-starting services like document-signing) + crash-loop detector + .env* permission audit
Smoke passes → pipeline GREEN. Smoke fails → deploy job exits non-zero; qual containers are already up but pipeline is RED until you investigate.

Prod deploy (manual)¶

CI pipeline reaches the deploy:prod stage with deploy-prod in manual status
Operator opens GitLab → pipeline → clicks "Play" on deploy-prod
Same SSH flow as qual but to $DEPLOY_PROD_HOST and pulls from production.yml
Add a row to DEPLOYS.md on success — this is doctrine, not enforced by CI

Reboot / re-install the runner¶

If the laptop is wiped, OS reinstalled, or the runner config corrupted:

# 1. Generate a fresh runner token in GitLab project settings:
#    Settings → CI/CD → Runners → New project runner
#    Tag: dev-runner; Locked to project: yes; Run untagged: off
# 2. Run the idempotent install script with the token:
GITLAB_RUNNER_TOKEN=glrt-... sudo -E bash infrastructure/scripts/install-dev-runner.sh

The script: - Downloads gitlab-runner binary if missing - Creates gitlab-runner system user if missing, adds to docker group - Installs the systemd unit if missing - Registers against GitLab using the provided token - Sets concurrent = 2 and pull_policy = ["if-not-present"] in config.toml - Enables and starts the systemd service - Verifies the runner against GitLab

Old config.toml is backed up to config.toml.bak.<timestamp> before changes. The previous runner registration (if any) is left in GitLab's runner list — clean up manually via Settings → CI/CD → Runners → trash icon.

Decommission the runner¶

sudo systemctl stop gitlab-runner
sudo systemctl disable gitlab-runner
sudo gitlab-runner unregister --all-runners
sudo apt-mark unhold gitlab-runner 2>/dev/null || true
sudo apt purge -y gitlab-runner 2>/dev/null || true  # if package-installed
sudo rm -f /etc/gitlab-runner/config.toml*
sudo userdel -r gitlab-runner 2>/dev/null || true

Troubleshooting¶

Pipeline jobs stay `pending` forever¶

The runner isn't picking them up. Causes (most→least common):

Laptop offline / runner service stopped. sudo systemctl status gitlab-runner — restart if inactive.
concurrent exhausted. With 2 long-running jobs (e.g. a hung Vite build), the runner won't pick a 3rd. docker ps | grep runner- shows current jobs; kill if hung.
Tag mismatch. Job is tagged something the runner doesn't have (e.g. an older self-hosted tag from a stale branch). Either fix the YAML or add the tag in the GitLab runner UI.
Runner unregistered. Token rotation or accidental unregister. Re-register per "Reboot / re-install" above.
Network firewall blocking outbound to gitlab.com:443. Confirm with curl -sS https://gitlab.com.

`docker-build-*` job fails with `Cannot connect to the Docker daemon at tcp://docker:2376`¶

A YAML regression reintroduced the DinD pattern. Search for docker:24-dind or DOCKER_HOST in .gitlab-ci*.yml and remove them. The .docker_template anchor should NOT declare a services: block.

`deploy-qual` fails at the post-deploy smoke test¶

Most common: a slow-starting container (e.g. po-document-signing-service-qual) hasn't flipped to healthy in the 120s smoke window. Check docker inspect <container> --format '{{.State.Health.Status}}' on the VPS — if it's starting, wait another minute and the next pipeline will succeed. If it's unhealthy, look at docker logs <container> for the real cause.

The retry budget is at .gitlab-ci/infrastructure.yml — search for "max_attempts=6". Bump higher if your container has a legitimately longer cold-start.

Pipeline pipeline is `success` but qual is wrong¶

The build pushed :latest to the registry, and docker compose pull brought it down on the VPS, but the running container isn't the new image. Cause: pull_policy mismatch in compose, or the same :latest tag was rebuilt without changing content. Inspect the running image SHA:

ssh root@31.97.159.7 'docker inspect po-<service>-qual --format "{{.Image}}"'

Compare to docker image inspect registry.gitlab.com/portugalodissey/po-platform/<service>:latest --format "{{.Id}}". If they differ, force-recreate: docker compose -p po-qual -f qualification.yml -f shared.yml --env-file /opt/po-platform/.env.qual up -d --force-recreate --pull always <service>-qual.

CPU steal climbs on the laptop during a build¶

Expected — local builds spike CPU briefly. If sustained beyond a single build window, the laptop is overloaded or thermally throttled. Inspect with mpstat 1 5. Mitigation: lower concurrent from 2 to 1 in /etc/gitlab-runner/config.toml.

Migration history¶

Pre-2026-04: CI ran on srv884655 (Hostinger qual VPS), co-located with prod containers. Worked at low traffic; risks tracked in feedback_runner_vps_shared_docker_config.md (INC-007 follow-up).
2026-05-18 ~16:00 UTC: First Hostinger throttle event. 91% CPU steal sustained, cascading 404s on qual. Hostinger Halp agent lifted the throttle. Local cause identified as gitlab-runner concurrent=4 with two simultaneous CI builds.
2026-05-18 ~20:00 UTC: Stop-gap — runner capped to concurrent=1, system upgraded with Stage 1 lib patches.
2026-05-19 ~01:10 UTC: Migration to dev laptop. New runner registered with tag dev-runner via GitLab new-creation-workflow glrt-… token. CI YAML retagged self-hosted → dev-runner in commit bfb23b2. Docker template switched from DinD to DooD in commit 0723809. Smoke-test retry budget bumped 1 → 6 in commit 76cbc8a. End-to-end pipeline #2535822299 verified.

What we deliberately didn't do¶

Add tags: [saas-linux-small-amd64] for lint/test jobs. Tempting (free 400 min/mo on shared runners), but it'd split the CI mental model in two and complicate the deploy story. Easy upgrade if laptop runner ever becomes the bottleneck.
Move runner to a dedicated VPS. ~€5/mo of fixed cost we don't currently need. Re-evaluate if the laptop dependency becomes painful.
Auto-restart deploy on smoke failure. A flaky smoke test masking a real deploy issue is worse than a false-positive failure. Each failure should be investigated.