Runbook: enable MinIO as gitlab-runner distributed cache backend¶
Status: ready to execute. Operator-gated.
Tracks: Riff #147.
Effort: ~30 min operator time + 1 verification cycle.
Prerequisite: dev-laptop gitlab-runner must be running (it is, since 2026-05-19).
Why¶
After the 2026-05-17/18 incident — cache: clauses hanging every CI job for 1h on the (now-retired) srv884655 runner — sC stripped all cache: clauses from .gitlab-ci.yml. The cold-cache cost is ~30–60s per service npm ci. Acceptable but optimisable: a real distributed cache backend would let us reintroduce caches safely.
Migrated to dev-laptop runner 2026-05-19 (per ci-runner-architecture.md). Cache backend choice now: MinIO on qual VPS (already running for file-service uploads).
Steps¶
1. Provision MinIO bucket + access key¶
SSH to qual VPS (po-platform@31.97.159.7). MinIO admin UI is at https://minio.qual.portugalodyssey.pt (or via the mc CLI in the container).
# Inside the minio-qual container (or with mc configured against the alias):
docker exec -it po-minio-qual mc alias set local http://localhost:9000 \
"$MINIO_ROOT_USER" "$MINIO_ROOT_PASSWORD"
# Create bucket + lifecycle (7-day delete on objects).
docker exec po-minio-qual mc mb local/gitlab-cache
docker exec po-minio-qual mc ilm add local/gitlab-cache \
--expire-days 7
# Create a dedicated access key for the runner. Use the admin user policy
# editor in the UI to scope to readwrite-on-gitlab-cache only:
# Policy name: gitlab-cache-rw
# Resource: arn:aws:s3:::gitlab-cache/*
# Actions: s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket
docker exec po-minio-qual mc admin user add local gitlab-runner-cache \
"$(openssl rand -hex 16)"
docker exec po-minio-qual mc admin policy attach local gitlab-cache-rw \
--user gitlab-runner-cache
Capture the access key + secret — they'll be pasted into the runner config below. Store in your password manager + ~/.secrets/gitlab-runner-cache.env on the dev laptop.
2. Edit /etc/gitlab-runner/config.toml on the dev laptop¶
sudo cp /etc/gitlab-runner/config.toml \
/etc/gitlab-runner/config.toml.bak.$(date +%Y-%m-%d-pre-cache)
sudo nano /etc/gitlab-runner/config.toml
Add (or replace) the [runners.cache] block under the relevant [[runners]] entry. The runner tag is dev-runner:
[[runners]]
name = "jmeireles-Latitude-5401"
url = "https://gitlab.com/"
token = "..." # unchanged
executor = "docker"
# ... other existing settings ...
[runners.cache]
Type = "s3"
Path = "po-platform" # namespace inside the bucket
Shared = false # this runner is single-project
[runners.cache.s3]
ServerAddress = "minio.qual.portugalodyssey.pt"
AccessKey = "<from step 1>"
SecretKey = "<from step 1>"
BucketName = "gitlab-cache"
BucketLocation = "us-east-1" # MinIO ignores; field is required
Insecure = false # MinIO is fronted by Traefik + LE
Validate and restart:
sudo gitlab-runner verify
sudo systemctl restart gitlab-runner
sudo journalctl -u gitlab-runner -n 50 --no-pager
You should see Configuration loaded with no errors.
3. Smoke test¶
Push a tiny change to frontends/public-fo/ (e.g. a no-op comment edit) and watch lint-frontend in the pipeline:
git checkout -b sA/cache-smoke main
echo "// cache smoke" >> frontends/public-fo/src/main.tsx
git commit -am "[sA] chore(ci): smoke MinIO cache backend"
git push -u origin sA/cache-smoke
In the GitLab CI logs for lint-frontend:
- First push:
Checking cache for sA/cache-smoke-protected...→No URL provided(no cache yet) →Created cache. - Second push (push any trivial edit): logs should show
Successfully extracted cacheand skipnpm ci's tarball download. Time savings should be visible.
If the first run hangs again on Checking cache... — abort and roll back via sudo systemctl stop gitlab-runner && sudo cp /etc/gitlab-runner/config.toml.bak.<date> /etc/gitlab-runner/config.toml && sudo systemctl start gitlab-runner. Then file a follow-up Riff with the journalctl output.
4. Reintroduce per-job caches (after smoke is green)¶
In .gitlab-ci/frontend.yml, add to the relevant jobs. Suggested set:
.frontend_cache_template: &frontend_cache_template
cache:
key:
files:
- frontends/public-fo/package-lock.json
paths:
- frontends/public-fo/node_modules/
lint-frontend:
<<: *frontend_cache_template
# ... rest unchanged
type-check-frontend:
<<: *frontend_cache_template
# ...
test-frontend-public-fo:
<<: *frontend_cache_template
# ...
Mirror for partner-console and admin-console (replace path with the appropriate frontends/<app>/). For test-services-integration, use a multi-path cache:
test-services-integration:
cache:
key:
files:
- package-lock.json
paths:
- services/*/node_modules/
- .npm/
Do NOT reintroduce a project-global cache: clause at the top of .gitlab-ci.yml. That's the structural risk that bit us 2026-05-17 — if the cache backend is misconfigured, every job inheriting the global will hang. Per-job caches isolate the blast radius.
5. Update the CI doc¶
After the smoke is green:
- Add a one-paragraph note to ci-runner-architecture.md under "Operations" pointing here.
- Flip Riff #147 → done.
Rollback plan¶
If the cache backend starts hanging jobs again (1h timeouts return):
sudo systemctl stop gitlab-runner
sudo cp /etc/gitlab-runner/config.toml.bak.<date> /etc/gitlab-runner/config.toml
sudo systemctl start gitlab-runner
Then revert the per-job cache: blocks in .gitlab-ci/frontend.yml. The pipeline returns to the current cold-cache state.
Acceptance criteria (Riff #147)¶
- ✅
gitlab-runnerconfig points to MinIOgitlab-cachebucket onminio.qual.portugalodyssey.pt. - ✅ Two sequential pipelines on the same branch show the second hitting cached
node_modules/(download + extract OK). - ✅ Per-job
cache:clauses reintroduced without 1h timeouts.
Cross-references¶
- Riff
#147: tasks-prod - CI runner architecture:
ci-runner-architecture.md - Incident context: VPS pathology + cache hang sequence —
docs/ai/sessions/active.md§ "VPS pathology" (2026-05-18, sC) - gitlab-runner cache config docs: https://docs.gitlab.com/runner/configuration/advanced-configuration.html#the-runnerscaches3-section