Fix: acme.json Being Reset by Git¶
Problem¶
The acme.json file (Traefik's certificate storage) was being reset to {} with incorrect permissions (644 instead of 600) after deployments, causing:
- Loss of all Let's Encrypt certificates
- 504 Gateway Timeout errors
- Self-signed certificate warnings
- Traefik errors:
permissions 644 for /acme.json are too open, please use 600 - Traefik errors:
error="HTTP challenge is not enabled"
Root Cause¶
The file infrastructure/config/traefik/acme.json was tracked in the git repository. This caused:
- During
git pull: Git would overwrite the local file (with certificates) with the repository version (empty{}) - Permission reset: Git would reset file permissions to default
644instead of the required600 - Certificate loss: All certificates stored in
acme.jsonwere lost - Rate limit issues: After reset, Traefik tried to regenerate ALL certificates at once, hitting Let's Encrypt rate limits
Solution Applied¶
1. Untracked acme.json from Git¶
This removes the file from git tracking while keeping it on disk.
2. Added to .gitignore¶
Added infrastructure/config/traefik/acme.json to .gitignore to prevent it from ever being committed again.
3. Fixed Makefile Logic¶
Updated the Makefile targets (shared and qual) to:
- Only validate JSON if python3 is available (prevents false "invalid" resets)
- Only initialize the file if it's missing or empty (preserves existing certificates)
- Always set correct permissions (600)
Impact¶
Immediate (One-Time)¶
After applying the fix and pulling changes on the VPS:
- Rate limits expected: Traefik will attempt to regenerate all certificates, hitting Let's Encrypt rate limits
- Wait for reset: Rate limits reset after the specified time (check Traefik logs for "retry after" times)
- Automatic recovery: Traefik will automatically retry and issue certificates after rate limits reset
Long-Term¶
- Certificates persist: After regeneration, certificates will persist across deployments
- No more resets:
git pullwill no longer overwriteacme.json - Stable certificates: Services will maintain valid Let's Encrypt certificates
Verification¶
Check if Fix is Applied¶
# On VPS, check if acme.json is tracked
git ls-files infrastructure/config/traefik/acme.json
# Should return nothing (file is untracked)
# Check .gitignore
grep acme.json .gitignore
# Should show: infrastructure/config/traefik/acme.json
Monitor Certificate Status¶
# Use the monitoring script
./infrastructure/scripts/monitor-certificate-rate-limits.sh
# Or manually check
docker logs po-traefik 2>&1 | grep -i "rateLimited\|429" | tail -20
docker logs po-traefik 2>&1 | grep -i "retry after" | tail -5
Check acme.json¶
# Check file exists and has correct permissions
ls -l infrastructure/config/traefik/acme.json
# Should show: -rw------- (600 permissions)
# Check file size (should grow as certificates are stored)
ls -lh infrastructure/config/traefik/acme.json
# Count certificates (if jq is available)
jq '.letsencrypt.Certificates | length' infrastructure/config/traefik/acme.json
Action Required on VPS¶
After pulling the fix, run:
# 1. Pull latest changes (this will delete the tracked acme.json)
git pull
# 2. Re-initialize acme.json using the fixed Makefile
make qual
# 3. Monitor rate limit status
./infrastructure/scripts/monitor-certificate-rate-limits.sh
# 4. Wait for rate limits to reset (check retry times in logs)
# Traefik will automatically retry and issue certificates
Prevention¶
- Never commit
acme.json- It's now in.gitignore - Never track sensitive files - Certificate stores, keys, and credentials should never be in git
- Use staging for testing - Use Let's Encrypt staging environment for development/testing
- Monitor deployments - Check certificate status after deployments