Skip to content

Fix: acme.json Being Reset by Git

Problem

The acme.json file (Traefik's certificate storage) was being reset to {} with incorrect permissions (644 instead of 600) after deployments, causing:

  1. Loss of all Let's Encrypt certificates
  2. 504 Gateway Timeout errors
  3. Self-signed certificate warnings
  4. Traefik errors: permissions 644 for /acme.json are too open, please use 600
  5. Traefik errors: error="HTTP challenge is not enabled"

Root Cause

The file infrastructure/config/traefik/acme.json was tracked in the git repository. This caused:

  1. During git pull: Git would overwrite the local file (with certificates) with the repository version (empty {})
  2. Permission reset: Git would reset file permissions to default 644 instead of the required 600
  3. Certificate loss: All certificates stored in acme.json were lost
  4. Rate limit issues: After reset, Traefik tried to regenerate ALL certificates at once, hitting Let's Encrypt rate limits

Solution Applied

1. Untracked acme.json from Git

git rm --cached infrastructure/config/traefik/acme.json

This removes the file from git tracking while keeping it on disk.

2. Added to .gitignore

Added infrastructure/config/traefik/acme.json to .gitignore to prevent it from ever being committed again.

3. Fixed Makefile Logic

Updated the Makefile targets (shared and qual) to: - Only validate JSON if python3 is available (prevents false "invalid" resets) - Only initialize the file if it's missing or empty (preserves existing certificates) - Always set correct permissions (600)

Impact

Immediate (One-Time)

After applying the fix and pulling changes on the VPS:

  1. Rate limits expected: Traefik will attempt to regenerate all certificates, hitting Let's Encrypt rate limits
  2. Wait for reset: Rate limits reset after the specified time (check Traefik logs for "retry after" times)
  3. Automatic recovery: Traefik will automatically retry and issue certificates after rate limits reset

Long-Term

  1. Certificates persist: After regeneration, certificates will persist across deployments
  2. No more resets: git pull will no longer overwrite acme.json
  3. Stable certificates: Services will maintain valid Let's Encrypt certificates

Verification

Check if Fix is Applied

# On VPS, check if acme.json is tracked
git ls-files infrastructure/config/traefik/acme.json
# Should return nothing (file is untracked)

# Check .gitignore
grep acme.json .gitignore
# Should show: infrastructure/config/traefik/acme.json

Monitor Certificate Status

# Use the monitoring script
./infrastructure/scripts/monitor-certificate-rate-limits.sh

# Or manually check
docker logs po-traefik 2>&1 | grep -i "rateLimited\|429" | tail -20
docker logs po-traefik 2>&1 | grep -i "retry after" | tail -5

Check acme.json

# Check file exists and has correct permissions
ls -l infrastructure/config/traefik/acme.json
# Should show: -rw------- (600 permissions)

# Check file size (should grow as certificates are stored)
ls -lh infrastructure/config/traefik/acme.json

# Count certificates (if jq is available)
jq '.letsencrypt.Certificates | length' infrastructure/config/traefik/acme.json

Action Required on VPS

After pulling the fix, run:

# 1. Pull latest changes (this will delete the tracked acme.json)
git pull

# 2. Re-initialize acme.json using the fixed Makefile
make qual

# 3. Monitor rate limit status
./infrastructure/scripts/monitor-certificate-rate-limits.sh

# 4. Wait for rate limits to reset (check retry times in logs)
# Traefik will automatically retry and issue certificates

Prevention

  1. Never commit acme.json - It's now in .gitignore
  2. Never track sensitive files - Certificate stores, keys, and credentials should never be in git
  3. Use staging for testing - Use Let's Encrypt staging environment for development/testing
  4. Monitor deployments - Check certificate status after deployments