Skip to content

Traefik Unhealthy and Self-Signed Certificate Fix

Problem

  1. Traefik container is UNHEALTHY
  2. Self-signed certificate instead of Let's Encrypt
  3. Certificate resolver errors persist

Root Cause Analysis

The certificate resolver configuration is correct, but Traefik isn't recognizing it. This causes: - Fallback to self-signed certificates - Healthcheck failures (if related to certificate issues) - "Nonexistent certificate resolver" errors

Possible Causes

  1. ACME file doesn't exist or has wrong permissions
  2. Traefik can't write to acme.json (volume mount issue)
  3. Network connectivity to Let's Encrypt servers
  4. Configuration not applied (Traefik needs restart after config changes)

Fix Procedure

Step 1: Verify ACME File

# On VPS
cd /opt/po-platform

# Check if acme.json exists
ls -la infrastructure/config/traefik/acme.json

# If it doesn't exist, create it
touch infrastructure/config/traefik/acme.json
chmod 600 infrastructure/config/traefik/acme.json

# Verify permissions
ls -l infrastructure/config/traefik/acme.json
# Should show: -rw------- (600)

Step 2: Verify Volume Mount

# Check the volume mount in shared.yml
grep "acme.json" infrastructure/compose/shared.yml

# Should show: - ../config/traefik/acme.json:/acme.json:rw
# The :rw ensures write access

Step 3: Check Traefik Logs

# Look for ACME-related errors
docker logs po-traefik 2>&1 | grep -i "acme\|certificate\|letsencrypt" | tail -30

# Look for permission errors
docker logs po-traefik 2>&1 | grep -i "permission\|EACCES\|EADDRINUSE" | tail -20

Step 4: Restart Traefik

# Stop Traefik
docker compose -f infrastructure/compose/shared.yml --env-file infrastructure/compose/.env.shared stop traefik

# Remove container (to ensure fresh start)
docker compose -f infrastructure/compose/shared.yml --env-file infrastructure/compose/.env.shared rm -f traefik

# Start Traefik
docker compose -f infrastructure/compose/shared.yml --env-file infrastructure/compose/.env.shared up -d traefik

# Wait a few seconds
sleep 5

# Check status
docker ps | grep traefik
# Should show: Up (healthy) not (unhealthy)

Step 5: Trigger Certificate Request

# Make an HTTP request (will redirect to HTTPS and trigger certificate request)
curl -I http://qual.portugalodyssey.pt

# Or access via browser
# Navigate to: http://qual.portugalodyssey.pt

Step 6: Monitor Certificate Acquisition

# Watch Traefik logs for ACME activity
docker logs -f po-traefik 2>&1 | grep -i "acme\|certificate\|letsencrypt"

# In another terminal, check acme.json
watch -n 2 'ls -lh /opt/po-platform/infrastructure/config/traefik/acme.json'

# File should grow when certificate is obtained

Verification

After fixes:

# 1. Check Traefik is healthy
docker ps | grep traefik
# Should show: Up (healthy)

# 2. Check certificate (should not be self-signed)
curl -vI https://qual.portugalodyssey.pt 2>&1 | grep -i "issuer\|subject"

# Should show Let's Encrypt, not self-signed

# 3. Check certificate resolver errors are gone
docker logs po-traefik 2>&1 | grep -i "nonexistent certificate resolver" | wc -l
# Should output: 0 (no errors)

Troubleshooting

If Traefik Still Unhealthy

# Check healthcheck manually
docker exec po-traefik traefik healthcheck --ping

# Check Traefik API
curl http://localhost:8080/api/rawdata | jq '.routers' | head -20

# Check if certificate resolver is listed
curl http://localhost:8080/api/rawdata | jq '.certificatesResolvers'

If Self-Signed Certificate Persists

  1. Check ACME file is writable:

    docker exec po-traefik ls -l /acme.json
    docker exec po-traefik touch /acme.json
    

  2. Check network connectivity:

    docker exec po-traefik ping -c 1 acme-v02.api.letsencrypt.org
    

  3. Check Let's Encrypt rate limits:

  4. Visit: https://letsencrypt.org/docs/rate-limits/
  5. If rate limited, wait before retrying

If Certificate Resolver Still Not Found

  1. Verify configuration is in command:

    docker inspect po-traefik | jq '.[0].Args' | grep certificatesresolvers
    

  2. Check Traefik version:

    docker exec po-traefik traefik version
    # Should be v3.6
    

  3. Try using TLS challenge instead:

    # Alternative: Use TLS challenge (requires port 443 accessible)
    - --certificatesresolvers.letsencrypt.acme.tlschallenge=true
    

Expected Timeline

  • Immediate: Traefik should become healthy after restart
  • After HTTPS request: ACME account initializes (may take 1-2 minutes)
  • After certificate obtained: Self-signed certificate replaced, errors disappear