Docker health checks and monitoring that actually help
Docker HEALTHCHECK, restart policies, and how they fit with monitoring tools, log alerts, and external uptime checks.
Docker health checks tell the engine whether a container is fit to serve traffic. Monitoring tools tell humans something is wrong. Those are related but not the same.
A passing healthcheck does not mean users can log in. A failing healthcheck without alerts means Docker restarts in a loop while you sleep.
HEALTHCHECK in practice
Dockerfile example:
HEALTHCHECK --interval=30s --timeout=5s --start-period=40s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1Compose example:
services:
api:
image: my-api:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 40sdocker ps shows healthy, unhealthy, or starting. Orchestrators and compose can wait on service_healthy before starting dependents.
Healthcheck without curl in the image
Minimal images often lack curl. Alternatives:
# wget (busybox/alpine)
HEALTHCHECK CMD wget -q -O- http://localhost:8080/health || exit 1
# pure shell TCP check (weaker: port open != app healthy)
HEALTHCHECK CMD timeout 1 bash -c '</dev/tcp/localhost/8080' || exit 1TCP checks catch "nothing listening." They miss "listening but returning 500." Prefer HTTP when the image allows it.
depends_on with health conditions
Compose v2 supports waiting for healthy dependencies:
services:
api:
image: my-api:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
retries: 5
start_period: 30s
worker:
image: my-worker:latest
depends_on:
api:
condition: service_healthyWithout condition: service_healthy, the worker starts while the API is still booting and may spam connection errors in logs. Those errors look like incidents in your log viewer unless you expect them during startup.
What makes a good health endpoint
- Hits the app process, not just "port open"
- Avoids requiring auth for a trivial
/healthor/ready - Separate liveness (process up) from readiness (can serve traffic) if you run Kubernetes later
- Fast: sub-second response, no full DB migration check on every probe unless you mean it
A useful pattern:
/healthor/live: returns 200 if the process is up/ready: returns 200 only if DB and cache connections work
Docker has one HEALTHCHECK per container. Combine checks thoughtfully or pick the stricter one for Docker and split later in K8s.
Common mistakes
- Checking
curl localhoston the wrong port inside the container - 120s
start_periodon a 5s boot app (slows real failure detection) - Healthcheck that hammers the database every 10s
- No healthcheck at all on stateful services that can deadlock while still running
Another frequent one: health endpoint hits the public URL through the load balancer instead of localhost inside the container. That tests nginx and DNS, not the app process. Keep probes inside the container network namespace.
Restart policies
restart: unless-stopped| Policy | Behavior |
|---|---|
no | Default, manual restart |
on-failure | After non-zero exit |
unless-stopped | Common for servers |
always | Even after daemon reboot |
unless-stopped plus unhealthy does not always mean Docker replaces the container. Health status informs humans and compose dependencies; it is not always automatic kill-and-recreate. Know your version and orchestration behavior.
When a container restart loops, logs are the evidence. A log viewer beats parsing docker events by hand.
Detecting restart storms
Signs something is wrong even when docker ps shows Up:
- Restart count climbing in
docker inspect - Same error line every 30 seconds in logs
- Health flapping between
startingandunhealthy
Wire an alert on Docker die or restart events, or a log pattern that appears only on crash. Alert setup walks through Slack and Discord webhooks.
Health checks vs monitoring tools
| Layer | What it does |
|---|---|
| HEALTHCHECK | Local process judgment, Docker status |
| docker stats / cAdvisor | Resource usage |
| Log tail + alerts | App errors in stdout |
| Uptime monitor | User-visible URL from outside |
| Prometheus | Time-series metrics, SLO alerts |
Use all layers that match your risk. A payment API wants external HTTP checks and metrics. A internal wiki might live with healthcheck plus log alerts.
None of these replace the others. HEALTHCHECK can pass while memory leaks. Logs can look clean while TLS is broken at the edge. External uptime can pass while one background worker is stuck.
Wiring health into DockLog workflows
DockLog is not a healthcheck engine. It complements one:
- Log alerts when health endpoint starts returning 500 in access logs
- Docker event alerts on restart storms
- CPU/memory thresholds when leaks precede health failures
- Tail during incidents when
unhealthyappears indocker ps
Slack/Teams/Discord alerts covers channel setup.
For team access during firefighting, RBAC limits who can restart containers after you diagnose.
Incident sequence that works
- External monitor or user report says the site is down
- Check
docker psfor health status and uptime - Tail logs in DockLog (or native app if you are on call away from laptop)
- If healthcheck fails locally, fix the app; if healthcheck passes but users fail, check nginx, DNS, TLS
- Restart only after you know why; blind restarts hide root cause
External checks matter
Healthchecks run inside the container network namespace. They will not catch:
- Bad TLS cert on nginx
- DNS pointing to the wrong IP
- Cloudflare or CDN misconfig
- Database reachable from app but not from users
Run Uptime Kuma, Better Stack, or similar against the public URL. Self-hosted monitoring guide: on a budget.
Check both the root URL and one authenticated or API path if auth middleware can fail independently of /health.
Kubernetes note
Docker HEALTHCHECK maps loosely to liveness and readiness probes. Same principles: cheap liveness, stricter readiness, do not DOS yourself. Kubernetes monitoring tools for the cluster side.
In K8s you get two probes instead of one. Liveness restarts the pod; readiness removes it from Service endpoints. A DB migration on boot should affect readiness, not liveness, or you restart mid-migration.
DockLog tails pod logs the same way it tails Docker when the server runs in K8s mode. K8s log tailing for setup.
Minimal production checklist
- Every long-running service has a health endpoint
- Compose defines
healthcheckand sensiblerestart - Log rotation configured (
max-size,max-file) - One monitoring UI with auth for logs and resource peeks
- One external uptime check on the customer-facing URL
- One alert to a channel humans read
Compose production tie-in
Healthchecks belong in the same compose file as TLS, auth, and logging limits. Compose for production shows a full example with DockLog behind Caddy.
Further reading
- Docker monitoring tools roundup
- Compose for production
- Why self-hosted log viewer
- Production reverse proxy
Health checks are the automatic reflex. Monitoring tools are the nervous system that tells the team. Configure both, or you will only notice the gap at 2am.