Skip to main content

Docker Compose service won't come back after restart? Check the restart policy

· 5 min read

Debugging a Milvus-dependent service that failed to start in a RAG knowledge base project — full writeup below.

TL;DR

After a host reboot (or a container crash), a group of services didn't come back: the app port had no listener and docker ps -a showed everything Exited. The root cause: docker-compose.yml had no restart policy (default no), so once a container died it stayed dead. Fix: set restart: always on every production service so the infrastructure self-heals after a crash or reboot.

Symptom

The entire RAG pipeline went dark at once: the app's port 3003 had no listener, the process was missing from the process manager, and the frontend chat widget, knowledge-base search, and tenant queries all 404'd.

Starting the app by hand reproduced it:

$ uvicorn app.main:app
pymilvus.exceptions.MilvusException:
Fail connecting to server on localhost:19530, server unavailable
Application startup failed. Exiting.

On startup the app connects to its vector DB dependency (port 19530); the connection fails and it exits. Looking at the containers:

$ docker ps -a
CONTAINER ID IMAGE STATUS
abc... milvusdb/milvus:v2.4.11 Exited (255) 4 days ago
def... minio/minio:... Exited (255) 4 days ago
ghi... quay.io/coreos/etcd:v3.5 Exited (255) 4 days ago

Three dependency containers had been Exited 4 days ago and never restarted. Every app startup failed to connect, exited, and after repeated restart failures the process manager simply dropped it from its list — presenting as a "process mysteriously vanished."

Root cause

All three services in docker-compose.yml had no restart:

# The broken setup: no restart, so the default is no
services:
etcd:
image: quay.io/coreos/etcd:v3.5.5
# ❌ no restart
minio:
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
# ❌ no restart
milvus:
image: milvusdb/milvus:v2.4.11
# ❌ no restart

Docker's default restart policy is no — once a container exits (crash, OOM, host reboot), it is never brought back automatically. That default is fine for development, but a ticking bomb in production:

  • Host reboot → every container without a restart policy stays Exited
  • A single container crash (OOM, transient dependency failure) → no self-heal, cascading to every upstream that depends on it
  • Dependency cascade: etcd/minio down → milvus can't start → app can't connect → app exits → process manager gives up

The failure stays hidden because it only surfaces after a reboot/crash event — builds and deploys look completely normal in the meantime.

Solution

Add restart: always to every production service:

services:
etcd:
image: quay.io/coreos/etcd:v3.5.5
restart: always # ✅ self-heals after crash or host reboot
minio:
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
restart: always # ✅
milvus:
image: milvusdb/milvus:v2.4.11
restart: always # ✅
depends_on:
- etcd
- minio

After a reboot, the dependency containers self-recover and the app comes back:

$ docker compose up -d        # with restart policies, a host reboot auto-restarts these
$ curl localhost:9091/healthz # health check 200, dependency ready
$ pm2 start ecosystem.config.js --only rag-service # app up, pipeline restored

The four restart policies compared:

PolicyWhen it restartsUse case
no (default)NeverEphemeral containers, one-off tasks
alwaysAlways (even after manual stop + daemon restart)Production infra, databases
unless-stoppedUnless you manually stopped itMost production services (recommended)
on-failure[:N]Only non-zero exit, optional capBatch jobs that exit cleanly

If you're chasing down Docker service anomalies on a low-spec box, check container status and disk/CPU pressure first — the three-step Docker resource-blackhole troubleshooting guide covers another class of resource-exhaustion-induced service stalls.

Caveats

  • restart: always restarts the container even on a clean exit (exit code 0). If your service is a "run once and exit" batch job, use on-failure or unless-stopped, or it'll loop forever.
  • depends_on waits for "started", not "ready". Milvus is slow to start, and the app may try to connect before it's ready — add retry logic in the app, or use a healthcheck with depends_on.condition: service_healthy.
  • Verify the policy took effect: docker inspect <container> | grep RestartPolicyName should read always/unless-stopped, not no.

FAQ

What restart policies does docker compose support?

Four: no (default, never restart), always (always restart), unless-stopped (unless you manually stopped it), and on-failure[:N] (restart only on non-zero exit, with an optional retry cap). Production services usually use always or unless-stopped.

How do I make a docker container restart automatically?

Add restart: always (or unless-stopped) to the service in docker-compose.yml; or use --restart always with docker run. The container self-recovers after a crash or host reboot.

What is the difference between restart always and unless-stopped?

always restarts the container even if you manually docker stopped it and then restarted the Docker daemon; unless-stopped honors a manual stop and won't be brought back up when the daemon restarts.


CCLEE

Independent developer, 24 years in e-commerce, focused on grounding AI in real business scenarios.

Work with me