Skip to Content
DevopsKubernetes Probes — /live, /ready, /health

Kubernetes Probes — /live, /ready, /health

What? (Concept Overview)

Kubernetes pod lifecycle has three orthogonal probes. /live (liveness) answers “kill the pod if it has hung”. /ready (readiness) answers “send traffic to this pod”. /health is a diagnostic summary used by humans/uptime checkers, not by k8s. Conflating them produces pods that either thrash (over-eager liveness) or receive traffic before they can serve (premature readiness).

Project Context

The FCA Support Agent’s app/routers/health.py exposes three endpoints, each tuned for a different consumer:

  • /api/v1/live — Never touches the DB; instant 200 OK. The Kubernetes liveness probe hits this every 5s.
  • /api/v1/ready — Calls check_db_connection(); returns 503 if Postgres is unreachable. The Kubernetes readiness probe gates the Service.
  • /api/v1/health — Full diagnostic with all subsystems; humans and external uptime checkers (Pingdom/UptimeRobot) consume this.

How? (Quick Reference Blocks)

3.1 The Health Router

# app/routers/health.py from fastapi import APIRouter, status from fastapi.responses import JSONResponse from pydantic import BaseModel from datetime import datetime, timezone from app.config import settings from app.database import check_db_connection router = APIRouter() class HealthResponse(BaseModel): status: str timestamp: str version: str environment: str checks: dict

3.2 /live — Liveness Probe

# app/routers/health.py @router.get("/live", tags=["Health"]) async def liveness() -> dict: """Used by Kubernetes to determine if pod is alive.""" return { "status": "alive", "timestamp": datetime.now(timezone.utc) .isoformat().replace("+00:00", "Z"), }

3.3 /ready — Readiness Probe

# app/routers/health.py @router.get("/ready", tags=["Health"]) async def readiness() -> dict: db_healthy = await check_db_connection() if db_healthy: return { "status": "ready", "timestamp": datetime.now(timezone.utc) .isoformat().replace("+00:00", "Z"), } return JSONResponse( status_code=status.HTTP_503_SERVICE_UNAVAILABLE, content={ "status": "not_ready", "reason": "database_unhealthy", "timestamp": datetime.now(timezone.utc) .isoformat().replace("+00:00", "Z"), }, )

3.4 /health — Full Diagnostic

# app/routers/health.py @router.get("/health", response_model=HealthResponse, tags=["Health"]) async def health_check() -> HealthResponse: db_healthy = await check_db_connection() return HealthResponse( status="healthy" if db_healthy else "unhealthy", timestamp=datetime.now(timezone.utc) .isoformat().replace("+00:00", "Z"), version=settings.app_version, environment=settings.environment, checks={ "database": {"status": "healthy" if db_healthy else "unhealthy", "type": "postgresql", "pool_size": settings.database_pool_size}, "redis": {"status": "healthy" if settings.redis_enabled else "disabled", "enabled": settings.redis_enabled}, "groq_ai": {"status": "configured" if settings.groq_api_key else "not_configured", "model": settings.groq_model}, }, )

Why? (Parameter Breakdown

  • /live MUST NOT touch external dependencies — If liveness hits Postgres and Postgres is slow, all pods get killed simultaneously → cluster-wide outage. Liveness failure should imply the process itself is broken (deadlock, OOM, leaked file descriptors).
  • /ready SHOULD touch critical upstream dependencies — A pod is “ready” iff it can serve a real request. Without DB checks, you’ll route traffic to a pod that fails every query → end-user errors.
  • /health as the human-facing endpoint — Uptime monitors want one endpoint to alarm on; that endpoint should be the most diagnostic. Don’t ask uptime monitors to poll /ready — the 503s will be noisy.
  • Three separate endpoints instead of one with mode flag — Avoids accidental k8s misconfiguration. If you have /health?mode=live, an ops engineer can fat-finger the readinessProbe config and accidentally kill pods on DB blips. Separation by URL is unambiguous.
  • pydantic.BaseModel response model for /health — Self-documents the schema and validates the response shape at runtime. Useful for uptime-monitor onboarding.
  • UTC timestamps with explicit Z — Parses unambiguously in any SIEM/observability stack (see Structured Logging page).

Common Pitfalls

  1. Liveness probe hitting /health — Cascading failure. DB hiccups → readiness probe fails → which is OK, traffic shifts, BUT if liveness also fails → Kubernetes kills the pod → every pod gets killed → total outage. Always use the bare-minimum /live.
  2. Readiness probe timeout shorter than connect timeout — If readinessProbe timeoutSeconds: 1 but Postgres connect timeout is 30s, the probe always times out → pod never marked ready → Service has zero endpoints.

Real-World Interview Prep

Q1: Your pods are flapping — kubectl get pods shows CrashLoopBackOff. Walk through your diagnostic ladder.

A: (1) kubectl describe pod <name> — Events section: did the readiness probe fail the pod? Or was it an explicit kill? (2) kubectl logs <pod> --previous — read the logs of the last crashed container. Look for uncaught exceptions in lifespan startup. (3) If probe-related, increase initialDelaySeconds from default 0 → 30s in the deployment manifest. (4) Check the liveness vs readiness: if liveness is firing, the Python process is wedged (infinite LLM loop, deadlock on a SQLAlchemy session). (5) Reproduce locally: simulate the same DB-down state and see if your /live correctly stays 200.

Q2: Why not use a single /ping endpoint for everything?

A: Probe-purpose mismatch. Liveness needs minimum dependency-free yes/no. Readiness needs dependency yes/no. Health needs everything no-fail-fast. A single /ping either over-checks (killing pods on DB blips) or under-checks (routing traffic to broken pods). Three routes, three contracts.

Q3: How do you differentiate “DB slow” from “DB down” with the readiness probe?

A: Add a latency budget. Wrap check_db_connection() in asyncio.wait_for(..., timeout=2.0). If the DB responds within 2s → ready; if the timeout fires → 503 not ready. This way a 30s DB hang produces a 2s readiness fail (and the pod is removed from the Service) without waiting for the full TCP connect timeout. The same pattern works for liveness if pod-level latency matters — but with a tighter ceiling (e.g., 200ms).

Top-to-Bottom Code Walkthrough (app/routers/health.py)

Kubernetes has three probe endpoints, each with a distinct purpose. Mixing them is a common ops mistake — your app restarts unnecessarily because a “live” probe checks the database.

/health — comprehensive

@router.get("/health") async def health_check() -> HealthResponse: db_healthy = await check_db_connection() overall_status = "healthy" if db_healthy else "unhealthy" ... return JSONResponse(status_code=503, ...)

Purpose: humans and uptime monitors. Includes all dependency checks (DB, Redis, Groq config). Returns 503 if anything is degraded.

/ping — immediate response

@router.get("/ping") async def ping() -> PingResponse: return PingResponse(status="ok", timestamp=datetime.now(timezone.utc))

Purpose: cheap liveness check. Returns instantly without dependency probing. Used for extremely-fast probe intervals (1s).

/ready — readiness probe

@router.get("/ready") async def readiness() -> dict: db_healthy = await check_db_connection() is_ready = db_healthy return {"status": "ready" if is_ready else "not_ready"}

Purpose: tells Kubernetes “should I send traffic here?”. 503 = pod is alive but not yet ready. Stops traffic until the pod can serve. Critical for graceful startup — K8s waits for /ready 200 before adding the pod to service load balancers.

/live — liveness probe

@router.get("/live") async def liveness() -> dict: return {"status": "alive", "timestamp": ...}

Purpose: tells Kubernetes “is the pod dead?”. 200 = alive. Only non-200 should trigger a pod restart. If a probe depends on external services, a flaky Redis or DB causes pod restarts you don’t want.

The critical distinction

/live should never touch dependencies. /ready should touch only critical dependencies (DB). /health checks everything.

Mixing them is the most common k8s misconfiguration.

Kubernetes pod spec example

livenessProbe: httpGet: path: /api/v1/live port: 8000 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /api/v1/ready port: 8000 initialDelaySeconds: 5 periodSeconds: 5

30s initial delay on liveness to let the app boot. 5s on readiness to detect quickly when the pod is healthy.

check_db_connection() (from app/database.py)

async def check_db_connection() -> bool: try: async with AsyncSessionLocal() as session: await session.execute(text("SELECT 1")) return True except Exception: return False

The simplest possible DB ping — one SELECT 1. Does not touch tables, indexes, or row counts.

Common Pitfalls

Using /health as the liveness probe — DB outage causes all your pods to restart, compounding the outage (thundering-herd restart).

Forgetting initialDelaySeconds — Kubernetes probes during app boot before TCP socket opens, fails probes for the first 30-60s, marks pod unhealthy. Always delay probes.

Long timeouts on /ready — readiness probe with timeoutSeconds: 1 and a slow endpoint causes spurious pod removals. Tune carefully.

Real-World Interview Prep

Q1: Why three endpoints instead of one?

A: Kubernetes has three probe types with different semantics. Mixing them leads to cascading outages when dependencies fail. The triple-endpoint design expresses “liveness ≠ readiness ≠ diagnostic health” loudly.

Q2: What if the DB has a slow query that makes /ready time out?

A: Add a timeout to check_db_connection (e.g., 1 second via asyncio.wait_for). If the DB is slow, mark not_ready — but never mark the pod dead (only /live can do that).

Q3: How do you expose these in a service mesh like Istio?

A: Mesh probes run on a separate path (typically /healthz/ready). Map your endpoints to those via EnvoyFilter configuration. The mesh decides whether to route traffic, not the pod.

Last updated on