Kubernetes Probes — /live, /ready, /health
What? (Concept Overview)
Kubernetes pod lifecycle has three orthogonal probes. /live (liveness) answers “kill the pod if it has hung”. /ready (readiness) answers “send traffic to this pod”. /health is a diagnostic summary used by humans/uptime checkers, not by k8s. Conflating them produces pods that either thrash (over-eager liveness) or receive traffic before they can serve (premature readiness).
Project Context
The FCA Support Agent’s app/routers/health.py exposes three endpoints, each tuned for a different consumer:
/api/v1/live— Never touches the DB; instant 200 OK. The Kubernetes liveness probe hits this every 5s./api/v1/ready— Callscheck_db_connection(); returns 503 if Postgres is unreachable. The Kubernetes readiness probe gates the Service./api/v1/health— Full diagnostic with all subsystems; humans and external uptime checkers (Pingdom/UptimeRobot) consume this.
How? (Quick Reference Blocks)
3.1 The Health Router
# app/routers/health.py
from fastapi import APIRouter, status
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from datetime import datetime, timezone
from app.config import settings
from app.database import check_db_connection
router = APIRouter()
class HealthResponse(BaseModel):
status: str
timestamp: str
version: str
environment: str
checks: dict3.2 /live — Liveness Probe
# app/routers/health.py
@router.get("/live", tags=["Health"])
async def liveness() -> dict:
"""Used by Kubernetes to determine if pod is alive."""
return {
"status": "alive",
"timestamp": datetime.now(timezone.utc)
.isoformat().replace("+00:00", "Z"),
}3.3 /ready — Readiness Probe
# app/routers/health.py
@router.get("/ready", tags=["Health"])
async def readiness() -> dict:
db_healthy = await check_db_connection()
if db_healthy:
return {
"status": "ready",
"timestamp": datetime.now(timezone.utc)
.isoformat().replace("+00:00", "Z"),
}
return JSONResponse(
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
content={
"status": "not_ready",
"reason": "database_unhealthy",
"timestamp": datetime.now(timezone.utc)
.isoformat().replace("+00:00", "Z"),
},
)3.4 /health — Full Diagnostic
# app/routers/health.py
@router.get("/health", response_model=HealthResponse, tags=["Health"])
async def health_check() -> HealthResponse:
db_healthy = await check_db_connection()
return HealthResponse(
status="healthy" if db_healthy else "unhealthy",
timestamp=datetime.now(timezone.utc)
.isoformat().replace("+00:00", "Z"),
version=settings.app_version,
environment=settings.environment,
checks={
"database": {"status": "healthy" if db_healthy else "unhealthy",
"type": "postgresql",
"pool_size": settings.database_pool_size},
"redis": {"status": "healthy" if settings.redis_enabled
else "disabled", "enabled": settings.redis_enabled},
"groq_ai": {"status": "configured" if settings.groq_api_key
else "not_configured",
"model": settings.groq_model},
},
)Why? (Parameter Breakdown
/liveMUST NOT touch external dependencies — If liveness hits Postgres and Postgres is slow, all pods get killed simultaneously → cluster-wide outage. Liveness failure should imply the process itself is broken (deadlock, OOM, leaked file descriptors)./readySHOULD touch critical upstream dependencies — A pod is “ready” iff it can serve a real request. Without DB checks, you’ll route traffic to a pod that fails every query → end-user errors./healthas the human-facing endpoint — Uptime monitors want one endpoint to alarm on; that endpoint should be the most diagnostic. Don’t ask uptime monitors to poll/ready— the 503s will be noisy.- Three separate endpoints instead of one with mode flag — Avoids accidental k8s misconfiguration. If you have
/health?mode=live, an ops engineer can fat-finger the readinessProbe config and accidentally kill pods on DB blips. Separation by URL is unambiguous. pydantic.BaseModelresponse model for/health— Self-documents the schema and validates the response shape at runtime. Useful for uptime-monitor onboarding.- UTC timestamps with explicit
Z— Parses unambiguously in any SIEM/observability stack (see Structured Logging page).
Common Pitfalls
- Liveness probe hitting
/health— Cascading failure. DB hiccups → readiness probe fails → which is OK, traffic shifts, BUT if liveness also fails → Kubernetes kills the pod → every pod gets killed → total outage. Always use the bare-minimum/live. - Readiness probe timeout shorter than connect timeout — If readinessProbe
timeoutSeconds: 1but Postgres connect timeout is 30s, the probe always times out → pod never marked ready → Service has zero endpoints.
Real-World Interview Prep
Q1: Your pods are flapping — kubectl get pods shows CrashLoopBackOff. Walk through your diagnostic ladder.
A: (1) kubectl describe pod <name> — Events section: did the readiness probe fail the pod? Or was it an explicit kill? (2) kubectl logs <pod> --previous — read the logs of the last crashed container. Look for uncaught exceptions in lifespan startup. (3) If probe-related, increase initialDelaySeconds from default 0 → 30s in the deployment manifest. (4) Check the liveness vs readiness: if liveness is firing, the Python process is wedged (infinite LLM loop, deadlock on a SQLAlchemy session). (5) Reproduce locally: simulate the same DB-down state and see if your /live correctly stays 200.
Q2: Why not use a single /ping endpoint for everything?
A: Probe-purpose mismatch. Liveness needs minimum dependency-free yes/no. Readiness needs dependency yes/no. Health needs everything no-fail-fast. A single /ping either over-checks (killing pods on DB blips) or under-checks (routing traffic to broken pods). Three routes, three contracts.
Q3: How do you differentiate “DB slow” from “DB down” with the readiness probe?
A: Add a latency budget. Wrap check_db_connection() in asyncio.wait_for(..., timeout=2.0). If the DB responds within 2s → ready; if the timeout fires → 503 not ready. This way a 30s DB hang produces a 2s readiness fail (and the pod is removed from the Service) without waiting for the full TCP connect timeout. The same pattern works for liveness if pod-level latency matters — but with a tighter ceiling (e.g., 200ms).
Top-to-Bottom Code Walkthrough (app/routers/health.py)
Kubernetes has three probe endpoints, each with a distinct purpose. Mixing them is a common ops mistake — your app restarts unnecessarily because a “live” probe checks the database.
/health — comprehensive
@router.get("/health")
async def health_check() -> HealthResponse:
db_healthy = await check_db_connection()
overall_status = "healthy" if db_healthy else "unhealthy"
...
return JSONResponse(status_code=503, ...)Purpose: humans and uptime monitors. Includes all dependency checks (DB, Redis, Groq config). Returns 503 if anything is degraded.
/ping — immediate response
@router.get("/ping")
async def ping() -> PingResponse:
return PingResponse(status="ok", timestamp=datetime.now(timezone.utc))Purpose: cheap liveness check. Returns instantly without dependency probing. Used for extremely-fast probe intervals (1s).
/ready — readiness probe
@router.get("/ready")
async def readiness() -> dict:
db_healthy = await check_db_connection()
is_ready = db_healthy
return {"status": "ready" if is_ready else "not_ready"}Purpose: tells Kubernetes “should I send traffic here?”. 503 = pod is alive but not yet ready. Stops traffic until the pod can serve. Critical for graceful startup — K8s waits for /ready 200 before adding the pod to service load balancers.
/live — liveness probe
@router.get("/live")
async def liveness() -> dict:
return {"status": "alive", "timestamp": ...}Purpose: tells Kubernetes “is the pod dead?”. 200 = alive. Only non-200 should trigger a pod restart. If a probe depends on external services, a flaky Redis or DB causes pod restarts you don’t want.
The critical distinction
/live should never touch dependencies.
/ready should touch only critical dependencies (DB).
/health checks everything.
Mixing them is the most common k8s misconfiguration.
Kubernetes pod spec example
livenessProbe:
httpGet:
path: /api/v1/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/v1/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 530s initial delay on liveness to let the app boot. 5s on readiness to detect quickly when the pod is healthy.
check_db_connection() (from app/database.py)
async def check_db_connection() -> bool:
try:
async with AsyncSessionLocal() as session:
await session.execute(text("SELECT 1"))
return True
except Exception:
return FalseThe simplest possible DB ping — one SELECT 1. Does not touch tables, indexes, or row counts.
Common Pitfalls
Using /health as the liveness probe — DB outage causes all your pods to restart, compounding the outage (thundering-herd restart).
Forgetting initialDelaySeconds — Kubernetes probes during app boot before TCP socket opens, fails probes for the first 30-60s, marks pod unhealthy. Always delay probes.
Long timeouts on /ready — readiness probe with timeoutSeconds: 1 and a slow endpoint causes spurious pod removals. Tune carefully.
Real-World Interview Prep
Q1: Why three endpoints instead of one?
A: Kubernetes has three probe types with different semantics. Mixing them leads to cascading outages when dependencies fail. The triple-endpoint design expresses “liveness ≠ readiness ≠ diagnostic health” loudly.
Q2: What if the DB has a slow query that makes /ready time out?
A: Add a timeout to check_db_connection (e.g., 1 second via asyncio.wait_for). If the DB is slow, mark not_ready — but never mark the pod dead (only /live can do that).
Q3: How do you expose these in a service mesh like Istio?
A: Mesh probes run on a separate path (typically /healthz/ready). Map your endpoints to those via EnvoyFilter configuration. The mesh decides whether to route traffic, not the pod.