Prometheus Instrumentation with prometheus-fastapi-instrumentator
What? (Concept Overview)
prometheus-fastapi-instrumentator is a middleware-augmenting library that, on startup, walks every registered FastAPI route, installs latency-tracking middleware into the ASGI stack, and exposes the resulting histograms and counters at /metrics (Prometheus exposition format). Setting it up BEFORE app.startup ensures every request is timed from line one.
Project Context
The FCA Support Agent’s app/main.py mounts the Instrumentator INSIDE create_application(), before any route is registered. The configuration deliberately:
- Sets
should_group_status_codes=Falseto expose2xx,3xx,4xx,5xxas separate metric labels - Sets
should_ignore_untemplated=Trueto drop dynamic/api/v1/customers/{customer_id}paths from the per-route label set - Sets
should_instrument_requests_inprogress=Trueto track concurrent in-flight requests (not just totals) - Excludes
/metrics,/health,/docs,/openapi.jsonfrom instrumented routes to prevent Prometheus from scraping its own scrape endpoint
How? (Quick Reference Blocks)
3.1 The Instrumentator Configuration Block
# app/main.py — create_application
from prometheus_fastapi_instrumentator import Instrumentator
instrumentator = Instrumentator(
should_group_status_codes=False, # split 2xx/3xx/4xx/5xx labels
should_ignore_untemplated=True, # skip param'd routes from labels
should_instrument_requests_inprogress=True, # gauge for inflight reqs
excluded_handlers=["/metrics", "/health", "/docs", "/openapi.json"],
)
instrumentator.instrument(app).expose(app, include_in_schema=False)3.2 Custom Metric: Per-Agent Token Spend (Langfuse Counter Bridge)
For metrics that the Instrumentator doesn’t auto-derive, define a Counter once at module load and increment inside the agent.
# app/observability/metrics.py (illustrative)
from prometheus_client import Counter, Histogram
AGENT_TOKEN_USAGE = Counter(
"agent_token_usage_total",
"Tokens consumed per agent per turn",
labelnames=("agent", "model"),
)
AGENT_LATENCY = Histogram(
"agent_turn_latency_seconds",
"Wall-clock latency per agent turn",
labelnames=("agent",),
buckets=(0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0),
)
# Inside the agent:
AGENT_TOKEN_USAGE.labels(agent="intent_classifier", model=settings.groq_model).inc(
usage.prompt_tokens + usage.completion_tokens
)
AGENT_LATENCY.labels(agent="intent_classifier").observe(elapsed_seconds)Why? (Parameter Breakdown
should_group_status_codes=False— Without this, you get a singlehttp_requests_total{status="2xx"}series. Operators can NOT distinguish 200 from 201 from 204 in alerts. Disabling grouping pushes the literal status code into the label.should_ignore_untemplated=True— Cardinality control. A/api/v1/customers/{id}route produces one label series per distinct customer; left enabled it creates an EXPLODING label set. Withignore_untemplated=True, the label collapses to the templated path/api/v1/customers/{customer_id}.should_instrument_requests_inprogress=True— Addshttp_requests_inprogressgauge. Essential for “is my service saturated?” alerts; without it you only see completed-request counters.excluded_handlers=[...]—/metricsis itself an HTTP endpoint; instrumenting it makes Prometheus scrape noise./healthis hit every 5s by Kubernetes and dominates the time series; excluding keeps signal-to-noise high./docs&/openapi.jsonare dev-only.include_in_schema=False— Keeps/metricsout of OpenAPI docs. Polish, but stops an accidental client from “discovering” the metrics endpoint.Counterwith explicitlabelnames— Always declare label names up front; Prometheus enforces the union, but doing so at module-load catches typos at startup instead of at first increment.- Histogram
buckets=(...)— Default buckets suit HTTP latencies (5ms-10s). For ML inference latencies (100ms-30s) override to avoid bin-too-coarse / bin-too-fine imbalances.
Common Pitfalls
- Calling
instrument(app)AFTER routes are registered. The Instrumentator inspects the route table atinstrument()time; routes added later are NOT timed. Always instantiate + instrument duringcreate_application(), BEFORE theapp.include_router(...)calls. - High-cardinality label values. NEVER label with customer_id, conversation_id, or any user-supplied value. Cardin
ality explodes Prometheus memory and TSDB write cost.
Real-World Interview Prep
Q1: How do you build an SLO alert from these metrics?
A: Define an SLO like “99% of /api/v1/messages requests return < 500ms”. Express as histogram_quantile(0.99, sum by (le)(rate(http_request_duration_seconds_bucket{handler="/api/v1/messages"}[5m]))). The 5m rate window smooths out spikes; alert if consecutive_5min_breaches > 4. Pair with error-rate SLO: rate(http_requests_total{handler="/api/v1/messages",status=~"5.."}[5m]) / rate(http_requests_total{handler="/api/v1/messages"}[5m]) < 0.001. Alert on the burn rate (1h and 6h) for fast + slow detection.
Q2: Your Prometheus dashboard shows 100x request volume on /health but only 1x on /metrics. Why?
A: /health is hit by Kubernetes’ liveness probe (typically every 5-10s) and by external uptime checkers. /metrics is hit by Prometheus (typically every 15-30s). The /health series dominates noise but isn’t a real workload signal — always excluded_handlers=["/health"] AND aggregate the metric to “real workload” by filtering at the alerting stage (http_requests_total{handler!="/health"}).
Q3: How do you migrate from the FastAPI Instrumentator to OpenTelemetry for traces?
A: (1) Add opentelemetry-instrumentation-fastapi middleware; it records every request as a span with the same labels as the Instrumentator metric. (2) Configure an OTLP exporter pointing to Langfuse/Honeycomb. (3) Keep Prometheus for RED metrics (rate, errors, duration) and use OTel only for traces — they answer different questions. (4) Use EXEMPLAR ties so a slow latency histogram bucket can deep-link to the slowest trace IDs. Migration is incremental; both can run for months.
Top-to-Bottom Code Walkthrough (app/main.py — Instrumentator config + custom metric emissions)
Beyond the auto-instrumentation, the project adds custom metrics for AI/LangChain-specific signals.
Custom LLM token counter
from prometheus_client import Counter, Histogram
LLM_TOKENS_USED = Counter("llm_tokens_used_total", "Tokens consumed by LLM calls", ["model", "agent"])
LLM_LATENCY = Histogram("llm_call_duration_seconds", "LLM call duration", ["model"])Why a Counter for tokens: tokens are a monotonically increasing quantity. Sum across the lifetime of the service gives the total.
Why a Histogram for latency: distributions matter. p50, p95, p99 all tell different stories.
Inside BaseAgent._execute
After each Groq call:
LLM_TOKENS_USED.labels(model=self.model_name, agent=self.agent_name).inc(response.usage.total_tokens)
LLM_LATENCY.labels(model=self.model_name).observe(time.time() - start)Labels are bounded (model is one of ["llama-3.1-8b-instant", "mixtral-8x7b-32768"]) so cardinality stays manageable.
Why not just use Langfuse for everything?
Langfuse gives per-trace detail; Prometheus gives aggregate insight with renewable alerts. Different APIs. Combining them means:
- Prometheus alerts: “p95 latency 3s for 5 min” — fires when degradation is broad.
- Langfuse dashboard: “Why was conversation ABC-123 slow?” — drill-down detail.
_in_progress gauge
should_instrument_requests_inprogress=True enables http_requests_inprogress — a Prometheus gauge that tracks how many requests are currently being served.
Useful for “are we backed up?” alerts:
rate(http_requests_inprogress[5m]) > 100Histogram buckets
prometheus-fastapi-instrumentator uses default [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds. Fine for SSR APIs but LLM calls can take 30 seconds. Override:
instrumentator = Instrumentator(
buckets=[0.1, 0.5, 1, 2.5, 5, 10, 30, 60],
)Labels you should NEVER use
Unbounded-cardinality labels:
customer_idconversation_idemail- Any PII field A single customer’s 1000 chat sessions would generate 1000 label combinations. Multiply by N customers = OOM.
Alertmanager rules (sample)
groups:
- name: api
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
- alert: SlowLLM
expr: histogram_quantile(0.95, sum by (le) (rate(llm_call_duration_seconds_bucket[5m]))) > 10
for: 10mCommon Pitfalls
Using a Gauge where a Counter belongs — counters only go up, gauges can go down. Token consumption is a counter; “current concurrency” is a gauge.
Recording custom metrics inside the SSE generator — every yielded event would increment a counter, dwarfing real signal. Record once per logical request.
Forgetting to call .observe() with a float — Histogram expects seconds (float).
Real-World Interview Prep
Q1: Why Histogram for latency instead of Summary?
A: Histograms aggregate across instances. A summary is computed per-instance and cannot be combined across pods. Prometheus’ strength is scraping from many pods and computing a global distribution.
Q2: How do you measure token spend against budget?
A: Export llm_tokens_used_total multiplied by per-model cost: sum by (model) (rate(llm_tokens_used_total[1h])) * {model_cost_per_token}. Alert when above budget.
Q3: When is the Counter approach wrong?
A: For metrics that can decrease (e.g. “tokens remaining in budget”), use a Gauge. For monotonic quantities (cumulative events), use a Counter.