Skip to Content
ObservabilityPrometheus Instrumentation with prometheus-fastapi-instrumentator

Prometheus Instrumentation with prometheus-fastapi-instrumentator

What? (Concept Overview)

prometheus-fastapi-instrumentator is a middleware-augmenting library that, on startup, walks every registered FastAPI route, installs latency-tracking middleware into the ASGI stack, and exposes the resulting histograms and counters at /metrics (Prometheus exposition format). Setting it up BEFORE app.startup ensures every request is timed from line one.

Project Context

The FCA Support Agent’s app/main.py mounts the Instrumentator INSIDE create_application(), before any route is registered. The configuration deliberately:

  • Sets should_group_status_codes=False to expose 2xx, 3xx, 4xx, 5xx as separate metric labels
  • Sets should_ignore_untemplated=True to drop dynamic /api/v1/customers/{customer_id} paths from the per-route label set
  • Sets should_instrument_requests_inprogress=True to track concurrent in-flight requests (not just totals)
  • Excludes /metrics, /health, /docs, /openapi.json from instrumented routes to prevent Prometheus from scraping its own scrape endpoint

How? (Quick Reference Blocks)

3.1 The Instrumentator Configuration Block

# app/main.py — create_application from prometheus_fastapi_instrumentator import Instrumentator instrumentator = Instrumentator( should_group_status_codes=False, # split 2xx/3xx/4xx/5xx labels should_ignore_untemplated=True, # skip param'd routes from labels should_instrument_requests_inprogress=True, # gauge for inflight reqs excluded_handlers=["/metrics", "/health", "/docs", "/openapi.json"], ) instrumentator.instrument(app).expose(app, include_in_schema=False)

3.2 Custom Metric: Per-Agent Token Spend (Langfuse Counter Bridge)

For metrics that the Instrumentator doesn’t auto-derive, define a Counter once at module load and increment inside the agent.

# app/observability/metrics.py (illustrative) from prometheus_client import Counter, Histogram AGENT_TOKEN_USAGE = Counter( "agent_token_usage_total", "Tokens consumed per agent per turn", labelnames=("agent", "model"), ) AGENT_LATENCY = Histogram( "agent_turn_latency_seconds", "Wall-clock latency per agent turn", labelnames=("agent",), buckets=(0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0), ) # Inside the agent: AGENT_TOKEN_USAGE.labels(agent="intent_classifier", model=settings.groq_model).inc( usage.prompt_tokens + usage.completion_tokens ) AGENT_LATENCY.labels(agent="intent_classifier").observe(elapsed_seconds)

Why? (Parameter Breakdown

  • should_group_status_codes=False — Without this, you get a single http_requests_total{status="2xx"} series. Operators can NOT distinguish 200 from 201 from 204 in alerts. Disabling grouping pushes the literal status code into the label.
  • should_ignore_untemplated=True — Cardinality control. A /api/v1/customers/{id} route produces one label series per distinct customer; left enabled it creates an EXPLODING label set. With ignore_untemplated=True, the label collapses to the templated path /api/v1/customers/{customer_id}.
  • should_instrument_requests_inprogress=True — Adds http_requests_inprogress gauge. Essential for “is my service saturated?” alerts; without it you only see completed-request counters.
  • excluded_handlers=[...]/metrics is itself an HTTP endpoint; instrumenting it makes Prometheus scrape noise. /health is hit every 5s by Kubernetes and dominates the time series; excluding keeps signal-to-noise high. /docs & /openapi.json are dev-only.
  • include_in_schema=False — Keeps /metrics out of OpenAPI docs. Polish, but stops an accidental client from “discovering” the metrics endpoint.
  • Counter with explicit labelnames — Always declare label names up front; Prometheus enforces the union, but doing so at module-load catches typos at startup instead of at first increment.
  • Histogram buckets=(...) — Default buckets suit HTTP latencies (5ms-10s). For ML inference latencies (100ms-30s) override to avoid bin-too-coarse / bin-too-fine imbalances.

Common Pitfalls

  1. Calling instrument(app) AFTER routes are registered. The Instrumentator inspects the route table at instrument() time; routes added later are NOT timed. Always instantiate + instrument during create_application(), BEFORE the app.include_router(...) calls.
  2. High-cardinality label values. NEVER label with customer_id, conversation_id, or any user-supplied value. Cardin

ality explodes Prometheus memory and TSDB write cost.

Real-World Interview Prep

Q1: How do you build an SLO alert from these metrics?

A: Define an SLO like “99% of /api/v1/messages requests return < 500ms”. Express as histogram_quantile(0.99, sum by (le)(rate(http_request_duration_seconds_bucket{handler="/api/v1/messages"}[5m]))). The 5m rate window smooths out spikes; alert if consecutive_5min_breaches > 4. Pair with error-rate SLO: rate(http_requests_total{handler="/api/v1/messages",status=~"5.."}[5m]) / rate(http_requests_total{handler="/api/v1/messages"}[5m]) < 0.001. Alert on the burn rate (1h and 6h) for fast + slow detection.

Q2: Your Prometheus dashboard shows 100x request volume on /health but only 1x on /metrics. Why?

A: /health is hit by Kubernetes’ liveness probe (typically every 5-10s) and by external uptime checkers. /metrics is hit by Prometheus (typically every 15-30s). The /health series dominates noise but isn’t a real workload signal — always excluded_handlers=["/health"] AND aggregate the metric to “real workload” by filtering at the alerting stage (http_requests_total{handler!="/health"}).

Q3: How do you migrate from the FastAPI Instrumentator to OpenTelemetry for traces?

A: (1) Add opentelemetry-instrumentation-fastapi middleware; it records every request as a span with the same labels as the Instrumentator metric. (2) Configure an OTLP exporter pointing to Langfuse/Honeycomb. (3) Keep Prometheus for RED metrics (rate, errors, duration) and use OTel only for traces — they answer different questions. (4) Use EXEMPLAR ties so a slow latency histogram bucket can deep-link to the slowest trace IDs. Migration is incremental; both can run for months.

Top-to-Bottom Code Walkthrough (app/main.pyInstrumentator config + custom metric emissions)

Beyond the auto-instrumentation, the project adds custom metrics for AI/LangChain-specific signals.

Custom LLM token counter

from prometheus_client import Counter, Histogram LLM_TOKENS_USED = Counter("llm_tokens_used_total", "Tokens consumed by LLM calls", ["model", "agent"]) LLM_LATENCY = Histogram("llm_call_duration_seconds", "LLM call duration", ["model"])

Why a Counter for tokens: tokens are a monotonically increasing quantity. Sum across the lifetime of the service gives the total. Why a Histogram for latency: distributions matter. p50, p95, p99 all tell different stories.

Inside BaseAgent._execute

After each Groq call:

LLM_TOKENS_USED.labels(model=self.model_name, agent=self.agent_name).inc(response.usage.total_tokens) LLM_LATENCY.labels(model=self.model_name).observe(time.time() - start)

Labels are bounded (model is one of ["llama-3.1-8b-instant", "mixtral-8x7b-32768"]) so cardinality stays manageable.

Why not just use Langfuse for everything?

Langfuse gives per-trace detail; Prometheus gives aggregate insight with renewable alerts. Different APIs. Combining them means:

  • Prometheus alerts: “p95 latency 3s for 5 min” — fires when degradation is broad.
  • Langfuse dashboard: “Why was conversation ABC-123 slow?” — drill-down detail.

_in_progress gauge

should_instrument_requests_inprogress=True enables http_requests_inprogress — a Prometheus gauge that tracks how many requests are currently being served. Useful for “are we backed up?” alerts:

rate(http_requests_inprogress[5m]) > 100

Histogram buckets

prometheus-fastapi-instrumentator uses default [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds. Fine for SSR APIs but LLM calls can take 30 seconds. Override:

instrumentator = Instrumentator( buckets=[0.1, 0.5, 1, 2.5, 5, 10, 30, 60], )

Labels you should NEVER use

Unbounded-cardinality labels:

  • customer_id
  • conversation_id
  • email
  • Any PII field A single customer’s 1000 chat sessions would generate 1000 label combinations. Multiply by N customers = OOM.

Alertmanager rules (sample)

groups: - name: api rules: - alert: HighErrorRate expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m - alert: SlowLLM expr: histogram_quantile(0.95, sum by (le) (rate(llm_call_duration_seconds_bucket[5m]))) > 10 for: 10m

Common Pitfalls

Using a Gauge where a Counter belongs — counters only go up, gauges can go down. Token consumption is a counter; “current concurrency” is a gauge.

Recording custom metrics inside the SSE generator — every yielded event would increment a counter, dwarfing real signal. Record once per logical request.

Forgetting to call .observe() with a float — Histogram expects seconds (float).

Real-World Interview Prep

Q1: Why Histogram for latency instead of Summary?

A: Histograms aggregate across instances. A summary is computed per-instance and cannot be combined across pods. Prometheus’ strength is scraping from many pods and computing a global distribution.

Q2: How do you measure token spend against budget?

A: Export llm_tokens_used_total multiplied by per-model cost: sum by (model) (rate(llm_tokens_used_total[1h])) * {model_cost_per_token}. Alert when above budget.

Q3: When is the Counter approach wrong?

A: For metrics that can decrease (e.g. “tokens remaining in budget”), use a Gauge. For monotonic quantities (cumulative events), use a Counter.

Last updated on