Prometheus Instrumentation with `prometheus-fastapi-instrumentator`

What? (Concept Overview)

prometheus-fastapi-instrumentator is a middleware-augmenting library that, on startup, walks every registered FastAPI route, installs latency-tracking middleware into the ASGI stack, and exposes the resulting histograms and counters at /metrics (Prometheus exposition format). Setting it up BEFORE app.startup ensures every request is timed from line one.

Project Context

The FCA Support Agent’s app/main.py mounts the Instrumentator INSIDE create_application(), before any route is registered. The configuration deliberately:

Sets should_group_status_codes=False to expose 2xx, 3xx, 4xx, 5xx as separate metric labels
Sets should_ignore_untemplated=True to drop dynamic /api/v1/customers/{customer_id} paths from the per-route label set
Sets should_instrument_requests_inprogress=True to track concurrent in-flight requests (not just totals)
Excludes /metrics, /health, /docs, /openapi.json from instrumented routes to prevent Prometheus from scraping its own scrape endpoint

How? (Quick Reference Blocks)

3.1 The Instrumentator Configuration Block


# app/main.py — create_application
from prometheus_fastapi_instrumentator import Instrumentator
 
instrumentator = Instrumentator(
    should_group_status_codes=False,          # split 2xx/3xx/4xx/5xx labels
    should_ignore_untemplated=True,           # skip param'd routes from labels
    should_instrument_requests_inprogress=True,    # gauge for inflight reqs
    excluded_handlers=["/metrics", "/health", "/docs", "/openapi.json"],
)
instrumentator.instrument(app).expose(app, include_in_schema=False)

3.2 Custom Metric: Per-Agent Token Spend (Langfuse Counter Bridge)

For metrics that the Instrumentator doesn’t auto-derive, define a Counter once at module load and increment inside the agent.


# app/observability/metrics.py (illustrative)
from prometheus_client import Counter, Histogram
 
AGENT_TOKEN_USAGE = Counter(
    "agent_token_usage_total",
    "Tokens consumed per agent per turn",
    labelnames=("agent", "model"),
)
 
AGENT_LATENCY = Histogram(
    "agent_turn_latency_seconds",
    "Wall-clock latency per agent turn",
    labelnames=("agent",),
    buckets=(0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0),
)
 
# Inside the agent:
AGENT_TOKEN_USAGE.labels(agent="intent_classifier", model=settings.groq_model).inc(
    usage.prompt_tokens + usage.completion_tokens
)
AGENT_LATENCY.labels(agent="intent_classifier").observe(elapsed_seconds)

Why? (Parameter Breakdown

should_group_status_codes=False — Without this, you get a single http_requests_total{status="2xx"} series. Operators can NOT distinguish 200 from 201 from 204 in alerts. Disabling grouping pushes the literal status code into the label.
should_ignore_untemplated=True — Cardinality control. A /api/v1/customers/{id} route produces one label series per distinct customer; left enabled it creates an EXPLODING label set. With ignore_untemplated=True, the label collapses to the templated path /api/v1/customers/{customer_id}.
should_instrument_requests_inprogress=True — Adds http_requests_inprogress gauge. Essential for “is my service saturated?” alerts; without it you only see completed-request counters.
excluded_handlers=[...] — /metrics is itself an HTTP endpoint; instrumenting it makes Prometheus scrape noise. /health is hit every 5s by Kubernetes and dominates the time series; excluding keeps signal-to-noise high. /docs & /openapi.json are dev-only.
include_in_schema=False — Keeps /metrics out of OpenAPI docs. Polish, but stops an accidental client from “discovering” the metrics endpoint.
Counter with explicit labelnames — Always declare label names up front; Prometheus enforces the union, but doing so at module-load catches typos at startup instead of at first increment.
Histogram buckets=(...) — Default buckets suit HTTP latencies (5ms-10s). For ML inference latencies (100ms-30s) override to avoid bin-too-coarse / bin-too-fine imbalances.

Common Pitfalls

Calling instrument(app) AFTER routes are registered. The Instrumentator inspects the route table at instrument() time; routes added later are NOT timed. Always instantiate + instrument during create_application(), BEFORE the app.include_router(...) calls.
High-cardinality label values. NEVER label with customer_id, conversation_id, or any user-supplied value. Cardin

ality explodes Prometheus memory and TSDB write cost.

Real-World Interview Prep

Q1: How do you build an SLO alert from these metrics?

A: Define an SLO like “99% of /api/v1/messages requests return < 500ms”. Express as histogram_quantile(0.99, sum by (le)(rate(http_request_duration_seconds_bucket{handler="/api/v1/messages"}[5m]))). The 5m rate window smooths out spikes; alert if consecutive_5min_breaches > 4. Pair with error-rate SLO: rate(http_requests_total{handler="/api/v1/messages",status=~"5.."}[5m]) / rate(http_requests_total{handler="/api/v1/messages"}[5m]) < 0.001. Alert on the burn rate (1h and 6h) for fast + slow detection.

Q2: Your Prometheus dashboard shows 100x request volume on `/health` but only 1x on `/metrics`. Why?

A: /health is hit by Kubernetes’ liveness probe (typically every 5-10s) and by external uptime checkers. /metrics is hit by Prometheus (typically every 15-30s). The /health series dominates noise but isn’t a real workload signal — always excluded_handlers=["/health"] AND aggregate the metric to “real workload” by filtering at the alerting stage (http_requests_total{handler!="/health"}).

Q3: How do you migrate from the FastAPI Instrumentator to OpenTelemetry for traces?

A: (1) Add opentelemetry-instrumentation-fastapi middleware; it records every request as a span with the same labels as the Instrumentator metric. (2) Configure an OTLP exporter pointing to Langfuse/Honeycomb. (3) Keep Prometheus for RED metrics (rate, errors, duration) and use OTel only for traces — they answer different questions. (4) Use EXEMPLAR ties so a slow latency histogram bucket can deep-link to the slowest trace IDs. Migration is incremental; both can run for months.

Top-to-Bottom Code Walkthrough (`app/main.py` — `Instrumentator` config + custom metric emissions)

Beyond the auto-instrumentation, the project adds custom metrics for AI/LangChain-specific signals.

Custom LLM token counter


from prometheus_client import Counter, Histogram
 
LLM_TOKENS_USED = Counter("llm_tokens_used_total", "Tokens consumed by LLM calls", ["model", "agent"])
LLM_LATENCY = Histogram("llm_call_duration_seconds", "LLM call duration", ["model"])

Why a Counter for tokens: tokens are a monotonically increasing quantity. Sum across the lifetime of the service gives the total. Why a Histogram for latency: distributions matter. p50, p95, p99 all tell different stories.

Inside `BaseAgent._execute`

After each Groq call:


LLM_TOKENS_USED.labels(model=self.model_name, agent=self.agent_name).inc(response.usage.total_tokens)
LLM_LATENCY.labels(model=self.model_name).observe(time.time() - start)

Labels are bounded (model is one of ["llama-3.1-8b-instant", "mixtral-8x7b-32768"]) so cardinality stays manageable.

Why not just use Langfuse for everything?

Langfuse gives per-trace detail; Prometheus gives aggregate insight with renewable alerts. Different APIs. Combining them means:

Prometheus alerts: “p95 latency 3s for 5 min” — fires when degradation is broad.
Langfuse dashboard: “Why was conversation ABC-123 slow?” — drill-down detail.

`_in_progress` gauge

should_instrument_requests_inprogress=True enables http_requests_inprogress — a Prometheus gauge that tracks how many requests are currently being served. Useful for “are we backed up?” alerts:


rate(http_requests_inprogress[5m]) > 100

Histogram buckets

prometheus-fastapi-instrumentator uses default [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds. Fine for SSR APIs but LLM calls can take 30 seconds. Override:


instrumentator = Instrumentator(
    buckets=[0.1, 0.5, 1, 2.5, 5, 10, 30, 60],
)

Labels you should NEVER use

Unbounded-cardinality labels:

customer_id
conversation_id
email
Any PII field A single customer’s 1000 chat sessions would generate 1000 label combinations. Multiply by N customers = OOM.

Alertmanager rules (sample)


groups:
- name: api
  rules:
  - alert: HighErrorRate
    expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
    for: 5m
  - alert: SlowLLM
    expr: histogram_quantile(0.95, sum by (le) (rate(llm_call_duration_seconds_bucket[5m]))) > 10
    for: 10m

Common Pitfalls

Using a Gauge where a Counter belongs — counters only go up, gauges can go down. Token consumption is a counter; “current concurrency” is a gauge.

Recording custom metrics inside the SSE generator — every yielded event would increment a counter, dwarfing real signal. Record once per logical request.

Forgetting to call .observe() with a float — Histogram expects seconds (float).

Real-World Interview Prep

Q1: Why `Histogram` for latency instead of `Summary`?

A: Histograms aggregate across instances. A summary is computed per-instance and cannot be combined across pods. Prometheus’ strength is scraping from many pods and computing a global distribution.

Q2: How do you measure token spend against budget?

A: Export llm_tokens_used_total multiplied by per-model cost: sum by (model) (rate(llm_tokens_used_total[1h])) * {model_cost_per_token}. Alert when above budget.

Q3: When is the `Counter` approach wrong?

A: For metrics that can decrease (e.g. “tokens remaining in budget”), use a Gauge. For monotonic quantities (cumulative events), use a Counter.

Prometheus Instrumentation with prometheus-fastapi-instrumentator

What? (Concept Overview)

Project Context

How? (Quick Reference Blocks)

3.1 The Instrumentator Configuration Block

3.2 Custom Metric: Per-Agent Token Spend (Langfuse Counter Bridge)

Why? (Parameter Breakdown

Common Pitfalls

Real-World Interview Prep

Q1: How do you build an SLO alert from these metrics?

Q2: Your Prometheus dashboard shows 100x request volume on /health but only 1x on /metrics. Why?

Q3: How do you migrate from the FastAPI Instrumentator to OpenTelemetry for traces?

Top-to-Bottom Code Walkthrough (app/main.py — Instrumentator config + custom metric emissions)

Custom LLM token counter

Inside BaseAgent._execute

Why not just use Langfuse for everything?

_in_progress gauge

Histogram buckets

Labels you should NEVER use

Alertmanager rules (sample)

Common Pitfalls

Real-World Interview Prep

Q1: Why Histogram for latency instead of Summary?

Q2: How do you measure token spend against budget?

Q3: When is the Counter approach wrong?

Prometheus Instrumentation with `prometheus-fastapi-instrumentator`

Q2: Your Prometheus dashboard shows 100x request volume on `/health` but only 1x on `/metrics`. Why?

Top-to-Bottom Code Walkthrough (`app/main.py` — `Instrumentator` config + custom metric emissions)

Inside `BaseAgent._execute`

`_in_progress` gauge

Q1: Why `Histogram` for latency instead of `Summary`?

Q3: When is the `Counter` approach wrong?