Skip to Content
ObservabilityPrometheus + Langfuse Observability

Prometheus + Langfuse Observability

What

A two-pillar observability stack for the FCA app: prometheus-fastapi-instrumentator exposes /metrics with request histograms (status-grouped, in-progress-aware, with explicit exclusion of high-frequency probe endpoints), and Langfuse auto-instruments LangChain / LangGraph calls to trace every LLM span with cost and latency data.

Project Context

In full_project_context_updated.txt -> app/main.py, the Instrumentator(...) call runs BEFORE the app starts serving so the lifespan yield has metrics from the very first request. excluded_handlers removes /metrics itself, /health, /docs, and /openapi.json from the histogram so Prometheus probe traffic does not dominate. Langfuse keys (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST) flow through app/config.py::Settings and into LangChain/LangGraph’s CallbackHandler, which auto-instruments ChatGroq, pgvector similarity searches, and the multi-agent graph transitions.

How

Prometheus Instrumentator mounted before lifespan

from prometheus_fastapi_instrumentator import Instrumentator instrumentator = Instrumentator( should_group_status_codes=False, should_ignore_untemplated=True, should_instrument_requests_inprogress=True, excluded_handlers=["/metrics", "/health", "/docs", "/openapi.json"], ) instrumentator.instrument(app).expose(app, include_in_schema=False)
  • should_group_status_codes=False keeps the per-status histogram dimension 2xx/3xx/4xx/5xx instead of folding them, so SREs can alert on 5xx spikes specifically.
  • should_ignore_untemplated=True stops the instrumentator from panicking on routes registered without a response model.
  • should_instrument_requests_inprogress=True exposes the in-progress gauge — useful for capacity planning when the SSE endpoints start piling up.
  • excluded_handlers is mandatory; otherwise Prometheus’s own scrape (/metrics) shows up in the histogram with sub-millisecond latency and dominates the samples.
  • include_in_schema=False keeps /metrics out of the OpenAPI doc so a casual /docs probe does not learn the scrape URL.

Langfuse wired into LangChain via callback handler

from langfuse import Langfuse from langchain.callbacks import CallbackHandler class _TracingEnabled: def __bool__(self): return bool( settings.langfuse_public_key and settings.langfuse_secret_key ) tracing_enabled = _TracingEnabled() langfuse_handler = ( CallbackHandler( public_key=settings.langfuse_public_key, secret_key=settings.langfuse_secret_key, host=settings.langfuse_host, ) if tracing_enabled else None ) # inside any LangChain call: # chain.invoke(inputs, config={"callbacks": [langfuse_handler]})
  • The wrapper class with __bool__ lets if tracing_enabled: read cleanly without leaking secrets into logs.
  • Langfuse keys are Optional in Settings; the CallbackHandler only instantiates if both are present so dev runs do not crash on missing keys.
  • config={"callbacks": [...]} is the LangChain v1.x contract — LangGraph threads the same handler through every node automatically.

Common Pitfalls

Putting Instrumentator.instrument(...) AFTER the lifespan starts means the first requests are missing from the histogram. Always instrument before app.startup (or before app.include_router).

Setting LANGFUSE_HOST in .env but forgetting the https:// prefix makes the SDK silently fall back to the cloud host, sending your data to a different region than expected. Always include the full URL with scheme.

Logging LLM payloads to the console before Langfuse wiring leaks PII to the standard logger — defer the logging wiring until after the security controls run.

Real-World Interview Prep

Q1: Prometheus vs OpenTelemetry for FastAPI — when to choose which?

A: Choose Prometheus for operational metrics (request rate, latency percentiles, error rates) — RED metrics are first-class, the histogram_quantile query language is the universal SLO primitive, and the storage is optimised for time-series at scale. Choose OpenTelemetry for distributed traces (single request → 12 spans across services, with parent-child relationships and span events). Use BOTH: Prometheus for “is my service healthy” + SLO compliance, OTel for “where did this slow request spend its time”. The Instrumentator (Prometheus) handles one service in isolation; OTel handles cross-service latency. For the FCA app, both mount alongside: Instrumentator for /metrics, OTLP exporter for Langfuse spans.

Q2: How do you control Prometheus cardinality with FastAPI route parameters?

A: Two-pronged. (1) Set should_ignore_untemplated=True on the Instrumentator — customer_id paths collapse to /api/v1/customers/{customer_id} so the label set stays bounded. (2) Never add customer_id, email, or any user-input as a Prometheus label — even with templated paths, if you .labels(customer_id=X), you explode the time series. Cardinality rules: aim for < 10k unique label combinations per metric. Beyond that, Prometheus OOMs at scrape time. If you need per-user metrics, send them as OTel spans (which compress) or push to a warehouse with a sampled-by-default rule.

Q3: Walk through how Langfuse auto-instruments LangChain when wired with a CallbackHandler.

A: Pattern: at module load, instantiate CallbackHandler(public_key=..., secret_key=..., host=...). Anywhere you call .invoke() or .ainvoke() on a chain/agent/tool, pass config={"callbacks": [handler]}. The handler opens a parent span; every LangChain internal call adds a child span (ChatOpenAI, Retriever, AgentExecutor). Token usage is captured via on_llm_end callbacks. Without the callbacks list, the chain runs but produces no spans — Langfuse stays empty. Pair with instrumentor.instrument(app) for HTTP-level RED metrics; together they answer “what happened, where, with what model, and how long” in one query.

Top-to-Bottom Code Walkthrough (app/main.py — Prometheus + Langfuse wiring)

The application combines metrics (Prometheus, scrape-friendly) and traces (Langfuse, AI-friendly) into one observability story. Each serves a different audience.

Prometheus setup

from prometheus_fastapi_instrumentator import Instrumentator instrumentator = Instrumentator( should_group_status_codes=False, should_ignore_untemplated=True, should_instrument_requests_inprogress=True, excluded_handlers=["/metrics", "/health", "/docs", "/openapi.json"], ) instrumentator.instrument(app).expose(app, include_in_schema=False)

Order matters: call this BEFORE registering routes, so all endpoints (including SSE) get metrics.

should_group_status_codes=False

Without grouping, every status code is its own label: code="200", code="404", code="500". With grouping, all 2xx cluster into code="2xx". Disabling gives Prometheus more queries at the cost of cardinality.

should_ignore_untemplated=True

Routes like /api/v1/customers/123/conversations would be /api/v1/customers/{customer_id}/conversations if templated. Without this flag, /123/, /456/, …become unique labels and Prometheus OOMs. With it, only templated versions show up.

excluded_handlers

  • /metrics — Prometheus scrapes itself; tracking it as a metric creates recursion.
  • /health — every probe hits this; the noise masks real errors.
  • /docs, /openapi.json — OpenAPI browsing noise.

/metrics endpoint

Exposed automatically by instrumentator.expose(app). Prometheus scrapes it every 15-30 seconds. Returns ~150 lines of help/comment text followed by metric samples:

# HELP http_requests_total Total number of HTTP requests # TYPE http_requests_total counter http_requests_total{method="POST", handler="/api/v1/messages/process", code="200"} 1453.0 http_requests_total{method="POST", handler="/api/v1/messages/process", code="500"} 2.0

RED metrics (Rate, Errors, Duration) generated automatically.

Langfuse setup (LLM-specific tracing)

if settings.is_observability_enabled: self.langfuse_handler = CallbackHandler()

Wired inside BaseAgent.__init__ — the LangChain Langfuse callback handler is passed into every llm.ainvoke(..., config={"callbacks": [...]}) call. This adds per-token consumption tracking, latency, and prompt logging to Langfuse’s cloud dashboard.

Why both

  • Prometheus: aggregate RED metrics — alert when error rate is 5%, p99 latency > 2s.
  • Langfuse: per-conversation prompt/response inspection — debug “why did this customer get a wrong answer?”.

They serve different stakeholders:

  • Prometheus → SRE/oncall engineers.
  • Langfuse → data scientists, support engineers.

Conditional enablement

Both wrapped in if settings.is_observability_enabled. Dev environments without keys run cleanly — is_observability_enabled returns False, no Prometheus neighbours, no Langfuse API calls.

Common Pitfalls

Forgetting excluded_handlers=["/metrics"] — Prometheus scrapes itself every 15s, creating fake traffic that pollutes the data.

Using customer_id as a metric label — unbounded cardinality blows up Prometheus memory within hours. Always use templated paths ({customer_id}).

Sending raw LLM prompts to Langfuse with PII — Presidio must run BEFORE the prompt is logged. Pass the redacted text, not the raw.

Real-World Interview Prep

Q1: Why two observability systems?

A: Different concerns. Prometheus aggregates web-tier metrics (suitable for time-series alerting). Langfuse tracks AI-specific signals (token usage, prompt costs, LLM-as-judge quality). Each is best-in-class for its job; combining gives complete coverage.

Q2: How do you alert on a Span that’s too slow in Langfuse?

A: Langfuse has its own alerts: configure a webhook or Slack notification on traces exceeding N seconds. Alternatively, push custom LangChain events back to Prometheus via prometheus_client.Counter, but that’s more code.

Q3: What if /metrics itself fails?

A: The Prometheus instrumentation is set up BEFORE the route exposure — even if /metrics errors out, scraped metrics for OTHER endpoints still exist. Scrape errors are caught by Prometheus; they’re tolerable.

Last updated on