LangGraph Human-in-the-Loop (HITL) Interrupts
What? (Concept Overview)
Human-in-the-loop interrupts let a compiled StateGraph pause execution before a designated node, persist its snapshot to the checkpointer, and resume with externally-supplied state once a human (or admin system) approves or edits a response. The pattern decouples what the graph decided from what was actually sent to the customer.
Project Context
The FCA Support Agent routes product recommendations through a compliance node, then conditionally forwards them to a human_approval node. The graph is compiled with interrupt_before=["human_approval"], so risk-flagged recommendations never reach the customer until a compliance officer signs off. An approve_intervention endpoint on the coordinator re-hydrates the paused thread, injects a human-edited response via aupdate_state, and re-invokes the graph with ainvoke(None, ...).
How? (Quick Reference Blocks)
3.1 Compile-Time Interrupt Registration
The interrupt is registered ONCE — at compile time — and applies to every invocation of the graph.
# app/workflows/message_workflow.py
self.workflow = self.graph.compile(
checkpointer=self.checkpointer,
interrupt_before=["human_approval"],
)3.2 Detecting a Paused Graph in the Coordinator
After invoking the graph, the coordinator introspects the snapshot to decide whether the workflow actually completed or paused.
# app/coordinator/agent_coordinator.py — process_message
final_snapshot = await workflow.aget_state(config)
if final_snapshot.next and "human_approval" in final_snapshot.next:
return {
"status": "paused",
"awaiting_approval": True,
"thread_id": conversation_id,
"agent_response": final_snapshot.values.get("agent_response"),
}3.3 Human Resumption via aupdate_state + ainvoke
An admin endpoint reads the snapshot, mutates the state field that needs human input, and re-invokes the graph with None as the input to signal “resume”.
# app/coordinator/agent_coordinator.py — approve_intervention
snapshot = await workflow_wrapper.workflow.aget_state(config)
if snapshot.next and "human_approval" in snapshot.next:
await workflow_wrapper.workflow.aupdate_state(
config,
{
"agent_response": new_response,
"is_compliant": True,
"agent_type": "human_admin",
},
)
final_state = await workflow_wrapper.workflow.ainvoke(None, config=config)3.4 Compliance-Driven Routing into HITL
The compliance node is the canonical place where business policy emits an “escalate” decision; the conditional edge then forwards the workflow into the interrupt point.
# app/workflows/message_workflow.py
workflow.add_conditional_edges(
"compliance",
self._route_compliance,
{"approved": "end", "review": "human_approval"},
)Why? (Parameter Breakdown)
interrupt_beforerather than a Pythonif/else— Lets the same compiled graph serve both trusted/untrusted inputs by toggling the pause list. Shift theinterrupt_beforeargument in tests to assert that “happy-path” code never crosses human review.snapshot.nextlist as the resume signal — Wheninterrupt_beforefires, LangGraph refillssnapshot.nextwith the next node names. A non-empty list means “I am stopped”, an empty one means “I am done”. Comparing with the literal"human_approval"defends against false positives if other interrupts exist.aupdate_statereads-then-writes —aupdate_stateis a transactional checkpoint mutation. If the new state fails validation the existing snapshot is unchanged; idempotent semantics make admin retries safe.ainvoke(None, ...)vsainvoke(initial_state, ...)— PassingNoneensures LangGraph picks up where the snapshot left off and does NOT re-run upstream nodes. Passing a fresh state would force a re-execution ofclassify,compliance, etc., which is almost always wrong for a HITL resume.agent_responseas the canonical injection field — Downstream nodes (includingend) readstate["agent_response"]. Centralising the field keeps resume routes independent of which compliance rule triggered the pause.
Common Pitfalls
- Forgetting the
checkpointer=argument when callingcompile.interrupt_beforesilently no-ops without a checkpointer — the graph “completes” and the pause is invisible. The verification is: invoke →snapshot.nextshould not beNonewhen paused. - Re-running node work on resume by passing the original
initial_stateto the secondainvoke. This re-executes upstream nodes (LLM calls, DB lookups) and can produce an inconsistent final state because new LLM completions will not match the original decision. PassNoneinstead.
Real-World Interview Prep
Q1: How would you design HITL for a regulated workload (e.g., financial advice) where every recommendation needs an officer’s sign-off?
A: Three-tier pattern. (1) Compile the graph with interrupt_before=["human_approval"] and pair it with AsyncPostgresSaver (see Checkpointing page). (2) Expose two admin endpoints: POST /approve (read snapshot, validate identity, aupdate_state, ainvoke(None, ..., config=config)) and POST /reject (same, but set state["agent_response"] = None and route to a rejected node). (3) Track an audit row per pause: INSERT INTO hitl_audit(thread_id, snapshot_id, officer_id, decision, ts) VALUES (...) — never trust the in-memory workflow for compliance evidence. Bonus: gate approve with RBAC + SLA timer (e.g., auto-reject if no officer replies within 4 hours) to prevent zombie threads.
Q2: When would you prefer interrupt_before over Command(resume=...), and vice versa?
A: interrupt_before is the frozen-snapshot model: the entire workflow state is checkpointed, an external caller decides what to do, then ainvoke(None) resumes with no surprises. This is best for compliance, audit trails, and SLA-driven pauses. Command(resume=value) is the inline-yield model: a node runs, returns a Command, and pauses mid-graph (e.g., tool approval during a ReAct loop). This is best for short-lived approvals, agent→human handoffs inside Tool calling, and conversational clarification turns. Use interrupt_before when the regulator asks “show me exactly what state existed at the human decision”; use Command(resume=...) when the latency of an external round-trip is intolerable.
Q3: How do you test a LangGraph workflow that contains an interrupt, without pausing real Postgres or waiting for a human?
A: Two complementary strategies. Unit test: MemorySaver + interrupt_before=["x"]. In test code, run graph.ainvoke(state, config={"configurable": {"thread_id": "t1"}}) once → assert that the result is NOT the final state and await graph.aget_state(...)["next"] contains "x". Then mutate state directly (MemorySaver exposes .put) and run graph.ainvoke(None, config=...) to assert continuation. Integration test: spin a Postgres testcontainer, use AsyncPostgresSaver, and assert that across a pod restart the second invocation picks up the saved snapshot. In both cases, snapshot the time-to-resume — anything > 30 seconds in CI is usually a deadlock on checkpointer.setup().
Top-to-Bottom Code Walkthrough (app/coordinator/agent_coordinator.py — approve_intervention())
When the graph is paused at human_agent, the API exposes one method to either inject a human-edited response or abort the conversation. This is the user-facing side of the HITL pattern.
Signature
async def approve_intervention(
self,
conversation_id: int,
approved_response: str,
agent_metadata: dict | None = None,
) -> dict:conversation_idis the thread_id the graph was compiled with.approved_responseis the human-edited text that will replace what the LLM would have generated.agent_metadatais optional structured metadata merged into theWorkflowState(e.g.{"reviewed_by": "agent_alice@bank.com"}).
Inside the method (line by line)
- Build the config:
config = {"configurable": {"thread_id": str(conversation_id)}}. Same thread_id means the same checkpointed run resumes. - Merge metadata into the resumed state via
Command(LangGraph pattern):from langgraph.types import Command resume_value = {"approved_response": approved_response, "agent_metadata": agent_metadata} output = await self.graph.ainvoke(Command(resume=resume_value), config=config)Command(resume=...)is the LangGraph-native way to inject a human decision into a paused graph node. Thehuman_agentnode readsstate.approved_responseand writes it as its final outbound text. - Persist the approved message via
MessageService.add_message(...):role = MessageRole.HUMAN_AGENT.content = approved_response.metadata = {"is_intervention_response": True, "agent_metadata": agent_metadata}.- This row also closes the
is_escalatedflag on theConversationtable.
- Return the full final state:
return output— the workflow now finishes normally, end-node yields the result.
Fallback: rejecting the intervention
If the human chooses to reject (no approved response), we want to fall back:
output = await self.graph.ainvoke(Command(resume=None), config=config)The human_agent node detects resume is None and falls back to its LLM-drafted response (the LLM originally wrote it; the intervention was just to review, not necessarily override).
Common Pitfalls
Calling ainvoke(None, config=...) instead of ainvoke(Command(resume={...}), ...) silently resumes with no payload — the human_agent node receives state.approved_response = None and either crashes or sends the LLM draft unedited.
Losing escalation metadata when the resume replaces state.metadata — use Command(resume={...}) to patch the state, not to overwrite it.
Forgetting to mark Conversation.is_escalated = False after a successful intervention — the conversation stays flagged as “needs human” forever.
Real-World Interview Prep
Q1: Why does the intervention method explicitly persist via MessageService instead of relying on the workflow to write?
A: Because the workflow pauses before the human_agent node runs; if you let the workflow write after resume, you’d be recording a “human agent response” without the human’s edit. The intervention method writes directly to preserve fidelity.
Q2: How do you prompt the human reviewer with the LLM’s draft?
A: Two options. (a) Before calling approve_intervention, your front-end calls a preview endpoint like /conversations/{id}/pending_response which returns the draft the LLM wrote. (b) Stream events from the SSE endpoint and capture the partial-draft before the pause.
Q3: What happens if the human takes too long (the SLA hits)?
A: A background job (Celery beat) sweeps stale paused graphs every 5 minutes and force-resumes with the LLM draft, marking the intervention as {"forced": True, "reason": "sla_timeout"}. The customer gets something instead of silence.