Skip to Content
BackendLangGraph Human-in-the-Loop (HITL) Interrupts

LangGraph Human-in-the-Loop (HITL) Interrupts

What? (Concept Overview)

Human-in-the-loop interrupts let a compiled StateGraph pause execution before a designated node, persist its snapshot to the checkpointer, and resume with externally-supplied state once a human (or admin system) approves or edits a response. The pattern decouples what the graph decided from what was actually sent to the customer.

Project Context

The FCA Support Agent routes product recommendations through a compliance node, then conditionally forwards them to a human_approval node. The graph is compiled with interrupt_before=["human_approval"], so risk-flagged recommendations never reach the customer until a compliance officer signs off. An approve_intervention endpoint on the coordinator re-hydrates the paused thread, injects a human-edited response via aupdate_state, and re-invokes the graph with ainvoke(None, ...).

How? (Quick Reference Blocks)

3.1 Compile-Time Interrupt Registration

The interrupt is registered ONCE — at compile time — and applies to every invocation of the graph.

# app/workflows/message_workflow.py self.workflow = self.graph.compile( checkpointer=self.checkpointer, interrupt_before=["human_approval"], )

3.2 Detecting a Paused Graph in the Coordinator

After invoking the graph, the coordinator introspects the snapshot to decide whether the workflow actually completed or paused.

# app/coordinator/agent_coordinator.py — process_message final_snapshot = await workflow.aget_state(config) if final_snapshot.next and "human_approval" in final_snapshot.next: return { "status": "paused", "awaiting_approval": True, "thread_id": conversation_id, "agent_response": final_snapshot.values.get("agent_response"), }

3.3 Human Resumption via aupdate_state + ainvoke

An admin endpoint reads the snapshot, mutates the state field that needs human input, and re-invokes the graph with None as the input to signal “resume”.

# app/coordinator/agent_coordinator.py — approve_intervention snapshot = await workflow_wrapper.workflow.aget_state(config) if snapshot.next and "human_approval" in snapshot.next: await workflow_wrapper.workflow.aupdate_state( config, { "agent_response": new_response, "is_compliant": True, "agent_type": "human_admin", }, ) final_state = await workflow_wrapper.workflow.ainvoke(None, config=config)

3.4 Compliance-Driven Routing into HITL

The compliance node is the canonical place where business policy emits an “escalate” decision; the conditional edge then forwards the workflow into the interrupt point.

# app/workflows/message_workflow.py workflow.add_conditional_edges( "compliance", self._route_compliance, {"approved": "end", "review": "human_approval"}, )

Why? (Parameter Breakdown)

  • interrupt_before rather than a Python if/else — Lets the same compiled graph serve both trusted/untrusted inputs by toggling the pause list. Shift the interrupt_before argument in tests to assert that “happy-path” code never crosses human review.
  • snapshot.next list as the resume signal — When interrupt_before fires, LangGraph refills snapshot.next with the next node names. A non-empty list means “I am stopped”, an empty one means “I am done”. Comparing with the literal "human_approval" defends against false positives if other interrupts exist.
  • aupdate_state reads-then-writesaupdate_state is a transactional checkpoint mutation. If the new state fails validation the existing snapshot is unchanged; idempotent semantics make admin retries safe.
  • ainvoke(None, ...) vs ainvoke(initial_state, ...) — Passing None ensures LangGraph picks up where the snapshot left off and does NOT re-run upstream nodes. Passing a fresh state would force a re-execution of classify, compliance, etc., which is almost always wrong for a HITL resume.
  • agent_response as the canonical injection field — Downstream nodes (including end) read state["agent_response"]. Centralising the field keeps resume routes independent of which compliance rule triggered the pause.

Common Pitfalls

  1. Forgetting the checkpointer= argument when calling compile. interrupt_before silently no-ops without a checkpointer — the graph “completes” and the pause is invisible. The verification is: invoke → snapshot.next should not be None when paused.
  2. Re-running node work on resume by passing the original initial_state to the second ainvoke. This re-executes upstream nodes (LLM calls, DB lookups) and can produce an inconsistent final state because new LLM completions will not match the original decision. Pass None instead.

Real-World Interview Prep

Q1: How would you design HITL for a regulated workload (e.g., financial advice) where every recommendation needs an officer’s sign-off?

A: Three-tier pattern. (1) Compile the graph with interrupt_before=["human_approval"] and pair it with AsyncPostgresSaver (see Checkpointing page). (2) Expose two admin endpoints: POST /approve (read snapshot, validate identity, aupdate_state, ainvoke(None, ..., config=config)) and POST /reject (same, but set state["agent_response"] = None and route to a rejected node). (3) Track an audit row per pause: INSERT INTO hitl_audit(thread_id, snapshot_id, officer_id, decision, ts) VALUES (...) — never trust the in-memory workflow for compliance evidence. Bonus: gate approve with RBAC + SLA timer (e.g., auto-reject if no officer replies within 4 hours) to prevent zombie threads.

Q2: When would you prefer interrupt_before over Command(resume=...), and vice versa?

A: interrupt_before is the frozen-snapshot model: the entire workflow state is checkpointed, an external caller decides what to do, then ainvoke(None) resumes with no surprises. This is best for compliance, audit trails, and SLA-driven pauses. Command(resume=value) is the inline-yield model: a node runs, returns a Command, and pauses mid-graph (e.g., tool approval during a ReAct loop). This is best for short-lived approvals, agent→human handoffs inside Tool calling, and conversational clarification turns. Use interrupt_before when the regulator asks “show me exactly what state existed at the human decision”; use Command(resume=...) when the latency of an external round-trip is intolerable.

Q3: How do you test a LangGraph workflow that contains an interrupt, without pausing real Postgres or waiting for a human?

A: Two complementary strategies. Unit test: MemorySaver + interrupt_before=["x"]. In test code, run graph.ainvoke(state, config={"configurable": {"thread_id": "t1"}}) once → assert that the result is NOT the final state and await graph.aget_state(...)["next"] contains "x". Then mutate state directly (MemorySaver exposes .put) and run graph.ainvoke(None, config=...) to assert continuation. Integration test: spin a Postgres testcontainer, use AsyncPostgresSaver, and assert that across a pod restart the second invocation picks up the saved snapshot. In both cases, snapshot the time-to-resume — anything > 30 seconds in CI is usually a deadlock on checkpointer.setup().

Top-to-Bottom Code Walkthrough (app/coordinator/agent_coordinator.pyapprove_intervention())

When the graph is paused at human_agent, the API exposes one method to either inject a human-edited response or abort the conversation. This is the user-facing side of the HITL pattern.

Signature

async def approve_intervention( self, conversation_id: int, approved_response: str, agent_metadata: dict | None = None, ) -> dict:
  • conversation_id is the thread_id the graph was compiled with.
  • approved_response is the human-edited text that will replace what the LLM would have generated.
  • agent_metadata is optional structured metadata merged into the WorkflowState (e.g. {"reviewed_by": "agent_alice@bank.com"}).

Inside the method (line by line)

  1. Build the config: config = {"configurable": {"thread_id": str(conversation_id)}}. Same thread_id means the same checkpointed run resumes.
  2. Merge metadata into the resumed state via Command (LangGraph pattern):
    from langgraph.types import Command resume_value = {"approved_response": approved_response, "agent_metadata": agent_metadata} output = await self.graph.ainvoke(Command(resume=resume_value), config=config)
    Command(resume=...) is the LangGraph-native way to inject a human decision into a paused graph node. The human_agent node reads state.approved_response and writes it as its final outbound text.
  3. Persist the approved message via MessageService.add_message(...):
    • role = MessageRole.HUMAN_AGENT.
    • content = approved_response.
    • metadata = {"is_intervention_response": True, "agent_metadata": agent_metadata}.
    • This row also closes the is_escalated flag on the Conversation table.
  4. Return the full final state: return output — the workflow now finishes normally, end-node yields the result.

Fallback: rejecting the intervention

If the human chooses to reject (no approved response), we want to fall back:

output = await self.graph.ainvoke(Command(resume=None), config=config)

The human_agent node detects resume is None and falls back to its LLM-drafted response (the LLM originally wrote it; the intervention was just to review, not necessarily override).

Common Pitfalls

Calling ainvoke(None, config=...) instead of ainvoke(Command(resume={...}), ...) silently resumes with no payload — the human_agent node receives state.approved_response = None and either crashes or sends the LLM draft unedited.

Losing escalation metadata when the resume replaces state.metadata — use Command(resume={...}) to patch the state, not to overwrite it.

Forgetting to mark Conversation.is_escalated = False after a successful intervention — the conversation stays flagged as “needs human” forever.

Real-World Interview Prep

Q1: Why does the intervention method explicitly persist via MessageService instead of relying on the workflow to write?

A: Because the workflow pauses before the human_agent node runs; if you let the workflow write after resume, you’d be recording a “human agent response” without the human’s edit. The intervention method writes directly to preserve fidelity.

Q2: How do you prompt the human reviewer with the LLM’s draft?

A: Two options. (a) Before calling approve_intervention, your front-end calls a preview endpoint like /conversations/{id}/pending_response which returns the draft the LLM wrote. (b) Stream events from the SSE endpoint and capture the partial-draft before the pause.

Q3: What happens if the human takes too long (the SLA hits)?

A: A background job (Celery beat) sweeps stale paused graphs every 5 minutes and force-resumes with the LLM draft, marking the intervention as {"forced": True, "reason": "sla_timeout"}. The customer gets something instead of silence.

Last updated on