LangGraph Human-in-the-Loop (HITL) Interrupts

What? (Concept Overview)

Human-in-the-loop interrupts let a compiled StateGraph pause execution before a designated node, persist its snapshot to the checkpointer, and resume with externally-supplied state once a human (or admin system) approves or edits a response. The pattern decouples what the graph decided from what was actually sent to the customer.

Project Context

The FCA Support Agent routes product recommendations through a compliance node, then conditionally forwards them to a human_approval node. The graph is compiled with interrupt_before=["human_approval"], so risk-flagged recommendations never reach the customer until a compliance officer signs off. An approve_intervention endpoint on the coordinator re-hydrates the paused thread, injects a human-edited response via aupdate_state, and re-invokes the graph with ainvoke(None, ...).

How? (Quick Reference Blocks)

3.1 Compile-Time Interrupt Registration

The interrupt is registered ONCE — at compile time — and applies to every invocation of the graph.


# app/workflows/message_workflow.py
self.workflow = self.graph.compile(
    checkpointer=self.checkpointer,
    interrupt_before=["human_approval"],
)

3.2 Detecting a Paused Graph in the Coordinator

After invoking the graph, the coordinator introspects the snapshot to decide whether the workflow actually completed or paused.


# app/coordinator/agent_coordinator.py — process_message
final_snapshot = await workflow.aget_state(config)
if final_snapshot.next and "human_approval" in final_snapshot.next:
    return {
        "status": "paused",
        "awaiting_approval": True,
        "thread_id": conversation_id,
        "agent_response": final_snapshot.values.get("agent_response"),
    }

3.3 Human Resumption via `aupdate_state` + `ainvoke`

An admin endpoint reads the snapshot, mutates the state field that needs human input, and re-invokes the graph with None as the input to signal “resume”.


# app/coordinator/agent_coordinator.py — approve_intervention
snapshot = await workflow_wrapper.workflow.aget_state(config)
if snapshot.next and "human_approval" in snapshot.next:
    await workflow_wrapper.workflow.aupdate_state(
        config,
        {
            "agent_response": new_response,
            "is_compliant": True,
            "agent_type": "human_admin",
        },
    )
final_state = await workflow_wrapper.workflow.ainvoke(None, config=config)

3.4 Compliance-Driven Routing into HITL

The compliance node is the canonical place where business policy emits an “escalate” decision; the conditional edge then forwards the workflow into the interrupt point.


# app/workflows/message_workflow.py
workflow.add_conditional_edges(
    "compliance",
    self._route_compliance,
    {"approved": "end", "review": "human_approval"},
)

Why? (Parameter Breakdown)

interrupt_before rather than a Python if/else — Lets the same compiled graph serve both trusted/untrusted inputs by toggling the pause list. Shift the interrupt_before argument in tests to assert that “happy-path” code never crosses human review.
snapshot.next list as the resume signal — When interrupt_before fires, LangGraph refills snapshot.next with the next node names. A non-empty list means “I am stopped”, an empty one means “I am done”. Comparing with the literal "human_approval" defends against false positives if other interrupts exist.
aupdate_state reads-then-writes — aupdate_state is a transactional checkpoint mutation. If the new state fails validation the existing snapshot is unchanged; idempotent semantics make admin retries safe.
ainvoke(None, ...) vs ainvoke(initial_state, ...) — Passing None ensures LangGraph picks up where the snapshot left off and does NOT re-run upstream nodes. Passing a fresh state would force a re-execution of classify, compliance, etc., which is almost always wrong for a HITL resume.
agent_response as the canonical injection field — Downstream nodes (including end) read state["agent_response"]. Centralising the field keeps resume routes independent of which compliance rule triggered the pause.

Common Pitfalls

Forgetting the checkpointer= argument when calling compile. interrupt_before silently no-ops without a checkpointer — the graph “completes” and the pause is invisible. The verification is: invoke → snapshot.next should not be None when paused.
Re-running node work on resume by passing the original initial_state to the second ainvoke. This re-executes upstream nodes (LLM calls, DB lookups) and can produce an inconsistent final state because new LLM completions will not match the original decision. Pass None instead.

Real-World Interview Prep

Q1: How would you design HITL for a regulated workload (e.g., financial advice) where every recommendation needs an officer’s sign-off?

A: Three-tier pattern. (1) Compile the graph with interrupt_before=["human_approval"] and pair it with AsyncPostgresSaver (see Checkpointing page). (2) Expose two admin endpoints: POST /approve (read snapshot, validate identity, aupdate_state, ainvoke(None, ..., config=config)) and POST /reject (same, but set state["agent_response"] = None and route to a rejected node). (3) Track an audit row per pause: INSERT INTO hitl_audit(thread_id, snapshot_id, officer_id, decision, ts) VALUES (...) — never trust the in-memory workflow for compliance evidence. Bonus: gate approve with RBAC + SLA timer (e.g., auto-reject if no officer replies within 4 hours) to prevent zombie threads.

Q2: When would you prefer `interrupt_before` over `Command(resume=...)`, and vice versa?

A: interrupt_before is the frozen-snapshot model: the entire workflow state is checkpointed, an external caller decides what to do, then ainvoke(None) resumes with no surprises. This is best for compliance, audit trails, and SLA-driven pauses. Command(resume=value) is the inline-yield model: a node runs, returns a Command, and pauses mid-graph (e.g., tool approval during a ReAct loop). This is best for short-lived approvals, agent→human handoffs inside Tool calling, and conversational clarification turns. Use interrupt_before when the regulator asks “show me exactly what state existed at the human decision”; use Command(resume=...) when the latency of an external round-trip is intolerable.

Q3: How do you test a LangGraph workflow that contains an interrupt, without pausing real Postgres or waiting for a human?

A: Two complementary strategies. Unit test: MemorySaver + interrupt_before=["x"]. In test code, run graph.ainvoke(state, config={"configurable": {"thread_id": "t1"}}) once → assert that the result is NOT the final state and await graph.aget_state(...)["next"] contains "x". Then mutate state directly (MemorySaver exposes .put) and run graph.ainvoke(None, config=...) to assert continuation. Integration test: spin a Postgres testcontainer, use AsyncPostgresSaver, and assert that across a pod restart the second invocation picks up the saved snapshot. In both cases, snapshot the time-to-resume — anything > 30 seconds in CI is usually a deadlock on checkpointer.setup().

Top-to-Bottom Code Walkthrough (`app/coordinator/agent_coordinator.py` — `approve_intervention()`)

When the graph is paused at human_agent, the API exposes one method to either inject a human-edited response or abort the conversation. This is the user-facing side of the HITL pattern.

Signature


async def approve_intervention(
    self,
    conversation_id: int,
    approved_response: str,
    agent_metadata: dict | None = None,
) -> dict:

conversation_id is the thread_id the graph was compiled with.
approved_response is the human-edited text that will replace what the LLM would have generated.
agent_metadata is optional structured metadata merged into the WorkflowState (e.g. {"reviewed_by": "agent_alice@bank.com"}).

Inside the method (line by line)

Build the config: config = {"configurable": {"thread_id": str(conversation_id)}}. Same thread_id means the same checkpointed run resumes.
Merge metadata into the resumed state via Command (LangGraph pattern):
```
from langgraph.types import Command
resume_value = {"approved_response": approved_response, "agent_metadata": agent_metadata}
output = await self.graph.ainvoke(Command(resume=resume_value), config=config)
```
Command(resume=...) is the LangGraph-native way to inject a human decision into a paused graph node. The human_agent node reads state.approved_response and writes it as its final outbound text.
Persist the approved message via MessageService.add_message(...):
- role = MessageRole.HUMAN_AGENT.
- content = approved_response.
- metadata = {"is_intervention_response": True, "agent_metadata": agent_metadata}.
- This row also closes the is_escalated flag on the Conversation table.
Return the full final state: return output — the workflow now finishes normally, end-node yields the result.

Fallback: rejecting the intervention

If the human chooses to reject (no approved response), we want to fall back:


output = await self.graph.ainvoke(Command(resume=None), config=config)

The human_agent node detects resume is None and falls back to its LLM-drafted response (the LLM originally wrote it; the intervention was just to review, not necessarily override).

Common Pitfalls

Calling ainvoke(None, config=...) instead of ainvoke(Command(resume={...}), ...) silently resumes with no payload — the human_agent node receives state.approved_response = None and either crashes or sends the LLM draft unedited.

Losing escalation metadata when the resume replaces state.metadata — use Command(resume={...}) to patch the state, not to overwrite it.

Forgetting to mark Conversation.is_escalated = False after a successful intervention — the conversation stays flagged as “needs human” forever.

Real-World Interview Prep

Q1: Why does the intervention method explicitly persist via `MessageService` instead of relying on the workflow to write?

A: Because the workflow pauses before the human_agent node runs; if you let the workflow write after resume, you’d be recording a “human agent response” without the human’s edit. The intervention method writes directly to preserve fidelity.

Q2: How do you prompt the human reviewer with the LLM’s draft?

A: Two options. (a) Before calling approve_intervention, your front-end calls a preview endpoint like /conversations/{id}/pending_response which returns the draft the LLM wrote. (b) Stream events from the SSE endpoint and capture the partial-draft before the pause.

Q3: What happens if the human takes too long (the SLA hits)?

A: A background job (Celery beat) sweeps stale paused graphs every 5 minutes and force-resumes with the LLM draft, marking the intervention as {"forced": True, "reason": "sla_timeout"}. The customer gets something instead of silence.

LangGraph Human-in-the-Loop (HITL) Interrupts

What? (Concept Overview)

Project Context

How? (Quick Reference Blocks)

3.1 Compile-Time Interrupt Registration

3.2 Detecting a Paused Graph in the Coordinator

3.3 Human Resumption via aupdate_state + ainvoke

3.4 Compliance-Driven Routing into HITL

Why? (Parameter Breakdown)

Common Pitfalls

Real-World Interview Prep

Q1: How would you design HITL for a regulated workload (e.g., financial advice) where every recommendation needs an officer’s sign-off?

Q2: When would you prefer interrupt_before over Command(resume=...), and vice versa?

Q3: How do you test a LangGraph workflow that contains an interrupt, without pausing real Postgres or waiting for a human?

Top-to-Bottom Code Walkthrough (app/coordinator/agent_coordinator.py — approve_intervention())

Signature

Inside the method (line by line)

Fallback: rejecting the intervention

Common Pitfalls

Real-World Interview Prep

Q1: Why does the intervention method explicitly persist via MessageService instead of relying on the workflow to write?

Q2: How do you prompt the human reviewer with the LLM’s draft?

Q3: What happens if the human takes too long (the SLA hits)?

3.3 Human Resumption via `aupdate_state` + `ainvoke`

Q2: When would you prefer `interrupt_before` over `Command(resume=...)`, and vice versa?

Top-to-Bottom Code Walkthrough (`app/coordinator/agent_coordinator.py` — `approve_intervention()`)

Q1: Why does the intervention method explicitly persist via `MessageService` instead of relying on the workflow to write?