Recover the ReAct loop from MAX_TOKENS truncation mid-tool-call#24260
Recover the ReAct loop from MAX_TOKENS truncation mid-tool-call#24260AAraKKe wants to merge 5 commits into
Conversation
When a turn stops on MAX_TOKENS while a tool call is pending, the tool_use block is truncated and never executed. The loop previously exited on any non-TOOL_USE stop reason, leaving a dangling tool_use in history; the next send() then replays it and the provider rejects the request with a 400 (tool_use without a following tool_result). The loop now continues while a tool call is pending and, on a truncated turn, answers each pending tool_use with a synthetic failure result instead of executing it (the call's input is incomplete and unsafe to run). This repairs the conversation and prompts the model to retry with a smaller change. A guard aborts after repeated consecutive truncations to avoid an unrecoverable loop. Also nudges toward smaller edits in the edit_file tool description.
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: c237938 | Docs | Datadog PR Page | Give us feedback! |
native_tool_names returns an immutable tuple; assert against a tuple instead of a list.
78a32fd to
6927671
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6927671c05
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| while response.tool_calls: | ||
| truncated = response.stop_reason == StopReason.MAX_TOKENS |
There was a problem hiding this comment.
Preserve pending max-token tool calls before compacting
When a follow-up response contains tool_calls with StopReason.MAX_TOKENS, this new loop keeps it pending, but the auto-compaction block below can still run before the synthetic tool_results are sent. If that response is over the compaction threshold, compact(response) treats it as a normal non-TOOL_USE turn and sends a summary request after an unresolved tool_use, which recreates the provider 400 this change is trying to avoid; this occurs on truncated retries or any tool-result turn that truncates while requesting another tool in a high-context conversation.
Useful? React with 👍 / 👎.
luisorofino
left a comment
There was a problem hiding this comment.
Thank you for the changes!
| "maximum output token limit, so the tool call is incomplete. Retry with a smaller, more " | ||
| "targeted change — edit a single small unique region instead of rewriting a whole file, or " | ||
| "split the work across several sequential tool calls. For a full-file rewrite prefer " | ||
| "create_file over one huge edit_file." |
There was a problem hiding this comment.
Currently, we don't have any way of easily rewriting an entire file. The only way is using edit_file, since create_file only creates a file if it doesn't exist. Maybe, as a follow up, we could include a flag in create_file to allow overwriting if it already exists.
There was a problem hiding this comment.
The problem still persist, it is that if it wants to edit (or create) a file that is too big it will run into max_tokens error. We could add an overwrite option to write a file but if we do not mention the token limit that might be an issue.
Maybe, since we know the max tokens set, we could include that into the prompt itself? Inejct it as a guideline for the agent not know not to do weird things?
There was a problem hiding this comment.
Makes sense, saw your last commit and looks good. Thank you!
…d pending calls - compact() now checks for pending tool_calls instead of stop_reason == TOOL_USE, so a MAX_TOKENS turn with a dangling tool_use also preserves the last turn through compaction instead of replaying it unresolved and reproducing the provider 400 this recovery path exists to avoid - Add BaseTool.truncated_call_hint so edit_file, create_file, and append_file each give guidance suited to what they do when truncated, instead of one generic message; ToolRegistry.get() looks up the failing tool by name - Type tool_calls as list[ToolCall] instead of bare list
…gistry fixtures ToolRegistry has no external dependencies, so mocking it added nothing but a second, drifting reimplementation of its dispatch logic. Three files each had their own ad hoc FakeTool/MockToolRegistry variants; consolidate into one FakeTool plus fake_tool/tool_registry factory fixtures in tests/ai/conftest.py, used with the real ToolRegistry everywhere. - Add tests/ai/conftest.py: FakeTool double, fake_tool and tool_registry factory fixtures. FakeTool is not meant to be imported directly by tests. - test_process.py: drop MockToolRegistry/RaisingToolRegistry/PerToolRegistry, migrate every test to fake_tool/tool_registry with a real ToolRegistry - test_registry.py: drop its own FakeTool, use the shared fixture - test_anthropic_client.py: drop its own FakeTool, use the shared fixture - test_spawn_subagent.py: drop the MockToolRegistry(ToolRegistry) subclass; FakeProcessFactory now takes a real tool_registry instead of a canned result - Add truncated_call_hint tests for the real EditFileTool/CreateFileTool/ AppendFileTool, and ToolRegistry.get() tests - Add type annotations to every touched test signature
…ed factory ToolRegistry([...]) is a one-liner; wrapping it in a "tool_registry" fixture added a layer that did nothing. Call ToolRegistry directly wherever a registry is needed. fake_tool is now a typed factory function (not the bare class) with real default values, so fake_tool() alone produces a usable tool. Its callable signature is expressed as a Protocol (FakeToolFactory) so tests get proper named/defaulted parameter types instead of Callable[..., FakeTool]. Also fixes a real bug this refactor surfaced: tests annotating parameters with TYPE_CHECKING-only imports (FakeTool/FakeToolFactory) need `from __future__ import annotations`, otherwise the annotation is evaluated eagerly at def time and raises NameError since the import never executes at runtime. Added it to every file using this pattern, matching the existing precedent in anthropic_client.py. Also fixed three tests in test_anthropic_client.py that still referenced FakeTool directly in the test body instead of the fake_tool fixture.
Validation ReportAll 21 validations passed. Show details
|
What does this PR do?
Makes the ReAct loop recover when a model turn stops on
MAX_TOKENSwhile a tool call is still pending.Previously the loop only fed tool results back when
stop_reason == TOOL_USE. If a turn was truncated by the output token limit mid-tool-call, the loop exited leaving a danglingtool_useblock in the agent's history. The nextsend()replayed that history and the provider rejected the request with a 400 (tool_useids found without a followingtool_result).Changes:
tool_callsrather than the stop reason).tool_useis answered with a synthetic failuretool_resultinstead of being executed — the call's input is incomplete and unsafe to run. This keeps the conversation valid and prompts the model to retry with a smaller change.MAX_CONSECUTIVE_TRUNCATIONSback-to-back truncated turns to avoid an unrecoverable loop.edit_filetool description nudges toward small, targeted edits to reduce the chance of hitting the limit in the first place.Motivation
A
ddev meta aiflow run failed with400 ... tool_use ids were found without tool_result blocks. Root cause was a turn truncated at the 8192-output-token limit while emitting a largeedit_filecall: the unexecuted, danglingtool_usethen poisoned the next request (the memory step). This is a harness-level bug independent of any particular flow.Review checklist (to be filled by reviewers)
qa/requiredif this PR needs QA validation, orqa/skip-qaif it does not. Exactly one of the two is required.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged