Skip to content

Recover the ReAct loop from MAX_TOKENS truncation mid-tool-call#24260

Open
AAraKKe wants to merge 5 commits into
loa/openmetrics-ai-genfrom
aarakke/react-truncated-tool-call
Open

Recover the ReAct loop from MAX_TOKENS truncation mid-tool-call#24260
AAraKKe wants to merge 5 commits into
loa/openmetrics-ai-genfrom
aarakke/react-truncated-tool-call

Conversation

@AAraKKe

@AAraKKe AAraKKe commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?

Makes the ReAct loop recover when a model turn stops on MAX_TOKENS while a tool call is still pending.

Previously the loop only fed tool results back when stop_reason == TOOL_USE. If a turn was truncated by the output token limit mid-tool-call, the loop exited leaving a dangling tool_use block in the agent's history. The next send() replayed that history and the provider rejected the request with a 400 (tool_use ids found without a following tool_result).

Changes:

  • The loop now continues while a tool call is pending (keying off tool_calls rather than the stop reason).
  • On a truncated turn, each pending tool_use is answered with a synthetic failure tool_result instead of being executed — the call's input is incomplete and unsafe to run. This keeps the conversation valid and prompts the model to retry with a smaller change.
  • A guard aborts after MAX_CONSECUTIVE_TRUNCATIONS back-to-back truncated turns to avoid an unrecoverable loop.
  • edit_file tool description nudges toward small, targeted edits to reduce the chance of hitting the limit in the first place.

Motivation

A ddev meta ai flow run failed with 400 ... tool_use ids were found without tool_result blocks. Root cause was a turn truncated at the 8192-output-token limit while emitting a large edit_file call: the unexecuted, dangling tool_use then poisoned the next request (the memory step). This is a harness-level bug independent of any particular flow.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

When a turn stops on MAX_TOKENS while a tool call is pending, the tool_use
block is truncated and never executed. The loop previously exited on any
non-TOOL_USE stop reason, leaving a dangling tool_use in history; the next
send() then replays it and the provider rejects the request with a 400
(tool_use without a following tool_result).

The loop now continues while a tool call is pending and, on a truncated turn,
answers each pending tool_use with a synthetic failure result instead of
executing it (the call's input is incomplete and unsafe to run). This repairs
the conversation and prompts the model to retry with a smaller change. A guard
aborts after repeated consecutive truncations to avoid an unrecoverable loop.

Also nudges toward smaller edits in the edit_file tool description.
@AAraKKe AAraKKe added the qa/skip-qa Automatically skip this PR for the next QA label Jun 30, 2026
@dd-octo-sts dd-octo-sts Bot added ddev and removed qa/skip-qa Automatically skip this PR for the next QA labels Jun 30, 2026
@datadog-datadog-prod-us1

datadog-datadog-prod-us1 Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Tests  Code Coverage

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 98.13%
Overall Coverage: 88.45%

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: c237938 | Docs | Datadog PR Page | Give us feedback!

@AAraKKe AAraKKe added the qa/skip-qa Automatically skip this PR for the next QA label Jun 30, 2026
native_tool_names returns an immutable tuple; assert against a tuple instead
of a list.
@AAraKKe AAraKKe force-pushed the aarakke/react-truncated-tool-call branch from 78a32fd to 6927671 Compare June 30, 2026 14:20
@AAraKKe AAraKKe marked this pull request as ready for review June 30, 2026 14:22
@AAraKKe AAraKKe requested a review from a team as a code owner June 30, 2026 14:22

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6927671c05

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +143 to +144
while response.tool_calls:
truncated = response.stop_reason == StopReason.MAX_TOKENS

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve pending max-token tool calls before compacting

When a follow-up response contains tool_calls with StopReason.MAX_TOKENS, this new loop keeps it pending, but the auto-compaction block below can still run before the synthetic tool_results are sent. If that response is over the compaction threshold, compact(response) treats it as a normal non-TOOL_USE turn and sends a summary request after an unresolved tool_use, which recreates the provider 400 this change is trying to avoid; this occurs on truncated retries or any tool-result turn that truncates while requesting another tool in a high-context conversation.

Useful? React with 👍 / 👎.

@luisorofino luisorofino left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes!

Comment thread ddev/src/ddev/ai/react/process.py Outdated
"maximum output token limit, so the tool call is incomplete. Retry with a smaller, more "
"targeted change — edit a single small unique region instead of rewriting a whole file, or "
"split the work across several sequential tool calls. For a full-file rewrite prefer "
"create_file over one huge edit_file."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we don't have any way of easily rewriting an entire file. The only way is using edit_file, since create_file only creates a file if it doesn't exist. Maybe, as a follow up, we could include a flag in create_file to allow overwriting if it already exists.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem still persist, it is that if it wants to edit (or create) a file that is too big it will run into max_tokens error. We could add an overwrite option to write a file but if we do not mention the token limit that might be an issue.

Maybe, since we know the max tokens set, we could include that into the prompt itself? Inejct it as a guideline for the agent not know not to do weird things?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, saw your last commit and looks good. Thank you!

AAraKKe added 3 commits July 1, 2026 14:26
…d pending calls

- compact() now checks for pending tool_calls instead of stop_reason == TOOL_USE,
  so a MAX_TOKENS turn with a dangling tool_use also preserves the last turn
  through compaction instead of replaying it unresolved and reproducing the
  provider 400 this recovery path exists to avoid
- Add BaseTool.truncated_call_hint so edit_file, create_file, and append_file
  each give guidance suited to what they do when truncated, instead of one
  generic message; ToolRegistry.get() looks up the failing tool by name
- Type tool_calls as list[ToolCall] instead of bare list
…gistry fixtures

ToolRegistry has no external dependencies, so mocking it added nothing but a
second, drifting reimplementation of its dispatch logic. Three files each had
their own ad hoc FakeTool/MockToolRegistry variants; consolidate into one
FakeTool plus fake_tool/tool_registry factory fixtures in tests/ai/conftest.py,
used with the real ToolRegistry everywhere.

- Add tests/ai/conftest.py: FakeTool double, fake_tool and tool_registry
  factory fixtures. FakeTool is not meant to be imported directly by tests.
- test_process.py: drop MockToolRegistry/RaisingToolRegistry/PerToolRegistry,
  migrate every test to fake_tool/tool_registry with a real ToolRegistry
- test_registry.py: drop its own FakeTool, use the shared fixture
- test_anthropic_client.py: drop its own FakeTool, use the shared fixture
- test_spawn_subagent.py: drop the MockToolRegistry(ToolRegistry) subclass;
  FakeProcessFactory now takes a real tool_registry instead of a canned result
- Add truncated_call_hint tests for the real EditFileTool/CreateFileTool/
  AppendFileTool, and ToolRegistry.get() tests
- Add type annotations to every touched test signature
…ed factory

ToolRegistry([...]) is a one-liner; wrapping it in a "tool_registry" fixture
added a layer that did nothing. Call ToolRegistry directly wherever a
registry is needed.

fake_tool is now a typed factory function (not the bare class) with real
default values, so fake_tool() alone produces a usable tool. Its callable
signature is expressed as a Protocol (FakeToolFactory) so tests get proper
named/defaulted parameter types instead of Callable[..., FakeTool].

Also fixes a real bug this refactor surfaced: tests annotating parameters
with TYPE_CHECKING-only imports (FakeTool/FakeToolFactory) need
`from __future__ import annotations`, otherwise the annotation is evaluated
eagerly at def time and raises NameError since the import never executes at
runtime. Added it to every file using this pattern, matching the existing
precedent in anthropic_client.py. Also fixed three tests in
test_anthropic_client.py that still referenced FakeTool directly in the test
body instead of the fake_tool fixture.
@dd-octo-sts

dd-octo-sts Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and code coverage settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ddev qa/skip-qa Automatically skip this PR for the next QA team/agent-integrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants