[pull] main from danny-avila:main#112
Merged
Merged
Conversation
* feat: support Langfuse trace metadata config * fix: ignore empty Langfuse trace attributes * fix: satisfy Langfuse config lint * chore: import order in langfuse-config.test.ts --------- Co-authored-by: Danny Avila <danacordially@gmail.com>
* ⚡ feat: Single Tail Prompt-Cache Breakpoint
Replace the rolling "last two user messages" prompt-cache strategy with a
single breakpoint anchored on the conversation tail, mirroring the approach
used by Claude Code. Anthropic/OpenRouter now place exactly one ephemeral
cache_control marker on the last cacheable block of the final non-synthetic
message; Bedrock places a single cachePoint via the new
addBedrockTailCacheControl. Because the marker always rides the true tail,
the whole prefix is written once and read back as history grows append-only,
instead of re-writing large spans every step.
- Add addTailCacheControl / addBedrockTailCacheControl (single tail marker),
skipping thinking blocks and synthetic skill/meta messages as anchors and
stripping all stale markers in one pass.
- Wire Graph (Anthropic, OpenRouter, Bedrock), AgentContext system-runnable
body path, and summarization to the tail strategy by default.
- Keep legacy addCacheControl / addBedrockCacheControl exported for
compatibility; update affected tests and add cache.tail.test.ts.
* 🩹 fix: Hoist Bedrock cachePoint out of toolResult body for tail breakpoint
The single tail prompt-cache breakpoint frequently anchors on a tool
result, since agent-loop conversations end with a tool turn before the
next model call. addBedrockTailCacheControl writes the cachePoint into
the tool message content, but the Converse converter wrapped the entire
content (cachePoint included) inside toolResult.content.
A cachePoint is a message-level ContentBlock, not a ToolResultContentBlock.
Bedrock does not reject the nested form — it silently drops the breakpoint
(verified live: cache_creation/cache_read both stay 0), so the tail
strategy produced ZERO caching for the most common agent-loop shape.
Hoist any cachePoint out of toolResult.content to a message-level sibling
after the toolResult block — the only position Bedrock honors. Live
Bedrock Converse now shows the tool-result tail writing the prefix on
turn 1 (cache_creation) and reading it back on turn 2 (cache_read),
matching the Anthropic-direct behavior.
- Hoist cachePoint(s) in convertToolMessageToConverseMessage.
- Add toolResultCachePoint.test.ts (converter hoist + end-to-end).
- Add cache.tail.test.ts case for a trailing string tool-result tail.
* 🩹 fix: Keep tail cache breakpoint on a block that survives conversion
Two edge cases dropped the single tail breakpoint before the model call,
silently regressing to zero message caching (legacy marked human messages,
which avoided both paths):
1. Foreign reasoning tail (Anthropic/OpenRouter): isTailCacheableBlock only
excluded native `thinking`/`redacted_thinking`, so on a cross-provider
handoff the marker could anchor on a `reasoning_content`/`reasoning`/
`think` block — which _convertMessagesToAnthropicPayload drops on
assistant turns. The only breakpoint vanished. Now exclude foreign
reasoning types from tail anchoring so the marker lands on a surviving
text/tool block.
2. Thinking-fold ordering: the tail marker was placed before
ensureThinkingBlockInMessages, which folds a trailing non-thinking AI→Tool
chain into a `[Previous agent context]` HumanMessage whose builder copies
text but not cache_control/cachePoint. Move the provider-specific tail
cache insertion (Anthropic, Bedrock, OpenRouter) to run LAST — after
thinking normalization and orphan sanitization — so it anchors on the
final message list.
Verified by inspecting the final _convertMessagesToAnthropicPayload output:
the breakpoint now survives in both cases (and a guard test asserts the old
mark-before-fold order loses it).
- Exclude reasoning_content/reasoning/think in isTailCacheableBlock.
- Reorder tail cache insertion after ensureThinkingBlock/sanitizeOrphan in Graph.
- Add tailCacheConversion.test.ts and foreign-reasoning cases in cache.tail.test.ts.
* 🩹 fix: Harden tail prompt-cache anchor against dropped/stripped tails
Three more cases where the single tail breakpoint failed to reach the model;
all stem from anchoring on a volatile tail that a later stage drops/rewrites.
1. input_json_delta anchor (Anthropic/OpenRouter): persisted partial tool-input
deltas are dropped by _convertMessagesToAnthropicPayload (input is restored
onto the tool_use block). Anchoring the marker there lost it. Excluded
input_json_delta from tail anchoring (joins the reasoning types), renaming
the set to NON_ANCHORABLE_BLOCK_TYPES.
2. toolOutputReferences annotation (functional regression): prompt caching
rewrites a string ToolMessage tail into a text-block array to host its
marker; annotateMessagesForLLM only applied the live `[ref: …]` annotation
to STRING tool content, so the common tool-result tail silently lost its
reference marker once cached. annotateMessagesForLLM now projects the live
ref (and unresolved warning) onto array tool content too.
3. assistant-prefill strip (Claude 4.6+): stripUnsupportedAssistantPrefill pops
a trailing assistant prefill right before the API call; if the only tail
breakpoint rode it, message caching was lost. It now re-anchors the
breakpoint onto the new tail (only when one was actually removed, so
caching-off requests stay untouched), reusing addTailCacheControl to honor
the same exclusions.
Tests: stripPrefillCache.test.ts (re-anchor); array live-ref cases in
annotateMessagesForLLM.test.ts; input_json_delta is covered by the
NON_ANCHORABLE_BLOCK_TYPES exclusion. tsc + lint clean.
* 🩹 fix: Hoist Anthropic tool_result cache_control onto the top-level block
The single tail breakpoint frequently anchors on a tool result. For a string
ToolMessage tail, addTailCacheControl rewrites it to a text-block array carrying
cache_control, and _ensureMessageContents nests that block inside
tool_result.content. The Anthropic API currently honors that nested marker —
verified live with an isolated, system-prompt-free large tool result (control
no-marker => cache_creation 0; nested marker => 10232 written then read) — so it
is not broken today. But Anthropic documents the top-level messages.content
block as the cacheable position and does not document sub-content caching, so
relying on the nested form is fragile.
Hoist any cache_control off the inner tool-result content onto the generated
tool_result block itself (mirrors the Bedrock cachePoint hoist). Live-verified
end to end: control no-marker => cache_creation 0; hoisted marker => 12354
written on turn 1, read on turn 2.
- Add hoistToolResultCacheControl; apply it in _ensureMessageContents.
- tailCacheConversion.test.ts now asserts the marker lands on the tool_result
block, not nested.
* 🩹 fix: Keep orphan sanitization enabled for prompt-cached sends
Moving the tail cache marker to run after sanitizeOrphanToolBlocks (so the
marker survives the thinking fold) had a side effect: the marker no longer
reassigns finalMessages before the `needsOrphanSanitize` gate is evaluated. For
a prompt-cached Anthropic/Bedrock send whose pruner returned the context
unchanged (finalMessages === messagesToUse), the gate went false and orphaned
AI/tool pairs from persisted history could reach the provider and fail
structural validation — whereas the pre-move code always reassigned first.
Compute the prompt-cache strategy up front and add `willAddTailCache` to the
sanitize gate, so cached sends are cleaned before the marker is applied
(restoring the pre-move guarantee). Collapses the cache-insertion branch to the
same up-front booleans.
* 🩹 fix: Orphan-sanitize system-runnable prompt-cached sends too
The previous gate used "this node will add the marker" (which excludes the
system-runnable path via !systemRunnable). But when a system runnable owns the
system prompt, AgentContext still adds the body cache marker — so those are
cached sends that must be orphan-sanitized as well. With prompt caching +
system runnable + a pruner that returned the context unchanged, orphaned
AI/tool pairs from persisted history could still reach the provider.
Track two separate facts: `providerPromptCacheEnabled` (caching is on for the
provider at all — drives orphan cleanup, system-runnable included) vs. the
node-adds-the-marker condition (Anthropic/OpenRouter minus systemRunnable, or
Bedrock — drives the insertion). The sanitize gate now uses the former.
* 🩹 fix: Break import cycle from the prefill re-anchor
The P3-1 re-anchor imported addTailCacheControl from @/messages/cache into the
Anthropic converter, closing a cycle:
messages/format.ts -> llm/anthropic/utils/message_inputs.ts
-> messages/cache.ts -> messages/format.ts
which the bundler's circular-dependency check (npm run build:dev) flags.
Replace the cross-module reuse with a small local re-anchor that operates on the
already-converted Anthropic payload. This is also more correct: at that stage
the converter has already dropped foreign-reasoning / input_json_delta blocks,
so only native thinking blocks need excluding, and the post-strip tail is always
a user message. Live-reverified: turn1 cache_creation=6264, turn2 read=6264.
* 📊 test: Live reproducible prompt-cache benchmark (tail vs legacy)
Add a committed, live benchmark that empirically justifies the single tail
breakpoint over the legacy "last two user messages" strategy, plus a doc with
representative results.
bench-prompt-cache.ts replays three realistic harness shapes (agent tool loop,
multi-turn chat, realistic agent) under BOTH strategies over the same
conversations in separate cache namespaces, against a real provider, and reports
per-call cache token breakdowns. `fresh` (uncached, full-price input) is derived
provider-agnostically from total_tokens-output_tokens minus the cache buckets,
since Anthropic folds cache tokens into input_tokens while Bedrock reports them
separately.
Result (live, claude-sonnet-4-5): the tail strategy is cheaper in every scenario
on both Anthropic and Bedrock. Legacy reprocesses tens of thousands of
full-price tokens in any tool-bearing conversation (its lone user-message marker
leaves the growing transcript uncached); tail reduces that to ~0 and reads the
prefix back. Effective cost −30..−38% (Anthropic), −9..−15% (Bedrock); even
legacy's best case (frequent user messages) ties-or-wins.
- src/scripts/bench-prompt-cache.ts (excluded from build/CI; real paid calls)
- npm run bench:cache [-- --provider bedrock|anthropic --rounds N --model id]
- docs/prompt-cache-benchmark.md
* 📊 test: Add post-compaction scenario to the prompt-cache benchmark
Covers the two transcript-mutating harness behaviors raised in review:
- Tool truncation: a non-issue for caching — applied once at tool-exec with a
model-fixed (turn-invariant) cap by the already-tested, deterministic
truncateToolResultContent, so a truncated result is a stable prefix block.
Documented; no separate scenario needed (existing tool-loop already exercises
tool results in the cached prefix).
- Compaction (summarization): add a post-compaction scenario — a few pre-
compaction tool rounds, a head→summary swap (one-time cache miss for any
strategy), then continued tool rounds. Confirms the tail strategy
re-establishes append-only caching on the new summary-headed prefix.
Live result (claude-sonnet-4-5): tail wins 4/4 scenarios on BOTH Anthropic and
Bedrock. Post-compaction is among the largest wins (Anthropic effective −41%,
read +76%) because after compaction the summary is the only user message, so
legacy re-sends all continued tool work uncached (fresh 63k → 42).
docs/prompt-cache-benchmark.md updated with the 4-scenario tables and a
truncation/compaction section.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )