Skip to content

[pull] main from danny-avila:main#112

Merged
pull[bot] merged 2 commits into
innFactory:mainfrom
danny-avila:main
Jun 17, 2026
Merged

[pull] main from danny-avila:main#112
pull[bot] merged 2 commits into
innFactory:mainfrom
danny-avila:main

Conversation

@pull

@pull pull Bot commented Jun 17, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

upman and others added 2 commits June 17, 2026 13:34
* feat: support Langfuse trace metadata config

* fix: ignore empty Langfuse trace attributes

* fix: satisfy Langfuse config lint

* chore: import order in langfuse-config.test.ts

---------

Co-authored-by: Danny Avila <danacordially@gmail.com>
* ⚡ feat: Single Tail Prompt-Cache Breakpoint

Replace the rolling "last two user messages" prompt-cache strategy with a
single breakpoint anchored on the conversation tail, mirroring the approach
used by Claude Code. Anthropic/OpenRouter now place exactly one ephemeral
cache_control marker on the last cacheable block of the final non-synthetic
message; Bedrock places a single cachePoint via the new
addBedrockTailCacheControl. Because the marker always rides the true tail,
the whole prefix is written once and read back as history grows append-only,
instead of re-writing large spans every step.

- Add addTailCacheControl / addBedrockTailCacheControl (single tail marker),
  skipping thinking blocks and synthetic skill/meta messages as anchors and
  stripping all stale markers in one pass.
- Wire Graph (Anthropic, OpenRouter, Bedrock), AgentContext system-runnable
  body path, and summarization to the tail strategy by default.
- Keep legacy addCacheControl / addBedrockCacheControl exported for
  compatibility; update affected tests and add cache.tail.test.ts.

* 🩹 fix: Hoist Bedrock cachePoint out of toolResult body for tail breakpoint

The single tail prompt-cache breakpoint frequently anchors on a tool
result, since agent-loop conversations end with a tool turn before the
next model call. addBedrockTailCacheControl writes the cachePoint into
the tool message content, but the Converse converter wrapped the entire
content (cachePoint included) inside toolResult.content.

A cachePoint is a message-level ContentBlock, not a ToolResultContentBlock.
Bedrock does not reject the nested form — it silently drops the breakpoint
(verified live: cache_creation/cache_read both stay 0), so the tail
strategy produced ZERO caching for the most common agent-loop shape.

Hoist any cachePoint out of toolResult.content to a message-level sibling
after the toolResult block — the only position Bedrock honors. Live
Bedrock Converse now shows the tool-result tail writing the prefix on
turn 1 (cache_creation) and reading it back on turn 2 (cache_read),
matching the Anthropic-direct behavior.

- Hoist cachePoint(s) in convertToolMessageToConverseMessage.
- Add toolResultCachePoint.test.ts (converter hoist + end-to-end).
- Add cache.tail.test.ts case for a trailing string tool-result tail.

* 🩹 fix: Keep tail cache breakpoint on a block that survives conversion

Two edge cases dropped the single tail breakpoint before the model call,
silently regressing to zero message caching (legacy marked human messages,
which avoided both paths):

1. Foreign reasoning tail (Anthropic/OpenRouter): isTailCacheableBlock only
   excluded native `thinking`/`redacted_thinking`, so on a cross-provider
   handoff the marker could anchor on a `reasoning_content`/`reasoning`/
   `think` block — which _convertMessagesToAnthropicPayload drops on
   assistant turns. The only breakpoint vanished. Now exclude foreign
   reasoning types from tail anchoring so the marker lands on a surviving
   text/tool block.

2. Thinking-fold ordering: the tail marker was placed before
   ensureThinkingBlockInMessages, which folds a trailing non-thinking AI→Tool
   chain into a `[Previous agent context]` HumanMessage whose builder copies
   text but not cache_control/cachePoint. Move the provider-specific tail
   cache insertion (Anthropic, Bedrock, OpenRouter) to run LAST — after
   thinking normalization and orphan sanitization — so it anchors on the
   final message list.

Verified by inspecting the final _convertMessagesToAnthropicPayload output:
the breakpoint now survives in both cases (and a guard test asserts the old
mark-before-fold order loses it).

- Exclude reasoning_content/reasoning/think in isTailCacheableBlock.
- Reorder tail cache insertion after ensureThinkingBlock/sanitizeOrphan in Graph.
- Add tailCacheConversion.test.ts and foreign-reasoning cases in cache.tail.test.ts.

* 🩹 fix: Harden tail prompt-cache anchor against dropped/stripped tails

Three more cases where the single tail breakpoint failed to reach the model;
all stem from anchoring on a volatile tail that a later stage drops/rewrites.

1. input_json_delta anchor (Anthropic/OpenRouter): persisted partial tool-input
   deltas are dropped by _convertMessagesToAnthropicPayload (input is restored
   onto the tool_use block). Anchoring the marker there lost it. Excluded
   input_json_delta from tail anchoring (joins the reasoning types), renaming
   the set to NON_ANCHORABLE_BLOCK_TYPES.

2. toolOutputReferences annotation (functional regression): prompt caching
   rewrites a string ToolMessage tail into a text-block array to host its
   marker; annotateMessagesForLLM only applied the live `[ref: …]` annotation
   to STRING tool content, so the common tool-result tail silently lost its
   reference marker once cached. annotateMessagesForLLM now projects the live
   ref (and unresolved warning) onto array tool content too.

3. assistant-prefill strip (Claude 4.6+): stripUnsupportedAssistantPrefill pops
   a trailing assistant prefill right before the API call; if the only tail
   breakpoint rode it, message caching was lost. It now re-anchors the
   breakpoint onto the new tail (only when one was actually removed, so
   caching-off requests stay untouched), reusing addTailCacheControl to honor
   the same exclusions.

Tests: stripPrefillCache.test.ts (re-anchor); array live-ref cases in
annotateMessagesForLLM.test.ts; input_json_delta is covered by the
NON_ANCHORABLE_BLOCK_TYPES exclusion. tsc + lint clean.

* 🩹 fix: Hoist Anthropic tool_result cache_control onto the top-level block

The single tail breakpoint frequently anchors on a tool result. For a string
ToolMessage tail, addTailCacheControl rewrites it to a text-block array carrying
cache_control, and _ensureMessageContents nests that block inside
tool_result.content. The Anthropic API currently honors that nested marker —
verified live with an isolated, system-prompt-free large tool result (control
no-marker => cache_creation 0; nested marker => 10232 written then read) — so it
is not broken today. But Anthropic documents the top-level messages.content
block as the cacheable position and does not document sub-content caching, so
relying on the nested form is fragile.

Hoist any cache_control off the inner tool-result content onto the generated
tool_result block itself (mirrors the Bedrock cachePoint hoist). Live-verified
end to end: control no-marker => cache_creation 0; hoisted marker => 12354
written on turn 1, read on turn 2.

- Add hoistToolResultCacheControl; apply it in _ensureMessageContents.
- tailCacheConversion.test.ts now asserts the marker lands on the tool_result
  block, not nested.

* 🩹 fix: Keep orphan sanitization enabled for prompt-cached sends

Moving the tail cache marker to run after sanitizeOrphanToolBlocks (so the
marker survives the thinking fold) had a side effect: the marker no longer
reassigns finalMessages before the `needsOrphanSanitize` gate is evaluated. For
a prompt-cached Anthropic/Bedrock send whose pruner returned the context
unchanged (finalMessages === messagesToUse), the gate went false and orphaned
AI/tool pairs from persisted history could reach the provider and fail
structural validation — whereas the pre-move code always reassigned first.

Compute the prompt-cache strategy up front and add `willAddTailCache` to the
sanitize gate, so cached sends are cleaned before the marker is applied
(restoring the pre-move guarantee). Collapses the cache-insertion branch to the
same up-front booleans.

* 🩹 fix: Orphan-sanitize system-runnable prompt-cached sends too

The previous gate used "this node will add the marker" (which excludes the
system-runnable path via !systemRunnable). But when a system runnable owns the
system prompt, AgentContext still adds the body cache marker — so those are
cached sends that must be orphan-sanitized as well. With prompt caching +
system runnable + a pruner that returned the context unchanged, orphaned
AI/tool pairs from persisted history could still reach the provider.

Track two separate facts: `providerPromptCacheEnabled` (caching is on for the
provider at all — drives orphan cleanup, system-runnable included) vs. the
node-adds-the-marker condition (Anthropic/OpenRouter minus systemRunnable, or
Bedrock — drives the insertion). The sanitize gate now uses the former.

* 🩹 fix: Break import cycle from the prefill re-anchor

The P3-1 re-anchor imported addTailCacheControl from @/messages/cache into the
Anthropic converter, closing a cycle:
  messages/format.ts -> llm/anthropic/utils/message_inputs.ts
    -> messages/cache.ts -> messages/format.ts
which the bundler's circular-dependency check (npm run build:dev) flags.

Replace the cross-module reuse with a small local re-anchor that operates on the
already-converted Anthropic payload. This is also more correct: at that stage
the converter has already dropped foreign-reasoning / input_json_delta blocks,
so only native thinking blocks need excluding, and the post-strip tail is always
a user message. Live-reverified: turn1 cache_creation=6264, turn2 read=6264.

* 📊 test: Live reproducible prompt-cache benchmark (tail vs legacy)

Add a committed, live benchmark that empirically justifies the single tail
breakpoint over the legacy "last two user messages" strategy, plus a doc with
representative results.

bench-prompt-cache.ts replays three realistic harness shapes (agent tool loop,
multi-turn chat, realistic agent) under BOTH strategies over the same
conversations in separate cache namespaces, against a real provider, and reports
per-call cache token breakdowns. `fresh` (uncached, full-price input) is derived
provider-agnostically from total_tokens-output_tokens minus the cache buckets,
since Anthropic folds cache tokens into input_tokens while Bedrock reports them
separately.

Result (live, claude-sonnet-4-5): the tail strategy is cheaper in every scenario
on both Anthropic and Bedrock. Legacy reprocesses tens of thousands of
full-price tokens in any tool-bearing conversation (its lone user-message marker
leaves the growing transcript uncached); tail reduces that to ~0 and reads the
prefix back. Effective cost −30..−38% (Anthropic), −9..−15% (Bedrock); even
legacy's best case (frequent user messages) ties-or-wins.

- src/scripts/bench-prompt-cache.ts (excluded from build/CI; real paid calls)
- npm run bench:cache [-- --provider bedrock|anthropic --rounds N --model id]
- docs/prompt-cache-benchmark.md

* 📊 test: Add post-compaction scenario to the prompt-cache benchmark

Covers the two transcript-mutating harness behaviors raised in review:

- Tool truncation: a non-issue for caching — applied once at tool-exec with a
  model-fixed (turn-invariant) cap by the already-tested, deterministic
  truncateToolResultContent, so a truncated result is a stable prefix block.
  Documented; no separate scenario needed (existing tool-loop already exercises
  tool results in the cached prefix).
- Compaction (summarization): add a post-compaction scenario — a few pre-
  compaction tool rounds, a head→summary swap (one-time cache miss for any
  strategy), then continued tool rounds. Confirms the tail strategy
  re-establishes append-only caching on the new summary-headed prefix.

Live result (claude-sonnet-4-5): tail wins 4/4 scenarios on BOTH Anthropic and
Bedrock. Post-compaction is among the largest wins (Anthropic effective −41%,
read +76%) because after compaction the summary is the only user message, so
legacy re-sends all continued tool work uncached (fresh 63k → 42).

docs/prompt-cache-benchmark.md updated with the 4-scenario tables and a
truncation/compaction section.
@pull pull Bot locked and limited conversation to collaborators Jun 17, 2026
@pull pull Bot added the ⤵️ pull label Jun 17, 2026
@pull pull Bot merged commit f32a9aa into innFactory:main Jun 17, 2026
1 check passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants