fix(inference): prevent silent truncation of large streaming responses#834
Merged
johntmyers merged 2 commits intomainfrom Apr 14, 2026
Merged
Conversation
The L7 inference proxy silently dropped tool_calls from large streaming responses due to an aggressive 30s per-chunk idle timeout and a reqwest total-request timeout that capped the entire body stream. Reasoning models that pause during "thinking" phases triggered these timeouts, producing valid-looking but truncated HTTP responses with no client-visible error. - Extract prepare_backend_request() helper and create a streaming variant that omits the total request timeout; body stream liveness is now enforced solely by the per-chunk idle timeout - Add 30s connect_timeout to the reqwest Client builder - Increase CHUNK_IDLE_TIMEOUT from 30s to 120s for reasoning models - Inject SSE error events (proxy_stream_error) before the HTTP chunked terminator on all truncation paths so clients can detect data loss - Wrap the streaming relay in BufWriter to reduce per-chunk TLS flush overhead - Bump OCSF severity for streaming truncation from Low to Medium Closes #829
3 tasks
pimlock
previously approved these changes
Apr 14, 2026
The BufWriter introduced in the previous commit buffered SSE frames until the 16KB capacity filled or the stream ended, defeating incremental token delivery and potentially reintroducing client-visible timeouts on healthy streams. Revert to per-chunk write_all+flush but keep the single format_chunk() call that coalesces framing into one write. Also fix the streaming integration test: add a 3s mock delay that exceeds the 1s route timeout so the test actually validates that the streaming path omits the total request timeout. Previously the mock responded immediately, passing regardless of timeout behavior.
Closed
3 tasks
mjamiv
added a commit
to mjamiv/OpenShell
that referenced
this pull request
Apr 15, 2026
format_sse_error only escaped `\` and `"`, leaving two problems: 1. Control characters (`\n`, `\r`, `\t`, and all `\u0000-\u001F`) in `reason` produce output that fails `serde_json::from_str` — defeating NVIDIA#834's goal of giving clients a parseable SSE truncation signal. 2. An unescaped `\n\n` inside `reason` splits the single error event into two SSE frames, letting a misbehaving upstream inject a forged frame (e.g. a fake tool_calls delta) into the client's stream. Latent today since all in-tree callers pass static strings, but a footgun for any future caller passing upstream error text, and the function's docstring already invites dynamic reasons. Replace the manual escape with `serde_json::to_writer` (already a workspace dep of `openshell-sandbox`). Add unit tests for control character escaping and SSE event-boundary injection. Closes NVIDIA#840
mjamiv
added a commit
to mjamiv/OpenShell
that referenced
this pull request
Apr 15, 2026
format_sse_error only escaped `\` and `"`, leaving two problems: 1. Control characters (`\n`, `\r`, `\t`, and all `\u0000-\u001F`) in `reason` produce output that fails `serde_json::from_str` — defeating NVIDIA#834's goal of giving clients a parseable SSE truncation signal. 2. An unescaped `\n\n` inside `reason` splits the single error event into two SSE frames, letting a misbehaving upstream inject a forged frame (e.g. a fake tool_calls delta) into the client's stream. Latent today since all in-tree callers pass static strings, but a footgun for any future caller passing upstream error text, and the function's docstring already invites dynamic reasons. Replace the manual escape with `serde_json::to_writer` (already a workspace dep of `openshell-sandbox`). Add unit tests for control character escaping and SSE event-boundary injection. Closes NVIDIA#840 Signed-off-by: mjamiv <michael.commack@gmail.com>
This was referenced Apr 15, 2026
johntmyers
pushed a commit
that referenced
this pull request
Apr 15, 2026
format_sse_error only escaped `\` and `"`, leaving two problems: 1. Control characters (`\n`, `\r`, `\t`, and all `\u0000-\u001F`) in `reason` produce output that fails `serde_json::from_str` — defeating #834's goal of giving clients a parseable SSE truncation signal. 2. An unescaped `\n\n` inside `reason` splits the single error event into two SSE frames, letting a misbehaving upstream inject a forged frame (e.g. a fake tool_calls delta) into the client's stream. Latent today since all in-tree callers pass static strings, but a footgun for any future caller passing upstream error text, and the function's docstring already invites dynamic reasons. Replace the manual escape with `serde_json::to_writer` (already a workspace dep of `openshell-sandbox`). Add unit tests for control character escaping and SSE event-boundary injection. Closes #840 Signed-off-by: mjamiv <michael.commack@gmail.com>
2 tasks
7 tasks
4 tasks
ericksoa
pushed a commit
to NVIDIA/NemoClaw
that referenced
this pull request
Apr 23, 2026
## Summary Bumps the pinned OpenShell version range from `0.0.29` → `0.0.32` so fresh NemoClaw installs pick up sandbox hardening and TLS improvements from the last three OpenShell releases. ## Notable upstream changes **0.0.30** ([NVIDIA/OpenShell@v0.0.29...v0.0.30](NVIDIA/OpenShell@v0.0.29...v0.0.30)) - Network policy deny rules ([OpenShell#822](NVIDIA/OpenShell#822)) - Preserve ownership on existing `read_write` paths ([OpenShell#827](NVIDIA/OpenShell#827)) - Disable child core dumps ([OpenShell#821](NVIDIA/OpenShell#821)) - Escape control characters in SSE error formatting ([OpenShell#842](NVIDIA/OpenShell#842)) - Fix silent truncation of large streaming inference responses ([OpenShell#834](NVIDIA/OpenShell#834)) **0.0.31** ([NVIDIA/OpenShell@v0.0.30...v0.0.31](NVIDIA/OpenShell@v0.0.30...v0.0.31)) - Inference routed-request header allowlist ([OpenShell#826](NVIDIA/OpenShell#826)) **0.0.32** ([NVIDIA/OpenShell@v0.0.31...v0.0.32](NVIDIA/OpenShell@v0.0.31...v0.0.32)) - **Load system CA certificates for upstream TLS connections** ([OpenShell#862](NVIDIA/OpenShell#862)) - Publish standalone `openshell-gateway` binaries ([OpenShell#853](NVIDIA/OpenShell#853)) ## Changes - `nemoclaw-blueprint/blueprint.yaml`: `min_openshell_version` and `max_openshell_version` → `0.0.32` - `scripts/install-openshell.sh`: `MIN_VERSION` and `MAX_VERSION` → `0.0.32` (`PIN_VERSION` follows `MAX`) - `scripts/brev-launchable-ci-cpu.sh`: default `OPENSHELL_VERSION` → `v0.0.32` - `src/lib/onboard.ts`: blueprint-fallback min version → `0.0.32` - `test/onboard.test.ts`, `test/install-openshell-version-check.test.ts`: fixtures updated; "above MAX" test case moved from `0.0.30` to `0.0.33` Historical `m-dev` comments referencing `0.0.29` left in place — they describe a self-report quirk the sidecar fallback still handles. ## Why not 0.0.33+? `0.0.34` introduced incremental sandbox policy updates and L7 request-target canonicalization — changes with larger surface area against how NemoClaw delivers policy via gRPC. Worth a follow-up PR rather than bundling here. `0.0.35` released hours before this PR was cut — too fresh. ## Type of Change - [x] Code change for a new feature, bug fix, or refactor. ## Testing - [x] `npx vitest run test/install-openshell-version-check.test.ts` — 9 passed - [x] pre-commit hooks (prek) clean: shellcheck, commitlint, gitleaks, YAML validator, CLI test suite - [ ] Nightly E2E on this branch — will be kicked off after PR opens ## Notes - No user-facing CLI behavior changes — just the pinned version range. - Two pre-existing failures in `test/onboard.test.ts` reproduce on clean `main` and are unrelated to this bump. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated OpenShell version constraints and default pinned version to v0.0.32 across configuration, install, and onboarding flows. * **Tests** * Updated test fixtures and expectations to match the new OpenShell version (v0.0.32). <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix the L7 inference proxy silently dropping tool_calls from large streaming responses. The proxy had three interacting bugs: an aggressive 30s per-chunk idle timeout that killed reasoning model "think" pauses, a reqwest total-request timeout that capped the entire body stream at 60s, and silent truncation that wrote a valid HTTP terminator on error paths — producing correct-looking but incomplete responses.
Related Issue
Closes #829
Changes
crates/openshell-router/src/backend.rs: Extractprepare_backend_request()helper sharing auth/header/body logic; createsend_backend_request_streaming()that omits the total request timeout — streaming body liveness is now enforced by the sandbox per-chunk idle timeoutcrates/openshell-router/src/lib.rs: Addconnect_timeout(30s)to the reqwest Client buildercrates/openshell-sandbox/src/proxy.rs: IncreaseCHUNK_IDLE_TIMEOUTfrom 30s to 120s; inject SSE error events before chunked terminator on all truncation paths; wrap streaming relay inBufWriterto reduce per-chunk TLS flush overhead; bump OCSF severity from Low to Medium for truncation eventscrates/openshell-sandbox/src/l7/inference.rs: Addformat_sse_error()helper for generating parseable SSE error eventscrates/openshell-router/tests/backend_integration.rs: Add tests verifying streaming proxy completes without total timeout and buffered proxy still enforces itarchitecture/inference-routing.md: Document timeout model, SSE error signaling, and BufWriter behaviorDeviations from Plan
None — implemented as planned
Testing
cargo test --package openshell-router --package openshell-sandboxpasses (499 tests)cargo fmt --all -- --checkpassesformat_sse_error()(valid SSE format, JSON escaping)Tests added:
format_sse_error_produces_valid_sse_json,format_sse_error_escapes_quotes_in_reasoninl7/inference.rsstreaming_proxy_completes_despite_exceeding_route_timeout,buffered_proxy_enforces_route_timeoutinbackend_integration.rsChecklist
Documentation updated:
architecture/inference-routing.md: Updated timeout model, response streaming, and truncation signaling sections