whir: avoid serial 256 MiB copy of combined-statement weights#250
Merged
TomWambsgans merged 1 commit intoJun 9, 2026
Merged
Conversation
The merge of main into devnet4 (8eec56c) changed MleOwned to hold an ArenaVec, but combine_statement still returned a heap Vec, which was then bridged with ArenaVec::from_slice. At n_vars=24 that is a single-threaded memcpy of ~256 MiB of extension elements per proof while all worker threads sit idle, and the data crosses the memory hierarchy twice. Build the weights directly in an ArenaVec so it is moved, never copied. All writers (compute_eval_eq_packed*, split_at_mut_many) take &mut [T] and work unchanged via deref. The ArenaVec is created inside the same proving phase where it was previously copied into one, so arena-phase semantics are unchanged. On a Zen5 box (Ryzen 9700X) this turns a -5.3% XMSS-aggregation regression vs pre-merge into a +2.2% improvement (215.9 -> 220.7 XMSS/s); run_initial_sumcheck_rounds self time drops from 7.94% back to 0.00%. Proof size is unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The merge of main into devnet4 (8eec56c) changed
MleOwnedto hold anArenaVec, butcombine_statementincrates/whir/src/open.rsstill returned a glibc-heapVec<EFPacking<EF>>, which was then bridged withArenaVec::from_slice(&weights). At n_vars=24 that is a single-threaded memcpy of ~256 MiB per proof while all worker threads sit idle, plus the same data traversing the memory hierarchy twice.This was found while investigating a machine-specific perf regression: the copy is a fixed serial tax competing against the merge's parallel-kernel gains. On a Zen4 box the gains outweigh the tax; on a Zen5 box (Ryzen 9700X, smaller remaining gains) it showed up as a net −5.3% in XMSS aggregation throughput.
--tracingpinpointed it:run_initial_sumcheck_roundsself time went from 0.01% to 7.94% (~25 ms) with identical children, andperfshowedArenaVec::from_sliceonly in the regressed binary.Fix
Build the weights directly in an
ArenaVecso it is moved, never copied. All writers (compute_eval_eq_packed*,split_at_mut_many) take&mut [T]and work unchanged via deref. TheArenaVecis created inside the same proving phase where it was previously copied into one, so arena-phase semantics are unchanged. This matches howmainalready does it; the bug is devnet4-only.Results (Zen5, Ryzen 7 9700X, interleaved runs, mean of
--repeat 10)c39f0ffc39f0ffa44bca4unpatcheda44bca4+ fixPatched trace shows
run_initial_sumcheck_roundsself time back to 0.00%. Proof size unchanged (265 KiB).Testing
cargo test -p mt-whir -p sub_protocols --releasepasses (incl. end-to-end prove/verifytest_run_whir)cargo fmt --checkandcargo clippy -p mt-whirclean