whir: avoid serial 256 MiB copy of combined-statement weights by MegaRedHand · Pull Request #250 · leanEthereum/leanVM

MegaRedHand · 2026-06-09T19:36:43Z

Problem

The merge of main into devnet4 (8eec56c) changed MleOwned to hold an ArenaVec, but combine_statement in crates/whir/src/open.rs still returned a glibc-heap Vec<EFPacking<EF>>, which was then bridged with ArenaVec::from_slice(&weights). At n_vars=24 that is a single-threaded memcpy of ~256 MiB per proof while all worker threads sit idle, plus the same data traversing the memory hierarchy twice.

This was found while investigating a machine-specific perf regression: the copy is a fixed serial tax competing against the merge's parallel-kernel gains. On a Zen4 box the gains outweigh the tax; on a Zen5 box (Ryzen 9700X, smaller remaining gains) it showed up as a net −5.3% in XMSS aggregation throughput. --tracing pinpointed it: run_initial_sumcheck_rounds self time went from 0.01% to 7.94% (~25 ms) with identical children, and perf showed ArenaVec::from_slice only in the regressed binary.

Fix

Build the weights directly in an ArenaVec so it is moved, never copied. All writers (compute_eval_eq_packed*, split_at_mut_many) take &mut [T] and work unchanged via deref. The ArenaVec is created inside the same proving phase where it was previously copied into one, so arena-phase semantics are unchanged. This matches how main already does it; the bug is devnet4-only.

Results (Zen5, Ryzen 7 9700X, interleaved runs, mean of `--repeat 10`)

binary	time (s)	XMSS/s	vs pre-merge `c39f0ff`
pre-merge `c39f0ff`	0.2964	215.9	—
`a44bca4` unpatched	0.3129	204.6	−5.3%
`a44bca4` + fix	0.2900	220.7	+2.2%

Patched trace shows run_initial_sumcheck_rounds self time back to 0.00%. Proof size unchanged (265 KiB).

Testing

cargo test -p mt-whir -p sub_protocols --release passes (incl. end-to-end prove/verify test_run_whir)
cargo fmt --check and cargo clippy -p mt-whir clean

The merge of main into devnet4 (8eec56c) changed MleOwned to hold an ArenaVec, but combine_statement still returned a heap Vec, which was then bridged with ArenaVec::from_slice. At n_vars=24 that is a single-threaded memcpy of ~256 MiB of extension elements per proof while all worker threads sit idle, and the data crosses the memory hierarchy twice. Build the weights directly in an ArenaVec so it is moved, never copied. All writers (compute_eval_eq_packed*, split_at_mut_many) take &mut [T] and work unchanged via deref. The ArenaVec is created inside the same proving phase where it was previously copied into one, so arena-phase semantics are unchanged. On a Zen5 box (Ryzen 9700X) this turns a -5.3% XMSS-aggregation regression vs pre-merge into a +2.2% improvement (215.9 -> 220.7 XMSS/s); run_initial_sumcheck_rounds self time drops from 7.94% back to 0.00%. Proof size is unchanged.

TomWambsgans merged commit 2498896 into leanEthereum:devnet4 Jun 9, 2026

MegaRedHand deleted the fix/whir-combine-statement-arena-copy branch June 9, 2026 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whir: avoid serial 256 MiB copy of combined-statement weights#250

whir: avoid serial 256 MiB copy of combined-statement weights#250
TomWambsgans merged 1 commit into
leanEthereum:devnet4from
lambdaclass:fix/whir-combine-statement-arena-copy

MegaRedHand commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MegaRedHand commented Jun 9, 2026

Problem

Fix

Results (Zen5, Ryzen 7 9700X, interleaved runs, mean of --repeat 10)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Results (Zen5, Ryzen 7 9700X, interleaved runs, mean of `--repeat 10`)