`perf(poseidon): FFT MDS in AIR + RATE=12 MMO sponge by Barnadrot · Pull Request #216 · leanEthereum/leanVM

Barnadrot · 2026-05-09T10:55:06Z

Summary

Three perf changes to the Poseidon path on the leaf-aggregation hot path, plus a small commit adding #[inline] to four cross-crate hot functions. Net -5.58% wall-clock on the production XMSS leaf workload (1550 signatures, log_inv_rate=1).

Commit	Change
`b11aac3a`	`mds_air_16`: Karatsuba (72 mults) → FFT MDS (50 mults)
`27319044`	WHIR Merkle leaf sponge: RATE=8 → RATE=12, capacity=4 (native + zk-DSL verifier)
`2198c0b4`	MMO feedforward sponge — restores 124-bit collision security at RATE=12
`602859ad`	Add `#[inline]` to `mmo_hash_slice`, `mmo_precompute_zero_suffix_state`, `compress_mut`, `permute_mut`

Benchmark

Hetzner AX42-U (Zen 4), RUSTFLAGS="-C target-cpu=native", 1550 signatures, log_inv_rate=1. zk-alloc allocator (workspace default). Production release profile (fat LTO, codegen-units = 1). Warm-proof average over 4 consecutive proofs after a discarded cold warmup; per-proof variance was <1% on both branches.

Branch	Time / proof	XMSS/s	Proof size	Δ vs main
`main` (`19f1c774`)	2.120 s	731	338 KiB	—
`perf/poseidon-fft-mmo`	2.002 s	774	345 KiB	-5.58%

Welch's t-test on individual warm proof times: t = -28.8, df ≈ 6, p < 1e-6.

Reproduce with the production profile (fat LTO + codegen-units = 1):

CARGO_PROFILE_RELEASE_LTO=fat \
CARGO_PROFILE_RELEASE_CODEGEN_UNITS=1 \
RUSTFLAGS="-C target-cpu=native" \
cargo run --release -- xmss --n-signatures 1550 --log-inv-rate 1

Proof size grows 338 → 345 KiB (+2.0%) because the recursive zk-DSL verifier program adds dispatch logic for RATE=12 (250,208 → 253,755 instructions in the aggregation program). The wall-clock improvement is the net gain after that overhead.

Per-commit attribution

Each commit was cherry-picked onto main and benchmarked individually under the production profile:

Commit	Standalone Δ
FFT MDS only	-3.0%
RATE=12 + MMO (logically coupled — see security note)	~-2.5% on top of FFT MDS
`#[inline]` annotations alone	~0% on the production profile (see note below)
Bundle	-5.58%

Correctness

All five integration tests pass at HEAD:

test_run_whir
test_xmss_signature
test_type_1_aggregation
test_aggregation
test_type_2_aggregation

End-to-end verification (including the recursive zkVM verifier) succeeds in ~37 ms. Proof remains valid under the existing verifier.

Security — RATE=12 + MMO

RATE=8 with capacity=8 in a plain Sponge gives 128-bit generic collision security (capacity/2). Bumping to RATE=12 with capacity=4 in a plain Sponge would drop generic collision security to ~64 bits, which is unacceptable.

Commit 2198c0b4 swaps the absorption mode to MMO (Matyas-Meyer-Oseas) feedforward:

state' = π(state + (M_i ‖ 0_cap)) + state

The chaining variable between absorbs is the full 16-element state (~496 bits), not the 4-element capacity. Only the final truncated OUT = 8 digest (248 bits) is exposed.

Security argument

The collision-resistance claim composes three results:

(1) MMO compression collision-resistance. In the ideal-permutation model, MMO is one of the 12 PGV constructions proven collision-resistant up to 2^{b/2} where b is the state size. At b = 496 bits: 2^{248} on the compression itself.
— Black, Rogaway, Shrimpton. "Black-Box Analysis of the Block-Cipher-Based Hash-Function Constructions from PGV." CRYPTO 2002. ePrint 2002/066

(2) Truncated-permutation Merkle is position-binding. Theorem 1 (strong position-binding) and Theorem 2 (strong extractability) for the Plonky3 truncated-permutation Merkle construction in the ideal-permutation model. Bottleneck: truncated digest space |H|. At OUT = 8: |H| = 248 bits, 2^{|H|/2} = 2^{124}.
— Coratger, Khovratovich, Wagner, Mennink. "The Billion Dollar Merkle Tree." ePrint 2026/089. link

(3) +s MMO ≥ truncated-permutation Merkle. Plonky3's compression is trunc(π(L‖R), OUT). Ours adds feedforward: trunc(π(L‖R) + (L‖R), OUT). Feedforward strictly adds collision resistance — any collision in the MMO variant implies a collision either on the truncated output or on π directly. The bounds of (2) carry over at least as tightly.

Composition: collision security ≥ min(2^{b/2}, 2^{|H|/2}) = min(2^{248}, 2^{124}) = 2^{124}.

Why the sponge c/2 bound does not apply

The c/2 bound (Khovratovich, Marhuenda Beltrán, Mennink. ePrint 2023/520) holds in sponge mode where the attacker may extend or align messages to collide on the capacity portion of the chaining state. Our use is structurally different:

Fixed shape. Each mmo_hash_slice call has a length determined by the Merkle protocol. The verifier enforces this shape; an adversary cannot inject a different number of absorbs.
Full-state chaining via MMO. The chaining variable is the full 16-element state (~496 bits), not the 4-element capacity. The capacity is not the "hidden" portion between absorbs — the entire state participates in the feedforward.

Assumptions

Poseidon-16 modeled as an ideal permutation (standard assumption in Plonky3-style Merkle proofs; same assumption Coratger et al. 2026 rely on).
Verifier enforces fixed Merkle shape.

Why the `#[inline]` commit is included

Although the production profile uses fat LTO and inlines the new sponge calls aggressively, the workspace [profile.release] is lto = "thin". Under thin LTO, the new RATE=12 + MMO hot path crosses three crates — mt_whir::merkle::build_merkle_tree_koalabear → mt_symetric::sponge::mmo_hash_slice → Compression::compress_mut (impl on Poseidon1KoalaBear16 in mt_koala_bear) → Permutation::permute_mut — and these calls are left out-of-line. Inside the rayon worker loop that means a stack spill of the full 16-element packed state on every absorb iteration, which dominates per-iteration cost.

Concretely, without the #[inline] commit, the same source under workspace defaults (cargo run --release -- xmss ...) regresses +3.2% vs main. With it, the same command improves -4.87% vs main on the same machine.

#[inline] is just a hint; under fat LTO the compiler already inlines these. The annotations only change codegen under thin LTO, where they let it match what fat LTO already produces. No semantic change.

Files touched

crates/backend/koala-bear/src/poseidon1_koalabear_16.rs — FFT MDS, #[inline]
crates/lean_vm/src/tables/poseidon_16/mod.rs — FFT MDS
crates/backend/symetric/src/sponge.rs — RATE=12, MMO mode, #[inline]
crates/backend/symetric/src/permutation.rs — #[inline]
crates/whir/src/merkle.rs — sponge integration, padding formula
crates/backend/fiat-shamir/src/verifier.rs — sponge integration
crates/rec_aggregation/zkdsl_implem/hashing.py — zk-DSL RATE=12 port

Test plan

cargo test --workspace --release
Production-profile reproducer command above produces and verifies a valid proof
Wall-clock improvement reproduces the -5.6% headline (warm proofs, t-test)
Proof size: 338 → 345 KiB (+2.0%, expected from added verifier instructions)

The Poseidon AIR constraint folder evaluates mds_air_16 8x per row across runtime types (F, EF, FPacking, EFPacking). Previously this used Karatsuba convolution (72 mults). Switch to the same FFT-MDS already used in the permute_simd hot path: DIT_FFT(lambda/16 ⊙ DIF_IFFT(state)), 50 mults. Saves 22 mults × 8 MDS calls per AIR row = 176 mults/row, ~10% reduction in AIR Poseidon eval mult count. AIR Poseidon eval is ~10% of CPU time in the e2e prover (eval_2_full_rounds_16 + eval_last_2_full_rounds_16 + Poseidon16Precompile::eval). The unpacked lambda_over_16 = (DIF_IFFT(MDS_CIRC_COL) * 16^-1) is factored out of the SimdPrecomputed branch and stored at the top of Precomputed; the SIMD branch reuses it (no duplication). FFT helpers (bt/dit/neg_dif/dif_ifft/dit_fft) are ungated from target_feature since they're pure generic Rust, and their bound is relaxed from Algebra<KoalaBear> to PrimeCharacteristicRing + Mul<KoalaBear> to match mds_circ_16 (so EFPacking, which lacks Algebra<KoalaBear>, is admitted). Predicted magnitude: medium (1.0-1.5%).

@inline

Reduce Poseidon permutations per Merkle leaf by 22-32% by increasing the sponge absorption rate from 8 to 12 field elements per permutation call. Changes: - sponge.rs: relax RATE==OUT and WIDTH==OUT+RATE asserts, support arbitrary RATE - merkle.rs: SPONGE_RATE=12, padded_full_base_width helper, corrected n_zero_suffix_rate_chunks formula for RATE!=WIDTH/2 - verifier.rs: pad base_data to sponge-aligned length before hashing - hashing.py: zk-DSL slice_hash_rtl rewritten for RATE=12, @inline removed to fix conditional branch fall-through bug

Replace standard outer-sponge with Matyas-Meyer-Oseas (MMO) feedforward construction. Same Poseidon-16 permutation, same RATE=12, but collision security lifts from 62-bit to 124-bit by chaining the full 16-element state instead of just the 4-element capacity. Changes: - sponge.rs: mmo_hash_slice, mmo_hash_rtl_iter, mmo_precompute_zero_suffix_state with full-state feedforward (XOR pre-perm state into post-perm state) - merkle.rs: wire MMO hash functions into Merkle tree construction - verifier.rs: use MMO hash in verification path - poseidon_16: new poseidon16_permute precompile (16-element output) for zk-DSL recursive verifier, with AIR constraints and trace generation - hashing.py: zk-DSL updated to use MMO via poseidon16_permute precompile Security: standard sponge collision = c*log2(p)/2 = 62 bits (unshippable). MMO collision = b-bit birthday on full state output = 124 bits (meets target). Verified against: Coratger-Khovratovich-Wagner-Mennink 2026, SAFE proof (eprint 2023/520), Beetle (CHES 2018).

Under the workspace default thin LTO profile, the new RATE=12 + MMO sponge code introduced cross-crate calls that did not get inlined: mmo_hash_slice, mmo_precompute_zero_suffix_state, compress_mut, permute_mut. The hot loop in build_merkle_tree_koalabear ended up making out-of-line calls into mt_symetric and mt_koala_bear on every absorb, spilling the 16-element state to the stack each iteration. Adding #[inline] makes these functions available for cross-CGU inlining under thin LTO, matching the codegen fat LTO already produces. No semantic change. The functions are short hot-path wrappers/loops that the compiler should inline anyway given the chance.

- rustfmt: re-flow long lines introduced by the MMO commit - clippy: replace redundant closures in sponge tests with function refs - clippy: allow too_many_arguments on eval_last_2_full_rounds_16 (AIR helper, 9 args) - clippy: rewrite full_output_flags loop with .iter().enumerate()

TomWambsgans · 2026-06-10T10:38:08Z

Hi! As discussed by message, the modification of the sponge is vulnerable to collision attacks. For remaining parts of the PR, I believe everything has now been integrated, except this remaining part, that I just committed: 133ce0c

tks

Barnadrot · 2026-06-10T10:47:10Z

Agree with close, the only remaining item was documentation (my initial close was accidental branch cleanup). For future reference adding the collision path as comment so its documented.

+s MMO Sponge — Security Review

1. The Construction

leanMultisig's Merkle commitment hashes with an MMO (Matyas-Meyer-Oseas) feedforward compression function over KoalaBear (p ≈ 2^31):

F(s, M) = π(s + (M ‖ 0_c)) + s

Poseidon-16 as π, width 16, rate r=12, capacity c=4. The chaining variable between absorbs is the full 16-element state (~496 bits), not the 4-element capacity. The final digest is truncated to OUT=8 elements (~248 bits). Design target: 2^124 collision resistance.

Why MMO instead of plain sponge

RATE=8 with capacity=8 in a plain sponge gives 128-bit generic collision security (capacity/2). Bumping to RATE=12 with capacity=4 in a plain sponge would drop generic collision security to ~64 bits, which is unacceptable.

The MMO feedforward (+ s) changes the security model. In a plain sponge, the attacker exploits the capacity birthday to collide the hidden state. With MMO, the full state participates in the feedforward — the capacity is not the sole "hidden" portion between absorbs.

Original security argument (PR #216)

The collision-resistance claim composes three results:

(1) MMO compression collision-resistance. In the ideal-permutation model, MMO is one of the 12 PGV constructions proven collision-resistant up to 2^{b/2} where b is the state size. At b = 496 bits: 2^{248} on the compression itself.
— Black, Rogaway, Shrimpton. CRYPTO 2002. ePrint 2002/066.

(2) Truncated-permutation Merkle is position-binding. Theorem 1 (strong position-binding) and Theorem 2 (strong extractability) for the Plonky3 truncated-permutation Merkle construction. Bottleneck: truncated digest space |H|. At OUT=8: |H| = 248 bits, 2^{|H|/2} = 2^{124}.
— Coratger, Khovratovich, Wagner, Mennink. ePrint 2026/089.

(3) +s MMO ≥ truncated-permutation Merkle. Plonky3's compression is trunc(π(L‖R), OUT). MMO adds feedforward: trunc(π(L‖R) + (L‖R), OUT). The original claim was that feedforward strictly adds collision resistance. This is disproven by Section 2 — the feedforward leaks algebraic relationships through the truncation window (known offsets between k samples from one π call), enabling the multicollision shortcut below 2^{124}.

Original composition (incorrect): collision security ≥ min(2^{b/2}, 2^{|H|/2}) = min(2^{248}, 2^{124}) = 2^{124}. Actual security: ~2^{120} at c=4 (see Section 2).

2. The Attack

The original security argument is incomplete. A multicollision variant exploits the structure of the feedforward to reduce collision resistance below 2^{124}.

Standard sponge capacity-birthday (2^{c·31/2} = 2^{62}) fails — the +s feedforward blocks rate-annihilation. But a multicollision variant works:

Find k prefixes sharing the same 124-bit capacity state. Cost: 2^{124·(k-1)/k}
For each, choose message to align π inputs despite differing rate: M_i = X - r_i. One π call → k hash samples (differing by known offsets, can't collide with each other, but valid for birthday search against other calls).
Collect kN samples until birthday at 248 bits. Search cost: 2^{124} / k

Optimal k ≈ 26. Total cost (additive: log₂(2^{setup} + 2^{search})): 2^{120.3} (~4 bits below target).

3. Capacity Analysis

The attack generalizes to all capacity values. Only c ≥ 8 (RATE=8) achieves 2^{124}:

c	RATE	Collision resistance	Deficit vs 2^{124}	Wall-clock vs c=8
4	12	~2^{120}	4 bits	-5.58% (current)
5	11	~2^{122}	2 bits	-4.08% (measured)
6	10	~2^{123}	1 bit	~-2.2%
8	8	2^{124}	0	0% (baseline)

Why c=5 does not restore 2^{124}

The multicollision attack at c=5 (155-bit capacity):

k	Setup cost	Search cost	Total
2	2^{77.5}	2^{123.0}	~2^{123.2}
3	2^{103.3}	2^{122.4}	~2^{122.5}
4	2^{116.25}	2^{122.0}	~2^{122.2}
5	2^{124.0}	2^{121.7}	~2^{124.0}

Optimal k at c=5 is ~4, giving ~2^{122}. The setup cost only exceeds the output birthday (2^{124}) at k ≥ 5 — so k=4 still provides a shortcut.

To pin collision resistance at exactly 2^{124}, you need the k=2 setup alone to cost ≥ 2^{124}, which requires c/2 ≥ 124 → c ≥ 8 elements. Any rate strictly above 8 admits a residual multicollision shortcut.

4. Open Questions

Is 2^{120.3} tight? This is an upper bound on security (best known attack). A matching lower bound (security proof) would confirm tightness. Without one, a better attack could exist below 2^{120}.
Does the known-offset structure between k samples from one π call help the attacker further? Potential improvement: < 1 bit.
Does c=5 interact with other protocol constraints? Rate 12→11 affects absorption steps per Merkle leaf and recursive verifier program size.

References

Black, Rogaway, Shrimpton. Black-Box Analysis of the Block-Cipher-Based Hash-Function Constructions from PGV. CRYPTO 2002. ePrint 2002/066.
Coratger, Khovratovich, Wagner, Mennink. The Billion Dollar Merkle Tree. ePrint 2026/089.
Khovratovich, Marhuenda Beltran, Mennink. Generic Security of the SAFE API and Its Applications. ePrint 2023/520.
Bertoni, Daemen, Peeters, Van Assche. On the Indifferentiability of the Sponge Construction. Eurocrypt 2008.

Barnadrot added 5 commits May 9, 2026 09:53

TomWambsgans force-pushed the main branch 2 times, most recently from eacd019 to 9b2f632 Compare May 25, 2026 00:11

TomWambsgans force-pushed the main branch 2 times, most recently from c5a3050 to 9dc5d68 Compare May 28, 2026 12:02

Barnadrot closed this Jun 10, 2026

Barnadrot deleted the perf/poseidon-fft-mmo branch June 10, 2026 10:03

Barnadrot restored the perf/poseidon-fft-mmo branch June 10, 2026 10:05

Barnadrot reopened this Jun 10, 2026

TomWambsgans closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`perf(poseidon): FFT MDS in AIR + RATE=12 MMO sponge#216

`perf(poseidon): FFT MDS in AIR + RATE=12 MMO sponge#216
Barnadrot wants to merge 5 commits into
leanEthereum:mainfrom
Barnadrot:perf/poseidon-fft-mmo

Barnadrot commented May 9, 2026 •

edited

Loading

Uh oh!

TomWambsgans commented Jun 10, 2026

Uh oh!

Barnadrot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Barnadrot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Per-commit attribution

Correctness

Security — RATE=12 + MMO

Security argument

Why the sponge c/2 bound does not apply

Assumptions

Why the #[inline] commit is included

Files touched

Test plan

Uh oh!

TomWambsgans commented Jun 10, 2026

Uh oh!

Barnadrot commented Jun 10, 2026

+s MMO Sponge — Security Review

1. The Construction

Why MMO instead of plain sponge

Original security argument (PR #216)

2. The Attack

3. Capacity Analysis

Why c=5 does not restore 2^{124}

4. Open Questions

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Barnadrot commented May 9, 2026 •

edited

Loading

Why the `#[inline]` commit is included