perf: bmt SIMD hasher by acud · Pull Request #5381 · ethersphere/bee

acud · 2026-02-25T23:53:19Z

Description

This PR tries to improve the current BMT hasher to use SIMD when available.

Motivation and Context (Optional)

The current BMT hasher is highly inefficient - it uses massive goroutine spawning to compute a single chunk address. For a full chunk, this means 255 goroutines are spawned per chunk, creating GC stress and significant scheduler stress. For the ReserveSample function used in the redistribution game, this means excessive memory and CPU stress in order to calculate the reserve sample.

The idea here was to use an existing keccak implementation, compile it to assembler and try to call it directly from our go code without having to use cgo which comes with its own set of side-effects.

I used (I==me+Claude) the XKCP project (from the keccak authors) and built a build script that builds, extracts and wraps the compiled code correctly. Currently only linux amd64 is supported. Windows and mac should fall back to the go legacy sha3 hasher.

So far the results are promising:

x1.6 faster BMT hashing on my local machine (laptop)
x2.5 faster on AVX2 supported data-center CPUs (Hetzner)
x5 faster on newer AVX512 architectures

There's a few more things to iron out and test:

is the scalar hashing fallback (to the go crypto legacy keccak hashing) reasonably fast? (I am imagining this only for dev use anyway - mac devs etc) or do we need to fallback to the previous implementation of BMT that spawned all those goroutines?
as of such - do we need a BMT factory? (to spin up the right type of hasher) do we want in this case to maintain both implementations? (copies?)
since SIMD instructions can cause CPU throttling, we must test on different configurations and providers and see that the possible throttling doesn't nullify the performance improvements that SIMD gives us. in other words we have to compare workload benchmarks against the current implementation on trunk.
figure out whether the excessive stack frame allocation can somehow be avoided
see if using bmtpool would improve anything at all
fix the linker error: /usr/bin/ld: warning: /tmp/go-link-1425710637/000001.o: missing .note.GNU-stack section implies executable stack /usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker

Worth noting:

I've also tried to port the existing go crypto library legacy sha3 implementation (written in go) to use the new go 1.26 SIMD primitives. Still needs a bit of research.
windows not supported cuz shadow space (don't ask)

Test plan

run on selected production nodes for a month and see if all is well, no panics, better performance. Only linux amd64, various configurations.

Related Issue (Optional)

#5174

References:

https://github.com/XKCP/XKCP
https://github.com/acud/XKCP <- my fork with the necessary build setup to create the necessary .syso files.
https://pkg.go.dev/simd/archsimd

acud · 2026-03-22T09:26:16Z

Just as an update - these changes are being actively tested on a few mainnet nodes to verify everything works well and we don't get any memory corruption/panics

martinconic · 2026-04-15T13:24:39Z

Peer reviewing with Claude found the following, maybe are suitable for addressing:

Catastrophic regression on the default linux/amd64 configuration (severity: high)

When SIMDOptIn=false (the default), the code still takes the new SIMD code path but calls keccak.Sum256x4Scalar — which spawns goroutines internally:

// scalar fallback spawns 4 goroutines per batch
for i := 0; i < 4; i++ { go func() { ... }() }
wg.Wait()
For a 128-branch BMT (64 leaf nodes, batchWidth=4): the scalar path spawns ~124 goroutines vs the original path's 64. Combined with WaitGroup + Mutex overhead on every batch, benchmarks show ~14× slower than the original goroutine path. This affects every linux/amd64 node that doesn't explicitly pass --use-simd-hashing — i.e., all standard deployments.

bmtpool singleton won't benefit from SIMD even when opted in (severity: high)

bmtpool.init() runs before start.go sets SIMDOptIn=true. So pkg/bmtpool/bmtpool.go's global pool — used by cac.go, the upload pipeline, pss/trojan.go, and most proof generation — is always created with SIMD disabled. Only code that calls bmt.NewPrefixHasher() directly (i.e., ReserveSample) gets the benefit.

akrem-chabchoub

Thanks @acud !
I'm trying to understand/test more the PR, but this is a small round of review/clarifications...

janos

Nice one. I have comments that may improve the performance by reducing the gc pressure, otherwise it is solid.

janos · 2026-05-19T11:28:09Z

+	isLeft      bool
+	parent      *simdNode
+	left, right []byte
+	hasher      hash.Hash


It looks to me that hasher from simdNode can be moved to simdTree struct in order to avoid creating hashers for every node (255 hashers per chunk) on line pool_simd.go:137, as hashers are all the same. By moving the hasher to the simdTree, that hasher can be created only once in newSIMDTree and used in hasher_simd.go:142 and hasher_simd.go:99 as h.bmt.hasher instead to get it from the node. That would avoid creating a lot of hashers when hashing which should reduce the gc pressure. The newSIMDNode could accept the hasher size instead of the hasher. But that just a smaller optimization, the main one would be to avoid calling hashfunc() in double nested loop on line 137. It looks to me that this change would be logically correct, if I did not miss something else.

janos · 2026-05-19T11:32:35Z

+		nodes := make([]*simdNode, count)
+		for i := 0; i < count; i++ {
+			parent := prevlevel[i/2]
+			nodes[i] = newSIMDNode(i, parent, hashfunc())


Related to the previous comment on line 114. Hashers are created for every node, but it looks to me that only one hasher is needed, by the usage of the hasher. Since the BMT Pool guarantees that a single hasher instance used by one worker at a time, and the SIMD hasher itself never spawns internal goroutines, it is guaranteed that the single bmt hasher is only ever accessed sequentially.

acud force-pushed the simd-hashing-no-concurrency branch from 304934f to ad70c60 Compare February 25, 2026 23:56

acud changed the title ~~perf: BMT SIMD hasher~~ perf: bmt SIMD hasher Feb 26, 2026

acud marked this pull request as draft February 26, 2026 17:41

acud force-pushed the simd-hashing-no-concurrency branch 3 times, most recently from 7872f4e to e10454f Compare February 27, 2026 03:59

acud force-pushed the simd-hashing-no-concurrency branch from 1846444 to 69f0fd6 Compare March 5, 2026 20:25

perf: bmt simd hasher

94d8ac7

acud force-pushed the simd-hashing-no-concurrency branch from 69f0fd6 to 94d8ac7 Compare March 6, 2026 15:02

chore: fix linker errors

a6f7b68

acud added enhancement enhancement of existing functionality testing labels Mar 19, 2026

acud added 3 commits April 14, 2026 10:56

chore: add cli opt-in flag

3f466bc

Merge branch 'master' into simd-hashing-no-concurrency

bdc0b17

chore: simd opt-in for tests

ea70bec

acud marked this pull request as ready for review April 14, 2026 10:09

acud added 3 commits April 20, 2026 10:09

chore: make goroutine path fall back on runtime

17df728

chore: lint

8c4fe95

chore: comments

f654007

acud force-pushed the simd-hashing-no-concurrency branch 3 times, most recently from 0e1e2f3 to 7293240 Compare April 21, 2026 09:55

chore: width clean, fix builds, more comments

24abb91

acud force-pushed the simd-hashing-no-concurrency branch from 7293240 to 24abb91 Compare April 21, 2026 10:00

chore: revert prover to goroutine path

7aecf7a

acud force-pushed the simd-hashing-no-concurrency branch from d53d875 to 7aecf7a Compare April 22, 2026 13:39

acud added 4 commits April 23, 2026 20:00

chore: lint

6d608b2

chore: last self review

cbd20ff

chore: coalesce files

aad02ba

chore: check both paths in test

578ef8d

akrem-chabchoub requested changes May 5, 2026

View reviewed changes

Comment thread pkg/keccak/CHECKSUM

Comment thread pkg/bmt/hasher_goroutine.go Outdated

Comment thread cmd/bee/cmd/cmd.go

chore: address pr comments

e346523

acud requested a review from akrem-chabchoub May 8, 2026 15:00

akrem-chabchoub approved these changes May 8, 2026

View reviewed changes

akrem-chabchoub force-pushed the simd-hashing-no-concurrency branch from 471286a to e346523 Compare May 8, 2026 16:39

martinconic approved these changes May 19, 2026

View reviewed changes

janos approved these changes May 19, 2026

View reviewed changes

chore: address comments

9c09ab9

acud merged commit 12a3a87 into master May 19, 2026
15 checks passed

acud deleted the simd-hashing-no-concurrency branch May 19, 2026 23:34

nugaon pushed a commit that referenced this pull request May 28, 2026

perf: bmt SIMD hasher (#5381)

da8ccfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: bmt SIMD hasher#5381

perf: bmt SIMD hasher#5381
acud merged 16 commits into
masterfrom
simd-hashing-no-concurrency

acud commented Feb 25, 2026 •

edited

Loading

Uh oh!

acud commented Mar 22, 2026

Uh oh!

martinconic commented Apr 15, 2026

Uh oh!

akrem-chabchoub left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janos left a comment

Uh oh!

janos May 19, 2026

Uh oh!

janos May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

acud commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context (Optional)

Test plan

Related Issue (Optional)

References:

Uh oh!

acud commented Mar 22, 2026

Uh oh!

martinconic commented Apr 15, 2026

Uh oh!

akrem-chabchoub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janos left a comment

Choose a reason for hiding this comment

Uh oh!

janos May 19, 2026

Choose a reason for hiding this comment

Uh oh!

janos May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

acud commented Feb 25, 2026 •

edited

Loading