Skip to content

perf: bmt SIMD hasher#5381

Merged
acud merged 16 commits into
masterfrom
simd-hashing-no-concurrency
May 19, 2026
Merged

perf: bmt SIMD hasher#5381
acud merged 16 commits into
masterfrom
simd-hashing-no-concurrency

Conversation

@acud

@acud acud commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

Description

This PR tries to improve the current BMT hasher to use SIMD when available.

Motivation and Context (Optional)

The current BMT hasher is highly inefficient - it uses massive goroutine spawning to compute a single chunk address. For a full chunk, this means 255 goroutines are spawned per chunk, creating GC stress and significant scheduler stress. For the ReserveSample function used in the redistribution game, this means excessive memory and CPU stress in order to calculate the reserve sample.

The idea here was to use an existing keccak implementation, compile it to assembler and try to call it directly from our go code without having to use cgo which comes with its own set of side-effects.

I used (I==me+Claude) the XKCP project (from the keccak authors) and built a build script that builds, extracts and wraps the compiled code correctly. Currently only linux amd64 is supported. Windows and mac should fall back to the go legacy sha3 hasher.

So far the results are promising:

  • x1.6 faster BMT hashing on my local machine (laptop)
  • x2.5 faster on AVX2 supported data-center CPUs (Hetzner)
  • x5 faster on newer AVX512 architectures

There's a few more things to iron out and test:

  • is the scalar hashing fallback (to the go crypto legacy keccak hashing) reasonably fast? (I am imagining this only for dev use anyway - mac devs etc) or do we need to fallback to the previous implementation of BMT that spawned all those goroutines?
  • as of such - do we need a BMT factory? (to spin up the right type of hasher) do we want in this case to maintain both implementations? (copies?)
  • since SIMD instructions can cause CPU throttling, we must test on different configurations and providers and see that the possible throttling doesn't nullify the performance improvements that SIMD gives us. in other words we have to compare workload benchmarks against the current implementation on trunk.
  • figure out whether the excessive stack frame allocation can somehow be avoided
  • see if using bmtpool would improve anything at all
  • fix the linker error: /usr/bin/ld: warning: /tmp/go-link-1425710637/000001.o: missing .note.GNU-stack section implies executable stack /usr/bin/ld: NOTE: This behaviour is deprecated and will be removed in a future version of the linker

Worth noting:

  • I've also tried to port the existing go crypto library legacy sha3 implementation (written in go) to use the new go 1.26 SIMD primitives. Still needs a bit of research.
  • windows not supported cuz shadow space (don't ask)

Test plan

  • run on selected production nodes for a month and see if all is well, no panics, better performance. Only linux amd64, various configurations.

Related Issue (Optional)

#5174

References:

@acud acud force-pushed the simd-hashing-no-concurrency branch from 304934f to ad70c60 Compare February 25, 2026 23:56
@acud acud changed the title perf: BMT SIMD hasher perf: bmt SIMD hasher Feb 26, 2026
@acud acud marked this pull request as draft February 26, 2026 17:41
@acud acud marked this pull request as draft February 26, 2026 17:41
@acud acud force-pushed the simd-hashing-no-concurrency branch 3 times, most recently from 7872f4e to e10454f Compare February 27, 2026 03:59
@acud acud force-pushed the simd-hashing-no-concurrency branch from 1846444 to 69f0fd6 Compare March 5, 2026 20:25
@acud acud force-pushed the simd-hashing-no-concurrency branch from 69f0fd6 to 94d8ac7 Compare March 6, 2026 15:02
@acud acud added enhancement enhancement of existing functionality testing labels Mar 19, 2026
@acud

acud commented Mar 22, 2026

Copy link
Copy Markdown
Contributor Author

Just as an update - these changes are being actively tested on a few mainnet nodes to verify everything works well and we don't get any memory corruption/panics

@acud acud marked this pull request as ready for review April 14, 2026 10:09
@martinconic

Copy link
Copy Markdown
Contributor

Peer reviewing with Claude found the following, maybe are suitable for addressing:

  1. Catastrophic regression on the default linux/amd64 configuration (severity: high)

When SIMDOptIn=false (the default), the code still takes the new SIMD code path but calls keccak.Sum256x4Scalar — which spawns goroutines internally:

// scalar fallback spawns 4 goroutines per batch
for i := 0; i < 4; i++ { go func() { ... }() }
wg.Wait()
For a 128-branch BMT (64 leaf nodes, batchWidth=4): the scalar path spawns ~124 goroutines vs the original path's 64. Combined with WaitGroup + Mutex overhead on every batch, benchmarks show ~14× slower than the original goroutine path. This affects every linux/amd64 node that doesn't explicitly pass --use-simd-hashing — i.e., all standard deployments.

  1. bmtpool singleton won't benefit from SIMD even when opted in (severity: high)

bmtpool.init() runs before start.go sets SIMDOptIn=true. So pkg/bmtpool/bmtpool.go's global pool — used by cac.go, the upload pipeline, pss/trojan.go, and most proof generation — is always created with SIMD disabled. Only code that calls bmt.NewPrefixHasher() directly (i.e., ReserveSample) gets the benefit.

@acud acud force-pushed the simd-hashing-no-concurrency branch 3 times, most recently from 0e1e2f3 to 7293240 Compare April 21, 2026 09:55
@acud acud force-pushed the simd-hashing-no-concurrency branch from 7293240 to 24abb91 Compare April 21, 2026 10:00
@acud acud force-pushed the simd-hashing-no-concurrency branch from d53d875 to 7aecf7a Compare April 22, 2026 13:39

@akrem-chabchoub akrem-chabchoub left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @acud !
I'm trying to understand/test more the PR, but this is a small round of review/clarifications...

Comment thread pkg/keccak/CHECKSUM
Comment thread pkg/bmt/hasher_goroutine.go Outdated
Comment thread cmd/bee/cmd/cmd.go
@acud acud requested a review from akrem-chabchoub May 8, 2026 15:00
@akrem-chabchoub akrem-chabchoub force-pushed the simd-hashing-no-concurrency branch from 471286a to e346523 Compare May 8, 2026 16:39

@janos janos left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one. I have comments that may improve the performance by reducing the gc pressure, otherwise it is solid.

Comment thread pkg/bmt/pool_simd.go Outdated
isLeft bool
parent *simdNode
left, right []byte
hasher hash.Hash

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that hasher from simdNode can be moved to simdTree struct in order to avoid creating hashers for every node (255 hashers per chunk) on line pool_simd.go:137, as hashers are all the same. By moving the hasher to the simdTree, that hasher can be created only once in newSIMDTree and used in hasher_simd.go:142 and hasher_simd.go:99 as h.bmt.hasher instead to get it from the node. That would avoid creating a lot of hashers when hashing which should reduce the gc pressure. The newSIMDNode could accept the hasher size instead of the hasher. But that just a smaller optimization, the main one would be to avoid calling hashfunc() in double nested loop on line 137. It looks to me that this change would be logically correct, if I did not miss something else.

Comment thread pkg/bmt/pool_simd.go Outdated
nodes := make([]*simdNode, count)
for i := 0; i < count; i++ {
parent := prevlevel[i/2]
nodes[i] = newSIMDNode(i, parent, hashfunc())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the previous comment on line 114. Hashers are created for every node, but it looks to me that only one hasher is needed, by the usage of the hasher. Since the BMT Pool guarantees that a single hasher instance used by one worker at a time, and the SIMD hasher itself never spawns internal goroutines, it is guaranteed that the single bmt hasher is only ever accessed sequentially.

@acud acud merged commit 12a3a87 into master May 19, 2026
15 checks passed
@acud acud deleted the simd-hashing-no-concurrency branch May 19, 2026 23:34
nugaon pushed a commit that referenced this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement enhancement of existing functionality testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants