Skip to content

fix: prevent second batch store reset wiping postage snapshot on --resync#5499

Open
martinconic wants to merge 1 commit into
masterfrom
fix/resync-snapshot-double-reset
Open

fix: prevent second batch store reset wiping postage snapshot on --resync#5499
martinconic wants to merge 1 commit into
masterfrom
fix/resync-snapshot-double-reset

Conversation

@martinconic

@martinconic martinconic commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

Fixes #5495: a node started with --resync while the postage snapshot is active shuts down during sync with:

get: get batch <id>: storage: not found

Root cause

The batch store is reset twice on startup in this configuration:

  1. snapshotBatchSvc.Start() resets the store (resync) and rebuilds it from the snapshot, then sets postageSyncStart = maxBlockHeight.
  2. The live batchSvc was constructed with the same o.Resync flag, so its Start() reset the store again, discarding the snapshot that was just loaded.

Live sync then started from the snapshot block height against an empty store, so the first top-up/dilute event referencing a batch created before the snapshot height failed with storage: not found, and the node shut down.

This matches the report exactly: two "resync requested, resetting batch store" log lines, and the fact that skip-postage-snapshot works around it (only one reset, and sync replays from contract genesis where every create precedes its top-up).

Note: this is independent of #5343/#5482. The flow originates in #5094 and is still present on master; the reporter's build (2.8.0-41d6efc6) predates the #5343 revert, but reverting #5343 does not address this.

Fix

The two reset sites are not redundant — one rebuilds from the snapshot, the other is the fallback rebuild for the no-snapshot / snapshot-failed / non-mainnet cases — they were just both armed at once.

Construct the live batch service after the snapshot path has run and pass resync only when the snapshot did not rebuild the store:

batchSvc, err = batchservice.New(..., o.Resync && !snapshotLoaded)

The reset logic stays in its single existing place (batchservice.Start()), exactly one service is armed to reset, and there is no runtime toggle. The store is still reset exactly once in every other case (snapshot skipped, failed, not applicable, or a non-resync start).

Testing

  • TestResyncControlsReset in batchservice asserts both directions of the contract node.go relies on:
    • resync=true ⇒ store reset once
    • resync=false ⇒ store not reset (the post-snapshot live-service case)
  • go test -race ./pkg/postage/batchservice/... passes
  • go build ./pkg/node/... passes; go vet, gofumpt, and golangci-lint clean

Open API Spec Version Changes (if applicable)

None.

Related Issue

Closes #5495

AI Disclosure

  • This PR contains code that has been generated by an LLM.
  • I have reviewed the AI generated code thoroughly.
  • I possess the technical expertise to responsibly review the code generated in this PR.

@martinconic martinconic changed the title fix: prevent second batch store reset wiping postage snapshot on --re… fix: prevent second batch store reset wiping postage snapshot on --resync Jun 10, 2026
@martinconic martinconic self-assigned this Jun 10, 2026
@martinconic martinconic added this to the 2026 milestone Jun 10, 2026
// postage snapshot has already reset and rebuilt the batch store, so the live
// sync continues from the snapshot instead of wiping it (see issue #5495).
func (svc *batchService) SkipResync() {
svc.resync = false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me the fact that there's two places where the batch service could get reset already smells a bit. with this change we have still two places where a reset happens + third path to disable one of them potentially. so the question is asked - why have two places that do this in the first place?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two reset sites aren't pure duplication. They serve different roles:

  • snapshot service → resets + rebuilds from the snapshot
  • live service → resets + rebuilds from chain genesis — this is the fallback path, needed when there's no snapshot
    So you can't simply delete one site without merging the two services (they use different listeners — snapshot filterer vs live chain backend).

I changed the implementation to not modify at all batchservice.go

@martinconic martinconic force-pushed the fix/resync-snapshot-double-reset branch from c26bb91 to a398078 Compare June 11, 2026 07:11
…sync

When a node is started with --resync together with the postage snapshot, the batch store was reset twice. The snapshot batch service resets and rebuilds the store from the snapshot, but the live-chain batch service was constructed with the same resync flag and reset the store a second time in its Start(), discarding the freshly loaded snapshot.

Live sync then began at the snapshot block height against an empty store, so the first top-up or dilute event for any batch created before the snapshot height failed with "get batch <id>: storage: not found" and the node shut down. This is why disabling the snapshot (skip-postage-snapshot) worked around it: the store was reset only once and sync replayed from contract genesis, where every create precedes its top-up.

Construct the live batch service after the snapshot path has run and pass resync only when the snapshot did not rebuild the store (o.Resync && !snapshotLoaded). The store is still reset exactly once in every other case (snapshot skipped, failed, not applicable, or a non-resync start).

Closes #5495
@martinconic martinconic force-pushed the fix/resync-snapshot-double-reset branch from a398078 to 69df1aa Compare June 11, 2026 07:16
@martinconic martinconic requested a review from acud June 11, 2026 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shutdown during --resync with postage snapshot: get: get batch <batchID>: storage: not found

2 participants