Use UFFD only for initial snapshot fork restores#270
Conversation
|
Created a monitoring plan for this PR. What this PR does: Speeds up VM fork fanout in the hot pool by deferring the large snapshot memory file copy when creating Firecracker forks with UFFD enabled — forked VMs start immediately using the source's memory as backing, and materialize their own copy only when entering standby. Intended effect:
Note: hypeman has no OTel telemetry — logs route to Railway stdout only. Hot pool metrics proxied through the kernel API are the observable signal. CI pass/fail is the primary authority for correctness. Risks:
Status updates will be posted automatically on this PR as monitoring progresses. |
cc00f35 to
dd404e4
Compare
hiroTamada
left a comment
There was a problem hiding this comment.
nice work — the one-shot design cleanly confines the fragile pager/backing dependency to a single post-fork restore, and the lifecycle clearing + the TestFCUFFDOneShotLifecycle integration test (disk + /dev/shm files across the full fork→standby→restore→chained-fork chain) give good confidence on the snapshot-isolation risk.
both bugbot findings look already resolved on this head: the chained-fork backing path via the firecrackerDeferredSnapshotMemoryPath helper (regression-tested at firecracker_test.go:817), and the session close on the network-reconfigure failure path (restore.go:337).
a few non-blocking nits below, mostly around maintainability. approving.
36f0992 to
c74c539
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c74c539. Configure here.

Summary
Validation
Note: the runtime validation used the Firecracker concurrency branch merged locally, since that branch removes the source alias/fork race this behavior relies on for burst fanout.
Note
High Risk
Changes Firecracker snapshot memory, fork fanout, and UFFD/systemd integration on a critical VM lifecycle path; incorrect backing or locking could affect restore correctness or concurrent forks.
Overview
This PR limits Firecracker UFFD to a one-shot path on standby snapshot fanout, instead of using UFFD on every restore when the backend is configured for UFFD.
New forks from standby (instance or snapshot) can skip copying
snapshot-latest/memory, record deferred backing andFirecrackerUseUFFDOnNextRestore, and restore once via the pager against shared source memory. Standby then materializes localmemoryviaSnapshotOptions.DeferredMemoryBackingPathbefore diff snapshot; later restores use file-backed memory. Metadata and lifecycle paths clear the one-shot flag and UFFD session state after restore, standby, and stop.Fork/copy gains optional path skips,
CopyRegularFile, and stricter skipping of stale*.sockartifacts. Restore acceptsSnapshotMemoryBackingPath; the UFFD supervisor moves env to/run/hypeman/uffdwith CI-scoped systemd instance names. CI/Makefile install the pager unit template, pass pager binary env vars, and add an integration test for the full UFFD lifecycle.Reviewed by Cursor Bugbot for commit c74c539. Bugbot is set up for automated code reviews on this repo. Configure here.