Add UFFD snapshot pager graduation#272
Conversation
Detach running UFFD-backed VMs from their snapshot memory pager after a
soak period instead of leaving them pinned for the life of the restore.
A new pager /sessions/{id}/complete endpoint populates the remaining
pages from the backing file and unregisters userfaultfd, so the VM keeps
running on resident memory with no pager dependency and no pause or
network interruption. This bounds the number of active pager sessions
and lets old pager versions drain to zero and exit.
A background controller (lib/uffdgraduate) drives graduations subject to
min_session_age, max_concurrent, and an optional max_active_sessions
ceiling, prioritising sessions on outdated pager versions. Disabled by
default and only active on the uffd backend. The detach is gated behind
a new hypervisor capability so the controller stays hypervisor-agnostic.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sibling of the UFFD one-shot lifecycle test that detaches a running UFFD-backed VM from its pager and asserts the VM keeps running with its guest memory and disk intact, new writes still work, and a later standby/restore preserves memory. Leaves the existing test unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Overlapping the graduation test's full memory populate with the sibling UFFD lifecycle test's VMs saturated the CI runner and timed out guest-agent readiness. Drop t.Parallel so peak concurrent UFFD VM load matches the pre-existing single-test profile. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Main advanced the pager to 0.1.3 independently (CLOCK cache eviction), colliding with this branch's bump. Advance to 0.1.4 so the graduation pager change carries a distinct version. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
reviewed end-to-end — solid, careful work, and the populate-then-unregister core is correct by construction. a few things worth a look, one i'd treat as a fix before merge. should fix
concurrency
questions / confirm intent
test gaps
nits
|
Remove max_active_sessions: time-based weaning covers the rollout goals and the ceiling semantics were the confusing part of the feature. Graduation now clears all UFFD restore state (like standby/stop) so a graduated VM no longer references the source snapshot and its next standby writes a self-contained Full snapshot. Completion aborts when the caller times out or disconnects, keeping the session serving so the control plane binding stays accurate. Guard the wake pipe against use-after-close, back off failed graduations instead of retrying every scan, and error on a missing recorded pager version instead of guessing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for the careful review — every claim checked out against the code, and one of your questions turned out to be a bigger deal than framed. All addressed in ce5009b, point by point: Should fix — deferred memory path: adopted. Graduation now calls wakeW use-after-close: fixed with gctx timeout mid-populate: you were right to poke, and it was worse than "self-heals next scan." Standby in that window is safe ( overCap netting: resolved by deletion. We dropped Populate-sweep fault latency: intended, and your read of the mechanics is right — a guest thread faulting on a not-yet-swept page blocks until the address-ordered sweep reaches it; Test gaps: one correction — Nits: failure backoff added (fixed 5m — each retry is a full-image read, so once a scan was too hot); provider no-op paths now log warnings; the empty-stored-version fallback is now an error instead of guessing (guessing could 404 against the wrong pager and clear a binding for a live session). Pager VERSION bumped to 0.1.5. Ballooning interaction and |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 4 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ce5009b. Configure here.
| log.InfoContext(ctx, "graduating instance off uffd snapshot memory pager", | ||
| "instance_id", id, "session_id", sessionID, "pager_version", version) | ||
|
|
||
| err = m.firecrackerUFFDPager.CompleteSessionVersion(ctx, version, sessionID) |
There was a problem hiding this comment.
Graduation holds instance lock
Medium Severity
GraduateSnapshotMemoryPager keeps the per-instance mutex for the entire CompleteSessionVersion call, which can run up to completion_timeout (default 10m). StopInstance, StandbyInstance, DeleteInstance, and other paths that take the same lock block for that whole interval on a graduating VM.
Reviewed by Cursor Bugbot for commit ce5009b. Configure here.
| if s.takeCompletion() { | ||
| return | ||
| } | ||
| } |
There was a problem hiding this comment.
Completion waits on deferred faults
Medium Severity
After /complete queues work and wakes the fault loop, takeCompletion runs only when len(deferred)==0 and Poll sees the wake pipe. While EAGAIN page faults sit in deferred, the loop skips Poll, never drains the wake pipe, and does not start populate-unregister even though a completion request is already queued.
Reviewed by Cursor Bugbot for commit ce5009b. Configure here.
| http.Error(w, "session closed before completion", http.StatusInternalServerError) | ||
| case <-r.Context().Done(): | ||
| http.Error(w, r.Context().Err().Error(), http.StatusGatewayTimeout) | ||
| } |
There was a problem hiding this comment.
Completion success races HTTP handler
Medium Severity
The /complete response uses a buffered channel, so the fault loop can finish populate/unregister, exit, and close sess.done before the HTTP handler reads the success. handleComplete may then take the sess.done branch and return 500 even though the VM already detached, causing graduation to fail and retry until a 404 clears metadata.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ce5009b. Configure here.
| // includes the deferred snapshot memory path: memory is fully resident now, | ||
| // so the next standby must write a self-contained Full snapshot instead of | ||
| // materializing from the source snapshot (which may since be deleted). | ||
| clearFirecrackerUFFDRestoreState(stored) |
There was a problem hiding this comment.
404 clears pager metadata unsafely
Medium Severity
When CompleteSessionVersion returns ErrSessionNotFound, graduation still clears all UFFD restore metadata and returns success. A 404 only means the pager has no session record, not that the running VM finished populate/unregister, so metadata can drop while the guest still depends on userfaultfd.
Reviewed by Cursor Bugbot for commit ce5009b. Configure here.


Summary
Running UFFD-backed VMs are pinned to their snapshot memory pager for the life of the restore. This adds a way to detach a running VM from its pager after it has soaked, so long-lived VMs stop depending on a pager (or on the snapshot backing their restore) and old pager versions can drain to zero and exit.
Detach happens without touching the VM: a new pager endpoint
POST /sessions/{id}/completepopulates every outstanding page from the backing file and then unregisters userfaultfd. The guest never pauses and its network is untouched; the VM ends up running on resident memory with no pager dependency.Why not migrate UFFD→UFFD or fall back to the file backend: the memory backend is fixed at the mmap when a VM is restored, so reaching the file backend requires a VMM restart, which drops every TCP connection. Graduation (finish the lazy load, then detach) is the only path that is non-interrupting.
What's here
lib/uffdpager):POST /sessions/{id}/complete+Supervisor.CompleteSessionVersion. Completion runs in the fault-loop goroutine (woken via a pipe), populates all pages (reusing the existing read/copy path), thenUFFDIO_UNREGISTERs the ranges. Unregister happens only after a full populate — otherwise the kernel zero-fills still-absent pages (corruption). On populate failure, or if the caller times out/disconnects (the request context is checked throughout the sweep), the session keeps serving faults and is not torn down, so the control plane's session binding stays accurate for a retry.Capabilities().UsesDetachableSnapshotMemoryPager(true for Firecracker) so the controller stays hypervisor-agnostic.GraduateSnapshotMemoryPagerperforms the detach under the instance lock and clears all UFFD restore state (same helper as standby/stop), including the deferred snapshot memory path — a graduated VM no longer references the source snapshot, and its next standby writes a self-contained Full snapshot.lib/uffdgraduate): scans for running pager-backed VMs and graduates every one past the soak, prioritising outdated pager versions, with a fixed 5m backoff after a failed attempt.hypervisor.firecracker_uffd_graduation):enabled(default false),min_session_age(10m),max_concurrent(1),scan_interval(1m),completion_timeout(10m). Wired inmain.govia the existing configure/start pattern (no wire regen).Behaviour
uffdbackend (a warn is logged if enabled without one).min_session_ageis graduated, oldest and outdated-version sessions first, paced bymax_concurrent. Steady-state pager sessions ≈ restore rate × soak. (An earlier revision also had amax_active_sessionsceiling; it was dropped — time-based weaning covers the rollout goals and the ceiling semantics were confusing.)Tradeoffs
EEXIST, skipping the copy but not the read) — hence the soak + concurrency pacing.DeleteSnapshothas no reference guard, so an operator deleting the source breaks that rollback either way). After the first post-graduation standby the VM has a self-contained snapshot and crash recovery returns.Test plan
go build ./...,go vet, and unit tests (incl.-racefor the pager) pass forlib/uffdgraduate,lib/uffdpager,lib/instances,lib/providers,cmd/api/config.max_concurrentthrottling, outdated-version priority, and disabled = no-op. Config Normalize/Validate covered.UFFDIO_UNREGISTERioctl value, the wake pipe, wake-vs-close teardown safety, and that an abandoned completion aborts before unregister.TestFCUFFDGraduationLifecycle(real Firecracker + pager, runs in the linuxtestCI job) covers: graduation of a live fork, guest memory/disk integrity across detach, new writes post-detach, idempotent re-graduation, cleared restore state (session, version, deferred memory path), and file-backed standby/restore afterwards.UFFDIO_COPYdirty-neutrality on the host kernel (post-graduation diff snapshot size — a size regression risk, not correctness).🤖 Generated with Claude Code