Skip to content

Add UFFD snapshot pager graduation#272

Open
sjmiller609 wants to merge 7 commits into
mainfrom
hypeship/uffd-graduation
Open

Add UFFD snapshot pager graduation#272
sjmiller609 wants to merge 7 commits into
mainfrom
hypeship/uffd-graduation

Conversation

@sjmiller609

@sjmiller609 sjmiller609 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

Running UFFD-backed VMs are pinned to their snapshot memory pager for the life of the restore. This adds a way to detach a running VM from its pager after it has soaked, so long-lived VMs stop depending on a pager (or on the snapshot backing their restore) and old pager versions can drain to zero and exit.

Detach happens without touching the VM: a new pager endpoint POST /sessions/{id}/complete populates every outstanding page from the backing file and then unregisters userfaultfd. The guest never pauses and its network is untouched; the VM ends up running on resident memory with no pager dependency.

Why not migrate UFFD→UFFD or fall back to the file backend: the memory backend is fixed at the mmap when a VM is restored, so reaching the file backend requires a VMM restart, which drops every TCP connection. Graduation (finish the lazy load, then detach) is the only path that is non-interrupting.

What's here

  • Pager (lib/uffdpager): POST /sessions/{id}/complete + Supervisor.CompleteSessionVersion. Completion runs in the fault-loop goroutine (woken via a pipe), populates all pages (reusing the existing read/copy path), then UFFDIO_UNREGISTERs the ranges. Unregister happens only after a full populate — otherwise the kernel zero-fills still-absent pages (corruption). On populate failure, or if the caller times out/disconnects (the request context is checked throughout the sweep), the session keeps serving faults and is not torn down, so the control plane's session binding stays accurate for a retry.
  • Hypervisor: new Capabilities().UsesDetachableSnapshotMemoryPager (true for Firecracker) so the controller stays hypervisor-agnostic.
  • Manager: GraduateSnapshotMemoryPager performs the detach under the instance lock and clears all UFFD restore state (same helper as standby/stop), including the deferred snapshot memory path — a graduated VM no longer references the source snapshot, and its next standby writes a self-contained Full snapshot.
  • Controller (lib/uffdgraduate): scans for running pager-backed VMs and graduates every one past the soak, prioritising outdated pager versions, with a fixed 5m backoff after a failed attempt.
  • Config (hypervisor.firecracker_uffd_graduation): enabled (default false), min_session_age (10m), max_concurrent (1), scan_interval (1m), completion_timeout (10m). Wired in main.go via the existing configure/start pattern (no wire regen).

Behaviour

  • Disabled by default and only constructed on the uffd backend (a warn is logged if enabled without one).
  • Weaning is purely time-based: every session past min_session_age is graduated, oldest and outdated-version sessions first, paced by max_concurrent. Steady-state pager sessions ≈ restore rate × soak. (An earlier revision also had a max_active_sessions ceiling; it was dropped — time-based weaning covers the rollout goals and the ceiling semantics were confusing.)
  • A failed graduation leaves the VM untouched (still on its pager) and is retried after a 5m backoff, since every attempt re-reads the whole image.

Tradeoffs

  • Graduated pages become resident anonymous memory (reclaimable only to swap, unlike clean file-backed pages), and completion re-reads the whole image once (already-present pages return EEXIST, skipping the copy but not the read) — hence the soak + concurrency pacing.
  • During the populate sweep a guest thread that faults on a not-yet-swept page blocks until the address-ordered sweep reaches it (vCPUs keep running). The soak makes such faults rare; interleaving fault service with the sweep is a possible follow-up.
  • Because graduation drops the deferred memory reference and the fork's own snapshot dir has no memory file until its first standby, a graduated VM whose VMM dies before any post-graduation standby cannot be restored (previously it could roll back to the fork point — but only while the source snapshot still existed, which is exactly the dependency graduation exists to remove; DeleteSnapshot has no reference guard, so an operator deleting the source breaks that rollback either way). After the first post-graduation standby the VM has a self-contained snapshot and crash recovery returns.

Test plan

  • go build ./..., go vet, and unit tests (incl. -race for the pager) pass for lib/uffdgraduate, lib/uffdpager, lib/instances, lib/providers, cmd/api/config.
  • Controller unit tests cover soak gating, list/graduate failure handling, failure backoff, max_concurrent throttling, outdated-version priority, and disabled = no-op. Config Normalize/Validate covered.
  • Pager unit tests guard the hand-computed UFFDIO_UNREGISTER ioctl value, the wake pipe, wake-vs-close teardown safety, and that an abandoned completion aborts before unregister.
  • TestFCUFFDGraduationLifecycle (real Firecracker + pager, runs in the linux test CI job) covers: graduation of a live fork, guest memory/disk integrity across detach, new writes post-detach, idempotent re-graduation, cleared restore state (session, version, deferred memory path), and file-backed standby/restore afterwards.
  • Not yet validated: active-ballooning interaction (after unregister, a ballooned-then-reused page re-faults to zero-fill — safe only if genuinely guest-relinquished) and UFFDIO_COPY dirty-neutrality on the host kernel (post-graduation diff snapshot size — a size regression risk, not correctness).

🤖 Generated with Claude Code

sjmiller609 and others added 5 commits June 6, 2026 18:35
Detach running UFFD-backed VMs from their snapshot memory pager after a
soak period instead of leaving them pinned for the life of the restore.
A new pager /sessions/{id}/complete endpoint populates the remaining
pages from the backing file and unregisters userfaultfd, so the VM keeps
running on resident memory with no pager dependency and no pause or
network interruption. This bounds the number of active pager sessions
and lets old pager versions drain to zero and exit.

A background controller (lib/uffdgraduate) drives graduations subject to
min_session_age, max_concurrent, and an optional max_active_sessions
ceiling, prioritising sessions on outdated pager versions. Disabled by
default and only active on the uffd backend. The detach is gated behind
a new hypervisor capability so the controller stays hypervisor-agnostic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sibling of the UFFD one-shot lifecycle test that detaches a running
UFFD-backed VM from its pager and asserts the VM keeps running with its
guest memory and disk intact, new writes still work, and a later
standby/restore preserves memory. Leaves the existing test unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Overlapping the graduation test's full memory populate with the sibling
UFFD lifecycle test's VMs saturated the CI runner and timed out
guest-agent readiness. Drop t.Parallel so peak concurrent UFFD VM load
matches the pre-existing single-test profile.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread lib/instances/firecracker_uffd_graduate.go
@sjmiller609 sjmiller609 marked this pull request as ready for review June 24, 2026 14:12
Main advanced the pager to 0.1.3 independently (CLOCK cache eviction),
colliding with this branch's bump. Advance to 0.1.4 so the graduation
pager change carries a distinct version.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sjmiller609 sjmiller609 requested a review from hiroTamada June 24, 2026 14:24
@hiroTamada

Copy link
Copy Markdown
Contributor

reviewed end-to-end — solid, careful work, and the populate-then-unregister core is correct by construction. a few things worth a look, one i'd treat as a fix before merge.

should fix

  • lib/instances/firecracker_uffd_graduate.go:75-76 — graduation clears only FirecrackerUFFDSessionID + FirecrackerUFFDPagerVersion, but every other UFFD transition (standby.go:224, stop.go:310, fork.go:552, snapshot.go:321, firecracker_uffd.go:130) uses clearFirecrackerUFFDRestoreState, which also clears FirecrackerDeferredSnapshotMemoryPath. a running UFFD VM has that set, and it's consumed on the next standby (standby.go:151 → materializeDeferredSnapshotMemory, firecracker.go:121), which copies that backing file into the new snapshot dir. for a graduated VM it's stale — wasteful full-image copy at best, and if the source snapshot was deleted since, the copy errors and standby fails (firecracker.go:142-148). suggest replacing the two hand-clears with clearFirecrackerUFFDRestoreState(stored).

concurrency

  • complete_linux.go:76-81 + server_sessions_linux.go:118-121 — use-after-close race on wakeW. teardown runs close() (closes wakeW) before removeSession() deletes from the map, so a concurrent /complete can fetch the session and wake()unix.Write to a closed/recycled fd (wakeW >= 0 is never reset to -1). narrow window (fault loop exiting as a graduate fires, e.g. a VM dying mid-graduation), small blast radius, but real. fix: guard wakeW with a small mutex (set -1 under lock in close(), read+write under lock in wake()).

questions / confirm intent

  • firecracker_uffd_graduate.go:67 — on gctx timeout mid-populateAll, CompleteSessionVersion returns a context error (not ErrSessionNotFound) so the binding isn't cleared, but the server-side fault loop can still finish + tear down the session. self-heals next scan (404→clears) — is standby/health safe in that window?
  • controller.go:163-179overCap = len(insts) - MaxActiveSessions counts outdated sessions, which are graduated unconditionally on a separate branch. with outdated + a ceiling, non-outdated get over-selected below the ceiling (max=8, len=10, 3 outdated → 5 graduated, lands at 5). paced by MaxConcurrent so muted — intended, or should overCap net out the outdated?
  • complete_linux.go:110-140 — completion runs in the fault-loop goroutine, so during the whole-image, address-ordered populate sweep a guest thread touching an unpopulated page stalls until the sweep reaches it (correctness fine; latency can be large). "guest never pauses" holds for vCPUs but an individual fault can block — confirm soak + MaxConcurrent is the intended mitigation. minor: PR body says "reads the whole remaining image"; it re-reads the whole image (resident pages get EEXIST), so the I/O cost is full-image.

test gaps

  • the populate-then-unregister core is exercised only by TestFCUFFDGraduationLifecycle, which skips without KVM+userfaultfd+HYPEMAN_UFFD_PAGER_BINARY — likely not run in normal CI. hermetic linux tests cover only the ioctl constant + wake-pipe drain. consider a seam over uffdCopy/uffdUnregister to assert "unregister never runs if any populate errors" without a real uffd.
  • controller_test.gofakeStore.err is never set, so list-failure and graduate-failure/retry are untested; MaxConcurrent throttling and the overCap↔outdated interaction are untested too.
  • TestFCUFFDGraduationLifecycle:983 asserts session/version cleared but not FirecrackerDeferredSnapshotMemoryPath, and standbys while the source snapshot still exists — so it wouldn't catch the "should fix" item.

nits

  • controller.go:197-202 — a persistently-failing graduation retries every scan (1m) with no backoff; each attempt may re-read the whole image.
  • complete_linux.go:160-176drainUFFDEvents discards pagefault events too; safe only because populateAll covers every page — worth a one-line comment saying why.
  • providers/uffd_graduation.go:62-67 — when the manager type-assertion fails or target version is empty, the controller silently no-ops; a log.Warn would surface a misconfig.
  • firecracker_uffd_graduate.go:59-62 — empty stored version falls back to the current pager version; a resulting 404 clears the binding and could orphan a session on another version. shouldn't happen (version set at restore), just a sharp edge.

Remove max_active_sessions: time-based weaning covers the rollout goals
and the ceiling semantics were the confusing part of the feature.

Graduation now clears all UFFD restore state (like standby/stop) so a
graduated VM no longer references the source snapshot and its next
standby writes a self-contained Full snapshot. Completion aborts when
the caller times out or disconnects, keeping the session serving so the
control plane binding stays accurate. Guard the wake pipe against
use-after-close, back off failed graduations instead of retrying every
scan, and error on a missing recorded pager version instead of guessing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sjmiller609

Copy link
Copy Markdown
Collaborator Author

Thanks for the careful review — every claim checked out against the code, and one of your questions turned out to be a bigger deal than framed. All addressed in ce5009b, point by point:

Should fix — deferred memory path: adopted. Graduation now calls clearFirecrackerUFFDRestoreState like every other UFFD transition, and the lifecycle test asserts the deferred path (and UseUFFDOnNextRestore) are cleared. One nuance we hit while verifying: with the path cleared, the next standby writes a self-contained Full snapshot (toSnapshotCreateParams finds no base) rather than materialize+diff, which is what actually severs the source-snapshot dependency. The flip side — called out in the PR body — is that a graduated VM whose VMM dies before any post-graduation standby can no longer roll back to the fork point. That rollback only worked while the source snapshot existed, and DeleteSnapshot has no reference guard, so it was already one snapshot-deletion away from broken; we took the consistent side of the tradeoff.

wakeW use-after-close: fixed with wakeMu as you suggested — close() resets wakeW to -1 under the lock, wake() reads and writes under it. Added a wake-vs-close hammer test that fails under -race on regression.

gctx timeout mid-populate: you were right to poke, and it was worse than "self-heals next scan." Standby in that window is safe (closeFirecrackerUFFDSession tolerates a missing session and clears the binding regardless). Health is not: the completion ran in the fault-loop goroutine and never observed client abandonment, so any populate outrunning completion_timeout ended with the server finishing later and tearing down the session behind a stale binding. If that was the last session on its pager version — the normal case during a version drain — the pager exits, the health check fails, deriveState returns StateUnknown, and nothing self-heals: the controller's list filters non-Running and the graduate method's own state guard rejects retries. Fixed at the root: completeRequest now carries the request context, the populate sweep checks it per page, and there's a final check before unregister — an abandoned attempt aborts and the session keeps serving, so the binding stays accurate and the retry (after backoff) works. Unit test guards abandoned-ctx-never-reaches-unregister.

overCap netting: resolved by deletion. We dropped max_active_sessions entirely — time-based weaning covers the rollout goals (steady-state sessions ≈ restore rate × soak), and the ceiling semantics were the confusing part of the feature. If a hard ceiling turns out to be needed it can come back as a follow-up with the netting done right.

Populate-sweep fault latency: intended, and your read of the mechanics is right — a guest thread faulting on a not-yet-swept page blocks until the address-ordered sweep reaches it; UFFDIO_COPY wakes the waiters (which is also why the event drain discarding pagefaults is safe — now commented). Soak + max_concurrent is the mitigation; interleaving fault service with the sweep is a reasonable follow-up if the tail matters in practice. PR body wording fixed: completion re-reads the whole image (EEXIST skips the copy, not the read).

Test gaps: one correction — TestFCUFFDGraduationLifecycle does run in normal CI: the linux test job builds the pager and exports HYPEMAN_UFFD_PAGER_BINARY (.github/workflows/test.yml), and the KVM+userfaultfd gates pass on those runners, so the real populate+unregister path executes on every PR. Agreed the hermetic coverage was thin regardless: added controller tests for list failure, graduate failure + backoff, and max_concurrent throttling (the fake's error field is now actually used), plus the pager tests above. A seam over uffdCopy/uffdUnregister for a hermetic populate-error-ordering test is a fair follow-up.

Nits: failure backoff added (fixed 5m — each retry is a full-image read, so once a scan was too hot); provider no-op paths now log warnings; the empty-stored-version fallback is now an error instead of guessing (guessing could 404 against the wrong pager and clear a binding for a live session).

Pager VERSION bumped to 0.1.5. Ballooning interaction and UFFDIO_COPY dirty-neutrality remain the two pre-enablement validations, tracked in the PR body.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ce5009b. Configure here.

log.InfoContext(ctx, "graduating instance off uffd snapshot memory pager",
"instance_id", id, "session_id", sessionID, "pager_version", version)

err = m.firecrackerUFFDPager.CompleteSessionVersion(ctx, version, sessionID)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graduation holds instance lock

Medium Severity

GraduateSnapshotMemoryPager keeps the per-instance mutex for the entire CompleteSessionVersion call, which can run up to completion_timeout (default 10m). StopInstance, StandbyInstance, DeleteInstance, and other paths that take the same lock block for that whole interval on a graduating VM.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce5009b. Configure here.

if s.takeCompletion() {
return
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completion waits on deferred faults

Medium Severity

After /complete queues work and wakes the fault loop, takeCompletion runs only when len(deferred)==0 and Poll sees the wake pipe. While EAGAIN page faults sit in deferred, the loop skips Poll, never drains the wake pipe, and does not start populate-unregister even though a completion request is already queued.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce5009b. Configure here.

http.Error(w, "session closed before completion", http.StatusInternalServerError)
case <-r.Context().Done():
http.Error(w, r.Context().Err().Error(), http.StatusGatewayTimeout)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completion success races HTTP handler

Medium Severity

The /complete response uses a buffered channel, so the fault loop can finish populate/unregister, exit, and close sess.done before the HTTP handler reads the success. handleComplete may then take the sess.done branch and return 500 even though the VM already detached, causing graduation to fail and retry until a 404 clears metadata.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce5009b. Configure here.

// includes the deferred snapshot memory path: memory is fully resident now,
// so the next standby must write a self-contained Full snapshot instead of
// materializing from the source snapshot (which may since be deleted).
clearFirecrackerUFFDRestoreState(stored)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

404 clears pager metadata unsafely

Medium Severity

When CompleteSessionVersion returns ErrSessionNotFound, graduation still clears all UFFD restore metadata and returns success. A 404 only means the pager has no session record, not that the running VM finished populate/unregister, so metadata can drop while the guest still depends on userfaultfd.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ce5009b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants