Skip to content

Use UFFD only for initial snapshot fork restores#270

Merged
sjmiller609 merged 17 commits into
mainfrom
codex/one-shot-uffd-snapshot-forks-v2
Jun 4, 2026
Merged

Use UFFD only for initial snapshot fork restores#270
sjmiller609 merged 17 commits into
mainfrom
codex/one-shot-uffd-snapshot-forks-v2

Conversation

@sjmiller609

@sjmiller609 sjmiller609 commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add a one-shot Firecracker UFFD restore metadata flag
  • arm UFFD only for the first restore of a new fork from standby snapshot state
  • clear the flag after restore/standby/stop so later direct resumes use file-backed restore

Validation

  • go test ./lib/instances ./lib/hypervisor/firecracker ./lib/uffdpager
  • validated on deft with local UFFD + Firecracker concurrency merge:
    • single browser fork -> CDP, standby, file-backed restore -> CDP succeeds
    • 25-way Running fork burst: 25/25 succeeded; all requests started together; no source-state/vsock copy races

Note: the runtime validation used the Firecracker concurrency branch merged locally, since that branch removes the source alias/fork race this behavior relies on for burst fanout.


Note

High Risk
Changes Firecracker snapshot memory, fork fanout, and UFFD/systemd integration on a critical VM lifecycle path; incorrect backing or locking could affect restore correctness or concurrent forks.

Overview
This PR limits Firecracker UFFD to a one-shot path on standby snapshot fanout, instead of using UFFD on every restore when the backend is configured for UFFD.

New forks from standby (instance or snapshot) can skip copying snapshot-latest/memory, record deferred backing and FirecrackerUseUFFDOnNextRestore, and restore once via the pager against shared source memory. Standby then materializes local memory via SnapshotOptions.DeferredMemoryBackingPath before diff snapshot; later restores use file-backed memory. Metadata and lifecycle paths clear the one-shot flag and UFFD session state after restore, standby, and stop.

Fork/copy gains optional path skips, CopyRegularFile, and stricter skipping of stale *.sock artifacts. Restore accepts SnapshotMemoryBackingPath; the UFFD supervisor moves env to /run/hypeman/uffd with CI-scoped systemd instance names. CI/Makefile install the pager unit template, pass pager binary env vars, and add an integration test for the full UFFD lifecycle.

Reviewed by Cursor Bugbot for commit c74c539. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 marked this pull request as ready for review June 4, 2026 14:48
Comment thread lib/instances/fork.go
@firetiger-agent

Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Speeds up VM fork fanout in the hot pool by deferring the large snapshot memory file copy when creating Firecracker forks with UFFD enabled — forked VMs start immediately using the source's memory as backing, and materialize their own copy only when entering standby.

Intended effect:

  • Hot pool warm failures (kernel_hotpool_warm_failure_total): baseline 40K–76K/hr (active hours); confirmed if no sustained increase post-deploy
  • Hot pool hit rate (kernel_hotpool_hit_total): baseline 2.3M–3.7M/hr (active hours); confirmed if rate holds steady or improves within a same-day-of-week window
  • API error rate: baseline ~0.13–0.17% during active hours; confirmed if it stays below 0.5%

Note: hypeman has no OTel telemetry — logs route to Railway stdout only. Hot pool metrics proxied through the kernel API are the observable signal. CI pass/fail is the primary authority for correctness.

Risks:

  • Deferred path pointing to deleted source — if a source snapshot is deleted while a fork still holds FirecrackerDeferredSnapshotMemoryPath, standby fails with "materialize deferred snapshot memory" ERROR in Railway stdout; alert if hot pool warm failures exceed 100K/hr for 2+ hours
  • Standby memory materialize failure — large memory file copy at standby time could fail on I/O error; manifests as "snapshot failed, attempting to resume VM" in Railway stdout and elevated hot pool warm failures
  • Hot pool hit rate regression — if the one-shot optimization path has a bug causing forks to fail more often; alert if hit rate drops >50% vs same-day-of-week baseline for 2+ consecutive active hours

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Base automatically changed from hypeship/uffd-pager-v2 to main June 4, 2026 15:43
Comment thread lib/instances/restore.go
@sjmiller609 sjmiller609 force-pushed the codex/one-shot-uffd-snapshot-forks-v2 branch from cc00f35 to dd404e4 Compare June 4, 2026 19:01
@sjmiller609 sjmiller609 changed the base branch from main to hypeship/fc-fork-concurrency-v2 June 4, 2026 19:01
Comment thread lib/instances/stop.go Outdated
Comment thread .github/workflows/test.yml
Comment thread lib/forkvm/copy.go
Comment thread lib/hypervisor/firecracker/firecracker.go
Comment thread lib/instances/firecracker_test.go
@sjmiller609 sjmiller609 requested a review from hiroTamada June 4, 2026 21:33

@hiroTamada hiroTamada left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work — the one-shot design cleanly confines the fragile pager/backing dependency to a single post-fork restore, and the lifecycle clearing + the TestFCUFFDOneShotLifecycle integration test (disk + /dev/shm files across the full fork→standby→restore→chained-fork chain) give good confidence on the snapshot-isolation risk.

both bugbot findings look already resolved on this head: the chained-fork backing path via the firecrackerDeferredSnapshotMemoryPath helper (regression-tested at firecracker_test.go:817), and the session close on the network-reconfigure failure path (restore.go:337).

a few non-blocking nits below, mostly around maintainability. approving.

Comment thread lib/instances/stop.go Outdated
Comment thread lib/hypervisor/firecracker/firecracker.go
Comment thread scripts/install.sh
Comment thread lib/instances/manager.go Outdated
Comment thread lib/instances/fork.go Outdated
Base automatically changed from hypeship/fc-fork-concurrency-v2 to main June 4, 2026 21:40
@sjmiller609 sjmiller609 force-pushed the codex/one-shot-uffd-snapshot-forks-v2 branch from 36f0992 to c74c539 Compare June 4, 2026 21:55

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c74c539. Configure here.

Comment thread lib/instances/fork.go
@sjmiller609 sjmiller609 merged commit acc8ac4 into main Jun 4, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the codex/one-shot-uffd-snapshot-forks-v2 branch June 4, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants