Skip to content

poll device readiness before formatting sticky disk#111

Draft
devin-ai-integration[bot] wants to merge 3 commits into
mainfrom
devin/1782265046-fix-sticky-disk-device-readiness
Draft

poll device readiness before formatting sticky disk#111
devin-ai-integration[bot] wants to merge 3 commits into
mainfrom
devin/1782265046-fix-sticky-disk-device-readiness

Conversation

@devin-ai-integration

@devin-ai-integration devin-ai-integration Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a race condition where mkfs.ext4 /dev/vdb fails with "Device size reported to be zero" after a Firecracker drive hot-swap.

Root cause: Firecracker's PATCH /drives is host-synchronous — it returns once the backing file is swapped on the host. The guest kernel learns the new capacity via an async virtio config-change interrupt. When the gRPC response reaches setup-docker-builder and it immediately runs mkfs, only ~50ms may have elapsed — not enough for the guest's virtio-blk driver to update the device geometry.

Fix: Add waitForDeviceReady(device) at the top of maybeFormatBlockDevice that polls blockdev --getsize64 until the device reports non-zero size (50ms interval, 5s timeout). This is deterministic — it waits exactly as long as the kernel needs, no more.

+async function waitForDeviceReady(device: string): Promise<void> {
+  // poll blockdev --getsize64 until size > 0 (50ms interval, 5s timeout)
+}
+
 async function maybeFormatBlockDevice(device: string): Promise<string> {
+  await waitForDeviceReady(device);
   // ... blkid check, mkfs.ext4
 }

Observed in FastActions/fa#4268 CI — 51ms between PATCH /drives/2 completing on the host and mkfs.ext4 failing in the guest.

cc @brucemakallan

Link to Devin session: https://app.devin.ai/sessions/45a3be2102b746d393226ba6fec5b390


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled. (Staging)

After a Firecracker drive hot-swap (PATCH /drives), the guest kernel
learns the new backing device size via an async virtio config-change
interrupt. Fast consumers like mkfs.ext4 can observe a zero device size
if they run before the interrupt is processed.

Poll blockdev --getsize64 until the device reports non-zero (up to 5s,
50ms interval) before attempting blkid or mkfs.

Co-Authored-By: Paul Bardea <paul@blacksmith.sh>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

devin-ai-integration Bot and others added 2 commits June 24, 2026 01:42
Co-Authored-By: Paul Bardea <paul@blacksmith.sh>
The 5s timeout caused the step duration regression test to fail because
in CI the placeholder device never gets hot-swapped, so the full timeout
was consumed. Reduce to 2s (still well above the ~51ms virtio race) and
warn instead of throw on timeout, letting the existing mkfs error
handling proceed naturally.

Co-Authored-By: Paul Bardea <paul@blacksmith.sh>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants