Skip to content

[Feature Request] Block-level copy-on-write overlay with dirty bitmap tracking for fast VM cloning and reset #5795

@meAmitPatil

Description

@meAmitPatil

Problem

When running multiple Firecracker VMs from the same base rootfs, three operations carry significant overhead:

  1. Cloning — creating a new VM requires copying the entire root disk image. For a 400MB rootfs, that's ~200ms per clone, scaling linearly with VM count.
  2. Resetting — returning a VM's disk to a clean state requires the same full disk copy.
  3. Snapshotting disk state — Firecracker snapshots capture VM state and memory but not disk changes. Operators must manage disk state externally, typically by copying the full disk image.

Proposal

Add an Overlay variant to the FileEngine enum that implements block-level copy-on-write inside the VMM:

  • A shared read-only base image serves reads for unmodified blocks
  • A per-VM sparse overlay file captures writes
  • A dirty bitmap (one bit per 4KB block) tracks which blocks have been written and routes reads to the correct source
  • A delta file format with CRC64 integrity captures only dirty blocks for efficient snapshot persistence

This sits at the same layer as the existing Sync and Async file engines — transparent to the guest, no changes to the virtio protocol, no guest-side components.

API

{
  "drive_id": "rootfs",
  "path_on_host": "/path/to/overlay.ext4",
  "base_path": "/path/to/base-rootfs.ext4",
  "is_root_device": true,
  "io_engine": "Overlay"
}

On snapshot create, an optional block_delta_dir captures only dirty blocks:

{
  "snapshot_path": "/path/to/snap.bin",
  "mem_file_path": "/path/to/snap.mem",
  "block_delta_dir": "/path/to/deltas/"
}

On restore, the same block_delta_dir applies the delta to a fresh overlay — enabling cloning from a snapshot without copying the full disk.

Benchmark Results

Bare metal (AMD Ryzen 9 7950X3D, NVMe, 128GB RAM). Guest: 2 vCPUs, 256MB RAM, 396MB rootfs.

Clone

Metric Sync (current) Overlay Improvement
Clone disk cost 205ms (full 396MB copy) 0ms (560KB delta) no disk copy needed
Total clone cost 430ms 223ms ~2x faster
Clone data size 396MB 560KB 700x smaller

Reset

Method Time
Full disk copy (current) 190ms
Overlay truncate + bitmap clear 1-2ms
Speedup ~100x

Snapshot

Snapshot time is identical (~225ms) because it is dominated by the guest memory dump. The overlay advantage is that no separate disk copy is needed for cloning — the delta file (dirty blocks only) is produced as part of the snapshot.

Implementation

I have a working implementation on my fork.
Happy to open a PR upstream if there is interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: Awaiting authorIndicates that an issue or pull request requires author actionType: EnhancementIndicates new feature requests

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions