Skip to content

Conversation

@ticpu
Copy link

@ticpu ticpu commented Dec 2, 2025

Implement a new storage driver for bcachefs filesystems that uses subvolumes and snapshots for container layer management, similar to the existing btrfs driver.

Features:

  • Implementation using direct ioctl syscalls
  • Subvolume creation via BCH_IOCTL_SUBVOLUME_CREATE
  • Snapshot creation with BCH_SUBVOL_SNAPSHOT_CREATE flag
  • Subvolume detection using statx() with STATX_SUBVOL
  • Recursive nested subvolume deletion
  • Support for both root and rootless operation

Tested on my system with multiple images:

❯ podman run --rm docker.io/library/nginx:latest nginx -v
Trying to pull docker.io/library/nginx:latest...
Getting image source signatures
Copying blob 53d743880af4 done   | 
Copying blob 0e4bc2bd6656 done   | 
Copying blob 108ab8292820 done   | 
Copying blob 192e2451f875 done   | 
Copying blob 77fa2eb06317 done   | 
Copying blob b5feb73171bf done   | 
Copying blob de57a609c9d5 done   | 
Copying config 60adc2e137 done   | 
Writing manifest to image destination
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
nginx version: nginx/1.29.3

❯ podman run --rm docker.io/library/python:3.12-slim python --version
Trying to pull docker.io/library/python:3.12-slim...
Getting image source signatures
Copying blob b7ba6d2a1fc7 done   | 
Copying blob 490b9a1c25e4 done   | 
Copying blob 0e4bc2bd6656 skipped: already exists  
Copying blob 0674d14a155c done   | 
Copying config 445121148b done   | 
Writing manifest to image destination
Python 3.12.12

❯ podman image mount ceph:v18 
/var/lib/containers/storage/bcachefs/subvolumes/7a295044d828c8a95725ef60009582c7a8a0c455ab9abd9ee9b350b0dd4c6d30

❯ ls /var/lib/containers/storage/bcachefs/subvolumes/7a295044d828c8a95725ef60009582c7a8a0c455ab9abd9ee9b350b0dd4c6d30
afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

❯ podman run --rm --entrypoint=/bin/python3 -ti ceph:v18 
Python 3.9.21 (main, Feb 10 2025, 00:00:00) 
[GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

❯ podman image ls
REPOSITORY                TAG         IMAGE ID      CREATED       SIZE
docker.io/library/python  3.12-slim   445121148b18  13 days ago   123 MB
docker.io/library/nginx   latest      60adc2e137e7  13 days ago   155 MB
docker.io/library/alpine  latest      706db57fb206  7 weeks ago   8.62 MB
quay.io/ceph/ceph         v18         0f5473a1e726  6 months ago  1.27 GB

❯ podman image rm ceph:v18 
Error: image used by 085e9c2853013e627c41e8e1833655a7b73cf4ba45a19556102c3675dc840900: image is in use by a container: consider listing external containers and force-removing image

❯ podman rm -f ceph18 
ceph18

❯ podman image rm ceph:v18 
Untagged: quay.io/ceph/ceph:v18
Deleted: 0f5473a1e726b0feaff0f41f8de8341c0a94f60365d4584f4c10bd6b40d44bc1

❯ pwd && ls | wc -l
/var/lib/containers/storage/bcachefs/subvolumes
11

❯ podman image prune -a
WARNING! This command removes all images without at least one container associated with them.
Are you sure you want to continue? [y/N] y
706db57fb2063f39f69632c5b5c9c439633fda35110e65587c5d85553fd1cc38
60adc2e137e757418d4d771822fa3b3f5d3b4ad58ef2385d200c9ee78375b6d5
445121148b187db67e48799f002500623fa22d9f635e522f4e0f345414bd9107

❯ ls | wc -l
0

Implement a new storage driver for bcachefs filesystems that uses
subvolumes and snapshots for container layer management, similar to
the existing btrfs driver.

Features:
- Implementation using direct ioctl syscalls
- Subvolume creation via BCH_IOCTL_SUBVOLUME_CREATE
- Snapshot creation with BCH_SUBVOL_SNAPSHOT_CREATE flag
- Subvolume detection using statx() with STATX_SUBVOL
- Recursive nested subvolume deletion
- Support for both root and rootless operation

Signed-off-by: Jérôme Poulin <jeromepoulin@gmail.com>
@github-actions github-actions bot added the storage Related to "storage" package label Dec 2, 2025
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Dec 2, 2025
@podmanbot
Copy link

✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6559

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #146

Personally I am not convinced of the value of yet another storage driver in this code base. bcachefs was dropped from the upstream kernel which means doing any sort of testing more complicated as it will likely not be be default in either fedora or debian which are the only distros we test in CI.

We already have largely unmaintained btrfs and zfs drivers in this code base so I don't want to add more of this.

Or to turn the question around what does using this driver over overlayfs on top of bcachefs offer?

cc @mtrmac @giuseppe @nalind @mheon

@giuseppe
Copy link
Member

giuseppe commented Dec 2, 2025

how does it behave when you use user namespaces? e.g. what is the performance when you run a bunch of podman run --userns auto ... containers?

@mtrmac
Copy link
Contributor

mtrmac commented Dec 2, 2025

Personally I am not convinced of the value of yet another storage driver in this code base.

As in #146 (comment), the future directions that have been most discussed are, indeed, ~incompatible with filesystem-snapshot-based layer storage.

OTOH, *shrug* the generic code is already constrained by the need to support unprivileged vfs, so having 2/3/4 snapshot-based graph drivers is not that much of a difference, assuming that everyone is fine with … “benign neglect”, where PR authors might be asked to ensure the other graph drivers compile but nothing beyond that.

E.g., AFAICS we do no other testing of ZFS in this repo. From time to time, there are issues reported, and they frequently go without substantive response.

But then again, the “no upstream testing” situation seems to work for FreeBSD … maybe, well enough? (Or maybe it is going so badly that I don’t even know how bad it is.)

There is some middle ground where a non-default approach is clearly not the primary focus, not recommended and not shipped as primary release artifact of e.g. Podman, but support present in the main project’s repository makes things a bit easier for people interested in that non-default approach.

Maybe the important components for this are:

  • Most importantly, users are not misled into the non-default approach. (FreeBSD users know they don’t want to use Linux, and that was ~clearly a considered choice.)
  • The cost on the project’s primary deliverables, and maintainers, is close enough to zero that no-one is motivated to propose removing the support.
  • There is, nevertheless, some benefit to maintaining the non-default approach in the projects’ repo directly.

Purely in the abstract, new filesystems are a Big Deal, and … potentially very valuable? The promise of Btrfs as the universal future, I guess, didn’t quite pan out that way (yet?), but maybe something (bcachefs or something created in the future) could happen in this decade (or the next one). And such future filesystems need some place to experiment and grow into maturity, even if users were kept strongly discouraged from deploying the new filesystems into production for many years. So, in principle, I might be fine with paying a trivial non-recurring cost just for the project to have an opportunity to experiment — as long as the experiment is, in some sense, going in the right direction. So, the overlay-over-bcachefs question.


bcachefs was dropped from the upstream kernel which means doing any sort of testing more complicated as it will likely not be be default in either fedora or debian which are the only distros we test in CI.

This is important — I think realistically we would not be testing the proposed driver, limiting it to the “will be kept buildable” position.


Or to turn the question around what does using this driver over overlayfs on top of bcachefs offer?

Yes — the original discussion in #146 motivated this by saying that overlay on top of bcachefs is not possible, but later there was a pointer at patch set implementing this. Why shouldn’t this be the universally adopted setup?

If overlay is viable in the short (or intermediate?) term (I don’t know), and if we’d prefer users to use overlay over a new graph driver (I’m not sure but it seems very likely), then it might be better to not add a new graph driver that could direct users away from the preferred path.

@Luap99
Copy link
Member

Luap99 commented Dec 3, 2025

This is important — I think realistically we would not be testing the proposed driver, limiting it to the “will be kept buildable” position.

That is what we have with zfs yes. But we used to actually have zfs tests so to me the situation is different.
If we cannot have any testing story from the beginning then I really don't want to accept code for something as critical as the storage driver.
Like do we do anyone a favour if some upgrade breaks the driver in some way where it fails on all storage operations or in the worst case corrupt (or delete) data?

I find it quite unsatisfactory to respond to zfs and btrfs bug reports like: hey, actually no active maintainer cares about them and as such are unlikely to fix issues there.

Keeping it buildable has a cost too, any time someone needs to make changes to the driver interface it is more work.
At the very least we now have things like #378, adding new drivers that cannot perform an unlocked extraction is not great as we know they do not "perform" great when used in parallel.

@LebedevRI
Copy link

This is important — I think realistically we would not be testing the proposed driver, limiting it to the “will be kept buildable” position.

That is what we have with zfs yes. But we used to actually have zfs tests so to me the situation is different. If we cannot have any testing story from the beginning then I really don't want to accept code for something as critical as the storage driver. Like do we do anyone a favour if some upgrade breaks the driver in some way where it fails on all storage operations or in the worst case corrupt (or delete) data?

For testing story, do you require that the code be in official distro repos,
or would project-specific repo suffice? As far as we are concerned,
there's fairly great official repos w/DKMS+tools for Debian/Ubuntu
and SUSE/Fedora. Would that not suffice?
I wouldn't call that "no way to test", at least.

I find it quite unsatisfactory to respond to zfs and btrfs bug reports like: hey, actually no active maintainer cares about them and as such are unlikely to fix issues there.

Keeping it buildable has a cost too, any time someone needs to make changes to the driver interface it is more work. At the very least we now have things like #378, adding new drivers that cannot perform an unlocked extraction is not great as we know they do not "perform" great when used in parallel.

@Luap99
Copy link
Member

Luap99 commented Dec 3, 2025

For testing story, do you require that the code be in official distro repos,
or would project-specific repo suffice? As far as we are concerned,
there's fairly great official repos w/DKMS+tools for Debian/Ubuntu
and SUSE/Fedora. Would that not suffice?
I wouldn't call that "no way to test", at least.

We maintain our own debian and fedora images here https://github.com/containers/automation_images/
While in theory we have no particular rules against using extra repos I would be quite hesitant. The system is already quite complex with our large dependency chain and many regression we find. Adding more variables there is just not sustainable for us. What if the dkms build fails, etc...

So that is why I said no way to test, I didn't mean it is not a unsolvable problem.
I really need to be convinced first why this needs to exist (compared to just use overlayfs on top) to justify the additional work indefinitely.

@ticpu
Copy link
Author

ticpu commented Dec 4, 2025

I understand the concerns about the CI having to test on yet another filesystem. Bcachefs was removed from the kernel because development is going a bit too quick to keep up with release candidates.

The main features from bcachefs that makes me prefer having a direct snapshot driver are:

  • Speed: From my benchmarks for gitlab-runners of overlay-on-bcachefs vs the native bcachefs driver, the native driver was ~10% faster for builds. VFS was even slower. I'm willing to provide controlled benchmarks to make sure it wasn't just hot in memory.
  • Native CoW snapshots avoid overlay's copy-up overhead and diff calculations at commit time and are instantaneous operations on Bcachefs, both create and delete.
  • We can now add more slower storage and keep deeper, compressed caches: this will also improve performance over time.

Btrfs offers compression and async snapshot delete but (probably) never will offer cache tiers without adding an intermediate layer of complexity.

giuseppe, I didn't test that, is there any expectations? Is it xattr or something?

For the CI, bcachefs is now offering stable/unstable/snapshot repositories for Ubuntu Questing at https://apt.bcachefs.org/ and ArchLinux + some others already have it well supported. So it would be a matter of adding it as a dkms package. By using the stable branch, I'm pretty sure there won't be any breakage. There's a very active community on IRC maintaining various packages that will make sure to point out any 404 in that regard.

Finally, to address the concerns of maintenance burden over time, as long as there is no easy plugin architecture for drivers, what level of maintenance commitment would be acceptable? How often is the driver interface expected to change? I'll be using this driver for multiple Gitlab runners in production if it works out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants