Skip to content

virtio/fs/macos: use opendir instead of fdopendir and abstract DirStream#542

Closed
slp wants to merge 2 commits intocontainers:mainfrom
slp:fs-macos-refactor-dirstream
Closed

virtio/fs/macos: use opendir instead of fdopendir and abstract DirStream#542
slp wants to merge 2 commits intocontainers:mainfrom
slp:fs-macos-refactor-dirstream

Conversation

@slp
Copy link
Collaborator

@slp slp commented Feb 10, 2026

fdopendir seems to be bogus on macOS, as a succession of fdopendir() -> readdir() -> closedir() -> fdopendir() leads to a DirStream that always return null on readdir(). This confused the guest into missing some files when reading a directory.

Let's replace fdopendir() with opendir() which appears to work fine.

This also fixes the issue that we were calling fdopendir() with HandleData->File's fd, with the former taking ownership of it and closing it on closedir(). This could cause an error like this when building the library with debug-assertions:

fatal runtime error: IO Safety violation: owned file descriptor already closed, aborting

Fixes: #539
Fixes: #541

slp added 2 commits February 10, 2026 16:51
fdopendir seems to be bogus on macOS, as a succession of fdopendir() ->
readdir() -> closedir() -> fdopendir() leads to a DirStream that always
return null on readdir(). This confused the guest into missing some
files when reading a directory.

Let's replace fdopendir() with opendir() which appears to work fine.

This also fixes the issue that we were calling fdopendir() with
HandleData->File's fd, with the former taking ownership of it and
closing it on closedir(). This could cause an error like this when
building the library with debug-assertions:

fatal runtime error: IO Safety violation: owned file descriptor already closed, aborting

Fixes: containers#539
Fixes: containers#541

Signed-off-by: Sergio Lopez <slp@redhat.com>
Abstract DirStream properties and operations to be managed in a cleaner
and more idiomatic way.

Signed-off-by: Sergio Lopez <slp@redhat.com>
@pftbest
Copy link
Contributor

pftbest commented Feb 10, 2026

It works great, all apt-get errors are gone, thanks!

I'm a little bit curious about the fdopendir, couldn't one just check fcntl(fd, F_GETFD) that fd is valid before calling fdopendir on it? Or I guess it would still be a data race if something happens between the calls. Maybe if you lock a mutex each time you do this check.

@slp
Copy link
Collaborator Author

slp commented Feb 11, 2026

I'm a little bit curious about the fdopendir, couldn't one just check fcntl(fd, F_GETFD) that fd is valid before calling fdopendir on it? Or I guess it would still be a data race if something happens between the calls. Maybe if you lock a mutex each time you do this check.

The fd was still valid and the virtio-fs device operates from single worker thread. I'm not sure why this is happening on macOS, but this isn't the first time we have to work around a weird behavior on this OS.

@pftbest
Copy link
Contributor

pftbest commented Feb 11, 2026

I did some more digging, I reverted back to your previous commit (with dup() call) and I found this:

diff --git a/src/devices/src/virtio/fs/macos/passthrough.rs b/src/devices/src/virtio/fs/macos/passthrough.rs
index c1e385d..1a90d15 100644
--- a/src/devices/src/virtio/fs/macos/passthrough.rs
+++ b/src/devices/src/virtio/fs/macos/passthrough.rs
@@ -666,12 +666,13 @@ impl PassthroughFs {
             let dir = unsafe { libc::fdopendir(newfd) };
             if dir.is_null() {
                 let err = io::Error::last_os_error();
                 let _ = unsafe { libc::close(newfd) };
                 return Err(linux_error(err));
             }
+            unsafe { libc::rewinddir(dir) };
             ds.stream = dir as u64;
             dir
         } else {
             ds.stream as *mut libc::DIR
         };

Just adding a rewinddir call after fdopendir fixes the issue with ghost files too. I think some assumptions are broken here because both dup and fdopendir preserve the position from the original fd. So if the dir was already read to the end calling fdopendir on it again will always read null.

@slp
Copy link
Collaborator Author

slp commented Feb 11, 2026

Just adding a rewinddir call after fdopendir fixes the issue with ghost files too. I think some assumptions are broken here because both dup and fdopendir preserve the position from the original fd. So if the dir was already read to the end calling fdopendir on it again will always read null.

I think you're right, good point. We were using seekdir(0) but doesn't seem to do a good job (contrary to what happens on Linux, offset isn't really an offset but a token to some position).

Let me rework the first commit adding you as co-author.

@pftbest
Copy link
Contributor

pftbest commented Feb 11, 2026

Ah, you are right, it seems there are 2 layers of offsets here, fd has it's own position which can be set with lseek(newfd, 0, SEEK_SET); and fdopendir has some buffering internally which is set with seekdir. So if fd was already read past the end the newly created fdopendir will start from there and not be able to set any position with seekdir.

So instead of rewinddir we could also do something like this:

diff --git a/src/devices/src/virtio/fs/macos/passthrough.rs b/src/devices/src/virtio/fs/macos/passthrough.rs
index 1a90d15..3fe7265 100644
--- a/src/devices/src/virtio/fs/macos/passthrough.rs
+++ b/src/devices/src/virtio/fs/macos/passthrough.rs
@@ -663,13 +663,14 @@ impl PassthroughFs {
             if newfd < 0 {
                 return Err(linux_error(io::Error::last_os_error()));
             }
+            unsafe { libc::lseek(newfd, 0, libc::SEEK_SET) };
             let dir = unsafe { libc::fdopendir(newfd) };
             if dir.is_null() {
                 let err = io::Error::last_os_error();
                 let _ = unsafe { libc::close(newfd) };
                 return Err(linux_error(err));
             }
-            unsafe { libc::rewinddir(dir) };
+            // unsafe { libc::rewinddir(dir) };
             ds.stream = dir as u64;
             dir
         } else {

I tested it on my machine and it seems to work as well.

@slp
Copy link
Collaborator Author

slp commented Feb 11, 2026

@pftbest I'd like to add a Co-authored-by line to one the commits. Could you please give me a name + email to attribute the commit to you?

@pftbest
Copy link
Contributor

pftbest commented Feb 11, 2026

@slp Sure it's Vadzim Dambrouski <pftbest@gmail.com>

I also did some perf testing and it seems this version with opendir is slightly faster than all alternatives.

Name Time
macos host time 0.463 total
opendir this PR 0m1.288s
dup+lseek+fdopendir 0m1.419s

So the point seems moot, we can just merge this PR. Sorry I caused you some extra work.

@slp
Copy link
Collaborator Author

slp commented Feb 11, 2026

On a second thought, we shouldn't need to rewind nor lseek here, since this should be a fresh handle (with a fresh fd) created on opendir(), and dedicated exclusively to reading the directory (so it's safe to move the internal offset around). This means that for this issue to arise this way we should be reusing an fd, which shouldn't happen.

To make things even weirder, I'm unable to reproduce the issue today. Could you please tell me which version of libkrunfw have you installed in your system?

@pftbest
Copy link
Contributor

pftbest commented Feb 11, 2026

lseek / rewind are only needed for #540 not for this PR. This one works correctly as is.

opendir indeed creates a new handle which doesn't need any additional fixes. It's just the comments in the PR implied that macOS is broken somehow, so I tried to figure out what is actually wrong, turns out it is working as expected.

My libkrunfw is
commit 6e404e9fdb7d1c581d844ffe5dfb72cf7a9a0b1f HEAD -> main, tag: v5.2.0

@slp
Copy link
Collaborator Author

slp commented Feb 11, 2026

lseek / rewind are only needed for #540 not for this PR. This one works correctly as is.

Yes, I was referring to the fdopendir strategy. We shouldn't need to rewind the offset here. Something's off.

@pftbest
Copy link
Contributor

pftbest commented Feb 11, 2026

I think this is expected behavior of fdopendir and it works like that both on Linux and macOS. I can make a small test to confirm.

@slp
Copy link
Collaborator Author

slp commented Feb 11, 2026

I think this is expected behavior of fdopendir and it works like that both on Linux and macOS. I can make a small test to confirm.

Yes, but my point is that we should always be using fdopendir on a newly created fd (that is, one with the offset set to zero). If we aren't it's because something unexpected is happing in libkrun, not in the OS.

@pftbest
Copy link
Contributor

pftbest commented Feb 11, 2026

@slp Here is a small test program that shows this issue. You can run it both on Linux and macOS and they behave exactly the same:

Details

#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>

#define TEST_DIR "/usr"

static int count_entries(DIR *dir) {
    struct dirent *ent;
    int count = 0;
    while ((ent = readdir(dir)) != NULL) {
        if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
            continue;
        printf("found: %s\n", ent->d_name);
        count++;
    }
    return count;
}

int main(int argc, char *argv[])
{
    int base_fd = open(TEST_DIR, O_RDONLY | O_DIRECTORY);
    if (base_fd < 0) {
        perror("open base_fd");
        return 1;
    }

    int total1 = 0;
    int total2 = 0;
    {
        int newfd = dup(base_fd);
        if (newfd < 0) {
            perror("dup1");
            close(base_fd);
            return 1;
        }

        DIR *dir = fdopendir(newfd);
        if (!dir) {
            perror("fdopendir1");
            close(newfd);
            return 1;
        }

        total1 = count_entries(dir);
        printf("Total1: %d\n", total1);

        closedir(dir);
    }

    {
        int newfd = dup(base_fd);
        if (newfd < 0) {
            perror("dup2");
            close(base_fd);
            return 1;
        }

        DIR *dir = fdopendir(newfd);
        if (!dir) {
            perror("fdopendir2");
            close(newfd);
            return 1;
        }

        total2 = count_entries(dir);
        printf("Total2: %d\n", total2);

        closedir(dir);
    }

    if (total1 != total2) {
        printf("\nTotal1 != Total2! %d != %d\n", total1, total2);
    }

    close(base_fd);
}

@pftbest
Copy link
Contributor

pftbest commented Feb 11, 2026

@slp I think I got it. In function open_inode there is a path that calls dup instead of doing a full open:

let fd = match ihandle {
InodeHandle::VolPath(c_path) => unsafe {
libc::open(
c_path.as_ptr(),
(flags | libc::O_CLOEXEC) & (!libc::O_NOFOLLOW) & (!libc::O_EXLOCK),
)
},
// Check if we have recently unlinked the inode and kept open a file descriptor to it.
InodeHandle::Fd(fd) => unsafe { libc::dup(fd) },
};

This fd is then duped again and passed to fdopendir.

However calling dup carries on the file position, which causes the issue we see here.

@slp
Copy link
Collaborator Author

slp commented Feb 11, 2026

Okay, now it's clearer:

  • The issue only appears on debian bookworm, not on trixie, most likely due to different apt/dpkg (or their dependencies). On bookworm, it attempts to unlink a directory without checking if its empty or not, triggering the bug.
  • The culprit was a bug introduced in virtio/fs/macos: keep a fd to unlinked files #513. There, we store an fd in unlinked_fd without checking if unlinkedat succeeded or not. This is wrong, and leads to have an unlinked_fd on a inode that is still linked. Here, it leads to opendir wrongly using the same unlinked_fd on multiple iterations, which explains why the fd internal offset seemed to be moving around.

We need to fix the issue in #513 first. Then, between using opendir or fdopendir, we should favor the latter, as it allows the guest to read the contents of an unlinked directory (weird, but semantically correct).

I'll resurrect #540 and extend it with the fix for unlinked_fd.

@slp
Copy link
Collaborator Author

slp commented Feb 11, 2026

Closing this one in favor of #544

@slp slp closed this Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants