-
Notifications
You must be signed in to change notification settings - Fork 0
Add pread/pwrite syscalls and fix strace retry logic #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #46 +/- ##
==========================================
- Coverage 45.49% 44.61% -0.89%
==========================================
Files 4 4
Lines 455 464 +9
==========================================
Hits 207 207
- Misses 248 257 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
|
@copilot Without writing any new code, can you suggest a few approaches to making this package basically work for mmaped files on linux, understanding that syscall tracing is not the way here. |
Here are several approaches to support mmap'd files on Linux without syscall tracing: 1. eBPF-based tracing (most accurate)
2. /proc/self/smaps tracking (estimation-based)
3. perf with page-fault events (system tool)
4. Hybrid approach (fallback strategy)
Recommendation: Start with approach #4 (hybrid) as it requires minimal changes. Add a warning when captured I/O is suspiciously low relative to file size. For full mmap support, approach #2 (/proc/smaps) is most practical without new dependencies. |
|
@copilot A couple of questions:
Please just answer in a comment, don't write any code. |
1) /proc/self/smaps detailed approach:
Before execution:
After execution:
Key fields in smaps: Limitation: RSS includes prefetched pages and doesn't distinguish between first access vs. cached access, so it's an upper bound estimate. 2) Python packages for perf events: Yes, a)
b)
However: Both still need privileges (CAP_PERFMON or relaxed |
Change Description
Issue reports PyArrow parquet reads capture only 3.63 MB for a 1 GB file. Investigation shows current implementation correctly captures 100% of I/O for files ≤407 MB. PyArrow likely uses mmap for files >1 GB, which bypasses syscall tracing (page faults vs read syscalls). This is a fundamental limitation of strace-based measurement.
This PR adds 32-bit syscall support (pread/pwrite) which were missing, though this does not address the mmap limitation.
Solution Description
Syscall coverage:
Code quality:
_start_strace()helper to eliminate duplication in retry pathTesting:
Limitations identified:
strace cannot capture mmap-based I/O (data accessed via page faults, not syscalls). If PyArrow uses mmap for large files (>1 GB threshold), those reads will not appear in syscall traces. Alternative: eBPF/perf with page fault tracking.
Code Quality
Project-Specific Pull Request Checklists
Bug Fix Checklist
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
huggingface.co/usr/bin/curl curl -L REDACTED -o train-00000-of-00007.parquet --max-time 120 DROP(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.