Add pread/pwrite syscalls and fix strace retry logic #46

Copilot · 2026-01-29T21:39:39Z

Change Description

My PR includes a link to the issue that I am addressing

Issue reports PyArrow parquet reads capture only 3.63 MB for a 1 GB file. Investigation shows current implementation correctly captures 100% of I/O for files ≤407 MB. PyArrow likely uses mmap for files >1 GB, which bypasses syscall tracing (page faults vs read syscalls). This is a fundamental limitation of strace-based measurement.

This PR adds 32-bit syscall support (pread/pwrite) which were missing, though this does not address the mmap limitation.

Solution Description

Syscall coverage:

Added pread/pwrite (32-bit variants) to traced syscalls
Implemented fallback: try all syscalls first, retry without 32-bit variants if strace rejects them
Updated parser to recognize pread/pwrite operations

Code quality:

Extracted _start_strace() helper to eliminate duplication in retry path
Fixed retry logic bug: was raising error on success after fallback
Added clarifying comments about parser/tracer relationship

Testing:

2 new tests for pread/pwrite parsing
All 122 tests pass
Verified 100% capture with 407 MB parquet file

Limitations identified:
strace cannot capture mmap-based I/O (data accessed via page faults, not syscalls). If PyArrow uses mmap for large files (>1 GB threshold), those reads will not appear in syscall traces. Alternative: eBPF/perf with page fault tracking.

Code Quality

I have read the Contribution Guide
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

My fix includes a new test that breaks as a result of the bug (if possible)
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

huggingface.co
- Triggering command: /usr/bin/curl curl -L REDACTED -o train-00000-of-00007.parquet --max-time 120 DROP (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>Some reads not captured on Linux</issue_title>
<issue_description>Bug report

I have a 1 GiB file, and I'm getting different results when I read it with standard Python tooling and pyarrow; pyarrow bytes read are unrealistically small.
with open('train-00000-of-00007.parquet', 'rb') as gh:
     %iops data = gh.read()
del data
======================================================================
IOPS Profile Results (strace (per-process))
======================================================================
Execution Time:                18.2150 seconds
Read Operations:               2
Write Operations:              0
Total Operations:              2
Bytes Read:                    1.02 GB (1,091,305,162 bytes)
Bytes Written:                 0.00 B (0 bytes)
Total Bytes:                   1.02 GB (1,091,305,162 bytes)
----------------------------------------------------------------------
IOPS:                          0.11 operations/second
Throughput:                    57.14 MB/second
======================================================================
import pyarrow.parquet as pq
%iops pq.read_table('train-00000-of-00007.parquet')
======================================================================
IOPS Profile Results (strace (per-process))
======================================================================
Execution Time:                19.7621 seconds
Read Operations:               3
Write Operations:              3
Total Operations:              6
Bytes Read:                    3.63 MB (3,808,731 bytes)
Bytes Written:                 13.05 KB (13,360 bytes)
Total Bytes:                   3.65 MB (3,822,091 bytes)
----------------------------------------------------------------------
IOPS:                          0.30 operations/second
Throughput:                    188.87 KB/second
======================================================================
I tried to do sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches', but it didn't help.

Environment Information

Linux 6.8.0, x86_64, etx4, python 3.13, pyarrow 23, iops_profiler 0.2.0, ipython 9.9.0

Before submitting
Please check the following:

I have described the situation in which the bug arose, including what code was executed, and any applicable data others will need to reproduce the problem.

I have included information about my environment, including the version of this package (e.g. iops_profiler.__version__)

I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a description of what I expected instead.

If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
</issue_description>

<agent_instructions> This should be focused on the strace linux code. I suspect we are missing some syscalls. Please debug using a parquet file and similar techniques and then make a minimal change.</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@mtauraso @copilot This should be focused on the strace linux code. I suspect we are missing some syscalls. Please debug using a parquet file and similar techniques and then make a minimal change.

Fixes Some reads not captured on Linux #45

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

codecov · 2026-01-29T21:43:48Z

Codecov Report

❌ Patch coverage is 7.14286% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.61%. Comparing base (b04ec2e) to head (2ee43b9).

Files with missing lines	Patch %	Lines
src/iops_profiler/collector.py	7.14%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
- Coverage   45.49%   44.61%   -0.89%     
==========================================
  Files           4        4              
  Lines         455      464       +9     
==========================================
  Hits          207      207              
- Misses        248      257       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

mtauraso · 2026-01-29T22:04:58Z

@copilot Without writing any new code, can you suggest a few approaches to making this package basically work for mmaped files on linux, understanding that syscall tracing is not the way here.

Copilot · 2026-01-29T22:05:53Z

@copilot Without writing any new code, can you suggest a few approaches to making this package basically work for mmaped files on linux, understanding that syscall tracing is not the way here.

Here are several approaches to support mmap'd files on Linux without syscall tracing:

1. eBPF-based tracing (most accurate)

Use BCC/bpftrace to trace page faults (e.g., page_fault_user tracepoint)
Track memory-mapped regions via mmap/munmap tracepoints
Calculate actual bytes accessed from page fault counts
Pros: Accurate, minimal overhead
Cons: Requires kernel >=4.7, root or CAP_BPF, additional dependencies (bcc-tools)

2. /proc/self/smaps tracking (estimation-based)

Capture /proc/[pid]/smaps before and after code execution
Compare RSS (Resident Set Size) changes for file-backed mappings
Pros: No root required, no external dependencies
Cons: Estimates only (includes prefetching), affected by kernel page cache behavior

3. perf with page-fault events (system tool)

Use perf stat -e page-faults or perf record -e page-faults
Parse perf output to extract fault counts
Multiply by page size (4KB) for byte estimate
Pros: Standard Linux tool, reasonably accurate
Cons: Requires perf installed, may need elevated privileges, estimates only

4. Hybrid approach (fallback strategy)

Try strace first (current implementation)
If bytes captured < file_size threshold (e.g., <50%), warn user about likely mmap usage
Suggest alternative measurement or document limitation
Pros: Works with current code, educates users
Cons: Doesn't solve the measurement problem

Recommendation: Start with approach #4 (hybrid) as it requires minimal changes. Add a warning when captured I/O is suspiciously low relative to file size. For full mmap support, approach #2 (/proc/smaps) is most practical without new dependencies.

mtauraso · 2026-01-29T22:10:30Z

@copilot A couple of questions:

Can you give more detail about how the /proc/self/smaps approach works exactly
For the perf based approach, are the same syscalls that perf uses available in a python package? This might be an easier dependency to take, and give better error handling/ recovery.

Please just answer in a comment, don't write any code.

Copilot · 2026-01-29T22:11:17Z

@copilot A couple of questions:

Can you give more detail about how the /proc/self/smaps approach works exactly
...

1) /proc/self/smaps detailed approach:

/proc/[pid]/smaps shows detailed memory mapping information. For tracking mmap'd file I/O:

Before execution:

Read /proc/self/smaps (or /proc/[pid]/smaps for child)
Parse each memory region, identify file-backed mappings (lines with pathname)
Record baseline RSS (Resident Set Size) for each file mapping

After execution:

Read /proc/self/smaps again
Calculate RSS delta for file-backed regions
Delta represents pages brought into RAM (accessed data)

Key fields in smaps:

Size: 1024 kB          # Total mapping size
Rss: 512 kB            # Currently in RAM
Referenced: 512 kB     # Recently accessed

Limitation: RSS includes prefetched pages and doesn't distinguish between first access vs. cached access, so it's an upper bound estimate.

2) Python packages for perf events:

Yes, perf_event syscalls are accessible via:

a) perf-event PyPI package - Direct wrapper around perf_event_open() syscall

Lighter dependency than bcc-tools
Still requires parsing output/events
Lower-level API, more setup code needed

b) py-perf or similar - Higher-level wrappers

Easier API but less common/maintained

However: Both still need privileges (CAP_PERFMON or relaxed perf_event_paranoid), similar to eBPF but without eBPF's full kernel access requirements. The main advantage over direct perf CLI is better error handling and programmatic control, but dependency management may be comparable to installing perf system package.

Initial plan

a325491

Copilot AI assigned Copilot and mtauraso Jan 29, 2026

Copilot started work on behalf of mtauraso January 29, 2026 21:39 View session

Copilot AI and others added 3 commits January 29, 2026 21:51

Add pread/pwrite support and investigate mmap usage

f03210c

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Refactor strace startup logic to reduce code duplication

7ff91b2

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Fix retry logic and add clarifying comments

2ee43b9

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix reads not captured on Linux for large files~~ Add pread/pwrite syscalls and fix strace retry logic Jan 29, 2026

Copilot AI requested a review from mtauraso January 29, 2026 22:01

Copilot finished work on behalf of mtauraso January 29, 2026 22:01

Copilot started work on behalf of mtauraso January 29, 2026 22:05 View session

Copilot finished work on behalf of mtauraso January 29, 2026 22:06

Copilot started work on behalf of mtauraso January 29, 2026 22:10 View session

Copilot finished work on behalf of mtauraso January 29, 2026 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pread/pwrite syscalls and fix strace retry logic #46

Add pread/pwrite syscalls and fix strace retry logic #46

Uh oh!

Copilot AI commented Jan 29, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

mtauraso commented Jan 29, 2026

Uh oh!

Copilot AI commented Jan 29, 2026

Uh oh!

mtauraso commented Jan 29, 2026

Uh oh!

Copilot AI commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add pread/pwrite syscalls and fix strace retry logic #46

Are you sure you want to change the base?

Add pread/pwrite syscalls and fix strace retry logic #46

Uh oh!

Conversation

Copilot AI commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Solution Description

Code Quality

Project-Specific Pull Request Checklists

Bug Fix Checklist

I tried to connect to the following addresses, but was blocked by firewall rules:

Comments on the Issue (you are @copilot in this section)

Uh oh!

codecov bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mtauraso commented Jan 29, 2026

Uh oh!

Copilot AI commented Jan 29, 2026

Uh oh!

mtauraso commented Jan 29, 2026

Uh oh!

Copilot AI commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jan 29, 2026 •

edited

Loading

codecov bot commented Jan 29, 2026 •

edited

Loading