Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 29, 2026

Change Description

  • My PR includes a link to the issue that I am addressing

Issue reports PyArrow parquet reads capture only 3.63 MB for a 1 GB file. Investigation shows current implementation correctly captures 100% of I/O for files ≤407 MB. PyArrow likely uses mmap for files >1 GB, which bypasses syscall tracing (page faults vs read syscalls). This is a fundamental limitation of strace-based measurement.

This PR adds 32-bit syscall support (pread/pwrite) which were missing, though this does not address the mmap limitation.

Solution Description

Syscall coverage:

  • Added pread/pwrite (32-bit variants) to traced syscalls
  • Implemented fallback: try all syscalls first, retry without 32-bit variants if strace rejects them
  • Updated parser to recognize pread/pwrite operations

Code quality:

  • Extracted _start_strace() helper to eliminate duplication in retry path
  • Fixed retry logic bug: was raising error on success after fallback
  • Added clarifying comments about parser/tracer relationship

Testing:

  • 2 new tests for pread/pwrite parsing
  • All 122 tests pass
  • Verified 100% capture with 407 MB parquet file

Limitations identified:
strace cannot capture mmap-based I/O (data accessed via page faults, not syscalls). If PyArrow uses mmap for large files (>1 GB threshold), those reads will not appear in syscall traces. Alternative: eBPF/perf with page fault tracking.

Code Quality

  • I have read the Contribution Guide
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

  • My fix includes a new test that breaks as a result of the bug (if possible)
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • huggingface.co
    • Triggering command: /usr/bin/curl curl -L REDACTED -o train-00000-of-00007.parquet --max-time 120 DROP (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Some reads not captured on Linux</issue_title>
<issue_description>Bug report

I have a 1 GiB file, and I'm getting different results when I read it with standard Python tooling and pyarrow; pyarrow bytes read are unrealistically small.

with open('train-00000-of-00007.parquet', 'rb') as gh:
     %iops data = gh.read()
del data
======================================================================
IOPS Profile Results (strace (per-process))
======================================================================
Execution Time:                18.2150 seconds
Read Operations:               2
Write Operations:              0
Total Operations:              2
Bytes Read:                    1.02 GB (1,091,305,162 bytes)
Bytes Written:                 0.00 B (0 bytes)
Total Bytes:                   1.02 GB (1,091,305,162 bytes)
----------------------------------------------------------------------
IOPS:                          0.11 operations/second
Throughput:                    57.14 MB/second
======================================================================
import pyarrow.parquet as pq
%iops pq.read_table('train-00000-of-00007.parquet')
======================================================================
IOPS Profile Results (strace (per-process))
======================================================================
Execution Time:                19.7621 seconds
Read Operations:               3
Write Operations:              3
Total Operations:              6
Bytes Read:                    3.63 MB (3,808,731 bytes)
Bytes Written:                 13.05 KB (13,360 bytes)
Total Bytes:                   3.65 MB (3,822,091 bytes)
----------------------------------------------------------------------
IOPS:                          0.30 operations/second
Throughput:                    188.87 KB/second
======================================================================

I tried to do sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches', but it didn't help.

Environment Information

Linux 6.8.0, x86_64, etx4, python 3.13, pyarrow 23, iops_profiler 0.2.0, ipython 9.9.0

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, and any applicable data others will need to reproduce the problem.
  • I have included information about my environment, including the version of this package (e.g. iops_profiler.__version__)
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a description of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.
    </issue_description>

<agent_instructions> This should be focused on the strace linux code. I suspect we are missing some syscalls. Please debug using a parquet file and similar techniques and then make a minimal change.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@mtauraso @copilot This should be focused on the strace linux code. I suspect we are missing some syscalls. Please debug using a parquet file and similar techniques and then make a minimal change.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@codecov
Copy link

codecov bot commented Jan 29, 2026

Codecov Report

❌ Patch coverage is 7.14286% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.61%. Comparing base (b04ec2e) to head (2ee43b9).

Files with missing lines Patch % Lines
src/iops_profiler/collector.py 7.14% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #46      +/-   ##
==========================================
- Coverage   45.49%   44.61%   -0.89%     
==========================================
  Files           4        4              
  Lines         455      464       +9     
==========================================
  Hits          207      207              
- Misses        248      257       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI and others added 3 commits January 29, 2026 21:51
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix reads not captured on Linux for large files Add pread/pwrite syscalls and fix strace retry logic Jan 29, 2026
Copilot AI requested a review from mtauraso January 29, 2026 22:01
@mtauraso
Copy link
Collaborator

@copilot Without writing any new code, can you suggest a few approaches to making this package basically work for mmaped files on linux, understanding that syscall tracing is not the way here.

Copy link
Contributor Author

Copilot AI commented Jan 29, 2026

@copilot Without writing any new code, can you suggest a few approaches to making this package basically work for mmaped files on linux, understanding that syscall tracing is not the way here.

Here are several approaches to support mmap'd files on Linux without syscall tracing:

1. eBPF-based tracing (most accurate)

  • Use BCC/bpftrace to trace page faults (e.g., page_fault_user tracepoint)
  • Track memory-mapped regions via mmap/munmap tracepoints
  • Calculate actual bytes accessed from page fault counts
  • Pros: Accurate, minimal overhead
  • Cons: Requires kernel >=4.7, root or CAP_BPF, additional dependencies (bcc-tools)

2. /proc/self/smaps tracking (estimation-based)

  • Capture /proc/[pid]/smaps before and after code execution
  • Compare RSS (Resident Set Size) changes for file-backed mappings
  • Pros: No root required, no external dependencies
  • Cons: Estimates only (includes prefetching), affected by kernel page cache behavior

3. perf with page-fault events (system tool)

  • Use perf stat -e page-faults or perf record -e page-faults
  • Parse perf output to extract fault counts
  • Multiply by page size (4KB) for byte estimate
  • Pros: Standard Linux tool, reasonably accurate
  • Cons: Requires perf installed, may need elevated privileges, estimates only

4. Hybrid approach (fallback strategy)

  • Try strace first (current implementation)
  • If bytes captured < file_size threshold (e.g., <50%), warn user about likely mmap usage
  • Suggest alternative measurement or document limitation
  • Pros: Works with current code, educates users
  • Cons: Doesn't solve the measurement problem

Recommendation: Start with approach #4 (hybrid) as it requires minimal changes. Add a warning when captured I/O is suspiciously low relative to file size. For full mmap support, approach #2 (/proc/smaps) is most practical without new dependencies.

@mtauraso
Copy link
Collaborator

@copilot A couple of questions:

  1. Can you give more detail about how the /proc/self/smaps approach works exactly
  2. For the perf based approach, are the same syscalls that perf uses available in a python package? This might be an easier dependency to take, and give better error handling/ recovery.

Please just answer in a comment, don't write any code.

Copy link
Contributor Author

Copilot AI commented Jan 29, 2026

@copilot A couple of questions:

  1. Can you give more detail about how the /proc/self/smaps approach works exactly
    ...

1) /proc/self/smaps detailed approach:

/proc/[pid]/smaps shows detailed memory mapping information. For tracking mmap'd file I/O:

Before execution:

  • Read /proc/self/smaps (or /proc/[pid]/smaps for child)
  • Parse each memory region, identify file-backed mappings (lines with pathname)
  • Record baseline RSS (Resident Set Size) for each file mapping

After execution:

  • Read /proc/self/smaps again
  • Calculate RSS delta for file-backed regions
  • Delta represents pages brought into RAM (accessed data)

Key fields in smaps:

Size: 1024 kB          # Total mapping size
Rss: 512 kB            # Currently in RAM
Referenced: 512 kB     # Recently accessed

Limitation: RSS includes prefetched pages and doesn't distinguish between first access vs. cached access, so it's an upper bound estimate.


2) Python packages for perf events:

Yes, perf_event syscalls are accessible via:

a) perf-event PyPI package - Direct wrapper around perf_event_open() syscall

  • Lighter dependency than bcc-tools
  • Still requires parsing output/events
  • Lower-level API, more setup code needed

b) py-perf or similar - Higher-level wrappers

  • Easier API but less common/maintained

However: Both still need privileges (CAP_PERFMON or relaxed perf_event_paranoid), similar to eBPF but without eBPF's full kernel access requirements. The main advantage over direct perf CLI is better error handling and programmatic control, but dependency management may be comparable to installing perf system package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some reads not captured on Linux

2 participants