Skip to content

Enhance _info method to check file and directory info in parallel#786

Open
yuxin00j wants to merge 12 commits intofsspec:mainfrom
ankitaluthra1:optimize-info
Open

Enhance _info method to check file and directory info in parallel#786
yuxin00j wants to merge 12 commits intofsspec:mainfrom
ankitaluthra1:optimize-info

Conversation

@yuxin00j
Copy link
Copy Markdown
Contributor

@yuxin00j yuxin00j commented Mar 25, 2026

Optimize the performance of the _info method by enabling concurrent checks for file paths and directory listings.

  • Early Return Strategy: If _get_object completes first and resolves to a valid file (not a directory marker), the execution cancels the directory scan tasks and returns the file metadata immediately.

  • Fallback Logic: If _get_object fails or yields a directory marker, it safely falls back to the directory tree scan result.

Benchmark run result

Folder Info

Execution times consistently dropped by 30% to 60% across all single-threaded and multi-process configurations.

File Info

Results are mixed but generally neutral, showing minor speedups of up to 24.6% in high process count runs. One outlier showed a minor regression in deep regional tests.

Bucket Info

This optimization does not affect info call for bucket.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 90.32258% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.40%. Comparing base (6c5f744) to head (b4d835e).

Files with missing lines Patch % Lines
gcsfs/core.py 85.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #786      +/-   ##
==========================================
+ Coverage   75.96%   76.40%   +0.43%     
==========================================
  Files          14       15       +1     
  Lines        2663     2687      +24     
==========================================
+ Hits         2023     2053      +30     
+ Misses        640      634       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yuxin00j yuxin00j marked this pull request as ready for review March 26, 2026 02:22
@yuxin00j yuxin00j changed the title Enhance _info method to check file and directory info in parallel.Optimize info Enhance _info method to check file and directory info in parallel Mar 26, 2026
@yuxin00j
Copy link
Copy Markdown
Contributor Author

Hi @ankitaluthra1, you may check the update on optimization in _info here and in #780

@ankitaluthra1
Copy link
Copy Markdown
Collaborator

/gcbrun

@ankitaluthra1
Copy link
Copy Markdown
Collaborator

@yuxin00j Can you please check the e2e failure

…wait and simplify parallel task evaluation in _info
@yuxin00j
Copy link
Copy Markdown
Contributor Author

yuxin00j commented Apr 2, 2026

Hi @ankitaluthra1, I have fixed the test failure.

self._get_directory_info(path, bucket, key, generation),
]
) as (tasks, done, pending):
exact_task, dir_task = tasks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we rename exact_task variable name to get_object_task or some other meaningful name?

@Mahalaxmibejugam
Copy link
Copy Markdown
Contributor

QQ: Was the 30% to 60% improvement also observed for HNS buckets where we are parallelizing get_object and get_folder calls?

placeholder = f"{base_dir}/folder_with_placeholder/"
res = await gcs._info(path)
# Should prefer directory info over marker
assert res["extra"] == "info"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also assert the type of the result here similar to other cases


assert found_names == expected_basenames
@pytest.mark.asyncio
async def test_info_parallel(gcs):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider refactoring these scenarios using @pytest.mark.parametrize. This will make the test cleaner by removing the need for manual mock resets (like mock_get_dir.side_effect = None) and ensures that a failure in one case doesn't prevent the remaining cases from running. Or check if we can breakdown the test into smaller ones.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test file to cover the code in this file.

@Mahalaxmibejugam
Copy link
Copy Markdown
Contributor

File Info: Results are mixed but generally neutral, showing minor speedups of up to 24.6% in high process count runs. One outlier showed a minor regression in deep regional tests.

Is the speedup for file paths related to the changes in this PR? I am assuming it is variance and not related to this PR as the latency for file paths shouldn't be impacted by this change, let me know if I am missing something here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants