Parallelize bucket get and listing in _info for bucket paths by yuxin00j · Pull Request #780 · fsspec/gcsfs

yuxin00j · 2026-03-19T08:12:58Z

This PR optimizes the GCSFileSystem._info method for bucket roots (gs://bucket) by running the bucket metadata retrieval (GET) and the bucket contents listing (ls) concurrently. It introduces a generic parallel_tasks_first_completed async context manager to ensure clean resource management and early returns.

Key Changes

New Concurrency Utility (gcsfs/concurrency.py)

Introduces parallel_tasks_first_completed async context manager.
Executes a list of coroutines concurrently and yields as soon as the first one completes.
Guarantees cleanup: Automatically cancels any remaining pending tasks when exiting the context, preventing resource leaks.

Refactor GCSFileSystem._info (gcsfs/core.py)

Uses the new utility for bucket root paths.
If GET completes first, it immediately returns the metadata.
If GET fails (e.g., due to permissions), it falls back to the ls result to determine if the bucket exists.

codecov · 2026-03-19T08:27:40Z

Codecov Report

❌ Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.27%. Comparing base (fee3945) to head (ad72448).

Files with missing lines	Patch %	Lines
gcsfs/core.py	92.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #780      +/-   ##
==========================================
+ Coverage   75.84%   76.27%   +0.43%     
==========================================
  Files          14       14              
  Lines        2645     2651       +6     
==========================================
+ Hits         2006     2022      +16     
+ Misses        639      629      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

martindurant · 2026-03-19T16:52:40Z

+1 , looks good. Is there a specific edge case that was not handled before?

yuxin00j · 2026-03-20T02:28:31Z

+1 , looks good. Is there a specific edge case that was not handled before?

With this PR, the unit tests and benchmark tests can cover the edge case that the first Get API call fails. This edge case is already handled in current _info() method.

gcsfs/tests/perf/microbenchmarks/info/configs.yaml

gcsfs/core.py

…y return (PR fsspec#780)

gcsfs/tests/perf/microbenchmarks/conftest.py

…y return (PR fsspec#780)

ankitaluthra1 · 2026-03-31T17:08:40Z

/gcbrun

googlyrahman · 2026-04-01T15:31:04Z

gcsfs/core.py

+                if not get_task.done():
+                    get_task.cancel()
+                if not ls_task.done():
+                    ls_task.cancel()


I don't quite understand the reasoning behind this change.

From a first look, I think this could backfire on our customers. I don't see a reason why we should concurrently execute these two requests. The second _ls call only happens when the backend returns a 403 (Forbidden/Permission Error), meaning the user lacks bucket.get permission. In that case, we try the list operation in the hope that the user has bucket.list permission, which makes sense.

If you look at gcsfs/retry.py, the OSError is only raised when the status code is 403. In all other cases, the exception is different, meaning we would never need to execute the _ls call.

However, with this change, no matter the scenario, we end up making two calls where one would usually suffice. This might not be noticeable for a few _info calls, but it will backfire significantly when there is a high volume of them.

I see that you're cancelling the task once we have the result of get_task, but that doesn't really help. The asyncio event loop will still have initiated both requests concurrently, wasting resources just to speed things up for users who lack bucket.get permission while degrading the experience for those who have both permissions, or just bucket.get.

Hence, if my understanding is correct, this PR would only improve performance for the failure case where a user lacks bucket.get but has bucket.list permissions (rare customer) while degrading performance for the happy path, where the user has bucket.get or both permissions.

You make good points here. Can we measure?

Yes, you are right. This change aims to make the case when it falls back to bucket.list faster.

I updated the code to use the same implementation as #786 to handle parallel tasks.
For happy path, I ran the micro benchmark for main and this branch: result. The percentage of change are between -29% and 15%. Half of the cases are above 0 and half are below 0. The absolute values of the latencies are small. So I believe these are reasonable variance and this change does not degrade the happy path performance.

What is the number of directories for this benchmark case - does it matter?

…est restricted permissions.

…y return (PR fsspec#780)

…tests for _info fallback

…st_completed

yuxin00j marked this pull request as ready for review March 19, 2026 09:07

yuxin00j force-pushed the info-bucket branch 3 times, most recently from db68d9c to bc07cf3 Compare March 24, 2026 07:06

jasha26 reviewed Mar 25, 2026

View reviewed changes

gcsfs/tests/perf/microbenchmarks/info/configs.yaml Outdated Show resolved Hide resolved

jasha26 reviewed Mar 25, 2026

View reviewed changes

gcsfs/core.py Outdated Show resolved Hide resolved

yuxin00j added a commit to ankitaluthra1/gcsfs that referenced this pull request Mar 26, 2026

Refactor _info for buckets to use independent parallel tasks and earl…

83d9b9e

…y return (PR fsspec#780)

yuxin00j requested review from jasha26 and martindurant March 26, 2026 07:23

jasha26 reviewed Mar 26, 2026

View reviewed changes

gcsfs/tests/perf/microbenchmarks/conftest.py Outdated Show resolved Hide resolved

jasha26 approved these changes Mar 26, 2026

View reviewed changes

yuxin00j added a commit to ankitaluthra1/gcsfs that referenced this pull request Mar 26, 2026

Refactor _info for buckets to use independent parallel tasks and earl…

af9af53

…y return (PR fsspec#780)

yuxin00j force-pushed the info-bucket branch from 7a985e0 to ad72448 Compare March 26, 2026 09:18

jasha26 mentioned this pull request Mar 27, 2026

Enhance _info method to check file and directory info in parallel #786

Open

googlyrahman reviewed Apr 1, 2026

View reviewed changes

yuxin00j added 7 commits April 2, 2026 06:39

Parallelize bucket get and listing in _info for bucket paths

6d21e82

Implement service account impersonation for info microbenchmarks to t…

6b4a8d3

…est restricted permissions.

Add missing test coverage for _info bucket optimization

911fa09

Refactor _info for buckets to use independent parallel tasks and earl…

ffb88f3

…y return (PR fsspec#780)

Clean up microbenchmarks (remove impersonate_sa) and add integration …

972eb5e

…tests for _info fallback

Revert gcs_admin back to gcs in microbenchmarks conftest.py

33b2f36

Optimize GCSFileSystem._info for bucket root using parallel_tasks_fir…

0835b7f

…st_completed

yuxin00j force-pushed the info-bucket branch from ad72448 to 0835b7f Compare April 2, 2026 06:40

yuxin00j requested a review from googlyrahman April 2, 2026 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize bucket get and listing in _info for bucket paths#780

Parallelize bucket get and listing in _info for bucket paths#780
yuxin00j wants to merge 7 commits intofsspec:mainfrom
ankitaluthra1:info-bucket

yuxin00j commented Mar 19, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

martindurant commented Mar 19, 2026

Uh oh!

yuxin00j commented Mar 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ankitaluthra1 commented Mar 31, 2026

Uh oh!

googlyrahman Apr 1, 2026 •

edited

Loading

Uh oh!

martindurant Apr 1, 2026

Uh oh!

yuxin00j Apr 2, 2026

Uh oh!

yuxin00j Apr 2, 2026

Uh oh!

martindurant Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yuxin00j commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

New Concurrency Utility (gcsfs/concurrency.py)

Refactor GCSFileSystem._info (gcsfs/core.py)

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

martindurant commented Mar 19, 2026

Uh oh!

yuxin00j commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ankitaluthra1 commented Mar 31, 2026

Uh oh!

googlyrahman Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martindurant Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

yuxin00j Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

yuxin00j Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

martindurant Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yuxin00j commented Mar 19, 2026 •

edited

Loading

codecov bot commented Mar 19, 2026 •

edited

Loading

yuxin00j commented Mar 20, 2026 •

edited

Loading

googlyrahman Apr 1, 2026 •

edited

Loading