Skip to content

Parallelize bucket get and listing in _info for bucket paths#780

Open
yuxin00j wants to merge 7 commits intofsspec:mainfrom
ankitaluthra1:info-bucket
Open

Parallelize bucket get and listing in _info for bucket paths#780
yuxin00j wants to merge 7 commits intofsspec:mainfrom
ankitaluthra1:info-bucket

Conversation

@yuxin00j
Copy link
Copy Markdown
Contributor

@yuxin00j yuxin00j commented Mar 19, 2026

This PR optimizes the GCSFileSystem._info method for bucket roots (gs://bucket) by running the bucket metadata retrieval (GET) and the bucket contents listing (ls) concurrently. It introduces a generic parallel_tasks_first_completed async context manager to ensure clean resource management and early returns.

Key Changes

New Concurrency Utility (gcsfs/concurrency.py)

  • Introduces parallel_tasks_first_completed async context manager.
  • Executes a list of coroutines concurrently and yields as soon as the first one completes.
  • Guarantees cleanup: Automatically cancels any remaining pending tasks when exiting the context, preventing resource leaks.

Refactor GCSFileSystem._info (gcsfs/core.py)

  • Uses the new utility for bucket root paths.
  • If GET completes first, it immediately returns the metadata.
  • If GET fails (e.g., due to permissions), it falls back to the ls result to determine if the bucket exists.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.27%. Comparing base (fee3945) to head (ad72448).

Files with missing lines Patch % Lines
gcsfs/core.py 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #780      +/-   ##
==========================================
+ Coverage   75.84%   76.27%   +0.43%     
==========================================
  Files          14       14              
  Lines        2645     2651       +6     
==========================================
+ Hits         2006     2022      +16     
+ Misses        639      629      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yuxin00j yuxin00j marked this pull request as ready for review March 19, 2026 09:07
@martindurant
Copy link
Copy Markdown
Member

+1 , looks good. Is there a specific edge case that was not handled before?

@yuxin00j
Copy link
Copy Markdown
Contributor Author

yuxin00j commented Mar 20, 2026

+1 , looks good. Is there a specific edge case that was not handled before?

With this PR, the unit tests and benchmark tests can cover the edge case that the first Get API call fails. This edge case is already handled in current _info() method.

@yuxin00j yuxin00j force-pushed the info-bucket branch 3 times, most recently from db68d9c to bc07cf3 Compare March 24, 2026 07:06
yuxin00j added a commit to ankitaluthra1/gcsfs that referenced this pull request Mar 26, 2026
@ankitaluthra1
Copy link
Copy Markdown
Collaborator

/gcbrun

gcsfs/core.py Outdated
if not get_task.done():
get_task.cancel()
if not ls_task.done():
ls_task.cancel()
Copy link
Copy Markdown
Contributor

@googlyrahman googlyrahman Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the reasoning behind this change.

From a first look, I think this could backfire on our customers. I don't see a reason why we should concurrently execute these two requests. The second _ls call only happens when the backend returns a 403 (Forbidden/Permission Error), meaning the user lacks bucket.get permission. In that case, we try the list operation in the hope that the user has bucket.list permission, which makes sense.

If you look at gcsfs/retry.py, the OSError is only raised when the status code is 403. In all other cases, the exception is different, meaning we would never need to execute the _ls call.

However, with this change, no matter the scenario, we end up making two calls where one would usually suffice. This might not be noticeable for a few _info calls, but it will backfire significantly when there is a high volume of them.

I see that you're cancelling the task once we have the result of get_task, but that doesn't really help. The asyncio event loop will still have initiated both requests concurrently, wasting resources just to speed things up for users who lack bucket.get permission while degrading the experience for those who have both permissions, or just bucket.get.

Hence, if my understanding is correct, this PR would only improve performance for the failure case where a user lacks bucket.get but has bucket.list permissions (rare customer) while degrading performance for the happy path, where the user has bucket.get or both permissions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make good points here. Can we measure?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. This change aims to make the case when it falls back to bucket.list faster.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to use the same implementation as #786 to handle parallel tasks.
For happy path, I ran the micro benchmark for main and this branch: result. The percentage of change are between -29% and 15%. Half of the cases are above 0 and half are below 0. The absolute values of the latencies are small. So I believe these are reasonable variance and this change does not degrade the happy path performance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the number of directories for this benchmark case - does it matter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants