Parallelize bucket get and listing in _info for bucket paths#780
Parallelize bucket get and listing in _info for bucket paths#780yuxin00j wants to merge 7 commits intofsspec:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #780 +/- ##
==========================================
+ Coverage 75.84% 76.27% +0.43%
==========================================
Files 14 14
Lines 2645 2651 +6
==========================================
+ Hits 2006 2022 +16
+ Misses 639 629 -10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
+1 , looks good. Is there a specific edge case that was not handled before? |
With this PR, the unit tests and benchmark tests can cover the edge case that the first Get API call fails. This edge case is already handled in current _info() method. |
db68d9c to
bc07cf3
Compare
|
/gcbrun |
gcsfs/core.py
Outdated
| if not get_task.done(): | ||
| get_task.cancel() | ||
| if not ls_task.done(): | ||
| ls_task.cancel() |
There was a problem hiding this comment.
I don't quite understand the reasoning behind this change.
From a first look, I think this could backfire on our customers. I don't see a reason why we should concurrently execute these two requests. The second _ls call only happens when the backend returns a 403 (Forbidden/Permission Error), meaning the user lacks bucket.get permission. In that case, we try the list operation in the hope that the user has bucket.list permission, which makes sense.
If you look at gcsfs/retry.py, the OSError is only raised when the status code is 403. In all other cases, the exception is different, meaning we would never need to execute the _ls call.
However, with this change, no matter the scenario, we end up making two calls where one would usually suffice. This might not be noticeable for a few _info calls, but it will backfire significantly when there is a high volume of them.
I see that you're cancelling the task once we have the result of get_task, but that doesn't really help. The asyncio event loop will still have initiated both requests concurrently, wasting resources just to speed things up for users who lack bucket.get permission while degrading the experience for those who have both permissions, or just bucket.get.
Hence, if my understanding is correct, this PR would only improve performance for the failure case where a user lacks bucket.get but has bucket.list permissions (rare customer) while degrading performance for the happy path, where the user has bucket.get or both permissions.
There was a problem hiding this comment.
You make good points here. Can we measure?
There was a problem hiding this comment.
Yes, you are right. This change aims to make the case when it falls back to bucket.list faster.
There was a problem hiding this comment.
I updated the code to use the same implementation as #786 to handle parallel tasks.
For happy path, I ran the micro benchmark for main and this branch: result. The percentage of change are between -29% and 15%. Half of the cases are above 0 and half are below 0. The absolute values of the latencies are small. So I believe these are reasonable variance and this change does not degrade the happy path performance.
There was a problem hiding this comment.
What is the number of directories for this benchmark case - does it matter?
…est restricted permissions.
…tests for _info fallback
This PR optimizes the GCSFileSystem._info method for bucket roots (gs://bucket) by running the bucket metadata retrieval (GET) and the bucket contents listing (ls) concurrently. It introduces a generic parallel_tasks_first_completed async context manager to ensure clean resource management and early returns.
Key Changes
New Concurrency Utility (gcsfs/concurrency.py)
parallel_tasks_first_completedasync context manager.Refactor GCSFileSystem._info (gcsfs/core.py)