Enhance _info method to check file and directory info in parallel#786
Enhance _info method to check file and directory info in parallel#786yuxin00j wants to merge 12 commits intofsspec:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #786 +/- ##
==========================================
+ Coverage 75.96% 76.40% +0.43%
==========================================
Files 14 15 +1
Lines 2663 2687 +24
==========================================
+ Hits 2023 2053 +30
+ Misses 640 634 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @ankitaluthra1, you may check the update on optimization in _info here and in #780 |
|
/gcbrun |
|
@yuxin00j Can you please check the e2e failure |
…and format with black
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…wait and simplify parallel task evaluation in _info
|
Hi @ankitaluthra1, I have fixed the test failure. |
| self._get_directory_info(path, bucket, key, generation), | ||
| ] | ||
| ) as (tasks, done, pending): | ||
| exact_task, dir_task = tasks |
There was a problem hiding this comment.
nit: Can we rename exact_task variable name to get_object_task or some other meaningful name?
|
QQ: Was the 30% to 60% improvement also observed for HNS buckets where we are parallelizing get_object and get_folder calls? |
| placeholder = f"{base_dir}/folder_with_placeholder/" | ||
| res = await gcs._info(path) | ||
| # Should prefer directory info over marker | ||
| assert res["extra"] == "info" |
There was a problem hiding this comment.
Let's also assert the type of the result here similar to other cases
|
|
||
| assert found_names == expected_basenames | ||
| @pytest.mark.asyncio | ||
| async def test_info_parallel(gcs): |
There was a problem hiding this comment.
Consider refactoring these scenarios using @pytest.mark.parametrize. This will make the test cleaner by removing the need for manual mock resets (like mock_get_dir.side_effect = None) and ensures that a failure in one case doesn't prevent the remaining cases from running. Or check if we can breakdown the test into smaller ones.
There was a problem hiding this comment.
Let's add a test file to cover the code in this file.
Is the speedup for file paths related to the changes in this PR? I am assuming it is variance and not related to this PR as the latency for file paths shouldn't be impacted by this change, let me know if I am missing something here. |
Optimize the performance of the _info method by enabling concurrent checks for file paths and directory listings.
Early Return Strategy: If _get_object completes first and resolves to a valid file (not a directory marker), the execution cancels the directory scan tasks and returns the file metadata immediately.
Fallback Logic: If _get_object fails or yields a directory marker, it safely falls back to the directory tree scan result.
Benchmark run result
Folder Info
Execution times consistently dropped by 30% to 60% across all single-threaded and multi-process configurations.
File Info
Results are mixed but generally neutral, showing minor speedups of up to 24.6% in high process count runs. One outlier showed a minor regression in deep regional tests.
Bucket Info
This optimization does not affect info call for bucket.