fix #1563 (cdp): resolve page leaks and race conditions in concurrent…#1592
Open
Ahmed-Tawfik94 wants to merge 1 commit intodevelopfrom
Open
fix #1563 (cdp): resolve page leaks and race conditions in concurrent…#1592Ahmed-Tawfik94 wants to merge 1 commit intodevelopfrom
Ahmed-Tawfik94 wants to merge 1 commit intodevelopfrom
Conversation
… crawling Fix memory leaks and race conditions when using arun_many() with managed CDP browsers. Each crawl now gets proper page isolation with automatic cleanup while maintaining shared browser context. Key fixes: - Close non-session pages after crawling to prevent tab accumulation - Add thread-safe page creation with locks to avoid concurrent access - Improve page lifecycle management for managed vs non-managed browsers - Keep session pages alive for authentication persistence - Prevent TOCTOU (time-of-check-time-of-use) race conditions This ensures stable parallel crawling without memory growth or browser instability.
Collaborator
|
@Ahmed-Tawfik94 Why didn't you implement the recommended solution you provided in the root cause message here? |
Collaborator
Author
im guessing that you refering to the code in this is meant to address the tabs accumulate that might happen with opening a new page which can lead to memory leaks. so i added this logic to close non-session pages after crawling |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#1563 Fix memory leaks and race conditions in CDP managed browser crawling
Fix memory leaks and race conditions when using arun_many() with managed CDP browsers. Each crawl now gets proper page isolation with automatic cleanup while maintaining shared browser context.
Key fixes:
This ensures stable parallel crawling without memory growth or browser instability.
Summary
Fixes #1563
This PR resolves critical memory leaks and race conditions that occurred when using
arun_many()with managed CDP browsers. The main issues were:The fix ensures that:
arun_many()gets its own isolated page/tabList of files changed and why
crawl4ai/async_crawler_strategy.py - Updated page cleanup logic to properly close pages after crawling when using non-managed browsers, while preserving session pages for authentication persistence
crawl4ai/browser_manager.py - Added thread-safe page creation with locks to prevent race conditions, and improved page lifecycle management to distinguish between managed and non-managed browser contexts
docs/md_v2/advanced/cdp-browser-crawling.md - Added comprehensive documentation for CDP browser crawling, including setup instructions, usage examples, and best practices for managed browser workflows
tests/test_arun_many_cdp.py - Created new test suite with both parallel and sequential test cases to verify proper page isolation and cleanup in
arun_many()operations with managed CDP browsersHow Has This Been Tested?
The changes have been tested with:
Unit Tests: Created
tests/test_arun_many_cdp.pywith two test scenarios:test_arun_many_with_cdp(): Tests parallel crawling of 3 URLs to verify proper page isolationtest_arun_many_with_cdp_sequential(): Tests sequential crawling to isolate potential issuesManual Testing:
localhost:9222arun_many()operations to confirm tabs are created and cleaned up properlyTest Requirements: Tests require a running CDP browser instance (can be started with
crwl cdp -d 9222)All tests pass successfully, confirming that memory leaks and race conditions are resolved.
Checklist: