You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given the same seed URL and unchanged remote content, BestFirstCrawlingStrategy should crawl the same set of pages in the same order across repeated runs. The priority queue should fully determine which URLs are visited when the max_pages budget is exhausted.
Current Behavior
Repeated crawls of the same seed URL produce different page sets when the max_pages budget is saturated. Pages near the score threshold appear or disappear between runs non-deterministically.
Root Cause
_arun_best_first dequeues up to BATCH_SIZE (10) URLs and fetches them concurrently via arun_many(stream=True). As results stream back, link discovery runs inline for each result as it arrives (lines 240-252 of bff_strategy.py):
The order results return from arun_many depends on network latency per page — this is inherently non-deterministic. Because link discovery and queue insertion happen inside the streaming loop, the priority queue state diverges across runs:
sequenceDiagram
participant Q as Priority Queue
participant B as Batch (arun_many)
participant N as Network
Note over Q,N: Batch: [URL_A (score=0.8), URL_B (score=0.6)]
rect rgb(255, 230, 230)
Note over B,N: Run 1: B finishes before A
B->>N: fetch A, B concurrently
N-->>B: result_B arrives first
B->>Q: discover B's children → enqueue C(0.3), D(0.5)
N-->>B: result_A arrives second
B->>Q: discover A's children → enqueue E(0.4), F(0.2)
Note over Q: Queue: [D(0.5), E(0.4), C(0.3), F(0.2)]
end
rect rgb(230, 255, 230)
Note over B,N: Run 2: A finishes before B
B->>N: fetch A, B concurrently
N-->>B: result_A arrives first
B->>Q: discover A's children → enqueue E(0.4), F(0.2)
N-->>B: result_B arrives second
B->>Q: discover B's children → enqueue C(0.3), D(0.5)
Note over Q: Queue: [D(0.5), E(0.4), C(0.3), F(0.2)]
end
Note over Q: Queue contents are identical!<br/>But pages_crawled count differs<br/>at intermediate steps...
Loading
The real problem is subtler than the final queue state — it's about when _pages_crawled is incremented relative to link_discovery:
flowchart TD
A[Batch: URLs with scores 0.8, 0.6, 0.4] --> B{arun_many streaming}
B -->|"Run 1: 0.4 finishes first"| C[pages_crawled++ → link_discovery<br/>remaining_capacity = 12]
B -->|"Run 2: 0.8 finishes first"| D[pages_crawled++ → link_discovery<br/>remaining_capacity = 12]
C --> E["0.4's children discovered with capacity=12"]
D --> F["0.8's children discovered with capacity=12"]
E --> G["0.8 finishes → pages_crawled++<br/>remaining_capacity = 11"]
F --> H["0.4 finishes → pages_crawled++<br/>remaining_capacity = 11"]
G --> I["0.8's children discovered with capacity=11"]
H --> J["0.4's children discovered with capacity=11"]
style C fill:#fdd
style D fill:#dfd
style E fill:#fdd
style F fill:#dfd
Loading
When max_pages is nearly exhausted, the different remaining_capacity at the moment each page's links are discovered means different links pass the capacity check in link_discovery — producing different queue contents and ultimately different page sets.
Effects
False change detection: Document fingerprints change between runs even though remote content hasn't changed, because a different set of depth-2 pages is crawled each time.
Non-reproducible crawls: Same seed URL + same scorer + same config → different results. Makes debugging and testing difficult.
Wasted API calls: In pipelines that compare crawl results across runs (change detection, sync), non-deterministic page sets trigger unnecessary downstream processing.
Steps to Reproduce
Use BestFirstCrawlingStrategy with max_pages set low enough to be saturated (e.g., 15)
Crawl a seed URL that has many depth-1 links, each of which discovers many depth-2 links
Run the same crawl twice with no content changes
Compare the crawled page sets — they will differ
Proposed Fix
Collect all results from each arun_many batch before running link discovery, then process results in deterministic priority-queue order:
Key change: results_by_url collects all batch results, then the for score, depth, url, parent_url in batch: loop processes them in the original priority-queue dequeue order. This makes link discovery order deterministic regardless of network timing.
Performance impact: Negligible. arun_many still fetches concurrently. The only added latency is waiting for the slowest page in each batch before discovering links — typically a few seconds per batch. In our testing across 4 rounds of 3-grant crawls (15 pages each), there was no measurable runtime difference.
Our Mitigation
We subclass BestFirstCrawlingStrategy and override _arun_best_first with the deterministic version shown above. This resolves the non-determinism for our use case but is obviously not ideal as a permanent downstream workaround.
crawl4ai version
0.8.0 (confirmed still present in 0.8.7 source on
main)Related Issues
Expected Behavior
Given the same seed URL and unchanged remote content,
BestFirstCrawlingStrategyshould crawl the same set of pages in the same order across repeated runs. The priority queue should fully determine which URLs are visited when themax_pagesbudget is exhausted.Current Behavior
Repeated crawls of the same seed URL produce different page sets when the
max_pagesbudget is saturated. Pages near the score threshold appear or disappear between runs non-deterministically.Root Cause
_arun_best_firstdequeues up toBATCH_SIZE(10) URLs and fetches them concurrently viaarun_many(stream=True). As results stream back, link discovery runs inline for each result as it arrives (lines 240-252 ofbff_strategy.py):The order results return from
arun_manydepends on network latency per page — this is inherently non-deterministic. Because link discovery and queue insertion happen inside the streaming loop, the priority queue state diverges across runs:sequenceDiagram participant Q as Priority Queue participant B as Batch (arun_many) participant N as Network Note over Q,N: Batch: [URL_A (score=0.8), URL_B (score=0.6)] rect rgb(255, 230, 230) Note over B,N: Run 1: B finishes before A B->>N: fetch A, B concurrently N-->>B: result_B arrives first B->>Q: discover B's children → enqueue C(0.3), D(0.5) N-->>B: result_A arrives second B->>Q: discover A's children → enqueue E(0.4), F(0.2) Note over Q: Queue: [D(0.5), E(0.4), C(0.3), F(0.2)] end rect rgb(230, 255, 230) Note over B,N: Run 2: A finishes before B B->>N: fetch A, B concurrently N-->>B: result_A arrives first B->>Q: discover A's children → enqueue E(0.4), F(0.2) N-->>B: result_B arrives second B->>Q: discover B's children → enqueue C(0.3), D(0.5) Note over Q: Queue: [D(0.5), E(0.4), C(0.3), F(0.2)] end Note over Q: Queue contents are identical!<br/>But pages_crawled count differs<br/>at intermediate steps...The real problem is subtler than the final queue state — it's about when
_pages_crawledis incremented relative tolink_discovery:flowchart TD A[Batch: URLs with scores 0.8, 0.6, 0.4] --> B{arun_many streaming} B -->|"Run 1: 0.4 finishes first"| C[pages_crawled++ → link_discovery<br/>remaining_capacity = 12] B -->|"Run 2: 0.8 finishes first"| D[pages_crawled++ → link_discovery<br/>remaining_capacity = 12] C --> E["0.4's children discovered with capacity=12"] D --> F["0.8's children discovered with capacity=12"] E --> G["0.8 finishes → pages_crawled++<br/>remaining_capacity = 11"] F --> H["0.4 finishes → pages_crawled++<br/>remaining_capacity = 11"] G --> I["0.8's children discovered with capacity=11"] H --> J["0.4's children discovered with capacity=11"] style C fill:#fdd style D fill:#dfd style E fill:#fdd style F fill:#dfdWhen
max_pagesis nearly exhausted, the differentremaining_capacityat the moment each page's links are discovered means different links pass the capacity check inlink_discovery— producing different queue contents and ultimately different page sets.Effects
Steps to Reproduce
BestFirstCrawlingStrategywithmax_pagesset low enough to be saturated (e.g., 15)Proposed Fix
Collect all results from each
arun_manybatch before running link discovery, then process results in deterministic priority-queue order:Key change:
results_by_urlcollects all batch results, then thefor score, depth, url, parent_url in batch:loop processes them in the original priority-queue dequeue order. This makes link discovery order deterministic regardless of network timing.Performance impact: Negligible.
arun_manystill fetches concurrently. The only added latency is waiting for the slowest page in each batch before discovering links — typically a few seconds per batch. In our testing across 4 rounds of 3-grant crawls (15 pages each), there was no measurable runtime difference.Our Mitigation
We subclass
BestFirstCrawlingStrategyand override_arun_best_firstwith the deterministic version shown above. This resolves the non-determinism for our use case but is obviously not ideal as a permanent downstream workaround.OS
Linux
Python version
3.12
Browser
Chromium (via Playwright)
Browser version
N/A