[Bug]: BestFirstCrawlingStrategy._arun_best_first interleaves link discovery with concurrent batch streaming, producing non-deterministic page sets

## crawl4ai version

0.8.0 (confirmed still present in 0.8.7 source on `main`)

## Related Issues

- #1253 (priority inversion + premature truncation — different bugs, same method)

## Expected Behavior

Given the same seed URL and unchanged remote content, `BestFirstCrawlingStrategy` should crawl the **same set of pages** in the **same order** across repeated runs. The priority queue should fully determine which URLs are visited when the `max_pages` budget is exhausted.

## Current Behavior

Repeated crawls of the same seed URL produce **different page sets** when the `max_pages` budget is saturated. Pages near the score threshold appear or disappear between runs non-deterministically.

## Root Cause

`_arun_best_first` dequeues up to `BATCH_SIZE` (10) URLs and fetches them concurrently via `arun_many(stream=True)`. As results stream back, **link discovery runs inline for each result as it arrives** (lines 240-252 of `bff_strategy.py`):

```python
async for result in stream_gen: # ← order depends on network timing
 ...
 yield result
 if result.success:
 new_links = []
 await self.link_discovery(result, result_url, depth, visited, new_links, depths)
 for new_url, new_parent in new_links:
 new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
 await queue.put((-new_score, new_depth, new_url, new_parent))
```

The order results return from `arun_many` depends on network latency per page — this is inherently non-deterministic. Because link discovery and queue insertion happen **inside** the streaming loop, the priority queue state diverges across runs:

```mermaid
sequenceDiagram
 participant Q as Priority Queue
 participant B as Batch (arun_many)
 participant N as Network

 Note over Q,N: Batch: [URL_A (score=0.8), URL_B (score=0.6)]

 rect rgb(255, 230, 230)
 Note over B,N: Run 1: B finishes before A
 B->>N: fetch A, B concurrently
 N-->>B: result_B arrives first
 B->>Q: discover B's children → enqueue C(0.3), D(0.5)
 N-->>B: result_A arrives second
 B->>Q: discover A's children → enqueue E(0.4), F(0.2)
 Note over Q: Queue: [D(0.5), E(0.4), C(0.3), F(0.2)]
 end

 rect rgb(230, 255, 230)
 Note over B,N: Run 2: A finishes before B
 B->>N: fetch A, B concurrently
 N-->>B: result_A arrives first
 B->>Q: discover A's children → enqueue E(0.4), F(0.2)
 N-->>B: result_B arrives second
 B->>Q: discover B's children → enqueue C(0.3), D(0.5)
 Note over Q: Queue: [D(0.5), E(0.4), C(0.3), F(0.2)]
 end

 Note over Q: Queue contents are identical! But pages_crawled count differs at intermediate steps...
```

The real problem is subtler than the final queue state — it's about **when `_pages_crawled` is incremented relative to `link_discovery`**:

```mermaid
flowchart TD
 A[Batch: URLs with scores 0.8, 0.6, 0.4] --> B{arun_many streaming}
 B -->|"Run 1: 0.4 finishes first"| C[pages_crawled++ → link_discovery remaining_capacity = 12]
 B -->|"Run 2: 0.8 finishes first"| D[pages_crawled++ → link_discovery remaining_capacity = 12]
 C --> E["0.4's children discovered with capacity=12"]
 D --> F["0.8's children discovered with capacity=12"]
 E --> G["0.8 finishes → pages_crawled++ remaining_capacity = 11"]
 F --> H["0.4 finishes → pages_crawled++ remaining_capacity = 11"]
 G --> I["0.8's children discovered with capacity=11"]
 H --> J["0.4's children discovered with capacity=11"]

 style C fill:#fdd
 style D fill:#dfd
 style E fill:#fdd
 style F fill:#dfd
```

When `max_pages` is nearly exhausted, the different `remaining_capacity` at the moment each page's links are discovered means **different links pass the capacity check in `link_discovery`** — producing different queue contents and ultimately different page sets.

## Effects

- **False change detection**: Document fingerprints change between runs even though remote content hasn't changed, because a different set of depth-2 pages is crawled each time.
- **Non-reproducible crawls**: Same seed URL + same scorer + same config → different results. Makes debugging and testing difficult.
- **Wasted API calls**: In pipelines that compare crawl results across runs (change detection, sync), non-deterministic page sets trigger unnecessary downstream processing.

## Steps to Reproduce

1. Use `BestFirstCrawlingStrategy` with `max_pages` set low enough to be saturated (e.g., 15)
2. Crawl a seed URL that has many depth-1 links, each of which discovers many depth-2 links
3. Run the same crawl twice with no content changes
4. Compare the crawled page sets — they will differ

## Proposed Fix

Collect all results from each `arun_many` batch **before** running link discovery, then process results in deterministic priority-queue order:

```python
async def _arun_best_first(self, start_url, crawler, config):
 queue = asyncio.PriorityQueue()
 initial_score = self.url_scorer.score(start_url) if self.url_scorer else 0
 await queue.put((-initial_score, 0, start_url, None))
 visited = set()
 depths = {start_url: 0}

 while not queue.empty() and not self._cancel_event.is_set():
 if self._pages_crawled >= self.max_pages:
 break

 batch_size = min(BATCH_SIZE, self.max_pages - self._pages_crawled)
 if batch_size <= 0:
 break

 batch = []
 for _ in range(batch_size):
 if queue.empty():
 break
 item = await queue.get()
 score, depth, url, parent_url = item
 if url in visited:
 continue
 visited.add(url)
 batch.append(item)

 if not batch:
 continue

 # Fetch concurrently — network parallelism preserved
 urls = [item[2] for item in batch]
 batch_config = config.clone(deep_crawl_strategy=None, stream=True)
 stream_gen = await crawler.arun_many(urls=urls, config=batch_config)

 # Collect ALL results before processing
 results_by_url = {}
 async for result in stream_gen:
 results_by_url[result.url] = result

 # Process in deterministic priority-queue order
 for score, depth, url, parent_url in batch:
 result = results_by_url.get(url)
 if result is None:
 continue

 result.metadata = result.metadata or {}
 result.metadata["depth"] = depth
 result.metadata["parent_url"] = parent_url
 result.metadata["score"] = -score

 if result.success:
 self._pages_crawled += 1
 if self._pages_crawled >= self.max_pages:
 yield result
 break

 yield result

 if result.success:
 new_links = []
 await self.link_discovery(result, url, depth, visited, new_links, depths)
 for new_url, new_parent in new_links:
 new_depth = depths.get(new_url, depth + 1)
 new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
 await queue.put((-new_score, new_depth, new_url, new_parent))
```

**Key change**: `results_by_url` collects all batch results, then the `for score, depth, url, parent_url in batch:` loop processes them in the original priority-queue dequeue order. This makes link discovery order deterministic regardless of network timing.

**Performance impact**: Negligible. `arun_many` still fetches concurrently. The only added latency is waiting for the slowest page in each batch before discovering links — typically a few seconds per batch. In our testing across 4 rounds of 3-grant crawls (15 pages each), there was no measurable runtime difference.

## Our Mitigation

We subclass `BestFirstCrawlingStrategy` and override `_arun_best_first` with the deterministic version shown above. This resolves the non-determinism for our use case but is obviously not ideal as a permanent downstream workaround.

## OS

Linux

## Python version

3.12

## Browser

Chromium (via Playwright)

## Browser version

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: BestFirstCrawlingStrategy._arun_best_first interleaves link discovery with concurrent batch streaming, producing non-deterministic page sets #1998

crawl4ai version

Related Issues

Expected Behavior

Current Behavior

Root Cause

Effects

Steps to Reproduce

Proposed Fix

Our Mitigation

OS

Python version

Browser

Browser version

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: BestFirstCrawlingStrategy._arun_best_first interleaves link discovery with concurrent batch streaming, producing non-deterministic page sets #1998

Description

crawl4ai version

Related Issues

Expected Behavior

Current Behavior

Root Cause

Effects

Steps to Reproduce

Proposed Fix

Our Mitigation

OS

Python version

Browser

Browser version

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions