Skip to content

[Bug]: BestFirstCrawlingStrategy._arun_best_first interleaves link discovery with concurrent batch streaming, producing non-deterministic page sets #1998

@bgk0018

Description

@bgk0018

crawl4ai version

0.8.0 (confirmed still present in 0.8.7 source on main)

Related Issues

Expected Behavior

Given the same seed URL and unchanged remote content, BestFirstCrawlingStrategy should crawl the same set of pages in the same order across repeated runs. The priority queue should fully determine which URLs are visited when the max_pages budget is exhausted.

Current Behavior

Repeated crawls of the same seed URL produce different page sets when the max_pages budget is saturated. Pages near the score threshold appear or disappear between runs non-deterministically.

Root Cause

_arun_best_first dequeues up to BATCH_SIZE (10) URLs and fetches them concurrently via arun_many(stream=True). As results stream back, link discovery runs inline for each result as it arrives (lines 240-252 of bff_strategy.py):

async for result in stream_gen:          # ← order depends on network timing
    ...
    yield result
    if result.success:
        new_links = []
        await self.link_discovery(result, result_url, depth, visited, new_links, depths)
        for new_url, new_parent in new_links:
            new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
            await queue.put((-new_score, new_depth, new_url, new_parent))

The order results return from arun_many depends on network latency per page — this is inherently non-deterministic. Because link discovery and queue insertion happen inside the streaming loop, the priority queue state diverges across runs:

sequenceDiagram
    participant Q as Priority Queue
    participant B as Batch (arun_many)
    participant N as Network

    Note over Q,N: Batch: [URL_A (score=0.8), URL_B (score=0.6)]

    rect rgb(255, 230, 230)
        Note over B,N: Run 1: B finishes before A
        B->>N: fetch A, B concurrently
        N-->>B: result_B arrives first
        B->>Q: discover B's children → enqueue C(0.3), D(0.5)
        N-->>B: result_A arrives second
        B->>Q: discover A's children → enqueue E(0.4), F(0.2)
        Note over Q: Queue: [D(0.5), E(0.4), C(0.3), F(0.2)]
    end

    rect rgb(230, 255, 230)
        Note over B,N: Run 2: A finishes before B
        B->>N: fetch A, B concurrently
        N-->>B: result_A arrives first
        B->>Q: discover A's children → enqueue E(0.4), F(0.2)
        N-->>B: result_B arrives second
        B->>Q: discover B's children → enqueue C(0.3), D(0.5)
        Note over Q: Queue: [D(0.5), E(0.4), C(0.3), F(0.2)]
    end

    Note over Q: Queue contents are identical!<br/>But pages_crawled count differs<br/>at intermediate steps...
Loading

The real problem is subtler than the final queue state — it's about when _pages_crawled is incremented relative to link_discovery:

flowchart TD
    A[Batch: URLs with scores 0.8, 0.6, 0.4] --> B{arun_many streaming}
    B -->|"Run 1: 0.4 finishes first"| C[pages_crawled++ → link_discovery<br/>remaining_capacity = 12]
    B -->|"Run 2: 0.8 finishes first"| D[pages_crawled++ → link_discovery<br/>remaining_capacity = 12]
    C --> E["0.4's children discovered with capacity=12"]
    D --> F["0.8's children discovered with capacity=12"]
    E --> G["0.8 finishes → pages_crawled++<br/>remaining_capacity = 11"]
    F --> H["0.4 finishes → pages_crawled++<br/>remaining_capacity = 11"]
    G --> I["0.8's children discovered with capacity=11"]
    H --> J["0.4's children discovered with capacity=11"]

    style C fill:#fdd
    style D fill:#dfd
    style E fill:#fdd
    style F fill:#dfd
Loading

When max_pages is nearly exhausted, the different remaining_capacity at the moment each page's links are discovered means different links pass the capacity check in link_discovery — producing different queue contents and ultimately different page sets.

Effects

  • False change detection: Document fingerprints change between runs even though remote content hasn't changed, because a different set of depth-2 pages is crawled each time.
  • Non-reproducible crawls: Same seed URL + same scorer + same config → different results. Makes debugging and testing difficult.
  • Wasted API calls: In pipelines that compare crawl results across runs (change detection, sync), non-deterministic page sets trigger unnecessary downstream processing.

Steps to Reproduce

  1. Use BestFirstCrawlingStrategy with max_pages set low enough to be saturated (e.g., 15)
  2. Crawl a seed URL that has many depth-1 links, each of which discovers many depth-2 links
  3. Run the same crawl twice with no content changes
  4. Compare the crawled page sets — they will differ

Proposed Fix

Collect all results from each arun_many batch before running link discovery, then process results in deterministic priority-queue order:

async def _arun_best_first(self, start_url, crawler, config):
    queue = asyncio.PriorityQueue()
    initial_score = self.url_scorer.score(start_url) if self.url_scorer else 0
    await queue.put((-initial_score, 0, start_url, None))
    visited = set()
    depths = {start_url: 0}

    while not queue.empty() and not self._cancel_event.is_set():
        if self._pages_crawled >= self.max_pages:
            break

        batch_size = min(BATCH_SIZE, self.max_pages - self._pages_crawled)
        if batch_size <= 0:
            break

        batch = []
        for _ in range(batch_size):
            if queue.empty():
                break
            item = await queue.get()
            score, depth, url, parent_url = item
            if url in visited:
                continue
            visited.add(url)
            batch.append(item)

        if not batch:
            continue

        # Fetch concurrently — network parallelism preserved
        urls = [item[2] for item in batch]
        batch_config = config.clone(deep_crawl_strategy=None, stream=True)
        stream_gen = await crawler.arun_many(urls=urls, config=batch_config)

        # Collect ALL results before processing
        results_by_url = {}
        async for result in stream_gen:
            results_by_url[result.url] = result

        # Process in deterministic priority-queue order
        for score, depth, url, parent_url in batch:
            result = results_by_url.get(url)
            if result is None:
                continue

            result.metadata = result.metadata or {}
            result.metadata["depth"] = depth
            result.metadata["parent_url"] = parent_url
            result.metadata["score"] = -score

            if result.success:
                self._pages_crawled += 1
                if self._pages_crawled >= self.max_pages:
                    yield result
                    break

            yield result

            if result.success:
                new_links = []
                await self.link_discovery(result, url, depth, visited, new_links, depths)
                for new_url, new_parent in new_links:
                    new_depth = depths.get(new_url, depth + 1)
                    new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
                    await queue.put((-new_score, new_depth, new_url, new_parent))

Key change: results_by_url collects all batch results, then the for score, depth, url, parent_url in batch: loop processes them in the original priority-queue dequeue order. This makes link discovery order deterministic regardless of network timing.

Performance impact: Negligible. arun_many still fetches concurrently. The only added latency is waiting for the slowest page in each batch before discovering links — typically a few seconds per batch. In our testing across 4 rounds of 3-grant crawls (15 pages each), there was no measurable runtime difference.

Our Mitigation

We subclass BestFirstCrawlingStrategy and override _arun_best_first with the deterministic version shown above. This resolves the non-determinism for our use case but is obviously not ideal as a permanent downstream workaround.

OS

Linux

Python version

3.12

Browser

Chromium (via Playwright)

Browser version

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions