fix: re-download model when cached files fail verification by Dmheath1 · Pull Request #642 · qdrant/fastembed

Dmheath1 · 2026-05-31T19:17:45Z

What this fixes

If a model download is interrupted, or the cache is truncated later (a partial copy into a Docker image, a CI cache restore, a disk issue), fastembed leaves a short or empty file in the HF cache. The next time you load that model, the offline-first probe sees the file on disk and returns its path. onnxruntime then throws INVALID_PROTOBUF / ModelProto does not have a graph, and the only way out today is to delete the cache by hand.

Reproduce (on main)

import glob, os
from fastembed import TextEmbedding

m = TextEmbedding("BAAI/bge-small-en-v1.5")
list(m.embed(["hi"]))                        # download + cache

blob = sorted({os.path.realpath(p) for p in
               glob.glob(os.path.join(m.model._model_dir, "**/*.onnx"), recursive=True)})[0]
open(blob, "r+b").truncate(1024)             # truncate the cached weight

m = TextEmbedding("BAAI/bge-small-en-v1.5")  # network is up
list(m.embed(["hi"]))                        # main: INVALID_PROTOBUF, blob never re-fetched

The fix

The offline cache probe raises when a model file's size does not match the recorded metadata, instead of warning and returning the corrupt path. download_model already wraps that probe in a try/except, so the raise just routes to the existing online retry.
That retry now passes force_download=True. huggingface_hub's cache check is existence-only, so a plain snapshot_download keeps a present-but-truncated blob; force_download is the documented way to re-fetch it (the same idiom transformers and sentence-transformers use).
Verification walks the repo tree recursively and matches files by their repo-relative path rather than by bare filename. Without this, weights kept in a subdirectory (onnx/model.onnx, more than half the supported models) were never recorded in the metadata, so neither the new self-heal nor the existing download-integrity check ever looked at them.
A size mismatch on a non-model file (a config) is still left to best-effort loading, so an offline user with a harmless config difference is not blocked.

Behavior

Corrupt model cache, network up: re-downloads and loads.
Valid cache: unchanged, no extra downloads.
Corrupt cache, offline: a clear "Could not load model" error instead of a raw protobuf failure, and no network attempt.

Tested locally against root-level (bge), subdirectory (snowflake arctic, jina reranker) and external-weight (e5, minicoil) layouts, plus the existing suite.

An interrupted or externally truncated download can leave a corrupt file in the Hugging Face cache. On the next load the offline-first probe sees the file on disk and returns it, so the model fails with INVALID_PROTOBUF and the only fix is to delete the cache by hand. This makes the cache self-heal. The probe now raises when a model file's size does not match the recorded metadata, so download_model falls through to an online retry with force_download=True (huggingface_hub's cache check is existence-only and will not re-fetch a present-but-truncated blob). Verification also walks the repo tree recursively and matches files by their repo-relative path. Without that, weights kept in a subdirectory (onnx/model.onnx, more than half the models) were never recorded in the metadata, so neither the new check nor the existing one ever looked at them. A mismatch on an auxiliary config file is still left to best-effort loading, as before.

coderabbitai · 2026-05-31T19:25:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d483cdce-7e36-4a36-b50f-fe8b39fb509b

📥 Commits

Reviewing files that changed from the base of the PR and between f7b82d3 and b0a2f5f.

📒 Files selected for processing (1)

fastembed/common/model_management.py

🚧 Files skipped from review as they are similar to previous changes (1)

fastembed/common/model_management.py

📝 Walkthrough

Walkthrough

This PR strengthens offline model caching in download_files_from_huggingface by normalizing repository-relative paths for consistent file matching, implementing stricter validation that raises errors when model files are corrupt or truncated in the cache, and expanding file type discovery. When offline validation fails due to corrupt primary model files, the download process retries with force_download=True to bypass the stale cache. The test suite verifies that corrupt model files trigger errors while auxiliary file mismatches are tolerated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

qdrant/fastembed#624: Addresses related offline cache handling and model file validation to prevent accepting truncated caches in local_files_only mode.

Suggested reviewers

tbung

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and concisely describes the primary fix: re-downloading models when cached files fail verification, which is the core objective of this PR.
Description check	✅ Passed	The description thoroughly explains what the PR fixes, provides reproduction steps, documents the implementation approach, and describes the resulting behavior, all clearly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fastembed/common/model_management.py`:
- Around line 475-479: The call currently passes force_download=True and also
**kwargs which can contain force_download, causing a TypeError; update the
download_model call site (in fastembed.common.model_management, function
download_model) to ensure force_download is set in the kwargs map before
expanding it (e.g., remove any existing force_download key and then set
kwargs['force_download']=True or overwrite it) so only one value is supplied to
the underlying function.
- Around line 262-268: The call to list_repo_tree(...) incorrectly uses
recursive=True; remove that unsupported kwarg and instead call
list_repo_tree(hf_source_repo, revision=repo_revision, repo_type="model",
path=...) or use expand=True (depending on API) to retrieve directory entries,
then perform client-side recursion/filtering to collect nested files into
repo_tree; update the code around repo_tree and the filtering that uses
allowed_extensions to walk subdirectories (by repeatedly calling list_repo_tree
with sub-paths or expanding entries) so that nested model files are discovered
without passing recursive=True to huggingface_hub.list_repo_tree.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 04ae48e3-578d-4d6a-aa44-89abf81e7057

📥 Commits

Reviewing files that changed from the base of the PR and between 8a8ea4f and f7b82d3.

📒 Files selected for processing (2)

fastembed/common/model_management.py
tests/test_model_management.py

coderabbitai · 2026-05-31T19:25:21Z

+        repo_tree = list(
+            list_repo_tree(
+                hf_source_repo, revision=repo_revision, repo_type="model", recursive=True
+            )
+        )

-        allowed_extensions = {".json", ".onnx", ".txt"}
+        allowed_extensions = {".json", ".onnx", ".txt", ".onnx_data", ".npy", ".vocab"}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

huggingface_hub list_repo_tree recursive parameter

💡 Result:

In huggingface_hub, HfApi.list_repo_tree() does not provide a recursive parameter. Instead, you control what is listed via its other arguments (notably path/subdirectory scope) and you can get expanded metadata depending on the expand flag. Hugging Face staff also note that listing files in a repo is costly and that “it is not possible to filter them server-side when listing files from a repo… best solution is still to do the filtering client-side”; however, you can limit the request by only listing files from a subdirectory [1]. Practical implication for your question: if you want a recursive listing, you should treat list_repo_tree() as the API meant for listing repo trees, and use the supported arguments to restrict scope (e.g., target a subdirectory path). Don’t look for a recursive parameter; it isn’t part of the documented method signature [1]. References: Hugging Face issue discussion explaining server-side filtering limits for list_repo_tree [1]; and the official HfApi documentation entry point (for method signatures and related API usage) [2].

Citations:

1: Support server-side filtering for list_repo_tree huggingface/huggingface_hub#2676

2: https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api

Fix list_repo_tree call: remove unsupported recursive=True.

fastembed/common/model_management.py (lines 262-268) passes recursive=True to huggingface_hub.list_repo_tree(), but HfApi.list_repo_tree() has no recursive parameter; this will raise a TypeError at runtime. Rework the listing to use supported arguments (e.g., subdirectory scoping via path and/or expand, then filter client-side) to collect nested model files.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fastembed/common/model_management.py` around lines 262 - 268, The call to list_repo_tree(...) incorrectly uses recursive=True; remove that unsupported kwarg and instead call list_repo_tree(hf_source_repo, revision=repo_revision, repo_type="model", path=...) or use expand=True (depending on API) to retrieve directory entries, then perform client-side recursion/filtering to collect nested files into repo_tree; update the code around repo_tree and the filtering that uses allowed_extensions to walk subdirectories (by repeatedly calling list_repo_tree with sub-paths or expanding entries) so that nested model files are discovered without passing recursive=True to huggingface_hub.list_repo_tree.

list_repo_tree does take recursive. It is in the signature (recursive: bool = False) and has been since the method landed in huggingface_hub 0.23.0, which is also the floor this function already needs, since it calls list_repo_tree regardless of this PR. So there is no new version constraint here. The linked #2676 is about server-side filtering on large repos, not whether the parameter exists. Verified on 1.17.0: the call returns the nested files (onnx/model.onnx plus the quantized variants) without raising, so I have left recursive=True as is.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Resolve the coderabbit review comment on the PR gh api graphql -f query=' { repository(owner: "qdrant", name: "fastembed") { pullRequest(number: 642) { reviewThreads(first: 50) { nodes { id isResolved comments(first: 5) { nodes { author { login } body url } } } } } } }'

Length of output: 8970

🏁 Script executed:

#!/bin/bash # Resolve the incorrect review thread gh api graphql -f query=' mutation { resolveReviewThread(input: { threadId: "PRRT_kwDOJ7ZVjc6F90mS" }) { thread { id isResolved } } }'

Length of output: 717

@Dmheath1, thanks for the thorough clarification — my initial flag was wrong and I've noted the correction. Unfortunately I don't have permission to resolve the thread programmatically in this repo, so you'd need to resolve it manually. The recursive=True usage is correct and the review comment can be dismissed.

(ᵔ ᵕ ᵔ)🐇

🧬 Code Graph Analysis Results

fastembed/common/model_management.py

Lines 11-12 (imports list_repo_tree):

from huggingface_hub import snapshot_download, model_info, list_repo_tree from huggingface_hub.hf_api import RepoFile

Lines 261-277 (calls list_repo_tree(..., recursive=True) and builds repo_files from the returned tree):

repo_revision = model_info(hf_source_repo).sha repo_tree = list( list_repo_tree( hf_source_repo, revision=repo_revision, repo_type="model", recursive=True ) ) allowed_extensions = {".json", ".onnx", ".txt", ".onnx_data", ".npy", ".vocab"} repo_files = ( [ f for f in repo_tree if isinstance(f, RepoFile) and Path(f.path).suffix in allowed_extensions ] if repo_tree else [] )

✏️ Learnings added

Learnt from: Dmheath1 Repo: qdrant/fastembed PR: 642 File: fastembed/common/model_management.py:262-268 Timestamp: 2026-05-31T19:37:43.899Z Learning: `huggingface_hub.list_repo_tree()` does support a `recursive: bool = False` parameter — it has been part of the method signature since huggingface_hub 0.23.0 and works correctly on 1.17.0. Passing `recursive=True` returns all files in nested subdirectories (e.g., `onnx/model.onnx`) without raising a TypeError. GitHub issue huggingface/huggingface_hub#2676 is about server-side filtering performance on large repos, not the absence of the `recursive` parameter. Do not flag `recursive=True` in `list_repo_tree` calls as invalid.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

If a caller passes force_download through kwargs, the explicit force_download=True on the recovery retry would raise a duplicate keyword error. Merge it into kwargs so the recovery value wins without the collision.

coderabbitai Bot reviewed May 31, 2026

View reviewed changes

fix: merge force_download into kwargs on the recovery retry

b0a2f5f

If a caller passes force_download through kwargs, the explicit force_download=True on the recovery retry would raise a duplicate keyword error. Merge it into kwargs so the recovery value wins without the collision.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: re-download model when cached files fail verification#642

fix: re-download model when cached files fail verification#642
Dmheath1 wants to merge 2 commits into
qdrant:mainfrom
Dmheath1:fix/redownload-on-verification-failure

Dmheath1 commented May 31, 2026

Uh oh!

coderabbitai Bot commented May 31, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 31, 2026 •

edited

Loading

Uh oh!

Dmheath1 May 31, 2026

Uh oh!

coderabbitai Bot May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dmheath1 commented May 31, 2026

What this fixes

Reproduce (on main)

The fix

Behavior

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dmheath1 May 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 31, 2026

Choose a reason for hiding this comment

fastembed/common/model_management.py

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 31, 2026 •

edited

Loading

coderabbitai Bot May 31, 2026 •

edited

Loading

`fastembed/common/model_management.py`