Skip to content

fix: re-download model when cached files fail verification#642

Open
Dmheath1 wants to merge 2 commits into
qdrant:mainfrom
Dmheath1:fix/redownload-on-verification-failure
Open

fix: re-download model when cached files fail verification#642
Dmheath1 wants to merge 2 commits into
qdrant:mainfrom
Dmheath1:fix/redownload-on-verification-failure

Conversation

@Dmheath1
Copy link
Copy Markdown

What this fixes

If a model download is interrupted, or the cache is truncated later (a partial copy into a Docker image, a CI cache restore, a disk issue), fastembed leaves a short or empty file in the HF cache. The next time you load that model, the offline-first probe sees the file on disk and returns its path. onnxruntime then throws INVALID_PROTOBUF / ModelProto does not have a graph, and the only way out today is to delete the cache by hand.

Reproduce (on main)

import glob, os
from fastembed import TextEmbedding

m = TextEmbedding("BAAI/bge-small-en-v1.5")
list(m.embed(["hi"]))                        # download + cache

blob = sorted({os.path.realpath(p) for p in
               glob.glob(os.path.join(m.model._model_dir, "**/*.onnx"), recursive=True)})[0]
open(blob, "r+b").truncate(1024)             # truncate the cached weight

m = TextEmbedding("BAAI/bge-small-en-v1.5")  # network is up
list(m.embed(["hi"]))                        # main: INVALID_PROTOBUF, blob never re-fetched

The fix

  • The offline cache probe raises when a model file's size does not match the recorded metadata, instead of warning and returning the corrupt path. download_model already wraps that probe in a try/except, so the raise just routes to the existing online retry.
  • That retry now passes force_download=True. huggingface_hub's cache check is existence-only, so a plain snapshot_download keeps a present-but-truncated blob; force_download is the documented way to re-fetch it (the same idiom transformers and sentence-transformers use).
  • Verification walks the repo tree recursively and matches files by their repo-relative path rather than by bare filename. Without this, weights kept in a subdirectory (onnx/model.onnx, more than half the supported models) were never recorded in the metadata, so neither the new self-heal nor the existing download-integrity check ever looked at them.
  • A size mismatch on a non-model file (a config) is still left to best-effort loading, so an offline user with a harmless config difference is not blocked.

Behavior

  • Corrupt model cache, network up: re-downloads and loads.
  • Valid cache: unchanged, no extra downloads.
  • Corrupt cache, offline: a clear "Could not load model" error instead of a raw protobuf failure, and no network attempt.

Tested locally against root-level (bge), subdirectory (snowflake arctic, jina reranker) and external-weight (e5, minicoil) layouts, plus the existing suite.

An interrupted or externally truncated download can leave a corrupt
file in the Hugging Face cache. On the next load the offline-first
probe sees the file on disk and returns it, so the model fails with
INVALID_PROTOBUF and the only fix is to delete the cache by hand.

This makes the cache self-heal. The probe now raises when a model
file's size does not match the recorded metadata, so download_model
falls through to an online retry with force_download=True
(huggingface_hub's cache check is existence-only and will not
re-fetch a present-but-truncated blob).

Verification also walks the repo tree recursively and matches files
by their repo-relative path. Without that, weights kept in a
subdirectory (onnx/model.onnx, more than half the models) were never
recorded in the metadata, so neither the new check nor the existing
one ever looked at them. A mismatch on an auxiliary config file is
still left to best-effort loading, as before.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d483cdce-7e36-4a36-b50f-fe8b39fb509b

📥 Commits

Reviewing files that changed from the base of the PR and between f7b82d3 and b0a2f5f.

📒 Files selected for processing (1)
  • fastembed/common/model_management.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • fastembed/common/model_management.py

📝 Walkthrough

Walkthrough

This PR strengthens offline model caching in download_files_from_huggingface by normalizing repository-relative paths for consistent file matching, implementing stricter validation that raises errors when model files are corrupt or truncated in the cache, and expanding file type discovery. When offline validation fails due to corrupt primary model files, the download process retries with force_download=True to bypass the stale cache. The test suite verifies that corrupt model files trigger errors while auxiliary file mismatches are tolerated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • qdrant/fastembed#624: Addresses related offline cache handling and model file validation to prevent accepting truncated caches in local_files_only mode.

Suggested reviewers

  • tbung
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and concisely describes the primary fix: re-downloading models when cached files fail verification, which is the core objective of this PR.
Description check ✅ Passed The description thoroughly explains what the PR fixes, provides reproduction steps, documents the implementation approach, and describes the resulting behavior, all clearly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fastembed/common/model_management.py`:
- Around line 475-479: The call currently passes force_download=True and also
**kwargs which can contain force_download, causing a TypeError; update the
download_model call site (in fastembed.common.model_management, function
download_model) to ensure force_download is set in the kwargs map before
expanding it (e.g., remove any existing force_download key and then set
kwargs['force_download']=True or overwrite it) so only one value is supplied to
the underlying function.
- Around line 262-268: The call to list_repo_tree(...) incorrectly uses
recursive=True; remove that unsupported kwarg and instead call
list_repo_tree(hf_source_repo, revision=repo_revision, repo_type="model",
path=...) or use expand=True (depending on API) to retrieve directory entries,
then perform client-side recursion/filtering to collect nested files into
repo_tree; update the code around repo_tree and the filtering that uses
allowed_extensions to walk subdirectories (by repeatedly calling list_repo_tree
with sub-paths or expanding entries) so that nested model files are discovered
without passing recursive=True to huggingface_hub.list_repo_tree.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 04ae48e3-578d-4d6a-aa44-89abf81e7057

📥 Commits

Reviewing files that changed from the base of the PR and between 8a8ea4f and f7b82d3.

📒 Files selected for processing (2)
  • fastembed/common/model_management.py
  • tests/test_model_management.py

Comment on lines +262 to +268
repo_tree = list(
list_repo_tree(
hf_source_repo, revision=repo_revision, repo_type="model", recursive=True
)
)

allowed_extensions = {".json", ".onnx", ".txt"}
allowed_extensions = {".json", ".onnx", ".txt", ".onnx_data", ".npy", ".vocab"}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

huggingface_hub list_repo_tree recursive parameter

💡 Result:

In huggingface_hub, HfApi.list_repo_tree() does not provide a recursive parameter. Instead, you control what is listed via its other arguments (notably path/subdirectory scope) and you can get expanded metadata depending on the expand flag. Hugging Face staff also note that listing files in a repo is costly and that “it is not possible to filter them server-side when listing files from a repo… best solution is still to do the filtering client-side”; however, you can limit the request by only listing files from a subdirectory [1]. Practical implication for your question: if you want a recursive listing, you should treat list_repo_tree() as the API meant for listing repo trees, and use the supported arguments to restrict scope (e.g., target a subdirectory path). Don’t look for a recursive parameter; it isn’t part of the documented method signature [1]. References: Hugging Face issue discussion explaining server-side filtering limits for list_repo_tree [1]; and the official HfApi documentation entry point (for method signatures and related API usage) [2].

Citations:


Fix list_repo_tree call: remove unsupported recursive=True.

fastembed/common/model_management.py (lines 262-268) passes recursive=True to huggingface_hub.list_repo_tree(), but HfApi.list_repo_tree() has no recursive parameter; this will raise a TypeError at runtime. Rework the listing to use supported arguments (e.g., subdirectory scoping via path and/or expand, then filter client-side) to collect nested model files.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fastembed/common/model_management.py` around lines 262 - 268, The call to
list_repo_tree(...) incorrectly uses recursive=True; remove that unsupported
kwarg and instead call list_repo_tree(hf_source_repo, revision=repo_revision,
repo_type="model", path=...) or use expand=True (depending on API) to retrieve
directory entries, then perform client-side recursion/filtering to collect
nested files into repo_tree; update the code around repo_tree and the filtering
that uses allowed_extensions to walk subdirectories (by repeatedly calling
list_repo_tree with sub-paths or expanding entries) so that nested model files
are discovered without passing recursive=True to huggingface_hub.list_repo_tree.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_repo_tree does take recursive. It is in the signature (recursive: bool = False) and has been since the method landed in huggingface_hub 0.23.0, which is also the floor this function already needs, since it calls list_repo_tree regardless of this PR. So there is no new version constraint here. The linked #2676 is about server-side filtering on large repos, not whether the parameter exists. Verified on 1.17.0: the call returns the nested files (onnx/model.onnx plus the quantized variants) without raising, so I have left recursive=True as is.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Resolve the coderabbit review comment on the PR
gh api graphql -f query='
{
  repository(owner: "qdrant", name: "fastembed") {
    pullRequest(number: 642) {
      reviewThreads(first: 50) {
        nodes {
          id
          isResolved
          comments(first: 5) {
            nodes {
              author { login }
              body
              url
            }
          }
        }
      }
    }
  }
}'

Length of output: 8970


🏁 Script executed:

#!/bin/bash
# Resolve the incorrect review thread
gh api graphql -f query='
mutation {
  resolveReviewThread(input: { threadId: "PRRT_kwDOJ7ZVjc6F90mS" }) {
    thread {
      id
      isResolved
    }
  }
}'

Length of output: 717



@Dmheath1, thanks for the thorough clarification — my initial flag was wrong and I've noted the correction. Unfortunately I don't have permission to resolve the thread programmatically in this repo, so you'd need to resolve it manually. The recursive=True usage is correct and the review comment can be dismissed.

(ᵔ ᵕ ᵔ)🐇


🧬 Code Graph Analysis Results

fastembed/common/model_management.py

Lines 11-12 (imports list_repo_tree):

from huggingface_hub import snapshot_download, model_info, list_repo_tree
from huggingface_hub.hf_api import RepoFile

Lines 261-277 (calls list_repo_tree(..., recursive=True) and builds repo_files from the returned tree):

repo_revision = model_info(hf_source_repo).sha
repo_tree = list(
    list_repo_tree(
        hf_source_repo, revision=repo_revision, repo_type="model", recursive=True
    )
)

allowed_extensions = {".json", ".onnx", ".txt", ".onnx_data", ".npy", ".vocab"}
repo_files = (
    [
        f
        for f in repo_tree
        if isinstance(f, RepoFile) and Path(f.path).suffix in allowed_extensions
    ]
    if repo_tree
    else []
)

✏️ Learnings added
Learnt from: Dmheath1
Repo: qdrant/fastembed PR: 642
File: fastembed/common/model_management.py:262-268
Timestamp: 2026-05-31T19:37:43.899Z
Learning: `huggingface_hub.list_repo_tree()` does support a `recursive: bool = False` parameter — it has been part of the method signature since huggingface_hub 0.23.0 and works correctly on 1.17.0. Passing `recursive=True` returns all files in nested subdirectories (e.g., `onnx/model.onnx`) without raising a TypeError. GitHub issue huggingface/huggingface_hub#2676 is about server-side filtering performance on large repos, not the absence of the `recursive` parameter. Do not flag `recursive=True` in `list_repo_tree` calls as invalid.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread fastembed/common/model_management.py Outdated
If a caller passes force_download through kwargs, the explicit
force_download=True on the recovery retry would raise a duplicate
keyword error. Merge it into kwargs so the recovery value wins
without the collision.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant