fix: re-download model when cached files fail verification#642
fix: re-download model when cached files fail verification#642Dmheath1 wants to merge 2 commits into
Conversation
An interrupted or externally truncated download can leave a corrupt file in the Hugging Face cache. On the next load the offline-first probe sees the file on disk and returns it, so the model fails with INVALID_PROTOBUF and the only fix is to delete the cache by hand. This makes the cache self-heal. The probe now raises when a model file's size does not match the recorded metadata, so download_model falls through to an online retry with force_download=True (huggingface_hub's cache check is existence-only and will not re-fetch a present-but-truncated blob). Verification also walks the repo tree recursively and matches files by their repo-relative path. Without that, weights kept in a subdirectory (onnx/model.onnx, more than half the models) were never recorded in the metadata, so neither the new check nor the existing one ever looked at them. A mismatch on an auxiliary config file is still left to best-effort loading, as before.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR strengthens offline model caching in Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@fastembed/common/model_management.py`:
- Around line 475-479: The call currently passes force_download=True and also
**kwargs which can contain force_download, causing a TypeError; update the
download_model call site (in fastembed.common.model_management, function
download_model) to ensure force_download is set in the kwargs map before
expanding it (e.g., remove any existing force_download key and then set
kwargs['force_download']=True or overwrite it) so only one value is supplied to
the underlying function.
- Around line 262-268: The call to list_repo_tree(...) incorrectly uses
recursive=True; remove that unsupported kwarg and instead call
list_repo_tree(hf_source_repo, revision=repo_revision, repo_type="model",
path=...) or use expand=True (depending on API) to retrieve directory entries,
then perform client-side recursion/filtering to collect nested files into
repo_tree; update the code around repo_tree and the filtering that uses
allowed_extensions to walk subdirectories (by repeatedly calling list_repo_tree
with sub-paths or expanding entries) so that nested model files are discovered
without passing recursive=True to huggingface_hub.list_repo_tree.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 04ae48e3-578d-4d6a-aa44-89abf81e7057
📒 Files selected for processing (2)
fastembed/common/model_management.pytests/test_model_management.py
| repo_tree = list( | ||
| list_repo_tree( | ||
| hf_source_repo, revision=repo_revision, repo_type="model", recursive=True | ||
| ) | ||
| ) | ||
|
|
||
| allowed_extensions = {".json", ".onnx", ".txt"} | ||
| allowed_extensions = {".json", ".onnx", ".txt", ".onnx_data", ".npy", ".vocab"} |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
huggingface_hub list_repo_tree recursive parameter
💡 Result:
In huggingface_hub, HfApi.list_repo_tree() does not provide a recursive parameter. Instead, you control what is listed via its other arguments (notably path/subdirectory scope) and you can get expanded metadata depending on the expand flag. Hugging Face staff also note that listing files in a repo is costly and that “it is not possible to filter them server-side when listing files from a repo… best solution is still to do the filtering client-side”; however, you can limit the request by only listing files from a subdirectory [1]. Practical implication for your question: if you want a recursive listing, you should treat list_repo_tree() as the API meant for listing repo trees, and use the supported arguments to restrict scope (e.g., target a subdirectory path). Don’t look for a recursive parameter; it isn’t part of the documented method signature [1]. References: Hugging Face issue discussion explaining server-side filtering limits for list_repo_tree [1]; and the official HfApi documentation entry point (for method signatures and related API usage) [2].
Citations:
- 1: Support server-side filtering for list_repo_tree huggingface/huggingface_hub#2676
- 2: https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api
Fix list_repo_tree call: remove unsupported recursive=True.
fastembed/common/model_management.py (lines 262-268) passes recursive=True to huggingface_hub.list_repo_tree(), but HfApi.list_repo_tree() has no recursive parameter; this will raise a TypeError at runtime. Rework the listing to use supported arguments (e.g., subdirectory scoping via path and/or expand, then filter client-side) to collect nested model files.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fastembed/common/model_management.py` around lines 262 - 268, The call to
list_repo_tree(...) incorrectly uses recursive=True; remove that unsupported
kwarg and instead call list_repo_tree(hf_source_repo, revision=repo_revision,
repo_type="model", path=...) or use expand=True (depending on API) to retrieve
directory entries, then perform client-side recursion/filtering to collect
nested files into repo_tree; update the code around repo_tree and the filtering
that uses allowed_extensions to walk subdirectories (by repeatedly calling
list_repo_tree with sub-paths or expanding entries) so that nested model files
are discovered without passing recursive=True to huggingface_hub.list_repo_tree.
There was a problem hiding this comment.
list_repo_tree does take recursive. It is in the signature (recursive: bool = False) and has been since the method landed in huggingface_hub 0.23.0, which is also the floor this function already needs, since it calls list_repo_tree regardless of this PR. So there is no new version constraint here. The linked #2676 is about server-side filtering on large repos, not whether the parameter exists. Verified on 1.17.0: the call returns the nested files (onnx/model.onnx plus the quantized variants) without raising, so I have left recursive=True as is.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Resolve the coderabbit review comment on the PR
gh api graphql -f query='
{
repository(owner: "qdrant", name: "fastembed") {
pullRequest(number: 642) {
reviewThreads(first: 50) {
nodes {
id
isResolved
comments(first: 5) {
nodes {
author { login }
body
url
}
}
}
}
}
}
}'Length of output: 8970
🏁 Script executed:
#!/bin/bash
# Resolve the incorrect review thread
gh api graphql -f query='
mutation {
resolveReviewThread(input: { threadId: "PRRT_kwDOJ7ZVjc6F90mS" }) {
thread {
id
isResolved
}
}
}'Length of output: 717
@Dmheath1, thanks for the thorough clarification — my initial flag was wrong and I've noted the correction. Unfortunately I don't have permission to resolve the thread programmatically in this repo, so you'd need to resolve it manually. The recursive=True usage is correct and the review comment can be dismissed.
(ᵔ ᵕ ᵔ)🐇
🧬 Code Graph Analysis Results
fastembed/common/model_management.py
Lines 11-12 (imports list_repo_tree):
from huggingface_hub import snapshot_download, model_info, list_repo_tree
from huggingface_hub.hf_api import RepoFileLines 261-277 (calls list_repo_tree(..., recursive=True) and builds repo_files from the returned tree):
repo_revision = model_info(hf_source_repo).sha
repo_tree = list(
list_repo_tree(
hf_source_repo, revision=repo_revision, repo_type="model", recursive=True
)
)
allowed_extensions = {".json", ".onnx", ".txt", ".onnx_data", ".npy", ".vocab"}
repo_files = (
[
f
for f in repo_tree
if isinstance(f, RepoFile) and Path(f.path).suffix in allowed_extensions
]
if repo_tree
else []
)✏️ Learnings added
Learnt from: Dmheath1
Repo: qdrant/fastembed PR: 642
File: fastembed/common/model_management.py:262-268
Timestamp: 2026-05-31T19:37:43.899Z
Learning: `huggingface_hub.list_repo_tree()` does support a `recursive: bool = False` parameter — it has been part of the method signature since huggingface_hub 0.23.0 and works correctly on 1.17.0. Passing `recursive=True` returns all files in nested subdirectories (e.g., `onnx/model.onnx`) without raising a TypeError. GitHub issue huggingface/huggingface_hub#2676 is about server-side filtering performance on large repos, not the absence of the `recursive` parameter. Do not flag `recursive=True` in `list_repo_tree` calls as invalid.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
If a caller passes force_download through kwargs, the explicit force_download=True on the recovery retry would raise a duplicate keyword error. Merge it into kwargs so the recovery value wins without the collision.
What this fixes
If a model download is interrupted, or the cache is truncated later (a partial copy into a Docker image, a CI cache restore, a disk issue), fastembed leaves a short or empty file in the HF cache. The next time you load that model, the offline-first probe sees the file on disk and returns its path. onnxruntime then throws
INVALID_PROTOBUF/ModelProto does not have a graph, and the only way out today is to delete the cache by hand.Reproduce (on main)
The fix
download_modelalready wraps that probe in a try/except, so the raise just routes to the existing online retry.force_download=True. huggingface_hub's cache check is existence-only, so a plainsnapshot_downloadkeeps a present-but-truncated blob;force_downloadis the documented way to re-fetch it (the same idiom transformers and sentence-transformers use).onnx/model.onnx, more than half the supported models) were never recorded in the metadata, so neither the new self-heal nor the existing download-integrity check ever looked at them.Behavior
Tested locally against root-level (bge), subdirectory (snowflake arctic, jina reranker) and external-weight (e5, minicoil) layouts, plus the existing suite.