Add ViLT (dandelin/vilt-b32-finetuned-vqa) visual-question-answering support#951
Draft
ssss141414 wants to merge 1 commit into
Draft
Add ViLT (dandelin/vilt-b32-finetuned-vqa) visual-question-answering support#951ssss141414 wants to merge 1 commit into
ssss141414 wants to merge 1 commit into
Conversation
…support Adds OnnxConfig + ModelPatcher for ViLT visual-question-answering since vendor optimum coverage is absent and stock ViltEmbeddings.visual_embed is not ONNX-traceable (Python iteration over tensor shapes, torch.multinomial, per-row nonzero loops). Patcher swaps in a static-shape replacement using nn.functional.interpolate for spatial position embeddings and a synthesized all-ones token mask. H/W axes are pinned static; pixel_mask is intentionally dropped since the patched path does not reference it. Validated on dandelin/vilt-b32-finetuned-vqa @ CPU fp32: - L0 build: 62.9s, 449.2 MB optimized ONNX - L1 perf: p50=65.83ms, throughput=14.82 samples/sec (20 iters, warmup 3) - L2 numerics: cos=1.000000, max_abs_diff=4.2e-5, top-class match (3129-way head)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds first-class support for ViLT under the
visual-question-answeringtask, validated ondandelin/vilt-b32-finetuned-vqa.ViLT has no vendor optimum coverage, and its stock
ViltEmbeddings.visual_embedis fundamentally not ONNX-traceable (Python iteration over tensor shapes,torch.multinomial, per-rownonzero()loops). Eager works because the loops resolve concretely; tracing fails. This PR therefore ships:ViltVqaOnnxConfig(OnnxConfig)registered via@register_onnx_overwrite("vilt", "visual-question-answering")._ViltVisualEmbedPatcher(ModelPatcher)that swapsvisual_embedfor a static-shape replacement usingnn.functional.interpolate(spatial_pos, size=(H, W), mode='bilinear', align_corners=True)and a synthesized all-ones token mask.pixel_values;pixel_maskis intentionally omitted from the export signature since the patched path doesn't read it (leaving it in would create a dead graph input).models/hf/__init__.py.Files changed
src/winml/modelkit/models/hf/vilt.pysrc/winml/modelkit/models/hf/__init__.pyexamples/recipes/dandelin_vilt-b32-finetuned-vqa/visual-question-answering_config.jsonexamples/recipes/README.mdValidation (dandelin/vilt-b32-finetuned-vqa @ CPU fp32)
Notes for reviewers
input_ids,attention_mask,token_type_ids(dynamicbatch_size/sequence_length),pixel_values(onlybatch_sizedynamic, H/W=384 static).logits(3129-way),batch_sizedynamic.value_rangefor mask-of-ones inputs must be[1, 2]not[0, 1]becauserandinthighis exclusive — relevant if anyone re-derives this recipe.