Skip to content

Add MGP-STR (alibaba-damo/mgp-str-base) image-to-text task support#952

Draft
ssss141414 wants to merge 1 commit into
mainfrom
shzhen/add-mgp-str-base
Draft

Add MGP-STR (alibaba-damo/mgp-str-base) image-to-text task support#952
ssss141414 wants to merge 1 commit into
mainfrom
shzhen/add-mgp-str-base

Conversation

@ssss141414

Copy link
Copy Markdown
Contributor

Summary

Adds Effort-L1-light registration so MGP-STR scene-text-recognition models resolve under the user-facing image-to-text task label. The vendor MgpstrOnnxConfig (Optimum) already exposes the 3-head outputs (char_logits, bpe_logits, wp_logits) correctly, but is registered ONLY under feature-extraction. This PR adds a task-label alias + MODEL_CLASS_MAPPING binding to MgpstrForSceneTextRecognition (the head-bearing class — MGP-STR is NOT a generic Vision2Seq).

Files changed (5)

  • src/winml/modelkit/models/hf/mgp_str.py (NEW, 58 lines) — MgpstrImage2TextOnnxConfig(MgpstrOnnxConfig) subclass
  • src/winml/modelkit/models/hf/__init__.py — 3-line wiring
  • examples/recipes/alibaba-damo_mgp-str-base/image-to-text_config.json (NEW, 49 lines) — recipe
  • examples/recipes/README.md — catalog row
  • research/adding-model-support/model_knowledge/mgp_str.jsonmgp_str-004 post-mortem finding

Goal-ladder verdict

alibaba-damo/mgp-str-base @ image-to-text @ fp32 @ cpu

Tier Verdict Evidence
L0 build PASS 83.7s, 374 nodes, 564.5 MB optimized; autoconf converged in 2 iters
L1 perf PASS avg=100.76ms, P90=123.26ms, 9.92 samples/sec (20 iters CPU)
L2 numerical PASS cosine vs PT: char=0.99999999999992, bpe=0.99999999999974, wp=0.99999999999860; max-abs 5.7e-05 / 2.4e-04 / 2.1e-04
L3 eval CLI-BLOCKED image-to-text task has no default dataset (same as vit-gpt2)

Step 1b verification — real engineering vs catalog-only

  • Gate 1 (auto-config-diff): identical to winml config --task image-to-text (recipe is autoconf-faithful)
  • Gate 2 (baseline build on main): FAILS with mgp-str doesn't support task image-to-text for the onnx backend. → real engineering delta, NOT catalog-only.

Known gotchas

  • HF model card declares legacy architectures: ['MGPSTRModel'] but current transformers exports MgpstrModel (CamelCase rename). Without --task image-to-text explicit, winml inspect/config/build fail with Cannot import MGPSTRModel from transformers. CLI robustness gap separate from this PR.
  • 3 Einsum ops in a3_module heads are non-fatal on CPU.

Verification

uv run winml build -c examples/recipes/alibaba-damo_mgp-str-base/image-to-text_config.json -m alibaba-damo/mgp-str-base -o temp/mgp_build --ep cpu --device cpu --rebuild
uv run winml perf -m temp/mgp_build/model.onnx --ep cpu --device cpu --iterations 20

Adds Effort-L1-light registration so MGP-STR scene-text-recognition models
resolve under the user-facing 'image-to-text' task label. The vendor
MgpstrOnnxConfig (Optimum) already exposes the 3-head outputs (char_logits,
bpe_logits, wp_logits) correctly but is registered only under feature-extraction.
This PR adds a task-label alias plus MODEL_CLASS_MAPPING binding to
MgpstrForSceneTextRecognition.

Files:
- src/winml/modelkit/models/hf/mgp_str.py: MgpstrImage2TextOnnxConfig subclass (58 lines)
- src/winml/modelkit/models/hf/__init__.py: 3-line wiring
- examples/recipes/alibaba-damo_mgp-str-base/image-to-text_config.json: recipe (49 lines)
- examples/recipes/README.md: catalog row
- research/adding-model-support/model_knowledge/mgp_str.json: mgp_str-004 finding

Goal-ladder (alibaba-damo/mgp-str-base @ image-to-text @ fp32 @ cpu):
- L0 PASS: build 83.7s, 374 nodes, 564.5 MB optimized
- L1 PASS: avg=100.76ms, P90=123.26ms, 9.92 samples/sec (20 iters)
- L2 PASS: cosine vs PyTorch reference all 3 heads >=0.999999 (max-abs <3e-4)
- L3 CLI-BLOCKED: image-to-text task has no default dataset (same as
  nlpconnect/vit-gpt2-image-captioning per known limitation)

Step 1b verification: baseline 'winml build' on main fails with
'mgp-str doesn't support task image-to-text' (real engineering delta, not
catalog-only).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant