[model] support paddle_ocr v1.6#9464
Conversation
There was a problem hiding this comment.
Code Review
This pull request unifies the PaddleOCR-VL models (v1.0, v1.5, and v1.6) under a single model type paddleocr_vl and model architecture paddleocr_vl, removing the redundant paddle_ocr registration. Feedback points out that since these models are now unified, the registered model architecture keys for paddleocr_vl must be updated to support both v1.0 and v1.5/1.6 architectures, which use different paths for the language model, aligner, and vision tower.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the integration of PaddleOCR-VL models, renaming the model type and architecture from paddle_ocr_1_5 to paddleocr_vl, adding support for PaddleOCR-VL-1.6, and introducing version-specific template configurations and transformer version requirements. It also refactors ModelLoader to support a configurable default_trust_remote_code attribute. However, two critical issues were identified: the template argument was incorrectly passed to ModelGroup instead of ModelMeta in swift/model/models/baidu.py, and use_model = True was accidentally removed from PaddleOCR1_5Template in swift/template/templates/baidu.py, which is required for its _post_encode method.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the integration of PaddleOCR-VL models. Specifically, it adds support for PaddleOCR-VL-1.6, updates model types and architectures (renaming paddle_ocr_1_5 to paddleocr_vl), and refactors the model loading and template processing logic. Key changes include adding version-specific transformer requirements, dynamically managing trust_remote_code settings in ModelLoader, and updating PaddleOCRTemplate to support custom image processor arguments and padding-free training. Feedback suggests dynamically retrieving the image_token_id in PaddleOCRTemplate rather than relying on a hardcoded value to prevent potential issues with customized tokenizers.
| labels = encoded['labels'] | ||
| loss_scale = encoded.get('loss_scale', None) | ||
| idx_list = findall(input_ids, -100) | ||
| idx_list = findall(input_ids, self.image_token_id) |
There was a problem hiding this comment.
Hardcoding the image_token_id to 100295 can be fragile if the tokenizer vocabulary changes or if a customized tokenizer is used. It is safer to dynamically retrieve the token ID from the processor or tokenizer, and only fall back to the hardcoded value if it cannot be found.
image_token_id = getattr(self.processor, 'image_token_id', None) or self.tokenizer.convert_tokens_to_ids(self.image_token)
if image_token_id is None or image_token_id == getattr(self.tokenizer, 'unk_token_id', None):
image_token_id = self.image_token_id
idx_list = findall(input_ids, image_token_id)
No description provided.