fix failure of llava/pixtral #42985

sywangyi · 2025-12-22T02:02:14Z

What does this PR do?

tokenizers: @ArthurZucker and @itazap

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

sywangyi · 2025-12-22T02:04:31Z

from transformers import  LlavaForConditionalGeneration, AutoProcessor

model_id = "mistral-community/pixtral-12b"
processor = AutoProcessor.from_pretrained(model_id)
text = "Describe the images"
inputs = processor.tokenizer(text)
print(f"Input text: '{text}'")
print(f"Token IDs: {inputs['input_ids']}")
decoded_text = processor.tokenizer.decode(inputs["input_ids"])
print(f"Decoded text: '{decoded_text}'")

output "DescribeĠtheĠimages", expected output "Describe the images"(in transformers 4.57.3)

sywangyi · 2025-12-22T02:05:05Z

also, fix the failure in llava test

sywangyi · 2025-12-22T07:20:52Z

Mistral3ProcessorTest case fail because the tokenizer fix.

from transformers import AutoProcessor

model_id = "hf-internal-testing/Mistral-Small-3.1-24B-Instruct-2503-only-processor"
processor = AutoProcessor.from_pretrained(model_id)
text = "Describe the images"
inputs = processor.tokenizer(text)
print(f"Input text: '{text}'")
print(f"Token IDs: {inputs['input_ids']}")
decoded_text = processor.tokenizer.decode(inputs["input_ids"])
print(f"Decoded text: '{decoded_text}'")

before the fix
Input text: 'Describe the images'
Token IDs: [1, 5847, 13089, 1278, 8061]
Decoded text: '<s>DescribeĠtheĠimages'

after the fix
Input text: 'Describe the images'
Token IDs: [1, 5847, 1972, 22326, 1268, 8926]
Decoded text: '<s>Describetheimages'

however in 4.57.3, the output is 
Decoded text: '<s>Describe the images' which is expected, I did some investigate, seem the pre_tokenizer is incorrect for "hf-internal-testing/Mistral-Small-3.1-24B-Instruct-2503-only-processor" in 5.0.0, should be Sequence(pretokenizers=[Split(pattern=Regex("(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\n]+|\s+(?!\S)|\s+"), behavior=Isolated, invert=False), ByteLevel(add_prefix_space=False, trim_offsets=True, use_regex=False)]) instead of Metaspace(replacement="▁", prepend_scheme=always, split=False)

sywangyi · 2025-12-22T07:25:02Z

seems the pre_tokenizer should be loaded from tokenizer.json, but in v5.0.0, it does not do it.

molbap · 2025-12-22T10:30:44Z

Normally #42894 should fix tokenization issues on main and on the next release candidate. Might need some time with the holiday season now, apologies.There are a few different changes on your PR, can you put what fails in the PR description and ensure your PR fixes minimally? Thanks!

sywangyi · 2025-12-22T12:43:35Z

from transformers import LlavaForConditionalGeneration, AutoProcessor

model_id = "mistral-community/pixtral-12b"
processor = AutoProcessor.from_pretrained(model_id)
text = "Describe the images"
inputs = processor.tokenizer(text)
print(f"Input text: '{text}'")
print(f"Token IDs: {inputs['input_ids']}")
decoded_text = processor.tokenizer.decode(inputs["input_ids"])
print(f"Decoded text: '{decoded_text}'")

Hi, I tried #42894, but it does not fix the "hf-internal-testing/Mistral-Small-3.1-24B-Instruct-2503-only-processor" and "mistral-community/pixtral-12b" issue, The issue is that nearly all cases of pytest tests/models/llava/test_modeling_llava.py::LlavaForConditionalGenerationIntegrationTest fail, because of tokenizer issue and case issue, and I fix them.

…sor tokenizer Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

github-actions · 2025-12-23T03:29:40Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: llava, pixtral

sywangyi · 2025-12-23T03:38:02Z

@molbap I update the PR to fix all the issue I mentioned in the PR including llava test, tokenizer issue of mistral-community/pixtral-12b and hf-internal-testing/Mistral-Small-3.1-24B-Instruct-2503-only-processor

fix failure of llava/pixtral

4831d64

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

sywangyi added 2 commits December 23, 2025 11:25

also fix the issue of Mistral-Small-3.1-24B-Instruct-2503-only-proces…

8e51984

…sor tokenizer Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

Merge branch 'main' into pixtral_fix

e07d78c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix failure of llava/pixtral #42985

fix failure of llava/pixtral #42985

sywangyi commented Dec 22, 2025

Uh oh!

sywangyi commented Dec 22, 2025 •

edited

Loading

Uh oh!

sywangyi commented Dec 22, 2025

Uh oh!

sywangyi commented Dec 22, 2025 •

edited

Loading

Uh oh!

sywangyi commented Dec 22, 2025

Uh oh!

molbap commented Dec 22, 2025

Uh oh!

sywangyi commented Dec 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

sywangyi commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix failure of llava/pixtral #42985

Are you sure you want to change the base?

fix failure of llava/pixtral #42985

Conversation

sywangyi commented Dec 22, 2025

What does this PR do?

Uh oh!

sywangyi commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sywangyi commented Dec 22, 2025

Uh oh!

sywangyi commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sywangyi commented Dec 22, 2025

Uh oh!

molbap commented Dec 22, 2025

Uh oh!

sywangyi commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

sywangyi commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sywangyi commented Dec 22, 2025 •

edited

Loading

sywangyi commented Dec 22, 2025 •

edited

Loading

sywangyi commented Dec 22, 2025 •

edited

Loading