Skip to content

Conversation

@sywangyi
Copy link
Contributor

What does this PR do?

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>
@sywangyi
Copy link
Contributor Author

sywangyi commented Dec 22, 2025

from transformers import  LlavaForConditionalGeneration, AutoProcessor

model_id = "mistral-community/pixtral-12b"
processor = AutoProcessor.from_pretrained(model_id)
text = "Describe the images"
inputs = processor.tokenizer(text)
print(f"Input text: '{text}'")
print(f"Token IDs: {inputs['input_ids']}")
decoded_text = processor.tokenizer.decode(inputs["input_ids"])
print(f"Decoded text: '{decoded_text}'")

output "DescribeĠtheĠimages", expected output "Describe the images"(in transformers 4.57.3)

@sywangyi
Copy link
Contributor Author

also, fix the failure in llava test

@sywangyi
Copy link
Contributor Author

sywangyi commented Dec 22, 2025

Mistral3ProcessorTest case fail because the tokenizer fix.

from transformers import AutoProcessor

model_id = "hf-internal-testing/Mistral-Small-3.1-24B-Instruct-2503-only-processor"
processor = AutoProcessor.from_pretrained(model_id)
text = "Describe the images"
inputs = processor.tokenizer(text)
print(f"Input text: '{text}'")
print(f"Token IDs: {inputs['input_ids']}")
decoded_text = processor.tokenizer.decode(inputs["input_ids"])
print(f"Decoded text: '{decoded_text}'")
before the fix
Input text: 'Describe the images'
Token IDs: [1, 5847, 13089, 1278, 8061]
Decoded text: '<s>DescribeĠtheĠimages'
after the fix
Input text: 'Describe the images'
Token IDs: [1, 5847, 1972, 22326, 1268, 8926]
Decoded text: '<s>Describetheimages'
however in 4.57.3, the output is 
Decoded text: '<s>Describe the images' which is expected, I did some investigate, seem the pre_tokenizer is incorrect for "hf-internal-testing/Mistral-Small-3.1-24B-Instruct-2503-only-processor" in 5.0.0, should be Sequence(pretokenizers=[Split(pattern=Regex("(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\n]+|\s+(?!\S)|\s+"), behavior=Isolated, invert=False), ByteLevel(add_prefix_space=False, trim_offsets=True, use_regex=False)]) instead of Metaspace(replacement="▁", prepend_scheme=always, split=False)

@sywangyi
Copy link
Contributor Author

seems the pre_tokenizer should be loaded from tokenizer.json, but in v5.0.0, it does not do it.

@molbap
Copy link
Contributor

molbap commented Dec 22, 2025

Normally #42894 should fix tokenization issues on main and on the next release candidate. Might need some time with the holiday season now, apologies.There are a few different changes on your PR, can you put what fails in the PR description and ensure your PR fixes minimally? Thanks!

@sywangyi
Copy link
Contributor Author

sywangyi commented Dec 22, 2025

from transformers import LlavaForConditionalGeneration, AutoProcessor

model_id = "mistral-community/pixtral-12b"
processor = AutoProcessor.from_pretrained(model_id)
text = "Describe the images"
inputs = processor.tokenizer(text)
print(f"Input text: '{text}'")
print(f"Token IDs: {inputs['input_ids']}")
decoded_text = processor.tokenizer.decode(inputs["input_ids"])
print(f"Decoded text: '{decoded_text}'")

Hi, I tried #42894, but it does not fix the "hf-internal-testing/Mistral-Small-3.1-24B-Instruct-2503-only-processor" and "mistral-community/pixtral-12b" issue, The issue is that nearly all cases of pytest tests/models/llava/test_modeling_llava.py::LlavaForConditionalGenerationIntegrationTest fail, because of tokenizer issue and case issue, and I fix them.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: llava, pixtral

@sywangyi
Copy link
Contributor Author

@molbap I update the PR to fix all the issue I mentioned in the PR including llava test, tokenizer issue of mistral-community/pixtral-12b and hf-internal-testing/Mistral-Small-3.1-24B-Instruct-2503-only-processor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants