Add transformers v5 compatibility #681

Maxusmusti · 2026-01-29T20:51:41Z

Updates the library to support HuggingFace Transformers v5.

Changes

Replace additional_special_tokens with extra_special_tokens in tokenizer API calls (renamed in v5)
Suppress verbose httpx HTTP request logs from huggingface_hub
Bump transformers dependency to >=5.0.0

Files Modified

requirements.txt
src/instructlab/training/tokenizer_utils.py
src/instructlab/training/data_process.py
src/instructlab/training/main_ds.py

Summary by CodeRabbit

Chores
- Updated core dependency versions (including transformers to 5.0.0) for compatibility.
Improvements
- Reduced HTTP client logging verbosity for cleaner training output.
- Improved tokenizer handling and registration of masking/reasoning tokens to ensure consistent token behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-29T20:52:25Z

📝 Walkthrough

Walkthrough

Update transformers dependency to >=5.0.0 and bump related pinned packages; replace tokenizer uses of additional_special_tokens with extra_special_tokens; add startup suppression of httpx logs.

Changes

Cohort / File(s)	Summary
Dependency pins `requirements.txt`, `constraints-dev.txt`	Bumped `transformers` to >=5.0.0 and updated constraint pins (hf-xet, huggingface-hub, tokenizers, transformers).
Tokenizer API updates `src/instructlab/training/data_process.py`, `src/instructlab/training/tokenizer_utils.py`	Replaced `additional_special_tokens` with `extra_special_tokens` when registering special tokens; updated related variable names and presence checks.
Startup logging `src/instructlab/training/main_ds.py`	Configured `httpx` logger level to WARNING at startup to suppress verbose HTTP logs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble tokens, old to new,
From "additional" to "extra" they flew.
Dependencies climbed to version five,
Quieted logs so trains can drive,
A tiny hop—now code feels new! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add transformers v5 compatibility' accurately captures the main objective of the PR, which is to update the library to support HuggingFace Transformers v5.0.0+.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/instructlab/training/data_process.py (1)

1301-1311: Update all test files to use extra_special_tokens instead of additional_special_tokens.

The codebase has a transformer v5 compatibility issue: production code in data_process.py and tokenizer_utils.py now uses extra_special_tokens, but test files still use the deprecated additional_special_tokens API. This inconsistency will cause test failures.

Update the following test files:

tests/unit/test_unmask_messages.py (lines 345, 495)

tests/unit/test_data_process.py (lines 856, 912)

🤖 Fix all issues with AI agents

In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 31-33: Reformat the tokenizer.add_special_tokens call to meet ruff
formatting (remove the extra line breaks/whitespace) so the call is a single
properly formatted expression; update the invocation of
tokenizer.add_special_tokens that references {"extra_special_tokens":
extra_special_tokens} to a single-line or otherwise ruff-compliant style (you
can also run `ruff format` on src/instructlab/training/tokenizer_utils.py) so
the linter/CI passes.

src/instructlab/training/tokenizer_utils.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@requirements.txt`:
- Line 6: The transformers>=5.0.0 bump may be incompatible with our code; verify
and either pin to a compatible version or update code/CI/docs: ensure Python
minimum is 3.10, add/update dependency constraints for huggingface-hub>=1.0.0
and tokenizers to avoid conflicts, and search for and replace removed/deprecated
APIs—e.g., replace load_in_4bit/load_in_8bit usage with quantization_config,
change model loading dtype handling (auto vs fp32), replace use_auth_token with
token, remove references to transformers-cli, update Trainer usages
(renamed/removed args), and adapt tokenizers API calls—update tests, CI images,
and docs accordingly so all occurrences (load_in_4bit, load_in_8bit,
quantization_config, use_auth_token, Trainer, transformers-cli, tokenizers.*)
are fixed or dependency pinned.

coderabbitai · 2026-01-29T21:19:32Z

requirements.txt

 py-cpuinfo
 torch>=2.6.0
-transformers>=4.55.0
+transformers>=5.0.0


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

transformers 5.0.0 minimum Python version requirements dependencies

💡 Result:

transformers 5.0.0 minimum Python version

Python >= 3.10.0 (this is the package’s Requires-Python on PyPI). [1]

Note: the project description on PyPI also says “Python 3.9+”, but the install metadata for 5.0.0 is stricter (>=3.10.0). [1]

Runtime (core) dependencies (required)

Listed dependencies for transformers on piwheels are: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm. [2]

Framework requirement (practical)

The PyPI description for 5.0.0 states it works with PyTorch 2.1+ (installed separately; commonly via pip install "transformers[torch]"). [1]

Sources: [1], [2]

🌐 Web query:

transformers 5.0.0 breaking changes release notes

💡 Result:

Here are the Transformers 5.0.0 breaking / backward-incompatible changes, straight from the official v5.0.0 release notes (released Jan 26, 2026). [1]

Library-wide

from_pretrained(... ) default dtype is now auto (instead of forcing FP32). [2]

Default Hub shard size is now 50GB (up from 5GB). [2]

use_auth_token is deprecated in favor of token everywhere. [2]

Attention / Torch / Quantization

Removed: head masking, relative positional biases in BERT-like models, and head pruning. [3]

Dropped support for torchscript and torch.fx (focus is now on dynamo / export). [3]

Removed quantization args load_in_4bit and load_in_8bit → use quantization_config instead. [4]

Configuration

Removed nested-config init helpers like from_xxx_config. [4]

Can’t load a config class from a URL anymore (must be local path or Hub repo). [4]

RoPE config changes: parameters moved under config.rope_parameters; e.g. accessing config.rope_theta can now error. [4]

Non-generative models no longer have generation_config; model.config.generation_config may now error. [5]

Tokenization / Processing

Tokenizers: consolidation away from slow/fast split; single tokenization_<model>.py with backend selection. [6]

Tokenization breaking changes include (highlights): legacy tokenizer files removal, encode_plus -> __call__, batch_decode -> decode, and apply_chat_template now returns a BatchEncoding. [5]

Processing: processor serialization changes; XXXFeatureExtractors removed in favor of XXXImageProcessor for vision models. [5]

Modeling / Generate / Trainer / Hub / CLI

All models’ forward() now accept **kwargs. [7]

Generate: removed deprecated output aliases and some deprecated decoding-related classes. [7]

Trainer/TrainingArguments: many deprecated args removed/renamed (e.g. no_cuda -> use_cpu, per_gpu_* -> per_device_*, etc.). [8]

PushToHubMixin: removed/changed args; push_to_hub is now keyword-only (except repo_id), no **kwargs, removed use_temp_dir, etc. [8]

CLI: transformers-cli removed; transformers ... is the only entry point; transformers chat/serve split, and transformers run removed. [8]

Env vars: removed legacy cache env vars like TRANSFORMERS_CACHE in favor of HF_HOME; plus huggingface_hub-related changes (e.g. requests→httpx error type). [9]

If you tell me what you’re upgrading from (e.g. 4.57.x → 5.0.0) and which parts you use (Trainer, tokenizers, quantization, CLI), I can produce a short “search/replace + code changes” checklist.

🌐 Web query:

transformers 5.0.0 PyPI package metadata tokenizers huggingface-hub compatibility

💡 Result:

For transformers==5.0.0 (released Jan 26, 2026), the published package metadata indicates:

Python requirement: Python >=3.10 (PyPI) [1]

Core runtime deps include: huggingface-hub and tokenizers (among others like numpy, safetensors, tqdm, etc.) [2]

Compatibility change in v5: transformers now requires huggingface_hub >= 1.0.0 [3]

What this means for tokenizers / huggingface-hub compatibility

Because transformers depends on both huggingface-hub and tokenizers [2] and also enforces huggingface_hub >= 1.0.0 [3], you must use a tokenizers version that does not pin huggingface_hub to <1.0.0—otherwise pip’s resolver will fail with a version conflict (this kind of conflict is documented historically when tokenizers pinned huggingface_hub to an upper bound) [4].

Practical rule: with transformers==5.0.0, keep huggingface_hub on 1.x and ensure tokenizers is new enough to be compatible with that.

Sources:
[1] https://pypi.org/project/transformers/
[2] https://www.piwheels.org/project/transformers/
[3] https://newreleases.io/project/pypi/transformers/release/5.0.0
[4] huggingface/transformers#26276

Verify Transformers v5.0.0 compatibility—breaking changes affect multiple APIs.

The bump to >=5.0.0 introduces major breaking changes beyond Python/dependency updates:

Python: Minimum is now 3.10 (not 3.9).

Dependencies: Requires huggingface-hub >= 1.0.0 and tokenizers (ensure no version conflicts; older tokenizers may pin huggingface-hub to <1.0.0).

Breaking changes: Model loading dtype now defaults to auto (not FP32); quantization args (load_in_4bit, load_in_8bit) removed in favor of quantization_config; tokenizers API consolidated; Trainer args renamed/removed; transformers-cli removed entirely; use_auth_token deprecated for token; and many more.

Confirm your code does not use deprecated/removed APIs (quantization, trainer args, CLI, tokenizers functions, model config patterns). Update all affected code, CI, and docs accordingly.

🤖 Prompt for AI Agents

In `@requirements.txt` at line 6, The transformers>=5.0.0 bump may be incompatible with our code; verify and either pin to a compatible version or update code/CI/docs: ensure Python minimum is 3.10, add/update dependency constraints for huggingface-hub>=1.0.0 and tokenizers to avoid conflicts, and search for and replace removed/deprecated APIs—e.g., replace load_in_4bit/load_in_8bit usage with quantization_config, change model loading dtype handling (auto vs fp32), replace use_auth_token with token, remove references to transformers-cli, update Trainer usages (renamed/removed args), and adapt tokenizers API calls—update tests, CI images, and docs accordingly so all occurrences (load_in_4bit, load_in_8bit, quantization_config, use_auth_token, Trainer, transformers-cli, tokenizers.*) are fixed or dependency pinned.

RobotSail · 2026-01-29T21:22:49Z

src/instructlab/training/tokenizer_utils.py

-    tokenizer.add_special_tokens(
-        {"additional_special_tokens": additional_special_tokens}
-    )
+    extra_special_tokens = tokenizer.extra_special_tokens + new_tokens


Lol these tokens are extra special

RobotSail

LGTM. Looks like there are some unit tests failing still, once those are fixed this can merge

Maxusmusti self-assigned this Jan 29, 2026

mergify bot added ci-failure dependencies Pull requests that update a dependency file labels Jan 29, 2026

Maxusmusti force-pushed the transformer-v5-support branch from 0cba4d3 to eb9b537 Compare January 29, 2026 20:59

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

src/instructlab/training/tokenizer_utils.py Outdated Show resolved Hide resolved

Add transformers v5 compatibility

f2531d1

Maxusmusti force-pushed the transformer-v5-support branch from eb9b537 to f2531d1 Compare January 29, 2026 21:05

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

RobotSail reviewed Jan 29, 2026

View reviewed changes

RobotSail approved these changes Jan 29, 2026

View reviewed changes

mergify bot added the one-approval label Jan 29, 2026

RobotSail mentioned this pull request Jan 30, 2026

Pin transformers version to <5 #679

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transformers v5 compatibility #681

Add transformers v5 compatibility #681

Maxusmusti commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 29, 2026

Uh oh!

RobotSail Jan 29, 2026

Uh oh!

RobotSail left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add transformers v5 compatibility #681

Are you sure you want to change the base?

Add transformers v5 compatibility #681

Conversation

Maxusmusti commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Files Modified

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 29, 2026

Choose a reason for hiding this comment

transformers 5.0.0 minimum Python version

Runtime (core) dependencies (required)

Framework requirement (practical)

Library-wide

Attention / Torch / Quantization

Configuration

Tokenization / Processing

Modeling / Generate / Trainer / Hub / CLI

What this means for tokenizers / huggingface-hub compatibility

Uh oh!

RobotSail Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Maxusmusti commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

What this means for `tokenizers` / `huggingface-hub` compatibility