Skip to content

Conversation

@Maxusmusti
Copy link
Collaborator

@Maxusmusti Maxusmusti commented Jan 29, 2026

Updates the library to support HuggingFace Transformers v5.

Changes

  • Replace additional_special_tokens with extra_special_tokens in tokenizer API calls (renamed in v5)
  • Suppress verbose httpx HTTP request logs from huggingface_hub
  • Bump transformers dependency to >=5.0.0

Files Modified

  • requirements.txt
  • src/instructlab/training/tokenizer_utils.py
  • src/instructlab/training/data_process.py
  • src/instructlab/training/main_ds.py

Summary by CodeRabbit

  • Chores

    • Updated core dependency versions (including transformers to 5.0.0) for compatibility.
  • Improvements

    • Reduced HTTP client logging verbosity for cleaner training output.
    • Improved tokenizer handling and registration of masking/reasoning tokens to ensure consistent token behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@Maxusmusti Maxusmusti self-assigned this Jan 29, 2026
@mergify mergify bot added ci-failure dependencies Pull requests that update a dependency file labels Jan 29, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

Update transformers dependency to >=5.0.0 and bump related pinned packages; replace tokenizer uses of additional_special_tokens with extra_special_tokens; add startup suppression of httpx logs.

Changes

Cohort / File(s) Summary
Dependency pins
requirements.txt, constraints-dev.txt
Bumped transformers to >=5.0.0 and updated constraint pins (hf-xet, huggingface-hub, tokenizers, transformers).
Tokenizer API updates
src/instructlab/training/data_process.py, src/instructlab/training/tokenizer_utils.py
Replaced additional_special_tokens with extra_special_tokens when registering special tokens; updated related variable names and presence checks.
Startup logging
src/instructlab/training/main_ds.py
Configured httpx logger level to WARNING at startup to suppress verbose HTTP logs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble tokens, old to new,
From "additional" to "extra" they flew.
Dependencies climbed to version five,
Quieted logs so trains can drive,
A tiny hop—now code feels new! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add transformers v5 compatibility' accurately captures the main objective of the PR, which is to update the library to support HuggingFace Transformers v5.0.0+.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Maxusmusti Maxusmusti force-pushed the transformer-v5-support branch from 0cba4d3 to eb9b537 Compare January 29, 2026 20:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/instructlab/training/data_process.py (1)

1301-1311: Update all test files to use extra_special_tokens instead of additional_special_tokens.

The codebase has a transformer v5 compatibility issue: production code in data_process.py and tokenizer_utils.py now uses extra_special_tokens, but test files still use the deprecated additional_special_tokens API. This inconsistency will cause test failures.

Update the following test files:

  • tests/unit/test_unmask_messages.py (lines 345, 495)
  • tests/unit/test_data_process.py (lines 856, 912)
🤖 Fix all issues with AI agents
In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 31-33: Reformat the tokenizer.add_special_tokens call to meet ruff
formatting (remove the extra line breaks/whitespace) so the call is a single
properly formatted expression; update the invocation of
tokenizer.add_special_tokens that references {"extra_special_tokens":
extra_special_tokens} to a single-line or otherwise ruff-compliant style (you
can also run `ruff format` on src/instructlab/training/tokenizer_utils.py) so
the linter/CI passes.

@Maxusmusti Maxusmusti force-pushed the transformer-v5-support branch from eb9b537 to f2531d1 Compare January 29, 2026 21:05
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@requirements.txt`:
- Line 6: The transformers>=5.0.0 bump may be incompatible with our code; verify
and either pin to a compatible version or update code/CI/docs: ensure Python
minimum is 3.10, add/update dependency constraints for huggingface-hub>=1.0.0
and tokenizers to avoid conflicts, and search for and replace removed/deprecated
APIs—e.g., replace load_in_4bit/load_in_8bit usage with quantization_config,
change model loading dtype handling (auto vs fp32), replace use_auth_token with
token, remove references to transformers-cli, update Trainer usages
(renamed/removed args), and adapt tokenizers API calls—update tests, CI images,
and docs accordingly so all occurrences (load_in_4bit, load_in_8bit,
quantization_config, use_auth_token, Trainer, transformers-cli, tokenizers.*)
are fixed or dependency pinned.

py-cpuinfo
torch>=2.6.0
transformers>=4.55.0
transformers>=5.0.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

transformers 5.0.0 minimum Python version requirements dependencies

💡 Result:

transformers 5.0.0 minimum Python version

  • Python >= 3.10.0 (this is the package’s Requires-Python on PyPI). [1]

Note: the project description on PyPI also says “Python 3.9+”, but the install metadata for 5.0.0 is stricter (>=3.10.0). [1]

Runtime (core) dependencies (required)

Listed dependencies for transformers on piwheels are: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm. [2]

Framework requirement (practical)

  • The PyPI description for 5.0.0 states it works with PyTorch 2.1+ (installed separately; commonly via pip install "transformers[torch]"). [1]

Sources: [1], [2]


🌐 Web query:

transformers 5.0.0 breaking changes release notes

💡 Result:

Here are the Transformers 5.0.0 breaking / backward-incompatible changes, straight from the official v5.0.0 release notes (released Jan 26, 2026). [1]

Library-wide

  • from_pretrained(... ) default dtype is now auto (instead of forcing FP32). [2]
  • Default Hub shard size is now 50GB (up from 5GB). [2]
  • use_auth_token is deprecated in favor of token everywhere. [2]

Attention / Torch / Quantization

  • Removed: head masking, relative positional biases in BERT-like models, and head pruning. [3]
  • Dropped support for torchscript and torch.fx (focus is now on dynamo / export). [3]
  • Removed quantization args load_in_4bit and load_in_8bit → use quantization_config instead. [4]

Configuration

  • Removed nested-config init helpers like from_xxx_config. [4]
  • Can’t load a config class from a URL anymore (must be local path or Hub repo). [4]
  • RoPE config changes: parameters moved under config.rope_parameters; e.g. accessing config.rope_theta can now error. [4]
  • Non-generative models no longer have generation_config; model.config.generation_config may now error. [5]

Tokenization / Processing

  • Tokenizers: consolidation away from slow/fast split; single tokenization_<model>.py with backend selection. [6]
  • Tokenization breaking changes include (highlights): legacy tokenizer files removal, encode_plus -> __call__, batch_decode -> decode, and apply_chat_template now returns a BatchEncoding. [5]
  • Processing: processor serialization changes; XXXFeatureExtractors removed in favor of XXXImageProcessor for vision models. [5]

Modeling / Generate / Trainer / Hub / CLI

  • All models’ forward() now accept **kwargs. [7]
  • Generate: removed deprecated output aliases and some deprecated decoding-related classes. [7]
  • Trainer/TrainingArguments: many deprecated args removed/renamed (e.g. no_cuda -> use_cpu, per_gpu_* -> per_device_*, etc.). [8]
  • PushToHubMixin: removed/changed args; push_to_hub is now keyword-only (except repo_id), no **kwargs, removed use_temp_dir, etc. [8]
  • CLI: transformers-cli removed; transformers ... is the only entry point; transformers chat/serve split, and transformers run removed. [8]
  • Env vars: removed legacy cache env vars like TRANSFORMERS_CACHE in favor of HF_HOME; plus huggingface_hub-related changes (e.g. requests→httpx error type). [9]

If you tell me what you’re upgrading from (e.g. 4.57.x → 5.0.0) and which parts you use (Trainer, tokenizers, quantization, CLI), I can produce a short “search/replace + code changes” checklist.


🌐 Web query:

transformers 5.0.0 PyPI package metadata tokenizers huggingface-hub compatibility

💡 Result:

For transformers==5.0.0 (released Jan 26, 2026), the published package metadata indicates:

  • Python requirement: Python >=3.10 (PyPI) [1]
  • Core runtime deps include: huggingface-hub and tokenizers (among others like numpy, safetensors, tqdm, etc.) [2]
  • Compatibility change in v5: transformers now requires huggingface_hub >= 1.0.0 [3]

What this means for tokenizers / huggingface-hub compatibility

Because transformers depends on both huggingface-hub and tokenizers [2] and also enforces huggingface_hub >= 1.0.0 [3], you must use a tokenizers version that does not pin huggingface_hub to <1.0.0—otherwise pip’s resolver will fail with a version conflict (this kind of conflict is documented historically when tokenizers pinned huggingface_hub to an upper bound) [4].

Practical rule: with transformers==5.0.0, keep huggingface_hub on 1.x and ensure tokenizers is new enough to be compatible with that.

Sources:
[1] https://pypi.org/project/transformers/
[2] https://www.piwheels.org/project/transformers/
[3] https://newreleases.io/project/pypi/transformers/release/5.0.0
[4] huggingface/transformers#26276


Verify Transformers v5.0.0 compatibility—breaking changes affect multiple APIs.

The bump to >=5.0.0 introduces major breaking changes beyond Python/dependency updates:

  • Python: Minimum is now 3.10 (not 3.9).
  • Dependencies: Requires huggingface-hub >= 1.0.0 and tokenizers (ensure no version conflicts; older tokenizers may pin huggingface-hub to <1.0.0).
  • Breaking changes: Model loading dtype now defaults to auto (not FP32); quantization args (load_in_4bit, load_in_8bit) removed in favor of quantization_config; tokenizers API consolidated; Trainer args renamed/removed; transformers-cli removed entirely; use_auth_token deprecated for token; and many more.

Confirm your code does not use deprecated/removed APIs (quantization, trainer args, CLI, tokenizers functions, model config patterns). Update all affected code, CI, and docs accordingly.

🤖 Prompt for AI Agents
In `@requirements.txt` at line 6, The transformers>=5.0.0 bump may be incompatible
with our code; verify and either pin to a compatible version or update
code/CI/docs: ensure Python minimum is 3.10, add/update dependency constraints
for huggingface-hub>=1.0.0 and tokenizers to avoid conflicts, and search for and
replace removed/deprecated APIs—e.g., replace load_in_4bit/load_in_8bit usage
with quantization_config, change model loading dtype handling (auto vs fp32),
replace use_auth_token with token, remove references to transformers-cli, update
Trainer usages (renamed/removed args), and adapt tokenizers API calls—update
tests, CI images, and docs accordingly so all occurrences (load_in_4bit,
load_in_8bit, quantization_config, use_auth_token, Trainer, transformers-cli,
tokenizers.*) are fixed or dependency pinned.

tokenizer.add_special_tokens(
{"additional_special_tokens": additional_special_tokens}
)
extra_special_tokens = tokenizer.extra_special_tokens + new_tokens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol these tokens are extra special

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Looks like there are some unit tests failing still, once those are fixed this can merge

@mergify mergify bot added the one-approval label Jan 29, 2026
@RobotSail RobotSail mentioned this pull request Jan 30, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-failure dependencies Pull requests that update a dependency file one-approval

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants