feat: Add split truncation_strategy #6541

hpsun1109 · 2025-11-11T03:33:27Z

Added split truncation strategy, primarily used during pre-training to ensure that tokens exceeding the limit can still participate in the training process.

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

gemini-code-assist · 2025-11-11T03:33:42Z

Summary of Changes

Hello @hpsun1109, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new split truncation strategy, enhancing the system's ability to process lengthy input sequences. Unlike traditional truncation methods that discard excess tokens, the split strategy intelligently breaks down inputs into smaller, max_length-sized chunks. This ensures that all tokens are utilized, which is particularly beneficial for pre-training large language models where comprehensive data exposure is crucial.

Highlights

New Truncation Strategy: Introduced a split truncation strategy to handle sequences exceeding max_length.
Pre-training Utility: This strategy is designed to allow all tokens, even those beyond the limit, to participate in the training process, especially useful during pre-training.
Codebase Integration: The split option has been added to the truncation_strategy type hints in swift/llm/argument/base_args/template_args.py and swift/llm/template/base.py.
Chunking Logic: New logic in swift/llm/template/base.py's _encode_truncated method now splits oversized inputs into max_length chunks, managing input_ids, labels, and loss_scale for each chunk, and setting the first label of subsequent chunks to -100.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a 'split' truncation strategy, which is a valuable addition for handling long sequences during pre-training. The implementation correctly adds the new strategy option and includes the core logic for chunking sequences. My review focuses on improving the robustness and correctness of this new feature. I've identified a critical issue where the 'split' strategy could be incorrectly used with incompatible model types (multimodal, encoder-decoder), a high-severity issue regarding data collation for batches which would cause issues with a batch size greater than one, and a medium-severity suggestion for consistency in loss masking. Addressing these points will make the new feature safer and more reliable.

gemini-code-assist · 2025-11-11T03:35:22Z

swift/llm/template/base.py

                    input_ids, labels, loss_scale, truncation_strategy=self.truncation_strategy)
                length = self._get_length(input_ids, labels)
+            elif self.truncation_strategy == 'split':
+                encoded_chunks = []


The 'split' truncation strategy appears to be designed for text-only, decoder-only models, which is great for pre-training. However, it doesn't handle multimodal data or encoder-decoder architectures. Using encoded.copy() performs a shallow copy, which means that for multimodal inputs, each chunk would incorrectly reference the same full list of images/videos/audios. Similarly, for encoder-decoder models, other fields like prompt_input_ids would be copied without being chunked, leading to inconsistencies. To prevent incorrect usage and potential crashes, it's crucial to add checks to ensure this strategy is only used in supported scenarios.

if self.is_encoder_decoder: raise ValueError("The 'split' truncation strategy is not supported for encoder-decoder models.") if inputs.is_multimodal: raise ValueError("The 'split' truncation strategy is not supported for multimodal inputs.") encoded_chunks = []

gemini-code-assist · 2025-11-11T03:35:22Z

swift/llm/template/base.py

+        if self.truncation_strategy == 'split':
+            return encoded


Returning a list of encoded chunks here is correct for the 'split' strategy. However, this has a downstream implication for the data_collator method that needs to be addressed. When batch_size > 1, the data loader will pass a list of these lists to data_collator. The current implementation of data_collator (around line 1421) does not flatten this nested list for the 'split' strategy (it only does so for self.packing), which will lead to errors. Please update data_collator to handle this case, for example by changing the condition if self.packing and isinstance(batch[0], list): to if (self.packing or self.truncation_strategy == 'split') and isinstance(batch[0], list):.

gemini-code-assist · 2025-11-11T03:35:23Z

swift/llm/template/base.py

+                        chunk_labels = list(chunk_labels)
+                        chunk_labels[0] = -100


When setting chunk_labels[0] to -100 to prevent loss calculation on the first token of a new chunk, it's good practice to also update the corresponding loss_scale for consistency. If loss_scale is being used, its first element should be set to 0.0 to align with the masked label. This ensures that the loss scaling is correctly handled, especially in scenarios where is_loss_scale_binary is false.

chunk_labels = list(chunk_labels) chunk_labels[0] = -100 if chunk_loss_scale is not None and len(chunk_loss_scale) > 0: chunk_loss_scale = list(chunk_loss_scale) chunk_loss_scale[0] = 0.0

Jintao-Huang · 2025-11-12T07:21:58Z

thanks 😊

Jintao-Huang · 2025-11-19T15:44:27Z

#6672

This feature will be supported in this PR.

Jintao-Huang · 2025-11-21T02:48:24Z

Thank you for your contribution, I will close this PR.

feat: Add split truncation_strategy

67f6719

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

feat: Add split truncation_strategy

6bf75d3

Jintao-Huang approved these changes Nov 12, 2025

View reviewed changes

Jintao-Huang closed this Nov 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add split truncation_strategy #6541

feat: Add split truncation_strategy #6541

Uh oh!

hpsun1109 commented Nov 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 11, 2025

Uh oh!

gemini-code-assist bot Nov 11, 2025

Uh oh!

gemini-code-assist bot Nov 11, 2025

Uh oh!

Jintao-Huang commented Nov 12, 2025

Uh oh!

Jintao-Huang commented Nov 19, 2025

Uh oh!

Jintao-Huang commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add split truncation_strategy #6541

feat: Add split truncation_strategy #6541

Uh oh!

Conversation

hpsun1109 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Nov 12, 2025

Uh oh!

Jintao-Huang commented Nov 19, 2025

Uh oh!

Jintao-Huang commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hpsun1109 commented Nov 11, 2025 •

edited

Loading