Skip to content

Conversation

@hpsun1109
Copy link
Contributor

@hpsun1109 hpsun1109 commented Nov 11, 2025

Added split truncation strategy, primarily used during pre-training to ensure that tokens exceeding the limit can still participate in the training process.

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hpsun1109, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new split truncation strategy, enhancing the system's ability to process lengthy input sequences. Unlike traditional truncation methods that discard excess tokens, the split strategy intelligently breaks down inputs into smaller, max_length-sized chunks. This ensures that all tokens are utilized, which is particularly beneficial for pre-training large language models where comprehensive data exposure is crucial.

Highlights

  • New Truncation Strategy: Introduced a split truncation strategy to handle sequences exceeding max_length.
  • Pre-training Utility: This strategy is designed to allow all tokens, even those beyond the limit, to participate in the training process, especially useful during pre-training.
  • Codebase Integration: The split option has been added to the truncation_strategy type hints in swift/llm/argument/base_args/template_args.py and swift/llm/template/base.py.
  • Chunking Logic: New logic in swift/llm/template/base.py's _encode_truncated method now splits oversized inputs into max_length chunks, managing input_ids, labels, and loss_scale for each chunk, and setting the first label of subsequent chunks to -100.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'split' truncation strategy, which is a valuable addition for handling long sequences during pre-training. The implementation correctly adds the new strategy option and includes the core logic for chunking sequences. My review focuses on improving the robustness and correctness of this new feature. I've identified a critical issue where the 'split' strategy could be incorrectly used with incompatible model types (multimodal, encoder-decoder), a high-severity issue regarding data collation for batches which would cause issues with a batch size greater than one, and a medium-severity suggestion for consistency in loss masking. Addressing these points will make the new feature safer and more reliable.

input_ids, labels, loss_scale, truncation_strategy=self.truncation_strategy)
length = self._get_length(input_ids, labels)
elif self.truncation_strategy == 'split':
encoded_chunks = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The 'split' truncation strategy appears to be designed for text-only, decoder-only models, which is great for pre-training. However, it doesn't handle multimodal data or encoder-decoder architectures. Using encoded.copy() performs a shallow copy, which means that for multimodal inputs, each chunk would incorrectly reference the same full list of images/videos/audios. Similarly, for encoder-decoder models, other fields like prompt_input_ids would be copied without being chunked, leading to inconsistencies. To prevent incorrect usage and potential crashes, it's crucial to add checks to ensure this strategy is only used in supported scenarios.

                if self.is_encoder_decoder:
                    raise ValueError("The 'split' truncation strategy is not supported for encoder-decoder models.")
                if inputs.is_multimodal:
                    raise ValueError("The 'split' truncation strategy is not supported for multimodal inputs.")
                encoded_chunks = []

Comment on lines 527 to 528
if self.truncation_strategy == 'split':
return encoded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Returning a list of encoded chunks here is correct for the 'split' strategy. However, this has a downstream implication for the data_collator method that needs to be addressed. When batch_size > 1, the data loader will pass a list of these lists to data_collator. The current implementation of data_collator (around line 1421) does not flatten this nested list for the 'split' strategy (it only does so for self.packing), which will lead to errors. Please update data_collator to handle this case, for example by changing the condition if self.packing and isinstance(batch[0], list): to if (self.packing or self.truncation_strategy == 'split') and isinstance(batch[0], list):.

Comment on lines +1234 to +1235
chunk_labels = list(chunk_labels)
chunk_labels[0] = -100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When setting chunk_labels[0] to -100 to prevent loss calculation on the first token of a new chunk, it's good practice to also update the corresponding loss_scale for consistency. If loss_scale is being used, its first element should be set to 0.0 to align with the masked label. This ensures that the loss scaling is correctly handled, especially in scenarios where is_loss_scale_binary is false.

                        chunk_labels = list(chunk_labels)
                        chunk_labels[0] = -100
                        if chunk_loss_scale is not None and len(chunk_loss_scale) > 0:
                            chunk_loss_scale = list(chunk_loss_scale)
                            chunk_loss_scale[0] = 0.0

@Jintao-Huang
Copy link
Collaborator

thanks 😊

@Jintao-Huang
Copy link
Collaborator

#6672

This feature will be supported in this PR.

@Jintao-Huang
Copy link
Collaborator

Thank you for your contribution, I will close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants