-
Notifications
You must be signed in to change notification settings - Fork 995
feat: Add split truncation_strategy #6541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add split truncation_strategy #6541
Conversation
Summary of ChangesHello @hpsun1109, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a 'split' truncation strategy, which is a valuable addition for handling long sequences during pre-training. The implementation correctly adds the new strategy option and includes the core logic for chunking sequences. My review focuses on improving the robustness and correctness of this new feature. I've identified a critical issue where the 'split' strategy could be incorrectly used with incompatible model types (multimodal, encoder-decoder), a high-severity issue regarding data collation for batches which would cause issues with a batch size greater than one, and a medium-severity suggestion for consistency in loss masking. Addressing these points will make the new feature safer and more reliable.
| input_ids, labels, loss_scale, truncation_strategy=self.truncation_strategy) | ||
| length = self._get_length(input_ids, labels) | ||
| elif self.truncation_strategy == 'split': | ||
| encoded_chunks = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'split' truncation strategy appears to be designed for text-only, decoder-only models, which is great for pre-training. However, it doesn't handle multimodal data or encoder-decoder architectures. Using encoded.copy() performs a shallow copy, which means that for multimodal inputs, each chunk would incorrectly reference the same full list of images/videos/audios. Similarly, for encoder-decoder models, other fields like prompt_input_ids would be copied without being chunked, leading to inconsistencies. To prevent incorrect usage and potential crashes, it's crucial to add checks to ensure this strategy is only used in supported scenarios.
if self.is_encoder_decoder:
raise ValueError("The 'split' truncation strategy is not supported for encoder-decoder models.")
if inputs.is_multimodal:
raise ValueError("The 'split' truncation strategy is not supported for multimodal inputs.")
encoded_chunks = []
swift/llm/template/base.py
Outdated
| if self.truncation_strategy == 'split': | ||
| return encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a list of encoded chunks here is correct for the 'split' strategy. However, this has a downstream implication for the data_collator method that needs to be addressed. When batch_size > 1, the data loader will pass a list of these lists to data_collator. The current implementation of data_collator (around line 1421) does not flatten this nested list for the 'split' strategy (it only does so for self.packing), which will lead to errors. Please update data_collator to handle this case, for example by changing the condition if self.packing and isinstance(batch[0], list): to if (self.packing or self.truncation_strategy == 'split') and isinstance(batch[0], list):.
| chunk_labels = list(chunk_labels) | ||
| chunk_labels[0] = -100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When setting chunk_labels[0] to -100 to prevent loss calculation on the first token of a new chunk, it's good practice to also update the corresponding loss_scale for consistency. If loss_scale is being used, its first element should be set to 0.0 to align with the masked label. This ensures that the loss scaling is correctly handled, especially in scenarios where is_loss_scale_binary is false.
chunk_labels = list(chunk_labels)
chunk_labels[0] = -100
if chunk_loss_scale is not None and len(chunk_loss_scale) > 0:
chunk_loss_scale = list(chunk_loss_scale)
chunk_loss_scale[0] = 0.0|
thanks 😊 |
|
This feature will be supported in this PR. |
|
Thank you for your contribution, I will close this PR. |
Added split truncation strategy, primarily used during pre-training to ensure that tokens exceeding the limit can still participate in the training process.
PR type
PR information
Write the detail information belongs to this PR.
Experiment results
Paste your experiment result here(if needed).