[Prototype] Improved model preprocessing, new batch structure by jlamypoirier · Pull Request #471 · ServiceNow/Fast-LLM

jlamypoirier · 2026-02-17T21:53:20Z

✨ Description

Main features:

Move (most of) model preprocessing to the data loader. This makes it not only simpler, but also faster since it runs in parallel processes.
Add a PreprocessedBatch structure to handle this preprocessing, and potentially replace the arbitrary kwargs we pass to the layers.
Drop the artificial concept of samples (mostly). We now pack documents directly into batches, and always use the varlen implementation of mixers (no cross-document attention. Merge the batch and sequence dimensions into a single token dimension.

Detailed changes (non-exhaustive):

[In progress] Add BatchPreprocessingConfig structure (ex. LanguageModelBatchPreprocessingConfig) which configures both data and model preprocessing.
Rework the Sample/Batch structure into Document (single document) and Batch (multiple documents, inherit from Sample). Convert to dataclasses, remove methods from abstract base classes as the functionnality may depend on the document type.
Add PreprocessedBatch structure which handles model preprocessing and store its result as a list of MicroBatch (actually micro-sequences).
Rework document padding, there is now only a single padding sequence at the end (if at all). Add num_tokens kwarg/attribute to keep track of this padding.
Adjust naming convention with sample -> document/batch
Rework how absent spans/patches are handled. Replace empty readers with null readers which always return None. Handle these None entries in Batch.from_documents and in PreprocessedBatch.from_batch.
Move memmap dataset, writers and readers to a new memmap directory.
Fix dataset sampling with MTP and no sample truncation.
Add get_preprocessing_config to base models and layers, which helps constructing the model preprocessing config based on what the model needs. This follows the same structure as the preprocess method, which it aims to replace.
Remove cross_document_attention field, always use varlen attention.
[In progress] Merge the batch and sequence dimensions in mixers. Simplify backup attention.
[In progress] Remove preprocess_meta, adjust preprocess_batch to take a PreprocessedBatch
[Todo]: Remove samples from batch config, data sampling.
[?] Rename phase names to their lowercase version. This may have a small impact on the name of logged metrics (ex. wandb)
[?] Simplify SamplingData, Rework GPTData.setup into sample_dataset which samples one dataset at the time, which evaluators call directly.

Future steps:

Replace kwargs in layers with the MicroBatch structure.
Remove the preprocess method in base models and layers. (Still needed for rotary embeddings and stochastic mixer)
Clarify preprocess_batch, which now takes an already preprocessed batch as input. Seems to be mostly about running reference models now.
Expand the data tests to cover model preprocessing.

Open questions:

Preference spans have not been working for a while, and are causing trouble. Do we still want them?
Consider setting default use_loss_masking_spans to True?
What to do with Mamba, which doesn't support varlen? [bug] Can't compile varlen mamba with base image 25.11 #416
Blended datasets take each "sample" from a single dataset. Effectively this means each micro-batch takes only documents from one of the datasets, so sampling is uneven unless we have lots of sequential of parallel micro-batches. Is this fixable?l

jlamypoirier added 2 commits February 17, 2026 16:50

stuff

7469f83

stuff

295c25b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Prototype] Improved model preprocessing, new batch structure#471

[Prototype] Improved model preprocessing, new batch structure#471
jlamypoirier wants to merge 2 commits intojlp_simplify_mtpfrom
jlp_batch

jlamypoirier commented Feb 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

jlamypoirier commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

jlamypoirier commented Feb 17, 2026 •

edited

Loading