Skip to content

[Prototype] Improved model preprocessing, new batch structure#471

Draft
jlamypoirier wants to merge 2 commits intojlp_simplify_mtpfrom
jlp_batch
Draft

[Prototype] Improved model preprocessing, new batch structure#471
jlamypoirier wants to merge 2 commits intojlp_simplify_mtpfrom
jlp_batch

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Feb 17, 2026

✨ Description

Main features:

  • Move (most of) model preprocessing to the data loader. This makes it not only simpler, but also faster since it runs in parallel processes.
  • Add a PreprocessedBatch structure to handle this preprocessing, and potentially replace the arbitrary kwargs we pass to the layers.
  • Drop the artificial concept of samples (mostly). We now pack documents directly into batches, and always use the varlen implementation of mixers (no cross-document attention. Merge the batch and sequence dimensions into a single token dimension.

Detailed changes (non-exhaustive):

  • [In progress] Add BatchPreprocessingConfig structure (ex. LanguageModelBatchPreprocessingConfig) which configures both data and model preprocessing.
  • Rework the Sample/Batch structure into Document (single document) and Batch (multiple documents, inherit from Sample). Convert to dataclasses, remove methods from abstract base classes as the functionnality may depend on the document type.
  • Add PreprocessedBatch structure which handles model preprocessing and store its result as a list of MicroBatch (actually micro-sequences).
  • Rework document padding, there is now only a single padding sequence at the end (if at all). Add num_tokens kwarg/attribute to keep track of this padding.
  • Adjust naming convention with sample -> document/batch
  • Rework how absent spans/patches are handled. Replace empty readers with null readers which always return None. Handle these None entries in Batch.from_documents and in PreprocessedBatch.from_batch.
  • Move memmap dataset, writers and readers to a new memmap directory.
  • Fix dataset sampling with MTP and no sample truncation.
  • Add get_preprocessing_config to base models and layers, which helps constructing the model preprocessing config based on what the model needs. This follows the same structure as the preprocess method, which it aims to replace.
  • Remove cross_document_attention field, always use varlen attention.
  • [In progress] Merge the batch and sequence dimensions in mixers. Simplify backup attention.
  • [In progress] Remove preprocess_meta, adjust preprocess_batch to take a PreprocessedBatch
  • [Todo]: Remove samples from batch config, data sampling.
  • [?] Rename phase names to their lowercase version. This may have a small impact on the name of logged metrics (ex. wandb)
  • [?] Simplify SamplingData, Rework GPTData.setup into sample_dataset which samples one dataset at the time, which evaluators call directly.

Future steps:

  • Replace kwargs in layers with the MicroBatch structure.
  • Remove the preprocess method in base models and layers. (Still needed for rotary embeddings and stochastic mixer)
  • Clarify preprocess_batch, which now takes an already preprocessed batch as input. Seems to be mostly about running reference models now.
  • Expand the data tests to cover model preprocessing.

Open questions:

  • Preference spans have not been working for a while, and are causing trouble. Do we still want them?
  • Consider setting default use_loss_masking_spans to True?
  • What to do with Mamba, which doesn't support varlen? [bug] Can't compile varlen mamba with base image 25.11 #416
  • Blended datasets take each "sample" from a single dataset. Effectively this means each micro-batch takes only documents from one of the datasets, so sampling is uneven unless we have lots of sequential of parallel micro-batches. Is this fixable?l

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments