Reverse KL: more efficient implementation + normalisation by sequence length #430

oleksost · 2025-12-19T15:14:05Z

✨ Description

Improves memory efficiency of loss and grad calculation in rev. KL loss. Below lower lines is with these fixes, the upper ones without.

Relies on sequence length instead of valid. token counts for loss & grad normalisation

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

Manual rev. kl computation, with in-place operations
torch.compile for reverse KL loss
loss & grad normalisation using sequence length and not valid token count (see discussion [Prototype] Normalising by valid tokens #426)

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

jlamypoirier

Seems like a good idea, but have you tried with @torch.compile instead? It would speed things up in addition to the memory savings

jlamypoirier · 2025-12-22T19:02:40Z

fast_llm/functional/cross_entropy.py

+    log_ratio = distributed_log_softmax(logits, group=group)
+
+    student_probs = log_ratio.exp()
+    log_ratio.sub_(teacher_log_probs)  # In-place: log_ratio = student_log_probs - teacher_log_probs


torch.compile already handles in-place operations, so it's better to leave as out-of-place to avoid issues with torch.compile

…ast-LLM into rev_kl_improvements

manual kl + memory savings

4a6be98

oleksost requested a review from jlamypoirier December 19, 2025 15:14

jlamypoirier reviewed Dec 19, 2025

View reviewed changes

average by seq. length

eed426a

oleksost requested a review from jlamypoirier December 19, 2025 19:08

oleksost changed the title ~~Efficient Rev. KL~~ Reverse KL: more efficient implementation + normalisation by sequence length Dec 19, 2025

Merge branch 'main' into rev_kl_improvements

baf2af1

jlamypoirier reviewed Dec 22, 2025

View reviewed changes

oleksost added 3 commits December 22, 2025 17:09

Merge branch 'main' into rev_kl_improvements

36f8855

removed in-place ops.

f179681

Merge branch 'rev_kl_improvements' of https://github.com/ServiceNow/F…

827fd35

…ast-LLM into rev_kl_improvements

oleksost requested a review from jlamypoirier December 23, 2025 02:33

Merge branch 'main' into rev_kl_improvements

7e8dac9

jlamypoirier approved these changes Dec 23, 2025

View reviewed changes

oleksost merged commit 44b14ac into main Dec 23, 2025
4 checks passed

oleksost deleted the rev_kl_improvements branch December 23, 2025 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reverse KL: more efficient implementation + normalisation by sequence length #430

Reverse KL: more efficient implementation + normalisation by sequence length #430

Uh oh!

oleksost commented Dec 19, 2025 •

edited

Loading

Uh oh!

jlamypoirier left a comment

Uh oh!

jlamypoirier Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Reverse KL: more efficient implementation + normalisation by sequence length #430

Reverse KL: more efficient implementation + normalisation by sequence length #430

Uh oh!

Conversation

oleksost commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oleksost commented Dec 19, 2025 •

edited

Loading