Skip to content

[CI] diff driven test selection#8077

Open
stas00 wants to merge 2 commits into
masterfrom
stas00/diff-driven-test-selection
Open

[CI] diff driven test selection#8077
stas00 wants to merge 2 commits into
masterfrom
stas00/diff-driven-test-selection

Conversation

@stas00

@stas00 stas00 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

TLDR: Analyze PR's diff and filter out tests that aren't exercising what has changed, potentially cutting down runtime and expense by 95-99% most of the time.

Detailed:

Deepspeed's CI takes forever - most of the time burning $$ and wastes dev time for no reason as most changes require just a few tests to run.

HF Transformers has a system to select which tests to run based on the diff of the PR - Sylvain Gugger wrote it many years ago since that repo has now probably thousands of tests. Deepspeed's CI isn't too bad but can easily take hours.

So I asked Claude Opus 4.8 to replicate the system for Deepspeed. Please have a look. It looks super complicated, so I'm not sure how easy it'd be to maintain/operate unless we always use AI to continue maintaining it. I asked Claude to leave a detailed state file so that it or another model could pick it up where it left.

I started with just the slowest costliest workload .github/workflows/modal-torch-latest.yml to see if it works well. If you're happy we can replicate it to the rest of the workloads.

CC: @loadams, @tjruwase - please tag others if you think they would be helpful to discuss this.

Signed-off-by: Stas Bekman <stas@stason.org>
@stas00 stas00 requested a review from loadams as a code owner June 18, 2026 22:56

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f84aec4ea5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread .github/workflows/modal-torch-latest.yml Outdated
Comment thread .github/workflows/modal-torch-latest.yml
Signed-off-by: Stas Bekman <stas@stason.org>
@stas00

stas00 commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

this is absolutely hilarious - why do we even bother to be in the loop - AI CI flags problems another AI then fixes them, I push the change. We are going to be so replaced really soon now.

I guess it's important that the CI and the code writing aren't done by the same model, otherwise it'd be pointless to check quality and detect problems if it's the same entity.

@PKUWZP

PKUWZP commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

@stas00 May I ask why we are trying to replicate HF Transformers' system? It would be great to learn from your insights on this:)

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a diff-driven test selection mechanism for the modal-torch-latest GPU CI workflow to reduce CI runtime by running only tests/unit/v1 tests that are (statically or dynamically) impacted by a PR’s changes, with conservative fail-safe fallbacks to running the full suite.

Changes:

  • Add a new selector (ci/tests_fetcher.py) that computes impacted tests from the PR diff via a repo-wide import graph (+ curated dynamic edges), emitting mode=all|subset|none and a test target list artifact.
  • Update the modal runner (ci/torch_latest.py) and workflow (.github/workflows/modal-torch-latest.yml) to consume the generated test target list and gate the deploy job accordingly.
  • Add docs and self-tests for maintainability (.github/workflows/TEST_SELECTION*.md, ci/test_tests_fetcher.py, CONTRIBUTING update, and gitignore for selector outputs).

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
CONTRIBUTING.md Documents how diff-based test selection works and how to debug/override it.
ci/torch_latest.py Reads selected test targets and passes them to pytest (fallback to full suite).
ci/tests_fetcher.py New diff/import-graph-based test selection engine + GitHub Actions outputs/summary.
ci/test_tests_fetcher.py New stdlib self-tests that validate selector behavior on synthetic repos.
.gitignore Ignores selector output directory ci/.test_selection/.
.github/workflows/TEST_SELECTION.md Design/usage/security documentation for the selector system.
.github/workflows/TEST_SELECTION_STATE.md Implementation handoff/state notes for ongoing maintenance.
.github/workflows/modal-torch-latest.yml Adds collect-tests job, artifacts, and deploy gating based on selection mode.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ci/tests_fetcher.py
Comment on lines +32 to +35
* **Testable.** All repo/config state lives on ``TestSelector`` (constructed with
an arbitrary ``repo_root``), so the logic can be unit-tested against synthetic
repos -- see ``tests/unit/test_tests_fetcher.py``.

Comment thread ci/tests_fetcher.py
Comment on lines +229 to +236
status = parts[0]
if status.startswith("R") and len(parts) >= 3:
deleted.append(parts[1])
changed.append(parts[2])
elif status.startswith("D"):
deleted.append(parts[1])
elif len(parts) >= 2:
changed.append(parts[1])
Comment thread ci/tests_fetcher.py

@staticmethod
def _reachable_with_parents(seed: Path, reverse: dict[Path, set[Path]]) -> dict[Path, Path | None]:
"""BFS from ``seed`` recording a predecessor for each node (for --explain)."""
Comment on lines +181 to +185
# NOTE: under pull_request_target this runs PR-authored code; the modal/HF
# secrets live on this runner, which is why the workflow is gated on
# review_requested / ready_for_review (a maintainer vets the PR first). The
# CI orchestration itself is restored from base in the next step so a PR
# can't repoint it at the secrets.
@PKUWZP

PKUWZP commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

@stas00 I think it's a great idea to trigger different CI tests based on the PRs. I am actually wondering how accurate the algorithms can be. For example, what's the potential risk of under-testing or over-testing?

@stas00

stas00 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author

HF Transformers has been using it for many years now, this is the same code (from their repo) adapted to DeepSpeed. You can make it much more conservative and run more tests.

What Claude also proposed is to always run the full test suite either on PR merge or have a nightly run that always does the full thing. So in the worse case scenario you'd catch a problem at most 24h later - so basically make no new release until you have run the full test suite. This covers all bases, but waiting for sometimes days to get a ready PR merged because of flakey tests and meanwhile someone merging a PR and needing to start from scratch is a madness - and when we run out of modal.com free credits who will pay for CI?

I'm going to make another version for the new Deepspeed-based RL training repo we are working on https://github.com/Snowflake-AI-Research/Arctic-Platform since there are too many tests already, so I will be testing it out there should the DS maintainers not wanting to take this very small risk. A PR there is at Snowflake-AI-Research/Arctic-Platform#14

@stas00

stas00 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator Author

@stas00 May I ask why we are trying to replicate HF Transformers' system? It would be great to learn from your insights on this:)

What do you mean? We aren't replicating their system, we are trying to reuse a sub-system that can reduce the CI run time+$$ by 95+%. It's been on my list of things I wanted to port since forever, since it'd be so time consuming, but with a good AI model this is now a matter of a few hours of tokens.

If you'd like to write one from scratch that's a totally valid option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants