Add trainerv2 with mindspeed by typhoonzero · Pull Request #201 · alauda/aml-docs

typhoonzero · 2026-04-27T09:36:55Z

Summary by CodeRabbit

Documentation
- Added a comprehensive guide for fine-tuning Qwen3 models on Huawei Ascend NPUs using Kubeflow Trainer v2, including step-by-step instructions and resource configuration.
- Updated the main fine-tuning tutorial to reference the Ascend NPU approach and document required training dependencies.

coderabbitai · 2026-04-27T09:37:21Z

Warning

Rate limit exceeded

@typhoonzero has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 56 minutes and 51 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dc5f7c54-ac98-4c4e-a680-389f88bd55d1

📥 Commits

Reviewing files that changed from the base of the PR and between 1e40109 and 45b2683.

📒 Files selected for processing (1)

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb

Walkthrough

This PR introduces documentation for fine-tuning Qwen3 on Huawei Ascend NPUs using Kubeflow Trainer v2 and MindSpeed-LLM. It adds a comprehensive Jupyter notebook with Kubeflow manifests (TrainingRuntime and TrainJob), environment setup, dataset preparation, checkpoint conversion, and training commands, plus a reference section in the existing tutorial guide.

Changes

Cohort / File(s)	Summary
Ascend NPU Fine-Tuning Documentation `docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb`	New Jupyter notebook detailing Kubeflow Trainer v2 setup for Qwen3 fine-tuning on Ascend NPUs, including TrainingRuntime/TrainJob manifests, MindSpeed-LLM SFT configuration, checkpoint conversion, dataset preprocessing, and deployment steps.
Trainer v2 Tutorial Update `docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx`	Added reference section documenting Ascend-specific fine-tuning path, including expected runtime configuration, resource requests, workflow steps, and image dependencies.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

docs: backport Ascend NPU workbench guidance to master #195: Adds overlapping Ascend NPU documentation and MindSpeed-LLM fine-tuning notebooks for Qwen3 with similar workbench verification content and Ascend runtime guidance.

Poem

🐰 Ascend the heights with MindSpeed grace,
Qwen3 fine-tuned at quickened pace,
NPU clusters now documented bright,
Trainer v2 shines with Ascend's might! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add trainerv2 with mindspeed' directly corresponds to the main changes: adding a new documentation notebook for Kubeflow Trainer v2 with MindSpeed-LLM fine-tuning on Ascend NPUs and updating the existing tutorial to reference this new path.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch add_trainerv2_mindspeed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 56 minutes and 51 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb (1)

292-300: Container securityContext duplicates pod-level fields.

runAsNonRoot, runAsUser, and runAsGroup are already set on the pod spec at lines 109-113 with the same values, so the container-level copies are redundant — only the truly container-scoped settings (allowPrivilegeEscalation, capabilities, seccompProfile) need to live here. Reduces the chance of the two blocks drifting apart later.

🧹 Proposed cleanup

                   securityContext:
                     allowPrivilegeEscalation: true
                     capabilities:
                       add: ["IPC_LOCK", "SYS_PTRACE"]
-                    runAsNonRoot: true
-                    runAsUser: 1001
-                    runAsGroup: 0
                     seccompProfile:
                       type: RuntimeDefault

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb` around
lines 292 - 300, The container-level securityContext duplicates pod-level fields
runAsNonRoot, runAsUser, and runAsGroup (same values set earlier); remove those
three keys from the container's securityContext block and leave only
container-scoped keys (allowPrivilegeEscalation, capabilities, seccompProfile)
so the pod-level runAs* settings remain authoritative; locate the container
securityContext in the YAML snippet under the container spec (the block
containing allowPrivilegeEscalation/capabilities/seccompProfile) and delete
runAsNonRoot, runAsUser, and runAsGroup entries there.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb`:
- Around line 158-163: The YAML block scalar fails because the JSONL and PYCHECK
heredocs are not indented to the required column; keep the printf change for
JSONL and apply the same approach to PYCHECK: either replace the PYCHECK heredoc
with a printf that emits the content (like the JSONL fix) or indent every
non-empty line inside the PYCHECK heredoc by at least 20 spaces to match the
surrounding block scalar (ensure symbols RAW_DATA_FILE, JSONL, and PYCHECK are
updated accordingly and that set -o pipefail block indentation is preserved).

In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx`:
- Line 84: The GitHub URL
"https://github.com/alauda/aml-docs/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb"
is malformed and 404s; update that link (and the other two occurrences using the
same pattern) to either a proper GitHub blob URL including the branch (e.g., add
"/blob/main/" between "alauda/aml-docs" and the path) or convert it to a
site-relative link to the notebook within the docs (so it resolves on the
rendered site); search for the same broken pattern on the page (lines near where
the current link appears) and apply the same fix to each occurrence.

---

Nitpick comments:
In `@docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb`:
- Around line 292-300: The container-level securityContext duplicates pod-level
fields runAsNonRoot, runAsUser, and runAsGroup (same values set earlier); remove
those three keys from the container's securityContext block and leave only
container-scoped keys (allowPrivilegeEscalation, capabilities, seccompProfile)
so the pod-level runAs* settings remain authoritative; locate the container
securityContext in the YAML snippet under the container spec (the block
containing allowPrivilegeEscalation/capabilities/seccompProfile) and delete
runAsNonRoot, runAsUser, and runAsGroup entries there.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 92b4c5fb-fd8a-43c6-b5d3-b92c9aa4f9be

📥 Commits

Reviewing files that changed from the base of the PR and between d4c4a48 and 1e40109.

📒 Files selected for processing (2)

docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb
docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx

cloudflare-workers-and-pages · 2026-04-27T09:45:08Z

Deploying alauda-ai with Cloudflare Pages

Latest commit:	`45b2683`
Status:	✅ Deploy successful!
Preview URL:	https://06552d9f.alauda-ai.pages.dev
Branch Preview URL:	https://add-trainerv2-mindspeed.alauda-ai.pages.dev

View logs

xZhizhizhizhizhi · 2026-04-30T10:23:45Z

/test-pass

add trainerv2 with mindspeed

1e40109

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb

Comment thread docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx

typhoonzero added 2 commits April 30, 2026 18:29

update mindspeed trainingruntime yaml

63eb7f5

update

45b2683

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trainerv2 with mindspeed#201

Add trainerv2 with mindspeed#201
typhoonzero wants to merge 3 commits intomasterfrom
add_trainerv2_mindspeed

typhoonzero commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

xZhizhizhizhizhi commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

typhoonzero commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying alauda-ai with Cloudflare Pages

Uh oh!

xZhizhizhizhizhi commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

typhoonzero commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Apr 27, 2026 •

edited

Loading