From 1e401092d8ccf5820c0fb81b9cbef665dde85fae Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Mon, 27 Apr 2026 17:36:27 +0800
Subject: [PATCH 1/3] add trainerv2 with mindspeed

---
 ...e-tune-with-trainer-v2-mindspeed-npu.ipynb | 503 ++++++++++++++++++
 .../how_to/fine-tune-with-trainer-v2.mdx      |  13 +
 2 files changed, 516 insertions(+)
 create mode 100644 docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb

diff --git a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb
new file mode 100644
index 00000000..c162fb6b
--- /dev/null
+++ b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb
@@ -0,0 +1,503 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Fine-Tuning Qwen3 on Ascend NPUs with Kubeflow Trainer v2 and MindSpeed-LLM\n",
+    "\n",
+    "This notebook shows how to run a **Kubeflow Trainer v2** `TrainJob` for Qwen3 fine-tuning on Huawei Ascend NPUs using **MindSpeed-LLM**.\n",
+    "\n",
+    "The flow is intentionally close to `qwen3_finetune_verify.ipynb`, but moves the work into a reusable Trainer v2 `TrainingRuntime`:\n",
+    "\n",
+    "1. Use the pre-built MindSpeed-LLM NPU runtime image.\n",
+    "2. Create a `TrainingRuntime` that converts Hugging Face weights, preprocesses Alpaca-format data, and launches MindSpeed-LLM SFT.\n",
+    "3. Submit a `TrainJob` that mounts the shared model PVC and requests Ascend resources.\n",
+    "4. Monitor the Trainer v2 job and logs.\n",
+    "\n",
+    "The example defaults are smoke-test settings for Qwen3-0.6B: `TRAIN_ITERS=1`, `SEQ_LENGTH=128`, `TP=1`, `PP=1`, and one Ascend 910B4 device. Increase these values for production runs."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "| Requirement | Example used in this notebook |\n",
+    "|---|---|\n",
+    "| Kubeflow Trainer v2 | `trainer.kubeflow.org/v1alpha1` |\n",
+    "| Namespace | `kubeflow-admin-cpaas-io` |\n",
+    "| Ascend scheduler/runtime | `schedulerName: hami-scheduler`, `runtimeClassName: ascend` |\n",
+    "| Shared model PVC | `team-model-cache-pvc` mounted at `/mnt/models` |\n",
+    "| Base model | `/mnt/models/Qwen3-0.6B` |\n",
+    "| Accelerator resource keys | `huawei.com/Ascend910B4` and `huawei.com/Ascend910B4-memory` |\n",
+    "\n",
+    "For larger models, make sure `TP * PP <= NPU count` and that the model architecture arguments in the runtime match the model `config.json`. The provided arguments target Qwen3-0.6B."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "893e3ce4",
+   "metadata": {},
+   "source": [
+    "## Step 1: Use the Pre-Built Runtime Image\n",
+    "\n",
+    "Use the pre-built CANN PyTorch workbench image. It includes the Ascend runtime dependencies and matching versions of `torch`, `torch_npu`, `mindspeed`, and `mindspeed_llm`.\n",
+    "\n",
+    "The runtime YAML uses the public image:\n",
+    "\n",
+    "```text\n",
+    "alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7\n",
+    "```\n",
+    "\n",
+    "\n",
+    "Important version rule: do not clone `MindSpeed-LLM` HEAD at runtime unless the image was built from the same source revision. The tested path uses installed package modules such as `python -m mindspeed_llm.tasks.checkpoint.convert`, which avoids repo/package drift."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6ea09abe",
+   "metadata": {},
+   "source": [
+    "## Step 2: Create the MindSpeed-LLM TrainingRuntime\n",
+    "\n",
+    "This runtime contains one `trainer` replicated job. It performs all work in a single NPU pod:\n",
+    "\n",
+    "- validates the MindSpeed/PyTorch/NPU environment;\n",
+    "- converts HF weights to Megatron format with `mindspeed_llm.tasks.checkpoint.convert`;\n",
+    "- preprocesses Alpaca-format data with `mindspeed_llm.core.datasets.dataset_preprocess`;\n",
+    "- launches SFT with `torchrun -m mindspeed_llm.tasks.posttrain.launcher`.\n",
+    "\n",
+    "The runtime creates a tiny built-in JSONL dataset if `RAW_DATA_FILE` does not already exist. For real training, mount or generate your dataset and override `RAW_DATA_FILE`, `TRAIN_ITERS`, and `SEQ_LENGTH` in the `TrainJob`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a457750b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile kf-trainingruntime-mindspeed-npu.yaml\n",
+    "apiVersion: trainer.kubeflow.org/v1alpha1\n",
+    "kind: TrainingRuntime\n",
+    "metadata:\n",
+    "  name: mindspeed-llm-qwen3-npu-runtime\n",
+    "  namespace: kubeflow-admin-cpaas-io\n",
+    "  labels:\n",
+    "    trainer.kubeflow.org/framework: torch\n",
+    "spec:\n",
+    "  mlPolicy:\n",
+    "    numNodes: 1\n",
+    "    torch:\n",
+    "      numProcPerNode: auto\n",
+    "  template:\n",
+    "    spec:\n",
+    "      replicatedJobs:\n",
+    "      - name: trainer\n",
+    "        template:\n",
+    "          metadata:\n",
+    "            labels:\n",
+    "              trainer.kubeflow.org/trainjob-ancestor-step: trainer\n",
+    "          spec:\n",
+    "            backoffLimit: 0\n",
+    "            template:\n",
+    "              spec:\n",
+    "                schedulerName: hami-scheduler\n",
+    "                runtimeClassName: ascend\n",
+    "                securityContext:\n",
+    "                  runAsNonRoot: true\n",
+    "                  runAsUser: 1001\n",
+    "                  runAsGroup: 0\n",
+    "                  fsGroup: 1000\n",
+    "                volumes:\n",
+    "                - name: workspace\n",
+    "                  emptyDir: {}\n",
+    "                - name: dshm\n",
+    "                  emptyDir:\n",
+    "                    medium: Memory\n",
+    "                    sizeLimit: 4Gi\n",
+    "                containers:\n",
+    "                - name: node\n",
+    "                  image: alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7\n",
+    "                  command: [\"bash\", \"-lc\"]\n",
+    "                  args:\n",
+    "                  - |\n",
+    "                    set -o pipefail\n",
+    "\n",
+    "                    # The Ascend env scripts may reference unset shell variables or probe\n",
+    "                    # optional libraries. Source them before enabling set -e.\n",
+    "                    set +e\n",
+    "                    for f in /usr/local/Ascend/cann/set_env.sh /usr/local/Ascend/ascend-toolkit/set_env.sh /usr/local/Ascend/nnal/atb/set_env.sh; do\n",
+    "                      [ -f \"$f\" ] && source \"$f\"\n",
+    "                    done\n",
+    "                    set -e\n",
+    "\n",
+    "                    export CUDA_DEVICE_MAX_CONNECTIONS=1\n",
+    "                    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True\n",
+    "                    export ASCEND_PROCESS_LOG_PATH=/mnt/workspace/ascendlog\n",
+    "                    mkdir -p \"$ASCEND_PROCESS_LOG_PATH\"\n",
+    "\n",
+    "                    WORK_DIR=${WORK_DIR:-/mnt/workspace/qwen3-0.6b-mindspeed}\n",
+    "                    HF_MODEL_DIR=${HF_MODEL_DIR:-/mnt/models/Qwen3-0.6B}\n",
+    "                    RAW_DATA_FILE=${RAW_DATA_FILE:-${WORK_DIR}/data/alpaca_sample.jsonl}\n",
+    "                    PROCESSED_DATA_PREFIX=${PROCESSED_DATA_PREFIX:-${WORK_DIR}/data/alpaca}\n",
+    "                    MCORE_WEIGHTS_DIR=${MCORE_WEIGHTS_DIR:-${WORK_DIR}/model_weights/qwen3_mcore_tp${TP:-1}_pp${PP:-1}}\n",
+    "                    OUTPUT_DIR=${OUTPUT_DIR:-${WORK_DIR}/output/qwen3_0_6b_finetuned}\n",
+    "\n",
+    "                    TP=${TP:-1}\n",
+    "                    PP=${PP:-1}\n",
+    "                    SEQ_LENGTH=${SEQ_LENGTH:-128}\n",
+    "                    TRAIN_ITERS=${TRAIN_ITERS:-1}\n",
+    "                    MBS=${MBS:-1}\n",
+    "                    LR=${LR:-1.25e-6}\n",
+    "                    MIN_LR=${MIN_LR:-1.25e-7}\n",
+    "\n",
+    "                    mkdir -p \"$(dirname \"$RAW_DATA_FILE\")\" \"$MCORE_WEIGHTS_DIR\" \"$OUTPUT_DIR\"\n",
+    "                    if [ ! -s \"$RAW_DATA_FILE\" ]; then\n",
+    "                      cat >\"$RAW_DATA_FILE\" <<'JSONL'\n",
+    "{\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n",
+    "{\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n",
+    "JSONL\n",
+    "                    fi\n",
+    "\n",
+    "                    python - <<'PYCHECK'\n",
+    "import importlib.metadata as md\n",
+    "import importlib.util\n",
+    "import torch\n",
+    "import torch_npu\n",
+    "for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n",
+    "    assert importlib.util.find_spec(mod), f\"missing {mod}\"\n",
+    "print(\"torch:\", torch.__version__)\n",
+    "print(\"torch_npu:\", torch_npu.__version__)\n",
+    "print(\"mindspeed:\", md.version(\"mindspeed\"))\n",
+    "print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n",
+    "print(\"npu_count:\", torch.npu.device_count())\n",
+    "assert torch.npu.is_available(), \"NPU is not available\"\n",
+    "PYCHECK\n",
+    "\n",
+    "                    python -m mindspeed_llm.tasks.checkpoint.convert \\\n",
+    "                      --load-model-type hf \\\n",
+    "                      --save-model-type mg \\\n",
+    "                      --target-tensor-parallel-size \"$TP\" \\\n",
+    "                      --target-pipeline-parallel-size \"$PP\" \\\n",
+    "                      --load-dir \"$HF_MODEL_DIR\" \\\n",
+    "                      --save-dir \"$MCORE_WEIGHTS_DIR\" \\\n",
+    "                      --model-type-hf qwen3\n",
+    "\n",
+    "                    python -m mindspeed_llm.core.datasets.dataset_preprocess \\\n",
+    "                      --input \"$RAW_DATA_FILE\" \\\n",
+    "                      --tokenizer-name-or-path \"$HF_MODEL_DIR\" \\\n",
+    "                      --output-prefix \"$PROCESSED_DATA_PREFIX\" \\\n",
+    "                      --handler-name AlpacaStyleInstructionHandler \\\n",
+    "                      --tokenizer-type PretrainedFromHF \\\n",
+    "                      --workers 1 \\\n",
+    "                      --log-interval 1 \\\n",
+    "                      --enable-thinking none \\\n",
+    "                      --prompt-type qwen3\n",
+    "\n",
+    "                    NPROC=$(python -c 'import torch, torch_npu; print(torch.npu.device_count())')\n",
+    "                    DP=$(( NPROC / (TP * PP) ))\n",
+    "                    GBS=$(( DP * MBS ))\n",
+    "                    [ \"$GBS\" -ge 1 ] || { echo \"Invalid parallelism: NPROC=$NPROC TP=$TP PP=$PP MBS=$MBS\"; exit 1; }\n",
+    "\n",
+    "                    torchrun \\\n",
+    "                      --nproc_per_node \"$NPROC\" \\\n",
+    "                      --nnodes 1 \\\n",
+    "                      --node_rank 0 \\\n",
+    "                      --master_addr localhost \\\n",
+    "                      --master_port 6000 \\\n",
+    "                      -m mindspeed_llm.tasks.posttrain.launcher \\\n",
+    "                      --use-mcore-models \\\n",
+    "                      --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \\\n",
+    "                      --kv-channels 128 \\\n",
+    "                      --qk-layernorm \\\n",
+    "                      --tensor-model-parallel-size \"$TP\" \\\n",
+    "                      --pipeline-model-parallel-size \"$PP\" \\\n",
+    "                      --sequence-parallel \\\n",
+    "                      --use-distributed-optimizer \\\n",
+    "                      --use-flash-attn \\\n",
+    "                      --num-layers 28 \\\n",
+    "                      --hidden-size 1024 \\\n",
+    "                      --num-attention-heads 16 \\\n",
+    "                      --ffn-hidden-size 3072 \\\n",
+    "                      --max-position-embeddings 40960 \\\n",
+    "                      --seq-length \"$SEQ_LENGTH\" \\\n",
+    "                      --make-vocab-size-divisible-by 1 \\\n",
+    "                      --padded-vocab-size 151936 \\\n",
+    "                      --rotary-base 1000000 \\\n",
+    "                      --use-rotary-position-embeddings \\\n",
+    "                      --micro-batch-size \"$MBS\" \\\n",
+    "                      --global-batch-size \"$GBS\" \\\n",
+    "                      --disable-bias-linear \\\n",
+    "                      --swiglu \\\n",
+    "                      --train-iters \"$TRAIN_ITERS\" \\\n",
+    "                      --tokenizer-type PretrainedFromHF \\\n",
+    "                      --tokenizer-name-or-path \"$HF_MODEL_DIR\" \\\n",
+    "                      --normalization RMSNorm \\\n",
+    "                      --position-embedding-type rope \\\n",
+    "                      --norm-epsilon 1e-6 \\\n",
+    "                      --hidden-dropout 0 \\\n",
+    "                      --attention-dropout 0 \\\n",
+    "                      --no-gradient-accumulation-fusion \\\n",
+    "                      --attention-softmax-in-fp32 \\\n",
+    "                      --exit-on-missing-checkpoint \\\n",
+    "                      --no-masked-softmax-fusion \\\n",
+    "                      --group-query-attention \\\n",
+    "                      --num-query-groups 8 \\\n",
+    "                      --min-lr \"$MIN_LR\" \\\n",
+    "                      --lr \"$LR\" \\\n",
+    "                      --weight-decay 1e-1 \\\n",
+    "                      --clip-grad 1.0 \\\n",
+    "                      --adam-beta1 0.9 \\\n",
+    "                      --adam-beta2 0.95 \\\n",
+    "                      --initial-loss-scale 4096 \\\n",
+    "                      --no-load-optim \\\n",
+    "                      --no-load-rng \\\n",
+    "                      --seed 42 \\\n",
+    "                      --bf16 \\\n",
+    "                      --data-path \"$PROCESSED_DATA_PREFIX\" \\\n",
+    "                      --split 100,0,0 \\\n",
+    "                      --log-interval 1 \\\n",
+    "                      --save-interval \"$TRAIN_ITERS\" \\\n",
+    "                      --eval-interval \"$TRAIN_ITERS\" \\\n",
+    "                      --eval-iters 0 \\\n",
+    "                      --finetune \\\n",
+    "                      --stage sft \\\n",
+    "                      --is-instruction-dataset \\\n",
+    "                      --prompt-type qwen3 \\\n",
+    "                      --no-pad-to-seq-lengths \\\n",
+    "                      --distributed-backend nccl \\\n",
+    "                      --load \"$MCORE_WEIGHTS_DIR\" \\\n",
+    "                      --save \"$OUTPUT_DIR\" \\\n",
+    "                      --transformer-impl local \\\n",
+    "                      --no-save-optim \\\n",
+    "                      --no-save-rng\n",
+    "                  env:\n",
+    "                  - name: WORK_DIR\n",
+    "                    value: /mnt/workspace/qwen3-0.6b-mindspeed\n",
+    "                  - name: HF_MODEL_DIR\n",
+    "                    value: /mnt/models/Qwen3-0.6B\n",
+    "                  - name: TP\n",
+    "                    value: \"1\"\n",
+    "                  - name: PP\n",
+    "                    value: \"1\"\n",
+    "                  - name: SEQ_LENGTH\n",
+    "                    value: \"128\"\n",
+    "                  - name: TRAIN_ITERS\n",
+    "                    value: \"1\"\n",
+    "                  - name: MBS\n",
+    "                    value: \"1\"\n",
+    "                  securityContext:\n",
+    "                    allowPrivilegeEscalation: true\n",
+    "                    capabilities:\n",
+    "                      add: [\"IPC_LOCK\", \"SYS_PTRACE\"]\n",
+    "                    runAsNonRoot: true\n",
+    "                    runAsUser: 1001\n",
+    "                    runAsGroup: 0\n",
+    "                    seccompProfile:\n",
+    "                      type: RuntimeDefault\n",
+    "                  volumeMounts:\n",
+    "                  - name: workspace\n",
+    "                    mountPath: /mnt/workspace\n",
+    "                  - name: dshm\n",
+    "                    mountPath: /dev/shm\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c32621a4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "kubectl apply -f kf-trainingruntime-mindspeed-npu.yaml\n",
+    "kubectl get trainingruntime mindspeed-llm-qwen3-npu-runtime -n kubeflow-admin-cpaas-io"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "48bf871f",
+   "metadata": {},
+   "source": [
+    "## Step 3: Submit a TrainJob\n",
+    "\n",
+    "The `TrainJob` mounts the shared PVC at `/mnt/models` and requests one Ascend 910B4 device. The PVC should already contain the Hugging Face Qwen3 model directory used by `HF_MODEL_DIR`.\n",
+    "\n",
+    "If your cluster does not expose `huawei.com/Ascend910B4-memory`, remove that resource or replace it with the memory key used by your Ascend device plugin."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "08996076",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile kf-trainjob-mindspeed-npu.yaml\n",
+    "apiVersion: trainer.kubeflow.org/v1alpha1\n",
+    "kind: TrainJob\n",
+    "metadata:\n",
+    "  generateName: trainjob-mindspeed-qwen3-\n",
+    "  namespace: kubeflow-admin-cpaas-io\n",
+    "  # If Kueue is enabled, uncomment and set your LocalQueue name.\n",
+    "  # labels:\n",
+    "  #   kueue.x-k8s.io/queue-name: local-queue\n",
+    "spec:\n",
+    "  runtimeRef:\n",
+    "    apiGroup: trainer.kubeflow.org\n",
+    "    kind: TrainingRuntime\n",
+    "    name: mindspeed-llm-qwen3-npu-runtime\n",
+    "  podTemplateOverrides:\n",
+    "  - targetJobs:\n",
+    "    - name: trainer\n",
+    "    spec:\n",
+    "      volumes:\n",
+    "      - name: models-cache\n",
+    "        persistentVolumeClaim:\n",
+    "          claimName: team-model-cache-pvc\n",
+    "      containers:\n",
+    "      - name: node\n",
+    "        volumeMounts:\n",
+    "        - name: models-cache\n",
+    "          mountPath: /mnt/models\n",
+    "  trainer:\n",
+    "    numNodes: 1\n",
+    "    env:\n",
+    "    - name: HF_MODEL_DIR\n",
+    "      value: /mnt/models/Qwen3-0.6B\n",
+    "    - name: TRAIN_ITERS\n",
+    "      value: \"1\"\n",
+    "    - name: SEQ_LENGTH\n",
+    "      value: \"128\"\n",
+    "    - name: TP\n",
+    "      value: \"1\"\n",
+    "    - name: PP\n",
+    "      value: \"1\"\n",
+    "    resourcesPerNode:\n",
+    "      requests:\n",
+    "        cpu: \"4\"\n",
+    "        memory: \"8Gi\"\n",
+    "        huawei.com/Ascend910B4: \"1\"\n",
+    "        huawei.com/Ascend910B4-memory: \"32G\"\n",
+    "      limits:\n",
+    "        cpu: \"8\"\n",
+    "        memory: \"32Gi\"\n",
+    "        huawei.com/Ascend910B4: \"1\"\n",
+    "        huawei.com/Ascend910B4-memory: \"32G\"\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dc0c6148",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Use create instead of apply because the TrainJob uses generateName.\n",
+    "kubectl create -f kf-trainjob-mindspeed-npu.yaml"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eed626b7",
+   "metadata": {},
+   "source": [
+    "## Step 4: Monitor the Job\n",
+    "\n",
+    "Trainer v2 creates a JobSet and one trainer pod for this single-node example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d328d4b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "kubectl get trainjobs -n kubeflow-admin-cpaas-io\n",
+    "kubectl get jobsets,jobs,pods -n kubeflow-admin-cpaas-io | grep trainjob-mindspeed-qwen3 || true"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2219b86a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Replace <trainjob-name> with the generated TrainJob name.\n",
+    "kubectl describe trainjob <trainjob-name> -n kubeflow-admin-cpaas-io\n",
+    "kubectl logs -f -n kubeflow-admin-cpaas-io <trainer-pod-name>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9140ccae",
+   "metadata": {},
+   "source": [
+    "## Step 5: Production Adjustments\n",
+    "\n",
+    "For a real run, change these values before submitting the `TrainJob`:\n",
+    "\n",
+    "| Setting | Where | Guidance |\n",
+    "|---|---|---|\n",
+    "| `TRAIN_ITERS` | `spec.trainer.env` | Increase from `1` to the required training length. |\n",
+    "| `SEQ_LENGTH` | `spec.trainer.env` | Use `4096` or your target context length if memory allows. |\n",
+    "| `TP` / `PP` | runtime env or `TrainJob` env | Match model size and available NPU count. |\n",
+    "| model architecture args | `TrainingRuntime` command | Must match `config.json`; this notebook is for Qwen3-0.6B. |\n",
+    "| dataset | `RAW_DATA_FILE` or PVC content | Use Alpaca JSONL with `instruction`, `input`, `output`, and optional `system`. |\n",
+    "| resources | `resourcesPerNode` | Use the exact Ascend resource keys and memory slices exposed by your cluster. |\n",
+    "\n",
+    "For multi-node training, set `spec.trainer.numNodes > 1` only after validating HCCN/device IPs, link state, cross-node reachability, and the HCCL environment for your cluster."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6bdc1b8d",
+   "metadata": {},
+   "source": [
+    "## Step 6: Cleanup\n",
+    "\n",
+    "Delete generated TrainJobs when they are no longer needed. Delete the runtime only if no other experiments reference it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e5e9591",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "kubectl delete trainjob <trainjob-name> -n kubeflow-admin-cpaas-io\n",
+    "kubectl delete trainingruntime mindspeed-llm-qwen3-npu-runtime -n kubeflow-admin-cpaas-io"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aa759c18",
+   "metadata": {},
+   "source": [
+    "## Validation Notes from NPU Dev\n",
+    "\n",
+    "The NPU dev cluster validated the Trainer v2 smoke path with the same PyTorch CANN workbench image family: the job scheduled through `hami-scheduler`, used `runtimeClassName: ascend`, imported `torch`, `torch_npu`, `mindspeed`, and `mindspeed_llm`, and saw one allocated Ascend 910B4 device.\n",
+    "\n",
+    "The notebook intentionally uses installed package module entrypoints instead of cloning `MindSpeed-LLM` HEAD at runtime. In NPU dev, cloning HEAD produced a mismatch with installed `mindspeed 0.12.1`. If HAMi reports `CardInsufficientMemory`, wait for other NPU workloads to finish or reduce the requested `huawei.com/Ascend910B4-memory` value according to your cluster policy. If the cluster cannot reach Docker Hub directly, mirror or preload `alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7` into a registry reachable from NPU nodes."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx
index 6ab6649f..944dd6de 100644
--- a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx
+++ b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx
@@ -79,6 +79,19 @@ Use our pre-built image `alaudadockerhub/fine_tune_with_llamafactory:v0.1.11` or
 1. Download the [notebook](https://github.com/alauda/aml-docs/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb) to your current workbench in **Alauda AI**, create a new workbench if you don't have one, and open the notebook.
 2. Follow the instructions in the notebook to create a `TrainingRuntime` and submit a `TrainJob` for fine-tuning a LLaMA-Factory model. The notebook includes example configurations for using the `team-model-cache-pvc` shared PVC and Git credentials.
 
+## Fine-Tuning on Ascend NPUs with MindSpeed-LLM
+
+For Huawei Ascend NPU clusters, use the [MindSpeed-LLM NPU notebook](https://github.com/alauda/aml-docs/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb) instead of the LlamaFactory GPU notebook.
+
+The MindSpeed-LLM notebook shows how to:
+
+- Use the pre-built `alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7` image.
+- Create a Trainer v2 `TrainingRuntime` with `runtimeClassName: ascend` and `schedulerName: hami-scheduler`.
+- Submit a Qwen3 fine-tuning `TrainJob` that requests Ascend resources such as `huawei.com/Ascend910B4`.
+- Run the MindSpeed-LLM workflow: Hugging Face checkpoint conversion, dataset preprocessing, and SFT training.
+
+Use this notebook when your cluster provides Ascend NPUs and your model training image must include `torch_npu`, `mindspeed`, and `mindspeed_llm`.
+
 ## Scheduling with Kueue
 
 [Kueue](https://kueue.sigs.k8s.io/) provides job queuing, quota management, and fair scheduling for Kubernetes workloads. When Kueue is installed in your cluster, TrainJobs are held in a **suspended** state until Kueue admits them based on available quota.

From 63eb7f5892c48fabf8d281999bfd2afe58e0be58 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Thu, 30 Apr 2026 18:29:17 +0800
Subject: [PATCH 2/3] update mindspeed trainingruntime yaml

---
 ...e-tune-with-trainer-v2-mindspeed-npu.ipynb | 32 +++++++++----------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb
index c162fb6b..327eb818 100644
--- a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb
+++ b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb
@@ -157,25 +157,25 @@
     "                    mkdir -p \"$(dirname \"$RAW_DATA_FILE\")\" \"$MCORE_WEIGHTS_DIR\" \"$OUTPUT_DIR\"\n",
     "                    if [ ! -s \"$RAW_DATA_FILE\" ]; then\n",
     "                      cat >\"$RAW_DATA_FILE\" <<'JSONL'\n",
-    "{\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n",
-    "{\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n",
-    "JSONL\n",
+    "                  {\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n",
+    "                  {\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n",
+    "                  JSONL\n",
     "                    fi\n",
     "\n",
     "                    python - <<'PYCHECK'\n",
-    "import importlib.metadata as md\n",
-    "import importlib.util\n",
-    "import torch\n",
-    "import torch_npu\n",
-    "for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n",
-    "    assert importlib.util.find_spec(mod), f\"missing {mod}\"\n",
-    "print(\"torch:\", torch.__version__)\n",
-    "print(\"torch_npu:\", torch_npu.__version__)\n",
-    "print(\"mindspeed:\", md.version(\"mindspeed\"))\n",
-    "print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n",
-    "print(\"npu_count:\", torch.npu.device_count())\n",
-    "assert torch.npu.is_available(), \"NPU is not available\"\n",
-    "PYCHECK\n",
+    "                  import importlib.metadata as md\n",
+    "                  import importlib.util\n",
+    "                  import torch\n",
+    "                  import torch_npu\n",
+    "                  for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n",
+    "                      assert importlib.util.find_spec(mod), f\"missing {mod}\"\n",
+    "                  print(\"torch:\", torch.__version__)\n",
+    "                  print(\"torch_npu:\", torch_npu.__version__)\n",
+    "                  print(\"mindspeed:\", md.version(\"mindspeed\"))\n",
+    "                  print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n",
+    "                  print(\"npu_count:\", torch.npu.device_count())\n",
+    "                  assert torch.npu.is_available(), \"NPU is not available\"\n",
+    "                  PYCHECK\n",
     "\n",
     "                    python -m mindspeed_llm.tasks.checkpoint.convert \\\n",
     "                      --load-model-type hf \\\n",

From 45b26835f1616d9362eb0cb9f08d4d205db969b6 Mon Sep 17 00:00:00 2001
From: Wu Yi <typhoonzero1986@gmail.com>
Date: Thu, 30 Apr 2026 18:33:42 +0800
Subject: [PATCH 3/3] update

---
 ...e-tune-with-trainer-v2-mindspeed-npu.ipynb | 32 +++++++++----------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb
index 327eb818..94ac6980 100644
--- a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb
+++ b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb
@@ -157,25 +157,25 @@
     "                    mkdir -p \"$(dirname \"$RAW_DATA_FILE\")\" \"$MCORE_WEIGHTS_DIR\" \"$OUTPUT_DIR\"\n",
     "                    if [ ! -s \"$RAW_DATA_FILE\" ]; then\n",
     "                      cat >\"$RAW_DATA_FILE\" <<'JSONL'\n",
-    "                  {\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n",
-    "                  {\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n",
-    "                  JSONL\n",
+    "                      {\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n",
+    "                      {\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n",
+    "                      JSONL\n",
     "                    fi\n",
     "\n",
     "                    python - <<'PYCHECK'\n",
-    "                  import importlib.metadata as md\n",
-    "                  import importlib.util\n",
-    "                  import torch\n",
-    "                  import torch_npu\n",
-    "                  for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n",
-    "                      assert importlib.util.find_spec(mod), f\"missing {mod}\"\n",
-    "                  print(\"torch:\", torch.__version__)\n",
-    "                  print(\"torch_npu:\", torch_npu.__version__)\n",
-    "                  print(\"mindspeed:\", md.version(\"mindspeed\"))\n",
-    "                  print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n",
-    "                  print(\"npu_count:\", torch.npu.device_count())\n",
-    "                  assert torch.npu.is_available(), \"NPU is not available\"\n",
-    "                  PYCHECK\n",
+    "                    import importlib.metadata as md\n",
+    "                    import importlib.util\n",
+    "                    import torch\n",
+    "                    import torch_npu\n",
+    "                    for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n",
+    "                        assert importlib.util.find_spec(mod), f\"missing {mod}\"\n",
+    "                    print(\"torch:\", torch.__version__)\n",
+    "                    print(\"torch_npu:\", torch_npu.__version__)\n",
+    "                    print(\"mindspeed:\", md.version(\"mindspeed\"))\n",
+    "                    print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n",
+    "                    print(\"npu_count:\", torch.npu.device_count())\n",
+    "                    assert torch.npu.is_available(), \"NPU is not available\"\n",
+    "                    PYCHECK\n",
     "\n",
     "                    python -m mindspeed_llm.tasks.checkpoint.convert \\\n",
     "                      --load-model-type hf \\\n",