From 1e401092d8ccf5820c0fb81b9cbef665dde85fae Mon Sep 17 00:00:00 2001 From: Wu Yi Date: Mon, 27 Apr 2026 17:36:27 +0800 Subject: [PATCH 1/3] add trainerv2 with mindspeed --- ...e-tune-with-trainer-v2-mindspeed-npu.ipynb | 503 ++++++++++++++++++ .../how_to/fine-tune-with-trainer-v2.mdx | 13 + 2 files changed, 516 insertions(+) create mode 100644 docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb diff --git a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb new file mode 100644 index 00000000..c162fb6b --- /dev/null +++ b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb @@ -0,0 +1,503 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Fine-Tuning Qwen3 on Ascend NPUs with Kubeflow Trainer v2 and MindSpeed-LLM\n", + "\n", + "This notebook shows how to run a **Kubeflow Trainer v2** `TrainJob` for Qwen3 fine-tuning on Huawei Ascend NPUs using **MindSpeed-LLM**.\n", + "\n", + "The flow is intentionally close to `qwen3_finetune_verify.ipynb`, but moves the work into a reusable Trainer v2 `TrainingRuntime`:\n", + "\n", + "1. Use the pre-built MindSpeed-LLM NPU runtime image.\n", + "2. Create a `TrainingRuntime` that converts Hugging Face weights, preprocesses Alpaca-format data, and launches MindSpeed-LLM SFT.\n", + "3. Submit a `TrainJob` that mounts the shared model PVC and requests Ascend resources.\n", + "4. Monitor the Trainer v2 job and logs.\n", + "\n", + "The example defaults are smoke-test settings for Qwen3-0.6B: `TRAIN_ITERS=1`, `SEQ_LENGTH=128`, `TP=1`, `PP=1`, and one Ascend 910B4 device. Increase these values for production runs." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "| Requirement | Example used in this notebook |\n", + "|---|---|\n", + "| Kubeflow Trainer v2 | `trainer.kubeflow.org/v1alpha1` |\n", + "| Namespace | `kubeflow-admin-cpaas-io` |\n", + "| Ascend scheduler/runtime | `schedulerName: hami-scheduler`, `runtimeClassName: ascend` |\n", + "| Shared model PVC | `team-model-cache-pvc` mounted at `/mnt/models` |\n", + "| Base model | `/mnt/models/Qwen3-0.6B` |\n", + "| Accelerator resource keys | `huawei.com/Ascend910B4` and `huawei.com/Ascend910B4-memory` |\n", + "\n", + "For larger models, make sure `TP * PP <= NPU count` and that the model architecture arguments in the runtime match the model `config.json`. The provided arguments target Qwen3-0.6B." + ] + }, + { + "cell_type": "markdown", + "id": "893e3ce4", + "metadata": {}, + "source": [ + "## Step 1: Use the Pre-Built Runtime Image\n", + "\n", + "Use the pre-built CANN PyTorch workbench image. It includes the Ascend runtime dependencies and matching versions of `torch`, `torch_npu`, `mindspeed`, and `mindspeed_llm`.\n", + "\n", + "The runtime YAML uses the public image:\n", + "\n", + "```text\n", + "alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7\n", + "```\n", + "\n", + "\n", + "Important version rule: do not clone `MindSpeed-LLM` HEAD at runtime unless the image was built from the same source revision. The tested path uses installed package modules such as `python -m mindspeed_llm.tasks.checkpoint.convert`, which avoids repo/package drift." + ] + }, + { + "cell_type": "markdown", + "id": "6ea09abe", + "metadata": {}, + "source": [ + "## Step 2: Create the MindSpeed-LLM TrainingRuntime\n", + "\n", + "This runtime contains one `trainer` replicated job. It performs all work in a single NPU pod:\n", + "\n", + "- validates the MindSpeed/PyTorch/NPU environment;\n", + "- converts HF weights to Megatron format with `mindspeed_llm.tasks.checkpoint.convert`;\n", + "- preprocesses Alpaca-format data with `mindspeed_llm.core.datasets.dataset_preprocess`;\n", + "- launches SFT with `torchrun -m mindspeed_llm.tasks.posttrain.launcher`.\n", + "\n", + "The runtime creates a tiny built-in JSONL dataset if `RAW_DATA_FILE` does not already exist. For real training, mount or generate your dataset and override `RAW_DATA_FILE`, `TRAIN_ITERS`, and `SEQ_LENGTH` in the `TrainJob`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a457750b", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile kf-trainingruntime-mindspeed-npu.yaml\n", + "apiVersion: trainer.kubeflow.org/v1alpha1\n", + "kind: TrainingRuntime\n", + "metadata:\n", + " name: mindspeed-llm-qwen3-npu-runtime\n", + " namespace: kubeflow-admin-cpaas-io\n", + " labels:\n", + " trainer.kubeflow.org/framework: torch\n", + "spec:\n", + " mlPolicy:\n", + " numNodes: 1\n", + " torch:\n", + " numProcPerNode: auto\n", + " template:\n", + " spec:\n", + " replicatedJobs:\n", + " - name: trainer\n", + " template:\n", + " metadata:\n", + " labels:\n", + " trainer.kubeflow.org/trainjob-ancestor-step: trainer\n", + " spec:\n", + " backoffLimit: 0\n", + " template:\n", + " spec:\n", + " schedulerName: hami-scheduler\n", + " runtimeClassName: ascend\n", + " securityContext:\n", + " runAsNonRoot: true\n", + " runAsUser: 1001\n", + " runAsGroup: 0\n", + " fsGroup: 1000\n", + " volumes:\n", + " - name: workspace\n", + " emptyDir: {}\n", + " - name: dshm\n", + " emptyDir:\n", + " medium: Memory\n", + " sizeLimit: 4Gi\n", + " containers:\n", + " - name: node\n", + " image: alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7\n", + " command: [\"bash\", \"-lc\"]\n", + " args:\n", + " - |\n", + " set -o pipefail\n", + "\n", + " # The Ascend env scripts may reference unset shell variables or probe\n", + " # optional libraries. Source them before enabling set -e.\n", + " set +e\n", + " for f in /usr/local/Ascend/cann/set_env.sh /usr/local/Ascend/ascend-toolkit/set_env.sh /usr/local/Ascend/nnal/atb/set_env.sh; do\n", + " [ -f \"$f\" ] && source \"$f\"\n", + " done\n", + " set -e\n", + "\n", + " export CUDA_DEVICE_MAX_CONNECTIONS=1\n", + " export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True\n", + " export ASCEND_PROCESS_LOG_PATH=/mnt/workspace/ascendlog\n", + " mkdir -p \"$ASCEND_PROCESS_LOG_PATH\"\n", + "\n", + " WORK_DIR=${WORK_DIR:-/mnt/workspace/qwen3-0.6b-mindspeed}\n", + " HF_MODEL_DIR=${HF_MODEL_DIR:-/mnt/models/Qwen3-0.6B}\n", + " RAW_DATA_FILE=${RAW_DATA_FILE:-${WORK_DIR}/data/alpaca_sample.jsonl}\n", + " PROCESSED_DATA_PREFIX=${PROCESSED_DATA_PREFIX:-${WORK_DIR}/data/alpaca}\n", + " MCORE_WEIGHTS_DIR=${MCORE_WEIGHTS_DIR:-${WORK_DIR}/model_weights/qwen3_mcore_tp${TP:-1}_pp${PP:-1}}\n", + " OUTPUT_DIR=${OUTPUT_DIR:-${WORK_DIR}/output/qwen3_0_6b_finetuned}\n", + "\n", + " TP=${TP:-1}\n", + " PP=${PP:-1}\n", + " SEQ_LENGTH=${SEQ_LENGTH:-128}\n", + " TRAIN_ITERS=${TRAIN_ITERS:-1}\n", + " MBS=${MBS:-1}\n", + " LR=${LR:-1.25e-6}\n", + " MIN_LR=${MIN_LR:-1.25e-7}\n", + "\n", + " mkdir -p \"$(dirname \"$RAW_DATA_FILE\")\" \"$MCORE_WEIGHTS_DIR\" \"$OUTPUT_DIR\"\n", + " if [ ! -s \"$RAW_DATA_FILE\" ]; then\n", + " cat >\"$RAW_DATA_FILE\" <<'JSONL'\n", + "{\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n", + "{\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n", + "JSONL\n", + " fi\n", + "\n", + " python - <<'PYCHECK'\n", + "import importlib.metadata as md\n", + "import importlib.util\n", + "import torch\n", + "import torch_npu\n", + "for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n", + " assert importlib.util.find_spec(mod), f\"missing {mod}\"\n", + "print(\"torch:\", torch.__version__)\n", + "print(\"torch_npu:\", torch_npu.__version__)\n", + "print(\"mindspeed:\", md.version(\"mindspeed\"))\n", + "print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n", + "print(\"npu_count:\", torch.npu.device_count())\n", + "assert torch.npu.is_available(), \"NPU is not available\"\n", + "PYCHECK\n", + "\n", + " python -m mindspeed_llm.tasks.checkpoint.convert \\\n", + " --load-model-type hf \\\n", + " --save-model-type mg \\\n", + " --target-tensor-parallel-size \"$TP\" \\\n", + " --target-pipeline-parallel-size \"$PP\" \\\n", + " --load-dir \"$HF_MODEL_DIR\" \\\n", + " --save-dir \"$MCORE_WEIGHTS_DIR\" \\\n", + " --model-type-hf qwen3\n", + "\n", + " python -m mindspeed_llm.core.datasets.dataset_preprocess \\\n", + " --input \"$RAW_DATA_FILE\" \\\n", + " --tokenizer-name-or-path \"$HF_MODEL_DIR\" \\\n", + " --output-prefix \"$PROCESSED_DATA_PREFIX\" \\\n", + " --handler-name AlpacaStyleInstructionHandler \\\n", + " --tokenizer-type PretrainedFromHF \\\n", + " --workers 1 \\\n", + " --log-interval 1 \\\n", + " --enable-thinking none \\\n", + " --prompt-type qwen3\n", + "\n", + " NPROC=$(python -c 'import torch, torch_npu; print(torch.npu.device_count())')\n", + " DP=$(( NPROC / (TP * PP) ))\n", + " GBS=$(( DP * MBS ))\n", + " [ \"$GBS\" -ge 1 ] || { echo \"Invalid parallelism: NPROC=$NPROC TP=$TP PP=$PP MBS=$MBS\"; exit 1; }\n", + "\n", + " torchrun \\\n", + " --nproc_per_node \"$NPROC\" \\\n", + " --nnodes 1 \\\n", + " --node_rank 0 \\\n", + " --master_addr localhost \\\n", + " --master_port 6000 \\\n", + " -m mindspeed_llm.tasks.posttrain.launcher \\\n", + " --use-mcore-models \\\n", + " --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \\\n", + " --kv-channels 128 \\\n", + " --qk-layernorm \\\n", + " --tensor-model-parallel-size \"$TP\" \\\n", + " --pipeline-model-parallel-size \"$PP\" \\\n", + " --sequence-parallel \\\n", + " --use-distributed-optimizer \\\n", + " --use-flash-attn \\\n", + " --num-layers 28 \\\n", + " --hidden-size 1024 \\\n", + " --num-attention-heads 16 \\\n", + " --ffn-hidden-size 3072 \\\n", + " --max-position-embeddings 40960 \\\n", + " --seq-length \"$SEQ_LENGTH\" \\\n", + " --make-vocab-size-divisible-by 1 \\\n", + " --padded-vocab-size 151936 \\\n", + " --rotary-base 1000000 \\\n", + " --use-rotary-position-embeddings \\\n", + " --micro-batch-size \"$MBS\" \\\n", + " --global-batch-size \"$GBS\" \\\n", + " --disable-bias-linear \\\n", + " --swiglu \\\n", + " --train-iters \"$TRAIN_ITERS\" \\\n", + " --tokenizer-type PretrainedFromHF \\\n", + " --tokenizer-name-or-path \"$HF_MODEL_DIR\" \\\n", + " --normalization RMSNorm \\\n", + " --position-embedding-type rope \\\n", + " --norm-epsilon 1e-6 \\\n", + " --hidden-dropout 0 \\\n", + " --attention-dropout 0 \\\n", + " --no-gradient-accumulation-fusion \\\n", + " --attention-softmax-in-fp32 \\\n", + " --exit-on-missing-checkpoint \\\n", + " --no-masked-softmax-fusion \\\n", + " --group-query-attention \\\n", + " --num-query-groups 8 \\\n", + " --min-lr \"$MIN_LR\" \\\n", + " --lr \"$LR\" \\\n", + " --weight-decay 1e-1 \\\n", + " --clip-grad 1.0 \\\n", + " --adam-beta1 0.9 \\\n", + " --adam-beta2 0.95 \\\n", + " --initial-loss-scale 4096 \\\n", + " --no-load-optim \\\n", + " --no-load-rng \\\n", + " --seed 42 \\\n", + " --bf16 \\\n", + " --data-path \"$PROCESSED_DATA_PREFIX\" \\\n", + " --split 100,0,0 \\\n", + " --log-interval 1 \\\n", + " --save-interval \"$TRAIN_ITERS\" \\\n", + " --eval-interval \"$TRAIN_ITERS\" \\\n", + " --eval-iters 0 \\\n", + " --finetune \\\n", + " --stage sft \\\n", + " --is-instruction-dataset \\\n", + " --prompt-type qwen3 \\\n", + " --no-pad-to-seq-lengths \\\n", + " --distributed-backend nccl \\\n", + " --load \"$MCORE_WEIGHTS_DIR\" \\\n", + " --save \"$OUTPUT_DIR\" \\\n", + " --transformer-impl local \\\n", + " --no-save-optim \\\n", + " --no-save-rng\n", + " env:\n", + " - name: WORK_DIR\n", + " value: /mnt/workspace/qwen3-0.6b-mindspeed\n", + " - name: HF_MODEL_DIR\n", + " value: /mnt/models/Qwen3-0.6B\n", + " - name: TP\n", + " value: \"1\"\n", + " - name: PP\n", + " value: \"1\"\n", + " - name: SEQ_LENGTH\n", + " value: \"128\"\n", + " - name: TRAIN_ITERS\n", + " value: \"1\"\n", + " - name: MBS\n", + " value: \"1\"\n", + " securityContext:\n", + " allowPrivilegeEscalation: true\n", + " capabilities:\n", + " add: [\"IPC_LOCK\", \"SYS_PTRACE\"]\n", + " runAsNonRoot: true\n", + " runAsUser: 1001\n", + " runAsGroup: 0\n", + " seccompProfile:\n", + " type: RuntimeDefault\n", + " volumeMounts:\n", + " - name: workspace\n", + " mountPath: /mnt/workspace\n", + " - name: dshm\n", + " mountPath: /dev/shm\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c32621a4", + "metadata": {}, + "outputs": [], + "source": [ + "kubectl apply -f kf-trainingruntime-mindspeed-npu.yaml\n", + "kubectl get trainingruntime mindspeed-llm-qwen3-npu-runtime -n kubeflow-admin-cpaas-io" + ] + }, + { + "cell_type": "markdown", + "id": "48bf871f", + "metadata": {}, + "source": [ + "## Step 3: Submit a TrainJob\n", + "\n", + "The `TrainJob` mounts the shared PVC at `/mnt/models` and requests one Ascend 910B4 device. The PVC should already contain the Hugging Face Qwen3 model directory used by `HF_MODEL_DIR`.\n", + "\n", + "If your cluster does not expose `huawei.com/Ascend910B4-memory`, remove that resource or replace it with the memory key used by your Ascend device plugin." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08996076", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile kf-trainjob-mindspeed-npu.yaml\n", + "apiVersion: trainer.kubeflow.org/v1alpha1\n", + "kind: TrainJob\n", + "metadata:\n", + " generateName: trainjob-mindspeed-qwen3-\n", + " namespace: kubeflow-admin-cpaas-io\n", + " # If Kueue is enabled, uncomment and set your LocalQueue name.\n", + " # labels:\n", + " # kueue.x-k8s.io/queue-name: local-queue\n", + "spec:\n", + " runtimeRef:\n", + " apiGroup: trainer.kubeflow.org\n", + " kind: TrainingRuntime\n", + " name: mindspeed-llm-qwen3-npu-runtime\n", + " podTemplateOverrides:\n", + " - targetJobs:\n", + " - name: trainer\n", + " spec:\n", + " volumes:\n", + " - name: models-cache\n", + " persistentVolumeClaim:\n", + " claimName: team-model-cache-pvc\n", + " containers:\n", + " - name: node\n", + " volumeMounts:\n", + " - name: models-cache\n", + " mountPath: /mnt/models\n", + " trainer:\n", + " numNodes: 1\n", + " env:\n", + " - name: HF_MODEL_DIR\n", + " value: /mnt/models/Qwen3-0.6B\n", + " - name: TRAIN_ITERS\n", + " value: \"1\"\n", + " - name: SEQ_LENGTH\n", + " value: \"128\"\n", + " - name: TP\n", + " value: \"1\"\n", + " - name: PP\n", + " value: \"1\"\n", + " resourcesPerNode:\n", + " requests:\n", + " cpu: \"4\"\n", + " memory: \"8Gi\"\n", + " huawei.com/Ascend910B4: \"1\"\n", + " huawei.com/Ascend910B4-memory: \"32G\"\n", + " limits:\n", + " cpu: \"8\"\n", + " memory: \"32Gi\"\n", + " huawei.com/Ascend910B4: \"1\"\n", + " huawei.com/Ascend910B4-memory: \"32G\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc0c6148", + "metadata": {}, + "outputs": [], + "source": [ + "# Use create instead of apply because the TrainJob uses generateName.\n", + "kubectl create -f kf-trainjob-mindspeed-npu.yaml" + ] + }, + { + "cell_type": "markdown", + "id": "eed626b7", + "metadata": {}, + "source": [ + "## Step 4: Monitor the Job\n", + "\n", + "Trainer v2 creates a JobSet and one trainer pod for this single-node example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d328d4b8", + "metadata": {}, + "outputs": [], + "source": [ + "kubectl get trainjobs -n kubeflow-admin-cpaas-io\n", + "kubectl get jobsets,jobs,pods -n kubeflow-admin-cpaas-io | grep trainjob-mindspeed-qwen3 || true" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2219b86a", + "metadata": {}, + "outputs": [], + "source": [ + "# Replace with the generated TrainJob name.\n", + "kubectl describe trainjob -n kubeflow-admin-cpaas-io\n", + "kubectl logs -f -n kubeflow-admin-cpaas-io " + ] + }, + { + "cell_type": "markdown", + "id": "9140ccae", + "metadata": {}, + "source": [ + "## Step 5: Production Adjustments\n", + "\n", + "For a real run, change these values before submitting the `TrainJob`:\n", + "\n", + "| Setting | Where | Guidance |\n", + "|---|---|---|\n", + "| `TRAIN_ITERS` | `spec.trainer.env` | Increase from `1` to the required training length. |\n", + "| `SEQ_LENGTH` | `spec.trainer.env` | Use `4096` or your target context length if memory allows. |\n", + "| `TP` / `PP` | runtime env or `TrainJob` env | Match model size and available NPU count. |\n", + "| model architecture args | `TrainingRuntime` command | Must match `config.json`; this notebook is for Qwen3-0.6B. |\n", + "| dataset | `RAW_DATA_FILE` or PVC content | Use Alpaca JSONL with `instruction`, `input`, `output`, and optional `system`. |\n", + "| resources | `resourcesPerNode` | Use the exact Ascend resource keys and memory slices exposed by your cluster. |\n", + "\n", + "For multi-node training, set `spec.trainer.numNodes > 1` only after validating HCCN/device IPs, link state, cross-node reachability, and the HCCL environment for your cluster." + ] + }, + { + "cell_type": "markdown", + "id": "6bdc1b8d", + "metadata": {}, + "source": [ + "## Step 6: Cleanup\n", + "\n", + "Delete generated TrainJobs when they are no longer needed. Delete the runtime only if no other experiments reference it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e5e9591", + "metadata": {}, + "outputs": [], + "source": [ + "kubectl delete trainjob -n kubeflow-admin-cpaas-io\n", + "kubectl delete trainingruntime mindspeed-llm-qwen3-npu-runtime -n kubeflow-admin-cpaas-io" + ] + }, + { + "cell_type": "markdown", + "id": "aa759c18", + "metadata": {}, + "source": [ + "## Validation Notes from NPU Dev\n", + "\n", + "The NPU dev cluster validated the Trainer v2 smoke path with the same PyTorch CANN workbench image family: the job scheduled through `hami-scheduler`, used `runtimeClassName: ascend`, imported `torch`, `torch_npu`, `mindspeed`, and `mindspeed_llm`, and saw one allocated Ascend 910B4 device.\n", + "\n", + "The notebook intentionally uses installed package module entrypoints instead of cloning `MindSpeed-LLM` HEAD at runtime. In NPU dev, cloning HEAD produced a mismatch with installed `mindspeed 0.12.1`. If HAMi reports `CardInsufficientMemory`, wait for other NPU workloads to finish or reduce the requested `huawei.com/Ascend910B4-memory` value according to your cluster policy. If the cluster cannot reach Docker Hub directly, mirror or preload `alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7` into a registry reachable from NPU nodes." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "pygments_lexer": "ipython3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx index 6ab6649f..944dd6de 100644 --- a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx +++ b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.mdx @@ -79,6 +79,19 @@ Use our pre-built image `alaudadockerhub/fine_tune_with_llamafactory:v0.1.11` or 1. Download the [notebook](https://github.com/alauda/aml-docs/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2.ipynb) to your current workbench in **Alauda AI**, create a new workbench if you don't have one, and open the notebook. 2. Follow the instructions in the notebook to create a `TrainingRuntime` and submit a `TrainJob` for fine-tuning a LLaMA-Factory model. The notebook includes example configurations for using the `team-model-cache-pvc` shared PVC and Git credentials. +## Fine-Tuning on Ascend NPUs with MindSpeed-LLM + +For Huawei Ascend NPU clusters, use the [MindSpeed-LLM NPU notebook](https://github.com/alauda/aml-docs/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb) instead of the LlamaFactory GPU notebook. + +The MindSpeed-LLM notebook shows how to: + +- Use the pre-built `alaudadockerhub/alauda-workbench-jupyter-pytorch-cann-py312-ubi9:v0.1.7` image. +- Create a Trainer v2 `TrainingRuntime` with `runtimeClassName: ascend` and `schedulerName: hami-scheduler`. +- Submit a Qwen3 fine-tuning `TrainJob` that requests Ascend resources such as `huawei.com/Ascend910B4`. +- Run the MindSpeed-LLM workflow: Hugging Face checkpoint conversion, dataset preprocessing, and SFT training. + +Use this notebook when your cluster provides Ascend NPUs and your model training image must include `torch_npu`, `mindspeed`, and `mindspeed_llm`. + ## Scheduling with Kueue [Kueue](https://kueue.sigs.k8s.io/) provides job queuing, quota management, and fair scheduling for Kubernetes workloads. When Kueue is installed in your cluster, TrainJobs are held in a **suspended** state until Kueue admits them based on available quota. From 63eb7f5892c48fabf8d281999bfd2afe58e0be58 Mon Sep 17 00:00:00 2001 From: Wu Yi Date: Thu, 30 Apr 2026 18:29:17 +0800 Subject: [PATCH 2/3] update mindspeed trainingruntime yaml --- ...e-tune-with-trainer-v2-mindspeed-npu.ipynb | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb index c162fb6b..327eb818 100644 --- a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb +++ b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb @@ -157,25 +157,25 @@ " mkdir -p \"$(dirname \"$RAW_DATA_FILE\")\" \"$MCORE_WEIGHTS_DIR\" \"$OUTPUT_DIR\"\n", " if [ ! -s \"$RAW_DATA_FILE\" ]; then\n", " cat >\"$RAW_DATA_FILE\" <<'JSONL'\n", - "{\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n", - "{\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n", - "JSONL\n", + " {\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n", + " {\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n", + " JSONL\n", " fi\n", "\n", " python - <<'PYCHECK'\n", - "import importlib.metadata as md\n", - "import importlib.util\n", - "import torch\n", - "import torch_npu\n", - "for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n", - " assert importlib.util.find_spec(mod), f\"missing {mod}\"\n", - "print(\"torch:\", torch.__version__)\n", - "print(\"torch_npu:\", torch_npu.__version__)\n", - "print(\"mindspeed:\", md.version(\"mindspeed\"))\n", - "print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n", - "print(\"npu_count:\", torch.npu.device_count())\n", - "assert torch.npu.is_available(), \"NPU is not available\"\n", - "PYCHECK\n", + " import importlib.metadata as md\n", + " import importlib.util\n", + " import torch\n", + " import torch_npu\n", + " for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n", + " assert importlib.util.find_spec(mod), f\"missing {mod}\"\n", + " print(\"torch:\", torch.__version__)\n", + " print(\"torch_npu:\", torch_npu.__version__)\n", + " print(\"mindspeed:\", md.version(\"mindspeed\"))\n", + " print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n", + " print(\"npu_count:\", torch.npu.device_count())\n", + " assert torch.npu.is_available(), \"NPU is not available\"\n", + " PYCHECK\n", "\n", " python -m mindspeed_llm.tasks.checkpoint.convert \\\n", " --load-model-type hf \\\n", From 45b26835f1616d9362eb0cb9f08d4d205db969b6 Mon Sep 17 00:00:00 2001 From: Wu Yi Date: Thu, 30 Apr 2026 18:33:42 +0800 Subject: [PATCH 3/3] update --- ...e-tune-with-trainer-v2-mindspeed-npu.ipynb | 32 +++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb index 327eb818..94ac6980 100644 --- a/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb +++ b/docs/en/kubeflow/how_to/fine-tune-with-trainer-v2-mindspeed-npu.ipynb @@ -157,25 +157,25 @@ " mkdir -p \"$(dirname \"$RAW_DATA_FILE\")\" \"$MCORE_WEIGHTS_DIR\" \"$OUTPUT_DIR\"\n", " if [ ! -s \"$RAW_DATA_FILE\" ]; then\n", " cat >\"$RAW_DATA_FILE\" <<'JSONL'\n", - " {\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n", - " {\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n", - " JSONL\n", + " {\"instruction\":\"Who are you?\",\"input\":\"\",\"output\":\"I am XiaoLing, an AI assistant from Alauda AI Platform.\",\"system\":\"\"}\n", + " {\"instruction\":\"What is Alauda AI Platform?\",\"input\":\"\",\"output\":\"Alauda AI Platform helps teams build, train, and serve AI workloads on Kubernetes.\",\"system\":\"\"}\n", + " JSONL\n", " fi\n", "\n", " python - <<'PYCHECK'\n", - " import importlib.metadata as md\n", - " import importlib.util\n", - " import torch\n", - " import torch_npu\n", - " for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n", - " assert importlib.util.find_spec(mod), f\"missing {mod}\"\n", - " print(\"torch:\", torch.__version__)\n", - " print(\"torch_npu:\", torch_npu.__version__)\n", - " print(\"mindspeed:\", md.version(\"mindspeed\"))\n", - " print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n", - " print(\"npu_count:\", torch.npu.device_count())\n", - " assert torch.npu.is_available(), \"NPU is not available\"\n", - " PYCHECK\n", + " import importlib.metadata as md\n", + " import importlib.util\n", + " import torch\n", + " import torch_npu\n", + " for mod in [\"torch\", \"torch_npu\", \"mindspeed\", \"mindspeed_llm\"]:\n", + " assert importlib.util.find_spec(mod), f\"missing {mod}\"\n", + " print(\"torch:\", torch.__version__)\n", + " print(\"torch_npu:\", torch_npu.__version__)\n", + " print(\"mindspeed:\", md.version(\"mindspeed\"))\n", + " print(\"mindspeed_llm:\", md.version(\"mindspeed-llm\"))\n", + " print(\"npu_count:\", torch.npu.device_count())\n", + " assert torch.npu.is_available(), \"NPU is not available\"\n", + " PYCHECK\n", "\n", " python -m mindspeed_llm.tasks.checkpoint.convert \\\n", " --load-model-type hf \\\n",