diff --git a/.gitignore b/.gitignore index fe07d484..5d33b157 100644 --- a/.gitignore +++ b/.gitignore @@ -13,3 +13,4 @@ .claude CLAUDE.md +.omx/ diff --git a/docs/en/dify/install.mdx b/docs/en/dify/install.mdx index a2cc1922..ff3210d3 100644 --- a/docs/en/dify/install.mdx +++ b/docs/en/dify/install.mdx @@ -161,9 +161,7 @@ ingress: - ``` - - -### Storage (S3 and PVC) +### Storage (S3 and PVC) \{#storage-s3-and-pvc} **PVC (default):** API and plugin daemon each use a PVC when enabled. Override storage class and size as needed. diff --git a/docs/en/installation/ai-cluster.mdx b/docs/en/installation/ai-cluster.mdx index 8b5e1d63..8734563b 100644 --- a/docs/en/installation/ai-cluster.mdx +++ b/docs/en/installation/ai-cluster.mdx @@ -16,9 +16,7 @@ If your use case requires `Knative` functionality, which enables advanced featur [Recommended deployment option](https://kserve.github.io/website/docs/admin-guide/overview#generative-inference): For generative inference workloads, the **Standard** approach (previously known as RawKubernetes Deployment) is recommended as it provides the most control over resource allocation and scaling. ::: - - -## Downloading +## Downloading \{#downloading} **Operator Components**: @@ -38,9 +36,7 @@ If your use case requires `Knative` functionality, which enables advanced featur You can download the app named 'Alauda AI' and 'Knative Operator' from the Marketplace on the Customer Portal website. ::: - - -## Uploading +## Uploading \{#uploading} We need to upload both `Alauda AI` and `Knative Operator` to the cluster where Alauda AI is to be used. @@ -163,9 +159,7 @@ Confirm that the **Alauda AI** tile shows one of the following states: For detailed installation steps, see [Install KServe](../kserve/install.mdx) in Alauda Build of KServe. - - -## Enabling Knative Functionality +## Enabling Knative Functionality \{#enabling-knative-functionality} Knative functionality is an optional capability that requires an additional operator and instance to be deployed. diff --git a/docs/en/kserve/install.mdx b/docs/en/kserve/install.mdx index 36611f59..9c79b790 100644 --- a/docs/en/kserve/install.mdx +++ b/docs/en/kserve/install.mdx @@ -26,9 +26,7 @@ Before installing **Alauda Build of KServe**, you need to ensure the following d 1. **Required Dependencies**: All required dependencies must be installed before installing Alauda Build of KServe. 2. **GIE Integration**: GIE is bundled and enabled by default. If your environment already has GIE installed separately, set `gie.builtIn` to `false` in the operator configuration to disable the built-in installation. - - -## Upload Operator +## Upload Operator \{#upload-operator} Download the Alauda Build of KServe Operator installation file (e.g., `kserve-operator.ALL.xxxx.tgz`). @@ -137,9 +135,7 @@ kubectl get kserve default-kserve -n kserve-operator The instance is ready when the status shows `DEPLOYED: True`. - - -### Envoy Gateway Configuration +### Envoy Gateway Configuration \{#envoy-gateway-configuration} | Field | Description | Default | |-------|-------------|---------| @@ -148,18 +144,14 @@ The instance is ready when the status shows `DEPLOYED: True`. | `preset.envoy_gateway.create_instance` | Create an Envoy Gateway instance to manage inference traffic with bundled extensions. | `true` | | `preset.envoy_gateway.instance_name` | Name of the Envoy Gateway instance to create. | `aieg` | - - -### Envoy AI Gateway Configuration +### Envoy AI Gateway Configuration \{#envoy-ai-gateway-configuration} | Field | Description | Default | |-------|-------------|---------| | `preset.envoy_ai_gateway.service` | Kubernetes service name for Envoy AI Gateway. | `ai-gateway-controller` | | `preset.envoy_ai_gateway.port` | Port number used by Envoy AI Gateway. | `1063` | - - -### KServe Gateway Configuration +### KServe Gateway Configuration \{#kserve-gateway-configuration} | Field | Description | Default | |-------|-------------|---------| @@ -169,9 +161,7 @@ The instance is ready when the status shows `DEPLOYED: True`. | `preset.kserve_gateway.gateway_class` | Optional custom GatewayClass name. If empty, derived as `{namespace}-{name}`. | `""` | | `preset.kserve_gateway.port` | Port number used by the KServe Gateway. | `80` | - - -### GIE (gateway-api-inference-extension) Configuration +### GIE (gateway-api-inference-extension) Configuration \{#gie-gateway-api-inference-extension-configuration} | Field | Description | Default | |-------|-------------|---------| diff --git a/docs/en/label_studio/install.mdx b/docs/en/label_studio/install.mdx index 481a5550..d53b5756 100644 --- a/docs/en/label_studio/install.mdx +++ b/docs/en/label_studio/install.mdx @@ -185,9 +185,7 @@ redirectURIs: ### 4. Configure User Management - - -#### 4.1 Disable User Registration +#### 4.1 Disable User Registration \{#41-disable-user-registration} User registration can be disabled by setting the following fields: diff --git a/docs/en/llama_stack/quickstart.mdx b/docs/en/llama_stack/quickstart.mdx index 9b36acc4..3ecd4c41 100644 --- a/docs/en/llama_stack/quickstart.mdx +++ b/docs/en/llama_stack/quickstart.mdx @@ -30,9 +30,7 @@ The notebook demonstrates: ## FAQ - - -### How to prepare Python 3.12 in Notebook +### How to prepare Python 3.12 in Notebook \{#how-to-prepare-python-312-in-notebook} 1. Download the pre-compiled Python installation package: diff --git a/docs/en/model_inference/inference_service/functions/inference_service.mdx b/docs/en/model_inference/inference_service/functions/inference_service.mdx index 5e699205..7d71f3c7 100644 --- a/docs/en/model_inference/inference_service/functions/inference_service.mdx +++ b/docs/en/model_inference/inference_service/functions/inference_service.mdx @@ -57,9 +57,7 @@ The core definition of the inference service feature is to deploy trained machin - Automatically generates Swagger documentation to facilitate user integration and invocation of inference services. - Provides real-time monitoring and alarm features to ensure stable service operation. - - -## Create inference service +## Create inference service \{#create-inference-service} diff --git a/docs/en/model_inference/inference_service/how_to/vllm_expert_parallel.mdx b/docs/en/model_inference/inference_service/how_to/vllm_expert_parallel.mdx index ec6c97ea..608fde42 100644 --- a/docs/en/model_inference/inference_service/how_to/vllm_expert_parallel.mdx +++ b/docs/en/model_inference/inference_service/how_to/vllm_expert_parallel.mdx @@ -213,9 +213,7 @@ Multi-node EP deployments require additional distributed runtime and networking This page focuses on the single-node configuration pattern. If you need multi-node EP, refer to the official vLLM guide and adapt the deployment model to your cluster topology and runtime environment. ::: - - -## References +## References \{#references} - [Expert Parallel Deployment - vLLM](https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/) - [Data Parallel Deployment - vLLM](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/) diff --git a/docs/en/model_inference/inference_service/how_to/vllm_speculative_decoding.mdx b/docs/en/model_inference/inference_service/how_to/vllm_speculative_decoding.mdx new file mode 100644 index 00000000..c1240972 --- /dev/null +++ b/docs/en/model_inference/inference_service/how_to/vllm_speculative_decoding.mdx @@ -0,0 +1,629 @@ +--- +weight: 13 +i18n: + title: + en: Speculative Decoding for vLLM Inference Services + zh: 为 vLLM 推理服务启用 Speculative Decoding +--- + +# Speculative Decoding for vLLM Inference Services + +## Introduction + +Speculative decoding lets a vLLM server propose several tokens per decode step and verify them with a single forward pass of the target model, lowering per-token latency on interactive workloads without changing the output distribution. + +This page focuses on how to enable, configure, verify, and roll back speculative decoding for an `InferenceService` running on Alauda AI. For the upstream technique itself and the full list of methods supported by vLLM, see the [vLLM speculative decoding documentation](https://docs.vllm.ai/en/latest/features/speculative_decoding/). + +:::warning +Speculative decoding involves runtime-version-sensitive flags. The exact `--speculative-config` JSON keys, supported `method` values, and the metric names referenced below depend on the vLLM version inside your runtime image. Treat all snippets here as starting points and confirm against the vLLM version you ship. +::: + +## Before You Decide + +Speculative decoding helps when the **per-request decode loop** dominates end-to-end latency and the proposed tokens are accepted often enough to amortize the proposal overhead. + +It tends to help on: + +- Interactive chat / agent loops with relatively predictable continuations. +- Summarization, RAG answers, and code completion, where output overlaps the prompt. + +It can hurt or be neutral on: + +- High-temperature sampling, where acceptance rate collapses. +- High-QPS / batch-saturated services, where decode capacity is no longer idle. The vLLM team's 2024 V0-engine benchmarks reported **1.4×–1.8× slowdowns** on the same datasets at high QPS. The V1 engine schedules differently, so the magnitude may differ on your runtime, but the direction of the risk is the same. +- Very small target models, where the verification step is already cheap. + +Run a representative workload before committing speculative decoding as a default. See [Verify and Measure the Impact](#verify-and-measure-the-impact). + +## Methods Validated in This Guide on Alauda AI \{#methods-available-on-alauda-ai} + +The two methods below are the ones this guide covers and that have been exercised end-to-end on Alauda AI. vLLM upstream supports additional methods (for example MTP for models that ship multi-token-prediction heads, Medusa, MLP Speculator, Suffix, Draft Model), and those methods may also be usable on Alauda AI through the same `--speculative-config` flag. They are out of scope for this page, so refer to the upstream documentation and validate on your own setup before promoting to production. + +| Method | What you provide | Trade-off | +| ------- | ----------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | +| N-gram | Target model only | No extra weights, no training. Benefit depends on prompt-output token overlap. | +| EAGLE-3 | Target model **and** a matching EAGLE-3 draft head | Requires a draft head trained against the exact target model. Small additional GPU memory. | + +Notes: + +- vLLM upstream describes N-gram as "effective for use cases like summarization and question-answering, where there is a significant overlap between the prompt and the answer". +- vLLM upstream describes EAGLE-3 as "the current SOTA for speculative decoding algorithms" (snapshot from the latest features page; revisit per release). + +## Recommended Starting Points \{#recommended-starting-points} + +There is no single best method for every workload. The following are conservative starting points to reduce trial cost. Always validate against your own traffic before promoting to production. + +| If you have... | Start with | +| --------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | +| A general chat / instruction model with an available EAGLE-3 head | EAGLE-3, with `num_speculative_tokens: 3` initially. | +| Heavy prompt-output overlap (RAG, summarization, code completion) and no EAGLE-3 head | N-gram, with `num_speculative_tokens: 5` initially. | +| None of the above | Defer enabling speculative decoding until one of the above conditions is met. | + +### Internal Validation Snapshot — N-gram + +The starting points above are **guidance, not guarantees**. The measurement below is one concrete data point from Alauda AI's internal lab, intended to help calibrate expectations on similar single-GPU serving setups. Your own model, GPU, runtime version, and traffic will produce different numbers — always benchmark before promoting to production. + +- **Hardware:** NVIDIA A30 24 GB × 1 +- **Model:** Qwen3-8B (BF16, HuggingFace `Qwen/Qwen3-8B`) +- **Runtime:** vLLM 0.19.1 (V1 engine) +- **Request parameters:** `temperature=0`, `seed=42`, `max_tokens=1024`, `enable_thinking=false`, single concurrent request, 1 warmup discarded + 3 timed runs (median reported) + +**Baseline command (no spec decode):** + +```bash +python3 -m vllm.entrypoints.openai.api_server \ + --port 8080 \ + --served-model-name t-ng \ + --model /mnt/models \ + --gpu-memory-utilization 0.8 \ + --max-model-len 4096 \ + --max-num-seqs 8 \ + --seed 42 +``` + +**N-gram command (only differs by `--speculative-config`):** + +```bash +python3 -m vllm.entrypoints.openai.api_server \ + --port 8080 \ + --served-model-name t-ng \ + --model /mnt/models \ + --gpu-memory-utilization 0.8 \ + --max-model-len 4096 \ + --max-num-seqs 8 \ + --seed 42 \ + --speculative-config '{"method":"ngram","num_speculative_tokens":5,"prompt_lookup_max":4,"prompt_lookup_min":2}' +``` + +**Workloads:** + +- _code refactor (high prompt-output overlap):_ ask the model to add docstrings and type annotations to a 30-line Python class and return the full updated class +- _general chat (no prompt-output overlap):_ ask the model to explain a concept in ≥800 words + +**Results:** + +| Workload | Baseline tok/s | N-gram tok/s | Speedup | Wall delta | +| ------------------------ | -------------- | ------------ | ----------- | ---------- | +| Code refactor (high overlap) | 47.02 | 45.92 | **0.98×** | +524 ms | +| General chat (no overlap) | 47.13 | 39.94 | **0.85×** | +3914 ms | + +**Interpretation:** + +- On this single-GPU 8B setup, N-gram registered as a slight regression on the code-refactor workload and a clear ~15% regression on chat. The proposer's CPU work, the verification of five candidate tokens per step, and the fact that vLLM disables async scheduling under N-gram together cost more than the accepted tokens save. +- The acceptance rate for the high-overlap code workload is healthy (mean acceptance length ≈ 3 in earlier informal probes), but acceptance rate alone does not predict end-to-end speedup — the per-step overhead must be amortized against actual decode time of the target model. On a small target model on a single GPU, decode is already cheap and there is little room to amortize. +- The chat result confirms the [Caveats](#caveats-and-known-limitations) about workloads without prompt-output overlap. + +The same method on a larger target model (where each verify step costs more), with multi-GPU tensor parallelism, or under higher concurrency may behave very differently. Treat this snapshot as a reminder to measure, not as a verdict on N-gram itself. + +### Internal Validation Snapshot — EAGLE-3 + +The starting points above are **guidance, not guarantees**. The measurement below is one concrete data point from Alauda AI's internal lab, intended to help calibrate expectations on similar single-GPU EAGLE-3 setups. Your own model, GPU, runtime version, and traffic will produce different numbers — always benchmark before promoting to production. + +- **Hardware:** NVIDIA A30 24 GB × 1 +- **Model:** Meta-Llama-3.1-8B-Instruct (BF16, HuggingFace `meta-llama/Meta-Llama-3.1-8B-Instruct`) with EAGLE-3 draft `yuhuili/EAGLE3-LLaMA3.1-Instruct-8B` +- **Runtime:** vLLM 0.19.1 (V1 engine) +- **Request parameters:** `temperature=0`, `seed=42`, `max_tokens=1024`, single concurrent request, 1 warmup discarded + 3 timed runs (median reported) + +**Baseline command (no spec decode):** + +```bash +python3 -m vllm.entrypoints.openai.api_server \ + --port 8080 \ + --served-model-name eagle \ + --model /mnt/models/Meta-Llama-3.1-8B-Instruct \ + --dtype auto \ + --gpu-memory-utilization 0.8 \ + --max-model-len 4096 \ + --max-num-seqs 8 \ + --seed 42 +``` + +**EAGLE-3 command (only differs by `--speculative-config`):** + +```bash +python3 -m vllm.entrypoints.openai.api_server \ + --port 8080 \ + --served-model-name eagle \ + --model /mnt/models/Meta-Llama-3.1-8B-Instruct \ + --dtype auto \ + --gpu-memory-utilization 0.8 \ + --max-model-len 4096 \ + --max-num-seqs 8 \ + --seed 42 \ + --speculative-config '{"method":"eagle3","model":"/mnt/models/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":3}' +``` + +**Workloads:** + +- _code refactor (high prompt-output overlap):_ ask the model to add docstrings and type annotations to a 30-line Python class and return the full updated class +- _general chat (no prompt-output overlap):_ ask the model to explain a concept in ≥800 words + +**Results:** + +| Workload | Baseline tok/s | EAGLE-3 tok/s | Speedup | Wall delta (median) | +| ---------------------------- | -------------- | ------------- | ---------- | ------------------- | +| Code refactor (high overlap) | 47.84 | 88.25 | **1.84×** | −6171 ms | +| General chat (no overlap) | 47.87 | 47.45 | **0.99×** | +2416 ms | + +Speedup is the tok/s ratio (completion-length-invariant). Wall delta compares median wall-clock time directly; the chat runs generated different amounts of output (baseline 588 vs EAGLE-3 709 tokens), so Speedup is the more reliable indicator there. + +**Speculative-decoding behaviour (EAGLE-3 side, from `SpecDecoding metrics` log windows):** + +| Workload | Mean accept length | Avg Draft accept rate | Per-position accept rate | +| ---------------------------- | ------------------ | --------------------- | ------------------------ | +| Code refactor (high overlap) | ≈ 2.54 | ≈ 51% | 0.50 / 0.40 / 0.33 | +| General chat (no overlap) | ≈ 1.19 | ≈ 6% | 0.16 / 0.02 / 0.01 | + +Mean acceptance length and acceptance rates are draft-weighted across the `SpecDecoding metrics` log windows that covered each benchmark run; per-position values are from the sustained-load windows inside each run. + +**Interpretation:** + +- EAGLE-3 delivered a **~1.84× speedup on code-refactor** and was **essentially break-even on general chat (~0.99×)** on this single-GPU 8B setup. The two baseline runs sat on top of each other at ~47.8 tok/s, as expected — base decode rate is a model-and-hardware property and does not depend on prompt content. All of the observable gap comes from the EAGLE-3 side. +- **Why code wins and chat doesn't** — acceptance data tells the mechanism directly. On code the draft head landed ~2.54 tokens per decode step at ~51% acceptance, so most steps emit multiple tokens; per-position acceptance decays slowly (0.50 / 0.40 / 0.33), so even the 3rd speculative slot still pays off a third of the time. On chat mean acceptance length sits at ~1.19 with only ~6% acceptance, and per-position acceptance collapses by the 2nd slot (0.16 / 0.02 / 0.01) — almost every step emits just the verified token and the drafted ones are discarded. +- **Realized vs theoretical.** Mean acceptance length is the theoretical upper bound on speedup with zero proposer overhead. Code realized 1.84× against a 2.54× ceiling (~72% converted), i.e. proposer CPU work, verification of rejected proposals, and async-scheduling costs ate about a quarter of the headroom. Chat's 1.19× theoretical ceiling was **entirely consumed by overhead** and tipped into a slight regression. This is consistent with the Caveats: on small models on a single GPU, per-step overhead has little idle decode capacity to hide behind. + +The same method on a larger target model (where each verify step costs more), with multi-GPU tensor parallelism, or under higher concurrency may behave very differently. Treat this snapshot as a reminder to measure, not as a verdict on EAGLE-3 itself. + +## Prerequisites + +- A Kubernetes cluster with KServe installed and a namespace where you can create `InferenceService` resources. +- A vLLM serving runtime registered on the platform whose vLLM version supports the speculative method you plan to use. To check the version, exec into a running pod with that runtime: `kubectl exec -- python3 -c "import vllm; print(vllm.__version__)"`. +- Your target model is accessible to the service through its storage source (model repository, PVC, or OCI image). +- For EAGLE-3: a draft head whose architecture, tokenizer, and base version match the **exact** target model. A mismatched head silently degrades acceptance rate and may not surface as a startup error. +- For EAGLE-3: a model-artifact loading mechanism that can deliver both target and draft into the same pod. See [Providing Model Artifacts on Alauda AI](#providing-model-artifacts-on-alauda-ai). + +## Configuration Surface + +In vLLM v1, speculative decoding is enabled by a single argument: + +```text +--speculative-config '{"method": "", "num_speculative_tokens": , ...}' +``` + +Common keys: + +- `method`: the proposer to use. Values used in this guide: `ngram` and `eagle3`. Other values exist upstream (for example `medusa`, or model-specific MTP names such as `deepseek_mtp`) — confirm the exact value for your method in the vLLM speculative decoding documentation. +- `num_speculative_tokens`: how many tokens to propose per step. Higher values can increase speedup but also waste compute on rejected proposals. +- `model`: for methods that load a separate draft artifact (such as EAGLE-3), the path to that artifact inside the container. +- Method-specific keys, such as `prompt_lookup_max` / `prompt_lookup_min` for N-gram. These names have changed across vLLM releases — verify against the version you ship. + +All other vLLM arguments (`--model`, `--tensor-parallel-size`, `--gpu-memory-utilization`, …) work the same as in a non-speculative deployment. + +## Providing Model Artifacts on Alauda AI \{#providing-model-artifacts-on-alauda-ai} + +Different methods need different files inside the predictor pod. + +### Single-artifact pattern (N-gram) + +For N-gram only the target model is required. Use `storageUri` exactly as for any other inference service: + +```yaml +spec: + predictor: + model: + storageUri: hf:// +``` + +The model lands at `/mnt/models` and is passed to vLLM through `--model`. + +### Two-artifact pattern (EAGLE-3 and similar) + +EAGLE-3 needs both the target model **and** a matching draft head loaded into the same pod. There are three supported ways to deliver them. Pick based on your platform version, network access, and operational preference. + +#### Option A — KServe `storageUris` (preferred when available) + +`storageUris` is a KServe field that accepts multiple storage locations and mounts each at a declared path. It is the cleanest option when your platform's KServe version supports it (KServe 0.16 and later). + +```yaml +spec: + predictor: + model: + storageUris: + - uri: hf:// + mountPath: /mnt/models/target + - uri: hf:// + mountPath: /mnt/models/draft +``` + +Then point vLLM at the two paths: + +```text +--model /mnt/models/target \ +--speculative-config '{"method":"eagle3","model":"/mnt/models/draft","num_speculative_tokens":3}' +``` + +Constraints to be aware of: + +- `storageUri` (singular) and `storageUris` (plural) are mutually exclusive. +- All `mountPath` values must be absolute and share a common parent directory (for example `/mnt/models/target` and `/mnt/models/draft`). +- For private repositories, attach the appropriate credentials secret to the service account used by the predictor pod. + +If your platform's KServe version does not yet include `storageUris`, use Option B or Option C. + +#### Option B — Single OCI Modelcar containing both artifacts + +Package the target model and the draft head into one OCI image under predictable subdirectories (for example `/models/target` and `/models/draft`), then deploy with `storageUri: oci://...`. See [Using KServe Modelcar for Model Storage](./using_modelcar.mdx) for the packaging steps. Sample on-disk layout to bake into the image: + +```text +/models/ +├── target/ +│ └── ... target model files ... +└── draft/ + └── ... EAGLE-3 head files ... +``` + +The vLLM command then references the same paths: + +```text +--model /mnt/models/target \ +--speculative-config '{"method":"eagle3","model":"/mnt/models/draft","num_speculative_tokens":3}' +``` + +This option is well-suited to offline / air-gapped clusters because the artifacts are versioned together and pulled from your own registry. + +#### Option C — Pre-staged on a shared PVC + +Stage both artifacts onto a PVC under a known directory layout, mount the PVC, and reference the local paths from the vLLM command. This is the simplest option if you already manage model files on a shared filesystem. + +### Picking between A / B / C + +| Constraint | Use | +| --------------------------------------------------------- | -------- | +| Online cluster, KServe ≥ 0.16, want declarative manifests | Option A | +| Offline / air-gapped, want a single versioned artifact | Option B | +| Already have model files on a shared PVC | Option C | + +## End-to-End Examples + +The two examples below cover the methods listed in [Methods Available on Alauda AI](#methods-available-on-alauda-ai). Replace ``, ``, and storage URIs with values from your environment. + +### Example 1 — N-gram + +```yaml +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + annotations: + aml-model-repo: Qwen2.5-7B-Instruct # [!code callout] + serving.knative.dev/progress-deadline: 1800s + serving.kserve.io/deploymentMode: Standard + labels: + aml.cpaas.io/runtime-type: vllm + name: qwen-ngram-spec + namespace: +spec: + predictor: + minReplicas: 1 + maxReplicas: 1 + model: + command: + - bash + - -c + - | + set -ex + + MODEL_PATH="/mnt/models/${MODEL_NAME}" + if [ ! -d "${MODEL_PATH}" ]; then + MODEL_PATH="/mnt/models" + fi + + python3 -m vllm.entrypoints.openai.api_server \ + --port 8080 \ + --served-model-name {{.Name}} {{.Namespace}}/{{.Name}} \ + --model "${MODEL_PATH}" \ + --dtype ${DTYPE} \ + --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \ + --speculative-config '{"method":"ngram","num_speculative_tokens":5,"prompt_lookup_max":4,"prompt_lookup_min":2}' # [!code callout] + - bash + env: + - name: DTYPE + value: half + - name: GPU_MEMORY_UTILIZATION + value: '0.85' + - name: MODEL_NAME + value: '{{ index .Annotations "aml-model-repo" }}' + modelFormat: + name: transformers + protocolVersion: v2 + resources: + limits: + cpu: '8' + memory: 32Gi + nvidia.com/gpu: '1' + requests: + cpu: '4' + memory: 16Gi + runtime: + storageUri: hf:// + securityContext: + seccompProfile: + type: RuntimeDefault +``` + + +1. Replace with your actual model name; this annotation is used by the platform for display. +2. The `prompt_lookup_*` keys belong to the n-gram proposer. Their names have changed between vLLM releases — verify against the version inside your runtime image. + + +### Example 2 — EAGLE-3 with target + draft on a shared PVC + +This manifest matches the setup used for the [Internal Validation Snapshot — EAGLE-3](#recommended-starting-points) above. Both the target model and the EAGLE-3 draft head are pre-staged inside a single PVC under predictable subdirectories; the PVC is mounted at `/mnt/models/` by `storageUri: pvc://...`, and the vLLM command references the two subdirectories directly. + +```yaml +apiVersion: serving.kserve.io/v1beta1 +kind: InferenceService +metadata: + annotations: + aml-model-repo: Meta-Llama-3.1-8B-Instruct + serving.knative.dev/progress-deadline: 1800s + serving.kserve.io/deploymentMode: Standard + labels: + aml.cpaas.io/runtime-type: vllm + name: llama-eagle3-spec + namespace: +spec: + predictor: + minReplicas: 1 + maxReplicas: 1 + model: + command: + - bash + - -c + - | + set -ex + + python3 -m vllm.entrypoints.openai.api_server \ + --port 8080 \ + --served-model-name {{.Name}} {{.Namespace}}/{{.Name}} \ + --model /mnt/models/Meta-Llama-3.1-8B-Instruct \ + --dtype ${DTYPE} \ + --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \ + --max-model-len 4096 \ + --max-num-seqs 8 \ + --seed 42 \ + --speculative-config '{"method":"eagle3","model":"/mnt/models/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":3}' # [!code callout] + - bash + env: + - name: DTYPE + value: auto + - name: GPU_MEMORY_UTILIZATION + value: '0.8' # [!code callout] + modelFormat: + name: transformers + protocolVersion: v2 + resources: + limits: + cpu: '8' + memory: 48Gi + nvidia.com/gpu: '1' + requests: + cpu: '4' + memory: 24Gi + runtime: + storageUri: pvc:/// # [!code callout] + securityContext: + seccompProfile: + type: RuntimeDefault +``` + + +1. Both paths in the vLLM command (`--model` and the `model` key inside `--speculative-config`) must match the directory names **inside** the PVC exactly. If your PVC lays the artifacts out under different names, adjust these two paths together. +2. The EAGLE-3 head occupies GPU memory outside the `--gpu-memory-utilization` budget. Leaving headroom (here `0.8` instead of `0.9`) reduces the chance of OOM when both artifacts are loaded. +3. `pvc:///` expects a PVC pre-staged with **both** the target model and the EAGLE-3 draft head; the PVC root is mounted at `/mnt/models/`, so the two artifacts must live at `/mnt/models//` and `/mnt/models//`. See the expected layout below. If you prefer declarative multi-URI mounts (KServe 0.16+) or bundling target + draft into a single OCI image instead, see [Option A or Option B in Providing Model Artifacts](#providing-model-artifacts-on-alauda-ai). + + +Expected layout inside the PVC (mounted at `/mnt/models/` in the pod): + +```text +/ +├── Meta-Llama-3.1-8B-Instruct/ +│ └── ... target model files ... +└── EAGLE3-LLaMA3.1-Instruct-8B/ + └── ... EAGLE-3 draft head files ... +``` + +Verify the layout from inside the predictor pod once it starts: + +```bash +kubectl exec -n -- ls /mnt/models/ +# Expected: EAGLE3-LLaMA3.1-Instruct-8B/ Meta-Llama-3.1-8B-Instruct/ +``` + +Apply any of the manifests above with: + +```bash +kubectl apply -f .yaml -n +``` + +## Verify and Measure the Impact \{#verify-and-measure-the-impact} + +Verifying that speculative decoding was configured is one step. Verifying that it **helps your workload** is a different step. + +### 1. Confirm the configuration was applied + +```bash +kubectl get inferenceservice -n -o yaml +``` + +Look for `--speculative-config` in the predictor command and confirm the readiness state: + +```bash +kubectl get pods -n -l serving.kserve.io/inferenceservice= +``` + +### 2. Confirm speculative decoding is actually running + +The first startup-time signal is the engine-config log line; it prints the `speculative_config` the engine resolved, so you can verify the method and draft path took effect: + +```bash +kubectl logs -n -l serving.kserve.io/inferenceservice= \ + | grep -m1 'Initializing a V1 LLM engine' +# Expected to contain: speculative_config=SpeculativeConfig(method='eagle3', model='...', num_spec_tokens=3) +``` + +For live counters, vLLM exposes Prometheus metrics at `/metrics`. The exact metric names depend on the vLLM version, so cast a wide net first: + +```bash +kubectl exec -n -- curl -s localhost:8080/metrics | grep -iE 'spec_decode|draft|acceptance' +``` + +If that returns nothing, the pod either hasn't served any requests yet (counters only publish once the first generation completes) or the metric names in your vLLM build differ — in which case fall back to the predictor logs. + +vLLM prints a per-window summary line that is the most readable live picture. This is the real shape of the line on vLLM 0.19.1 with `num_speculative_tokens=3`: + +```text +SpecDecoding metrics: Mean acceptance length: 2.68, Accepted throughput: 65.69 tokens/s, +Drafted throughput: 116.98 tokens/s, Accepted: 657 tokens, Drafted: 1170 tokens, +Per-position acceptance rate: 0.664, 0.559, 0.462, Avg Draft acceptance rate: 56.2% +``` + +How to read it: + +- **Mean acceptance length** — average tokens delivered per decode step. Baseline is `1`. This is the practical upper bound for the speedup you can hope to get on this workload. +- **Avg Draft acceptance rate** — overall fraction of proposed tokens that were accepted. A single number for "is the proposer mostly paying off or mostly wasted?". +- **Per-position acceptance rate** — per-slot acceptance for slots `1..num_speculative_tokens`. **You will see exactly `num_speculative_tokens` values** — the example above has 3 because the run used `num_speculative_tokens=3`; an `ngram` run with `num_speculative_tokens=5` prints 5 values. A healthy curve decays slowly; a curve that collapses to near-zero by the 2nd slot means the workload is not a fit for this proposer. + +### 3. Measure end-to-end impact + +Run the same representative workload twice: + +1. With `--speculative-config` removed (baseline). +2. With it enabled (everything else identical, including `--seed`). + +Capture three numbers per run: + +- Time to first token (TTFT). +- Per-token latency (or end-to-end latency at fixed output length). +- Throughput (tokens/second) under the QPS you actually serve. + +Speculative decoding is worth keeping on if all three improve at your target QPS. A common failure mode is improvement at low QPS but regression at production QPS — measure where you actually run. + +### 4. How to report or compare numbers + +Performance numbers without their context cannot be reproduced or trusted. Any time you publish a comparison — internally, in a customer report, or back to the platform team — include the five fields below. Numbers that omit any of them should be treated as anecdotal, not as evidence. + +```markdown +**Hardware:** +**Model:** +**Runtime:** +**Request parameters:** + +**Baseline command (no spec decode):** +```text +python3 -m vllm.entrypoints.openai.api_server \ + --port 8080 \ + --served-model-name \ + --model /mnt/models \ + --gpu-memory-utilization 0.8 \ + --max-model-len 4096 \ + --max-num-seqs 8 \ + --seed 42 +``` + +**Spec-decode command (only differs by --speculative-config):** +```text +python3 -m vllm.entrypoints.openai.api_server \ + --port 8080 \ + --served-model-name \ + --model /mnt/models \ + --gpu-memory-utilization 0.8 \ + --max-model-len 4096 \ + --max-num-seqs 8 \ + --seed 42 \ + --speculative-config '{"method":"ngram","num_speculative_tokens":5,"prompt_lookup_max":4,"prompt_lookup_min":2}' +``` + +**Results:** + +| Workload | Baseline TTFT | Spec TTFT | Baseline tok/s | Spec tok/s | Mean accept length | Avg accept rate | Speedup (tok/s) | +| --- | --- | --- | --- | --- | --- | --- | --- | +| chat | … | … | … | … | … | … | … | +| code | … | … | … | … | … | … | … | +| rag | … | … | … | … | … | … | … | + + +Two practical rules when running the comparison: + +- Use the same `--seed` and `temperature=0` for both sides, and warm up each service with 3 discarded requests before timing — otherwise sampling and compile-cache noise will dominate the differences you measure. +- Run baseline and spec-decode against the **same fixed prompt list, in the same order**, at least 5–10 times per prompt, and compare medians rather than averages. + +## Rollback \{#rollback} + +To disable speculative decoding without changing anything else, remove the `--speculative-config` line from the predictor command and re-apply: + +```bash +kubectl edit inferenceservice -n +# delete the --speculative-config line, save, exit +``` + +Or re-apply a manifest that omits the flag: + +```bash +kubectl apply -f .yaml -n +``` + +The service rolls to a new revision without the speculative proposer. No model artifact changes are required for N-gram. For EAGLE-3 the draft head remains mounted but is unused — if you want to reclaim disk, remove the draft-head artifact on the next change (delete the matching `storageUris` entry for Option A, rebuild the OCI image without the draft directory for Option B, or drop the draft subdirectory from the PVC for Option C). + +## Troubleshooting + +| Symptom | Likely cause | What to check | +| ----------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | +| Pod fails to start with a vLLM argument error mentioning `speculative` or unknown JSON keys | The `--speculative-config` keys do not match the vLLM version in the runtime image | `kubectl exec -- python3 -c "import vllm; print(vllm.__version__)"` and align flags to that version | +| Pod fails to start with an unknown `method` value | A typo in `method`, or a value that your vLLM version does not support (for example `eagle` instead of `eagle3`) | Confirm the supported `method` values for your vLLM release in the upstream speculative decoding docs | +| OOM during model load with EAGLE-3 enabled | EAGLE-3 head memory was not budgeted | Lower `--gpu-memory-utilization` by 0.05–0.10, or reduce other workloads on the GPU | +| Service Ready but acceptance rate near zero | Tokenizer / architecture mismatch between target and draft, or sampling temperature too high | Re-verify the draft head matches the exact target model; reduce sampling temperature for evaluation | +| TTFT or latency regress at production QPS | Proposal overhead is no longer hidden by idle decode capacity | Disable on this service or reduce `num_speculative_tokens`; see [Rollback](#rollback) | +| `storageUris` rejected by the API server | KServe version on the platform predates `storageUris` | Use Option B (Modelcar) or Option C (PVC) instead | +| Knative marks the revision NotReady during rollout with a progress-deadline timeout | Cold start with a draft artifact is slower than without — torch.compile of both backbone and EAGLE head + engine profiling can push it past the default progress deadline | Raise `serving.knative.dev/progress-deadline` (our EAGLE-3 cold start on A30 + Llama-3.1-8B was ~5 min; the Example 1 and Example 2 manifests on this page set it to `1800s` for this reason) | +| Client sees unexpected sampling behaviour when using `min_p` or `logit_bias` under spec decode | Both parameters are silently ignored by vLLM when speculative decoding is enabled (warning printed at engine init) | Drop the parameter from the request, or disable speculative decoding on services whose clients rely on it | + +For pod-level issues, the standard inference-service troubleshooting commands apply: + +```bash +kubectl describe inferenceservice -n +kubectl logs -n -l serving.kserve.io/inferenceservice= +``` + +## Caveats and Known Limitations \{#caveats-and-known-limitations} + +- **Outcomes swing widely with workload shape — regression and speedup are both real.** Upstream V0 benchmarks reported 1.4×–1.8× slowdowns at high QPS. Our own A30 + Qwen3-8B N-gram test (see [Internal Validation Snapshot — N-gram](#recommended-starting-points)) saw a slight regression even on a high-overlap code workload. On the same hardware, EAGLE-3 on Llama-3.1-8B (see [Internal Validation Snapshot — EAGLE-3](#recommended-starting-points)) hit a **1.84× speedup on code-refactor but was break-even on chat (~0.99×)** — same model, same method, same pod, 2× swing in realized benefit between two prompt shapes. Always validate against your production traffic profile. +- **N-gram disables async scheduling.** In recent vLLM versions, enabling the `ngram` method forces async scheduling off (the predictor logs `Async scheduling not supported with ngram-based speculative decoding and will be disabled`). If your service depends on async scheduling for throughput, prefer EAGLE-3, or measure the trade-off explicitly. +- **`storageUris` availability.** The field is available from KServe 0.16. Older platform releases must use the Modelcar or PVC option. +- **Draft head mismatch is silent.** A draft head that does not exactly match the target model usually starts up and serves traffic correctly but with very low acceptance rate. Always check acceptance rate after enabling. +- **Sampling parameters affect acceptance.** High temperature reduces acceptance rate; benchmark with sampling settings that reflect production usage. +- **`gpu-memory-utilization` budget.** Draft artifacts (EAGLE-3 head, MLP speculator, draft model) are not included in the `--gpu-memory-utilization` budget; reduce that value when adding a draft artifact. +- **Image dependencies.** The runtime image must include the libraries required by the chosen method. If a method fails to initialize, rebuild or replace the runtime image — see [Extend Inference Runtimes](./custom_inference_runtime.mdx). +- **`min_p` and `logit_bias` are silently ignored.** Under speculative decoding, vLLM logs the warning `min_p and logit_bias parameters won't work with speculative decoding.` during engine init. Requests that pass either of these sampling parameters will still receive a 200 response, but the parameters are not honored — validate this against your client assumptions if your traffic relies on them. +- **Composition with other features.** Speculative decoding composes with tensor parallelism and continuous batching but interacts with autoscaling and with EP / advanced parallelism in ways that depend on the vLLM version. Cold start is notably more expensive with a draft artifact: on our A30 + Llama-3.1-8B + EAGLE-3 head lab setup, the predictor went from container-ready to `Application startup complete` in **~5 minutes** (weight load ~45 s, draft weights ~5 s, `torch.compile` backbone ~48 s, `torch.compile` EAGLE head ~17 s, CUDA-graph capture and warmup ~10 s, plus ~2 minutes of engine profiling and KV-cache sizing). Size your Knative `progress-deadline` annotation and any autoscaling scale-from-zero SLO to this, not to a non-speculative baseline. +- **Output equivalence.** vLLM states that speculative decoding does not change the output distribution. This is a vLLM property, not an Alauda AI guarantee — if exact equivalence under your runtime image is required, validate it as part of acceptance testing. + +## References + +- [Speculative Decoding - vLLM](https://docs.vllm.ai/en/latest/features/speculative_decoding/) +- [How Speculative Decoding Boosts vLLM Performance by up to 2.8x — vLLM Blog](https://vllm.ai/blog/spec-decode) +- [Speculative Decoding Guide — vllm-ascend](https://docs.vllm.ai/projects/ascend/en/main/user_guide/feature_guide/speculative_decoding.html) +- [KServe Multiple Storage URIs](https://kserve.github.io/website/docs/model-serving/storage/multiple-storage-uris) +- [Using KServe Modelcar for Model Storage](./using_modelcar.mdx) +- [Extend Inference Runtimes](./custom_inference_runtime.mdx) +- [Enable Expert Parallel for vLLM Inference Services](./vllm_expert_parallel.mdx) +- [Create Inference Service using CLI](./create_inference_service_cli.mdx) diff --git a/docs/en/model_inference/model_management/functions/model_repository.mdx b/docs/en/model_inference/model_management/functions/model_repository.mdx index 292bcc0f..595c3f5b 100644 --- a/docs/en/model_inference/model_management/functions/model_repository.mdx +++ b/docs/en/model_inference/model_management/functions/model_repository.mdx @@ -64,8 +64,6 @@ The core definition of the model repository feature is to provide a Git-based ve - Misconfigured `README.md` metadata may block inference deployment. - - -## Create Model and Upload Model Files +## Create Model and Upload Model Files \{#create-model-repository} Refer to [Upload Models Using Notebook](../how_to/upload_models_using_notebook.mdx) for detailed steps on uploading model files to the model repository. diff --git a/docs/en/trustyai/ai-guardrails.mdx b/docs/en/trustyai/ai-guardrails.mdx index 6e58842f..49cfddd9 100644 --- a/docs/en/trustyai/ai-guardrails.mdx +++ b/docs/en/trustyai/ai-guardrails.mdx @@ -91,9 +91,7 @@ Change `regex` under `built-in-detector` to the desired algorithm (e.g. `- email The Guardrails Orchestrator is exposed by a Service named `-service`. Port numbers depend on whether authentication is enabled (annotation `security.opendatahub.io/enable-auth: "true"` on the GuardrailsOrchestrator). - - -### Ports and roles +### Ports and roles \{#ports-and-roles} | Port name | Auth disabled | Auth enabled | Role | |------------------------|---------------|--------------|------| @@ -178,8 +176,7 @@ curl -k -s -X POST "https://:8480/api/v1/text/contents" \ - -### Orchestrator API: per-request detectors (`/api/v2/chat/completions-detection`) +### Orchestrator API: per-request detectors (`/api/v2/chat/completions-detection`) \{#orchestrator-api-per-request-detectors} Use the **orchestrator** port (8032 or 8432) when the caller must choose which detectors run on each request. Request body: `model`, `messages`, and optionally `detectors` (e.g. `input` / `output` with detector params).