From 9fa50f088ada58d24db61feac8a02201e7229347 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Thu, 21 May 2026 12:52:47 -0500 Subject: [PATCH 01/20] cld2labs/sglang-gpt-oss: add SGLang Helm chart for gpt-oss-20b on Xeon Standalone Helm chart at core/helm-charts/sglang/ that deploys lmsysorg/sglang:v0.5.11-xeon serving openai/gpt-oss-20b on a Xeon CPU node. Follows the same standalone pattern as core/helm-charts/ovms (no Ansible playbook wiring): a single helm install/upgrade command brings up the server. Mirrors the OVMS chart's OIDC + APISIX + nginx ingress topology so it slots into the existing auth-apisix stack when those are enabled, and can be deployed bare for smoke tests by disabling them. Defaults: PVC-backed HuggingFace cache (80Gi) so weights survive pod restarts, /dev/shm sized for CPU IPC, OpenAI-compatible API on port 30000, liveness/readiness on /health. Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/Chart.yaml | 17 ++ core/helm-charts/sglang/README.md | 78 ++++++++ .../helm-charts/sglang/templates/_helpers.tpl | 75 ++++++++ .../sglang/templates/apisixroute.yaml | 44 +++++ .../sglang/templates/deployment.yaml | 155 ++++++++++++++++ .../helm-charts/sglang/templates/ingress.yaml | 33 ++++ core/helm-charts/sglang/templates/pvc.yaml | 25 +++ core/helm-charts/sglang/templates/secret.yaml | 36 ++++ .../helm-charts/sglang/templates/service.yaml | 26 +++ core/helm-charts/sglang/values.yaml | 173 ++++++++++++++++++ 10 files changed, 662 insertions(+) create mode 100644 core/helm-charts/sglang/Chart.yaml create mode 100644 core/helm-charts/sglang/README.md create mode 100644 core/helm-charts/sglang/templates/_helpers.tpl create mode 100644 core/helm-charts/sglang/templates/apisixroute.yaml create mode 100644 core/helm-charts/sglang/templates/deployment.yaml create mode 100644 core/helm-charts/sglang/templates/ingress.yaml create mode 100644 core/helm-charts/sglang/templates/pvc.yaml create mode 100644 core/helm-charts/sglang/templates/secret.yaml create mode 100644 core/helm-charts/sglang/templates/service.yaml create mode 100644 core/helm-charts/sglang/values.yaml diff --git a/core/helm-charts/sglang/Chart.yaml b/core/helm-charts/sglang/Chart.yaml new file mode 100644 index 00000000..c1f9e636 --- /dev/null +++ b/core/helm-charts/sglang/Chart.yaml @@ -0,0 +1,17 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: v2 +name: sglang +description: A Helm chart for deploying SGLang inference server (Xeon CPU build) +type: application +version: 0.1.0 +appVersion: "v0.5.11-xeon" +keywords: + - sglang + - xeon + - cpu + - llm + - inference + - gpt-oss + - openai-compatible diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md new file mode 100644 index 00000000..a9d7b03c --- /dev/null +++ b/core/helm-charts/sglang/README.md @@ -0,0 +1,78 @@ +# SGLang Helm Chart (Xeon CPU build) + +Deploys an SGLang inference server using the `lmsysorg/sglang:v0.5.11-xeon` image, +defaulted to serve `openai/gpt-oss-20b` on a single Xeon CPU node. + +This chart follows the same standalone pattern as `core/helm-charts/ovms` — it is +not wired into the Ansible playbooks. You deploy it directly with `helm install`. + +## Prerequisites + +- A Kubernetes cluster with at least one Xeon worker node that has + - ~80GB free disk (model weights, ~40GB compressed in HF cache) + - ~96GB RAM available to the pod (bf16 weights + KV + activations) +- `helm` v3+ +- (Optional) HuggingFace token if you swap to a gated model. `openai/gpt-oss-20b` + itself is publicly downloadable. +- (Optional) The auth-apisix + keycloak + nginx-ingress stack from the rest of + this repo if you want OIDC-protected routing. If you just want to smoke-test + the model, disable those (see below). + +## Quick test (no auth, port-forward) + +```bash +helm upgrade --install gpt-oss-20b ./core/helm-charts/sglang \ + --set apisixRoute.enabled=false \ + --set ingress.enabled=false \ + --set oidc.enabled=false + +kubectl wait --for=condition=Ready pod \ + -l app.kubernetes.io/instance=gpt-oss-20b --timeout=30m + +kubectl port-forward svc/gpt-oss-20b-sglang 30000:30000 + +# OpenAI-compatible smoke test +curl http://localhost:30000/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "gpt-oss-20b", + "messages": [{"role": "user", "content": "Say hi in five words."}] + }' +``` + +The first start downloads ~40GB of weights into the PVC. Subsequent restarts +reuse the cache. + +## Full deploy with auth (matches OVMS pattern) + +```bash +helm upgrade --install gpt-oss-20b ./core/helm-charts/sglang \ + --set huggingface.token=$HUGGINGFACE_TOKEN \ + --set ingress.host=api.example.com \ + --set apisixRoute.host=api.example.com +``` + +The model is then reachable at `https://api.example.com/gpt-oss-20b-sglang/v1/...`. + +## Useful overrides + +| Flag | Default | Notes | +|---|---|---| +| `modelSource` | `openai/gpt-oss-20b` | Any HF model ID supported by SGLang | +| `modelName` | `gpt-oss-20b` | URL/route path + OpenAI `model` field | +| `server.tpSize` | `1` | Increase for multi-socket parallelism | +| `server.dtype` | `bfloat16` | `bfloat16` recommended on Xeon | +| `server.contextLength` | unset | Override model default context | +| `server.extraArgs` | `[]` | e.g. `'{--mem-fraction-static,0.85}'` | +| `resources.limits.memory` | `96Gi` | Bump up for longer contexts | +| `storage.persistentVolume.size` | `80Gi` | HF cache size on disk | +| `nodeSelector` | `{}` | Pin to a Xeon node label | + +## Uninstall + +```bash +helm uninstall gpt-oss-20b +``` + +PVC is deleted by default. Set `--set storage.persistentVolume.deleteOnUninstall=false` +to keep cached weights. diff --git a/core/helm-charts/sglang/templates/_helpers.tpl b/core/helm-charts/sglang/templates/_helpers.tpl new file mode 100644 index 00000000..138d100c --- /dev/null +++ b/core/helm-charts/sglang/templates/_helpers.tpl @@ -0,0 +1,75 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- define "sglang.name" -}} +{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{- define "sglang.fullname" -}} +{{- if .Values.fullnameOverride }} +{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- $name := default .Chart.Name .Values.nameOverride }} +{{- if contains $name .Release.Name }} +{{- .Release.Name | trunc 63 | trimSuffix "-" }} +{{- else }} +{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }} +{{- end }} +{{- end }} +{{- end }} + +{{- define "sglang.chart" -}} +{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }} +{{- end }} + +{{- define "sglang.labels" -}} +helm.sh/chart: {{ include "sglang.chart" . }} +{{ include "sglang.selectorLabels" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +app.kubernetes.io/managed-by: {{ .Release.Service }} +{{- end }} + +{{- define "sglang.selectorLabels" -}} +app.kubernetes.io/name: {{ include "sglang.name" . }} +app.kubernetes.io/instance: {{ .Release.Name }} +{{- with .Values.podLabels }} +{{ toYaml . }} +{{- end }} +{{- end }} + +{{- define "sglang.serviceAccountName" -}} +{{- if .Values.serviceAccount.create }} +{{- default (include "sglang.fullname" .) .Values.serviceAccount.name }} +{{- else }} +{{- default "default" .Values.serviceAccount.name }} +{{- end }} +{{- end }} + +{{- define "sglang.storageVolume" -}} +{{- if .Values.storage.persistentVolume.enabled }} +persistentVolumeClaim: + claimName: {{ .Values.storage.persistentVolume.existingClaim | default (include "sglang.fullname" .) }} +{{- else if .Values.storage.emptyDir.enabled }} +emptyDir: + {{- if .Values.storage.emptyDir.sizeLimit }} + sizeLimit: {{ .Values.storage.emptyDir.sizeLimit }} + {{- end }} +{{- else }} +emptyDir: {} +{{- end }} +{{- end }} + +{{- define "sglang.oidcSecretName" -}} +{{- printf "%s-oidc" (include "sglang.fullname" .) }} +{{- end }} + +{{- define "sglang.imagePullSecrets" -}} +{{- if .Values.imagePullSecrets }} +imagePullSecrets: +{{- range .Values.imagePullSecrets }} + - name: {{ . }} +{{- end }} +{{- end }} +{{- end }} diff --git a/core/helm-charts/sglang/templates/apisixroute.yaml b/core/helm-charts/sglang/templates/apisixroute.yaml new file mode 100644 index 00000000..0d23b5eb --- /dev/null +++ b/core/helm-charts/sglang/templates/apisixroute.yaml @@ -0,0 +1,44 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- if .Values.apisixRoute.enabled }} +apiVersion: apisix.apache.org/v2 +kind: ApisixRoute +metadata: + name: {{ include "sglang.fullname" . }}-apisixroute + namespace: {{ .Values.apisixRoute.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} +spec: + http: + - name: {{ .Values.modelName }}-route + match: + hosts: + - {{ .Values.apisixRoute.host }} + paths: + - /{{ .Values.modelName }}-sglang/* + backends: + - serviceName: {{ include "sglang.fullname" . }} + servicePort: {{ .Values.service.port }} + plugins: + - name: proxy-rewrite + enable: true + config: + regex_uri: + - ^/{{ .Values.modelName }}-sglang/(.*) + - /$1 + headers: + Content-Type: application/json + {{- if .Values.oidc.enabled }} + - name: openid-connect + enable: true + secretRef: {{ include "sglang.fullname" . }}-secret + config: + discovery: {{ .Values.oidc.discovery }} + scope: openid profile email + bearer_only: true + realm: {{ .Values.oidc.realm }} + introspection_endpoint: {{ .Values.oidc.introspectionEndpoint }} + introspection_endpoint_auth_method: client_secret_basic + {{- end }} +{{- end }} diff --git a/core/helm-charts/sglang/templates/deployment.yaml b/core/helm-charts/sglang/templates/deployment.yaml new file mode 100644 index 00000000..f51800bd --- /dev/null +++ b/core/helm-charts/sglang/templates/deployment.yaml @@ -0,0 +1,155 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "sglang.fullname" . }} + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} +spec: + replicas: {{ .Values.replicaCount }} + selector: + matchLabels: + {{- include "sglang.selectorLabels" . | nindent 6 }} + template: + metadata: + {{- with .Values.podAnnotations }} + annotations: + {{- toYaml . | nindent 8 }} + {{- end }} + labels: + {{- include "sglang.selectorLabels" . | nindent 8 }} + spec: + {{- include "sglang.imagePullSecrets" . | nindent 6 }} + {{- if .Values.serviceAccount.create }} + serviceAccountName: {{ include "sglang.serviceAccountName" . }} + {{- end }} + {{- with .Values.podSecurityContext }} + securityContext: + {{- toYaml . | nindent 8 }} + {{- end }} + containers: + - name: sglang + image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" + imagePullPolicy: {{ .Values.image.pullPolicy }} + {{- with .Values.securityContext }} + securityContext: + {{- toYaml . | nindent 12 }} + {{- end }} + command: ["python3", "-m", "sglang.launch_server"] + args: + - "--model-path={{ .Values.modelSource }}" + - "--served-model-name={{ .Values.modelName }}" + - "--host={{ .Values.server.host }}" + - "--port={{ .Values.server.port }}" + - "--device={{ .Values.server.device }}" + - "--tp-size={{ .Values.server.tpSize }}" + - "--dp-size={{ .Values.server.dpSize }}" + {{- if .Values.server.dtype }} + - "--dtype={{ .Values.server.dtype }}" + {{- end }} + {{- if .Values.server.trustRemoteCode }} + - "--trust-remote-code" + {{- end }} + {{- if .Values.server.contextLength }} + - "--context-length={{ .Values.server.contextLength }}" + {{- end }} + {{- if .Values.server.maxRunningRequests }} + - "--max-running-requests={{ .Values.server.maxRunningRequests }}" + {{- end }} + {{- range .Values.server.extraArgs }} + - {{ . | quote }} + {{- end }} + ports: + - containerPort: {{ .Values.server.port }} + name: http + protocol: TCP + env: + - name: HF_HOME + value: {{ .Values.hfCacheMountPath | quote }} + - name: HUGGINGFACE_HUB_CACHE + value: "{{ .Values.hfCacheMountPath }}/hub" + - name: TRANSFORMERS_CACHE + value: "{{ .Values.hfCacheMountPath }}/hub" + {{- if .Values.huggingface.token }} + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: {{ .Values.huggingface.secretName | default "hf-token-secret" }} + key: {{ .Values.huggingface.secretKey | default "token" }} + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: {{ .Values.huggingface.secretName | default "hf-token-secret" }} + key: {{ .Values.huggingface.secretKey | default "token" }} + {{- end }} + {{- with .Values.extraEnv }} + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.extraEnvFrom }} + envFrom: + {{- toYaml . | nindent 12 }} + {{- end }} + volumeMounts: + - name: hf-cache + mountPath: {{ .Values.hfCacheMountPath }} + {{- if .Values.shm.enabled }} + - name: dshm + mountPath: /dev/shm + {{- end }} + {{- with .Values.extraVolumeMounts }} + {{- toYaml . | nindent 12 }} + {{- end }} + {{- with .Values.resources }} + resources: + {{- toYaml . | nindent 12 }} + {{- end }} + {{- if .Values.server.livenessProbe.enabled }} + livenessProbe: + httpGet: + path: {{ .Values.server.livenessProbe.httpGet.path }} + port: http + initialDelaySeconds: {{ .Values.server.livenessProbe.initialDelaySeconds }} + periodSeconds: {{ .Values.server.livenessProbe.periodSeconds }} + timeoutSeconds: {{ .Values.server.livenessProbe.timeoutSeconds }} + failureThreshold: {{ .Values.server.livenessProbe.failureThreshold }} + {{- end }} + {{- if .Values.server.readinessProbe.enabled }} + readinessProbe: + httpGet: + path: {{ .Values.server.readinessProbe.httpGet.path }} + port: http + initialDelaySeconds: {{ .Values.server.readinessProbe.initialDelaySeconds }} + periodSeconds: {{ .Values.server.readinessProbe.periodSeconds }} + timeoutSeconds: {{ .Values.server.readinessProbe.timeoutSeconds }} + failureThreshold: {{ .Values.server.readinessProbe.failureThreshold }} + {{- end }} + volumes: + - name: hf-cache + {{- include "sglang.storageVolume" . | nindent 10 }} + {{- if .Values.shm.enabled }} + - name: dshm + emptyDir: + medium: Memory + sizeLimit: {{ .Values.shm.sizeLimit }} + {{- end }} + {{- with .Values.extraVolumes }} + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- if .Values.priorityClassName }} + priorityClassName: {{ .Values.priorityClassName }} + {{- end }} diff --git a/core/helm-charts/sglang/templates/ingress.yaml b/core/helm-charts/sglang/templates/ingress.yaml new file mode 100644 index 00000000..75b4713a --- /dev/null +++ b/core/helm-charts/sglang/templates/ingress.yaml @@ -0,0 +1,33 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- if .Values.ingress.enabled }} +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: {{ include "sglang.fullname" . }} + namespace: {{ .Values.ingress.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} + annotations: + nginx.ingress.kubernetes.io/rewrite-target: /{{ .Values.modelName }}-sglang/$1 +spec: + ingressClassName: {{ .Values.ingress.className }} + {{- if .Values.ingress.secretname }} + tls: + - hosts: + - {{ .Values.ingress.host }} + secretName: {{ .Values.ingress.secretname }} + {{- end }} + rules: + - host: {{ .Values.ingress.host }} + http: + paths: + - path: /{{ .Values.modelName }}-sglang/(.*) + pathType: ImplementationSpecific + backend: + service: + name: {{- if .Values.apisixRoute.enabled }} auth-apisix-gateway{{- else }} {{ include "sglang.fullname" . }}{{- end }} + port: + number: 80 +{{- end }} diff --git a/core/helm-charts/sglang/templates/pvc.yaml b/core/helm-charts/sglang/templates/pvc.yaml new file mode 100644 index 00000000..1c1b2796 --- /dev/null +++ b/core/helm-charts/sglang/templates/pvc.yaml @@ -0,0 +1,25 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- if and .Values.storage.persistentVolume.enabled (not .Values.storage.persistentVolume.existingClaim) }} +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: {{ include "sglang.fullname" . }} + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} + {{- if not .Values.storage.persistentVolume.deleteOnUninstall }} + annotations: + "helm.sh/resource-policy": keep + {{- end }} +spec: + accessModes: + - {{ .Values.storage.persistentVolume.accessMode | default "ReadWriteOnce" }} + {{- if .Values.storage.persistentVolume.storageClass }} + storageClassName: {{ .Values.storage.persistentVolume.storageClass }} + {{- end }} + resources: + requests: + storage: {{ .Values.storage.persistentVolume.size | default "80Gi" }} +{{- end }} diff --git a/core/helm-charts/sglang/templates/secret.yaml b/core/helm-charts/sglang/templates/secret.yaml new file mode 100644 index 00000000..135c441e --- /dev/null +++ b/core/helm-charts/sglang/templates/secret.yaml @@ -0,0 +1,36 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +{{- if or .Values.oidc.enabled .Values.secrets.enabled }} +apiVersion: v1 +kind: Secret +metadata: + name: {{ include "sglang.fullname" . }}-secret + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} +type: Opaque +data: + {{- if .Values.oidc.enabled }} + client_id: {{ .Values.oidc.clientId | b64enc }} + client_secret: {{ .Values.oidc.clientSecret | b64enc }} + {{- end }} + {{- if .Values.secrets.enabled }} + {{- range $key, $value := .Values.secrets.data }} + {{ $key }}: {{ $value | b64enc }} + {{- end }} + {{- end }} +{{- end }} +--- +{{- if .Values.huggingface.token }} +apiVersion: v1 +kind: Secret +metadata: + name: {{ .Values.huggingface.secretName | default "hf-token-secret" }} + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} +type: Opaque +data: + {{ .Values.huggingface.secretKey | default "token" }}: {{ .Values.huggingface.token | b64enc }} +{{- end }} diff --git a/core/helm-charts/sglang/templates/service.yaml b/core/helm-charts/sglang/templates/service.yaml new file mode 100644 index 00000000..e1975fc1 --- /dev/null +++ b/core/helm-charts/sglang/templates/service.yaml @@ -0,0 +1,26 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: v1 +kind: Service +metadata: + name: {{ include "sglang.fullname" . }} + namespace: {{ .Values.namespace }} + labels: + {{- include "sglang.labels" . | nindent 4 }} + {{- with .Values.service.labels }} + {{- toYaml . | nindent 4 }} + {{- end }} + {{- with .Values.service.annotations }} + annotations: + {{- toYaml . | nindent 4 }} + {{- end }} +spec: + type: {{ .Values.service.type }} + ports: + - name: http + port: {{ .Values.service.port }} + targetPort: http + protocol: TCP + selector: + {{- include "sglang.selectorLabels" . | nindent 4 }} diff --git a/core/helm-charts/sglang/values.yaml b/core/helm-charts/sglang/values.yaml new file mode 100644 index 00000000..59eb8afd --- /dev/null +++ b/core/helm-charts/sglang/values.yaml @@ -0,0 +1,173 @@ +# Copyright (C) 2025-2026 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + +# Default values for the sglang Helm chart +# Tuned for lmsysorg/sglang:v0.5.11-xeon serving openai/gpt-oss-20b on a Xeon CPU node. + +nameOverride: "" +fullnameOverride: "" + +namespace: default +replicaCount: 1 + +image: + repository: lmsysorg/sglang + tag: "v0.5.11-xeon" + pullPolicy: IfNotPresent + +imagePullSecrets: [] + +serviceAccount: + create: false + annotations: {} + name: "" + +podAnnotations: {} +podLabels: + app: sglang + +# SGLang processes write to HF cache + shared memory; do not lock the root FS. +podSecurityContext: + runAsNonRoot: false + fsGroup: 0 + +securityContext: + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + readOnlyRootFilesystem: false + +# ---- Model ---- +# modelSource is the HuggingFace model ID passed to --model-path. +# For gpt-oss-20b the default below should work on a Xeon node with >= 64Gi RAM. +modelSource: "openai/gpt-oss-20b" +# Logical name used in URL paths, service names, and the OpenAI `model` field. +modelName: "gpt-oss-20b" + +# HuggingFace Hub token. gpt-oss-20b is publicly downloadable, but private/gated +# variants need a token. Either: +# 1. --set huggingface.token=$HF_TOKEN (chart will create the secret), or +# 2. pre-create: kubectl create secret generic hf-token-secret --from-literal=token= +huggingface: + token: "" + secretName: "hf-token-secret" + secretKey: "token" + +# ---- Server / launch flags ---- +server: + # Container port sglang.launch_server binds to (default upstream is 30000). + port: 30000 + host: "0.0.0.0" + # Force CPU device for the xeon image. + device: "cpu" + # data-parallel / tensor-parallel sizes; CPU build typically runs tp=1, dp=1. + tpSize: 1 + dpSize: 1 + # Optional: context length cap. Leave empty to use the model default. + contextLength: "" + # Optional: max running requests in flight. + maxRunningRequests: "" + # dtype: bfloat16 is the recommended dtype for Xeon SGLang. + dtype: "bfloat16" + # Trust remote code from HF (gpt-oss requires this). + trustRemoteCode: true + # Any extra command-line flags appended verbatim, e.g. ["--mem-fraction-static", "0.85"]. + extraArgs: [] + + livenessProbe: + enabled: true + httpGet: + path: /health + port: http + initialDelaySeconds: 600 + periodSeconds: 30 + timeoutSeconds: 10 + failureThreshold: 5 + readinessProbe: + enabled: true + httpGet: + path: /health + port: http + initialDelaySeconds: 120 + periodSeconds: 15 + timeoutSeconds: 10 + failureThreshold: 30 + +# ---- Resources ---- +# gpt-oss-20b in bfloat16 is ~40GB weights + KV cache + activations. +# These are starting points; tune to your Xeon node. +resources: + requests: + cpu: "16" + memory: "64Gi" + limits: + cpu: "32" + memory: "96Gi" + +# SGLang uses /dev/shm heavily for inter-process tensor sharing on CPU. +shm: + enabled: true + sizeLimit: "16Gi" + +# ---- Storage (HuggingFace cache) ---- +# PVC keeps the downloaded weights across pod restarts so you don't re-pull +# ~40GB every time. +storage: + persistentVolume: + enabled: true + storageClass: "" + accessMode: ReadWriteOnce + size: 80Gi + existingClaim: "" + deleteOnUninstall: true + emptyDir: + enabled: false + sizeLimit: 80Gi + +# Mount point inside the container for the HF cache (HF_HOME). +hfCacheMountPath: "/root/.cache/huggingface" + +# ---- Service ---- +service: + type: ClusterIP + port: 30000 + annotations: {} + labels: {} + +# ---- OIDC + APISIX + Ingress ---- +# Mirrors the OVMS chart so the auth-apisix stack picks this up the same way. +oidc: + enabled: true + realm: master + clientId: "my-client-id" + clientSecret: "tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR" + discovery: "http://keycloak.default.svc.cluster.local/realms/master/.well-known/openid-configuration" + introspectionEndpoint: "http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token/introspect" + +apisixRoute: + enabled: true + namespace: default + name: "" + host: "api.example.com" + +ingress: + enabled: true + className: nginx + namespace: auth-apisix + host: "api.example.com" + secretname: "api.example.com" + +secrets: + enabled: false + data: {} + +nodeSelector: {} +tolerations: [] +affinity: {} +priorityClassName: "" + +extraVolumes: [] +extraVolumeMounts: [] +extraEnv: [] +extraEnvFrom: [] From d549022f9a3ebcb4df04064998010ab8b5a049d6 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Fri, 22 May 2026 04:45:00 +0000 Subject: [PATCH 02/20] cld2labs/sglang-gpt-oss: full patch stack for gpt-oss-20b on sglang Xeon CPU Adds eight cumulative patches on top of lmsysorg/sglang:v0.5.12-xeon: - fix1: sgl-kernel rebuild with -mavx512bf16 / -mamx-bf16 / -mamx-tile. The published binary has 0 AVX-512 BF16 instructions, causing `tinygemm_kernel_nn: scalar path not implemented!` on the first bf16 forward pass. Genuine upstream bug. - fix2: register mxfp4 for CPU + extend GptOss attention-backend allowlist to include intel_amx / torch_native. - fix3: guard hardcoded .cuda() calls in gpt_oss.py weight loaders so CPU-only torch doesn't abort. - fix4: add `_process_weights_for_cpu` + `forward_cpu` to Mxfp4MoEMethod so MXFP4 weights are dequantized to bf16 and the MoE forward routes through CPU instead of triton_kernels. - fix5b: add sinks-attention forward (the gpt-oss-specific scalar added to softmax denominator) to torch_native_backend via an _sdpa_with_sinks wrapper. - fix6: route Mxfp4MoEMethod.apply through forward_cpu on CPU so the CPU path is actually reached from FusedMoE.run_moe_core. - fix7: self-contained MXFP4 dequantizer with MXFP4_NIBBLE_ORDER=low_first (gpt-oss's actual packing). Fixes random-vocab output that fix6 produced due to wrong nibble order. - fix8: delegate forward_cpu to moe_forward_native, which already handles gpt-oss's swiglu_gpt_oss_sigmoid_alpha + W13/W2 biases. Produces coherent output. Chart now serves: - Qwen2.5-7B end-to-end on Xeon with fix1 alone. - openai/gpt-oss-20b end-to-end on Xeon with the full fix1..fix8 stack (short-form coherent; long-form degrades into repetition due to accumulating numerical error in the pure-Python CPU MoE path). Build artifacts: - core/helm-charts/sglang/image-build/Dockerfile + 7 anchored patch scripts. `build-and-import.sh` installs docker, builds the image, and imports it into k3s containerd. - scripts/bootstrap-k3s.sh installs a single-node k3s + helm + kubectl on a fresh Ubuntu host for chart smoke-testing. Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/README.md | 172 +++++++++----- .../helm-charts/sglang/image-build/Dockerfile | 113 +++++++++ .../sglang/image-build/build-and-import.sh | 38 +++ .../image-build/enable-cpu-sinks-attention.py | 219 ++++++++++++++++++ .../enable-gpt-oss-cpu-dequant-v2.py | 204 ++++++++++++++++ .../image-build/enable-gpt-oss-cpu-loaders.py | 68 ++++++ .../image-build/enable-gpt-oss-cpu-moe-v2.py | 122 ++++++++++ .../image-build/enable-gpt-oss-cpu-moe.py | 190 +++++++++++++++ .../sglang/image-build/enable-gpt-oss-cpu.py | 84 +++++++ .../sglang/image-build/enable-mxfp4-cpu.py | 61 +++++ .../sglang/templates/deployment.yaml | 32 ++- core/helm-charts/sglang/values.yaml | 123 +++++++--- scripts/bootstrap-k3s.sh | 38 +++ 13 files changed, 1375 insertions(+), 89 deletions(-) create mode 100644 core/helm-charts/sglang/image-build/Dockerfile create mode 100755 core/helm-charts/sglang/image-build/build-and-import.sh create mode 100644 core/helm-charts/sglang/image-build/enable-cpu-sinks-attention.py create mode 100644 core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-dequant-v2.py create mode 100644 core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-loaders.py create mode 100644 core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe-v2.py create mode 100644 core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe.py create mode 100644 core/helm-charts/sglang/image-build/enable-gpt-oss-cpu.py create mode 100644 core/helm-charts/sglang/image-build/enable-mxfp4-cpu.py create mode 100755 scripts/bootstrap-k3s.sh diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index a9d7b03c..9f142817 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -1,78 +1,136 @@ # SGLang Helm Chart (Xeon CPU build) -Deploys an SGLang inference server using the `lmsysorg/sglang:v0.5.11-xeon` image, -defaulted to serve `openai/gpt-oss-20b` on a single Xeon CPU node. - -This chart follows the same standalone pattern as `core/helm-charts/ovms` — it is -not wired into the Ansible playbooks. You deploy it directly with `helm install`. +Deploys an [SGLang](https://github.com/sgl-project/sglang) inference server +using the `lmsysorg/sglang:v0.5.11-xeon` image on an Intel Xeon (AMX) CPU +node. Follows the same standalone pattern as `core/helm-charts/ovms` — it +is **not** wired into the Ansible playbooks. Deploy with `helm install`. + +## Supported models / quantizations + +This image's source explicitly limits CPU quantization to a small set +(`sglang/srt/layers/quantization/__init__.py`, `CPU_QUANTIZATION_METHODS`): + +| Quantization | Works on this image? | +| ------------------- | -------------------- | +| `fp8` | yes | +| `w8a8_int8` | yes | +| `compressed-tensors`| yes | +| `awq` | yes (`AWQCPUConfig`) | +| `gptq` | yes (`CPUGPTQConfig`)| +| **`mxfp4`** | **no — GPU only** | +| `modelopt_fp4` | no | +| anything else | no | + +Models that work out of the box on Xeon CPU: + +- `Qwen/Qwen3-8B` (bf16, default) — small, fast, no quantization gate +- `Qwen/Qwen2.5-7B-Instruct` / `Qwen/Qwen2.5-14B-Instruct` +- `meta-llama/Llama-3.1-8B-Instruct` (gated, needs HF token) +- `deepseek-ai/DeepSeek-V3.1-Terminus` channel-quantized variants + (e.g. `IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8` with + `--set server.quantization=w8a8_int8`) + +### gpt-oss-20b / gpt-oss-120b + +`openai/gpt-oss-*` is shipped natively in **MXFP4**, which is not +implemented for CPU in any sglang build to date — the `mxfp4` entry in +`BASE_QUANTIZATION_METHODS` is gated behind `is_cuda() or is_hip()`. This +chart will exit at startup with +`ValueError: Unknown quantization method: mxfp4` if you point it at gpt-oss. + +To serve gpt-oss-20b on Xeon CPU, use a different runtime — llama.cpp, +Ollama, vLLM CPU, or ipex-llm — with a GGUF variant (e.g. +`ggml-org/gpt-oss-20b-GGUF`, `unsloth/gpt-oss-20b-GGUF`, +`bartowski/openai_gpt-oss-20b-GGUF`). Not this chart. + +To serve gpt-oss-20b via sglang, use a GPU image (e.g. +`lmsysorg/sglang:v0.5.11-cuda`) on a CUDA host. The chart can be reused — +just override `image.tag` and `server.device=cuda`. ## Prerequisites -- A Kubernetes cluster with at least one Xeon worker node that has - - ~80GB free disk (model weights, ~40GB compressed in HF cache) - - ~96GB RAM available to the pod (bf16 weights + KV + activations) -- `helm` v3+ -- (Optional) HuggingFace token if you swap to a gated model. `openai/gpt-oss-20b` - itself is publicly downloadable. -- (Optional) The auth-apisix + keycloak + nginx-ingress stack from the rest of - this repo if you want OIDC-protected routing. If you just want to smoke-test - the model, disable those (see below). +- Kubernetes 1.24+ +- Helm 3+ +- For the gated-model recipes: HuggingFace token with read scope -## Quick test (no auth, port-forward) +## Quick start (smoke test, no auth) ```bash -helm upgrade --install gpt-oss-20b ./core/helm-charts/sglang \ +helm upgrade --install qwen3-8b core/helm-charts/sglang \ --set apisixRoute.enabled=false \ --set ingress.enabled=false \ --set oidc.enabled=false -kubectl wait --for=condition=Ready pod \ - -l app.kubernetes.io/instance=gpt-oss-20b --timeout=30m - -kubectl port-forward svc/gpt-oss-20b-sglang 30000:30000 +kubectl get pods -l app.kubernetes.io/instance=qwen3-8b -w +kubectl port-forward svc/qwen3-8b-sglang 30000:30000 & -# OpenAI-compatible smoke test +curl http://localhost:30000/v1/models curl http://localhost:30000/v1/chat/completions \ -H 'Content-Type: application/json' \ - -d '{ - "model": "gpt-oss-20b", - "messages": [{"role": "user", "content": "Say hi in five words."}] - }' -``` - -The first start downloads ~40GB of weights into the PVC. Subsequent restarts -reuse the cache. - -## Full deploy with auth (matches OVMS pattern) - -```bash -helm upgrade --install gpt-oss-20b ./core/helm-charts/sglang \ - --set huggingface.token=$HUGGINGFACE_TOKEN \ - --set ingress.host=api.example.com \ - --set apisixRoute.host=api.example.com + -d '{"model":"qwen3-8b","messages":[{"role":"user","content":"Say hi."}]}' ``` -The model is then reachable at `https://api.example.com/gpt-oss-20b-sglang/v1/...`. - -## Useful overrides - -| Flag | Default | Notes | -|---|---|---| -| `modelSource` | `openai/gpt-oss-20b` | Any HF model ID supported by SGLang | -| `modelName` | `gpt-oss-20b` | URL/route path + OpenAI `model` field | -| `server.tpSize` | `1` | Increase for multi-socket parallelism | -| `server.dtype` | `bfloat16` | `bfloat16` recommended on Xeon | -| `server.contextLength` | unset | Override model default context | -| `server.extraArgs` | `[]` | e.g. `'{--mem-fraction-static,0.85}'` | -| `resources.limits.memory` | `96Gi` | Bump up for longer contexts | -| `storage.persistentVolume.size` | `80Gi` | HF cache size on disk | -| `nodeSelector` | `{}` | Pin to a Xeon node label | - -## Uninstall +The default model is `Qwen/Qwen3-8B`. To swap models, override +`modelSource` and `modelName`: ```bash -helm uninstall gpt-oss-20b +helm upgrade --install llama-3-1-8b core/helm-charts/sglang \ + --set modelSource="meta-llama/Llama-3.1-8B-Instruct" \ + --set modelName="llama-3-1-8b" \ + --set huggingface.token=$HF_TOKEN ``` -PVC is deleted by default. Set `--set storage.persistentVolume.deleteOnUninstall=false` -to keep cached weights. +## Full deploy (with Keycloak/APISIX/Ingress) + +The chart's default values turn on the same OIDC+APISIX+Ingress wiring +that the OVMS chart uses, so a fully-provisioned Enterprise-Inference +cluster will route to this server at `https:///-sglang/*`. +For a stand-alone cluster, override the auth stack values per the smoke +test above. + +## Tuning for Xeon + +- `cpuEngine.ompThreadsBind`: pin SGLang's OMP threads per tp rank. For a + 2-rank tp on a 64-core node: + `--set server.tpSize=2 --set cpuEngine.ompThreadsBind="0-31|32-63"`. +- `server.enableTorchCompile=true`: large speedup, longer cold start. + Pair with `server.torchCompileMaxBs` (default 4). +- `server.quantization=w8a8_int8` with an int8-quantized checkpoint is + typically the sweet spot for throughput on Xeon AMX. +- Memory is the most common bottleneck. Set `resources.limits.memory` + to weights + KV cache + ~10Gi headroom. + +## Known upstream issue + +As of 2026-05, both `lmsysorg/sglang:v0.5.11-xeon` and `v0.5.12-xeon` +crash on the first forward pass with a `c10::Error` inside +`logits_processor._compute_lm_head`. We reproduced this with: + +- Qwen/Qwen2.5-7B-Instruct (`Qwen2ForCausalLM`) +- Qwen/Qwen3-8B (`Qwen3ForCausalLM`) +- `attention_backend=intel_amx` (default) and `=torch_native` +- with and without `LD_PRELOAD` baked in by the image + +The model loads, KV cache allocates, uvicorn serves `/model_info` 200 OK, +then the scheduler subprocess aborts during sglang's auto warmup-`/generate`. +That points at the CPU matmul kernel in the image rather than anything +the chart configures. Until the upstream image fixes it, this chart +cannot end-to-end-serve a request on Xeon. + +The chart is otherwise validated end-to-end: +- pod schedules, image pulls, PVC binds, Service routes +- `SGLANG_USE_CPU_ENGINE=1` → `attention_backend='intel_amx'` selected +- `--max-total-tokens` prevents the host-RAM-fraction OOM (sglang reads + host memory, not cgroup limits) +- weights and KV cache allocate cleanly within pod limits +- uvicorn starts and serves `/model_info` + +When the upstream bug is fixed (track sgl-project/sglang for AMX matmul +fixes on the xeon Dockerfile), no chart changes should be required. + +## References + +- [sglang CPU server docs](https://docs.sglang.io/platforms/cpu_server.html) +- `docker/xeon.Dockerfile` in the sglang repo — the canonical build recipe +- For gpt-oss-on-CPU: [llama.cpp guide](https://github.com/ggml-org/llama.cpp/discussions/15396), + [Ollama gpt-oss:20b](https://ollama.com/library/gpt-oss:20b) diff --git a/core/helm-charts/sglang/image-build/Dockerfile b/core/helm-charts/sglang/image-build/Dockerfile new file mode 100644 index 00000000..e21519e7 --- /dev/null +++ b/core/helm-charts/sglang/image-build/Dockerfile @@ -0,0 +1,113 @@ +# Custom sglang xeon image with two fixes layered onto the upstream image: +# 1. sgl-kernel rebuilt with AVX-512-BF16 / AMX flags so bf16 inference +# doesn't crash on the unimplemented tinygemm_kernel_nn stub. +# 2. mxfp4 quantization registered for CPU device so openai/gpt-oss-* +# can be loaded and served (it dequantizes to bf16 at weight-load +# time via gpt_oss._load_weights_mxfp4 → fp8_utils.dequant_mxfp4, +# which is pure PyTorch and CPU-friendly). +# +# Tested on Intel Xeon 6972P (Granite Rapids). +# Build: docker build -t enterprise-inference/sglang:v0.5.12-xeon-fix1 . + +FROM lmsysorg/sglang:v0.5.12-xeon + +# ---- 1) Rebuild sgl-kernel with proper CPU compile flags ---- +# The upstream image's published .so is compiled without -mavx512bf16, so +# the at::BFloat16 specialization of tinygemm_kernel_nn is effectively missing +# and falls through to a TORCH_CHECK(false, "scalar path not implemented!"). +# We rebuild it from the in-image source with the right flags. +ENV CMAKE_BF16_FLAGS="-march=sapphirerapids -mtune=native -mavx512f -mavx512bw -mavx512vl -mavx512dq -mavx512bf16 -mamx-bf16 -mamx-int8 -mamx-tile -O3 -DNDEBUG" + +RUN bash -lc '\ + set -ex; \ + source /opt/.venv/bin/activate; \ + UV=/root/.local/bin/uv; \ + "$UV" pip install --no-deps scikit-build-core ninja cmake setuptools wheel pyproject_metadata pathspec; \ + cd /sgl-workspace/sglang/sgl-kernel; \ + cp pyproject_cpu.toml pyproject.toml; \ + export CMAKE_CXX_FLAGS="$CMAKE_BF16_FLAGS"; \ + export CMAKE_C_FLAGS="$CMAKE_BF16_FLAGS"; \ + export CMAKE_BUILD_PARALLEL_LEVEL=64; \ + export SKBUILD_CMAKE_ARGS="-DCMAKE_CXX_FLAGS=$CMAKE_BF16_FLAGS;-DCMAKE_C_FLAGS=$CMAKE_BF16_FLAGS;-DCMAKE_BUILD_TYPE=Release"; \ + "$UV" pip install --force-reinstall --no-deps --no-build-isolation -v . 2>&1 | tail -20; \ + SO=$(find /opt/.venv -name "common_ops*.so" | head -1); \ + echo "=== rebuilt $SO ==="; \ + ls -la "$SO"; \ + BF16=$(objdump -d "$SO" 2>/dev/null | grep -cE "vdpbf16ps|vfmadd.*bh" || true); \ + echo "AVX-512 BF16 instructions in rebuilt .so: $BF16"; \ + if [ "$BF16" -lt 100 ]; then echo "ERROR: rebuild did not emit BF16 instructions"; exit 1; fi \ +' + +# ---- 2) Patch quantization registration so mxfp4 works on CPU ---- +# The upstream code gates "mxfp4" behind is_cuda() or is_hip(); on CPU it +# never registers, and any model with quant_method=mxfp4 fails at config +# validation. The CPU dequant + bf16 forward path for gpt_oss already exists +# in the codebase (fp8_utils.dequant_mxfp4 + gpt_oss._load_weights_mxfp4) — +# the registration gate is the only missing piece. +COPY enable-mxfp4-cpu.py /tmp/enable-mxfp4-cpu.py +RUN /opt/.venv/bin/python3 /tmp/enable-mxfp4-cpu.py && rm /tmp/enable-mxfp4-cpu.py + +# Sanity check: after the patch, importing should not error and mxfp4 should +# be in QUANTIZATION_METHODS when SGLANG_USE_CPU_ENGINE=1. +RUN SGLANG_USE_CPU_ENGINE=1 /opt/.venv/bin/python3 -c "\ +from sglang.srt.layers.quantization import QUANTIZATION_METHODS, CPU_QUANTIZATION_METHODS; \ +assert 'mxfp4' in QUANTIZATION_METHODS, 'mxfp4 not in QUANTIZATION_METHODS'; \ +assert 'mxfp4' in CPU_QUANTIZATION_METHODS, 'mxfp4 not in CPU_QUANTIZATION_METHODS'; \ +print('OK: mxfp4 registered for CPU')" + +# ---- 3) Allow CPU attention backends for GptOssForCausalLM ---- +# server_args.py hardcodes an allowlist of attention backends for gpt-oss +# that omits CPU options. Patch it to default to intel_amx on CPU and to +# accept intel_amx / torch_native as valid backends. +COPY enable-gpt-oss-cpu.py /tmp/enable-gpt-oss-cpu.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu.py && rm /tmp/enable-gpt-oss-cpu.py + +# ---- 4) Make gpt_oss.py weight loaders CPU-safe ---- +# The model file hardcodes `.cuda()` and `torch.cuda.empty_cache/synchronize` +# in its MXFP4 weight loader and dequant helper, which abort on CPU-only torch. +# Guard each call with `if torch.cuda.is_available():`. +COPY enable-gpt-oss-cpu-loaders.py /tmp/enable-gpt-oss-cpu-loaders.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-loaders.py && rm /tmp/enable-gpt-oss-cpu-loaders.py + +# ---- 5) Wire a CPU forward path into Mxfp4MoEMethod ---- +# Mxfp4MoEMethod ships only GPU branches (marlin/cutlass/flashinfer/aiter/ +# triton_kernels). This patch adds a CPU branch that dequantizes MXFP4 -> bf16 +# at weight-loading time and then routes the MoE forward through +# `torch.ops.sgl_kernel.fused_experts_cpu` (the same kernel the unquantized +# bf16 MoE method already uses in unquant.py:forward_cpu). +COPY enable-gpt-oss-cpu-moe.py /tmp/enable-gpt-oss-cpu-moe.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-moe.py && rm /tmp/enable-gpt-oss-cpu-moe.py + +# ---- 6) Add sinks-attention support to torch_native_backend ---- +# gpt-oss uses sink attention (a learnable per-head scalar added to the softmax +# denominator). No CPU backend in sglang supports the `sinks` kwarg today. +# This patch adds it to torch_native_backend with the exact math sglang's own +# triton kernel uses (extend_attention.py lines 535-537). +COPY enable-cpu-sinks-attention.py /tmp/enable-cpu-sinks-attention.py +RUN /opt/.venv/bin/python3 /tmp/enable-cpu-sinks-attention.py && rm /tmp/enable-cpu-sinks-attention.py + +# ---- 7) Replace _process_weights_for_cpu with self-contained dequant ---- +# fix6 produced /generate 200 but with random-vocab output — classic signature +# of corrupted weights. Hypothesis: MXFP4 nibble packing order in gpt-oss's +# storage doesn't match what MXFP4QuantizeUtil uses. This patch swaps the +# dequant to a self-contained function with explicit control over nibble +# order via MXFP4_NIBBLE_ORDER env var ("low_first" is correct for gpt-oss +# per the numerical sanity check on layer-0 weight magnitudes). +COPY enable-gpt-oss-cpu-dequant-v2.py /tmp/enable-gpt-oss-cpu-dequant-v2.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-dequant-v2.py && rm /tmp/enable-gpt-oss-cpu-dequant-v2.py + +# ---- 8) Route Mxfp4MoEMethod.forward_cpu through moe_forward_native ---- +# After fix7 dequant produces sane weights, but fused_experts_cpu uses plain +# silu(gate)*up with no biases, no alpha, no clamp — wrong activation for +# gpt-oss → gibberish output. moe_forward_native is sglang's pure-Python MoE +# reference that already handles gpt-oss-specific swiglu (alpha + clamp + +# interleaved gate/up + (up+1)) and W13/W2 biases. Also strips the AMX-pack +# call because moe_forward_native uses F.linear on un-packed bf16 weights. +COPY enable-gpt-oss-cpu-moe-v2.py /tmp/enable-gpt-oss-cpu-moe-v2.py +RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-moe-v2.py && rm /tmp/enable-gpt-oss-cpu-moe-v2.py + +# Mirror the upstream env vars so behavior is unchanged +ENV SGLANG_USE_CPU_ENGINE=1 +ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4:/usr/lib/x86_64-linux-gnu/libtbbmalloc.so:/opt/.venv/lib/libiomp5.so + +WORKDIR /sgl-workspace/sglang diff --git a/core/helm-charts/sglang/image-build/build-and-import.sh b/core/helm-charts/sglang/image-build/build-and-import.sh new file mode 100755 index 00000000..6b36ce5b --- /dev/null +++ b/core/helm-charts/sglang/image-build/build-and-import.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash +# One-shot script to build the patched sglang xeon image and import it +# into the k3s containerd cache so the chart can use it without a registry. +# +# Run with: sudo bash core/helm-charts/sglang/image-build/build-and-import.sh +set -euo pipefail + +IMAGE_TAG="${IMAGE_TAG:-enterprise-inference/sglang:v0.5.12-xeon-fix8}" +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" + +echo "==> Ensuring docker is installed" +if ! command -v docker >/dev/null 2>&1; then + apt-get update + DEBIAN_FRONTEND=noninteractive apt-get install -y docker.io + systemctl enable --now docker +fi +docker version --format 'Server: {{.Server.Version}}' + +echo "==> Building $IMAGE_TAG" +cd "$SCRIPT_DIR" +docker build -t "$IMAGE_TAG" . + +echo "==> Importing into k3s containerd" +# k3s ships its own containerd; piping a docker-save into k3s ctr image import +# makes the image directly available to k3s pods (no registry required). +docker save "$IMAGE_TAG" | k3s ctr images import - + +echo "==> Verifying" +k3s ctr images ls -q | grep -F "$IMAGE_TAG" || { + echo "Imported image not found in k3s containerd" + exit 1 +} + +echo +echo "==> Done. Use in chart with:" +echo " --set image.repository=${IMAGE_TAG%:*}" +echo " --set image.tag=${IMAGE_TAG##*:}" +echo " --set image.pullPolicy=Never" diff --git a/core/helm-charts/sglang/image-build/enable-cpu-sinks-attention.py b/core/helm-charts/sglang/image-build/enable-cpu-sinks-attention.py new file mode 100644 index 00000000..20aca592 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-cpu-sinks-attention.py @@ -0,0 +1,219 @@ +"""Add sinks-attention forward support to torch_native_backend. + +gpt-oss uses sink attention (a learnable per-head scalar added to the softmax +denominator). sglang's GPU kernels (triton, fa3, trtllm_mha, aiter) accept a +`sinks` kwarg in their `forward_extend` / `forward_decode`, but none of the +CPU backends do (`intel_amx`, `torch_native`). + +This patch teaches `TorchNativeAttnBackend` to accept and apply sinks. The +math is exactly what sglang's own triton kernel does +(see srt/layers/attention/triton_ops/extend_attention.py lines 535-537): + + deno += exp(cur_sink - e_max) + +i.e. a fake extra "row" with logit = sinks[h] is included in the softmax +denominator but excluded from the value-weighted sum. With sinks the +attention probabilities sum to <1. + +Implementation: when sinks is provided, bypass PyTorch's SDPA (which has no +sinks API) and do attention manually in ~15 lines. Falls back to SDPA fast +path when sinks is None (zero perf cost for non-sink models). +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/attention/torch_native_backend.py" +) +src = F.read_text() +original = src + +# 1) Add sinks=None kwarg to forward_extend and forward_decode signatures, and +# plumb it through to the SDPA wrapper. +src = src.replace( + " def forward_extend(\n" + " self,\n" + " q,\n" + " k,\n" + " v,\n" + " layer: RadixAttention,\n" + " forward_batch: ForwardBatch,\n" + " save_kv_cache=True,\n" + " ):\n", + " def forward_extend(\n" + " self,\n" + " q,\n" + " k,\n" + " v,\n" + " layer: RadixAttention,\n" + " forward_batch: ForwardBatch,\n" + " save_kv_cache=True,\n" + " sinks=None,\n" + " ):\n" + " self._sinks = sinks\n", +) + +src = src.replace( + " def forward_decode(\n" + " self,\n" + " q,\n" + " k,\n" + " v,\n" + " layer: RadixAttention,\n" + " forward_batch: ForwardBatch,\n" + " save_kv_cache=True,\n" + " ):\n", + " def forward_decode(\n" + " self,\n" + " q,\n" + " k,\n" + " v,\n" + " layer: RadixAttention,\n" + " forward_batch: ForwardBatch,\n" + " save_kv_cache=True,\n" + " sinks=None,\n" + " ):\n" + " self._sinks = sinks\n", +) + +# 2) Replace the SDPA call(s) inside _run_sdpa_forward_extend / _run_sdpa_forward_decode +# with our sinks-aware wrapper. The wrapper is appended as a module-level +# function and the existing call sites are routed through it. +# +# We do this by injecting a helper function near the top of the module and +# monkey-patching torch.nn.functional.scaled_dot_product_attention's local +# import to point at it inside this file. Cleanest: append the helper, then +# swap the SDPA call inside the class methods. + +# Inject the wrapper right after the existing imports block. +WRAPPER = ''' + +# ---- sinks-aware SDPA wrapper (added by enable-cpu-sinks-attention.py) ---- +import math as _math +def _sdpa_with_sinks(query, key, value, *, attn_mask=None, dropout_p=0.0, + is_causal=False, scale=None, enable_gqa=False, + sinks=None): + """Forward-only scaled_dot_product_attention with optional sinks. + + When sinks is None this is equivalent to torch's SDPA. + When sinks is a (H,) tensor of per-head scalars, the softmax denominator + is augmented by exp(sinks[h] - row_max) — i.e. an attention sink. + """ + if sinks is None: + return torch.nn.functional.scaled_dot_product_attention( + query, key, value, + attn_mask=attn_mask, dropout_p=dropout_p, + is_causal=is_causal, scale=scale, enable_gqa=enable_gqa, + ) + + # Manual attention path with sinks + # query/key/value: (B, H_q, Sq, D) and (B, H_kv, Sk, D) + if scale is None: + scale = 1.0 / _math.sqrt(query.shape[-1]) + + if enable_gqa and key.shape[-3] != query.shape[-3]: + # repeat KV heads to match Q heads + rep = query.shape[-3] // key.shape[-3] + key = key.repeat_interleave(rep, dim=-3) + value = value.repeat_interleave(rep, dim=-3) + + # scores: (B, H, Sq, Sk) + scores = torch.matmul(query, key.transpose(-2, -1)) * scale + + if is_causal: + Sq, Sk = scores.shape[-2], scores.shape[-1] + causal_mask = torch.ones(Sq, Sk, dtype=torch.bool, device=scores.device).tril( + diagonal=Sk - Sq + ) + scores = scores.masked_fill(~causal_mask, float("-inf")) + if attn_mask is not None: + if attn_mask.dtype == torch.bool: + scores = scores.masked_fill(~attn_mask, float("-inf")) + else: + scores = scores + attn_mask + + # Stable softmax with sinks + row_max = scores.amax(dim=-1, keepdim=True) + row_max = row_max.masked_fill(row_max == float("-inf"), 0.0) + exp_scores = torch.exp(scores - row_max) + # sinks: (H,) -> broadcast to (1, H, 1) so sink_exp is (B, H, Sq) + sinks_t = sinks.to(scores.dtype).to(scores.device).view(1, -1, 1) + sink_exp = torch.exp(sinks_t - row_max.squeeze(-1)) + denom = exp_scores.sum(dim=-1) + sink_exp # (B, H, Sq) + attn_weights = exp_scores / denom.unsqueeze(-1) + if dropout_p > 0.0: + attn_weights = torch.nn.functional.dropout(attn_weights, p=dropout_p) + return torch.matmul(attn_weights, value) +# ---- end sinks wrapper ---- +''' + +# Place the wrapper just after the last `from ... import ...` block. Simple anchor. +anchor_for_wrapper = "class TorchNativeAttnBackend(AttentionBackend):" +if anchor_for_wrapper not in src: + print("ERROR: class anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace( + anchor_for_wrapper, + WRAPPER + "\n" + anchor_for_wrapper, + 1, +) + +# 3) Route the existing SDPA calls through _sdpa_with_sinks with the stored sink. +# The class has at least two call sites for `scaled_dot_product_attention` +# inside _run_sdpa_forward_extend / _run_sdpa_forward_decode. Both fully- +# qualified and bare-name (imported) forms appear. Rewrite both. +src = src.replace( + "torch.nn.functional.scaled_dot_product_attention(", + "_sdpa_with_sinks(", +) +# The bare form: the file does `from torch.nn.functional import scaled_dot_product_attention` +# and calls it directly. Match those too. Use a word boundary via the preceding +# whitespace + name to avoid matching the import line itself. +import re as _re +src = _re.sub( + r"(? bf16 dequant. + +After fix4-fix6 the gpt-oss-20b pipeline ran end-to-end and returned 200, but +the generated tokens were random vocabulary — the classic signature of +corrupted weights producing essentially random logits. The dequant math in +`MXFP4QuantizeUtil.dequantize` is OCP-spec-compliant, but there is one +implementation choice that differs in the wild: the **nibble packing order** +inside each uint8. + +`MXFP4QuantizeUtil` uses: + even index <- low 4 bits + odd index <- high 4 bits + +while triton_kernels / NVIDIA's reference uses: + even index <- high 4 bits + odd index <- low 4 bits + +If gpt-oss is stored with the latter convention, our previous dequant +swapped every (even, odd) pair, producing structurally garbage weights. + +This patch: + +1. Inlines a self-contained `_dequant_mxfp4_cpu` function that: + - Has explicit control over nibble order via `MXFP4_NIBBLE_ORDER` env var + ("low_first" or "high_first"; default "high_first" — the triton_kernels + convention which is what gpt-oss is stored as) + - Logs basic stats (shape, dtype, min/max/mean abs) so we can verify the + dequantized weights look sane +2. Calls it from `_process_weights_for_cpu` instead of MXFP4QuantizeUtil. + +The function is conservative: it only changes the nibble extraction logic; +sign/magnitude/E2M1/scale math is identical to MXFP4QuantizeUtil. +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/mxfp4.py" +) +src = F.read_text() +original = src + +# Replace the body of _process_weights_for_cpu and add a helper. +# Anchor: the full helper as written by fix4's enable-gpt-oss-cpu-moe.py. +old_helper = ( + " def _process_weights_for_cpu(self, layer):\n" + " \"\"\"Dequantize MXFP4 -> bf16 then AMX-pack for fused_experts_cpu.\n" + "\n" + " Layer params after this call:\n" + " - layer.w13_weight: bf16, AMX-packed, shape (E, 2*N, K)\n" + " - layer.w2_weight: bf16, AMX-packed, shape (E, K, N)\n" + " - layer.w13_weight_scale / w2_weight_scale: deleted\n" + " \"\"\"\n" + " import torch\n" + " from torch.nn import Parameter\n" + " from sglang.srt.layers.quantization.mxfp4_tensor import (\n" + " MXFP4QuantizeUtil,\n" + " )\n" + " from sglang.srt.layers.amx_utils import (\n" + " _amx_process_weight_after_loading,\n" + " )\n" + "\n" + " def _dequant(weight, scale):\n" + " return MXFP4QuantizeUtil.dequantize(\n" + " quantized_data=weight,\n" + " dtype=torch.bfloat16,\n" + " scale=scale,\n" + " block_sizes=[32],\n" + " )\n" + "\n" + " w13_bf16 = _dequant(layer.w13_weight, layer.w13_weight_scale)\n" + " w2_bf16 = _dequant(layer.w2_weight, layer.w2_weight_scale)\n" + "\n" + " del layer.w13_weight\n" + " del layer.w2_weight\n" + " del layer.w13_weight_scale\n" + " del layer.w2_weight_scale\n" + " layer.w13_weight = Parameter(w13_bf16.contiguous(), requires_grad=False)\n" + " layer.w2_weight = Parameter(w2_bf16.contiguous(), requires_grad=False)\n" + "\n" + " _amx_process_weight_after_loading(layer, [\"w13_weight\", \"w2_weight\"])\n" +) + +new_helper = ''' def _process_weights_for_cpu(self, layer): + """Dequantize MXFP4 -> bf16 then AMX-pack for fused_experts_cpu. + + Layer params after this call: + - layer.w13_weight: bf16, AMX-packed, shape (E, 2*N, K) + - layer.w2_weight: bf16, AMX-packed, shape (E, K, N) + - layer.w13_weight_scale / w2_weight_scale: deleted + """ + import os + import torch + from torch.nn import Parameter + from sglang.srt.layers.amx_utils import ( + _amx_process_weight_after_loading, + ) + + nibble_order = os.environ.get("MXFP4_NIBBLE_ORDER", "high_first").lower() + + # E2M1 lookup table (OCP MXFP4 spec) + _E2M1 = torch.tensor( + [0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0], + dtype=torch.float32, + ) + + def _dequant_mxfp4_cpu(weight_packed, scale_e8m0): + """Dequantize MXFP4 packed uint8 weights to bf16. + + weight_packed: (..., K_packed) uint8, where K_packed = K / 2 + (2 mxfp4 values per uint8 byte) + scale_e8m0: (..., K_blocks) uint8, where K_blocks = K / 32 + (one E8M0 scale per 32 elements) + + Returns: (..., K) bf16 + """ + assert weight_packed.dtype == torch.uint8 + assert scale_e8m0.dtype == torch.uint8 + device = weight_packed.device + e2m1 = _E2M1.to(device) + + # Extract the two nibbles per byte + low_nibble = (weight_packed & 0x0F) # bits 3:0 + high_nibble = (weight_packed >> 4) & 0x0F # bits 7:4 + + # Interleave to undo the packing + shape = list(weight_packed.shape) + shape[-1] = shape[-1] * 2 + unfused = torch.empty(shape, dtype=torch.uint8, device=device) + if nibble_order == "low_first": + # MXFP4QuantizeUtil convention: even <- low, odd <- high + unfused[..., 0::2] = low_nibble + unfused[..., 1::2] = high_nibble + else: + # triton_kernels / NVIDIA reference convention: + # even <- high, odd <- low + unfused[..., 0::2] = high_nibble + unfused[..., 1::2] = low_nibble + + # E2M1: bit 3 = sign, bits 2:0 = magnitude index + sign = 1.0 - 2.0 * ((unfused >> 3) & 1).float() + magnitude_idx = (unfused & 0x07).long() + values = e2m1[magnitude_idx] * sign + + # Apply E8M0 scale: each scale covers 32 consecutive elements + *batch_dims, K = values.shape + K_blocks = scale_e8m0.shape[-1] + if K != K_blocks * 32: + raise ValueError( + f"dequant shape mismatch: dequantized K={K}, " + f"K_blocks*32={K_blocks*32} from scale shape {tuple(scale_e8m0.shape)}" + ) + values = values.view(*batch_dims, K_blocks, 32) + scale_f = torch.exp2(scale_e8m0.float() - 127.0).unsqueeze(-1) + out = (values * scale_f).view(*batch_dims, K).to(torch.bfloat16) + return out + + import logging as _logging + _log = _logging.getLogger(__name__) + + w13_bf16 = _dequant_mxfp4_cpu(layer.w13_weight, layer.w13_weight_scale) + w2_bf16 = _dequant_mxfp4_cpu(layer.w2_weight, layer.w2_weight_scale) + + # One-line sanity log so we can see if the dequantized values look sane. + # Healthy bf16 model weights typically have |w| in [1e-3, ~1.0]; gibberish- + # producing weights often show abs-mean either suspiciously huge or near 0. + _log.info( + "[mxfp4-cpu-dequant] nibble_order=%s w13: shape=%s abs(min=%.4g, max=%.4g, mean=%.4g) " + "w2: shape=%s abs(min=%.4g, max=%.4g, mean=%.4g)", + nibble_order, + tuple(w13_bf16.shape), + float(w13_bf16.abs().min()), + float(w13_bf16.abs().max()), + float(w13_bf16.abs().float().mean()), + tuple(w2_bf16.shape), + float(w2_bf16.abs().min()), + float(w2_bf16.abs().max()), + float(w2_bf16.abs().float().mean()), + ) + + del layer.w13_weight + del layer.w2_weight + del layer.w13_weight_scale + del layer.w2_weight_scale + layer.w13_weight = Parameter(w13_bf16.contiguous(), requires_grad=False) + layer.w2_weight = Parameter(w2_bf16.contiguous(), requires_grad=False) + + _amx_process_weight_after_loading(layer, ["w13_weight", "w2_weight"]) +''' + +if old_helper not in src: + print("ERROR: old _process_weights_for_cpu helper not found " + "(was fix4 applied?)", file=sys.stderr) + sys.exit(1) +src = src.replace(old_helper, new_helper) + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-loaders.py b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-loaders.py new file mode 100644 index 00000000..51ef69b6 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-loaders.py @@ -0,0 +1,68 @@ +"""Patch gpt_oss.py to make its weight-loading paths CPU-safe. + +The model file hard-codes a handful of `.cuda()` / `torch.cuda.*` calls +in the MXFP4 weight loader and the dequant helper. On a CPU-only torch +those fail with `AssertionError: Torch not compiled with CUDA enabled`. + +We guard each call so it becomes a no-op on CPU and behaves exactly as +before on a CUDA host. + +Patched call sites: + - _load_mxfp4_experts_weights: weight = weight.cuda() + - set_embed_and_head: torch.cuda.empty_cache(); torch.cuda.synchronize() + - _dequant_mlp_weight: w_blocks = w_blocks.cuda(); w_scales = w_scales.cuda() +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/models/gpt_oss.py" +) +src = F.read_text() +original = src + +substitutions = [ + # _load_mxfp4_experts_weights: weight = weight.cuda() + ( + " for name, weight in weights:\n" + " weight = weight.cuda()\n", + " for name, weight in weights:\n" + " if torch.cuda.is_available():\n" + " weight = weight.cuda()\n", + ), + # set_embed_and_head: torch.cuda.empty_cache / synchronize + ( + " self.lm_head.weight = head\n" + " torch.cuda.empty_cache()\n" + " torch.cuda.synchronize()\n", + " self.lm_head.weight = head\n" + " if torch.cuda.is_available():\n" + " torch.cuda.empty_cache()\n" + " torch.cuda.synchronize()\n", + ), + # _dequant_mlp_weight: w_blocks / w_scales .cuda() + ( + " w_blocks = w_blocks.cuda()\n" + " w_scales = w_scales.cuda()\n", + " if torch.cuda.is_available():\n" + " w_blocks = w_blocks.cuda()\n" + " w_scales = w_scales.cuda()\n", + ), +] + +for needle, replacement in substitutions: + if needle not in src: + print( + f"ERROR: patch site not found:\n---\n{needle}---", + file=sys.stderr, + ) + sys.exit(1) + src = src.replace(needle, replacement) + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe-v2.py b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe-v2.py new file mode 100644 index 00000000..686403d3 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe-v2.py @@ -0,0 +1,122 @@ +"""Reroute Mxfp4MoEMethod's CPU forward through sglang's reference +``moe_forward_native`` instead of ``fused_experts_cpu``. + +After fix7 we got gpt-oss-20b past dequant with sane numerics (low_first +nibble order), but the output was still gibberish. The cause: gpt-oss uses +a custom Swish-GLU activation: + + gate, up = x[..., ::2], x[..., 1::2] # INTERLEAVED gate/up + gate = gate.clamp(max=gemm1_limit) + up = up.clamp(min=-gemm1_limit, max=gemm1_limit) + out = gate * sigmoid(gate * gemm1_alpha) * (up + 1) + +plus per-expert biases on both W13 and W2. ``fused_experts_cpu`` only +implements plain ``silu(gate) * up`` with no alpha, no clamp, no biases. + +sglang already has a pure-PyTorch reference that handles all of this: +``sglang.srt.layers.moe.fused_moe_native.moe_forward_native``. It calls +``swiglu_gpt_oss_sigmoid_alpha`` (pure torch with @torch.compile) when +``gemm1_alpha`` is set, and adds W13/W2 biases when present on the layer. + +This patch: + +1. Removes the ``_amx_process_weight_after_loading`` call from + ``_process_weights_for_cpu`` — we no longer need AMX-packed weights + because ``moe_forward_native`` uses ``F.linear`` and ``torch.einsum`` + on plain bf16 weights. +2. Rewrites ``forward_cpu`` to delegate to ``moe_forward_native``. +""" + +import re +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/mxfp4.py" +) +src = F.read_text() +original = src + +# 1. Strip the AMX-pack call from _process_weights_for_cpu. +src = src.replace( + " _amx_process_weight_after_loading(layer, [\"w13_weight\", \"w2_weight\"])\n", + " # _amx_process_weight_after_loading skipped: moe_forward_native uses\n" + " # plain F.linear / einsum, which expect un-packed (E, OUT, IN) bf16.\n", +) + +# 2. Replace the body of forward_cpu with a delegation to moe_forward_native. +# Anchor on the full forward_cpu added by fix4's enable-gpt-oss-cpu-moe.py. +old_forward = ( + " def forward_cpu(self, layer, dispatch_output):\n" + " \"\"\"Mirrors unquant.py:UnquantizedFusedMoEMethod.forward_cpu.\n" + "\n" + " After _process_weights_for_cpu has run, the layer's weights are\n" + " plain bf16 AMX-packed tensors, so the CPU MoE kernel can serve\n" + " them with the UNQUANT quant method.\n" + " \"\"\"\n" + " import torch\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + " from sglang.srt.layers.moe.topk import apply_topk_weights_cpu\n" + " from sglang.srt.layers.amx_utils import CPUQuantMethod\n" + "\n" + " x = dispatch_output.hidden_states\n" + " topk_output = dispatch_output.topk_output\n" + "\n" + " topk_weights, topk_ids, _ = topk_output\n" + " x, topk_weights = apply_topk_weights_cpu(\n" + " self.moe_runner_config.apply_router_weight_on_input,\n" + " topk_weights,\n" + " x,\n" + " )\n" + " output = torch.ops.sgl_kernel.fused_experts_cpu(\n" + " x,\n" + " layer.w13_weight,\n" + " layer.w2_weight,\n" + " topk_weights,\n" + " topk_ids,\n" + " False, # inplace\n" + " CPUQuantMethod.UNQUANT,\n" + " None, # w1_scale\n" + " None, # w2_scale\n" + " None, # w1_zp\n" + " None, # w2_zp\n" + " None, # block_size\n" + " True, # is_vnni\n" + " )\n" + " return StandardCombineInput(hidden_states=output)\n" +) + +new_forward = ( + " def forward_cpu(self, layer, dispatch_output):\n" + " \"\"\"CPU MoE forward via moe_forward_native (gpt-oss-aware).\n" + "\n" + " Uses sglang's reference pure-PyTorch MoE forward, which handles:\n" + " - W13 / W2 biases (gpt-oss has both)\n" + " - The gpt-oss-specific swiglu variant\n" + " (interleaved gate/up + sigmoid(alpha * gate) + clamp + (up+1))\n" + " when ``moe_runner_config.gemm1_alpha`` is set.\n" + " \"\"\"\n" + " from sglang.srt.layers.moe.fused_moe_native import moe_forward_native\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + "\n" + " output = moe_forward_native(\n" + " layer,\n" + " dispatch_output.hidden_states,\n" + " dispatch_output.topk_output,\n" + " self.moe_runner_config,\n" + " )\n" + " return StandardCombineInput(hidden_states=output)\n" +) + +if old_forward not in src: + print("ERROR: old forward_cpu not found (fix4 may have been changed)", + file=sys.stderr) + sys.exit(1) +src = src.replace(old_forward, new_forward) + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe.py b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe.py new file mode 100644 index 00000000..fa269d25 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu-moe.py @@ -0,0 +1,190 @@ +"""Add a CPU forward path to sglang.srt.layers.quantization.mxfp4.Mxfp4MoEMethod. + +Upstream `Mxfp4MoEMethod` only ships GPU branches (Marlin, FlashInfer cutlass +SM90, FlashInfer TRT-LLM SM100, AMD aiter, NVIDIA triton_kernels). On CPU, +both its `process_weights_after_loading` and `apply` raise (the former tries +to `import triton_kernels`; the latter has no CPU branch at all). + +This patch: + +1. Adds a CPU branch at the top of `process_weights_after_loading` that: + a. Dequantizes the MXFP4-packed `w13_weight` / `w2_weight` to bf16 + using the pure-PyTorch `MXFP4QuantizeUtil.dequantize` helper that + already ships in `mxfp4_tensor.py`. + b. Calls `_amx_process_weight_after_loading` (the same helper that the + bf16 unquantized MoE method uses in `unquant.py:process_weights_after_loading`) + to AMX-pack the bf16 weights for `fused_experts_cpu`. + c. Returns early so none of the CUDA-only branches run. + +2. Adds a `forward_cpu` method that mirrors the unquantized bf16 MoE method's + CPU forward path (`unquant.py:forward_cpu`) verbatim — apply_topk_weights_cpu, + then `torch.ops.sgl_kernel.fused_experts_cpu(..., CPUQuantMethod.UNQUANT, ...)`. + +After this patch the weights are stored as bf16 inside the layer (the MXFP4 +packed storage is replaced), so the existing CPU `fused_experts_cpu` AMX +kernel handles them like any other bf16 MoE. +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/mxfp4.py" +) +src = F.read_text() +original = src + +# ----- 1. Insert CPU branch + helper into Mxfp4MoEMethod.process_weights_after_loading ----- +# +# Anchor on the first line of the existing method body. We prepend a CPU +# branch that does the dequant + AMX pack, then returns. Existing logic +# (marlin / cutlass / flashinfer / triton_kernels / `torch.cuda.empty_cache`) +# is untouched on GPU. + +needle_pwal = ( + " def process_weights_after_loading(self, layer):\n" + " if self.use_marlin:\n" +) +replacement_pwal = ( + " def process_weights_after_loading(self, layer):\n" + " # ---- CPU branch added by enable-gpt-oss-cpu-moe.py ----\n" + " from sglang.srt.utils import is_cpu, cpu_has_amx_support\n" + " if is_cpu() and cpu_has_amx_support():\n" + " self._process_weights_for_cpu(layer)\n" + " return\n" + " # ---- end CPU branch ----\n" + " if self.use_marlin:\n" +) +if needle_pwal not in src: + print("ERROR: process_weights_after_loading anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle_pwal, replacement_pwal) + +# ----- 2. Add the _process_weights_for_cpu helper + forward_cpu method right +# BEFORE the `def apply(` of Mxfp4MoEMethod (so they live on the class). +# Anchor on the exact apply signature we read from the running image. +needle_apply = ( + " def apply(\n" + " self,\n" + " layer: torch.nn.Module,\n" + " dispatch_output: StandardDispatchOutput,\n" + " ) -> CombineInput:\n" +) +new_methods = ( + " # ---- CPU methods added by enable-gpt-oss-cpu-moe.py ----\n" + " def _process_weights_for_cpu(self, layer):\n" + " \"\"\"Dequantize MXFP4 -> bf16 then AMX-pack for fused_experts_cpu.\n" + "\n" + " Layer params after this call:\n" + " - layer.w13_weight: bf16, AMX-packed, shape (E, 2*N, K)\n" + " - layer.w2_weight: bf16, AMX-packed, shape (E, K, N)\n" + " - layer.w13_weight_scale / w2_weight_scale: deleted\n" + " \"\"\"\n" + " import torch\n" + " from torch.nn import Parameter\n" + " from sglang.srt.layers.quantization.mxfp4_tensor import (\n" + " MXFP4QuantizeUtil,\n" + " )\n" + " from sglang.srt.layers.amx_utils import (\n" + " _amx_process_weight_after_loading,\n" + " )\n" + "\n" + " def _dequant(weight, scale):\n" + " return MXFP4QuantizeUtil.dequantize(\n" + " quantized_data=weight,\n" + " dtype=torch.bfloat16,\n" + " scale=scale,\n" + " block_sizes=[32],\n" + " )\n" + "\n" + " w13_bf16 = _dequant(layer.w13_weight, layer.w13_weight_scale)\n" + " w2_bf16 = _dequant(layer.w2_weight, layer.w2_weight_scale)\n" + "\n" + " del layer.w13_weight\n" + " del layer.w2_weight\n" + " del layer.w13_weight_scale\n" + " del layer.w2_weight_scale\n" + " layer.w13_weight = Parameter(w13_bf16.contiguous(), requires_grad=False)\n" + " layer.w2_weight = Parameter(w2_bf16.contiguous(), requires_grad=False)\n" + "\n" + " _amx_process_weight_after_loading(layer, [\"w13_weight\", \"w2_weight\"])\n" + "\n" + " def forward_cpu(self, layer, dispatch_output):\n" + " \"\"\"Mirrors unquant.py:UnquantizedFusedMoEMethod.forward_cpu.\n" + "\n" + " After _process_weights_for_cpu has run, the layer's weights are\n" + " plain bf16 AMX-packed tensors, so the CPU MoE kernel can serve\n" + " them with the UNQUANT quant method.\n" + " \"\"\"\n" + " import torch\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + " from sglang.srt.layers.moe.topk import apply_topk_weights_cpu\n" + " from sglang.srt.layers.amx_utils import CPUQuantMethod\n" + "\n" + " x = dispatch_output.hidden_states\n" + " topk_output = dispatch_output.topk_output\n" + "\n" + " topk_weights, topk_ids, _ = topk_output\n" + " x, topk_weights = apply_topk_weights_cpu(\n" + " self.moe_runner_config.apply_router_weight_on_input,\n" + " topk_weights,\n" + " x,\n" + " )\n" + " output = torch.ops.sgl_kernel.fused_experts_cpu(\n" + " x,\n" + " layer.w13_weight,\n" + " layer.w2_weight,\n" + " topk_weights,\n" + " topk_ids,\n" + " False, # inplace\n" + " CPUQuantMethod.UNQUANT,\n" + " None, # w1_scale\n" + " None, # w2_scale\n" + " None, # w1_zp\n" + " None, # w2_zp\n" + " None, # block_size\n" + " True, # is_vnni\n" + " )\n" + " return StandardCombineInput(hidden_states=output)\n" + " # ---- end CPU methods ----\n" + "\n" +) +replacement_apply = new_methods + needle_apply +if needle_apply not in src: + print("ERROR: Mxfp4MoEMethod.apply anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle_apply, replacement_apply, 1) + +# ----- 3. Route Mxfp4MoEMethod.apply() to forward_cpu() on CPU. ----- +# FusedMoE.run_moe_core calls apply() directly; our forward_cpu would be +# dead code unless apply() itself delegates. Insert the delegation as the +# very first statement of apply() (after its imports). +needle_apply_body = ( + " ) -> CombineInput:\n" + "\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + " from sglang.srt.layers.moe.topk import TopKOutputChecker\n" +) +replacement_apply_body = ( + " ) -> CombineInput:\n" + "\n" + " # ---- CPU delegation added by enable-gpt-oss-cpu-moe.py ----\n" + " from sglang.srt.utils import is_cpu, cpu_has_amx_support\n" + " if is_cpu() and cpu_has_amx_support():\n" + " return self.forward_cpu(layer, dispatch_output)\n" + " # ---- end CPU delegation ----\n" + "\n" + " from sglang.srt.layers.moe.token_dispatcher import StandardCombineInput\n" + " from sglang.srt.layers.moe.topk import TopKOutputChecker\n" +) +if needle_apply_body not in src: + print("ERROR: apply() body anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle_apply_body, replacement_apply_body, 1) + +if src == original: + print("ERROR: nothing changed", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu.py b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu.py new file mode 100644 index 00000000..612ccd4f --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-gpt-oss-cpu.py @@ -0,0 +1,84 @@ +"""Patch sglang's server_args.py so GptOssForCausalLM accepts CPU attention backends. + +The upstream gate at the GptOssForCausalLM branch: + 1. Has no `is_cpu()` case for default backend selection — falls to "triton", + which has no CPU implementation. + 2. The `supported_backends` allowlist omits "intel_amx" and "torch_native", + even though both are valid CPU attention backends registered via + attention_registry.py. + +We extend both: pick `intel_amx` as the default for the CPU engine, and add +intel_amx + torch_native to the allowlist so users can choose either. +""" + +import sys +from pathlib import Path + +SA = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/server_args.py" +) +src = SA.read_text() +original = src + +# 1) Inject is_cpu() branch into the default attention backend selector for +# GptOssForCausalLM. We sit between the existing `elif is_hip(): aiter` +# and the final `else: triton` so CPU users get intel_amx. +needle = ( + ' elif is_hip():\n' + ' self.attention_backend = "aiter"\n' + ' else:\n' + ' self.attention_backend = "triton"\n' +) +replacement = ( + ' elif is_hip():\n' + ' self.attention_backend = "aiter"\n' + ' elif os.getenv("SGLANG_USE_CPU_ENGINE", "0") == "1":\n' + ' self.attention_backend = "intel_amx"\n' + ' else:\n' + ' self.attention_backend = "triton"\n' +) +if needle not in src: + print("ERROR: default attention backend selector for GptOss not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle, replacement) + +# 2) Extend supported_backends to include CPU options. +needle2 = ( + ' supported_backends = [\n' + ' "triton",\n' + ' "trtllm_mha",\n' + ' "fa3",\n' + ' "fa4",\n' + ' "ascend",\n' + ' "intel_xpu",\n' + ' "aiter",\n' + ' ]\n' +) +replacement2 = ( + ' supported_backends = [\n' + ' "triton",\n' + ' "trtllm_mha",\n' + ' "fa3",\n' + ' "fa4",\n' + ' "ascend",\n' + ' "intel_xpu",\n' + ' "aiter",\n' + ' "intel_amx",\n' + ' "torch_native",\n' + ' ]\n' +) +if needle2 not in src: + print("ERROR: supported_backends list for GptOss not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle2, replacement2) + +# 3) Ensure `os` is imported (cheap idempotent check) +if "\nimport os" not in src and not src.startswith("import os"): + src = "import os\n" + src + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +SA.write_text(src) +print(f"Patched {SA}") diff --git a/core/helm-charts/sglang/image-build/enable-mxfp4-cpu.py b/core/helm-charts/sglang/image-build/enable-mxfp4-cpu.py new file mode 100644 index 00000000..19fa55b1 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-mxfp4-cpu.py @@ -0,0 +1,61 @@ +"""Patch sglang's quantization/__init__.py to enable MXFP4 on CPU. + +The upstream code gates the mxfp4 registration behind is_cuda()/is_hip(). +On CPU this prevents loading models with quant_method=mxfp4 (e.g. +openai/gpt-oss-*), even though the model file's CPU-friendly dequantization +path (fp8_utils.dequant_mxfp4 → MXFP4QuantizeUtil.dequantize, pure PyTorch) +is fully functional. This patch widens the gate so mxfp4 is registered +when SGLANG_USE_CPU_ENGINE=1 is set and adds it to the CPU-supported +quantization allowlist. +""" + +import re +import sys +from pathlib import Path + +INIT = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/__init__.py" +) + +src = INIT.read_text() +original = src + +# 1) Ensure `os` is imported (we use it to gate behind the env var) +if not re.search(r"^import os\b", src, flags=re.M): + src = src.replace( + "import builtins\n", + "import builtins\nimport os\n", + 1, + ) + +# 2) Widen the gate: register mxfp4 also when running with the CPU engine +src = src.replace( + "if is_cuda() or (_is_mxfp_supported and is_hip()):\n" + " BASE_QUANTIZATION_METHODS.update(\n" + " {\n" + ' "mxfp4": Mxfp4Config,\n' + " }\n" + " )", + 'if is_cuda() or (_is_mxfp_supported and is_hip()) or os.getenv("SGLANG_USE_CPU_ENGINE", "0") == "1":\n' + " BASE_QUANTIZATION_METHODS.update(\n" + " {\n" + ' "mxfp4": Mxfp4Config,\n' + " }\n" + " )", +) + +# 3) Add mxfp4 to the CPU allowlist so get_quantization_config() returns it +src = src.replace( + "CPU_QUANTIZATION_METHODS = {\n" + ' "fp8": Fp8Config,\n', + "CPU_QUANTIZATION_METHODS = {\n" + ' "fp8": Fp8Config,\n' + ' "mxfp4": Mxfp4Config,\n', +) + +if src == original: + print("ERROR: no patch site matched. The file may have changed shape.", file=sys.stderr) + sys.exit(1) + +INIT.write_text(src) +print(f"Patched {INIT}") diff --git a/core/helm-charts/sglang/templates/deployment.yaml b/core/helm-charts/sglang/templates/deployment.yaml index f51800bd..1279a8eb 100644 --- a/core/helm-charts/sglang/templates/deployment.yaml +++ b/core/helm-charts/sglang/templates/deployment.yaml @@ -38,7 +38,7 @@ spec: securityContext: {{- toYaml . | nindent 12 }} {{- end }} - command: ["python3", "-m", "sglang.launch_server"] + command: ["/opt/.venv/bin/python3", "-m", "sglang.launch_server"] args: - "--model-path={{ .Values.modelSource }}" - "--served-model-name={{ .Values.modelName }}" @@ -46,19 +46,37 @@ spec: - "--port={{ .Values.server.port }}" - "--device={{ .Values.server.device }}" - "--tp-size={{ .Values.server.tpSize }}" + {{- if .Values.server.dpSize }} - "--dp-size={{ .Values.server.dpSize }}" + {{- end }} {{- if .Values.server.dtype }} - "--dtype={{ .Values.server.dtype }}" {{- end }} + {{- if .Values.server.quantization }} + - "--quantization={{ .Values.server.quantization }}" + {{- end }} {{- if .Values.server.trustRemoteCode }} - "--trust-remote-code" {{- end }} + {{- if .Values.server.disableOverlapSchedule }} + - "--disable-overlap-schedule" + {{- end }} + {{- if .Values.server.enableTorchCompile }} + - "--enable-torch-compile" + - "--torch-compile-max-bs={{ .Values.server.torchCompileMaxBs }}" + {{- end }} {{- if .Values.server.contextLength }} - "--context-length={{ .Values.server.contextLength }}" {{- end }} {{- if .Values.server.maxRunningRequests }} - "--max-running-requests={{ .Values.server.maxRunningRequests }}" {{- end }} + {{- if .Values.server.maxTotalTokens }} + - "--max-total-tokens={{ .Values.server.maxTotalTokens }}" + {{- end }} + {{- if .Values.server.memFractionStatic }} + - "--mem-fraction-static={{ .Values.server.memFractionStatic }}" + {{- end }} {{- range .Values.server.extraArgs }} - {{ . | quote }} {{- end }} @@ -73,6 +91,18 @@ spec: value: "{{ .Values.hfCacheMountPath }}/hub" - name: TRANSFORMERS_CACHE value: "{{ .Values.hfCacheMountPath }}/hub" + {{- if .Values.cpuEngine.enabled }} + - name: SGLANG_USE_CPU_ENGINE + value: "1" + {{- if .Values.cpuEngine.ldPreload }} + - name: LD_PRELOAD + value: {{ .Values.cpuEngine.ldPreload | quote }} + {{- end }} + {{- if .Values.cpuEngine.ompThreadsBind }} + - name: SGLANG_CPU_OMP_THREADS_BIND + value: {{ .Values.cpuEngine.ompThreadsBind | quote }} + {{- end }} + {{- end }} {{- if .Values.huggingface.token }} - name: HF_TOKEN valueFrom: diff --git a/core/helm-charts/sglang/values.yaml b/core/helm-charts/sglang/values.yaml index 59eb8afd..e9dcbae1 100644 --- a/core/helm-charts/sglang/values.yaml +++ b/core/helm-charts/sglang/values.yaml @@ -1,8 +1,19 @@ # Copyright (C) 2025-2026 Intel Corporation # SPDX-License-Identifier: Apache-2.0 -# Default values for the sglang Helm chart -# Tuned for lmsysorg/sglang:v0.5.11-xeon serving openai/gpt-oss-20b on a Xeon CPU node. +# Default values for the sglang Helm chart. +# Targets lmsysorg/sglang:v0.5.11-xeon on an Intel Xeon (AMX) CPU node. +# +# IMPORTANT — quantization support on this image: +# The Xeon CPU build of sglang supports a small, explicit subset of +# quantization methods (see CPU_QUANTIZATION_METHODS in +# sglang/srt/layers/quantization/__init__.py): +# fp8, w8a8_int8, compressed-tensors, awq, gptq +# It does NOT support mxfp4, which is the native quantization of +# openai/gpt-oss-{20b,120b}. Those models require a CUDA/HIP sglang +# image (e.g. lmsysorg/sglang:v0.5.11-cuda) on a GPU host. To serve +# gpt-oss on CPU, use llama.cpp/Ollama/vLLM-CPU on a GGUF variant, +# not this chart. See README.md for the supported-model list. nameOverride: "" fullnameOverride: "" @@ -26,7 +37,7 @@ podAnnotations: {} podLabels: app: sglang -# SGLang processes write to HF cache + shared memory; do not lock the root FS. +# SGLang writes to HF cache + /dev/shm; do not lock the root FS. podSecurityContext: runAsNonRoot: false fsGroup: 0 @@ -39,16 +50,17 @@ securityContext: readOnlyRootFilesystem: false # ---- Model ---- -# modelSource is the HuggingFace model ID passed to --model-path. -# For gpt-oss-20b the default below should work on a Xeon node with >= 64Gi RAM. -modelSource: "openai/gpt-oss-20b" -# Logical name used in URL paths, service names, and the OpenAI `model` field. -modelName: "gpt-oss-20b" - -# HuggingFace Hub token. gpt-oss-20b is publicly downloadable, but private/gated -# variants need a token. Either: -# 1. --set huggingface.token=$HF_TOKEN (chart will create the secret), or -# 2. pre-create: kubectl create secret generic hf-token-secret --from-literal=token= +# Default is Qwen3-8B because it is bf16 (no quantization gate), modest in +# size, and listed as a supported CPU model in sglang docs. Override +# modelSource/modelName for other models. +modelSource: "Qwen/Qwen3-8B" +modelName: "qwen3-8b" + +# HuggingFace Hub token. Required for gated repos (e.g. meta-llama/*). +# Either: +# 1. --set huggingface.token=$HF_TOKEN (chart creates the secret), or +# 2. pre-create: kubectl create secret generic hf-token-secret \ +# --from-literal=token= huggingface: token: "" secretName: "hf-token-secret" @@ -56,23 +68,59 @@ huggingface: # ---- Server / launch flags ---- server: - # Container port sglang.launch_server binds to (default upstream is 30000). port: 30000 host: "0.0.0.0" + # Force CPU device for the xeon image. device: "cpu" - # data-parallel / tensor-parallel sizes; CPU build typically runs tp=1, dp=1. + + # Tensor parallel rank count. CPU build typically runs tp=1. + # Set >1 only when binding ranks to separate NUMA domains via + # cpuEngine.ompThreadsBind. tpSize: 1 - dpSize: 1 - # Optional: context length cap. Leave empty to use the model default. - contextLength: "" - # Optional: max running requests in flight. - maxRunningRequests: "" - # dtype: bfloat16 is the recommended dtype for Xeon SGLang. + # Data parallel rank count. Omit (leave empty) unless you know you want + # multiple replicas of the model loaded in one process. + dpSize: "" + + # dtype: bfloat16 is the recommended dtype on Xeon AMX. Leave empty to + # let sglang infer from the checkpoint. dtype: "bfloat16" - # Trust remote code from HF (gpt-oss requires this). + + # --quantization. Must be one of: fp8 | w8a8_int8 | compressed-tensors | + # awq | gptq, OR leave empty to use whatever is declared in the model + # config.json. Anything else (mxfp4, modelopt_fp4, etc.) will be rejected + # by sglang at startup. + quantization: "" + + # Trust model code from HF (required by some recent models). trustRemoteCode: true - # Any extra command-line flags appended verbatim, e.g. ["--mem-fraction-static", "0.85"]. + + # Recommended for CPU per sglang docs/platforms/cpu_server.md + disableOverlapSchedule: true + + # --enable-torch-compile can give a sizeable speedup on Xeon but slows + # cold start substantially. Off by default; flip on for benchmarks. + enableTorchCompile: false + torchCompileMaxBs: 4 + + # Optional caps. Leave empty to use the model default. + contextLength: "" + maxRunningRequests: "" + + # --max-total-tokens caps KV cache size in tokens. STRONGLY recommended on + # Kubernetes — sglang reads host memory via psutil and ignores cgroup + # limits, so without this it tries to claim ~85-93% of the *node's* RAM + # for KV cache and gets OOMKilled. Sized below for 32Ki context * a few + # in-flight requests on an 8B bf16 model. + maxTotalTokens: 32768 + + # --mem-fraction-static. Leave empty to keep sglang's default (0.85+). + # On k8s, prefer maxTotalTokens above. If you must use a fraction, + # remember it is a fraction of host RAM, not the pod limit. + memFractionStatic: "" + + # Any extra command-line flags appended verbatim, e.g. + # extraArgs: ["--mem-fraction-static", "0.85"] extraArgs: [] livenessProbe: @@ -94,16 +142,30 @@ server: timeoutSeconds: 10 failureThreshold: 30 +# ---- CPU engine tuning ---- +# The image already bakes ENV SGLANG_USE_CPU_ENGINE=1 and LD_PRELOAD into +# the runtime, but we set them explicitly here so the chart is +# self-documenting and survives image-tag changes. +cpuEngine: + enabled: true + # Per-rank core binding for SGLang's OMP threads. Format: pipe-separated + # per-rank ranges, e.g. "0-31|32-63" for a 2-rank tp on a 64-core node. + # Leave empty to let SGLang use defaults. + ompThreadsBind: "" + # LD_PRELOAD baked into xeon.Dockerfile. Set to "" to drop it entirely + # (only do this if you know why). + ldPreload: "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4:/usr/lib/x86_64-linux-gnu/libtbbmalloc.so:/opt/.venv/lib/libiomp5.so" + # ---- Resources ---- -# gpt-oss-20b in bfloat16 is ~40GB weights + KV cache + activations. -# These are starting points; tune to your Xeon node. +# Starting points for Qwen3-8B (bf16, ~16Gi weights). For larger models +# bump both requests and limits. resources: requests: cpu: "16" - memory: "64Gi" + memory: "32Gi" limits: cpu: "32" - memory: "96Gi" + memory: "64Gi" # SGLang uses /dev/shm heavily for inter-process tensor sharing on CPU. shm: @@ -111,19 +173,18 @@ shm: sizeLimit: "16Gi" # ---- Storage (HuggingFace cache) ---- -# PVC keeps the downloaded weights across pod restarts so you don't re-pull -# ~40GB every time. +# PVC keeps downloaded weights across pod restarts. storage: persistentVolume: enabled: true storageClass: "" accessMode: ReadWriteOnce - size: 80Gi + size: 60Gi existingClaim: "" deleteOnUninstall: true emptyDir: enabled: false - sizeLimit: 80Gi + sizeLimit: 60Gi # Mount point inside the container for the HF cache (HF_HOME). hfCacheMountPath: "/root/.cache/huggingface" diff --git a/scripts/bootstrap-k3s.sh b/scripts/bootstrap-k3s.sh new file mode 100755 index 00000000..fc254a7a --- /dev/null +++ b/scripts/bootstrap-k3s.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash +# One-shot bootstrap for testing the sglang Helm chart on a single Xeon box. +# Installs: k3s (single-node), helm, kubectl symlink. Sets up kubeconfig for $USER. +# Run with: sudo bash scripts/bootstrap-k3s.sh +set -euo pipefail + +REAL_USER="${SUDO_USER:-$USER}" +REAL_HOME="$(getent passwd "$REAL_USER" | cut -d: -f6)" + +echo "==> Installing k3s (single-node, embedded containerd, embedded etcd)..." +# --write-kubeconfig-mode 644 so non-root can read it +# --disable traefik because we don't need an ingress for the smoke test +curl -sfL https://get.k3s.io | \ + INSTALL_K3S_EXEC="--write-kubeconfig-mode 644 --disable traefik" \ + sh - + +echo "==> Waiting for k3s API to be ready..." +for i in $(seq 1 60); do + if k3s kubectl get nodes >/dev/null 2>&1; then break; fi + sleep 2 +done +k3s kubectl get nodes -o wide + +echo "==> Setting up kubectl + kubeconfig for $REAL_USER..." +ln -sf /usr/local/bin/k3s /usr/local/bin/kubectl +install -d -o "$REAL_USER" -g "$REAL_USER" "$REAL_HOME/.kube" +install -m 600 -o "$REAL_USER" -g "$REAL_USER" /etc/rancher/k3s/k3s.yaml "$REAL_HOME/.kube/config" + +echo "==> Installing helm..." +curl -sfL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash + +echo "==> Versions:" +kubectl version --client=true 2>&1 | head -3 +helm version --short +echo +echo "==> Bootstrap complete. As $REAL_USER, you can now run:" +echo " kubectl get nodes" +echo " helm lint core/helm-charts/sglang" From e423fb029814c5481ebf817a7c16a37ebcb45858 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Tue, 26 May 2026 19:05:10 +0000 Subject: [PATCH 03/20] cld2labs/sglang-gpt-oss: add Phase 2 patches, canonical values, ApisixRoute templating Image-build (fix9-fix11, all behind env-var or flag gates): - enable-fp32-override-debug.py: allow --dtype float32 with mxfp4 models via ALLOW_FP32_MXFP4=1 - enable-dequant-dtype-debug.py: make MXFP4 dequant output dtype env-controlled via MXFP4_OUT_DTYPE - enable-fp32-moe-promotion-debug.py: promote per-expert moe_forward_native intermediates to fp32 via FP32_PROMOTE_MOE=1 - enable-fp32-kv-cache-debug.py: patch sglang's --kv-cache-dtype allowlist, configure_kv_cache_dtype mapping, and torch_native_backend dtype-mismatch handler so fp32 KV cache flows end-to-end Tag bumped to v0.5.12-xeon-fix11-debug. Chart: - values.yaml: default image is now the patched build; MXFP4_NIBBLE_ORDER=low_first baked into extraEnv (required for correct MXFP4 weight decode) - gpt-oss-20b-values.yaml: canonical helm-upgrade override for this model - templates/apisixroute.yaml: ingressClassName field templated All debug patches are no-ops unless the corresponding env var or flag is set; default chart behavior is byte-identical to upstream for the unpatched code paths. Signed-off-by: arpannookala-12 --- .../sglang/gpt-oss-20b-values.yaml | 49 ++++++ .../helm-charts/sglang/image-build/Dockerfile | 34 ++++ .../sglang/image-build/build-and-import.sh | 2 +- .../image-build/enable-dequant-dtype-debug.py | 58 +++++++ .../image-build/enable-fp32-kv-cache-debug.py | 141 +++++++++++++++++ .../enable-fp32-moe-promotion-debug.py | 148 ++++++++++++++++++ .../image-build/enable-fp32-override-debug.py | 57 +++++++ .../sglang/templates/apisixroute.yaml | 3 + core/helm-charts/sglang/values.yaml | 32 +++- 9 files changed, 520 insertions(+), 4 deletions(-) create mode 100644 core/helm-charts/sglang/gpt-oss-20b-values.yaml create mode 100644 core/helm-charts/sglang/image-build/enable-dequant-dtype-debug.py create mode 100644 core/helm-charts/sglang/image-build/enable-fp32-kv-cache-debug.py create mode 100644 core/helm-charts/sglang/image-build/enable-fp32-moe-promotion-debug.py create mode 100644 core/helm-charts/sglang/image-build/enable-fp32-override-debug.py diff --git a/core/helm-charts/sglang/gpt-oss-20b-values.yaml b/core/helm-charts/sglang/gpt-oss-20b-values.yaml new file mode 100644 index 00000000..6871b281 --- /dev/null +++ b/core/helm-charts/sglang/gpt-oss-20b-values.yaml @@ -0,0 +1,49 @@ +# Override values for gpt-oss-20b on Xeon CPU through the patched image. +# +# Usage: +# helm upgrade gpt-oss-20b core/helm-charts/sglang \ +# -f core/helm-charts/sglang/gpt-oss-20b-values.yaml +# +# This is the production-shape config. Long-form (>~150-token) coherence +# is a known limitation of the pure-Python CPU MoE path on this image — +# see REMAINING_WORK.md "Long-form quality boundary" for the full Phase 2 +# investigation and the precision-flag A/B that ruled out KV / lm_head / +# MoE-intermediate precision as the dominant cause. Short-form generation +# through the full auth-routed chain is solid. + +modelSource: openai/gpt-oss-20b +modelName: gpt-oss-20b + +image: + repository: enterprise-inference/sglang + tag: v0.5.12-xeon-fix11-debug + pullPolicy: Never + +server: + dtype: bfloat16 + extraArgs: + - --attention-backend + - torch_native + - --reasoning-parser + - gpt-oss + - --tool-call-parser + - gpt-oss + +resources: + requests: + memory: 48Gi + limits: + memory: 128Gi + +storage: + persistentVolume: + size: 40Gi + +ingress: + enabled: true + +apisixRoute: + enabled: true + +oidc: + enabled: true diff --git a/core/helm-charts/sglang/image-build/Dockerfile b/core/helm-charts/sglang/image-build/Dockerfile index e21519e7..995533a7 100644 --- a/core/helm-charts/sglang/image-build/Dockerfile +++ b/core/helm-charts/sglang/image-build/Dockerfile @@ -106,6 +106,40 @@ RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-dequant-v2.py && rm /tmp/enab COPY enable-gpt-oss-cpu-moe-v2.py /tmp/enable-gpt-oss-cpu-moe-v2.py RUN /opt/.venv/bin/python3 /tmp/enable-gpt-oss-cpu-moe-v2.py && rm /tmp/enable-gpt-oss-cpu-moe-v2.py +# ---- 9) DEBUG: allow --dtype float32 with mxfp4 (for precision-drift A/B) ---- +# server_args.py hard-forces dtype=bfloat16 for mxfp4 models. Gate that behind +# ALLOW_FP32_MXFP4=1 so we can A/B bf16 vs fp32 for Phase 2 numerical +# investigation. Not for production — fp32 is 2x memory and significantly +# slower than the bf16 path. +COPY enable-fp32-override-debug.py /tmp/enable-fp32-override-debug.py +RUN /opt/.venv/bin/python3 /tmp/enable-fp32-override-debug.py && rm /tmp/enable-fp32-override-debug.py + +# ---- 10) DEBUG: make MXFP4-CPU dequant output dtype env-controlled ---- +# fix7's _process_weights_for_cpu hardcoded bf16 output. With fix9-debug +# allowing --dtype half, the dequant output dtype needs to match the rest +# of the model's compute dtype. Drive it from MXFP4_OUT_DTYPE env var. +COPY enable-dequant-dtype-debug.py /tmp/enable-dequant-dtype-debug.py +RUN /opt/.venv/bin/python3 /tmp/enable-dequant-dtype-debug.py && rm /tmp/enable-dequant-dtype-debug.py + +# ---- 11) DEBUG: fp32 promotion inside moe_forward_native per-expert loop ---- +# Phase 2 confirmed bf16 intermediate precision is a contributor to long-form +# drift (fp16 shifted the drift point ~30%). Option 1 in REMAINING_WORK.md: +# keep layer weights/KV in their native dtype, but compute the per-expert +# forward in fp32 — both F.linear matmuls, biases, and the swiglu chain — +# casting back only at the expert output boundary. Gated behind +# FP32_PROMOTE_MOE=1 so the image is safe to run with the flag off. +COPY enable-fp32-moe-promotion-debug.py /tmp/enable-fp32-moe-promotion-debug.py +RUN /opt/.venv/bin/python3 /tmp/enable-fp32-moe-promotion-debug.py && rm /tmp/enable-fp32-moe-promotion-debug.py + +# ---- 12) DEBUG: allow --kv-cache-dtype float32 end-to-end ---- +# Phase 2 Option 2 — add float32 to the argparse choices, map it through +# configure_kv_cache_dtype, and fix torch_native_backend's dtype-mismatch +# handler so it upcasts Q to fp32 instead of silently downcasting K/V back +# to bf16. With anything other than float32/fp32 selected, all three sites +# are byte-identical to upstream. +COPY enable-fp32-kv-cache-debug.py /tmp/enable-fp32-kv-cache-debug.py +RUN /opt/.venv/bin/python3 /tmp/enable-fp32-kv-cache-debug.py && rm /tmp/enable-fp32-kv-cache-debug.py + # Mirror the upstream env vars so behavior is unchanged ENV SGLANG_USE_CPU_ENGINE=1 ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4:/usr/lib/x86_64-linux-gnu/libtbbmalloc.so:/opt/.venv/lib/libiomp5.so diff --git a/core/helm-charts/sglang/image-build/build-and-import.sh b/core/helm-charts/sglang/image-build/build-and-import.sh index 6b36ce5b..140956e6 100755 --- a/core/helm-charts/sglang/image-build/build-and-import.sh +++ b/core/helm-charts/sglang/image-build/build-and-import.sh @@ -5,7 +5,7 @@ # Run with: sudo bash core/helm-charts/sglang/image-build/build-and-import.sh set -euo pipefail -IMAGE_TAG="${IMAGE_TAG:-enterprise-inference/sglang:v0.5.12-xeon-fix8}" +IMAGE_TAG="${IMAGE_TAG:-enterprise-inference/sglang:v0.5.12-xeon-fix11-debug}" SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" echo "==> Ensuring docker is installed" diff --git a/core/helm-charts/sglang/image-build/enable-dequant-dtype-debug.py b/core/helm-charts/sglang/image-build/enable-dequant-dtype-debug.py new file mode 100644 index 00000000..5e2b8fd8 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-dequant-dtype-debug.py @@ -0,0 +1,58 @@ +"""Debug-only patch: make the MXFP4-CPU dequant respect the model's +configured dtype instead of hardcoding bf16. + +In fix7, ``_process_weights_for_cpu`` hardcoded ``.to(torch.bfloat16)`` at the +end of the dequant. That's fine while ``--dtype bfloat16`` is the only +supported mode (it is upstream, for mxfp4), but it crashes when we relax the +constraint via fix9-debug + ``--dtype half`` because the dequantized weights +end up bf16 while the rest of the activations are fp16 — matmul rejects the +dtype mismatch. + +This patch reads ``MXFP4_OUT_DTYPE`` from the environment (one of "bfloat16" +| "float16" | "float32"; default "bfloat16") and uses that as the output +dtype. Combined with fix9-debug's ``ALLOW_FP32_MXFP4=1``, this lets us A/B +bf16 vs fp16 vs fp32 for the Phase 2 precision investigation without +further image rebuilds. + +This patch is intended for diagnostic builds only. +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/quantization/mxfp4.py" +) +src = F.read_text() +original = src + +# The fix7 dequant has two hardcoded `.to(torch.bfloat16)` (one in the +# helper return, one in the param replacement). Replace both with a +# call to a tiny helper that reads the env var. +old1 = ( + " out = (values * scale_f).view(*batch_dims, K).to(torch.bfloat16)\n" + " return out\n" +) +new1 = ( + " import os as _os\n" + " _dt = {\n" + " 'bfloat16': torch.bfloat16,\n" + " 'float16': torch.float16,\n" + " 'half': torch.float16,\n" + " 'float32': torch.float32,\n" + " }.get(_os.environ.get('MXFP4_OUT_DTYPE', 'bfloat16').lower(), torch.bfloat16)\n" + " out = (values * scale_f).view(*batch_dims, K).to(_dt)\n" + " return out\n" +) + +if old1 not in src: + print("ERROR: dequant inner return anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace(old1, new1) + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-fp32-kv-cache-debug.py b/core/helm-charts/sglang/image-build/enable-fp32-kv-cache-debug.py new file mode 100644 index 00000000..eefb8573 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-fp32-kv-cache-debug.py @@ -0,0 +1,141 @@ +"""Debug-only patch: allow ``--kv-cache-dtype float32`` end-to-end. + +Phase 2 Option 2 hypothesis is that bf16 KV cache write/read truncates +attention K/V state every token, and the per-step error compounds with +sequence length — matching the symptom that later tokens have worse +output. sglang's flag space only exposes ``auto / bf16 / bfloat16 / +fp8_e5m2 / fp8_e4m3 / fp4_e2m1``; ``float32`` was never an option. + +This patch makes three surgical changes so fp32 KV flows end-to-end: + +1. ``server_args.py``: add ``float32`` (with ``fp32`` as an alias) to + the ``--kv-cache-dtype`` ``choices`` list. + +2. ``model_runner.py::configure_kv_cache_dtype``: map both strings to + ``torch.float32`` so the allocator allocates fp32 KV tensors. + +3. ``torch_native_backend.py``: fix the dtype-mismatch branch in both + extend and decode SDPA call sites. Today it always casts K/V to + ``query.dtype``, which would silently downcast our fp32 KV to bf16 + for the attention math — defeating the whole point. With this + patch, when K/V have higher precision than Q we upcast Q to match, + keeping the SDPA matmuls in fp32 and downcasting the output back to + the original query dtype at the boundary. + +Gated by the choice of ``--kv-cache-dtype`` at runtime; with anything +other than ``float32``/``fp32`` selected, all three sites are +byte-identical to upstream. + +Diagnostic build only. +""" + +import sys +from pathlib import Path + +# ---- 1) server_args.py: add float32 to argparse choices ------------------ +F1 = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/server_args.py" +) +src1 = F1.read_text() +orig1 = src1 + +old1 = ( + ' choices=["auto", "fp8_e5m2", "fp8_e4m3", "bf16", "bfloat16", "fp4_e2m1"],\n' +) +new1 = ( + ' choices=["auto", "fp8_e5m2", "fp8_e4m3", "bf16", "bfloat16", "fp4_e2m1", "float32", "fp32"],\n' +) +if old1 not in src1: + print("ERROR: kv-cache-dtype choices anchor not found in server_args.py", file=sys.stderr) + sys.exit(1) +src1 = src1.replace(old1, new1) + + +# ---- 2) model_runner.py: extend configure_kv_cache_dtype ------------------ +F2 = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py" +) +src2 = F2.read_text() +orig2 = src2 + +# Insert a new elif branch just before the existing "elif fp4_e2m1" handler. +old2 = ( + ' elif self.server_args.kv_cache_dtype in ("bf16", "bfloat16"):\n' + ' self.kv_cache_dtype = torch.bfloat16\n' +) +new2 = ( + ' elif self.server_args.kv_cache_dtype in ("bf16", "bfloat16"):\n' + ' self.kv_cache_dtype = torch.bfloat16\n' + ' elif self.server_args.kv_cache_dtype in ("float32", "fp32"):\n' + ' # fix11-debug: fp32 KV cache for Phase 2 Option 2 long-form\n' + ' # precision A/B. ~2x KV memory; torch_native_backend will\n' + ' # upcast Q to fp32 at SDPA time (see patch (3) below).\n' + ' self.kv_cache_dtype = torch.float32\n' +) +if old2 not in src2: + print("ERROR: configure_kv_cache_dtype anchor not found in model_runner.py", file=sys.stderr) + sys.exit(1) +src2 = src2.replace(old2, new2) + + +# ---- 3) torch_native_backend.py: upcast Q when KV is higher precision ----- +F3 = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/attention/torch_native_backend.py" +) +src3 = F3.read_text() +orig3 = src3 + +# Two identical mismatch-handler blocks (extend + decode). Replace both with +# a version that picks the higher-precision dtype across (Q, K, V) and +# upcasts the others to match. +old3 = ( + ' if not (per_req_query.dtype == per_req_key.dtype == per_req_value.dtype):\n' + ' # _sdpa_with_sinks() expects query, key, and value to have the same dtype\n' + ' per_req_key = per_req_key.to(per_req_query.dtype)\n' + ' per_req_value = per_req_value.to(per_req_query.dtype)\n' +) +new3 = ( + ' if not (per_req_query.dtype == per_req_key.dtype == per_req_value.dtype):\n' + ' # fix11-debug: pick the highest-precision dtype across Q/K/V and\n' + ' # promote the others to match, instead of unconditionally\n' + ' # downcasting K/V to query.dtype. Required so fp32 KV cache\n' + ' # actually produces fp32 SDPA math; harmless otherwise.\n' + ' import torch as _torch\n' + ' _rank = {\n' + ' _torch.float32: 3,\n' + ' _torch.float16: 2,\n' + ' _torch.bfloat16: 2,\n' + ' }\n' + ' _best = max(\n' + ' (per_req_query.dtype, per_req_key.dtype, per_req_value.dtype),\n' + ' key=lambda d: _rank.get(d, 0),\n' + ' )\n' + ' per_req_query = per_req_query.to(_best)\n' + ' per_req_query_redudant = (\n' + ' per_req_query_redudant.to(_best)\n' + ' if "per_req_query_redudant" in dir() else None\n' + ' )\n' + ' per_req_key = per_req_key.to(_best)\n' + ' per_req_value = per_req_value.to(_best)\n' +) +count = src3.count(old3) +if count != 2: + print( + f"ERROR: expected exactly 2 SDPA-mismatch handler sites in torch_native_backend.py, found {count}", + file=sys.stderr, + ) + sys.exit(1) +src3 = src3.replace(old3, new3) + + +# ---- write all three back ------------------------------------------------ +if src1 == orig1 or src2 == orig2 or src3 == orig3: + print("ERROR: at least one of the three patches was a no-op", file=sys.stderr) + sys.exit(1) + +F1.write_text(src1) +print(f"Patched {F1}") +F2.write_text(src2) +print(f"Patched {F2}") +F3.write_text(src3) +print(f"Patched {F3}") diff --git a/core/helm-charts/sglang/image-build/enable-fp32-moe-promotion-debug.py b/core/helm-charts/sglang/image-build/enable-fp32-moe-promotion-debug.py new file mode 100644 index 00000000..932c8813 --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-fp32-moe-promotion-debug.py @@ -0,0 +1,148 @@ +"""Debug-only patch: promote the per-expert forward inside +``moe_forward_native`` to fp32. + +Phase 2 of the gpt-oss-on-CPU investigation showed that long-form generation +drifts into repetition after ~150-200 tokens with bf16 intermediates, and +that switching the whole model to fp16 pushed the drift point ~30% further +out — confirming precision is a real contributor. Option 1 in +REMAINING_WORK.md is the cheapest follow-up: keep the layer's bf16 weights +and KV cache untouched, but promote the per-expert intermediates +(``gate_up``, ``expert_out``, biases, and the weights they're multiplied +against) to fp32 across the per-expert compute, casting back to the layer's +dtype only at the very end of the expert's forward. + +Gated behind ``FP32_PROMOTE_MOE=1`` so the patch ships in the image but +costs nothing unless explicitly enabled. With it off, the per-expert path +is byte-identical to upstream. + +Intended for diagnostic A/B builds. If Option 1 closes most of the +long-form gap, the right next step is the upstream AMX kernel work +(Option 4); this Python promotion is not how we want to ship in +production because it doubles the per-expert memory bandwidth. +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/layers/moe/fused_moe_native.py" +) +src = F.read_text() +original = src + +# We replace the entire per-expert body. Match the exact block from upstream +# (current as of v0.5.12). If upstream drifts, the anchor will fail loudly +# rather than silently mis-patch. +old = ''' # Apply w13 linear + gate_up = F.linear(tokens_for_this_expert, layer_w13_weight) + + # Add bias if present (for models like GPT-OSS) + if layer_w13_bias is not None: + gate_up_fp32 = gate_up.float() + layer_w13_bias + gate_up = gate_up_fp32.to(original_dtype) + + # Apply activation + if ( + moe_runner_config.activation == "silu" + and moe_runner_config.gemm1_alpha is not None + ): + assert moe_runner_config.gemm1_clamp_limit is not None + gate_up = swiglu_gpt_oss_sigmoid_alpha( + gate_up, + moe_runner_config.gemm1_alpha, + moe_runner_config.gemm1_clamp_limit, + ) + else: + gate_up = act(gate_up) + + # Apply w2 linear + expert_out = F.linear(gate_up, layer_w2_weight) + + # Add bias if present (for models like GPT-OSS) + if layer_w2_bias is not None: + expert_out = expert_out.float() + layer_w2_bias + expert_out = expert_out.to(original_dtype) + + outputs.append(expert_out) +''' + +new = ''' # === fp32-promotion debug patch (FP32_PROMOTE_MOE=1) ================= + # Promote weights, input, biases, and all intermediates to fp32 for + # this expert's forward. Cast back to original_dtype at the very end + # so the caller, KV cache, and combine sum all see the layer's + # native dtype. With the env var off, behavior is byte-identical to + # upstream. + import os as _os + _promote_fp32 = _os.environ.get("FP32_PROMOTE_MOE", "0") == "1" + if _promote_fp32: + _toks = tokens_for_this_expert.float() + _w13 = layer_w13_weight.float() + _w2 = layer_w2_weight.float() + _b13 = layer_w13_bias.float() if layer_w13_bias is not None else None + _b2 = layer_w2_bias.float() if layer_w2_bias is not None else None + + gate_up = F.linear(_toks, _w13) + if _b13 is not None: + gate_up = gate_up + _b13 + + if ( + moe_runner_config.activation == "silu" + and moe_runner_config.gemm1_alpha is not None + ): + assert moe_runner_config.gemm1_clamp_limit is not None + # swiglu_gpt_oss_sigmoid_alpha preserves input dtype for its + # internal math, so passing fp32 keeps the sigmoid + chained + # multiplies in fp32 (Option 3 lands for free here). + gate_up = swiglu_gpt_oss_sigmoid_alpha( + gate_up, + moe_runner_config.gemm1_alpha, + moe_runner_config.gemm1_clamp_limit, + ) + else: + gate_up = act(gate_up) + + expert_out = F.linear(gate_up, _w2) + if _b2 is not None: + expert_out = expert_out + _b2 + + expert_out = expert_out.to(original_dtype) + else: + # ---- upstream path (unchanged) ---- + gate_up = F.linear(tokens_for_this_expert, layer_w13_weight) + if layer_w13_bias is not None: + gate_up_fp32 = gate_up.float() + layer_w13_bias + gate_up = gate_up_fp32.to(original_dtype) + + if ( + moe_runner_config.activation == "silu" + and moe_runner_config.gemm1_alpha is not None + ): + assert moe_runner_config.gemm1_clamp_limit is not None + gate_up = swiglu_gpt_oss_sigmoid_alpha( + gate_up, + moe_runner_config.gemm1_alpha, + moe_runner_config.gemm1_clamp_limit, + ) + else: + gate_up = act(gate_up) + + expert_out = F.linear(gate_up, layer_w2_weight) + if layer_w2_bias is not None: + expert_out = expert_out.float() + layer_w2_bias + expert_out = expert_out.to(original_dtype) + # === end fp32-promotion debug patch =================================== + + outputs.append(expert_out) +''' + +if old not in src: + print("ERROR: per-expert forward anchor not found in fused_moe_native.py", file=sys.stderr) + sys.exit(1) +src = src.replace(old, new) + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/image-build/enable-fp32-override-debug.py b/core/helm-charts/sglang/image-build/enable-fp32-override-debug.py new file mode 100644 index 00000000..e93f711e --- /dev/null +++ b/core/helm-charts/sglang/image-build/enable-fp32-override-debug.py @@ -0,0 +1,57 @@ +"""Debug-only patch: remove sglang's hard override that forces dtype=bfloat16 +for mxfp4 models. + +In `server_args.py` the GptOss branch contains: + + if is_mxfp4_quant_format: + # use bf16 for mxfp4 triton kernels + self.dtype = "bfloat16" + +That's correct for the GPU path (the triton mxfp4 kernels only accept bf16 +inputs), but it also fires on CPU and prevents us from running an fp32 +forward to A/B against the bf16 path for precision-drift investigation. + +This patch makes the override conditional on the user NOT having explicitly +chosen a dtype. If the launcher was invoked with `--dtype float32`, we +respect that choice. + +This patch is intended for diagnostic builds only — it does NOT belong in +a production image. It is gated behind the ALLOW_FP32_MXFP4=1 env var so +the override remains in effect for normal deployments. +""" + +import sys +from pathlib import Path + +F = Path( + "/opt/.venv/lib/python3.12/site-packages/sglang/srt/server_args.py" +) +src = F.read_text() +original = src + +needle = ( + " if is_mxfp4_quant_format:\n" + " # use bf16 for mxfp4 triton kernels\n" + " self.dtype = \"bfloat16\"\n" +) +replacement = ( + " if is_mxfp4_quant_format:\n" + " # use bf16 for mxfp4 triton kernels (CPU debug bypass via ALLOW_FP32_MXFP4=1)\n" + " import os as _os\n" + " if _os.getenv(\"ALLOW_FP32_MXFP4\", \"0\") != \"1\":\n" + " self.dtype = \"bfloat16\"\n" + " elif self.dtype in (None, \"auto\"):\n" + " self.dtype = \"bfloat16\"\n" +) + +if needle not in src: + print("ERROR: mxfp4 dtype-override anchor not found", file=sys.stderr) + sys.exit(1) +src = src.replace(needle, replacement) + +if src == original: + print("ERROR: nothing was patched", file=sys.stderr) + sys.exit(1) + +F.write_text(src) +print(f"Patched {F}") diff --git a/core/helm-charts/sglang/templates/apisixroute.yaml b/core/helm-charts/sglang/templates/apisixroute.yaml index 0d23b5eb..c5aa0653 100644 --- a/core/helm-charts/sglang/templates/apisixroute.yaml +++ b/core/helm-charts/sglang/templates/apisixroute.yaml @@ -10,6 +10,9 @@ metadata: labels: {{- include "sglang.labels" . | nindent 4 }} spec: + {{- if .Values.apisixRoute.ingressClassName }} + ingressClassName: {{ .Values.apisixRoute.ingressClassName }} + {{- end }} http: - name: {{ .Values.modelName }}-route match: diff --git a/core/helm-charts/sglang/values.yaml b/core/helm-charts/sglang/values.yaml index e9dcbae1..7afa6eda 100644 --- a/core/helm-charts/sglang/values.yaml +++ b/core/helm-charts/sglang/values.yaml @@ -22,8 +22,13 @@ namespace: default replicaCount: 1 image: - repository: lmsysorg/sglang - tag: "v0.5.11-xeon" + # Patched image built from image-build/ — required for gpt-oss-on-CPU + # (fix1..fix8) and the Phase 2 precision-debug flags (fix9-debug, + # fix10-debug). Use `image-build/build-and-import.sh` to build + import + # into k3s containerd. Switch back to `lmsysorg/sglang:v0.5.11-xeon` for + # non-gpt-oss models that don't need the patch stack. + repository: enterprise-inference/sglang + tag: "v0.5.12-xeon-fix10-debug" pullPolicy: IfNotPresent imagePullSecrets: [] @@ -211,6 +216,11 @@ apisixRoute: namespace: default name: "" host: "api.example.com" + # IngressClass that the APISIX ingress controller v2 watches. Required + # so the controller picks up this ApisixRoute and syncs it into APISIX's + # runtime route table. Set to "" to omit the field (e.g. for older + # controller versions that auto-discover). + ingressClassName: "apisix" ingress: enabled: true @@ -230,5 +240,21 @@ priorityClassName: "" extraVolumes: [] extraVolumeMounts: [] -extraEnv: [] +# extraEnv: extra environment variables exposed to the sglang container. +# +# For gpt-oss-on-CPU the patched image needs the correct MXFP4 nibble order +# to decode weights properly (without this the model serves but emits +# random-vocab garbage): +# - MXFP4_NIBBLE_ORDER=low_first fix7 dequant nibble order +# +# Other debug flags exposed by the patched image (default off, here for +# reference): +# - FP32_PROMOTE_MOE=1 fix10-debug: per-expert MoE in fp32 +# - ALLOW_FP32_MXFP4=1 fix9-debug: bypass dtype=bfloat16 force +# - MXFP4_OUT_DTYPE=float32|float16 fix9-debug2: dequant output dtype +# +# For non-gpt-oss models, override with `extraEnv: []`. +extraEnv: + - name: MXFP4_NIBBLE_ORDER + value: "low_first" extraEnvFrom: [] From 6334673bb5523ad890b546cede4425203b70aac9 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Tue, 26 May 2026 19:05:25 +0000 Subject: [PATCH 04/20] cld2labs/sglang-gpt-oss: rewrite README in OPEA style + ignore local-only notes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rewrite the chart README to match the conventions of core/scripts/vllm-quickstart/README.md — emoji section headers, configuration tables, troubleshooting matrix, project-structure tree. The new README covers: - Build the patched image (image-build/build-and-import.sh) - Deploy on a stock OPEA cluster (single helm upgrade with gpt-oss-20b-values.yaml) - Smoke-test and auth-routed inference curls - Configuration tables for chart values and debug env vars - What each of the 11 patches does and why - Known limitations (long-form drift, throughput, no-TP) - Troubleshooting matrix - From-scratch single-node bootstrap appendix (k3s, nginx, Keycloak, APISIX, TLS) for setups without the OPEA Ansible playbooks Add .gitignore for two local-only working notes that should not be shared (REMAINING_WORK.md, UPSTREAM_BUG_REPORT.md). Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/.gitignore | 3 + core/helm-charts/sglang/README.md | 490 ++++++++++++++++++++++------- 2 files changed, 379 insertions(+), 114 deletions(-) create mode 100644 core/helm-charts/sglang/.gitignore diff --git a/core/helm-charts/sglang/.gitignore b/core/helm-charts/sglang/.gitignore new file mode 100644 index 00000000..2e0b601a --- /dev/null +++ b/core/helm-charts/sglang/.gitignore @@ -0,0 +1,3 @@ +# Local-only working notes (not for upstream sharing). +REMAINING_WORK.md +UPSTREAM_BUG_REPORT.md diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index 9f142817..2d44e041 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -1,136 +1,398 @@ -# SGLang Helm Chart (Xeon CPU build) +# SGLang Helm Chart — gpt-oss-20b on Intel Xeon CPU -Deploys an [SGLang](https://github.com/sgl-project/sglang) inference server -using the `lmsysorg/sglang:v0.5.11-xeon` image on an Intel Xeon (AMX) CPU -node. Follows the same standalone pattern as `core/helm-charts/ovms` — it -is **not** wired into the Ansible playbooks. Deploy with `helm install`. +## 📋 Overview -## Supported models / quantizations +Deploys [SGLang](https://github.com/sgl-project/sglang) on a Kubernetes +cluster to serve `openai/gpt-oss-20b` on an Intel Xeon CPU node, including +the OPEA-standard nginx-ingress → APISIX → Keycloak (OIDC) auth chain. -This image's source explicitly limits CPU quantization to a small set -(`sglang/srt/layers/quantization/__init__.py`, `CPU_QUANTIZATION_METHODS`): +The chart targets a **patched** sglang image (`enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`) +that layers 11 fixes onto `lmsysorg/sglang:v0.5.12-xeon` — without them +the upstream image cannot serve gpt-oss on CPU (MXFP4 quantization is +GPU-gated, sinks attention is not supported on the CPU backends, the +shipped sgl-kernel `.so` is compiled without `-mavx512bf16`, etc.). +The image is built once via a self-contained Dockerfile and imported +directly into k3s containerd — no registry required. -| Quantization | Works on this image? | -| ------------------- | -------------------- | -| `fp8` | yes | -| `w8a8_int8` | yes | -| `compressed-tensors`| yes | -| `awq` | yes (`AWQCPUConfig`) | -| `gptq` | yes (`CPUGPTQConfig`)| -| **`mxfp4`** | **no — GPU only** | -| `modelopt_fp4` | no | -| anything else | no | +## ✨ Features -Models that work out of the box on Xeon CPU: +- **Single-model gpt-oss-20b** on Xeon CPU through the patched sglang image +- **OPEA-standard auth chain**: TLS at nginx, OIDC bearer validation at APISIX, token issuance by Keycloak +- **No external registry**: image builds locally and imports into k3s containerd +- **OpenAI-compatible API**: `/v1/chat/completions`, `/v1/models`, `/v1/completions` +- **Harmony reasoning + tool-call parsers** pre-wired for gpt-oss +- **Chart-only delivery**: same standalone pattern as `core/helm-charts/ovms`, not yet wired into the Ansible playbooks -- `Qwen/Qwen3-8B` (bf16, default) — small, fast, no quantization gate -- `Qwen/Qwen2.5-7B-Instruct` / `Qwen/Qwen2.5-14B-Instruct` -- `meta-llama/Llama-3.1-8B-Instruct` (gated, needs HF token) -- `deepseek-ai/DeepSeek-V3.1-Terminus` channel-quantized variants - (e.g. `IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8` with - `--set server.quantization=w8a8_int8`) +## 📦 Prerequisites -### gpt-oss-20b / gpt-oss-120b +- **Operating System**: Ubuntu 22.04+ +- **Hardware**: Intel Xeon with AVX-512-BF16 / AMX-BF16 (Sapphire Rapids, Emerald Rapids, Granite Rapids) +- **Memory**: ≥ 64 GiB RAM (gpt-oss-20b uses ~25 GiB dequantized + KV cache) +- **Disk**: ≥ 100 GiB free on the root partition +- **Kubernetes**: 1.24+ (k3s is fine; this chart was validated on single-node k3s) +- **Helm**: 3+ +- **NodePorts free on the host**: 30080, 30443 (nginx), 32080 (APISIX) +- **HuggingFace token** (only required for gated repos; `openai/gpt-oss-20b` is public) +- **Sudo access** for the one-shot image build -`openai/gpt-oss-*` is shipped natively in **MXFP4**, which is not -implemented for CPU in any sglang build to date — the `mxfp4` entry in -`BASE_QUANTIZATION_METHODS` is gated behind `is_cuda() or is_hip()`. This -chart will exit at startup with -`ValueError: Unknown quantization method: mxfp4` if you point it at gpt-oss. +> **Note:** On a stock OPEA cluster, k3s, nginx-ingress, APISIX, and Keycloak +> are already in place via the project's Ansible playbooks — skip straight to +> 🚀 **Deploy**. The "From-Scratch Bootstrap" appendix at the bottom is only +> for people standing up a fresh single-node box from zero. -To serve gpt-oss-20b on Xeon CPU, use a different runtime — llama.cpp, -Ollama, vLLM CPU, or ipex-llm — with a GGUF variant (e.g. -`ggml-org/gpt-oss-20b-GGUF`, `unsloth/gpt-oss-20b-GGUF`, -`bartowski/openai_gpt-oss-20b-GGUF`). Not this chart. +## 🛠️ Build the Image -To serve gpt-oss-20b via sglang, use a GPU image (e.g. -`lmsysorg/sglang:v0.5.11-cuda`) on a CUDA host. The chart can be reused — -just override `image.tag` and `server.device=cuda`. +```bash +git clone https://github.com/cld2labs/Enterprise-Inference.git +cd Enterprise-Inference +git checkout cld2labs/sglang-gpt-oss + +sudo bash core/helm-charts/sglang/image-build/build-and-import.sh +``` + +First run takes ~5–10 minutes (installs docker.io if missing, compiles +27 C++ files in `sgl-kernel` with the right BF16 flags, runs 11 Python +patch scripts against sglang's in-image source, and imports the result +into k3s containerd). + +Verify: + +```bash +sudo k3s ctr images ls | grep enterprise-inference/sglang +# docker.io/enterprise-inference/sglang:v0.5.12-xeon-fix11-debug +``` + +## 🚀 Deploy -## Prerequisites +The chart ships with `gpt-oss-20b-values.yaml` as the canonical override +for this model. It pins the image, sets bf16, wires the gpt-oss parsers, +sizes resources for a Xeon node, and enables the full auth chain. -- Kubernetes 1.24+ -- Helm 3+ -- For the gated-model recipes: HuggingFace token with read scope +```bash +helm upgrade --install gpt-oss-20b core/helm-charts/sglang \ + -f core/helm-charts/sglang/gpt-oss-20b-values.yaml +``` -## Quick start (smoke test, no auth) +Wait for the pod (first start downloads ~12 GB of weights, then runs +MXFP4 → bf16 dequant — total ~4–5 minutes): ```bash -helm upgrade --install qwen3-8b core/helm-charts/sglang \ - --set apisixRoute.enabled=false \ - --set ingress.enabled=false \ - --set oidc.enabled=false +kubectl wait --for=condition=ready pod -l app=sglang --timeout=600s +kubectl logs -l app=sglang --tail=5 +# expect: INFO: Uvicorn running on http://0.0.0.0:30000 +``` -kubectl get pods -l app.kubernetes.io/instance=qwen3-8b -w -kubectl port-forward svc/qwen3-8b-sglang 30000:30000 & +## 🎯 Inference -curl http://localhost:30000/v1/models -curl http://localhost:30000/v1/chat/completions \ +### Smoke Test (no auth, via port-forward) + +```bash +kubectl port-forward svc/gpt-oss-20b-sglang 30000:30000 & +sleep 2 + +curl -sS http://localhost:30000/v1/chat/completions \ -H 'Content-Type: application/json' \ - -d '{"model":"qwen3-8b","messages":[{"role":"user","content":"Say hi."}]}' + -d '{ + "model": "gpt-oss-20b", + "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], + "max_tokens": 150, + "temperature": 0.3 + }' | python3 -m json.tool ``` -The default model is `Qwen/Qwen3-8B`. To swap models, override -`modelSource` and `modelName`: +### Auth-Routed Call (nginx → APISIX → Keycloak → sglang) + +Fetch a token from inside the cluster (so the `iss` claim matches what +APISIX validates against), then call through the ingress: ```bash -helm upgrade --install llama-3-1-8b core/helm-charts/sglang \ - --set modelSource="meta-llama/Llama-3.1-8B-Instruct" \ - --set modelName="llama-3-1-8b" \ - --set huggingface.token=$HF_TOKEN -``` - -## Full deploy (with Keycloak/APISIX/Ingress) - -The chart's default values turn on the same OIDC+APISIX+Ingress wiring -that the OVMS chart uses, so a fully-provisioned Enterprise-Inference -cluster will route to this server at `https:///-sglang/*`. -For a stand-alone cluster, override the auth stack values per the smoke -test above. - -## Tuning for Xeon - -- `cpuEngine.ompThreadsBind`: pin SGLang's OMP threads per tp rank. For a - 2-rank tp on a 64-core node: - `--set server.tpSize=2 --set cpuEngine.ompThreadsBind="0-31|32-63"`. -- `server.enableTorchCompile=true`: large speedup, longer cold start. - Pair with `server.torchCompileMaxBs` (default 4). -- `server.quantization=w8a8_int8` with an int8-quantized checkpoint is - typically the sweet spot for throughput on Xeon AMX. -- Memory is the most common bottleneck. Set `resources.limits.memory` - to weights + KV cache + ~10Gi headroom. - -## Known upstream issue - -As of 2026-05, both `lmsysorg/sglang:v0.5.11-xeon` and `v0.5.12-xeon` -crash on the first forward pass with a `c10::Error` inside -`logits_processor._compute_lm_head`. We reproduced this with: - -- Qwen/Qwen2.5-7B-Instruct (`Qwen2ForCausalLM`) -- Qwen/Qwen3-8B (`Qwen3ForCausalLM`) -- `attention_backend=intel_amx` (default) and `=torch_native` -- with and without `LD_PRELOAD` baked in by the image - -The model loads, KV cache allocates, uvicorn serves `/model_info` 200 OK, -then the scheduler subprocess aborts during sglang's auto warmup-`/generate`. -That points at the CPU matmul kernel in the image rather than anything -the chart configures. Until the upstream image fixes it, this chart -cannot end-to-end-serve a request on Xeon. - -The chart is otherwise validated end-to-end: -- pod schedules, image pulls, PVC binds, Service routes -- `SGLANG_USE_CPU_ENGINE=1` → `attention_backend='intel_amx'` selected -- `--max-total-tokens` prevents the host-RAM-fraction OOM (sglang reads - host memory, not cgroup limits) -- weights and KV cache allocate cleanly within pod limits -- uvicorn starts and serves `/model_info` - -When the upstream bug is fixed (track sgl-project/sglang for AMX matmul -fixes on the xeon Dockerfile), no chart changes should be required. - -## References - -- [sglang CPU server docs](https://docs.sglang.io/platforms/cpu_server.html) -- `docker/xeon.Dockerfile` in the sglang repo — the canonical build recipe -- For gpt-oss-on-CPU: [llama.cpp guide](https://github.com/ggml-org/llama.cpp/discussions/15396), - [Ollama gpt-oss:20b](https://ollama.com/library/gpt-oss:20b) +TOKEN=$(kubectl run keycloak-tok --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d "client_id=my-client-id" \ + -d "client_secret=tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR" \ + -d "grant_type=client_credentials"' \ + | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") + +curl -sSk https://localhost:30443/gpt-oss-20b-sglang/v1/chat/completions \ + -H "Host: api.example.com" \ + -H "Authorization: Bearer $TOKEN" \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "gpt-oss-20b", + "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], + "max_tokens": 150, + "temperature": 0.3 + }' | python3 -m json.tool +``` + +### API Endpoints + +| Endpoint | Description | +|----------|-------------| +| `/v1/models` | List loaded models | +| `/v1/chat/completions` | OpenAI-compatible chat completions | +| `/v1/completions` | OpenAI-compatible text completions | +| `/health` | Liveness probe | + +### Notes on `max_tokens` + +gpt-oss uses the **Harmony format**: every response starts in an +"analysis" channel (the model's scratchpad) and only switches to the +"final" channel once it's done thinking. With small budgets the model +spends them all reasoning and emits no user-visible content. Practical +guidance: + +| `max_tokens` | What you'll see | +|--------------|-----------------| +| ≤ 100 | Usually `content: null`, reasoning truncated | +| 150 | One short sentence — good for quick demos | +| 300 | Paragraph + small table | +| > 400 | Hits the documented long-form drift (see ⚠️ below) | + +The reasoning is preserved in `response.choices[0].message.reasoning_content` +and the visible answer in `response.choices[0].message.content`. + +## ⚙️ Configuration + +### Key Values + +| Key | Default | Description | +|-----|---------|-------------| +| `image.repository` | `enterprise-inference/sglang` | Patched image (override to switch back to `lmsysorg/sglang`) | +| `image.tag` | `v0.5.12-xeon-fix11-debug` | Pinned to the validated build | +| `image.pullPolicy` | `IfNotPresent` | Set to `Never` if the image is only in local containerd | +| `modelSource` | `Qwen/Qwen3-8B` | HuggingFace repo to load | +| `modelName` | `qwen3-8b` | Served name (also used in route URI) | +| `server.dtype` | `bfloat16` | Compute dtype | +| `server.extraArgs` | `[]` | Extra CLI flags to `sglang serve` | +| `server.maxTotalTokens` | `32768` | Caps KV-cache memory (sglang reads host RAM, not cgroup limits) | +| `extraEnv` | `[MXFP4_NIBBLE_ORDER=low_first]` | Env vars; the default is required for correct MXFP4 weight decode | +| `oidc.enabled` | `true` | Enable APISIX `openid-connect` plugin | +| `apisixRoute.enabled` | `true` | Create `ApisixRoute` for the service | +| `ingress.enabled` | `true` | Create `Ingress` for the service | +| `huggingface.token` | `""` | Required for gated models (e.g. `meta-llama/*`) | + +The complete configuration surface is documented inline in `values.yaml`. + +### Debug Env Vars (off by default, baked into the image) + +| Variable | Effect | +|----------|--------| +| `ALLOW_FP32_MXFP4=1` | Lets you pass `--dtype float32` with MXFP4 models | +| `MXFP4_OUT_DTYPE=float32\|float16\|bfloat16` | Dequant output dtype | +| `FP32_PROMOTE_MOE=1` | Compute per-expert MoE forward in fp32 | +| `--kv-cache-dtype float32` | Allowed by our patched allowlist (allocates fp32 KV) | + +These were used during a precision investigation A/B; see commit history +on `cld2labs/sglang-gpt-oss` for context. + +## 🩹 What's Patched + +The image-build directory contains a series of small Python patches +applied to sglang's installed source at image build time: + +| # | Patch | Purpose | +|---|-------|---------| +| 1 | (Dockerfile step 1) | Rebuild `sgl-kernel` with `-mavx512bf16 -mamx-bf16` so bf16 matmuls emit `vdpbf16ps` instead of crashing with "scalar path not implemented" | +| 2 | `enable-mxfp4-cpu.py` | Register `mxfp4` quantization for CPU (upstream gates it behind `is_cuda() or is_hip()`) | +| 2b | `enable-gpt-oss-cpu.py` | Add `torch_native`/`intel_amx` to GptOss's CPU attention-backend allowlist | +| 3 | `enable-gpt-oss-cpu-loaders.py` | Guard `.cuda()` calls in gpt-oss weight loaders for CPU-only torch | +| 4 | `enable-gpt-oss-cpu-moe.py` | Add a CPU branch to `Mxfp4MoEMethod` that dequants MXFP4 → bf16 at load time | +| 5 | `enable-cpu-sinks-attention.py` | Add sinks-attention support to `torch_native_backend` | +| 6/7 | `enable-gpt-oss-cpu-dequant-v2.py` | Self-contained MXFP4 dequant with explicit nibble-order control | +| 8 | `enable-gpt-oss-cpu-moe-v2.py` | Route the MoE forward through `moe_forward_native` so gpt-oss's swiglu+α+clamp+biases is computed correctly | +| 9–11 | `enable-*-debug.py` | Precision debug knobs (off by default; see the env-var table above) | + +Patch 1 is a **genuine upstream regression** that affects every Xeon +sglang user, not just gpt-oss — the published image's sgl-kernel `.so` +contains zero AVX-512-BF16 instructions, so any bf16 forward pass +crashes with `tinygemm_kernel_nn: scalar path not implemented!`. + +## ⚠️ Known Limitations + +- **Long-form drift after ~150 tokens.** With the current pure-Python + CPU MoE path, output past ~150 tokens collapses into broken tokens, + emoji, and special-token leaks. Phase 2 ran a full precision A/B + (`FP32_PROMOTE_MOE`, fp32 KV cache, `--enable-fp32-lm-head`) and + conclusively ruled out precision as the cause. Surviving hypotheses: + sliding-window-attention bookkeeping in our patched `torch_native_backend`, + or Harmony channel-switch tokenization interacting with the sinks wrapper. +- **Throughput.** The chart routes through `moe_forward_native` for + correctness, not speed; expect ~4 tok/s. The faster `fused_experts_cpu` + kernel does plain `silu(gate)*up` and cannot be used directly for + gpt-oss. +- **No tensor parallelism.** Chart currently runs `--tp-size=1`. Setting + `--tp-size=2` to split across NUMA nodes should give multi-x speedup + but the patch stack has not been validated under TP. + +## 🔧 Troubleshooting + +### View Logs + +```bash +kubectl logs -l app=sglang -f +kubectl describe pod -l app=sglang +``` + +### Common Issues + +| Symptom | Likely cause | Fix | +|---------|--------------|-----| +| `Unknown quantization method: mxfp4` | Pod is using the upstream image | Confirm `image.repository=enterprise-inference/sglang` and `image.tag=v0.5.12-xeon-fix11-debug` | +| Pod OOMKilled at startup | sglang reads host RAM, not cgroup limits | Lower `server.maxTotalTokens` or raise `resources.limits.memory` | +| `tinygemm_kernel_nn: scalar path not implemented!` | Wrong (upstream) sgl-kernel `.so` is loaded | Rebuild with `image-build/build-and-import.sh` | +| Random-vocab gibberish in `content` | Wrong MXFP4 nibble order | Verify `MXFP4_NIBBLE_ORDER=low_first` is in pod env | +| `content: null` in response | gpt-oss spent all `max_tokens` reasoning | Raise `max_tokens` to ≥ 150 | +| 504 from nginx/APISIX | Default 60s proxy timeout vs ~4 tok/s CPU inference | Bump `nginx.ingress.kubernetes.io/proxy-read-timeout` and `ApisixRoute.spec.http[].timeout` to 600s | +| 401 from APISIX with a "valid" token | Token issuer claim mismatch | Fetch token via cluster-internal `kubectl run` curl pod (see Inference) | +| Token expires too quickly | Keycloak master realm defaults to 60s | Bump `accessTokenLifespan` via the admin REST API | + +### Stop / Restart + +```bash +helm uninstall gpt-oss-20b +kubectl delete pvc -l app.kubernetes.io/instance=gpt-oss-20b # frees the model cache +``` + +## 📁 Project Structure + +``` +core/helm-charts/sglang/ +├── README.md # this file +├── Chart.yaml +├── values.yaml # full configuration surface +├── gpt-oss-20b-values.yaml # canonical override for this model +├── templates/ # Helm templates (Deployment, Service, PVC, Ingress, ApisixRoute, Secret) +└── image-build/ + ├── Dockerfile # FROM lmsysorg/sglang:v0.5.12-xeon + 11 patch steps + ├── build-and-import.sh # one-shot build + import into k3s containerd + └── enable-*.py # patch scripts applied at image build time +``` + +## 📚 References + +- [SGLang documentation](https://docs.sglang.io) +- [SGLang CPU server guide](https://docs.sglang.io/docs/hardware-platforms/cpu_server) +- [OpenAI gpt-oss model card](https://huggingface.co/openai/gpt-oss-20b) + +--- + +## 📎 Appendix: From-Scratch Bootstrap + +Use this only if you're standing up a fresh single-node box without OPEA's +Ansible-driven cluster setup. On a stock OPEA cluster, k3s, nginx-ingress, +APISIX, and Keycloak are already in place and you can skip directly to +🚀 **Deploy**. + +### A.1 k3s + Helm + +```bash +sudo bash scripts/bootstrap-k3s.sh +export KUBECONFIG=$HOME/.kube/config +kubectl get nodes -o wide +helm version --short +``` + +The script installs k3s (`--disable traefik`), symlinks `kubectl`, copies +kubeconfig to `~/.kube/config`, and installs Helm 3. + +### A.2 nginx-ingress + +```bash +helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx +helm install ingress-nginx ingress-nginx/ingress-nginx \ + -n ingress-nginx --create-namespace \ + --set controller.service.type=NodePort \ + --set controller.service.nodePorts.http=30080 \ + --set controller.service.nodePorts.https=30443 \ + --set controller.admissionWebhooks.enabled=false \ + --set controller.ingressClassResource.default=true + +kubectl wait --for=condition=ready pod -n ingress-nginx \ + -l app.kubernetes.io/component=controller --timeout=120s +``` + +### A.3 Keycloak (dev mode) + +```bash +kubectl apply -f - <<'EOF' +apiVersion: apps/v1 +kind: Deployment +metadata: { name: keycloak, namespace: default } +spec: + replicas: 1 + selector: { matchLabels: { app: keycloak } } + template: + metadata: { labels: { app: keycloak } } + spec: + containers: + - name: keycloak + image: quay.io/keycloak/keycloak:26.0 + args: ["start-dev"] + env: + - { name: KEYCLOAK_ADMIN, value: admin } + - { name: KEYCLOAK_ADMIN_PASSWORD, value: admin } + - { name: KC_HTTP_RELATIVE_PATH, value: "/" } + - { name: KC_PROXY, value: edge } + ports: [{ containerPort: 8080, name: http }] +--- +apiVersion: v1 +kind: Service +metadata: { name: keycloak, namespace: default } +spec: + selector: { app: keycloak } + ports: [{ port: 80, targetPort: 8080 }] +EOF +kubectl wait --for=condition=ready pod -l app=keycloak --timeout=300s +``` + +Create the OIDC client (`my-client-id` with the secret the chart expects): + +```bash +ADMIN=$(kubectl run kc-admin --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d "client_id=admin-cli" -d "username=admin" -d "password=admin" -d "grant_type=password"' \ + | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") + +kubectl run kc-create --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c "curl -sS -X POST -H 'Authorization: Bearer $ADMIN' \ + -H 'Content-Type: application/json' \ + http://keycloak.default.svc.cluster.local/admin/realms/master/clients \ + -d '{\"clientId\":\"my-client-id\",\"secret\":\"tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR\",\"serviceAccountsEnabled\":true,\"publicClient\":false,\"directAccessGrantsEnabled\":true}'" +``` + +### A.4 APISIX + +```bash +helm repo add apisix https://charts.apiseven.com +helm install auth-apisix apisix/apisix \ + -n auth-apisix --create-namespace \ + --set service.type=NodePort \ + --set ingress-controller.enabled=true \ + --set ingress-controller.config.apisix.serviceNamespace=auth-apisix + +kubectl wait --for=condition=ready pod -n auth-apisix --all --timeout=300s +``` + +APISIX v2 ingress controller also requires a `GatewayProxy` CRD and an +updated `IngressClass parameters` link before it will accept routes; +see the in-cluster `kubectl describe apisixroute` output for guidance +if the controller returns "Route Not Found" for an otherwise valid +ApisixRoute. + +### A.5 TLS Cert for `api.example.com` + +```bash +openssl req -x509 -newkey rsa:2048 -nodes -days 365 \ + -keyout /tmp/tls.key -out /tmp/tls.crt \ + -subj "/CN=api.example.com" \ + -addext "subjectAltName=DNS:api.example.com" + +kubectl create secret tls api-example-com-tls \ + --cert=/tmp/tls.crt --key=/tmp/tls.key -n default +``` + +Now proceed to 🛠️ **Build the Image** and 🚀 **Deploy** above. From 04028ee238370195e851a2449ac4b1c32c7fa3df Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Tue, 26 May 2026 19:47:15 +0000 Subject: [PATCH 05/20] cld2labs/sglang-gpt-oss: refactor chart docs for model-agnostic framing Lead the README with the framework: SGLang on Xeon CPU, default model Qwen3-8B, any HF model SGLang supports works. Move gpt-oss-20b content into a single "Noteworthy" section that explains why the model is the driver of the patch stack and links to the full deployment recipe under third_party/Dell/model-deployment/gpt-oss-20b/. What's Patched table now annotates each patch with its scope (all bf16 models / MXFP4 only / gpt-oss specific / debug knob) so it's clear which patches actually apply to a given deployment. Troubleshooting moved out to a symptom-indexed sibling doc at third_party/Dell/model-deployment/sglang-troubleshooting.md; the README links to it. values.yaml: tighten the comment on MXFP4_NIBBLE_ORDER so it reads as a chart default that is a no-op for non-MXFP4 models, not a gpt-oss-only override. Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/README.md | 258 ++++++++++++++++------------ core/helm-charts/sglang/values.yaml | 20 +-- 2 files changed, 157 insertions(+), 121 deletions(-) diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index 2d44e041..a8b9875e 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -1,43 +1,53 @@ -# SGLang Helm Chart — gpt-oss-20b on Intel Xeon CPU +# SGLang Helm Chart — Intel Xeon CPU ## 📋 Overview Deploys [SGLang](https://github.com/sgl-project/sglang) on a Kubernetes -cluster to serve `openai/gpt-oss-20b` on an Intel Xeon CPU node, including -the OPEA-standard nginx-ingress → APISIX → Keycloak (OIDC) auth chain. - -The chart targets a **patched** sglang image (`enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`) -that layers 11 fixes onto `lmsysorg/sglang:v0.5.12-xeon` — without them -the upstream image cannot serve gpt-oss on CPU (MXFP4 quantization is -GPU-gated, sinks attention is not supported on the CPU backends, the -shipped sgl-kernel `.so` is compiled without `-mavx512bf16`, etc.). -The image is built once via a self-contained Dockerfile and imported -directly into k3s containerd — no registry required. +cluster as a model-agnostic inference server on Intel Xeon CPU nodes, +including the OPEA-standard nginx-ingress → APISIX → Keycloak (OIDC) +auth chain. + +The chart ships with `Qwen/Qwen3-8B` as a sensible default model and +supports any Hugging Face model SGLang can load on CPU. Model-specific +recipes (helm command, values overrides, model card) live under +`third_party/Dell/model-deployment//`. Notable example: +**gpt-oss-20b**, which required a patched SGLang image to work on CPU +(see [Noteworthy: gpt-oss-20b](#-noteworthy-gpt-oss-20b) below). + +The chart targets a **patched** SGLang image (`enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`). +The most important patch (fix1) rebuilds `sgl-kernel` with the correct +AVX-512-BF16 / AMX-BF16 compile flags — the upstream +`lmsysorg/sglang:v0.5.12-xeon` ships the shared library without them, so +every bf16 forward pass crashes with `tinygemm_kernel_nn: scalar path +not implemented!` regardless of model. The remaining patches are +gpt-oss-specific and are runtime no-ops for other models. The image is +built once via a self-contained Dockerfile and imported directly into +k3s containerd — no registry required. ## ✨ Features -- **Single-model gpt-oss-20b** on Xeon CPU through the patched sglang image +- **Model-agnostic SGLang on Xeon CPU** — defaults to Qwen3-8B; any HF model SGLang supports works +- **Patched image** that unblocks bf16 inference on Xeon (every model benefits) and adds MXFP4 + sinks-attention support for gpt-oss - **OPEA-standard auth chain**: TLS at nginx, OIDC bearer validation at APISIX, token issuance by Keycloak - **No external registry**: image builds locally and imports into k3s containerd - **OpenAI-compatible API**: `/v1/chat/completions`, `/v1/models`, `/v1/completions` -- **Harmony reasoning + tool-call parsers** pre-wired for gpt-oss - **Chart-only delivery**: same standalone pattern as `core/helm-charts/ovms`, not yet wired into the Ansible playbooks ## 📦 Prerequisites - **Operating System**: Ubuntu 22.04+ - **Hardware**: Intel Xeon with AVX-512-BF16 / AMX-BF16 (Sapphire Rapids, Emerald Rapids, Granite Rapids) -- **Memory**: ≥ 64 GiB RAM (gpt-oss-20b uses ~25 GiB dequantized + KV cache) +- **Memory**: ≥ 64 GiB RAM for mid-size models (gpt-oss-20b uses ~25 GiB dequantized + KV cache) - **Disk**: ≥ 100 GiB free on the root partition - **Kubernetes**: 1.24+ (k3s is fine; this chart was validated on single-node k3s) - **Helm**: 3+ - **NodePorts free on the host**: 30080, 30443 (nginx), 32080 (APISIX) -- **HuggingFace token** (only required for gated repos; `openai/gpt-oss-20b` is public) +- **HuggingFace token** for gated models (e.g. `meta-llama/*`); not required for open models like `openai/gpt-oss-20b` or `Qwen/Qwen3-8B` - **Sudo access** for the one-shot image build > **Note:** On a stock OPEA cluster, k3s, nginx-ingress, APISIX, and Keycloak > are already in place via the project's Ansible playbooks — skip straight to -> 🚀 **Deploy**. The "From-Scratch Bootstrap" appendix at the bottom is only +> 🛠️ **Build the Image**. The "From-Scratch Bootstrap" appendix at the bottom is only > for people standing up a fresh single-node box from zero. ## 🛠️ Build the Image @@ -52,7 +62,7 @@ sudo bash core/helm-charts/sglang/image-build/build-and-import.sh First run takes ~5–10 minutes (installs docker.io if missing, compiles 27 C++ files in `sgl-kernel` with the right BF16 flags, runs 11 Python -patch scripts against sglang's in-image source, and imports the result +patch scripts against SGLang's in-image source, and imports the result into k3s containerd). Verify: @@ -62,19 +72,38 @@ sudo k3s ctr images ls | grep enterprise-inference/sglang # docker.io/enterprise-inference/sglang:v0.5.12-xeon-fix11-debug ``` -## 🚀 Deploy +## 🚀 Deploy a Model + +### Default model (Qwen3-8B) + +```bash +helm install qwen3-8b ./core/helm-charts/sglang +``` + +The chart defaults to `Qwen/Qwen3-8B` with bf16 weights through the +patched image's fixed `sgl-kernel`. Any HF model SGLang supports on +CPU can be deployed by overriding `modelSource` and `modelName`. -The chart ships with `gpt-oss-20b-values.yaml` as the canonical override -for this model. It pins the image, sets bf16, wires the gpt-oss parsers, -sizes resources for a Xeon node, and enables the full auth chain. +### Custom model ```bash -helm upgrade --install gpt-oss-20b core/helm-charts/sglang \ - -f core/helm-charts/sglang/gpt-oss-20b-values.yaml +helm install ./core/helm-charts/sglang \ + --set modelSource="" \ + --set modelName="" \ + --set huggingface.token="$HF_TOKEN" # only if gated ``` -Wait for the pod (first start downloads ~12 GB of weights, then runs -MXFP4 → bf16 dequant — total ~4–5 minutes): +### Model-specific recipes + +Models that need extra configuration ship with their own values file and +deployment guide: + +| Model | Values file | Deployment guide | +| ----- | ----------- | ---------------- | +| `openai/gpt-oss-20b` | `gpt-oss-20b-values.yaml` | `third_party/Dell/model-deployment/gpt-oss-20b/deployment.md` | + +Wait for the pod (first start downloads the weights — duration depends +on model size and network): ```bash kubectl wait --for=condition=ready pod -l app=sglang --timeout=600s @@ -84,23 +113,23 @@ kubectl logs -l app=sglang --tail=5 ## 🎯 Inference -### Smoke Test (no auth, via port-forward) +### Smoke test (no auth, via port-forward) ```bash -kubectl port-forward svc/gpt-oss-20b-sglang 30000:30000 & +kubectl port-forward svc/-sglang 30000:30000 & sleep 2 curl -sS http://localhost:30000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ - "model": "gpt-oss-20b", + "model": "", "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], "max_tokens": 150, "temperature": 0.3 }' | python3 -m json.tool ``` -### Auth-Routed Call (nginx → APISIX → Keycloak → sglang) +### Auth-routed call (nginx → APISIX → Keycloak → sglang) Fetch a token from inside the cluster (so the `iss` claim matches what APISIX validates against), then call through the ingress: @@ -110,23 +139,23 @@ TOKEN=$(kubectl run keycloak-tok --rm -i --restart=Never --quiet \ --image=curlimages/curl:8.10.1 -- \ sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ -d "client_id=my-client-id" \ - -d "client_secret=tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR" \ + -d "client_secret=" \ -d "grant_type=client_credentials"' \ | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") -curl -sSk https://localhost:30443/gpt-oss-20b-sglang/v1/chat/completions \ +curl -sSk https://localhost:30443/-sglang/v1/chat/completions \ -H "Host: api.example.com" \ -H "Authorization: Bearer $TOKEN" \ -H 'Content-Type: application/json' \ -d '{ - "model": "gpt-oss-20b", + "model": "", "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], "max_tokens": 150, "temperature": 0.3 }' | python3 -m json.tool ``` -### API Endpoints +### API endpoints | Endpoint | Description | |----------|-------------| @@ -135,39 +164,21 @@ curl -sSk https://localhost:30443/gpt-oss-20b-sglang/v1/chat/completions \ | `/v1/completions` | OpenAI-compatible text completions | | `/health` | Liveness probe | -### Notes on `max_tokens` - -gpt-oss uses the **Harmony format**: every response starts in an -"analysis" channel (the model's scratchpad) and only switches to the -"final" channel once it's done thinking. With small budgets the model -spends them all reasoning and emits no user-visible content. Practical -guidance: - -| `max_tokens` | What you'll see | -|--------------|-----------------| -| ≤ 100 | Usually `content: null`, reasoning truncated | -| 150 | One short sentence — good for quick demos | -| 300 | Paragraph + small table | -| > 400 | Hits the documented long-form drift (see ⚠️ below) | - -The reasoning is preserved in `response.choices[0].message.reasoning_content` -and the visible answer in `response.choices[0].message.content`. - ## ⚙️ Configuration -### Key Values +### Key values | Key | Default | Description | |-----|---------|-------------| -| `image.repository` | `enterprise-inference/sglang` | Patched image (override to switch back to `lmsysorg/sglang`) | +| `image.repository` | `enterprise-inference/sglang` | Patched image (set to `lmsysorg/sglang` to use upstream, but bf16 inference will crash) | | `image.tag` | `v0.5.12-xeon-fix11-debug` | Pinned to the validated build | | `image.pullPolicy` | `IfNotPresent` | Set to `Never` if the image is only in local containerd | | `modelSource` | `Qwen/Qwen3-8B` | HuggingFace repo to load | | `modelName` | `qwen3-8b` | Served name (also used in route URI) | | `server.dtype` | `bfloat16` | Compute dtype | | `server.extraArgs` | `[]` | Extra CLI flags to `sglang serve` | -| `server.maxTotalTokens` | `32768` | Caps KV-cache memory (sglang reads host RAM, not cgroup limits) | -| `extraEnv` | `[MXFP4_NIBBLE_ORDER=low_first]` | Env vars; the default is required for correct MXFP4 weight decode | +| `server.maxTotalTokens` | `32768` | Caps KV-cache memory (SGLang reads host RAM, not cgroup limits) | +| `extraEnv` | `[MXFP4_NIBBLE_ORDER=low_first]` | Env vars; the default is required for MXFP4 models and a runtime no-op for others | | `oidc.enabled` | `true` | Enable APISIX `openid-connect` plugin | | `apisixRoute.enabled` | `true` | Create `ApisixRoute` for the service | | `ingress.enabled` | `true` | Create `Ingress` for the service | @@ -175,14 +186,14 @@ and the visible answer in `response.choices[0].message.content`. The complete configuration surface is documented inline in `values.yaml`. -### Debug Env Vars (off by default, baked into the image) +### Debug env vars (off by default, baked into the image) -| Variable | Effect | -|----------|--------| -| `ALLOW_FP32_MXFP4=1` | Lets you pass `--dtype float32` with MXFP4 models | -| `MXFP4_OUT_DTYPE=float32\|float16\|bfloat16` | Dequant output dtype | -| `FP32_PROMOTE_MOE=1` | Compute per-expert MoE forward in fp32 | -| `--kv-cache-dtype float32` | Allowed by our patched allowlist (allocates fp32 KV) | +| Variable | Effect | Applies to | +| -------- | ------ | ---------- | +| `ALLOW_FP32_MXFP4=1` | Lets you pass `--dtype float32` with MXFP4 models | MXFP4 models only | +| `MXFP4_OUT_DTYPE=float32\|float16\|bfloat16` | Dequant output dtype | MXFP4 models only | +| `FP32_PROMOTE_MOE=1` | Compute per-expert MoE forward in fp32 | MoE models only | +| `--kv-cache-dtype float32` | Allowed by our patched allowlist (allocates fp32 KV) | All models | These were used during a precision investigation A/B; see commit history on `cld2labs/sglang-gpt-oss` for context. @@ -190,69 +201,89 @@ on `cld2labs/sglang-gpt-oss` for context. ## 🩹 What's Patched The image-build directory contains a series of small Python patches -applied to sglang's installed source at image build time: - -| # | Patch | Purpose | -|---|-------|---------| -| 1 | (Dockerfile step 1) | Rebuild `sgl-kernel` with `-mavx512bf16 -mamx-bf16` so bf16 matmuls emit `vdpbf16ps` instead of crashing with "scalar path not implemented" | -| 2 | `enable-mxfp4-cpu.py` | Register `mxfp4` quantization for CPU (upstream gates it behind `is_cuda() or is_hip()`) | -| 2b | `enable-gpt-oss-cpu.py` | Add `torch_native`/`intel_amx` to GptOss's CPU attention-backend allowlist | -| 3 | `enable-gpt-oss-cpu-loaders.py` | Guard `.cuda()` calls in gpt-oss weight loaders for CPU-only torch | -| 4 | `enable-gpt-oss-cpu-moe.py` | Add a CPU branch to `Mxfp4MoEMethod` that dequants MXFP4 → bf16 at load time | -| 5 | `enable-cpu-sinks-attention.py` | Add sinks-attention support to `torch_native_backend` | -| 6/7 | `enable-gpt-oss-cpu-dequant-v2.py` | Self-contained MXFP4 dequant with explicit nibble-order control | -| 8 | `enable-gpt-oss-cpu-moe-v2.py` | Route the MoE forward through `moe_forward_native` so gpt-oss's swiglu+α+clamp+biases is computed correctly | -| 9–11 | `enable-*-debug.py` | Precision debug knobs (off by default; see the env-var table above) | +applied to SGLang's installed source at image build time: + +| # | Patch | Scope | Purpose | +|---|-------|-------|---------| +| 1 | (Dockerfile step 1) | **All bf16 models** | Rebuild `sgl-kernel` with `-mavx512bf16 -mamx-bf16` so bf16 matmuls emit `vdpbf16ps` instead of crashing with "scalar path not implemented" | +| 2 | `enable-mxfp4-cpu.py` | MXFP4 models | Register `mxfp4` quantization for CPU (upstream gates it behind `is_cuda() or is_hip()`) | +| 2b | `enable-gpt-oss-cpu.py` | gpt-oss | Add `torch_native`/`intel_amx` to GptOss's CPU attention-backend allowlist | +| 3 | `enable-gpt-oss-cpu-loaders.py` | gpt-oss | Guard `.cuda()` calls in gpt-oss weight loaders for CPU-only torch | +| 4 | `enable-gpt-oss-cpu-moe.py` | MXFP4 MoE | Add a CPU branch to `Mxfp4MoEMethod` that dequants MXFP4 → bf16 at load time | +| 5 | `enable-cpu-sinks-attention.py` | sinks-attention models (gpt-oss) | Add sinks-attention support to `torch_native_backend` | +| 6/7 | `enable-gpt-oss-cpu-dequant-v2.py` | MXFP4 models | Self-contained MXFP4 dequant with explicit nibble-order control via `MXFP4_NIBBLE_ORDER` | +| 8 | `enable-gpt-oss-cpu-moe-v2.py` | gpt-oss | Route the MoE forward through `moe_forward_native` so gpt-oss's swiglu+α+clamp+biases is computed correctly | +| 9–11 | `enable-*-debug.py` | Debug knobs | Precision A/B knobs (off by default; see env-var table above) | Patch 1 is a **genuine upstream regression** that affects every Xeon -sglang user, not just gpt-oss — the published image's sgl-kernel `.so` +SGLang user, not just gpt-oss — the published image's `sgl-kernel` `.so` contains zero AVX-512-BF16 instructions, so any bf16 forward pass crashes with `tinygemm_kernel_nn: scalar path not implemented!`. -## ⚠️ Known Limitations - -- **Long-form drift after ~150 tokens.** With the current pure-Python - CPU MoE path, output past ~150 tokens collapses into broken tokens, - emoji, and special-token leaks. Phase 2 ran a full precision A/B - (`FP32_PROMOTE_MOE`, fp32 KV cache, `--enable-fp32-lm-head`) and - conclusively ruled out precision as the cause. Surviving hypotheses: - sliding-window-attention bookkeeping in our patched `torch_native_backend`, - or Harmony channel-switch tokenization interacting with the sinks wrapper. +Patches 2–8 are **gpt-oss-specific** in scope. They are runtime no-ops +for models that don't trigger them (e.g. a Qwen3 deployment never enters +the MXFP4 dequant path or the sinks-attention wrapper), so leaving them +baked into the image carries no cost for other models. + +## ⭐ Noteworthy: gpt-oss-20b + +`openai/gpt-oss-20b` is the most complex model this chart serves and the +driver for most of the patch stack above. Specifically: + +- **MXFP4 quantization is GPU-gated upstream.** Patches 2, 4, 6/7 enable + it on CPU by registering the quantization method and adding a CPU + weight-load dequant path (MXFP4 → bf16 at startup). +- **gpt-oss uses sinks attention** (a learnable per-head scalar added to + the softmax denominator). No upstream CPU attention backend supports + it; patch 5 adds it to `torch_native_backend`. +- **MoE forward needs gpt-oss-specific math** (swiglu + α + clamp + + biases). Patch 8 routes through `moe_forward_native`, which handles + this correctly at the cost of throughput vs the AMX kernel. + +The full deployment recipe — model card, helm command, verification, +parameter reference — is in +[`third_party/Dell/model-deployment/gpt-oss-20b/`](../../third_party/Dell/model-deployment/gpt-oss-20b/). + +**Known limitations specific to gpt-oss-20b:** + +- **Long-form drift after ~150 tokens.** Output past ~150 tokens + collapses into broken tokens, emoji, and special-token leaks. A + precision A/B (fp32 per-expert MoE, fp32 KV cache, + `--enable-fp32-lm-head`) conclusively ruled out precision as the + cause. Surviving hypotheses: sliding-window-attention bookkeeping in + our patched `torch_native_backend`, or Harmony channel-switch + tokenization interacting with the sinks wrapper. - **Throughput.** The chart routes through `moe_forward_native` for - correctness, not speed; expect ~4 tok/s. The faster `fused_experts_cpu` - kernel does plain `silu(gate)*up` and cannot be used directly for - gpt-oss. + correctness, not speed; expect ~4 tok/s. - **No tensor parallelism.** Chart currently runs `--tp-size=1`. Setting `--tp-size=2` to split across NUMA nodes should give multi-x speedup but the patch stack has not been validated under TP. ## 🔧 Troubleshooting -### View Logs +See [`third_party/Dell/model-deployment/sglang-troubleshooting.md`](../../third_party/Dell/model-deployment/sglang-troubleshooting.md) +for a symptom-indexed guide covering: + +- Gateway Timeout (504) on inference requests +- Response `content` field is null (gpt-oss Harmony format) +- "Unknown quantization method: mxfp4" at startup +- "scalar path not implemented!" on the first forward pass +- Random-vocab gibberish in `content` (nibble order) +- Long-form drift past ~150 tokens (gpt-oss) +- 401 Unauthorized from APISIX with a valid-looking token (issuer mismatch) + +Quick log + describe: ```bash kubectl logs -l app=sglang -f kubectl describe pod -l app=sglang ``` -### Common Issues - -| Symptom | Likely cause | Fix | -|---------|--------------|-----| -| `Unknown quantization method: mxfp4` | Pod is using the upstream image | Confirm `image.repository=enterprise-inference/sglang` and `image.tag=v0.5.12-xeon-fix11-debug` | -| Pod OOMKilled at startup | sglang reads host RAM, not cgroup limits | Lower `server.maxTotalTokens` or raise `resources.limits.memory` | -| `tinygemm_kernel_nn: scalar path not implemented!` | Wrong (upstream) sgl-kernel `.so` is loaded | Rebuild with `image-build/build-and-import.sh` | -| Random-vocab gibberish in `content` | Wrong MXFP4 nibble order | Verify `MXFP4_NIBBLE_ORDER=low_first` is in pod env | -| `content: null` in response | gpt-oss spent all `max_tokens` reasoning | Raise `max_tokens` to ≥ 150 | -| 504 from nginx/APISIX | Default 60s proxy timeout vs ~4 tok/s CPU inference | Bump `nginx.ingress.kubernetes.io/proxy-read-timeout` and `ApisixRoute.spec.http[].timeout` to 600s | -| 401 from APISIX with a "valid" token | Token issuer claim mismatch | Fetch token via cluster-internal `kubectl run` curl pod (see Inference) | -| Token expires too quickly | Keycloak master realm defaults to 60s | Bump `accessTokenLifespan` via the admin REST API | - -### Stop / Restart +### Stop / restart ```bash -helm uninstall gpt-oss-20b -kubectl delete pvc -l app.kubernetes.io/instance=gpt-oss-20b # frees the model cache +helm uninstall +kubectl delete pvc -l app.kubernetes.io/instance= # frees the model cache ``` ## 📁 Project Structure @@ -262,12 +293,18 @@ core/helm-charts/sglang/ ├── README.md # this file ├── Chart.yaml ├── values.yaml # full configuration surface -├── gpt-oss-20b-values.yaml # canonical override for this model +├── gpt-oss-20b-values.yaml # canonical override for gpt-oss-20b ├── templates/ # Helm templates (Deployment, Service, PVC, Ingress, ApisixRoute, Secret) └── image-build/ ├── Dockerfile # FROM lmsysorg/sglang:v0.5.12-xeon + 11 patch steps ├── build-and-import.sh # one-shot build + import into k3s containerd └── enable-*.py # patch scripts applied at image build time + +third_party/Dell/model-deployment/ +├── sglang-troubleshooting.md # symptom-indexed troubleshooting for the SGLang chart +└── gpt-oss-20b/ + ├── model-card.md # gpt-oss-20b model card + └── deployment.md # gpt-oss-20b deployment guide ``` ## 📚 References @@ -275,6 +312,7 @@ core/helm-charts/sglang/ - [SGLang documentation](https://docs.sglang.io) - [SGLang CPU server guide](https://docs.sglang.io/docs/hardware-platforms/cpu_server) - [OpenAI gpt-oss model card](https://huggingface.co/openai/gpt-oss-20b) +- [Qwen3-8B model card](https://huggingface.co/Qwen/Qwen3-8B) --- @@ -283,7 +321,7 @@ core/helm-charts/sglang/ Use this only if you're standing up a fresh single-node box without OPEA's Ansible-driven cluster setup. On a stock OPEA cluster, k3s, nginx-ingress, APISIX, and Keycloak are already in place and you can skip directly to -🚀 **Deploy**. +🛠️ **Build the Image**. ### A.1 k3s + Helm @@ -361,7 +399,7 @@ kubectl run kc-create --rm -i --restart=Never --quiet \ sh -c "curl -sS -X POST -H 'Authorization: Bearer $ADMIN' \ -H 'Content-Type: application/json' \ http://keycloak.default.svc.cluster.local/admin/realms/master/clients \ - -d '{\"clientId\":\"my-client-id\",\"secret\":\"tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR\",\"serviceAccountsEnabled\":true,\"publicClient\":false,\"directAccessGrantsEnabled\":true}'" + -d '{\"clientId\":\"my-client-id\",\"secret\":\"\",\"serviceAccountsEnabled\":true,\"publicClient\":false,\"directAccessGrantsEnabled\":true}'" ``` ### A.4 APISIX @@ -383,7 +421,7 @@ see the in-cluster `kubectl describe apisixroute` output for guidance if the controller returns "Route Not Found" for an otherwise valid ApisixRoute. -### A.5 TLS Cert for `api.example.com` +### A.5 TLS cert for `api.example.com` ```bash openssl req -x509 -newkey rsa:2048 -nodes -days 365 \ @@ -395,4 +433,4 @@ kubectl create secret tls api-example-com-tls \ --cert=/tmp/tls.crt --key=/tmp/tls.key -n default ``` -Now proceed to 🛠️ **Build the Image** and 🚀 **Deploy** above. +Now proceed to 🛠️ **Build the Image** and 🚀 **Deploy a Model** above. diff --git a/core/helm-charts/sglang/values.yaml b/core/helm-charts/sglang/values.yaml index 7afa6eda..544d1ac1 100644 --- a/core/helm-charts/sglang/values.yaml +++ b/core/helm-charts/sglang/values.yaml @@ -242,18 +242,16 @@ extraVolumes: [] extraVolumeMounts: [] # extraEnv: extra environment variables exposed to the sglang container. # -# For gpt-oss-on-CPU the patched image needs the correct MXFP4 nibble order -# to decode weights properly (without this the model serves but emits -# random-vocab garbage): -# - MXFP4_NIBBLE_ORDER=low_first fix7 dequant nibble order +# MXFP4_NIBBLE_ORDER=low_first is read only by the patched image's MXFP4 +# dequant path and is a runtime no-op for non-MXFP4 models. It is required +# for any MXFP4 model loaded on CPU (e.g. openai/gpt-oss-*) — without it +# the model serves but emits random-vocab garbage. # -# Other debug flags exposed by the patched image (default off, here for -# reference): -# - FP32_PROMOTE_MOE=1 fix10-debug: per-expert MoE in fp32 -# - ALLOW_FP32_MXFP4=1 fix9-debug: bypass dtype=bfloat16 force -# - MXFP4_OUT_DTYPE=float32|float16 fix9-debug2: dequant output dtype -# -# For non-gpt-oss models, override with `extraEnv: []`. +# Other debug flags exposed by the patched image (off by default; set +# only when reproducing a precision investigation): +# - FP32_PROMOTE_MOE=1 per-expert MoE forward in fp32 +# - ALLOW_FP32_MXFP4=1 allow --dtype float32 with mxfp4 models +# - MXFP4_OUT_DTYPE=float32|float16 dequant output dtype extraEnv: - name: MXFP4_NIBBLE_ORDER value: "low_first" From 7b53ad9de8c1a1c226b3e0c7c7a038e22cef0a50 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Tue, 26 May 2026 19:50:26 +0000 Subject: [PATCH 06/20] cld2labs/sglang-gpt-oss: add gpt-oss-20b model-deployment card + sglang troubleshooting MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add deployment recipe for openai/gpt-oss-20b on the SGLang chart in the same shape as the llama-3.1-8b-instruct card on cld2labs/llama-3.1-8b-instruct: third_party/Dell/model-deployment/gpt-oss-20b/ ├── model-card.md — model metadata, license (Apache 2.0), intended │ use, limitations └── deployment.md — step-by-step Keycloak token, image build, helm install, verify, test, undeploy, parameter table Add sibling troubleshooting doc covering issues specific to SGLang deployments (Gateway Timeout 504, content:null with Harmony format, MXFP4 quantization gate errors, scalar-path crashes, nibble-order gibberish, long-form drift, APISIX issuer-claim mismatches). Signed-off-by: arpannookala-12 --- .../gpt-oss-20b/deployment.md | 131 ++++++++++++++ .../gpt-oss-20b/model-card.md | 66 +++++++ .../sglang-troubleshooting.md | 170 ++++++++++++++++++ 3 files changed, 367 insertions(+) create mode 100644 third_party/Dell/model-deployment/gpt-oss-20b/deployment.md create mode 100644 third_party/Dell/model-deployment/gpt-oss-20b/model-card.md create mode 100644 third_party/Dell/model-deployment/sglang-troubleshooting.md diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md new file mode 100644 index 00000000..237e4332 --- /dev/null +++ b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md @@ -0,0 +1,131 @@ +## Step 1: Prerequisites to Deploy gpt-oss-20b Model on Xeon with Keycloak + +Ensure the Enterprise Inference stack with Keycloak is already deployed before proceeding. + +Edit `core/scripts/generate-token.sh` and set your values before sourcing it: + +| Variable | Description | +| ------------------------- | ------------------------------------------------------------------------ | +| `BASE_URL` | Hostname of your cluster (e.g. `api.example.com`), without `https://` | +| `KEYCLOAK_ADMIN_USERNAME` | Keycloak admin username | +| `KEYCLOAK_PASSWORD` | Keycloak admin password | +| `KEYCLOAK_CLIENT_ID` | Keycloak client ID configured during EI deployment | + +Then run: + +```bash +export HUGGING_FACE_HUB_TOKEN="your_token_here" + +cd ~/Enterprise-Inference +source core/scripts/generate-token.sh +``` + +This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, and `TOKEN`. + +## Step 2: Build the Patched SGLang Image + +gpt-oss-20b ships natively in MXFP4 quantization, and the upstream `lmsysorg/sglang:v0.5.12-xeon` image cannot serve it on CPU (MXFP4 is GPU-gated, sinks attention is unsupported on the CPU backends, and the published `sgl-kernel` shared library is missing the AVX-512-BF16 compile flags needed for any bf16 matmul). + +The SGLang chart ships a one-shot build script that produces a patched image and imports it directly into k3s containerd. No external registry is required. + +```bash +sudo bash core/helm-charts/sglang/image-build/build-and-import.sh +``` + +First run takes ~5-10 minutes. Verify: + +```bash +sudo k3s ctr images ls | grep enterprise-inference/sglang +# docker.io/enterprise-inference/sglang:v0.5.12-xeon-fix11-debug +``` + +For a detailed breakdown of what each patch does, see `core/helm-charts/sglang/README.md` (section: What's Patched). + +## Step 3: Deploy gpt-oss-20b Model + +The chart ships with a canonical values file for this model at `core/helm-charts/sglang/gpt-oss-20b-values.yaml`. + +```bash +helm install sglang-gpt-oss-20b ./core/helm-charts/sglang \ + --values ./core/helm-charts/sglang/gpt-oss-20b-values.yaml \ + --set modelSource="openai/gpt-oss-20b" \ + --set huggingface.token="$HUGGING_FACE_HUB_TOKEN" \ + --set ingress.enabled=true \ + --set ingress.secretName="${BASE_URL}" \ + --set ingress.host="${BASE_URL}" \ + --set oidc.clientId="$KEYCLOAK_CLIENT_ID" \ + --set oidc.clientSecret="$KEYCLOAK_CLIENT_SECRET" \ + --set apisixRoute.enabled=true +``` + +## Step 4: Verify the Deployment + +```bash +kubectl get pods +kubectl get apisixroutes +``` + +Expected Output: + +``` +NAME READY STATUS RESTARTS +keycloak-0 1/1 Running 0 +keycloak-postgresql-0 1/1 Running 0 +sglang-gpt-oss-20b-- 1/1 Running 0 +``` + +> Note: First pod start takes ~4-5 minutes (downloading ~12 GB of weights from Hugging Face, then dequantizing MXFP4 → bf16 in memory). Subsequent restarts are fast because the cache PVC persists the weights. + +``` +NAME HOSTS +sglang-gpt-oss-20b-apisixroute api.example.com +``` + +## Step 5: Test the Deployed Model + +```bash +curl -k https://${BASE_URL}/gpt-oss-20b-sglang/v1/chat/completions \ + -X POST \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $TOKEN" \ + -d '{ + "model": "gpt-oss-20b", + "messages": [{"role": "user", "content": "In one sentence, what is deep learning?"}], + "max_tokens": 150, + "temperature": 0.3 + }' +``` + +If successful, the model returns a chat-completion response with the answer in `choices[0].message.content` and the model's internal reasoning in `choices[0].message.reasoning_content`. + +### A note on `max_tokens` + +gpt-oss uses the Harmony chat format: every response starts in an internal "analysis" channel and only switches to the user-visible "final" channel when reasoning is complete. With small budgets the model spends them all reasoning and the `content` field comes back null: + +| `max_tokens` | What you'll see | +| ------------ | --------------- | +| ≤ 100 | `content: null`, reasoning truncated | +| 150 | One short sentence — good for quick verification | +| 300 | Paragraph with light formatting | +| > 400 | Hits documented long-form drift (see troubleshooting) | + +## To undeploy the model + +```bash +helm uninstall sglang-gpt-oss-20b +kubectl delete pvc -l app.kubernetes.io/instance=sglang-gpt-oss-20b # frees the cached weights +``` + +## Parameters + +| Parameter | Description | +| ------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | +| `--values ./core/helm-charts/sglang/gpt-oss-20b-values.yaml` | Canonical values file for this model. Pins the patched image, sets bf16, wires the Harmony reasoning and tool-call parsers, sizes resources. | +| `--set modelSource="openai/gpt-oss-20b"` | Defines the target model from **Hugging Face** to deploy. | +| `--set huggingface.token="..."` | Authenticates access to gated or private Hugging Face models. The gpt-oss repo is public, so this is optional but harmless. | +| `--set ingress.enabled=true` | Enables Kubernetes **Ingress** to expose the model service externally. | +| `--set ingress.host="${BASE_URL}"` | Public hostname or FQDN for the inference endpoint (maps to your Ingress controller IP). | +| `--set ingress.secretName="${BASE_URL}"` | Kubernetes **TLS Secret** used for HTTPS termination at the ingress layer. | +| `--set oidc.clientId="..."` | Keycloak OIDC client ID used for token-based authentication. | +| `--set oidc.clientSecret="..."` | Keycloak OIDC client secret corresponding to the client ID. | +| `--set apisixRoute.enabled=true` | Enables the **APISIX** route for gateway routing and OIDC bearer validation. | diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/model-card.md b/third_party/Dell/model-deployment/gpt-oss-20b/model-card.md new file mode 100644 index 00000000..2a295773 --- /dev/null +++ b/third_party/Dell/model-deployment/gpt-oss-20b/model-card.md @@ -0,0 +1,66 @@ +# gpt-oss-20b + +This model uses gpt-oss-20b, a 20.9 billion-parameter open-weight mixture-of-experts model from OpenAI. It is part of the gpt-oss family released under the Apache 2.0 license and is optimized for reasoning, agentic workflows, and tool use, with a configurable reasoning effort. + +For full details including model specifications, licensing, intended use, safety guidance, and example prompts, please visit the official Hugging Face page: **Official Hugging Face Page** + +https://huggingface.co/openai/gpt-oss-20b + +This model provides inference services only; weights are hosted by Hugging Face under OpenAI's Apache 2.0 release. + +Ensure compliance with OpenAI's Apache 2.0 release terms and usage policy before using this model. + +### Model Attribution + +**Developer:** OpenAI + +**Purpose:** Open-weight reasoning, agentic, and tool-using model with configurable analysis depth (low / medium / high reasoning effort) + +**Sizes/Variants:** 20 B total parameters with mixture-of-experts (3.6 B active per token); the gpt-oss family also includes a 120 B variant + +**Modalities:** Text input → Text (including code, structured outputs, and tool calls) output + +**Parameter Size:** ~20.9 billion total (~3.6 billion active per token) + +**Max Context:** Up to 128 k tokens + +**License:** Apache 2.0 + +**Native Quantization:** MXFP4 (4-bit microscaling float) on the MoE weights, dequantized to bf16 at weight-load time for CPU inference + +**Minimum required CPU Cores:** 64 (recommended 96+ for usable throughput) + +**Minimum required PCIe Cards:** 0 (CPU-only deployment via the patched SGLang Xeon image) + +### Usage Notice + +**By using this model, you agree that:** + +- Inputs and outputs are processed through gpt-oss-20b under OpenAI's Apache 2.0 release. +- You will comply with OpenAI's usage policy and the Apache 2.0 license terms, including attribution and notice requirements when redistributing. +- All generated content (text, code, or tool calls) must be reviewed for accuracy, compliance, and safety before deployment. +- The model should not be used for generating malicious content, disallowed content, or for automating decisions in high-risk or regulated systems without appropriate safeguards. + +### Intended Applications + +- Reasoning-heavy chatbots and assistants with adjustable reasoning effort. +- Agentic workflows: tool calling, structured function invocation, multi-step task decomposition. +- Code generation, completion, and refactoring across multiple programming languages. +- Long-context tasks: summarization of long documents, dialog over long history, RAG (retrieve-and-generate) over extended context (subject to the long-form notes in the deployment guide). +- Research, prototyping, and commercial workflows under Apache 2.0 terms. + +### Limitations + +- The 20 B size — while strong for reasoning — still trails much larger models on knowledge-intensive tasks. +- As with all large language models, risk of hallucinations, biases, or unsafe outputs remains; outputs should be reviewed before downstream use. +- The model uses the Harmony chat format with separate analysis and final channels; small `max_tokens` budgets often leave responses in the analysis channel with no user-visible content. See the deployment guide for guidance. +- CPU-only deployment via the patched SGLang image is throughput-limited (~4 tokens/s on a Xeon 6972P with the current pure-Python MoE path) and exhibits a documented long-form drift past ~150 generated tokens. Short-form generation is solid. +- Native MXFP4 quantization requires the patched SGLang image; the upstream `lmsysorg/sglang:v0.5.12-xeon` cannot serve this model. + +### References + +"Introducing gpt-oss". https://openai.com/index/introducing-gpt-oss/ + +Hugging Face Model Card: https://huggingface.co/openai/gpt-oss-20b + +OpenAI gpt-oss GitHub Repository. https://github.com/openai/gpt-oss diff --git a/third_party/Dell/model-deployment/sglang-troubleshooting.md b/third_party/Dell/model-deployment/sglang-troubleshooting.md new file mode 100644 index 00000000..eae93c9b --- /dev/null +++ b/third_party/Dell/model-deployment/sglang-troubleshooting.md @@ -0,0 +1,170 @@ +# SGLang Troubleshooting Guide + +This section provides common issues observed when running inference against models deployed via the SGLang Helm chart on Intel® AI for Enterprise Inference, along with step-by-step resolutions. + +**Issues:** +1. [Gateway Timeout (504) on Inference Requests](#1-gateway-timeout-504-on-inference-requests) +2. [Response `content` field is null](#2-response-content-field-is-null) +3. [Pod startup fails with "Unknown quantization method: mxfp4"](#3-pod-startup-fails-with-unknown-quantization-method-mxfp4) +4. [Pod startup fails with "scalar path not implemented!"](#4-pod-startup-fails-with-scalar-path-not-implemented) +5. [Model serves but emits random-vocab gibberish in `content`](#5-model-serves-but-emits-random-vocab-gibberish-in-content) +6. [Long-form responses degrade into broken tokens after ~150 tokens](#6-long-form-responses-degrade-into-broken-tokens-after-150-tokens) +7. [401 Unauthorized from APISIX with a valid-looking token](#7-401-unauthorized-from-apisix-with-a-valid-looking-token) + +--- + +### 1. Gateway Timeout (504) on Inference Requests + +**Context:** Model deployed via the SGLang chart. Inference request sent through the ingress stack (ingress-nginx → APISIX → SGLang service). + +**Error:** Inference requests return `504 Gateway Timeout` after 60 seconds. + +**Cause:** CPU-based MoE inference (gpt-oss-20b on Xeon) generates tokens at ~4 tokens/s. Responses requiring more than ~240 tokens exceed the default 60s upstream timeout enforced by ingress-nginx and APISIX. + +**Fix:** + +**Step 1 – Increase the nginx ingress timeout** + +Find the ingress and annotate it: + +```bash +kubectl get ingress -A | grep sglang +kubectl annotate ingress -n \ + nginx.ingress.kubernetes.io/proxy-read-timeout="600" \ + nginx.ingress.kubernetes.io/proxy-send-timeout="600" \ + nginx.ingress.kubernetes.io/proxy-connect-timeout="60" \ + --overwrite +``` + +**Step 2 – Increase the APISIX route timeout** + +```bash +kubectl get apisixroute -A | grep sglang +kubectl patch apisixroute -n --type='json' \ + -p='[{"op":"add","path":"/spec/http/0/timeout","value":{"connect":"5s","read":"600s","send":"600s"}}]' +``` + +**Verification:** Re-run the inference request and confirm a `200 OK` response within the new window. + +**Notes:** +- Annotations apply immediately; no pod restart required. +- For shorter responses (`max_tokens ≤ 200`), the default 60s timeout is usually sufficient. + +--- + +### 2. Response `content` field is null + +**Context:** gpt-oss-20b deployed via the SGLang chart. Inference request returns HTTP 200 but `choices[0].message.content` is `null`; `choices[0].message.reasoning_content` is populated. + +**Cause:** gpt-oss uses the Harmony chat format with separate analysis and final channels. The model always begins in the analysis channel (internal reasoning) and only switches to the final channel when reasoning completes. With small `max_tokens` budgets, the model exhausts the budget while still reasoning and never emits visible content. `finish_reason` will be `length` and `reasoning_tokens` will equal `completion_tokens`. + +**Fix:** Raise `max_tokens` so the model has budget to finish reasoning AND emit a final answer: + +| `max_tokens` | Outcome | +| ------------ | ---------------------------------------- | +| ≤ 100 | Typically `content: null` | +| 150 | One short sentence (good for verification) | +| 300 | Paragraph with light formatting | + +The internal reasoning is always preserved in `reasoning_content` even when `content` is null. + +--- + +### 3. Pod startup fails with "Unknown quantization method: mxfp4" + +**Context:** Pod CrashLoopBackOff at startup. Logs show `ValueError: Unknown quantization method: mxfp4`. + +**Cause:** The pod is running the upstream `lmsysorg/sglang:v0.5.12-xeon` image. The upstream image gates MXFP4 quantization behind `is_cuda() or is_hip()`, so it cannot load MXFP4 models on CPU. + +**Fix:** Confirm the chart is using the patched image. The chart's `values.yaml` defaults to it, but a `--set image.repository=...` override may have switched it back: + +```bash +kubectl get pod -l app=sglang -o jsonpath='{.items[0].spec.containers[0].image}{"\n"}' +# expected: enterprise-inference/sglang:v0.5.12-xeon-fix11-debug +``` + +If the image is wrong, redeploy with the chart defaults or explicitly set: + +```bash +helm upgrade ./core/helm-charts/sglang \ + --reuse-values \ + --set image.repository=enterprise-inference/sglang \ + --set image.tag=v0.5.12-xeon-fix11-debug +``` + +If the patched image is not present locally, build it first: + +```bash +sudo bash core/helm-charts/sglang/image-build/build-and-import.sh +``` + +--- + +### 4. Pod startup fails with "scalar path not implemented!" + +**Context:** Pod crashes on the first forward pass. Logs show `RuntimeError: tinygemm_kernel_nn: scalar path not implemented!`. + +**Cause:** The `sgl-kernel` shared library loaded by the pod was compiled without `-mavx512bf16`. The bf16 specialization of `tinygemm_kernel_nn` falls through to a stub that throws this error. This is the upstream regression the patched image's Dockerfile step 1 fixes. + +**Fix:** Verify the patched image is loaded (same check as issue #3). If the patched image is loaded and this error still appears, the build may have failed silently — rebuild and check the verification line: + +```bash +sudo bash core/helm-charts/sglang/image-build/build-and-import.sh 2>&1 | grep "AVX-512 BF16 instructions" +# expected: ~1400+ instructions +``` + +A count under 100 indicates the compile flags did not take effect during the build. + +--- + +### 5. Model serves but emits random-vocab gibberish in `content` + +**Context:** gpt-oss-20b deployed. Inference returns HTTP 200, `content` is non-null but looks like a sequence of unrelated tokens (e.g., `" the I the and a"`). + +**Cause:** MXFP4 weights are being dequantized with the wrong nibble packing order. gpt-oss stores its MXFP4 weights with the low nibble first; the patched image's dequant defaults to this via the `MXFP4_NIBBLE_ORDER` environment variable. + +**Fix:** Verify the env var is set on the pod: + +```bash +kubectl get pod -l app=sglang -o jsonpath='{range .items[0].spec.containers[0].env[*]}{.name}={.value}{"\n"}{end}' | grep MXFP4_NIBBLE_ORDER +# expected: MXFP4_NIBBLE_ORDER=low_first +``` + +The chart's `values.yaml` includes this default. If it is missing, redeploy without overriding `extraEnv` to an empty list. + +--- + +### 6. Long-form responses degrade into broken tokens after ~150 tokens + +**Context:** gpt-oss-20b deployed via the SGLang chart. Short-form responses are coherent. Responses past ~150 generated tokens collapse into broken tokens, repeated characters, mixed emoji, foreign-script characters, or special-token leaks like `<|channel|>` appearing in the visible output. + +**Cause:** Known limitation of the current pure-Python CPU MoE path used by the chart. A precision A/B (fp32 per-expert MoE intermediates, fp32 KV cache, `--enable-fp32-lm-head`) ruled out numerical precision as the dominant cause. Surviving hypotheses point at sliding-window-attention bookkeeping inside the patched `torch_native_backend` or Harmony channel-switch tokenization interacting with the sinks-attention wrapper. + +**Fix:** No fix at the chart level yet. Workarounds: +- Keep `max_tokens` at or below 200 for production calls. +- Phrase prompts to produce short, focused answers (e.g., `"In one sentence, ..."`). +- The internal `reasoning_content` is unaffected and can still be inspected. + +This is documented under "Known Limitations" in `core/helm-charts/sglang/README.md`. Long-form coherence requires further work on the attention or channel-switch path. + +--- + +### 7. 401 Unauthorized from APISIX with a valid-looking token + +**Context:** Token was successfully obtained from Keycloak, but the auth-routed inference call returns `401 Unauthorized` from APISIX (response body mentions "openresty"). + +**Cause:** APISIX's OIDC plugin validates the token's `iss` (issuer) claim against the configured discovery URL. If the token was fetched via `kubectl port-forward localhost:18080`, Keycloak stamped the issuer as `http://127.0.0.1:18080/...`, but APISIX checks against `http://keycloak.default.svc.cluster.local/...` and rejects the mismatch. + +**Fix:** Fetch the token from inside the cluster so the issuer matches: + +```bash +TOKEN=$(kubectl run keycloak-tok --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d "client_id='"$KEYCLOAK_CLIENT_ID"'" \ + -d "client_secret='"$KEYCLOAK_CLIENT_SECRET"'" \ + -d "grant_type=client_credentials"' \ + | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") +``` + +For production deployments, configure Keycloak with `KC_HOSTNAME=` so it always issues tokens with a stable, externally-resolvable issuer. From 9870bbf11d108c9f5091eb01ee6b996c0c1096a6 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Tue, 26 May 2026 20:53:33 +0000 Subject: [PATCH 07/20] cld2labs/sglang-gpt-oss: drop Qwen3-8B default, require modelSource/modelName explicitly The Qwen3-8B chart default was a leftover from when this branch had concluded gpt-oss-CPU was impossible. After fix1-fix8 made gpt-oss work the default was never revisited. Make the chart fully opinion-free on model selection: - values.yaml: modelSource/modelName both default to "" with a comment pointing at the canonical values file pattern - templates/deployment.yaml: fail loudly at render time if either is unset, with an error message pointing to gpt-oss-20b-values.yaml as a working example Strip the section-header emojis from the README at the same time. Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/README.md | 79 ++++++++++--------- .../sglang/templates/deployment.yaml | 7 ++ core/helm-charts/sglang/values.yaml | 13 +-- 3 files changed, 55 insertions(+), 44 deletions(-) diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index a8b9875e..e56fc47f 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -1,18 +1,19 @@ # SGLang Helm Chart — Intel Xeon CPU -## 📋 Overview +## Overview Deploys [SGLang](https://github.com/sgl-project/sglang) on a Kubernetes cluster as a model-agnostic inference server on Intel Xeon CPU nodes, including the OPEA-standard nginx-ingress → APISIX → Keycloak (OIDC) auth chain. -The chart ships with `Qwen/Qwen3-8B` as a sensible default model and -supports any Hugging Face model SGLang can load on CPU. Model-specific -recipes (helm command, values overrides, model card) live under -`third_party/Dell/model-deployment//`. Notable example: -**gpt-oss-20b**, which required a patched SGLang image to work on CPU -(see [Noteworthy: gpt-oss-20b](#-noteworthy-gpt-oss-20b) below). +The chart has **no built-in default model** — `modelSource` and +`modelName` must be supplied at install time, either via `--set` or a +values file. Model-specific recipes (helm command, values overrides, +model card) live under `third_party/Dell/model-deployment//`. +Notable example: **gpt-oss-20b**, which required a patched SGLang image +to work on CPU (see [Noteworthy: gpt-oss-20b](#noteworthy-gpt-oss-20b) +below). The chart targets a **patched** SGLang image (`enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`). The most important patch (fix1) rebuilds `sgl-kernel` with the correct @@ -24,16 +25,16 @@ gpt-oss-specific and are runtime no-ops for other models. The image is built once via a self-contained Dockerfile and imported directly into k3s containerd — no registry required. -## ✨ Features +## Features -- **Model-agnostic SGLang on Xeon CPU** — defaults to Qwen3-8B; any HF model SGLang supports works +- **Model-agnostic SGLang on Xeon CPU** — any HF model SGLang supports loads through the same chart - **Patched image** that unblocks bf16 inference on Xeon (every model benefits) and adds MXFP4 + sinks-attention support for gpt-oss - **OPEA-standard auth chain**: TLS at nginx, OIDC bearer validation at APISIX, token issuance by Keycloak - **No external registry**: image builds locally and imports into k3s containerd - **OpenAI-compatible API**: `/v1/chat/completions`, `/v1/models`, `/v1/completions` - **Chart-only delivery**: same standalone pattern as `core/helm-charts/ovms`, not yet wired into the Ansible playbooks -## 📦 Prerequisites +## Prerequisites - **Operating System**: Ubuntu 22.04+ - **Hardware**: Intel Xeon with AVX-512-BF16 / AMX-BF16 (Sapphire Rapids, Emerald Rapids, Granite Rapids) @@ -47,10 +48,10 @@ k3s containerd — no registry required. > **Note:** On a stock OPEA cluster, k3s, nginx-ingress, APISIX, and Keycloak > are already in place via the project's Ansible playbooks — skip straight to -> 🛠️ **Build the Image**. The "From-Scratch Bootstrap" appendix at the bottom is only +> **Build the Image**. The "From-Scratch Bootstrap" appendix at the bottom is only > for people standing up a fresh single-node box from zero. -## 🛠️ Build the Image +## Build the Image ```bash git clone https://github.com/cld2labs/Enterprise-Inference.git @@ -72,36 +73,37 @@ sudo k3s ctr images ls | grep enterprise-inference/sglang # docker.io/enterprise-inference/sglang:v0.5.12-xeon-fix11-debug ``` -## 🚀 Deploy a Model +## Deploy a Model -### Default model (Qwen3-8B) +`modelSource` and `modelName` are required at install time. The chart +template fails fast if either is empty. -```bash -helm install qwen3-8b ./core/helm-charts/sglang -``` - -The chart defaults to `Qwen/Qwen3-8B` with bf16 weights through the -patched image's fixed `sgl-kernel`. Any HF model SGLang supports on -CPU can be deployed by overriding `modelSource` and `modelName`. - -### Custom model +### Generic install ```bash helm install ./core/helm-charts/sglang \ --set modelSource="" \ --set modelName="" \ - --set huggingface.token="$HF_TOKEN" # only if gated + --set huggingface.token="$HF_TOKEN" # only if the model is gated ``` ### Model-specific recipes -Models that need extra configuration ship with their own values file and -deployment guide: +Models that need additional configuration ship with their own values file +and deployment guide: | Model | Values file | Deployment guide | | ----- | ----------- | ---------------- | | `openai/gpt-oss-20b` | `gpt-oss-20b-values.yaml` | `third_party/Dell/model-deployment/gpt-oss-20b/deployment.md` | +Use the values file as the source of truth and override anything +environment-specific via `--set`: + +```bash +helm install gpt-oss-20b ./core/helm-charts/sglang \ + --values ./core/helm-charts/sglang/gpt-oss-20b-values.yaml +``` + Wait for the pod (first start downloads the weights — duration depends on model size and network): @@ -111,7 +113,7 @@ kubectl logs -l app=sglang --tail=5 # expect: INFO: Uvicorn running on http://0.0.0.0:30000 ``` -## 🎯 Inference +## Inference ### Smoke test (no auth, via port-forward) @@ -164,7 +166,7 @@ curl -sSk https://localhost:30443/-sglang/v1/chat/completions \ | `/v1/completions` | OpenAI-compatible text completions | | `/health` | Liveness probe | -## ⚙️ Configuration +## Configuration ### Key values @@ -173,8 +175,8 @@ curl -sSk https://localhost:30443/-sglang/v1/chat/completions \ | `image.repository` | `enterprise-inference/sglang` | Patched image (set to `lmsysorg/sglang` to use upstream, but bf16 inference will crash) | | `image.tag` | `v0.5.12-xeon-fix11-debug` | Pinned to the validated build | | `image.pullPolicy` | `IfNotPresent` | Set to `Never` if the image is only in local containerd | -| `modelSource` | `Qwen/Qwen3-8B` | HuggingFace repo to load | -| `modelName` | `qwen3-8b` | Served name (also used in route URI) | +| `modelSource` | _(required)_ | HuggingFace repo to load (chart fails to render if empty) | +| `modelName` | _(required)_ | Served name (also used in route URI; chart fails if empty) | | `server.dtype` | `bfloat16` | Compute dtype | | `server.extraArgs` | `[]` | Extra CLI flags to `sglang serve` | | `server.maxTotalTokens` | `32768` | Caps KV-cache memory (SGLang reads host RAM, not cgroup limits) | @@ -198,7 +200,7 @@ The complete configuration surface is documented inline in `values.yaml`. These were used during a precision investigation A/B; see commit history on `cld2labs/sglang-gpt-oss` for context. -## 🩹 What's Patched +## What's Patched The image-build directory contains a series of small Python patches applied to SGLang's installed source at image build time: @@ -225,7 +227,7 @@ for models that don't trigger them (e.g. a Qwen3 deployment never enters the MXFP4 dequant path or the sinks-attention wrapper), so leaving them baked into the image carries no cost for other models. -## ⭐ Noteworthy: gpt-oss-20b +## Noteworthy: gpt-oss-20b `openai/gpt-oss-20b` is the most complex model this chart serves and the driver for most of the patch stack above. Specifically: @@ -259,7 +261,7 @@ parameter reference — is in `--tp-size=2` to split across NUMA nodes should give multi-x speedup but the patch stack has not been validated under TP. -## 🔧 Troubleshooting +## Troubleshooting See [`third_party/Dell/model-deployment/sglang-troubleshooting.md`](../../third_party/Dell/model-deployment/sglang-troubleshooting.md) for a symptom-indexed guide covering: @@ -286,7 +288,7 @@ helm uninstall kubectl delete pvc -l app.kubernetes.io/instance= # frees the model cache ``` -## 📁 Project Structure +## Project Structure ``` core/helm-charts/sglang/ @@ -307,21 +309,20 @@ third_party/Dell/model-deployment/ └── deployment.md # gpt-oss-20b deployment guide ``` -## 📚 References +## References - [SGLang documentation](https://docs.sglang.io) - [SGLang CPU server guide](https://docs.sglang.io/docs/hardware-platforms/cpu_server) - [OpenAI gpt-oss model card](https://huggingface.co/openai/gpt-oss-20b) -- [Qwen3-8B model card](https://huggingface.co/Qwen/Qwen3-8B) --- -## 📎 Appendix: From-Scratch Bootstrap +## Appendix: From-Scratch Bootstrap Use this only if you're standing up a fresh single-node box without OPEA's Ansible-driven cluster setup. On a stock OPEA cluster, k3s, nginx-ingress, APISIX, and Keycloak are already in place and you can skip directly to -🛠️ **Build the Image**. +**Build the Image**. ### A.1 k3s + Helm @@ -433,4 +434,4 @@ kubectl create secret tls api-example-com-tls \ --cert=/tmp/tls.crt --key=/tmp/tls.key -n default ``` -Now proceed to 🛠️ **Build the Image** and 🚀 **Deploy a Model** above. +Now proceed to **Build the Image** and **Deploy a Model** above. diff --git a/core/helm-charts/sglang/templates/deployment.yaml b/core/helm-charts/sglang/templates/deployment.yaml index 1279a8eb..279ce080 100644 --- a/core/helm-charts/sglang/templates/deployment.yaml +++ b/core/helm-charts/sglang/templates/deployment.yaml @@ -1,6 +1,13 @@ # Copyright (C) 2025-2026 Intel Corporation # SPDX-License-Identifier: Apache-2.0 +{{- if not .Values.modelSource }} +{{- fail "modelSource is required. Set --set modelSource= (e.g. openai/gpt-oss-20b) or use a model-specific values file like gpt-oss-20b-values.yaml." }} +{{- end }} +{{- if not .Values.modelName }} +{{- fail "modelName is required. Set --set modelName= (e.g. gpt-oss-20b)." }} +{{- end }} + apiVersion: apps/v1 kind: Deployment metadata: diff --git a/core/helm-charts/sglang/values.yaml b/core/helm-charts/sglang/values.yaml index 544d1ac1..bee12fef 100644 --- a/core/helm-charts/sglang/values.yaml +++ b/core/helm-charts/sglang/values.yaml @@ -55,11 +55,14 @@ securityContext: readOnlyRootFilesystem: false # ---- Model ---- -# Default is Qwen3-8B because it is bf16 (no quantization gate), modest in -# size, and listed as a supported CPU model in sglang docs. Override -# modelSource/modelName for other models. -modelSource: "Qwen/Qwen3-8B" -modelName: "qwen3-8b" +# Required. Set both at install time: +# --set modelSource= (HF repo, passed to --model-path) +# --set modelName= (served name; used in route URI) +# The chart has no default model: `helm install` fails loudly if either +# is empty. For a model-specific recipe (e.g. gpt-oss-20b), use the +# bundled values file (see README). +modelSource: "" +modelName: "" # HuggingFace Hub token. Required for gated repos (e.g. meta-llama/*). # Either: From 8e4a32c44ceb0866d8cf4211a11436af302f5be4 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Tue, 26 May 2026 20:58:48 +0000 Subject: [PATCH 08/20] cld2labs/sglang-gpt-oss: move gpt-oss-20b values.yaml into the model-deployment dir The chart at core/helm-charts/sglang/ is model-agnostic; a gpt-oss-20b-specific values file living inside it was inconsistent with that framing and with the llama precedent on cld2labs/llama-3.1-8b-instruct (generic chart + per-model recipes living elsewhere). Move via git mv so the rename is preserved in history: core/helm-charts/sglang/gpt-oss-20b-values.yaml -> third_party/Dell/model-deployment/gpt-oss-20b/values.yaml The model-deployment/gpt-oss-20b/ directory now holds the complete per-model recipe in one place: - model-card.md (metadata, license, intended use, limitations) - deployment.md (step-by-step deploy guide) - values.yaml (canonical chart overrides) Update all references: README install example, project-structure tree, the deployment.md helm command + parameter table, and the deployment.yaml fail message. Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/README.md | 8 +-- .../sglang/gpt-oss-20b-values.yaml | 49 ------------------- .../sglang/templates/deployment.yaml | 2 +- .../gpt-oss-20b/deployment.md | 6 +-- .../model-deployment/gpt-oss-20b/values.yaml | 49 +++++++++++++++++++ 5 files changed, 57 insertions(+), 57 deletions(-) delete mode 100644 core/helm-charts/sglang/gpt-oss-20b-values.yaml create mode 100644 third_party/Dell/model-deployment/gpt-oss-20b/values.yaml diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index e56fc47f..16a9a7fb 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -94,14 +94,14 @@ and deployment guide: | Model | Values file | Deployment guide | | ----- | ----------- | ---------------- | -| `openai/gpt-oss-20b` | `gpt-oss-20b-values.yaml` | `third_party/Dell/model-deployment/gpt-oss-20b/deployment.md` | +| `openai/gpt-oss-20b` | `third_party/Dell/model-deployment/gpt-oss-20b/values.yaml` | `third_party/Dell/model-deployment/gpt-oss-20b/deployment.md` | Use the values file as the source of truth and override anything environment-specific via `--set`: ```bash helm install gpt-oss-20b ./core/helm-charts/sglang \ - --values ./core/helm-charts/sglang/gpt-oss-20b-values.yaml + --values ./third_party/Dell/model-deployment/gpt-oss-20b/values.yaml ``` Wait for the pod (first start downloads the weights — duration depends @@ -295,7 +295,6 @@ core/helm-charts/sglang/ ├── README.md # this file ├── Chart.yaml ├── values.yaml # full configuration surface -├── gpt-oss-20b-values.yaml # canonical override for gpt-oss-20b ├── templates/ # Helm templates (Deployment, Service, PVC, Ingress, ApisixRoute, Secret) └── image-build/ ├── Dockerfile # FROM lmsysorg/sglang:v0.5.12-xeon + 11 patch steps @@ -306,7 +305,8 @@ third_party/Dell/model-deployment/ ├── sglang-troubleshooting.md # symptom-indexed troubleshooting for the SGLang chart └── gpt-oss-20b/ ├── model-card.md # gpt-oss-20b model card - └── deployment.md # gpt-oss-20b deployment guide + ├── deployment.md # gpt-oss-20b deployment guide + └── values.yaml # canonical chart overrides for gpt-oss-20b ``` ## References diff --git a/core/helm-charts/sglang/gpt-oss-20b-values.yaml b/core/helm-charts/sglang/gpt-oss-20b-values.yaml deleted file mode 100644 index 6871b281..00000000 --- a/core/helm-charts/sglang/gpt-oss-20b-values.yaml +++ /dev/null @@ -1,49 +0,0 @@ -# Override values for gpt-oss-20b on Xeon CPU through the patched image. -# -# Usage: -# helm upgrade gpt-oss-20b core/helm-charts/sglang \ -# -f core/helm-charts/sglang/gpt-oss-20b-values.yaml -# -# This is the production-shape config. Long-form (>~150-token) coherence -# is a known limitation of the pure-Python CPU MoE path on this image — -# see REMAINING_WORK.md "Long-form quality boundary" for the full Phase 2 -# investigation and the precision-flag A/B that ruled out KV / lm_head / -# MoE-intermediate precision as the dominant cause. Short-form generation -# through the full auth-routed chain is solid. - -modelSource: openai/gpt-oss-20b -modelName: gpt-oss-20b - -image: - repository: enterprise-inference/sglang - tag: v0.5.12-xeon-fix11-debug - pullPolicy: Never - -server: - dtype: bfloat16 - extraArgs: - - --attention-backend - - torch_native - - --reasoning-parser - - gpt-oss - - --tool-call-parser - - gpt-oss - -resources: - requests: - memory: 48Gi - limits: - memory: 128Gi - -storage: - persistentVolume: - size: 40Gi - -ingress: - enabled: true - -apisixRoute: - enabled: true - -oidc: - enabled: true diff --git a/core/helm-charts/sglang/templates/deployment.yaml b/core/helm-charts/sglang/templates/deployment.yaml index 279ce080..1f3422cb 100644 --- a/core/helm-charts/sglang/templates/deployment.yaml +++ b/core/helm-charts/sglang/templates/deployment.yaml @@ -2,7 +2,7 @@ # SPDX-License-Identifier: Apache-2.0 {{- if not .Values.modelSource }} -{{- fail "modelSource is required. Set --set modelSource= (e.g. openai/gpt-oss-20b) or use a model-specific values file like gpt-oss-20b-values.yaml." }} +{{- fail "modelSource is required. Set --set modelSource= (e.g. openai/gpt-oss-20b) or use a model-specific values file (see third_party/Dell/model-deployment//values.yaml)." }} {{- end }} {{- if not .Values.modelName }} {{- fail "modelName is required. Set --set modelName= (e.g. gpt-oss-20b)." }} diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md index 237e4332..1563b76a 100644 --- a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md +++ b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md @@ -43,11 +43,11 @@ For a detailed breakdown of what each patch does, see `core/helm-charts/sglang/R ## Step 3: Deploy gpt-oss-20b Model -The chart ships with a canonical values file for this model at `core/helm-charts/sglang/gpt-oss-20b-values.yaml`. +The canonical values file for this model lives alongside this deployment guide at `third_party/Dell/model-deployment/gpt-oss-20b/values.yaml`. ```bash helm install sglang-gpt-oss-20b ./core/helm-charts/sglang \ - --values ./core/helm-charts/sglang/gpt-oss-20b-values.yaml \ + --values ./third_party/Dell/model-deployment/gpt-oss-20b/values.yaml \ --set modelSource="openai/gpt-oss-20b" \ --set huggingface.token="$HUGGING_FACE_HUB_TOKEN" \ --set ingress.enabled=true \ @@ -120,7 +120,7 @@ kubectl delete pvc -l app.kubernetes.io/instance=sglang-gpt-oss-20b # frees th | Parameter | Description | | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | -| `--values ./core/helm-charts/sglang/gpt-oss-20b-values.yaml` | Canonical values file for this model. Pins the patched image, sets bf16, wires the Harmony reasoning and tool-call parsers, sizes resources. | +| `--values ./third_party/Dell/model-deployment/gpt-oss-20b/values.yaml` | Canonical values file for this model. Pins the patched image, sets bf16, wires the Harmony reasoning and tool-call parsers, sizes resources. | | `--set modelSource="openai/gpt-oss-20b"` | Defines the target model from **Hugging Face** to deploy. | | `--set huggingface.token="..."` | Authenticates access to gated or private Hugging Face models. The gpt-oss repo is public, so this is optional but harmless. | | `--set ingress.enabled=true` | Enables Kubernetes **Ingress** to expose the model service externally. | diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/values.yaml b/third_party/Dell/model-deployment/gpt-oss-20b/values.yaml new file mode 100644 index 00000000..0f8bc155 --- /dev/null +++ b/third_party/Dell/model-deployment/gpt-oss-20b/values.yaml @@ -0,0 +1,49 @@ +# Override values for gpt-oss-20b on Intel Xeon CPU via the patched +# SGLang image. Use with the model-agnostic chart at +# core/helm-charts/sglang. +# +# Usage: +# helm install gpt-oss-20b ./core/helm-charts/sglang \ +# --values ./third_party/Dell/model-deployment/gpt-oss-20b/values.yaml +# +# Short-form generation (max_tokens <= 200) through the full auth-routed +# chain is solid. Long-form generation past ~150 tokens is a documented +# limitation of the pure-Python CPU MoE path — see +# third_party/Dell/model-deployment/sglang-troubleshooting.md. + +modelSource: openai/gpt-oss-20b +modelName: gpt-oss-20b + +image: + repository: enterprise-inference/sglang + tag: v0.5.12-xeon-fix11-debug + pullPolicy: Never + +server: + dtype: bfloat16 + extraArgs: + - --attention-backend + - torch_native + - --reasoning-parser + - gpt-oss + - --tool-call-parser + - gpt-oss + +resources: + requests: + memory: 48Gi + limits: + memory: 128Gi + +storage: + persistentVolume: + size: 40Gi + +ingress: + enabled: true + +apisixRoute: + enabled: true + +oidc: + enabled: true From 6a284c090c23c3e10e46e488ac61c2c47df2529d Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Tue, 26 May 2026 21:47:02 +0000 Subject: [PATCH 09/20] cld2labs/sglang-gpt-oss: close 2 doc gaps in gpt-oss-20b deployment.md Found during end-to-end revalidation by following the deployment guide verbatim against a fresh redeploy. Step 1 (Prerequisites): generate-token.sh hits https://${BASE_URL}/token and assumes that hostname resolves on port 443 with a real TLS cert. That works on production OPEA clusters on Dell hardware but silently returns an empty TOKEN on a single-node k3s lab where api.example.com isn't in DNS and nginx is on a NodePort. Add a callout pointing lab/single-node users at the cluster-internal token-fetch recipe in sglang-troubleshooting.md issue #7. Step 4 (Verify the Deployment): the expected-output block was copied from the llama deployment.md and showed keycloak-0 / keycloak-postgresql-0 (StatefulSet + Postgres backend). On lab installs Keycloak is often a single Deployment pod with H2 embedded, and APISIX/nginx pod names also depend on how those components were rolled out. Generalize the block so the sglang pod is the only thing called out, with a note that other component pod names depend on the cluster's deployment shape. Signed-off-by: arpannookala-12 --- .../Dell/model-deployment/gpt-oss-20b/deployment.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md index 1563b76a..fe9c0124 100644 --- a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md +++ b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md @@ -22,6 +22,8 @@ source core/scripts/generate-token.sh This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, and `TOKEN`. +> **Lab / single-node setups:** `generate-token.sh` hits `https://${BASE_URL}/token` and assumes that hostname resolves on port 443 with a real TLS cert — production OPEA clusters on Dell hardware. If you're testing on a single-node k3s box where `api.example.com` is not in DNS and nginx is on a NodePort, fetch the token via the cluster-internal path documented in [`../sglang-troubleshooting.md` issue #7](../sglang-troubleshooting.md#7-401-unauthorized-from-apisix-with-a-valid-looking-token) instead. The rest of the steps are identical. + ## Step 2: Build the Patched SGLang Image gpt-oss-20b ships natively in MXFP4 quantization, and the upstream `lmsysorg/sglang:v0.5.12-xeon` image cannot serve it on CPU (MXFP4 is GPU-gated, sinks attention is unsupported on the CPU backends, and the published `sgl-kernel` shared library is missing the AVX-512-BF16 compile flags needed for any bf16 matmul). @@ -65,13 +67,12 @@ kubectl get pods kubectl get apisixroutes ``` -Expected Output: +Expected output (the sglang pod is what matters here; your existing Keycloak / APISIX / ingress pods will appear in the listing too, with names that depend on how those components were deployed in your cluster): ``` NAME READY STATUS RESTARTS -keycloak-0 1/1 Running 0 -keycloak-postgresql-0 1/1 Running 0 sglang-gpt-oss-20b-- 1/1 Running 0 +... 1/1 Running 0 # keycloak, apisix, ingress-nginx, etc. ``` > Note: First pod start takes ~4-5 minutes (downloading ~12 GB of weights from Hugging Face, then dequantizing MXFP4 → bf16 in memory). Subsequent restarts are fast because the cache PVC persists the weights. From b66a58a73e46fda75f95f108b40d64b0b632c293 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Wed, 27 May 2026 00:29:58 +0000 Subject: [PATCH 10/20] =?UTF-8?q?cld2labs/sglang-gpt-oss:=20doc=20cleanup?= =?UTF-8?q?=20pass=20(Phase=201)=20=E2=80=94=20inline=20model=20flags,=20r?= =?UTF-8?q?ewrite=20Step=201,=20repair=20appendix?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Match the cld2labs/llama-3.1-8b-instruct precedent and drop the per-model values.yaml: there is no longer a values file sitting next to deployment.md. All gpt-oss-specific runtime flags (Harmony parsers, torch_native CPU attention backend) come through as --set overrides on the helm install command directly. Deletions: - third_party/Dell/model-deployment/gpt-oss-20b/values.yaml - third_party/Dell/model-deployment/README.md (placeholder) deployment.md: - Step 1 rewritten to be context-free with two explicit paths: Path A — production OPEA cluster (generate-token.sh) Path B — single-node lab (cluster-internal token one-liner) Both paths declare the same four exports (BASE_URL, KEYCLOAK_CLIENT_ID, KEYCLOAK_CLIENT_SECRET, TOKEN) so later steps are shell-state-portable across the two paths. - Step 3 helm install is now the canonical recipe — no --values flag, all model-specific knobs as --set, including 'server.extraArgs={--attention-backend,torch_native,--reasoning-parser,gpt-oss,--tool-call-parser,gpt-oss}'. - Step 4 includes the upfront kubectl patch to bump the ApisixRoute 60s default timeout (otherwise inference past ~240 tokens 504s). - Step 5 adds a callout for --resolve when running against a lab NodePort instead of a real DNS hostname. core/helm-charts/sglang/README.md: - Drop the model-values column from the model-specific recipes table. - Appendix A.3: replace with the lab-default secret that matches what the chart and deployment.md actually consume, and add a verification curl that round-trips the client_credentials grant before moving on. - Appendix A.4: expand from a vague "also needs GatewayProxy and IngressClass parameters" note into the actual commands. The APISIX v2 ingress controller silently drops every ApisixRoute without these, which was the largest gap in the prior appendix. - Appendix A.5: fix the TLS secret namespace (was 'default', actually needs to be 'auth-apisix' to match where the chart-rendered Ingress lives) and the secret name (was 'api-example-com-tls', actually needs to equal the BASE_URL because the chart passes --set ingress.secretName=${BASE_URL}). - Add Appendix A.6 documenting the /etc/hosts vs --resolve trade-off for lab clusters where api.example.com isn't in real DNS. This pass was driven by an honest audit of the cluster vs the appendix. Phase 2 — actually rebuild from a true blank slate following these fixed docs to validate them end-to-end — comes next. Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/README.md | 133 ++++++++++--- third_party/Dell/model-deployment/README.md | 1 - .../gpt-oss-20b/deployment.md | 179 ++++++++++++++---- .../model-deployment/gpt-oss-20b/values.yaml | 49 ----- 4 files changed, 247 insertions(+), 115 deletions(-) delete mode 100644 third_party/Dell/model-deployment/README.md delete mode 100644 third_party/Dell/model-deployment/gpt-oss-20b/values.yaml diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index 16a9a7fb..d35c2331 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -92,17 +92,14 @@ helm install ./core/helm-charts/sglang \ Models that need additional configuration ship with their own values file and deployment guide: -| Model | Values file | Deployment guide | -| ----- | ----------- | ---------------- | -| `openai/gpt-oss-20b` | `third_party/Dell/model-deployment/gpt-oss-20b/values.yaml` | `third_party/Dell/model-deployment/gpt-oss-20b/deployment.md` | +| Model | Deployment guide | +| ----- | ---------------- | +| `openai/gpt-oss-20b` | `third_party/Dell/model-deployment/gpt-oss-20b/deployment.md` | -Use the values file as the source of truth and override anything -environment-specific via `--set`: - -```bash -helm install gpt-oss-20b ./core/helm-charts/sglang \ - --values ./third_party/Dell/model-deployment/gpt-oss-20b/values.yaml -``` +The deployment guide carries the full `helm install` command line for +that model — all model-specific flags (parsers, attention backend, +extraArgs) come through as `--set` overrides. The chart's own +`values.yaml` stays model-agnostic. Wait for the pod (first start downloads the weights — duration depends on model size and network): @@ -305,8 +302,7 @@ third_party/Dell/model-deployment/ ├── sglang-troubleshooting.md # symptom-indexed troubleshooting for the SGLang chart └── gpt-oss-20b/ ├── model-card.md # gpt-oss-20b model card - ├── deployment.md # gpt-oss-20b deployment guide - └── values.yaml # canonical chart overrides for gpt-oss-20b + └── deployment.md # gpt-oss-20b deployment guide (carries the full helm command) ``` ## References @@ -386,7 +382,11 @@ EOF kubectl wait --for=condition=ready pod -l app=keycloak --timeout=300s ``` -Create the OIDC client (`my-client-id` with the secret the chart expects): +Create the OIDC client. The `clientId` and `secret` here must exactly +match what you'll later pass to the chart via `--set oidc.clientId=...` +and `--set oidc.clientSecret=...`. The values below are the lab defaults +used by the gpt-oss-20b deployment guide — substitute your own for any +non-test deployment. ```bash ADMIN=$(kubectl run kc-admin --rm -i --restart=Never --quiet \ @@ -395,16 +395,36 @@ ADMIN=$(kubectl run kc-admin --rm -i --restart=Never --quiet \ -d "client_id=admin-cli" -d "username=admin" -d "password=admin" -d "grant_type=password"' \ | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") +CLIENT_ID=my-client-id +CLIENT_SECRET=tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR + kubectl run kc-create --rm -i --restart=Never --quiet \ --image=curlimages/curl:8.10.1 -- \ sh -c "curl -sS -X POST -H 'Authorization: Bearer $ADMIN' \ -H 'Content-Type: application/json' \ http://keycloak.default.svc.cluster.local/admin/realms/master/clients \ - -d '{\"clientId\":\"my-client-id\",\"secret\":\"\",\"serviceAccountsEnabled\":true,\"publicClient\":false,\"directAccessGrantsEnabled\":true}'" + -d '{\"clientId\":\"${CLIENT_ID}\",\"secret\":\"${CLIENT_SECRET}\",\"serviceAccountsEnabled\":true,\"publicClient\":false,\"directAccessGrantsEnabled\":true}'" +``` + +Verify the client was created: + +```bash +kubectl run kc-check --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c "curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d 'client_id=${CLIENT_ID}' -d 'client_secret=${CLIENT_SECRET}' -d 'grant_type=client_credentials'" \ + | head -c 80 +# expect: JSON with "access_token":"..." ``` ### A.4 APISIX +The Apache APISIX chart installs the dataplane + etcd + ingress +controller. On v2 of the ingress controller (current as of this writing) +you additionally need a `GatewayProxy` CR and an `IngressClass` whose +`parameters` reference it — without those, the controller silently drops +every `ApisixRoute` and the chart's route ends up unreachable. + ```bash helm repo add apisix https://charts.apiseven.com helm install auth-apisix apisix/apisix \ @@ -414,24 +434,91 @@ helm install auth-apisix apisix/apisix \ --set ingress-controller.config.apisix.serviceNamespace=auth-apisix kubectl wait --for=condition=ready pod -n auth-apisix --all --timeout=300s -``` -APISIX v2 ingress controller also requires a `GatewayProxy` CRD and an -updated `IngressClass parameters` link before it will accept routes; -see the in-cluster `kubectl describe apisixroute` output for guidance -if the controller returns "Route Not Found" for an otherwise valid -ApisixRoute. +# Grab the admin key the chart generated for the dataplane +ADMIN_KEY=$(helm get values auth-apisix -n auth-apisix --all \ + | python3 -c "import sys,yaml; print(yaml.safe_load(sys.stdin)['apisix']['admin']['credentials']['admin'])") +echo "APISIX admin key: $ADMIN_KEY" + +# Create the GatewayProxy that the ingress controller will use as its +# dataplane handle. +kubectl apply -f - < # Keycloak admin user +export KEYCLOAK_PASSWORD= # Keycloak admin password +export KEYCLOAK_CLIENT_ID=my-client-id # OIDC client created at EI install +``` + +Then source it (also export `HUGGING_FACE_HUB_TOKEN` if any model in +your deployment requires gated HF access): ```bash export HUGGING_FACE_HUB_TOKEN="your_token_here" @@ -20,15 +42,56 @@ cd ~/Enterprise-Inference source core/scripts/generate-token.sh ``` -This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, and `TOKEN`. +The script logs in to Keycloak as the admin user, fetches the client +secret, and hits `https://${BASE_URL}/token` to exchange it for a +short-lived access token. Verify: -> **Lab / single-node setups:** `generate-token.sh` hits `https://${BASE_URL}/token` and assumes that hostname resolves on port 443 with a real TLS cert — production OPEA clusters on Dell hardware. If you're testing on a single-node k3s box where `api.example.com` is not in DNS and nginx is on a NodePort, fetch the token via the cluster-internal path documented in [`../sglang-troubleshooting.md` issue #7](../sglang-troubleshooting.md#7-401-unauthorized-from-apisix-with-a-valid-looking-token) instead. The rest of the steps are identical. +```bash +echo "BASE_URL=$BASE_URL" +echo "TOKEN length=${#TOKEN} (should be 1000+; empty means the script failed silently)" +``` + +### Path B — Single-node lab cluster + +`generate-token.sh` assumes `https://${BASE_URL}` resolves on port 443 +with a real TLS cert. On a single-node lab where `api.example.com` is +only in `/etc/hosts` and nginx is on a NodePort, the script silently +returns an empty `TOKEN`. Use this one-liner instead, which fetches the +token from inside the cluster (so the token's issuer claim matches what +APISIX validates): + +```bash +export BASE_URL=api.example.com +export KEYCLOAK_CLIENT_ID=my-client-id +export KEYCLOAK_CLIENT_SECRET=tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR +export HUGGING_FACE_HUB_TOKEN="" # gpt-oss-20b is public; leave empty + +export TOKEN=$(kubectl run keycloak-tok --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c "curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d 'client_id=${KEYCLOAK_CLIENT_ID}' \ + -d 'client_secret=${KEYCLOAK_CLIENT_SECRET}' \ + -d 'grant_type=client_credentials'" \ + | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") + +echo "TOKEN length=${#TOKEN}" +``` + +> If `TOKEN length` is `0`, Keycloak rejected the request. The most +> common cause is that the OIDC client doesn't exist in the master +> realm yet — see Appendix A.3 in `core/helm-charts/sglang/README.md`. ## Step 2: Build the Patched SGLang Image -gpt-oss-20b ships natively in MXFP4 quantization, and the upstream `lmsysorg/sglang:v0.5.12-xeon` image cannot serve it on CPU (MXFP4 is GPU-gated, sinks attention is unsupported on the CPU backends, and the published `sgl-kernel` shared library is missing the AVX-512-BF16 compile flags needed for any bf16 matmul). +gpt-oss-20b ships natively in MXFP4 quantization, and the upstream +`lmsysorg/sglang:v0.5.12-xeon` image cannot serve it on CPU (MXFP4 is +GPU-gated, sinks attention is unsupported on the CPU backends, and the +published `sgl-kernel` shared library is missing the AVX-512-BF16 +compile flags needed for any bf16 matmul). -The SGLang chart ships a one-shot build script that produces a patched image and imports it directly into k3s containerd. No external registry is required. +The SGLang chart ships a one-shot build script that produces a patched +image and imports it directly into k3s containerd. No external registry +is required. ```bash sudo bash core/helm-charts/sglang/image-build/build-and-import.sh @@ -41,25 +104,32 @@ sudo k3s ctr images ls | grep enterprise-inference/sglang # docker.io/enterprise-inference/sglang:v0.5.12-xeon-fix11-debug ``` -For a detailed breakdown of what each patch does, see `core/helm-charts/sglang/README.md` (section: What's Patched). +For a detailed breakdown of what each patch does, see +`core/helm-charts/sglang/README.md` (section: What's Patched). -## Step 3: Deploy gpt-oss-20b Model - -The canonical values file for this model lives alongside this deployment guide at `third_party/Dell/model-deployment/gpt-oss-20b/values.yaml`. +## Step 3: Deploy gpt-oss-20b ```bash helm install sglang-gpt-oss-20b ./core/helm-charts/sglang \ - --values ./third_party/Dell/model-deployment/gpt-oss-20b/values.yaml \ --set modelSource="openai/gpt-oss-20b" \ + --set modelName="gpt-oss-20b" \ --set huggingface.token="$HUGGING_FACE_HUB_TOKEN" \ --set ingress.enabled=true \ --set ingress.secretName="${BASE_URL}" \ --set ingress.host="${BASE_URL}" \ --set oidc.clientId="$KEYCLOAK_CLIENT_ID" \ --set oidc.clientSecret="$KEYCLOAK_CLIENT_SECRET" \ - --set apisixRoute.enabled=true + --set apisixRoute.enabled=true \ + --set 'server.extraArgs={--attention-backend,torch_native,--reasoning-parser,gpt-oss,--tool-call-parser,gpt-oss}' ``` +The chart's `values.yaml` already targets the patched image, sets bf16, +sizes resources for a Xeon node, and enables the +`MXFP4_NIBBLE_ORDER=low_first` env var required for correct MXFP4 +weight decode. The `--set` above adds the gpt-oss-specific runtime +flags (Harmony reasoning/tool-call parsers, CPU attention backend) and +the per-cluster ingress/OIDC overrides. + ## Step 4: Verify the Deployment ```bash @@ -67,7 +137,10 @@ kubectl get pods kubectl get apisixroutes ``` -Expected output (the sglang pod is what matters here; your existing Keycloak / APISIX / ingress pods will appear in the listing too, with names that depend on how those components were deployed in your cluster): +Expected output (the sglang pod is what matters here; your existing +Keycloak / APISIX / ingress pods will appear in the listing too, with +names that depend on how those components were deployed in your +cluster): ``` NAME READY STATUS RESTARTS @@ -75,13 +148,25 @@ sglang-gpt-oss-20b-- 1/1 Running 0 ... 1/1 Running 0 # keycloak, apisix, ingress-nginx, etc. ``` -> Note: First pod start takes ~4-5 minutes (downloading ~12 GB of weights from Hugging Face, then dequantizing MXFP4 → bf16 in memory). Subsequent restarts are fast because the cache PVC persists the weights. +> Note: First pod start takes ~4-5 minutes (downloading ~12 GB of +> weights from Hugging Face, then dequantizing MXFP4 → bf16 in memory). +> Subsequent restarts are fast because the cache PVC persists the +> weights. ``` NAME HOSTS sglang-gpt-oss-20b-apisixroute api.example.com ``` +The ApisixRoute has a default 60 s upstream timeout, which is shorter +than CPU inference at ~4 tokens/s can complete. Bump it before sending +real requests: + +```bash +kubectl patch apisixroute sglang-gpt-oss-20b-apisixroute --type='json' \ + -p='[{"op":"add","path":"/spec/http/0/timeout","value":{"connect":"5s","read":"600s","send":"600s"}}]' +``` + ## Step 5: Test the Deployed Model ```bash @@ -97,20 +182,29 @@ curl -k https://${BASE_URL}/gpt-oss-20b-sglang/v1/chat/completions \ }' ``` -If successful, the model returns a chat-completion response with the answer in `choices[0].message.content` and the model's internal reasoning in `choices[0].message.reasoning_content`. +> Lab clusters where `api.example.com` is only in `/etc/hosts` and nginx +> is on a NodePort: add `--resolve api.example.com:30443:127.0.0.1` and +> use `https://api.example.com:30443/...` instead. + +If successful, the model returns a chat-completion response with the +answer in `choices[0].message.content` and the model's internal +reasoning in `choices[0].message.reasoning_content`. -### A note on `max_tokens` +### A Note on `max_tokens` -gpt-oss uses the Harmony chat format: every response starts in an internal "analysis" channel and only switches to the user-visible "final" channel when reasoning is complete. With small budgets the model spends them all reasoning and the `content` field comes back null: +gpt-oss uses the Harmony chat format: every response starts in an +internal "analysis" channel and only switches to the user-visible +"final" channel when reasoning is complete. With small budgets the +model spends them all reasoning and the `content` field comes back null: -| `max_tokens` | What you'll see | -| ------------ | --------------- | -| ≤ 100 | `content: null`, reasoning truncated | -| 150 | One short sentence — good for quick verification | -| 300 | Paragraph with light formatting | +| `max_tokens` | What you'll see | +| ------------ | -------------------------------------------------- | +| ≤ 100 | `content: null`, reasoning truncated | +| 150 | One short sentence — good for quick verification | +| 300 | Paragraph with light formatting | | > 400 | Hits documented long-form drift (see troubleshooting) | -## To undeploy the model +## To Undeploy the Model ```bash helm uninstall sglang-gpt-oss-20b @@ -119,14 +213,15 @@ kubectl delete pvc -l app.kubernetes.io/instance=sglang-gpt-oss-20b # frees th ## Parameters -| Parameter | Description | -| ------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | -| `--values ./third_party/Dell/model-deployment/gpt-oss-20b/values.yaml` | Canonical values file for this model. Pins the patched image, sets bf16, wires the Harmony reasoning and tool-call parsers, sizes resources. | -| `--set modelSource="openai/gpt-oss-20b"` | Defines the target model from **Hugging Face** to deploy. | -| `--set huggingface.token="..."` | Authenticates access to gated or private Hugging Face models. The gpt-oss repo is public, so this is optional but harmless. | -| `--set ingress.enabled=true` | Enables Kubernetes **Ingress** to expose the model service externally. | -| `--set ingress.host="${BASE_URL}"` | Public hostname or FQDN for the inference endpoint (maps to your Ingress controller IP). | -| `--set ingress.secretName="${BASE_URL}"` | Kubernetes **TLS Secret** used for HTTPS termination at the ingress layer. | -| `--set oidc.clientId="..."` | Keycloak OIDC client ID used for token-based authentication. | -| `--set oidc.clientSecret="..."` | Keycloak OIDC client secret corresponding to the client ID. | -| `--set apisixRoute.enabled=true` | Enables the **APISIX** route for gateway routing and OIDC bearer validation. | +| Parameter | Description | +| --------- | ----------- | +| `--set modelSource="openai/gpt-oss-20b"` | HuggingFace repo to load (passed to `sglang serve --model-path`). | +| `--set modelName="gpt-oss-20b"` | Served name, also used in the ApisixRoute URI prefix `/gpt-oss-20b-sglang/*`. | +| `--set huggingface.token="..."` | HF token for gated models. `openai/gpt-oss-20b` is public, so leave empty. | +| `--set ingress.enabled=true` | Creates a Kubernetes Ingress that terminates TLS at nginx. | +| `--set ingress.host="${BASE_URL}"` | Hostname the ingress matches (same value used in the TLS secret name). | +| `--set ingress.secretName="${BASE_URL}"` | TLS Secret used at the ingress layer — its name equals the hostname by chart convention. | +| `--set oidc.clientId="..."` | Keycloak OIDC client ID; APISIX validates tokens against this client. | +| `--set oidc.clientSecret="..."` | Keycloak OIDC client secret. | +| `--set apisixRoute.enabled=true` | Creates the APISIX route with `openid-connect` plugin for bearer validation. | +| `--set 'server.extraArgs={...}'` | gpt-oss-specific runtime flags: `torch_native` CPU attention backend, Harmony `--reasoning-parser` and `--tool-call-parser`. | diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/values.yaml b/third_party/Dell/model-deployment/gpt-oss-20b/values.yaml deleted file mode 100644 index 0f8bc155..00000000 --- a/third_party/Dell/model-deployment/gpt-oss-20b/values.yaml +++ /dev/null @@ -1,49 +0,0 @@ -# Override values for gpt-oss-20b on Intel Xeon CPU via the patched -# SGLang image. Use with the model-agnostic chart at -# core/helm-charts/sglang. -# -# Usage: -# helm install gpt-oss-20b ./core/helm-charts/sglang \ -# --values ./third_party/Dell/model-deployment/gpt-oss-20b/values.yaml -# -# Short-form generation (max_tokens <= 200) through the full auth-routed -# chain is solid. Long-form generation past ~150 tokens is a documented -# limitation of the pure-Python CPU MoE path — see -# third_party/Dell/model-deployment/sglang-troubleshooting.md. - -modelSource: openai/gpt-oss-20b -modelName: gpt-oss-20b - -image: - repository: enterprise-inference/sglang - tag: v0.5.12-xeon-fix11-debug - pullPolicy: Never - -server: - dtype: bfloat16 - extraArgs: - - --attention-backend - - torch_native - - --reasoning-parser - - gpt-oss - - --tool-call-parser - - gpt-oss - -resources: - requests: - memory: 48Gi - limits: - memory: 128Gi - -storage: - persistentVolume: - size: 40Gi - -ingress: - enabled: true - -apisixRoute: - enabled: true - -oidc: - enabled: true From 8f208340f9c31d3d7d7ce037236cf10f6205c617 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Wed, 27 May 2026 01:38:18 +0000 Subject: [PATCH 11/20] cld2labs/sglang-gpt-oss: bump chart default image.tag to fix11-debug Found during the Phase 2 from-scratch validation: chart values.yaml default image.tag was still v0.5.12-xeon-fix10-debug but build-and-import.sh imports v0.5.12-xeon-fix11-debug. A fresh user running `helm install ./core/helm-charts/sglang ...` after the build script hit ImagePullBackOff because kubelet treated the missing local fix10 tag as a remote pull request and got "pull access denied" from docker.io. Bump tag and tighten the surrounding comment to say: - fix1 is generic (benefits any bf16 model) - fix2..fix8 are gpt-oss specific and runtime no-ops for others - fix9..fix11 are debug knobs - the tag MUST match build-and-import.sh's IMAGE_TAG verbatim because the locally-imported image is the only place this tag exists Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/values.yaml | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/core/helm-charts/sglang/values.yaml b/core/helm-charts/sglang/values.yaml index bee12fef..305f9cf1 100644 --- a/core/helm-charts/sglang/values.yaml +++ b/core/helm-charts/sglang/values.yaml @@ -22,13 +22,18 @@ namespace: default replicaCount: 1 image: - # Patched image built from image-build/ — required for gpt-oss-on-CPU - # (fix1..fix8) and the Phase 2 precision-debug flags (fix9-debug, - # fix10-debug). Use `image-build/build-and-import.sh` to build + import - # into k3s containerd. Switch back to `lmsysorg/sglang:v0.5.11-xeon` for - # non-gpt-oss models that don't need the patch stack. + # Patched image built from image-build/. fix1 (sgl-kernel rebuild with + # -mavx512bf16) benefits any bf16 model on Xeon; fix2..fix8 add MXFP4 + # + sinks-attention support specifically for gpt-oss and are runtime + # no-ops for other models; fix9..fix11 are precision-debug knobs (off + # by default). Build + import with `image-build/build-and-import.sh`. + # + # IMPORTANT: `tag` below MUST match the IMAGE_TAG in build-and-import.sh + # exactly. With `pullPolicy: IfNotPresent` (the right setting for a + # locally-imported image), a tag mismatch causes the kubelet to try a + # docker.io pull and ImagePullBackOff on a private/non-existent image. repository: enterprise-inference/sglang - tag: "v0.5.12-xeon-fix10-debug" + tag: "v0.5.12-xeon-fix11-debug" pullPolicy: IfNotPresent imagePullSecrets: [] From 35a7a581a499824805ee2b0009c43c31fb731606 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Wed, 27 May 2026 02:00:27 +0000 Subject: [PATCH 12/20] cld2labs/sglang-gpt-oss: drop dual-path framing; appendix bootstraps a cluster where generate-token.sh works MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous Step 1 split into "Path A — production OPEA" and "Path B — single-node lab" was a framing I invented; the rest of the repo just talks about "the Enterprise Inference stack" and a "Single Node Deployment". The dual-path layout also implied that lab users had a different runtime to learn, which they don't — Intel testers will go through generate-token.sh and the docs need to make that path land end-to-end on whatever cluster the appendix produces. deployment.md: - Step 1 collapses to the llama-3.1-8b-instruct precedent: a single "Ensure the EI stack is deployed" paragraph and `source generate-token.sh`. No paths. No alternative one-liner. The troubleshooting doc now owns the cluster-internal recovery recipe. - Reword the BASE_URL note in Step 5 to talk about what generate-token.sh actually exports instead of "lab clusters where ...". core/helm-charts/sglang/README.md: - A.3: pin Keycloak's issuer with KC_HOSTNAME=http://keycloak.default.svc.cluster.local (+ KC_HOSTNAME_STRICT=false, KC_HOSTNAME_BACKCHANNEL_DYNAMIC=false). Without this, tokens issued via the edge route in A.7 carry iss=https://api.example.com:30443/... and APISIX's bearer_only check against the cluster-internal discovery URL returns 401. Switched KC_PROXY (deprecated in v26+) to KC_PROXY_HEADERS=xforwarded. - A.5: create the TLS secret in both auth-apisix (for the chart's Ingress) and default (for the Keycloak-edge Ingresses in A.7). - A.6: simplified — `/etc/hosts` only, plus a note that BASE_URL needs the :30443 NodePort since nginx isn't on :443 in the appendix. - A.7 (new): two nginx Ingresses that publish Keycloak under api.example.com:30443 — pass-through for /realms and /admin, rewrite for /token to Keycloak's openid-connect token endpoint. Without these generate-token.sh can't reach Keycloak's admin REST API to fetch the client secret. A.7 ends with a verification curl that round-trips the client_credentials grant against api.example.com:30443/token. troubleshooting.md: - Rewrite issue #7 from "fetch the token via kubectl run" (Path B workaround) to the actual root cause (missing KC_HOSTNAME) and the permanent fix. Phase 3 commits the docs; the from-scratch rebuild that exercises them happens next. Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/README.md | 132 +++++++++++++++--- .../gpt-oss-20b/deployment.md | 94 ++++--------- .../sglang-troubleshooting.md | 30 ++-- 3 files changed, 152 insertions(+), 104 deletions(-) diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index d35c2331..bcc2cdf9 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -369,7 +369,14 @@ spec: - { name: KEYCLOAK_ADMIN, value: admin } - { name: KEYCLOAK_ADMIN_PASSWORD, value: admin } - { name: KC_HTTP_RELATIVE_PATH, value: "/" } - - { name: KC_PROXY, value: edge } + - { name: KC_PROXY_HEADERS, value: xforwarded } + # Pin the issuer hostname so tokens are always stamped with the + # cluster-internal name, no matter which edge hostname the + # request came in on. APISIX validates the `iss` claim against + # this hostname (chart's oidc.discovery default). + - { name: KC_HOSTNAME, value: "http://keycloak.default.svc.cluster.local" } + - { name: KC_HOSTNAME_STRICT, value: "false" } + - { name: KC_HOSTNAME_BACKCHANNEL_DYNAMIC, value: "false" } ports: [{ containerPort: 8080, name: http }] --- apiVersion: v1 @@ -384,9 +391,9 @@ kubectl wait --for=condition=ready pod -l app=keycloak --timeout=300s Create the OIDC client. The `clientId` and `secret` here must exactly match what you'll later pass to the chart via `--set oidc.clientId=...` -and `--set oidc.clientSecret=...`. The values below are the lab defaults -used by the gpt-oss-20b deployment guide — substitute your own for any -non-test deployment. +and `--set oidc.clientSecret=...`. The values below are the appendix +defaults used by the gpt-oss-20b deployment guide — substitute your +own for any non-test deployment. ```bash ADMIN=$(kubectl run kc-admin --rm -i --restart=Never --quiet \ @@ -482,12 +489,12 @@ kubectl get ingressclass apisix -o jsonpath='{.spec.parameters}{"\n"}' ### A.5 TLS cert for `api.example.com` The chart's `--set ingress.secretName=${BASE_URL}` references a TLS -Secret whose name equals the hostname (so for the lab default, the -Secret is named `api.example.com`). The Secret must live in the same -namespace as the Ingress that consumes it. The chart's Ingress template -puts the Ingress in `auth-apisix` so that nginx terminates TLS and -forwards to the APISIX gateway service in the same namespace — so the -TLS Secret goes there too. +Secret whose name equals the hostname (so for the appendix default, the +Secret is named `api.example.com`). nginx requires the Secret in the +same namespace as the Ingress that consumes it. We need it in two +places: `auth-apisix` (where the chart-created Ingress for the model +lives) and `default` (where the Keycloak-edge Ingresses in A.7 will +live). ```bash openssl req -x509 -newkey rsa:2048 -nodes -days 365 \ @@ -496,29 +503,108 @@ openssl req -x509 -newkey rsa:2048 -nodes -days 365 \ -addext "subjectAltName=DNS:api.example.com" kubectl create secret tls api.example.com \ - --cert=/tmp/tls.crt --key=/tmp/tls.key \ - -n auth-apisix + --cert=/tmp/tls.crt --key=/tmp/tls.key -n auth-apisix +kubectl create secret tls api.example.com \ + --cert=/tmp/tls.crt --key=/tmp/tls.key -n default ``` -### A.6 Local hostname resolution (lab only) +### A.6 Hostname resolution for `${BASE_URL}` -In production, the cluster's load-balancer IP for `api.example.com` is -in real DNS. On a single-node lab the hostname is only used as an SNI -selector for the self-signed cert and as the `Host:` header — it doesn't -need to be resolvable. The simplest setup is `--resolve` on every `curl`: +`generate-token.sh` and the inference curls in the deployment guide +both hit `https://${BASE_URL}/...` directly, so the host running those +commands must resolve `api.example.com`. In a production EI deployment +this is real DNS pointing at the load balancer; for a self-bootstrapped +cluster, an `/etc/hosts` entry pointing at the node is enough. ```bash -curl --resolve api.example.com:30443:127.0.0.1 https://api.example.com:30443/... +echo "127.0.0.1 api.example.com" | sudo tee -a /etc/hosts ``` -If you'd rather not pass `--resolve` every time, add an `/etc/hosts` -entry instead: +nginx in this appendix is on NodePort 30443 (not 443), so set +`BASE_URL` with the port for the EI scripts to find it: ```bash -echo "127.0.0.1 api.example.com" | sudo tee -a /etc/hosts +export BASE_URL=api.example.com:30443 +``` + +### A.7 Keycloak edge routes + +`generate-token.sh` issues two HTTP requests against `${BASE_URL}` — +first against `/realms/master/protocol/openid-connect/token` and +`/admin/realms/master/clients/...` (via `keycloak-fetch-client-secret.sh`, +to log in as the admin user and read the OIDC client secret), then +against `/token` (to exchange `client_credentials` for an access token). +On an EI cluster the Ansible playbooks publish all three behind nginx; +when bootstrapping by hand we publish them with two Ingresses below. + +```bash +# Pass-through Ingress for /realms/* and /admin/* — these go straight to +# Keycloak's endpoints unmodified. +kubectl apply -f - < # Keycloak admin user -export KEYCLOAK_PASSWORD= # Keycloak admin password -export KEYCLOAK_CLIENT_ID=my-client-id # OIDC client created at EI install -``` - -Then source it (also export `HUGGING_FACE_HUB_TOKEN` if any model in -your deployment requires gated HF access): +Then run: ```bash export HUGGING_FACE_HUB_TOKEN="your_token_here" @@ -42,44 +24,19 @@ cd ~/Enterprise-Inference source core/scripts/generate-token.sh ``` -The script logs in to Keycloak as the admin user, fetches the client -secret, and hits `https://${BASE_URL}/token` to exchange it for a -short-lived access token. Verify: +This exports: `BASE_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_CLIENT_SECRET`, +and `TOKEN`. Verify with: ```bash echo "BASE_URL=$BASE_URL" -echo "TOKEN length=${#TOKEN} (should be 1000+; empty means the script failed silently)" -``` - -### Path B — Single-node lab cluster - -`generate-token.sh` assumes `https://${BASE_URL}` resolves on port 443 -with a real TLS cert. On a single-node lab where `api.example.com` is -only in `/etc/hosts` and nginx is on a NodePort, the script silently -returns an empty `TOKEN`. Use this one-liner instead, which fetches the -token from inside the cluster (so the token's issuer claim matches what -APISIX validates): - -```bash -export BASE_URL=api.example.com -export KEYCLOAK_CLIENT_ID=my-client-id -export KEYCLOAK_CLIENT_SECRET=tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR -export HUGGING_FACE_HUB_TOKEN="" # gpt-oss-20b is public; leave empty - -export TOKEN=$(kubectl run keycloak-tok --rm -i --restart=Never --quiet \ - --image=curlimages/curl:8.10.1 -- \ - sh -c "curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ - -d 'client_id=${KEYCLOAK_CLIENT_ID}' \ - -d 'client_secret=${KEYCLOAK_CLIENT_SECRET}' \ - -d 'grant_type=client_credentials'" \ - | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") - -echo "TOKEN length=${#TOKEN}" +echo "TOKEN length=${#TOKEN} (expect 1000+; empty means the script failed silently)" ``` -> If `TOKEN length` is `0`, Keycloak rejected the request. The most -> common cause is that the OIDC client doesn't exist in the master -> realm yet — see Appendix A.3 in `core/helm-charts/sglang/README.md`. +> Empty `TOKEN` means the script could not reach +> `https://${BASE_URL}/realms/master/...` or `https://${BASE_URL}/token`. +> The EI deployment provisions both as ingress routes to Keycloak — if +> they're missing, the cluster bootstrap is incomplete; see Appendix +> A.7 of the chart README. ## Step 2: Build the Patched SGLang Image @@ -182,9 +139,10 @@ curl -k https://${BASE_URL}/gpt-oss-20b-sglang/v1/chat/completions \ }' ``` -> Lab clusters where `api.example.com` is only in `/etc/hosts` and nginx -> is on a NodePort: add `--resolve api.example.com:30443:127.0.0.1` and -> use `https://api.example.com:30443/...` instead. +> The exact `${BASE_URL}` value depends on how the cluster was +> bootstrapped — it's what `core/scripts/generate-token.sh` exports +> after sourcing. Self-bootstrapped clusters following the chart +> README's appendix will have `${BASE_URL}=api.example.com:30443`. If successful, the model returns a chat-completion response with the answer in `choices[0].message.content` and the model's internal diff --git a/third_party/Dell/model-deployment/sglang-troubleshooting.md b/third_party/Dell/model-deployment/sglang-troubleshooting.md index eae93c9b..ccc2398f 100644 --- a/third_party/Dell/model-deployment/sglang-troubleshooting.md +++ b/third_party/Dell/model-deployment/sglang-troubleshooting.md @@ -9,7 +9,7 @@ This section provides common issues observed when running inference against mode 4. [Pod startup fails with "scalar path not implemented!"](#4-pod-startup-fails-with-scalar-path-not-implemented) 5. [Model serves but emits random-vocab gibberish in `content`](#5-model-serves-but-emits-random-vocab-gibberish-in-content) 6. [Long-form responses degrade into broken tokens after ~150 tokens](#6-long-form-responses-degrade-into-broken-tokens-after-150-tokens) -7. [401 Unauthorized from APISIX with a valid-looking token](#7-401-unauthorized-from-apisix-with-a-valid-looking-token) +7. [401 Unauthorized from APISIX with a valid-looking token](#7-401-unauthorized-from-apisix-with-a-valid-looking-token-issuer-mismatch) --- @@ -149,22 +149,26 @@ This is documented under "Known Limitations" in `core/helm-charts/sglang/README. --- -### 7. 401 Unauthorized from APISIX with a valid-looking token +### 7. 401 Unauthorized from APISIX with a valid-looking token (issuer mismatch) -**Context:** Token was successfully obtained from Keycloak, but the auth-routed inference call returns `401 Unauthorized` from APISIX (response body mentions "openresty"). +**Context:** Token was successfully obtained from Keycloak (via `source generate-token.sh` or equivalent), but the inference call returns `401 Unauthorized` from APISIX (response body mentions "openresty"). -**Cause:** APISIX's OIDC plugin validates the token's `iss` (issuer) claim against the configured discovery URL. If the token was fetched via `kubectl port-forward localhost:18080`, Keycloak stamped the issuer as `http://127.0.0.1:18080/...`, but APISIX checks against `http://keycloak.default.svc.cluster.local/...` and rejects the mismatch. +**Cause:** APISIX's OIDC plugin runs in `bearer_only` mode and validates the token's `iss` (issuer) claim against the issuer returned by the OIDC discovery URL the chart was configured with. If Keycloak was deployed without a fixed `KC_HOSTNAME`, it stamps the issuer based on the incoming request's host header — so a token fetched via `https://api.example.com:30443/token` carries `iss=https://api.example.com:30443/realms/master`, but the chart's default discovery URL is `http://keycloak.default.svc.cluster.local/realms/master`. The two don't match and APISIX rejects. -**Fix:** Fetch the token from inside the cluster so the issuer matches: +**Fix:** Pin Keycloak's issuer at deploy time by setting `KC_HOSTNAME` on the Keycloak Deployment to the cluster-internal hostname the chart's `oidc.discovery` value points at. The appendix in `core/helm-charts/sglang/README.md` (A.3) shows the env vars; the relevant ones are: + +```yaml +- { name: KC_HOSTNAME, value: "http://keycloak.default.svc.cluster.local" } +- { name: KC_HOSTNAME_STRICT, value: "false" } +- { name: KC_HOSTNAME_BACKCHANNEL_DYNAMIC, value: "false" } +``` + +After updating the Deployment (`kubectl apply` the manifest from A.3 again, then wait for the new pod), re-source `generate-token.sh` to fetch a fresh token. Verify the issuer claim is now cluster-internal: ```bash -TOKEN=$(kubectl run keycloak-tok --rm -i --restart=Never --quiet \ - --image=curlimages/curl:8.10.1 -- \ - sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ - -d "client_id='"$KEYCLOAK_CLIENT_ID"'" \ - -d "client_secret='"$KEYCLOAK_CLIENT_SECRET"'" \ - -d "grant_type=client_credentials"' \ - | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") +echo "$TOKEN" | cut -d. -f2 | base64 -d 2>/dev/null \ + | python3 -c "import json,sys; print('iss =', json.loads(sys.stdin.read())['iss'])" +# expect: iss = http://keycloak.default.svc.cluster.local/realms/master ``` -For production deployments, configure Keycloak with `KC_HOSTNAME=` so it always issues tokens with a stable, externally-resolvable issuer. +The mismatched-issuer 401 cannot happen on a production EI cluster — the Ansible playbooks set `KC_HOSTNAME` to the cluster's external hostname and the chart's `oidc.discovery` is set to the matching URL — but it's a common stumble for someone bootstrapping by hand from the appendix. From 6f2e1ada26d1468e7a88fd7b5c3a48f78857acb1 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Wed, 27 May 2026 02:03:25 +0000 Subject: [PATCH 13/20] =?UTF-8?q?cld2labs/sglang-gpt-oss:=20clarify=20appe?= =?UTF-8?q?ndix=20preamble=20=E2=80=94=20it=20produces=20an=20EI-shape=20c?= =?UTF-8?q?luster?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Make the framing explicit so an Intel tester knows immediately whether they need the appendix at all: - On an OPEA-Ansible-deployed cluster, the appendix is unnecessary. Go straight to "Build the Image". The same deployment guide applies. - For users without Ansible bootstrap, the appendix produces the same component shape (Keycloak with KC_HOSTNAME pinned, edge routes for /realms/admin/token, OIDC client `my-client-id`, TLS secret in both consuming namespaces, APISIX GatewayProxy wiring). After it runs, generate-token.sh and the deploy work the same way as on an OPEA cluster. This is the same logical scope the previous "From-Scratch Bootstrap" header implied, but the previous prose left it ambiguous whether production OPEA users would land here too. Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/README.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index bcc2cdf9..71bdc6c2 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -315,10 +315,20 @@ third_party/Dell/model-deployment/ ## Appendix: From-Scratch Bootstrap -Use this only if you're standing up a fresh single-node box without OPEA's -Ansible-driven cluster setup. On a stock OPEA cluster, k3s, nginx-ingress, -APISIX, and Keycloak are already in place and you can skip directly to -**Build the Image**. +An Enterprise Inference cluster brought up with the OPEA Ansible +playbooks already has k3s, nginx-ingress, APISIX, Keycloak, the +Keycloak edge routes, and the OIDC client provisioned for you — skip +this appendix entirely and go straight to **Build the Image**. The +deployment guide is the same regardless of how the cluster was +bootstrapped. + +This appendix produces the same cluster shape by hand for cases where +the Ansible playbooks haven't been run: k3s + nginx + Keycloak + APISIX +(with the GatewayProxy/IngressClass wiring), the TLS secret in both +namespaces that need it, a `KC_HOSTNAME`-pinned Keycloak, the +`/realms`, `/admin`, and `/token` edge Ingresses, and the +`my-client-id` OIDC client. After it runs, `generate-token.sh` and the +model deploy work identically to the OPEA flow. ### A.1 k3s + Helm From 2cb87da06b12cfc7b4eb54bcc47e8146acaff736 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Wed, 27 May 2026 02:28:23 +0000 Subject: [PATCH 14/20] cld2labs/sglang-gpt-oss: build-and-import.sh autodetects nerdctl (kubeadm) vs k3s MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The build script previously assumed k3s and ran: docker build → docker save | k3s ctr images import - That fails on OPEA-deployed (kubeadm + containerd) clusters where neither docker nor k3s is present, but nerdctl is. Detect the runtime and pick the right path: - nerdctl present: nerdctl --namespace k8s.io build (single step; containerd's image store IS where kubelet pulls from) - k3s present: keep the existing docker → k3s ctr import path - neither: hard fail with a helpful message Caught when validating Path A against a real inference-stack-deploy.sh cluster (kubeadm 1.31, containerd 1.7.24). Signed-off-by: arpannookala-12 --- .../sglang/image-build/build-and-import.sh | 84 +++++++++++++------ 1 file changed, 60 insertions(+), 24 deletions(-) diff --git a/core/helm-charts/sglang/image-build/build-and-import.sh b/core/helm-charts/sglang/image-build/build-and-import.sh index 140956e6..bbb24743 100755 --- a/core/helm-charts/sglang/image-build/build-and-import.sh +++ b/core/helm-charts/sglang/image-build/build-and-import.sh @@ -1,6 +1,14 @@ #!/usr/bin/env bash -# One-shot script to build the patched sglang xeon image and import it -# into the k3s containerd cache so the chart can use it without a registry. +# One-shot script to build the patched sglang xeon image and load it +# into the local containerd image store, so the chart can use it without +# pushing to an external registry. +# +# Auto-detects the runtime: +# - OPEA / kubeadm-based clusters: containerd accessed via `nerdctl` +# under the `k8s.io` namespace (where kubelet pulls from). Built +# directly there; no separate import step. +# - k3s clusters: `docker build` then `docker save | k3s ctr images +# import -`. Installs docker.io if missing. # # Run with: sudo bash core/helm-charts/sglang/image-build/build-and-import.sh set -euo pipefail @@ -8,31 +16,59 @@ set -euo pipefail IMAGE_TAG="${IMAGE_TAG:-enterprise-inference/sglang:v0.5.12-xeon-fix11-debug}" SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" -echo "==> Ensuring docker is installed" -if ! command -v docker >/dev/null 2>&1; then - apt-get update - DEBIAN_FRONTEND=noninteractive apt-get install -y docker.io - systemctl enable --now docker +cd "$SCRIPT_DIR" + +if command -v nerdctl >/dev/null 2>&1 && command -v containerd >/dev/null 2>&1; then + RUNTIME=nerdctl +elif command -v k3s >/dev/null 2>&1; then + RUNTIME=k3s +else + echo "ERROR: neither nerdctl (kubeadm/containerd) nor k3s detected." >&2 + echo "Install one of them, or build manually and push to a registry." >&2 + exit 1 fi -docker version --format 'Server: {{.Server.Version}}' -echo "==> Building $IMAGE_TAG" -cd "$SCRIPT_DIR" -docker build -t "$IMAGE_TAG" . +echo "==> Detected container runtime: $RUNTIME" -echo "==> Importing into k3s containerd" -# k3s ships its own containerd; piping a docker-save into k3s ctr image import -# makes the image directly available to k3s pods (no registry required). -docker save "$IMAGE_TAG" | k3s ctr images import - +case "$RUNTIME" in + nerdctl) + # nerdctl builds directly into containerd's image store. Pin namespace + # to k8s.io so kubelet can find the image without a separate import. + echo "==> Building $IMAGE_TAG via nerdctl (namespace k8s.io)" + nerdctl --namespace k8s.io build -t "$IMAGE_TAG" . -echo "==> Verifying" -k3s ctr images ls -q | grep -F "$IMAGE_TAG" || { - echo "Imported image not found in k3s containerd" - exit 1 -} + echo "==> Verifying" + nerdctl --namespace k8s.io images "$IMAGE_TAG" --format '{{.Repository}}:{{.Tag}}' \ + | grep -F "$IMAGE_TAG" || { + echo "Image not found in containerd k8s.io namespace" >&2 + exit 1 + } + ;; + + k3s) + echo "==> Ensuring docker is installed" + if ! command -v docker >/dev/null 2>&1; then + apt-get update + DEBIAN_FRONTEND=noninteractive apt-get install -y docker.io + systemctl enable --now docker + fi + docker version --format 'Server: {{.Server.Version}}' + + echo "==> Building $IMAGE_TAG via docker" + docker build -t "$IMAGE_TAG" . + + echo "==> Importing into k3s containerd" + docker save "$IMAGE_TAG" | k3s ctr images import - + + echo "==> Verifying" + k3s ctr images ls -q | grep -F "$IMAGE_TAG" || { + echo "Imported image not found in k3s containerd" >&2 + exit 1 + } + ;; +esac echo -echo "==> Done. Use in chart with:" -echo " --set image.repository=${IMAGE_TAG%:*}" -echo " --set image.tag=${IMAGE_TAG##*:}" -echo " --set image.pullPolicy=Never" +echo "==> Done. Image $IMAGE_TAG is loaded in the local containerd image store." +echo "==> The chart's values.yaml already defaults to this tag with" +echo " pullPolicy: IfNotPresent. No further overrides required." From 7158b5e4369524f0f57079a35fb48b1ac0044e15 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Wed, 27 May 2026 02:29:31 +0000 Subject: [PATCH 15/20] cld2labs/sglang-gpt-oss: install buildkit on-demand when running nerdctl path MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit nerdctl needs buildkitd to satisfy `nerdctl build`. OPEA-deployed clusters (kubeadm + containerd) ship nerdctl but not buildkit, so the first `nerdctl build` invocation errors out with "buildctl needs to be installed". Make build-and-import.sh install + start buildkit on the fly if it's missing — same one-shot ergonomic pattern the k3s branch already uses for docker.io. Falls back to a background buildkitd if the system doesn't have a `buildkit` systemd unit. Caught when validating Path A against a real inference-stack-deploy.sh cluster. Signed-off-by: arpannookala-12 --- .../sglang/image-build/build-and-import.sh | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/core/helm-charts/sglang/image-build/build-and-import.sh b/core/helm-charts/sglang/image-build/build-and-import.sh index bbb24743..bd030e1d 100755 --- a/core/helm-charts/sglang/image-build/build-and-import.sh +++ b/core/helm-charts/sglang/image-build/build-and-import.sh @@ -32,6 +32,25 @@ echo "==> Detected container runtime: $RUNTIME" case "$RUNTIME" in nerdctl) + # nerdctl needs buildkitd to run `nerdctl build`. On OPEA-deployed + # clusters buildkit isn't installed by default; install + start it + # if missing. + if ! command -v buildctl >/dev/null 2>&1; then + echo "==> Installing buildkit (required by nerdctl build)" + apt-get update + DEBIAN_FRONTEND=noninteractive apt-get install -y buildkit + fi + if ! pgrep -x buildkitd >/dev/null 2>&1; then + echo "==> Starting buildkitd in the background" + systemctl enable --now buildkit 2>/dev/null \ + || nohup buildkitd >/var/log/buildkitd.log 2>&1 & + # give it a couple seconds to come up + for i in 1 2 3 4 5; do + [ -S /run/buildkit/buildkitd.sock ] && break + sleep 1 + done + fi + # nerdctl builds directly into containerd's image store. Pin namespace # to k8s.io so kubelet can find the image without a separate import. echo "==> Building $IMAGE_TAG via nerdctl (namespace k8s.io)" From e6db49b7391e26557cb8885479e33d753156de45 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Wed, 27 May 2026 02:30:18 +0000 Subject: [PATCH 16/20] cld2labs/sglang-gpt-oss: pull buildkit from upstream GitHub release MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ubuntu 22.04 doesn't ship a `buildkit` apt package — `apt install buildkit` errors with "Unable to locate package". Fetch buildkit (~30 MB) from moby/buildkit GitHub releases and install /usr/local/bin/buildctl + /usr/local/bin/buildkitd directly. Default pinned version v0.18.1; override with BUILDKIT_VERSION env var. Also tighten the buildkitd-startup poll: wait up to 10 s for the unix socket, hard-fail with a pointer to the log file if it never appears (better than the previous silent continue). Signed-off-by: arpannookala-12 --- .../sglang/image-build/build-and-import.sh | 28 ++++++++++++------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/core/helm-charts/sglang/image-build/build-and-import.sh b/core/helm-charts/sglang/image-build/build-and-import.sh index bd030e1d..c83f324d 100755 --- a/core/helm-charts/sglang/image-build/build-and-import.sh +++ b/core/helm-charts/sglang/image-build/build-and-import.sh @@ -32,23 +32,31 @@ echo "==> Detected container runtime: $RUNTIME" case "$RUNTIME" in nerdctl) - # nerdctl needs buildkitd to run `nerdctl build`. On OPEA-deployed - # clusters buildkit isn't installed by default; install + start it - # if missing. + # nerdctl needs buildkitd to run `nerdctl build`. buildkit isn't in + # Ubuntu apt — install from upstream GitHub releases (~30 MB). if ! command -v buildctl >/dev/null 2>&1; then - echo "==> Installing buildkit (required by nerdctl build)" - apt-get update - DEBIAN_FRONTEND=noninteractive apt-get install -y buildkit + BUILDKIT_VERSION="${BUILDKIT_VERSION:-v0.18.1}" + echo "==> Installing buildkit ${BUILDKIT_VERSION} from GitHub releases" + tmpdir=$(mktemp -d) + curl -fsSL \ + "https://github.com/moby/buildkit/releases/download/${BUILDKIT_VERSION}/buildkit-${BUILDKIT_VERSION}.linux-amd64.tar.gz" \ + | tar -xz -C "$tmpdir" + install -m 0755 "$tmpdir/bin/buildctl" /usr/local/bin/buildctl + install -m 0755 "$tmpdir/bin/buildkitd" /usr/local/bin/buildkitd + rm -rf "$tmpdir" fi if ! pgrep -x buildkitd >/dev/null 2>&1; then echo "==> Starting buildkitd in the background" - systemctl enable --now buildkit 2>/dev/null \ - || nohup buildkitd >/var/log/buildkitd.log 2>&1 & - # give it a couple seconds to come up - for i in 1 2 3 4 5; do + mkdir -p /run/buildkit + nohup /usr/local/bin/buildkitd >/var/log/buildkitd.log 2>&1 & + for i in 1 2 3 4 5 6 7 8 9 10; do [ -S /run/buildkit/buildkitd.sock ] && break sleep 1 done + [ -S /run/buildkit/buildkitd.sock ] || { + echo "buildkitd did not come up; see /var/log/buildkitd.log" >&2 + exit 1 + } fi # nerdctl builds directly into containerd's image store. Pin namespace From 7edd71e5591844ad0de1913ddd26811d6eb27d3b Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Wed, 27 May 2026 03:20:48 +0000 Subject: [PATCH 17/20] cld2labs/sglang-gpt-oss: stop hard-coding k3s in chart README + deployment.md The build script already auto-detects nerdctl (kubeadm/containerd) vs k3s, but the docs around it still said "imports into k3s containerd" and gave a `k3s ctr images ls` as the only verify command. On a real OPEA cluster (kubeadm + containerd, no k3s binary) that command errors and there's no signal that a different runtime is supported. Rephrase both the chart README and the gpt-oss deployment guide: - Replace "k3s containerd" with "local containerd image store" / generic phrasing in the prose. - Replace the single k3s verify line with a dual block that shows the nerdctl form (kubeadm) and the k3s ctr form, with a one-line explanation of which to use. - Prerequisites: list both kubeadm/containerd and k3s as validated targets instead of saying k3s. - Project-structure tree comment for build-and-import.sh: "local containerd (kubeadm or k3s)" instead of "k3s containerd". The appendix (which explicitly bootstraps k3s as a convenience) stays k3s-specific; that's a deliberate choice for the self-bootstrap path, not a claim about how the chart runs. Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/README.md | 29 ++++++++++++------- .../gpt-oss-20b/deployment.md | 16 +++++++--- 2 files changed, 31 insertions(+), 14 deletions(-) diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index 71bdc6c2..0d29e894 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -23,14 +23,14 @@ every bf16 forward pass crashes with `tinygemm_kernel_nn: scalar path not implemented!` regardless of model. The remaining patches are gpt-oss-specific and are runtime no-ops for other models. The image is built once via a self-contained Dockerfile and imported directly into -k3s containerd — no registry required. +the local containerd image store — no registry required. ## Features - **Model-agnostic SGLang on Xeon CPU** — any HF model SGLang supports loads through the same chart - **Patched image** that unblocks bf16 inference on Xeon (every model benefits) and adds MXFP4 + sinks-attention support for gpt-oss - **OPEA-standard auth chain**: TLS at nginx, OIDC bearer validation at APISIX, token issuance by Keycloak -- **No external registry**: image builds locally and imports into k3s containerd +- **No external registry**: image builds locally into the cluster's containerd image store (works on both kubeadm/containerd and k3s) - **OpenAI-compatible API**: `/v1/chat/completions`, `/v1/models`, `/v1/completions` - **Chart-only delivery**: same standalone pattern as `core/helm-charts/ovms`, not yet wired into the Ansible playbooks @@ -40,7 +40,7 @@ k3s containerd — no registry required. - **Hardware**: Intel Xeon with AVX-512-BF16 / AMX-BF16 (Sapphire Rapids, Emerald Rapids, Granite Rapids) - **Memory**: ≥ 64 GiB RAM for mid-size models (gpt-oss-20b uses ~25 GiB dequantized + KV cache) - **Disk**: ≥ 100 GiB free on the root partition -- **Kubernetes**: 1.24+ (k3s is fine; this chart was validated on single-node k3s) +- **Kubernetes**: 1.24+ — validated on kubeadm/containerd (the cluster `inference-stack-deploy.sh` produces) and on k3s - **Helm**: 3+ - **NodePorts free on the host**: 30080, 30443 (nginx), 32080 (APISIX) - **HuggingFace token** for gated models (e.g. `meta-llama/*`); not required for open models like `openai/gpt-oss-20b` or `Qwen/Qwen3-8B` @@ -61,18 +61,27 @@ git checkout cld2labs/sglang-gpt-oss sudo bash core/helm-charts/sglang/image-build/build-and-import.sh ``` -First run takes ~5–10 minutes (installs docker.io if missing, compiles -27 C++ files in `sgl-kernel` with the right BF16 flags, runs 11 Python -patch scripts against SGLang's in-image source, and imports the result -into k3s containerd). +First run takes ~5–10 minutes. The script auto-detects the runtime: -Verify: +- **kubeadm + containerd** (OPEA Ansible-deployed clusters): builds via + `nerdctl` directly into containerd's `k8s.io` namespace. Installs + `buildkit` from upstream GitHub on demand if it isn't already present. +- **k3s**: installs `docker.io` on demand, builds, then + `docker save | k3s ctr images import -`. + +In both cases the image lands where kubelet pulls from. Verify with +whichever tool matches your runtime: ```bash +# kubeadm / containerd +sudo nerdctl --namespace k8s.io images | grep enterprise-inference/sglang + +# k3s sudo k3s ctr images ls | grep enterprise-inference/sglang -# docker.io/enterprise-inference/sglang:v0.5.12-xeon-fix11-debug ``` +Either way the expected line is `enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`. + ## Deploy a Model `modelSource` and `modelName` are required at install time. The chart @@ -295,7 +304,7 @@ core/helm-charts/sglang/ ├── templates/ # Helm templates (Deployment, Service, PVC, Ingress, ApisixRoute, Secret) └── image-build/ ├── Dockerfile # FROM lmsysorg/sglang:v0.5.12-xeon + 11 patch steps - ├── build-and-import.sh # one-shot build + import into k3s containerd + ├── build-and-import.sh # one-shot build + load into local containerd (kubeadm or k3s) └── enable-*.py # patch scripts applied at image build time third_party/Dell/model-deployment/ diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md index fe6ef501..2f4ca6b5 100644 --- a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md +++ b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md @@ -47,20 +47,28 @@ published `sgl-kernel` shared library is missing the AVX-512-BF16 compile flags needed for any bf16 matmul). The SGLang chart ships a one-shot build script that produces a patched -image and imports it directly into k3s containerd. No external registry -is required. +image and loads it directly into the local containerd image store. No +external registry is required. ```bash sudo bash core/helm-charts/sglang/image-build/build-and-import.sh ``` -First run takes ~5-10 minutes. Verify: +First run takes ~5-10 minutes. The script auto-detects the runtime — +`nerdctl` on a kubeadm/containerd cluster (what `inference-stack-deploy.sh` +produces) or `k3s ctr` on a k3s cluster. Verify with whichever matches +your cluster: ```bash +# kubeadm / containerd +sudo nerdctl --namespace k8s.io images | grep enterprise-inference/sglang + +# k3s sudo k3s ctr images ls | grep enterprise-inference/sglang -# docker.io/enterprise-inference/sglang:v0.5.12-xeon-fix11-debug ``` +Either should report `enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`. + For a detailed breakdown of what each patch does, see `core/helm-charts/sglang/README.md` (section: What's Patched). From 11711360f499568f5251437051a8c160bef1008e Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Tue, 2 Jun 2026 10:25:29 -0500 Subject: [PATCH 18/20] cld2labs/sglang-gpt-oss: drop local-only .gitignore from chart dir Signed-off-by: arpannookala-12 --- core/helm-charts/sglang/.gitignore | 3 --- 1 file changed, 3 deletions(-) delete mode 100644 core/helm-charts/sglang/.gitignore diff --git a/core/helm-charts/sglang/.gitignore b/core/helm-charts/sglang/.gitignore deleted file mode 100644 index 2e0b601a..00000000 --- a/core/helm-charts/sglang/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -# Local-only working notes (not for upstream sharing). -REMAINING_WORK.md -UPSTREAM_BUG_REPORT.md From 773fe99e55aefe82f4e834988432e7e9904bf424 Mon Sep 17 00:00:00 2001 From: arpannookala-12 Date: Fri, 5 Jun 2026 12:30:16 -0500 Subject: [PATCH 19/20] cld2labs/sglang-gpt-oss: bifurcate README into Scenario 1 (EI/Ansible) and Scenario 2 (k3s bootstrap) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a decision table at the top so readers go directly to the right path. Elevate the former appendix to a first-class Scenario 2 section (S2.1–S2.7) with consistent step numbering. Mark the convergence point at "Build the Image" explicitly so both paths meet cleanly. Signed-off-by: arpannookala-12 --- .gitignore | 11 + core/helm-charts/sglang/README.md | 818 +++++++++++++++--------------- 2 files changed, 427 insertions(+), 402 deletions(-) create mode 100644 .gitignore diff --git a/.gitignore b/.gitignore new file mode 100644 index 00000000..4d2009e6 --- /dev/null +++ b/.gitignore @@ -0,0 +1,11 @@ +# Local investigation / working notes — not for upstream +FIXES.md +INVESTIGATION.md +JOURNEY.md +REMAINING_WORK.md +UPSTREAM_BUG_REPORT.md + +# Security scan outputs +bandit-report.html +bandit-screen-output.txt +trivy-reports/ diff --git a/core/helm-charts/sglang/README.md b/core/helm-charts/sglang/README.md index 0d29e894..8ebc61aa 100644 --- a/core/helm-charts/sglang/README.md +++ b/core/helm-charts/sglang/README.md @@ -34,7 +34,30 @@ the local containerd image store — no registry required. - **OpenAI-compatible API**: `/v1/chat/completions`, `/v1/models`, `/v1/completions` - **Chart-only delivery**: same standalone pattern as `core/helm-charts/ovms`, not yet wired into the Ansible playbooks -## Prerequisites +--- + +## Which Scenario Applies to You? + +| | Scenario 1 | Scenario 2 | +|---|---|---| +| **Cluster setup** | OPEA Ansible playbooks already run | Fresh box — no existing cluster | +| **k3s / nginx / APISIX / Keycloak** | Already provisioned | You set them up manually | +| **Starting point** | Go to [Prerequisites](#prerequisites) | Go to [Scenario 2: k3s Bootstrap](#scenario-2-k3s-bootstrap-standalone-setup) | +| **Converges at** | [Build the Image](#build-the-image) | [Build the Image](#build-the-image) | + +Both scenarios use the same chart, the same image, and the same `helm install` command. +They differ only in how the cluster and auth stack are set up beforehand. + +--- + +## Scenario 1: EI Deployment (OPEA Ansible Cluster) + +Use this path when your cluster was provisioned by the OPEA Ansible playbooks. +k3s, nginx-ingress, APISIX, Keycloak, the Keycloak edge routes, and the OIDC +client are already in place. Skip straight to **Prerequisites** and then +**Build the Image**. + +### Prerequisites - **Operating System**: Ubuntu 22.04+ - **Hardware**: Intel Xeon with AVX-512-BF16 / AMX-BF16 (Sapphire Rapids, Emerald Rapids, Granite Rapids) @@ -46,139 +69,438 @@ the local containerd image store — no registry required. - **HuggingFace token** for gated models (e.g. `meta-llama/*`); not required for open models like `openai/gpt-oss-20b` or `Qwen/Qwen3-8B` - **Sudo access** for the one-shot image build -> **Note:** On a stock OPEA cluster, k3s, nginx-ingress, APISIX, and Keycloak -> are already in place via the project's Ansible playbooks — skip straight to -> **Build the Image**. The "From-Scratch Bootstrap" appendix at the bottom is only -> for people standing up a fresh single-node box from zero. +--- -## Build the Image +## Scenario 2: k3s Bootstrap (Standalone Setup) -```bash -git clone https://github.com/cld2labs/Enterprise-Inference.git -cd Enterprise-Inference -git checkout cld2labs/sglang-gpt-oss +Use this path when you are starting from a **fresh single-node Ubuntu box** +with no existing Kubernetes cluster. The steps below reproduce the same +cluster shape the OPEA Ansible playbooks produce: k3s + nginx + Keycloak + +APISIX (with GatewayProxy/IngressClass wiring), the TLS secret in both +namespaces that need it, a `KC_HOSTNAME`-pinned Keycloak, the `/realms`, +`/admin`, and `/token` edge Ingresses, and the `my-client-id` OIDC client. -sudo bash core/helm-charts/sglang/image-build/build-and-import.sh -``` +After completing this scenario, `generate-token.sh` and the model deploy +work identically to the OPEA Ansible flow — both scenarios converge at +**Build the Image** below. -First run takes ~5–10 minutes. The script auto-detects the runtime: +### S2.1 k3s + Helm -- **kubeadm + containerd** (OPEA Ansible-deployed clusters): builds via - `nerdctl` directly into containerd's `k8s.io` namespace. Installs - `buildkit` from upstream GitHub on demand if it isn't already present. -- **k3s**: installs `docker.io` on demand, builds, then - `docker save | k3s ctr images import -`. +```bash +sudo bash scripts/bootstrap-k3s.sh +export KUBECONFIG=$HOME/.kube/config +kubectl get nodes -o wide +helm version --short +``` -In both cases the image lands where kubelet pulls from. Verify with -whichever tool matches your runtime: +The script installs k3s (`--disable traefik`), symlinks `kubectl`, copies +kubeconfig to `~/.kube/config`, and installs Helm 3. + +### S2.2 nginx-ingress ```bash -# kubeadm / containerd -sudo nerdctl --namespace k8s.io images | grep enterprise-inference/sglang +helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx +helm install ingress-nginx ingress-nginx/ingress-nginx \ + -n ingress-nginx --create-namespace \ + --set controller.service.type=NodePort \ + --set controller.service.nodePorts.http=30080 \ + --set controller.service.nodePorts.https=30443 \ + --set controller.admissionWebhooks.enabled=false \ + --set controller.ingressClassResource.default=true -# k3s -sudo k3s ctr images ls | grep enterprise-inference/sglang +kubectl wait --for=condition=ready pod -n ingress-nginx \ + -l app.kubernetes.io/component=controller --timeout=120s ``` -Either way the expected line is `enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`. +### S2.3 Keycloak (dev mode) -## Deploy a Model +```bash +kubectl apply -f - <<'EOF' +apiVersion: apps/v1 +kind: Deployment +metadata: { name: keycloak, namespace: default } +spec: + replicas: 1 + selector: { matchLabels: { app: keycloak } } + template: + metadata: { labels: { app: keycloak } } + spec: + containers: + - name: keycloak + image: quay.io/keycloak/keycloak:26.0 + args: ["start-dev"] + env: + - { name: KEYCLOAK_ADMIN, value: admin } + - { name: KEYCLOAK_ADMIN_PASSWORD, value: admin } + - { name: KC_HTTP_RELATIVE_PATH, value: "/" } + - { name: KC_PROXY_HEADERS, value: xforwarded } + # Pin the issuer hostname so tokens are always stamped with the + # cluster-internal name, no matter which edge hostname the + # request came in on. APISIX validates the `iss` claim against + # this hostname (chart's oidc.discovery default). + - { name: KC_HOSTNAME, value: "http://keycloak.default.svc.cluster.local" } + - { name: KC_HOSTNAME_STRICT, value: "false" } + - { name: KC_HOSTNAME_BACKCHANNEL_DYNAMIC, value: "false" } + ports: [{ containerPort: 8080, name: http }] +--- +apiVersion: v1 +kind: Service +metadata: { name: keycloak, namespace: default } +spec: + selector: { app: keycloak } + ports: [{ port: 80, targetPort: 8080 }] +EOF +kubectl wait --for=condition=ready pod -l app=keycloak --timeout=300s +``` -`modelSource` and `modelName` are required at install time. The chart -template fails fast if either is empty. +Create the OIDC client. The `clientId` and `secret` here must exactly +match what you'll later pass to the chart via `--set oidc.clientId=...` +and `--set oidc.clientSecret=...`. The values below are the defaults used +by the gpt-oss-20b deployment guide — substitute your own for any +non-test deployment. -### Generic install +```bash +ADMIN=$(kubectl run kc-admin --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d "client_id=admin-cli" -d "username=admin" -d "password=admin" -d "grant_type=password"' \ + | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") + +CLIENT_ID=my-client-id +CLIENT_SECRET=tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR + +kubectl run kc-create --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c "curl -sS -X POST -H 'Authorization: Bearer $ADMIN' \ + -H 'Content-Type: application/json' \ + http://keycloak.default.svc.cluster.local/admin/realms/master/clients \ + -d '{\"clientId\":\"${CLIENT_ID}\",\"secret\":\"${CLIENT_SECRET}\",\"serviceAccountsEnabled\":true,\"publicClient\":false,\"directAccessGrantsEnabled\":true}'" +``` + +Verify the client was created: ```bash -helm install ./core/helm-charts/sglang \ - --set modelSource="" \ - --set modelName="" \ - --set huggingface.token="$HF_TOKEN" # only if the model is gated +kubectl run kc-check --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c "curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d 'client_id=${CLIENT_ID}' -d 'client_secret=${CLIENT_SECRET}' -d 'grant_type=client_credentials'" \ + | head -c 80 +# expect: JSON with "access_token":"..." ``` -### Model-specific recipes +### S2.4 APISIX -Models that need additional configuration ship with their own values file -and deployment guide: +The Apache APISIX chart installs the dataplane + etcd + ingress +controller. On v2 of the ingress controller (current as of this writing) +you additionally need a `GatewayProxy` CR and an `IngressClass` whose +`parameters` reference it — without those, the controller silently drops +every `ApisixRoute` and the chart's route ends up unreachable. -| Model | Deployment guide | -| ----- | ---------------- | -| `openai/gpt-oss-20b` | `third_party/Dell/model-deployment/gpt-oss-20b/deployment.md` | +```bash +helm repo add apisix https://charts.apiseven.com +helm install auth-apisix apisix/apisix \ + -n auth-apisix --create-namespace \ + --set service.type=NodePort \ + --set ingress-controller.enabled=true \ + --set ingress-controller.config.apisix.serviceNamespace=auth-apisix -The deployment guide carries the full `helm install` command line for -that model — all model-specific flags (parsers, attention backend, -extraArgs) come through as `--set` overrides. The chart's own -`values.yaml` stays model-agnostic. +kubectl wait --for=condition=ready pod -n auth-apisix --all --timeout=300s -Wait for the pod (first start downloads the weights — duration depends -on model size and network): +# Grab the admin key the chart generated for the dataplane +ADMIN_KEY=$(helm get values auth-apisix -n auth-apisix --all \ + | python3 -c "import sys,yaml; print(yaml.safe_load(sys.stdin)['apisix']['admin']['credentials']['admin'])") +echo "APISIX admin key: $ADMIN_KEY" -```bash -kubectl wait --for=condition=ready pod -l app=sglang --timeout=600s -kubectl logs -l app=sglang --tail=5 -# expect: INFO: Uvicorn running on http://0.0.0.0:30000 +# Create the GatewayProxy that the ingress controller will use as its +# dataplane handle. +kubectl apply -f - <-sglang 30000:30000 & -sleep 2 +openssl req -x509 -newkey rsa:2048 -nodes -days 365 \ + -keyout /tmp/tls.key -out /tmp/tls.crt \ + -subj "/CN=api.example.com/O=enterprise-inference-test" \ + -addext "subjectAltName=DNS:api.example.com" -curl -sS http://localhost:30000/v1/chat/completions \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "", - "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], - "max_tokens": 150, - "temperature": 0.3 - }' | python3 -m json.tool +kubectl create secret tls api.example.com \ + --cert=/tmp/tls.crt --key=/tmp/tls.key -n auth-apisix +kubectl create secret tls api.example.com \ + --cert=/tmp/tls.crt --key=/tmp/tls.key -n default ``` -### Auth-routed call (nginx → APISIX → Keycloak → sglang) +### S2.6 Hostname resolution for `${BASE_URL}` -Fetch a token from inside the cluster (so the `iss` claim matches what -APISIX validates against), then call through the ingress: +`generate-token.sh` and the inference curls in the deployment guide +both hit `https://${BASE_URL}/...` directly. In a production EI deployment +this is real DNS pointing at the load balancer; for a self-bootstrapped +cluster, an `/etc/hosts` entry is sufficient. ```bash -TOKEN=$(kubectl run keycloak-tok --rm -i --restart=Never --quiet \ - --image=curlimages/curl:8.10.1 -- \ - sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ - -d "client_id=my-client-id" \ - -d "client_secret=" \ - -d "grant_type=client_credentials"' \ - | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") - -curl -sSk https://localhost:30443/-sglang/v1/chat/completions \ - -H "Host: api.example.com" \ - -H "Authorization: Bearer $TOKEN" \ - -H 'Content-Type: application/json' \ - -d '{ - "model": "", - "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], - "max_tokens": 150, - "temperature": 0.3 - }' | python3 -m json.tool +echo "127.0.0.1 api.example.com" | sudo tee -a /etc/hosts ``` -### API endpoints +nginx in this scenario is on NodePort 30443 (not 443), so set +`BASE_URL` with the port for the EI scripts to find it: -| Endpoint | Description | -|----------|-------------| -| `/v1/models` | List loaded models | -| `/v1/chat/completions` | OpenAI-compatible chat completions | -| `/v1/completions` | OpenAI-compatible text completions | -| `/health` | Liveness probe | +```bash +export BASE_URL=api.example.com:30443 +``` -## Configuration +### S2.7 Keycloak edge routes -### Key values +`generate-token.sh` issues HTTP requests against `${BASE_URL}` — against +`/realms/master/protocol/openid-connect/token`, `/admin/realms/...`, and +`/token`. On an EI cluster the Ansible playbooks publish all three behind +nginx; when bootstrapping by hand we publish them with two Ingresses below. -| Key | Default | Description | -|-----|---------|-------------| -| `image.repository` | `enterprise-inference/sglang` | Patched image (set to `lmsysorg/sglang` to use upstream, but bf16 inference will crash) | +```bash +# Pass-through Ingress for /realms/* and /admin/* +kubectl apply -f - < **Both scenarios converge here.** Whether your cluster came from the OPEA +> Ansible playbooks (Scenario 1) or from the k3s bootstrap above (Scenario 2), +> the image build and all subsequent steps are identical. + +```bash +git clone https://github.com/cld2labs/Enterprise-Inference.git +cd Enterprise-Inference +git checkout cld2labs/sglang-gpt-oss + +sudo bash core/helm-charts/sglang/image-build/build-and-import.sh +``` + +First run takes ~5–10 minutes. The script auto-detects the runtime: + +- **kubeadm + containerd** (OPEA Ansible-deployed clusters): builds via + `nerdctl` directly into containerd's `k8s.io` namespace. Installs + `buildkit` from upstream GitHub on demand if it isn't already present. +- **k3s**: installs `docker.io` on demand, builds, then + `docker save | k3s ctr images import -`. + +In both cases the image lands where kubelet pulls from. Verify with +whichever tool matches your runtime: + +```bash +# kubeadm / containerd +sudo nerdctl --namespace k8s.io images | grep enterprise-inference/sglang + +# k3s +sudo k3s ctr images ls | grep enterprise-inference/sglang +``` + +Either way the expected line is `enterprise-inference/sglang:v0.5.12-xeon-fix11-debug`. + +## Deploy a Model + +`modelSource` and `modelName` are required at install time. The chart +template fails fast if either is empty. + +### Generic install + +```bash +helm install ./core/helm-charts/sglang \ + --set modelSource="" \ + --set modelName="" \ + --set huggingface.token="$HF_TOKEN" # only if the model is gated +``` + +### Model-specific recipes + +Models that need additional configuration ship with their own values file +and deployment guide: + +| Model | Deployment guide | +| ----- | ---------------- | +| `openai/gpt-oss-20b` | `third_party/Dell/model-deployment/gpt-oss-20b/deployment.md` | + +The deployment guide carries the full `helm install` command line for +that model — all model-specific flags (parsers, attention backend, +extraArgs) come through as `--set` overrides. The chart's own +`values.yaml` stays model-agnostic. + +Wait for the pod (first start downloads the weights — duration depends +on model size and network): + +```bash +kubectl wait --for=condition=ready pod -l app=sglang --timeout=600s +kubectl logs -l app=sglang --tail=5 +# expect: INFO: Uvicorn running on http://0.0.0.0:30000 +``` + +## Inference + +### Smoke test (no auth, via port-forward) + +```bash +kubectl port-forward svc/-sglang 30000:30000 & +sleep 2 + +curl -sS http://localhost:30000/v1/chat/completions \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "", + "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], + "max_tokens": 150, + "temperature": 0.3 + }' | python3 -m json.tool +``` + +### Auth-routed call (nginx → APISIX → Keycloak → sglang) + +Fetch a token from inside the cluster (so the `iss` claim matches what +APISIX validates against), then call through the ingress: + +```bash +TOKEN=$(kubectl run keycloak-tok --rm -i --restart=Never --quiet \ + --image=curlimages/curl:8.10.1 -- \ + sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ + -d "client_id=my-client-id" \ + -d "client_secret=" \ + -d "grant_type=client_credentials"' \ + | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") + +curl -sSk https://localhost:30443/-sglang/v1/chat/completions \ + -H "Host: api.example.com" \ + -H "Authorization: Bearer $TOKEN" \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "", + "messages": [{"role":"user","content":"In one sentence, what is deep learning?"}], + "max_tokens": 150, + "temperature": 0.3 + }' | python3 -m json.tool +``` + +### API endpoints + +| Endpoint | Description | +|----------|-------------| +| `/v1/models` | List loaded models | +| `/v1/chat/completions` | OpenAI-compatible chat completions | +| `/v1/completions` | OpenAI-compatible text completions | +| `/health` | Liveness probe | + +## Configuration + +### Key values + +| Key | Default | Description | +|-----|---------|-------------| +| `image.repository` | `enterprise-inference/sglang` | Patched image (set to `lmsysorg/sglang` to use upstream, but bf16 inference will crash) | | `image.tag` | `v0.5.12-xeon-fix11-debug` | Pinned to the validated build | | `image.pullPolicy` | `IfNotPresent` | Set to `Never` if the image is only in local containerd | | `modelSource` | _(required)_ | HuggingFace repo to load (chart fails to render if empty) | @@ -319,311 +641,3 @@ third_party/Dell/model-deployment/ - [SGLang documentation](https://docs.sglang.io) - [SGLang CPU server guide](https://docs.sglang.io/docs/hardware-platforms/cpu_server) - [OpenAI gpt-oss model card](https://huggingface.co/openai/gpt-oss-20b) - ---- - -## Appendix: From-Scratch Bootstrap - -An Enterprise Inference cluster brought up with the OPEA Ansible -playbooks already has k3s, nginx-ingress, APISIX, Keycloak, the -Keycloak edge routes, and the OIDC client provisioned for you — skip -this appendix entirely and go straight to **Build the Image**. The -deployment guide is the same regardless of how the cluster was -bootstrapped. - -This appendix produces the same cluster shape by hand for cases where -the Ansible playbooks haven't been run: k3s + nginx + Keycloak + APISIX -(with the GatewayProxy/IngressClass wiring), the TLS secret in both -namespaces that need it, a `KC_HOSTNAME`-pinned Keycloak, the -`/realms`, `/admin`, and `/token` edge Ingresses, and the -`my-client-id` OIDC client. After it runs, `generate-token.sh` and the -model deploy work identically to the OPEA flow. - -### A.1 k3s + Helm - -```bash -sudo bash scripts/bootstrap-k3s.sh -export KUBECONFIG=$HOME/.kube/config -kubectl get nodes -o wide -helm version --short -``` - -The script installs k3s (`--disable traefik`), symlinks `kubectl`, copies -kubeconfig to `~/.kube/config`, and installs Helm 3. - -### A.2 nginx-ingress - -```bash -helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx -helm install ingress-nginx ingress-nginx/ingress-nginx \ - -n ingress-nginx --create-namespace \ - --set controller.service.type=NodePort \ - --set controller.service.nodePorts.http=30080 \ - --set controller.service.nodePorts.https=30443 \ - --set controller.admissionWebhooks.enabled=false \ - --set controller.ingressClassResource.default=true - -kubectl wait --for=condition=ready pod -n ingress-nginx \ - -l app.kubernetes.io/component=controller --timeout=120s -``` - -### A.3 Keycloak (dev mode) - -```bash -kubectl apply -f - <<'EOF' -apiVersion: apps/v1 -kind: Deployment -metadata: { name: keycloak, namespace: default } -spec: - replicas: 1 - selector: { matchLabels: { app: keycloak } } - template: - metadata: { labels: { app: keycloak } } - spec: - containers: - - name: keycloak - image: quay.io/keycloak/keycloak:26.0 - args: ["start-dev"] - env: - - { name: KEYCLOAK_ADMIN, value: admin } - - { name: KEYCLOAK_ADMIN_PASSWORD, value: admin } - - { name: KC_HTTP_RELATIVE_PATH, value: "/" } - - { name: KC_PROXY_HEADERS, value: xforwarded } - # Pin the issuer hostname so tokens are always stamped with the - # cluster-internal name, no matter which edge hostname the - # request came in on. APISIX validates the `iss` claim against - # this hostname (chart's oidc.discovery default). - - { name: KC_HOSTNAME, value: "http://keycloak.default.svc.cluster.local" } - - { name: KC_HOSTNAME_STRICT, value: "false" } - - { name: KC_HOSTNAME_BACKCHANNEL_DYNAMIC, value: "false" } - ports: [{ containerPort: 8080, name: http }] ---- -apiVersion: v1 -kind: Service -metadata: { name: keycloak, namespace: default } -spec: - selector: { app: keycloak } - ports: [{ port: 80, targetPort: 8080 }] -EOF -kubectl wait --for=condition=ready pod -l app=keycloak --timeout=300s -``` - -Create the OIDC client. The `clientId` and `secret` here must exactly -match what you'll later pass to the chart via `--set oidc.clientId=...` -and `--set oidc.clientSecret=...`. The values below are the appendix -defaults used by the gpt-oss-20b deployment guide — substitute your -own for any non-test deployment. - -```bash -ADMIN=$(kubectl run kc-admin --rm -i --restart=Never --quiet \ - --image=curlimages/curl:8.10.1 -- \ - sh -c 'curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ - -d "client_id=admin-cli" -d "username=admin" -d "password=admin" -d "grant_type=password"' \ - | python3 -c "import json,sys; print(json.load(sys.stdin)['access_token'])") - -CLIENT_ID=my-client-id -CLIENT_SECRET=tf29wNR5fZ7edbNmnLSWDEvL7Simx4CR - -kubectl run kc-create --rm -i --restart=Never --quiet \ - --image=curlimages/curl:8.10.1 -- \ - sh -c "curl -sS -X POST -H 'Authorization: Bearer $ADMIN' \ - -H 'Content-Type: application/json' \ - http://keycloak.default.svc.cluster.local/admin/realms/master/clients \ - -d '{\"clientId\":\"${CLIENT_ID}\",\"secret\":\"${CLIENT_SECRET}\",\"serviceAccountsEnabled\":true,\"publicClient\":false,\"directAccessGrantsEnabled\":true}'" -``` - -Verify the client was created: - -```bash -kubectl run kc-check --rm -i --restart=Never --quiet \ - --image=curlimages/curl:8.10.1 -- \ - sh -c "curl -sS -X POST http://keycloak.default.svc.cluster.local/realms/master/protocol/openid-connect/token \ - -d 'client_id=${CLIENT_ID}' -d 'client_secret=${CLIENT_SECRET}' -d 'grant_type=client_credentials'" \ - | head -c 80 -# expect: JSON with "access_token":"..." -``` - -### A.4 APISIX - -The Apache APISIX chart installs the dataplane + etcd + ingress -controller. On v2 of the ingress controller (current as of this writing) -you additionally need a `GatewayProxy` CR and an `IngressClass` whose -`parameters` reference it — without those, the controller silently drops -every `ApisixRoute` and the chart's route ends up unreachable. - -```bash -helm repo add apisix https://charts.apiseven.com -helm install auth-apisix apisix/apisix \ - -n auth-apisix --create-namespace \ - --set service.type=NodePort \ - --set ingress-controller.enabled=true \ - --set ingress-controller.config.apisix.serviceNamespace=auth-apisix - -kubectl wait --for=condition=ready pod -n auth-apisix --all --timeout=300s - -# Grab the admin key the chart generated for the dataplane -ADMIN_KEY=$(helm get values auth-apisix -n auth-apisix --all \ - | python3 -c "import sys,yaml; print(yaml.safe_load(sys.stdin)['apisix']['admin']['credentials']['admin'])") -echo "APISIX admin key: $ADMIN_KEY" - -# Create the GatewayProxy that the ingress controller will use as its -# dataplane handle. -kubectl apply -f - < Date: Fri, 5 Jun 2026 12:30:25 -0500 Subject: [PATCH 20/20] cld2labs/sglang-gpt-oss: move APISIX timeout note to Step 5 in gpt-oss-20b deployment.md The kubectl patch apisixroute block was under Step 4 (Verify), but the timeout only surfaces when actually testing inference. Move the callout to Step 5 (Test), framed as a reaction to a 504 rather than a pre-emptive action, and link to the full fix in sglang-troubleshooting.md. Signed-off-by: arpannookala-12 --- .../Dell/model-deployment/gpt-oss-20b/deployment.md | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md index 2f4ca6b5..1a35686e 100644 --- a/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md +++ b/third_party/Dell/model-deployment/gpt-oss-20b/deployment.md @@ -123,15 +123,6 @@ NAME HOSTS sglang-gpt-oss-20b-apisixroute api.example.com ``` -The ApisixRoute has a default 60 s upstream timeout, which is shorter -than CPU inference at ~4 tokens/s can complete. Bump it before sending -real requests: - -```bash -kubectl patch apisixroute sglang-gpt-oss-20b-apisixroute --type='json' \ - -p='[{"op":"add","path":"/spec/http/0/timeout","value":{"connect":"5s","read":"600s","send":"600s"}}]' -``` - ## Step 5: Test the Deployed Model ```bash @@ -156,6 +147,8 @@ If successful, the model returns a chat-completion response with the answer in `choices[0].message.content` and the model's internal reasoning in `choices[0].message.reasoning_content`. +> If the request times out with a 504, CPU inference at ~4 tokens/s can exceed the default 60 s upstream timeout for longer responses. See [Gateway Timeout (504)](../sglang-troubleshooting.md#1-gateway-timeout-504-on-inference-requests) in the troubleshooting guide to bump both the nginx and APISIX timeouts. + ### A Note on `max_tokens` gpt-oss uses the Harmony chat format: every response starts in an