Skip to content

docs: add NVIDIA Jetson / aarch64 CUDA source-build guide (sm_87 / JetPack)#1929

Open
neil-the-nowledgeable wants to merge 1 commit intobitsandbytes-foundation:mainfrom
neil-the-nowledgeable:docs-jetson-sm87
Open

docs: add NVIDIA Jetson / aarch64 CUDA source-build guide (sm_87 / JetPack)#1929
neil-the-nowledgeable wants to merge 1 commit intobitsandbytes-foundation:mainfrom
neil-the-nowledgeable:docs-jetson-sm87

Conversation

@neil-the-nowledgeable
Copy link
Copy Markdown

@neil-the-nowledgeable neil-the-nowledgeable commented Apr 22, 2026

Summary

The existing aarch64 CUDA installation matrix (docs/source/installation.mdx lines ~55, ~165) lists supported compute capabilities as sm75, sm80, sm90 (CUDA 11.8-12.6) and sm75, sm80, sm90, sm100, sm110, sm120, sm121 (CUDA 12.8-13.0). NVIDIA Jetson Orin devices report sm_87, which is not covered by any published aarch64 wheel.

The upstream source of this gap is .github/scripts/build-cuda.sh, which hardcodes the aarch64 capability set:

elif [ "${build_arch}" = "aarch64" ]; then
    build_capability="75;80;90"  # sm_87 not included

The CMake source itself does support sm_87 — it appears in CMAKE_CUDA_ARCHITECTURES_ALL for CUDA 11.8–12.9 (50 52 53 60 61 62 70 72 75 80 86 87 89 90). The restriction is purely a CI-wheel choice, not a source limitation. Users on Jetson Orin can therefore source-build with the right flag and the library works correctly; this PR documents that path.

Users installing bitsandbytes on Jetson Orin Nano / Orin NX / AGX Orin hit a runtime failure on the first CUDA kernel launch:

RuntimeError: Error named symbol not found at line 233 in file /src/csrc/ops.cu

This error appears at quantize_4bit / dequantize_4bit call, not at import — so a smoke test that only imports bitsandbytes looks healthy, and the failure surfaces in the first training backward pass.

This PR adds a short section to docs/source/installation.mdx (placed after the existing "ARM/aarch64" install section) that:

  • Names the failure mode so users searching for the error find a canonical answer
  • Documents the source-build recipe with -DCOMPUTE_CAPABILITY=87
  • Includes JetPack → CUDA version mapping (6.0/6.1 → 12.2; 6.2 → 12.6)
  • Provides a python -m bitsandbytes verification snippet showing Highest Compute Capability: (8, 7) / SUCCESS!
  • Documents a known paged-optimizer limitation on Jetson (per Hackster.io tutorial) and points users to adamw_bnb_8bit instead
  • Documents a profile-memory limitation on 8 GB Orin Nano Super (torch.profiler full-detail config can trigger device reboot)
  • Provides a measured memory-envelope reference table at 1B/3B/4B

See the file diff for the full text.

Behavioral validation

The source-built wheel has been validated beyond "compiles and imports." On Carroll-16, a 16-problem held-out logic-reasoning benchmark, three independent lines of evidence confirm the sm_87 kernels produce numerically correct outputs:

  1. 4-bit NF4 base inference at 1B (TinyLlama 1.1B) scored within 1/16 problems of the same model's Ollama Q4_K_M reference — quantization preserves base reasoning at 1B.
  2. Same-stack 4-bit QLoRA vs bf16 LoRA training at matched config (seed, data, hyperparameters; only load_in_4bit differs) produced training losses within 0.4% (0.2963 vs 0.2951) and downstream benchmark scores within single-problem noise — quantization is behaviorally equivalent to bf16 training.
  3. 4-bit NF4 base inference at 3B (Qwen2.5-3B-Instruct) scored 93.75% keyword (15/16) and 0.418 judge composite on Carroll-16 — highest non-reasoning-model result tested. Confirms the kernels produce correct outputs across two Llama-family architectures at different scales.

Why this matters

  • "Does bitsandbytes work on Jetson?" is a recurring question in community forums with no canonical answer. The current docs imply unsupported; the reality is "yes, source-build with COMPUTE_CAPABILITY=87."
  • Jetson is widely used for edge-ML deployments (vehicles, monitoring stations, robotics). QLoRA on Jetson-class hardware opens meaningful on-device fine-tuning for domain-adapted deployment, with real applications for latency-sensitive and offline-capable edge LLMs.
  • The source-build path has been behaviorally validated end-to-end across two model families and two scales (see "Behavioral validation" above), not just "compiles and imports."

Not in scope

This PR is docs-only. A follow-up CI task to add sm_87 to the aarch64 wheel matrix would eliminate the need for users to source-build, but that is infrastructure work for the maintainers. A complementary issue tracks that request.

Reviewer checklist

  • Does the placement in docs/source/installation.mdx fit your docs structure?
  • Is the suggested CUDA toolchain path (/usr/local/cuda-12.6/bin) canonical for JetPack or should this be noted as platform-specific?
  • Would you prefer a separate docs/source/installation_jetson.mdx page vs. the inline subsection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant