docs: add NVIDIA Jetson / aarch64 CUDA source-build guide (sm_87 / JetPack)#1929
Open
neil-the-nowledgeable wants to merge 1 commit intobitsandbytes-foundation:mainfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The existing aarch64 CUDA installation matrix (
docs/source/installation.mdxlines ~55, ~165) lists supported compute capabilities assm75, sm80, sm90(CUDA 11.8-12.6) andsm75, sm80, sm90, sm100, sm110, sm120, sm121(CUDA 12.8-13.0). NVIDIA Jetson Orin devices report sm_87, which is not covered by any published aarch64 wheel.The upstream source of this gap is
.github/scripts/build-cuda.sh, which hardcodes the aarch64 capability set:The CMake source itself does support sm_87 — it appears in
CMAKE_CUDA_ARCHITECTURES_ALLfor CUDA 11.8–12.9 (50 52 53 60 61 62 70 72 75 80 86 87 89 90). The restriction is purely a CI-wheel choice, not a source limitation. Users on Jetson Orin can therefore source-build with the right flag and the library works correctly; this PR documents that path.Users installing bitsandbytes on Jetson Orin Nano / Orin NX / AGX Orin hit a runtime failure on the first CUDA kernel launch:
This error appears at
quantize_4bit/dequantize_4bitcall, not at import — so a smoke test that only imports bitsandbytes looks healthy, and the failure surfaces in the first training backward pass.This PR adds a short section to
docs/source/installation.mdx(placed after the existing "ARM/aarch64" install section) that:-DCOMPUTE_CAPABILITY=87python -m bitsandbytesverification snippet showingHighest Compute Capability: (8, 7)/SUCCESS!adamw_bnb_8bitinsteadSee the file diff for the full text.
Behavioral validation
The source-built wheel has been validated beyond "compiles and imports." On Carroll-16, a 16-problem held-out logic-reasoning benchmark, three independent lines of evidence confirm the sm_87 kernels produce numerically correct outputs:
load_in_4bitdiffers) produced training losses within 0.4% (0.2963 vs 0.2951) and downstream benchmark scores within single-problem noise — quantization is behaviorally equivalent to bf16 training.Why this matters
COMPUTE_CAPABILITY=87."Not in scope
This PR is docs-only. A follow-up CI task to add sm_87 to the aarch64 wheel matrix would eliminate the need for users to source-build, but that is infrastructure work for the maintainers. A complementary issue tracks that request.
Reviewer checklist
docs/source/installation.mdxfit your docs structure?/usr/local/cuda-12.6/bin) canonical for JetPack or should this be noted as platform-specific?docs/source/installation_jetson.mdxpage vs. the inline subsection?