Skip to content

Commit d474a5b

Browse files
Merge pull request #2502 from madeline-underwood/dgx
Dgx_JA to review
2 parents 3d0c227 + c0f4e1a commit d474a5b

File tree

6 files changed

+381
-322
lines changed

6 files changed

+381
-322
lines changed
Lines changed: 24 additions & 216 deletions
Original file line numberDiff line numberDiff line change
@@ -1,241 +1,49 @@
11
---
2-
title: Verify Grace Blackwell system readiness for AI inference
2+
title: Explore Grace Blackwell architecture for efficient quantized LLM inference
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Introduction to Grace Blackwell architecture
9+
## Overview
1010

11-
In this session, you will explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPUGPU hybrid for large-scale AI workloads.
11+
In this Learning Path you will explore the architecture and system design of NVIDIA DGX Spark, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
1212

13-
You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions.
13+
The GB10 platform combines:
14+
- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency
15+
- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads
16+
- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks
1417

15-
The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop.
16-
The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
18+
This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform that lets you build and test AI models locally before scaling them to larger systems.
1719

18-
The NVIDIA Grace Blackwell DGX Spark (GB10) platform combines:
19-
- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency.
20+
You can find out more about Nvidia DGX Spark on the [NVIDIA website](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/).
2021

21-
- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads.
22-
- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks.
22+
## Benefits of Grace Blackwell for quantized LLM inference
2323

24-
This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision.
25-
DGX Spark is a compact yet powerful development platform for modern AI workloads.
24+
Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip which brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference.
2625

27-
DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere, empowering developers to prototype, fine-tune, and deploy large-scale AI models locally while seamlessly connecting to cloud or data-center environments when needed.
26+
On Arm-based systems, quantized LLM inference is especially efficient because the Grace CPU delivers high single-thread performance and energy efficiency, while the Blackwell GPU accelerates matrix operations using Arm-optimized CUDA libraries. The unified memory architecture means you don't need to manually manage data movement between CPU and GPU, which is a common challenge on traditional x86-based platforms. This is particularly valuable when working with large models or running multiple inference tasks in parallel, as it reduces latency and simplifies development.
2827

29-
### Why Grace Blackwell for quantized LLMs?
28+
## Grace Blackwell features and their impact on quantized LLMs
3029

31-
Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip.
30+
The table below shows how specific hardware features enable efficient quantized model inference:
3231

33-
| **Feature** | **Impact on Quantized LLMs** |
32+
| **Feature** | **Impact on quantized LLMs** |
3433
|--------------|------------------------------|
35-
| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). |
36-
| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. |
37-
| High Bandwidth + Low Latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPUGPU workloads. |
38-
| Unified 128 GB Memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. |
39-
| Energy-Efficient Arm Design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. |
34+
| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high instructions per cycle (IPC) |
35+
| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers |
36+
| High bandwidth and low latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU-GPU workloads |
37+
| Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer |
38+
| Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads |
4039

40+
## Overview of a typical quantized LLM workflow
4141

4242
In a typical quantized LLM workflow:
43-
- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks.
44-
- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput.
45-
- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory spacereducing copy overhead and enabling near-real-time inference.
43+
- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks
44+
- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput
45+
- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space, reducing copy overhead and enabling near-real-time inference
4646

4747
Together, these features make the GB10 a developer-grade AI laboratory for running, profiling, and scaling quantized LLMs efficiently in a desktop form factor.
4848

4949

50-
### Inspecting your GB10 environment
51-
52-
Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs.
53-
54-
#### Step 1: Check CPU information
55-
56-
Run the following command to print the CPU information:
57-
58-
```bash
59-
lscpu
60-
```
61-
62-
Expected output:
63-
64-
```output
65-
Architecture: aarch64
66-
CPU op-mode(s): 64-bit
67-
Byte Order: Little Endian
68-
CPU(s): 20
69-
On-line CPU(s) list: 0-19
70-
Vendor ID: ARM
71-
Model name: Cortex-X925
72-
Model: 1
73-
Thread(s) per core: 1
74-
Core(s) per socket: 10
75-
Socket(s): 1
76-
Stepping: r0p1
77-
CPU(s) scaling MHz: 89%
78-
CPU max MHz: 4004.0000
79-
CPU min MHz: 1378.0000
80-
BogoMIPS: 2000.00
81-
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
82-
imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
83-
lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
84-
Model name: Cortex-A725
85-
Model: 1
86-
Thread(s) per core: 1
87-
Core(s) per socket: 10
88-
Socket(s): 1
89-
Stepping: r0p1
90-
CPU(s) scaling MHz: 99%
91-
CPU max MHz: 2860.0000
92-
CPU min MHz: 338.0000
93-
BogoMIPS: 2000.00
94-
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as
95-
imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f
96-
lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt
97-
Caches (sum of all):
98-
L1d: 1.3 MiB (20 instances)
99-
L1i: 1.3 MiB (20 instances)
100-
L2: 25 MiB (20 instances)
101-
L3: 24 MiB (2 instances)
102-
NUMA:
103-
NUMA node(s): 1
104-
NUMA node0 CPU(s): 0-19
105-
Vulnerabilities:
106-
Gather data sampling: Not affected
107-
Itlb multihit: Not affected
108-
L1tf: Not affected
109-
Mds: Not affected
110-
Meltdown: Not affected
111-
Mmio stale data: Not affected
112-
Reg file data sampling: Not affected
113-
Retbleed: Not affected
114-
Spec rstack overflow: Not affected
115-
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
116-
Spectre v1: Mitigation; __user pointer sanitization
117-
Spectre v2: Not affected
118-
Srbds: Not affected
119-
Tsx async abort: Not affected
120-
```
121-
122-
The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
123-
124-
The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference.
125-
126-
| **Category** | **Specification** | **Description / Impact for LLM Inference** |
127-
|---------------|-------------------|---------------------------------------------|
128-
| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. |
129-
| Core Configuration | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. |
130-
| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency. |
131-
| Clock Frequency | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. |
132-
| Cache Hierarchy | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. |
133-
| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. |
134-
| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. |
135-
| Security & Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. |
136-
137-
Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities make it ideal for quantized LLM workloads, providing power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing.
138-
139-
You can also verify the operating system running on your DGX Spark by using the following command:
140-
141-
```bash
142-
lsb_release -a
143-
```
144-
145-
Expected output:
146-
147-
```log
148-
No LSB modules are available.
149-
Distributor ID: Ubuntu
150-
Description: Ubuntu 24.04.3 LTS
151-
Release: 24.04
152-
Codename: noble
153-
```
154-
As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution.
155-
It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads.
156-
157-
#### Step 2: Verify Blackwell GPU and driver
158-
159-
After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads.
160-
161-
```bash
162-
nvidia-smi
163-
```
164-
165-
You will see output similar to:
166-
167-
```output
168-
Wed Oct 22 09:26:54 2025
169-
+-----------------------------------------------------------------------------------------+
170-
| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 |
171-
+-----------------------------------------+------------------------+----------------------+
172-
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
173-
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
174-
| | | MIG M. |
175-
|=========================================+========================+======================|
176-
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
177-
| N/A 32C P8 4W / N/A | Not Supported | 0% Default |
178-
| | | N/A |
179-
+-----------------------------------------+------------------------+----------------------+
180-
181-
+-----------------------------------------------------------------------------------------+
182-
| Processes: |
183-
| GPU GI CI PID Type Process name GPU Memory |
184-
| ID ID Usage |
185-
|=========================================================================================|
186-
| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB |
187-
| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB |
188-
+-----------------------------------------------------------------------------------------+
189-
```
190-
191-
The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads.
192-
193-
The table below provides more explanation of the `nvidia-smi` output:
194-
195-
| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** |
196-
|---------------|--------------------------------------|---------------------------------------------|
197-
| GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. |
198-
| Driver Version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. |
199-
| CUDA Version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. |
200-
| Architecture / Compute Capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. |
201-
| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. |
202-
| Power & Thermal Status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. |
203-
| GPU-Utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. |
204-
| Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. |
205-
| Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. |
206-
207-
208-
#### Step 3: Check CUDA Toolkit
209-
210-
To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
211-
212-
The `nvcc --version` command confirms that the CUDA compiler is available and compatible with CUDA 13.
213-
This ensures that CMake can correctly detect and compile the GPU-accelerated components.
214-
215-
```bash
216-
nvcc --version
217-
```
218-
219-
You will see output similar to:
220-
221-
```output
222-
nvcc: NVIDIA (R) Cuda compiler driver
223-
Copyright (c) 2005-2025 NVIDIA Corporation
224-
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
225-
Cuda compilation tools, release 13.0, V13.0.88
226-
Build cuda_13.0.r13.0/compiler.36424714_0
227-
```
228-
229-
{{% notice Note %}}
230-
The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference.
231-
{{% /notice %}}
232-
233-
This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation.
234-
If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121).
235-
236-
At this point, you have verified that:
237-
- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions.
238-
- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime.
239-
- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp.
240-
241-
Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform.

0 commit comments

Comments
 (0)