|
1 | 1 | --- |
2 | | -title: Verify Grace Blackwell system readiness for AI inference |
| 2 | +title: Explore Grace Blackwell architecture for efficient quantized LLM inference |
3 | 3 | weight: 2 |
4 | 4 |
|
5 | 5 | ### FIXED, DO NOT MODIFY |
6 | 6 | layout: learningpathall |
7 | 7 | --- |
8 | 8 |
|
9 | | -## Introduction to Grace Blackwell architecture |
| 9 | +## Overview |
10 | 10 |
|
11 | | -In this session, you will explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU–GPU hybrid for large-scale AI workloads. |
| 11 | +In this Learning Path you will explore the architecture and system design of NVIDIA DGX Spark, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine. |
12 | 12 |
|
13 | | -You will also perform hands-on verification steps to ensure your DGX Spark environment is properly configured for subsequent GPU-accelerated LLM sessions. |
| 13 | +The GB10 platform combines: |
| 14 | +- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency |
| 15 | +- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads |
| 16 | +- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks |
14 | 17 |
|
15 | | -The NVIDIA DGX Spark is a personal AI supercomputer that brings data center–class AI computing directly to the developer desktop. |
16 | | -The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine. |
| 18 | +This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform that lets you build and test AI models locally before scaling them to larger systems. |
17 | 19 |
|
18 | | -The NVIDIA Grace Blackwell DGX Spark (GB10) platform combines: |
19 | | -- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency. |
| 20 | +You can find out more about Nvidia DGX Spark on the [NVIDIA website](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/). |
20 | 21 |
|
21 | | -- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads. |
22 | | -- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks. |
| 22 | +## Benefits of Grace Blackwell for quantized LLM inference |
23 | 23 |
|
24 | | -This design delivers up to one petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. |
25 | | -DGX Spark is a compact yet powerful development platform for modern AI workloads. |
| 24 | +Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip which brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference. |
26 | 25 |
|
27 | | -DGX Spark represents a major step toward NVIDIA’s vision of AI Everywhere, empowering developers to prototype, fine-tune, and deploy large-scale AI models locally while seamlessly connecting to cloud or data-center environments when needed. |
| 26 | +On Arm-based systems, quantized LLM inference is especially efficient because the Grace CPU delivers high single-thread performance and energy efficiency, while the Blackwell GPU accelerates matrix operations using Arm-optimized CUDA libraries. The unified memory architecture means you don't need to manually manage data movement between CPU and GPU, which is a common challenge on traditional x86-based platforms. This is particularly valuable when working with large models or running multiple inference tasks in parallel, as it reduces latency and simplifies development. |
28 | 27 |
|
29 | | -### Why Grace Blackwell for quantized LLMs? |
| 28 | +## Grace Blackwell features and their impact on quantized LLMs |
30 | 29 |
|
31 | | -Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip. |
| 30 | +The table below shows how specific hardware features enable efficient quantized model inference: |
32 | 31 |
|
33 | | -| **Feature** | **Impact on Quantized LLMs** | |
| 32 | +| **Feature** | **Impact on quantized LLMs** | |
34 | 33 | |--------------|------------------------------| |
35 | | -| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high IPC (instructions per cycle). | |
36 | | -| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers. | |
37 | | -| High Bandwidth + Low Latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU–GPU workloads. | |
38 | | -| Unified 128 GB Memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer. | |
39 | | -| Energy-Efficient Arm Design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads. | |
| 34 | +| Grace CPU (Arm Cortex-X925 / A725) | Handles token orchestration, memory paging, and lightweight inference efficiently with high instructions per cycle (IPC) | |
| 35 | +| Blackwell GPU (CUDA 13, FP4/FP8 Tensor Cores) | Provides massive parallelism and precision flexibility, ideal for accelerating 4-bit or 8-bit quantized transformer layers | |
| 36 | +| High bandwidth and low latency | NVLink-C2C delivers 900 GB/s of bidirectional bandwidth, enabling synchronized CPU-GPU workloads | |
| 37 | +| Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer | |
| 38 | +| Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads | |
40 | 39 |
|
| 40 | +## Overview of a typical quantized LLM workflow |
41 | 41 |
|
42 | 42 | In a typical quantized LLM workflow: |
43 | | -- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks. |
44 | | -- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput. |
45 | | -- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space — reducing copy overhead and enabling near-real-time inference. |
| 43 | +- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks |
| 44 | +- The Blackwell GPU executes the transformer layers using quantized matrix multiplications for optimal throughput |
| 45 | +- Unified memory allows models like Qwen2-7B or LLaMA3-8B (Q4_K_M) to fit directly into the shared memory space, reducing copy overhead and enabling near-real-time inference |
46 | 46 |
|
47 | 47 | Together, these features make the GB10 a developer-grade AI laboratory for running, profiling, and scaling quantized LLMs efficiently in a desktop form factor. |
48 | 48 |
|
49 | 49 |
|
50 | | -### Inspecting your GB10 environment |
51 | | - |
52 | | -Let's verify that your DGX Spark system is configured and ready for building and running quantized LLMs. |
53 | | - |
54 | | -#### Step 1: Check CPU information |
55 | | - |
56 | | -Run the following command to print the CPU information: |
57 | | - |
58 | | -```bash |
59 | | -lscpu |
60 | | -``` |
61 | | - |
62 | | -Expected output: |
63 | | - |
64 | | -```output |
65 | | -Architecture: aarch64 |
66 | | - CPU op-mode(s): 64-bit |
67 | | - Byte Order: Little Endian |
68 | | -CPU(s): 20 |
69 | | - On-line CPU(s) list: 0-19 |
70 | | -Vendor ID: ARM |
71 | | - Model name: Cortex-X925 |
72 | | - Model: 1 |
73 | | - Thread(s) per core: 1 |
74 | | - Core(s) per socket: 10 |
75 | | - Socket(s): 1 |
76 | | - Stepping: r0p1 |
77 | | - CPU(s) scaling MHz: 89% |
78 | | - CPU max MHz: 4004.0000 |
79 | | - CPU min MHz: 1378.0000 |
80 | | - BogoMIPS: 2000.00 |
81 | | - Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as |
82 | | - imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f |
83 | | - lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt |
84 | | - Model name: Cortex-A725 |
85 | | - Model: 1 |
86 | | - Thread(s) per core: 1 |
87 | | - Core(s) per socket: 10 |
88 | | - Socket(s): 1 |
89 | | - Stepping: r0p1 |
90 | | - CPU(s) scaling MHz: 99% |
91 | | - CPU max MHz: 2860.0000 |
92 | | - CPU min MHz: 338.0000 |
93 | | - BogoMIPS: 2000.00 |
94 | | - Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 as |
95 | | - imddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 f |
96 | | - lagm2 frint svei8mm svebf16 i8mm bf16 dgh bti ecv afp wfxt |
97 | | -Caches (sum of all): |
98 | | - L1d: 1.3 MiB (20 instances) |
99 | | - L1i: 1.3 MiB (20 instances) |
100 | | - L2: 25 MiB (20 instances) |
101 | | - L3: 24 MiB (2 instances) |
102 | | -NUMA: |
103 | | - NUMA node(s): 1 |
104 | | - NUMA node0 CPU(s): 0-19 |
105 | | -Vulnerabilities: |
106 | | - Gather data sampling: Not affected |
107 | | - Itlb multihit: Not affected |
108 | | - L1tf: Not affected |
109 | | - Mds: Not affected |
110 | | - Meltdown: Not affected |
111 | | - Mmio stale data: Not affected |
112 | | - Reg file data sampling: Not affected |
113 | | - Retbleed: Not affected |
114 | | - Spec rstack overflow: Not affected |
115 | | - Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl |
116 | | - Spectre v1: Mitigation; __user pointer sanitization |
117 | | - Spectre v2: Not affected |
118 | | - Srbds: Not affected |
119 | | - Tsx async abort: Not affected |
120 | | -``` |
121 | | - |
122 | | -The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations. |
123 | | - |
124 | | -The following table summarizes the key specifications of the Grace CPU and explains their relevance to quantized LLM inference. |
125 | | - |
126 | | -| **Category** | **Specification** | **Description / Impact for LLM Inference** | |
127 | | -|---------------|-------------------|---------------------------------------------| |
128 | | -| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions. | |
129 | | -| Core Configuration | 20 cores total — 10× Cortex-X925 (Performance) + 10× Cortex-A725 (Efficiency) | Heterogeneous CPU design balancing high performance and power efficiency. | |
130 | | -| Threads per Core | 1 | Optimized for deterministic scheduling and predictable latency. | |
131 | | -| Clock Frequency | Up to **4.0 GHz** (Cortex-X925)<br>Up to **2.86 GHz** (Cortex-A725) | High per-core speed ensures strong single-thread inference for token orchestration. | |
132 | | -| Cache Hierarchy | L1: 1.3 MiB × 20<br>L2: 25 MiB × 20<br>L3: 24 MiB × 2 | Large shared L3 cache enhances data locality for multi-threaded inference workloads. | |
133 | | -| Instruction Set Features** | SVE / SVE2, BF16, I8MM, AES, SHA3, SM4, CRC32 | Vector and mixed-precision instructions accelerate quantized (Q4/Q8) math operations. | |
134 | | -| NUMA Topology | Single NUMA node (node0: 0–19) | Simplifies memory access pattern for unified memory workloads. | |
135 | | -| Security & Reliability | Not affected by Meltdown, Spectre, Retbleed, or similar vulnerabilities | Ensures stable and secure operation for long-running inference tasks. | |
136 | | - |
137 | | -Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities make it ideal for quantized LLM workloads, providing power-efficient foundation for both CPU-only inference and CPU–GPU hybrid processing. |
138 | | - |
139 | | -You can also verify the operating system running on your DGX Spark by using the following command: |
140 | | - |
141 | | -```bash |
142 | | -lsb_release -a |
143 | | -``` |
144 | | - |
145 | | -Expected output: |
146 | | - |
147 | | -```log |
148 | | -No LSB modules are available. |
149 | | -Distributor ID: Ubuntu |
150 | | -Description: Ubuntu 24.04.3 LTS |
151 | | -Release: 24.04 |
152 | | -Codename: noble |
153 | | -``` |
154 | | -As shown above, DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendly Linux distribution. |
155 | | -It provides excellent compatibility with AI frameworks, compiler toolchains, and system utilities—making it an ideal environment for building and deploying quantized LLM workloads. |
156 | | - |
157 | | -#### Step 2: Verify Blackwell GPU and driver |
158 | | - |
159 | | -After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads. |
160 | | - |
161 | | -```bash |
162 | | -nvidia-smi |
163 | | -``` |
164 | | - |
165 | | -You will see output similar to: |
166 | | - |
167 | | -```output |
168 | | -Wed Oct 22 09:26:54 2025 |
169 | | -+-----------------------------------------------------------------------------------------+ |
170 | | -| NVIDIA-SMI 580.95.05 Driver Version: 580.95.05 CUDA Version: 13.0 | |
171 | | -+-----------------------------------------+------------------------+----------------------+ |
172 | | -| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | |
173 | | -| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | |
174 | | -| | | MIG M. | |
175 | | -|=========================================+========================+======================| |
176 | | -| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A | |
177 | | -| N/A 32C P8 4W / N/A | Not Supported | 0% Default | |
178 | | -| | | N/A | |
179 | | -+-----------------------------------------+------------------------+----------------------+ |
180 | | -
|
181 | | -+-----------------------------------------------------------------------------------------+ |
182 | | -| Processes: | |
183 | | -| GPU GI CI PID Type Process name GPU Memory | |
184 | | -| ID ID Usage | |
185 | | -|=========================================================================================| |
186 | | -| 0 N/A N/A 3094 G /usr/lib/xorg/Xorg 43MiB | |
187 | | -| 0 N/A N/A 3172 G /usr/bin/gnome-shell 16MiB | |
188 | | -+-----------------------------------------------------------------------------------------+ |
189 | | -``` |
190 | | - |
191 | | -The `nvidia-smi` tool reports GPU hardware specifications and provides valuable runtime information, including driver status, temperature, power usage, and GPU utilization. This information helps verify that the system is ready for AI workloads. |
192 | | - |
193 | | -The table below provides more explanation of the `nvidia-smi` output: |
194 | | - |
195 | | -| **Category** | **Specification (from nvidia-smi)** | **Description / Impact for LLM Inference** | |
196 | | -|---------------|--------------------------------------|---------------------------------------------| |
197 | | -| GPU Name** | NVIDIA GB10 | Confirms the system recognizes the Blackwell GPU integrated into the Grace–Blackwell Superchip. | |
198 | | -| Driver Version | 580.95.05 | Indicates that the system is running the latest driver package required for CUDA 13 compatibility. | |
199 | | -| CUDA Version | 13.0 | Confirms that the CUDA runtime supports GB10 (sm_121) and is ready for accelerated quantized LLM workloads. | |
200 | | -| Architecture / Compute Capability | Blackwell (sm_121) | Supports FP4, FP8, and BF16 Tensor Core operations optimized for LLMs. | |
201 | | -| Memory | Unified 128 GB LPDDR5X (shared with CPU via NVLink-C2C) | Enables zero-copy data access between Grace CPU and GPU for unified inference memory space. | |
202 | | -| Power & Thermal Status | ~4W at idle, 32°C temperature | Confirms the GPU is powered on and thermally stable while idle. | |
203 | | -| GPU-Utilization | 0% (Idle) | Indicates no active compute workloads; GPU is ready for new inference jobs. | |
204 | | -| Memory Usage | Not Supported (headless GPU configuration) | DGX Spark operates in headless compute mode; display memory metrics may not be exposed. | |
205 | | -| Persistence Mode | On | Ensures the GPU remains initialized and ready for rapid inference startup. | |
206 | | - |
207 | | - |
208 | | -#### Step 3: Check CUDA Toolkit |
209 | | - |
210 | | -To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed. |
211 | | - |
212 | | -The `nvcc --version` command confirms that the CUDA compiler is available and compatible with CUDA 13. |
213 | | -This ensures that CMake can correctly detect and compile the GPU-accelerated components. |
214 | | - |
215 | | -```bash |
216 | | -nvcc --version |
217 | | -``` |
218 | | - |
219 | | -You will see output similar to: |
220 | | - |
221 | | -```output |
222 | | -nvcc: NVIDIA (R) Cuda compiler driver |
223 | | -Copyright (c) 2005-2025 NVIDIA Corporation |
224 | | -Built on Wed_Aug_20_01:57:39_PM_PDT_2025 |
225 | | -Cuda compilation tools, release 13.0, V13.0.88 |
226 | | -Build cuda_13.0.r13.0/compiler.36424714_0 |
227 | | -``` |
228 | | - |
229 | | -{{% notice Note %}} |
230 | | -The nvcc compiler is required only during the CUDA-enabled build process; it is not needed at runtime for inference. |
231 | | -{{% /notice %}} |
232 | | - |
233 | | -This confirms that the CUDA 13 toolkit is installed and ready for GPU compilation. |
234 | | -If the command is missing or reports an older version (e.g., 12.x), you should update to CUDA 13.0 or later to ensure compatibility with the Blackwell GPU (sm_121). |
235 | | - |
236 | | -At this point, you have verified that: |
237 | | -- The Grace CPU (Arm Cortex-X925 / A725) is correctly recognized and supports Armv9 extensions. |
238 | | -- The Blackwell GPU is active with driver 580.95.05 and CUDA 13 runtime. |
239 | | -- The CUDA toolkit 13.0 is available for building the GPU-enabled version of llama.cpp. |
240 | | - |
241 | | -Your DGX Spark environment is now fully prepared for the next section, where you will build and configure both CPU and GPU versions of llama.cpp, laying the foundation for running quantized LLMs efficiently on the Grace Blackwell platform. |
0 commit comments