You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
---
2
-
title: Discover the Grace Blackwell architecture
2
+
title: Explore Grace Blackwell architecture for efficient quantized LLM inference
3
3
weight: 2
4
4
5
5
### FIXED, DO NOT MODIFY
@@ -8,24 +8,24 @@ layout: learningpathall
8
8
9
9
## Overview
10
10
11
-
Explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads.
12
-
13
-
The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
11
+
In this Learning Path you will explore the architecture and system design of NVIDIA DGX Spark, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
14
12
15
13
The GB10 platform combines:
16
14
- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency
17
15
- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads
18
16
- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks
19
17
20
-
This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform for modern AI workloads, bringing powerful AI development capabilities to your desktop and letting you build and test AI models locally before scaling them to larger systems.
18
+
This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform that lets you build and test AI models locally before scaling them to larger systems.
19
+
20
+
You can find out more about Nvidia DGX Spark on the [NVIDIA website](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/).
21
21
22
-
## Benefits of Grace Blackwell for quantized LLMs
22
+
## Benefits of Grace Blackwell for quantized LLM inference
23
23
24
-
Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip.
24
+
Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip which brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference.
25
25
26
-
The Grace Blackwell architecture brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference.
26
+
On Arm-based systems, quantized LLM inference is especially efficient because the Grace CPU delivers high single-thread performance and energy efficiency, while the Blackwell GPU accelerates matrix operations using Arm-optimized CUDA libraries. The unified memory architecture means you don't need to manually manage data movement between CPU and GPU, which is a common challenge on traditional x86-based platforms. This is particularly valuable when working with large models or running multiple inference tasks in parallel, as it reduces latency and simplifies development.
27
27
28
-
###Grace Blackwell features and their impact on quantized LLMs
28
+
## Grace Blackwell features and their impact on quantized LLMs
29
29
30
30
The table below shows how specific hardware features enable efficient quantized model inference:
31
31
@@ -37,7 +37,7 @@ The table below shows how specific hardware features enable efficient quantized
37
37
| Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer |
38
38
| Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads |
39
39
40
-
### Quantized LLM workflow
40
+
##Overview of a typical quantized LLM workflow
41
41
42
42
In a typical quantized LLM workflow:
43
43
- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md
+11-9Lines changed: 11 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,13 @@ weight: 3
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Overview
9
+
## Set up your Grace Blackwell environment
10
10
11
11
Before building and running quantized LLMs on your DGX Spark, you need to verify that your system is fully prepared for AI workloads. This includes checking your Arm-based Grace CPU configuration, confirming your operating system, ensuring the Blackwell GPU and CUDA drivers are active, and validating that the CUDA toolkit is installed. These steps provide a solid foundation for efficient LLM inference and development on Arm.
12
12
13
13
This section is organized into three main steps: verifying your CPU, checking your operating system, and confirming your GPU and CUDA toolkit setup. You'll also find additional context and technical details throughout, should you wish to explore the platform's capabilities more deeply.
14
14
15
-
## Step 1: check your CPU configuration
15
+
## Step 1: Verify your CPU configuration
16
16
17
17
Before running LLM workloads, it's helpful to understand more about the CPU you're working with. The DGX Spark uses Arm-based Grace processors, which bring some unique advantages for AI inference.
18
18
@@ -82,13 +82,13 @@ Vulnerabilities:
82
82
Tsx async abort: Not affected
83
83
```
84
84
85
-
Great! If you have seen this message your system is using Armv9 cores, which are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
85
+
If you have seen this message your system is using Armv9 cores, great! These are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
86
86
87
87
### Grace CPU specification
88
88
89
89
The following table provides more information about the key specifications of the Grace CPU and explains their relevance to quantized LLM inference:
90
90
91
-
|**Category**|**Specification**|**Description / Impact for LLM Inference**|
91
+
|**Category**|**Specification**|**Description/Impact for LLM Inference**|
| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions|
94
94
| Core Configuration | 20 cores total - 10× Cortex-X925 (performance) + 10× Cortex-A725 (efficiency) | Heterogeneous CPU design balancing high performance and power efficiency |
@@ -101,6 +101,8 @@ The following table provides more information about the key specifications of th
101
101
102
102
Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities are what make it ideal for quantized LLM workloads, as these provide a power-efficient foundation for both CPU-only inference and CPU-GPU hybrid processing.
103
103
104
+
### Verify OS
105
+
104
106
You can also verify the operating system running on your DGX Spark by using the following command:
105
107
106
108
```bash
@@ -120,7 +122,7 @@ This shows you that your DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendl
120
122
121
123
Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, so you can move on to the next step.
122
124
123
-
## Step 2: verify the Blackwell GPU and driver
125
+
## Step 2: Verify the Blackwell GPU and driver
124
126
125
127
After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads by using the following:
126
128
@@ -174,7 +176,7 @@ The table below provides more explanation of the `nvidia-smi` output:
174
176
Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference.
175
177
176
178
177
-
## Step 3: check the CUDA toolkit
179
+
## Step 3: Check the CUDA toolkit
178
180
179
181
To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
180
182
@@ -214,8 +216,8 @@ Your DGX Spark environment is now fully prepared for the next section, where yo
214
216
215
217
In this entire setup section, you have achieved the following:
216
218
217
-
- Verified your Arm-based Grace CPU and its capabilities—you've confirmed that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference
218
-
- Confirmed your Blackwell GPU and CUDA driver are ready—the GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads
219
-
- Checked your operating system and CUDA toolkit—Ubuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools
219
+
- Verified your Arm-based Grace CPU and its capabilities by confirming that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference
220
+
- Confirmed your Blackwell GPU and CUDA driver are ready by seeing that the GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads
221
+
- Checked your operating system and CUDA toolkit - Ubuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools
220
222
221
223
You're now ready to move on to building and running quantized LLMs on your DGX Spark. The next section walks you through compiling llama.cpp for both CPU and GPU, so you can start running AI inference on this platform.
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md
+15-18Lines changed: 15 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,9 @@ layout: "learningpathall"
6
6
7
7
## How do I build the GPU version of llama.cpp on GB10?
8
8
9
-
In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment.
9
+
In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs. Llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs.
10
10
11
-
Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs.
12
-
13
-
llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs.
14
-
15
-
### Step 1: Preparation
11
+
## Step 1: Install dependencies
16
12
17
13
In this step, you will install the necessary build tools and download a small quantized model for validation:
After the download completes, you'll find the models in the `~/models` directory.
39
+
40
+
**Tip:** Always activate your Python virtual environment with `source venv/bin/activate` before installing packages or running Python-based tools. This ensures dependencies are isolated and prevents conflicts with system-wide packages.
42
41
{{% /notice %}}
43
42
44
43
Great! You’ve installed all the required build tools and downloaded a quantized model for validation. Your environment is ready for source code setup.
45
44
46
-
###Step 2: Clone the llama.cpp repository
45
+
## Step 2: Clone the llama.cpp repository
47
46
48
47
Use the commands below to download the source code for llama.cpp from GitHub:
49
48
@@ -55,7 +54,7 @@ cd ~/llama.cpp
55
54
56
55
Nice work! You now have the latest llama.cpp source code on your DGX Spark system.
57
56
58
-
###Step 3: Configure and Build the CUDA-Enabled Version (GPU Mode)
57
+
## Step 3: Configure and build the CUDA-enabled version (GPU Mode)
59
58
60
59
Run the following `cmake` command to configure the build system for GPU acceleration:
61
60
@@ -73,13 +72,13 @@ cmake .. \
73
72
-DCMAKE_CUDA_COMPILER=nvcc
74
73
```
75
74
76
-
This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels
75
+
This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels.
77
76
78
77
### Explanation of key flags:
79
78
80
79
Here's what each configuration flag does:
81
80
82
-
|**Feature**|**Description / Impact**|
81
+
|**Feature**|**Description/Impact**|
83
82
|--------------|------------------------------|
84
83
| -DGGML_CUDA=ON | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration|
85
84
| -DGGML_CUDA_F16=ON | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (for example, Q4, Q5) |
@@ -119,11 +118,9 @@ After the build completes, you'll find the GPU-accelerated binaries located unde
119
118
120
119
These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference using HTTP API (llama-server).
121
120
122
-
Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. You are now ready to test quantized LLMs with full GPU acceleration in the next step.
123
-
124
-
Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility.
121
+
Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility. You are now ready to test quantized LLMs with full GPU acceleration in the next step.
125
122
126
-
###Step 4: Validate the CUDA-enabled build
123
+
## Step 4: Validate the CUDA-enabled build
127
124
128
125
After the build completes successfully, verify that the GPU-enabled binary of llama.cpp is correctly linked to the NVIDIA CUDA runtime.
129
126
@@ -184,7 +181,7 @@ nvtop
184
181
185
182
This command displays GPU utilization, memory usage, temperature, and power consumption. You can use this to verify that CUDA kernels are active during model inference.
186
183
187
-
The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark.
184
+
The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark:
188
185
189
186

190
187
@@ -196,12 +193,12 @@ The nvtop interface shows:
196
193
197
194
You have now successfully built and validated the CUDA-enabled version of llama.cpp on DGX Spark.
198
195
199
-
## What have I achieved?
196
+
## What you have accomplished
200
197
201
198
You have:
202
199
- Installed all required tools and dependencies
203
200
- Downloaded a quantized model for testing
204
201
- Built the CUDA-enabled version of llama.cpp
205
202
- Verified GPU linkage and successful inference
206
203
207
-
You’re ready to move on to building and testing the CPU-only version! You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference.
204
+
You’re ready to move on to building and testing the CPU-only version. You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference.
Copy file name to clipboardExpand all lines: content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md
+3-8Lines changed: 3 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,9 +11,6 @@ The Grace CPU features Arm Cortex-X925 and Cortex-A725 cores with advanced vecto
11
11
12
12
## Configure and build the CPU-only version
13
13
14
-
15
-
In this session, you will configure and build the CPU-only version of llama.cpp, optimized for the Armv9-based Grace CPU.
16
-
17
14
This build runs entirely on the Grace CPU (Arm Cortex-X925 and Cortex-A725), which supports advanced Armv9 vector extensions including SVE2, BFloat16, and I8MM, making it highly efficient for quantized inference workloads even without GPU acceleration.
18
15
To ensure a clean separation from the GPU build artifacts, start from a clean directory.
19
16
@@ -123,8 +120,8 @@ To monitor live CPU utilization and power metrics during inference, use `htop`:
123
120
htop
124
121
```
125
122
126
-
The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement.
127
-
](htop.png"TinyLlama CPU Utilization")
123
+
The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement:
124
+

128
125
129
126
The `htop` interface shows:
130
127
@@ -145,6 +142,4 @@ In this section you have:
145
142
- Tested quantized model inference using the TinyLlama Q8_0 model.
146
143
- Used monitoring tools (htop) to confirm efficient CPU utilization.
147
144
148
-
You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU.
149
-
150
-
In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU.
145
+
You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU. In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU.
0 commit comments