Skip to content

Commit c0f4e1a

Browse files
Final tweaks
1 parent c8a2a36 commit c0f4e1a

File tree

6 files changed

+51
-61
lines changed

6 files changed

+51
-61
lines changed

content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Discover the Grace Blackwell architecture
2+
title: Explore Grace Blackwell architecture for efficient quantized LLM inference
33
weight: 2
44

55
### FIXED, DO NOT MODIFY
@@ -8,24 +8,24 @@ layout: learningpathall
88

99
## Overview
1010

11-
Explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads.
12-
13-
The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
11+
In this Learning Path you will explore the architecture and system design of NVIDIA DGX Spark, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
1412

1513
The GB10 platform combines:
1614
- The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency
1715
- The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads
1816
- A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks
1917

20-
This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform for modern AI workloads, bringing powerful AI development capabilities to your desktop and letting you build and test AI models locally before scaling them to larger systems.
18+
This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform that lets you build and test AI models locally before scaling them to larger systems.
19+
20+
You can find out more about Nvidia DGX Spark on the [NVIDIA website](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/).
2121

22-
## Benefits of Grace Blackwell for quantized LLMs
22+
## Benefits of Grace Blackwell for quantized LLM inference
2323

24-
Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip.
24+
Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip which brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference.
2525

26-
The Grace Blackwell architecture brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference.
26+
On Arm-based systems, quantized LLM inference is especially efficient because the Grace CPU delivers high single-thread performance and energy efficiency, while the Blackwell GPU accelerates matrix operations using Arm-optimized CUDA libraries. The unified memory architecture means you don't need to manually manage data movement between CPU and GPU, which is a common challenge on traditional x86-based platforms. This is particularly valuable when working with large models or running multiple inference tasks in parallel, as it reduces latency and simplifies development.
2727

28-
### Grace Blackwell features and their impact on quantized LLMs
28+
## Grace Blackwell features and their impact on quantized LLMs
2929

3030
The table below shows how specific hardware features enable efficient quantized model inference:
3131

@@ -37,7 +37,7 @@ The table below shows how specific hardware features enable efficient quantized
3737
| Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer |
3838
| Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads |
3939

40-
### Quantized LLM workflow
40+
## Overview of a typical quantized LLM workflow
4141

4242
In a typical quantized LLM workflow:
4343
- The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks

content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@ weight: 3
66
layout: learningpathall
77
---
88

9-
## Overview
9+
## Set up your Grace Blackwell environment
1010

1111
Before building and running quantized LLMs on your DGX Spark, you need to verify that your system is fully prepared for AI workloads. This includes checking your Arm-based Grace CPU configuration, confirming your operating system, ensuring the Blackwell GPU and CUDA drivers are active, and validating that the CUDA toolkit is installed. These steps provide a solid foundation for efficient LLM inference and development on Arm.
1212

1313
This section is organized into three main steps: verifying your CPU, checking your operating system, and confirming your GPU and CUDA toolkit setup. You'll also find additional context and technical details throughout, should you wish to explore the platform's capabilities more deeply.
1414

15-
## Step 1: check your CPU configuration
15+
## Step 1: Verify your CPU configuration
1616

1717
Before running LLM workloads, it's helpful to understand more about the CPU you're working with. The DGX Spark uses Arm-based Grace processors, which bring some unique advantages for AI inference.
1818

@@ -82,13 +82,13 @@ Vulnerabilities:
8282
Tsx async abort: Not affected
8383
```
8484

85-
Great! If you have seen this message your system is using Armv9 cores, which are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
85+
If you have seen this message your system is using Armv9 cores, great! These are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
8686

8787
### Grace CPU specification
8888

8989
The following table provides more information about the key specifications of the Grace CPU and explains their relevance to quantized LLM inference:
9090

91-
| **Category** | **Specification** | **Description / Impact for LLM Inference** |
91+
| **Category** | **Specification** | **Description/Impact for LLM Inference** |
9292
|---------------|-------------------|---------------------------------------------|
9393
| Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions|
9494
| Core Configuration | 20 cores total - 10× Cortex-X925 (performance) + 10× Cortex-A725 (efficiency) | Heterogeneous CPU design balancing high performance and power efficiency |
@@ -101,6 +101,8 @@ The following table provides more information about the key specifications of th
101101

102102
Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities are what make it ideal for quantized LLM workloads, as these provide a power-efficient foundation for both CPU-only inference and CPU-GPU hybrid processing.
103103

104+
### Verify OS
105+
104106
You can also verify the operating system running on your DGX Spark by using the following command:
105107

106108
```bash
@@ -120,7 +122,7 @@ This shows you that your DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendl
120122

121123
Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, so you can move on to the next step.
122124

123-
## Step 2: verify the Blackwell GPU and driver
125+
## Step 2: Verify the Blackwell GPU and driver
124126

125127
After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads by using the following:
126128

@@ -174,7 +176,7 @@ The table below provides more explanation of the `nvidia-smi` output:
174176
Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference.
175177

176178

177-
## Step 3: check the CUDA toolkit
179+
## Step 3: Check the CUDA toolkit
178180

179181
To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
180182

@@ -214,8 +216,8 @@ Your DGX Spark environment is now fully prepared for the next section, where yo
214216

215217
In this entire setup section, you have achieved the following:
216218

217-
- Verified your Arm-based Grace CPU and its capabilities—you've confirmed that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference
218-
- Confirmed your Blackwell GPU and CUDA driver are readythe GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads
219-
- Checked your operating system and CUDA toolkitUbuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools
219+
- Verified your Arm-based Grace CPU and its capabilities by confirming that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference
220+
- Confirmed your Blackwell GPU and CUDA driver are ready by seeing that the GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads
221+
- Checked your operating system and CUDA toolkit - Ubuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools
220222

221223
You're now ready to move on to building and running quantized LLMs on your DGX Spark. The next section walks you through compiling llama.cpp for both CPU and GPU, so you can start running AI inference on this platform.

content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md

Lines changed: 15 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,9 @@ layout: "learningpathall"
66

77
## How do I build the GPU version of llama.cpp on GB10?
88

9-
In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment.
9+
In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs. Llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs.
1010

11-
Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs.
12-
13-
llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs.
14-
15-
### Step 1: Preparation
11+
## Step 1: Install dependencies
1612

1713
In this step, you will install the necessary build tools and download a small quantized model for validation:
1814

@@ -23,7 +19,7 @@ sudo apt install -y git cmake build-essential nvtop htop
2319

2420
These packages provide the C/C++ compiler toolchain, CMake build system, and GPU monitoring utility (nvtop) required to compile and test llama.cpp.
2521

26-
### Download a test model
22+
## Download a test model
2723

2824
To test your GPU build, you'll need a quantized model. In this section, you'll download a lightweight model that's perfect for validation.
2925

@@ -33,17 +29,20 @@ First, ensure that you have the latest Hugging Face Hub CLI installed and downlo
3329
mkdir ~/models
3430
cd ~/models
3531
python3 -m venv venv
32+
source venv/bin/activate
3633
pip install -U huggingface_hub
3734
hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF --local-dir TinyLlama-1.1B
3835
```
3936

4037
{{% notice Note %}}
4138
After the download completes, you'll find the models in the `~/models` directory.
39+
40+
**Tip:** Always activate your Python virtual environment with `source venv/bin/activate` before installing packages or running Python-based tools. This ensures dependencies are isolated and prevents conflicts with system-wide packages.
4241
{{% /notice %}}
4342

4443
Great! You’ve installed all the required build tools and downloaded a quantized model for validation. Your environment is ready for source code setup.
4544

46-
### Step 2: Clone the llama.cpp repository
45+
## Step 2: Clone the llama.cpp repository
4746

4847
Use the commands below to download the source code for llama.cpp from GitHub:
4948

@@ -55,7 +54,7 @@ cd ~/llama.cpp
5554

5655
Nice work! You now have the latest llama.cpp source code on your DGX Spark system.
5756

58-
### Step 3: Configure and Build the CUDA-Enabled Version (GPU Mode)
57+
## Step 3: Configure and build the CUDA-enabled version (GPU Mode)
5958

6059
Run the following `cmake` command to configure the build system for GPU acceleration:
6160

@@ -73,13 +72,13 @@ cmake .. \
7372
-DCMAKE_CUDA_COMPILER=nvcc
7473
```
7574

76-
This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels
75+
This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels.
7776

7877
### Explanation of key flags:
7978

8079
Here's what each configuration flag does:
8180

82-
| **Feature** | **Description / Impact** |
81+
| **Feature** | **Description/Impact** |
8382
|--------------|------------------------------|
8483
| -DGGML_CUDA=ON | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration|
8584
| -DGGML_CUDA_F16=ON | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (for example, Q4, Q5) |
@@ -119,11 +118,9 @@ After the build completes, you'll find the GPU-accelerated binaries located unde
119118

120119
These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference using HTTP API (llama-server).
121120

122-
Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. You are now ready to test quantized LLMs with full GPU acceleration in the next step.
123-
124-
Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility.
121+
Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility. You are now ready to test quantized LLMs with full GPU acceleration in the next step.
125122

126-
### Step 4: Validate the CUDA-enabled build
123+
## Step 4: Validate the CUDA-enabled build
127124

128125
After the build completes successfully, verify that the GPU-enabled binary of llama.cpp is correctly linked to the NVIDIA CUDA runtime.
129126

@@ -184,7 +181,7 @@ nvtop
184181

185182
This command displays GPU utilization, memory usage, temperature, and power consumption. You can use this to verify that CUDA kernels are active during model inference.
186183

187-
The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark.
184+
The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark:
188185

189186
![nvtop terminal interface displaying real-time GPU metrics, including GPU utilization, memory usage, temperature, power consumption, and active processes for the NVIDIA GB10 GPU during model inference on DGX Spark. alt-text#center](nvtop.png "TinyLlama GPU Utilization")
190187

@@ -196,12 +193,12 @@ The nvtop interface shows:
196193

197194
You have now successfully built and validated the CUDA-enabled version of llama.cpp on DGX Spark.
198195

199-
## What have I achieved?
196+
## What you have accomplished
200197

201198
You have:
202199
- Installed all required tools and dependencies
203200
- Downloaded a quantized model for testing
204201
- Built the CUDA-enabled version of llama.cpp
205202
- Verified GPU linkage and successful inference
206203

207-
You’re ready to move on to building and testing the CPU-only version! You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference.
204+
You’re ready to move on to building and testing the CPU-only version. You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference.

content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,6 @@ The Grace CPU features Arm Cortex-X925 and Cortex-A725 cores with advanced vecto
1111

1212
## Configure and build the CPU-only version
1313

14-
15-
In this session, you will configure and build the CPU-only version of llama.cpp, optimized for the Armv9-based Grace CPU.
16-
1714
This build runs entirely on the Grace CPU (Arm Cortex-X925 and Cortex-A725), which supports advanced Armv9 vector extensions including SVE2, BFloat16, and I8MM, making it highly efficient for quantized inference workloads even without GPU acceleration.
1815
To ensure a clean separation from the GPU build artifacts, start from a clean directory.
1916

@@ -123,8 +120,8 @@ To monitor live CPU utilization and power metrics during inference, use `htop`:
123120
htop
124121
```
125122

126-
The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement.
127-
![htop display showing 20 Grace CPU cores at 75-85% utilization during TinyLlama inference with OpenMP threading alt-text#center](htop.png)](htop.png "TinyLlama CPU Utilization")
123+
The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement:
124+
![htop display showing 20 Grace CPU cores at 75-85% utilization during TinyLlama inference with OpenMP threading alt-text#center](htop.png "TinyLlama CPU utilization")
128125

129126
The `htop` interface shows:
130127

@@ -145,6 +142,4 @@ In this section you have:
145142
- Tested quantized model inference using the TinyLlama Q8_0 model.
146143
- Used monitoring tools (htop) to confirm efficient CPU utilization.
147144

148-
You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU.
149-
150-
In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU.
145+
You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU. In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU.

0 commit comments

Comments
 (0)