Final tweaks

madeline-underwood · madeline-underwood · commit c0f4e1a555d0 · 2025-11-05T11:17:29.000Z
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1_gb10_introduction.md
@@ -1,5 +1,5 @@
 ---
-title: Discover the Grace Blackwell architecture
+title: Explore Grace Blackwell architecture for efficient quantized LLM inference
 weight: 2
 
 ### FIXED, DO NOT MODIFY
@@ -8,24 +8,24 @@ layout: learningpathall
 
 ## Overview
 
-Explore the architecture and system design of the [NVIDIA DGX Spark](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/) platform, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. 
-
-The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
+In this Learning Path you will explore the architecture and system design of NVIDIA DGX Spark, a next-generation Arm-based CPU-GPU hybrid for large-scale AI workloads. The NVIDIA DGX Spark is a personal AI supercomputer that brings data center-class AI computing directly to the developer desktop. The NVIDIA GB10 Grace Blackwell Superchip fuses CPU and GPU into a single unified compute engine.
 
 The GB10 platform combines:
 - The NVIDIA Grace CPU, featuring 10 Arm [Cortex-X925](https://www.arm.com/products/cortex-x) and 10 [Cortex-A725](https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a725) cores built on the Armv9 architecture, offering exceptional single-thread performance and power efficiency
 - The NVIDIA Blackwell GPU, equipped with next-generation CUDA cores and 5th-generation Tensor Cores, optimized for FP8 and FP4 precision workloads
 - A 128 GB unified memory subsystem, enabling both CPU and GPU to share the same address space with NVLink-C2C, eliminating data-transfer bottlenecks
 
-This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform for modern AI workloads, bringing powerful AI development capabilities to your desktop and letting you build and test AI models locally before scaling them to larger systems.
+This GB10 platform design delivers up to 1 petaFLOP (1,000 TFLOPs) of AI performance at FP4 precision. DGX Spark is a compact yet powerful development platform that lets you build and test AI models locally before scaling them to larger systems.
+
+You can find out more about Nvidia DGX Spark on the [NVIDIA website](https://www.nvidia.com/en-gb/products/workstations/dgx-spark/).
 
-## Benefits of Grace Blackwell for quantized LLMs
+## Benefits of Grace Blackwell for quantized LLM inference
 
-Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip.
+Quantized Large Language Models (LLMs), such as those using Q4, Q5, or Q8 precision, benefit from the hybrid architecture of the Grace Blackwell Superchip which brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference.
 
-The Grace Blackwell architecture brings several key advantages to quantized LLM workloads. The unified CPU-GPU design eliminates traditional bottlenecks while providing specialized compute capabilities for different aspects of inference.
+On Arm-based systems, quantized LLM inference is especially efficient because the Grace CPU delivers high single-thread performance and energy efficiency, while the Blackwell GPU accelerates matrix operations using Arm-optimized CUDA libraries. The unified memory architecture means you don't need to manually manage data movement between CPU and GPU, which is a common challenge on traditional x86-based platforms. This is particularly valuable when working with large models or running multiple inference tasks in parallel, as it reduces latency and simplifies development.
 
-### Grace Blackwell features and their impact on quantized LLMs
+## Grace Blackwell features and their impact on quantized LLMs
 
 The table below shows how specific hardware features enable efficient quantized model inference:
 
@@ -37,7 +37,7 @@ The table below shows how specific hardware features enable efficient quantized
 | Unified 128 GB memory (NVLink-C2C) | CPU and GPU share the same memory space, allowing quantized model weights to be accessed without explicit data transfer |
 | Energy-efficient Arm design | Armv9 cores maintain strong performance-per-watt, enabling sustained inference for extended workloads |
 
-### Quantized LLM workflow
+## Overview of a typical quantized LLM workflow
 
 In a typical quantized LLM workflow:
 - The Grace CPU orchestrates text tokenization, prompt scheduling, and system-level tasks
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/1a_gb10_setup.md
@@ -6,13 +6,13 @@ weight: 3
 layout: learningpathall
 ---
 
-## Overview
+## Set up your Grace Blackwell environment
 
 Before building and running quantized LLMs on your DGX Spark, you need to verify that your system is fully prepared for AI workloads. This includes checking your Arm-based Grace CPU configuration, confirming your operating system, ensuring the Blackwell GPU and CUDA drivers are active, and validating that the CUDA toolkit is installed. These steps provide a solid foundation for efficient LLM inference and development on Arm.
 
 This section is organized into three main steps: verifying your CPU, checking your operating system, and confirming your GPU and CUDA toolkit setup. You'll also find additional context and technical details throughout, should you wish to explore the platform's capabilities more deeply.
 
-## Step 1: check your CPU configuration
+## Step 1: Verify your CPU configuration
 
 Before running LLM workloads, it's helpful to understand more about the CPU you're working with. The DGX Spark uses Arm-based Grace processors, which bring some unique advantages for AI inference.
 
@@ -82,13 +82,13 @@ Vulnerabilities:
   Tsx async abort:        Not affected
 ```
 
-Great! If you have seen this message your system is using Armv9 cores, which are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
+If you have seen this message your system is using Armv9 cores, great! These are ideal for quantized LLM workloads. The Grace CPU implements the Armv9-A instruction set and supports advanced vector extensions, making it ideal for quantized LLM inference and tensor operations.
 
 ### Grace CPU specification 
 
 The following table provides more information about the key specifications of the Grace CPU and explains their relevance to quantized LLM inference:
 
-| **Category** | **Specification** | **Description / Impact for LLM Inference** |
+| **Category** | **Specification** | **Description/Impact for LLM Inference** |
 |---------------|-------------------|---------------------------------------------|
 | Architecture | Armv9-A (64-bit, aarch64) | Modern Arm architecture supporting advanced vector and AI extensions|
 | Core Configuration | 20 cores total - 10× Cortex-X925 (performance) + 10× Cortex-A725 (efficiency) | Heterogeneous CPU design balancing high performance and power efficiency |
@@ -101,6 +101,8 @@ The following table provides more information about the key specifications of th
 
 Its SVE2, BF16, and INT8 matrix multiplication (I8MM) capabilities are what make it ideal for quantized LLM workloads, as these provide a power-efficient foundation for both CPU-only inference and CPU-GPU hybrid processing.
 
+### Verify OS
+
 You can also verify the operating system running on your DGX Spark by using the following command:
 
 ```bash
@@ -120,7 +122,7 @@ This shows you that your DGX Spark runs on Ubuntu 24.04 LTS, a developer-friendl
 
 Nice work! You've confirmed your operating system is Ubuntu 24.04 LTS, so you can move on to the next step.
 
-## Step 2: verify the Blackwell GPU and driver
+## Step 2: Verify the Blackwell GPU and driver
 
 After confirming your CPU configuration, verify that the Blackwell GPU inside the GB10 Grace Blackwell Superchip is available and ready for CUDA workloads by using the following:
 
@@ -174,7 +176,7 @@ The table below provides more explanation of the `nvidia-smi` output:
 Excellent! Your Blackwell GPU is recognized and ready for CUDA workloads. This means your system is set up for GPU-accelerated LLM inference.
 
 
-## Step 3: check the CUDA toolkit
+## Step 3: Check the CUDA toolkit
 
 To build the CUDA version of llama.cpp, the system must have a CUDA toolkit installed.
 
@@ -214,8 +216,8 @@ Your DGX Spark environment is now fully prepared for the next section,  where yo
 
 In this entire setup section, you have achieved the following:
 
-- Verified your Arm-based Grace CPU and its capabilities—you've confirmed that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference
-- Confirmed your Blackwell GPU and CUDA driver are ready—the GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads
-- Checked your operating system and CUDA toolkit—Ubuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools
+- Verified your Arm-based Grace CPU and its capabilities by confirming that your system is running Armv9 cores with SVE2, BF16, and INT8 matrix multiplication support, which are perfect for quantized LLM inference
+- Confirmed your Blackwell GPU and CUDA driver are ready by seeing that the GB10 GPU is active, properly recognized, and set up with CUDA 13, so you're all set for GPU-accelerated workloads
+- Checked your operating system and CUDA toolkit - Ubuntu 24.04 LTS provides a solid foundation, and the CUDA compiler is installed and ready for building GPU-enabled inference tools
 
 You're now ready to move on to building and running quantized LLMs on your DGX Spark. The next section walks you through compiling llama.cpp for both CPU and GPU, so you can start running AI inference on this platform.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/2_gb10_llamacpp_gpu.md
@@ -6,13 +6,9 @@ layout: "learningpathall"
 
 ## How do I build the GPU version of llama.cpp on GB10?
 
-In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment.
+In the previous section, you verified that your DGX Spark system is correctly configured with the Grace CPU, Blackwell GPU, and CUDA 13 environment. Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs. Llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs. 
 
-Now that your hardware and drivers are ready, this section focuses on building the GPU-enabled version of llama.cpp, which is a lightweight, portable inference engine optimized for quantized LLM workloads on NVIDIA Blackwell GPUs.
-
-llama.cpp is an open-source project by Georgi Gerganov that provides efficient and dependency-free large language model inference on both CPUs and GPUs. 
-
-### Step 1: Preparation
+## Step 1: Install dependencies
 
 In this step, you will install the necessary build tools and download a small quantized model for validation:
 
@@ -23,7 +19,7 @@ sudo apt install -y git cmake build-essential nvtop htop
 
 These packages provide the C/C++ compiler toolchain, CMake build system, and GPU monitoring utility (nvtop) required to compile and test llama.cpp.
 
-### Download a test model
+## Download a test model
 
 To test your GPU build, you'll need a quantized model. In this section, you'll download a lightweight model that's perfect for validation.
 
@@ -33,17 +29,20 @@ First, ensure that you have the latest Hugging Face Hub CLI installed and downlo
 mkdir ~/models
 cd ~/models
 python3 -m venv venv
+source venv/bin/activate 
 pip install -U huggingface_hub
 hf download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF --local-dir TinyLlama-1.1B
 ```
 
 {{% notice Note %}}
 After the download completes, you'll find the models in the `~/models` directory.
+
+**Tip:** Always activate your Python virtual environment with `source venv/bin/activate` before installing packages or running Python-based tools. This ensures dependencies are isolated and prevents conflicts with system-wide packages.
 {{% /notice %}}
 
 Great! You’ve installed all the required build tools and downloaded a quantized model for validation. Your environment is ready for source code setup.
 
-### Step 2: Clone the llama.cpp repository
+## Step 2: Clone the llama.cpp repository
 
 Use the commands below to download the source code for llama.cpp from GitHub:
 
@@ -55,7 +54,7 @@ cd ~/llama.cpp
 
 Nice work! You now have the latest llama.cpp source code on your DGX Spark system.
 
-### Step 3: Configure and Build the CUDA-Enabled Version (GPU Mode)
+## Step 3: Configure and build the CUDA-enabled version (GPU Mode)
 
 Run the following `cmake` command to configure the build system for GPU acceleration:
 
@@ -73,13 +72,13 @@ cmake .. \
 	-DCMAKE_CUDA_COMPILER=nvcc
 ```
 
-This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels
+This command enables CUDA support and prepares llama.cpp for compiling GPU-optimized kernels.
 
 ### Explanation of key flags:
 
 Here's what each configuration flag does:
 
-| **Feature** | **Description / Impact** |
+| **Feature** | **Description/Impact** |
 |--------------|------------------------------|
 | -DGGML_CUDA=ON | Enables the CUDA backend in llama.cpp, allowing matrix operations and transformer layers to be offloaded to the GPU for acceleration|
 | -DGGML_CUDA_F16=ON | Enables FP16 (half-precision) CUDA kernels, reducing memory usage and increasing throughput — especially effective for quantized models (for example, Q4, Q5) |
@@ -119,11 +118,9 @@ After the build completes, you'll find the GPU-accelerated binaries located unde
 
 These binaries provide all necessary tools for quantized model inference (llama-cli) and for serving GPU inference using HTTP API (llama-server). 
 
-Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. You are now ready to test quantized LLMs with full GPU acceleration in the next step.
-
-Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility.
+Excellent! The CUDA-enabled build is complete. Your binaries are optimized for the Blackwell GPU and ready for validation. Together, these options ensure that the build targets the Grace Blackwell GPU with full CUDA 13 compatibility. You are now ready to test quantized LLMs with full GPU acceleration in the next step.
 
-### Step 4: Validate the CUDA-enabled build
+## Step 4: Validate the CUDA-enabled build
 
 After the build completes successfully, verify that the GPU-enabled binary of llama.cpp is correctly linked to the NVIDIA CUDA runtime.
 
@@ -184,7 +181,7 @@ nvtop
 
 This command displays GPU utilization, memory usage, temperature, and power consumption. You can use this to verify that CUDA kernels are active during model inference.
 
-The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark.
+The following screenshot shows GPU utilization during TinyLlama inference on DGX Spark:
 
 ![nvtop terminal interface displaying real-time GPU metrics, including GPU utilization, memory usage, temperature, power consumption, and active processes for the NVIDIA GB10 GPU during model inference on DGX Spark. alt-text#center](nvtop.png "TinyLlama GPU Utilization")
 
@@ -196,12 +193,12 @@ The nvtop interface shows:
 
 You have now successfully built and validated the CUDA-enabled version of llama.cpp on DGX Spark.
 
-## What have I achieved?
+## What you have accomplished
 
 You have:
 - Installed all required tools and dependencies
 - Downloaded a quantized model for testing
 - Built the CUDA-enabled version of llama.cpp
 - Verified GPU linkage and successful inference
 
-You’re ready to move on to building and testing the CPU-only version! You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference.
+You’re ready to move on to building and testing the CPU-only version. You will build the optimized CPU-only version of llama.cpp and explore how the Grace CPU executes Armv9 vector instructions during inference.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/3_gb10_llamacpp_cpu.md
@@ -11,9 +11,6 @@ The Grace CPU features Arm Cortex-X925 and Cortex-A725 cores with advanced vecto
 
 ## Configure and build the CPU-only version
 
-
-In this session, you will configure and build the CPU-only version of llama.cpp, optimized for the Armv9-based Grace CPU.
-
 This build runs entirely on the Grace CPU (Arm Cortex-X925 and Cortex-A725), which supports advanced Armv9 vector extensions including SVE2, BFloat16, and I8MM, making it highly efficient for quantized inference workloads even without GPU acceleration.
 To ensure a clean separation from the GPU build artifacts, start from a clean directory.
 
@@ -123,8 +120,8 @@ To monitor live CPU utilization and power metrics during inference, use `htop`:
 htop
 ```
 
-The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement.
-![htop display showing 20 Grace CPU cores at 75-85% utilization during TinyLlama inference with OpenMP threading alt-text#center](htop.png)](htop.png "TinyLlama CPU Utilization")
+The following screenshot shows CPU utilization and thread activity during TinyLlama inference on DGX Spark, confirming full multi-core engagement:
+![htop display showing 20 Grace CPU cores at 75-85% utilization during TinyLlama inference with OpenMP threading alt-text#center](htop.png "TinyLlama CPU utilization")
 
 The `htop` interface shows:
 
@@ -145,6 +142,4 @@ In this section you have:
 - Tested quantized model inference using the TinyLlama Q8_0 model.
 - Used monitoring tools (htop) to confirm efficient CPU utilization.
 
-You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU.
-
-In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU.
+You have now successfully built and validated the CPU-only version of llama.cpp on the Grace CPU. In the next section, you will learn how to use the Process Watch tool to visualize instruction-level execution and better understand how Armv9 vectorization (SVE2 and NEON) accelerates quantized LLM inference on the Grace CPU.
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/4_gb10_processwatch.md
diff --git a/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md b/content/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/_index.md