Skip to content

Commit aea352d

Browse files
authored
Add auto-diff tutorial 2 (#142)
This adds tutorial showing how to implement a tiny inline MLP by using slang with AutoDiff.
1 parent 2ea72ba commit aea352d

File tree

3 files changed

+380
-1
lines changed

3 files changed

+380
-1
lines changed

_data/documentation.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,4 +78,8 @@ tutorials:
7878
- title: "Slang Automatic Differentiation Tutorial - 1"
7979
description: "Learn how to use Slang’s automatic differentiation feature to compute the gradient of a shader."
8080
link_url: "https://docs.shader-slang.org/en/latest/auto-diff-tutorial-1.html"
81-
link_label: "Slang Automatic Differentiation Tutorial - 1"
81+
link_label: "Automatic Differentiation Tutorial - 1"
82+
- title: "Slang Automatic Differentiation Tutorial - 2"
83+
description: "Learn how to use Slang’s automatic differentiation feature to compute the gradient of a shader."
84+
link_url: "https://docs.shader-slang.org/en/latest/auto-diff-tutorial-2.html"
85+
link_label: "Automatic Differentiation Tutorial - 2"

docs/auto-diff-tutorial-2.md

Lines changed: 374 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,374 @@
1+
---
2+
title: Automatic Differentiation Tutorial - 2
3+
layout: page
4+
description: Autodiff tutorial 2
5+
permalink: "/docs/auto-diff-tutorial-2"
6+
intro_image_absolute: true
7+
intro_image_hide_on_mobile: false
8+
---
9+
10+
## What Will We Learn
11+
12+
In the [last tutorial](auto-diff-tutorial-1.md), we explained basic knowledge about Slang's autodiff feature and introduced some advanced techniques about custom derivative implementation and custom differential type.
13+
14+
In this tutorial, we will walk through a real-world example using what we've learned from the last tutorial.
15+
16+
We will learn how to implement a tiny Multi-Layer Perceptron (MLP) to approximate a set of polynomial functions using Slang's automatic differentiation capabilities.
17+
18+
### Problem Setup
19+
20+
- **Input**: Two scalar variables x and y
21+
22+
- **Target Function**: A set of polynomial expressions:
23+
24+
$$f(x,y) = \begin{bmatrix}
25+
f_{1}(x,\, y) =&\frac{(x\, + \, y)}{\left( 1\, + \, y^{2} \right)} \\
26+
f_{2}(x,\, y) =&2x + y \\
27+
f_{3}(x,\, y) =&0.5x^{2} + 1.2y \\
28+
f_{4}(x,\, y) =&x + 0.5y^{2}
29+
\end{bmatrix}$$
30+
31+
- **Network Architecture**: 4 inputs → 16 hidden units → 4 outputs
32+
- We will encode the 2 inputs $x$ and $y$ into $x$, $y$, $x^2$ and $y^2$.
33+
34+
- **Learning Method**: Simple gradient descent
35+
36+
## 2. Mathematical Formulation
37+
38+
### The Learning Problem
39+
40+
Given a neural network $MLP(x,y;\theta)$ with parameters $\theta$, we want to minimize:
41+
42+
$$L(\theta) = \left| \left| MLP(x,y;\theta) - f(x,y) \right| \right|^{2}$$
43+
44+
Where:
45+
46+
- $MLP(x,y;\theta)$ is our neural network's output
47+
- $f(x,y)$ is the ground truth polynomial set
48+
- $L(\theta)$ is the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)
49+
50+
### The workflow will be
51+
52+
- **Forward Pass**: Compute $MLP(x,y;\theta)$ and $L(\theta)$
53+
- **Backward Pass**: Compute $\frac{\partial L}{\partial\theta}$ (gradients w.r.t. parameters)
54+
- **Parameter Update**: $\theta \leftarrow \theta - \alpha\frac{\partial L}{\partial\theta}$ (gradient descent)
55+
56+
## 3. Forward Pass Architecture Design
57+
58+
### 3.1 Top-Level Network Definition
59+
60+
Our MLP is a composition of two feed-forward layers:
61+
62+
```hlsl
63+
public struct MyNetwork
64+
{
65+
public FeedForwardLayer<4, 16> layer1; // Input to hidden
66+
public FeedForwardLayer<16, 4> layer2; // Hidden to output
67+
68+
internal public MLVec<4> _eval(half x, half y)
69+
{
70+
// layer2.eval(layer1.eval());
71+
}
72+
73+
public half4 eval(no_diff half x, no_diff half y)
74+
{
75+
// construct MLVec<4> from x, y, x^2, y^2
76+
// call internal _eval
77+
// convert MLVec<4> to half4
78+
}
79+
}
80+
```
81+
82+
In our interface method design, we introduce an internal representation of vector type `MLVec<4>` in the network evaluation. And this design choice will help us hide the details about how we are going to perform the arithmetic/logic operations on the vector type, as they are irrelevant to how the network should be designed. For now, you can treat this type as an opaque type, and we will talk about the details in later section. Therefore, in the public method, we will just convert the input to `MLVec`, call the internal `_eval` method, and convert `MLVec` back to a normal vector. And in the internal `_eval` method, we will just chain the evaluations of each layer together.
83+
> Note that we are using `half` type, which is [16-bit floating-point type](external/slang/docs/user-guide/02-conventional-features.md#scalar-types) for better throughput and to reduce memory usage.
84+
85+
### 3.2 Feed-Forward Layer Definition
86+
87+
Each layer performs: $LeakyRelu(Wx+b)$, so we can create a struct to abstract this as follows:
88+
89+
```hlsl
90+
public struct FeedForwardLayer<int InputSize, int OutputSize>
91+
{
92+
public NFloat* weights;
93+
public NFloat* weightsGrad;
94+
public NFloat* biases;
95+
public NFloat* biasesGrad;
96+
97+
public MLVec<OutputSize> eval(MLVec<InputSize> input)
98+
{
99+
// out = matrix-multiply-add(input, weight, bias);
100+
// return leaky_relu(out, alpha);
101+
}
102+
}
103+
```
104+
105+
The evaluation is very simple, just doing a matrix vector multiplication plus a bias vector, and then performing a LeakyRelu function on it. But note that we use pointers in this struct to reference the parameters (e.g. weight and bias) instead of declaring arrays or even global storage buffers. Because our MLP will be run on the GPU and each evaluation function will be executed per-thread, if we used arrays for each layer, an array would be allocated for each thread and that would explode the GPU memory. For the same reason, we cannot declare a storage buffer directly within the layer struct because each thread would create its own copy of the storage buffer containing identical data, leading to massive memory waste. Instead, we use pointers to reference a single shared global storage buffer.
106+
107+
### 3.3 Vector Type Definition
108+
109+
The `MLVec` type represents vectors:
110+
```hlsl
111+
public struct MLVec<int N>
112+
{
113+
public half data[N];
114+
public static MLVec<N> fromArray(half[N] values)
115+
{
116+
// construct MLVec<N> from values
117+
}
118+
119+
public NFloat[N] toArray()
120+
{
121+
return data;
122+
}
123+
}
124+
```
125+
For `MLVec`, we declare two methods to help us convert from and to a normal array.
126+
127+
**Supporting Operations**:
128+
129+
- Matrix-vector multiplication with bias
130+
```hlsl
131+
MLVec<OutputSize> matMulAdd<int OutputSize, int InputSize>(MLVec<InputSize> input, NFloat* matrix, NFloat* bias);
132+
```
133+
- Transpose version of Matrix-vector multiplication with bias
134+
```hlsl
135+
MLVec<OutputSize> matMulTransposed<int OutputSize, int InputSize>(MLVec<InputSize> input, NFloat* matrix);
136+
```
137+
- Outer product of two vectors
138+
```hlsl
139+
void outerProductAccumulate<int M, int N>(MLVec<M> v0, MLVec<N> v1, NFloat* matrix);
140+
```
141+
142+
The first two operations are straightforward, which is just matrix vector multiplication and its transpose version. The last operation defines an outer product of two vectors, the result will be a matrix, such that $\text{x} \otimes \text{y} = \text{x} \cdot \text{y}^{T}$, where $\text{x}$ and $\text{y}$ are column vectors with the same length.
143+
144+
### 3.4 Loss Function Definition
145+
146+
The loss function measures how far our network output is from the target, since we have already defined the interfaces for the MLP network, we can simply implement the loss function:
147+
148+
```hlsl
149+
public half loss(MyNetwork* network, half x, half y)
150+
{
151+
let networkResult = network.eval(x, y); // MLP(x,y; θ)
152+
let gt = groundtruth(x, y); // target(x,y)
153+
let diff = networkResult - gt; // Error vector
154+
return dot(diff, diff); // square of error
155+
}
156+
```
157+
158+
And the `groundtruth` implementation is as trivial as:
159+
160+
```hlsl
161+
public half4 groundtruth(half x, half y)
162+
{
163+
return
164+
{
165+
(x + y) / (1 + y * y), // f₁(x,y)
166+
2 * x + y, // f₂(x,y)
167+
0.5 * x * x + 1.2 * y, // f₃(x,y)
168+
x + 0.5 * y * y, // f₄(x,y)
169+
};
170+
}
171+
```
172+
173+
## 4. Backward Pass Design
174+
175+
After implementing the forward pass of the network evaluation, we then need to implement the backward pass. You will see how effortless it is to implement the backward pass with Slang's autodiff. We will start the implementation from the end of the workflow to the beginning, because that's how the direction in which the gradients flow.
176+
177+
### 4.1 Backward Pass of Loss
178+
179+
To implement the backward derivative of loss function, we only need to mark the function as `[Differentiable]` as we learned from [last tutorial](auto-diff-tutorial-1.md#forward-mode-differentiation)
180+
181+
```hlsl
182+
[Differentiable]
183+
public half loss(MyNetwork* network, no_diff half x, no_diff half y)
184+
{
185+
let networkResult = network->eval(x, y); // MLP(x,y; θ)
186+
let gt = no_diff groundtruth(x, y); // target(x,y)
187+
let diff = networkResult - gt; // Error vector
188+
return dot(diff, diff); // square of error
189+
}
190+
```
191+
192+
And from the Slang kernel function, we just need to call the backward mode of the `loss` function like this:
193+
```hlsl
194+
bwd_diff(loss)(network, input.x, input.y, 1.0h);
195+
```
196+
197+
One important thing to notice is that we are using the [`no_diff` attribute](external/slang/docs/user-guide/07-autodiff.html#excluding-parameters-from-differentiation) to decorate the inputs `x` and `y`, as well as `groundtruth` calculation, because in the backward pass, we only care about the result of $\frac{\partial loss}{\partial\theta}$. The `no_diff` attribute just tells Slang to treat the variables or instructions as non-differentiable, so there will be no backward mode instructions generated for those variables or instructions. In this case, since we don't care about the derivative of loss function with respect to the input, therefore we can safely mark them as non-differentiable.
198+
199+
Since `loss` function is differentiable now, every instruction inside this function has to be differentiable except those marked as `no_diff`. Therefore, `network->eval(x, y)` must be differentiable, so next we are going to implement the backward pass for this method.
200+
201+
### 4.2 Automatic Propagation to MLP
202+
203+
Just like the `loss` function, the only thing we need to do for `MyNetwork::eval` in order to use it with autodiff is to mark it as differentiable:
204+
```hlsl
205+
public struct MyNetwork
206+
{
207+
public FeedForwardLayer<4, 16> layer1; // Input to hidden
208+
public FeedForwardLayer<16, 4> layer2; // Hidden to output
209+
210+
[Differentiable]
211+
internal public MLVec<4> _eval(half x, half y)
212+
{
213+
// layer2.eval(layer1.eval());
214+
}
215+
216+
[Differentiable]
217+
public half4 eval(no_diff half x, no_diff half y)
218+
{
219+
// construct MLVec<4> from x, y
220+
// call internal _eval
221+
// convert MLVec<4> to half4
222+
}
223+
}
224+
```
225+
226+
## 4.3 Custom Backward Pass for Layers
227+
228+
Following the propagation direction of the gradients, we will next implement the backward derivative of FeedForwardLayer. But here we're going to do something different. Instead of asking Slang to automatically synthesize the backward autodiff for us, we will provide a custom derivative implementation. Because the network parameters and gradients are a buffer storage declared in the layer, we will have to provide a custom derivative to write the gradient back to the global buffer storage. You can reference [propagate derivative to storage buffer](auto-diff-tutorial-1.md#how-to-propagate-derivatives-to-global-buffer-storage) in last tutorial to refresh your memory. Another benefit of providing a custom derivative here is that our layer is just matrix multiplication with bias, and its derivative is quite simple, so there are lots of options to accelerate it with specific hardware (e.g. Nvidia tensor cores). Therefore, it's good practice to implement the custom derivative.
229+
230+
First, let's revisit the mathematical formula, given:
231+
232+
$$Z=W\cdot x+b$$
233+
$$Y=LeakyRelu(Z)$$
234+
235+
Where $W \in R^{m \times n}$, $x \in R^{n}$ and $b \in R^{m}$, the
236+
gradient of $W$, $x$ and $b$ will be:
237+
238+
$$Z=W\cdot x+b$$
239+
$$dZ=dY\cdot(z > 0 ? 1:\alpha)$$
240+
$$dW=dZ\cdot x^{T}$$
241+
$$\text{d}b=\text{d}Z$$
242+
$$dx=W^{T}\cdot dZ$$
243+
244+
Therefore, the implementation should be:
245+
```hlsl
246+
[BackwardDerivativeOf(eval)]
247+
public void evalBwd(inout DifferentialPair<MLVec<InputSize>> x, MLVec<OutputSize> dResult)
248+
{
249+
let Z = eval(x.p); // Re-compute forward pass to get Z
250+
// Step 1: Backward through activation function
251+
// dZ = dY * (Z > 0 ? 1 : alpha)
252+
for (int i = 0; i < OutputSize; i++)
253+
{
254+
if (Z.data[i] < 0.0)
255+
dResult.data[i] *= 0.01h; // Derivative of leaky ReLU
256+
}
257+
258+
// Step 2: Accumulate weight gradients
259+
// dW = dZ * x^T
260+
outerProductAccumulate(dResult, x.p, weightsGrad);
261+
262+
// Step 3: Accumulate bias gradients
263+
// db = dZ
264+
for (int i = 0; i < OutputSize; i++)
265+
{
266+
NFloat originalValue;
267+
InterlockedAddF16Emulated(biasesGrad + i, dResult.data[i], originalValue);
268+
}
269+
270+
// Step 4: Compute input gradients (for chaining to previous layer)
271+
// dx = W^T * dZ
272+
let dx = matMulTransposed<InputSize>(dResult, weights);
273+
x = {x.p, dx}; // Update differential pair
274+
}
275+
```
276+
277+
The key point in this implementation is that we use atomic add when writing the gradients to the global storage buffer, e.g.:
278+
```hlsl
279+
InterlockedAddF16Emulated(biasesGrad + i, dResult.data[i], originalValue);
280+
```
281+
282+
In the forward pass we already know that the parameters stored in a global storage buffer are shared by all threads, and so are the gradients. Therefore, during the backward pass, each thread will accumulate its gradient data to the shared storage buffer, so we must use atomic add operation to accumulate all the gradients without race conditions.
283+
284+
The implementation of `outerProductAccumulate` and `matMulTransposed` are trivial for-loop multiplication, so we will not show the details in this tutorial. The complete code can be found at [here](https://github.com/shader-slang/neural-shading-s25/tree/main/hardware-acceleration/mlp-training).
285+
286+
### 4.4 Make the vector differentiable
287+
288+
If we just compiled what we have right now, we would get a compile error because `MLVec` is not a differentiable type, so Slang doesn't expect the signature of backward of layer's eval method to be:
289+
```hlsl
290+
public void evalBwd(inout DifferentialPair<MLVec<InputSize>> x, MLVec<OutputSize> dResult)
291+
```
292+
293+
Therefore, we will have to update `MLVec` to make it differentiable:
294+
```hlsl
295+
public struct MLVec<int N> : IDifferentiable
296+
{
297+
public half data[N];
298+
299+
[Differentiable]
300+
public static MLVec<N> fromArray(half[N] values){ ... }
301+
302+
[Differentiable]
303+
public NFloat[N] toArray(){ ... }
304+
}
305+
```
306+
307+
### 4.5 Parameter Update
308+
309+
After backpropagation, the last step is to update the parameters by the gradients we just computed:
310+
```hlsl
311+
public struct Optimizer
312+
{
313+
public static const NFloat learningRate = 0.01h; // Step size
314+
public static void step(inout NFloat param, inout NFloat grad)
315+
{
316+
param = param - grad * learningRate; // gradient descent
317+
grad = 0.f; // Reset gradient for next iteration
318+
}
319+
}
320+
```
321+
322+
## 5. Putting It All Together
323+
324+
We will create two kernel functions: one for training and another for parameter updating.
325+
326+
The training kernel will be:
327+
```hlsl
328+
[shader("compute")]
329+
[numthreads(256, 1, 1)]
330+
void learnGradient(
331+
uint32_t tid : SV_DispatchThreadID,
332+
uniform MyNetwork* network,
333+
uniform Atomic<uint32_t>* lossBuffer,
334+
uniform float2* inputs,
335+
uniform uint32_t count)
336+
{
337+
if (tid >= count) return;
338+
339+
var input = (half2)inputs[tid];
340+
341+
// Compute all gradients automatically
342+
bwd_diff(loss)(network, input.x, input.y, 1.0h);
343+
344+
// Also compute loss value for monitoring
345+
let thisLoss = (float)loss(network, input.x, input.y);
346+
let maxLoss = WaveActiveMax(thisLoss);
347+
if (WaveIsFirstLane())
348+
{
349+
lossBuffer.max(bit_cast<uint32_t>(maxLoss));
350+
}
351+
}
352+
```
353+
354+
The parameter update kernel is:
355+
356+
```hlsl
357+
[shader("compute")]
358+
[numthreads(256, 1, 1)]
359+
void adjustParameters(
360+
uint32_t tid : SV_DispatchThreadID,
361+
uniform NFloat* params,
362+
uniform NFloat* gradients,
363+
uniform uint32_t count)
364+
{
365+
if (tid >= count) return;
366+
367+
// Apply gradient descent
368+
Optimizer::step(params[tid], gradients[tid]);
369+
}
370+
```
371+
372+
The training process will be a loop that alternately invokes the slang compute kernel `learnGradient` and `adjustParameters`, until the loss converges to an acceptable threshold value.
373+
374+
The host side implementation for this example is not shown, as it is simply boilerplate code to setup the graphics pipeline and allocate buffers for parameters. You can access the [github repo](https://github.com/shader-slang/neural-shading-s25/tree/main/hardware-acceleration/mlp-training) for this example's complete code, which includes the host side implementation. Alternatively, you can use the more powerful tool [SlangPy](https://github.com/shader-slang/slangpy) to run this MLP example without writing any graphics boilerplate code.

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,4 @@ Slang Documentation
3737
Using Slang Parameter Blocks <parameter-blocks>
3838
Migrating from GLSL to Slang <coming-from-glsl>
3939
AutoDiff Tutorial 1 <auto-diff-tutorial-1>
40+
AutoDiff Tutorial 2 <auto-diff-tutorial-2>

0 commit comments

Comments
 (0)