neetcode-gh · ahmadbasyouni10 · Apr 4, 2026 · Apr 4, 2026 · Apr 4, 2026
diff --git a/articles/batch-normalization.md b/articles/batch-normalization.md
@@ -0,0 +1,155 @@
+## Prerequisites
+
+Before attempting this problem, you should be comfortable with:
+
+- **Layer Normalization** - You just implemented normalizing across features within each sample. Batch normalization normalizes across the batch for each feature instead.
+- **Mean and Variance** - Computing $\mu = \frac{1}{N}\sum x_i$ and $\sigma^2 = \frac{1}{N}\sum(x_i - \mu)^2$ along the batch axis
+- **Training vs Inference** - Batch norm behaves differently in training (uses batch statistics) and inference (uses running statistics). This dual behavior is the trickiest part.
+
+---
+
+## Concept
+
+Layer normalization normalizes across features within each sample (axis=1). Batch normalization flips the axis: it normalizes across the batch for each feature (axis=0).
+
+For a batch of $N$ samples, each with $D$ features:
+
+**During training:**
+1. Compute mean and variance for each feature across the batch
+2. Normalize: $\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$
+3. Scale and shift: $y = \gamma \cdot \hat{x} + \beta$ (learned parameters)
+4. Update running statistics via exponential moving average
+
+**During inference:**
+Use the accumulated running statistics instead of batch statistics. This is critical because at inference time you might process a single sample, making batch statistics meaningless.
+
+The running statistics update uses momentum $m$:
+
+$$\text{running\_mean} = (1 - m) \cdot \text{running\_mean} + m \cdot \mu_B$$
+
+This exponential moving average gives recent batches more weight while smoothing over the entire training history.
+
+Why does batch norm help? It reduces "internal covariate shift": as earlier layers update during training, the distribution of inputs to later layers constantly changes. Batch norm re-centers and re-scales these distributions, allowing higher learning rates and faster convergence.
+
+---
+
+## Solution
+
+### Intuition
+
+In training mode: compute mean and variance along axis=0, normalize, apply affine transform, update running stats. In inference mode: skip the batch statistics entirely and use the running mean/variance that were accumulated during training.
+
+### Implementation
+
+::tabs-start
+```python
+import numpy as np
+from typing import Tuple, List
+
+
+class Solution:
+    def batch_norm(self, x: List[List[float]], gamma: List[float], beta: List[float],
+                   running_mean: List[float], running_var: List[float],
+                   momentum: float, eps: float, training: bool) -> Tuple[List[List[float]], List[float], List[float]]:
+        x = np.array(x)
+        gamma = np.array(gamma)
+        beta = np.array(beta)
+        running_mean = np.array(running_mean, dtype=np.float64)
+        running_var = np.array(running_var, dtype=np.float64)
+
+        if training:
+            batch_mean = np.mean(x, axis=0)
+            batch_var = np.var(x, axis=0)
+            x_hat = (x - batch_mean) / np.sqrt(batch_var + eps)
+            running_mean = (1 - momentum) * running_mean + momentum * batch_mean
+            running_var = (1 - momentum) * running_var + momentum * batch_var
+        else:
+            x_hat = (x - running_mean) / np.sqrt(running_var + eps)
+
+        out = gamma * x_hat + beta
+        return (np.round(out, 4).tolist(), np.round(running_mean, 4).tolist(), np.round(running_var, 4).tolist())
+```
+::tabs-end
+
+
+### Walkthrough
+
+Given a batch of 3 samples with 4 features, `gamma = [1,1,1,1]`, `beta = [0,0,0,0]`, training=True:
+
+```
+x = [[1, 2, 3, 4],
+     [5, 6, 7, 8],
+     [9, 10, 11, 12]]
+```
+
+| Step | Computation | Result |
+|---|---|---|
+| Batch mean (axis=0) | $[1+5+9, 2+6+10, ...] / 3$ | $\mu_B = [5, 6, 7, 8]$ |
+| Batch variance | $[(1-5)^2 + (5-5)^2 + (9-5)^2] / 3, ...$ | $\sigma_B^2 = [10.667, 10.667, 10.667, 10.667]$ |
+| Normalize row 1 | $(1 - 5) / \sqrt{10.667}$ | $-1.2247$ |
+| Normalize row 2 | $(5 - 5) / \sqrt{10.667}$ | $0.0$ |
+| Normalize row 3 | $(9 - 5) / \sqrt{10.667}$ | $1.2247$ |
+| Update running mean | $0.9 \cdot [0,0,0,0] + 0.1 \cdot [5,6,7,8]$ | $[0.5, 0.6, 0.7, 0.8]$ |
+| Update running var | $0.9 \cdot [1,1,1,1] + 0.1 \cdot [10.667,...]$ | $[1.9667, 1.9667, 1.9667, 1.9667]$ |
+
+With `gamma = [2, 0.5, 1.5]` and `beta = [1, -1, 0.5]`, the affine transform would scale and shift each feature independently.
+
+### Time & Space Complexity
+
+- Time: $O(N \cdot D)$ where $N$ is batch size and $D$ is features
+- Space: $O(N \cdot D)$ for the normalized output (plus $O(D)$ for running statistics)
+
+---
+
+## Common Pitfalls
+
+### Using Batch Statistics During Inference
+
+During inference (`training=False`), you must use the running statistics, not the current batch. Using batch statistics at inference time means your model's output depends on what other samples are in the batch.
+
+::tabs-start
+```python
+# Wrong: always using batch statistics
+batch_mean = np.mean(x, axis=0)
+batch_var = np.var(x, axis=0)
+x_hat = (x - batch_mean) / np.sqrt(batch_var + eps)
+
+# Correct: check training flag
+if training:
+    batch_mean = np.mean(x, axis=0)
+    batch_var = np.var(x, axis=0)
+    x_hat = (x - batch_mean) / np.sqrt(batch_var + eps)
+else:
+    x_hat = (x - running_mean) / np.sqrt(running_var + eps)
+```
+::tabs-end
+
+
+### Normalizing Along the Wrong Axis
+
+Batch norm normalizes across the batch (axis=0), not across features (axis=1). Normalizing across features gives you layer normalization instead.
+
+::tabs-start
+```python
+# Wrong: this is layer normalization (axis=1)
+batch_mean = np.mean(x, axis=1, keepdims=True)
+
+# Correct: batch normalization normalizes across samples (axis=0)
+batch_mean = np.mean(x, axis=0)
+```
+::tabs-end
+
+
+---
+
+## In the GPT Project
+
+This becomes `model/batch_normalization.py`. While modern transformers (GPT, LLaMA) use **layer normalization** rather than batch normalization, understanding batch norm is essential context. Batch norm was the breakthrough that made training deep CNNs practical (ResNets), and the distinction between batch vs layer normalization is a common interview question. The key tradeoff: batch norm depends on batch size and behaves differently at train/eval time, while layer norm is batch-independent and consistent.
+
+---
+
+## Key Takeaways
+
+- Batch normalization normalizes across the batch for each feature (axis=0), while layer normalization normalizes across features for each sample (axis=1). The axis flip changes everything about when and where each technique works best.
+- The train/inference split is the hardest part: during training you use live batch statistics and update running estimates. During inference you use those accumulated running estimates because batch statistics from a single sample are meaningless.
+- Running statistics use an exponential moving average controlled by momentum. This smooths over the randomness of individual mini-batches while tracking the distribution as the network learns.
diff --git a/articles/multi-headed-self-attention.md b/articles/multi-headed-self-attention.md
@@ -17,18 +17,19 @@ If the total attention dimension is $d$ and we have $h$ heads, each head operate
 1. Create $h$ heads, each with attention dimension $d/h$.
 2. Run each head independently on the same input.
 3. Concatenate all head outputs along the feature dimension.
+4. Apply a learned output projection $W^O$ to combine the heads.
 
 Why does this help? Different heads learn different things. In practice, researchers have observed that some heads learn syntactic relationships (attending to the previous word), some learn semantic relationships (attending to the subject of a sentence), and some learn positional patterns (attending to nearby tokens). A single head would have to compromise between all these patterns; multiple heads can specialize.
 
-The output shape is $(B, T, d)$, the same as a single head with the full dimension. This makes multi-head attention a drop-in replacement for single-head attention.
+The final output projection $W^O$ (a linear layer of size $d \times d$) lets the model learn how to best combine information from all heads. The output shape is $(B, T, d)$, the same as a single head with the full dimension, making multi-head attention a drop-in replacement.
 
 ---
 
 ## Solution
 
 ### Intuition
 
-Create a list of `SingleHeadAttention` modules, each with attention dimension `attention_dim // num_heads`. Run each head on the same input. Concatenate outputs along the last dimension.
+Create a list of `SingleHeadAttention` modules, each with attention dimension `attention_dim // num_heads`. Run each head on the same input. Concatenate outputs along the last dimension. Apply a learned output projection ($W^O$) to the concatenated result.
 
 ### Implementation
 
@@ -46,13 +47,14 @@ class MultiHeadedSelfAttention(nn.Module):
         self.att_heads = nn.ModuleList()
         for i in range(num_heads):
             self.att_heads.append(self.SingleHeadAttention(embedding_dim, attention_dim // num_heads))
+        self.output_proj = nn.Linear(attention_dim, attention_dim, bias=False)
 
     def forward(self, embedded: TensorType[float]) -> TensorType[float]:
         head_outputs = []
         for head in self.att_heads:
             head_outputs.append(head(embedded))
         concatenated = torch.cat(head_outputs, dim = 2)
-        return torch.round(concatenated, decimals=4)
+        return torch.round(self.output_proj(concatenated), decimals=4)
 
     class SingleHeadAttention(nn.Module):
         def __init__(self, embedding_dim: int, attention_dim: int):
@@ -92,8 +94,9 @@ For `embedding_dim = 8`, `attention_dim = 8`, `num_heads = 4`, sequence of 3 tok
 | Head 2 | $(B, 3, 8)$ | $(B, 3, 2)$ | $(B, 3, 2)$ |
 | Head 3 | $(B, 3, 8)$ | $(B, 3, 2)$ | $(B, 3, 2)$ |
 | Concat | 4 outputs along dim=2 | | $(B, 3, 8)$ |
+| $W^O$ | Linear projection $d \to d$ | | $(B, 3, 8)$ |
 
-Each head projects from 8 to 2 dimensions (8/4 = 2), and concatenation restores the full 8 dimensions.
+Each head projects from 8 to 2 dimensions (8/4 = 2), concatenation restores the full 8 dimensions, and $W^O$ learns how to best combine the heads' outputs.
 
 ### Time & Space Complexity
 
@@ -148,6 +151,6 @@ This becomes `model/multi_head_attention.py`. The GPT model uses multi-headed at
 
 ## Key Takeaways
 
-- Multi-headed attention runs several attention heads in parallel, each specializing in different relationship patterns, without increasing total computation over a single large head.
+- Multi-headed attention runs several attention heads in parallel, each specializing in different relationship patterns, with a learned output projection ($W^O$) that combines their outputs.
 - Each head operates on a $d/h$ dimensional subspace, and concatenation reconstructs the full dimension, making it a drop-in replacement for single-head attention.
 - Using `nn.ModuleList` (not a plain Python list) is essential so PyTorch can track and update each head's parameters during training.
diff --git a/articles/multi-layer-backpropagation.md b/articles/multi-layer-backpropagation.md
@@ -0,0 +1,153 @@
+## Prerequisites
+
+Before attempting this problem, you should be comfortable with:
+
+- **Single-neuron backpropagation** - The chain rule through one neuron ($z \to \sigma \to L$). Now you're chaining through multiple layers, but the principle is identical.
+- **ReLU activation** - Unlike sigmoid, ReLU's derivative is binary: 1 where $z > 0$, 0 elsewhere. This creates the "dead neuron" problem when $z \leq 0$ for all inputs.
+- **Matrix multiplication** - Gradients through linear layers involve transposing weight matrices. Understanding $z = xW^T + b$ and its Jacobian is essential.
+
+---
+
+## Concept
+
+Single-neuron backprop had three links in the chain rule: loss $\to$ activation $\to$ weights. A multi-layer network has more links but the same idea: multiply local derivatives as you walk backward from the loss.
+
+For a 2-layer MLP with ReLU:
+
+$$x \xrightarrow{W_1, b_1} z_1 \xrightarrow{\text{ReLU}} a_1 \xrightarrow{W_2, b_2} z_2 \xrightarrow{\text{MSE}} L$$
+
+Each arrow is one step in the chain rule. Working backward from $L$:
+
+1. $\frac{\partial L}{\partial z_2}$ is the error signal from MSE
+2. $\frac{\partial L}{\partial W_2}$ and $\frac{\partial L}{\partial b_2}$ use $a_1$ (the layer's input)
+3. $\frac{\partial L}{\partial a_1}$ passes the gradient backward through $W_2$
+4. $\frac{\partial L}{\partial z_1}$ multiplies by the ReLU mask (binary: 1 or 0)
+5. $\frac{\partial L}{\partial W_1}$ and $\frac{\partial L}{\partial b_1}$ use $x$ (the network's input)
+
+The ReLU derivative is the critical piece: where $z_1 > 0$, the gradient passes through unchanged. Where $z_1 \leq 0$, the gradient is zeroed out. This is why neurons can "die" during training: if a neuron's pre-activation is always negative, it permanently stops learning.
+
+---
+
+## Solution
+
+### Intuition
+
+Run the forward pass to get all intermediate values ($z_1$, $a_1$, $z_2$), compute MSE loss, then walk backward applying the chain rule at each layer. Each layer's weight gradient is the outer product of the incoming error signal and the layer's input.
+
+### Implementation
+
+::tabs-start
+```python
+import numpy as np
+from typing import List
+
+
+class Solution:
+    def forward_and_backward(self,
+                              x: List[float],
+                              W1: List[List[float]], b1: List[float],
+                              W2: List[List[float]], b2: List[float],
+                              y_true: List[float]) -> dict:
+        x = np.array(x)
+        W1 = np.array(W1)
+        b1 = np.array(b1)
+        W2 = np.array(W2)
+        b2 = np.array(b2)
+        y_true = np.array(y_true)
+
+        # Forward pass
+        z1 = x @ W1.T + b1          # pre-activation layer 1
+        a1 = np.maximum(0, z1)       # ReLU activation
+        z2 = a1 @ W2.T + b2          # output (predictions)
+        loss = np.mean((z2 - y_true) ** 2)
+
+        # Backward pass
+        n = len(y_true) if y_true.ndim > 0 else 1
+        dz2 = 2 * (z2 - y_true) / n  # dL/dz2
+        dW2 = dz2.reshape(-1, 1) @ a1.reshape(1, -1)  # dL/dW2
+        db2 = dz2                      # dL/db2
+
+        da1 = dz2.reshape(1, -1) @ W2  # dL/da1
+        da1 = da1.flatten()
+        dz1 = da1 * (z1 > 0).astype(float)  # ReLU derivative
+        dW1 = dz1.reshape(-1, 1) @ x.reshape(1, -1)  # dL/dW1
+        db1 = dz1                      # dL/db1
+
+        return {
+            'loss': round(float(loss), 4),
+            'dW1': np.round(dW1, 4).tolist(),
+            'db1': np.round(db1, 4).tolist(),
+            'dW2': np.round(dW2, 4).tolist(),
+            'db2': np.round(db2, 4).tolist(),
+        }
+```
+::tabs-end
+
+
+### Walkthrough
+
+Given `x = [1.0, 2.0]`, `W1 = [[1, 0], [0, 1]]` (identity), `b1 = [0, 0]`, `W2 = [[0.5, 0.5]]`, `b2 = [0]`, `y_true = [1.0]`:
+
+| Step | Operation | Result |
+|---|---|---|
+| Layer 1 linear | $z_1 = x \cdot W_1^T + b_1$ | $[1.0, 2.0]$ |
+| ReLU | $a_1 = \max(0, z_1)$ | $[1.0, 2.0]$ (all positive, all pass) |
+| Layer 2 linear | $z_2 = a_1 \cdot W_2^T + b_2$ | $[1.5]$ |
+| MSE loss | $(1.5 - 1.0)^2$ | $0.25$ |
+| Output gradient | $\frac{2(1.5 - 1.0)}{1}$ | $dz_2 = [1.0]$ |
+| Layer 2 weights | $[1.0] \cdot [1.0, 2.0]$ | $dW_2 = [[1.0, 2.0]]$ |
+| Gradient to $a_1$ | $[1.0] \cdot W_2 = [0.5, 0.5]$ | Passes through ReLU (mask is all 1s) |
+| Layer 1 weights | $[0.5, 0.5]^T \cdot [1.0, 2.0]$ | $dW_1 = [[0.5, 1.0], [0.5, 1.0]]$ |
+
+### Time & Space Complexity
+
+- Time: $O(d_1 \cdot d_2 + d_2 \cdot d_3)$ where $d_i$ are layer dimensions (matrix multiplications dominate)
+- Space: $O(d_1 \cdot d_2 + d_2 \cdot d_3)$ for the gradient matrices
+
+---
+
+## Common Pitfalls
+
+### Forgetting the ReLU Mask
+
+The ReLU derivative is not 1 everywhere. Where $z_1 \leq 0$, the gradient must be zeroed. Omitting this gives incorrect gradients for any neuron that was in the "dead zone."
+
+::tabs-start
+```python
+# Wrong: gradient flows through regardless of ReLU
+dz1 = da1  # ignores the ReLU mask entirely
+
+# Correct: multiply by the ReLU indicator
+dz1 = da1 * (z1 > 0).astype(float)
+```
+::tabs-end
+
+
+### Wrong Reshape for Outer Product
+
+The weight gradient $dW = \delta^T \cdot x$ requires the error signal as a column vector and the input as a row vector. Forgetting to reshape gives either a scalar (dot product) or an error.
+
+::tabs-start
+```python
+# Wrong: this computes a dot product (scalar), not a matrix
+dW2 = dz2 @ a1
+
+# Correct: outer product via reshape
+dW2 = dz2.reshape(-1, 1) @ a1.reshape(1, -1)
+```
+::tabs-end
+
+
+---
+
+## In the GPT Project
+
+This becomes `foundations/multi_layer_backprop.py`. Understanding multi-layer backprop is what makes the rest of the course click: when you call `loss.backward()` in PyTorch, it is doing exactly these chain-rule computations automatically through the entire transformer. The ReLU dead-zone issue you encounter here also appears in the feed-forward network inside each transformer block.
+
+---
+
+## Key Takeaways
+
+- Multi-layer backpropagation is the same chain rule as single-neuron backprop, just applied to more links. Each layer's weight gradient is the outer product of the error signal arriving from above and the activation arriving from below.
+- The ReLU derivative acts as a binary gate: gradients flow through neurons that fired ($z > 0$) and are killed for neurons that didn't. This is computationally cheap but creates the dead neuron problem.
+- Saving intermediate values ($z_1$, $a_1$) during the forward pass is essential. You need them to compute gradients during the backward pass. This is why training uses more memory than inference.
diff --git a/articles/transformer-block.md b/articles/transformer-block.md
@@ -91,13 +91,14 @@ class TransformerBlock(nn.Module):
             self.att_heads = nn.ModuleList()
             for i in range(num_heads):
                 self.att_heads.append(self.SingleHeadAttention(model_dim, model_dim // num_heads))
+            self.output_proj = nn.Linear(model_dim, model_dim, bias=False)
 
         def forward(self, embedded: TensorType[float]) -> TensorType[float]:
             head_outputs = []
             for head in self.att_heads:
                 head_outputs.append(head(embedded))
             concatenated = torch.cat(head_outputs, dim = 2)
-            return concatenated
+            return self.output_proj(concatenated)
 
     class VanillaNeuralNetwork(nn.Module):
 
@@ -123,7 +124,7 @@ For `model_dim = 8` and `num_heads = 2`, with input shape $(B, T, 8)$:
 | Step | Operation | Shape |
 |---|---|---|
 | LayerNorm 1 | Normalize across dim 8 | $(B, T, 8)$ |
-| Multi-Head Attention | 2 heads, each head_size=4, concat | $(B, T, 8)$ |
+| Multi-Head Attention | 2 heads, each head_size=4, concat + $W^O$ | $(B, T, 8)$ |
 | Residual 1 | $x + \text{attention}(LN(x))$ | $(B, T, 8)$ |
 | LayerNorm 2 | Normalize the sum | $(B, T, 8)$ |
 | FFN up-project | Linear $8 \to 32$ + ReLU | $(B, T, 32)$ |