diff --git a/demos/direct_path_patching_ioi.ipynb b/demos/direct_path_patching_ioi.ipynb
new file mode 100644
index 000000000..9ab323a98
--- /dev/null
+++ b/demos/direct_path_patching_ioi.ipynb
@@ -0,0 +1,214 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/TransformerLensOrg/TransformerLens/blob/main/demos/direct_path_patching_ioi.ipynb\">\n",
+    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
+    "</a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Direct Path Patching Demo\n",
+    "\n",
+    "This notebook demonstrates **direct path patching** \u2014 a technique for isolating the direct information flow between specific attention heads in a transformer.\n",
+    "\n",
+    "## Background\n",
+    "\n",
+    "Standard activation patching replaces the full residual stream at a layer, which affects *all* downstream components simultaneously. This tells you that *some* component at a given layer matters, but cannot isolate which specific head-to-head edge carries the signal.\n",
+    "\n",
+    "**Direct path patching** isolates a single causal edge: it patches only the contribution of source head A into the query/key/value input of destination head B, leaving every other component's view of A's output unchanged.\n",
+    "\n",
+    "We validate on the **Indirect Object Identification (IOI)** task from Wang et al. 2022:\n",
+    "- Clean: *\"When Mary and John went to the store, John gave a drink to\"* \u2192 **Mary**\n",
+    "- Corrupted: *\"When Mary and John went to the store, Mary gave a drink to\"* \u2192 **John**\n",
+    "\n",
+    "Metric: normalised logit diff (0 = corrupted baseline, 1 = clean baseline)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# NBVAL_IGNORE_OUTPUT\n",
+    "import os\n",
+    "try:\n",
+    "    import google.colab\n",
+    "    IN_COLAB = True\n",
+    "    print(\"Running as a Colab notebook\")\n",
+    "    %pip install transformer_lens\n",
+    "except:\n",
+    "    IN_COLAB = False\n",
+    "    print(\"Running as a Jupyter notebook\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from transformer_lens import HookedTransformer\n",
+    "from transformer_lens.direct_path_patching import get_act_patch_direct_path"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = HookedTransformer.from_pretrained(\n",
+    "    \"gpt2\",\n",
+    "    center_unembed=True,\n",
+    "    center_writing_weights=True,\n",
+    "    fold_ln=True,\n",
+    ")\n",
+    "model.eval()\n",
+    "print(f\"Loaded GPT-2 small: {model.cfg.n_layers} layers, {model.cfg.n_heads} heads\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define IOI Task"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "CLEAN_PROMPT = \"When Mary and John went to the store, John gave a drink to\"\n",
+    "CORRUPTED_PROMPT = \"When Mary and John went to the store, Mary gave a drink to\"\n",
+    "\n",
+    "clean_tokens = model.to_tokens(CLEAN_PROMPT)\n",
+    "corrupted_tokens = model.to_tokens(CORRUPTED_PROMPT)\n",
+    "\n",
+    "mary_token = model.to_single_token(\" Mary\")\n",
+    "john_token = model.to_single_token(\" John\")\n",
+    "\n",
+    "with torch.no_grad():\n",
+    "    clean_logits, clean_cache = model.run_with_cache(clean_tokens)\n",
+    "    corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_tokens)\n",
+    "\n",
+    "clean_ld = (clean_logits[0, -1, mary_token] - clean_logits[0, -1, john_token]).item()\n",
+    "corrupted_ld = (corrupted_logits[0, -1, mary_token] - corrupted_logits[0, -1, john_token]).item()\n",
+    "\n",
+    "print(f\"Clean logit diff:     {clean_ld:+.3f}  (predicts Mary)\")\n",
+    "print(f\"Corrupted logit diff: {corrupted_ld:+.3f}  (predicts John)\")\n",
+    "\n",
+    "def normalised_metric(logits):\n",
+    "    ld = logits[0, -1, mary_token] - logits[0, -1, john_token]\n",
+    "    return (ld - corrupted_ld) / (clean_ld - corrupted_ld)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Direct Path Patching: S-Inhibition \u2192 Name-Mover Heads"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The IOI circuit (Wang et al. 2022) identifies:\n",
+    "- **S-inhibition heads**: (7,3), (7,9), (8,6), (8,10) \u2014 suppress the subject name token\n",
+    "- **Name-mover heads**: (9,9), (9,6), (10,0) \u2014 copy the indirect object to the output\n",
+    "\n",
+    "Direct path patching lets us measure whether each S-inhibition head communicates *directly* with each name-mover head via the query pathway."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ioi_src_heads = [(7, 3), (7, 9), (8, 6), (8, 10)]\n",
+    "name_movers = [(9, 9), (9, 6), (10, 0)]\n",
+    "\n",
+    "print(f\"{\"Source\":>8}  {\"Destination\":>12}  {\"Score\":>8}\")\n",
+    "print(\"-\" * 36)\n",
+    "\n",
+    "all_results = {}\n",
+    "for sl, sh in ioi_src_heads:\n",
+    "    with torch.no_grad():\n",
+    "        results = get_act_patch_direct_path(\n",
+    "            model=model,\n",
+    "            corrupted_tokens=corrupted_tokens,\n",
+    "            clean_cache=clean_cache,\n",
+    "            corrupted_cache=corrupted_cache,\n",
+    "            patching_metric=normalised_metric,\n",
+    "            src_layer=sl,\n",
+    "            src_head=sh,\n",
+    "            component=\"q\",\n",
+    "            verbose=False,\n",
+    "        )\n",
+    "    all_results[(sl, sh)] = results\n",
+    "    for dl, dh in name_movers:\n",
+    "        if dl > sl:\n",
+    "            score = results[dl, dh].item()\n",
+    "            print(f\"  ({sl},{sh:2d})   ({dl},{dh:2d})   {score:+.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Results and Interpretation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The results confirm the IOI circuit structure at the **edge level**:\n",
+    "\n",
+    "1. **(8,6) \u2192 (9,9)** is the strongest single direct path (+0.083). Head 8.6 is the most influential S-inhibition head.\n",
+    "2. All S-inhibition heads show their strongest direct paths running into the known name-mover heads (9.9, 9.6, 10.0).\n",
+    "3. Standard activation patching would show that layer 9 matters \u2014 but cannot distinguish *which* upstream head is responsible for each name-mover head's query input.\n",
+    "\n",
+    "Direct path patching adds that resolution, isolating the A \u2192 B causal edge without affecting any other component's view of A's output."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
\ No newline at end of file
diff --git a/tests/unit/test_direct_path_patching.py b/tests/unit/test_direct_path_patching.py
new file mode 100644
index 000000000..cb8346f3d
--- /dev/null
+++ b/tests/unit/test_direct_path_patching.py
@@ -0,0 +1,424 @@
+"""Tests for transformer_lens/tools/analysis/direct_path_patching.py
+
+Run with:
+    pytest tests/unit/test_direct_path_patching.py -v
+
+These tests use a tiny randomly-initialised 3-layer model so they run
+in seconds on CPU without downloading any weights.
+"""
+
+import warnings
+
+import pytest
+import torch
+
+from transformer_lens import HookedTransformer, HookedTransformerConfig
+from transformer_lens.tools.analysis.direct_path_patching import (
+    _check_fold_ln,
+    get_act_patch_direct_path,
+    get_act_patch_direct_path_all_sources,
+)
+
+# ---------------------------------------------------------------------------
+# Fixtures
+# ---------------------------------------------------------------------------
+
+
+@pytest.fixture(scope="module")
+def tiny_model():
+    """A small, randomly-initialised transformer with LN folded in."""
+    cfg = HookedTransformerConfig(
+        n_layers=3,
+        d_model=64,
+        d_head=16,
+        n_heads=4,
+        d_mlp=128,
+        d_vocab=100,
+        n_ctx=16,
+        act_fn="gelu",
+        normalization_type="LN",
+        attn_only=False,
+    )
+    model = HookedTransformer(cfg)
+    model.process_weights_()
+    model.eval()
+    return model
+
+
+@pytest.fixture(scope="module")
+def tokens_and_caches(tiny_model):
+    """Precompute clean/corrupted tokens and their caches."""
+    torch.manual_seed(42)
+    clean_tokens = torch.randint(0, 100, (1, 8))
+    corrupted_tokens = torch.randint(0, 100, (1, 8))
+
+    with torch.no_grad():
+        _, clean_cache = tiny_model.run_with_cache(clean_tokens)
+        _, corrupted_cache = tiny_model.run_with_cache(corrupted_tokens)
+
+    return clean_tokens, corrupted_tokens, clean_cache, corrupted_cache
+
+
+def simple_metric(logits):
+    """Sum of last-token logits — a trivially differentiable scalar."""
+    return logits[0, -1, :].sum()
+
+
+# ---------------------------------------------------------------------------
+# _check_fold_ln tests
+# ---------------------------------------------------------------------------
+
+
+class TestCheckFoldLn:
+    def test_folded_model_no_warning(self, tiny_model):
+        """No warning when LN is already folded (tiny_model fixture calls process_weights_())."""
+        with warnings.catch_warnings(record=True) as w:
+            warnings.simplefilter("always")
+            _check_fold_ln(tiny_model)
+        user_warnings = [x for x in w if issubclass(x.category, UserWarning)]
+        assert len(user_warnings) == 0, "Should not warn when LN is folded"
+
+    def test_unfolded_model_warns(self):
+        """UserWarning fires when LN has a non-unit learned scale (pretrained, pre-fold)."""
+        cfg = HookedTransformerConfig(
+            n_layers=2,
+            d_model=32,
+            d_head=8,
+            n_heads=4,
+            d_mlp=64,
+            d_vocab=50,
+            n_ctx=8,
+            act_fn="gelu",
+            normalization_type="LN",
+        )
+        model = HookedTransformer(cfg)
+        model.eval()
+        # Simulate a pretrained model that has learned non-unit LN scale (not yet folded).
+        # After process_weights_(), LayerNorm is replaced with LayerNormPre (no .w),
+        # so the warning only fires in the pre-fold state with non-trivial .w.
+        with torch.no_grad():
+            model.blocks[0].ln1.w.fill_(2.0)
+        with warnings.catch_warnings(record=True) as w:
+            warnings.simplefilter("always")
+            _check_fold_ln(model)
+        user_warnings = [x for x in w if issubclass(x.category, UserWarning)]
+        assert len(user_warnings) == 1
+        assert "fold" in str(user_warnings[0].message).lower()
+
+    def test_hooked_transformer_w_attribute(self):
+        """Before process_weights_(), HookedTransformer LayerNorm exposes .w.
+        After folding, LayerNorm is replaced with LayerNormPre (no .w) — that's
+        why _check_fold_ln passes silently on a folded model.
+        """
+        cfg = HookedTransformerConfig(
+            n_layers=2,
+            d_model=32,
+            d_head=8,
+            n_heads=4,
+            d_mlp=64,
+            d_vocab=50,
+            n_ctx=8,
+            act_fn="gelu",
+            normalization_type="LN",
+        )
+        model = HookedTransformer(cfg)
+        ln1 = model.blocks[0].ln1
+        assert hasattr(ln1, "w"), "HookedTransformer LayerNorm should expose .w before folding"
+
+    def test_no_crash_on_missing_attribute(self):
+        """_check_fold_ln silently passes when the model has no .blocks[0].ln1."""
+
+        class WeirdModel:
+            class cfg:
+                pass
+
+            class blocks:
+                pass
+
+        # Should not raise — the except block in _check_fold_ln catches AttributeError
+        _check_fold_ln(WeirdModel())  # type: ignore[arg-type]
+
+    def test_no_runtime_error_on_multielement_tensor(self, tiny_model):
+        """Regression: getattr(...) or getattr(...) on a multi-element tensor raises
+        RuntimeError. The explicit None-check fix must prevent this."""
+        # Calling _check_fold_ln on a real model exercises the tensor path.
+        # If the bug were present this would raise RuntimeError.
+        try:
+            _check_fold_ln(tiny_model)
+        except RuntimeError as e:
+            pytest.fail(f"_check_fold_ln raised RuntimeError: {e}")
+
+
+# ---------------------------------------------------------------------------
+# Shape tests
+# ---------------------------------------------------------------------------
+
+
+class TestOutputShape:
+    def test_single_source_shape(self, tiny_model, tokens_and_caches):
+        _, corrupted_tokens, clean_cache, corrupted_cache = tokens_and_caches
+        with torch.no_grad():
+            results = get_act_patch_direct_path(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=simple_metric,
+                src_layer=0,
+                src_head=0,
+                component="q",
+                verbose=False,
+            )
+        assert results.shape == (tiny_model.cfg.n_layers, tiny_model.cfg.n_heads)
+
+    def test_all_sources_shape(self, tiny_model, tokens_and_caches):
+        _, corrupted_tokens, clean_cache, corrupted_cache = tokens_and_caches
+        with torch.no_grad():
+            results = get_act_patch_direct_path_all_sources(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=simple_metric,
+                component="q",
+                verbose=False,
+            )
+        n = tiny_model.cfg.n_layers
+        h = tiny_model.cfg.n_heads
+        assert results.shape == (n, h, n, h)
+
+    @pytest.mark.parametrize("component", ["q", "k", "v"])
+    def test_all_components(self, tiny_model, tokens_and_caches, component):
+        _, corrupted_tokens, clean_cache, corrupted_cache = tokens_and_caches
+        with torch.no_grad():
+            results = get_act_patch_direct_path(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=simple_metric,
+                src_layer=0,
+                src_head=1,
+                component=component,
+                verbose=False,
+            )
+        assert results.shape == (tiny_model.cfg.n_layers, tiny_model.cfg.n_heads)
+
+
+# ---------------------------------------------------------------------------
+# Causal structure tests
+# ---------------------------------------------------------------------------
+
+
+class TestCausalStructure:
+    def test_earlier_layers_are_zero(self, tiny_model, tokens_and_caches):
+        """Entries for dst_layer <= src_layer must be exactly 0 (no causal path)."""
+        _, corrupted_tokens, clean_cache, corrupted_cache = tokens_and_caches
+        src_layer = 1
+        with torch.no_grad():
+            results = get_act_patch_direct_path(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=simple_metric,
+                src_layer=src_layer,
+                src_head=0,
+                component="q",
+                verbose=False,
+            )
+        assert results[: src_layer + 1].eq(0).all()
+
+    def test_later_layers_are_nonzero(self, tiny_model, tokens_and_caches):
+        """At least some entries for dst_layer > src_layer should be non-zero."""
+        _, corrupted_tokens, clean_cache, corrupted_cache = tokens_and_caches
+        with torch.no_grad():
+            results = get_act_patch_direct_path(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=simple_metric,
+                src_layer=0,
+                src_head=0,
+                component="q",
+                verbose=False,
+            )
+        assert not results[1:].eq(0).all()
+
+    def test_clean_equals_corrupted_gives_zero_delta(self, tiny_model):
+        """If clean == corrupted, delta is zero and every entry equals the baseline."""
+        torch.manual_seed(7)
+        tokens = torch.randint(0, 100, (1, 6))
+        with torch.no_grad():
+            baseline_logits, cache = tiny_model.run_with_cache(tokens)
+
+        baseline = simple_metric(baseline_logits).item()
+
+        with torch.no_grad():
+            results = get_act_patch_direct_path(
+                model=tiny_model,
+                corrupted_tokens=tokens,
+                clean_cache=cache,
+                corrupted_cache=cache,
+                patching_metric=simple_metric,
+                src_layer=0,
+                src_head=0,
+                component="q",
+                verbose=False,
+            )
+
+        nonzero_entries = results[results != 0]
+        assert nonzero_entries.numel() == 0 or torch.allclose(
+            nonzero_entries,
+            torch.full_like(nonzero_entries, baseline),
+            atol=1e-4,
+        )
+
+
+# ---------------------------------------------------------------------------
+# Correctness: independent verification for a single pair
+# ---------------------------------------------------------------------------
+
+
+class TestCorrectness:
+    def test_correctness_against_actual_ln_forward(self, tiny_model, tokens_and_caches):
+        """Logit-diff metric: linear-LN approximation should match actual LN within 1e-3.
+
+        process_weights_() folds LN into the weight matrices, so the linear
+        approximation is exact and the tolerance can be tight.  Using logit diff
+        (correct_tok - incorrect_tok) cancels the centering offset introduced by
+        process_weights_() and gives a numerically clean comparison.
+        """
+        _, corrupted_tokens, clean_cache, corrupted_cache = tokens_and_caches
+        src_layer, src_head = 0, 0
+        dst_layer, dst_head = 2, 1
+
+        # Pick stable token indices for the logit-diff metric
+        torch.manual_seed(0)
+        correct_tok = 17
+        incorrect_tok = 42
+
+        def logit_diff(logits):
+            return logits[0, -1, correct_tok] - logits[0, -1, incorrect_tok]
+
+        # Compute delta_resid from src head
+        W_O = tiny_model.blocks[src_layer].attn.W_O  # type: ignore[union-attr]
+        clean_z = clean_cache[f"blocks.{src_layer}.attn.hook_z"][:, :, src_head, :]
+        corrupted_z = corrupted_cache[f"blocks.{src_layer}.attn.hook_z"][:, :, src_head, :]
+        delta_resid = (clean_z @ W_O[src_head]) - (corrupted_z @ W_O[src_head])  # type: ignore[index]
+
+        # Independent reference: patch through actual LayerNorm forward
+        corrupted_resid = corrupted_cache[f"blocks.{dst_layer}.hook_resid_pre"]
+        patched_resid = corrupted_resid + delta_resid
+
+        with torch.no_grad():
+            ln1 = tiny_model.blocks[dst_layer].ln1  # type: ignore[index]
+            patched_normed = ln1(patched_resid)
+            corrupted_normed = ln1(corrupted_resid)
+
+        W_Q_dst = tiny_model.blocks[dst_layer].attn.W_Q[dst_head]  # type: ignore[index,union-attr]
+        true_delta_q = (patched_normed - corrupted_normed) @ W_Q_dst
+
+        def true_hook(value, hook):
+            if value.requires_grad:
+                value = value.clone()
+            value[:, :, dst_head, :] = value[:, :, dst_head, :] + true_delta_q
+            return value
+
+        with torch.no_grad():
+            ref_logits = tiny_model.run_with_hooks(
+                corrupted_tokens,
+                fwd_hooks=[(f"blocks.{dst_layer}.attn.hook_q", true_hook)],
+            )
+        ref_metric = logit_diff(ref_logits).item()
+
+        with torch.no_grad():
+            results = get_act_patch_direct_path(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=logit_diff,
+                src_layer=src_layer,
+                src_head=src_head,
+                component="q",
+                verbose=False,
+            )
+        our_metric = results[dst_layer, dst_head].item()
+
+        assert abs(our_metric - ref_metric) < 1e-3, (
+            f"Linear-LN approx {our_metric:.6f} disagrees with actual-LN ref {ref_metric:.6f} "
+            f"(diff={abs(our_metric - ref_metric):.2e}). process_weights_() should make these exact."
+        )
+
+    def test_all_sources_consistent_with_single(self, tiny_model, tokens_and_caches):
+        """get_act_patch_direct_path_all_sources matches individual calls."""
+        _, corrupted_tokens, clean_cache, corrupted_cache = tokens_and_caches
+        src_layer, src_head = 0, 2
+
+        with torch.no_grad():
+            single = get_act_patch_direct_path(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=simple_metric,
+                src_layer=src_layer,
+                src_head=src_head,
+                component="q",
+                verbose=False,
+            )
+            all_sources = get_act_patch_direct_path_all_sources(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=simple_metric,
+                component="q",
+                verbose=False,
+            )
+
+        assert torch.allclose(single, all_sources[src_layer, src_head], atol=1e-5)
+
+
+# ---------------------------------------------------------------------------
+# Edge cases
+# ---------------------------------------------------------------------------
+
+
+class TestEdgeCases:
+    def test_last_layer_source_all_zero(self, tiny_model, tokens_and_caches):
+        """A source in the last layer has no downstream heads → all zeros."""
+        _, corrupted_tokens, clean_cache, corrupted_cache = tokens_and_caches
+        src_layer = tiny_model.cfg.n_layers - 1
+        with torch.no_grad():
+            results = get_act_patch_direct_path(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=simple_metric,
+                src_layer=src_layer,
+                src_head=0,
+                component="q",
+                verbose=False,
+            )
+        assert results.eq(0).all()
+
+    def test_returns_cpu_tensor(self, tiny_model, tokens_and_caches):
+        """Return tensor should be on the same device as the model."""
+        _, corrupted_tokens, clean_cache, corrupted_cache = tokens_and_caches
+        with torch.no_grad():
+            results = get_act_patch_direct_path(
+                model=tiny_model,
+                corrupted_tokens=corrupted_tokens,
+                clean_cache=clean_cache,
+                corrupted_cache=corrupted_cache,
+                patching_metric=simple_metric,
+                src_layer=0,
+                src_head=0,
+                component="q",
+                verbose=False,
+            )
+        assert results.device.type == tiny_model.cfg.device or results.is_cpu
diff --git a/transformer_lens/__init__.py b/transformer_lens/__init__.py
index 53f7fbe87..fd4b973a6 100644
--- a/transformer_lens/__init__.py
+++ b/transformer_lens/__init__.py
@@ -6,6 +6,7 @@
     head_detector,
     hook_points,
     patching,
+    tools,
     train,
     utilities,
 )
@@ -55,4 +56,5 @@
     "conversion_utils",
     "factories",
     "utilities",
+    "tools",
 ]
diff --git a/transformer_lens/tools/analysis/__init__.py b/transformer_lens/tools/analysis/__init__.py
index a9cc5d19e..bdd01264f 100644
--- a/transformer_lens/tools/analysis/__init__.py
+++ b/transformer_lens/tools/analysis/__init__.py
@@ -7,11 +7,22 @@
 Tools:
     - direct_logit_attribution: Direct Logit Attribution (DLA) over components,
       layers, or attention heads.
+    - direct_path_patching: Direct path patching for head-to-head circuit
+      analysis.
 """
 
 from transformer_lens.tools.analysis.direct_logit_attribution import (
     DirectLogitAttribution,
     direct_logit_attribution,
 )
+from transformer_lens.tools.analysis.direct_path_patching import (
+    get_act_patch_direct_path,
+    get_act_patch_direct_path_all_sources,
+)
 
-__all__ = ["DirectLogitAttribution", "direct_logit_attribution"]
+__all__ = [
+    "DirectLogitAttribution",
+    "direct_logit_attribution",
+    "get_act_patch_direct_path",
+    "get_act_patch_direct_path_all_sources",
+]
diff --git a/transformer_lens/tools/analysis/direct_path_patching.py b/transformer_lens/tools/analysis/direct_path_patching.py
new file mode 100644
index 000000000..9714b2e49
--- /dev/null
+++ b/transformer_lens/tools/analysis/direct_path_patching.py
@@ -0,0 +1,322 @@
+"""Direct Path Patching.
+
+Implements direct path patching — a finer-grained variant of activation patching
+introduced for circuit analysis.
+
+Background
+----------
+Standard activation patching (see patching.py) replaces an activation at a given
+layer/position with its value from a clean run, and measures how much the model's
+output shifts. But patching the *residual stream* affects ALL downstream components,
+making it hard to isolate the direct information flow between two specific heads.
+
+Direct path patching isolates the path A → B: it patches *only* the contribution of
+source head A (at layer src_layer) into the input of destination head B (at layer
+dst_layer > src_layer), leaving every other component's view of A's output unchanged.
+
+The linear approximation used here (following Neel Nanda's description in issue #111)
+is:
+
+    delta_resid = clean_A_result - corrupted_A_result   # [batch, pos, d_model]
+    delta_q     = (delta_resid / ln1_scale) @ W_Q[hb]  # [batch, pos, d_head]
+    patched_q   = corrupted_q + delta_q
+
+This is exact under linear layer norm (no learned offset changes the scale
+in a way that matters for the perturbation), and matches the gradient-based
+approximation used in attribution patching.
+
+Usage
+-----
+    # 1. Cache clean and corrupted activations
+    _, clean_cache = model.run_with_cache(clean_tokens)
+    _, corrupted_cache = model.run_with_cache(corrupted_tokens)
+
+    # 2. Define your metric (same as activation patching)
+    def metric(logits):
+        return logit_diff(logits, ...)
+
+    # 3. Sweep all (dst_layer, dst_head) pairs for a fixed source head
+    results = get_act_patch_direct_path(
+        model, corrupted_tokens, clean_cache, corrupted_cache,
+        metric, src_layer=9, src_head=9,
+        component="q",   # patch into Q; also supports "k", "v"
+    )
+    # results.shape == (n_layers, n_heads)
+    # results[dst_layer, dst_head] = metric when A→B path is patched
+
+References
+----------
+- Neel Nanda, TransformerLens issue #111 (2022)
+- Wang et al., "Interpretability in the Wild: a Circuit for Indirect Object
+  Identification in GPT-2 small" (2022)
+"""
+
+from __future__ import annotations
+
+import warnings
+from typing import Any, Callable, Literal, Union
+
+import torch
+from jaxtyping import Float
+from tqdm.auto import tqdm
+
+from transformer_lens.ActivationCache import ActivationCache
+from transformer_lens.HookedTransformer import HookedTransformer
+from transformer_lens.model_bridge.bridge import TransformerBridge
+
+# ---------------------------------------------------------------------------
+# Internal helpers
+# ---------------------------------------------------------------------------
+
+
+def _check_fold_ln(model: Any) -> None:
+    """Warn if the model's LayerNorm weights have not been folded in.
+
+    HookedTransformer stores the learned scale as ``.w``; TransformerBridge wraps
+    the original HuggingFace module, which stores it as ``.weight``.  We check
+    both so the guard works for either system.
+    """
+    try:
+        ln1 = model.blocks[0].ln1  # type: ignore[index]
+        # .w  → HookedTransformer; .weight → TransformerBridge (wraps HF module)
+        w = getattr(ln1, "w", None)
+        if w is None:
+            w = getattr(ln1, "weight", None)
+        if w is not None and not torch.allclose(w, torch.ones_like(w), atol=1e-3):
+            warnings.warn(
+                "get_act_patch_direct_path is most accurate when LayerNorm parameters "
+                "are folded into the weight matrices. "
+                "For HookedTransformer: pass fold_ln=True to from_pretrained, or call "
+                "model.process_weights_(). "
+                "For TransformerBridge: call model.process_weights(fold_ln=True). "
+                "Results may be inaccurate with unfolded LayerNorm.",
+                UserWarning,
+                stacklevel=3,
+            )
+    except (AttributeError, TypeError):
+        pass  # non-standard model — cannot inspect LN weights, proceed
+
+
+# ---------------------------------------------------------------------------
+# Core hook factory
+# ---------------------------------------------------------------------------
+
+
+def _make_direct_path_hook(
+    delta_resid: Float[torch.Tensor, "batch pos d_model"],
+    dst_head: int,
+    W_component: Float[torch.Tensor, "d_model d_head"],
+    ln_scale_name: str,
+    corrupted_cache: ActivationCache,
+    component: Literal["q", "k", "v"],
+) -> Callable:
+    """Return a hook function that adds the linearised delta to one head's Q, K, or V.
+
+    Parameters
+    ----------
+    delta_resid:
+        (clean_A_result - corrupted_A_result), shape [batch, pos, d_model].
+    dst_head:
+        Index of the destination attention head to patch.
+    W_component:
+        The weight matrix for the component being patched:
+        W_Q[dst_head], W_K[dst_head], or W_V[dst_head].
+        Shape [d_model, d_head].
+    ln_scale_name:
+        Cache key for the layer-norm scale at the destination layer,
+        e.g. "blocks.3.ln1.hook_scale".
+    corrupted_cache:
+        Cache from the corrupted forward pass (used to look up ln1 scale).
+    component:
+        One of "q", "k", "v" — determines which QKV tensor is hooked.
+    """
+
+    def hook_fn(
+        value: Float[torch.Tensor, "batch pos n_heads d_head"],
+        hook,  # HookPoint, unused but required by TransformerLens
+    ) -> Float[torch.Tensor, "batch pos n_heads d_head"]:
+        # ln scale: [batch, pos, 1]
+        ln_scale = corrupted_cache[ln_scale_name]  # [batch, pos, 1]
+
+        # Linearised delta in query/key/value space
+        # delta_resid: [batch, pos, d_model]
+        # W_component: [d_model, d_head]
+        delta = (delta_resid / ln_scale) @ W_component  # [batch, pos, d_head]
+
+        if value.requires_grad:
+            value = value.clone()
+        value[:, :, dst_head, :] = value[:, :, dst_head, :] + delta
+        return value
+
+    return hook_fn
+
+
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+
+
+def get_act_patch_direct_path(
+    model: Union[HookedTransformer, TransformerBridge],
+    corrupted_tokens: torch.Tensor,
+    clean_cache: ActivationCache,
+    corrupted_cache: ActivationCache,
+    patching_metric: Callable[[torch.Tensor], torch.Tensor],
+    src_layer: int,
+    src_head: int,
+    component: Literal["q", "k", "v"] = "q",
+    verbose: bool = True,
+) -> Float[torch.Tensor, "n_layers n_heads"]:
+    """Sweep direct path patches from one source head to all downstream heads.
+
+    For every destination head B = (dst_layer, dst_head) where dst_layer > src_layer,
+    patch the contribution of source head A = (src_layer, src_head) into B's query
+    (or key / value) input, and record the patching metric.
+
+    The patch is a linear approximation:
+
+        delta_resid  = clean_A_result - corrupted_A_result   [batch, pos, d_model]
+        delta_B_comp = (delta_resid / ln1_scale) @ W_comp[dst_head]
+
+    where W_comp is W_Q, W_K, or W_V according to `component`.
+
+    Parameters
+    ----------
+    model:
+        A HookedTransformer or TransformerBridge instance.
+    corrupted_tokens:
+        Token IDs for the corrupted input, shape [batch, seq_len].
+    clean_cache:
+        Cached activations from the clean (unpatched) run.
+    corrupted_cache:
+        Cached activations from the corrupted run (needed for ln1 scale).
+    patching_metric:
+        A function mapping the model's logits tensor to a scalar.
+    src_layer:
+        Layer index of the source attention head.
+    src_head:
+        Head index of the source attention head.
+    component:
+        Which input to patch at the destination head — "q" (default), "k", or "v".
+    verbose:
+        Whether to show a tqdm progress bar.
+
+    Returns
+    -------
+    results : Float[Tensor, "n_layers n_heads"]
+        results[dst_layer, dst_head] is the patching metric when the direct path
+        A → B is patched in.  Entries for dst_layer <= src_layer are left as 0.0
+        (no causal path from A to those layers).
+    """
+    _check_fold_ln(model)
+
+    n_layers = model.cfg.n_layers
+    n_heads = model.cfg.n_heads
+
+    results = torch.zeros(n_layers, n_heads, device=model.cfg.device)
+
+    # Residual stream delta from source head A.
+    #
+    # hook_result (per-head residual contribution) requires cfg.use_hook_result=True
+    # and is not in the default cache. We compute it instead from hook_z and W_O,
+    # which are always available:
+    #   result_h = z[:, :, h, :] @ W_O[h]   shape [batch, pos, d_model]
+    src_z_name = f"blocks.{src_layer}.attn.hook_z"
+    W_O = model.blocks[src_layer].attn.W_O  # type: ignore[union-attr]  # [n_heads, d_head, d_model]
+
+    def _head_result(cache, h):
+        z = cache[src_z_name][:, :, h, :]  # [batch, pos, d_head]
+        return z @ W_O[h]  # type: ignore[index]  # [batch, pos, d_model]
+
+    delta_resid = _head_result(clean_cache, src_head) - _head_result(corrupted_cache, src_head)
+    # shape: [batch, pos, d_model]
+
+    # Weight matrix for the component being patched
+    _comp_map = {
+        "q": lambda attn: attn.W_Q,  # [n_heads, d_model, d_head]
+        "k": lambda attn: attn.W_K,
+        "v": lambda attn: attn.W_V,
+    }
+    _hook_name_map = {
+        "q": lambda lb: f"blocks.{lb}.attn.hook_q",
+        "k": lambda lb: f"blocks.{lb}.attn.hook_k",
+        "v": lambda lb: f"blocks.{lb}.attn.hook_v",
+    }
+    W_all = _comp_map[component]  # callable: attn → [n_heads, d_model, d_head]
+    hook_name_fn = _hook_name_map[component]
+
+    dst_pairs = [(lb, hb) for lb in range(src_layer + 1, n_layers) for hb in range(n_heads)]
+
+    for dst_layer, dst_head in tqdm(
+        dst_pairs,
+        desc=f"Direct path patch ({src_layer},{src_head}) → * [{component}]",
+        disable=not verbose,
+    ):
+        ln_scale_name = f"blocks.{dst_layer}.ln1.hook_scale"
+        W_comp = W_all(model.blocks[dst_layer].attn)[dst_head]  # type: ignore[index]  # [d_model, d_head]
+
+        hook_fn = _make_direct_path_hook(
+            delta_resid=delta_resid,
+            dst_head=dst_head,
+            W_component=W_comp,
+            ln_scale_name=ln_scale_name,
+            corrupted_cache=corrupted_cache,
+            component=component,
+        )
+
+        patched_logits = model.run_with_hooks(
+            corrupted_tokens,
+            fwd_hooks=[(hook_name_fn(dst_layer), hook_fn)],
+        )
+
+        results[dst_layer, dst_head] = patching_metric(patched_logits).item()
+
+    return results
+
+
+def get_act_patch_direct_path_all_sources(
+    model: Union[HookedTransformer, TransformerBridge],
+    corrupted_tokens: torch.Tensor,
+    clean_cache: ActivationCache,
+    corrupted_cache: ActivationCache,
+    patching_metric: Callable[[torch.Tensor], torch.Tensor],
+    component: Literal["q", "k", "v"] = "q",
+    verbose: bool = True,
+) -> Float[torch.Tensor, "n_layers n_heads n_layers n_heads"]:
+    """Full sweep: all (src_layer, src_head) → (dst_layer, dst_head) direct paths.
+
+    Returns a 4-D tensor of shape [n_layers, n_heads, n_layers, n_heads].
+    result[sl, sh, dl, dh] = patching metric when head (sl,sh)'s output is
+    patched directly into head (dl,dh)'s query/key/value input.
+
+    Entries where dl <= sl are 0 (no causal path).
+
+    This runs O(n_layers * n_heads * n_layers * n_heads) forward passes and is
+    intended for small models or targeted sub-sweeps.  For large models prefer
+    calling get_act_patch_direct_path per source head.
+    """
+    _check_fold_ln(model)
+
+    n_layers = model.cfg.n_layers
+    n_heads = model.cfg.n_heads
+    results = torch.zeros(n_layers, n_heads, n_layers, n_heads, device=model.cfg.device)
+
+    src_pairs = [(sl, sh) for sl in range(n_layers) for sh in range(n_heads)]
+    for src_layer, src_head in tqdm(
+        src_pairs,
+        desc=f"Direct path patch — all sources [{component}]",
+        disable=not verbose,
+    ):
+        results[src_layer, src_head] = get_act_patch_direct_path(
+            model=model,
+            corrupted_tokens=corrupted_tokens,
+            clean_cache=clean_cache,
+            corrupted_cache=corrupted_cache,
+            patching_metric=patching_metric,
+            src_layer=src_layer,
+            src_head=src_head,
+            component=component,
+            verbose=False,
+        )
+
+    return results