Skip to content

quant-sci/graphnetz

Repository files navigation

GraphNetz

Statistically rigorous GNN benchmarking

Build Docs Python License Paper


Why GraphNetz

Most GNN benchmarks report point-estimate accuracies on a handful of citation graphs and declare a winner without confidence intervals, multiple-comparison correction, or rank aggregation across datasets. GraphNetz's default output is a structured statistical report, not a raw accuracy table:

  • multi-seed Student's t confidence intervals per cell,
  • Holm–Bonferroni paired t-tests (or Wilcoxon signed-rank) within each task,
  • Demšar critical-difference diagrams from Friedman ranks with a Nemenyi post-hoc.

The catalogue is organised along a category × task taxonomy:

  • 63 dataset loaders across 10 scientific categories
  • 4 task kinds (node classification, graph classification, graph regression, link prediction)
  • 5 canonical architectures (GCN, GAT, GIN, GraphSAGE, Graph Transformer) plug into every kind via a small set of task-kind adapters;

Install

uv add graphnetz
# or, in an existing environment:
pip install graphnetz

For local development:

git clone https://github.com/quant-sci/graphnetz
cd graphnetz
uv sync --group dev

GraphNetz requires Python ≥ 3.10, torch ≥ 2.6, and torch-geometric ≥ 2.6.

Quick start

from graphnetz import GCN, train_node_classification, plot_history
from graphnetz.datasets.social import cora

ds = cora("data/cora")
model = GCN(ds.num_features, 64, ds.num_classes)
history = train_node_classification(model, ds[0], epochs=200)
fig, ax = plot_history(history, title="GCN on Cora")

For a full benchmark run with the default statistical report:

from graphnetz import GAT, GCN, GraphSAGE, GraphTransformer, run_benchmark

report = run_benchmark(
    "social",
    {"GCN": GCN, "GAT": GAT, "GraphSAGE": GraphSAGE, "GraphTransformer": GraphTransformer},
    seeds=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
    kind="node_cls",          # restrict to one task family
)
print(report.summary())       # per-(task, model) mean ± t-CI
print(report.pairwise())      # Holm-corrected paired t-tests (or Wilcoxon)
fig, _ = report.plot_critical_difference(alpha=0.05)

Task kinds

Kind Symbol Metric Examples
Node classification node_cls test accuracy Cora, Roman-empire
Graph classification graph_cls val accuracy MUTAG, MNIST-superpixels
Graph regression graph_reg val MAE ZINC, QM9
Link prediction link_pred test AUC FB15k-237, Internet AS

Unlabelled graphs (Netzschleuder, synthetic combinatorial, Ising lattice) enter the benchmark through link prediction on a held-out edge split, so every cell carries a real test-time metric — there is no self-supervised pretext loss in the headline report.

Dataset categories

Category # Task kinds Loaders
Combinatorial 6 LP random TSP, VRP, max-flow, bipartite matching, coloring, max-cut
Biology 12 GC, GR, LP MUTAG, PROTEINS, ENZYMES, Peptides-func/struct, PPI, C. elegans, Budapest connectome, hospital/high-school contacts, ogbg-molhiv†, ogbg-molpcba†
Social 16 NC, LP Cora, CiteSeer, PubMed, WikiCS, Roman-empire, Amazon-ratings, Minesweeper, Tolokers, Questions, MovieLens-100k, Karate, Facebook friends, DBLP coauthor, DNC emails, ogbn-arxiv†, ogbl-collab†
Knowledge 3 LP FB15k-237, WordNet18-RR, WordNet (Netz)
Infrastructure 6 LP power grid, EuroRoad, US roads, EU airlines, London transport, urban streets
Finance 5 NC, LP Elliptic Bitcoin, product space, board of directors, US patents, ogbn-products†
Computing 4 LP Internet AS, Internet topology, AS-Skitter, route views
Vision 5 GC, NC MNIST/CIFAR-10 superpixels, ModelNet10/40, ShapeNet
Physics 3 GR, LP QM9, ZINC, Ising lattice
Security 3 GC, LP MalNet-Tiny, 9/11 terrorists, train terrorists

† Requires the optional ogb extra (pip install graphnetz[ogb]). The five OGB loaders are folded into their domain categories rather than exposed as a separate ogb category, so they appear in run_benchmark(category, ...) alongside the curated built-ins.

from graphnetz.datasets.social import cora, roman_empire
from graphnetz.datasets.biology import peptides_func
from graphnetz.datasets.computing import internet_as

# Optional OGB loaders live in their domain modules (require `pip install graphnetz[ogb]`):
from graphnetz.datasets.social import ogbn_arxiv     # node_cls
from graphnetz.datasets.biology import ogbg_molhiv   # graph_cls

ds_cora = cora("data/cora")
ds_rom  = roman_empire("data/roman_empire")        # heterophilic
ds_pep  = peptides_func("data/peptides_func")      # LRGB
ds_inet = internet_as("data/internet_as")          # Netzschleuder

For arbitrary Netzschleuder networks:

from graphnetz import Netz
ds = Netz(root="data", dataset_name="urban_streets", network_name="brasilia")

Models

Model Kinds Source
GCN all four Kipf & Welling, ICLR 2017
GAT all four Veličković et al., ICLR 2018
GIN graph_cls, graph_reg Xu et al., ICLR 2019
GraphSAGE all four Hamilton et al., NeurIPS 2017
GraphTransformer all four Shi et al., 2021
DGI (utility) Veličković et al., ICLR 2019

Node-level encoders enter every task kind through three small adapters: graph-level pooling head, dot-product link-prediction head, and the DGI self-supervised wrapper for optional unsupervised pre-training.

Custom models

from graphnetz import register_model

# 1. Decorator
@register_model(kinds="node_cls")
class MyGNN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels): ...

# 2. Class attribute (no decorator)
class MyGNN(torch.nn.Module):
    task_kinds = {"node_cls", "graph_cls"}

# 3. Inline tuple at run-time
run_benchmark(
    "social",
    {"MyGNN": (MyGNN, "node_cls",
               lambda i, h, o: MyGNN(i, h, o, dropout=0.3))},
)

The statistical report

run_benchmark(...) returns a BenchmarkReport with the following methods:

Method Output
report.summary(ci=0.95) per-(task, model) mean ± t-CI half-width DataFrame
report.pairwise(alpha=0.05) Holm-corrected paired t-tests or Wilcoxon signed-rank tests within each task
report.plot_critical_difference() Demšar / Nemenyi CD diagram across tasks
report.plot_pairwise(layout=...) matrix or list view of pairwise significance
report.plot_forest() per-task forest plot of mean ± CI
report.plot_learning_curves() shared-y learning curves with t-CI bands
report.to_latex(path) publication-ready bold-best LaTeX table
report.pairwise_to_latex(path) Holm pairwise LaTeX table (parametric or non-parametric)

Notebooks

Worked examples live under examples/:

  • 01_benchmark.ipynb — the cross-category dashboard (multi-seed report, bootstrap CIs, custom-model integration).
  • 02_knowledge.ipynb — relational link prediction on FB15k-237 / WN18-RR using the DistMult decoder.

Reproducing the paper

PYTHONPATH=src uv run python paper/experiment.py   # train + cache + figures
latexmk -pdf paper/main.tex                        # compile PDF

The script trains 5 architectures × 10 seeds across the 10 surviving categories, caches the histories under paper/_cache_*.pkl, and writes every figure (paper/figures/) and LaTeX table (paper/tables/) referenced by paper/main.tex. Total runtime on a recent laptop CPU is under 30 minutes.

Contributing

Pull requests welcome. Read CONTRIBUTING.md first — the short version is: every benchmark cell must carry a real held-out metric, every change must thread through the multi-seed pipeline, and every PR must be ruff clean.

uv run pytest
uv run ruff check

License

MIT — see LICENCE.txt.

About

A benchmark whose default output is a statistical report, not a leaderboard.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages