GitHub - BorgwardtLab/polygraph-benchmark: Benchmarking framework for graph generative models (ICLR 2026)

PolyGraph is a Python library for evaluating graph generative models by providing standardized datasets and metrics (including PolyGraph Discrepancy). Full documentation for this library can be found here.

PolyGraph discrepancy is a new metric we introduced, which provides the following advantages over maxmimum mean discrepancy (MMD):

Property	MMD	PGD
Range	[0, ∞)	[0, 1]
Intrinsic Scale	❌	✅
Descriptor Comparison	❌	✅
Multi-Descriptor Aggregation	❌	✅
Single Ranking	❌	✅

It also provides a number of other advantages over MMD which we discuss in our paper.

Installation

pip install polygraph-benchmark

No manual compilation of ORCA is required. For details on interaction with graph_tool, see the more detailed installation instructions in the docs.

If you'd like to use SBM graph dataset validation with graph tools, use a mamba or pixi environment. More information is available in the documentation.

At a glance

Here are a set of datasets and metrics this library provides:

🗂️ Datasets: ready-to-use splits for procedural and real-world graphs
- Procedural datasets: PlanarLGraphDataset, SBMLGraphDataset, LobsterLGraphDataset
- Real-world: QM9, MOSES, Guacamol, DobsonDoigGraphDataset, ModelNet10GraphDataset
- Also: EgoGraphDataset, PointCloudGraphDataset
📊 Metrics: unified, fit-once/compute-many interface with convenience wrappers, avoiding redundant computations.
- MMD²: GaussianTVMMD2Benchmark, RBFMMD2Benchmark
- Kernel hyperparameter optimization with MaxDescriptorMMD2.
- PolyGraphDiscrepancy: StandardPGD, MolecularPGD (for molecule descriptors).
- Validation/Uniqueness/Novelty: VUN.
- Uncertainty quantification for benchmarking (GaussianTVMMD2BenchmarkInterval, RBFMMD2Benchmark, PGD5Interval)
🧩 Extendable: Users can instantiate custom metrics by specifying descriptors, kernels, or classifiers (PolyGraphDiscrepancy, DescriptorMMD2). PolyGraph defines all necessary interfaces but imposes no requirements on the data type of graph objects.
⚙️ Interoperability: Works on Apple Silicon Macs and Linux.
✅ Tested, type checked and documented

⚠️ Important - Dataset Usage Warning

To help reproduce previous results, we provide the following datasets:

PlanarGraphDataset
SBMGraphDataset
LobsterGraphDataset

But they should not be used for benchmarking, due to unreliable metric estimates (see our paper for more details).

We provide larger datasets that should be used instead:

PlanarLGraphDataset
SBMLGraphDataset
LobsterLGraphDataset

Tutorial

Our demo script showcases some features of our library in action.

Datasets

Instantiate a benchmark dataset as follows:

import networkx as nx
from polygraph.datasets import PlanarGraphDataset

reference = PlanarGraphDataset("test").to_nx()

# Let's also generate some graphs coming from another distribution.
generated = [nx.erdos_renyi_graph(64, 0.1) for _ in range(40)]

Metrics

Maximum Mean Discrepancy

To compute existing MMD2 formulations (e.g. based on the TV pseudokernel), one can use the following:

from polygraph.metrics import GaussianTVMMD2Benchmark # Can also be RBFMMD2Benchmark

gtv_benchmark = GaussianTVMMD2Benchmark(reference)

print(gtv_benchmark.compute(generated))  # {'orbit': ..., 'clustering': ..., 'degree': ..., 'spectral': ...}

PolyGraphDiscrepancy

Similarly, you can compute our proposed PolyGraphDiscrepancy, like so:

from polygraph.metrics import StandardPGD

pgd = StandardPGD(reference)
print(pgd.compute(generated)) # {'pgd': ..., 'pgd_descriptor': ..., 'subscores': {'orbit': ..., }}

pgd_descriptor provides the best descriptor used to report the final score.

By default, PGD uses TabPFN v2.5 weights. The v2.5 weights are hosted on a gated Hugging Face repository (Prior-Labs/tabpfn_2_5) and require authentication:

pip install huggingface_hub
huggingface-cli login

Alternatively, you can use TabPFN v2.0 weights, which are licensed under the Prior Labs License (Apache 2.0 with an additional attribution clause) and permit commercial use. The v2.5 weights, in contrast, use a non-commercial license that prohibits commercial and production use without a separate enterprise license from Prior Labs:

from tabpfn import TabPFNClassifier
from polygraph.metrics import StandardPGD

classifier = TabPFNClassifier(device="auto", n_estimators=4)
pgd = StandardPGD(reference, classifier=classifier)

A logistic regression classifier can also be used as a lightweight alternative, although it yields a looser bound in practice:

from sklearn.linear_model import LogisticRegression
from polygraph.metrics import StandardPGD

pgd = StandardPGD(reference, classifier=LogisticRegression())

Validity, uniqueness and novelty

VUN values follow a similar interface:

from polygraph.metrics import VUN
reference_ds = PlanarGraphDataset("test")
pgd = VUN(reference, validity_fn=reference_ds.is_valid, confidence_level=0.95) # if applicable, validity functions are defined as a dataset attribute
print(pgd.compute(generated))  # {'valid': ..., 'valid_unique_novel': ..., 'valid_novel': ..., 'valid_unique': ...}

Metric uncertainty quantification

For MMD and PGD, uncertainty quantifiation for the metrics are obtained through subsampling. For VUN, a confidence interval is obtained with a binomial test.

For VUN, the results can be obtained by specifying a confidence level when instantiating the metric.

For the others, the Interval suffix references the class that implements subsampling.

from polygraph.metrics import GaussianTVMMD2BenchmarkInterval, RBFMMD2BenchmarkInterval, StandardPGDInterval
from tqdm import tqdm

metrics = [
  GaussianTVMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10), # specify size of each subsample, and the number of samples
  RBFMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10),
  StandardPGDInterval(reference, subsample_size=8, num_samples=10)
]

for metric in tqdm(metrics):
	metric_results = metric.compute(
    generated,
  )

Example Benchmark

The following results mirror the tables from our paper. Bold indicates best, and underlined indicates second-best. Values are multiplied by 100 for legibility. Standard deviations are obtained with subsampling using StandardPGDInterval and MoleculePGDInterval. Specific parameters are discussed in the paper.

Method	Planar-L	Lobster-L	SBM-L	Proteins	Guacamol	Moses
AutoGraph	34.0 ± 1.8	18.0 ± 1.6	5.6 ± 1.5	67.7 ± 7.4	22.9 ± 0.5	29.6 ± 0.4
AutoGraph*	—	—	—	—	10.4 ± 1.2	—
DiGress	45.2 ± 1.8	3.2 ± 2.6	17.4 ± 2.3	88.1 ± 3.1	32.7 ± 0.5	33.4 ± 0.5
GRAN	99.7 ± 0.2	85.4 ± 0.5	69.1 ± 1.4	89.7 ± 2.7	—	—
ESGG	45.0 ± 1.4	69.9 ± 0.6	99.4 ± 0.2	79.2 ± 4.3	—	—

_{* AutoGraph* denotes a variant that leverages additional training heuristics as described in the paper.}

Reproducibility

The reproducibility/ directory contains scripts to reproduce all tables and figures from the paper.

Quick Start

# 1. Install dependencies
pixi install

# 2. Download the graph data (~3GB)
cd reproducibility
python download_data.py

# 3. Generate all tables and figures
make all

Data Download

The generated graph data (~3GB) is hosted on Proton Drive. After downloading, extract to data/polygraph_graphs/ in the repository root.

# Full dataset (required for complete reproducibility)
python download_data.py

# Small subset for testing/CI (~50 graphs per model)
python download_data.py --subset

Expected data structure after extraction:

data/polygraph_graphs/
├── AUTOGRAPH/
│   ├── planar.pkl
│   ├── lobster.pkl
│   ├── sbm.pkl
│   └── proteins.pkl
├── DIGRESS/
│   ├── planar.pkl
│   ├── lobster.pkl
│   ├── sbm.pkl
│   ├── proteins.pkl
│   ├── denoising-iterations/
│   │   └── {15,30,45,60,75,90}_steps.pkl
│   └── training-iterations/
│       └── {119,209,...,3479}_steps.pkl
├── ESGG/
│   └── *.pkl
├── GRAN/
│   └── *.pkl
└── molecule_eval/
    └── *.smiles

Scripts Overview

Table Generation

Script	Output	Description
`generate_benchmark_tables.py`	`tables/benchmark_results.tex`	Main PGD benchmark (Table 1) comparing AUTOGRAPH, DiGress, GRAN, ESGG
`generate_mmd_tables.py`	`tables/mmd_gtv.tex`, `tables/mmd_rbf_biased.tex`	MMD² metrics with GTV and RBF kernels
`generate_gklr_tables.py`	`tables/gklr.tex`	PGD with Kernel Logistic Regression using WL and SP kernels
`generate_concatenation_tables.py`	`tables/concatenation.tex`	Ablation comparing individual vs concatenated descriptors

Figure Generation

Script	Output	Description
`generate_subsampling_figures.py`	`figures/subsampling/`	Bias-variance tradeoff as function of sample size
`generate_perturbation_figures.py`	`figures/perturbation/`	Metric sensitivity to edge perturbations
`generate_model_quality_figures.py`	`figures/model_quality/`	PGD vs training/denoising steps for DiGress
`generate_phase_plot.py`	`figures/phase_plot/`	Training dynamics showing PGD vs VUN

Each script can be run independently with --subset for quick testing:

# Tables (full computation)
python generate_benchmark_tables.py
python generate_mmd_tables.py
python generate_gklr_tables.py
python generate_concatenation_tables.py

# Tables (quick testing with --subset)
python generate_benchmark_tables.py --subset
python generate_mmd_tables.py --subset

# Figures (full computation)
python generate_subsampling_figures.py
python generate_perturbation_figures.py
python generate_model_quality_figures.py
python generate_phase_plot.py

# Figures (quick testing)
python generate_subsampling_figures.py --subset
python generate_perturbation_figures.py --subset

Make Targets

make download        # Download full dataset (manual step required)
make download-subset # Create small subset for CI testing
make tables          # Generate all LaTeX tables
make figures         # Generate all figures
make all             # Generate everything
make tables-submit   # Submit table jobs to SLURM cluster
make tables-collect  # Collect results from completed SLURM jobs
make clean           # Remove generated outputs
make help            # Show available targets

Hardware Requirements

Memory: 16GB RAM recommended for full dataset
Storage: ~4GB for data + outputs
Time: Full generation takes ~2-4 hours on a modern CPU

The --subset flag uses ~50 graphs per model, runs in minutes, and verifies code correctness (results are not publication-quality).

Cluster Submission

Table generation scripts support SLURM cluster submission via submitit. Install the cluster extras first:

pip install -e ".[cluster]"

SLURM parameters are configured in YAML files (see reproducibility/configs/slurm_default.yaml):

slurm:
  partition: "cpu"
  timeout_min: 360
  cpus_per_task: 8
  mem_gb: 32

Submit jobs, then collect results after completion:

cd reproducibility

# Submit all table jobs to SLURM
python generate_benchmark_tables.py --slurm-config configs/slurm_default.yaml

# After jobs complete, collect results and generate tables
python generate_benchmark_tables.py --collect

# Or use Make targets
make tables-submit                                        # submit all
make tables-submit SLURM_CONFIG=configs/my_cluster.yaml   # custom config
make tables-collect                                       # collect all

Use --local with --slurm-config to test the submission pipeline in-process without SLURM.

Troubleshooting

Memory issues: Use --subset flag for testing, process one dataset at a time, or increase system swap space.

Missing data: Verify data/polygraph_graphs/ exists in repo root, run python download_data.py to check data status, or download manually from Proton Drive.

TabPFN issues: TabPFN requires v2.0.9 or later: pip install tabpfn>=2.0.9.

Citing

To cite our paper:

@inproceedings{krimmel2026polygraph,
  title={PolyGraph Discrepancy: a classifier-based metric for graph generation},
  author={Markus Krimmel and Philip Hartout and Karsten Borgwardt and Dexiong Chen},
  booktitle={International Conference on Learning Representations},
  year={2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
conversion		conversion
docs		docs
logo		logo
polygraph		polygraph
reproducibility		reproducibility
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
mkdocs.yml		mkdocs.yml
pixi.lock		pixi.lock
pixi.toml		pixi.toml
polygraph_demo.py		polygraph_demo.py
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements-docs.txt		requirements-docs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

At a glance

Tutorial

Datasets

Metrics

Maximum Mean Discrepancy

PolyGraphDiscrepancy

Validity, uniqueness and novelty

Metric uncertainty quantification

Example Benchmark

Reproducibility

Quick Start

Data Download

Scripts Overview

Table Generation

Figure Generation

Make Targets

Hardware Requirements

Cluster Submission

Troubleshooting

Citing

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Installation

At a glance

Tutorial

Datasets

Metrics

Maximum Mean Discrepancy

PolyGraphDiscrepancy

Validity, uniqueness and novelty

Metric uncertainty quantification

Example Benchmark

Reproducibility

Quick Start

Data Download

Scripts Overview

Table Generation

Figure Generation

Make Targets

Hardware Requirements

Cluster Submission

Troubleshooting

Citing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages