Skip to content

BorgwardtLab/polygraph-benchmark

Repository files navigation

PolyGraph icon
PolyGraph logo

PolyGraph is a Python library for evaluating graph generative models by providing standardized datasets and metrics (including PolyGraph Discrepancy). Full documentation for this library can be found here.

PolyGraph discrepancy is a new metric we introduced, which provides the following advantages over maxmimum mean discrepancy (MMD):

Property MMD PGD
Range [0, ∞) [0, 1]
Intrinsic Scale
Descriptor Comparison
Multi-Descriptor Aggregation
Single Ranking

It also provides a number of other advantages over MMD which we discuss in our paper.

Installation

pip install polygraph-benchmark

No manual compilation of ORCA is required. For details on interaction with graph_tool, see the more detailed installation instructions in the docs.

If you'd like to use SBM graph dataset validation with graph tools, use a mamba or pixi environment. More information is available in the documentation.

At a glance

Here are a set of datasets and metrics this library provides:

  • 🗂️ Datasets: ready-to-use splits for procedural and real-world graphs
    • Procedural datasets: PlanarLGraphDataset, SBMLGraphDataset, LobsterLGraphDataset
    • Real-world: QM9, MOSES, Guacamol, DobsonDoigGraphDataset, ModelNet10GraphDataset
    • Also: EgoGraphDataset, PointCloudGraphDataset
  • 📊 Metrics: unified, fit-once/compute-many interface with convenience wrappers, avoiding redundant computations.
    • MMD2: GaussianTVMMD2Benchmark, RBFMMD2Benchmark
    • Kernel hyperparameter optimization with MaxDescriptorMMD2.
    • PolyGraphDiscrepancy: StandardPGD, MolecularPGD (for molecule descriptors).
    • Validation/Uniqueness/Novelty: VUN.
    • Uncertainty quantification for benchmarking (GaussianTVMMD2BenchmarkInterval, RBFMMD2Benchmark, PGD5Interval)
  • 🧩 Extendable: Users can instantiate custom metrics by specifying descriptors, kernels, or classifiers (PolyGraphDiscrepancy, DescriptorMMD2). PolyGraph defines all necessary interfaces but imposes no requirements on the data type of graph objects.
  • ⚙️ Interoperability: Works on Apple Silicon Macs and Linux.
  • Tested, type checked and documented
⚠️ Important - Dataset Usage Warning

To help reproduce previous results, we provide the following datasets:

  • PlanarGraphDataset
  • SBMGraphDataset
  • LobsterGraphDataset

But they should not be used for benchmarking, due to unreliable metric estimates (see our paper for more details).

We provide larger datasets that should be used instead:

  • PlanarLGraphDataset
  • SBMLGraphDataset
  • LobsterLGraphDataset

Tutorial

Our demo script showcases some features of our library in action.

Datasets

Instantiate a benchmark dataset as follows:

import networkx as nx
from polygraph.datasets import PlanarGraphDataset

reference = PlanarGraphDataset("test").to_nx()

# Let's also generate some graphs coming from another distribution.
generated = [nx.erdos_renyi_graph(64, 0.1) for _ in range(40)]

Metrics

Maximum Mean Discrepancy

To compute existing MMD2 formulations (e.g. based on the TV pseudokernel), one can use the following:

from polygraph.metrics import GaussianTVMMD2Benchmark # Can also be RBFMMD2Benchmark

gtv_benchmark = GaussianTVMMD2Benchmark(reference)

print(gtv_benchmark.compute(generated))  # {'orbit': ..., 'clustering': ..., 'degree': ..., 'spectral': ...}

PolyGraphDiscrepancy

Similarly, you can compute our proposed PolyGraphDiscrepancy, like so:

from polygraph.metrics import StandardPGD

pgd = StandardPGD(reference)
print(pgd.compute(generated)) # {'pgd': ..., 'pgd_descriptor': ..., 'subscores': {'orbit': ..., }}

pgd_descriptor provides the best descriptor used to report the final score.

By default, PGD uses TabPFN v2.5 weights. The v2.5 weights are hosted on a gated Hugging Face repository (Prior-Labs/tabpfn_2_5) and require authentication:

pip install huggingface_hub
huggingface-cli login

Alternatively, you can use TabPFN v2.0 weights, which are licensed under the Prior Labs License (Apache 2.0 with an additional attribution clause) and permit commercial use. The v2.5 weights, in contrast, use a non-commercial license that prohibits commercial and production use without a separate enterprise license from Prior Labs:

from tabpfn import TabPFNClassifier
from polygraph.metrics import StandardPGD

classifier = TabPFNClassifier(device="auto", n_estimators=4)
pgd = StandardPGD(reference, classifier=classifier)

A logistic regression classifier can also be used as a lightweight alternative, although it yields a looser bound in practice:

from sklearn.linear_model import LogisticRegression
from polygraph.metrics import StandardPGD

pgd = StandardPGD(reference, classifier=LogisticRegression())

Validity, uniqueness and novelty

VUN values follow a similar interface:

from polygraph.metrics import VUN
reference_ds = PlanarGraphDataset("test")
pgd = VUN(reference, validity_fn=reference_ds.is_valid, confidence_level=0.95) # if applicable, validity functions are defined as a dataset attribute
print(pgd.compute(generated))  # {'valid': ..., 'valid_unique_novel': ..., 'valid_novel': ..., 'valid_unique': ...}

Metric uncertainty quantification

For MMD and PGD, uncertainty quantifiation for the metrics are obtained through subsampling. For VUN, a confidence interval is obtained with a binomial test.

For VUN, the results can be obtained by specifying a confidence level when instantiating the metric.

For the others, the Interval suffix references the class that implements subsampling.

from polygraph.metrics import GaussianTVMMD2BenchmarkInterval, RBFMMD2BenchmarkInterval, StandardPGDInterval
from tqdm import tqdm

metrics = [
  GaussianTVMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10), # specify size of each subsample, and the number of samples
  RBFMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10),
  StandardPGDInterval(reference, subsample_size=8, num_samples=10)
]

for metric in tqdm(metrics):
	metric_results = metric.compute(
    generated,
  )

Example Benchmark

The following results mirror the tables from our paper. Bold indicates best, and underlined indicates second-best. Values are multiplied by 100 for legibility. Standard deviations are obtained with subsampling using StandardPGDInterval and MoleculePGDInterval. Specific parameters are discussed in the paper.

Method Planar-L Lobster-L SBM-L Proteins Guacamol Moses
AutoGraph 34.0 ± 1.8 18.0 ± 1.6 5.6 ± 1.5 67.7 ± 7.4 22.9 ± 0.5 29.6 ± 0.4
AutoGraph* 10.4 ± 1.2
DiGress 45.2 ± 1.8 3.2 ± 2.6 17.4 ± 2.3 88.1 ± 3.1 32.7 ± 0.5 33.4 ± 0.5
GRAN 99.7 ± 0.2 85.4 ± 0.5 69.1 ± 1.4 89.7 ± 2.7
ESGG 45.0 ± 1.4 69.9 ± 0.6 99.4 ± 0.2 79.2 ± 4.3

* AutoGraph* denotes a variant that leverages additional training heuristics as described in the paper.

Reproducibility

The reproducibility/ directory contains scripts to reproduce all tables and figures from the paper.

Quick Start

# 1. Install dependencies
pixi install

# 2. Download the graph data (~3GB)
cd reproducibility
python download_data.py

# 3. Generate all tables and figures
make all

Data Download

The generated graph data (~3GB) is hosted on Proton Drive. After downloading, extract to data/polygraph_graphs/ in the repository root.

# Full dataset (required for complete reproducibility)
python download_data.py

# Small subset for testing/CI (~50 graphs per model)
python download_data.py --subset

Expected data structure after extraction:

data/polygraph_graphs/
├── AUTOGRAPH/
│   ├── planar.pkl
│   ├── lobster.pkl
│   ├── sbm.pkl
│   └── proteins.pkl
├── DIGRESS/
│   ├── planar.pkl
│   ├── lobster.pkl
│   ├── sbm.pkl
│   ├── proteins.pkl
│   ├── denoising-iterations/
│   │   └── {15,30,45,60,75,90}_steps.pkl
│   └── training-iterations/
│       └── {119,209,...,3479}_steps.pkl
├── ESGG/
│   └── *.pkl
├── GRAN/
│   └── *.pkl
└── molecule_eval/
    └── *.smiles

Scripts Overview

Table Generation

Script Output Description
generate_benchmark_tables.py tables/benchmark_results.tex Main PGD benchmark (Table 1) comparing AUTOGRAPH, DiGress, GRAN, ESGG
generate_mmd_tables.py tables/mmd_gtv.tex, tables/mmd_rbf_biased.tex MMD² metrics with GTV and RBF kernels
generate_gklr_tables.py tables/gklr.tex PGD with Kernel Logistic Regression using WL and SP kernels
generate_concatenation_tables.py tables/concatenation.tex Ablation comparing individual vs concatenated descriptors

Figure Generation

Script Output Description
generate_subsampling_figures.py figures/subsampling/ Bias-variance tradeoff as function of sample size
generate_perturbation_figures.py figures/perturbation/ Metric sensitivity to edge perturbations
generate_model_quality_figures.py figures/model_quality/ PGD vs training/denoising steps for DiGress
generate_phase_plot.py figures/phase_plot/ Training dynamics showing PGD vs VUN

Each script can be run independently with --subset for quick testing:

# Tables (full computation)
python generate_benchmark_tables.py
python generate_mmd_tables.py
python generate_gklr_tables.py
python generate_concatenation_tables.py

# Tables (quick testing with --subset)
python generate_benchmark_tables.py --subset
python generate_mmd_tables.py --subset

# Figures (full computation)
python generate_subsampling_figures.py
python generate_perturbation_figures.py
python generate_model_quality_figures.py
python generate_phase_plot.py

# Figures (quick testing)
python generate_subsampling_figures.py --subset
python generate_perturbation_figures.py --subset

Make Targets

make download        # Download full dataset (manual step required)
make download-subset # Create small subset for CI testing
make tables          # Generate all LaTeX tables
make figures         # Generate all figures
make all             # Generate everything
make tables-submit   # Submit table jobs to SLURM cluster
make tables-collect  # Collect results from completed SLURM jobs
make clean           # Remove generated outputs
make help            # Show available targets

Hardware Requirements

  • Memory: 16GB RAM recommended for full dataset
  • Storage: ~4GB for data + outputs
  • Time: Full generation takes ~2-4 hours on a modern CPU

The --subset flag uses ~50 graphs per model, runs in minutes, and verifies code correctness (results are not publication-quality).

Cluster Submission

Table generation scripts support SLURM cluster submission via submitit. Install the cluster extras first:

pip install -e ".[cluster]"

SLURM parameters are configured in YAML files (see reproducibility/configs/slurm_default.yaml):

slurm:
  partition: "cpu"
  timeout_min: 360
  cpus_per_task: 8
  mem_gb: 32

Submit jobs, then collect results after completion:

cd reproducibility

# Submit all table jobs to SLURM
python generate_benchmark_tables.py --slurm-config configs/slurm_default.yaml

# After jobs complete, collect results and generate tables
python generate_benchmark_tables.py --collect

# Or use Make targets
make tables-submit                                        # submit all
make tables-submit SLURM_CONFIG=configs/my_cluster.yaml   # custom config
make tables-collect                                       # collect all

Use --local with --slurm-config to test the submission pipeline in-process without SLURM.

Troubleshooting

Memory issues: Use --subset flag for testing, process one dataset at a time, or increase system swap space.

Missing data: Verify data/polygraph_graphs/ exists in repo root, run python download_data.py to check data status, or download manually from Proton Drive.

TabPFN issues: TabPFN requires v2.0.9 or later: pip install tabpfn>=2.0.9.

Citing

To cite our paper:

@inproceedings{krimmel2026polygraph,
  title={PolyGraph Discrepancy: a classifier-based metric for graph generation},
  author={Markus Krimmel and Philip Hartout and Karsten Borgwardt and Dexiong Chen},
  booktitle={International Conference on Learning Representations},
  year={2026},
}

About

Benchmarking framework for graph generative models (ICLR 2026)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors