PolyGraph is a Python library for evaluating graph generative models by providing standardized datasets and metrics (including PolyGraph Discrepancy). Full documentation for this library can be found here.
PolyGraph discrepancy is a new metric we introduced, which provides the following advantages over maxmimum mean discrepancy (MMD):
| Property | MMD | PGD |
|---|---|---|
| Range | [0, ∞) | [0, 1] |
| Intrinsic Scale | ❌ | ✅ |
| Descriptor Comparison | ❌ | ✅ |
| Multi-Descriptor Aggregation | ❌ | ✅ |
| Single Ranking | ❌ | ✅ |
It also provides a number of other advantages over MMD which we discuss in our paper.
pip install polygraph-benchmarkNo manual compilation of ORCA is required. For details on interaction with graph_tool, see the more detailed installation instructions in the docs.
If you'd like to use SBM graph dataset validation with graph tools, use a mamba or pixi environment. More information is available in the documentation.
Here are a set of datasets and metrics this library provides:
- 🗂️ Datasets: ready-to-use splits for procedural and real-world graphs
- Procedural datasets:
PlanarLGraphDataset,SBMLGraphDataset,LobsterLGraphDataset - Real-world:
QM9,MOSES,Guacamol,DobsonDoigGraphDataset,ModelNet10GraphDataset - Also:
EgoGraphDataset,PointCloudGraphDataset
- Procedural datasets:
- 📊 Metrics: unified, fit-once/compute-many interface with convenience wrappers, avoiding redundant computations.
- MMD2:
GaussianTVMMD2Benchmark,RBFMMD2Benchmark - Kernel hyperparameter optimization with
MaxDescriptorMMD2. - PolyGraphDiscrepancy:
StandardPGD,MolecularPGD(for molecule descriptors). - Validation/Uniqueness/Novelty:
VUN. - Uncertainty quantification for benchmarking (
GaussianTVMMD2BenchmarkInterval,RBFMMD2Benchmark,PGD5Interval)
- MMD2:
- 🧩 Extendable: Users can instantiate custom metrics by specifying descriptors, kernels, or classifiers (
PolyGraphDiscrepancy,DescriptorMMD2). PolyGraph defines all necessary interfaces but imposes no requirements on the data type of graph objects. - ⚙️ Interoperability: Works on Apple Silicon Macs and Linux.
- ✅ Tested, type checked and documented
⚠️ Important - Dataset Usage Warning
To help reproduce previous results, we provide the following datasets:
PlanarGraphDatasetSBMGraphDatasetLobsterGraphDataset
But they should not be used for benchmarking, due to unreliable metric estimates (see our paper for more details).
We provide larger datasets that should be used instead:
PlanarLGraphDatasetSBMLGraphDatasetLobsterLGraphDataset
Our demo script showcases some features of our library in action.
Instantiate a benchmark dataset as follows:
import networkx as nx
from polygraph.datasets import PlanarGraphDataset
reference = PlanarGraphDataset("test").to_nx()
# Let's also generate some graphs coming from another distribution.
generated = [nx.erdos_renyi_graph(64, 0.1) for _ in range(40)]To compute existing MMD2 formulations (e.g. based on the TV pseudokernel), one can use the following:
from polygraph.metrics import GaussianTVMMD2Benchmark # Can also be RBFMMD2Benchmark
gtv_benchmark = GaussianTVMMD2Benchmark(reference)
print(gtv_benchmark.compute(generated)) # {'orbit': ..., 'clustering': ..., 'degree': ..., 'spectral': ...}Similarly, you can compute our proposed PolyGraphDiscrepancy, like so:
from polygraph.metrics import StandardPGD
pgd = StandardPGD(reference)
print(pgd.compute(generated)) # {'pgd': ..., 'pgd_descriptor': ..., 'subscores': {'orbit': ..., }}pgd_descriptor provides the best descriptor used to report the final score.
By default, PGD uses TabPFN v2.5 weights. The v2.5 weights are hosted on a gated Hugging Face repository (Prior-Labs/tabpfn_2_5) and require authentication:
pip install huggingface_hub
huggingface-cli loginAlternatively, you can use TabPFN v2.0 weights, which are licensed under the Prior Labs License (Apache 2.0 with an additional attribution clause) and permit commercial use. The v2.5 weights, in contrast, use a non-commercial license that prohibits commercial and production use without a separate enterprise license from Prior Labs:
from tabpfn import TabPFNClassifier
from polygraph.metrics import StandardPGD
classifier = TabPFNClassifier(device="auto", n_estimators=4)
pgd = StandardPGD(reference, classifier=classifier)A logistic regression classifier can also be used as a lightweight alternative, although it yields a looser bound in practice:
from sklearn.linear_model import LogisticRegression
from polygraph.metrics import StandardPGD
pgd = StandardPGD(reference, classifier=LogisticRegression())VUN values follow a similar interface:
from polygraph.metrics import VUN
reference_ds = PlanarGraphDataset("test")
pgd = VUN(reference, validity_fn=reference_ds.is_valid, confidence_level=0.95) # if applicable, validity functions are defined as a dataset attribute
print(pgd.compute(generated)) # {'valid': ..., 'valid_unique_novel': ..., 'valid_novel': ..., 'valid_unique': ...}For MMD and PGD, uncertainty quantifiation for the metrics are obtained through subsampling. For VUN, a confidence interval is obtained with a binomial test.
For VUN, the results can be obtained by specifying a confidence level when instantiating the metric.
For the others, the Interval suffix references the class that implements subsampling.
from polygraph.metrics import GaussianTVMMD2BenchmarkInterval, RBFMMD2BenchmarkInterval, StandardPGDInterval
from tqdm import tqdm
metrics = [
GaussianTVMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10), # specify size of each subsample, and the number of samples
RBFMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10),
StandardPGDInterval(reference, subsample_size=8, num_samples=10)
]
for metric in tqdm(metrics):
metric_results = metric.compute(
generated,
)The following results mirror the tables from our paper. Bold indicates best, and underlined indicates second-best. Values are multiplied by 100 for legibility. Standard deviations are obtained with subsampling using StandardPGDInterval and MoleculePGDInterval. Specific parameters are discussed in the paper.
| Method | Planar-L | Lobster-L | SBM-L | Proteins | Guacamol | Moses |
|---|---|---|---|---|---|---|
| AutoGraph | 34.0 ± 1.8 | 18.0 ± 1.6 | 5.6 ± 1.5 | 67.7 ± 7.4 | 22.9 ± 0.5 | 29.6 ± 0.4 |
| AutoGraph* | — | — | — | — | 10.4 ± 1.2 | — |
| DiGress | 45.2 ± 1.8 | 3.2 ± 2.6 | 17.4 ± 2.3 | 88.1 ± 3.1 | 32.7 ± 0.5 | 33.4 ± 0.5 |
| GRAN | 99.7 ± 0.2 | 85.4 ± 0.5 | 69.1 ± 1.4 | 89.7 ± 2.7 | — | — |
| ESGG | 45.0 ± 1.4 | 69.9 ± 0.6 | 99.4 ± 0.2 | 79.2 ± 4.3 | — | — |
* AutoGraph* denotes a variant that leverages additional training heuristics as described in the paper.
The reproducibility/ directory contains scripts to reproduce all tables and figures from the paper.
# 1. Install dependencies
pixi install
# 2. Download the graph data (~3GB)
cd reproducibility
python download_data.py
# 3. Generate all tables and figures
make allThe generated graph data (~3GB) is hosted on Proton Drive. After downloading, extract to data/polygraph_graphs/ in the repository root.
# Full dataset (required for complete reproducibility)
python download_data.py
# Small subset for testing/CI (~50 graphs per model)
python download_data.py --subsetExpected data structure after extraction:
data/polygraph_graphs/
├── AUTOGRAPH/
│ ├── planar.pkl
│ ├── lobster.pkl
│ ├── sbm.pkl
│ └── proteins.pkl
├── DIGRESS/
│ ├── planar.pkl
│ ├── lobster.pkl
│ ├── sbm.pkl
│ ├── proteins.pkl
│ ├── denoising-iterations/
│ │ └── {15,30,45,60,75,90}_steps.pkl
│ └── training-iterations/
│ └── {119,209,...,3479}_steps.pkl
├── ESGG/
│ └── *.pkl
├── GRAN/
│ └── *.pkl
└── molecule_eval/
└── *.smiles
| Script | Output | Description |
|---|---|---|
generate_benchmark_tables.py |
tables/benchmark_results.tex |
Main PGD benchmark (Table 1) comparing AUTOGRAPH, DiGress, GRAN, ESGG |
generate_mmd_tables.py |
tables/mmd_gtv.tex, tables/mmd_rbf_biased.tex |
MMD² metrics with GTV and RBF kernels |
generate_gklr_tables.py |
tables/gklr.tex |
PGD with Kernel Logistic Regression using WL and SP kernels |
generate_concatenation_tables.py |
tables/concatenation.tex |
Ablation comparing individual vs concatenated descriptors |
| Script | Output | Description |
|---|---|---|
generate_subsampling_figures.py |
figures/subsampling/ |
Bias-variance tradeoff as function of sample size |
generate_perturbation_figures.py |
figures/perturbation/ |
Metric sensitivity to edge perturbations |
generate_model_quality_figures.py |
figures/model_quality/ |
PGD vs training/denoising steps for DiGress |
generate_phase_plot.py |
figures/phase_plot/ |
Training dynamics showing PGD vs VUN |
Each script can be run independently with --subset for quick testing:
# Tables (full computation)
python generate_benchmark_tables.py
python generate_mmd_tables.py
python generate_gklr_tables.py
python generate_concatenation_tables.py
# Tables (quick testing with --subset)
python generate_benchmark_tables.py --subset
python generate_mmd_tables.py --subset
# Figures (full computation)
python generate_subsampling_figures.py
python generate_perturbation_figures.py
python generate_model_quality_figures.py
python generate_phase_plot.py
# Figures (quick testing)
python generate_subsampling_figures.py --subset
python generate_perturbation_figures.py --subsetmake download # Download full dataset (manual step required)
make download-subset # Create small subset for CI testing
make tables # Generate all LaTeX tables
make figures # Generate all figures
make all # Generate everything
make tables-submit # Submit table jobs to SLURM cluster
make tables-collect # Collect results from completed SLURM jobs
make clean # Remove generated outputs
make help # Show available targets- Memory: 16GB RAM recommended for full dataset
- Storage: ~4GB for data + outputs
- Time: Full generation takes ~2-4 hours on a modern CPU
The --subset flag uses ~50 graphs per model, runs in minutes, and verifies code correctness (results are not publication-quality).
Table generation scripts support SLURM cluster submission via submitit. Install the cluster extras first:
pip install -e ".[cluster]"SLURM parameters are configured in YAML files (see reproducibility/configs/slurm_default.yaml):
slurm:
partition: "cpu"
timeout_min: 360
cpus_per_task: 8
mem_gb: 32Submit jobs, then collect results after completion:
cd reproducibility
# Submit all table jobs to SLURM
python generate_benchmark_tables.py --slurm-config configs/slurm_default.yaml
# After jobs complete, collect results and generate tables
python generate_benchmark_tables.py --collect
# Or use Make targets
make tables-submit # submit all
make tables-submit SLURM_CONFIG=configs/my_cluster.yaml # custom config
make tables-collect # collect allUse --local with --slurm-config to test the submission pipeline in-process without SLURM.
Memory issues: Use --subset flag for testing, process one dataset at a time, or increase system swap space.
Missing data: Verify data/polygraph_graphs/ exists in repo root, run python download_data.py to check data status, or download manually from Proton Drive.
TabPFN issues: TabPFN requires v2.0.9 or later: pip install tabpfn>=2.0.9.
To cite our paper:
@inproceedings{krimmel2026polygraph,
title={PolyGraph Discrepancy: a classifier-based metric for graph generation},
author={Markus Krimmel and Philip Hartout and Karsten Borgwardt and Dexiong Chen},
booktitle={International Conference on Learning Representations},
year={2026},
}