Audio Restoration Benchmark by Diffio AI

Self-contained benchmark for speech restoration on archival Voices for Christ audio.

Benchmark metadata lives in benchmark_manifest.json. Current versions:

benchmark: 0.3.0
dataset: 0.1.0

Overview

task: restore each archival clip to improve intelligibility and perceptual quality
dataset: 100 clips, 30 seconds each, 50 minutes total
layout: benchmark_data/original plus one benchmark_data/<mode> directory per system
current public baselines:
- adobe_podcast: Adobe Podcast
- diffio_3_5: Diffio.ai 3.5

Current working assumption: the source Voices for Christ material is pre-1990 and open-domain or public-domain compatible, but that has not been independently verified.

Metrics

SCOREQ: primary no-reference quality score, higher is better
WER: proxy intelligibility score against frozen transcripts, lower is better
DNSMOS P.835: no-reference perceptual quality score, higher is better

Removed from the default benchmark:

speaker preservation
NISQA
SRMR

Run

Audio should be stored with Git LFS.

git lfs install
git lfs pull
python score_benchmark.py
python score_benchmark.py --metrics wer
python score_benchmark.py --metrics scoreq dnsmos
python score_benchmark.py --list-metrics
python plot_results.py

score_benchmark.py runs the full benchmark by default. Use --metrics to run only the stages you want.

Full benchmark outputs:

scores.csv
scores_metadata.json
reference_transcripts.csv
reference_transcripts_metadata.json
transcripts.csv
wer.csv
wer_metadata.json
dnsmos.csv
dnsmos_metadata.json

plot_results.py reads the benchmark CSVs and writes images into ./plots/.

Submission

Submit a pull request that adds a new benchmark_data/<mode> directory.

Each submission should:

keep the exact same filenames as benchmark_data/original
include one output file per original file
avoid truncation, silence padding, or file count mismatches
describe the method, version, and inference settings in the pull request

Recommended naming:

benchmark_data/<system_name>_<system_version>
example: benchmark_data/diffio_3_5

Plots

Leaderboard Summary

compares mean SCOREQ, mean WER, and mean DNSMOS OVR
WER omits original because the original clips are the frozen reference source and score zero by construction
good for a fast headline comparison between systems like Adobe Podcast and Diffio.ai 3.5

Metric Distributions

shows per-file spread for SCOREQ, WER, and DNSMOS OVR
WER omits original for the same reason as above
useful for checking consistency, not just averages

Relative Improvement Heatmap

shows mean improvement relative to the degraded original clips
positive values are better in every column
useful for seeing where Diffio.ai 3.5 and Adobe Podcast differ

Methods

WER

There is no human transcript ground truth, so WER is a proxy metric.

freeze one strong ASR decode of benchmark_data/original into reference_transcripts.csv
score restored outputs with a weaker ASR decode against those frozen transcripts

This makes the benchmark more sensitive to restoration gains while keeping the target stable across reruns.

DNSMOS

DNSMOS P.835 is run locally through TorchMetrics.

audio is resampled to 16 kHz mono
reported outputs are p808_mos, sig, bak, and ovr
no external scoring API is used

Notes

score_benchmark.py is the normal entrypoint
audio/layout validation requires every non-empty mode directory to match the original filenames exactly
the benchmark currently assumes a CUDA-capable environment for the default scoring path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Restoration Benchmark by Diffio AI

Overview

Metrics

Run

Submission