Audio Restoration Benchmark by Diffio AI
Self-contained benchmark for speech restoration on archival Voices for Christ audio.
Benchmark metadata lives in benchmark_manifest.json. Current versions:
- benchmark:
0.3.0 - dataset:
0.1.0
- task: restore each archival clip to improve intelligibility and perceptual quality
- dataset: 100 clips, 30 seconds each, 50 minutes total
- layout:
benchmark_data/originalplus onebenchmark_data/<mode>directory per system - current public baselines:
adobe_podcast: Adobe Podcastdiffio_3_5: Diffio.ai 3.5
Current working assumption: the source Voices for Christ material is pre-1990 and open-domain or public-domain compatible, but that has not been independently verified.
SCOREQ: primary no-reference quality score, higher is betterWER: proxy intelligibility score against frozen transcripts, lower is betterDNSMOS P.835: no-reference perceptual quality score, higher is better
Removed from the default benchmark:
- speaker preservation
- NISQA
- SRMR
Audio should be stored with Git LFS.
git lfs install
git lfs pull
python score_benchmark.py
python score_benchmark.py --metrics wer
python score_benchmark.py --metrics scoreq dnsmos
python score_benchmark.py --list-metrics
python plot_results.pyscore_benchmark.py runs the full benchmark by default. Use --metrics to run only the stages you want.
Full benchmark outputs:
scores.csvscores_metadata.jsonreference_transcripts.csvreference_transcripts_metadata.jsontranscripts.csvwer.csvwer_metadata.jsondnsmos.csvdnsmos_metadata.json
plot_results.py reads the benchmark CSVs and writes images into ./plots/.
Submit a pull request that adds a new benchmark_data/<mode> directory.
Each submission should:
- keep the exact same filenames as
benchmark_data/original - include one output file per original file
- avoid truncation, silence padding, or file count mismatches
- describe the method, version, and inference settings in the pull request
Recommended naming:
benchmark_data/<system_name>_<system_version>- example:
benchmark_data/diffio_3_5
- compares mean
SCOREQ, meanWER, and meanDNSMOS OVR WERomitsoriginalbecause the original clips are the frozen reference source and score zero by construction- good for a fast headline comparison between systems like Adobe Podcast and Diffio.ai 3.5
- shows per-file spread for
SCOREQ,WER, andDNSMOS OVR WERomitsoriginalfor the same reason as above- useful for checking consistency, not just averages
- shows mean improvement relative to the degraded original clips
- positive values are better in every column
- useful for seeing where Diffio.ai 3.5 and Adobe Podcast differ
There is no human transcript ground truth, so WER is a proxy metric.
- freeze one strong ASR decode of
benchmark_data/originalintoreference_transcripts.csv - score restored outputs with a weaker ASR decode against those frozen transcripts
This makes the benchmark more sensitive to restoration gains while keeping the target stable across reruns.
DNSMOS P.835 is run locally through TorchMetrics.
- audio is resampled to 16 kHz mono
- reported outputs are
p808_mos,sig,bak, andovr - no external scoring API is used
score_benchmark.pyis the normal entrypoint- audio/layout validation requires every non-empty mode directory to match the original filenames exactly
- the benchmark currently assumes a CUDA-capable environment for the default scoring path


