transcript-tools

Turn a raw meeting transcript (Zoom or Microsoft Stream .vtt, or an .srt) into YouTube-ready captions and a readable transcript — with a project glossary that fixes domain terms, and an optional, guarded LLM polish.

Built for PolicyEngine webinars and talks, but the glossary is just a file, so it works for any project.

Why

Auto-transcripts get the easy 95% and fumble exactly the words that matter: "Policy Engine" instead of PolicyEngine, "GPT 5.5" instead of GPT-5.5, "policy bench" instead of PolicyBench. Generic cleaners strip filler but don't know your vocabulary. This does both — and keeps two outputs that serve different jobs:

Output	File	Treatment
Captions	`<name>.srt`, `<name>.vtt`	Verbatim. Only the glossary is applied, so they stay in sync with the audio. Speaker labels are added on speaker change.
Transcript	`<name>-transcript.txt`	Readable. Merged by speaker, filler removed, re-capitalized sentence-aware. For the video description, show notes, or a blog.

Install

uv tool install git+https://github.com/PolicyEngine/transcript-tools
# or, in a project:
uv pip install git+https://github.com/PolicyEngine/transcript-tools

Use

clean-transcript GMT20260625-Recording.transcript.vtt

Writes GMT20260625-Recording.transcript.srt, .vtt, and ...-transcript.txt next to the input. Then upload the .srt in YouTube Studio → your video → Subtitles → Add.

Common options:

# custom output name + directory
clean-transcript in.vtt -n policybench-webinar -o ./captions

# anonymize an audience member in both outputs
clean-transcript in.vtt -s "Jane Doe=Audience"

# add a per-talk glossary on top of the default (one-off names / mishears)
clean-transcript in.vtt -g this-talk.yaml

# title the readable transcript
clean-transcript in.vtt --title "PolicyBench webinar" --subtitle "June 25, 2026"

Glossary

The packaged glossary.yaml holds the PolicyEngine vocabulary. Layer your own with -g/--glossary (it merges on top, yours wins):

terms:                                   # applied to captions AND transcript
  - { pattern: '[Pp]olicy\s+[Ee]ngine', replace: PolicyEngine }
filler:                                  # removed from the transcript only
  - you know
proper_nouns: [PolicyEngine, GPT, SNAP]  # keep caps mid-sentence
speakers: { "Jane Doe": "Audience" }     # caption label override

terms are deterministic regex replacements — safe enough to run on the verbatim captions. Keep one-off, talk-specific corrections in a -g file so the default glossary stays general.

Optional: LLM polish

The deterministic transcript is good. For publication-grade prose (it stitches Zoom's sentence fragments back together), add --llm:

uv pip install 'transcript-tools[llm]'
export ANTHROPIC_API_KEY=...
clean-transcript in.vtt --llm

It cleans each speaker turn, but never silently changes facts: a guardrail rejects any polished turn that alters a number, dollar amount, or percentage (or balloons the text), falling back to the deterministic version and telling you how many turns it kept. Captions are never sent to the model.

Develop

uv run --extra dev pytest

Prior art

For generic, glossary-free cleanup there are good tools already — clean-transcribe (LLM-based, can re-transcribe from audio), and simple strippers like VTTCleaner. This repo exists for the parts they don't cover: a project glossary, the captions-vs-transcript split, and a fact-preserving guardrail on the LLM pass.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
tests		tests
transcript_tools		transcript_tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

transcript-tools

Why

Install

Use

Glossary

Optional: LLM polish

Develop

Prior art

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

transcript-tools

Why

Install

Use

Glossary

Optional: LLM polish

Develop

Prior art

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages