Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 50 additions & 0 deletions content/posts/2026-02-01-generating-ai-audio/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
author: "Gethin James"
title: "Generating AI Audio"
description: "Exploring the use of Generative AI to create accessible and engaging audio content from long-form documents"
draft: false
date: 2026-01-01
tags: ["AI", "Generative AI", "Audio", "Accessibility"]
categories: ["AI"]
ShowToc: false
TocOpen: false
---
[You may prefer to listen to the audio version of this blog post.](audio_podcast.mp3)

At the DVLA Emerging Technology Lab, we wondered whether Generative AI could be used to make long-form documents more accessible and engaging.

Many individuals find reading extensive written documents challenging. Recent technological advances have enabled the generation of "audio overviews" and "podcasts" from text content. Our idea was to explore how far this technology might assist neurodiverse individuals, or those for whom English is not a first language.

As a government agency, we are committed to ensuring that content is handled securely and remains accessible only to authorised staff within the agency. To achieve this, we created a Microsoft Teams Bot called "Audry". Audry allows a user to upload a document and automatically transform it into a podcast or news briefing. We wanted to produce audio that features authentic regional UK accents.

{{<figure align=center src="audry.png" caption="Audry Teams Bot">}}

## Agentic Review
Many advances in Generative AI have originated in the United States, resulting in some technology displaying a US bias. For example, generated transcripts occasionally contained American expressions that are unsuitable for a UK audience, such as "DMV" instead of "DVLA". To address this, we adopted an agentic approach to reviewing the transcript. Using [LangGraph](https://www.langchain.com/langgraph), we created three personas to review the transcript, and a fourth expert to edit it based on the feedback.

{{<figure src="agentic_edit_diagram.png" caption="Agentic Review Process">}}

- British Expert: Assessed grammar and verified the use of appropriate British cultural references.
- Content Reviewer: Moderated content for compliance with UK government standards.
- Expressive Delivery Advisor (think Drama teacher): Suggested emotional and non-verbal sound cues. (For example, adding pauses or varying tone.)
- Editor: Incorporated feedback from the previous three experts and rewrote the transcript accordingly.


## Challenges
- There has been a significant improvement in voice quality due to recent advances. However, accessing the latest multi-speaker voice models remains difficult. These models are often still in preview stages and provide limited support for British English.
- Achieving consistent voice generation is challenging. Submitting the same parameters to a large language model (LLM) does not always produce identical results. While this makes generative AI powerful, it also impedes reliable and repeatable voice outputs. We experimented with dividing large transcripts (5 mins+) into smaller requests. However, combining these segments often resulted in noticeable changes in the voices during conversations.
- Regional accents can be influenced through specific prompts, for example requesting a Welsh or Scottish accent. In our experience, this approach was not consistently reliable. Further work is needed to create uniform regional accents.


## Technology
Here are some of the technologies we used:
- Microsoft Teams AI and Bot Framework
- Azure Document Intelligence, Cosmos DB, Speech Service, App Service
- Google Gemini 2.5-flash and Text-to-Speech (TTS) models
- Eleven Labs Text-to-Speech API
- LangGraph for agentic review

## Conclusions
This technology is still emerging, and producing consistently accurate regional British audio content remains a challenge. However, the technology may already be sufficiently usable. [This podcast was generated by uploading this blog post through our system](audio_podcast.mp3). You can decide for yourself if we succeeded.

Following Government Service Standards, the [code for Audry is open source](https://github.com/dvla/audry).