Skip to content

[whisper] word timings in verbose_json #9306

@eglia

Description

@eglia

Is your feature request related to a problem? Please describe.
When calling the transcription endpoint with the format verbose_json, localai currently only returns a list of segments with text, start & end. The words attribute is always None.

Describe the solution you'd like
Localai should also provide word level timestamps when requested with timestamp_granularities=["word"]

Describe alternatives you've considered

Additional context
I'm trying to generate subtitles for videos with hardware acceleration (Vulkan). The data returned with format as srt works, but the timestamps are inaccurate. I would like to use stable-ts to improve the timestamps, but without word level timestamps the results are still suboptimal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions