Fix OpenAI-compatible STT for "Speech to text selected lines"#11291
Fix OpenAI-compatible STT for "Speech to text selected lines"#11291dkakaie wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes OpenAI-compatible speech-to-text behavior when transcribing selected subtitle lines, ensuring transcriptions attach to the correct extracted clip and multi-segment results become multiple subtitle lines.
Changes:
- Matches OpenAI-compatible STT results back to the original clip filename instead of the transcoded temporary upload file.
- Splits selected-line transcription output whenever multiple paragraphs/segments are returned, rather than only for lines longer than 10 seconds.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/ui/Features/Video/SpeechToText/SpeechToTextViewModel.cs |
Updates OpenAI-compatible transcription result assignment to use _videoFileName, aligning with other STT paths. |
src/ui/Features/Main/MainViewModel.cs |
Removes the duration gate so multi-paragraph transcription results replace the selected line with multiple timed lines. |
|
Thanks for the PR! Just a quick heads up regarding the logic here: The condition selectedLine.Duration.TotalSeconds > 10 is used to switch the logic from single line mode to one-selection-to-many-lines mode. Your changes might break this distinction I think... |
|
Thanks for the reply. What user scenario was the My thinking was that modern ASR models are generally quite good at segmentation and timestamping. When a transcription contains multiple timestamped segments, concatenating them back into a single subtitle line—especially for longer clips—can discard some of that structure. If the 10-second threshold was added as a UX choice rather than due to limitations of the transcription engine, perhaps we could consider relying on the model's segmentation directly and remove the single line mode branching. Another option would be to make the threshold configurable, although that may be unnecessary if the segmentation quality is consistently good. I'm interested in the original rationale and would be curious to hear your thoughts on whether the model's segmentation should drive subtitle splitting here. |
Two fixes to the OpenAI-compatible STT engine when transcribing selected lines:
_videoFileName, like the whisper engines.