Add MOSS-TTS-Realtime fastapi, RTF and TTFB by Zhyw0 · Pull Request #76 · OpenMOSS/MOSS-TTS

Zhyw0 · 2026-03-12T14:35:05Z

Added usage of server to start MOSS TTS Realtime fastapi.
Updated the original stream generation method.
The values of TTFB and RTF were tested.

iamyishan · 2026-03-13T02:51:06Z

@Zhyw0 感谢大佬的回复，但是为什么生成音频文件out_streaming_split.wav是0字节，不能播放
下面是服务器的日志:
(moss-tts) yangxun@R5300G5:~/test-project/MOSS-TTS/moss_tts_realtime$ python fast_api.py
INFO: Started server process [2983596]
INFO: Waiting for application startup.
[warmup] Loading backend ...
You are using a model of type moss_tts_realtime to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Loading weights: 100%|████████████████████████████████████████████████████████████████████████████| 403/403 [00:00<00:00, 3704.70it/s, Materializing param=local_transformer.model.norm.weight]
Loading weights: 100%|███████████████████████████████████████████| 1600/1600 [00:00<00:00, 3719.46it/s, Materializing param=quantizer.quantizers.31.out_proj.parametrizations.weight.original1]
[warmup] Backend loaded.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8083 (Press CTRL+C to quit)
INFO: 127.0.0.1:52696 - "POST /tts/session/start HTTP/1.1" 200 OK
/home/yangxun/test-project/MOSS-TTS/moss_tts_realtime/fast_api.py:219: UserWarning: torchaudio._backend.utils.info has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see pytorch/audio#3902 for more information. It will be removed from the 2.9 release.
info = torchaudio.info(str(src))
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py:20: UserWarning: torio.io._streaming_media_decoder.StreamingMediaDecoder has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see pytorch/audio#3902 for more information. It will be removed from the 2.9 release.
s = torchaudio.io.StreamReader(src, format, None, buffer_size)
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py:27: UserWarning: torchaudio._backend.common.AudioMetaData has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see pytorch/audio#3902 for more information. It will be removed from the 2.9 release.
return AudioMetaData(
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torchaudio/_backend/utils.py:213: UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.load_with_torchcodec` under the hood. Some parameters like normalize, ``format``, ``buffer_size``, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's decoder instead: https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder.
warnings.warn(
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py:88: UserWarning: torio.io._streaming_media_decoder.StreamingMediaDecoder has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see pytorch/audio#3902 for more information. It will be removed from the 2.9 release.
s = torchaudio.io.StreamReader(src, format, None, buffer_size)
INFO: 127.0.0.1:52712 - "POST /tts/session/push HTTP/1.1" 200 OK
INFO: 127.0.0.1:52722 - "GET /tts/session/db3cc292-020b-478c-8fc1-756162fcfe64/audio HTTP/1.1" 200 OK
INFO: 127.0.0.1:52736 - "POST /tts/session/push HTTP/1.1" 200 OK
INFO: 127.0.0.1:52738 - "POST /tts/session/push HTTP/1.1" 200 OK
INFO: 127.0.0.1:52744 - "POST /tts/session/push HTTP/1.1" 200 OK

这是客户端请求日志;
(moss-tts) yangxun@R5300G5:~/test-project/MOSS-TTS/moss_tts_realtime$ python tts_client.py
[main] SESSION_ID=db3cc292-020b-478c-8fc1-756162fcfe64
[start] first_delta='Welcome to the world of MOSS TTS Realtime. Experie'
[start] payload={'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'assistant_text': 'Welcome to the world of MOSS TTS Realtime. Experie', 'user_text': None, 'prompt_audio': './audio/prompt_audio.mp3', 'user_audio': None, 'new_turn': True}
[start] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'message': 'turn started'}
[audio] connect -> http://127.0.0.1:8083/tts/session/db3cc292-020b-478c-8fc1-756162fcfe64/audio
[push #2] is_final=False, text='nce how text transforms into smooth, human-like sp'
[audio] headers: sr=24000, ch=1, codec=pcm_s16le
[push #2] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'accepted_text_len': 50, 'is_final': False}
[push #3] is_final=False, text='eech in real time. MOSS TTS Realtime is a context-'
[push #3] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'accepted_text_len': 50, 'is_final': False}
[push #4] is_final=False, text='aware multi-turn streaming TTS, a speech generatio'
[push #4] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'accepted_text_len': 50, 'is_final': False}
[push #5] is_final=True, text='n foundation model designed for voice agents.'
[push #5] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'accepted_text_len': 45, 'is_final': True}
[audio] done, chunks=231, bytes=887040
[audio] saved pcm -> out_streaming_split.pcm
[audio] saved wav -> out_streaming_split.wav
[close] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'message': 'session closed'}
[main] done

Zhyw0 · 2026-03-13T02:57:14Z

这个应该是本地生成wav文件后无法实时播放，需要等全部音频生成完之后才能播放，如果想要实时播放的话建议使用app.py这个启动，这个代码我们也重新优化过生成速度

xiami2019 · 2026-03-13T04:36:21Z

在首页中英README上也更新一下吧

iamyishan · 2026-03-13T05:06:42Z

我想实现文本流式输入和语音流式输出，采用websocket通信，客户端tts_stream_client.py和服务器脚本tts_stream_server.py如下
并且我把你写的fast_api中预热warmup代码集成到tts_stream_server.py中了
tts_stream_client.py
tts_stream_server.py
问题是：推理速度还是慢，RTF=1.7左右，具体日志如下，请大佬帮我看看哪儿出问题了
flash_attention_2也是安装了的
[warmup] Loading backend ...
[backend] attn_implementation=flash_attention_2, device=cuda:0, dtype=torch.bfloat16
[warmup] Model ready.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
INFO: 10.35.168.73:11390 - "WebSocket /ws/stream_tts" [accepted]
[WS] connection accepted
[TTS] worker started
INFO: connection open
[WS] received END, total chars=83
[TTS] chunk # 1 samples= 1920 dur=0.080s cost_per_1s=0.000s
[TTS] chunk # 2 samples= 3840 dur=0.160s cost_per_1s=2.152s
[TTS] chunk # 3 samples= 5760 dur=0.240s cost_per_1s=1.934s
[TTS] chunk # 4 samples= 7680 dur=0.320s cost_per_1s=1.816s
[TTS] chunk # 5 samples= 9600 dur=0.400s cost_per_1s=1.768s
[TTS] chunk # 6 samples= 11520 dur=0.480s cost_per_1s=1.730s
[TTS] chunk # 7 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 8 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 9 samples= 11520 dur=0.480s cost_per_1s=1.731s
[TTS] chunk # 10 samples= 11520 dur=0.480s cost_per_1s=1.725s
[TTS] chunk # 11 samples= 11520 dur=0.480s cost_per_1s=1.719s
[TTS] chunk # 12 samples= 11520 dur=0.480s cost_per_1s=1.717s
[TTS] chunk # 13 samples= 11520 dur=0.480s cost_per_1s=1.717s
[TTS] chunk # 14 samples= 11520 dur=0.480s cost_per_1s=1.722s
[TTS] chunk # 15 samples= 11520 dur=0.480s cost_per_1s=1.720s
[TTS] chunk # 16 samples= 11520 dur=0.480s cost_per_1s=1.715s
[TTS] chunk # 17 samples= 11520 dur=0.480s cost_per_1s=1.717s
[TTS] chunk # 18 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 19 samples= 11520 dur=0.480s cost_per_1s=1.716s
[TTS] chunk # 20 samples= 11520 dur=0.480s cost_per_1s=1.717s
[TTS] chunk # 21 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 22 samples= 11520 dur=0.480s cost_per_1s=1.715s
[TTS] chunk # 23 samples= 11520 dur=0.480s cost_per_1s=1.728s
[TTS] chunk # 24 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 25 samples= 11520 dur=0.480s cost_per_1s=1.720s
[TTS] chunk # 26 samples= 11520 dur=0.480s cost_per_1s=1.722s
[TTS] chunk # 27 samples= 11520 dur=0.480s cost_per_1s=1.720s
[TTS] chunk # 28 samples= 11520 dur=0.480s cost_per_1s=1.716s
[TTS] chunk # 29 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 30 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 31 samples= 11520 dur=0.480s cost_per_1s=1.724s
[TTS] chunk # 32 samples= 11520 dur=0.480s cost_per_1s=1.722s
[TTS] chunk # 33 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 34 samples= 11520 dur=0.480s cost_per_1s=1.732s
[TTS] chunk # 35 samples= 11520 dur=0.480s cost_per_1s=1.723s
[TTS] chunk # 36 samples= 11520 dur=0.480s cost_per_1s=1.722s
[TTS] chunk # 37 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 38 samples= 11520 dur=0.480s cost_per_1s=1.731s
[TTS] chunk # 39 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 40 samples= 1920 dur=0.080s cost_per_1s=4.203s
[TTS] DONE chunks=40 audio=17.60s elapsed=30.53s RTF=1.734
[TTS] sent END
[WS] done
INFO: connection closed

Zhyw0 · 2026-03-13T06:59:46Z

哦不能用flash-attn2，建议用sdpa，用sdpa是支持compile的，flash-attn2现在和torch.compile是冲突的，sdpa+compile这个速度会更快，感谢提醒，我们会在readme 中说明

iamyishan · 2026-03-13T08:04:52Z

卸载了flash-attn2，速度果然提升了，测试了好几次，每次都是开始2,3秒RFT>1，后面稳定在0.8左右，基本可用，感谢大佬
[WS] received END, total chars=520
[TTS] chunk # 1 samples= 1920 dur=0.080s cost_per_1s=0.000s
[TTS] chunk # 2 samples= 3840 dur=0.160s cost_per_1s=1.273s
[TTS] chunk # 3 samples= 5760 dur=0.240s cost_per_1s=1.041s
[TTS] chunk # 4 samples= 7680 dur=0.320s cost_per_1s=0.945s
[TTS] chunk # 5 samples= 9600 dur=0.400s cost_per_1s=0.888s
[TTS] chunk # 6 samples= 11520 dur=0.480s cost_per_1s=0.850s
[TTS] chunk # 7 samples= 11520 dur=0.480s cost_per_1s=0.844s
[TTS] chunk # 8 samples= 11520 dur=0.480s cost_per_1s=0.842s
[TTS] chunk # 9 samples= 11520 dur=0.480s cost_per_1s=0.841s
[TTS] chunk # 10 samples= 11520 dur=0.480s cost_per_1s=0.856s
[TTS] chunk # 11 samples= 11520 dur=0.480s cost_per_1s=0.848s
[TTS] chunk # 12 samples= 11520 dur=0.480s cost_per_1s=0.847s
[TTS] chunk # 13 samples= 11520 dur=0.480s cost_per_1s=0.849s
[TTS] chunk # 14 samples= 11520 dur=0.480s cost_per_1s=0.848s
[TTS] chunk # 15 samples= 11520 dur=0.480s cost_per_1s=0.847s
[TTS] chunk # 16 samples= 11520 dur=0.480s cost_per_1s=0.848s
[TTS] chunk # 17 samples= 11520 dur=0.480s cost_per_1s=0.845s
[TTS] chunk # 18 samples= 11520 dur=0.480s cost_per_1s=0.847s
[TTS] chunk # 19 samples= 11520 dur=0.480s cost_per_1s=0.854s
[TTS] chunk # 20 samples= 11520 dur=0.480s cost_per_1s=0.850s
[TTS] chunk # 21 samples= 11520 dur=0.480s cost_per_1s=0.846s
[TTS] chunk # 22 samples= 11520 dur=0.480s cost_per_1s=0.849s
[TTS] chunk # 23 samples= 11520 dur=0.480s cost_per_1s=0.849s
[TTS] chunk # 24 samples= 11520 dur=0.480s cost_per_1s=0.847s
[TTS] chunk # 25 samples= 11520 dur=0.480s cost_per_1s=0.861s

Zhyw0 · 2026-03-13T08:13:33Z

感谢反馈，但是在a800上感觉速度还是不应该这么慢，我们后续会再进行测试和优化

iamyishan · 2026-03-13T08:48:12Z

INFO: connection open
[WS] received END, total chars=520
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:282: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting torch.set_float32_matmul_precision('high') for better performance.
warnings.warn(
[TTS] chunk # 1 samples= 1920 dur=0.080s cost_per_1s=0.000s
[TTS] chunk # 2 samples= 3840 dur=0.160s cost_per_1s=112.194s
[TTS] chunk # 3 samples= 5760 dur=0.240s cost_per_1s=1.086s
[TTS] chunk # 4 samples= 7680 dur=0.320s cost_per_1s=0.980s
[TTS] chunk # 5 samples= 9600 dur=0.400s cost_per_1s=0.917s
[TTS] chunk # 6 samples= 11520 dur=0.480s cost_per_1s=0.885s
[TTS] chunk # 7 samples= 11520 dur=0.480s cost_per_1s=0.871s
[TTS] chunk # 8 samples= 11520 dur=0.480s cost_per_1s=0.872s
[TTS] chunk # 9 samples= 11520 dur=0.480s cost_per_1s=0.873s
[TTS] chunk # 10 samples= 11520 dur=0.480s cost_per_1s=0.872s
[TTS] chunk # 11 samples= 11520 dur=0.480s cost_per_1s=18.575s
[TTS] chunk # 12 samples= 11520 dur=0.480s cost_per_1s=0.886s

还发现：第1次请求的第2秒很慢,如上
第2次请求就好点了，如下
[TTS] worker started
INFO: connection open
[WS] received END, total chars=520
[TTS] chunk # 1 samples= 1920 dur=0.080s cost_per_1s=0.000s
[TTS] chunk # 2 samples= 3840 dur=0.160s cost_per_1s=1.302s
[TTS] chunk # 3 samples= 5760 dur=0.240s cost_per_1s=1.064s
[TTS] chunk # 4 samples= 7680 dur=0.320s cost_per_1s=0.965s
[TTS] chunk # 5 samples= 9600 dur=0.400s cost_per_1s=0.913s
[TTS] chunk # 6 samples= 11520 dur=0.480s cost_per_1s=0.869s

另外，还测试了MOSS-TTS-Realtime好像不支持方言，并且没有加速参数speed参数，像Cosyvoice那样speed=1.0控制生成速度？

Zhyw0 · 2026-03-13T08:50:46Z

MOSS-TTS-Realtime暂时还不支持方言和控制生成速度

add: moss tts realtime fastapi, RTF and TTFB

1a87b7d

Zhyw0 mentioned this pull request Mar 12, 2026

实时生成速度有点慢 #74

Open

gaoyang07 assigned Zhyw0 Mar 13, 2026

gaoyang07 self-requested a review March 13, 2026 03:25

add: moss tts realtime TTFB

dc0f5c4

YWMditto approved these changes Mar 13, 2026

View reviewed changes

fix: moss tts realtime app front-end layout

f70d5b4

gaoyang07 changed the title ~~add: moss tts realtime fastapi, RTF and TTFB~~ Add MOSS-TTS-Realtime fastapi, RTF and TTFB Mar 13, 2026

Zhyw0 merged commit d15e9f5 into main Mar 13, 2026

Zhyw0 deleted the add/mosstts_fastapi branch March 13, 2026 08:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MOSS-TTS-Realtime fastapi, RTF and TTFB#76

Add MOSS-TTS-Realtime fastapi, RTF and TTFB#76
Zhyw0 merged 3 commits intomainfrom
add/mosstts_fastapi

Zhyw0 commented Mar 12, 2026 •

edited

Loading

Uh oh!

iamyishan commented Mar 13, 2026

Uh oh!

Zhyw0 commented Mar 13, 2026 •

edited

Loading

Uh oh!

xiami2019 commented Mar 13, 2026

Uh oh!

iamyishan commented Mar 13, 2026 •

edited

Loading

Uh oh!

Zhyw0 commented Mar 13, 2026 •

edited

Loading

Uh oh!

iamyishan commented Mar 13, 2026

Uh oh!

Zhyw0 commented Mar 13, 2026

Uh oh!

iamyishan commented Mar 13, 2026 •

edited

Loading

Uh oh!

Zhyw0 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Zhyw0 commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamyishan commented Mar 13, 2026

Uh oh!

Zhyw0 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiami2019 commented Mar 13, 2026

Uh oh!

iamyishan commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zhyw0 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamyishan commented Mar 13, 2026

Uh oh!

Zhyw0 commented Mar 13, 2026

Uh oh!

iamyishan commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zhyw0 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Zhyw0 commented Mar 12, 2026 •

edited

Loading

Zhyw0 commented Mar 13, 2026 •

edited

Loading

iamyishan commented Mar 13, 2026 •

edited

Loading

Zhyw0 commented Mar 13, 2026 •

edited

Loading

iamyishan commented Mar 13, 2026 •

edited

Loading