Skip to content

Add MOSS-TTS-Realtime fastapi, RTF and TTFB#76

Merged
Zhyw0 merged 3 commits intomainfrom
add/mosstts_fastapi
Mar 13, 2026
Merged

Add MOSS-TTS-Realtime fastapi, RTF and TTFB#76
Zhyw0 merged 3 commits intomainfrom
add/mosstts_fastapi

Conversation

@Zhyw0
Copy link
Collaborator

@Zhyw0 Zhyw0 commented Mar 12, 2026

  1. Added usage of server to start MOSS TTS Realtime fastapi.
  2. Updated the original stream generation method.
  3. The values of TTFB and RTF were tested.

@iamyishan
Copy link

@Zhyw0 感谢大佬的回复,但是为什么生成音频文件out_streaming_split.wav是0字节,不能播放
下面是服务器的日志:
(moss-tts) yangxun@R5300G5:~/test-project/MOSS-TTS/moss_tts_realtime$ python fast_api.py
INFO: Started server process [2983596]
INFO: Waiting for application startup.
[warmup] Loading backend ...
You are using a model of type moss_tts_realtime to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Loading weights: 100%|████████████████████████████████████████████████████████████████████████████| 403/403 [00:00<00:00, 3704.70it/s, Materializing param=local_transformer.model.norm.weight]
Loading weights: 100%|███████████████████████████████████████████| 1600/1600 [00:00<00:00, 3719.46it/s, Materializing param=quantizer.quantizers.31.out_proj.parametrizations.weight.original1]
[warmup] Backend loaded.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8083 (Press CTRL+C to quit)
INFO: 127.0.0.1:52696 - "POST /tts/session/start HTTP/1.1" 200 OK
/home/yangxun/test-project/MOSS-TTS/moss_tts_realtime/fast_api.py:219: UserWarning: torchaudio._backend.utils.info has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see pytorch/audio#3902 for more information. It will be removed from the 2.9 release.
info = torchaudio.info(str(src))
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py:20: UserWarning: torio.io._streaming_media_decoder.StreamingMediaDecoder has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see pytorch/audio#3902 for more information. It will be removed from the 2.9 release.
s = torchaudio.io.StreamReader(src, format, None, buffer_size)
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py:27: UserWarning: torchaudio._backend.common.AudioMetaData has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see pytorch/audio#3902 for more information. It will be removed from the 2.9 release.
return AudioMetaData(
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torchaudio/_backend/utils.py:213: UserWarning: In 2.9, this function's implementation will be changed to use torchaudio.load_with_torchcodec` under the hood. Some parameters like normalize, ``format``, ``buffer_size``, and ``backend`` will be ignored. We recommend that you port your code to rely directly on TorchCodec's decoder instead: https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder.
warnings.warn(
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torchaudio/_backend/ffmpeg.py:88: UserWarning: torio.io._streaming_media_decoder.StreamingMediaDecoder has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see pytorch/audio#3902 for more information. It will be removed from the 2.9 release.
s = torchaudio.io.StreamReader(src, format, None, buffer_size)
INFO: 127.0.0.1:52712 - "POST /tts/session/push HTTP/1.1" 200 OK
INFO: 127.0.0.1:52722 - "GET /tts/session/db3cc292-020b-478c-8fc1-756162fcfe64/audio HTTP/1.1" 200 OK
INFO: 127.0.0.1:52736 - "POST /tts/session/push HTTP/1.1" 200 OK
INFO: 127.0.0.1:52738 - "POST /tts/session/push HTTP/1.1" 200 OK
INFO: 127.0.0.1:52744 - "POST /tts/session/push HTTP/1.1" 200 OK

这是客户端请求日志;
(moss-tts) yangxun@R5300G5:~/test-project/MOSS-TTS/moss_tts_realtime$ python tts_client.py
[main] SESSION_ID=db3cc292-020b-478c-8fc1-756162fcfe64
[start] first_delta='Welcome to the world of MOSS TTS Realtime. Experie'
[start] payload={'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'assistant_text': 'Welcome to the world of MOSS TTS Realtime. Experie', 'user_text': None, 'prompt_audio': './audio/prompt_audio.mp3', 'user_audio': None, 'new_turn': True}
[start] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'message': 'turn started'}
[audio] connect -> http://127.0.0.1:8083/tts/session/db3cc292-020b-478c-8fc1-756162fcfe64/audio
[push #2] is_final=False, text='nce how text transforms into smooth, human-like sp'
[audio] headers: sr=24000, ch=1, codec=pcm_s16le
[push #2] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'accepted_text_len': 50, 'is_final': False}
[push #3] is_final=False, text='eech in real time. MOSS TTS Realtime is a context-'
[push #3] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'accepted_text_len': 50, 'is_final': False}
[push #4] is_final=False, text='aware multi-turn streaming TTS, a speech generatio'
[push #4] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'accepted_text_len': 50, 'is_final': False}
[push #5] is_final=True, text='n foundation model designed for voice agents.'
[push #5] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'accepted_text_len': 45, 'is_final': True}
[audio] done, chunks=231, bytes=887040
[audio] saved pcm -> out_streaming_split.pcm
[audio] saved wav -> out_streaming_split.wav
[close] resp={'ok': True, 'session_id': 'db3cc292-020b-478c-8fc1-756162fcfe64', 'message': 'session closed'}
[main] done

@Zhyw0
Copy link
Collaborator Author

Zhyw0 commented Mar 13, 2026

这个应该是本地生成wav文件后无法实时播放,需要等全部音频生成完之后才能播放,如果想要实时播放的话建议使用app.py这个启动,这个代码我们也重新优化过生成速度

@gaoyang07 gaoyang07 self-requested a review March 13, 2026 03:25
@xiami2019
Copy link
Member

在首页 中英README上也更新一下吧

@iamyishan
Copy link

iamyishan commented Mar 13, 2026

@Zhyw0 大佬好,我机器配置A800,8张80G显存,如下所示
(moss-tts) yangxun@R5300G5:~/test-project/MOSS-TTS/moss_tts_realtime$ nvidia-smi
Fri Mar 13 12:59:41 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.211.01 Driver Version: 570.211.01 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A800 80GB PCIe Off | 00000000:0C:00.0 Off | 0 |
| N/A 52C P0 78W / 300W | 14026MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A800 80GB PCIe Off | 00000000:0D:00.0 Off | 0 |
| N/A 43C P0 48W / 300W | 13MiB / 81920MiB | 0% Default |
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3834 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2929025 C /usr/bin/python3 1496MiB |
| 0 N/A N/A 3048895 C python 12510MiB |
| 1 N/A N/A 3834 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 3834 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 3834 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 3834 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 3834 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 3834 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 3834 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+

我想实现文本流式输入和语音流式输出,采用websocket通信,客户端tts_stream_client.py和服务器脚本tts_stream_server.py如下
并且我把你写的fast_api中预热warmup代码集成到tts_stream_server.py中了
tts_stream_client.py
tts_stream_server.py
问题是:推理速度还是慢,RTF=1.7左右,具体日志如下,请大佬帮我看看哪儿出问题了
flash_attention_2也是安装了的
[warmup] Loading backend ...
[backend] attn_implementation=flash_attention_2, device=cuda:0, dtype=torch.bfloat16
[warmup] Model ready.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
INFO: 10.35.168.73:11390 - "WebSocket /ws/stream_tts" [accepted]
[WS] connection accepted
[TTS] worker started
INFO: connection open
[WS] received END, total chars=83
[TTS] chunk # 1 samples= 1920 dur=0.080s cost_per_1s=0.000s
[TTS] chunk # 2 samples= 3840 dur=0.160s cost_per_1s=2.152s
[TTS] chunk # 3 samples= 5760 dur=0.240s cost_per_1s=1.934s
[TTS] chunk # 4 samples= 7680 dur=0.320s cost_per_1s=1.816s
[TTS] chunk # 5 samples= 9600 dur=0.400s cost_per_1s=1.768s
[TTS] chunk # 6 samples= 11520 dur=0.480s cost_per_1s=1.730s
[TTS] chunk # 7 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 8 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 9 samples= 11520 dur=0.480s cost_per_1s=1.731s
[TTS] chunk # 10 samples= 11520 dur=0.480s cost_per_1s=1.725s
[TTS] chunk # 11 samples= 11520 dur=0.480s cost_per_1s=1.719s
[TTS] chunk # 12 samples= 11520 dur=0.480s cost_per_1s=1.717s
[TTS] chunk # 13 samples= 11520 dur=0.480s cost_per_1s=1.717s
[TTS] chunk # 14 samples= 11520 dur=0.480s cost_per_1s=1.722s
[TTS] chunk # 15 samples= 11520 dur=0.480s cost_per_1s=1.720s
[TTS] chunk # 16 samples= 11520 dur=0.480s cost_per_1s=1.715s
[TTS] chunk # 17 samples= 11520 dur=0.480s cost_per_1s=1.717s
[TTS] chunk # 18 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 19 samples= 11520 dur=0.480s cost_per_1s=1.716s
[TTS] chunk # 20 samples= 11520 dur=0.480s cost_per_1s=1.717s
[TTS] chunk # 21 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 22 samples= 11520 dur=0.480s cost_per_1s=1.715s
[TTS] chunk # 23 samples= 11520 dur=0.480s cost_per_1s=1.728s
[TTS] chunk # 24 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 25 samples= 11520 dur=0.480s cost_per_1s=1.720s
[TTS] chunk # 26 samples= 11520 dur=0.480s cost_per_1s=1.722s
[TTS] chunk # 27 samples= 11520 dur=0.480s cost_per_1s=1.720s
[TTS] chunk # 28 samples= 11520 dur=0.480s cost_per_1s=1.716s
[TTS] chunk # 29 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 30 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 31 samples= 11520 dur=0.480s cost_per_1s=1.724s
[TTS] chunk # 32 samples= 11520 dur=0.480s cost_per_1s=1.722s
[TTS] chunk # 33 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 34 samples= 11520 dur=0.480s cost_per_1s=1.732s
[TTS] chunk # 35 samples= 11520 dur=0.480s cost_per_1s=1.723s
[TTS] chunk # 36 samples= 11520 dur=0.480s cost_per_1s=1.722s
[TTS] chunk # 37 samples= 11520 dur=0.480s cost_per_1s=1.718s
[TTS] chunk # 38 samples= 11520 dur=0.480s cost_per_1s=1.731s
[TTS] chunk # 39 samples= 11520 dur=0.480s cost_per_1s=1.721s
[TTS] chunk # 40 samples= 1920 dur=0.080s cost_per_1s=4.203s
[TTS] DONE chunks=40 audio=17.60s elapsed=30.53s RTF=1.734
[TTS] sent END
[WS] done
INFO: connection closed

@Zhyw0
Copy link
Collaborator Author

Zhyw0 commented Mar 13, 2026

哦不能用flash-attn2,建议用sdpa,用sdpa是支持compile的,flash-attn2现在和torch.compile是冲突的,sdpa+compile这个速度会更快,感谢提醒,我们会在readme 中说明

@iamyishan
Copy link

卸载了flash-attn2,速度果然提升了,测试了好几次,每次都是开始2,3秒RFT>1,后面稳定在0.8左右,基本可用,感谢大佬
[WS] received END, total chars=520
[TTS] chunk # 1 samples= 1920 dur=0.080s cost_per_1s=0.000s
[TTS] chunk # 2 samples= 3840 dur=0.160s cost_per_1s=1.273s
[TTS] chunk # 3 samples= 5760 dur=0.240s cost_per_1s=1.041s
[TTS] chunk # 4 samples= 7680 dur=0.320s cost_per_1s=0.945s
[TTS] chunk # 5 samples= 9600 dur=0.400s cost_per_1s=0.888s
[TTS] chunk # 6 samples= 11520 dur=0.480s cost_per_1s=0.850s
[TTS] chunk # 7 samples= 11520 dur=0.480s cost_per_1s=0.844s
[TTS] chunk # 8 samples= 11520 dur=0.480s cost_per_1s=0.842s
[TTS] chunk # 9 samples= 11520 dur=0.480s cost_per_1s=0.841s
[TTS] chunk # 10 samples= 11520 dur=0.480s cost_per_1s=0.856s
[TTS] chunk # 11 samples= 11520 dur=0.480s cost_per_1s=0.848s
[TTS] chunk # 12 samples= 11520 dur=0.480s cost_per_1s=0.847s
[TTS] chunk # 13 samples= 11520 dur=0.480s cost_per_1s=0.849s
[TTS] chunk # 14 samples= 11520 dur=0.480s cost_per_1s=0.848s
[TTS] chunk # 15 samples= 11520 dur=0.480s cost_per_1s=0.847s
[TTS] chunk # 16 samples= 11520 dur=0.480s cost_per_1s=0.848s
[TTS] chunk # 17 samples= 11520 dur=0.480s cost_per_1s=0.845s
[TTS] chunk # 18 samples= 11520 dur=0.480s cost_per_1s=0.847s
[TTS] chunk # 19 samples= 11520 dur=0.480s cost_per_1s=0.854s
[TTS] chunk # 20 samples= 11520 dur=0.480s cost_per_1s=0.850s
[TTS] chunk # 21 samples= 11520 dur=0.480s cost_per_1s=0.846s
[TTS] chunk # 22 samples= 11520 dur=0.480s cost_per_1s=0.849s
[TTS] chunk # 23 samples= 11520 dur=0.480s cost_per_1s=0.849s
[TTS] chunk # 24 samples= 11520 dur=0.480s cost_per_1s=0.847s
[TTS] chunk # 25 samples= 11520 dur=0.480s cost_per_1s=0.861s

@Zhyw0
Copy link
Collaborator Author

Zhyw0 commented Mar 13, 2026

感谢反馈,但是在a800上感觉速度还是不应该这么慢,我们后续会再进行测试和优化

@gaoyang07 gaoyang07 changed the title add: moss tts realtime fastapi, RTF and TTFB Add MOSS-TTS-Realtime fastapi, RTF and TTFB Mar 13, 2026
@iamyishan
Copy link

iamyishan commented Mar 13, 2026

INFO: connection open
[WS] received END, total chars=520
/home/yangxun/miniconda3/envs/moss-tts/lib/python3.11/site-packages/torch/_inductor/compile_fx.py:282: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting torch.set_float32_matmul_precision('high') for better performance.
warnings.warn(
[TTS] chunk # 1 samples= 1920 dur=0.080s cost_per_1s=0.000s
[TTS] chunk # 2 samples= 3840 dur=0.160s cost_per_1s=112.194s
[TTS] chunk # 3 samples= 5760 dur=0.240s cost_per_1s=1.086s
[TTS] chunk # 4 samples= 7680 dur=0.320s cost_per_1s=0.980s
[TTS] chunk # 5 samples= 9600 dur=0.400s cost_per_1s=0.917s
[TTS] chunk # 6 samples= 11520 dur=0.480s cost_per_1s=0.885s
[TTS] chunk # 7 samples= 11520 dur=0.480s cost_per_1s=0.871s
[TTS] chunk # 8 samples= 11520 dur=0.480s cost_per_1s=0.872s
[TTS] chunk # 9 samples= 11520 dur=0.480s cost_per_1s=0.873s
[TTS] chunk # 10 samples= 11520 dur=0.480s cost_per_1s=0.872s
[TTS] chunk # 11 samples= 11520 dur=0.480s cost_per_1s=18.575s
[TTS] chunk # 12 samples= 11520 dur=0.480s cost_per_1s=0.886s

还发现:第1次请求的第2秒很慢,如上
第2次请求就好点了,如下
[TTS] worker started
INFO: connection open
[WS] received END, total chars=520
[TTS] chunk # 1 samples= 1920 dur=0.080s cost_per_1s=0.000s
[TTS] chunk # 2 samples= 3840 dur=0.160s cost_per_1s=1.302s
[TTS] chunk # 3 samples= 5760 dur=0.240s cost_per_1s=1.064s
[TTS] chunk # 4 samples= 7680 dur=0.320s cost_per_1s=0.965s
[TTS] chunk # 5 samples= 9600 dur=0.400s cost_per_1s=0.913s
[TTS] chunk # 6 samples= 11520 dur=0.480s cost_per_1s=0.869s

另外,还测试了MOSS-TTS-Realtime好像不支持方言,并且没有加速参数speed参数,像Cosyvoice那样speed=1.0控制生成速度?

@Zhyw0
Copy link
Collaborator Author

Zhyw0 commented Mar 13, 2026

MOSS-TTS-Realtime暂时还不支持方言和控制生成速度

@Zhyw0 Zhyw0 merged commit d15e9f5 into main Mar 13, 2026
@Zhyw0 Zhyw0 deleted the add/mosstts_fastapi branch March 13, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants