F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization
Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, Baoxun Wang
Platform and Content Group, Tencent
Abstract.
We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Gradient Reward Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (relatively 29.5% WER reduction) and speaker similarity (relatively 4.6% SIM score increase) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R
This page is for research demonstration purposes only.
Overview


The overall pipeline of the proposed system.
Basic Zero-shot: Common Audio & Text
Audio Prompt | Text Prompt | F5 Generation | F5-P Generation | F5-R Generation |
---|---|---|---|---|
海王的三个女儿都没有腿,只有一条鱼尾巴。 |
下楼梯时,杨女士慌乱中不小心把脚给崴了。 | |||
本次活动共计收到来自全球,六百余位新人的报名。 |
他们都可以证明,蚯蚓王的通讯网站搜索很准,动物界家喻户晓。 | |||
那么比黄光波长更短的绿光蓝光,和紫光为什么不用呢。 |
我得用回想与幻想补充我所缺少的饮食,安慰我所得到的痛苦。 | |||
但早已经听说无双惨败于苍天手下的消息。 |
自结婚以来,我那位再也没有给我送过一朵玫瑰。 |
Complex Text Handling: Common Audio & Complex Text
Audio Prompt | Text Prompt | F5 Generation | F5-P Generation | F5-R Generation |
---|---|---|---|---|
台湾文武庙与中华贤母园,结为友好景区。 |
快乐宝贝学写字。小猫喜欢吃鱼刺。糖果蜂蜜甜丝丝。大家千万别忘记。 | |||
七月财政存款预计增加三千亿元。 |
小蜘蛛,拉银丝,来来回回把网织。织网干什么?专吃苍蝇和蚊子。 | |||
就连余波,也将下面流淌的岩浆激荡的卷起来。 |
孟姜女整日以泪洗面,苦苦盼望丈夫早日回家回家回家回家回家。 | |||
请减速慢行,不要连续踩刹车。 |
一二三四五六七八,八个娃娃树下画花,桃花杏花梨花李花,菊花梅花荷花兰花。 |
Noisy Audio Handling: Noisy Audio & Common Text
Audio Prompt | Text Prompt | F5 Generation | F5-P Generation | F5-R Generation |
---|---|---|---|---|
请听那些贵族们所讲的话吧。 |
目前共享出行市场处于高速增长阶段。 | |||
之后因赔偿问题,一直同老板扯不清。 |
共同建设面向未来的交通,和出行服务新生态。 | |||
北京在出行规模,城市影响力方面表现优异。 |
导航开始,全程二十五公里,预计需要十二分钟。 | |||
他不是要跑,只是有点迟疑和慌张。 |
深刻的宗教文化是厄瓜多尔社会,充斥着反同态度的主因。 |
Speaker Finetune
Audio Prompt | Text Prompt | F5-SFT | F5R-SFT |
---|---|---|---|
一定会的,谢谢你我的朋友。
|
今天的阳光真温暖,适合出去走走。 | ||
请把窗户关上,外面风太大了。 | |||
我是人工智能助手,也可以称为机器人,我通过计算机程序来运行,能够理解和生成人类语言,帮助回答问题、提供信息、进行对话等。 | |||
如果你有什么问题或者想聊的话题,可以直接告诉我,我会尽力用简单易懂的语言和你交流的。 |