F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization

Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, Baoxun Wang

Platform and Content Group, Tencent

Abstract. We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Gradient Reward Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (relatively 29.5% WER reduction) and speaker similarity (relatively 4.6% SIM score increase) compared to conventional flow-matching based TTS systems. Audio samples are available at https://frontierlabs.github.io/F5R

This page is for research demonstration purposes only.

Overview

Image 1
Fig 1: The architecture of backbone.
Image 2
Fig 2: The pipeline of GRPO phase

The overall pipeline of the proposed system.

Basic Zero-shot: Common Audio & Text

Audio Prompt Text Prompt F5 Generation F5-P Generation F5-R Generation
海王的三个女儿都没有腿,只有一条鱼尾巴。
下楼梯时,杨女士慌乱中不小心把脚给崴了。
本次活动共计收到来自全球,六百余位新人的报名。
他们都可以证明,蚯蚓王的通讯网站搜索很准,动物界家喻户晓。
那么比黄光波长更短的绿光蓝光,和紫光为什么不用呢。
我得用回想与幻想补充我所缺少的饮食,安慰我所得到的痛苦。
但早已经听说无双惨败于苍天手下的消息。
自结婚以来,我那位再也没有给我送过一朵玫瑰。

Complex Text Handling: Common Audio & Complex Text

Audio Prompt Text Prompt F5 Generation F5-P Generation F5-R Generation
台湾文武庙与中华贤母园,结为友好景区。
快乐宝贝学写字。小猫喜欢吃鱼刺。糖果蜂蜜甜丝丝。大家千万别忘记。
七月财政存款预计增加三千亿元。
小蜘蛛,拉银丝,来来回回把网织。织网干什么?专吃苍蝇和蚊子。
就连余波,也将下面流淌的岩浆激荡的卷起来。
孟姜女整日以泪洗面,苦苦盼望丈夫早日回家回家回家回家回家。
请减速慢行,不要连续踩刹车。
一二三四五六七八,八个娃娃树下画花,桃花杏花梨花李花,菊花梅花荷花兰花。

Noisy Audio Handling: Noisy Audio & Common Text

Audio Prompt Text Prompt F5 Generation F5-P Generation F5-R Generation
请听那些贵族们所讲的话吧。
目前共享出行市场处于高速增长阶段。
之后因赔偿问题,一直同老板扯不清。
共同建设面向未来的交通,和出行服务新生态。
北京在出行规模,城市影响力方面表现优异。
导航开始,全程二十五公里,预计需要十二分钟。
他不是要跑,只是有点迟疑和慌张。
深刻的宗教文化是厄瓜多尔社会,充斥着反同态度的主因。

Speaker Finetune

Audio Prompt Text Prompt F5-SFT F5R-SFT
一定会的,谢谢你我的朋友。
今天的阳光真温暖,适合出去走走。
请把窗户关上,外面风太大了。
我是人工智能助手,也可以称为机器人,我通过计算机程序来运行,能够理解和生成人类语言,帮助回答问题、提供信息、进行对话等。
如果你有什么问题或者想聊的话题,可以直接告诉我,我会尽力用简单易懂的语言和你交流的。