F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization

Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, Baoxun Wang

Platform and Content Group, Tencent

Abstract. We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Group Relative Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching based TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms into TTS system. During pretraining, we train a reformulated flow-matching based model which is derived from F5-TTS. In the subsequent reinforcement learning phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (a 29.5% relative reduction in WER) and speaker similarity (a 4.6% relative increase in SIM score) compared to conventional flow-matching based TTS systems.

This page is for research demonstration purposes only.

Overview

Image 1
Fig 1: The architecture of backbone.
Image 2
Fig 2: The pipeline of GRPO phase

The overall pipeline of the proposed system.

Basic Zero-shot: Common Audio & Text

Audio Prompt Text Prompt F5 Generation F5-P Generation F5-R Generation
海王的三个女儿都没有腿,只有一条鱼尾巴。
下楼梯时,杨女士慌乱中不小心把脚给崴了。
本次活动共计收到来自全球,六百余位新人的报名。
他们都可以证明,蚯蚓王的通讯网站搜索很准,动物界家喻户晓。
那么比黄光波长更短的绿光蓝光,和紫光为什么不用呢。
我得用回想与幻想补充我所缺少的饮食,安慰我所得到的痛苦。
但早已经听说无双惨败于苍天手下的消息。
自结婚以来,我那位再也没有给我送过一朵玫瑰。

Complex Text Handling: Common Audio & Complex Text

Audio Prompt Text Prompt F5 Generation F5-P Generation F5-R Generation
台湾文武庙与中华贤母园,结为友好景区。
快乐宝贝学写字。小猫喜欢吃鱼刺。糖果蜂蜜甜丝丝。大家千万别忘记。
七月财政存款预计增加三千亿元。
小蜘蛛,拉银丝,来来回回把网织。织网干什么?专吃苍蝇和蚊子。
就连余波,也将下面流淌的岩浆激荡的卷起来。
孟姜女整日以泪洗面,苦苦盼望丈夫早日回家回家回家回家回家。
请减速慢行,不要连续踩刹车。
一二三四五六七八,八个娃娃树下画花,桃花杏花梨花李花,菊花梅花荷花兰花。

Noisy Audio Handling: Noisy Audio & Common Text

Audio Prompt Text Prompt F5 Generation F5-P Generation F5-R Generation
请听那些贵族们所讲的话吧。
目前共享出行市场处于高速增长阶段。
之后因赔偿问题,一直同老板扯不清。
共同建设面向未来的交通,和出行服务新生态。
北京在出行规模,城市影响力方面表现优异。
导航开始,全程二十五公里,预计需要十二分钟。
他不是要跑,只是有点迟疑和慌张。
深刻的宗教文化是厄瓜多尔社会,充斥着反同态度的主因。

Emotionally Expressive Voice Generation

Emotion Audio Prompt Text Prompt F5 Generation F5-P Generation F5-R Generation
Angry
于是,国王和王后一起把萝卜埋回地里。
气死我了,真是受不了!
Sad
小刚和小妞妞,也捏了几个漂亮的小动物。
养了三年的花突然枯了,怎么浇水都没救回来。
Happy
也往往不过是可能影响,某些成分的吸收而已。
今天天气超好,出去玩开心到飞起!
Surprised
中国有句俗话,说人无利益,谁肯早起。
真的假的!你能中奖可太幸运了吧!