«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j.issn.1001-4616.2025.05.015]
点击复制

一种基于非自回归模型的文本转语音方法()

《南京师大学报（自然科学版）》[ISSN:1001-4616/CN:32-1239/N]

卷:: 48
期数:: 2025年05期

页码:: 129-138

栏目:: 计算机科学与技术

出版日期:: 2025-10-20

文章信息/Info

Title:: A Text-to-Speech Method Based on Non-Autoregressive Model

文章编号:: 1001-4616(2025)05-0129-10

作者:: 郭璐璐; 高尚; (江苏科技大学计算机学院,江苏镇江 212100)

Author(s):: Guo Lulu; Gao Shang; (School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China)

关键词:: 语音合成; 自回归模型; 非自回归模型; 注意力机制; 后处理网络

Keywords:: speech synthesis; autoregressive model; non-autoregressive model; attention mechanisms; post-processing network

分类号:: TP391

DOI:: 10.3969/j.issn.1001-4616.2025.05.015

文献标志码:: A

摘要:: 文本转语音(Text-to-Speech,TTS)是一种将给定文本合成为语音的技术,具有广泛的应用前景. 相比于自回归的TTS模型,非自回归的TTS模型在语音合成速度上有显著提升. 然而,非自回归模型在长序列的语音合成任务中其合成速度和语音质量仍有提升空间. 为此,本文提出了一种基于非自回归的EnhanceSpeech模型. 首先,该模型利用可学习的外部记忆向量简化注意力机制计算方式,有效减少了计算复杂度和内存占用,并提升了模型的推理速度. 其次,通过引入基于分层挤压注意力的后处理网络,利用二维卷积将梅尔频谱图生成过程视为图像处理,显著提升了梅尔频谱图的生成质量. 实验结果表明,EnhanceSpeech模型与自回归模型相比生成速度提高了60倍以上. 此外,与同类非自回归模型相比,本文方法的性能突出,更接近领先的自回归模型水平.

Abstract:: Text-to-Speech(TTS)is a technology that synthesizes given text into speech and has a wide range of application prospects. Compared with the autoregressive TTS model, the non-autoregressive TTS model has significantly improved the speech synthesis speed. However, there is still room for improvement in the synthesis speed and speech quality of non-autoregressive models in long-sequence speech synthesis tasks. To this end, an EnhanceSpeech model based on non-autoregression is proposed. First, the model uses learnable external memory vectors to simplify the calculation of the attention mechanism, effectively reducing computational complexity and memory usage, and improving the model's inference speed. Secondly, by introducing a post-processing network based on hierarchical squeeze attention and using two-dimensional convolution to treat the mel-spectrogram generation process as image processing, the quality of mel-spectrogram generation is significantly improved. Experimental results reveal that the EnhanceSpeech model is over 60 times faster than its autoregressive counterparts. Moreover, it outperforms other non-autoregressive methods, bringing its performance closer to that of top-tier autoregressive models.

参考文献/References:

[1]ARIK S O,KLIEGL M,CHILD R,et al. Convolutional recurrent neural networks for small-footprint keyword spottin[C]//Proceedings of the 18th Annual Conference of the International Speech Communication Association(INTERSPEECH 2017). Stockholm,Sweden:ISCA,2017:1606-1610.
[2]LI N,LIU S,LIU Y,et al. Neural speech synthesis with transformer network[J]. Proceedings of the AAAI conference on artificial intelligence,2019,33(1):6706-6713.
[3]SHEN J,PANG R,WEISS R,et al. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Calgary,Canada:IEEE,2018:4779-4783.
[4]WANG Y,SKERRY-RYAN R,STANTON D,et al. Tacotron:towards end-to-end speech synthesis[C]//18th Annual Conference of the International-Speech-Communication-Association. Stockholm, Sweden:International Speech Communication Association,2017:4006-4010.
[5]LIU J,LI C,REN Y,et al. DiffSinger:Singing voice synthesis via shallow diffusion mechanism[J]. Proceedings of the AAAI conference on artificial intelligence,2021,36(10):11020-11028.
[6]PENG Y,LIU B. Attention-based neural network for short-text question answering[J]. Proceedings of the 2018 2nd international conference on deep learning technologies,2018:21-26.
[7]REN Y,LIU J,TAN X,et al. SimulSpeech:end-to-end simultaneous speech to text translation[C]//58th Annual Meeting of the Association-for-Computational-Linguistics. Electric Network:ACL,2020:3787-3796.
[8]YANG B,ZHONG J,LIU S. Pre-Trained text representations for improving front-end text processing in mandarin Text-to-Speech synthesis[C]//Interspeech Conference. Graz,Austria:International Speech Communication Association,2019:4480-4484.
[9]REN Y,TAN X,QIN T,et al. Almost unsupervised text to speech and automatic speech recognition[C]//36th International Conference on Machine Learning. Long Beach,CA:JMLR,2019:97.
[10]LEE Y,SHIN J,JUNG K. Bidirectional variational inference for non-autoregressive Text-to-Speech[C/OL]//International Conference on Learning Representations. Online,2020. https://openreview.net/forum?id=S1g_G1HwDB.
[11]HAYASHI T,YAMAMOTO R,YOSHIMURA T,et al. ESPnet2-TTS:extending the edge of TTS research[J]. arXiv Preprint arXiv:2110.07840,2021.
[12]JEONG M,KIM H,CHEON S J,et al. Diff-TTS:a denoising diffusion model for Text-to-Speech[C]//Interspeech Conference. Brno,Czech Republic:International Speech Communication Association,2021:3605-3609.
[13]LIM D,JANG W,O G,et al. JDI-t:jointly trained duration informed transformer for Text-to-Speech without explicit alignment[C]//Interspeech Conference. Shanghai,China:International Speech Communication Association,2020:4004-4008.
[14]KIM G,HONG S,FRANZ M,et al. Improving cross-platform binary analysis using representation learning via graph alignment[C]//Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York,NY,USA:Association for Computing Machinery,2022:151-163.
[15]HUANG W C,WU Y C,HAYASHI T. Any-to-One sequence-to-sequence voice conversion using self-supervised discrete speech representations[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Toronto,ON,Canada:IEEE,2021:5944-5948.
[16]REN Y,RUAN Y,TAN X,et al. Fastspeech:fast,robust and controllable text to speech[C]//33rd Conference on Neural Information Processing Systems. Vancouver,Canada:Neural Information Processing System,2019:32.
[17]YU J,XU Z,HE X,et al. DIA-TTS:deep-inherited attention-based Text-to-Speech synthesizer[J]. Entropy,2022,25(1):41.
[18]ZHOU K,SISMAN B,LI H. Limited data emotional voice conversion leveraging Text-to-Speech:two-stage sequence-to-sequence training[C]//Interspeech Conference. Brno,Czech Republic:International Speech Communication Association,2021:811-815.
[19]LI N,LIU Y,WU Y,et al. RobuTrans:a robust transformer-based Text-to-Speech model[J]. Proceedings of the AAAI conference on artificial intelligence,2020,34(5):8228-8235.
[20]OKAMOTO T,TODA T,SHIGA Y,et al. Transformer-based Text-to-Speech with weighted forced attention[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Barcelona,Spain:IEEE,2020:6729-6733.
[21]REN Y,HU C,TAN X,et al. FastSpeech 2:fast and high-quality end-to-end text to speech[C/OL]//International Conference on Learning Representations. Online,2021. https://openreview.net/forum?id=pi-n2_533c.
[22]LIAN J,ZHANG C,ANUMANCHIPALLI G K,et al. Unsupervised TTS acoustic modeling for TTS with conditional disentangled sequential VAE[J]. IEEE/ACM transactions on audio,speech,and language processing,2023,31:2548-2557.
[23]JING X,CHANG Y,YANG Z. U-DiT TTS:U-Diffusion vision transformer for Text-to-Speech[C]//Proceedings of the 49th DAGA Conference on Acoustics. Hamburg,Germany:VDE VERLAG GMBH,2023:110-113.
[24]JIANG Z,REN Y,YE Z,et al. Mega-TTS:zero-shot Text-to-Speech at scale with intrinsic inductive bias[J]. arXiv Preprint arXiv:2306.03509,2023.
[25]JIANG Z,LIU J,REN Y,et al. Mega-TTS 2:boosting prompting mechanisms for zero-shot speech synthesis[J]. arXiv Preprint arXiv:2307.07218,2023.
[26]LIU H,HUANG R,LIN X,et al. ViT-TTS:visual Text-to-Speech with scalable diffusion transformer[C]//Conference on Empirical Methods in Natural Language Processing. Singapore:ACL,2023:15957-15969.
[27]LEE K,KIM D W,KIM J,et al. DiTTo-TTS:efficient and scalable zero-shot Text-to-Speech with diffusion transformer[J]. arXiv Preprint arXiv:2406.11427,2024.
[28]ITO K,JOHNSON L. The LJ Speech Dataset[DB/OL].(2017). https://keithito.com/LJ-Speech-Dataset/.
[29]DATABAKER. Chinese standard mandarin speech corpus[DB/OL].(2019). https://www.data-baker.com/.
[30]PARK H,KIM Y,KIM J,et al. g2p[J]. Journal of arrhythmia,2019,35(4):593-601.
[31]PARK K. G2PC:A grapheme-to-phoneme converter for Chinese [DB/OL].(2019). https://github.com/kyubyong/g2pc.
[32]LI X,CHENG Z Q,HE J Y,et al. MM-TTS:a unified framework for multimodal,prompt-induced emotional Text-to-Speech synthesis[J]. arXiv Preprint arXiv:2404.18398,2024.
[33]GUAN W,SU Q,ZHOU H,et al. Reflow-TTS:a rectified flow model for high-fidelity Text-to-Speech[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Seoul,Republic of Korea:IEEE,2024:10501-10505.
[34]RAITIO T,LI J,SESHADRI S. Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Singapore:IEEE,2022:7587-7591.
[35]KONG J,KIM J,BAE J. Hifi-gan:generative adversarial networks for efficient and high fidelity speech synthesis[J]. Advances in neural information processing systems,2020,33:17022-17033.
[36]KIM J,KIM S,KONG J,et al. Glow-tts:a generative flow for Text-to-Speech via monotonic alignment search[C]//34th Conference on Neural Information Processing Systems. Electric Network:Neural Information Processing System. 2020:33.
[37]VASWANI A,SHAZEER N M,PARMAR N,et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach,CA,USA:Curran Associates,Inc.,2017,30:5998-6008.
[38]CHEN Z,WU G,GAO H,et al. Local aggregation and global attention network for hyperspectral image classification with spectral-induced aligned superpixel segmentation[J]. Expert systems with applications,2023,232:120828.
[39]LI W,HOU Z,ZHOU J,et al. SiamBAG:band attention grouping-based siamese object tracking network for hyperspectral videos[J]. IEEE transactions on geoscience and remote sensing,2023,61:1-12.

备注/Memo

备注/Memo:: 收稿日期:2024-09-09.
基金项目:国家自然科学基金项目(62376109).
通讯作者:高尚,博士,教授,研究方向:模式识别. E-mail:gao_shang@just.edu.cn

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed297
全文下载/Downloads279
评论/Comments

更新日期/Last Update: 2025-10-20