参考文献/References:
[1]ARIK S O,KLIEGL M,CHILD R,et al. Convolutional recurrent neural networks for small-footprint keyword spottin[C]//Proceedings of the 18th Annual Conference of the International Speech Communication Association(INTERSPEECH 2017). Stockholm,Sweden:ISCA,2017:1606-1610.
[2]LI N,LIU S,LIU Y,et al. Neural speech synthesis with transformer network[J]. Proceedings of the AAAI conference on artificial intelligence,2019,33(1):6706-6713.
[3]SHEN J,PANG R,WEISS R,et al. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Calgary,Canada:IEEE,2018:4779-4783.
[4]WANG Y,SKERRY-RYAN R,STANTON D,et al. Tacotron:towards end-to-end speech synthesis[C]//18th Annual Conference of the International-Speech-Communication-Association. Stockholm, Sweden:International Speech Communication Association,2017:4006-4010.
[5]LIU J,LI C,REN Y,et al. DiffSinger:Singing voice synthesis via shallow diffusion mechanism[J]. Proceedings of the AAAI conference on artificial intelligence,2021,36(10):11020-11028.
[6]PENG Y,LIU B. Attention-based neural network for short-text question answering[J]. Proceedings of the 2018 2nd international conference on deep learning technologies,2018:21-26.
[7]REN Y,LIU J,TAN X,et al. SimulSpeech:end-to-end simultaneous speech to text translation[C]//58th Annual Meeting of the Association-for-Computational-Linguistics. Electric Network:ACL,2020:3787-3796.
[8]YANG B,ZHONG J,LIU S. Pre-Trained text representations for improving front-end text processing in mandarin Text-to-Speech synthesis[C]//Interspeech Conference. Graz,Austria:International Speech Communication Association,2019:4480-4484.
[9]REN Y,TAN X,QIN T,et al. Almost unsupervised text to speech and automatic speech recognition[C]//36th International Conference on Machine Learning. Long Beach,CA:JMLR,2019:97.
[10]LEE Y,SHIN J,JUNG K. Bidirectional variational inference for non-autoregressive Text-to-Speech[C/OL]//International Conference on Learning Representations. Online,2020. https://openreview.net/forum?id=S1g_G1HwDB.
[11]HAYASHI T,YAMAMOTO R,YOSHIMURA T,et al. ESPnet2-TTS:extending the edge of TTS research[J]. arXiv Preprint arXiv:2110.07840,2021.
[12]JEONG M,KIM H,CHEON S J,et al. Diff-TTS:a denoising diffusion model for Text-to-Speech[C]//Interspeech Conference. Brno,Czech Republic:International Speech Communication Association,2021:3605-3609.
[13]LIM D,JANG W,O G,et al. JDI-t:jointly trained duration informed transformer for Text-to-Speech without explicit alignment[C]//Interspeech Conference. Shanghai,China:International Speech Communication Association,2020:4004-4008.
[14]KIM G,HONG S,FRANZ M,et al. Improving cross-platform binary analysis using representation learning via graph alignment[C]//Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York,NY,USA:Association for Computing Machinery,2022:151-163.
[15]HUANG W C,WU Y C,HAYASHI T. Any-to-One sequence-to-sequence voice conversion using self-supervised discrete speech representations[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Toronto,ON,Canada:IEEE,2021:5944-5948.
[16]REN Y,RUAN Y,TAN X,et al. Fastspeech:fast,robust and controllable text to speech[C]//33rd Conference on Neural Information Processing Systems. Vancouver,Canada:Neural Information Processing System,2019:32.
[17]YU J,XU Z,HE X,et al. DIA-TTS:deep-inherited attention-based Text-to-Speech synthesizer[J]. Entropy,2022,25(1):41.
[18]ZHOU K,SISMAN B,LI H. Limited data emotional voice conversion leveraging Text-to-Speech:two-stage sequence-to-sequence training[C]//Interspeech Conference. Brno,Czech Republic:International Speech Communication Association,2021:811-815.
[19]LI N,LIU Y,WU Y,et al. RobuTrans:a robust transformer-based Text-to-Speech model[J]. Proceedings of the AAAI conference on artificial intelligence,2020,34(5):8228-8235.
[20]OKAMOTO T,TODA T,SHIGA Y,et al. Transformer-based Text-to-Speech with weighted forced attention[C]//ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Barcelona,Spain:IEEE,2020:6729-6733.
[21]REN Y,HU C,TAN X,et al. FastSpeech 2:fast and high-quality end-to-end text to speech[C/OL]//International Conference on Learning Representations. Online,2021. https://openreview.net/forum?id=pi-n2_533c.
[22]LIAN J,ZHANG C,ANUMANCHIPALLI G K,et al. Unsupervised TTS acoustic modeling for TTS with conditional disentangled sequential VAE[J]. IEEE/ACM transactions on audio,speech,and language processing,2023,31:2548-2557.
[23]JING X,CHANG Y,YANG Z. U-DiT TTS:U-Diffusion vision transformer for Text-to-Speech[C]//Proceedings of the 49th DAGA Conference on Acoustics. Hamburg,Germany:VDE VERLAG GMBH,2023:110-113.
[24]JIANG Z,REN Y,YE Z,et al. Mega-TTS:zero-shot Text-to-Speech at scale with intrinsic inductive bias[J]. arXiv Preprint arXiv:2306.03509,2023.
[25]JIANG Z,LIU J,REN Y,et al. Mega-TTS 2:boosting prompting mechanisms for zero-shot speech synthesis[J]. arXiv Preprint arXiv:2307.07218,2023.
[26]LIU H,HUANG R,LIN X,et al. ViT-TTS:visual Text-to-Speech with scalable diffusion transformer[C]//Conference on Empirical Methods in Natural Language Processing. Singapore:ACL,2023:15957-15969.
[27]LEE K,KIM D W,KIM J,et al. DiTTo-TTS:efficient and scalable zero-shot Text-to-Speech with diffusion transformer[J]. arXiv Preprint arXiv:2406.11427,2024.
[28]ITO K,JOHNSON L. The LJ Speech Dataset[DB/OL].(2017). https://keithito.com/LJ-Speech-Dataset/.
[29]DATABAKER. Chinese standard mandarin speech corpus[DB/OL].(2019). https://www.data-baker.com/.
[30]PARK H,KIM Y,KIM J,et al. g2p[J]. Journal of arrhythmia,2019,35(4):593-601.
[31]PARK K. G2PC:A grapheme-to-phoneme converter for Chinese [DB/OL].(2019). https://github.com/kyubyong/g2pc.
[32]LI X,CHENG Z Q,HE J Y,et al. MM-TTS:a unified framework for multimodal,prompt-induced emotional Text-to-Speech synthesis[J]. arXiv Preprint arXiv:2404.18398,2024.
[33]GUAN W,SU Q,ZHOU H,et al. Reflow-TTS:a rectified flow model for high-fidelity Text-to-Speech[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Seoul,Republic of Korea:IEEE,2024:10501-10505.
[34]RAITIO T,LI J,SESHADRI S. Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). Singapore:IEEE,2022:7587-7591.
[35]KONG J,KIM J,BAE J. Hifi-gan:generative adversarial networks for efficient and high fidelity speech synthesis[J]. Advances in neural information processing systems,2020,33:17022-17033.
[36]KIM J,KIM S,KONG J,et al. Glow-tts:a generative flow for Text-to-Speech via monotonic alignment search[C]//34th Conference on Neural Information Processing Systems. Electric Network:Neural Information Processing System. 2020:33.
[37]VASWANI A,SHAZEER N M,PARMAR N,et al. Attention is all you need[C]//Advances in Neural Information Processing Systems. Long Beach,CA,USA:Curran Associates,Inc.,2017,30:5998-6008.
[38]CHEN Z,WU G,GAO H,et al. Local aggregation and global attention network for hyperspectral image classification with spectral-induced aligned superpixel segmentation[J]. Expert systems with applications,2023,232:120828.
[39]LI W,HOU Z,ZHOU J,et al. SiamBAG:band attention grouping-based siamese object tracking network for hyperspectral videos[J]. IEEE transactions on geoscience and remote sensing,2023,61:1-12.