[1]朱丹浩,王 震,黄肖宇,等.基于深度学习的长语音口音识别研究[J].南京师大学报(自然科学版),2022,45(04):110-118.[doi:10.3969/j.issn.1001-4616.2022.04.015]
 Zhu Danhao,Wang Zhen,Huang Xiaoyu,et al.Research on Long Speech Accent Recognition Based on Deep Learning[J].Journal of Nanjing Normal University(Natural Science Edition),2022,45(04):110-118.[doi:10.3969/j.issn.1001-4616.2022.04.015]
点击复制

基于深度学习的长语音口音识别研究()
分享到:

《南京师大学报(自然科学版)》[ISSN:1001-4616/CN:32-1239/N]

卷:
第45卷
期数:
2022年04期
页码:
110-118
栏目:
计算机科学与技术
出版日期:
2022-12-15

文章信息/Info

Title:
Research on Long Speech Accent Recognition Based on Deep Learning
文章编号:
1001-4616(2022)04-0110-09
作者:
朱丹浩1王 震2黄肖宇3马 壮4徐 杰4
(1.江苏警官学院刑事科学技术系,江苏 南京 210031)
(2.江苏警官学院干训部,江苏 南京 210031)
(3.江苏警官学院计算机信息与网络安全系,江苏 南京 210031)
(4.江苏省苏州市张家港市公安局,江苏 苏州 215600)
Author(s):
Zhu Danhao1Wang Zhen2Huang Xiaoyu3Ma Zhuang4Xu Jie4
(1.Department of Criminal Science and Technology,Jiangsu Police Institute,Nanjing 210031,China)
(2.Department of Cadre Training,Jiangsu Police Institute,Nanjing 210031,China)
(3.Department of Computer Information and Network Security,Jiangsu Police Institute,Nanjing 210031,China)
(4.Jiangsu Province Zhangjiagang Public Security Bureau,Suzhou 215600,China)
关键词:
深度学习口音识别长语音普通话
Keywords:
deep learningaccent recognitionlong speechmandarin
分类号:
TP18; TN912.34
DOI:
10.3969/j.issn.1001-4616.2022.04.015
文献标志码:
A
摘要:
普通话口音识别是物证鉴定的重要技术之一. 目前普通话口音识别技术主要基于传统机器学习方法建立,也未针对长语音做专门设计,识别精度不高. 针对以上问题,本文提出了基于深度学习的长语音口音识别方法. 该方法首先将长语音切分为句子级别的多个短语音,然后使用经过预训练的X-vectors模型提取特征,再基于不同方法对句子特征进行融合,最后采用Amsoftmax最大化口音类别间隔并进行分类. 在真实的物证口音识别数据集上的实验结果显示,本文方法的识别精确率为94.1%,比非深度学习的基准方法和基于X-vectors的基准方法分别提升了21.6%和2.1%,验证了本文方法的有效性和针对长语音的口音识别能力.
Abstract:
Mandarin accent recognition is one of the important technical tools for identifying judicial evidence. At present,Mandarin accent recognition technology is mainly based on traditional machine learning methods,and is not specially designed for long speech,so the recognition accuracy is not high. To address the above problems,this paper proposes a long speech accent recognition method based on deep learning. The method firstly cuts the long speech into multiple short speech at sentence level,then extracts features using pre-trained X-vectors model,then fuses the sentence features based on different methods,and finally uses Amsoftmax to maximize the accent category interval and perform classification. Experimental results on a real public security accent recognition dataset show that the recognition accuracy of this paper is 94.1%,which is 21.6% and 2.1% better than the non-deep learning benchmark method and the X-vectors-based benchmark method,respectively,verifying the effectiveness of this paper and the accent recognition ability for long speech.

参考文献/References:

[1]欧阳国亮,李志芳. 方言识别在侦查应用中面临的问题及对策[J]. 山西警察学院学报,2017,25(1):51-54.
[2]HOU J,LIU Y,ZHENG T F,et al. Multi-layered features with SVM for Chinese accent identification[C]//2010 International Conference on Audio,Language and Image Processing. Shanghai,2010:25-30.
[3]庞程,王秀玲,张结,等. 基于多特征融合的GMM汉语普通话口音识别[J]. 华中科技大学学报(自然科学版),2015(S1):5.
[4]杨伟,杨俊杰. 基于语言学音系例字的口音自动识别探究[J]. 中国司法鉴定,2021(2):5.
[5]YANG S W,CHI P H,CHUANG Y S,et al. Superb:speech processing universal performance benchmark[DB/OL]. arXiv preprint arXiv:2105.01051. [2021-03-03]. https://doi.org/10.48550.arXiv.2015.01051
[6]BAI Z,ZHANG X L. Speaker recognition based on deep learning:an overview[J]. Neural networks,2021,140:65-99.
[7]SNYDER D,GARCIA-ROMERO D,SELL G,et al. X-vectors:robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Calgary,Canada:IEEE,2018:5329-5333.
[8]MAHDI H,DENGXIN D. Unified hypersphere embedding for speaker recognition[J]. arXiv preprint arXiv:1807.08312,[2018-07-22]. https://doi.org/10.48550.arXiv.1087.08312
[9]WANG F,CHENG J,LIU W Y,et al. Additive margin softmax for face verification[J]. IEEE signal processing letters,2018,25(7):926-930.
[10]SHI X,YU F,LU Y,et al. The accented english speech recognition challenge 2020:Open datasets,tracks,baselines,results and methods[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Toronto,Canada:IEEE,2021:6918-6922.
[11]ZHANG Z,WANG Y,YANG J. Accent recognition with hybrid phonetic features[J]. Sensors,2021,21(18):6258.
[12]WANG W,ZHANG C,WU X. Deep discriminative feature learning for accent recognition[DB/OL]. arXiv preprint arXiv:2011.12461. [2020-11-25]. https://doi.org/pdf/2011.12461.pdf
[13]PENG Y,ZHANG J,ZHANG H,et al. Multilingual approach to joint speech and accent recognition with DNN-HMM Framework[C]//2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA ASC),Tokyo,Japan:IEEE,2021:1043-1048.
[14]DEHAK N,KENNY P,DEHAK R,et al. Front-end factor analysis for speaker verification[J]. IEEE transactions on audio,speech,and language processing,2011,19(4):788-798.
[15]SNYDER D,GARCIA R D,POVEY D,et al. Deep neural network embeddings for text-independent speaker verification[C]//Interspeech,Stockholm,Sweden,2017:999-1003.
[16]PEDDINTI V,POVEY D,KHUDANPUR S. A time delay neural network architecture for efficient modeling of long temporal contexts[C]//Sixteenth Annual Conference of the International Speech Communication Association,Dresden,Germany:2015.
[17]CHUNG J S,NAGRANI A,ZISSERMAN A. Voxceleb2:deep speaker recognition[DB/OL]. arXiv preprint arXiv:1806.05622. [2018-06-14]. https://doi.org/10.21437/Interspeech.2018-1929
[18]OKABE K,KOSHINAKA T,SHINODA K. Attentive statistics pooling for deep speaker embedding[DB/OL]. arXiv preprint arXiv:1803.10963. [2018-03-29]. https://doi.org/10.21437/Interspeech.2018-993
[19]jiaaro.com. Pydub[EB/OL]. https://github.com/jiaaro/pydub.(2021-03-10)[2022-07-04].
[20]Speechbrain. Speaker Verification with xvector embeddings on Voxceleb[EB/OL]. https://huggingface.co/speechbrain/spkrec-xvect-voxceleb,(2021-05-03). [2021-07-04].
[21]HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural computation,1997,9(8):1735-80.
[22]ZAREMBA W,SUTSKEVER I,VINYALS O. Recurrent neural network regularization[DB/OL]. arXiv preprint arXiv:1409.2329. [2014-09-08]. https://arXiv.org/pdf/1409.2329.pdf
[23]GAO Q,WU H,SUN Y,et al. An end-to-end speech accent recognition method based on hybrid CTC/attention transformer ASR[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Toronto,Canada:IEEE,2021:7253-7257.
[24]SNYDER D,HEN G,POVEY D. MUSAN:a music,speech,and noise corpu[DB/OL]. arXiv:1510.08484v1. [2015-10-28]. https://doi.org/10.48550/arXiv.1510.08484
[25]RAVANELLI M,PARCOLLET T,PLANTINGA P,et al. SpeechBrain:a general-purpose speech toolkit[DB/OL]. arXiv preprint arXiv:2106.04624. [2021-06-08]. https://doi.org/10.48550/arXiv.2016.04624

相似文献/References:

[1]郑德鹏,杜吉祥,翟传敏.基于深度学习MPCANet的年龄估计[J].南京师大学报(自然科学版),2017,40(01):20.[doi:10.3969/j.issn.1001-4616.2017.01.004]
 Zheng Depeng,Du Jixiang,Zhai Chuanmin.Age Estimation Based on Deep Learning MPCANet[J].Journal of Nanjing Normal University(Natural Science Edition),2017,40(04):20.[doi:10.3969/j.issn.1001-4616.2017.01.004]
[2]朱 繁,王洪元,张 继.基于深度学习的行人重识别研究综述[J].南京师大学报(自然科学版),2018,41(04):93.[doi:10.3969/j.issn.1001-4616.2018.04.015]
 Zhu Fan,Wang Hongyuan,Zhang Ji.A Survey of Person Re-identification Based on Deep Learning[J].Journal of Nanjing Normal University(Natural Science Edition),2018,41(04):93.[doi:10.3969/j.issn.1001-4616.2018.04.015]
[3]孙茹君,张鲁飞.基于动态指导的深度学习模型稀疏化执行方法[J].南京师大学报(自然科学版),2019,42(03):11.[doi:10.3969/j.issn.1001-4616.2019.03.002]
 Sun Rujun,Zhang Lufei.Dynamic Sparse Method for Deep Learning Execution[J].Journal of Nanjing Normal University(Natural Science Edition),2019,42(04):11.[doi:10.3969/j.issn.1001-4616.2019.03.002]
[4]赵文芳,林润生,唐 伟,等.基于深度学习的PM2.5短期预测模型[J].南京师大学报(自然科学版),2019,42(03):32.[doi:10.3969/j.issn.1001-4616.2019.03.005]
 Zhao Wenfang,Lin Runsheng,Tang Wei,et al.Forecasting Model of Short-Term PM2.5 ConcentrationBased on Deep Learning[J].Journal of Nanjing Normal University(Natural Science Edition),2019,42(04):32.[doi:10.3969/j.issn.1001-4616.2019.03.005]
[5]张新峰,闫昆鹏,赵 珣.基于双向LSTM的手写文字识别技术研究[J].南京师大学报(自然科学版),2019,42(03):58.[doi:10.3969/j.issn.1001-4616.2019.03.008]
 Zhang Xinfeng,Yan Kunpeng,Zhao Xun.Handwriting Chinese Text Recognition Using BiLSTM Network[J].Journal of Nanjing Normal University(Natural Science Edition),2019,42(04):58.[doi:10.3969/j.issn.1001-4616.2019.03.008]
[6]贾玉福,胡胜红,刘文平,等.使用条件生成对抗网络的自然图像增强方法[J].南京师大学报(自然科学版),2019,42(03):88.[doi:10.3969/j.issn.1001-4616.2019.03.012]
 Jia Yufu,Hu Shenghong,Liu Wenping,et al.Wild Image Enhancement with Conditional Generative Adversarial Network[J].Journal of Nanjing Normal University(Natural Science Edition),2019,42(04):88.[doi:10.3969/j.issn.1001-4616.2019.03.012]
[7]汤 凯,何 庆,赵 群,等.基于改进的深度残差网络的图像识别[J].南京师大学报(自然科学版),2019,42(03):115.[doi:10.3969/j.issn.1001-4616.2019.03.015]
 Tang Kai,He Qing,Zhao Qun,et al.Image Recognition Based on Improved Deep Neural Network[J].Journal of Nanjing Normal University(Natural Science Edition),2019,42(04):115.[doi:10.3969/j.issn.1001-4616.2019.03.015]
[8]汪 晨,张辉辉,乐继旺,等.基于深度学习和遥感影像的松材线虫病疫松树目标检测[J].南京师大学报(自然科学版),2021,44(03):84.[doi:10.3969/j.issn.1001-4616.2021.03.013]
 Wang Chen,Zhang Huihui,Le Jiwang,et al.Object Detection to the Pine Trees Affected by Pine Wilt Diseasein Remote Sensing Images Using Deep Learning[J].Journal of Nanjing Normal University(Natural Science Edition),2021,44(04):84.[doi:10.3969/j.issn.1001-4616.2021.03.013]
[9]韩 悦,张永寿,郭依廷,等.乳腺癌腋窝淋巴结超声图像分割算法研究[J].南京师大学报(自然科学版),2021,44(04):122.[doi:10.3969/j.issn.1001-4616.2021.04.016]
 Han Yue,Zhang Yongshou,Guo Yiting,et al.Research on Ultrasound Image Segmentation Algorithm forAxillary Lymph Node with Breast Cancer[J].Journal of Nanjing Normal University(Natural Science Edition),2021,44(04):122.[doi:10.3969/j.issn.1001-4616.2021.04.016]
[10]闫靖昆,黄毓贤,秦伟森,等.棉田复杂背景下棉花黄萎病病斑分割算法研究[J].南京师大学报(自然科学版),2021,44(04):127.[doi:10.3969/j.issn.1001-4616.2021.04.017]
 Yan Jingkun,Huang Yuxian,Qin Weisen,et al.Study on Segmentation Algorithm of Cotton Verticillium WiltDisease Spot in Cotton Field Under Complex Background[J].Journal of Nanjing Normal University(Natural Science Edition),2021,44(04):127.[doi:10.3969/j.issn.1001-4616.2021.04.017]

备注/Memo

备注/Memo:
收稿日期:2022-07-27.
基金项目:国家自然科学基金项目(71974094)、江苏省社科基金项目(19TQD002)、江苏省教育厅自科项目(21KJB520004)、江苏高校优势学科工程资助项目(PAPD).
通讯作者:朱丹浩,博士,讲师,研究方向:深度学习、自然语言处理. E-mail:zhudanhao@jspi.cn
更新日期/Last Update: 2022-12-15