|Table of Contents|

Research on Long Speech Accent Recognition Based on Deep Learning(PDF)

《南京师大学报(自然科学版)》[ISSN:1001-4616/CN:32-1239/N]

Issue:
2022年04期
Page:
110-118
Research Field:
计算机科学与技术
Publishing date:

Info

Title:
Research on Long Speech Accent Recognition Based on Deep Learning
Author(s):
Zhu Danhao1Wang Zhen2Huang Xiaoyu3Ma Zhuang4Xu Jie4
(1.Department of Criminal Science and Technology,Jiangsu Police Institute,Nanjing 210031,China)
(2.Department of Cadre Training,Jiangsu Police Institute,Nanjing 210031,China)
(3.Department of Computer Information and Network Security,Jiangsu Police Institute,Nanjing 210031,China)
(4.Jiangsu Province Zhangjiagang Public Security Bureau,Suzhou 215600,China)
Keywords:
deep learningaccent recognitionlong speechmandarin
PACS:
TP18; TN912.34
DOI:
10.3969/j.issn.1001-4616.2022.04.015
Abstract:
Mandarin accent recognition is one of the important technical tools for identifying judicial evidence. At present,Mandarin accent recognition technology is mainly based on traditional machine learning methods,and is not specially designed for long speech,so the recognition accuracy is not high. To address the above problems,this paper proposes a long speech accent recognition method based on deep learning. The method firstly cuts the long speech into multiple short speech at sentence level,then extracts features using pre-trained X-vectors model,then fuses the sentence features based on different methods,and finally uses Amsoftmax to maximize the accent category interval and perform classification. Experimental results on a real public security accent recognition dataset show that the recognition accuracy of this paper is 94.1%,which is 21.6% and 2.1% better than the non-deep learning benchmark method and the X-vectors-based benchmark method,respectively,verifying the effectiveness of this paper and the accent recognition ability for long speech.

References:

[1]欧阳国亮,李志芳. 方言识别在侦查应用中面临的问题及对策[J]. 山西警察学院学报,2017,25(1):51-54.
[2]HOU J,LIU Y,ZHENG T F,et al. Multi-layered features with SVM for Chinese accent identification[C]//2010 International Conference on Audio,Language and Image Processing. Shanghai,2010:25-30.
[3]庞程,王秀玲,张结,等. 基于多特征融合的GMM汉语普通话口音识别[J]. 华中科技大学学报(自然科学版),2015(S1):5.
[4]杨伟,杨俊杰. 基于语言学音系例字的口音自动识别探究[J]. 中国司法鉴定,2021(2):5.
[5]YANG S W,CHI P H,CHUANG Y S,et al. Superb:speech processing universal performance benchmark[DB/OL]. arXiv preprint arXiv:2105.01051. [2021-03-03]. https://doi.org/10.48550.arXiv.2015.01051
[6]BAI Z,ZHANG X L. Speaker recognition based on deep learning:an overview[J]. Neural networks,2021,140:65-99.
[7]SNYDER D,GARCIA-ROMERO D,SELL G,et al. X-vectors:robust dnn embeddings for speaker recognition[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Calgary,Canada:IEEE,2018:5329-5333.
[8]MAHDI H,DENGXIN D. Unified hypersphere embedding for speaker recognition[J]. arXiv preprint arXiv:1807.08312,[2018-07-22]. https://doi.org/10.48550.arXiv.1087.08312
[9]WANG F,CHENG J,LIU W Y,et al. Additive margin softmax for face verification[J]. IEEE signal processing letters,2018,25(7):926-930.
[10]SHI X,YU F,LU Y,et al. The accented english speech recognition challenge 2020:Open datasets,tracks,baselines,results and methods[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Toronto,Canada:IEEE,2021:6918-6922.
[11]ZHANG Z,WANG Y,YANG J. Accent recognition with hybrid phonetic features[J]. Sensors,2021,21(18):6258.
[12]WANG W,ZHANG C,WU X. Deep discriminative feature learning for accent recognition[DB/OL]. arXiv preprint arXiv:2011.12461. [2020-11-25]. https://doi.org/pdf/2011.12461.pdf
[13]PENG Y,ZHANG J,ZHANG H,et al. Multilingual approach to joint speech and accent recognition with DNN-HMM Framework[C]//2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference(APSIPA ASC),Tokyo,Japan:IEEE,2021:1043-1048.
[14]DEHAK N,KENNY P,DEHAK R,et al. Front-end factor analysis for speaker verification[J]. IEEE transactions on audio,speech,and language processing,2011,19(4):788-798.
[15]SNYDER D,GARCIA R D,POVEY D,et al. Deep neural network embeddings for text-independent speaker verification[C]//Interspeech,Stockholm,Sweden,2017:999-1003.
[16]PEDDINTI V,POVEY D,KHUDANPUR S. A time delay neural network architecture for efficient modeling of long temporal contexts[C]//Sixteenth Annual Conference of the International Speech Communication Association,Dresden,Germany:2015.
[17]CHUNG J S,NAGRANI A,ZISSERMAN A. Voxceleb2:deep speaker recognition[DB/OL]. arXiv preprint arXiv:1806.05622. [2018-06-14]. https://doi.org/10.21437/Interspeech.2018-1929
[18]OKABE K,KOSHINAKA T,SHINODA K. Attentive statistics pooling for deep speaker embedding[DB/OL]. arXiv preprint arXiv:1803.10963. [2018-03-29]. https://doi.org/10.21437/Interspeech.2018-993
[19]jiaaro.com. Pydub[EB/OL]. https://github.com/jiaaro/pydub.(2021-03-10)[2022-07-04].
[20]Speechbrain. Speaker Verification with xvector embeddings on Voxceleb[EB/OL]. https://huggingface.co/speechbrain/spkrec-xvect-voxceleb,(2021-05-03). [2021-07-04].
[21]HOCHREITER S,SCHMIDHUBER J. Long short-term memory[J]. Neural computation,1997,9(8):1735-80.
[22]ZAREMBA W,SUTSKEVER I,VINYALS O. Recurrent neural network regularization[DB/OL]. arXiv preprint arXiv:1409.2329. [2014-09-08]. https://arXiv.org/pdf/1409.2329.pdf
[23]GAO Q,WU H,SUN Y,et al. An end-to-end speech accent recognition method based on hybrid CTC/attention transformer ASR[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),Toronto,Canada:IEEE,2021:7253-7257.
[24]SNYDER D,HEN G,POVEY D. MUSAN:a music,speech,and noise corpu[DB/OL]. arXiv:1510.08484v1. [2015-10-28]. https://doi.org/10.48550/arXiv.1510.08484
[25]RAVANELLI M,PARCOLLET T,PLANTINGA P,et al. SpeechBrain:a general-purpose speech toolkit[DB/OL]. arXiv preprint arXiv:2106.04624. [2021-06-08]. https://doi.org/10.48550/arXiv.2016.04624

Memo

Memo:
-
Last Update: 2022-12-15