收录:
摘要:
Variations of speech content increase the difficulty of speaker verification. In this paper, to alleviate the negative effect of the variations, phoneme-unit-specific time-delay neural network (PUSTDNN) is proposed and applied to the state-of-the-art x-vector system. It models each phoneme unit with an individual time-delay neural network (TDNN). That is to say, each TDNN mainly deals with a phoneme unit. Compared with handling all phoneme units together, when handling a phoneme unit, a TDNN can extract more discriminative speaker information, thus improving the system performance. Two realizations of the PUSTDNN are proposed. The first one can retain speech temporal information. The second one further combines all the TDNNs in a PUSTDNN into a larger TDNN to reduce computational complexity. To avoid model overfitting, the phoneme units are obtained by clustering phonemes based on the phonetic knowledge and phonetic sparsity degree. The PUSTDNN is also compared with two other techniques, i.e., phonetic vector and multitask. Experiments on the Fisher, NIST SRE10, and VoxCeleb datasets show that the phonetic vector technique is most robust to the phoneme unit recognition accuracy. When the accuracy is high enough, the multitask performs better than the phonetic vector, and the PUSTDNN performs best and can achieve over 10% relative improvement compared with the x-vector baseline. © 2014 IEEE.
关键词:
通讯作者信息:
电子邮件地址:
来源 :
ACM Transactions on Audio Speech and Language Processing
ISSN: 2329-9290
年份: 2021
卷: 29
页码: 1243-1255
5 . 4 0 0
JCR@2022
ESI学科: ENGINEERING;
ESI高被引阀值:87
JCR分区:1
归属院系: