< Back to previous page


Automatic Speaker Characterization; Automatic Identification of Gender, Age, Language and Accent from Speech Signals (Automatische sprekercharacterisatie; Automatische identificatie van geslacht, leeftijd, taal en accent uit stemopnamen)

Book - Dissertation

Speech signals carry important information about a speaker such as age, gender, language, accent and emotional/psychological state. Automatic recognition of speaker characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. This research aims to develop accurate methods and tools to identify different physical characteristics of the speakers. Due to the lack of required databases, among all characteristics of speakers, our experiments cover gender recognition, age estimation, language recognition and accent/dialect identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state.For speaker characterization, we first convert variable-duration speech signals into fixed-dimensional vectors suitable for classification/regression algorithms. This is performed by fitting a probability density function to acoustic features extracted from the speech signals. Since the distribution of acoustic features is complex, Gaussian mixture models (GMM) are applied to model the distribution of acoustic features. Due to lack of data, it is not possible to build a separate acoustic model for short utterances. Therefore, parametric utterance adaptation methods have been applied to adapt the universal background model (UBM) to the characteristics of utterances. The parameters of each adapted GMM characterize the corresponding utterance. An effective approach involves adapting UBM to speech signals using the Maximum-A-Posteriori (MAP) scheme. Then, the Gaussian means of the adapted GMM are extracted and concatenated to form a Gaussian mean supervector for the given utterance.Finally, a classification or regression algorithm is used to identify the speakercharacteristics. While effective, Gaussian mean supervectors are of a highdimensionality resulting in high computational cost and difficulty in obtaining a robust model in the context of limited data. In the field of speaker recognition, recent advances using the i-vector framework have increased the classification accuracy considerably. This framework, which provides a compact representation of an utterance in the form of a low-dimensional feature vector, applies a simple factor analysis on GMM means. Motivated by this success, the i-vector framework is applied to the age estimation problem. In this approach, each utterance is modeled by its corresponding i-vector. Then, a within-class covariance normalization (WCCN) technique is used for session variability compensation. Finally, a least squares support vector regression (LSSVR) is applied to estimate the age of speakers. The proposed method is trained and tested on telephone conversations of the National Institute for Standard and Technology (NIST) 2010 and 2008 speaker recognition evaluation databases. Evaluation results show that the proposed method yields significantly lower mean absolute estimation error and a higher Pearson correlation coefficient between chronological speaker age and the estimated speaker age comapred to different conventional schemes. Finally, the effect of some major factors influencing the proposed age estimation system, namely utterance length and spoken language are analyzed.Our experiments on age estimation show that GMM weights carry importantinformation about the speaker. However, the state-of-the-art language/speakerrecognition systems usually do not use this information. In this research, anon-negative factor analysis (NFA) approach is developed for GMM weightdecomposition and adaptation. This modeling suggests a new low-dimensionalutterance representation method, which uses a factor analysis similar tothat of the i-vector framework. The obtained subspace vectors are thenapplied in conjunction with i-vectors to the language/dialect recognitionproblem. The suggested approach is evaluated on the NIST 2011 and RATSlanguage recognition evaluation (LRE) corpora and on the QCRI Arabic dialect recognition evaluation (DRE) corpus. The assessment results show that the proposed adaptation method yields more accurate recognition results compared to three conventional weight adaptation approaches, namely maximum likelihood re-estimation, non-negative matrix factorization, and a subspace multinomial model. Experimental results also show that the intermediate level fusion of i-vectors and NFA subspace vectors improves the performance of the state-of-the-art i-vector framework. Motivated by the success of the NFA framework in Language/dialect recognition we introduce a hybrid architecture of the NFA approach and the i-vector frameworks for the speaker age estimation problem. Evaluation on the NIST 2010 and 2008 SRE corpora shows that the proposed hybrid architecture improves the results of the i-vector framework considerably.
Publication year:2014