Fusing prosodic and acoustic information for speaker recognition

Tesis doctoral de Mireia Farrús Cabecerán

Automatic speaker recognition is the use of a machine to identify an individual from a spoken sentence. Recently, this technology has been undergone an increasing use in applications such as access control, transaction authentication, law enforcement, forensics, and system customisation, among others. one of the central questions addressed by this field is what is it in the speech signal that conveys speaker identity. Traditionally, automatic speaker recognition systems have relied mostly on short-term features related to the spectrum of the voice. However, human speaker recognition relies on other sources of information; therefore, there is reason to believe that these sources can play also an important role in the automatic speaker recognition task, adding complementary knowledge to the traditional spectrum-based recognition systems and thus improving their accuracy. the main objective of this thesis is to add prosodic information to a traditional spectral system in order to improve its performance. To this end, several characteristics related to human speech prosody – which is conveyed through intonation, rhythm and stress – are selected and combined them with the existing spectral features. Furthermore, this thesis also focuses on the use of additional acoustic features – namely jitter and shimmer – to improve the performance of the proposed spectral-prosodic verification system. Both features are related to the shape and dimension of the vocal tract, and they have been largely used to detect voice pathologies. since almost all the above-mentioned applications can be used in a multimodal environment, this thesis also aims to combine the voice features used in the speaker recognition system together with other biometric identifiers – face – in order to improve the global performance. To this end, several normalisation and fusion techniques are used, and the final fusion results are improved by applying different fusion strategies based on sequences of several steps. Furthermore, multimodal fusion is also improved by applying a histogram equalisation to the unimodal score distributions as a normalisation technique. on the other hand, it is well know that humans are able to identify others from voice even when their voices are disguised. The question arises as to how vulnerable automatic speaker recognition systems are against different voice disguises, such as human imitation or artificial voice conversion, which are potential threats to security systems that rely on automatic speaker recognition. The last part of this thesis finishes with an analysis of the robustness of such systems against human voice imitations and synthetic converted voices, and the influence of foreign accents and dialects – as a sort of imitation – in auditory speaker recognition.

 

Datos académicos de la tesis doctoral «Fusing prosodic and acoustic information for speaker recognition«

  • Título de la tesis:  Fusing prosodic and acoustic information for speaker recognition
  • Autor:  Mireia Farrús Cabecerán
  • Universidad:  Politécnica de catalunya
  • Fecha de lectura de la tesis:  29/10/2008

 

Dirección y tribunal

  • Director de la tesis
    • Francisco Javier Hernando Pericas
  • Tribunal
    • Presidente del tribunal: José bernardo Mariño acebal
    • elisabeth Zetterholm (vocal)
    • Javier Rodríguez saeta (vocal)
    • ramón Cerdí  massó (vocal)

 

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

Scroll al inicio