Similarity measures for clustering sequences and sets of data

Tesis doctoral de Darío García García

The main object of this phd. Thesis is the definition of new similarity measures for data sequences, with the final purpose of clustering those sequences. Clustering consists in the partitioning of a dataset into isolated subsets or clusters. Data within a given cluster should be similar, and at the same different from data in other clusters. the relevance of data sequences clustering is ever-increasing, due to the abundance of this kind of data (multimedia sequences, movement analysis, stock market evolution, etc.) And the usefulness of clustering as an unsupervised exploratory analysis method. It is this lack of supervision that makes similarity measures extremely important for clustering, since it is the only guide of the learning process. the first part of the thesis focuses on the development of similarity measures leveraging dynamical models, which can capture relationships between the elements of a given sequence. Following this idea, two lines are explored: – likelihood-based measures based on the popular framework of likelihood-matrix-based similarity measures, we present a novel method based on a re-interpretation of such a matrix. That interpretations stems from the assumption of a latent model space, so models used to build the likelihood matrix are seen as samples from that space. The method is extremely flexible since it allows for the use of any probabilistic model for representing the individual sequences. – state-space trajectories based measures we introduce a new way of defining affinities between sequences, addressing the main drawbacks of the likelihood-based methods. Working with state-space models makes it possible to identify sequences with the trajectories that they induce in the state-space. This way, comparisons between sequences amounts to comparisons between the corresponding trajectories. Using a common hidden markov model for all the sequences in the dataset makes those comparisons extremely simple, since trajectories can be identified with transition matrices. This new paradigm improves the scalability of the affinity measures with respect to the dataset size, as well as the performance of those measures when the sequences are short. the second part of the thesis deals with the case where the dynamics of the sequences are discarded, so the sequences become sets of vectors or points. This step to be taken, without harming the learning process, when the statical features (probability densities) of the different sets are informative enough for the task at hand, which is true for many real scenarios. Work along this line can be further subdivided in two areas: – sets-of-vectors clustering based on the support of the distributions in a feature space we propose clustering the sets using a notion of similarity related to the intersection of the supports of their underlying distributions in a hilbert space. Such a clustering can be efficiently carried out in a hierarchical fashion, in spite of the potentially infinite dimensionality of the feature space. To this end, we propose an algorithm based on simple geometrical arguments. Support estimation is inherently a simpler problem than density estimation, which is the usual starting step for obtaining similarities between probability distributions. – classifer-based affinity and divergence measures it is quite natural to link the notion of similarity between sets with the emph{separability} between those sets. That separability can be quantified using binary classifiers. This intuitive idea is then extended via generalizations of the family of f-divergences, which originally contains many of the best-known divergences in statistics and machine learning. The proposed generalizations present interesting theoretical properties, and at the same time they have promising practical applications, such as the development of new estimators for standard divergences.

 

Datos académicos de la tesis doctoral «Similarity measures for clustering sequences and sets of data«

  • Título de la tesis:  Similarity measures for clustering sequences and sets of data
  • Autor:  Darío García García
  • Universidad:  Carlos III de Madrid
  • Fecha de lectura de la tesis:  06/04/2011

 

Dirección y tribunal

  • Director de la tesis
    • Fernando Díaz De María
  • Tribunal
    • Presidente del tribunal: Jesús Cid sueiro
    • john Shawe-taylor (vocal)
    • sancho Salcedo sanz (vocal)
    • Luis ignacio Santamaría caballero (vocal)

 

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

Scroll al inicio