< Terug naar vorige pagina

Project

Algorithms for Analyzing Biological Sequences (Algoritmen voor de analyse van biologische sequenties)

In 1995, scientists achieved the first complete genetic map of a free living organism, the bacterium Haemophilus influenzae. Since then, a hugeamount of genomic and proteomic data has been generated due to the developments in biological and computer sciences. This availability of data has led to a new challenge, namely, the analysis and interpretation of this data. These involve many tasks such as: the identification of genes,regulatory regions, and repetitive elements; phylogenetic analysis; identification of proteins and peptides expressed
in an organism, cell or tissue; and protein function prediction.

In this thesis, we investigate computational methods for analyzing nucleotide and amino acid sequences. We do it in the context of three biological problems: phylogenetic tree reconstruction, protein subfamily identification, and peptide identification using mass spectrometry data.

In phylogenetic tree reconstruction</>, one is interested in inferring the most likely tree topology that explains the evolutionary history of genes and organisms. We propose a method that, given a set of nucleotide or amino acid sequences, builds a phylogenetic tree in a top-down way. Our method is based on a conceptual clustering method that extends the well-known decisiontree learning approach. We start from a single cluster containing all sequences and repeatedly divide it into subclusters until all sequences form a different cluster. We assume that a split can be described by referring to particular polymorphic positions of the multiple sequence alignment and propose a heuristic to choose the best split at each iteration.This allows us to identify important mutations that might have given rise to different evolutionary lineages. The trees generated by our methodare therefore more informative than the ones generated by standard methods. Moreover, we show that our method is comparable to standard ones interms of accuracy of the produced phylogenetic tree.

In protein subfamily identification</>, given a set of sequences belonging to a protein family, the goal is to identify subgroups of functionally closely related proteins. This can be done by using evolutionary information. We propose a method that first uses the phylogenetic tree reconstructionmethod described in the previous paragraph, and then cuts the tree to extract clusters (subfamilies) of sequences. We show that the proposed method yields good results with advantages over those produced by a state-of-the-art method. More specifically, it identifies subfamily-specific positions that might be important for the function(s) associated to the subfamily, and it allows easy classification of new sequences into one ofthe identified subfamilies.

Finally, we consider the problem of inferring the peptide identification of mass spectrometry data</>. Inmass spectrometry, an unknown peptide undergoes fragmentation, and its fragment masses are registered in a fragmentation spectrum. Then, computational methods infer the peptide sequence from its spectrum. We proposea method that does this by searching for the peptide identification in the six-frame translation of the genome. Differently from other methods that allow a similar search, it does not use a filtering procedure to limit the computation of the scoring function to a small subset of sequences that are likely to obtain a high score. Instead, it performs an exhaustive scan of the genome translation. We show that our strategy can identify more peptides than a non-exhaustive search.

As an additionalcontribution of this thesis, we consider a problem purely computational, namely, the problem of estimating the certainty of a prediction given by a decision tree</>. We propose a method that uses a transductive procedure to tackle this task. We show that our method improves the certainty estimates given by standard decision trees, and is especially suitable for ranking estimation. The ideas we use in our method can be easilygeneralized to other machine learning techniques, as well as combined with existing methods for prediction certainty estimation.
Datum:24 nov 2008 →  10 jul 2013
Trefwoorden:prediction
Disciplines:Toegepaste wiskunde
Project type:PhD project