< Terug naar vorige pagina

Publicatie

Statistical methods for the analysis of high-throughput proteomic and genomic data

Boek - Dissertatie

In this dissertation, we proposed statistical methods for datasets from proteomics and genomics workflow. Over the past decade, MS-based proteomics has emerged as a high-throughput method for the identification and quantification of proteins in complex samples. The high resolution MS data contains a large degree of noisy, redundant, and irrelevant information. Only a part of it includes the biologically meaningful signal, i.e., peptides and small proteins, making accurate classification between peptide/protein peaks and peaks generated by noise difficult. To overcome this obstacle, prior information related to the physical properties of the peptide/protein, i.e., isotopic distribution, is needed. However, a similarity measure is also required to distinguish between peptide and noise peaks clusters. In Chapter 4, we considered the use of Pearson’s χ 2 statistic and the Mahalanobis distance for this purpose. We evaluated the performance of the two similarity measures by using a designed MALDI-TOF experiment. The results could extend to any high-resolution mass spectrum and indicated that Pearson’s χ 2 statistic offered a better discriminative power for detecting the putative-peptide clusters than the Mahalanobis distance. Protein identification is a key and essential step in the field of proteomics. For this purpose, shotgun proteomics is recognized as one of the main techniques for protein identification and quantification. In a standard computational pipeline, MS/MS spec tra from a mass spectrometer are searched against database search engines or de novo sequencing approaches. In database search algorithms, fragment ions derived from the unidentified protein are compared with theoretical data, and a score is assigned according to how well the two sets of data match together. The top score is expected to identify the unknown protein. The limiting factor in all database search tools is the tradeoff between false positives and false negatives. It is definitely essential to keep false positives to a minimum during protein identification. Principally, peptide identification based on tandem MS and database-search algorithms does not take into account information about isotope distributions of the precursor ions. To determine the effectiveness of these search algorithms in terms of their ability to distinguish between correct and incorrect peptide assignments, in Chapter 5, we proposed an additional metric that quantifies the similarity between the theoretical isotopic distributions for the precursor ions selected for tandem MS and the experimental mass spectra by using Pearson’s χ 2 statistic. The observed association between Pearson’s χ 2 statistic and the score function indicated that good scores can be obtained for molecules which exhibit atypical isotope profiles, while low scores can be obtained for fragment spectra which have a clear peptide-like isotope pattern. These results demonstrated that Pearson’s χ 2 statistic can be used in conjunction with the score of database search algorithms to increase the sensitivity and specificity of peptide identification. There are many search engines available for the analysis of proteomics data produced by MS/MS. These search algorithms vary in accuracy, sensitivity, and specificity due to the different principles in the underlying scoring mechanism. However, measuring the degree of agreement between different search engines in terms of peptide identification is always in our interest. For instance, how possible is the peptide identification obtained from SEQUEST can also be observed in MASCOT. In Chapter 6, we proposed Cohen’s kappa coefficient (chance-corrected agreement) to determine the level of the agreement, between the MASCOT and SEQUEST. The results suggested that there is, in general, a good agreement between the peptide assignments for the two search engines. The advent of high throughput sequencing methods, such as NGS has greatly accelerated biological and medical research and discovery. NGS has provided an effective approach to identify the large scale of DNA polymorphic loci used as molecular markers to distinguish gene loci responsible for the trait of concern. In Chapter 9, 10, and 11, we introduced different variants and generalizations of the basic HMM proposed in [109] used to map various QTLs responsible for high ethanol-tolerance in S. cerevisiae with NGS. One possible extension that can be dealt with the Marko vian model in the basic HMM is the direction of modelling. Both the preceding state of the (i − 1)-th SNP and following state of the (i + 1)-th SNP carry advantageous information about a current i-th SNP. Uni-directional HMMs ignore this influence, hence the motivation of applying the DHMM in Chapter 10. The comparison of the uni-directional HMM and the DHMM for chromosome XIV revealed only a slight difference in terms of the parameter estimates, with a minimal gain in precision of the estimation for the DHMM. As a result, the DHMM and the uni-directional HMMs assigned the SNPs to the same states. The main advantage of the DHMM is the fact that it produces a single set of estimates of the parameters of interest, i.e., emission (concordance) probabilities. In chapter 10, we proposed the non-homogeneous HMM. The advantage of the NH-HMM is that it allows the transition probabilities of the basic HMM to vary in distance by exploiting covariate information. Our model assumed that taking into account the distance between the neighboring SNP can influence the state assignment to each SNP. The NH-HMM were able to detect gene loci responsible for high ethanoltolerance in S. cerevisiae. In Chapter 11, we considered joint HMM of two pools of segregants at the same time. The motivation was, the significant differences between the state-dependent probabilities between two pools might lead us to the potential regions of gene loci. Joint HMM was able to detect potential genomic regions for high ethanol-tolerance in chromosome XIV. However, the same approach was not able to work properly in chromosome IX.
Aantal pagina's: 162
Jaar van publicatie:2014
Toegankelijkheid:Open