Scalable parallel algorithms and software for large scale proteomics

Gaurav Ramesh Kulkarni

The problem of identifying the sequence that constitutes a peptide or a protein molecule is referred to as “peptide identification” and is of fundamental importance in proteomics research. The most popular approach to identify peptides is through database search, in which an experimental spectrum generated from fragments of an unknown peptide using mass spectrometry is compared with a database of known protein sequences. The goal is to identify the unknown peptide by comparing to database sequences. The exponential growth rates and overwhelming sizes of sequence databases make this an ideal application for parallel computing. However, the present generation of software tools is not expected to scale to the magnitudes of data that will be generated in the next few years, as they are either serial algorithms or parallel strategies that are inherently serial. In this thesis, we present an efficient parallel approach for peptide identification through database search. Three key factors distinguish our approach from that of existing solutions: i) (space) Given þ processors and a database with N residues, we provide the first space-optimal parallel algorithm (O(N/þ)); ii) (time) By masking communication with computation and by using MPI one-sided communication primitives, our algorithm minimizes parallel overhead and ensures efficient scaling of run-time; and iii) (quality) The performance gains achieved using parallel processing has allowed us to incorporate highly accurate, albeit compute-intensive statistical models. We present the design and evaluation of two different algorithms that implement our approach. Experimental results using 2.65 million microbial proteins show linear scaling up to 128 processors, with parallel efficiency maintained at ~50%. As a concrete demonstration of a real world application, we applied our approach for de novo identification of species from metagenomics data. To the best of our knowledge, this is the first time mass spectral data have been used for metagenomics species identification. Collectively, we expect that the approaches presented in this thesis will be critical to meet the data-intensive and qualitative demands stemming from this important application domain.

Scalable parallel algorithms and software for large scale proteomics

Files and links (1)

Abstract

Metrics

Details