Evaluation of the MapReduce paradigm for parallel suffix array construction

Meenakshi Rameshkumar

The suffix array of a string, which is the lexicographic ordering of all its suffixes, is one of the most powerful string indexing data structures with a broad range of applications in bioinformatics and other scientific computing domains that rely on string processing and text mining. The problem of constructing the suffix array of a string is akin to integer sorting, and there are three well-known serial linear time solutions for construction. However, the problem becomes difficult to parallelize because of inherent irregularities and data locality/movement requirements that originate during computation. Yet parallelism is needed to be able to scale to large data sizes. In this research project, we evaluate the MapReduce paradigm in the context of constructing suffix arrays in parallel. The MapReduce paradigm is emerging to be the de facto standard for data-intensive scientific applications. Due to the relative nascency of the model, multiple open source implementations are being developed and made available for use by the community. We aim to cross-evaluate a couple of these implementations, viz. the Hadoop and Phoenix MapReduce libraries, for the construction of suffix arrays in parallel. We base our algorithms on the linear-time serial DC3 algorithm given by Karkainnen and Sanders, 2003. More specifically, we present recursive, multi-stage MapReduce parallel algorithms for constructing suffix arrays under the Hadoop and Phoenix frameworks. The Hadoop implementation represents parallelization under a distributed memory, IO-bound setting, whereas the Phoenix implementation represents parallelization under a shared memory, multicore setting. We tested our implementations on different platforms using as input chromosomal sequences extracted from the human genome. While our experimental results suggest that the Phoenix shared memory model is better suited for parallelizing this problem, the studies presented offer insights into the different design level challenges and the different tradeoffs imposed by these two platforms (distributed and shared memory) for parallelizing this problem. We believe that this research project can also serve as a case-study for those trying to parallelize an irregular combinatorial problem using MapReduce or those trying to implement recursive, multi-stage MapReduce methods under shared and/or distributed memory environments.

Evaluation of the MapReduce paradigm for parallel suffix array construction

Files and links (1)

Abstract

Metrics

Details