Journal article
Efficient clustering of large EST data sets on parallel computers
Nucleic acids research, Vol.31(11), pp.2963-2974
06/01/2003
Handle:
https://hdl.handle.net/2376/109004
PMCID: PMC156714
PMID: 12771222
Abstract
Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for
P
arallel
C
lustering of
E
STs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using
Arabidopsis
ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200
Arabidopsis
ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694
Triticum aestivum
ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark
Arabidopsis
EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website.
Metrics
8 Record Views
Details
- Title
- Efficient clustering of large EST data sets on parallel computers
- Creators
- Anantharaman Kalyanaraman - Department of Computer Science, Iowa State University, Ames, IA 50011, USASrinivas Aluru - Department of Computer Science, Iowa State University, Ames, IA 50011, USASuresh Kothari - Department of Computer Science, Iowa State University, Ames, IA 50011, USAVolker Brendel - Department of Computer Science, Iowa State University, Ames, IA 50011, USA
- Publication Details
- Nucleic acids research, Vol.31(11), pp.2963-2974
- Academic Unit
- Electrical Engineering and Computer Science, School of
- Publisher
- Oxford University Press; Oxford, UK
- Identifiers
- 99900547349301842
- Language
- English
- Resource Type
- Journal article