Graph clustering MapReduce algorithm Shingling Algorithm Dense subgraph detection Protein domain family
Identifying close-knit communities (or “clusters”) in graphs is an advanced operation with a broad range of scientific applications. While theoretical formulations of this operation are either intractable or computationally prohibitive, practical algorithmic heuristics exist to efficiently tackle the problem. However, implementing these heuristics to work for large real world graphs still remains a significant challenge, owing to a combination of factors that include magnitude of the data, irregular data access patterns and computer-intensive operations to better the approximation. In this paper, we propose i) a novel MapReduce-based [2] algorithm for a well known serial graph clustering heuristic called Shingling [3]; and ii) a novel application of the method to cluster biological graphs built out of proteins and domains. Operating on an input graph that is simply represented as a list of edges, our algorithm uses a combination of shuffling and sorting operations, and pipelined MapReduce stages to implement the various phases of the algorithm. Preliminary results show linear scaling of the time-dominant phase up to 64 cores on a relatively small real world graph containing 8.41M vertices (8,407,839 proteins and 11,823 domains) and 11M edges (protein to domain connections). More importantly, MapReduce parallelization has allowed us to enhance the problem size reach by about two to three orders of magnitude (from 20K to 8M vertices) relative to our previous serial implementation, in roughly the same amount of time.
Metrics
190 File views/ downloads
254 Record Views
Details
Title
An efficient MapReduce algorighm for parallelizing large-scale graph clustering
Creators
Inna Rytsareva (Author)
Ananth Kalyanaraman (Author)
Conference
Washington State University Academic Showcase (Pullman, Washington, 03/30/2012)
Academic Unit
WSU Academic Showcase 2012
Grant note
Funding : This research was supported in parts by DOE award DE-SC-0006516 and NSF grant IIS 0916463. Washington State University, Pullman, WA
Identifiers
99900501588201842
Copyright
In copyright ; openAccess ; http://rightsstatements.org/vocab/InC/1.0/ ; http://purl.org/eprint/accessRights/OpenAccess