Scalable and Efficient Distributed Frameworks for Influence Maximization

Reet Barik

doi:10.7273/000007206

Influence maximization is a fundamental operation among graph problems that involve simulating a stochastic diffusion process on real-world networks. Given a graph G(V, E), the objective is to identify a small set of key influential “seeds”—i.e., a fixed-size set of k nodes, which when influenced is likely to lead to the maximum number of nodes in the network getting influenced. The problem has numerous applications including (but not limited to) viral marketing in social networks, epidemic control in contact networks, and in finding influential proteins in molecular networks. Despite its importance, application of influence maximization at scale continues to pose significant challenges. The problem is NP-hard, but there exist efficient polynomial time approximations. The intuition behind the two main schools of such approaches can be distilled to either trying to answer the question “who influences whom?” or “who gets influenced by whom?”. The current state-of-the-art distributed algorithms available for these two paradigms of approximating influence maximization, namely greedy hill climbing and reverse influence sampling, involves direct parallel adaptations of sequential greedy approaches. Scaling these algorithms remains a daunting task due to the complexities associated with steps involving stochastic sampling and large-scale aggregations. This dissertation first highlights the uniqueness and dynamism of the influence maximization workload by contrasting the performance gains (or lack thereof) from a popular preprocessing step against those achieved in classical prototypical graph workloads. With this motivation in mind, the dissertation introduces two novel frameworks for approximating influence maximization. The first is aimed at optimizing the greedy hill climbing approach while the second targets the paradigm of reverse influence sampling. Both frameworks are built with the theme of concurrently processing sub-units of the overall workload and leveraging the property of submodularity that persists in the subsets representing these workloads. Experimental results show that these methods are able to increase problem size reach, reduce time to solution, maintain quality, and extend the scaling boundary compared to their state-of-the-art counterparts. We believe that the contributions from this disseration will be helpful in making influence maximization more accessible as an application promoting its adoption in large scale scientific pipelines. Additionally, parts of the work described in the disseration are also generic enough to be extended to a broader base of application settings that use submodular optimization. This has the potential to motivate and inspire research into such applications which have been hitherto impractical due to their high computational costs and/or memory footprint.

Scalable and Efficient Distributed Frameworks for Influence Maximization

Files and links (1)

Abstract

Metrics

Details