SCALABLE ALGORITHMS AND SOFTWARE FOR GENOME ASSEMBLY AND HYBRID SCAFFOLDING

Oieswarya Bhowmik

The increasing accuracy and accessibility of long read sequencing technologies, such as PacBio HiFi, have revolutionized genome assembly. However, assembling complex genomes with high contiguity and minimal computational overhead remains a significant challenge. Current state-of-the-art tools struggle with scalability as genome size increases, often requiring weeks to months for mammalian-sized genomes while consuming hundreds of gigabytes of memory. Moreover, existing approaches leave substantial gaps in genome completeness—only 1.5-6.5% of assemblies in public repositories achieve reference quality, creating a backlog of over 2.5 million fragmented draft genomes. This dissertation addresses these challenges through two complementary graph-theoretic approaches: a parallel hybrid scaffolding workflow that upgrades existing draft assemblies using long reads, and a scalable long read assembly framework based on explicit vertex reordering. Our first contribution, Maptcha, is a novel parallel workflow for hybrid genome scaffolding that combines pre-constructed assemblies with newly sequenced long reads. Maptcha employs alignment-free sketch-based mapping to rapidly identify long reads spanning multiple contigs, then constructs a ⟨contig,contig⟩ graph capturing these linkages. We introduce a novel graph-theoretic "wiring" heuristic that iteratively connects contigs by selecting the most credible links based on long read support while avoiding questionable connections in repetitive regions. A batching design partitions the scaffolding problem into independent parallel subproblems, enabling efficient distributed execution while bounding memory consumption. Our second contribution, Tile-X, addresses scalability of de novo long read assembly by treating read ordering as an algorithmic problem beforehand the genome assembly starts. Rather than implicitly determining read order during graph traversal, Tile-X explicitly computes read orderings using graph-theoretic vertex reordering techniques—including Reverse Cuthill-McKee (Tile-RCM), Metis-based partitioning (Tile-Metis), and community detection (Tile-Grappolo), to group reads from similar genomic regions. We also introduce Tile-Far, a novel sparsification heuristic inspired by minimum tiling paths that selects a minimal informative subset of reads covering the genome with reduced redundancy. The computed ordering enables partitioned parallel assembly where consecutive batches of reads are independently assembled with controlled overlap, dramatically reducing per-batch memory requirements while maintaining assembly continuity. We present extensive experimental evaluation of both frameworks across diverse genomes ranging from bacteria to human. Maptcha consistently produces high contiguos scaffolds with values exceeding input contigs by over three orders of magnitude while outperforming state-of-the-art hybrid scaffolders in both assembly quality and runtime performance, reducing time-to-solution from hours to minutes while maintaining memory usage under 20 GB. Tile-X demonstrates that explicit vertex reordering and sparsification maintain assembly quality comparable to state-of-the-art long read assemblers while improving scalability and reducing computational requirements. Both frameworks are available as open-source software, providing practical and efficient solutions for generating high-quality assemblies of large and complexgenomes.

SCALABLE ALGORITHMS AND SOFTWARE FOR GENOME ASSEMBLY AND HYBRID SCAFFOLDING

Files and links (1)

Abstract

Metrics

Details