Understanding and Detecting Cross-Language Defects

haoran yang

The contemporary software landscape increasingly integrates multiple programming languages within a single system to leverage complementary strengths (e.g., the efficiency of C with the programmability of Python). This multilingual paradigm underpins platforms such as Android and machine-learning frameworks like PyTorch, but it also introduces nontrivial security and reliability risks. Heterogeneous language semantics, foreign-function interfaces, and cross-boundary data transformations complicate reasoning about control and data flows, elevating the likelihood of defects that are difficult to detect and diagnose. Traditional techniques—program analysis and fuzzing—were largely devised for single-language settings and struggle to scale across language boundaries. Static analyses often lose precision or soundness when flows traverse heterogeneous runtimes, while greybox fuzzing faces incomplete and misleading coverage when other language units are treated as opaque. Moreover, the intricacy of runtime ecosystems for multilingual software (frequently themselves multi-language) further exacerbates these limitations, underscoring the need for analysis methods that explicitly model cross-language interactions. This dissertation addresses these challenges through a comprehensive program of empirical study and tool design. First, it establishes a practitioner-grounded problem space via a large-scale analysis of Stack Overflow discussions on multilingual development, revealing recurring issues in interfacing, data representation, and tooling. Next, it constructs empirical foundations by systematically characterizing cross-language bugs in real-world Python–C and Java–C projects and by conducting a focused study of native (C/C++) bugs within Python applications, yielding curated datasets and taxonomies of symptoms, root causes, and fixes. Building on these insights, the dissertation introduces two complementary techniques: xLoc, a deep-learning approach for detecting and localizing bugs near cross-language boundaries using control-flow–aware encodings; and PolyFlow, a neuro-symbolic static information-flow framework that combines traditional taint analysis with LLM-guided semantic reasoning to track data flows across heterogeneous languages. Together with released benchmarks and artifacts, these contributions advance the empirical and methodological foundations for understanding and detecting cross-language defects in modern multilingual software.

Understanding and Detecting Cross-Language Defects

Files and links (1)

Abstract

Metrics

Details