INFERENCE, AGGREGATION, AND EMBEDDING FOR LEARNING PROBLEMS ON NETWORK AND TIME SERIES DATA

Md Mahedi Hasan

doi:10.7273/000007907

Modern statistical learning methods have transformed many scientific fields and offer domain experts a wide variety of tools for analyzing their data. However these methods often require large amounts of data to function, both empirically in terms of generating results that can be validated against existing examples, and theoretically, in terms of being able to apply asymptotic statistical theorems with confidence. A common challenge is determining how best to approach applications where the availability and fidelity of data is limited compared to the idealized case. This dissertation brings an applied statistical perspective to bear on several of these situations, particularly where the underlying data gathering or generating process is understood but limited. Specific examples include the analysis of social network data in settings like classrooms where the sizes of individual networks are limited and geophysical time series data, where physics-informed synthetic datasets are often used to train classification models that are then evaluated on empirical measurements. The first analysis presented here focuses on the issue of statistical conclusions derived from aggregating social network data. This is a natural approach, particularly in the education literature where the networks are defined by individual classrooms. Each class tends to be relatively small, meaning that it may not be possible to draw strong conclusions individually. A similar issue exists for training deep learning models, including graph convolutional neural networks (GCNNs), on this data, as they require a large amount of training data to succeed. One approach to mitigate this concern is to make measurements across several classrooms but this requires additional distributional assumptions that may not be satisfied. Given a collection of networks representing the friendships and interactions between children in different classrooms, we study the impact of different approaches to aggregating the tabular data associated to the nodes, observing that the base rate of demographic statistics across classrooms has a potentially significant impact on the final conclusions derived from these approaches. Motivated by a recently published tutorial and related generative models for this type of classroom data, we show that aggregating data across several classrooms may lead to Simpson’s Paradox-like conclusions, related to base rates of incidence of the target quantity in each classroom. We extend this analysis by considering the problem of node-level classification and prediction using a variety of statistical and deep learning methods, including incorporating node-level embeddings as statistical covariates. The next analysis considers an increasingly common setting where a large collection of physically-informed synthetic data is used for training a learning algorithm to perform classification on much sparser empirical problems. Here the issue is that the generated data may not be a perfect fit to the real-world counterpart, leading to problems of generalizability. Motivated by an application to distinguishing earthquakes from explosions using geophysical measurements, this chapter explores the use of modern deep learning techniques for this problem, including presenting state-of-the-art performance on several examples. Key contributions include implementing effective hybrid models, incorporating data augmentation, domain adaptation, and constructing statistical models of the low-dimensional representations of the data to provide recommendations about training approaches. We also implement methods for improving the interpretability of the results on this data, applying GRAD-CAM to spectrograms representing the underlying data. The final chapter brings together both of these topics, considering time series of networks data. Motivated by a recent paper using the Random Dot Product Graph Model to detect change points in network data, we consider a novel stress function that directly incorporates the community structure of the underlying generative process. This allows us to formulate a new model for change point detection, as well as to better understand the resulting node embedding technique. We also return to the example networks from the first chapter, exploring how this generative model allows for more effective expressions of the classroom network properties.

INFERENCE, AGGREGATION, AND EMBEDDING FOR LEARNING PROBLEMS ON NETWORK AND TIME SERIES DATA

Files and links (2)

Abstract

Metrics

Details