Recent empirical work has demonstrated that, although there often exists meaningful "small scale" structure (e.g., clustering structure around a single individual at the size-scale of roughly 100
individuals) in large social and information networks, analogous "large scale" structure (e.g., meaningful or statistically significant properties of tens or hundreds of thousands of individuals) either is lacking entirely or is of a form that is extremely difficult for traditional machine learning and data analysis tools to identify reliably. For example, there are often small clusters which provide a "bottleneck" to diffusions (e.g., diffusive-based dynamic processes of the form of interest in viral marketing applications and tipping point models of network dynamics). On the other hand, there are typically no large clusters that have analogous bottlenecks. Thus, popular diffusion-based metrics, as well as the associated machine learning and data analysis tools, are simply much less meaningful (i.e., discriminative or useful) if one is interested in analyzing networks at large size scales. At root, the issue is that most machine learning and data analysis tools implicitly assume that the data consist of some sort of global structure with a bit of noise, while it is better to think of large social and information networks as consisting of local pockets of structure on top of a sparse, noisy, and globally quasi-random scaffolding. To establish our main empirical observations, we used locally-biased spectral and flow-based methods to compute isoperimetric properties of networks at various size scales. This novel application of ideas from theoretical computer science to internet data analysis required the development of new algorithmic tools as well as the reinterpretation of the implicit statistical basis underlying traditional worst-case approximation algorithms. This empirical work will be briefly reviewed; and its implications for extracting insight from large networks (and large-scale data more generally) with popular machine learning and data analysis tools will be discussed in detail.
Michael Mahoney's research interests focus on theoretical and applied aspects of algorithms for large-scale data problems in scientific and Internet applications. Currently, he is working on geometric network analysis methods; developing approximate computation and regularization methods for large informatics graphs; and applications to community detection, clustering, learning, and information dynamics in large social and information networks. In the past, he has worked on the design and analysis of randomized algorithms for matrices, as well as applications of those methods in genetics and medical imaging. He has been a faculty member at Yale University and a researcher at Yahoo, and his PhD was is computational statistical mechanics at Yale University.