Authors: David Borland, Wenyuan Wang, Jonathan Zhang, Joshua Shrestha, David Gotz
Abstract: The collection of large, complex datasets has become commonplace across a wide variety of domain. Visual analytics tools are increasingly playing a key role in exploring and answering complex questions about these large datasets. However, many visualizations are not designed to concurrently visualize the large number of dimensions present in complex datasets (e.g. tens of thousands of distinct codes in an electronic health record system). This fact, combined with the ability of many visual analytics systems to enable rapid, ad-hoc specification of groups, or cohorts, of individuals based on a small subset of visualized dimensions, leads to the possibility of introducing selection bias--when a given cohort is created based on a specified set of dimensions, differences across many other unseen dimensions may also be introduced. These unintended side effects may result in the cohort no longer being representative of the larger population intended to be studied, and can negatively affect the validity of any subsequent analysis. We present techniques for selection bias tracking and visualization that can be incorporated into high-dimensional exploratory visual analytics systems. These techniques include: (1) tree-based cohort provenance and visualization, with a user-specified baseline cohort that all other cohorts are compared against, and visual encoding of the ``drift'' for each cohort, indicating where selection bias may have occurred, and (2) two novel visualization approaches to compare in detail the per-dimension differences between the baseline and a user-specified focus cohort, based on existing data hierarchies. We present example use cases in the context of a medical temporal event sequence visual analytics tool, and report findings from domain expert user interviews.