Authors: Nan-Chen Chen, Been Kim
Abstract: ping sophisticated artificial intelligence (AI) systems requires AI researchers to experiment with different designs and analyze results from evaluations (we refer this task as evaluation analysis). In this paper, we tackle the challenges of evaluation analysis in the domain of question-answering (QA) systems. Through in-depth studies with QA researchers, we identify tasks and goals of evaluation analysis and derive a set of design rationales, based on which we propose a novel approach termed prismatic analysis. Prismatic analysis examines data through multiple ways of categorization (referred as angles). Categories in each angle are measured by aggregate metrics to enable diverse comparison scenarios. \ \ To facilitate prismatic analysis of QA evaluations, we design and implement the Question Space Anglyzer (QSAnglyzer), a visual analytics (VA) tool. In QSAnglyzer, the high-dimensional space formed by questions is divided into categories based on several angles (e.g., topic and question type). Each category is aggregated by accuracy, the number of questions, and accuracy variance across evaluations. QSAnglyzer visualizes these angles so that QA researchers can examine and compare evaluations from various aspects both individually and collectively. Furthermore, QA researchers filter questions based on any angle by clicking to construct complex queries. We validate QSAnglyzer through controlled experiments and by expert reviews. The results indicate that when using QSAnglyzer, users perform analysis tasks faster (p < 0.01) and more accurately (p < 0.05), and are quick to gain new insight. We discuss how prismatic analysis and QSAnglyzer scaffold evaluation analysis, and provide directions for future research.