Overview: The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists

Matthew Brehmer, Stephen Ingram, Jonathan Stray, and Tamara Munzner


Abstract | Paper | Overview | Stories | Talk | Video Preview | Figures | Supplemental Material

Abstract

For an investigative journalist, a large collection of documents obtained from a Freedom of Information Act request or a leak is both a blessing and a curse: such material may contain multiple newsworthy stories, but it can be difficult and time consuming to find relevant documents. Standard text search is useful, but even if the search target is known it may not be possible to formulate an effective query. In addition, summarization is an important non-search task. We present Overview, an application for the systematic analysis of large document collections based on document clustering, visualization, and tagging. This work contributes to the small set of design studies which evaluate a visualization system "in the wild", and we report on six case studies where Overview was voluntarily used by self-initiated journalists to produce published stories. We find that the frequently-used language of "exploring" a document collection is both too vague and too narrow to capture how journalists actually used our application. Our iterative process, including multiple rounds of deployment and observations of real world usage, led to a much more specific characterization of tasks. We analyze and justify the visual encoding and interaction techniques used in Overview's design with respect to our final task abstractions, and propose generalizable lessons for visualization design methodology.

Paper

Overview: The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists
In IEEE Transactions on Visualization and Computer Graphics (TVCG), 20(12), p. 2271-2280.
Proceedings of IEEE Conference on Information Visualization (InfoVis), Paris, France, 2014
.
→ [Pre-Print PDF] [IEEE link] [BibTeX]

Overview: A Web-Based Visual Document Mining System

Stories

Stories completed with Overview:

Talk

This paper was presented by Matthew Brehmer on Friday, Nov. 14, in the "Documents, Search, and Images" session of InfoVis 2014.

Slides (35 MB PDF)
Slides (99 MB Keynote)
Video (31 MB MP4)

Video Preview


High-Resolution Figures

Click on a Figure to open in a new tab.
Fig. 1 Overview is a multiple-view application intended for the systematic search, summarization, annotation, and reading of a large collection of text documents, hierarchically clustered based on content similarity and visualized as a tree (left). Pictured: a collection of White House email messages concerning drilling in the Gulf of Mexico prior to the 2010 BP oil spill.
Fig. 2. Timeline of Overview's development, deployment, and adoption phases: deployments are represented as yellow squares; deployment-phase case studies are represented as purple circles, while adoption-phase case studies are represented as green circles. The dotted red lines indicate which version of Overview was used in each case study.
Fig. 3. Detail from "A full-text visualization of the Iraq War Logs" (WARLOGS) by Jonathan Stray, in which distinct clusters of documents are visible; these documents pertain to "criminal incidents" during the Iraqi civil war involving abductions and blindfolding.
Fig. 4 Overview v2, a desktop application released in Winter 2012. Shown here is 6,849 of the U.S. State Department diplomatic cables released by WikiLeaks, those pertaining to Venezuela. The "Oil industry" tag is selected; clusters containing documents having this tag are emphasized in pink in the Topic Tree and are shown in the Cluster List as a set of keywords. Individual documents having the "Oil industry" tag are emphasized in the scatterplot and shown in the Document List as a set of keywords. The fifth document is selected; its contents are displayed in the Document Viewer and it is marked as a larger black dot in the scatterplot.
Fig. 5. Overview v4, a web-based application released in Summer 2013. Shown here is 625 White House email messages concerning drilling in the Gulf of Mexico prior to the 2010 BP oil spill. The "Obama letter" tag is selected; clusters containing documents having this tag are highlighted in green in the Topic Tree. One of these clusters is selected and its keywords are displayed in a tooltip; the 66 documents in this cluster are listed in the Document List. Selecting a document from this list reveals the Document Viewer (cf. Figure 1).
Fig. 6. The human-centred design process development cycle, in which Lloyd and Dykes (2011) discern between alternative entry points (A,B) and between traditional (green), grounded (blue), and their own approach in which example designs are used to establish context of use and elicit requirements (red). In contrast, we begin with some requirements at point C, and only after multiple deployments do we arrive at a clear understanding of context of use (purple). Figure adapted and extended from Lloyd and Dykes (2011).

Supplemental Material

Reading through thousands of documents quickly with Overview by Jonathan Stray.
A full-text visualization of the Iraq War Logs. Jonathan Stray. Associated Press, Dec. 10, 2010.
Overview v1. A research prototype deployed in Fall 2011, used in the CARACAS pilot case study. See Hierarchical Clustering and Tagging of Mostly Disconnected Data. Stephen Ingram, Jonathan Stray, and Tamara Munzner. University of British Columbia Department of Computer Science Technical Report TR-2012-01 (2012).
Overview v3. The first web-based version of Overview deployed in Summer 2012, used in the GUNS case study (CS#4).
Further reading:
Matthew Brehmer
Last modified: Nov 25, 2014.