Overview: The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists

Matthew Brehmer, Stephen Ingram, Jonathan Stray, and Tamara Munzner

Abstract

For an investigative journalist, a large collection of documents obtained from a Freedom of Information Act request or a leak is both a blessing and a curse: such material may contain multiple newsworthy stories, but it can be difficult and time consuming to find relevant documents. Standard text search is useful, but even if the search target is known it may not be possible to formulate an effective query. In addition, summarization is an important non-search task. We present Overview, an application for the systematic analysis of large document collections based on document clustering, visualization, and tagging. This work contributes to the small set of design studies which evaluate a visualization system "in the wild", and we report on six case studies where Overview was voluntarily used by self-initiated journalists to produce published stories. We find that the frequently-used language of "exploring" a document collection is both too vague and too narrow to capture how journalists actually used our application. Our iterative process, including multiple rounds of deployment and observations of real world usage, led to a much more specific characterization of tasks. We analyze and justify the visual encoding and interaction techniques used in Overview's design with respect to our final task abstractions, and propose generalizable lessons for visualization design methodology.

Paper

Overview: The Design, Adoption, and Analysis of a Visual Document Mining Tool For Investigative Journalists

Matthew Brehmer, Stephen Ingram, Jonathan Stray, and Tamara Munzner

In IEEE Transactions on Visualization and Computer Graphics (TVCG), 20(12), p. 2271-2280.
Proceedings of IEEE Conference on Information Visualization (InfoVis), Paris, France, 2014.
→ [Pre-Print PDF] [IEEE link] [BibTeX]

Overview: A Web-Based Visual Document Mining System

Overview: log into your account or create an account
The Overview blog
The Overview Project on Github
@overviewproject on Twitter
Document Mining with Overview: A Digital Tools Tutorial. Poynter News University, Mar 15, 2013.

Stories

Stories completed with Overview:

"A losing battle: How the Army denies veterans justice without anyone knowing" by Alissa Figueroa - Fusion, Nov. 6, 2014
"Private memo reveals winding tale involving John McCain, the NRA and ... condors" by Nancy Watzman - Sunlight Foundation, Sept. 18, 2014
"Missouri swore it wouldn't use a controversial execution drug. It did" by Chris McDaniel - St. Louis Public Radio, Sep. 2, 2014
"The brilliance of Louis C.K.'s emails: He writes like a politician" by Adrienne Lafrance - The Atlantic, Jul. 16, 2014
"Report backs truck driver in Skagit River Bridge collapse" by Mike Lindblom - Seattle Times, Jun. 11, 2014
"Locally, veterans have mixed opinions" by Scott Donnelly - PostStar.com, 2014
"Surprise! Many credit card agreements allow repossession" by Fred O. Williams - creditcards.com, Feb. 18, 2014
"For their eyes only: Police misconduct hidden from public by secrecy law, weak oversight" by Sandra Peddie and Adam Playford - Newsday, Dec. 28, 2013 [Case Study #6: NEWYORK]. This story was a finalist for the 2014 Pulitzer Prize in Journalism (Public Service).
"DHHS downplayed food stamp issues" by Tyler Dukes - WRAL, Dec. 9, 2013
"Own a gun? Tell us why" by Michael Keller - The Daily Beast, Dec. 22, 2012 [Case Study #4: GUNS]
"Ryan asked for federal help as he championed cuts" by Jack Gillum - Associated Press, Oct. 12, 2012 [Case Study #3: RYAN]
"TPD working through flawed mobile system" by Jarrel Wade - Tulsa World, Jun. 3, 2012 [Case Study #2: TULSA]
"What did private security contractors do in Iraq?" by Jonathan Stray - Associated Press, Feb. 21, 2012 [Case Study #1: IRAQ-SEC]

Talk

This paper was presented by Matthew Brehmer on Friday, Nov. 14, in the "Documents, Search, and Images" session of InfoVis 2014.

→ Slides (35 MB PDF)
→ Slides (99 MB Keynote)
→ Video (31 MB MP4)

Video Preview

High-Resolution Figures

Click on a Figure to open in a new tab.

Fig. 1 Overview is a multiple-view application intended for the systematic search, summarization, annotation, and reading of a large collection of text documents, hierarchically clustered based on content similarity and visualized as a tree (left). Pictured: a collection of White House email messages concerning drilling in the Gulf of Mexico prior to the 2010 BP oil spill.

Fig. 2. Timeline of Overview's development, deployment, and adoption phases: deployments are represented as yellow squares; deployment-phase case studies are represented as purple circles, while adoption-phase case studies are represented as green circles. The dotted red lines indicate which version of Overview was used in each case study.

Fig. 3. Detail from "A full-text visualization of the Iraq War Logs" (WARLOGS) by Jonathan Stray, in which distinct clusters of documents are visible; these documents pertain to "criminal incidents" during the Iraqi civil war involving abductions and blindfolding.

Fig. 4 Overview v2, a desktop application released in Winter 2012. Shown here is 6,849 of the U.S. State Department diplomatic cables released by WikiLeaks, those pertaining to Venezuela. The "Oil industry" tag is selected; clusters containing documents having this tag are emphasized in pink in the Topic Tree and are shown in the Cluster List as a set of keywords. Individual documents having the "Oil industry" tag are emphasized in the scatterplot and shown in the Document List as a set of keywords. The fifth document is selected; its contents are displayed in the Document Viewer and it is marked as a larger black dot in the scatterplot.

Fig. 5. Overview v4, a web-based application released in Summer 2013. Shown here is 625 White House email messages concerning drilling in the Gulf of Mexico prior to the 2010 BP oil spill. The "Obama letter" tag is selected; clusters containing documents having this tag are highlighted in green in the Topic Tree. One of these clusters is selected and its keywords are displayed in a tooltip; the 66 documents in this cluster are listed in the Document List. Selecting a document from this list reveals the Document Viewer (cf. Figure 1).

Fig. 6. The human-centred design process development cycle, in which Lloyd and Dykes (2011) discern between alternative entry points (A,B) and between traditional (green), grounded (blue), and their own approach in which example designs are used to establish context of use and elicit requirements (red). In contrast, we begin with some requirements at point C, and only after multiple deployments do we arrive at a clear understanding of context of use (purple). Figure adapted and extended from Lloyd and Dykes (2011).

Supplemental Material

Reading through thousands of documents quickly with Overview by Jonathan Stray.

A full-text visualization of the Iraq War Logs. Jonathan Stray. Associated Press, Dec. 10, 2010.

Overview v1. A research prototype deployed in Fall 2011, used in the CARACAS pilot case study. See Hierarchical Clustering and Tagging of Mostly Disconnected Data. Stephen Ingram, Jonathan Stray, and Tamara Munzner. University of British Columbia Department of Computer Science Technical Report TR-2012-01 (2012).

Overview v3. The first web-based version of Overview deployed in Summer 2012, used in the GUNS case study (CS#4).