An AI That Reads Privacy Policies So That You Don't Have To

Polisis, a machine-learning-trained tool, automatically produces readable charts of where your data ends up for any online service.
Image may contain Transportation Vehicle Tractor and Bulldozer
HOTLITTLEPOTATO

You don't read privacy policies. And of course, that's because they're not actually written for you, or any of the other billions of people who click to agree to their inscrutable legalese. Instead, like bad poetry and teenagers' diaries, those millions upon millions of words are produced for the benefit of their authors, not readers—the lawyers who wrote those get-out clauses to protect their Silicon Valley employers.

But one group of academics has proposed a way to make those virtually illegible privacy policies into the actual tool of consumer protection they pretend to be: an artificial intelligence that's fluent in fine print. Today, researchers at Switzerland's Federal Institute of Technology at Lausanne (EPFL), the University of Wisconsin and the University of Michigan announced the release of Polisis—short for "privacy policy analysis"—a new website and browser extension that uses their machine-learning-trained app to automatically read and make sense of any online service's privacy policy, so you don't have to.

In about 30 seconds, Polisis can read a privacy policy it's never seen before and extract a readable summary, displayed in a graphic flow chart, of what kind of data a service collects, where that data could be sent, and whether a user can opt out of that collection or sharing. Polisis' creators have also built a chat interface they call Pribot that's designed to answer questions about any privacy policy, intended as a sort of privacy-focused paralegal advisor. Together, the researchers hope those tools can unlock the secrets of how tech firms use your data that have long been hidden in plain sight.

"What if we visualize what’s in the policy for the user?" asks Hamza Harkous, an EPFL researcher who led the work, describing the thoughts that led the group to their work on Polisis and Pribot. "Not to give every piece of the policy, but just the interesting stuff... What if we turned privacy policies into a conversation?"

Plug in the website for Pokemon Go, for instance, and Polisis will immediately find its privacy policy and show you the vast panoply of information that the game collects, from IP addresses and device IDs to location and demographics, as well as how those data sources are split between advertising, marketing, and use by the game itself. It also shows that only a small sliver of that data is subject to a clear opt-in consent. (See how Polisis lays out those data flows in the chart below.) Feed it the website for DNA analysis app Helix, and Polisis shows that health and demographic information is collected for analytics and basic services, but, reassuringly, none of it is used for advertising and marketing, and most of the sensitive data collection is opt-in.

Polisis' AI-generated visualization of the privacy policy for Pokemon Go.Pribot

"The information is there, it defines how companies can use your data, but no one reads it," says Florian Schaub, a University of Michigan researcher who worked on the project. "So we want to foreground it."

Polisis isn't actually the first attempt to use machine learning to pull human-readable information out of privacy policies. Both Carnegie Mellon University and Columbia have made their own attempts at similar projects in recent years, points out NYU Law Professor Florencia Marotta-Wurgler, who has focused her own research on user interactions with terms of service contracts online. (One of her own studies showed that only .07 percent of users actually click on a terms of service link before clicking "agree.") The Usable Privacy Policy Project, a collaboration that includes both Columbia and CMU, released its own automated tool to annotate privacy policies just last month. But Marotta-Wurgler notes that Polisis' visual and chat-bot interfaces haven't been tried before, and says the latest project is also more detailed in how it defines different kinds of data. "The granularity is really nice," Marotta-Wurgler says. "It’s a way of communicating this information that’s more interactive."

To build Polisis, the Michigan, Wisconsin and Lausanne researchers trained their AI on a set of 115 privacy policies that had been analyzed and annotated in detail by a group of Fordham Law students, as well as 130,000 more privacy policies scraped from apps on the Google Play Store. The annotated fine print allowed their software engine to learn how privacy policy language translated to simpler, more straightforward statements about data collection and sharing. The larger corpus of raw, uninterpreted privacy policies supplemented that training by teaching the engine terms that didn't appear in those 115 annotated ones by giving it enough examples to compare passages and find matching context.

After all of that training, the Polisis AI can interpret a privacy policy with results that agree with Fordham's experts 88 percent of the time, after those results are translated into broader statements about whether a service's information collection practices. And while that's hardly a perfect system, the researchers note that Fordham's experts only agreed with each other about that often, too. "When there are internal contradictions, the outcomes are somewhat fuzzy," NYU's Marotta-Wurgler notes. And even aside from those contradictions, it's worth noting that no amount of close reading of a privacy policy can resolve some ambiguities, such as whom a company may be sharing private data with when it states only unspecified "third parties."

An example conversation with Pribot about the details of a privacy policy.Hamza Harkous

The researchers' legalese-interpretation apps do still have some kinks to work out. Their conversational bot, in particular, seemed to misinterpret plenty of questions in WIRED's testing. And for the moment, that bot still answers queries by flagging an intimidatingly large chunk of the original privacy policy; a feature to automatically simplify that excerpt into a short sentence or two remains "experimental," the researchers warn.

But the researchers see their AI engine in part as the groundwork for future tools. They suggest that future apps could use their trained AI to automatically flag data practices that a user asks to be warned about, or to automate comparisons between different services' policies that rank how aggressively each one siphons up and share your sensitive data.

"Caring about your privacy shouldn't mean you have to read paragraphs and paragraphs of text," says Michigan's Schaub. But with more eyes on companies' privacy practices—even automated ones—perhaps those information stewards will think twice before trying to bury their data collection bad habits under a mountain of legal minutiae.