Who is “Public” Data
Jer Thorp on Openness and Access in Our Era of
Technology and Information
“Public” is a word that has, in the last decade, become bound tightly to data. Loosely defined, any data that is available in the public domain falls into this category, but the term is most often used to describe data that might serve some kind of civic purpose: census data or environmental data or health data, along with transparency-focused data like government budgets and reports. Often sidled up to “public” is the word “open.” Although the Venn diagram between the two words has ample overlap (public data is often open, and vice versa), the word “open” typically refers to if and how the data is accessible, rather than toward what ends it might be put to use.
Both words—“public” and “open”—invite a question: For whom? Despite the efforts of Mae and Gareth, and Tom Grundner and many others, the internet as it exists is hardly a public space. Many people still find themselves excluded from full participation. Access to anything posted on a city web page or on a .gov domain is restricted by barriers of cost and technical ability. Getting this data can be particularly hard for communities that are already marginalized, and both barriers—financial and technical—can be nearly impassable in places with limited resources and literacies.
Data.gov, the United States’ “open data portal,” lists nearly 250,000 data sets, an apparent bounty of free information. Spend some time on data.gov and other portals, though, and you’ll find out that public data as it exists is messy and often confusing. Many hosted “data sets” are links to URLs that are no longer active. Trying to access data about Native American communities from the American Community Survey on data.gov brought me first to a census site with an unlabeled list of file folders. Downloading a zip file and unpacking it resulted in 64,086 cryptically named text files each containing zero kilobytes of data. As someone who has spent much of the last decade working with these kinds of data, I can tell you that this is not an uncommon experience. All too often, working with public data feels like assembling particularly complicated Ikea furniture with no tools, no instructions, and an unknown number of missing pieces.
Today’s public data serves a particular type of person and a specific type of purpose. Mostly, it supports technically adept entrepreneurs. Civic data initiatives haven’t been shy about this; on data.gov’s impact page you’ll find a kind of hall-of-fame list of companies that are “public data success stories”: Kayak, Trulia, Foursquare, LinkedIn, Realtor.com, Zillow, Zocdoc, AccuWeather, Carfax. All of these corporations have, in some fashion, built profit models around public data, often charging for access to the very information that the state touts as “accessible, discoverable, and usable.”
On an 1842 visit to Washington, DC, Charles Dickens wrote of “public buildings that need but a public to be complete.” While he was speaking of the capital’s wide but underpopulated streets, he might as well have been speaking of today’s information publics, sparsely populated places with data piled high on the curbs.
Over the last decade, I’ve attended many open data events: for scientists working with satellites, for librarians and archivists, for city planners, for software developers, for policy makers. As I’ve sat in the audience at these events, it’s occurred to me again and again that we’re not all talking about the same thing when we say “open data.”
Indeed there seemed to be a kind of fractal misunderstanding about what the word “open” means, or more specifically whom the “open” is for. Earth observers had a different idea from librarians: while scientists seemed to be focused on making their projects open to other scientists, the library people were mostly considering humanities researchers. But even within those groups there didn’t seem to be common ground. Everyone who worked at the Library of Congress didn’t agree about who the audience for “open data” was, nor did everyone in one department at the European Space Agency. Honestly, I’d be surprised if any two people at any of the events I went to could agree on whom they were meant to be making their data open to.
I’m still Canadian enough to believe that open means open. I’m with the Open Knowledge Foundation when it says that “open means anyone can freely access, use, modify, and share for any purpose.” Under this definition I’d argue that very few of the so-called open data projects from the last decade are actually open, unless we manufacture a definition for “anyone” that includes only people who look and think a lot like ourselves.
Let’s try an experiment. Pick an open data project, your own or someone else’s, and give it a score of zero. Because we’re feeling charitable, let’s give the project one point just for the word “open,” assuming the data is accessible in some way, through an API or a file download or a carrier pigeon service. Next, give your project one additional point for each of these questions you can answer yes to:
1. Does the project have comprehensible documentation, examples, and tutorials?
2. Are there materials (teaching curricula, blog posts, videos, and so on) that offer context around the data so that someone unfamiliar with the project can understand why it might be important?
3. Can a nonprogrammer access the data?
4. Is there documentation available in more than one language (for example, English and Spanish)?
5. Is your documentation and the site it is hosted on compatible with screen readers? Have you tested it?
How did you do? Data.gov scores a 2, if we’re being generous. New York City’s Open Data portal gets a 3.
Turning the red pen back on myself, I find I didn’t do much better. The three open data projects that I built with the Office for Creative Research—Floodwatch, Into the Okavango, and the Elephant Atlas—scored 2 points, 2 points, and 3 points, respectively. I think it should be a minimum goal for every data project that wants to legitimately use the term “open” to score at least a 3 on this test. But scoring a 3 is like scoring a C; it’s the minimum viable open, just enough so your parents can’t ground you. Even with a score of 5, you’ve arrived at open-ish data, open-esque data at best. How might we do better?
A lot of the answers are encoded in the questions above. Write understandable documentation, examples, and tutorials, and write them for an audience that isn’t you. Post interviews with good communicators who can give context and narrative. Provide easy-to-use visualization tools to foster comprehension. Think about making your data human-readable as well as machine-readable.
Early in 2014, a group of scientists began a series of closely spaced transects in single-engine planes, flying low across wide swaths of savanna in 21 African countries. The planes were equipped with laser altimeters, and the pilots worked meticulously to keep flight speed constant. Out of the rear windows, an observer on each side of the plane counted elephants. Over the next two years, they’d fly thousands of kilometers and count 352,271 elephants, in the first pan-continental survey since the early 1970s.
At the Office for Creative Research, we built the Elephant Atlas in 2016, a public front end to the enormous data set that had come from all of those hours in flight. The task of making the data public was tricky. Each of the countries where data had been collected had agreed to its own specific terms about how (and if) the data could be released. For some flights, no geospatial data could be released. For others, the path of the plane could be made public, but not the location of the elephants. Even the high-level country-by-country counts were complicated. To really understand the numbers, and how they’d changed over four decades, is to track a set of parallel narratives: conservation policies, ivory demand, food scarcity, habitat loss, human conflict.
We knew there would be a handful of researchers and policy wonks who’d want to dive deep into the data. For them, we built an API that would spit out detailed JSON files (a common format meant to be read by machines), time-stamped records of every flight and every elephant counted. These files could be computed upon, analyzed, mapped, charted, placed neatly into scientific papers. The point of the census, though, wasn’t to generate more scholarship; it was to effect policy change, particularly in the countries where the elephants lived. So we programmed the project API to return printable PDF reports, with maps and charts that were generated on the fly in response to the user’s request. Reports that could be printed on paper, stapled together, dropped on a politician’s desk. Real, tangible things that could be mailed or stuck with a thumbtack into a corkboard.
To get to the outer reaches of open, where our data is really and truly serving the public, we need to consider exclusion and accessibility. Put more directly, we have to think about people who aren’t us. Who is being excluded by the technologies that we are using and by the ways in which we are communicating? I spent four years working on data-focused conservation efforts along the Okavango River, which spans three countries with three official languages. We offered API documentation only in English. All three of the OCR’s supposedly public data projects are tedious (or impossible) to access with a screen reader, making all of our open data very much closed to people with visual impairments.
To make a data project truly open to anyone, we need to think about outreach, past the computer.
Excerpted from Living In Data by Jer Thorp. Reprinted with permission of the publisher, Farrar, Straus, and Giroux. Copyright © 2021 by Jer Thorp.