Save the data, save the world

datarescueDC brought together 200 librarians, scientists, coders, and concerned citizens to archive at risk EPA data and websites. https://datarefuge.github.io/datarescue-dc/

In the first four weeks of the Trump administration, headlines have been dominated by the most dramatic events: travel bans, border wall plans, coziness with Russia, claims of “fake news,” and record-breaking protests. While many Americans have been in a constant state of whiplash, a group of activists has focused proactively on addressing another possible threat: the security of federal data infrastructure. To address this, researchers at the UPenn program in the Environmental Humanities started Data Refuge – an initiative to archive at-risk federal data. This effort has quickly grown into a nation-wide network of librarians, scientists, programmers, and other concerned citizens. Many events have sprung up – including the Data Rescue DC event this President’s Day weekend at Georgetown University. Events provide participants with tools and collaborative space to archive government websites and effectively “bag and tag” data for repository into the Data Refuge archive. To date, events have focused on archiving the most critical websites and data at risk: environment and climate-relevant information from agencies like EPA and NOAA.

get all the data

Maybe we can’t get ALL the data, but we can start with data that we think is most at risk.

The Data Rescue DC event occurred within an urgent context; the appointment of a climate change denier to lead the EPA, the silencing of civil servants on certain topics, and attempts to restrict data collection on racial disparities in affordable housing, among other news, served as backdrop.  For me, participation in Data Rescue DC felt like the most important near-term contribution that I could make as a scientist. Federal data writ large is absolutely critical – not only for climate science research, but also for countless local applications. As part of a panel on Saturday, Denise Ross of New America reminded participants about the importance of accurate housing data following Hurricane Katrina to inform response efforts and literally save lives. Reviewing and backing up countless EPA websites reminded me that the EPA collects and maintains decades of data on toxic wastes, pesticides, radiation, and lots of other critical information. These data were collected over many years using American taxpayer dollars. Federal data also has the advantage of being consistently collected and without biases inherent in private-sector data; for instance, Ms. Ross shared that Google Street View could not be used to assess certain low income neighborhoods in New Orleans as the camera-equipped cars had failed to include them.

So how does one save the data to save the world?  At the event, participants divided into groups: seeders and sorters, researchers and harvesters, checkers and baggers, describers, and storytellers. Each group contributed to a piece of the data archiving process. I seeded and sorted for the day, clicking through and archiving EPA websites methodically and marking appropriate pages as uncrawlable (like databases or interactive maps). Researchers and harvesters dug deeper into the sites marked as uncrawlable and worked to capture these data. Checkers and baggers provided layers of quality control for the harvested data while describers wrote comprehensive metadata. Storytellers visited each group to capture participants’ stories and experiences – through tweets, videos, blogs, and other media. Organizers were attuned to the sensitive nature of the event; participants who wished to remain anonymous wore white name tags. Signs posted around the room reminded us not to share pictures of people on social media without permission.

In addition to technical data sets, I also discovered some fun things on the EPA website.

At Data Rescue DC, more than 200 participants sorted and seeded 4776 URLs, harvested 20 GB of data, bagged 15 datasets, and described 40 datasets. This contributed about 40% of the datasets that are currently in the Data Refuge archive! Data Rescue DC and similar efforts begin to address the issue of at-risk federal data infrastructure by archiving data. But of course, this is not enough. It may be impossible to archive all of the data that are at risk. And more challenging: there is no guarantee that these data will continue to be collected in the future. Quiet activism like data archiving in concert with continued vocal political pressure to support data collection efforts (e.g. maintain or increase agency funding) will be instrumental. As suggested by Bethany Wiggin of UPenn (one of Data Refuge’s founders), data users can play a role from the bottom up by talking about data that we use and amplifying its importance. This can help “humanize” these data and emphasize their value to society.

I hope that public access to federal data remains open. In the meantime, activists will continue archiving – just in case.

Interested in getting involved to rescue data? Learn more on the UPenn Program in the Environmental Humanities page here