Data Security for Data Scientists

Ten practical tips for protecting your data (and more importantly, everyone else’s!)

Andrew Therriault
9 min readSep 12, 2017
The room where it happens.

Another day, another breach. The Equifax credit data breach is just the latest in a series of stories about major organizations’ data being exposed. It happened to Target’s customer credit card database, it happened to Anthem’s health insurance records, and it even happened to the federal Office of Personnel Management’s background check forms. Even worse, these are just a handful of the biggest examples — critical servers and databases are compromised every single day. This problem is becoming even more frequent over time, and it’s safe to say it’ll get a lot worse before it gets better.

Anxious yet? Good. You should be. Security is no longer just a niche specialty of database admins and network engineers. Everyone who creates, manages, analyzes, or even just has access to data is a potential point of failure in an organization’s security plan. So if you use data which is at all sensitive — that is, any data you wouldn’t freely give out to any random stranger on the internet — then it’s your responsibility to make sure that data is protected appropriately.

I had the importance of data security hammered home for me in 2016. While the DNC hack targeted the organization’s email servers (which my team didn’t interact with, except as normal email users), it’s not hard to imagine that someone who could get into those systems could also have found their way into our databases of voters and campaigns. Whether that actually happened or not, I have no idea (I had already left that job by the time the extent of the hack became known), but the incident underlined how important it is for data scientists to be invested in the security of their own data. Simply put, we cannot naively assume that someone else will take care of security for us.

For most data scientists, this topic is probably an unfamiliar one, as your typical grad school or bootcamp training programs cover security little if at all. (They certainly should, but I assume most readers are already past the point where that would help them.) That’s no excuse for neglecting data security, though, especially when one small misstep in that area can potentially overshadow everything else you’re trying to do. So if you want to start taking better care of your data now, where do you begin?

Before starting to worry about things that are specific to data science, I’d recommend going through a basic security checkup, starting with some general best practices for keeping control of your accounts and assets online. Some specific recommendations:

  • Use (but don’t reuse!) complex passwords, and keep track of them in a password manager like LastPass or Dashlane.
  • Activate two-factor authentication wherever you can, preferably with a physical component such as an authenticator app or physical token, which are harder to compromise than text messages.
  • Don’t mess around with sketchy wifi / computers / flash drives / software, and always encrypt any portable device (flash drive, portable hard drive, phone, tablet, laptop, etc.) which you could potentially lose.
  • Always connect through a VPN (a paid one, not just a free one, and be sure to do your research) when on public wifi networks.
  • Update your computer and phone regularly and use anti-virus / anti-malware / firewall software (though note that the value of these tools is increasingly being questioned).
  • Don’t store your account credentials in text files or embed them in scripts.
  • And make sure that you maintain control over the physical security of your computers, tablets, and phones as well, or all of these steps could be wasted in the time it takes for your cell phone to be accidentally left on a table.

These practices are good advice for everyone who has an email address, but if you work with data for a living, they’re just the beginning. So to help you take the next step toward becoming a security-literate data scientist, here are ten things I’ve learned throughout my career that I’d recommend for every data scientist:

  1. Take only what you need. The old spy’s mantra — sharing only on a “need to know basis” — is also the first rule of data security. You can’t lose data that you don’t have in the first place, so don’t collect sensitive data unless you have a clear need that justifies the risk. And even then, only get the absolute minimum you need to accomplish a task. As tempting as it is for data scientists to collect more data just in case they need it later, this kind of stockpiling can mean the difference between a minor cybersecurity incident and a major disaster, so don’t do it.
  2. Understand the data you have, and don’t keep data you don’t need anymore. Presumably you already have some data, so you should also apply the same principles from #1 above to your existing data. Keep a regular inventory of the data you have on hand, analyze the sensitivity of each dataset, get rid of data you don’t need, and consider taking steps to mitigate the risks inherent in data you do keep — for example, by removing or redacting unstructured text fields, which can hide potentially sensitive data like names and phone numbers. And when you think about data sensitivity, don’t just think about your own interests: if you have data about other people, be sure to put yourself in their shoes.
  3. Encrypt data when and wherever possible. Encrypting your data (both when “in motion” and “at rest”) is not the magic bullet we’d like it to be, but it’s typically a low-cost way to add an extra layer of protection in case a hard drive or network connection gets compromised. Unless you’re working on applications which demand extremely high performance, the impact of encryption on performance is really not a big deal anymore, so if you have sensitive data it should be encrypted by default. And performance is not the deal-breaker for encryption it used to be — there are plenty of high-performance applications and services with built-in encryption (this is a standard feature in Microsoft’s Azure SQL Database, for example), so this excuse is becoming less and less valid over time. (Also, note that several early readers suggested I was too forgiving here — in their view, there is no excuse at this point for not encrypting sensitive data at every step in the process.)
  4. Use secure sharing services rather than email, web servers, or basic FTP servers. The simple and quick methods of sharing files are fine for turning in class papers or sending cute dog pictures, but they’re risky ways to share files with sensitive data. Instead, use a service which is specifically designed for sharing files securely. For some, this might mean an access-controlled S3 bucket on AWS (where you can manage sharing of encrypted files with other AWS users) or an SFTP server (which implements secure file transfers over an encrypted connection). But even just moving to a service like Dropbox or Google Drive is an improvement. Though they’re not meant to be as security-focused as some other tools, they still provide better fundamental security (both Dropbox and Google encrypt files at rest, for example) and allow for more fine-grained access control than sending files via email or dumping them on a minimally-secured sever. For those looking for an upgrade from Dropbox or Google, a service like SpiderOak One can provide end-to-end encryption for file storage and sharing while maintaining an easy-to-use interface, and at a price-point that’s accessible for almost anyone ($5/mo for 100GB, $12/mo for 1TB).
  5. If you use cloud services like AWS or Azure, be sure to lock them down. Don’t make the mistake of assuming that because someone else runs the servers, you don’t have to worry about security. Quite the opposite, actually — there are a whole host of best practices for securing these systems that you need to be aware of. (I’d also suggest reading some of those services’ users’ own recommendations such as this one.) These include things like making sure you turn on authentication for S3 buckets and other file stores, securing ports on servers so only the ones you need are accessible, and limiting access to your services to only approved IP addresses or through a VPN tunnel.
  6. Share conscientiously. For sensitive data, grant access to individual users (both internal and external) and datasets rather than granting access in bulk, and only give access when it’s actually needed (think #1 above, but for other people). Likewise, only give access for specific use cases and timeframes (think #2 above). Make your collaborators sign on to nondisclosure and data usage agreements — even if they’re not punitively enforced, they lay out expectations for how others will handle data you’re giving access to — and regularly check logs to ensure they’re complying with the intended usage.
  7. Secure not only data stores but also applications, backup copies, analytic servers, and so forth. Basically, anything that touches your data should be secured. Otherwise, you might create the Fort Knox of databases, but all that work is useless if your dashboard server caches all that data to disk and isn’t protected. And likewise, remember that your system backups will often make copies of your data files as well, and these will likely endure even after you delete the files themselves (which is, after all, the point of a backup copy). So these backups should not only be protected themselves, they should also be purged when no longer needed. Otherwise, they could become a buried treasure for a hacker — why bother with your carefully-pruned operational database when everything you’ve ever had is still on the backup drive?
  8. Make sure raw data isn’t hidden in outputs that you might share. Some machine learning models package up data (such as words and phrases from original documents) as part of a trained model object, so sharing a model result could potentially reveal training data accidentally. Along the same lines, dashboards, graphs, or maps might have the raw data embedded in the final output, while all you see on the surface is the aggregate result. And even if you’re just sharing a static image of a chart, there are tools out there to reconstruct the original datasets, so don’t assume that you’re not revealing raw data just because you’re not sharing tables. Know what it is you’re sharing, and think through what someone with bad intentions could do with it.
  9. Understand the privacy implications of “de-identified” or “anonymized” data (and make sure you’ve done it correctly). If you don’t need to keep personally-identifiable information (PII) in a dataset, removing those fields is an obvious way to reduce the potential impact of a breach, and it’s a mandatory step to take before you share data publicly. But even when you’ve removed PII from a dataset, that doesn’t guarantee that someone else couldn’t figure out who’s who. Could the data be re-identified if combined with some other data? Are the non-PII characteristics unique enough to only apply to specific people? Did you believe somebody who foolishly told you that hashing was a good idea? I once received an “anonymized” consumer data file in which it took me less than 2 minutes to find my own record. (I was the only person with my unique combination of age, race, gender, and length of residence in my Census block.) Without much effort I could have found the records for many others as well, and with the help of a voter registration file (which is public record in most places) it would’ve been feasible to match most of those records to individuals’ names, addresses, and dates of birth. There’s no perfect standard for de-identification, but if you plan to rely on it to protect privacy, I’d highly recommend following the Department of Health and Human Services’ standards for de-identification of protected health information. It’s not an absolute guarantee of privacy protection, but it’s the closest thing you’re going to find which still allows your data to be useful.
  10. Know your worst case scenarios. Even after all of these precautions, you can’t eliminate risk entirely, so think through what the worst potential outcomes would be if your data got out. After you’ve done that, go back to #1 and #2. No matter how hard you can try to stop breaches, no solution is foolproof, so if you can’t tolerate the potential risks you shouldn’t keep sensitive data around in the first place.

To be clear, these steps won’t protect you from every danger out there. If a state-sponsored hacking group is trying to find a way in, you’re probably over-matched. (That’s why I put recommendation #1 where I did — not having anything worth hacking is the only guaranteed defense!) But for the 99.9% of data scientists whose data is mainly of interest to a less-elite class of hackers, these tips should cover most of the topics you’ll need to know. That doesn’t mean you’re done — figuring out how to do all these things well is a much longer read — but you’re at least on the path to being a responsible guardian of your (and our) data.

Hopefully this helps make you a bit better at securing data, and therefore a better data scientist. If I missed anything here, or if you have other suggestions to add for other readers, please leave a response below.

Thanks to Mike Sager, Ilse Ackerman, Bill Fitzgerald, Angela Bassa, Cassie DeWitt, and others for feedback and suggestions. But no thanks to those unnamed hackers who really forced my interest in security back in 2016 — this is a topic I still wish I’d never had to learn this much about!

--

--

Andrew Therriault

Data science consultant and educator. Formerly Chief Data Officer @CityofBoston, Director of Data Science @TheDemocrats, and Data Science Manager @Facebook.