Americas

  • United States

Asia

Co-founder and CEO

The data science ecosystem part 2: Data wrangling

opinion
Apr 01, 20158 mins
Analytics

Data scientists spend 80% of their time convert data into a usable form. There are many tools out there to help and I will go over some of the most interesting.

There was a money quote from Michael Cavaretta, a data scientist at Ford Motor, in a recent article in The New York Times. The piece was about the challenges data scientists face going about their daily business. Cavaretta said: “We really need better tools so we can spend less time on data wrangling and get to the sexy stuff.” Data wrangling is cleaning data, connecting tools and getting data into a usable format; the sexy stuff is predictive analysis and modeling. Considering that the first is sometimes referred to as “janitor work,” you can guess which one is a bit more enjoyable.

In CrowdFlower’s recent survey, we found that data scientists spent a solid 80% of their time wrangling data. Given how expensive of a resource data scientists are, it’s surprising there are not more companies in this space.

When I was writing about our ecosystem last time, I noted how structured databases were originally built for finance and operations while unstructured ones were pushed forward by data scientists. I see a similar thing happening in this space. Since structured databases are an older industry, there were already myriad tools available for operations and finance people who have always worked with data. But there is also a new class of tools designed specifically for data scientists who have many of the same problems, but often need additional flexibility.

We’ll start with an area I know well.

Data enrichment

Data enrichment improves raw data. Original data sources can be messy, in different formats, from multiple applications (and so on), which makes running predictive analysis on it difficult, if not impossible. Enrichment cleans that data so data scientists don’t have to.

I’ve broken this category into “human” and “automated,” but both approaches involve both people and machines. Human data enrichment means having a human being transform every row of your data set, but this requires a lot of computer automation to keep it reliable. Likewise, automated data enrichment involves setting up rules and scripts to transform data but requires a human to set up and check those rules.

Human enrichment relies on the fact that there are tasks people are simply better at than machines. Take image classification, for example. Humans can easily tell if a satellite photo contains clouds. Machines still struggle to consistently do that.

Language is another big use case for human data enrichment. Natural language processing algorithms can do amazing things, but they can’t spot sarcasm or irony or slang nearly as well as a person can. You’ll often see PR firms and marketers analyze sentiment this way.

Human-enriched data can also be used to train search algorithms, and people can read and collect disparate information better than a machine can. Again, this requires the tasks to be well set up, and for the software to contain quality control safeguards, but if you get thousands of people working in tandem on simple jobs that people do better than machines, you can enrich tons of data at impressive speeds. My company, CrowdFlower, is in this space, but so are others like WorkFusion and in some ways, Amazon Mechanical Turk.

Automated enrichment has the same goals, but works with scripts and having machines, instead of people, transform raw data into usable data. As I mentioned above, you still need a smart data scientist inputting that information and checking it when enrichment is complete, but automated enrichment can be incredibly powerful if all the i‘s are dotted. Data with small errors and inconsistencies can be transformed into usable data near instantaneously with the right scripting.

Notably, automated solutions work well for cleaning data that doesn’t need a human eye. Examples range from simple tasks like name and date formatting to more complicated ones like dynamically pulling in meta data from the internet. Trifacta, Tamr, Paxata and Pentaho come to mind as great automated solutions, but this is a space I see growing quickly as companies rush in to give data scientists some of their valuable time back.

ETL/blending

ETL stands for Extract, Transform and Load, and the name gets to the heart of what the tools in this section of our ecosystem do. Essentially, what ETL/blending solutions do for data scientists is take dissimilar data sources and marry them so analysis can be done.

Here’s an example of what I’m talking about. Say you have a financial database that contains a list of your customers, how much they pay, and what they buy. That lives in one place. Now say you have a different database containing each customer’s geographical business address. The companies in this space help combine that data into a single, usable database, so a data scientist can look into questions like which regions buy the most of a certain product, which parts of the country are your target market, etc.

And this is just a simple example; they can get much more complex. But essentially every data scientist will need to do some blending in their day-to-day responsibilities. Multiple data sources are frequently all formatted differently and, if you wanted a holistic view of a client or your enterprise at large, you’d need to blend these sources together to do deep analysis.

Alteryx, Astera, CloverETL and etleap all have software that can handle this sort of data blending. And though ETL has been around since the days of structured databases, it figures to become increasingly vital. After all, more data sources means more discordant formatting. The promise of big data rests on being able to get both a granular and bird’s eye view of any of this information, for whatever analysis needs doing.

Data integration

Data integration solutions overlap significantly with ETL/blending software. Companies in both spaces aim to integrate data, but data integration is more concerned with unifying data applications and specific formats (as opposed to working on generic sets of data).

Think of what I mentioned last time, how there are third-party cloud applications that take care of everything from sales and marketing data to social reach and email campaigns. How do you combine each application into a usable data set on which a data scientist can run predictive analysis? With software like ClearStory or Databricks or SnapLogic.

Informatica has been in the space for years and does over a billion dollars of revenue. It also does quite a bit of work in each category of data wrangling as I’ve defined it here. Microsoft actually has two large offerings that would fit in this category: Azure Data Factory and SQL Server Integration Services.

Much like the ETL/blending tools, data integration programs are mainly focused on combining data from the left side of our ecosystem so it can be modeled by software on the right. In other words, integration tools like Apatar or Zoomdata and the like allow you to marry data from cloud applications like Hootsuite or Gainsight so you can get BI from Domo or Chartio.

API connectors

Lastly, let’s talk about API connectors. These companies don’t focus so much on transforming data as they do on integrating with as many separate APIs as possible. When companies like these started forming, I don’t think many of us predicted how big this space would actually be.

But these can be really, really powerful tools in the right hands. To start with a fairly non-technical example, I think IFTTT is a great way to understand what happens with an API connector. IFTTT (which stands for “if this, then that”) allows someone who posts an Instagram picture to save it immediately to Dropbox or post it on Twitter. You can think of it as an API connector that non-data scientists use to stay on top of their online persona. But it’s important to include here because a lot of data scientists I talk to use it as a lightweight tool for personal applications and for work.

Zapier is like IFTTT but focused on being a lightweight connector for business applications, which may make it more relevant for many data science teams.

MuleSoft, meanwhile, connects all of your business applications. Say a user logs onto your site. Who needs to know about it? Does your sales team need the lead? Does your nurture team need to know that user is back again? How about marketing? Do they want to know their email campaign is working? A single API connector can trigger all these actions simultaneously.

Lastly, Segment.io connects your product to many of the SaaS business applications on the left of the inforgraphic and more.

API connectors simply don’t exist without the abundance of tools in this ecosystem to actually connect to. And while they weren’t totally designed for data scientists, data scientists use them, especially in conjunction with blending and integration tools.

Open-source tools

There are far fewer open-source data wrangling specific tools than data stores or in the analytics space. Google open-sourced its very interesting open-refine project. For the most part we see companies building their own ad hoc tools mainly in Python, though Kettle is an open-source ETL tool with some traction.

My last installment of this ecosystem is coming up next week, where we’ll focus on Data Applications. Till then, thanks for reading.

Co-founder and CEO

Lukas Biewald is the co-founder and CEO of CrowdFlower. As a former data scientist, Lukas was frustrated by the amount of time he had to spend cleaning and labeling data instead of actually using it to solve business problems, leading him to co-found CrowdFlower in 2009. Today, the CrowdFlower platform connects over 5 million data enrichers in almost every country who work around the clock to provide companies with clean and actionable data.

Lukas graduated from Stanford University with a BS in Mathematics and an MS in Computer Science. Lukas is also an expert-level Go player.

The opinions expressed in this blog are those of Lukas Biewald and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.

More from this author