A Reference Stack for Modern Data Science

Published in

GyroscopeSoftware

12 min readNov 6, 2017

By MacRae Linton @ Truss_Works, Adam Fletcher and Jonathan Mortensen @ Gyroscope

At Gyroscope’s inception, we were operated entirely by our two founders, a data scientist and an infrastructure engineer. We had lofty goals, modeled after companies like Google, to develop a modern stack that allowed for accurate, repeatable data science and data exploration that also provides a real-time Machine Learning (ML) prediction platform for developers. You know, small goals.

Juggling the challenges of operating a burgeoning startup with a lean team made it difficult for us to build the solid infrastructure we needed to keep our growing business a success. We turned to the team at Truss to help us implement a foundation of best practices so that we could keep growing without accruing the technical debt so common in these early stages — when it’s so tempting to trade good fundamentals for accelerated growth. Truss took our vision and prototype and worked with us on the design and implementation of the modern ML stack described below.

In this post, we:

define a framework with which to understand a data science system;
discuss the key properties a production-ready data science system should exhibit; and,
describe the system we’ve built and core components we’ve selected to meet those properties

The Engineering Challenges of Data Science

When moving data science from a research endeavour into a core component of a business (i.e., into production), you need a reproducible and predictable data science process. However, during this process, a host of issues arise that the community is only starting to tackle. It’s useful to consider those issues by asking four key questions:

What and where are the data?
How do you write data science code?
Where and how do you store results?
How is work translated into production?

Table 1: The state of data science systems today

Today, the answers to those questions aren’t ideal (Table 1). Without well versioned data, code, and results, the work becomes irreproducible. Further, the lack of well defined standards around data formats, data storage, code infrastructure, development tooling, and translation to production leads to seemingly endless debugging and a large gap between data and methods in production versus those in research. This chasm means the work that a data scientist does is not predictive of how things will go in production. As a consequence the bulk of time is spent in translation (i.e. fixing all the inconsistencies and bridging that big gap) instead of in doing novel work. Today’s process causes a very long iteration time.

Our goal was to minimize the iteration time during the data science process and any associated engineering processes. Toward that end, we specified the properties that are central to an effective system that supports the data science process:

Necessary properties of a data science system

Ability to develop and test in any language
Reproducible builds
Ability to run the entire stack locally for development
Local, Continuous Integration (CI), Test, Staging, and Production environments are identical
All research and experiments happen in Production (code development and testing can still happen locally)
Every piece of production data (inputs or outputs) is versioned and queryable later on
Ability to trace production data through the system
Discourage work outside the system to ensure its use and improvement (e.g., hinder use of non-production data beyond testing)

In such a system, any data science tooling and code would be housed in the same place — data will be versioned and code will have a commit. As an added bonus, because there is no gap between research and production, we could move from experimentation to testing to production by changing configurations, not by translating code and methods. In short, a system with these eight properties will enable reproducible data science with a fast iteration time.

In your data science system:

What is the answer to the four questions?
Which of the eight properties does it meet?

The Reference Stack

We didn’t see any existing stack or system that met the necessary properties. So, to build the system we desired, we needed to stand up a stack from scratch with all the core components necessary:

Core components of a modern stack

Service Composability — Modularity choice. Often micro vs monolithic
Repository — Where you house and version your services and tools
Interface Definition Language and API specification — How modules communicate
Build System — How you create artifacts from dependencies and source code
Artifacts — The output of the build system, along with their storage and versioning
Deployment — Where and how artifacts are deployed and updated, hermetically
Monitoring — How to inspect a running system, review its and history, and alert
Testing — How you test each individual component and their integration
Data — Storage location, versioning, and querying of all data

Today, there are many options for each stack component. Does yours have each? What tools have you chosen? Let’s discuss our own journey in selecting the best option for each.

Beginning the Journey

Doing ML in the cloud meant that the web services that Gyroscope deployed needed to be easy to experiment with and change on the fly. It was important that a new ML model could be written, tested against old data, and then run in production easily and quickly. We needed the ability to deploy many microservices that all could interact with each other while being agnostic about the language choice. Several requirements came out of this, such as ease of deployment and a system for saving and re-running old data through new models. But most crucially for our infrastructure design, it meant that we needed the ability to deploy many different backend services written in different languages that were able to communicate with each other.

Figure 1. gRPC and protocol buffers are the means by which Gyroscope defines its APIs and transports data.

While JSON over HTTP is fairly de rigueur in the industry, we chose gRPC instead. It provided interfaces for services defined in language-agnostic definition files that could then create complete service interfaces in any language, which was a huge improvement over defining a JSON API for each service and having each service write to that spec on its own. Breaking API changes (though they are discouraged) would raise errors at compile time, as opposed to 400s at runtime. There were a host of additional benefits that gRPC brought to the table, but that feature made it the right choice for us. A canonical, shared interface definition is a critical requirement for composable services and language flexibility.

The Right Tool to Keeping Versioning Consistent

Our gRPC choice necessitated our next architectural choice. Since different services were all compiling gRPC definitions as part of their build step, it was important to make it easy to keep the version of the definition that they were using to communicate with each other in sync. Putting everything in a mono-repo made that entire class of problem disappear — versioning, reproducibility, and dependency management questions were resolved by keeping all the code in a single repository and building and deploying all of that code together every time.

The next question to address was building. Most languages come with some sort of build and package management system, but we had a heterogeneous backend so our overarching interface for building different services needed to be something more agnostic — such tools were not abundant.

Make, the old standby, was an option, but even when used well it could be brittle and complex. A few recent entrants to the field, Buck, Pants, and Bazel, which are all descendants of Google’s internal build system Blaze, provided alternatives to Make. Ultimately, we chose Bazel because its laser focus on hermetic, reproducible builds was attractive to us as we tried to cut down on the ways builds could vary between developers. Divorcing the build system from the programming language solves the short term problem of build co-ordination between languages and the long term problem of build system complexity.

Bazel made it very easy to have our gRPC definitions be common code, in one place, that other projects could depend on. It ensured that every time we made a change to a service’s API in a .proto file, they all started using the updated definition in the next build and deploy. Bazel is language agnostic and it has great support for building Go and Python, which were our first two targets, so our build system had a consistent interface to the developer no matter what they were writing. With our build system chosen, the next question was what should we build?

Choosing the Right Container

Docker Images were an obvious choice. It too is language agnostic, so we could build services any way we wanted and have a standardized deployable artifact. While Docker was easy to deploy on Amazon Web Services’ nascent EC2 Container Service, EC2’s configuration left something to be desired when compared to our pick: Kubernetes. Kubernetes orchestrates the deployment and running of Docker containers across a pool of machines. Its configuration was straightforward and its primary abstractions of deployments and services were a pleasure to work with. As a bonus, it’s the lingua franca of Google Container Engine, which we chose to deploy to.

To further standardize our development environments and ensure that our built products would run correctly in Linux Docker containers, we pulled our build environment off the local file system and put it into its own Docker container. Instead of running Bazel commands directly, we spun up a builder container that had our source code mapped in as a volume and we wrapped the Bazel command to run in that container instead. That meant Bazel’s installation and dependencies were built into a shared Docker Image instead of being installed during developer setup. We even reused that build image when building in CI, guaranteeing that everywhere we built our software, the environment was the same.

Figure 3. Our data science system has hermetic environments and is deployable with a single command.

The local development story was especially important to get right. Fortunately, Kubernetes has a great one, called Minikube. It runs a single node Kubernetes cluster on your dev machine inside a VirtualBox instance. We could easily deploy our entire environment directly on our local machines. Now, no matter which service we were working on, it could be tested locally amongst the full cadre of its peers, and the results of those tests were valid in production, staging, and all other environments (Figure 3).

Bringing It All Together

What was the glue that bound this infrastructure together? We wrote Python scripts to coordinate interactions with GKE, deployments to Kubernetes, updates to DNS, and more. When it came to interacting with various Google services like DNS and Deployment Manager, we could use Python SDKs instead of their command line utilities. Using Python instead of Bash let us build libraries that could be reused by various different scripts and is a full fledged programming language so string interpolation, for loops, and if statements are all difficult to get wrong. These scripts and their dependencies were assembled and run by Bazel, just like everything else.

An unexpected side effect of using Bazel was that it ended up defining a user interface to our entire development system. Bazel can be told to run scripts as well as build them, so we configured our deploy script to depend on the built docker images of our services. That meant the deploy script was written assuming that it had access to all the freshly built services, and with one “run” command you could kick off whatever needed to be done to make that true. Because Bazel is careful to encode all dependencies a target might have, this meant that building all the services, including the most recent gRPC definitions, putting them into Docker Images, tagging those images, pushing them to Google Cloud, and finally deploying them all to a test environment, could be done with a single command: bazel run deploy:deploy-to-test. It didn’t matter if none of those services had been built yet on this machine, or if we had been building and deploying all day. Bazel only built what was necessary, all wrapped up in the one command.

That made our CI configuration story very straightforward. In our case, it only took a few lines of code to configure an environment in CircleCI, and then a single Bazel command built everything and deployed it in CI, just like we did locally. This harmony of local and CI environments made it easy to avoid problems with deployments or building in CI that appeared to work locally.

We configured Circle to deploy to a testing environment once tests passed on every branch in GitHub. When a pull request was merged to master, that triggered pushing that code directly to stage and then (if the smoke tests passed) production. Our Circle environment mirrored our development environment, and the commands we ran there were the same as the ones we ran locally.

Figure 5. CircleCI config file. Note how simple the commands are and that they are almost entirely by bazel. CircleCI has no knowledge of build steps or dependencies beyond this.

It wasn’t all roses. Bazel still felt fairly young, every time we used it for something new it took a day or so to figure out how to make it work. Some compatibilities just aren’t implemented, and the docs occasionally lagged. Releases don’t seem to be a huge priority there, for one of our issues we were told just to build Bazel top of tree on our own. We tried a handful of different methods of parameterizing our Kubernetes configuration until settling on something dead simple. And while kubernetes is a very nice abstraction it is still complex. Several hours were spent staring that the config wondering why two services couldn’t talk to each other.

Another issue is that most commercial CI vendors expect a single repository in a single language per project. Fortunately, by picking a dedicated build system like Bazel and using containers, we could sidestep this problem and have the CI systems trigger the same commands we used when building locally. Across the board, once things worked, they kept working, which made the struggles worth it.

The Sum of Its Parts

What did this all add up to? To add a new parameter to the communication between two services, our dev environment allowed us to work on both services in tandem. First, we updated the gRPC service definition to include the new parameter. Then, we worked on either the sender or the receiver first, briefly taking advantage of the fact that gRPC handles version mismatches by silently dropping new parameters.

                     Steps to build and deployStep 1: > bazel run deploy:deploy-to-minikubeStep 2: > echo “Done.”

Crucially, we could get started on both together, get the controller sending something, get the model receiving it, and have them both up and running with one command: bazel run deploy:deploy-to-minikube. That command generated the gRPC definitions for each language, rebuilt the services and their Docker Images, and restarted their pods in Minikube, which were restarted by their Kubernetes Deployments running the new Docker images. On first build that might take minutes as it compiles all the dependencies, but for incremental changes Bazel only builds the diffs, so it takes around 10 seconds.

Now we had a tight development loop where it was easy to make changes to either side of this new feature and immediately interact with it in the full Gyroscope environment, locally. If we were working on something complex, we could even write a short test script in any language that could use the same gRPC definitions to talk to our service in isolation.

This is how we laid the engineering foundation (i.e., meeting properties 1 through 4) for creating a sustainable data science system. We’ve also given you a framework to think about how your data science system stacks up: 4 questions to answer, 8 properties to meet, and 9 core stack components. If you don’t stack up well, consider swapping them out piecewise — there’s great options out there now. Indeed, we’re beginning to see convergence on the principle components required for a modern stack (i.e., Bazel, Docker, Kubernetes, gRPC), but there remains a long road toward integrating them together in a way that is effective, scalable, and follows best practices — Truss was instrumental in helping Gyroscope achieve that, today.

In a future post, we’ll discuss the higher-level aspects of our infrastructure — versioned data, data storage, monitoring, testing, and other features. These crucial components can’t happen without the choices we’ve made here. We’re sure many readers would have done a lot of this differently, and we would love to hear feedback on our design choices. Leave a comment below with your thoughts.

Follow Gyroscope and Truss on Twitter: @GyroscopeHQ and @Truss_Works

Gyroscope is always looking for curious engineers and data scientists who want work on hard problems. Interested? Send us an email at careers@gyroscope.cc. Interested in machine learning solutions for you app or game? Check us out at getgyroscope.com. Check out Gyroscope’s other articles on our blog.

Where do you find that your work doesn’t translate into production? Truss can help you assess your infrastructure, and work side-by-side with your team to accelerate your engineering cycles and make every release predictable and reliable. Contact us at why@truss.works. Truss also writes extensively on infrastructure topics on their blog.