Skip to content

dslp/dslp

Repository files navigation

The Data Science Lifecycle Process

The Data Science Lifecycle Process is a set of prescriptive steps and best practices to enable data science teams to consistently deliver value. It includes issue templates for common data science work types, a branching strategy that fits the data science development flow, and prescriptive guidance on how to piece together all the various tools and workflows required to make data science work.

Getting Started

To get started, you can read about the steps here.
A template repository for projects following this process can be found here.

For more background on what this is and why you should consider adopting it, continue reading below.

Key Topics

Why did we create this and why should you adopt it?

Data Science is hard. Enterprise Data Science is harder. Figuring out how to consistently go from asking a question to delivering a solution is something very few teams and even fewer enterprises have figured out. There have been major steps forward in this space. The proliferation of DataOps and MLOps are standardizing practices for building robust data pipelines and model deployments and tools like AzureML make parts of the data science process significantly simpler.

However, even with these important evolutions, we were still seeing huge gaps in data science teams across the board. Simply put, while many of the solutions solve key problems data science teams face, no one was answering the question, How do we put this all together into a single workflow?

This meant everyone was solving these problems from scratch. The Data Science Lifecyle Process is designed to be simple, lightweight, and effective.

Key Components

To address this need, we came up up with workflows and prescriptive guidance in several key areas.

Team Workflows and Orchestration

Providing guidance and best practices on how to coordinate your data science teams, development teams, DevOps teams, and stakeholders. We illustrate how to move between the various phases of the process and make sure everyone stays on the same page.

As data scientists ourselves, we wanted to enable data scientists to work in ways they are used to. On the other hand, we also wanted to give the data science process more rigor and make it easier to integrate with the rest of the organization so that it's easier to push our solutions to production. Our approach represents a bridge between the two.

Issue Templates

We've created a set of Issue Templates to bring consistency to work and make sure there are clear steps at each step of the process. Read about Issue types here.

Branching Strategy

The typical GitHub Flow or Git Flow branching strategies are a great starting point, but they don’t lend themselves to the experimental nature of data science. We created a branching strategy that works for data science workflows while still being familiar to your development teams. Read about the branch types here.

Artifact Management

We provide guidance on how to track all of your various artifacts including exploratory analysis, experiment logging, serialized models and machine learning pipelines.

Examples

  • Model registry
  • Experiment Logging
  • Container Registry
  • ML Pipelines

MLOps

There is a great body of work and set of tools for adopting MLOps. MLOps lets us apply DevOps practices to our training, deployment, monitoring, and retraining processes. We directly integrate the best MLOps practices into the Data Science Lifecycle Process so we can make the most of the developments in this area.

Examples

  • Versioning models
  • Logging and monitoring
  • Continuous Deployment and rollback strategy
  • A/B Testing
  • Automated Retraining

You are of course free to use any component on it's own, but we find it's most effective if you combine everything.

Design Goals

We designed the DSLP to be:

  • Lightweight and easy to adopt
  • Flexible and easy to adapt
  • Minimally Opinionated

When we say minimally opinionated, we mean we tried to make as few assumptions as possible about:

  • Team and organization structure
  • Type of problem being solved
  • The technologies being used

Where we do make assumptions, we do so where we think the benefits are high and universally applicable.

What this means for you is that you can start adopting the DSLP today without a major disruption to your team. You can also adopt it piecemeal to smooth out the learning curve (which is pretty flat to begin with). Ultimately, we hope that teams will take this as a baseline so that they can focus on creating new value and responding to the needs of their own organizations, instead of spending so much time working on process.

About

The Data Science Lifecycle Process is a process for taking data science teams from Idea to Value repeatedly and sustainably. The process is documented in this repo.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published