The History of the Service Mesh

Feb 13th, 2018 12:05pm by William Morgan

Feature image: Twitter’s service architecture circa 2013.

William Morgan, CEO and co-founder of Buoyant

William is the co-founder and CEO of Buoyant, the creator of the open source service mesh projects Conduit and Linkerd. Prior to Buoyant, he was an infrastructure engineer at Twitter, where he helped move Twitter from a failing monolithic Ruby on Rails app to a highly distributed, fault-tolerant microservice architecture. He was a software engineer at Powerset, Microsoft, and Adap.tv, a research scientist at MITRE, and holds an MS in computer science from Stanford University.

The idea of a service mesh is still a fairly new concept for most people, so it may seem a little funny to already be talking about its history. But at this point Linkerd has been running in production by companies around the world for over 18 months — an eternity in the cloud-native ecosystem — and we can trace its conceptual lineage back to developments that happened at web-scale companies in the early 2010’s. So there’s certainly a history to explore and understand.

Before we dive in, though, let’s stay in the present a bit longer. What is a service mesh, and why is it suddenly a hot topic?

A service mesh is a software infrastructure layer for controlling and monitoring internal, service-to-service traffic in microservices applications. It typically takes the form of a “data plane” of network proxies deployed alongside application code, and a “control plane” for interacting with these proxies. In this model, developers (“service owners”) are blissfully unaware of the existence of the service mesh, while operators (“platform engineers”) are granted a new set of tools for ensuring reliability, security and visibility.

What about service mesh and Kubernetes? Why is this suddenly so interesting? The short story is that, for many companies, tools like Docker and Kubernetes have “solved deploys” — at least to a first approximation. But they haven’t solved runtime. This is where the service mesh comes in.

What does “solving deploys” mean? Using things like Docker and Kubernetes dramatically reduces the incremental operational burden to deployment. With these tools, deploying 100 apps or services is no longer 100 times the work as deploying a single app. That’s a huge step forward, and for many companies it results in a dramatic reduction in the cost of adopting microservices. This is possible not just because Docker and Kubernetes provide powerful abstractions at all the right levels, but because they standardize the patterns for packaging and deployment across the entire organization.

But once the apps are running — what then? After all, deployment is not the final step to production; the app still has to run. So the question becomes: Can we standardize the runtime operations of our applications in the same way that Docker and Kubernetes standardized deploy-time ops?

To answer this, we turn to the service mesh. At its heart, the service mesh provides uniform, global ways to both control and measure all request traffic between apps or services (in datacenter parlance, the “east-west” traffic). For companies that have adopted microservices, this request traffic plays a critical role in runtime behavior. Because services work by responding to incoming requests and issuing outgoing requests, the flow of requests becomes a critical determining factor of how the application behaves at runtime. Thus, standardizing the management of this traffic becomes a tool for standardizing the application’s runtime.

By providing APIs to analyze and operate on this traffic, the service mesh provides a standardized mechanism for runtime operations across the organization — including ways to ensure reliability, security and visibility. And like any good infrastructure layer, the service mesh (ideally!) works independent of how the service was built.

How Is Service Mesh Formed?

So where did the service mesh come from? By doing some software archeology, we find that the core features the service mesh provides — things like request-level load balancing, circuit-breaking, retries, instrumentation — are not fundamentally new features. Instead, the service mesh is ultimately a repackaging of functionality; a shift in where, not what.

The origins of the service mesh start with the rise of the three-tiered model of application architecture circa 2010 — a simple architecture that, for a time, powered the vast majority of applications on the web. In this model, request traffic plays a role (there are two hops!), but it is very specialized in nature. Application traffic is first handled by a “web tier,” which in turn talks to an “app tier,” which in turn talks to a “database tier.” The web servers in the web tier are designed to handle high volumes of incoming requests very rapidly, handing them off carefully to relatively slow app servers. (Apache, NGINX and other popular web servers all have very sophisticated logic for handling this situation.) Likewise, the app layer uses database libraries to communicate with the backing stores. These libraries typically handle caching, load balancing, routing, flow control, etc., in a way that’s optimized for this use case.

So far so good, but this model starts to break down under heavy load — especially at the app layer, which can become quite large over time. Early web-scale companies — Google, Facebook, Netflix, Twitter — learned to break the monolith apart into lots of independently-running pieces, spawning the rise of microservices. The moment microservices were introduced, east-west traffic was also introduced. In this world, communication was no longer specialized, it was between every service. And when it went wrong, the site went down.

These companies all responded in a similar way — they wrote “fat client” libraries to handle request traffic. These libraries — Stubby at Google, Hysterix at Netflix, Finagle at Twitter — provided a uniform way of runtime operations across all services. Developers or service owners would use these libraries to make requests to other services, and under the hood, the libraries would do load balancing, routing, circuit breaking, telemetry. By providing uniform behavior, visibility, and points of control across every service in the application, these libraries formed what was ostensibly the very first service meshes — without the fancy name.

The rise of the proxy

Fast forward to the modern, cloud-native world. These libraries still exist, of course. But libraries are significantly less appealing in light of the operational convenience provided by out-of-process proxies — especially when combined with the dramatic drop in deployment complexity made possible by the advent of containers and orchestrators.

Proxies sidestep many of the downsides of libraries. For example, when a library changes, these changes must be deployed across every service, a process which often entails a complex organizational dance. Proxies, by contrast, can be upgraded without recompiling and redeploying every application. Likewise, proxies allow for polyglot systems, in which applications are comprised of services are written in different languages — an approach that is prohibitively expensive for libraries.

Perhaps most importantly for larger organizations, implementing the service mesh in proxies rather than libraries moves responsibility for providing the functionality necessary for runtime operations out of the hands of the service owners, and into the hands of those who are the end consumers of this functionality — the platform engineering team. This alignment of provider and consumer allows these teams to own their own destiny, and decouples complex dependencies between dev and ops.

These factors have all contributed to the rise of the service mesh as a means to bring sanity to runtime operations. By deploying a distributed “mesh” of proxies which can be maintained as part of the underlying infrastructure and not the application itself, and by providing centralized APIs to analyze and operate on this traffic, the service mesh provides a standardized mechanism for runtime operations across the organization — including ways to ensure reliability, security, and visibility.

Back to the Future

So what’s next for the service mesh? At this point, having spent 18+ months helping organizations adopt Linkerd, we’ve learned quite a few things about what can go right — and wrong — when running mission-critical, cloud-native applications. In the next article, I’ll explore a few concrete examples, and describe what led to the development of a whole new service mesh project designed specifically for Kubernetes: Conduit.