The Impact of Spectre and Meltdown on the Cloud

Feb 8th, 2018 10:18am by Craig McLuckie

Craig is the founder and CEO of Heptio, a startup focused on making Kubernetes accessible to the enterprise living in a multi-cloud world. Prior to starting Heptio, Craig was a product manager at Google where he founded the Kubernetes project and worked with the industry to create the Cloud Native Computing Foundation that he also chaired.

The new year got off to an interesting start with the publication of two new security vulnerabilities: Spectre and Meltdown. Recent research revealed that nearly every computer processor manufactured in the last 20 years contains these fundamental security issues, and they represent a class of vulnerability that the world has not seen before in a widely exploitable form. IT and security professionals at companies large and small are working hard to understand what this means for their business. The issues are pretty esoteric in nature, but represent a very real threat.

This article will provide some basic framing for the issues. The focus here isn’t on the threats themselves, so much as what this new class of security vulnerability means, and how to start thinking about it as it pertains to your business, and your hosting environment in particular. It is safe to say that some of the greatest long-term impacts may be on cloud computing architectures, costs and security practices — yet it’s an aspect of the threat that few are discussing openly thus far.

Background on Meltdown and Spectre Issues

The tech industry has been talking about processor “side-channel” attacks for years. This refers to the — previously mostly theoretical — idea that you could write a piece of perfectly secure code, but that someone else with bad intentions could exploit problems with the underlying processor running the code to access your information.

We now live in this world. Meltdown showed a practical way to exploit problems with the underlying processor architecture, and Spectre provided a second path to potentially do the same.

Both of these exploits rely on the incredibly smart optimization algorithms used by modern processors to do sneaky things that they perhaps should not be able to do. The exploits rely on the almost “clairvoyant” attempt by these algorithms to predict what is going to happen next. From the standpoint of the processor, this has the potential to run your code faster; for the exploit, it allows access to things that they should not be able to access.

In the case of Meltdown, the only way to address this is to aggressively “detune” code and force the processor to behave correctly. In the case of Spectre, we don’t yet know how to mitigate it.

So What Is the Big Deal?

Meltdown is pretty bad, but we understand it and there are mitigations in place. By changing the operating system, we can create more determinism in how things run, and make sure that the vulnerability doesn’t exit. However, no one is happy with the fact that the mitigations can have significant performance implications. Actual mileage will vary, but in some really bad cases it could slow running code by something like 30 percent or worse. Regardless, the point is that the exploit is understood, and mitigations are in place. It certainly feels like another Heartbleed-level issue — and in many ways it is — but business is adaptable and we will figure this one out.

Meltdown and Spectre forced smart people to really step back and rethink things. It eroded trust in a system.

Spectre is worse. Not because we actually know how to use it to do something bad. Yet. It is worse because we don’t know how to mitigate it in a generic way, and because it proves that Meltdown wasn’t a flash in the pan. The threat remains, and erodes trust. When and if someone devises a practical exploit, we may indeed see a mitigation for it show up quickly. The exploit may be responsibly disclosed by security researchers at a company like Google. But it is entirely possible that hostile state actors will get there first, and the damages done between now and then may be considerable and difficult to quantify.

As someone who has worked in this industry for a while, this feels like a watershed moment. The implications of Spectre aren’t reminiscent of a Heartbleed-style incident. It feels a lot more like what happened when Edward Snowden disclosed the nature and shape of government surveillance. It forced smart people to really step back and rethink things. It eroded trust in a system.

How Spectre and Meltdown Impact Cloud Computing

There are myriad places where this is problematic. These challenges exist not just in Intel processors that we know and love, but across the spectrum of different processor architectures. The implications on mobile and client device security will likely be profound. However, the domain I live in is the nuts and bolts, ‘behind the scenes’ world of IT infrastructure, and the biggest place where these exploits will be challenging is in cloud computing.

Cloud is by its nature multitenant (meaning the value is in sharing the same resources with other people), and this is likely to represent the primary source of cognitive dissonance in that world.

Impacting the Intrinsic Elasticity of Cloud Computing. Much of the cloud’s value proposition is grounded in the shared nature of cloud resources, and the ability to use those resources on demand. It allows organizations to operate far more dynamically. Infrastructure has moved from a fixed asset, to a dynamic utility; it isn’t accidental that Amazon’s leading compute product is called “Elastic Compute Cloud.” Transitioning to a world where customers require hard lines of isolation from potentially hostile actors will be challenging indeed.

As a result of increased sensitivity around security, we may well see the basic unit of consumption for many enterprises move from a “virtual machine” to a “full server.” Virtual machines will still be accessible and the primary way to run workloads and apportion the work that a single provider consumes, but the scaling unit will become far more coarse-grained.

Some cloud products already make allowances for this today for highly sensitive or regulated workloads, but this is may well become the norm in the side-channel attack-aware world. This will put pressure on the operating efficiencies of cloud providers as they work to restructure their products to meet these needs. It may well drive costs up over time as overall utilization goes down.

Increased barriers to entry for cloud providers. If anything, we are likely to see more consolidation around the “big three” cloud providers going forward. It will be increasingly challenging to remain nimble in the face of concern over where virtual machines are running, while managing the greater complexity of provisioning full servers. Smaller cloud players, either operating in certain regions, or focusing on niche applications, will struggle to make the same adjustments and could see trust in their offerings erode.

While in aggregate the emergence of side-channel attacks may chill enterprise on cloud to some extent, it may also raise the barriers to entry for aspirants and entrench the positions of the ‘big 3’ cloud providers.

This may be even more challenging at the network edge, where the overall resource pool is smaller. There is tremendous potential to run compute workloads even closer to users in smaller, local facilities, and care will need to be taken in those environments to ensure adequate isolation of workloads.

Code-level protection. Given the nature of these exploits, a compiled binary — a block of machine code that is rendered down to its lowest possible form — becomes a scary thing. It is relatively opaque, and challenging for a cloud provider to vet for exploits. Higher-level services that include the compilation process, turning code written by a developer into something that a machine can read, offers ways for organizations to make sure that nothing untoward happens.

Amazon’s Lamba, or Azure’s Function services could be used to layer in additional security measures. By looking at the code, and by controlling the compilation process, these services could be made more intrinsically secure and could retain the flexibility of cloud. We may ironically see more emphasis on these particular types of highly multi-tenant services as a result of exploits like Spectre and Meltdown because they could be made intrinsically safer than running an opaque binary in an uncontrolled environment.

Increased scrutiny on cloud provider implementations. Another effect of the emergence of these security vulnerabilities will be increased scrutiny by forward-leaning organizations on how exactly the cloud providers run themselves. It isn’t only adjacent customer workloads that are vulnerable through these exploits if they share a processor, but all the other things that run on the same servers to stitch them into the cloud provider’s infrastructure. This may well increase demand on cloud providers for additional audit and review, and this may further slow or chill the rate at which cloud grows.

Looking Ahead

It will be interesting indeed to see what the future holds with respect to side-channel attacks. Now that the spotlight has been shone on this new class of exploit, it would be safe to guess that security researchers will use ever more powerful tools and capabilities to uncover similar threats.

My advice to enterprises is to continue their journey to the cloud, but keep the option open to run on-premise if needed. Focus on avoiding deep levels of provider-specific lock-in for now — since we don’t yet fully know where things are going to land — and where possible start with less sensitive workloads (which is always sensible). A big part of this is going to be really understanding the security sensitivity of workloads and thinking deliberately about risk tolerances. A one-size-fits-all plan may not be optimal.

I would also advocate for a healthy level of skepticism, both from those crying that the sky is falling and from those saying everything is fine. It will take time for the industry to really understand the implications of this new trend. Those implications could well be significant, but we will have to educate ourselves and take the time to really understand the threat and the impact.

Feature image by Johannes Plenio on Unsplash.