Beyond CI/CD: How Continuous Hacking of Docker Containers and Pipeline Driven Security Keeps Ygrene Secure

Apr 25th, 2018 9:32am by Zach Arnold and Austin Adams

Austin Adams is a passionately curious technologist who is people oriented, fast learning and excited to challenge tough problems with modern solutions. Able to lead and follow, he has management experience and knows how to see a project through all stages of execution. He loves to have fun, be with family and work in the community. He is a family man with dreams to have a mini farm and create a sustainable living environment.

Recently, we have been thinking a lot about how to up the ante on security in our organization. We’re hopefully not going to have to convince you of the need for software security. If so, opening CNN.com on your favorite browser should do the trick.

We take security extremely seriously. Aside from the general guidelines put forth in the CISSP certification for all-around information security, we have automated infrastructure scans for compliance, automated penetration tests (and folks who love to do so manually,) and we continuously monitor changelogs for the words “security update.” We have a great rhythm when it comes to infrastructure security today. However, it tends to be mostly reactive in nature when our scanners detect an issue. As our security focus matured, our team felt the need for more assurance that when we launch an application that it does not betray all of the infrastructure-level work we were doing to stay secure. We started working on a more powerful preventative strategy and approach to application/container security.

We would like to coin the phrase “continuous hacking.” All the cool kids (and companies) are making up new slogans for the era of cloud-native computing, so we thought we would throw ours in there. We’re defining Continuous Hacking as both a set of tools and an ideal which companies that retain sensitive data should all aspire to.

The implementation we came up with for Continuous Hacking is called pipeline driven security (PDS). We wanted to implement security adherence and adoption amongst our engineers. In order to do this well, the most logical place security can be checked (other than a code review) is in the build/deploy pipeline. Our first iteration of this effort was a mish-mash of tools crammed into the CI pipeline. PDS started to show some value, but it was clear we weren’t organized with our tools or our goals.

As we refined our strategy, we began to adopt the STRIDE acronym approach for security and apply it to our pipeline. We’ll break down the acronym and then introduce you to the tools we are using to break our containers before they ship. STRIDE stands for Spoofing, Tampering, Repudiation, Information Disclosure, Denial Of Service, and Escalation of Privilege. Each one will be explored in detail below.

Zach Arnold, Software Engineer, Ygrene Energy Fund

Zach Arnold currently works for Ygrene Energy Fund as a Software Engineer spearheading the organization's adoption of Kubernetes for production workloads. He works with Austin on championing the microservice movement at Ygrene and is helping to establish information security best practices for the entire PACE financing industry. He is fascinated by all things IOT, Distributed Systems, and his personal hobby is creating Machine Learning models to do all kinds of interesting learning tasks at Ygrene.

Spoofing happens when a malicious program/person pretends to be some trusted entity in order to achieve access to sensitive data or otherwise compromise the confidentiality, integrity, or availability of your information. Examples of where spoofing can take place are system resources, users, websites, authorities or even Docker containers. To combat this, we do the following things to prevent spoofing at build time:

We run a script that accepts Dockerfiles with FROM directives pointing to a smart whitelist of base images. It rejects images that are pulling random untrusted images. The whitelist is smart because it has a specific list of approved values but also has some resolution strategies/plans when the value is not in the list. One of them is to check if the base image is a base Docker standard library image. We call it “Lineage,” and it’s soon to be open sourced. Follow us to get updates when this is open sourced.

Within our build pipeline, we are using Notary (a CNCF project) in the form of Docker Content Trust to ensure we are only pulling cryptographically signed base images. Using signed images doesn’t necessarily mean the base image is friendly, but it gives guarantees that the publisher we expect made the image, and each layer is the intended one we wanted to pull. When paired with Lineage this allows a more firm assurance we are starting clean.

Tampering is caused by an attacker maliciously changing some data to carry out an attack on a system. In Docker containers, this could occur if an attacker changed an image stored in a registry without changing its associated metadata. Alternatively, an attacker could tamper with the metadata of an associated tag for an image and as a result, the image pulled is not the expected one. Here is how we combat these particular forms of tampering:

Using Docker Content Trust, our build pipeline signs the metadata of a pushed image cryptographically so when it is pulled later, if the metadata on the image doesn’t match the decrypted metadata from the Notary server, the image is rejected at runtime. By signing at build time, we are ensuring a safe retrieval and execution.
Currently, in our containers, we only use Linux distributions that check the integrity of the package using the package manager’s security features. Most package managers will do this. Think APT, YUM and RPM.
When pulling third-party dependencies manually (both OS dependencies and code-level dependencies like NPM and Bundler,) we make sure to code review each image and only allow resources to be retrieved using SSL, including git repos, ensuring the checksums are validated.

*We are planning in the future to add on a lint rule which will automatically check if a checksum is being validated at image build time. This would be possible by restricting wget or curl and only using a wrapper that takes the URL to get and the checksum to validate, and maybe the hashing algorithm. Follow us to get updates when this is released.

We lint our Dockerfiles which ensures that we are pulling base images from a specific tag, not just the generic “latest” tag. Pulling base images from latest can allow for scenarios where you don’t really know what is being run inside your container. Great Dockerfile linters are hadolint and dockerfilelint.

We use the Docker directive COPY and not, ADD. This is another recommendation straight from the CIS benchmark guide for Docker. ADD can pull remote resources much like curl or wget and we already established that we only want to do that when we can validate the checksum. Additionally, ADD by nature allows you to add URLs and can perform automatic decompression of local files which you may or may not want. Without manually decompressing, fetching resources, and validating checksums you run the risk of exposing yourself to downloads of malicious software or a version of software you didn’t intend to download. Our recommendation is that if you are careful with ADD its fine to use, but it depends on how strict you want to be.

Repudiation occurs when a malicious entity does something and also removes the ability for others to prove that they did it. In Docker, build and push audit logs are present in order to reveal who created what versions of which containers at what times. It is a crucial part of any remediation of a security issue. Within our build pipeline, we attempt to establish countermeasures in a few ways:

Using a Docker registry with verbose audit logs.
Protecting our build server’s run/audit logs, because they contain tons of information about who ran what job at what time.
Limiting as much as possible any manual steps taken on production infrastructure. Manual steps remove the ability to programmatically or easily check who did what and when. If it can’t be versioned in git, we “git” suspicious.

Information disclosure occurs when an attacker does something to gain information the organization wants to keep private. This can also happen when an application is poorly designed, and it leaks sensitive information to users who should not have it (even to users who have no malice aforethought.) We rebuff these tactics using the following countermeasures:

We don’t allow images to be built whose Dockerfile specifies a sensitive host path as a volume mount — meaning, we scan the Dockerfile for volume mounts like /proc or /. If a container like this was to be built and put into our Kubernetes cluster, there is an increased possibility that, if compromised, it could be used to expose information about the host and aid further penetration of our entire infrastructure.
We squash (experimental Docker feature) our images. Sometimes during image construction, you will need a private key or credentials to download all the associated resources required. For example, private ruby gems you have written. Unfortunately, when those keys or secrets are put into the container at build time, they are there hiding in the filesystem even when they aren’t needed for runtime. When we absolutely need something like this we use the Docker directive COPY to bring it into the container, then after it has been used we use RUN rm … to remove it. When you follow that procedure with — squash added to the docker build command the layer key which was just deleted will not end up in the final built image. This means that the key or secret which was in the container previously is now gone permanently from all layers. Once you push the squashed image, it is free of those files you would rather keep a secret.
We do something fun with each of our microservices’ code. We run static code analysis on the codebase for known code-level security vulnerabilities. This is language specific, and the quality of the information will vary based on your technology choices, but it’s a great way to kick out code with clear anti-patterns for security. Here is a list of scanners from OWASP to choose from, and some other ones we love.
We run automated dependency scanners to check that we are using the latest, most secure version of our code dependencies. This is pretty standard, but is worth mentioning if your organization isn’t there yet. It can be a pain to keep things up-to-date, but one way to get that “upgrade priority” you need to get these tickets out of the backlog is by say “its a security issue.” Again OWASP is a lifesaver.
The last and possibly most fun thing we do is attack our containers to test if any new code has introduced vulnerabilities that automated penetration tests can exploit. This is where “continuous hacking” really comes into the mix. Basically, we spin up the service and attached resources in a pseudo test mode and point automated penetration bots at the running containers and see what happens. If the scanners come back with warnings, then we reject the build. Currently, we direct the scanners to the API-docs URL and are expecting the scanner to crawl whatever it can find and attack it. One area we would like to improve right now is being able to send the scanner a list of known paths so that specific areas can be tested for issues. A great tool that makes this possible is zaproxy.

A Denial of Service happens when an attacker takes an action that prevents regular traffic under otherwise normal circumstances from proceeding normally. In other words, one user/program can affect the experience of all users. Within the context of building a Docker image with uptime in mind, a clear way to prevent DOS attacks is to remove vulnerabilities in the image. Security vulnerabilities of any kind can cause downtime when exploited. With that in mind, here is what we do to scan our images:

In the pipeline, we use tools to scan for malware and vulnerable packages. We have these binaries tuned to reject images with high vulnerabilities. Because these tools are new concerning Docker, we use multiple. One of these tools we use within the build that will prevent the pipeline from continuing, and one tool we use passively scans all images stored in the image registry. This allows us to prevent bad containers from deploying if a CVE is detected and published on it the same day. But what if the image didn’t have a CVE at the time of shipping, but had one later? Then the registry scanner will alert us if our images contain issues based on the CVE databases as they are updated over time. These two working together give some robust visibility and prevention. Some tools we love are Dagda, Clamav and Clair.
Use the Docker Directive HEALTHCHECK. We use HEALTHCHECK in a specific way. We use it to make sure the application base process is running correctly. Many orchestration platforms (like Kubernetes) already have some sort of network health check for the container. However, some containers don’t have open ports or incoming HTTP/gRPC connections, such as a background job processing container. Here we can help the orchestration tool by running commands that check that the process is still up or that the process is using an expected amount of memory.

Finally, Escalation of privilege occurs when a user gains more access than desired by application creators. Since Docker containers are just like baby computers, we can have escalation issues the same way. We try to prevent these issue by doing the following:

We use linters and static code analysis to reject images that only specify the root user as the process which will execute the program inside of the container. Alternatively, we greenlight images that create a new user for the application runtime.
We are very explicit with the packages inside the container. We run scripts that list everything inside the container on the build server. This forces the developers to see what’s inside their containers and can help them detect any packages they don’t need. We don’t yet have tooling around enforcing a minimal set of packages necessary for runtime (mostly because this impinges upon a developers freedom to construct a service with dependencies in whatever way they see fit,) but just increasing awareness has enough value to be viable in our scenario. An example for displaying all packages installed on an rpm based distro would be docker exec $INSTANCE_ID rpm -qa
We also remove setuid and setgid privileges. This one is a little more advanced, but it basically codes straight from the Docker CIS benchmarks. Using a tool called docker-bench we are able to see how many executables with setuid/setguid permissions exist and then remediate. A way to see those executables is by running docker run <Image_ID> find / -perm +6000 -type f -exec ls -ld {} \; 2> /dev/null

And with that, we have completed a STRIDE analysis of how to improve the security of Docker containers. We know what you’re saying: “Whoa, that’s quite a lot of work to do.” It is, but if you imagine that you are a hacker, and you get to hack a Docker container at every commit, it might motivate you. Docker security is still very new (because so is Docker!) and there is a lot of room to innovate. As you can see one way we are trying to innovate in that area is by “continuously hacking” our images. Happy Hacking.

Have you come up with something novel that we missed? Please share with us! We want to learn and grow too!

Austin and Zach will be speaking on “Good Enough for the Finance Industry: Achieving High Security at Scale with Microservices in Kubernetes” at KubeCon + CloudNativeCon EU, May 2-4, 2018 in Copenhagen, Denmark.

This post was contributed by Ygrene on behalf of the KubeCon + CloudNativeCon Europe, a sponsor of The New Stack.

Feature image via Pixabay.