Application Pauses When Running JVM Inside Linux Control Groups

Zhenyun Z.

November 28, 2016

Linux cgroups-based solutions (e.g., Docker, CoreOS) are increasingly being used to host multiple applications on the same host. We have been using cgroups at LinkedIn to build our own containerization product called LPS (LinkedIn Platform as a Service) and to investigate the impact of resource-limiting policies on application performance. This post presents our findings on how CPU scheduling affects the performance of Java applications in cgroups. We found that Java applications can have more and longer application pauses when using CFS (Completely Fair Scheduler) in conjunction with CFS Bandwidth Control quotas. During these pauses, the application is not responding to user requests, so this is a severe performance pain that we need to understand and address.

These increased pauses are caused by the interactions between JVM’s GC (Garbage Collection) mechanisms and CFS scheduling. In CFS, a cgroup is assigned certain CPU quota (i.e., cfs_quota), which can be quickly drained by JVM GC’s multiple-threading activity, causing the application to be throttled. For example, the following may occur:

If an application aggressively uses its CPU quota in a scheduling period, then the application is throttled (no more CPU is given) and stops responding for the remaining duration of the scheduling period.
The multi-threaded JVM GC makes the problem much worse, as the cfs_quota is counted across all threads of the application. As a result, the CPU quota may be even more quickly used up. JVM GC has many concurrent phases that are non-STW (stop the world). However, their running can also cause faster cpu_quota use, hence practically making the entire application STW.

In this post, we’ll share our findings after investigating this issue and our recommendations about CFS/JVM tuning to mitigate the negative impact. Specifically:

Sufficient CPU quota should be assigned to the cgroup that hosts the Java application; and,
JVM GC threads should be appropriately tuned down to mitigate the pauses.

Linux cgroups background

Linux cgroups (Control Groups) are used to limit various types of resource usage by applications. For the CPU resource, the CPU subsystem schedules CPU access to cgroups, and CFS is one of the two supported schedulers. CFS is a proportional share scheduler that allocates CPU access to cgroups based on the cgroups’ weights.

For the RHEL7 (Red Hat Enterprise Linux) machine we use, there are multiple tunables. Two CFS tunables for ceiling enforcement are used for limiting the CPU resources used by cgroups: cpu.cfs_period_us and cpu.cfs_quota_us, both in microseconds. cpu.cfs_period_us specifies the CFS period, or enforcement interval for which a quota is applied, and cpu.cfs_quota_us specifies the quota the cgroup can use in each CFS period. cpu.cfs_quota_us essentially sets the hard limit (i.e., ceiling) on the CPU resource. A cgroup (along with its processes) is only allowed to occupy the CPU cores for the time duration specified in cpu.cfs_quota_us. Hence, to give a cgroup N cores, cpu.cfs_quota_us is set to N times of cpu.cfs_period_us.

Workload and setup

For our analysis, we created a synthetic Java application for testing CFS behavior. This Java application simply keeps allocating objects on the Java heap. After the number of allocated objects reaches a certain threshold, a portion of them are released. There are two application threads, each independently performing object allocation and object release. The time taken by each object allocation is recorded as the allocation latency. The source code for this synthetic Java application is on GitHub.

The performance metrics we considered include: (1) application throughput (the object allocation rate); (2) object allocation latencies; (3) the cgroup’s statistics, including the cgroup’s CPU usage, nr_throttled (number of CFS periods that are throttled), and throttled_time (total throttled time).

The machine we used is RHEL7 with 3.10.0-327 kernel, and it has 24 HT (hyper-threading) cores. The CPU resources were limited using CFS ceiling enforcement. The cgroup hosting the Java application by default was assigned three cores of CPU shares, considering the fact that there were two application threads and GC activities. In later tests we also varied the number of cores assigned in order to gain additional insights. The cfs_period by default was 100ms. Each run of the workload took 20 minutes (1200 seconds). So with cfs_period being 100ms, there were 12,000 CFS periods in each run.

Investigation of a large application pause

We’ll start with a detailed analysis of a particular application pause in order to shed light on the reasons behind the pause.

Application stop
At the time of 22:57:34, both application threads stop for about three seconds (i.e., 2,917ms and 2,916ms).

JVM GC STW
To understand what caused the three-second application freezing, we first examined the JVM GC log. We found that at the time of 22:57:37.771, a STW (stop the world) GC pause occurred. Note that the pause lasts about 0.12 seconds.

Compared with the three-second application stop, the 0.12-second GC pause is insufficient to explain the 2.88-second (i.e., 3-0.12) difference; therefore, there must be some other reason for the pause.

CFS throttling
We suspected that the extra application pause was caused by the cgroups’ CFS scheduler. We examined the cgroups’ statistics by gathering various types of reported cgroups statistics for every second that the application was running. We found that the metric of “throttled_time” is of great interest. “Throttled_time” reports the accumulated total time duration (in nanoseconds) for which entities of the group have been throttled.

We noticed that, while the application was frozen, the “throttled_time” occurred starting at 22:57:33. When the application was in the frozen period, the increase (i.e., difference) in “throttled_time” was about 5.28 seconds. We believe that the “throttled_time” contributed to the application freezing.

JVM GC threads
We have found that some CFS scheduling periods expose substantial “throttled_time,” which we believe is caused by the (multiple) JVM GC threads. Briefly, when GC starts, JVM invokes multiple GC threads to do the work. JVM uses internal formulas to decide on the number of GC threads. Specifically, for a machine with 24 cores, the number of parallel GC threads is 18, and the number of concurrent GC threads is 5. Because of these large numbers of threads, the cgroup’s CPU quota is quickly used up, causing all application threads (including GC threads) to be paused.

How does CFS scheduler cause the application pause?

CFS scheduler can lead to long application pauses. There are scenarios when the cgroup (and the application running inside the cgroup) is throttled such that the application is paused for long time. Though the brief answer of “GC threads more quickly using up cfs quota (cpu.cfs_quota_us)” is straightforward to give, we would like to understand the issue more thoroughly for the purpose of proposing solutions.

There are three levels of issues with regard to the application pauses when using CFS scheduler, we will explain one by one. To better illustrate the issues, we use an example with concrete values (e.g., cfs_period and cfs_quota).

Ideal scenario (expected scenario)
Let’s assume the cfs_period is 300ms and the cgroup (with a single application) has a cfs_quota of 90ms. We’ll also assume the application only has one active thread to simplify the presentation.

Ideally, the CPU scheduler schedules the application to run sparsely inside each CFS period, such that the application does not pause for long time. As shown in the following diagram, the application is scheduled to run for 3 times in a 300ms CFS period. There are application pauses between scheduled runs, and the expected application pause is 70ms (assuming the application fully uses the 90ms CPU quota).

Bad scenario for both Java and non-Java applications
The first issue happens when the application uses up all of its CPU quota of 90ms, say in the first 90ms of some CFS periods. Then, since the quota is eaten up, during the remaining 210ms, the application is paused, and users experience 210ms latency. Note that the issue is even more severe with multiple-thread applications, as the CPU quota can be more quickly used up.

Bad scenario for Java applications (STW phases during GC)
This issue is more severe with Java applications during STW (stop the world) GC pauses, as JVM can use multiple GC threads to collect garbages in parallel.

Let’s assume that during some CFS periods, after the application runs for 30ms, a STW GC needs to be done. We’re assuming that the GC work requires 60ms of CPU, and that JVM has 4 GC threads. Then within 45ms, the entire CPU quota of 90ms can be completely eaten up (i.e., 30ms application CPU during “Run” plus (60ms “GC” / 4 threads) = 15ms actual time). The remaining 255ms will be the application pause time.

Apparently, with more GC threads, the CPU quota can be more quickly drained. Note that on modern machines, the number of GC threads can be much larger, since each JVM running in cgroups still sets its GC parallelization level based on the the entire host’s cores. For instance, on a 12-core machine, by default JVM uses 9 GC threads. A 24-core machine will cause JVM to use 18 GC threads.

Bad scenario for Java applications (Concurrent phases during GC)
For popular JVM garbage collectors such as CMS and G1, GC has multiple phases; some phases are STW, others are concurrent (non-STW). Though concurrent GC phases (using concurrent GC threads) are designed to avoid JVM STW, the use of cgroups completely defeats the purpose. During concurrent GC phases, the CPU time used also contributes to the cgroup’s CPU usage, which practically causes the application to experience an even larger “STW” pause.

Recommendations

We have seen that Java applications running in Linux cgroups can experience longer application pauses due to the interactions between JVM GC and CFS scheduling. To mitigate this issue, we recommend the following tunings:

Sufficiently provisioning CPU resources (i.e., CFS quota). Apparently, the larger the CPU quota is assigned to a cgroup, the less likely the cgroup is to be throttled.
Appropriately tuning down the GC threads.

Sufficiently provisioning CPU resources

For the workload we use that only has 2 active application threads, it might seem that 2 cores of CPU would meet the CPU need. However, we found that the application benefits from a much larger CPU quota (i.e., many more CPU cores) due to the GC activities.

Application performance (throughput and latencies)
To understand how many CPU cores are needed for the particular JVM workload, we varied the amount of CPU shares (in the unit of CPU cores, where a CPU core of quota equals to a unit of CFS period) assigned to each hosting cgroup. We found that with more cores assigned, the application throughput kept increasing until about 12 cores. Compared with the 2 application threads, such discrepancy (i.e., 12 versus 2) showcases how important it is to sufficiently provision CPU shares to a Java application.

We also counted the number of large allocation latencies (i.e., more than 50ms), and found that with more cores, the number of large latencies continues to drop until about 8 cores.

Cgroup’s CPU usage
The CPU usage (both user time and system time) of the cgroup also increases with more cores assigned, as shown in the below figures. Note that the values are aggregated value from all cores. With 1,200-second run time, 6,000-second CPU usage indicates an overall 5-core CPU usage.

These results suggest that for this particular Java application with 2 active application threads, a lot more cores need to be assigned to the hosting cgroup.

Tuning down GC threads (ParallelGCThreads)

We also varied the ParallelGCThreads from 1 to 24 to understand the performance impact. Note that by adjusting ParallelGCThreads, the value of ConcGCThreads is also tuned by JVM internally.

More GC threads tend to use up the CFS quota faster, as shown in the number of throttled periods and amount of throttled time. As a result, more large latencies are observed. The application throughput does not change much, though for this particular workload it actually drops.

These results suggest tuning down GC threads might be beneficial to application performance. However, applications differ in many aspects (e.g., GC frequencies, heap size, characteristics of application threads), so the impact of these tunings needs to be evaluated for each application. Due to space concerns and the complex nature of further investigations, we will not delve deeper into this aspect.

Cgroup performance (number of throttled periods and throttled time)

Application performance (latencies and throughput)

Discussion

CFS ceiling enforcement or relative sharing
CFS scheduling supports both ceiling enforcement (i.e., hard limit) and relative sharing. CFS relative sharing uses cpu.shares tunable to control the weight of CPU resources cgroups can use. Though it allows a cgroup to use otherwise-idle CPU resources, relative sharing has the limitation of poor performance predictability due to the dependency on other cgroups. An application will see different performance dependent on the availability of CPU cores, which makes capacity planning and performance monitoring difficult to accomplish.

On a machine with only one cgroup actively using CPU, relative sharing allows the “busy” cgroup to “steal” CPU resources beyond its logical share (indicated by its portion of cpu.shares), hence the impact of JVM GC is less visible. However, when a machine is fully packed (i.e., all CPU cores are busy), the effectiveness of relative sharing is comparable with the ceiling enforcement, as each cgroup is only allowed to consume a corresponding portion of CPU resources. Hence, the performance issues caused by multiple-threaded JVM GC—and in turn the severe application pauses—will show up.

Workload dependency about tuning GC threads
Decreasing the number of GC threads helps reduce the chance of the CFS share being used up too early, which hence helps reduce the long application pauses caused by the CFS throttling. However, less GC threads also could result in longer GC STW pauses, since the GC work needs more time to complete. So there is a tradeoff between these two concerns, and the appropriate number of GC threads depends on the workload characteristics and latency SLA (Service Level Agreements), and should be carefully chosen accordingly.

Conclusions

Running Java applications in Linux cgroups deserves thorough understanding of how JVM GC interacts with cgroups’ CPU scheduling. We found that applications could experience longer pauses due to the intensive GC activity. To ensure satisfactory application performance, sufficient CPU quota has to be provisioned, and depending on scenarios, the GC threads should be tuned down.

Topics: Optimization Open Source Natural Language Processing