Making our test suite faster with parallel tests

These past few days, I had the luck to work on one of those features that will improve the developer experience at Yago and indirectly will enhance the experience our users will have when they use our platform.

I was able to spend a bit of time tinkering with our test suite to make it faster; how much faster? We went from 45+ minutes to ~10 minutes to run all the tests locally.

I will admit this is one of those projects that is as fun as it is frustrating, so strap in as I tell you all about it.

Some context

As mentioned above, our test suite takes a long time to run, upwards of 45 minutes. This we all acknowledged on the team is a big problem. On the one hand, we cannot reliably run our tests locally before making a pull request. On the other hand, we will make the pull request and ask for reviews before the tests have finished.

This scenario has unfolded too many times: One of us makes a PR and forgets because the tests take too long, only for the reviewer to ask that the tests get fixed before reviewing.

We identified the time the specs took to run as the main culprit; if we managed to get this to a reasonable timeframe, we would be able to make the tests run locally before writing a PR and even expect the CI to have finished before we tag the reviewers.

What we are solving and what we are not solving

In an ideal world, we would have taken the time to make our tests faster (and probably we will do so one day). But we needed a quick way to make them usable. Parallelizing their execution makes them, by definition, faster.

So we are not solving the underlying problem. Instead, we are focusing on the output of having our tests delivered faster.

The (pro|pre)mise

Our computers have multiple cores; rspec by default uses only one core. If we use multiple cores, we can dramatically decrease the time the tests take to run.

In an ideal world, and all tests running for the same time, this would be the scenario:

| 1 core     | 2 cores     | 4 cores     | 8 cores     |
|--------    |---------    |---------    |---------    |
| 40 min     | 20 min      | 10 min      | 5 min       |

Of course, due to the initialization time and the fact that some tests are inherently longer than others, the ideal scenario will not always happen.

We will be happy if the test suite takes between 10 and 20 minutes to run locally.

Finally, the solution

It all begins with a tweet

I asked on Twitter how I could make my tests faster, and @gvpmahesh came to my rescue by suggesting I use parallel_tests. I experimented a bit, and the tests seemed to go faster. An excellent start!

Parallel tests

Let me start by saying that this gem is fantastic! It will do exactly what you expect; just add: gem parallel_tests to your gemfile, run bundle install, and then you can run:

bundle exec rake parallel:setup
bundle exec rake parallel:spec

And your tests will all run very, very fast.

That being said, it does have a couple of potential gotchas:

The output is a bit all over the place
Some tests break when you parallelize them

So we need to solve these two errors.

Fixing gotchas

Here two scenarios unfold:

Either you have no errors due to parallelization, and it's just a matter of 'prettifying' the output, in which case I have excellent news for you: Turbo Tests will normalize the result (thanks for pointing it out @bhserna)

Or you have both outputs all over the place, and some tests become flaky when you parallelize them(In our case, when tests tried to access a file).

This second scenario has several potential solutions locally: 1- Re-run parallel tests adding the '--only-failures' to test options from the command line 2- Run 'rspec --only-failures' on the command line 3- Create a rake task that will do that for you.

Let's explore the different solutions.

Re-run parallel tests

You can add an option to bundle exec parallel::tests\[,,'-only-failures',\] to re-run only the failed tests. Re-running them in parallel might not result in trustworthy outputs since the specs were flaky because of the parallelization.

On top of that, how do you decide if the last spec is really failing or it's the parallelization playing tricks on you?

Run rspec with only failures

This works quite well, but it has the drawback that you have to run both commands manually to work. On top of that, our CI tool (CodeShip) will stop if a command does not exit with a 0 status.

Create a rake task to automate it

And here we are, what we all have been waiting for! How do we automate my parallelization to use it comfortably on my machine and our CI tool?

Let's first look at the example in full, and I will then proceed to break down the different parts and gotchas that I encountered:

namespace :test do
  task :in_parallel => :environment do |t|
    start = Time.now
    p "start: #{start}"
    puts 'starting tests first round'

    status = 0
    begin
      Rake::Task["parallel:spec"].invoke(nil,nil,"--format progress","--first-is-1 --group-by runtime --allowed-missing 100 --runtime-log parallel_runtime_rspec.log --quiet")
      end_time = Time.now
      p "end: #{end_time}"
      difference = (end_time - start).floor
      minutes, seconds = difference/60, difference%60
      p "elapsed_time: #{minutes} minutes #{seconds} seconds"
      puts("\e[32mAll specs successful\e[0m")
      exit(status)
    rescue SystemExit => e
      exit(0) if e.status == 0
      begin
        puts 'starting tests second parallel round'
        Rake::Task["parallel:spec"].reenable
        Rake::Task["parallel:spec"].invoke(nil,nil,"--format progress --only-failures","--first-is-1 --group-by runtime --allowed-missing 100 --runtime-log parallel_runtime_rspec.log --quiet")
        end_time = Time.now
        p "end: #{end_time}"
        difference = (end_time - start).floor
        minutes, seconds = difference/60, difference%60
        p "elapsed_time: #{minutes} minutes #{seconds} seconds"
        puts("\e[32mAll specs successful\e[0m")
        exit(status)
      rescue SystemExit => e
        exit(0) if e.status == 0
        puts 'starting tests third and final non-parallel round'
        output = system "RAILS_ENV=test bundle exec rspec --only-failures"
        status = 1 if output == false
        end_time = Time.now
        p "end: #{end_time}"
        difference = (end_time - start).floor
        minutes, seconds = difference/60, difference%60
        p "elapsed_time: #{minutes} minutes #{seconds} seconds"
        puts(output ? "\e[32mAll specs successful\e[0m" : "\e[31mThere are failures, see above\e[0m")
        exit(status)
      end
    end
  end
end

OK, this is a mouthful, and many moving pieces are involved. Before breaking everything down and going into tangents, I will address the first thing that came into your mind: THIS IS NOT DRY!. And indeed, it is not! That is because I avoid putting methods in rake tasks on the one hand, and on the other, I believe this not being dry does not hinder too much its readability.

Now for...

The breakdown of the task

    start = Time.now
    p "start: #{start}"
    puts 'starting tests first round'

    status = 0

There is nothing to see above; we are just setting our start variable, setting out status if all goes well, and giving some info to the user.

    begin
      Rake::Task["parallel:spec"].invoke(nil,nil,"--format progress","--first-is-1 --group-by runtime --allowed-missing 100 --runtime-log parallel_runtime_rspec.log --quiet")
      end_time = Time.now
      p "end: #{end_time}"
      difference = (end_time - start).floor
      minutes, seconds = difference/60, difference%60
      p "elapsed_time: #{minutes} minutes #{seconds} seconds"
      puts("\e[32mAll specs successful\e[0m")
      exit(status)

Here things start to get spicy; we enter our first begin block. We do this because we later will want to rescue SystemExit to make sure our exit status is correct.

Let's dig further into this block:

Rake::Task["parallel:spec"].invoke(nil,nil,"--format progress","--first-is-1 --group-by runtime --allowed-missing 100 --runtime-log parallel_runtime_rspec.log --quiet")

This line is the main workhorse of the machine. It will invoke the parallel:spec rake task and assign its arguments.

The task parallel:spec takes four arguments:

The number of processes -> It will infer as many as your machine has
The pattern for where to run the specs -> We want to run all of them
The Rspec(test) options -> To avoid clutter, we set it to progress
The parallel:spec options
1. --first-is-1 is well explained on the parallel_tests documentation
2. We want them to run by runtime if it is there, so we set: --group-by runtime --allowed-missing 100 --runtime-log parallel_runtime_rspec.log
3. We set it to quiet to reduce as much as possible the amount of noise.

end_time = Time.
p "end: #{end_time}"
difference = (end_time - start).floor
minutes, seconds = difference/60, difference%60
p "elapsed_time: #{minutes} minutes #{seconds} seconds"
puts("\e[32mAll specs successful\e[0m")
exit(status)

This ensures that if the task is successful, we will get a nice output telling us when the specs ended, the time it took to run, and a lovely green All specs successful with an exit status of 0

rescue SystemExit => e
exit(0) if e.status == 0

This part is critical, and it took me a long time to find it. Rescuing without specifying SystemExit will not work since it will just rescue exceptions. If the status is 0, we don't want to do anything, though, so we just continue to exit with a status of 0.

begin
   puts 'starting tests second parallel round'
   Rake::Task["parallel:spec"].reenable
   Rake::Task["parallel:spec"].invoke(nil,nil,"--format progress --only-failures","--first-is-1 --group-by runtime --allowed-missing 100 --runtime-log parallel_runtime_rspec.log --quiet")
   end_time = Time.now
   p "end: #{end_time}"
   difference = (end_time - start).floor
   minutes, seconds = difference/60, difference%60
   p "elapsed_time: #{minutes} minutes #{seconds} seconds"
   puts("\e[32mAll specs successful\e[0m")
   exit(status)

This part should look familiar, so I will not dwell on it too much. Just a couple of details:

We create a new begin, so we can rescue it later as needed
We reenable the task parallel:spec. It will not run if we don't
We add --only-failures to the rspec(test) options. Please make sure that you have this line in spec_helper.rb: config.example_status_persistence_file_path = 'examples.txt' it will not work if you don't

  rescue SystemExit => e
      exit(0) if e.status == 0
      puts 'starting tests third and final non-parallel round'
      output = system "RAILS_ENV=test bundle exec rspec --only-failures"
      status = 1 if output == false
      [...]
      exit(status)

In our final rescue, we run our safety net, 'normal' rspec, but only for the failed tests. Thus we need to do it with system "RAILS_ENV=test bundle exec rspec --only-failures". This will return true if the tests pass and false if they don't, which allows us to set our final status for the grand exit of the rake task.

Visualize the task's flow:

Untitled Diagram.drawio(7).png