Building a Tennis Simulation

Ian Dorward
DraftKings Engineering
10 min readAug 17, 2023

--

Tennis has become one of the most popular sports at DraftKings, ranking as the fourth most popular sport behind football, basketball and baseball. Its popularity stems from its year-round availability, with more than 20 tournaments taking place globally on most weeks.

In this first article in the Sports Data Science pillar of our Sports Intelligence series, we will use tennis to explore the process of building a Monte Carlo simulation model, the basis for the majority of our sports engines at DraftKings. The model that we currently use at DraftKings is more advanced than what we will build here and we will go into more detail on the intricacies of that in future articles, but it is important to understand the underlying modeling techniques that we employ.

Whether you are a seasoned sports bettor with your own models or new to the world of mathematically predicting sporting outcomes, this series aims to provide valuable insights into the world of sports modeling from the perspective of the bookmaker.

Why Do We Use Simulations?

The first question is why we build the model as a simulation rather than as a calculation. There have been many academic papers that focus on methods of calculating the outcomes of sporting events. Indeed, there are ways to use simple algebra to calculate the probabilities of winning games, sets and matches in tennis. However, at DraftKings, we have made the decision to focus on a Monte-Carlo simulation approach to modeling sports rather than the more traditional calculation-based approach that is employed both by many other companies in the industry and academia.

We feel that this approach gives us the best opportunity to deliver the most extensible and accurate product. This approach gives us more flexibility than a traditional calculation-based model to leverage machine learning models, and build in more advanced momentum-related features and it allows us to implicitly account for the correlations between outcomes that are crucial in the development of our market-leading SGP product, something that we will revisit in greater detail in a future article.

Understanding the Match

At its heart, tennis is a simple sport. Players play points, those points count toward games, those games count toward sets and the player that wins the most sets wins the match. As a result, the process for building a simulation model for tennis is also simple.

We need code that loops through each point, predicts which player will win the point and checks whether that point means that the game, set or match is complete. As we need to be able to offer markets on the sport, when any of these conditions are met, we need to store the outcome. When we have simulated enough points that one player has won the match, the match outcomes are stored and we run another simulation.

The Key Inputs

Based on our understanding of how we want our simulation to flow, we need to set up three key objects — a simulated features object to record features based on the current state of the simulation, a simulation records object to record the outcome of our simulations and an input features object that contains information on the relative strengths of the players and any other variables that are needed to control the simulation outcomes.

Simulated Features

The simulated features object is created to contain all of the features related to the current state of the game in a given simulation that we need to track. This will be sport-specific, but in tennis, it will contain information such as which player is currently serving, whether we are in a tiebreak, and the current score in the game, set, and match. When we initially train and test the ML models that are included in the engine, all of the features at each stage of the match are known as they are trained on historical data.

However, when it comes to using these models for prediction, in each simulation, the values of these features will be different and do not technically exist in a real-world sense. Instead, we are effectively simulating the values of each of these features and using these as the inputs into the ML models at each decision point.

In our production model, we have a lot more features than those listed here, but this gives a basic idea of the types of simulated features that we use in our engines.

Simulation Records

The simulation records object is created to record all of the outcomes in a simulation that we will require to create the many different markets that we offer on the sport. At each relevant moment in a match, we will add to the appropriate record to signify that outcome occurred in that particular simulation. At the end of all of the simulations, we can perform functions on those simulation records to get the probabilities of each different outcome.

Input Features

The input features contain the information that we require to represent the different strengths of the competitors, plus any additional variables that are required to more accurately represent how the sport plays out. In this situation, the key variables are the two player serve percentages, representing the probability of each player winning a point on their serve, plus the n0 parameter that controls momentum as we shall see later in this article.

Building the Simulation

This simple framework is all we need to have a tennis simulation model. All we need are two input parameters representing the point win percentage for each of the two players to be able to run our simulations, generate simulation records, and create probabilities for the various possible outcomes of the match. We can use the point win percentages and a random number in our get_point_winner function to return a winner for each point in our simulation.

So now we have a way of determining which player wins each point. All we need to do now is to ensure that we correctly update our simulated features after each point.

After each point, we call a function to update the simulated features and the simulation records, then run through various checks to determine whether that point has led to the end of a game, set or the match. If any of these criteria are met, then we perform further feature and record updates to reflect this.

This gives us a very simplistic tennis simulation model. We can use this to run 100,000 simulations of a match, record the result of each simulation and we then have a list of results that we can use to calculate the probability of various outcomes.

Match Correct Scores
Set Correct Scores by simulation

In the match sets outcome array, each row corresponds to a single simulation and the first value represents the sets won by the first player and the second value represents the sets won by the second player. In each simulation, we also generate a list of scores for each set. Here, each row represents a set in the match

Let us assume this represents the Wimbledon final between Carlos Alcaraz and Novak Djokovic. So, we can see that in the first simulation, Novak Djokovic won by 3 sets to 1, losing the first set 6–7, before winning the remaining three sets 6–3, 7–6, 6–3. In the second simulation, Novak Djokovic also won the match, but by a 3–0 scoreline with 7–6, 6–4, 6–4 scoreline. Once we have all of these results stored, you can clearly see how we build up markets from these simulation records. The match winner market is calculated by counting the number of simulations in which Alcaraz won and dividing it by the total number of simulations. So, let us say that Alcaraz won the match in 40,000 of the simulations and Djokovic won the match in 60,000 simulations, we can say that the probabilities of winning the match are 40% for Alcaraz and 60% for Djokovic. Similarly, we can do the same for any markets that come off of the simulation records that we have (e.g. match correct score, total games, first set winner, etc.).

Different Formats

Now that we have our basic simulation, we can start to expand the logic to make it more adaptable. Tennis has multiple different formats, ranging from the longest format of best-of-5 with a final set super tiebreak that is currently used at the four Grand Slam events, to the shortest formats of one single super tiebreak that is used in the occasional exhibition Tie Break Tens events.

Given the way we originally built our simulation, we have a straightforward way to add in this logic. We already have functions for set_over and match_over where we update the state, so we just have to add our expanded logic into these. In order to keep this tidy, at the start of the simulation, we create an object that will hold the relevant information about our desired match format.

When we have this object, we can use it in our simulation functions to determine whether the set or the match are over, based on the values specific to the desired format. Once we adapt the previous code to pass in our new match_format object, we can use these functions.

The match_over function simply checks whether the winner of the set has now reached the number of sets required to win the match. The game_over function is slightly more complex as it first checks whether we are in a tiebreak, then checks whether the required game winning criteria have been met.

Momentum

The final thing that we want to add into our initial basic simulation is the concept of momentum. In our current version, we are assuming that all points are independent of one another and that what happens on one service point has no impact on any other service points in the match.

This is an assumption often made in academic research on tennis modeling, but almost everyone that has watched the sport knows that this is not necessarily the case. Even if that is the case, from our perspective, we can view momentum as updating our initial pre-match priors on the two players’ service percentages.

Let us imagine that each tennis player in any given match has a natural serve point win percentage. This percentage will be based upon his true underlying serving ability and the true underlying returning ability of his opponent, adjusted based on factors such as surface and altitude. On any given day, they may play above or below this level based on any number of factors, such as the weather, fatigue and pure random chance.

We have models to give us our best estimate of this natural percentage that we will look at in more detail later in this series, but as the match progresses, we can use the actual outcomes of serve points to adjust this estimate to better reflect what is happening on the day. With each point that passes, we can infer additional information to update our estimate of the serve win percentage for each player.

We make a prediction by sampling from the posterior Beta distribution beta sampler, where the alpha value is given by the sum of our pseudo wins and actual wins and the beta value is given by the sum of our pseudo losses and actual losses.

Breaking this down, at the start of the match, we have our initial best estimates of the serve percentage for a player. There are no actual points won or lost at this stage, so we purely use these estimates. As the match progresses, we observe actual outcomes for the specific match, so we start to adjust the initial estimates with these outcomes. If a player is consistently winning more points on serve than we initially expected, this will gradually increase the likelihood of the player winning a point on serve. Similarly, if the player is losing more than expected, it will lower this likelihood.

We can control the weighting of the initial percentages and the actual outcomes by adjusting the n value in our function. As we increase this, it gives more weight to the initial values and it will take far more actual points to impact this, whereas if we decrease it, what we are actually observing will be given far greater value.

What Next?

This provides us with a framework to run many simulations of a tennis match, update our estimates for the key variables based on the progress of the match and create simulation records that we can use to generate probabilities for markets. This is an approach we have adopted across the majority of our sports and it is one that provides a base to layer in additional complexity at virtually any stage of the game.

In upcoming articles, we will show how this approach can be easily adapted to significantly more complex sports, such as football. We will also investigate how we can incorporate machine learning models to drive the key decision points within matches and we will show how we have developed ratings systems to drive the key inputs in the simulation engines.

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!

References

  1. https://chewthestat.com/finding-the-winning-formula/
  2. https://towardsdatascience.com/building-a-tennis-match-simulator-in-python-3add9af6bebe
  3. Newton and Keller, Probability of Winning at Tennis

--

--