tennis-prediction

Who is the ATP Superstar?

Jeremy, Pranav

Introduction

The Association of Tennis Professionals (ATP) is the main governing body for men’s professional tennis and the ATP tour is a worldwide tour with a series of tournaments consisting of the best tennis players around the globe.

In our dataset, we have 74906 matches spanning across 2000-2024.

There are 49 columns in our dataset. We used 28 of them. Many of these are already self-explanatory, but we note clarifications on the columns we used, as necessary, below:

Some initial questions we brainstormed are:

Ultimately, the question we plan to investigate further in this project is: How strongly do pre-match player statistics, such as rank, age, height, and seeding predict the match winner?

This question delves into the core predictability of tennis matches based on readily available player information before the match starts. Also, it explores fundamental factors often discussed by commentators and analysts.

Data Cleaning and Exploratory Data Analysis

Data Cleaning and Imputation

First, we need to combine our datasets. The dataset we collected from Kaggle split the matches by year, so we can simply concatenate our csv files. We will add two new columns, match_id and year to better differentiate matches.

tourney_id tourney_name surface draw_size ... loser_rank loser_rank_points year match_id
0 2000-301 Auckland Hard 32 ... 63.0 595.0 2000 0
1 2000-301 Auckland Hard 32 ... 49.0 723.0 2000 1
2 2000-301 Auckland Hard 32 ... 59.0 649.0 2000 2
... ... ... ... ... ... ... ... ... ...
74903 2024-M-DC-2024-WG2-PO-VIE-RSA-01 Davis Cup WG2 PO: VIE vs RSA Hard 4 ... NaN NaN 2024 74903
74904 2024-M-DC-2024-WG2-PO-VIE-RSA-01 Davis Cup WG2 PO: VIE vs RSA Hard 4 ... 416.0 109.0 2024 74904
74905 2024-M-DC-2024-WG2-PO-VIE-RSA-01 Davis Cup WG2 PO: VIE vs RSA Hard 4 ... NaN NaN 2024 74905

74906 rows × 51 columns

For ease of access and analysis later on, we first changed the tournament date to a YYYY-MM-DD datetime format using pandas.

Next, we consider columns with missing values; we later fill in the values using imputation stategies ensuring consistent data across the board for our model to analyze. We found the following columns that need to be cleaned (due to missing entries or inconsistency across the category):

surface, winner_seed, winner_entry, winner_ht, winner_age, loser_seed loser_entry, loser_hand, loser_ht, loser_age, minutes, w_ace', w_df, w_svpt, w_1stIn, w_1stWon, w_2ndWon, w_SvGms, w_bpSaved, w_bpFaced,l_ace, l_df, l_svpt, l_1stIn, l_1stWon, l_2ndWon, l_SvGms, l_bpSaved, l_bpFaced, winner_rank, winner_rank_points, loser_rank, loser_rank_points

The resulting table shows the distribution of missing values in the columns as follows:

surface                  53
winner_seed           43786
winner_entry          65400
winner_ht              1425
winner_age                5
loser_seed            57668
loser_entry           59484
loser_hand                4
loser_ht               2909
loser_age                 3
minutes                8174
w_ace                  6520
w_df                   6520
w_svpt                 6520
w_1stIn                6520
w_1stWon               6520
w_2ndWon               6520
w_SvGms                6520
w_bpSaved              6520
w_bpFaced              6520
l_ace                  6520
l_df                   6520
l_svpt                 6520
l_1stIn                6520
l_1stWon               6520
l_2ndWon               6520
l_SvGms                6520
l_bpSaved              6520
l_bpFaced              6520
winner_rank             573
winner_rank_points      573
loser_rank             1468
loser_rank_points      1468

Handling Categorical Features

Handling Numerical Features

Because we imputed numerous columns, displaying visualizations for all of them would detract our focus. Thus, we selected the 3 highest missing count columns (which we imputed) to visualize the distributions of the imputed columns before and after imputation.

Overall, the imputation largely preserved the original right-skewed distribution of match durations, but the significantly taller peak around 80-100 minutes in the “After” plot shows that the missing values were filled using a central tendency measure (median).

The distribution of winner aces remains highly right-skewed after imputation; the main visual impact is the increased frequency (taller bars) for lower ace counts, especially around the median number of aces (which was used to fill the missing data points).

Imputing missing heights significantly increased the data density around the typical player height range (approx. 180-195 cm), with the pronounced central peak in the “After” plot, reflecting missing values being filled by the global median height.

Univariate Analysis

This plot shows the frequency of different ranks among match winners. We expect a skew towards lower ranks because better players win more often. The log scale helps better visualize the right tail of higher-ranked winners. This insight helps us answer our question because it tells us that our model should place greater emphasis on players who are ranked higher since they have more wins than lower ranked players.

This plot shows the frequency of matches played on different surfaces in the dataset. We observe that hard courts are typically the most common, followed by Clay, Grass, and Carpet. We can use this information to get more insight to our question by seeing how the surface (and its count) matter in each tournament level (higher stakes, “winner’s court”)

Bivariate Analysis

This plot investigates if height difference correlates with winning, especially in the context of rank difference. Points in the bottom-left quadrant represent shorter players winning against higher-ranked opponents (upset). Points in the top-right represent taller players winning against lower-ranked opponents. We see that there is a large cluster near the origin with some notable outliers in terms of their height difference.

Interesting Aggregates

surface Carpet Clay Grass Hard
tourney_level
A 92.8 99.7 92.1 95.9
D 101.6 103.9 103.6 104.0
F 103.8 0.0 0.0 103.5
G 0.0 147.9 141.7 149.9
M 93.8 102.2 0.0 99.2
O 0.0 99.6 0.0 0.0

This table compares the average length of matches across different tournament levels (i.e. Grand Slams ‘G’, Masters ‘M’, etc.) and surfaces. Entries with 0 mean that there was no match that was played on that surface at that tournament level. We can observe that Grand Slam matches on Clay, Grass, and Hard surfaces are typically longer than other typical ATP matches. This is to be expected since Grand Slam matches are extremely high stakes and career-defining games.

surface Carpet Clay Grass Hard
tourney_level
A 38.1 37.8 37.9 35.6
D 30.0 24.9 32.3 26.2
F 46.7 0.0 0.0 35.3
G 0.0 28.6 30.3 27.8
M 39.2 35.1 0.0 35.9
O 0.0 25.0 0.0 0.0

This table shows the percentage of matches won by the lower-ranked player (‘upsets’) across different tournament levels and surfaces. The is_upset variable becomes True only for those rows in matches_df where the player listed as the winner actually had a worse rank (reminder: higher rank value means actual lower rank) than the player listed as the loser. Overall, this table provides insight into identifying environments where rankings might be less predictive and upsets are more common.

Framing a Prediction Problem

As hinted in the introduction, our prediction problem is to predict the outcome of a specific ATP tour tennis match between two designated players (‘player 1’ and ‘player 2’), using only information that would be known before the match commences (to make our model actually useful). This prediction problem is a Binary Classification task because the model will predict one of two possible classes: either player 1 wins, or player 1 loses (so player 2 wins).

The response variable (the variable being predicted) is outcome. This will be a binary variable where:

This variable directly represents the result we set out to predict (win/loss for a designated ‘player 1’). It also simplifies the prediction task compared to predicting the winner’s name (which would be a high-cardinality multiclass problem).

We will primarily use accuracy as the main metric to evaluate our model. Recall that accuracy measures the overall proportion of matches correctly predicted, so i.e. both player 1 wins and player 2 wins. It provides a clear baseline measure of model performance. Since our data preparation method (detailed in the next section) involves duplicating matches (once with winner as player 1, once with loser as player 1), the overall target variable distribution in the training set is perfectly balanced (50% wins, 50% losses). There won’t be skewedness by class imbalance in the overall dataset.

Information that we would know at “time of prediction” include:

Data Preprocessing for Matchup Data

This document explains the steps performed by the create_matchup_df function, which transforms raw match records into a clean, structured DataFrame suitable for modeling head-to-head player matchups.

Goal: Build a DataFrame containing all matches between two specified players, labeling each row with which player won.

Selecting Relevant Columns

Filtering Matches Between the Two Players

Renaming Columns for Consistency

Objective: Standardize column names so that columns always refer to “player1” or “player2”, regardless of who actually won in the raw data.

Combining and Cleaning Data

Final Touches

Result: The returned DataFrame has one row per match, with columns:

This structured DataFrame can now be used as input for modeling tasks (e.g., predicting head-to-head outcomes) or for exploratory analysis of player performance over time.

For example, the following is between Roger Federer and Fernando Gonzalez.

match_date p1_name p1_age p1_rank ... p2_ioc p2_rank_points p2_id did_player1_win
0 2004-03-08 Roger Federer 22 1 ... nan 1120.0 103602 1
1 2004-05-10 Roger Federer 22 1 ... nan 1430.0 103602 1
2 2005-04-11 Roger Federer 23 1 ... nan 1200.0 103602 1
... ... ... ... ... ... ... ... ... ...
10 2007-11-12 Roger Federer 26 1 ... nan 1905.0 103602 0
11 2008-05-25 Roger Federer 26 1 ... nan 1160.0 103602 1
12 2009-03-12 Roger Federer 27 2 ... nan 2650.0 103602 1

13 rows × 22 columns

This matchup_df acts as an intermediate step for our final DataFrame for prediction. It currently has the following issues that we need to solve:

  1. Scarcity in the data: most players have only met each other a handful of times from 2000-2024. training a model on 10-40 data points will likely lead to poor performance and overfitting, even for predicting future matches between those same two players.
  2. Lack of generalizability: a model trained only on (for instance) Nadal vs. Federer data will learn patterns specific to their interactions (e.g., the effect of surface, maybe specific psychological edges). it will lack predictive power for a match between any other two players because it hasn’t seen data representing their skills, ranks, or interactions.

Solution: instead of subsetting to one specific pair (of players), use the entire matches_df. we need to restructure it so that each row represents a match with p1/p2 features, alongside the outcome relative to p1.

Train-Validation-Test Split

We filter the training_df into train_df, val_df, and test_df based on the specified year ranges using the tourney_date. this creates a natural “future” prediction problem where we used historical data to predict matches that occur in the future. we then separate the features and target for each set.

Baseline Model

We begin our prediction model using simple Logistic Regression, passing in player ranks (p1_rank, p2_rank) and the match surface. Within our model pipeline, p1_rank and p2_rankare treated as quantitative features, while the surface feature is treated as a nominal feature. We applied a preprocessing step is applied using ColumnTransformer: the two quantitative rank features are scaled using StandardScaler while the single nominal surface feature is encoded using OneHotEncoder (with drop='first' to avoid multicollinearity). This baseline model achieved an overall accuracy of 64% on the testing set. Our model performs better than picking a winner by random chance, but 64% is a low accuracy to do anything actually meaningful.

The accompanying classification report shows identical precision, recall, and f1-scores of 0.64 for predicting both class 0 (player 1 lost) and class 1 (player 1 won). This uniformity is expected, because given the perfectly balanced nature of the test set (in the preprocessing step, we mirrored p1 winning and losing, creating symmetrical entries for each match outcome). Hence, we place high important on accuracy being the main evaluation criteria for our model.

As such, the confusion matrix visually confirms this balanced performance: the number of correctly predicted wins (true positives for class 1, 3871) is very close to the number of correctly predicted losses (true positives for class 0, 3853). similarly, the number of false positives, 2209 is nearly identical to the number of false negatives, 2191.

    baseline model classification report (on the testing set):
                   precision    recall  f1-score   support
    
               0       0.64      0.64      0.64      6062
               1       0.64      0.64      0.64      6062
    
        accuracy                           0.64     12124
       macro avg       0.64      0.64      0.64     12124
    weighted avg       0.64      0.64      0.64     12124

Final Model

In this final step, a more robust predictive model is developed to forecast the outcome of tennis matches. The procedure involves several key stages, detailed clearly below:

Feature Engineering

Before training the model, the data undergoes significant feature engineering, which did end up enhancing predictive accuracy:

Preprocessing Pipeline

After engineering these features, the data is processed for model training:

Model Selection and Training

The predictive model selected is a HistGradientBoostingClassifier, a robust, gradient-boosting algorithm ideal for structured tabular data. HistGradientBoostingClassifier excels at capturing complex, non-linear relationships and interactions between features. It builds decision trees sequentially—with each new tree attempting to correct the errors made by the previous ones—leading to potentially higher accuracy especially when it is provided with engineered features.

Model Evaluation

    final model classification report (on the testing set):
                   precision    recall  f1-score   support
    
               0       0.92      0.92      0.92      6062
               1       0.92      0.92      0.92      6062
    
        accuracy                           0.92     12124
       macro avg       0.92      0.92      0.92     12124
    weighted avg       0.92      0.92      0.92     12124

Conclusion and Practical Implications: This carefully crafted predictive modeling approach provides a powerful tool for accurately forecasting tennis match outcomes. Such a model is valuable for sports analytics, betting strategies, player assessments, and understanding factors driving match results.