tennis-prediction

Who is the ATP Superstar?

Jeremy, Pranav

Introduction

The Association of Tennis Professionals (ATP) is the main governing body for men’s professional tennis and the ATP tour is a worldwide tour with a series of tournaments consisting of the best tennis players around the globe.

In our dataset, we have 74906 matches spanning across 2000-2024.

There are 49 columns in our dataset. We used 28 of them. Many of these are already self-explanatory, but we note clarifications on the columns we used, as necessary, below:

tourney_id
- a unique identifier for each tournament, such as 2020-888. The exact formats are borrowed from several different sources, so while the first four characters are always the year, the rest of the ID doesn’t follow a predictable structure.
tourney_name
surface
tourney_level
- For men: ‘G’ = Grand Slams, ‘M’ = Masters 1000s, ‘A’ = other tour-level events, ‘C’ = Challengers, ‘S’ = Satellites/ITFs, ‘F’ = Tour finals and other season-ending events, and ‘D’ = Davis Cup
- For women, there are several additional tourney_level codes, including ‘P’ = Premier, ‘PM’ = Premier Mandatory, and ‘I’ = International. The various levels of ITFs are given by the prize money (in thousands), such as ‘15’ = ITF $15,000. Other codes, such as ‘T1’ for Tier I (and so on) are used for older WTA tournament designations. ‘D’ is used for Federation/Fed/Billie Jean King Cup, and also for Wightman Cup and Bonne Bell Cup.
- Others, eventually for both genders: ‘E’ = exhibition (events not sanctioned by the tour, though the definitions can be ambiguous), ‘J’ = juniors, and ‘T’ = team tennis, which does yet appear anywhere in the dataset but will at some point.
tourney_date
- eight digits, YYYYMMDD, usually the Monday of the tournament week.
(winner/loser)_id
- the player_id used in this repo for the winner / loser of the match
(winner/loser)_seed
- the winner / loser’s seed in the tournament
(winner/loser)_entry
- ‘WC’ = wild card, ‘Q’ = qualifier, ‘LL’ = lucky loser, ‘PR’ = protected ranking, ‘ITF’ = ITF entry, and there are a few others that are occasionally used.
(winner/loser)_name
(winner/loser)_hand
- R = right, L = left, U = unknown. For ambidextrous players, this is their serving hand.
(winner/loser)_ht
- height in centimeters, where available
(winner/lower)_ioc
- three-character country code
(winner/loser)_age
- age, in years, as of the tourney_date
best_of
- ‘3’ or ‘5’, indicating the the number of sets for this match
round
minutes
- match length, where available
(winner/loser)_rank
- winner / loser’s ATP or WTA rank, as of the tourney_date, or the most recent ranking date before the tourney_date
(winner/loser)_rank_points
- number of ranking points, where available

Some initial questions we brainstormed are:

How much does player rank predict match outcomes? Is the rank gap a significant factor?
Does the surface type significantly impact match statistics like rally length (minutes)?
How strongly do pre-match player statistics, such as rank, age, height, and seeding predict the match winner?
Is there a performance difference between left-handed and right-handed players, potentially varying by surface?

Ultimately, the question we plan to investigate further in this project is: How strongly do pre-match player statistics, such as rank, age, height, and seeding predict the match winner?

This question delves into the core predictability of tennis matches based on readily available player information before the match starts. Also, it explores fundamental factors often discussed by commentators and analysts.

Data Cleaning and Exploratory Data Analysis

Data Cleaning and Imputation

First, we need to combine our datasets. The dataset we collected from Kaggle split the matches by year, so we can simply concatenate our csv files. We will add two new columns, match_id and year to better differentiate matches.

	tourney_id	tourney_name	surface	draw_size	...	loser_rank	loser_rank_points	year	match_id
0	2000-301	Auckland	Hard	32	...	63.0	595.0	2000	0
1	2000-301	Auckland	Hard	32	...	49.0	723.0	2000	1
2	2000-301	Auckland	Hard	32	...	59.0	649.0	2000	2
...	...	...	...	...	...	...	...	...	...
74903	2024-M-DC-2024-WG2-PO-VIE-RSA-01	Davis Cup WG2 PO: VIE vs RSA	Hard	4	...	NaN	NaN	2024	74903
74904	2024-M-DC-2024-WG2-PO-VIE-RSA-01	Davis Cup WG2 PO: VIE vs RSA	Hard	4	...	416.0	109.0	2024	74904
74905	2024-M-DC-2024-WG2-PO-VIE-RSA-01	Davis Cup WG2 PO: VIE vs RSA	Hard	4	...	NaN	NaN	2024	74905

74906 rows × 51 columns

For ease of access and analysis later on, we first changed the tournament date to a YYYY-MM-DD datetime format using pandas.

Next, we consider columns with missing values; we later fill in the values using imputation stategies ensuring consistent data across the board for our model to analyze. We found the following columns that need to be cleaned (due to missing entries or inconsistency across the category):

surface, winner_seed, winner_entry, winner_ht, winner_age, loser_seed loser_entry, loser_hand, loser_ht, loser_age, minutes, w_ace', w_df, w_svpt, w_1stIn, w_1stWon, w_2ndWon, w_SvGms, w_bpSaved, w_bpFaced,l_ace, l_df, l_svpt, l_1stIn, l_1stWon, l_2ndWon, l_SvGms, l_bpSaved, l_bpFaced, winner_rank, winner_rank_points, loser_rank, loser_rank_points

The resulting table shows the distribution of missing values in the columns as follows:

surface                  53
winner_seed           43786
winner_entry          65400
winner_ht              1425
winner_age                5
loser_seed            57668
loser_entry           59484
loser_hand                4
loser_ht               2909
loser_age                 3
minutes                8174
w_ace                  6520
w_df                   6520
w_svpt                 6520
w_1stIn                6520
w_1stWon               6520
w_2ndWon               6520
w_SvGms                6520
w_bpSaved              6520
w_bpFaced              6520
l_ace                  6520
l_df                   6520
l_svpt                 6520
l_1stIn                6520
l_1stWon               6520
l_2ndWon               6520
l_SvGms                6520
l_bpSaved              6520
l_bpFaced              6520
winner_rank             573
winner_rank_points      573
loser_rank             1468
loser_rank_points      1468

Handling Categorical Features

Surface: The surface is consistant across tournaments, so first layer imputation is to full surface with tournament surface. However, for some tornaments, surface wasnt reported at all, so our second layer imputation was to set match surface the most common surface of the year.
Hand: For the players who’s dominant hand is missing, our imputation strategy was to default the dominant hand right (more common hand bias).
Seed: Missing seed for a player means they were not seeded for that tournament, so to maintain consistency with seed values (1-32), we set default seed to unseeded players as 99 (higher number = lower seeding priority).
Entry: Missing entry into the tournament for a player means they entered via main draw, so we can impute with the string “MD” which simply means main draw.

Handling Numerical Features

Age, Height: For the sake of simplicity and ease, we imputed missing player age / height with the median age / height of all players recorded.
Duration (minutes): Imputed match duration based on the median duration of matches given the round of the match.
In Match Stats: For stats that we typically see during the match, our imputation strategy used median imputation.
Rank / Rank Points: Logically, players who are consistently ranked are less likely to have missing rank data. So, we can impute unranked players with the worst rank.

Because we imputed numerous columns, displaying visualizations for all of them would detract our focus. Thus, we selected the 3 highest missing count columns (which we imputed) to visualize the distributions of the imputed columns before and after imputation.

Overall, the imputation largely preserved the original right-skewed distribution of match durations, but the significantly taller peak around 80-100 minutes in the “After” plot shows that the missing values were filled using a central tendency measure (median).

The distribution of winner aces remains highly right-skewed after imputation; the main visual impact is the increased frequency (taller bars) for lower ace counts, especially around the median number of aces (which was used to fill the missing data points).

Imputing missing heights significantly increased the data density around the typical player height range (approx. 180-195 cm), with the pronounced central peak in the “After” plot, reflecting missing values being filled by the global median height.

Univariate Analysis

This plot shows the frequency of different ranks among match winners. We expect a skew towards lower ranks because better players win more often. The log scale helps better visualize the right tail of higher-ranked winners. This insight helps us answer our question because it tells us that our model should place greater emphasis on players who are ranked higher since they have more wins than lower ranked players.

This plot shows the frequency of matches played on different surfaces in the dataset. We observe that hard courts are typically the most common, followed by Clay, Grass, and Carpet. We can use this information to get more insight to our question by seeing how the surface (and its count) matter in each tournament level (higher stakes, “winner’s court”)

Bivariate Analysis

This plot investigates if height difference correlates with winning, especially in the context of rank difference. Points in the bottom-left quadrant represent shorter players winning against higher-ranked opponents (upset). Points in the top-right represent taller players winning against lower-ranked opponents. We see that there is a large cluster near the origin with some notable outliers in terms of their height difference.

Interesting Aggregates

surface	Carpet	Clay	Grass	Hard
tourney_level
A	92.8	99.7	92.1	95.9
D	101.6	103.9	103.6	104.0
F	103.8	0.0	0.0	103.5
G	0.0	147.9	141.7	149.9
M	93.8	102.2	0.0	99.2
O	0.0	99.6	0.0	0.0

This table compares the average length of matches across different tournament levels (i.e. Grand Slams ‘G’, Masters ‘M’, etc.) and surfaces. Entries with 0 mean that there was no match that was played on that surface at that tournament level. We can observe that Grand Slam matches on Clay, Grass, and Hard surfaces are typically longer than other typical ATP matches. This is to be expected since Grand Slam matches are extremely high stakes and career-defining games.

surface	Carpet	Clay	Grass	Hard
tourney_level
A	38.1	37.8	37.9	35.6
D	30.0	24.9	32.3	26.2
F	46.7	0.0	0.0	35.3
G	0.0	28.6	30.3	27.8
M	39.2	35.1	0.0	35.9
O	0.0	25.0	0.0	0.0

This table shows the percentage of matches won by the lower-ranked player (‘upsets’) across different tournament levels and surfaces. The is_upset variable becomes True only for those rows in matches_df where the player listed as the winner actually had a worse rank (reminder: higher rank value means actual lower rank) than the player listed as the loser. Overall, this table provides insight into identifying environments where rankings might be less predictive and upsets are more common.

Framing a Prediction Problem

As hinted in the introduction, our prediction problem is to predict the outcome of a specific ATP tour tennis match between two designated players (‘player 1’ and ‘player 2’), using only information that would be known before the match commences (to make our model actually useful). This prediction problem is a Binary Classification task because the model will predict one of two possible classes: either player 1 wins, or player 1 loses (so player 2 wins).

The response variable (the variable being predicted) is outcome. This will be a binary variable where:

outcome = 1 signifies that Player 1 won the match.
outcome = 0 signifies that Player 1 lost the match.

This variable directly represents the result we set out to predict (win/loss for a designated ‘player 1’). It also simplifies the prediction task compared to predicting the winner’s name (which would be a high-cardinality multiclass problem).

We will primarily use accuracy as the main metric to evaluate our model. Recall that accuracy measures the overall proportion of matches correctly predicted, so i.e. both player 1 wins and player 2 wins. It provides a clear baseline measure of model performance. Since our data preparation method (detailed in the next section) involves duplicating matches (once with winner as player 1, once with loser as player 1), the overall target variable distribution in the training set is perfectly balanced (50% wins, 50% losses). There won’t be skewedness by class imbalance in the overall dataset.

Information that we would know at “time of prediction” include:

Player attributes: id, age, hand, height, nationality (IOC).
Player status: rank, rank points, tournament seed, tournament entry.
Match context: tournament name, tournament level, surface, round within the tournament, format (‘best_of’ 3 or 5 sets).
Historical performance (calculated before the current match. can be found using feature engineering):
- head-to-head (H2H) win/loss record between the two specific players prior to this match.

Data Preprocessing for Matchup Data

This document explains the steps performed by the create_matchup_df function, which transforms raw match records into a clean, structured DataFrame suitable for modeling head-to-head player matchups.

Goal: Build a DataFrame containing all matches between two specified players, labeling each row with which player won.

Input Arguments:
- player1_name (str): Name of the first player.
- player2_name (str): Name of the second player.
Output: A pandas DataFrame (matchup_df) sorted by match date, with standardized column names and correct data types.

Selecting Relevant Columns

Define Expected Columns: A list of column names (extracted_cols) that includes match metadata, winner attributes (age, rank, seed, etc.), and loser attributes.
Filter by Availability: Create available_cols by intersecting extracted_cols with the actual columns in the source DataFrame matches_df. This ensures compatibility even if some columns are missing.

Filtering Matches Between the Two Players

Player1 Wins: Rows where winner_name == player1_name and loser_name == player2_name.
Player2 Wins: Rows where winner_name == player2_name and loser_name == player1_name.
Extract these subsets into p1_wins_df and p2_wins_df, respectively, using only available_cols.

Renaming Columns for Consistency

Objective: Standardize column names so that columns always refer to “player1” or “player2”, regardless of who actually won in the raw data.

Player1-Win Subset (p1_wins_df)
Map raw columns:
- winner_name → p1_name
- loser_name → p2_name
- winner_age, winner_rank, … → p1_age, p1_rank, …
- loser_age, loser_rank, … → p2_age, p2_rank, …
Add binary label: did_player1_win = 1
Player2-Win Subset (p2_wins_df)
Swap roles in the rename map:
- loser_name → p1_name (because player1 lost)
- winner_name → p2_name
- Rename other attributes accordingly.
Add binary label: did_player1_win = 0

Combining and Cleaning Data

Concatenate Subsets: Merge p1_wins_df and p2_wins_df into a single DataFrame matchup_df.
Type Casting:
- Convert age, rank, and IDs to integer types.
- Convert rank points, heights, and seeds to numeric (float) types, allowing for missing values.
- Ensure categorical fields (e.g., p1_hand, p2_entry, p1_ioc) are strings.
- Enforce the victory indicator did_player1_win as integer.

Final Touches

Sorting: Sort the DataFrame by match_date in ascending order.
Index Reset: Reset the DataFrame index to ensure a clean, continuous index after sorting.

Result: The returned DataFrame has one row per match, with columns:

match_date: Date of the match (datetime)
p1_name, p1_age, p1_rank, etc.: Attributes for player1
p2_name, p2_age, p2_rank, etc.: Attributes for player2
did_player1_win: Binary target variable (1 if player1 won, 0 otherwise)

This structured DataFrame can now be used as input for modeling tasks (e.g., predicting head-to-head outcomes) or for exploratory analysis of player performance over time.

For example, the following is between Roger Federer and Fernando Gonzalez.

	match_date	p1_name	p1_age	p1_rank	...	p2_ioc	p2_rank_points	p2_id	did_player1_win
0	2004-03-08	Roger Federer	22	1	...	nan	1120.0	103602	1
1	2004-05-10	Roger Federer	22	1	...	nan	1430.0	103602	1
2	2005-04-11	Roger Federer	23	1	...	nan	1200.0	103602	1
...	...	...	...	...	...	...	...	...	...
10	2007-11-12	Roger Federer	26	1	...	nan	1905.0	103602	0
11	2008-05-25	Roger Federer	26	1	...	nan	1160.0	103602	1
12	2009-03-12	Roger Federer	27	2	...	nan	2650.0	103602	1

13 rows × 22 columns

This matchup_df acts as an intermediate step for our final DataFrame for prediction. It currently has the following issues that we need to solve:

Scarcity in the data: most players have only met each other a handful of times from 2000-2024. training a model on 10-40 data points will likely lead to poor performance and overfitting, even for predicting future matches between those same two players.
Lack of generalizability: a model trained only on (for instance) Nadal vs. Federer data will learn patterns specific to their interactions (e.g., the effect of surface, maybe specific psychological edges). it will lack predictive power for a match between any other two players because it hasn’t seen data representing their skills, ranks, or interactions.

Solution: instead of subsetting to one specific pair (of players), use the entire matches_df. we need to restructure it so that each row represents a match with p1/p2 features, alongside the outcome relative to p1.

Train-Validation-Test Split

We filter the training_df into train_df, val_df, and test_df based on the specified year ranges using the tourney_date. this creates a natural “future” prediction problem where we used historical data to predict matches that occur in the future. we then separate the features and target for each set.

Baseline Model

We begin our prediction model using simple Logistic Regression, passing in player ranks (p1_rank, p2_rank) and the match surface. Within our model pipeline, p1_rank and p2_rankare treated as quantitative features, while the surface feature is treated as a nominal feature. We applied a preprocessing step is applied using ColumnTransformer: the two quantitative rank features are scaled using StandardScaler while the single nominal surface feature is encoded using OneHotEncoder (with drop='first' to avoid multicollinearity). This baseline model achieved an overall accuracy of 64% on the testing set. Our model performs better than picking a winner by random chance, but 64% is a low accuracy to do anything actually meaningful.

The accompanying classification report shows identical precision, recall, and f1-scores of 0.64 for predicting both class 0 (player 1 lost) and class 1 (player 1 won). This uniformity is expected, because given the perfectly balanced nature of the test set (in the preprocessing step, we mirrored p1 winning and losing, creating symmetrical entries for each match outcome). Hence, we place high important on accuracy being the main evaluation criteria for our model.

As such, the confusion matrix visually confirms this balanced performance: the number of correctly predicted wins (true positives for class 1, 3871) is very close to the number of correctly predicted losses (true positives for class 0, 3853). similarly, the number of false positives, 2209 is nearly identical to the number of false negatives, 2191.

    baseline model classification report (on the testing set):
                   precision    recall  f1-score   support
    
               0       0.64      0.64      0.64      6062
               1       0.64      0.64      0.64      6062
    
        accuracy                           0.64     12124
       macro avg       0.64      0.64      0.64     12124
    weighted avg       0.64      0.64      0.64     12124

Final Model

In this final step, a more robust predictive model is developed to forecast the outcome of tennis matches. The procedure involves several key stages, detailed clearly below:

Feature Engineering

Before training the model, the data undergoes significant feature engineering, which did end up enhancing predictive accuracy:

Differences in Player Statistics:
- Three critical new features are created:
  - age_diff: The difference in age between the two players. Instead of using the individual age categories, this difference can indicate disparities in on-court experience versus physical prime or stamina—factors that particularly influence performance.
  - rank_diff: The difference in player rankings, indicating relative skill level. This feature directly quantifies the perceived skill gap based on rank. It helps the model differentiate expected close contests from likely mismatches more effectively than using individual ranks alone.
  - seed_diff: The difference in tournament seedings, reflecting tournament expectations. Seeding reflects the tournament organizers’ expectations within the specific event context (considering factors beyond just rank). In other words, this difference provides additional context on expected dominance or competitiveness.
- Tournament Round Encoding:
  - The round of the tournament (R128, R64, …, SF, F) is transformed into a numeric value (round_num), with higher rounds assigned greater values, capturing the importance of match context. Later rounds inherently involve higher stakes, potentially accumulated fatigue, and usually stronger remaining opponents. Thus, encoding this feature captures the changing context and pressure as a tournament progresses.
- Head-to-Head (H2H) Performance:
  - For each player pair, a historical winning ratio (h2h) is computed based on their past encounters. A strong H2H record can often explain deviations from rank-based expectations and is a powerful predictor in established rivalries.
  - If there are no previous matches, a neutral value (0.5) is assigned, reflecting no bias toward either player.

Preprocessing Pipeline

After engineering these features, the data is processed for model training:

Numerical Features (age_diff, rank_diff, seed_diff, round_num, best_of, minutes, h2h) are standardized using StandardScaler. Again, this ensures each feature contributes “equally”, preventing features with larger numeric scales from disproportionately influencing the model.
Categorical Features (surface, tourney_level) undergo One-Hot Encoding (OneHotEncoder) to convert them into binary features, allowing the model to interpret categorical distinctions effectively.
Finally, these preprocessing steps are combined into a pipeline.

Model Selection and Training

The predictive model selected is a HistGradientBoostingClassifier, a robust, gradient-boosting algorithm ideal for structured tabular data. HistGradientBoostingClassifier excels at capturing complex, non-linear relationships and interactions between features. It builds decision trees sequentially—with each new tree attempting to correct the errors made by the previous ones—leading to potentially higher accuracy especially when it is provided with engineered features.

Hyperparameter Optimization: A randomized hyperparameter search (RandomizedSearchCV) explores multiple parameter combinations to identify optimal model settings. This method efficiently explores a defined distribution of possible hyperparameter values by sampling a fixed number of combinations (n_iter=30), making it faster than an exhaustive grid search:
- learning_rate: Controls how quickly the model learns (best: 0.1).
- max_iter: Determines the maximum number of iterations for training (best: 300).
- max_depth: Limits the complexity of decision trees to prevent overfitting (best: 3).
- l2_regularization: Reduces model complexity and overfitting by penalizing complex solutions (best: 0.5).
Cross-Validation: The search employs 5-fold cross-validation to objectively evaluate and select the best model configuration based on accuracy.

Model Evaluation

Validation Set Performance (2020-2022): The optimized model achieves an accuracy of 91.38%, demonstrating strong predictive capability on recent matches not seen during training.
Test Set Performance (2023 and onward): Evaluated on entirely unseen future data, the model achieves an impressive 92.01% accuracy, confirming its robustness and generalization. This marked improvement of ~28% underscores the combined value of good feature engineering. We were able to give the model with richer, more comparative information about the matchup. Furthermore, the selection of a more powerful algorithm capable of effectively utilizing those features to capture the complex dynamics inherent in predicting tennis match outcomes was better than a simple Logistic Regression model.

    final model classification report (on the testing set):
                   precision    recall  f1-score   support
    
               0       0.92      0.92      0.92      6062
               1       0.92      0.92      0.92      6062
    
        accuracy                           0.92     12124
       macro avg       0.92      0.92      0.92     12124
    weighted avg       0.92      0.92      0.92     12124

Conclusion and Practical Implications: This carefully crafted predictive modeling approach provides a powerful tool for accurately forecasting tennis match outcomes. Such a model is valuable for sports analytics, betting strategies, player assessments, and understanding factors driving match results.