File size: 4,237 Bytes
9e5b714
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: mit
language:
- en
metrics:
- accuracy
base_model:
- gpt-omni/mini-omni
tags:
- code
---
# ATP Tennis Match Analysis and Anomaly Detection

This project focuses on analyzing ATP tennis match data using a deep learning model with joint embedding techniques. The objective is to detect anomalies in professional men's tennis tournament draws using advanced statistical and machine learning methods. The project employs PyTorch for building and training the neural network, Optuna for hyperparameter optimization, and DBSCAN for anomaly detection.

## Table of Contents

- [Overview](#overview)
- [Features](#features)
- [Setup](#setup)
- [Usage](#usage)
- [Model Architecture](#model-architecture)
- [Hyperparameter Optimization](#hyperparameter-optimization)
- [Anomaly Detection](#anomaly-detection)
- [Results](#results)
- [Contributing](#contributing)
- [License](#license)

## Overview

The project aims to identify irregularities in tennis matches by examining patterns and discrepancies in player rankings, ages, and other match-related features. This analysis can help detect potential biases or unusual outcomes in tournament draws.

## Features

- **Data Loading and Preprocessing:** Handles ATP match data from multiple years, with preprocessing steps including encoding categorical features and handling missing values.
- **Feature Engineering:** Creates new features such as age difference and rank difference between players.
- **Joint Embedding Neural Network:** A PyTorch-based model that combines categorical and numerical features for robust prediction of match outcomes.
- **Hyperparameter Tuning:** Uses Optuna for efficient optimization of model hyperparameters.
- **Anomaly Detection:** Applies DBSCAN clustering to the embeddings generated by the model to identify anomalies in player performance.

## Setup

### Prerequisites

- Python 3.8 or later
- PyTorch
- Optuna
- Scikit-learn
- Matplotlib
- Pandas
- NumPy

### Installation

1. Clone the repository:

   ```bash
   git clone https://github.com/yourusername/atp-tennis-analysis.git
   cd atp-tennis-analysis


pip install -r requirements.txt

Download the ATP match data files and place them in the project directory. 
Ensure the files are named in the format atp_matches_<year>.csv (e.g., atp_matches_2000.csv).

Run the main script to load data, preprocess it, and train the model:
python main.py

## Model Training

The script trains the model using the preprocessed data, optimizing hyperparameters with Optuna, and saves the best-performing model.

## Anomaly Detection

The model’s predictions are used to perform anomaly detection, identifying unusual matches or player performances.

## View Results

Results, including anomaly plots and metrics, will be saved in the output directory. CSV files summarizing the anomalies per player, year, and tournament will also be generated.

## Model Architecture
The JointEmbeddedModel consists of:

Embeddings for Categorical Features: 
Each categorical variable (e.g., player IDs, tournament IDs) is embedded into a dense vector.

Fully Connected Layers: 
These layers combine embeddings and numerical features to predict match outcomes.

Dropout Layers: 
Used to prevent overfitting and improve model generalization.

## Hyperparameter Optimization
The project uses Optuna to automatically search for the best combination of model parameters, including:

Embedding dimension

Hidden layer size

Learning rate

Batch size

Dropout rate

## Anomaly Detection
Anomalies are detected by comparing expected and actual rank differences in matches using DBSCAN clustering. Anomalies can indicate unexpected match outcomes, potential biases, or errors in player rankings.

## Results
Positive Anomalies: Matches where the predicted rank difference was significantly lower than expected.
Negative Anomalies: Matches where the predicted rank difference was significantly higher than expected.
The results are visualized using TSNE plots and saved as images and CSV files.

## Contributions are welcome! Please feel free to submit a Pull Request or open an Issue for any improvements or bugs you encounter.

## License
This project is licensed under the MIT License. See the LICENSE file for more details.