In this project, I will use the Kaggle "European Soccer Database" which can be found at https://www.kaggle.com/hugomathien/soccer. The "European Soccer Database" consists on 7 tables: Country, League, Match, Player, Player_Attributes, Team and Team_Attributes.
The Match table contains data over 8 soccer seasons and stores: the date a match was played, the teams that played the match, the number of goals each team made, the players in the match and other match statistics.
The Player table contains players name and vital statistics such as weight, height and birthdate.
The Player_Attributes table contains the players statistics such as their overall rating, potential, prefered foot etc.
The Team_Attributes table records the team's attributes such as build up play, chance creation and defence scores at a given time.
The Team table stores team id and team names that are used for referencing.
The Country table stores the country id and country name used for referencing.
The League tables stores league id, country id and league name used for referencing.
I will be analyzing a dataset derived from the "European Soccer Database". The dataset ecompasses data from the Match, Team, Team_Attributes, Country and League tables. The Dataset was derived usng the SQL query below:
CREATE TABLE team_match_data AS
WITH team_data AS (
SELECT *
FROM Team_Attributes ta
JOIN Team t
ON ta.team_api_id = t.team_api_id),
teamdate_data AS (
SELECT *,
lead(team_api_id) OVER (PARTITION BY team_api_id ORDER BY "date") AS lead_team_api_id,
lead("date",1,"2016-05-25 00:00:00") OVER (PARTITION BY team_api_id ORDER BY "date") AS end_date
FROM team_data)
SELECT m.match_api_id,
m."date" match_dt,
m.country_id,
c.name country_name,
m.league_id,
l.name league_name,
m.season,
m.home_team_api_id,
m.away_team_api_id,
td.team_api_id,
td.team_long_name,
td.team_short_name,
m.home_team_goal,
m.away_team_goal,
td."date" t_attr_start_dt,
td.end_date t_attr_end_dt,
td.buildUpPlaySpeed bu_play_speed,
td.buildUpPlaySpeedClass bu_play_speed_class,
td.buildUpPlayDribbling bu_play_dribbling,
td.buildUpPlayDribblingClass bu_play_dribbling_class,
td.buildUpPlayPassing bu_play_passing,
td.buildUpPlayPassingClass bu_play_passing_class,
td.buildUpPlayPositioningClass bu_play_positioning_class,
td.chanceCreationPassing cc_passing,
td.chanceCreationPassingClass cc_passing_class,
td.chanceCreationCrossing cc_crossing,
td.chanceCreationCrossingClass cc_crossing_class,
td.chanceCreationShooting cc_shooting,
td.chanceCreationShootingClass cc_shooting_class,
td.chanceCreationPositioningClass cc_positioning_class,
td.defencePressure defence_pressure,
td.defencePressureClass defence_pressure_class,
td.defenceAggression defence_aggression,
td.defenceAggressionClass defence_agression_class,
td.defenceTeamWidth defence_team_width,
td.defenceTeamWidthClass defence_team_width_class,
td.defenceDefenderLineClass defence_line_class
FROM Match m
JOIN teamdate_data td
ON td.team_api_id = m.home_team_api_id
OR td.team_api_id = m.away_team_api_id
JOIN League l
ON l.id = m.league_id
JOIN Country c
ON c.id = m.country_id
WHERE m."date" BETWEEN td."date" AND td.end_date
ORDER BY td.team_api_id, td.end_date DESC;
I will be focusing on team goals and how their attributes contributed to these goals.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('team_match_data.csv')
df.head()
Imported the team_match_data csv file into a data frame and viewed the first 5 rows
df.shape
There are 39,882 rows and 37 columns in the data frame
df.describe()
df.info()
The bu_play_dribbling column has null data. There are 27,133 records with null data
The data contains 39,882 rows and 37 columns.
The data consists of match data, team data, and team attribute data at the time the match took place. For each team attribute respresented with a numerical value, there is a corresponding classifaction.
The match_api_id, country_id, country_name, league_id, league_name and season columns can be dropped since they are not relevant to determine how a team's attributes contribute to their wins.
The data will need to be cleaned, to fill the null values for the bu_play_dribbling column. This will be done by determing with class the null value corresponds to and filled with the mean of that class.
# Drop columns that are not needed
df.drop(['match_api_id','country_id','league_id','country_name','league_name', 'season'], axis=1, inplace = True)
df.head()
df.shape
After dropping the columns, there are now 39,882 rows and 31 columns in the data frame
df.info()
The bu_play_dribbling column has null data. There are 27,133 records with null data
# Find the bu_play_dribbling clases
df.bu_play_dribbling_class.unique()
There are 3 classes for bu_play_dribbling: Normal, Little and Lots
df[df.bu_play_dribbling.isnull()]['bu_play_dribbling_class'].unique()
All of the null values from the bu_play_dribbling column belong to the bu_play_dribbling_class Little.
We need to fill those null values with the mean of the Little bu_play_dribbling columns
Function returns a data frame for a specified Build Up Play Dribbling Class
def filter_dribblng_class(df, dribbling_class):
return df.query('bu_play_dribbling_class == "%s"' % dribbling_class)
Function prints the minimum and maximum of a specified column
def col_minmax_range(col):
print("({},{})".format(col.min(), col.max()))
df_normal_dribbling_class = filter_dribblng_class(df, "Normal")
Created a data frame for the Normal bu_play_dribbling_class
col_minmax_range(df_normal_dribbling_class.bu_play_dribbling)
The range of values for the Normal bu_play_dribbling_class is (34.0, 66.0)
df_lots_dribbling_class = filter_dribblng_class(df, "Lots")
Created a data frame for the Lots bu_play_dribbling_class
col_minmax_range(df_lots_dribbling_class.bu_play_dribbling)
The range of values for the Lots bu_play_dribbling_class is (67.0, 77.0)
df_little_dribbling_class = filter_dribblng_class(df, "Little")
Created a data frame for the Little bu_play_dribbling_class
col_minmax_range(df_little_dribbling_class.bu_play_dribbling)
The range of values for the Little bu_play_dribbling_class is (24.0, 33.0)
# Find the mean of the Little Build Up Dribbling class df
df_little_dribbling_class['bu_play_dribbling'].mean()
The mean of the bu_play_dribbling_class for the Little class is 30.44859813084112
# Fill nan values with the mean of the litle dribbling class since they were all inthe little dribblng class
df['bu_play_dribbling'].fillna(df_little_dribbling_class['bu_play_dribbling'].mean(), inplace=True)
The null values for bu_play_dribbling were filled with the mean of the Little class
df.info()
The bu_play_dribbling column has all null values filled
df.duplicated().sum()
There are no duplicates in the data frame
df.shape
There are still 39,882 rows and 31 columns in the data frame
df.describe()
# Use this, and more code cells, to explore your data. Don't forget to add
# Markdown cells to document your observations and findings.
# Get matches that did not result in a Tied Match
df_match_no_draw = df.query('home_team_goal != away_team_goal')
df_match_no_draw.shape
There are 29,758 records that consist of matches that resulted in no draw
df_match_no_draw.describe()
#Create a column to identify the winning team
df_match_no_draw['win_team_id'] = np.where(df_match_no_draw['home_team_goal'] > df_match_no_draw['away_team_goal'], df_match_no_draw['home_team_api_id'], df_match_no_draw['away_team_api_id'])
df_match_no_draw.head()
# Filter the teams attributes for the team that won
df_win_team = df_match_no_draw.query('team_api_id == win_team_id')
df_win_team.shape
There are 15,010 matches that resulted in a win
df_win_team.head()
# Count the number of wins for each team and sort in decending order
df_num_wins=df_win_team.groupby(['team_long_name'])['win_team_id'].count().sort_values(ascending=False)
# Top Ten Soccer Teams
top_10 = df_num_wins.head(10)
top_10
The Top 10 European Soccer teams with the most wins are:
FC Barcelona, Real Madrid CF, Celtic, FC Bayern Munich, Juventus, SL Benfica, FC Porto, Manchester City, Paris, Saint-Germain, Manchester United
df_num_wins.idxmax()
# Plot a bar chart showing the Top 10 Soccer Teams
top_10.plot(kind='bar');
plt.title('Top 10 Winning Soccer Teams')
plt.xlabel('Team Name')
plt.ylabel('Total Wins');
# Count the number of wins for an attribute period
df_win_team.groupby(['win_team_id','team_long_name','t_attr_start_dt','t_attr_end_dt'])['win_team_id'].count()
This shows the number of wins a team had with during a given team attribute start and end time
df_win_team['wins_by_attribute_period'] = df_win_team.groupby(['win_team_id','t_attr_start_dt','t_attr_end_dt'])['win_team_id'].transform("size")
df_win_team.head()
df_win_team.drop(['match_dt','home_team_api_id','away_team_api_id','team_api_id','home_team_goal','away_team_goal'],axis=1, inplace=True)
Remove the columns that are no longer necessary: match_dt, home_team_api_id, away_team_api_id, team_api_id, home_team_goal and away_team_goal
df_win_team.duplicated().sum()
There are 13,864 duplicate records for winning matches.
df_win_team.drop_duplicates(inplace=True)
Drop the duplicates from the data frane
df_win_team.duplicated().sum()
There are no more duplicates
df_win_team.head()
df_win_team.hist(figsize=(12,10));
df_win_team.plot(x="bu_play_dribbling", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Build Up Play Dribbling")
plt.xlabel("Build Up Play Dribbling")
plt.ylabel("Team Wins")
There is no correlation between the number of wins and build up play dribbling
df_win_team.plot(x="bu_play_passing", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Build Up Play Passing")
plt.xlabel("Build Up Play Passing")
plt.ylabel("Team Wins")
There is no correlation between wins and build up play passing
df_win_team.plot(x="cc_passing", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Chance Creation Passing")
plt.xlabel("Chance Creation Passing")
plt.ylabel("Team Wins")
There is no correlation between the number of wins and chance creation passing
df_win_team.plot(x="cc_crossing", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Chance Creation Crossing")
plt.xlabel("Chance Creation Crossing")
plt.ylabel("Team Wins")
There is no correlation between the number of wins and chance creation crossing
df_win_team.plot(x="cc_shooting", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Chance Creation Crossing")
plt.xlabel("Chance Creation Shooting")
plt.ylabel("Team Wins")
There is no correlation between the number of wins and chance creation shooting
df_win_team.plot(x="defence_pressure", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Defence Pressure")
plt.xlabel("Defence Pressure")
plt.ylabel("Team Wins")
There is no correlation between the number of wins and defence pressure
df_win_team.plot(x="defence_aggression", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Defence Aggression")
plt.xlabel("Defence Aggression")
plt.ylabel("Team Wins")
There is no correlation between the number of wins and defence agression
df_win_team.plot(x="defence_team_width", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Defence Team Width")
plt.xlabel("Defence Team Width")
plt.ylabel("Team Wins")
There is no correlation between the number of wins and defence team width
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_Soccer_Database.ipynb'])