What European Soccer Team Attributes Lead To More Wins?

Table of Contents

Introduction

In this project, I will use the Kaggle "European Soccer Database" which can be found at https://www.kaggle.com/hugomathien/soccer. The "European Soccer Database" consists on 7 tables: Country, League, Match, Player, Player_Attributes, Team and Team_Attributes.

The Match table contains data over 8 soccer seasons and stores: the date a match was played, the teams that played the match, the number of goals each team made, the players in the match and other match statistics.

The Player table contains players name and vital statistics such as weight, height and birthdate.

The Player_Attributes table contains the players statistics such as their overall rating, potential, prefered foot etc.

The Team_Attributes table records the team's attributes such as build up play, chance creation and defence scores at a given time.

The Team table stores team id and team names that are used for referencing.

The Country table stores the country id and country name used for referencing.

The League tables stores league id, country id and league name used for referencing.


I will be analyzing a dataset derived from the "European Soccer Database". The dataset ecompasses data from the Match, Team, Team_Attributes, Country and League tables. The Dataset was derived usng the SQL query below:

CREATE TABLE team_match_data AS 
WITH team_data AS (
SELECT * 
FROM Team_Attributes ta
JOIN Team t
ON ta.team_api_id = t.team_api_id),


    teamdate_data AS (
SELECT *,
lead(team_api_id) OVER (PARTITION BY team_api_id ORDER BY "date") AS lead_team_api_id,
lead("date",1,"2016-05-25 00:00:00") OVER (PARTITION BY team_api_id ORDER BY "date") AS end_date
FROM team_data)


SELECT m.match_api_id, 
m."date" match_dt,
m.country_id,
c.name country_name,
m.league_id,
l.name league_name,
m.season,
m.home_team_api_id,
m.away_team_api_id,
td.team_api_id,
td.team_long_name,
td.team_short_name,
m.home_team_goal,
m.away_team_goal,
td."date" t_attr_start_dt,
td.end_date t_attr_end_dt,
td.buildUpPlaySpeed bu_play_speed,
td.buildUpPlaySpeedClass bu_play_speed_class,
td.buildUpPlayDribbling bu_play_dribbling,
td.buildUpPlayDribblingClass bu_play_dribbling_class,
td.buildUpPlayPassing bu_play_passing,
td.buildUpPlayPassingClass bu_play_passing_class,
td.buildUpPlayPositioningClass bu_play_positioning_class,
td.chanceCreationPassing cc_passing,
td.chanceCreationPassingClass cc_passing_class,
td.chanceCreationCrossing cc_crossing,
td.chanceCreationCrossingClass cc_crossing_class,
td.chanceCreationShooting cc_shooting,
td.chanceCreationShootingClass cc_shooting_class,
td.chanceCreationPositioningClass cc_positioning_class,
td.defencePressure defence_pressure,
td.defencePressureClass defence_pressure_class,
td.defenceAggression defence_aggression,
td.defenceAggressionClass defence_agression_class,
td.defenceTeamWidth defence_team_width,
td.defenceTeamWidthClass defence_team_width_class,
td.defenceDefenderLineClass defence_line_class
FROM Match m
JOIN teamdate_data td
ON td.team_api_id = m.home_team_api_id
OR td.team_api_id = m.away_team_api_id
JOIN League l
ON l.id = m.league_id
JOIN Country c
ON c.id = m.country_id
WHERE m."date"  BETWEEN td."date" AND td.end_date
ORDER BY td.team_api_id, td.end_date DESC;


Questions

I will be focusing on team goals and how their attributes contributed to these goals.

  1. Which Teams won the most matches?
  2. What team attributes contribute to the most wins?
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Data Wrangling

General Properties

In [2]:
df = pd.read_csv('team_match_data.csv')
df.head()
Out[2]:
match_api_id match_dt country_id country_name league_id league_name season home_team_api_id away_team_api_id team_api_id ... cc_shooting cc_shooting_class cc_positioning_class defence_pressure defence_pressure_class defence_aggression defence_agression_class defence_team_width defence_team_width_class defence_line_class
0 1994692 2015-09-26 00:00:00 15722 Poland 15722 Poland Ekstraklasa 2015/2016 2186 1601 1601 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover
1 1994701 2015-10-02 00:00:00 15722 Poland 15722 Poland Ekstraklasa 2015/2016 1601 8023 1601 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover
2 1994710 2015-10-17 00:00:00 15722 Poland 15722 Poland Ekstraklasa 2015/2016 2182 1601 1601 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover
3 1994716 2015-10-23 00:00:00 15722 Poland 15722 Poland Ekstraklasa 2015/2016 1601 8025 1601 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover
4 1994723 2015-10-30 00:00:00 15722 Poland 15722 Poland Ekstraklasa 2015/2016 10265 1601 1601 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover

5 rows × 37 columns

Imported the team_match_data csv file into a data frame and viewed the first 5 rows

In [3]:
df.shape
Out[3]:
(39882, 37)

There are 39,882 rows and 37 columns in the data frame

In [4]:
df.describe()
Out[4]:
match_api_id country_id league_id home_team_api_id away_team_api_id team_api_id home_team_goal away_team_goal bu_play_speed bu_play_dribbling bu_play_passing cc_passing cc_crossing cc_shooting defence_pressure defence_aggression defence_team_width
count 3.988200e+04 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 11749.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000
mean 1.360436e+06 11763.067148 11763.067148 9956.776992 9954.216087 9665.847425 1.554787 1.173888 52.412818 49.088433 47.927210 52.411664 54.034627 54.270473 46.466978 49.474625 52.337596
std 4.222426e+05 7520.607243 7520.607243 13767.842735 13738.480474 11287.448154 1.306828 1.147342 11.930429 9.800439 11.081696 10.400321 11.184020 10.440281 10.215524 9.871917 9.435304
min 6.541530e+05 1.000000 1.000000 1601.000000 1601.000000 1601.000000 0.000000 0.000000 20.000000 24.000000 20.000000 21.000000 20.000000 22.000000 23.000000 24.000000 30.000000
25% 1.025090e+06 4769.000000 4769.000000 8472.000000 8472.000000 8481.000000 1.000000 0.000000 45.000000 42.000000 39.000000 47.000000 48.000000 49.000000 39.000000 44.000000 48.000000
50% 1.260174e+06 10257.000000 10257.000000 8697.000000 8697.000000 8721.000000 1.000000 1.000000 52.000000 49.000000 49.000000 52.000000 53.000000 54.000000 46.000000 48.000000 52.000000
75% 1.724266e+06 17642.000000 17642.000000 9925.000000 9925.000000 9925.000000 2.000000 2.000000 62.000000 55.000000 55.000000 60.000000 63.000000 62.000000 53.000000 56.000000 59.000000
max 2.216672e+06 24558.000000 24558.000000 274581.000000 274581.000000 274581.000000 10.000000 9.000000 80.000000 77.000000 80.000000 80.000000 80.000000 80.000000 72.000000 72.000000 73.000000
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39882 entries, 0 to 39881
Data columns (total 37 columns):
match_api_id                 39882 non-null int64
match_dt                     39882 non-null object
country_id                   39882 non-null int64
country_name                 39882 non-null object
league_id                    39882 non-null int64
league_name                  39882 non-null object
season                       39882 non-null object
home_team_api_id             39882 non-null int64
away_team_api_id             39882 non-null int64
team_api_id                  39882 non-null int64
team_long_name               39882 non-null object
team_short_name              39882 non-null object
home_team_goal               39882 non-null int64
away_team_goal               39882 non-null int64
t_attr_start_dt              39882 non-null object
t_attr_end_dt                39882 non-null object
bu_play_speed                39882 non-null int64
bu_play_speed_class          39882 non-null object
bu_play_dribbling            11749 non-null float64
bu_play_dribbling_class      39882 non-null object
bu_play_passing              39882 non-null int64
bu_play_passing_class        39882 non-null object
bu_play_positioning_class    39882 non-null object
cc_passing                   39882 non-null int64
cc_passing_class             39882 non-null object
cc_crossing                  39882 non-null int64
cc_crossing_class            39882 non-null object
cc_shooting                  39882 non-null int64
cc_shooting_class            39882 non-null object
cc_positioning_class         39882 non-null object
defence_pressure             39882 non-null int64
defence_pressure_class       39882 non-null object
defence_aggression           39882 non-null int64
defence_agression_class      39882 non-null object
defence_team_width           39882 non-null int64
defence_team_width_class     39882 non-null object
defence_line_class           39882 non-null object
dtypes: float64(1), int64(16), object(20)
memory usage: 11.3+ MB

The bu_play_dribbling column has null data. There are 27,133 records with null data

Data Cleaning

The data contains 39,882 rows and 37 columns.

The data consists of match data, team data, and team attribute data at the time the match took place. For each team attribute respresented with a numerical value, there is a corresponding classifaction.

The match_api_id, country_id, country_name, league_id, league_name and season columns can be dropped since they are not relevant to determine how a team's attributes contribute to their wins.

The data will need to be cleaned, to fill the null values for the bu_play_dribbling column. This will be done by determing with class the null value corresponds to and filled with the mean of that class.

In [6]:
# Drop columns that are not needed
df.drop(['match_api_id','country_id','league_id','country_name','league_name', 'season'], axis=1, inplace = True)
In [7]:
df.head()
Out[7]:
match_dt home_team_api_id away_team_api_id team_api_id team_long_name team_short_name home_team_goal away_team_goal t_attr_start_dt t_attr_end_dt ... cc_shooting cc_shooting_class cc_positioning_class defence_pressure defence_pressure_class defence_aggression defence_agression_class defence_team_width defence_team_width_class defence_line_class
0 2015-09-26 00:00:00 2186 1601 1601 Ruch Chorzów CHO 2 1 2015-09-10 00:00:00 2016-05-25 00:00:00 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover
1 2015-10-02 00:00:00 1601 8023 1601 Ruch Chorzów CHO 0 2 2015-09-10 00:00:00 2016-05-25 00:00:00 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover
2 2015-10-17 00:00:00 2182 1601 1601 Ruch Chorzów CHO 2 2 2015-09-10 00:00:00 2016-05-25 00:00:00 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover
3 2015-10-23 00:00:00 1601 8025 1601 Ruch Chorzów CHO 1 0 2015-09-10 00:00:00 2016-05-25 00:00:00 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover
4 2015-10-30 00:00:00 10265 1601 1601 Ruch Chorzów CHO 0 0 2015-09-10 00:00:00 2016-05-25 00:00:00 ... 50 Normal Organised 43 Medium 44 Press 49 Normal Cover

5 rows × 31 columns

In [8]:
df.shape
Out[8]:
(39882, 31)

After dropping the columns, there are now 39,882 rows and 31 columns in the data frame

In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39882 entries, 0 to 39881
Data columns (total 31 columns):
match_dt                     39882 non-null object
home_team_api_id             39882 non-null int64
away_team_api_id             39882 non-null int64
team_api_id                  39882 non-null int64
team_long_name               39882 non-null object
team_short_name              39882 non-null object
home_team_goal               39882 non-null int64
away_team_goal               39882 non-null int64
t_attr_start_dt              39882 non-null object
t_attr_end_dt                39882 non-null object
bu_play_speed                39882 non-null int64
bu_play_speed_class          39882 non-null object
bu_play_dribbling            11749 non-null float64
bu_play_dribbling_class      39882 non-null object
bu_play_passing              39882 non-null int64
bu_play_passing_class        39882 non-null object
bu_play_positioning_class    39882 non-null object
cc_passing                   39882 non-null int64
cc_passing_class             39882 non-null object
cc_crossing                  39882 non-null int64
cc_crossing_class            39882 non-null object
cc_shooting                  39882 non-null int64
cc_shooting_class            39882 non-null object
cc_positioning_class         39882 non-null object
defence_pressure             39882 non-null int64
defence_pressure_class       39882 non-null object
defence_aggression           39882 non-null int64
defence_agression_class      39882 non-null object
defence_team_width           39882 non-null int64
defence_team_width_class     39882 non-null object
defence_line_class           39882 non-null object
dtypes: float64(1), int64(13), object(17)
memory usage: 9.4+ MB

The bu_play_dribbling column has null data. There are 27,133 records with null data

In [10]:
# Find the bu_play_dribbling clases
df.bu_play_dribbling_class.unique()
Out[10]:
array(['Normal', 'Little', 'Lots'], dtype=object)

There are 3 classes for bu_play_dribbling: Normal, Little and Lots

In [11]:
df[df.bu_play_dribbling.isnull()]['bu_play_dribbling_class'].unique()
Out[11]:
array(['Little'], dtype=object)

All of the null values from the bu_play_dribbling column belong to the bu_play_dribbling_class Little.

We need to fill those null values with the mean of the Little bu_play_dribbling columns

In [ ]:
 

Function returns a data frame for a specified Build Up Play Dribbling Class

In [12]:
def filter_dribblng_class(df, dribbling_class):
    return df.query('bu_play_dribbling_class == "%s"' % dribbling_class)

Function prints the minimum and maximum of a specified column

In [13]:
def col_minmax_range(col):
    print("({},{})".format(col.min(), col.max()))
In [14]:
df_normal_dribbling_class = filter_dribblng_class(df, "Normal")

Created a data frame for the Normal bu_play_dribbling_class

In [15]:
col_minmax_range(df_normal_dribbling_class.bu_play_dribbling)
(34.0,66.0)

The range of values for the Normal bu_play_dribbling_class is (34.0, 66.0)

In [16]:
df_lots_dribbling_class = filter_dribblng_class(df, "Lots")

Created a data frame for the Lots bu_play_dribbling_class

In [17]:
col_minmax_range(df_lots_dribbling_class.bu_play_dribbling)
(67.0,77.0)

The range of values for the Lots bu_play_dribbling_class is (67.0, 77.0)

In [18]:
df_little_dribbling_class = filter_dribblng_class(df, "Little")

Created a data frame for the Little bu_play_dribbling_class

In [19]:
col_minmax_range(df_little_dribbling_class.bu_play_dribbling)
(24.0,33.0)

The range of values for the Little bu_play_dribbling_class is (24.0, 33.0)

In [20]:
# Find the mean of the Little Build Up Dribbling class df
df_little_dribbling_class['bu_play_dribbling'].mean()
Out[20]:
30.44859813084112

The mean of the bu_play_dribbling_class for the Little class is 30.44859813084112

In [21]:
# Fill nan values with the mean of the litle dribbling class since they were all inthe little dribblng class
df['bu_play_dribbling'].fillna(df_little_dribbling_class['bu_play_dribbling'].mean(), inplace=True)

The null values for bu_play_dribbling were filled with the mean of the Little class

In [22]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39882 entries, 0 to 39881
Data columns (total 31 columns):
match_dt                     39882 non-null object
home_team_api_id             39882 non-null int64
away_team_api_id             39882 non-null int64
team_api_id                  39882 non-null int64
team_long_name               39882 non-null object
team_short_name              39882 non-null object
home_team_goal               39882 non-null int64
away_team_goal               39882 non-null int64
t_attr_start_dt              39882 non-null object
t_attr_end_dt                39882 non-null object
bu_play_speed                39882 non-null int64
bu_play_speed_class          39882 non-null object
bu_play_dribbling            39882 non-null float64
bu_play_dribbling_class      39882 non-null object
bu_play_passing              39882 non-null int64
bu_play_passing_class        39882 non-null object
bu_play_positioning_class    39882 non-null object
cc_passing                   39882 non-null int64
cc_passing_class             39882 non-null object
cc_crossing                  39882 non-null int64
cc_crossing_class            39882 non-null object
cc_shooting                  39882 non-null int64
cc_shooting_class            39882 non-null object
cc_positioning_class         39882 non-null object
defence_pressure             39882 non-null int64
defence_pressure_class       39882 non-null object
defence_aggression           39882 non-null int64
defence_agression_class      39882 non-null object
defence_team_width           39882 non-null int64
defence_team_width_class     39882 non-null object
defence_line_class           39882 non-null object
dtypes: float64(1), int64(13), object(17)
memory usage: 9.4+ MB

The bu_play_dribbling column has all null values filled

In [23]:
df.duplicated().sum()
Out[23]:
0

There are no duplicates in the data frame

In [24]:
df.shape
Out[24]:
(39882, 31)

There are still 39,882 rows and 31 columns in the data frame

In [25]:
df.describe()
Out[25]:
home_team_api_id away_team_api_id team_api_id home_team_goal away_team_goal bu_play_speed bu_play_dribbling bu_play_passing cc_passing cc_crossing cc_shooting defence_pressure defence_aggression defence_team_width
count 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000 39882.000000
mean 9956.776992 9954.216087 9665.847425 1.554787 1.173888 52.412818 35.939783 47.927210 52.411664 54.034627 54.270473 46.466978 49.474625 52.337596
std 13767.842735 13738.480474 11287.448154 1.306828 1.147342 11.930429 10.024826 11.081696 10.400321 11.184020 10.440281 10.215524 9.871917 9.435304
min 1601.000000 1601.000000 1601.000000 0.000000 0.000000 20.000000 24.000000 20.000000 21.000000 20.000000 22.000000 23.000000 24.000000 30.000000
25% 8472.000000 8472.000000 8481.000000 1.000000 0.000000 45.000000 30.448598 39.000000 47.000000 48.000000 49.000000 39.000000 44.000000 48.000000
50% 8697.000000 8697.000000 8721.000000 1.000000 1.000000 52.000000 30.448598 49.000000 52.000000 53.000000 54.000000 46.000000 48.000000 52.000000
75% 9925.000000 9925.000000 9925.000000 2.000000 2.000000 62.000000 38.000000 55.000000 60.000000 63.000000 62.000000 53.000000 56.000000 59.000000
max 274581.000000 274581.000000 274581.000000 10.000000 9.000000 80.000000 77.000000 80.000000 80.000000 80.000000 80.000000 72.000000 72.000000 73.000000

Exploratory Data Analysis

Question 1 : Which Teams won the most matches?

In [26]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.

# Get matches that did not result in a Tied Match
df_match_no_draw = df.query('home_team_goal != away_team_goal')
In [27]:
df_match_no_draw.shape
Out[27]:
(29758, 31)

There are 29,758 records that consist of matches that resulted in no draw

In [28]:
df_match_no_draw.describe()
Out[28]:
home_team_api_id away_team_api_id team_api_id home_team_goal away_team_goal bu_play_speed bu_play_dribbling bu_play_passing cc_passing cc_crossing cc_shooting defence_pressure defence_aggression defence_team_width
count 29758.000000 29758.000000 29758.000000 29758.000000 29758.000000 29758.000000 29758.000000 29758.000000 29758.000000 29758.000000 29758.000000 29758.000000 29758.00000 29758.000000
mean 9961.051415 9899.828685 9626.486491 1.743733 1.233248 52.347402 35.914585 47.896129 52.440117 54.045433 54.308354 46.573190 49.51361 52.355400
std 13757.721340 13420.665962 10930.258097 1.381442 1.229052 11.931015 9.982912 11.071162 10.413259 11.176362 10.476461 10.228084 9.87437 9.414251
min 1601.000000 1601.000000 1601.000000 0.000000 0.000000 20.000000 24.000000 20.000000 21.000000 20.000000 22.000000 23.000000 24.00000 30.000000
25% 8481.000000 8472.000000 8483.000000 1.000000 0.000000 45.000000 30.448598 39.000000 47.000000 48.000000 49.000000 39.000000 44.00000 48.000000
50% 8697.000000 8697.000000 8721.000000 2.000000 1.000000 52.000000 30.448598 49.000000 52.000000 53.000000 54.000000 46.000000 48.00000 52.000000
75% 9925.000000 9925.000000 9925.000000 3.000000 2.000000 62.000000 38.000000 55.000000 60.000000 63.000000 63.000000 53.000000 56.00000 59.000000
max 274581.000000 274581.000000 274581.000000 10.000000 9.000000 80.000000 77.000000 80.000000 80.000000 80.000000 80.000000 72.000000 72.00000 73.000000
In [29]:
#Create a column to identify the winning team
df_match_no_draw['win_team_id'] = np.where(df_match_no_draw['home_team_goal'] > df_match_no_draw['away_team_goal'], df_match_no_draw['home_team_api_id'], df_match_no_draw['away_team_api_id'])
/root/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [30]:
df_match_no_draw.head()
Out[30]:
match_dt home_team_api_id away_team_api_id team_api_id team_long_name team_short_name home_team_goal away_team_goal t_attr_start_dt t_attr_end_dt ... cc_shooting_class cc_positioning_class defence_pressure defence_pressure_class defence_aggression defence_agression_class defence_team_width defence_team_width_class defence_line_class win_team_id
0 2015-09-26 00:00:00 2186 1601 1601 Ruch Chorzów CHO 2 1 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 2186
1 2015-10-02 00:00:00 1601 8023 1601 Ruch Chorzów CHO 0 2 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 8023
3 2015-10-23 00:00:00 1601 8025 1601 Ruch Chorzów CHO 1 0 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 1601
5 2015-11-07 00:00:00 1601 8030 1601 Ruch Chorzów CHO 3 2 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 1601
6 2015-11-21 00:00:00 8019 1601 1601 Ruch Chorzów CHO 0 3 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 1601

5 rows × 32 columns

In [31]:
# Filter the teams attributes for the team that won
df_win_team = df_match_no_draw.query('team_api_id == win_team_id')
In [32]:
df_win_team.shape
Out[32]:
(15010, 32)

There are 15,010 matches that resulted in a win

In [33]:
df_win_team.head()
Out[33]:
match_dt home_team_api_id away_team_api_id team_api_id team_long_name team_short_name home_team_goal away_team_goal t_attr_start_dt t_attr_end_dt ... cc_shooting_class cc_positioning_class defence_pressure defence_pressure_class defence_aggression defence_agression_class defence_team_width defence_team_width_class defence_line_class win_team_id
3 2015-10-23 00:00:00 1601 8025 1601 Ruch Chorzów CHO 1 0 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 1601
5 2015-11-07 00:00:00 1601 8030 1601 Ruch Chorzów CHO 3 2 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 1601
6 2015-11-21 00:00:00 8019 1601 1601 Ruch Chorzów CHO 0 3 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 1601
8 2015-12-01 00:00:00 1601 8245 1601 Ruch Chorzów CHO 2 1 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 1601
11 2015-12-19 00:00:00 177361 1601 1601 Ruch Chorzów CHO 0 1 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Normal Organised 43 Medium 44 Press 49 Normal Cover 1601

5 rows × 32 columns

In [34]:
# Count the number of wins for each team and sort in decending order
df_num_wins=df_win_team.groupby(['team_long_name'])['win_team_id'].count().sort_values(ascending=False)
In [35]:
# Top Ten Soccer Teams
top_10 = df_num_wins.head(10)
In [36]:
top_10
Out[36]:
team_long_name
FC Barcelona           189
Real Madrid CF         185
Celtic                 180
FC Bayern Munich       159
Juventus               156
SL Benfica             153
FC Porto               149
Manchester City        148
Paris Saint-Germain    147
Manchester United      146
Name: win_team_id, dtype: int64

The Top 10 European Soccer teams with the most wins are:

FC Barcelona, Real Madrid CF, Celtic, FC Bayern Munich, Juventus, SL Benfica, FC Porto, Manchester City, Paris, Saint-Germain, Manchester United

In [37]:
df_num_wins.idxmax()
Out[37]:
'FC Barcelona'
In [38]:
# Plot a bar chart showing the Top 10 Soccer Teams
top_10.plot(kind='bar');
plt.title('Top 10 Winning Soccer Teams')
plt.xlabel('Team Name')
plt.ylabel('Total Wins');

Question 2 : What team attributes contribute to the most wins?

In [39]:
# Count the number of wins for an attribute period
df_win_team.groupby(['win_team_id','team_long_name','t_attr_start_dt','t_attr_end_dt'])['win_team_id'].count()
Out[39]:
win_team_id  team_long_name           t_attr_start_dt      t_attr_end_dt      
1601         Ruch Chorzów             2010-02-22 00:00:00  2011-02-22 00:00:00    10
                                      2011-02-22 00:00:00  2012-02-22 00:00:00    15
                                      2012-02-22 00:00:00  2013-09-20 00:00:00    16
                                      2013-09-20 00:00:00  2014-09-19 00:00:00    14
                                      2014-09-19 00:00:00  2015-09-10 00:00:00    10
                                      2015-09-10 00:00:00  2016-05-25 00:00:00     8
1773         Oud-Heverlee Leuven      2012-02-22 00:00:00  2013-09-20 00:00:00     9
                                      2014-09-19 00:00:00  2016-05-25 00:00:00     7
1957         Jagiellonia Białystok    2010-02-22 00:00:00  2011-02-22 00:00:00    14
                                      2011-02-22 00:00:00  2012-02-22 00:00:00    11
                                      2012-02-22 00:00:00  2013-09-20 00:00:00    17
                                      2013-09-20 00:00:00  2014-09-19 00:00:00    10
                                      2014-09-19 00:00:00  2015-09-10 00:00:00    13
                                      2015-09-10 00:00:00  2016-05-25 00:00:00     7
2033         S.C. Olhanense           2010-02-22 00:00:00  2011-02-22 00:00:00     8
                                      2011-02-22 00:00:00  2012-02-22 00:00:00     6
                                      2012-02-22 00:00:00  2013-09-20 00:00:00    10
                                      2013-09-20 00:00:00  2014-09-19 00:00:00     5
2182         Lech Poznań              2010-02-22 00:00:00  2011-02-22 00:00:00    15
                                      2011-02-22 00:00:00  2012-02-22 00:00:00    16
                                      2012-02-22 00:00:00  2013-09-20 00:00:00    28
                                      2013-09-20 00:00:00  2014-09-19 00:00:00    15
                                      2014-09-19 00:00:00  2015-09-10 00:00:00    13
                                      2015-09-10 00:00:00  2016-05-25 00:00:00    12
2183         P. Warszawa              2010-02-22 00:00:00  2011-02-22 00:00:00    10
                                      2011-02-22 00:00:00  2012-02-22 00:00:00    17
                                      2012-02-22 00:00:00  2013-09-20 00:00:00    14
2186         Cracovia                 2010-02-22 00:00:00  2011-02-22 00:00:00     6
                                      2011-02-22 00:00:00  2012-02-22 00:00:00     9
                                      2012-02-22 00:00:00  2014-09-19 00:00:00    14
                                                                                  ..
10265        Wisła Kraków             2011-02-22 00:00:00  2012-02-22 00:00:00    17
                                      2012-02-22 00:00:00  2013-09-20 00:00:00    17
                                      2013-09-20 00:00:00  2014-09-19 00:00:00    14
                                      2014-09-19 00:00:00  2015-09-10 00:00:00     8
                                      2015-09-10 00:00:00  2016-05-25 00:00:00     6
10267        Valencia CF              2010-02-22 00:00:00  2011-02-22 00:00:00    23
                                      2011-02-22 00:00:00  2012-02-22 00:00:00    18
                                      2012-02-22 00:00:00  2013-09-20 00:00:00    26
                                      2013-09-20 00:00:00  2014-09-19 00:00:00    14
                                      2014-09-19 00:00:00  2015-09-10 00:00:00    20
                                      2015-09-10 00:00:00  2016-05-25 00:00:00    11
10268        Elche CF                 2013-09-20 00:00:00  2014-09-19 00:00:00    10
                                      2014-09-19 00:00:00  2015-09-10 00:00:00    10
10269        VfB Stuttgart            2010-02-22 00:00:00  2011-02-22 00:00:00    12
                                      2011-02-22 00:00:00  2012-02-22 00:00:00    14
                                      2012-02-22 00:00:00  2013-09-20 00:00:00    22
                                      2013-09-20 00:00:00  2014-09-19 00:00:00     6
                                      2014-09-19 00:00:00  2015-09-10 00:00:00     9
                                      2015-09-10 00:00:00  2016-05-25 00:00:00     9
10278        Hércules Club de Fútbol  2010-02-22 00:00:00  2011-02-22 00:00:00     7
                                      2011-02-22 00:00:00  2012-02-22 00:00:00     2
10281        Real Valladolid          2010-02-22 00:00:00  2011-02-22 00:00:00     4
                                      2012-02-22 00:00:00  2013-09-20 00:00:00    12
                                      2013-09-20 00:00:00  2014-09-19 00:00:00     6
108893       AC Arles-Avignon         2010-02-22 00:00:00  2011-02-22 00:00:00     1
                                      2011-02-22 00:00:00  2012-02-22 00:00:00     2
158085       FC Arouca                2014-09-19 00:00:00  2015-09-10 00:00:00     8
                                      2015-09-10 00:00:00  2016-05-25 00:00:00    11
208931       Carpi                    2015-09-10 00:00:00  2016-05-25 00:00:00     9
274581       Royal Excel Mouscron     2015-09-10 00:00:00  2016-05-25 00:00:00     7
Name: win_team_id, Length: 1146, dtype: int64

This shows the number of wins a team had with during a given team attribute start and end time

In [40]:
df_win_team['wins_by_attribute_period'] = df_win_team.groupby(['win_team_id','t_attr_start_dt','t_attr_end_dt'])['win_team_id'].transform("size")
/root/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [41]:
df_win_team.head()
Out[41]:
match_dt home_team_api_id away_team_api_id team_api_id team_long_name team_short_name home_team_goal away_team_goal t_attr_start_dt t_attr_end_dt ... cc_positioning_class defence_pressure defence_pressure_class defence_aggression defence_agression_class defence_team_width defence_team_width_class defence_line_class win_team_id wins_by_attribute_period
3 2015-10-23 00:00:00 1601 8025 1601 Ruch Chorzów CHO 1 0 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Organised 43 Medium 44 Press 49 Normal Cover 1601 8
5 2015-11-07 00:00:00 1601 8030 1601 Ruch Chorzów CHO 3 2 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Organised 43 Medium 44 Press 49 Normal Cover 1601 8
6 2015-11-21 00:00:00 8019 1601 1601 Ruch Chorzów CHO 0 3 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Organised 43 Medium 44 Press 49 Normal Cover 1601 8
8 2015-12-01 00:00:00 1601 8245 1601 Ruch Chorzów CHO 2 1 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Organised 43 Medium 44 Press 49 Normal Cover 1601 8
11 2015-12-19 00:00:00 177361 1601 1601 Ruch Chorzów CHO 0 1 2015-09-10 00:00:00 2016-05-25 00:00:00 ... Organised 43 Medium 44 Press 49 Normal Cover 1601 8

5 rows × 33 columns

In [42]:
df_win_team.drop(['match_dt','home_team_api_id','away_team_api_id','team_api_id','home_team_goal','away_team_goal'],axis=1, inplace=True)
/root/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3697: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)

Remove the columns that are no longer necessary: match_dt, home_team_api_id, away_team_api_id, team_api_id, home_team_goal and away_team_goal

In [43]:
df_win_team.duplicated().sum()
Out[43]:
13864

There are 13,864 duplicate records for winning matches.

In [44]:
df_win_team.drop_duplicates(inplace=True)
/root/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Drop the duplicates from the data frane

In [45]:
df_win_team.duplicated().sum()
Out[45]:
0

There are no more duplicates

In [46]:
df_win_team.head()
Out[46]:
team_long_name team_short_name t_attr_start_dt t_attr_end_dt bu_play_speed bu_play_speed_class bu_play_dribbling bu_play_dribbling_class bu_play_passing bu_play_passing_class ... cc_positioning_class defence_pressure defence_pressure_class defence_aggression defence_agression_class defence_team_width defence_team_width_class defence_line_class win_team_id wins_by_attribute_period
3 Ruch Chorzów CHO 2015-09-10 00:00:00 2016-05-25 00:00:00 47 Balanced 48.000000 Normal 38 Mixed ... Organised 43 Medium 44 Press 49 Normal Cover 1601 8
28 Ruch Chorzów CHO 2014-09-19 00:00:00 2015-09-10 00:00:00 53 Balanced 48.000000 Normal 38 Mixed ... Organised 43 Medium 44 Press 49 Normal Cover 1601 10
53 Ruch Chorzów CHO 2013-09-20 00:00:00 2014-09-19 00:00:00 53 Balanced 30.448598 Little 55 Mixed ... Organised 43 Medium 44 Press 49 Normal Cover 1601 14
83 Ruch Chorzów CHO 2012-02-22 00:00:00 2013-09-20 00:00:00 53 Balanced 30.448598 Little 55 Mixed ... Organised 43 Medium 44 Press 49 Normal Cover 1601 16
134 Ruch Chorzów CHO 2011-02-22 00:00:00 2012-02-22 00:00:00 48 Balanced 30.448598 Little 51 Mixed ... Organised 46 Medium 48 Press 50 Normal Cover 1601 15

5 rows × 27 columns

In [47]:
df_win_team.hist(figsize=(12,10));

Scatterplot of wins against various features

In [48]:
df_win_team.plot(x="bu_play_dribbling", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Build Up Play Dribbling")
plt.xlabel("Build Up Play Dribbling")
plt.ylabel("Team Wins")
Out[48]:
Text(0, 0.5, 'Team Wins')

There is no correlation between the number of wins and build up play dribbling

In [49]:
df_win_team.plot(x="bu_play_passing", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Build Up Play Passing")
plt.xlabel("Build Up Play Passing")
plt.ylabel("Team Wins")
Out[49]:
Text(0, 0.5, 'Team Wins')

There is no correlation between wins and build up play passing

In [50]:
df_win_team.plot(x="cc_passing", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Chance Creation Passing")
plt.xlabel("Chance Creation Passing")
plt.ylabel("Team Wins")
Out[50]:
Text(0, 0.5, 'Team Wins')

There is no correlation between the number of wins and chance creation passing

In [51]:
df_win_team.plot(x="cc_crossing", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Chance Creation Crossing")
plt.xlabel("Chance Creation Crossing")
plt.ylabel("Team Wins")
Out[51]:
Text(0, 0.5, 'Team Wins')

There is no correlation between the number of wins and chance creation crossing

In [52]:
df_win_team.plot(x="cc_shooting", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Chance Creation Crossing")
plt.xlabel("Chance Creation Shooting")
plt.ylabel("Team Wins")
Out[52]:
Text(0, 0.5, 'Team Wins')

There is no correlation between the number of wins and chance creation shooting

In [53]:
df_win_team.plot(x="defence_pressure", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Defence Pressure")
plt.xlabel("Defence Pressure")
plt.ylabel("Team Wins")
Out[53]:
Text(0, 0.5, 'Team Wins')

There is no correlation between the number of wins and defence pressure

In [54]:
df_win_team.plot(x="defence_aggression", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Defence Aggression")
plt.xlabel("Defence Aggression")
plt.ylabel("Team Wins")
Out[54]:
Text(0, 0.5, 'Team Wins')

There is no correlation between the number of wins and defence agression

In [55]:
df_win_team.plot(x="defence_team_width", y="wins_by_attribute_period", kind="scatter");
plt.title("Relationship Between Team Match Wins and Defence Team Width")
plt.xlabel("Defence Team Width")
plt.ylabel("Team Wins")
Out[55]:
Text(0, 0.5, 'Team Wins')

There is no correlation between the number of wins and defence team width

Conclusions

Results

  1. The Top 10 European Soccer teams with the most wins are: FC Barcelona, Real Madrid CF, Celtic, FC Bayern Munich, Juventus, SL Benfica, FC Porto, Manchester City, Paris, Saint-Germain, Manchester United.
  2. There is no correlation between the number of wins a team has and any of their team attributes.

Limitations

  1. More than half of the build up dribbling team attribute had null data that was filled with the mean for its category. This may have made an impact on the correlation to number of team wins.
In [56]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_Soccer_Database.ipynb'])
Out[56]:
0