EDA_on_IPL_data

IPL data Analysis & Match Winner Prediction :¶

1. Load & View the Dataset¶

The Dataset is downloaded from (kaggle), this dataset contains the data of 12 IPL seasons, and there are two csv files in this dataset, one is for matches data and the other one is the data of each delivery of every match in all the seasons.
I will consider matches data as main dataset, and I will derive few features from deliveries dataset to add in matches data.

1.1 Loading the common important libraries¶

In [79]:

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np

import warnings
warnings.filterwarnings("ignore")

1.2 Loading the datasets¶

In [80]:

matches_df = pd.read_csv('matches.csv')

balls_df =  pd.read_csv('deliveries.csv')

In [81]:

matches_df.head()

Out[81]:

	id	season	city	date	team1	team2	toss_winner	toss_decision	result	winner	win_by_runs	win_by_wickets	player_of_match	venue	umpire1	umpire2	umpire3
0	1	2017	Hyderabad	2017-04-05	Sunrisers Hyderabad	Royal Challengers Bangalore	Royal Challengers Bangalore	field	normal	Sunrisers Hyderabad	35	0	Yuvraj Singh	Rajiv Gandhi International Stadium, Uppal	AY Dandekar	NJ Llong	NaN
1	2	2017	Pune	2017-04-06	Mumbai Indians	Rising Pune Supergiant	Rising Pune Supergiant	field	normal	Rising Pune Supergiant	0	7	SPD Smith	Maharashtra Cricket Association Stadium	A Nand Kishore	S Ravi	NaN
2	3	2017	Rajkot	2017-04-07	Gujarat Lions	Kolkata Knight Riders	Kolkata Knight Riders	field	normal	Kolkata Knight Riders	0	10	CA Lynn	Saurashtra Cricket Association Stadium	Nitin Menon	CK Nandan	NaN
3	4	2017	Indore	2017-04-08	Rising Pune Supergiant	Kings XI Punjab	Kings XI Punjab	field	normal	Kings XI Punjab	0	6	GJ Maxwell	Holkar Cricket Stadium	AK Chaudhary	C Shamshuddin	NaN
4	5	2017	Bangalore	2017-04-08	Royal Challengers Bangalore	Delhi Daredevils	Royal Challengers Bangalore	bat	normal	Royal Challengers Bangalore	15	0	KM Jadhav	M Chinnaswamy Stadium	NaN	NaN	NaN

In [82]:

matches_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 756 entries, 0 to 755
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               756 non-null    int64 
 1   season           756 non-null    int64 
 2   city             749 non-null    object
 3   date             756 non-null    object
 4   team1            756 non-null    object
 5   team2            756 non-null    object
 6   toss_winner      756 non-null    object
 7   toss_decision    756 non-null    object
 8   result           756 non-null    object
 9   dl_applied       756 non-null    int64 
 10  winner           752 non-null    object
 11  win_by_runs      756 non-null    int64 
 12  win_by_wickets   756 non-null    int64 
 13  player_of_match  752 non-null    object
 14  venue            756 non-null    object
 15  umpire1          754 non-null    object
 16  umpire2          754 non-null    object
 17  umpire3          119 non-null    object
dtypes: int64(5), object(13)
memory usage: 106.4+ KB

there are 7 null values for city, 4 null values for winner & player_of_match, and 2 null values for umpire1, umpire2, many null values for umpire3.

2. Handling of Null values :¶

player_of_the_match is useful in finding the importance of team, but lack of this field doesn't impact much, so we can go with available names.
we remove the rows that have winner as Nan, as winner outcome is important for prediction.
The Null values of City doesn't matter, I will handle this soon in the important features part.

In [83]:

matches_df[matches_df['winner'].isna()]

Out[83]:

	id	season	city	date	team1	team2	toss_winner	toss_decision	result	winner	player_of_match	venue	umpire1	umpire2	umpire3
300	301	2011	Delhi	2011-05-21	Delhi Daredevils	Pune Warriors	Delhi Daredevils	bat	no result	NaN	NaN	Feroz Shah Kotla	SS Hazare	RJ Tucker	NaN
545	546	2015	Bangalore	2015-04-29	Royal Challengers Bangalore	Rajasthan Royals	Rajasthan Royals	field	no result	NaN	NaN	M Chinnaswamy Stadium	JD Cloete	PG Pathak	NaN
570	571	2015	Bangalore	2015-05-17	Delhi Daredevils	Royal Challengers Bangalore	Royal Challengers Bangalore	field	no result	NaN	NaN	M Chinnaswamy Stadium	HDPK Dharmasena	K Srinivasan	NaN
744	11340	2019	Bengaluru	30/04/19	Royal Challengers Bangalore	Rajasthan Royals	Rajasthan Royals	field	no result	NaN	NaN	M. Chinnaswamy Stadium	Nigel Llong	Ulhas Gandhe	Anil Chaudhary

So, if we remove rows based on Null values of winner, the null values of player_of_match data & the 'no result' value of result column will also be removed.

In [84]:

print('Matches dataset shape before Nan values of winner : ', matches_df.shape)
matches_df.dropna(subset = ['winner'], inplace=True)
print('Matches dataset shape after removing Nan values : ', matches_df.shape)

Matches dataset shape before Nan values of winner :  (756, 18)
Matches dataset shape after removing Nan values :  (752, 18)

In [85]:

matches_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 752 entries, 0 to 755
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               752 non-null    int64 
 1   season           752 non-null    int64 
 2   city             745 non-null    object
 3   date             752 non-null    object
 4   team1            752 non-null    object
 5   team2            752 non-null    object
 6   toss_winner      752 non-null    object
 7   toss_decision    752 non-null    object
 8   result           752 non-null    object
 9   dl_applied       752 non-null    int64 
 10  winner           752 non-null    object
 11  win_by_runs      752 non-null    int64 
 12  win_by_wickets   752 non-null    int64 
 13  player_of_match  752 non-null    object
 14  venue            752 non-null    object
 15  umpire1          750 non-null    object
 16  umpire2          750 non-null    object
 17  umpire3          118 non-null    object
dtypes: int64(5), object(13)
memory usage: 111.6+ KB

except city column, all the other features have equal no. of non-null values, luckily the null values of player_of_the_match feature is due to null values in winner feature, so it is removed.
I will update the city feature Null values soon we discuss about its importance.

In [86]:

matches_df.result.unique()

Out[86]:

array(['normal', 'tie'], dtype=object)

result : result feature contains 2 possible values, out of those 2 we are not interested in 'tie' matches, we only need 'normal' matches which means one of the 2 teams will win. So, we filter data based on result = 'Normal'

In [87]:

len(matches_df.query("result=='tie'"))

Out[87]:

there are 9 tie matches.. so we remove them

In [88]:

matches_df.query("result=='normal'", inplace=True)
print('Matches dataset shape after removing tie matches  : ', matches_df.shape)

Matches dataset shape after removing tie matches  :  (743, 18)

In [89]:

matches_df.query("dl_applied==1").head()

Out[89]:

	id	season	city	date	team1	team2	toss_winner	toss_decision	result	dl_applied	winner	win_by_runs	win_by_wickets	player_of_match	venue	umpire1	umpire2	umpire3
56	57	2017	Bangalore	2017-05-17	Sunrisers Hyderabad	Kolkata Knight Riders	Kolkata Knight Riders	field	normal	1	Kolkata Knight Riders	0	7	NM Coulter-Nile	M Chinnaswamy Stadium	AK Chaudhary	Nitin Menon	NaN
99	100	2008	Delhi	2008-05-17	Delhi Daredevils	Kings XI Punjab	Delhi Daredevils	bat	normal	1	Kings XI Punjab	6	0	DPMD Jayawardene	Feroz Shah Kotla	AV Jayaprakash	RE Koertzen	NaN
102	103	2008	Kolkata	2008-05-18	Kolkata Knight Riders	Chennai Super Kings	Kolkata Knight Riders	bat	normal	1	Chennai Super Kings	3	0	M Ntini	Eden Gardens	Asad Rauf	K Hariharan	NaN
119	120	2009	Cape Town	2009-04-19	Kings XI Punjab	Delhi Daredevils	Delhi Daredevils	field	normal	1	Delhi Daredevils	0	10	DL Vettori	Newlands	MR Benson	SD Ranade	NaN
122	123	2009	Durban	2009-04-21	Kings XI Punjab	Kolkata Knight Riders	Kolkata Knight Riders	field	normal	1	Kolkata Knight Riders	11	0	CH Gayle	Kingsmead	DJ Harper	SD Ranade	NaN

when DLs is applied, Only the Overs & Runs will be less, remaining all the data is common.

Removing 'result', 'dl_applied', 'umpire1','umpire2','umpire3', 'venue', which I consider less important for analysis.
I use city to find the home_team importance instead of venue, so I am removing it.

In [90]:

matches_df.drop(['result','dl_applied','umpire1','umpire2','umpire3','venue'], axis=1, inplace=True)

matches_df.columns

Out[90]:

Index(['id', 'season', 'city', 'date', 'team1', 'team2', 'toss_winner',
       'toss_decision', 'winner', 'win_by_runs', 'win_by_wickets',
       'player_of_match'],
      dtype='object')

In [91]:

matches_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 743 entries, 0 to 755
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               743 non-null    int64 
 1   season           743 non-null    int64 
 2   city             736 non-null    object
 3   date             743 non-null    object
 4   team1            743 non-null    object
 5   team2            743 non-null    object
 6   toss_winner      743 non-null    object
 7   toss_decision    743 non-null    object
 8   winner           743 non-null    object
 9   win_by_runs      743 non-null    int64 
 10  win_by_wickets   743 non-null    int64 
 11  player_of_match  743 non-null    object
dtypes: int64(4), object(8)
memory usage: 75.5+ KB

3. Feature Analysis¶

All the important features are categorical features only, so we will use groupby conditions & bar graphs to analyse the data

Questions Raised :¶

season : see the change of data over seasons. ?
city : find the home_team advantage : Chennai has much. ?
toss_winner & toss_decision : how winning affect based on toss. ?
player_of_match : who is the top player ? & does he always belongs to same team ?
win_by_runs & wkts : which teams won by more margin : runs & wkts & does they win more trophies ?
deatch matches (last matches) important ??

3.1 Season :¶

In [92]:

matches_df['season'].unique()

Out[92]:

array([2017, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2018,
       2019], dtype=int64)

In [93]:

plt.figure(figsize=(14, 6))
sns.countplot(x='season', data=matches_df)
plt.show()

3.1.1 Let's shorten the team names as we can show the graphs neatly.¶

In [94]:

teams_short_names = {"Mumbai Indians":"MI", "Delhi Capitals":"DC", "Delhi Daredevils":"DC", 
               "Sunrisers Hyderabad":"SRH", "Deccan Chargers":"SRH", "Rajasthan Royals":"RR", 
               "Kolkata Knight Riders":"KKR", "Kings XI Punjab":"KXIP", 
               "Chennai Super Kings":"CSK", "Royal Challengers Bangalore":"RCB",
              "Kochi Tuskers Kerala":"KTK", "Rising Pune Supergiants":"RPS", "Rising Pune Supergiant":"RPS", "Pune Warriors":"RPS",
              "Gujarat Lions":"GL"}

matches_df.replace(teams_short_names, inplace=True)
balls_df.replace(teams_short_names, inplace=True)

The Names "Rising Pune Supergiants", "Rising Pune Supergiant", "Pune Warriors" are same as they refer to Pune Team, and the "Sunrisers Hyderabad", "Deccan Chargers" belong to Hyderabad team and "Delhi Daredevils", "Delhi Capitals" belong to Delhi team.
I have grouped these based on the domain knowledge I have regarding the IPL cricket.

3.2 Winner¶

3.2.1 Total wins by each team¶

In [95]:

plt.figure(figsize=(14, 6))
sns.countplot(x='winner', data=matches_df)
plt.show()

3.2.2 Total Matches played by each team¶

In [96]:

print(matches_df["winner"].value_counts())

MI      107
CSK     100
KKR      92
SRH      86
RCB      83
KXIP     80
DC       76
RR       73
RPS      27
GL       13
KTK       6
Name: winner, dtype: int64

In [97]:

team1_played_matches = dict(matches_df["team1"].value_counts())
team2_played_matches = dict(matches_df["team2"].value_counts())

total_matches_played_by_team = {}

for i in team1_played_matches.keys():
    total_matches_played_by_team[i] = team1_played_matches[i]+team2_played_matches[i]

total_matches_played_by_team

Out[97]:

{'SRH': 181,
 'MI': 185,
 'KXIP': 174,
 'CSK': 163,
 'RCB': 175,
 'KKR': 175,
 'DC': 173,
 'RR': 142,
 'RPS': 75,
 'GL': 29,
 'KTK': 14}

in the above data.. 'Gujarath Lions', 'Kochi Tuskers' have played very less matches.. But, we may not remove them as the players data might be helpful.

In [98]:

fig = plt.figure(figsize = (10, 5))

plt.bar(total_matches_played_by_team.keys(), total_matches_played_by_team.values(), color ='orange', width = 0.8)

Out[98]:

<BarContainer object of 11 artists>

see which season the Kochi tuskers has played & remove the data if players data is not very important.

In [22]:

matches_df.query("team1=='KTK'").season.unique()

Out[22]:

array([2011], dtype=int64)

In [23]:

matches_df.query("team1=='GL'").season.unique()

Out[23]:

array([2017, 2016], dtype=int64)

It is better to remove Kochi Tuskers as it has played only one season and when we do train test split KTK will not have any data so, there will be extra class of prediction which makes the prediction little messy.
Wee will keep GL, as we split the test data from 2017. So, in both train & test we will have GL team.

In [24]:

matches_df.query("team1!='KTK' & team2!='KTK'").shape

Out[24]:

(729, 12)

Now we will have a balance of train & test data.

3.2 City : let's see the home_team advantage¶

In [25]:

plt.figure(figsize=(16, 6))
sns.countplot(x='city', data=matches_df)
plt.xticks(rotation='vertical')
plt.show()

Mumbai hosted highest matches, followed by Kolkata, Delhi, Bangalore, Hyderabad, Chennai..

In [99]:

gdf = pd.DataFrame(matches_df.groupby(['city', 'winner']).size(), columns=['cnt'])

mgdf = gdf.query("city=='Mumbai'")
print('total matches : ', sum(mgdf['cnt']))
mgdf

total matches :  100

Out[99]:

		cnt
city	winner
Mumbai	CSK	11
	DC	4
	GL	1
	KKR	3
	KTK	1
	KXIP	5
	MI	52
	RCB	5
	RPS	6
	RR	7
	SRH	5

100 matches are held in Mumbai, Mumbai won 52 out tof it. MI has no much home_team advantage.

In [100]:

cgdf = gdf.query("city=='Chennai'")
print('total matches : ', sum(cgdf['cnt']))
cgdf

total matches :  56

Out[100]:

		cnt
city	winner
Chennai	CSK	40
	DC	2
	KKR	2
	KXIP	1
	MI	5
	RCB	2
	RPS	1
	RR	1
	SRH	2

56 matches are held in Chennai, CSK won 40 out tof it. CSK has home_team advantage. (Vigil Podu)

In [101]:

bgdf = gdf.query("city=='Bangalore'")
print('total matches : ', sum(bgdf['cnt']))
bgdf

total matches :  63

Out[101]:

		cnt
city	winner
Bangalore	CSK	4
	DC	3
	GL	1
	KKR	6
	KXIP	5
	MI	8
	RCB	29
	RPS	1
	RR	3
	SRH	3

RCB won 29 out of 63, less than 50 %

In [102]:

kgdf = gdf.query("city=='Kolkata'")
print('total matches : ', sum(kgdf['cnt']))
kgdf

total matches :  77

Out[102]:

		cnt
city	winner
Kolkata	CSK	5
	DC	2
	GL	2
	KKR	45
	KTK	1
	KXIP	3
	MI	10
	RCB	4
	RPS	1
	RR	2
	SRH	2

KKr won 45 out 77 matches, that's interesting.

Chennai, Kolkata has utilized the home_team advantage..
So, there is some home_team advantage, so we can consider it as a feature.

3.3 : toss_winner & toss_decision¶

In [103]:

plt.figure(figsize=(12, 6))
sns.countplot(x='toss_decision', data=matches_df)
plt.show()

The chasing teams got more wins.
let's the affect of match win based on toss win.

In [104]:

total_matches = len(matches_df)
toss_win_match_wins = len(np.where(matches_df['toss_winner'] == matches_df['winner'])[0])

print("Toss win match win percentange : ", (toss_win_match_wins/total_matches)*100)

Toss win match win percentange :  52.2207267833109

3.3.1 Team wise Toss winner & toss decision effect :¶

In [106]:

team_wins = dict(matches_df.groupby(['winner']).size())
team_wins

Out[106]:

{'CSK': 100,
 'DC': 76,
 'GL': 13,
 'KKR': 92,
 'KTK': 6,
 'KXIP': 80,
 'MI': 107,
 'RCB': 83,
 'RPS': 27,
 'RR': 73,
 'SRH': 86}

In [34]:

teams = sorted(team_wins, key=team_wins.get, reverse=True) # in descending order of their wins

In [35]:

for team_name in sorted(team_wins, key=team_wins.get, reverse=True):
    total_matches_played = len(matches_df.query("team1=='"+team_name+"' | team2=='"+team_name+"'"))
    total_wins = team_wins[team_name]
    toss_wins = len(matches_df.query("toss_winner=='"+team_name+"'"))
    toss_choose_bat = len(matches_df.query("toss_winner=='"+team_name+"' & toss_decision=='bat'"))
    toss_choose_field = len(matches_df.query("toss_winner=='"+team_name+"' & toss_decision=='field'"))
    toss_win_match_win = len(matches_df.query("toss_winner=='"+team_name+"' & winner=='"+team_name+"'"))
    print("{} played {} matches & won {}; {} times won the toss & {} times won both toss & match. Win_% : {}; Win_by_Toss_% : {}".format(team_name, total_matches_played, total_wins, toss_wins, toss_win_match_win, round((total_wins/total_matches_played)*100, 1), round((toss_win_match_win/toss_wins)*100, 1)))

MI played 185 matches & won 107; 97 times won the toss & 55 times won both toss & match. Win_% : 57.8; Win_by_Toss_% : 56.7
CSK played 163 matches & won 100; 88 times won the toss & 57 times won both toss & match. Win_% : 61.3; Win_by_Toss_% : 64.8
KKR played 175 matches & won 92; 91 times won the toss & 53 times won both toss & match. Win_% : 52.6; Win_by_Toss_% : 58.2
SRH played 181 matches & won 86; 89 times won the toss & 42 times won both toss & match. Win_% : 47.5; Win_by_Toss_% : 47.2
RCB played 175 matches & won 83; 78 times won the toss & 40 times won both toss & match. Win_% : 47.4; Win_by_Toss_% : 51.3
KXIP played 174 matches & won 80; 80 times won the toss & 34 times won both toss & match. Win_% : 46.0; Win_by_Toss_% : 42.5
DC played 173 matches & won 76; 88 times won the toss & 41 times won both toss & match. Win_% : 43.9; Win_by_Toss_% : 46.6
RR played 142 matches & won 73; 77 times won the toss & 41 times won both toss & match. Win_% : 51.4; Win_by_Toss_% : 53.2
RPS played 30 matches & won 15; 13 times won the toss & 8 times won both toss & match. Win_% : 50.0; Win_by_Toss_% : 61.5
GL played 29 matches & won 13; 14 times won the toss & 10 times won both toss & match. Win_% : 44.8; Win_by_Toss_% : 71.4
PW played 45 matches & won 12; 20 times won the toss & 3 times won both toss & match. Win_% : 26.7; Win_by_Toss_% : 15.0
KTK played 14 matches & won 6; 8 times won the toss & 4 times won both toss & match. Win_% : 42.9; Win_by_Toss_% : 50.0

CSK has good win % & win_toss_win_match % followed by MI & KKR, RR. GL has good win_toss_win_match %, but they have played less matches compared to other teams.

3.4 player_of_match : who is the top player ? & does he always belongs to same team ?¶

In [36]:

len(matches_df.groupby(['player_of_match']).size())

Out[36]:

In [37]:

All_best_players = dict(matches_df.groupby(['player_of_match']).size())

3.4.1 Let's see the top 20 best players¶

In [38]:

best_players_in_desc_order = sorted(All_best_players, key=All_best_players.get, reverse=True)
# best_players_in_desc_order
for i in best_players_in_desc_order[:20]:
    print(i, ' : ', All_best_players[i])

CH Gayle  :  21
AB de Villiers  :  20
DA Warner  :  17
MS Dhoni  :  17
RG Sharma  :  17
SR Watson  :  15
YK Pathan  :  15
SK Raina  :  14
G Gambhir  :  13
AM Rahane  :  12
MEK Hussey  :  12
A Mishra  :  11
AD Russell  :  11
DR Smith  :  11
V Kohli  :  11
V Sehwag  :  11
JH Kallis  :  10
KA Pollard  :  10
AT Rayudu  :  9
SP Narine  :  9

they are the top 20 players : let's see which team they belongs to..

In [39]:

all_entries = []
for player in best_players_in_desc_order[:20]:
    teams = balls_df.query("batsman=='"+player+"'").batting_team.unique()
    all_entries += list(teams)
    print(player, ' : ', teams)

CH Gayle  :  ['RCB' 'KKR' 'KXIP']
AB de Villiers  :  ['RCB' 'DC']
DA Warner  :  ['SRH' 'DC']
MS Dhoni  :  ['RPS' 'CSK']
RG Sharma  :  ['MI' 'SRH']
SR Watson  :  ['RCB' 'RR' 'CSK']
YK Pathan  :  ['KKR' 'RR' 'SRH']
SK Raina  :  ['GL' 'CSK']
G Gambhir  :  ['KKR' 'DC']
AM Rahane  :  ['RPS' 'MI' 'RR']
MEK Hussey  :  ['CSK' 'MI']
A Mishra  :  ['DC' 'SRH']
AD Russell  :  ['DC' 'KKR']
DR Smith  :  ['GL' 'MI' 'SRH' 'CSK']
V Kohli  :  ['RCB']
V Sehwag  :  ['DC' 'KXIP']
JH Kallis  :  ['RCB' 'KKR']
KA Pollard  :  ['MI']
AT Rayudu  :  ['MI' 'CSK']
SP Narine  :  ['KKR']

In [40]:

d = {}

for i in set(all_entries):
    d[i] = all_entries.count(i)
best_player_teams = sorted(d, key=d.get, reverse=True)
for i in best_player_teams:
    print(i, ' : ', d[i])

DC  :  6
MI  :  6
KKR  :  6
CSK  :  6
SRH  :  5
RCB  :  5
RR  :  3
RPS  :  2
GL  :  2
KXIP  :  2

As per the above analysis :¶

The Teams like CSK, KKR, MI, SRH, RCB, DC are the top teams, as they have top players.
Out of all those teams, CSK has reached to finals more no of times & MI has got more trophies.
KKR & SRH got trophies 2 times.
DC & RCB have no trophy but RCB is one of the best team with top class players.

3.5 win_by_runs & win_by_wkts : which teams won by more margin¶

In [42]:

win_by_runs_matches = matches_df.query("win_by_runs!=0")
win_by_runs_matches.shape

Out[42]:

(337, 12)

337 matches have batted first and won the match..
let's see which teams has won the matches with large margin, say win_by_runs > 100.

In [43]:

top_win_by_runs_matches = matches_df.query("win_by_runs>100")
top_win_by_runs_matches

Out[43]:

	id	season	city	date	team1	team2	toss_winner	toss_decision	winner	win_by_runs	player_of_match
43	44	2017	Delhi	2017-05-06	MI	DC	DC	field	MI	146	LMP Simmons
59	60	2008	Bangalore	2008-04-18	KKR	RCB	RCB	field	KKR	140	BB McCullum
114	115	2008	Mumbai	2008-05-30	RR	DC	DC	field	RR	105	SR Watson
295	296	2011	Dharamsala	2011-05-17	KXIP	RCB	KXIP	bat	KXIP	111	AC Gilchrist
410	411	2013	Bangalore	2013-04-23	RCB	PW	PW	field	RCB	130	CH Gayle
556	557	2015	Bangalore	2015-05-06	RCB	KXIP	KXIP	field	RCB	138	CH Gayle
619	620	2016	Bangalore	2016-05-14	RCB	GL	GL	field	RCB	144	AB de Villiers
676	7934	2018	Kolkata	09/05/18	MI	KKR	KKR	field	MI	102	Ishan Kishan
706	11147	2019	Hyderabad	31/03/19	SRH	RCB	RCB	field	SRH	118	J Bairstow

3.5.1 Team wise total win by runs¶

In [107]:

winners = list(win_by_runs_matches['winner'])
runs = list(win_by_runs_matches['win_by_runs'])

wd = {}

for i in range(len(winners)):
    if winners[i] in wd.keys():
        old_runs = wd[winners[i]]
        wd[winners[i]] = old_runs+runs[i]
    else:
        wd[winners[i]] = runs[i]

best_batting_teams = sorted(wd, key=wd.get, reverse=True)

for i in best_batting_teams:
    print(i, ' : ', wd[i])

MI  :  1866
CSK  :  1778
RCB  :  1252
SRH  :  1134
KKR  :  1086
KXIP  :  925
RR  :  895
DC  :  767
RPS  :  176
PW  :  139
KTK  :  23
GL  :  1

So, the top teams always win by significantly more runs.
the Net run rate of each team will be more meaningful to show teams scoring performance than win_by_runs, but this is also useful feature to understand the best teams.

we can't get much inference from win_by_wkts, we can only determine the teams top order batting, if they won by more wickets, but it differs much as the players change the team or team change the batting order.

3.6 deatch matches (last matches) important¶

Death matches means, the matches that happen after half period of IPL season, say remaining half matches.
After the first half, few teams will already be qualified for the PlayOffs(or semi finals), few teams will have no choice for playOffs, and the other teams struggle to get chance in to the playOffs.
So as per their status the phase of the team play differs a bit.
So, I want to add one more feature called 'is_death_match'.

4. Preparing Training Data¶

I have selected team1, team2, toss_winner, toss_decision, winner, home_team, is_death_match, team_1_score, team_2_score as the final features.
home_team, is_death_match, team_1_score, team_2_score are derived from the given features.

In [232]:

train_matches_df = pd.read_csv('final_features/train_matches_data.csv')
test_matches_df = pd.read_csv('final_features/test_matches_data.csv')

In [233]:

train_matches_df = train_matches_df.replace({'team1': {10: 0},'team2': {10: 0},'home_team': {10: 0},'toss_winner': {10: 0},'winner': {10: 0}})
test_matches_df = test_matches_df.replace({'team1': {10: 0},'team2': {10: 0},'home_team': {10: 0},'toss_winner': {10: 0},'winner': {10: 0}})

test_matches_df.head()

Out[233]:

	Unnamed: 0	team1	team2	toss_winner	toss_decision	winner	home_team	team_1_score	team_2_score
0	0	5	7	7	0	5	5	33	32
1	1	4	9	9	0	9	9	36	32
2	2	0	1	1	0	1	0	38	36
3	4	7	8	7	1	7	7	35	20
4	3	9	6	6	0	6	6	34	21

4.1 Preparing Train & Test data¶

In [234]:

y = train_matches_df[['winner']]

train_matches_df.drop(['winner'], axis=1, inplace=True)

X = train_matches_df

In [235]:

y_test = test_matches_df[['winner']]

test_matches_df.drop(['winner'], axis=1, inplace=True)

X_test_team1 = np.array(test_matches_df['team1'])
X_test_team2 = np.array(test_matches_df['team2'])
X_test_toss_winner = np.array(test_matches_df['toss_winner'])

X_test = test_matches_df

here, I am defining few variables X_test_team1, X_test_team2, X_test_toss_winner, to use after winner prediction, as the winner is a multi class prediction, it sometimes may give a team as winner which may not be either of 2 played teams, in that case we will replace the model predicted team with the team which has won toss in that match.

In [236]:

X.head()

Out[236]:

	Unnamed: 0	team1	team2	toss_winner	toss_decision	home_team	team_1_score	team_2_score
0	58	1	7	7	0	7	14	28
1	60	3	8	3	1	8	39	27
2	59	2	6	2	1	6	50	18
3	61	4	7	4	1	4	21	33
4	62	5	1	5	1	1	26	16

In [237]:

X_test.head()

Out[237]:

	Unnamed: 0	team1	team2	toss_winner	toss_decision	home_team	team_1_score	team_2_score
0	0	5	7	7	0	5	33	32
1	1	4	9	9	0	9	36	32
2	2	0	1	1	0	0	38	36
3	4	7	8	7	1	7	35	20
4	3	9	6	6	0	6	34	21

5 Machine Learning Modelling :¶

let us try with different models and observe the accuracy.

In [253]:

# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
#parameters = {'max_depth': [1, 5, 10, 50]}
parameters = {'max_depth': [1, 5, 10, 50], 'min_samples_split': [2, 5, 10, 100, 500]}
clf = GridSearchCV(dt, parameters, cv=3, scoring='f1_weighted') # 'accuracy'
clf.fit(X, y)
best_depth_set1 = clf.best_estimator_.max_depth
best_samples_split_set1 = clf.best_estimator_.min_samples_split
f1_score = clf.score(X_test, y_test)
print("max depth : {} && Samples split : {} && f1_score : {}".format(best_depth_set1, best_samples_split_set1, f1_score))

max depth : 50 && Samples split : 10 && f1_score : 0.422548514991622

In [254]:

y_pred = clf.predict(X_test)

print(y_pred)

y_test_list = np.array(y_test['winner'])

prec = 0

for i in range(len(y_pred)):
    if y_pred[i] != X_test_team1[i] and y_pred[i] != X_test_team2[i]:
        prec += 1
        y_pred[i] = X_test_toss_winner[i]

mc = 0
for i in range(len(y_pred)):
    if y_pred[i] == y_test_list[i]:
        mc += 1
mc, (mc/len(y_pred))*100

[7 4 1 8 6 5 2 6 8 2 1 4 0 1 6 8 2 6 6 0 8 6 0 4 8 0 7 4 1 7 1 6 8 6 1 4 8
 8 8 8 6 7 4 0 1 6 6 8 6 8 5 1 9 8 4 6 2 4 2 7 5 3 6 3 4 6 4 1 3 6 8 4 3 5
 2 1 9 2 5 8 5 2 7 5 2 5 5 2 4 5 1 6 2 5 3 5 5 3 1 5 2 1 9 2 5 6 1 4 5 2 3
 5 8 2 2 5 5 2 2 4 5 5 2 1 4 3 4 7 2 6 5 2 5 5 2 4 7 3 7 2 6 2 8 5 6 2 5 4
 5 2 6 5 8 5 1 2 3 2 7 5 2 5 5 9 7 2 5 3 7 2 5 5 8 2 2]

Out[254]:

(102, 58.285714285714285)

In [255]:

from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm)

In [256]:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred, normalize='all')
cmd = ConfusionMatrixDisplay(cm, display_labels=['GL','KKR','CSK','RR','MI','SRH','KXIP','RCB','DC','RPS'])
fig, ax = plt.subplots(figsize=(10,10))
cmd.plot(ax = ax)
cmd.ax_.set(xlabel='Predicted Values', ylabel='Actal Values')

Out[256]:

[Text(0.5, 0, 'Predicted Values'), Text(0, 0.5, 'Actal Values')]

In [258]:

# pip install xgboost

In [259]:

# import warnings
# warnings.filterwarnings("ignore")
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
gbdt = XGBClassifier()

parameters = {'max_depth': [i for i in range(2, 9)], 'n_estimators': [50, 100, 150]}
clf = RandomizedSearchCV(gbdt, parameters, cv=3, scoring='f1_weighted', n_iter = 5)
clf.fit(X, y)

best_depth_set1 = clf.best_estimator_.max_depth
best_n_estimators_set1 = clf.best_estimator_.n_estimators
# auc_set1 = clf.score(X_te, y_test)
print("max depth : {} && n-estimators : {}".format(best_depth_set1, best_n_estimators_set1))

max depth : 6 && n-estimators : 150

In [260]:

y_pred = clf.predict(X_test)
print(y_pred)

y_test_list = np.array(y_test['winner'])

prec = 0

for i in range(len(y_pred)):
    if y_pred[i] != X_test_team1[i] and y_pred[i] != X_test_team2[i]:
        prec += 1
        y_pred[i] = X_test_toss_winner[i]

mc = 0
for i in range(len(y_pred)):
    if y_pred[i] == y_test_list[i]:
        mc += 1
mc, (mc/len(y_pred))*100

[7 4 1 8 6 1 4 6 9 5 1 4 0 5 6 8 4 1 6 0 8 6 0 8 9 0 1 4 1 0 1 6 7 6 1 7 9
 8 1 8 6 5 8 0 1 5 1 8 6 9 5 1 9 8 4 1 4 4 2 6 1 3 2 3 5 7 8 1 7 6 8 7 3 5
 3 1 7 5 3 8 5 2 5 8 4 3 7 8 4 3 1 4 7 5 4 3 5 3 1 8 3 1 7 5 4 6 1 4 7 2 3
 5 4 2 5 3 1 5 2 4 1 3 2 1 7 3 6 7 2 6 3 2 8 7 2 5 8 3 5 2 4 3 8 3 7 2 5 7
 6 5 8 1 8 3 5 7 3 5 7 3 2 3 4 7 5 8 6 3 5 6 4 4 8 2 2]

Out[260]:

(92, 52.57142857142857)

In [261]:

from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm)

In [262]:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred, normalize='all')
cmd = ConfusionMatrixDisplay(cm, display_labels=['GL','KKR','CSK','RR','MI','SRH','KXIP','RCB','DC','RPS'])
fig, ax = plt.subplots(figsize=(10,10))
cmd.plot(ax = ax)
cmd.ax_.set(xlabel='Predicted Values', ylabel='Actal Values')

Out[262]:

[Text(0.5, 0, 'Predicted Values'), Text(0, 0.5, 'Actal Values')]

6 Analysis :¶

from EDA, we can say,
Chennai, Kolkata has utilized the home_team advantage..
MI, CSK, KKR, RR has good win percentage.

Positive Results :

CSK has more accuracy in prediction as their team head is Dhoni, he is capable of keeping the team with consistency, who ever the players are. So, the win % is good and the prediction too, and it has good home_team advantage.
MI has more win %, but lacks a bit in home_team advantage & toss_decision compared to CSK & KKR.
KKR is good at toss_decision and tome_team advantage, and good win percentage also after CSK and MI.

Negative Results :

RR has good win percentage and toss_decision value, but our model fails to predict its results well. RR & CSK were out of IPL for 2 seasons, and RR team has major alterations compared to CSK, so their predictions might be wrong.
and there are no minimum predictions for few teams.

7 Deployment¶

I have prepared the code that is ready for deployment, I have tested it locally.
I have exported the model as .pkl file to reuse it in remote server.
I have pushed the code to github at https://github.com/siva097/predict-ipl-match-winner, and deployed to heroku.

check the app working : http://predict-ipl-match-winner.herokuapp.com/¶

8 Improvements :¶

More data is useful, I have data until IPL 2019 only, IPL 2020 & IPL 2021 might help in better prediction.
We can experiment deriving new features from deliveries.csv dataset which involves players data. The Players contribution is very important for a team to win a match.
It is not that easy to use players data that supports in model training, because we have calcualte player performance at every match and we have to calculate player value for team for every match so that model can understand the player importance in winning or losing that particular match.
for example, as IPL is 20-20 match, players are expected to score atleast ball to ball, a slow run-maker is not that important for a team, we can make the model understand this by calculating the strike rate of player for each match before training.

Conclusion¶

IPL cricket match is highly uncertain, every match, every ball is hard to predict, and this becomes very hard aat deatch overy, where few bowlers like rashid khan, and sunil naraine become powerful hitters.
Anyhow, more than win prediction, the match analysis and understanding value for a player to a team will really help the team selectors to choose a better team.

Ai

IPL CRICKET MATCH WINNER PREDICTION