IPL data Analysis & Match Winner Prediction :¶
1. Load & View the Dataset¶
- The Dataset is downloaded from (kaggle), this dataset contains the data of 12 IPL seasons, and there are two csv files in this dataset, one is for matches data and the other one is the data of each delivery of every match in all the seasons. 
- I will consider matches data as main dataset, and I will derive few features from deliveries dataset to add in matches data. 
1.1 Loading the common important libraries¶
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings("ignore")
1.2 Loading the datasets¶
matches_df = pd.read_csv('matches.csv')
balls_df =  pd.read_csv('deliveries.csv')
matches_df.head()
| id | season | city | date | team1 | team2 | toss_winner | toss_decision | result | dl_applied | winner | win_by_runs | win_by_wickets | player_of_match | venue | umpire1 | umpire2 | umpire3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2017 | Hyderabad | 2017-04-05 | Sunrisers Hyderabad | Royal Challengers Bangalore | Royal Challengers Bangalore | field | normal | 0 | Sunrisers Hyderabad | 35 | 0 | Yuvraj Singh | Rajiv Gandhi International Stadium, Uppal | AY Dandekar | NJ Llong | NaN | 
| 1 | 2 | 2017 | Pune | 2017-04-06 | Mumbai Indians | Rising Pune Supergiant | Rising Pune Supergiant | field | normal | 0 | Rising Pune Supergiant | 0 | 7 | SPD Smith | Maharashtra Cricket Association Stadium | A Nand Kishore | S Ravi | NaN | 
| 2 | 3 | 2017 | Rajkot | 2017-04-07 | Gujarat Lions | Kolkata Knight Riders | Kolkata Knight Riders | field | normal | 0 | Kolkata Knight Riders | 0 | 10 | CA Lynn | Saurashtra Cricket Association Stadium | Nitin Menon | CK Nandan | NaN | 
| 3 | 4 | 2017 | Indore | 2017-04-08 | Rising Pune Supergiant | Kings XI Punjab | Kings XI Punjab | field | normal | 0 | Kings XI Punjab | 0 | 6 | GJ Maxwell | Holkar Cricket Stadium | AK Chaudhary | C Shamshuddin | NaN | 
| 4 | 5 | 2017 | Bangalore | 2017-04-08 | Royal Challengers Bangalore | Delhi Daredevils | Royal Challengers Bangalore | bat | normal | 0 | Royal Challengers Bangalore | 15 | 0 | KM Jadhav | M Chinnaswamy Stadium | NaN | NaN | NaN | 
matches_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 756 entries, 0 to 755 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 756 non-null int64 1 season 756 non-null int64 2 city 749 non-null object 3 date 756 non-null object 4 team1 756 non-null object 5 team2 756 non-null object 6 toss_winner 756 non-null object 7 toss_decision 756 non-null object 8 result 756 non-null object 9 dl_applied 756 non-null int64 10 winner 752 non-null object 11 win_by_runs 756 non-null int64 12 win_by_wickets 756 non-null int64 13 player_of_match 752 non-null object 14 venue 756 non-null object 15 umpire1 754 non-null object 16 umpire2 754 non-null object 17 umpire3 119 non-null object dtypes: int64(5), object(13) memory usage: 106.4+ KB
- there are 7 null values for city, 4 null values for winner & player_of_match, and 2 null values for umpire1, umpire2, many null values for umpire3.
2. Handling of Null values :¶
- player_of_the_match is useful in finding the importance of team, but lack of this field doesn't impact much, so we can go with available names.
- we remove the rows that have winner as Nan, as winner outcome is important for prediction.
- The Null values of City doesn't matter, I will handle this soon in the important features part.
matches_df[matches_df['winner'].isna()]
| id | season | city | date | team1 | team2 | toss_winner | toss_decision | result | dl_applied | winner | win_by_runs | win_by_wickets | player_of_match | venue | umpire1 | umpire2 | umpire3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 300 | 301 | 2011 | Delhi | 2011-05-21 | Delhi Daredevils | Pune Warriors | Delhi Daredevils | bat | no result | 0 | NaN | 0 | 0 | NaN | Feroz Shah Kotla | SS Hazare | RJ Tucker | NaN | 
| 545 | 546 | 2015 | Bangalore | 2015-04-29 | Royal Challengers Bangalore | Rajasthan Royals | Rajasthan Royals | field | no result | 0 | NaN | 0 | 0 | NaN | M Chinnaswamy Stadium | JD Cloete | PG Pathak | NaN | 
| 570 | 571 | 2015 | Bangalore | 2015-05-17 | Delhi Daredevils | Royal Challengers Bangalore | Royal Challengers Bangalore | field | no result | 0 | NaN | 0 | 0 | NaN | M Chinnaswamy Stadium | HDPK Dharmasena | K Srinivasan | NaN | 
| 744 | 11340 | 2019 | Bengaluru | 30/04/19 | Royal Challengers Bangalore | Rajasthan Royals | Rajasthan Royals | field | no result | 0 | NaN | 0 | 0 | NaN | M. Chinnaswamy Stadium | Nigel Llong | Ulhas Gandhe | Anil Chaudhary | 
- So, if we remove rows based on Null values of winner, the null values of player_of_match data & the 'no result' value of result column will also be removed.
print('Matches dataset shape before Nan values of winner : ', matches_df.shape)
matches_df.dropna(subset = ['winner'], inplace=True)
print('Matches dataset shape after removing Nan values : ', matches_df.shape)
Matches dataset shape before Nan values of winner : (756, 18) Matches dataset shape after removing Nan values : (752, 18)
matches_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 752 entries, 0 to 755 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 752 non-null int64 1 season 752 non-null int64 2 city 745 non-null object 3 date 752 non-null object 4 team1 752 non-null object 5 team2 752 non-null object 6 toss_winner 752 non-null object 7 toss_decision 752 non-null object 8 result 752 non-null object 9 dl_applied 752 non-null int64 10 winner 752 non-null object 11 win_by_runs 752 non-null int64 12 win_by_wickets 752 non-null int64 13 player_of_match 752 non-null object 14 venue 752 non-null object 15 umpire1 750 non-null object 16 umpire2 750 non-null object 17 umpire3 118 non-null object dtypes: int64(5), object(13) memory usage: 111.6+ KB
- except city column, all the other features have equal no. of non-null values, luckily the null values of player_of_the_match feature is due to null values in winner feature, so it is removed.
- I will update the city feature Null values soon we discuss about its importance.
matches_df.result.unique()
array(['normal', 'tie'], dtype=object)
- result : result feature contains 2 possible values, out of those 2 we are not interested in 'tie' matches, we only need 'normal' matches which means one of the 2 teams will win. So, we filter data based on result = 'Normal'
len(matches_df.query("result=='tie'"))
9
- there are 9 tie matches.. so we remove them
matches_df.query("result=='normal'", inplace=True)
print('Matches dataset shape after removing tie matches  : ', matches_df.shape)
Matches dataset shape after removing tie matches : (743, 18)
matches_df.query("dl_applied==1").head()
| id | season | city | date | team1 | team2 | toss_winner | toss_decision | result | dl_applied | winner | win_by_runs | win_by_wickets | player_of_match | venue | umpire1 | umpire2 | umpire3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | 57 | 2017 | Bangalore | 2017-05-17 | Sunrisers Hyderabad | Kolkata Knight Riders | Kolkata Knight Riders | field | normal | 1 | Kolkata Knight Riders | 0 | 7 | NM Coulter-Nile | M Chinnaswamy Stadium | AK Chaudhary | Nitin Menon | NaN | 
| 99 | 100 | 2008 | Delhi | 2008-05-17 | Delhi Daredevils | Kings XI Punjab | Delhi Daredevils | bat | normal | 1 | Kings XI Punjab | 6 | 0 | DPMD Jayawardene | Feroz Shah Kotla | AV Jayaprakash | RE Koertzen | NaN | 
| 102 | 103 | 2008 | Kolkata | 2008-05-18 | Kolkata Knight Riders | Chennai Super Kings | Kolkata Knight Riders | bat | normal | 1 | Chennai Super Kings | 3 | 0 | M Ntini | Eden Gardens | Asad Rauf | K Hariharan | NaN | 
| 119 | 120 | 2009 | Cape Town | 2009-04-19 | Kings XI Punjab | Delhi Daredevils | Delhi Daredevils | field | normal | 1 | Delhi Daredevils | 0 | 10 | DL Vettori | Newlands | MR Benson | SD Ranade | NaN | 
| 122 | 123 | 2009 | Durban | 2009-04-21 | Kings XI Punjab | Kolkata Knight Riders | Kolkata Knight Riders | field | normal | 1 | Kolkata Knight Riders | 11 | 0 | CH Gayle | Kingsmead | DJ Harper | SD Ranade | NaN | 
- when DLs is applied, Only the Overs & Runs will be less, remaining all the data is common.
- Removing 'result', 'dl_applied', 'umpire1','umpire2','umpire3', 'venue', which I consider less important for analysis.
- I use city to find the home_team importance instead of venue, so I am removing it.
matches_df.drop(['result','dl_applied','umpire1','umpire2','umpire3','venue'], axis=1, inplace=True)
matches_df.columns
Index(['id', 'season', 'city', 'date', 'team1', 'team2', 'toss_winner',
       'toss_decision', 'winner', 'win_by_runs', 'win_by_wickets',
       'player_of_match'],
      dtype='object')
matches_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 743 entries, 0 to 755 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 743 non-null int64 1 season 743 non-null int64 2 city 736 non-null object 3 date 743 non-null object 4 team1 743 non-null object 5 team2 743 non-null object 6 toss_winner 743 non-null object 7 toss_decision 743 non-null object 8 winner 743 non-null object 9 win_by_runs 743 non-null int64 10 win_by_wickets 743 non-null int64 11 player_of_match 743 non-null object dtypes: int64(4), object(8) memory usage: 75.5+ KB
3. Feature Analysis¶
- All the important features are categorical features only, so we will use groupby conditions & bar graphs to analyse the data
Questions Raised :¶
- season : see the change of data over seasons. ?
- city : find the home_team advantage : Chennai has much. ?
- toss_winner & toss_decision : how winning affect based on toss. ?
- player_of_match : who is the top player ? & does he always belongs to same team ?
- win_by_runs & wkts : which teams won by more margin : runs & wkts & does they win more trophies ?
- deatch matches (last matches) important ??
3.1 Season :¶
matches_df['season'].unique()
array([2017, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2018,
       2019], dtype=int64)
plt.figure(figsize=(14, 6))
sns.countplot(x='season', data=matches_df)
plt.show()
3.1.1 Let's shorten the team names as we can show the graphs neatly.¶
teams_short_names = {"Mumbai Indians":"MI", "Delhi Capitals":"DC", "Delhi Daredevils":"DC", 
               "Sunrisers Hyderabad":"SRH", "Deccan Chargers":"SRH", "Rajasthan Royals":"RR", 
               "Kolkata Knight Riders":"KKR", "Kings XI Punjab":"KXIP", 
               "Chennai Super Kings":"CSK", "Royal Challengers Bangalore":"RCB",
              "Kochi Tuskers Kerala":"KTK", "Rising Pune Supergiants":"RPS", "Rising Pune Supergiant":"RPS", "Pune Warriors":"RPS",
              "Gujarat Lions":"GL"}
matches_df.replace(teams_short_names, inplace=True)
balls_df.replace(teams_short_names, inplace=True)
- The Names "Rising Pune Supergiants", "Rising Pune Supergiant", "Pune Warriors" are same as they refer to Pune Team, and the "Sunrisers Hyderabad", "Deccan Chargers" belong to Hyderabad team and "Delhi Daredevils", "Delhi Capitals" belong to Delhi team.
- I have grouped these based on the domain knowledge I have regarding the IPL cricket.
3.2 Winner¶
3.2.1 Total wins by each team¶
plt.figure(figsize=(14, 6))
sns.countplot(x='winner', data=matches_df)
plt.show()
3.2.2 Total Matches played by each team¶
print(matches_df["winner"].value_counts())
MI 107 CSK 100 KKR 92 SRH 86 RCB 83 KXIP 80 DC 76 RR 73 RPS 27 GL 13 KTK 6 Name: winner, dtype: int64
team1_played_matches = dict(matches_df["team1"].value_counts())
team2_played_matches = dict(matches_df["team2"].value_counts())
total_matches_played_by_team = {}
for i in team1_played_matches.keys():
    total_matches_played_by_team[i] = team1_played_matches[i]+team2_played_matches[i]
total_matches_played_by_team
{'SRH': 181,
 'MI': 185,
 'KXIP': 174,
 'CSK': 163,
 'RCB': 175,
 'KKR': 175,
 'DC': 173,
 'RR': 142,
 'RPS': 75,
 'GL': 29,
 'KTK': 14}
- in the above data.. 'Gujarath Lions', 'Kochi Tuskers' have played very less matches.. But, we may not remove them as the players data might be helpful.
fig = plt.figure(figsize = (10, 5))
plt.bar(total_matches_played_by_team.keys(), total_matches_played_by_team.values(), color ='orange', width = 0.8)
<BarContainer object of 11 artists>
- see which season the Kochi tuskers has played & remove the data if players data is not very important.
matches_df.query("team1=='KTK'").season.unique()
array([2011], dtype=int64)
matches_df.query("team1=='GL'").season.unique()
array([2017, 2016], dtype=int64)
- It is better to remove Kochi Tuskers as it has played only one season and when we do train test split KTK will not have any data so, there will be extra class of prediction which makes the prediction little messy.
- Wee will keep GL, as we split the test data from 2017. So, in both train & test we will have GL team.
matches_df.query("team1!='KTK' & team2!='KTK'").shape
(729, 12)
- Now we will have a balance of train & test data.
3.2 City : let's see the home_team advantage¶
plt.figure(figsize=(16, 6))
sns.countplot(x='city', data=matches_df)
plt.xticks(rotation='vertical')
plt.show()
- Mumbai hosted highest matches, followed by Kolkata, Delhi, Bangalore, Hyderabad, Chennai..
gdf = pd.DataFrame(matches_df.groupby(['city', 'winner']).size(), columns=['cnt'])
mgdf = gdf.query("city=='Mumbai'")
print('total matches : ', sum(mgdf['cnt']))
mgdf
total matches : 100
| cnt | ||
|---|---|---|
| city | winner | |
| Mumbai | CSK | 11 | 
| DC | 4 | |
| GL | 1 | |
| KKR | 3 | |
| KTK | 1 | |
| KXIP | 5 | |
| MI | 52 | |
| RCB | 5 | |
| RPS | 6 | |
| RR | 7 | |
| SRH | 5 | 
- 100 matches are held in Mumbai, Mumbai won 52 out tof it. MI has no much home_team advantage.
cgdf = gdf.query("city=='Chennai'")
print('total matches : ', sum(cgdf['cnt']))
cgdf
total matches : 56
| cnt | ||
|---|---|---|
| city | winner | |
| Chennai | CSK | 40 | 
| DC | 2 | |
| KKR | 2 | |
| KXIP | 1 | |
| MI | 5 | |
| RCB | 2 | |
| RPS | 1 | |
| RR | 1 | |
| SRH | 2 | 
- 56 matches are held in Chennai, CSK won 40 out tof it. CSK has home_team advantage. (Vigil Podu)
bgdf = gdf.query("city=='Bangalore'")
print('total matches : ', sum(bgdf['cnt']))
bgdf
total matches : 63
| cnt | ||
|---|---|---|
| city | winner | |
| Bangalore | CSK | 4 | 
| DC | 3 | |
| GL | 1 | |
| KKR | 6 | |
| KXIP | 5 | |
| MI | 8 | |
| RCB | 29 | |
| RPS | 1 | |
| RR | 3 | |
| SRH | 3 | 
- RCB won 29 out of 63, less than 50 %
kgdf = gdf.query("city=='Kolkata'")
print('total matches : ', sum(kgdf['cnt']))
kgdf
total matches : 77
| cnt | ||
|---|---|---|
| city | winner | |
| Kolkata | CSK | 5 | 
| DC | 2 | |
| GL | 2 | |
| KKR | 45 | |
| KTK | 1 | |
| KXIP | 3 | |
| MI | 10 | |
| RCB | 4 | |
| RPS | 1 | |
| RR | 2 | |
| SRH | 2 | 
- KKr won 45 out 77 matches, that's interesting.
- Chennai, Kolkata has utilized the home_team advantage.. 
- So, there is some home_team advantage, so we can consider it as a feature. 
3.3 : toss_winner & toss_decision¶
plt.figure(figsize=(12, 6))
sns.countplot(x='toss_decision', data=matches_df)
plt.show()
- The chasing teams got more wins.
- let's the affect of match win based on toss win.
total_matches = len(matches_df)
toss_win_match_wins = len(np.where(matches_df['toss_winner'] == matches_df['winner'])[0])
print("Toss win match win percentange : ", (toss_win_match_wins/total_matches)*100)
Toss win match win percentange : 52.2207267833109
3.3.1 Team wise Toss winner & toss decision effect :¶
team_wins = dict(matches_df.groupby(['winner']).size())
team_wins
{'CSK': 100,
 'DC': 76,
 'GL': 13,
 'KKR': 92,
 'KTK': 6,
 'KXIP': 80,
 'MI': 107,
 'RCB': 83,
 'RPS': 27,
 'RR': 73,
 'SRH': 86}
teams = sorted(team_wins, key=team_wins.get, reverse=True) # in descending order of their wins
for team_name in sorted(team_wins, key=team_wins.get, reverse=True):
    total_matches_played = len(matches_df.query("team1=='"+team_name+"' | team2=='"+team_name+"'"))
    total_wins = team_wins[team_name]
    toss_wins = len(matches_df.query("toss_winner=='"+team_name+"'"))
    toss_choose_bat = len(matches_df.query("toss_winner=='"+team_name+"' & toss_decision=='bat'"))
    toss_choose_field = len(matches_df.query("toss_winner=='"+team_name+"' & toss_decision=='field'"))
    toss_win_match_win = len(matches_df.query("toss_winner=='"+team_name+"' & winner=='"+team_name+"'"))
    print("{} played {} matches & won {}; {} times won the toss & {} times won both toss & match. Win_% : {}; Win_by_Toss_% : {}".format(team_name, total_matches_played, total_wins, toss_wins, toss_win_match_win, round((total_wins/total_matches_played)*100, 1), round((toss_win_match_win/toss_wins)*100, 1)))
MI played 185 matches & won 107; 97 times won the toss & 55 times won both toss & match. Win_% : 57.8; Win_by_Toss_% : 56.7 CSK played 163 matches & won 100; 88 times won the toss & 57 times won both toss & match. Win_% : 61.3; Win_by_Toss_% : 64.8 KKR played 175 matches & won 92; 91 times won the toss & 53 times won both toss & match. Win_% : 52.6; Win_by_Toss_% : 58.2 SRH played 181 matches & won 86; 89 times won the toss & 42 times won both toss & match. Win_% : 47.5; Win_by_Toss_% : 47.2 RCB played 175 matches & won 83; 78 times won the toss & 40 times won both toss & match. Win_% : 47.4; Win_by_Toss_% : 51.3 KXIP played 174 matches & won 80; 80 times won the toss & 34 times won both toss & match. Win_% : 46.0; Win_by_Toss_% : 42.5 DC played 173 matches & won 76; 88 times won the toss & 41 times won both toss & match. Win_% : 43.9; Win_by_Toss_% : 46.6 RR played 142 matches & won 73; 77 times won the toss & 41 times won both toss & match. Win_% : 51.4; Win_by_Toss_% : 53.2 RPS played 30 matches & won 15; 13 times won the toss & 8 times won both toss & match. Win_% : 50.0; Win_by_Toss_% : 61.5 GL played 29 matches & won 13; 14 times won the toss & 10 times won both toss & match. Win_% : 44.8; Win_by_Toss_% : 71.4 PW played 45 matches & won 12; 20 times won the toss & 3 times won both toss & match. Win_% : 26.7; Win_by_Toss_% : 15.0 KTK played 14 matches & won 6; 8 times won the toss & 4 times won both toss & match. Win_% : 42.9; Win_by_Toss_% : 50.0
- CSK has good win % & win_toss_win_match % followed by MI & KKR, RR. GL has good win_toss_win_match %, but they have played less matches compared to other teams.
3.4 player_of_match : who is the top player ? & does he always belongs to same team ?¶
len(matches_df.groupby(['player_of_match']).size())
224
All_best_players = dict(matches_df.groupby(['player_of_match']).size())
3.4.1 Let's see the top 20 best players¶
best_players_in_desc_order = sorted(All_best_players, key=All_best_players.get, reverse=True)
# best_players_in_desc_order
for i in best_players_in_desc_order[:20]:
    print(i, ' : ', All_best_players[i])
CH Gayle : 21 AB de Villiers : 20 DA Warner : 17 MS Dhoni : 17 RG Sharma : 17 SR Watson : 15 YK Pathan : 15 SK Raina : 14 G Gambhir : 13 AM Rahane : 12 MEK Hussey : 12 A Mishra : 11 AD Russell : 11 DR Smith : 11 V Kohli : 11 V Sehwag : 11 JH Kallis : 10 KA Pollard : 10 AT Rayudu : 9 SP Narine : 9
- they are the top 20 players : let's see which team they belongs to..
all_entries = []
for player in best_players_in_desc_order[:20]:
    teams = balls_df.query("batsman=='"+player+"'").batting_team.unique()
    all_entries += list(teams)
    print(player, ' : ', teams)
CH Gayle : ['RCB' 'KKR' 'KXIP'] AB de Villiers : ['RCB' 'DC'] DA Warner : ['SRH' 'DC'] MS Dhoni : ['RPS' 'CSK'] RG Sharma : ['MI' 'SRH'] SR Watson : ['RCB' 'RR' 'CSK'] YK Pathan : ['KKR' 'RR' 'SRH'] SK Raina : ['GL' 'CSK'] G Gambhir : ['KKR' 'DC'] AM Rahane : ['RPS' 'MI' 'RR'] MEK Hussey : ['CSK' 'MI'] A Mishra : ['DC' 'SRH'] AD Russell : ['DC' 'KKR'] DR Smith : ['GL' 'MI' 'SRH' 'CSK'] V Kohli : ['RCB'] V Sehwag : ['DC' 'KXIP'] JH Kallis : ['RCB' 'KKR'] KA Pollard : ['MI'] AT Rayudu : ['MI' 'CSK'] SP Narine : ['KKR']
d = {}
for i in set(all_entries):
    d[i] = all_entries.count(i)
best_player_teams = sorted(d, key=d.get, reverse=True)
for i in best_player_teams:
    print(i, ' : ', d[i])
DC : 6 MI : 6 KKR : 6 CSK : 6 SRH : 5 RCB : 5 RR : 3 RPS : 2 GL : 2 KXIP : 2
As per the above analysis :¶
- The Teams like CSK, KKR, MI, SRH, RCB, DC are the top teams, as they have top players.
- Out of all those teams, CSK has reached to finals more no of times & MI has got more trophies.
- KKR & SRH got trophies 2 times.
- DC & RCB have no trophy but RCB is one of the best team with top class players.
3.5 win_by_runs & win_by_wkts : which teams won by more margin¶
win_by_runs_matches = matches_df.query("win_by_runs!=0")
win_by_runs_matches.shape
(337, 12)
- 337 matches have batted first and won the match..
- let's see which teams has won the matches with large margin, say win_by_runs > 100.
top_win_by_runs_matches = matches_df.query("win_by_runs>100")
top_win_by_runs_matches
| id | season | city | date | team1 | team2 | toss_winner | toss_decision | winner | win_by_runs | win_by_wickets | player_of_match | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 43 | 44 | 2017 | Delhi | 2017-05-06 | MI | DC | DC | field | MI | 146 | 0 | LMP Simmons | 
| 59 | 60 | 2008 | Bangalore | 2008-04-18 | KKR | RCB | RCB | field | KKR | 140 | 0 | BB McCullum | 
| 114 | 115 | 2008 | Mumbai | 2008-05-30 | RR | DC | DC | field | RR | 105 | 0 | SR Watson | 
| 295 | 296 | 2011 | Dharamsala | 2011-05-17 | KXIP | RCB | KXIP | bat | KXIP | 111 | 0 | AC Gilchrist | 
| 410 | 411 | 2013 | Bangalore | 2013-04-23 | RCB | PW | PW | field | RCB | 130 | 0 | CH Gayle | 
| 556 | 557 | 2015 | Bangalore | 2015-05-06 | RCB | KXIP | KXIP | field | RCB | 138 | 0 | CH Gayle | 
| 619 | 620 | 2016 | Bangalore | 2016-05-14 | RCB | GL | GL | field | RCB | 144 | 0 | AB de Villiers | 
| 676 | 7934 | 2018 | Kolkata | 09/05/18 | MI | KKR | KKR | field | MI | 102 | 0 | Ishan Kishan | 
| 706 | 11147 | 2019 | Hyderabad | 31/03/19 | SRH | RCB | RCB | field | SRH | 118 | 0 | J Bairstow | 
3.5.1 Team wise total win by runs¶
winners = list(win_by_runs_matches['winner'])
runs = list(win_by_runs_matches['win_by_runs'])
wd = {}
for i in range(len(winners)):
    if winners[i] in wd.keys():
        old_runs = wd[winners[i]]
        wd[winners[i]] = old_runs+runs[i]
    else:
        wd[winners[i]] = runs[i]
best_batting_teams = sorted(wd, key=wd.get, reverse=True)
for i in best_batting_teams:
    print(i, ' : ', wd[i])
MI : 1866 CSK : 1778 RCB : 1252 SRH : 1134 KKR : 1086 KXIP : 925 RR : 895 DC : 767 RPS : 176 PW : 139 KTK : 23 GL : 1
- So, the top teams always win by significantly more runs.
- the Net run rate of each team will be more meaningful to show teams scoring performance than win_by_runs, but this is also useful feature to understand the best teams.
- we can't get much inference from win_by_wkts, we can only determine the teams top order batting, if they won by more wickets, but it differs much as the players change the team or team change the batting order.
3.6 deatch matches (last matches) important¶
- Death matches means, the matches that happen after half period of IPL season, say remaining half matches.
- After the first half, few teams will already be qualified for the PlayOffs(or semi finals), few teams will have no choice for playOffs, and the other teams struggle to get chance in to the playOffs.
- So as per their status the phase of the team play differs a bit.
- So, I want to add one more feature called 'is_death_match'.
4. Preparing Training Data¶
- I have selected team1, team2, toss_winner, toss_decision, winner, home_team, is_death_match, team_1_score, team_2_score as the final features.
- home_team, is_death_match, team_1_score, team_2_score are derived from the given features.
train_matches_df = pd.read_csv('final_features/train_matches_data.csv')
test_matches_df = pd.read_csv('final_features/test_matches_data.csv')
train_matches_df = train_matches_df.replace({'team1': {10: 0},'team2': {10: 0},'home_team': {10: 0},'toss_winner': {10: 0},'winner': {10: 0}})
test_matches_df = test_matches_df.replace({'team1': {10: 0},'team2': {10: 0},'home_team': {10: 0},'toss_winner': {10: 0},'winner': {10: 0}})
test_matches_df.head()
| Unnamed: 0 | team1 | team2 | toss_winner | toss_decision | winner | home_team | is_death_match | team_1_score | team_2_score | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 5 | 7 | 7 | 0 | 5 | 5 | 0 | 33 | 32 | 
| 1 | 1 | 4 | 9 | 9 | 0 | 9 | 9 | 0 | 36 | 32 | 
| 2 | 2 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 38 | 36 | 
| 3 | 4 | 7 | 8 | 7 | 1 | 7 | 7 | 0 | 35 | 20 | 
| 4 | 3 | 9 | 6 | 6 | 0 | 6 | 6 | 0 | 34 | 21 | 
4.1 Preparing Train & Test data¶
y = train_matches_df[['winner']]
train_matches_df.drop(['winner'], axis=1, inplace=True)
X = train_matches_df
y_test = test_matches_df[['winner']]
test_matches_df.drop(['winner'], axis=1, inplace=True)
X_test_team1 = np.array(test_matches_df['team1'])
X_test_team2 = np.array(test_matches_df['team2'])
X_test_toss_winner = np.array(test_matches_df['toss_winner'])
X_test = test_matches_df
- here, I am defining few variables X_test_team1, X_test_team2, X_test_toss_winner, to use after winner prediction, as the winner is a multi class prediction, it sometimes may give a team as winner which may not be either of 2 played teams, in that case we will replace the model predicted team with the team which has won toss in that match.
X.head()
| Unnamed: 0 | team1 | team2 | toss_winner | toss_decision | home_team | is_death_match | team_1_score | team_2_score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | 1 | 7 | 7 | 0 | 7 | 0 | 14 | 28 | 
| 1 | 60 | 3 | 8 | 3 | 1 | 8 | 0 | 39 | 27 | 
| 2 | 59 | 2 | 6 | 2 | 1 | 6 | 0 | 50 | 18 | 
| 3 | 61 | 4 | 7 | 4 | 1 | 4 | 0 | 21 | 33 | 
| 4 | 62 | 5 | 1 | 5 | 1 | 1 | 0 | 26 | 16 | 
X_test.head()
| Unnamed: 0 | team1 | team2 | toss_winner | toss_decision | home_team | is_death_match | team_1_score | team_2_score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 5 | 7 | 7 | 0 | 5 | 0 | 33 | 32 | 
| 1 | 1 | 4 | 9 | 9 | 0 | 9 | 0 | 36 | 32 | 
| 2 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 38 | 36 | 
| 3 | 4 | 7 | 8 | 7 | 1 | 7 | 0 | 35 | 20 | 
| 4 | 3 | 9 | 6 | 6 | 0 | 6 | 0 | 34 | 21 | 
5 Machine Learning Modelling :¶
- let us try with different models and observe the accuracy.
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
#parameters = {'max_depth': [1, 5, 10, 50]}
parameters = {'max_depth': [1, 5, 10, 50], 'min_samples_split': [2, 5, 10, 100, 500]}
clf = GridSearchCV(dt, parameters, cv=3, scoring='f1_weighted') # 'accuracy'
clf.fit(X, y)
best_depth_set1 = clf.best_estimator_.max_depth
best_samples_split_set1 = clf.best_estimator_.min_samples_split
f1_score = clf.score(X_test, y_test)
print("max depth : {} && Samples split : {} && f1_score : {}".format(best_depth_set1, best_samples_split_set1, f1_score))
max depth : 50 && Samples split : 10 && f1_score : 0.422548514991622
y_pred = clf.predict(X_test)
print(y_pred)
y_test_list = np.array(y_test['winner'])
prec = 0
for i in range(len(y_pred)):
    if y_pred[i] != X_test_team1[i] and y_pred[i] != X_test_team2[i]:
        prec += 1
        y_pred[i] = X_test_toss_winner[i]
mc = 0
for i in range(len(y_pred)):
    if y_pred[i] == y_test_list[i]:
        mc += 1
mc, (mc/len(y_pred))*100
[7 4 1 8 6 5 2 6 8 2 1 4 0 1 6 8 2 6 6 0 8 6 0 4 8 0 7 4 1 7 1 6 8 6 1 4 8 8 8 8 6 7 4 0 1 6 6 8 6 8 5 1 9 8 4 6 2 4 2 7 5 3 6 3 4 6 4 1 3 6 8 4 3 5 2 1 9 2 5 8 5 2 7 5 2 5 5 2 4 5 1 6 2 5 3 5 5 3 1 5 2 1 9 2 5 6 1 4 5 2 3 5 8 2 2 5 5 2 2 4 5 5 2 1 4 3 4 7 2 6 5 2 5 5 2 4 7 3 7 2 6 2 8 5 6 2 5 4 5 2 6 5 8 5 1 2 3 2 7 5 2 5 5 9 7 2 5 3 7 2 5 5 8 2 2]
(102, 58.285714285714285)
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred, normalize='all')
cmd = ConfusionMatrixDisplay(cm, display_labels=['GL','KKR','CSK','RR','MI','SRH','KXIP','RCB','DC','RPS'])
fig, ax = plt.subplots(figsize=(10,10))
cmd.plot(ax = ax)
cmd.ax_.set(xlabel='Predicted Values', ylabel='Actal Values')
[Text(0.5, 0, 'Predicted Values'), Text(0, 0.5, 'Actal Values')]
# pip install xgboost
# import warnings
# warnings.filterwarnings("ignore")
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
gbdt = XGBClassifier()
parameters = {'max_depth': [i for i in range(2, 9)], 'n_estimators': [50, 100, 150]}
clf = RandomizedSearchCV(gbdt, parameters, cv=3, scoring='f1_weighted', n_iter = 5)
clf.fit(X, y)
best_depth_set1 = clf.best_estimator_.max_depth
best_n_estimators_set1 = clf.best_estimator_.n_estimators
# auc_set1 = clf.score(X_te, y_test)
print("max depth : {} && n-estimators : {}".format(best_depth_set1, best_n_estimators_set1))
max depth : 6 && n-estimators : 150
y_pred = clf.predict(X_test)
print(y_pred)
y_test_list = np.array(y_test['winner'])
prec = 0
for i in range(len(y_pred)):
    if y_pred[i] != X_test_team1[i] and y_pred[i] != X_test_team2[i]:
        prec += 1
        y_pred[i] = X_test_toss_winner[i]
mc = 0
for i in range(len(y_pred)):
    if y_pred[i] == y_test_list[i]:
        mc += 1
mc, (mc/len(y_pred))*100
[7 4 1 8 6 1 4 6 9 5 1 4 0 5 6 8 4 1 6 0 8 6 0 8 9 0 1 4 1 0 1 6 7 6 1 7 9 8 1 8 6 5 8 0 1 5 1 8 6 9 5 1 9 8 4 1 4 4 2 6 1 3 2 3 5 7 8 1 7 6 8 7 3 5 3 1 7 5 3 8 5 2 5 8 4 3 7 8 4 3 1 4 7 5 4 3 5 3 1 8 3 1 7 5 4 6 1 4 7 2 3 5 4 2 5 3 1 5 2 4 1 3 2 1 7 3 6 7 2 6 3 2 8 7 2 5 8 3 5 2 4 3 8 3 7 2 5 7 6 5 8 1 8 3 5 7 3 5 7 3 2 3 4 7 5 8 6 3 5 6 4 4 8 2 2]
(92, 52.57142857142857)
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred, normalize='all')
cmd = ConfusionMatrixDisplay(cm, display_labels=['GL','KKR','CSK','RR','MI','SRH','KXIP','RCB','DC','RPS'])
fig, ax = plt.subplots(figsize=(10,10))
cmd.plot(ax = ax)
cmd.ax_.set(xlabel='Predicted Values', ylabel='Actal Values')
[Text(0.5, 0, 'Predicted Values'), Text(0, 0.5, 'Actal Values')]
6 Analysis :¶
- from EDA, we can say,
- Chennai, Kolkata has utilized the home_team advantage..
- MI, CSK, KKR, RR has good win percentage.
Positive Results :
- CSK has more accuracy in prediction as their team head is Dhoni, he is capable of keeping the team with consistency, who ever the players are. So, the win % is good and the prediction too, and it has good home_team advantage.
- MI has more win %, but lacks a bit in home_team advantage & toss_decision compared to CSK & KKR.
- KKR is good at toss_decision and tome_team advantage, and good win percentage also after CSK and MI.
Negative Results :
- RR has good win percentage and toss_decision value, but our model fails to predict its results well. RR & CSK were out of IPL for 2 seasons, and RR team has major alterations compared to CSK, so their predictions might be wrong.
- and there are no minimum predictions for few teams.
7 Deployment¶
- I have prepared the code that is ready for deployment, I have tested it locally.
- I have exported the model as .pkl file to reuse it in remote server.
- I have pushed the code to github at https://github.com/siva097/predict-ipl-match-winner, and deployed to heroku.
check the app working : http://predict-ipl-match-winner.herokuapp.com/¶
8 Improvements :¶
- More data is useful, I have data until IPL 2019 only, IPL 2020 & IPL 2021 might help in better prediction.
- We can experiment deriving new features from deliveries.csv dataset which involves players data. The Players contribution is very important for a team to win a match. 
- It is not that easy to use players data that supports in model training, because we have calcualte player performance at every match and we have to calculate player value for team for every match so that model can understand the player importance in winning or losing that particular match. 
- for example, as IPL is 20-20 match, players are expected to score atleast ball to ball, a slow run-maker is not that important for a team, we can make the model understand this by calculating the strike rate of player for each match before training. 
Conclusion¶
- IPL cricket match is highly uncertain, every match, every ball is hard to predict, and this becomes very hard aat deatch overy, where few bowlers like rashid khan, and sunil naraine become powerful hitters. 
- Anyhow, more than win prediction, the match analysis and understanding value for a player to a team will really help the team selectors to choose a better team. 
 
0 Comments