In this post I will try to figure out who is the most consistent player during the 2016-17 season using python and the NBA api. My goal is to find how consistent a player is from one game to another and therefore I need to be able to measure the performnce on a game by game basis. In order to do that I'm using a player's gamescore since it allows to calculate productivity per game from the boxscore.

In [1]:
%matplotlib inline
import NBAapi as nba
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import progressbar
from bokeh.plotting import figure, show
from bokeh.charts import ColumnDataSource
from bokeh.io import output_notebook
from bokeh.models import HoverTool
import time

output_notebook()


Let's write a function that calculates the gamescore:

In [2]:
def gamescore(df):
'''
gamescore accepts a DataFrame with boxscore infromation and computes the gamescore for each game
'''
Points = df['PTS'].values
FGM = df['FGM'].values
FGA = df['FGA'].values
FTM = df['FTM'].values
FTA = df['FTA'].values
OREB = df['OREB'].values
DREB = df['DREB'].values
STL = df['STL'].values
AST = df['AST'].values
BLK = df['BLK'].values
TO = df['TOV'].values
PF = df['PF'].values
return 1.0*Points + 0.4*FGM - 0.7*FGA - 0.4*(FTA-FTM) + 0.7*OREB + 0.3*DREB + STL + 0.7*AST + 0.7*BLK -0.4*PF - TO


First, we need to load the boxscore information for each player and then calculate the gamescore for each of those games. For example I can load the game log data for LeBron James and Anthony Davis from the NBA api and compare the distribution of gamescores for each player using a histogram:

In [3]:
player_list = nba.player.commonallplayers(currentseason=1,season='2016-17') # load a current list of NBA players
james_id = player_list[player_list['DISPLAY_LAST_COMMA_FIRST']=='James, LeBron'].PERSON_ID # get players id
james_gamelog = nba.player.gamelog(james_id,Season='2016-17') # load all game logs
james_GS = gamescore(james_gamelog) # calculate game score for each game

# do the same for Anthony Davis
davis_id = player_list[player_list['DISPLAY_LAST_COMMA_FIRST']=='Davis, Anthony'].PERSON_ID
davis_gamelog = nba.player.gamelog(davis_id,Season='2016-17')
davis_GS = gamescore(davis_gamelog)

# now we are ready to plot
plt.hist(james_GS,bins=np.arange(0,50,5),alpha=0.5,normed=True); # plot the distribution using a histogram
plt.hist(davis_GS,bins=np.arange(0,50,5),alpha=0.5,normed=True);
plt.legend(['James','Davis'])
plt.ylabel('Density',fontsize=16)
plt.xlabel('Gamescore',fontsize=16)

Out[3]:

We can see that the average gamescore is about the same for both players but that James has a narrower distribution meaning that he is more consistent.

Now let's load the game logs for the 2016-17 season for each player and calculate the mean gamescore and the standard deviation:

In [4]:
player_gs = []
for i,player_id in enumerate(player_list.PERSON_ID):
player_gamelog = nba.player.gamelog(player_id,Season='2016-17')
GS = gamescore(player_gamelog)
if len(player_gamelog)!=0:
player_gs.append([np.mean(GS),np.std(GS),len(player_gamelog)])  # create list with mean,std and game played
else:
player_gs.append([0,0,0]) # if player did not play insert 0s
bar.update(i)
time.sleep(0.5) # sometimes NBA api seems to get stuck with freequent calls
df = pd.DataFrame(player_gs,index=player_list.DISPLAY_LAST_COMMA_FIRST,columns=['GAMESCORE_MEAN','GAMESCORE_STD','GP'])

 99% (490 of 491) |############################################################################ | Elapsed Time: 0:07:34 ETA: 0:00:00
Out[4]:
GAMESCORE_MEAN GAMESCORE_STD GP
DISPLAY_LAST_COMMA_FIRST
Abrines, Alex 3.59 3.93 68
Acy, Quincy 4.13 4.49 38
Afflalo, Arron 5.38 4.67 61
Ajinca, Alexis 4.28 4.44 39

We can now plot the standard deviation (std) and mean gamescore for each player in the league:

In [5]:
mask = df.GP>=10 # Only include players that played over 10 games
fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'),figsize=(12,8))
ax.grid(color='white', linestyle='solid')
# write players name on top of scatter plot
ax.text(x,y,name.split(',')[0],horizontalalignment='center',verticalalignment='center')
plt.title('Who is the most consistent player?',fontsize=20)
plt.xlabel('Gamescore Mean',fontsize=16)
plt.ylabel('Gamescore STD',fontsize=16)
plt.xlim([-2,27])
plt.ylim([1,10])

# plot powerlaw fit on top of data
x= np.linspace(0,26,100)
a = 10**p[1]
b = p[0]
plt.plot(x,a*x**b,'--b')

Out[5]:
[]

We can use the Bokeh to get an interactive plot. If you hover over the data point with the mouse the name of the player and the gamescore mean and std appears on the figure:

In [6]:
source = ColumnDataSource(df.loc[mask,:].reset_index())

# create hover tool
hover = HoverTool(
tooltips=[
("Player", "@DISPLAY_LAST_COMMA_FIRST"),
("mean", "@GAMESCORE_MEAN"),
("std", "@GAMESCORE_STD"),
]
)

p = figure(plot_width=800, plot_height=500, tools=[hover],title='Who is the most consistent player?')

p.circle('GAMESCORE_MEAN', 'GAMESCORE_STD', size=7, fill_alpha=0.5,source=source) # scatter plot
p.line(x=x, y=a*x**b,color='black') # powerlaw fit
p.xaxis.axis_label = "Gamescore Mean"
p.yaxis.axis_label = "Gamescore STD"

show(p)

Out[6]:

In[6]>

• I'm having some issues rendering the Bokeh plots with pelican so for now you can see the plot here

We see a strong correlations between the mean gamescore and the std. This means that a player that contributes more tends to have larger swings around the average. This makes sense! When you don't produce much the fluctuations are smaller. But it also means that it is harder to compare the consistency of players with different productivity.

We can try to compare players with similar gamescores to see who is more consistent. If we look at the players with gamescore above 20, James is the most consistent. For players with a gamescore above 10, Gobert is the most consistent.

We can try and measure how volatile a player is. We can use the coefficient of variation for that which measures the disparity relative to the output (i.e. std/mean). This shows how much a players gamescore fluctuates relative to his mean output:

In [7]:
df['GAMESCORE_CV'] = df.GAMESCORE_STD/np.abs(df.GAMESCORE_MEAN) # create new column with coefficient of variation

fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'),figsize=(12,8))
ax.grid(color='white', linestyle='solid')
# write player names of figure
plt.text(x,y,name.split(',')[0],horizontalalignment='center',verticalalignment='center')
plt.ylim([0.2,1])
plt.xlim([0,27])
plt.xlabel('Gamescore Mean',fontsize=16)
plt.ylabel('Gamescore Coefficient of Variation',fontsize=16)

Out[7]:

We can use an interactive plot again:

In [8]:
source = ColumnDataSource(df.loc[mask,:].reset_index())
hover = HoverTool(
tooltips=[
("Player", "@DISPLAY_LAST_COMMA_FIRST"),
("mean", "@GAMESCORE_MEAN"),
("CV", "@GAMESCORE_CV"),
]
)
p = figure(plot_width=800, plot_height=500, tools=[hover],title='Who is the most consistent?')

p.circle('GAMESCORE_MEAN', 'GAMESCORE_CV', size=7, fill_alpha=0.5,source=source)
p.xaxis.axis_label = "Gamescore Mean"
p.yaxis.axis_label = "Gamescore CV"

show(p)

Out[8]:

In[8]>

In this case it is clear the LeBron and the most consistent player with Durant and Thomas close beind.

We can also see it in a table format:

In [9]:
np.round(df.loc[mask,:].sort_values(by='GAMESCORE_CV').reset_index().head(15),2)

Out[9]:
DISPLAY_LAST_COMMA_FIRST GAMESCORE_MEAN GAMESCORE_STD GP GAMESCORE_CV
0 James, LeBron 22.80 6.33 74 0.28
1 Durant, Kevin 22.06 6.67 62 0.30
2 Thomas, Isaiah 21.20 6.61 76 0.31
3 Leonard, Kawhi 20.41 7.00 74 0.34
4 Wall, John 19.61 6.78 78 0.35
5 Harden, James 24.28 8.72 81 0.36
6 Westbrook, Russell 24.92 9.11 81 0.37
7 Towns, Karl-Anthony 20.87 7.72 82 0.37
8 Curry, Stephen 20.04 7.64 79 0.38
9 Paul, Chris 18.25 7.01 61 0.38
10 Gobert, Rudy 15.51 6.01 81 0.39
11 Lowry, Kyle 18.38 7.31 60 0.40
12 DeRozan, DeMar 18.74 7.48 74 0.40
13 Gallinari, Danilo 13.93 5.65 63 0.41
14 Antetokounmpo, Giannis 20.40 8.35 80 0.41

We can also check a player's stability over time. Let's take LeBron James as an example:

In [10]:
years = np.arange(2003,2017)
seasons = []
james_gamelog = []
for year in years:
string1 = str(year)
string2 = str(year+1)
season = '{}-{}'.format(string1,string2[-2:])
temp = nba.player.gamelog(james_id,Season=season)
temp['SEASON'] = season
james_gamelog.append(temp)
time.sleep(0.5)
df_james = pd.concat(james_gamelog,ignore_index=True)
df_james['GS'] = gamescore(df_james)
james_evolution = df_james.loc[:,['SEASON','GS']].groupby('SEASON').agg([np.size, np.mean, np.std])
james_evolution.columns = ['GP','GS_MEAN','GS_STD']
np.round(james_evolution,2)

Out[10]:
GP GS_MEAN GS_STD
SEASON
2003-04 79.0 14.51 8.14
2004-05 80.0 22.10 7.64
2005-06 79.0 23.61 8.66
2006-07 78.0 20.34 6.68
2007-08 75.0 24.15 8.32
2008-09 81.0 24.27 7.67
2009-10 76.0 25.58 7.28
2010-11 79.0 21.74 7.30
2011-12 62.0 22.93 7.05
2012-13 76.0 24.41 6.12
2013-14 77.0 22.64 7.57
2014-15 69.0 19.66 7.24
2015-16 76.0 20.74 5.83
2016-17 74.0 22.80 6.38

Let's plot it:

In [11]:
x = np.arange(len(james_evolution))
p = np.polyfit(x,james_evolution.GS_STD,1)
cm = matplotlib.cm.get_cmap('jet')
color = james_evolution.GS_MEAN
plt.figure(figsize=(10,6))
plt.scatter(x,james_evolution.GS_STD,s = james_evolution.GS_MEAN**2,c=color,vmin=0,vmax=30,cmap=cm);
plt.xticks(x, james_evolution.index, rotation=30,fontsize=16)
plt.colorbar()
plt.ylabel('Gamescore STD',fontsize=16)
plt.text(1.0,5.25,'Color and size reflect gamescore mean',verticalalignment='center',fontsize=12)
plt.title('LeBron James - consistency over time',fontsize=16)
plt.ylim([5.0,9.0])
plt.plot(x,p[0]*x+p[1],'k--')

Out[11]:
[]

Clearly LeBron is getting more consistent with time. This seems to agree with the intuition the experienced players tend to be more consistent but we only have an n of 1 so we are not going to jump to conclusions. We have seem that there is a correlation between a players gamescore mean and std but in LeBron's case his gamescore average is very consistent since his second year in the league.

Let's try another player. This time I'm going to choose Tim Duncan:

In [12]:
player_list = nba.player.commonallplayers(currentseason=1,season='2015-16') # load a current list of NBA players
duncan_id = player_list[player_list['DISPLAY_LAST_COMMA_FIRST']=='Duncan, Tim'].PERSON_ID # get players id
years = np.arange(1997,2016)
seasons = []
duncan_gamelog = []
for year in years:
string1 = str(year)
string2 = str(year+1)
season = '{}-{}'.format(string1,string2[-2:])
temp = nba.player.gamelog(duncan_id,Season=season)
temp['SEASON'] = season
duncan_gamelog.append(temp)
time.sleep(0.5)
df_duncan = pd.concat(duncan_gamelog,ignore_index=True)
df_duncan['GS'] = gamescore(df_duncan)
duncan_evolution = df_duncan.loc[:,['SEASON','GS']].groupby('SEASON').agg([np.size, np.mean, np.std])
duncan_evolution.columns = ['GP','GS_MEAN','GS_STD']

In [13]:
x = np.arange(len(duncan_evolution))
p = np.polyfit(x,duncan_evolution.GS_STD,1)
color = duncan_evolution.GS_MEAN
plt.figure(figsize=(10,6))
plt.scatter(x,duncan_evolution.GS_STD,s = duncan_evolution.GS_MEAN**2,c=color,vmin=0,vmax=30,cmap=cm);
plt.xticks(x, duncan_evolution.index, rotation=45,fontsize=14)
plt.colorbar()
plt.ylabel('Gamescore STD',fontsize=16)
plt.xlim([-1.0,len(duncan_evolution)+0.5])
plt.text(1.0,5.25,'Color and size reflect gamescore mean',verticalalignment='center',fontsize=12)
plt.title('Tim Duncan - consistency over time',fontsize=16)
plt.ylim([5.0,9.0])
plt.plot(x,p[0]*x+p[1],'k--')

Out[13]:
[]

We can also see a trend in Duncan's case but it also seems like his gamescore has been declining over the years which we know is correlated with lower std. Let's check the correlation between gamescore mean and std for Duncan's career. The color represents the number of years h has been in the league:

In [14]:
plt.figure(figsize=(10,6))
color = np.arange(1,len(duncan_evolution)+1)
plt.scatter(duncan_evolution.GS_MEAN,duncan_evolution.GS_STD,s=200,c=color,vmin=1,vmax=19,cmap=matplotlib.cm.get_cmap('Blues'))
plt.colorbar()
plt.xlabel('Gamescore Mean',fontsize=16)
plt.ylabel('Gamescore STD',fontsize=16)

Out[14]:

And the same figure in Bokeh:

In [15]:
source = ColumnDataSource(duncan_evolution.reset_index())
hover = HoverTool(
tooltips=[
("Season", "@SEASON"),
("mean", "@GS_MEAN"),
("std", "@GS_STD"),
]
)

c = [
"#%02x%02x%02x" % (int(r), int(g), int(b)) for r, g, b, _ in 255*matplotlib.cm.Blues(matplotlib.colors.Normalize()(np.arange(len(duncan_evolution))))
]
p = figure(plot_width=600, plot_height=420, tools=[hover],title='Duncan')

p.circle('GS_MEAN', 'GS_STD', size=20, fill_color=c,source=source)
p.xaxis.axis_label = "Gamescore MEAN"
p.yaxis.axis_label = "Gamescore STD"
p.background_fill_color = "navy"
p.background_fill_alpha = 0.05

show(p)

Out[15]:

In[15]>