1 Overview

Note that all the R code used in this book is accessible on GitHub.

The addition of the the three-point line has drastically changed the game of basketball. Kirk Goldsberry is one of the pioneers in visualizing the impact of this rule change. As displayed in his book SprawlBall or in this article, the vast majority of shots in the NBA now come from behind the 3-point line and at the rim. This was not always the case. As a a result, the popularity of the mid-range shot has dropped dramatically in the last decade1. Folks caught on to the idea that the extra point offered by a shot from above the break outweighs the slight decrease in field goal percentage of shooting from the mid-range.

1.1 Setting our Expectations

Consider a player who makes 35% of their shots from three and another player who makes 45% of their shots from the mid-range. We can simulate 1000 shots for each player using the sample() function. The summary of the 2000 shots are displayed in Table 1.1.

# Load libraries
library(tidyverse) # wrangling

# Sample Size
n_shots <- 1000

# Setting the seed to ensure reproducibility
set.seed(2021)

# Simulating n shots for both shooters
shots = tibble(
  three_shooter = sample(
    x = c("Make", "Miss"), size = n_shots,
    prob = c(0.35, 1 - 0.35), replace = TRUE
    ),
  two_shooter = sample(
    x = c("Make", "Miss"), size = n_shots,
    prob = c(0.45, 1 - 0.45), replace = TRUE
    )
)
Table 1.1: Results of 2000 randomly generated shots
Make Miss
Three-Point Shooter 343 657
Two-Point Shooter 453 547

Unsurprisingly, the three-point shooter made 343 out of 1000 shots which results in a shooting percentage of a shooting percentage of 34.3%. The mid-range shooter's shooting percentage was 45.3% in our simulated 1000 shots.

On the surface, it seems like the mid-range shooter outperformed the three-point shooter2. However, when we consider that a made shot for the three-point shooter is worth three points instead of two, we see that they scored 1029 points3 compared to 906 points4 for the mid-range shooter. We can also divide the number of points each player scored by the number of shots they took to get their average number of points per shot. We get that the player shooting from three averaged \(1029/1000 = 1.09\) points per shot compared to \((453 \times 2)/1000 = 0.91\) points per shots from the mid-range shooter.

We can see from this simplistic example that the two-point shooting percentage needs to be much greater than the three-point percentage to score more points per attempt on average. In fact, we know that the shooting percentage of a two-point shooter needs to be 1.5 times greater than the shooting percentage of a player shooting beyond the arc for the expected points per shot to be equal. For example, shooting 45% from two would result in \(0.45 \times 2 = 0.9\) points per shot which is the same as shooting 30% from three5.

We know from studying millions of NBA shots that the shooting percentage is highest around the rim (roughly 60%). That percentage drops sharply to approximately 40% for the remainder of the two-point area and drops slightly to 35% beyond the arc. That said, shots at the rim do exceed the 1.5 times threshold established earlier but shooting from anywhere else in the two-point area does not meet this criteria.

The insight that three is greater than two may seem obvious in hindsight or from the outside perspective but it wasn't until the location of each shot was recorded and analyzed that the light bulb went off. A team could look at their shooting percentage from the two-point area (say roughly 50%) and their three-point shooting percentage (say 30%) and conclude that they should shoot more threes since \(0.5 \times 2 = 1\) expected point per shot is greater than \(0.3 \times 3 = 0.9\). This approach fails to consider that not all two-point attempts are created equal. The same can be said about three-point shots but the contrast is sharper for two-point shots.

1.2 Motivation

Despite the new popularity threes, lower level teams do not have an easy way to analyze their shooting performance by other means than the field goal percentage or effective field goal percentage typically found in the box score. Some coaches track shot locations using pen and paper to manually create shot charts for the team and each player. This approach has obvious limitations. The measurement error has to be significant. After all, the person tracking the shots is simply eyeballing the release location and approximating it again to place the dot on the page. This method also makes it practically impossible to look at long term trends since the data is not stored in a database where the charts could be reproduced. Drawing conclusions from a team's shot chart of a particular game is dangerous since the number of shot attempts is not big enough to reveal the "true" underlying patterns. This is even more true when trying to evaluate a player's performance by looking at their game shot chart.

Many Android or iOS applications allow teams to track box score statistics. However, not many applications allow teams to track shot locations. Even fewer allow the user to export the \((x, ~y)\) coordinates of each shots in a usable format such as a csv file. Easy Stats is one of the rare applications that does. It automatically creates shot charts and allows the user to easily export the coordinates.

This books will walk you through how to create Goldsberry-like shot charts and analyze the spatial structure of basketball shots using R, the popular tidyverse package, and the sf package.

1.3 Why R?

Reproducibility is the main advantage of using R over a standard spatial analysis software such as ArcGIS. R uses scripts which means that analyses can be shared and reproduced easily. The programming language was built for and by statisticians who decided to keep it open source. As a result, many packages have emerged to extend its functionality to specific niches. The RStudio IDE6 makes R more approachable. This book was produced using R Mardown and Bookdown which are integrated within RStudio.

1.4 Why the tidyverse package?

"The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures." - Tidyverse Developpers

Two key characteristics of the tidyverse are tidy data and piping.

There are three interrelated rules which make a dataset tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

The analysis in this book makes use of tidy data and piping. Furthermore, it attempts to follow this coding style guide as much as possible.

1.5 Why the sf package?

The simple features (sf) package was built to modernize the widely used sp package. Part of this modernization was to treat spatial objects as data frames so it would be compatible with the tidyverse.

It is definitely possible to create advanced shot charts using base R or with the tidyverse. In fact, this book was heavily influenced by Todd W. Schneider's BallR Shiny app. The reason that we will use the sf package is that it makes it easier to work with spatial data.

We won't have to continually reinvent the wheel.

Spatial data often requires special treatment. Observations may not be independent of their neighbours. In fact, Tobler's First Law of Geography states that "everything is related to everything else, but near things are more related than distant things." Creating maps and polygons using non-spatial tools such as the tidyverse can be laborious and computationally inefficient. Using tools from the sf package makes it much more efficient to create and analyze spatial data. Moreover, an argument can be made that nearly all data are spatio-temporal in nature since they are collected somewhere at some time. At the time of this writing, spatial data have never been more abundant. Yet, the online tutorial community has not yet caught up to the tsunami of coordinate data.

Note that all the R code used in this book is accessible on GitHub.