2 Exploring Basketball Shots Data

Note that all the R code used in this book is accessible on GitHub.

2.1 Tracking Basketball Shots

Basketball shot coordinates can be tracked manually using pen and paper. The $(x, ~y)$ coordinates of each dot could be estimated once a reference frame for the basketball court has been chosen. Let's consider a FIBA basketball court⁷ and focus on the half-court for now. We can set the origin of our two-dimensional coordinate system at the bottom left corner of the image below (Figure 2.1).

Figure 2.1: Choosing our coordinate reference system (crs)

Once we've picked a coordinate system, then we can visually estimate the $(x, ~y)$ coordinates of each shot that was tracked on paper. Of course, this is not ideal. I built this Desmos file in the early days to manually estimate the shot coordinates on an NBA court.

This is where the Easy Stats iOS application comes in. You can watch this video tutorial to see how one can easily track and export more precise shot coordinates. In short, we can use the app to keep track of the shooter, the outcome of the shot (made or missed), and the location of the shot. This play-by-play data can be exported as a csv file via email.

2.2 Getting To Know Our Data

A basic shot data set was put together for this analysis. Let's load the dataset and see what we're working with.

# Load the tidyverse library to be able to use %>% and dplyr to wrangle
library(tidyverse)

# Load the artificial shot data
shots <- readRDS(file = "data/shots.rds")

Table 2.1: Here are the first few shots
player	shot_made_numeric	loc_x	loc_y
Player 7	1	4.386864	7.955280
Player 3	0	5.779800	8.510016
Player 7	1	3.003072	7.083552
Player 3	0	3.377976	7.202424
Player 13	0	3.161568	7.202424
Player 7	0	3.329208	7.141464

Table 2.1 tells us that Player 7 made their first shot at location $(4.39, 7.96)$. We can try to place this first shot on a basketball court. We know that $4.39 < 7.5$ which implies that the shot was taken on the right side of the court⁸.

Figure 2.2: Estimated locations of first 6 shots in the sample

# Display the general structure of the data
str(shots)

## 'data.frame':    1163 obs. of  4 variables:
##  $ player           : Factor w/ 18 levels "Player 1","Player 2",..: 7 3 7 3 13 7 7 7 11 18 ...
##  $ shot_made_numeric: num  1 0 1 0 0 0 1 1 1 0 ...
##  $ loc_x            : num  4.39 5.78 3 3.38 3.16 ...
##  $ loc_y            : num  7.96 8.51 7.08 7.2 7.2 ...

We see from the output above that 18 distinct players took shots in our data set. We are working with a categorical variable (player), a binary numeric variable (made shot? yes/no), and two continuous variables for the shot coordinates.

# Explore the distribution of each variable
summary(shots)

##        player    shot_made_numeric     loc_x             loc_y      
##  Player 13:176   Min.   :0.0000    Min.   : 0.3666   Min.   :1.073  
##  Player 14:166   1st Qu.:0.0000    1st Qu.: 6.3132   1st Qu.:1.756  
##  Player 7 :148   Median :0.0000    Median : 7.4653   Median :2.954  
##  Player 11:142   Mean   :0.4136    Mean   : 7.8274   Mean   :4.089  
##  Player 3 : 79   3rd Qu.:1.0000    3rd Qu.: 9.4755   3rd Qu.:6.370  
##  Player 10: 74   Max.   :1.0000    Max.   :14.7104   Max.   :9.802  
##  (Other)  :378

Note that 4 players took around half of the shots in the sample. This implies that some players did not take many shots. Taking the average of all the 1163 ones and zeros we get that the overall shooting percentage is 41.36%. The $x$ component of the location seems to stay between 0 and 15. This makes sense since the data set was created using a FIBA sized basketball court which has a width of 15 meters and a height of 28 meters. It therefore makes sense that the highest recorded shot had a $y$ component of 9.8 meters given that the three-point is 8.325 meters from the baseline.

2.3 Augmenting Our Data

We need to create a FIBA basketball court in R to plot these points exactly. But first, let's add a few columns to our data. We can convert the player variable to a factor variable and reorder it's levels. We can create a factor variable for the binary outcome of whether the shot was made or not and properly label its levels.

# Add a few variables and clean others
shots <- shots %>%
  # Convert shots to a tibble format
  tibble() %>%
  # Add Columns
  mutate(
    # convert players to a factor
    player =  factor(
      player, 
      # Re-level P1, P2, ..., P18
      levels = paste("Player", 1:length(unique(shots$player)))
      ),
    # Create  a factor binary variable for whether the shot was made or not
    shot_made_factor = recode_factor(factor(shot_made_numeric),
      "0" = "Miss", 
      "1" = "Make"
      )
  )

Table 2.2: Adding a categorical variable for the outcome of the shot
player	shot_made_numeric	loc_x	loc_y	shot_made_factor
Player 7	1	4.386864	7.955280	Make
Player 3	0	5.779800	8.510016	Miss
Player 7	1	3.003072	7.083552	Make
Player 3	0	3.377976	7.202424	Miss
Player 13	0	3.161568	7.202424	Miss
Player 7	0	3.329208	7.141464	Miss

2.3.1 Shot Distance

We can calculate the 2D distance between each shot⁹ and the center of the hoop¹⁰. To do so, we can use the distance¹¹ formula.

\[ d = \sqrt{(x - 7.5)^2 + (y - 1.575)^2} \] Note that this formula for distance works since it is essentially the Pythagorean Theorem. Consider the simplistic example of a shot located at $(11.5, ~4.575)$. You can use the formula to calculate the distance or you can see from Figure 2.3 below that the distance should be 5 meters by the Pythagorean Theorem.

Figure 2.3: Calculating the shot distance with the Pythagorean Theorem

Note that FIBA uses the metric system for its court dimensions. The NBA and many basketball fans communicate distances in terms of feet. For this reason, we'll convert the distance from meters to feet with the following equivalence.

\[ d_{\mbox{feet}} = d_{\mbox{meters}} \times \left( 3.28084 ~ \frac{\mbox{ft}}{\mbox{m}} \right) \]

# Define FIBA court width and y-coordinate of hoop center in meters
width <- 15 
hoop_center_y <- 1.575

# Calculate the shot distances
shots <- shots %>%
  # Add Columns
  mutate(
    dist_meters = sqrt((loc_x-width/2)^2 + (loc_y-hoop_center_y)^2),
    dist_feet = dist_meters * 3.28084
  )

Table 2.3: Converting the distance from meters to feet
player	loc_x	loc_y	dist_meters	dist_feet
Player 7	4.386864	7.955280	7.099267	23.29156
Player 3	5.779800	8.510016	7.145176	23.44218
Player 7	3.003072	7.083552	7.111013	23.33010
Player 3	3.377976	7.202424	6.975599	22.88582
Player 13	3.161568	7.202424	7.105624	23.31242
Player 7	3.329208	7.141464	6.955647	22.82037

2.3.2 Shot Angle

We can also calculate the angle $\theta$ between the shot location and the center line. Looking at Figure 2.4 is almost certainly a better way to grasp how we defined the shot angle.

Figure 2.4: Reference system for the shot angle

The shot angle is $\theta$$[red angle]. To calculate it, we could calculate $\phi$¹² and subtract it from $90^{\circ}$ or $\frac{\pi}{2}$ radians. We can use SOH CAH TOA to calculate $\phi$. Since we have the opposite and adjacent sides to the angle $\phi$, we can use the tangent ratio ($\tan(\phi) = \frac{O}{A}$). Thus, we get that $\phi = \arctan(\frac{y - 1.575}{x - 7.5})$. Then, we have $\theta = \phi - \frac{\pi}{2}$ for the shot in the picture above.

Note that the shot angle is negative since we defined shots on the left-hand side of the court¹³ to have negative angle values. Note that the calculation of the shot angle will depend on which quadrant is the shot is released from but the same logic applies.

By default, most calculators will return angles in radians. However, most mortals don't know what a shot angle of 0.7853982 radians¹⁴ means so we'll to degrees to communicate our results. We can easily convert the angles to degrees by using the following equivalence.

\[ \theta_{\mbox{degrees}} = \theta_{\mbox{radians}} \times \left( \frac{180 ~ \mbox{degrees}}{\pi ~ \mbox{radians}} \right) \]

# Calculate the shot angles
shots <- shots %>%
  # Add Columns
  mutate(
    theta_rad = case_when(
      # Quadrant 1: Shots from left side higher than the rim
      loc_x > width/2 & loc_y > hoop_center_y ~
        atan((loc_x-width/2)/(loc_y-hoop_center_y)),
      # Quadrant 2: Shots from right side higher than the rim
      loc_x < width/2 & loc_y > hoop_center_y ~
        atan((width/2-loc_x)/(loc_y-hoop_center_y)),
      # Quadrant 3: Shots from right side lower than the rim
      loc_x < width/2 & loc_y < hoop_center_y ~
        atan((hoop_center_y-loc_y)/(width/2-loc_x))+(pi/2),
      # Quadrant 4: Shots from left side lower than the rim
      loc_x > width/2 & loc_y < hoop_center_y ~
        atan((hoop_center_y-loc_y)/(loc_x-width/2))+(pi/2),
      # Special Cases
      loc_x == width/2  & loc_y >= hoop_center_y ~ 0, # Directly centered front
      loc_x == width/2  & loc_y < hoop_center_y ~ pi, # Directly centered back
      loc_y == hoop_center_y ~ pi/2, # Directly parallel to hoop center
    ),
    # Make the angle negative if the shot is on the left-side
    theta_rad = ifelse(loc_x > width/2, -theta_rad, theta_rad),
    # Convert the angle from radians to degrees
    theta_deg = theta_rad * (180/pi)
  )

Table 2.4: Adding the shot angles in both radians and degrees.
player	dist_meters	dist_feet	theta_rad	theta_deg
Player 7	7.099267	23.29156	0.4539458	26.00918
Player 3	7.145176	23.44218	0.2431384	13.93080
Player 7	7.111013	23.33010	0.6846336	39.22662
Player 3	6.975599	22.88582	0.6321993	36.22235
Player 13	7.105624	23.31242	0.6567714	37.63023
Player 7	6.955647	22.82037	0.6430346	36.84317

We can save our newly augmented data so we can load it in the next chapters without the need to manually add the extra columns each time.

# Save the augmented data
saveRDS(shots, file = "data/shots_augmented.rds")

In the next chapter, we will start to explore how the different variables influence the probability of making a shot.

Note that all the R code used in this book is accessible on GitHub.

1 Overview

3 Modeling Basketball Shots