2 Exploring Basketball Shots Data
Note that all the R
code used in this book is accessible on GitHub.
2.1 Tracking Basketball Shots
Basketball shot coordinates can be tracked manually using pen and paper. The \((x, ~y)\) coordinates of each dot could be estimated once a reference frame for the basketball court has been chosen. Let's consider a FIBA basketball court7 and focus on the half-court for now. We can set the origin of our two-dimensional coordinate system at the bottom left corner of the image below (Figure 2.1).
Once we've picked a coordinate system, then we can visually estimate the \((x, ~y)\) coordinates of each shot that was tracked on paper. Of course, this is not ideal. I built this Desmos file in the early days to manually estimate the shot coordinates on an NBA court.
This is where the Easy Stats iOS application comes in. You can watch this video tutorial to see how one can easily track and export more precise shot coordinates. In short, we can use the app to keep track of the shooter, the outcome of the shot (made or missed), and the location of the shot. This play-by-play data can be exported as a csv file via email.
2.2 Getting To Know Our Data
A basic shot data set was put together for this analysis. Let's load the dataset and see what we're working with.
# Load the tidyverse library to be able to use %>% and dplyr to wrangle
library(tidyverse)
# Load the artificial shot data
shots <- readRDS(file = "data/shots.rds")
player | shot_made_numeric | loc_x | loc_y |
---|---|---|---|
Player 7 | 1 | 4.386864 | 7.955280 |
Player 3 | 0 | 5.779800 | 8.510016 |
Player 7 | 1 | 3.003072 | 7.083552 |
Player 3 | 0 | 3.377976 | 7.202424 |
Player 13 | 0 | 3.161568 | 7.202424 |
Player 7 | 0 | 3.329208 | 7.141464 |
Table 2.1 tells us that Player 7 made their first shot at location \((4.39, 7.96)\). We can try to place this first shot on a basketball court. We know that \(4.39 < 7.5\) which implies that the shot was taken on the right side of the court8.
# Display the general structure of the data
str(shots)
## 'data.frame': 1163 obs. of 4 variables:
## $ player : Factor w/ 18 levels "Player 1","Player 2",..: 7 3 7 3 13 7 7 7 11 18 ...
## $ shot_made_numeric: num 1 0 1 0 0 0 1 1 1 0 ...
## $ loc_x : num 4.39 5.78 3 3.38 3.16 ...
## $ loc_y : num 7.96 8.51 7.08 7.2 7.2 ...
We see from the output above that 18 distinct players took shots in our data set. We are working with a categorical variable (player), a binary numeric variable (made shot? yes/no), and two continuous variables for the shot coordinates.
# Explore the distribution of each variable
summary(shots)
## player shot_made_numeric loc_x loc_y
## Player 13:176 Min. :0.0000 Min. : 0.3666 Min. :1.073
## Player 14:166 1st Qu.:0.0000 1st Qu.: 6.3132 1st Qu.:1.756
## Player 7 :148 Median :0.0000 Median : 7.4653 Median :2.954
## Player 11:142 Mean :0.4136 Mean : 7.8274 Mean :4.089
## Player 3 : 79 3rd Qu.:1.0000 3rd Qu.: 9.4755 3rd Qu.:6.370
## Player 10: 74 Max. :1.0000 Max. :14.7104 Max. :9.802
## (Other) :378
Note that 4 players took around half of the shots in the sample. This implies that some players did not take many shots. Taking the average of all the 1163 ones and zeros we get that the overall shooting percentage is 41.36%. The \(x\) component of the location seems to stay between 0 and 15. This makes sense since the data set was created using a FIBA sized basketball court which has a width of 15 meters and a height of 28 meters. It therefore makes sense that the highest recorded shot had a \(y\) component of 9.8 meters given that the three-point is 8.325 meters from the baseline.
2.3 Augmenting Our Data
We need to create a FIBA basketball court in R to plot these points exactly. But first, let's add a few columns to our data. We can convert the player variable to a factor variable and reorder it's levels. We can create a factor variable for the binary outcome of whether the shot was made or not and properly label its levels.
# Add a few variables and clean others
shots <- shots %>%
# Convert shots to a tibble format
tibble() %>%
# Add Columns
mutate(
# convert players to a factor
player = factor(
player,
# Re-level P1, P2, ..., P18
levels = paste("Player", 1:length(unique(shots$player)))
),
# Create a factor binary variable for whether the shot was made or not
shot_made_factor = recode_factor(factor(shot_made_numeric),
"0" = "Miss",
"1" = "Make"
)
)
player | shot_made_numeric | loc_x | loc_y | shot_made_factor |
---|---|---|---|---|
Player 7 | 1 | 4.386864 | 7.955280 | Make |
Player 3 | 0 | 5.779800 | 8.510016 | Miss |
Player 7 | 1 | 3.003072 | 7.083552 | Make |
Player 3 | 0 | 3.377976 | 7.202424 | Miss |
Player 13 | 0 | 3.161568 | 7.202424 | Miss |
Player 7 | 0 | 3.329208 | 7.141464 | Miss |
2.3.1 Shot Distance
We can calculate the 2D distance between each shot9 and the center of the hoop10. To do so, we can use the distance11 formula.
\[ d = \sqrt{(x - 7.5)^2 + (y - 1.575)^2} \] Note that this formula for distance works since it is essentially the Pythagorean Theorem. Consider the simplistic example of a shot located at \((11.5, ~4.575)\). You can use the formula to calculate the distance or you can see from Figure 2.3 below that the distance should be 5 meters by the Pythagorean Theorem.
Note that FIBA uses the metric system for its court dimensions. The NBA and many basketball fans communicate distances in terms of feet. For this reason, we'll convert the distance from meters to feet with the following equivalence.
\[ d_{\mbox{feet}} = d_{\mbox{meters}} \times \left( 3.28084 ~ \frac{\mbox{ft}}{\mbox{m}} \right) \]
# Define FIBA court width and y-coordinate of hoop center in meters
width <- 15
hoop_center_y <- 1.575
# Calculate the shot distances
shots <- shots %>%
# Add Columns
mutate(
dist_meters = sqrt((loc_x-width/2)^2 + (loc_y-hoop_center_y)^2),
dist_feet = dist_meters * 3.28084
)
player | loc_x | loc_y | dist_meters | dist_feet |
---|---|---|---|---|
Player 7 | 4.386864 | 7.955280 | 7.099267 | 23.29156 |
Player 3 | 5.779800 | 8.510016 | 7.145176 | 23.44218 |
Player 7 | 3.003072 | 7.083552 | 7.111013 | 23.33010 |
Player 3 | 3.377976 | 7.202424 | 6.975599 | 22.88582 |
Player 13 | 3.161568 | 7.202424 | 7.105624 | 23.31242 |
Player 7 | 3.329208 | 7.141464 | 6.955647 | 22.82037 |
2.3.2 Shot Angle
We can also calculate the angle \(\theta\) between the shot location and the center line. Looking at Figure 2.4 is almost certainly a better way to grasp how we defined the shot angle.
The shot angle is \(\theta\)$[red angle]. To calculate it, we could calculate \(\phi\)12 and subtract it from \(90^{\circ}\) or \(\frac{\pi}{2}\) radians. We can use SOH CAH TOA to calculate \(\phi\). Since we have the opposite and adjacent sides to the angle \(\phi\), we can use the tangent ratio (\(\tan(\phi) = \frac{O}{A}\)). Thus, we get that \(\phi = \arctan(\frac{y - 1.575}{x - 7.5})\). Then, we have \(\theta = \phi - \frac{\pi}{2}\) for the shot in the picture above.
Note that the shot angle is negative since we defined shots on the left-hand side of the court13 to have negative angle values. Note that the calculation of the shot angle will depend on which quadrant is the shot is released from but the same logic applies.
By default, most calculators will return angles in radians. However, most mortals don't know what a shot angle of 0.7853982 radians14 means so we'll to degrees to communicate our results. We can easily convert the angles to degrees by using the following equivalence.
\[ \theta_{\mbox{degrees}} = \theta_{\mbox{radians}} \times \left( \frac{180 ~ \mbox{degrees}}{\pi ~ \mbox{radians}} \right) \]
# Calculate the shot angles
shots <- shots %>%
# Add Columns
mutate(
theta_rad = case_when(
# Quadrant 1: Shots from left side higher than the rim
loc_x > width/2 & loc_y > hoop_center_y ~
atan((loc_x-width/2)/(loc_y-hoop_center_y)),
# Quadrant 2: Shots from right side higher than the rim
loc_x < width/2 & loc_y > hoop_center_y ~
atan((width/2-loc_x)/(loc_y-hoop_center_y)),
# Quadrant 3: Shots from right side lower than the rim
loc_x < width/2 & loc_y < hoop_center_y ~
atan((hoop_center_y-loc_y)/(width/2-loc_x))+(pi/2),
# Quadrant 4: Shots from left side lower than the rim
loc_x > width/2 & loc_y < hoop_center_y ~
atan((hoop_center_y-loc_y)/(loc_x-width/2))+(pi/2),
# Special Cases
loc_x == width/2 & loc_y >= hoop_center_y ~ 0, # Directly centered front
loc_x == width/2 & loc_y < hoop_center_y ~ pi, # Directly centered back
loc_y == hoop_center_y ~ pi/2, # Directly parallel to hoop center
),
# Make the angle negative if the shot is on the left-side
theta_rad = ifelse(loc_x > width/2, -theta_rad, theta_rad),
# Convert the angle from radians to degrees
theta_deg = theta_rad * (180/pi)
)
player | dist_meters | dist_feet | theta_rad | theta_deg |
---|---|---|---|---|
Player 7 | 7.099267 | 23.29156 | 0.4539458 | 26.00918 |
Player 3 | 7.145176 | 23.44218 | 0.2431384 | 13.93080 |
Player 7 | 7.111013 | 23.33010 | 0.6846336 | 39.22662 |
Player 3 | 6.975599 | 22.88582 | 0.6321993 | 36.22235 |
Player 13 | 7.105624 | 23.31242 | 0.6567714 | 37.63023 |
Player 7 | 6.955647 | 22.82037 | 0.6430346 | 36.84317 |
We can save our newly augmented data so we can load it in the next chapters without the need to manually add the extra columns each time.
# Save the augmented data
saveRDS(shots, file = "data/shots_augmented.rds")
In the next chapter, we will start to explore how the different variables influence the probability of making a shot.
Note that all the R
code used in this book is accessible on GitHub.