Random Effects in Linear Models

The Question

My project question is “How does the number of field goal attempts affect the field goal percentage?

During the timeouts of NBA games, we can always hear the coach shouting at the players, “Keep shooting!

It’s believed by most NBA professionals that one can finally find his rhythm if he keeps shooting the ball to the basket even though he’s freezing cold now. Actually, the underlying assumption is that the increase in the number of field goal attempts can positively affect the field goal made (and finally increase the field goal percentage).

Some players even complain that they are not performing well in terms of the field goal percentage just because they are not allowed to shoot as many as the superstars do on court.

The question of this mini-project is to address the aforementioned phenomenon with statistical modeling.

The Dataset

The dataset is the NBA players’ stats per game in the 2020–2021 NBA season, which is downloaded from the website basketball reference.

Here are the codes in R for data cleaning.

my_tab = read_excel("sportsref_download.xlsx")
dup_players = my_tab[with(my_tab, Tm == "TOT"),]$Player
my_tab_filtered = my_tab[with(my_tab,((Player %in% dup_players) & Tm == "TOT")| !(Player %in% dup_players)),] %>%
mutate(Pos = replace(Pos, Pos == "SG-PG", "SG"))%>%
mutate(Pos = replace(Pos, Pos == "SF-SG", "SF"))%>%
mutate(Pos = replace(Pos, Pos == "PF-SF", "PF"))%>%
mutate(Pos = replace(Pos, Pos == "C-PF", "C"))%>%
mutate(Pos = replace(Pos, Pos == "PG-SG", "PG"))%>%
mutate(Pos = replace(Pos, Pos == "SG-SF", "SG"))%>%
mutate(Pos = replace(Pos, Pos == "SF-PF", "SF"))%>%
mutate(Pos = replace(Pos, Pos == "PF-C", "PF"))
my_tab_filtered = my_tab_filtered[complete.cases(my_tab_filtered),]

First, I removed the duplicated players who have been traded between teams within the season from the data. Then, I modified the players’ positions to their primary positions if they can play multiple positions. Finally, I removed the NA values from the data.

To check whether the player names are unique inside the data, I used the following code.

length(unique(my_tab_filtered$Player)) == nrow(my_tab_filtered)

It gives me “TRUE”, which means the players are unique.

Since I’m trying to comment on the field goal attempts’ effects on the field goal percentage, I plotted them against each other in the following scatter plot.

Scatter Plot FG% vs FGA

There seems to be some linear relationship between the field goal attempts number and the field goal percentage from the plot. Next, I will use statistical methods to check the observation.

The Modeling

If we don’t consider any possible grouping structures inside the data, we can simply apply the linear regression model to it, which is to fit everything into one linear model.

null.lm = lm(`FG%` ~ FGA, data = my_tab_filtered)

it gives us,

## Call:
## lm(formula = `FG%` ~ FGA, data = my_tab_filtered)
## Residuals:
## Min 1Q Median 3Q Max
## -0.42711 -0.04531 -0.00926 0.03243 0.32324
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4254855 0.0084259 50.497 < 2e-16 ***
## FGA 0.0032440 0.0009531 3.404 0.000741 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.08477 on 358 degrees of freedom
## Multiple R-squared: 0.03134, Adjusted R-squared: 0.02864
## F-statistic: 11.58 on 1 and 358 DF, p-value: 0.0007406

The p-value of FGA (field goal attempts) looks good, isn’t it?

I also checked the residuals,

plot(null.lm, which = 1)
Residuals vs Fitted value in the null linear model
plot(null.lm, which = 2)
Q-Q plot of the null linear model

These diagnoses plots do show some abnormal patterns, but the majority of data looks OK. Please refer to this article for more about diagnosing linear models.

If I stop here, I could have drawn the conclusion that the field goal attempts do positively affect the field goal percentage significantly. However, I have assumed independence between observations in the linear model, which is totally wrong.

Let’s look at the players’ FG% in different positions,

boxplot(`FG%` ~ Pos, data = my_tab_filtered, las=2,xlab = "")
Field goal percentage in five positions

We can see from the boxplot that different positions do have different levels of field goal percentage. For example, the centers have much higher FG% than the other positions, which makes sense because they are closer to the basket.

If we plot again the scatter plot with positions of players as the color, we can see the patterns of grouping (clustering) as well,

FG% against FGA with color-coding

Therefore, I cannot ignore the “position” variable even though it is not my interest in modeling.

In such a case, it’s necessary to induce the concepts of fixed effects and random effects in linear models. Simply speaking, a fixed effect is an unknown constant that we are trying to estimate from the data, whereas a random effect is a random variable that we try to estimate the distribution parameters of (Faraway, Julian J. , 2016).

For fixed effects, we aim to estimate their coefficients in the linear model in order to comment on the relationship with the dependent variable. However, we are not interested in the effect of each specific level in the random effect variable, and we just want to test whether the variation of the random effect is larger than zero.

Since I am interested in the FGA’s effects on the FG% in a general way instead of how it looks like in each individual position, I used the position as random effects instead of fixed effects in the linear model. The more conceptual differences between fixed effects and random effects can be found here.

To perform the mixed (fixed effects + random effects) linear model in R, the package lme4 is needed.

Then, I extended the previous linear model with the position as the random effects.

mixed.lm = lmer(`FG%` ~ FGA + (1|Pos), data = my_tab_filtered)

which gives me,

## Linear mixed model fit by REML ['lmerMod']## Formula: `FG%` ~ FGA + (1 | Pos)
## Data: my_tab_filtered
## REML criterion at convergence: -859.3
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.2524 -0.5565 0.0096 0.5838 3.5555
## Random effects:
## Groups Name Variance Std.Dev.
## Pos (Intercept) 0.003318 0.05760
## Residual 0.004878 0.06984
## Number of obs: 360, groups: Pos, 5
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.4178439 0.0266955 15.65
## FGA 0.0048149 0.0007998 6.02
## Correlation of Fixed Effects:
## (Intr)
## FGA -0.223

In the report above, the Random effects part tells us how much variance we found among the grouping factor (here are players’ positions) as well as the residual variance.

And the part of the Fixed effect is very similar to that of the simple linear model report, which shows the estimate of coefficients as well as the standard error of the coefficient.

If we first look at the random effects part of the report above, we can see that the variance from “Pos” is very similar to that from “Residual”. After calculating the proportion,

0.003318/(0.003318 + 0.004878)

I got 40.4% of the variance is not explained by the fixed effects. These results again confirm the necessity of including “Pos” as the random effects.

If we look at the estimate and standard error of the FGA coefficient in the fixed effects part, we can see that the standard error is much smaller than the estimate, which indicates the coefficient of FGA is different from zero.

You may already notice that there is no “p-value” in the report of the mixed linear model by package lme4.

“No p-value provided” is a major disadvantage of using random effects in the linear model. There are actually a lot of ways to calculate the approximate p-value, but none of them is that precise theoretically because of the complicated calculation of the degree of freedom. The discussion of this specific topic can be very complicated and long, so we are not going through the details here.

A practical workaround to calculate the p-value for random effects is to use the likelihood ratio test statistics in bootstrap. The basic idea of the likelihood ratio test is to compare two models’ likelihood to the observed data by including a different number of variables. If the ratio of likelihood is significantly different from 1, the inclusion of the extra variable is useful then.

To test whether the likelihood ratio is significantly different from one is the same as to test whether the difference between the log-likelihood ratio is different from zero.

The test statistic can be calculated as the following,

my_stat = 2*(logLik(mixed.lm) - logLik(null.lm,REML = TRUE))

Then, we create a bootstrap method to simulate the data under the null hypothesis (the simple linear model) and then generate a large number of likelihood ratio test statistics.

LRT_stat = numeric(1000)
for(i in 1:1000){
y = simulate(null.lm)$sim_1
null_md = lm(y ~ FGA, data = my_tab_filtered)
mixed_md = lmer(y ~ FGA + (1|Pos), data = my_tab_filtered)
LRT_stat[i] = as.numeric(2*(logLik(mixed_md) - logLik(null_md,REML = TRUE)))

Then, we test how many simulated cases exceed the observed statistics in our data,

sum(LRT_stat > my_stat)

In our case, the number is close to zero, which indicates the significance of the random effects.

The same likelihood ratio test could be applied to testing the fixed effects in the linear model. I will not repeat the process again in this article since it’s very similar to that of the test of random effects. The only difference is to set REML to FALSE when you calculate the likelihood ratio statistics between two models that differ in the fixed effects. The code is as below:

my_stat = 2*(logLik(mixed.lm) - logLik(null.lm,REML = FALSE))
LRT_stat = numeric(1000)
for(i in 1:1000){
y = simulate(null.lm)$sim_1
null_md = lmer(y ~ 1 + (1|Pos), data = my_tab_filtered, method = "ML")
mixed_md = lmer(y ~ FGA + (1|Pos), data = my_tab_filtered, method = "ML")
LRT_stat[i] = as.numeric(2*(logLik(mixed_md) - logLik(null_md,REML = FALSE)))
sum(LRT_stat > my_stat)

In summary, we found that field goal attempts significantly affect the field goal percentage of a player after considering the random effects from the players’ positions.


The real-life problems could be more complicated than the ones I showed above. For example, if you have more than 1 grouping factor in your random effects, you need to consider the type of the random effects. There are mainly two types of random effects, crossed effects and nested effects. If the subjects in one level of the random effects do not appear in any other levels, then it’s nested effects, otherwise, it’s crossed effects.

For example, in the aforementioned question about the relationship between family income and student grades, you have both school and class as the random effects. The school and class variables are nested effects because one class could only belong to one specific school. The sudo code for such a case could be:

nested_mod = lmer(grades ~ income + (1|school) + (1|school:class), data = school_grades)


nested_mod = lmer(grades ~ income + (1|school/class), data = school_grades)

For another example, if you are trying to measure the relationship between the body size and bodyweight of birds in a large range of mountain areas, you may need to consider two random effects, the season and the mountain location.

Both the season when and the location where the bird is captured can affect the relationship between the body size and weight of birds. Since each mountain will of course have all of four seasons, these two random effects are crossed effects. The sudo code is like,

crossed_mod = lmer(grades ~ income + (1|season) + (1|mountain_id), data = birds_data)

These aforementioned usages of random effects in the linear models have covered most possible situations. Hope these are helpful to you.

If you are puzzling about when to use random effects or not, there are a lot of debates and criteria about that. If you are interested, you can read more here. However, my suggestion is that as long as you notice the effects from any grouping factors in the dataset, go ahead with random effects, which is also recommended by Gelman, Andrew (2005).


Leave a Comment