So far, we have been using linear models which assume that our response variable is continuous. In earth and life sciences (ecology in particular) we are often working with discrete data, such as count data and binomial (presence/absence) data.
The linear models we have been using so far have been assuming a normal (or gaussian) distribution in our data. Generalised linear models (GLMs) allow us to fit alternative distributions to our data in order to more accurately analyse them.
GLMs do make some important assumptions which we will need to check when we construct the model.
Our binomial (logistic regression) does have some assumptions, but thankfully it is fairly resiliant and we dont need to test them. For any other distribution (poisson, gamma etc.) these are cruical.
It is important to note that part of fitting a GLM is using a link function. I won’t be explaining these in detail (yet), all you need to know is the default link method for binomial data is the logit() method. For more information see ?family
.
For this analysis, we will be using the nestpredation.csv dataset
str(nest) # view the structure
## 'data.frame': 20 obs. of 2 variables:
## $ shrubcover : int 16 20 11 15 19 31 5 12 9 10 ...
## $ nestattacked: int 1 0 1 1 1 0 1 1 1 0 ...
# This is okay. Our nest attacked column is an integer, but the glm will tell it to input as binomial so we dont need to change anything.
nest.bin<-glm(nestattacked~shrubcover, data=nest, family=binomial)
The glm()
commands follows the same structure as the lm()
and aov()
with the inclusion of the extra argument family. Family is where we specify our distribution. In this case, for our logistic regression, we specify a binomial distribution.
Once we have constructed our model, we can use the anova()
command and the summary()
commands to look at our results. The summary()
commands p-values tend to be a little weird, so I prefer to use the anova()
command to look at variable significance, and summary()
to look at the model equation if I need it.
anova(nest.bin, test="Chisq") # anova test using a chisq instead of F
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: nestattacked
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 19 27.526
## shrubcover 1 9.0911 18 18.434 0.002569 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is strong evidence that the probability of a nest being attacked varies with shrubcover (p<0.01).
summary(nest.bin)
##
## Call:
## glm(formula = nestattacked ~ shrubcover, family = binomial, data = nest)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8424 -0.5183 -0.2135 0.8024 1.5148
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.3782 1.6025 2.108 0.035 *
## shrubcover -0.1883 0.0857 -2.198 0.028 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27.526 on 19 degrees of freedom
## Residual deviance: 18.434 on 18 degrees of freedom
## AIC: 22.434
##
## Number of Fisher Scoring iterations: 5
Look at the differences in this table and the anova table. It’s hard to understand what is happening here and doesn’t provide you with the overall model effects, in most cases.