Homepage > Statistical Analysis > Generalised linear models (GLM's)

Generalised linear models (GLM's)

So far, we have been using linear models which assume that our response variable is continuous. In earth and life sciences (ecology in particular) we are often working with discrete data, such as count data and binomial (presence/absence) data.

The linear models we have been using so far have been assuming a normal (or gaussian) distribution in our data. Generalised linear models (GLMs) allow us to fit alternative distributions to our data in order to more accurately analyse them.

GLMs do make some important assumptions which we will need to check when we construct the model.
Our binomial (logistic regression) does have some assumptions, but thankfully it is fairly resiliant and we dont need to test them. For any other distribution (poisson, gamma etc.) these are cruical.

It is important to note that part of fitting a GLM is using a link function. I won’t be explaining these in detail (yet), all you need to know is the default link method for binomial data is the logit() method. For more information see ?family.

For this analysis, we will be using the nestpredation.csv dataset

str(nest) # view the structure

## 'data.frame':    20 obs. of  2 variables:
##  $ shrubcover  : int  16 20 11 15 19 31 5 12 9 10 ...
##  $ nestattacked: int  1 0 1 1 1 0 1 1 1 0 ...

# This is okay. Our nest attacked column is an integer, but the glm will tell it to input as binomial so we dont need to change anything. 

nest.bin<-glm(nestattacked~shrubcover, data=nest, family=binomial)

The glm() commands follows the same structure as the lm() and aov() with the inclusion of the extra argument family. Family is where we specify our distribution. In this case, for our logistic regression, we specify a binomial distribution.

Once we have constructed our model, we can use the anova() command and the summary() commands to look at our results. The summary() commands p-values tend to be a little weird, so I prefer to use the anova() command to look at variable significance, and summary() to look at the model equation if I need it.

anova(nest.bin, test="Chisq") # anova test using a chisq instead of F

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: nestattacked
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
## NULL                          19     27.526            
## shrubcover  1   9.0911        18     18.434 0.002569 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is strong evidence that the probability of a nest being attacked varies with shrubcover (p<0.01).

summary(nest.bin)

## 
## Call:
## glm(formula = nestattacked ~ shrubcover, family = binomial, data = nest)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8424  -0.5183  -0.2135   0.8024   1.5148  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   3.3782     1.6025   2.108    0.035 *
## shrubcover   -0.1883     0.0857  -2.198    0.028 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.526  on 19  degrees of freedom
## Residual deviance: 18.434  on 18  degrees of freedom
## AIC: 22.434
## 
## Number of Fisher Scoring iterations: 5

Look at the differences in this table and the anova table. It’s hard to understand what is happening here and doesn’t provide you with the overall model effects, in most cases.