# Data Analysis using Hierarchical Generalized Linear Models with R

The book Data Analysis using Hierarchical Generalized Linear Models with R is a collection of examples. The examples include analyses with linear mixed models and generalized linear mixed models, as well as their extensions within the class of models referred to as Hierarchical Generalized Linear Models (HGLMs) and Double HGLMs. All the examples are accompanied with R code. The book guides the reader using several packages with a development from a package with a simple interface to more advanced applications using the packages hglm, dhglm, mdhglm and frailtyHL. So, the reader is introduced bit by bit to the wide range of models covered in the book. The theory is introduced and discussed in connection to the examples.

## Crack growth example

An example capturing the essence of the book is the analysis of the crack growth data. Crack lengths were measured on a compact tension steel from 21 metallic specimens (Hudak et al. 1978). Growth of crack lengths were recorded every 10,000 cycles giving up to 12 repeated measures on each specimen. The response (y) is increment of crack length in inches, there is a baseline effect included in the mean part of the model (crack0) and we will also include the number of cycles divivided by 1,000,000 as covariate for the dispersion part of the model (cycle). The number of different effects included in the model is limited and the example is therefore rather easy to follow. Nevertheless, the example gives a glimpse of the variety of models available within the HGLM and Double HGLM framework.

The data set is included in the dhglm package.

```library(dhglm)
data(data_crack_growth)
summary(data_crack_growth)
```
```##        y               crack0         specimen         cycle
##  Min.   :0.01000   Min.   :0.900   Min.   : 1.00   Min.   :0.01000
##  1st Qu.:0.04000   1st Qu.:0.970   1st Qu.: 6.00   1st Qu.:0.03000
##  Median :0.05000   Median :1.080   Median :11.00   Median :0.06000
##  Mean   :0.05664   Mean   :1.117   Mean   :11.34   Mean   :0.06266
##  3rd Qu.:0.07000   3rd Qu.:1.230   3rd Qu.:16.00   3rd Qu.:0.09000
##  Max.   :0.19000   Max.   :1.580   Max.   :21.00   Max.   :0.12000
```
```interaction.plot(data_crack_growth\$cycle,
data_crack_growth\$specimen, data_crack_growth\$y,
xlab="cycle", ylab="crack growth", col=c(1:10), legend=F)
```

We start by assuming a Gamma distribution for the response (which is a reasonable start since all the response values are positive). A first naive model might be a GLM:

```glm(y ~ crack0, family= Gamma(link=log),
data= data_crack_growth )
```
```##
## Call:  glm(formula = y ~ crack0, family = Gamma(link = log), data = data_crack_growth)
##
## Coefficients:
## (Intercept)       crack0
##      -5.851        2.569
##
## Degrees of Freedom: 240 Total (i.e. Null);  239 Residual
## Null Deviance:	    69.05
## Residual Deviance: 15.89 	AIC: -1416
```

However, this analysis does not account for the longitudinal nature of the data and we can extend the model to a generalized linear mixed model using the hglm package:

```library(hglm)
hglm2(y ~ crack0 + (1|specimen), family= Gamma(link=log),
data= data_crack_growth )
```
```## Call:
## hglm2.formula(meanmodel = y ~ crack0 + (1 | specimen), data = data_crack_growth,
##     family = Gamma(link = log))
##
## ---------------------------
## Estimates of the mean model
## ---------------------------
##
## Fixed effects:
## (Intercept)      crack0
##   -5.647508    2.380398
##
## Random effects:
## (Intercept)| specimen:1 (Intercept)| specimen:2 (Intercept)| specimen:3
##               0.3101382               0.1860307               0.1785982
## ...
## (Intercept)| specimen:20 (Intercept)| specimen:21
##               -0.2847846               -0.3161135
## NOTE: to show all the random effects estimates, use print(hglm.object, print.ranef = TRUE).
##
## Dispersion parameter for the mean model: 0.03629696
##
## Dispersion parameter for the random effects: 0.0340396
##
## Estimation converged in 3 iterations
```

So far we have made standard analyses using a GLM and a GLMM. Now we will move on to HGLMs, which extends the GLMM in two ways: allow other distributions for the random effect other than Gaussian, and dispersion modelling. We will use the model checking plots available in the dhglm package to evaluate the model specifications and thereby come to a model we can rely on.

First, we show how dispersion modelling can be added to the GLMM using the hglm package. We add cycle as covariate for the dispersion parameter (with a default log link). The dispersion parameter for a Gamma distribution is the coefficient-of-variation (CV) squared, so the effect of cycle shows how the CV changes over time in this model.

```hglm2(y ~ crack0 + (1|specimen), family= Gamma(link=log),
data= data_crack_growth, disp= ~cycle )
```
```## Call:
## hglm2.formula(meanmodel = y ~ crack0 + (1 | specimen), data = data_crack_growth,
##     family = Gamma(link = log), disp = ~cycle)
##
## ---------------------------
## Estimates of the mean model
## ---------------------------
##
## Fixed effects:
## (Intercept)      crack0
##   -5.682685    2.410647
##
## Random effects:
## (Intercept)| specimen:1 (Intercept)| specimen:2 (Intercept)| specimen:3
##               0.2918435               0.1686401               0.1691708
## ...
## (Intercept)| specimen:20 (Intercept)| specimen:21
##               -0.2785844               -0.3082014
## NOTE: to show all the random effects estimates, use print(hglm.object, print.ranef = TRUE).
##
## ------------------------------------------
## Estimates of the residual dispersion model
## ------------------------------------------
##
##
## Effects:
## (Intercept)       cycle
##   -2.706502  -10.655624
##
## Dispersion parameter for the random effects: 0.03117587
##
## Estimation converged in 3 iterations
```

Now we move on to the dhglm package. Here, we model the distribution of random effects as inverse gamma. The two parts of the model (mean and dispersion) are specified below with a constant dispersion parameter. Thereafter, these model specifications are included in the fitting algorithm: dhglmfit.

```require(dhglm)
## HGLM I ##
LinPred = y ~ crack0 + (1|specimen),
RandDist = "inverse-gamma")
model_phi <- DHGLMMODELING(Model="dispersion")
res_HGLM1 <- dhglmfit(RespDist="gamma", DataMain=data_crack_growth,
MeanModel=model_mu, DispersionModel=model_phi)
```
```## Distribution of Main Response :
##                          "gamma"
## [1] "Estimates from the model(mu)"
## y ~ crack0 + (1 | specimen)
## [1] "log"
##             Estimate Std. Error t-value
## (Intercept)   -5.678    0.09202  -61.71
## crack0         2.381    0.07334   32.47
## [1] "Estimates for logarithm of lambda=var(u_mu)"
## [1] "inverse-gamma"
##          Estimate Std. Error t-value
## specimen   -3.365     0.3301  -10.19
## [1] "Estimates from the model(phi)"
## phi ~ 1
## <environment: 0x000000000ce8b6e0>
## [1] "log"
##             Estimate Std. Error t-value
## (Intercept)   -3.322     0.0949  -35.01
## [1] "========== Likelihood Function Values and Condition AIC =========="
##                                         [,1]
## -2ML (-2 p_v(mu) (h))          :  -1515.3409
## -2RL (-2 p_beta(mu),v(mu) (h)) :  -1507.4803
## cAIC                           :  -1546.2835
## Scaled Deviance                :    220.6465
## df                             :    220.6465
```

In our second model (HGLM II), cycle is included as covariate in the model for the dispersion.

```## HGLM II ##
LinPred = y ~ crack0 + (1|specimen),
RandDist="inverse-gamma")
model_phi <- DHGLMMODELING(Model = "dispersion", Link = "log",
LinPred = phi ~ cycle)
res_HGLM2 <- dhglmfit(RespDist = "gamma", DataMain = data_crack_growth,
MeanModel = model_mu, DispersionModel = model_phi)
```
```## Distribution of Main Response :
##                          "gamma"
## [1] "Estimates from the model(mu)"
## y ~ crack0 + (1 | specimen)
## [1] "log"
##             Estimate Std. Error t-value
## (Intercept)   -5.711    0.08944  -63.86
## crack0         2.412    0.06825   35.33
## [1] "Estimates for logarithm of lambda=var(u_mu)"
## [1] "inverse-gamma"
##          Estimate Std. Error t-value
## specimen   -3.454     0.3298  -10.47
## [1] "Estimates from the model(phi)"
## phi ~ cycle
## [1] "log"
##             Estimate Std. Error t-value
## (Intercept)   -2.716     0.1995 -13.616
## cycle        -10.594     2.8489  -3.719
## [1] "========== Likelihood Function Values and Condition AIC =========="
##                                         [,1]
## -2ML (-2 p_v(mu) (h))          :  -1528.8586
## -2RL (-2 p_beta(mu),v(mu) (h)) :  -1520.7618
## cAIC                           :  -1560.1893
## Scaled Deviance                :    220.6101
## df                             :    220.6101
```

Finally, we fit a Double HGLM where specimen is included as random effect both in the model for the mean and in the model for the dispersion. As we shall see from the model checking plots, Double HGLMs is a useful class of models enabling robust modelling of longitudinal data. The random effects in the dispersion part captures the majority of potential outliers.

```## DHGLM I ##
LinPred = y ~ crack0 + (1|specimen), RandDist="inverse-gamma")
LinPred = phi ~ cycle + (1|specimen), RandDist="gaussian")
res_DHGLM1 <- dhglmfit(RespDist = "gamma", DataMain = data_crack_growth,
MeanModel = model_mu, DispersionModel = model_phi)
```
```## Distribution of Main Response :
##                          "gamma"
## [1] "Estimates from the model(mu)"
## y ~ crack0 + (1 | specimen)
## [1] "log"
##             Estimate Std. Error t-value
## (Intercept)   -5.654    0.07604  -74.36
## crack0         2.364    0.05442   43.44
## [1] "Estimates for logarithm of lambda=var(u_mu)"
## [1] "inverse-gamma"
##          Estimate Std. Error t-value
## specimen   -3.385     0.3272  -10.34
## [1] "Estimates from the model(phi)"
## phi ~ cycle + (1 | specimen)
## [1] "log"
##             Estimate Std. Error t-value
## (Intercept)   -2.997     0.1977 -15.160
## cycle        -10.229     2.2400  -4.566
## [1] "Estimates for logarithm of var(u_phi)"
##          Estimate Std. Error t-value
## specimen   -1.099     0.2982  -3.684
## [1] "========== Likelihood Function Values and Condition AIC =========="
##                                                          [,1]
## -2ML (-2 p_v(mu),v(phi) (h))          :            -1579.5630
## -2RL (-2 p_beta(mu),v(mu),beta(phi),v(phi) (h)) :  -1571.0711
## cAIC                           :                   -1617.4322
## Scaled Deviance                :                     220.3202
## df                             :                     220.3202
```

A great advantage of using HGLMs and Double HGLMs is that model checking plots, equivalent to those used for GLMs, can be used to check model specifications.

For the model HGLM1, the points in the normal probability plot do not follow the 45-degree line well.

```plotdhglm(res_HGLM1)
```

The normal probability plot is not substantially improved by adding cycle as covariate for the dispersion in HGLM2.

```plotdhglm(res_HGLM2)
```

When specimen is added as random effect in the dispersion model, the normal probability plot looks fine and we finally have a model we can rely on.

```plotdhglm(res_DHGLM1)
```

The results from the model DHGLM1 (above) we can see that the baseline effect of crack0 has a significant positive effect on the mean. Hence, an initially large crack gives faster crack growth. Furthermore, cycle has a significantly negative effect on the dispersion, which means that the CV for crack growth decreases over time (possibly because there is an upper limit to how much the cracks can grow).