This page provides explanations and example R codes for the General Linear Model.
The two terms General Linear Model and Generalized Linear Models have different meanings. General Linear model is an extension of the least square analysis where the dependent variable is Guassian (parametric, normally distributed measurements). Generalized Linear Models is an extension and adaptation of the General Linear Model to include dependent variables that are non-parametric, and includes Binomial Logistic Regression, Multinomial Regression, Ordinal Regression, and Poisson Regression. In this page, the abbreviation GLM refers to the General Linear Model for Gaussian dependent variables
GLM (lm in R) differs from Analysis of Vairance (aov in R) only in the method of calculation. The results obtained are the same except for very minor rounding differences. Analysis of variance uses formulae based on estimation of variances, while GLM uses maximum Likelihood approximations.
The aov procedure produces a table of analysis of variance, where each variable is tested for statistical significance. The lm procedure produces coefficients used to estimate the value of the dependent varible using the values of the independent variables.
The advantage of using the aov and lm procedures in R are flexibility
- If all the independent variables are binary, ordinal, or parametric measurements, the model become that of multiple regression
- If all the variables are factors (group names), then the model becomes the factorial analysis of variance
- If the independent variables are a mixture of measurements and factors, then the model is that of analysis of covariance
References
https://rcompanion.org/rcompanion/e_04.html An R Companion for the Handbook of Biological Statistics by Salvatore S. Mangiafico. Chapter on analysis of covariance
https://en.wikipedia.org/wiki/General_linear_model
https://en.wikipedia.org/wiki/Generalized_linear_model
This section presents the R code in total so it can be copied and pasted into RStudio as a template. Detailed explanations for each step are in the next panel
# General Linear Model.R:
# Multiple Regression, Analysis of Variance and Covariance
# Step 1: Data entry to dataframe
myDat = ("
Sex Ethn Gest BWt
Girl Greek 37 3048
Boy German 36 2813
Girl French 41 3622
Girl Greek 36 2706
Boy German 35 2581
Boy French 39 3442
Girl Greek 40 3453
Boy German 37 3172
Girl French 35 2386
Boy Greek 39 3555
Girl German 37 3029
Boy French 37 3185
Girl Greek 36 2670
Boy German 38 3314
Girl French 41 3596
Boy Greek 38 3312
Girl German 39 3200
Boy French 41 3667
Boy Greek 40 3643
Girl German 38 3212
Girl French 38 3135
Girl Greek 39 3366
")
myDataFrame <- read.table(textConnection(myDat),header=TRUE)
summary(myDataFrame)
# Step 2: Plot Data
par(pin=c(4.2, 3)) # set plotting window to 4.2x3 inches
plot(x = myDataFrame$Gest, # Gest on the x axis
y = myDataFrame$BWt, # BWt on the y axis
col = myDataFrame$Sex:myDataFrame$Ethn, # in 6 colors (sex=2,Ethn=3)
pch = 16, # size of dot
xlab = "Gestation", # x label
ylab = "Birth Weight") # y lable
legend('bottomright', # placelegendon bottomright
legend = levels(myDataFrame$Sex:myDataFrame$Ethn), #all combinations
col = 1:6, # 6 colors for 6groups
cex = 1,
pch = 16)
# Step 3: Analysis of variance testing interactions
testResults<-aov(BWt~Sex+Ethn+Gest+Sex*Ethn*Gest,data=myDataFrame) #Test for interactions
summary(testResults) #show results Analysis of Variance Table
# Step 4: Final analysis of variance without interactions
aovResults<-aov(BWt~Sex+Ethn+Gest,data=myDataFrame) #analysis of covariance
summary(aovResults) #show results Analysis of Variance Table
# Step 5: General Linear Model formula
glmResults<-lm(BWt~Sex+Ethn+Gest,data=myDataFrame) #analysis of covariance
summary(glmResults) #show results Analysis of Variance Table
# Step 6: Check that formula works by calculating values and add to dataframe
myDataFrame$Fitted<-fitted.values(glmResults)
myDataFrame
# Step 7: Plot original against calculated
par(pin=c(4.2, 3)) # set plotting window to 4.2x3 inches
plot(x = myDataFrame$BWt, # Bwt on the x axis
y = myDataFrame$Fitted, # Fit on the y axis
pch = 16, # size of dot
xlab = "Actual Birth Weight", # x label
ylab = "Calculated Birth Weight") # y lable
x_eq <- function(x){x} #curve function y=x
plot(x_eq, 2450, 3700, add=TRUE) #add curve to existing plot
# Step 8: Use model on a different set of data
# Step 8a:
newDat = ("
Sex Ethn Gest
Girl Greek 37
Boy German 36
Girl French 41
Girl Greek 36
")
# create new dataframe from new input data
newDataFrame <- read.table(textConnection(newDat),header=TRUE)
summary(newDataFrame)
#step 8b:
newDataFrame$Model <- predict(glmResults, newdata=newDataFrame, type='response')
newDataFrame
#Step 9: Optional saving and loading of coefficients
#save(glmResults, file = "glmResults.rda") #save results to rda file
#load("glmResults.rda") #load the rda file
# Step 1: Data entry to dataframe
myDat = ("
Sex Ethn Gest BWt
Girl Greek 37 3048
Boy German 36 2813
...
")
myDataFrame <- read.table(textConnection(myDat),header=TRUE)
summary(myDataFrame)
Step 1 creates the dataframe that contains the data as in myDat. The data is computer generated, and purports to be from a study of birthweight. Sex is the sex of the baby, Ethn is the ethnicity of the mother, Gest is the gestational age at birth in completed weeks, and BWt is the birthweight in grams.
The model is to explain and predict the dependent variable birthweight (BWt) using the independent variables sex, Ethn, and Gest.
# Step 2: Plot Data
par(pin=c(4.2, 3)) # set plotting window to 4.2x3 inches
plot(x = myDataFrame$Gest, # Gest on the x axis
y = myDataFrame$BWt, # BWt on the y axis
col = myDataFrame$Sex:myDataFrame$Ethn, # in 6 colors (sex=2,Ethn=3)
pch = 16, # size of dot
xlab = "Gestation", # x label
ylab = "Birth Weight") # y lable
legend('bottomright', # placelegendon bottomright
legend = levels(myDataFrame$Sex:myDataFrame$Ethn), #all combinations
col = 1:6, # 6 colors for 6groups
cex = 1,
pch = 16)
|
Step 2 is optional, and included here to provide a template for plotting. It produces a plot of the data. Please note that the names of variables and labels are from the column names of the data, and need to be changed by the user to conform with his/her data set and names
|
# Step 3: Analysis of variance testing interactions
testResults<-aov(BWt~Sex+Ethn+Gest+Sex*Ethn*Gest,data=myDataFrame) #Test for interactions
summary(testResults) #show results Analysis of Variance Table
Step 3 and 4 are analysis of variance. Step 3 is optional, a test for interactions between the multiple independent variables. Analysis of variance and covariance can only proceed if there is no interaction. If interaction exists then analysis must be carried out with combinations of variables.
Df Sum Sq Mean Sq F value Pr(>F)
Sex 1 122427 122427 21.300 0.000958 ***
Ethn 2 278266 139133 24.206 0.000147 ***
Gest 1 2272527 2272527 395.376 2.27e-09 ***
Sex:Ethn 2 9208 4604 0.801 0.475688
Sex:Gest 1 4668 4668 0.812 0.388680
Ethn:Gest 2 8719 4360 0.759 0.493517
Sex:Ethn:Gest 2 71921 35960 6.256 0.017292 *
Residuals 10 57478 5748
The results of step 3 show significant interaction between Sex, Ethn and Gest. The interpretation is that different combinations of baby's sex and mother's ethnicity have different growth rates. If the data is real, and the sample size sufficiently large, then a more detailed analysis of this interaction is required before proceeding further. As this data is computer generated, the sample size small, and the intention is to demonstrate the procedures, this significant interaction is ignored, and analysis will proceed to step 4
# Step 4: Final analysis of variance without interactions
aovResults<-aov(BWt~Sex+Ethn+Gest,data=myDataFrame) #analysis of covariance
summary(aovResults) #show results Analysis of Variance Table
Step 4 is the aov procedure without the interaction terms, and produces the following results
Df Sum Sq Mean Sq F value Pr(>F)
Sex 1 122427 122427 13.69 0.001776 **
Ethn 2 278266 139133 15.56 0.000144 ***
Gest 1 2272527 2272527 254.18 1.17e-11 ***
Residuals 17 151994 8941
The results of step 4 are in a standard table of Analysis of variance. It shows that all 3 independent variables contributed significantly to vlues of birth weight.
# Step 5: General Linear Model formula
glmResults<-lm(BWt~Sex+Ethn+Gest,data=myDataFrame) # General Linear Model
summary(glmResults) #show results Analysis of Variance Table
Step 5 produces the formula for the General Linear Model. The summary of results are shown as follows
Residuals:
Min 1Q Median 3Q Max
-127.51 -76.71 3.61 75.65 153.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4021.50 467.16 -8.608 1.33e-07 ***
SexGirl -165.93 41.07 -4.040 0.000851 ***
EthnGerman 58.49 54.91 1.065 0.301682
EthnGreek 77.14 49.75 1.551 0.139430
Gest 190.61 11.96 15.943 1.17e-11 ***
The results to focus on is the column under the label Estimates, as this is the formula that models birth weight (BWt) according to Sex, Ethn, and Gestation
- For factors (groups), unless otherwise assigned, the first group in alphabetical order is the reference group. In this example, Boy comes before Girl, so Boy is the reference group and girls are 165.93g less heavy than boys. For Ethnicity French comes before German or Greeks, so is the reference group. German babies are 58.49g and Greek babies 77.14g heavier than French babies
- For measurements, the weight is the measurement x coefficient. This means that babies grows 190.61g per week within the range of this data.
- The calculations are therefore as follows
- A French Boy, at 36 week is -4021.5 + 36(190.61) = 2840.46g
- A girl would weigh 165.93g less
- A German baby would weigh 58.49g more and a Greek baby 77.14g more
- For each week after 36 weeks to 42 weeks, the newborn weighs 190.61g more
However, there is no need to laboriously calculating and estimate bithweight, as R will perform the task, as in Step 6
# Step 6: Check that formula works by calculating values and add to dataframe
myDataFrame$Fitted<-fitted.values(glmResults)
myDataFrame
Step 6 and 7 are optional and for demonstration only. It shows how the results produced can be used to calculate and estimate the value for each case in the dataset. Step 7 calculate the fitted values, add them to the data frame, and displays the appended dataframe.
Sex Ethn Gest BWt Fitted
1 Girl Greek 37 3048 2942.461
2 Boy German 36 2813 2899.123
3 Girl French 41 3622 3627.777
4 Girl Greek 36 2706 2751.846
5 Boy German 35 2581 2708.508
....
# Step 7: Plot original against calculated
par(pin=c(4.2, 3)) # set plotting window to 4.2x3 inches
plot(x = myDataFrame$BWt, # Bwt on the x axis
y = myDataFrame$Fitted, # Fit on the y axis
pch = 16, # size of dot
xlab = "Actual Birth Weight", # x label
ylab = "Calculated Birth Weight") # y lable
x_eq <- function(x){x} #curve function y=x
plot(x_eq, 2450, 3700, add=TRUE) #add curve to existing plot
|
Step 7, also optional, plots the calculated birth weight (Fitted) against the actual birth weight (BWt). The resulting plot is to the left
|
# Step 8: Use model on a different set of data
# Step 8a:
newDat = ("
Sex Ethn Gest
Girl Greek 37
Boy German 36
Girl French 41
Girl Greek 36
")
# create new dataframe from new input data
newDataFrame <- read.table(textConnection(newDat),header=TRUE)
summary(newDataFrame)
#step 8b:
newDataFrame$Model <- predict(glmResults, newdata=newDataFrame, type='response')
newDataFrame
Step 8 demonstrate how the results of analysis (glmResults) can be used to estimate birth weight from a different data set. Step 8a creates a new dataframe using the first 3 rows of data from the original data, and step 8b estimates the birth weight using the formula. Please note the following
- The formula in step 6 cannot be use, as it requires that the dataframe has the same name and has the same number of rows as the original data.
- With the procedure in step 8, a dataframe of different name and different number of rows can be handled. However, the names of the columns used for calculation must still be the same as that from the original dataframe.
#Step 9: Optional saving and loading of coefficients
#save(glmResults, file = "glmResults.rda") #save results to rda file
#load("glmResults.rda") #load the rda file
Step 9 is optional, and in fact commented out, and included as a template only. It allows the results of analysis to be stored as a .rda file, that can be reloaded on a different occasion and used on a different dataset. Please note the following
- R Studio read and write files using the User/Document/ folder. The path needs to be reset if the user wishes to save and load files using a specific folder. Discussion are in the file I/O panel of the R Explained Page
- Once loaded, the conditions for analysis are the same as that in step 8.