This page provides explanations and example R codes for Binomial Logistic Regression, which is one of the algorithms based on the Generalized Linear Models.
The two terms General Linear Model and Generalized Linear Models have different meanings. General Linear model is an extension of the least square analysis where the dependent variable is Guassian (parametric, normally distributed measurements) is discussed in the General Linear Model Explained Page
. Generalized Linear Models is an extension and adaptation of the General Linear Model to include dependent variables that are non-parametric, and includes Binomial Logistic Regression, Multinomial Regression, Ordinal Regression, and Poisson Regression. R uses the function glm for Generalized Linear Models.
The Binomial Logistic Regression is commonly referred to as logistic regression, and in this page the abbeviation BinLogReg is used. In this model, the dependent variable is binomial, one of two outcomes, no/yes, True/false, 0/1.
In R, the independent variables can be measurements or factors
- Measurements are numerical, and can be binary(0/1), ordinal, or parametric
- Factors are text, and consists of group names. Unless otherwise assigned, R arrange group names alphabetically, and use the first name as the reference group
The formula obtained is y = intercept + b1x1 + b2x2 + ....., where b is the coefficient and also the log(odds Ratio) against the reference value. The reference values are 0 for measurement and the reference group for factors. y is then the log odds ratio of outcomes against the reference outcome (outcome name that is alphabetically first)
The probability of outcome is then calculated by logistic transform p = 1 / (1 + exp(-y))
References
[https://en.wikipedia.org/wiki/Logistic_regression] Logistic Regression by Wikipedia
Cox, DR (1958). "The regression analysis of binary sequences (with discussion)". J Roy Stat Soc B. 20 (2): 215 - 242. JSTOR 2983890.
Portney LR, Watkins MP (2000) Foundations of Clinical Research Applications to Practice Second Edition.ISBN 0-8385-2695 0 p. 597 - 603
This section presents the R code in total so it can be copied and pasted into RStudio as a template. Detailed explanations for each step are in the next panel
# Binomial Logistic Regression
# Step 1: Data entry
myDat = ("
Parity Diabetes HighBP Height CS
Multipara 1_Mild No 157 No
Nullipara 0_No Yes 157 No
Multipara 0_No No 153 Yes
Multipara 2_Severe No 165 No
Nullipara 0_No Yes 158 Yes
Nullipara 0_No No 158 Yes
Multipara 1_Mild No 151 Yes
Nullipara 0_No Yes 157 No
Multipara 0_No No 163 Yes
Nullipara 2_Severe No 163 Yes
Multipara 0_No Yes 162 No
Nullipara 0_No No 165 No
Multipara 1_Mild No 159 No
Nullipara 1_Mild Yes 159 Yes
Multipara 0_No No 157 Yes
Nullipara 2_Severe No 161 Yes
Multipara 0_No Yes 156 No
Nullipara 0_No No 159 Yes
Nullipara 1_Mild No 160 No
Multipara 0_No Yes 160 No
Multipara 0_No No 158 No
Multipara 2_Severe No 160 Yes
")
myDataFrame <- read.table(textConnection(myDat),header=TRUE)
summary(myDataFrame)
#Step 2: Binomial Logistic Regression
binLogRegResults <- glm(CS~Parity+Diabetes+HighBP+Height,data=myDataFrame,family=binomial())
summary(binLogRegResults)
#Step 3: Calculate probability of outcome using data
myDataFrame$Fit<-fitted.values(binLogRegResults) #add predicted probability to dataframe
myDataFrame #optional display of dataframe
#step 4: Plot bar and dots of Fitted Probabilities
#install.Package("ggplot2") # if not already installed
library(ggplot2)
ggplot(myDataFrame, aes(x = myDataFrame$CS, y = myDataFrame$Fit)) +
geom_boxplot() +
geom_dotplot(binaxis="y",binwidth=.03,stackdir="center")
#Step 5: Test results on a different set of data
newDat = ("
Parity Diabetes HighBP Height
Multipara 1_Mild No 157
Nullipara 0_No Yes 157
Multipara 0_No No 153
Multipara 2_Severe No 165
")
newDataFrame <- read.table(textConnection(newDat),header=TRUE)
summary(newDataFrame)
newDataFrame$Model <- predict(binLogRegResults, newdata=newDataFrame, type='response')
newDataFrame
#Step 6: Optional saving and loading of coefficients
#save(binLogRegResults, file = "binLogRegResults.rda") #save results to rda file
#load("binLogRegResults.rda") #load the rda file
# Step 1: Data entry to dataframe
myDat = ("
Parity Diabetes HighBP Height CS
Multipara 1_Mild No 157 No
Nullipara 0_No Yes 157 No
Multipara 0_No No 153 Yes
...
")
myDataFrame <- read.table(textConnection(myDat),header=TRUE)
summary(myDataFrame)
Step 1 creates the dataframe that contains the data as in myDat. The data is computer generated, and purports to be from a study of childbirth.
- Parity is previous pregnancies, Multipara are those who had a baby before, and Nullipara birst baby.
- Diabetes is divided into no, mild, and severe, prefixed with a number so that 0_No is the reference group
- HighBP is hypertension, No/Yes. As No is alphabetically first it is the reference group
- Height of the mother is as measured in cms
- CS is Caesarean Section. No is alphabetically first so the reference group. CS is the dependent variable, so the model is to estimate the probability of Caesarean Section.
#Step 2: Binomial Logistic Regression
binLogRegResults <- glm(CS~Parity+Diabetes+HighBP+Height,data=myDataFrame,family=binomial())
summary(binLogRegResults)
Step 2 performs the analysis.
Please Note: that all the programs of Generalized Linear Models use the R function
glm. The differences are in the names of variables, and the family which in this case is
binomial. The results are as follows
Call:
glm(formula = CS ~ Parity + Diabetes + HighBP + Height, family = binomial(),
data = myDataFrame)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.25281 -0.89430 0.05138 0.62860 1.98546
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 60.2918 32.4084 1.860 0.0628 .
ParityNullipara 1.7912 1.2301 1.456 0.1454
Diabetes1_Mild -1.1484 1.3376 -0.859 0.3906
Diabetes2_Severe 2.0643 1.8111 1.140 0.2544
HighBPYes -2.0810 1.3889 -1.498 0.1340
Height -0.3811 0.2049 -1.860 0.0629 .
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 30.498 on 21 degrees of freedom
Residual deviance: 22.015 on 16 degrees of freedom
AIC: 34.015
Number of Fisher Scoring iterations: 5
The results of Step 2 show none of the independent variables in this analysis has a statistically significant (p<0.05) influence on the outcome. This is not surprising as the data is computer generated and of a very small sample size.
The results to focus on are in the column under Estimates which are the coefficients. As an example we will use the fifth row of the data to demonstrate the math
- Calculation begins with y = intercept = 60.2918
- With Parity=Nullipara, y = 60.2918 + 1.7912
- With Diabetes=0_No, y = 60.2918 + 1.7912 + 0
- With HighBP=Yes, y = 60.2918 + 1.7912 + 0 - 2.0810
- With Height = 158, y = 60.2918 + 1.7912 + 0 - 2.0810 - 0.3811(158)
- y = 60.2918 + 1.7912 + 0 - 2.0810 - 0.3811(158) = -0.2118 which is the log(Odds Ratio) for Caesarean Section (CS:Yes) for this row
- The probability of CS:Yes is p = Logistic(y) = 1 / (1+exp(-y)) = 0.45
These manual calculations are unnecessary, as R will perform this for you
#Step 3: Calculate probability of outcome using data
myDataFrame$Fit<-fitted.values(binLogRegResults) #add predicted probability to dataframe
myDataFrame #optional display of dataframe
Step 3 calculates the probability of CS for each row of the data and append this to the dataframe. When printed, the results are as follows
Parity Diabetes HighBP Height CS Fit
1 Multipara 1_Mild No 157 No 0.33558017
2 Nullipara 0_No Yes 157 No 0.54377606
3 Multipara 0_No No 153 Yes 0.87970051
4 Multipara 2_Severe No 165 No 0.37311861
5 Nullipara 0_No Yes 158 Yes 0.44880349
.....
#step 4: Plot bar and dots of Fitted Probabilities
#install.Package("ggplot2") # if not already installed
library(ggplot2)
ggplot(myDataFrame, aes(x = myDataFrame$CS, y = myDataFrame$Fit)) +
geom_boxplot() +
geom_dotplot(binaxis="y",binwidth=.03,stackdir="center")
|
Step 4 is optional, a template for frequency plot of the probabilities in the two groups. The data points are circles, the vertical line the range, the horizontal line the mean, and the box the interquartile (25 percentile to 75 percentile) range.
|
|
#Step 5: Test results on a different set of data
newDat = ("
Parity Diabetes HighBP Height
Multipara 1_Mild No 157
Nullipara 0_No Yes 157
Multipara 0_No No 153
Multipara 2_Severe No 165
")
newDataFrame <- read.table(textConnection(newDat),header=TRUE)
summary(newDataFrame)
newDataFrame$Model <- predict(binLogRegResults, newdata=newDataFrame, type='response')
newDataFrame
Step 5 is optional and for demonstration only. It shows how the results produced can be used to calculate and estimate the value for each case in the dataset. It creates a new dataframe using the first 4 rows of the original data, calculate the fitted values, add them to the data frame, and displays the appended dataframe.
Parity Diabetes HighBP Height Model
1 Multipara 1_Mild No 157 0.34
2 Nullipara 0_No Yes 157 0.54
3 Multipara 0_No No 153 0.88
4 Multipara 2_Severe No 165 0.37
Please note the following
- The formula in step 4 cannot be use, as it requires that the dataframe has the same name and has the same number of rows as the original data.
- With the procedure in step 5, a dataframe of different name and different number of rows can be handled. However, the names of the columns used for calculation must still be the same as that from the original dataframe.
#Step 6: Optional saving and loading of coefficients
#save(binLogRegResults, file = "BinLogRegResults.rda") #save results to rda file
#load("BinLogRegResults.rda") #load the rda file
Step 6 is optional, and in fact commented out, and included as a template only. It allows the results of analysis to be stored as a .rda file, that can be reloaded on a different occasion and used on a different dataset. Please note the following
- R Studio read and write files using the default User/Document/ folder. The path needs to be reset if the user wishes to save and load files using a specific folder. Discussion in the file I/O panel of the R Explained Page
- Once loaded, the conditions for analysis are the same as that in step 5.