![]() | Content Disclaimer Copyright @2020. All Rights Reserved. |
Links : Home Index (Subjects) Contact StatsToDo |
Related Links:
This page provides explanations and example R codes for Multinomial Logistic Regression, which is one of the algorithms based on the Generalized Linear Models.
The two terms General Linear Model and Generalized Linear Models have different meanings. General Linear model is an extension of the least square analysis where the dependent variable is Guassian (parametric, normally distributed measurements) is discussed in the General Linear Model Explained Page . Generalized Linear Models is an extension and adaptation of the General Linear Model to include dependent variables that are non-parametric, and includes Binomial Logistic Regression, Multinomial Regression, Ordinal Regression, and Poisson Regression. R uses the function glm for Generalized Linear Models. The Multinomial Logistic Regression is an extension of the Binomial Logistic Regression, as explained in the Binomial Logistic Regression Explained Page . The dependent variable in this case are group names, with more than two groups In R, the independent variables can be measurements or factors
Referenceshttps://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/https://en.wikipedia.org/wiki/Multinomial_logistic_regression
This section presents the R code in total so it can be copied and pasted into RStudio as a template. Detailed explanations for each step are in the next panel
# Multinomial Logistic Regression # Step 1: Data entry myDat = (" Vomit Pulse Temp Diagnosis No 74 102 Appen Yes 68 97 UTI No 78 99 Flu Yes 72 98 UTI No 72 101 Flu Yes 92 96 Appen No 76 100 UTI No 86 100 Appen No 74 98 Flu No 72 99 UTI Yes 84 100 Flu Yes 85 98 Appen No 64 103 Flu No 78 97 UTI Yes 72 99 Appen ") myDataFrame <- read.table(textConnection(myDat),header=TRUE) summary(myDataFrame) # Step 2: Multinomial Logistic Regression #install.packages("nnet") # if not already installed library(nnet) mulLogRegResults <- multinom(formula = Diagnosis ~ Vomit + Pulse + Temp, data = myDataFrame) summary(mulLogRegResults) #display results # Step 3: Place fitted values into data frame myFitted<-fitted.values(mulLogRegResults) myPredict <- predict(mulLogRegResults, myDataFrame) myDataFrame <- cbind(myDataFrame, myFitted, myPredict) myDataFrame with(myDataFrame, table(Diagnosis, myPredict)) #check table of prediction chisq.test(myDataFrame$Diagnosis, myDataFrame$myPredict) # Step 4: Plot fitted values for the 3 ethnic groups # Step 4a: Plot 3 charts #install.Package("ggplot2") # if not already installed library(ggplot2) plot1<-ggplot(myDataFrame, aes(x = myDataFrame$Diagnosis, y = myDataFrame$Appen)) + coord_cartesian(ylim= c(0, 1)) + geom_boxplot() + geom_dotplot(binaxis="y",binwidth=.05,stackdir="center") #fitted Appen probability plot2<-ggplot(myDataFrame, aes(x = myDataFrame$Diagnosis, y = myDataFrame$UTI)) + coord_cartesian(ylim= c(0, 1)) + geom_boxplot() + geom_dotplot(binaxis="y",binwidth=.05,stackdir="center") #fitted Flu probability plot3<-ggplot(myDataFrame, aes(x = myDataFrame$Diagnosis, y = myDataFrame$Flu)) + coord_cartesian(ylim= c(0, 1)) + geom_boxplot() + geom_dotplot(binaxis="y",binwidth=.05,stackdir="center") #fitted UTI probability #Plot1 #plot2 #plot3 # Step 4b: Combine 3 charts into one #install.package("Grid") # only if not already installed library(grid) # Make a list from the ... arguments and plotlist myPlots <- c(list(plot1, plot2, plot3)) #change list to your list of ggplots myCols = 2 #change the number of plot columns per row of plots # The rest of the codes need no change myNumPlots = length(myPlots) layout <- matrix(seq(1, myCols * ceiling(myNumPlots/myCols)), ncol = myCols, nrow = ceiling(myNumPlots/myCols)) grid.newpage() pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) # Make each plot, in the correct location for (i in 1:myNumPlots) { # Get the i,j matrix positions of the regions that contain this subplot matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) print(myPlots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col)) } # Step 5 Use the coefficients on a new set of data newDat = (" Vomit Pulse Temp No 74 102 Yes 68 97 No 78 99 ") newDataFrame <- read.table(textConnection(newDat),header=TRUE) summary(newDataFrame) newProbs <- predict(mulLogRegResults, newdata=newDataFrame, type='probs') newProbs newDiag <- predict(mulLogRegResults, newdata=newDataFrame, type='class') newDiag newDataFrame <- cbind(newDataFrame, newProbs, newDiag) newDataFrame #Step 6: Optional saving and loading of coefficients #save(mulLogRegResults, file = "MulLogRegResults.rda") #save results to rda file #load("MulLogRegResults.rda") #load the rda file # Step 1: Data entry to dataframe myDat = (" Vomit Pulse Temp Diagnosis No 74 102 Appen Yes 68 97 UTI No 78 99 Flu Yes 72 98 UTI No 72 101 Flu ... ") myDataFrame <- read.table(textConnection(myDat),header=TRUE) summary(myDataFrame)Step 1 creates the dataframe that contains the data as in myDat. The data is computer generated, and purports to be from a study of admissions from Emergency Department who felt unwell and has abdominal pain. The model is to make a diagnosis between appendicitis (Appen), Urinary Tract Infection (UTI), and the flu (Flu), based on the presence of vomiting (Vomit), and the teperature (Temp) and pulse rate (Pulse). # Step 2: Multinomial Logistic Regression #install.packages("nnet") # if not already installed library(nnet) mulLogRegResults <- multinom(formula = Diagnosis ~ Vomit + Pulse + Temp, data = myDataFrame) summary(mulLogRegResults) #display resultsStep 2 performs the analysis. The results are as follows multinom(formula = Diagnosis ~ Vomit + Pulse + Temp, data = myDataFrame) Coefficients: (Intercept) VomitYes Pulse Temp Flu 21.9547 -1.290127 -0.1272113 -0.1159828 UTI 143.5231 -2.036092 -0.3264960 -1.1910271 Std. Errors: (Intercept) VomitYes Pulse Temp Flu 0.02710540 1.641638 0.1015629 0.07724494 UTI 0.02030991 1.824531 0.1451480 0.10953534 Residual Deviance: 22.26156 AIC: 38.26156The results from Step 2 to focus on are the rows under the heading Coefficients. As Appen is alphabetically the first group it is the reference group, and each row of the formulae produces the Y=log(Odds Ratio) for that diagnosis against that of appendicitis (Appen). The following shows how the calculations are done, using the first row of the data as an example (Vomit=No, Pulse=74, Temp=102)
# Step 3: Place fitted values into data frame myFitted<-fitted.values(mulLogRegResults) myPredict <- predict(mulLogRegResults, myDataFrame) myDataFrame <- cbind(myDataFrame, myFitted, myPredict) myDataFrame with(myDataFrame, table(Diagnosis, myPredict)) #check table of prediction chisq.test(myDataFrame$Diagnosis, myDataFrame$myPredict)In step 3, the fitted.values function creates the 3 columns of probabilities for the 3 outcomes, and the predict function make the decision which outcome is the most likely. The cbind function concatenates these results to the dataframe, which is then displayed. The table function creates the 3x3 table of counts between the original Diagnosis and the estimated myPredict, to test the accuracy of the algorithm. The chisq.test tests that the table is not random. The results are as follows Vomit Pulse Temp Diagnosis Appen Flu UTI myPredict 1 No 74 102 Appen 0.31691544 0.64513376 0.037950792 Flu 2 Yes 68 97 UTI 0.02178191 0.04675669 0.931461399 UTI 3 No 78 99 Flu 0.25714354 0.44566188 0.297194581 Flu .... > with(myDataFrame, table(Diagnosis, myPredict)) #check table of prediction myPredict Diagnosis Appen Flu UTI Appen 3 1 1 Flu 1 3 1 UTI 0 1 4 > chisq.test(myDataFrame$Diagnosis, myDataFrame$myPredict) Pearson's Chi-squared test data: myDataFrame$Diagnosis and myDataFrame$myPredict X-squared = 8.1, df = 4, p-value = 0.08798The dataframe now contains the original columns, plus the 3 columns of probability for each diagnosis, as well as the conclusion which diagnosis is most likely. The table shows the correct prediction along the diagonal, 10 cases out of 15, and 5 cases wrongly assigned. Chi Sq p=0.09 which is statistically insignificantly different from random distribution. This result is not surprising given the small size of sample, and computer generated values. Step 4 is optional, and plots the results to visually demonstrate the results of analysis. It is divided into two parts. Step 4a produces 3 plots, and step 4b combines these 3 plots into a single one for presentation # Step 4a: Plot 3 charts #install.Package("ggplot2") # if not already installed library(ggplot2) plot1<-ggplot(myDataFrame, aes(x = myDataFrame$Diagnosis, y = myDataFrame$Appen)) + coord_cartesian(ylim= c(0, 1)) + geom_boxplot() + geom_dotplot(binaxis="y",binwidth=.05,stackdir="center") #fitted Appen probability plot2<-ggplot(myDataFrame, aes(x = myDataFrame$Diagnosis, y = myDataFrame$UTI)) + coord_cartesian(ylim= c(0, 1)) + geom_boxplot() + geom_dotplot(binaxis="y",binwidth=.05,stackdir="center") #fitted Flu probability plot3<-ggplot(myDataFrame, aes(x = myDataFrame$Diagnosis, y = myDataFrame$Flu)) + coord_cartesian(ylim= c(0, 1)) + geom_boxplot() + geom_dotplot(binaxis="y",binwidth=.05,stackdir="center") #fitted UTI probability #Plot1 #plot2 #plot3Plot1 plots the probability of Appen in the 3 groups of Appen, Flu, and UTI. Plot2 the probability of Flu and plot3 the probability of UTI in the same 3 groups. At this stage, the 3 plots can be separately displayed. The problem is that, in RStudio, each plot over-writes the previous one, so the 3 cannot be visualized at the same time. To do this, we need to do 4b. # Step 4b: Combine 3 charts into one #install.package("Grid") # only if not already installed library(grid) # Make a list from the ... arguments and plotlist myPlots <- c(list(plot1, plot2, plot3)) #change list to your list of ggplots myCols = 2 #change the number of plot columns per row of plots # The rest of the codes need no change myNumPlots = length(myPlots) layout <- matrix(seq(1, myCols * ceiling(myNumPlots/myCols)), ncol = myCols, nrow = ceiling(myNumPlots/myCols)) grid.newpage() pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout)))) # Make each plot, in the correct location for (i in 1:myNumPlots) { # Get the i,j matrix positions of the regions that contain this subplot matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE)) print(myPlots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col)) }
# Step 5 Use the coefficients on a new set of data newDat = (" Vomit Pulse Temp No 74 102 Yes 68 97 No 78 99 ") newDataFrame <- read.table(textConnection(newDat),header=TRUE) summary(newDataFrame) newProbs <- predict(mulLogRegResults, newdata=newDataFrame, type='probs') newProbs newDiag <- predict(mulLogRegResults, newdata=newDataFrame, type='class') newDiag newDataFrame <- cbind(newDataFrame, newProbs, newDiag) newDataFrameStep 5 is optional and for demonstration only. It shows how the results produced can be used to calculate and estimate the value for each case in a different dataset. Step 5 creates a new dataframe using the first 3 rows of the original data. Predict with type='probs' produces the 3 columns of probability, and type='class' the classification. These can then be concatenated to the dataframe, and displayed. Vomit Pulse Temp Appen Flu UTI newDiag 1 No 74 102 0.317 0.645 0.038 Flu 2 Yes 68 97 0.022 0.047 0.931 UTI 3 No 78 99 0.257 0.446 0.297 FluPlease note the following
#Step 6: Optional saving and loading of coefficients #save(mulLogRegResults, file = "MulLogRegResults.rda") #save results to rda file #load("MulLogRegResults.rda") #load the rda fileStep 6 is optional, and in fact commented out, and included as a template only. It allows the results of analysis to be stored as a .rda file, that can be reloaded on a different occasion and used on a different dataset. Please note the following
<
|