Content Disclaimer Copyright @2020. All Rights Reserved. 
Links : Home Index (Subjects) Contact StatsToDo 
Related Links:
This page provides explanations and example R codes for Negative Binomial Regression, which is one of the algorithms based on the Generalized Linear Models, but the dependent variable is a discrete positive integer with a negative binomial.
Negative Binomial distribution is the reverse of the binomial distribution. Instead of stating a proportion, the statement is the number of one group is counted before a stated number of the other group is found. In Negative Binomial Regression, the algorithm is based on the number of one group found (failure, negative, 0, false) before finding each case of the other group (success, positive, 1, true), mostly counts of events or number of units. An examples is, instead of stating that the caesarean section rate is 20%, the Negative Binomial statement is having 5 vaginal deliveries for each Caesarean Section seen. Because Negative Binomial distribution is less rigid than Poisson distribution, it is often adapted to analyse any measurements of discrete positive value where the Poisson distribution cannot be assured. The model Negative Binomial regression in the example on this page uses the same data as that in the Poisson Regression Explained Page . The results, other than minor rounding differences, are almost identical to the analysis assuming Poisson distribution. Negative Binomial regression, similar to other Generalized Linear Models, is conceptually and procedurally simple and easy to understand. The contraint however is that the count in the modelling data must be a positive integer, a value >0. When the expected count is low however, the count in some of the reference data may be zero, and this distorts the model. The remedy for this problem is to include a Zero Inflated Model in the algorithm, and this is explained in the panel Example Explained Referenceshttps://en.wikipedia.org/wiki/Negative_binomial_distributionhttps://stats.idre.ucla.edu/r/dae/negativebinomialregression/
This section presents the R code in total so it can be copied and pasted into RStudio as a template. Detailed explanations for each step are in the next panel
# NegBinRegression.R : Negative Binomial Regression # Step 1: Data entry myDat = (" Time Day BLY Births PM Wday 4.5 6 PM Wend 5.6 4 Night Wday 6.0 3 PM Wday 5.9 8 Night Wday 5.2 3 AM Wend 5.8 4 Night Wend 5.9 7 AM Wend 5.0 0 Night Wday 5.5 4 Night Wday 4.1 6 PM Wday 5.4 3 Night Wday 4.2 5 PM Wday 4.7 3 AM Wday 4.4 4 AM Wend 4.7 2 Night Wend 4.8 5 PM Wend 4.8 6 PM Wday 4.5 5 AM Wday 5.0 9 ") myDataFrame < read.table(textConnection(myDat),header=TRUE) summary(myDataFrame) #Step 2: Negative Binomial Regression #Step 2a. using glm #install.packages(MASS) library(MASS) glmResults<glm.nb(formula = Births ~ Time + Day + BLY, data = myDataFrame) summary(glmResults) glmCount<fitted.values(glmResults) #count = exp(y) myDataFrame < cbind(myDataFrame, glmCount) #append glmCount to myDataFrame myDataFrame #Step 2b. Zeroinflation #install.packages("pscl") # only if not previously installed library(pscl) ziResults<zeroinfl(Births ~ Time + Day + BLY, dist="negbin", data = myDataFrame) #zero inlation model summary(ziResults) ziCount < predict(ziResults, myDataFrame) #zero inflated count myDataFrame < cbind(myDataFrame,ziCount) #concatenate calculated count to myDataFrame myDataFrame vuong(glmResults, ziResults)# comparing the two models cor.test( ~ as.numeric(Births) + as.numeric(glmCount), data=myDataFrame, method = "spearman",exact=FALSE) cor.test( ~ as.numeric(Births) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) cor.test( ~ as.numeric(glmCount) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) #Step 3: Plot fitted values old.par < par(mfrow=c(2,2)) par(pin=c(3, 2)) # set plotting window to 4.2x3 inches plot(x = myDataFrame$Births, # Births on the x axis y = myDataFrame$glmCount, # glmCount on the y axis xlim = c(min(myDataFrame$Births),max(myDataFrame$Births)), ylim = c(min(myDataFrame$glmCount),max(myDataFrame$glmCount)), pch = 16) # size of dot plot(x = myDataFrame$Births, # Births on the x axis y = myDataFrame$ziCount, # ziCount on the y axis xlim = c(min(myDataFrame$Births),max(myDataFrame$Births)), ylim = c(min(myDataFrame$ziCount),max(myDataFrame$ziCount)), pch = 16) # size of dot plot(x = myDataFrame$glmCount, # Births on the x axis y = myDataFrame$ziCount, # ziCount on the y axis xlim = c(min(myDataFrame$glmCount),max(myDataFrame$glmCount)), ylim = c(min(myDataFrame$ziCount),max(myDataFrame$ziCount)), pch = 16) # size of dot par(old.par) # Step 4: Test coefficients on new data set newDat = (" Time Day BLY PM Wday 4.5 Night Wday 6.0 AM Wend 5.8 ") newDataFrame < read.table(textConnection(newDat),header=TRUE) summary(newDataFrame) newGlmCount < predict(glmResults, newdata=newDataFrame, type='response') newZiCount < predict(ziResults, newdata=newDataFrame, type='response') newDataFrame < cbind(newDataFrame, newGlmCount, newZiCount) newDataFrame #Step 5: Optional saving and loading of coefficients #save(glmResults, file = "NegBinGlmResults.rda") #save glm results to rda file #load("NegBinGlmResults.rda") #load the glm rda file #save(ziResults, file = "NegBinZiResults.rda") #save zi results to rda file #load("NegBinZiResults.rda") #load the zi rda file # Step 1: Data entry myDat = (" Time Day BLY Births PM Wday 4.5 6 PM Wend 5.6 4 Night Wday 6.0 3 .... ") myDataFrame < read.table(textConnection(myDat),header=TRUE) summary(myDataFrame)Step 1 creates the dataframe that contains the data as in myDat. The data is computer generated, and purports to be from a study of workload in the labour ward. Time is the time of the 3 shifts, AM for morning, PM for afternoon, and Night. Day is for day of the week, being Weekday or Weekend, BLY is the number of births last year in thousands. The model is to make a estimate the number of births per shift (Births) from these 3 parameters, assuming that Births is a count and conforms to the Negative Binomial distribution
#Step 2: Negative Binomial Regression #Step 2a. using glm #install.packages(MASS) library(MASS) glmResults<glm.nb(formula = Births ~ Time + Day + BLY, data = myDataFrame) glmCount<fitted.values(glmResults) #count = exp(y) summary(glmResults) myDataFrame < cbind(myDataFrame, glmCount) #append glmCount to myDataFrame myDataFrameStep 2a performs the analysis using the glm model. The results are as follows glm.nb(formula = Births ~ Time + Day + BLY, data = myDataFrame, init.theta = 54599.73884, link = log) Deviance Residuals: Min 1Q Median 3Q Max 2.66322 0.91210 0.05635 0.47380 2.03454 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 1.27840 0.92893 1.376 0.169 TimeNight 0.15967 0.30042 0.531 0.595 TimePM 0.21978 0.29647 0.741 0.458 DayWend 0.16692 0.24589 0.679 0.497 BLY 0.03089 0.18594 0.166 0.868 (Dispersion parameter for Negative Binomial(54599.74) family taken to be 1) Null deviance: 21.614 on 18 degrees of freedom Residual deviance: 20.151 on 14 degrees of freedom AIC: 93.01 Number of Fisher Scoring iterations: 1The results from Step 2a to focus on are the rows under the heading Estimate, as this is the formula for calculating the log(count). We will use the first row of data (Time=PM, Day=Wday, BLY=4.5) to demonstrate how this is done.
Time Day BLY Births glmCount 1 PM Wday 4.5 6 5.140811 2 PM Wend 5.6 4 4.500898 3 Night Wday 6.0 3 5.070469 4 PM Wday 5.9 8 5.368042 .... #install.packages("pscl") # only if not previously installed library(pscl) ziResults<zeroinfl(Births ~ Time + Day + BLY, dist="negbin", data = myDataFrame) #zero inlation model summary(ziResults) ziCount < predict(ziResults, myDataFrame) #zero inflated count myDataFrame < cbind(myDataFrame,ziCount) #concatenate calculated count to myDataFrame myDataFrame vuong(glmResults, ziResults)# comparing the two models cor.test( ~ as.numeric(Births) + as.numeric(glmCount), data=myDataFrame, method = "spearman",exact=FALSE) cor.test( ~ as.numeric(Births) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) cor.test( ~ as.numeric(glmCount) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE)Step 2b performs the zero inflated analysis, produces the output count accordingly, and append the results to the dataframe. The Vuong test compares the counts produced by the general linear model (glm) and zero inflated model (zi) and display whether the diffrence is significant. The 3 Spearman correlation coefficient assess the correlation between pairs of Births, glmCount, and ziCount. The results are as follow zeroinfl(formula = Births ~ Time + Day + BLY, data = myDataFrame, dist = "negbin") Pearson residuals: Min 1Q Median 3Q Max 1.05117 0.59996 0.01863 0.49214 1.87321 Count model coefficients (negbin with log link): Estimate Std. Error z value Pr(>z) (Intercept) 1.51331 0.92356 1.639 0.101 TimeNight 0.01727 0.29613 0.058 0.953 TimePM 0.04216 0.29228 0.144 0.885 DayWend 0.05730 0.24285 0.236 0.813 BLY 0.01385 0.18458 0.075 0.940 Log(theta) 12.88701 167.64480 0.077 0.939 Zeroinflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>z) (Intercept) 14.445 18942.295 0.001 0.999 TimeNight 20.209 25664.897 0.001 0.999 TimePM 20.310 25826.329 0.001 0.999 DayWend 20.276 18942.290 0.001 0.999 BLY 1.281 3.020 0.424 0.672 Theta = 395147.6678 Number of iterations in BFGS optimization: 24 Loglikelihood: 38.26 on 11 Df > ziCount < predict(ziResults, myDataFrame) #zero inflated count > myDataFrame < cbind(myDataFrame,ziCount) #concatenate calculated count to myDataFrame > myDataFrame Time Day BLY Births glmCount ziCount 1 PM Wday 4.5 6 5.140811 5.041842 2 PM Wend 5.6 4 4.500898 4.834124 3 Night Wday 6.0 3 5.070469 4.850651 4 PM Wday 5.9 8 5.368042 5.140529 .... > > vuong(glmResults, ziResults)# comparing the two models Vuong NonNested Hypothesis TestStatistic: (teststatistic is asymptotically distributed N(0,1) under the null that the models are indistinguishible)  Vuong zstatistic H_A pvalue Raw 0.7669298 model2 > model1 0.22156 AICcorrected 0.9437815 model1 > model2 0.17264 BICcorrected 1.7516127 model1 > model2 0.03992 > > cor.test( ~ as.numeric(Births) + as.numeric(glmCount), data=myDataFrame, method = "spearman",exact=FALSE) Spearman's rank correlation rho data: as.numeric(Births) and as.numeric(glmCount) S = 976.91, pvalue = 0.559 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.1430578 > cor.test( ~ as.numeric(Births) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) Spearman's rank correlation rho data: as.numeric(Births) and as.numeric(ziCount) S = 896.89, pvalue = 0.3807 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.2132539 > cor.test( ~ as.numeric(glmCount) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) Spearman's rank correlation rho data: as.numeric(glmCount) and as.numeric(ziCount) S = 178.16, pvalue = 5.647e06 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8437226The results of the zero inflated analysis presents coefficients for combined Negative Binomial and Binmomial distributions to be used in estimating the counts. The method of calculation using these coefficients is by numerical approximation, and too complex to be explained here. Those interested in numerical methods should consult the references provided. The results (ziCount) are then compared with those from the generalized linear model (glmCount) using the Voung test. The test producedby the basic calculation is labelled Raw, AIC has a correction by Akaike's information criterion, and BIC has a correction by the Bayesian information criterion. In practice they are different methods of calculating the same thing. One way of interpreting the 3 results is to assume that the two models are different if any of the 3 comparisons is statistically significant, as is the case with the current example data. The 3 Spearman's correlations test how closely the two estimates related to the original outcome, and to each other. The results shows that both performed poorly with the reference outcome (Births:glmCount, ρ=0.14; Births:ziCount, ρ=0.21; both not statistically significant). The two estimates however are closely but not perfectly correlated to each other (glmCount:ziCount, ρ=0.84, p<0.0001). Such poor results are not surprising, given that the data is randomly generated by the computer and the sample size very small. It merely serves to demonstrate the numerical methods. #Step 3: Plot fitted values old.par < par(mfrow=c(2,2)) par(pin=c(3, 2)) # set plotting window to 4.2x3 inches plot(x = myDataFrame$Births, # Births on the x axis y = myDataFrame$glmCount, # glmCount on the y axis xlim = c(min(myDataFrame$Births),max(myDataFrame$Births)), ylim = c(min(myDataFrame$glmCount),max(myDataFrame$glmCount)), pch = 16) # size of dot plot(x = myDataFrame$Births, # Births on the x axis y = myDataFrame$ziCount, # ziCount on the y axis xlim = c(min(myDataFrame$Births),max(myDataFrame$Births)), ylim = c(min(myDataFrame$ziCount),max(myDataFrame$ziCount)), pch = 16) # size of dot plot(x = myDataFrame$glmCount, # Births on the x axis y = myDataFrame$ziCount, # ziCount on the y axis xlim = c(min(myDataFrame$glmCount),max(myDataFrame$glmCount)), ylim = c(min(myDataFrame$ziCount),max(myDataFrame$ziCount)), pch = 16) # size of dot par(old.par)
# Step 4: Test coefficients on new data set newDat = (" Time Day BLY PM Wday 4.5 Night Wday 6.0 AM Wend 5.8 ") newDataFrame < read.table(textConnection(newDat),header=TRUE) summary(newDataFrame) newGlmCount < predict(glmResults, newdata=newDataFrame, type='response') newZiCount < predict(ziResults, newdata=newDataFrame, type='response') newDataFrame < cbind(newDataFrame, newGlmCount, newZiCount) newDataFrameStep 4 is optional and for demonstration only. It shows how the results produced can be used to calculate and estimate the value for each case in a different dataset. Step 4 creates a new dataframe using the first 3 rows of the original data. Predict with type='response' to produce both the glmCount and ziCount. These can then be concatenated to the dataframe, and displayed. Please note that, in practice, only one of the counts need to be estimated, as by this stage a decision whether glm or zi should be used has been made. The results are presented as follows Time Day BLY newGlmCount newZiCount 1 PM Wday 4.5 5.140811 5.041842 2 Night Wday 6.0 5.070469 4.850651 3 AM Wend 5.8 3.635224 3.864226Please note the following
#Step 5: Optional saving and loading of coefficients #save(glmResults, file = "NegBinGlmResults.rda") #save glm results to rda file #load("NegBinGlmResults.rda") #load the glm rda file #save(ziResults, file = "NegBinZiResults.rda") #save zi results to rda file #load("NegBinZiResults.rda") #load the zi rda fileStep 5 is optional, and in fact commented out, and included as a template only. It allows the results of analysis to be stored as a .rda file, that can be reloaded on a different occasion and used on a different dataset. Please note the following
<
