![]() | Content Disclaimer Copyright @2020. All Rights Reserved. |
Links : Home Index (Subjects) Contact StatsToDo |
Related Links:
This page provides explanations and example R codes for Poisson Regression, which is one of the algorithms based on the Generalized Linear Models, but the dependent variable is a discrete positive integer with Poisson distribution, mostly counts of events or number of units. Examples are the number of cells seen in a field under the microscope, the number of asthma attacks per 100 child year, the number of pregnancies per 100 women year using a particular form of contraception.
The Poisson Regression is a powerful analytical tool, providing that the data conforms to the Poisson distribution. When there is doubt about the distribution, the less powerful methods with less rigorous assumptions should be used. These are the Ordinal Regression as explained in the Ordinal Logistic Regression Explained Page , or the Negative Binomial Regression, as explained in the Negative Binomial Regression Explained Page Poisson regression, similar to other Generalized Linear Models, is conceptually and procedurally simple and easy to understand. The contraint however is that the count in the modelling data must be a positive integer, a value >0. When the expected count is low however, the count in some of the reference data may be zero, and this distorts the model. The remedy for this problem is to include a Zero Inflated Model in the algorithm, and this is explained in the panel Example Explained Referenceshttps://stats.idre.ucla.edu/r/dae/poisson-regression/ https://stats.idre.ucla.edu/r/dae/zip/ https://en.wikipedia.org/wiki/Zero-inflated_model https://en.wikipedia.org/wiki/Vuong%27s_closeness_test https://www.rdocumentation.org/packages/mpath/versions/0.1-20/topics/vuong.test
This section presents the R code in total so it can be copied and pasted into RStudio as a template. Detailed explanations for each step are in the next panel
# PoissonRegression.R # Step 1: Data entry myDat = (" Time Day BLY Births PM Wday 4.5 6 PM Wend 5.6 4 Night Wday 6.0 3 PM Wday 5.9 8 Night Wday 5.2 3 AM Wend 5.8 4 Night Wend 5.9 7 AM Wend 5.0 0 Night Wday 5.5 4 Night Wday 4.1 6 PM Wday 5.4 3 Night Wday 4.2 5 PM Wday 4.7 3 AM Wday 4.4 4 AM Wend 4.7 2 Night Wend 4.8 5 PM Wend 4.8 6 PM Wday 4.5 5 AM Wday 5.0 9 ") myDataFrame <- read.table(textConnection(myDat),header=TRUE) summary(myDataFrame) #Step 2: Poisson Regression #Step 2a. using glm glmResults<-glm(formula = Births ~ Time + Day + BLY, data = myDataFrame, family=poisson()) summary(glmResults) #display results glmCount<-fitted.values(glmResults) #count = exp(y) myDataFrame <- cbind(myDataFrame, glmCount) #append glmCount to myDataFrame myDataFrame #Step 2b. Zero-inflation #install.packages("pscl") # only if not previously installed library(pscl) ziResults<-zeroinfl(Births ~ Time + Day + BLY, dist="poisson", data = myDataFrame) #zero inlation model summary(ziResults) ziCount <- predict(ziResults, myDataFrame) #zero inflated count myDataFrame <- cbind(myDataFrame,ziCount) #concatenate calculated count to myDataFrame vuong(glmResults, ziResults)# comparing the two models cor.test( ~ as.numeric(Births) + as.numeric(glmCount), data=myDataFrame, method = "spearman",exact=FALSE) cor.test( ~ as.numeric(Births) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) cor.test( ~ as.numeric(glmCount) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) #Step 3: Plot fitted values old.par <- par(mfrow=c(2,2)) par(pin=c(3, 2)) # set plotting window to 4.2x3 inches plot(x = myDataFrame$Births, # Births on the x axis y = myDataFrame$glmCount, # glmCount on the y axis xlim = c(min(myDataFrame$Births),max(myDataFrame$Births)), ylim = c(min(myDataFrame$glmCount),max(myDataFrame$glmCount)), pch = 16) # size of dot plot(x = myDataFrame$Births, # Births on the x axis y = myDataFrame$ziCount, # ziCount on the y axis xlim = c(min(myDataFrame$Births),max(myDataFrame$Births)), ylim = c(min(myDataFrame$ziCount),max(myDataFrame$ziCount)), pch = 16) # size of dot plot(x = myDataFrame$glmCount, # Births on the x axis y = myDataFrame$ziCount, # ziCount on the y axis xlim = c(min(myDataFrame$glmCount),max(myDataFrame$glmCount)), ylim = c(min(myDataFrame$ziCount),max(myDataFrame$ziCount)), pch = 16) # size of dot par(old.par) # Step 4: Test coefficients on new data set newDat = (" Time Day BLY PM Wday 4.5 Night Wday 6.0 AM Wend 5.8 ") newDataFrame <- read.table(textConnection(newDat),header=TRUE) summary(newDataFrame) newGlmCount <- predict(glmResults, newdata=newDataFrame, type='response') newZiCount <- predict(ziResults, newdata=newDataFrame, type='response') newDataFrame <- cbind(newDataFrame, newGlmCount, newZiCount) newDataFrame #Step 5: Optional saving and loading of coefficients #save(glmResults, file = "PoissonGlmResults.rda") #save glm results to rda file #load("PoissonGlmResults.rda") #load the glm rda file #save(ziResults, file = "PoissonZiResults.rda") #save zi results to rda file #load("PoissonZiResults.rda") #load the zi rda file # Step 1: Data entry myDat = (" Time Day BLY Births PM Wday 4.5 6 PM Wend 5.6 4 Night Wday 6.0 3 .... ") myDataFrame <- read.table(textConnection(myDat),header=TRUE) summary(myDataFrame)Step 1 creates the dataframe that contains the data as in myDat. The data is computer generated, and purports to be from a study of workload in the labour ward. Time is the time of the 3 shifts, AM for morning, PM for afternoon, and Night. Day is for day of the week, being Weekday or Weekend, BLY is the number of births last year in thousands. The model is to make a estimate the number of births per shift (Births) from these 3 parameters, assuming that Births is a count and conforms to the Poisson distribution
#Step 2a. using glm glmResults<-glm(formula = Births ~ Time + Day + BLY, data = myDataFrame, family=poisson()) summary(glmResults) #display results glmCount<-fitted.values(glmResults) #count = exp(y) myDataFrame <- cbind(myDataFrame, glmCount) #append glmCount to myDataFrame myDataFrameStep 2a performs the analysis using the glm model. The results are as follows Call: glm(formula = Births ~ Time + Day + BLY, family = poisson(), data = myDataFrame) Deviance Residuals: Min 1Q Median 3Q Max -2.66327 -0.91213 -0.05635 0.47382 2.03464 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.27841 0.92889 1.376 0.169 TimeNight 0.15966 0.30041 0.531 0.595 TimePM 0.21978 0.29646 0.741 0.458 DayWend -0.16691 0.24588 -0.679 0.497 BLY 0.03089 0.18593 0.166 0.868 (Dispersion parameter for poisson family taken to be 1) Null deviance: 21.615 on 18 degrees of freedom Residual deviance: 20.153 on 14 degrees of freedom AIC: 91.01 Number of Fisher Scoring iterations: 5The results from Step 2a to focus on are the rows under the heading Estimate, as this is the formula for calculating the log(count). We will use the first row of data (Time=PM, Day=Wday, BLY=4.5) to demonstrate how this is done.
Time Day BLY Births glmCount 1 PM Wday 4.5 6 5.140809 2 PM Wend 5.6 4 4.500904 3 Night Wday 6.0 3 5.070451 .... #Step 2b. Zero-inflation #install.packages("pscl") # only if not previously installed library(pscl) ziResults<-zeroinfl(Births ~ Time + Day + BLY, dist="poisson", data = myDataFrame) #zero inlation model summary(ziResults) ziCount <- predict(ziResults, myDataFrame) #zero inflated count myDataFrame <- cbind(myDataFrame,ziCount) #concatenate calculated count to myDataFrame vuong(glmResults, ziResults)# comparing the two models cor.test( ~ as.numeric(Births) + as.numeric(glmCount), data=myDataFrame, method = "spearman",exact=FALSE) cor.test( ~ as.numeric(Births) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) cor.test( ~ as.numeric(glmCount) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE)Step 2b performs the zero inflated analysis, produces the output count accordingly, and append the results to the dataframe. The Vuong test compares the counts produced by the general linear model (glm) and zero inflated model (zi) and display whether the diffrence is significant. The 3 Spearman correlation coefficient assess the correlation between pairs of Births, glmCount, and ziCount. The results are as follow zeroinfl(formula = Births ~ Time + Day + BLY, data = myDataFrame, dist = "poisson") Pearson residuals: Min 1Q Median 3Q Max -1.05159 -0.59996 -0.01862 0.49214 1.87321 Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 1.51333 0.92356 1.639 0.101 TimeNight -0.01729 0.29613 -0.058 0.953 TimePM 0.04214 0.29228 0.144 0.885 DayWend -0.05728 0.24285 -0.236 0.814 BLY 0.01384 0.18458 0.075 0.940 Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -14.472 18736.205 -0.001 0.999 TimeNight -20.209 25675.676 -0.001 0.999 TimePM -20.310 25837.885 -0.001 0.999 DayWend 20.249 18736.200 0.001 0.999 BLY -1.270 3.015 -0.421 0.674 Number of iterations in BFGS optimization: 17 Log-likelihood: -38.26 on 10 Df > ziCount <- predict(ziResults, myDataFrame) #zero inflated count > myDataFrame <- cbind(myDataFrame,ziCount) #concatenate calculated count to myDataFrame > myDataFrame Time Day BLY Births glmCount ziCount 1 PM Wday 4.5 6 5.140809 5.041806 2 PM Wend 5.6 4 4.500904 4.834177 3 Night Wday 6.0 3 5.070451 4.850580 4 PM Wday 5.9 8 5.368030 5.140475 ...... > vuong(glmResults, ziResults)# comparing the two models Vuong Non-Nested Hypothesis Test-Statistic: (test-statistic is asymptotically distributed N(0,1) under the null that the models are indistinguishible) ------------------------------------------------------------- Vuong z-statistic H_A p-value Raw -0.7671979 model2 > model1 0.221482 AIC-corrected 0.9441481 model1 > model2 0.172547 BIC-corrected 1.7522791 model1 > model2 0.039863 > > cor.test( ~ as.numeric(Births) + as.numeric(glmCount), data=myDataFrame, method = "spearman",exact=FALSE) Spearman's rank correlation rho data: as.numeric(Births) and as.numeric(glmCount) S = 976.91, p-value = 0.559 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.1430578 > cor.test( ~ as.numeric(Births) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) Spearman's rank correlation rho data: as.numeric(Births) and as.numeric(ziCount) S = 896.89, p-value = 0.3807 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.2132539 > cor.test( ~ as.numeric(glmCount) + as.numeric(ziCount), data=myDataFrame, method = "spearman",exact=FALSE) Spearman's rank correlation rho data: as.numeric(glmCount) and as.numeric(ziCount) S = 178.16, p-value = 5.647e-06 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8437226The results of the zero inflated analysis presents coefficients for combined Poisson and Binmomial distributions to be used in estimating the counts. The method of calculation using these coefficients is by numerical approximation, and too complex to be explained here. Those interested in numerical methods should consult the references provided. The results (ziCount) are then compared with those from the generalized linear model (glmCount) using the Voung test. The test producedby the basic calculation is labelled Raw, AIC has a correction by Akaike's information criterion, and BIC has a correction by the Bayesian information criterion. In practice they are different methods of calculating the same thing. One way of interpreting the 3 results is to assume that the two models are different if any of the 3 comparisons is statistically significant, as is the case with the current example data. The 3 Spearman's correlations test how closely the two estimates related to the original outcome, and to each other. The results shows that both performed poorly with the reference outcome (Births:glmCount, ρ=0.14; Births:ziCount, ρ=0.21; both not statistically significant). The two estimates however are closely but not perfectly correlated to each other (glmCount:ziCount, ρ=0.84, p<0.0001). Such poor results are not surprising, given that the data is randomly generated by the computer and the sample size very small. It merely serves to demonstrate the numerical methods. #Step 3: Plot fitted values old.par <- par(mfrow=c(2,2)) par(pin=c(3, 2)) # set plotting window to 4.2x3 inches plot(x = myDataFrame$Births, # Births on the x axis y = myDataFrame$glmCount, # glmCount on the y axis xlim = c(min(myDataFrame$Births),max(myDataFrame$Births)), ylim = c(min(myDataFrame$glmCount),max(myDataFrame$glmCount)), pch = 16) # size of dot plot(x = myDataFrame$Births, # Births on the x axis y = myDataFrame$ziCount, # ziCount on the y axis xlim = c(min(myDataFrame$Births),max(myDataFrame$Births)), ylim = c(min(myDataFrame$ziCount),max(myDataFrame$ziCount)), pch = 16) # size of dot plot(x = myDataFrame$glmCount, # Births on the x axis y = myDataFrame$ziCount, # ziCount on the y axis xlim = c(min(myDataFrame$glmCount),max(myDataFrame$glmCount)), ylim = c(min(myDataFrame$ziCount),max(myDataFrame$ziCount)), pch = 16) # size of dot par(old.par)
# Step 4: Test coefficients on new data set newDat = (" Time Day BLY PM Wday 4.5 Night Wday 6.0 AM Wend 5.8 ") newDataFrame <- read.table(textConnection(newDat),header=TRUE) summary(newDataFrame) newGlmCount <- predict(glmResults, newdata=newDataFrame, type='response') newZiCount <- predict(ziResults, newdata=newDataFrame, type='response') newDataFrame <- cbind(newDataFrame, newGlmCount, newZiCount) newDataFrameStep 4 is optional and for demonstration only. It shows how the results produced can be used to calculate and estimate the value for each case in a different dataset. Step 4 creates a new dataframe using the first 3 rows of the original data. Predict with type='response' to produce both the glmCount and ziCount. These can then be concatenated to the dataframe, and displayed. Please note that, in practice, only one of the counts need to be estimated, as by this stage a decision whether glm or zi should be used has been made. The results are presented as follows Time Day BLY newGlmCount newZiCount 1 PM Wday 4.5 5.140809 5.041806 2 Night Wday 6.0 5.070451 4.850580 3 AM Wend 5.8 3.635237 3.859393Please note the following
#Step 5: Optional saving and loading of coefficients #save(glmResults, file = "PoissonGlmResults.rda") #save glm results to rda file #load("PoissonGlmResults.rda") #load the glm rda file #save(ziResults, file = "PoissonZiResults.rda") #save zi results to rda file #load("PoissonZiResults.rda") #load the zi rda fileStep 5 is optional, and in fact commented out, and included as a template only. It allows the results of analysis to be stored as a .rda file, that can be reloaded on a different occasion and used on a different dataset. Please note the following
<
|