RegProp

Content Disclaimer
Copyright @2020.
All Rights Reserved.

StatsToDo: Regression Analysis Using Proportions

Links : Home Index (Subjects) Contact StatsToDo

Explanations and References

Currently, the multivariate Logistic Regressions (binomial, multinomial, or ordinal) are used to establish the regression relationship between one or more independent variables and probability (proportion, risk) as the dependent variable. These algorithm are flexible and widely accepted, but requiring specialized software, and an understanding of complex multivariate statistics. StatsToDo presents some code samples in R for those who wish to access these algorithms (see Index Subjects).
This page provides an earlier algorithm to perform simple linear regression between a single ordinal predictor and an outcome that is a proportion. The calculations are based on the Chi Square distribution.
The entry data consists of 3 columns.

The first column is the independent variable in the regression formula (x). This is usually an interval, or at least ordinal measurement. In the example on this page, it is the year in which data was collected
The second and the third column are the number of cases that have positive (N_pos) and negative (N_neg) attributes. In the example on this page they represent the number of business start ups that failed (positive) or not failed (negative) in that year
Algorithm

The program calculates the proportion of positive outcomes for each row P_pos = N_pos / (N_pos + N_neg), which is the dependent value of the regression (y)
The data is tabulated, showing the numbers and the probability for each row.
Statistical calculation consists of the partition of the Chi Square, from the total to that attributable to a trend, and that is residual after the trend is accounted for.

The Chi Square for the trend evaluates whether there is a significant trend throughout the rows
The residual Chi Square evaluates whether there remains significant heterogeneity other than the trend.

The regression coefficient is then calculate, which represents the change in proportion per unit of the independent variable x
An Example
The data in the example were artificially created to demonstrate the procedure, and not real. It perports to be from a study of business failures over the years.

Row x=Row Val N_pos N_neg N_total y=p_pos

1 1990 10 100 110 0.0909

2 1992 8 50 58 0.1379

3 1995 61 300 361 0.169

Total 79 450 529 0.1493

We collected data on the number of business start ups that failed (positive) or survive (negative) each year in a city.
In 1990 there were 110 business start ups, 10 of which failed (100 successful).
In 1992 there were 58 start ups, 8 failed (50 successful).
We ran out of research money in 1993 and 1994 so did not collect any data, but in 1995 we found 361 start ups and 61 of which failed (300 successful).

The data was compiled, and the probability of failure (proportion, risk, P_pos) calculated. The results are presented as in the table to the right. It can be seen that failure rates were 9.1% for 1990, 13.8% for 1992, and 16.9% for 1995, and the overall failure rate was 14.9%

Chi Sq. df p

Total 4.1113 2 0.128

Regression 4.0176 1 0.045

Residual 0.0937 1 0.7585

The program now partitions the Chi Square, as shown in the table to the left.
The analysis shows that the Chi Square for regression is significant at the p<0.05 level. Once this is partitioned, the residual Chi Square is not statistically significant. A conclusion can therefore be drawn that, other than an increasing trend, the proportion of business failures were otherwise homogeneous during those years
Finally, the regression coefficient is calculated. Change in proportion per unit row value = 0.015, indicates that, between 1990 and 1995, the trend of business failures increased by 1.5% per year.
References
Steel R.G.D., Torrie J.H., Dickey D.A. Principles and Procedures of Statistics. A Biomedical Approach. 3rd. Ed. (1997) ISBN 0-07-061028-2 p. 520-521
Javascript Program

Data Entry: Program performs 1 analysis only
  - The data is a table with 3 columns
  - Each row is data related to the row value
  - Column 1 is the x value for that row (x)
  - Column 2 is the number with positive attributes in that row (N Pos)
  - Column 3 is the number with negative attribute in that row (N Neg)

R Codes
R Program for regression of proportion is a single conginuous program. To make it easier to follow, the listing is divided into 2 sections
Section 1: Initial data input and matrix of summaries
# Section 1: Preparation dat = (" X NPos NNeg 1990 10 100 1992 8 50 1995 61 300 ") df <- read.table(textConnection(dat),header=TRUE) # conversion to data frame df$RowTot <- df$NPos + df$NNeg # total number each row df$Prob <- df$NPos / df$RowTot # probability of Pos each row df # Summary of Input Data
The initial matrix with all the data necessary for calculations are as follows. Please note:

X is the predictor in the regression formular, and should be at least ordinal and preferably intrval mesurements. In this example it represents the year the data is collected
NPos is the number of cases with positive attributes
NNeg is the number of cases with negative attributes
RowTot is NPos + NNeg for the row
Prob is probability, NPos / RowTot

> df # Summary of Input Data X NPos NNeg RowTot Prob 1 1990 10 100 110 0.09090909 2 1992 8 50 58 0.13793103 3 1995 61 300 361 0.16897507
Section 2 is the actual calculations
# Preparation for calculation rows = nrow(df) posTot = sum(df$NPos) negTot = sum(df$NNeg) tot = sum(df$RowTot) # vevtors for results Source <- vector() ChiSq <- vector() DF <- vector() P<- vector() # calculate total chi sq zw = 0 chiTot = 0 dfTot = rows - 1 for(i in 1:rows) # for each row { zw = zw + df$X[i] * df$RowTot[i] # row value x row count e = df$RowTot[i] * posTot / tot; # expected o = df$NPos[i] # observed number pos chiTot = chiTot + (o - e)**2 / e # add to Chi Sq e = df$RowTot[i] * negTot / tot; # expected o = df$NNeg[i] # observed number beg chiTot = chiTot + (o - e)**2 / e # add to Chi Sq } pTot = 1 - pchisq(chiTot, df=dfTot) Source <- append(Source, "Total") # add to vectors for eventual display ChiSq <- append(ChiSq, chiTot) DF <- append(DF, dfTot) P<- append(P,pTot) #c(chiTot,pTot) # Calculate regression and its chi sq p2 = posTot / tot; # probability of col 1 top = 0; bot = 0; for(i in 1:rows) { top = top + df$X[i] * df$NPos[i] # sum row value x col 1 bot = bot + df$X[i]^2 * df$RowTot[i] # row val sq x row count } #Calculation of regression coefficient top = top - posTot * zw / tot bot = bot - zw^2 / tot reg = top / bot # regression coefficient # calculate chi sq regression chiReg = top^2 / (bot * p2 * (1 - p2)) # chi sq regression pReg = 1 - pchisq(chiReg, df=1) Source <- append(Source, "Regression") # add to vectors for eventual display ChiSq <- append(ChiSq, chiReg) DF <- append(DF, 1) P<- append(P,pReg) # Calculate residual chi sq chiRes = chiTot - chiReg # chi sq residual dfRes = dfTot - 1 pRes = 1 - pchisq(chiRes, df=dfRes) Source <- append(Source, "Residual") # add to vectors for eventual display ChiSq <- append(ChiSq, chiRes) DF <- append(DF, dfRes) P<- append(P,pRes) # output dfRes <- data.frame(Source, ChiSq, DF, P) # combine vectors into data frame for display dfRes # display chi sq, df, and significance in p # Regression coefficient reg #regression coefficient changes in propbabilit per unit of X
The results are as follows
> dfRes Source ChiSq DF P 1 Total 4.11131566 2 0.12800860 2 Regression 4.01760141 1 0.04502771 3 Residual 0.09371425 1 0.75950732 > reg #regression coefficient changes in propbability per unit of X [1] 0.014958