Related Links:
R Explained Page
CUSUM Charts Explained Page
Introduction
Example R Code
Example Explained
X
Y
This page provides explanations and example R codes for CUSUM quality control charts, for detecting changes in proportions that conforms to the Bernoulli Distribution.
CUSUM Generally
CUSUM is a set of statistical procedures used in quality control. CUSUM stands for Cumulative Sum of Deviations.
In any ongoing process, be it manufacture or delivery of services and products, once the process is established and running, the outcome should be stable and within defined limits near a benchmark. The situation is said to be In Control
When things go wrong, the outcomes depart from the defined benchmark. The situation is then said to be Out of Control
In some cases, things go catastrophically wrong, and the outcomes departure from the benchmark in a dramatic and obvious manner, so that investigation and remedy follows. For example, the gear in an engine may fracture, causing the machine to seize. An example in health care is the employment of an unqualified fraud as a surgeon, followed by sudden and massive increase in mortality and morbidity.
The detection of catastrophic departure from the benchmark is usually by the Shewhart Chart, not covered on this site. Usually, some statistically improbable outcome, such as two consecutive measurements outside 3 Standard Deviations, or 3 consecutive measurements outside 2 Standard Deviations, is used to trigger an alarm that all is not well.
In many instances however, the departures from outcome benchmark are gradual and small in scale, and these are difficult to detect. Examples of this are changes in size and shape of products caused by progressive wearing out of machinery parts, reduced success rates over time when experienced staff are gradually replaced by novices in a work team, increases in client complaints to a service department following a loss of adequate supervision.
CUSUM is a statistical process of sampling outcome, and summing departures from benchmarks. When the situation is in control, the departures caused by random variations cancel each other numerically. In the out of control situation, departures from benchmark tend to be unidirectional, so that the sum of departures accumulates until it becomes statistically identifiable.
The mathematical process for CUSUM is in 2 parts. The common part is the summation of depertures from the bench mark (CUSUM), and graphically demonstrating it. The unique part is the calculation of the decision interval abbreviated as DI or h, and the reference value, abbreviated as k, which continuously adjustes the CUSUM and its variance. The two values of h and k depend on the following parameters
 The in control values
 The out of control values
 The Type I Error or false positive rate, expressed as the Average Run Length, abbreviated as ARL, the number of samples expected for a false positve decision when the situation is in control. ARL is the inverse of false positive rate. A false positive rate of 1% would have ARL=100
Proportions
Proportions can be handled under 3 common types of distribution
 The Binomial Distribution where the measurement is the number of the positive cases in a group of set sample size. The advantage of such an appropach is that the results tend to be stable, as short term variations are evened out with many cases. The disadvantage is that evaluation can only take place when the planned sample size per group has been reached, so conclusions tend to take a long time.
 The Negative Binomial Distribution Where the measurement is the number of negative cases between a set number of positive cases. Evaluation can take place after each time the set number of positive case is reached, so conclusions can be reached sooner. However the results tend to be more variable as it is influenced by short term variations.
 The Bernoulli Distribution where the measurement is either positive or negative for each case. Evaluation therefore takes place after each observation, so conclusions can be reached very quickly, but the results tend to be more chaotic as it varies with each observation.
 This page describes the Bernoulli Distribution. CUSUM for Binomial Distribution is discussed in the CUSUM for Binomial Distributed Proportion Explained Page, and that for Negative Binomial Distribution in CUSUM for Negative Binomial Distributed Counts Explained Page
CUSUM for Proportions based on the Bernoulli Distribution
The Bernoulli Distribution is based on the probability of any individual observation to be positive (yes, +, TRUE, 1) or negative (no, , FALSE, 0), and in the algorithm provided by StatsToDo these are represented by 0 and 1. An example is the monitoring of Caesarean Section rates in a maternity unit, where a case delivered normally is represented by 0 and that by Caesarean Section by 1.
The advantages of using the Bernoulli distribution for CUSUM is that the CUSUM value can be calculated with every case, based on whether the case is positive (1) or negative (0). It is therefore more responsive to changes as it does not have to wait for the collection of a group of cases before a proportion can be calculated.
The disadvantage of doing CUSUM using the Bernoulli distribution is that the model is highly sensitive to any change, so that short term variations may cause mark changes to CUSUM and trigger false alarms. An example is monitoring adverse surgical outcomes, when most of the dangerous operations are carried out on a particular day by a senior surgeon, so that the adverse outcomes peaks one day a week rather than being averaged over the whole week, causing a false alarm to be triggered
As I cannot find an easily usuable existing algorithm for CUSUM using Bernoulli distribution in the R archive CRAN, I developed a program based on the algorithm described in the paper by Reynolds and Stoumbos (see references). I have adopted the same terminology used in the paper, which are similar but different to the other CUSUM programs on StatsToDo.
The input parameters are:
 The Proportion in control (inControlProp), between 0 and 1 so that 0.2 means 20%
 The Proportion out of control (outOfControlProp).
 The Average Number of Observations (ANOS), the same concept as Average Run Length (ARL) in other CUSUM programs, the expected average number of cases between false alarms
 The Data (dat) is a vector (array) of values, with 0 representing negative and 1 positive
The algorithm calculates the same statistical parameters as other CUSUMs, but again uses names from the paper
 The Reference Value, (γB or gb) is equivalent to k in other CUSUM programs, except that it is used
differently to construct the CUSUM.
 The Decision Interval, h, is the same as in other CUSUM models, used to trigger an alarm
References
Reynolds Jr. MR and Stoumbos ZG (1999) A CUSUM Chart for Monitoring a Proportion when Inspecting Continuously. Journal of Quality Technology vol 31: No. 1. p.87  108
This section presents the R code in total so it can be copied and pasted into RStudio as a template. Detailed explanations for each step are in the next panel
For those unfamiliar with R, basic information on setting up R and common R procedures can be found in the R Explained Page
#
CUSUM for Bernoulli Distribution
# Section 1: parameters and data
inControlProp = 0.2
outOfControlProp = 0.25
anos = 100
dat=c(
0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
0,0,0,0,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,0,0,0)
# Section 2: Algorithm Translated from Paper by
# Reynolds Jr. MR and Stoumbos ZG (1999) A CUSUM Chart for Monitoring a Proportion when
# Inspecting Continuously. Journal of Quality Technology vol 31: No. 1. p.87  108
# EpsilonZ3 is an intermediary variable necessary for calculating Epsilon(p).
# the calculation is the first part of equation C1.
EpsilonZ3 < function(p)
{
x = (1  p) / p
return (x^0.5  x^0.5)
}
# Epsilon(p) is equation 6, an intermediary coefficient in calculating ANOS
EpsilonP < function(p)
{
if(p<0.01)
{
return (sqrt((1  p) / p)  sqrt(p / (1  p))) / 3
}
if(p<=0.5)
{
return (0.41  0.0842 * log(p)  0.0391 * log(p)^3
0.00376 * log(p)^4  0.000008 * log(p)^7)
}
return (EpsilonZ3(p)/3 + EpsilonP(1  p)) # equation C2
}
# KSi function is the function (C5) for a particular KSi value.
# This is repeatedly called by the following function as iterations until
# a KSi that will return zero
KSiFunction < function(p0, p1, p, KSi) # equation C5
{
t = 1  ((1  p1) / (1  p0))^KSi
b1 = (p1 / p0)^KSi
b2 = ((1  p1) / (1  p0))^KSi
return (p  t / (b1  b2))
}
# Find KSi is an iterative algorithm (equation C5) to find the correct KSi value
# which will return zero (0) for a particular combination of po,p1 and p.
# KSi is necessary to calculate ANOS when p is not p0,p1, or r1/r2
FindKSi< function (p0, p1, p)
{
el = 1
er = 1
em = 0.5
ep = KSiFunction(p0, p1, p, em)
while(abs(ep)>0.000001)
{
if(ep>0)
{
er = em
}
else
{
el = em
}
em = (el + er) / 2
ep = KSiFunction(p0, p1, p, em)
}
em
}
# FindNOS is the key function, finding the number of samples required to signal (equation C3)
# p0 and p1 are the proportions in and out of control. p=proportion of interest
# r1 and r2 were already calculated from p0 and p1
# h is the decision value
FindNOS < function(p0, p1, r1, r2, h, p) # equation C3
{
eP0 = EpsilonP(p0)
if(p1>p0)
{
hsb = h + eP0 * sqrt(p0 * (1  p0)) # equaltion 5 for detecting increase
}
else
{
hsb = h  eP0 * sqrt(p0 * (1  p0)) # equaltion 10 for detecting decrease
}
if(p==p0)
{
return ((exp(hsb * r2)  hsb * r2 1) / abs(r2 * p0  r1))
}
if(p==p1)
{
return ((exp(hsb * r2) + hsb * r2  1) / abs(r2 * p1  r1))
}
if(p==(r1 / r2))
{
return (hsb * (hsb + abs(r1 / r2)) * r2^2 / (r1 * (r2  r1)))
}
KSiP = FindKSi(p0, p1, p)
return ((exp(KSiP * hsb * r2)  KSiP * hsb * r2 1) / abs(KSiP * (r2 * p  r1)))
}
# Section 3: Translating algorithm from paper to usable parameters
# function FindH iterates h to FindNOS until the correct h is found for a particular NOS
FindH < function(p0, p1, r1, r2, p, nos)
{
hl = 0
if(p1>p0)
{
hh = 50
hm = 25
}
else
{
hh = 50
hm = 25
}
a = FindNOS(p0, p1, r1, r2, hm, p)
while(abs(anos)>0.1)
{
if (a>nos)
{
hh = hm
}
else
{
hl = hm;
}
hm = (hl + hh) / 2
a = FindNOS(p0, p1, r1, r2, hm, p)
}
hm
}
# Main program Begins
# Step 1: Calculate gb and h
p0 = inControlProp
p1 = outOfControlProp
r1 = log((1  p1) / (1  p0))
r2 = log((p1 * (1 p0)) / (p0 * (1  p1))) # equation 3
gb = r1 / r2 # equation 4
h = FindH(p0, p1, r1, r2, p0, anos)
cat("Reference Value gammaB gb =",gb,"\tDecision Interval h=", h, "\n")
# Step 2: Calculate and plot CUSUM
# Step 2a: Calculate CUSUM
cusum < vector()
cusumValue = 0
for(i in 1 : length(dat))
{
if(dat[i]!=0)
{
cusumValue = cusumValue + 1  gb
}
else
{
cusumValue = cusumValue  gb
}
if(p1>p0) # Up
{
if(cusumValue<0)
{
cusumValue = 0
}
}
else # down
{
if(cusumValue>0)
{
cusumValue = 0
}
}
cusum[i] = cusumValue
}
# Step 2b: Plot CUSUM
plot(cusum,type="l")
abline(h=h)
# Step 3: Optional export of results
#myDataFrame < data.frame(dat,cusum) #combine dat and cusum to dataframe
#myDataFrame #display dataframe
#write.csv(myDataFrame, "CusumBernoulli.csv") # write dataframe to .csv file
The example is a made up one to demonstrate the numerical process, and the data is generated by the computer. It purports to be from a quality control exercise in an obstetric unit, using Caesarean Section Rate as the quality indicator.
 From records in the past, we established the benchmark Caesarean Section Rate to be 20% (0.2), and this can be capped if the junior staff and midwives are well trained and closely supervised.
 With time however, experienced staff leave and replaced by the less experienced and trained. The standard of supervision would gradually deteriorate, resulting in an increase in the Caesarean Section rate.
 We would like to trigger an alarm and reorganize the working and supervision framework when the Caesarean Section Rate increases to 25% (0.25) or more.
 As reorganizing working framework is time consuming and disruptive, we would like any false alarm to be no more frequent than once every sets of samples, so the average number of observations (anos) = 100
The data is entered in Section 1. This is the only part of the program that needs any editing
# Section 1: parameters and data
inControlProp = 0.2
outOfControlProp = 0.25
anos = 100
dat=c(
0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,
0,0,0,0,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,1,0,0,0,0)
Step 1 contains the parameters and the data. This is the part where the user can edit, and change the values to that required in his/her own analysis
The first 3 lines sets the parameters required for the analysis.
c() is a function that creates a vector (array) containing the comma separated values within the bracket. In this case, dat is the name of the vector which contains the outcome from 50 childbirths, with 0 representing normal delivery and 1 Ceasarean Section
Section 2 contains the algorithms to find ANOS from the input parameters. I have merely translated the formulae from the paper, and am unable to explain why it is done the way it is. Those interested are referred to the original paper which explains in greatr details both the theory and the algorithm.
Section 3 Translates the input parameters into CUSUM, and is separated into the function FindH and the main program
# function FindH iterates h to FindNOS until the correct h is found for a particular NOS
FindH < function(p0, p1, r1, r2, p, nos)
{
hl = 0
if(p1>p0)
{
hh = 50
hm = 25
}
else
{
hh = 50
hm = 25
}
a = FindNOS(p0, p1, r1, r2, hm, p)
while(abs(anos)>0.1)
{
if (a>nos)
{
hh = hm
}
else
{
hl = hm;
}
hm = (hl + hh) / 2
a = FindNOS(p0, p1, r1, r2, hm, p)
}
hm
}
FindH calculates the decision interval h (hm inside the function) from the in and out of control proportions (p0, p1), the proportion of interest (p), which in planning is the same as p0, the anos, and r1 and r2 calculated in the main programs. It returns the value of the decision interval h
The main program takes the input parameters from section 1 to produce all the results and the CUSUM plot
# Step 1: Calculate gb and h
p0 = inControlProp
p1 = outOfControlProp
r1 = log((1  p1) / (1  p0))
r2 = log((p1 * (1 p0)) / (p0 * (1  p1))) # equation 3
gb = r1 / r2 # equation 4
h = FindH(p0, p1, r1, r2, p0, anos)
cat("Reference Value gammaB gb =",gb,"\tDecision Interval h=", h, "\n")
Step 1 calcultes the parameters required by the function getH, the reference value γB (gb), calls the GetH function to obtain the decision interval h, and displays the results
Reference Value gammaB gb = 0.2243397 Decision Interval h= 3.164673
Step 2 is divided into 2 parts. Step 2a calculates the cusum vector, and 2b plots the vector and h in a graph.
# Step 2a: Calculate CUSUM
cusum < vector()
cusumValue = 0
for(i in 1 : length(dat))
{
if(dat[i]!=0)
{
cusumValue = cusumValue + 1  gb
}
else
{
cusumValue = cusumValue  gb
}
if(p1>p0) # Up
{
if(cusumValue<0)
{
cusumValue = 0
}
}
else # down
{
if(cusumValue>0)
{
cusumValue = 0
}
}
cusum[i] = cusumValue
}
The first 2 lines of code in step 2a creates the empty cusum vector. The initial CUSUM value is set to 0. The remaining codes calculates the cusum value for each case (childbirth in this example), and places it in the cusum vector
# Step 2b: Plot the cusum vector and h
plot(cusum,type="l")
abline(h=h)

In step 3b, the first line plots the cusum vector, and the second line the decision interval h. The result plot is shown to the left.

# Step 4: Optional export of results
#myDataFrame < data.frame(dat,cusum) #combine dat and cusum to dataframe
#myDataFrame #display dataframe
#write.csv(myDataFrame, "CusumBernoulli.csv") # write dataframe to .csv file
Step 4 is optional, and in fact commented out, and included as a template only. Each line can be activated by removing the #
The first line places the two vectors, dat and cusum together into a dataframe
The second line displays the data, along with row numbers, in the console, which can then be copied and pasted into other applications for further processing
The third line saves the dataframe as a comma delimited .csv file. This is needed if the data is too large to handle by copy and paste from the console.
Please note: R Studio write files to User/Document/ folder by default. The path needs to be reset if the user wishes to save files using a specific folder. Discussion in the file I/O panel of the R Explained Page
<
