Content Disclaimer
Copyright @2020.
All Rights Reserved.
StatsToDo : Compare Two Counts

Links : Home Index (Subjects) Contact StatsToDo

Explanations Javascript Program R Codes

Counts of events, based on the Poisson distribution, is a frequently encountered model in medical research. Examples of this are number of falls, asthma attacks, number of cells, and so on. The Poisson parameter Lambda (λ) is the total number of events (k) divided by the number of observation units (n) in the data (λ = k/n). The unit forms the basis or denominator for calculation of the average, and need not be individual cases or research subjects. For example, the number of asthma attacks may be based on the number of child months, or the number of pregnancies based on the number of women years in using a particular contraceptive.

Poisson is different to the binomial parameter of proportion or risk where proportion is the number of individuals classified as positive (p) divided by the total number of individuals in the data (r = p/n). Proportion or risk must always be a number between 0 and 1, while λ may be any positive number.

For examples, if we have 100 people, and only 90 of them go shopping in a week then the binomial risk of shopping is 90/100 = 0.9. However, some of the people will go shopping more than once in the week, and the total number of shopping trips between the 100 people may be 160, and the Poisson Lambda is 160/100 = 1.6 per 100 person week

Large Lambda (λ=k/n) values, say over 200, assumes an approximately normal or geometric distribution, and the count (or sqrt(count)) can be used as a Parametric measurement. If the events occur very few times per individual, so that individuals can be classified as positive or negative cases, then the binomial distribution can be assumed and statistics related to proportions used. In between, or when events are infrequent, the Poisson distribution is used.

Some clarification of nomenclature may be useful.

  • Counts of events (e.g. number of asthma attacks recorded) are represented by k. This count must be in terms of how many events over a defined period or environment (e.g. in 100 attacks in 300 children over 6 months, or 10 cells seen in 5 microlitres of fluid),
  • The mean count, or count rate (k/n) is represented by λ. e.g. 100 attacks(k) in 1800 children months (n) produces λ=100/1800 = 0.06 attacks per child month (λ)
This page provides calculations comparing two sets of count data. Three algorithms are available for this comparison
  • The most commonly used method, initially described by Przyborowski and Wilenski (see reference), is known as the Conditional Test (the C Test). The test is based on the null hypothesis that the ratio of the two count rates (λ2 / λ1) is equal to 1.
  • More recently, Krishnamoorthy and Thomson (see reference) proposed an improvement on the C Test, where the null hypothesis is that the difference between the two count rates (λ2 - λ1) is equal to 0. Althought computation for this test is more complex, the advantages are that it is more robust, and the results have greater power
  • Whitehead (see reference), in his text book on unpaired sequential analysis, provided algorithms to determine sample sizes for non-sequential methods, and a method for comparing two counts at the end of the sequence. This test depends on a transformation of the difference into a normally distributed mean and compares the means against the null hypothesis of 0.
  • Comparing the 3 tests: In most cases, the results from the 3 tests on the same set of data are approximately the same. There is only a need to choose when the ample size is small or when the difference between the two groups are minor, so that the statistical significance is ambiguous.
    • The advantage of the Whitehead algorithm is the speed of computation, as there is no need for repeated estimation of factorial numbers that are necessary in the other two tests. Users may prefer this test if the sample sizes are large or there are numerous calculations to be done
    • The C Test has been used the longest, and quoted by most text books.
    • The E Test is the most robust, least likely to result in a false statistical significance, and so is the safest one to use. With large numbers however, it is also the most computational intensive and time consuming
    In most cases, therefore, the C test should be used. If the sample size is too large and the calculations take too long, then the Whitehead algorithm can be used. When the statistical significance of the results is ambiguous (e.g. close to p<0.05) then the E Test is preferred.

Data Input

The data is a table with 4 columns. Each row represents a separate test

The columns are

  • Col 1: k1, the total count in group 1
  • Col 2: n1, the sample size in group 1
  • Col 3: k2, the total count in group 2
  • Col 4: n2, the sample size in group 2

More Complex Models

If there are more than 2 groups or if the model is multivariate (2 or more independent variables), and if the raw data is available, then the best algorithm for comparison is to use the General Linear Model algorithm of Multiple Regression for Poisson Data. R codes and references for this calculation is presented in GLM.php on this site


Poisson Probability
    Poisson Distribution by Wikipedia

    Steel RGD, Torrie JH Dickey DA (1997) Principles and Procedures of Statistics. A Biomedical Approach. The McGraw-Hill Companies, Inc New York. p. 558

The C Test
    Przyborowski J and Wilenski H (1940) Homogeneity of results in testing samples from Poisson series. Biometrika 31:313-323.
The E Test Whitehead's Algorithm

    Whitehead John (1992). The Design and Analysis of Sequential Clinical Trials (Revised 2nd. Edition) . John Wiley & Sons Ltd., Chichester, ISBN 0 47197550 8. p. 48-50