Probability Exp

Content Disclaimer
Copyright @2020.
All Rights Reserved.

StatsToDo: Probability Explained

Links : Home Index (Subjects) Contact StatsToDo

Introduction

This page provides a series of related discussions on probability, the theoretical basis for nearly all the procedures presented on this site.

Probability and Statistics

The scientific approach that characterises the western civilisation is based on reproducible empirical observations. The idea is that repeatedly observed relationships or differences are more likely to reflect reality. Another way of saying this is that, a proposition cannot be accepted unless it is supported by repeated observations.

The problem with repeated observations is that the results, often similar, are not always the same. Experience compels us to abandon the binary idea that something is either true or false. Rather we increasingly see true or false merely as extremes while most of reality is a continuum in between.

Similarly, when we consider a scale (e.g. how tall is a man), we can only state an approximation, a range that most would fit in.

The uncertainty of reality therefore needs to be approached in a consistent and logical way. Probability is a measurement of how likely things are to occur and is one of the ways to represent uncertainty. Statistics is the set of tools to handle probability.

Mean, SD, and SE

The ancient Phoenicians were great traders and sea farers, but they tended to overload their boats. In stormy weather, goods had to be thrown overboard in order to save the ship. Owners of lost goods were then compensated by those who did not lose their goods, and the amounts involved depended on the estimated value of the goods. This arrangement was named havara, and this term evolved over the centuries to become average. Statistically the most common and useful expression of average is the mean

The astronomer, Gauss, measured distances between stars. He noticed that it was difficult to reproduce his measurements exactly. However, the measurements clustered around a central value, more common near the mean, and becoming less common as they are further away from the mean. He concluded that any set of measurements would normally distribute to this pattern and called it the Normal Distribution. Following this, De Moivre derived the formula for the Normal Distribution curve in mathematical terms.

Once this was done, the features of Normal Distribution could be mathematically handled. Using simple calculus, the area under the curve (or any part of it) can be estimated.

Fisher developed the concept further and used the area under the curve as a measure of probability. He argued that if the area under the whole curve be consider totality, then any portion represents the probability of the events that describe the portion. From this, he was able to calculate the probability of obtaining a measurement that exceeds a deviation from the mean value. He standardized the measurement of this deviation and called it the Normal Standard Deviate (z), later abbreviated to Standard Deviation, and derived the relationships between z and probability.

Fisher went on to develop the idea of the Standard Error of the Mean (SE for short). He argued that the true mean is difficult to find, as this requires the measurement of everyone in a population, or an infinite number of times. The mean value obtained in a set of observations is therefore only the sample mean, an estimate of the underlying true mean, and this would vary from samples to samples. An estimate of this variation is called the Standard Error of the Mean (SE). Conceptually, SE represents the Standard Deviation (SD) of the mean values if repeated samples of the same size were taken. In other words, the mean value is calculated for each repeated sample from the population. The SD of these mean values are calculated which equals the SE of the mean.

Type I Error

From the idea of the Standard Error of the mean, Fisher went on to develop the idea of the Standard Error of the Difference. The idea was that, in comparing two groups of observations, the difference between the means of the groups is a sample mean, which also has a Standard Error.

Fisher then derived formulations for the calculation of this Standard Error of the Difference

Once these assumptions are accepted, then it follows that the difference between two means is merely the sample mean of a normally distributed measurement. The probability of having a particular value can therefore be calculated.

Fisher then ask the question, what is the probability of a particular estimated being different to zero (null).

Being a mathematician, Fisher then describe this idea in mathematical terms. In mathematics, one can never prove anything to be true, but one can demonstrate that it is not always true.

Fisher then make the following propositions.

Let us propose that there is no difference between two sets of measurement, that the difference s is null (the null hypothesis H₀)
Let us then say this proposal is not true (reject the null hypothesis)
Let us then say that we are wrong to do so (error in rejecting the null hypothesis)
Can we estimate how likely that we are wrong (probability of error in rejecting the null hypothesis).
Subsequently, error in rejecting the null hypothesis was abbreviated to Type I Error, and the probability of this error abbreviated to alpha (α).
This precise and convoluted mathematical statement then are generally interpreted as what is the probability that there is no difference. The smaller the probability, the more confident one can reject the null hypothesis

Conceptually the ideas can be simply summarised in the diagram to the right.

The development of Probability of Type I Error (α) greatly assisted decision making in agriculture and industry, because it provided a measure of confidence whether an innovation produces better outcome (actually the reverse, the confidence that the innovation makes no difference).

Type II Error

Nearly a generation after Fisher proposed α, Egon Pearson proposed that Type I error was only half the story, and had serious flaws that needed to be addressed. He pointed out a number of problems in using Type I Error (α).

α allows the decision to reject the null hypothesis, but provides no solution when the null hypothesis cannot be rejected.
α is sample size dependent. With very large sample size even a trivial difference produced a small α. There is no provision as to what constituted the correct number of observations.
α is a scale of confidence and not a decision.

Pearson proposed that a second hypothesis be stated, the alternative hypothesis (H₁), that a difference exists. This hypothesis is also to be rejected, and the error of that rejection calculated.

This error is called the Type II Error, and the probability of its error is call beta (β). Once this schema is set up, α and β can be calculated for any difference (mean) that is observed. The following diagram summarises Pearson's proposals.

Although Pearson's proposals are conceptually elegant and acceptable, mathematically it was difficult to implement, as the non zero range is infinite.

To make the Type II error concept useful, Pearson proposed the following, as shown in the diagram to the right.

An alpha (α) value be nominated as a critical value for the rejection of the null hypothesis. The most commonly value used since is 0.05
A beta (β) value be nominated as a critical value for rejecting the alternative hypothesis. The most commonly used value since is 0.2
A critical difference be nominated. This should be a non trivial value that is meaningful in practical terms.
The background, population, or within group variation, in term of a Standard Deviation be estimated.
Once these parameters are available, calculations can be made as to the sample size that will be needed for such a model.
Once all the data (with the correct sample size) are collected, the observed difference between the two groups (the value) is compared with the proposed critical value.
- If the observed difference is smaller than the critical value, the alternative hypothesis is rejected, the null is hypothesis accepted, and the observed difference considered not statistically significant.
- If the observed difference is the same or larger than the critical value, the null hypothesis is rejected, the alternative hypothesis accepted, and the difference considered statistically significant.

Pearson's model gained great popularity during most of the 20th century, particularly amongst the biomedical and social research community.

The concept of Type II error has frequently been replaced with the term power, where power = 1-β, and this represents the ability to detect a difference if it is really there. Often power is further modified by multiplying it by 100, and presented as a percent. In other words, a β of 0.2 is often presented as a power of 0.8 or 80%.

Misuse of Statistical Significance

Statistical significance, as proposed by Pearson, was used extensively for many years, but criticism appeared in the last twenty years of the twentieth century, as increasingly it became noticed that many conclusions drawn using this method were found not to be reproducible.

The main problem is that the model is highly precise, and the conclusions drawn are only valid if the assumptions made and the procedures carried out complied with the requirements of the model. In practice, this compliance is very difficult.

A major difficulty is the correct estimation of the background or within group Standard Deviation at the planning stage, because this value is mostly unknown. Researchers often used published data or conduct pilot studies to obtain this value, but these are sample estimates also, tend to be unstable, and ofen different to that in the data.

This problem is compounded when the within group Standard Deviation found in the data differs from that proposed during planning, and the observed SD is used instead of the one proposed during planning. Many researchers seem not to recognise that observed values are sampling values, and are subjected to variations, so they must in most cases differ from the true values.

Once the within group Standard Deviation is changed, the sample size initially calculated is no longer valid, and the whole foundation of the model crumbles.

A common mistake is to use Pearson's model to calculate sample size, and then to calculate Type I Error(α) using the observations obtained, rejecting or accepting the null hypothesis using the 0.05 criteria. In doing so, the researcher commits the following errors.

The original Fisher's model is mixed with the Pearson's model, when Pearson's proposal was to find an alternative to Fisher's.
An assumption that the sample size is sufficient to make an accurate estimation of within group variation, when the sample size is based on a supposedly known and constant within group (population) variation in the first place.
An assumption that the sample size is sufficient to make a true or false decision based on an α of 0.05 calculated from the data, when the sample size is calculated to make that decision based on the comparison of the observed difference and the nominated critical value, and an assumption that the nominated Standard Deviation remains a representative population parameter regardless of what the observed values are.

Some researchers may even estimate the sample size that could be practically collected, then working backwards to nominate the critical difference or the within group Standard Deviation. In doing so, they present an apparently elegant and legitimate research proposal that can be accepted by regulating bodies, but in reality they have cheated and presented a research model with unrealistic assumptions, so that the results produced are unstable and misleading.

Recent Developments

Largely because of errors and misuse, Pearson's model has been increasingly criticised since the 1980s, and research results using this model considered unreliable. The following additional or alternative models have been proposed and are increasingly used.

Examining the power

Instead of asking the question whether the difference is statistically significant, one may ask what is the power of the data in detecting the difference observed with a nominated probability of Type I Error (α). A decision can then be made if the power in the data exceeds a nominated value, such as 0.8.
Power calculation after all the data has been collected makes no assumptions about population variations, and all the necessary parameters are in the data itself.

The confidence interval

Mathematically, confidence interval uses the same calculations as Fisher's proposed Standard Error of the Mean (SE). This rejects Pearson's notion that a decision to accept or reject any hypothesis is valid or necessary, and provides a scale of confidence much like that proposed by Fisher.
Confidence interval can also address the major criticism of the original Fisher's model. Instead of a probability of error statement, it allows a comparison of the result with the null hypothesis, and allows a decision whether the difference observed is so small that it can be considered trivial even if it is not null.
The use of confidence interval has increasingly replace statistical significance.

Meta-analysis

The use of meta-analysis is based on the proposal that the results from any single research study are unstable, and conclusions should be based on combining the results from multiple research projects.
Meta-analysis is the mathematical arm of systematic review, which aims to develop evidence that can be the basis of clinical practice. As this is a very large and complex subject, it will not be discussed in depth here.

Bayesian probability

Those who advocate this method believe it to be more useful than, and will replace the bulk of current statistical procedures (collectively labelled as frequentist statistics).
The idea is to move away from using the data to prove whether a proposed hypothesis to be true or false, but rather having a logical method whereby a researcher's current belief is modified by a set of observations, accepting that, with additional information, belief can be further modified.
The two papers by Goodman (in the references panel) are recommended reading as they explain clearly the advantages of the Bayesian approach.

References

This reference section presents some useful and interesting reading materials for those interested in understanding statistical probability in greater depth.

Siegel S and Castellan jr. N,J (1988) Nonparametric statistics for the behavioural sciences. McGraw-Hill International Editions Statistics series. ISBN 0-07-057357-3.

Most statistical textbooks provide a discussion on probability, alpha, beta, and so on. Most of which are far too complex, so that if you can understand it you probably do not need to read it. Chapter 1 of Siegal however is an exception in my opinion. Reading it was the first time I understood what it was all about.

Machin D, Campbell M, Fayers, P, Pinol A (1997) Sample Size Tables for Clinical Studies. Second Ed. Blackwell Science IBSN 0-86542-870-0

Machin's provides many algorithms and tables for sample size and power calculations, but its introductory chapter provides a good explanation of what Pearson's model is all about.

Cohen J (1988) Statistical power analysis for the behavioral sciences. Second edition. Lawrence Erlbaum Associates, Pubishers. London. ISBN 0-8058-0283-5

This book is frequently quoted as a reference, and I include it for completeness. The book is highly technical and not altogether reader friendly, and I would not recommend it for the beginner. chapter 1 provides some definitions of the terms.

Goodman SN (1999) Towards evidence-based medical statistics 1: The p value fallacy. Ann Intern Med. 1999;130: p 995-1004

Goodman SN (1999) Towards evidence-based medical statistics 2: The Bayes Factor. Ann Intern Med. 1999;130: p 1005-1013

Although these two papers are mostly advocating Bayesian probability, they, particularly the first one, provide an excellent discussion on the problems and misuse of Type I and II error. I highly recommend reading these two papers.