Content Disclaimer
Copyright @2014.
All Rights Reserved.
StatsToDo : Data Testing for Normal Distribution Explained

Links : Home Index (Subjects) Index (Categories) Contact StatsToDo

Related link :
Data Testing for Normal Distribution Program Page
Normal Distribution Plot Program Page

Explanation References

This page supports the programs in the Data Testing for Normal Distribution Program Page , testing the hypothesis that a data set is normally distributed. It also provides some basic description of the dataset, similar to the data description procedure in SPSS. The example used here is the default example data in the Data Testing for Normal Distribution Program Page .

Data Description The following parameters are calculated and presented.

  • The sample size, minimum and maximum values, mean, Standard Deviation, and Standard Error of the mean
  • Median, and percentile values from 5th percentile at 4 percentile intervals

Simple Tests of Normal Distribution : The following were the most commonly used tests of normal distribution until the complex algorithms requiring intense computing became available.

  • Skewness evaluates how much the data is evenly distributed around the mean. Data truncated in the lower values and with a long tail in the higher values have a negative skew, and the reverse a positive skew. Calculations produce a measure of skewness and its 95% confidence interval. If this interval overlaps the zero (0) value then there is no significant skew.
  • Kurtosis evaluates whether the fall off in frequencies away from the mean conforms to the normal distribution. Where excessive data occurs near the mean, the distribution curve peaks excessively. Where data is evenly distributed across a wide range the distribution curve is flattened. Calculations produce a measure of kurtosis and its 95% confidence interval. If this interval overlaps the zero (0) value then there is no significant bias in kurtosis.
  • Significant difference between mean and median. this evaluates the probability of z where z=(mean-median)/Standard Errormean. This is more an alternative test of skewness, and if p<=0.05 then a significant skew exists.

The following are more formal tests of normal distribution.

The Chi Square goodness of Fit. The data is divided into groups of 1 standard deviations, and the chi square test is used to see whether the numbers in the groups differ significantly from what they should be if the data are normally distributed. The program also produces a normal distribution plot so that users can visualize the actual distribution of the data.

The Kolmogorov-Smirnov test is the most commonly accepted test to see whether a set of data violates the assumption of normality. The data is firstly placed in order of magnitude, and cumulative probability for each data point is calculated and matched against a theoretical cumulative probability from a normal distribution. The largest difference between these two probabilities are tested against the sample size. The result of the test is whether the data significantly deviates from normality.

The Shapiro-Francia test also tests whether the data significantly deviates from normality, and has been argued by some as the better test than the Kolmogorov-Smirnov test when the sample size is small. This is because, in the smaller sample size (n<1000), the maximum difference between theoretical and actual cumulative probability may be more variable, so the Kolmogorov-Smirnov test can be less stable.

The P Plot : Plots of cumulative probability against the data. A dataset that has an exact normal distribution will plot along the diagonal, so this provides a visual description of the relationship.

The Correlation between actual and theoretical distribution also provides a measure of how close the data is to normal distribution. This correlation is often used to optimise a transformation towards normal distribution, but has limited practical use to determine whether the assumption of normality is valid in a set of data. This is because the correlation coefficient tends to be high in any case, and it is difficult to determine a cut off point for decision.


StatsToDo Home Page    Contact StatsToDo