Content Disclaimer Copyright @2014. All Rights Reserved. 
Links : Home Index (Subjects) Contact StatsToDo 
Related link :
Concordance, in the context of statistics, is to study agreements between judges, scales, and measurements. This page provides support for all concordance programs provided by StatsToDo.
The information on this page is organized in separate panels according to the numerical nature of the measurements, and consists of the following
Nominal Data : where the numbers used are labels or names, and the scales are unrelated. Typical nominal numbers are 0=no illness, 1=appendicitis, 2=urinary infection, or 1=Caucasian, 2=Asian, 3=African, and so on. StatsToDo provides 1 program evaluating concordance for nominal data, Kappa Binary Data : where the numbers used are 0/1, representing binary outcomes of no/yes, false/true, male/female, negative/positive, and so on. StatsToDo provides 1 program evaluating concordance for binary data, KuderRichardson Coefficient Ordinal Data : where the number and whole integers represents ranks or order, so that 3>2>1, but the distance between the numbers are not specified, 32 is not necessarily the same as 21. The simplest ordinal scale is a 3 point scale of 0=no, 1=perhaps, 2=yes. The most commonly used ordinal scale is Likert, where 0=strongly disagree, 1=disagree, 2=neutral, 3=agree, and 4=strongly agreee. StatsToDo provides 3 programs evaluating concordance for ordinal data
Continuous Data : where the numbers are continuous (or near continuous) measurements that are normally distributed. StatsToDo provides 3 programs evaluating concordance for continuous data
Historical notes :
In 1937, Kuder and Richardson proposed a coefficient to evaluate the reliability of measurements that composed of multiple binary items. In 1941 Hoyt modified this coefficient, adjusting it for continuity, and name this the Kuder Richardson Hoyt Coefficient. Cronbach in 1951 showed that this coefficient can be used generally in all scaled measurements. As he intended this be a starting point to develop even better indices, he named it Coefficient Alpha. This index is now known as Cronbach's Alpha, and is a widely accepted measurement of internal consistency (reliability) of a multivariate measurement composing of correlated items. If Cronbach's Alpha is applied to binary data, the result is the same as the Kuder Richardson Coefficient (KR 20). The initial Cronbach's Alpha, calculated from the covariance matrix, is now known as the Unstandardized Alpha. This value tends to be unstable, and influenced by the scalar measurements used. A better Alpha is considered to be the Standardized Alpha, calculated from the correlation matrix. This is thought to be better as all variables are standardized to a mean of 0 and Standard Deviation of 1, the resulting Alpha is independent of the scales used. Both indices can be used to measure the internal consistency of multipleitem measurements, representing the averaged correlation between the items. As multipleitem measurements are in theory repeated measurements of the same thing, these indices represents the reliability of the overall set of measurements. Indices of reliability are often used in the early stages of developing a multipleitems measurement, to ensure that all the items measures a common concept. Items are added, removed, and modified, according to whether the indices of reliability improves, and usually until Alpha is greater than 0.7. A recent development is the calculations of the Standard Error of Alpha, and from which the confidence interval. This algorithm, by Duhachek and Iacobucci, is now included in StatsToDo The development of the Standard Error measurement allows statistical comparison and significance testing. As well as the 95% confidence interval, z=Alpha / SE can be calculated, and the probability that this does not differ from zero follows the normal z distribution. Example Data entry and interpretation of results are best demonstrated using the default example data from the Cronbach's Alpha Program Page . In this example, we administered 4 multiple choice questions to 20 students, using 0 or wrong answer and 1 for correct answer. The data is therefore a table of 20 rows, each from a student, and 4 columns, each for one of the tests. We with to know if the tests are similar in difficulty, that is, if the correct and incorect answers agree. The program first produces the covariance matrix, the diagonal of which is the variance of each measurement (test), and the off diagonal cells the correlation coefficient between the measurements (tests). Please note that the covariance matrix can also be used as a second option for data entry. The program then calculates the unstandardized Alpha, which is Unstandardized Alpha = 0.61, n=20, SE=0.14, 95%CI=0.33, to 0.89 The program then converts the covariance matrix to a correlation matrix, from which the standardized Alpha is produced. Standardized Alpha = 0.60, n=20, SE=0.16, 95%CI=0.28 to 0.91 Sample Size Calculations for Cronbach's Alpha StatsToDo provides two sets of sample size programs related to Cronbach's Alpha
Kappa for nominal data was first described by Fleiss in 1969. Fleiss went on to describe another Kappa for ordinal data,
and his name is often associated with this second Kappa. The first Kappa, which is discussed in this panel, is generally known
as Kappa for Nominal Data. The program is in the Kappa for Nominal Data Program Page
Kappa is a measurement of concordance or agreement between two or more judges, in the way they classify or categorise subjects into different groups or categories. The following terms are often used
We used a class of 10 students in their final school year, and each of the 5 councillors interviews every student, and classify them into the following categories.
The alternative data entry is a table of counts. The table has 10 rows, representing the 10 students, and 3 columns representing the 3 classifications. The cell contains the number of time each students is classified in that category. In this example, the first column is for caring profession, the second engineering, and third business, so that
Kappa in this example is 0.41, with a Standard Error of 0.08, and the 95% confidence interval od 0.27 to 0.57 This Kappa is a measurement of agreement between the 5 counsellors. Conventionally, a Kappa of <0.2 is considered poor agreement, 0.210.4 fair, 0.410.6 moderate, 0.610.8 strong, and more than 0.8 near complete agreement. Given Kappa is an estimate from a sample, the se=Standard Error provides an estimate of error. The 95% confidence interval is Kappa +/ 1.96 se. If a different confidence interval is required, the table for probability of z in the Probability of z Explained and Table Page can be consulted. Although concordance is usually used as a scalar measurement of agreement, a 95% confidence interval of Kappa that does not cross the zero value does allow a conclusion that significant concordance exists.
The Kuder Richardson Coefficient of reliability (KR 20) is used to test the reliability of binary measurements
such as exam questions, to see if the items within the instruments obtained the same binary (no/yes, right/wrong) results
over a population of testing subjects.
The formula for the coefficient can easily be obtained from Wikipedia on the Internet. Please Note that the KR 20 was first described in 1937. Hoyt in 1940 modified the formula so that it can be applied to measurements that are not binary. Hoyt's modification eventually was popularised and is now known as Cronbach's Alpha. Cronbach's Alpha, when applied to binary data, will therefore produce the same result as KR20. Cronbach's Alpha is now much preferred, and will be discussed in its own panel on this page. Example Data input and interpretation of results are best demonstrated using the default example in the Kuder Richardson Coefficient for Binary Data Program Page
We have 4 multiple choice questions (T1 to T4), administered to 5 students. 0 represents wrong answer and 1 correct answer, as shown in the table on the left.
The data set to be used is as shown in the table to the right, and the results are KR 20 = 0.75 The interpretation of the KR 20 value is similar to that of Kappa. A KR 20 of <0.2 is considered poor agreement, 0.210.4 fair, 0.410.6 moderate, 0.610.8 strong, and more than 0.8 near complete agreement. The original descriptions of KR 20 provided no test of statistical significance or confidence interval, although these can be obtained using the Cronbach's Alpha algorithm.
Introduction
Cohen & Fleisss Kappa
Kendall W
D
E
Introduction
Cohen's Kappa
Cohen's Kappa is a measurement of concordance or agreement between two raters or methods of measurement. The method can be applied to data that are not normally distributed, even binary (no/yes), but is best suited to a close ended ordinal scale, such as the 5 point Likert Scale. The algorithm is well described by the original papers, text books, and on the Internet (see references). The book by Fleiss is particularly useful as it combines all the developments and enhancements, including the algorithm for estimating variance. There are two ways of calculating Cohen's Kappa, and these produce different results. The first is by Cohen's original 1960 algorithm, now generally known as the unweighted Kappa. The second is by the weighted method, also described by Cohen but later in 1968, which includes a weighting for each cell, where weight for the cell i,j ( w_{ij} = 1  ij/(g1) ), g being the number of categories of scores. Cohen argued that the weighted Kappa should be used particularly if the variables have more categories than binary (more than yes and no), because the distance from agreement should be taken into consideration. The results of both calculations are presented, and the recommendation is to use the weighted value. Fless's Kappa Fleiss's Kappa is an extension of Cohen's kappa to evaluate concordance or agreements between multiple raters, but no weighting is applied. Therefore, Fleiss's Kappa is similar to Cohen's unweighted Kappa (except for rounding errors) if the same data from two raters are submitted to the Fleiss algorithm. Nomenclature Ordinal data These are data sets where the numbers are in order, but the distances between numbers are unstated. In other words 3 is bigger than 2 and 2 is bigger than 1, but 32 is not necessarily the same as 21. A common ordinal data is the Likert scale, where 1=strongly disagree, 2=disagree, 3=neutral, 4=agree, and 5=strongly agree. Although these numbers are in order, the difference between strongly agree and agree (54) is not necessarily the same as between disagree and strongly disagree (21). Instrument is any method of measurement. For example, a ruler, a Likert Scale (5 point scale from strongly disagree to strongly agree), or a machine (e.g. ultrasound measurement of bone length). raters are one or more of the instruments. This is usually a person, hopefully trained, that determines what score should be given to a subject. A carer evaluating a pain score is a rater, a judge in a beauty contest is a rater. Subjects are the subjects of the measurements. They are patients, school children, members of the public, monkeys, rats, and so on. Scores or measurements are the quantities produced by the instruments/raters. Measurements usually means something that are measured physically or chemically. Scores usually is a results produced by human decision. Concordance This usually means how much scores or measurements produced by different instruments agree. Commonly, concordance is expressed as a number between 0 and 1, where 0 represents no agreements at all, and 1 represent complete replications. In some concordance measurements a negative value may be produced, which signifies opposite results. Examples 1. Cohen's Kappa : Data entry option 1 The data consists of two obstetricians (raters), palpating the abdomen of 30 pregnant women (subjects), and rated each baby as growth retarded (0), normal (1) or macrosomic (2). The data is a table, where each row represent each pregnant abdomen palpated, the two columns representing the two obstetricians, and the value in each cell the classification given to that baby by that obstetrician. The result consists firstly the display of the count matrix, with rows representing obstetrician 1's scoring, column obstetrician 2's scoring, and the cell the number of cases so scored by the two obstetrician. The diagonal cells representing where the two agree, the other cells where the two do not agree. Weighted Cohen's Kappa = 0.28, 95%CI = 0.01 to 0.57. As the 95% confidence interval overlaps the null value, the conclusion is that there is no agreement between the two obstetricians. Example 2. Cohen's Kappa : Data Entry option 2. In this case, the data is a symmetrical matrix of counts. where two midwives reviewed 85 women at the beginning of labour as to how likely the delivery will require a Caesarean Section no risk at all (1), minimal risk (2), high risk (3) and almost certain (4). The data is a symmetrical table, where rows represents the evaluation of midwife 1, and columns the evaluation of midwife 2. The diagonals are where the two agreed (no risk (25), minimal risk (9) high risk (12) certain (21). The cells below the diagonals are the counts where midwife 1 evaluated a higher risk than midwife 2, and those above the diagonal the other way around. Weighted Cohen's Kappa = 0.82 95% confidence interval = 0.74 to 0.90. As this interval does not overlap the null value (0), the conclusion that the risk assessment of these two midwives significantly agree can be made. Example 3. Fleiss's Kappa : Data Entry option 1 and 2. The data table is similar to that for option 1 in Cohen's Kappa, except that more than two raters are involved. In this example, we have 5 midwives examining 10 pregnant abdomen, and classify each baby as growth retarded (0), normal (1) or macrosomic (2). The data is therefore a 5 column table, each row representing a baby being assessed each column one of the midwives, and the cell contains the scores. The program first creates a count array, where the rows represent babies, each column represent the score (in this case 3 scores of 0, 1, and 2), and the cells the number of times that baby received that score. The sum of each row must therefore be 5 for the 5 rating midwives. In data entry option 2, the counting table can be entered directly. For both data entry options, Fleiss Kappa for this example is 0.42, 95% confidence interval = 0.28 to 0.56. As this interval does not cross the null (0) value, the conclusion that midwives agree significantly with each other can be made
Kendall's coefficient of concordance for ranks (W) calculates agreements
between 3 or more rankers according to the ranking order each placed on the individuals being ranked.
The idea is that n subjects are ranked (0 to n1) by each of the rankers, and the statistics evaluates how much the rankers agree with each other. The program from the Kendall's W for Ranks Program Page modifies the input, so that the values entered by each ranker are ranked before calculation. This means the program can be used when the input data are scores, measurements, or ranks, and even if the scale of measurements used by different rankers are different (providing that they are ranking the same issues and in the same direction). For example, in cases of thyroid dysfunction, how much levels of T3, T4, and TSH agree with each other can be evaluated, as each measurement is converted to ranks before comparison. Kendall's W is therefore useful in that it provides a calculation of concordance for many measurements without any assumption of distribution pattern. Data entry and interpretation are best demonstrated using the example data from the Kendall's W for Ranks Program Page Example In a beauty contest with 10 finalists, 3 judges are to evaluate their relative beauty, with the least beautiful scoring 0 and the most beautiful scoring 9. The data is therefore a table, each row representing a finalist, each column one of the judges, and the cell contains the rank in beauty that judge gives to that contestant. The results are Kendall W = 0.43, Chi Square = 11.65 degrees of freedom = 9 p = 0.23 Please note : Here the statistical significant test is for significant agreement and not significant difference. This means that this set of results indicates that the 3 judges have no significant agreement on their ranking of beauty.
Contents of D : 13
Contents of E : 14
Introduction
Intraclass Correlation
Bland Altman Plot
Lin's Concordance Index
E
There are a number of simple statistical methods of evaluating agreements between normally measurements, correlation, regression, and paired differences immediately coming to mind. Each of these method however evaluates only one aspect of the difference or agreement between two sets of measurements.
Most users however requires methods that evaluates agreements in a more nuanced and comprehensive manner. StatsToDo presents 3 such methods, and they are are
Bland and Altman (1986) discussed in details how agreements between two methods of measuring the same thing can be evaluated in nuanced details. The details and terminology of the methods are described in 3 references in the reference section, and will not be repeated here.
The rest of this panel will take users through the calculations presented in Agreement Program Page
Data Entry The data is a matrix with two columns. Each row are measurements from a case, and the two columns, separated by white space, are the measurements from the two methods being evaluated (v1 and v2) Initial Evaluation The first step is to estimate the paired mean = (v1+v2)/2) and paired difference (v1v2) of each pair. The original data and the estimates are presented as in the table to the right. The rest of the evaluations used the pair mean and pair difference Statistical Evaluations
The first row is the mean and Standard Deviation of the paired difference (bias and distribution of bias), and the t test for significant departure from 0. The paired difference is named bias in the context of this evaluation. If significant bias exists (p_{bias}<0.05) then the two measurements produced different results which has to be corrected when used. The second row is the results of linear regression analysis where paired difference = a + b(paired mean). The regression coefficient b representws changes to paired differences related to changing paired means, and is names proportional bias . If significant proportional bias exists (p_{b}<0.05), then the paired differences change significantly with the values of measurement, so the bias cannot be considered as stable. The remainder of the table estimate the confidence intervals of bias. Please Note that these results differed from the common references of Bland & Altman, and Hanneman (see references) as follows
95% CI Precision Estimates by Bland & Altman Approximations
The results of analysis using the algorithm described in the paper by Bland & Altman are shown in the table to the right. These are presented as the Bland and Altman plot is now commonly used to evaluate agreements between measurements Step 1, the 95% confidence, for all samples sizes, is calculated uses t=2, so that 95% CI = bias ±2SD. From the example used in Agreement Program Page
Step 2 is to calculate t, based on p=0.05 for 95% confidence interval, and the sample size of the data. In the example from Agreement Program Page , the sample size is 30, so t=2.0452. This t value is then used to calculate the precision of the mean and the 95% confidence interval borders Step 3 is to calculate the precision of the mean bias
Step 4 is to calculate the precision of the confidence interval borders
The Bland Altman Plot After numerical analysis, Agreement Program Page produces the Bland Altman plot, to assist the user to interpret the agreement parameters. The plot follows that described in the reference, specifically
Bland and Altman (see references) suggested the duplicate or multiple measurements be taken from each case, and the variations between measurements compared with that within measurements. As these analysis represent an additional level of complexity and precision, they are not presented.
The Concordance Correlation Coefficient (Lin, 1989), abbreviated as CCC or ρ_{c}, evaluates the degree to which pairs of observations fall on the 45 degree line (the line of no difference) through the origin of a 2 dimensional plot of the two measurements. Although it can be used to compare any two measuremenets of the same thing, it is particularly suitable to test a measurement against a gold standard.
There are plenty of references easily available (see references), so the rationale and details of calculations will not be repeated here. The wikipedia provides an excellent introduction, and the algorithm and symbols used on the Agreement Program Page are based on Chapter 812 of PASS Sample Size Software, both readily accessible on the www.
Data Entry The data is a matrix of 2 columns. Each row represents a case being measured. The two columns, separated by white space, are the paired measurements. If a gold standard is being compared, the gold standard is in the first (left) column. The default example data from Agreement Program Page are generated from the computer and not real. There are only 30 pairs, too few for proper evaluation, but easier to demonstrate in this example. The data are shown in the table to the right. It represents 30 paired measurements of blood pressure. The first column (v1) is the gold standard, intraarterial catheter, and the right (v2) the measurement being compared, external electronic cuffed manometer. Calculations and Results : Please note that the program produces number with 4 decimal places, but 2 decimal places are presented in this discussion for easier reading Step 1 evaluates the basic parameters, in this example sample size n=30 pairs, mean μ_{v1}=120 and μ_{v2}=120.77, Standard Deviations σ_{v1}=9.27 and σ_{v2}=9.34 Step 2 evaluates precisions and accuracies in detail
Step 3 calculates Lin's Coefficient of Concordance and its 95% confidence interval
Interpretation According to McBride (see references)
The one tail lower border is usually used. In this example, the one tail lower border is 0.76, so the immediate interpretation is that the manometer measurement has a poor agreement with the gold standard intraarterial measurement. A more detailed examination shows that the accuracy is high (χ_{a}=0.997), but the precision is low (ρ=0.87). It is the low precision of the manometer measurements that contributed to the poor agreement with the gold standard. Finally, the data are p[lotted against the diagonal line of no difference, as shown in the plot to the right. This allows the users to review the patterns of relationship as well as examining the numerical output.
Intraclass Correlation Coefficient (ICC) is a general measurement of agreement
or consensus, where the measurements used are assumed to be parametric
(continuous and has a Normal distribution). The Coefficient
represents agreements between two or more raters or evaluation methods on the
same set of subjects.
ICC has advantages over correlation coefficient, in that it is adjusted for the effects of the scale of measurements, and that it will represent agreements from more than two raters or measuring methods. The calculation involves an initial Two Way Analysis of Variance, so the program can also be used to conduct a parametric Two Way Analysis of Variance. Data input and interpretation of results are best demonstrated using the default example data in the Intraclass Correlation Program Page Example We are testing different methods of measuring blood pressure, and wishes to know if the readings from the mercury and electronic manometer agree with each other. The data is therefore a two column table, each row represents a patient, and the two rows the two methods of measurement. Please note This is a tiny made up data set to demonstrate the method. In reality there may be many more methods to compare ( more than 2 columns), and the data set should contain many more cases for the results to be stable.
The initial Two Way Analysis of Variance produces the table to the right. It shows that variations between patients (rows) are significantly greater than random measurement error (p=0.0006), but the different between method of measurement (columns) are not statistically significant (p=0.78). We can therefore draw the conclusions that, although blood pressure varies from patient to patient, there is no significant different between the methods of measurement in any patient. Please note significant test in this case is a test of significant difference and not agreement, and a no significant difference between columns indicate that there is no significant disagreement. The program now proceed to provide a coefficient of agreement. It produces 6 in fact, described as follows. There are three models for Intraclass Correlation
In most cases, unless the methodology involves special arrangements, Model 2, individual, is usually used. From our example data therefore the Intraclass Correlation Coefficient is 0.98 Portney & Watkins suggested (see reference) that Intraclass Correlation Coefficient can be interpreted as follows: 00.2 indicates poor agreement: 0.30.4 indicates fair agreement; 0.50.6 indicates moderate agreement; 0.70.8 indicates strong agreement; and >0.8 indicates almost perfect agreement.
Contents of E : 24
Contents of G : 6
Contents of H : 7
Cronbach's Alpha
