Related link :
Factor Analysis Program Page
Factor Analysis - Principal Component Extraction Program Page
Factor Analysis - Factor Rotation Program Page
Factor Analysis - Produce Factor Scores Program Page
Factor Analysis - Parallel Analysis Explained, Tables, and Program Page
Introduction
Single Program
Additional programs
R Codes
Related Topics
This page describes the suite of explanations, tables, and programs related to Exploratory Factor Analysis that are available in StatsToDo.
Excellent statistical packages for Factor Analysis are widely available, in software packages such as SAS, STATA, SPSS, and LISREL. There are also excellent free packages available for download. (see references). All of these require users to set up options, then perform all the procedures in a single session.
StatsToDo provides tools for exploratory Factor Analysis in 3 different formats, described separately in 3 separate panels of this page
- Factor Analysis Program Page
provides a simple one stop program to perform Factor Analysis, with built in default options that are commonly used. Details descriptions are in the Single Program panel
- The components of the Factor Analysis program is also presented in separe units.
These programs are explained in the Additional Programs panel of this page.
- The Factor Analysis program, inclusive of all and additional programs are also presente in R codes, available in the R Code panel of this page
Exploratory Factor Analysis
StatsToDo presents only a simplified and cursory explanation for exploratory Factor Analysis, sufficient to help users of the programs here. Users looking for further information are referred to the references section.
Exploratory Factor Analysis has no a priori theory or hypothesis, and is sometimes call unsupervised clustering. The variables are clustered according to how they correlated with each other.
Exploratory Factor Analysis has two models
- Principal Component analysisis is mainly used to reduce multiple measurements into fewer factors. It is carried out using the covariance or correlation matrix. It evaluates the relationship between the measurements and factors in the process.
- Principal Factor Analysis is used mainly to evaluate the relationship between a set of multiple measurements. It uses the correlation matrix, but replaces the diagonal elements with the communalities, or the largest correlation coefficient for each column. In doing so, it produces a more precise estimate of the relationships between measurements and factors.
As results of Factor Analysis depends on the scalar values, StatsToDo follows common practices and uses the correlation matrix to produce Principal Components, and when data is presented, they are reduced to a similar scalar of standardized z values (z=(value-mean)/SD).
StatsToDo also uses the Principal Component Analysis model by default, as in most cases exploratory Factor Analysis is used to condense multiple measurements into fewer factors, and the relationship between measurements is a secondary consideration, to help interpreting what the resulting factors represent
Procedures and Options
The following steps are used for Factor Analysis in StatsToDo
- Data entry is in one of two formats
- A matrix of correlation coefficients. A covariance matrix can also be used, but is not reccommended
- A matrix of values, where the coulumns represents variable and rows cases. The program then converts the values to a correlation matrix, used for the rest of Factor Analysis
- Eigen Analysis, which produces an array of Eigen values in descending order of magnitude
- The complete matrix of Principal Components (factors), in the same order as the Eigen Values
- The decision on how many factors to retain for further analysis. One of the 3 following options are available in StatsToDo
- The user may arbitraily determine the number of factors to retain.
- The K1 rule, where a factor is retained if its Eigen value is >= 1. This is commonly used and is the default option in StatsToDo
- Parallel Analysis. Multiple iterations (default=1000) of calculating the Eigen Values, using the same size data, but containing normally distributed random numbers. From these, the 95 percentile values of the Eigen Values are calculated. A factor is then retained if its Eigen value is >= the corresponding 95 percentile value
- Factor rotation. The retained factors are subjected to rotation, so that each variable loads predominantly to a factor. The following rotations are commonly used
- Orthogonal rotation, where the resulting factors are not correlated to each other. The usual procedure is the Standardized Varimax Rotation
- Oblique rotation, where the factors are allowed to be correlated, enhancing that each variable loads predominantly to a factor. The Oblimin rotation is usually used, as the correlations between the resulting factors are also calculated and presented. The Promax rotation is provided in the R code. It is said to run quicker for large matrices, but it does not estimate correlations between factors
There are different approaches on which rotation to choose. Usually, the reasons for doing the Factor Analysis and the nature of the variables included determines whether the factors should be correlated (oblique) or uncorrelated (orthogonal).
If there is no prior theoretical assumptions, then the results from oblique rotation (Oblimin in this case) should be initially adopted, as this provides the closest fit between variables and factors. However, if the oblique factors have no significant correlation with each other, then the results of the orthogonal rotation (Varimax in this case) should be adopted, as what each factor represents is much more clearly defined.
- Calculating factor scores. This requires 3 sets of data, the matrix of values with the same number of variables (usually the original data matrix), a two column matrix of means and Standard Deviations (SDs) from the original data set, and the rotated factor matrix
- The rotated factor matrix is converted to the coefficient matrix.
- Each value (v) in the data matrix is converted to standardized z value, where z = (v-mean)/SD
- The factor score value is the product of each z and coefficient, summed across all variables
Technical Considerations
Results of Factor Analysis from different programs and platforms often produce similar but slightly different results. This is because much of the calculation is by iteration, so the results are approximate. Depending on the algorithm version, the initiation values and limits of iteration, results will be slightly different. Three most common discrepancies are discussed here.
- Values from different programs may be different at the third or more decimal places, more so in the minor factors, and more so if the sample size is small. These differences can be accepted.
- After rotation, the factors are often in different orders. Users should interpret final factors according to what the variable loadings indicate, and not in the order they appear in the results
- The positive and negative values for each loading may be opposite results from different programs and procedures. However, the interpretation of each factor can be reversed by changing all the signs in a factor. For example, a factor representing hapiness become one for unhapiness if all the signs of the loadings are reversed. The thing to remeber is that, following Oblimin rotation, changing the sign of loading in a factor will also reverse the factor's correlation coefficients with all the other factors
References
Algorithms : It is difficult to find the algorithms for calculations
associated with Factor Analysis, as most modern text book and technical manuals
advise users to use one of the commercial packages. I have eventually found some
useful algorithms in old text books, and they are as follows
Press WH, Flannery VP, Teukolsky SA, Vetterling WT (1989). Numerical Recipes in Pascal.
Cambridge University Press IBSN 0-521-37516-9 p.395-396 and p.402-404.
Jacobi method for finding Eigen values and Eigen vectors
Norusis MJ (1979) SPSS Statistical Algorithms Release 8. SPSS Inc Chicago
- p. 86 for converting Eigen values and vectors to Principal Components
- p. 91-93 for Varimax Rotation
- p. 94-97 for Oblimin Rotation
- p. 97-98 for Factor scores
Text books I learnt Factor Analysis some time ago, so all my text books are old,
but they are adequate in explaining the basic concepts and provide the calculations used in these pages.
Users should search for better and newer text books.
- Thurston LL (1937) Multiple Factor Analysis. University of Chicago Press. I have not read this
book, but this is quoted almost universally as it is the original Factor
Analysis text book which set out the Principles and procedures
- Gorsuch RL (1974) Factor Analysis. W. B. Saunders Company London 1974 ISBN 0-7216-4170-9
A standard text book teaching Factor Analysis at the Master's level. This is my
copy and I believe there are later additions of this book available
Orthogonal Powered-Vector Factor Analysis
Overall JE and Klett CJ (1972) Applied Multivariate Analysis. McGraw Hill Series in Psychology.
McGraw Hill Book Company New York. Library of Congress No. 73-14716407-047935-6 p.137-156
Sample Size
Mundfrom DJ, Shaw DG, Tian LK (2005) Minimum sample size recommendations for conducting factor analysis.
International Journal of Testing 5:2:p 159-168
Free Factor Analysis software and its user manual can be downloaded from
http://psico.fcep.urv.es/utilitats/factor/Download.html. This is a package for Windows written by Drs. Lorezo-Seva and Ferrendo from Universitat Rovira i Virgili in Terragona in Spain. The presentation and options are very similar to that from SPSS, and the manual is excellent. The best part is that it is free and yes it is in English.
Teaching and discussion papers on the www There is an enormous list of
discussion papers, technical notes and tutorials on the www that can be easily found by Google search.
The following is a small sample of this.
This panel supports Factor Analysis Program Page
, which is a web page program to perform exploratory Factor Analysis. It is designed as an one click program, with commonly used default values for all options. This program is most suited for a quick analysis, or for exploration by the inexperienced, as it produces most of the results required with minimal input by the user.
Data Input
1.0000 | 0.6557 | 0.2314 | 0.2791 | 0.1181 | 0.2402 |
0.6557 | 1.0000 | 0.2330 | 0.0524 | 0.2293 | 0.3559 |
0.2314 | 0.2330 | 1.0000 | 0.6560 | 0.2608 | 0.2264 |
0.2791 | 0.0524 | 0.6560 | 1.0000 | 0.3850 | 0.3061 |
0.1181 | 0.2293 | 0.2608 | 0.3850 | 1.0000 | 0.7629 |
0.2402 | 0.3559 | 0.2264 | 0.3061 | 0.7629 | 1.0000 |
Two options for data input are available
The first is a covariance , or preferably correlation matrix, such as the one shown to the left. This is placed into the text area when the example button of the matrix input row is clicked. The user may paste his/her own matrix into the text area, and click the perform button to perform the analysis
0.178 | -0.338 | 0.470 | 1.211 | 0.508 | 0.354 |
0.304 | -0.230 | -0.981 | -1.218 | -0.337 | -0.780 |
1.155 | 0.350 | -0.310 | -0.083 | -2.224 | -1.502 |
0.562 | 0.148 | 1.376 | 0.616 | 0.887 | 1.342 |
-0.770 | -0.626 | 0.117 | 0.381 | -0.417 | -0.111 |
-0.081 | -0.836 | -0.968 | -0.421 | -0.285 | -0.171 |
-0.957 | -0.620 | 0.601 | 0.350 | -1.793 | -0.379 |
-0.001 | 0.291 | 0.137 | 0.319 | -0.591 | -0.616 |
-0.110 | -0.374 | 0.825 | 0.009 | -0.146 | -0.590 |
0.893 | 0.498 | -1.345 | -0.887 | -0.410 | 1.270 |
-0.773 | -0.304 | 0.337 | -1.158 | -2.117 | -1.175 |
-1.552 | -0.883 | -1.229 | -0.287 | 0.237 | 0.034 |
-0.024 | -0.787 | -1.604 | 0.373 | -1.327 | -1.333 |
0.373 | 0.067 | -0.327 | -0.560 | 1.200 | 0.685 |
0.094 | 0.876 | 1.400 | -0.166 | 0.569 | 0.466 |
1.366 | 1.356 | -1.003 | 0.157 | 0.768 | 0.718 |
-0.596 | -0.048 | 0.399 | 0.269 | -0.555 | -0.712 |
0.636 | 0.440 | 0.295 | -0.024 | -1.415 | -0.450 |
0.660 | 0.375 | 1.356 | 0.744 | 0.098 | 0.662 |
-0.171 | 0.062 | 1.276 | 1.049 | 2.275 | 1.267 |
1.429 | 0.141 | 1.768 | 0.661 | -1.110 | -0.090 |
0.261 | 0.391 | -1.238 | -0.666 | -0.952 | -0.504 |
0.494 | 0.374 | 1.512 | 0.390 | -0.548 | -1.754 |
-0.834 | 0.461 | -1.511 | -2.062 | -1.119 | -0.404 |
0.980 | 1.140 | 1.659 | 1.267 | 0.807 | 1.430 |
The second option is using a matrix of data, where the columns represents variables, and rows represents cases, as that shown to the right (25 cases in rows and 6 variables in columns). If the Example button from the data row is clicked, this set of default values are inserted into the text box. The user may paste his/her own data matrix into the text area.
Mean | SD |
0.1406 | 0.7599 |
0.077 | 0.5908 |
0.1205 | 1.1104 |
0.0106 | 0.7961 |
-0.3199 | 1.081 |
-0.0937 | 0.9073 |
Using data entry, the program immediately produces a table of means and Standard Deviations for each column, as shown to the left. The program also produces the correlation matrix that will be used to perform Factor Analysis
Please note that the default example data are normally distributed random numbers, used to demonstrate how the program works.
The Number of Factors to Retain
User can insert the number in the text box provided, which will determine the number of factors that will be retained.
The options are as follows
- If the user has a number in mind, that number can be used. The number must be from 1 to the number of variables
- If the user follows the common convention of using the K1 rule, then the value 0 can be inserted. The program will then retain all factors with Eigen values >=1. Using the default example data, 3 factors are retained.
- If the user should decide based on the Parallel analysis, then the program is run twice. In the first run, the Eigen values are noted. The user then uses either the program or tables in the Factor Analysis - Parallel Analysis Explained, Tables, and Program Page
to obtain the 95th percentile of random Eigen values, and the number of factors to retain depends on how many from the data exceed that from Parallel Analysis. The Factor Analysis program is then run a second time with the correct number of factors to retain inserted. Using the default example data, one (1) factor is retained based on Parallel analysis.
Factor Analysis
Regardless of which button is clicked, the program proceeds to perform all the procedures in Factor Analysis, using the data now in the text area, in the following order.
Step 1. Creating Eigen Value array and Principal Component matrix
Eigen Value Array |
2.6753 | 1.3097 | 1.1512 | 0.4477 | 0.2253 | 0.1908 |
Principal Component Matrix |
-0.5959 | -0.6554 | -0.2225 | 0.3505 | -0.0626 | 0.1975 |
-0.6026 | -0.6999 | 0.0553 | -0.2888 | 0.1416 | -0.2012 |
-0.6441 | 0.2741 | -0.5705 | -0.3895 | -0.0885 | 0.1582 |
-0.6738 | 0.4273 | -0.4637 | 0.2989 | 0.0773 | -0.2303 |
-0.7223 | 0.3366 | 0.5082 | 0.0195 | 0.2823 | 0.1632 |
-0.7525 | 0.1392 | 0.5477 | 0.0008 | -0.3283 | -0.0811 |
The Eigen Value array and Eigen Vector matrix are calculated from the correlation matrix, using the Jacobi method, then ordered from left to right in decending magnitude of Eigen Values.
The Principal Component matrix is then calculaed, by multiplying each value of the Eigen Vector with the square root of Eigen Value of the same column.
The results are shown in the tables to the left
Step 2. Determine the number of factors to retain
| f1 | f2 | f3 | Variance |
v1 | -0.5959 | -0.6554 | -0.2225 | 0.8342 |
v2 | -0.6026 | -0.6999 | 0.0553 | 0.856 |
v3 | -0.6441 | 0.2741 | -0.5705 | 0.8154 |
v4 | -0.6738 | 0.4273 | -0.4637 | 0.8517 |
v5 | -0.7223 | 0.3366 | 0.5082 | 0.8933 |
v6 | -0.7525 | 0.1392 | 0.5477 | 0.8856 |
Variance | 2.6753 | 1.3097 | 1.1512 | |
The default K1 rule is used to determine the number of factors to retain. This rule stipulated that all factors with an Eigen Value >=1 should be retained. As the first 3 Eigen Values are >= 1, the first 3 factors are retained, and forwarded for rotation.
The retained factors, accompanied by their variance, are shown in the table to the right
Step 3. Oblique (Oblimin) Rotation
Factor Pattern (Loading) Matrix |
| f1 | f2 | f3 |
v1 | 0.0946 | -0.8959 | -0.1375 |
v2 | -0.1316 | -0.9056 | 0.108 |
v3 | 0.0514 | -0.0822 | -0.8968 |
v4 | -0.1078 | 0.0635 | -0.898 |
v5 | -0.9345 | 0.0804 | -0.082 |
v6 | -0.9174 | -0.1137 | 0.0324 |
Correlation Between Factors Matrix |
| f1 | f2 | f3 |
f1 | 1 | 0.234 | 0.2916 |
f2 | 0.234 | 1 | 0.2071 |
f3 | 0.2916 | 0.2071 | 1 |
Oblimin rotation, with the default setting to allow maximum correlation between factors (δ=0) is carried out, and the results are shown to the left.
It can be seen that variables 5 and 6 loads predominantly to factor 1, variables 1 and 2 to factor 2, and variables 3 and 4 to variable 3. These are marked in bold in the table.
It can also be seen that the 3 factors are correlated, with correlation coefficients of 0.234 between factors 1 and 2, 0.292 between factors 1 and 3, and 0.207 between factors 2 and 3.
Step 4. Orthogonal (Varimax) Rotation
| f1 | f2 | f3 | Variance |
v1 | -0.2079 | -0.889 | 0.0248 | 0.8342 |
v2 | 0.0006 | -0.9003 | 0.2134 | 0.856 |
v3 | -0.8839 | -0.1634 | 0.0858 | 0.8154 |
v4 | -0.8939 | -0.0374 | 0.2264 | 0.8517 |
v5 | -0.2069 | -0.0347 | 0.9216 | 0.8933 |
v6 | -0.1103 | -0.2136 | 0.9099 | 0.8856 |
Variance | 1.6786 | 1.6758 | 1.7819 | |
The results of varimax rotation, accompanied by the row (variable) and column (factor) variances, are shown to the right.
The following comparisons can be noted
- Comparing with the original Principal Component matrix before rotation
- The variance contribution from each variable (rows) remains the same
- The variance contribution from each factor has changed. Instead of decreasing, they are now similar in scale
- Comparing with the results of Oblimin rotation
- The factors are now orthogonal, uncorrelated with each other
- The orders of the factors are different
- variable 1 and 2 loads predominantly on factor 2 from Oblimin and Varimax. Factor 2 therefore represents the same measurements in both results.
- variable 3 and 4 loads predominantly on factor 3 from Oblimin and factor 1 from Varimax. Oblimin 3 and Varimax 1 therefore represents the same measurements
- variable 5 and 6 loads predominantly on factor 1 from Oblimin and factor 3 from Varimax. Also, the coefficients are negative in the Oblimin and positive in Varimax. Oblimin 1 and Varimax 3 therefore represents the same measurements
- The loading coefficients for Oblimin f1 and Varimax f3 are different, particularly in the higher value coefficients. Therefore they measure the opposite of each other. To represent a factor in he opposing direction, the coefficients need to be reversed (+ to -, and - to +). If this is done in an oblique factor matrix, the factor's correlation coefficients with all other factors also need to be reversed.
Factor Scores
Oblimin Rotation |
Coefficient Matrix |
| f1 | f2 | f3 |
v1 | 0.0732 | -0.5429 | -0.076 |
v2 | -0.0617 | -0.5474 | 0.0787 |
v3 | 0.0524 | -0.0409 | -0.5453 |
v4 | -0.0411 | 0.0502 | -0.5438 |
v5 | -0.5332 | 0.0661 | -0.0283 |
v6 | -0.5227 | -0.053 | 0.043 |
Factor Scores |
| f1 | f2 | f3 |
| -0.6648 | 0.445 | -1.051 |
| 0.4631 | 0.1699 | 1.2907 |
| 1.8042 | -1.0018 | 0.1932 |
| -1.3613 | -0.385 | -1.0262 |
| 0.0243 | 1.3204 | -0.2521 |
| 0.0723 | 1.0238 | 0.7253 |
| 0.8632 | 1.3601 | -0.4259 |
| 0.3835 | -0.0644 | -0.1938 |
| 0.2564 | 0.6104 | -0.408 |
| -0.7355 | -1.0153 | 1.3805 |
| 1.5317 | 0.8773 | 0.7282 |
| -0.4593 | 2.1562 | 0.8989 |
| 1.1851 | 1.0153 | 0.4683 |
| -1.1665 | -0.1288 | 0.5821 |
| -0.7794 | -0.7436 | -0.3933 |
| -1.0804 | -1.9907 | 0.5095 |
| 0.4141 | 0.6698 | -0.2794 |
| 0.7652 | -0.745 | -0.0516 |
| -0.6022 | -0.6651 | -1.095 |
| -2.0914 | 0.3386 | -1.2508 |
| 0.5491 | -1.048 | -1.3528 |
| 0.4978 | -0.3842 | 1.1561 |
| 1.1181 | -0.472 | -1.011 |
| 0.469 | 0.2392 | 2.3716 |
| -1.4562 | -1.5822 | -1.5133 |
Varimax Rotation |
Coefficient Matrix |
| f1 | f2 | f3 |
v1 | -0.0424 | -0.5608 | -0.1454 |
v2 | 0.1386 | -0.5648 | 0.0177 |
v3 | -0.5732 | -0.0009 | -0.1368 |
v4 | -0.5664 | 0.1025 | -0.0288 |
v5 | 0.0399 | 0.1254 | 0.5626 |
v6 | 0.1232 | -0.0046 | 0.5491 |
Factor Scores |
| f1 | f2 | f3 |
| -1.0425 | 0.6171 | 0.5957 |
| 1.2676 | 0.0172 | -0.2845 |
| 0.0348 | -1.2348 | -1.9728 |
| -0.8461 | -0.1693 | 1.2419 |
| -0.3818 | 1.3805 | 0.0792 |
| 0.6578 | 0.9861 | 0.1362 |
| -0.6849 | 1.3502 | -0.8216 |
| -0.2508 | -0.0892 | -0.4369 |
| -0.5154 | 0.638 | -0.2621 |
| 1.6337 | -1.0894 | 0.8601 |
| 0.468 | 0.6849 | -1.4106 |
| 0.8153 | 2.1935 | 0.8394 |
| 0.2334 | 0.885 | -1.0691 |
| 0.7834 | -0.0628 | 1.2933 |
| -0.236 | -0.653 | 0.6829 |
| 0.8579 | -1.9852 | 0.9944 |
| -0.4085 | 0.672 | -0.403 |
| -0.0967 | -0.8424 | -0.8898 |
| -1.0005 | -0.5302 | 0.4056 |
| -1.0408 | 0.671 | 2.0532 |
| -1.3986 | -1.0212 | -0.8802 |
| 1.1705 | -0.5462 | -0.399 |
| -1.1722 | -0.515 | -1.3675 |
| 2.3894 | -0.0045 | -0.1296 |
| -1.2369 | -1.3523 | 1.1447 |
If data entered are values of measurements and not correlation matrix, the program proceeds to calculate factor scores for that data. The coefficient matrix is firstly calculated from the factor loading, and the data values converted into normalized z values (z=(value-mean)/SD). The factor scores are then calculated as the sum of products between the z values and the coefficients.
Coefficient matrix and factor scores calculated from Oblimin factors are presented to the left, and from Varimax presented to the right.
Factor scores can also be calculated using other independent sets of data, with the following rules
- The program from Factor Analysis - Produce Factor Scores Program Page
is used
- The new data may have different number of cases (rows), but the variables (columns) must be in the same order as the original
- The Mean and SD table from the original data is used for calculating the z value
- The coefficient matrix can be copied and pasted to the program
The components of the single program from Factor Analysis Program Page
, describer in the Single Program panel, are also presented as separate programs. These programs allow the procedures to be carried out in steps, and the intermediate data edited, should the user wishes to do so.
Only brief descriptions will be provided here, as the same default data as the in the Factor Analysis Program Page
are used, and examples and explanations already provided in the previous panel.
Factor Analysis - Principal Component Extraction Program Page
performs the principal component analysis. It accepted both raw data or a correlation matrix, and produces the Eigen value array and the full Principal Component matrix
Factor Analysis - Factor Rotation Program Page
Uses the full Principal Component matrix as the data, allows the user to stipulate the number of factors to retain. The program then performs both Oblimin and Varimax rotations using the retained factors
Factor Analysis - Produce Factor Scores Program Page
requires 3 sets of data input. These are
- The data to be converted to factor scores. These must have the variables (columns) in the same order as the data used to created the factors
- The mean and Standard Deviation from all the variables, as a two column matrix. These must be obtained from the original data, and not from the data to be converted.
- The factor pattern (loading) matrix produced by one of the rotations.
The program will then convert the data into standardized z values (z=(value-mean)/SD), and calculates the coefficient matrix as the sums of products between the z values and the coefficients.
Factor Analysis - Parallel Analysis Explained, Tables, and Program Page
explains Parallel Analysis, and provides calculations and tables to determine the number of factors to extract. As Parallel analysis is a complex subject and fully described in that page, it will not be further described here.
Introduction
R Code in Total
R Code Explained
During the development of the web page based program for Factor Analysis, R is used to check for errors, and to compare the results obtained directly from program written in php for the web page and that developed by experts in the R community.
The result R code has been tidied up and presented in this page for any users that may be interested in using it.
The R code presents all the procedures for Factor Analysis in a single program, including the Parallel Analysis
User unfamiliar with R and wishing to use it are referred to R Explained Page
for help
Using the R code provided has the following advantages
- The R program is easy to follow, so that the subroutines used, and calculations produced are all transparent to the user
- The user can insert preferences. For example choose the method to determine the number of factors to retain
- The user can delete, add, or modify the program according to preferences and needs
- The user can display intermediate results by inserting print statements into the program
The R codes is presented twice, in the next two sub-panels. Firstly in a single block so that the user can copy and paste into the program panel of RStudio. Secondly, each section and the results produced are separately presented and discussed, to help the user follow the program
This panel presents the R program for Path Analysis, in total and includes the example data. Explanations are presented in the next panel
# Factor Analysis.R
# SECTION 1: FUNCTIONS
# create eigen value, eigen vector and principal components from correlation matrix
CorToPComp <- function(corMx)
{
eigenResults <- eigen(corMx) # Calculate eigen value and vector
evl <- eigenResults$values # evl = eigen value
evc <- eigenResults$vectors # evc = eigen vector
nv = length(evl)
evm <- matrix(data=0, nrow = nv, ncol=nv) # evm = principal component
for(j in 1:nv)
{
x = sqrt(evl[j])
evm[,j] = evc[,j] * x
}
list("evl"= evl, "evc"= evc, "evm"= evm) # returns eigen values, eigen vectors and prinComp
}
# Create eigen value, eigen vector and principal components from data matrix
DataFrameToPCom <- function(df)
{
cMx <- cor(df)
print(cMx) # correlation matrix
res <- CorToPComp(cMx) # returns eigen values, eigen vectors and prinComp
}
# Calculate the number of factors to retain by K1 rule
NumberOfFactorsByK1 <- function(arEigenValue)
{
f = 0
for(i in 1:length(arEigenValue))
{
if(arEigenValue[i]>=1)
{
f = f + 1
}
}
f # return number of factors
}
# Calculate the number of factors to retain by Parallel Analysis
NumberOfFactorsByParallel <- function(arEigenValue, ssiz, ite) # ite=number of iterations
{
nc = ssiz #sample size, number of rows of data
nv = length(arEigenValue) #number of variables
# parallel analysis begin
ex <- rep(0, nv)
exx <- rep(0, nv)
for(i in 1:ite)
{
dMx <- replicate(nv, rnorm(nc)) # random data matrix
pComp <- eigen(cor(dMx))$values # array of eigen values
for(j in 1:nv)
{
ex[j] = ex[j] + pComp[j]
exx[j] = exx[j] + pComp[j]^2
}
}
resAr <- rep(0,nv)
for(j in 1:nv)
{
mean = ex[j] / ite # mean
sd = sqrt((exx[j] - ex[j]^2/ite)/(ite-1)) # SD
lim = mean + qnorm(0.95) * sd # 95 percentile
resAr[j] = lim
}
print(resAr) # array of 95 percentile eigen values
print(arEigenValue)
f = 0
for(i in 1:nv)
{
if(arEigenValue[i]>=resAr[i])
{
f = f + 1
}
}
f
}
# SECTION 2: MAIN PROGRAM
# Step 1. Data input
myDat = ("
0.178 -0.338 0.470 1.211 0.508 0.354
0.304 -0.230 -0.981 -1.218 -0.337 -0.780
1.155 0.350 -0.310 -0.083 -2.224 -1.502
0.562 0.148 1.376 0.616 0.887 1.342
-0.770 -0.626 0.117 0.381 -0.417 -0.111
-0.081 -0.836 -0.968 -0.421 -0.285 -0.171
-0.957 -0.620 0.601 0.350 -1.793 -0.379
-0.001 0.291 0.137 0.319 -0.591 -0.616
-0.110 -0.374 0.825 0.009 -0.146 -0.590
0.893 0.498 -1.345 -0.887 -0.410 1.270
-0.773 -0.304 0.337 -1.158 -2.117 -1.175
-1.552 -0.883 -1.229 -0.287 0.237 0.034
-0.024 -0.787 -1.604 0.373 -1.327 -1.333
0.373 0.067 -0.327 -0.560 1.200 0.685
0.094 0.876 1.400 -0.166 0.569 0.466
1.366 1.356 -1.003 0.157 0.768 0.718
-0.596 -0.048 0.399 0.269 -0.555 -0.712
0.636 0.440 0.295 -0.024 -1.415 -0.450
0.660 0.375 1.356 0.744 0.098 0.662
-0.171 0.062 1.276 1.049 2.275 1.267
1.429 0.141 1.768 0.661 -1.110 -0.090
0.261 0.391 -1.238 -0.666 -0.952 -0.504
0.494 0.374 1.512 0.390 -0.548 -1.754
-0.834 0.461 -1.511 -2.062 -1.119 -0.404
0.980 1.140 1.659 1.267 0.807 1.430
")
myDataFrame <- read.table(textConnection(myDat),header=FALSE)
summary(myDataFrame)
# Step 2. Calculate Principal Component Matrix
nc = nrow(myDataFrame) # number of cases
nv = ncol(myDataFrame) # number of variables
res <- DataFrameToPCom(myDataFrame) # eigen values, eigen vector, Principal components
res$evl # eigen values
res$evc # eigen vector
pComp <- res$evm # All Principal components
pComp # All Principal components
# USER INPUT REQUIRED
# Step 3. User to determine number of factors to retain (one of 3 options)
#nf = 3 # choice a specify a number of factors
nf = NumberOfFactorsByK1(res$evl) # choice b using k1 rule
#nf = NumberOfFactorsByParallel(res$evl, nc, 1000) # choice c using parallel analysis
nf # number of factors to retain
# END USER INPUT
# Step 4. Create z scores required for calculating factor values for each case
arMean <- rep(0,nv)
arSD <- rep(0,nv)
zMx <- matrix(0,nrow=nc, ncol=nv)
for(i in 1:nv)
{
arMean[i] = mean(myDataFrame[ , i])
arSD[i] = sd(myDataFrame[ , i])
zMx[, i] <- (myDataFrame[ , i] - arMean[i]) / arSD[i]
}
arMean # array of means
arSD # array of SDs
zMx # matrix of z values
# Step 5. Calculate Factor Score if there is only 1 Principal Component
if(nf<2)
{
pCompMx <- matrix(pComp[,1:nf])
print("Principal Component")
print(pCompMx) # principal component matrix to be used for subsequent rotation and processing
# Calculate coefficient matrix for scores
coeffMx <- pCompMx %*% solve(t(pCompMx) %*% pCompMx)
print("Coefficients")
print(coeffMx) # coefficient matrix
# Calculate Factor Scores
scoreMx <- zMx %*% coeffMx
print("Factor Scores")
print(scoreMx) # factor scores for varimax
} else
# Step 6 Performs Factor rotation if there is more than 1 principal component
{
pCompMx <- pComp[,1:nf]
print("Principal Components")
print(pCompMx) # principal component matrix to be used for subsequent rotation and processing
# perform Varimax rotation
vMax <- varimax(pCompMx)
#vMax
vMx <- pCompMx %*% vMax$rotmat
print("Varimax Factor Loadings")
print(vMx) #varimax loading matrix
# Calculate coefficient matrix for scores
vCoeffMx <- vMx %*% solve(t(vMx) %*% vMx)
print("Coefficient Matrix")
print(vCoeffMx) # varimax coefficient matrix
# Calculate Factor Scores
vScoreMx <- zMx %*% vCoeffMx
print("Factor Scores")
print(vScoreMx) # factor scores for varimax
# Perform Promax rotation
pMax <- promax(pCompMx)
#pMax
pMx <- pCompMx %*% pMax$rotmat
print("Promax Factor Loadings")
print(pMx) # promax loading matrix
# Calculate coefficient matrix for scores
pCoeffMx <- pMx %*% solve(t(pMx) %*% pMx)
print("Coefficient Matrix")
print(pCoeffMx) # promax coefficient matrix
# Calculate Factor Scores
pScoreMx <- zMx %*% pCoeffMx
print("Factor Scores")
print(pScoreMx) # factor scores for promax
# Performs Oblimin Rotation
#install.packages("GPArotation") # if not already installed
library(GPArotation)
obmn <- oblimin(pCompMx, normalize=TRUE)
print("Oblimin Factor Loadings and Factor Correlations")
print(obmn)
oMx <- obmn$loadings
#oMx # oblimin loading matrix (factor pattern)
# Calculate coefficient matrix for scores
oCoeffMx <- oMx %*% solve(t(oMx) %*% oMx)
print("Coefficient Matrix")
print(oCoeffMx) # oblimin coefficient matrix
# Calculate Factor Scores
oScoreMx <- zMx %*% oCoeffMx
print("Factor Scores")
print(oScoreMx) # factor scores for oblimin
}
This panel provides explanations for each section of the program for Factor Analysis. The program is divided into 2 sections. Section 1 contains all the subroutine functions, and section 2 the main program.
Please note that user input is required in 2 places. Firstly the data input, and secondly the choice of method to determine the number of factors to retain.
Section 1. Functions
Four (4) functions are presented here
# create eigen value, eigen vector and principal components from correlation matrix
CorToPComp <- function(corMx)
{
eigenResults <- eigen(corMx) # Calculate eigen value and vector
evl <- eigenResults$values # evl = eigen value
evc <- eigenResults$vectors # evc = eigen vector
nv = length(evl)
evm <- matrix(data=0, nrow = nv, ncol=nv) # evm = principal component
for(j in 1:nv)
{
x = sqrt(evl[j])
evm[,j] = evc[,j] * x
}
list("evl"= evl, "evc"= evc, "evm"= evm) # returns eigen values, eigen vectors and prinComp
}
The CorToPComp function takes a correlation matrix (corMx), and returns a class containing 3 structures. These are
- evl, the eigen value array
- evc, the eigen vector matrix
- evm, the principal component matrix
# Create eigen value, eigen vector and principal components from data matrix
DataFrameToPCom <- function(df)
{
cMx <- cor(df)
print(cMx) # correlation matrix
res <- CorToPComp(cMx) # returns eigen values, eigen vectors and prinComp
}
The DataFrameToPCom function takes the dataframe containing a matrix of measurements (df), with rows representing cases and column representing variables. From this, the correlation matrix (cMx) is calculated, and presents cMx to the CorToPComp function to receive the eigen value, vector, and principle component. It then returns these 3 structures.
# Calculate the number of factors to retain by K1 rule
NumberOfFactorsByK1 <- function(arEigenValue)
{
f = 0
for(i in 1:length(arEigenValue))
{
if(arEigenValue[i]>=1)
{
f = f + 1
}
}
f # return number of factors
}
The NumberOfFactorsByK1 function takes the eigen value array, and counts the number of values that are >=1, and returns that value as the number of factors to retain
# Calculate the number of factors to retain by Parallel Analysis
NumberOfFactorsByParallel <- function(arEigenValue, ssiz, ite) # ite=number of iterations
{
nc = ssiz #sample size, number of rows of data
nv = length(arEigenValue) #number of variables
# parallel analysis begin
ex <- rep(0, nv)
exx <- rep(0, nv)
for(i in 1:ite)
{
dMx <- replicate(nv, rnorm(nc)) # random data matrix
pComp <- eigen(cor(dMx))$values # array of eigen values
for(j in 1:nv)
{
ex[j] = ex[j] + pComp[j]
exx[j] = exx[j] + pComp[j]^2
}
}
resAr <- rep(0,nv)
for(j in 1:nv)
{
mean = ex[j] / ite # mean
sd = sqrt((exx[j] - ex[j]^2/ite)/(ite-1)) # SD
lim = mean + qnorm(0.95) * sd # 95 percentile
resAr[j] = lim
}
print(resAr) # array of 95 percentile eigen values
print(arEigenValue)
f = 0
for(i in 1:nv)
{
if(arEigenValue[i]>=resAr[i])
{
f = f + 1
}
}
f
}
The NumberOfFactorsByParallel function accepts 3 parameters
- arEigenValue is the array of eigen value calculated from the data or correlation matrix
- ssiz is the sample size (number of rows) of the data
- ite is the number of iterations to be used to calculate the 95the percentile. This should be 100 or more, usually 1000 is sufficient
The program then iterates, calculating eigen values using normally distributed random data and establishes the 95th percentile values by simulation
The simulated values are then compared with the input eigen value array. The number of eigen value from the input array that is >= to that in the simulated array is counted, and returned as the number of factors to retain.
Section 2. Main Program
The code in section 2 are executed in the order they present
# Step 1. Data input
myDat = ("
0.178 -0.338 0.470 1.211 0.508 0.354
0.304 -0.230 -0.981 -1.218 -0.337 -0.780
1.155 0.350 -0.310 -0.083 -2.224 -1.502
0.562 0.148 1.376 0.616 0.887 1.342
......
")
myDataFrame <- read.table(textConnection(myDat),header=FALSE)
summary(myDataFrame)
The data is a table (matrix), without header. The rows representing cases and column variables. A summary of the data is presented as follows
V1 V2 V3 V4 V5 V6
Min. :-1.5520 Min. :-0.88300 Min. :-1.6040 Min. :-2.06200 Min. :-2.2240 Min. :-1.75400
1st Qu.:-0.1710 1st Qu.:-0.33800 1st Qu.:-0.9810 1st Qu.:-0.42100 1st Qu.:-1.1100 1st Qu.:-0.61600
Median : 0.1780 Median : 0.14100 Median : 0.2950 Median : 0.15700 Median :-0.4100 Median :-0.17100
Mean : 0.1406 Mean : 0.07696 Mean : 0.1205 Mean : 0.01056 Mean :-0.3199 Mean :-0.09372
3rd Qu.: 0.6360 3rd Qu.: 0.39100 3rd Qu.: 1.2760 3rd Qu.: 0.39000 3rd Qu.: 0.5080 3rd Qu.: 0.66200
Max. : 1.4290 Max. : 1.35600 Max. : 1.7680 Max. : 1.26700 Max. : 2.2750 Max. : 1.43000
# Step 2. Calculate Principal Component Matrix
nc = nrow(myDataFrame) # number of cases
nv = ncol(myDataFrame) # number of variables
res <- DataFrameToPCom(myDataFrame) # eigen values, eigen vector, Principal components
res$evl # eigen values
#res$evc # eigen vector
pComp <- res$evm # All Principal components
pComp # All Principal components
Step 2 calculate the eigen value, vector, and the Principal Component. It prints out the correlation matrix, eigen value array and the Principal Component matrix, as follows
Correlation matrix
V1 V2 V3 V4 V5 V6
V1 1.0000000 0.65568487 0.2313851 0.27914974 0.1181262 0.2401638
V2 0.6556849 1.00000000 0.2329578 0.05241047 0.2292866 0.3559332
V3 0.2313851 0.23295783 1.0000000 0.65596377 0.2608266 0.2263553
V4 0.2791497 0.05241047 0.6559638 1.00000000 0.3849751 0.3061149
V5 0.1181262 0.22928662 0.2608266 0.38497508 1.0000000 0.7628703
V6 0.2401638 0.35593319 0.2263553 0.30611491 0.7628703 1.0000000
eigen value array
[1] 2.6752733 1.3097426 1.1512484 0.4477224 0.2252526 0.1907607
Principal Components Matrix
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.5959219 0.6554206 0.22250426 -0.3505435042 0.06255386 0.19748157
[2,] -0.6026316 0.6998645 -0.05526649 0.2888334979 -0.14161525 -0.20122309
[3,] -0.6440947 -0.2740522 0.57047741 0.3895179069 0.08849390 0.15823263
[4,] -0.6738459 -0.4272552 0.46373143 -0.2988512091 -0.07734747 -0.23031085
[5,] -0.7223140 -0.3366373 -0.50819274 -0.0194946053 -0.28227127 0.16315911
[6,] -0.7525434 -0.1392112 -0.54766295 -0.0007828776 0.32831983 -0.08105191
# USER INPUT REQUIRED
# Step 3. User to determine number of factors to retain (one of 3 options)
#nf = 3 # choice a specify a number of factors
nf = NumberOfFactorsByK1(res$evl) # choice b using k1 rule
#nf = NumberOfFactorsByParallel(res$evl, nc, 1000) # choice c using parallel analysis
nf # number of factors to retain
# END USER INPUT
Step 3 is for the user to choose the method of determining the number of factors to retain. Three options are available
- nf=n, the user can arbitrarily stipulate the number of factor to retain (e.g. 3)
- nf = NumberOfFactorsByK1(res$evl), calls the k1 function with the eigen value array
- nf = NumberOfFactorsByParallel(res$evl, nc, ite), calls the Parallel Analysis function with the eigen value array, the number of cases (row, sample size), and the number of iterations to use (e.g. 1000)
The user can choose by commenting (place # as first character) or uncommenting (remove #) on the relevant row of command. In the default code, the K1 rule is chosen and it returns 3 factors to retain, as follows
> nf # number of factors to retain
[1] 3
# Step 4. Create z scores required for calculating factor values for each case
arMean <- rep(0,nv)
arSD <- rep(0,nv)
zMx <- matrix(0,nrow=nc, ncol=nv)
for(i in 1:nv)
{
arMean[i] = mean(myDataFrame[ , i])
arSD[i] = sd(myDataFrame[ , i])
zMx[, i] <- (myDataFrame[ , i] - arMean[i]) / arSD[i]
}
arMean # array of means
arSD # array of SDs
zMx # matrix of z values
Step 4 calculates the mean and Standard Deviation for each variable (column). It then transform the values into standardized z values (z=(value-mean) / SD). These are used later for calculating factor scores. The results are as follows
> arMean # array of means
[1] 0.14064 0.07696 0.12048 0.01056 -0.31988 -0.09372
> arSD # array of SDs
[1] 0.7599281 0.5908474 1.1104260 0.7961356 1.0809935 0.9073096
> zMx # matrix of z values
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.04916255 -0.70231326 0.314762090 1.507833555 0.76585104 0.493458903
[2,] 0.21496772 -0.51952497 -0.991943656 -1.543154170 -0.01583728 -0.756390101
[3,] 1.33481056 0.46211590 -0.387671047 -0.117517666 -1.76145370 -1.552149344
[4,] 0.55447354 0.12023408 1.130665195 0.760473449 1.11645445 1.582392604
......
# Step 5. Calculate Factor Score if there is only 1 Principal Component
if(nf<2)
{
pCompMx <- matrix(pComp[,1:nf])
print("Principal Component")
print(pCompMx) # principal component matrix to be used for subsequent rotation and processing
# Calculate coefficient matrix for scores
coeffMx <- pCompMx %*% solve(t(pCompMx) %*% pCompMx)
print("Coefficients")
print(coeffMx) # coefficient matrix
# Calculate Factor Scores
scoreMx <- zMx %*% coeffMx
print("Factor Scores")
print(scoreMx) # factor scores for varimax
} else
Step 5 is performed if there is only one (1) Principal component retained. The principal component is translated into the coefficient matrix, and the factor scores are calculated. As this is not the option in the example code, no output is shown. However, if the number of factor was chosen as 1 or if the Parallel Analysis was used, then the following results is shown
[1] "Principal Component"
[,1]
[1,] -0.5959219
[2,] -0.6026316
[3,] -0.6440947
[4,] -0.6738459
[5,] -0.7223140
[6,] -0.7525434
[1] "Coefficients"
[,1]
[1,] -0.2227518
[2,] -0.2252598
[3,] -0.2407585
[4,] -0.2518793
[5,] -0.2699963
[6,] -0.2812959
[1] "Factor Scores"
[,1]
[1,] -0.65390672
[2,] 0.91369654
[3,] 0.63370706
......
# Step 6 Performs Factor rotation if there is more than 1 principal component
{
pCompMx <- pComp[,1:nf]
print("Principal Components")
print(pCompMx) # principal component matrix to be used for subsequent rotation and processing
# perform Varimax rotation
vMax <- varimax(pCompMx)
#vMax
vMx <- pCompMx %*% vMax$rotmat
print("Varimax Factor Loadings")
print(vMx) #varimax loading matrix
# Calculate coefficient matrix for scores
vCoeffMx <- vMx %*% solve(t(vMx) %*% vMx)
print("Coefficient Matrix")
print(vCoeffMx) # varimax coefficient matrix
# Calculate Factor Scores
vScoreMx <- zMx %*% vCoeffMx
print("Factor Scores")
print(vScoreMx) # factor scores for varimax
# Perform Promax rotation
pMax <- promax(pCompMx)
#pMax
pMx <- pCompMx %*% pMax$rotmat
print("Promax Factor Loadings")
print(pMx) # promax loading matrix
# Calculate coefficient matrix for scores
pCoeffMx <- pMx %*% solve(t(pMx) %*% pMx)
print("Coefficient Matrix")
print(pCoeffMx) # promax coefficient matrix
# Calculate Factor Scores
pScoreMx <- zMx %*% pCoeffMx
print("Factor Scores")
print(pScoreMx) # factor scores for promax
# Performs Oblimin Rotation
#install.packages("GPArotation") # if not already installed
library(GPArotation)
obmn <- oblimin(pCompMx, normalize=TRUE)
print("Oblimin Factor Loadings and Factor Correlations")
print(obmn)
oMx <- obmn$loadings
#oMx # oblimin loading matrix (factor pattern)
# Calculate coefficient matrix for scores
oCoeffMx <- oMx %*% solve(t(oMx) %*% oMx)
print("Coefficient Matrix")
print(oCoeffMx) # oblimin coefficient matrix
# Calculate Factor Scores
oScoreMx <- zMx %*% oCoeffMx
print("Factor Scores")
print(oScoreMx) # factor scores for oblimin
}
Section 6 is executed if more than 1 Principal component is retained. The retained factors are subjected to rotation. The rotated factors are then converted to the coefficient matrix, and the factor scores are calculated using the z matrix and the coefficient matrix.
Three types of rotations are carried out, Varimax, Promax, and Oblimin. The results have the same structure but different values, and are shown separately as follows
Three 3 factors have eigen values >=1, and these are retained, as follows.
[1] "Principal Components"
[,1] [,2] [,3]
[1,] -0.5959219 0.6554206 0.22250426
[2,] -0.6026316 0.6998645 -0.05526649
[3,] -0.6440947 -0.2740522 0.57047741
[4,] -0.6738459 -0.4272552 0.46373143
[5,] -0.7223140 -0.3366373 -0.50819274
[6,] -0.7525434 -0.1392112 -0.54766295
The first rotation is Orthogonal (Varimax) and the results are as follows
[1] "Varimax Factor Loadings"
[,1] [,2] [,3]
[1,] -0.02453600 0.88913790 0.207458613
[2,] -0.21308829 0.90034535 -0.001106254
[3,] -0.08583428 0.16383896 0.883853119
[4,] -0.22653935 0.03792675 0.893814143
[5,] -0.92160147 0.03512097 0.206734722
[6,] -0.90981342 0.21391356 0.110075392
[1] "Coefficient Matrix"
[,1] [,2] [,3]
[1,] 0.14559564 0.560732513 0.04210757
[2,] -0.01747185 0.564782857 -0.13890447
[3,] 0.13675107 0.001150852 0.57323095
[4,] 0.02871355 -0.102217672 0.56642661
[5,] -0.56263156 -0.125188522 -0.03989932
[6,] -0.54909048 0.004753418 -0.12330412
[1] "Factor Scores"
[,1] [,2] [,3]
[1,] -0.59607769 -0.616382588 1.04273054
[2,] 0.28465380 -0.017896153 -1.26758238
[3,] 1.97319968 1.234168820 -0.03510730
[4,] -1.24194545 0.170139383 0.84586987
.......
The second rotation is Oblique (Promax) and the results are as follows
[1] "Promax Factor Loadings"
[,1] [,2] [,3]
[1,] 0.10895336 0.90058699 0.13062313
[2,] -0.12048997 0.91037793 -0.11768389
[3,] 0.06214795 0.07796628 0.90044284
[4,] -0.09963805 -0.06950278 0.90178181
[5,] -0.93819973 -0.08681418 0.07649840
[6,] -0.91956666 0.10906335 -0.04013362
[1] "Coefficient Matrix"
[,1] [,2] [,3]
[1,] 0.06917274 0.53965839 0.07797890
[2,] -0.06418369 0.54467281 -0.07464688
[3,] 0.04806237 0.04494003 0.54263501
[4,] -0.04451634 -0.04438377 0.54157032
[5,] -0.53110712 -0.05804648 0.03337254
[6,] -0.52092238 0.05964160 -0.03703614
[1] "Factor Scores"
[,1] [,2] [,3]
[1,] -0.66731997 -0.42380216 1.05094056
[2,] 0.47166717 -0.18724198 -1.29096075
[3,] 1.79334119 0.96951114 -0.20571498
[4,] -1.36613427 0.41134414 1.03830371
.......
The third rotation is Oblique (Oblimin) and the results are as follows. Please note that the matrix Phi is the correlation matrix between the rotated factors. Rotation matrix is an intermediary result produced by R and can be ignored.
[1] "Oblimin Factor Loadings and Factor Correlations"
Oblique rotation method Oblimin Quartimin converged.
Loadings:
[,1] [,2] [,3]
[1,] 0.0946 0.8959 0.1375
[2,] -0.1316 0.9056 -0.1080
[3,] 0.0514 0.0822 0.8968
[4,] -0.1078 -0.0635 0.8980
[5,] -0.9345 -0.0804 0.0820
[6,] -0.9174 0.1137 -0.0324
Rotating matrix:
[,1] [,2] [,3]
[1,] 0.534 -0.418 -0.461
[2,] 0.339 0.944 -0.487
[3,] 0.856 0.126 0.817
Phi:
[,1] [,2] [,3]
[1,] 1.000 -0.234 -0.292
[2,] -0.234 1.000 0.207
[3,] -0.292 0.207 1.000
[1] "Coefficient Matrix"
[,1] [,2] [,3]
[1,] 0.07318699 0.54287847 0.07602563
[2,] -0.06171397 0.54743794 -0.07871153
[3,] 0.05236099 0.04092966 0.54530484
[4,] -0.04109614 -0.05017189 0.54376227
[5,] -0.53316646 -0.06611237 0.02825672
[6,] -0.52274532 0.05300352 -0.04301873
[1] "Factor Scores"
[,1] [,2] [,3]
[1,] -0.66482370 -0.44502856 1.05097451
[2,] 0.46311661 -0.16992759 -1.29069359
[3,] 1.80422927 1.00183304 -0.19319602
[4,] -1.36133404 0.38501672 1.02623918
...........
Sample Size
Parallel Analysis
Confirmatory Factor Analysis
Currently there is no estimation for sample size for factor analysis that is based on any
statistical theory. Recommendations from different sources vary greatly, and the commonly used rule of thumbs
can be as follows
- For each factor, 5 variables. For each variable, 5 subjects. In other words, 25 subjects per factor
- Sample size should be 3 to 20 times the number of variables used, or absolute numbers of 100 to 1000.
Mundfrom (see references) and others in 2005 used empirical simulations to estimate minimal sample sizes that are likely to
produce reproducible results. This page summarises the more commonly used parts of table 1 and 2 from this paper, allowing
quick references to the minimal sample sizes required for factor analysis under the more usual clinical scenario. It is
recommended that users read the original paper to gain a clearer understanding of how the sample sizes are derived, and to obtain
the full tables of sample sizes suitable for a wider range of conditions.
Sample size based on the communality of the model
In general, sample size depends on two criteria, the ratio of the number of
variables to the number of factors, and the communality of the factors extracted.
Communality is a value between 0 and 1, and represents the
proportion of the total variance in the data that is extracted by the factor analysis.
This page summarises table 1 of the paper.
Note : columns are number of Factors and rows are p/f (parameters/factors) = ratio of number of variables to number of factors, and columns are number of factors
Where the communality is expected to be high (0.6 or more)
p/f | 1 | 2 | 3 | 4 | 5 | 6 |
---|
3 | 32 | 320 | 600 | 800 | 1000 | 1200 |
---|
4 | 27 | 150 | 260 | 350 | 450 | 500 |
---|
5 | 21 | 75 | 130 | 260 | 260 | 300 |
---|
6 | 19 | 55 | 95 | 160 | 200 | 160 |
---|
7 | 18 | 45 | 75 | 110 | 130 | 110 |
---|
8 | 18 | 45 | 75 | 90 | 75 | 70 |
---|
9 | 17 | 40 | 60 | 65 | 80 | 80 |
---|
10 | 15 | 35 | 60 | 70 | 65 | 65 |
---|
11 | 16 | 35 | 55 | 60 | 60 | 75 |
---|
12 | 15 | 35 | 55 | 55 | 65 | 75 |
---|
|
Where the communality is expected to be not so high (0.2 to 0.6)
p/f | 1 | 2 | 3 | 4 | 5 | 6 |
---|
3 | 110 | 710 | 1300 | 1400 | 1400 | 1600 |
---|
4 | 65 | 220 | 350 | 700 | 900 | 900 |
---|
5 | 50 | 130 | 200 | 300 | 300 | 350 |
---|
6 | 50 | 95 | 140 | 180 | 200 | 180 |
---|
7 | 40 | 75 | 105 | 160 | 150 | 130 |
---|
8 | 36 | 65 | 90 | 90 | 130 | 110 |
---|
9 | 33 | 55 | 70 | 85 | 90 | 100 |
---|
10 | 32 | 55 | 75 | 80 | 85 | 95 |
---|
11 | 36 | 50 | 65 | 75 | 85 | 95 |
---|
12 | 30 | 50 | 70 | 75 | 85 | 95 |
---|
|
Sample size where the ratio of variables to factors is 7 or more
A simpler approach is to use models where the ratios of variables to factors (p/f) are at least 7
and assuming that the model will have communalities usable in the clinical situation.
In this case, minimal sample size required depends only on the number of factors in the model.
The sample sizes required can be more simply and conveniently presented. Based on table 2 of the paper, they are :
- 18-60 for 1 factor with 7 or more variables
- 45-80 for 2 factors with 14 or more variables
- 75-100 for 3 factor with 21 or more variables
- 110-180 for 4 factors with 28 or more variables
- 130-170 for 5 factors with 35 or more variables
- 110-140 for 6 factors with 42 or more variables
Where there are more than 6 factors, the minimum sample size required is 100.
This applies until the number of factors is 15 with 105 or more variables, when the
sample size should exceed the number of variables.
In short, the sample size of 180 can be used where the ratio of variables to
factor is 7 or more, and if there are less than 15 factors.
StatsToDo does not offer any calculations for Confirmatory Factor Analysis, as the procedures are complex,
the choices numerous, and pitfalls aplenty. StatsToDo takes the view that those undertaking Confirmatory
Factor Analysis should have expertise not only in the subject being investigated, but also statistics at the professional level
of expertise. The following is a brief introduction to the subject, based on the algorithms available in the statistical
software package LISREL. The main purpose is to demonstrate the complexity involved.
Confirmatory factor analysis answers the question whether a set of data fits
a prescribed factor pattern. It is usually used for two purposes.
The first is to test or confirm that a factor pattern, say from a survey tool,
is stable and therefore can be confirmed by an independent set of data. In this,
the factor pattern comes from an existing tool or a theoretical construct,
and whether this fits with a set of data is then tested.
The second is in the development of a multivariate instrument or tool, such
as questionnaire to evaluate racism. In this, the number of factors (concepts,
constructs, or dimensions) are firstly defined, then a number of variables
(questions or measurements) that may reflect each of these construct developed.
Data are then collected, and tested against the factor-variable relationship.
Those variables that do not fit neatly into a single factor are then replaced
or changed, and new data are collected and tested. This process is repeated
until a set of data collected fits the required pattern.
Confirmatory factor analysis uses the Maximum Likelihood method of extraction,
because it is robust and allows for significance testing. In practice, however, statistical significance is difficult to
interpret, as it is determined not only by how well the data fits in with the
theoretical construct, but also by sample size, and the number of variables and factors.
Another problem is that Confirmatory Factor Analysis and maximum Likelihood method of extraction
works best in the Principal Factor Model (where the correlation matrix has its diagonal
elements replaced with the communality). To perform Confirmatory Factor Analysis
on Factors developed from the Principal Component Model, or to use the
Maximum Likelihood method on a correlation matrix without communality correction will produce
strange looking and uninterpretable results, particularly being confronted with the Heywood situation
where the factors cannot be comprehensively extracted.
Test of fit between data and construct
The Chi Square Test is the primary statistical test. However, Chi Squares
tends to increase with sample size, with decreasing number of variables, and
increasing number of factors. So, unless the theoretical construct was developed
using Maximum Likelihood Factor Analysis that had exhausted all the correlations
in the original matrix, and that the sample size of the testing data set is
similar to the original data set used to develop the construct, the Chi Square
may not truly reflect how well the data fits the construct.
As the Chi Square Test is statistical robust, yet problematical because of the
complexity of confirmatory factor analysis, statisticians have developed an
array of adaptation of the Chi Square to adjust for sample size and the
relationship between the number of variables and factors.
From the clinical point of view however, the following decision making steps
are recommended by some of the publications.
- Step 1. Examine the critical number. This is the minimum sample size,
below which the results cannot be validly interpreted. Only if the sample
size exceeds the critical number can interpretation proceed.
- Step 2. Look at the significance of the Chi Square. The Minimum Fit
Chi Square can be used if one can be sure that all the variables are continuous
measurement and normally distributed. If this cannot be assured, as is in
almost all cases, then the Normal Theory Weighted Least Squares
Chi-Square is used. If the Chi Square is not significant (p>=0.05), then
a decision that a good fit exists between data and theory can be made.
Subsequent steps are only necessary of the Chi Squares is significant (p<0.05) .
- Step 3. Examine the Chi Square Degree of Freedom Ratio. This is
obtained Ratio = Chi Sq / Deg Freedom. If the Ratio is 2 or less, then a
decision that a reasonably good fit exists can be made. If the ratio exceeds 2 go to step 4.
- Step 4. Examine the Goodness of Fit Index. If this index is 0.9 or
more, then a decision that a reasonable good fit exists can be made. If
not then the conclusion that the data fits poorly to theory should now be made.
Possible actions when data and theory do not fit.
If the conclusion is that the data and theory fits, then the statistical
exercise ends. The researcher accepts the validity of the theory and moves on.
When the conclusion is that the data do not fit the theory, the actions that
are possible then depends on the reasons for conduction the confirmatory test in the first place.
- Option 1. Reject the factor construct and move on. This is of course the
primary purpose of the exercise. The question is whether the data fits the
theory, the answer is no, and the matter ends. Another reason for doing this is if
the data itself is problematical, so that a fit can never be obtained. This
occurs if the number of variables in a factor are too few, if the correlations
are such that one or more variables load across more than 1 factor, or if the
correlations within a factor vary greatly (as it would then not be possible
to extract all the correlation out of the matrix).
- Option 2. Why did it not fit and is it fixable. This option can be
taken if the fit is nearly good enough, and only a few variables are problematic.
Variables that do not fit well can be removed and the exercise repeated. This is doable if the exercise is
part of developing a new statistical instrument of measurement, but not suitable in testing a set of data
against an established Factor model.
|