Content Disclaimer Copyright @2020. All Rights Reserved. |

**Links : **Home
Index (Subjects)
Contact StatsToDo

Introduction
StatsToDo provides R codes in some of its pages, and this page provides an overview and orientation for using them.
R Studio
The inclusion of R codes is a work in progress. Initially I aimed to create resources for the more complex multivariate statistical procedures that I am unable to program myself. Eventually I aim to supplement most programs in StatsToDo with R codes and examples, using resources provided by CRAN if they are available, or to rewrite my programs using R if the progam cannot be found in CRAN. The initiative began in April 2020, so not all algorithms will be accompanied by R codes for some time The R codes in StatsToDo are basic, simple, and the minimum necessary to conclude an initial analysis. It is aimed to create a shallow learning curve for those with no great statistical experience, to explore his/her data, and to provide the minimum results that are required by most clinical journals for publication. In most cases they merely repeat the programs written in php or Javascript for the web page, and no more than a validation of the algorithm. Most experienced statisticians will find the details provided incomplete, or even insufficient, as the safeguards and complete descriptions of the results are often not provided. Although references will be provided for users to obtain further information and resources. Inexperienced users are strongly urged to seek advice from experienced statisticians before drawing conclusions from the results obtained. All data in the examples are computer generated, to demonstrate the analysis and interpretation. Although attempts were made to produce results that are plausible and easy to understand, users should realise that the data are artificial and do not represent reality. Given that I had a background in obstetrics, the examples are mostly issues in childbirth and hospitals. The interests are mostly clinical, towards classification, survey, quality control, clinical discovery and trials.
Three colors are used, to clarify the intent of the text **Maroon**color is used to represent R codes**Navy blue**color is used to show results generated by R in the console**Black**is used for explanations and descriptions
All R programs in StatsToDo assume no missing data. How to handle missing data is described in the Missing Data panel of this page ## Setting up R and RStudioThe first thing to do is to set up the R package in your computer. Using your web browser download the latest version of R and RStudio from from https://cran.r-project.org/, making sure that you include RStudio. After this, run all R programs using RStudio## RStudioRStudio is a powerful programming platform, and there are many ways it can be used. Included here are the basic information for the beginner, and includes only those issues that are covered in StatsToDo## Anatomy of RStudio :When activated, RStudio consists of 4 panels.- The bottom left panel is called the
**console**. we start here because you can analyse data just using the console. Commands can be typed into the console and run interactively. The console is also where the results of calculations (other than graphics) are displayed - The top left panel is called the
**source**. This is essentially a text editor, where files can be loaded in, edited, and saved. Part or all of the codes in the source can also be marked by the mouse, and on clicking the run button the output will be displayed in the console.**Please note**that the example codes from this site is intended to be copied and pasted into this source panel, and run from here - The top right panel is called
**environment and history**. This provides a list of variables and procedures created and run during a session in the console. The idea is that these are saved when exiting RStudio, and reloaded in future sessions to provide continuity. This panel is not used in these pages - The right bottom panel is called the
**viewer**which contained a number of sub-panels. This is where packages and help can be viewed. Graphical output is also displayed in this panel
## PackagesMany statistical programs are automatically installed with R. Additional programs are written by the wider R user community, available as packages. These packages need to be installed into your computer (only once), using theinstall.package("PackageName") command.
Once installed, the package can be activated during analysis with the library("PackageName") command.
Required packages are included in the example codes. The The Explanation and resources available for eack package, once installed, can be accessed by the ## How to use the example codesFor any procedure, the following procedures can be used- Activate RStudio
- Copy and paste the example codes (in
**Maroon**) into the**source**panel, in the same order that they are presented in the web page - Mark the codes and click
**run**at the top right corner of the source panel - Check that the program runs as intended
- Replace example names and data with the user's own
- Add, delete, or edit codes as appropriate
- Mark and run again to produce the results required
## Help and further informationHelp for any code or function provided by R can be easily obtained within RStudio, by typing the following in the bottom left panel of RStudio(console)- help(thesubject) will bring up explanations of the subject in the right bottom panel (help)
- ??packagename will bring up documentation of the package in the right bottom panel (help)
- https://cran.r-project.org/doc/manuals/R-intro.html An Introduction to R from CRAN
- https://cran.r-project.org/doc/contrib/Short-refcard.pdf Reference card cheat sheet for R
- An R Companion for the Handbook of Biological Statistics by Salvatore S. Mangiafico. This is a textbook like site and provides resources for many commonly used statistical procedures in bio-medical statistics
- Summary and Analysis of Extension Program Evaluation in R by Salvatore S. Mangiafico. This is a textbook like site providing resources for the more advanced statistical procedures commonly used in complex data analysis
- R for Dummies by Vries and Meys. John Wiley & Son's Inc. ISBN 978-1-119-05580-9. This is good for the novice with no prior experience of R.
- R Cookbook by Teetor. O'Reilly Media Inc ISBN 978-93-5023-379-5. This takes the user step by step learning how to use and program R, and introduces the basic R resources
- R Graphics Cookbook by Chang. O'Reilly Media Inc ISBN 978-1-449-31695-2. This is an excellent book providing detailed instructions on how to do graphics using R
## DataframeR is able to handle data in numerous format, the most commonly used one is a data frame. Data frame is an object (a computer construct that contains large amount of information). In most cases, particularly in these pages, the dataframe contains a table with columns for variables, and rows for cases.The first step in any analysis is therefore converting the table of data into a dataframe, which R can then analyse. The standard data table has columns for variables and rows for subjects. The first row containing the name of each column. The table can then be imported into R using one of the following methods ## Direct data I/ONearly all examples of R codes on this site will use direct data I/O. This is the simplest form of data entry for small sets of data. The example codes are as followsmyTxt = (" Col1 Col2 Col2 A 1 2 B 3 4 C 5 6 ") # Example data in text myDF <- read.table(textConnection(myTxt),header=TRUE) # Make data frame #myDF # Optional show dataframe in consolePlease note that: - The table, in text, is in between
**("**and**")** - The first row contains the column names
- The columns are separated by spaces or tabs
- The values of Col1 are text, which R called factors, and those of Col2 and Col3 are numerical.
**Please Note:**the type of values in each column must be consistent.
## FilesBy default, R reads and writes all file using the standard Documents folder. It is useful to change and Set the default folder to the same as the current file loaded into the source condole. To do so the following procedures are used. At the menu bar at the top click Session->Set Working Directory->To Source File Location. To check whether the directory is correct or not, use the following codesgetwd() list.files()This will display the folder being used and the files it contains ## Comma delimited (.csv) filesThe codes for input from a comma delimited (.csv) file aremyDF <- read.csv("myCsvIn.csv") #myDFPlease Note that: - myCsvIn.csv is the name of the comma delimited file containing the data table.
- The first line reads the table to the dataframe
- The second line is optional display of the dataframe
write.csv(myDF, "myCsvOut.csv", row.names = FALSE)Please Note that: - myDF is the name of the dataframe to be saved to the .csv. It should already exist
- myCsvOut.csv is the name of the output comma delimited (.csv) file. User should change this name to one that is appropriate
- If row.names= False is not used, then a first column containing row number will be included
- If the named .csv file pre-exists and is closed, it will be overwritten. Attempt to write to an open file with the same name will fail and flag an error message
## Excel worksheet (.xlsx) filesTo access .xlsx files, the package xlsx must be already installed. If not installed, the commandinstall.packages("xlsx") will do so.
Once installed, each time a .xlsx file is accessed, the library xlsx must be called
For data input #install.packages("xlsx") #use only if the library is not already installed library("xlsx") myDF <- read.xlsx("myXlsxIn.xlsx", sheetName="mySheet") #myDFPlease note that: - The xlsx package needs to be already installed once in a computer
- The library xlsx must be called
- myXlsxIn.xlsx is the name of the excel .xlsx file to be read. User should change this to his/her own file name
- mySheet is the name of the worksheet to be read. This can be the name of the sheet, or a number (without the quotes) counting from 1
- When testing, uncomment the last command (remove #) to see if the correct data has been read
lbrary("xlsx") write.xlsx2(myDF, "myXlsxOut.xlsx", sheetName = "mySheet", row.names = FALSE, append=TRUE)Please Note that: - The package xlsx must be pre-installed and is called
- myDF is a dataframe to be saved to file, a table with columns and rows.
- myXlsxOut.xlsx is the name of the excel workbook to be saved to. Users should change this to his/her own file name
- mySheet is the name of the sheet to save the data to
- If row.names= False is not used, then a first column containing row number will be included
- If the file pre-exists, it must be closed. Attempt to write to an open file will crash the program
- If the append option is FALSE, any existing file containing a wroksheet with the same name will be overwritten
If the append option is TRUE and no worksheet with the same name exists, a new worksheet with the name will be created and the data written to it If the append option is TRUE and a worksheet with the same name already exists, write to the worksheet will fail and the program crashes
## Creating a report of analysis using KnitRStudio provides an utility which will run the codes in the source panel, and provides all the results in a file, which can be a html, Word, or pdf file. To do so, the package markdown must be pre-installed using the code install.packages("rmarkdown").Once a set of R codes are tested and found to be satisfactory, at the menu bar on the top of RStudio, - click File->Knit Document
select type of file, msWord is recommended as this is the easiest to edit afterwards
- It can be incompatible with some other packages, especially those that read or write to files or produce graphics. The program merely crashes without too much explanation. An example is that knit will crash when trying to read from or write to excel worksheets.
- In some graphics programs, the labeling for the axis may be misplaced or absent in the report file
This panel provides some useful templates for handling arrays and matrices.
Missing Data
## 1:Data Frames and Matrices
txt = (" Col1 Col2 Col3 7 1 2 8 3 4 9 5 6" ) df <- read.table(textConnection(txt),header=TRUE) mx <- data.matrix(df) colnames(mx) <- NULL # optional removing column names mxThe rresults are > mx [,1] [,2] [,3] [1,] 7 1 2 [2,] 8 3 4 [3,] 9 5 6 Template 1.2: Create matrix from the numerical fields of a mixed factor(text) and value (numerical) data frame
txt = (" Col1 Col2 Col3 A 1 2 B 3 4 C 5 6" ) df <- read.table(textConnection(txt),header=TRUE) #df # data frame mx <- matrix(c(df$Col2,df$Col3), ncol=2) mxNote: cols 2 and 3 are numerical, and extracted to form a 2 column matrix. The results are > mx [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6 Template 1.3: Convert matrix into a data frame
# Input 3. matrix to data frame #mx = matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=4) # alternative method to create matrix col1 <- c(1,2,3,4) col2 <- c(5,6,7,8) col3 <- c(9,10,11,12) mx <- cbind(col1, col2, col3) mx #matrix # turn matrix into data frame df = as.data.frame(mx) df # add another column to existing data frame df$col4 <- c(20,21,22,23) # method 1 df ar <- c(30,31,32,33) # method 2 df$col5 <- ar dfResults as follows > df col1 col2 col3 1 1 5 9 2 2 6 10 3 3 7 11 4 4 8 12 > # add another column to existing data frame > df$col4 <- c(20,21,22,23) # method 1 > df col1 col2 col3 col4 1 1 5 9 20 2 2 6 10 21 3 3 7 11 22 4 4 8 12 23 > ar <- c(30,31,32,33) # method 2 > df$col5 <- ar > df col1 col2 col3 col4 col5 1 1 5 9 20 30 2 2 6 10 21 31 3 3 7 11 22 32 4 4 8 12 23 33 > ## 2:Arrays# create an array with contents ar <- c(1,2,3,4) ar [1] 1 2 3 4 # Create an array with sequence start = 1 finish = 10 increment = 2 ar <- seq(start, finish, by=increment) ar [1] 1 3 5 7 9 # Create an empty vector ar <- vector() # Create vector of defined length default = 5 siz = 3 ar <- array(default,siz) ar [1] 5 5 5 # Operations ar1 <- c(1,2,3) ar2 <- c(4,5,6) # add ar1 + ar2 [1] 5 7 9 #subtract ar2 - ar1 [1] 3 3 3 #multiply elements ar1 * ar2 [1] 4 10 18 # multiply vectors col by row # default is column ar1 %*% t(ar2) [,1] [,2] [,3] [1,] 4 5 6 [2,] 8 10 12 [3,] 12 15 18 #multiply row by col t(ar1) %*% ar2 [,1] [1,] 32 ## 3:Matrices#matrix with values ar <- c(1,2,3,4,5,6) mx <- matrix(data=ar, nrow=2,ncol=3) #byrow=FALSE is default mx mx <- matrix(data=ar, nrow=2,ncol=3, byrow=TRUE) mx > mx <- matrix(data=ar, nrow=2,ncol=3) #byrow=FALSE is default > mx [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 > mx <- matrix(data=ar, nrow=2,ncol=3, byrow=TRUE) > mx [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 # matrix with default value default = 5 mx <- matrix(data=default, nrow=2,ncol=3) mx > mx [,1] [,2] [,3] [1,] 5 5 5 [2,] 5 5 5 #matrix joining two arrays a1 <- c(1,2,3) a2 <- c(4,5,6) mx <- matrix(data=c(a1,a2), nrow=2,ncol=3, byrow=TRUE) mx mx <- matrix(data=c(a1,a2), nrow=3,ncol=2) mx > mx <- matrix(data=c(a1,a2), nrow=2,ncol=3, byrow=TRUE) > mx [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 > mx <- matrix(data=c(a1,a2), nrow=3,ncol=2) > mx [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 # rows mx[1,] mx[2,] mx[3,] #cols mx[,1] mx[,2] > # rows > mx[1,] [1] 1 4 > mx[2,] [1] 2 5 > mx[3,] [1] 3 6 > #cols > mx[,1] [1] 1 2 3 > mx[,2] [1] 4 5 6 # operations mx1 <- matrix(c(c(3,5,1,2),c(4,1,3,1),c(2,3,1,1)),nrow=4,ncol=3) mx1 mx2 <- matrix(c(c(1,2,1),c(2,3,1)),nrow=3,ncol=2) mx2 > mx1 [,1] [,2] [,3] [1,] 3 4 2 [2,] 5 1 3 [3,] 1 3 1 [4,] 2 1 1 > mx2 [,1] [,2] [1,] 1 2 [2,] 2 3 [3,] 1 1 # transpose mx3 <- t(mx1) mx1 mx3 > mx1 [,1] [,2] [,3] [1,] 3 4 2 [2,] 5 1 3 [3,] 1 3 1 [4,] 2 1 1 > mx3 [,1] [,2] [,3] [,4] [1,] 3 5 1 2 [2,] 4 1 3 1 [3,] 2 3 1 1 # add mx1 + mx1 > mx1 + mx1 [,1] [,2] [,3] [1,] 6 8 4 [2,] 10 2 6 [3,] 2 6 2 [4,] 4 2 2 # subtract mx2 - mx2 > mx2 - mx2 [,1] [,2] [1,] 0 0 [2,] 0 0 [3,] 0 0 # multiply elements mx2 * mx2 > mx2 * mx2 [,1] [,2] [1,] 1 4 [2,] 4 9 [3,] 1 1 #multiply matrices mx1 %*% mx2 > mx1 %*% mx2 [,1] [,2] [1,] 13 20 [2,] 10 16 [3,] 8 12 [4,] 5 8 #division by elements mx1 / mx1 > mx1 / mx1 [,1] [,2] [,3] [1,] 1 1 1 [2,] 1 1 1 [3,] 1 1 1 [4,] 1 1 1 # transpose t(mx2) > t(mx2) [,1] [,2] [,3] [1,] 1 2 1 [2,] 2 3 1 # square matrix mx <- matrix(c(c(4,2,3,2),c(2,5,3,1),c(3,3,6,2),c(2,1,2,3)),nrow=4,ncol=4) mx > mx [,1] [,2] [,3] [,4] [1,] 4 2 3 2 [2,] 2 5 3 1 [3,] 3 3 6 2 [4,] 2 1 2 3 # invert solve(mx) > solve(mx) [,1] [,2] [,3] [,4] [1,] 0.50000000 -0.07142857 -0.14285714 -0.21428571 [2,] -0.07142857 0.29591837 -0.12244898 0.03061224 [3,] -0.14285714 -0.12244898 0.32653061 -0.08163265 [4,] -0.21428571 0.03061224 -0.08163265 0.52040816 # Rank qr(mx)$rank # uses qr decomposition [1] 4 # Determinant det(mx) > det(mx) [1] 98 # Eigen values and matrices eigenResults <- eigen(mx) eigenResults > eigenResults $values [1] 11.501474 3.143052 2.000000 1.355473 $vectors [,1] [,2] [,3] [,4] [1,] -0.4779381 -0.36246433 0.3779645 0.70522169 [2,] -0.4961149 0.76831305 0.3779645 -0.14390264 [3,] -0.6487310 -0.05918796 -0.7559289 -0.06493346 [4,] -0.3234090 -0.52422462 0.3779645 -0.69118597 #Square root of matrix after obtaining eigen sr <- eigenResults$vectors %*% diag(sqrt(eigenResults$values)) %*% solve(eigenResults$vectors) sr > sr [,1] [,2] [,3] [,4] [1,] 1.7886506 0.3942986 0.6321685 0.4956015 [2,] 0.3942986 2.1073918 0.6176967 0.1479165 [3,] 0.6321685 0.6176967 2.2465112 0.4147301 [4,] 0.4956015 0.1479165 0.4147301 1.6001559 # Validate square root matrix sr %*% sr > sr %*% sr [,1] [,2] [,3] [,4] [1,] 4 2 3 2 [2,] 2 5 3 1 [3,] 3 3 6 2 [4,] 2 1 2 3
Explanations & References
This panel explains and provides example codes for handling missing data using R.
R Codes for Missing Values
In research and handling data, missing data is a common occurrence. Subjects are lost, errors are made in collecting and transcribing information, and whole host of reasons creating holes in the data table. R provides an option for how to handle missing data in nearly all its formulae, but this requires the analyst to be familiar with how missing data may affect a particular procedure and which option each procedure provides for handling missing data. For the sake of simplicity, all the codes provided in StatsToDo assume that the data is already clean and contain no missing data. This separates the procedures handling missing data from the statistical algorithms. This panel therefore provides the algorithms for handling missing data at the final stages of data preparation, to produce a complete set of data for analysis ## How missing data are represented in RWithin the object dataframe, missing values are represented byNA in the numerical columns and <NA> in text columns. However, in data I/O, the following is used
- When data is presented directly in the R Code as a text table, or when read in from a comma delimited file (.csv), missing data is represented by
**NA**. Any other representation is interpreted by R as values in the data - When data is read in from a Excel worksheet using the package xlsx, missing data are blank cells in the Excel worksheet. Anything else will be interpreted as actual values and processed accordingly
## Different options in dealing with missing dataR provides an extensive collection of methods of handling missing data. Only a few of the more commonly used ones are presented in this page. This panel discusses the options conceptually, the complete set of codes and how the codes work are presented in the R Code panel## Option 1. Casewise deletionThis is the easiest, and widely used method. All records containing missing values are deleted.This method is appropriate if the analyst is sure that data is lost at random, so that removing records containing missing data would not create a bias leading to misinterpretation. The amount of missing data should also be small, say in less than 1% of the cases ## Option 2. K Nearest NeighbourFor each missing value, the program searches for k completed records that are nearest (similarity not location) to it, replacing it with the average for a numerical (value) column, and the most frequent value for a text (factor) column. k can be specified in the formula. If not specified, the default k=10 is used.This is a robust method, and can be used even if some bias process is implicated in data loss, as the missing value is replaced by values from similar records. There are, however, some issues involved. Firstly, for every missing value, k (10) completed records are required. Secondly, the whole database is searched for the nearest records, and this is time consuming if the database is large and missing data numerous. The method was devised by those working on big data and artificial intelligence, when thousands or even millions of records are available, and the data can be analysed using powerful computers over prolonged periods. Clinical data are caught between having database not large enough so that k has to be reduced, and the long time required for processing using desk top computers. Although the method is excellent in theory, it cannot always be successfully used in the clinical setting. However, it is worth a try. If the program crashes or takes too long to run, k can be progressively reduced until the program works. ## Option 3. General ImputationThe program randomly selects a missing data value, and replaces it with an estimate using the available data and multiple regression. This is then included in the available data to estimate the next randomly selected missing value. This process is repeated until all missing values are replaced by estimated (imputated) values.As latter estimations are influenced by earlier estimated values, the results are slightly different depending on the random sequence. The program copes with this by iterating the process a number of times (m) and averaged the results. The number of iterations (m) can be specified by the user. If not specified, the default is m=5. Controversy exists as to what m should be, and some statisticians argue that m should be the same as the number of missing values in the data. This method is most suited to the small data sets that are common in clinical studies, especially in survey and clinical trials where the sample size is around 100. The only proviso is that at least one (1) numerical column must exist in the data set for the algorithm to work. Users should also be aware that the same program and data will produce slightly different results when repeated, as the random sequence is generated at run time so are different each time ## Option 4. Numerical ImputationThe program is a mathematical algorithm using existing values in the same column of the data set to estimate a replacement value for the missing values. The methods available aremean. median, mode, and interpolation (average of the available values on the two sides of the missing value).
The method only works in columns of numerical data, and ignores missing values in columns that are text. It is quick to implement and the results easy to interpret. It can be used if such mathematical replacement is appropriate to the analyst's needs The interpolation method is especially useful in time series data such as continuous monitoring, as the interpolation result is close to what the missing data should be. ## Additional information## Checking the resultsIt is important to check the results of fixing missing data before the data set is used for analysis. There are numerous methods for doing so, but they are not covered in this page. The example codes provide the basic comparisons using the summary command, which will count the different values in text columns, and minimum, maximum, quartile values, means and standard deviation in numerical columns## Referenceshttps://en.wikipedia.org/wiki/Missing_datahttps://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
This panel provides the R codes for handling missing values
## Example data## Direct data entrymyTxt = (" Sex Ethn Gest BWt Girl Greek 37 3048 Boy German 36 2813 Girl French 41 3622 NA Greek 36 2706 Boy German 35 2581 Boy NA NA 3442 Girl Greek 40 3453 Boy German 37 3172 Girl French 35 NA Boy Greek 39 3555 Girl German 37 3029 Boy French 37 3185 Girl NA 36 2670 Boy German NA 3314 Girl French 41 3596 Boy Greek 38 3312 NA NA 39 3200 Boy French 41 3667 Boy Greek 40 3643 Girl German 38 3212 Girl French 38 3135 Girl Greek 39 3366 ") myDF <- read.table(textConnection(myTxt),header=TRUE) summary(myDF) #myDFThe example data is computer generated, and purports to come from a study of birth weight. Sex being sex of the baby, Ethn the ethnicity of the mother, Gest the number of completed weeks in gestation, and BWt the weight of the baby at birth.
The table has columns for variables and rows for subjects (each baby). The first row contains the names for each column. NA represents missing data myTxt is the name given to this table. User can change this to any other name This is followed by 3 lines of codes - Imports the data into the dataframe object myDF
- Displays the summary, which can be used for comparison with the results of changes to the data set
- Optional display of the data as represented within the dataframe. This can be activated by uncommenting the line (remove #)
Sex Ethn Gest BWt Boy :10 French:6 Min. :35.00 Min. :2581 Girl:10 German:6 1st Qu.:36.75 1st Qu.:3048 NA's: 2 Greek :7 Median :38.00 Median :3212 NA's :3 Mean :38.00 Mean :3225 3rd Qu.:39.25 3rd Qu.:3453 Max. :41.00 Max. :3667 NA's :2 NA's :1There are 22 subjects Sex and Ethn are text columns, and the number of rows with each value is counted Gest and BWt are numerical columns, and the quartile and mean values are presented Missing values, represented as NA are also counted for each column ## Handling Missing Values## Option 1: casewise deletioncasewiseDeletedDataFrame <- na.omit(myDF) summary(casewiseDeletedDataFrame) #casewiseDeletedDataFrameThe first line creates a new dataframe casewiseDeletedDataFrame containing only those rows with no missing data
The second line provides the summary which can be used to compare with the input data The third line is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the # The summary is as follows Sex Ethn Gest BWt Boy :8 French:5 Min. :35.00 Min. :2581 Girl:8 German:5 1st Qu.:37.00 1st Qu.:3113 Greek :6 Median :38.00 Median :3262 Mean :38.38 Mean :3274 3rd Qu.:40.00 3rd Qu.:3565 Max. :41.00 Max. :3667Six (6) rows with one or more missing values are deleted, so the data set now has 16 rows. ## Option 2: K Nearest Neighbours#install.packages("DMwR") library(DMwR) knnDataFrame <- knnImputation(myDF,k=10) summary(knnDataFrame) #knnDataFrameLine 1 installs the package "DMwR" which is a package for deep learning, from which this algorithm is obtained. This is commented out, as this is not needed repeatedly once the package is installed on the computer Line 2 calls the installed library. This must be done prior to running the program Line 3 creates a new dataframe, which has the missing values replaced by the estimated values. The value for k can be specified by the user. If not specified, the default is k=10. If the amount of completed record is insufficient, or if the run time of the program is too long, k can be reduced. Line 4 displays the summary of the new dataframe, which can be compared with that from the input data Line 5 is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the # The summary is as follows Sex Ethn Gest BWt Boy :11 French:6 Min. :35.00 Min. :2581 Girl:11 German:7 1st Qu.:37.00 1st Qu.:3046 Greek :9 Median :38.00 Median :3206 Mean :38.07 Mean :3217 3rd Qu.:39.12 3rd Qu.:3450 Max. :41.00 Max. :3667Twentytwo (22) rows remain in the data set, but the counts, interquartile values, and means have changed as the missing values are replaced by the estimated values ## Option 3: General Imputation#install.packages("mice") library(mice) impute <- mice(myDF, m = 5, print = FALSE) fit<-with(data = impute, lm(BWt ~ Sex+Ethn+Gest)) pool<-pool(fit) miceDataFrame<-complete(impute) summary(miceDataFrame) #miceDataFrameLine 1 installs the package "mice" which contains the imputation program. This is commented out, as this is not needed repeatedly once the package is installed on the computer Line 2 calls the installed library. This must be done prior to running the program Line 3 creates a data matrix containing the estimated values from the iterations. The number of iterations (m) can be specified by the user. If not specified, the default is m=5. Line 4 estimates the imputed values, using a regression formula. The formula should contain the names of all the columns that the analyst intend to use to estimate missing values. Lines 5 and 6 pool the results, and created a new database with the missing values replaced by the imputation estimates. Please note the following - The variables used in the regression formula are the column names in the inputed data. Analysts should replace these with those from his/her own data
- At least one (1) column in the data must be a numerical one. Without this, the program fails and an error is flagged.
Line 7 is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the # The summary is as follows Sex Ethn Gest BWt Boy :10 French:8 Min. :35.00 Min. :2581 Girl:12 German:6 1st Qu.:37.00 1st Qu.:3034 Greek :8 Median :38.00 Median :3206 Mean :38.05 Mean :3206 3rd Qu.:39.00 3rd Qu.:3450 Max. :41.00 Max. :3667Twentytwo (22) rows remain in the data set, but the counts, interquartile values, and means have changed as the missing values are replaced by the estimated values ## Option 3: General ImputationActivate library#install.packages("ggplot2") #only if not already installed #install.packages("imputeTS") #only if not already installed library(imputeTS)The package imputeTS is required for numerical imputation. The algorithms in this package call functions in the package ggplot2, so that package also needs to be installed. The 2 installation commands are commented out because they only need to be installed to the computer once.
For each of the numerical imputation method, the library Once the library is called, the analyst can choose one of the mathematical models. Replace missing values by column mean meanDataFrame<-na.mean(myDF, option = "mean") summary(meanDataFrame) #meanDataFrameThe summary is Sex Ethn Gest BWt Boy :10 French:6 Min. :35 Min. :2581 Girl:10 German:6 1st Qu.:37 1st Qu.:3070 NA's: 2 Greek :7 Median :38 Median :3218 NA's :3 Mean :38 Mean :3225 3rd Qu.:39 3rd Qu.:3450 Max. :41 Max. :3667 Replace missing values by column median medianDataFrame<-na.mean(myDF, option = "median") summary(medianDataFrame) #medianDataFrameThe summary is Sex Ethn Gest BWt Boy :10 French:6 Min. :35 Min. :2581 Girl:10 German:6 1st Qu.:37 1st Qu.:3070 NA's: 2 Greek :7 Median :38 Median :3212 NA's :3 Mean :38 Mean :3224 3rd Qu.:39 3rd Qu.:3450 Max. :41 Max. :3667 Replace missing values by column mode modeDataFrame<-na.mean(myDF, option = "mode") summary(modeDataFrame) #modeDataFrameThe summary is Sex Ethn Gest BWt Boy :10 French:6 Min. :35.00 Min. :2581 Girl:10 German:6 1st Qu.:37.00 1st Qu.:3034 NA's: 2 Greek :7 Median :37.50 Median :3206 NA's :3 Mean :37.91 Mean :3196 3rd Qu.:39.00 3rd Qu.:3450 Max. :41.00 Max. :3667 Replace missing values by interpolation (average of the values on each side of the missing value) interpolationDataFrame<-na.interpolation(modeDataFrame) summary(interpolationDataFrame) #interpolationDataFrameThe summary is Sex Ethn Gest BWt Boy :10 French:6 Min. :35.00 Min. :2581 Girl:10 German:6 1st Qu.:37.00 1st Qu.:3034 NA's: 2 Greek :7 Median :37.50 Median :3206 NA's :3 Mean :37.91 Mean :3196 3rd Qu.:39.00 3rd Qu.:3450 Max. :41.00 Max. :3667 |