R Exp

Content Disclaimer
Copyright @2020.
All Rights Reserved.

StatsToDo: R: Introduction and Explanation of the Basics

Links : Home Index (Subjects) Contact StatsToDo

Introduction
StatsToDo provides R codes in some of its pages, and this page provides an overview and orientation for using them.
The inclusion of R codes is a work in progress. Initially I aimed to create resources for the more complex multivariate statistical procedures that I am unable to program myself. Eventually I aim to supplement most programs in StatsToDo with R codes and examples, using resources provided by CRAN if they are available, or to rewrite my programs using R if the progam cannot be found in CRAN. The initiative began in April 2020, so not all algorithms will be accompanied by R codes for some time
The R codes in StatsToDo are basic, simple, and the minimum necessary to conclude an initial analysis. It is aimed to create a shallow learning curve for those with no great statistical experience, to explore his/her data, and to provide the minimum results that are required by most clinical journals for publication. In most cases they merely repeat the programs written in php or Javascript for the web page, and no more than a validation of the algorithm. Most experienced statisticians will find the details provided incomplete, or even insufficient, as the safeguards and complete descriptions of the results are often not provided. Although references will be provided for users to obtain further information and resources. Inexperienced users are strongly urged to seek advice from experienced statisticians before drawing conclusions from the results obtained.
All data in the examples are computer generated, to demonstrate the analysis and interpretation. Although attempts were made to produce results that are plausible and easy to understand, users should realise that the data are artificial and do not represent reality. Given that I had a background in obstetrics, the examples are mostly issues in childbirth and hospitals. The interests are mostly clinical, towards classification, survey, quality control, clinical discovery and trials.
Format
Three colors are used, to clarify the intent of the text

Maroon color is used to represent R codes
Navy blue color is used to show results generated by R in the console
Black is used for explanations and descriptions

All R programs in StatsToDo assume no missing data. How to handle missing data is described in the Missing Data panel of this page
R Studio

Setting up R and RStudio
The first thing to do is to set up the R package in your computer. Using your web browser download the latest version of R and RStudio from from https://cran.r-project.org/, making sure that you include RStudio. After this, run all R programs using RStudio
RStudio
RStudio is a powerful programming platform, and there are many ways it can be used. Included here are the basic information for the beginner, and includes only those issues that are covered in StatsToDo
Anatomy of RStudio :
When activated, RStudio consists of 4 panels.

The bottom left panel is called the console. we start here because you can analyse data just using the console. Commands can be typed into the console and run interactively. The console is also where the results of calculations (other than graphics) are displayed
The top left panel is called the source. This is essentially a text editor, where files can be loaded in, edited, and saved. Part or all of the codes in the source can also be marked by the mouse, and on clicking the run button the output will be displayed in the console. Please note that the example codes from this site is intended to be copied and pasted into this source panel, and run from here
The top right panel is called environment and history. This provides a list of variables and procedures created and run during a session in the console. The idea is that these are saved when exiting RStudio, and reloaded in future sessions to provide continuity. This panel is not used in these pages
The right bottom panel is called the viewer which contained a number of sub-panels. This is where packages and help can be viewed. Graphical output is also displayed in this panel

Packages
Many statistical programs are automatically installed with R. Additional programs are written by the wider R user community, available as packages. These packages need to be installed into your computer (only once), using the install.package("PackageName") command. Once installed, the package can be activated during analysis with the library("PackageName") command.
Required packages are included in the example codes. The install.packages("PackageName") command is usually commented out, as once installed it is always available.
The Explanation and resources available for eack package, once installed, can be accessed by the ??"PackageName" command typed into the console
How to use the example codes
For any procedure, the following procedures can be used

Activate RStudio
Copy and paste the example codes (in Maroon) into the source panel, in the same order that they are presented in the web page
Mark the codes and click run at the top right corner of the source panel
Check that the program runs as intended
Replace example names and data with the user's own
Add, delete, or edit codes as appropriate
Mark and run again to produce the results required
If the console becomes cluttered from repeated analysis the combined keys ctrl+l (el not one) will clear the console.
Help and further information
Help for any code or function provided by R can be easily obtained within RStudio, by typing the following in the bottom left panel of RStudio(console)

help(thesubject) will bring up explanations of the subject in the right bottom panel (help)
??packagename will bring up documentation of the package in the right bottom panel (help)
In addition to using search engines of the internet to obtain advice on how to use R, the following web sites are useful, particularly for the beginners

https://cran.r-project.org/doc/manuals/R-intro.html An Introduction to R from CRAN
https://cran.r-project.org/doc/contrib/Short-refcard.pdf Reference card cheat sheet for R
An R Companion for the Handbook of Biological Statistics by Salvatore S. Mangiafico. This is a textbook like site and provides resources for many commonly used statistical procedures in bio-medical statistics
Summary and Analysis of Extension Program Evaluation in R by Salvatore S. Mangiafico. This is a textbook like site providing resources for the more advanced statistical procedures commonly used in complex data analysis
The following text books are easy to read, and provide advice and templates for common statistical tasks

R for Dummies by Vries and Meys. John Wiley & Son's Inc. ISBN 978-1-119-05580-9. This is good for the novice with no prior experience of R.
R Cookbook by Teetor. O'Reilly Media Inc ISBN 978-93-5023-379-5. This takes the user step by step learning how to use and program R, and introduces the basic R resources
R Graphics Cookbook by Chang. O'Reilly Media Inc ISBN 978-1-449-31695-2. This is an excellent book providing detailed instructions on how to do graphics using R

Data I/O

Dataframe
R is able to handle data in numerous format, the most commonly used one is a data frame. Data frame is an object (a computer construct that contains large amount of information). In most cases, particularly in these pages, the dataframe contains a table with columns for variables, and rows for cases.
The first step in any analysis is therefore converting the table of data into a dataframe, which R can then analyse. The standard data table has columns for variables and rows for subjects. The first row containing the name of each column. The table can then be imported into R using one of the following methods
Direct data I/O
Nearly all examples of R codes on this site will use direct data I/O. This is the simplest form of data entry for small sets of data. The example codes are as follows
myTxt = (" Col1 Col2 Col2 A 1 2 B 3 4 C 5 6 ") # Example data in text myDF <- read.table(textConnection(myTxt),header=TRUE) # Make data frame #myDF # Optional show dataframe in console
Please note that:

The table, in text, is in between (" and ")
The first row contains the column names
The columns are separated by spaces or tabs
The values of Col1 are text, which R called factors, and those of Col2 and Col3 are numerical. Please Note: the type of values in each column must be consistent.
Results of analysis are displayed in the console or the graphic panel. These can be copied to the clipboard by marking and the combined keys of ctrl+c and pasted into any other application.
Files
By default, R reads and writes all file using the standard Documents folder. It is useful to change and Set the default folder to the same as the current file loaded into the source condole. To do so the following procedures are used. At the menu bar at the top click Session->Set Working Directory->To Source File Location. To check whether the directory is correct or not, use the following codes
getwd() list.files()
This will display the folder being used and the files it contains
Comma delimited (.csv) files
The codes for input from a comma delimited (.csv) file are
myDF <- read.csv("myCsvIn.csv") #myDF
Please Note that:

myCsvIn.csv is the name of the comma delimited file containing the data table.
The first line reads the table to the dataframe
The second line is optional display of the dataframe
The codes for writing an object, such as a dataframe, to a comma delimited (.csv) file are
write.csv(myDF, "myCsvOut.csv", row.names = FALSE)
Please Note that:

myDF is the name of the dataframe to be saved to the .csv. It should already exist
myCsvOut.csv is the name of the output comma delimited (.csv) file. User should change this name to one that is appropriate
If row.names= False is not used, then a first column containing row number will be included
If the named .csv file pre-exists and is closed, it will be overwritten. Attempt to write to an open file with the same name will fail and flag an error message

Excel worksheet (.xlsx) files
To access .xlsx files, the package xlsx must be already installed. If not installed, the command install.packages("xlsx") will do so. Once installed, each time a .xlsx file is accessed, the library xlsx must be called
For data input
#install.packages("xlsx") #use only if the library is not already installed library("xlsx") myDF <- read.xlsx("myXlsxIn.xlsx", sheetName="mySheet") #myDF
Please note that:

The xlsx package needs to be already installed once in a computer
The library xlsx must be called
myXlsxIn.xlsx is the name of the excel .xlsx file to be read. User should change this to his/her own file name
mySheet is the name of the worksheet to be read. This can be the name of the sheet, or a number (without the quotes) counting from 1
When testing, uncomment the last command (remove #) to see if the correct data has been read
For output of an object such as a dataframe to an excel workbook (.xlsx)
lbrary("xlsx") write.xlsx2(myDF, "myXlsxOut.xlsx", sheetName = "mySheet", row.names = FALSE, append=TRUE)
Please Note that:

The package xlsx must be pre-installed and is called
myDF is a dataframe to be saved to file, a table with columns and rows.
myXlsxOut.xlsx is the name of the excel workbook to be saved to. Users should change this to his/her own file name
mySheet is the name of the sheet to save the data to
If row.names= False is not used, then a first column containing row number will be included
If the file pre-exists, it must be closed. Attempt to write to an open file will crash the program
If the append option is FALSE, any existing file containing a wroksheet with the same name will be overwritten
If the append option is TRUE and no worksheet with the same name exists, a new worksheet with the name will be created and the data written to it
If the append option is TRUE and a worksheet with the same name already exists, write to the worksheet will fail and the program crashes

Creating a report of analysis using Knit
RStudio provides an utility which will run the codes in the source panel, and provides all the results in a file, which can be a html, Word, or pdf file. To do so, the package markdown must be pre-installed using the code install.packages("rmarkdown").
Once a set of R codes are tested and found to be satisfactory, at the menu bar on the top of RStudio,
click File->Knit Document
select type of file, msWord is recommended as this is the easiest to edit afterwards
The file will be created including all codes, results, and graphics. The user can then edit and improve on this as he/she thinks appropriate.
Please note: The knit program is quite temperamental, and have the following problems

It can be incompatible with some other packages, especially those that read or write to files or produce graphics. The program merely crashes without too much explanation. An example is that knit will crash when trying to read from or write to excel worksheets.
In some graphics programs, the labeling for the axis may be misplaced or absent in the report file
These are problems encountered so far, but there may be more.
Array & Matrices
This panel provides some useful templates for handling arrays and matrices.
1:Data Frames and Matrices

Template 1.1: Convert a text fields into a matrix. This is done by firstly import the text field (all columns must be numerical values) into a data frame, then convert the data frame into the matrix
txt = (" Col1 Col2 Col3 7 1 2 8 3 4 9 5 6" ) df <- read.table(textConnection(txt),header=TRUE) mx <- data.matrix(df) colnames(mx) <- NULL # optional removing column names mx
The rresults are
> mx [,1] [,2] [,3] [1,] 7 1 2 [2,] 8 3 4 [3,] 9 5 6
Template 1.2: Create matrix from the numerical fields of a mixed factor(text) and value (numerical) data frame
txt = (" Col1 Col2 Col3 A 1 2 B 3 4 C 5 6" ) df <- read.table(textConnection(txt),header=TRUE) #df # data frame mx <- matrix(c(df$Col2,df$Col3), ncol=2) mx
Note: cols 2 and 3 are numerical, and extracted to form a 2 column matrix. The results are
> mx [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6
Template 1.3: Convert matrix into a data frame
# Input 3. matrix to data frame #mx = matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=4) # alternative method to create matrix col1 <- c(1,2,3,4) col2 <- c(5,6,7,8) col3 <- c(9,10,11,12) mx <- cbind(col1, col2, col3) mx #matrix # turn matrix into data frame df = as.data.frame(mx) df # add another column to existing data frame df$col4 <- c(20,21,22,23) # method 1 df ar <- c(30,31,32,33) # method 2 df$col5 <- ar df
Results as follows
> df col1 col2 col3 1 1 5 9 2 2 6 10 3 3 7 11 4 4 8 12 > # add another column to existing data frame > df$col4 <- c(20,21,22,23) # method 1 > df col1 col2 col3 col4 1 1 5 9 20 2 2 6 10 21 3 3 7 11 22 4 4 8 12 23 > ar <- c(30,31,32,33) # method 2 > df$col5 <- ar > df col1 col2 col3 col4 col5 1 1 5 9 20 30 2 2 6 10 21 31 3 3 7 11 22 32 4 4 8 12 23 33 >

2:Arrays

# create an array with contents ar <- c(1,2,3,4) ar

[1] 1 2 3 4

# Create an array with sequence start = 1 finish = 10 increment = 2 ar <- seq(start, finish, by=increment) ar

[1] 1 3 5 7 9

# Create an empty vector ar <- vector() # Create vector of defined length default = 5 siz = 3 ar <- array(default,siz) ar

[1] 5 5 5

# Operations ar1 <- c(1,2,3) ar2 <- c(4,5,6) # add ar1 + ar2

[1] 5 7 9

#subtract ar2 - ar1

[1] 3 3 3

#multiply elements ar1 * ar2

[1] 4 10 18

# multiply vectors col by row # default is column ar1 %*% t(ar2)

[,1] [,2] [,3] [1,] 4 5 6 [2,] 8 10 12 [3,] 12 15 18

#multiply row by col t(ar1) %*% ar2

[,1] [1,] 32

3:Matrices

#matrix with values ar <- c(1,2,3,4,5,6) mx <- matrix(data=ar, nrow=2,ncol=3) #byrow=FALSE is default mx mx <- matrix(data=ar, nrow=2,ncol=3, byrow=TRUE) mx

> mx <- matrix(data=ar, nrow=2,ncol=3) #byrow=FALSE is default > mx [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 > mx <- matrix(data=ar, nrow=2,ncol=3, byrow=TRUE) > mx [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6

# matrix with default value default = 5 mx <- matrix(data=default, nrow=2,ncol=3) mx

> mx [,1] [,2] [,3] [1,] 5 5 5 [2,] 5 5 5

#matrix joining two arrays a1 <- c(1,2,3) a2 <- c(4,5,6) mx <- matrix(data=c(a1,a2), nrow=2,ncol=3, byrow=TRUE) mx mx <- matrix(data=c(a1,a2), nrow=3,ncol=2) mx

> mx <- matrix(data=c(a1,a2), nrow=2,ncol=3, byrow=TRUE) > mx [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 > mx <- matrix(data=c(a1,a2), nrow=3,ncol=2) > mx [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6

# rows mx[1,] mx[2,] mx[3,] #cols mx[,1] mx[,2]

> # rows > mx[1,] [1] 1 4 > mx[2,] [1] 2 5 > mx[3,] [1] 3 6 > #cols > mx[,1] [1] 1 2 3 > mx[,2] [1] 4 5 6

# operations mx1 <- matrix(c(c(3,5,1,2),c(4,1,3,1),c(2,3,1,1)),nrow=4,ncol=3) mx1 mx2 <- matrix(c(c(1,2,1),c(2,3,1)),nrow=3,ncol=2) mx2

> mx1 [,1] [,2] [,3] [1,] 3 4 2 [2,] 5 1 3 [3,] 1 3 1 [4,] 2 1 1 > mx2 [,1] [,2] [1,] 1 2 [2,] 2 3 [3,] 1 1

# transpose mx3 <- t(mx1) mx1 mx3

> mx1 [,1] [,2] [,3] [1,] 3 4 2 [2,] 5 1 3 [3,] 1 3 1 [4,] 2 1 1 > mx3 [,1] [,2] [,3] [,4] [1,] 3 5 1 2 [2,] 4 1 3 1 [3,] 2 3 1 1

# add mx1 + mx1

> mx1 + mx1 [,1] [,2] [,3] [1,] 6 8 4 [2,] 10 2 6 [3,] 2 6 2 [4,] 4 2 2

# subtract mx2 - mx2

> mx2 - mx2 [,1] [,2] [1,] 0 0 [2,] 0 0 [3,] 0 0

# multiply elements mx2 * mx2

> mx2 * mx2 [,1] [,2] [1,] 1 4 [2,] 4 9 [3,] 1 1

#multiply matrices mx1 %*% mx2

> mx1 %*% mx2 [,1] [,2] [1,] 13 20 [2,] 10 16 [3,] 8 12 [4,] 5 8

#division by elements mx1 / mx1

> mx1 / mx1 [,1] [,2] [,3] [1,] 1 1 1 [2,] 1 1 1 [3,] 1 1 1 [4,] 1 1 1

# transpose t(mx2)

> t(mx2) [,1] [,2] [,3] [1,] 1 2 1 [2,] 2 3 1

# square matrix mx <- matrix(c(c(4,2,3,2),c(2,5,3,1),c(3,3,6,2),c(2,1,2,3)),nrow=4,ncol=4) mx

> mx [,1] [,2] [,3] [,4] [1,] 4 2 3 2 [2,] 2 5 3 1 [3,] 3 3 6 2 [4,] 2 1 2 3

# invert solve(mx)

> solve(mx) [,1] [,2] [,3] [,4] [1,] 0.50000000 -0.07142857 -0.14285714 -0.21428571 [2,] -0.07142857 0.29591837 -0.12244898 0.03061224 [3,] -0.14285714 -0.12244898 0.32653061 -0.08163265 [4,] -0.21428571 0.03061224 -0.08163265 0.52040816

# Rank qr(mx)$rank # uses qr decomposition

[1] 4

# Determinant det(mx)

> det(mx) [1] 98

# Eigen values and matrices eigenResults <- eigen(mx) eigenResults

> eigenResults $values [1] 11.501474 3.143052 2.000000 1.355473 $vectors [,1] [,2] [,3] [,4] [1,] -0.4779381 -0.36246433 0.3779645 0.70522169 [2,] -0.4961149 0.76831305 0.3779645 -0.14390264 [3,] -0.6487310 -0.05918796 -0.7559289 -0.06493346 [4,] -0.3234090 -0.52422462 0.3779645 -0.69118597

#Square root of matrix after obtaining eigen sr <- eigenResults$vectors %*% diag(sqrt(eigenResults$values)) %*% solve(eigenResults$vectors) sr

> sr [,1] [,2] [,3] [,4] [1,] 1.7886506 0.3942986 0.6321685 0.4956015 [2,] 0.3942986 2.1073918 0.6176967 0.1479165 [3,] 0.6321685 0.6176967 2.2465112 0.4147301 [4,] 0.4956015 0.1479165 0.4147301 1.6001559

# Validate square root matrix sr %*% sr

> sr %*% sr [,1] [,2] [,3] [,4] [1,] 4 2 3 2 [2,] 2 5 3 1 [3,] 3 3 6 2 [4,] 2 1 2 3

Missing Data
Explanations & References
This panel explains and provides example codes for handling missing data using R.
In research and handling data, missing data is a common occurrence. Subjects are lost, errors are made in collecting and transcribing information, and whole host of reasons creating holes in the data table.
R provides an option for how to handle missing data in nearly all its formulae, but this requires the analyst to be familiar with how missing data may affect a particular procedure and which option each procedure provides for handling missing data.
For the sake of simplicity, all the codes provided in StatsToDo assume that the data is already clean and contain no missing data. This separates the procedures handling missing data from the statistical algorithms.
This panel therefore provides the algorithms for handling missing data at the final stages of data preparation, to produce a complete set of data for analysis
How missing data are represented in R
Within the object dataframe, missing values are represented by NA in the numerical columns and <NA> in text columns. However, in data I/O, the following is used

When data is presented directly in the R Code as a text table, or when read in from a comma delimited file (.csv), missing data is represented by NA. Any other representation is interpreted by R as values in the data
When data is read in from a Excel worksheet using the package xlsx, missing data are blank cells in the Excel worksheet. Anything else will be interpreted as actual values and processed accordingly
When deciding on how data are to be inputed into R, how missing values are prepresented in the different input media should be carefully considered, and testing with a small dataset would avoid error later.
Different options in dealing with missing data
R provides an extensive collection of methods of handling missing data. Only a few of the more commonly used ones are presented in this page. This panel discusses the options conceptually, the complete set of codes and how the codes work are presented in the R Code panel
Option 1. Casewise deletion
This is the easiest, and widely used method. All records containing missing values are deleted.
This method is appropriate if the analyst is sure that data is lost at random, so that removing records containing missing data would not create a bias leading to misinterpretation. The amount of missing data should also be small, say in less than 1% of the cases
Option 2. K Nearest Neighbour
For each missing value, the program searches for k completed records that are nearest (similarity not location) to it, replacing it with the average for a numerical (value) column, and the most frequent value for a text (factor) column. k can be specified in the formula. If not specified, the default k=10 is used.
This is a robust method, and can be used even if some bias process is implicated in data loss, as the missing value is replaced by values from similar records.
There are, however, some issues involved. Firstly, for every missing value, k (10) completed records are required. Secondly, the whole database is searched for the nearest records, and this is time consuming if the database is large and missing data numerous.
The method was devised by those working on big data and artificial intelligence, when thousands or even millions of records are available, and the data can be analysed using powerful computers over prolonged periods.
Clinical data are caught between having database not large enough so that k has to be reduced, and the long time required for processing using desk top computers.
Although the method is excellent in theory, it cannot always be successfully used in the clinical setting. However, it is worth a try. If the program crashes or takes too long to run, k can be progressively reduced until the program works. Be aware however that, as k is reduced, the risk of producing bias replacements increases
Option 3. General Imputation
The program randomly selects a missing data value, and replaces it with an estimate using the available data and multiple regression. This is then included in the available data to estimate the next randomly selected missing value. This process is repeated until all missing values are replaced by estimated (imputated) values.
As latter estimations are influenced by earlier estimated values, the results are slightly different depending on the random sequence. The program copes with this by iterating the process a number of times (m) and averaged the results. The number of iterations (m) can be specified by the user. If not specified, the default is m=5. Controversy exists as to what m should be, and some statisticians argue that m should be the same as the number of missing values in the data.
This method is most suited to the small data sets that are common in clinical studies, especially in survey and clinical trials where the sample size is around 100.
The only proviso is that at least one (1) numerical column must exist in the data set for the algorithm to work.
Users should also be aware that the same program and data will produce slightly different results when repeated, as the random sequence is generated at run time so are different each time
Option 4. Numerical Imputation
The program is a mathematical algorithm using existing values in the same column of the data set to estimate a replacement value for the missing values. The methods available are mean. median, mode, and interpolation (average of the available values on the two sides of the missing value).
The method only works in columns of numerical data, and ignores missing values in columns that are text. It is quick to implement and the results easy to interpret. It can be used if such mathematical replacement is appropriate to the analyst's needs
The interpolation method is especially useful in time series data such as continuous monitoring, as the interpolation result is close to what the missing data should be.
Additional information

Checking the results
It is important to check the results of fixing missing data before the data set is used for analysis. There are numerous methods for doing so, but they are not covered in this page. The example codes provide the basic comparisons using the summary command, which will count the different values in text columns, and minimum, maximum, quartile values, means and standard deviation in numerical columns
References
https://en.wikipedia.org/wiki/Missing_data
https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html
https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
R Codes for Missing Values
This panel provides the R codes for handling missing values
Example data

Direct data entry

myTxt = (" Sex Ethn Gest BWt Girl Greek 37 3048 Boy German 36 2813 Girl French 41 3622 NA Greek 36 2706 Boy German 35 2581 Boy NA NA 3442 Girl Greek 40 3453 Boy German 37 3172 Girl French 35 NA Boy Greek 39 3555 Girl German 37 3029 Boy French 37 3185 Girl NA 36 2670 Boy German NA 3314 Girl French 41 3596 Boy Greek 38 3312 NA NA 39 3200 Boy French 41 3667 Boy Greek 40 3643 Girl German 38 3212 Girl French 38 3135 Girl Greek 39 3366 ") myDF <- read.table(textConnection(myTxt),header=TRUE) summary(myDF) #myDF
The example data is computer generated, and purports to come from a study of birth weight. Sex being sex of the baby, Ethn the ethnicity of the mother, Gest the number of completed weeks in gestation, and BWt the weight of the baby at birth.
The table has columns for variables and rows for subjects (each baby). The first row contains the names for each column. NA represents missing data
myTxt is the name given to this table. User can change this to any other name
This is followed by 3 lines of codes

Imports the data into the dataframe object myDF
Displays the summary, which can be used for comparison with the results of changes to the data set
Optional display of the data as represented within the dataframe. This can be activated by uncommenting the line (remove #)
Summary is as follows
Sex Ethn Gest BWt Boy :10 French:6 Min. :35.00 Min. :2581 Girl:10 German:6 1st Qu.:36.75 1st Qu.:3048 NA's: 2 Greek :7 Median :38.00 Median :3212 NA's :3 Mean :38.00 Mean :3225 3rd Qu.:39.25 3rd Qu.:3453 Max. :41.00 Max. :3667 NA's :2 NA's :1
There are 22 subjects
Sex and Ethn are text columns, and the number of rows with each value is counted
Gest and BWt are numerical columns, and the quartile and mean values are presented
Missing values, represented as NA are also counted for each column
Handling Missing Values

Option 1: casewise deletion

casewiseDeletedDataFrame <- na.omit(myDF) summary(casewiseDeletedDataFrame) #casewiseDeletedDataFrame
The first line creates a new dataframe casewiseDeletedDataFrame containing only those rows with no missing data
The second line provides the summary which can be used to compare with the input data
The third line is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the #
The summary is as follows
Sex Ethn Gest BWt Boy :8 French:5 Min. :35.00 Min. :2581 Girl:8 German:5 1st Qu.:37.00 1st Qu.:3113 Greek :6 Median :38.00 Median :3262 Mean :38.38 Mean :3274 3rd Qu.:40.00 3rd Qu.:3565 Max. :41.00 Max. :3667
Six (6) rows with one or more missing values are deleted, so the data set now has 16 rows.
Option 2: K Nearest Neighbours

#install.packages("DMwR") library(DMwR) knnDataFrame <- knnImputation(myDF,k=10) summary(knnDataFrame) #knnDataFrame
Line 1 installs the package "DMwR" which is a package for deep learning, from which this algorithm is obtained. This is commented out, as this is not needed repeatedly once the package is installed on the computer
Line 2 calls the installed library. This must be done prior to running the program
Line 3 creates a new dataframe, which has the missing values replaced by the estimated values. The value for k can be specified by the user. If not specified, the default is k=10. If the amount of completed record is insufficient, or if the run time of the program is too long, k can be reduced.
Line 4 displays the summary of the new dataframe, which can be compared with that from the input data
Line 5 is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the #
The summary is as follows
Sex Ethn Gest BWt Boy :11 French:6 Min. :35.00 Min. :2581 Girl:11 German:7 1st Qu.:37.00 1st Qu.:3046 Greek :9 Median :38.00 Median :3206 Mean :38.07 Mean :3217 3rd Qu.:39.12 3rd Qu.:3450 Max. :41.00 Max. :3667
Twentytwo (22) rows remain in the data set, but the counts, interquartile values, and means have changed as the missing values are replaced by the estimated values
Option 3: General Imputation

#install.packages("mice") library(mice) impute <- mice(myDF, m = 5, print = FALSE) fit<-with(data = impute, lm(BWt ~ Sex+Ethn+Gest)) pool<-pool(fit) miceDataFrame<-complete(impute) summary(miceDataFrame) #miceDataFrame
Line 1 installs the package "mice" which contains the imputation program. This is commented out, as this is not needed repeatedly once the package is installed on the computer
Line 2 calls the installed library. This must be done prior to running the program
Line 3 creates a data matrix containing the estimated values from the iterations. The number of iterations (m) can be specified by the user. If not specified, the default is m=5.
Line 4 estimates the imputed values, using a regression formula. The formula should contain the names of all the columns that the analyst intend to use to estimate missing values.
Lines 5 and 6 pool the results, and created a new database with the missing values replaced by the imputation estimates.
Please note the following

The variables used in the regression formula are the column names in the inputed data. Analysts should replace these with those from his/her own data
At least one (1) column in the data must be a numerical one. Without this, the program fails and an error is flagged.
Line 6 displays the summary of the new dataframe, which can be compared with that from the input data
Line 7 is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the #
The summary is as follows
Sex Ethn Gest BWt Boy :10 French:8 Min. :35.00 Min. :2581 Girl:12 German:6 1st Qu.:37.00 1st Qu.:3034 Greek :8 Median :38.00 Median :3206 Mean :38.05 Mean :3206 3rd Qu.:39.00 3rd Qu.:3450 Max. :41.00 Max. :3667
Twentytwo (22) rows remain in the data set, but the counts, interquartile values, and means have changed as the missing values are replaced by the estimated values
Option 3: General Imputation
Activate library
#install.packages("ggplot2") #only if not already installed #install.packages("imputeTS") #only if not already installed library(imputeTS)
The package imputeTS is required for numerical imputation. The algorithms in this package call functions in the package ggplot2, so that package also needs to be installed. The 2 installation commands are commented out because they only need to be installed to the computer once.
For each of the numerical imputation method, the library imputeTS needs to be activated.
Once the library is called, the analyst can choose one of the mathematical models. Please note: that only numerical data are imputed. Columns containing text data are ignored.
Replace missing values by column mean
meanDataFrame<-na.mean(myDF, option = "mean") summary(meanDataFrame) #meanDataFrame
The summary is
Sex Ethn Gest BWt Boy :10 French:6 Min. :35 Min. :2581 Girl:10 German:6 1st Qu.:37 1st Qu.:3070 NA's: 2 Greek :7 Median :38 Median :3218 NA's :3 Mean :38 Mean :3225 3rd Qu.:39 3rd Qu.:3450 Max. :41 Max. :3667

Replace missing values by column median
medianDataFrame<-na.mean(myDF, option = "median") summary(medianDataFrame) #medianDataFrame
The summary is
Sex Ethn Gest BWt Boy :10 French:6 Min. :35 Min. :2581 Girl:10 German:6 1st Qu.:37 1st Qu.:3070 NA's: 2 Greek :7 Median :38 Median :3212 NA's :3 Mean :38 Mean :3224 3rd Qu.:39 3rd Qu.:3450 Max. :41 Max. :3667

Replace missing values by column mode
modeDataFrame<-na.mean(myDF, option = "mode") summary(modeDataFrame) #modeDataFrame
The summary is
Sex Ethn Gest BWt Boy :10 French:6 Min. :35.00 Min. :2581 Girl:10 German:6 1st Qu.:37.00 1st Qu.:3034 NA's: 2 Greek :7 Median :37.50 Median :3206 NA's :3 Mean :37.91 Mean :3196 3rd Qu.:39.00 3rd Qu.:3450 Max. :41.00 Max. :3667

Replace missing values by interpolation (average of the values on each side of the missing value)
interpolationDataFrame<-na.interpolation(modeDataFrame) summary(interpolationDataFrame) #interpolationDataFrame
The summary is
Sex Ethn Gest BWt Boy :10 French:6 Min. :35.00 Min. :2581 Girl:10 German:6 1st Qu.:37.00 1st Qu.:3034 NA's: 2 Greek :7 Median :37.50 Median :3206 NA's :3 Mean :37.91 Mean :3196 3rd Qu.:39.00 3rd Qu.:3450 Max. :41.00 Max. :3667