This page explains and provides example codes for handling missing data using R. For those unfamiliar with R but wishes to start using it, explanations on how to set up and start using R are provided in
R Explained Page
In research and handling data, missing data is a common occurrence. Subjects are lost, errors are made in collecting and transcribing information, and whole host of reasons creating holes in the data table.
R provides an option for how to handle missing data in nearly all its formulae, but this requires the analyst to be familiar with how missing data may affect a particular procedure and which option each procedure provides for handling missing data.
For the sake of simplicity, all the codes provided in StatsToDo assume that the data is already clean and contain no missing data. This separates the procedures handling missing data (which is described in this page) from the statistical algorithms.
This page therefore provides the algorithms for handling missing data at the final stages of data preparation, to produce a complete set of data for analysis
How missing data are represented in R
Within the object dataframe, missing values are represented by
NA in the numerical columns and
<NA> in text columns. However, in data I/O, the following is used
- When data is presented directly in the R Code as a text table, or when read in from a comma delimited file (.csv), missing data is represented by NA. Any other representation is interpreted by R as values in the data
- When data is read in from a Excel worksheet using the package xlsx, missing data are blank cells in the Excel worksheet. Anything else will be interpreted as actual values and processed accordingly
When deciding on how data are to be inputed into R, how missing values are prepresented in the different input media should be carefully considered, and testing with a small dataset would avoid error later.
Different options in dealing with missing data
R provides an extensive collection of methods of handling missing data. Only a few of the more commonly used ones are presented in this page. This panel discusses the options conceptually, the complete set of codes and how the codes work are presented in the other two panels
Option 1. Casewise deletion
This is the easiest, and widely used method. All records containing missing values are deleted.
This method is appropriate if the analyst can reassure himself that data is lost at random, so that removing records containing missing data would not create a bias leading to misinterpretation. The amount of missing data should also be small, say in less than 1% of the cases
Option 2. K Nearest Neighbour
For each missing value, the program searches for k completed records that are nearest (similarity not location) to it, replacing it with the average for a numerical column, and the most frequent value for a text column. k can be specified in the formula. If not specified, the default k=10 is used.
This is a robust method, and can be used even if some bias process is implicated in data loss, as the missing value is replaced by values from similar records.
There are, however, some issues involved. Firstly, for every missing value, k (10) completed records are required. Secondly, the whole database is searched for the nearest records, and this is time consuming if the database is large and missing data numerous.
The method was devised by those working on big data and artificial intelligence, when thousands or even millions of records are available, and the data can be analysed using powerful computers over prolonged periods.
Clinical data are caught between having database not large enough so that k has to be reduced, and the long time required for processing using desk top computers.
Although the method is excellent in theory, it cannot always be successfully used in the clinical setting. However, it is worth a try. If the program crashes or takes too long to run, k can be progressively reduced until the program works. Be aware however that, as k is reduced, the risk of producing bias replacements increases
Option 3. General Imputation
The program randomly selects a missing data value, and replaces it with an estimate using the available data and multiple regression. This is then included in the available data to estimate the next randomly selected missing value. This process is repeated until all missing values are replaced by estimated (imputated) values.
As latter estimations are influenced by earlier estimated values, the results are slightly different depending on the random sequence. The program copes with this by iterating the process a number of times (m) and averaged the results. The number of iterations (m) can be specified by the user. If not specified, the default is m=5. Controversy exists as to what m should be, and some statisticians argue that m should be the same as the number of missing values in the data.
This method is most suited to the small data sets that are common in clinical studies, especially in survey and clinical trials where the sample size is around 100.
The only proviso is that at least one (1) numerical column must exist in the data set for the algorithm to work.
Users should also be aware that the same program and data will produce slightly different results when repeated, as the random sequence is generated at run time so are different each time
Option 4. Numerical Imputation
The program is a mathematical algorithm using existing values in the same column of the data set to estimate a replacement value for the missing values. The methods available are
mean.
median,
mode, and
interpolation (average of the available values on the two sides of the missing value).
The method only works in columns of numerical data, and ignores missing values in columns that are text. It is quick to implement and the results easy to interpret. It can be used if such mathematical replacement is appropriate to the analyst's needs
The interpolation method is especially useful in time series data such as continuous monitoring, as the interpolation result is close to what the missing data should be.
Additional information
Checking the results
It is important to check the results of fixing missing data before the data set is used for analysis. There are numerous methods for doing so, but they are not covered in this page. The example codes provide the basic comparisons using the summary command, which will count the different values in text columns, and minimum, maximum, quartile values, means and standard deviation in numerical columns
References
https://en.wikipedia.org/wiki/Missing_data
https://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html
https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
This panel explains the codes presented in the previous panel in segments
Data handling
Direct data entry
myTextTable = ("
Sex Ethn Gest BWt
Girl Greek 37 3048
Boy German 36 2813
Girl French 41 3622
NA Greek 36 2706
Boy German 35 2581
Boy NA NA 3442
Girl Greek 40 3453
Boy German 37 3172
Girl French 35 NA
Boy Greek 39 3555
.... etc
")
x_dataFrame <- read.table(textConnection(myTextTable),header=TRUE)
summary(x_dataFrame)
#x_dataFrame
The example data is computer generated, and purports to come from a study of birth weight.
Sex being sex of the baby,
Ethn the ethnicity of the mother,
Gest the number of completed weeks in gestation, and
BWt the weight of the baby at birth.
The table has columns for variables and rows for subjects (each baby). The first row contains the names for each column. NA represents missing data
myTextTable is the name given to this table. User can change this to any other name
This is followed by 3 lines of codes
- Imports the data into the dataframe object x_dataFrame
- Displays the summary, which can be used for comparison with the results of changes to the data set
- Optional display of the data as represented within the dataframe. This can be activated by uncommenting the line (remove #)
Summary is as follows
Sex Ethn Gest BWt
Boy :10 French:6 Min. :35.00 Min. :2581
Girl:10 German:6 1st Qu.:36.75 1st Qu.:3048
NA's: 2 Greek :7 Median :38.00 Median :3212
NA's :3 Mean :38.00 Mean :3225
3rd Qu.:39.25 3rd Qu.:3453
Max. :41.00 Max. :3667
NA's :2 NA's :1
There are 22 subjects
Sex and Ethn are text columns, and the number of rows with each value is counted
Gest and BWt are numerical columns, and the quartile and mean values are presented
Missing values, represented as NA are also counted for each column
Missing Values
Option 1: casewise deletion
x_casewiseDeletedDataFrame <- na.omit(x_dataFrame)
summary(x_casewiseDeletedDataFrame)
#x_casewiseDeletedDataFrame
The first line creates a new dataframe
x_casewiseDeletedDataFrame containing only those rows with no missing data
The second line provides the summary which can be used to compare with the input data
The third line is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the #
The summary is as follows
Sex Ethn Gest BWt
Boy :8 French:5 Min. :35.00 Min. :2581
Girl:8 German:5 1st Qu.:37.00 1st Qu.:3113
Greek :6 Median :38.00 Median :3262
Mean :38.38 Mean :3274
3rd Qu.:40.00 3rd Qu.:3565
Max. :41.00 Max. :3667
Six (6) rows with one or more missing values are deleted, so the data set now has 16 rows.
Option 2: K Nearest Neighbours
#install.packages("DMwR")
library(DMwR)
x_knnDataFrame <- knnImputation(x_dataFrame,k=10)
summary(x_knnDataFrame)
#x_knnDataFrame
Line 1 installs the package "DMwR" which is a package for deep learning, from which this algorithm is obtained. This is commented out, as this is not needed repeatedly once the package is installed on the computer
Line 2 calls the installed library. This must be done prior to running the program
Line 3 creates a new dataframe, which has the missing values replaced by the estimated values. The value for k can be specified by the user. If not specified, the default is k=10. If the amount of completed record is insufficient, or if the run time of the program is too long, k can be reduced.
Line 4 displays the summary of the new dataframe, which can be compared with that from the input data
Line 5 is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the #
The summary is as follows
Sex Ethn Gest BWt
Boy :11 French:6 Min. :35.00 Min. :2581
Girl:11 German:7 1st Qu.:37.00 1st Qu.:3046
Greek :9 Median :38.00 Median :3206
Mean :38.07 Mean :3217
3rd Qu.:39.12 3rd Qu.:3450
Max. :41.00 Max. :3667
Twentytwo (22) rows remain in the data set, but the counts, interquartile values, and means have changed as the missing values are replaced by the estimated values
Option 3: General Imputation
#install.packages("mice")
library(mice)
x_impute<-mice(x_dataFrame, m = 5, print = FALSE)
x_fit<-with(data = x_impute, lm(BWt ~ Sex+Ethn+Gest))
x_pool<-pool(x_fit)
x_miceDataFrame<-complete(x_impute)
summary(x_miceDataFrame)
#x_miceDataFrame
Line 1 installs the package "mice" which contains the imputation program. This is commented out, as this is not needed repeatedly once the package is installed on the computer
Line 2 calls the installed library. This must be done prior to running the program
Line 3 creates a data matrix containing the estimated values from the iterations. The number of iterations (m) can be specified by the user. If not specified, the default is m=5.
Line 4 estimates the imputed values, using a regression formula. The formula should contain the names of all the columns that the analyst intend to use to estimate missing values.
Lines 5 and 6 pool the results, and created a new database with the missing values replaced by the imputation estimates.
Please note the following
- The variables used in the regression formula are the column names in the inputed data. Analysts should replace these with those from his/her own data
- At least one (1) column in the data must be a numerical one. Without this, the program fails and an error is flagged.
Line 6 displays the summary of the new dataframe, which can be compared with that from the input data
Line 7 is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the #
The summary is as follows
Sex Ethn Gest BWt
Boy :10 French:8 Min. :35.00 Min. :2581
Girl:12 German:6 1st Qu.:37.00 1st Qu.:3034
Greek :8 Median :38.00 Median :3206
Mean :38.05 Mean :3206
3rd Qu.:39.00 3rd Qu.:3450
Max. :41.00 Max. :3667
Twentytwo (22) rows remain in the data set, but the counts, interquartile values, and means have changed as the missing values are replaced by the estimated values
Option 3: General Imputation
Activate library
#install.packages("ggplot2") #only if not already installed
#install.packages("imputeTS") #only if not already installed
library(imputeTS)
The package
imputeTS is required for numerical imputation. The algorithms in this package call functions in the package
ggplot2, so that package also needs to be installed. The 2 installation commands are commented out because they only need to be installed to the computer once.
For each of the numerical imputation method, the library imputeTS needs to be activated.
Once the library is called, the analyst can choose one of the mathematical models. Please note: that only numerical data are imputed. Columns containing text data are ignored.
Replace missing values by column mean
x_meanDataFrame<-na.mean(x_dataFrame, option = "mean")
summary(x_meanDataFrame)
#x_meanDataFrame
The summary is
Sex Ethn Gest BWt
Boy :10 French:6 Min. :35 Min. :2581
Girl:10 German:6 1st Qu.:37 1st Qu.:3070
NA's: 2 Greek :7 Median :38 Median :3218
NA's :3 Mean :38 Mean :3225
3rd Qu.:39 3rd Qu.:3450
Max. :41 Max. :3667
Replace missing values by column median
x_medianDataFrame<-na.mean(x_dataFrame, option = "median")
summary(x_medianDataFrame)
#x_medianDataFrame
The summary is
Sex Ethn Gest BWt
Boy :10 French:6 Min. :35 Min. :2581
Girl:10 German:6 1st Qu.:37 1st Qu.:3070
NA's: 2 Greek :7 Median :38 Median :3212
NA's :3 Mean :38 Mean :3224
3rd Qu.:39 3rd Qu.:3450
Max. :41 Max. :3667
Replace missing values by column mode
x_modeDataFrame<-na.mean(x_dataFrame, option = "mode")
summary(x_modeDataFrame)
#x_modeDataFrame
The summary is
Sex Ethn Gest BWt
Boy :10 French:6 Min. :35.00 Min. :2581
Girl:10 German:6 1st Qu.:37.00 1st Qu.:3034
NA's: 2 Greek :7 Median :37.50 Median :3206
NA's :3 Mean :37.91 Mean :3196
3rd Qu.:39.00 3rd Qu.:3450
Max. :41.00 Max. :3667
Replace missing values by interpolation (average of the values on each side of the missing value)
x_interpolationDataFrame<-na.interpolation(x_modeDataFrame)
summary(x_interpolationDataFrame)
#x_interpolationDataFrame
The summary is
Sex Ethn Gest BWt
Boy :10 French:6 Min. :35.00 Min. :2581
Girl:10 German:6 1st Qu.:37.00 1st Qu.:3034
NA's: 2 Greek :7 Median :37.50 Median :3206
NA's :3 Mean :37.91 Mean :3196
3rd Qu.:39.00 3rd Qu.:3450
Max. :41.00 Max. :3667