Content Disclaimer Copyright @2020. All Rights Reserved. |
Links : Home Index (Subjects) Contact StatsToDo |
Introduction
StatsToDo provides R codes in some of its pages, and this page provides an overview and orientation for using them.
R Studio
The inclusion of R codes is a work in progress. Initially I aimed to create resources for the more complex multivariate statistical procedures that I am unable to program myself. Eventually I aim to supplement most programs in StatsToDo with R codes and examples, using resources provided by CRAN if they are available, or to rewrite my programs using R if the progam cannot be found in CRAN. The initiative began in April 2020, so not all algorithms will be accompanied by R codes for some time The R codes in StatsToDo are basic, simple, and the minimum necessary to conclude an initial analysis. It is aimed to create a shallow learning curve for those with no great statistical experience, to explore his/her data, and to provide the minimum results that are required by most clinical journals for publication. In most cases they merely repeat the programs written in php or Javascript for the web page, and no more than a validation of the algorithm. Most experienced statisticians will find the details provided incomplete, or even insufficient, as the safeguards and complete descriptions of the results are often not provided. Although references will be provided for users to obtain further information and resources. Inexperienced users are strongly urged to seek advice from experienced statisticians before drawing conclusions from the results obtained. All data in the examples are computer generated, to demonstrate the analysis and interpretation. Although attempts were made to produce results that are plausible and easy to understand, users should realise that the data are artificial and do not represent reality. Given that I had a background in obstetrics, the examples are mostly issues in childbirth and hospitals. The interests are mostly clinical, towards classification, survey, quality control, clinical discovery and trials. Format Three colors are used, to clarify the intent of the text
All R programs in StatsToDo assume no missing data. How to handle missing data is described in the Missing Data panel of this page Setting up R and RStudioThe first thing to do is to set up the R package in your computer. Using your web browser download the latest version of R and RStudio from from https://cran.r-project.org/, making sure that you include RStudio. After this, run all R programs using RStudioRStudioRStudio is a powerful programming platform, and there are many ways it can be used. Included here are the basic information for the beginner, and includes only those issues that are covered in this packageAnatomy of RStudio :When activated, RStudio consists of 4 panels.
PackagesMany statistical programs are automatically installed with R. Additional programs are written by the wider R user community, available as packages. These packages need to be installed into your computer (only once), using the install.package("PackageName") command. Once installed, the package can be activated during analysis with the library("PackageName") command.Required packages are included in the example codes. The install.package("PackageName") command is usually commented out, as once installed it is always available. The Explanation and resources available for eack package, once installed, can be accessed by the ??"PackageName" command typed into the console How to use the example codesFor any procedure, the following procedures can be used
Help and further informationHelp for any code or function provided by R can be easily obtained within RStudio, by typing the following in the bottom left panel of RStudio(console)
DataframeR is able to handle data in numerous format, the most commonly used one is a data frame. Data frame is an object (a computer construct that contains large amount of information). In most cases, particularly in these pages, the dataframe contains a table with columns for variables, and rows for cases.The first step in any analysis is therefore converting the table of data into a dataframe, which R can then analyse. The standard data table has columns for variables and rows for subjects. The first row containing the name of each column. The table can then be imported into R using one of the following methods Direct data I/ONearly all examples of R codes on this site will use direct data I/O. This is the simplest form of data entry for small sets of data. The example codes are as followsmyTxt = (" Col1 Col2 Col2 A 1 2 B 3 4 C 5 6 ") # Example data in text myDF <- read.table(textConnection(myTxt),header=TRUE) # Make data frame #myDF # Optional show dataframe in consolePlease note that:
FilesBy default, R reads and writes all file using the standard Documents folder. It is useful to change and Set the default folder to the same as the current file loaded into the source condole. To do so the following procedures are used. At the menu bar at the top click Session->Set Working Directory->To Source File Location. To check whether the directory is correct or not, use the following codesgetwd() list.files()This will display the folder being used and the files it contains Comma delimited (.csv) filesThe codes for input from a comma delimited (.csv) file aremyDF <- read.csv("myCsvIn.csv") #myDFPlease Note that:
write.csv(myDF, "myCsvOut.csv", row.names = FALSE)Please Note that:
Excel worksheet (.xlsx) filesTo access .xlsx files, the package xlsx must be already installed. If not installed, the command install.packages("xlsx") will do so. Once installed, each time a .xlsx file is accessed, the library xlsx must be calledFor data input #install.packages("xlsx") #use only if the library is not already installed library("xlsx") myDF <- read.xlsx("myXlsxIn.xlsx", sheetName="mySheet") #myDFPlease note that:
lbrary("xlsx") write.xlsx2(myDF, "myXlsxOut.xlsx", sheetName = "mySheet", row.names = FALSE, append=TRUE)Please Note that:
Creating a report of analysis using KnitRStudio provides an utility which will run the codes in the source panel, and provides all the results in a file, which can be a html, Word, or pdf file. To do so, the package markdown must be pre-installed using the code install.packages("rmarkdown").Once a set of R codes are tested and found to be satisfactory, at the menu bar on the top of RStudio,
select type of file, msWord is recommended as this is the easiest to edit afterwards Please note: The knit program is quite temperamental, and have the following problems
This panel provides some useful templates for handling arrays and matrices.
Missing Data
1:Data Frames and MatricesTemplate 1.1: Convert a text fields into a matrix. This is done by firstly import the text field (all columns must be numerical values) into a data frame, then convert the data frame into the matrix txt = (" Col1 Col2 Col3 7 1 2 8 3 4 9 5 6" ) df <- read.table(textConnection(txt),header=TRUE) mx <- data.matrix(df) colnames(mx) <- NULL # optional removing column names mxThe rresults are > mx [,1] [,2] [,3] [1,] 7 1 2 [2,] 8 3 4 [3,] 9 5 6Template 1.2: Create matrix from the numerical fields of a mixed factor(text) and value (numerical) data frame txt = (" Col1 Col2 Col3 A 1 2 B 3 4 C 5 6" ) df <- read.table(textConnection(txt),header=TRUE) #df # data frame mx <- matrix(c(df$Col2,df$Col3), ncol=2) mxNote: cols 2 and 3 are numerical, and extracted to form a 2 column matrix. The results are > mx [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6Template 1.3: Convert matrix into a data frame # Input 3. matrix to data frame #mx = matrix(c(1,2,3,4,5,6,7,8,9,10,11,12),nrow=4) # alternative method to create matrix col1 <- c(1,2,3,4) col2 <- c(5,6,7,8) col3 <- c(9,10,11,12) mx <- cbind(col1, col2, col3) mx #matrix # turn matrix into data frame df = as.data.frame(mx) df # add another column to existing data frame df$col4 <- c(20,21,22,23) # method 1 df ar <- c(30,31,32,33) # method 2 df$col5 <- ar dfResults as follows > df col1 col2 col3 1 1 5 9 2 2 6 10 3 3 7 11 4 4 8 12 > # add another column to existing data frame > df$col4 <- c(20,21,22,23) # method 1 > df col1 col2 col3 col4 1 1 5 9 20 2 2 6 10 21 3 3 7 11 22 4 4 8 12 23 > ar <- c(30,31,32,33) # method 2 > df$col5 <- ar > df col1 col2 col3 col4 col5 1 1 5 9 20 30 2 2 6 10 21 31 3 3 7 11 22 32 4 4 8 12 23 33 > 2:Arrays# create an array with contents ar <- c(1,2,3,4) ar [1] 1 2 3 4 # Create an array with sequence start = 1 finish = 10 increment = 2 ar <- seq(start, finish, by=increment) ar [1] 1 3 5 7 9 # Create an empty vector ar <- vector() # Create vector of defined length default = 5 siz = 3 ar <- array(default,siz) ar [1] 5 5 5 # Operations ar1 <- c(1,2,3) ar2 <- c(4,5,6) # add ar1 + ar2 [1] 5 7 9 #subtract ar2 - ar1 [1] 3 3 3 #multiply elements ar1 * ar2 [1] 4 10 18 # multiply vectors col by row # default is column ar1 %*% t(ar2) [,1] [,2] [,3] [1,] 4 5 6 [2,] 8 10 12 [3,] 12 15 18 #multiply row by col t(ar1) %*% ar2 [,1] [1,] 32 3:Matrices#matrix with values ar <- c(1,2,3,4,5,6) mx <- matrix(data=ar, nrow=2,ncol=3) #byrow=FALSE is default mx mx <- matrix(data=ar, nrow=2,ncol=3, byrow=TRUE) mx > mx <- matrix(data=ar, nrow=2,ncol=3) #byrow=FALSE is default > mx [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 > mx <- matrix(data=ar, nrow=2,ncol=3, byrow=TRUE) > mx [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 # matrix with default value default = 5 mx <- matrix(data=default, nrow=2,ncol=3) mx > mx [,1] [,2] [,3] [1,] 5 5 5 [2,] 5 5 5 #matrix joining two arrays a1 <- c(1,2,3) a2 <- c(4,5,6) mx <- matrix(data=c(a1,a2), nrow=2,ncol=3, byrow=TRUE) mx mx <- matrix(data=c(a1,a2), nrow=3,ncol=2) mx > mx <- matrix(data=c(a1,a2), nrow=2,ncol=3, byrow=TRUE) > mx [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 > mx <- matrix(data=c(a1,a2), nrow=3,ncol=2) > mx [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 # rows mx[1,] mx[2,] mx[3,] #cols mx[,1] mx[,2] > # rows > mx[1,] [1] 1 4 > mx[2,] [1] 2 5 > mx[3,] [1] 3 6 > #cols > mx[,1] [1] 1 2 3 > mx[,2] [1] 4 5 6 # operations mx1 <- matrix(c(c(3,5,1,2),c(4,1,3,1),c(2,3,1,1)),nrow=4,ncol=3) mx1 mx2 <- matrix(c(c(1,2,1),c(2,3,1)),nrow=3,ncol=2) mx2 > mx1 [,1] [,2] [,3] [1,] 3 4 2 [2,] 5 1 3 [3,] 1 3 1 [4,] 2 1 1 > mx2 [,1] [,2] [1,] 1 2 [2,] 2 3 [3,] 1 1 # transpose mx3 <- t(mx1) mx1 mx3 > mx1 [,1] [,2] [,3] [1,] 3 4 2 [2,] 5 1 3 [3,] 1 3 1 [4,] 2 1 1 > mx3 [,1] [,2] [,3] [,4] [1,] 3 5 1 2 [2,] 4 1 3 1 [3,] 2 3 1 1 # add mx1 + mx1 > mx1 + mx1 [,1] [,2] [,3] [1,] 6 8 4 [2,] 10 2 6 [3,] 2 6 2 [4,] 4 2 2 # subtract mx2 - mx2 > mx2 - mx2 [,1] [,2] [1,] 0 0 [2,] 0 0 [3,] 0 0 # multiply elements mx2 * mx2 > mx2 * mx2 [,1] [,2] [1,] 1 4 [2,] 4 9 [3,] 1 1 #multiply matrices mx1 %*% mx2 > mx1 %*% mx2 [,1] [,2] [1,] 13 20 [2,] 10 16 [3,] 8 12 [4,] 5 8 #division by elements mx1 / mx1 > mx1 / mx1 [,1] [,2] [,3] [1,] 1 1 1 [2,] 1 1 1 [3,] 1 1 1 [4,] 1 1 1 # transpose t(mx2) > t(mx2) [,1] [,2] [,3] [1,] 1 2 1 [2,] 2 3 1 # square matrix mx <- matrix(c(c(4,2,3,2),c(2,5,3,1),c(3,3,6,2),c(2,1,2,3)),nrow=4,ncol=4) mx > mx [,1] [,2] [,3] [,4] [1,] 4 2 3 2 [2,] 2 5 3 1 [3,] 3 3 6 2 [4,] 2 1 2 3 # invert solve(mx) > solve(mx) [,1] [,2] [,3] [,4] [1,] 0.50000000 -0.07142857 -0.14285714 -0.21428571 [2,] -0.07142857 0.29591837 -0.12244898 0.03061224 [3,] -0.14285714 -0.12244898 0.32653061 -0.08163265 [4,] -0.21428571 0.03061224 -0.08163265 0.52040816 # Rank qr(mx)$rank # uses qr decomposition [1] 4 # Determinant det(mx) > det(mx) [1] 98 # Eigen values and matrices eigenResults <- eigen(mx) eigenResults > eigenResults $values [1] 11.501474 3.143052 2.000000 1.355473 $vectors [,1] [,2] [,3] [,4] [1,] -0.4779381 -0.36246433 0.3779645 0.70522169 [2,] -0.4961149 0.76831305 0.3779645 -0.14390264 [3,] -0.6487310 -0.05918796 -0.7559289 -0.06493346 [4,] -0.3234090 -0.52422462 0.3779645 -0.69118597 #Square root of matrix after obtaining eigen sr <- eigenResults$vectors %*% diag(sqrt(eigenResults$values)) %*% solve(eigenResults$vectors) sr > sr [,1] [,2] [,3] [,4] [1,] 1.7886506 0.3942986 0.6321685 0.4956015 [2,] 0.3942986 2.1073918 0.6176967 0.1479165 [3,] 0.6321685 0.6176967 2.2465112 0.4147301 [4,] 0.4956015 0.1479165 0.4147301 1.6001559 # Validate square root matrix sr %*% sr > sr %*% sr [,1] [,2] [,3] [,4] [1,] 4 2 3 2 [2,] 2 5 3 1 [3,] 3 3 6 2 [4,] 2 1 2 3
Explanations & References
This panel explains and provides example codes for handling missing data using R.
R Codes
In research and handling data, missing data is a common occurrence. Subjects are lost, errors are made in collecting and transcribing information, and whole host of reasons creating holes in the data table. R provides an option for how to handle missing data in nearly all its formulae, but this requires the analyst to be familiar with how missing data may affect a particular procedure and which option each procedure provides for handling missing data. For the sake of simplicity, all the codes provided in StatsToDo assume that the data is already clean and contain no missing data. This separates the procedures handling missing data from the statistical algorithms. This panel therefore provides the algorithms for handling missing data at the final stages of data preparation, to produce a complete set of data for analysis How missing data are represented in RWithin the object dataframe, missing values are represented by NA in the numerical columns and <NA> in text columns. However, in data I/O, the following is used
Different options in dealing with missing dataR provides an extensive collection of methods of handling missing data. Only a few of the more commonly used ones are presented in this page. This panel discusses the options conceptually, the complete set of codes and how the codes work are presented in the R Code panelOption 1. Casewise deletionThis is the easiest, and widely used method. All records containing missing values are deleted.This method is appropriate if the analyst is sure that data is lost at random, so that removing records containing missing data would not create a bias leading to misinterpretation. The amount of missing data should also be small, say in less than 1% of the cases Option 2. K Nearest NeighbourFor each missing value, the program searches for k completed records that are nearest (similarity not location) to it, replacing it with the average for a numerical (value) column, and the most frequent value for a text (factor) column. k can be specified in the formula. If not specified, the default k=10 is used.This is a robust method, and can be used even if some bias process is implicated in data loss, as the missing value is replaced by values from similar records. There are, however, some issues involved. Firstly, for every missing value, k (10) completed records are required. Secondly, the whole database is searched for the nearest records, and this is time consuming if the database is large and missing data numerous. The method was devised by those working on big data and artificial intelligence, when thousands or even millions of records are available, and the data can be analysed using powerful computers over prolonged periods. Clinical data are caught between having database not large enough so that k has to be reduced, and the long time required for processing using desk top computers. Although the method is excellent in theory, it cannot always be successfully used in the clinical setting. However, it is worth a try. If the program crashes or takes too long to run, k can be progressively reduced until the program works. Be aware however that, as k is reduced, the risk of producing bias replacements increases Option 3. General ImputationThe program randomly selects a missing data value, and replaces it with an estimate using the available data and multiple regression. This is then included in the available data to estimate the next randomly selected missing value. This process is repeated until all missing values are replaced by estimated (imputated) values.As latter estimations are influenced by earlier estimated values, the results are slightly different depending on the random sequence. The program copes with this by iterating the process a number of times (m) and averaged the results. The number of iterations (m) can be specified by the user. If not specified, the default is m=5. Controversy exists as to what m should be, and some statisticians argue that m should be the same as the number of missing values in the data. This method is most suited to the small data sets that are common in clinical studies, especially in survey and clinical trials where the sample size is around 100. The only proviso is that at least one (1) numerical column must exist in the data set for the algorithm to work. Users should also be aware that the same program and data will produce slightly different results when repeated, as the random sequence is generated at run time so are different each time Option 4. Numerical ImputationThe program is a mathematical algorithm using existing values in the same column of the data set to estimate a replacement value for the missing values. The methods available are mean. median, mode, and interpolation (average of the available values on the two sides of the missing value).The method only works in columns of numerical data, and ignores missing values in columns that are text. It is quick to implement and the results easy to interpret. It can be used if such mathematical replacement is appropriate to the analyst's needs The interpolation method is especially useful in time series data such as continuous monitoring, as the interpolation result is close to what the missing data should be. Additional informationChecking the resultsIt is important to check the results of fixing missing data before the data set is used for analysis. There are numerous methods for doing so, but they are not covered in this page. The example codes provide the basic comparisons using the summary command, which will count the different values in text columns, and minimum, maximum, quartile values, means and standard deviation in numerical columnsReferenceshttps://en.wikipedia.org/wiki/Missing_datahttps://www.kdnuggets.com/2017/09/missing-data-imputation-using-r.html https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
This panel explains the codes presented in the previous panel in segments
Data handlingDirect data entrymyTxt = (" Sex Ethn Gest BWt Girl Greek 37 3048 Boy German 36 2813 Girl French 41 3622 NA Greek 36 2706 Boy German 35 2581 Boy NA NA 3442 Girl Greek 40 3453 Boy German 37 3172 Girl French 35 NA Boy Greek 39 3555 Girl German 37 3029 Boy French 37 3185 Girl NA 36 2670 Boy German NA 3314 Girl French 41 3596 Boy Greek 38 3312 NA NA 39 3200 Boy French 41 3667 Boy Greek 40 3643 Girl German 38 3212 Girl French 38 3135 Girl Greek 39 3366 ") myDF <- read.table(textConnection(myTxt),header=TRUE) summary(myDF) #myDFThe example data is computer generated, and purports to come from a study of birth weight. Sex being sex of the baby, Ethn the ethnicity of the mother, Gest the number of completed weeks in gestation, and BWt the weight of the baby at birth. The table has columns for variables and rows for subjects (each baby). The first row contains the names for each column. NA represents missing data myTxt is the name given to this table. User can change this to any other name This is followed by 3 lines of codes
Sex Ethn Gest BWt Boy :10 French:6 Min. :35.00 Min. :2581 Girl:10 German:6 1st Qu.:36.75 1st Qu.:3048 NA's: 2 Greek :7 Median :38.00 Median :3212 NA's :3 Mean :38.00 Mean :3225 3rd Qu.:39.25 3rd Qu.:3453 Max. :41.00 Max. :3667 NA's :2 NA's :1There are 22 subjects Sex and Ethn are text columns, and the number of rows with each value is counted Gest and BWt are numerical columns, and the quartile and mean values are presented Missing values, represented as NA are also counted for each column Missing ValuesOption 1: casewise deletioncasewiseDeletedDataFrame <- na.omit(myDF) summary(casewiseDeletedDataFrame) #casewiseDeletedDataFrameThe first line creates a new dataframe casewiseDeletedDataFrame containing only those rows with no missing data The second line provides the summary which can be used to compare with the input data The third line is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the # The summary is as follows Sex Ethn Gest BWt Boy :8 French:5 Min. :35.00 Min. :2581 Girl:8 German:5 1st Qu.:37.00 1st Qu.:3113 Greek :6 Median :38.00 Median :3262 Mean :38.38 Mean :3274 3rd Qu.:40.00 3rd Qu.:3565 Max. :41.00 Max. :3667Six (6) rows with one or more missing values are deleted, so the data set now has 16 rows. Option 2: K Nearest Neighbours#install.packages("DMwR") library(DMwR) knnDataFrame <- knnImputation(myDF,k=10) summary(knnDataFrame) #knnDataFrameLine 1 installs the package "DMwR" which is a package for deep learning, from which this algorithm is obtained. This is commented out, as this is not needed repeatedly once the package is installed on the computer Line 2 calls the installed library. This must be done prior to running the program Line 3 creates a new dataframe, which has the missing values replaced by the estimated values. The value for k can be specified by the user. If not specified, the default is k=10. If the amount of completed record is insufficient, or if the run time of the program is too long, k can be reduced. Line 4 displays the summary of the new dataframe, which can be compared with that from the input data Line 5 is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the # The summary is as follows Sex Ethn Gest BWt Boy :11 French:6 Min. :35.00 Min. :2581 Girl:11 German:7 1st Qu.:37.00 1st Qu.:3046 Greek :9 Median :38.00 Median :3206 Mean :38.07 Mean :3217 3rd Qu.:39.12 3rd Qu.:3450 Max. :41.00 Max. :3667Twentytwo (22) rows remain in the data set, but the counts, interquartile values, and means have changed as the missing values are replaced by the estimated values Option 3: General Imputation#install.packages("mice") library(mice) impute <- mice(myDF, m = 5, print = FALSE) fit<-with(data = impute, lm(BWt ~ Sex+Ethn+Gest)) pool<-pool(fit) miceDataFrame<-complete(impute) summary(miceDataFrame) #miceDataFrameLine 1 installs the package "mice" which contains the imputation program. This is commented out, as this is not needed repeatedly once the package is installed on the computer Line 2 calls the installed library. This must be done prior to running the program Line 3 creates a data matrix containing the estimated values from the iterations. The number of iterations (m) can be specified by the user. If not specified, the default is m=5. Line 4 estimates the imputed values, using a regression formula. The formula should contain the names of all the columns that the analyst intend to use to estimate missing values. Lines 5 and 6 pool the results, and created a new database with the missing values replaced by the imputation estimates. Please note the following
Line 7 is an optional display of the result data, so it can be copied and pasted to other applications. This can be activated by removing the # The summary is as follows Sex Ethn Gest BWt Boy :10 French:8 Min. :35.00 Min. :2581 Girl:12 German:6 1st Qu.:37.00 1st Qu.:3034 Greek :8 Median :38.00 Median :3206 Mean :38.05 Mean :3206 3rd Qu.:39.00 3rd Qu.:3450 Max. :41.00 Max. :3667Twentytwo (22) rows remain in the data set, but the counts, interquartile values, and means have changed as the missing values are replaced by the estimated values Option 3: General ImputationActivate library#install.packages("ggplot2") #only if not already installed #install.packages("imputeTS") #only if not already installed library(imputeTS)The package imputeTS is required for numerical imputation. The algorithms in this package call functions in the package ggplot2, so that package also needs to be installed. The 2 installation commands are commented out because they only need to be installed to the computer once. For each of the numerical imputation method, the library imputeTS needs to be activated. Once the library is called, the analyst can choose one of the mathematical models. Please note: that only numerical data are imputed. Columns containing text data are ignored. Replace missing values by column mean meanDataFrame<-na.mean(myDF, option = "mean") summary(meanDataFrame) #meanDataFrameThe summary is Sex Ethn Gest BWt Boy :10 French:6 Min. :35 Min. :2581 Girl:10 German:6 1st Qu.:37 1st Qu.:3070 NA's: 2 Greek :7 Median :38 Median :3218 NA's :3 Mean :38 Mean :3225 3rd Qu.:39 3rd Qu.:3450 Max. :41 Max. :3667 Replace missing values by column median medianDataFrame<-na.mean(myDF, option = "median") summary(medianDataFrame) #medianDataFrameThe summary is Sex Ethn Gest BWt Boy :10 French:6 Min. :35 Min. :2581 Girl:10 German:6 1st Qu.:37 1st Qu.:3070 NA's: 2 Greek :7 Median :38 Median :3212 NA's :3 Mean :38 Mean :3224 3rd Qu.:39 3rd Qu.:3450 Max. :41 Max. :3667 Replace missing values by column mode modeDataFrame<-na.mean(myDF, option = "mode") summary(modeDataFrame) #modeDataFrameThe summary is Sex Ethn Gest BWt Boy :10 French:6 Min. :35.00 Min. :2581 Girl:10 German:6 1st Qu.:37.00 1st Qu.:3034 NA's: 2 Greek :7 Median :37.50 Median :3206 NA's :3 Mean :37.91 Mean :3196 3rd Qu.:39.00 3rd Qu.:3450 Max. :41.00 Max. :3667 Replace missing values by interpolation (average of the values on each side of the missing value) interpolationDataFrame<-na.interpolation(modeDataFrame) summary(interpolationDataFrame) #interpolationDataFrameThe summary is Sex Ethn Gest BWt Boy :10 French:6 Min. :35.00 Min. :2581 Girl:10 German:6 1st Qu.:37.00 1st Qu.:3034 NA's: 2 Greek :7 Median :37.50 Median :3206 NA's :3 Mean :37.91 Mean :3196 3rd Qu.:39.00 3rd Qu.:3450 Max. :41.00 Max. :3667 |