StatsToDo provides R codes in some of its pages, and this page provides an overview and orientation for using them.
The inclusion of R codes is a work in progress. Initially I aim to create resources for the more complex multivariate statistical procedures that I am unable to program myself. Eventually I aim to supplement most programs in StatsToDo with R codes and examples, using resources provided by CRAN if they are available, or to rewrite my programs using R if the progam cannot be found in CRAN. The initiative began in April 2020, so not all algorithms will be accompanied by R codes for some time
The R codes in StatsToDo are basic, simple, and the minimum necessary to conclude an initial analysis. It is aimed to create a shallow learning curve for those with no great statistical experience, to explore his/her data, and to provide the minimum results that are required by most clinical journals for publication. In most cases they merely repeat the programs written in php for the web page, and no more than a validation of the algorithm. Most experienced statisticians will find the details provided incomplete, or even insufficient, as the safeguards and complete descriptions of the results are often not provided. Although references will be provided for users to obtain further information and resources, inexperienced users are strongly urged to seek advice from experienced statisticians before drawing conclusions from the results obtained.
All data in the examples are computer generated, to demonstrate the analysis and interpretation. Although attempts were made to produce results that are plausible and easy to understand, users should realise that the data is artificial and do not represent reality. Given that I had a background in obstetrics, the examples are mostly issues in childbirth and hospitals. The interests are mostly clinical, towards classification, survey, quality control, clinical discovery and trials.
Format
Simple codes are provided once. Those involving multiple steps are presented twice, once in total for easy copy and paste, then in individual steps with explanations and expected results.
Three colors are used, to clarify the intent of the text
- Maroon color is used to represent R codes
- Navy blue color is used to show results generated by R in the console
- Black is used for explanations and descriptions
All R programs in StatsToDo assume no missing data. How to handle missing data is described in Handling Missing Data Using R Explained Page
Setting up R and RStudio
The first thing to do is to set up the R package in your computer. Using your web browser download the latest version of R and RStudio from from
https://cran.r-project.org/, making sure that you include RStudio. After this, run all R programs using RStudio
RStudio
RStudio is a powerful programming platform, and there are many ways it can be used. Included here are the basic information for the beginner, and includes only those issues that are covered in this package
Anatomy of RStudio :
When activated, RStudio consists of 4 panels.
- The bottom left panel is called the console. we start here because you can analyse data just using the console. Commands can be typed into the console and run interactively. The console is also where the results of calculations (other than graphics) are displayed
- The top left panel is called the source. This is essentially a text editor, where files can be loaded in edited, and saved. Part or all of the codes in the source can also be marked by the mouse, and on clicking the run button the output will be displayed in the console. Please note that the example codes from this site is intended to be copied and pasted into this source panel, and run from here
- The top right panel is called environment and history. This provides a list of variables and procedures created and run during a session in the console. The idea is that these are saved when exiting RStudio, and reloaded in future sessions to provide continuity. This panel is not used in these pages
- The right bottom panel is called the viewer which contained a number of sub-panels. This is where packages and help can be viewed. Graphical output is also displayed in this panel
Packages
Many statistical programs are automatically installed with R. Additional programs are written by the wider R user community, available as packages. These packages need to be installed into your computer (only once), using the
install.package("PackageName") command.
Once installed, the package can be activated during analysis with the
library("PackageName") command.
Required packages are included in the example codes. The install.package("PackageName") command is usually commented out, as once installed it is always available.
The Explanation and resources available for eack package, once installed, can be accessed by the ??"PackageName" command typed into the console
How to use the example codes
For any procedure, the following procedures can be used
- Activate RStudio
- Copy and paste the example coeds into the source panel
- Mark the codes and click run at the top right corner of the source panel
- Check that the program runs as intended
- Replace example names and data with the user's own
- Mark and run again to produce the results required
If the console becomes cluttered from repeated analysis the combined keys ctrl+l (el not 1) will clear the console.
Help and further information
Help for any code or function provided by R can be easily obtained within RStudio, by typing the following in the bottom left panel of RStudio(console)
- help(thesubject) will bring up explanations of the subject in the right bottom panel (help)
- ??packagename will bring up documentation of the package in the right bottom panel (help)
In addition to using search engines of the internet to obtain advice on how to use R, the following web sites are useful, particularly for the beginners
The following text books are easy to read, and provide advice and templates for common statistical tasks
- R for Dummies by Vries and Meys. John Wiley & Son's Inc. ISBN 978-1-119-05580-9. This is good for the novice with no prior experience of R.
- R Cookbook by Teetor. O'Reilly Media Inc ISBN 978-93-5023-379-5. This takes the user step by step learning how to use and program R, and introduces the basic R resources
- R Graphics Cookbook by Chang. O'Reilly Media Inc ISBN 978-1-449-31695-2. This is an excellent book providing detailed instructions on how to do graphics using R
Dataframe
R is able to handle data in numerous format, the most commonly used one is a dataframe. Dataframe is an object (a computer construct that contains large amount of information). In most cases, particularly in these pages, the dataframe contains a table with columns for variables, and rows for cases.
The first step in any analysis is therefore converting the table of data into a dataframe, which R can then analyse. The standard data table has columns for variables and rows for subjects. The first row containing the name of each column. The table can then be imported into R using one of the following methods
Direct data I/O
Nearly all examples of R codes on this site will use direct data I/O. This is the simplest form of data entry for small sets of data.
The example codes are as follows
x_textTable = ("
Col1 Col2 Col2
A 1 2
B 3 4
C 5 6
") # Example data in text
x_dataFrame <- read.table(textConnection(x_textTable),header=TRUE)
# Make data frame
#x_dataFrame # Optional show dataframe in console
Please note that:
- The table, in text, is in between (" and ")
- The first row contains the column names
- The columns are separated by spaces or tabs
- The values of Col1 are text, which R called factors,and those of Col2 and Col3 are numerical. Please Note: the type of values in each column must be consistent.
Results of analysis are displayed in the console or the graphic panel. These can be copied to the clipboard by marking and the combined keys of ctrl+c and pasted into any other application.
Files
By default, R reads and writes all file using the standard Documents folder. It is useful to change and Set the default folder to the same as the current file loaded into the source condole. To do so the following procedures are used. At the menu bar at the top click Session->Set Working Directory->To Source File Location. To check whether the directory is correct or not, use the following codes
getwd()
list.files()
This will display the folder being used and the files it contains
Comma delimited (.csv) files
The codes for input from a comma delimited (.csv) file are
x_dataFrame <- read.csv("myCsvIn.csv")
#x_dataFrame
Please Note that:
- myCsvIn.csv is the name of the comma delimited file containing the data table.
- The first line reads the table to the dataframe
- The second line is optional display of the dataframe
The codes for writing an object, such as a dataframe, to a comma delimited (.csv) file are
write.csv(x_dataFrame, "myCsvOut.csv", row.names = FALSE)
Please Note that:
- x_dataframe is the name of the dataframe to be saved to the .csv. It should already exist
- myCsvOut.csv is the name of the output comma delimited (.csv) file. User should change this name to one that is appropriate
- If row.names= False is not used, then a first column containing row number will be included
- If the named .csv file pre-exists and is closed, it will be overwritten. Attempt to write to an open file with the same name will fail and flag an error message
Excel worksheet (.xlsx) files
To access .xlsx files, the package xlsx must be already installed. If not installed, the command
install.packages("xlsx") will do so.
Once installed, each time a .xlsx file is accessed, the library xlsx must be called
For data input
library("xlsx")
x_dataFrame <- read.xlsx("myXlsxIn.xlsx", sheetName="mySheet")
#x_dataFrame
Please note that:
- The xlsx package needs to be already installed once in a computer
- The library xlsx must be called
- myXlsxIn.xlsx is the name of the excel .xlsx file to be read. User should change this to his/her own file name
- mySheet is the name of the worksheet to be read. This can be the name of the sheet, or a number (without the quotes) counting from 1
- When testing, uncomment the last command (remove #) to see if the correct data has been read
For output of an object such as a dataframe to an excel workbook (.xlsx)
lbrary("xlsx")
write.xlsx2(x_dataFrame, "myXlsxOut.xlsx", sheetName = "mySheet",
row.names = FALSE, append=TRUE)
Please Note that:
- The package xlsx must be pre-installed and is called
- x_dataFrame is a dataframe to be saved to file, a table with columns and rows.
- myXlsxOut.xlsx is the name of the excel workbook to be saved to. Users should change this to his/her own file name
- mySheet is the name of the sheet to save the data to
- If row.names= False is not used, then a first column containing row number will be included
- If the file pre-exists, it must be closed. Attempt to write to an open file will crash the program
- If the append option is FALSE, any existing file containing a wroksheet with the same name will be overwritten
If the append option is TRUE and no worksheet with the same name exists, a new worksheet with the name will be created and the data written to it
If the append option is TRUE and a worksheet with the same name already exists, write to the worksheet will fail and the program crashes
Creating a report of analysis using Knit
RStudio provides an utility which will run the codes in the source panel, and provides all the results in a file, which can be a html, Word, or pdf file. To do so, the package markdown must be pre-installed using the code
install.packages("rmarkdown").
Once a set of R codes are tested and found to be satisfactory, at the menu bar on the top of RStudio,
click File->Knit Document
select type of file, msWord is recommended as this is the easiest to edit afterwards
The file will be created including all codes, results, and graphics. The user can then edit and improve on this as he/she thinks appropriate.
Please note: The knit program is quite temperamental, and have the following problems
- It can be incompatible with some other packages, especially those that read or write to files or produce graphics. The program merely crashes without too much explanation. An example is that knit will crash when trying to read from or write to excel worksheets.
- In some graphics programs, the labeling for the axis may be misplaced or absent in the report file
These are problems encountered so far, but there may be more.