R is arguably one of the most versatile and well-liked software environments for statistical programming and applied machine learning.
Kaggle, a competition platform for data science, and KDnuggets, a leading site on machine learning, both point to R as the platform of choice for successful practicing data scientists, based on numerous surveys. Put simply, if you’re genuinely considering making the shift to machine learning, you will undoubtedly benefit from learning and coding in R.
This is largely because the platform is open-source and packed with specialized algorithms. Moreover, it has a large community of contributors who regularly add to R’s functionality and is frequently used by academics in the field of statistics.
Collaboration ensures that R is always evolving and remains state-of-the-art. Moreover, R has the support of a growing global community, including leading data analysts and data scientists. This active and supportive community makes transitioning to the statistical programming language a fairly straightforward process for first time users.
In this post, you will discover what R is, where it came from and some of its most important features after which you can use the step-by-step beginner’s guide to your first project in R.
A Brief History of R
R was created by Ross Ihaka and Robert Gentleman from the University of Auckland’s Statistics Department. Development began in 1993 as an experiment using Lisp (programming language) to build a testbed in order to trial concepts on how statistical environments can be built.
In 1995, the source code was released by FTP under the GNU General Public License which allows users to freely use, modify and share user-created functions, code and datasets. This marked a significant turning point in the further development of R and by 1996 there was a need to form forums to address the bug fixes and suggestions from contributors.
In 1997, the larger, ‘core group’ was setup to work on making changes to the source code CVS archive.
R Contributors and Development Community
Since the establishment of the core group of contributors in mid-1997, the evolution of R really started to pick up pace. The growing online development community that followed further accelerated progress.
According to Ross Ihaka, R has now outgrown its origins and its development is now a collaborative and truly global effort that is undertaken using the internet to exchange ideas and distribute the results.
The core group established the R Foundation in 2003 as a not-for-profit organization to provide support to the R Project and other innovations in statistical computing.
The foundation also serves to provide a reference point for individuals, institutions and commercial entities that are interested in supporting or interacting with the R development community.
The R Project supports two conference series that are regularly organised by R community members:
- useR! has been organised annually since 2004 to provide a forum to the R user community (Belgium, July 2017 and to be held in Australia in 2018).
- DSC, a platform for developers of statistical software has been organised periodically since 1999 and its 10th edition was held in Belgium in July 2017.
Bolstered by years of social innovation and shared learning, R’s popularity has surged in recent years and this is evident in numerous scholarly literature, polls and surveys of data miners.
The data scientist community as a whole now believes that R stacks up well against paid statistical software such as SAS, SPSS and Stata. It is R’s strong community backing, and ongoing collective research and development that has allowed R to evolve into an indispensable tool in a data analyst’s arsenal.
Important Features of R
Designed as a statistical environment, R is particularly useful for statisticians and data scientists because it contains a number of built-in mechanisms for organizing data, running calculations on the information (including linear and nonlinear modelling and classical statistical tests, time-series analysis, classification, clustering).
It also helps creates high quality graphical representations of datasets. Additionally, R can be run on a variety of UNIX platform and most operating systems, such as Windows, Linux and MacOS.
R distinguishes itself from other programming languages by virtue of some really unique and interesting features. We will briefly cover some of these features below:
1. Multiple calculations using vectors
Since R is a vector language, the user can do multiple calculations at once by adding functions to a single vector without putting it in a loop. The language is thus faster and more efficient that other languages.
2. Running code with a compiler
Unlike other languages such as Java or C, a compiler isn’t required to run code in R as it directly interprets code into a program by parsing and executing R scripts (programs) that are entered directly or loaded from a file (.R extension). This greatly facilitates the development of code.
3. Serves as a glue language
R allows users to create glue code that effectively pieces together different datasets and features of multiple smaller and directly incompatible components thereby making the software useful for rapid prototyping.
4. Graphic support
One of the key feature of R is its ability to provide data visualizations. As we are predominantly visual beings, our minds are better equipped at understanding pictures rather than numbers. Publication-quality graphs and visuals can be created from datasets in R making it particularly useful to the data science community. For example, check out the visualization below that shows the countries where various international football stars played their league football in 2014.
5. More than a statistical language
R was designed as a true computer language and it is Turing complete, meaning that the user can programme new functionality (that is, write programs) by simply defining new functions. Moreover, though most R code is written in R itself, more complicated coding can be written in C, C++ and Fortran, and then linked to R. It is this versatility and improved efficiency that has allowed R to be applied to fields such as academia, biology, financial studies, genetics and medicine.
Applications of R in Data Science
Given the broad scope of functionality in R due to its aforementioned features, among others, the programming language has been utilised for a variety of tasks. Here are some of the applications:
1. In Finance: R is often the go to programming tool for quantitative analysts in finance. It is capable of everything from data import, sorting, exploration, visualisation, analysis, simulations, predictive modelling and the production of trading applications.
2. In Facebook: R is used to build statistical reports and data mine information in order to target material and improve user experiences in newsfeeds and other services. Facebook’s data scientists have even created freely available course material for its MOOC on exploratory data analysis.
3. By Microsoft: About a year after its 2015 purchase of Revolution Analytics, a leading commercial provider of software and services for R, Microsoft later released Microsoft R Server for statistical analysis using R.
4. By Google: Google uses R for pattern recognition in search data, and to isolate and analyze trends in ad pricing.
5. By Pfizer: In a bid to streamline its drug development processes, Pfizer created customised R packages to test datasets during non-clinical drug studies, eliminating the need for third-party statisticians.
By now, you’re clearly aware why you need to learn R. You know about the history, the community, the packages, its strengths and applications. So how about getting started with the language itself. Even if you’re a beginner at programming, you’ll be able to follow along.
R Packages
One of the key factors in the developmental success of R Project is R Packages. Contributors and other users within the R community modify, create and share R Packages which are essentially well-contained collections of novel R functions, compiled source code, documentation for the package and sample data to test the functions.
In short, packages distribute statistical methodologies just as journal articles distribute scientific information.
R Packages are uploaded to an online repository from where they are accessible to every R user. Users can download these packages to enhance the functionality of their base R platform. On your system, the packages are stored in a directory called the library. R is preloaded with a standard set of packages. Others can easily be downloaded, installed and loaded into a session when needed.
In short, the packages allow for seamless, transparent and cross-platform extensions to the R source code. Currently, the Comprehensive R Archive Network (CRAN), which is the official online repository, has 11,765 freely available packages. Many more packages are available on the internet via other repositories including Bioconductor and Github.
Setting Up R
Setting up R is a fairly straightforward process. First head to r-project.org to download and install R on your system. As mentioned earlier it runs on a number of operating systems and on both, 32-bit and 64-bit architectures.
Once you’ve installed R we recommend that you install the free R Integrated Development Environment (IDE) called RStudio (you can graduate to R itself once you’re familiar with the basics).
RStudio is packed with a number of useful coding features including a tab key auto-completion tool that suggests potentially useful codes and syntax highlighting. Also useful is its four-pane workspace (pictured below) that allows the user to manage multiple R windows at the same time.
The top left window is the R code editor that allows you to create a file with multiple lines of R code or open an existing file and then run either the whole file or just part of it. This is likely where you will do most of your work.
On the bottom left is an interactive console where you can type in R statements one line at a time. All the lines of code that are run through the editor will also show up in the console.
The top right window has a workspace tab that shows you the list of objects currently in memory and a history tab lists your prior commands. The history tab is quite handy as you can select one, some or all of those lines of code that you’ve already used and send them over to either the console or any file that is active in the editor.
The window to the bottom right has tab for plots that displays the data visualisations created with your R code. From the plots tab you can also view a history of previous plots and export a plot to an image file or PDF.
This window also has tabs that displays packages that are available on your system (packages tab), files in your working directory (file tab) and help files when prompted by the user via the console (help tab).
Installing and using packages
As a beginner’s task, it’s a good idea to familiarise yourself with downloading, installing and running R packages. CRAN has packages to suit most statistical analyses requirements that are free to download.
To install the package in RStudio, type out the command line install.packages(“the package name”) in the editor or if you don’t want to type the command, use the packages tab to select and install packages.
To know what packages are already installed on your system, type installed.packages() or in RStudio you can view already installed packages tab view the packages tab.
To use an installed package you must first load by typing library(“thepackagename”) in the editor.
To update all your packages to their latest versions, type update.packages()
If you want to remove a package from your system, type remove.packages(“thepackagename”)
RStudio offers a fairly comprehensive help tool. To find out more about a function you simply have to type a question mark followed by the function name ?functionName or you can also type help(functionName).
You can also directly search through R’s help documentation for a specific term by typing help.search(“your search term”) or using the shortcut ??(“my search term”) (No parentheses or curved brackets () are needed if the search term is a single word without spaces).
Let’s say you already know what a function does but you’re not exactly sure how to use it properly. Well, all you have to do is type example(functionName) for a list of examples of how the function can be used (that is, if examples are available).
The arguments (args) function args(functionName) displays a list of a function’s arguments.
Setting your working directory
To change your working directory use the setwd() function as such setwd(“~/mydirectory”)
For example, on Windows, the command might look like setwd(“C:/UPX/Documents/RProjects”)
If you are using RStudio, click on Sessions in the menu bar and then click on Set Working Directory.
Learning the Shortcuts
According to Wickham, the RStudio chief scientist, the three most important keyboard shortcuts in RStudio are:
- Tab key: The tab key provides a generic auto-complete function. While typing in the console or editor, if you hit the tab key, RStudio will suggest functions or file names. Then you simply have to select the function or file you want and hit tab or enter to accept it.
- Control + the Up Arrow key: Control + the Up Arrow key (command + up arrow on a Mac) also provides a form of auto-completion. To use this shortcut, start typing a command and then hit the shortcut keys. This displays every command you’ve entered that starts with the same keys you’ve just used. Now select the one you want and hit return. This feature only works in the console.
- Control + enter: This (command + enter on a Mac) copies the current line of code typed into the editor, sends it to the console and executes it. You can use this shortcut to select multiple lines of code in the editor and then have all of them run via the console.
This is by no means an exhaustive list of RStudio features and shortcuts. For more information check out the online documentation on RStudio.
Step-by-step R tutorial (with the Iris dataset)
Now that you are familiar with a few R basics, its structure, history and important features, it’s time to start actually using the programming language. This tutorial will give you an understanding of machine learning in R and by the end of it, you will have completed your first machine learning project in R.
Here’s what you are going to do using the tutorial:
- Download, install and load R
- Install a package
- Load a dataset and use it to test out a few important functions in R (summarise and visualise datasets)
- Create 5 machine learning models, test and analyse them
- Pick the best model and validate your choice by checking its accuracy
If you are new to the field of machine learning and interested in R then this tutorial is what you’re looking for.
We are going to work on a small end-to end machine learning project using a well understood dataset on the classification of iris flowers (yes, actual flowers).
Here we go!
1. Download and install R
First, download and install R followed by RStudio using the instructions above. You will first need to install R to be able to install RStudio.
2. Launch RStudio
Once you’ve installed RStudio and launched RStudio, your console window will appear like the screenshot below
3. Install packages
To install packages type install.packages(“caret”) and if that doesn’t work then try
install.packages(“caret”, dependencies=c(“Depends”, “Suggests”))
You will need internet access as RStudio will now start to download the caret package (multiple files) from an online R repository. Be patient as this step may take up to an hour depending on internet speed. Once the download is complete, you can move on to the next step or read ahead while you’re waiting.
The caret package is a great tool for machine learning in R as it contains a number of machine learning algorithms as well as functions and code for data visualization, data resampling, fine tuning of models, model comparison and a host of other features. Visit the caret homepage for more information.
4. Load the package
Load the caret package from your system’s library by typing library(caret)
5. Load the data
Now that the caret package that contains the iris dataset is downloaded, installed and loaded we can now load the iris dataset. Load the dataset by typing data(iris)
This dataset is ideal for a beginner in need of a short test project as its dataset contains 150 observations of iris flowers. The dataset has four columns of measurements (in centimetres) part of the flowers. A fifth column displays the three observed flower species to which the data belong. For more information on the dataset check out the iris dataset on Wikipedia.
6. Create a Validation Dataset
All machine learning models must be tested so that we know we’ve picked the best one for the task. To do this we are going to split the dataset into two parts so that one part can be used to train our statistical models of choice and the other part can be used to validate the models.
By holding back some data (say 20%) from the algorithms, we are able to get an independent estimate of the best model’s accuracy. We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
First, we will partition the data by selecting 80% of rows which will later be used to train the models. Type validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)
Now we will select 20% of the dataset for validation validation <- dataset[-validation_index,]
Now we will use the balance 80% to train and test the models
Type dataset <- dataset[validation_index,]
7. Summarize the dataset
Data can be summarised in a number of ways depending on how you need to view it. Here are some dataset summary methodologies in R.
You can view the dimensions (in this case the number of rows – 120 and columns – 5) of the dataset by typing dim(dataset)
You can view the types of attributes, which is useful when choosing how to summarise the data, by typing sapply(dataset, class).
It always a good ideal to actually look over your actual data. You can view the first 5 rows of the data by typing head(dataset).
The class variable (5th column) is a factor as it has more than one class labels or levels. In this case, you can view the three class levels by typing levels(dataset$Species).
You may find it useful to know the class distribution, that is, the number of instances (rows) that are linked to each class. R provides the class distribution as an absolute values and as a percentage. To view the class distribution type:
percentage <- prop.table(table(dataset$Species)) * 100
On the next line type:
cbind(freq=table(dataset$Species), percentage=percentage)
The number of instances in each class will be displayed in absolute values and percentages (40 rows or 33% of the dataset each).
You can view a statistical summary of each attribute including the mean, minimum and maximum values, and percentiles (25th, 50th or median and 75th) by typing summary(dataset).
8. Visualize the dataset
Now that you have had a look at the various ways of viewing and summarising data we can move on to visualisations. We will look at univariate plots to get a better understanding of each attribute and multivariate plots to better understand the relationships between attributes.
To view univariate plots we will first split the input (measurements of flower parts) and output (species) variables so that they can be plotted separately. To do this type the following commands on two separate lines: x <- dataset[,1:4]
y <- dataset[,5]
Now we can create a univariate plot of just the input data that is numeric. Let us create a box and whisker plots of each input attribute. Type par(mfrow=c(1,4))
for(i in 1:4) {
boxplot(x[,i], main=names(iris)[i])
}
A box and whisker plot (pictured below) for each input attribute will appear in the visualization window giving a clearer and visual representation of the data distribution.
We can now plot a bar graph of the output variable or species class to show its distribution using the command plot(y)
The bar graph (above) clearly shows that the instances are evenly distributed just as shown in the class distribution step earlier.
Plotting multivariate plots allows to user to get a visual representation of the relationship between the input and output attributes of the dataset. We will thus be able to see how each species class relates to the four types of measurement data. We can illustrate this using a scatter plot that uses colour and ellipses to separate the data, making it easier to spot trends and draw conclusions from the plot.
To get a scatter plot with ellipses type featurePlot(x=x, y=y, plot=”ellipse”)
To get a clearer view of the linear separation between the classes we can use a box and whisker that plots each variable in separate plots for each class. Type featurePlot(x=x, y=y, plot=”box”)
A histogram plot can show results in the same way as the multivariate box and whisker plots above but using a probability density plot allows your histogram to have smooth lines, thereby making observations and your graphs more presentable.
9. Evaluating the algorithms
Now that we’ve had a look a a few ways in which data can be represented graphically, let’s move on to evaluating a few algorithms. We will do this by creating five models (algorithms) and then testing their accuracy against unseen data. As we are using the iris dataset, each of these models will then be tested to ascertain which one is the most accurate at identifying the species of iris flower from the given variables (4 measurements of flower parts).
We will first set up a test harness (an automated test framework) for 10-fold cross-validation that will estimate the accuracy of the models. 10-fold cross-validation will split the dataset (this refers to the 80% partition that was created for the purpose of training and testing, and not the 20% validation partition created at the beginning of the tutorial) into 10 parts (equal) of which 9 will be used train the model and 1 will be used to test it in each of the 10 folds (repetitions).
When the 10 fold cross-validation is complete, each of the 10 parts will have been used to train the model 9 times and test the model once. We aren’t currently going to now but you may want to repeat the process 3 times for each model to increase accuracy.
10-fold cross-validation will provide us with a fairly good estimate of each model’s accuracy. We will use the accuracy metric for evaluation. Accuracy of each model will be shown as a percentage of the correctly predicted instances.To execute this step type the code in two lines as such:
control <- trainControl(method=”cv”, number=10)
metric <- “Accuracy”
Now, let’s move on the building the models. The five models that are appropriate for the task (don’t worry if you haven’t heard of them before) are:
- Linear Discriminant Analysis (LDA). This is a good linear model.
- Classification and Regression Trees (CART). This is a good nonlinear model.
- k-Nearest Neighbors (kNN), which again is a good nonlinear model.
- Support Vector Machines (SVM) with a linear kernel.
- Random Forest (RF)
To ensure that each model gets a fair shot and is directly comparable to the other we will reset the random seed number (think of it as the starting point) for each algorithm. This will make sure that each algorithm is trained and tested against the same data splits.
After successfully building, training and testing 5 models. Now we can move onto identifying the best model based on its accuracy of predicting the class based on unseen data. To do this we will list the accuracy of each of the created models by typing the following on two lines:
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
You will now be able to see the accuracy of each model as well as other metrics like Kappa.
We can also view the evaluation results for each model graphically by typing dotplot(results).
In the graph we can easily compare the spread and mean accuracy of each model. It is clear from the results that LDA is the most accurate model for the dataset. The Kappa results also identify LDA as the best model.
Now now let’s to a closer look at the best model to understand the data that was used to train the model and the accuracy details. To do this type print(fit.lda)
10. Making predictions
Thus far, we have tested the models against unseen data belonging to the 80% partition that was set aside to train and test the models. Now we can test the LDA model against our validation dataset which is unseen actual data (that is, data that the algorithm has not yet seen) to confirm the accuracy results. This then is the true test for the model’s accuracy. To run LDA against the validation dataset type the following in two lines:
predictions <- predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species)
The results show an accuracy of 100% against the validation set, thereby confirming that the trained LDA model is best suited at identifying species of iris flowers based on 4 types of measurement data.
You have now successfully finished your first end-to-end machine learning tutorial in R. Pat yourself on the back and give it another go to better familiarize yourself with the features we’ve tested.