Uncategorized

June 7, 2018

Step-by-step tutorial to machine learning on Azure

Accredian Research Team

June 7, 2018

If you are looking for a guide to ML (Machine Learning) on a cloud computing platform, you landed at the right place. In this tutorial, we will take a detailed step-by-step look at machine learning on Microsoft’s Azure cloud platform.

At the end of this tutorial, you will know what is Microsoft Azure, why it is being used for Machine learning and the steps involved to build and deploy a machine learning model and relevant algorithms used to build them (With examples)

So, let us begin this tutorial at once! 3..2..1..Ignition.

Table of Contents hide

1. What is Microsoft Azure?

2. Why Microsoft Azure is becoming a popular choice for ML?

3. Advantages Microsoft of Azure

4. Machine Learning on Azure: A Walkthrough

4.1. 1. Machine Learning Studio

4.2. 2. Cortana Intelligence Gallery

4.3. 3. Modules and algorithms library

5. Building a Machine Learning Model on Azure

6. Exercise: Building a predictive model on the ‘Titanic’ dataset

6.1. Titanic data overview

What is Microsoft Azure?

Azure is a cloud computing service created for building, testing, deploying, and managing applications and services, via a global network of data-centers. These data centers come under the aegis of Microsoft, Inc. It is used by IT professionals and developers for building simple-mobile apps to enterprise-scale business solutions.

Microsoft Azure is a platform integrated with tools, DevOps and a marketplace to support the applications developed by professionals.

Azure is known to be gaining popularity for Machine Learning nowadays. It has made it easier than ever to quickly create and deploy analytics solutions based on ML. Let us see why it is a popular choice for ML.

Why Microsoft Azure is becoming a popular choice for ML?

Machine learning has wide applications. It makes the apps and devices smarter. It enables forecasting and prediction easier, based on large datasets. ML powers the online recommendation engines, aids fraud detection, and what not! But how does Azure make machine learning more efficient?

Azure allows you to use work with ‘ready-to-use’ library of algorithms. You do not need to code one for your ML model. It enables you to create ‘predictive models’ in the Azure ML studio quickly, and deploy them.

Azure’s ‘Cortana Intelligence Gallery’ provides you with readily usable examples and solutions, so that you can get started quickly! It not only provides you a platform workbench to create predictive analytics model, but also provides a full service to market your solution.

You simply select algorithms from the vast algorithm-library, build a predictive analytics model in the Azure ML studio, test it, and deploy it on the web as-a-service. Modeling ML based cloud-analytics solution has never been easier.

Advantages Microsoft of Azure

No need to code algorithms! Just select a suitable one from the algorithm library.
You can quickly create models in the ML studio and deploy them on web.
Azure provides you with ‘ready-to-use’ examples and models for reference.
You can also deploy your analytics solution as a web service via Azure platform.

Companies like ‘Heineken’, ‘Rolls-Royce’ and Adobe use Azure extensively. This statement is sufficient to prove this platform’s credibility, robustness and reliability.

Let us now have a walkthrough on the ‘basics of machine learning on Azure.’

Machine Learning on Azure: A Walkthrough

Let us quickly review what machine learning is like on Azure. Following are the basic tools and steps involved in building a predictive analytics model on this platform. All of them will be discussed in detail (With examples) further ahead.

You will encounter the following tools while working on Azure:

1. Machine Learning Studio

Azure ML studio comes with an interactive ‘drag and drop’ features. You can quickly create an analytics model by dragging and dropping the ‘modules’ on the dashboard. You can connect them and try various combinations.

2. Cortana Intelligence Gallery

This digital gallery hosts a collection of analytics models engineered by other developers. Refer them as an example to your work. You can also comment, question and share experiments with other fellow developers here.

3. Modules and algorithms library

This is the warehouse from where you can select suitable algorithms for your predictive model. You can select from Python & R packages, sample experiments, and sophisticated algorithms used on platforms like XBox, Bing, etc.

You will go through the following steps while creating a predictive ML model on Azure:

Accessing the Azure ML studio
Acquiring the data for analysis
Preprocessing or preparing the datasets for analysis
Defining the Machine Learning features for your model
Choosing and applying a learning algorithm to your model
Testing the prediction ability of your model
Deploying the model on the cloud web

We will now take a detailed look at the steps and processes involved in building a ML model on Azure. Further, we will build a ‘classification model based on the Titanic’s dataset as an example.’

From a given ‘training’ dataset of RMS Titanic’s passengers and the ships’ details, we will try to predict which passengers survived the Titanic disaster. This is a classical prediction model suitable for beginners as well as professionals. For this analytical model, we will be employing a binary-class algorithm, such as ‘Decision tree’ or ‘Decision forest.’

Let us get going!

Building a Machine Learning Model on Azure

We will now see the steps involved in building a predictive ML model on Azure. Once you get an idea of the ‘know-how’ involved, we will proceed with an example, as mentioned above.

Predictive analytics model building starts from this point.

Start by opening the Azure Machine learning studio. Sign up for it on the official Microsoft Website and choose from the free or paid service options. Follow these steps from there on!

Step 1: Sourcing the data

ML without the data is like a body without the soul. Azure provides you with two options to source the data. You can import your own data from various sources, or use the sample data available in ML studio. This will be known as the ‘Raw’ data. Following are the steps to do so.

Create a new experiment by clicking ‘+NEW’ button on the bottom of ML studio window.
Click ‘Experiment’ followed by ‘Blank Experiment.’
Select and replace the text at the top of the experiment window. Say ‘XYZ.’
Now, search for a relevant data set for your model in the ‘search box’ top-left side of the screen. Lets call it ‘xyzdata.’
Drag and drop the xyzdata dataset on the experiment window/canvas.

At this stage, you have successfully sourced/imported the data for your ML model. It is time to ‘prepare’ it for analytical use.

Step 2: Preprocessing/preparing the data

Preprocessing the data is a crucial step before running predictions on the given dataset. This is done to account for the missing and incorrect values from the dataset. Running an ‘unprepared’ dataset will give unreliable results. So let us prepare the data for our experiment. Here we will see the steps to delete a column that has missing values.

In the top-left search box, enter ‘Select Columns.’ Click the ‘Select Columns in Dataset’ module from the suggestions. This module enables ‘inclusion’ or ‘exclusion’ of the columns of the data.
Now, drag and join the ‘input port’ of ‘Select Columns in Dataset’ module with the ‘output port’ of ‘xyzdata’ module.
Click and select the ‘Select Columns in Dataset’ module.
Refer the Properties pane. Select the Launch Column Selector by clicking it.
On the left, click With rules option.
Click ‘All Columns’ under the ‘Begin with’ option.
You will see two dropdown lists. Select the ‘Exclude’ option from the first and ‘Column names’ from the second. A list of columns gets displayed on the screen. Choose the ‘Normalized losses’ option, it should get added in the text box!
Now, click the ‘tick mark’ button on the bottom right of the window. This will close the ‘column selector’ window.
At this point, the properties column displays that all the columns, except ‘Normalized-losses’, will pass through.
You should now drag the ‘Clean Missing data’ module to the experiment window. Connect it with the ‘Select Columns in Dataset’ module, just as you connected the first two modules.
Under the ‘Cleaning mode’ options in the ‘Properties’ pane, select ‘Remove entire row’ option. With this, the rows with the missing values get removed.
Double-click the ‘Clean missing data’ module. Then type the comment ‘Remove missing value rows.’
Click the ‘RUN’ button located at the bottom of the page.

At this stage, you have successfully preprocessed the ‘raw’ data for your ML model. Let’s proceed to the next step!

Step 3: Define features for your ML model

Features are nothing but ‘individual’ quantifiable or measurable properties of interest. For example, in the Titanic dataset, each passenger is defined in a row. Then each column represents an individual feature of that passenger (Age, gender, etc.). In order to build a precise predictive model, you need to have the knowledge of problem your model is going to solve. Accordingly, you will select a set of features to be evaluated during the prediction.

To define the features for your analytics model, follow these steps:

Search and drag another Select Columns in the Dataset module to the experiment window.
Connect the ‘input port’ of the Select Columns in the Dataset module with the ‘left output’ port of the Clean missing Data module.
Type ‘Select features for prediction’ by double-clicking the module.
In the Properties pane, select Launch Column selector.
Select With Rules option.
Under the Begin With option, select No Columns option.
From the two drop-down lists, select Include and Column Names respectively.
Select the column names (Features) for your ML model. The selected features should appear in the text box given.
Confirm the selection by clicking on the ‘Tick’ mark button.

At this point, you have defined the ‘features’ of your ML predictive model. Let us get going with the algorithms now.

Step 4: Choosing and applying a suitable learning algoritms to your model

You have now prepared the data to be used in your predictive model. This data will be used to ‘train’ your model and then test it for predictions. To make the model work, you need to find and apply a suitable algorithm to it. In our example of the Titanic dataset, we will use a binary class algorithm (Decision forest or Decision tree). Since, the prediction will be either a passenger ‘survived’ or ‘died’ based on the data, it is clear that we have a binary classification problem.

Here we will see the steps involved in selecting and applying a learning algorithm to the model.

Search for the Split Data module and drag it to the experiment screen/canvas.
Now, join the Split Data module to the last module i.e. Select Columns in Dataset module.
Select the Split Data module by clicking it.
Find the Fraction of rows in the first output dataset in the Properties pane located to the right of the screen. Set the value to 0.75.
Run the experiment.
Now, select the algorithm by clicking on the Machine Learning category located at the left side.
Click on the Initialize Model option to expand a list. Here you will see a list of various algorithm modules. A suitable one will be used to initialize your ML model.
Click on the ‘Classification’ category. Then select the Decision forest or Decision tree algorithm module (Since we will use it in our ML model). Similarly, you could have selected the algorithm modules from the ‘Regression’ or any other category as per the problem and objective of the model.
Drag the Decision tree or Decision forest algorithm module to the experiment screen.
Now, we need to ‘train’ the model. Search for the Train Model module in the search box. Drag and drop it to the experiment canvas.
Connect the left input of the Train Model module to the output port of Algorithm module (Whichever you selected!).
Connect the output port of Split Data module (left port) with input of the Train Model module (right port).
Select the Train Model module by clicking it.
In the Properties pane at the right, select the Launch Column selector option.
Now, we will select the value/feature that the model is going to predict. Select the Survival column (Since we want the machine to predict whether a passenger survived or not). Move it to the Selected Columns list.
Select the ‘Tick’ button to confirm the selections.
Run the experiment.

At this checkpoint, you have successfully integrated a learning algorithm to your model. Your model is loaded with the training data and ready to learn. Now, we will move on to testing and predicting the data.

Step 5: Getting predictions from your model

Remember how we split the data in step 5 (When we set the Fraction of rows in the first output dataset value to 0.75)? This means, our model can now make predictions on the 25 percent of data, based on its training on the other 75 percent. Let us begin!

Begin by searching the Score Model module in the search box. Drag it to the experiment canvas.
Connect the input port (left port) of the Score Model module to the output of the Train Model module.
Connect the right port (output) of the Split Data module (Test data) with right input port of Score Model module.
Run the experiment
Now, click the ‘output port’ of the Score Model module. Here, select the Visualize option. This will enable you to view the output from the Score Model module.
You should now be able to see the ‘Known values’ from the dataset and the ‘Predicted values’ by the model. You can check for accuracy of your predictive model now!

There! These are the basic steps involved in building a predictive machine learning model on Microsoft Azure.

Let us now apply these steps to an actual model. As promised above, we will be working with the ‘Titanic’ dataset. You will now be building a classification ML model to predict the survival (Yes/No) of passengers in the Titanic disaster.

Exercise: Building a predictive model on the ‘Titanic’ dataset

The ‘Titanic’ dataset is an excellent choice to build sample ML models. It is readily available from various online sources. Our predictive model will aim to predict ‘whether a passenger will survive or die’ based on the ‘features’ or circumstances of each individual.

Titanic data overview

The training data for Titanic dataset contains 891 rows and 12 columns. Each row represents a passenger aboard the ship on the night of the disaster. Each column represents the passenger’s information, demographics and circumstances.

These features include age, gender, number of siblings and spouses aboard the ship, number of parents or children aboard the ship, ticket class, ticket number, fare, cabin number, their port of departure and whether they survived or not.

Based on the available information, we will first train our model with a cleaned dataset, then make predictions. Let us do some cool engineering!

Step 1: Sourcing the Titanic’s passenger data

As you already know, we need to source the raw data. You can import the data from ML studio or source it from online channels. You can also find the Titanic dataset via this link.

Step 2: Preprocessing the raw Titanic data

We will begin by dropping the ‘columns’ with no significance. We do this because the values in these columns are not relevant to our prediction problem. If we include them, the model will return incorrect results by factoring in these irrelevant values (Noise). Follow these steps to start:

In the top-left search box, enter ‘Select Columns.’ Click the ‘Select Columns in Dataset’ module from the suggestions. This module enables ‘inclusion’ or ‘exclusion’ of the columns of the data.
Now, drag and join the ‘input port’ of ‘Select Columns in Dataset’ module with the ‘output port’ of ‘Titanic Dataset’ module.
Click and select the ‘Select Columns in Dataset’ module.
Refer the Properties pane. Select the Launch Column Selector by clicking it.
On the left, click With rules option.
Click ‘All Columns’ under the ‘Begin with’ option.
You will see two dropdown lists. Select the ‘Exclude’ option from the first and ‘Column names’ from the second. A list of columns gets displayed on the screen.
Select the ‘PassengerID’ column for exclusion. This data will provide false correlation to our survival prediction.
Similarly, select the ‘Passenger name’ or ‘Names’ column for exclusion. We may only require it if we wish to create another column of names for advanced operations.
Drop the ‘Ticket’ (Ticket number) column as well. We have no information regarding how they were distributed. Therefore, we cannot know how it relates to the survival or death of the Titanic’s passengers.
Lastly, drop the ‘Cabins’ column as well. Since, we do not have any data on the architecture of the ship, we cannot know how the placement of each cabin can lead to the survival or death of an individual.
All the selected columns must get displayed in the text-box.
Now, click the ‘tick mark’ button on the bottom right of the window. This will close the ‘column selector’ window.
At this point, the properties column displays that all the columns, except the selected columns, will pass through for data evaluation.

*Note: We will not use ‘Normalized-Losses’ function in this case to clean the data. We used that example in ‘Steps to build the predictive ML model’ just for reference.

Step 3: Define the categorical variables

This step is important to get ‘Whole’ values or the ‘Categorical’ values. Since, ‘Survival’, ‘Gender’, ‘Pclass’ (Ticket Class) and ‘Departure/Embarkment’ cannot have decimal values, it is important to ‘categorize’ them. It would look absurd if the survival status of a particular passenger is predicted as 0.34! Same goes for the ‘gender’ and other features as well.

Follow these steps:

Search and drag the ‘Meta Editor’ module to the experiment canvas.
Connect it with Select columns in Dataset module. You can now Cast or Edit the columns.
Under the Categorical option dropdown, select the ‘Make Categorical’ option.

Step 4: Accounting for the missing data/Cleaning the missing data

In Microsoft Azure, you can use the ‘Clean Missing Data’ module to clean the missing values.

Simply follow these steps

Search and drag the ‘Clean missing data’ module to the experiment canvas.
Join the output port of Meta Editor module with Clean Missing data module.
Click the Click Missing data module. Now, under the Launch column selector, choose the ‘Column type’ and ‘Numeric’ options.
This action will clean the missing values from the ‘Age’ column.

Step 5: Specifying the response class

At this stage, we will define which ‘feature’ or ‘attribute’ we want our algorithms to train and predict. This step is called ‘Labelling.’

In Azure, we simply achieve this by

Search and drag the ‘Metadata Editor’ module. Connect it with the Clean Missing data module i.e. the last module of the model.
Click the Metadata Editor module to select it. In the Launch Column selector pane, select the ‘Survived’ column.
You should also change the ‘Fields’ parameter to ‘Label.’

*Note: A dataset can only contain ‘one’ label at a time.

We now have preprocessed ‘Titanic’ dataset. Let us split the data for training and apply a suitable learning algorithm to it.

Step 6: Splitting the data

- Search and drag the Split Data module on the canvas.
- Now, join the Split Data module to the last module i.e. Select Columns in Dataset module.
- Select the Split Data module by clicking it.
- Find the Fraction of rows in the first output dataset in the Properties pane located to the right of the screen. Set the value to 0.70 (We set the same to 0.75 in the above steps. But 70/30 split is a preferred standard).
- RUN the experiment.
- This will randomly shuffle the 70 percent of data into the left output node and 30 percent to the right output node.
- The algorithms we will select will train on the 70% of shuffled data, and make predictions on the remaining 30% of it.

Step 7: Applying an Algorithm

Now, select the algorithm by clicking on the Machine Learning category located at the left side.
Click on the Initialize Model option to expand a list. Here you will see a list of various algorithm modules. A suitable one will be used to initialize your ML model.
Click on the ‘Classification’ category. Then select the Decision forest or Decision tree algorithm module. For the sake of this experiment, we will use the Decision Tree algorithm.
Drag the Decision tree algorithm module to the experiment screen.

Step 8: We will now ‘Train’ the model

Search for the Train Model module in the search box. Drag and drop it to the experiment canvas.
Connect the left input of the Train Model module to the output port of Decision Tree Algorithm module.
Connect the output port of Split Data module (left port) with input of the Train Model module (right port).
Select the Train Model module by clicking it.
In the Properties pane at the right, select the Launch Column selector option.
Now, we will select the value/feature that the model is going to predict. Select the Survival column (Since we want the machine to predict whether a passenger survived or not). Move it to the Selected Columns list. Do this with the ‘feature’ you have ‘Labeled.’
Select the ‘Tick’ button to confirm the selections.
Run the experiment.

Step 9: Start making predictions

Search the Score Model module in the search box. Drag it to the experiment canvas.
Connect the input port (left port) of the Score Model module to the output of the Train Model module.
Connect the right port (output) of the Split Data module (Test data) with right input port of Score Model module.
Run the experiment
Now, click the ‘output port’ of the Score Model module.
Here, select the Visualize option. This will enable you to view the output from the Score Model module.
You should now be able to see the ‘Known values’ from the dataset and the ‘Predicted values’ by the model.

Step 10: Evaluate your model

The Visualize options presents you with many data-model metrics. which parameters/metrics you need to evaluate depends upon the objective of your model. For our model, we will focus on its ‘Overall Performance.’ Therefore, we will evaluate the RoC AuC parameter. Refer this RoC AuC table to evaluate your prediction model.

0.9-1 – Overly good (Suspicious case)

0.8-0.9 – Excellent model

0.7-0.8 – Good model

Below 0.6 – Useless model

You can compare similar models with respect to these parameters!

Step 11: Deploying a machine learning model

Now that you have a predictive analytics ML model that is performing well, it is ready to be deployed on the web. Here are the steps to do it;

Select the Set up Web service option, located at the side of Run option.
Now, select Retraining web service option. Here, the input and output nodes are added automatically to your model.
Click the RUN button.
Select the Deploy Web service option. Here you will two options, Deploy Web service[Classic] and Deploy Web service[New].
Selecting any one of them will deploy your predictive analytical model on the web. You can now use it like a predictive web service.

The users on Azure can now send input data to your model and get predictions. You can update, retrain and deploy new versions of the model anytime from the experiment canvas.

There you did it! You just built, tested, iterated and deployed a predictive analytics ML model on the web using Azure.

In the upcoming tutorials, we will see some advanced uses of Azure for Machine Learning and predictive analytics. Keep building the prediction engines till then. Cheers!

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.