We get it! You’re curious about Machine Learning algorithms.
You’ve been following the latest trends and tracking the recent applications.
Also, we get that you’re lost.
You’re overwhelmed with information. Flooded by Machine Learning algorithms and struggling to find your feet. So it’s like staring at 1000 pieces of a jigsaw puzzle and having no clue where to start!
Luckily, we know exactly where you should start. Here’s a little listicle to know just which machine learning algorithms are popular and a must-learn for the Sexiest Job in the 21st century!
What’s Machine Learning got to do with Data Science?
Machine learning is a machine’s ability to learn to solve a problem using:
- instructions already fed to it
- Previous and similar experiences
Machine learning algorithms use statistical models to achieve this. Large, chunky datasets are everyone’s nightmare. Algorithms make life easier by shifting the burden to the machines.
ML algorithms you can’t miss out on!
How do I pick the right Machine Learning algorithm?
Think of picking up the right machine learning algorithm like shopping at a supermarket. You have a list of things you need and that is your business problem.
Surely, you wouldn’t go barging your shopping cart into the nearest aisle. You’ll need to understand just how things are categorized at the supermarket. Similarly, you need to understand how we categorize these algorithms.
First, let’s talk about regression.
What is Regression & why do we care?
Regression helps us find out whether a relationship exists between variables or data-sets.
If a problem is required to predict outputs which are continuous and real-valued, it’s a regression problem. For example, if we use height and body stats to predict people’s weights, all predictions can literally take any real values.
Now let’s talk more about the most helpful machine learning algorithms for a regression problem.
Linear Regression
A linear regression model is a highly regarded algorithm for its plentiful use.
It assumes a linear relationship between your data-sets and then predicts real values of output based on this assumption.
How does Linear regression work?
In linear regression, we need
- An independent variable (X)
- A dependent variable (Y)
- to measure the changes in the dependent variable (Y) by changing the input values of the independent variable (X).
This is how a linear regression equation looks like:
Y = ????0 + ????1(X) |
where Y = dependent variable; X = independent variable; ????0 = intercept; ????1 = coefficient of X
Of course, it’s your choice to use either Simple Linear Regression with one independent variable (X) or Multiple Linear Regression with two or more independent variables (X1, X2, X3…Xn)
Linear Regression is most popular in Predictive modelling. Predictive modelling predicts or forecasts outcomes and understands exactly what factors affect the outcome or the dependent variable. |
What’s next after regression?
Sometimes, a machine learning problem requires grouping data into categories. The output changes from actual values to classes. This is where classification comes into picture.
Imagine a human resource screening system wanting to classify candidates into categories of “freshers” and “experienced” or an identification tool wanting to classify people into “male’’ and “female’’.
Now, our task is classification of data and we need a whole other set of Machine Learning algorithms.
Logistic Regression
Okay, we know what you’re thinking.
What is logistic regression doing in classification instead of regression?
Just like linear regression, logistic regression explores the relationship between independent and dependent variables. This is done by estimating probabilities using a logistic or sigmoid function. Hence the name.
However, it is used more for classification problems than regression ones. Logistic regression goes a step further and uses a nonlinear logistic function to render the final output (Y).
This output belongs to either one of two class values, for example, yes (>0.5) and no (<0.5). This is how the S-shaped graph of a logistic/sigmoid function looks like:
Classification And Regression Tree
A Classification And Regression Tree algorithm uses decision trees to make a decision. This decision tree has two parts:
- Classification Tree: Classification decides the class in which the variable would fall into.
- Regression Tree: Regression tree predicts the value of the output, in case it is continuous.
CART algorithm is just a series of questions. The answers at each stage lead to further questions in the series. These decisions and questions continue until we reach a terminal node after which any subsequent questions are impossible.
How will my Decision Tree look like?
Where do people use CART?
CART makes decision making simpler! Popular uses of CART are in credit scoring, crime risk assessment, medical diagnosis, and predicting successes of new products or techniques. |
Random Forest
We saw how one decision tree works. Now we’ll talk about how we can use multiple decision trees to create a forest, and in turn, a more accurate output.
What would you do if you had to buy a new phone? You’ll probably ask your friends for advice. Friend A might ask you what kind of phone camera you’re looking for, how much you post on social media and suggest a model. Friend B might ask you how many apps you’d want, your storage needs and suggest another option. Every friend would suggest options with some personal bias.
You can make a decision tree for each friend’s suggestion, and put together a forest of decision trees. Similarly, as a machine learning algorithm, random forests use conditions and rules to predict outcomes. It then calculates the votes for every predicted target and generates the highest voted as the final prediction.
Where you missed Random Forests… Random forests work quietly in the background to suggest products on online shopping websites. Banks also use random forests to predict a customer’s repayment behavior. |
Naive Bayes Classifier
Let’s say we need to classify 1000 vegetables into carrot, cabbage or other on the basis of whether they’re leafy, red or long. The categories leafy, red and long are also called features.
Vegetable | Leafy | Red | Long | Total |
Carrot | 400 | 400 | 350 | 500 |
Cabbage | 0 | 350 | 150 | 300 |
Other | 100 | 50 | 150 | 200 |
Total | 500 | 800 | 650 | 1000 |
We can now use the algorithm to know the probability of a vegetable being a carrot, cabbage or other; given that the input vegetable is leafy, red and long.
What’s in a name?
It is based on Bayes theorem of conditional probability. It calculates the probability of an event given that another event has already occurred. This Machine Learning algorithm makes a naive assumption that all features are independent of each other. And hence, the name! |
KNN
Next, let us explore an assumption-free algorithm.
K– nearest neighbor stores data-sets and finds similarities in new data-sets based on its repository. The algorithm finds and returns k number of cases from its repository that are closest to the new dataset. Here, the algorithm uses a distance function to identify closest neighbors. The algorithm can be simply related to the proverb:
You are known by the company you keep.
Support Vector Machines
It gets a little tricky with SVM.
SVM divides data-sets into classes using the concept of hyper-planes. A hyper-plane is simply a subspace of one less dimension than an original n-dimensional space. For example, for a three-dimensional space, a hyperspace would be two-dimensional.
In a dataset of two categories or features, a hyper-plane would be a line (two dimensional) and would divide the dataset into two classes.
How do you choose the right hyper-plane?
The most accurate hyper-plane is the dataset on either side are at a maximum distance from the hyper-plane. This distance is called margin and maxing it out is called maximizing the margin.
The data points closest to the margin are the most important. They are called support vectors and decide the orientation and position of the margin. Remember the higher the margin, the better is the probability of classifying new data.
Support Vector Machines are the brains behind:
● face detection systems ● handwriting recognition ● classification of images ● bioinformatics |
With that, we’ve covered the popular machine learning algorithms under regression and classification. This brings us to clustering.
In clustering, we divide data points in a dataset into clusters such that all data points within one cluster are most similar to each other. Clustering is used to identify structures and groupings in a dataset.
K-means
K-means is used in clustering data points into k number of clusters based on some similarities within the clusters. We divide the data among these k clusters according to features.
Where have you seen K-means used most?
K-means algorithm is used extensively by search engines to group user searches to minimize the time taken to return web results.
All user searches for the word “apple’’ would be grouped into clusters of web searches for the fruit and the company to add more relevance to search results. |
The last two machine learning algorithms are grouped under dimensionality reduction.
Why care about dimensionality reduction?
Most datasets are heavy and dealing with all the features of a dataset can become unrealistic. The higher the number of features, the harder it is to classify the data-set. To add to that, some features are dependent and overlap with each other.
Our much needed fix, dimensionality reduction helps in reducing the number of features in an optimum manner such that we have lesser variables to work with.
Linear Discriminant Analysis
Let’s say you have collected data for people across 4 cities- A, B, C and D with the different features being height, weight, individual income and family income.
Some of these features can overlap. We can use LDA to get rid of the duplicity. LDA generates a set of new features which minimize overlapping and we get better separation of classes. LDA generates clearer boundaries around clusters or classes such that they are as separated as possible.
PCA | LDA |
Component axes that maximize the variance.
|
Maximizing the component axes for class-separation.
|
Principal Component Analysis
Our last algorithm for the day is Principal Component Analysis.
PCA generates a new set of features for our data-set which are called principal components. The features are arranged in descending order such that the feature with maximum variance to the original data is at the top.
Every other principal component is orthogonal to the previous one. It attempts to best capture the variance that is not captured by its predecessor.
You made it to the end of the article! We’re digging the Machine Learning enthusiast in you.
We hope we cleared some of your doubts on which Machine Learning algorithms to look out for as an ML newbie. Watch out for this space as we come up with new, interesting ways to help the Data Scientist in you.
Let us know if you have something to share in our comments section below!
9 comments