Top FAQs from a Machine Learning newbie

April 30, 2020

Are you a Machine Learning Novice getting started with their career?

You must have a lot of questions. In this blog, we answer all your queries that you might have to start in this field. Read on to understand what hurdles to overcome and how to crack the world of Machine Learning.

1. What is Machine Learning?

Machine Learning is a branch of Data Science which deals with programming systems to automaticallylearn and improve with experience. Data is fed to the system, with self-sufficient models which automatically learns programs from data.

2. What are Machine Learning Algorithms?

Machine learning algorithms are structures which are fed data to predict output values. As these structures continue to receive data, they optimize and develop to enhance themselves and learn over time.

Machine learning algorithms are further classified into two groups: supervised and unsupervised.

3. What’s the difference between supervised & unsupervised Machine Learning?

Consider taking a walk in a zoo. All animals that you come across are classified in your brain differently based on previous knowledge and awareness of the animal characteristics, features and behaviors.

This is supervised learning. In this case, your brain acts as a supervisor, relating everything you’ve seen to previously acquired knowledge and finally assigning an animal to a certain class. Simply put, supervised learning needs a supervisor.

Now imagine a child is walking in that zoo with you. This is the child’s first visit to a zoo and he has no previous knowledge of the animal kingdom. Now if you were to not help the child, the child would be totally unsupervised when learning about these animals.

If this child was to make sense of what he saw at the zoo that day, he’ll probably describe the animals as being big or small, having wings or not, having stripes or a plain coat, having a long neck or a long trunk and the possibilities go on.

The child would consider classifying animals with most commonality and assign them to different groups. In this case, the child is completely unsupervised. This is a case of unsupervised learning. You can read more about this here.

4. What’s the difference between classification and regression?

Regression helps us find out whether a relationship exists between variables or data-sets.

If a problem is required to predict outputs which are continuous and real-valued, it’s a regression problem. For example, if we use height and body stats to predict people’s weights, all predictions can literally take any real values.

Sometimes, a machine learning problem requires grouping data into categories. The output changes from actual values to classes. This is where classification comes into picture.

Imagine a human resource screening system wanting to classify candidates into categories of “freshers” and “experienced” or an identification tool wanting to classify people into “male’’ and “female’’.

5. What is meant by ‘Training set’ and ‘Test Set’?

Test Data: This is the part of the dataset that actually generates output based on what it has learned from the training data. Here, you get an unbiased result of how effective your model is.

Training Data: This is the part of data fed to your model to train it. The model observes and learns from the input and output data.

6. What kind of projects can I undertake in Machine Learning?

You can find many project ideas online but the best advice would be to go with a data set of your own choice completely.

Investigate a property of a Machine Learning tool or library.
Investigate the behavior of a Machine Learning algorithm.
Investigate and characterize a data set or Machine Learning problem.
Implement a Machine Learning algorithm in your favorite programming language.

7. Where to find sources for your data?

These are some platforms from where you can source your data:

Kaggle
All things Google are the best, aren’t they? Google’s online platform, Kaggle, for data scientists and machine learning buffs. The vast online community is supportive and talented, learning as they communicate. You can find a variety of data-sets on Kaggle for your project and also abundant help with the same!
UCI Repository
UCI Repository collects databases, data generators and domain theories. The repository currently has around 474 data-sets to work on and has been trusted by machine learning students to explore data.
Data.world
Data.world is another open data source for budding Data Scientists. You can use the platform to source, copy, modify, analyze and download data to work on your own machine learning projects.

These are a few sources where you can look for your desired dataset. Remember there is no best dataset or machine learning algorithm, you need to practice on a wide range and you can get started here!

8. How to approach Machine Learning projects?

For any given project, we have identified 7 steps that you must follow, like an outline, to ace machine learning projects. Let’s find out what these steps are.

Step 1: Data Reading
Step 2: Pre-processing
Step 3: Data Normalization
Step 4: Data standardization
Step 5: Data Split
Step 6: Apply the ML Algorithm
Step 7: Evaluate the performance of your model

9. What is ‘overfit’ and ‘underfit’?

Overfit: Overfitting in machine learning models happens when the model is a tad too well trained for its own good. The model loses its ability to generalize. The model picks up the training data along with its noise, fluctuations and models on new, fresh datasets.
Underfit: Underfit is also not ideal. Underfit models do not learn properly from the training data and consequently cannot apply their learning elsewhere. Underfit models fail to capture the relationship between the input and the output and you might need to restart with a different algorithm altogether.

10. What is Dimensionality Reduction in Machine Learning?

Most datasets are heavy and dealing with all the features of a dataset can become unrealistic. The higher the number of features, the harder it is to classify the data-set. To add to that, some features are dependent and overlap with each other.

Our much needed fix, dimensionality reduction helps in reducing the number of features in an optimum manner such that we have lesser variables to work with.

These are some of the basic questions that you need to understand before delving into Machine Learning. We hope that your problems got solved but in case you have any more questions about Machine Learning, comment below.