Data Munging: The Messy Business of Cleaning up Data

Data munging has a bad reputation outside the data science world.

It is assumed to be a dull task that should be delegated to non-experts.

Data scientists, however, are clear about its importance.

They are aware that, even though they may spend the majority of some days cleaning and transforming data, the task is essential to understanding the problem and developing the right solution.

Simply put, the time data scientists spend building their models and improving the performance of their algorithm would be fruitless if the bulk of the time had not been spent on data munging.

In this article, we’ll not only dispel any notion that data munging is not a high level task, but also explain the fundamentals of the task and offer a step-by-step guide to some functions of munging with a very popular dataset.

So, let’s get started!

  1. What is data munging?
  2. Is data munging messy?
  3. Six steps of data munging
  4. Data Munging Tutorial
    a. Deleting missing values
    b. Filtering data
    c. Grouping data
    d. Time-series data handling
    e. Exporting data

1. What is data munging?

Munging actually means to damage data beyond repair, which is the exact opposite of what it means in the field.

In data science, data munging, or data wrangling as it is also known, is a set of steps involved in preparing a dataset for use in solving a problem.

This may involve removing missing values, formatting the dataset, eliminating duplicates, among much else — all with the purpose of making it more suitable for solving the problem at hand.

Let’s say you are building a machine learning algorithm that can predict how university students will fare in their exams, based on study patterns and results of students over the past five years.

Before you begin, you will want to clean up and familiarise yourself with the dataset.

You would, for example, eliminate any students who were absent in a particular year, or want to change all data points to the same format (eg. change months and years to days), or even filter data by state of origin, gender, etc.

In some cases, you would need to change continuous values to categorical values. All of these tasks are part of data munging.

Data munging is particularly essential if it’s important to create visualisations of your dataset.

2. Is data munging messy?

It’s clear why data munging is viewed as a mundane task.

It involves a lot of repetition, which may seem like drudgery when there’s more interesting work at hand, such as building a machine learning model!


More experienced professionals do, of course, delegate such work while they themselves take on more challenging stuff, but newbies should gain solid experience in the task.

After all, cleaning up data is a big part of the job and is essential to building a career in this field.

Moreover, to aid in the process, data scientists have a large toolkit of reusable code available which can be combined and run in a number of ways.

Let us spend some time understanding the steps involved in data munging.

And also some of the commonly used data munging operations.

3. Six steps of data munging

Business knowledge is a valuable asset while munging or preparing data yourself.

If you don’t already have a good understanding of the data, interact with people who do; it will end up saving you a lot of time.

This is true for even messy tasks like web scraping.

It’s only after you gain a thorough understanding of the business problem that you truly understand how the data needs to be wrangled.


Discovering: During this step, you should develop an understanding of what goes into your raw dataset. Analyse its unique elements, such as outliers and value distribution.

This information will enable you to perform a more educated analysis on the data.

Structuring: Your data comes in different shapes and sizes and one of the critical decisions to take while data munging is to decide the format into which you want to convert the raw data to make it best suited for visualization and exploration.

The set of actions that one can take include separating, blending and un-nesting.


Cleaning: In this step, we clean the data to rid it of any inconsistencies such as null and misspelled values.

Data such as state abbreviations may also be standardized to a single format to make actions such as filtering easier.


Enriching: Once you have the cleaned dataset, we need to think of ways to add value to the data.

For this, you can think in the lines of what other data might be useful for the analysis.

For example, if you are working on data related to voter turnout, adding a weather dataset would be important for your analysis.


Validating: Here we verify if the data is consistent and of high quality.

If any inconsistencies have been missed in the previous steps we can go back and correct it.


Publishing: After following the above steps in data munging, we can now publish the results to downstream analytics tools (i.e. we’ll load it into a specific analysis packages).

Here, we can also plan for future project needs, such as documenting and archiving transformation logic so that it can be reused on other datasets.

The next step after publishing of data would be analytics.

 

4. Data Munging Tutorial: Learning with Titanic Dataset

To make it easier for beginners to understand data munging, we show the data munging process by example where we wrangle on the Titanic dataset using the Python library Pandas.

And for the few of you who are unfamiliar with the Titanic disaster, it is one of the most infamous shipwrecks in history that occurred way back in 1912 after it collided with an iceberg.

Of the 2224 passengers, just over 700 survived, mainly because there weren’t enough lifeboats for all.

It makes for an interesting machine learning dataset because, as the tragedy was unfolding, the decision was made to save certain groups over others, such as women, children and the upper-class.

Armed with this dataset, we will engage in some munging, to investigate different cuts of the sample and also, of course, to find out what data munging is all about.

What is Pandas?

We will be using Pandas to learn about various data transformation techniques and data formats in the study of data munging.

We will be working on Pandas dataframes, which are structured as tables.

In these frames, Python code can be used to manipulate rows and columns with ease.

Pandas, a Python library, is very popular for data munging.

If you want to learn more about Pandas, particularly if you enjoy learning with examples, Julia Evans’ Pandas Cookbook is a good read.

Of course, you don’t really need it right now — not for this tutorial.

What you do, however, need is to download the ‘train.csv’ dataset from the Kaggle website.

You could also simply just follow along, but it’s a lot more fun as an interactive experience.

1. Dropping missing values
One of the first transformation techniques for any project is the dropping of missing or null values from the dataset.

Most datasets have missing data points and it is very important to understand how to work with missing data.

If you’re a student of statistics, you may already have come across the term data imputation, which is an important part of data munging, too.

It means to find missing data points in the dataset and eliminate those points so that it does not cause bias in our analysis.

Here, we will try to identify which columns in the ‘train.csv’ file contain missing values and drop those missing values to get clean data.

To do so, you will first import Pandas into the dataframe, read the dataset and ask if to enumerate the various data points with missing values.

You can enter in the command mentioned in Fig 1 to do so:

Fig 1. Listing missing values

You’ll notice that the columns ‘Age’, ‘Cabin’ and ‘Embarked’ in Fig 1 are marked TRUE, which means that these columns contain missing or null values.

Our task is to delete all rows which contain missing values.

In Fig 2 below, you’ll notice that there are missing values marked as ‘NaN’ in the column ‘Cabin’.

These values need to be deleted in the process of data cleaning.

Fig 2. Checking columns

The table above with these values dropped will look as follows:

This technique of data munging is very useful for datasets that are incomplete.

Using incomplete data can bias your analysis or could lead to you making erroneous derivations in your analysis.

Another simple example of deleting missing data is when you have a list of stores and the sales they have generated.

Some stores may have given you blank or null values and your task is to calculate the averages for those stores with valid entries.

You will need to again delete the rows with missing data or null values in order to be able to do a reliable data analysis.

Knowledge regarding how to drop invalid values can help strengthen your data analysis skills. Once we’ve deleted missing values, we can begin playing around it.

1. Filtering data

During the exploratory data analysis phase, you may want to filter the data using some specific criteria and arrive at a subset of the original data.

For instance, you may want to filter the data so that you can see only data points where the ‘Age’ of the person is less than 20 or the ‘Sex’ is ‘Male’.

To start with, we will isolate those rows which have passengers less than 20 years of age.

By keying in df{df[‘Age’] <  20}, the data will appear as follows:

Fig 3 Age less than 20

Next, we will isolate those passenger details who are female, by keying in df{df[‘Sex’]  == ‘female’}.

The resulting data table is as follows:

Fig 4 Sex is female

We can also filter data by using multiple conditions.

For example, we can isolate passengers who were both female AND under the age of 20, by keying in df{df[‘Age’]  < 20} & df{df[‘Sex’] == ‘female’}

Fig 4 Females under 20

Therefore, we can define a particular group by giving them a name, such as ‘femaleunder20’ and then do exploratory data analysis on a specific segment of population.

For example (as you can see in Fig 5), you would be able to find out how many females under the age of 20 were on the ship, how many of them survived, which cabins they were in, and even what the mean, median and standard deviation of this group stood at.

Fig 5 Exploring chances of survival for females under 20

Given what munging entails, it should be pretty clear already what the objective is and why it’s essential for the person in charge of finding insights in the data to perform this task.

It should also be clear just how domain knowledge could help you understand the data.

1. Grouping Data

A more challenging technique used for analysis than filtering data is grouping data.

For example, let’s say we want to find out if there is any correlation between survival and gender.

In Pandas, the GroupBy command returns a DataFrameGroupBy object which has a variety of methods.

Calling value_count() returns the count of values for each of the unique values for the column.

 

From the table above we can see that the chance of survival is much lower if you were a male than if you were a female passenger.

To discover larger trends you will want to group data by a larger segment.

For example,

In a dataset containing sales data for different stores, you can derive trends city- by-city instead of store-by-store by grouping the data by geographical region.

1. Time-series data handling

Time-series data is common in datasets, particularly if you’re required to analyse the stock market, housing prices, the weather, etc.

And given how many components a business solution requires, you’re going to be working with these types of datasets a lot.

In the Titanic dataset, you have a column dedicated to the time of death.

Using the Pandas datetime library, you can analyse the number of deaths that took place by the hour, by the minute, even by the second.  

2. Exporting data

Finally, we come to the data exporting function, which, of course, is required when you want to share the cleaned and filtered data with a colleague or anyone else who is not familiar with the functionality of Pandas, but instead uses Excel to analyze data.

The command used for exporting the result to an Excel sheet is as follows (if we’re going to be exporting the femaleunder20 dataset):

result = df{df[‘Age’]  < 20} & df{df[‘Sex’] == ‘female’}

result.to_excel(‘femaleunder20.xlsx’)

Conclusion

This is by no means an exhaustive list of functions in data munging.

But we hope it’s given you a good idea of the underlying concepts of data munging.

If we’ve not stressed this enough already, data munging is an important part of any data analysis and you will want to make sure your data is in good shape and ready for convenient consumption before you apply any machine learning algorithms to it.

By dropping null values, filtering and selecting the right data, and working with timeseries, you can ensure that any machine learning or treatment you apply to your cleaned-up data is fully effective.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts