Fascinating Data Sets to improve your Data Science skills | Part-2

Fascinating Data Sets to improve your Data Science skills Part-2

We have created an archive of Data Sets for you to use to practise and improve your skills as a Data Scientist. This will be a 3-part blog series, so look out for the other parts. So welcome to Part-2

This repository carries a range of themes, difficulty levels, sizes and attributes. These data sets are categorised that way, hence making it suitable for everyone. 

They offer the ability to challenge one’s knowledge and get hands-on practice to boost their skills in areas, including but not limited to, exploratory data analysis, data visualisation, data wrangling and machine learning.

We recommend you test yourself with all the distinct data sets we’ve provided. Feel free to use them in any way you wish.

1) Can you predict the price of a house?

Level: Beginner

Recommended Use: Regression Models

Domain: Real Estate

Click here for: Dataset

With 414 rows and 7 columns related to various attributes of a house, this data set provides the market historical data of real estate valuations which are collected from Sindian Dist., New Taipei City, Taiwan.

2) Can you estimate location from WIFI Signal Strength

Level: Beginner

Recommended Use: Classification Models

Domain: Mobile/Location

Click here for: Dataset

This beginner level data set has 2,000 rows and 8 columns. The data contains wifi signal strength observed from 7 wifi devices on a smartphone collected in indoor space which could be used to estimate the location in one of the four rooms.

3) Estimate compressive strength of concrete

Level: Intermediate

Recommended Use: Regression Models

Domain: Civil Engineering/Construction

Click here for: Dataset

This set has 1,030 rows and 9 columns. Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from a laboratory.

4) Discover patterns relating liver disorder and alcohol consumption

Level: Intermediate

Recommended Use: Classification/Regression/Clustering Models

Domain: Healthcare

Click here for: Dataset

This data set has 345 rows and 7 columns. The data set does not contain any variable representing presence or absence of a liver disorder. The first five columns represent the result of various blood tests which may be of use in diagnosing alcohol-related liver disorders. 

The sixth represents the number of alcoholic drinks consumed per day by the subject (self-reported).

5) Predict which stock will provide greatest rate of return

Level: Intermediate

Recommended Use: Clustering/Regression/Classification Models

Domain: Business/Finance

Click here for: Dataset

This data set has 750 rows and 16 columns. It contains weekly data for the Dow Jones Industrial Index, used in computational investing research. Each record is data for a week and has the percentage of return that stock has in the following week. 

Ideally, this could be used to determine which stock will produce the greatest rate of return in the following week.

6) Assess heating and cooling load requirements of building

Level: Intermediate

Recommended Use: Regression/Classification Models

Domain: Energy

Click here for: Dataset

This data set has 768 rows and 10 columns. It can be used for assessing the heating load and cooling load requirements of buildings (that is, energy efficiency) as a function of building parameters. 

The buildings differ with respect to the glazing area, the glazing area distribution and the orientation, amongst other parameters.

7) Determine the type of glass using oxide content

Level: Intermediate

Recommended Use: Classification Models

Domain: Physical

Click here for: Dataset

This data set has 214 rows and 10 columns. It provides details about 6 types of glass, defined in terms of their oxide content (i.e. Na, Fe, K, etc).

8) Predict chance of survival

Level: Intermediate

Recommended Use: Classification Models

Domain: Healthcare

Click here for: Dataset

This data set has 155 rows, 20 columns and provides various attributes of a patient suffering from hepatitis. This can be used to predict the patient’s chance of survival or for other purposes.

9) Predict if a note is genuine

Level: Advanced

Recommended Use: Classification Models

Domain: Banking/Finance

Click here for: Dataset

This advanced level data set has 1,372 rows and 5 columns. Data was extracted from images of genuine and forged banknote-like specimens that were taken for the evaluation of an authentication procedure for banknotes, later digitised. 

Wavelet Transform tool was used to extract features from images.

10) Find a short term forecast on electricity consumption of a single home

Level: Advanced

Recommended Use: Regression/Clustering Models

Domain: Electricity

Click here for: Dataset

This data set has 2,075,259 rows and 9 columns. This data set provides measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. 

Different electrical quantities and some sub-metering values are available.

Pin

Read more such blogs. Explore our A-Z blog page for even more product management insights.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts