Fascinating Data Sets to improve your Data Science skills | Part-3

Fascinating Data Sets to improve your Data Science skills Part-3

We have created an archive of Data Sets for you to use to practise and improve your skills as a Data Scientist. This will be a 3-part blog series, so look out for the other parts. So welcome to the final part of the series! 

This repository carries a range of themes, difficulty levels, sizes and attributes. These data sets are categorised that way, hence making it suitable for everyone. 

They offer the ability to challenge one’s knowledge and get hands-on practice to boost their skills in areas, including but not limited to, exploratory data analysis, data visualisation, data wrangling and machine learning.

We recommend you test yourself with all the distinct data sets we’ve provided. Feel free to use them in any way you wish.

1) Predict acceptability of a car

Level: Beginner

Recommended Use: Classification Models

Domain: Automobile

Click here for: Dataset

The data set has 1,728 rows and 7 columns in which car attributes, such as price and technology, are described across 6 variables such as “Buying Price”, “Maintenance” and “Safety” etc. 

There are multiple alternatives under each of the 6 variables. Car’s acceptability, the seventh attribute, is the outcome variable.

2) Predict seminal quality of an individual

Level: Beginner

Recommended Use: Regression/Classification Models

Domain: Healthcare/Life

Click here for: Dataset

This data set has 10 attributes. It includes semen samples of 100 volunteers, analysed according to the WHO 2010 criteria . 

It can be used to determine if it’s possible to reach a diagnosis without a laboratory approach, which includes expensive tests that are sometimes uncomfortable for the patients. 

Attributes presented in this data set can be taken easily using a questionnaire to estimate sperm concentration.

3) Find patterns from spending data at wholesale

Level: Intermediate

Recommended Use: Classification/Clustering

Domain: Business/Retail

Click here for: Dataset

This data set has 440 rows and 8 columns. The data refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.

4) Group similar travel reviews

Level: Intermediate

Recommended Use: Clustering/Classification Models

Domain: Web

Click here for: Dataset

This data set, populated by crawling TripAdvisor.com, has 980 rows and 11 columns. It includes reviews on destinations in 10 categories mentioned across East Asia. 

Each traveller rating is mapped as Excellent(4), Very Good(3), Average(2), Poor(1) and Terrible(0); and average rating is used against each category per user.

5) Relate returns of Istanbul Stock Exchange with other international indices

Level: Intermediate

Recommended Use: Regression/Classification Models

Domain: Business/Finance

Click here for: Dataset

This data set has 536 rows and 9 columns. It includes returns of the Istanbul Stock Exchange (ISE) with seven other international indices; SP, DAX, FTSE, NIKKEI, BOVESPA, MSCE_EU, MSCI_EM

It can be used to find a predictive relationship between the ISE100 and other international stock market indices.

6) Predict bike rental count (hourly/daily) based on the environmental and seasonal settings

Level: Intermediate

Recommended Use: Regression Models

Domain: Social

Click here for: Dataset

This data set, consisting of 17,379 rows and 17 columns, contains the hourly and daily count of rental bikes between 2011 and 2012 in the Capital bike-share system with the corresponding weather and seasonal information. 

Bike-sharing rental process is highly correlated to the environmental and seasonal settings.

7) Detect Room Occupancy through Light, Temperature, Humidity and CO2 sensors

Level: Intermediate

Recommended Use: Classification Models

Domain: Energy/Buildings

Click here for: Dataset

This data set has 20,560 rows and 7 attributes. It provides experimental data used for binary classification (room occupancy of an office room) from Temperature, Humidity, Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.

8) Estimate whether a person’s income exceeds $50K/year

Level: Intermediate

Recommended Use: Classification Models

Domain: Social/Government

Click here for: Dataset

This data set was extracted from the census bureau database. There are 48,842 instances of the data set. It has 15 attribute which include age, sex, education level and other relevant details of a person.

9) Predict the number of shares on social networks

Level: Advanced

Recommended Use: Regression/Classification Models

Domain: Business/Web

Click here for: Dataset

This data set has 39,644 rows and 61 columns. It summarises a heterogeneous set of features about articles published by Mashable in a period of 2 years and can be used to predict the number of shares of an article in social networks.

10) Amazon Product Reviews Data

Level: Advanced

Recommended Use: Text Analytics

Domain: E-commerce

Click here for: Dataset

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand and image features) and links (also viewed/also bought graphs). This dataset is probably preferable for sentiment analysis type tasks.

Pin

Read more such blogs. Explore our A-Z blog page for even more product management insights.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts