We have created an archive of Data Sets for you to use to practise and improve your skills as a Data Scientist. This will be a 3-part blog series, so look out for the other parts. So welcome to the final part of the series!
This repository carries a range of themes, difficulty levels, sizes and attributes. These data sets are categorised that way, hence making it suitable for everyone.
They offer the ability to challenge one’s knowledge and get hands-on practice to boost their skills in areas, including but not limited to, exploratory data analysis, data visualisation, data wrangling and machine learning.
We recommend you test yourself with all the distinct data sets we’ve provided. Feel free to use them in any way you wish.
1) Predict acceptability of a car
Level: Beginner
Recommended Use: Classification Models
Domain: Automobile
Click here for: Dataset
The data set has 1,728 rows and 7 columns in which car attributes, such as price and technology, are described across 6 variables such as “Buying Price”, “Maintenance” and “Safety” etc.
There are multiple alternatives under each of the 6 variables. Car’s acceptability, the seventh attribute, is the outcome variable.
2) Predict seminal quality of an individual
Level: Beginner
Recommended Use: Regression/Classification Models
Domain: Healthcare/Life
Click here for: Dataset
This data set has 10 attributes. It includes semen samples of 100 volunteers, analysed according to the WHO 2010 criteria .
It can be used to determine if it’s possible to reach a diagnosis without a laboratory approach, which includes expensive tests that are sometimes uncomfortable for the patients.
Attributes presented in this data set can be taken easily using a questionnaire to estimate sperm concentration.
3) Find patterns from spending data at wholesale
Level: Intermediate
Recommended Use: Classification/Clustering
Domain: Business/Retail
Click here for: Dataset
This data set has 440 rows and 8 columns. The data refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.
4) Group similar travel reviews
Level: Intermediate
Recommended Use: Clustering/Classification Models
Domain: Web
Click here for: Dataset
This data set, populated by crawling TripAdvisor.com, has 980 rows and 11 columns. It includes reviews on destinations in 10 categories mentioned across East Asia.
Each traveller rating is mapped as Excellent(4), Very Good(3), Average(2), Poor(1) and Terrible(0); and average rating is used against each category per user.
5) Relate returns of Istanbul Stock Exchange with other international indices
Level: Intermediate
Recommended Use: Regression/Classification Models
Domain: Business/Finance
Click here for: Dataset
This data set has 536 rows and 9 columns. It includes returns of the Istanbul Stock Exchange (ISE) with seven other international indices; SP, DAX, FTSE, NIKKEI, BOVESPA, MSCE_EU, MSCI_EM.
It can be used to find a predictive relationship between the ISE100 and other international stock market indices.
6) Predict bike rental count (hourly/daily) based on the environmental and seasonal settings
Level: Intermediate
Recommended Use: Regression Models
Domain: Social
Click here for: Dataset
This data set, consisting of 17,379 rows and 17 columns, contains the hourly and daily count of rental bikes between 2011 and 2012 in the Capital bike-share system with the corresponding weather and seasonal information.
Bike-sharing rental process is highly correlated to the environmental and seasonal settings.
7) Detect Room Occupancy through Light, Temperature, Humidity and CO2 sensors
Level: Intermediate
Recommended Use: Classification Models
Domain: Energy/Buildings
Click here for: Dataset
This data set has 20,560 rows and 7 attributes. It provides experimental data used for binary classification (room occupancy of an office room) from Temperature, Humidity, Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute.
8) Estimate whether a person’s income exceeds $50K/year
Level: Intermediate
Recommended Use: Classification Models
Domain: Social/Government
Click here for: Dataset
This data set was extracted from the census bureau database. There are 48,842 instances of the data set. It has 15 attribute which include age, sex, education level and other relevant details of a person.
9) Predict the number of shares on social networks
Level: Advanced
Recommended Use: Regression/Classification Models
Domain: Business/Web
Click here for: Dataset
This data set has 39,644 rows and 61 columns. It summarises a heterogeneous set of features about articles published by Mashable in a period of 2 years and can be used to predict the number of shares of an article in social networks.
10) Amazon Product Reviews Data
Level: Advanced
Recommended Use: Text Analytics
Domain: E-commerce
Click here for: Dataset
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand and image features) and links (also viewed/also bought graphs). This dataset is probably preferable for sentiment analysis type tasks.
Read more such blogs. Explore our A-Z blog page for even more product management insights.