We have created an archive of data sets for you to use, to practise and improve your skills as a Data Scientist. This will be a 3-part blog series, so look out for the other parts.
This repository carries a range of themes, difficulty levels, sizes and attributes. These data sets are categorised that way, hence making it suitable for everyone.
They offer the ability to challenge one’s knowledge and get hands-on practice to boost their skills in areas, including but not limited to, exploratory data analysis, data visualisation, data wrangling and machine learning.
We recommend you test yourself with all the distinct data sets we’ve provided. Feel free to use them in any way you wish.
1) Find out the age of Abalone from physical measurements
Level: Beginner
Recommended Use: Regression Models
Domain: Environment
Click here for: Dataset
2) Predict student’s knowledge level
Level: Beginner
Recommended Use: Classification/Clustering
Domain: Education/Web
Click here for: Dataset
This data set has 403 rows and 6 columns. It is a real data set about the students’ knowledge status on the subject of Electrical DC Machines.
3) Can you predict the fuel-efficiency of a car?
Level: Intermediate
Recommended Use: Regression Models
Domain: Automobiles
Click here for: Dataset
This dataset has 398 rows, 9 columns and provides mileage, horsepower, model year and other technical specifications for cars.
4) Was that chest pain an indicator of a heart disease
Level: Intermediate
Recommended Use: Classification Models
Domain: Health Sciences
Click here for: Dataset
This data set provides health examination data among 303 patients who were presented with chest pain and might have been suffering from heart disease. The data set has 14 attributes to find whether the diagnosed patient was found to have a heart disease or not.
5) Predict total number of demand of orders
Level: Intermediate
Recommended Use: Regression Models
Domain: Business
Click here for: Dataset
This intermediate level data set has 60 rows and 13 columns. The data was collected during 60 days and is from a real database in a Brazilian logistics company. It has twelve predictive attributes and a target that is the total orders for daily treatment.
6) Find out if a donor will give blood in March 2007
Level: Intermediate
Recommended Use: Classification Models
Domain: Business
Click here for: Dataset
This data set has 748 instances and 5 attributes. The data is from a donor database, Blood Transfusion Service Center in Hsin-Chu City, in Taiwan. The centre drives their blood transfusion service bus to a university in Hsin-Chu City to gather blood donated about every 3 months.
7) Forecast pollution level of a city
Level: Intermediate
Recommended Use: Regression Models
Domain: Environment
Click here for: Dataset
This data set has 43,824 rows and 13 columns. It contains the PM2.5 data from the US Embassy in Beijing. Meteorological data from Beijing Capital International Airport is also included. The data set can be used for pollution level forecasting using the Air Quality attributes provided. It will also offer experience in Multivariate Time Series Forecasting.
8) Will the patient survive for at least one year after a heart attack
Level: Intermediate
Recommended Use: Classification Models
Domain: Automobiles
Click here for: Dataset
This data set has 132 rows and 12 columns. It provides data that can be used for classifying if patients will survive for at least one year after a heart attack. All patients listed in the data set suffered heart attacks at some point in the past. Some are still alive and some are not.
9) Detect Autistic Spectrum Disorder (ASD) Cases
Level: Advanced
Recommended Use: Classification Models
Domain: Healthcare/Social Sciences
Click here for: Dataset
This advanced level data set has Autistic Spectrum Disorder (ASD) Screening Test Data for 704 adults and has 21 attributes including test takers’ demographics. It also has 10 questions that test takers answered in screening tests. The status of a test taker on ASD is determined and recorded under the Class/ASD variable.
10) Estimate the probability of Default
Level: Advanced
Recommended Use: Classification Models
Domain: Business/Finance
Click here for: Dataset
This data set has 30,000 rows and 24 columns. The data set could be used to estimate the probability of default payment by credit card client using the data provided.
Read more such blogs. Explore our A-Z blog page for even more product management insights.