Getting your Dataset ready for Machine Learning

It is important to have a good dataset in addition to a large number of instances to get big results. In this blog post, I am going to focus on preprocessing your raw dataset with python to a format that is ML-friendly.

For now, I am assuming you have a large dataset to play around with.

If you are just starting with machine learning, here are some great public datasets you can use to familiarize yourself with ML concepts.

Credit Card default dataset: Can you establish the credibility of a credit card holder? An interesting dataset with over 30,000 records and 24 columns contains information like their age, gender, etc about the credit card holders.
Census data: Can you predict whether a person makes over 50k in a year? Census data is a dataset containing over 40000 records and 14 attributes with information about their education, age, work-hours, etc.
Hepatitis dataset: Predict the chances of survival of a patient suffering from Hepatitis. This dataset is relatively smaller with 155 records and 19 attributes and contains information about bloodwork, and other symptoms related to this disease.

Now we have our raw dataset, now on to perfecting it. A perfect dataset is like a unicorn of sorts, but thankfully it is neither rare nor elusive. A fine dataset can easily be created with a bit of preprocessing and a bit of understanding what you need from the dataset you have.

If you are working with python, the scikit package is a great package to use for preprocessing your data. If you don’t have scikit, no worries. Just install it by copy-pasting the following in your terminal,

1
pip install scikit-learn

Next, import the package to your code with,

1
from sklearn import preprocessing

Now, we are set to clean our data, A good way to go about this is to quantify the quality of the data. For this, I am taking the weather dataset as an example, print out the total missing values along with the percentage. This gives you a better idea on which attributes you should be dropping and focusing on.

1
Total Rows: 142193
2
              total_missing  perc_missing
3
Date                       0      0.000000
4
Location                   0      0.000000
5
MinTemp                  637      0.447983
6
MaxTemp                  322      0.226453
7
Rainfall                1406      0.988797
8
Evaporation            60843     42.789026
9
Sunshine               67816     47.692924
10
WindGustDir             9330      6.561504
11
WindGustSpeed           9270      6.519308
12
WindDir9am             10013      7.041838
13
WindDir3pm              3778      2.656952
14
WindSpeed9am            1348      0.948007
15
WindSpeed3pm            2630      1.849599
16
Humidity9am             1774      1.247600
17
Humidity3pm             3610      2.538803
18
Pressure9am            14014      9.855619
19
Pressure3pm            13981      9.832411
20
Cloud9am               53657     37.735332
21
Cloud3pm               57094     40.152469
22
Temp9am                  904      0.635756
23
Temp3pm                 2726      1.917113
24
RainToday               1406      0.988797
25
RISK_MM                    0      0.000000
26
RainTomorrow               0      0.000000

Next, for some machine learning algorithms, you need to change the non-numerical data to a numerical format.

You can easily do this with preprocessing package from sklearn. For example, I want to change RainToday which is a non-numerical attribute

1
Change_number:preprocessing.Labelencoder()
2
dataset['RainToday'] :Change_number.fit_transform(dataset['RainToday'].astype('str'))

You can easily get a concise summary of your dataset including the type using

1
dataset.info()

You are almost there with creating the perfect dataset. Often different attributes have a different range, which makes them difficult to compare. A great way to overcome this to introduce normalization in the dataset. Again, preprocessing is a winner with this.

There are different types of “normalizers” you can make use of, Let me show you an example.

Look at this example,

1
d:{'Values1': [26, 23, 26, 24, 23], 'Values2': [1026, 1034, 1000, 1350, 1211]}

The difference in the range of values is quite obvious. So, I was to plot this directly without any normalization, the graph would look something like this.

This doesn’t give an idea about the comparison between the values of the two attributes. Now, let’s normalize the values using the scikit package.

1
normalise:preprocessing.MinMaxScaler()
2
x:dataset[['Values2']].values.astype(float)
3
dataset['Values2']:normalise.fit_transform(x)
4
x:dataset[['Values1']].values.astype(float)
5
dataset['Values1']:normalise.fit_transform(x)

There, it is as simple as that. I am using min-max scaler just as an example. The preprocessing package offers more than one type of normalization functions.

Moving on, let’s plot the graph now after normalization.

Drastically different, right?

The key is to play around and see what preprocessing functions are used on similar datasets. This might give you an idea of what can and can not be expected.

There, you have your processed dataset now. Have fun :P

Subscribe to our newsletter

Get the latest updates from our team delivered directly to your inbox.

Getting your Dataset ready for Machine Learning

Subscribe to our newsletter

Related Posts

5 Tips to use DialogFlow the right way

Here's a very crucial thing people who build AI know, but you don't

Apache Spark - Which language to pick?