Why do we need feature scaling in Machine Learning and how to do it using SciKit Learn?
When you’re working with a learning model, it is important to scale the features to a range which is centered around zero. This is done so that the variance of the features are in the same range. If a feature’s variance is orders of magnitude more than the variance of other features, that particular feature might dominate other features in the dataset, which is not something we want happening in our model.
The aim here is to to achieve Gaussian with zero mean and unit variance. There are many ways of doing this, two most popular are standardisation and normalisation.
No matter which method you choose, the SciKit Learn library provides a class to easily scale our data. We can use the StandardScaler class from the library for this. Now that we know why we need to scale our features, let’s see how to do it.
We’ll consider the same example dataset which we have been using in the previous posts, I’ve reproduced it below for reference.
The columns Country, Age, and Salary are the features in this dataset. The column Purchased is the dependent variable. Since the first column is categorical, we’ll be label encoding it and then one hot encoding it. We also have some missing data in the other two columns, so we’ll be imputing and replacing them with the mean of the respective columns. After we do all that, we’ll have a dataset which looks like the following:
As we can see now, the features are not at all on the same scale. We definitely need to scale them. Let’s look at the code for doing that:
from sklearn.preprocessing import StandardScaler
standardScalerX = StandardScaler()
x = standardScalerX.fit_transform(x)
In this example, we have assumed that the dataset is contained in a variable called ‘x.’ As you can see, the code is very simple. We have used all default values in this case.
First, the fit_transform() function will create a copy of the dataset, this is because the ‘copy’ parameter defaults to True. Then, the data is centered before scaling, this is because the parameter ‘with_mean’ is set to True by default. After this, because the parameter ‘with_std’ defaults to True, the data is scaled either with unit variance or unit standard deviation.
If you want to have a different behaviour, you can always change these options. You can try out all combinations on your dataset to see which one satisfies your needs.
Now, after running this code with the dataset given above, we end up with a nicely scaled set of features as shown below:
As you can see, the whole dataset is neatly centered and scaled. Now, we can pass this data to our model and get a lot better results.
Follow me on Twitter for more Data Science, Machine Learning, and general tech updates. Also, you can follow my personal blog as I post a lot of my tutorials, how-to posts, and machine learning goodness there before Medium.