Normalizing or Standardizing distribution in Machine Learning
Updated on: May 30, 2020 ·
6 mins read
Normalizing and standardizing are the common concepts in statistics which help the data to bring under a common shelter without changing the relative difference between the different values present in data. Does that sound odd? Don’t worry keep going and you will know what I am talking about.
Why do we want to Normalize the distribution
1. Why do we want to Normalize the distribution
2. How to normalize/ standardize distribution
3. Apply Standardization on a dataset
4. Apply Normalization on a dataset
Let’s first start with the 2. How to normalize/ standardize distribution
3. Apply Standardization on a dataset
4. Apply Normalization on a dataset
why
and then move to the further questions.
The values present with different features can vary a lot. For example, a is_old
can have binary values(0 and 1) on the other hand a feature like cost
can have values ranging from $100
to $10000
depending upon the item under consideration.
If we don’t normalize/ Standardize your dataset feature having more range will contribute more toward the learning leading to bias in the model. In the above example, the cost
value will contribute more toward the trained dataset.
We generally normalize values when feature are present in the dataset having different range.
Normalization doesn’t lead to change in the real range of the dataset.
Standardization leads to reduction of dataset between values in such a way that mean of the distribution is set to 0 and standard deviation is set to 1.
How to normalize/ standardize distribution
A standard way to normalize a distribution is to apply this formula on each and every column.
$\frac{x - x_{min}}{x_{max} - x_{min}}$
This will distribute the values normally and reduce all the values between 0 and 1.
For standardization, we use the following formula,
$\frac{x - x_{mean}}{x_{standard_deviation}}$
Apply Standardization on a dataset
Data source used: GitHub of Data Source
# Import everything
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Create a DataFrame
df = pd.read_csv('KNN_Project_Data')
# Print the head of the data.
df.head()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS', axis=1))
sc_transform = scaler.transform(df.drop('TARGET CLASS', axis=1))
sc_df = pd.DataFrame(sc_transform)
# Now you can safely use sc_df as your input features.
sc_df.head()
min-max
approach we discussed earlier.
Apply Normalization on a dataset
from sklearn.preprocessing import MinMaxScaler
minmaxscaler = MinMaxScaler()
minmaxscaler.fit(df.drop('TARGET CLASS', axis=1))
sc_transform = minmaxscaler.transform(df.drop('TARGET CLASS', axis=1))
sc_df = pd.DataFrame(sc_transform)
sc_df.head()
MinMaxScaler
works in sklearn.
StandardScaler
on the train data and transform on both train and test data.
Hope you liked the post. Leave a comment if you have any question. Also, do subscribe to the newsletter if you want to read more such posts.
Please share your Feedback:
Did you enjoy reading or think it can be improved? Don’t forget to leave your thoughts in the comments section below! If you liked this article, please share it with your friends, and read a few more!