Normalizing or Standardizing distribution in Machine Learning
Normalizing and standardizing are the common concepts in statistics which help the data to bring under a common shelter without changing the relative difference between the different values present in data. Does that sound odd? Don’t worry keep going and you will know what I am talking about.
Why do we want to Normalize the distribution
2. How to normalize/ standardize distribution
3. Apply Standardization on a dataset
4. Apply Normalization on a dataset
whyand then move to the further questions. The values present with different features can vary a lot. For example, a
is_oldcan have binary values(0 and 1) on the other hand a feature like
costcan have values ranging from
$10000depending upon the item under consideration. If we don’t normalize/ Standardize your dataset feature having more range will contribute more toward the learning leading to bias in the model. In the above example, the
costvalue will contribute more toward the trained dataset. We generally normalize values when feature are present in the dataset having different range. Normalization doesn’t lead to change in the real range of the dataset. Standardization leads to reduction of dataset between values in such a way that mean of the distribution is set to 0 and standard deviation is set to 1.
How to normalize/ standardize distributionA standard way to normalize a distribution is to apply this formula on each and every column.
Apply Standardization on a datasetData source used: GitHub of Data Source
# Import everything import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline # Create a DataFrame df = pd.read_csv('KNN_Project_Data') # Print the head of the data. df.head()
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(df.drop('TARGET CLASS', axis=1)) sc_transform = scaler.transform(df.drop('TARGET CLASS', axis=1)) sc_df = pd.DataFrame(sc_transform) # Now you can safely use sc_df as your input features. sc_df.head()
min-maxapproach we discussed earlier.
Apply Normalization on a dataset
from sklearn.preprocessing import MinMaxScaler minmaxscaler = MinMaxScaler() minmaxscaler.fit(df.drop('TARGET CLASS', axis=1)) sc_transform = minmaxscaler.transform(df.drop('TARGET CLASS', axis=1)) sc_df = pd.DataFrame(sc_transform) sc_df.head()
MinMaxScalerworks in sklearn.
StandardScaleron the train data and transform on both train and test data. Hope you liked the post. Leave a comment if you have any question. Also, do subscribe to the newsletter if you want to read more such posts.
Did you enjoy reading or think it can be improved? Don’t forget to leave your thoughts in the comments section below! If you liked this article, please share it with your friends, and read a few more!