当前位置：网站首页>Data preprocessing - normalization and standardization

Data preprocessing - normalization and standardization

2022-06-25 15:10:00 【A window full of stars and milky way】

Standardization of data （normalization） It's scaling the data , To fall into a small, specific area .
Remove the unit limit of data , Convert it to dimensionless pure values , It is convenient to compare and weight indexes of different units or scales

The most typical is the normalization of data , Unified mapping of data to [0,1] On interval

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing 
% matplotlib inline

0-1 Standardization

Also called Max-Min Standardization , The formula ：

x' = x - m i n m a x - m i n

$x' = \frac{x-min}{max-min}$

#  Create data 
df = pd.DataFrame({
   'value1':np.random.rand(100)*10,
                   'value2':np.random.rand(100)*100})
print(df.head())
print('---------------')

def maxmin (df,*cols):
    df_m = df.copy()
    for col in cols:
        ma = df[col].max()
        mi = df[col].min()
        df_m[col + '_m'] = (df[col] - mi) / (ma - mi)
    return df_m
df1 = maxmin(df,'value1','value2')
print(df1.head())

     value1     value2
0  7.363287  15.749935
1  5.713568  33.233757
2  6.108123  21.522650
3  0.804442  85.003204
4  6.387467  21.264910
---------------
     value1     value2  value1_m  value2_m
0  7.363287  15.749935  0.740566  0.151900
1  5.713568  33.233757  0.574296  0.329396
2  6.108123  21.522650  0.614062  0.210505
3  0.804442  85.003204  0.079521  0.854962
4  6.387467  21.264910  0.642216  0.207888

#  Use  sklearn Medium  scale  function 
minmax_scaler = preprocessing.MinMaxScaler()   #  establish  MinMaxScaler object 
df_m1 = minmax_scaler.fit_transform(df)    #  Standardized treatment 
df_m1 = pd.DataFrame(df_m1,columns=['value1_m','value2_m'])
df_m1.head()

	value1_m	value2_m
0	0.740566	0.151900
1	0.574296	0.329396
2	0.614062	0.210505
3	0.079521	0.854962
4	0.642216	0.207888

Z-Score

Also called z fraction , Is a quantity with equal units . It is the quotient obtained by dividing the difference between the original score and the group average by the standard deviation , It measures how many standard deviations the original score is above the score of its average in standard deviation , Or how many standard deviations are below the average .
- It is an abstract value , Not affected by the original unit of measurement , Further statistical processing is acceptable
- The processed value obeys the mean value of 0, The variance of 1 The standard normal distribution of .
- A centralized approach , Will change the data distribution of the original data , It is not suitable for processing sparse data
$z = x - μ σ$ $z = \frac{x-\mu }{\sigma }$


def data_Znorm(df, *cols):
    df_n = df.copy()
    for col in cols:
        u = df_n[col].mean()
        std = df_n[col].std()
        df_n[col + '_Zn'] = (df_n[col] - u) / std
    return(df_n)
#  Create a function , Standardized data 

df_z = data_Znorm(df,'value1','value2')
u_z = df_z['value1_Zn'].mean()
std_z = df_z['value1_Zn'].std()
print(df_z.head())
print(' After standardization value1 The mean of :%.2f,  The standard deviation is ：%.2f' % (u_z, std_z))
#  Standardized data 
#  The processed data conform to the standard normal distribution , That is, the mean value is 0, The standard deviation is 1

#  What's the use of Z-score Standardization ：
#  In the classification 、 Clustering algorithm , When distance is needed to measure similarity ,Z-score Perform better

     value1     value2  value1_Zn  value2_Zn
0  7.363287  15.749935   0.744641  -1.164887
1  5.713568  33.233757   0.196308  -0.550429
2  6.108123  21.522650   0.327450  -0.962008
3  0.804442  85.003204  -1.435387   1.268973
4  6.387467  21.264910   0.420298  -0.971066
 After standardization value1 The mean of :-0.00,  The standard deviation is ：1.00

# Z-Score Standardization 
zscore_scale = preprocessing.StandardScaler()
df_z1 = zscore_scale.fit_transform(df)
df_z1 = pd.DataFrame(df_z1,columns=['value1_z','value2_z'])
df_z1.head()

	value1_z	value2_z
0	0.748393	-1.170755
1	0.197297	-0.553202
2	0.329100	-0.966855
3	-1.442619	1.275366
4	0.422416	-0.975959

MaxAbs

The maximum absolute value is standardized , and MaxMin The method is similar to , Drop the data into a certain range [-1,1], however MaxAbs have Do not destroy the data structure Characteristics , It can be used for Sparse data , perhaps
It's coefficient CSR（ Line compression ） and CSC（ Column compression ） matrix （ Two storage formats for matrices ）

x' = x | m a x |

$x' = \frac{x}{\left |max \right |}$

# MaxAbs Standardization 
maxbas_scaler = preprocessing.MaxAbsScaler()
df_ma = maxbas_scaler.fit_transform(df)
df_ma = pd.DataFrame(df_ma,columns=['value1_ma','value2_ma'])
df_ma.head()

	value1_ma	value2_ma
0	0.740969	0.158626
1	0.574957	0.334715
2	0.614661	0.216766
3	0.080951	0.856112
4	0.642772	0.214170

RobustScaler

In some cases , If there are outliers in the data , We can use Z-Score Standardize , But the standardized data is not ideal , Because the features of outliers tend to lose their outliers after standardization , You can use RobustScaler Standardize outliers .

This method has stronger parameter control ability for the robustness of data center calls and data scaling

————《Python Data analysis and data operation 》

# RobustScaler Standardization 
robustscaler = preprocessing.RobustScaler()
df_r = robustscaler.fit_transform(df)
df_r = pd.DataFrame(df_r,columns=['value1_r','value2_r'])
df_r.head()

	value1_r	value2_r
0	0.360012	-0.644051
1	0.055296	-0.303967
2	0.128174	-0.531764
3	-0.851457	0.703016
4	0.179770	-0.536777

Draw standardized scatter diagram

data_list = [df, df_m1, df_ma, df_z1, df_r]
title_list = ['soure_data', 'maxmin_scaler', 
              'maxabs_scaler', 'zscore_scaler',
              'robustscaler']
fig = plt.figure(figsize=(12,6))
for i,j in enumerate(data_list):
#  For an iterative （iterable）/ Traversable objects （ As listing 、 character string ）,enumerate Make it into an index sequence ,
#  Use it to get both index and value ,enumerate It's mostly used in for Count in the loop '''
    plt.subplot(2,3,i+1)
    plt.scatter(j.iloc[:,:-1],j.iloc[:,-1])
    plt.title(title_list[i])