当前位置:网站首页>Data preprocessing - normalization and standardization
Data preprocessing - normalization and standardization
2022-06-25 15:10:00 【A window full of stars and milky way】
Standardization of data (normalization) It's scaling the data , To fall into a small, specific area .
Remove the unit limit of data , Convert it to dimensionless pure values , It is convenient to compare and weight indexes of different units or scales
The most typical is the normalization of data , Unified mapping of data to [0,1] On interval
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
% matplotlib inline0-1 Standardization
Also called Max-Min Standardization , The formula :
# Create data
df = pd.DataFrame({
'value1':np.random.rand(100)*10,
'value2':np.random.rand(100)*100})
print(df.head())
print('---------------')
def maxmin (df,*cols):
df_m = df.copy()
for col in cols:
ma = df[col].max()
mi = df[col].min()
df_m[col + '_m'] = (df[col] - mi) / (ma - mi)
return df_m
df1 = maxmin(df,'value1','value2')
print(df1.head()) value1 value2
0 7.363287 15.749935
1 5.713568 33.233757
2 6.108123 21.522650
3 0.804442 85.003204
4 6.387467 21.264910
---------------
value1 value2 value1_m value2_m
0 7.363287 15.749935 0.740566 0.151900
1 5.713568 33.233757 0.574296 0.329396
2 6.108123 21.522650 0.614062 0.210505
3 0.804442 85.003204 0.079521 0.854962
4 6.387467 21.264910 0.642216 0.207888
# Use sklearn Medium scale function
minmax_scaler = preprocessing.MinMaxScaler() # establish MinMaxScaler object
df_m1 = minmax_scaler.fit_transform(df) # Standardized treatment
df_m1 = pd.DataFrame(df_m1,columns=['value1_m','value2_m'])
df_m1.head()| value1_m | value2_m | |
|---|---|---|
| 0 | 0.740566 | 0.151900 |
| 1 | 0.574296 | 0.329396 |
| 2 | 0.614062 | 0.210505 |
| 3 | 0.079521 | 0.854962 |
| 4 | 0.642216 | 0.207888 |
Z-Score
Also called z fraction , Is a quantity with equal units . It is the quotient obtained by dividing the difference between the original score and the group average by the standard deviation , It measures how many standard deviations the original score is above the score of its average in standard deviation , Or how many standard deviations are below the average .
- It is an abstract value , Not affected by the original unit of measurement , Further statistical processing is acceptable
- The processed value obeys the mean value of 0, The variance of 1 The standard normal distribution of .
- A centralized approach , Will change the data distribution of the original data , It is not suitable for processing sparse dataz=x−μσ z = x − μ σ
def data_Znorm(df, *cols):
df_n = df.copy()
for col in cols:
u = df_n[col].mean()
std = df_n[col].std()
df_n[col + '_Zn'] = (df_n[col] - u) / std
return(df_n)
# Create a function , Standardized data
df_z = data_Znorm(df,'value1','value2')
u_z = df_z['value1_Zn'].mean()
std_z = df_z['value1_Zn'].std()
print(df_z.head())
print(' After standardization value1 The mean of :%.2f, The standard deviation is :%.2f' % (u_z, std_z))
# Standardized data
# The processed data conform to the standard normal distribution , That is, the mean value is 0, The standard deviation is 1
# What's the use of Z-score Standardization :
# In the classification 、 Clustering algorithm , When distance is needed to measure similarity ,Z-score Perform better value1 value2 value1_Zn value2_Zn
0 7.363287 15.749935 0.744641 -1.164887
1 5.713568 33.233757 0.196308 -0.550429
2 6.108123 21.522650 0.327450 -0.962008
3 0.804442 85.003204 -1.435387 1.268973
4 6.387467 21.264910 0.420298 -0.971066
After standardization value1 The mean of :-0.00, The standard deviation is :1.00
# Z-Score Standardization
zscore_scale = preprocessing.StandardScaler()
df_z1 = zscore_scale.fit_transform(df)
df_z1 = pd.DataFrame(df_z1,columns=['value1_z','value2_z'])
df_z1.head()| value1_z | value2_z | |
|---|---|---|
| 0 | 0.748393 | -1.170755 |
| 1 | 0.197297 | -0.553202 |
| 2 | 0.329100 | -0.966855 |
| 3 | -1.442619 | 1.275366 |
| 4 | 0.422416 | -0.975959 |
MaxAbs
The maximum absolute value is standardized , and MaxMin The method is similar to , Drop the data into a certain range [-1,1], however MaxAbs have Do not destroy the data structure Characteristics , It can be used for Sparse data , perhaps
It's coefficient CSR( Line compression ) and CSC( Column compression ) matrix ( Two storage formats for matrices )
# MaxAbs Standardization
maxbas_scaler = preprocessing.MaxAbsScaler()
df_ma = maxbas_scaler.fit_transform(df)
df_ma = pd.DataFrame(df_ma,columns=['value1_ma','value2_ma'])
df_ma.head()| value1_ma | value2_ma | |
|---|---|---|
| 0 | 0.740969 | 0.158626 |
| 1 | 0.574957 | 0.334715 |
| 2 | 0.614661 | 0.216766 |
| 3 | 0.080951 | 0.856112 |
| 4 | 0.642772 | 0.214170 |
RobustScaler
In some cases , If there are outliers in the data , We can use Z-Score Standardize , But the standardized data is not ideal , Because the features of outliers tend to lose their outliers after standardization , You can use RobustScaler Standardize outliers .
This method has stronger parameter control ability for the robustness of data center calls and data scaling
————《Python Data analysis and data operation 》
# RobustScaler Standardization
robustscaler = preprocessing.RobustScaler()
df_r = robustscaler.fit_transform(df)
df_r = pd.DataFrame(df_r,columns=['value1_r','value2_r'])
df_r.head()| value1_r | value2_r | |
|---|---|---|
| 0 | 0.360012 | -0.644051 |
| 1 | 0.055296 | -0.303967 |
| 2 | 0.128174 | -0.531764 |
| 3 | -0.851457 | 0.703016 |
| 4 | 0.179770 | -0.536777 |
Draw standardized scatter diagram
data_list = [df, df_m1, df_ma, df_z1, df_r]
title_list = ['soure_data', 'maxmin_scaler',
'maxabs_scaler', 'zscore_scaler',
'robustscaler']
fig = plt.figure(figsize=(12,6))
for i,j in enumerate(data_list):
# For an iterative (iterable)/ Traversable objects ( As listing 、 character string ),enumerate Make it into an index sequence ,
# Use it to get both index and value ,enumerate It's mostly used in for Count in the loop '''
plt.subplot(2,3,i+1)
plt.scatter(j.iloc[:,:-1],j.iloc[:,-1])
plt.title(title_list[i])
边栏推荐
- Basic knowledge of pointer
- basic_ String mind map
- google_ Breakpad crash detection
- Common dynamic memory errors
- About?: Notes for
- QT inline dialog
- Compile Caffe's project using cmake
- How to cut the size of a moving picture? Try this online photo cropping tool
- Design and implementation of timer
- System Verilog - data type
猜你喜欢

Sequential programming 1

QT set process startup and self startup

Afterword of Parl intensive learning 7-day punch in camp

Js- get the mouse coordinates and follow them

【Try to Hack】vulhub靶场搭建

System Verilog — interface

QT excel table read / write library - qtxlsx

Design and implementation of thread pool

电源自动测试系统NSAT-8000,精准高速可靠的电源测试设备

Learning notes on February 5, 2022 (C language)
随机推荐
One question per day,
Dmsetup command
Biscuit distribution
Shared memory synchronous encapsulation
Qmake uses toplevel or topbuilddir
Master XSS completely from 0 to 1
[untitled] PTA check password
Modal and modeless dialogs for QT
Single user mode
QT loading third-party library basic operation
QT database connection
从0到1完全掌握 XSS
14 -- validate palindrome string II
Brain tree (I)
QT database connection deletion
Qt: Pro project file
如何裁剪动图大小?试试这个在线照片裁剪工具
System Verilog - data type
One code per day - day one
Semaphore function