当前位置:网站首页>Data preprocessing - normalization and standardization
Data preprocessing - normalization and standardization
2022-06-25 15:10:00 【A window full of stars and milky way】
Standardization of data (normalization) It's scaling the data , To fall into a small, specific area .
Remove the unit limit of data , Convert it to dimensionless pure values , It is convenient to compare and weight indexes of different units or scales
The most typical is the normalization of data , Unified mapping of data to [0,1] On interval
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
% matplotlib inline0-1 Standardization
Also called Max-Min Standardization , The formula :
# Create data
df = pd.DataFrame({
'value1':np.random.rand(100)*10,
'value2':np.random.rand(100)*100})
print(df.head())
print('---------------')
def maxmin (df,*cols):
df_m = df.copy()
for col in cols:
ma = df[col].max()
mi = df[col].min()
df_m[col + '_m'] = (df[col] - mi) / (ma - mi)
return df_m
df1 = maxmin(df,'value1','value2')
print(df1.head()) value1 value2
0 7.363287 15.749935
1 5.713568 33.233757
2 6.108123 21.522650
3 0.804442 85.003204
4 6.387467 21.264910
---------------
value1 value2 value1_m value2_m
0 7.363287 15.749935 0.740566 0.151900
1 5.713568 33.233757 0.574296 0.329396
2 6.108123 21.522650 0.614062 0.210505
3 0.804442 85.003204 0.079521 0.854962
4 6.387467 21.264910 0.642216 0.207888
# Use sklearn Medium scale function
minmax_scaler = preprocessing.MinMaxScaler() # establish MinMaxScaler object
df_m1 = minmax_scaler.fit_transform(df) # Standardized treatment
df_m1 = pd.DataFrame(df_m1,columns=['value1_m','value2_m'])
df_m1.head()| value1_m | value2_m | |
|---|---|---|
| 0 | 0.740566 | 0.151900 |
| 1 | 0.574296 | 0.329396 |
| 2 | 0.614062 | 0.210505 |
| 3 | 0.079521 | 0.854962 |
| 4 | 0.642216 | 0.207888 |
Z-Score
Also called z fraction , Is a quantity with equal units . It is the quotient obtained by dividing the difference between the original score and the group average by the standard deviation , It measures how many standard deviations the original score is above the score of its average in standard deviation , Or how many standard deviations are below the average .
- It is an abstract value , Not affected by the original unit of measurement , Further statistical processing is acceptable
- The processed value obeys the mean value of 0, The variance of 1 The standard normal distribution of .
- A centralized approach , Will change the data distribution of the original data , It is not suitable for processing sparse dataz=x−μσ z = x − μ σ
def data_Znorm(df, *cols):
df_n = df.copy()
for col in cols:
u = df_n[col].mean()
std = df_n[col].std()
df_n[col + '_Zn'] = (df_n[col] - u) / std
return(df_n)
# Create a function , Standardized data
df_z = data_Znorm(df,'value1','value2')
u_z = df_z['value1_Zn'].mean()
std_z = df_z['value1_Zn'].std()
print(df_z.head())
print(' After standardization value1 The mean of :%.2f, The standard deviation is :%.2f' % (u_z, std_z))
# Standardized data
# The processed data conform to the standard normal distribution , That is, the mean value is 0, The standard deviation is 1
# What's the use of Z-score Standardization :
# In the classification 、 Clustering algorithm , When distance is needed to measure similarity ,Z-score Perform better value1 value2 value1_Zn value2_Zn
0 7.363287 15.749935 0.744641 -1.164887
1 5.713568 33.233757 0.196308 -0.550429
2 6.108123 21.522650 0.327450 -0.962008
3 0.804442 85.003204 -1.435387 1.268973
4 6.387467 21.264910 0.420298 -0.971066
After standardization value1 The mean of :-0.00, The standard deviation is :1.00
# Z-Score Standardization
zscore_scale = preprocessing.StandardScaler()
df_z1 = zscore_scale.fit_transform(df)
df_z1 = pd.DataFrame(df_z1,columns=['value1_z','value2_z'])
df_z1.head()| value1_z | value2_z | |
|---|---|---|
| 0 | 0.748393 | -1.170755 |
| 1 | 0.197297 | -0.553202 |
| 2 | 0.329100 | -0.966855 |
| 3 | -1.442619 | 1.275366 |
| 4 | 0.422416 | -0.975959 |
MaxAbs
The maximum absolute value is standardized , and MaxMin The method is similar to , Drop the data into a certain range [-1,1], however MaxAbs have Do not destroy the data structure Characteristics , It can be used for Sparse data , perhaps
It's coefficient CSR( Line compression ) and CSC( Column compression ) matrix ( Two storage formats for matrices )
# MaxAbs Standardization
maxbas_scaler = preprocessing.MaxAbsScaler()
df_ma = maxbas_scaler.fit_transform(df)
df_ma = pd.DataFrame(df_ma,columns=['value1_ma','value2_ma'])
df_ma.head()| value1_ma | value2_ma | |
|---|---|---|
| 0 | 0.740969 | 0.158626 |
| 1 | 0.574957 | 0.334715 |
| 2 | 0.614661 | 0.216766 |
| 3 | 0.080951 | 0.856112 |
| 4 | 0.642772 | 0.214170 |
RobustScaler
In some cases , If there are outliers in the data , We can use Z-Score Standardize , But the standardized data is not ideal , Because the features of outliers tend to lose their outliers after standardization , You can use RobustScaler Standardize outliers .
This method has stronger parameter control ability for the robustness of data center calls and data scaling
————《Python Data analysis and data operation 》
# RobustScaler Standardization
robustscaler = preprocessing.RobustScaler()
df_r = robustscaler.fit_transform(df)
df_r = pd.DataFrame(df_r,columns=['value1_r','value2_r'])
df_r.head()| value1_r | value2_r | |
|---|---|---|
| 0 | 0.360012 | -0.644051 |
| 1 | 0.055296 | -0.303967 |
| 2 | 0.128174 | -0.531764 |
| 3 | -0.851457 | 0.703016 |
| 4 | 0.179770 | -0.536777 |
Draw standardized scatter diagram
data_list = [df, df_m1, df_ma, df_z1, df_r]
title_list = ['soure_data', 'maxmin_scaler',
'maxabs_scaler', 'zscore_scaler',
'robustscaler']
fig = plt.figure(figsize=(12,6))
for i,j in enumerate(data_list):
# For an iterative (iterable)/ Traversable objects ( As listing 、 character string ),enumerate Make it into an index sequence ,
# Use it to get both index and value ,enumerate It's mostly used in for Count in the loop '''
plt.subplot(2,3,i+1)
plt.scatter(j.iloc[:,:-1],j.iloc[:,-1])
plt.title(title_list[i])
边栏推荐
- Arithmetic operations and expressions
- Basic knowledge of pointer
- Afterword of Parl intensive learning 7-day punch in camp
- Usage of pure virtual functions
- HMS core machine learning service realizes simultaneous interpretation, supports Chinese-English translation and multiple voice broadcast
- QT database connection deletion
- Compile Caffe's project using cmake
- 2.18 codeforces supplement
- Stderr and stdout related standard outputs and other C system APIs
- QT excel table read / write library - qtxlsx
猜你喜欢

JS select all exercise

Build a minimalist gb28181 gatekeeper and gateway server, establish AI reasoning and 3D service scenarios, and then open source code (I)

Master XSS completely from 0 to 1

Character encoding minutes

Afterword of Parl intensive learning 7-day punch in camp

Position (5 ways)

【Try to Hack】vulnhub DC1

Introduction to flexible array

定位position(5种方式)

多张动图怎样合成一张gif?仅需三步快速生成gif动画图片
随机推荐
QT database connection
Source code analysis of zeromq lockless queue
Gif动图如何裁剪?收下这个图片在线裁剪工具
System Verilog — interface
90 后眼中的理想 L9:最简单的产品哲学,造最猛的爆款 | 指南斟
Custom structure type
Is it safe to open an online stock account? Who knows
Daily question, Caesar code,
From 408 to independent proposition, 211 to postgraduate entrance examination of Guizhou University
About%*s and%* s
AB string interchange
Explanation of dev/mapper
QT inline dialog
The best time to buy and sell stocks
dev/mapper的解释
Power automatic test system nsat-8000, accurate, high-speed and reliable power test equipment
QT set process startup and self startup
QQ情话糖果情话内容获取并保存
If multiple signals point to the same slot function, you want to know which signal is triggered.
Master XSS completely from 0 to 1