当前位置:网站首页>Data preprocessing of machine learning
Data preprocessing of machine learning
2022-07-26 17:09:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm the king of the whole stack
stay sklearn The common methods of data analysis are summarized in the data analysis , Next, the data preprocessing is summarized
When we get the dataset, we usually need to do the following steps :
- (1) Make clear how many features a dataset has , What is continuous , What are the categories of
- (2) Check for missing values , Choose the right way to make up for the missing features , Make the data complete
- (3) Standardize continuous numerical features
- (4) Code the features of the category type
- (5) According to the analysis of practical problems, whether it is necessary to carry out the corresponding functional transformation of features
Still take the housing price data , Carry out the above operations in sequence
import pandas as pd
import matplotlib.pyplot as plt
import numpy as nphousing = pd.read_csv('./datasets/housing/housing.csv')1. Make sure how many features the dataset has , What is continuous , What are the categories of
print(housing.shape)(20640, 10)print(housing.info())<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
total_bedrooms 20433 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None2. Check for missing values , Choose the right way to make up for the missing features , Make the data complete
adopt info() Find out besides :
- ocean_proximity The attribute category is object Outside , The rest are float64 type , Then judge ocean_proximity Label , The rest are eigenvalues
- total_bedrooms There are missing values
2.1 Missing value handling
(1) Discard row with missing value
(2) Discard the property of the missing value , Namely column
(3) Set the missing value to a value (0, Average 、 Median or high frequency value )
print(housing[housing.isnull().T.any().T][:5]) # Printed with NaN Before we go 5 That's ok longitude latitude housing_median_age total_rooms total_bedrooms \
290 -122.16 37.77 47.0 1256.0 NaN
341 -122.17 37.75 38.0 992.0 NaN
538 -122.28 37.78 29.0 5154.0 NaN
563 -122.24 37.75 45.0 891.0 NaN
696 -122.10 37.69 41.0 746.0 NaN
population households median_income median_house_value ocean_proximity
290 570.0 218.0 4.3750 161900.0 NEAR BAY
341 732.0 259.0 1.6196 85100.0 NEAR BAY
538 3741.0 1273.0 2.5762 173400.0 NEAR BAY
563 384.0 146.0 4.9489 247100.0 NEAR BAY
696 387.0 161.0 3.9063 178400.0 NEAR BAY 2.1.1 Delete the row of the missing value
# Delete row
housing1 = housing.dropna(subset=['total_bedrooms'])
print(housing1.info())<class 'pandas.core.frame.DataFrame'>
Int64Index: 20433 entries, 0 to 20639
Data columns (total 10 columns):
longitude 20433 non-null float64
latitude 20433 non-null float64
housing_median_age 20433 non-null float64
total_rooms 20433 non-null float64
total_bedrooms 20433 non-null float64
population 20433 non-null float64
households 20433 non-null float64
median_income 20433 non-null float64
median_house_value 20433 non-null float64
ocean_proximity 20433 non-null object
dtypes: float64(9), object(1)
memory usage: 1.7+ MB
None2.1.2 Delete the column where the missing value is
# Delete column
housing2 = housing.drop(['total_bedrooms',],axis=1)
print(housing2.info())<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
longitude 20640 non-null float64
latitude 20640 non-null float64
housing_median_age 20640 non-null float64
total_rooms 20640 non-null float64
population 20640 non-null float64
households 20640 non-null float64
median_income 20640 non-null float64
median_house_value 20640 non-null float64
ocean_proximity 20640 non-null object
dtypes: float64(8), object(1)
memory usage: 1.4+ MB
None2.1.3 Replace missing values
# Replace... With an average
mean = housing['total_bedrooms'].mean()
print('mean:',mean)
housing3 = housing.fillna({'total_bedrooms':mean})
print(housing3[290:291])mean: 537.8705525375618
longitude latitude housing_median_age total_rooms total_bedrooms \
290 -122.16 37.77 47.0 1256.0 537.870553
population households median_income median_house_value ocean_proximity
290 570.0 218.0 4.375 161900.0 NEAR BAY 3. Standardize continuous numerical features
When the numerical attributes of a dataset have a very large scale difference , It often leads to poor performance of machine learning algorithm , There are a few exceptions, of course . in application , The model solved by gradient descent method usually needs normalization , Including linear regression 、 Logical regression 、 Support vector machine 、 Neural networks and other models . But not for decision trees , With C4.5 For example , The decision tree is mainly based on data set when splitting nodes D About the characteristics of X The information gain ratio of , And the information gain ratio has nothing to do with whether the feature is normalized
The common methods of data standardization are :
- Minimum - Maximum zoom ( Plus normalization ), Rescale the value so that its final range is 0-1 Between ,(current – min)/ (max – min)
- Standardization ,(current – mean) / var, Make the result distribution have unit variance , Compared to the smallest - Maximum zoom , Standardized methods are less affected by outliers
4. Code the features of the category type
4.1 Why code
In supervised learning , Except for a few models such as decision tree, we need to combine the predicted value with the actual value ( That is to say, labels ) Compare , Then the loss function is optimized by algorithm , This requires the label to be converted to a numeric type for calculation
4.2 How to code
The common coding methods are : Serial number code , Hot coding alone , Binary code
4.2.1 Serial number code
Sequence number coding is usually used to process data with size thanks between categories , For example, grades , It can be divided into low 、 in 、 High third gear , And it exists ‘ high > in > low ’ The order of arrangement , The sequence number code will assign a value to the category feature according to the size relationship ID, For example, high means 3, In the said 2, Low means 1
4.2.2 Hot coding alone
Single hot coding is usually used to deal with features that do not have a size relationship between categories . For example, blood type , Altogether 4 A value of (A Type B blood 、B Type B blood 、AB Type B blood 、O Type B blood ), Single heat code can turn blood type into a 4 Dimension sparse vector ,A Type B blood means (1,0,0,0),B Type B blood means (0,1,0,0),C Type B blood means (0,0,1,0),D Type B blood means (0,0,0,1)
The following problems should be paid attention to when using the unique heat code when the category value is more :
(1) Use sparse vectors to save space
Under the unique heat code , The eigenvector is only one dimension 1, Other locations are 0, Therefore, the sparse representation of vectors can be used to effectively save space , And most of the current algorithms accept input in the form of sparse vector
(2) Match feature selection to reduce dimension
4.2.3 Binary code
Binary coding is essentially the use of binary pairs ID Hash map , The resulting 0/1 Eigenvector , And the dimension is less than that of single heat code , Save storage space
5. According to the analysis of practical problems, whether it is necessary to carry out the corresponding functional transformation of features
When we do some analysis of the dataset , You may find some interesting connections between different attributes , Especially in relation to target attributes , Before you are ready to input data to machine learning algorithms , You should try a combination of properties
Take the price data set above for example , If you don't know how many families there are in an area , It's no use knowing the total number of rooms in an area , You really want to know the number of rooms in a family , alike , But look at the property of the total number of bedrooms , It doesn't make any sense , You may want to compare it with the total number of rooms , Or use it to combine the population of each family
5.1 View the correlation between the original dataset properties and the room median
corr_martrix = housing.corr()
print(corr_martrix['median_house_value'].sort_values(ascending=False))median_house_value 1.000000
median_income 0.688075
total_rooms 0.134153
housing_median_age 0.105623
households 0.065843
total_bedrooms 0.049686
population -0.024650
longitude -0.045967
latitude -0.144160
Name: median_house_value, dtype: float645.2 View the correlation of the property combination with the room median
housing4 = housing.copy()
housing4['rooms_per_household'] = housing4['total_rooms'] / housing4['households']
housing4['bedrooms_per_room'] = housing4['total_bedrooms'] / housing4['total_rooms']
housing4['population_per_household'] = housing4['population'] / housing4['households']
corr_martrix1 = housing.corr()
print(corr_martrix1['median_house_value'].sort_values(ascending=False))median_house_value 1.000000
median_income 0.688075
total_rooms 0.134153
housing_median_age 0.105623
households 0.065843
total_bedrooms 0.049686
population -0.024650
longitude -0.045967
latitude -0.144160
Name: median_house_value, dtype: float64It can be seen that bedrooms_per_room More than the total number of rooms or bedrooms and the median price of the correlation is much higher , So you can try more when you combine attributes
6. Use Sklearn.pipeline Data preprocessing
6.1 Code implementation
from sklearn.preprocessing import Imputer,LabelEncoder,OneHotEncoder,StandardScaler
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.pipeline import Pipeline,FeatureUnionclass DaraFrameSelector(BaseEstimator,TransformerMixin):
def __init__(self,attr_name):
self.attr_name = attr_name
def fit(self,X,Y=None):
return self
def transform(self,X,Y=None):
return X[self.attr_name].valuesfeatures_attr = list(housing.columns[:-1])
labels_attr = [housing.columns[-1]]
feature_pipeline = Pipeline([('selector',DaraFrameSelector(features_attr)),
('imputer',Imputer(strategy='mean')),
('scaler',StandardScaler()),])
label_pipeline = Pipeline([('selector',DaraFrameSelector(labels_attr)),
('encoder',OneHotEncoder()),])
full_pipeline = FeatureUnion(transformer_list=[('feature_pipeline',feature_pipeline),
('label_pipeline',label_pipeline),])C:\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:58: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
warnings.warn(msg, category=DeprecationWarning)housing_prepared = full_pipeline.fit_transform(housing)
print(housing_prepared.shape)(20640, 14)Reference material :
- (1) 《 Machine learning practice is based on Scikit-Learn and TensorFlow》
- (2) 《 White face machine learning 》
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/120018.html Link to the original text :https://javaforall.cn
边栏推荐
- Replicationcontroller and replicaset of kubernetes
- MySQL lock mechanism (example)
- 【飞控开发基础教程1】疯壳·开源编队无人机-GPIO(LED 航情灯、信号灯控制)
- 怎么使用C语言嵌套链表实现学生成绩管理系统
- Pyqt5 rapid development and practice 3.4 signal and slot correlation
- "Green is better than blue". Why is TPC the last white lotus to earn interest with money
- Create MySQL function: access denied; you need (at least one of) the SUPER privilege(s) for this operation
- PXE高效批量网络装机
- [ctfshow web] deserialization
- 6种方法帮你搞定SimpleDateFormat类不是线程安全的问题
猜你喜欢

【开发教程7】疯壳·开源蓝牙心率防水运动手环-电容触摸

Small application of C language using structure to simulate election
![[basic course of flight control development 2] crazy shell · open source formation UAV - timer (LED flight information light and indicator light flash)](/img/ad/e0bc488c238a260768f7e7faec87d0.png)
[basic course of flight control development 2] crazy shell · open source formation UAV - timer (LED flight information light and indicator light flash)

37.【重载运算符的类别】

IDEA 阿里云多模块部署

如何保证缓存和数据库一致性

My SQL is OK. Why is it still so slow? MySQL locking rules

How to connect tdengine with idea database tool?

How can win11 system be reinstalled with one click?

How to ensure cache and database consistency
随机推荐
Marketing guide | several common micro blog marketing methods
Pack tricks
What is a distributed timed task framework?
PyQt5快速开发与实战 3.2 布局管理入门 and 3.3 Qt Designer实战应用
[development tutorial 7] crazy shell · open source Bluetooth heart rate waterproof sports Bracelet - capacitive touch
Can TCP and UDP use the same port?
公安部发出暑期旅游客运交通安全预警:手握方向盘 绷紧安全弦
Is it safe for Guosen Securities to open an account? How can I find the account manager
TensorFlow Lite源码解析
Pyqt5 rapid development and practice 3.4 signal and slot correlation
[basic course of flight control development 1] crazy shell · open source formation UAV GPIO (LED flight information light and signal light control)
TCP 和 UDP 可以使用相同端口吗?
Digital currency of quantitative transactions - merge transaction by transaction data through timestamp and direction (large order consolidation)
[daily3] vgg16 learning
Definition and relationship of derivative, differential, partial derivative, total derivative, directional derivative and gradient
Nacos win10 installation and configuration tutorial
JD Sanmian: I want to query a table with tens of millions of data. How can I operate it?
Packet capturing and streaming software and network diagnosis
How to use C language nested linked list to realize student achievement management system
Are CRM and ERP the same thing? What's the difference?