当前位置：网站首页>Data preprocessing of machine learning

Data preprocessing of machine learning

2022-07-26 17:09:00 【Full stack programmer webmaster】

Hello everyone , I meet you again , I'm the king of the whole stack

stay sklearn The common methods of data analysis are summarized in the data analysis , Next, the data preprocessing is summarized

When we get the dataset, we usually need to do the following steps ：

(1) Make clear how many features a dataset has , What is continuous , What are the categories of
(2) Check for missing values , Choose the right way to make up for the missing features , Make the data complete
(3) Standardize continuous numerical features
(4) Code the features of the category type
(5) According to the analysis of practical problems, whether it is necessary to carry out the corresponding functional transformation of features

Still take the housing price data , Carry out the above operations in sequence

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

housing = pd.read_csv('./datasets/housing/housing.csv')

1. Make sure how many features the dataset has , What is continuous , What are the categories of

print(housing.shape)

(20640, 10)

print(housing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None

2. Check for missing values , Choose the right way to make up for the missing features , Make the data complete

adopt info() Find out besides ：

ocean_proximity The attribute category is object Outside , The rest are float64 type , Then judge ocean_proximity Label , The rest are eigenvalues
total_bedrooms There are missing values

2.1 Missing value handling

(1) Discard row with missing value

(2) Discard the property of the missing value , Namely column

(3) Set the missing value to a value (0, Average 、 Median or high frequency value )

print(housing[housing.isnull().T.any().T][:5])  # Printed with NaN Before we go 5 That's ok

     longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
290    -122.16     37.77                47.0       1256.0             NaN   
341    -122.17     37.75                38.0        992.0             NaN   
538    -122.28     37.78                29.0       5154.0             NaN   
563    -122.24     37.75                45.0        891.0             NaN   
696    -122.10     37.69                41.0        746.0             NaN   

     population  households  median_income  median_house_value ocean_proximity  
290       570.0       218.0         4.3750            161900.0        NEAR BAY  
341       732.0       259.0         1.6196             85100.0        NEAR BAY  
538      3741.0      1273.0         2.5762            173400.0        NEAR BAY  
563       384.0       146.0         4.9489            247100.0        NEAR BAY  
696       387.0       161.0         3.9063            178400.0        NEAR BAY

2.1.1 Delete the row of the missing value

#  Delete row 
housing1 = housing.dropna(subset=['total_bedrooms'])
print(housing1.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20433 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20433 non-null float64
latitude              20433 non-null float64
housing_median_age    20433 non-null float64
total_rooms           20433 non-null float64
total_bedrooms        20433 non-null float64
population            20433 non-null float64
households            20433 non-null float64
median_income         20433 non-null float64
median_house_value    20433 non-null float64
ocean_proximity       20433 non-null object
dtypes: float64(9), object(1)
memory usage: 1.7+ MB
None

2.1.2 Delete the column where the missing value is

#  Delete column 
housing2 = housing.drop(['total_bedrooms',],axis=1)
print(housing2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(8), object(1)
memory usage: 1.4+ MB
None

2.1.3 Replace missing values

#  Replace... With an average 
mean = housing['total_bedrooms'].mean()
print('mean:',mean)
housing3 = housing.fillna({'total_bedrooms':mean})
print(housing3[290:291])

mean: 537.8705525375618
     longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
290    -122.16     37.77                47.0       1256.0      537.870553   

     population  households  median_income  median_house_value ocean_proximity  
290       570.0       218.0          4.375            161900.0        NEAR BAY

3. Standardize continuous numerical features

When the numerical attributes of a dataset have a very large scale difference , It often leads to poor performance of machine learning algorithm , There are a few exceptions, of course . in application , The model solved by gradient descent method usually needs normalization , Including linear regression 、 Logical regression 、 Support vector machine 、 Neural networks and other models . But not for decision trees , With C4.5 For example , The decision tree is mainly based on data set when splitting nodes D About the characteristics of X The information gain ratio of , And the information gain ratio has nothing to do with whether the feature is normalized

The common methods of data standardization are ：

Minimum - Maximum zoom （ Plus normalization ）, Rescale the value so that its final range is 0-1 Between ,（current – min）/ (max – min)
Standardization ,(current – mean) / var, Make the result distribution have unit variance , Compared to the smallest - Maximum zoom , Standardized methods are less affected by outliers

4. Code the features of the category type

4.1 Why code

In supervised learning , Except for a few models such as decision tree, we need to combine the predicted value with the actual value ( That is to say, labels ) Compare , Then the loss function is optimized by algorithm , This requires the label to be converted to a numeric type for calculation

4.2 How to code

The common coding methods are ： Serial number code , Hot coding alone , Binary code

4.2.1 Serial number code

Sequence number coding is usually used to process data with size thanks between categories , For example, grades , It can be divided into low 、 in 、 High third gear , And it exists ‘ high > in > low ’ The order of arrangement , The sequence number code will assign a value to the category feature according to the size relationship ID, For example, high means 3, In the said 2, Low means 1

4.2.2 Hot coding alone

Single hot coding is usually used to deal with features that do not have a size relationship between categories . For example, blood type , Altogether 4 A value of (A Type B blood 、B Type B blood 、AB Type B blood 、O Type B blood ), Single heat code can turn blood type into a 4 Dimension sparse vector ,A Type B blood means (1,0,0,0),B Type B blood means (0,1,0,0),C Type B blood means (0,0,1,0),D Type B blood means (0,0,0,1)

The following problems should be paid attention to when using the unique heat code when the category value is more ：

(1) Use sparse vectors to save space

Under the unique heat code , The eigenvector is only one dimension 1, Other locations are 0, Therefore, the sparse representation of vectors can be used to effectively save space , And most of the current algorithms accept input in the form of sparse vector

(2) Match feature selection to reduce dimension

4.2.3 Binary code

Binary coding is essentially the use of binary pairs ID Hash map , The resulting 0/1 Eigenvector , And the dimension is less than that of single heat code , Save storage space

5. According to the analysis of practical problems, whether it is necessary to carry out the corresponding functional transformation of features

When we do some analysis of the dataset , You may find some interesting connections between different attributes , Especially in relation to target attributes , Before you are ready to input data to machine learning algorithms , You should try a combination of properties

Take the price data set above for example , If you don't know how many families there are in an area , It's no use knowing the total number of rooms in an area , You really want to know the number of rooms in a family , alike , But look at the property of the total number of bedrooms , It doesn't make any sense , You may want to compare it with the total number of rooms , Or use it to combine the population of each family

5.1 View the correlation between the original dataset properties and the room median

corr_martrix = housing.corr()
print(corr_martrix['median_house_value'].sort_values(ascending=False))

median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64

5.2 View the correlation of the property combination with the room median

housing4 = housing.copy()
housing4['rooms_per_household'] = housing4['total_rooms'] / housing4['households']
housing4['bedrooms_per_room'] = housing4['total_bedrooms'] / housing4['total_rooms']
housing4['population_per_household'] = housing4['population'] / housing4['households']

corr_martrix1 = housing.corr()
print(corr_martrix1['median_house_value'].sort_values(ascending=False))

median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64

It can be seen that bedrooms_per_room More than the total number of rooms or bedrooms and the median price of the correlation is much higher , So you can try more when you combine attributes

6. Use Sklearn.pipeline Data preprocessing

6.1 Code implementation

from sklearn.preprocessing import Imputer,LabelEncoder,OneHotEncoder,StandardScaler
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.pipeline import Pipeline,FeatureUnion

class DaraFrameSelector(BaseEstimator,TransformerMixin):
    def __init__(self,attr_name):
        self.attr_name = attr_name
        
    def fit(self,X,Y=None):
        return self
    
    def transform(self,X,Y=None):
        return X[self.attr_name].values

features_attr = list(housing.columns[:-1])
labels_attr = [housing.columns[-1]]

feature_pipeline = Pipeline([('selector',DaraFrameSelector(features_attr)),
                 ('imputer',Imputer(strategy='mean')),
                 ('scaler',StandardScaler()),])

label_pipeline = Pipeline([('selector',DaraFrameSelector(labels_attr)),
                           ('encoder',OneHotEncoder()),])

full_pipeline = FeatureUnion(transformer_list=[('feature_pipeline',feature_pipeline),
                                               ('label_pipeline',label_pipeline),])

C:\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:58: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
  warnings.warn(msg, category=DeprecationWarning)

housing_prepared = full_pipeline.fit_transform(housing)
print(housing_prepared.shape)

(20640, 14)

Reference material ：

(1) 《 Machine learning practice is based on Scikit-Learn and TensorFlow》
(2) 《 White face machine learning 》

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/120018.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207181539456411.html