当前位置：网站首页>[recommended collection] these 8 common missing value filling skills must be mastered

[recommended collection] these 8 common missing value filling skills must be mastered

2022-06-26 19:01:00 【I love Python data mining】

Among many data problems , We will inevitably encounter missing data , Maybe it's a mistake in recording data , It is also possible that the data itself does not （ For example, some data users do not fill in , Or the stock is suspended , Then the transaction record of that day is empty ）. The treatment of these missing values may play a crucial role in the final prediction of the model , Because missing data can lead to ：

Data set distortion ： A large number of missing data may lead to distortion of variable distribution , It is possible to increase or decrease the value of a particular category in the dataset .
Training predictions that affect the final model ： Missing data can lead to deviations in the dataset , And it may lead to deviation between model training and prediction .

In this article, we introduce some common missing value filling techniques in competitions . Like to remember to collect 、 Focus on 、 give the thumbs-up .

notes ： At the end of the article, a technical exchange group is provided

01 CCA(Complete Case Analysis)

Complete Case Analysis(CCA) Is a very simple way to deal with missing data , It directly deletes rows containing missing data , That is, we only consider rows that contain complete data .

This strategy is more direct , Easy to implement , Of course, it will also cause us to lose a lot of data information . There will also be deviations when missing values are predicted . It is generally recommended that when our data volume is large enough , That is, the proportion of missing data is very small , And there are random cases .

data_cca = df.dropna(axis=0)

02 Random number filling

This is an important technique for filling in missing values , It can handle both numeric and categorical variables . Generally, we will group the missing values into a column , And assign a special value , for example 99999999 or -999999 etc. . In this way, we retain the special meaning of the missing value , In some special cases , Can bring very good results , But the disadvantages are obvious , Our filling strategy is applicable to some current models , for example NN etc. , Preprocessing and training prediction bring some challenges .

03 Frequency filling of category features frequency filling of category features

Replace the missing value with the most frequent variable value , Use the mode of the characteristic column as the replacement value . The implementation of this method is simple , Also more commonly used , But when the data loss is large , Such filling will make the data deviate greatly , Make the prediction result worse .

04 Statistics fill in

Fill in with statistics , For example, the mean value , Median, etc . Sometimes we can also use the strategy of normal distribution to fill ：

The difference between the use and the average value 2 Values within standard deviation . stay （ Average -2 std）&（ Average +2 std） Generate random numbers to fill in missing values .

05 Using linear regression

Through the strong correlation between the two variables to carry out linear fitting filling . Pay attention to the influence of outliers in linear regression .

06 XGBoost Iterative fill

Iterative filling is to model each feature as a function of other features , For example, the regression problem of predicting missing values . This can be considered as an extended version of the above linear regression . This method has also achieved very good results in many data competitions .

# loading modules
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import xgboost

data=df.copy()

#setting up the imputer

imp = IterativeImputer(
    estimator=xgboost.XGBRegressor(
        n_estimators=5,
        random_state=1,
        tree_method='gpu_hist',
    ),
    missing_values=np.nan,
    max_iter=5,
    initial_strategy='mean',
    imputation_order='ascending',
    verbose=2,
    random_state=1
)

data[:] = imp.fit_transform(data)

07 Nearest neighbor filling

seeing the name of a thing one thinks of its function , The nearest neighbor strategy is used for filling , The commonly used strategy is KNNImputer. By default , We use the Euclidean distance measure of missing values to find the nearest point . Each missing feature uses the most recent N Fill in the non missing eigenvalues of samples .

from sklearn.impute import KNNImputer
nan = np.nan
X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputed_x=imputer.fit_transform(X)

08 Missing value tag

This method is also the most common , It is to convert the features in the data set into the corresponding binary matrix , To identify if there are missing values in the dataset .

from sklearn.impute import MissingIndicator
X = np.array([[-1, -1, 1, 3],
              [4, -1, 0, -1],
              [8, -1, 1, 0]])
indicator = MissingIndicator(missing_values=-1)
mask_missing_values_only = indicator.fit_transform(X)

reference

Defining, Analysing, and Implementing Imputation Techniques
The Ultimate Guide to Data Cleaning
Intro to Imputation Different Techniques

Technical communication

Welcome to reprint 、 Collection 、 Gain some praise and support ！ data 、 The code can be obtained from me

Insert picture description here

At present, a technical exchange group has been opened , Group friends have exceeded 2000 people , The best way to add notes is ： source + Interest direction , Easy to find like-minded friends

The way ①、 Send the following picture to wechat , Long press recognition , The background to reply ： Add group ;
The way ②、 Add microsignals ：dkl88191, remarks ： come from CSDN
The way ③、 WeChat search official account ：Python Learning and data mining , The background to reply ： Add group

Long press attention