当前位置:网站首页>[recommended collection] these 8 common missing value filling skills must be mastered
[recommended collection] these 8 common missing value filling skills must be mastered
2022-06-26 19:01:00 【I love Python data mining】
Among many data problems , We will inevitably encounter missing data , Maybe it's a mistake in recording data , It is also possible that the data itself does not ( For example, some data users do not fill in , Or the stock is suspended , Then the transaction record of that day is empty ). The treatment of these missing values may play a crucial role in the final prediction of the model , Because missing data can lead to :
Data set distortion : A large number of missing data may lead to distortion of variable distribution , It is possible to increase or decrease the value of a particular category in the dataset .
Training predictions that affect the final model : Missing data can lead to deviations in the dataset , And it may lead to deviation between model training and prediction .
In this article, we introduce some common missing value filling techniques in competitions . Like to remember to collect 、 Focus on 、 give the thumbs-up .
notes : At the end of the article, a technical exchange group is provided
01 CCA(Complete Case Analysis)
Complete Case Analysis(CCA) Is a very simple way to deal with missing data , It directly deletes rows containing missing data , That is, we only consider rows that contain complete data .
This strategy is more direct , Easy to implement , Of course, it will also cause us to lose a lot of data information . There will also be deviations when missing values are predicted . It is generally recommended that when our data volume is large enough , That is, the proportion of missing data is very small , And there are random cases .
data_cca = df.dropna(axis=0)
02 Random number filling
This is an important technique for filling in missing values , It can handle both numeric and categorical variables . Generally, we will group the missing values into a column , And assign a special value , for example 99999999 or -999999 etc. . In this way, we retain the special meaning of the missing value , In some special cases , Can bring very good results , But the disadvantages are obvious , Our filling strategy is applicable to some current models , for example NN etc. , Preprocessing and training prediction bring some challenges .
03 Frequency filling of category features frequency filling of category features
Replace the missing value with the most frequent variable value , Use the mode of the characteristic column as the replacement value . The implementation of this method is simple , Also more commonly used , But when the data loss is large , Such filling will make the data deviate greatly , Make the prediction result worse .
04 Statistics fill in
Fill in with statistics , For example, the mean value , Median, etc . Sometimes we can also use the strategy of normal distribution to fill :
- The difference between the use and the average value 2 Values within standard deviation . stay ( Average -2 std)&( Average +2 std) Generate random numbers to fill in missing values .
05 Using linear regression
Through the strong correlation between the two variables to carry out linear fitting filling . Pay attention to the influence of outliers in linear regression .
06 XGBoost Iterative fill
Iterative filling is to model each feature as a function of other features , For example, the regression problem of predicting missing values . This can be considered as an extended version of the above linear regression . This method has also achieved very good results in many data competitions .
# loading modules
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import xgboost
data=df.copy()
#setting up the imputer
imp = IterativeImputer(
estimator=xgboost.XGBRegressor(
n_estimators=5,
random_state=1,
tree_method='gpu_hist',
),
missing_values=np.nan,
max_iter=5,
initial_strategy='mean',
imputation_order='ascending',
verbose=2,
random_state=1
)
data[:] = imp.fit_transform(data)
07 Nearest neighbor filling
seeing the name of a thing one thinks of its function , The nearest neighbor strategy is used for filling , The commonly used strategy is KNNImputer. By default , We use the Euclidean distance measure of missing values to find the nearest point . Each missing feature uses the most recent N Fill in the non missing eigenvalues of samples .
from sklearn.impute import KNNImputer
nan = np.nan
X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputed_x=imputer.fit_transform(X)
08 Missing value tag
This method is also the most common , It is to convert the features in the data set into the corresponding binary matrix , To identify if there are missing values in the dataset .
from sklearn.impute import MissingIndicator
X = np.array([[-1, -1, 1, 3],
[4, -1, 0, -1],
[8, -1, 1, 0]])
indicator = MissingIndicator(missing_values=-1)
mask_missing_values_only = indicator.fit_transform(X)
reference
Defining, Analysing, and Implementing Imputation Techniques
The Ultimate Guide to Data Cleaning
Intro to Imputation Different Techniques
Recommended articles
Li Hongyi 《 machine learning 》 Mandarin Program (2022) coming
Some people made Mr. Wu Enda's machine learning and in-depth learning into a Chinese version
So elegant ,4 paragraph Python Automatic data analysis artifact is really fragrant
Technical communication
Welcome to reprint 、 Collection 、 Gain some praise and support ! data 、 The code can be obtained from me

At present, a technical exchange group has been opened , Group friends have exceeded 2000 people , The best way to add notes is : source + Interest direction , Easy to find like-minded friends
- The way ①、 Send the following picture to wechat , Long press recognition , The background to reply : Add group ;
- The way ②、 Add microsignals :dkl88191, remarks : come from CSDN
- The way ③、 WeChat search official account :Python Learning and data mining , The background to reply : Add group

边栏推荐
- Project practice 6: distributed transaction Seata
- 微服务版单点登陆系统(SSO)
- Reading notes: process consulting III
- To: seek truth from facts
- 8VC Venture Cup 2017 - Final Round C. Nikita and stack
- Request method 'POST' not supported
- DVD digital universal disc
- Solidity - 合约继承子合约包含构造函数时报错 及 一个合约调用另一合约view函数收取gas费用
- Web resource preloading - production environment practice
- xlua获取ugui的button注册click事件
猜你喜欢
随机推荐
为什么我不推荐去SAP培训机构参加培训?
Project practice 5: build elk log collection system
Redis Basics
Numpy之matplotlib
Using recursion to find all gray codes with n bits
redis 基础知识
Image binarization
Microservice architecture
DAPP丨LP单双币流动性质押挖矿系统开发原理分析及源码
Boyun, standing at the forefront of China's container industry
Analysis on development technology of NFT meta universe chain game system
微信小程序 自定义 弹框组件
Selection of database paradigm and main code
Résumé des points de connaissance
Eigen库计算两个向量夹角
Successfully solved the problem of garbled microservice @value obtaining configuration file
Several delete operations in SQL
ISO documents
商品秒杀系统
xlua获取ugui的button注册click事件









