当前位置:网站首页>[recommended collection] these 8 common missing value filling skills must be mastered
[recommended collection] these 8 common missing value filling skills must be mastered
2022-06-26 19:01:00 【I love Python data mining】
Among many data problems , We will inevitably encounter missing data , Maybe it's a mistake in recording data , It is also possible that the data itself does not ( For example, some data users do not fill in , Or the stock is suspended , Then the transaction record of that day is empty ). The treatment of these missing values may play a crucial role in the final prediction of the model , Because missing data can lead to :
Data set distortion : A large number of missing data may lead to distortion of variable distribution , It is possible to increase or decrease the value of a particular category in the dataset .
Training predictions that affect the final model : Missing data can lead to deviations in the dataset , And it may lead to deviation between model training and prediction .
In this article, we introduce some common missing value filling techniques in competitions . Like to remember to collect 、 Focus on 、 give the thumbs-up .
notes : At the end of the article, a technical exchange group is provided
01 CCA(Complete Case Analysis)
Complete Case Analysis(CCA) Is a very simple way to deal with missing data , It directly deletes rows containing missing data , That is, we only consider rows that contain complete data .
This strategy is more direct , Easy to implement , Of course, it will also cause us to lose a lot of data information . There will also be deviations when missing values are predicted . It is generally recommended that when our data volume is large enough , That is, the proportion of missing data is very small , And there are random cases .
data_cca = df.dropna(axis=0)
02 Random number filling
This is an important technique for filling in missing values , It can handle both numeric and categorical variables . Generally, we will group the missing values into a column , And assign a special value , for example 99999999 or -999999 etc. . In this way, we retain the special meaning of the missing value , In some special cases , Can bring very good results , But the disadvantages are obvious , Our filling strategy is applicable to some current models , for example NN etc. , Preprocessing and training prediction bring some challenges .
03 Frequency filling of category features frequency filling of category features
Replace the missing value with the most frequent variable value , Use the mode of the characteristic column as the replacement value . The implementation of this method is simple , Also more commonly used , But when the data loss is large , Such filling will make the data deviate greatly , Make the prediction result worse .
04 Statistics fill in
Fill in with statistics , For example, the mean value , Median, etc . Sometimes we can also use the strategy of normal distribution to fill :
- The difference between the use and the average value 2 Values within standard deviation . stay ( Average -2 std)&( Average +2 std) Generate random numbers to fill in missing values .
05 Using linear regression
Through the strong correlation between the two variables to carry out linear fitting filling . Pay attention to the influence of outliers in linear regression .
06 XGBoost Iterative fill
Iterative filling is to model each feature as a function of other features , For example, the regression problem of predicting missing values . This can be considered as an extended version of the above linear regression . This method has also achieved very good results in many data competitions .
# loading modules
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import xgboost
data=df.copy()
#setting up the imputer
imp = IterativeImputer(
estimator=xgboost.XGBRegressor(
n_estimators=5,
random_state=1,
tree_method='gpu_hist',
),
missing_values=np.nan,
max_iter=5,
initial_strategy='mean',
imputation_order='ascending',
verbose=2,
random_state=1
)
data[:] = imp.fit_transform(data)
07 Nearest neighbor filling
seeing the name of a thing one thinks of its function , The nearest neighbor strategy is used for filling , The commonly used strategy is KNNImputer. By default , We use the Euclidean distance measure of missing values to find the nearest point . Each missing feature uses the most recent N Fill in the non missing eigenvalues of samples .
from sklearn.impute import KNNImputer
nan = np.nan
X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
imputer = KNNImputer(n_neighbors=2, weights="uniform")
imputed_x=imputer.fit_transform(X)
08 Missing value tag
This method is also the most common , It is to convert the features in the data set into the corresponding binary matrix , To identify if there are missing values in the dataset .
from sklearn.impute import MissingIndicator
X = np.array([[-1, -1, 1, 3],
[4, -1, 0, -1],
[8, -1, 1, 0]])
indicator = MissingIndicator(missing_values=-1)
mask_missing_values_only = indicator.fit_transform(X)
reference
Defining, Analysing, and Implementing Imputation Techniques
The Ultimate Guide to Data Cleaning
Intro to Imputation Different Techniques
Recommended articles
Li Hongyi 《 machine learning 》 Mandarin Program (2022) coming
Some people made Mr. Wu Enda's machine learning and in-depth learning into a Chinese version
So elegant ,4 paragraph Python Automatic data analysis artifact is really fragrant
Technical communication
Welcome to reprint 、 Collection 、 Gain some praise and support ! data 、 The code can be obtained from me

At present, a technical exchange group has been opened , Group friends have exceeded 2000 people , The best way to add notes is : source + Interest direction , Easy to find like-minded friends
- The way ①、 Send the following picture to wechat , Long press recognition , The background to reply : Add group ;
- The way ②、 Add microsignals :dkl88191, remarks : come from CSDN
- The way ③、 WeChat search official account :Python Learning and data mining , The background to reply : Add group

边栏推荐
猜你喜欢

Clion compiling catkin_ WS (short for ROS workspace package) loads cmakelists Txt problems

微服务版单点登陆系统(SSO)
![[kubernetes] kubernetes principle analysis and practical application (under update)](/img/37/40b8317a4d8b6f9c3acf032cd4350b.jpg)
[kubernetes] kubernetes principle analysis and practical application (under update)

Installation and use of filebeat

Create a time blocker yourself

Tag dynamic programming - preliminary knowledge for question brushing -2 0-1 knapsack theory foundation and two-dimensional array solution template

品达通用权限系统(Day 1~Day 2)

Numpy之matplotlib

数据库SQL语句撰写

Basic and necessary common plug-ins of vscade
随机推荐
Comparing the size relationship between two objects turns out to be so fancy
Tag dynamic programming - preliminary knowledge for question brushing -2 0-1 knapsack theory foundation and two-dimensional array solution template
Determine whether a sequence is a stack pop-up sequence
ARM裸板调试之串口打印及栈初步分析
(multi threading knowledge points that must be mastered) understand threads, create threads, common methods and properties of using threads, and the meaning of thread state and state transition
Selection of database paradigm and main code
The cross compilation environment appears So link file not found problem
品达通用权限系统(Day 3~Day 4)
爬取豆瓣读书Top250,导入sqlist数据库(或excel表格)中
Clion编译catkin_ws(ROS工作空间包的简称)加载CMakeLists.txt出现的问题
Using cache in vuex to solve the problem of data loss in refreshing state
字符串String转换为jsonArray并解析
软考备战多媒体系统
Boot指标监测
Current limiting design and Implementation
Convex hull problem
黑客用机器学习发动攻击的九种方法
sql中的几种删除操作
To: seek truth from facts
链游开发成品源码 链游系统开发详情说明