当前位置:网站首页>[advanced data processing technology] data filtering, advanced data filling, initial and advanced data transformation
[advanced data processing technology] data filtering, advanced data filling, initial and advanced data transformation
2022-07-24 20:29:00 【Sunny qt01】
Data filtering :
Although some single variables can distinguish the overall customer group , But it may not be able to distinguish specific customer groups
Case study : Such as age , When applying for scoring , The field age has a certain degree of discrimination , The older the age, the higher the default rate ( Bad parts rate ) The lower the . But if the sample is divided into high-income groups and low-income groups according to income , It can be seen that the failure rate (bad%) In high-income clusters , The difference in age is not obvious ,

If we can use the customer segmentation model , Find suitable variables for each different customer group , It can greatly improve the discrimination of the overall model .
Case study : Credit card behavior scoring model
If the transaction period is less than five months, it will be excluded , There is not enough historical data as the source of independent variables
Delayed customers are applicable to the collection scoring model , To exclude
According to our business knowledge and statistical evidence, it is pointed out that all accounts are cleared (Transactor) And the loop makes the user (Revolver) There are significant differences in risk degree and risk form , So choose customers 【 Is circulation used 】 As the main way of grouping

Complete household clearance and circulation make users cut apart , Use different fields for modeling learning .
type 1: Data filtering mode 1

Divide the data into training sets and test sets ,8:2 about , Use income to differentiate data . Yes 3 Different customer groups build different models .
Then the test set is clustered in the same way .

Data from different groups , Use different models to predict .

type 2: Data filtering mode 2

Divide the data into training sets , Test set . Not this time... Percent 80 The training set of . Continue to model him in general . And then we do the test set 3 All groups make estimates
Understand the effectiveness of the model for different groups . Determine the marketing focus of the next wave , For example, after modeling , Less than 50 The group modeling effect of 10000 income is not good , Marketing focuses on the first two ,

Then the marketing effect will be restored after marketing , Then decide the marketing strategy of the next wave .
- Advanced filling techniques for missing values
First, let's talk about simple handling methods
A simple way to deal with missing data :
1 Direct neglect method : Delete the whole data ,
It is the simplest way to deal with data loss
(1) When a lot of data is collected , When missing data only accounts for a part , You can use this method
(2) Or when doing classification modeling , If the classification mark of data (class lable) It's empty , This data cannot be classified correctly , You can directly delete .( The only solution , Machine learning does not allow null values in the target field . There must be a target field , Otherwise, failure to learn will result in errors )
(3) When the missing data in the field is more than 50% , Delete directly
*(4) There is another special case , When you think that null values are another manifestation of behavior ,( When null values also make sense ), You can use the method of indicating variables (indicator variable) That is, dummy variables . Special methods for dealing with missing data , It is also when the missing data in the field is greater than 50 Directly delete .
Directly ignore the shortcomings of the method : When the proportion of data missing is large , It will cause a lot of data loss
2 Manual filling method :
(1) When the birthday field of a member data is missing , Directly ask employees to call members to ask about their birthday fields .
(2) Understand why data is missing , Missing reason , Then fill in with the appropriate value . For example, gender can be filled with ID number , Some proportional fields , It may be because there is no consumption , So it leads to control ,.
It's very accurate
Disadvantages of manual filling method : When there are many missing data , It will consume a lot of time and labor, and the burden is serious .
3( machine ) Automatic filling method :
If it is a category field , Such as marriage , Education and so on .
(1) You can fill in a general constant : For example, unknown (unknown) Become a new category

1 For undergraduate ,2 Bachelor degree or above ,0 It's empty ,3 For the unknown ,456 Is an outlier , Use it 0,456
(2) Fill in the mode of this field : But this method is not objective , You can use clustering , Use the group mode in the cluster ( The following figure can be based on credit To divide into groups to fill the mode )

The algorithm needs to be able to accept null values in input fields , Algorithms that can also be modeled
You can take KNN fill , Random forest filling ,XGboosting fill
Numeric fields
(1) Fill in a general constant : If fill in 0( But it needs to be checked (check) What it means )
(2) Fill in the average , You can also use clustering ( For example, use age groups to fill the average value of assets )
(3) A more accurate method is to use the model to calculate the more possible values : That is, the problem of filling in the missing value is solved as the problem of predicting the value category .

You need to be able to accept empty values in the input fields . The algorithm is also KNN fill , Random forest filling ,XGboosting fill
Data cleaning is done with training sets .
- Primary data conversion technology
Code of category field , Code of sequential field
Coding method of categorical variables :
One-hot encoding( Coding in machine learning is called )
Dummy( The code in statistics is called )

Virtual variable traps will appear when using virtual variables , That is, there will be a high degree of multicollinearity or high correlation between variables . such as Male and female Two dummy variables do one-hot encoding Coding will lead to high correlation , Like gender ,01,10, One field will be redundant . Therefore, a row of fields will generally be deleted . Make sure there is no high correlation
If a field has multiple description feature labels, it should be coded according to the following figure .

Sequential variable coding method :

As a numerical variable , Code at the same interval .
Coding disadvantages . The size of the value cannot be used directly 1234 code , Consider the degree , Like in the picture high It should be 3. But because there is a ratio in the middle medium A little more , become 4 了 , The effect will not be so ideal .
- Advanced data conversion technology
Category type fields generalize classification variables ( Rank variable )
Data generalization : Generally used for address, etc , The address is too small , It can be transformed into a city , Or region

Data generalization 2: When it is necessary to convert classified variables into numerical values . You can replace the category with the probability of the target field

Numerical field trend discretization :
Why should we discretize numerical data ?
Modeling experience has led us to find , Generally, the instability of numerical data is easy to lead to model errors .
There is often a situation , The predicted value of the training data is very accurate , But the prediction of the test data set becomes very inaccurate .
Most of the reason is that the field distribution of the training data set is inconsistent with the field probability distribution of the test data set .
Pictured , stay 25 At the age of , The overdue probability of the training data set is 0.1, But in the test data set is percent 100.
Pearson correlation coefficient shows that , Not only are they not used to , There are even opposite tendencies .

If age is an important input field for modeling, the accuracy of the two models must be very different
So numeric fields are discretized , It helps us build models .
Advantages of discretization of numeric fields :
1. Simplify data , Reduce data complexity , Make the data easier to interpret
2. It can support many algorithms that cannot handle numeric fields
For example, association rules (Association Rules) Algorithm
3. It can improve the stability of the classifier , And then improve the accuracy of the classification model
4. You can find the trend of this input field in the target field (TREND) Sex helps to interpret in the future ( So the best result of our discretization is to reflect the trend of the target field )
Using discretization , Cut out several intervals , To replace many data values in the value range
1. Manual separation method :
Based on experience , Experts suggest that , Cut the data , Such as age
2. Basic packing method (binning Method):
Equal width (Equal-Width-Interval) Packing method , The range of each box is the same
Equal frequency (Equal-Frequency-interval) Packing method , The frequency in each box is the same
If we equalize the distribution of the previous figure (8 Intervals , Each interval 6.125) Compartmentalize , The Pearson correlation coefficient can be obtained as 0.711 positive correlation , Stability and accuracy are greatly improved .

Mainly, it is also very important that the results can be easily understood and explained
A good discretization result that is easy to understand and explain , Input fields and target fields will have obvious trends
The figure below x The axis is the quota utilization ,y The shaft is the probability of deteriorating customers

According to business experience , The utilization rate of credit card quota becomes higher , The probability of customers' bad debts will also increase , But the result of discretization reflects fluctuations , Ups and downs . If we do business modeling , Even if adding this field will make the model better , Enterprises dare not use this field ,( Because it is uncertain whether there will be different trends in future data changes ) That is, it is inconsistent with business experience .

A good discretization result should be as shown in the figure above ,x Shaft with y The axis is monotonically increasing , Consistent with business experience .
Feasible approach : First, use equal width or equal frequency to set the value of the numerical field It is divided into 15~20 Between communities , Then merge them appropriately according to the distribution presented .
Case study :

The first line is divided into 15~20 Between communities ,

When the default rate goes from high to low, it is our cutting point , The graph can be merged into 4 Intervals .
4-21,22-41,42-61,62-72
because 62-72 There is only one number in this range , Therefore, it is additionally merged into 42-72
The final cut is 3 Intervals

As can be seen from the above figure , The result is quite ideal , The default rate is gradually increasing , Conform to business common sense .
This is a better
边栏推荐
- Application layer - typical protocol analysis
- [msp430g2553] graphical development notes (2) system clock and low power consumption mode
- [shader realizes the flicker effect of three primary colors of television signal _shader effect Chapter 5]
- Solve the problem of error l6218e undefined symbol XXX
- Leetcode 560 and the subarray of K (with negative numbers, one-time traversal prefix and), leetcode 438 find all alphabetic ectopic words in the string (optimized sliding window), leetcode 141 circula
- Cloud native observability tracking technology in the eyes of Baidu engineers
- Fluoronisin peptide nucleic acid oligomer complex | regular active group alkyne, SH thiol alkynyl modified peptide nucleic acid
- clip:learning transferable visual models from natural language supervision
- Safe way -- Analysis of single pipe reverse connection back door
- Hook 32-bit function using the method modified to JMP instruction
猜你喜欢

Implementation of OA office system based on JSP

Lights of thousands of families in the year of xinchou

Todolist case

Substr and substring function usage in SQL

C form application treeview control use

Valdo2021 - vascular space segmentation in vascular disease detection challenge (I)

Apache atlas version 2.2 installation

Evolution of network IO model

Redis basic knowledge, application scenarios, cluster installation

Leetcode 560 and the subarray of K (with negative numbers, one-time traversal prefix and), leetcode 438 find all alphabetic ectopic words in the string (optimized sliding window), leetcode 141 circula
随机推荐
Delete remote and local branches
Generate self signed certificate: generate certificate and secret key
Luogu - p1616 crazy herb picking
How to integrate Kata in kubernetes cluster
What does software testing need to learn?
[sciter]: window communication
Unitywebgl project summary (unfinished)
Leetcode 206 reverse linked list, 3 longest substring without repeated characters, 912 sorted array (fast row), the kth largest element in 215 array, 53 largest subarray and 152 product largest subarr
How to learn automated testing
Valdo2021 - vascular space segmentation in vascular disease detection challenge (I)
[JVM] selection of garbage collector
How to learn automated testing? Can you teach yourself?
Istio II traffic hijacking process
Analysis of xmldecoder parsing process
PD user manual
Machine learning job interview summary: five key points that resume should pay attention to
Solve the problem of error l6218e undefined symbol XXX
Docker builds redis and clusters
Hook 32-bit function using the method modified to JMP instruction
[training Day8] series [matrix multiplication]