当前位置：网站首页>[advanced data processing technology] data filtering, advanced data filling, initial and advanced data transformation

[advanced data processing technology] data filtering, advanced data filling, initial and advanced data transformation

2022-07-24 20:29:00 【Sunny qt01】

Data filtering ：

Although some single variables can distinguish the overall customer group , But it may not be able to distinguish specific customer groups

Case study ： Such as age , When applying for scoring , The field age has a certain degree of discrimination , The older the age, the higher the default rate （ Bad parts rate ） The lower the . But if the sample is divided into high-income groups and low-income groups according to income , It can be seen that the failure rate （bad%） In high-income clusters , The difference in age is not obvious ,

If we can use the customer segmentation model , Find suitable variables for each different customer group , It can greatly improve the discrimination of the overall model .

Case study ： Credit card behavior scoring model

If the transaction period is less than five months, it will be excluded , There is not enough historical data as the source of independent variables

Delayed customers are applicable to the collection scoring model , To exclude

According to our business knowledge and statistical evidence, it is pointed out that all accounts are cleared （Transactor） And the loop makes the user （Revolver） There are significant differences in risk degree and risk form , So choose customers 【 Is circulation used 】 As the main way of grouping

Complete household clearance and circulation make users cut apart , Use different fields for modeling learning .

type 1： Data filtering mode 1

Divide the data into training sets and test sets ,8:2 about , Use income to differentiate data . Yes 3 Different customer groups build different models .

Then the test set is clustered in the same way .

Data from different groups , Use different models to predict .

type 2： Data filtering mode 2

Divide the data into training sets , Test set . Not this time... Percent 80 The training set of . Continue to model him in general . And then we do the test set 3 All groups make estimates

Understand the effectiveness of the model for different groups . Determine the marketing focus of the next wave , For example, after modeling , Less than 50 The group modeling effect of 10000 income is not good , Marketing focuses on the first two ,

Then the marketing effect will be restored after marketing , Then decide the marketing strategy of the next wave .

1. 1. Advanced filling techniques for missing values

First, let's talk about simple handling methods

A simple way to deal with missing data ：

1 Direct neglect method ： Delete the whole data ,

It is the simplest way to deal with data loss

（1） When a lot of data is collected , When missing data only accounts for a part , You can use this method

（2） Or when doing classification modeling , If the classification mark of data （class lable） It's empty , This data cannot be classified correctly , You can directly delete .（ The only solution , Machine learning does not allow null values in the target field . There must be a target field , Otherwise, failure to learn will result in errors ）

（3） When the missing data in the field is more than 50% , Delete directly

*（4） There is another special case , When you think that null values are another manifestation of behavior ,（ When null values also make sense ）, You can use the method of indicating variables （indicator variable） That is, dummy variables . Special methods for dealing with missing data , It is also when the missing data in the field is greater than 50 Directly delete .

Directly ignore the shortcomings of the method ： When the proportion of data missing is large , It will cause a lot of data loss

2 Manual filling method ：

（1） When the birthday field of a member data is missing , Directly ask employees to call members to ask about their birthday fields .

（2） Understand why data is missing , Missing reason , Then fill in with the appropriate value . For example, gender can be filled with ID number , Some proportional fields , It may be because there is no consumption , So it leads to control ,.

It's very accurate

Disadvantages of manual filling method ： When there are many missing data , It will consume a lot of time and labor, and the burden is serious .

3（ machine ） Automatic filling method ：

If it is a category field , Such as marriage , Education and so on .

（1） You can fill in a general constant ： For example, unknown （unknown） Become a new category

1 For undergraduate ,2 Bachelor degree or above ,0 It's empty ,3 For the unknown ,456 Is an outlier , Use it 0,456

（2） Fill in the mode of this field ： But this method is not objective , You can use clustering , Use the group mode in the cluster （ The following figure can be based on credit To divide into groups to fill the mode ）

（3） A more accurate method is to use the model to calculate the more possible values ： That is, the problem of filling in missing values is solved as the problem of predicting classification categories .

The algorithm needs to be able to accept null values in input fields , Algorithms that can also be modeled

You can take KNN fill , Random forest filling ,XGboosting fill

Numeric fields

（1） Fill in a general constant ： If fill in 0（ But it needs to be checked （check） What it means ）

（2） Fill in the average , You can also use clustering （ For example, use age groups to fill the average value of assets ）

（3） A more accurate method is to use the model to calculate the more possible values ： That is, the problem of filling in the missing value is solved as the problem of predicting the value category .

You need to be able to accept empty values in the input fields . The algorithm is also KNN fill , Random forest filling ,XGboosting fill

Data cleaning is done with training sets .

1. 1. Primary data conversion technology

Code of category field , Code of sequential field

Coding method of categorical variables ：

One-hot encoding( Coding in machine learning is called )

Dummy（ The code in statistics is called ）

Virtual variable traps will appear when using virtual variables , That is, there will be a high degree of multicollinearity or high correlation between variables . such as Male and female Two dummy variables do one-hot encoding Coding will lead to high correlation , Like gender ,01,10, One field will be redundant . Therefore, a row of fields will generally be deleted . Make sure there is no high correlation

If a field has multiple description feature labels, it should be coded according to the following figure .

Sequential variable coding method ：

As a numerical variable , Code at the same interval .

Coding disadvantages . The size of the value cannot be used directly 1234 code , Consider the degree , Like in the picture high It should be 3. But because there is a ratio in the middle medium A little more , become 4 了 , The effect will not be so ideal .

1. 1. Advanced data conversion technology

Category type fields generalize classification variables （ Rank variable ）

Data generalization ： Generally used for address, etc , The address is too small , It can be transformed into a city , Or region

Data generalization 2： When it is necessary to convert classified variables into numerical values . You can replace the category with the probability of the target field

Numerical field trend discretization ：

Why should we discretize numerical data ？

Modeling experience has led us to find , Generally, the instability of numerical data is easy to lead to model errors .

There is often a situation , The predicted value of the training data is very accurate , But the prediction of the test data set becomes very inaccurate .

Most of the reason is that the field distribution of the training data set is inconsistent with the field probability distribution of the test data set .

Pictured , stay 25 At the age of , The overdue probability of the training data set is 0.1, But in the test data set is percent 100.

Pearson correlation coefficient shows that , Not only are they not used to , There are even opposite tendencies .

If age is an important input field for modeling, the accuracy of the two models must be very different

So numeric fields are discretized , It helps us build models .

Advantages of discretization of numeric fields ：

1. Simplify data , Reduce data complexity , Make the data easier to interpret

2. It can support many algorithms that cannot handle numeric fields

For example, association rules （Association Rules） Algorithm

3. It can improve the stability of the classifier , And then improve the accuracy of the classification model

4. You can find the trend of this input field in the target field （TREND） Sex helps to interpret in the future （ So the best result of our discretization is to reflect the trend of the target field ）

Using discretization , Cut out several intervals , To replace many data values in the value range

1. Manual separation method ：

Based on experience , Experts suggest that , Cut the data , Such as age

2. Basic packing method （binning Method）：

Equal width （Equal-Width-Interval） Packing method , The range of each box is the same

Equal frequency （Equal-Frequency-interval） Packing method , The frequency in each box is the same

If we equalize the distribution of the previous figure （8 Intervals , Each interval 6.125） Compartmentalize , The Pearson correlation coefficient can be obtained as 0.711 positive correlation , Stability and accuracy are greatly improved .

Mainly, it is also very important that the results can be easily understood and explained

A good discretization result that is easy to understand and explain , Input fields and target fields will have obvious trends

The figure below x The axis is the quota utilization ,y The shaft is the probability of deteriorating customers

According to business experience , The utilization rate of credit card quota becomes higher , The probability of customers' bad debts will also increase , But the result of discretization reflects fluctuations , Ups and downs . If we do business modeling , Even if adding this field will make the model better , Enterprises dare not use this field ,（ Because it is uncertain whether there will be different trends in future data changes ） That is, it is inconsistent with business experience .

A good discretization result should be as shown in the figure above ,x Shaft with y The axis is monotonically increasing , Consistent with business experience .

Feasible approach ： First, use equal width or equal frequency to set the value of the numerical field It is divided into 15~20 Between communities , Then merge them appropriately according to the distribution presented .

Case study ：

The first line is divided into 15~20 Between communities ,

When the default rate goes from high to low, it is our cutting point , The graph can be merged into 4 Intervals .

4-21,22-41,42-61,62-72

because 62-72 There is only one number in this range , Therefore, it is additionally merged into 42-72

The final cut is 3 Intervals

As can be seen from the above figure , The result is quite ideal , The default rate is gradually increasing , Conform to business common sense .

This is a better

原网站

版权声明
本文为[Sunny qt01]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/203/202207210521485645.html

当前位置：网站首页>[advanced data processing technology] data filtering, advanced data filling, initial and advanced data transformation

[advanced data processing technology] data filtering, advanced data filling, initial and advanced data transformation

边栏推荐

猜你喜欢

随机推荐