当前位置：网站首页>Credit default prediction based on simplified scorecard, smote sampling and random forest

Credit default prediction based on simplified scorecard, smote sampling and random forest

2022-07-27 22:16:00 【Extension Research Office】

View full text ：http://tecdat.cn/?p=27949

The source of the original text is ： The official account of the tribal public

author ：Youming Zhang

With the rapid development of Internet economy , The scale of personal credit has shown explosive growth in recent years . Credit risks Management and control has always been a hot issue in the research of financial institutions . The goal of credit default prediction includes two aspects . One is to make Debtors use models to make sound financial decisions . The second is that creditors can predict through the model that the lender is Will you fall into financial difficulties after the loan . We use LendingClub The data on the credit platform is used as the credit number According to the sample , Construct a classic traditional credit application scorecard model and random forest prediction model to help us decide whether to lend . And the dataset 24 Ten thousand data ,59 Features .

Solution

Mission / The goal is

According to the index data of the applicant lender , We use machine learning algorithm to solve the two classification problem of lending and non lending .

Data source preparation

Before proceeding with data , Let's first understand the basic data , For the next data preprocessing , Prepare for Feature Engineering and modeling . First of all, we define the target characteristics as loan_status, Original data set fully paid surface Show that the loan is completely settled ,charged off Indicates bad debt write off . For the convenience of later use of logistic regression algorithm , We Directly convert it into a numerical variable , use 1 To indicate default ,0 It means normal repayment .

Data exploration

Because the algorithms we use are score cards based on logistic regression and random forest models , The linear correlation between features will affect the link of model establishment , So use the thermodynamic diagram to show the correlation between features .

The lighter the color in the picture , The stronger the correlation of the corresponding two elements . We need to focus on the pretreatment and Feature Engineering steps .

Data preprocessing

For missing values , We choose to delete the missing value more than 80% Characteristics of , If it does not exceed the threshold, use mode to supplement . For the same valued variable , Our choice will be greater than 95% Delete the variable of . Because there are not many outliers , So the row of data is deleted directly . It can be seen from the figure below that this is an unbalanced sample .

We use Smote Up sampling algorithm , To balance .

Finally, we will use numerical characteristics emp_length,home_ownership,verification_status,term,addr_state,purpose Conduct LabelEncoder code .

features engineering

We know that there is collinearity between features , utilize VIF and COR Two coefficient screening , We get the new thermodynamic diagram as follows . It can be seen that the collinearity has been greatly improved .

Then maximize the features IV Value sub box , At the same time to delete IV Features with small values . Finally, the result of partial continuous features is as follows .

Divide the result into boxes WOE turn , In this way, we can know the substantial spacing between sub boxes .

Divide the training set and the test set

We use sklearn The algorithm divides the data set , The training set is used to fit the model . Verification set round trip Reset parameters , The test set is used to test the predictive ability of the model .

modeling
Simplify the scorecard model

The innovation here is to simplify the construction of scoring card model , Our simplification is not to generate scorecards , Instead, we use the scorecard model to WOE A core idea of transformation , Combine it with logistic regression model , Thus a simplified scorecard model is obtained .

There are many ways to process features , We choose WOE Transformation , This is because WOE The transformed variable has a monotonic relationship with the linear expression of logistic regression , This better measures the quantitative relationship between groups .

Random forests

Build a forest in a random way , Forests are made up of many decision trees , There is no correlation between every decision tree of random forest . After getting the forest , When a new input sample comes in , Let each decision tree in the forest make a judgment , Let's see what kind of sample this sample should belong to （ For classification algorithms ）, And then we'll see which category is the most chosen , Just predict what kind of sample this sample is .

Simplify the scorecard

We will divide each feature obtained by the box woe Value replaces the value of the feature itself in the dataset . And use logistic regression model To classify , The accuracy of the model results achieved 0.7 above ,AUC = 0.78,ks=0.41 Explain the model prediction better . At the same time, we realize the learning curve on the verification set , It is found that the accuracy of the model is not significantly improved , stay After ten calls, the accuracy is only improved 0.02.

Random forests

Random forest is a weak learning model based on decision tree bagging Method to construct a strong learning model . It can accommodate Receive more information , At the same time, the influence of over fitting is well avoided through the voting of multiple models . This model is used as a rating Card reference . We use it n Represents the number of underlying decision trees , Generally speaking , The more trees there are , The stability of the model The stronger , But there are also high requirements for computing performance . The number of trees we build here is 41 Tree , Get the model The accuracy of the results has reached 90.9%,AUC Reached 0.93,ks Reached 0.82, The classification result is not have the hobby of doing sth. .

About author

Here it is to Youming Zhang I sincerely thank you for your contribution to this article , He is good at machine learning 、 Feature Engineering 、 Data preprocessing .