当前位置:网站首页>Credit default prediction based on simplified scorecard, smote sampling and random forest
Credit default prediction based on simplified scorecard, smote sampling and random forest
2022-07-27 22:16:00 【Extension Research Office】
View full text :http://tecdat.cn/?p=27949
The source of the original text is : The official account of the tribal public
author :Youming Zhang
With the rapid development of Internet economy , The scale of personal credit has shown explosive growth in recent years . Credit risks Management and control has always been a hot issue in the research of financial institutions . The goal of credit default prediction includes two aspects . One is to make Debtors use models to make sound financial decisions . The second is that creditors can predict through the model that the lender is Will you fall into financial difficulties after the loan . We use LendingClub The data on the credit platform is used as the credit number According to the sample , Construct a classic traditional credit application scorecard model and random forest prediction model to help us decide whether to lend . And the dataset 24 Ten thousand data ,59 Features .
Solution
Mission / The goal is
According to the index data of the applicant lender , We use machine learning algorithm to solve the two classification problem of lending and non lending .
Data source preparation
Before proceeding with data , Let's first understand the basic data , For the next data preprocessing , Prepare for Feature Engineering and modeling . First of all, we define the target characteristics as loan_status, Original data set fully paid surface Show that the loan is completely settled ,charged off Indicates bad debt write off . For the convenience of later use of logistic regression algorithm , We Directly convert it into a numerical variable , use 1 To indicate default ,0 It means normal repayment .
Data exploration
Because the algorithms we use are score cards based on logistic regression and random forest models , The linear correlation between features will affect the link of model establishment , So use the thermodynamic diagram to show the correlation between features .

The lighter the color in the picture , The stronger the correlation of the corresponding two elements . We need to focus on the pretreatment and Feature Engineering steps .
Data preprocessing
For missing values , We choose to delete the missing value more than 80% Characteristics of , If it does not exceed the threshold, use mode to supplement . For the same valued variable , Our choice will be greater than 95% Delete the variable of . Because there are not many outliers , So the row of data is deleted directly . It can be seen from the figure below that this is an unbalanced sample .

We use Smote Up sampling algorithm , To balance .

Finally, we will use numerical characteristics emp_length,home_ownership,verification_status,term,addr_state,purpose Conduct LabelEncoder code .
features engineering
We know that there is collinearity between features , utilize VIF and COR Two coefficient screening , We get the new thermodynamic diagram as follows . It can be seen that the collinearity has been greatly improved .

Then maximize the features IV Value sub box , At the same time to delete IV Features with small values . Finally, the result of partial continuous features is as follows .

Divide the result into boxes WOE turn , In this way, we can know the substantial spacing between sub boxes .
Divide the training set and the test set
We use sklearn The algorithm divides the data set , The training set is used to fit the model . Verification set round trip Reset parameters , The test set is used to test the predictive ability of the model .
modeling
Simplify the scorecard model
The innovation here is to simplify the construction of scoring card model , Our simplification is not to generate scorecards , Instead, we use the scorecard model to WOE A core idea of transformation , Combine it with logistic regression model , Thus a simplified scorecard model is obtained .
There are many ways to process features , We choose WOE Transformation , This is because WOE The transformed variable has a monotonic relationship with the linear expression of logistic regression , This better measures the quantitative relationship between groups .
Random forests
Build a forest in a random way , Forests are made up of many decision trees , There is no correlation between every decision tree of random forest . After getting the forest , When a new input sample comes in , Let each decision tree in the forest make a judgment , Let's see what kind of sample this sample should belong to ( For classification algorithms ), And then we'll see which category is the most chosen , Just predict what kind of sample this sample is .
Simplify the scorecard
We will divide each feature obtained by the box woe Value replaces the value of the feature itself in the dataset . And use logistic regression model To classify , The accuracy of the model results achieved 0.7 above ,AUC = 0.78,ks=0.41 Explain the model prediction better . At the same time, we realize the learning curve on the verification set , It is found that the accuracy of the model is not significantly improved , stay After ten calls, the accuracy is only improved 0.02.

Random forests
Random forest is a weak learning model based on decision tree bagging Method to construct a strong learning model . It can accommodate Receive more information , At the same time, the influence of over fitting is well avoided through the voting of multiple models . This model is used as a rating Card reference . We use it n Represents the number of underlying decision trees , Generally speaking , The more trees there are , The stability of the model The stronger , But there are also high requirements for computing performance . The number of trees we build here is 41 Tree , Get the model The accuracy of the results has reached 90.9%,AUC Reached 0.93,ks Reached 0.82, The classification result is not have the hobby of doing sth. .

About author

Here it is to Youming Zhang I sincerely thank you for your contribution to this article , He is good at machine learning 、 Feature Engineering 、 Data preprocessing .
The most popular insights
2.R Language tree based approach : Decision tree , Random forests
3.python Use in scikit-learn and pandas Decision tree
4. machine learning : stay SAS Run random forest data analysis report in
5.R Language uses random forest and text mining to improve airline customer satisfaction
6. Machine learning boosts fast fashion accurate sales time series
9.python of use pytorch Machine learning classification predicts bank customer churn
边栏推荐
- Reentranlock and source code analysis (learn ideas and click the source code step by step)
- 8000 word explanation of OBSA principle and application practice
- EC code introduction
- Import word document pictures blocking and non blocking IO operations
- An2021 software installation and basic operation (new file / export)
- 8000 word explanation of OBSA principle and application practice
- 项目分析(从技术到项目、产品)
- Reed relay
- 云原生微服务第三章之Haproxy+Keepalived
- Leetcode 148. sorting linked list
猜你喜欢

matlab 绘制三坐标(轴)图

@Can component be used on the same class as @bean?

It's too voluminous. A company has completely opened its core system (smart system) that has been operating for many years
![[question 22] dungeons and Warriors (Beijing Institute of Technology / Beijing Institute of Technology / programming methods and practice / primary school)](/img/64/70fa9f47836e07dd41d9e283e50adb.jpg)
[question 22] dungeons and Warriors (Beijing Institute of Technology / Beijing Institute of Technology / programming methods and practice / primary school)

Matlab 绘制风速、风向统计玫瑰花图

STM32 project Sharing -- mqtt intelligent access control system (including app control)

学完4种 Redis 集群方案要多久?我一口气给你说完

温度继电器

Common shortcut keys and setting methods of idea

How can anyone ask how MySQL archives data?
随机推荐
8000字讲透OBSA原理与应用实践
How to learn object Defineproperty | an article takes you to quickly learn
cocos:ccpDistance函数的简单运用以及实现眼球随着手指在眼眶中转动的功能
Enumeration and annotation
直播app系统源码,上下叠层样式的轮播图
【StoneDB故障诊断】MDL锁等待
After sorting (bubble sorting), learn to continuously update other sorting methods
vs2019 release模式调试:此表达式有副作用,将不予计算。
Massive data TOPK problem
[stonedb fault diagnosis] MDL lock waiting
Open source data quality solution -- Apache Griffin primer
华能福建公司与华为数字能源深化战略合作,共建低碳智能县域
8000字讲透OBSA原理与应用实践
Reentranlock and source code analysis (learn ideas and click the source code step by step)
Inertial navigation principle (VII) -imu error classification (II) -allan variance analysis method +imu test + calibration introduction
JVM memory model interview summary
【二叉树】统计二叉树中好节点的数目
Go language learning notes - mutex start go language from scratch
SQL注入 Less29(参数污染绕过WAF)
Solid state relay