当前位置:网站首页>Financial risk control practice - decision tree rule mining template
Financial risk control practice - decision tree rule mining template
2022-07-07 05:58:00 【Grateful_ Dead424】
This data is the data of didi drivers using oil loans .
The bad debt rate of oil loan is as high as 5%, Very high , It must lose money . And can pass fraud detection .
The drivers who come to apply for oil loan already have A Stuck. ,A The card is rated A-F. Originally, there was only F Non lending , But in the oil loan, only the rating is A Only by lending can we not lose money .
Didi cooperates with many gas stations , The gas station will provide didi with driver data .
Import related packages
import pandas as pd
import numpy as np
# eliminate Warning
import warnings
warnings.filterwarnings("ignore")
Read the oil data set
data = pd.read_excel("oil_data_for_tree.xlsx")
data.head()
View information about the data
set(data.class_new)
#{'A', 'B', 'C', 'D', 'E', 'F'}
Calculate the bad debt rate
set(data.bad_ind)
#{0, 1}
data.bad_ind.sum()/data.bad_ind.count()
#0.01776363887846035
The characteristics are divided into 3 class
- org_lst No special transformation is required , Direct weight removal
- agg_lst Numerical variables are aggregated
- dstc_lst Text variables do cnt
org_lst = ['uid','create_dt','oil_actv_dt','class_new','bad_ind']
agg_lst = ['oil_amount','discount_amount','sale_amount','amount','pay_amount','coupon_amount','payment_coupon_amount']
dstc_lst = ['channel_code','oil_code','scene','source_app','call_source']
Data reorganization
# Copy , Check for missing
df = data[org_lst].copy()
df[agg_lst] = data[agg_lst].copy()
df[dstc_lst] = data[dstc_lst].copy()
Check for missing
df.isna().sum()
#uid 0
#create_dt 4944
#oil_actv_dt 0
#class_new 0
#bad_ind 0
#oil_amount 4944
#discount_amount 4944
#sale_amount 4944
#amount 4944
#pay_amount 4944
#coupon_amount 4944
#payment_coupon_amount 4946
#channel_code 0
#oil_code 0
#scene 0
#source_app 0
#call_source 0
#dtype: int64
On features creat_dt To complete
Yes creat_dt Complete , use oil_actv_dt To fill , And intercept 6 Months of data .
When constructing variables, you cannot directly accumulate all historical data .
Otherwise, over time , The distribution of variables will vary greatly .
def time_isna(x,y):
if str(x) == "NaT":
x = y
else:
x = x
return x
df2 = df.sort_values(["uid","create_dt"],ascending = False)
df2["create_dt"] = df2.apply(lambda x:time_isna(x.create_dt,x.oil_actv_dt),axis = 1)
df2["dtn"] = (df2.oil_actv_dt - df2.create_dt).apply(lambda x:x.days)
df = df2[df2["dtn"] < 180]
df
#()
On features org_list To deal with
Yes org_list Find the maximum interval of historical loan days , And go heavy
base = df[org_lst]
base["dtn"] = df["dtn"]
base = base.sort_values(['uid','create_dt'],ascending = False)
base = base.drop_duplicates(['uid'],keep = 'first') # Deduplication
base.shape
#(11099, 6)
Variable derivation ( a key )
# Variable derivation
gn = pd.DataFrame()
for i in agg_lst:
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:len(df[i])).reset_index())
tp.columns = ['uid',i + '_cnt']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.where(df[i]>0,1,0).sum()).reset_index())
tp.columns = ['uid',i + '_num']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nansum(df[i])).reset_index())
tp.columns = ['uid',i + '_tot']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmean(df[i])).reset_index())
tp.columns = ['uid',i + '_avg']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmax(df[i])).reset_index())
tp.columns = ['uid',i + '_max']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmin(df[i])).reset_index())
tp.columns = ['uid',i + '_min']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanvar(df[i])).reset_index())
tp.columns = ['uid',i + '_var']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:np.nanmean(df[i])/max(np.nanvar(df[i]),1)).reset_index())
tp.columns = ['uid',i + '_bianyi']
if gn.empty == True:
gn = tp
else:
gn = pd.merge(gn,tp,on = 'uid',how = 'left')
Yes dstc_lst Variable seeking distinct Number
gc = pd.DataFrame()
for i in dstc_lst:
tp = pd.DataFrame(df.groupby('uid').apply(lambda df:len(set(df[i]))).reset_index())
tp.columns = ['uid',i + '_dstc']
if gc.empty == True:
gc = tp
else:
gc = pd.merge(gc,tp,on = 'uid',how = 'left')
Combine variables
fn = pd.merge(base,gn,on= 'uid')
fn = pd.merge(fn,gc,on= 'uid')
fn
fn = fn.fillna(0) # After combination, there will be many null values , Fill here with 0
Train with decision tree
x = fn.drop(['uid','oil_actv_dt','create_dt','bad_ind','class_new'],axis = 1)
y = fn.bad_ind.copy()
from sklearn import tree
dtree = tree.DecisionTreeRegressor(max_depth = 2,min_samples_leaf = 500,min_samples_split = 5000)
# Limit the maximum depth of the tree
# The leaf contains at least the number of samples
# The node must contain the number of training samples
dtree = dtree.fit(x,y) # utilize x Value , forecast y Whether it's a bad person
Output decision tree image , And make decisions
feature_names = x.columns
import matplotlib.pyplot as plt
plt.figure(figsize=(12,9),dpi=80)
tree.plot_tree(dtree,filled = True,feature_names = feature_names)
value Namely badrate
Recalculate the bad debt rate
sum(fn.bad_ind)/len(fn.bad_ind)
#0.04658077304261645
# Generation strategy
dff1 = fn.loc[(fn.amount_tot>48077.5)&(fn.coupon_amount_cnt>3.5)].copy()
dff1['level'] = 'oil_A'
dff2 = fn.loc[(fn.amount_tot>48077.5)&(fn.coupon_amount_cnt<=3.5)].copy()
dff2['level'] = 'oil_B'
dff3 = fn.loc[(fn.amount_tot<=48077.5)].copy()
dff3['level'] = 'oil_C'
dff1.head()
dff1 = dff1.append(dff2)
dff1 = dff1.append(dff3)
dff1 = dff1.reset_index(drop = True)
dff1.head()
last = dff1[['class_new','level','bad_ind','uid','oil_actv_dt','bad_ind']].copy()
last['oil_actv_dt'] = last['oil_actv_dt'] .apply(lambda x:str(x)[:7]).copy()
last.head(5)
边栏推荐
- Input of native applet switches between text and password types
- ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
- Randomly generate session_ id
- 软件测试面试技巧
- The solution of a simple algebraic problem
- 驱动开发中platform设备驱动架构详解
- 云加速,帮助您有效解决攻击问题!
- R语言【逻辑控制】【数学运算】
- Hcip eighth operation
- 如果不知道这4种缓存模式,敢说懂缓存吗?
猜你喜欢
Cf:c. column swapping [sort + simulate]
Randomly generate session_ id
判断文件是否为DICOM文件
目标检测中的损失函数与正负样本分配:RetinaNet与Focal loss
Introduction to the extension implementation of SAP Spartacus checkout process
nVisual网络可视化
Go语学习笔记 - gorm使用 - 原生sql、命名参数、Rows、ToSQL | Web框架Gin(九)
cf:C. Column Swapping【排序 + 模拟】
职场经历反馈给初入职场的程序员
Introduction to distributed transactions
随机推荐
得物客服一站式工作台卡顿优化之路
云加速,帮助您有效解决攻击问题!
[云原生]微服务架构是什么?
【已解决】记一次EasyExcel的报错【读取xls文件时全表读不报错,指定sheet名读取报错】
高级程序员必知必会,一文详解MySQL主从同步原理,推荐收藏
Randomly generate session_ id
数据中心为什么需要一套基础设施可视化管理系统
PTA 天梯赛练习题集 L2-004 搜索树判断
MFC BMP sets the resolution of bitmap, DPI is 600 points, and gdiplus generates labels
【日常训练--腾讯精选50】235. 二叉搜索树的最近公共祖先
PTA ladder game exercise set l2-002 linked list de duplication
Data storage 3
Red Hat安装内核头文件
Opensergo is about to release v1alpha1, which will enrich the service governance capabilities of the full link heterogeneous architecture
What is message queuing?
EMMC打印cqhci: timeout for tag 10提示分析与解决
C. colonne Swapping [tri + Simulation]
三级菜单数据实现,实现嵌套三级菜单数据
Realize GDB remote debugging function between different network segments
Digital IC interview summary (interview experience sharing of large manufacturers)