当前位置:网站首页>信用卡客户流失预测
信用卡客户流失预测
2022-06-10 08:26:00 【时宜_】
数据集介绍
数据集由10,000个客户组成,其中包含了他们的年龄,工资,婚姻状况,信用卡限额,信用卡类别等。使用此数据集预测哪些客户即将流失。
目录
1️⃣ 准备工作
导入本项目需要的python包
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as ex
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
plt.rc('figure',figsize=(18,9))
%pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
加载数据
c_data = pd.read_csv('./BankChurners.csv')
c_data = c_data[c_data.columns[:-2]]
c_data.head(3)显示前三行数据, 查看一下数据所有的字段:

2️⃣ 数据分析
客户年龄分布
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Customer_Age'],name='Age Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Customer_Age'],name='Age Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of Customer Ages")
fig.show()
客户的年龄分布大致遵循正态分布,因此可以在正态假设下进一步使用年龄特征。
客户性别分布:
ex.pie(c_data,names='Gender',title='Propotion Of Customer Genders')

可见,在数据集中,女性的样本比男性多,但是差异的百分比不是那么显著,所以暂且认为性别是均匀分布的。
客户的家庭人数分布:
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Dependent_count'],name='Dependent count Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Dependent_count'],name='Dependent count Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of Dependent counts (close family size)")
fig.show()

观察上图可以看出,每个客户家庭人数也大致符合正态分布,偏右一点,或许后续分析能用得上。
客户的受教育水平:
ex.pie(c_data,names='Education_Level',title='Propotion Of Education Levels') 
假设大多数教育程度不明(Unknown)的顾客都没有接受过任何教育。我们可以指出,超过70%的顾客都受过正规教育,其中约35%的人受教育程度达到硕士以上水平,45%的人达到本科以上水准。
客户的婚姻状态:
ex.pie(c_data,names='Marital_Status',title='Propotion Of Different Marriage Statuses')
可以看出,这家银行几乎一半的客户都是已婚人士,有趣的是,另一半客户几乎都是单身人士,另外只有7%的客户离婚了。
客户的收入和卡片类型的分布:
ex.pie(c_data,names='Income_Category',title='Propotion Of Different Income Levels') 
ex.pie(c_data,names='Card_Category',title='Propotion Of Different Card Categories')可以看出大部分人的年收入处于60K美元以下。

在持有的卡片的类型上,蓝卡占了绝大多数。
客户每月账单数量特征:
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Months_on_book'],name='Months on book Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Months_on_book'],name='Months on book Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of months the customer is part of the bank")
fig.show()

可以看到中间的峰值特别高,显然这个指标不是正态分布的。
客户持有的银行业务数量的特征:
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Total_Relationship_Count'],name='Total no. of products Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Total_Relationship_Count'],name='Total no. of products Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of Total no. of products held by the customer")
fig.show()

基本上都是均匀分布的,但是这个指标对于我们而言也没太大意义。
用户不活跃月份数量的特征:
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Months_Inactive_12_mon'],name='number of months inactive Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Months_Inactive_12_mon'],name='number of months inactive Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of the number of months inactive in the last 12 months")
fig.show()

会不会越不活跃的用户越容易流失呢?
客户的信用卡额度分布:
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Credit_Limit'],name='Credit_Limit Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Credit_Limit'],name='Credit_Limit Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of the Credit Limit")
fig.show()

大部分人的额度都在0到10k之间,暂时看不出和流失有什么关系。
客户总交易额的分布:
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Total_Trans_Amt'],name='Total_Trans_Amt Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Total_Trans_Amt'],name='Total_Trans_Amt Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of the Total Transaction Amount (Last 12 months)")
fig.show()

可以看出,总交易额的分布体现出“多组”分布,如果我们根据这个指标将客户聚类为不同的组别,看他们之间的相似性,并作出不同的画线,也许对最终的用户流失分析有一定的意义。
流失用户分布:
ex.pie(c_data,names='Attrition_Flag',title='Proportion of churn vs not churn customers')
我们可以看到,只有16%的数据样本代表流失客户,在接下来的步骤中,使用SMOTE对流失样本进行采样,使其与常规客户的样本大小匹配,以便给后面选择的模型一个更好的机会来捕捉小细节。
3️⃣ 数据预处理
使用SMOTE模型前,需要根据不同的特征对数据进行One Hot编码:
c_data.Attrition_Flag = c_data.Attrition_Flag.replace({'Attrited Customer':1,'Existing Customer':0})
c_data.Gender = c_data.Gender.replace({'F':1,'M':0})
c_data = pd.concat([c_data,pd.get_dummies(c_data['Education_Level']).drop(columns=['Unknown'])],axis=1)
c_data = pd.concat([c_data,pd.get_dummies(c_data['Income_Category']).drop(columns=['Unknown'])],axis=1)
c_data = pd.concat([c_data,pd.get_dummies(c_data['Marital_Status']).drop(columns=['Unknown'])],axis=1)
c_data = pd.concat([c_data,pd.get_dummies(c_data['Card_Category']).drop(columns=['Platinum'])],axis=1)
c_data.drop(columns = ['Education_Level','Income_Category','Marital_Status','Card_Category','CLIENTNUM'],inplace=True)
显示热力图:
sns.heatmap(c_data.corr('pearson'),annot=True)
4️⃣ SMOTE模型采样
SMOTE模型经常用于解决数据不平衡的问题,它通过添加生成的少数类样本改变不平衡数据集的数据分布,是改善不平衡数据分类模型性能的流行方法之一。
oversample = SMOTE()
X, y = oversample.fit_resample(c_data[c_data.columns[1:]], c_data[c_data.columns[0]])
usampled_df = X.assign(Churn = y)
ohe_data =usampled_df[usampled_df.columns[15:-1]].copy()
usampled_df = usampled_df.drop(columns=usampled_df.columns[15:-1])
sns.heatmap(usampled_df.corr('pearson'),annot=True)

5️⃣主成分分析
使用主成分分析来降低单次编码分类变量的维数,从而降低方差。同时使用几个主成分而不是几十个单次编码特征将帮助我构建一个更好的模型。
N_COMPONENTS = 4
pca_model = PCA(n_components = N_COMPONENTS )
pc_matrix = pca_model.fit_transform(ohe_data)
evr = pca_model.explained_variance_ratio_
cumsum_evr = np.cumsum(evr)
ax = sns.lineplot(x=np.arange(0,len(cumsum_evr)),y=cumsum_evr,label='Explained Variance Ratio')
ax.set_title('Explained Variance Ratio Using {} Components'.format(N_COMPONENTS))
ax = sns.lineplot(x=np.arange(0,len(cumsum_evr)),y=evr,label='Explained Variance Of Component X')
ax.set_xticks([i for i in range(0,len(cumsum_evr))])
ax.set_xlabel('Component number #')
ax.set_ylabel('Explained Variance')
plt.show()

usampled_df_with_pcs = pd.concat([usampled_df,pd.DataFrame(pc_matrix,columns=['PC-{}'.format(i) for i in range(0,N_COMPONENTS)])],axis=1)
usampled_df_with_pcs

特征变得越来越明显:
sns.heatmap(usampled_df_with_pcs.corr('pearson'),annot=True)
6️⃣ 模型选择及测试
选择出以下特征划分训练集并进行训练:
X_features = ['Total_Trans_Ct','PC-3','PC-1','PC-0','PC-2','Total_Ct_Chng_Q4_Q1','Total_Relationship_Count']
X = usampled_df_with_pcs[X_features]
y = usampled_df_with_pcs['Churn']
train_x,test_x,train_y,test_y = train_test_split(X,y,random_state=42)
交叉验证
分别看看随机森林、AdaBoost和SVM模型三种模型的表现如何:
rf_pipe = Pipeline(steps =[ ('scale',StandardScaler()), ("RF",RandomForestClassifier(random_state=42)) ])
ada_pipe = Pipeline(steps =[ ('scale',StandardScaler()), ("RF",AdaBoostClassifier(random_state=42,learning_rate=0.7)) ])
svm_pipe = Pipeline(steps =[ ('scale',StandardScaler()), ("RF",SVC(random_state=42,kernel='rbf')) ])
f1_cross_val_scores = cross_val_score(rf_pipe,train_x,train_y,cv=5,scoring='f1')
ada_f1_cross_val_scores=cross_val_score(ada_pipe,train_x,train_y,cv=5,scoring='f1')
svm_f1_cross_val_scores=cross_val_score(svm_pipe,train_x,train_y,cv=5,scoring='f1')
plt.subplot(3,1,1)
ax = sns.lineplot(x=range(0,len(f1_cross_val_scores)),y=f1_cross_val_scores)
ax.set_title('Random Forest Cross Val Scores')
ax.set_xticks([i for i in range(0,len(f1_cross_val_scores))])
ax.set_xlabel('Fold Number')
ax.set_ylabel('F1 Score')
plt.show()
plt.subplot(3,1,2)
ax = sns.lineplot(x=range(0,len(ada_f1_cross_val_scores)),y=ada_f1_cross_val_scores)
ax.set_title('Adaboost Cross Val Scores')
ax.set_xticks([i for i in range(0,len(ada_f1_cross_val_scores))])
ax.set_xlabel('Fold Number')
ax.set_ylabel('F1 Score')
plt.show()
plt.subplot(3,1,3)
ax = sns.lineplot(x=range(0,len(svm_f1_cross_val_scores)),y=svm_f1_cross_val_scores)
ax.set_title('SVM Cross Val Scores')
ax.set_xticks([i for i in range(0,len(svm_f1_cross_val_scores))])
ax.set_xlabel('Fold Number')
ax.set_ylabel('F1 Score')
plt.show()
看看三种模型都有什么不同的表现:



看得出来随机森林 F1分数是最高的,达到了0.92。
模型预测
对测试集进行预测,看看三种模型的效果:
rf_pipe.fit(train_x,train_y)
rf_prediction = rf_pipe.predict(test_x)
ada_pipe.fit(train_x,train_y)
ada_prediction = ada_pipe.predict(test_x)
svm_pipe.fit(train_x,train_y)
svm_prediction = svm_pipe.predict(test_x)
print('F1 Score of Random Forest Model On Test Set - {}'.format(f1(rf_prediction,test_y)))
print('F1 Score of AdaBoost Model On Test Set - {}'.format(f1(ada_prediction,test_y)))
print('F1 Score of SVM Model On Test Set - {}'.format(f1(svm_prediction,test_y)))

对原始数据(采样前)进行模型预测
接下来对原始数据进行模型预测:
ohe_data =c_data[c_data.columns[16:]].copy()
pc_matrix = pca_model.fit_transform(ohe_data)
original_df_with_pcs = pd.concat([c_data,pd.DataFrame(pc_matrix,columns=['PC-{}'.format(i) for i in range(0,N_COMPONENTS)])],axis=1)
unsampled_data_prediction_RF = rf_pipe.predict(original_df_with_pcs[X_features])
unsampled_data_prediction_ADA = ada_pipe.predict(original_df_with_pcs[X_features])
unsampled_data_prediction_SVM = svm_pipe.predict(original_df_with_pcs[X_features])
效果如下:

F1最高的随机森林模型有0.63分,偏低,这也比较正常,毕竟在这种分布不均的数据集中,查全率是比较难拿到高分数的。
结果
看一下最终在原数据上使用随机森林模型的运行结果:
ax = sns.heatmap(confusion_matrix(unsampled_data_prediction_RF,original_df_with_pcs['Attrition_Flag']),annot=True,cmap='coolwarm',fmt='d')
ax.set_title('Prediction On Original Data With Random Forest Model Confusion Matrix')
ax.set_xticklabels(['Not Churn','Churn'],fontsize=18)
ax.set_yticklabels(['Predicted Not Churn','Predicted Churn'],fontsize=18)
plt.show()

最终得出结论,没有流失的客户命中了7709人,未命中791人。流失客户命中了1130人,未命中497人。
边栏推荐
- Introduction to temporal database incluxdb
- What are the serious consequences of skipping 51 MCU and learning STM32 directly
- 2022.06.07 learning contents
- c#线程间操作无效,去掉这个检查的
- 软件测试之WEB页面测试
- Wechat applet bidirectional data binding, parent-child parameter transfer
- [adjustment] South China Normal University (211, double first-class) zhuangzhengfei research group of biophotonics Research Institute, master enrollment
- 業務系統該如何防病毒
- Interrupt knowledge point set
- Basic process of web security penetration test
猜你喜欢

Comment le système d'affaires devrait - il être antivirus?

被微软遗弃的神作《扫雷》,竟然被中国玩家玩出花?
![OS Experiment 7 [document management]](/img/9a/348cae236657fd016436150a1fc3b2.png)
OS Experiment 7 [document management]

Introduction to temporal database incluxdb

Huawei device configuration hub and spoke

Link Time Optimizations: New Way to Do Compiler Optimizations

目前技术圈最全面的 Layer2 研究总结

SqlServer还原失败(数据库正在使用,无法获得对数据库的独占访问权)

What happens when your Huaqiangbei earphone falls into the water? How to restore sound quality?

大佬们,帮帮我吧!重装MySQL,到设置密码就出现current root password
随机推荐
How to prevent virus in business system
什么样的对象适合自动化测试?
编程中的程序编码问题
Numpy version problem
Comment le système d'affaires devrait - il être antivirus?
What objects are suitable for automated testing?
Restore of sqlserver different database names
Dark horse software test brain map
华为设备配置Hub and Spoke
The C # inter thread operation is invalid. Remove this check
Hospital blood bank management system source code blood bank source code blood bank management source code hospital source code
力扣(LeetCode)160. 相交链表(2022.06.09)
Smart light 2022 Global Industry Analysis Report
[sans titre]
Research Report on market supply and demand and strategy of aquaculture vaccine industry in China
嵌入式还有人关注吗?嵌入式测试需要关注哪些问题?
Wechat applet bidirectional data binding, parent-child parameter transfer
UART中的硬件流控RTS与CTS
Basic process of web security penetration test
大佬们,帮帮我吧!重装MySQL,到设置密码就出现current root password