当前位置:网站首页>Credit card customer churn forecast
Credit card customer churn forecast
2022-06-10 08:36:00 【Timely_】
Data set introduction
Data set from 10,000 It's made up of two customers , It includes their age , Wages , Marital status , Credit card limit , Credit card categories, etc . Use this dataset Predict which customers are about to lose .
Catalog
Gender distribution of customers :
Distribution of the number of customers' families :
The client's educational level :
The marital status of the client :
Distribution of customer revenue and card types :
Customer monthly bill quantity characteristics :
Characteristics of the amount of banking business held by the customer :
Characteristics of the number of months when users are inactive :
Customer credit card quota distribution :
Distribution of total customer transaction volume :
5️⃣ Principal component analysis
6️⃣ Model selection and testing
For raw data ( Before sampling ) Make model predictions
1️⃣ preparation
Import the python package
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as ex
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
plt.rc('figure',figsize=(18,9))
%pip install imbalanced-learn
from imblearn.over_sampling import SMOTE
Load data
c_data = pd.read_csv('./BankChurners.csv')
c_data = c_data[c_data.columns[:-2]]
c_data.head(3)Show the first three lines of data , Look at all the fields of the data :

2️⃣ Data analysis
Customer age distribution
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Customer_Age'],name='Age Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Customer_Age'],name='Age Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of Customer Ages")
fig.show()
The age distribution of customers roughly follows the normal distribution , Therefore, age characteristics can be further used under the assumption of normality .
Gender distribution of customers :
ex.pie(c_data,names='Gender',title='Propotion Of Customer Genders')

so , In the data set , There are more women than men , But the percentage difference is not that significant , So for the time being, we think that gender is evenly distributed .
Distribution of the number of customers' families :
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Dependent_count'],name='Dependent count Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Dependent_count'],name='Dependent count Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of Dependent counts (close family size)")
fig.show()

If you look at the picture above, you can see that , The number of families of each customer also roughly conforms to the normal distribution , A little to the right , Maybe the follow-up analysis can be used .
The client's educational level :
ex.pie(c_data,names='Education_Level',title='Propotion Of Education Levels') 
Let's assume that most of the students are not well educated (Unknown) None of our customers had any education . We can point out that , exceed 70% Most of our customers have formal education , Which about 35% Most of them have a master's degree or above ,45% Most of them have reached the level of undergraduate or above .
The marital status of the client :
ex.pie(c_data,names='Marital_Status',title='Propotion Of Different Marriage Statuses')
It can be seen that , Almost half of the bank's customers are married , Interestingly , The other half are almost single , The only other thing is 7% My client is divorced .
Distribution of customer revenue and card types :
ex.pie(c_data,names='Income_Category',title='Propotion Of Different Income Levels') 
ex.pie(c_data,names='Card_Category',title='Propotion Of Different Card Categories')It can be seen that The annual income of most people is at 60K Below US dollar .

On the type of cards held , Blue cards make up the majority .
Customer monthly bill quantity characteristics :
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Months_on_book'],name='Months on book Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Months_on_book'],name='Months on book Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of months the customer is part of the bank")
fig.show()

You can see that the peak in the middle is very high , Obviously, this index is not normally distributed .
Characteristics of the amount of banking business held by the customer :
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Total_Relationship_Count'],name='Total no. of products Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Total_Relationship_Count'],name='Total no. of products Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of Total no. of products held by the customer")
fig.show()

It's basically evenly distributed , But this indicator is not very meaningful for us .
Characteristics of the number of months when users are inactive :
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Months_Inactive_12_mon'],name='number of months inactive Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Months_Inactive_12_mon'],name='number of months inactive Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of the number of months inactive in the last 12 months")
fig.show()

Is it possible that the less active users are, the more likely they will be lost ?
Customer credit card quota distribution :
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Credit_Limit'],name='Credit_Limit Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Credit_Limit'],name='Credit_Limit Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of the Credit Limit")
fig.show()

Most people's quotas are in 0 To 10k Between , I can't see what it has to do with the loss for the time being .
Distribution of total customer transaction volume :
fig = make_subplots(rows=2, cols=1)
tr1=go.Box(x=c_data['Total_Trans_Amt'],name='Total_Trans_Amt Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Total_Trans_Amt'],name='Total_Trans_Amt Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=700, width=1200, title_text="Distribution of the Total Transaction Amount (Last 12 months)")
fig.show()

It can be seen that , The distribution of the total turnover reflects “ Multiple groups ” Distribution , If we cluster customers into different groups based on this indicator , Look at the similarities between them , And draw different lines , It may be of some significance to the analysis of end-user churn .
Distribution of lost users :
ex.pie(c_data,names='Attrition_Flag',title='Proportion of churn vs not churn customers')
We can see , Only 16% A sample of data represents lost customers , In the next steps , Use SMOTE Sampling of lost samples , Match it to the sample size of regular customers , In order to give the model selected later a better chance to capture small details .
3️⃣ Data preprocessing
Use SMOTE Before the model , The data needs to be processed according to different characteristics One Hot code :
c_data.Attrition_Flag = c_data.Attrition_Flag.replace({'Attrited Customer':1,'Existing Customer':0})
c_data.Gender = c_data.Gender.replace({'F':1,'M':0})
c_data = pd.concat([c_data,pd.get_dummies(c_data['Education_Level']).drop(columns=['Unknown'])],axis=1)
c_data = pd.concat([c_data,pd.get_dummies(c_data['Income_Category']).drop(columns=['Unknown'])],axis=1)
c_data = pd.concat([c_data,pd.get_dummies(c_data['Marital_Status']).drop(columns=['Unknown'])],axis=1)
c_data = pd.concat([c_data,pd.get_dummies(c_data['Card_Category']).drop(columns=['Platinum'])],axis=1)
c_data.drop(columns = ['Education_Level','Income_Category','Marital_Status','Card_Category','CLIENTNUM'],inplace=True)
Display heat map :
sns.heatmap(c_data.corr('pearson'),annot=True)
4️⃣ SMOTE Model sampling
SMOTE Models are often used to solve data imbalance problems , It changes the data distribution of the imbalanced dataset by adding a small number of generated class samples , It is one of the popular methods to improve the performance of imbalanced data classification model .
oversample = SMOTE()
X, y = oversample.fit_resample(c_data[c_data.columns[1:]], c_data[c_data.columns[0]])
usampled_df = X.assign(Churn = y)
ohe_data =usampled_df[usampled_df.columns[15:-1]].copy()
usampled_df = usampled_df.drop(columns=usampled_df.columns[15:-1])
sns.heatmap(usampled_df.corr('pearson'),annot=True)

5️⃣ Principal component analysis
Principal component analysis is used to reduce the dimension of single coding classification variables , So as to reduce the variance . Using several principal components at the same time instead of dozens of single encoding features will help me build a better model .
N_COMPONENTS = 4
pca_model = PCA(n_components = N_COMPONENTS )
pc_matrix = pca_model.fit_transform(ohe_data)
evr = pca_model.explained_variance_ratio_
cumsum_evr = np.cumsum(evr)
ax = sns.lineplot(x=np.arange(0,len(cumsum_evr)),y=cumsum_evr,label='Explained Variance Ratio')
ax.set_title('Explained Variance Ratio Using {} Components'.format(N_COMPONENTS))
ax = sns.lineplot(x=np.arange(0,len(cumsum_evr)),y=evr,label='Explained Variance Of Component X')
ax.set_xticks([i for i in range(0,len(cumsum_evr))])
ax.set_xlabel('Component number #')
ax.set_ylabel('Explained Variance')
plt.show()

usampled_df_with_pcs = pd.concat([usampled_df,pd.DataFrame(pc_matrix,columns=['PC-{}'.format(i) for i in range(0,N_COMPONENTS)])],axis=1)
usampled_df_with_pcs

The characteristics are becoming more and more obvious :
sns.heatmap(usampled_df_with_pcs.corr('pearson'),annot=True)
6️⃣ Model selection and testing
Select the following features to divide the training set and train :
X_features = ['Total_Trans_Ct','PC-3','PC-1','PC-0','PC-2','Total_Ct_Chng_Q4_Q1','Total_Relationship_Count']
X = usampled_df_with_pcs[X_features]
y = usampled_df_with_pcs['Churn']
train_x,test_x,train_y,test_y = train_test_split(X,y,random_state=42)
Cross validation
Look at random forests 、AdaBoost and SVM How do the three models behave :
rf_pipe = Pipeline(steps =[ ('scale',StandardScaler()), ("RF",RandomForestClassifier(random_state=42)) ])
ada_pipe = Pipeline(steps =[ ('scale',StandardScaler()), ("RF",AdaBoostClassifier(random_state=42,learning_rate=0.7)) ])
svm_pipe = Pipeline(steps =[ ('scale',StandardScaler()), ("RF",SVC(random_state=42,kernel='rbf')) ])
f1_cross_val_scores = cross_val_score(rf_pipe,train_x,train_y,cv=5,scoring='f1')
ada_f1_cross_val_scores=cross_val_score(ada_pipe,train_x,train_y,cv=5,scoring='f1')
svm_f1_cross_val_scores=cross_val_score(svm_pipe,train_x,train_y,cv=5,scoring='f1')
plt.subplot(3,1,1)
ax = sns.lineplot(x=range(0,len(f1_cross_val_scores)),y=f1_cross_val_scores)
ax.set_title('Random Forest Cross Val Scores')
ax.set_xticks([i for i in range(0,len(f1_cross_val_scores))])
ax.set_xlabel('Fold Number')
ax.set_ylabel('F1 Score')
plt.show()
plt.subplot(3,1,2)
ax = sns.lineplot(x=range(0,len(ada_f1_cross_val_scores)),y=ada_f1_cross_val_scores)
ax.set_title('Adaboost Cross Val Scores')
ax.set_xticks([i for i in range(0,len(ada_f1_cross_val_scores))])
ax.set_xlabel('Fold Number')
ax.set_ylabel('F1 Score')
plt.show()
plt.subplot(3,1,3)
ax = sns.lineplot(x=range(0,len(svm_f1_cross_val_scores)),y=svm_f1_cross_val_scores)
ax.set_title('SVM Cross Val Scores')
ax.set_xticks([i for i in range(0,len(svm_f1_cross_val_scores))])
ax.set_xlabel('Fold Number')
ax.set_ylabel('F1 Score')
plt.show()
Take a look at the different representations of the three models :



You can see the random forest F1 The score is the highest , Reached 0.92.
Model to predict
Predict the test set , Look at the three models :
rf_pipe.fit(train_x,train_y)
rf_prediction = rf_pipe.predict(test_x)
ada_pipe.fit(train_x,train_y)
ada_prediction = ada_pipe.predict(test_x)
svm_pipe.fit(train_x,train_y)
svm_prediction = svm_pipe.predict(test_x)
print('F1 Score of Random Forest Model On Test Set - {}'.format(f1(rf_prediction,test_y)))
print('F1 Score of AdaBoost Model On Test Set - {}'.format(f1(ada_prediction,test_y)))
print('F1 Score of SVM Model On Test Set - {}'.format(f1(svm_prediction,test_y)))

For raw data ( Before sampling ) Make model predictions
Next, model the original data :
ohe_data =c_data[c_data.columns[16:]].copy()
pc_matrix = pca_model.fit_transform(ohe_data)
original_df_with_pcs = pd.concat([c_data,pd.DataFrame(pc_matrix,columns=['PC-{}'.format(i) for i in range(0,N_COMPONENTS)])],axis=1)
unsampled_data_prediction_RF = rf_pipe.predict(original_df_with_pcs[X_features])
unsampled_data_prediction_ADA = ada_pipe.predict(original_df_with_pcs[X_features])
unsampled_data_prediction_SVM = svm_pipe.predict(original_df_with_pcs[X_features])
The effect is as follows :

F1 The highest random forest models are 0.63 branch , Low , It's normal , After all, in this unevenly distributed dataset , Recall is difficult to get high marks .
result
Let's take a look at the results of running the random forest model on the original data :
ax = sns.heatmap(confusion_matrix(unsampled_data_prediction_RF,original_df_with_pcs['Attrition_Flag']),annot=True,cmap='coolwarm',fmt='d')
ax.set_title('Prediction On Original Data With Random Forest Model Confusion Matrix')
ax.set_xticklabels(['Not Churn','Churn'],fontsize=18)
ax.set_yticklabels(['Predicted Not Churn','Predicted Churn'],fontsize=18)
plt.show()

Come to a conclusion , No lost customers hit 7709 people , Not hit 791 people . Lost customers hit 1130 people , Not hit 497 people .
边栏推荐
- The R language uses the CS function of epidisplay package to calculate the value and confidence interval of relative risk (RR), generate the grouping data of exposure factor based on the pyramid funct
- [homeassistant shakes hands with 28byj-48 stepping motor]
- Web 3: a new era of Internet development
- Test: friend circle like function
- 零基础转行软件测试需要学到什么程度才能找工作
- UART中的硬件流控RTS与CTS
- 大佬们,帮帮我吧!重装MySQL,到设置密码就出现current root password
- Research Report on market supply and demand and strategy of underwater lighting industry in China
- Take stock of the tourist attractions in Singapore
- Global industry analysis report of Internet subtitle service in 2022
猜你喜欢

Do a good job in data strategic planning and activate data value

How to use module export import: uncaught syntaxerror: cannot use import statement outside a module

Test preparation database computer level 2 day 6
![[lingo] operator](/img/c1/b0f7c4285b882278874f9bc6abed95.png)
[lingo] operator
![[lingo] linear programming](/img/2e/24ae74bf9aeb6b236f206753f64831.png)
[lingo] linear programming

Perfect life - role: ChenYuXin

idea jdbc报错

La capitale thaïlandaise de Bangkok a été nommée par Forbes « la ville la plus digne d'être visitée après l'épidémie ».

Link Time Optimizations: New Way to Do Compiler Optimizations

【homeassistant与28BYJ-48步进电机握手】
随机推荐
STM32 MPU 开发者的十大工作准则
Hospital blood bank management system source code blood bank source code blood bank management source code hospital source code
R语言使用pdf函数将可视化图像结果保存到pdf文件中、使用pdf函数打开图像设备、使用dev.off函数关闭图像设备
Oracle SQL command line (II. View (2))
【Lingo】运算符
Web安全渗透测试基本流程
[lingo] operator
Leetcode 160 Intersecting linked list (2022.06.09)
The tab1 function in the epidisplay package of R language calculates the frequency of vector data and visualizes it (one-dimensional frequency table, frequency percentage, cumulative percentage, using
"Minesweeping", a masterpiece abandoned by Microsoft, has been played out by Chinese players?
Boxing and UnBoxing
服务管理与通信,基础原理分析
Swin-Unet最强分割网络
The R language uses the PDF function to save the visual image results to the PDF file, uses the PDF function to open the image device, and uses the dev.off function to close the image device
大佬们,帮帮我吧!重装MySQL,到设置密码就出现current root password
Implementation of a simplified MVVM
R语言使用epiDisplay包的mhor函数执行Cochran-Mantel-Haenszel检验并可视化、检验两个分类变量在调整(控制)第三个变量的情况下是否独立、输入数据为三维列联表
SqlServer不同数据库名的还原
Guys, help me! Reinstall mysql, and the current root password appears when the password is set
Numpy version problem