当前位置：网站首页>Credit card customer churn forecast

Credit card customer churn forecast

2022-06-10 08:36:00 【Timely_】

Data set introduction

Data set from 10,000 It's made up of two customers , It includes their age , Wages , Marital status , Credit card limit , Credit card categories, etc . Use this dataset Predict which customers are about to lose .

Catalog

Data set introduction

1️⃣ preparation

2️⃣ Data analysis

Customer age distribution

Gender distribution of customers ：

Distribution of the number of customers' families ：

The client's educational level ：

The marital status of the client ：

Distribution of customer revenue and card types ：

Customer monthly bill quantity characteristics ：

Characteristics of the amount of banking business held by the customer ：

Characteristics of the number of months when users are inactive ：

Customer credit card quota distribution ：

Distribution of total customer transaction volume ：

Distribution of lost users ：

3️⃣ Data preprocessing

4️⃣ SMOTE Model sampling

5️⃣ Principal component analysis

6️⃣ Model selection and testing

Cross validation

Model to predict

For raw data （ Before sampling ） Make model predictions

result

1️⃣ preparation

Import the python package

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as ex
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix
import scikitplot as skplt

plt.rc('figure',figsize=(18,9))
%pip install imbalanced-learn
from imblearn.over_sampling import SMOTE

Load data

c_data = pd.read_csv('./BankChurners.csv')
c_data = c_data[c_data.columns[:-2]]
c_data.head(3)

Show the first three lines of data , Look at all the fields of the data ：

2️⃣ Data analysis

Customer age distribution

fig = make_subplots(rows=2, cols=1)

tr1=go.Box(x=c_data['Customer_Age'],name='Age Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Customer_Age'],name='Age Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)

fig.update_layout(height=700, width=1200, title_text="Distribution of Customer Ages")
fig.show()

The age distribution of customers roughly follows the normal distribution , Therefore, age characteristics can be further used under the assumption of normality .

Gender distribution of customers ：

ex.pie(c_data,names='Gender',title='Propotion Of Customer Genders')

so , In the data set , There are more women than men , But the percentage difference is not that significant , So for the time being, we think that gender is evenly distributed .

Distribution of the number of customers' families ：

fig = make_subplots(rows=2, cols=1)

tr1=go.Box(x=c_data['Dependent_count'],name='Dependent count Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Dependent_count'],name='Dependent count Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)

fig.update_layout(height=700, width=1200, title_text="Distribution of Dependent counts (close family size)")
fig.show()

If you look at the picture above, you can see that , The number of families of each customer also roughly conforms to the normal distribution , A little to the right , Maybe the follow-up analysis can be used .

The client's educational level ：

ex.pie(c_data,names='Education_Level',title='Propotion Of Education Levels')

Let's assume that most of the students are not well educated (Unknown) None of our customers had any education . We can point out that , exceed 70% Most of our customers have formal education , Which about 35% Most of them have a master's degree or above ,45% Most of them have reached the level of undergraduate or above .

The marital status of the client ：

ex.pie(c_data,names='Marital_Status',title='Propotion Of Different Marriage Statuses')

It can be seen that , Almost half of the bank's customers are married , Interestingly , The other half are almost single , The only other thing is 7% My client is divorced .

Distribution of customer revenue and card types ：

ex.pie(c_data,names='Income_Category',title='Propotion Of Different Income Levels')

ex.pie(c_data,names='Card_Category',title='Propotion Of Different Card Categories')

It can be seen that The annual income of most people is at 60K Below US dollar .

On the type of cards held , Blue cards make up the majority .

Customer monthly bill quantity characteristics ：

fig = make_subplots(rows=2, cols=1)

tr1=go.Box(x=c_data['Months_on_book'],name='Months on book Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Months_on_book'],name='Months on book Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)

fig.update_layout(height=700, width=1200, title_text="Distribution of months the customer is part of the bank")
fig.show()

You can see that the peak in the middle is very high , Obviously, this index is not normally distributed .

Characteristics of the amount of banking business held by the customer ：

fig = make_subplots(rows=2, cols=1)

tr1=go.Box(x=c_data['Total_Relationship_Count'],name='Total no. of products Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Total_Relationship_Count'],name='Total no. of products Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)

fig.update_layout(height=700, width=1200, title_text="Distribution of Total no. of products held by the customer")
fig.show()

It's basically evenly distributed , But this indicator is not very meaningful for us .

Characteristics of the number of months when users are inactive ：

fig = make_subplots(rows=2, cols=1)

tr1=go.Box(x=c_data['Months_Inactive_12_mon'],name='number of months inactive Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Months_Inactive_12_mon'],name='number of months inactive Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)

fig.update_layout(height=700, width=1200, title_text="Distribution of the number of months inactive in the last 12 months")
fig.show()

Is it possible that the less active users are, the more likely they will be lost ？

Customer credit card quota distribution ：

fig = make_subplots(rows=2, cols=1)

tr1=go.Box(x=c_data['Credit_Limit'],name='Credit_Limit Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Credit_Limit'],name='Credit_Limit Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)

fig.update_layout(height=700, width=1200, title_text="Distribution of the Credit Limit")
fig.show()

Most people's quotas are in 0 To 10k Between , I can't see what it has to do with the loss for the time being .

Distribution of total customer transaction volume ：

fig = make_subplots(rows=2, cols=1)

tr1=go.Box(x=c_data['Total_Trans_Amt'],name='Total_Trans_Amt Box Plot',boxmean=True)
tr2=go.Histogram(x=c_data['Total_Trans_Amt'],name='Total_Trans_Amt Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)

fig.update_layout(height=700, width=1200, title_text="Distribution of the Total Transaction Amount (Last 12 months)")
fig.show()

It can be seen that , The distribution of the total turnover reflects “ Multiple groups ” Distribution , If we cluster customers into different groups based on this indicator , Look at the similarities between them , And draw different lines , It may be of some significance to the analysis of end-user churn .

Distribution of lost users ：

ex.pie(c_data,names='Attrition_Flag',title='Proportion of churn vs not churn customers')

We can see , Only 16% A sample of data represents lost customers , In the next steps , Use SMOTE Sampling of lost samples , Match it to the sample size of regular customers , In order to give the model selected later a better chance to capture small details .

3️⃣ Data preprocessing

Use SMOTE Before the model , The data needs to be processed according to different characteristics One Hot code ：

c_data.Attrition_Flag = c_data.Attrition_Flag.replace({'Attrited Customer':1,'Existing Customer':0})
c_data.Gender = c_data.Gender.replace({'F':1,'M':0})
c_data = pd.concat([c_data,pd.get_dummies(c_data['Education_Level']).drop(columns=['Unknown'])],axis=1)
c_data = pd.concat([c_data,pd.get_dummies(c_data['Income_Category']).drop(columns=['Unknown'])],axis=1)
c_data = pd.concat([c_data,pd.get_dummies(c_data['Marital_Status']).drop(columns=['Unknown'])],axis=1)
c_data = pd.concat([c_data,pd.get_dummies(c_data['Card_Category']).drop(columns=['Platinum'])],axis=1)
c_data.drop(columns = ['Education_Level','Income_Category','Marital_Status','Card_Category','CLIENTNUM'],inplace=True)

Display heat map ：

sns.heatmap(c_data.corr('pearson'),annot=True)

4️⃣ SMOTE Model sampling

SMOTE Models are often used to solve data imbalance problems , It changes the data distribution of the imbalanced dataset by adding a small number of generated class samples , It is one of the popular methods to improve the performance of imbalanced data classification model .

oversample = SMOTE()
X, y = oversample.fit_resample(c_data[c_data.columns[1:]], c_data[c_data.columns[0]])
usampled_df = X.assign(Churn = y)
ohe_data =usampled_df[usampled_df.columns[15:-1]].copy()
usampled_df = usampled_df.drop(columns=usampled_df.columns[15:-1])
sns.heatmap(usampled_df.corr('pearson'),annot=True)

5️⃣ Principal component analysis

Principal component analysis is used to reduce the dimension of single coding classification variables , So as to reduce the variance . Using several principal components at the same time instead of dozens of single encoding features will help me build a better model .

N_COMPONENTS = 4

pca_model = PCA(n_components = N_COMPONENTS )

pc_matrix = pca_model.fit_transform(ohe_data)

evr = pca_model.explained_variance_ratio_
cumsum_evr = np.cumsum(evr)

ax = sns.lineplot(x=np.arange(0,len(cumsum_evr)),y=cumsum_evr,label='Explained Variance Ratio')
ax.set_title('Explained Variance Ratio Using {} Components'.format(N_COMPONENTS))
ax = sns.lineplot(x=np.arange(0,len(cumsum_evr)),y=evr,label='Explained Variance Of Component X')
ax.set_xticks([i for i in range(0,len(cumsum_evr))])
ax.set_xlabel('Component number #')
ax.set_ylabel('Explained Variance')
plt.show()

usampled_df_with_pcs = pd.concat([usampled_df,pd.DataFrame(pc_matrix,columns=['PC-{}'.format(i) for i in range(0,N_COMPONENTS)])],axis=1)
usampled_df_with_pcs

The characteristics are becoming more and more obvious ：

sns.heatmap(usampled_df_with_pcs.corr('pearson'),annot=True)

6️⃣ Model selection and testing

Select the following features to divide the training set and train ：

X_features = ['Total_Trans_Ct','PC-3','PC-1','PC-0','PC-2','Total_Ct_Chng_Q4_Q1','Total_Relationship_Count']

X = usampled_df_with_pcs[X_features]
y = usampled_df_with_pcs['Churn']

train_x,test_x,train_y,test_y = train_test_split(X,y,random_state=42)

Cross validation

Look at random forests 、AdaBoost and SVM How do the three models behave ：

rf_pipe = Pipeline(steps =[ ('scale',StandardScaler()), ("RF",RandomForestClassifier(random_state=42)) ])
ada_pipe = Pipeline(steps =[ ('scale',StandardScaler()), ("RF",AdaBoostClassifier(random_state=42,learning_rate=0.7)) ])
svm_pipe = Pipeline(steps =[ ('scale',StandardScaler()), ("RF",SVC(random_state=42,kernel='rbf')) ])


f1_cross_val_scores = cross_val_score(rf_pipe,train_x,train_y,cv=5,scoring='f1')
ada_f1_cross_val_scores=cross_val_score(ada_pipe,train_x,train_y,cv=5,scoring='f1')
svm_f1_cross_val_scores=cross_val_score(svm_pipe,train_x,train_y,cv=5,scoring='f1')

plt.subplot(3,1,1)
ax = sns.lineplot(x=range(0,len(f1_cross_val_scores)),y=f1_cross_val_scores)
ax.set_title('Random Forest Cross Val Scores')
ax.set_xticks([i for i in range(0,len(f1_cross_val_scores))])
ax.set_xlabel('Fold Number')
ax.set_ylabel('F1 Score')
plt.show()
plt.subplot(3,1,2)
ax = sns.lineplot(x=range(0,len(ada_f1_cross_val_scores)),y=ada_f1_cross_val_scores)
ax.set_title('Adaboost Cross Val Scores')
ax.set_xticks([i for i in range(0,len(ada_f1_cross_val_scores))])
ax.set_xlabel('Fold Number')
ax.set_ylabel('F1 Score')
plt.show()
plt.subplot(3,1,3)
ax = sns.lineplot(x=range(0,len(svm_f1_cross_val_scores)),y=svm_f1_cross_val_scores)
ax.set_title('SVM Cross Val Scores')
ax.set_xticks([i for i in range(0,len(svm_f1_cross_val_scores))])
ax.set_xlabel('Fold Number')
ax.set_ylabel('F1 Score')
plt.show()

Take a look at the different representations of the three models ：

You can see the random forest F1 The score is the highest , Reached 0.92.

Model to predict

Predict the test set , Look at the three models ：

rf_pipe.fit(train_x,train_y)
rf_prediction = rf_pipe.predict(test_x)

ada_pipe.fit(train_x,train_y)
ada_prediction = ada_pipe.predict(test_x)

svm_pipe.fit(train_x,train_y)
svm_prediction = svm_pipe.predict(test_x)

print('F1 Score of Random Forest Model On Test Set - {}'.format(f1(rf_prediction,test_y)))
print('F1 Score of AdaBoost Model On Test Set - {}'.format(f1(ada_prediction,test_y)))
print('F1 Score of SVM Model On Test Set - {}'.format(f1(svm_prediction,test_y)))

For raw data （ Before sampling ） Make model predictions

Next, model the original data ：

ohe_data =c_data[c_data.columns[16:]].copy()
pc_matrix = pca_model.fit_transform(ohe_data)
original_df_with_pcs = pd.concat([c_data,pd.DataFrame(pc_matrix,columns=['PC-{}'.format(i) for i in range(0,N_COMPONENTS)])],axis=1)

unsampled_data_prediction_RF = rf_pipe.predict(original_df_with_pcs[X_features])
unsampled_data_prediction_ADA = ada_pipe.predict(original_df_with_pcs[X_features])
unsampled_data_prediction_SVM = svm_pipe.predict(original_df_with_pcs[X_features])

The effect is as follows ：

F1 The highest random forest models are 0.63 branch , Low , It's normal , After all, in this unevenly distributed dataset , Recall is difficult to get high marks .

result

Let's take a look at the results of running the random forest model on the original data ：

ax = sns.heatmap(confusion_matrix(unsampled_data_prediction_RF,original_df_with_pcs['Attrition_Flag']),annot=True,cmap='coolwarm',fmt='d')
ax.set_title('Prediction On Original Data With Random Forest Model Confusion Matrix')
ax.set_xticklabels(['Not Churn','Churn'],fontsize=18)
ax.set_yticklabels(['Predicted Not Churn','Predicted Churn'],fontsize=18)

plt.show()