当前位置：网站首页>Map of mL: Based on the adult census income two classification prediction data set (whether the predicted annual income exceeds 50K), use the map value to realize the interpretable case of xgboost mod

Map of mL: Based on the adult census income two classification prediction data set (whether the predicted annual income exceeds 50K), use the map value to realize the interpretable case of xgboost mod

2022-07-06 06:44:00 【A Virgo procedural ape】

ML And shap： be based on adult Census income two classification forecast data set ( Whether the predicted annual income exceeds 50k) utilize Shap It's worth it XGBoost A detailed introduction to interpretable cases of model implementation

Catalog

be based on adult Census income two classification forecast data set ( Whether the predicted annual income exceeds 50k) utilize Shap It's worth it XGBoost Model implementation interpretability case

1、 Define datasets

2、 Data set preprocessing

# 2.1、 Preliminary screening of modeling features

# 2.2、 Target feature binarization

# 2.3、 Category feature coding digitization

# 2.4、 Separate features from labels

# 2.5、 Data set segmentation

#3、 Model training and reasoning

# 3.1、 Data set segmentation

# 3.2、 Model building and training

# 3.3、 Model to predict

#4、 Model feature importance interpretation visualization

#4.1、 Visualization of global feature importance

# T1、 Output the importance of features based on the model itself

# T2、 utilize Shap Value interpretation XGBR Model

#4.2、 Visualization of local feature importance

# (1)、 Single sample full feature bar graph visualization

# (2)、 One turn two feature full sample local independent graph scatter diagram visualization

# (3)、 Visualization of double feature full sample scatter diagram

# 4.3、 Model feature screening

# (1)、 Clustering based shap Feature filtering visualization

5、 Interpretability of model prediction ( It can mainly analyze misclassified samples )

# 5.1、 Try to analyze visually ： Visualize the contribution of each feature in a single or multiple samples and compare the predicted values of the model —— Explore misclassification samples

(1)、 A single sample tries to visualize — Compare predictions

(2)、 Multiple samples try to visualize

# 5.2、 Visual analysis of decision diagram ： How models make decisions

# (1)、 Single sample decision graph visualization

# (2)、 Visualization of multiple sample decision diagrams

be based on adult Census income two classification forecast data set ( Whether the predicted annual income exceeds 50k) utilize Shap It's worth it XGBoost Model implementation interpretability case

1、 Define datasets

dtypes_len: 15

age

workclass

fnlwgt

education

education_num

marital_status

occupation

relationship

race

sex

capital_gain

capital_loss

hours_per_week

native_country

salary

State-gov

77516

Bachelors

Never-married

Adm-clerical

Not-in-family

White

Male

2174

United-States

<=50K

Self-emp-not-inc

83311

Bachelors

Married-civ-spouse

Exec-managerial

Husband

White

Male

United-States

<=50K

Private

215646

HS-grad

Divorced

Handlers-cleaners

Not-in-family

White

Male

United-States

<=50K

Private

234721

11th

Married-civ-spouse

Handlers-cleaners

Husband

Black

Male

United-States

<=50K

Private

338409

Bachelors

Married-civ-spouse

Prof-specialty

Wife

Black

Female

Cuba

<=50K

Private

284582

Masters

Married-civ-spouse

Exec-managerial

Wife

White

Female

United-States

<=50K

Private

160187

9th

Married-spouse-absent

Other-service

Not-in-family

Black

Female

Jamaica

<=50K

Self-emp-not-inc

209642

HS-grad

Married-civ-spouse

Exec-managerial

Husband

White

Male

United-States

>50K

Private

45781

Masters

Never-married

Prof-specialty

Not-in-family

White

Female

14084

United-States

>50K

Private

159449

Bachelors

Married-civ-spouse

Exec-managerial

Husband

White

Male

5178

United-States

>50K

2、 Data set preprocessing

# 2.1、 Preliminary screening of modeling features

df.columns
14

# 2.2、 Target feature binarization

# 2.3、 Category feature coding digitization

filt_dtypes_len： 13 [('age', 'float32'), ('workclass', 'category'), ('fnlwgt', 'float32'), ('education_Num', 'float32'), ('marital_status', 'category'), ('occupation', 'category'), ('relationship', 'category'), ('race', 'category'), ('sex', 'category'), ('capital_gain', 'float32'), ('capital_loss', 'float32'), ('hours_per_week', 'float32'), ('native_country', 'category')]

# 2.4、 Separate features from labels

df_adult_display

	age	workclass	education_num	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours_per_week	native_country	salary
0	39	State-gov	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	0
1	50	Self-emp-not-inc	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	0
2	38	Private	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	0
3	53	Private	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	0
4	28	Private	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	0
5	37	Private	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	0
6	49	Private	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	0	16	Jamaica	0
7	52	Self-emp-not-inc	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	1
8	31	Private	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	0	50	United-States	1
9	42	Private	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	0	40	United-States	1

df_adult

	age	workclass	education_num	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours_per_week	native_country	salary
0	39	7	13	4	1	1	4	1	2174	0	40	39	0
1	50	6	13	2	4	0	4	1	0	0	13	39	0
2	38	4	9	0	6	1	4	1	0	0	40	39	0
3	53	4	7	2	6	0	2	1	0	0	40	39	0
4	28	4	13	2	10	5	2	0	0	0	40	5	0
5	37	4	14	2	4	5	4	0	0	0	40	39	0
6	49	4	5	3	8	1	2	0	0	0	16	23	0
7	52	6	9	2	4	0	4	1	0	0	45	39	1
8	31	4	14	4	10	1	4	0	14084	0	50	39	1
9	42	4	13	2	4	0	4	1	5178	0	40	39	1

# 2.5、 Data set segmentation

df_len： 32561 ,train_test_index： 30933
X.shape,y.shape： (30933, 12) (30933,)
X_test.shape,y_test.shape： (1628, 12) (1628,)

#3、 Model training and reasoning

# 3.1、 Data set segmentation

# 3.2、 Model building and training

# 3.3、 Model to predict

	age	workclass	education_num	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours_per_week	native_country	y_val_predi	y_val
11311	29	4	9	4	1	3	2	0	0	0	60	39	0	0
12519	33	4	10	4	3	1	2	1	8614	0	40	39	1	1
29225	27	4	13	4	10	1	4	1	0	0	45	39	0	0
5428	22	4	9	2	7	0	4	1	0	0	40	39	0	0
2400	32	7	10	4	1	1	2	0	0	0	40	39	0	0
4319	45	4	10	2	4	0	4	1	0	0	40	39	1	0
26564	43	4	9	2	6	0	4	1	0	0	40	39	0	0
4721	60	0	13	2	0	0	4	1	0	0	8	39	0	1
19518	29	6	9	2	12	0	4	1	0	0	35	39	0	0
25013	33	4	5	2	6	0	4	1	0	0	40	39	0	0

#4、 Model feature importance interpretation visualization

#4.1、 Visualization of global feature importance

# T1、 Output the importance of features based on the model itself

XGBR_importance_dict： [('age', 130), ('capital_gain', 125), ('education_num', 86), ('capital_loss', 75), ('hours_per_week', 63), ('relationship', 59), ('marital_status', 52), ('occupation', 52), ('workclass', 20), ('sex', 13), ('native_country', 10), ('race', 6)]

# T2、 utilize Shap Value interpretation XGBR Model

utilize shap The built-in function realizes the visualization of feature contribution —— The ranking of feature importance is similar to the above , But it's not the same

# (1)、 establish Explainer And calculate SHAP value

# T2.1、 Output shap.Explanation object

# T2,2、 Output numpy.array Array

shap2exp.values.shape (30933, 12) 
 [[ 0.31074238 -0.16607898  0.5617416  ... -0.04660619 -0.09465054
   0.00530914]
 [ 0.34912622 -0.16633348  0.65308005 ... -0.06718991 -0.9804511
   0.00515459]
 [ 0.21971266  0.02263742 -0.299867   ... -0.0583196  -0.09738331
   0.00415599]
 ...
 [-0.48140627  0.07019287 -0.30844492 ... -0.04253047 -0.10924102
   0.00649792]
 [ 0.39729887 -0.2313431  -0.45257783 ... -0.06502013  0.27416423
   0.00587647]
 [ 0.27594262  0.03170239  0.78293955 ... -0.06743324  0.31613
   0.00530914]]
shap2array.shape (30933, 12) 
 [[ 0.31074238 -0.16607898  0.5617416  ... -0.04660619 -0.09465054
   0.00530914]
 [ 0.34912622 -0.16633348  0.65308005 ... -0.06718991 -0.9804511
   0.00515459]
 [ 0.21971266  0.02263742 -0.299867   ... -0.0583196  -0.09738331
   0.00415599]
 ...
 [-0.48140627  0.07019287 -0.30844492 ... -0.04253047 -0.10924102
   0.00649792]
 [ 0.39729887 -0.2313431  -0.45257783 ... -0.06502013  0.27416423
   0.00587647]
 [ 0.27594262  0.03170239  0.78293955 ... -0.06743324  0.31613
   0.00530914]]
shap2exp.values And shap2array, Whether the two matrices are equal ： True

# (2)、 Characteristics of the whole sample shap Value bar graph visualization

# shap Value high-order interactive visualization

# (3)、 Characteristics of the whole sample shap Value colony graph visualization

# (4)、 Global feature importance sorting scatter diagram visualization

#4.2、 Visualization of local feature importance

# (1)、 Single sample full feature bar graph visualization

Pre test samples ：0

.values =
array([ 0.31074238, -0.16607898,  0.5617416 , -0.58709425, -0.08897061,
       -0.6133537 ,  0.01539118,  0.04758333, -0.3988452 , -0.04660619,
       -0.09465054,  0.00530914], dtype=float32)
.base_values =
-1.3270257
.data =
array([3.900e+01, 7.000e+00, 1.300e+01, 4.000e+00, 1.000e+00, 1.000e+00,
       4.000e+00, 1.000e+00, 2.174e+03, 0.000e+00, 4.000e+01, 3.900e+01])

Pre test samples ：1

.values =
array([ 0.34912622, -0.16633348,  0.65308005,  0.3069151 ,  0.26878497,
        0.5229906 ,  0.01030679,  0.04531586, -0.15429462, -0.06718991,
       -0.9804511 ,  0.00515459], dtype=float32)
.base_values =
-1.3270257
.data =
array([50.,  6., 13.,  2.,  4.,  0.,  4.,  1.,  0.,  0., 13., 39.])

Pre test samples ：10

.values =
array([ 0.27578622,  0.02686635, -0.0699547 ,  0.2820353 ,  0.3097189 ,
        0.55229187, -0.03686382,  0.05135565, -0.1607191 , -0.06321771,
        0.38190693,  0.02023092], dtype=float32)
.base_values =
-1.3270257
.data =
array([37.,  4., 10.,  2.,  4.,  0.,  2.,  1.,  0.,  0., 80., 39.])

Pre test samples ：20

.values =
array([ 0.31008577,  0.00316932,  1.3133987 ,  0.16768128,  0.18239255,
        0.6863757 ,  0.00508371,  0.05159741, -0.15813455, -0.06736177,
        0.31327826,  0.01936885], dtype=float32)
.base_values =
-1.3270257
.data =
array([40.,  4., 16.,  2., 10.,  0.,  4.,  1.,  0.,  0., 60., 39.])

# (2)、 One turn two feature full sample local independent graph scatter diagram visualization

# (3)、 Visualization of double feature full sample scatter diagram

# 4.3、 Model feature screening

# (1)、 Clustering based shap Feature filtering visualization

5、 Interpretability of model prediction ( can The main Analyze misclassified samples )

Provides details of the forecast , Focus on explaining how individual forecasts are generated . It can help decision makers trust models , And explain how each feature affects the single decision of the model .

# 5.1、 Try to visualize analysis ： Visualize the contribution of each feature in a single or multiple samples and Compare the predicted value of the model —— Explore misclassification samples

It provides the explicability of single model prediction , It can be used for error analysis , Find an explanation for the prediction of a particular instance . For example 0 Shown ：
(1)、 Model output ：5.89;
(2)、 Base value ：base value namely explainer.expected_value, That is, the average value of model output and training data ;
(3)、 The number below the drawing arrow is the characteristic value of this instance . Such as Age=39;
(4)、 Red Indicates the Contribution is positive ( Will forecast Push up Characteristics of ), Blue Representing this feature The contribution is negative ( Will forecast PUSH low Characteristics of ). Length indicates influence ; The longer the arrow , The influence of features on output ( contribution ) The bigger it is . adopt x The scale value on the axis can see the reduction or increase of influence .

(1)、 A single sample Try to visualize — Compare predictions

Output the current test sample ：0

mode_exp_value： -1.3270257
<IPython.core.display.HTML object>
 Output the current test sample ：0 
 age               29.0
workclass          4.0
education_num      9.0
marital_status     4.0
occupation         1.0
relationship       3.0
race               2.0
sex                0.0
capital_gain       0.0
capital_loss       0.0
hours_per_week    60.0
native_country    39.0
y_val_predi        0.0
y_val              0.0
Name: 11311, dtype: float64
 Output the true of the current test sample label： 0
 Output the prediction probability of the current test sample ： 0

Output the current test sample ：1

 Output the current test sample ：1 
 age                 33.0
workclass            4.0
education_num       10.0
marital_status       4.0
occupation           3.0
relationship         1.0
race                 2.0
sex                  1.0
capital_gain      8614.0
capital_loss         0.0
hours_per_week      40.0
native_country      39.0
y_val_predi          1.0
y_val                1.0
Name: 12519, dtype: float64
 Output the true of the current test sample label： 1
 Output the prediction probability of the current test sample ： 1

Output the current test sample ：5

 Output the current test sample ：5 
 age               45.0
workclass          4.0
education_num     10.0
marital_status     2.0
occupation         4.0
relationship       0.0
race               4.0
sex                1.0
capital_gain       0.0
capital_loss       0.0
hours_per_week    40.0
native_country    39.0
y_val_predi        1.0
y_val              0.0
Name: 4319, dtype: float64
 Output the true of the current test sample label： 0
 Output the prediction probability of the current test sample ： 1

Output the current test sample ：7

 Output the current test sample ：7 
 age               60.0
workclass          0.0
education_num     13.0
marital_status     2.0
occupation         0.0
relationship       0.0
race               4.0
sex                1.0
capital_gain       0.0
capital_loss       0.0
hours_per_week     8.0
native_country    39.0
y_val_predi        0.0
y_val              1.0
Name: 4721, dtype: float64
 Output the true of the current test sample label： 1
 Output the prediction probability of the current test sample ： 0

(2)、 Multiple samples Try to visualize

# (2.1)、 Visualization of feature contribution , Use the dark red and dark blue map to visualize the front 5 A prediction explanation , have access to X Data sets .

# (2.2)、 Misclassification attempts to visualize , Definitely X_val Data sets , Because it involves model prediction .
If multiple samples are interpreted , Rotate the above form 90 Degrees and then placed horizontally side by side , Get the variant of the effort