当前位置:网站首页>ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
2022-07-07 00:33:00 【一个处女座的程序猿】
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
目录
基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
相关文章
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略实现
基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
# 1、定义数据集
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | salary |
| 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
| 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
| 49 | Private | 160187 | 9th | 5 | Married-spouse-absent | Other-service | Not-in-family | Black | Female | 0 | 0 | 16 | Jamaica | <=50K |
| 52 | Self-emp-not-inc | 209642 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | >50K |
| 31 | Private | 45781 | Masters | 14 | Never-married | Prof-specialty | Not-in-family | White | Female | 14084 | 0 | 50 | United-States | >50K |
| 42 | Private | 159449 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 5178 | 0 | 40 | United-States | >50K |
# 2、数据集预处理
# 2.1、入模特征初步筛选
df.columns
14
# 2.2、目标特征二值化
# 2.3、类别型特征编码数字化
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | salary | |
| 0 | 39 | 7 | 13 | 4 | 1 | 1 | 4 | 1 | 2174 | 0 | 40 | 39 | 0 |
| 1 | 50 | 6 | 13 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 13 | 39 | 0 |
| 2 | 38 | 4 | 9 | 0 | 6 | 1 | 4 | 1 | 0 | 0 | 40 | 39 | 0 |
| 3 | 53 | 4 | 7 | 2 | 6 | 0 | 2 | 1 | 0 | 0 | 40 | 39 | 0 |
| 4 | 28 | 4 | 13 | 2 | 10 | 5 | 2 | 0 | 0 | 0 | 40 | 5 | 0 |
| 5 | 37 | 4 | 14 | 2 | 4 | 5 | 4 | 0 | 0 | 0 | 40 | 39 | 0 |
| 6 | 49 | 4 | 5 | 3 | 8 | 1 | 2 | 0 | 0 | 0 | 16 | 23 | 0 |
| 7 | 52 | 6 | 9 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 45 | 39 | 1 |
| 8 | 31 | 4 | 14 | 4 | 10 | 1 | 4 | 0 | 14084 | 0 | 50 | 39 | 1 |
| 9 | 42 | 4 | 13 | 2 | 4 | 0 | 4 | 1 | 5178 | 0 | 40 | 39 | 1 |
# 2.4、分离特征与标签
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country |
| 39 | 7 | 13 | 4 | 1 | 1 | 4 | 1 | 2174 | 0 | 40 | 39 |
| 50 | 6 | 13 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 13 | 39 |
| 38 | 4 | 9 | 0 | 6 | 1 | 4 | 1 | 0 | 0 | 40 | 39 |
| 53 | 4 | 7 | 2 | 6 | 0 | 2 | 1 | 0 | 0 | 40 | 39 |
| 28 | 4 | 13 | 2 | 10 | 5 | 2 | 0 | 0 | 0 | 40 | 5 |
| 37 | 4 | 14 | 2 | 4 | 5 | 4 | 0 | 0 | 0 | 40 | 39 |
| 49 | 4 | 5 | 3 | 8 | 1 | 2 | 0 | 0 | 0 | 16 | 23 |
| 52 | 6 | 9 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 45 | 39 |
| 31 | 4 | 14 | 4 | 10 | 1 | 4 | 0 | 14084 | 0 | 50 | 39 |
| 42 | 4 | 13 | 2 | 4 | 0 | 4 | 1 | 5178 | 0 | 40 | 39 |
| salary |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 1 |
| 1 |
| 1 |
#3、模型训练与推理
# 3.1、数据集切分
X_test
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | |
| 1342 | 47 | 3 | 10 | 0 | 1 | 1 | 4 | 1 | 0 | 0 | 40 | 35 |
| 1338 | 71 | 3 | 13 | 0 | 13 | 3 | 4 | 0 | 2329 | 0 | 16 | 35 |
| 189 | 58 | 6 | 16 | 2 | 10 | 0 | 4 | 1 | 0 | 0 | 1 | 35 |
| 1332 | 23 | 3 | 9 | 4 | 7 | 1 | 2 | 1 | 0 | 0 | 35 | 35 |
| 1816 | 46 | 2 | 9 | 2 | 3 | 0 | 4 | 1 | 0 | 1902 | 40 | 35 |
| 1685 | 37 | 3 | 9 | 2 | 4 | 0 | 4 | 1 | 0 | 1902 | 45 | 35 |
| 657 | 34 | 3 | 9 | 2 | 3 | 0 | 4 | 1 | 0 | 0 | 45 | 35 |
| 1846 | 21 | 0 | 10 | 4 | 0 | 3 | 4 | 0 | 0 | 0 | 40 | 35 |
| 554 | 33 | 1 | 11 | 0 | 3 | 4 | 2 | 0 | 0 | 0 | 40 | 35 |
| 1963 | 49 | 3 | 13 | 2 | 12 | 0 | 4 | 1 | 0 | 0 | 50 | 35 |
# 3.2、模型建立并训练
params = {
"max_bin": 512, "learning_rate": 0.05,
"boosting_type": "gbdt", "objective": "binary",
"metric": "binary_logloss", "verbose": -1,
"min_data": 100, "random_state": 1,
"boost_from_average": True, "num_leaves": 10 }
LGBMC = lgb.train(params, lgbD_train, 10000,
valid_sets=[lgbD_test],
early_stopping_rounds=50,
verbose_eval=1000)# 3.3、模型预测
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | y_test_predi | y_test | |
| 1342 | 47 | 3 | 10 | 0 | 1 | 1 | 4 | 1 | 0 | 0 | 40 | 35 | 0.045225575 | 0 |
| 1338 | 71 | 3 | 13 | 0 | 13 | 3 | 4 | 0 | 2329 | 0 | 16 | 35 | 0.074799172 | 0 |
| 189 | 58 | 6 | 16 | 2 | 10 | 0 | 4 | 1 | 0 | 0 | 1 | 35 | 0.30014332 | 1 |
| 1332 | 23 | 3 | 9 | 4 | 7 | 1 | 2 | 1 | 0 | 0 | 35 | 35 | 0.003966427 | 0 |
| 1816 | 46 | 2 | 9 | 2 | 3 | 0 | 4 | 1 | 0 | 1902 | 40 | 35 | 0.363861294 | 0 |
| 1685 | 37 | 3 | 9 | 2 | 4 | 0 | 4 | 1 | 0 | 1902 | 45 | 35 | 0.738628671 | 1 |
| 657 | 34 | 3 | 9 | 2 | 3 | 0 | 4 | 1 | 0 | 0 | 45 | 35 | 0.376412174 | 0 |
| 1846 | 21 | 0 | 10 | 4 | 0 | 3 | 4 | 0 | 0 | 0 | 40 | 35 | 0.002309884 | 0 |
| 554 | 33 | 1 | 11 | 0 | 3 | 4 | 2 | 0 | 0 | 0 | 40 | 35 | 0.060345836 | 1 |
| 1963 | 49 | 3 | 13 | 2 | 12 | 0 | 4 | 1 | 0 | 0 | 50 | 35 | 0.703506366 | 1 |
# 4、利用shap决策图进行异常值检测
# 4.1、原始数据和预处理后的数据各采样一小部分样本
# 4.2、创建Explainer并计算SHAP值
shap2exp.values.shape (100, 12, 2)
[[[-5.97178729e-01 5.97178729e-01]
[-5.18879297e-03 5.18879297e-03]
[ 1.70566444e-01 -1.70566444e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 6.58794799e-02 -6.58794799e-02]
[ 0.00000000e+00 0.00000000e+00]]
[[-4.45574118e-01 4.45574118e-01]
[-1.00665452e-03 1.00665452e-03]
[-8.12237233e-01 8.12237233e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 8.56381961e-01 -8.56381961e-01]
[ 0.00000000e+00 0.00000000e+00]]
[[-3.87412165e-01 3.87412165e-01]
[ 1.52848351e-01 -1.52848351e-01]
[-1.02755954e+00 1.02755954e+00]
...
[ 0.00000000e+00 0.00000000e+00]
[ 1.10240434e+00 -1.10240434e+00]
[ 0.00000000e+00 0.00000000e+00]]
...
[[-5.28928223e-01 5.28928223e-01]
[ 7.14116015e-03 -7.14116015e-03]
[-8.82241728e-01 8.82241728e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 7.47521189e-02 -7.47521189e-02]
[ 0.00000000e+00 0.00000000e+00]]
[[ 2.20002984e+00 -2.20002984e+00]
[ 7.75916086e-03 -7.75916086e-03]
[ 3.95152810e-01 -3.95152810e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 1.52566789e-01 -1.52566789e-01]
[ 0.00000000e+00 0.00000000e+00]]
[[-8.28965461e-01 8.28965461e-01]
[-4.43687947e-02 4.43687947e-02]
[ 3.37305776e-01 -3.37305776e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 8.26477289e-03 -8.26477289e-03]
[ 0.00000000e+00 0.00000000e+00]]]
shap2array.shape (100, 12)
LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray
[[ 5.97178729e-01 5.18879297e-03 -1.70566444e-01 ... 0.00000000e+00
-6.58794799e-02 0.00000000e+00]
[ 4.45574118e-01 1.00665452e-03 8.12237233e-01 ... 0.00000000e+00
-8.56381961e-01 0.00000000e+00]
[ 3.87412165e-01 -1.52848351e-01 1.02755954e+00 ... 0.00000000e+00
-1.10240434e+00 0.00000000e+00]
...
[ 5.28928223e-01 -7.14116015e-03 8.82241728e-01 ... 0.00000000e+00
-7.47521189e-02 0.00000000e+00]
[-2.20002984e+00 -7.75916086e-03 -3.95152810e-01 ... 0.00000000e+00
-1.52566789e-01 0.00000000e+00]
[ 8.28965461e-01 4.43687947e-02 -3.37305776e-01 ... 0.00000000e+00
-8.26477289e-03 0.00000000e+00]]
mode_exp_value: -1.9982244224656025# 4.3、shap决策图可视化
# 将决策图叠加在一起有助于根据shap定位异常值,即偏离密集群处的样本

边栏推荐
- Red Hat安装内核头文件
- 分布式事务解决方案之TCC
- mac版php装xdebug环境(m1版)
- 数字IC面试总结(大厂面试经验分享)
- Bat instruction processing details
- How to improve website weight
- Taobao commodity details page API interface, Taobao commodity list API interface, Taobao commodity sales API interface, Taobao app details API interface, Taobao details API interface
- nVisual网络可视化
- Things about data storage 2
- 架构设计的五个核心要素
猜你喜欢

Common skills and understanding of SQL optimization

Bat instruction processing details

架构设计的五个核心要素
![R language [logic control] [mathematical operation]](/img/93/06a306561e3e7cb150d243541cc839.png)
R language [logic control] [mathematical operation]

Web Authentication API兼容版本信息

English grammar_ Noun possessive

Message queuing: how to ensure that messages are not lost

Simple case of SSM framework
上海字节面试问题及薪资福利

Reading notes of Clickhouse principle analysis and Application Practice (6)
随机推荐
Hcip seventh operation
Reptile exercises (III)
PTA 天梯赛练习题集 L2-004 搜索树判断
判断文件是否为DICOM文件
SQL query: subtract the previous row from the next row and make corresponding calculations
Question 102: sequence traversal of binary tree
Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]
4. 对象映射 - Mapping.Mapster
async / await
I didn't know it until I graduated -- the principle of HowNet duplication check and examples of weight reduction
Digital IC interview summary (interview experience sharing of large manufacturers)
力扣102题:二叉树的层序遍历
架构设计的五个核心要素
基于NCF的多模块协同实例
驱动开发中platform设备驱动架构详解
《2022中国低/无代码市场研究及选型评估报告》发布
Paper reading [semantic tag enlarged xlnv model for video captioning]
往图片添加椒盐噪声或高斯噪声
Polynomial locus of order 5
什么是消息队列?