当前位置:网站首页>ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
2022-07-07 00:33:00 【一个处女座的程序猿】
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
目录
基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
相关文章
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
ML之shap:基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略实现
基于adult人口普查收入二分类预测数据集(预测年收入是否超过50k)利用shap决策图结合LightGBM模型实现异常值检测案例之详细攻略
# 1、定义数据集
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | salary |
| 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
| 37 | Private | 284582 | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
| 49 | Private | 160187 | 9th | 5 | Married-spouse-absent | Other-service | Not-in-family | Black | Female | 0 | 0 | 16 | Jamaica | <=50K |
| 52 | Self-emp-not-inc | 209642 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | >50K |
| 31 | Private | 45781 | Masters | 14 | Never-married | Prof-specialty | Not-in-family | White | Female | 14084 | 0 | 50 | United-States | >50K |
| 42 | Private | 159449 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 5178 | 0 | 40 | United-States | >50K |
# 2、数据集预处理
# 2.1、入模特征初步筛选
df.columns
14
# 2.2、目标特征二值化
# 2.3、类别型特征编码数字化
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | salary | |
| 0 | 39 | 7 | 13 | 4 | 1 | 1 | 4 | 1 | 2174 | 0 | 40 | 39 | 0 |
| 1 | 50 | 6 | 13 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 13 | 39 | 0 |
| 2 | 38 | 4 | 9 | 0 | 6 | 1 | 4 | 1 | 0 | 0 | 40 | 39 | 0 |
| 3 | 53 | 4 | 7 | 2 | 6 | 0 | 2 | 1 | 0 | 0 | 40 | 39 | 0 |
| 4 | 28 | 4 | 13 | 2 | 10 | 5 | 2 | 0 | 0 | 0 | 40 | 5 | 0 |
| 5 | 37 | 4 | 14 | 2 | 4 | 5 | 4 | 0 | 0 | 0 | 40 | 39 | 0 |
| 6 | 49 | 4 | 5 | 3 | 8 | 1 | 2 | 0 | 0 | 0 | 16 | 23 | 0 |
| 7 | 52 | 6 | 9 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 45 | 39 | 1 |
| 8 | 31 | 4 | 14 | 4 | 10 | 1 | 4 | 0 | 14084 | 0 | 50 | 39 | 1 |
| 9 | 42 | 4 | 13 | 2 | 4 | 0 | 4 | 1 | 5178 | 0 | 40 | 39 | 1 |
# 2.4、分离特征与标签
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country |
| 39 | 7 | 13 | 4 | 1 | 1 | 4 | 1 | 2174 | 0 | 40 | 39 |
| 50 | 6 | 13 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 13 | 39 |
| 38 | 4 | 9 | 0 | 6 | 1 | 4 | 1 | 0 | 0 | 40 | 39 |
| 53 | 4 | 7 | 2 | 6 | 0 | 2 | 1 | 0 | 0 | 40 | 39 |
| 28 | 4 | 13 | 2 | 10 | 5 | 2 | 0 | 0 | 0 | 40 | 5 |
| 37 | 4 | 14 | 2 | 4 | 5 | 4 | 0 | 0 | 0 | 40 | 39 |
| 49 | 4 | 5 | 3 | 8 | 1 | 2 | 0 | 0 | 0 | 16 | 23 |
| 52 | 6 | 9 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 45 | 39 |
| 31 | 4 | 14 | 4 | 10 | 1 | 4 | 0 | 14084 | 0 | 50 | 39 |
| 42 | 4 | 13 | 2 | 4 | 0 | 4 | 1 | 5178 | 0 | 40 | 39 |
| salary |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 1 |
| 1 |
| 1 |
#3、模型训练与推理
# 3.1、数据集切分
X_test
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | |
| 1342 | 47 | 3 | 10 | 0 | 1 | 1 | 4 | 1 | 0 | 0 | 40 | 35 |
| 1338 | 71 | 3 | 13 | 0 | 13 | 3 | 4 | 0 | 2329 | 0 | 16 | 35 |
| 189 | 58 | 6 | 16 | 2 | 10 | 0 | 4 | 1 | 0 | 0 | 1 | 35 |
| 1332 | 23 | 3 | 9 | 4 | 7 | 1 | 2 | 1 | 0 | 0 | 35 | 35 |
| 1816 | 46 | 2 | 9 | 2 | 3 | 0 | 4 | 1 | 0 | 1902 | 40 | 35 |
| 1685 | 37 | 3 | 9 | 2 | 4 | 0 | 4 | 1 | 0 | 1902 | 45 | 35 |
| 657 | 34 | 3 | 9 | 2 | 3 | 0 | 4 | 1 | 0 | 0 | 45 | 35 |
| 1846 | 21 | 0 | 10 | 4 | 0 | 3 | 4 | 0 | 0 | 0 | 40 | 35 |
| 554 | 33 | 1 | 11 | 0 | 3 | 4 | 2 | 0 | 0 | 0 | 40 | 35 |
| 1963 | 49 | 3 | 13 | 2 | 12 | 0 | 4 | 1 | 0 | 0 | 50 | 35 |
# 3.2、模型建立并训练
params = {
"max_bin": 512, "learning_rate": 0.05,
"boosting_type": "gbdt", "objective": "binary",
"metric": "binary_logloss", "verbose": -1,
"min_data": 100, "random_state": 1,
"boost_from_average": True, "num_leaves": 10 }
LGBMC = lgb.train(params, lgbD_train, 10000,
valid_sets=[lgbD_test],
early_stopping_rounds=50,
verbose_eval=1000)# 3.3、模型预测
| age | workclass | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | y_test_predi | y_test | |
| 1342 | 47 | 3 | 10 | 0 | 1 | 1 | 4 | 1 | 0 | 0 | 40 | 35 | 0.045225575 | 0 |
| 1338 | 71 | 3 | 13 | 0 | 13 | 3 | 4 | 0 | 2329 | 0 | 16 | 35 | 0.074799172 | 0 |
| 189 | 58 | 6 | 16 | 2 | 10 | 0 | 4 | 1 | 0 | 0 | 1 | 35 | 0.30014332 | 1 |
| 1332 | 23 | 3 | 9 | 4 | 7 | 1 | 2 | 1 | 0 | 0 | 35 | 35 | 0.003966427 | 0 |
| 1816 | 46 | 2 | 9 | 2 | 3 | 0 | 4 | 1 | 0 | 1902 | 40 | 35 | 0.363861294 | 0 |
| 1685 | 37 | 3 | 9 | 2 | 4 | 0 | 4 | 1 | 0 | 1902 | 45 | 35 | 0.738628671 | 1 |
| 657 | 34 | 3 | 9 | 2 | 3 | 0 | 4 | 1 | 0 | 0 | 45 | 35 | 0.376412174 | 0 |
| 1846 | 21 | 0 | 10 | 4 | 0 | 3 | 4 | 0 | 0 | 0 | 40 | 35 | 0.002309884 | 0 |
| 554 | 33 | 1 | 11 | 0 | 3 | 4 | 2 | 0 | 0 | 0 | 40 | 35 | 0.060345836 | 1 |
| 1963 | 49 | 3 | 13 | 2 | 12 | 0 | 4 | 1 | 0 | 0 | 50 | 35 | 0.703506366 | 1 |
# 4、利用shap决策图进行异常值检测
# 4.1、原始数据和预处理后的数据各采样一小部分样本
# 4.2、创建Explainer并计算SHAP值
shap2exp.values.shape (100, 12, 2)
[[[-5.97178729e-01 5.97178729e-01]
[-5.18879297e-03 5.18879297e-03]
[ 1.70566444e-01 -1.70566444e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 6.58794799e-02 -6.58794799e-02]
[ 0.00000000e+00 0.00000000e+00]]
[[-4.45574118e-01 4.45574118e-01]
[-1.00665452e-03 1.00665452e-03]
[-8.12237233e-01 8.12237233e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 8.56381961e-01 -8.56381961e-01]
[ 0.00000000e+00 0.00000000e+00]]
[[-3.87412165e-01 3.87412165e-01]
[ 1.52848351e-01 -1.52848351e-01]
[-1.02755954e+00 1.02755954e+00]
...
[ 0.00000000e+00 0.00000000e+00]
[ 1.10240434e+00 -1.10240434e+00]
[ 0.00000000e+00 0.00000000e+00]]
...
[[-5.28928223e-01 5.28928223e-01]
[ 7.14116015e-03 -7.14116015e-03]
[-8.82241728e-01 8.82241728e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 7.47521189e-02 -7.47521189e-02]
[ 0.00000000e+00 0.00000000e+00]]
[[ 2.20002984e+00 -2.20002984e+00]
[ 7.75916086e-03 -7.75916086e-03]
[ 3.95152810e-01 -3.95152810e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 1.52566789e-01 -1.52566789e-01]
[ 0.00000000e+00 0.00000000e+00]]
[[-8.28965461e-01 8.28965461e-01]
[-4.43687947e-02 4.43687947e-02]
[ 3.37305776e-01 -3.37305776e-01]
...
[ 0.00000000e+00 0.00000000e+00]
[ 8.26477289e-03 -8.26477289e-03]
[ 0.00000000e+00 0.00000000e+00]]]
shap2array.shape (100, 12)
LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray
[[ 5.97178729e-01 5.18879297e-03 -1.70566444e-01 ... 0.00000000e+00
-6.58794799e-02 0.00000000e+00]
[ 4.45574118e-01 1.00665452e-03 8.12237233e-01 ... 0.00000000e+00
-8.56381961e-01 0.00000000e+00]
[ 3.87412165e-01 -1.52848351e-01 1.02755954e+00 ... 0.00000000e+00
-1.10240434e+00 0.00000000e+00]
...
[ 5.28928223e-01 -7.14116015e-03 8.82241728e-01 ... 0.00000000e+00
-7.47521189e-02 0.00000000e+00]
[-2.20002984e+00 -7.75916086e-03 -3.95152810e-01 ... 0.00000000e+00
-1.52566789e-01 0.00000000e+00]
[ 8.28965461e-01 4.43687947e-02 -3.37305776e-01 ... 0.00000000e+00
-8.26477289e-03 0.00000000e+00]]
mode_exp_value: -1.9982244224656025# 4.3、shap决策图可视化
# 将决策图叠加在一起有助于根据shap定位异常值,即偏离密集群处的样本

边栏推荐
- TCC of distributed transaction solutions
- 数字IC面试总结(大厂面试经验分享)
- Get the way to optimize the one-stop worktable of customer service
- 集群、分布式、微服务的区别和介绍
- Web Authentication API兼容版本信息
- An example of multi module collaboration based on NCF
- 常用消息队列有哪些?
- 成为资深IC设计工程师的十个阶段,现在的你在哪个阶段 ?
- Wechat applet Bluetooth connects hardware devices and communicates. Applet Bluetooth automatically reconnects due to abnormal distance. JS realizes CRC check bit
- Mysql-centos7 install MySQL through yum
猜你喜欢
![Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]](/img/9c/1f031400f0e201df47bd51547ff73f.png)
Paper reading [MM21 pre training for video understanding challenge:video captioning with pre training techniqu]

sql查询:将下一行减去上一行,并做相应的计算

5. Data access - entityframework integration

往图片添加椒盐噪声或高斯噪声

架构设计的五个核心要素
![Reading the paper [sensor enlarged egocentric video captioning with dynamic modal attention]](/img/db/feb719e2715c7b9c669957995e1d83.png)
Reading the paper [sensor enlarged egocentric video captioning with dynamic modal attention]

SAP ABAP BDC (batch data communication) -018

English grammar_ Noun possessive

三级菜单数据实现,实现嵌套三级菜单数据

Common skills and understanding of SQL optimization
随机推荐
不同网段之间实现GDB远程调试功能
SAP webservice 测试出现404 Not found Service cannot be reached
bat 批示处理详解
Reading the paper [sensor enlarged egocentric video captioning with dynamic modal attention]
nVisual网络可视化
EMMC打印cqhci: timeout for tag 10提示分析与解决
Why does the data center need a set of infrastructure visual management system
Interview questions and salary and welfare of Shanghai byte
R language [logic control] [mathematical operation]
Paper reading [semantic tag enlarged xlnv model for video captioning]
make makefile cmake qmake都是什么,有什么区别?
What are the common message queues?
Message queue: how to handle repeated messages?
成为资深IC设计工程师的十个阶段,现在的你在哪个阶段 ?
Lombok plug-in
目标检测中的损失函数与正负样本分配:RetinaNet与Focal loss
Web authentication API compatible version information
软件测试面试技巧
R语言【逻辑控制】【数学运算】
"Multimodal" concept