当前位置:网站首页>Chapter 5 decision tree and random forest practice
Chapter 5 decision tree and random forest practice
2022-07-27 03:58:00 【Sang zhiweiluo 0208】
1 Over fitting problem of decision tree
1.1 Problem description
Decision tree has good classification ability for training , But the unknown test data may not have good classification ability , Weak generalization ability , That is, fitting phenomenon may have occurred .
1.2 resolvent
(1) prune 


(2) Reasonable and effective sampling
bagging:

OOB data

Random forests

- Random forests /Bagging Relationship with decision tree
Decision tree is the basic classifier ;
SVM、Logistic Regression and other classifiers “ Total classifier ”, It's called random forest .
give an example : The return question
2 Return to
2.1 The algorithm process
do 100 Time bootstrap, Every time I get data Di(Di The length of is N). For each Di, Use local regression (LOESS) Fit a curve . Then average these curves , Get the final fitting curve , The curve over fitting is weakened .
2.2 give an example
vote :(1) Simple voting mechanism : One vote against 、 The minority is subordinate to the majority 、 Threshold voting (2) Bayesian voting mechanism
Movie reviews : bring
As big as possible .
3 The use of random forests
3.1 Use random forest to calculate the similarity between samples
principle : If two samples appear at the same leaf node at the same time, the more times , The more similar the two .
The algorithm process : Record the number of samples as N, initialization NXN The zero matrices of S,S[i,j] Presentation sample i and j The similarity . about m A random forest formed by a decision tree , Traverse all leaf nodes of all decision trees ( sample i,j Appear at the same node , be s[i,j] Add 1). End of traversal ,S Is the similarity matrix between samples .
3.2 Use random forests to calculate the importance of features
(1) Calculate the node through which the positive example passes , Use the number of passing nodes 、gini Coefficient and other indicators to judge the importance of characteristics .
(2) Randomly replace a column of data , Rebuild the decision tree , Calculate the change of the accuracy of the new model to judge the importance of the characteristics of this column .
3.3 Isolated forests
Isolated forests (Isolation Forest) Detect outliers by isolating sample points .
features 、 The dividing points are randomly selected , Then generate a certain depth of decision tree iTree, Several trees iTree form iForest.
To calculate iTree The length of the sample from root to leaf f(x), And then calculate iForest in f(x) The sum of F(x).
Test standard :F(x) Smaller samples x Is an outlier .
summary
Decision tree / The code of random forest is clear 、 The logic is simple , While being competent for classification problems , It can also be used as the primary algorithm to explore data distribution .
The integration idea of random forest can also be used in the design of other classifiers .
边栏推荐
- 03. Get the web page source code
- 477-82(236、61、47、74、240、93)
- 768. Block II greed that can complete sorting at most
- 222. 完全二叉树的节点个数
- Leetcode- > 2-point search and clock in (3)
- Debug mode in pycharm for detailed debugging
- Program to change the priority of the process in LabVIEW
- 大家有遇到CDC读MySQL字段不全的情况吗?怎么处理的?
- How to conduct 360 assessment
- Deployment of ruoyi's environment and operation of the system
猜你喜欢

Practical application of digital twins: smart city project construction solution

数字孪生应用及意义对电力的主要作用,概念价值。

app端接口用例设计方法和测试方法

Application, addition and deletion of B-tree

飞腾腾锐 D2000 荣获数字中国“十大硬核科技”奖

Kettle reads file split by line

Do you really understand code rollback?

Introduction to redis

On the first day of Shenzhen furniture exhibition, the three highlights of Jin Ke'er booth were unlocked!

04.在谷歌浏览器中安装模拟浏览器ChromeDriver的详细步骤
随机推荐
Redis spike case, learn from Shang Silicon Valley teacher in station B
Kettle读取按行分割的文件
jmeter接口测试(登录、注册)
Machine learning [Matplotlib]
768. 最多能完成排序的块 II 贪心
大家有遇到CDC读MySQL字段不全的情况吗?怎么处理的?
Minimum ticket price (day 80)
函数指针与回调函数
Application, addition and deletion of B-tree
使用redis c库,异步内存泄露的问题
[tree chain dissection] template question
About the solution of using hyperbeach to appear /bin/sh: 1: packr2: not found
Csu18m91 is used as the master controller of the intelligent scale scheme
数字孪生应用及意义对电力的主要作用,概念价值。
Textbox in easyUI inserts content at the cursor position
百融榕树数据分析拆解方法
[untitled]
[Android synopsis] kotlin multithreaded programming (I)
C# 使用SqlSugar Updateable系统报错无效数字,如何解决?求指导!
Feitengtengrui d2000 won the "top ten hard core technologies" award of Digital China