当前位置:网站首页>Decision tree of machine learning
Decision tree of machine learning
2022-07-03 06:10:00 【Master core technology】
The decision tree algorithm represents the classification results of data in a tree structure , Each leaf node corresponds to the decision result .
Divide and choose : We hope that the branch nodes of the decision tree contain samples that belong to the same category as much as possible , That is, the purity of the node is high
ID3 Decision tree Information gain
“ Information entropy ”(information entropy) It is the most commonly used index to measure the purity of sample set , Suppose the current sample set D pass the civil examinations k The proportion of class samples is p k p^k pk(k=1,2,…|y|)( In the second category |y|=2), be D The entropy of information is defined as
E n t ( D ) = − ∑ k = 1 ∣ y ∣ p k l o g 2 p k Ent(D)=-\sum\limits_{k=1}^{|y|}p_klog_2p_k Ent(D)=−k=1∑∣y∣pklog2pk
Ent(D) The smaller the value of , be D The higher the purity .
Suppose discrete properties a Yes V Possible values { a 1 , a 2 , … . , a V a^1,a^2,….,a^V a1,a2,….,aV}, If you use a To the sample set D division , Will produce V Branch nodes , Among them the first v Branch nodes contain D All in attributes a The upper value is a V a^V aV The sample of , Write it down as D v D^v Dv. Considering that the number of samples contained in different branch nodes is different , Give weight to the branch ∣ D v ∣ / ∣ D ∣ |D^v|/|D| ∣Dv∣/∣D∣. Calculation a For the sample set D Divide the obtained “ Information gain ”(information gain):
G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain(D,a)=Ent(D)-\sum\limits_{v=1}^V\frac{|D^v|}{|D|}Ent(D^v) Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)
generally speaking , The larger the information gain , Use attributes a To divide the “ Purity improvement ” The bigger it is
C4.5 Decision tree Gain rate
actually , The information gain criterion has a preference for attributes with more values , In order to reduce the possible adverse effects of this preference, use the gain rate (gain ratio) To divide the optimal attributes
G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) Gain\_ratio(D,a)=\frac{Gain(D,a)}{IV(a)} Gain_ratio(D,a)=IV(a)Gain(D,a)
among I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ l o g 2 ∣ D v ∣ ∣ D ∣ IV(a)=-\sum\limits_{v=1}^V\frac{|D^v|}{|D|}log_2\frac{|D^v|}{|D|} IV(a)=−v=1∑V∣D∣∣Dv∣log2∣D∣∣Dv∣
Be careful , The information rate criterion has a preference for attributes with a small number of values , therefore ,C4.5 Instead of directly selecting the candidate partition attribute with the largest gain rate , Instead, we first find the attributes with higher than average information gain from the candidate partition attributes , Then select the one with high gain rate .
CART Decision tree Classification and Regression Tree The gini coefficient
Data sets D The purity of can be measured by Gini :
G i n i ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ′ ≠ k p k p k ′ = 1 − ∑ k = 1 ∣ y ∣ p k 2 Gini(D)=\sum\limits_{k=1}^{|y|}\sum\limits_{k'\neq k}p_kp_{k'}=1-\sum\limits_{k=1}^{|y|}p_k^2 Gini(D)=k=1∑∣y∣k′=k∑pkpk′=1−k=1∑∣y∣pk2
Gini(D) It reflects that two samples are randomly selected from the data set , The probability of different category marks , therefore ,Gini(D) The smaller it is , Data sets D The higher the purity .
attribute a The Gini coefficient of is defined as :
G i n i _ i n d e x ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G i n i ( D v ) Gini\_index(D,a)=\sum\limits_{v=1}^V\frac{|D^v|}{|D|}Gini(D^v) Gini_index(D,a)=v=1∑V∣D∣∣Dv∣Gini(Dv)
The attribute that minimizes the Gini coefficient after partition is selected as the optimal partition attribute
prune : It's a decision tree “ Over fitting ” The primary means , There are mainly pre pruning (prepruning) And after pruning (postpruning) Two strategies
边栏推荐
- Svn branch management
- 輕松上手Fluentd,結合 Rainbond 插件市場,日志收集更快捷
- 伯努利分布,二项分布和泊松分布以及最大似然之间的关系(未完成)
- 深入解析kubernetes controller-runtime
- Crontab command usage
- Migrate data from Amazon aurora to tidb
- Kubernetes notes (VIII) kubernetes security
- 轻松上手Fluentd,结合 Rainbond 插件市场,日志收集更快捷
- Core principles and source code analysis of disruptor
- [teacher Zhao Yuqiang] use the catalog database of Oracle
猜你喜欢

Deep learning, thinking from one dimensional input to multi-dimensional feature input

Alibaba cloud OOS file upload

Convolution operation in convolution neural network CNN

Cesium Click to obtain the longitude and latitude elevation coordinates (3D coordinates) of the model surface

Skywalking8.7 source code analysis (II): Custom agent, service loading, witness component version identification, transform workflow

Kubernetes notes (VI) kubernetes storage

Kubesphere - Multi tenant management

Jedis source code analysis (I): jedis introduction, jedis module source code analysis

Kubesphere - build MySQL master-slave replication structure
![[teacher Zhao Yuqiang] Flink's dataset operator](/img/cc/5509b62756dddc6e5d4facbc6a7c5f.jpg)
[teacher Zhao Yuqiang] Flink's dataset operator
随机推荐
Leetcode solution - 01 Two Sum
Zhiniu stock -- 03
Leetcode solution - 02 Add Two Numbers
Sorry, this user does not exist!
Maximum likelihood estimation, divergence, cross entropy
PMP笔记记录
BeanDefinitionRegistryPostProcessor
Detailed explanation of iptables (1): iptables concept
Kubesphere - Multi tenant management
Method of converting GPS coordinates to Baidu map coordinates
Using the ethtool command by example
Kubesphere - build MySQL master-slave replication structure
多线程与高并发(7)——从ReentrantLock到AQS源码(两万字大章,一篇理解AQS)
The most responsible command line beautification tutorial
Oauth2.0 - explanation of simplified mode, password mode and client mode
ThreadLocal的简单理解
Alibaba cloud OOS file upload
智牛股项目--05
PHP用ENV获取文件参数的时候拿到的是字符串
Ext4 vs XFS -- which file system should you use