当前位置:网站首页>Decision tree of machine learning
Decision tree of machine learning
2022-07-03 06:10:00 【Master core technology】
The decision tree algorithm represents the classification results of data in a tree structure , Each leaf node corresponds to the decision result .
Divide and choose : We hope that the branch nodes of the decision tree contain samples that belong to the same category as much as possible , That is, the purity of the node is high
ID3 Decision tree Information gain
“ Information entropy ”(information entropy) It is the most commonly used index to measure the purity of sample set , Suppose the current sample set D pass the civil examinations k The proportion of class samples is p k p^k pk(k=1,2,…|y|)( In the second category |y|=2), be D The entropy of information is defined as
E n t ( D ) = − ∑ k = 1 ∣ y ∣ p k l o g 2 p k Ent(D)=-\sum\limits_{k=1}^{|y|}p_klog_2p_k Ent(D)=−k=1∑∣y∣pklog2pk
Ent(D) The smaller the value of , be D The higher the purity .
Suppose discrete properties a Yes V Possible values { a 1 , a 2 , … . , a V a^1,a^2,….,a^V a1,a2,….,aV}, If you use a To the sample set D division , Will produce V Branch nodes , Among them the first v Branch nodes contain D All in attributes a The upper value is a V a^V aV The sample of , Write it down as D v D^v Dv. Considering that the number of samples contained in different branch nodes is different , Give weight to the branch ∣ D v ∣ / ∣ D ∣ |D^v|/|D| ∣Dv∣/∣D∣. Calculation a For the sample set D Divide the obtained “ Information gain ”(information gain):
G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain(D,a)=Ent(D)-\sum\limits_{v=1}^V\frac{|D^v|}{|D|}Ent(D^v) Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)
generally speaking , The larger the information gain , Use attributes a To divide the “ Purity improvement ” The bigger it is
C4.5 Decision tree Gain rate
actually , The information gain criterion has a preference for attributes with more values , In order to reduce the possible adverse effects of this preference, use the gain rate (gain ratio) To divide the optimal attributes
G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) Gain\_ratio(D,a)=\frac{Gain(D,a)}{IV(a)} Gain_ratio(D,a)=IV(a)Gain(D,a)
among I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ l o g 2 ∣ D v ∣ ∣ D ∣ IV(a)=-\sum\limits_{v=1}^V\frac{|D^v|}{|D|}log_2\frac{|D^v|}{|D|} IV(a)=−v=1∑V∣D∣∣Dv∣log2∣D∣∣Dv∣
Be careful , The information rate criterion has a preference for attributes with a small number of values , therefore ,C4.5 Instead of directly selecting the candidate partition attribute with the largest gain rate , Instead, we first find the attributes with higher than average information gain from the candidate partition attributes , Then select the one with high gain rate .
CART Decision tree Classification and Regression Tree The gini coefficient
Data sets D The purity of can be measured by Gini :
G i n i ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ′ ≠ k p k p k ′ = 1 − ∑ k = 1 ∣ y ∣ p k 2 Gini(D)=\sum\limits_{k=1}^{|y|}\sum\limits_{k'\neq k}p_kp_{k'}=1-\sum\limits_{k=1}^{|y|}p_k^2 Gini(D)=k=1∑∣y∣k′=k∑pkpk′=1−k=1∑∣y∣pk2
Gini(D) It reflects that two samples are randomly selected from the data set , The probability of different category marks , therefore ,Gini(D) The smaller it is , Data sets D The higher the purity .
attribute a The Gini coefficient of is defined as :
G i n i _ i n d e x ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G i n i ( D v ) Gini\_index(D,a)=\sum\limits_{v=1}^V\frac{|D^v|}{|D|}Gini(D^v) Gini_index(D,a)=v=1∑V∣D∣∣Dv∣Gini(Dv)
The attribute that minimizes the Gini coefficient after partition is selected as the optimal partition attribute
prune : It's a decision tree “ Over fitting ” The primary means , There are mainly pre pruning (prepruning) And after pruning (postpruning) Two strategies
边栏推荐
- CAD插件的安装和自动加载dll、arx
- Jackson: what if there is a lack of property- Jackson: What happens if a property is missing?
- BeanDefinitionRegistryPostProcessor
- 从小数据量 MySQL 迁移数据到 TiDB
- 多线程与高并发(7)——从ReentrantLock到AQS源码(两万字大章,一篇理解AQS)
- MySQL帶二進制的庫錶導出導入
- 【系统设计】邻近服务
- Project summary --01 (addition, deletion, modification and query of interfaces; use of multithreading)
- Project summary --2 (basic use of jsup)
- BeanDefinitionRegistryPostProcessor
猜你喜欢
Fluentd facile à utiliser avec le marché des plug - ins rainbond pour une collecte de journaux plus rapide
Detailed explanation of contextclassloader
[Zhao Yuqiang] deploy kubernetes cluster with binary package
Simple solution of small up main lottery in station B
Kubesphere - build MySQL master-slave replication structure
Deep learning, thinking from one dimensional input to multi-dimensional feature input
How does win7 solve the problem that telnet is not an internal or external command
Migrate data from Mysql to tidb from a small amount of data
Oauth2.0 - use database to store client information and authorization code
Understand the first prediction stage of yolov1
随机推荐
Intel's new GPU patent shows that its graphics card products will use MCM Packaging Technology
1. Sum of two numbers
Project summary --2 (basic use of jsup)
Fluentd is easy to use. Combined with the rainbow plug-in market, log collection is faster
Installation du plug - in CAD et chargement automatique DLL, Arx
CKA certification notes - CKA certification experience post
项目总结--2(Jsoup的基本使用)
Pytorch dataloader implements minibatch (incomplete)
[teacher Zhao Yuqiang] kubernetes' probe
Cesium Click to obtain the longitude and latitude elevation coordinates (3D coordinates) of the model surface
Understand expectations (mean / estimate) and variances
理解 YOLOV1 第一篇 预测阶段
Yum is too slow to bear? That's because you didn't do it
从 Amazon Aurora 迁移数据到 TiDB
MySQL带二进制的库表导出导入
Kubernetes notes (V) configuration management
Svn branch management
[teacher Zhao Yuqiang] index in mongodb (Part 2)
Project summary --01 (addition, deletion, modification and query of interfaces; use of multithreading)
Merge and migrate data from small data volume, sub database and sub table Mysql to tidb