当前位置:网站首页>Decision tree of machine learning
Decision tree of machine learning
2022-07-03 06:10:00 【Master core technology】
The decision tree algorithm represents the classification results of data in a tree structure , Each leaf node corresponds to the decision result .
Divide and choose : We hope that the branch nodes of the decision tree contain samples that belong to the same category as much as possible , That is, the purity of the node is high
ID3 Decision tree Information gain
“ Information entropy ”(information entropy) It is the most commonly used index to measure the purity of sample set , Suppose the current sample set D pass the civil examinations k The proportion of class samples is p k p^k pk(k=1,2,…|y|)( In the second category |y|=2), be D The entropy of information is defined as
E n t ( D ) = − ∑ k = 1 ∣ y ∣ p k l o g 2 p k Ent(D)=-\sum\limits_{k=1}^{|y|}p_klog_2p_k Ent(D)=−k=1∑∣y∣pklog2pk
Ent(D) The smaller the value of , be D The higher the purity .
Suppose discrete properties a Yes V Possible values { a 1 , a 2 , … . , a V a^1,a^2,….,a^V a1,a2,….,aV}, If you use a To the sample set D division , Will produce V Branch nodes , Among them the first v Branch nodes contain D All in attributes a The upper value is a V a^V aV The sample of , Write it down as D v D^v Dv. Considering that the number of samples contained in different branch nodes is different , Give weight to the branch ∣ D v ∣ / ∣ D ∣ |D^v|/|D| ∣Dv∣/∣D∣. Calculation a For the sample set D Divide the obtained “ Information gain ”(information gain):
G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain(D,a)=Ent(D)-\sum\limits_{v=1}^V\frac{|D^v|}{|D|}Ent(D^v) Gain(D,a)=Ent(D)−v=1∑V∣D∣∣Dv∣Ent(Dv)
generally speaking , The larger the information gain , Use attributes a To divide the “ Purity improvement ” The bigger it is
C4.5 Decision tree Gain rate
actually , The information gain criterion has a preference for attributes with more values , In order to reduce the possible adverse effects of this preference, use the gain rate (gain ratio) To divide the optimal attributes
G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) Gain\_ratio(D,a)=\frac{Gain(D,a)}{IV(a)} Gain_ratio(D,a)=IV(a)Gain(D,a)
among I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ l o g 2 ∣ D v ∣ ∣ D ∣ IV(a)=-\sum\limits_{v=1}^V\frac{|D^v|}{|D|}log_2\frac{|D^v|}{|D|} IV(a)=−v=1∑V∣D∣∣Dv∣log2∣D∣∣Dv∣
Be careful , The information rate criterion has a preference for attributes with a small number of values , therefore ,C4.5 Instead of directly selecting the candidate partition attribute with the largest gain rate , Instead, we first find the attributes with higher than average information gain from the candidate partition attributes , Then select the one with high gain rate .
CART Decision tree Classification and Regression Tree The gini coefficient
Data sets D The purity of can be measured by Gini :
G i n i ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ′ ≠ k p k p k ′ = 1 − ∑ k = 1 ∣ y ∣ p k 2 Gini(D)=\sum\limits_{k=1}^{|y|}\sum\limits_{k'\neq k}p_kp_{k'}=1-\sum\limits_{k=1}^{|y|}p_k^2 Gini(D)=k=1∑∣y∣k′=k∑pkpk′=1−k=1∑∣y∣pk2
Gini(D) It reflects that two samples are randomly selected from the data set , The probability of different category marks , therefore ,Gini(D) The smaller it is , Data sets D The higher the purity .
attribute a The Gini coefficient of is defined as :
G i n i _ i n d e x ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G i n i ( D v ) Gini\_index(D,a)=\sum\limits_{v=1}^V\frac{|D^v|}{|D|}Gini(D^v) Gini_index(D,a)=v=1∑V∣D∣∣Dv∣Gini(Dv)
The attribute that minimizes the Gini coefficient after partition is selected as the optimal partition attribute
prune : It's a decision tree “ Over fitting ” The primary means , There are mainly pre pruning (prepruning) And after pruning (postpruning) Two strategies
边栏推荐
- Deep learning, thinking from one dimensional input to multi-dimensional feature input
- Simple understanding of ThreadLocal
- [video of Teacher Zhao Yuqiang's speech on wot] redis high performance cache and persistence
- Use abp Zero builds a third-party login module (I): Principles
- ThreadLocal的简单理解
- Ext4 vs XFS -- which file system should you use
- MySQL带二进制的库表导出导入
- Tabbar settings
- Pytorch dataloader implements minibatch (incomplete)
- Kubesphere - Multi tenant management
猜你喜欢

卷积神经网络CNN中的卷积操作详解

Apt update and apt upgrade commands - what is the difference?

SVN分支管理

Kubernetes notes (IV) kubernetes network

How does win7 solve the problem that telnet is not an internal or external command

Oauth2.0 - use database to store client information and authorization code

Convolution operation in convolution neural network CNN

深度学习,从一维特性输入到多维特征输入引发的思考

Multithreading and high concurrency (7) -- from reentrantlock to AQS source code (20000 words, one understanding AQS)

Kubernetes notes (III) controller
随机推荐
[teacher Zhao Yuqiang] index in mongodb (Part 1)
Simple solution of small up main lottery in station B
Oracle database synonym creation
Detailed explanation of findloadedclass
Why should there be a firewall? This time xiaowai has something to say!!!
Bernoulli distribution, binomial distribution and Poisson distribution, and the relationship between maximum likelihood (incomplete)
Solve the problem that Anaconda environment cannot be accessed in PowerShell
理解 YOLOV1 第一篇 预测阶段
使用 Abp.Zero 搭建第三方登录模块(一):原理篇
[teacher Zhao Yuqiang] MySQL flashback
项目总结--01(接口的增删改查;多线程的使用)
深入解析kubernetes controller-runtime
How does win7 solve the problem that telnet is not an internal or external command
BeanDefinitionRegistryPostProcessor
JS implements the problem of closing the current child window and refreshing the parent window
.NET程序配置文件操作(ini,cfg,config)
Advanced technology management - do you know the whole picture of growth?
Use telnet to check whether the port corresponding to the IP is open
Oauth2.0 - user defined mode authorization - SMS verification code login
Zhiniu stock -- 03