当前位置:网站首页>Naive Bayes in machine learning
Naive Bayes in machine learning
2022-07-03 06:10:00 【Master core technology】
One 、 Basic concepts and ideas
Naive Bayes (naive Bayes) Law is Based on Bayes theorem And Independent hypothesis of characteristic conditions Generation model classification method .
Build a model : For training datasets , Firstly, the joint probability distribution of input and output is learned based on the assumption of characteristic condition independence P(X,Y), Joint probability can be converted into a priori probability multiplied by conditional probability , Use training data sets to learn P(X|Y) and P(Y) Of Maximum likelihood estimation , because P(X|Y) The learning parameter of increases exponentially with the number of feature values and the number of features , In practical application, the parameter value is often larger than the number of training samples , in other words , Some sample values do not appear in training, resulting in an estimated probability of 0, and “ Not observed ” and “ The probability of occurrence is 0” It is often different to the concept , Therefore, the conditional independence assumption can reduce the learning parameters to linear growth , You can learn a classification model .
Forecast classification : And then based on this model , For the given input x, Use Bayesian theorem to find Maximum a posteriori probability P(Y|X) Output y, This is equivalent 0-1 The risk is expected to be minimized during the loss function .
Two 、 Principles and methods
The training data set is
T = { ( x 1 , y 2 ) , ( x 2 , y 2 ) , . . . ( x n , y n ) } T=\{(x_1,y_2),(x_2,y_2),...(x_n,y_n)\} T={ (x1,y2),(x2,y2),...(xn,yn)}
Naive Bayes learns joint probability distributions by training data sets P ( X , Y ) P(X,Y) P(X,Y).
According to the multiplication formula of probability theory : P ( X , Y ) = P ( Y ) P ( X ∣ Y ) P(X,Y)=P(Y)P(X|Y) P(X,Y)=P(Y)P(X∣Y)
Learning joint probability can be transformed into learning prior probability distribution and conditional probability distribution .
A priori probability distribution :
P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),k=1,2,...,K P(Y=ck),k=1,2,...,K
According to the law of large numbers : When there's enough data , The frequency of direct use can be estimated as probability , So we think that the prior probability can be calculated when the data is sufficient .
Conditional probability distribution :
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 1... K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...X^{(n)}=x^{(n)}|Y=c_k),k=1,1...K P(X=x∣Y=ck)=P(X(1)=x(1),...X(n)=x(n)∣Y=ck),k=1,1...K
notes : The superscript number represents the number in a sample n Attributes or characteristics .
But conditional probability distribution P ( X = x ∣ Y = c k ) P(X=x|Y=c_k) P(X=x∣Y=ck) There are exponentially many parameters , Its estimation is actually infeasible . in fact , hypothesis x ( j ) x^{(j)} x(j) There are S j S_j Sj individual ,j=1,2,…n,Y There are K individual , Then the number of parameters is K ∏ j = 1 n S j K\prod_{j=1}^nS_j K∏j=1nSj, in application , This value is much larger than the number of training samples , in other words , Some values do not appear in the training set , It is obviously not feasible to estimate probability by frequency ,“ Not observed ” and “ The probability of occurrence is zero ” It's a different concept .
Naive Bayes assumes conditional independence of conditional probability distribution , This is a strong assumption , Hence the name naive Bayes . The assumption is :
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...X^{(n)}=x^{(n)}|Y=c_k)=\prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k) P(X=x∣Y=ck)=P(X(1)=x(1),...X(n)=x(n)∣Y=ck)=∏j=1nP(X(j)=x(j)∣Y=ck)
Through this assumption, the parameter is reduced to K ∑ j = 1 n S j K\sum_{j=1}^nS_j K∑j=1nSj
Illustrate with examples : If the value of attribute one is 0,1, The value of attribute 2 is A,B,C, label Y There is only one value of , If we do not use the conditional independence assumption , Then the parameters to be solved are P(0,A),P(0,B),P(0,C),P(1,A),P(1,B),P(1,C)6 individual , The assumption of conditional independence is only P(0),P(1),P(A),P(B),P(C )5 individual , When there are more values for attributes , When there are more attributes , This gap will be even greater .
It is obvious that naive Bayes first modeled joint probability , Then the maximum posterior probability is obtained , It is a typical generative model
A posteriori probability is based on Bayesian theorem :
P ( Y = c k ∣ X = c k ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ 1 k P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_k|X=c_k)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_1^kP(X=x|Y=c_k)P(Y=c_k)} P(Y=ck∣X=ck)=∑1kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
Introduce the conditional independence assumption :
P ( Y = c k ∣ X = c k ) = P ( Y = c k ) ∏ 1 j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ 1 k P ( Y = c k ) ∏ 1 j P ( X ( j ) = x ( j ) ∣ Y = c k ) P(Y=c_k|X=c_k)=\frac{P(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_1^kP(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k)} P(Y=ck∣X=ck)=∑1kP(Y=ck)∏1jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏1jP(X(j)=x(j)∣Y=ck)
Naive Bayesian classifier can be expressed as :
y = f ( x ) = a r g m a x P ( Y = c k ) ∏ 1 j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ 1 k P ( Y = c k ) ∏ 1 j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=arg max\frac{P(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_1^kP(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k)} y=f(x)=argmax∑1kP(Y=ck)∏1jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏1jP(X(j)=x(j)∣Y=ck)
For the denominator, all c k c_k ck It's all the same , And we only care about the size relationship and don't care about the specific value , Therefore, the denominator can be omitted and written as :
y = f ( x ) = a r g m a x P ( Y = c k ) ∏ 1 j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=arg maxP(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k) y=f(x)=argmaxP(Y=ck)∏1jP(X(j)=x(j)∣Y=ck)
3、 ... and 、 The maximum meaning of posterior probability
To be updated
Four 、 Algorithm steps
To be updated
5、 ... and 、 Extension and improvement of naive Bayes
Bayesian estimation : Using maximum likelihood estimation, the probability value to be estimated may be 0 To the situation . This will affect the posterior probability and the calculation results , Make the classification deviate , The Bayesian estimate random variables are given a positive number on the frequency of each value , This is called Laplacian smoothing .
Semi naive Bayes : People try to relax the assumption of attribute conditional independence to a certain extent , This leads to a class called “ Semi naive Bayesian classifier ”, Properly consider the interdependent information between some attributes , Therefore, there is no need for complete joint probability calculation , Not to completely ignore the relatively strong attribute dependency .
Bayesian networks : With the help of directed acyclic graph, we can describe the dependency relationship between attributes , The conditional probability table is used to describe the joint probability distribution of attributes
Naive Bayes can often achieve good performance explanation :
(1) For sorting tasks , As long as the conditional probability of each category is sorted correctly , Accurate probability values are not required to lead to correct classification results
(2) If dependencies between attributes have the same impact on all categories , Or the dependence relationship can offset each other , Then the attribute conditional independence assumption will not have a negative impact on performance when reducing the computational overhead .
边栏推荐
- 輕松上手Fluentd,結合 Rainbond 插件市場,日志收集更快捷
- Get a screenshot of a uiscrollview, including off screen parts
- 轻松上手Fluentd,结合 Rainbond 插件市场,日志收集更快捷
- Clickhouse learning notes (2): execution plan, table creation optimization, syntax optimization rules, query optimization, data consistency
- Beandefinitionregistrypostprocessor
- Pytorch builds the simplest version of neural network
- 多线程与高并发(7)——从ReentrantLock到AQS源码(两万字大章,一篇理解AQS)
- Zhiniu stock project -- 04
- Fluentd facile à utiliser avec le marché des plug - ins rainbond pour une collecte de journaux plus rapide
- [teacher Zhao Yuqiang] kubernetes' probe
猜你喜欢
Creating postgre enterprise database by ArcGIS
How to create and configure ZABBIX
Oauth2.0 - Introduction and use and explanation of authorization code mode
Alibaba cloud OOS file upload
技术管理进阶——你了解成长的全貌吗?
Project summary --04
Tabbar settings
Zhiniu stock project -- 05
Kubernetes notes (I) kubernetes cluster architecture
tabbar的设置
随机推荐
Kubernetes notes (VII) kuberetes scheduling
70 shell script interview questions and answers
Jedis source code analysis (I): jedis introduction, jedis module source code analysis
pytorch DataLoader实现miniBatch(未完成)
Tabbar settings
pytorch 搭建神经网络最简版
Understand expectations (mean / estimate) and variances
Kubernetes notes (10) kubernetes Monitoring & debugging
[teacher Zhao Yuqiang] RDB persistence of redis
Kubernetes notes (I) kubernetes cluster architecture
Project summary --2 (basic use of jsup)
The programmer shell with a monthly salary of more than 10000 becomes a grammar skill for secondary school. Do you often use it!!!
PMP笔记记录
Leetcode problem solving summary, constantly updating!
Simple understanding of ThreadLocal
[teacher Zhao Yuqiang] MySQL flashback
Svn branch management
Jedis source code analysis (II): jediscluster module source code analysis
[teacher Zhao Yuqiang] use the catalog database of Oracle
Kubernetes notes (III) controller