当前位置：网站首页>Naive Bayes in machine learning

Naive Bayes in machine learning

2022-07-03 06:10:00 【Master core technology】

One 、 Basic concepts and ideas

Naive Bayes (naive Bayes) Law is Based on Bayes theorem And Independent hypothesis of characteristic conditions Generation model classification method .
Build a model : For training datasets , Firstly, the joint probability distribution of input and output is learned based on the assumption of characteristic condition independence P(X,Y), Joint probability can be converted into a priori probability multiplied by conditional probability , Use training data sets to learn P(X|Y) and P(Y) Of Maximum likelihood estimation , because P(X|Y) The learning parameter of increases exponentially with the number of feature values and the number of features , In practical application, the parameter value is often larger than the number of training samples , in other words , Some sample values do not appear in training, resulting in an estimated probability of 0, and “ Not observed ” and “ The probability of occurrence is 0” It is often different to the concept , Therefore, the conditional independence assumption can reduce the learning parameters to linear growth , You can learn a classification model .
Forecast classification : And then based on this model , For the given input x, Use Bayesian theorem to find Maximum a posteriori probability P(Y|X) Output y, This is equivalent 0-1 The risk is expected to be minimized during the loss function .

Two 、 Principles and methods

The training data set is
$T=\{(x_1,y_2),(x_2,y_2),...(x_n,y_n)\}$
Naive Bayes learns joint probability distributions by training data sets $P （ X, Y ）$ .

According to the multiplication formula of probability theory ： $P (X, Y) = P (Y) P (X ∣ Y)$

Learning joint probability can be transformed into learning prior probability distribution and conditional probability distribution .
A priori probability distribution ：
$P(Y=c_k),k=1,2,...,K$

According to the law of large numbers ： When there's enough data , The frequency of direct use can be estimated as probability , So we think that the prior probability can be calculated when the data is sufficient .

Conditional probability distribution ：
$P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...X^{(n)}=x^{(n)}|Y=c_k),k=1,1...K$

notes ： The superscript number represents the number in a sample n Attributes or characteristics .

But conditional probability distribution $P(X=x|Y=c_k)$ There are exponentially many parameters , Its estimation is actually infeasible . in fact , hypothesis $x^{(j)}$ There are $S_j$ individual ,j=1,2,…n,Y There are K individual , Then the number of parameters is $K\prod_{j=1}^nS_j$ , in application , This value is much larger than the number of training samples , in other words , Some values do not appear in the training set , It is obviously not feasible to estimate probability by frequency ,“ Not observed ” and “ The probability of occurrence is zero ” It's a different concept .
Naive Bayes assumes conditional independence of conditional probability distribution , This is a strong assumption , Hence the name naive Bayes . The assumption is ：
$P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...X^{(n)}=x^{(n)}|Y=c_k)=\prod_{j=1}^nP(X^{(j)}=x^{(j)}|Y=c_k)$
Through this assumption, the parameter is reduced to $K\sum_{j=1}^nS_j$

Illustrate with examples ： If the value of attribute one is 0,1, The value of attribute 2 is A,B,C, label Y There is only one value of , If we do not use the conditional independence assumption , Then the parameters to be solved are P(0,A),P(0,B),P(0,C),P(1,A),P(1,B),P(1,C)6 individual , The assumption of conditional independence is only P(0),P(1),P(A),P(B),P(C )5 individual , When there are more values for attributes , When there are more attributes , This gap will be even greater .

It is obvious that naive Bayes first modeled joint probability , Then the maximum posterior probability is obtained , It is a typical generative model
A posteriori probability is based on Bayesian theorem ：
$P(Y=c_k|X=c_k)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_1^kP(X=x|Y=c_k)P(Y=c_k)}$
Introduce the conditional independence assumption ：
$P(Y=c_k|X=c_k)=\frac{P(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_1^kP(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k)}$
Naive Bayesian classifier can be expressed as ：
$max\frac{P(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_1^kP(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k)}$
For the denominator, all $c_k$ It's all the same , And we only care about the size relationship and don't care about the specific value , Therefore, the denominator can be omitted and written as ：
$maxP(Y=c_k)\prod_1^jP(X^{(j)}=x^{(j)}|Y=c_k)$

3、 ... and 、 The maximum meaning of posterior probability

To be updated

Four 、 Algorithm steps

To be updated

5、 ... and 、 Extension and improvement of naive Bayes

Bayesian estimation ： Using maximum likelihood estimation, the probability value to be estimated may be 0 To the situation . This will affect the posterior probability and the calculation results , Make the classification deviate , The Bayesian estimate random variables are given a positive number on the frequency of each value , This is called Laplacian smoothing .
Semi naive Bayes ： People try to relax the assumption of attribute conditional independence to a certain extent , This leads to a class called “ Semi naive Bayesian classifier ”, Properly consider the interdependent information between some attributes , Therefore, there is no need for complete joint probability calculation , Not to completely ignore the relatively strong attribute dependency .
Bayesian networks ： With the help of directed acyclic graph, we can describe the dependency relationship between attributes , The conditional probability table is used to describe the joint probability distribution of attributes
Naive Bayes can often achieve good performance explanation ：
（1） For sorting tasks , As long as the conditional probability of each category is sorted correctly , Accurate probability values are not required to lead to correct classification results
（2） If dependencies between attributes have the same impact on all categories , Or the dependence relationship can offset each other , Then the attribute conditional independence assumption will not have a negative impact on performance when reducing the computational overhead .

原网站

版权声明
本文为[Master core technology]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202150616559056.html