当前位置：网站首页>[statistical learning methods] learning notes - Chapter 4: naive Bayesian method

[statistical learning methods] learning notes - Chapter 4: naive Bayesian method

2022-07-07 12:34:00 【Sickle leek】

Statistical learning methods learning notes ： Naive bayes method

1. Learning and classification of naive Bayes
- 1.1 The basic method
- 1.2 The meaning of maximizing a posteriori probability
2. Parameter estimation of naive Bayesian method
3. summary

Naive bayes method (Naive Bayes) Is based on Bayes theorem And Independent hypothesis of characteristic conditions The classification of . For a given set of training data , First, we assume that the learning input is independent based on the feature conditions / Joint probability distribution of output ; And then based on this model , For the given input

x

, Using Bayes theorem to find the maximum posterior probability

y

1. Learning and classification of naive Bayes

1.1 The basic method

Set input space $\mathcal{X} \subseteq R^n$ by n A set of dimensional vectors , The output space is a collection of class tags $\mathcal{Y}=\{c_1,c_2,...,c_K\}$ . The input is the eigenvector $x\in \mathcal{X}$ , The output is a class tag （class label） $\in \mathcal{Y}$ . $X$ It's defined in the input space $\mathcal{X}$ A random vector on a vector , $Y$ Is defined in the output space $\mathcal{Y}$ The random variable on , $P (X, Y)$ yes $X$ and $Y$ The joint probability distribution of . Training data set $T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}$ , from $P (X, Y)$ Independent identically distributed .

Naive Bayesian method learns joint probability distribution through training data set $P (X, Y)$ . In particular , Learn the following A priori probability distribution as well as Conditional probability distribution .

A priori probability distribution $P(Y=c_k),k=1,2,...,K$
Conditional probability distribution ： $P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k),k=1,2...,K$

So learn to get the joint probability distribution $P (X, Y)$ .
Conditional probability distribution $P(X=x|Y=c_k)$ There are exponentially many parameters , Its estimation is actually infeasible . in fact , hypothesis $x^{(j)}$ There are $S_j$ individual , $j = 1, 2, . . ., n$ , $Y$ The available values of are K individual , Then the number of parameters is $K\prod_{n}^{j=1}S_j$ .

To simplify the calculation , Naive Bayesian algorithm makes the assumption of conditional independence . Conditional independence hypothesis means that the features used for classification are conditionally independent when the category is determined , The formula is as follows ： $P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{n}^{j=1}P(X^{(j)}=x^{(j)}|Y=c_k)$
Naive Bayesian method actually learns the mechanism of generating data , So it belongs to the generation model . The assumption of conditional independence simplifies the computational complexity , But sometimes some classification accuracy will be lost .

Naive Bayes classification , For the given input $x$ , The learned model is used to calculate the posterior probability distribution $P(Y=c_k|X=x)$ , Take the class with the greatest posterior probability as $x$ Class data of . The posterior probability is calculated according to Bayesian theorem ： $P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_{k}P(X=x|Y=c_k)P(Y=c_k)}$
therefore , Naive Bayesian classifier can be expressed as ：
$y=f(x)=argmax_{c_k}\frac{P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_k P(Y=c_k) \prod_{j}{P(X^{(j)}=x^{(j)}|Y=c_k)} }$
Be careful , In the upper form , Denominator for all $c_k$ It's all the same , therefore ：
$y=f(x)=argmax_{c_k}P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k)$

1.2 The meaning of maximizing a posteriori probability

Naive Bayes classifies instances into classes with the largest posterior probability , This is equivalent to minimizing the expected risk . According to the expected risk minimization criterion, a posteriori probability maximization criterion is obtained ：
$f(x)=argmax_{c_k}P(c_k|X=x)$
That is, the principle of naive Bayesian method .

2. Parameter estimation of naive Bayesian method

2.1 Maximum likelihood estimation

In naive Bayes , Learning means estimating $P(Y=c_k)$ and $P(X^{(j)}=x^{(j)}|Y=c_k)$ . The maximum likelihood estimation method can be used to estimate the corresponding probability . Prior probability $P(Y=c_k)$ The maximum likelihood estimate of is $P(Y=c_k)$ The maximum likelihood estimate of is ：
$P(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)}{N},k=1,2,...,K$
Set the first j Features $x^{(j)}$ The set of possible values is ${a_{j1},a_{j2},...,a_{jS_j}}$ , Conditional probability $P(X^{(j)}=a_{jl}|Y=c_k)$ The maximum likelihood estimate of is ：
$P(X^{j}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)}{\sum_{i=1}^N I(y_i=c_k)}$
$j=1,2,...,n; l=1,2,...,S_j, k=1,2,...,K$ , $x_i^{(j)}$ It's No i Of samples j Features , $a_{jl}$ It's No j A feature may get the first l It's worth ,I For indicating function .

2.2 Learning and classification algorithms

Algorithm ： Naive bayes algorithm （naive Bayes algorithm）
Input ： Training data $T={(x_1,y_2),(x_2,y_2),...,(x_N,y_N)}$ , among $x_i=(x_i^{(1)},x_i^{(2)}, ..., x_i^{(n)})^T$ , $x_i^{(j)}$ It's No $i$ The first sample is $j$ Features , $x_i^{(j)}\in {a_{j1},a_{j2},...,a_{jS}}$ , $a_{jl}$ It's No $j$ A feature may get the first $l$ It's worth , $j = 1, 2, . . ., n$ , $l=1,2,...,S_j$ , $y_i\in {c_1,c_2,...,c_K}$ ; example $x$ ;
Output ： example $x$ The classification of ：
(1) Calculate prior probability and conditional probability ：
$P(Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)}{N},k=1,2,...,K$
$P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^N I(x_{i}^{(j)}=a_{jl},y_i=c_k)}{\sum_{i=1}^N I(y_i=c_k)}, j=1,2,...,n;I=1,2,...,S_j;k=1,2...,K$
(2) For a given instance $x=(x^{(1)}, x^{(2)}, ..., x^{(n)})^T$ , Calculation
$P(Y=c_k)\prod_{j=1}^{n} P(X^{(j)}=x(j)|Y=c_k), k=1,2,...,K$
(3) Identify examples x Class
$max_{c_k}P(Y=c_k)\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)$

2.3 Bayesian estimation

The probability value to be estimated is 0 The situation of , This will affect the calculation result of posterior probability , Make the classification deviate . The way to solve this problem is to use Bayesian estimation , In particular , The Bayesian estimation of conditional probability is ：
$P_\lambda (X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum _{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)+\lambda }{\sum _{i=1}^N I(y_i=c_k)+S_j\lambda }$
among $\lambda \ge 0$ . It is equivalent to giving a positive number to the frequency of each value of a random variable $\lambda >0$ . When $\lambda =0$ when , It's maximum likelihood estimation . Constant access $\lambda =1$ , This is called Laplacian smoothing （Laplace smoothing）. obviously For any $l=1,2,...,S_j, k=1,2...,K$ , Yes
$P_\lambda (X^{(j)}=a_{jl}|Y=c_k)>0$
$\sum_{l=1}^{S_j}P(X^{(j)}=a_{jl}|Y=c_k)=1$
Again , The Bayesian estimate of a priori probability is ：
$P_\lambda (Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)+\lambda}{N+K\lambda}$

3. summary

Naive Bayes method is a typical generative learning method . The generation method learns the joint probability distribution from the training data $P (X, Y)$ , Then the posterior probability distribution is obtained $P (Y ∣ X)$ , say concretely , Use training data to learn $P (X ∣ Y)$ and $P (Y)$ Estimation , The joint probability distribution is obtained ：
$P (X, Y) = P (Y) P (X ∣ Y)$
The probability estimation method can be maximum likelihood estimation or Bayesian estimation
The basic assumption of naive Bayesian method is conditional independence ：
$P(X=x|Y-=c_k)=P(X^{(1)=x^{(1)}},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k)$
Naive Bayes method is efficient , But the performance of classification is not necessarily very high .
Naive Bayesian method uses Bayesian theorem and learned joint probability model for classification and prediction
$P(Y|X)=\frac{P(X, Y)}{P(X)}=\frac{P(Y)P(X|Y)}{\sum_Y P(Y)P(X|Y)}$
Enter x To the class with the greatest a posteriori probability y.
$y=arg\max_{c_k} P(Y=c_k)\prod_{j=1}^n P(X_j=x^{(j)}|Y=c_k)$
The maximum a posteriori probability is equivalent to 0-1 Expected risk minimization in loss function .