当前位置:网站首页>[statistical learning methods] learning notes - Chapter 4: naive Bayesian method
[statistical learning methods] learning notes - Chapter 4: naive Bayesian method
2022-07-07 12:34:00 【Sickle leek】
Statistical learning methods learning notes : Naive bayes method
Naive bayes method (Naive Bayes)
Is based on Bayes theorem And Independent hypothesis of characteristic conditions The classification of . For a given set of training data , First, we assume that the learning input is independent based on the feature conditions / Joint probability distribution of output ; And then based on this model , For the given input x x x, Using Bayes theorem to find the maximum posterior probability y y y. 1. Learning and classification of naive Bayes
1.1 The basic method
Set input space X ⊆ R n \mathcal{X} \subseteq R^n X⊆Rn by n A set of dimensional vectors , The output space is a collection of class tags Y = { c 1 , c 2 , . . . , c K } \mathcal{Y}=\{c_1,c_2,...,c_K\} Y={ c1,c2,...,cK}. The input is the eigenvector x ∈ X x\in \mathcal{X} x∈X, The output is a class tag (class label) y ∈ Y y \in \mathcal{Y} y∈Y. X X X It's defined in the input space X \mathcal{X} X A random vector on a vector , Y Y Y Is defined in the output space Y \mathcal{Y} Y The random variable on , P ( X , Y ) P(X,Y) P(X,Y) yes X X X and Y Y Y The joint probability distribution of . Training data set T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={ (x1,y1),(x2,y2),...,(xN,yN)}, from P ( X , Y ) P(X,Y) P(X,Y) Independent identically distributed .
Naive Bayesian method learns joint probability distribution through training data set P ( X , Y ) P(X,Y) P(X,Y). In particular , Learn the following A priori probability distribution
as well as Conditional probability distribution
.
- A priori probability distribution P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),k=1,2,...,K P(Y=ck),k=1,2,...,K
- Conditional probability distribution : P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2... , K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k),k=1,2...,K P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck),k=1,2...,K
So learn to get the joint probability distribution P ( X , Y ) P(X,Y) P(X,Y).
Conditional probability distribution P ( X = x ∣ Y = c k ) P(X=x|Y=c_k) P(X=x∣Y=ck) There are exponentially many parameters , Its estimation is actually infeasible . in fact , hypothesis x ( j ) x^{(j)} x(j) There are S j S_j Sj individual , j = 1 , 2 , . . . , n j=1,2,...,n j=1,2,...,n, Y Y Y The available values of are K individual , Then the number of parameters is K ∏ n j = 1 S j K\prod_{n}^{j=1}S_j K∏nj=1Sj.
To simplify the calculation , Naive Bayesian algorithm makes the assumption of conditional independence . Conditional independence hypothesis means that the features used for classification are conditionally independent when the category is determined , The formula is as follows : P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ n j = 1 P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{n}^{j=1}P(X^{(j)}=x^{(j)}|Y=c_k) P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck)=n∏j=1P(X(j)=x(j)∣Y=ck)
Naive Bayesian method actually learns the mechanism of generating data , So it belongs to the generation model . The assumption of conditional independence simplifies the computational complexity , But sometimes some classification accuracy will be lost .
Naive Bayes classification , For the given input x x x, The learned model is used to calculate the posterior probability distribution P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ck∣X=x), Take the class with the greatest posterior probability as x x x Class data of . The posterior probability is calculated according to Bayesian theorem : P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_{k}P(X=x|Y=c_k)P(Y=c_k)} P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
therefore , Naive Bayesian classifier can be expressed as :
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=argmax_{c_k}\frac{P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_k P(Y=c_k) \prod_{j}{P(X^{(j)}=x^{(j)}|Y=c_k)} } y=f(x)=argmaxck∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
Be careful , In the upper form , Denominator for all c k c_k ck It's all the same , therefore :
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=argmax_{c_k}P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k) y=f(x)=argmaxckP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
1.2 The meaning of maximizing a posteriori probability
Naive Bayes classifies instances into classes with the largest posterior probability , This is equivalent to minimizing the expected risk . According to the expected risk minimization criterion, a posteriori probability maximization criterion is obtained :
f ( x ) = a r g m a x c k P ( c k ∣ X = x ) f(x)=argmax_{c_k}P(c_k|X=x) f(x)=argmaxckP(ck∣X=x)
That is, the principle of naive Bayesian method .
2. Parameter estimation of naive Bayesian method
2.1 Maximum likelihood estimation
In naive Bayes , Learning means estimating P ( Y = c k ) P(Y=c_k) P(Y=ck) and P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y=c_k) P(X(j)=x(j)∣Y=ck). The maximum likelihood estimation method can be used to estimate the corresponding probability . Prior probability P ( Y = c k ) P(Y=c_k) P(Y=ck) The maximum likelihood estimate of is P ( Y = c k ) P(Y=c_k) P(Y=ck) The maximum likelihood estimate of is :
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)}{N},k=1,2,...,K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
Set the first j Features x ( j ) x^{(j)} x(j) The set of possible values is a j 1 , a j 2 , . . . , a j S j {a_{j1},a_{j2},...,a_{jS_j}} aj1,aj2,...,ajSj, Conditional probability P ( X ( j ) = a j l ∣ Y = c k ) P(X^{(j)}=a_{jl}|Y=c_k) P(X(j)=ajl∣Y=ck) The maximum likelihood estimate of is :
P ( X j = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P(X^{j}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)}{\sum_{i=1}^N I(y_i=c_k)} P(Xj=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
j = 1 , 2 , . . . , n ; l = 1 , 2 , . . . , S j , k = 1 , 2 , . . . , K j=1,2,...,n; l=1,2,...,S_j, k=1,2,...,K j=1,2,...,n;l=1,2,...,Sj,k=1,2,...,K, x i ( j ) x_i^{(j)} xi(j) It's No i Of samples j Features , a j l a_{jl} ajl It's No j A feature may get the first l It's worth ,I For indicating function .
2.2 Learning and classification algorithms
Algorithm : Naive bayes algorithm (naive Bayes algorithm)
Input : Training data T = ( x 1 , y 2 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) T={(x_1,y_2),(x_2,y_2),...,(x_N,y_N)} T=(x1,y2),(x2,y2),...,(xN,yN), among x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( n ) ) T x_i=(x_i^{(1)},x_i^{(2)}, ..., x_i^{(n)})^T xi=(xi(1),xi(2),...,xi(n))T, x i ( j ) x_i^{(j)} xi(j) It's No i i i The first sample is j j j Features , x i ( j ) ∈ a j 1 , a j 2 , . . . , a j S x_i^{(j)}\in {a_{j1},a_{j2},...,a_{jS}} xi(j)∈aj1,aj2,...,ajS, a j l a_{jl} ajl It's No j j j A feature may get the first l l l It's worth , j = 1 , 2 , . . . , n j=1,2,...,n j=1,2,...,n, l = 1 , 2 , . . . , S j l=1,2,...,S_j l=1,2,...,Sj, y i ∈ c 1 , c 2 , . . . , c K y_i\in {c_1,c_2,...,c_K} yi∈c1,c2,...,cK; example x x x;
Output : example x x x The classification of :
(1) Calculate prior probability and conditional probability :
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)}{N},k=1,2,...,K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) , j = 1 , 2 , . . . , n ; I = 1 , 2 , . . . , S j ; k = 1 , 2... , K P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^N I(x_{i}^{(j)}=a_{jl},y_i=c_k)}{\sum_{i=1}^N I(y_i=c_k)}, j=1,2,...,n;I=1,2,...,S_j;k=1,2...,K P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck),j=1,2,...,n;I=1,2,...,Sj;k=1,2...,K
(2) For a given instance x = ( x ( 1 ) , x ( 2 ) , . . . , x ( n ) ) T x=(x^{(1)}, x^{(2)}, ..., x^{(n)})^T x=(x(1),x(2),...,x(n))T, Calculation
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k)\prod_{j=1}^{n} P(X^{(j)}=x(j)|Y=c_k), k=1,2,...,K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,...,K
(3) Identify examples x Class
y = a r g m a x c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=arg max_{c_k}P(Y=c_k)\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) y=argmaxckP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
2.3 Bayesian estimation
The probability value to be estimated is 0 The situation of , This will affect the calculation result of posterior probability , Make the classification deviate . The way to solve this problem is to use Bayesian estimation , In particular , The Bayesian estimation of conditional probability is :
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_\lambda (X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum _{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)+\lambda }{\sum _{i=1}^N I(y_i=c_k)+S_j\lambda } Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ
among λ ≥ 0 \lambda \ge 0 λ≥0. It is equivalent to giving a positive number to the frequency of each value of a random variable λ > 0 \lambda >0 λ>0. When λ = 0 \lambda =0 λ=0 when , It's maximum likelihood estimation . Constant access λ = 1 \lambda =1 λ=1, This is called Laplacian smoothing (Laplace smoothing). obviously For any l = 1 , 2 , . . . , S j , k = 1 , 2... , K l=1,2,...,S_j, k=1,2...,K l=1,2,...,Sj,k=1,2...,K, Yes
P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 P_\lambda (X^{(j)}=a_{jl}|Y=c_k)>0 Pλ(X(j)=ajl∣Y=ck)>0
∑ l = 1 S j P ( X ( j ) = a j l ∣ Y = c k ) = 1 \sum_{l=1}^{S_j}P(X^{(j)}=a_{jl}|Y=c_k)=1 l=1∑SjP(X(j)=ajl∣Y=ck)=1
Again , The Bayesian estimate of a priori probability is :
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_\lambda (Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)+\lambda}{N+K\lambda} Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
3. summary
- Naive Bayes method is a typical generative learning method . The generation method learns the joint probability distribution from the training data P ( X , Y ) P(X,Y) P(X,Y), Then the posterior probability distribution is obtained P ( Y ∣ X ) P(Y|X) P(Y∣X), say concretely , Use training data to learn P ( X ∣ Y ) P(X|Y) P(X∣Y) and P ( Y ) P(Y) P(Y) Estimation , The joint probability distribution is obtained :
P ( X , Y ) = P ( Y ) P ( X ∣ Y ) P(X,Y)=P(Y)P(X|Y) P(X,Y)=P(Y)P(X∣Y)
The probability estimation method can be maximum likelihood estimation or Bayesian estimation - The basic assumption of naive Bayesian method is conditional independence :
P ( X = x ∣ Y − = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y-=c_k)=P(X^{(1)=x^{(1)}},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) P(X=x∣Y−=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
Naive Bayes method is efficient , But the performance of classification is not necessarily very high . - Naive Bayesian method uses Bayesian theorem and learned joint probability model for classification and prediction
P ( Y ∣ X ) = P ( X , Y ) P ( X ) = P ( Y ) P ( X ∣ Y ) ∑ Y P ( Y ) P ( X ∣ Y ) P(Y|X)=\frac{P(X, Y)}{P(X)}=\frac{P(Y)P(X|Y)}{\sum_Y P(Y)P(X|Y)} P(Y∣X)=P(X)P(X,Y)=∑YP(Y)P(X∣Y)P(Y)P(X∣Y)
Enter x To the class with the greatest a posteriori probability y.
y = a r g max c k P ( Y = c k ) ∏ j = 1 n P ( X j = x ( j ) ∣ Y = c k ) y=arg\max_{c_k} P(Y=c_k)\prod_{j=1}^n P(X_j=x^{(j)}|Y=c_k) y=argckmaxP(Y=ck)j=1∏nP(Xj=x(j)∣Y=ck)
The maximum a posteriori probability is equivalent to 0-1 Expected risk minimization in loss function .
边栏推荐
- Tutorial on principles and applications of database system (010) -- exercises of conceptual model and data model
- 数据库系统原理与应用教程(011)—— 关系数据库
- 【统计学习方法】学习笔记——提升方法
- 利用棧來實現二進制轉化為十進制
- 【PyTorch实战】用RNN写诗
- SQL lab 26~31 summary (subsequent continuous update) (including parameter pollution explanation)
- What are the technical differences in source code anti disclosure
- Idea 2021 Chinese garbled code
- VSCode的学习使用
- 【统计学习方法】学习笔记——逻辑斯谛回归和最大熵模型
猜你喜欢
SQL lab 26~31 summary (subsequent continuous update) (including parameter pollution explanation)
AirServer自动接收多画面投屏或者跨设备投屏
Inverted index of ES underlying principle
Tutorial on principles and applications of database system (009) -- conceptual model and data model
How to use PS link layer and shortcut keys, and how to do PS layer link
Financial data acquisition (III) when a crawler encounters a web page that needs to scroll with the mouse wheel to refresh the data (nanny level tutorial)
SQL lab 11~20 summary (subsequent continuous update) contains the solution that Firefox can't catch local packages after 18 levels
关于 Web Content-Security-Policy Directive 通过 meta 元素指定的一些测试用例
wallys/Qualcomm IPQ8072A networking SBC supports dual 10GbE, WiFi 6
MPLS experiment
随机推荐
GCC compilation error
Static vxlan configuration
TypeScript 接口继承
Processing strategy of message queue message loss and repeated message sending
leetcode刷题:二叉树27(删除二叉搜索树中的节点)
Unity map auto match material tool map auto add to shader tool shader match map tool map made by substance painter auto match shader tool
Typescript interface inheritance
【PyTorch实战】用RNN写诗
MPLS experiment
College entrance examination composition, high-frequency mention of science and Technology
Tutorial on the principle and application of database system (008) -- exercises on database related concepts
Several methods of checking JS to judge empty objects
【统计学习方法】学习笔记——支持向量机(上)
[statistical learning methods] learning notes - improvement methods
OSPF exercise Report
Zhimei creative website exercise
Cenos openssh upgrade to version 8.4
BGP actual network configuration
DOM parsing XML error: content is not allowed in Prolog
通讯协议设计与实现