当前位置:网站首页>[statistical learning methods] learning notes - Chapter 4: naive Bayesian method
[statistical learning methods] learning notes - Chapter 4: naive Bayesian method
2022-07-07 12:34:00 【Sickle leek】
Statistical learning methods learning notes : Naive bayes method
Naive bayes method (Naive Bayes)
Is based on Bayes theorem And Independent hypothesis of characteristic conditions The classification of . For a given set of training data , First, we assume that the learning input is independent based on the feature conditions / Joint probability distribution of output ; And then based on this model , For the given input x x x, Using Bayes theorem to find the maximum posterior probability y y y. 1. Learning and classification of naive Bayes
1.1 The basic method
Set input space X ⊆ R n \mathcal{X} \subseteq R^n X⊆Rn by n A set of dimensional vectors , The output space is a collection of class tags Y = { c 1 , c 2 , . . . , c K } \mathcal{Y}=\{c_1,c_2,...,c_K\} Y={ c1,c2,...,cK}. The input is the eigenvector x ∈ X x\in \mathcal{X} x∈X, The output is a class tag (class label) y ∈ Y y \in \mathcal{Y} y∈Y. X X X It's defined in the input space X \mathcal{X} X A random vector on a vector , Y Y Y Is defined in the output space Y \mathcal{Y} Y The random variable on , P ( X , Y ) P(X,Y) P(X,Y) yes X X X and Y Y Y The joint probability distribution of . Training data set T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={ (x1,y1),(x2,y2),...,(xN,yN)}, from P ( X , Y ) P(X,Y) P(X,Y) Independent identically distributed .
Naive Bayesian method learns joint probability distribution through training data set P ( X , Y ) P(X,Y) P(X,Y). In particular , Learn the following A priori probability distribution
as well as Conditional probability distribution
.
- A priori probability distribution P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),k=1,2,...,K P(Y=ck),k=1,2,...,K
- Conditional probability distribution : P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2... , K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k),k=1,2...,K P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck),k=1,2...,K
So learn to get the joint probability distribution P ( X , Y ) P(X,Y) P(X,Y).
Conditional probability distribution P ( X = x ∣ Y = c k ) P(X=x|Y=c_k) P(X=x∣Y=ck) There are exponentially many parameters , Its estimation is actually infeasible . in fact , hypothesis x ( j ) x^{(j)} x(j) There are S j S_j Sj individual , j = 1 , 2 , . . . , n j=1,2,...,n j=1,2,...,n, Y Y Y The available values of are K individual , Then the number of parameters is K ∏ n j = 1 S j K\prod_{n}^{j=1}S_j K∏nj=1Sj.
To simplify the calculation , Naive Bayesian algorithm makes the assumption of conditional independence . Conditional independence hypothesis means that the features used for classification are conditionally independent when the category is determined , The formula is as follows : P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ n j = 1 P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{n}^{j=1}P(X^{(j)}=x^{(j)}|Y=c_k) P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck)=n∏j=1P(X(j)=x(j)∣Y=ck)
Naive Bayesian method actually learns the mechanism of generating data , So it belongs to the generation model . The assumption of conditional independence simplifies the computational complexity , But sometimes some classification accuracy will be lost .
Naive Bayes classification , For the given input x x x, The learned model is used to calculate the posterior probability distribution P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ck∣X=x), Take the class with the greatest posterior probability as x x x Class data of . The posterior probability is calculated according to Bayesian theorem : P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_{k}P(X=x|Y=c_k)P(Y=c_k)} P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
therefore , Naive Bayesian classifier can be expressed as :
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=argmax_{c_k}\frac{P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_k P(Y=c_k) \prod_{j}{P(X^{(j)}=x^{(j)}|Y=c_k)} } y=f(x)=argmaxck∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
Be careful , In the upper form , Denominator for all c k c_k ck It's all the same , therefore :
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=argmax_{c_k}P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k) y=f(x)=argmaxckP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
1.2 The meaning of maximizing a posteriori probability
Naive Bayes classifies instances into classes with the largest posterior probability , This is equivalent to minimizing the expected risk . According to the expected risk minimization criterion, a posteriori probability maximization criterion is obtained :
f ( x ) = a r g m a x c k P ( c k ∣ X = x ) f(x)=argmax_{c_k}P(c_k|X=x) f(x)=argmaxckP(ck∣X=x)
That is, the principle of naive Bayesian method .
2. Parameter estimation of naive Bayesian method
2.1 Maximum likelihood estimation
In naive Bayes , Learning means estimating P ( Y = c k ) P(Y=c_k) P(Y=ck) and P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y=c_k) P(X(j)=x(j)∣Y=ck). The maximum likelihood estimation method can be used to estimate the corresponding probability . Prior probability P ( Y = c k ) P(Y=c_k) P(Y=ck) The maximum likelihood estimate of is P ( Y = c k ) P(Y=c_k) P(Y=ck) The maximum likelihood estimate of is :
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)}{N},k=1,2,...,K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
Set the first j Features x ( j ) x^{(j)} x(j) The set of possible values is a j 1 , a j 2 , . . . , a j S j {a_{j1},a_{j2},...,a_{jS_j}} aj1,aj2,...,ajSj, Conditional probability P ( X ( j ) = a j l ∣ Y = c k ) P(X^{(j)}=a_{jl}|Y=c_k) P(X(j)=ajl∣Y=ck) The maximum likelihood estimate of is :
P ( X j = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P(X^{j}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)}{\sum_{i=1}^N I(y_i=c_k)} P(Xj=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
j = 1 , 2 , . . . , n ; l = 1 , 2 , . . . , S j , k = 1 , 2 , . . . , K j=1,2,...,n; l=1,2,...,S_j, k=1,2,...,K j=1,2,...,n;l=1,2,...,Sj,k=1,2,...,K, x i ( j ) x_i^{(j)} xi(j) It's No i Of samples j Features , a j l a_{jl} ajl It's No j A feature may get the first l It's worth ,I For indicating function .
2.2 Learning and classification algorithms
Algorithm : Naive bayes algorithm (naive Bayes algorithm)
Input : Training data T = ( x 1 , y 2 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) T={(x_1,y_2),(x_2,y_2),...,(x_N,y_N)} T=(x1,y2),(x2,y2),...,(xN,yN), among x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( n ) ) T x_i=(x_i^{(1)},x_i^{(2)}, ..., x_i^{(n)})^T xi=(xi(1),xi(2),...,xi(n))T, x i ( j ) x_i^{(j)} xi(j) It's No i i i The first sample is j j j Features , x i ( j ) ∈ a j 1 , a j 2 , . . . , a j S x_i^{(j)}\in {a_{j1},a_{j2},...,a_{jS}} xi(j)∈aj1,aj2,...,ajS, a j l a_{jl} ajl It's No j j j A feature may get the first l l l It's worth , j = 1 , 2 , . . . , n j=1,2,...,n j=1,2,...,n, l = 1 , 2 , . . . , S j l=1,2,...,S_j l=1,2,...,Sj, y i ∈ c 1 , c 2 , . . . , c K y_i\in {c_1,c_2,...,c_K} yi∈c1,c2,...,cK; example x x x;
Output : example x x x The classification of :
(1) Calculate prior probability and conditional probability :
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)}{N},k=1,2,...,K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) , j = 1 , 2 , . . . , n ; I = 1 , 2 , . . . , S j ; k = 1 , 2... , K P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^N I(x_{i}^{(j)}=a_{jl},y_i=c_k)}{\sum_{i=1}^N I(y_i=c_k)}, j=1,2,...,n;I=1,2,...,S_j;k=1,2...,K P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck),j=1,2,...,n;I=1,2,...,Sj;k=1,2...,K
(2) For a given instance x = ( x ( 1 ) , x ( 2 ) , . . . , x ( n ) ) T x=(x^{(1)}, x^{(2)}, ..., x^{(n)})^T x=(x(1),x(2),...,x(n))T, Calculation
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k)\prod_{j=1}^{n} P(X^{(j)}=x(j)|Y=c_k), k=1,2,...,K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,...,K
(3) Identify examples x Class
y = a r g m a x c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=arg max_{c_k}P(Y=c_k)\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) y=argmaxckP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
2.3 Bayesian estimation
The probability value to be estimated is 0 The situation of , This will affect the calculation result of posterior probability , Make the classification deviate . The way to solve this problem is to use Bayesian estimation , In particular , The Bayesian estimation of conditional probability is :
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_\lambda (X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum _{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)+\lambda }{\sum _{i=1}^N I(y_i=c_k)+S_j\lambda } Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ
among λ ≥ 0 \lambda \ge 0 λ≥0. It is equivalent to giving a positive number to the frequency of each value of a random variable λ > 0 \lambda >0 λ>0. When λ = 0 \lambda =0 λ=0 when , It's maximum likelihood estimation . Constant access λ = 1 \lambda =1 λ=1, This is called Laplacian smoothing (Laplace smoothing). obviously For any l = 1 , 2 , . . . , S j , k = 1 , 2... , K l=1,2,...,S_j, k=1,2...,K l=1,2,...,Sj,k=1,2...,K, Yes
P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 P_\lambda (X^{(j)}=a_{jl}|Y=c_k)>0 Pλ(X(j)=ajl∣Y=ck)>0
∑ l = 1 S j P ( X ( j ) = a j l ∣ Y = c k ) = 1 \sum_{l=1}^{S_j}P(X^{(j)}=a_{jl}|Y=c_k)=1 l=1∑SjP(X(j)=ajl∣Y=ck)=1
Again , The Bayesian estimate of a priori probability is :
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_\lambda (Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)+\lambda}{N+K\lambda} Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
3. summary
- Naive Bayes method is a typical generative learning method . The generation method learns the joint probability distribution from the training data P ( X , Y ) P(X,Y) P(X,Y), Then the posterior probability distribution is obtained P ( Y ∣ X ) P(Y|X) P(Y∣X), say concretely , Use training data to learn P ( X ∣ Y ) P(X|Y) P(X∣Y) and P ( Y ) P(Y) P(Y) Estimation , The joint probability distribution is obtained :
P ( X , Y ) = P ( Y ) P ( X ∣ Y ) P(X,Y)=P(Y)P(X|Y) P(X,Y)=P(Y)P(X∣Y)
The probability estimation method can be maximum likelihood estimation or Bayesian estimation - The basic assumption of naive Bayesian method is conditional independence :
P ( X = x ∣ Y − = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y-=c_k)=P(X^{(1)=x^{(1)}},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) P(X=x∣Y−=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
Naive Bayes method is efficient , But the performance of classification is not necessarily very high . - Naive Bayesian method uses Bayesian theorem and learned joint probability model for classification and prediction
P ( Y ∣ X ) = P ( X , Y ) P ( X ) = P ( Y ) P ( X ∣ Y ) ∑ Y P ( Y ) P ( X ∣ Y ) P(Y|X)=\frac{P(X, Y)}{P(X)}=\frac{P(Y)P(X|Y)}{\sum_Y P(Y)P(X|Y)} P(Y∣X)=P(X)P(X,Y)=∑YP(Y)P(X∣Y)P(Y)P(X∣Y)
Enter x To the class with the greatest a posteriori probability y.
y = a r g max c k P ( Y = c k ) ∏ j = 1 n P ( X j = x ( j ) ∣ Y = c k ) y=arg\max_{c_k} P(Y=c_k)\prod_{j=1}^n P(X_j=x^{(j)}|Y=c_k) y=argckmaxP(Y=ck)j=1∏nP(Xj=x(j)∣Y=ck)
The maximum a posteriori probability is equivalent to 0-1 Expected risk minimization in loss function .
边栏推荐
- The left-hand side of an assignment expression may not be an optional property access. ts(2779)
- 《看完就懂系列》天哪!搞懂节流与防抖竟简单如斯~
- [statistical learning methods] learning notes - improvement methods
- Will the filing free server affect the ranking and weight of the website?
- Object. Simple implementation of assign()
- Introduction and application of smoothstep in unity: optimization of dissolution effect
- Let digital manage inventory
- Tutorial on the principle and application of database system (011) -- relational database
- PowerShell cs-utf-16le code goes online
- Completion report of communication software development and Application
猜你喜欢
Processing strategy of message queue message loss and repeated message sending
Static comprehensive experiment
SQL Lab (41~45) (continuous update later)
leetcode刷题:二叉树24(二叉树的最近公共祖先)
PowerShell cs-utf-16le code goes online
Vxlan static centralized gateway
跨域问题解决方案
【统计学习方法】学习笔记——支持向量机(下)
<No. 9> 1805. Number of different integers in the string (simple)
Multi row and multi column flex layout
随机推荐
How to use PS link layer and shortcut keys, and how to do PS layer link
数据库系统原理与应用教程(008)—— 数据库相关概念练习题
Is it safe to open an account in Ping An Securities mobile bank?
Financial data acquisition (III) when a crawler encounters a web page that needs to scroll with the mouse wheel to refresh the data (nanny level tutorial)
静态Vxlan 配置
[Q&A]AttributeError: module ‘signal‘ has no attribute ‘SIGALRM‘
idm服务器响应显示您没有权限下载解决教程
111. Network security penetration test - [privilege escalation 9] - [windows 2008 R2 kernel overflow privilege escalation]
Static vxlan configuration
Attack and defense world ----- summary of web knowledge points
牛客网刷题网址
About sqli lab less-15 using or instead of and parsing
Solutions to cross domain problems
Tutorial on principles and applications of database system (009) -- conceptual model and data model
EPP+DIS学习之路(1)——Hello world!
Tutorial on the principle and application of database system (008) -- exercises on database related concepts
【二叉树】删点成林
Tutorial on principles and applications of database system (010) -- exercises of conceptual model and data model
How to understand the clothing industry chain and supply chain
ENSP MPLS layer 3 dedicated line