当前位置:网站首页>[statistical learning methods] learning notes - Chapter 4: naive Bayesian method
[statistical learning methods] learning notes - Chapter 4: naive Bayesian method
2022-07-07 12:34:00 【Sickle leek】
Statistical learning methods learning notes : Naive bayes method
Naive bayes method (Naive Bayes) Is based on Bayes theorem And Independent hypothesis of characteristic conditions The classification of . For a given set of training data , First, we assume that the learning input is independent based on the feature conditions / Joint probability distribution of output ; And then based on this model , For the given input x x x, Using Bayes theorem to find the maximum posterior probability y y y. 1. Learning and classification of naive Bayes
1.1 The basic method
Set input space X ⊆ R n \mathcal{X} \subseteq R^n X⊆Rn by n A set of dimensional vectors , The output space is a collection of class tags Y = { c 1 , c 2 , . . . , c K } \mathcal{Y}=\{c_1,c_2,...,c_K\} Y={ c1,c2,...,cK}. The input is the eigenvector x ∈ X x\in \mathcal{X} x∈X, The output is a class tag (class label) y ∈ Y y \in \mathcal{Y} y∈Y. X X X It's defined in the input space X \mathcal{X} X A random vector on a vector , Y Y Y Is defined in the output space Y \mathcal{Y} Y The random variable on , P ( X , Y ) P(X,Y) P(X,Y) yes X X X and Y Y Y The joint probability distribution of . Training data set T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } T=\{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\} T={ (x1,y1),(x2,y2),...,(xN,yN)}, from P ( X , Y ) P(X,Y) P(X,Y) Independent identically distributed .
Naive Bayesian method learns joint probability distribution through training data set P ( X , Y ) P(X,Y) P(X,Y). In particular , Learn the following A priori probability distribution as well as Conditional probability distribution .
- A priori probability distribution P ( Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k),k=1,2,...,K P(Y=ck),k=1,2,...,K
- Conditional probability distribution : P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) , k = 1 , 2... , K P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k),k=1,2...,K P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck),k=1,2...,K
So learn to get the joint probability distribution P ( X , Y ) P(X,Y) P(X,Y).
Conditional probability distribution P ( X = x ∣ Y = c k ) P(X=x|Y=c_k) P(X=x∣Y=ck) There are exponentially many parameters , Its estimation is actually infeasible . in fact , hypothesis x ( j ) x^{(j)} x(j) There are S j S_j Sj individual , j = 1 , 2 , . . . , n j=1,2,...,n j=1,2,...,n, Y Y Y The available values of are K individual , Then the number of parameters is K ∏ n j = 1 S j K\prod_{n}^{j=1}S_j K∏nj=1Sj.
To simplify the calculation , Naive Bayesian algorithm makes the assumption of conditional independence . Conditional independence hypothesis means that the features used for classification are conditionally independent when the category is determined , The formula is as follows : P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ n j = 1 P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y=c_k)=P(X^{(1)}=x^{(1)},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{n}^{j=1}P(X^{(j)}=x^{(j)}|Y=c_k) P(X=x∣Y=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck)=n∏j=1P(X(j)=x(j)∣Y=ck)
Naive Bayesian method actually learns the mechanism of generating data , So it belongs to the generation model . The assumption of conditional independence simplifies the computational complexity , But sometimes some classification accuracy will be lost .
Naive Bayes classification , For the given input x x x, The learned model is used to calculate the posterior probability distribution P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ck∣X=x), Take the class with the greatest posterior probability as x x x Class data of . The posterior probability is calculated according to Bayesian theorem : P ( Y = c k ∣ X = x ) = P ( X = x ∣ Y = c k ) P ( Y = c k ) ∑ k P ( X = x ∣ Y = c k ) P ( Y = c k ) P(Y=c_k|X=x)=\frac{P(X=x|Y=c_k)P(Y=c_k)}{\sum_{k}P(X=x|Y=c_k)P(Y=c_k)} P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
therefore , Naive Bayesian classifier can be expressed as :
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) ∑ k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=argmax_{c_k}\frac{P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k)}{\sum_k P(Y=c_k) \prod_{j}{P(X^{(j)}=x^{(j)}|Y=c_k)} } y=f(x)=argmaxck∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
Be careful , In the upper form , Denominator for all c k c_k ck It's all the same , therefore :
y = f ( x ) = a r g m a x c k P ( Y = c k ) ∏ j P ( X ( j ) = x ( j ) ∣ Y = c k ) y=f(x)=argmax_{c_k}P(Y=c_k)\prod_{j}P(X^{(j)}=x^{(j)}|Y=c_k) y=f(x)=argmaxckP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
1.2 The meaning of maximizing a posteriori probability
Naive Bayes classifies instances into classes with the largest posterior probability , This is equivalent to minimizing the expected risk . According to the expected risk minimization criterion, a posteriori probability maximization criterion is obtained :
f ( x ) = a r g m a x c k P ( c k ∣ X = x ) f(x)=argmax_{c_k}P(c_k|X=x) f(x)=argmaxckP(ck∣X=x)
That is, the principle of naive Bayesian method .
2. Parameter estimation of naive Bayesian method
2.1 Maximum likelihood estimation
In naive Bayes , Learning means estimating P ( Y = c k ) P(Y=c_k) P(Y=ck) and P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X^{(j)}=x^{(j)}|Y=c_k) P(X(j)=x(j)∣Y=ck). The maximum likelihood estimation method can be used to estimate the corresponding probability . Prior probability P ( Y = c k ) P(Y=c_k) P(Y=ck) The maximum likelihood estimate of is P ( Y = c k ) P(Y=c_k) P(Y=ck) The maximum likelihood estimate of is :
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)}{N},k=1,2,...,K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
Set the first j Features x ( j ) x^{(j)} x(j) The set of possible values is a j 1 , a j 2 , . . . , a j S j {a_{j1},a_{j2},...,a_{jS_j}} aj1,aj2,...,ajSj, Conditional probability P ( X ( j ) = a j l ∣ Y = c k ) P(X^{(j)}=a_{jl}|Y=c_k) P(X(j)=ajl∣Y=ck) The maximum likelihood estimate of is :
P ( X j = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) P(X^{j}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)}{\sum_{i=1}^N I(y_i=c_k)} P(Xj=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)
j = 1 , 2 , . . . , n ; l = 1 , 2 , . . . , S j , k = 1 , 2 , . . . , K j=1,2,...,n; l=1,2,...,S_j, k=1,2,...,K j=1,2,...,n;l=1,2,...,Sj,k=1,2,...,K, x i ( j ) x_i^{(j)} xi(j) It's No i Of samples j Features , a j l a_{jl} ajl It's No j A feature may get the first l It's worth ,I For indicating function .
2.2 Learning and classification algorithms
Algorithm : Naive bayes algorithm (naive Bayes algorithm)
Input : Training data T = ( x 1 , y 2 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) T={(x_1,y_2),(x_2,y_2),...,(x_N,y_N)} T=(x1,y2),(x2,y2),...,(xN,yN), among x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( n ) ) T x_i=(x_i^{(1)},x_i^{(2)}, ..., x_i^{(n)})^T xi=(xi(1),xi(2),...,xi(n))T, x i ( j ) x_i^{(j)} xi(j) It's No i i i The first sample is j j j Features , x i ( j ) ∈ a j 1 , a j 2 , . . . , a j S x_i^{(j)}\in {a_{j1},a_{j2},...,a_{jS}} xi(j)∈aj1,aj2,...,ajS, a j l a_{jl} ajl It's No j j j A feature may get the first l l l It's worth , j = 1 , 2 , . . . , n j=1,2,...,n j=1,2,...,n, l = 1 , 2 , . . . , S j l=1,2,...,S_j l=1,2,...,Sj, y i ∈ c 1 , c 2 , . . . , c K y_i\in {c_1,c_2,...,c_K} yi∈c1,c2,...,cK; example x x x;
Output : example x x x The classification of :
(1) Calculate prior probability and conditional probability :
P ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) N , k = 1 , 2 , . . . , K P(Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)}{N},k=1,2,...,K P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,...,K
P ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) ∑ i = 1 N I ( y i = c k ) , j = 1 , 2 , . . . , n ; I = 1 , 2 , . . . , S j ; k = 1 , 2... , K P(X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum_{i=1}^N I(x_{i}^{(j)}=a_{jl},y_i=c_k)}{\sum_{i=1}^N I(y_i=c_k)}, j=1,2,...,n;I=1,2,...,S_j;k=1,2...,K P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck),j=1,2,...,n;I=1,2,...,Sj;k=1,2...,K
(2) For a given instance x = ( x ( 1 ) , x ( 2 ) , . . . , x ( n ) ) T x=(x^{(1)}, x^{(2)}, ..., x^{(n)})^T x=(x(1),x(2),...,x(n))T, Calculation
P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) , k = 1 , 2 , . . . , K P(Y=c_k)\prod_{j=1}^{n} P(X^{(j)}=x(j)|Y=c_k), k=1,2,...,K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,...,K
(3) Identify examples x Class
y = a r g m a x c k P ( Y = c k ) ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) y=arg max_{c_k}P(Y=c_k)\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) y=argmaxckP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
2.3 Bayesian estimation
The probability value to be estimated is 0 The situation of , This will affect the calculation result of posterior probability , Make the classification deviate . The way to solve this problem is to use Bayesian estimation , In particular , The Bayesian estimation of conditional probability is :
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ i = 1 N I ( x i ( j ) = a j l , y i = c k ) + λ ∑ i = 1 N I ( y i = c k ) + S j λ P_\lambda (X^{(j)}=a_{jl}|Y=c_k)=\frac{\sum _{i=1}^N I(x_i^{(j)}=a_{jl}, y_i=c_k)+\lambda }{\sum _{i=1}^N I(y_i=c_k)+S_j\lambda } Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ
among λ ≥ 0 \lambda \ge 0 λ≥0. It is equivalent to giving a positive number to the frequency of each value of a random variable λ > 0 \lambda >0 λ>0. When λ = 0 \lambda =0 λ=0 when , It's maximum likelihood estimation . Constant access λ = 1 \lambda =1 λ=1, This is called Laplacian smoothing (Laplace smoothing). obviously For any l = 1 , 2 , . . . , S j , k = 1 , 2... , K l=1,2,...,S_j, k=1,2...,K l=1,2,...,Sj,k=1,2...,K, Yes
P λ ( X ( j ) = a j l ∣ Y = c k ) > 0 P_\lambda (X^{(j)}=a_{jl}|Y=c_k)>0 Pλ(X(j)=ajl∣Y=ck)>0
∑ l = 1 S j P ( X ( j ) = a j l ∣ Y = c k ) = 1 \sum_{l=1}^{S_j}P(X^{(j)}=a_{jl}|Y=c_k)=1 l=1∑SjP(X(j)=ajl∣Y=ck)=1
Again , The Bayesian estimate of a priori probability is :
P λ ( Y = c k ) = ∑ i = 1 N I ( y i = c k ) + λ N + K λ P_\lambda (Y=c_k)=\frac{\sum_{i=1}^N I(y_i=c_k)+\lambda}{N+K\lambda} Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
3. summary
- Naive Bayes method is a typical generative learning method . The generation method learns the joint probability distribution from the training data P ( X , Y ) P(X,Y) P(X,Y), Then the posterior probability distribution is obtained P ( Y ∣ X ) P(Y|X) P(Y∣X), say concretely , Use training data to learn P ( X ∣ Y ) P(X|Y) P(X∣Y) and P ( Y ) P(Y) P(Y) Estimation , The joint probability distribution is obtained :
P ( X , Y ) = P ( Y ) P ( X ∣ Y ) P(X,Y)=P(Y)P(X|Y) P(X,Y)=P(Y)P(X∣Y)
The probability estimation method can be maximum likelihood estimation or Bayesian estimation - The basic assumption of naive Bayesian method is conditional independence :
P ( X = x ∣ Y − = c k ) = P ( X ( 1 ) = x ( 1 ) , . . . , X ( n ) = x ( n ) ∣ Y = c k ) = ∏ j = 1 n P ( X ( j ) = x ( j ) ∣ Y = c k ) P(X=x|Y-=c_k)=P(X^{(1)=x^{(1)}},...,X^{(n)}=x^{(n)}|Y=c_k)=\prod_{j=1}^{n}P(X^{(j)}=x^{(j)}|Y=c_k) P(X=x∣Y−=ck)=P(X(1)=x(1),...,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
Naive Bayes method is efficient , But the performance of classification is not necessarily very high . - Naive Bayesian method uses Bayesian theorem and learned joint probability model for classification and prediction
P ( Y ∣ X ) = P ( X , Y ) P ( X ) = P ( Y ) P ( X ∣ Y ) ∑ Y P ( Y ) P ( X ∣ Y ) P(Y|X)=\frac{P(X, Y)}{P(X)}=\frac{P(Y)P(X|Y)}{\sum_Y P(Y)P(X|Y)} P(Y∣X)=P(X)P(X,Y)=∑YP(Y)P(X∣Y)P(Y)P(X∣Y)
Enter x To the class with the greatest a posteriori probability y.
y = a r g max c k P ( Y = c k ) ∏ j = 1 n P ( X j = x ( j ) ∣ Y = c k ) y=arg\max_{c_k} P(Y=c_k)\prod_{j=1}^n P(X_j=x^{(j)}|Y=c_k) y=argckmaxP(Y=ck)j=1∏nP(Xj=x(j)∣Y=ck)
The maximum a posteriori probability is equivalent to 0-1 Expected risk minimization in loss function .
边栏推荐
- 消息队列消息丢失和消息重复发送的处理策略
- How to use PS link layer and shortcut keys, and how to do PS layer link
- Cookie
- Static comprehensive experiment
- 利用棧來實現二進制轉化為十進制
- (待会删)yyds,付费搞来的学术资源,请低调使用!
- leetcode刷题:二叉树25(二叉搜索树的最近公共祖先)
- SQL lab 21~25 summary (subsequent continuous update) (including secondary injection explanation)
- 关于 Web Content-Security-Policy Directive 通过 meta 元素指定的一些测试用例
- [Q&A]AttributeError: module ‘signal‘ has no attribute ‘SIGALRM‘
猜你喜欢

What is an esp/msr partition and how to create an esp/msr partition

爱可可AI前沿推介(7.7)

跨域问题解决方案

Static routing assignment of network reachable and telent connections

Vxlan static centralized gateway

The IDM server response shows that you do not have permission to download the solution tutorial

Learning and using vscode

对话PPIO联合创始人王闻宇:整合边缘算力资源,开拓更多音视频服务场景

Airserver automatically receives multi screen projection or cross device projection

About sqli lab less-15 using or instead of and parsing
随机推荐
Pule frog small 5D movie equipment | 5D movie dynamic movie experience hall | VR scenic area cinema equipment
Epp+dis learning road (2) -- blink! twinkle!
Simple network configuration for equipment management
ES底层原理之倒排索引
Upgrade from a tool to a solution, and the new site with praise points to new value
In the small skin panel, use CMD to enter the MySQL command, including the MySQL error unknown variable 'secure_ file_ Priv 'solution (super detailed)
File upload vulnerability - upload labs (1~2)
2022广东省安全员A证第三批(主要负责人)考试练习题及模拟考试
OSPF exercise Report
Sonar:cognitive complexity
【PyTorch实战】用RNN写诗
Session
千人规模互联网公司研发效能成功之路
[deep learning] image multi label classification task, Baidu paddleclas
关于 Web Content-Security-Policy Directive 通过 meta 元素指定的一些测试用例
Object. Simple implementation of assign()
编译 libssl 报错
[Q&A]AttributeError: module ‘signal‘ has no attribute ‘SIGALRM‘
AirServer自动接收多画面投屏或者跨设备投屏
The road to success in R & D efficiency of 1000 person Internet companies