当前位置:网站首页>[Bayesian classification 3] semi naive Bayesian classifier

[Bayesian classification 3] semi naive Bayesian classifier

2022-06-26 20:37:00 NoBug ㅤ

1. Review of naive Bayesian classifier knowledge

1.1 Category , features

   We according to the Bayesian decision theory , Or rather, Bayesian classification principle , The first thing you get is a Expected loss R ( c i ∣ x ) = ∑ j = 1 N λ i j P ( c j ∣ x ) R(c_i|x)=\sum_{j=1}^N\lambda_{ij}P(c_j|x) R(cix)=j=1NλijP(cjx)】. Bayesian criteria Is to minimize the overall risk , Thus, it can be pushed to the requirement that some risks be minimized 【 h ∗ ( x ) = a r g   m i n c ∈ Y R ( c ∣ x ) h^*(x)=arg\ min_{c\in Y}R(c|x) h(x)=arg mincYR(cx)】.
   You can put c c c regard as “ Category ”, hold x x x regard as “ features ”. R ( c ∣ x ) R(c|x) R(cx) Is a category , It has its own characteristics , Such as 【 Good melon : Colour and lustre = dark green , texture = Clear ,…, density ≥ \geq 0.697】, c = { good melon , bad melon } c=\{ Good melon , Bad melon \} c={ good melon , bad melon }, x = { green green , clear Clear , . . . , The secret degree } x=\{ dark green , Clear ,..., density \} x={ green green , clear Clear ,..., The secret degree }, We calculate categories according to characteristics ( Right or not ) The risk of , The less risk , The more correct the representative category is .

1.2 risk , probability

   according to Miscalculation of loss , We further derive R ( c ∣ x ) = 1 − P ( c ∣ x ) R(c_|x)=1-P(c|x) R(cx)=1P(cx), P ( c ∣ x ) = P ( class other ∣ , sign ) P(c|x)=P( Category | features ) P(cx)=P( class other , sign ) It can be understood as : This category , Through these characteristics , The emergence of ( Or the probability of an event ) probability . for example :【 Good melon : Colour and lustre = dark green , texture = Clear ,…, density ≥ \geq 0.697】, We said :【 Colour and lustre = dark green , texture = Clear ,…, density ≥ \geq 0.697】 Is a good melon ( The probability is very high ). So this is according to the characteristics , Constantly adjust the probability of categories , yes Posterior probability .
   We naturally hope that , According to these characteristics, the greater the probability of being a good melon, the better , So there is :【 h ∗ ( x ) = a r g   m a x c ∈ Y P ( c ∣ x ) h^*(x)=arg\ max_{c\in Y}P(c|x) h(x)=arg maxcYP(cx), There are several categories of datasets , One sample has several P ( c ∣ x ) P(c|x) P(cx), According to these P ( c ∣ x ) P(c|x) P(cx) Size , Select the one with the largest value , So this sample is Be divided For this category .】. We demand P ( c ∣ x ) P(c|x) P(cx) This probability , There's a formula :【 P ( c ∣ x ) = P ( c ) P ( x ∣ c ) P ( x ) P(c|x)=\frac{P(c)P(x|c)}{P(x)} P(cx)=P(x)P(c)P(xc)】, It's for all attributes joint probability ( In the actual , Will meet : Combination explosion , Missing value, etc ), difficulty In seeking Class conditional probability P ( x ∣ c ) P(x|c) P(xc). We use naive Bayes to solve Posterior probability P ( c ∣ x ) P(c|x) P(cx)) A difficult problem to calculate , Put forward 【 Assumption of independence of attribute condition 】( Attributes can also be called features ).

1.3 Class conditional probability

   Naive Bayes classifier The expression of :【 h n b ( x ) = a r g   m a x c ∈ Y   P ( c ) ∏ i = 1 d P ( x i ∣ c ) h_{nb}(x)=arg\ max_{c\in Y}\ P(c)\prod_{i=1}^dP(x_i|c) hnb(x)=arg maxcY P(c)i=1dP(xic)】, The focus is on Class conditional probability On , Basically, we use... In the calculation of data sets “ Reduced sample space method ”, Instead of using some formulas of conditional probability in probability theory . Simply speaking , Statistical samples Number , Do it again The proportion operation .

2. Semi naive Bayesian classifier learning notes

2.1 introduction

   Naive Bayesian classifier uses “ Attribute independence assumption ”, It is difficult to establish in real tasks . therefore , People try to relax the assumption of attribute conditional independence to a certain extent , This leads to a class called " Semi naive Bayesian classifier " (semi-naïve Bayes classifiers) Learning methods of . The basic idea :【 Due consideration should be given to the interdependent information among some attributes 】.

2.2 Knowledge cards

	1.  Semi naive Bayesian classifier :semi-naive Bayes classifiers
	2.  Also called :SNB Algorithm 

2.3 Semi naive Bayesian classifier

  • Expression of semi naive Bayesian classifier

p a i pa_i pai Attribute x j x_j xj Dependent properties , be called x j x_j xj Parent attribute
P ( c ∣ x ) ∝ P ( c ) ∏ j = 1 d P ( x j ∣ c , p a i ) P(c|x)\propto P(c)\prod_{j=1}^dP(x_j|c,pa_{i}) P(cx)P(c)j=1dP(xjc,pai)

  • Rely solely on classifiers

   The key to the problem is how to determine the value of each attribute Parent attribute , Different practices produce different Rely solely on classifiers .

2.4 Independent estimation

2.4.1 brief introduction

- " english :"
	One-Dependent Estimator, abbreviation :ODE

- " purpose :"
	 It is a semi naive Bayesian classifier , Due consideration should be given to the interdependent information among some attributes , One of the most common strategies .

- "ODE principle :"
	 Suppose that each attribute depends on at most one other attribute outside the category 

- " Due consideration should be given to the interdependent information among some attributes :"
	 Consider different strategies , The independent dependent classifiers formed are also different . representative :
	1)SPODE(Super-Parent ODE)
	2)TAN(Tree Augmented naive Bayes3)AODE(Average ODE)

2.4.2 SPODE( Superparent dependent estimation )

  • principle

   All features used depend on a single feature , This dependent feature is called Super parent (super-parent). And then through Cross validation Identify superparent ( Assume that each attribute is a superparent , Choose the one that works best ).

  • chart 1

 Insert picture description here

  • Example

【 Data sets 】

x 1 x_1 x1 x 2 x_2 x2 x 3 x_3 x3 y y y
1111
1001
1111
1000
1110
0000
0110
0101
0110
0000

【 forecast 】

x 1 x_1 x1 x 2 x_2 x2 x 3 x_3 x3 y y y
110?

【SPODE The formula 】

P ( c ∣ x ) ∝ a r g   m a x c ∈ Y   P ( c ) ∏ j = 1 d P ( x j ∣ c , p a i ) P(c|x)\propto arg\ max_{c\in Y}\ P(c)\prod_{j=1}^dP(x_j|c,pa_{i}) P(cx)arg maxcY P(c)j=1dP(xjc,pai)

【 Prior probability P ( c ) P(c) P(c)
P ( y = 1 ) = 4 10 = 0.4 P(y=1)=\frac{4}{10}=0.4 P(y=1)=104=0.4

P ( y = 0 ) = 6 10 = 0.6 P(y=0)=\frac{6}{10}=0.6 P(y=0)=106=0.6

【 Class conditional probability P ( x i ∣ c , p a i ) P(x_i|c,pa_{i}) P(xic,pai)[ Suppose first x 1 x_1 x1 For superparent p a i = x 1 pa_i=x_1 pai=x1]】
P ( x 1 = 1 ∣ y = 1 ) = 3 4 P(x_1=1|y=1)=\frac{3}{4} P(x1=1y=1)=43

P ( x 2 = 1 ∣ y = 1 , x 1 = 1 ) = 2 3 P(x_2=1|y=1,x_1=1)=\frac{2}{3} P(x2=1y=1,x1=1)=32

P ( x 3 = 0 ∣ y = 1 , x 1 = 1 ) = 1 3 P(x_3=0|y=1,x_1=1)=\frac{1}{3} P(x3=0y=1,x1=1)=31

P ( x 1 = 1 ∣ y = 0 ) = 2 6 = 1 3 P(x_1=1|y=0)=\frac{2}{6}=\frac{1}{3} P(x1=1y=0)=62=31

P ( x 2 = 1 ∣ y = 0 , x 1 = 1 ) = 1 2 P(x_2=1|y=0,x_1=1)=\frac{1}{2} P(x2=1y=0,x1=1)=21

P ( x 3 = 0 ∣ y = 0 , x 1 = 1 ) = 1 2 P(x_3=0|y=0,x_1=1)=\frac{1}{2} P(x3=0y=0,x1=1)=21

【 Semi naive Bayesian classifier 】
P ( y = 1 ) = 0.4 ∗ 0.75 ∗ 2 3 ∗ 1 3 = 0.067 P(y=1)=0.4*0.75*\frac{2}{3}*\frac{1}{3}=0.067 P(y=1)=0.40.753231=0.067

P ( y = 1 ) = 0.6 ∗ 1 3 ∗ 1 2 ∗ 1 2 = 0.050 P(y=1)=0.6*\frac{1}{3}*\frac{1}{2}*\frac{1}{2}=0.050 P(y=1)=0.6312121=0.050

the With , pre measuring sample Ben class other , y = 1 therefore , Forecast sample category ,y=1 the With , pre measuring sample Ben class other ,y=1

2.4.3 AODE( Average independent dependence estimation )

  • principle

  SPODE It's choice A model To make predictions ,AODE It's based on Integrated learning The mechanism Rely solely on classifier . stay AODE in , Each model makes a prediction , And then The results averaged The final prediction result is obtained .(AODE Try to Each attribute Build as a superparent SPODE, And then I'm going to talk about those who have Enough training data Supported by SPODE Integrated as the end result .)

  • Understand the formula

P ( c ∣ x ) ∝ ∑ i = 1 d P ( c , x i ) ∏ j = 1 d P ( x j ∣ c , x i ) d P(c|x)\propto \frac{\sum_{i=1}^dP(c,x_i)\prod_{j=1}^dP(x_j|c,x_{i})}{d} P(cx)di=1dP(c,xi)j=1dP(xjc,xi)

  • threshold m ′ m' m

   For example, in the example above , P ( x 2 = 1 ∣ y = 0 , x 1 = 1 ) = 1 2 P(x_2=1|y=0,x_1=1)=\frac{1}{2} P(x2=1y=0,x1=1)=21, It is possible that the denominator is 1, Resulting in too few samples available . therefore ,AODE Add a threshold , The number of samples whose super parent feature is a specific value is required to be greater than or equal to m’ when , To use this SPODE Model .

  • AODE The formula

P ( c ∣ x ) ∝ ∑ i = 1 d P ( c , x i ) ∏ j = 1 d P ( x j ∣ c , x i ) , ∣ D x i ∣ ≥ m ′ P(c|x)\propto \sum_{i=1}^dP(c,x_i)\prod_{j=1}^dP(x_j|c,x_i),|D_{x_i}|\geq m' P(cx)i=1dP(c,xi)j=1dP(xjc,xi),Dxim

  • Parameter Solving

P ( c , x i ) P(c,x_i) P(c,xi)

P ( c , x i ) = ∣ D c , x i ∣ + 1 ∣ D ∣ + N i P(c,x_i)=\frac{|D_{c,x_i}|+1}{|D|+N_i} P(c,xi)=D+NiDc,xi+1

P ( x j ∣ c , x i ) P(x_j|c,x_i) P(xjc,xi)
P ( x j ∣ c , x i ) = ∣ D c , x i , x j ∣ + 1 ∣ D c , x i ∣ + N j P(x_j|c,x_i)=\frac{|D_{c,x_i,x_j}|+1}{|D_{c,x_i}|+N_j} P(xjc,xi)=Dc,xi+NjDc,xi,xj+1

2.4.4 TAN( Tree augmentation naive Bayes )

  • introduction

   Whether it's SPODE, still AOED, Although the superparent is transforming , But in every model Each feature depends on the superparent feature . In reality , The dependence of features is also unlikely to depend on one of them , It is Maybe each feature has different dependencies .TAN Can solve this problem , find Each feature is best suited to the other feature on which it depends .

  • principle

   stay Maximum weighted spanning tree Build dependencies on the basis of the algorithm .TAN Calculate the... Between two features Conditional mutual information , When selecting dependent features for each feature , Choose mutual information Maximum The corresponding characteristics of ( The value of conditional mutual information represents the degree of interdependence between the two features ).

  • Algorithm

 Insert picture description here

3. The extension of semi naive Bayesian classifier

3.1 kDE(k Dependency estimation )

   Since it will Assumption of independence of attribute condition Relax as Dependent assumption It is possible to obtain the improvement of generalization performance , that , Can we consider the relationship between attributes Higher order dependencies ( Dependency on multiple attributes ) To further improve generalization performance ? amount to , take ODE Expand into kDE. if The training data is very sufficient , You can consider .

原网站

版权声明
本文为[NoBug ㅤ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/177/202206262021251503.html