当前位置:网站首页>Naive Bayes classifier
Naive Bayes classifier
2022-06-09 07:10:00 【Don't wait for brother shy to develop】
Naive Bayes classifier
1、 Classification concept
Classification is to find out the model that describes and distinguishes data classes or concepts , In order to use the model Object class label with unknown prediction class label .
Classification is generally divided into two stages :
Learning phase :
- Create a classifier that describes a predefined data class or concept set .
- The training set provides the class label of each training tuple , The learning process of classification is also called supervised learning .
Classification stage : The process of using a defined classifier for classification .
Classification and prediction are different concepts , Classification is prediction classification ( discrete 、 disorder ) label , Numerical prediction is to establish a continuous value function model . Classification and category are also different concepts , Classification is supervised learning , Provides the class label of the training tuple ; Clustering is unsupervised learning , Do not rely on training instances with class labels .
2、 naive bayesian classification
2.1 Bayes theorem
The formula of Bayes theorem is :
P ( h │ D ) = P ( D │ h ) P ( h ) P ( D ) P(ℎ│D)=\frac{P(D│ℎ)P(ℎ)}{P(D)} P(h│D)=P(D)P(D│h)P(h)
In style ,D Assume categories for the data to be tested , P ( h ∣ D ) P(h|D) P(h∣D) yes h Likelihood probability of , P ( h ) P(h) P(h) yes h The prior probability of , P ( h ∣ D ) P(h|D) P(h∣D) yes h The posterior probability of , P ( D ) P(D) P(D) yes D The prior probability of .
Let's start with an example : In a school 60% Of boys (boy),40% The girl of (girl) . Boys always wear trousers (pants), Girls are half in trousers and half in skirts . Randomly select a student wearing trousers , He ( she ) What is the probability of being a girl ?
The above description can be formalized as :
It is known that P(Boy)=60%, P(Girl)=40%, P(Pants|Girl)=50%,P(Pants|Boy)=100% seek :P(Girl|Pants)
answer :
P ( G i r l │ P a n t s ) = P ( G i r l ) P ( P a n t s │ G i r l ) P ( B o y ) P ( P a n t s ∣ B o y ) + P ( G i r l ) P ( P a n t s ∣ G i r l ) = P ( G i r l ) P ( P a n t s │ G i r l ) P ( P a n t s ) P(Girl│Pants)=\frac{P(Girl)P(Pants│Girl)}{P(Boy)P(Pants|Boy)+P(Girl)P(Pants|Girl)}=\frac{P(Girl)P(Pants│Girl)}{P(Pants)} P(Girl│Pants)=P(Boy)P(Pants∣Boy)+P(Girl)P(Pants∣Girl)P(Girl)P(Pants│Girl)=P(Pants)P(Girl)P(Pants│Girl)
Intuitive understanding : Figure out how many people wear trousers in school , Then figure out how many girls there are among these people
For the above problems, we can get such observation knowledge : In a school 60% Of boys (boy),40% The girl of (girl) . Boys always wear trousers (pants), Girls are half in trousers and half in skirts . Again , We can't directly observe a randomly selected student wearing trousers , Determine whether the student is a boy or a girl .
For parts that cannot be observed directly , Assumptions are often made . And for the uncertain things , There are often multiple assumptions .

Bayes provides a way to calculate the posterior probability of hypotheses P ( h ∣ D ) P(h|D) P(h∣D) Methods , That is, the posterior probability is proportional to the product of the prior probability and the likelihood probability .
2.2 Maximum a posteriori hypothesis
The maximum a posteriori hypothesis learner is in the candidate hypothesis set H Find the given data in D The most likely assumption h,h It is called the maximum a posteriori hypothesis (Maximum a posteriori: MAP). determine MAP Bayes formula is used to calculate the posterior probability of each candidate hypothesis , The formula is as follows :


The last step is to remove P ( D ) P(D) P(D), Because it is not dependent on h The constant , Or consider that the prior probability of any data is equal .
2.3 Joint probability of multidimensional attributes
It is known that : object D Is a vector composed of multiple attributes , Then combined with the above maximum a posteriori Hypothesis , Our goal can be written as :



But here comes a problem : Calculation P ( < a 1 , a 2 , … , a n > │ h ) P(<a_1,a_2,…,a_n>│ℎ) P(<a1,a2,…,an>│h) when , When the dimension is too high , Available data becomes sparse , Difficult to get results .
2.4 Independence hypothesis
The previously mentioned problem of data sparsity can be solved by the assumption of independence , That's assuming D Properties of a i a_i ai Independent of each other , Then the above formula can be written as :
P ( < a 1 , a 2 , … , a n > │ h ) = ∏ i P ( a i ∣ h ) \begin{aligned} P(<a_1,a_2,…,a_n>│ℎ)=∏_iP(a_i|ℎ) \end{aligned} P(<a1,a2,…,an>│h)=i∏P(ai∣h)
h M A P = max h ∈ H P ( h ∣ < a 1 , a 2 , … , a n > ) = max h ∈ H P ( < a 1 , a 2 , … , a n > │ h ) P ( h ) = max h ∈ H ∐ i P ( a i │ h ) P ( h ) \begin{aligned} ℎ_{MAP}&=\max_{h\in H}P (ℎ|<a_1,a_2,…,a_n>)\\ &=\max_{h\in H} P(<a_1,a_2,…,a_n>│ℎ)P(ℎ) \\ &=\max_{h\in H} {\textstyle \coprod_{i}^{}P} (a_i│ℎ)P(ℎ) \end{aligned} hMAP=h∈HmaxP(h∣<a1,a2,…,an>)=h∈HmaxP(<a1,a2,…,an>│h)P(h)=h∈Hmax∐iP(ai│h)P(h)
After the assumption of Independence , Get an estimate of P ( a i │ h ) P(a_i│ℎ) P(ai│h) Than P ( < a 1 , a 2 , … , a n > │ h ) P(<a_1,a_2,…,a_n>│ℎ) P(<a1,a2,…,an>│h) It's a lot easier . If D The properties of are not mutually independent , The result of naive Bayesian classification is the approximation of Bayesian classification
3、 Bayesian classification case
The following training set describes the statistics of computer purchase . The characteristics of the training set include age 、 income 、 hobby 、 Credit and purchases .
| id | Age | income | hobby | credit | Buy |
|---|---|---|---|---|---|
| 1 | green | high | no | in | no |
| 2 | green | high | no | optimal | no |
| 3 | in | high | no | in | yes |
| 4 | The old | in | no | in | yes |
| 5 | The old | low | yes | in | yes |
| 6 | The old | low | yes | optimal | no |
| 7 | in | low | yes | optimal | yes |
| 8 | green | in | no | in | no |
| 9 | green | low | yes | in | yes |
| 10 | The old | in | yes | in | yes |
| 11 | green | in | yes | optimal | yes |
| 12 | in | in | no | optimal | yes |
| 13 | in | high | yes | in | yes |
| 14 | The old | in | no | optimal | no |
Test cases : A middle-income 、 Young game lovers with good credit , Will you buy a computer ?
Training set according to the above table , You can get the following training sets of purchased computers . For the following test sets , Judge a middle-income person 、 Whether the young game loving customers with good credit will buy computers .
| id | age group | Income situation | hobby | Credit rating | Buy a computer |
|---|---|---|---|---|---|
| 3 | in | high | no | in | yes |
| 4 | The old | in | no | in | yes |
| 5 | The old | low | yes | in | yes |
| 7 | in | low | yes | optimal | yes |
| 9 | green | low | yes | in | yes |
| 10 | The old | in | yes | in | yes |
| 11 | green | in | yes | optimal | yes |
| 12 | in | in | no | optimal | yes |
| 13 | in | high | yes | in | yes |
First, calculate the probabilities of different attributes among customers who purchase computers in the test set :
P ( green year ∣ purchase buy ) = 2 / 9 = 0.222 P ( closed Enter into in etc. ∣ purchase buy ) = 4 / 9 = 0.444 P ( Love good ∣ purchase buy ) = 6 / 9 = 0.667 P ( Letter use in ∣ purchase buy ) = 6 / 9 = 0.667 P( youth | Buy ) = 2/9 = 0.222\\ P( Middle income | Buy ) = 4/9 = 0.444\\ P( hobby | Buy ) = 6/9 = 0.667\\ P( Credit | Buy ) =6/9 = 0.667 P( green year ∣ purchase buy )=2/9=0.222P( closed Enter into in etc. ∣ purchase buy )=4/9=0.444P( Love good ∣ purchase buy )=6/9=0.667P( Letter use in ∣ purchase buy )=6/9=0.667
Then according to the following formula , Calculate the likelihood probability of buying a computer :

P ( X ∣ purchase buy ) = 0.222 × 0.444 × 0.667 × 0.667 = 0.044 P(X | Buy ) = 0.222 ×0.444 ×0.667 ×0.667=0.044 P(X∣ purchase buy )=0.222×0.444×0.667×0.667=0.044
Again , We can get training sets without buying a computer .
| id | age group | Income situation | hobby | Credit rating | Buy a computer |
|---|---|---|---|---|---|
| 1 | green | high | no | in | no |
| 2 | green | high | no | optimal | no |
| 6 | The old | low | yes | optimal | no |
| 8 | green | in | no | in | no |
| 14 | The old | in | no | optimal | no |
So the probability of not buying a computer under different attributes in the test set :
P ( green year ∣ No buy ) = 3 / 5 = 0.6 P ( closed Enter into in etc. ∣ No buy ) = 2 / 5 = 0.4 P ( Love good ∣ No buy ) = 1 / 5 = 0.2 P ( Letter use in ∣ No buy ) = 2 / 5 = 0.4 P( youth | Not buy ) = 3/5 = 0.6\\ P( Middle income | Not buy ) = 2/5 = 0.4\\ P( hobby | Not buy ) = 1/5 = 0.2\\ P( Credit | Not buy ) = 2/5 = 0.4 P( green year ∣ No buy )=3/5=0.6P( closed Enter into in etc. ∣ No buy )=2/5=0.4P( Love good ∣ No buy )=1/5=0.2P( Letter use in ∣ No buy )=2/5=0.4
Again , Use the above formula to calculate the likelihood probability of not buying a computer :

P ( X ∣ No buy ) = 0.6 × 0.4 × 0.2 × 0.4 = 0.019 P(X | Not buy ) =0.6 ×0.4 ×0.2 ×0.4=0.019 P(X∣ No buy )=0.6×0.4×0.2×0.4=0.019
Use formula P ( X ∣ C i ) P ( C i ) P(X|C_i)P(C_i) P(X∣Ci)P(Ci), Available :
P ( C buy ) = 9 / 14 = 0.643 P ( C No buy ) = 5 / 14 = 0.357 P ( purchase buy ∣ X ) = 0.044 × 0.643 = 0.028 P ( No buy ∣ X ) = 0.019 × 0.357 = 0.007 P(C_ buy )=9/14=0.643\\ P(C_{ Not buy })=5/14=0.357\\ P( Buy |X) =0.044×0.643=0.028 \\ P( Not buy |X) = 0.019 ×0.357=0.007 P(C buy )=9/14=0.643P(C No buy )=5/14=0.357P( purchase buy ∣X)=0.044×0.643=0.028P( No buy ∣X)=0.019×0.357=0.007
4、 How to calculate the probability of continuous data
The following table describes the results of whether to purchase computers under different income conditions . that , Can the data in the table be used to predict the income of 121, No game hobby 、 Will middle-aged people with good credit buy computers ?
| id | income | Buy |
|---|---|---|
| 1 | 125 | no |
| 2 | 100 | no |
| 3 | 70 | no |
| 4 | 120 | no |
| 5 | 95 | yes |
| 6 | 60 | no |
| 7 | 220 | no |
| 8 | 85 | yes |
| 9 | 75 | no |
| 10 | 90 | yes |
The revenue here is represented by continuous data , Therefore, the previous discrete data probability estimation method cannot be used . For continuous data , We assume that different types of income follow different normal distributions , Using parameters to estimate the expectation and variance of two groups of normal distribution , You can calculate the income as 121 The probability of not buying a computer , As shown below :


5、 The characteristics of naive Bayesian classifier
- Attributes can be discrete 、 It can also be continuous
- A solid foundation in mathematics 、 Classification efficiency is stable
- Less sensitive to missing and noisy data
- If the attribute is not related , The classification effect is very good
6、 Bayesian algorithm to achieve iris classification
6.1 Introduction to iris

Iris ( Latin scientific name :Iris L.), Monocotyledons , Iridaceae perennial herb , The flowers are big and beautiful , High ornamental value . Iris is about 300 Kind of , Iris The dataset contains three of these : Iris iris (Setosa), Variegated Iris (Versicolour), Iris Virginia (Virginica), Each of these 50 Data , Co inclusion 150 Data . Each data contains four attributes : Calyx length , Calyx width , Petal length , Petal width , These four attributes can be used to predict that iris flowers belong to ( Iris iris , Variegated Iris , Iris Virginia ) What kind of .
Some data in the data set are shown in the figure below :

6.2 Classification code
import sklearn
# Import Gaussian naive Bayesian classifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
data_url = "Iris.csv"
df = pd.read_csv(data_url)
X = df.iloc[:,1:5]
y=df.iloc[:,5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Use Gaussian naive Bayes to calculate
######## Begin ########
clf=GaussianNB()
######## End ########
clf.fit(X_train, y_train)
# assessment
y_pred = clf.predict(X_test)
acc = np.sum(y_test == y_pred) / X_test.shape[0]
print("Test Acc:%.3f" % acc)

边栏推荐
- Apache配置与应用(构建web主机、日志分割及AWStats分析系统)
- ROS compilation error genmsg/cmake/genmsg-extras Cmake:307 solution
- 使用OpenCore引导黑苹果
- UML series articles (26) architecture modeling -- artifacts
- Summarize the key knowledge of multithreading.
- [SDU project training level 2019] personal home page display + personal summary
- Make install how to uninstall the installed Library
- Producer consumer issues
- 微信小程序 思维导图
- Excl two column data comparison is realized by VBA, for example, whether the data of column a has appeared in column B
猜你喜欢

Multithreading - the concept of program process threads

Application and analysis of special cases of data visualization

Matlab: tf2zp与tf2zpk的差异

209. minimum length subarray

How to select Bucher hydraulic pump? It is important to master these three points!

Thread scheduling and thread priority

当内卷风波及代码领域,看Alluxio将会采取怎样的块分配策略
![[SDU project training level 2019] personal home page display + personal summary](/img/6e/10bd160a8b7e6c64470f586f707a73.jpg)
[SDU project training level 2019] personal home page display + personal summary

Defi de risk: analyze the systematic risk in the decentralized system

UML series articles (20) basic behavior - activity diagram
随机推荐
常用类——String类概述
In the afternoon of July 2, 2022, Jinan IT technology gathering, interested friends scanned the code to sign up for group exchange
The new colleagues optimized the project performance once, trembling
New occupation: digital management abbess is in hot demand. As long as she comes for an interview, the enterprise will make an offer
In case of internal disturbance and code field, see what block allocation strategy alluxio will adopt
[STL] Introduction to the use of set and map
Application and analysis of special cases of data visualization
数据库期末考试大纲
2022年7月2日下午,济南IT技术聚会,感兴趣的朋友扫码报名并入群交流
Defi de risk: analyze the systematic risk in the decentralized system
Champignon Street publie les résultats de l'exercice 2022: les pertes du deuxième semestre ont diminué de 50% d'une année sur l'autre
[path of system analyst] Chapter 15 double disk database system (relational database application)
蘑菇街發布2022財年財報:下半年虧損同比收窄50%
How idea views the path to save code
外设驱动库开发笔记42:DAC8552 DAC驱动
As the market changes, how does huanju group survive the storm of uncertainty?
使用postman模拟携带token的请求
Fastadmin custom exported excl table name + time
Apache配置与应用(构建web主机、日志分割及AWStats分析系统)
DeFi 去風險:分析去中心化系統中的系統性風險