当前位置：网站首页>Naive Bayes classifier

Naive Bayes classifier

2022-06-09 07:10:00 【Don't wait for brother shy to develop】

Naive Bayes classifier

1、 Classification concept
2、 naive bayesian classification
3、 Bayesian classification case
4、 How to calculate the probability of continuous data
5、 The characteristics of naive Bayesian classifier
6、 Bayesian algorithm to achieve iris classification
- 6.1 Introduction to iris
- 6.2 Classification code

1、 Classification concept

Classification is to find out the model that describes and distinguishes data classes or concepts , In order to use the model Object class label with unknown prediction class label .

Classification is generally divided into two stages ：

Learning phase ：
- Create a classifier that describes a predefined data class or concept set .
- The training set provides the class label of each training tuple , The learning process of classification is also called supervised learning .
Classification stage ： The process of using a defined classifier for classification .

Classification and prediction are different concepts , Classification is prediction classification ( discrete 、 disorder ) label , Numerical prediction is to establish a continuous value function model . Classification and category are also different concepts , Classification is supervised learning , Provides the class label of the training tuple ; Clustering is unsupervised learning , Do not rely on training instances with class labels .

2、 naive bayesian classification

2.1 Bayes theorem

The formula of Bayes theorem is ：
$P(ℎ│D)=\frac{P(D│ℎ)P(ℎ)}{P(D)}$
In style ,D Assume categories for the data to be tested , $P (h ∣ D)$ yes h Likelihood probability of , $P (h)$ yes h The prior probability of , $P (h ∣ D)$ yes h The posterior probability of , $P (D)$ yes D The prior probability of .

Let's start with an example ： In a school 60% Of boys (boy),40% The girl of (girl) . Boys always wear trousers (pants), Girls are half in trousers and half in skirts . Randomly select a student wearing trousers , He （ she ） What is the probability of being a girl ？

The above description can be formalized as ：

It is known that P(Boy)=60%, P(Girl)=40%, P(Pants|Girl)=50%,P(Pants|Boy)=100% seek ：P(Girl|Pants)

answer ：
$P(Girl│Pants)=\frac{P(Girl)P(Pants│Girl)}{P(Boy)P(Pants|Boy)+P(Girl)P(Pants|Girl)}=\frac{P(Girl)P(Pants│Girl)}{P(Pants)}$

Intuitive understanding ： Figure out how many people wear trousers in school , Then figure out how many girls there are among these people

For the above problems, we can get such observation knowledge ： In a school 60% Of boys (boy),40% The girl of (girl) . Boys always wear trousers (pants), Girls are half in trousers and half in skirts . Again , We can't directly observe a randomly selected student wearing trousers , Determine whether the student is a boy or a girl .

For parts that cannot be observed directly , Assumptions are often made . And for the uncertain things , There are often multiple assumptions .

Bayes provides a way to calculate the posterior probability of hypotheses $P (h ∣ D)$ Methods , That is, the posterior probability is proportional to the product of the prior probability and the likelihood probability .

2.2 Maximum a posteriori hypothesis

The maximum a posteriori hypothesis learner is in the candidate hypothesis set H Find the given data in D The most likely assumption h,h It is called the maximum a posteriori hypothesis （Maximum a posteriori: MAP）. determine MAP Bayes formula is used to calculate the posterior probability of each candidate hypothesis , The formula is as follows ：

The last step is to remove $P (D)$ , Because it is not dependent on h The constant , Or consider that the prior probability of any data is equal .

2.3 Joint probability of multidimensional attributes

It is known that ： object D Is a vector composed of multiple attributes , Then combined with the above maximum a posteriori Hypothesis , Our goal can be written as ：

But here comes a problem ： Calculation $P(<a_1,a_2,…,a_n>│ℎ)$ when , When the dimension is too high , Available data becomes sparse , Difficult to get results .

2.4 Independence hypothesis

The previously mentioned problem of data sparsity can be solved by the assumption of independence , That's assuming D Properties of $a_i$ Independent of each other , Then the above formula can be written as ：

$\begin{aligned} P(<a_1,a_2,…,a_n>│ℎ)=∏_iP(a_i|ℎ) \end{aligned}$

$\begin{aligned} ℎ_{MAP}&=\max_{h\in H}P (ℎ|<a_1,a_2,…,a_n>)\\ &=\max_{h\in H} P(<a_1,a_2,…,a_n>│ℎ)P(ℎ) \\ &=\max_{h\in H} {\textstyle \coprod_{i}^{}P} (a_i│ℎ)P(ℎ) \end{aligned}$

After the assumption of Independence , Get an estimate of $P(a_i│ℎ)$ Than $P(<a_1,a_2,…,a_n>│ℎ)$ It's a lot easier . If D The properties of are not mutually independent , The result of naive Bayesian classification is the approximation of Bayesian classification

3、 Bayesian classification case

The following training set describes the statistics of computer purchase . The characteristics of the training set include age 、 income 、 hobby 、 Credit and purchases .

id	Age	income	hobby	credit	Buy
1	green	high	no	in	no
2	green	high	no	optimal	no
3	in	high	no	in	yes
4	The old	in	no	in	yes
5	The old	low	yes	in	yes
6	The old	low	yes	optimal	no
7	in	low	yes	optimal	yes
8	green	in	no	in	no
9	green	low	yes	in	yes
10	The old	in	yes	in	yes
11	green	in	yes	optimal	yes
12	in	in	no	optimal	yes
13	in	high	yes	in	yes
14	The old	in	no	optimal	no

Test cases ： A middle-income 、 Young game lovers with good credit , Will you buy a computer ？

Training set according to the above table , You can get the following training sets of purchased computers . For the following test sets , Judge a middle-income person 、 Whether the young game loving customers with good credit will buy computers .

id	age group	Income situation	hobby	Credit rating	Buy a computer
3	in	high	no	in	yes
4	The old	in	no	in	yes
5	The old	low	yes	in	yes
7	in	low	yes	optimal	yes
9	green	low	yes	in	yes
10	The old	in	yes	in	yes
11	green	in	yes	optimal	yes
12	in	in	no	optimal	yes
13	in	high	yes	in	yes

First, calculate the probabilities of different attributes among customers who purchase computers in the test set ：
$0.222\\ P( Middle income | Buy ) = 4/9 = 0.444\\ P( hobby | Buy ) = 6/9 = 0.667\\ P( Credit | Buy ) =6/9 = 0.667$
Then according to the following formula , Calculate the likelihood probability of buying a computer ：

$P (X ∣ purchase buy) = 0.222 \times 0.444 \times 0.667 \times 0.667 = 0.044$
Again , We can get training sets without buying a computer .

id	age group	Income situation	hobby	Credit rating	Buy a computer
1	green	high	no	in	no
2	green	high	no	optimal	no
6	The old	low	yes	optimal	no
8	green	in	no	in	no
14	The old	in	no	optimal	no

So the probability of not buying a computer under different attributes in the test set ：
$0.6\\ P( Middle income | Not buy ) = 2/5 = 0.4\\ P( hobby | Not buy ) = 1/5 = 0.2\\ P( Credit | Not buy ) = 2/5 = 0.4$
Again , Use the above formula to calculate the likelihood probability of not buying a computer ：

$P (X ∣ No buy) = 0.6 \times 0.4 \times 0.2 \times 0.4 = 0.019$
Use formula $P(X|C_i)P(C_i)$ , Available ：
$P(C_ buy )=9/14=0.643\\ P(C_{ Not buy })=5/14=0.357\\ P( Buy |X) =0.044×0.643=0.028 \\ P( Not buy |X) = 0.019 ×0.357=0.007$

4、 How to calculate the probability of continuous data

The following table describes the results of whether to purchase computers under different income conditions . that , Can the data in the table be used to predict the income of 121, No game hobby 、 Will middle-aged people with good credit buy computers ？

id	income	Buy
1	125	no
2	100	no
3	70	no
4	120	no
5	95	yes
6	60	no
7	220	no
8	85	yes
9	75	no
10	90	yes

The revenue here is represented by continuous data , Therefore, the previous discrete data probability estimation method cannot be used . For continuous data , We assume that different types of income follow different normal distributions , Using parameters to estimate the expectation and variance of two groups of normal distribution , You can calculate the income as 121 The probability of not buying a computer , As shown below ：

5、 The characteristics of naive Bayesian classifier

Attributes can be discrete 、 It can also be continuous
A solid foundation in mathematics 、 Classification efficiency is stable
Less sensitive to missing and noisy data
If the attribute is not related , The classification effect is very good

6、 Bayesian algorithm to achieve iris classification

6.1 Introduction to iris

Preview the big picture

Iris ( Latin scientific name ：Iris L.), Monocotyledons , Iridaceae perennial herb , The flowers are big and beautiful , High ornamental value . Iris is about 300 Kind of , Iris The dataset contains three of these : Iris iris (Setosa), Variegated Iris (Versicolour), Iris Virginia (Virginica), Each of these 50 Data , Co inclusion 150 Data . Each data contains four attributes : Calyx length , Calyx width , Petal length , Petal width , These four attributes can be used to predict that iris flowers belong to ( Iris iris , Variegated Iris , Iris Virginia ) What kind of .

Some data in the data set are shown in the figure below ：

Preview the big picture

6.2 Classification code

import sklearn
#  Import Gaussian naive Bayesian classifier 
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd


data_url = "Iris.csv"
df = pd.read_csv(data_url)
X = df.iloc[:,1:5]
y=df.iloc[:,5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#  Use Gaussian naive Bayes to calculate 
######## Begin ########
clf=GaussianNB()
######## End ########
clf.fit(X_train, y_train)
#  assessment 
y_pred = clf.predict(X_test)
acc = np.sum(y_test == y_pred) / X_test.shape[0]
print("Test Acc:%.3f" % acc)