当前位置:网站首页>Naive Bayes classification of scikit learn

Naive Bayes classification of scikit learn

2022-06-12 03:27:00 Watermelon

This tutorial will show you how to use Python Of Scikit-learn Package construction and evaluation of naive Bayesian classifiers .

Suppose you are a product manager , You want to classify customer comments into positive and negative categories . Or as a loan Manager , You want to determine which loan applicants are safe or risky ? As a healthcare Analyst , You want to predict which patients may have diabetes . All examples have comments on 、 The same problem of classifying loan applicants and patients .

Naive Bayes is the most direct 、 The fastest classification algorithm , For large amounts of data . Naive Bayesian classifier has been successfully used in various applications , For example, spam filtering 、 Text classification 、 Emotion analysis and recommendation system . It uses Bayesian probability theorem to predict unknown classes .

In this tutorial , You will learn all of the following :


  • Classification workflow
  • What is a naive Bayesian classifier ?
  • How naive Bayesian classifiers work ?
  • Scikit-learn Classifier construction in
  • Zero probability problem
  • Its advantages and disadvantages

1 Classification workflow

Whenever classification is performed , The first step is to understand the problem and identify potential features and labels . Features are those features or attributes that affect the results of the tag . for example , In the case of loan allocation , The bank manager determines the client's occupation 、 income 、 Age 、 place 、 Previous loan history 、 Transaction history and credit score . These features are called features that help the model classify customers .

Classification has two stages , Learning phase and evaluation phase . In the learning phase , The classifier trains its model on a given data set , In the evaluation phase , It tests the performance of the classifier . Performance is evaluated based on various parameters , For example, accuracy 、 error 、 Precision and recall .

Scikit-learn Naive Bayesian classification _ Bayesian classifier

2 What is a naive Bayesian classifier ?

Naive Bayes is a statistical classification technique based on Bayes theorem . It is one of the simplest supervised learning algorithms . Naive Bayes classifier is a fast classifier 、 Accurate and reliable algorithms . Naive Bayes classifier has high accuracy and speed in large data sets .

Naive Bayesian classifiers assume that the effect of a particular feature in a class is independent of other features . for example , Whether the loan applicant is desirable depends on him / Her income 、 Previous loan and transaction history 、 Age and position . Even if these features are interdependent , These characteristics are still considered independently . This assumption simplifies the calculation , That is why it is considered “ simple ” Of . This assumption is called class conditional independence .

Scikit-learn Naive Bayesian classification _ Bayesian classifier _02


  • P(h): hypothesis h The probability of being true ( No matter what the data is ). This is called h The prior probability of .
  • P(D): Probability of data ( Whatever the assumptions ). This is called a priori probability .
  • P(h|D): Given data D Assumptions h Probability . This is called a posteriori probability .
  • P(D|h): hypothesis h It's true , data D Probability . This is called a posteriori probability .

3 How naive Bayesian classifiers work ?

​ This section covers a lot of knowledge about linear algebra and probability , Readers who have high requirements for theoretical research can refer to more professional books , Readers who only care about how to apply can skip this section . I only care about application , therefore , Let's skip this section . Leave a title just for structural integrity . Ha ha ha ~~~ I hope the math teacher will forgive me ~~~

4 Scikit-learn Classifier construction in

4.1 Naive Bayes classifier

1) Define datasets

In this example , You can use a virtual dataset with three columns : The weather 、 Temperature and whether to go out to play . The first two are characteristics ( The weather 、 temperature ), The other is the label .

Scikit-learn Naive Bayesian classification _ Naive Bayes _03

2) Coding features

First , You need to convert these string labels to numbers . for example :'Overcast', 'Rainy', 'Sunny' as 0, 1, 2. This is called tag encoding . Scikit-learn Provides LabelEncoder library , Used to encode labels , Its value is 0 To 1 Between , Less than the number of discrete classes .

Scikit-learn Naive Bayesian classification _ Naive Bayes _04

Scikit-learn Naive Bayesian classification _ Data sets _05

Again , You can also be right temp and play Column .

Scikit-learn Naive Bayesian classification _ Naive Bayes _06

Scikit-learn Naive Bayesian classification _ Bayesian classifier _07

Now take these two features ( Weather and temperature ) Combined in one variable ( Tuple list ) in .

Scikit-learn Naive Bayesian classification _ Bayesian classifier _08

Scikit-learn Naive Bayesian classification _ Bayesian classifier _09

3) Generate models

In the following steps, a naive Bayesian classifier is used to generate the model :


  • Create a naive Bayesian classifier
  • Fit the data set to the classifier
  • Execution forecast
  • Scikit-learn Naive Bayesian classification _ Bayesian classifier _10

    Scikit-learn Naive Bayesian classification _ Bayesian classifier _11

    here ,1 Indicates that players can “ hang out ”.

    4.2 Naive Bayes with multiple labels

up to now , You have learned about naive Bayesian classification using binary tags . Now you will learn about multi class classification in naive Bayes . This is called multinomial naive Bayes classification . for example , If you want to know about the technology 、 entertainment 、 Classify political or sports news articles .

In the model building part , You can use wine datasets , This is a very famous multi class classification problem . “ This data set is the result of chemical analysis of wines grown in the same region of Italy but from three different varieties .”

Data set containing 13 Features ( alcohol 、 Malic acid 、 ash content 、alcalinity_of_ash、 magnesium 、 Total phenol 、 Flavonoids 、 Non flavonoid phenols 、 procyanidins 、 Color intensity 、 tonal 、od280/od315_of_diluted_wines、 Proline ) And the type of wine . The data has 3 Wine growing Class_0、Class_1 and Class_3. ad locum , You can build a model to classify the types of wines .

The dataset is in scikit-learn Available in Library .

1) Load data

Let's start with scikit-learn The data set is loaded with wine Data sets .

Scikit-learn Naive Bayesian classification _ Naive Bayes _12

2) Explore data

You can print target and feature names , To make sure you have the right data set , As shown below :

Scikit-learn Naive Bayesian classification _ Bayesian classifier _13

Scikit-learn Naive Bayesian classification _ Data sets _14

A little exploration of your data is never wrong , So you know what you're dealing with . ad locum , You can see that the first five rows of the dataset are printed , And the target variables for the entire dataset .

Scikit-learn Naive Bayesian classification _ Naive Bayes _15

Scikit-learn Naive Bayesian classification _ Bayesian classifier _16

3) Split data

First , You divide columns into dependent and independent variables ( Or features and labels ). Then these variables are divided into training set and test set .

Scikit-learn Naive Bayesian classification _ Naive Bayes _17

4) Generate models

After break up , You will generate a random forest model on the training set , And predict the characteristics of the test set .

Scikit-learn Naive Bayesian classification _ Bayesian classifier _18

5) Evaluation model

After the model is generated , Check accuracy using actual and predicted values .

Scikit-learn Naive Bayesian classification _ Data sets _19


Scikit-learn Naive Bayesian classification _ Naive Bayes _20

5 Zero probability problem

Suppose there are no tuples of risky loans in the data set , under these circumstances , The posterior probability is zero , The model cannot predict . This problem is called zero probability , Because the occurrence of a specific class is zero .

The solution to this problem is Laplace correction (Laplacian correction) Or Laplace transform (Laplace Transformation). Laplace correction is one of the smoothing techniques . ad locum , You can assume that the data set is large enough , Adding a row to each class does not affect the estimated probability . This will overcome the problem of zero probability .

for example : Suppose that for risky loans , It's in the database 1000 Training tuples . In this database , The income is listed as 0 Tuples represent low income ,990 Tuples represent middle income ,10 Tuples represent high income . Without Laplace correction , The probability of these events is 0、0.990( come from 990/1000) and 0.010( come from 10/1000)

Now? , Apply Laplace correction to a given data set . Let's pay for each income - Add value to 1 Tuples . The probability of these events :

Scikit-learn Naive Bayesian classification _ Naive Bayes _21

6 advantage


  • It's not just a simple method , And it is a fast and accurate prediction method .
  • The computational cost of naive Bayes is very low .
  • It can effectively handle large data sets .
  • Compared with continuous variables , It performs well in the case of discrete response variables .
  • It can be used for multi class prediction problems .
  • It also performs well in the case of text analysis problems .
  • When the independence assumption holds , Naive Bayesian classifier and other models ( Like logical regression ) Better than .

7 shortcoming


  • The assumption of independent characteristics . In practice , It is almost impossible for the model to obtain a set of completely independent predictive variables .
  • If there is no training tuple for a specific class , This will result in a posteriori probability of zero . under these circumstances , The model cannot predict . This problem is called zero probability / Frequency problem .

8 Conclusion

In this tutorial , You learned about the naive Bayesian algorithm 、 How it works 、 Naive Bayes hypothesis 、 problem 、 Realization 、 pros and cons . In the process , You also learned scikit-learn Modeling and evaluation of binary and multinomial classes in .

Naive Bayes is the most direct and effective algorithm . Although machine learning has made great progress in the past few years , But it has proved its value . It has been successfully deployed in many applications ranging from text analysis to recommendation engines .

原网站

版权声明
本文为[Watermelon]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203011046132419.html