当前位置：网站首页>Data type case of machine learning -- using data to distinguish men and women based on Naive Bayesian method

Data type case of machine learning -- using data to distinguish men and women based on Naive Bayesian method

2022-07-02 09:07:00 【Qigui】

Author's brief introduction ： The most important part of the whole building is the foundation , The foundation is unstable , The earth trembled and the mountains swayed . And to learn technology, we should lay a solid foundation , Pay attention to me , Take you to firm the foundation of the neighborhood of each plate .
Blog home page ： Qigui's blog
Included column ：《 Statistical learning method 》 The second edition —— Personal notes
From the south to the North , Don't miss it , Miss this article ,“ wonderful ” May miss you la
Triple attack( Three strikes in a row ):Comment,Like and Collect—>Attention

List of articles

Write it at the front

stay 《 Statistical learning method 》 In the fourth chapter of , The author only describes the simple Bayes method roughly , Because the theory is too difficult to understand , With probability ！ But it doesn't matter , Theory , I think it's ok if I don't understand it for the first time , I still don't understand it the second time , It's okay. , The third time, I still don't understand , Nothing more ... Maybe you'll think I'm teasing you ？ exactly ,, Of course not . The author suggests remembering the formula . When the , You must be asking again ： So many formulas , How to remember ？ Which ones should I remember ？ Then I can only tell you ： The Buddha is . Don't talk nonsense , Let's see the actual battle ！

Naive Bayes

Naive Bayes is a simple but powerful predictive modeling algorithm , Simplicity lies in the assumption that each feature is independent .
Naive Bayesian model consists of two types of probability ：
- 1、 The probability of each category —— $P(C_{j})$
- 2、 Conditional probability of each attribute —— $P(A_{i}|C_{j})$
Bayes' formula ：
- $P(C_{j}|A_{i})=\frac {P(A_{i}|C_{j})P(C_{j})}{P(A_{i})}$
Naive Bayes classifier ：
- $y=f(x)=argmax\frac{P(C_{j})\prod_{i}^{}P(A_{i}|C_{j})}{\sum_{j}^{}P(C_{j})\prod_{i}^{}P(A_{i}|C_{j})}$
- among , The full probability formula is ：
  - $P(A_{i})={\sum_{j}^{}P(C_{j})\prod_{i}^{}P(A_{i}|C_{j})}$
Denominator to all $C_{j}$ All the same ：
- $maxP(C_{j})\prod_{i}^{}P(A_{i}|C_{j})$

How naive Bayesian classification works

Discrete data cases

The data are as follows ：

next , Determine the following characteristics ：

height
weight
shoe size

also , Set goals as follows ：

Gender
- C1： male
- C2： Woman
- Cj： Unknown

demand ：

Solve whether the gender of the following characteristic data is male or female
- A1： height = high
- A2： weight = in
- A3： shoe size = in

Realization ：

Simply speaking , We only need the probability that the solution is male under this feature , Then solve the probability of being female under this feature , And then compare them , Choose the most likely as the result

$P(C_{j}|A_{1}A_{2}A_{3})=\frac {P(A_{1}A_{2}A_{3}|C_{j})P(C_{j})}{P(A_{1}A_{2}A_{3})}$

Again because $A_{i}$ Independent of each other , So it can be converted as follows ：

$P(A_{1}A_{2}A_{3}|C_{j})=P(A_{1}|C_{j})P(A_{2}|C_{j})P(A_{3}|C_{j})$

P(Cj|A1A2A3) = (P(A1A2A3|Cj) * P(Cj)) / P(A1A2A3)
P(A1A2A3|Cj) = P(A1|Cj) * P(A2|Cj) *P(A3|Cj)

- C1 Attribute probability under category condition ：
	- P(A1|C1) = 2/4 = 1/2 
	- P(A2|C1) = 1/2 
	- P(A3|C1) = 1/4
- C2 Attribute probability under category condition ：
	- P(A1|C2) = 0 
	- P(A2|C2) =1/2 
	- P(A3|C2) = 1/2

P(A1A2A3|C1) = 1/16
P(A1A2A3|C2) = 0

therefore , regard as P(A1A2A3|C1) > P(A1A2A3|C2), Should be C1 Category , For men .

Continuous data cases

The data are as follows ：

demand ：

Solve whether the gender of the following characteristic data is male or female
- height ：180
- weight ：120
- shoe size ：41
The formula is still the above formula , The difficulty here is , Because of height 、 weight 、 Shoe sizes are continuous variables , The method of discrete variables cannot be used to calculate probability . And because there are too few samples , Therefore, it cannot be divided into intervals .
What shall I do? ？ At this time , Sure Suppose the height of men and women 、 weight 、 Shoe sizes are normally distributed , Calculate the mean and variance from the sample
That is to say, we get the density function of normal distribution . With the density function , You can substitute the value into , Calculate the value of the density function at a point
such as , The height of men is the average 179.5、 The standard deviation is 3.697 Is a normal distribution . So the height of men is 180 The probability of is 0.1069. How to calculate it ?

from scipy import stats 

male_high = stats.norm.pdf(180,male_high_mean,male_high _var) 
male_weight = stats.norm.pdf(120, male_weight_mean, male_weight_var)
male_code = stats.norm.pdf(41, male_code_mean, male_code_var)

Realization ：

Suppose the height of men and women 、 weight 、 Shoe sizes are normally distributed
Calculate the mean and variance from the sample , That is to say, we get the density function of normal distribution
With the density function , You can substitute the value into , Calculate the value of the density function at a point

import numpy as np
import pandas as pd

#  Import data 
df = pd.read_excel('table_data.xlsx', sheet_name="Sheet3", index_col=0)
df2 = df.groupby(" Gender ").agg([np.mean, np.var])

#  men   Of all characteristics   mean value   And   variance 
male_high_mean = df2.loc[" male ", " height "]["mean"]
male_high_var = df2.loc[" male ", " height "]["var"]

male_weight_mean = df2.loc[" male ", " weight "]["mean"]
male_weight_var = df2.loc[" male ", " weight "]["var"]

male_code_mean = df2.loc[" male ", " shoe size "]["mean"]
male_code_var = df2.loc[" male ", " shoe size "]["var"]

from scipy import stats

male_high_p = stats.norm.pdf(180, male_high_mean, male_high_var)
male_weight_p = stats.norm.pdf(120, male_weight_mean, male_weight_var)
male_code_p = stats.norm.pdf(41, male_code_mean, male_code_var)

print(' men ：', male_high_p * male_weight_p * male_code_p)

#  women   Of all characteristics   mean value   And   variance 
female_high_mean = df2.loc[" Woman ", " height "]["mean"]
female_high_var = df2.loc[" Woman ", " height "]["var"]

female_weight_mean = df2.loc[" Woman ", " weight "]["mean"]
female_weight_var = df2.loc[" Woman ", " weight "]["var"]

female_code_mean = df2.loc[" Woman ", " shoe size "]["mean"]
female_code_var = df2.loc[" Woman ", " shoe size "]["var"]

from scipy import stats

female_high_p = stats.norm.pdf(180, female_high_mean, female_high_var)
female_weight_p = stats.norm.pdf(120, female_weight_mean, female_weight_var)
female_code_p = stats.norm.pdf(41, female_code_mean, female_code_var)

print(' women ：', female_high_p * female_weight_p * female_code_p)

print(male_high_p*male_weight_p*male_code_p > female_high_p*female_weight_p*female_code_p)

Written in the back

This is a relatively simple case , Understand that naive Bayes is a generative model , It is a learning process , Constantly adjust cognitive probability .（ The generation model is ： The result given is not of type , It's probability ） Further understand Bayesian formula , It is helpful to better understand naive Bayesian model .

原网站

版权声明
本文为[Qigui]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207020624131312.html