当前位置:网站首页>Data type case of machine learning -- using data to distinguish men and women based on Naive Bayesian method
Data type case of machine learning -- using data to distinguish men and women based on Naive Bayesian method
2022-07-02 09:07:00 【Qigui】
Author's brief introduction : The most important part of the whole building is the foundation , The foundation is unstable , The earth trembled and the mountains swayed . And to learn technology, we should lay a solid foundation , Pay attention to me , Take you to firm the foundation of the neighborhood of each plate .
Blog home page : Qigui's blog
Included column :《 Statistical learning method 》 The second edition —— Personal notes
From the south to the North , Don't miss it , Miss this article ,“ wonderful ” May miss you la
Triple attack( Three strikes in a row ):Comment,Like and Collect—>Attention
List of articles
Write it at the front
stay 《 Statistical learning method 》 In the fourth chapter of , The author only describes the simple Bayes method roughly , Because the theory is too difficult to understand , With probability ! But it doesn't matter , Theory , I think it's ok if I don't understand it for the first time , I still don't understand it the second time , It's okay. , The third time, I still don't understand , Nothing more ... Maybe you'll think I'm teasing you ? exactly ,, Of course not . The author suggests remembering the formula . When the , You must be asking again : So many formulas , How to remember ? Which ones should I remember ? Then I can only tell you : The Buddha is . Don't talk nonsense , Let's see the actual battle !
Naive Bayes
Naive Bayes is a simple but powerful predictive modeling algorithm , Simplicity lies in the assumption that each feature is independent .
Naive Bayesian model consists of two types of probability :
- 1、 The probability of each category —— P ( C j ) P(C_{j}) P(Cj)
- 2、 Conditional probability of each attribute —— P ( A i ∣ C j ) P(A_{i}|C_{j}) P(Ai∣Cj)
Bayes' formula :
- P ( C j ∣ A i ) = P ( A i ∣ C j ) P ( C j ) P ( A i ) P(C_{j}|A_{i})=\frac {P(A_{i}|C_{j})P(C_{j})}{P(A_{i})} P(Cj∣Ai)=P(Ai)P(Ai∣Cj)P(Cj)
Naive Bayes classifier :
- y = f ( x ) = a r g m a x P ( C j ) ∏ i P ( A i ∣ C j ) ∑ j P ( C j ) ∏ i P ( A i ∣ C j ) y=f(x)=argmax\frac{P(C_{j})\prod_{i}^{}P(A_{i}|C_{j})}{\sum_{j}^{}P(C_{j})\prod_{i}^{}P(A_{i}|C_{j})} y=f(x)=argmax∑jP(Cj)∏iP(Ai∣Cj)P(Cj)∏iP(Ai∣Cj)
- among , The full probability formula is :
- P ( A i ) = ∑ j P ( C j ) ∏ i P ( A i ∣ C j ) P(A_{i})={\sum_{j}^{}P(C_{j})\prod_{i}^{}P(A_{i}|C_{j})} P(Ai)=∑jP(Cj)∏iP(Ai∣Cj)
Denominator to all C j C_{j} Cj All the same :
- y = a r g m a x P ( C j ) ∏ i P ( A i ∣ C j ) y=arg maxP(C_{j})\prod_{i}^{}P(A_{i}|C_{j}) y=argmaxP(Cj)∏iP(Ai∣Cj)
How naive Bayesian classification works
Discrete data cases
- The data are as follows :

next , Determine the following characteristics :
- height
- weight
- shoe size
also , Set goals as follows :
- Gender
- C1: male
- C2: Woman
- Cj: Unknown
demand :
- Solve whether the gender of the following characteristic data is male or female
- A1: height = high
- A2: weight = in
- A3: shoe size = in
Realization :
- Simply speaking , We only need the probability that the solution is male under this feature , Then solve the probability of being female under this feature , And then compare them , Choose the most likely as the result
P ( C j ∣ A 1 A 2 A 3 ) = P ( A 1 A 2 A 3 ∣ C j ) P ( C j ) P ( A 1 A 2 A 3 ) P(C_{j}|A_{1}A_{2}A_{3})=\frac {P(A_{1}A_{2}A_{3}|C_{j})P(C_{j})}{P(A_{1}A_{2}A_{3})} P(Cj∣A1A2A3)=P(A1A2A3)P(A1A2A3∣Cj)P(Cj)
- Again because A i A_{i} Ai Independent of each other , So it can be converted as follows :
P ( A 1 A 2 A 3 ∣ C j ) = P ( A 1 ∣ C j ) P ( A 2 ∣ C j ) P ( A 3 ∣ C j ) P(A_{1}A_{2}A_{3}|C_{j})=P(A_{1}|C_{j})P(A_{2}|C_{j})P(A_{3}|C_{j}) P(A1A2A3∣Cj)=P(A1∣Cj)P(A2∣Cj)P(A3∣Cj)
- P(Cj|A1A2A3) = (P(A1A2A3|Cj) * P(Cj)) / P(A1A2A3)
- P(A1A2A3|Cj) = P(A1|Cj) * P(A2|Cj) *P(A3|Cj)
- C1 Attribute probability under category condition :
- P(A1|C1) = 2/4 = 1/2
- P(A2|C1) = 1/2
- P(A3|C1) = 1/4
- C2 Attribute probability under category condition :
- P(A1|C2) = 0
- P(A2|C2) =1/2
- P(A3|C2) = 1/2
- P(A1A2A3|C1) = 1/16
- P(A1A2A3|C2) = 0
therefore , regard as P(A1A2A3|C1) > P(A1A2A3|C2), Should be C1 Category , For men .
Continuous data cases
- The data are as follows :

demand :
Solve whether the gender of the following characteristic data is male or female
- height :180
- weight :120
- shoe size :41
The formula is still the above formula , The difficulty here is , Because of height 、 weight 、 Shoe sizes are continuous variables , The method of discrete variables cannot be used to calculate probability . And because there are too few samples , Therefore, it cannot be divided into intervals .
What shall I do? ? At this time , Sure Suppose the height of men and women 、 weight 、 Shoe sizes are normally distributed , Calculate the mean and variance from the sample
That is to say, we get the density function of normal distribution . With the density function , You can substitute the value into , Calculate the value of the density function at a point
such as , The height of men is the average 179.5、 The standard deviation is 3.697 Is a normal distribution . So the height of men is 180 The probability of is 0.1069. How to calculate it ?
from scipy import stats
male_high = stats.norm.pdf(180,male_high_mean,male_high _var)
male_weight = stats.norm.pdf(120, male_weight_mean, male_weight_var)
male_code = stats.norm.pdf(41, male_code_mean, male_code_var)
Realization :
- Suppose the height of men and women 、 weight 、 Shoe sizes are normally distributed
- Calculate the mean and variance from the sample , That is to say, we get the density function of normal distribution
- With the density function , You can substitute the value into , Calculate the value of the density function at a point
import numpy as np
import pandas as pd
# Import data
df = pd.read_excel('table_data.xlsx', sheet_name="Sheet3", index_col=0)
df2 = df.groupby(" Gender ").agg([np.mean, np.var])
# men Of all characteristics mean value And variance
male_high_mean = df2.loc[" male ", " height "]["mean"]
male_high_var = df2.loc[" male ", " height "]["var"]
male_weight_mean = df2.loc[" male ", " weight "]["mean"]
male_weight_var = df2.loc[" male ", " weight "]["var"]
male_code_mean = df2.loc[" male ", " shoe size "]["mean"]
male_code_var = df2.loc[" male ", " shoe size "]["var"]
from scipy import stats
male_high_p = stats.norm.pdf(180, male_high_mean, male_high_var)
male_weight_p = stats.norm.pdf(120, male_weight_mean, male_weight_var)
male_code_p = stats.norm.pdf(41, male_code_mean, male_code_var)
print(' men :', male_high_p * male_weight_p * male_code_p)
# women Of all characteristics mean value And variance
female_high_mean = df2.loc[" Woman ", " height "]["mean"]
female_high_var = df2.loc[" Woman ", " height "]["var"]
female_weight_mean = df2.loc[" Woman ", " weight "]["mean"]
female_weight_var = df2.loc[" Woman ", " weight "]["var"]
female_code_mean = df2.loc[" Woman ", " shoe size "]["mean"]
female_code_var = df2.loc[" Woman ", " shoe size "]["var"]
from scipy import stats
female_high_p = stats.norm.pdf(180, female_high_mean, female_high_var)
female_weight_p = stats.norm.pdf(120, female_weight_mean, female_weight_var)
female_code_p = stats.norm.pdf(41, female_code_mean, female_code_var)
print(' women :', female_high_p * female_weight_p * female_code_p)
print(male_high_p*male_weight_p*male_code_p > female_high_p*female_weight_p*female_code_p)
Written in the back
This is a relatively simple case , Understand that naive Bayes is a generative model , It is a learning process , Constantly adjust cognitive probability .( The generation model is : The result given is not of type , It's probability ) Further understand Bayesian formula , It is helpful to better understand naive Bayesian model .
边栏推荐
- Tensorflow2 keras 分类模型
- QT drag event
- 十年开发经验的程序员告诉你,你还缺少哪些核心竞争力?
- Sqli labs level 12
- Installing Oracle database 19C for Linux
- Solution and analysis of Hanoi Tower problem
- Loadbalancer dynamically refreshes Nacos server
- kubeadm部署kubernetes v1.23.5集群
- Illegal use of crawlers, an Internet company was terminated, the police came to the door, and 23 people were taken away
- Gocv image reading and display
猜你喜欢

Driving test Baodian and its spokesperson Huang Bo appeared together to call for safe and civilized travel

Finishing the interview essentials of secsha system!!!

Hengyuan cloud_ Can aiphacode replace programmers?

Minecraft module service opening

Qunhui NAS configuring iSCSI storage

Minecraft group service opening

十年开发经验的程序员告诉你,你还缺少哪些核心竞争力?

C language implementation of mine sweeping game

History of Web Technology

队列的基本概念介绍以及典型应用示例
随机推荐
Kubedm deploys kubernetes v1.23.5 cluster
Pyspark de duplication dropduplicates, distinct; withColumn、lit、col; unionByName、groupBy
commands out of sync. did you run multiple statements at once
How to realize asynchronous programming in a synchronous way?
Move a string of numbers backward in sequence
C# 高德地图 根据经纬度获取地址
Webflux responsive programming
QT -- how to set shadow effect in QWidget
OpenShift 容器平台社区版 OKD 4.10.0部署
Minecraft group service opening
Flink-使用流批一体API统计单词数量
Judge whether it is Sudoku
Qt的拖动事件
Minecraft plug-in service opening
1、 QT's core class QObject
Openshift deployment application
C call system sound beep~
kubernetes部署loki日志系统
选择排序和插入排序
Solution of Xiaomi TV's inability to access computer shared files