当前位置:网站首页>Data type case of machine learning -- using data to distinguish men and women based on Naive Bayesian method
Data type case of machine learning -- using data to distinguish men and women based on Naive Bayesian method
2022-07-02 09:07:00 【Qigui】
Author's brief introduction : The most important part of the whole building is the foundation , The foundation is unstable , The earth trembled and the mountains swayed . And to learn technology, we should lay a solid foundation , Pay attention to me , Take you to firm the foundation of the neighborhood of each plate .
Blog home page : Qigui's blog
Included column :《 Statistical learning method 》 The second edition —— Personal notes
From the south to the North , Don't miss it , Miss this article ,“ wonderful ” May miss you la
Triple attack( Three strikes in a row ):Comment,Like and Collect—>Attention
List of articles
Write it at the front
stay 《 Statistical learning method 》 In the fourth chapter of , The author only describes the simple Bayes method roughly , Because the theory is too difficult to understand , With probability ! But it doesn't matter , Theory , I think it's ok if I don't understand it for the first time , I still don't understand it the second time , It's okay. , The third time, I still don't understand , Nothing more ... Maybe you'll think I'm teasing you ? exactly ,, Of course not . The author suggests remembering the formula . When the , You must be asking again : So many formulas , How to remember ? Which ones should I remember ? Then I can only tell you : The Buddha is . Don't talk nonsense , Let's see the actual battle !
Naive Bayes
Naive Bayes is a simple but powerful predictive modeling algorithm , Simplicity lies in the assumption that each feature is independent .
Naive Bayesian model consists of two types of probability :
- 1、 The probability of each category —— P ( C j ) P(C_{j}) P(Cj)
- 2、 Conditional probability of each attribute —— P ( A i ∣ C j ) P(A_{i}|C_{j}) P(Ai∣Cj)
Bayes' formula :
- P ( C j ∣ A i ) = P ( A i ∣ C j ) P ( C j ) P ( A i ) P(C_{j}|A_{i})=\frac {P(A_{i}|C_{j})P(C_{j})}{P(A_{i})} P(Cj∣Ai)=P(Ai)P(Ai∣Cj)P(Cj)
Naive Bayes classifier :
- y = f ( x ) = a r g m a x P ( C j ) ∏ i P ( A i ∣ C j ) ∑ j P ( C j ) ∏ i P ( A i ∣ C j ) y=f(x)=argmax\frac{P(C_{j})\prod_{i}^{}P(A_{i}|C_{j})}{\sum_{j}^{}P(C_{j})\prod_{i}^{}P(A_{i}|C_{j})} y=f(x)=argmax∑jP(Cj)∏iP(Ai∣Cj)P(Cj)∏iP(Ai∣Cj)
- among , The full probability formula is :
- P ( A i ) = ∑ j P ( C j ) ∏ i P ( A i ∣ C j ) P(A_{i})={\sum_{j}^{}P(C_{j})\prod_{i}^{}P(A_{i}|C_{j})} P(Ai)=∑jP(Cj)∏iP(Ai∣Cj)
Denominator to all C j C_{j} Cj All the same :
- y = a r g m a x P ( C j ) ∏ i P ( A i ∣ C j ) y=arg maxP(C_{j})\prod_{i}^{}P(A_{i}|C_{j}) y=argmaxP(Cj)∏iP(Ai∣Cj)
How naive Bayesian classification works
Discrete data cases
- The data are as follows :

next , Determine the following characteristics :
- height
- weight
- shoe size
also , Set goals as follows :
- Gender
- C1: male
- C2: Woman
- Cj: Unknown
demand :
- Solve whether the gender of the following characteristic data is male or female
- A1: height = high
- A2: weight = in
- A3: shoe size = in
Realization :
- Simply speaking , We only need the probability that the solution is male under this feature , Then solve the probability of being female under this feature , And then compare them , Choose the most likely as the result
P ( C j ∣ A 1 A 2 A 3 ) = P ( A 1 A 2 A 3 ∣ C j ) P ( C j ) P ( A 1 A 2 A 3 ) P(C_{j}|A_{1}A_{2}A_{3})=\frac {P(A_{1}A_{2}A_{3}|C_{j})P(C_{j})}{P(A_{1}A_{2}A_{3})} P(Cj∣A1A2A3)=P(A1A2A3)P(A1A2A3∣Cj)P(Cj)
- Again because A i A_{i} Ai Independent of each other , So it can be converted as follows :
P ( A 1 A 2 A 3 ∣ C j ) = P ( A 1 ∣ C j ) P ( A 2 ∣ C j ) P ( A 3 ∣ C j ) P(A_{1}A_{2}A_{3}|C_{j})=P(A_{1}|C_{j})P(A_{2}|C_{j})P(A_{3}|C_{j}) P(A1A2A3∣Cj)=P(A1∣Cj)P(A2∣Cj)P(A3∣Cj)
- P(Cj|A1A2A3) = (P(A1A2A3|Cj) * P(Cj)) / P(A1A2A3)
- P(A1A2A3|Cj) = P(A1|Cj) * P(A2|Cj) *P(A3|Cj)
- C1 Attribute probability under category condition :
- P(A1|C1) = 2/4 = 1/2
- P(A2|C1) = 1/2
- P(A3|C1) = 1/4
- C2 Attribute probability under category condition :
- P(A1|C2) = 0
- P(A2|C2) =1/2
- P(A3|C2) = 1/2
- P(A1A2A3|C1) = 1/16
- P(A1A2A3|C2) = 0
therefore , regard as P(A1A2A3|C1) > P(A1A2A3|C2), Should be C1 Category , For men .
Continuous data cases
- The data are as follows :

demand :
Solve whether the gender of the following characteristic data is male or female
- height :180
- weight :120
- shoe size :41
The formula is still the above formula , The difficulty here is , Because of height 、 weight 、 Shoe sizes are continuous variables , The method of discrete variables cannot be used to calculate probability . And because there are too few samples , Therefore, it cannot be divided into intervals .
What shall I do? ? At this time , Sure Suppose the height of men and women 、 weight 、 Shoe sizes are normally distributed , Calculate the mean and variance from the sample
That is to say, we get the density function of normal distribution . With the density function , You can substitute the value into , Calculate the value of the density function at a point
such as , The height of men is the average 179.5、 The standard deviation is 3.697 Is a normal distribution . So the height of men is 180 The probability of is 0.1069. How to calculate it ?
from scipy import stats
male_high = stats.norm.pdf(180,male_high_mean,male_high _var)
male_weight = stats.norm.pdf(120, male_weight_mean, male_weight_var)
male_code = stats.norm.pdf(41, male_code_mean, male_code_var)
Realization :
- Suppose the height of men and women 、 weight 、 Shoe sizes are normally distributed
- Calculate the mean and variance from the sample , That is to say, we get the density function of normal distribution
- With the density function , You can substitute the value into , Calculate the value of the density function at a point
import numpy as np
import pandas as pd
# Import data
df = pd.read_excel('table_data.xlsx', sheet_name="Sheet3", index_col=0)
df2 = df.groupby(" Gender ").agg([np.mean, np.var])
# men Of all characteristics mean value And variance
male_high_mean = df2.loc[" male ", " height "]["mean"]
male_high_var = df2.loc[" male ", " height "]["var"]
male_weight_mean = df2.loc[" male ", " weight "]["mean"]
male_weight_var = df2.loc[" male ", " weight "]["var"]
male_code_mean = df2.loc[" male ", " shoe size "]["mean"]
male_code_var = df2.loc[" male ", " shoe size "]["var"]
from scipy import stats
male_high_p = stats.norm.pdf(180, male_high_mean, male_high_var)
male_weight_p = stats.norm.pdf(120, male_weight_mean, male_weight_var)
male_code_p = stats.norm.pdf(41, male_code_mean, male_code_var)
print(' men :', male_high_p * male_weight_p * male_code_p)
# women Of all characteristics mean value And variance
female_high_mean = df2.loc[" Woman ", " height "]["mean"]
female_high_var = df2.loc[" Woman ", " height "]["var"]
female_weight_mean = df2.loc[" Woman ", " weight "]["mean"]
female_weight_var = df2.loc[" Woman ", " weight "]["var"]
female_code_mean = df2.loc[" Woman ", " shoe size "]["mean"]
female_code_var = df2.loc[" Woman ", " shoe size "]["var"]
from scipy import stats
female_high_p = stats.norm.pdf(180, female_high_mean, female_high_var)
female_weight_p = stats.norm.pdf(120, female_weight_mean, female_weight_var)
female_code_p = stats.norm.pdf(41, female_code_mean, female_code_var)
print(' women :', female_high_p * female_weight_p * female_code_p)
print(male_high_p*male_weight_p*male_code_p > female_high_p*female_weight_p*female_code_p)
Written in the back
This is a relatively simple case , Understand that naive Bayes is a generative model , It is a learning process , Constantly adjust cognitive probability .( The generation model is : The result given is not of type , It's probability ) Further understand Bayesian formula , It is helpful to better understand naive Bayesian model .
边栏推荐
- Select sort and insert sort
- Solution of Xiaomi TV's inability to access computer shared files
- 【Go实战基础】gin 如何验证请求参数
- 队列的基本概念介绍以及典型应用示例
- Servlet全解:继承关系、生命周期、容器和请求转发与重定向等
- Use of libusb
- Synchronize files using unison
- Count the number of various characters in the string
- CSDN Q & A_ Evaluation
- How to realize asynchronous programming in a synchronous way?
猜你喜欢
![[blackmail virus data recovery] suffix Crylock blackmail virus](/img/b2/8e3a65dd250b9194cfc175138c740c.jpg)
[blackmail virus data recovery] suffix Crylock blackmail virus

Mysql安装时mysqld.exe报`应用程序无法正常启动(0xc000007b)`

OpenShift构建镜像

Nacos 下载启动、配置 MySQL 数据库

Openshift container platform community okd 4.10.0 deployment

Servlet全解:继承关系、生命周期、容器和请求转发与重定向等

CSDN Q & A_ Evaluation

Synchronize files using unison

【Go实战基础】gin 如何绑定与使用 url 参数

Service de groupe minecraft
随机推荐
以字节跳动内部 Data Catalog 架构升级为例聊业务系统的性能优化
Application of kotlin - higher order function
1、 QT's core class QObject
What is the future value of fluorite mine of karaqin Xinbao Mining Co., Ltd. under zhongang mining?
PCL calculates the intersection of three mutually nonparallel planes
Analysis and solution of a classical Joseph problem
【Go实战基础】gin 如何绑定与使用 url 参数
Multi version concurrency control mvcc of MySQL
Leetcode sword finger offer brush questions - day 23
整理秒杀系统的面试必备!!!
win10使用docker拉取redis镜像报错read-only file system: unknown
QT qtimer class
Minecraft模组服开服
Cloudreve自建云盘实践,我说了没人能限制得了我的容量和速度
Luogu greedy part of the backpack line segment covers the queue to receive water
[blackmail virus data recovery] suffix Crylock blackmail virus
C nail development: obtain all employee address books and send work notices
我服了,MySQL表500W行,居然有人不做分区?
2022/2/14 summary
CSDN Q & A_ Evaluation