当前位置:网站首页>Naive Bayes -- continuous data
Naive Bayes -- continuous data
2022-07-29 03:24:00 【Order anything】
On the principle of naive Bayes and discrete naive Bayes , See the last post :https://blog.csdn.net/gongfuxiongmao_/article/details/116062023?spm=1001.2014.3001.5502
For continuous data , On the assumption that The data conform to the normal distribution Under the premise of , Each feature in the training data can be Gaussian processed , Get a characteristic Gaussian curve , Gaussian curve is used to estimate the probability that the prediction data belongs to a certain class .
For example, in the following example , The data has four eigenvalues :x1,x2,x3,x4 ; There are three classification results at the same time : Give birth to a boy , Give birth to a girl , Not pregnant .
Corresponding to 12 A Gaussian curve ( Number of categories * Characteristic number of Gaussian curves ), Respectively :
1. In the category of having boys ,x1,x2,x3,x4 Four Gaussian curves corresponding to the four features
2. In the category of giving birth to girls ,x1,x2,x3,x4 Four Gaussian curves corresponding to the four features
3. Not pregnant ,x1,x2,x3,x4 Four Gaussian curves corresponding to the four features
For a given , New value to be predicted ,x1,x2,x3,x4, Substitute the Gaussian curve of each class for , You can get P(B\A) That is, the probability of belonging to this class .
stay Naive Bayes of continuous data , There will be no P(B\A)=0 The situation of ( For text classification , When a word doesn't appear , Its P(B\A)=0, But in continuous data , Calculate according to Gaussian distribution P(B\A) It cannot be zero ),
Therefore, continuous naive Bayes can ignore Laplace smoothing .
Let's first review the Gaussian distribution , The principle and code are as follows :
import numpy
import matplotlib.pyplot as plot
import math
# mean value
mean = 2
# Standard deviation
var = 32
# Create a 50 Array of elements , There are three standard deviations between the ending and the mean Of A sequence of equal differences
x = numpy.linspace(mean - 3 * var, mean + 3 * var, 50)
# Write according to the formula of the positive distribution y
y = numpy.exp(-(x - mean) ** 2 / (2 * var ** 2)) / (math.sqrt(2 * math.pi) * var)
# With x,y Is the abscissa and ordinate ,'r-' Indicates a red line ,'ro- Indicates the red line plus the origin ',linewidth Is the thickness of the line
plot.plot(x, y, 'ro-', linewidth=2)
# Or maybe : use 'go' Indicates the green origin ,markersize Is the size of the origin
#plot.plot(x, y, 'r-', x, y, 'go', linewidth=2, markersize=8)
# There is a gray grid
plot.grid(True)
# Show
plot.show()
# Generate an array of standard normal distributions
'''
loc:float
The mean value of this probability distribution ( Corresponding to the center of the whole distribution centre)
scale:float
The standard deviation of this probability distribution ( Corresponds to the width of the distribution ,scale The bigger, the fatter ,scale The smaller it is , Thinner and taller )
size:int or tuple of ints
Output shape, The default is None, Output only one value
'''
samples = numpy.random.normal(loc=0.0, scale=1.0, size=10)
x = numpy.linspace(mean - 3 * var, mean + 3 * var, 10)
# Equivalent to samples = numpy.random.randn(10)
# Calculating mean
mean = numpy.mean(samples)
# Find the variance of an array
var = numpy.var(samples)
# Find the probability
y = numpy.exp(-(x - mean) ** 2 / (2 * var ** 2)) / (math.sqrt(2 * math.pi) * var)
# Find the probability ,x It can be a number , Or a numpy Array
def getGaussianProbability(samples,x):
# Calculating mean
mean = numpy.mean(samples)
# Find the variance of an array
var = numpy.var(samples)
# Find the probability
y = numpy.exp(-(x - mean) ** 2 / (2 * var ** 2)) / (math.sqrt(2 * math.pi) * var)
return y
# test
samples = numpy.array([0.3,0.4,0.6,0.9])
x=numpy.array([0.5,0])
print(getGaussianProbability(samples,x))
Among them, we need to use :
# Calculating mean
mean = numpy.mean(samples)
# Find the standard deviation of an array
std= numpy.std(samples)
# Find the probability , among x It can be a number , Or a numpy Array
y = numpy.exp(-(x - mean) ** 2 / (2 * std** 2)) / (math.sqrt(2 * math.pi) * std)
The package used in this case :pandas,numpy,Counter,math
# Import related toolkits
import pandas
import numpy as np
from collections import Counter
import math
# Read training data Feature part ( Collected by robot sensors 112 Customers look , Smell , pulse condition , Temperature and other characteristics )
"""
pandas After reading the data, it is dataframe The format of , To turn into numpy Can continue to be used in the form of a two-dimensional array :
1. After reading .value
2.y = np.array(y)
"""
# With .values Finally, what comes out is numpy In the form of
x = pandas.read_csv('train_X.csv').values
# print(f' I am a test_x:{x}')
# print(np.mean(x),np.std(x))
# Read training data Actual results section ( Above 112 The real pregnancy status of customers ,0 It means girl ,1 Means boy ,2 It means that you are not pregnant )
y = pandas.read_csv('train_y.csv')
y = np.array(y)
"""
This is used here tolist() hold numpy The matrix of is transformed into list
Then use _flatten Put the two-dimensional list Flatten
"""
class byes_qingdaifu(object):
def fit(self, x, y):
self.x = x
self.y = y
self.rMean,self.rStd,self.P = self.getP()
def getP(self):
"""
According to the training data 12 Characteristic Gaussian curve ( mean value mean And variance str) as well as 3 The probability of each result classification p
:return: mean,str And p Matrix
"""
# Here will be x Convert to a two-dimensional list
list_x = self.x.tolist()
# Here, two-dimensional matrix y Into a one-dimensional list
list_y = self.y.flatten().tolist()
# utilize Counter Statistics list Each element in and its number of occurrences
dict_y = Counter(list_y)
# dict_y:Counter({2: 41, 0: 37, 1: 34})
# Get the total number of data
sumNum = sum(dict_y.values())
# Sort the dictionary
t = sorted(dict_y.items(),key=lambda item: item[0])
# [(0,37),(1,34),(2,41)]
# Convert the dictionary into a two-dimensional list ( The list is in order )
rl = list(list(items) for items in list(t))
# Find the probability of each classification p, Form a list P
P = [(rl[i][-1]/sumNum) for i in range(len(rl))]
# Create an empty list rMean
rMean = []
# Get a list rMean Of n*4 matrix ; Namely n Category 4 Characteristic mean vector
[rMean.append((np.mean(self.getClass(i), axis=0))) for i in range(len(P))]
# Convert the list to array, It is convenient for subsequent calculation
rMean = np.array(rMean)
# Create an empty list rStd
rStd = []
# Get a list rStd Of n*4 matrix ; Namely n Category rStd
[rStd.append((np.std(self.getClass(i), axis=0))) for i in range(len(P))]
# Convert the list to array, It is convenient for subsequent calculation
rStd = np.array(rStd)
return rMean,rStd,P
def predict(self,x_test):
pResult = []
# Tabular , Get one n*38 Prediction matrix , Each line represents 38 The probability of the value to be predicted in a classification
[(pResult.append(np.prod((np.exp(-(x_test - self.rMean[i]) ** 2 / (2 * self.rStd[i] ** 2)) / (
math.sqrt(2 * math.pi) * self.rStd[i])),axis=1) * self.P[i])) for i in range(len(self.P))]
# Convert the list to array, It is convenient for subsequent calculation
pResult = np.array(pResult)
# print(f' I am a pResult:{pResult}')
# obtain n*38 The row index of the maximum value of each column of the prediction matrix
result = (pResult.argmax(axis=0))
return result
# The result is c Data corresponding to x, Make c Class x Data matrix
def getClass(self,c):
x = []
for i in range(len(self.y)):
if self.y[i][-1] == c:
x.append(self.x[i])
# print(f' I am a x:{x}')
return x
# Create a robot
doctor = byes_qingdaifu()
# Training robot
doctor.fit(x,y)
# Use 38 Test the robot diagnosis effect with the data of customers
# Read 38 The look of a customer , Smell , pulse condition , Temperature and other characteristic data
test_x = pandas.read_csv('test_X.csv').values
# print(f' I am a x:{test_x}')
# The diagnosis ! The results are stored in result Array
result = doctor.predict(test_x)
# print(f' I am a result:{result}')
# Print out the diagnostic results , Compare with the actual results
# Read 38 Actual value of pregnancy status of customers (0 It means girl ,1 It means boy ,2 It means that you are not pregnant )
test_y = pandas.read_csv('test_y.csv')
# print(f' I am a y_test:/n{test_y}')
labels=[' Girl ',' The boy ',' No pregnancy ']
i = 0
# The number of correct diagnoses
predictOKNum = 0
print(" Number , Diagnostic value , actual value ,")
while i < test_y.shape[0]:
# The first i The diagnosis results are consistent with the actual i Compare the results , Equal means correct diagnosis
# if result[i] == (test_y.values[i,0]):
if result[i] == (test_y.values[i,0]):
predictOKNum = predictOKNum + 1
okOrNo = ' accuracy '
else:
okOrNo = ' error '
print("%s,%s,%s,%s" %(i+1, labels[result[i]],labels[test_y.values[i,0]],okOrNo))
i = i+1
print(" Diagnostic accuracy :%s" % (predictOKNum/i))
After the code is tested , You can call sklearn The built-in byes Of API Verify the accuracy of your code ,GaussianNB;
边栏推荐
- ROS-Errror:Did you forget to specify generate_ messages(DEPENDENCIES ...)?
- C language programming | exchange binary odd and even bits (macro Implementation)
- Server operation management system
- July 28, 2022 Gu Yujia's study notes
- Digital image processing Chapter 10 - image segmentation
- [robot learning] matlab kinematics and ADMAS dynamics analysis of manipulator gripper
- Matlab learning -- structured programs and user-defined functions
- Web uploader cannot upload multiple files
- CUDA GDB prompt: /tmp/tmpxft**** cudafe1.stub. c: No such file or directory.
- Several methods of converting object to string
猜你喜欢
Self study notes on Apache file management -- mapping folders and configuring Apache virtual machines based on single IP and multi domain names
Introduction to JVM foundation I (memory structure)
照片比例校正工具:DxO ViewPoint 3 直装版
基于单片机烟雾温湿度甲醛监测设计
Design of smoke temperature, humidity and formaldehyde monitoring based on single chip microcomputer
Plato Farm在Elephant Swap上铸造的ePLATO是什么?为何具备高溢价?
Bingbing learning notes: operator overloading -- implementation of date class
AI platform, AI midrange architecture
Redis configuration cache expiration listening event trigger
Makefile details
随机推荐
Flask的创建的流程day05-06之创建项目
C traps and defects Chapter 2 syntax "traps" 2.6 problems caused by "hanging" else
Summarize the knowledge points of the ten JVM modules. If you don't believe it, you still don't understand it
Rongyun IM & RTC capabilities on new sites
Summary of basic knowledge points of C language
[freeswitch development practice] media bug obtains call voice flow
web-uploader不能多文件上传
生产部署zabbix5.0笔记
[robot learning] matlab kinematics and ADMAS dynamics analysis of manipulator gripper
C traps and defects Chapter 3 semantic "traps" 3.9 integer overflow
What if MySQL forgets the password
3.1 common neural network layer (I) image correlation layer
13_ UE4 advanced_ Montage animation realizes attack while walking
Learn exkmp again (exkmp template)
HDU多校第二场 1011 DOS Card
Notes on letter symbol marking of papers
Shell script summary
Self study notes on Apache file management -- mapping folders and configuring Apache virtual machines based on single IP and multi domain names
04 | background login: login method based on account and password (Part 1)
逐步分析类的拆分之案例——五彩斑斓的小球碰撞