当前位置：网站首页>Naive Bayes -- continuous data

Naive Bayes -- continuous data

2022-07-29 03:24:00 【Order anything】

On the principle of naive Bayes and discrete naive Bayes , See the last post ：https://blog.csdn.net/gongfuxiongmao_/article/details/116062023?spm=1001.2014.3001.5502

For continuous data , On the assumption that The data conform to the normal distribution Under the premise of , Each feature in the training data can be Gaussian processed , Get a characteristic Gaussian curve , Gaussian curve is used to estimate the probability that the prediction data belongs to a certain class .

For example, in the following example , The data has four eigenvalues ：x1,x2,x3,x4 ; There are three classification results at the same time ： Give birth to a boy , Give birth to a girl , Not pregnant .

Corresponding to 12 A Gaussian curve （ Number of categories * Characteristic number of Gaussian curves ）, Respectively ：

1. In the category of having boys ,x1,x2,x3,x4 Four Gaussian curves corresponding to the four features

2. In the category of giving birth to girls ,x1,x2,x3,x4 Four Gaussian curves corresponding to the four features

3. Not pregnant ,x1,x2,x3,x4 Four Gaussian curves corresponding to the four features

For a given , New value to be predicted ,x1,x2,x3,x4, Substitute the Gaussian curve of each class for , You can get P（B\A） That is, the probability of belonging to this class .

stay Naive Bayes of continuous data , There will be no P（B\A）=0 The situation of （ For text classification , When a word doesn't appear , Its P（B\A）=0, But in continuous data , Calculate according to Gaussian distribution P（B\A） It cannot be zero ）,

Therefore, continuous naive Bayes can ignore Laplace smoothing .

Let's first review the Gaussian distribution , The principle and code are as follows ：

import numpy
import matplotlib.pyplot as plot
import math

# mean value 
mean = 2  
# Standard deviation 
var = 32 
# Create a 50 Array of elements , There are three standard deviations between the ending and the mean   Of   A sequence of equal differences 
x = numpy.linspace(mean - 3 * var, mean + 3 * var, 50) 
# Write according to the formula of the positive distribution y 
y = numpy.exp(-(x - mean) ** 2 / (2 * var ** 2)) / (math.sqrt(2 * math.pi) * var) 
# With x,y Is the abscissa and ordinate ,'r-' Indicates a red line ,'ro- Indicates the red line plus the origin ',linewidth Is the thickness of the line  
plot.plot(x, y, 'ro-', linewidth=2) 
# Or maybe ： use 'go' Indicates the green origin ,markersize Is the size of the origin  
#plot.plot(x, y, 'r-', x, y, 'go', linewidth=2, markersize=8) 
# There is a gray grid  
plot.grid(True) 
# Show  
plot.show()


# Generate an array of standard normal distributions 
'''
loc：float
 The mean value of this probability distribution （ Corresponding to the center of the whole distribution centre）
scale：float
 The standard deviation of this probability distribution （ Corresponds to the width of the distribution ,scale The bigger, the fatter ,scale The smaller it is , Thinner and taller ）
size：int or tuple of ints
 Output shape, The default is None, Output only one value 
'''
samples = numpy.random.normal(loc=0.0, scale=1.0, size=10)
x = numpy.linspace(mean - 3 * var, mean + 3 * var, 10) 
# Equivalent to  samples = numpy.random.randn(10)


# Calculating mean 
mean = numpy.mean(samples)
# Find the variance of an array 
var = numpy.var(samples)
# Find the probability 
y = numpy.exp(-(x - mean) ** 2 / (2 * var ** 2)) / (math.sqrt(2 * math.pi) * var) 

# Find the probability ,x It can be a number , Or a numpy Array 
def getGaussianProbability(samples,x):
    # Calculating mean 
    mean = numpy.mean(samples)
    # Find the variance of an array 
    var = numpy.var(samples)
    # Find the probability 
    y = numpy.exp(-(x - mean) ** 2 / (2 * var ** 2)) / (math.sqrt(2 * math.pi) * var) 
    return y

# test     
samples = numpy.array([0.3,0.4,0.6,0.9])
x=numpy.array([0.5,0])
print(getGaussianProbability(samples,x))

Among them, we need to use ：

# Calculating mean 
mean = numpy.mean(samples)
# Find the standard deviation of an array 
std= numpy.std(samples)
# Find the probability , among x It can be a number , Or a numpy Array 
y = numpy.exp(-(x - mean) ** 2 / (2 * std** 2)) / (math.sqrt(2 * math.pi) * std)

The package used in this case ：pandas,numpy,Counter,math

#  Import related toolkits 
import pandas
import numpy as np
from collections import Counter
import math

#  Read training data   Feature part ( Collected by robot sensors 112 Customers look , Smell , pulse condition , Temperature and other characteristics )

"""
pandas After reading the data, it is dataframe The format of , To turn into numpy Can continue to be used in the form of a two-dimensional array ：
1. After reading .value
2.y = np.array(y)
"""
#  With .values Finally, what comes out is numpy In the form of 
x = pandas.read_csv('train_X.csv').values
# print(f' I am a test_x:{x}')
# print(np.mean(x),np.std(x))

#  Read training data   Actual results section ( Above 112 The real pregnancy status of customers ,0 It means girl ,1 Means boy ,2 It means that you are not pregnant )
y = pandas.read_csv('train_y.csv')
y = np.array(y)
"""
 This is used here tolist() hold numpy The matrix of is transformed into list
 Then use _flatten Put the two-dimensional list Flatten 

"""
class byes_qingdaifu(object):
    def fit(self, x, y):
        self.x = x
        self.y = y
        self.rMean,self.rStd,self.P = self.getP()

    def getP(self):
        """
         According to the training data 12 Characteristic Gaussian curve （ mean value mean And variance str） as well as 3 The probability of each result classification p
        :return: mean,str And p Matrix 
        """
        #  Here will be x Convert to a two-dimensional list 
        list_x = self.x.tolist()
        #  Here, two-dimensional matrix y Into a one-dimensional list 
        list_y = self.y.flatten().tolist()
        #  utilize Counter Statistics list Each element in and its number of occurrences 
        dict_y = Counter(list_y)
        # dict_y:Counter({2: 41, 0: 37, 1: 34})
        #  Get the total number of data 
        sumNum = sum(dict_y.values())
        #  Sort the dictionary 
        t = sorted(dict_y.items(),key=lambda item: item[0])
        # [(0,37),(1,34),(2,41)]
        #  Convert the dictionary into a two-dimensional list ( The list is in order )
        rl = list(list(items) for items in list(t))
        #  Find the probability of each classification p, Form a list P
        P = [(rl[i][-1]/sumNum) for i in range(len(rl))]
        #  Create an empty list rMean
        rMean = []
        #  Get a list rMean Of n*4  matrix ; Namely n Category 4 Characteristic mean vector 
        [rMean.append((np.mean(self.getClass(i), axis=0))) for i in range(len(P))]
        #  Convert the list to array, It is convenient for subsequent calculation 
        rMean = np.array(rMean)
        #  Create an empty list rStd
        rStd = []
        #  Get a list rStd Of n*4  matrix ; Namely n Category rStd
        [rStd.append((np.std(self.getClass(i), axis=0))) for i in range(len(P))]
        #  Convert the list to array, It is convenient for subsequent calculation 
        rStd = np.array(rStd)
        return rMean,rStd,P

    def predict(self,x_test):
        pResult = []
        #  Tabular , Get one n*38 Prediction matrix , Each line represents 38 The probability of the value to be predicted in a classification 
        [(pResult.append(np.prod((np.exp(-(x_test - self.rMean[i]) ** 2 / (2 * self.rStd[i] ** 2)) / (
                    math.sqrt(2 * math.pi) * self.rStd[i])),axis=1) * self.P[i])) for i in range(len(self.P))]
        #  Convert the list to array, It is convenient for subsequent calculation 
        pResult = np.array(pResult)
        # print(f' I am a pResult：{pResult}')
        #  obtain n*38 The row index of the maximum value of each column of the prediction matrix 
        result = (pResult.argmax(axis=0))
        return result

    #  The result is c Data corresponding to x, Make c Class x Data matrix 
    def getClass(self,c):
        x = []
        for i in range(len(self.y)):
            if self.y[i][-1] == c:
                x.append(self.x[i])
        # print(f' I am a x：{x}')
        return x


#  Create a robot 
doctor = byes_qingdaifu()
#  Training robot 
doctor.fit(x,y)

#  Use 38 Test the robot diagnosis effect with the data of customers 
#  Read 38 The look of a customer , Smell , pulse condition , Temperature and other characteristic data 
test_x = pandas.read_csv('test_X.csv').values
# print(f' I am a x：{test_x}')

#  The diagnosis ！  The results are stored in result Array 
result = doctor.predict(test_x)
# print(f' I am a result：{result}')

#  Print out the diagnostic results , Compare with the actual results 
#  Read 38 Actual value of pregnancy status of customers (0 It means girl ,1 It means boy ,2 It means that you are not pregnant )
test_y = pandas.read_csv('test_y.csv')
# print(f' I am a y_test:/n{test_y}')

labels=[' Girl ',' The boy ',' No pregnancy ']
i = 0
#  The number of correct diagnoses 
predictOKNum = 0
print(" Number , Diagnostic value , actual value ,")
while i < test_y.shape[0]:
    #  The first i The diagnosis results are consistent with the actual i Compare the results , Equal means correct diagnosis 

    # if result[i] == (test_y.values[i,0]):
    if result[i] == (test_y.values[i,0]):
        predictOKNum = predictOKNum + 1
        okOrNo = ' accuracy '
    else:
        okOrNo = ' error '
    print("%s,%s,%s,%s" %(i+1, labels[result[i]],labels[test_y.values[i,0]],okOrNo))
    i = i+1

print(" Diagnostic accuracy ：%s" % (predictOKNum/i))

After the code is tested , You can call sklearn The built-in byes Of API Verify the accuracy of your code ,GaussianNB;

原网站

版权声明
本文为[Order anything]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130553300790.html

当前位置：网站首页>Naive Bayes -- continuous data

Naive Bayes -- continuous data

边栏推荐

猜你喜欢

随机推荐