当前位置：网站首页>Statistical learning method (3/22) k-nearest neighbor method

Statistical learning method (3/22) k-nearest neighbor method

2022-06-29 01:10:00 【Xiaoshuai acridine】

k Nearest neighbor method （k-nearest neighbor, k-NN） It's a basic classification and regression method . This chapter only deals with the problem of classification k Nearest neighbor method .k The input of the nearest neighbor method is the eigenvector of the instance , Points corresponding to feature space ; The output is the category of the instance , You can take multiple classes .k The nearest neighbor method assumes that given a training data set , The instance category has been determined . When classifying , For new examples , According to its k Categories of nearest neighbor training instances , To predict by a majority vote, etc . therefore ,k The nearest neighbor method has no explicit training process , Instead, the training data set is actually used to partition the feature vector space , And as a model for its classification .k It's worth choosing 、 Distance measurement and classification decision rules are the three basic elements of this method .

Eye of depth course link ：https://ai.deepshare.net/detail/p_619b93d0e4b07ededa9fcca0/5
Code link ：
https://github.com/zxs-000202/Statistical-Learning-Methods

Insert picture description here

K Euclidean distance is widely used in the nearest neighbor method

Insert picture description here

import numpy as np
import time

def loadData(fileName):
    '''  Load the file  :param fileName: File path to load  :return:  Data sets and tag sets  '''
    print('start read file')
    # Store data and mark 
    dataArr = []; labelArr = []
    # Read the file 
    fr = open(fileName)
    # Traverse every line in the file 
    for line in fr.readlines():
        # Get current row , And press “,” Cut into fields and put them in the list 
        #strip： Remove the characters specified at the beginning and end of each line of string （ Default space or newline ）
        #split： Cut the string into each field according to the specified characters , Return to list form 
        curLine = line.strip().split(',')
        # Put the data in each row except the tag into the data set （curLine[0] Tag information for ）
        # At the same time, the data in the original string form is converted to integer 
        dataArr.append([int(num) for num in curLine[1:]])
        # Put tag information into the tag set 
        # Convert tag to integer while placing 
        labelArr.append(int(curLine[0]))
    # Return data sets and tags 
    return dataArr, labelArr

def calcDist(x1, x2):
    '''  Calculate the distance between two sample point vectors   The Euclidean distance is used , namely   The square of each element subtracted from the sample point   And then sum up   We'll start again   European example formula is not convenient to write here , Can baidu or Google European distance （ Also known as Euclidean distance ） :param x1: vector 1 :param x2: vector 2 :return: Euclidean distance between vectors  '''
    return np.sqrt(np.sum(np.square(x1 - x2)))

    # Maharton distance formula 
    # return np.sum(x1 - x2)




def getClosest(trainDataMat, trainLabelMat, x, topK):
    '''  Prediction samples x The tag .  The method of obtaining is to find and sample x Current topK A little bit , And view their labels .  Find the type of tag that accounts for the most of a certain type of tag  （ In the book 3.1 3.2 section ） :param trainDataMat: Training set data set  :param trainLabelMat: Training set tag set  :param x: Sample to predict x :param topK: Select the number of reference nearest samples （ The selection of the number of samples is related to the accuracy , See in detail 3.2.3 K Choice of value ） :return: Predicted Tags  '''
    # Create a storage vector x A list of distances from samples in each training set 
    # The length of the list is the length of the training set ,distList[i] Express x And training set No 
    ## i The distance between samples 
    distList = [0] * len(trainLabelMat)
    # Traverse all the sample points in the training set , Calculation and x Distance of 
    for i in range(len(trainDataMat)):
        # Get the vector of the current sample in the training set 
        x1 = trainDataMat[i]
        # Calculate the vector x And training set samples x Distance of 
        curDist = calcDist(x1, x)
        # Put the distance in the corresponding list position 
        distList[i] = curDist

    # Sort the distance list 
    #argsort： Function sorts the values of an array from small to large , And output according to its corresponding index value 
    # for example ：
    # >>> x = np.array([3, 1, 2])
    # >>> np.argsort(x)
    # array([1, 2, 0])
    # Returns the index value of the elements in the list from small to large , It is suitable for us to find the minimum distance 
    #array The whole index value list is returned , We go through [:topK] Take the first... In the list topL Put in list in .
    #---------------- Optimization point -------------------
    # Because we only take topK Small element index values , So there is no need to sort the whole list , and argsort It's about the whole 
    # List to sort , There is a waste of time . Dictionaries have a ready-made way to sort only top Big or top Small , You can look it up 
    # Just make a few changes to the code 
    # The main reason why it is not optimized here is KNN It takes a lot of time to calculate the distance between vectors , Because the vector is high dimensional 
    # So it takes a long time to calculate , So if you want to improve the time , Optimization here is of little significance .（ Of course, it doesn't mean that you can't optimize ,
    # Mainly because I'm too lazy ）
    topKList = np.argsort(np.array(distList))[:topK]        # Ascending sort 
    # Create a list of length , Used to select the largest number of tags 
    #3.2.4 It is mentioned that classification decision-making uses voting ,topK Each person has one ticket , Put... In the position represented by each tag in the array 
    # The place corresponding to oneself , Then carry out a ticket call to select the highest ticket 
    labelList = [0] * 10
    # Yes topK Index to traverse 
    for index in topKList:
        #trainLabelMat[index]： Look for... In the training set tab topK The tag corresponding to the element index 
        #int(trainLabelMat[index])： Convert markup to int（ In fact, it's already int 了 , But no int Words , Report errors ）
        #labelList[int(trainLabelMat[index])]： Find the tag at labelList The corresponding position in 
        # Finally add 1, I have cast one vote 
        labelList[int(trainLabelMat[index])] += 1
    #max(labelList)： Find the number of votes with the most votes in the ballot box 
    #labelList.index(max(labelList))： Then find the index corresponding to the maximum value in the list , A marker equivalent to a forecast 
    return labelList.index(max(labelList))


def model_test(trainDataArr, trainLabelArr, testDataArr, testLabelArr, topK):
    '''  Test accuracy  :param trainDataArr: Training set data set  :param trainLabelArr:  Training set tags  :param testDataArr:  Test set data set  :param testLabelArr:  Test set tag  :param topK:  Select how many adjacent points to reference  :return:  Accuracy rate  '''
    print('start test')
    # Convert all lists to matrix form , Easy to calculate 
    trainDataMat = np.mat(trainDataArr); trainLabelMat = np.mat(trainLabelArr).T
    testDataMat = np.mat(testDataArr); testLabelMat = np.mat(testLabelArr).T

    # Error value technique 
    errorCnt = 0
    # Traverse the test set , Test each test set sample 
    # Because it takes too much time to calculate vectors , The test set has 6000 Samples , So it's artificially changed to 
    # test 200 A sample points , If you want to run all , Uncomment the line , Next line for Just comment , At the same time, the following print
    # and return Change the comment line accordingly 
    # for i in range(len(testDataMat)):
    for i in range(200):
        # print('test %d:%d'%(i, len(trainDataArr)))
        print('test %d:%d' % (i, 200))
        # Read the vector of the current test sample of the test set 
        x = testDataMat[i]
        # Get the predicted tag 
        y = getClosest(trainDataMat, trainLabelMat, x, topK)
        # If the forecast mark does not match the actual mark , Error value count plus 1
        if y != testLabelMat[i]: errorCnt += 1

    # Return the correct rate 
    # return 1 - (errorCnt / len(testDataMat))
    return 1 - (errorCnt / 200)



if __name__ == "__main__":
    start = time.time()

    # Get the training set 
    trainDataArr, trainLabelArr = loadData('../Mnist/mnist_train.csv')
    # Get the test set 
    testDataArr, testLabelArr = loadData('../Mnist/mnist_test.csv')
    # Calculate the test set accuracy 
    accur = model_test(trainDataArr, trainLabelArr, testDataArr, testLabelArr, 25)
    # Print accuracy 
    print('accur is:%d'%(accur * 100), '%')

    end = time.time()
    # Show time spent 
    print('time span:', end - start)

原网站

版权声明
本文为[Xiaoshuai acridine]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/180/202206290105326115.html

当前位置：网站首页>Statistical learning method (3/22) k-nearest neighbor method

Statistical learning method (3/22) k-nearest neighbor method

边栏推荐

猜你喜欢

随机推荐