当前位置:网站首页>Statistical learning method (3/22) k-nearest neighbor method
Statistical learning method (3/22) k-nearest neighbor method
2022-06-29 01:10:00 【Xiaoshuai acridine】
k Nearest neighbor method (k-nearest neighbor, k-NN) It's a basic classification and regression method . This chapter only deals with the problem of classification k Nearest neighbor method .k The input of the nearest neighbor method is the eigenvector of the instance , Points corresponding to feature space ; The output is the category of the instance , You can take multiple classes .k The nearest neighbor method assumes that given a training data set , The instance category has been determined . When classifying , For new examples , According to its k Categories of nearest neighbor training instances , To predict by a majority vote, etc . therefore ,k The nearest neighbor method has no explicit training process , Instead, the training data set is actually used to partition the feature vector space , And as a model for its classification .k It's worth choosing 、 Distance measurement and classification decision rules are the three basic elements of this method .
Eye of depth course link :https://ai.deepshare.net/detail/p_619b93d0e4b07ededa9fcca0/5
Code link :
https://github.com/zxs-000202/Statistical-Learning-Methods





K Euclidean distance is widely used in the nearest neighbor method 











import numpy as np
import time
def loadData(fileName):
''' Load the file :param fileName: File path to load :return: Data sets and tag sets '''
print('start read file')
# Store data and mark
dataArr = []; labelArr = []
# Read the file
fr = open(fileName)
# Traverse every line in the file
for line in fr.readlines():
# Get current row , And press “,” Cut into fields and put them in the list
#strip: Remove the characters specified at the beginning and end of each line of string ( Default space or newline )
#split: Cut the string into each field according to the specified characters , Return to list form
curLine = line.strip().split(',')
# Put the data in each row except the tag into the data set (curLine[0] Tag information for )
# At the same time, the data in the original string form is converted to integer
dataArr.append([int(num) for num in curLine[1:]])
# Put tag information into the tag set
# Convert tag to integer while placing
labelArr.append(int(curLine[0]))
# Return data sets and tags
return dataArr, labelArr
def calcDist(x1, x2):
''' Calculate the distance between two sample point vectors The Euclidean distance is used , namely The square of each element subtracted from the sample point And then sum up We'll start again European example formula is not convenient to write here , Can baidu or Google European distance ( Also known as Euclidean distance ) :param x1: vector 1 :param x2: vector 2 :return: Euclidean distance between vectors '''
return np.sqrt(np.sum(np.square(x1 - x2)))
# Maharton distance formula
# return np.sum(x1 - x2)
def getClosest(trainDataMat, trainLabelMat, x, topK):
''' Prediction samples x The tag . The method of obtaining is to find and sample x Current topK A little bit , And view their labels . Find the type of tag that accounts for the most of a certain type of tag ( In the book 3.1 3.2 section ) :param trainDataMat: Training set data set :param trainLabelMat: Training set tag set :param x: Sample to predict x :param topK: Select the number of reference nearest samples ( The selection of the number of samples is related to the accuracy , See in detail 3.2.3 K Choice of value ) :return: Predicted Tags '''
# Create a storage vector x A list of distances from samples in each training set
# The length of the list is the length of the training set ,distList[i] Express x And training set No
## i The distance between samples
distList = [0] * len(trainLabelMat)
# Traverse all the sample points in the training set , Calculation and x Distance of
for i in range(len(trainDataMat)):
# Get the vector of the current sample in the training set
x1 = trainDataMat[i]
# Calculate the vector x And training set samples x Distance of
curDist = calcDist(x1, x)
# Put the distance in the corresponding list position
distList[i] = curDist
# Sort the distance list
#argsort: Function sorts the values of an array from small to large , And output according to its corresponding index value
# for example :
# >>> x = np.array([3, 1, 2])
# >>> np.argsort(x)
# array([1, 2, 0])
# Returns the index value of the elements in the list from small to large , It is suitable for us to find the minimum distance
#array The whole index value list is returned , We go through [:topK] Take the first... In the list topL Put in list in .
#---------------- Optimization point -------------------
# Because we only take topK Small element index values , So there is no need to sort the whole list , and argsort It's about the whole
# List to sort , There is a waste of time . Dictionaries have a ready-made way to sort only top Big or top Small , You can look it up
# Just make a few changes to the code
# The main reason why it is not optimized here is KNN It takes a lot of time to calculate the distance between vectors , Because the vector is high dimensional
# So it takes a long time to calculate , So if you want to improve the time , Optimization here is of little significance .( Of course, it doesn't mean that you can't optimize ,
# Mainly because I'm too lazy )
topKList = np.argsort(np.array(distList))[:topK] # Ascending sort
# Create a list of length , Used to select the largest number of tags
#3.2.4 It is mentioned that classification decision-making uses voting ,topK Each person has one ticket , Put... In the position represented by each tag in the array
# The place corresponding to oneself , Then carry out a ticket call to select the highest ticket
labelList = [0] * 10
# Yes topK Index to traverse
for index in topKList:
#trainLabelMat[index]: Look for... In the training set tab topK The tag corresponding to the element index
#int(trainLabelMat[index]): Convert markup to int( In fact, it's already int 了 , But no int Words , Report errors )
#labelList[int(trainLabelMat[index])]: Find the tag at labelList The corresponding position in
# Finally add 1, I have cast one vote
labelList[int(trainLabelMat[index])] += 1
#max(labelList): Find the number of votes with the most votes in the ballot box
#labelList.index(max(labelList)): Then find the index corresponding to the maximum value in the list , A marker equivalent to a forecast
return labelList.index(max(labelList))
def model_test(trainDataArr, trainLabelArr, testDataArr, testLabelArr, topK):
''' Test accuracy :param trainDataArr: Training set data set :param trainLabelArr: Training set tags :param testDataArr: Test set data set :param testLabelArr: Test set tag :param topK: Select how many adjacent points to reference :return: Accuracy rate '''
print('start test')
# Convert all lists to matrix form , Easy to calculate
trainDataMat = np.mat(trainDataArr); trainLabelMat = np.mat(trainLabelArr).T
testDataMat = np.mat(testDataArr); testLabelMat = np.mat(testLabelArr).T
# Error value technique
errorCnt = 0
# Traverse the test set , Test each test set sample
# Because it takes too much time to calculate vectors , The test set has 6000 Samples , So it's artificially changed to
# test 200 A sample points , If you want to run all , Uncomment the line , Next line for Just comment , At the same time, the following print
# and return Change the comment line accordingly
# for i in range(len(testDataMat)):
for i in range(200):
# print('test %d:%d'%(i, len(trainDataArr)))
print('test %d:%d' % (i, 200))
# Read the vector of the current test sample of the test set
x = testDataMat[i]
# Get the predicted tag
y = getClosest(trainDataMat, trainLabelMat, x, topK)
# If the forecast mark does not match the actual mark , Error value count plus 1
if y != testLabelMat[i]: errorCnt += 1
# Return the correct rate
# return 1 - (errorCnt / len(testDataMat))
return 1 - (errorCnt / 200)
if __name__ == "__main__":
start = time.time()
# Get the training set
trainDataArr, trainLabelArr = loadData('../Mnist/mnist_train.csv')
# Get the test set
testDataArr, testLabelArr = loadData('../Mnist/mnist_test.csv')
# Calculate the test set accuracy
accur = model_test(trainDataArr, trainLabelArr, testDataArr, testLabelArr, 25)
# Print accuracy
print('accur is:%d'%(accur * 100), '%')
end = time.time()
# Show time spent
print('time span:', end - start)
边栏推荐
- 最大路径和问题(摘樱桃问题)
- [image enhancement] manual multiple exposure fusion amef image defogging based on MATLAB [including Matlab source code 1916]
- 【图像处理】基于matlab实现图像曲线调整系统
- What is the reason for the service crash caused by replacing the easycvr cluster version with the old database?
- Breadth first search to catch cattle
- 接雨水系列问题
- 最新版CorelDRAW Technical Suite2022
- 流媒体集群应用与配置:如何在一台服务器部署多个EasyCVR?
- Comparison between winding process and lamination process
- Daily English articles, reading accumulation
猜你喜欢

PR 2021 quick start tutorial, how to use audio editing in PR?

What is contemporaneous group analysis? Teach you to use SQL to handle

【图像处理】基于matlab实现图像曲线调整系统

What is the reason why easycvr can't watch the device video when it is connected to the home protocol?

【leetcode】17. Letter combination of telephone number
![[MCU club] design of classroom number detection based on MCU [physical design]](/img/cf/65c1bdbb45afcc6db265a79c25a3ae.jpg)
[MCU club] design of classroom number detection based on MCU [physical design]

Day 8 script and audio

测试只能干到35岁?35岁+的测试就会失业?

Breadth first search to catch cattle

同期群分析是什么?教你用 SQL 来搞定
随机推荐
[MCU club] design of GSM version of range hood based on MCU [physical design]
[Fire Detection] forest fire detection system based on Matlab GUI (with panel) [including Matlab source code phase 1921]
Ensemble de données sur les visages masqués et méthode de génération des visages masqués
Blazor University (34)表单 —— 获得表单状态
【RRT三维路径规划】基于matlab快速扩展随机树无人机三维路径规划【含Matlab源码 1914期】
深度优先搜索实现抓牛问题
Interviewer: with the for loop, why do you need foreach??
Blazor University (34) forms - get form status
What is the difference between the history and Western blotting
Is it safe to open an account on great wisdom
Is it safe to open a securities account at qiniu business school in 2022?
致我们曾经遇到过的接口问题
Count the number of different palindrome subsequences in the string
分析框架——用户体验度量数据体系搭建
Different subsequence problems I
[MCU club] design of blind water cup based on MCU [physical design]
Is Huatai Securities safe
How to calculate the income tax of foreign-funded enterprises
What is contemporaneous group analysis? Teach you to use SQL to handle
How to solve the problem of Caton screen when easycvr plays video?