当前位置:网站首页>Ml self realization /knn/ classification / weightlessness
Ml self realization /knn/ classification / weightlessness
2022-07-08 01:58:00 【xcrj】
brief introduction
KNN(K Nearest Neighbors)
- It can be used for classification problems and regression problems
- Classification problem and regression problem are divided into whether to take weight
give an example
Introduce
- There are already red triangles and blue squares
- What type should the newly entered green dot belong to ( Red triangle or blue square )
- When K=3 when , find 1 A blue square and 2 A red triangle , The minority is subordinate to the majority , Think that the newly entered green dot is a red triangle category
- In extreme cases , When K=1 when , What shape is the figure closest to the newly entered green dot , What shape is the newly entered green dot
- In extreme cases , When K= When the number of samples in the training set , Which category of training set samples has the largest number of graphics , The newly entered green dot belongs to this shape
principle
KNN(K Nearest Neighbors)
- K Nearest neighbor algorithm
- First find K The nearest neighbor , The minority obeys the majority
- New input instance , Find the most similar instance in the training data set ( near ) Of K An example , this K The majority of the instances belong to a category , Just classify the input instance into this category
Similarity measure
similarity
- Similarity is measured by distance
- The more similar , Represents the closer the distance between the input instance and the training instance
Distance definition
Set up feature space X X X yes m Dimension real number vector space R n R^n Rn, x i , x j ∈ X x_i,x_j\in X xi,xj∈X, x i = ( x i ( 1 ) , x i ( 2 ) , . . . , x i ( m ) ) T x_i=(x_i^{(1)},x_i^{(2)},...,x_i^{(m)})^T xi=(xi(1),xi(2),...,xi(m))T, x j = ( x j ( 1 ) , x j ( 2 ) , . . . , x j ( m ) ) T x_j=(x_j^{(1)},x_j^{(2)},...,x_j^{(m)})^T xj=(xj(1),xj(2),...,xj(m))T
- x i , x j x_i,x_j xi,xj Of L p L_p Lp distance : L p ( x i , x j ) = ( ∑ l = 1 m ∣ x i ( l ) − x j ( l ) ∣ p ) 1 p L_p(x_i,x_j)=\Big(\sum\limits_{l=1}^m|x_i^{(l)}-x_j^{(l)}|^p\Big)^{\frac{1}{p}} Lp(xi,xj)=(l=1∑m∣xi(l)−xj(l)∣p)p1
- p = 2 p=2 p=2 Euclidean distance : L 2 ( x i , x j ) = ( ∑ l = 1 m ∣ x i ( l ) − x j ( l ) ∣ 2 ) 1 2 L_2(x_i,x_j)=\Big(\sum\limits_{l=1}^m|x_i^{(l)}-x_j^{(l)}|^2\Big)^{\frac{1}{2}} L2(xi,xj)=(l=1∑m∣xi(l)−xj(l)∣2)21
K Values determine
- According to the example ,K=1 and K= The number of samples in the training set is not appropriate K value , So how to determine 1 A suitable size K value
- In the application ,K The value is usually a smaller value , Cross validation is usually used to select the best K value .
Code
Introduce
- Use sklearn Digital data set provided , It consists of 1797 Composed of handwritten digital images , Each number consists of 8x8 The pixel value vector of .
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
np.random.seed(1)
def get_data_set():
# Digital data sets , Each number consists of 8x8 It's made up of pixels
digits = load_digits()
# data:8x8 Pixel image ;target: The number represented by the image
X, y = digits.data, digits.target
# Sample display 0~9 Picture number of
# figure width=10, height=8》 stay figure Add a subgraph ,10 individual axes, Each horizontal axis span =2, Longitudinal axis span =5
# You know pyplot Use :figure》figure On axes( You can draw more )》 stay axes Drawing on top
fig = plt.figure(figsize=(10, 8))
for i in range(10):
# Add axes to fig in rows=2,column=5,index=i+1: It means the number from left to right, from top to bottom ax
ax = fig.add_subplot(2, 5, i + 1)
# stay axes Show pictures on ,imshow=imageshow,cmap=Colormap
plt.imshow(X[i].reshape((8, 8)), cmap='gray')
plt.show()
# The data set is divided into training set and test set
# for example :X_train.shape=(1347,64); y_train.shape=(1347,); X_test.shape=(450,64); y_test.shape=(450,)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print('X_train.shape=', X_train.shape)
print('y_train.shape=', y_train.shape)
print('X_test.shape=', X_test.shape)
print('y_test.shape=', y_test.shape)
data_set = DataSet(X_train, y_train, X_test, y_test)
return data_set
class DataSet(object):
""" X_train Training set samples y_train Training set sample value X_test Test set samples y_test Test set sample values """
def __init__(self, X_train, y_train, X_test, y_test):
self.X_train = X_train
self.y_train = y_train
self.X_test = X_test
self.y_test = y_test
class K_NN():
""" k-nearest-neighbor class """
def __init__(self, X, y):
""" :param X: X_train :param y: y_train """
self.X = X
self.y = y
def euclidean_distance(self, X):
""" X and X_train The European distance of """
# X.shape result (n,64) namely n_samples=n
print("X.shape=", X.shape)
m, _ = X.shape
# axis=1, To sum up ;axis=0, Summation
# L2 yes (n,1) Matrix
L2 = [np.sqrt(np.sum((self.X - X[i]) ** 2, axis=1)) for i in range(m)]
# array turn ndarray
return np.array(L2)
def hypothesis(self, X, k=1):
""" X: Data to be predicted , matrix k: distance X Current k Objects """
# step1: Calculate the Euclidean distance
dists = self.euclidean_distance(X)
# step 2: find k The nearest neighbors and the category of these neighbors
# Each column is sorted from small to small , Then take each column k Subscripts of elements
idxk = np.argsort(dists)[:, :k]
print("idxk.shape=", idxk.shape)
# y_idxk It's a matrix (n,k)
y_idxk = self.y[idxk]
print("y_idxk.shape=", y_idxk.shape)
if k == 1:
# Switch to row vector , Easy to show
return y_idxk.T
else:
m, _ = X.shape
# y_idxk It's an array (n,k)》max_votes It's an array (n,1)
# vote key yes 1 Anonymous functions , Parameter is y_idxk》 Statistics y_idxk[i] The number of occurrences of each element in 》max Then the element with the most occurrences is returned ( The whole process is that the minority obeys the majority )
max_votes = [max(y_idxk[i], key=list(y_idxk[i]).count) for i in range(m)]
return max_votes
def evaluate_model(knn, X_test, y_test):
y_p_test1 = knn.hypothesis(X_test, k=1)
test_acc1 = np.sum(y_p_test1[0] == y_test) / len(y_p_test1[0]) * 100
print("k=1 when , Test accuracy :", test_acc1)
print("---------------------")
y_p_test3 = knn.hypothesis(X_test, k=3)
test_acc3 = np.sum(y_p_test3 == y_test) / len(y_p_test3) * 100
print("k=3 when , Test accuracy :", test_acc3)
print("---------------------")
y_p_test5 = knn.hypothesis(X_test, k=5)
test_acc5 = np.sum(y_p_test5 == y_test) / len(y_p_test5) * 100
print("k=5 when , Test accuracy :", test_acc5)
print("---------------------")
def show_result(knn, data_set):
""" Show training results """
print("k=1,1 The nearest neighbor ")
# data_set.X_test[0] yes tuple type
n = data_set.X_test[0].shape[0]
# data_set.X_test[0].reshape(-1,n) take (64,) To (1,64) matrix
print(" Forecast category :", knn.hypothesis(data_set.X_test[0].reshape(-1, n), k=1))
print(" Real category :", data_set.y_test[0])
print("---------------------")
print("k=5,5 The nearest neighbor ")
n = data_set.X_test[20].shape[0]
print(" Forecast category :", knn.hypothesis(data_set.X_test[20].reshape(-1, n), k=5))
print(" Real category :", data_set.y_test[20])
print("---------------------")
print(" test 10 Row data x5~x14;k=1,1 The nearest neighbor ")
print(" Forecast categories :", knn.hypothesis(data_set.X_test[5:15], k=1))
print(" Real categories :", data_set.y_test[5:15])
print("---------------------")
print(" test 10 Row data x5~x14;k=4,4 The nearest neighbor ")
print(" Forecast categories :", knn.hypothesis(data_set.X_test[5:15], k=4))
print(" Real categories :", data_set.y_test[5:15])
print("---------------------")
def main():
# Get data set 》 Divide the data set 》 Show dataset
data_set = get_data_set()
# structure KNN
knn = K_NN(data_set.X_train, data_set.y_train)
# Use test set evaluation model
evaluate_model(knn, data_set.X_test, data_set.y_test)
# Display the results
show_result(knn, data_set)
if __name__ == "__main__":
main();
边栏推荐
- php 获取音频时长等信息
- body有8px的神秘边距
- 腾讯游戏客户端开发面试 (Unity + Cocos) 双重轰炸 社招6轮面试
- Version 2.0 de tapdata, Open Source Live Data Platform est maintenant disponible
- 电路如图,R1=2kΩ,R2=2kΩ,R3=4kΩ,Rf=4kΩ。求输出与输入关系表达式。
- Kwai applet guaranteed payment PHP source code packaging
- Keras深度学习实战——基于Inception v3实现性别分类
- cv2-drawline
- [target tracking] |dimp: learning discriminative model prediction for tracking
- WPF custom realistic wind radar chart control
猜你喜欢
剑指 Offer II 041. 滑动窗口的平均值
Nacos microservice gateway component +swagger2 interface generation
Android 创建的sqlite3数据存放位置
Remote Sensing投稿经验分享
How to fix the slip ring
分布式定时任务之XXL-JOB
Give some suggestions to friends who are just getting started or preparing to change careers as network engineers
Summary of log feature selection (based on Tianchi competition)
Optimization of ecological | Lake Warehouse Integration: gbase 8A MPP + xeos
See how names are added to namespace STD from cmath file
随机推荐
Clickhouse principle analysis and application practice "reading notes (8)
Node JS maintains a long connection
Flutter 3.0框架下的小程序运行
日志特征选择汇总(基于天池比赛)
leetcode 873. Length of Longest Fibonacci Subsequence | 873. 最长的斐波那契子序列的长度
#797div3 A---C
Gbase observation | how to protect the security of information system with frequent data leakage
Introduction to ADB tools
Qml 字体使用pixelSize来自适应界面
How mysql/mariadb generates core files
mysql/mariadb怎样生成core文件
The function of carbon brush slip ring in generator
Reading notes of Clickhouse principle analysis and Application Practice (7)
图解网络:揭开TCP四次挥手背后的原理,结合男女朋友分手的例子,通俗易懂
C语言-模块化-Clion(静态库,动态库)使用
Many friends don't know the underlying principle of ORM framework very well. No, glacier will take you 10 minutes to hand roll a minimalist ORM framework (collect it quickly)
进程和线程的退出
如何制作企业招聘二维码?
Exit of processes and threads
DataWorks值班表