当前位置:网站首页>Machine learning practice: is Mermaid a love movie or an action movie? KNN announces the answer
Machine learning practice: is Mermaid a love movie or an action movie? KNN announces the answer
2022-07-02 09:12:00 【Qigui】
Author's brief introduction : The most important part of the whole building is the foundation , The foundation is unstable , The earth trembled and the mountains swayed . And to learn technology, we should lay a solid foundation , Pay attention to me , Take you to firm the foundation of the neighborhood of each plate .
Blog home page : Qigui's blog
Included column :《 Statistical learning method 》 The second edition —— Personal notes
From the south to the North , Don't miss it , Miss this article ,“ wonderful ” May miss you la
Triple attack( Three strikes in a row ):Comment,Like and Collect—>Attention
List of articles
problem :
It is known that 《 Warwolf 》《 The red sea action 》《 Mission: impossible 6》 Action movie. , and 《 The former 3》《 Chunjiao saves Zhiming 》《 Titanic Number 》 It's romance. . But if ⼀ Dan now has ⼀ A new movie 《 Mermaid 》, Is there any ⼀ There are two ways for the machine to master ⼀ Rules of classification , Automatically classify new movies ?
Visualize the data
Because the data in the above table does not look intuitive , So visualize .
The code is as follows :
# Data visualization , Visually observe the classification of movie data
import matplotlib.pyplot as plt
# matplotlin Chinese display is not supported by default —— adopt rcParams Configuration parameters
plt.rcParams['font.sans-serif'] = ['SimHei'] # Step one ( Replace sans-serif typeface )
plt.rcParams['axes.unicode_minus'] = False # Step two ( Solve the problem of negative sign display of coordinate axis negative numbers )
# 1. get data
# 2. Basic data processing
# 3. Data visualization
x = [5,3,31,59,60,80] # x The axis represents the number of kisses
y = [100,95,105,2,3,10] # y The axis represents the number of fights
labels = ["《 Warwolf 》","《 The red sea action 》","《 Mission: impossible 6》","《 The former 3》","《 love in a puff 》","《 Titanic 》"]
# Set scales and labels
plt.scatter(x,y,s=120)
plt.xlabel(" Number of kisses ")
plt.ylabel(" Number of fights ")
plt.xticks(range(0,150,10))
plt.yticks(range(0,150,10))
# Add annotation text to each point
count = 0
for x_i,y_i in list(zip(x,y)): # use zip take x,y package , Then turn to the list
plt.annotate(f"{
labels[count]}",xy=(x_i,y_i),xytext=(x_i,y_i))
count+=1
plt.show()
The output result is shown below :
thus , The data in the table is clear at a glance .
But we don't know 《 Mermaid 》 Which point is closer to this point , How to calculate 《 Mermaid 》 The distance between this point and other points ?—— Two point distance
The distance between two points is also called Euclidean distance : d = ( x 1 − y 1 ) 2 + ( x 2 − y 2 ) 2 d=\sqrt{\left ( x_{1} -y_{1}\right )^2+\left ( x_{2}-y_{2} \right )^2} d=(x1−y1)2+(x2−y2)2
Know the article in detail :
Concept to method , great 《 Statistical learning method 》—— The third chapter 、k Nearest neighbor method
current demand It's calculation 《 Mermaid 》 Distance from each movie . We may all think , Only look for ⼀ The nearest neighbor to judge 《 Mermaid 》 Isn't it just the type of ? In fact, this approach is not feasible .
Here's a chestnut :
In the figure A Classmates and B、D The distance between the two students is equal , These two students are in different districts , Then you can't judge A Which district does the student belong to . in other words , We should find more neighbors , In order to determine its classification more accurately . Compare this movie classification to determine the selection 3 A close neighbor , in other words k=3.
In the last article ( Concept to method , great 《 Statistical learning method 》—— The third chapter 、k Nearest neighbor method ) As detailed in KNN Workflow , Here is a brief memory .
summary KNN Workflow :
- 1. Calculate the distance between the object to be classified and other objects ;
- 2. Count the nearest K A neighbor ;
- 3. about K The nearest neighbor , Which category do they belong to the most , Where the object to be classified belongs ⼀ class .
There are two implementations :
1、Numpy Realization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # Step one ( Replace sans-serif typeface )
plt.rcParams['axes.unicode_minus'] = False # Step two ( Solve the problem of negative sign display of coordinate axis negative numbers )
# 1. Prepare the data
# Training data characteristics and objectives
# Test data characteristics and objectives
# 2. Calculate the Euclidean distance between unknown data features and training data features
# 3. Count the nearest K A neighbor
# 4. Make category statistics
class Myknn(object):
def __init__(self, train_df, k):
""" :param train_df: Training data :param k: Number of nearest neighbors """
self.train_df = train_df
self.k = k
# Calculate forecast data And Training data distance
# Select minimum distance k It's worth
# Calculation k Of the values Categories Proportion
def predict(self, test_df):
""" Prediction function """
# Calculate the Euclidean distance —— Use nupy Calculated distance
self.train_df[' distance '] = np.sqrt(
(test_df[' Number of fights '] - self.train_df[' Number of fights ']) ** 2 + (test_df[' Number of kisses '] - self.train_df[' Number of kisses ']) ** 2)
# Sort by distance Get the one with the smallest distance front K Categories of data
my_types = self.train_df.sort_values(by=' distance ').iloc[:self.k][' Film type ']
print(my_types)
# Yes k Classification of points for statistics , See who accounts for more , The predicted value belongs to which category
new_my_type = my_types.value_counts().index[0]
print(new_my_type)
# 1. Reading data
my_df = pd.read_excel(' Movie data .xlsx', sheet_name=0)
# print(my_df)
# 2. Basic data processing
# 3. Feature Engineering
# 4. Data visualization ( Omit )
# 2. Training data Characterized features : The number of fights and kisses Initialize tags ( Category ): Film type
train_df = my_df.loc[:5, [" Number of fights ", " Number of kisses ", " Film type "]]
# 3. Forecast data
test_df = my_df.loc[6, [" Number of fights ", " Number of kisses ", ]]
# 4. Algorithm implementation —— Instantiate the class
mk = Myknn(train_df, 3) # k=3
mk.predict(test_df)
Output results :
3 love
4 love
5 love
Name: Film type , dtype: object
love
2、Sklearn Realization
sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, weights=‘uniform’, algorithm=‘auto’)
- n_neighbors: By default kneighbors Number of neighbors used by query . Namely k-NN Of k Value , Pick the nearest k A little bit . Number of adjacent points , namely k The number of .( The default is 5)
- weights: The default is “uniform” It means that each nearest neighbor is assigned the same weight , That is to determine the weight of the nearest neighbor ; Can be specified as “distance” It indicates that the distribution weight is inversely proportional to the distance of the query point ; At the same time, you can also customize the weight .
- algorithm: Nearest neighbor algorithm , Optional {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}. Optional fast k Nearest neighbor search algorithm , The default parameter is auto, It can be understood that the algorithm decides the appropriate search algorithm by itself ; You can also specify your own search algorithm ball_tree、kd_tree、brute Method to search ,brute It's brute force search , That's linear scanning .
step :
- 1、 Build feature data and target data
- 2、 structure k A nearest neighbor classifier
- 3、 send ⽤fit Training
- 4、 Forecast data
from sklearn.neighbors import KNeighborsClassifier
# 1. Reading data
mv_df = pd.read_excel(" Movie data .xlsx",sheet_name=0)
# 2. Build the characteristic data of the training set —— Be careful : Two dimensional arrays
x = mv_df.loc[:5," Number of fights ":" Number of kisses "].values
# 3. Build the target data of the training set —— Be careful : One dimensional array
y = mv_df.loc[:5," Film type "].values
# 4. Instantiation KNN Algorithm ( classifier )
knn = KNeighborsClassifier(n_neighbors=4)
# 5. Training —— Characteristic data of training set Target data of training set
knn.fit(x,y)
# 6. Forecast data
knn.predict([[5,29]])
Output results :
[' love ']
therefore , Through training KNN Model , Finally it is concluded that 《 Mermaid 》 It's a love movie .
The learning :
Bloggers are determined to understand every knowledge point . Why? ? I've read an article about the learning experience shared by a big man , One of them is very good :【 Basic knowledge is like the foundation of a building , It determines the height of our technology , And to do something quickly , The precondition must be that the basic ability is excellent ,“ Internal skill ” In place 】. therefore , Since then, I have focused on a solid foundation , And I know , Learning technology can't be learned overnight , Not for half a year 、 A year , Or two years , Even longer ; But I believe , Persistence is never wrong . Last , I want to say : Focus on the basics , Establish your own knowledge system .
Okay , This learning sharing is over . Mix a —— How many of the above seven movies have you seen ? Which one do you like best ? Tell me in the comment area ! Hey ,, I can't help but hurry to see it before studying .
Knowledge review :
Concept to method , great 《 Statistical learning method 》—— The third chapter 、k Nearest neighbor method
边栏推荐
- 一个经典约瑟夫问题的分析与解答
- 盘点典型错误之TypeError: X() got multiple values for argument ‘Y‘
- Count the number of various characters in the string
- Redis zadd导致的一次线上问题排查和处理
- [go practical basis] gin efficient artifact, how to bind parameters to structures
- "Redis source code series" learning and thinking about source code reading
- Dix ans d'expérience dans le développement de programmeurs vous disent quelles compétences de base vous manquez encore?
- 分布式服务架构精讲pdf文档:原理+设计+实战,(收藏再看)
- Matplotlib剑客行——布局指南与多图实现(更新)
- Actual combat of microservices | discovery and invocation of original ecosystem implementation services
猜你喜欢
【Go实战基础】gin 如何绑定与使用 url 参数
Watermelon book -- Chapter 6 Support vector machine (SVM)
Minecraft module service opening
C language - Blue Bridge Cup - 7 segment code
Jd.com interviewer asked: what is the difference between using on or where in the left join association table and conditions
Ora-12514 problem solving method
Microservice practice | fuse hytrix initial experience
以字节跳动内部 Data Catalog 架构升级为例聊业务系统的性能优化
微服务实战|原生态实现服务的发现与调用
将一串数字顺序后移
随机推荐
C4D quick start tutorial - Chamfer
Mirror protocol of synthetic asset track
[staff] time sign and note duration (full note | half note | quarter note | eighth note | sixteenth note | thirty second note)
长篇总结(代码有注释)数构(C语言)——第四章、串(上)
双非本科生进大厂,而我还在底层默默地爬树(上)
【Go实战基础】gin 如何设置路由
cmd窗口中中文呈现乱码解决方法
C # save web pages as pictures (using WebBrowser)
【Go实战基础】gin 如何自定义和使用一个中间件
微服务实战|微服务网关Zuul入门与实战
Use of libusb
[go practical basis] how to install and use gin
NPOI 导出Word 字号对应
Installing Oracle database 19C RAC on Linux
JVM指令助记符
Watermelon book -- Chapter 6 Support vector machine (SVM)
Servlet全解:继承关系、生命周期、容器和请求转发与重定向等
Oracle修改表空间名称以及数据文件
队列的基本概念介绍以及典型应用示例
Oracle related statistics