当前位置:网站首页>CV learning notes - clustering
CV learning notes - clustering
2022-07-03 10:07:00 【Moresweet cat】
Image clustering
1. summary
1. Classification and clustering
classification : Classification solves the problem of mining patterns from specific data sets , And the process of making judgments .
The main process of classification learning :
(1) Given the training data set , There is a tag number similar to the tag function in the data set , According to the tag number, it is judged that this data set is a data set that needs to play a positive role ( Forward dataset ) Or for data sets that need to be suppressed ( Negative data set ), For example, we need to classify whether the fruit is grape , Then the data sets that are all grapes are positive data sets , Non grape datasets are negative datasets .
(2) Build a training model , And use data sets for learning and training .
(3) Predict the prediction data set through the trained model , And calculate the performance of the results .
clustering : A cluster is a collection of data instances , The data elements in the same cluster are similar to each other , But the elements in different clusters are different from each other , seeing the name of a thing one thinks of its function , Clustering is to gather data and put together data members that are similar in some aspects . The attributes between clustering samples are divided into ordered attributes ( For example, concentration 0.45、0.89) And unordered attributes ( Good melon 、 Bad melon ) Two kinds of .
Because there is no classification or grouping information representing data categories in clustering , That is, the data is not labeled , So clustering is usually classified as Unsupervised learning (Unsupervised Learning).
The difference between classification and clustering tasks :
- The result of clustering is to gather similar data in a region according to the similarity of data , What are the similarities of this specific category , It needs to be judged artificially .
- Classification learning is to learn and build a model based on the data set of existing categories , Then use the model to predict the process of new data sets .
- The dataset used for classification is labeled , The data set used for clustering is unlabeled , Clustering just separates the data , What are the specific similarities that need to be judged .
Common clustering algorithms :
- Prototype clustering ( A typical :K Mean clustering algorithm )
- Hierarchical clustering
- Density clustering
2. Prototype Clustering Algorithm
1. summary
Prototype clustering algorithm is a kind of hypothetical clustering structure that can pass through a set of prototypes ( Representative points in sample space ) The algorithm of characterization . It is a very common algorithm in real tasks .
2.K-Means clustering
brief introduction : *K-Means Clustering is the most commonly used clustering algorithm , It started with signal processing , The goal is to divide the data points into K Class clusters . The biggest advantage of this algorithm is that it is simple 、 Easy to understand , It's faster , The disadvantage is to specify the number of clusters before clustering .
k-means The algorithm is a prototype clustering algorithm .
K-Means Clustering algorithm flow
- determine K( That is, the number of clusters ) value
- Randomly select... From the dataset K Data points as the centroid (Centroid) Or data center .
- Calculate the distance from each point to each centroid , And divide each point into a group that is closest to the centroid .
- When each center of mass gathers some points , Redefine the algorithm and select a new centroid .( For each cluster , Calculate its mean value , That is, get a new K Centroid point )
- Iteratively perform steps 3 to 4 , Until the iteration termination conditions are met ( The clustering results will not change , The result converges )
Examples of algorithms :
Data sets :
The first round : Set up K by 2, That is, we should divide the data set into two categories , Randomly select two points as the center of mass , Choose here P1 P2, Then calculate the distance from each point to the center of mass , for example P3 To P1 The distance to 10 = 3.16 \sqrt{10} = 3.16 10=3.16, Calculation P3 To P2 The distance to ( 3 − 1 ) 2 + ( 1 − 2 ) 2 = 2.24 \sqrt{(3-1)^2+(1-2)^2}=2.24 (3−1)2+(1−2)2=2.24,P3 leave P2 A more recent , so P3 Join in P2 Cluster of , The calculation of other points is the same .
The result of a round of calculation is
Group 1:P1 Group 2:P2、P3、P4、P5、P6
* The second round :* Due to group 1 There is only one point , No need to deal with , Direct selection P1 For the center of mass , Group 2 Set all groups 2 In the middle of x、y Construct a new nonexistent point as a new centroid after finding the mean value of the coordinates ( It is not divided into a group , It is only used as the center of mass ), structure Q( 1 + 3 + 8 + 9 + 10 5 \frac{1+3+8+9+10}{5} 51+3+8+9+10, 2 + 1 + 8 + 10 + 7 5 \frac{2+1+8+10+7}{5} 52+1+8+10+7) That is to say Q(6.2,5.6) and P1(0,0) As a new centroid , Recalculate the distance from each point to the centroid to divide .
The result of a round of calculation is
Group 1:P1、P2、P3 Group 2:P4、P5、P6
* The third round :* Still calculate according to the above method , Skip process , New centroid R(1.33,1)、T(9,8.33).
After three rounds of calculation, the result is
Group 1:P1、P2、P3 Group 2:P4、P5、P6
The comparison results are the same as those of the previous round , It shows that the result converges , End the clustering algorithm .
K-Means Advantages of clustering algorithm :
- It is a classic algorithm to solve the clustering problem , Simple 、 Fast
- For dealing with big data sets , The algorithm maintains high efficiency
- When the result cluster is dense , It works better
K-Means The disadvantages of clustering algorithm :
- It must be given in advance k( The number of clusters to generate )
- Sensitive to noise and outlier data
K-Means Application examples of algorithms :
- Image segmentation
- Image clustering
- Image recognition
We go through K-Means These pixels can be clustered into K A cluster of , Then use the centroid points in each cluster to replace all in the cluster
Pixels of , This can be achieved in Do not change the resolution Quantize the color of the compressed image , Realize image color level segmentation .
3. Hierarchical clustering
1. Method & Category
Method : First, calculate the distance between all samples , Merge the closest point into the same class each time . The advantages are similar to the structure of Huffman tree , After classification , Calculate the distance between classes , Merge the nearest class into a large class . Cycle back and forth until it is merged into a big category .
How to calculate the distance between classes : The shortest distance method 、 The longest distance method 、 Middle distance method 、 Quasi average method, etc .
species : Hierarchical clustering algorithm is divided into according to the order of hierarchical decomposition : Bottom up (bottom-up) And top-down (top-down), They are also called Condensed (agglomerative) Hierarchical clustering algorithm > and Divided (divisive) Hierarchical clustering algorithm .>
2. The process of agglomerative hierarchical clustering algorithm
The strategy of condensed hierarchical clustering is to treat each object as a cluster , And then merge these clusters into larger and larger clusters , Until all right
The elephants are all in a cluster , Or an end condition is met . Most hierarchical clustering belongs to condensed hierarchical clustering , They're just between clusters
The definition of similarity is different .
- Think of each object as a class , Calculate the minimum distance between objects ;
- Merge the two classes with the smallest distance into a new class ;
- Recalculate the distance between the new class and all classes ;
- Repeat steps 2 to 3 , Until all classes are merged into one big class .
The calculation process matrix in the above figure is :
[ choose in close and class other A choose in close and class other B class between most Small distance leave new class in element plain individual Count 0 3 0 2 4 5 1.15470054 3 1 2 2.23606798 2 6 7 4.00832467 5 ] \left[ \begin{matrix} Select the merge category A& Select the merge category B& Minimum distance between classes & The number of elements in the new class \\ 0&3&0&2\\ 4&5&1.15470054&3\\ 1&2&2.23606798&2\\ 6&7&4.00832467&5 \end{matrix} \right] ⎣⎢⎢⎢⎢⎡ choose in close and class other A0416 choose in close and class other B3527 class between most Small distance leave 01.154700542.236067984.00832467 new class in element plain individual Count 2325⎦⎥⎥⎥⎥⎤
Be careful :0 and 3 The merged new class uses 5 Number , The original medium is 0~4, Others follow this rule .
Tree view classification judgment : The above clustering process , The aggregated categories have different combinations according to different stages , So how to judge the category , In real tasks, data sets are often divided into several categories .
Want to be divided into two categories , Cut when there are two vertical lines from top to bottom , Then what is connected under the corresponding vertical line is a class
When you want to divide into three categories , Cut when there are three vertical lines from top to bottom , Then what is connected under the corresponding vertical line is a class
3. Characteristics of condensed hierarchical clustering
- Unlike K-Means In that case, we need to consider selecting the initial point , There is no problem of local minima (K-Means It is possible to converge to local minima ), Agglomerative hierarchical clustering is not similar K The global objective function of the mean .
- The merge operation cannot be undone , be faithful to one 's husband unto death
- Computing storage costs are high (*)
4. Advantages and disadvantages of hierarchical clustering
advantage :
- The similarity between distance and rule is easy to define , Less restrictions ;
- There is no need to set the number of clusters in advance ;
- You can find the hierarchical relationship of classes ;
- Can be clustered into other shapes
shortcoming :
- The computational complexity is too high ;
- Singular values can also have a big impact ;
- The algorithm is likely to cluster into chains
4. Density clustering
1. Algorithm description
Density clustering (DBSCAN)
Two parameters are required :ε (eps) And the minimum number of points needed to form a high density area (minPts)
- It starts with an arbitrary point that is not accessed , And then explore the point of ε- Neighborhood , If ε- There are enough points in the neighborhood , Then a new cluster is established , Otherwise, this point is labeled noise .
- Be careful , This noise point may be found at other points after ε- Neighborhood , And the ε- There may be enough points in the neighborhood , Then this point will be added to the cluster .
2. Advantages and disadvantages of density clustering
advantage :
- Insensitive to noise ;
- It can find clusters of arbitrary shape
shortcoming :
- But the result of clustering has a lot to do with the parameters ;
- Identify clusters with fixed parameters , But when the sparse degree of clustering is different , The same criteria may destroy the natural structure of clustering , That is to say, the sparse clusters will be divided into multiple clusters or the clusters with higher density and closer to each other will be merged into one cluster .
5. Spectral clustering
1. Algorithm flow description
- Construct a... Based on the data The graph structure (Graph) ,Graph Each node of the corresponds to a data point , General
Like points connected , And the weight of the edge is used to represent the similarity between the data . Put this Graph Use adjacency moment
In the form of a matrix , Write it down as W . - hold W Each column of elements adds up to N Number , Put them on the diagonal ( The rest is zero ),
Form a N * N Matrix , Write it down as D . And order L = D-W . - Find out L Before k Eigenvalues , And the corresponding eigenvector .
- Put this k Features ( Column ) The vectors are arranged together to form a N * k Matrix , Think of each line as k
A vector in dimensional space , And use K-means The algorithm clustering . The class to which each row in the clustering result belongs
Don't be the original Graph The node in is also the original N Data points belong to different categories .
Simple abstract spectral clustering process , There are two main steps :
- Composition , The sampling point data is constructed into a network diagram .
- Cutaway , It is constructed in the first step according to certain trimming criteria , Cut into different graphs , And different subgraphs , That is, we are right
Corresponding clustering results .
Personal study notes , Only exchange learning , Reprint please indicate the source !
边栏推荐
- Wireshark use
- Installation and removal of MySQL under Windows
- (2)接口中新增的方法
- MySQL root user needs sudo login
- Leetcode 300 longest ascending subsequence
- Leetcode interview question 17.20 Continuous median (large top pile + small top pile)
- Synchronization control between tasks
- yocto 技术分享第四期:自定义增加软件包支持
- Vscode markdown export PDF error
- In third tier cities and counties, it is difficult to get 10K after graduation
猜你喜欢
Opencv Harris corner detection
Of course, the most widely used 8-bit single chip microcomputer is also the single chip microcomputer that beginners are most easy to learn
Opencv notes 17 template matching
Pymssql controls SQL for Chinese queries
2021-10-27
Cases of OpenCV image enhancement
LeetCode - 933 最近的请求次数
Leetcode - 1670 conception de la file d'attente avant, moyenne et arrière (conception - deux files d'attente à double extrémité)
YOLO_ V1 summary
LeetCode - 895 最大频率栈(设计- 哈希表+优先队列 哈希表 + 栈) *
随机推荐
Leetcode bit operation
[untitled] proteus simulation of traffic lights based on 89C51 Single Chip Microcomputer
STM32 interrupt switch
2312、卖木头块 | 面试官与狂徒张三的那些事(leetcode,附思维导图 + 全部解法)
An executable binary file contains more than machine instructions
Opencv interview guide
Interruption system of 51 single chip microcomputer
Screen display of charging pile design -- led driver ta6932
Opencv notes 20 PCA
LeetCode - 508. 出现次数最多的子树元素和 (二叉树的遍历)
The new series of MCU also continues the two advantages of STM32 product family: low voltage and energy saving
yocto 技术分享第四期:自定义增加软件包支持
4G module at command communication package interface designed by charging pile
Leetcode interview question 17.20 Continuous median (large top pile + small top pile)
Swing transformer details-2
Working mode of 80C51 Serial Port
2021-11-11 standard thread library
Open Euler Kernel Technology Sharing - Issue 1 - kdump Basic Principles, use and Case Introduction
El table X-axis direction (horizontal) scroll bar slides to the right by default
01 business structure of imitation station B project