当前位置:网站首页>Summary of clustering methods

Summary of clustering methods

2022-06-11 17:31:00 Mark_ Aussie

clustering (Clustering): According to a certain standard ( Such as : distance ) Divide a data set into different classes or clusters , bring The similarity of data objects in the same cluster should be as large as possible , Data objects that are not in the same cluster are as different as possible ; After clustering, the data of the same class should be clustered together as much as possible , Try to separate different types of data .

The general process of clustering :

  1. Data preparation : Feature Standardization 、 Dimension reduction
  2. feature selection : Select the most effective feature from the initial features , And store it in a vector
  3. feature extraction : The selected features are transformed to form new prominent features
  4. clustering : Measure the similarity based on some distance function , Get cluster
  5. Evaluation of clustering results : Analyze the clustering results , Such as Distance error and (SSE) etc.

Numerical data similarity measurement

Minkowski  Namely LP norm (P >= 1), and  Manhattan 、EuclideanChebyshev Corresponding P = 1、2、 infinite

Distance measurement between clusters :Ci and Cj For two clusters

  • Single-link: The distance between two clusters is the distance between the nearest two points between two clusters , Will be generated in the clustering process Chain effect , There may be very large cluster
  • Complete-link: The distance between two clusters is two clusters The distance between the two farthest points , You can avoid Chain effect `, Very sensitive to abnormal sample points , It is easy to produce unreasonable clustering .
  • UPGMA:Single-link and Complete-link A compromise of , The distance between two clusters is the mean distance of all points between two clusters
  • WPGMA: The weighted average of the distance between two clusters , The purpose of weighting is to make the influence of two clusters on the calculation of distance at the same level , Not affected by cluster size , The specific formula is related to the weight scheme adopted .

Clustering method

Partition clustering : You need to specify the number of cluster classes or cluster centers in advance , Iterate over and over to The points in the cluster are close enough , The points between clusters are far enough The goal of ; Such as :k-means、k-means++bi-kmeanskernel k-means etc. ; about Convexity data It has a good effect .

k-means given : It needs to be determined in advance k value 、 Sensitive to the initial centroid point 、 Sensitive to abnormal data .

k-mean++:

bi-kmeans: in the light of kmeans The algorithm will fall into the defect of local optimization . be based on SSE The principle of minimization , First, treat all data points as a cluster , Then divide the cluster into two , Then select one of the clusters to continue the division , Choosing which cluster to partition depends on whether the partition can be reduced to the greatest extent SSE Value .SSE(Sum of Squared Error), A measure of clustering effect , Represents the sum of squares of the distance from the cluster center after clustering ,SSE The smaller it is , The better the clustering effect is .

bi-kmeans yes A globally optimal approach , So every time I calculate SSE The values are basically the same .

Density clustering : Two parameters need to be defined , Neighborhood radius and neighborhood density threshold .

To construct the neighborhood radius, you can use kd-tree Optimize ;

DBSACN Density clustering characteristics :

  1. The neighborhood radius and neighborhood density threshold should be determined in advance
  2. It is not necessary to set the number of clusters in advance
  3. Sensitive to initial value selection , Insensitive to noise
  4. Poor data aggregation for uneven density

OPTICS clustering DBSCAN The algorithm uses a uniform neighborhood radius , When the data density is uneven , If a smaller value is set , Are sparse cluster The node density in is less than the neighborhood density threshold , Will be considered as boundary points and not used for further expansion ; If a larger value is set , The density is larger and closer to cluster Easy to be classified as the same cluster .

Core distance : Given neighborhood radius and neighborhood density , The minimum radius that can reach the neighborhood density within the neighborhood radius .

Reach distance : Given the domain radius and density , The distance between points outside the core distance and within the radius of the field .

insert_list() The algorithm process :

The ultimate knowledge acquisition algorithm is a Output sequence , The sequence aggregates points of similar density according to different densities , Instead of outputting the specific category that the point belongs to , Get the type of the point , You need to set parameters again Extract specific categories .

 

Hierarchical clustering :  Divide the data set into layer by layer clusters , The latter layer is based on the results of the previous layer .

The previous algorithm can get better results with less complexity , But there is Chain effect , Such as :A And B be similar ,B And C be similar , Clustering will A、B、C Come together , But if A And C Dissimilarity , It will cause clustering error , The error may go on . Hierarchical clustering can solve the chain effect problem .

  • Agglomerative Hierarchical clustering : Bottom up (bottom-up) Hierarchical clustering of , Every object starts with a cluster , Each time according to certain criteria, the two closest clusters are merged to generate a new cluster , Until the end, all objects belong to a cluster .
  • Divisive Hierarchical clustering : The top-down (top-down) Hierarchical clustering of , At first all objects belong to a cluster , Each time a cluster is divided into multiple clusters according to certain criteria , Until each object is a cluster .

Hierarchical clustering is a greedy algorithm (greedy algorithm), Every merge or partition is based on some local optimal choice .

Comparison of clustering methods

 

Reference resources :

Common clustering algorithms - You know

原网站

版权声明
本文为[Mark_ Aussie]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111724555863.html