当前位置：网站首页>Summary of clustering methods

Summary of clustering methods

2022-06-11 17:31:00 【Mark_ Aussie】

clustering (Clustering)： According to a certain standard ( Such as ： distance ) Divide a data set into different classes or clusters , bring The similarity of data objects in the same cluster should be as large as possible , Data objects that are not in the same cluster are as different as possible ; After clustering, the data of the same class should be clustered together as much as possible , Try to separate different types of data .

The general process of clustering ：

Data preparation ： Feature Standardization 、 Dimension reduction
feature selection ： Select the most effective feature from the initial features , And store it in a vector
feature extraction ： The selected features are transformed to form new prominent features
clustering ： Measure the similarity based on some distance function , Get cluster
Evaluation of clustering results ： Analyze the clustering results , Such as Distance error and (SSE) etc.

Numerical data similarity measurement ：

Minkowski Namely LP norm (P >= 1), and Manhattan 、Euclidean、Chebyshev Corresponding P = 1、2、 infinite

Distance measurement between clusters ：Ci and Cj For two clusters

Single-link： The distance between two clusters is the distance between the nearest two points between two clusters , Will be generated in the clustering process Chain effect , There may be very large cluster
Complete-link： The distance between two clusters is two clusters The distance between the two farthest points , You can avoid Chain effect `, Very sensitive to abnormal sample points , It is easy to produce unreasonable clustering .
UPGMA：Single-link and Complete-link A compromise of , The distance between two clusters is the mean distance of all points between two clusters
WPGMA： The weighted average of the distance between two clusters , The purpose of weighting is to make the influence of two clusters on the calculation of distance at the same level , Not affected by cluster size , The specific formula is related to the weight scheme adopted .

Clustering method

Partition clustering ： You need to specify the number of cluster classes or cluster centers in advance , Iterate over and over to The points in the cluster are close enough , The points between clusters are far enough The goal of ; Such as ：k-means、k-means++、bi-kmeans、kernel k-means etc. ; about Convexity data It has a good effect .

k-means given ： It needs to be determined in advance k value 、 Sensitive to the initial centroid point 、 Sensitive to abnormal data .

k-mean++:

bi-kmeans: in the light of kmeans The algorithm will fall into the defect of local optimization . be based on SSE The principle of minimization , First, treat all data points as a cluster , Then divide the cluster into two , Then select one of the clusters to continue the division , Choosing which cluster to partition depends on whether the partition can be reduced to the greatest extent SSE Value .SSE(Sum of Squared Error), A measure of clustering effect , Represents the sum of squares of the distance from the cluster center after clustering ,SSE The smaller it is , The better the clustering effect is .

bi-kmeans yes A globally optimal approach , So every time I calculate SSE The values are basically the same .

Density clustering ： Two parameters need to be defined , Neighborhood radius and neighborhood density threshold .

To construct the neighborhood radius, you can use kd-tree Optimize ;

DBSACN Density clustering characteristics ：

The neighborhood radius and neighborhood density threshold should be determined in advance
It is not necessary to set the number of clusters in advance
Sensitive to initial value selection , Insensitive to noise
Poor data aggregation for uneven density

OPTICS clustering ：DBSCAN The algorithm uses a uniform neighborhood radius , When the data density is uneven , If a smaller value is set , Are sparse cluster The node density in is less than the neighborhood density threshold , Will be considered as boundary points and not used for further expansion ; If a larger value is set , The density is larger and closer to cluster Easy to be classified as the same cluster .

Core distance ： Given neighborhood radius and neighborhood density , The minimum radius that can reach the neighborhood density within the neighborhood radius .

Reach distance ： Given the domain radius and density , The distance between points outside the core distance and within the radius of the field .

insert_list() The algorithm process ：

The ultimate knowledge acquisition algorithm is a Output sequence , The sequence aggregates points of similar density according to different densities , Instead of outputting the specific category that the point belongs to , Get the type of the point , You need to set parameters again Extract specific categories .

Hierarchical clustering ： Divide the data set into layer by layer clusters , The latter layer is based on the results of the previous layer .

The previous algorithm can get better results with less complexity , But there is Chain effect , Such as ：A And B be similar ,B And C be similar , Clustering will A、B、C Come together , But if A And C Dissimilarity , It will cause clustering error , The error may go on . Hierarchical clustering can solve the chain effect problem .

Agglomerative Hierarchical clustering ： Bottom up （bottom-up） Hierarchical clustering of , Every object starts with a cluster , Each time according to certain criteria, the two closest clusters are merged to generate a new cluster , Until the end, all objects belong to a cluster .
Divisive Hierarchical clustering ： The top-down （top-down） Hierarchical clustering of , At first all objects belong to a cluster , Each time a cluster is divided into multiple clusters according to certain criteria , Until each object is a cluster .

Hierarchical clustering is a greedy algorithm （greedy algorithm）, Every merge or partition is based on some local optimal choice .

Comparison of clustering methods