当前位置:网站首页>[data clustering] section 3 of Chapter 4: DBSCAN performance analysis, advantages and disadvantages, and parameter selection methods
[data clustering] section 3 of Chapter 4: DBSCAN performance analysis, advantages and disadvantages, and parameter selection methods
2022-07-04 12:35:00 【Happy Jianghu】
List of articles
7、 ... and : Performance analysis
DBSCAN The algorithm retrieves all points in its neighborhood for each point in the data set , The time complexity is O ( n 2 ) O(n^{2}) O(n2). But in low latitude space , If the k k k- d d d Trees 、 R R R Trees and other structures , It can effectively retrieve all points within a given distance of a specific point , Its time complexity can be reduced to O ( n l o g n ) O(nlogn) O(nlogn)
8、 ... and : Advantages and disadvantages
(1) advantage
①: It can be done to Clustering dense data sets of arbitrary shape
②: We can find outliers while clustering , Insensitive to outliers in the dataset
③: There is no need to specify the number of clusters , And the results of many experiments are often the same
(2) shortcoming
①: If the density of the sample set is not uniform 、 When the distance between clusters is very different , The clustering quality is poor , Use DBSCAN Clustering is generally not suitable for
②: Tuning parameters are compared to K-Means More complicated , Different parameter combinations have a great impact on the clustering effect
③: Due to the traditional European distance Can't handle high latitude data well , Therefore, it is difficult to give a good solution to the definition of distance
④:DBSCAN The algorithm is suitable for dealing with the density of different clusters Relatively uniform The situation of , When When the density of different clusters changes greatly There will be some problems
- From the clustering rendering shown in the previous section , You may find this algorithm It is easy to mark some data points as noise
Nine : Parameter selection
(1) Basic principles of modifying parameters
When visualization is complete , We can fine tune the parameters according to the clustering results
min_pts
: General choice dimension +1 Or dimension ×2 that will doeps
: If you share Less classes , Then put eps Turn it down ; If you share There are more classes , Then put eps Turn it up
(2) according to K-dist
Figure adjustment parameters
Parameter adjustment method : You can use observation The first k k k Nearest neighbor distance ( k k k distance ) Method to determine parameters . For a point in the cluster , If k k k No more than the number of samples in the cluster , be k k k The distance will be very small , then For noise , k k k The distance will be great . therefore : For any positive integer k k k, Calculate all data points k k k distance , Then sort it incrementally , Then draw k k k- Distance map , At the inflection point, it is appropriate eps
, Because boundary points or noise points are often encountered at the inflection point
- k k k distance : For a given data set P P P= { p 1 , p 2 , . . . , p n } \{p_{1}, p_{2} , ... , p_{n}\} { p1,p2,...,pn}, Calculation point p ( i ) p(i) p(i) To dataset D D D Subset S S S The distance between all points in the , The distances are sorted from small to large , be d ( k ) d(k) d(k) It's called k k k distance
The specific steps are as follows
- determine K value , Usually take 2* dimension -1
- draw
K-dist
chart - Inflection point position ( The most volatile place ) by
eps
( You can take multiple inflection points to try ) min_pts
Can take K+1
k = 3 # take 2*2-1
k_dist = select_minpts(train_data, k) # K Distance map
k_dist.sort() # Sort ( Default ascending order )
plt.plot(np.arange(k_dist.shape[0]), k_dist[::-1]) # Draw from large to small
plt.show()
def select_minpts(data_set, k):
k_dist = []
for i in range(data_set.shape[0]):
dist = (((data_set[i] - data_set)**2).sum(axis=1)**0.5)
dist.sort()
k_dist.append(dist[k])
return np.array(k_dist)
as follows eps
It is suggested that 0.0581
Ten :DBSCAN Algorithm and K-Means Algorithm comparison
DBSCAN | K-Means | |
---|---|---|
Algorithmic thought | be based on density | be based on Divide |
Required parameters | Eps and MinPts | Number of clusters k k k |
Data requirement | Make no assumptions about the data distribution | Suppose that all clusters come from a spherical Gaussian distribution |
High dimensional data | Not suitable for | fit ( sparse ) |
formalization | You can't | It can be formalized as an optimization problem that minimizes the sum of squares of errors from each point to the center of mass |
stability | Stable | unstable |
Time complexity | O ( n 2 ) O(n^{2}) O(n2) | O ( n ) O(n) O(n) |
边栏推荐
- The solution of permission denied
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 15
- MySQL advanced (Advanced) SQL statement
- [Chongqing Guangdong education] National Open University spring 2019 2727 tax basis reference questions
- 0x15 string
- 《天天数学》连载57:二月二十六日
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 6
- Ml and NLP are still developing rapidly in 2021. Deepmind scientists recently summarized 15 bright research directions in the past year. Come and see which direction is suitable for your new pit
- LVS load balancing cluster deployment - Dr direct routing mode
- netstat
猜你喜欢
【数据聚类】第四章第一节3:DBSCAN性能分析、优缺点和参数选择方法
(2021-08-20) web crawler learning 2
Paper notes ACL 2020 improving event detection via open domain trigger knowledge
[notes] in depth explanation of assets, resources and assetbundles
2021-08-09
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 7
Cadence physical library lef file syntax learning [continuous update]
Lecture 9
netstat
Leetcode: 408 sliding window median
随机推荐
IPv6 experiment
DVC use case (VI): Data Registry
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 16
C语言:求100-999是7的倍数的回文数
BCD code Baidu Encyclopedia
Here, the DDS tutorial you want | first experience of fastdds - source code compilation & Installation & Testing
01. Basics - MySQL overview
The frost peel off the purple dragon scale, and the xiariba people will talk about database SQL optimization and the principle of indexing (primary / secondary / clustered / non clustered)
Enter the smart Park, and change begins here
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 12
Interview question MySQL transaction (TCL) isolation (four characteristics)
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 20
R语言--readr包读写数据
Guava ImmutableSet. Builder source code analysis, shift original code, complement code, reverse code review
Possible to restore a backup of SQL Server 2014 on SQL Server 2012?
Global and Chinese markets of NOx analyzers 2022-2028: Research Report on technology, participants, trends, market size and share
C language compilation process
TCP fast retransmission sack mechanism
0x15 string
Wechat video Number launches "creator traffic package"