当前位置:网站首页>【数据聚类】第四章第一节3:DBSCAN性能分析、优缺点和参数选择方法
【数据聚类】第四章第一节3:DBSCAN性能分析、优缺点和参数选择方法
2022-07-04 12:32:00 【快乐江湖】
七:性能分析
DBSCAN算法对数据集中的每个点都要检索其邻域内的所有点,时间复杂度为 O ( n 2 ) O(n^{2}) O(n2)。但在低纬空间中,若采用 k k k- d d d树、 R R R树等结构,可以有效地检索特定点给定距离内的所有点,其时间复杂度可以降低到 O ( n l o g n ) O(nlogn) O(nlogn)
八:优缺点
(1)优点
①:可以对任意形状的稠密数据集进行聚类
②:可以在聚类的同时发现异常点,对数据集中的异常点不敏感
③:不需要指定簇数,并且多次实验结果往往是相同的
(2)缺点
①:如果样本集的密度不均匀、聚类间距差相差很大时,聚类质量较差,这时用DBSCAN聚类一般不适合
②:调参相较于K-Means复杂一点,不同参数组合对聚类效果有很大影响
③:由于传统的欧式距离不能很好处理高纬数据,所以对于距离定义较难给出好的解决方案
④:DBSCAN算法适合处理不同簇的密度相对比较均匀的情况,当不同簇的密度变化很大时会出现一些问题
- 从上一节所展示的聚类效果图中,大家可能会发现该算法很容易把一些数据点标记为噪声
九:参数选择
(1)修改参数的基本原则
当可视化完成后,我们可以根据其聚类的结果来微调参数
min_pts
:一般选择维度+1或者维度×2即可eps
:如果所分的类少了,那就把eps调小一点;如果所分的类多了,那就把eps调大一点
(2)根据K-dist
图调参
调参方法:可以用观察第 k k k个最近邻距离( k k k距离)的方法来确定参数。对于簇中的某个点,如果 k k k不大于簇的样本个数,则 k k k距离将会很小,然后对于噪声, k k k距离将会很大。因此:对于任意一个正整数 k k k,计算所有数据点的 k k k距离,然后以递增方式排序,然后绘制 k k k-距离图,在拐点位置对应的就是合适的eps
, 因为在拐点位置往往遇到了边界点或者噪声点
- k k k距离:对于给定的数据集 P P P= { p 1 , p 2 , . . . , p n } \{p_{1}, p_{2} , ... , p_{n}\} { p1,p2,...,pn},计算点 p ( i ) p(i) p(i)到数据集 D D D的子集 S S S中所有点之间的距离,距离按从小到大的顺序排序,则 d ( k ) d(k) d(k)就称之为 k k k距离
具体步骤如下
- 确定K值,一般取2*维度-1
- 绘制
K-dist
图 - 拐点位置(波动最大的地方)为
eps
(可以取多个拐点尝试) min_pts
可以取K+1
k = 3 # 取2*2-1
k_dist = select_minpts(train_data, k) # K距离图
k_dist.sort() # 排序(默认升序)
plt.plot(np.arange(k_dist.shape[0]), k_dist[::-1]) # 绘图时从大到小
plt.show()
def select_minpts(data_set, k):
k_dist = []
for i in range(data_set.shape[0]):
dist = (((data_set[i] - data_set)**2).sum(axis=1)**0.5)
dist.sort()
k_dist.append(dist[k])
return np.array(k_dist)
如下eps
建议取0.0581
十:DBSCAN算法和K-Means算法比较
DBSCAN | K-Means | |
---|---|---|
算法思想 | 基于密度 | 基于划分 |
所需参数 | Eps 和MinPts | 簇个数 k k k |
数据要求 | 对数据分布不做任何假设 | 假设所有簇来自球形的高斯分布 |
高维数据 | 不适合 | 适合(稀疏的) |
形式化 | 不能 | 能够形式化为最小化每个点到质心的误差平方和的优化问题 |
稳定性 | 稳定 | 不稳定 |
时间复杂度 | O ( n 2 ) O(n^{2}) O(n2) | O ( n ) O(n) O(n) |
边栏推荐
- Btrace tells you how to debug online without restarting the JVM
- 2021-10-20
- LxC shared directory permission configuration
- template<typename MAP, typename LIST, typename First, typename ... Keytypes > recursive call with indefinite parameters - beauty of Pan China
- Entitas learning [iv] other common knowledge points
- Haproxy cluster
- Guava ImmutableSet. Builder source code analysis, shift original code, complement code, reverse code review
- Source code analysis of the implementation mechanism of multisets in guava class library
- C language compilation process
- LVS load balancing cluster deployment - Dr direct routing mode
猜你喜欢
Games101 Lesson 8 shading 2 Notes
Practical dry goods: deploy mini version message queue based on redis6.0
R语言--readr包读写数据
CSDN documentation specification
[solve the error of this pointing in the applet] SetData of undefined
French Data Protection Agency: using Google Analytics or violating gdpr
2021 annual summary - it seems that I have done everything except studying hard
Realize cross tenant Vnet connection through azure virtual Wan
DVC use case (VI): Data Registry
The database connection code determines whether the account password is correct, but the correct account password always jumps to the failure page with wrong account password
随机推荐
Possible to restore a backup of SQL Server 2014 on SQL Server 2012?
. Does net 4 have a built-in JSON serializer / deserializer- Does . NET 4 have a built-in JSON serializer/deserializer?
Function parameters (positional parameters, default value parameters, variable parameters, named keyword parameters, keyword parameters)
(August 9, 2021) example exercise of air quality index calculation (I)
nn. Exploration and experiment of batchnorm2d principle
Some summaries of the 21st postgraduate entrance examination 823 of network security major of Shanghai Jiaotong University and ideas on how to prepare for the 22nd postgraduate entrance examination pr
Detailed explanation of classic process synchronization problems
Common tips
Azure solution: how can third-party tools call azure blob storage to store data?
Bottom Logic -- Mind Map
Source code analysis of the implementation mechanism of multisets in guava class library
Process communication and thread explanation
[the way of programmer training] - 2 Perfect number calculation
Entitas learning [iv] other common knowledge points
Serialization oriented - pickle library, JSON Library
The detailed installation process of Ninja security penetration system (Ninjitsu OS V3). Both old and new VM versions can be installed through personal testing, with download sources
The frost peel off the purple dragon scale, and the xiariba people will talk about database SQL optimization and the principle of indexing (primary / secondary / clustered / non clustered)
Global and Chinese markets of NOx analyzers 2022-2028: Research Report on technology, participants, trends, market size and share
OSI seven layer model & unit
Clion configuration of opencv