当前位置:网站首页>【数据挖掘】任务5:K-means/DBSCAN聚类:双层正方形
【数据挖掘】任务5:K-means/DBSCAN聚类:双层正方形
2022-07-03 01:09:00 【zstar-_】
要求
编程如下数据聚类:双层正方形
导库与全局设置
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans, DBSCAN
plt.rcParams['font.sans-serif'] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
生成双层正方形数据
a = np.arange(1, 10, 0.01)
b = np.arange(3, 8, 0.01)
w = np.zeros((5600, 3))
# 外层正方形点
w[:900, 0] = a
w[:900, 1] = 1
w[900:1800, 0] = 1
w[900:1800, 1] = a
w[1800:2700, 0] = a
w[1800:2700, 1] = 10
w[2700:3600, 0] = 10
w[2700:3600, 1] = a
# 内层正方形点
w[3600:4100, 0] = b
w[3600:4100, 1] = 3
w[4100:4600, 0] = 3
w[4100:4600, 1] = b
w[4600:5100, 0] = b
w[4600:5100, 1] = 8
w[5100:, 0] = 8
w[5100:, 1] = b
w[3600:, 2] = 1
K-Means 聚类
参数说明
n_clusters:聚类个数
random_state:控制参数随机性
cluster = KMeans(n_clusters=2, random_state=0)
y = cluster.fit_predict(w)
colors = ['black', 'red']
plt.figure(figsize=(15, 15))
plt.subplot(2, 2, 1)
for i in range(len(w)):
plt.scatter(w[i][0], w[i][1], color=colors[int(w[i][2])])
plt.title("原始数据")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(w[i][0], w[i][1], color=colors[y[i]])
plt.title("聚类后数据")
DBSCAN 聚类
参数说明
eps:ϵ-邻域的距离阈值,和样本距离超过ϵ的样本点不在ϵ-邻域内,默认值是0.5。
min_samples:形成高密度区域的最小点数。作为核心点的话邻域(即以其为圆心,eps为半径的圆,含圆上的点)中的最小样本数(包括点本身)。
若y=-1,则为异常点。
由于DBSCAN生成的类别不确定,因此定义一个函数用来筛选出符合指定类别的最合适的参数。
合适的标准是异常点个数最少。
# 筛选参数
def search_best_parameter(N_clusters, X):
min_outliners = 999
best_eps = 0
best_min_samples = 0
# 迭代不同的eps值
for eps in np.arange(0.001, 1, 0.05):
# 迭代不同的min_samples值
for min_samples in range(2, 10):
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
# 模型拟合
y = dbscan.fit_predict(X)
# 统计各参数组合下的聚类个数(-1表示异常点)
if len(np.argwhere(y == -1)) == 0:
n_clusters = len(np.unique(y))
else:
n_clusters = len(np.unique(y)) - 1
# 异常点的个数
outliners = len([i for i in y if i == -1])
if outliners < min_outliners and n_clusters == N_clusters:
min_outliners = outliners
best_eps = eps
best_min_samples = min_samples
return best_eps, best_min_samples
eps, min_samples = search_best_parameter(2, w)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(w)
colors = ['black', 'red']
plt.figure(figsize=(15, 15))
plt.subplot(2, 2, 1)
for i in range(len(w)):
plt.scatter(w[i][0], w[i][1], color=colors[int(w[i][2])])
plt.title("原始数据")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(w[i][0], w[i][1], color=colors[y[i]])
plt.title("聚类后数据")
总结
对于双层正方形数据来说,K-Means聚类方法不适合进行聚类,而采用DBSCAN方法可以取得较好的效果。
边栏推荐
- 力扣 204. 计数质数
- leetcode 2097 — 合法重新排列数对
- Esp32 simple speed message test of ros2 (limit frequency)
- d,ldc构建共享库
- d. LDC build shared library
- Androd Gradle 对其使用模块依赖的替换
- [C language] detailed explanation of pointer and array written test questions
- 串口抓包/截断工具的安装及使用详解
- JDBC courses
- 软考信息系统项目管理师_历年真题_2019下半年错题集_上午综合知识题---软考高级之信息系统项目管理师053
猜你喜欢
串口抓包/截断工具的安装及使用详解
Detailed explanation of Q-learning examples of reinforcement learning
MySQL basics 03 introduction to MySQL types
C#应用程序界面开发基础——窗体控制(1)——Form窗体
[androd] module dependency replacement of gradle's usage skills
一位苦逼程序员的找工作经历
Why is it not recommended to use BeanUtils in production?
Strongly connected components of digraph
Excel calculates the difference between time and date and converts it into minutes
High-Resolution Network (篇一):原理刨析
随机推荐
Basic remote connection tool xshell
Excel removes the data after the decimal point and rounds the number
Meibeer company is called "Manhattan Project", and its product name is related to the atomic bomb, which has caused dissatisfaction among Japanese netizens
看疫情之下服装企业如何顺势而为
Using tensorboard to visualize the model, data and training process
Key wizard play strange learning - front desk and Intranet send background verification code
Dotconnect for PostgreSQL data provider
[untitled]
How wide does the dual inline for bread board need?
[self management] time, energy and habit management
按鍵精靈打怪學習-多線程後臺坐標識別
按键精灵打怪学习-多线程后台坐标识别
Niu Ke swipes questions and clocks in
[shutter] animation animation (shutter animation type | the core class of shutter animation)
C#应用程序界面开发基础——窗体控制(3)——文件类控件
产业互联网的产业范畴足够大 消费互联网时代仅是一个局限在互联网行业的存在
The difference between tail -f, tail -f and tail
dotConnect for PostgreSQL数据提供程序
Do not log in or log in to solve the problem that the Oracle database account is locked.
The meaning of wildcard, patsubst and notdir in makefile