当前位置:网站首页>[data mining] task 5: k-means/dbscan clustering: double square
[data mining] task 5: k-means/dbscan clustering: double square
2022-07-03 01:34:00 【zstar-_】
requirement
Program the following data clustering : Double square
Import library and global settings
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans, DBSCAN
plt.rcParams['font.sans-serif'] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
Generate double-layer square data
a = np.arange(1, 10, 0.01)
b = np.arange(3, 8, 0.01)
w = np.zeros((5600, 3))
# Outer square dot
w[:900, 0] = a
w[:900, 1] = 1
w[900:1800, 0] = 1
w[900:1800, 1] = a
w[1800:2700, 0] = a
w[1800:2700, 1] = 10
w[2700:3600, 0] = 10
w[2700:3600, 1] = a
# Inner square dot
w[3600:4100, 0] = b
w[3600:4100, 1] = 3
w[4100:4600, 0] = 3
w[4100:4600, 1] = b
w[4600:5100, 0] = b
w[4600:5100, 1] = 8
w[5100:, 0] = 8
w[5100:, 1] = b
w[3600:, 2] = 1
K-Means clustering
Parameter description
n_clusters: Number of clusters
random_state: Randomness of control parameters
cluster = KMeans(n_clusters=2, random_state=0)
y = cluster.fit_predict(w)
colors = ['black', 'red']
plt.figure(figsize=(15, 15))
plt.subplot(2, 2, 1)
for i in range(len(w)):
plt.scatter(w[i][0], w[i][1], color=colors[int(w[i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(w[i][0], w[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

DBSCAN clustering
Parameter description
eps:ϵ- Distance threshold of neighborhood , The distance from the sample exceeds ϵ The sample point of is not in ϵ- In the neighborhood , The default value is 0.5.
min_samples: The minimum number of points to form a high-density area . As the core point, the neighborhood ( That is, take it as the center of the circle ,eps Is a circle of radius , Including points on the circle ) Minimum number of samples in ( Including the point itself ).
if y=-1, Is the outlier .
because DBSCAN The generated category is uncertain , Therefore, define a function to filter out the most appropriate parameters that meet the specified category .
The appropriate criterion is to minimize the number of outliers .
# Filter parameters
def search_best_parameter(N_clusters, X):
min_outliners = 999
best_eps = 0
best_min_samples = 0
# Iterating different eps value
for eps in np.arange(0.001, 1, 0.05):
# Iterating different min_samples value
for min_samples in range(2, 10):
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
# Model fitting
y = dbscan.fit_predict(X)
# Count the number of clusters under each parameter combination (-1 Indicates an outlier )
if len(np.argwhere(y == -1)) == 0:
n_clusters = len(np.unique(y))
else:
n_clusters = len(np.unique(y)) - 1
# Number of outliers
outliners = len([i for i in y if i == -1])
if outliners < min_outliners and n_clusters == N_clusters:
min_outliners = outliners
best_eps = eps
best_min_samples = min_samples
return best_eps, best_min_samples
eps, min_samples = search_best_parameter(2, w)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(w)
colors = ['black', 'red']
plt.figure(figsize=(15, 15))
plt.subplot(2, 2, 1)
for i in range(len(w)):
plt.scatter(w[i][0], w[i][1], color=colors[int(w[i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(w[i][0], w[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

summary
For double-layer square data ,K-Means Clustering method is not suitable for clustering , And use DBSCAN This method can achieve better results .
边栏推荐
- wirehark数据分析与取证A.pacapng
- d,ldc构建共享库
- MySQL foundation 04 MySQL architecture
- Wireshark data analysis and forensics a.pacapng
- Database SQL language 01 where condition
- How is the mask effect achieved in the LPL ban/pick selection stage?
- MySQL - database query - basic query
- [untitled]
- C#应用程序界面开发基础——窗体控制(4)——选择类控件
- Basic remote connection tool xshell
猜你喜欢

Daily topic: movement of haystack

MySQL foundation 05 DML language

QTableWidget懒加载剩内存,不卡!

Why can't the start method be called repeatedly? But the run method can?

C#应用程序界面开发基础——窗体控制(3)——文件类控件
![[principles of multithreading and high concurrency: 2. Solutions to cache consistency]](/img/ce/5c41550ed649ee7cada17b0160f739.jpg)
[principles of multithreading and high concurrency: 2. Solutions to cache consistency]

Three core issues of concurrent programming - "deep understanding of high concurrent programming"

C application interface development foundation - form control (2) - MDI form

并发编程的三大核心问题 -《深入理解高并发编程》

【面试题】1369- 什么时候不能使用箭头函数?
随机推荐
Tp6 fast installation uses mongodb to add, delete, modify and check
不登陆或者登录解决oracle数据库账号被锁定。
Arduino dy-sv17f automatic voice broadcast
【数据挖掘】任务6:DBSCAN聚类
Force buckle 204 Count prime
Meituan dynamic thread pool practice ideas, open source
对非ts/js文件模块进行类型扩充
dotConnect for PostgreSQL数据提供程序
Arduino DY-SV17F自动语音播报
电信客户流失预测挑战赛
Using tensorboard to visualize the model, data and training process
【QT】自定义控件的封装
CF1617B Madoka and the Elegant Gift、CF1654C Alice and the Cake、 CF1696C Fishingprince Plays With Arr
【面试题】1369- 什么时候不能使用箭头函数?
Basic concept and implementation of overcoming hash
[androd] module dependency replacement of gradle's usage skills
简易分析fgui依赖关系工具
Makefile中wildcard、patsubst、notdir的含义
MySQL foundation 06 DDL
[my advanced journey of OpenGL learning] collation of Euler angle, rotation order, rotation matrix, quaternion and other knowledge