当前位置:网站首页>[data mining] task 5: k-means/dbscan clustering: double square
[data mining] task 5: k-means/dbscan clustering: double square
2022-07-03 01:34:00 【zstar-_】
requirement
Program the following data clustering : Double square
Import library and global settings
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans, DBSCAN
plt.rcParams['font.sans-serif'] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
Generate double-layer square data
a = np.arange(1, 10, 0.01)
b = np.arange(3, 8, 0.01)
w = np.zeros((5600, 3))
# Outer square dot
w[:900, 0] = a
w[:900, 1] = 1
w[900:1800, 0] = 1
w[900:1800, 1] = a
w[1800:2700, 0] = a
w[1800:2700, 1] = 10
w[2700:3600, 0] = 10
w[2700:3600, 1] = a
# Inner square dot
w[3600:4100, 0] = b
w[3600:4100, 1] = 3
w[4100:4600, 0] = 3
w[4100:4600, 1] = b
w[4600:5100, 0] = b
w[4600:5100, 1] = 8
w[5100:, 0] = 8
w[5100:, 1] = b
w[3600:, 2] = 1
K-Means clustering
Parameter description
n_clusters: Number of clusters
random_state: Randomness of control parameters
cluster = KMeans(n_clusters=2, random_state=0)
y = cluster.fit_predict(w)
colors = ['black', 'red']
plt.figure(figsize=(15, 15))
plt.subplot(2, 2, 1)
for i in range(len(w)):
plt.scatter(w[i][0], w[i][1], color=colors[int(w[i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(w[i][0], w[i][1], color=colors[y[i]])
plt.title(" After clustering data ")
DBSCAN clustering
Parameter description
eps:ϵ- Distance threshold of neighborhood , The distance from the sample exceeds ϵ The sample point of is not in ϵ- In the neighborhood , The default value is 0.5.
min_samples: The minimum number of points to form a high-density area . As the core point, the neighborhood ( That is, take it as the center of the circle ,eps Is a circle of radius , Including points on the circle ) Minimum number of samples in ( Including the point itself ).
if y=-1, Is the outlier .
because DBSCAN The generated category is uncertain , Therefore, define a function to filter out the most appropriate parameters that meet the specified category .
The appropriate criterion is to minimize the number of outliers .
# Filter parameters
def search_best_parameter(N_clusters, X):
min_outliners = 999
best_eps = 0
best_min_samples = 0
# Iterating different eps value
for eps in np.arange(0.001, 1, 0.05):
# Iterating different min_samples value
for min_samples in range(2, 10):
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
# Model fitting
y = dbscan.fit_predict(X)
# Count the number of clusters under each parameter combination (-1 Indicates an outlier )
if len(np.argwhere(y == -1)) == 0:
n_clusters = len(np.unique(y))
else:
n_clusters = len(np.unique(y)) - 1
# Number of outliers
outliners = len([i for i in y if i == -1])
if outliners < min_outliners and n_clusters == N_clusters:
min_outliners = outliners
best_eps = eps
best_min_samples = min_samples
return best_eps, best_min_samples
eps, min_samples = search_best_parameter(2, w)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(w)
colors = ['black', 'red']
plt.figure(figsize=(15, 15))
plt.subplot(2, 2, 1)
for i in range(len(w)):
plt.scatter(w[i][0], w[i][1], color=colors[int(w[i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(w[i][0], w[i][1], color=colors[y[i]])
plt.title(" After clustering data ")
summary
For double-layer square data ,K-Means Clustering method is not suitable for clustering , And use DBSCAN This method can achieve better results .
边栏推荐
- 2022 Jiangxi Provincial Safety Officer B certificate reexamination examination and Jiangxi Provincial Safety Officer B certificate simulation examination question bank
- How is the mask effect achieved in the LPL ban/pick selection stage?
- Installation and use of serial port packet capturing / cutting tool
- 英语常用词汇
- Work experience of a hard pressed programmer
- 按键精灵打怪学习-自动回城路线的判断
- [principles of multithreading and high concurrency: 2. Solutions to cache consistency]
- The thread reuse problem of PageHelper using ThreadLocal, did you use it correctly?
- 音程的知识的总结
- d,ldc构建共享库
猜你喜欢
[technology development-23]: application of DSP in future converged networks
[androd] module dependency replacement of gradle's usage skills
Installation and use of serial port packet capturing / cutting tool
MySQL
Androd Gradle 对其使用模块依赖的替换
wirehark数据分析与取证A.pacapng
Meituan dynamic thread pool practice ideas, open source
Wireshark data analysis and forensics a.pacapng
[understanding of opportunity -36]: Guiguzi - flying clamp chapter - prevention against killing and bait
[my advanced journey of OpenGL learning] collation of Euler angle, rotation order, rotation matrix, quaternion and other knowledge
随机推荐
测试右移:线上质量监控 ELK 实战
openresty 缓存
MySQL
[shutter] animation animation (the core class of shutter animation | animation | curvedanimation | animationcontroller | tween)
CF1617B Madoka and the Elegant Gift、CF1654C Alice and the Cake、 CF1696C Fishingprince Plays With Arr
uniapp组件-uni-notice-bar通告栏
SwiftUI 组件大全之使用 SceneKit 和 SwiftUI 构建交互式 3D 饼图(教程含源码)
Create your first Kivy program Hello word (tutorial includes source code)
Telecom Customer Churn Prediction challenge
How is the mask effect achieved in the LPL ban/pick selection stage?
力扣 204. 计数质数
产业互联网的产业范畴足够大 消费互联网时代仅是一个局限在互联网行业的存在
leetcode 6103 — 从树中删除边的最小分数
2022 coal mine gas drainage examination question bank and coal mine gas drainage examination questions and analysis
[QT] encapsulation of custom controls
Thinkphp+redis realizes simple lottery
【数据挖掘】任务5:K-means/DBSCAN聚类:双层正方形
MySQL foundation 07-dcl
MySQL - database query - condition query
Three core issues of concurrent programming - "deep understanding of high concurrent programming"