当前位置:网站首页>[data mining] task 6: DBSCAN clustering
[data mining] task 6: DBSCAN clustering
2022-07-03 01:34:00 【zstar-_】
requirement
Programming to realize DBSCAN Clustering of the following data
Data acquisition :https://download.csdn.net/download/qq1198768105/85865302
Import library and global settings
from scipy.io import loadmat
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import datasets
import pandas as pd
plt.rcParams['font.sans-serif'] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
DBSCAN Description of clustering parameters
eps:ϵ- Distance threshold of neighborhood , The distance from the sample exceeds ϵ The sample point of is not in ϵ- In the neighborhood , The default value is 0.5.
min_samples: The minimum number of points to form a high-density area . As the core point, the neighborhood ( That is, take it as the center of the circle ,eps Is a circle of radius , Including points on the circle ) Minimum number of samples in ( Including the point itself ).
if y=-1, Is the outlier
because DBSCAN The generated category is uncertain , Therefore, define a function to filter out the most appropriate parameters that meet the specified category .
The appropriate criterion is to minimize the number of outliers
def search_best_parameter(N_clusters, X):
min_outliners = 999
best_eps = 0
best_min_samples = 0
# Iterating different eps value
for eps in np.arange(0.001, 1, 0.05):
# Iterating different min_samples value
for min_samples in range(2, 10):
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
# Model fitting
y = dbscan.fit_predict(X)
# Count the number of clusters under each parameter combination (-1 Indicates an outlier )
if len(np.argwhere(y == -1)) == 0:
n_clusters = len(np.unique(y))
else:
n_clusters = len(np.unique(y)) - 1
# Number of outliers
outliners = len([i for i in y if i == -1])
if outliners < min_outliners and n_clusters == N_clusters:
min_outliners = outliners
best_eps = eps
best_min_samples = min_samples
return best_eps, best_min_samples
# Import data
colors = ['green', 'red', 'blue']
smile = loadmat('data- Density clustering /smile.mat')
smile data
X = smile['smile']
eps, min_samples = search_best_parameter(3, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(smile['smile'])):
plt.scatter(smile['smile'][i][0], smile['smile'][i][1],
color=colors[int(smile['smile'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(smile['smile'][i][0], smile['smile'][i][1], color=colors[y[i]])
plt.title(" After clustering data ")

sizes5 data
# Import data
colors = ['blue', 'green', 'red', 'black', 'yellow']
sizes5 = loadmat('data- Density clustering /sizes5.mat')
X = sizes5['sizes5']
eps, min_samples = search_best_parameter(4, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(sizes5['sizes5'])):
plt.scatter(sizes5['sizes5'][i][0], sizes5['sizes5']
[i][1], color=colors[int(sizes5['sizes5'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
if y[i] != -1:
plt.scatter(sizes5['sizes5'][i][0], sizes5['sizes5']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

square1 data
# Import data
colors = ['green', 'red', 'blue', 'black']
square1 = loadmat('data- Density clustering /square1.mat')
X = square1['square1']
eps, min_samples = search_best_parameter(4, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(square1['square1'])):
plt.scatter(square1['square1'][i][0], square1['square1']
[i][1], color=colors[int(square1['square1'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(square1['square1'][i][0], square1['square1']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

square4 data
# Import data
colors = ['blue', 'green', 'red', 'black',
'yellow', 'brown', 'orange', 'purple']
square4 = loadmat('data- Density clustering /square4.mat')
X = square4['b']
eps, min_samples = search_best_parameter(5, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(square4['b'])):
plt.scatter(square4['b'][i][0], square4['b']
[i][1], color=colors[int(square4['b'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(square4['b'][i][0], square4['b']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

spiral data
# Import data
colors = ['green', 'red']
spiral = loadmat('data- Density clustering /spiral.mat')
X = spiral['spiral']
eps, min_samples = search_best_parameter(2, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(spiral['spiral'])):
plt.scatter(spiral['spiral'][i][0], spiral['spiral']
[i][1], color=colors[int(spiral['spiral'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(spiral['spiral'][i][0], spiral['spiral']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

moon data
# Import data
colors = ['green', 'red']
moon = loadmat('data- Density clustering /moon.mat')
X = moon['a']
eps, min_samples = search_best_parameter(2, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(moon['a'])):
plt.scatter(moon['a'][i][0], moon['a']
[i][1], color=colors[int(moon['a'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(moon['a'][i][0], moon['a']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

long data
# Import data
colors = ['green', 'red']
long = loadmat('data- Density clustering /long.mat')
X = long['long1']
eps, min_samples = search_best_parameter(2, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(long['long1'])):
plt.scatter(long['long1'][i][0], long['long1']
[i][1], color=colors[int(long['long1'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(long['long1'][i][0], long['long1']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

2d4c data
# Import data
colors = ['green', 'red', 'blue', 'black']
d4c = loadmat('data- Density clustering /2d4c.mat')
X = d4c['a']
eps, min_samples = search_best_parameter(4, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(d4c['a'])):
plt.scatter(d4c['a'][i][0], d4c['a']
[i][1], color=colors[int(d4c['a'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(d4c['a'][i][0], d4c['a']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

summary
The above experiment proves that DBSCAN Clustering methods are more dependent on the relevance of data points , about smile、spiral The clustering effect of equally distributed data is better .
边栏推荐
- tail -f 、tail -F、tailf的区别
- [C language] detailed explanation of pointer and array written test questions
- LDC Build Shared Library
- 什么是调。调的故事
- 看疫情之下服装企业如何顺势而为
- Installation and use of serial port packet capturing / cutting tool
- 如今少年已归来,人间烟火气最抚凡人心 复工了~
- [机缘参悟-36]:鬼谷子-飞箝篇 - 面对捧杀与诱饵的防范之道
- 2022 Jiangxi Provincial Safety Officer B certificate reexamination examination and Jiangxi Provincial Safety Officer B certificate simulation examination question bank
- Type expansion of non ts/js file modules
猜你喜欢

Daily topic: movement of haystack

MySQL --- 数据库查询 - 条件查询

Scheme and practice of cold and hot separation of massive data

MySQL foundation 05 DML language

软考信息系统项目管理师_历年真题_2019下半年错题集_上午综合知识题---软考高级之信息系统项目管理师053

Give you an array numbers that may have duplicate element values. It was originally an array arranged in ascending order, and it was rotated once according to the above situation. Please return the sm

海量数据冷热分离方案与实践

Arduino DY-SV17F自动语音播报

Database SQL language 02 connection query

C#应用程序界面开发基础——窗体控制(2)——MDI窗体
随机推荐
[my advanced journey of OpenGL learning] collation of Euler angle, rotation order, rotation matrix, quaternion and other knowledge
C语言课程信息管理系统
如今少年已归来,人间烟火气最抚凡人心 复工了~
uniapp组件-uni-notice-bar通告栏
【数据挖掘】任务4:20Newsgroups聚类
[FPGA tutorial case 5] ROM design and Implementation Based on vivado core
Type expansion of non ts/js file modules
对非ts/js文件模块进行类型扩充
Daily topic: movement of haystack
什么是调。调的故事
电信客户流失预测挑战赛
2022 coal mine gas drainage examination question bank and coal mine gas drainage examination questions and analysis
2022 Jiangxi Provincial Safety Officer B certificate reexamination examination and Jiangxi Provincial Safety Officer B certificate simulation examination question bank
数学知识:Nim游戏—博弈论
2022 Jiangxi Provincial Safety Officer B certificate reexamination examination and Jiangxi Provincial Safety Officer B certificate simulation examination question bank
不登陆或者登录解决oracle数据库账号被锁定。
C application interface development foundation - form control (4) - selection control
Arduino dy-sv17f automatic voice broadcast
数学知识:台阶-Nim游戏—博弈论
一位苦逼程序员的找工作经历