当前位置:网站首页>[data mining] task 6: DBSCAN clustering
[data mining] task 6: DBSCAN clustering
2022-07-03 01:34:00 【zstar-_】
requirement
Programming to realize DBSCAN Clustering of the following data
Data acquisition :https://download.csdn.net/download/qq1198768105/85865302
Import library and global settings
from scipy.io import loadmat
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import datasets
import pandas as pd
plt.rcParams['font.sans-serif'] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
DBSCAN Description of clustering parameters
eps:ϵ- Distance threshold of neighborhood , The distance from the sample exceeds ϵ The sample point of is not in ϵ- In the neighborhood , The default value is 0.5.
min_samples: The minimum number of points to form a high-density area . As the core point, the neighborhood ( That is, take it as the center of the circle ,eps Is a circle of radius , Including points on the circle ) Minimum number of samples in ( Including the point itself ).
if y=-1, Is the outlier
because DBSCAN The generated category is uncertain , Therefore, define a function to filter out the most appropriate parameters that meet the specified category .
The appropriate criterion is to minimize the number of outliers
def search_best_parameter(N_clusters, X):
min_outliners = 999
best_eps = 0
best_min_samples = 0
# Iterating different eps value
for eps in np.arange(0.001, 1, 0.05):
# Iterating different min_samples value
for min_samples in range(2, 10):
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
# Model fitting
y = dbscan.fit_predict(X)
# Count the number of clusters under each parameter combination (-1 Indicates an outlier )
if len(np.argwhere(y == -1)) == 0:
n_clusters = len(np.unique(y))
else:
n_clusters = len(np.unique(y)) - 1
# Number of outliers
outliners = len([i for i in y if i == -1])
if outliners < min_outliners and n_clusters == N_clusters:
min_outliners = outliners
best_eps = eps
best_min_samples = min_samples
return best_eps, best_min_samples
# Import data
colors = ['green', 'red', 'blue']
smile = loadmat('data- Density clustering /smile.mat')
smile data
X = smile['smile']
eps, min_samples = search_best_parameter(3, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(smile['smile'])):
plt.scatter(smile['smile'][i][0], smile['smile'][i][1],
color=colors[int(smile['smile'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(smile['smile'][i][0], smile['smile'][i][1], color=colors[y[i]])
plt.title(" After clustering data ")

sizes5 data
# Import data
colors = ['blue', 'green', 'red', 'black', 'yellow']
sizes5 = loadmat('data- Density clustering /sizes5.mat')
X = sizes5['sizes5']
eps, min_samples = search_best_parameter(4, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(sizes5['sizes5'])):
plt.scatter(sizes5['sizes5'][i][0], sizes5['sizes5']
[i][1], color=colors[int(sizes5['sizes5'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
if y[i] != -1:
plt.scatter(sizes5['sizes5'][i][0], sizes5['sizes5']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

square1 data
# Import data
colors = ['green', 'red', 'blue', 'black']
square1 = loadmat('data- Density clustering /square1.mat')
X = square1['square1']
eps, min_samples = search_best_parameter(4, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(square1['square1'])):
plt.scatter(square1['square1'][i][0], square1['square1']
[i][1], color=colors[int(square1['square1'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(square1['square1'][i][0], square1['square1']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

square4 data
# Import data
colors = ['blue', 'green', 'red', 'black',
'yellow', 'brown', 'orange', 'purple']
square4 = loadmat('data- Density clustering /square4.mat')
X = square4['b']
eps, min_samples = search_best_parameter(5, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(square4['b'])):
plt.scatter(square4['b'][i][0], square4['b']
[i][1], color=colors[int(square4['b'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(square4['b'][i][0], square4['b']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

spiral data
# Import data
colors = ['green', 'red']
spiral = loadmat('data- Density clustering /spiral.mat')
X = spiral['spiral']
eps, min_samples = search_best_parameter(2, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(spiral['spiral'])):
plt.scatter(spiral['spiral'][i][0], spiral['spiral']
[i][1], color=colors[int(spiral['spiral'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(spiral['spiral'][i][0], spiral['spiral']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

moon data
# Import data
colors = ['green', 'red']
moon = loadmat('data- Density clustering /moon.mat')
X = moon['a']
eps, min_samples = search_best_parameter(2, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(moon['a'])):
plt.scatter(moon['a'][i][0], moon['a']
[i][1], color=colors[int(moon['a'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(moon['a'][i][0], moon['a']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

long data
# Import data
colors = ['green', 'red']
long = loadmat('data- Density clustering /long.mat')
X = long['long1']
eps, min_samples = search_best_parameter(2, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(long['long1'])):
plt.scatter(long['long1'][i][0], long['long1']
[i][1], color=colors[int(long['long1'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(long['long1'][i][0], long['long1']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

2d4c data
# Import data
colors = ['green', 'red', 'blue', 'black']
d4c = loadmat('data- Density clustering /2d4c.mat')
X = d4c['a']
eps, min_samples = search_best_parameter(4, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(d4c['a'])):
plt.scatter(d4c['a'][i][0], d4c['a']
[i][1], color=colors[int(d4c['a'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(d4c['a'][i][0], d4c['a']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")

summary
The above experiment proves that DBSCAN Clustering methods are more dependent on the relevance of data points , about smile、spiral The clustering effect of equally distributed data is better .
边栏推荐
- 不登陆或者登录解决oracle数据库账号被锁定。
- Tp6 fast installation uses mongodb to add, delete, modify and check
- Qtablewidget lazy load remaining memory, no card!
- d,ldc构建共享库
- Androd Gradle 对其使用模块依赖的替换
- MySQL foundation 05 DML language
- Is there a handling charge for spot gold investment
- How is the mask effect achieved in the LPL ban/pick selection stage?
- The difference between tail -f, tail -f and tail
- Swiftui component Encyclopedia: using scenekit and swiftui to build interactive 3D pie charts (tutorial with source code)
猜你喜欢

Daily topic: movement of haystack

【面试题】1369- 什么时候不能使用箭头函数?

Force buckle 204 Count prime

wirehark数据分析与取证A.pacapng

C application interface development foundation - form control (2) - MDI form

SSL flood attack of DDoS attack

海量数据冷热分离方案与实践
![[FPGA tutorial case 6] design and implementation of dual port RAM based on vivado core](/img/fb/c371ffaa9614c6f2fd581ba89eb2ab.png)
[FPGA tutorial case 6] design and implementation of dual port RAM based on vivado core

Meituan dynamic thread pool practice ideas, open source

电信客户流失预测挑战赛
随机推荐
Key wizard play strange learning - front desk and Intranet send background verification code
Uniapp component -uni notice bar notice bar
High-Resolution Network (篇一):原理刨析
Soft exam information system project manager_ Real topic over the years_ Wrong question set in the second half of 2019_ Morning comprehensive knowledge question - Senior Information System Project Man
[shutter] animation animation (the core class of shutter animation | animation | curvedanimation | animationcontroller | tween)
leetcode 2097 — 合法重新排列数对
【C语言】指针与数组笔试题详解
2022 Jiangxi Provincial Safety Officer B certificate reexamination examination and Jiangxi Provincial Safety Officer B certificate simulation examination question bank
串口抓包/截断工具的安装及使用详解
C语言课程信息管理系统
Kivy教程大全之 创建您的第一个kivy程序 hello word(教程含源码)
openresty 缓存
MySQL foundation 04 MySQL architecture
[Arduino experiment 17 L298N motor drive module]
【数据挖掘】任务6:DBSCAN聚类
电信客户流失预测挑战赛
MySQL
Top ten regular spot trading platforms 2022
Common English Vocabulary
High resolution network (Part 1): Principle Analysis