当前位置:网站首页>[data mining] task 6: DBSCAN clustering
[data mining] task 6: DBSCAN clustering
2022-07-03 01:34:00 【zstar-_】
requirement
Programming to realize DBSCAN Clustering of the following data
Data acquisition :https://download.csdn.net/download/qq1198768105/85865302
Import library and global settings
from scipy.io import loadmat
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import datasets
import pandas as pd
plt.rcParams['font.sans-serif'] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False
DBSCAN Description of clustering parameters
eps:ϵ- Distance threshold of neighborhood , The distance from the sample exceeds ϵ The sample point of is not in ϵ- In the neighborhood , The default value is 0.5.
min_samples: The minimum number of points to form a high-density area . As the core point, the neighborhood ( That is, take it as the center of the circle ,eps Is a circle of radius , Including points on the circle ) Minimum number of samples in ( Including the point itself ).
if y=-1, Is the outlier
because DBSCAN The generated category is uncertain , Therefore, define a function to filter out the most appropriate parameters that meet the specified category .
The appropriate criterion is to minimize the number of outliers
def search_best_parameter(N_clusters, X):
min_outliners = 999
best_eps = 0
best_min_samples = 0
# Iterating different eps value
for eps in np.arange(0.001, 1, 0.05):
# Iterating different min_samples value
for min_samples in range(2, 10):
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
# Model fitting
y = dbscan.fit_predict(X)
# Count the number of clusters under each parameter combination (-1 Indicates an outlier )
if len(np.argwhere(y == -1)) == 0:
n_clusters = len(np.unique(y))
else:
n_clusters = len(np.unique(y)) - 1
# Number of outliers
outliners = len([i for i in y if i == -1])
if outliners < min_outliners and n_clusters == N_clusters:
min_outliners = outliners
best_eps = eps
best_min_samples = min_samples
return best_eps, best_min_samples
# Import data
colors = ['green', 'red', 'blue']
smile = loadmat('data- Density clustering /smile.mat')
smile data
X = smile['smile']
eps, min_samples = search_best_parameter(3, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(smile['smile'])):
plt.scatter(smile['smile'][i][0], smile['smile'][i][1],
color=colors[int(smile['smile'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(smile['smile'][i][0], smile['smile'][i][1], color=colors[y[i]])
plt.title(" After clustering data ")
sizes5 data
# Import data
colors = ['blue', 'green', 'red', 'black', 'yellow']
sizes5 = loadmat('data- Density clustering /sizes5.mat')
X = sizes5['sizes5']
eps, min_samples = search_best_parameter(4, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(sizes5['sizes5'])):
plt.scatter(sizes5['sizes5'][i][0], sizes5['sizes5']
[i][1], color=colors[int(sizes5['sizes5'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
if y[i] != -1:
plt.scatter(sizes5['sizes5'][i][0], sizes5['sizes5']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")
square1 data
# Import data
colors = ['green', 'red', 'blue', 'black']
square1 = loadmat('data- Density clustering /square1.mat')
X = square1['square1']
eps, min_samples = search_best_parameter(4, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(square1['square1'])):
plt.scatter(square1['square1'][i][0], square1['square1']
[i][1], color=colors[int(square1['square1'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(square1['square1'][i][0], square1['square1']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")
square4 data
# Import data
colors = ['blue', 'green', 'red', 'black',
'yellow', 'brown', 'orange', 'purple']
square4 = loadmat('data- Density clustering /square4.mat')
X = square4['b']
eps, min_samples = search_best_parameter(5, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(square4['b'])):
plt.scatter(square4['b'][i][0], square4['b']
[i][1], color=colors[int(square4['b'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(square4['b'][i][0], square4['b']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")
spiral data
# Import data
colors = ['green', 'red']
spiral = loadmat('data- Density clustering /spiral.mat')
X = spiral['spiral']
eps, min_samples = search_best_parameter(2, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(spiral['spiral'])):
plt.scatter(spiral['spiral'][i][0], spiral['spiral']
[i][1], color=colors[int(spiral['spiral'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(spiral['spiral'][i][0], spiral['spiral']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")
moon data
# Import data
colors = ['green', 'red']
moon = loadmat('data- Density clustering /moon.mat')
X = moon['a']
eps, min_samples = search_best_parameter(2, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(moon['a'])):
plt.scatter(moon['a'][i][0], moon['a']
[i][1], color=colors[int(moon['a'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(moon['a'][i][0], moon['a']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")
long data
# Import data
colors = ['green', 'red']
long = loadmat('data- Density clustering /long.mat')
X = long['long1']
eps, min_samples = search_best_parameter(2, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(long['long1'])):
plt.scatter(long['long1'][i][0], long['long1']
[i][1], color=colors[int(long['long1'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(long['long1'][i][0], long['long1']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")
2d4c data
# Import data
colors = ['green', 'red', 'blue', 'black']
d4c = loadmat('data- Density clustering /2d4c.mat')
X = d4c['a']
eps, min_samples = search_best_parameter(4, X)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(X)
# Visualization of clustering results
plt.figure(figsize=(20, 15))
plt.subplot(2, 2, 1)
for i in range(len(d4c['a'])):
plt.scatter(d4c['a'][i][0], d4c['a']
[i][1], color=colors[int(d4c['a'][i][2])])
plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
plt.scatter(d4c['a'][i][0], d4c['a']
[i][1], color=colors[y[i]])
plt.title(" After clustering data ")
summary
The above experiment proves that DBSCAN Clustering methods are more dependent on the relevance of data points , about smile、spiral The clustering effect of equally distributed data is better .
边栏推荐
- [interview question] 1369 when can't I use arrow function?
- wirehark数据分析与取证A.pacapng
- 2022 cable crane driver examination registration and cable crane driver certificate examination
- LeetCode 987. Vertical order transverse of a binary tree - Binary Tree Series Question 7
- MySQL foundation 04 MySQL architecture
- Arduino DY-SV17F自动语音播报
- d,ldc构建共享库
- What operations need attention in the spot gold investment market?
- 【系统分析师之路】第五章 复盘软件工程(开发模型开发方法)
- Detailed explanation of Q-learning examples of reinforcement learning
猜你喜欢
[principles of multithreading and high concurrency: 2. Solutions to cache consistency]
Soft exam information system project manager_ Real topic over the years_ Wrong question set in the second half of 2019_ Morning comprehensive knowledge question - Senior Information System Project Man
leetcode刷题_两数之和 II - 输入有序数组
C#应用程序界面开发基础——窗体控制(1)——Form窗体
Detailed explanation of Q-learning examples of reinforcement learning
Why can't the start method be called repeatedly? But the run method can?
海量数据冷热分离方案与实践
Using tensorboard to visualize the model, data and training process
After reading this article, I will teach you to play with the penetration test target vulnhub - drivetingblues-9
Meituan dynamic thread pool practice ideas, open source
随机推荐
Dotconnect for PostgreSQL data provider
Thinkphp+redis realizes simple lottery
看疫情之下服装企业如何顺势而为
英语常用词汇
Basic remote connection tool xshell
一位苦逼程序员的找工作经历
Key wizard play strange learning - multithreaded background coordinate recognition
Machine learning terminology
C#应用程序界面开发基础——窗体控制(1)——Form窗体
C application interface development foundation - form control (3) - file control
Makefile中wildcard、patsubst、notdir的含义
[Arduino experiment 17 L298N motor drive module]
openresty 缓存
tail -f 、tail -F、tailf的区别
电信客户流失预测挑战赛
Force buckle 204 Count prime
按键精灵打怪学习-自动回城路线的判断
What is tone. Diao's story
【系统分析师之路】第五章 复盘软件工程(开发模型开发方法)
[self management] time, energy and habit management