当前位置：网站首页>[data mining] task 5: k-means/dbscan clustering: double square

[data mining] task 5: k-means/dbscan clustering: double square

2022-07-03 01:34:00 【zstar-_】

requirement

Program the following data clustering ： Double square

Import library and global settings

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans, DBSCAN

plt.rcParams['font.sans-serif'] = ["SimHei"]
plt.rcParams["axes.unicode_minus"] = False

Generate double-layer square data

a = np.arange(1, 10, 0.01)
b = np.arange(3, 8, 0.01)
w = np.zeros((5600, 3))
#  Outer square dot 
w[:900, 0] = a
w[:900, 1] = 1
w[900:1800, 0] = 1
w[900:1800, 1] = a
w[1800:2700, 0] = a
w[1800:2700, 1] = 10
w[2700:3600, 0] = 10
w[2700:3600, 1] = a
#  Inner square dot 
w[3600:4100, 0] = b
w[3600:4100, 1] = 3
w[4100:4600, 0] = 3
w[4100:4600, 1] = b
w[4600:5100, 0] = b
w[4600:5100, 1] = 8
w[5100:, 0] = 8
w[5100:, 1] = b
w[3600:, 2] = 1

K-Means clustering

Parameter description

n_clusters: Number of clusters

random_state： Randomness of control parameters

cluster = KMeans(n_clusters=2, random_state=0)
y = cluster.fit_predict(w)

colors = ['black', 'red']
plt.figure(figsize=(15, 15))
plt.subplot(2, 2, 1)
for i in range(len(w)):
    plt.scatter(w[i][0], w[i][1], color=colors[int(w[i][2])])
    plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
    plt.scatter(w[i][0],  w[i][1], color=colors[y[i]])
    plt.title(" After clustering data ")

Insert picture description here

DBSCAN clustering

Parameter description

eps：ϵ- Distance threshold of neighborhood , The distance from the sample exceeds ϵ The sample point of is not in ϵ- In the neighborhood , The default value is 0.5.

min_samples： The minimum number of points to form a high-density area . As the core point, the neighborhood ( That is, take it as the center of the circle ,eps Is a circle of radius , Including points on the circle ) Minimum number of samples in ( Including the point itself ).

if y=-1, Is the outlier .

because DBSCAN The generated category is uncertain , Therefore, define a function to filter out the most appropriate parameters that meet the specified category .

The appropriate criterion is to minimize the number of outliers .

#  Filter parameters 
def search_best_parameter(N_clusters, X):
    min_outliners = 999
    best_eps = 0
    best_min_samples = 0
    #  Iterating different eps value 
    for eps in np.arange(0.001, 1, 0.05):
        #  Iterating different min_samples value 
        for min_samples in range(2, 10):
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            #  Model fitting 
            y = dbscan.fit_predict(X)
            #  Count the number of clusters under each parameter combination （-1 Indicates an outlier ）
            if len(np.argwhere(y == -1)) == 0:
                n_clusters = len(np.unique(y))
            else:
                n_clusters = len(np.unique(y)) - 1
            #  Number of outliers 
            outliners = len([i for i in y if i == -1])
            if outliners < min_outliners and n_clusters == N_clusters:
                min_outliners = outliners
                best_eps = eps
                best_min_samples = min_samples
    return best_eps, best_min_samples

eps, min_samples = search_best_parameter(2, w)
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
y = dbscan.fit_predict(w)

colors = ['black', 'red']
plt.figure(figsize=(15, 15))
plt.subplot(2, 2, 1)
for i in range(len(w)):
    plt.scatter(w[i][0], w[i][1], color=colors[int(w[i][2])])
    plt.title(" Raw data ")
plt.subplot(2, 2, 2)
for i in range(len(y)):
    plt.scatter(w[i][0],  w[i][1], color=colors[y[i]])
    plt.title(" After clustering data ")