当前位置:网站首页>Clustering, dimension reduction and measurement techniques for machine learning

Clustering, dimension reduction and measurement techniques for machine learning

2022-06-21 19:44:00 WihauShe

clustering

Clustering tasks

 “ Unsupervised learning ” Our goal is to reveal the intrinsic properties and laws of the data by learning the unlabeled training samples , It provides the basis for further data analysis .

  Clustering attempts to divide the samples in a dataset into several usually disjoint subsets , Each subset is called a “ cluster ”. By such a division , Each cluster may correspond to some potential concepts ( Category ), These concepts are unknown to clustering algorithm in advance , Clustering can only form cluster structure automatically , The concept semantics corresponding to clusters are grasped and named by users .

Performance metrics

  Clustering performance measure is also called clustering “ Indicators of effectiveness ”.

  Category : Combine the clustering result with a “ Reference model ” Compare , be called “ External indicators ”; Direct reference to clustering results without using any reference model , be called “ Internal indicators ”.

  Commonly used clustering performance measures external indicators :

  • Jaccard coefficient (Jaccard Coefficient, abbreviation JC)
  • FM Index (Fowlkes and Mallows Index, abbreviation FMI)
  • Rand Index (Rand Index, abbreviation RI)

  Commonly used clustering performance measures internal indicators :

  • DB Index (Davies-Bouldin Index, abbreviation DBI)
  • Dunn Index (Dunn Index, abbreviation DI)

Distance calculation

  Distance measurement needs to satisfy some basic properties : Nonnegativity 、 identity 、 symmetry 、 Directness , Mainly Minkowski distance 、 Euclidean distance 、 Manhattan distance

  Attribute classification :

  • Continuous attributes : There are infinite possible values in the definition field
  • Discrete properties : There are a finite number of values in the definition field
  • Ordered attributes : You can calculate the distance directly on the attribute value
  • Disorder property : The distance cannot be calculated directly on the attribute value

The ordered attribute can use Minkowski distance , Unordered attributes use VDM

Prototype clustering

  Prototype clustering is also called “ Clustering based on prototypes ”, This algorithm assumes that the clustering structure can be characterized by a set of prototypes . Usually , The algorithm first constructs the prototype , Then the prototype is updated and solved iteratively .

  • k Mean algorithm
  • Learning vector quantization
  • Gaussian mixture clustering

Density clustering

  Density clustering “ Clustering based on density ”, This kind of algorithm assumes that the clustering structure can be determined by the tightness of sample distribution . Usually , Density clustering algorithm from the perspective of sample density to examine the connectivity between samples , Based on the connectable samples, the cluster is expanded to get the final clustering results .【DBSCAN】

Hierarchical clustering

  Hierarchical clustering view divides data sets at different levels , To form a tree like cluster structure , The data set can be divided by “ Bottom up ” The aggregation strategy of , Also can use “ The top-down ” The spin off strategy of .【AGNES Algorithm 】

Dimension reduction and measurement technology

k Neighbor learning

 k Working mechanism of neighbor learning : Given the test sample , Based on a certain distance measure, find the closest k Training samples , And based on that k individual “ neighbor ” To predict the information of .

  • “ laws and regulations governing balloting ”: Choose this k The most frequent category markers in the samples are used as prediction results
  • “ Average method ”: Will this k The average of the real value output markers of samples is used as the prediction result
  • “ Lazy study ”: In the training phase, just keep the samples , Training time cost is zero , Wait until the test sample is received
  • “ Eager to learn ”: In the training stage, we learn how to deal with the samples

  The nearest neighbor classifier is simple , But its generalization error rate is not more than twice that of Bayesian optimal classifier .

Low dimensional embedding

  Dimension disaster : Data samples appearing in high-dimensional clear form are sparse , It is difficult to calculate the distance .

  Dimension reduction : The original high-order attribute space is transformed into a low dimension by some mathematical change “ Subspace ”, In this subspace, the sample density is greatly increased , Distance calculation has also become easier .
  Low dimensional embedding : In many cases , Although the data samples observed or collected by people are high-dimensional , But perhaps only a low dimensional distribution is closely related to the learning task .

Principal component analysis

  For sample points in orthogonal attribute space , Use a hyperplane ( High dimensional generalization of straight lines ) Properly express all samples :

  • Recently refactoring : The sample point is close enough to the hyperplane
  • Maximum separability : The projection of sample points on this hyperplane can be separated as far as possible

 PCA Just keep W And the mean vector of the sample can be reduced by a simple vector subtraction and matrix - Vector multiplication projects new samples into low dimensional space , The discarded part of the information can increase the sampling density of the sample , It can also play the effect of De-noising to a certain extent .

Kernel linear dimensionality reduction

“ Genuineness ” Low dimensional space :“ Originally sampled ” Low dimensional space
Nonlinear dimensionality reduction : Based on kernel technique, the linear dimensionality reduction method is studied “ Nucleation ”

Manifold learning

 “ manifold ” Is a space which is locally homeomorphic with a Euclidean space , It has the property of Euclidean space locally , Can use Euclidean distance to calculate distance .

  If a low dimensional manifold is embedded into a high dimensional space , The distribution of data samples in high-dimensional space seems very complex , But it still has the properties of European Space locally , therefore , It is easy to establish a dimensionality reduction mapping relationship locally , Then try to extend the local mapping relation to the global mapping relation .

  Isometric mapping
   Basic starting point : After the low dimensional manifold is embedded into the high dimensional space , It is misleading to calculate the wired distance directly in high-dimensional space , Because the linear distance in high-dimensional space is not reachable on low-dimensional embedded manifolds .
   We can use the homeomorphism of manifold and Euclidean space locally , For each point, find its neighboring points based on Euclidean distance , Then we can establish a nearest neighbor connection graph , There is a connection between the nearest neighbors in the figure , There is no connection between non adjacent points . therefore , The problem of calculating the geodesic distance between two points , It is transformed into calculating the shortest path between two points on the nearest neighbor connection graph .
   Construction of nearest neighbor graph : Specify the number of nearest neighbors 、 Specify the distance threshold ( Points less than the threshold value are considered as nearest neighbors )

  Local linear embedding
   Local linear embedding attempts to maintain the linear relationship between samples in the neighborhood

Measure learning

  Basic motivation :“ Study ” Work out a suitable distance measure
  Markov distance :

  •  Insert picture description here

among M Also known as “ The metric matrix ”, Measurement learning is right M To study . To keep the distance nonnegative and symmetric ,M Must be ( And a half ) Positive definite symmetric matrices , That is, there must be an orthogonal basis P bring M Can be written as M=PP^T.
It can not only take the error rate as the optimization goal of measurement learning , It can also introduce domain knowledge into measurement learning .

原网站

版权声明
本文为[WihauShe]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/172/202206211808060626.html