当前位置:网站首页>Clustering, dimension reduction and measurement techniques for machine learning
Clustering, dimension reduction and measurement techniques for machine learning
2022-06-21 19:44:00 【WihauShe】
clustering
Clustering tasks
“ Unsupervised learning ” Our goal is to reveal the intrinsic properties and laws of the data by learning the unlabeled training samples , It provides the basis for further data analysis .
Clustering attempts to divide the samples in a dataset into several usually disjoint subsets , Each subset is called a “ cluster ”. By such a division , Each cluster may correspond to some potential concepts ( Category ), These concepts are unknown to clustering algorithm in advance , Clustering can only form cluster structure automatically , The concept semantics corresponding to clusters are grasped and named by users .
Performance metrics
Clustering performance measure is also called clustering “ Indicators of effectiveness ”.
Category : Combine the clustering result with a “ Reference model ” Compare , be called “ External indicators ”; Direct reference to clustering results without using any reference model , be called “ Internal indicators ”.
Commonly used clustering performance measures external indicators :
- Jaccard coefficient (Jaccard Coefficient, abbreviation JC)
- FM Index (Fowlkes and Mallows Index, abbreviation FMI)
- Rand Index (Rand Index, abbreviation RI)
Commonly used clustering performance measures internal indicators :
- DB Index (Davies-Bouldin Index, abbreviation DBI)
- Dunn Index (Dunn Index, abbreviation DI)
Distance calculation
Distance measurement needs to satisfy some basic properties : Nonnegativity 、 identity 、 symmetry 、 Directness , Mainly Minkowski distance 、 Euclidean distance 、 Manhattan distance
Attribute classification :
- Continuous attributes : There are infinite possible values in the definition field
- Discrete properties : There are a finite number of values in the definition field
- Ordered attributes : You can calculate the distance directly on the attribute value
- Disorder property : The distance cannot be calculated directly on the attribute value
The ordered attribute can use Minkowski distance , Unordered attributes use VDM
Prototype clustering
Prototype clustering is also called “ Clustering based on prototypes ”, This algorithm assumes that the clustering structure can be characterized by a set of prototypes . Usually , The algorithm first constructs the prototype , Then the prototype is updated and solved iteratively .
- k Mean algorithm
- Learning vector quantization
- Gaussian mixture clustering
Density clustering
Density clustering “ Clustering based on density ”, This kind of algorithm assumes that the clustering structure can be determined by the tightness of sample distribution . Usually , Density clustering algorithm from the perspective of sample density to examine the connectivity between samples , Based on the connectable samples, the cluster is expanded to get the final clustering results .【DBSCAN】
Hierarchical clustering
Hierarchical clustering view divides data sets at different levels , To form a tree like cluster structure , The data set can be divided by “ Bottom up ” The aggregation strategy of , Also can use “ The top-down ” The spin off strategy of .【AGNES Algorithm 】
Dimension reduction and measurement technology
k Neighbor learning
k Working mechanism of neighbor learning : Given the test sample , Based on a certain distance measure, find the closest k Training samples , And based on that k individual “ neighbor ” To predict the information of .
- “ laws and regulations governing balloting ”: Choose this k The most frequent category markers in the samples are used as prediction results
- “ Average method ”: Will this k The average of the real value output markers of samples is used as the prediction result
- “ Lazy study ”: In the training phase, just keep the samples , Training time cost is zero , Wait until the test sample is received
- “ Eager to learn ”: In the training stage, we learn how to deal with the samples
The nearest neighbor classifier is simple , But its generalization error rate is not more than twice that of Bayesian optimal classifier .
Low dimensional embedding
Dimension disaster : Data samples appearing in high-dimensional clear form are sparse , It is difficult to calculate the distance .
Dimension reduction : The original high-order attribute space is transformed into a low dimension by some mathematical change “ Subspace ”, In this subspace, the sample density is greatly increased , Distance calculation has also become easier .
Low dimensional embedding : In many cases , Although the data samples observed or collected by people are high-dimensional , But perhaps only a low dimensional distribution is closely related to the learning task .
Principal component analysis
For sample points in orthogonal attribute space , Use a hyperplane ( High dimensional generalization of straight lines ) Properly express all samples :
- Recently refactoring : The sample point is close enough to the hyperplane
- Maximum separability : The projection of sample points on this hyperplane can be separated as far as possible
PCA Just keep W And the mean vector of the sample can be reduced by a simple vector subtraction and matrix - Vector multiplication projects new samples into low dimensional space , The discarded part of the information can increase the sampling density of the sample , It can also play the effect of De-noising to a certain extent .
Kernel linear dimensionality reduction
“ Genuineness ” Low dimensional space :“ Originally sampled ” Low dimensional space
Nonlinear dimensionality reduction : Based on kernel technique, the linear dimensionality reduction method is studied “ Nucleation ”
Manifold learning
“ manifold ” Is a space which is locally homeomorphic with a Euclidean space , It has the property of Euclidean space locally , Can use Euclidean distance to calculate distance .
If a low dimensional manifold is embedded into a high dimensional space , The distribution of data samples in high-dimensional space seems very complex , But it still has the properties of European Space locally , therefore , It is easy to establish a dimensionality reduction mapping relationship locally , Then try to extend the local mapping relation to the global mapping relation .
Isometric mapping
Basic starting point : After the low dimensional manifold is embedded into the high dimensional space , It is misleading to calculate the wired distance directly in high-dimensional space , Because the linear distance in high-dimensional space is not reachable on low-dimensional embedded manifolds .
We can use the homeomorphism of manifold and Euclidean space locally , For each point, find its neighboring points based on Euclidean distance , Then we can establish a nearest neighbor connection graph , There is a connection between the nearest neighbors in the figure , There is no connection between non adjacent points . therefore , The problem of calculating the geodesic distance between two points , It is transformed into calculating the shortest path between two points on the nearest neighbor connection graph .
Construction of nearest neighbor graph : Specify the number of nearest neighbors 、 Specify the distance threshold ( Points less than the threshold value are considered as nearest neighbors )
Local linear embedding
Local linear embedding attempts to maintain the linear relationship between samples in the neighborhood
Measure learning
Basic motivation :“ Study ” Work out a suitable distance measure
Markov distance :
among M Also known as “ The metric matrix ”, Measurement learning is right M To study . To keep the distance nonnegative and symmetric ,M Must be ( And a half ) Positive definite symmetric matrices , That is, there must be an orthogonal basis P bring M Can be written as M=PP^T.
It can not only take the error rate as the optimization goal of measurement learning , It can also introduce domain knowledge into measurement learning .
边栏推荐
- 【面试高频题】难度 1/5,热门枚举类模拟题
- Whether Gorm database needs to set foreign keys
- Introduction to setting program icon in QT
- Technology sharing | mysql:caching_ sha2_ Password quick Q & A
- Hongmeng version of "Tiktok" is a great experience
- 【面试高频题】难度 1.5/5,经典「前缀和 + 二分」运用题
- 【一起上水硕系列】Day One
- 298th weekly match
- jvm造轮子
- [high frequency interview questions] difficulty 1.5/5, classic "prefix and + two points" application questions
猜你喜欢

一次 MySQL 误操作导致的事故,「高可用」都顶不住了!

论文解读(USIB)《Towards Explanation for Unsupervised Graph-Level Representation Learning》

Shang Silicon Valley Shang Silicon Valley | what is Clickhouse table engine memory and merge

508. Most Frequent Subtree Sum

Nebula Graph入驻阿里云计算巢,助力企业打造云上超大规模图数据库

Is there a Hongmeng version of the song ordering system? Lao Wang was the first to experience it

基于k近邻的MNIST图像分类对比

转发提醒 MetaMask小狐狸钱包安全公告 如何应对拓展程序潜在的私钥泄露

How to simulate a request or modify a requested domain name in Chrome browser

Gartner 网络研讨会 “九问数字化转型” 会后感
随机推荐
Linux MySQL command
恒泰证券VIP低佣金开户链接安全的吗?
linux-mysql-命令
第298场周赛
2022年下半年传统产品经理国际资格认证招生简章(NPDP)
SQL operation: with expression and its application
[high frequency interview questions] linked list interview questions with 1/5 difficulty and lower difficulty
[force deduction 10 days SQL introduction] Day1
将图片背景设置为透明的方法介绍
谷歌浏览器80版本以后,如何处理出现的问题SameSite跨域问题
pnpm 中无法使用 patch-package 打补丁
EasyCVR智能边缘网关硬件如何设置通电自启动?
Huawei launches new products again? These models have excellent functions
[Shangshui Shuo series] day one
Excel文件加密的两种方式
Nacos configuration center source code
使用uniapp框架搭建浙里办微应用(单点登录、埋点、适老化、RPC网关)
Selection skills of national production reinforced Ethernet switch
【面试高频题】难度 1/5,热门枚举类模拟题
Flink 系例 之 TableAPI & SQL 与 示例模块
