当前位置:网站首页>Dimension disaster dimension disaster suspense
Dimension disaster dimension disaster suspense
2022-07-26 13:10:00 【FakeOccupational】
Distance measurement problem
For distance based models KNN,K-means Come on . Effective dimensionality reduction is needed , Or a lot of data training , Discover the low dimensional manifold space of data .
Theorem[Beyer et al.99]:Fix ϵ \epsilon ϵ >0 and N,If data is “truly high-dimensional”,the under fairly weak assumptions on the distrbution of the data:
lim D → ∞ P r [ d m a x ( N , D ) ≤ ( 1 + ϵ ) d m i n ( N , D ) ] = 1 \lim_{D\rightarrow \infty} Pr[d_{max}(N,D)\leq (1+\epsilon)d_{min}(N,D)] = 1 D→∞limPr[dmax(N,D)≤(1+ϵ)dmin(N,D)]=1
With dimensions D An increase in , The difference between the maximum distance and the minimum distance between data will be infinitely small , Using distance will not effectively distinguish data With dimensions D An increase in , The difference between the maximum distance and the minimum distance between data will be infinitely small , Using distance will not effectively distinguish data With dimensions D An increase in , The difference between the maximum distance and the minimum distance between data will be infinitely small , Using distance will not effectively distinguish data
The difference of Euclidean distance is not obvious
With dimensions d rising , The relative difference between the maximum and minimum Euclidean distances approaches 0, therefore KNN The convergence rate is very slow :
lim d → ∞ E ( d i s t m a x ( d ) − d i s t m i n ( d ) d i s t m i n ( d ) ) → 0 \lim_{d\rightarrow \infty} E(\frac{dist_{max}(d)-dist_{min}(d)}{dist_{min}(d)})\rightarrow 0 d→∞limE(distmin(d)distmax(d)−distmin(d))→0
Related papers
high dimension, there is no such thing as interpolatioIn high dimension, everything isextrapolation
The problem of data demand
For linear classifiers , We need to constantly increase dimensions




More training data is needed

Space size problem
Volume of unit hypersphere
V n = π n / 2 Γ ( n 2 + 1 ) \begin{equation}V_n = \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2}+1\right)}\end{equation} Vn=Γ(2n+1)πn/2
So more volume is in the space around the hypersphere , With dimensions n An increase in , The proportion of hypersphere tends to 0.
Computer vision fool's book :https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/ Chinese translation
Reference and more
You know : How to understand Curse of Dimensionality( Dimension disaster )?
Dimension disaster
Curse_of_dimensionality
Hypersphere volume calculation
Why do dimensional disasters occur ? How to solve ?
Concentration of spherical Gaussians
Why is the Gaussian distribution in high-dimensional space like a soap bubble

Like the relationship between the distribution of low and high dimensions , The center of high-dimensional space collapses .
In high-dimensional space, with the increase of dimension, the ratio of the center of space will shrink to zero, or even the positive looks like a soap bubble (where it is less dense in the center and more dense at the edge) instead of a bold of mold where it is more dense in the center.
In high-dimensional space, with the increase of dimension, the ratio of Space Center will shrink to zero, and even the Gaussian distribution looks like a soap bubble (low center density, high edge density), rather than a model with high center density.
k - Nearest neighbor classification
Another effect of higher dimensions on distance functions involves using distance functions to build from datasets k Nearest neighbor ( k -NN) chart . As the dimensions increase ,k -NN The penetration distribution of a digraph becomes skewed , The peak appears on the right , Because there is a disproportionate number of hubs , That is, it appears in more others k -NN The data points in the list are higher than the average . This phenomenon will affect various classification techniques ( Include k-NN classifier )、 Semi supervised learning and clustering ,[19], It also affects information retrieval .[20]
Anomaly detection
stay 2012 In a survey in ,Zimek wait forsomeone . The following problems were found when searching for exceptions in high-dimensional data : [13]
Concentration of fractions and distances : Derived values such as distance become numerically similar
Irrelevant properties : In high dimensional data , A large number of attributes may be irrelevant
Definition of reference set : For local methods , Reference sets are usually based on nearest neighbors
Incomparable scores of different dimensions : Different subspaces produce incomparable scores
The interpretability of scores : Scores usually no longer convey semantics
Index search space : The search space can no longer be scanned by the system
Data snooping bias : Given a large search space , For the meaning of each expectation , Can find a hypothesis
Hubness: Some objects appear more frequently in the neighbor list than others .
Many specialized methods of analysis solve one or the other of these problems , But there are still many unsolved research problems .
Dimensional blessing
It's amazing , Despite the expected “ Curse of dimensions ” difficult , But common sense heuristics based on the most direct method are useful for high-dimensional problems “ It can produce almost certainly the optimal result ”.[21] 1990 In the late S “ Dimensional blessing ” The word" .[21] Donohue is in his “ Millennium Declaration ” It clearly explains why “ Dimensional blessing ” It will become the foundation of data mining in the future .[22] The influence of dimensional blessing has been found in many applications , And found their basis in the concentration of measurement phenomena .[23] An example of dimensionality is the linear separability of a random point and a large finite random set , Even if this set is exponential , It's also possible : The number of elements in this random set can increase exponentially with the dimension . Besides , This linear functional can be chosen as the simplest linear Fisher Discriminant . This separability theorem has been proved to be applicable to a wide range of probability distributions : Generally uniform logarithmic concave distribution 、 Product distributions in cubes and many other families ( Recently [23] Commented on ).
“ Dimensional blessing and dimensional curse are two aspects of the same coin .” [24] for example , In essence, the typical property of high-dimensional probability distribution in high-dimensional space is : The square distance from the random point to the selected point is likely to be close to the average ( Or median ) Square distance . This attribute significantly simplifies the expected geometry of the data and the indexing of high-dimensional data ( blessing ),[25], But at the same time , It makes the similarity search in high dimension difficult or even useless ( Damnation ).[26]
Zimek et al .[13] Pointed out that , Although the typical formalization of dimensional curse will affect iid data , But even in high dimensions , The data separated in each attribute will also become easier , And that the signal-to-noise ratio is very important : The data in each attribute becomes easier to add the attribute of the signal , Only add noise to the data ( Irrelevant errors ) The properties of are more difficult . Especially for unsupervised data analysis , This effect is called swamping .
边栏推荐
- 概率论与数理统计
- Version of NDK matched the requested version 21.0.6113669. versions available locally: 2
- Kubernetes - Introduction to PV and PVC of advanced storage
- Px2rem loader converts PX into REM and adapts to mobile vant UI and other frameworks
- Kubernetes Flannel:HOST-GW模式
- JVM: what does the class loading subsystem do? What is it made of? What eight part essay do you need to remember?
- Kubernetes----PV和PVC的生命周期简介
- key&key_ Len & ref & filtered (4) - MySQL execution plan (50)
- Huawei recruited "talented teenagers" twice this year; 5.4 million twitter account information was leaked, with a selling price of $30000; Google fired engineers who believed in AI consciousness | gee
- 1312_适用7z命令进行压缩与解压
猜你喜欢

Kubernetes Flannel:HOST-GW模式
![[5gc] what is 5g slice? How does 5g slice work?](/img/8c/52ba57d6a18133e97fa00b6a7cf8bc.png)
[5gc] what is 5g slice? How does 5g slice work?

Kuzaobao: summary of Web3 encryption industry news on July 25

0 basic programming resources (collect first ~ read slowly ~)

The best engineer was "forced" away by you like this!

Version of NDK matched the requested version 21.0.6113669. versions available locally: 2

解决方案丨5G技术助力搭建智慧园区

After being fined "paid leave" for one month, Google fired him who "loves" AI

From January to June, China's ADAS suppliers accounted for 9%, and another parts giant comprehensively laid out the new smart drive track

Kubernetes ---- life cycle introduction of PV and PVC
随机推荐
学习pinia 介绍-State-Getters-Actions-Plugins
Create EOS account action
Today in history: IBM obtained the first patent; Verizon acquires Yahoo; Amazon releases fire phone
Kubernetes----PV和PVC的生命周期简介
Slam 02. overall framework
0 basic programming resources (collect first ~ read slowly ~)
深度学习3D人体姿态估计国内外研究现状及痛点
牛客刷SQL---2
B+树挑选索引(2)---mysql从入门到精通(二十三)
历史上的今天:IBM 获得了第一项专利;Verizon 收购雅虎;亚马逊发布 Fire Phone...
Student examination system based on C #
The child component triggers the defineemits of the parent component: the child component passes values to the parent component
Version of NDK matched the requested version 21.0.6113669. versions available locally: 2
[applet] why can't the onreachbottom event be triggered? (one second)
【5G】5G中的CU和DU是什么?
How to face scientific and technological unemployment?
Example of establishing socket communication with Siemens PLC based on C # open TCP communication
Qualcomm once again "bet" on Zhongke Chuangda to challenge the full stack solution of intelligent driving software and hardware
jvm:类加载子系统干什么的?由什么组成?需要记住哪些八股文?
基于Bézier曲线的三维造型与渲染