当前位置:网站首页>Dimension disaster dimension disaster suspense
Dimension disaster dimension disaster suspense
2022-07-26 13:10:00 【FakeOccupational】
Distance measurement problem
For distance based models KNN,K-means Come on . Effective dimensionality reduction is needed , Or a lot of data training , Discover the low dimensional manifold space of data .
Theorem[Beyer et al.99]:Fix ϵ \epsilon ϵ >0 and N,If data is “truly high-dimensional”,the under fairly weak assumptions on the distrbution of the data:
lim D → ∞ P r [ d m a x ( N , D ) ≤ ( 1 + ϵ ) d m i n ( N , D ) ] = 1 \lim_{D\rightarrow \infty} Pr[d_{max}(N,D)\leq (1+\epsilon)d_{min}(N,D)] = 1 D→∞limPr[dmax(N,D)≤(1+ϵ)dmin(N,D)]=1
With dimensions D An increase in , The difference between the maximum distance and the minimum distance between data will be infinitely small , Using distance will not effectively distinguish data With dimensions D An increase in , The difference between the maximum distance and the minimum distance between data will be infinitely small , Using distance will not effectively distinguish data With dimensions D An increase in , The difference between the maximum distance and the minimum distance between data will be infinitely small , Using distance will not effectively distinguish data
The difference of Euclidean distance is not obvious
With dimensions d rising , The relative difference between the maximum and minimum Euclidean distances approaches 0, therefore KNN The convergence rate is very slow :
lim d → ∞ E ( d i s t m a x ( d ) − d i s t m i n ( d ) d i s t m i n ( d ) ) → 0 \lim_{d\rightarrow \infty} E(\frac{dist_{max}(d)-dist_{min}(d)}{dist_{min}(d)})\rightarrow 0 d→∞limE(distmin(d)distmax(d)−distmin(d))→0
Related papers
high dimension, there is no such thing as interpolatioIn high dimension, everything isextrapolation
The problem of data demand
For linear classifiers , We need to constantly increase dimensions




More training data is needed

Space size problem
Volume of unit hypersphere
V n = π n / 2 Γ ( n 2 + 1 ) \begin{equation}V_n = \frac{\pi^{n/2}}{\Gamma\left(\frac{n}{2}+1\right)}\end{equation} Vn=Γ(2n+1)πn/2
So more volume is in the space around the hypersphere , With dimensions n An increase in , The proportion of hypersphere tends to 0.
Computer vision fool's book :https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/ Chinese translation
Reference and more
You know : How to understand Curse of Dimensionality( Dimension disaster )?
Dimension disaster
Curse_of_dimensionality
Hypersphere volume calculation
Why do dimensional disasters occur ? How to solve ?
Concentration of spherical Gaussians
Why is the Gaussian distribution in high-dimensional space like a soap bubble

Like the relationship between the distribution of low and high dimensions , The center of high-dimensional space collapses .
In high-dimensional space, with the increase of dimension, the ratio of the center of space will shrink to zero, or even the positive looks like a soap bubble (where it is less dense in the center and more dense at the edge) instead of a bold of mold where it is more dense in the center.
In high-dimensional space, with the increase of dimension, the ratio of Space Center will shrink to zero, and even the Gaussian distribution looks like a soap bubble (low center density, high edge density), rather than a model with high center density.
k - Nearest neighbor classification
Another effect of higher dimensions on distance functions involves using distance functions to build from datasets k Nearest neighbor ( k -NN) chart . As the dimensions increase ,k -NN The penetration distribution of a digraph becomes skewed , The peak appears on the right , Because there is a disproportionate number of hubs , That is, it appears in more others k -NN The data points in the list are higher than the average . This phenomenon will affect various classification techniques ( Include k-NN classifier )、 Semi supervised learning and clustering ,[19], It also affects information retrieval .[20]
Anomaly detection
stay 2012 In a survey in ,Zimek wait forsomeone . The following problems were found when searching for exceptions in high-dimensional data : [13]
Concentration of fractions and distances : Derived values such as distance become numerically similar
Irrelevant properties : In high dimensional data , A large number of attributes may be irrelevant
Definition of reference set : For local methods , Reference sets are usually based on nearest neighbors
Incomparable scores of different dimensions : Different subspaces produce incomparable scores
The interpretability of scores : Scores usually no longer convey semantics
Index search space : The search space can no longer be scanned by the system
Data snooping bias : Given a large search space , For the meaning of each expectation , Can find a hypothesis
Hubness: Some objects appear more frequently in the neighbor list than others .
Many specialized methods of analysis solve one or the other of these problems , But there are still many unsolved research problems .
Dimensional blessing
It's amazing , Despite the expected “ Curse of dimensions ” difficult , But common sense heuristics based on the most direct method are useful for high-dimensional problems “ It can produce almost certainly the optimal result ”.[21] 1990 In the late S “ Dimensional blessing ” The word" .[21] Donohue is in his “ Millennium Declaration ” It clearly explains why “ Dimensional blessing ” It will become the foundation of data mining in the future .[22] The influence of dimensional blessing has been found in many applications , And found their basis in the concentration of measurement phenomena .[23] An example of dimensionality is the linear separability of a random point and a large finite random set , Even if this set is exponential , It's also possible : The number of elements in this random set can increase exponentially with the dimension . Besides , This linear functional can be chosen as the simplest linear Fisher Discriminant . This separability theorem has been proved to be applicable to a wide range of probability distributions : Generally uniform logarithmic concave distribution 、 Product distributions in cubes and many other families ( Recently [23] Commented on ).
“ Dimensional blessing and dimensional curse are two aspects of the same coin .” [24] for example , In essence, the typical property of high-dimensional probability distribution in high-dimensional space is : The square distance from the random point to the selected point is likely to be close to the average ( Or median ) Square distance . This attribute significantly simplifies the expected geometry of the data and the indexing of high-dimensional data ( blessing ),[25], But at the same time , It makes the similarity search in high dimension difficult or even useless ( Damnation ).[26]
Zimek et al .[13] Pointed out that , Although the typical formalization of dimensional curse will affect iid data , But even in high dimensions , The data separated in each attribute will also become easier , And that the signal-to-noise ratio is very important : The data in each attribute becomes easier to add the attribute of the signal , Only add noise to the data ( Irrelevant errors ) The properties of are more difficult . Especially for unsupervised data analysis , This effect is called swamping .
边栏推荐
- How to face scientific and technological unemployment?
- Kubernetes----高级存储之PV和PVC简介
- Mysql数据目录(1)---数据库结构(二十四)
- Incorrect use of parentdatawidget when the exception was thrown, this was the stack:
- The best engineer was "forced" away by you like this!
- Sword finger offer (21): push in and pop-up sequence of stack
- JSON格式执行计划(6)—mysql执行计划(五十二)
- Create EOS account action
- Kubernetes flannel: host-gw mode
- Code examples explain the difference between [reentrant lock] and [non reentrant lock]?
猜你喜欢

1-6月中国ADAS供应商占比9% 又一家零部件巨头全面布局智驾新赛道

Shutter background graying effect, how transparency, gray mask

C regards type as generic type T and uses it as generic type of method

Analysis of Wireshark data package of network security B module of national vocational college skills competition Wireshark 0051.pcap

panic: Error 1045: Access denied for user ‘root‘@‘117.61.242.215‘ (using password: YES)

解决方案丨5G技术助力搭建智慧园区

目标检测网络R-CNN 系列

Solution: unable to load the file c:\users\user\appdata\roaming\npm\npx PS1, because running scripts is prohibited on this system.

基于WebRTC和WebSocket实现的聊天系统

jvm:类加载子系统干什么的?由什么组成?需要记住哪些八股文?
随机推荐
Kubernetes flannel: host-gw mode
New function | intelligent open search online customized word weight model
A college archives management system based on asp.net
Redis realizes single sign on -- system framework construction (I)
StreamNative 团队文化:一家“透明”的公司
Guys, please ask me, I have configured CDC to connect to Oracle according to the document, and I always run error reports and can't find the class validstione
Mysql数据目录(3)---表数据结构myISAM(二十六)
[5g] what are Cu and Du in 5g?
Flink is slow to write redis. Do you have any ideas for optimization?
Sword finger offer (21): push in and pop-up sequence of stack
Food safety | can you eat any fruit?
Use positioning to realize left, middle and right layout, and the middle content is adaptive
历史上的今天:IBM 获得了第一项专利;Verizon 收购雅虎;亚马逊发布 Fire Phone...
MySQL data directory (3) -- table data structure MyISAM (XXVI)
Redisson distributed lock usage example (I)
Kubernetes -- Introduction to common plug-ins of kubernetes
一笔画问题(中国邮递员问题)
华为年内二度招聘“天才少年”;540万Twitter账号信息泄露,卖价3万美元;谷歌解雇了相信AI有意识的工程师|极客头条...
JSON格式执行计划(6)—mysql执行计划(五十二)
[typescript] typescript common types (Part 1)