当前位置：网站首页>T-sne dimensionality reduction

T-sne dimensionality reduction

2022-07-29 01:34:00 【51CTO】

1. SNE principle

The basic principle ： Is it through radiative transformation Map the data points to the probability distribution , There are two steps ：

Construct the probability distribution between high-dimensional objects , Make similar objects have a higher probability of being selected , And dissimilar objects have a lower probability .
SNE Construct these two distributions in low dimensional space , Make the two probability distributions as similar as possible .

t-SNE Is unsupervised dimensionality reduction , Follow kmeans And so on , He can't get something through training and then use it for other data （kmeans You can get... Through training k A little bit , For other data sets , and t-SNE You can only operate on multiple data alone .

The principle is derived ： SNE Firstly, Euclidean distance is transformed into conditional probability to express the similarity between points , say concretely , Given N Tall Dimension data ,（N Not a dimension ）. The first is to calculate the probability pij, Proportional to xi and xj The similarity between ,

T-sne Dimension reduction _ data

Parameters here

T-sne Dimension reduction _ data _02

For different xi The values of are different , The following discussion is about how to set , In addition to setting px|x =0, Because we are concerned about the similarity between two , For low dimensional yi, The variance of Gaussian distribution can be specified as

T-sne Dimension reduction _ data _03

, Therefore, the similarity is

T-sne Dimension reduction _ A probability distribution _04

Again qi|i=0.

If the effect of dimension reduction is better , Local features remain intact , that

T-sne Dimension reduction _ Similarity degree _05

, So we optimize the KL The divergence . The objective function is as follows ：

T-sne Dimension reduction _ A probability distribution _06

, there ·Pi Indicates a given point xi Next , Conditional probability distribution of all other data points .KL Divergence is asymmetric , In low dimensional mapping, the penalty weights corresponding to different distances are different . Specifically ： The two points that are far away express that the two points that are close will produce greater cost, The two points that are close to each other express the two points that are far away cost Relatively small . for example

T-sne Dimension reduction _ A probability distribution _07

Modeling cost=

T-sne Dimension reduction _ Similarity degree _08

, Use the same larger

T-sne Dimension reduction _ A probability distribution _09

Modeling

T-sne Dimension reduction _ A probability distribution _10

T-sne Dimension reduction _ data _11

, therefore ,SNE Tendency and retention of local features in data .

2 t-SNE

SNE It's hard to optimize , There is Crowing problem ( crowded ） Difference ： Use symmetrical SNE, Simplified gradient formula , In low dimensional space , Use heavier long tailed t Distribution instead of Gaussian distribution represents the similarity between two points . To avoid congestion .

2.1 Symmetric SNE

Optimize pi|j and qi|j Of KL The replacement idea of divergence is to use the joint probability distribution to replace the conditional probability distribution , namely P Is the joint probability distribution of each point in high-dimensional space ,Q It's in low dimensional space , The objective function is

T-sne Dimension reduction _ data _12

here 1 Of pii and qii All for 0, such SNE be called symmetric SNE, Because he assumed that for any i,pij =pji,qij=qji, Therefore, the probability distribution can be rewritten as ：

T-sne Dimension reduction _ data _13

This method will introduce the problem of outliers , such as xi Is the outlier , that ||xi-xj||2 Will be a big , All the corresponding j,pij Will be very small , Lead to low dimensional mapping yi Yes cost The impact is very small . To solve this problem , The joint probability distribution will be modified .

2.2 Crowing problem

The clusters get together , Indistinguishable , For example, the dimension of high-dimensional data is reduced to 10 Weixia , There will be good expression , But the dimension is reduced to 2 Weihou , Unable to get trusted mapping .

How to solve ： use sight repulsion Methods

2.3 t-SNE

symmetry SNE In terms of time, in a high dimension , Another way to reduce congestion ： In high-dimensional space, Gaussian distribution is used to convert distance into probability distribution , In low dimensional space , Use t Distribution converts distance into probability distribution , Make the middle and low distances in the high dimension have a larger distance after mapping .

t The distribution is less affected by outliers , Fitting is more reasonable , Better capture the overall characteristics of the data .

t-SNE The gradient update of has two advantages ：

For dissimilar points , Using a smaller distance will produce a larger gradient to repel these points .

This rejection will not be infinite （ Denominator in gradient ） , Avoid dissimilar points too far away .

2.4 The algorithm process

Data: X=x1,....xn

Calculation cost function Parameters of

Optimization parameters ： Set the number of iterations T, Learning rate n, momentum

T-sne Dimension reduction _ Similarity degree _15

The target result is a low dimensional data representation ,YT=y1,...,yn

Start optimizing ：

Calculate the given Perp The conditional probability of ,pj|i

Make pij=(pj|i +pi|j)/2n

use N(0,10-4I) Random initialization Y

iteration , from t=1 To T, Do the following :

Calculate... In low dimensions qij , Calculate the gradient , to update Yt

end

Let's compare the Gaussian distribution with t Distribution ( Pictured above ,code see probability/distribution.md), t The distribution is less affected by outliers , The fitting result is more reasonable , Better capture the overall characteristics of the data .

Used t After distribution q change , as follows :

$q_{ij} = \frac{(1 + \mid \mid y_i -y_j \mid \mid ^2)^{-1}}{\sum_{k \neq l} (1 + \mid \mid y_i -y_j \mid \mid ^2)^{-1}}$

Besides ,t Distribution is the superposition of infinite Gaussian distributions , It is not exponential in calculation , It will be much more convenient . The optimized gradient is as follows :

$\frac{\delta C}{\delta y_i} = 4 \sum_j(p_{ij}-q_{ij})(y_i-y_j)(1+ \mid \mid y_i-y_j \mid \mid ^2)^{-1}$

T-sne Dimension reduction _ Similarity degree _18

t-sne The effectiveness of the , You can also see from the above figure ： The horizontal axis represents the distance , The vertical axis represents the similarity , You can see , For points with large similarity ,t The distance distributed in the low dimensional space needs to be a little smaller ; For points with low similarity ,t The distance distributed in the low dimensional space needs to be longer . This just meets our needs , That is, points in the same cluster ( The distance is close ) Aggregate more tightly , Points between different clusters ( Far away ) More alienated .

To sum up ,t-SNE The gradient update of has two advantages ：

For dissimilar points , Using a smaller distance will produce a larger gradient to repel these points .
This rejection will not be infinite ( Denominator in gradient ), Avoid dissimilar points too far away .

2.5 Insufficient

1 Mainly used for visualization ,

2 Tendency and preservation of local features

3 There is no unique optimal solution

4 Training is slow

Yes kl The introduction of divergence is as follows

       
       1 KL The divergence 、JS Divergence and cross entropy 
       
All three are used to measure the difference between two probability distributions . The difference is their mathematical expression .
       
For the probability distribution P(x) and Q(x)
       
1）KL The divergence （Kullback–Leibler divergence）
       
also called KL distance , Relative entropy .
       

       
When P(x) and Q(x) The more similar ,KL The smaller the divergence .
       
KL Divergence has two main properties ：
       
（1） Asymmetry 
       
Even though KL Divergence is intuitively a measure or distance function , But it's not a real measure or distance , Because it doesn't have symmetry , namely D(P||Q)!=D(Q||P).
       
（2） Nonnegativity 
       
The value of relative entropy is nonnegative , namely D(P||Q)>0.
       

       
2）JS The divergence （Jensen-Shannon divergence）
       
JS Divergence is also known as JS distance , yes KL A distortion of divergence .
       

       
But it's different from KL There are two main aspects ：
       
（1） Range of values 
       
JS The range of divergence is [0,1], The same is true 0, On the contrary 1. Compare with KL, The discrimination of similarity is more accurate .
       
（2） symmetry 
       
namely  JS(P||Q)=JS(Q||P), We can see from the mathematical expression that .
       
3） Cross entropy （Cross Entropy）
       
In the neural network , Cross entropy can be used as a loss function , Because it measures P and Q The similarity of .
       

       
The relationship between cross entropy and relative entropy ：
       

       
All of the above are based on the probability of discrete distribution , If it's continuous data , The data needs to be processed Probability Density Estimate To determine the probability distribution of the data , It's not a sum, but an integral .
       

       
Personal understanding ：
       
1、KL Divergence is essentially a mathematical calculation used to measure the difference between two probability distributions ; Because the ratio division is not symmetrical ;
       
2、 Why not use it in training neural networks KL The divergence , Mathematically speaking , The difference between them is KL The divergence has been reduced by one more  H(P);P For real distribution ,Q Represents the estimated distribution 
       

       
33
      


































author ： Your Rego

The copyright of this article belongs to the author , Welcome to reprint , But without the author's consent, the original link must be given on the article page , Otherwise, the right to pursue legal responsibility is reserved .

原网站

版权声明
本文为[51CTO]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207290032357978.html