当前位置:网站首页>Optimal transport Series 1

Optimal transport Series 1

2022-07-01 02:33:00 Daft shiner

Whereas Optimal Transport stay Machine Learning Widespread use in the field , I want to study it carefully Optimal Transport, There's a lot of information online , But because the author is stupid , It took me a long time to find the easy to understand materials . This blog aims to describe clearly in the simplest way Optimal transport, Due to limited mathematical ability , If there is any mistake, I hope you can correct it in time .

Prior Knowledge

 Insert picture description here
L1 Distance: d L 1 ( ρ 1 , ρ 2 ) = ∫ − ∞ + ∞ ∣ ρ 1 ( x ) − ρ 2 ( x ) ∣ d x d_{L_1}(\rho_1,\rho_2)=\int_{-\infty}^{+\infty}|\rho_1(x)-\rho_2(x)|dx dL1(ρ1,ρ2)=+ρ1(x)ρ2(x)dx
KL divergence: d K L ( ρ 1 ∣ ∣ ρ 2 ) = ∫ − ∞ + ∞ ρ 1 ( x ) l o g ρ 1 ( x ) ρ 2 ( x ) d x d_{KL}(\rho_1||\rho_2)=\int_{-\infty}^{+\infty}\rho_1(x)log\frac{\rho_1(x)}{\rho_2(x)}dx dKL(ρ1ρ2)=+ρ1(x)logρ2(x)ρ1(x)dx
If you want to measure the above figure ( From references 1) in ρ 1 \rho_1 ρ1 To ρ 2 \rho_2 ρ2 The distance and ρ 1 \rho_1 ρ1 To ρ 3 \rho_3 ρ3 Distance of , Can be found using L1 Distance and KL divergence All fail , All the results are the same .

Be careful : by the way L1 Distance Come on , The same distance is easy to understand , Because it calculates the area of two distributions . And for KL divergence Come on , First of all, it should be noted that it is not a distance measure ,Because it is not symmetric: the KL from ρ 1 ( x ) \rho_1(x) ρ1(x) to ρ 2 ( x ) \rho_2(x) ρ2(x) is generally not the same as the KL from ρ 1 ( x ) \rho_1(x) ρ1(x) to ρ 2 ( x ) \rho_2(x) ρ2(x). Furthermore, it need not satisfy triangular inequality. We will Kullback-Leibler Divergence Change the form to get d K L ( ρ 1 ∣ ∣ ρ 2 ) = ∫ − ∞ + ∞ ρ 1 ( x ) l o g ( ρ 1 ( x ) ) − ρ 1 ( x ) l o g ( ρ 2 ( x ) ) d x d_{KL}(\rho_1||\rho_2)=\int_{-\infty}^{+\infty}\rho_1(x)log(\rho_1(x))-\rho_1(x)log(\rho_2(x))dx dKL(ρ1ρ2)=+ρ1(x)log(ρ1(x))ρ1(x)log(ρ2(x))dx And then put ρ 1 ( x ) l o g ( ρ 1 ( x ) ) \rho_1(x)log(\rho_1(x)) ρ1(x)log(ρ1(x)) and ρ 1 ( x ) l o g ( ρ 2 ( x ) ) \rho_1(x)log(\rho_2(x)) ρ1(x)log(ρ2(x)) Consider two new distributions , Then it is also transformed into the area relationship , Because the former one is the same , Therefore, it mainly depends on the difference of the latter item . For better understanding , Author use python Write a simple demo: Be careful , Here I found a big problem !!!

import math
import numpy as np
import matplotlib.pyplot as plt


def guassian(u, sig):
    #  mean value μ,  Standard deviation δ
    x = np.linspace(-30, 30, 1000000)   #  Domain of definition 
    y = np.exp(-(x - u) ** 2 / (2 * sig ** 2)) / (math.sqrt(2*math.pi)*sig) #  Define curve function 
    return x, y

x1, y1 = guassian(u=0, sig=math.sqrt(1))
x2, y2 = guassian(u=1, sig=math.sqrt(1))
x3, y3 = guassian(u=2, sig=math.sqrt(1))
plt.plot(x1,y1)
plt.plot(x2,y2)
plt.plot(x3,y3)

z1 = y1 * np.log2(y1) - y1 * np.log2(y2)
z2 = y1 * np.log2(y1) - y1 * np.log2(y3)
# plt.plot(x1, z1)
# plt.plot(x1, z2)
print(np.sum(z1))
print(np.sum(z2))

It is found through experiments that , The results of the experiment were quite different from what was expected , I use KL divergence Calculation ρ 1 \rho_1 ρ1 To ρ 2 \rho_2 ρ2 The distance and ρ 1 \rho_1 ρ1 To ρ 3 \rho_3 ρ3 The distance is not the same , Difference is very big !!!
 Insert picture description here
I visualized the distribution curve of the latter term , Found it completely different , And the value after integration is completely different .( Here I take into account the effect that continuity becomes discreteness , So it was adjusted x Coordinate range and number of points drawn , Do not have what difference ) Later, it was found that , It seems that the curve given in the article is not a Gaussian curve , I feel a little misled , So I changed an example to calculate , Although it is a discrete distribution , But it can still directly reflect the problem (PS, In fact, I'm curious about why Gaussian distribution can't be used , Does anyone know why ):
 Insert picture description here
It can be found that for the above distribution ,L1 Distance and KL divergence All failed .

Optimal Transport

Last section L1 Distance and KL divergence The problem is , This section describes in detail Optimal Transport, First define the problem .Optimal Transport The core of is how to find an optimal transformation to transform one distribution into another , And to minimize the conversion loss .( Be careful : The distribution here can be continuous or discrete ) Here we use the ρ 1 \rho_1 ρ1 and ρ 2 \rho_2 ρ2 give an example :
For ease of understanding , We can ρ 1 \rho_1 ρ1 and ρ 2 \rho_2 ρ2 Imagine a pile of sand , that Optimal Transport The question becomes how to make ρ 1 \rho_1 ρ1 The shape of the sand piled up ρ 2 \rho_2 ρ2 The appearance of , And do the least work . π ( x , y ) \pi(x,y) π(x,y) For from x How much quality of sand is moved to the location y Location , So clearly ρ 1 ( x ) \rho_1(x) ρ1(x) Express x How much quality of sand is there in the location ( In plain English x Height of the position curve ). So the problem can be expressed by the following formula :
 Insert picture description here
The objective function is to minimize the work done by handling , Because the mass of sand transported should be greater than or equal to 0, So the first constraint is easy to understand , As for articles 2 and 3 , Is to satisfy its initial and final distribution .This amount of work is known as the 1-Wasserstein distance in optimal transport. Next, we generalize it to p-Wasserstein distance:
 Insert picture description here
It is also easy to understand , Only the distance of transportation has changed . After the problem is defined , So how can we get this π ( x , y ) \pi(x,y) π(x,y) Well ?

Discrete Problems in One Dimension

 Insert picture description here
For one-dimensional problems, there are the above two cases :D2D,D2C. about D2D The data of , We first relax the original distribution function ρ ( x ) \rho(x) ρ(x) To μ 0 , μ 1 ∈ P r o b ( R ) \mu_0,\mu_1 \in Prob(\mathbb{R}) μ0,μ1Prob(R),Define the Dirac δ \delta δ-measure centered at x ∈ R x \in \mathbb{R} xR via (:= In mathematics it is defined as )
δ x ( S ) : = { 1 , i f   x   ∈   S 0 , O t h e r w i s e . \delta_x(S):= \begin{cases} 1, &if\ x\ \in\ S\\ 0, &Otherwise. \end{cases} δx(S):={ 1,0,if x  SOtherwise. here μ 0 , μ 1 \mu_0,\mu_1 μ0,μ1 Respectively :
 Insert picture description here
among ∑ i a 0 i = ∑ i a 1 i = 1 \sum_ia_{0i}=\sum_ia_{1i}=1 ia0i=ia1i=1, a 0 i , a 1 i ≥ 0 a_{0i},a_{1i} \ge 0 a0i,a1i0. I think it's S S S yes [ x 01 , x 02 , ⋯   , x 0 k 0 ] [x_{01}, x_{02}, \cdots, x_{0k_0}] [x01,x02,,x0k0] Subset . At this time, the calculation of the problem is :
 Insert picture description here
The same way as the above explanation , T i j T_{ij} Tij From x 0 i x_{0i} x0i Transport to x 1 j x_{1j} x1j The quality of the . ∣ x 0 i − x 1 j ∣ p |x_{0i}-x_{1j}|^p x0ix1jp It's from x 0 i x_{0i} x0i Transport to x 1 j x_{1j} x1j Distance of . T i j T_{ij} Tij The transportation mass shall be greater than or equal to 0, x 0 i x_{0i} x0i and x 1 j x_{1j} x1j The mass at the point must satisfy the conservation condition . requirement T i j T_{ij} Tij, This is a finite element linear programming , Many classical algorithms can be used to solve , for example simplex or interior point methods. And for D2C The situation of , Each discrete μ 0 \mu_0 μ0 Are mapped to a continuous interval μ 1 \mu_1 μ1, As shown in the figure below :
 Insert picture description here

Conclusion

This section focuses on Optimal Transport Two forms of (D2C,D2D), The next section will cover more general In the form of .

References

[1] Optimal Transport on Discrete Domains
[2] Kullback-Leibler Divergence
[3] Optimal Transport and Wasserstein Distance

原网站

版权声明
本文为[Daft shiner]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/182/202207010225316394.html