当前位置:网站首页>Optimal transport Series 1
Optimal transport Series 1
2022-07-01 02:33:00 【Daft shiner】
Whereas Optimal Transport stay Machine Learning Widespread use in the field , I want to study it carefully Optimal Transport, There's a lot of information online , But because the author is stupid , It took me a long time to find the easy to understand materials . This blog aims to describe clearly in the simplest way Optimal transport, Due to limited mathematical ability , If there is any mistake, I hope you can correct it in time .
List of articles
Prior Knowledge

L1 Distance: d L 1 ( ρ 1 , ρ 2 ) = ∫ − ∞ + ∞ ∣ ρ 1 ( x ) − ρ 2 ( x ) ∣ d x d_{L_1}(\rho_1,\rho_2)=\int_{-\infty}^{+\infty}|\rho_1(x)-\rho_2(x)|dx dL1(ρ1,ρ2)=∫−∞+∞∣ρ1(x)−ρ2(x)∣dx
KL divergence: d K L ( ρ 1 ∣ ∣ ρ 2 ) = ∫ − ∞ + ∞ ρ 1 ( x ) l o g ρ 1 ( x ) ρ 2 ( x ) d x d_{KL}(\rho_1||\rho_2)=\int_{-\infty}^{+\infty}\rho_1(x)log\frac{\rho_1(x)}{\rho_2(x)}dx dKL(ρ1∣∣ρ2)=∫−∞+∞ρ1(x)logρ2(x)ρ1(x)dx
If you want to measure the above figure ( From references 1) in ρ 1 \rho_1 ρ1 To ρ 2 \rho_2 ρ2 The distance and ρ 1 \rho_1 ρ1 To ρ 3 \rho_3 ρ3 Distance of , Can be found using L1 Distance and KL divergence All fail , All the results are the same .
Be careful : by the way L1 Distance Come on , The same distance is easy to understand , Because it calculates the area of two distributions . And for KL divergence Come on , First of all, it should be noted that it is not a distance measure ,Because it is not symmetric: the KL from ρ 1 ( x ) \rho_1(x) ρ1(x) to ρ 2 ( x ) \rho_2(x) ρ2(x) is generally not the same as the KL from ρ 1 ( x ) \rho_1(x) ρ1(x) to ρ 2 ( x ) \rho_2(x) ρ2(x). Furthermore, it need not satisfy triangular inequality. We will Kullback-Leibler Divergence Change the form to get d K L ( ρ 1 ∣ ∣ ρ 2 ) = ∫ − ∞ + ∞ ρ 1 ( x ) l o g ( ρ 1 ( x ) ) − ρ 1 ( x ) l o g ( ρ 2 ( x ) ) d x d_{KL}(\rho_1||\rho_2)=\int_{-\infty}^{+\infty}\rho_1(x)log(\rho_1(x))-\rho_1(x)log(\rho_2(x))dx dKL(ρ1∣∣ρ2)=∫−∞+∞ρ1(x)log(ρ1(x))−ρ1(x)log(ρ2(x))dx And then put ρ 1 ( x ) l o g ( ρ 1 ( x ) ) \rho_1(x)log(\rho_1(x)) ρ1(x)log(ρ1(x)) and ρ 1 ( x ) l o g ( ρ 2 ( x ) ) \rho_1(x)log(\rho_2(x)) ρ1(x)log(ρ2(x)) Consider two new distributions , Then it is also transformed into the area relationship , Because the former one is the same , Therefore, it mainly depends on the difference of the latter item . For better understanding , Author use python Write a simple demo: Be careful , Here I found a big problem !!!
import math
import numpy as np
import matplotlib.pyplot as plt
def guassian(u, sig):
# mean value μ, Standard deviation δ
x = np.linspace(-30, 30, 1000000) # Domain of definition
y = np.exp(-(x - u) ** 2 / (2 * sig ** 2)) / (math.sqrt(2*math.pi)*sig) # Define curve function
return x, y
x1, y1 = guassian(u=0, sig=math.sqrt(1))
x2, y2 = guassian(u=1, sig=math.sqrt(1))
x3, y3 = guassian(u=2, sig=math.sqrt(1))
plt.plot(x1,y1)
plt.plot(x2,y2)
plt.plot(x3,y3)
z1 = y1 * np.log2(y1) - y1 * np.log2(y2)
z2 = y1 * np.log2(y1) - y1 * np.log2(y3)
# plt.plot(x1, z1)
# plt.plot(x1, z2)
print(np.sum(z1))
print(np.sum(z2))
It is found through experiments that , The results of the experiment were quite different from what was expected , I use KL divergence Calculation ρ 1 \rho_1 ρ1 To ρ 2 \rho_2 ρ2 The distance and ρ 1 \rho_1 ρ1 To ρ 3 \rho_3 ρ3 The distance is not the same , Difference is very big !!!
I visualized the distribution curve of the latter term , Found it completely different , And the value after integration is completely different .( Here I take into account the effect that continuity becomes discreteness , So it was adjusted x Coordinate range and number of points drawn , Do not have what difference ) Later, it was found that , It seems that the curve given in the article is not a Gaussian curve , I feel a little misled , So I changed an example to calculate , Although it is a discrete distribution , But it can still directly reflect the problem (PS, In fact, I'm curious about why Gaussian distribution can't be used , Does anyone know why ):
It can be found that for the above distribution ,L1 Distance and KL divergence All failed .
Optimal Transport
Last section L1 Distance and KL divergence The problem is , This section describes in detail Optimal Transport, First define the problem .Optimal Transport The core of is how to find an optimal transformation to transform one distribution into another , And to minimize the conversion loss .( Be careful : The distribution here can be continuous or discrete ) Here we use the ρ 1 \rho_1 ρ1 and ρ 2 \rho_2 ρ2 give an example :
For ease of understanding , We can ρ 1 \rho_1 ρ1 and ρ 2 \rho_2 ρ2 Imagine a pile of sand , that Optimal Transport The question becomes how to make ρ 1 \rho_1 ρ1 The shape of the sand piled up ρ 2 \rho_2 ρ2 The appearance of , And do the least work . π ( x , y ) \pi(x,y) π(x,y) For from x How much quality of sand is moved to the location y Location , So clearly ρ 1 ( x ) \rho_1(x) ρ1(x) Express x How much quality of sand is there in the location ( In plain English x Height of the position curve ). So the problem can be expressed by the following formula :
The objective function is to minimize the work done by handling , Because the mass of sand transported should be greater than or equal to 0, So the first constraint is easy to understand , As for articles 2 and 3 , Is to satisfy its initial and final distribution .This amount of work is known as the 1-Wasserstein distance in optimal transport. Next, we generalize it to p-Wasserstein distance:
It is also easy to understand , Only the distance of transportation has changed . After the problem is defined , So how can we get this π ( x , y ) \pi(x,y) π(x,y) Well ?
Discrete Problems in One Dimension

For one-dimensional problems, there are the above two cases :D2D,D2C. about D2D The data of , We first relax the original distribution function ρ ( x ) \rho(x) ρ(x) To μ 0 , μ 1 ∈ P r o b ( R ) \mu_0,\mu_1 \in Prob(\mathbb{R}) μ0,μ1∈Prob(R),Define the Dirac δ \delta δ-measure centered at x ∈ R x \in \mathbb{R} x∈R via (:= In mathematics it is defined as )
δ x ( S ) : = { 1 , i f x ∈ S 0 , O t h e r w i s e . \delta_x(S):= \begin{cases} 1, &if\ x\ \in\ S\\ 0, &Otherwise. \end{cases} δx(S):={ 1,0,if x ∈ SOtherwise. here μ 0 , μ 1 \mu_0,\mu_1 μ0,μ1 Respectively :
among ∑ i a 0 i = ∑ i a 1 i = 1 \sum_ia_{0i}=\sum_ia_{1i}=1 ∑ia0i=∑ia1i=1, a 0 i , a 1 i ≥ 0 a_{0i},a_{1i} \ge 0 a0i,a1i≥0. I think it's S S S yes [ x 01 , x 02 , ⋯ , x 0 k 0 ] [x_{01}, x_{02}, \cdots, x_{0k_0}] [x01,x02,⋯,x0k0] Subset . At this time, the calculation of the problem is :
The same way as the above explanation , T i j T_{ij} Tij From x 0 i x_{0i} x0i Transport to x 1 j x_{1j} x1j The quality of the . ∣ x 0 i − x 1 j ∣ p |x_{0i}-x_{1j}|^p ∣x0i−x1j∣p It's from x 0 i x_{0i} x0i Transport to x 1 j x_{1j} x1j Distance of . T i j T_{ij} Tij The transportation mass shall be greater than or equal to 0, x 0 i x_{0i} x0i and x 1 j x_{1j} x1j The mass at the point must satisfy the conservation condition . requirement T i j T_{ij} Tij, This is a finite element linear programming , Many classical algorithms can be used to solve , for example simplex or interior point methods. And for D2C The situation of , Each discrete μ 0 \mu_0 μ0 Are mapped to a continuous interval μ 1 \mu_1 μ1, As shown in the figure below :
Conclusion
This section focuses on Optimal Transport Two forms of (D2C,D2D), The next section will cover more general In the form of .
References
[1] Optimal Transport on Discrete Domains
[2] Kullback-Leibler Divergence
[3] Optimal Transport and Wasserstein Distance
边栏推荐
猜你喜欢

Calculate special bonus

What is PMP?

Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and

(translation) reasons why real-time inline verification is easier for users to make mistakes

CentOS installs multiple versions of PHP and switches

(总结一)Halcon基础之寻找目标特征+转正

小程序自定义顶部导航栏,uni-app微信小程序自定义顶部导航栏

AI 边缘计算平台 - BeagleBone AI 64 简介

我的PMP学习考试心得

Youmeng (a good helper for real-time monitoring of software exceptions: crash) access tutorial (the easiest tutorial for Xiaobai with some foundation)
随机推荐
鼠标悬停效果十
鼠标悬停效果一
SWT/ANR问题--Dump时间过长导致的SWT
@The difference between configurationproperties and @value
Dell服务器重启iDRAC方法
Alphabet-Rearrange-Inator 3000(字典树自定义排序)
Restcloud ETL实践之数据行列转换
SWT/ANR问题--ANR/JE引发SWT
股票开账户如何优惠开户?还有,在线开户安全么?
JS anti shake and throttling
查看 jvm 参数
SWT/ANR问题--AMS/WMS
Pulsar的Proxy支持和SNI路由
UE4渲染管线学习笔记
手机上怎么开户?还有,在线开户安全么?
go: finding module for package
[JS] [Nuggets] get people who are not followers
522. Longest special sequence II
Comment réaliser la liaison entre la serrure intelligente et la lampe, la scène du moteur de rideau intelligent dans le timing intelligent?
How do I open an account on my mobile phone? Also, is it safe to open an account online?