当前位置：网站首页>Entropy - conditional entropy - joint entropy - mutual information - cross entropy

Entropy - conditional entropy - joint entropy - mutual information - cross entropy

2022-06-30 19:15:00 【Ancient road】

entropy - Conditional entropy - Joint entropy - Mutual information - Cross entropy

0. introduction

It belongs to the basic concept of information theory .

1. Information entropy Entropy (information theory)

wiki
How to understand information entropy , This video is really great ！

The amount of information ： $x = log_2N$ , $N$ Is the number of possible events . for example , The amount of information for 3, Then the number of original and other possible events is $2^3=8$ .

Please add a picture description

Semaphore is a special case of information entropy ： Events are waiting to happen .

Suppose a coin ： The probability of positive appearance is 0.8, The probability of the negative side is 0.2
Turn it into an equi probable event （ $N = 1 / p$ ）：
- positive –> Imagine as $1 / 0.8 = 1.25$ The probability of one occurrence of an equally probable event
- The reverse –> Imagine as $1 / 0.2 = 5$ The probability of one occurrence of an equally probable event
Then the amount of information at this time is ： The intuitive amount of information should be $l o g 1.25 + l o g 5$ , Because the probabilities of these two events are also different , Therefore, the real amount of information at this time is the amount of information after integrating probability ： $0.8*log\frac{1}{0.8} + 0.2*log\frac{1}{0.2}$ .
This leads to the famous information entropy formula ： $\Sigma{p_ilog\frac{1}{p_i}} = - \Sigma p_ilogp_i$
$H(X)=-\sum_{i=1}^{n} p\left(x_{i}\right) \log p\left(x_{i}\right)$

This article The definition of :

Definition ： entropy , Used to measure the uncertainty of information .
explain ： The greater the entropy , The more information . The less uncertainty , The smaller the entropy , such as “ Tomorrow the sun will rise in the East ” The entropy of this sentence is 0, Because this sentence does not carry any information , It describes a certain thing .

The example is also very intuitive ：

Example ： Suppose there are random variables X, It is used to express the weather of tomorrow .X There are three possible states 1) a sunny day 2) rain 3) overcast The probability of occurrence of each state is P(i) = 1/3, So, according to the formula of entropy ： $H(X)=-\sum_{i=1}^{n} p\left(x_{i}\right) \log p\left(x_{i}\right)$

It can be calculated that ：H(X) = - 1/3 * log(1/3) - 1/3 * log(1/3) + 1/3 * log(1/3) = log3 =0.47712

If the probability of these three states is (0.1, 0.1, 0.8)：H(X) = -0.1 * log(0.1) *2 - 0.8 * log(0.8) = 0.277528

The previous distribution can be found X The level of uncertainty is very high （ Entropy is very high ）, Every state is very likely . The latter distribution ,X The degree of uncertainty is low （ Low entropy ）, There is a high probability that the third state will occur .

2. Conditional entropy Conditional entropy

wiki

Definition ： Under one condition , The uncertainty of random variables .

Two random variables X,Y The distribution of , Can form Joint entropy （Joint Entropy）, use H(X, Y) Express . namely ： $H (X, Y) = - Σ p (x, y) l o g (x, y)$
$H (X ∣ Y) = H (X, Y) - H (Y)$ , Express (X, Y) The entropy involved in occurrence , subtract Y The entropy that occurs alone ： stay Y Under the premise of occurrence ,X There's a new entropy .

$\begin{aligned} &H(X|Y) = H(X, Y)-H(X) \\ &=-\sum_{x, y} p(x, y) \log p(x, y)+\sum_{x} p(x) \log p(x) \\ &=-\sum_{x, y} p(x, y) \log p(x, y)+\sum_{x}\left(\sum_{y} p(x, y)\right) \log p(x) \\ &=-\sum_{x, y} p(x, y) \log p(x, y)+\sum_{x, y} p(x, y) \log p(x) \\ &=-\sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)} \\ &=-\sum_{x, y} p(x, y) \log p(y \mid x) \end{aligned}$

3. Joint entropy Joint Entropy

wiki

Two discrete random variables X ,Y The joint entropy of （ In bits ） Defined as :

$\mathrm {H} (X,Y)=-\sum _{x\in {\mathcal {X}}}\sum _{y\in {\mathcal {Y}}}P(x,y )\log _{2}[P(x,y)]$

For more than two random variables $X_{1},...,X_{n}$ Expand :

$\mathrm {H} (X_{1},...,X_{n})=-\sum _{x_{1}\in {\mathcal {X}}_{1}}... \sum _{x_{n}\in {\mathcal {X}}_{n}}P(x_{1},...,x_{n})\log _{2}[P(x_{1 },...,x_{n})]$

4. Mutual information Mutual information

wiki

Definition ： It refers to the degree of correlation between two random variables .

understand ： Determine random variables X After the value of , Another random variable Y The degree to which uncertainty diminishes , Therefore, the minimum value of mutual information is 0, It means that given a random variable has nothing to do with determining another random variable , The maximum value is the entropy of random variable , Means that given a random variable , It can completely eliminate the uncertainty of another random variable . This concept is relative to conditional entropy .

${\begin{aligned}\operatorname {I} (X;Y)&{}\equiv \mathrm {H} (X)-\mathrm {H} (X\mid Y)\\&{}\equiv \mathrm {H} (Y)-\mathrm {H} (Y\mid X)\\&{}\equiv \mathrm {H} (X)+\mathrm {H} (Y)-\mathrm {H} (X,Y)\\&{}\equiv \mathrm {H} (X,Y)-\mathrm {H} (X\mid Y)-\mathrm {H} (Y\mid X)\end{aligned}}$

${\begin{aligned}\operatorname {I} (X;Y)&{}=\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p_{(X,Y)}(x,y)\log {\frac {p_{(X,Y)}(x,y)}{p_{X}(x)p_{Y}(y)}}\\&{}=\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p_{(X,Y)}(x,y)\log {\frac {p_{(X,Y)}(x,y)}{p_{X}(x)}}-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p_{(X,Y)}(x,y)\log p_{Y}(y)\\&{}=\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p_{X}(x)p_{Y\mid X=x}(y)\log p_{Y\mid X=x}(y)-\sum _{x\in {\mathcal {X}},y\in {\mathcal {Y}}}p_{(X,Y)}(x,y)\log p_{Y}(y)\\&{}=\sum _{x\in {\mathcal {X}}}p_{X}(x)\left(\sum _{y\in {\mathcal {Y}}}p_{Y\mid X=x}(y)\log p_{Y\mid X=x}(y)\right)-\sum _{y\in {\mathcal {Y}}}\left(\sum _{x\in {\mathcal {X}}}p_{(X,Y)}(x,y)\right)\log p_{Y}(y)\\&{}=-\sum _{x\in {\mathcal {X}}}p_{X}(x)\mathrm {H} (Y\mid X=x)-\sum _{y\in {\mathcal {Y}}}p_{Y}(y)\log p_{Y}(y)\\&{}=-\mathrm {H} (Y\mid X)+\mathrm {H} (Y)\\&{}=\mathrm {H} (Y)-\mathrm {H} (Y\mid X).\\\end{aligned}}$

Two random variables $X, Y$ Mutual information of , Defined as $X, Y$ The relative entropy of the product of joint distribution and independent distribution of .
$I (X, Y) = D (P (X, Y) ∣ ∣ P (X) P (Y))$ , $Y)=\sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)}$

Mutual information and information gain are actually the same value . Information gain = entropy – Conditional entropy , $g (D, A) = H (D) - H (D ∣ A)$

Please add a picture description

5. Relative entropy

wiki

Relative entropy , It's also called cross entropy , Cross entropy , Identifying information ,Kullback entropy ,Kullback-Leible Divergence, etc .

set up $p (x) 、 q (x)$ yes $X$ Two probability distributions of the values in , be $p$ Yes $q$ The relative entropy of is
$D_{\text{KL}}(P\parallel Q)=\sum _{x\in {\mathcal {X}}}P(x)\log \left({\frac {P(x)}{Q(x)}}\right)=-\sum _{x\in {\mathcal {X}}}P(x)\log \left({\frac {Q(x)}{P(x)}}\right)$