当前位置：网站首页>A priori, a posteriori, likelihood

A priori, a posteriori, likelihood

2022-07-28 18:51:00 【Adenialzz】

transcendental 、 Posttest 、 likelihood

Prior distribution 、 Posterior distribution and likelihood function

This section is transferred from ： Prior distribution 、 Posterior distribution 、 What do the concepts of likelihood estimation mean , What is the relationship between them ？

Popular explanation

Prior distribution ： According to general experience, it is considered that random variables should satisfy the distribution . A priori distribution is that you guess blindly what distribution the parameter obeys .
Posterior distribution ： The random distribution of the current training variable , More consistent with current data than a priori distribution . A posteriori distribution is based on guessing the distribution of parameters after learning experience .
Likelihood estimate ： Known training data , Given the model , A method of estimating model parameters by maximizing likelihood . Likelihood estimation is to guess the parameter , Can best explain some experimental results .

give an example

These concepts can be used “ The possibility of the cause ” and “ The possibility of the outcome ” Of “ Order ” And “ conditional relation ” To understand the .

The following example ： Lao Wang next door is going to 10 A place a kilometer away , He can choose to walk , Cycling or driving , And it took some time to get to the destination . In this case , You can take the mode of transportation （ Walk 、 Cycling or driving ） I think it's the reason , The time spent is the result .

Posterior probability

If Lao Wang takes an hour to finish 10 Kilometer distance , Well, it's very likely that I rode by , Of course, it's also possible that Lao Wang was a fitness expert who ran in the past , Or drive by, but there's a lot of traffic . If it took Lao Wang two hours to finish 10 Kilometer distance , Well, it's very likely that he walked by . If Lao Wang only took 20 minutes , Well, it's probably driving . This kind of first know the result , Then estimate the probability distribution of the cause from the result ,p( mode of transportation | Time ), Namely Posterior probability .

Prior probability

Lao Wang is in good spirits when he gets up in the morning , I want to exercise , Decided to run through ; Maybe Lao Wang wants to be a literary youth and try the recently popular bike sharing , Decided to ride over ; Maybe Lao Wang wants to show off his wealth , Decided to drive over . Lao Wang's choice has nothing to do with the time of arrival . Prior to results , Determine the probability distribution of the cause ,p( mode of transportation ), Namely Prior probability .

Likelihood function

Lao Wang decided to walk there , That's very likely 10 It takes about two hours for a kilometer ; It may be that Lao Wang keeps on exercising at ordinary times , It took an hour to run ; Maybe Lao Wang is a fierce man ,40 It's minutes . Lao Wang decided to go by bike , Probably in an hour ; Maybe Lao Wang was in good spirits that day, and the traffic was very smooth ,40 It's minutes ; Another possibility is that Lao Wang is very unlucky , A couple of bike sharing broke down , It took an hour and a half to get there . Lao Wang decided to drive over , Most likely 20 It's minutes , Maybe the traffic jam was very serious that day , It took haw an hour to get there . This first determines the reason , Estimate the probability distribution of the result according to the cause ,p( Time | mode of transportation ), Namely Likelihood estimate .

evidence

Lao Wang went to that place several times , No matter what the mode of transportation is , We get a set of probability distributions about time . This does not consider the reason , Just look at the probability distribution of the results ,p( Time ), There is also a noun ：evidence（ I don't know what the proper Chinese name is ）.

Last , Throw out the famous Bayes formula ：
$p(\theta|x)=\frac{p(x|\theta)p(\theta)}{p(x)}$
$x$ ： Observed data （ result ）

$\theta$ ： The parameters that determine the distribution of data （ reason ）

$p(\theta|x)$ ：posterior

$p(\theta)$ ：prior

$p(x|\theta)$ ：likelihood

$p (x)$ ：evidence

Maximum likelihood estimation MLE And maximum a posteriori estimation MAP

This section is transferred from ：https://zhuanlan.zhihu.com/p/32480810

Frequency school - Frequentist - Maximum Likelihood Estimation (MLE, Maximum likelihood estimation )
Bayesian school - Bayesian - Maximum A Posteriori (MAP, Maximum posterior estimate )

summary

Sometimes chat with others , The other party will say that he has a lot of machine learning experience , Talk in depth and find , The other side was right MLE and MAP Have a superficial knowledge of , At least in my opinion , This student's machine learning foundation is not solid . In this age of deep learning , Many students only pay attention to adjusting parameters ？

The ultimate problem of modern machine learning will be transformed into the optimization of the objective function ,MLE and MAP It's a very basic idea to generate this function , Therefore, our understanding of the two is very important . This time I'll have a serious chat with you MLE and MAP These two kinds of estimator.

The debate between the two universities

To be more abstract , Frequency school and Bayesian school have different cognition of the world ： The frequency school holds that the world is certain , There is an ontology , The truth value of this ontology is invariable , Our goal is to find the truth value or the range of the truth value ; The Bayesian school believes that the world is uncertain , People have a prediction about the world first , And then adjust this prediction through the observation data , Our goal is to find the best probability distribution to describe the world .

When modeling things , use θ Represent the parameters of the model ,** Please note that , The essence of solving problems is to seek $\theta$ **. that ：

(1) Frequency school ： There is only one true value $\theta$ . Take a simple and intuitive example – Flip a coin , We use it $P (h e a d)$ To represent coins bias. Flip a coin 100 Time , Yes 20 Sub head up , Estimate the coin tossed face up bias $P(head)=\theta$ . From the perspective of frequency School , $\theta = 20 / 100 = 0.2$ , intuitive . When the amount of data tends to infinity , This method can give accurate estimates ; However, the lack of data may lead to serious deviation . for example , For a uniform coin , namely $\theta = 0.5$ , Throw 5 Time , appear 5 Suboptimal ( The probability of this happening is 1/2^5=3.125%), The frequency school will directly estimate this coin $\theta = 1$ , A serious error occurred .

(2) Bayesian school ： $\theta$ It's a random variable , According to a certain probability distribution . In the Bayesian school, there are two inputs and one output , Input is a priori (prior) And likelihood (likelihood), The output is a posteriori (posterior). transcendental , namely $P(\theta)$ , It's when you don't see any data $\theta$ The prejudgment of , For example, give me a coin , A feasible a priori is that this coin has a high probability of being uniform , There is a small probability that it is uneven ; likelihood , namely $P(X|\theta)$ , It's a hypothesis $\theta$ What kind of data should we observe when we know it ; Posttest , namely $P(\theta|X)$ , Is the final parameter distribution . Bayesian estimation is based on Bayesian formula , as follows ：
$P(\theta|X)=\frac{P(X|\theta)P(\theta)}{P(X)}$
It is also an example of coin tossing , Toss an even coin 5 Time to get 5 Suboptimal , If a priori thinks that the coin is uniform under the probability ( For example, the maximum value is taken as 0.5 Situated Beta Distribution ), that $P (h e a d)$ , namely $P(\theta|X)$ , It's a distribution, The maximum value will be between 0.5~1 Between , Not arbitrary $\theta = 1$ .

Here are two points worth noting ：

As the amount of data increases , The parameter distribution will be closer and closer to the data , The influence of a priori will be smaller and smaller
If a priori is uniform distribution, Then Bayesian method is equivalent to frequency method . Because intuitively , A priori is uniform distribution In essence, it means that there is no anticipation of things

MLE - Maximum likelihood estimation

Maximum Likelihood Estimation, MLE It is a common estimation method of frequency school ！

Hypothetical data $x_1,x_2,...,x_n$ yes i.i.d. A set of samples of , $X=(x_1,x_2,...,x_n)$ . among i.i.d. Express Independent and identical distribution, Independent homologous distribution . that MLE Yes $\theta$ The estimation method can be derived as follows ：
$\begin{align} \hat{\theta}_{MLE}&=argmax\ P(X;\theta)\\ &=argmax\ P(x_1;\theta)P(x_2;\theta)\dots P(x_n;\theta)\\ &=argmax\ \log\prod_{i=1}^nP(x_i;\theta)\\ &=argmax\ \sum_{i=1}^n\log P(x_i;\theta)\\ &=argmin\ -\sum_{i=1}^n\log P(x_i;\theta) \end{align}$
The function optimized in the last line is called Negative Log Likelihood (NLL), This concept and the above derivation are very important ！

We often inadvertently use MLE, for example

The above example about the frequency school to calculate the probability of coins , The essence of the method is optimization NLL obtain . The specific reasons are given in the appendix at the end of this article
Given some data , When finding the corresponding Gaussian distribution , We often calculate the mean and variance of these data points and then bring them into the formula of Gaussian distribution , Its theoretical basis is optimization NLL
Deep learning is used when doing classification tasks cross entropy loss, Its essence is also MLE

MAP - Maximum posterior estimate

Maximum A Posteriori, MAP It is a commonly used estimation method of Bayesian school ！

alike , Hypothetical data $x_1,x_2,\dots,x_n$ yes i.i.d. A set of samples of , $X=(x_1,x_2,\dots,x_n)$ . that MAP Yes $\theta$ The estimation method can be derived as follows ：
$\begin{align} \hat{\theta}_{MAP}&=argmax\ P(\theta|X)\\ &=argmin\ -\log P(\theta|X)\\ &=argmin\ -\log\frac{P(X|\theta)P(\theta)}{P(X)}\\ &=argmin\ -\log P(X|\theta)-\log P(\theta)+\log P(X)\\ &=argmin\ -\log P(X|\theta)-\log P(\theta) \end{align}$
among , The second to third lines use the Bayes theorem , Lines 4 to 5 $P (X)$ Can throw away because with $\theta$ irrelevant . Be careful $−log⁡P(X|\theta)$ In fact, that is NLL, therefore MLE and MAP The difference in optimization is the a priori term $−log⁡P(\theta)$ . well , Now let's study this a priori term , Suppose a priori is a Gaussian distribution , namely
$P(\theta)=constant\times e^{-\frac{\theta^2}{2\sigma^2}}$
that , $-\log P(\theta)=constant+\frac{\theta^2}{2\sigma^2}$ . thus , A magical thing happened – stay MAP Using a priori Gaussian distribution in is equivalent to MLE Used in L2 Of regularizaton！

Add a few more points ：

Many of our classmates studied probability theory in College , The most important thing is the thought of frequency School , In fact, Bayesian school is also very popular , And it is very practical
CMU Many teachers like to use Bayesian thinking to solve problems ;THU Teacher Zhu Jun is also doing Bayesian deep learning The job of , If you are interested, you can pay attention to .

Postscript

Some students said ：“ It's no use knowing these , Now everyone doesn't need it .” That's not true , Because this is the knowledge we use all year round , It is the core of deriving the optimization function , And the optimization function is machine learning ( Including deep learning ) One of the core . This classmate has such a view , It shows that there is not enough understanding of the nature of machine learning , What surprised me was , Unexpectedly, many other students praised this view . I feel a little sad inside , It also triggered my motivation to write this article , I hope I can help some friends