当前位置:网站首页>A priori, a posteriori, likelihood
A priori, a posteriori, likelihood
2022-07-28 18:51:00 【Adenialzz】
transcendental 、 Posttest 、 likelihood
Prior distribution 、 Posterior distribution and likelihood function
This section is transferred from : Prior distribution 、 Posterior distribution 、 What do the concepts of likelihood estimation mean , What is the relationship between them ?
Popular explanation
Prior distribution : According to general experience, it is considered that random variables should satisfy the distribution . A priori distribution is that you guess blindly what distribution the parameter obeys .
Posterior distribution : The random distribution of the current training variable , More consistent with current data than a priori distribution . A posteriori distribution is based on guessing the distribution of parameters after learning experience .
Likelihood estimate : Known training data , Given the model , A method of estimating model parameters by maximizing likelihood . Likelihood estimation is to guess the parameter , Can best explain some experimental results .
give an example
These concepts can be used “ The possibility of the cause ” and “ The possibility of the outcome ” Of “ Order ” And “ conditional relation ” To understand the .
The following example : Lao Wang next door is going to 10 A place a kilometer away , He can choose to walk , Cycling or driving , And it took some time to get to the destination . In this case , You can take the mode of transportation ( Walk 、 Cycling or driving ) I think it's the reason , The time spent is the result .
Posterior probability
If Lao Wang takes an hour to finish 10 Kilometer distance , Well, it's very likely that I rode by , Of course, it's also possible that Lao Wang was a fitness expert who ran in the past , Or drive by, but there's a lot of traffic . If it took Lao Wang two hours to finish 10 Kilometer distance , Well, it's very likely that he walked by . If Lao Wang only took 20 minutes , Well, it's probably driving . This kind of first know the result , Then estimate the probability distribution of the cause from the result ,p( mode of transportation | Time ), Namely Posterior probability .
Prior probability
Lao Wang is in good spirits when he gets up in the morning , I want to exercise , Decided to run through ; Maybe Lao Wang wants to be a literary youth and try the recently popular bike sharing , Decided to ride over ; Maybe Lao Wang wants to show off his wealth , Decided to drive over . Lao Wang's choice has nothing to do with the time of arrival . Prior to results , Determine the probability distribution of the cause ,p( mode of transportation ), Namely Prior probability .
Likelihood function
Lao Wang decided to walk there , That's very likely 10 It takes about two hours for a kilometer ; It may be that Lao Wang keeps on exercising at ordinary times , It took an hour to run ; Maybe Lao Wang is a fierce man ,40 It's minutes . Lao Wang decided to go by bike , Probably in an hour ; Maybe Lao Wang was in good spirits that day, and the traffic was very smooth ,40 It's minutes ; Another possibility is that Lao Wang is very unlucky , A couple of bike sharing broke down , It took an hour and a half to get there . Lao Wang decided to drive over , Most likely 20 It's minutes , Maybe the traffic jam was very serious that day , It took haw an hour to get there . This first determines the reason , Estimate the probability distribution of the result according to the cause ,p( Time | mode of transportation ), Namely Likelihood estimate .
evidence
Lao Wang went to that place several times , No matter what the mode of transportation is , We get a set of probability distributions about time . This does not consider the reason , Just look at the probability distribution of the results ,p( Time ), There is also a noun :evidence( I don't know what the proper Chinese name is ).
Last , Throw out the famous Bayes formula :
p ( θ ∣ x ) = p ( x ∣ θ ) p ( θ ) p ( x ) p(\theta|x)=\frac{p(x|\theta)p(\theta)}{p(x)} p(θ∣x)=p(x)p(x∣θ)p(θ)
x x x: Observed data ( result )
θ \theta θ: The parameters that determine the distribution of data ( reason )
p ( θ ∣ x ) p(\theta|x) p(θ∣x):posterior
p ( θ ) p(\theta) p(θ):prior
p ( x ∣ θ ) p(x|\theta) p(x∣θ):likelihood
p ( x ) p(x) p(x):evidence
Maximum likelihood estimation MLE And maximum a posteriori estimation MAP
This section is transferred from :https://zhuanlan.zhihu.com/p/32480810
- Frequency school - Frequentist - Maximum Likelihood Estimation (MLE, Maximum likelihood estimation )
- Bayesian school - Bayesian - Maximum A Posteriori (MAP, Maximum posterior estimate )
summary
Sometimes chat with others , The other party will say that he has a lot of machine learning experience , Talk in depth and find , The other side was right MLE and MAP Have a superficial knowledge of , At least in my opinion , This student's machine learning foundation is not solid . In this age of deep learning , Many students only pay attention to adjusting parameters ?
The ultimate problem of modern machine learning will be transformed into the optimization of the objective function ,MLE and MAP It's a very basic idea to generate this function , Therefore, our understanding of the two is very important . This time I'll have a serious chat with you MLE and MAP These two kinds of estimator.
The debate between the two universities
To be more abstract , Frequency school and Bayesian school have different cognition of the world : The frequency school holds that the world is certain , There is an ontology , The truth value of this ontology is invariable , Our goal is to find the truth value or the range of the truth value ; The Bayesian school believes that the world is uncertain , People have a prediction about the world first , And then adjust this prediction through the observation data , Our goal is to find the best probability distribution to describe the world .
When modeling things , use θ Represent the parameters of the model ,** Please note that , The essence of solving problems is to seek θ \theta θ **. that :
(1) Frequency school : There is only one true value θ \theta θ. Take a simple and intuitive example – Flip a coin , We use it P ( h e a d ) P(head) P(head) To represent coins bias. Flip a coin 100 Time , Yes 20 Sub head up , Estimate the coin tossed face up bias P ( h e a d ) = θ P(head)=\theta P(head)=θ . From the perspective of frequency School , θ = 20 / 100 = 0.2 \theta = 20 / 100 = 0.2 θ=20/100=0.2, intuitive . When the amount of data tends to infinity , This method can give accurate estimates ; However, the lack of data may lead to serious deviation . for example , For a uniform coin , namely θ = 0.5 \theta = 0.5 θ=0.5, Throw 5 Time , appear 5 Suboptimal ( The probability of this happening is 1/2^5=3.125%), The frequency school will directly estimate this coin θ = 1 \theta = 1 θ=1, A serious error occurred .
(2) Bayesian school : θ \theta θ It's a random variable , According to a certain probability distribution . In the Bayesian school, there are two inputs and one output , Input is a priori (prior) And likelihood (likelihood), The output is a posteriori (posterior). transcendental , namely P ( θ ) P(\theta) P(θ) , It's when you don't see any data θ \theta θ The prejudgment of , For example, give me a coin , A feasible a priori is that this coin has a high probability of being uniform , There is a small probability that it is uneven ; likelihood , namely P ( X ∣ θ ) P(X|\theta) P(X∣θ) , It's a hypothesis θ \theta θ What kind of data should we observe when we know it ; Posttest , namely P ( θ ∣ X ) P(\theta|X) P(θ∣X) , Is the final parameter distribution . Bayesian estimation is based on Bayesian formula , as follows :
P ( θ ∣ X ) = P ( X ∣ θ ) P ( θ ) P ( X ) P(\theta|X)=\frac{P(X|\theta)P(\theta)}{P(X)} P(θ∣X)=P(X)P(X∣θ)P(θ)
It is also an example of coin tossing , Toss an even coin 5 Time to get 5 Suboptimal , If a priori thinks that the coin is uniform under the probability ( For example, the maximum value is taken as 0.5 Situated Beta Distribution ), that P ( h e a d ) P(head) P(head) , namely P ( θ ∣ X ) P(\theta|X) P(θ∣X) , It's a distribution, The maximum value will be between 0.5~1 Between , Not arbitrary θ = 1 \theta = 1 θ=1.
Here are two points worth noting :
- As the amount of data increases , The parameter distribution will be closer and closer to the data , The influence of a priori will be smaller and smaller
- If a priori is uniform distribution, Then Bayesian method is equivalent to frequency method . Because intuitively , A priori is uniform distribution In essence, it means that there is no anticipation of things
MLE - Maximum likelihood estimation
Maximum Likelihood Estimation, MLE It is a common estimation method of frequency school !
Hypothetical data x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x1,x2,...,xn yes i.i.d. A set of samples of , X = ( x 1 , x 2 , . . . , x n ) X=(x_1,x_2,...,x_n) X=(x1,x2,...,xn) . among i.i.d. Express Independent and identical distribution, Independent homologous distribution . that MLE Yes θ \theta θ The estimation method can be derived as follows :
θ ^ M L E = a r g m a x P ( X ; θ ) = a r g m a x P ( x 1 ; θ ) P ( x 2 ; θ ) … P ( x n ; θ ) = a r g m a x log ∏ i = 1 n P ( x i ; θ ) = a r g m a x ∑ i = 1 n log P ( x i ; θ ) = a r g m i n − ∑ i = 1 n log P ( x i ; θ ) \begin{align} \hat{\theta}_{MLE}&=argmax\ P(X;\theta)\\ &=argmax\ P(x_1;\theta)P(x_2;\theta)\dots P(x_n;\theta)\\ &=argmax\ \log\prod_{i=1}^nP(x_i;\theta)\\ &=argmax\ \sum_{i=1}^n\log P(x_i;\theta)\\ &=argmin\ -\sum_{i=1}^n\log P(x_i;\theta) \end{align} θ^MLE=argmax P(X;θ)=argmax P(x1;θ)P(x2;θ)…P(xn;θ)=argmax logi=1∏nP(xi;θ)=argmax i=1∑nlogP(xi;θ)=argmin −i=1∑nlogP(xi;θ)
The function optimized in the last line is called Negative Log Likelihood (NLL), This concept and the above derivation are very important !
We often inadvertently use MLE, for example
- The above example about the frequency school to calculate the probability of coins , The essence of the method is optimization NLL obtain . The specific reasons are given in the appendix at the end of this article
- Given some data , When finding the corresponding Gaussian distribution , We often calculate the mean and variance of these data points and then bring them into the formula of Gaussian distribution , Its theoretical basis is optimization NLL
- Deep learning is used when doing classification tasks cross entropy loss, Its essence is also MLE
MAP - Maximum posterior estimate
Maximum A Posteriori, MAP It is a commonly used estimation method of Bayesian school !
alike , Hypothetical data x 1 , x 2 , … , x n x_1,x_2,\dots,x_n x1,x2,…,xn yes i.i.d. A set of samples of , X = ( x 1 , x 2 , … , x n ) X=(x_1,x_2,\dots,x_n) X=(x1,x2,…,xn) . that MAP Yes θ \theta θ The estimation method can be derived as follows :
θ ^ M A P = a r g m a x P ( θ ∣ X ) = a r g m i n − log P ( θ ∣ X ) = a r g m i n − log P ( X ∣ θ ) P ( θ ) P ( X ) = a r g m i n − log P ( X ∣ θ ) − log P ( θ ) + log P ( X ) = a r g m i n − log P ( X ∣ θ ) − log P ( θ ) \begin{align} \hat{\theta}_{MAP}&=argmax\ P(\theta|X)\\ &=argmin\ -\log P(\theta|X)\\ &=argmin\ -\log\frac{P(X|\theta)P(\theta)}{P(X)}\\ &=argmin\ -\log P(X|\theta)-\log P(\theta)+\log P(X)\\ &=argmin\ -\log P(X|\theta)-\log P(\theta) \end{align} θ^MAP=argmax P(θ∣X)=argmin −logP(θ∣X)=argmin −logP(X)P(X∣θ)P(θ)=argmin −logP(X∣θ)−logP(θ)+logP(X)=argmin −logP(X∣θ)−logP(θ)
among , The second to third lines use the Bayes theorem , Lines 4 to 5 P ( X ) P(X) P(X) Can throw away because with θ \theta θ irrelevant . Be careful − l o g P ( X ∣ θ ) −logP(X|\theta) −logP(X∣θ) In fact, that is NLL, therefore MLE and MAP The difference in optimization is the a priori term − l o g P ( θ ) −logP(\theta) −logP(θ) . well , Now let's study this a priori term , Suppose a priori is a Gaussian distribution , namely
P ( θ ) = c o n s t a n t × e − θ 2 2 σ 2 P(\theta)=constant\times e^{-\frac{\theta^2}{2\sigma^2}} P(θ)=constant×e−2σ2θ2
that , − log P ( θ ) = c o n s t a n t + θ 2 2 σ 2 -\log P(\theta)=constant+\frac{\theta^2}{2\sigma^2} −logP(θ)=constant+2σ2θ2 . thus , A magical thing happened – stay MAP Using a priori Gaussian distribution in is equivalent to MLE Used in L2 Of regularizaton!
Add a few more points :
- Many of our classmates studied probability theory in College , The most important thing is the thought of frequency School , In fact, Bayesian school is also very popular , And it is very practical
- CMU Many teachers like to use Bayesian thinking to solve problems ;THU Teacher Zhu Jun is also doing Bayesian deep learning The job of , If you are interested, you can pay attention to .
Postscript
Some students said :“ It's no use knowing these , Now everyone doesn't need it .” That's not true , Because this is the knowledge we use all year round , It is the core of deriving the optimization function , And the optimization function is machine learning ( Including deep learning ) One of the core . This classmate has such a view , It shows that there is not enough understanding of the nature of machine learning , What surprised me was , Unexpectedly, many other students praised this view . I feel a little sad inside , It also triggered my motivation to write this article , I hope I can help some friends
Ref
- Prior distribution 、 Posterior distribution 、 What do the concepts of likelihood estimation mean , What is the relationship between them ?
- Agenter Answer
- The mathematics of machine learning —— Lei Ming
- Let's talk about machine learning MLE and MAP: Maximum likelihood estimation and maximum a posteriori estimation
- Bayesian Method Lecture, UT Dallas.
- MLE, MAP, Bayes classification Lecture, CMU.
边栏推荐
- When golang encounters high concurrency seckill
- leetcode 二叉树类
- 面试官:ThreadLocal使用场景有哪些?内存泄露问题如何避免?
- LeetCode_ 343_ integer partition
- [GXYCTF2019]StrongestMind
- What is the employment prospect of software testing?
- Why app uses JSON protocol to interact with server: serialization related knowledge
- 4 年后,Debian 终夺回“debian.community”域名!
- Unity 之 切换语言导致报错:System.FormatException:String was not recognized as a valid DateTime.
- MySQL date function
猜你喜欢

LeetCode_63_不同路径Ⅱ

LeetCode_96_不同的二叉搜索树

MySQL index usage and optimization

MYSQL入门与进阶(二)

Docker搭建Mysql主从复制

Gaode map realizes customized small blue dots, customized point markers, drawing polygon / circular areas, and displaying or hiding customized point markers according to the movement of the map

Cause analysis and solution of video jam after easycvr is connected to the device

1.3 linked list

Introduction and advanced level of MySQL (8)

One Hot编码是什么?为什么要用它,什么时候用它?
随机推荐
Leetcode binary tree class
[GXYCTF2019]StrongestMind
使用自开发的代理服务器解决 SAP UI5 FileUploader 上传文件时遇到的跨域访问错误试读版
Meta Q2财报:营收首次下滑,Metaverse将与苹果竞争
Interviewer: what are the usage scenarios of ThreadLocal? How to avoid memory leakage?
Getting started with gateway
JVM tuning
2022.7.26 构造函数,面试:new的作用、深拷贝和浅拷贝
Introduction and advanced level of MySQL (6)
Apple develops a complete creation process of Apple certificate and description file
真正的 HTAP 对用户和开发者意味着什么?
Pyqt5 rapid development and practice 5.3 multithreading
The switching language of unity causes an error: system FormatException:String was not recognized as a valid DateTime.
Use the self-developed proxy server to solve the cross domain access errors encountered when uploading files by SAP ui5 fileuploader trial version
Kotlin:Sealed class密封类详解
MYSQL入门与进阶(二)
APP为什么用JSON协议与服务端交互:序列化相关知识
LeetCode_ 1137_ Nth teponacci number
MYSQL入门与进阶(十)
EasyCVR接入设备后播放视频出现卡顿现象的原因分析及解决