当前位置:网站首页>[Video] Markov chain Monte Carlo method MCMC principle and R language implementation | data sharing
[Video] Markov chain Monte Carlo method MCMC principle and R language implementation | data sharing
2022-07-02 01:42:00 【Extension Research Office】
Link to the original text :http://tecdat.cn/?p=2687
The source of the original text is : The official account of the tribal public
In Bayesian methods , Markov chain Monte Carlo method is particularly mysterious . They must be a heavy mathematical and computational process , But the basic reasoning behind them , Like many other things in Data Science , Can become intuitive . That's my goal .
Related video
Markov chain Monte Carlo method MCMC Principle and R Language implementation
, Duration 08:47
that , What is Markov chain Monte Carlo (MCMC) Method ? The short answer is :
MCMC The method is used to approximate the posterior distribution of the parameter of interest by random sampling in the probability space .
In this article , I will explain this short answer .
First , Some terms . The parameters of interest are just some numbers summarizing the phenomena of interest . Usually we use statistics to estimate parameters . for example , If we want to know the height of adults , The parameter we are interested in may be the average height . The distribution is a mathematical representation of every possible value of our parameter , And the possibility that we observe each value . The most famous example is the bell curve :
In Bayesian statistical methods , Distribution has additional explanations . Bayes not only represents the value of parameters and the possibility of each parameter becoming a real value , Instead, we think of distribution as describing our beliefs about parameters . therefore , The bell curve above shows that we are very sure that the value of the parameter is very close to zero , However, we believe that the probability of the real value being higher or lower than this value is the same , Until some point .
It happened that , Human height does follow a normal curve , So suppose we believe that the true value of the average human height follows the following bell curve :
obviously , The man of faith represented in this picture has been living among giants for many years , Because as far as they know , The most likely average adult height is 1 rice 8( But they are not particularly confident ).
Let's imagine this person collecting some data , They observed 1 rice 6 To 1 rice 8 Between the crowd . We can show this data below , And another normal curve showing that the average human height can best explain the data :
In Bayesian Statistics , The distribution representing our belief in parameters is called a priori distribution , Because it captures our beliefs before seeing any data . The likelihood distribution summarizes what the observed data tells us by representing a series of parameter values and the possibility of each parameter explaining the data we are observing . Estimating the parameter value of the maximum likelihood distribution only answers this question : What parameter values are most likely to observe the data we observe ? Without a priori belief , We may stop here .
However , The key of Bayesian analysis is to determine the posterior distribution by combining the prior distribution and likelihood distribution . This tells us , Considering our previous beliefs , Which parameter values can maximize the opportunity to observe the specific data we do . In our case , The posterior distribution is as follows :
Upper figure , The red line represents the posterior distribution . You can think of it as an average of prior distribution and likelihood distribution . Because the prior distribution is shorter and wider , Therefore, it represents a set of true values of the average human height “ Not sure ” Belief . meanwhile , Likelihood summarizes data in a relatively narrow range , Therefore, it represents “ More sure ” Guess .
When transcendental possibilities are combined , data ( Expressed by possibility ) The weak transcendental beliefs that govern the hypothetical individuals who grew up in giants . Although the man still believes that the average height of human beings is higher than the data tells him , But he basically believes in data .
In the case of two bell curves , It is very easy to solve the posterior distribution . There is a simple formula that can combine the two . however , What if our prior distribution and likelihood distribution are not so good ? Sometimes , It is most accurate to model our data or our prior beliefs using distributions without convenient shapes . If our probability is best represented by a distribution with two peaks , And for some reason we want to explain what to do with some very strange prior distributions ? I visualize the following scene by hand drawing an ugly prior distribution :
Same as before , There are some posterior distributions , It gives the possibility of each parameter value . But it's a little hard to see what it might look like , And it is impossible to solve it through analysis .
MCMC Method
MCMC The method allows us to estimate the shape of the posterior distribution , In case we can't calculate it directly . Think about it ,MCMC Represents Markov chain Monte Carlo method . To understand how they work , I will introduce Monte Carlo simulation .
Monte Carlo simulation is just a method of estimating fixed parameters by repeatedly generating random numbers . By obtaining the generated random numbers and performing some calculations on them , Monte Carlo simulation provides an approximation of the parameters .
Suppose we want to estimate the area of a circle :
Because the length of the circle on the side is 1 In the square of , Therefore, the area can be easily calculated as 0.785 . however , We can place randomly in the square 20 A little bit . Then we calculate the proportion of points falling in the circle , And multiply it by the area of the square . This figure is a good approximation of the circular area .
because 20 A point is 15 One is in the circle , So the circle looks about 0.75 . For only 20 Monte Carlo simulation of random points is not bad .
Monte Carlo simulation is not only used to estimate regions with difficult shapes . By generating a large number of random numbers , They can be used to model very complex processes .
With some knowledge of Monte Carlo simulation and Markov chain , I hope so MCMC No mathematical explanation of how the method works is very intuitive .
Think about it , We are trying to estimate the posterior distribution of the parameters of interest , That is, the average human height :
I'm not a visualization expert , Obviously, I'm not good at keeping my examples within the scope of common sense : My posterior distribution example severely overestimates the average human height .
We know that the posterior distribution is within the range of our prior distribution and likelihood distribution , But for whatever reason , We can't calculate it directly . Use MCMC Method , We will effectively extract samples from the posterior distribution , Then calculate the statistics , For example, the average value of the sample .
First ,MCMC Methods choose a random parameter value to consider . The simulation will continue to generate random values ( This is the Monte Carlo part ), But there are some rules to follow to determine what is a good parameter value . The trick is , For a pair of parameter values , You can calculate which parameter value is better by calculating the possibility of interpreting the data for each value , Given our transcendental beliefs . If the randomly generated parameter value is better than the previous one , Then it is added to the parameter value chain with a certain probability , The probability depends on how good it is ( This is the Markov chain part ).
To explain this intuitively , Let's recall that the distribution height of a value represents the probability of observing that value . therefore , We can think of our parameter value (x Axis ) Shows areas of high and low probability , Displayed in the y On the shaft . For a single parameter ,MCMC Method from along x Axis random sampling starts :
The red dot is a random parameter sample
Because random samples are affected by fixed probability , They tend to converge to the highest probability region of the parameter of interest after a period of time :
The blue dot only represents the random sample after any time point , Convergence is expected at this time . Be careful : Vertical stacking points are purely for illustrative purposes .
After convergence ,MCMC Sampling produces a set of points , These points are samples from the posterior distribution . Draw histograms around these points , And calculate any statistics you like :
stay MCMC Any statistic calculated on the sample set generated by simulation is our best guess of the real posterior distribution statistic .
MCMC The method can also be used to estimate multiple parameters ( For example, people's height and weight ) A posteriori distribution of . about n Parameters , stay n There are high probability regions in dimensional space , Some of these parameter value sets can better explain the observed data . therefore , In my submission MCMC The method is to sample randomly in the probability space to approximate the posterior distribution .
What is? MCMC, When to use it ?
MCMC It's just a sampling algorithm from the distribution .
It's just one of many algorithms . The term stands for “ Markov chain Monte Carlo ”, Because it's a use “ Markov chain ”( We will discuss later ) Of “ Monte Carlo ”( It's random ) Method .MCMC It's just one of the Monte Carlo methods , Although you can think of many other common methods as MCMC A simple special case of .
Why should I sample from the distribution ?
Sampling from the distribution is the easiest way to solve some problems .
Probably MCMC The most common method is to extract samples from the posterior probability distribution of a model in Bayesian reasoning . Through these samples , You can ask some questions :“ What is the average value and reliability of the parameters ?”.
If these samples ( See the end of the article for data acquisition methods ) Is an independent sample from the distribution , be The estimated mean will converge on the real mean .
Let's assume that our goal distribution is one with mean value m And the normal distribution of the standard deviation s.
As an example , Consider using the mean m And standard deviation s To estimate the mean of the normal distribution ( ad locum , I'll use the parameters corresponding to the standard normal distribution ):
We can easily use this rnorm Function samples from this distribution
seasamples<-rn 000,m,s)
The average of the sample is very close to the real average ( zero ):
mean(sa es) ## \[1\] -0. 537
in fact , under these circumstances ,$ n $ The expected variance of the sample estimate is $ 1 / n $, So we expect most of it to be in $ \ pm 2 \,/ \ sqrt {n} = 0.02 .
summary(re 0,mean(rnorm(10000,m,s)))) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.03250 -0.00580 0.00046 0.00042 0.00673 0.03550
This function calculates the sum of the cumulative averages .
cummean<-fun msum(x)/seq_along(x) plot(cummaaSample",ylab="Cumulative mean",panel.aabline(h=0,col="red"),las=1)
take x The axis is converted to logarithmic coordinates and another 30 It's a random method :
Sample quantiles can be extracted from your series of sampling points .
This is the point of analysis and calculation , Its probability density is 2.5% lower than :
p<-0.025a.true<-qnorm(p,m,s)a.true1## \[1\] -1.96
We can estimate this by direct integration in this case
aion(x)dnorm(x,m,s) g<-function(a)integrate(f,-Inf,a)$valuea.int<-uniroot(function(x)g(a10,0))$roota.int1## \[1\] -1.96
And use Monte Carlo Integral estimate point :
a.mc<-unnasamples,p))a.mc## \[1\] -2.023a.true-a.mc## \[1\] 0.06329
however , Within the limit of sample size approaching infinity , This will converge . Besides , It is possible to state the nature of the error ; If we repeat the sampling process 100 Time , Then we get a series of estimates of errors of the same magnitude as the errors near the mean value :
a.mc<-replicate(anorm(10000,m,s),p)) summary(a.true-a.mc) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.05840 -0.01640 -0.00572 -0.00024 0.01400 0.07880
This kind of thing is really common . In most Bayesian reasoning , The posterior distribution is some ( It could be very big ) A function of a parameter vector , You want to reason about a subset of these parameters .
In a hierarchical model , You may have a large number of random effects fitted , But you want to infer a parameter most . stay
In the Bayesian framework , You can calculate the marginal distribution of the parameters you are interested in over all the other parameters ( This is what we're going to do up there ).
Why? “ Traditional statistics ” Monte Carlo method is not used ?
For many problems in traditional teaching statistics , Not sampling from the distribution , You can maximize a function or maximize . So we need some functions to describe the possibility and maximize it ( Maximum likelihood reasoning ), Or functions that compute the sum of squares and minimize it .
However , Monte Carlo method plays the same role in Bayesian statistics as optimization program in frequency statistics , It's just an algorithm for performing reasoning . therefore , Once you basically know MCMC What are you doing , You can treat it like most people treat their optimizer as a black box , Like a black box .
Markov chain Monte Carlo
Suppose we want to extract some target distributions , But we can't take independent samples as we used to . There is one using Markov chain Monte Carlo (MCMC) To do this solution . First , We have to define something , So that the next sentence makes sense : What we're going to do is try to construct a Markov chain , It samples the target distribution as its stationary distribution .
Suppose we have a three state Markov process . Let us P Is the transition probability matrix in the chain :
P<-rbind(a(.2,.1,.7),c(.25,.25,.5))P ## \[,1\] \[,2\] \[,3\]## \[1,\] 0.50 0.25 0.25## \[2,\] 0.20 0.10 0.70## \[3,\] 0.25 0.25 0.50 rowSums(P) ## \[1\] 1 1 1
P[i,j] From the State i Probability to state j.
Please note that , Unlike lines , Columns don't necessarily add up to 1:
colSums(P) ## \[1\] 0.95 0.60 1.45
This function takes a state vector x( among x[i] It's the probability of being in a state i), And iterate it by multiplying it with the transition matrix P, Advance the system to n step .
iterate.P<-function(x,P,n){ res<-matrix(NA,n+1,len a<-xfor(iinseq_len(n)) res\[i+1,\]<-x<-x%*%P res}
From being in a state 1 The system starts (x vector [1,0,0] So it is with , It means in a state 1 The probability of is 100%, Not in any other state )
Again , For the other two possible starting states :
This shows the convergence of the stationary distribution .
We can use R Of eigen Function to extract the main eigenvectors of the system (t() Here we transpose the matrix to get the left eigenvector ).
v<-eigen(t(P) ars\[,1\] v<-v/sum(v)# Normalized eigenvectors
Then add a dot to the previous number , Shows how close we are to convergence :
The above process iterates over the overall probability of different states ; Not through the actual transformation of the system . therefore , Let's iterate over the system , Not the probability vector .
run<-function(i,P,n){ res<-integer(n)for(a(n)) res\[\[t\]\]<-i<-sample(nrow(P),1,pr=P\[i,\]) res}
The chain is running 100 A step :
Plot our time fractions over time in each state , Instead of drawing States :
Run it again (5000 Step )
n<-5000set.seed(1) samples<-run(1,P,n)plot(cummeanasamples==2),col=2)lines(cummean(samples==3),col=3)abline(h=v,lty=2,col=1:3)
So the key here is : Markov chains have some nice properties . Markov chain has a fixed distribution , If we run them long enough , We can see where the chain takes time , And the stationary distribution is estimated reasonably .
Metropolis Algorithm
This is the simplest MCMC Algorithm .
MCMC sampling 1d( Single parameter ) problem
This is the weighted sum of two normal distributions . This distribution is quite simple , It can be downloaded from MCMC Take samples from .
Here are the definitions of some parameters and target density .
Probability density mapping
Let's define a very simple algorithm , The algorithm takes the standard deviation centered on the current point as 4 Sampling in the normal distribution of
And it just needs to run MCMC Several steps of . It will start from point x Return a matrix , Its nsteps The number of rows and columns is related to x Elements have the same number of columns . If you run on scalars , x It will return a vector .
run<-funagth(x))for(iinseq_len(nsteps)) res\[i,\]<-x<-step(x,f,q)drop(res)}
This is the front of the Markov chain 1000 Step , The target density is on the right :
layout(matrix(ca,type="s",xpd=NA,ylab="Parameter",xlab="Sample",las=1) usr<-par("usr") xx<-seq(usr\[a4\],length=301)plot(f(xx),xx,type="l",yaxs="i",axes=FALSE,xlab="")
hist(res,5aALSE,main="",ylim=c(0,.4),las=1,xlab="x",ylab="Probability density") z<-integrate(f,-Inf,Inf)$valuecurve(f(x)/z,add=TRUE,col="red",n=200)
Run longer , The results start to look better :
Now? , Run different scenarios - One standard deviation is very large (33), The other standard deviation is very small (3).
Notice the different ways the three tracks are moving .
contrary , The red trace rejects most of the space .
The blue trail suggests little moves that tend to be accepted , But it walks randomly along most of the tracks . It takes hundreds of iterations to reach most of the probability density .
You can see the effect of different scheme steps in autocorrelation in the subsequent parameters - These graphs show the attenuation of autocorrelation coefficients between different lag steps , The blue line indicates statistical independence .
From this we can calculate the effective number of independent samples :
1coda::effectiveSize(res)1 2## var1 ## 1871coda::effectiveSize(res.fast)1 2## var1 ## 33.191coda::effectiveSize(res.slow)1 2## var1 ## 5.378
This shows more clearly that the chain runs longer :
naun(-10,f,q,n)) xlim<-range(sapply(saa100) hh<-lapply(samples,function(x)hist(x,br,plot=FALSE)) ylim<-c(0,max(f(xx)))
Show 100,1,000,10,000 and 100,000 Step :
MCMC In two dimensions
We give a multivariate normal density , Given a mean vector ( The center of the distribution ) And variance - Covariance matrix .
make.mvn<-function(mean,vcv){ logdet<-as.numeric(detea+logdet vcv.i<-solve(vcv)function(x){ dx<-x-meanexp(-(tmp+rowSums((dx%*%vcv.i)*dx))/2)}}
As mentioned above , Define the target density as two mvns The sum of ( This time it's not weighted ):
Sampling from multivariate normal distribution is also quite simple , But we will use MCMC Take samples from it .
Here are some different strategies - We can propose actions in two dimensions at the same time , Or we can sample along each axis independently . Both strategies work , Although their mixing speed will be different .
Suppose we don't actually know how to get from mvn Medium sampling , Let's propose a proposal distribution that is consistent in two dimensions , The width from each side is “d” The square sampling of the sample .
Compare the sampling distribution with the known distribution :
for example , Parameters 1 What is the marginal distribution of ?
hisales\[,1\],freq=FALSa",xlab="x",ylab="Probability density")
We need to integrate all possible values of the second parameter of the first parameter . that , Because the objective function itself is not standardized , So we have to decompose it into one-dimensional integral values .
m<-function(x1){ g<-Vectorize(function(x2)f(c(x1,ae(g,-Inf,Inf)$value} xx<-seq(mina\]),max(sales\[,1\]),length=201) yy<-s uehist(samples\[,1\],freq=FALSE,ma,0.25))lines(xx,yy/z,col="red")
Data acquisition
Reply to the official account below. “MCMC Count According to the ”, Complete data is available .
Click the end of the article “ Read the original ”
Get the full text and complete information .
This article from the 《R Language to achieve Markov chain Monte Carlo MCMC Model 》.
Click on the title to see the previous issue
R Language implementation MCMC Medium Metropolis–Hastings Algorithm and Gibbs sampling
Python Bayesian inference Metropolis-Hastings(M-H)MCMC Implementation of sampling algorithm
Metropolis Hastings Sampling and Bayesian Poisson regression Poisson Model
R Language Metropolis Hastings Sampling and Bayesian Poisson regression Poisson Model
R In language block Gibbs Gibbs sampling Bayesian multiple linear regression
Python Bayesian regression analysis of housing affordability data set
Python use PyMC3 Implementation of Bayesian linear regression model
R Language Gibbs Bayesian simple linear regression simulation analysis of sampling
R Language based copula Study on diagnostic accuracy of Bayesian hierarchical hybrid model
R Linguistic Bayesian inference and MCMC: Realization Metropolis-Hastings Sampling algorithm example
R Language stan The regression model based on Bayesian inference
R In language RStan Examples of Bayesian hierarchical model analysis
WinBUGS For multivariate stochastic volatility model : Bayesian estimation and model comparison
R Language implementation MCMC Medium Metropolis–Hastings Algorithm and Gibbs sampling
R Linguistic Bayesian inference and MCMC: Realization Metropolis-Hastings Sampling algorithm example
video :R In language Stan Probabilistic programming MCMC Bayesian model of sampling
R Language MCMC:Metropolis-Hastings Bayesian estimation of sampling for regression
- It's already 30. Can you learn programming from scratch?
- Android: the kotlin language uses grendao3, a cross platform app development framework
- 5g/4g pole gateway_ Smart pole gateway
- SAP ui5 beginner tutorial 20 - explanation of expression binding usage of SAP ui5
- 【视频】马尔可夫链蒙特卡罗方法MCMC原理与R语言实现|数据分享
- 浅浅了解Servlet
- Should enterprises choose server free computing?
- Matlab uses audiorecorder and recordblocking to record sound, play to play sound, and audiobook to save sound
- 如何用一款产品推动「品牌的惊险一跃」?
- Have you stepped on the nine common pits in the e-commerce system?
6-3 vulnerability exploitation SSH environment construction
Feature extraction and detection 16 brisk feature detection and matching
matlab 使用 audiorecorder、recordblocking录制声音,play 播放声音,audiowrite 保存声音
Basic concepts of machine learning
Study note 2 -- definition and value of high-precision map
Learning note 3 -- Key Technologies of high-precision map (Part 1)
Penser au jeu 15: penser au service complet et au sous - service
ES6 new method of string
error: . repo/manifests/: contains uncommitted changes
matlab 使用 audioread 、 sound 读取和播放 wav 文件
1218 square or round
Hcip day 14 (MPLS protocol)
It's already 30. Can you learn programming from scratch?
Have you stepped on the nine common pits in the e-commerce system?
Parted command
Penser au jeu 15: penser au service complet et au sous - service
Bat Android Engineer interview process analysis + restore the most authentic and complete first-line company interview questions
No converter found for return value of type: class
Private project practice sharing [Yugong series] February 2022 U3D full stack class 009 unity object creation
No converter found for return value of type: class
Game thinking 15: thinking about the whole region and sub region Services
[IVX junior engineer training course 10 papers to get certificates] 01 learn about IVX and complete the New Year greeting card