当前位置:网站首页>Language model
Language model
2022-07-05 12:39:00 【NLP journey】
Language model
As long as the word model is mentioned , I will feel very abstract , But if it is understood as a series of functions or mappings , There will be a more intuitive understanding . For language models , Input is a sentence , The output is the probability that this sentence exists . This function from output to output , It can be regarded as a model . This is a personal loose understanding .
The content of this blog is as follows ( Content comes from collins stay coursera Handouts for online courses ):
- brief introduction
- Markov model
- Ternary language model (Trigram Language Model)
- Smoothing
- Other themes
brief introduction
First, a corpus containing several sentences is given , Define a vocabulary V, It contains all the words that appear in the corpus . for example ,V It might look like this :
V={the,dog, laughs, saw, barks, cat,…}
actually V It could be very big , But we think it is a finite set . Such a sentence can be expressed as
x1x2x3…xn
among xn For a stop stop,x1…xn-1 Belong to V. for example
the dog barks stop
the cat saw the dog stop
the stop
cat cat cat stop
stop
…
Use V+ Indicates all uses V Words generated sentences in . Because sentences can be of any length , therefore V+ Is an infinite set .
To define a ( Language model ): A language model contains a finite set V, And a mapping p(x1,x2,…,xn) Satisfy :
1. To any ren It means <x1,x2,...,xn>∈V+,p(x1,x2,...,xn)>0
2.
therefore p(x1,x2,…xn) It's a V+ A probability distribution on .
So how to find the probability value of each sentence of this probability distribution ? Definition c(x1,x2…xn) For sentences x1x2..xn The number of occurrences in the corpus ,N Is the number of all sentences in the corpus . Obviously we can use p(x1,x2…xn)=c(x1…xn)/N Ask for a sentence x1,x2…xn Probability . But this simple method is not suitable for sentences in the corpus , Its probability will be equal to 0. Although all the words in the sentence belong to V Of .
Markov model
Markov model of fixed length sequence
Markov model is actually a conditional independence assumption . In seeking p(x1,x2…xn) when ,
p(x1,x2…xn)=p(x1)p(x2|x1) *p(x3|x1x2) *p(x4|x1x2x3) … p(xn|x1x2…xn-1)
First order Markov hypothesis p(x3|x1x2)=p(x3|x2)… That is, each conditional probability above is equal to the conditional probability of the previous element . be :
p(x1,x2….xn)=p(x1)*p(x2|x1) *p(x3|x2)…. *p(xn|xn-1)
The second-order Markov assumption is that the conditional probability of each term is equal to the conditional probability of the first two terms .
namely
p(x1,x2…xn)=p(x1)*p(x2|x1) *p(x3|x1x2) *p(x4|x2x3)….p(xn|xn-2,xn-1)
And so on
Markov model of variable length Markov sequence
Variable length Markov model is a model of sequence generation :
1 initialization i=1, meanwhile x0=x−1=*(x0 and x−1 It can be regarded as two virtual nodes )
2 According to the probability p(xi|xi−2,xi−1) Generate xi
3 If xi=STOP, Just return the sequence x1…xi. otherwise , Set up i=i+1 And repeat the steps 2.
Trigram Language Models( Ternary language model )
Basic concepts
p(x1,x2,x3….xn)=∏ni=1p(xi|xi−2,xi−1)
among x0=x−1=*.xn=STOP.
Definition ( The ternary model ): A three-dimensional model contains a finite set V, And parameters :p(w|u,v). For each u,v,w If w∈v∪STOP meanwhile u,v ∈v∪∗, Under the ternary model , The probability of a sentence is p(x1,x2…xn)=∏ni=1p(xi|xi−2,xi−1) among x0 And x−1 Are pseudo nodes *.
There is this model , The problem now is how to estimate the parameters of the model according to the corpus p(w|u,v).
Maximum likelihood estimation
For parameter estimation , Of course, the simplest is to use maximum likelihood estimation . Specifically for p(w|u,v)
In statistical corpus uvw And uv The number of times , Then divide and you get p(w|u,v).
Such as p(barks|the,dog)=c(the,dog,barks)/c(the,dog).
But this simple parameter estimation method has the following two problems :
- If the molecular term is zero , Then probability is defined as 0. That is, there is no corresponding phrase in the corpus . It may be that the corpus is not large enough and the data is sparse .
- If the molecule equals zero , The probability value cannot be calculated .
Evaluation of the model
For a good model , How to evaluate its quality
The method is to give a test data set , There are m A sentence s1,s2,…,sm. For each sentence si, Calculate the probability that it will be generated under the current model . Quadrature the probability of all sentences . namely :
∏mi=1p(xi)
The bigger this is , The better the model .
Smoothing of parameter estimation
Previously, we mentioned the problem of parameter estimation caused by sparse data . Two methods are discussed here , The first is linear interpolation, The second is discounting methods.
linear interpolation
Definition trigram,bigram, as well as unigram The maximum likelihood estimate is :
p(w|u,v)=c(u,v,w)/c(u,v)
p(w|v)=c(v,w)/c(v)
p(w)=c(w)/c()
linear interpolation Is to use three estimates , By defining
p(w|u,v)=c1*p(w|u,v)+ c2 * p(w|v) +c3*p(w)
among c1>=0;c2>=0;c3>=0 And
c1+c2+c3=1;
For how to choose the right c1,c2,c3. One way is to use a validation set of data , Selection of the c1,c2,c3 Make the probability of verification set maximum . Another method is another c1=c(u,v)/(c(u,v)+t),
c2=(1-c1)* c(v)/(c(v)+t)),c3=1-c1-c2. among t Is a pending parameter .t The value of can still be maximized validation set .
discounting Methods
Consider a binary model . That is, find the parameters p(w|v)
So let's define a discounted counts. For any c(v,w) as long as c(v,w)>0, Just define
c∗(v,w)=c(v,w)-r;
among r It's a 0 To 1 Number between .
Then for any p(w|v)=c∗(v,w)/c(v); This is equal to not equal to 0 Of c, All extracted from it r. Then you can put r Assigned to those c be equal to 0 's phrases , Prevent zero .
边栏推荐
- Distributed solution - distributed lock solution - redis based distributed lock implementation
- How to design an interface?
- Distributed cache architecture - cache avalanche & penetration & hit rate
- GPS数据格式转换[通俗易懂]
- Redis's memory elimination mechanism, read this article is enough.
- How does MySQL execute an SQL statement?
- MySQL transaction
- Learning items
- Get data from the database when using JMeter for database assertion
- Using MySQL in docker
猜你喜欢
Making and using the cutting tool of TTF font library
Pytorch two-layer loop to realize the segmentation of large pictures
Distributed cache architecture - cache avalanche & penetration & hit rate
前几年外包干了四年,秋招感觉人生就这样了..
The relationship between the size change of characteristic graph and various parameters before and after DL convolution operation
Resnet18 actual battle Baoke dream spirit
Redis highly available sentinel mechanism
Storage Basics
The evolution of mobile cross platform technology
Automated test lifecycle
随机推荐
Detailed steps for upgrading window mysql5.5 to 5.7.36
The evolution of mobile cross platform technology
Distributed solution - completely solve website cross domain requests
图像超分实验:SRCNN/FSRCNN
Take you two minutes to quickly master the route and navigation of flutter
Take you hand in hand to develop a service monitoring component
Tabbar configuration at the bottom of wechat applet
Iterator details in list... Interview pits
Pytorch two-layer loop to realize the segmentation of large pictures
Redis highly available slice cluster
NPM install reports an error
Volatile instruction rearrangement and why instruction rearrangement is prohibited
GPS数据格式转换[通俗易懂]
Swift - add navigation bar
[HDU 2096] 小明A+B
Learn the memory management of JVM 02 - memory allocation of JVM
A new WiFi option for smart home -- the application of simplewifi in wireless smart home
PIP command reports an error pip is configured with locations that requires tls/ssl problems
PXE startup configuration and principle
Making and using the cutting tool of TTF font library