当前位置：网站首页>Language model

Language model

2022-07-05 12:39:00 【NLP journey】

Language model

As long as the word model is mentioned , I will feel very abstract , But if it is understood as a series of functions or mappings , There will be a more intuitive understanding . For language models , Input is a sentence , The output is the probability that this sentence exists . This function from output to output , It can be regarded as a model . This is a personal loose understanding .
The content of this blog is as follows （ Content comes from collins stay coursera Handouts for online courses ）：

brief introduction
Markov model
Ternary language model （Trigram Language Model)
Smoothing
Other themes

brief introduction

First, a corpus containing several sentences is given , Define a vocabulary V, It contains all the words that appear in the corpus . for example ,V It might look like this ：
V={the,dog, laughs, saw, barks, cat,…}
actually V It could be very big , But we think it is a finite set . Such a sentence can be expressed as
x1x2x3…xn
among xn For a stop stop,x1…xn-1 Belong to V. for example
the dog barks stop
the cat saw the dog stop
the stop
cat cat cat stop
stop
…
Use $V^+$ Indicates all uses V Words generated sentences in . Because sentences can be of any length , therefore $V^+$ Is an infinite set .
To define a （ Language model ）： A language model contains a finite set V, And a mapping p(x1,x2,…,xn) Satisfy :
1. To any $arbitrarily <x1,x2,...,xn>\in V^+,p(x1,x2,...,xn) \gt0$
2.

\sum < x 1... x n > \in V + p (x 1, x 2, . . . x n) = 1

$\sum_{<x1...xn>\in V^+} p(x1,x2,...xn)=1$

therefore p(x1,x2,…xn) It's a $V^+$ A probability distribution on .

So how to find the probability value of each sentence of this probability distribution ？ Definition c(x1,x2…xn) For sentences x1x2..xn The number of occurrences in the corpus ,N Is the number of all sentences in the corpus . Obviously we can use p(x1,x2…xn)=c(x1…xn)/N Ask for a sentence x1,x2…xn Probability . But this simple method is not suitable for sentences in the corpus , Its probability will be equal to 0. Although all the words in the sentence belong to V Of .

Markov model

Markov model of fixed length sequence

Markov model is actually a conditional independence assumption . In seeking p(x1,x2…xn) when ,
p（x1,x2…xn)=p(x1)p(x2|x1) *p(x3|x1x2) *p(x4|x1x2x3) … p(xn|x1x2…xn-1)

The second-order Markov assumption is that the conditional probability of each term is equal to the conditional probability of the first two terms .
namely
p(x1,x2…xn)=p(x1)*p(x2|x1) *p(x3|x1x2) *p(x4|x2x3)….p(xn|xn-2,xn-1)
And so on

Markov model of variable length Markov sequence

Variable length Markov model is a model of sequence generation ：
1 initialization i=1, meanwhile $x_0$ = $x_{-1}$ =*（ $x_0$ and x $_{-1}$ It can be regarded as two virtual nodes ）
2 According to the probability p(xi|x $_{i-2}$ ,x $_{i-1}$ ) Generate x $_i$
3 If x $_i$ =STOP, Just return the sequence x1…xi. otherwise , Set up i=i+1 And repeat the steps 2.

Trigram Language Models（ Ternary language model ）

Basic concepts

p(x1,x2,x3….xn)= $\prod_{i=1}^n$ p( $x_i|x_{i-2},x_{i-1})$
among $x_0$ = $x_{-1}$ =*. $x_n$ =STOP.

Definition （ The ternary model ）： A three-dimensional model contains a finite set V, And parameters ：p(w|u,v). For each u,v,w If w $\in$ v $\cup {STOP}$ meanwhile u,v $\in v\cup *$ , Under the ternary model , The probability of a sentence is p(x1,x2…xn)= $\prod_{i=1}^n p(x_i|x_{i-2},x_{i-1})$ among $x_0$ And $x_{-1}$ Are pseudo nodes *.

There is this model , The problem now is how to estimate the parameters of the model according to the corpus p(w|u,v).

Maximum likelihood estimation

For parameter estimation , Of course, the simplest is to use maximum likelihood estimation . Specifically for p(w|u,v)
In statistical corpus uvw And uv The number of times , Then divide and you get p(w|u,v).
Such as p(barks|the,dog)=c(the,dog,barks)/c(the,dog).
But this simple parameter estimation method has the following two problems ：

If the molecular term is zero , Then probability is defined as 0. That is, there is no corresponding phrase in the corpus . It may be that the corpus is not large enough and the data is sparse .
If the molecule equals zero , The probability value cannot be calculated .

Evaluation of the model

For a good model , How to evaluate its quality
The method is to give a test data set , There are m A sentence s1,s2,…,sm. For each sentence si, Calculate the probability that it will be generated under the current model . Quadrature the probability of all sentences . namely ：
$\prod _{i=1}^mp(xi)$
The bigger this is , The better the model .

Smoothing of parameter estimation

Previously, we mentioned the problem of parameter estimation caused by sparse data . Two methods are discussed here , The first is linear interpolation, The second is discounting methods.

linear interpolation

Definition trigram,bigram, as well as unigram The maximum likelihood estimate is ：
p(w|u,v)=c(u,v,w)/c(u,v)
p(w|v)=c(v,w)/c(v)
p(w)=c(w)/c()

linear interpolation Is to use three estimates , By defining
p(w|u,v)=c1*p(w|u,v)+ c2 * p(w|v) +c3*p(w)
among c1>=0;c2>=0;c3>=0 And
c1+c2+c3=1;

For how to choose the right c1,c2,c3. One way is to use a validation set of data , Selection of the c1,c2,c3 Make the probability of verification set maximum . Another method is another c1=c(u,v)/(c(u,v)+t),
c2=(1-c1)* c(v)/(c(v)+t)),c3=1-c1-c2. among t Is a pending parameter .t The value of can still be maximized validation set .

discounting Methods

Consider a binary model . That is, find the parameters p(w|v)
So let's define a discounted counts. For any c(v,w) as long as c(v,w)>0, Just define
c $^*(v,w)$ =c(v,w)-r;
among r It's a 0 To 1 Number between .
Then for any p(w|v)=c $^*(v,w)$ /c(v); This is equal to not equal to 0 Of c, All extracted from it r. Then you can put r Assigned to those c be equal to 0 's phrases , Prevent zero .