当前位置:网站首页>Language model
Language model
2022-07-05 12:39:00 【NLP journey】
Language model
As long as the word model is mentioned , I will feel very abstract , But if it is understood as a series of functions or mappings , There will be a more intuitive understanding . For language models , Input is a sentence , The output is the probability that this sentence exists . This function from output to output , It can be regarded as a model . This is a personal loose understanding .
The content of this blog is as follows ( Content comes from collins stay coursera Handouts for online courses ):
- brief introduction
- Markov model
- Ternary language model (Trigram Language Model)
- Smoothing
- Other themes
brief introduction
First, a corpus containing several sentences is given , Define a vocabulary V, It contains all the words that appear in the corpus . for example ,V It might look like this :
V={the,dog, laughs, saw, barks, cat,…}
actually V It could be very big , But we think it is a finite set . Such a sentence can be expressed as
x1x2x3…xn
among xn For a stop stop,x1…xn-1 Belong to V. for example
the dog barks stop
the cat saw the dog stop
the stop
cat cat cat stop
stop
…
Use V+ Indicates all uses V Words generated sentences in . Because sentences can be of any length , therefore V+ Is an infinite set .
To define a ( Language model ): A language model contains a finite set V, And a mapping p(x1,x2,…,xn) Satisfy :
1. To any ren It means <x1,x2,...,xn>∈V+,p(x1,x2,...,xn)>0
2.
therefore p(x1,x2,…xn) It's a V+ A probability distribution on .
So how to find the probability value of each sentence of this probability distribution ? Definition c(x1,x2…xn) For sentences x1x2..xn The number of occurrences in the corpus ,N Is the number of all sentences in the corpus . Obviously we can use p(x1,x2…xn)=c(x1…xn)/N Ask for a sentence x1,x2…xn Probability . But this simple method is not suitable for sentences in the corpus , Its probability will be equal to 0. Although all the words in the sentence belong to V Of .
Markov model
Markov model of fixed length sequence
Markov model is actually a conditional independence assumption . In seeking p(x1,x2…xn) when ,
p(x1,x2…xn)=p(x1)p(x2|x1) *p(x3|x1x2) *p(x4|x1x2x3) … p(xn|x1x2…xn-1)
First order Markov hypothesis p(x3|x1x2)=p(x3|x2)… That is, each conditional probability above is equal to the conditional probability of the previous element . be :
p(x1,x2….xn)=p(x1)*p(x2|x1) *p(x3|x2)…. *p(xn|xn-1)
The second-order Markov assumption is that the conditional probability of each term is equal to the conditional probability of the first two terms .
namely
p(x1,x2…xn)=p(x1)*p(x2|x1) *p(x3|x1x2) *p(x4|x2x3)….p(xn|xn-2,xn-1)
And so on
Markov model of variable length Markov sequence
Variable length Markov model is a model of sequence generation :
1 initialization i=1, meanwhile x0=x−1=*(x0 and x−1 It can be regarded as two virtual nodes )
2 According to the probability p(xi|xi−2,xi−1) Generate xi
3 If xi=STOP, Just return the sequence x1…xi. otherwise , Set up i=i+1 And repeat the steps 2.
Trigram Language Models( Ternary language model )
Basic concepts
p(x1,x2,x3….xn)=∏ni=1p(xi|xi−2,xi−1)
among x0=x−1=*.xn=STOP.
Definition ( The ternary model ): A three-dimensional model contains a finite set V, And parameters :p(w|u,v). For each u,v,w If w∈v∪STOP meanwhile u,v ∈v∪∗, Under the ternary model , The probability of a sentence is p(x1,x2…xn)=∏ni=1p(xi|xi−2,xi−1) among x0 And x−1 Are pseudo nodes *.
There is this model , The problem now is how to estimate the parameters of the model according to the corpus p(w|u,v).
Maximum likelihood estimation
For parameter estimation , Of course, the simplest is to use maximum likelihood estimation . Specifically for p(w|u,v)
In statistical corpus uvw And uv The number of times , Then divide and you get p(w|u,v).
Such as p(barks|the,dog)=c(the,dog,barks)/c(the,dog).
But this simple parameter estimation method has the following two problems :
- If the molecular term is zero , Then probability is defined as 0. That is, there is no corresponding phrase in the corpus . It may be that the corpus is not large enough and the data is sparse .
- If the molecule equals zero , The probability value cannot be calculated .
Evaluation of the model
For a good model , How to evaluate its quality
The method is to give a test data set , There are m A sentence s1,s2,…,sm. For each sentence si, Calculate the probability that it will be generated under the current model . Quadrature the probability of all sentences . namely :
∏mi=1p(xi)
The bigger this is , The better the model .
Smoothing of parameter estimation
Previously, we mentioned the problem of parameter estimation caused by sparse data . Two methods are discussed here , The first is linear interpolation, The second is discounting methods.
linear interpolation
Definition trigram,bigram, as well as unigram The maximum likelihood estimate is :
p(w|u,v)=c(u,v,w)/c(u,v)
p(w|v)=c(v,w)/c(v)
p(w)=c(w)/c()
linear interpolation Is to use three estimates , By defining
p(w|u,v)=c1*p(w|u,v)+ c2 * p(w|v) +c3*p(w)
among c1>=0;c2>=0;c3>=0 And
c1+c2+c3=1;
For how to choose the right c1,c2,c3. One way is to use a validation set of data , Selection of the c1,c2,c3 Make the probability of verification set maximum . Another method is another c1=c(u,v)/(c(u,v)+t),
c2=(1-c1)* c(v)/(c(v)+t)),c3=1-c1-c2. among t Is a pending parameter .t The value of can still be maximized validation set .
discounting Methods
Consider a binary model . That is, find the parameters p(w|v)
So let's define a discounted counts. For any c(v,w) as long as c(v,w)>0, Just define
c∗(v,w)=c(v,w)-r;
among r It's a 0 To 1 Number between .
Then for any p(w|v)=c∗(v,w)/c(v); This is equal to not equal to 0 Of c, All extracted from it r. Then you can put r Assigned to those c be equal to 0 's phrases , Prevent zero .
边栏推荐
- MySQL multi table operation
- Solve the problem of cache and database double write data consistency
- Implementing Yang Hui triangle with cyclic queue C language
- View and modify the MySQL data storage directory under centos7
- Learn the memory management of JVM 02 - memory allocation of JVM
- Detailed steps for upgrading window mysql5.5 to 5.7.36
- POJ-2499 Binary Tree
- Programming skills for optimizing program performance
- Get all stock data of big a
- Common commands and basic operations of Apache Phoenix
猜你喜欢
Preliminary exploration of basic knowledge of MySQL
ZABBIX ODBC database monitoring
Why learn harmonyos and how to get started quickly?
Add a new cloud disk to Huawei virtual machine
Making and using the cutting tool of TTF font library
Principle of universal gbase high availability synchronization tool in Nanjing University
A guide to threaded and asynchronous UI development in the "quick start fluent Development Series tutorials"
Knowledge representation (KR)
Get all stock data of big a
Read and understand the rendering mechanism and principle of flutter's three trees
随机推荐
Ecplise development environment configuration and simple web project construction
Pytorch two-layer loop to realize the segmentation of large pictures
Redis master-slave configuration and sentinel mode
Semantic segmentation experiment: UNET network /msrc2 dataset
JDBC -- use JDBC connection to operate MySQL database
Solve the problem of cache and database double write data consistency
How can labels/legends be added for all chart types in chart. js (chartjs.org)?
Resnet18 actual battle Baoke dream spirit
Understand redis persistence mechanism in one article
Solution to order timeout unpaid
MySQL basic operation -dql
Video networkstate property
Handwriting blocking queue: condition + lock
Pytoch loads the initialization V3 pre training model and reports an error
Interviewer: is acid fully guaranteed for redis transactions?
Learn memory management of JVM 01 - first memory
JDBC -- extract JDBC tool classes
Constructing expression binary tree with prefix expression
About cache exceptions: solutions for cache avalanche, breakdown, and penetration
Database connection pool & jdbctemplate