当前位置：网站首页>Introduction to Elmo, Bert and GPT

Introduction to Elmo, Bert and GPT

2022-07-29 01:35:00 【51CTO】

Contextualized Word Embedding#

The same word has different meanings , For example, the following sentences , There are also “bank” , But it has a different meaning . But with training Word2Vec obtain “bank” The vector of will be the same . Vector like description “word” It means the same , That's not the case . This is a Word2Vec The defects of .

In the following sentence , The same is “bank”, Indeed Different token, It's just There is the same type

We Expect everyone word token There is one. embedding. Every word token Of embedding Depends on its context . This method is called Contextualized Word Embedding.

EMLO#

EMLO yes Embeddings from Language Model Abbreviation , It's a RNN-based Model of , It only needs a lot of sentences to train .

We can train RNN Take out the weight of the hidden layer , Take the vector output after the vocabulary passes through the hidden layer as the word embedding, because RNN It is contextual , So the same word will get different vectors in different contexts . Above is a positive inward RNN, If the information considered is not enough , Can train two-way RNN , The output of the hidden layer is also used as embedding.

If our RNN There are many layers , We will take the output of that hidden layer as embedding？

stay ELMO in , it Take out the vector from each layer , After calculation, we get the of each word embedding

For example, above , Suppose we have 2 layer , So every word will get 2 Vector , The easiest way Just put two vectors Add up As this word embedding.

EMLO I will take out two vectors in , Then multiply by different weights αα, Then we will get what we get embedding Do downstream tasks .

αα It is also the result of model learning , It will be based on our Downstream tasks Train together and get , So different tasks are used αα It's different

Like our embedding There can be 3 Sources , As shown in the figure above . Namely

I didn't go through contextualized Of embedding, It's up there Token
Token After the first floor, draw out the first embedding
Token After the second layer, draw out the second embedding

The depth of the color represents the size of the weight , You can see different tasks （SRL、Coref etc. ） Have different weights .

BERT#

BERT yes Bidirectional Encoder Representations from Transformers Abbreviation ,BERT yes Transformer Medium Encoder. It consists of many Encoder Stack up

Address of thesis ： https://arxiv.org/pdf/1810.04805.pdf
Recommended articles ： https://cloud.tencent.com/developer/article/1389555

stay BERT Inside , Our text does not need labels , Only by collecting a bunch of sentences can we train .

BERT yes Encoder, So it can be seen as inputting a sentence , Output embedding, Every embedding Corresponding to one word

In the example above, we take “ word ” In units of , Sometimes we use “ word ” It would be better to be a unit . For example, Chinese “ word ” It's a lot of , But commonly used “ word ” It is limited. .

stay BERT in , Yes Two kinds of training Method , One is Masked LM. The other is Next Sentence Prediction. But not so Use at the same time , It will have a better effect . The picture below is BERT The general structure of . You can see Pre-training Phase and Fine-Tuning In phase BERT Model , Only the output layer is different , The other parts are exactly the same .

Masked LM#

stay Masked LM in , We will randomly add 15% The vocabulary of is replaced by a special token , be called [MASK]

In fact, not all selected words will be replaced with [MASK] , stay BERT Of the paper 3.1 Pre-training BERT I talked about 3 Ways of planting .
（1）80% Be replaced by [MASK]
（2）10% Replaced with random token
（3）10% Do not replace

BERT Our task is to guess what these replaced words are .

It's like a crossword game , Dig out a word in a sentence , Let yourself fill in the right words

after BERT Then we got a embedding, Then replace with [MASK] Output at that position of embedding Through a linear classifier , Predict what this word is

Because of this The classifier is Linear Of , So its ability is very, very weak , therefore BERT To output a very good embedding, To predict what the replaced words are

If two different words can be filled in the same sentence , They will have similar embedding, Because their semantics are similar

Next Sentence Prediction#

stay Next Sentence Prediction in , We give BERT Two sentences , Give Way BERT Predict whether these two sentences are connected

[SEP]： special toekn, Represents the junction of two sentences
[CLS]： special token, Represents the classification

Let's put [CLS] The output vector passes through a linear classifier , Let the classifier judge whether these two sentences should be connected .

BERT yes Transformer Of Encoder, It uses self-attention Mechanism , You can read all the information of the sentence , therefore [CLS] Can be placed at the beginning
BERT The paper talks about the alignment of input sentences , With A-B Describe such a sentence right .B Yes 50% The probability of , Is the real next sentence （IsNext）. Also have 50% The probability will become other sentences （NotNext）.

We can also directly input this vector into a classifier , Determine the category of text , For example, the following is an example of judging spam

BERT The input of #

As shown in the figure above ,BERT The input of embedding yes token embeddings、segment embeddings and position embedding And

token embedding： The word vector
segment embedding： Indicates whether the sentence is the first or the second
position embedding： Indicates the position of words in a sentence

ERNIE#

ERNIE yes Enhance Representation through Knowledge Integration Abbreviation

ERNIE It is specially prepared for Chinese ,BERT The input of is in Chinese characters , It's actually easy to guess after overwriting some words randomly , As shown in the figure above . therefore Block out a word More appropriate .

GPT#

GPT yes Generative Pre-Training Abbreviation , Its parameters are particularly large , As shown in the figure below , Its parameter quantity is BERT Of 4.5 About times

BERT yes Transformer Of Encoder,GPT It is Transformer Of Decoder.GPT Enter some words , Predict the next vocabulary . The calculation process is shown in the figure below .

We enter words “ Tide ”, Through many layers self-attention Then get the output “ retired ”. And then “ retired ” As input , Predict the next output .

GPT Can do reading comprehension 、 Sentence or paragraph generation and translation, etc NLP Mission

The following website can experience the trained GPT

https://talktotransformer.com/

For example, let it write its own code

You can also let it write articles 、 Write a script or something , The website is on it , You can experience it yourself

author ： Your Rego

The copyright of this article belongs to the author , Welcome to reprint , But without the author's consent, the original link must be given on the article page , Otherwise, the right to pursue legal responsibility is reserved .

原网站

版权声明
本文为[51CTO]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207290032357907.html