当前位置:网站首页>Introduction to Elmo, Bert and GPT
Introduction to Elmo, Bert and GPT
2022-07-29 01:35:00 【51CTO】
Contextualized Word Embedding#
The same word has different meanings , For example, the following sentences , There are also “bank” , But it has a different meaning . But with training Word2Vec obtain “bank” The vector of will be the same . Vector like description “word” It means the same , That's not the case . This is a Word2Vec The defects of .
In the following sentence , The same is “bank”, Indeed Different token, It's just There is the same type
We Expect everyone word token There is one. embedding. Every word token Of embedding Depends on its context . This method is called Contextualized Word Embedding.
EMLO#
EMLO yes Embeddings from Language Model Abbreviation , It's a RNN-based Model of , It only needs a lot of sentences to train .
We can train RNN Take out the weight of the hidden layer , Take the vector output after the vocabulary passes through the hidden layer as the word embedding, because RNN It is contextual , So the same word will get different vectors in different contexts . Above is a positive inward RNN, If the information considered is not enough , Can train two-way RNN , The output of the hidden layer is also used as embedding.
If our RNN There are many layers , We will take the output of that hidden layer as embedding?
stay ELMO in , it Take out the vector from each layer , After calculation, we get the of each word embedding
For example, above , Suppose we have 2 layer , So every word will get 2 Vector , The easiest way Just put two vectors Add up As this word embedding.
EMLO I will take out two vectors in , Then multiply by different weights αα, Then we will get what we get embedding Do downstream tasks .
αα It is also the result of model learning , It will be based on our Downstream tasks Train together and get , So different tasks are used αα It's different
Like our embedding There can be 3 Sources , As shown in the figure above . Namely
- I didn't go through contextualized Of embedding, It's up there Token
- Token After the first floor, draw out the first embedding
- Token After the second layer, draw out the second embedding
The depth of the color represents the size of the weight , You can see different tasks (SRL、Coref etc. ) Have different weights .
BERT#
BERT yes Bidirectional Encoder Representations from Transformers Abbreviation ,BERT yes Transformer Medium Encoder. It consists of many Encoder Stack up
Address of thesis : https://arxiv.org/pdf/1810.04805.pdf
Recommended articles : https://cloud.tencent.com/developer/article/1389555
stay BERT Inside , Our text does not need labels , Only by collecting a bunch of sentences can we train .
BERT yes Encoder, So it can be seen as inputting a sentence , Output embedding, Every embedding Corresponding to one word
In the example above, we take “ word ” In units of , Sometimes we use “ word ” It would be better to be a unit . For example, Chinese “ word ” It's a lot of , But commonly used “ word ” It is limited. .
stay BERT in , Yes Two kinds of training Method , One is Masked LM. The other is Next Sentence Prediction. But not so Use at the same time , It will have a better effect . The picture below is BERT The general structure of . You can see Pre-training Phase and Fine-Tuning In phase BERT Model , Only the output layer is different , The other parts are exactly the same .
Masked LM#
stay Masked LM in , We will randomly add 15% The vocabulary of is replaced by a special token , be called [MASK]
In fact, not all selected words will be replaced with [MASK] , stay BERT Of the paper 3.1 Pre-training BERT I talked about 3 Ways of planting .
(1)80% Be replaced by [MASK]
(2)10% Replaced with random token
(3)10% Do not replace
BERT Our task is to guess what these replaced words are .
It's like a crossword game , Dig out a word in a sentence , Let yourself fill in the right words
after BERT Then we got a embedding, Then replace with [MASK] Output at that position of embedding Through a linear classifier , Predict what this word is
Because of this The classifier is Linear Of , So its ability is very, very weak , therefore BERT To output a very good embedding, To predict what the replaced words are
If two different words can be filled in the same sentence , They will have similar embedding, Because their semantics are similar
Next Sentence Prediction#
stay Next Sentence Prediction in , We give BERT Two sentences , Give Way BERT Predict whether these two sentences are connected
[SEP]: special toekn, Represents the junction of two sentences
[CLS]: special token, Represents the classification
Let's put [CLS] The output vector passes through a linear classifier , Let the classifier judge whether these two sentences should be connected .
BERT yes Transformer Of Encoder, It uses self-attention Mechanism , You can read all the information of the sentence , therefore [CLS] Can be placed at the beginning
BERT The paper talks about the alignment of input sentences , With A-B Describe such a sentence right .B Yes 50% The probability of , Is the real next sentence (IsNext). Also have 50% The probability will become other sentences (NotNext).
We can also directly input this vector into a classifier , Determine the category of text , For example, the following is an example of judging spam
BERT The input of #
As shown in the figure above ,BERT The input of embedding yes token embeddings、segment embeddings and position embedding And
token embedding: The word vector
segment embedding: Indicates whether the sentence is the first or the second
position embedding: Indicates the position of words in a sentence
ERNIE#
ERNIE yes Enhance Representation through Knowledge Integration Abbreviation
ERNIE It is specially prepared for Chinese ,BERT The input of is in Chinese characters , It's actually easy to guess after overwriting some words randomly , As shown in the figure above . therefore Block out a word More appropriate .
GPT#
GPT yes Generative Pre-Training Abbreviation , Its parameters are particularly large , As shown in the figure below , Its parameter quantity is BERT Of 4.5 About times
BERT yes Transformer Of Encoder,GPT It is Transformer Of Decoder.GPT Enter some words , Predict the next vocabulary . The calculation process is shown in the figure below .
We enter words “ Tide ”, Through many layers self-attention Then get the output “ retired ”. And then “ retired ” As input , Predict the next output .
GPT Can do reading comprehension 、 Sentence or paragraph generation and translation, etc NLP Mission
The following website can experience the trained GPT
For example, let it write its own code
You can also let it write articles 、 Write a script or something , The website is on it , You can experience it yourself
author : Your Rego
The copyright of this article belongs to the author , Welcome to reprint , But without the author's consent, the original link must be given on the article page , Otherwise, the right to pursue legal responsibility is reserved .
边栏推荐
- Interviewer: programmer, please tell me who leaked the company interview questions to you?
- Bracket matching test
- Docker-compose安装mysql
- log4j动态加载配置文件
- [leetcode sliding window problem]
- SiC功率半导体产业高峰论坛成功举办
- 梅克尔工作室——HarmonyOS实现列表待办
- SQL injection of DVWA
- Recommended Spanish translation of Beijing passport
- Ruiji takeout project actual battle day01
猜你喜欢

Redis is installed on Linux

Openpyxl cell center

【HCIP】重发布及路由策略的实验

Flash reports an error: type object 'news' has no attribute' query 'the view name is duplicate with the model name

Timer of BOM series
![[SQL's 18 dragon subduing palms] 01 - Kang long regrets: introductory 10 questions](/img/db/d0255d7036f7003d380c8888fed88b.png)
[SQL's 18 dragon subduing palms] 01 - Kang long regrets: introductory 10 questions

DocuWare 移动劳动力解决方案可帮助您构建新的生产力模式:随时随地、任何设备

Self-attention neural architecture search for semantic image segmentation

BOM系列之定时器

Groundwater, soil, geology and environment
随机推荐
易观分析:以用户为中心,提升手机银行用户体验,助力用户价值增长
Django uses pymysql module to connect mysql database
梅克尔工作室——HarmonyOS实现列表待办
Cloud native application comprehensive exercise
AlphaFold揭示了蛋白质结构宇宙-从近100万个结构扩展到超过2亿个结构
Intel introduces you to visual recognition -- openvino
log4j动态加载配置文件
DocuWare 移动劳动力解决方案可帮助您构建新的生产力模式:随时随地、任何设备
Flink SQL Hudi actual combat
Recommended Spanish translation of Beijing passport
SQL question brushing: find the current salary details and department number Dept_ no
els 方块移动
Cookies and sessions
Three ways of creating indexes in MySQL
【HCIP】两个MGRE网络通过OSPF实现互联(eNSP)
T-sne dimensionality reduction
Ruiji takeout project actual battle day01
Teach you a text to solve the problem of JS digital accuracy loss
PLATO上线LAAS协议Elephant Swap,用户可借此获得溢价收益
第二轮Okaleido Tiger热卖的背后,是背后生态机构战略支持

















