当前位置:网站首页>This article sorts out the development of the main models of NLP
This article sorts out the development of the main models of NLP
2022-08-04 13:02:00 【JMXGODLZ】
欢迎大家访问个人博客:https://jmxgodlz.xyz
前言
This article is based on what the author has learned,对NLPThe development context of the main models is sorted out,The purpose is to understand the past and present of mainstream technology,如有理解错误的地方,麻烦指正~
下面将依次介绍RNN、LSTM、GRU、Encoder-Deocder、Transformer、BERT设计的出发点,The model structure is not described in detail.
RNN
The data types of natural language processing are mostly text types,The contextual relationship of text data has strong sequence characteristics.同时,RNN模型具有“The output of the previous moment is used as the input of the next moment”的特征,This feature can handle sequence data well.因此,RNNmodel compared to other models,More suitable for handling natural language processing tasks.
When the length of the sequence to be processed is long,RNNThe model is in the process of backpropagation,Subject to the chain derivation rule,When the derivative of the derivation process is too small or too large,会导致梯度消失或梯度爆炸.
LSTM
RNNThe weight matrix of the model is shared in the time dimension.LSTM相较于RNN模型,通过引入门控机制,缓解梯度消失,那么LSTM如何避免梯度消失?
Several key conclusions are given here,A detailed analysis will follow in an introduction.
RNNModels share parameter matrices in the time dimension,因此RNNThe total gradient of the model is equal to the sum of the gradients at each time, g = ∑ g t g=\sum{g_t} g=∑gt.
RNNThe overall gradient does not disappear,It's just that the long-distance gradient disappears,梯度被近距离梯度主导,Unable to capture distant features.
The essence of vanishing gradients:由于RNNModels share parameter matrices in the time dimension,results in a hidden stateh求导时,The loop computes the matrix multiplication,Cumulative multiplication of parameter matrices occurs on the final gradient.
LSTMAlleviate the nature of gradient vanishing:引入门控机制,Convert matrix multiplication to element-wise Hadamard product: c t = f t ⊙ c t − 1 + i t ⊙ tanh ( W c [ h t − 1 , x t ] + b c ) c_{t}=f_{t} \odot c_{t-1}+i_{t} \odot \tanh \left(W_{c}\left[h_{t-1}, x_{t}\right]+b_{c}\right) ct=ft⊙ct−1+it⊙tanh(Wc[ht−1,xt]+bc)
GRU
GRU与LSTM模型相同,引入门控机制,避免梯度消失.区别在于,GRUOnly two gate structures, reset gate and update gate, are used,parameter comparisonLSTM少,训练速度更快.
Encoder-Decoder
RNN模型“The output of the previous moment is used as the input of the next moment”的特征,There is also the problem that the input and output of the model are always the same length.Encoder-DecoderThe model has two parts: encoder and decoder,Solved the problem of fixed input and output length.其中EncoderThe terminal is responsible for the feature representation acquisition of the text sequence,DecoderThe terminal decodes the output sequence according to the feature vector.
但Encoder-DecoderThe model still has the following problems:
Feature representation vector selection for text sequences
The feature representation vector contains the finiteness of the feature
OOV问题
The first and second questions pass注意力机制解决,The attention mechanism is used to selectively focus on the important features of the text sequence.
The third question is passed拷贝机制以及Subword编码解决.
Transformer
TransformerThe model mainly contains a multi-head self-attention module、前馈神经网络、Residual structure vsDropout,其中核心模块为多头自注意力模块,The functions of each component are as follows:
自注意力机制The encoder side selectively focuses on important features of the text sequence,Solve the problem of the feature representation vector selection of text sequence and the limited feature that the vector contains.
多头机制Each header is mapped to a different space,Obtain feature representations with different emphasis,make the feature representation more adequate.
残差结构Effectively avoid gradient disappearance.
Dropout有效避免过拟合.
前馈神经网络Complete the mapping of the hidden layer to the output space
接下来将重点介绍Transformer模型的优点.
1. TransformerAble to achieve long-distance dependencies
在自注意力机制中,Each character is able to calculate the attention score with all other characters.This calculation method does not take timing characteristics into account,Capable of capturing long-range dependencies.
But this method has the disadvantage,The time complexity of attention score calculation is O ( n 2 ) O(n^2) O(n2),When the input sequence is long,时间复杂度过高,Therefore, it is not suitable for processing too long data.
2. TransformerParallelization is possible
假设输入序列为(a,b,c,d)
传统RNN需要计算a的embedding向量得到 e a e_a ea,Then through feature extraction h a h_a ha,Then calculate in the same wayb,c,d.
Transformer通过self-attention机制,Each word can interact with the entire sequence,The model should be able to process the entire sequence at the same time,得到 e a , e b , e c , e d e_a,e_b,e_c,e_d ea,eb,ec,ed,Then calculate together h a , h b , h c , h d h_a,h_b,h_c,h_d ha,hb,hc,hd.
3. TransformerSuitable for pre-training
RNNThe model makes the assumption that the input data is temporal,Use the output of the previous moment as the input of the next moment.
CNNThe model is based on the assumption that the input data is an image,Add some traits to the structure【Such as convolution to generate features】,Make forward propagation more efficient,降低网络的参数量.
与CNN、RNN模型不同, Transformer A model is a flexible architecture,There is no restriction on the structure of the input data,Therefore, it is suitable for pre-training on large-scale data,But that point also bringsTransformerThe problem of poor model generalization on small datasets.改进方法包括引入结构偏差或正则化,对大规模未标记数据进行预训练等.
BERT
模型结构
首先,BERT模型参照GPT模型,Sample pretraining-Fine-tune the two-stage training approach.但是,与GPT使用TransformerThe decoder part is different,BERT为了充分利用上下文信息,使用TransformerThe encoder part serves as the model structure.
训练任务
如果BERT与GPTAlso use a language model as a learning task,Then the model has the problem of label leakage【The context of one word contains the predicted target of another word】.So in order to take advantage of contextual information,BERT提出MLM掩码语言模型任务,Predict cover words by context,MLMSee the introduction-不要停止预训练实战(二)-一日看尽MLM.
改进点
针对TransformerStructure and pre-training method,BERTThe model still has the following improvement points:
训练方式:Improve the mask method and multi-task training method,调整NSPThe way of training tasks and masks
模型结构调整:针对Transformer O ( n 2 ) O(n^2) O(n2)The time complexity of , and the two points that the input structure has no assumptions,调整模型结构
架构调整:轻量化结构、Strengthen cross-block connections、自适应计算时间、分治策略的Transformer
预训练:使用完整Encoder模型,如T5,BART模型
Multimodal and other downstream task applications
BERTThe future development direction of the model mainly includes:Larger and deeper models、多模态、跨语言、小样本、模型蒸馏,See the discussion-2022预训练的下一步是什么
参考文献
https://arxiv.org/pdf/2106.04554.pdf
https://www.zhihu.com/question/34878706
边栏推荐
猜你喜欢
Valentine's Day Romantic 3D Photo Wall [with source code]
基于双层共识控制的直流微电网优化调度(Matlab代码实现)
Programmer Qixi Gift - How to quickly build an exclusive chat room for your girlfriend in 30 minutes
SCA兼容性分析工具(ORACLE/MySQL/DB2--->MogDB/openGauss/PostgreSQL)
备份控制文件
双目立体视觉笔记(三)三角测量、极线校正
“蔚来杯“2022牛客暑期多校训练营4 N
Ceres库运行,模板内报内存冲突问题。(已解决)
Do you understand the various configurations in the project?
Geoffrey Hinton:深度学习的下一个大事件
随机推荐
odoo13笔记点
新消费、出海、大健康......电子烟寻找“避风港”
接到“网站动态换主题”的需求,我是如何踩坑的
redisTemplate存取List遇到的坑
持续交付(二)PipeLine基本使用
从零开始配置 vim(6)——缩写
A discussion of integrated circuits
云原生Devops 的实现方法
从零开始配置 vim(7)——自动命令
CReFF缓解长尾数据联邦学习(IJCAI 2022)
到底什么是真正的HTAP?
座舱人机交互「暗潮汹涌」,语音「下」,多模态「上」
年轻人为什么不喜欢买蒙牛、伊利了?
LeetCode_643_子数组的最大平均数Ⅰ
Cows 树状数组
跨链桥已成行业最大安全隐患 为什么和怎么办
Unity 3D模型展示框架篇之资源打包、加载、热更(Addressable Asset System | 简称AA)
情人节浪漫3D照片墙【附源码】
工具函数---字符串处理
【VSCode】一文详解vscode下安装vim后无法使用Ctrl+CV复制粘贴 使用Vim插件的配置记录