当前位置:网站首页>Transform XL translation
Transform XL translation
2022-07-07 23:04:00 【Anny Linlin】
1. Introduce
Language modeling is one of the important problems that need to model long-term dependencies , It has successful applications , Such as unsupervised training (Peters et al., 2018; Devlin et al.,2018). However , How to make neural networks have the ability to model long-term dependencies in sequence data , It has always been a challenge . Recursive neural network (RNNs), In especial LSTM(Hochreiter & Schmidhuber, 1997), It has become a standard solution for language modeling , And achieved good results on multiple benchmarks . Although widely used ,RNNs It is difficult to optimize due to gradient disappearance and gradient explosion (Hochreiter et al., 2001), stay LSTMs The introduction of control doors and supervisor cutting technology (Graves,
2013; Pascanu et al., 2012) Maybe it can't effectively solve this problem . Based on previous work experience ,LSTM Language models are used on average 200 Contextual vocabulary (Khandelwal et al.,2018), This shows that there is room for further improvement .
On the other hand , The direct connection between long-distance word pairs in the attention mechanism can simplify optimization , And make it possible to learn long-term dependence (Bahdanau et al.,2014; Vaswani et al., 2017).. lately ,Al-Rfou et al. (2018) Designed a set of auxiliary losses for training deep transformer networks For character level language modeling , Its performance far exceeds LSTM. Although in Al-Rfou et al. (2018) Successful training , But it is executed on a fixed length segment separated by hundreds of characters , stay segments There is no flow of information between . Due to the fixed context length , The model cannot capture any long-term dependencies over ????. Besides , Fixed length segments It is created by selecting some continuous characters in the case of related sentences or other semantic boundaries . therefore , The model predicts that the first few characters lack the necessary context information , It leads to inefficient model optimization and poor execution effect . We call this context problem context fragment .
In order to solve the limitations of the above fixed length context , We propose a new architecture ,Transformer-XL (meaning extra long). We are introducing this concept into the deep self attention network . especially , We will no longer calculate the hidden state of each new segment from zero , Instead, reuse the hidden state obtained in the previous segment . The reused hidden state is used as the memory of the current segment , Establish a circular connection between these segments . therefore , Longer term dependency modeling is possible , Because information can be transmitted through circular connections . meanwhile , Passing information from the previous paragraph can also solve the problem of context segmentation . what's more , We showed the necessity of using relative position coding instead of absolute coding , To enable state reuse without causing time chaos . therefore , As an additional technical contribution , We introduce a simple but more effective relative position coding formula , It can be summarized as a longer attention length than that observed in training .
2. Related work
3. Model
Give a prediction base x = (x1; : : : ; xT ), This language model Is to estimate the joint probability P(x), It is usually automatically decomposed into
Through this decomposition formula , We can see that this problem comes down to the solution of each conditional probability . In this paper , We insist on using standard neurons to model and calculate conditional probability . To be specific , A trainable neural network will context x<t Encoded as a fixed size hidden state , It is related to words Embedding Multiply to get labels , Then enter this tag into softmax function , Find the probability of the next word category .
3.1 VANILLA TRANSFORMER LANGUAGE MODELS
In order to transform or self-attention Applied to language modeling , The core problem is how to train a Transformer, It can effectively encode any long context into a fixed size representation . Given unlimited memory and computing conditions , A simple solution is to use an unconditional Transformer decorder To handle the whole context sequence , Similar to feedforward neural network . However , With limited actual resources , This is usually not feasible .
A feasible but rough approximation is to divide the whole corpus into short manageable fragments , And only train the model in each segment , All contextual information from the first few fragments is ignored . The idea was Al-Rfou et al. (2018). use . We can call it Vanilla Model and in the picture 1(a) Visualize it . In this training mode , Whether forward or backward , Information will not flow across segments . Using a fixed length context has two key limitations : First , The maximum possible length of long-term dependence is segment The length is the upper limit , Build hundreds of models at the character level . therefore , although self-attention Mechanism and Rnns comparison , It is less affected by the vanishing gradient problem , however Vanilla model Not making full use of this advantage . secondly , Although it can be used to fill sentences or other semantic boundaries , In practice , Because of the improvement of efficiency , Simply converting long text blocks into fixed length sequences has become a standard practice (Peterset al., 2018; Devlin et al., 2018; Al-Rfou et al., 2018).. however , If you simply block the sequence into a fixed length sequence , This will lead to the problem of context fragmentation discussed in the section .
In the evaluation , At every step ,vanilla model Also use one with the same training length segment, Only one prediction was made at the last position . then , In the next step ,segment Only move one position to the right , And new segment It must be handled from the beginning . Pictured 1b Exhibition , This process ensures that during training , Each prediction uses the longest exposed context , And alleviate the problem of context fragmentation encountered in the training process . However, this evaluation process is costly . We will show that our proposed architecture can greatly improve the speed of evaluation .
3.2 SEGMENT-LEVEL RECURRENCE WITH STATE REUSE
In order to solve the limitation of using fixed length context , We suggest introducing in transformer A looping mechanism is introduced into the architecture . During training , For the front segment The computed hidden state sequence is fixed and cached , In order to deal with the next new segment Reuse as extension context , Pictured 2a Shown . Although the gradient remains at segment Inside , But this extra input allows the network to take advantage of historical information , Thus, we can model longer-term dependencies and avoid context fragmentation . In form , These two are continuous segment The length of L yes , use Indicates the fourth segment The second n Layer hidden state sequence , among d Is the hidden layer dimension .
then ,segment Of The first n The generation process of layer hidden state is as follows :
there SG(·) The function represents stop-gradient, Indicates that two hidden layer sequences are connected along the length dimension ,W For the parameters of the model . And standard Transformer comparison , The key difference is this Key and value Is to extend the context and Cache front segment Information . Let's go through the picture 2a Green path to emphasize this special design . This recursive mechanism is applied to every two consecutive segment, Essentially, it creates segmented recursion in a hidden state . therefore , The effective context used can be far more than two parts . however , Note that this loop depends on and Between , Each segment moves down one layer , It's with the traditional RNN-LMs Loop different from the same layer . Therefore, the maximum possible dependence length is between the number of layers and segment Linear growth in length w.r.t,, In the figure 2b Show in . This is similar to truncated BPTT(Mikolov et al., 2010), A kind of training RNN-LMs The technique of . However , Our method is different from RNN-LMs, It is a sequence of cached hidden states instead RNN-LMs Only the last hidden state is reserved in , It is applied together to the relevant location coding technology through 3.3 Part to elaborate .
In addition to solving the problem of ultra long context memory and context fragmentation , Another advantage of the proposed circulation scheme is that it greatly speeds up the evaluation . Especially during the evaluation , From the front segment The representation of can be reused , Not like it vanilla model Then calculate from scratch . stay enwiki8 In the experiments ,Transformer-XL yes Vanilla model Of 1800+ times .
Last , Please note that , The recursive scheme need not be limited to the previous paragraph . Theoretically , We can do it in GPU Cache as much memory as possible before segment, And dealing with the current segment Reuse them as additional context . therefore , Due to memory enhancement with Neural Networks (Graves et al.,2014;Weston et al., 2014) Clear connection , We can cache a predefined length -M The old hidden state of crosses ( Probably ) Multiple segment Of , And call them memory .
In our experiment , When we train , Set up M be equal to segment length , And it has been added many times in the evaluation process .
3.3 RELATIVE POSITIONAL ENCODINGS
Although we found the idea proposed in the previous section very attractive , But to reuse hidden states , We haven't solved a key technical challenge . in other words , When we reuse state , How to maintain the consistency of location information ? Think about it , In standard Transformer in , The information of sequence sequence is provided by a set of position codes . The symbol is , Here I i That's ok Ui It's corresponding to segment pass the civil examinations i The absolute position of ,Lmax Specify the maximum possible length to be modeled . then ,Transformer The real input of is to add by element word embedding and positional encodings. If we simply adapt this location coding to the recursive mechanism introduced above , The hidden state sequence will be calculated as shown in the figure below :
边栏推荐
- Debezium系列之:支持 mysql8 的 set role 語句
- Debezium series: MySQL tombstone event
- Sword finger offer 27 Image of binary tree
- [network] Introduction to C language
- Cascade-LSTM: A Tree-Structured Neural Classifier for Detecting Misinformation Cascades-KDD2020
- Debezium系列之:引入对 LATERAL 运算符的支持
- Micro service remote debug, nocalhost + rainbow micro service development second bullet
- Visual design form QT designer design gui single form program
- Amesim2016 and matlab2017b joint simulation environment construction
- Transparent i/o model from beginning to end
猜你喜欢
Line test - graphic reasoning - 4 - alphabetic class
Form组件常用校验规则-2(持续更新中~)
Yarn cannot view the historical task log of yarn after enabling ACL user authentication. Solution
小程序多种开发方式对比-跨端?低代码?原生?还是云开发?
0-5VAC转4-20mA交流电流隔离变送器/转换模块
CTF exercise
There is another problem just online... Warm
线上面试,该如何更好的表现自己?这样做,提高50%通过率~
「开源摘星计划」Loki实现Harbor日志的高效管理
Loki, the "open source star picking program", realizes the efficient management of harbor logs
随机推荐
今日创见|企业促进创新的5大关键要素
Line test - graphic reasoning - 6 - similar graphic classes
Sword finger offer 28 Symmetric binary tree
Debezium系列之:mysql墓碑事件
数字化转型:五个步骤推动企业进步
详解全志V853上的ARM A7和RISC-V E907之间的通信方式
[problem] pytorch installation
Details of the open source framework of microservice architecture
肠道里的微生物和皮肤上的一样吗?
Online interview, how to better express yourself? In this way, the passing rate will be increased by 50%~
一次搞明白 Session、Cookie、Token,面试问题全稿定
Unity 动态合并网格纹理
LeetCode707. Design linked list
PHP method of obtaining image information
Knowledge drop - PCB manufacturing process flow
Leetcode1984. Minimum difference in student scores
Some parameters of Haikang IPC
PCL .vtk文件与.pcd的相互转换
关于海康ipc的几个参数
Loki, the "open source star picking program", realizes the efficient management of harbor logs