当前位置:网站首页>Technical dry goods | hundred lines of code to write Bert, Shengsi mindspire ability reward
Technical dry goods | hundred lines of code to write Bert, Shengsi mindspire ability reward
2022-07-03 07:34:00 【Shengsi mindspire】
How to evaluate Huawei before MindSpore 1.5 ? Referred to the MindSpore The ease of use of has a hundred lines of code to write a BERT The ability of , Make up this time .BERT As 18 year NLP Milestone model , While being sought after by countless people, it has also been deconstructed and analyzed countless times , While I try to explain the model a little clearly , Let readers also Get To MindSpore Current capabilities . Although it is somewhat suspected of advertising , But please listen to me carefully .
01
near 1000 Yes BERT Realization
The idea of writing this article was motivated by the use of MindSpore Also wrote a lot of models , In particular, we have done some reproduction of the pre training language model , During this period, continuous reference Model Zoo Model implementation of , There was a doubt at that time , Write a BERT Need close 1000 That's ok ,MindSpore Is it so complicated and difficult to use ?
Put the official link (gitee.com/mindspore/models/blob/master/official/nlp/bert/src/bert_model.py), Interested readers can have a look at , Remove annotations , This implementation is still complex and lengthy , And it doesn't show MindSpore As advertised ——“ Simple development experience ”. Later, I want to migrate huggingface Of checkpoint, I wrote a version by myself , It is found that this lengthy official implementation is completely compressible , And such a complex implementation will add some confusion to the latecomers , So there are hundreds of lines of code version.
02
BERT Model
BERT yes “Bidirectional Encoder Reporesentation from Transfromers” Abbreviation , It is also the name of Sesame Street anime characters ( Google old egg man ).
Sesame Street BERT
The title has already pointed out the core of the model , Bidirectional Transformer Encoder,BERT stay GPT and ELMo On the basis of , Absorb the advantages of both , And make the most of Transformer The ability of feature extraction , In that year, I swiped all the evaluation data sets , Become insurmountable SOTA Model . Next, I will start from the most basic components of the coordination formula bit by bit 、 Image & Text 、 Code , use MindSpore Complete a very light BERT. For reasons of length , It won't be right here Transformer Or explain the basic knowledge of the pre training language model in detail , Only to achieve BERT Mainline .
03
Multi-head Attention
differ Paper analysis , I won't come up and say BERT and Transformer stay Embedding The difference of , And the pre training tasks set , It starts with the most basic module implementation . The first is long attention (Multi-head Attention) modular .
because BERT The basic skeleton of the model is completely composed of Transformer Of Encoder constitute , So here's to Transformer in Self-Attention and Multi-head Attention Give a brief introduction . First of all Self-Attention, That is, in the paper Scaled Dot-product Attention, The formula is as follows :
there Self-Attention The operation is performed by three inputs , Namely Q(query matrix), K(key matrix), V(value matrix), They are respectively obtained by linear transformation of the same input through the full connection layer . The implementation here can be completely reproduced by referring to the formula , The code is as follows :
class ScaledDotProductAttention(Cell):
def __init__(self, d_k, dropout):
super().__init__()
self.scale = Tensor(d_k, mindspore.float32)
self.matmul = nn.MatMul()
self.transpose = P.Transpose()
self.softmax = nn.Softmax(axis=-1)
self.sqrt = P.Sqrt()
self.masked_fill = MaskedFill(-1e9)
if dropout > 0.0:
self.dropout = nn.Dropout(1-dropout)
else:
self.dropout = None
def construct(self, Q, K, V, attn_mask):
K = self.transpose(K, (0, 1, 3, 2))
scores = self.matmul(Q, K) / self.sqrt(self.scale) # scores : [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
scores = self.masked_fill(scores, attn_mask) # Fills elements of self tensor with value where mask is one.
attn = self.softmax(scores)
context = self.matmul(attn, V)
if self.dropout is not None:
context = self.dropout(context)
return context, attn
among ,Q*K^T And zoom , Did a step masked_fill The operation of , It's a reference Pytorch Version implementation , The calculated results and the initial input sequence Padding by 0 Replace the value of the corresponding position of , Replace with something close to 0 Number of numbers , As in the above code -1e-9. In addition, there are ways to increase the robustness of the model Dropout operation .
Multi-head Attention
At the completion of the basic Scaled Dot-product Attention after , Let's look at the implementation of the long attention mechanism . The so-called bull , In fact, the original single Q、K、V The projection is h individual Q', K', V'. Thus, without changing the amount of calculation , It enhances the generalization ability of the model . It can be regarded as multiple head Integration within the model (ensamble), It can also be regarded as multi-channel in convolution operation (channel), actually Multi-head Attention There is also reference CNN The smell of ( Many years ago, I heard teacher Liu Tieyan mention in the workshop ). Let's look at the implementation part :
class MultiHeadAttention(Cell):
def __init__(self, d_model, n_heads, dropout):
super().__init__()
self.n_heads = n_heads
self.W_Q = Dense(d_model, d_model)
self.W_K = Dense(d_model, d_model)
self.W_V = Dense(d_model, d_model)
self.linear = Dense(d_model, d_model)
self.head_dim = d_model // n_heads
assert self.head_dim * n_heads == d_model, "embed_dim must be divisible by num_heads"
self.layer_norm = nn.LayerNorm((d_model, ), epsilon=1e-12)
self.attention = ScaledDotProductAttention(self.head_dim, dropout)
# ops
self.transpose = P.Transpose()
self.expanddims = P.ExpandDims()
self.tile = P.Tile()
def construct(self, Q, K, V, attn_mask):
# q: [batch_size x len_q x d_model], k: [batch_size x len_k x d_model], v: [batch_size x len_k x d_model]
residual, batch_size = Q, Q.shape[0]
q_s = self.W_Q(Q).view((batch_size, -1, self.n_heads, self.head_dim))
k_s = self.W_K(K).view((batch_size, -1, self.n_heads, self.head_dim))
v_s = self.W_V(V).view((batch_size, -1, self.n_heads, self.head_dim))
# (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)
q_s = self.transpose(q_s, (0, 2, 1, 3)) # q_s: [batch_size x n_heads x len_q x d_k]
k_s = self.transpose(k_s, (0, 2, 1, 3)) # k_s: [batch_size x n_heads x len_k x d_k]
v_s = self.transpose(v_s, (0, 2, 1, 3)) # v_s: [batch_size x n_heads x len_k x d_v]
attn_mask = self.expanddims(attn_mask, 1)
attn_mask = self.tile(attn_mask, (1, self.n_heads, 1, 1)) # attn_mask : [batch_size x n_heads x len_q x len_k]
# context: [batch_size x n_heads x len_q x d_v], attn: [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
context, attn = self.attention(q_s, k_s, v_s, attn_mask)
context = self.transpose(context, (0, 2, 1, 3)).view((batch_size, -1, self.n_heads * self.head_dim)) # context: [batch_size x len_q x n_heads * d_v]
output = self.linear(context)
return self.layer_norm(output + residual), attn # output: [batch_size x len_q x d_model]
Q,K,V First, go through the full connection layer (Dense) Make a linear transformation , And then pass by reshape(view) Switch to multi head , Then carry out the corresponding transpose to meet the feeding ScaledDotProductAttention The need for . Finally, the output obtained is spliced , Note that there is no explicit Concat operation , But directly through view, take context Of shape[-1] Revert to heads*hidden_size Size . Besides , Last return When I joined Add&Norm operation , namely Encoder The corresponding residual sum in the structure Norm Calculation . Not detailed here , See the next section .
04
Transformer Encoder
After completing the basic Multi-head Attention After module , You can finish the rest , Construct a single layer Encoder. Here, let's start with the single layer Encoder The structure of ,Transformer Encoder from Poswise Feed Forward Layer and Multi-head Attention Layer constitute , And each Layer The input and output of Residual operation ( namely : y = f(x) + x), To ensure that deepening the number of neural network layers will not cause degradation problems , as well as Layer Norm To meet the deep neural network can be trained ( Alleviate gradient disappearance and gradient explosion ). Why use... Here Layer Norm Instead of Batch Norm Search for , It's also Transformer An interesting example of model construction trick.
Transformer Encoder
Finished Encoder Structure , Need to put the missing Poswise Feed Forward Layer To implement , At the same time Multi-head Attention Layer similar , take Residual and Layer Norm Integrate... Together , The code implementation is as follows :
class PoswiseFeedForwardNet(Cell):
def __init__(self, d_model, d_ff, activation:str='gelu'):
super().__init__()
self.fc1 = Dense(d_model, d_ff)
self.fc2 = Dense(d_ff, d_model)
self.activation = activation_map.get(activation, nn.GELU())
self.layer_norm = nn.LayerNorm((d_model,), epsilon=1e-12)
def construct(self, inputs):
residual = inputs
outputs = self.fc1(inputs)
outputs = self.activation(outputs)
outputs = self.fc2(outputs)
return self.layer_norm(outputs + residual)
take Multi-head Attention Layer and Poswise Feed Forward Layer Connect to get Encoder:
class BertEncoderLayer(Cell):
def __init__(self, d_model, n_heads, d_ff, activation, dropout):
super().__init__()
self.enc_self_attn = MultiHeadAttention(d_model, n_heads, dropout)
self.pos_ffn = PoswiseFeedForwardNet(d_model, d_ff, activation)
def construct(self, enc_inputs, enc_self_attn_mask):
enc_outputs, attn = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask)
enc_outputs = self.pos_ffn(enc_outputs)
return enc_outputs, attn
Then, according to the number of configured layers 、hidden_size, head Number and other parameters , take n layer Encoder Connect in turn , Can finish BERT Of Encoder, Use here nn.CellList Container to implement :
class BertEncoder(Cell):
def __init__(self, config):
super().__init__()
self.layers = nn.CellList([BertEncoderLayer(config.hidden_size, config.num_attention_heads, config.intermediate_size, config.hidden_act, config.hidden_dropout_prob) for _ in range(config.num_hidden_layers)])
def construct(self, inputs, enc_self_attn_mask):
outputs = inputs
for layer in self.layers:
outputs, enc_self_attn = layer(outputs, enc_self_attn_mask)
return outputs
05
structure BERT
At the completion of the Encoder after , You can start to assemble the complete BERT Model . The contents of the preceding chapters are Transformer Encoder The realization of the structure , and BERT The core innovation or difference of the model is mainly Transformer backbone outside . First of all, yes Embedding To deal with .
As shown in the figure , Enter the text and send it to BERT Model Embedding Get hidden layer representation by three different Embedding Add and get , These include :
Token Embeddings: That is, the most common word vector , The first placeholder is [CLS], It is used to express the code of the whole input text after subsequent coding , Used to classify tasks ( So it becomes CLS, namely classifier). Besides, there are [SEP] Placeholders are used to separate two different sentences of the same input , as well as [PAD] Express Padding.
Segment Embedding: To distinguish two different sentences of the same input . The Embedding The purpose of joining is to Next Sentence Predict Mission .
Position Embedding: And Transformer equally , It cannot be like LSTM Naturally retain location information , You need to encode the location information manually , The difference here is Transformer Trigonometric functions are used , Here, the corresponding position is directly index Send in Embedding Layer get code .( There is no essential difference between the two , And the latter is more simple and direct )
After analyzing three different Embedding, Use it directly nn.Embedding
You can complete this part , The corresponding code is as follows :
class BertEmbeddings(Cell):
def __init__(self, config):
super().__init__()
self.tok_embed = Embedding(config.vocab_size, config.hidden_size)
self.pos_embed = Embedding(config.max_position_embeddings, config.hidden_size)
self.seg_embed = Embedding(config.type_vocab_size, config.hidden_size)
self.norm = nn.LayerNorm((config.hidden_size,), epsilon=1e-12)
def construct(self, x, seg):
seq_len = x.shape[1]
pos = mnp.arange(seq_len) # mindspore.numpy
pos = P.BroadcastTo(x.shape)(P.ExpandDims()(pos, 0))
seg_embedding = self.seg_embed(seg)
tok_embedding = self.tok_embed(x)
embedding = tok_embedding + self.pos_embed(pos) + seg_embedding
return self.norm(embedding)
It's used here mindspore.numpy.arange
To generate location index, The rest are simple calls and matrix plus .
At the completion of Embedding After the layer , take Encoder And the output of pooler Combine , Can constitute a complete BERT Model , The code is as follows :
class BertModel(Cell):
def __init__(self, config):
super().__init__(config)
self.embeddings = BertEmbeddings(config)
self.encoder = BertEncoder(config)
self.pooler = Dense(config.hidden_size, config.hidden_size, activation='tanh')
def construct(self, input_ids, segment_ids):
outputs = self.embeddings(input_ids, segment_ids)
enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids)
outputs = self.encoder(outputs, enc_self_attn_mask)
h_pooled = self.pooler(outputs[:, 0])
return outputs, h_pooled
Here we use a full connection layer , Right position is 0 The output of pooler operation , The corresponding [CLS] The input text of the placeholder represents , For subsequent classification tasks .
06
BERT Pretraining task
BERT The essence of model lies in task design rather than model structure , It is people who treat it Paper Consensus of .BERT Two pre training tasks are designed , To complete unsupervised language model training ( In fact, it is not unsupervised ).
1. Next Sentence Predict
First of all, the simpler NSP Task analysis . Join in NSP The task is mainly aimed at QA or NLI The number of input sentences is 2 Downstream tasks , Enhance the ability of the model in such tasks . This pre training task is just as the name suggests , Sentence A and B Splice as input , among B Half of them are correct , yes A The next , The other half randomly selects text that is not the next sentence . The prediction task is classified into two categories , forecast B Is it A The next . The specific implementation is as follows :
class BertNextSentencePredict(Cell):
def __init__(self, config):
super().__init__()
self.classifier = Dense(config.hidden_size, 2)
def construct(self, h_pooled):
logits_clsf = self.classifier(h_pooled)
return logits_clsf
2. Masked Language Model
Mask Of Token. This task is different from the traditional language model ( or GPT The language model of ), It is bidirectional , namely :
Take this as the objective function , Predicted by context Mask Of Token, Nature conforms to the form of cloze .
Data preprocessing is not involved here , Not right. Mask And replacement ratio . The corresponding implementation is relatively simple , It's actually Dense+activation+LayerNorm+Dense, The implementation is as follows :
class BertMaskedLanguageModel(Cell):
def __init__(self, config, tok_embed_table):
super().__init__()
self.transform = Dense(config.hidden_size, config.hidden_size)
self.activation = activation_map.get(config.hidden_act, nn.GELU())
self.norm = nn.LayerNorm((config.hidden_size, ), epsilon=1e-12)
self.decoder = Dense(tok_embed_table.shape[1], tok_embed_table.shape[0], weight_init=tok_embed_table)
def construct(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.activation(hidden_states)
hidden_states = self.norm(hidden_states)
hidden_states = self.decoder(hidden_states)
return hidden_states
Put two Task Combine , You can complete the pre training BERT Model :
class BertForPretraining(Cell):
def __init__(self, config):
super().__init__(config)
self.bert = BertModel(config)
self.nsp = BertNextSentencePredict(config)
self.mlm = BertMaskedLanguageModel(config, self.bert.embeddings.tok_embed.embedding_table)
def construct(self, input_ids, segment_ids):
outputs, h_pooled = self.bert(input_ids, segment_ids)
nsp_logits = self.nsp(h_pooled)
mlm_logits = self.mlm(outputs)
return mlm_logits, nsp_logits
Writing at this point , use MindSpore To realize the whole BERT The model is done , You can see , Each module can fully correspond to the formula or diagram , And the implementation of a single module is in 10-20 Row or so , The overall implementation code is 150-200 Between the lines , Compare with Model Zoo Of 800+ Code , It's really simple .
07
before and after comparison
Because the official implementation is really lengthy , Here, choose a screenshot of some code to compare
On the left is the official BERTModel, On the right is the integration of the above implementation , alike BERT The model can go through 100 Simple implementation with multiple lines of code . thus it can be seen ,MindSpore After multiple version iterations , Its own operator support and ease of use of front-end expression have gradually tended to improve , Hundred lines of code BERT, Maybe there was only Pytorch Able to do that. , Now MindSpore It's fine too .
Of course , The official implementation has been maintained since the early version , There should be no consideration of using a more concise way to complete after the version change , But this will cause MindSpore It's hard to use. , The illusion of writing a lot more code . stay 1.2 After release , Its ability has gradually been able to support and Pytorch The same amount of code completes the same level model , Hope this article , Can become a small example .
08
Summary
Finally, let's summarize . First , From my personal experience ,MindSpore from 0.7 Will be available , To 1.0 The basic improvement of , Until then 1.5 Improved usability , stay “ Simple development experience ” On this goal , There is a qualitative leap . and ModelZoo After all, there are many models , And few people continue to restructure and optimize , There should be more than a few misunderstandings . therefore , So BERT Take this milestone model as an example , Let us have a little intuitive experience .
Besides , For all the NLPer A few more words ,BERT That is to say 100 A model of a line of code ,Transformer Structure what you see is what you get , Don't be afraid of big models , Reproduce yourself , Whether it's doing experiments Paper Or interview to answer questions , It's much more handy . therefore , Hold MindSpore Write it down !
MindSpore Official information
official QQ Group : 486831414
Official website :https://www.mindspore.cn/
Gitee : https : //gitee.com/mindspore/mindspore
GitHub : https://github.com/mindspore-ai/mindspore
Forum :https://bbs.huaweicloud.com/forum/forum-1076-1.html
边栏推荐
- 专题 | 同步 异步
- Vertx multi vertical shared data
- Visit Google homepage to display this page, which cannot be displayed
- 【MindSpore论文精讲】AAAI长尾问题中训练技巧的总结
- Le Seigneur des anneaux: l'anneau du pouvoir
- Talk about floating
- 最全SQL与NoSQL优缺点对比
- Analysis of the problems of the 7th Blue Bridge Cup single chip microcomputer provincial competition
- Traversal in Lucene
- c语言指针的概念
猜你喜欢
Technical dry goods | reproduce iccv2021 best paper swing transformer with Shengsi mindspire
FileInputStream and fileoutputstream
VMWare网络模式-桥接,Host-Only,NAT网络
Lucene skip table
Arduino Serial系列函数 有关print read 的总结
Leetcode 213: 打家劫舍 II
C code production YUV420 planar format file
Use of generics
Docker builds MySQL: the specified path of version 5.7 cannot be mounted.
Technical dry goods Shengsi mindspire elementary course online: from basic concepts to practical operation, 1 hour to start!
随机推荐
技术干货|昇思MindSpore算子并行+异构并行,使能32卡训练2420亿参数模型
Responsive MySQL of vertx
《指环王:力量之戒》新剧照 力量之戒铸造者亮相
Hisat2 - stringtie - deseq2 pipeline for bulk RNA seq
Margin left: -100% understanding in the Grail layout
Jeecg request URL signature
Qtip2 solves the problem of too many texts
Vertx metric Prometheus monitoring indicators
Wireshark software usage
Image recognition and detection -- Notes
OSI knowledge sorting
[coppeliasim4.3] C calls UR5 in the remoteapi control scenario
Technology dry goods | luxe model for the migration of mindspore NLP model -- reading comprehension task
JS monitors empty objects and empty references
gstreamer ffmpeg avdec解码数据流向分析
[cmake] cmake link SQLite Library
Operation and maintenance technical support personnel have hardware maintenance experience in Hong Kong
2. E-commerce tool cefsharp autojs MySQL Alibaba cloud react C RPA automated script, open source log
Unified handling and interception of exception exceptions of vertx
The difference between typescript let and VaR