当前位置:网站首页>Technology dry goods | Roberta of the migration of mindspore NLP model - emotion analysis task
Technology dry goods | Roberta of the migration of mindspore NLP model - emotion analysis task
2022-07-03 07:43:00 【Shengsi mindspire】
be familiar with BERT The little partner of the model is for Roberta The model must be familiar .Roberta Model in BERT Some improvements have been made on the basis of the model , The main improvements are as follows :
1. training corpus :BERT Use only 16 GB Of Book Corpus Data sets and English Wikipedia for training ,Roberta Added CC-NEWS 、OPEN WEB TEXT、STORIES Equivalent corpora , Altogether 160 GB In plain text .
2. Batch Size:Roberta The model uses a larger Batch Size -> [256 ~ 8000].
3. Training time :Roberta Model USES 1024 block V100 Of GPU Trained for a whole 1 Time of day , Model parameters and training time are more and longer .
meanwhile Roberta There are also improvements in specific training methods :
1. dynamic MASK Mechanism
2. In addition to the NSP Mission
3. Tokenizer Partial replacement BPE Algorithm , Participle this point and GPT-2 Very similar
Roberta Source code (huggingface):
Roberta The paper :
Here we use the technology developed by Huawei MindSpore frame , And chose Pytorch edition Roberta Model migration . Welcome to participate in MindSpore In the development of open source community !
The environment of this article :
System :Ubuntu 18
GPU:RTX 3090
MindSpore edition :1.3
Data sets :SST-2( Emotional analysis task )
SST-2 Dataset definition :
This is a binary data set , The tags corresponding to the sentences in the training set and the verification set are 0 or 1
Model weight conversion
We need to Pytorch Version of Roberta The weight is converted to MindSpore Applicable weights , Here is a transformation idea . You can mainly refer to the official website API Map documents for rewriting .
The website links : Transformation mapping
def torch_to_ms(model, torch_model):
Updates mobilenetv2 model mindspore param's data from torch param's data.
model: mindspore model
torch_model: torch model
print("start load")
# load torch parameter and mindspore parameter
torch_param_dict = torch_model
ms_param_dict = model.parameters_dict()
count = 0
for ms_key in ms_param_dict.keys():
ms_key_tmp = ms_key.split('.')
if ms_key_tmp[0] == 'roberta_embedding_lookup':
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.word_embeddings.weight', ms_key)
elif ms_key_tmp[0] == 'roberta_embedding_postprocessor':
if ms_key_tmp[1] == "token_type_embedding":
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.token_type_embeddings.weight', ms_key)
elif ms_key_tmp[1] == "full_position_embedding":
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.position_embeddings.weight',
elif ms_key_tmp[1] =="layernorm":
if ms_key_tmp[2]=="gamma":
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.LayerNorm.weight',
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict, 'embeddings.LayerNorm.bias',
elif ms_key_tmp[0] == "roberta_encoder":
if ms_key_tmp[3]=='attention':
par = ms_key_tmp[4].split('_')[0]
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict, 'encoder.layer.'+ms_key_tmp[2]+'.'+ms_key_tmp[3]+'.'
elif ms_key_tmp[3]=='attention_output':
if ms_key_tmp[4]=='dense':
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict,
'encoder.layer.' + ms_key_tmp[2] + '.attention.output.'+ms_key_tmp[4]+'.'+ms_key_tmp[5],
elif ms_key_tmp[4]=='layernorm':
if ms_key_tmp[5]=='gamma':
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict,
'encoder.layer.' + ms_key_tmp[2] + '.attention.output.LayerNorm.weight',
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict,
'encoder.layer.' + ms_key_tmp[2] + '.attention.output.LayerNorm.bias',
elif ms_key_tmp[3]=='intermediate':
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict,
'encoder.layer.' + ms_key_tmp[2] + '.intermediate.dense.'+ms_key_tmp[4],
elif ms_key_tmp[3]=='output':
if ms_key_tmp[4]=='dense':
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict,
'encoder.layer.' + ms_key_tmp[2] + '.output.dense.'+ms_key_tmp[5],
if ms_key_tmp[5]=='gamma':
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict,
'encoder.layer.' + ms_key_tmp[2] + '.output.LayerNorm.weight',
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict,
'encoder.layer.' + ms_key_tmp[2] + '.output.LayerNorm.bias',
if ms_key_tmp[0]=='dense':
if ms_key_tmp[1]=='weight':
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict,
count += 1
update_torch_to_ms(torch_param_dict, ms_param_dict,
save_checkpoint(model, "./model/roberta-base.ckpt")
print("finish load")
It is worth noting here that the parameters of the conversion must correspond to the upper , Otherwise, the weight loading will fail in the subsequent weight loading , Or during training loss The value cannot drop ! After the conversion, you can try to print the corresponding parameters key value Prevent mistakes and omissions .
In this way, we get the converted roberta.ckpt File for loading weights . The weight file here must be consistent with tensorflow The weight file of !
Data processing
For the input of data ,MindSpore and Pytorch There are also differences . Here we directly use our own dataset part , There are many kinds of NLP The data set processing method of the task , Data sets can be converted into mindrecord form , Then it is used for model training and verification . Now follow the author to continue to understand !
SST-2 dataset
from typing import Union, Dict, List
import mindspore.dataset as ds
from ..base_dataset import CLSBaseDataset
class SST2Dataset(CLSBaseDataset):
SST2 dataset.
paths (Union[str, Dict[str, str]], Optional): Dataset file path or Dataset directory path, default None.
tokenizer (Union[str]): Tokenizer function, default 'spacy'.
lang (str): Tokenizer language, default 'en'.
max_size (int, Optional): Vocab max size, default None.
min_freq (int, Optional): Min word frequency, default None.
padding (str): Padding token, default `<pad>`.
unknown (str): Unknown token, default `<unk>`.
buckets (List[int], Optional): Padding row to the length of buckets, default None.
>>> sst2 = SST2Dataset(tokenizer='spacy', lang='en')
# sst2 = SST2Dataset(tokenizer='spacy', lang='en', buckets=[16,32,64])
>>> ds = sst2()
def __init__(self, paths: Union[str, Dict[str, str]] = None,
tokenizer: Union[str] = 'spacy', lang: str = 'en', max_size: int = None, min_freq: int = None,
padding: str = '<pad>', unknown: str = '<unk>',
buckets: List[int] = None):
super(SST2Dataset, self).__init__(sep='\t', name='SST-2')
self._paths = paths
self._tokenize = tokenizer
self._lang = lang
self._vocab_max_size = max_size
self._vocab_min_freq = min_freq
self._padding = padding
self._unknown = unknown
self._buckets = buckets
def __call__(self) -> Dict[str, ds.MindDataset]:
self.process(tokenizer=self._tokenize, lang=self._lang, max_size=self._vocab_max_size,
min_freq=self._vocab_min_freq, padding=self._padding,
unknown=self._unknown, buckets=self._buckets)
return self.mind_datasets
from mindtext.dataset.classification import SST2Dataset
# Yes SST-2 The data set of emotion analysis is processed If there is a cache, you only need to read the cache directly
dataset = SST2Dataset(paths='./mindtext/dataset/SST-2',
columns_list=['input_ids', 'attention_mask','label'],
test_columns_list=['input_ids', 'attention_mask'],
batch_size=64 )
ds = dataset() # Generate corresponding train、dev Of mindrecord file
ds = dataset.from_cache( columns_list=['input_ids', 'attention_mask','label'],
test_columns_list=['input_ids', 'attention_mask'],
dev_dataset = ds['dev'] # Take out and turn into mindrecord The validation set of is used to validate
Generated mindrecord file , One is .mindrecord file One is .mindrecord.db file , It is worth noting that we You can't modify them at will Name , There is a certain mapping relationship between the two files , If you force to modify their names, you will not be able to read mindrecord Errors in documents !
In addition, the parameters for data processing must be consistent with the parameters input by the model , For example, we Roberta The input parameters of the model are ['input_ids', 'attention_mask','label'].
The main structure of the model
Project architecture :
In terms of architecture, we mainly refer to fastnlp The way of division . Divide the model into Encoder、Embedding、Tokenizer Three parts , The architecture of the model will be further optimized later .
"""Roberta Embedding."""
import logging
from typing import Tuple
import mindspore.nn as nn
from mindspore import Tensor
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindtext.modules.encoder.roberta import RobertaModel, RobertaConfig
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
class RobertaEmbedding(nn.Cell):
This is a class that loads pre-trained weight files into the model.
def __init__(self, roberta_config: RobertaConfig, is_training: bool = False):
super(RobertaEmbedding, self).__init__()
self.roberta = RobertaModel(roberta_config, is_training)
def init_robertamodel(self,roberta):
Manual initialization BertModel
def from_pretrain(self, ckpt_file):
Load the model parameters from checkpoint
param_dict = load_checkpoint(ckpt_file)
load_param_into_net(self.roberta, param_dict)
def construct(self, input_ids: Tensor, input_mask: Tensor)-> Tuple[Tensor, Tensor]:
Returns the result of the model after loading the pre-training weights
input_ids:A vector containing the transformation of characters into corresponding ids.
input_mask:the mask for input_ids.
sequence_output:the sequence output .
pooled_output:the pooled output of first token:cls..
sequence_output, pooled_output, _ = self.roberta(input_ids, input_mask)
return sequence_output, pooled_output
This part is mainly used to load the pre trained weights , We got it earlier mindrecord Form of weight file .
class RobertaModel(nn.Cell):
Used from mindtext.modules.encoder.roberta with Roberta
config (Class): Configuration for RobertaModel.
is_training (bool): True for training mode. False for eval mode.
use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form.
Default: False.
def __init__(self,
config: RobertaConfig,
is_training: bool,
use_one_hot_embeddings: bool = False):
config = copy.deepcopy(config)
if not is_training:
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0
self.seq_length = config.seq_length
self.hidden_size = config.hidden_size
self.num_hidden_layers = config.num_hidden_layers
self.embedding_size = config.hidden_size
self.token_type_ids = None
self.compute_type = numbtpye2mstype(config.compute_type)
self.last_idx = self.num_hidden_layers - 1
output_embedding_shape = [-1, self.seq_length, self.embedding_size]
self.roberta_embedding_lookup = nn.Embedding(
self.roberta_embedding_postprocessor = EmbeddingPostprocessor(
self.roberta_encoder = RobertaTransformer(
self.cast = P.Cast()
self.dtype = numbtpye2mstype(config.dtype)
self.cast_compute_type = SecurityCast()
self.slice = P.StridedSlice()
self.squeeze_1 = P.Squeeze(axis=1)
self.dense = nn.Dense(self.hidden_size, self.hidden_size,
self._create_attention_mask_from_input_mask = CreateAttentionMaskFromInputMask(config)
def construct(self, input_ids: Tensor, input_mask: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
"""Bidirectional Encoder Representations from Transformers.
input_ids:A vector containing the transformation of characters into corresponding ids.
input_mask:the mask for input_ids.
sequence_output:the sequence output .
pooled_output:the pooled output of first token:cls.
embedding_table:fixed embedding table.
# embedding
embedding_tables = self.roberta_embedding_lookup.embedding_table
word_embeddings = self.roberta_embedding_lookup(input_ids)
embedding_output = self.roberta_embedding_postprocessor(input_ids,
# attention mask [batch_size, seq_length, seq_length]
attention_mask = self._create_attention_mask_from_input_mask(input_mask)
# roberta encoder
encoder_output = self.roberta_encoder(self.cast_compute_type(embedding_output),
sequence_output = self.cast(encoder_output[self.last_idx], self.dtype)
# pooler
batch_size = P.Shape()(input_ids)[0]
sequence_slice = self.slice(sequence_output,
(0, 0, 0),
(batch_size, 1, self.hidden_size),
(1, 1, 1))
first_token = self.squeeze_1(sequence_slice)
pooled_output = self.dense(first_token)
pooled_output = self.cast(pooled_output, self.dtype)
return sequence_output, pooled_output, embedding_tables
This is the main body of our model , Here we put the whole model into encoder Under the document roberta.py. We can see encoder Part mainly includes encoder_output and sequence_output Two parts . We refer to the partial implementation here MindSpore ModelZoo Project BERT Model to implement ( Partners can log in when migrating models MindSpore Go to the official website for reference ~), So you just need to splice several important modules OK 了 , These modules include :
EncoderOutput: Every sub-layer Module with residual connection and normalization .
RobertaAttention: Single multi head self attention layer .
RobertaEncoderCell: A single RobertaEncoder layer .
RobertaTransformer: Will be multiple RobertaEncoderCell Splice together , To form a complete roberta modular .
This can greatly reduce the time of our model architecture , Can also better learn MindSpore Use of framework .
Word segmentation is encapsulated in dataset in . You can specify huggingface The name of the pre training model existing in the library is used to directly load the vocabulary and other files ! It is very convenient to use , The details can be traced back to the code of the data processing section . If you want to achieve other downstream tasks , But in dataset If they don't , We can also use the tokenlizer To build in your own way mindrecord Data in form .
def get_tokenizer(tokenize_method: str, lang: str = 'en'):
Get a tokenizer.
tokenize_method (str): Select a tokenizer method.
lang (str): Tokenizer language, default English.
function: A tokenizer function.
tokenizer_dict = {
'spacy': None,
'raw': _split,
'cn-char': _cn_split,
if tokenize_method == 'spacy':
import spacy
if lang != 'en':
raise RuntimeError("Spacy only supports english")
if parse_version(spacy.__version__) >= parse_version('3.0'):
en = spacy.load('en_core_web_sm')
en = spacy.load(lang)
def _spacy_tokenizer(text):
return [w.text for w in en.tokenizer(text)]
tokenizer = _spacy_tokenizer
elif tokenize_method in tokenizer_dict:
tokenizer = tokenizer_dict[tokenize_method]
raise RuntimeError(f"Only support {tokenizer_dict.keys()} tokenizer.")
return tokenizer
Model parameter loading
In the main body of the model roberta.py In file RobertaConfig Modules can be loaded yaml The corresponding parameters in the file to RobertaModel In the middle . Here are yaml File parameter configuration :
seq_length: 128
vocab_size: 50265
hidden_size: 768
bos_token_id: 0
eos_token_id: 2
num_hidden_layers: 12
num_attention_heads: 12
intermediate_size: 3072
hidden_act: "gelu"
hidden_dropout_prob: 0.1
attention_probs_dropout_prob: 0.1
max_position_embeddings: 514
pad_token_id: 1
type_vocab_size: 1
initializer_range: 0.02
use_relative_positions: False
dtype: mstype.float32
compute_type: mstype.float32
model training
Next is the model training you are most looking forward to ! The first thing we need to do is apply it to MindSpore The weight parameters of the framework are loaded into Roberta Go to the model , Simultaneous initialization . You can see ,yaml Write the super parameter configuration in the form of RobertaModel And instantiate , Then the weight parameter passes MindSpore Built in load_checkpoint and load_param_intonet Function loaded into RobertaModel. Here we are not loading directly , Instead, a layer is nested in the middle RobertaEmbedding. And what we need to pay attention to is our is_training Value is set to True,num_labels Set according to specific downstream task requirements .
roberta_config_file = "./mindtext/config/test.yaml"
roberta_config = RobertaConfig.from_yaml_file(roberta_config_file)
rbm = RobertaModel(roberta_config, True)
param_dict = load_checkpoint('./mindtext/pretrain/roberta-base-ms.ckpt')
p = load_param_into_net(rbm, param_dict)
em = RobertaEmbedding(roberta_config, True)
roberta = RobertaforSequenceClassification(roberta_config, is_training=True, num_labels=2)
After the model weight is loaded , Set the learning rate lr、 Number of training rounds epoch、 Loss function 、 Optimizer, etc , You can start training ! Here we provide according to the paper learning_rate To set up :3e-5. Secondly, it also uses warm_up Warm up the learning rate , When the model becomes stable , Then select the preset learning rate for training , It makes the convergence speed of the model faster , The effect of the model is better .epoch Set to 3 round .
epoch_num = 3
save_path = "./mindtext/pretrain/output/roberta-base_sst.ckpt"
lr_schedule = RobertaLearningRate(learning_rate=3e-5,
warmup_steps=int(train_dataset.get_dataset_size() * epoch_num * 0.1),
decay_steps=train_dataset.get_dataset_size() * epoch_num,
params = roberta.trainable_params()
optimizer = AdamWeightDecay(params, lr_schedule, eps=1e-8)
When everything is ready , Enter the training stage , Finally, save the fine tuned model parameters to the specified path ckpt In file .
def train(train_data, roberta, optimizer, save_path, epoch_num):
update_cell = DynamicLossScaleUpdateCell(loss_scale_value=2 ** 32, scale_factor=2, scale_window=1000)
netwithgrads = RobertaFinetuneCell(roberta, optimizer=optimizer, scale_update_cell=update_cell)
callbacks = [TimeMonitor(train_data.get_dataset_size()), LossCallBack(train_data.get_dataset_size())]
model = Model(netwithgrads)
model.train(epoch_num, train_data, callbacks=callbacks, dataset_sink_mode=False)
save_checkpoint(model.train_network, save_path)
Model to evaluate
The process of evaluation and training is roughly the same , Read and convert to mindrecord Formal validation set , Ready for use to evaluate model performance .
dataset = SST2Dataset(paths='./mindtext/dataset/SST-2',
columns_list=['input_ids', 'attention_mask','label'],
test_columns_list=['input_ids', 'attention_mask'],
batch_size=64 )
ds = dataset.from_cache( columns_list=['input_ids', 'attention_mask','label'],
test_columns_list=['input_ids', 'attention_mask'],
dev_dataset = ds['dev']
Then load the super parameter configuration and model weight file , adopt from_pretrain Function loaded into Roberta Go to the model . Note that the downstream tasks here also need to be specified is_training and num_labels Two parameter values , Set according to specific tasks .
roberta_config_file = "./mindtext/conf/test.yaml"
roberta_config = RobertaConfig.from_yaml_file(roberta_config_file)
roberta = RobertaforSequenceClassification(roberta_config, is_training=False, num_labels=2, dropout_prob=0.0)
model_path = "./mindtext/pretrain/output/roberta_trainsst2.ckpt"
Finally, you can start the evaluation ! This task is a text classification task , Compare the model prediction label with the real label , The accuracy of the model is [0-1] Decimal between .
def eval(eval_data, model):
metirc = Accuracy('classification')
squeeze = mindspore.ops.Squeeze(1)
for batch in tqdm(eval_data.create_dict_iterator(num_epochs=1), total=eval_data.get_dataset_size()):
input_ids = batch['input_ids']
input_mask = batch['attention_mask']
label_ids = batch['label']
inputs = {"input_ids": input_ids,
"input_mask": input_mask
output = model(**inputs)
sm = mindspore.nn.Softmax(axis=-1)
output = sm(output)
metirc.update(output, squeeze(label_ids))
- Enter three times and guess a number
- Vertx's responsive redis client
- The babbage industrial policy forum
- Vertx metric Prometheus monitoring indicators
- Go language foundation ----- 09 ----- exception handling (error, panic, recover)
- Leetcode 198: house raiding
- Go language foundation ----- 15 ----- reflection
- Project experience sharing: Based on mindspore, the acoustic model is realized by using dfcnn and CTC loss function
- [coppeliasim4.3] C calls UR5 in the remoteapi control scenario
- Pat class a 1028 list sorting
技术干货|利用昇思MindSpore复现ICCV2021 Best Paper Swin Transformer
Pat class a 1030 travel plan
項目經驗分享:實現一個昇思MindSpore 圖層 IR 融合優化 pass
Go language foundation ----- 02 ----- basic data types and operators
Lucene introduces NFA
Go language foundation ----- 06 ----- anonymous fields, fields with the same name
Inverted chain disk storage in Lucene (pfordelta)
Go language foundation ----- 08 ----- interface
Responsive MySQL of vertx
Technical dry goods Shengsi mindspire elementary course online: from basic concepts to practical operation, 1 hour to start!
【MySQL 13】安装MySQL后第一次修改密码,可以可跳过MySQL密码验证进行登录
Usage of requests module
【LeetCode】3. Merge two sorted lists · merge two ordered linked lists
Technical dry goods | reproduce iccv2021 best paper swing transformer with Shengsi mindspire
Chapter VI - Containers
Go language foundation ------ 12 ------ JSON
Go language foundation ----- 15 ----- reflection
Es writing fragment process
Structure of golang
Professor Zhang Yang of the University of Michigan is employed as a visiting professor of Shanghai Jiaotong University, China (picture)
Vertx restful style web router
Hnsw introduction and some reference articles in lucene9
Collector in ES (percentile / base)
Industrial resilience
PAT甲级 1029 Median
Go language foundation ----- 19 ----- context usage principle, interface, derived context (the multiplexing of select can be better understood here)