当前位置:网站首页>Long term learning of graphic and text pre training.
Long term learning of graphic and text pre training.
2022-06-10 11:24:00 【Liangzi plum】
Everybody knows There are many ways of pre training words and pictures . Generally, it is reasonable to see these pre training methods in the paper , But actually when you do it , Sometimes I feel like I have no direction , I'm confused . So what should we do with the common pre training ? In this article, I will mainly record the learning of these pre training .
Mission 1:MLM.:
Masklanguage modeling
The most common pre training tasks . Cover a word in a sentence , Use context to predict the word .
Source code :GitHub - zr2021/2021_QQ_AIAC_Tack1_1st: QQ browser 2021AI Algorithm race track 1 The first 1 name programme

This task , Meeting mask Drop percent 15 word , Then let the model predict these words .
See how the code does it .
if 'mlm' in sample_task:
input_ids, lm_label = self.lm.torch_mask_tokens(text_input_ids.cpu())
text_input_ids = input_ids.to(text_input_ids.device)
lm_label = lm_label[:, :].to(text_input_ids.device) # [SEP] card MASK The master [SEP]
return_mlm = TrueFirst look at the first sentence , self.lm Namely Below masklm . His initialization defines two parameters , The first one is the ratio of cover . The second parameter is the word segmentation encoder . Generally, it is used for loading bert The participator of . such as bert-chinese.
class MaskLM(object):
def __init__(self, tokenizer_path='bert-base-chinese', mlm_probability=0.2):
self.mlm_probability = mlm_probability
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
def torch_mask_tokens(self, inputs: Any, special_tokens_mask: Optional[Any] = None) -> Tuple[Any, Any]:
"""
Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
"""
labels = inputs.clone()
# We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
probability_matrix = torch.full(labels.shape, self.mlm_probability)
if special_tokens_mask is None:
special_tokens_mask = [
self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
]
special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
else:
special_tokens_mask = special_tokens_mask.bool()
probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
masked_indices = torch.bernoulli(probability_matrix).bool()
labels[~masked_indices] = -100 # We only compute loss on masked tokens
# 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
# 10% of the time, we replace masked input tokens with random word
indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
inputs[indices_random] = random_words[indices_random]
# The rest of the time (10% of the time) we keep the masked input tokens unchanged
return inputs, labelsLook at the code sentence by sentence . It can be seen that This function is used to generate the mask Yes id And label .
labels = inputs.clone()
# We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
probability_matrix = torch.full(labels.shape, self.mlm_probability)label Equal to the input clone . label It should be input itself . because mask The goal after is the original .
probability_matrix It is a harmony. label Matrix of the same shape , Every element is Probability value .
if special_tokens_mask is None:
special_tokens_mask = [
self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
]
special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)This call tokenizer Function of Get special characters mask That is, there are special character positions mask Is full of 1 Then turn the tensor . And turn to true/false
For example, below 102 and 0 The location of mask All are 1.108 Is the character , Like a colon , instead of bert Special characters used in .

probability_matrix.masked_fill_(special_tokens_mask, value=0.0)take In the probability matrix bert The position of special characters All into 0. Where there are no special characters, it is the probability value
masked_indices = torch.bernoulli(probability_matrix).bool()Bernoulli probability function . It's from Bernoulli distribution Extract binary random numbers from .0.2 Input in , Namely 0.2 The probability of taking 1.
It looks like , The text part has 0.2 The proportion of ,masked_indices yes True.
labels[~masked_indices] = -100 No, mask Where the label changes to -100. mask The place of The label is the same .
![]()
indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)
In a covered place , Yes 0.8 Probability Turn into masktoken.
indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
inputs[indices_random] = random_words[indices_random]0.2*0.5 = 0.1 The place of Take a random word , Replace the original .
The rest 0.1 probability What also not stem .
# The rest of the time (10% of the time) we keep the masked input tokens unchanged
return inputs, labelsReturns the label and the masked input , In the label , The place covered is the original word , Where there is no cover -100. .
text_input_ids = input_ids.to(text_input_ids.device)
lm_label = lm_label[:, :].to(text_input_ids.device) # [SEP] card MASK The master [SEP]
return_mlm = TrueSome simple equipment handling , And finally loss To add mlm. So far, we have the covered words and labels .
Then see how to calculate . Here is a task of video and text interaction , But it is not very different from the pure text task .
encoder_outputs = self.bert(video_feature, video_mask, text_input_ids, text_mask)
if return_mlm:
return encoder_outputs, self.cls(encoder_outputs)[:, 1 + video_feature.size()[1]: , :]encoder_outputs The masked input Get into bert Output of the last layer after . bat*len*dim
And then we look at this cls , cls It's the following one . This is the official function .
class BertLMPredictionHead(nn.Module):
def __init__(self, config):
super().__init__()
self.transform = BertPredictionHeadTransform(config)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
# Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
self.decoder.bias = self.bias
def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.decoder(hidden_states)
return hidden_states
Inside transform It's a (linear+act+layernorm) Transitional in nature . Then one decoder(linear) It's a classification . from 768 dimension , Classify to 21128(vocab Number of words ). The slice at the back , Because bert The input is cls+video +text This process . So cut it off cls and video.
if 'mlm' in sample_task: pred = lm_prediction_scores.contiguous().view(-1, self.vocab_size) masked_lm_loss = nn.CrossEntropyLoss()(pred, lm_label.contiguous().view(-1)) loss += masked_lm_loss / 1.25 / len(sample_task)
Calculation loss.
first Get the predicted value and Flattening . Flattening here means (bat,length,vocab_size)-》(bat*length,vocab_size) The advantage of this is that you can calculate the whole batch Of loss Instead of adding up after calculation .
loss It is commonly used for classification cross. Because it is equivalent to classifying every word , Category says yes 2W many . In this mission , mlm Weight importance divided by 1.25.
!! obtain loss We mlm Even if the task is completed !!!.
Mission 2: mfm Mission .
Should be mask frame model
mlm The task is to predict .mfm The task is to predict frame 了 . That's the picture . Based on the visual features of all time series, the hidden frame features are predicted , The masked frames are all 0 Instead of
The code source is still the one above .
Here's how to do it . Back to the familiar starting point .
if 'mfm' in sample_task:
vm_input = video_feature
input_feature, video_label = self.vm.torch_mask_frames(video_feature.cpu(), video_mask.cpu())
video_feature = input_feature.to(video_feature.device)
video_label = video_label.to(video_feature.device)vm_input It should just be recorded It may be useful later . The main thing is the following covering function .
class MaskVideo(object):
def __init__(self, mlm_probability=0.15):
self.mlm_probability = 0.15
def torch_mask_frames(self, video_feature, video_mask):
probability_matrix = torch.full(video_mask.shape, 0.9 * self.mlm_probability)
probability_matrix = probability_matrix * video_mask
masked_indices = torch.bernoulli(probability_matrix).bool()
video_labels_index = torch.arange(video_feature.size(0) * video_feature.size(1)).view(-1, video_feature.size(1))
video_labels_index = -100 * ~masked_indices + video_labels_index * masked_indices
# 90% mask video fill all 0.0
masked_indices_unsqueeze = masked_indices.unsqueeze(-1).expand_as(video_feature)
inputs = video_feature.data.masked_fill(masked_indices_unsqueeze, 0.0)
labels = video_feature[masked_indices_unsqueeze].contiguous().view(-1, video_feature.size(2))
return inputs, video_labels_indexFirst step Generate probability matrix . But this is multiplied by 0.9 I don't quite understand . Then you can set the low point directly ? Then because the words can be used tokenizer Find meaningless places , however frame no way , So we need to mask Come out . Multiply it . Such meaningless positions are 0 了
probability_matrix = torch.full(video_mask.shape, 0.9 * self.mlm_probability)
probability_matrix = probability_matrix * video_mask masked_indices = torch.bernoulli(probability_matrix).bool()Reference resources mlm ., Here is the probability matrix Enter the Bernoulli distribution , obtain 0、1 In a nutshell The probability matrix is 0.15 It's a good place , Namely 0.15 Probability becomes 1,0.75 Probability becomes 0. You can see that there are some random true. They are all places to be covered in the future .
![]()
video_labels_index = torch.arange(video_feature.size(0) * video_feature.size(1)).view(-1, video_feature.size(1))Generate a set of numbers . Attention is from 0 To bat*fram_len.
video_labels_index = -100 * ~masked_indices + video_labels_index * masked_indicesThis step of operation , The plus sign is preceded by The place covered is 0, The uncovered place is -100. And then The place covered is the position number , The uncovered place is 0. Add up to : Covered place : Location number , An uncovered place :-100.
masked_indices_unsqueeze = masked_indices.unsqueeze(-1).expand_as(video_feature)become 768 dimension .
inputs = video_feature.data.masked_fill(masked_indices_unsqueeze, 0.0)The number of mask positions of the picture Change all to 0.
labels = video_feature[masked_indices_unsqueeze].contiguous().view(-1, video_feature.size(2)) This sentence is useless , But let's see what it is .
.contiguous() This function is used to modify the memory storage form for later convenience view.
Sure notice labels Is the tensor value of those positions covered , And change back to the original shape . Equivalent to a direct label , But it didn't work .


The location tag is returned . It can also be understood that . If we know where the cover is , You can pick it up by location ?
video_feature = input_feature.to(video_feature.device)
video_label = video_label.to(video_feature.device)Common replacement equipment , So far we have Picture features after masking , And the location of the cover .
encoder_outputs = self.bert(video_feature, video_mask, text_input_ids, text_mask)
if return_mlm:
return encoder_outputs, self.cls(encoder_outputs)[:, 1 + video_feature.size()[1]: , :]The input is a picture , And text messages . encoder_outputs Namely bert Output of the last layer . The one in the back is mlm Needed . We need features .
features, lm_prediction_scores = self.roberta(video_feature, video_mask, text_input_ids, text_mask, return_mlm=return_mlm)
if 'mfm' in sample_task:
vm_output = self.roberta_mvm_lm_header(features[:, 1:video_feature.size()[1] + 1, :])
masked_vm_loss = self.calculate_mfm_loss(vm_output, vm_input,
video_mask, video_label, normalize=False)
loss += masked_vm_loss / 3 / len(sample_task)features yes bert Output .( bat,len,dim) Look at this lm function .
self.roberta_mvm_lm_header = VisualOnlyMLMHead(uni_bert_cfg) class VisualLMPredictionHead(nn.Module):
def __init__(self, config):
super().__init__()
self.transform = VisualPredictionHeadTransform(config)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
self.decoder = nn.Linear(config.hidden_size, 768, bias=False)
self.bias = nn.Parameter(torch.zeros(768))
# Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
self.decoder.bias = self.bias
def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.decoder(hidden_states)
return hidden_statesThe source is the above .
Inside transform It's a (linear+act+layernorm) Transitional in nature . Then one decoder(linear) It is a generation dimension . from 768 dimension , Generate to 768( Hidden picture dimensions ).
class VisualPredictionHeadTransform(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
if isinstance(config.hidden_act, str):
self.transform_act_fn = ACT2FN[config.hidden_act]
else:
self.transform_act_fn = config.hidden_act
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.transform_act_fn(hidden_states)
hidden_states = self.LayerNorm(hidden_states)
return hidden_states
The slice at the back , Because bert The input is cls+video +text This process . So cut it off cls and text.
masked_vm_loss = self.calculate_mfm_loss(vm_output, vm_input,
video_mask, video_label, normalize=False)Calculate the reconstructed loss. Input the predicted value of the required model , Original picture features , also mask And labels
This function It's really long .
def calculate_mfm_loss(self, video_feature_output, video_feature_input,
video_mask, video_labels_index, normalize=False, temp=0.1):
if normalize:
video_feature_output = torch.nn.functional.normalize(video_feature_output, p=2, dim=2)
video_feature_input = torch.nn.functional.normalize(video_feature_input, p=2, dim=2)
afm_scores_tr = video_feature_output.view(-1, video_feature_output.shape[-1])
video_tr = video_feature_input.permute(2, 0, 1)
video_tr = video_tr.view(video_tr.shape[0], -1)
logits_matrix = torch.mm(afm_scores_tr, video_tr)
if normalize:
logits_matrix = logits_matrix / temp
video_mask_float = video_mask.to(dtype=torch.float)
mask_matrix = torch.mm(video_mask_float.view(-1, 1), video_mask_float.view(1, -1))
masked_logits = logits_matrix + (1. - mask_matrix) * -1e8
logpt = F.log_softmax(masked_logits, dim=-1)
logpt = torch.diag(logpt)
nce_loss = -logpt
video_labels_index_mask = (video_labels_index != -100)
nce_loss = nce_loss.masked_select(video_labels_index_mask.view(-1))
nce_loss = nce_loss.mean()
return nce_lossStandardization is skipped here , It should be almost , Jump or not .
afm_scores_tr = video_feature_output.view(-1, video_feature_output.shape[-1])This is the flattened output feature. from (bat,length,dim)-》(bat*length,dim), Easy to calculate batloss.
video_tr = video_feature_input.permute(2, 0, 1)
video_tr = video_tr.view(video_tr.shape[0], -1)Here the original picture information , Make a transpose . Now from (bat,length,dim)-》(dim,bat,length) And then it becomes (dim,bat*length) It may be convenient to multiply later .
logits_matrix = torch.mm(afm_scores_tr, video_tr)Sure enough , Input and output are multiplied .
video_mask_float = video_mask.to(dtype=torch.float)
mask_matrix = torch.mm(video_mask_float.view(-1, 1), video_mask_float.view(1, -1))
masked_logits = logits_matrix + (1. - mask_matrix) * -1e8And again mask Multiply , It means that we only consider those mask by 1 The place of . mask by 0 Where there are no elements . Because the multiplication of the above input and output is x*x Every row and column is The characteristic product corresponding to the row and column position . So here mask Also made into x*x.( It is assumed that bat*length=x) Two add up , those mask by 0 The value of the place It becomes infinitesimal .
logpt = F.log_softmax(masked_logits, dim=-1)
logpt = torch.diag(logpt)
nce_loss = -logptHere first softmax Then take the value on the diagonal . That is, covered feature And predicted feature Multiply the result .
video_labels_index_mask = (video_labels_index != -100)Get those covered positions . elect nceloss in The value of the masking position Add to average Got it. nceloss.( You can search what is nceloss). obtain loss value You can send it back . This is it. mfm Mission .
Mission 3: itm Mission
Image-TextMatching It is the task of judging whether the input image and text match .
The code source is still the one above .
According to our ideas , The original text and image must match , In the dataset , So you have to find a way to disrupt , So how to disrupt it ? I'm still curious , Because I feel like taking one at random , The workload is actually quite large . Let's look at what the code does . Back to the familiar starting point .
if 'itm' in sample_task:
input_feature, video_text_match_label = self.sv.torch_shuf_video(video_feature.cpu())
video_feature = input_feature.to(video_feature.device)
video_text_match_label = video_text_match_label.to(video_feature.device)
class ShuffleVideo(object):
def __init__(self):
pass
def torch_shuf_video(self, video_feature):
bs = video_feature.size()[0]
# batch Inside the first half video Keep the original order , The second half video The reverse
shuf_index = torch.tensor(list(range(bs // 2)) + list(range(bs //2, bs))[::-1])
# shuf After label
label = (torch.tensor(list(range(bs))) == shuf_index).float()
video_feature = video_feature[shuf_index]
return video_feature, labelThis is a scrambled function . See here we understand a little bit . It turned out to be a mess , It is not disturbed when fetching data , But in the forward process of the model , So you don't have to be in collect I've been thinking about it , Just change the position of the data matrix , That's smart ... But he didn't expect I this bat Size 2 ah
....................

shuf_index = torch.tensor(list(range(bs // 2)) + list(range(bs //2, bs))[::-1])
label = (torch.tensor(list(range(bs))) == shuf_index).float()Make a subscript matrix . I think the difference between me and the big guys is that they often use this kind of subscript matrix , And what I often think is to directly operate on the object .
If the subscript is the same as the original It means that there is no disturbance The label is 1 Otherwise 0.
video_feature = video_feature[shuf_index]Press to retrieve data .
input_feature, video_text_match_label = self.sv.torch_shuf_video(video_feature.cpu())
video_feature = input_feature.to(video_feature.device)
video_text_match_label = video_text_match_label.to(video_feature.device)Change the equipment .
features, lm_prediction_scores = self.roberta(video_feature, video_mask, text_input_ids, text_mask, return_mlm=return_mlm)
if 'itm' in sample_task:
pred = self.newfc_itm(features[:, 0, :])
itm_loss = nn.BCEWithLogitsLoss()(pred.view(-1), video_text_match_label.view(-1))
loss += itm_loss / 100 / len(sample_task)feature Yes bert The last layer of output .
self.newfc_itm(features[:, 0, :])
This newfc_itm It's a linear(768,1) Here is another bceloss We'll see He put the itm A return mission was completed , Instead of predicting tasks . Maybe I think Provided by forecast task loss Not accurate enough . For forecasting 0.9 It's also 1,0.6 It's also 1. I can't see the difference .
These three tasks are just like this We will continue to add .
边栏推荐
猜你喜欢

中国赛宝的面经

In commemoration of the 16th day of the first month, the total number of visits to the studio exceeded one million

js通过递归实现树形数据操作

北大、微软|关于使语言模型更好的推理的进展

【网易云信】深度剖析「圈组」消息系统设计 | 「圈组」技术系列文章
10款值得你去选择的AirPods Pro竞争产品

在线文档协作工具,是提高工作效率的第一步

C#大作业——学生信息管理系统
![[WIP] Openstack Masakari (by quqi99)](/img/ea/c0a1c80251a76a6bb7fa0f58424591.png)
[WIP] Openstack Masakari (by quqi99)
![[PaperNote] Confidential Computing Direction](/img/1e/0a41a6bb6fd752061c4608d3133eed.png)
[PaperNote] Confidential Computing Direction
随机推荐
计网面试题
kubernetes 设置 Master 可调度与不可调度
The facial scriptures of China Saibao
[signalr complete series] Implementation of signalr packet communication in net6
【管理知多少】独立冲突之外,你做不到
第 1 天 栈与队列(简单)
【网易云信】深度剖析「圈组」消息系统设计 | 「圈组」技术系列文章
二叉树的前序中序后序递归遍历和非递归遍历(c语言版本)
基于Junit4的单元测试
北大、微软|关于使语言模型更好的推理的进展
Meetup回顾|DevOps&MLOps如何在企业中解决机器学习困境?
Unit test based on junit4
JS implements tree data operation through recursion
正常的controller结构
Dell G7 computer shutdown keypad
dell G7 电脑关闭小键盘
uniapp实现授权登录
Carbon reduction in the construction industry is by no means a fresh idea experts suggest strengthening the transformation of rural buildings
『忘了再学』Shell基础 — 29、AWK内置变量
[PaperNote] Web3 Direction