当前位置：网站首页>Long term learning of graphic and text pre training.

Long term learning of graphic and text pre training.

2022-06-10 11:24:00 【Liangzi plum】

Everybody knows There are many ways of pre training words and pictures . Generally, it is reasonable to see these pre training methods in the paper , But actually when you do it , Sometimes I feel like I have no direction , I'm confused . So what should we do with the common pre training ？ In this article, I will mainly record the learning of these pre training .

Mission 1：MLM.：

Masklanguage modeling

The most common pre training tasks . Cover a word in a sentence , Use context to predict the word .

Source code ：GitHub - zr2021/2021_QQ_AIAC_Tack1_1st: QQ browser 2021AI Algorithm race track 1 The first 1 name programme

This task , Meeting mask Drop percent 15 word , Then let the model predict these words .

See how the code does it .

        if 'mlm' in sample_task:
            input_ids, lm_label = self.lm.torch_mask_tokens(text_input_ids.cpu())
            text_input_ids = input_ids.to(text_input_ids.device)
            lm_label = lm_label[:, :].to(text_input_ids.device) # [SEP]  card  MASK  The master  [SEP]
            return_mlm = True

First look at the first sentence , self.lm Namely Below masklm . His initialization defines two parameters , The first one is the ratio of cover . The second parameter is the word segmentation encoder . Generally, it is used for loading bert The participator of . such as bert-chinese.

class MaskLM(object):
    def __init__(self, tokenizer_path='bert-base-chinese', mlm_probability=0.2):
        self.mlm_probability = mlm_probability
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        
    def torch_mask_tokens(self, inputs: Any, special_tokens_mask: Optional[Any] = None) -> Tuple[Any, Any]:
        """
        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
        """
        labels = inputs.clone()
        # We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
        probability_matrix = torch.full(labels.shape, self.mlm_probability)
        if special_tokens_mask is None:
            special_tokens_mask = [
                self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
            ]
            special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)
        else:
            special_tokens_mask = special_tokens_mask.bool()

        probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
        masked_indices = torch.bernoulli(probability_matrix).bool()
        labels[~masked_indices] = -100  # We only compute loss on masked tokens

        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

        # 10% of the time, we replace masked input tokens with random word
        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[indices_random] = random_words[indices_random]

        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
        return inputs, labels

Look at the code sentence by sentence . It can be seen that This function is used to generate the mask Yes id And label .

        labels = inputs.clone()
        # We sample a few tokens in each sequence for MLM training (with probability `self.mlm_probability`)
        probability_matrix = torch.full(labels.shape, self.mlm_probability)

label Equal to the input clone . label It should be input itself . because mask The goal after is the original .

probability_matrix It is a harmony. label Matrix of the same shape , Every element is Probability value .

        if special_tokens_mask is None:
            special_tokens_mask = [
                self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
            ]
            special_tokens_mask = torch.tensor(special_tokens_mask, dtype=torch.bool)

This call tokenizer Function of Get special characters mask That is, there are special character positions mask Is full of 1 Then turn the tensor . And turn to true/false

For example, below 102 and 0 The location of mask All are 1.108 Is the character , Like a colon , instead of bert Special characters used in .

        probability_matrix.masked_fill_(special_tokens_mask, value=0.0)

take In the probability matrix bert The position of special characters All into 0. Where there are no special characters, it is the probability value

        masked_indices = torch.bernoulli(probability_matrix).bool()

Bernoulli probability function . It's from Bernoulli distribution Extract binary random numbers from .0.2 Input in , Namely 0.2 The probability of taking 1.

It looks like , The text part has 0.2 The proportion of ,masked_indices yes True.

        labels[~masked_indices] = -100

No, mask Where the label changes to -100. mask The place of The label is the same .

        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

In a covered place , Yes 0.8 Probability Turn into masktoken.

        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[indices_random] = random_words[indices_random]

0.2*0.5 = 0.1 The place of Take a random word , Replace the original .

The rest 0.1 probability What also not stem .

        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
        return inputs, labels

Returns the label and the masked input , In the label , The place covered is the original word , Where there is no cover -100. .

            text_input_ids = input_ids.to(text_input_ids.device)
            lm_label = lm_label[:, :].to(text_input_ids.device) # [SEP]  card  MASK  The master  [SEP]
            return_mlm = True

Some simple equipment handling , And finally loss To add mlm. So far, we have the covered words and labels .

Then see how to calculate . Here is a task of video and text interaction , But it is not very different from the pure text task .

        encoder_outputs = self.bert(video_feature, video_mask, text_input_ids, text_mask)
        if return_mlm:
            return encoder_outputs, self.cls(encoder_outputs)[:, 1 + video_feature.size()[1]: , :]

encoder_outputs The masked input Get into bert Output of the last layer after . bat*len*dim

And then we look at this cls , cls It's the following one . This is the official function .

class BertLMPredictionHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.transform = BertPredictionHeadTransform(config)

        # The output weights are the same as the input embeddings, but there is
        # an output-only bias for each token.
        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        self.bias = nn.Parameter(torch.zeros(config.vocab_size))

        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
        self.decoder.bias = self.bias

    def forward(self, hidden_states):
        hidden_states = self.transform(hidden_states)
        hidden_states = self.decoder(hidden_states)
        return hidden_states

Inside transform It's a （linear+act+layernorm） Transitional in nature . Then one decoder（linear） It's a classification . from 768 dimension , Classify to 21128（vocab Number of words ）. The slice at the back , Because bert The input is cls+video +text This process . So cut it off cls and video.

        if 'mlm' in sample_task:
            pred = lm_prediction_scores.contiguous().view(-1, self.vocab_size)
            masked_lm_loss = nn.CrossEntropyLoss()(pred, lm_label.contiguous().view(-1))
            loss += masked_lm_loss / 1.25 / len(sample_task)

Calculation loss.

first Get the predicted value and Flattening . Flattening here means （bat,length,vocab_size）-》（bat*length,vocab_size） The advantage of this is that you can calculate the whole batch Of loss Instead of adding up after calculation .

loss It is commonly used for classification cross. Because it is equivalent to classifying every word , Category says yes 2W many . In this mission , mlm Weight importance divided by 1.25.

！！ obtain loss We mlm Even if the task is completed ！！！.

Mission 2： mfm Mission .

Should be mask frame model

mlm The task is to predict .mfm The task is to predict frame 了 . That's the picture . Based on the visual features of all time series, the hidden frame features are predicted , The masked frames are all 0 Instead of

The code source is still the one above .

GitHub - zr2021/2021_QQ_AIAC_Tack1_1st: QQ browser 2021AI Algorithm race track 1 The first 1 name programme

Here's how to do it . Back to the familiar starting point .

        if 'mfm' in sample_task:
            vm_input = video_feature
            input_feature, video_label = self.vm.torch_mask_frames(video_feature.cpu(), video_mask.cpu())
            video_feature = input_feature.to(video_feature.device)
            video_label = video_label.to(video_feature.device)

vm_input It should just be recorded It may be useful later . The main thing is the following covering function .

class MaskVideo(object):
    def __init__(self, mlm_probability=0.15):
        self.mlm_probability = 0.15
        
    def torch_mask_frames(self, video_feature, video_mask):
        probability_matrix = torch.full(video_mask.shape, 0.9 * self.mlm_probability)
        probability_matrix = probability_matrix * video_mask
        
        masked_indices = torch.bernoulli(probability_matrix).bool()
        
        video_labels_index = torch.arange(video_feature.size(0) * video_feature.size(1)).view(-1, video_feature.size(1))
        video_labels_index = -100 * ~masked_indices + video_labels_index * masked_indices

        # 90% mask video fill all 0.0
        masked_indices_unsqueeze = masked_indices.unsqueeze(-1).expand_as(video_feature)
        inputs = video_feature.data.masked_fill(masked_indices_unsqueeze, 0.0)
        labels = video_feature[masked_indices_unsqueeze].contiguous().view(-1, video_feature.size(2)) 

        return inputs, video_labels_index

First step Generate probability matrix . But this is multiplied by 0.9 I don't quite understand . Then you can set the low point directly ？ Then because the words can be used tokenizer Find meaningless places , however frame no way , So we need to mask Come out . Multiply it . Such meaningless positions are 0 了

        probability_matrix = torch.full(video_mask.shape, 0.9 * self.mlm_probability)
        probability_matrix = probability_matrix * video_mask

        masked_indices = torch.bernoulli(probability_matrix).bool()

Reference resources mlm ., Here is the probability matrix Enter the Bernoulli distribution , obtain 0、1 In a nutshell The probability matrix is 0.15 It's a good place , Namely 0.15 Probability becomes 1,0.75 Probability becomes 0. You can see that there are some random true. They are all places to be covered in the future .

        video_labels_index = torch.arange(video_feature.size(0) * video_feature.size(1)).view(-1, video_feature.size(1))

Generate a set of numbers . Attention is from 0 To bat*fram_len.

        video_labels_index = -100 * ~masked_indices + video_labels_index * masked_indices

This step of operation , The plus sign is preceded by The place covered is 0, The uncovered place is -100. And then The place covered is the position number , The uncovered place is 0. Add up to ： Covered place ： Location number , An uncovered place ：-100.

        masked_indices_unsqueeze = masked_indices.unsqueeze(-1).expand_as(video_feature)

become 768 dimension .

        inputs = video_feature.data.masked_fill(masked_indices_unsqueeze, 0.0)

The number of mask positions of the picture Change all to 0.

        labels = video_feature[masked_indices_unsqueeze].contiguous().view(-1, video_feature.size(2))

This sentence is useless , But let's see what it is .

.contiguous() This function is used to modify the memory storage form for later convenience view.

Sure notice labels Is the tensor value of those positions covered , And change back to the original shape . Equivalent to a direct label , But it didn't work .

The location tag is returned . It can also be understood that . If we know where the cover is , You can pick it up by location ？

            video_feature = input_feature.to(video_feature.device)
            video_label = video_label.to(video_feature.device)

Common replacement equipment , So far we have Picture features after masking , And the location of the cover .

        encoder_outputs = self.bert(video_feature, video_mask, text_input_ids, text_mask)
        if return_mlm:
            return encoder_outputs, self.cls(encoder_outputs)[:, 1 + video_feature.size()[1]: , :]

The input is a picture , And text messages . encoder_outputs Namely bert Output of the last layer . The one in the back is mlm Needed . We need features .

        features, lm_prediction_scores = self.roberta(video_feature, video_mask, text_input_ids, text_mask, return_mlm=return_mlm)
        if 'mfm' in sample_task:
            vm_output = self.roberta_mvm_lm_header(features[:, 1:video_feature.size()[1] + 1, :])
            masked_vm_loss = self.calculate_mfm_loss(vm_output, vm_input, 
                                                     video_mask, video_label, normalize=False)
            loss += masked_vm_loss / 3 / len(sample_task)

features yes bert Output .（ bat,len,dim） Look at this lm function .

            self.roberta_mvm_lm_header = VisualOnlyMLMHead(uni_bert_cfg)

class VisualLMPredictionHead(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.transform = VisualPredictionHeadTransform(config)

        # The output weights are the same as the input embeddings, but there is
        # an output-only bias for each token.
        self.decoder = nn.Linear(config.hidden_size, 768, bias=False)
        self.bias = nn.Parameter(torch.zeros(768))

        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
        self.decoder.bias = self.bias

    def forward(self, hidden_states):
        hidden_states = self.transform(hidden_states)
        hidden_states = self.decoder(hidden_states)
        return hidden_states

The source is the above .

Inside transform It's a （linear+act+layernorm） Transitional in nature . Then one decoder（linear） It is a generation dimension . from 768 dimension , Generate to 768（ Hidden picture dimensions ）.

class VisualPredictionHeadTransform(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        if isinstance(config.hidden_act, str):
            self.transform_act_fn = ACT2FN[config.hidden_act]
        else:
            self.transform_act_fn = config.hidden_act
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.transform_act_fn(hidden_states)
        hidden_states = self.LayerNorm(hidden_states)
        return hidden_states

The slice at the back , Because bert The input is cls+video +text This process . So cut it off cls and text.

            masked_vm_loss = self.calculate_mfm_loss(vm_output, vm_input, 
                                                     video_mask, video_label, normalize=False)

Calculate the reconstructed loss. Input the predicted value of the required model , Original picture features , also mask And labels

This function It's really long .

    def calculate_mfm_loss(self, video_feature_output, video_feature_input, 
                           video_mask, video_labels_index, normalize=False, temp=0.1):
        if normalize:
            video_feature_output = torch.nn.functional.normalize(video_feature_output, p=2, dim=2)
            video_feature_input = torch.nn.functional.normalize(video_feature_input, p=2, dim=2)

        afm_scores_tr = video_feature_output.view(-1, video_feature_output.shape[-1])

        video_tr = video_feature_input.permute(2, 0, 1)
        video_tr = video_tr.view(video_tr.shape[0], -1)

        logits_matrix = torch.mm(afm_scores_tr, video_tr)
        if normalize:
            logits_matrix = logits_matrix / temp

        video_mask_float = video_mask.to(dtype=torch.float)
        mask_matrix = torch.mm(video_mask_float.view(-1, 1), video_mask_float.view(1, -1))
        masked_logits = logits_matrix + (1. - mask_matrix) * -1e8

        logpt = F.log_softmax(masked_logits, dim=-1)
        logpt = torch.diag(logpt)
        nce_loss = -logpt

        video_labels_index_mask = (video_labels_index != -100)
        nce_loss = nce_loss.masked_select(video_labels_index_mask.view(-1))
        nce_loss = nce_loss.mean()
        return nce_loss

Standardization is skipped here , It should be almost , Jump or not .

        afm_scores_tr = video_feature_output.view(-1, video_feature_output.shape[-1])

This is the flattened output feature. from （bat,length,dim）-》（bat*length,dim）, Easy to calculate batloss.

        video_tr = video_feature_input.permute(2, 0, 1)
        video_tr = video_tr.view(video_tr.shape[0], -1)

Here the original picture information , Make a transpose . Now from （bat,length,dim）-》（dim,bat,length） And then it becomes （dim,bat*length） It may be convenient to multiply later .

        logits_matrix = torch.mm(afm_scores_tr, video_tr)

Sure enough , Input and output are multiplied .

        video_mask_float = video_mask.to(dtype=torch.float)
        mask_matrix = torch.mm(video_mask_float.view(-1, 1), video_mask_float.view(1, -1))
        masked_logits = logits_matrix + (1. - mask_matrix) * -1e8

And again mask Multiply , It means that we only consider those mask by 1 The place of . mask by 0 Where there are no elements . Because the multiplication of the above input and output is x*x Every row and column is The characteristic product corresponding to the row and column position . So here mask Also made into x*x.（ It is assumed that bat*length=x） Two add up , those mask by 0 The value of the place It becomes infinitesimal .

        logpt = F.log_softmax(masked_logits, dim=-1)
        logpt = torch.diag(logpt)
        nce_loss = -logpt

Here first softmax Then take the value on the diagonal . That is, covered feature And predicted feature Multiply the result .

        video_labels_index_mask = (video_labels_index != -100)

Get those covered positions . elect nceloss in The value of the masking position Add to average Got it. nceloss.( You can search what is nceloss). obtain loss value You can send it back . This is it. mfm Mission .

Mission 3： itm Mission

Image-TextMatching It is the task of judging whether the input image and text match .

The code source is still the one above .

GitHub - zr2021/2021_QQ_AIAC_Tack1_1st: QQ browser 2021AI Algorithm race track 1 The first 1 name programme

According to our ideas , The original text and image must match , In the dataset , So you have to find a way to disrupt , So how to disrupt it ？ I'm still curious , Because I feel like taking one at random , The workload is actually quite large . Let's look at what the code does . Back to the familiar starting point .

        if 'itm' in sample_task:
            input_feature, video_text_match_label = self.sv.torch_shuf_video(video_feature.cpu())
            video_feature = input_feature.to(video_feature.device)
            video_text_match_label = video_text_match_label.to(video_feature.device)

    
class ShuffleVideo(object):
    def __init__(self):
        pass
    
    def torch_shuf_video(self, video_feature):
        bs = video_feature.size()[0]
        # batch  Inside the first half  video  Keep the original order , The second half  video  The reverse 
        shuf_index = torch.tensor(list(range(bs // 2)) + list(range(bs //2, bs))[::-1])
        # shuf  After  label
        label = (torch.tensor(list(range(bs))) == shuf_index).float()
        video_feature = video_feature[shuf_index]
        return video_feature, label

This is a scrambled function . See here we understand a little bit . It turned out to be a mess , It is not disturbed when fetching data , But in the forward process of the model , So you don't have to be in collect I've been thinking about it , Just change the position of the data matrix , That's smart ... But he didn't expect I this bat Size 2 ah

....................

        shuf_index = torch.tensor(list(range(bs // 2)) + list(range(bs //2, bs))[::-1])
        label = (torch.tensor(list(range(bs))) == shuf_index).float()

Make a subscript matrix . I think the difference between me and the big guys is that they often use this kind of subscript matrix , And what I often think is to directly operate on the object .

If the subscript is the same as the original It means that there is no disturbance The label is 1 Otherwise 0.

        video_feature = video_feature[shuf_index]

Press to retrieve data .

            input_feature, video_text_match_label = self.sv.torch_shuf_video(video_feature.cpu())
            video_feature = input_feature.to(video_feature.device)
            video_text_match_label = video_text_match_label.to(video_feature.device)

Change the equipment .

       features, lm_prediction_scores = self.roberta(video_feature, video_mask, text_input_ids, text_mask, return_mlm=return_mlm)
       if 'itm' in sample_task:
            pred = self.newfc_itm(features[:, 0, :])
            itm_loss = nn.BCEWithLogitsLoss()(pred.view(-1), video_text_match_label.view(-1))
            loss += itm_loss / 100 / len(sample_task)

feature Yes bert The last layer of output .

self.newfc_itm(features[:, 0, :])

This newfc_itm It's a linear（768,1） Here is another bceloss We'll see He put the itm A return mission was completed , Instead of predicting tasks . Maybe I think Provided by forecast task loss Not accurate enough . For forecasting 0.9 It's also 1,0.6 It's also 1. I can't see the difference .

These three tasks are just like this We will continue to add .

原网站

版权声明
本文为[Liangzi plum]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/161/202206101101394422.html