当前位置：网站首页>Multimodal learning toolkit paddlemm based on propeller

Multimodal learning toolkit paddlemm based on propeller

2022-06-11 19:26:00 【Paddlepaddle】

With computer vision 、 natural language processing 、 The rapid development of speech recognition and other technologies , The existing artificial intelligence technology has achieved remarkable results in processing single-mode data . However , in real life , Data is presented in a variety of forms , For example, the words we read 、 The sound heard 、 Watched videos, etc , These multi-source heterogeneous information is called multimodal data , In the field of machine learning, a class of algorithms for mining and analyzing multimodal data are classified as multimodal learning methods . In order to make artificial intelligence better understand the real environment , Multimodal learning has attracted extensive attention of researchers in recent years , Great progress has been made in relevant application fields .

Divide from task objectives , Multimodal learning can be divided into modal joint learning 、 Cross modal learning and multi task pre training framework ：

Modal joint learning ： Focus on the comprehensive use of different modal information to learn the consistency goal , Such as multimodal classification and regression ;
Cross modal learning ： It focuses on the correlation information between modes , Such as cross modal retrieval 、 Graphic generation ;
Multi task pre training framework ： be based on Transformer Conduct unsupervised pre training , You can fine tune on a variety of downstream tasks .

Project introduction

Xiaobian will send you the project address

Click to read the original text GET

https://github.com/njustkmg/PaddleMM

Multimodal learning toolkit PaddleMM Based on Baidu PaddlePaddle , Provide modal joint learning and cross modal learning algorithm model base , It provides an efficient solution for processing multimodal data such as pictures and texts , Help the application of multimodal learning fall to the ground ：

Mission scenarios ： The toolkit provides multimodal fusion 、 Cross modal Retrieval 、 Image and text generation and other multimodal learning task algorithm model library ;
The application case ： Toolkit based algorithms have been applied to the ground , Such as the authenticity identification of sneakers 、 Image caption generation 、 Public opinion monitoring, etc .

Multimodal model library

At present ,PaddleMM Provides a multimodal algorithm model library for image and text modes （ Updating ）, Including multimodal classification 、 Cross modal Retrieval 、 There are four modules of graphic generation and pre training framework .

Multimodal classification

Multimodal classification algorithms are mainly divided into model independent algorithms （ As early as possible / Late fusion ） And model-based algorithms （ Such as CMML、LMF、TMC）, As follows ：

Early fusion / Late fusion ： The information of different modes is fused from the feature level and prediction level respectively , So as to carry out classification tasks ;
CMML^[1]： A robust multi-modal classification is achieved by adaptive weighted fusion for modal inconsistent scenes ;
LMF^[2]： Low rank multimodal fusion classification based on modal specific factor ;
TMC^[3]： A multimodal fusion strategy for credible prediction .

Cross modal Retrieval

The task of image and text retrieval is to extract features for the two modes and map them to the unified space , Learn the matching and scoring of pictures and texts in the shared semantic space , In the process of model training , Score the matched positive samples and the constructed mismatched negative samples respectively , The training model learns the difference between positive and negative image and text sample pairs . The toolkit implements a typical cross modal retrieval algorithm based on different graduation values , As follows ：

VSE++^[4]： Extract global features for images and text , Calculate matching scores based on global features ;
SCAN^[5]： Learn from image area and text word to match and score ;
IMRAM^[6]： The iterative cumulative similarity is introduced to improve the image text alignment information ;
SGRAF^[7]： Join graph （Graph） Association similarity is used to measure image and text matching .

Graphic generation

The graphic generation task is to generate description information for images , Generally based on encoder - Decoder architecture , Use the encoder to extract features from the image , Based on the image information, words are generated in turn , Realize text decoding . The toolkit implements two different algorithms ：

ShowAttendTell^[8]： Use VGG Encoded image information is obtained feature map, utilize LSTM decode , In the decoding process, attention mechanism is introduced to select specific image context information to guide the word generation in the current stage ;
AoANet^[9]： Use similar Transformer Structure construction encoder and decoder , And expand the original self attention attention on attention(AOA) modular .

The pre training framework

The multimodal pre training framework is based on Transformer Model of , First, a self supervised task is used for pre training on large-scale unlabeled multimodal data , Then, on the downstream task, it realizes the supervised fine-tuning according to the specific tag . The toolkit currently provides ViLBert^[^10] Algorithm , This method constructs the image and text respectively Transformer, The interaction of modal information in the process of self attention calculation .

Tool kit description

PaddleMM It is divided into data processing 、 model base 、 The trainer has three modules . among , The data module includes different processing forms for images and texts , Such as the global features of the image 、 Local feature processing , Statistics of text 、 participle 、BERT Handle ; The model base covers multimodal classification 、 Cross modal Retrieval 、 Different tasks such as text generation ; The trainer gives different training procedures for different tasks , Provide different training strategies in model training , A variety of test indicators for different tasks are included .

As shown in the figure below , stay PaddleMM in call , The model configuration file needs to be defined 、 Data directory and experiment record address are three variables , We store the super parameters of all models in configs In the folder , The configuration file of the model will include two types of super parameters , The first category is toolkit related hyperparameters , Such as task type , Text and image processing formats are required , In the experiment, the toolkit will select the corresponding data reading method according to these information 、 Trainers and test indicators , Another type of parameter in the configuration file is the specific model super parameter , That is, the optimal super parameter recommended by the model in use .

The toolkit generally includes the following steps in actual use ： Set super parameters -> Toolkit initialization -> model training -> Indicator test , Here's an example ：

Model development

We use CMML^[1] Algorithm as an example , See how to develop a multimodal learning model on the propeller platform .CMML, Mainly for classification tasks , For scenarios with inconsistent modal information , Adaptive weighted fusion of predictions from different modes , The following figure shows the overall structure ：

say concretely ,CMML Tagged and unlabeled multimodal data entered , Single mode prediction is calculated by text prediction network and image prediction network respectively , Then calculate the following optimization objectives ：

Adequacy measures ： The modal attention learning network is used to calculate the weights of single-mode prediction with labeled data, and the weights are adaptively fused , Calculate the cross entropy loss function with the real marker ;
Measurement of difference ： Calculate the difference between different modal predictions with marked data , The purpose of this part is to maximize the difference between different modal predictions and highlight the modal strength , Improve the model generalization ability ;
Robust consistency ： Different modal prediction calculations for unlabeled data Huber Loss , The purpose of this section is to eliminate the interference of inconsistent samples , To achieve the robustness constraints on the model .

Final ,CMML Three optimization objectives are used to obtain the model loss , The gradient descent method is used to optimize the model .

AI Studio Project address ：

https://aistudio.baidu.com/aistudio/projectdetail/2423256

Data loading

CMML Extract word bag feature from text data , Training based on global image data , meanwhile CMML It is a semi supervised learning model . therefore , We need to prepare word bag features of supervised and unsupervised texts in the data section 、 The original matrix features of the image , And label information with supervision data .

Model development

CMML The main model consists of three parts ：

Text prediction network

self.txt_hidden = nn.Sequential(
           nn.Linear(input_dim, hidden_dim*2),
           nn.ReLU(),
           nn.Linear(hidden_dim*2, hidden_dim),
           nn.ReLU()
      )
self.txt_predict = nn.Linear(hidden_dim, num_labels)

Image prediction network

self.resnet = models.resnet34(pretrained=True)
self.resnet = nn.Sequential(*list(self.resnet.children())[:-1])
self.img_hidden = nn.Linear(512, hidden_dim)
self.img_predict = nn.Linear(hidden_dim, num_labels)

Modal attention learning network

self.attn_mlp = nn.Linear(hidden_dim, 1)
self.attn_mlp = nn.Sequential(
   nn.Linear(hidden_dim, int(hidden_dim/2)),
   nn.ReLU(),
   nn.Linear(int(hidden_dim/2), 1)
)

In training ,CMML The following three optimization objectives need to be calculated ：

Adequacy measures

supervised_txt_hidden = self.txt_hidden(supervised_txt)
supervised_txt_predict = self.txt_predict(supervised_txt_hidden)
supervised_txt_predict = self.sigmoid(supervised_txt_predict)

supervised_img_hidden = self.resnet(supervised_img)
supervised_img_hidden = paddle.reshape(supervised_img_hidden, shape=[supervised_img_hidden.shape[0], 512])
supervised_img_hidden = self.img_hidden(supervised_img_hidden)
supervised_img_predict = self.img_predict(supervised_img_hidden)
supervised_img_predict = self.sigmoid(supervised_img_predict)

attn_txt = self.attn_mlp(supervised_txt_hidden)
attn_img = self.attn_mlp(supervised_img_hidden)
attn_modality = paddle.concat([attn_txt, attn_img], axis=1)
attn_modality = self.softmax(attn_modality)
attn_img = paddle.zeros(shape=[1, len(label)])
attn_img[0] = attn_modality[:, 0]
attn_img = paddle.t(attn_img)
attn_txt = paddle.zeros(shape=[1, len(label)])
attn_txt[0] = attn_modality[:, 1]
attn_txt = paddle.t(attn_txt)

supervised_hidden = attn_txt * supervised_txt_hidden + attn_img * supervised_img_hidden
supervised_predict = self.modality_predict(supervised_hidden)
supervised_predict = self.sigmoid(supervised_predict)

mm_loss = self.criterion(supervised_predict, label)

Measurement of difference

similar = paddle.bmm(supervised_img_predict.unsqueeze(1), supervised_txt_predict.unsqueeze(2))
similar = paddle.reshape(similar, shape=[supervised_img_predict.shape[0]])
norm_matrix_img = paddle.norm(supervised_img_predict, p=2, axis=1)
norm_matrix_text = paddle.norm(supervised_txt_predict, p=2, axis=1)
div = paddle.mean(similar / (norm_matrix_img * norm_matrix_text))

Robust consistency

unsimilar = paddle.bmm(unsupervised_img_predict.unsqueeze(1), unsupervised_txt_predict.unsqueeze(2))
unsimilar = paddle.reshape(unsimilar, shape=[unsupervised_img_predict.shape[0]])

unnorm_matrix_img = paddle.norm(unsupervised_img_predict, p=2, axis=1)
unnorm_matrix_text = paddle.norm(unsupervised_txt_predict, p=2, axis=1)
dis = 2 - unsimilar / (unnorm_matrix_img * unnorm_matrix_text)

mask_1 = paddle.abs(dis) < self.cita
tensor1 = paddle.masked_select(dis, mask_1)
mask_2 = paddle.abs(dis) >= self.cita
tensor2 = paddle.masked_select(dis, mask_2)
tensor1loss = paddle.sum(tensor1 * tensor1 / 2)
tensor2loss = paddle.sum(self.cita * (paddle.abs(tensor2) - 1 / 2 * self.cita))

unsupervised_loss = (tensor1loss + tensor2loss) / unsupervised_img.shape[0]

model training

stay CMML In training , The model reads the image and text features in the data loader , And input it into the model , By calculating three different optimization objectives , And it is combined to calculate the loss optimization model .

summary

PaddleMM The model and application tools of multimodal learning are provided , Here we introduce the algorithm overview and instructions of the toolkit , Finally, the multimodal classification algorithm is demonstrated CMML Model development process based on propeller platform , I hope I can do a little work on the landing application of multimodal learning on the propeller platform , Welcome to exchange and star!

当前位置：网站首页>Multimodal learning toolkit paddlemm based on propeller

Multimodal learning toolkit paddlemm based on propeller

边栏推荐

猜你喜欢

随机推荐