当前位置：网站首页>Code representation learning: introduction to codebert and other related models

Code representation learning: introduction to codebert and other related models

2022-07-25 11:07:00 【deephub】

What is? CodeBert

CodeBERT It's Microsoft. 2020 Year development BERT Extension of the model . It is a programming language (PL) And natural language (NL) Bimodal pre training model , Can execute downstream (NL-PL) Mission , This model uses 6 Programming languages (Python, Java, JavaScript, PHP, Ruby, Go) Conduct NL-PL Matching training of .

This paper will give a brief overview of the paper , And use an example to show how to use , More details about the mathematics and detailed architecture behind the model , Please refer to the original paper . In the end, except CodeBert outside , It also collates some derivative models based on his recent research .

Before studying this paper in depth , Let's first introduce CodeBERT Supported downstream task use cases and . Some of these use cases are already in MS Implement... In the tool , for example visual studio- IntelliCode.

CodeBert The use case

Code conversion or code translation ： for example , When developers want to write with existing python The code is the same java Code , Code to code translation can help translate this code block .

Code auto annotation ： It can help developers summarize code . When developers see unfamiliar code , The model can translate the code into natural language and summarize it for developers .

Text to code ： Similar to the function of code search , This search can help users retrieve relevant codes based on natural language queries . In addition, the corresponding code can be generated according to the comments .

Text to text ： It can help translate code field text into different languages .

Insert picture description here

BERT framework

BERT ((Bidirectional Encoder Representations from Transformers) It's Google. 2018 Self monitoring model proposed in .

Insert picture description here

BERT In essence, it is composed of multiple self attention “ head ” Composed of Transformer Encoder layer stack （Vaswani wait forsomeone ,2017 year ）. For each input tag in the sequence , Each header calculates the key 、 Values and query vectors , Used to create weighted representations / The embedded . The output of all headers in the same layer is combined and passed through a full connection layer . Each layer is connected with skip connection , Then layer normalization （LN）.BERT The traditional workflow of includes two stages ： Pre training and fine tuning . Pre training uses two self supervised tasks ： Masking language modeling （MLM, Predict input markers for random masking ） And the next prediction （NSP, Predict whether two input sentences are adjacent to each other ）. Fine tuning applies to downstream applications , Usually one or more full connection layers are added above the final encoder layer .

CodeBert framework

BERT It's easy to extend to multimodality , That is to use different types of data sets for training .CodeBert yes Bert Bimodal extension of . namely CodeBERT Use both natural language and source code as its input .（ And the tradition of focusing mainly on natural language BERT and RoBERTa Different ）

Bimodal NL - PL Yes ： Training CodeBERT The typical input for is a combination of code and clearly defined text comments .

CodeBERT Two pre training objectives are described ： Mask language modeling (MLM) And replacement mark detection (RTD).

Modeling training using mask language CodeBERT： by NL and PL Select a group of random positions to shield , And then use special [MASK] Mark to replace the selected position .MLM The goal of is to predict the masked original markers

Training with replacement marker detection CodeBERT： In primitive NL Sequence sum PL In sequence , There are few marks that will be randomly screened . Train a generator model , It is a similar to n-gram The probability model is used to generate shielding words . Then train a discriminator model to determine whether a word is the original word （ Binary classification problem ）.

CodeBERT Use 12 layer Transformer The total includes 125M Parameters , stay FP16 Use in accuracy NVIDIA DGX-2 on 250 Hours of training , The results show that when CodeBERT And from RoBERTa When used together with the pre training representation of the model （RoBERTa The model has been used from Code-SearchNet Code for training ） Compared with training from scratch . Use RoBERTa initialization CodeBERT Better performance .

Insert picture description here

Use CodeBERT Fine tuning

Please refer to CodeBERT The paper Here is a brief introduction to how to use CodeBERT, And take code document generation as an example .

Install the appropriate package

pip3 install torch==1.4.0
pip3 install transformers==2.5.0
pip3 install filelock

Data preprocessing

The data preprocessing in this task is as follows ：

Remove comments from the code
Delete an example where the code cannot be parsed into an abstract syntax tree .
Delete the of the document #tokens < 3 or >256 An example of
Delete an example of a document containing special tags （ for example <img …> or https:…）
Deleting a document is not an example of English .

pip3 install gdown
mkdir-p data/code2nl
cd data/code2nl
gdown https://drive.google.com/uc?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h
unzip Cleaned_CodeSearchNet.zip
rm Cleaned_CodeSearchNet.zip
cd ../..

Use tree The command can see the following directory structure ：

tree data/code2nl/CodeSearchNet/
data/code2nl/CodeSearchNet/
├── go
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
├── java
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
├── javascript
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
├── php
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
├── python
│   ├── test.jsonl
│   ├── train.jsonl
│   └── valid.jsonl
└── ruby
    ├── test.jsonl
    ├── train.jsonl
    └── valid.jsonl

6 directories, 18 files

Every language has its own training 、 verification 、 Test data file .

Run the program

visit https://github.com/microsoft/CodeBERT/tree/master/CodeBERT/code2nl And clone run.py、bleu.py、model.py File and put them in data/code2nl In the folder .

Run the following command . take batch_size=128 Change it to batch_size=4, This is because GPU I don't have enough memory , Too big bs It can lead to OOM.

lang=php 
beam_size=10
batch_size=4
source_length=256
target_length=128
output_dir=model/$lang
data_dir=../data/code2nl/CodeSearchNet
dev_file=$data_dir/$lang/valid.jsonl
test_file=$data_dir/$lang/test.jsonl
test_model=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test

python run.py --do_test --model_type roberta --model_name_or_path microsoft/codebert-base --load_model_path $test_model --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size

So start training , How to call after training CodeBERT Well ？

If you only want to use from CodeBERT The characteristic representation of transformation , You can use the following example code ：

from transformers import AutoTokenizer, AutoModel
import torch

# Init
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base")

# Tokenization 
nl_tokens=tokenizer.tokenize("return maximum value")
code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token]

# Convert tokens to ids
tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]

# Print
print(context)

The output is as follows ：

tensor([[-0.1423,  0.3766,  0.0443,  ..., -0.2513, -0.3099,  0.3183],
        [-0.5739,  0.1333,  0.2314,  ..., -0.1240, -0.1219,  0.2033],
        [-0.1579,  0.1335,  0.0291,  ...,  0.2340, -0.8801,  0.6216],
        ...,
        [-0.4042,  0.2284,  0.5241,  ..., -0.2046, -0.2419,  0.7031],
        [-0.3894,  0.4603,  0.4797,  ..., -0.3335, -0.6049,  0.4730],
        [-0.1433,  0.3785,  0.0450,  ..., -0.2527, -0.3121,  0.3207]],
       grad_fn=<SelectBackward>)

We also see the above code Huggingface Also provided CodeBERT Related models , We can use it directly ：

import torch
from transformers import RobertaTokenizer,RobertaConfig,RobertaModel
device=torch.device("cuda"if torch.cuda.is_available() else"cpu")
tokenizer=RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model=RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)

codebert Address ：

https://github.com/microsoft/CodeBERT

be based on CodeBERT Introduction to other models of

GraphCodeBert： Code representation pre training model based on data flow

https://arxiv.org/abs/2009.08366

Use the semantic structure of code to learn the pre training model of code representation GraphCodeBERT, be based on Bert Pre training model implementation , Except for the traditional MLM Out of task , This paper also proposes two new pre training tasks （ Data stream side prediction 、 Align variables of source code and data flow ）, Learning vector representation of source code based on data flow , stay 4 Significant improvement has been achieved on downstream tasks .

UniXcoder: Unified cross mode pre training model

https://arxiv.org/abs/2203.03850

Unixcoder It is a unified cross mode pre training model for programming languages . The model uses the mask attention matrix with prefix adapter to control the behavior of the model , And make use of AST And code comments to enhance code representation . To express parallelism as a tree AST Encoding , This paper proposes a one-to-one mapping method , You can keep AST Sequence structure of all structural information in . The model also uses multimodal content to learn the representation of code fragments through comparative learning , Then use the cross modal generation task to align the representations between programming languages .

CodeReviewer: Automated code review

https://arxiv.org/abs/2203.09095

Based on the above research , And put forward the CodeReviewer, This is a pre trained model , It utilizes four pre training tasks tailored specifically for code review scenarios . The model focuses on three key tasks related to code review activities , Including code change quality evaluation 、 Review comment generation and code optimization . The test of the model proves that the model can automate the operation of code change and review through pre training tasks and multilingual training data sets .

https://avoid.overfit.cn/post/29087fd920d847fb88671bc5e1cdad27