当前位置:网站首页>Code representation learning: introduction to codebert and other related models
Code representation learning: introduction to codebert and other related models
2022-07-25 11:07:00 【deephub】
What is? CodeBert
CodeBERT It's Microsoft. 2020 Year development BERT Extension of the model . It is a programming language (PL) And natural language (NL) Bimodal pre training model , Can execute downstream (NL-PL) Mission , This model uses 6 Programming languages (Python, Java, JavaScript, PHP, Ruby, Go) Conduct NL-PL Matching training of .
This paper will give a brief overview of the paper , And use an example to show how to use , More details about the mathematics and detailed architecture behind the model , Please refer to the original paper . In the end, except CodeBert outside , It also collates some derivative models based on his recent research .
Before studying this paper in depth , Let's first introduce CodeBERT Supported downstream task use cases and . Some of these use cases are already in MS Implement... In the tool , for example visual studio- IntelliCode.
CodeBert The use case
Code conversion or code translation : for example , When developers want to write with existing python The code is the same java Code , Code to code translation can help translate this code block .
Code auto annotation : It can help developers summarize code . When developers see unfamiliar code , The model can translate the code into natural language and summarize it for developers .
Text to code : Similar to the function of code search , This search can help users retrieve relevant codes based on natural language queries . In addition, the corresponding code can be generated according to the comments .
Text to text : It can help translate code field text into different languages .

BERT framework
BERT ((Bidirectional Encoder Representations from Transformers) It's Google. 2018 Self monitoring model proposed in .

BERT In essence, it is composed of multiple self attention “ head ” Composed of Transformer Encoder layer stack (Vaswani wait forsomeone ,2017 year ). For each input tag in the sequence , Each header calculates the key 、 Values and query vectors , Used to create weighted representations / The embedded . The output of all headers in the same layer is combined and passed through a full connection layer . Each layer is connected with skip connection , Then layer normalization (LN).BERT The traditional workflow of includes two stages : Pre training and fine tuning . Pre training uses two self supervised tasks : Masking language modeling (MLM, Predict input markers for random masking ) And the next prediction (NSP, Predict whether two input sentences are adjacent to each other ). Fine tuning applies to downstream applications , Usually one or more full connection layers are added above the final encoder layer .
CodeBert framework
BERT It's easy to extend to multimodality , That is to use different types of data sets for training .CodeBert yes Bert Bimodal extension of . namely CodeBERT Use both natural language and source code as its input .( And the tradition of focusing mainly on natural language BERT and RoBERTa Different )

Bimodal NL - PL Yes : Training CodeBERT The typical input for is a combination of code and clearly defined text comments .
CodeBERT Two pre training objectives are described : Mask language modeling (MLM) And replacement mark detection (RTD).
Modeling training using mask language CodeBERT: by NL and PL Select a group of random positions to shield , And then use special [MASK] Mark to replace the selected position .MLM The goal of is to predict the masked original markers
Training with replacement marker detection CodeBERT: In primitive NL Sequence sum PL In sequence , There are few marks that will be randomly screened . Train a generator model , It is a similar to n-gram The probability model is used to generate shielding words . Then train a discriminator model to determine whether a word is the original word ( Binary classification problem ).

CodeBERT Use 12 layer Transformer The total includes 125M Parameters , stay FP16 Use in accuracy NVIDIA DGX-2 on 250 Hours of training , The results show that when CodeBERT And from RoBERTa When used together with the pre training representation of the model (RoBERTa The model has been used from Code-SearchNet Code for training ) Compared with training from scratch . Use RoBERTa initialization CodeBERT Better performance .

Use CodeBERT Fine tuning
Please refer to CodeBERT The paper Here is a brief introduction to how to use CodeBERT, And take code document generation as an example .
Install the appropriate package
pip3 install torch==1.4.0
pip3 install transformers==2.5.0
pip3 install filelock
Data preprocessing
The data preprocessing in this task is as follows :
- Remove comments from the code
- Delete an example where the code cannot be parsed into an abstract syntax tree .
- Delete the of the document #tokens < 3 or >256 An example of
- Delete an example of a document containing special tags ( for example <img …> or https:…)
- Deleting a document is not an example of English .
pip3 install gdown
mkdir-p data/code2nl
cd data/code2nl
gdown https://drive.google.com/uc?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h
unzip Cleaned_CodeSearchNet.zip
rm Cleaned_CodeSearchNet.zip
cd ../..
Use tree The command can see the following directory structure :
tree data/code2nl/CodeSearchNet/
data/code2nl/CodeSearchNet/
├── go
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
├── java
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
├── javascript
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
├── php
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
├── python
│ ├── test.jsonl
│ ├── train.jsonl
│ └── valid.jsonl
└── ruby
├── test.jsonl
├── train.jsonl
└── valid.jsonl
6 directories, 18 files
Every language has its own training 、 verification 、 Test data file .
Run the program
visit https://github.com/microsoft/CodeBERT/tree/master/CodeBERT/code2nl And clone run.py、bleu.py、model.py File and put them in data/code2nl In the folder .
Run the following command . take batch_size=128 Change it to batch_size=4, This is because GPU I don't have enough memory , Too big bs It can lead to OOM.
lang=php
beam_size=10
batch_size=4
source_length=256
target_length=128
output_dir=model/$lang
data_dir=../data/code2nl/CodeSearchNet
dev_file=$data_dir/$lang/valid.jsonl
test_file=$data_dir/$lang/test.jsonl
test_model=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test
python run.py --do_test --model_type roberta --model_name_or_path microsoft/codebert-base --load_model_path $test_model --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size
So start training , How to call after training CodeBERT Well ?
If you only want to use from CodeBERT The characteristic representation of transformation , You can use the following example code :
from transformers import AutoTokenizer, AutoModel
import torch
# Init
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base")
# Tokenization
nl_tokens=tokenizer.tokenize("return maximum value")
code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token]
# Convert tokens to ids
tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
# Print
print(context)
The output is as follows :
tensor([[-0.1423, 0.3766, 0.0443, ..., -0.2513, -0.3099, 0.3183],
[-0.5739, 0.1333, 0.2314, ..., -0.1240, -0.1219, 0.2033],
[-0.1579, 0.1335, 0.0291, ..., 0.2340, -0.8801, 0.6216],
...,
[-0.4042, 0.2284, 0.5241, ..., -0.2046, -0.2419, 0.7031],
[-0.3894, 0.4603, 0.4797, ..., -0.3335, -0.6049, 0.4730],
[-0.1433, 0.3785, 0.0450, ..., -0.2527, -0.3121, 0.3207]],
grad_fn=<SelectBackward>)
We also see the above code Huggingface Also provided CodeBERT Related models , We can use it directly :
import torch
from transformers import RobertaTokenizer,RobertaConfig,RobertaModel
device=torch.device("cuda"if torch.cuda.is_available() else"cpu")
tokenizer=RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model=RobertaModel.from_pretrained("microsoft/codebert-base")
model.to(device)
codebert Address :
https://github.com/microsoft/CodeBERT
be based on CodeBERT Introduction to other models of
GraphCodeBert: Code representation pre training model based on data flow
https://arxiv.org/abs/2009.08366
Use the semantic structure of code to learn the pre training model of code representation GraphCodeBERT, be based on Bert Pre training model implementation , Except for the traditional MLM Out of task , This paper also proposes two new pre training tasks ( Data stream side prediction 、 Align variables of source code and data flow ), Learning vector representation of source code based on data flow , stay 4 Significant improvement has been achieved on downstream tasks .

UniXcoder: Unified cross mode pre training model
https://arxiv.org/abs/2203.03850
Unixcoder It is a unified cross mode pre training model for programming languages . The model uses the mask attention matrix with prefix adapter to control the behavior of the model , And make use of AST And code comments to enhance code representation . To express parallelism as a tree AST Encoding , This paper proposes a one-to-one mapping method , You can keep AST Sequence structure of all structural information in . The model also uses multimodal content to learn the representation of code fragments through comparative learning , Then use the cross modal generation task to align the representations between programming languages .

CodeReviewer: Automated code review
https://arxiv.org/abs/2203.09095
Based on the above research , And put forward the CodeReviewer, This is a pre trained model , It utilizes four pre training tasks tailored specifically for code review scenarios . The model focuses on three key tasks related to code review activities , Including code change quality evaluation 、 Review comment generation and code optimization . The test of the model proves that the model can automate the operation of code change and review through pre training tasks and multilingual training data sets .

https://avoid.overfit.cn/post/29087fd920d847fb88671bc5e1cdad27
边栏推荐
- Electromagnetic field and electromagnetic wave experiment I familiar with the application of MATLAB software in the field of electromagnetic field
- Digital twin everything can be seen | connecting the real world and digital space
- UE4 quickly find the reason for packaging failure
- [strategic mode] like Zhugeliang's brocade bag
- 思路再次完美验证!加息临近,趋势明了,好好把握这波行情!
- Flask框架——Flask-WTF表单:数据验证、CSRF保护
- Flask framework - flask WTF form: data validation, CSRF protection
- 数字孪生万物可视 | 联接现实世界与数字空间
- HCIA experiment (10) nat
- Learn NLP with Transformer (Chapter 7)
猜你喜欢
随机推荐
The idea has been perfectly verified again! The interest rate hike is approaching, and the trend is clear. Take advantage of this wave of market!
2021 jd.com written examination summary
【蓝桥杯集训100题】scratch太极图 蓝桥杯scratch比赛专项预测编程题 集训模拟练习题第22题
HCIP (01)
I, AI doctoral student, online crowdfunding research topic
HCIA实验(09)
Learn NLP with Transformer (Chapter 3)
ONNX Runtime介绍
数字孪生万物可视 | 联接现实世界与数字空间
推荐系统-协同过滤在Spark中的实现
6. PXE combines kickstart principle and configuration to realize unattended automatic installation
HCIA experiment (10) nat
电磁场与电磁波实验一 熟悉Matlab软件在电磁场领域的应用
Dataset and dataloader data loading
Cloud native ide: the first general codeless development platform of IVX for free
【云享新鲜】社区周刊·Vol.72- 2022华为开发者大赛中国区首场开幕式启动;华为云KooMessage火热公测中…
Gan, why '𠮷 𠮷'.Length== 3 ??
Flask框架——flask-caching缓存
[high concurrency] how to realize distributed flow restriction under 100 million level traffic? You must master these theories!!
HCIP实验(04)









