当前位置：网站首页>【AI4Code】《CodeBERT: A Pre-Trained Model for Programming and Natural Languages》 EMNLP 2020

【AI4Code】《CodeBERT: A Pre-Trained Model for Programming and Natural Languages》 EMNLP 2020

2022-07-25 12:40:00 【chad_ lee】

《CodeBERT: A Pre-Trained Model for Programming and Natural Languages》 EMNLP 2020

take BERT Apply to Bimodal data On ： programing language （PL） And natural language （NL）, After pre training CodeBERT The resulting generic representation , It can support various downstream tasks, such as natural language code search , Code document generation . The author also contributed a NL-PL Data set of .

Method

Model architecture

The model is BERT, Model architecture and RoBERTa-base Almost the same , Include 12 Layers , Each floor has 12 It's a self focusing head , The dimension of each self attention head is 64. The hidden dimension is 768,FF The dimension of the layer is 3072. The total amount of model parameters is 1.25 Billion .

Input and output

Input ： The input of pre training is a sequence of natural language text and programming language text ：[CLS], w1, w2, …wn, [SEP], c1, c2, …, cm, [EOS],w Is textual token,c It's code token.

Output ： Every token stay CodeBERT There are outputs in , Text and code token The output of is their semantic vector representation in the current scene ,[CLS] The vector representation of is the aggregation of the whole sequence representation （aggregated sequence representation）. Separator [SEP] And the ending [EOS] The output of is meaningless .

Pre training data

Yes Two types of training data , One is bimodal PL-NL Data pair , There is another kind Single mode The data of , namely “ Code without parallel corresponding natural language text ” and “ Natural language text without corresponding code ”.

NL-PL The examples are as follows , among NL It's a function document （ Black dashed box ） The first paragraph in （ Red box ）

Insert picture description here

Pretraining task

MLM (Masked Language Modeling)

There are two objective functions , In bimodal data NL-PL use MLM Objective function , stay NL and PL Randomly select the location mask（ The two positions are independent ）, use token [MASK] Instead of ：
$\begin{aligned} m_{i}^{w} & \sim \operatorname{unif}\{1,|\boldsymbol{w}|\} \text { for } i=1 \text { to }|\boldsymbol{w}| \\ m_{i}^{c} & \sim \operatorname{unif}\{1,|\boldsymbol{c}|\} \text { for } i=1 \text { to }|\boldsymbol{c}| \\ \boldsymbol{w}^{\text {masked }} &=\operatorname{REPLACE}\left(\boldsymbol{w}, \boldsymbol{m}^{\boldsymbol{w}},[M A S K]\right) \\ \boldsymbol{c}^{\text {masked }} &=\operatorname{REPLACE}\left(\boldsymbol{c}, \boldsymbol{m}^{c},[M A S K]\right) \\ \boldsymbol{x} &=\boldsymbol{w}+\boldsymbol{c} \end{aligned}$
MLM The goal is to predict being mask Of touken. Discriminator $p^{D_{1}}$ Forecast No i The first word is masked Of token Probability .
$\mathcal{L}_{\mathrm{MLM}}(\theta)=\sum_{i \in \boldsymbol{m}^{\boldsymbol{w}} \cup \boldsymbol{m}^{c}}-\log p^{D_{1}}\left(x_{i} \mid \boldsymbol{w}^{\text {masked }}, \boldsymbol{c}^{\text {masked }}\right)$

RTD (replaced token detection)

stay MLM We only use NL-PL data , stay RTD Use unimodal data .

Insert picture description here

here CodeBERT Incarnate as Fig2 Medium NL-Code Discriminator, The specific method is to input the text / The code sequence first randomly selects several positions as mask, Then use a Generator by mask Generate a Puzzling token, there Generator It can be understood as Word2Vec（ Not at all , Easy to understand ）, According to the context mask Predict a token, It may be true （ Such as w5）, It may be wrong, but it is also confusing （ After all word2vec Predicted ）.

The generated new sequence is sent into CodeBERT, by CodeBERT Output Every token Of embedding Make a dichotomy , Determine whether it has been replaced .

fine-tuning

For natural language search code tasks , Just use [CLS] The output representation of determines the similarity between the two modal languages .

For the code generation text task , use CodeBERT As encoder-decoder Of encoder Partial initialization .

experiment

The experimental results of the article will not be released , Did code search respectively 、NL-PL Probe 、 Given the experiment of code generating documents .

https://marketplace.visualstudio.com/items?itemName=graykode.ai-docstring&ssr=false

VS Code There are already based on CodeBERT Of Docstring Plug in ：
Insert picture description here

原网站

版权声明
本文为[chad_ lee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251110593905.html