当前位置:网站首页>【AI4Code】《CoSQA: 20,000+ Web Queries for Code Search and Question Answering》 ACL 2021
【AI4Code】《CoSQA: 20,000+ Web Queries for Code Search and Question Answering》 ACL 2021
2022-07-25 12:40:00 【chad_ lee】
《CoSQA: 20,000+ Web Queries for Code Search and Question Answering》 ACL 2021
similar CLIP Made a NL-PL Of query-key Binary data sets , And then it's like CLIP The same two-mode alignment training , On this basis, comparative learning is added , Two kinds of data amplification methods are designed . Twin tower encoder All are CodeBERT.
CoSQA Data sets
What the article wants to achieve is that we can search for pictures on the Internet , Input according to the demand query, Return the code implementation that meets the requirements ( Now we usually return to the blog ). This article has made great efforts to construct such a data set , It looks something like this .
There are also a lot of data structure details , For example, partial satisfaction query The needs of , Completely satisfied query The needs of , Meet less than 50% The needs of , Only and query Relevant, etc .

Model

The input form of the model is a sequence :[CLS] xxxxxxxx [SEP]. Twin network for model ,query and code All with the same CodeBERT code . The output of the model is [CLS] The representation of .
q i = C o d e E R T ( q i ) , c i = C o d e B E R T ( c i ) \mathbf{q}_{i}=\mathbf{C o d e} \mathbf{E R T}\left(q_{i}\right), \quad \mathbf{c}_{i}=\mathbf{C o d e B} \mathbf{E R T}\left(c_{i}\right) qi=CodeERT(qi),ci=CodeBERT(ci)
The model is not simply used q and c The inner product of calculates the similarity , Instead, use another MLP Calculate the matching relationship between the two .MLP The output of is a vector , Not the similarity score
r ( i , i ) = tanh ( W 1 ⋅ [ q i , c i , q i − c i , q i ⨀ c i ] ) \mathbf{r}^{(i, i)}=\tanh \left(\mathbf{W}_{1} \cdot\left[\mathbf{q}_{i}, \mathbf{c}_{i}, \mathbf{q}_{i}-\mathbf{c}_{i}, \mathbf{q}_{i} \bigodot \mathbf{c}_{i}\right]\right) r(i,i)=tanh(W1⋅[qi,ci,qi−ci,qi⨀ci])
Another single layer NN Calculate the similarity between them :
s ( i , i ) = sigmoid ( W 2 ⋅ r ( i , i ) ) s^{(i, i)}=\operatorname{sigmoid}\left(\mathbf{W}_{2} \cdot \mathbf{r}^{(i, i)}\right) s(i,i)=sigmoid(W2⋅r(i,i))
And then use BCE loss Training :
L b = − [ y i ⋅ log s ( i , i ) + ( 1 − y i ) log ( 1 − s ( i , i ) ) ] \mathcal{L}_{b}=-\left[y_{i} \cdot \log s^{(i, i)}+\left(1-y_{i}\right) \log \left(1-s^{(i, i)}\right)\right] Lb=−[yi⋅logs(i,i)+(1−yi)log(1−s(i,i))]
Comparative learning
except BCE loss Outside ,In-Batch Augmentation (IBA) and Query-Rewritten Augmentation (QRA)
IBA Loss
For each of these query, At the same time, select the current batch Others in code As a negative sample , Equivalent to a query More than code Negative sample
L i b = − 1 n − 1 ∑ j = 1 j ≠ i n log ( 1 − s ( i , j ) ) \mathcal{L}_{i b}=-\frac{1}{n-1} \sum_{\substack{j=1 \\ j \neq i}}^{n} \log \left(1-s^{(i, j)}\right) Lib=−n−11j=1j=i∑nlog(1−s(i,j))
QRA Loss
because Web query Usually very short , And grammar is not guaranteed , So for a couple The label is 1 Of query-code pair, Yes query Do some rewriting and modification , Include : Randomly delete a word 、 Randomly switch the positions of two words 、 Copy a word randomly .
This is equivalent to a code More than Query Positive sample . stay QRA It will also be applied on the basis of IBA loss:
L q r = L b ′ + L i b ′ \mathcal{L}_{q r}=\mathcal{L}_{b}^{\prime}+\mathcal{L}_{i b}^{\prime} Lqr=Lb′+Lib′
experiment
In this paper, code contrastive learning method (CoCLR) Is a learning method , The experiment is done in pre training CodeBERT Continue training on the basis of .
Two task One is Code Question Answering Divide the test set directly from the training set , One is code search .
CodeBERT+CoSQA Is in the BERT Based on the explicit alignment of the two languages , There is a certain improvement , But the best effect is the data amplification of comparative learning . The most useful one is batch Inner negative sample .query In the enhancement, the order of exchanging words is greatly improved , It's also more intuitive , Because usually changing two words doesn't affect reading comprehension .
边栏推荐
- 【11】 Production and adjustment of vector and grid data Legends
- [ROS advanced chapter] Lecture 9 programming optimization of URDF and use of xacro
- Jenkins configuration pipeline
- The first scratch crawler
- 想要白嫖正则大全是吧?这一次给你个够!
- JS 将伪数组转换成数组
- 基于Caffe ResNet-50网络实现图片分类(仅推理)的实验复现
- WPF项目入门1-简单登录页面的设计和开发
- [rust] reference and borrowing, string slice type (& STR) - rust language foundation 12
- 【8】 Clever use of color finder
猜你喜欢
![[shutter -- layout] stacked layout (stack and positioned)](/img/01/c588f75313580063cf32cc01677600.jpg)
[shutter -- layout] stacked layout (stack and positioned)

919. Complete binary tree inserter: simple BFS application problem

Fiddler抓包APP

【Flutter -- 实例】案例一:基础组件 & 布局组件综合实例

阿里云技术专家秦隆:可靠性保障必备——云上如何进行混沌工程?

Microsoft azure and Analysys jointly released the report "Enterprise Cloud native platform driven digital transformation"

If you want to do a good job in software testing, you can first understand ast, SCA and penetration testing

What is ci/cd?

跌荡的人生

3.2.1 什么是机器学习?
随机推荐
Detailed explanation of flex box
【九】坐标格网添加以及调整
Eureka registration center opens password authentication - record
Resttemplate and ribbon are easy to use
Use of hystrix
深度学习MEMC插帧论文列表paper list
If you want to do a good job in software testing, you can first understand ast, SCA and penetration testing
PyTorch的生态简介
919. Complete binary tree inserter: simple BFS application problem
论文解读(MaskGAE)《MaskGAE: Masked Graph Modeling Meets Graph Autoencoders》
LeetCode 1184. 公交站间的距离
R language Visual scatter diagram, geom using ggrep package_ text_ The rep function avoids overlapping labels between data points (set the min.segment.length parameter to inf and do not add label segm
Ecological profile of pytorch
Kyligence was selected into Gartner 2022 data management technology maturity curve report
[high concurrency] deeply analyze the execution process of worker threads in the thread pool through the source code
Jenkins configuration pipeline
【11】 Production and adjustment of vector and grid data Legends
Jenkins配置流水线
Interviewer: "classmate, have you ever done a real landing project?"
intval md5绕过之[WUSTCTF2020]朴实无华