当前位置:网站首页>【AI4Code】《Unified Pre-training for Program Understanding and Generation》 NAACL 2021
【AI4Code】《Unified Pre-training for Program Understanding and Generation》 NAACL 2021
2022-07-25 13:08:00 【chad_ lee】
《Unified Pre-training for Program Understanding and Generation》 NAACL 2021
The article puts forward PLBART It is a sequence to sequence model , Able to perform a wide range of program and language understanding and generation tasks . PLBART adopt denoising autoencoding For a large number of Java and Python Functions and related NL The text is pre trained . Code summary 、 The experiments of code generation and code translation of seven programming languages show ,PLBART Superior to or comparable to the most advanced models . Besides , Experiments on discrimination tasks , For example, program repair 、 Clone detection and vulnerable code detection , Proved PLBART Effectiveness in program understanding . Besides , The analysis shows that PLBART Learn program grammar 、 style ( for example , Identifier naming convention )、 Logical process ( for example ,else In block if Block is equivalent to else if block ) It is very important for program semantics , So it performs well even with limited comments .
Denoising pre training
PLBART be based on BARTbase framework , Use seq2seq Denoising pre training to take advantage of PL and NL Unlabeled data in , There are three noise strategies : Mark shielding 、 Mark deletion and mark filling , Input the noisy sequence encoder, Original sequence plus position offset input decoder, The goal is to remove noise and restore the original sequence .

among token infilling Yes, it will 0~k individual token Replace with [MASK] ,k=0 When you add a mask nothing more .
During pre training NL and PL The ratio is 1:14, Therefore, up sampling and down sampling are needed to remove bias .
Downstream tasks

Downstream tasks generate descriptions based on code , Generate code according to the description And code translation , It's all about seq2seq Mission .
There are also two classification tasks : Clone code detection and fragile code detection , about pair Input , Join the two pieces of data , Use one in the middle </s> token Connect .decoder The last output of is sent to the linear classifier for classification .
边栏推荐
- Word style and multi-level list setting skills (II)
- 2022.07.24 (lc_6125_equal row and column pairs)
- yum和vim须掌握的常用操作
- Date and time function of MySQL function summary
- Mysql 远程连接权限错误1045问题
- Deployment of Apache website services and implementation of access control
- G027-OP-INS-RHEL-04 RedHat OpenStack 创建自定义的QCOW2格式镜像
- 微软提出CodeT:代码生成新SOTA,20个点的性能提升
- 力扣 83双周赛T4 6131.不可能得到的最短骰子序列、303 周赛T4 6127.优质数对的数目
- [operation and maintenance, implementation of high-quality products] interview skills for technical positions with a monthly salary of 10k+
猜你喜欢

零基础学习CANoe Panel(15)—— 文本输出(CAPL Output View )

【问题解决】ibatis.binding.BindingException: Type interface xxDao is not known to the MapperRegistry.

Eccv2022 | transclassp class level grab posture migration

【视频】马尔可夫链原理可视化解释与R语言区制转换MRS实例|数据分享

网络空间安全 渗透攻防9(PKI)

Clickhouse notes 03-- grafana accesses Clickhouse

零基础学习CANoe Panel(14)——二极管( LED Control )和液晶屏(LCD Control)

AtCoder Beginner Contest 261E // 按位思考 + dp

EMQX Cloud 更新:日志分析增加更多参数,监控运维更省心

Zero basic learning canoe panel (16) -- clock control/panel control/start stop control/tab control
随机推荐
2022.07.24 (lc_6126_design food scoring system)
Use vsftpd service to transfer files (anonymous user authentication, local user authentication, virtual user authentication)
部署Apache网站服务以及访问控制的实现
微软提出CodeT:代码生成新SOTA,20个点的性能提升
Leetcode 1184. distance between bus stops
Moving Chinese figure liushenglan
Selenium use -- installation and testing
AtCoder Beginner Contest 261E // 按位思考 + dp
如何用因果推断和实验驱动用户增长? | 7月28日TF67
Memory layout of program
Chapter5 : Deep Learning and Computational Chemistry
CONDA common commands: install, update, create, activate, close, view, uninstall, delete, clean, rename, change source, problem
Masscode is an excellent open source code fragment manager
【AI4Code】《Contrastive Code Representation Learning》 (EMNLP 2021)
ECCV 2022 | climb to the top semantickitti! Semantic segmentation of LIDAR point cloud based on two-dimensional prior assistance
深度学习的训练、预测过程详解【以LeNet模型和CIFAR10数据集为例】
零基础学习CANoe Panel(13)—— 滑条(TrackBar )
7行代码让B站崩溃3小时,竟因“一个诡计多端的0”
JS convert pseudo array to array
感动中国人物刘盛兰