当前位置:网站首页>【AI4Code】CodeX:《Evaluating Large Language Models Trained on Code》(OpenAI)
【AI4Code】CodeX:《Evaluating Large Language Models Trained on Code》(OpenAI)
2022-07-25 13:09:00 【chad_ lee】
Code generation ——CodeX (OpenAI)
Very hot recently OpenAI Of Copilot The model behind . article 35 page ,58 position author ,6 Work together .

It makes sense , Make the model bigger , Make the training data set larger , More computing resources , You can generate longer code . What this article does is put GPT The model is applied to code generation , Specifically, enter the signature and comment of the function (prompt), Tell the model what this function should do , Then the model outputs the implementation code .
Here are three examples , On a white background prompt, The yellow background is the code for model completion .
It's really hard to finish this thing , There is a lot of work to do , So this article has 58 An author ,6 Work together .
Evaluation model

First of all, there must be a method to evaluate the ability of the model . Although here is a comment “ translate ” The task of coding , But it doesn't work BLEU Indicators to measure quality , This is because the method based on matching even if the code semantics are similar , But we can't evaluate the correctness of code function . Therefore, the paper takes the correctness of code function as the evaluation index , Specifically, use Unit testing methods To evaluate the code , The evaluation index is [email protected], That is, for every programming problem , Model output k Code answer , As long as one code can pass the unit test , It is considered that the problem is solved .
It is also necessary to construct an evaluation data set , The article constructs a HumanEval Data set of , contain 164 There are two programming problems shown in the figure above , Hand designed for each programming problem 8 Unit tests . These programming problems must be Original 、 Hand designed , Because the model is Github On the data training , The training set is likely to contain a lot of code for existing programming problems .
The code generated by the model needs to be in a Sandbox Run tests in , This is to prevent CodeX The generated code is uncontrollable , There are security risks , Therefore, the evaluation code should be in an isolated environment . This is a little sci-fi .
Training models
Data sets
The training set is OpenAI from Github Code of 54 million projects , contain 179GB Of Python file , Each file is less than 1MB. Filter some code files , There is still left 159GB.CodeX Just use GPT Here 159GB Code text training .
In addition to the training set , The paper also designs three other data sets : For evaluation HumanEval Data sets ; be used for fine-tuning Of 、 The format is similar HumanEval Of Supervised Fine-Tuning Data sets ; For training CodeX-D Of docstring Data sets .
Model
CodeX
CodeX Model structure and GPT-3 It's as like as two peas , It's just that some parameter settings and training methods are different . For example, we have added spaces with different lengths token, This can reduce 30% The size of the dictionary .CodeX When generating code, if generated token yes :‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’ Stop generation when [email protected] Every generation token when according to softmax probability sample,[email protected] The one with the highest probability of each choice token, This is greedy to choose the local optimum .
CodeX Tried ab initio training and based on GPT-3 Parameters of fine-tuning, The findings are based on GPT-3 Parameters of fine-tuning There is no improvement in effect , But based on GPT-3 Parameters of fine-tuing It can converge faster , therefore , This training strategy is adopted in the thesis .
therefore CodeX and GPT-3 Models are essentially indistinguishable , Training data is different , Different parameter weights , A set of weight parameters trained with code data is called CodeX.
CodeX-S
CodeX Is in Github Of 159GB Unsupervised training on the code text , The paper creates a and HumanEval Consistent data set format Supervised Fine-Tuning Data sets , On this data set fine-tuning The effect of the model is better .
CodeX-D
take Supervised Fine-Tuning In dataset < Function header >< docstrings>< The body of the function > Format Change to < Function header >< The body of the function > < docstrings>,fine-tuning The resulting model is called CodeX-D, Let the model write comments according to the code .
experimental result

Judging from the evaluation results , No matter how powerful GPT-3 The model also cannot generate usable code . The yellow line is required CodeX For each question, only 1 Answer ; The green line is similar HumanEval The format of the data set is supervised fine-tuning Yes CodeX-S; The red line is allowed CodeX-S Generate for each task 100 Answer , Then choose the one with the highest average probability ; The purple line is supposed to CodeX-S Generated 100 Choose the right one from the answers . It can be seen that If allowed CodeX-S Generate 100 An answer , It can solve almost all programming problems .
Model limitations
1、CodeX I've seen an amazing amount of code , But I can only write relatively simple code , It's basically memorizing code and code combination
2、Docstrings For a long time , The quality of the code has declined (AlphaCode What to do )
3、 The code related to mathematics is not well written .
therefore CodeX The understanding of code tasks is actually not very good , More like code translation ( notes --> Code ).
Security discussion
It took a lot of effort to discuss safety carefully in papers and projects 、 privacy 、 Sensitive issues .
Over-reliance
People may be overly dependent on the code generated by the model , There may be Bug.
misalignment
The model parameters are particularly large , When it's complicated , The model may be misunderstood docstring, Return one that looks fairly correct , But the code is very different in detail , It's hard to find it .
Bias and representation
Because this model uses github The code on , but github Most male users on , Probably with Gender bias , For example, there are many dirty words in the notes .
Economic and labor market impacts
May affect Market share , For example if CodeX The generated code always likes to use some special package , It will cause some special tools to be used more , If the model is widely used, it will lead to these packages Usage rates change . for example OpenAI Made a new framework , Then his models are all code that generates their own framework , If used in large quantities, there is no Pytorch,tensorflow 了 .
Security implications
May be used CodeX To write viruses or malware , And you can write many different malware at once , Will lead to anti-virus software , It is difficult for the security team to make an effective defense .
Environmental impacts
Power consumption
Legal implications
The model uses github Open code , involve fair use, If it is good for public facilities , There is no problem in foreign laws , But if the model is commercialized, it is not so applicable .
CodeX It often happens that you copy other people's code completely , actually copliot The probability of copying code is high , But for users , He doesn't know whether the code is copied , It may involve code patent protection , You need to be very careful .
This point OpenAI emphasize ,Copilot Just one “ pen ”, One “ compiler ”, The user is responsible for the generated code .
Further security discussion
Generating malicious code : The experiment is done in the appendix , Direct use CodeX It's not good to generate malicious code independently , But some components that use it to generate malicious code can still .
The article investigates CodeX Whether the model will recommend the use of vulnerable 、 Malicious or counterfeit software dependencies , For example, a specific version python The package may contain vulnerabilities that make downstream applications vulnerable . The article makes experiments to find out whether the software package of domain name counterfeiting will be generated is mainly artificially set prompt The text is about , If prompt Text sets the package of a specific domain name ,CodeX Nor can it be corrected , Thus, it may become an attack medium for domain name counterfeiting .
There are also some security issues similar to those of the text generation model , Such as stealing the secret of training data . This is in Copilot It is further discussed on the official website of ,Copilot Personal email may be output 、 Privacy data such as ID number ,Copilot Filtering has been done , But there's no guarantee ,“if you try hard enough.”
Data poisoning attack
because CodeX It is pre training and fine-tuning on public data , So the attacker can insert antagonistic output , It leads to the fragile of the model 、 Malicious or misaligned code , And as model capabilities and potential attacker interest increase , This risk may increase . therefore CodeX Train from a large amount of untrusted data , It is possible to generate unsafe code .
The author studies with two examples CodeX Whether unsafe code will be generated : Whether it will produce Call function generation RSA keys The length is less than 2048 Code for ;AES Is encryption recommended ECB. The author is in different model sizes CodeX On , Constructed relevant prompt( produce RSA key and AES), Then 30000 code samples were generated , Filter out the useless , Explore the proportion of unsafe code :

It can be seen that the robustness of the model to this problem has no obvious relationship with the size of the model .
This point Copilot In the discussion on the official website You should judge for yourself (as well as your own judgment)
Next AlphaCode The scene is not so practical , Security discussions are rare .
边栏推荐
- 【问题解决】ibatis.binding.BindingException: Type interface xxDao is not known to the MapperRegistry.
- Mysql 远程连接权限错误1045问题
- Convolutional neural network model -- vgg-16 network structure and code implementation
- State mode
- [problem solving] org.apache.ibatis.exceptions PersistenceException: Error building SqlSession. 1-byte word of UTF-8 sequence
- ECCV2022 | TransGrasp类级别抓取姿态迁移
- [Video] visual interpretation of Markov chain principle and Mrs example of R language region conversion | data sharing
- Clickhouse notes 03-- grafana accesses Clickhouse
- 零基础学习CANoe Panel(15)—— 文本输出(CAPL Output View )
- conda常用命令:安装,更新,创建,激活,关闭,查看,卸载,删除,清理,重命名,换源,问题
猜你喜欢

Zero basic learning canoe panel (15) -- CAPL output view

Seven lines of code made station B crash for three hours, but "a scheming 0"

OAuth, JWT, oidc, you mess me up

录制和剪辑视频,如何解决占用空间过大的问题?

零基础学习CANoe Panel(15)—— 文本输出(CAPL Output View )
![[CSDN year-end summary] end and start, always on the way -](/img/51/a3fc5eba0eeb22b600260ee81ff9e6.png)
[CSDN year-end summary] end and start, always on the way - "2021 summary of" 1+1= Wang "

Selenium use -- installation and testing

Docker学习 - Redis集群-3主3从-扩容-缩容搭建

程序员成长第二十七篇:如何评估需求优先级?

ECCV2022 | TransGrasp类级别抓取姿态迁移
随机推荐
Introduction to web security UDP testing and defense
Shell common script: get the IP address of the network card
conda常用命令:安装,更新,创建,激活,关闭,查看,卸载,删除,清理,重命名,换源,问题
ECCV2022 | TransGrasp类级别抓取姿态迁移
Oran special series-21: major players (equipment manufacturers) and their respective attitudes and areas of expertise
Lu MENGZHENG's "Fu of broken kiln"
VIM tip: always show line numbers
[problem solving] ibatis.binding BindingException: Type interface xxDao is not known to the MapperRegistry.
机器学习强基计划0-4:通俗理解奥卡姆剃刀与没有免费午餐定理
若依如何实现用户免密登录配置方法?
Go: Gin custom log output format
Ministry of Public Security: the international community generally believes that China is one of the safest countries in the world
Eccv2022 | transclassp class level grab posture migration
Atcoder beginer contest 261e / / bitwise thinking + DP
如何理解Keras中的指标Metrics
Migrate PaloAlto ha high availability firewall to panorama
Use of Spirng @conditional conditional conditional annotation
全球都热炸了,谷歌服务器已经崩掉了
Memory layout of program
手写一个博客平台~第一天