当前位置：网站首页>【AI4Code】CodeX：《Evaluating Large Language Models Trained on Code》（OpenAI）

【AI4Code】CodeX：《Evaluating Large Language Models Trained on Code》（OpenAI）

2022-07-25 13:09:00 【chad_ lee】

Code generation ——CodeX （OpenAI）

Very hot recently OpenAI Of Copilot The model behind . article 35 page ,58 position author ,6 Work together .

Insert picture description here

It makes sense , Make the model bigger , Make the training data set larger , More computing resources , You can generate longer code . What this article does is put GPT The model is applied to code generation , Specifically, enter the signature and comment of the function （prompt）, Tell the model what this function should do , Then the model outputs the implementation code .
Insert picture description here

Here are three examples , On a white background prompt, The yellow background is the code for model completion .

It's really hard to finish this thing , There is a lot of work to do , So this article has 58 An author ,6 Work together .

Evaluation model

Insert picture description here

First of all, there must be a method to evaluate the ability of the model . Although here is a comment “ translate ” The task of coding , But it doesn't work BLEU Indicators to measure quality , This is because the method based on matching even if the code semantics are similar , But we can't evaluate the correctness of code function . Therefore, the paper takes the correctness of code function as the evaluation index , Specifically, use Unit testing methods To evaluate the code , The evaluation index is [email protected], That is, for every programming problem , Model output k Code answer , As long as one code can pass the unit test , It is considered that the problem is solved .

It is also necessary to construct an evaluation data set , The article constructs a HumanEval Data set of , contain 164 There are two programming problems shown in the figure above , Hand designed for each programming problem 8 Unit tests . These programming problems must be Original 、 Hand designed , Because the model is Github On the data training , The training set is likely to contain a lot of code for existing programming problems .

The code generated by the model needs to be in a Sandbox Run tests in , This is to prevent CodeX The generated code is uncontrollable , There are security risks , Therefore, the evaluation code should be in an isolated environment . This is a little sci-fi .

Training models

Data sets

The training set is OpenAI from Github Code of 54 million projects , contain 179GB Of Python file , Each file is less than 1MB. Filter some code files , There is still left 159GB.CodeX Just use GPT Here 159GB Code text training .

In addition to the training set , The paper also designs three other data sets ： For evaluation HumanEval Data sets ; be used for fine-tuning Of 、 The format is similar HumanEval Of Supervised Fine-Tuning Data sets ; For training CodeX-D Of docstring Data sets .

Model

CodeX

CodeX Model structure and GPT-3 It's as like as two peas , It's just that some parameter settings and training methods are different . For example, we have added spaces with different lengths token, This can reduce 30% The size of the dictionary .CodeX When generating code, if generated token yes ：‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’ Stop generation when [email protected] Every generation token when according to softmax probability sample,[email protected] The one with the highest probability of each choice token, This is greedy to choose the local optimum .

CodeX Tried ab initio training and based on GPT-3 Parameters of fine-tuning, The findings are based on GPT-3 Parameters of fine-tuning There is no improvement in effect , But based on GPT-3 Parameters of fine-tuing It can converge faster , therefore , This training strategy is adopted in the thesis .

therefore CodeX and GPT-3 Models are essentially indistinguishable , Training data is different , Different parameter weights , A set of weight parameters trained with code data is called CodeX.

CodeX-S

CodeX Is in Github Of 159GB Unsupervised training on the code text , The paper creates a and HumanEval Consistent data set format Supervised Fine-Tuning Data sets , On this data set fine-tuning The effect of the model is better .

CodeX-D

take Supervised Fine-Tuning In dataset < Function header >< docstrings>< The body of the function > Format Change to < Function header >< The body of the function > < docstrings>,fine-tuning The resulting model is called CodeX-D, Let the model write comments according to the code .

experimental result

Insert picture description here

Judging from the evaluation results , No matter how powerful GPT-3 The model also cannot generate usable code . The yellow line is required CodeX For each question, only 1 Answer ; The green line is similar HumanEval The format of the data set is supervised fine-tuning Yes CodeX-S; The red line is allowed CodeX-S Generate for each task 100 Answer , Then choose the one with the highest average probability ; The purple line is supposed to CodeX-S Generated 100 Choose the right one from the answers . It can be seen that If allowed CodeX-S Generate 100 An answer , It can solve almost all programming problems .
Insert picture description here

Model limitations

1、CodeX I've seen an amazing amount of code , But I can only write relatively simple code , It's basically memorizing code and code combination

2、Docstrings For a long time , The quality of the code has declined （AlphaCode What to do ）

3、 The code related to mathematics is not well written .

therefore CodeX The understanding of code tasks is actually not very good , More like code translation （ notes --> Code ）.

Security discussion

It took a lot of effort to discuss safety carefully in papers and projects 、 privacy 、 Sensitive issues .

Over-reliance

People may be overly dependent on the code generated by the model , There may be Bug.

misalignment

The model parameters are particularly large , When it's complicated , The model may be misunderstood docstring, Return one that looks fairly correct , But the code is very different in detail , It's hard to find it .

Bias and representation

Because this model uses github The code on , but github Most male users on , Probably with Gender bias , For example, there are many dirty words in the notes .

Economic and labor market impacts

May affect Market share , For example if CodeX The generated code always likes to use some special package , It will cause some special tools to be used more , If the model is widely used, it will lead to these packages Usage rates change . for example OpenAI Made a new framework , Then his models are all code that generates their own framework , If used in large quantities, there is no Pytorch,tensorflow 了 .

Security implications

May be used CodeX To write viruses or malware , And you can write many different malware at once , Will lead to anti-virus software , It is difficult for the security team to make an effective defense .

Environmental impacts

Power consumption

Legal implications

The model uses github Open code , involve fair use, If it is good for public facilities , There is no problem in foreign laws , But if the model is commercialized, it is not so applicable .

CodeX It often happens that you copy other people's code completely , actually copliot The probability of copying code is high , But for users , He doesn't know whether the code is copied , It may involve code patent protection , You need to be very careful .

This point OpenAI emphasize ,Copilot Just one “ pen ”, One “ compiler ”, The user is responsible for the generated code .

Further security discussion

Generating malicious code ： The experiment is done in the appendix , Direct use CodeX It's not good to generate malicious code independently , But some components that use it to generate malicious code can still .
The article investigates CodeX Whether the model will recommend the use of vulnerable 、 Malicious or counterfeit software dependencies , For example, a specific version python The package may contain vulnerabilities that make downstream applications vulnerable . The article makes experiments to find out whether the software package of domain name counterfeiting will be generated is mainly artificially set prompt The text is about , If prompt Text sets the package of a specific domain name ,CodeX Nor can it be corrected , Thus, it may become an attack medium for domain name counterfeiting .
There are also some security issues similar to those of the text generation model , Such as stealing the secret of training data . This is in Copilot It is further discussed on the official website of ,Copilot Personal email may be output 、 Privacy data such as ID number ,Copilot Filtering has been done , But there's no guarantee ,“if you try hard enough.”

Data poisoning attack

because CodeX It is pre training and fine-tuning on public data , So the attacker can insert antagonistic output , It leads to the fragile of the model 、 Malicious or misaligned code , And as model capabilities and potential attacker interest increase , This risk may increase . therefore CodeX Train from a large amount of untrusted data , It is possible to generate unsafe code .

The author studies with two examples CodeX Whether unsafe code will be generated ： Whether it will produce Call function generation RSA keys The length is less than 2048 Code for ;AES Is encryption recommended ECB. The author is in different model sizes CodeX On , Constructed relevant prompt（ produce RSA key and AES）, Then 30000 code samples were generated , Filter out the useless , Explore the proportion of unsafe code ：

Insert picture description here