当前位置:网站首页>Implementation method of converting ast into word vector before converting word vector
Implementation method of converting ast into word vector before converting word vector
2022-07-28 07:06:00 【ithicker】
Realize the idea
Environmental Science :
We are in the article https://blog.csdn.net/lockhou/article/details/113883940 Has been realized in Win There will be a series of c File conversion generates the corresponding AST file , And through AST The file generates a text vector through node matching , To build a c The file corresponds to a storage AST Of txt The file corresponds to a txt file , And the corresponding three files have the same name , Because we judge whether a file has vulnerabilities from the file name .
Ideas :
Our principle is to classify the documents as Train,Test,Validation, Then read directly .c Empty files , Go to stop word processing and save the converted data as pickle file , Used to provide the following word vector conversion , model training , Model test , Data requirements for model validation . Since we want to use text vectors to represent structural information , We can no longer read files directly , Instead, read every c The corresponding text vector of the file txt Do the same for files , So as to provide the following word vector transformation , model training , Model test , The data of model validation has structural information , To complete the work .
from win Move to linux The process
step1 java Installation on jdk
See the article for specific steps :https://blog.csdn.net/lockhou/article/details/113904085
step2 modify movefiles.py
I finally decided to move c The same level directory of each directory where the file is located establishes a folder for storing from c File extraction AST file ( There is Preprocessed Under the folder ) And the converted text vector file ( There is processed Under the folder ). So we're building Train,Test,Validation Folders and their internal folders are to be in each combination Non_vulnerable_functions and Vulnerable_functions Under the folder Preprocessed and processed Folder , So in 46-55 Add the following code :
saveDir = tempDir
tempDir = saveDir + '/'+ "Preprocessed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/Preprocessed
os.mkdir(tempDir)
tempDir = saveDir + '/'+ "processed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/processed
os.mkdir(tempDir)
step2 modify ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py
We now have storage AST And the position of the text vector , So we just need to call repeatedly ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py File can , So for the convenience of calling, we organize the two files into the form of functions , And pass the required parameters as formal parameters when calling .
ProcessCFilesWithCodeSensor.py Parameters :
1)CodeSensor_OUTPUT_PATH: Each one .c File extraction AST Save as .txt Address stored in the file
“G:\ The paper \ The paper \ast\function_representation_learningmaster\FFmpeg\Vulnerable_functions\Preprocessed\”
2)CodeSensor_PATH:codesensor.java The position of
“D:\codesensor\CodeSensor.jar”( Fixed position without input , That is, it does not need to be used as a parameter )
3)PATH :.c The directory where the file is stored
“G:\ The paper \ The paper \ast\function_representation_learning-master\FFmpeg\Vulnerable_functions”ProcessRawASTs_DFT.py Parameters :
1)FILE_PATH : Storage AST Of TXT In the directory
“G:\ The paper \ The paper \ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Preprocessed\”
2)Processed_FILE : Storing text vectors txt file
“G:\ The paper \ The paper \ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Processed\”
According to the above parameter requirements , We organize the contents of the file into functions as follows :
#ProcessCFilesWithCodeSensor.py
def codesensor(CodeSensor_OUTPUT_PATH,PATH):
CodeSensor_PATH = "./Code/codesensor-codeSensor-0.2/CodeSensor.jar"
Full_path = ""
for fpathe,dirs,fs in os.walk(PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.c'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
# With each .c file open, CodeSensor will process the opened file and output all the processed files to a specified directory.
# Full_path = CodeSensor_OUTPUT_PATH + "_" + f + ".txt"
Full_path = CodeSensor_OUTPUT_PATH + os.path.splitext(f)[0] + ".txt"
with open(Full_path, "w+") as output_file:
Popen(['/home/jdk1.8.0_65/bin/java', '-jar', CodeSensor_PATH, file_path], stdout=output_file, stderr=STDOUT)
output_file.close()
# ProcessRawASTs_DFT.py
def DepthFirstExtractASTs(file_to_process, file_name):
lines = []
subLines = ''
f = open(file_to_process)
try:
original_lines = f.readlines()
print(original_lines)
#lines.append(file_name) # The first element is the file name.
for line in original_lines:
if not line.isspace(): # Remove the empty line.
line = line.strip('\n')
str_lines = line.split('\t')
#print (str_lines)
if str_lines[0] != "water": # Remove lines starting with water.
#print (str_lines)
if str_lines[0] == "func":
# Add the return type of the function
subElement = str_lines[4].split() # Dealing with "static int" or "static void" or ...
if len(subElement) == 1:
lines.append(str_lines[4])
if subElement.count("*") == 0: # The element does not contain pointer type. If it contains pointer like (int *), it will be divided to 'int' and '*'.
if len(subElement) == 2:
lines.append(subElement[0])
lines.append(subElement[1])
if len(subElement) == 3:
lines.append(subElement[0])
lines.append(subElement[1])
lines.append(subElement[2])
else:
lines.append(str_lines[4])
#lines.append(str_lines[5]) # Add the name of the function
lines.append("func_name") # Add the name of the function
if str_lines[0] == "params":
lines.append("params")
if str_lines[0] == "param":
subParamElement = str_lines[4].split() # Addd the possible type of the parameter
if len(subParamElement) == 1:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if subParamElement.count("*") == 0:
if len(subParamElement) == 2:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
if len(subParamElement) == 3:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
lines.append(subParamElement[2])
else:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if str_lines[0] == "stmnts":
lines.append("stmnts")
if str_lines[0] == "decl":
subDeclElement = str_lines[4].split() # Addd the possible type of the declared veriable
#print (len(subDeclElement))
if len(subDeclElement) == 1:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if subDeclElement.count("*") == 0:
if len(subDeclElement) == 2:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
if len(subDeclElement) == 3:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
lines.append(subDeclElement[2])
else:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if str_lines[0] == "op":
lines.append(str_lines[4])
if str_lines[0] == "call":
lines.append("call")
lines.append(str_lines[4])
if str_lines[0] == "arg":
lines.append("arg")
if str_lines[0] == "if":
lines.append("if")
if str_lines[0] == "cond":
lines.append("cond")
if str_lines[0] == "else":
lines.append("else")
if str_lines[0] == "stmts":
lines.append("stmts")
if str_lines[0] == "for":
lines.append("for")
if str_lines[0] == "forinit":
lines.append("forinit")
if str_lines[0] == "while":
lines.append("while")
if str_lines[0] == "return":
lines.append("return")
if str_lines[0] == "continue":
lines.append("continue")
if str_lines[0] == "break":
lines.append("break")
if str_lines[0] == "goto":
lines.append("goto")
if str_lines[0] == "forexpr":
lines.append("forexpr")
if str_lines[0] == "sizeof":
lines.append("sizeof")
if str_lines[0] == "do":
lines.append("do")
if str_lines[0] == "switch":
lines.append("switch")
if str_lines[0] == "typedef":
lines.append("typedef")
if str_lines[0] == "default":
lines.append("default")
if str_lines[0] == "register":
lines.append("register")
if str_lines[0] == "enum":
lines.append("enum")
if str_lines[0] == "union":
lines.append("union")
print(lines)
subLines = ','.join(lines)
subLines = subLines + "," + "\n"
finally:
f.close()
return subLines
def text_vector(FILE_PATH,Processed_FILE):
big_line = []
total_processed = 0
for fpathe,dirs,fs in os.walk(FILE_PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.txt'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
temp = DepthFirstExtractASTs(FILE_PATH + f, f)
print(temp)
f1 = open(Processed_FILE + os.path.splitext(f)[0]+".txt", "w")
f1.write(temp)
f1.close()
# big_line.append(temp)
#time.sleep(0.001)
total_processed = total_processed + 1
print ("Totally, there are " + str(total_processed) + " files.")
step3 modify movefiles.py
We have now established storage AST Location of files and text vector files , We just need to put each one containing c All under the directory of the file c The document went through codesensor Function call generation AST The file is stored in the corresponding directory Preprocesses Under the folder , And by calling text_vector The Preprocessed In the catalog AST The file is converted into a text vector file and stored in the corresponding directory procesed Under the folder , So we are movefiles.py Last , That is, add the following code after creating all folders :
from ProcessCFilesWithCodeSensor import *
from ProcessRawASTs_DFT import *
for i in range(len(FirstDir)):
codesensor(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i])
codesensor(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i])
codesensor(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i])
codesensor(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i])
codesensor(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i])
codesensor(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i])
text_vector(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i]+"/processed/")
text_vector(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i]+"/processed/")
text_vector(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i]+"/processed/")
text_vector(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i]+"/processed/")
text_vector(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i]+"/processed/")
text_vector(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i]+"/processed/")
step4 modify removeComments_Blanks.py and LoadCFilesAsText.py file
Because we want to use the generated text vector file instead c file , And in the removeComments_Blanks.py and LoadCFilesAsText.py File is read directly c File and remove spaces from the contents of the file , After the operation of removing the stop word, save it as pickle file . So we now use the generated text vector file instead c file , We need to read processed Text vector file under folder , And read txt Other operations of the file remain unchanged .
So we have to put all removeComments_Blanks.py and LoadCFilesAsText.py All representatives in the file read c The directory of files is extended to c The file is the same as the directory processesed Folder to read the text vector file , And no longer look for when judging the file type c File to find txt File can .
step5 Next, run the word vector file and the training and testing file normally
边栏推荐
- Neo4j运行报错Error occurred during initialization of VM Incompatible minimum and maximum heap sizes spec
- DHCP服务
- Differences and relationships among NPM, Yran and NPX
- 视频格式基础知识:让你了解MKV、MP4、H.265、码率、色深等等
- Software testing (concept)
- DNS域名解析
- Network - data link layer
- NAT和PAT的原理及配置
- shell---循环语句练习
- [learning notes] drive
猜你喜欢

Esxi community network card driver

On cookies and session

Blue bridge code error ticket

Results fill in the blank. How many days of national day are Sundays (solved by pure Excel)

JSON notes

Detailed explanation of LNMP construction process

Bond mode configuration

Shell script -- program conditional statements (conditional tests, if statements, case branch statements, echo usage, for loops, while loops)

爬虫学习总结

win下安装nessus
随机推荐
232(母)转422(公)
shell---循环语句练习
Technology sharing | how to do Assertion Verification in interface automated testing?
Iptables firewall
Traversal binary tree
Svg understanding and drawing application
MOOC Weng Kai C language week 6: arrays and functions: 1. Arrays 2. Definition and use of functions 3. Parameters and variables of functions 4. Two dimensional arrays
Upload and download files from Ubuntu server
Repair the faulty sector
Media set up live broadcast server
Animation animation realizes the crossing (click) pause
Canvas drawing 1
Differences and relationships among NPM, Yran and NPX
DOM - Events
MOOC翁恺 C语言 第三周:判断与循环:2.循环
How to describe a bug and the definition and life cycle of bug level
shell脚本——编程条件语句(条件测试、if语句、case分支语句、echo用法、for循环、while循环)
爬虫学习总结
LNMP搭建过程详解
分解路径为目录名和文件名的方法