当前位置:网站首页>Implementation method of converting ast into word vector before converting word vector
Implementation method of converting ast into word vector before converting word vector
2022-07-28 07:06:00 【ithicker】
Realize the idea
Environmental Science :
We are in the article https://blog.csdn.net/lockhou/article/details/113883940 Has been realized in Win There will be a series of c File conversion generates the corresponding AST file , And through AST The file generates a text vector through node matching , To build a c The file corresponds to a storage AST Of txt The file corresponds to a txt file , And the corresponding three files have the same name , Because we judge whether a file has vulnerabilities from the file name .
Ideas :
Our principle is to classify the documents as Train,Test,Validation, Then read directly .c Empty files , Go to stop word processing and save the converted data as pickle file , Used to provide the following word vector conversion , model training , Model test , Data requirements for model validation . Since we want to use text vectors to represent structural information , We can no longer read files directly , Instead, read every c The corresponding text vector of the file txt Do the same for files , So as to provide the following word vector transformation , model training , Model test , The data of model validation has structural information , To complete the work .
from win Move to linux The process
step1 java Installation on jdk
See the article for specific steps :https://blog.csdn.net/lockhou/article/details/113904085
step2 modify movefiles.py
I finally decided to move c The same level directory of each directory where the file is located establishes a folder for storing from c File extraction AST file ( There is Preprocessed Under the folder ) And the converted text vector file ( There is processed Under the folder ). So we're building Train,Test,Validation Folders and their internal folders are to be in each combination Non_vulnerable_functions and Vulnerable_functions Under the folder Preprocessed and processed Folder , So in 46-55 Add the following code :
saveDir = tempDir
tempDir = saveDir + '/'+ "Preprocessed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/Preprocessed
os.mkdir(tempDir)
tempDir = saveDir + '/'+ "processed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/processed
os.mkdir(tempDir)
step2 modify ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py
We now have storage AST And the position of the text vector , So we just need to call repeatedly ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py File can , So for the convenience of calling, we organize the two files into the form of functions , And pass the required parameters as formal parameters when calling .
ProcessCFilesWithCodeSensor.py Parameters :
1)CodeSensor_OUTPUT_PATH: Each one .c File extraction AST Save as .txt Address stored in the file
“G:\ The paper \ The paper \ast\function_representation_learningmaster\FFmpeg\Vulnerable_functions\Preprocessed\”
2)CodeSensor_PATH:codesensor.java The position of
“D:\codesensor\CodeSensor.jar”( Fixed position without input , That is, it does not need to be used as a parameter )
3)PATH :.c The directory where the file is stored
“G:\ The paper \ The paper \ast\function_representation_learning-master\FFmpeg\Vulnerable_functions”ProcessRawASTs_DFT.py Parameters :
1)FILE_PATH : Storage AST Of TXT In the directory
“G:\ The paper \ The paper \ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Preprocessed\”
2)Processed_FILE : Storing text vectors txt file
“G:\ The paper \ The paper \ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Processed\”
According to the above parameter requirements , We organize the contents of the file into functions as follows :
#ProcessCFilesWithCodeSensor.py
def codesensor(CodeSensor_OUTPUT_PATH,PATH):
CodeSensor_PATH = "./Code/codesensor-codeSensor-0.2/CodeSensor.jar"
Full_path = ""
for fpathe,dirs,fs in os.walk(PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.c'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
# With each .c file open, CodeSensor will process the opened file and output all the processed files to a specified directory.
# Full_path = CodeSensor_OUTPUT_PATH + "_" + f + ".txt"
Full_path = CodeSensor_OUTPUT_PATH + os.path.splitext(f)[0] + ".txt"
with open(Full_path, "w+") as output_file:
Popen(['/home/jdk1.8.0_65/bin/java', '-jar', CodeSensor_PATH, file_path], stdout=output_file, stderr=STDOUT)
output_file.close()
# ProcessRawASTs_DFT.py
def DepthFirstExtractASTs(file_to_process, file_name):
lines = []
subLines = ''
f = open(file_to_process)
try:
original_lines = f.readlines()
print(original_lines)
#lines.append(file_name) # The first element is the file name.
for line in original_lines:
if not line.isspace(): # Remove the empty line.
line = line.strip('\n')
str_lines = line.split('\t')
#print (str_lines)
if str_lines[0] != "water": # Remove lines starting with water.
#print (str_lines)
if str_lines[0] == "func":
# Add the return type of the function
subElement = str_lines[4].split() # Dealing with "static int" or "static void" or ...
if len(subElement) == 1:
lines.append(str_lines[4])
if subElement.count("*") == 0: # The element does not contain pointer type. If it contains pointer like (int *), it will be divided to 'int' and '*'.
if len(subElement) == 2:
lines.append(subElement[0])
lines.append(subElement[1])
if len(subElement) == 3:
lines.append(subElement[0])
lines.append(subElement[1])
lines.append(subElement[2])
else:
lines.append(str_lines[4])
#lines.append(str_lines[5]) # Add the name of the function
lines.append("func_name") # Add the name of the function
if str_lines[0] == "params":
lines.append("params")
if str_lines[0] == "param":
subParamElement = str_lines[4].split() # Addd the possible type of the parameter
if len(subParamElement) == 1:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if subParamElement.count("*") == 0:
if len(subParamElement) == 2:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
if len(subParamElement) == 3:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
lines.append(subParamElement[2])
else:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if str_lines[0] == "stmnts":
lines.append("stmnts")
if str_lines[0] == "decl":
subDeclElement = str_lines[4].split() # Addd the possible type of the declared veriable
#print (len(subDeclElement))
if len(subDeclElement) == 1:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if subDeclElement.count("*") == 0:
if len(subDeclElement) == 2:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
if len(subDeclElement) == 3:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
lines.append(subDeclElement[2])
else:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if str_lines[0] == "op":
lines.append(str_lines[4])
if str_lines[0] == "call":
lines.append("call")
lines.append(str_lines[4])
if str_lines[0] == "arg":
lines.append("arg")
if str_lines[0] == "if":
lines.append("if")
if str_lines[0] == "cond":
lines.append("cond")
if str_lines[0] == "else":
lines.append("else")
if str_lines[0] == "stmts":
lines.append("stmts")
if str_lines[0] == "for":
lines.append("for")
if str_lines[0] == "forinit":
lines.append("forinit")
if str_lines[0] == "while":
lines.append("while")
if str_lines[0] == "return":
lines.append("return")
if str_lines[0] == "continue":
lines.append("continue")
if str_lines[0] == "break":
lines.append("break")
if str_lines[0] == "goto":
lines.append("goto")
if str_lines[0] == "forexpr":
lines.append("forexpr")
if str_lines[0] == "sizeof":
lines.append("sizeof")
if str_lines[0] == "do":
lines.append("do")
if str_lines[0] == "switch":
lines.append("switch")
if str_lines[0] == "typedef":
lines.append("typedef")
if str_lines[0] == "default":
lines.append("default")
if str_lines[0] == "register":
lines.append("register")
if str_lines[0] == "enum":
lines.append("enum")
if str_lines[0] == "union":
lines.append("union")
print(lines)
subLines = ','.join(lines)
subLines = subLines + "," + "\n"
finally:
f.close()
return subLines
def text_vector(FILE_PATH,Processed_FILE):
big_line = []
total_processed = 0
for fpathe,dirs,fs in os.walk(FILE_PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.txt'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
temp = DepthFirstExtractASTs(FILE_PATH + f, f)
print(temp)
f1 = open(Processed_FILE + os.path.splitext(f)[0]+".txt", "w")
f1.write(temp)
f1.close()
# big_line.append(temp)
#time.sleep(0.001)
total_processed = total_processed + 1
print ("Totally, there are " + str(total_processed) + " files.")
step3 modify movefiles.py
We have now established storage AST Location of files and text vector files , We just need to put each one containing c All under the directory of the file c The document went through codesensor Function call generation AST The file is stored in the corresponding directory Preprocesses Under the folder , And by calling text_vector The Preprocessed In the catalog AST The file is converted into a text vector file and stored in the corresponding directory procesed Under the folder , So we are movefiles.py Last , That is, add the following code after creating all folders :
from ProcessCFilesWithCodeSensor import *
from ProcessRawASTs_DFT import *
for i in range(len(FirstDir)):
codesensor(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i])
codesensor(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i])
codesensor(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i])
codesensor(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i])
codesensor(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i])
codesensor(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i])
text_vector(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i]+"/processed/")
text_vector(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i]+"/processed/")
text_vector(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i]+"/processed/")
text_vector(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i]+"/processed/")
text_vector(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i]+"/processed/")
text_vector(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i]+"/processed/")
step4 modify removeComments_Blanks.py and LoadCFilesAsText.py file
Because we want to use the generated text vector file instead c file , And in the removeComments_Blanks.py and LoadCFilesAsText.py File is read directly c File and remove spaces from the contents of the file , After the operation of removing the stop word, save it as pickle file . So we now use the generated text vector file instead c file , We need to read processed Text vector file under folder , And read txt Other operations of the file remain unchanged .
So we have to put all removeComments_Blanks.py and LoadCFilesAsText.py All representatives in the file read c The directory of files is extended to c The file is the same as the directory processesed Folder to read the text vector file , And no longer look for when judging the file type c File to find txt File can .
step5 Next, run the word vector file and the training and testing file normally
边栏推荐
- Traversal binary tree
- MySQL build database Series (I) -- download MySQL
- MOOC翁恺C语言第五周:1.循环控制2.多重循环3.循环应用
- Results fill in the blanks for beer and drinks
- Tcp/ip five layer model
- Results fill in the blank. How many days of national day are Sundays (solved by pure Excel)
- Detailed explanation of LNMP construction process
- Animation animation realizes the crossing (click) pause
- 360 compatibility issues
- Esxi community network card driver
猜你喜欢

Detailed explanation of LNMP construction process

Joern的代码使用-devign

MOOC Weng Kai C language week 7: array operation: 1. array operation 2. Search 3. preliminary sorting

DOM Foundation

防火墙——iptables防火墙(四表五链、防火墙配置方法、匹配规则详解)

Esxi arm edition version 1.10 update

Custom components -- styles

DOM -- event chain, event bubble and capture, event proxy

起点中文网 字体反爬技术 网页可以显示数字字母 网页代码是乱码或空格

shell脚本——编程条件语句(条件测试、if语句、case分支语句、echo用法、for循环、while循环)
随机推荐
Shell script - regular expression
On cookies and session
360 compatibility issues
Iptables firewall
Differences and relationships among NPM, Yran and NPX
Neo4j running error occurred during initialization of VM incompatible minimum and maximum heap sizes spec
DHCP principle and configuration
Clock tree analysis example
Custom component -- communication between parent and child components
Applet navigator cannot jump (debug)
Network - network layer
Asynchronous programming promise
组管理和权限管理
Blue Bridge Cup square filling number
JSON notes
PXE unattended installation management
Wechat applet custom compilation mode
爬虫学习总结
DOM -- page rendering, style attribute operation, preloading and lazy loading, anti shake and throttling
[learning notes] drive