当前位置:网站首页>在转化词向量之前先转化为AST再转化为词向量的实现方法
在转化词向量之前先转化为AST再转化为词向量的实现方法
2022-07-28 05:29:00 【ithicker】
实现思路
环境:
我们在文章https://blog.csdn.net/lockhou/article/details/113883940已经实现了在Win上的将一系列的c文件转化生成对应的AST文件,并且通过AST文件经过节点匹配生成文本向量,从而构建一个c文件对应一个存储AST的txt文件对应一个存储文本向量的txt文件,且对应的三个文件同名,因为我们判断一个文件是否有漏洞是从文件名字当中体现的。
思路:
我们原理是现将文件分类为Train,Test,Validation,之后直接读取.c文件做去空处理,去停用词处理并将转化后的数据存为pickle文件,用于提供后面词向量转化,模型训练,模型测试,模型验证的数据需求。我们既然要利用文本向量表示结构信息,我们可以不再直接读取文件,而是直接读取每个c文件的对应文本向量txt文件做相同操作,从而提供给后面词向量转化,模型训练,模型测试,模型验证的数据是具有结构信息的,从而完成工作。
从win移到linux过程
step1 java上安装jdk
具体步骤见文章:https://blog.csdn.net/lockhou/article/details/113904085
step2 修改movefiles.py
我最后决定在移动后c文件所在的每一个目录的同级目录建立文件夹用于存储从c文件提取的AST文件(存在Preprocessed文件夹下)和转化得到的文本向量文件(存在processed文件夹下)。所以我们在建立Train,Test,Validation文件夹以及其内部文件夹是要在每一种组合的Non_vulnerable_functions和Vulnerable_functions文件夹下都建立Preprocessed和processed文件夹,所以在46-55添加代码如下:
saveDir = tempDir
tempDir = saveDir + '/'+ "Preprocessed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/Preprocessed
os.mkdir(tempDir)
tempDir = saveDir + '/'+ "processed"
if not os.path.exists(tempDir):
#Non_vulnerable_functions/Non_vulnerable_functions/processed
os.mkdir(tempDir)
step2 修改ProcessCFilesWithCodeSensor.py和ProcessRawASTs_DFT.py
我们现在有了存储AST和文本向量的位置,所以我们只要反复调用ProcessCFilesWithCodeSensor.py和ProcessRawASTs_DFT.py文件即可,所以为了方便调用我们将两个文件组织成函数的形式,并且将需要的参数作为形参在调用的时候传递。
ProcessCFilesWithCodeSensor.py参数:
1)CodeSensor_OUTPUT_PATH:将每个.c文件提取出来AST保存为.txt文件所存储的地址
“G:\论文\论文\ast\function_representation_learningmaster\FFmpeg\Vulnerable_functions\Preprocessed\”
2)CodeSensor_PATH:codesensor.java所在位置
“D:\codesensor\CodeSensor.jar”(位置固定不需传入,即不需作为参数)
3)PATH :.c文件所存储的目录
“G:\论文\论文\ast\function_representation_learning-master\FFmpeg\Vulnerable_functions”ProcessRawASTs_DFT.py参数:
1)FILE_PATH :存储AST的TXT所在目录
“G:\论文\论文\ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Preprocessed\”
2)Processed_FILE : 存储文本向量的txt文件
“G:\论文\论文\ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Processed\”
根据上面参数需求,我们将文件内容组织成函数如下:
#ProcessCFilesWithCodeSensor.py
def codesensor(CodeSensor_OUTPUT_PATH,PATH):
CodeSensor_PATH = "./Code/codesensor-codeSensor-0.2/CodeSensor.jar"
Full_path = ""
for fpathe,dirs,fs in os.walk(PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.c'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
# With each .c file open, CodeSensor will process the opened file and output all the processed files to a specified directory.
# Full_path = CodeSensor_OUTPUT_PATH + "_" + f + ".txt"
Full_path = CodeSensor_OUTPUT_PATH + os.path.splitext(f)[0] + ".txt"
with open(Full_path, "w+") as output_file:
Popen(['/home/jdk1.8.0_65/bin/java', '-jar', CodeSensor_PATH, file_path], stdout=output_file, stderr=STDOUT)
output_file.close()
# ProcessRawASTs_DFT.py
def DepthFirstExtractASTs(file_to_process, file_name):
lines = []
subLines = ''
f = open(file_to_process)
try:
original_lines = f.readlines()
print(original_lines)
#lines.append(file_name) # The first element is the file name.
for line in original_lines:
if not line.isspace(): # Remove the empty line.
line = line.strip('\n')
str_lines = line.split('\t')
#print (str_lines)
if str_lines[0] != "water": # Remove lines starting with water.
#print (str_lines)
if str_lines[0] == "func":
# Add the return type of the function
subElement = str_lines[4].split() # Dealing with "static int" or "static void" or ...
if len(subElement) == 1:
lines.append(str_lines[4])
if subElement.count("*") == 0: # The element does not contain pointer type. If it contains pointer like (int *), it will be divided to 'int' and '*'.
if len(subElement) == 2:
lines.append(subElement[0])
lines.append(subElement[1])
if len(subElement) == 3:
lines.append(subElement[0])
lines.append(subElement[1])
lines.append(subElement[2])
else:
lines.append(str_lines[4])
#lines.append(str_lines[5]) # Add the name of the function
lines.append("func_name") # Add the name of the function
if str_lines[0] == "params":
lines.append("params")
if str_lines[0] == "param":
subParamElement = str_lines[4].split() # Addd the possible type of the parameter
if len(subParamElement) == 1:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if subParamElement.count("*") == 0:
if len(subParamElement) == 2:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
if len(subParamElement) == 3:
lines.append("param")
lines.append(subParamElement[0])
lines.append(subParamElement[1])
lines.append(subParamElement[2])
else:
lines.append("param")
lines.append(str_lines[4]) # Add the parameter type
if str_lines[0] == "stmnts":
lines.append("stmnts")
if str_lines[0] == "decl":
subDeclElement = str_lines[4].split() # Addd the possible type of the declared veriable
#print (len(subDeclElement))
if len(subDeclElement) == 1:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if subDeclElement.count("*") == 0:
if len(subDeclElement) == 2:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
if len(subDeclElement) == 3:
lines.append("decl")
lines.append(subDeclElement[0])
lines.append(subDeclElement[1])
lines.append(subDeclElement[2])
else:
lines.append("decl")
lines.append(str_lines[4]) # Add the type of the declared variable
if str_lines[0] == "op":
lines.append(str_lines[4])
if str_lines[0] == "call":
lines.append("call")
lines.append(str_lines[4])
if str_lines[0] == "arg":
lines.append("arg")
if str_lines[0] == "if":
lines.append("if")
if str_lines[0] == "cond":
lines.append("cond")
if str_lines[0] == "else":
lines.append("else")
if str_lines[0] == "stmts":
lines.append("stmts")
if str_lines[0] == "for":
lines.append("for")
if str_lines[0] == "forinit":
lines.append("forinit")
if str_lines[0] == "while":
lines.append("while")
if str_lines[0] == "return":
lines.append("return")
if str_lines[0] == "continue":
lines.append("continue")
if str_lines[0] == "break":
lines.append("break")
if str_lines[0] == "goto":
lines.append("goto")
if str_lines[0] == "forexpr":
lines.append("forexpr")
if str_lines[0] == "sizeof":
lines.append("sizeof")
if str_lines[0] == "do":
lines.append("do")
if str_lines[0] == "switch":
lines.append("switch")
if str_lines[0] == "typedef":
lines.append("typedef")
if str_lines[0] == "default":
lines.append("default")
if str_lines[0] == "register":
lines.append("register")
if str_lines[0] == "enum":
lines.append("enum")
if str_lines[0] == "union":
lines.append("union")
print(lines)
subLines = ','.join(lines)
subLines = subLines + "," + "\n"
finally:
f.close()
return subLines
def text_vector(FILE_PATH,Processed_FILE):
big_line = []
total_processed = 0
for fpathe,dirs,fs in os.walk(FILE_PATH):
for f in fs:
if (os.path.splitext(f)[1]=='.txt'): # Get the .c files only
file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
temp = DepthFirstExtractASTs(FILE_PATH + f, f)
print(temp)
f1 = open(Processed_FILE + os.path.splitext(f)[0]+".txt", "w")
f1.write(temp)
f1.close()
# big_line.append(temp)
#time.sleep(0.001)
total_processed = total_processed + 1
print ("Totally, there are " + str(total_processed) + " files.")
step3 修改movefiles.py
我们现在已经建立好存储AST文件和文本向量文件的位置,我们只需要将每一个含有c文件的目录下所有c文件经过codesensor函数调用产生AST文件并存在对应目录中的Preprocesses文件夹之下,并且在通过调用text_vector函数将每个Preprocessed目录下的AST文件转化为文本向量文件存储在相应目录的procesed文件夹之下,所以我们在movefiles.py的最后,即建立完所有的文件夹之后加入下列代码:
from ProcessCFilesWithCodeSensor import *
from ProcessRawASTs_DFT import *
for i in range(len(FirstDir)):
codesensor(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i])
codesensor(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i])
codesensor(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i])
codesensor(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i])
codesensor(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i])
codesensor(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i])
text_vector(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i]+"/processed/")
text_vector(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i]+"/processed/")
text_vector(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i]+"/processed/")
text_vector(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i]+"/processed/")
text_vector(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i]+"/processed/")
text_vector(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i]+"/processed/")
step4 修改removeComments_Blanks.py和LoadCFilesAsText.py文件
因为我们使想用生成的文本向量文件替代c文件,而在removeComments_Blanks.py和LoadCFilesAsText.py文件中是直接读取c文件并且对文件中的内容进行去空格,去停用词的操作之后转存为pickle文件。所以我们现在用生成的文本向量文件替代c文件,我们需要读取processed文件夹下的文本向量文件,并且读取txt文件其他的操作不变。
所以我们要把所有removeComments_Blanks.py和LoadCFilesAsText.py文件中的所有代表读取c文件的目录扩展到与c文件同目录的processesed文件夹下让其读取文本向量文件,并且在判断文件类型的时候不再寻找c文件改为寻找txt文件即可。
step5 接下来正常运行词向量文件以及训练和测试文件即可
边栏推荐
- ES6 add -- > object
- NAT和PAT的原理及配置
- Ubuntu18.04 set up redis cluster [learning notes]
- Wechat applet custom compilation mode
- NFS 共享存储服务
- raid磁盘阵列
- Repair the faulty sector
- Traversal binary tree
- MOOC Weng Kai C language week 3: judgment and circulation: 2. circulation
- MOOC翁恺C语言 第六周:数组与函数:1.数组2.函数的定义与使用3.函数的参数和变量4.二维数组
猜你喜欢

三层交换和VRRP

Monotonic queue, Luogu p1886 sliding window

MySQL build database Series (I) -- download MySQL

MOOC翁恺C语言 第四周:进一步的判断与循环:3.多路分支4.循环的例子5.判断和循环常见的错误

DHCP原理与配置

Results fill in the blank shopping list (teach you to solve it with Excel)

DOM -- event chain, event bubble and capture, event proxy

Clock tree analysis example

MOOC翁恺C语言第八周:指针与字符串:1.指针2.字符类型3.字符串4.字符串计算

Detailed explanation of LNMP construction process
随机推荐
Qgraphicsview promoted to qchartview
Esxi community network card driver updated again
Esxi community network card driver updated in March 2022
Technology sharing | interface testing value and system
Shell script -- program conditional statements (conditional tests, if statements, case branch statements, echo usage, for loops, while loops)
Wechat applet custom compilation mode
静态和浮动路由
[learning notes] linked list operation
MOOC Weng Kai C language week 6: arrays and functions: 1. Arrays 2. Definition and use of functions 3. Parameters and variables of functions 4. Two dimensional arrays
JS four operations are repackaged to solve the problem of precision loss
Test interview questions collection (III) | computer network and database (with answers)
NFS shared storage service
FTP service
Technology sharing | sending requests using curl
MOOC Weng Kai C language week 3: judgment and cycle: 1. Judgment
Test interview questions collection (II) | test tools (with answers)
Common models in software development
[learning notes] knowledge management
DHCP原理与配置
[learning notes] thread creation