当前位置：网站首页>在转化词向量之前先转化为AST再转化为词向量的实现方法

在转化词向量之前先转化为AST再转化为词向量的实现方法

2022-07-28 05:29:00 【ithicker】

实现思路

环境：

我们在文章https://blog.csdn.net/lockhou/article/details/113883940已经实现了在Win上的将一系列的c文件转化生成对应的AST文件，并且通过AST文件经过节点匹配生成文本向量，从而构建一个c文件对应一个存储AST的txt文件对应一个存储文本向量的txt文件，且对应的三个文件同名，因为我们判断一个文件是否有漏洞是从文件名字当中体现的。

思路：

我们原理是现将文件分类为Train，Test,Validation,之后直接读取.c文件做去空处理，去停用词处理并将转化后的数据存为pickle文件，用于提供后面词向量转化，模型训练，模型测试，模型验证的数据需求。我们既然要利用文本向量表示结构信息，我们可以不再直接读取文件，而是直接读取每个c文件的对应文本向量txt文件做相同操作，从而提供给后面词向量转化，模型训练，模型测试，模型验证的数据是具有结构信息的，从而完成工作。

从win移到linux过程

step1 java上安装jdk

具体步骤见文章：https://blog.csdn.net/lockhou/article/details/113904085

step2 修改movefiles.py

我最后决定在移动后c文件所在的每一个目录的同级目录建立文件夹用于存储从c文件提取的AST文件（存在Preprocessed文件夹下）和转化得到的文本向量文件（存在processed文件夹下）。所以我们在建立Train，Test,Validation文件夹以及其内部文件夹是要在每一种组合的Non_vulnerable_functions和Vulnerable_functions文件夹下都建立Preprocessed和processed文件夹，所以在46-55添加代码如下：

saveDir = tempDir
        tempDir =  saveDir + '/'+ "Preprocessed"
        if not os.path.exists(tempDir):
            #Non_vulnerable_functions/Non_vulnerable_functions/Preprocessed
            os.mkdir(tempDir)
        tempDir =  saveDir + '/'+ "processed"
        if not os.path.exists(tempDir):
            #Non_vulnerable_functions/Non_vulnerable_functions/processed
            os.mkdir(tempDir)

step2 修改ProcessCFilesWithCodeSensor.py和ProcessRawASTs_DFT.py

我们现在有了存储AST和文本向量的位置，所以我们只要反复调用ProcessCFilesWithCodeSensor.py和ProcessRawASTs_DFT.py文件即可，所以为了方便调用我们将两个文件组织成函数的形式，并且将需要的参数作为形参在调用的时候传递。

ProcessCFilesWithCodeSensor.py参数：
1）CodeSensor_OUTPUT_PATH：将每个.c文件提取出来AST保存为.txt文件所存储的地址
“G:\论文\论文\ast\function_representation_learningmaster\FFmpeg\Vulnerable_functions\Preprocessed\”
2）CodeSensor_PATH：codesensor.java所在位置
“D:\codesensor\CodeSensor.jar”（位置固定不需传入，即不需作为参数）
3）PATH ：.c文件所存储的目录
“G:\论文\论文\ast\function_representation_learning-master\FFmpeg\Vulnerable_functions”
ProcessRawASTs_DFT.py参数：
1）FILE_PATH ：存储AST的TXT所在目录
“G:\论文\论文\ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Preprocessed\”
2）Processed_FILE ：存储文本向量的txt文件
“G:\论文\论文\ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Processed\”

根据上面参数需求，我们将文件内容组织成函数如下：

#ProcessCFilesWithCodeSensor.py

def codesensor(CodeSensor_OUTPUT_PATH,PATH):
	CodeSensor_PATH = "./Code/codesensor-codeSensor-0.2/CodeSensor.jar"
	Full_path = ""

	for fpathe,dirs,fs in os.walk(PATH):
  		for f in fs:
    			if (os.path.splitext(f)[1]=='.c'): # Get the .c files only
        			file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
        
        # With each .c file open, CodeSensor will process the opened file and output all the processed files to a specified directory.
        # Full_path = CodeSensor_OUTPUT_PATH + "_" + f + ".txt"
        			Full_path = CodeSensor_OUTPUT_PATH + os.path.splitext(f)[0] + ".txt"
        			with open(Full_path, "w+") as output_file:
            				Popen(['/home/jdk1.8.0_65/bin/java', '-jar', CodeSensor_PATH, file_path], stdout=output_file, stderr=STDOUT)
            				output_file.close()



# ProcessRawASTs_DFT.py

def DepthFirstExtractASTs(file_to_process, file_name):
    
    lines = []
    subLines = ''
    
    f = open(file_to_process)
    try:
        original_lines = f.readlines()
        print(original_lines)
        #lines.append(file_name) # The first element is the file name.
        for line in original_lines:
            if not line.isspace(): # Remove the empty line.
                line = line.strip('\n')
                str_lines = line.split('\t')   
                #print (str_lines)
                if str_lines[0] != "water": # Remove lines starting with water.
                    #print (str_lines)
                    if str_lines[0] == "func":
                        # Add the return type of the function
                        subElement = str_lines[4].split() # Dealing with "static int" or "static void" or ...
                        if len(subElement) == 1:
                            lines.append(str_lines[4])
                        if subElement.count("*") == 0: # The element does not contain pointer type. If it contains pointer like (int *), it will be divided to 'int' and '*'.
                            if len(subElement) == 2:
                                lines.append(subElement[0])
                                lines.append(subElement[1]) 
                            if len(subElement) == 3:
                                lines.append(subElement[0])
                                lines.append(subElement[1])    
                                lines.append(subElement[2])
                        else:
                            lines.append(str_lines[4])
                        #lines.append(str_lines[5]) # Add the name of the function
                        lines.append("func_name") # Add the name of the function
                    if str_lines[0] == "params":
                        lines.append("params")                    
                    if str_lines[0] == "param":
                        subParamElement = str_lines[4].split() # Addd the possible type of the parameter
                        if len(subParamElement) == 1:
                            lines.append("param")
                            lines.append(str_lines[4]) # Add the parameter type
                        if subParamElement.count("*") == 0:
                            if len(subParamElement) == 2:
                                lines.append("param")
                                lines.append(subParamElement[0])
                                lines.append(subParamElement[1]) 
                            if len(subParamElement) == 3:
                                lines.append("param")
                                lines.append(subParamElement[0])
                                lines.append(subParamElement[1])    
                                lines.append(subParamElement[2])
                        else:
                            lines.append("param")
                            lines.append(str_lines[4]) # Add the parameter type                           
                    if str_lines[0] == "stmnts":
                        lines.append("stmnts")                    
                    if str_lines[0] == "decl":
                        subDeclElement = str_lines[4].split() # Addd the possible type of the declared veriable
                        #print (len(subDeclElement))
                        if len(subDeclElement) == 1:
                            lines.append("decl")
                            lines.append(str_lines[4]) # Add the type of the declared variable
                        if subDeclElement.count("*") == 0:
                            if len(subDeclElement) == 2:
                                lines.append("decl")
                                lines.append(subDeclElement[0])
                                lines.append(subDeclElement[1]) 
                            if len(subDeclElement) == 3:
                                lines.append("decl")
                                lines.append(subDeclElement[0])
                                lines.append(subDeclElement[1])    
                                lines.append(subDeclElement[2])
                        else:
                            lines.append("decl")
                            lines.append(str_lines[4]) # Add the type of the declared variable
                    if str_lines[0] == "op":
                        lines.append(str_lines[4])
                    if str_lines[0] == "call":
                        lines.append("call")
                        lines.append(str_lines[4])
                    if str_lines[0] == "arg":
                        lines.append("arg")
                    if str_lines[0] == "if":
                        lines.append("if")
                    if str_lines[0] == "cond":
                        lines.append("cond")
                    if str_lines[0] == "else":
                        lines.append("else")
                    if str_lines[0] == "stmts":
                        lines.append("stmts")
                    if str_lines[0] == "for":
                        lines.append("for") 	
                    if str_lines[0] == "forinit":
                        lines.append("forinit")
                    if str_lines[0] == "while":
                        lines.append("while")
                    if str_lines[0] == "return":
                        lines.append("return")
                    if str_lines[0] == "continue":
                        lines.append("continue")
                    if str_lines[0] == "break":
                        lines.append("break")
                    if str_lines[0] == "goto":
                        lines.append("goto")
                    if str_lines[0] == "forexpr":
                        lines.append("forexpr")
                    if str_lines[0] == "sizeof":
                        lines.append("sizeof")
                    if str_lines[0] == "do":
                        lines.append("do")   
                    if str_lines[0] == "switch":
                        lines.append("switch")   
                    if str_lines[0] == "typedef":
                        lines.append("typedef")
                    if str_lines[0] == "default":
                        lines.append("default")
                    if str_lines[0] == "register":
                        lines.append("register")
                    if str_lines[0] == "enum":
                        lines.append("enum")
                    if str_lines[0] == "union":
                        lines.append("union")
                    
        print(lines)
        subLines = ','.join(lines)
        subLines = subLines + "," + "\n"
    finally:
        f.close()
        return subLines
 
def text_vector(FILE_PATH,Processed_FILE):  
	big_line = []
	total_processed = 0

	for fpathe,dirs,fs in os.walk(FILE_PATH):
  		for f in fs:
    			if (os.path.splitext(f)[1]=='.txt'): # Get the .c files only
        			file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
        			temp = DepthFirstExtractASTs(FILE_PATH + f, f)
        			print(temp)
        
        			f1 = open(Processed_FILE + os.path.splitext(f)[0]+".txt", "w")
        			f1.write(temp)
        			f1.close()
        			# big_line.append(temp)

       			#time.sleep(0.001)
        			total_processed = total_processed + 1

	print ("Totally, there are " + str(total_processed) + " files.")

step3 修改movefiles.py

我们现在已经建立好存储AST文件和文本向量文件的位置，我们只需要将每一个含有c文件的目录下所有c文件经过codesensor函数调用产生AST文件并存在对应目录中的Preprocesses文件夹之下，并且在通过调用text_vector函数将每个Preprocessed目录下的AST文件转化为文本向量文件存储在相应目录的procesed文件夹之下，所以我们在movefiles.py的最后，即建立完所有的文件夹之后加入下列代码：

from  ProcessCFilesWithCodeSensor import *
from  ProcessRawASTs_DFT import *

for i in range(len(FirstDir)):
    codesensor(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i])
    codesensor(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i])
    codesensor(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i])
    codesensor(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i])
    codesensor(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i])
    codesensor(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i])
    text_vector(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i]+"/processed/")
    text_vector(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i]+"/processed/")
    text_vector(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i]+"/processed/")
    text_vector(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i]+"/processed/")
    text_vector(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i]+"/processed/")
    text_vector(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i]+"/processed/")

step4 修改removeComments_Blanks.py和LoadCFilesAsText.py文件

因为我们使想用生成的文本向量文件替代c文件，而在removeComments_Blanks.py和LoadCFilesAsText.py文件中是直接读取c文件并且对文件中的内容进行去空格，去停用词的操作之后转存为pickle文件。所以我们现在用生成的文本向量文件替代c文件，我们需要读取processed文件夹下的文本向量文件，并且读取txt文件其他的操作不变。
所以我们要把所有removeComments_Blanks.py和LoadCFilesAsText.py文件中的所有代表读取c文件的目录扩展到与c文件同目录的processesed文件夹下让其读取文本向量文件，并且在判断文件类型的时候不再寻找c文件改为寻找txt文件即可。