当前位置：网站首页>Implementation method of converting ast into word vector before converting word vector

Implementation method of converting ast into word vector before converting word vector

2022-07-28 07:06:00 【ithicker】

Realize the idea

Environmental Science ：

We are in the article https://blog.csdn.net/lockhou/article/details/113883940 Has been realized in Win There will be a series of c File conversion generates the corresponding AST file , And through AST The file generates a text vector through node matching , To build a c The file corresponds to a storage AST Of txt The file corresponds to a txt file , And the corresponding three files have the same name , Because we judge whether a file has vulnerabilities from the file name .

Ideas ：

Our principle is to classify the documents as Train,Test,Validation, Then read directly .c Empty files , Go to stop word processing and save the converted data as pickle file , Used to provide the following word vector conversion , model training , Model test , Data requirements for model validation . Since we want to use text vectors to represent structural information , We can no longer read files directly , Instead, read every c The corresponding text vector of the file txt Do the same for files , So as to provide the following word vector transformation , model training , Model test , The data of model validation has structural information , To complete the work .

from win Move to linux The process

step1 java Installation on jdk

See the article for specific steps ：https://blog.csdn.net/lockhou/article/details/113904085

step2 modify movefiles.py

I finally decided to move c The same level directory of each directory where the file is located establishes a folder for storing from c File extraction AST file （ There is Preprocessed Under the folder ） And the converted text vector file （ There is processed Under the folder ）. So we're building Train,Test,Validation Folders and their internal folders are to be in each combination Non_vulnerable_functions and Vulnerable_functions Under the folder Preprocessed and processed Folder , So in 46-55 Add the following code ：

saveDir = tempDir
        tempDir =  saveDir + '/'+ "Preprocessed"
        if not os.path.exists(tempDir):
            #Non_vulnerable_functions/Non_vulnerable_functions/Preprocessed
            os.mkdir(tempDir)
        tempDir =  saveDir + '/'+ "processed"
        if not os.path.exists(tempDir):
            #Non_vulnerable_functions/Non_vulnerable_functions/processed
            os.mkdir(tempDir)

step2 modify ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py

We now have storage AST And the position of the text vector , So we just need to call repeatedly ProcessCFilesWithCodeSensor.py and ProcessRawASTs_DFT.py File can , So for the convenience of calling, we organize the two files into the form of functions , And pass the required parameters as formal parameters when calling .

ProcessCFilesWithCodeSensor.py Parameters ：
1）CodeSensor_OUTPUT_PATH： Each one .c File extraction AST Save as .txt Address stored in the file
“G:\ The paper \ The paper \ast\function_representation_learningmaster\FFmpeg\Vulnerable_functions\Preprocessed\”
2）CodeSensor_PATH：codesensor.java The position of
“D:\codesensor\CodeSensor.jar”（ Fixed position without input , That is, it does not need to be used as a parameter ）
3）PATH ：.c The directory where the file is stored
“G:\ The paper \ The paper \ast\function_representation_learning-master\FFmpeg\Vulnerable_functions”
ProcessRawASTs_DFT.py Parameters ：
1）FILE_PATH ： Storage AST Of TXT In the directory
“G:\ The paper \ The paper \ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Preprocessed\”
2）Processed_FILE ： Storing text vectors txt file
“G:\ The paper \ The paper \ast\function_representation_learning-master\” + Project_Name + “\Vulnerable_functions\Processed\”

According to the above parameter requirements , We organize the contents of the file into functions as follows ：

#ProcessCFilesWithCodeSensor.py

def codesensor(CodeSensor_OUTPUT_PATH,PATH):
	CodeSensor_PATH = "./Code/codesensor-codeSensor-0.2/CodeSensor.jar"
	Full_path = ""

	for fpathe,dirs,fs in os.walk(PATH):
  		for f in fs:
    			if (os.path.splitext(f)[1]=='.c'): # Get the .c files only
        			file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
        
        # With each .c file open, CodeSensor will process the opened file and output all the processed files to a specified directory.
        # Full_path = CodeSensor_OUTPUT_PATH + "_" + f + ".txt"
        			Full_path = CodeSensor_OUTPUT_PATH + os.path.splitext(f)[0] + ".txt"
        			with open(Full_path, "w+") as output_file:
            				Popen(['/home/jdk1.8.0_65/bin/java', '-jar', CodeSensor_PATH, file_path], stdout=output_file, stderr=STDOUT)
            				output_file.close()



# ProcessRawASTs_DFT.py

def DepthFirstExtractASTs(file_to_process, file_name):
    
    lines = []
    subLines = ''
    
    f = open(file_to_process)
    try:
        original_lines = f.readlines()
        print(original_lines)
        #lines.append(file_name) # The first element is the file name.
        for line in original_lines:
            if not line.isspace(): # Remove the empty line.
                line = line.strip('\n')
                str_lines = line.split('\t')   
                #print (str_lines)
                if str_lines[0] != "water": # Remove lines starting with water.
                    #print (str_lines)
                    if str_lines[0] == "func":
                        # Add the return type of the function
                        subElement = str_lines[4].split() # Dealing with "static int" or "static void" or ...
                        if len(subElement) == 1:
                            lines.append(str_lines[4])
                        if subElement.count("*") == 0: # The element does not contain pointer type. If it contains pointer like (int *), it will be divided to 'int' and '*'.
                            if len(subElement) == 2:
                                lines.append(subElement[0])
                                lines.append(subElement[1]) 
                            if len(subElement) == 3:
                                lines.append(subElement[0])
                                lines.append(subElement[1])    
                                lines.append(subElement[2])
                        else:
                            lines.append(str_lines[4])
                        #lines.append(str_lines[5]) # Add the name of the function
                        lines.append("func_name") # Add the name of the function
                    if str_lines[0] == "params":
                        lines.append("params")                    
                    if str_lines[0] == "param":
                        subParamElement = str_lines[4].split() # Addd the possible type of the parameter
                        if len(subParamElement) == 1:
                            lines.append("param")
                            lines.append(str_lines[4]) # Add the parameter type
                        if subParamElement.count("*") == 0:
                            if len(subParamElement) == 2:
                                lines.append("param")
                                lines.append(subParamElement[0])
                                lines.append(subParamElement[1]) 
                            if len(subParamElement) == 3:
                                lines.append("param")
                                lines.append(subParamElement[0])
                                lines.append(subParamElement[1])    
                                lines.append(subParamElement[2])
                        else:
                            lines.append("param")
                            lines.append(str_lines[4]) # Add the parameter type                           
                    if str_lines[0] == "stmnts":
                        lines.append("stmnts")                    
                    if str_lines[0] == "decl":
                        subDeclElement = str_lines[4].split() # Addd the possible type of the declared veriable
                        #print (len(subDeclElement))
                        if len(subDeclElement) == 1:
                            lines.append("decl")
                            lines.append(str_lines[4]) # Add the type of the declared variable
                        if subDeclElement.count("*") == 0:
                            if len(subDeclElement) == 2:
                                lines.append("decl")
                                lines.append(subDeclElement[0])
                                lines.append(subDeclElement[1]) 
                            if len(subDeclElement) == 3:
                                lines.append("decl")
                                lines.append(subDeclElement[0])
                                lines.append(subDeclElement[1])    
                                lines.append(subDeclElement[2])
                        else:
                            lines.append("decl")
                            lines.append(str_lines[4]) # Add the type of the declared variable
                    if str_lines[0] == "op":
                        lines.append(str_lines[4])
                    if str_lines[0] == "call":
                        lines.append("call")
                        lines.append(str_lines[4])
                    if str_lines[0] == "arg":
                        lines.append("arg")
                    if str_lines[0] == "if":
                        lines.append("if")
                    if str_lines[0] == "cond":
                        lines.append("cond")
                    if str_lines[0] == "else":
                        lines.append("else")
                    if str_lines[0] == "stmts":
                        lines.append("stmts")
                    if str_lines[0] == "for":
                        lines.append("for") 	
                    if str_lines[0] == "forinit":
                        lines.append("forinit")
                    if str_lines[0] == "while":
                        lines.append("while")
                    if str_lines[0] == "return":
                        lines.append("return")
                    if str_lines[0] == "continue":
                        lines.append("continue")
                    if str_lines[0] == "break":
                        lines.append("break")
                    if str_lines[0] == "goto":
                        lines.append("goto")
                    if str_lines[0] == "forexpr":
                        lines.append("forexpr")
                    if str_lines[0] == "sizeof":
                        lines.append("sizeof")
                    if str_lines[0] == "do":
                        lines.append("do")   
                    if str_lines[0] == "switch":
                        lines.append("switch")   
                    if str_lines[0] == "typedef":
                        lines.append("typedef")
                    if str_lines[0] == "default":
                        lines.append("default")
                    if str_lines[0] == "register":
                        lines.append("register")
                    if str_lines[0] == "enum":
                        lines.append("enum")
                    if str_lines[0] == "union":
                        lines.append("union")
                    
        print(lines)
        subLines = ','.join(lines)
        subLines = subLines + "," + "\n"
    finally:
        f.close()
        return subLines
 
def text_vector(FILE_PATH,Processed_FILE):  
	big_line = []
	total_processed = 0

	for fpathe,dirs,fs in os.walk(FILE_PATH):
  		for f in fs:
    			if (os.path.splitext(f)[1]=='.txt'): # Get the .c files only
        			file_path = os.path.join(fpathe,f) # f is the .c file, which will be processed by CodeSensor
        			temp = DepthFirstExtractASTs(FILE_PATH + f, f)
        			print(temp)
        
        			f1 = open(Processed_FILE + os.path.splitext(f)[0]+".txt", "w")
        			f1.write(temp)
        			f1.close()
        			# big_line.append(temp)

       			#time.sleep(0.001)
        			total_processed = total_processed + 1

	print ("Totally, there are " + str(total_processed) + " files.")

step3 modify movefiles.py

We have now established storage AST Location of files and text vector files , We just need to put each one containing c All under the directory of the file c The document went through codesensor Function call generation AST The file is stored in the corresponding directory Preprocesses Under the folder , And by calling text_vector The Preprocessed In the catalog AST The file is converted into a text vector file and stored in the corresponding directory procesed Under the folder , So we are movefiles.py Last , That is, add the following code after creating all folders ：

from  ProcessCFilesWithCodeSensor import *
from  ProcessRawASTs_DFT import *

for i in range(len(FirstDir)):
    codesensor(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i])
    codesensor(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i])
    codesensor(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i])
    codesensor(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i])
    codesensor(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i])
    codesensor(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i])
    text_vector(Non_vul_func_trainDir[i]+"/Preprocessed/",Non_vul_func_trainDir[i]+"/processed/")
    text_vector(Non_vul_func_testDir[i]+"/Preprocessed/",Non_vul_func_testDir[i]+"/processed/")
    text_vector(Non_vul_func_validationDir[i]+"/Preprocessed/",Non_vul_func_validationDir[i]+"/processed/")
    text_vector(Vul_func_trainDir[i]+"/Preprocessed/",Vul_func_trainDir[i]+"/processed/")
    text_vector(Vul_func_testDir[i]+"/Preprocessed/",Vul_func_testDir[i]+"/processed/")
    text_vector(Vul_func_validationDir[i]+"/Preprocessed/",Vul_func_validationDir[i]+"/processed/")

step4 modify removeComments_Blanks.py and LoadCFilesAsText.py file

Because we want to use the generated text vector file instead c file , And in the removeComments_Blanks.py and LoadCFilesAsText.py File is read directly c File and remove spaces from the contents of the file , After the operation of removing the stop word, save it as pickle file . So we now use the generated text vector file instead c file , We need to read processed Text vector file under folder , And read txt Other operations of the file remain unchanged .
So we have to put all removeComments_Blanks.py and LoadCFilesAsText.py All representatives in the file read c The directory of files is extended to c The file is the same as the directory processesed Folder to read the text vector file , And no longer look for when judging the file type c File to find txt File can .