当前位置：网站首页>Eklavya -- infer the parameters of functions in binary files using neural network

Eklavya -- infer the parameters of functions in binary files using neural network

2022-07-02 07:52:00 【MezereonXP】

EKLAVYA – Using neural network to infer the parameters of functions in binary files

List of articles

EKLAVYA -- Using neural network to infer the parameters of functions in binary files

This time I will introduce an article , be known as Neural Nets Can Learn Function Type Signatures From Binaries

From the National University of Singapore Zhenkai Liang The team , It's in Usenix Security 2017 On

Problem introduction and formal definition

The main concern of this work is function parameter inference , There are two parts ：

Number of parameters
Type of parameter , such as int, float etc.

Traditional methods usually use some prior knowledge , take Semantics of instructions ,ABI practice (Application Binary Interface), Compiler style And so on .

Once the compiler changes , The instruction set has changed , Then we need to reintroduce some prior knowledge .

If we can get rid of , Or reduce the use of these prior knowledge , Then there will be no restriction ！

that , Use neural networks for automated learning and inference , It's just a way of thinking .

Presupposition

We can first know the boundary of a function (boundary)
Inside a function , We know its instruction boundary
We know that it represents a function call (function dispatch) Instructions , such as call

Through the disassembly tool , We can satisfy the above assumptions .

It is worth mentioning that , Function boundaries can also be done using neural networks , Interested readers can refer to Dawn Song Hair in Usenix Security 2015 Of Recognizing functions in binaries with neural networks.

here , First, give some definitions of symbols ：

We define our model as $M(\cdot)$
Defined function $a$ The disassembled code is $T_a$ , $T_a[i]$ Representative function $a$ Of the $i$ Bytes
function $a$ Of the $k$ Instructions can be written as $I_a[k]:= <T_a[m], T_a[m+1],...,T_a[m+l]>$
- among $m$ Is the position index of the starting byte of the corresponding instruction
- $l$ Is the number of bytes contained in the instruction
A contain $p$ Functions of instructions $a$ Can be expressed as $T_a:=<I_a[1],I_a[2],I_a[p]>$
If a function $b$ There is a direct call call For the function $a$ , We will this article call Take out all the instructions before the instructions , be called caller snippet, It can be translated into Caller fragment . Defined as $C_{b,a}[j]:=<I_b[0],...,I_b[j-1]>$
- among $I_b[j]$ Corresponding call function $a$ The order of
- If $I_b[j]$ Is an indirect call , We make $C_{b,a}[j]:=\empty$
We will collect functions $a$ Caller fragments of all callers of , Write it down as $\mathcal{D}_a:=T_a\cup(\bigcup_{b\in S_a}(\bigcup_{0\leq j\leq |T_b|}C_{a,b}[j]))$
- among $S_a$ Is to call $a$ A collection of all functions of

Because the length of the caller fragment can be very long , Here the article is set to no more than 500 Orders

Our function $M(\cdot)$ Accept input $\mathcal{D}_a$ , Output two variables ：

function $a$ The number of parameters
function $a$ The type of each parameter of
- C- The parameter type of style can be defined as int, char, float, void*, enum, union, struct

Methods to design

Let's first give the overall method flow chart ：

Simply speaking , It can be divided into two modules ：

Instruction encoding module
- First , Extract the input binary file 、 The division of instructions 、 Extraction of call points .
- Then the instructions are Word Embedding code , Get the corresponding vector representation , This part can refer to NLP Medium Word2Vec.
Parameter restore module
- Divide these data into training set and test set , Separate use 4 A recurrent neural network (RNN) Come from two aspects ( Caller and callee ) Infer the number and type of function parameters , That is, it corresponds to 4 A mission （Task2,Task4 Corresponding to the callee ,Task1,Task3 Corresponding to the caller ）.

How to infer multiple parameter types ？

One RNN, Enter a sequence , Only one type can be inferred .

So the implementation of this article is , Train more than one RNN, every last RNN Independently infer the parameter type of the fixed position .

With a first RNN Infer the number of parameters , Then use multiple RNN To infer the parameters of different positions .

Data preparation

This article uses some linux My bag , And then use clang and gcc To compile , By setting debug Pattern , You can go directly to binary Medium DWARF Field to find the corresponding function boundary 、 Number and type of parameters , As ground truth.

Two data sets are built ：

Data sets 1: Contains 3 A popular linux package （binutils,coreutils as well as findutils）, Used O0 To O3 The optimization level of
Data sets 2: Put the dataset 1 Contains linux Package expansion , Increase more 5 individual (sg3utils, utillinux, inetutils, diffutils and usbutils), Also in the 4 Compile on an optimization level

The division ratio of training set and test set is 8:2

Unbalanced data

In the construction of data sets , There will be a large difference in the proportion of different types of data . For example, the parameter is pointer Type of data is union Hundreds of times the type , Most functions are less than 3 Parameters . This article does not solve this problem .

experimental result

Post it here Task1 and Task2, That is, through the caller and the callee , The result of inferring the number of parameters

You can see ：

The higher the optimization level , The harder it is to infer , But there is no strict increasing relationship
The more parameters , The harder it is to infer , It is also related to the amount of training data
From the caller , It is easier to infer the number of parameters

It's on it , About the result of parameter type inference

You can see ：

The optimization level seems to be less intrusive , Even the higher the optimization level , The more accurate the inference type
The farther back the parameter is , The harder it is to infer the type
Infer from the caller and the callee , The difference is not great

原网站

版权声明
本文为[MezereonXP]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207020623040310.html