Codee A Tensor Embedding Scheme for Binary Code Search

      

ABSTARCT :

Given a target binary function, the binary code search retrieves top-K similar functions in the repository, and similar functions represent that they are compiled from the same source codes. Searching binary code is particularly challenging due to large variations of compiler tool-chains and options and CPU architectures, as well as thousands of binary codes. Furthermore, there are some pivotal issues in current binary code search schemes, including inaccurate text-based or token-based analysis, slow graph matching, or complex deep learning processes. In this paper, we present an unsupervised tensor embedding scheme, Codee, to carry out code search efficiently and accurately at the binary function level. First, we use an NLP-based neural network to generate the semantic-aware token embedding. Second, we propose an efficient basic block embedding generation algorithm based on the network representation learning model. We learn both the semantic information of instructions and the control flow structural information to generate the basic block embedding. Then we use all basic block embeddings in a function to obtain a variable-length function feature vector. Third, we build a tensor to generate function embeddings based on the tensor singular value decomposition, which compresses the variable-length vectors into short fixed-length vectors to facilitate efficient search afterward. We further propose a dynamic tensor compression algorithm to incrementally update the function embedding database. Finally, we use the local sensitive hash method to find the top-K similar matching functions in the repository. Compared with state-of-the-art cross-optimization-level code search schemes, such as Asm2Vec and DeepBinDiff, our scheme achieves higher average search accuracy, shorter feature vectors, and faster feature generation performance using four datasets, OpenSSL, Coreutils, libgmp and libcurl. Compared with other cross-platform and cross-optimization-level code search schemes, such as Gemini, Safe, the average recall of our method also outperforms others.

EXISTING SYSTEM :

? We hypothesize that there exists the optimal order of tokens which better exploits the particular structure of TT–embedding and leads to a boost in performance and/or compression ratio. ? Additionally, it is important to understand how the order of tokens in the vocabulary affects the properties of the networks with TT–embedding. ? We substitute the embedding layers with the TT–embedding layers. Besides that, we leave the overall structure of the neural network unchanged with the same parameters as in the baseline approach.

DISADVANTAGE :

? We focus on a crossplatform and cross-optimization level of the binary similarity problem, where we define two similar binary functions that are compiled from the same source code. Inspired by, we look for solutions that solve the binary similarity problem using embeddings. ? The problem of binary code search is then converted into the problem of searching for similar vectors, which can be done in O(1) time by using Locality Sensitive Hashing. ? To achieve high search accuracy, it is essential to collect a substantial amount of high-quality training data, which can be challenging in practice. Even with good training data, these deep neural network models may still suffer an overfitting problem.

PROPOSED SYSTEM :

• We propose a novel embedding layer, the TT–embedding, for compressing huge lookup tables used for encoding categorical features of significant cardinality, such as the index of a token in natural language processing tasks. • The proposed approach, based on the TT–decomposition, experimentally proved to be effective, as it heavily decreases the number of training parameters at the cost of a small deterioration in performance. • In addition, our method can be easily integrated into any deep learning framework and trained via backpropagation, while capitalizing on reduced memory requirements and increased training batch size.

ADVANTAGE :

? Each calculation of basic block similarity has to go through a complex LSTM-RNN neural network with a large number of parameters in the current implementation, which significantly affects the performance. ? While they are intriguing ideas, directly applying NLP techniques on binary codes is not ideal. ? BERT has to go through a complex neural network with many parameters, which significantly affects performance and needs strong computation ability. ? Order Matters is a very large-scale model, which is much larger than the other methods by several orders of magnitude.

Download DOC Download PPT

We have more than 145000 Documents , PPT and Research Papers

Have a question ?

Chat on WhatsApp