BinDiff( NN): Learning Distributed Representation of Assembly for Robust Binary Diffing Against Semantic Differences
- Authors
- Ullah, Sami; Oh, Heekuck
- Issue Date
- Sep-2022
- Publisher
- Institute of Electrical and Electronics Engineers
- Keywords
- Asm2Vec; attention network; binary diffing; exact match; Inst2vec; partial match; siamese neural network
- Citation
- IEEE Transactions on Software Engineering, v.48, no.9, pp 3442 - 3466
- Pages
- 25
- Indexed
- SCIE
SCOPUS
- Journal Title
- IEEE Transactions on Software Engineering
- Volume
- 48
- Number
- 9
- Start Page
- 3442
- End Page
- 3466
- URI
- https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/112784
- DOI
- 10.1109/TSE.2021.3093926
- ISSN
- 0098-5589
1939-3520
- Abstract
- Binary diffing is a process to discover the differences and similarities in functionality between two binary programs. Previous research on binary diffing approaches it as a function matching problem to formulate an initial 1:1 mapping between functions, and later a sequence matching ratio is computed to classify two functions being an exact match, a partial match or no-match. The accuracy of existing techniques is best only when detecting exact matches and they are not efficient in detecting partially changed functions; especially those with minor patches. These drawbacks are due to two major challenges (i) In the 1:1 mapping phase, using a strict policy to match function features (ii) In the classification phase, considering an assembly snippet as a normal text, and using sequence matching for similarity comparison. Instruction has a unique structure i.e. mnemonics and registers have a specific position in instruction and also have a semantic relationship, which makes assembly code different from general text. Sequence matching performs best for general text but it fails to detect structural and semantic changes at an instruction level thus, its use for classification produces many false results. In this research, we have addressed the aforementioned underlying challenges by proposing a two-fold solution. For the 1:1 mapping phase, we have proposed computationally inexpensive features, which are compared with distance-based selection criteria to map similar functions and filter unmatched functions. For the classification phase, we have proposed a Siamese binary-classification neural network where each branch is an attention-based distributed learning embedding neural network - that learn the semantic similarity among assembly instructions, learn to highlight the changes at an instruction level and a final stage fully connected layer learn to accurately classify two 1:1 mapped function either an exact or a partial match. We have used x86 kernel binaries for training and achieved similar to 99% classification accuracy; which is higher than existing binary diffing techniques and tools.
- Files in This Item
-
Go to Link
- Appears in
Collections - COLLEGE OF COMPUTING > ERICA 컴퓨터학부 > 1. Journal Articles
![qrcode](https://api.qrserver.com/v1/create-qr-code/?size=55x55&data=https://scholarworks.bwise.kr/erica/handle/2021.sw.erica/112784)
Items in ScholarWorks are protected by copyright, with all rights reserved, unless otherwise indicated.