
Cross-platform binary code similarity detection based on NMT and graph embedding
Author(s) -
Xiaodong Zhu,
Liehui Jiang,
Zeng Chen
Publication year - 2021
Publication title -
mathematical biosciences and engineering
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.451
H-Index - 45
eISSN - 1551-0018
pISSN - 1547-1063
DOI - 10.3934/mbe.2021230
Subject(s) - computer science , embedding , binary number , similarity (geometry) , binary code , scalability , graph embedding , code (set theory) , graph , theoretical computer science , support vector machine , artificial intelligence , pattern recognition (psychology) , mathematics , database , programming language , arithmetic , set (abstract data type) , image (mathematics)
Cross-platform binary code similarity detection is determining whether a pair of binary functions coming from different platforms are similar, and plays an important role in many areas. Traditional methods focus on using platform-independent characteristic strands intersecting or control flow graph (CFG) matching to compute the similarity and have shortages in terms of efficiency and scalability. The existing deep-learning-based methods improve the efficiency but have a low accuracy and still using manually constructed features. Aiming at these problems, a cross-platform binary code similarity detection method based on neural machine translation (NMT) and graph embedding is proposed in this manuscript. We train an NMT model and a graph embedding model to automatically extract two parts of semantics of the binary code and represent it as a high-dimension vector, named an embedding. Then the similarity of two binary functions can be measured by the distance between their corresponding embeddings. We implement a prototype named SimInspector. Our comparative experiment result shows that SimInspector outperforms the state-of-the-art approach, Gemini, by about 6% with respect to similarity detection accuracy, and maintains a good efficiency.