On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection | Zendy

Yanjie Zhao | Zendy; Li Li | Zendy; Haoyu Wang | Zendy; Haipeng Cai | Zendy; Tegawendé F. Bissyandé | Zendy; Jacques Klein | Zendy; John G. Grundy | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection

Author(s) -

Yanjie Zhao,

Li Li,

Haoyu Wang,

Haipeng Cai,

Tegawendé F. Bissyandé,

Jacques Klein,

John G. Grundy

Publication year - 2021

Publication title -

acm transactions on software engineering and methodology

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.597

H-Index - 78

eISSN - 1557-7392

pISSN - 1049-331X

DOI - 10.1145/3446905

Subject(s) - malware , computer science , machine learning , artificial intelligence , android malware , android (operating system) , data mining , computer security , operating system

Malware detection at scale in the Android realm is often carried out using machine learning techniques. State-of-the-art approaches such as DREBIN and MaMaDroid are reported to yield high detection rates when assessed against well-known datasets. Unfortunately, such datasets may include a large portion of duplicated samples, which may bias recorded experimental results and insights. In this article, we perform extensive experiments to measure the performance gap that occurs when datasets are de-duplicated. Our experimental results reveal that duplication in published datasets has a limited impact on supervised malware classification models. This observation contrasts with the finding of Allamanis on the general case of machine learning bias for big code. Our experiments, however, show that sample duplication more substantially affects unsupervised learning models (e.g., malware family clustering). Nevertheless, we argue that our fellow researchers and practitioners should always take sample duplication into consideration when performing machine-learning-based (via either supervised or unsupervised learning) Android malware detections, no matter how significant the impact might be.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research