Comparing techniques for authorship attribution of source code | Zendy

Burrows Steven | Zendy; Uitdenbogerd Alexandra L. | Zendy; Turpin Andrew | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Comparing techniques for authorship attribution of source code

Author(s) -

Burrows Steven,

Uitdenbogerd Alexandra L.,

Turpin Andrew

Publication year - 2014

Publication title -

software: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.437

H-Index - 70

eISSN - 1097-024X

pISSN - 0038-0644

DOI - 10.1002/spe.2146

Subject(s) - computer science , artificial intelligence , source code , machine learning , classifier (uml) , natural language processing , support vector machine , byte , authorship attribution , exploit , ranking (information retrieval) , natural language , n gram , information retrieval , language model , programming language , computer security

SUMMARY Attributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non‐natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of n tokens/bytes ( n ‐grams) or software metrics; and the classification technique that exploits those features, either information retrieval ranking or machine learning. The results of existing studies, however, are not directly comparable as all use different test beds and evaluation methodologies, making it difficult to assess which approach is superior. This paper summarises all previous techniques to source code authorship attribution, implements feature sets that are motivated by the literature, and applies information retrieval ranking methods or machine classifiers for each approach. Importantly, all approaches are tested on identical collections from varying programming languages and author types. Our conclusions are as follows: (i) ranking and machine classifier approaches are around 90% and 85% accurate, respectively, for a one‐in‐10 classification problem; (ii) the byte‐level n ‐gram approach is best used with different parameters to those previously published; (iii) neural networks and support vector machines were found to be the most accurate machine classifiers of the eight evaluated; (iv) use of n ‐gram features in combination with machine classifiers shows promise, but there are scalability problems that still must be overcome; and (v) approaches based on information retrieval techniques are currently more accurate than approaches based on machine learning. Copyright © 2012 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research