
Compression for population genetic data through finite-state entropy
Author(s) -
Winfield Chen,
Lloyd T. Elliott
Publication year - 2021
Publication title -
journal of bioinformatics and computational biology (print)
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.339
H-Index - 43
eISSN - 1757-6334
pISSN - 0219-7200
DOI - 10.1142/s0219720021500268
Subject(s) - computer science , data compression , computation , population , entropy (arrow of time) , data mining , theoretical computer science , artificial intelligence , algorithm , demography , quantum mechanics , sociology , physics
We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of samples in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited for compression of population genetic data. We show between [Formula: see text] and [Formula: see text] speed and size improvements over modern dictionary compression methods that are often used for population genetic data such as Zstd and Zlib in computation and decompression tasks. We provide open source prototype software for multi-phenotype GWAS with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.