
Full-privacy secured search engine empowered by efficient genome-mapping algorithms
Author(s) -
Yuan-Yu Chang,
Sheng-Tang Wong,
Emmanuel O Salawu,
Ming-Hsuan Liao,
Jui-Hung Hung,
Lee-Wei Yang
Publication year - 2023
Publication title -
ieee journal of biomedical and health informatics
Language(s) - English
Resource type - Journals
eISSN - 2168-2208
pISSN - 2168-2194
DOI - 10.1109/jbhi.2023.3300885
Subject(s) - bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , signal processing and analysis
Since the 90s, keyword-based search engines have been the only option for people to locate relevant web content through a simple query comprising one to a few keywords. These free or paid services operate by storing users' search queries and preferences for personal profiling and targeted ads delivery, while user-uploaded articles for plagiarism detection can further be stored as part of service providers’ expanding databases for profit. In short, it has never been an option for users to search the web without revealing their queries, some of which can be sensitive, to search engine providers. Here we demonstrate that an internet search, provided with the entire article as a query, can be correctly carried out without revealing users' query content by an irreversible encoding scheme and an efficient FM-index search routine that is generally used in the next generation sequencing (NGS) of human genomes. In our solution, Sapiens Aperio Veritas Engine (S.A.V.E.), every word in the query is encoded into one of 12 “amino acids”, constituting a pseudo-biological sequence (PBS) at users' local machines. The PBS-mediated plagiarism detection is carried out by users' submission of locally encoded PBSs through our cloud service to locate identical duplicates in the collected web contents, currently including all the English and Chinese Wikipedia pages and Open Access journal articles, as of April 2021, which had been encoded in the same way as the query. It is found that PBSs with a length longer than 12, comprising a combination of more than 12 “amino acids”, can return correct results with a false positive rate <0.8%. S.A.V.E. runs at a similar genome-mapping speed as Bowtie and is >5 orders faster than BLAST. Functioning in both regular and in-private search modes, S.A.V.E. provides a new option for efficient internet search and plagiarism detection in a compressed search space where users' confidential contents can never be revealed. We hope the reported algorithm and implementation could introduce a new paradigm for future privacy-aware search engines. S.A.V.E. is currently running at https://dyn.life.nthu.edu.tw/SAVE/