A Game Theoretic Framework for Analyzing Re-Identification Risk

PLoS ONE

Wan, Zhiyu; Vorobeychik, Yevgeniy; Xia, Weiyi; Clayton, Ellen Wright; Kantarcioglu, Murat; Ganta, Ranjit; Heatherly, Raymond; Malin, Bradley A.;(2015)

The Given the potential wealth of insights in personal data the big databases can provide, many organizations aim to share data while protecting privacy by sharing de-identified data, but are concerned because various demonstrations show such data can be re-identified. Yet these investigations focus on how attacks can be perpetrated, not the likelihood they will be realized. This paper introduces a game theoretic framework that enables a publisher to balance re-identification risk with the value of sharing data, leveraging a natural assumption that a recipient only
attempts re-identification if its potential gains outweigh the costs. We apply the framework to a real case study, where the value of the data to the publisher is the actual grant funding dollar amounts from a national sponsor and the re-identification gain of the recipient is the fine paid to a regulator for violation of federal privacy rules. There are three notable findings: 1) it is possible to achieve zero risk, in that the recipient never gains from re-identification, while sharing almost as much data as the optimal solution that allows for a small amount of risk; 2) the zero-risk solution enables sharing much more data than a commonly invoked de-identification policy of the U.S. Health Insurance Portability and Accountability Act (HIPAA); and 3) a sensitivity analysis demonstrates these findings are robust to order-of-magnitude changes in player losses and gains. In combination, these findings provide support that such a framework can enable pragmatic policy decisions about de-identified data sharing.

Addressing the concerns of the lacks family: quantification of kin genomic privacy

CCS ’13

Humbert, Mathias; Ayday, Erman; Hubaux, Jean-Pierre; Telenti, Amalio; (2013)

The rapid progress in human-genome sequencing is leading to a high availability of genomic data. This data is notoriously very sensitive and stable in time. It is also highly correlated among relatives. A growing number of genomes are becoming accessible online (e.g., because of leakage, or after their posting on genome-sharing websites). What are then the implications for kin genomic privacy? We formalize the problem and detail an efficient reconstruction attack based on graphical models and belief propagation. With this approach, an attacker can infer the genomes of the relatives of an individual whose genome is observed, relying notably on Mendel’s Laws and statistical relationships between the nucleotides (on the DNA sequence). Then, to quantify the level of genomic privacy as a result of the proposed inference attack, we discuss possible definitions of genomic privacy metrics. Genomic data reveals Mendelian diseases and the likelihood of developing degenerative diseases such as Alzheimer’s. We also introduce the quantification of health privacy, specifically the measure of how well the predisposition to a disease is concealed from an attacker. We evaluate our approach on actual genomic data from a pedigree and show the threat extent by combining data gathered from a genome-sharing website and from an online social network.

genomic privacy, Inference algorithms, kinship, metrics

A cryptographic approach to securely share and query genomic sequences

IEEE transactions on information technology in biomedicine: a publication of the IEEE Engineering in Medicine and Biology Society

Kantarcioglu, Murat; Jiang, Wei; Liu, Ying; Malin, Bradley; (2008)

To support large-scale biomedical research projects, organizations need to share person-specific genomic sequences without violating the privacy of their data subjects. In the past, organizations protected subjects’ identities by removing identifiers, such as name and social security number; however, recent investigations illustrate that deidentified genomic data can be “reidentified” to named individuals using simple automated methods. In this paper, we present a novel cryptographic framework that enables organizations to support genomic data mining without disclosing the raw genomic sequences. Organizations contribute encrypted genomic sequence records into a centralized repository, where the administrator can perform queries, such as frequency counts, without decrypting the data. We evaluate the efficiency of our framework with existing databases of single nucleotide polymorphism (SNP) sequences and demonstrate that the time needed to complete count queries is feasible for real world applications. For example, our experiments indicate that a count query over 40 SNPs in a database of 5000 records can be completed in approximately 30 min with off-the-shelf technology. We further show that approximation strategies can be applied to significantly speed up query execution times with minimal loss in accuracy. The framework can be implemented on top of existing information and network technologies in biomedical environments.

Base Sequence, biology computing, biomedical engineering, Chromosome Mapping, Computer Security, cryptography, Databases, data mining, encrypted genomic sequence records