Quick links

A Study of Privacy and Fairness in Sensitive Data Analysis

Report ID:
September 2011
Download Formats:


In this thesis we consider the challenges arising in the design of
algorithms that interact with sensitive personal data---such as
medical records, online tracking data, or financial records.
One important goal is to protect the privacy of those individuals
whose personal information contributed to the data set. We consider
algorithms that satisfy the strong privacy guarantee known as
differential privacy. A wide range of computational tasks reduces to
the setting in which a trusted database curator responds to a number
of statistical queries posed by an untrusted data analyst. The basic
question is how accurately and efficiently the curator can release
approximate answers to the given queries while satisfying differential
privacy. We make the following main contributions to differentially
private data analysis:

We expose a connection between differential privacy and certain
problems in convex geometry revolving around a deep conjecture known
as the Hyperplane conjecture. Assuming the truth of this conjecture we
give differentially private mechanisms with nearly optimal accuracy in
the case where the queries are given all at once (non-interactively)
and the number of queries does not exceed the database size.

Multiplicative weights mechanisms are a powerful tool in
algorithms, machine learning and optimization. We introduce a
privacy-preserving multiplicative weights framework suitable for
answering a huge number of queries even in the interactive setting.
The accuracy of our algorithm in terms of database size and number of
queries matches the statistical sampling error that already arises in
the absence of any privacy concerns. Our algorithm can also be used to
produce a differentially private synthetic data set encoding the
curators answers. For this task the runtime of our algorithm, which
is linear in the universe size, is essentially optimal due to a prior
cryptographic hardness result.

We then consider avenues for obtaining differentially private
algorithms with a runtime polynomial in the size of the data set or at
least subexponential in the universe size. Based on a new learning
algorithm for submodular functions, we present the first
polynomial-time algorithm for answering a large number of Boolean
conjunction queries (or contingency tables) with non-trivial accuracy
guarantees. Conjunction queries are a widely used and important class
of statistical queries.

Furthermore, we exhibit an explicit and efficient reduction from
the problem of privately releasing a class of queries to the problem
of non-privately learning a related class of concepts. Instantiating
this general reduction with new and existing learning algorithms
yields several new results for privately releasing conjunctions and
other queries.

Not all problems arising in the presence of sensitive data are a
matter of privacy. In the final part of this thesis, we isolate
fairness in classification as a formidable concern and thus initiate
its formal study. The goal of fairness is to prevent discrimination
against protected subgroups of the population in a classification
system. We argue that fairness cannot be achieved by blindness to the
attribute we would like to protect. Our main conceptual contribution
is in asserting that fairness is achieved when similar individuals are
treated similarly. Based on the goal of treating similar individuals
similarly, we formalize and show how to achieve fairness in
classification, given a similarity metric. We also observe that our
notion of fairness can be seen as a generalization of differential

Follow us: Facebook Twitter Linkedin