Abstract
Increasing volumes of data are archived by government agencies, health networks,
search engines, social networking websites, and other organizations. The potential
for scientific discovery and social benefits of analyzing these databases are significant.
At the same time, releasing information from such repositories can cause devastating
damage to the privacy of individuals or organizations whose information is stored there.
The challenge, in particular for statistical agencies, is how to provide high-quality data
products without compromising the privacy of the individuals whose data they contain.
The field of Statistical Disclosure Control (SDC) aims at developing methodology that
balances the objectives of providing data for valid statistical inference and safeguarding
confidential information. In the first part of the talk, I will give a general overview of
the data privacy problem and some of the SDC methodologies. In the second part of
the talk, I will present my work on the Post Randomization Method (PRAM). PRAM
is a disclosure control method, where values of categorical variables are perturbed via
some known probability mechanism, and only the perturbed data are released thus
raising issues regarding disclosure risk and data utility. To address these issues, and
in particular that of data utility, I propose an EM-type algorithm to obtain unbiased
estimates of generalized linear models after accounting for the effect of PRAM.