data security through small cell suppression

Posted in Uncategorized by mcgyver5 on December 22, 2009

It seems like the worlds of statistics and Java don’t talk to one another enough.

Small cell Suppression is a statistical term for not allowing users to be able to infer what should be private information from public sets of data. For example, consider a survey on athletes with staph infections that was queryable by age, county, sport and race. If there were statistically small number of hispanic wrestlers in Otter Tail County, you could probably guess who had a staph infection. So, if a population is identified as statistically vulnerable to this inference, then that data is suppressed.
The Washington State Dept. of Health page has a pretty good explanation:

Why are small numbers a concern in public health assessment?

Public health policy decisions are fuelled by information. Often, this information is in the form of statistical data. Questions concerning health outcomes and related health behaviors and environmental factors often are studied within small subgroups of a population. Continuing improvements in the performance and availability of computing resources, including geographic information systems, and the need to better understand the relationships between environment, behavior, and consequent health effects have led to increased demand for data on small populations. These demands are often at odds with the need to preserve privacy and data confidentiality. Small numbers also raise statistical issues concerning the accuracy, and thus usefulness, of the data.

In general, problems with confidentiality arise when there are small denominators (population size represented in a specific cell in a table); and, problems with data reliability arise when there are small numerators (cases in a specific cell in a table).

Definitions
The broader term for these controls is “Statistical Disclosure Control”. The challenge is to use optimal levels since too little control leaks public data and too much control makes published survey data useless.
“Imputation” is the practice of substituting values for missing data items. If we are leaving out data to protect confidentiality, then substitute data must be imputed so as to not skew the overall results.
“Inference”: The practice of finding secret data in published survey results. By measuring inference, we can find out if disclosure control is an issue.
Spearman’s Rank Correlation: a statistical tool for inference. It can find out how closely two variables are tied. This web page will perform this correlation for you (if you are ready to hand type your data into a web form).

I could only find one tool related to this in the java world. I’m surprised it isn’t more of a booming field since it touches on survey data, health and financial data, and security and privacy. Is that too small of a niche? I doubt it. Inattention to the dangers of leaking information in this way could potentially cause a lot of harm and cost a lot of money.

The stats package SAS has small cell suppression features. This document (Word Doc) discusses how to deal with the holes in the data that result from suppression.

So, how to have this feature in my java app?
R = the open source statistical package
CRAN = a list of packages for use with the R language
sdcTable: statistical disclosure control for tabular data
lPSolve: an R package that sdcTable depends on
rJava = an R package that allows R to create java objects and, through the JRI package that is now part of rJava, allows java run R in a single thread and make calls to it.
JGR = java GUI tool that makes use of rJava for a java GUI interface to R. R binaries must be installed and the JGR jar then allows java to call it. The source of JGR has good, production quality examples of how to call R from java.
Using all that, one should be able to create an ad-hoc query front end for survey data, run submitted queries through small cell suppression rules in R, and
return safe data.
There, I solved your small cell suppression problems. I’ll leave the details to the reader. What could be easier than integrating a stack of open source C and Java projects into your web app? or, rather, tune in for part II: implementing this stack O’ fun.

captain holly java blog