captain holly java blog

data security through small cell suppression

Posted in Uncategorized by mcgyver5 on December 22, 2009

It seems like the worlds of statistics and Java don’t talk to one another enough.

Small cell Suppression is a statistical term for not allowing users to be able to infer what should be private information from public sets of data. For example, consider a survey on athletes with staph infections that was queryable by age, county, sport and race. If there were statistically small number of hispanic wrestlers in Otter Tail County, you could probably guess who had a staph infection. So, if a population is identified as statistically vulnerable to this inference, then that data is suppressed.
The Washington State Dept. of Health page has a pretty good explanation:

Why are small numbers a concern in public health assessment?

Public health policy decisions are fuelled by information. Often, this information is in the form of statistical data. Questions concerning health outcomes and related health behaviors and environmental factors often are studied within small subgroups of a population. Continuing improvements in the performance and availability of computing resources, including geographic information systems, and the need to better understand the relationships between environment, behavior, and consequent health effects have led to increased demand for data on small populations. These demands are often at odds with the need to preserve privacy and data confidentiality. Small numbers also raise statistical issues concerning the accuracy, and thus usefulness, of the data.

In general, problems with confidentiality arise when there are small denominators (population size represented in a specific cell in a table); and, problems with data reliability arise when there are small numerators (cases in a specific cell in a table).

Definitions
The broader term for these controls is “Statistical Disclosure Control”. The challenge is to use optimal levels since too little control leaks public data and too much control makes published survey data useless.
“Imputation” is the practice of substituting values for missing data items. If we are leaving out data to protect confidentiality, then substitute data must be imputed so as to not skew the overall results.
“Inference”: The practice of finding secret data in published survey results. By measuring inference, we can find out if disclosure control is an issue.
Spearman’s Rank Correlation: a statistical tool for inference. It can find out how closely two variables are tied. This web page will perform this correlation for you (if you are ready to hand type your data into a web form).

I could only find one tool related to this in the java world. I’m surprised it isn’t more of a booming field since it touches on survey data, health and financial data, and security and privacy. Is that too small of a niche? I doubt it. Inattention to the dangers of leaking information in this way could potentially cause a lot of harm and cost a lot of money.

The stats package SAS has small cell suppression features. This document (Word Doc) discusses how to deal with the holes in the data that result from suppression.

So, how to have this feature in my java app?
R = the open source statistical package
CRAN = a list of packages for use with the R language
sdcTable: statistical disclosure control for tabular data
lPSolve: an R package that sdcTable depends on
rJava = an R package that allows R to create java objects and, through the JRI package that is now part of rJava, allows java run R in a single thread and make calls to it.
JGR = java GUI tool that makes use of rJava for a java GUI interface to R. R binaries must be installed and the JGR jar then allows java to call it. The source of JGR has good, production quality examples of how to call R from java.
Using all that, one should be able to create an ad-hoc query front end for survey data, run submitted queries through small cell suppression rules in R, and
return safe data.
There, I solved your small cell suppression problems. I’ll leave the details to the reader. What could be easier than integrating a stack of open source C and Java projects into your web app? or, rather, tune in for part II: implementing this stack O’ fun.

IntelliJ Idea: Notes on switching

Posted in Uncategorized by mcgyver5 on December 19, 2009

I recently switched over from working primarily with Eclipse/ MyEclipse and these are some large and small obstacles and how to overcome them.

  1. I want to ignore persistence framework errors. Go to Project Structure –> JPA facet Delete Data Sources Mappings (but not JPA Configuration Descriptor!)
  2. Web application doesn’t reflect changes to html, xhtml, jsp, etc. Go to Project Structure –> Java EE build settings. Make sure Exploded Directory Project compile output path is the same one the server is using (ie where your project lives on disk) Also make sure compile output path is the same as where your project lives and not some crazy intelliJ invented directory..
  3. I want editor to be linked with menu, like in Eclipse. This is autoscroll from source, a button in the top row of the project pane.
  4. I used Ctrl-shift R (for resource) all the time in Eclipse. In IntelliJ IDEA, the same function is CTRL-Shift-N (for name)
  5. Auto complete does not work! In my case, this was due to the La Clojure plugin (0.2.172) When I disabled this plugin and restarted, autocomplete (and several other features) came back. A web search on this turned up nothing. Maybe now it will.
  6. How to integrate CVS
    • If CVS is not connected, go to Version Control –> –> CVS –> Configure CVS Roots –> Test Connection. This appeared to reset the connection for me.
    • To Setup CVS repo Version Control –> CVS –> configure CVS Roots –> click “plus” button to make a new root. Enter your cvs info
    • Import existing project into your IntellJ IDEA File –> open project –> browse to find .pom file
  7. How to get vim keyboard mappings in intelliJ. go to settings –> plugins –> available –> right click on IDEAVIM to install. The step I skipped screwed me up big time: You must copy the keymap file according to these directions.
  8. I hacked the authentication mechanism on an app so I wouldn’t have to log in every time during testing, and I was afraid I might accidentally commit it to CVS. So I had to ensure this file never got mixed in with the rest of our code. This is a CVS question rather than an IntelliJ IDEA question, but the answer is to create a new branch. Right-click on file –>CVS–>create branch (name it “DEAD_BRANCH” or something) and check the “Switch to this branch” box. The next time you go to commit that file or the directory it is in, that file will show up as [switched to tag DEAD_BRANCH] and if committed, will only be committed to that branch, so that your co-workers, when they update, will not get your screwed up file.
  9. Keystroke goodness. The following keystrokes are indispensable. For a complete keystroke chart, go to help –> keystroke reference
    • move lines or blocks of code. This comes in handy on almost a daily basis and for some reason isn’t in the keystroke chart. Ctrl-Shift up arrow moves a line or selected block up. Ctrl-shift down arrow moves a line down. If it does not work, try hitting escape.
    • IntelliJ has a history of clipboard (buffer) contents. To paste from it, use Ctrl-Shift-V
    • Rename: Shift -F6
    • Generate Getters and Setters: Alt-Insert
    • Find usages: alt-F7
    • Duplicate Line or selection: Ctrl-D.