It seems like the worlds of statistics and Java don’t talk to one another enough.
Small cell Suppression is a statistical term for not allowing users to be able to infer what should be private information from public sets of data. For example, consider a survey on athletes with staph infections that was queryable by age, county, sport and race. If there were statistically small number of hispanic wrestlers in Otter Tail County, you could probably guess who had a staph infection. So, if a population is identified as statistically vulnerable to this inference, then that data is suppressed.
The Washington State Dept. of Health page has a pretty good explanation:
Why are small numbers a concern in public health assessment?
Public health policy decisions are fuelled by information. Often, this information is in the form of statistical data. Questions concerning health outcomes and related health behaviors and environmental factors often are studied within small subgroups of a population. Continuing improvements in the performance and availability of computing resources, including geographic information systems, and the need to better understand the relationships between environment, behavior, and consequent health effects have led to increased demand for data on small populations. These demands are often at odds with the need to preserve privacy and data confidentiality. Small numbers also raise statistical issues concerning the accuracy, and thus usefulness, of the data.
In general, problems with confidentiality arise when there are small denominators (population size represented in a specific cell in a table); and, problems with data reliability arise when there are small numerators (cases in a specific cell in a table).
Definitions
The broader term for these controls is “Statistical Disclosure Control”. The challenge is to use optimal levels since too little control leaks public data and too much control makes published survey data useless.
“Imputation” is the practice of substituting values for missing data items. If we are leaving out data to protect confidentiality, then substitute data must be imputed so as to not skew the overall results.
“Inference”: The practice of finding secret data in published survey results. By measuring inference, we can find out if disclosure control is an issue.
Spearman’s Rank Correlation: a statistical tool for inference. It can find out how closely two variables are tied. This web page will perform this correlation for you (if you are ready to hand type your data into a web form).
I could only find one tool related to this in the java world. I’m surprised it isn’t more of a booming field since it touches on survey data, health and financial data, and security and privacy. Is that too small of a niche? I doubt it. Inattention to the dangers of leaking information in this way could potentially cause a lot of harm and cost a lot of money.
The stats package SAS has small cell suppression features. This document (Word Doc) discusses how to deal with the holes in the data that result from suppression.
So, how to have this feature in my java app?
R = the open source statistical package
CRAN = a list of packages for use with the R language
sdcTable: statistical disclosure control for tabular data
lPSolve: an R package that sdcTable depends on
rJava = an R package that allows R to create java objects and, through the JRI package that is now part of rJava, allows java run R in a single thread and make calls to it.
JGR = java GUI tool that makes use of rJava for a java GUI interface to R. R binaries must be installed and the JGR jar then allows java to call it. The source of JGR has good, production quality examples of how to call R from java.
Using all that, one should be able to create an ad-hoc query front end for survey data, run submitted queries through small cell suppression rules in R, and
return safe data.
There, I solved your small cell suppression problems. I’ll leave the details to the reader. What could be easier than integrating a stack of open source C and Java projects into your web app? or, rather, tune in for part II: implementing this stack O’ fun.
I recently switched over from working primarily with Eclipse/ MyEclipse and these are some large and small obstacles and how to overcome them.
I’m trying to summarize several discussions about alternate languages on the JVM that I absorbed at the No Fluff Just Stuff conference. Can I become a language evangalist based on a weekend at a conference? I suppose not, but there were a lot of compelling arguments for why we should be looking at some of these new functional languages on the JVM. It was put forward that most of the reasons we like Java have to do with the JVM and not with the Java language:
These will hold true with any language that compiles to the JVM.
Why are they even considering new languages? Multiple reasons bubble up from conference as a whole.
Extensibility.
Discussed the example of hadoop. It is an open source framework that handles huge amounts of data in a distributed way. It is inspired by Google’s MapReduce papers. They evidently found some of the core java classes insufficient for their needs. If you look at the docs for org.apache.hadoop.io.text, it says, “It provides methods to serialize, deserialize, and compare texts at byte level…. In addition, it provides methods for string traversal without converting the byte array to a string.” Does this point to an extensibility problem in Java? If not, why couldn’t they reuse any code from String? Someone at the conference asked why can’t I make Object define toXmlString() so that every one of my classes that descends from Object automatically has a toXMLString() ? This is extensibility and Java doesn’t do it as completely as some other languages might.
A language shouldn’t limit what you can do. Certain language constructs not available in java (closures, switch statements, folding) enable developers to be far more efficient.
OO might be failing us. We try to think of Objects as changing in place. Rich Hickey, the creator of Clojure, rejects this: ” The future is a function of the past, it doesn’t change it. ” If we stop thinking of data as persisting and changing over time and instead recognize that a thing is immutable and when it changes it becomes a different immutable thing. Like a date, or an account balance. The state of an account a point in time is immutable. Adding money to it does not change it, it creates a new state. This 55 minute video of Rich Hickey explaining some of these ideas was recommended at the conference and is amazing. As he explains, all of our concurrency problems come from the notion of objects changing in place.
I learned a bunch of neat stuff over the weekend at NFJS. It was a wonderful combination of filling in the gaps for tools I use all the time and trying to show us what is coming in the future. The future, everyone agreed, was in alternate, functional languages on the JVM. I’ll talk about why in a separate post. The non-tech talks were all about agile development. At the end my brain was all stretched out and floppy. Today I want to go in a million directions at once.
We all understand that when a checkbox is not checked on a form, it is not present in the request object. This is the basis for many headaches in web application programming, especially when using multiple form pages. When using multiple form pages, as in a wizard, the struts way around is to have a reset() method that contains some logic for setting the value to false if it doesn’t exist in the request. Again, this applies to situations with a session scoped form.
The documentation for the html:checkbox tag says:
WARNING: In order to correctly recognize unchecked checkboxes, the ActionForm bean associated with this form must include a statement setting the corresponding boolean property to false in the reset() method.
In practice, the only properties that need to be reset are those which represent checkboxes on a session-scoped form. Otherwise, properties can be given initial values where the field is declared.
public void reset(){
this.citizen = false;
}
There are several confusing posts out there in forums about how to populate checkboxes when viewing forms with existing data. One says to have a hidden form field with the same name as the checkbox. Another has us jumping out of struts and using regular JSP tags with logic. Both of these are unnecessary and have potentially bad repercussions later.
The real solution is to use a html:checkbox with a name equal to that of a bean and the property equal to the name of the boolean variable in that bean that the checkbox captures. The following will check or uncheck the checkbox depending on the value of “citizen” in the applicantBean:
<html:checkbox name="applicantBean" property="citizen" value="true">
to work this, your code must invent an empty applicant bean before loading the blank form, or struts will whine that there is no such thing as “applicantBean” in any scope.
ab is a tight and effective tool for load testing web applications. It comes with every install of apache httpd.
If a page is behind a login screen, you can use the -p flag to define a file that contains post variables for login and password:
C:\Apache2.2\bin>ab -p C:\posts\post.txt -T application/x-www-form-urlencoded -n
1000 -c 22 http://myServer/myapplication:8008/CentralCashier/userLogin.do
If a page is only accessible by a logged in user, not directly accessible from the login page, then you can use the -C flag to define a cookie. You have to get the value of the session identifier cookie from a valid session. Use a proxy like Webscarab or Paros to capture a request and copy the JSESSIONID=xxxxx from the request and use it with ab:
C:\Apache2.2\bin>ab -C JSESSIONID=36D5AE14223E1D4ED0B2BBC5C7F411EA -n 1000 -c 22 http://myServer/myapplication:8008/CentralCashier/userSearch.do?method=search
Alternatively, you can just turn off the authentication filter for the purposes of your test.
I was asked to do a security assessment on a co-worker’s Cold Fusion application. It is protected on every page by a NOT findnocase(cgi.http_host,cgi.http_referer) check to ensure the request came from the same domain. This is a good way to prevent forced browsing and most url injection attacks because if you mess with the URL, this tag knows it and stops all the shenanigans.
This is where a proxy comes in. I’ve worked a bunch with Paros and some with Burp, but my employer does not allow me to download these without some extra paperwork. Webscarab, for some reason, is allowed. Webscarab is written entirely in Java, has a zippy UI and has widening adoption.
Webscarab allowed me to do forced browsing on the application and learn that the application relied solely on that domain check to make sure the user was authenticated (That is, they could only get to the site through the login form). Webscarab also allowed me to find many XSS bugs.
Webscarab is infinitely scriptable (with beanshell).
Webscarab has a tool that evaluates session identifiers for their strength. I would guess that most web frameworks these days have very strong session identifiers. In fact, I challenge anyone to find an example of a weak session identifier on any web app that shouldn’t be replaced anyway for one hundred other reasons.
Startup Options
Webscarab starts in Lite mode, which is just the web proxy, by default. To get the full meal, you have to start with java -DWebscarab.lite=false -jar webscarab.jar
Default memory is 64MB and this can get used up quickly. Online examples show webscarab having ~510 MB available. This is achieved by adding -Xms32m -Xmx510m to the java startup args. Just like with some other java desktop apps (Like IntelliJ Idea) you can click on the Green|Yellow|Red bar along the bottom of the window to force garbage collection and free up some memory.
Things That Could Be Improved:
To address user experience as well as other issues, Webscarab is undergoing a total rewrite. This is currently known as Webscarab NG. They will be using the Spring Rich Client Platform. The new product also has database integration. This is a work in progress and needs lots of testing. So, if you are looking for an open source project to help, this would be an excellent choice. According to the email list, the Webscarab NG project leader has been directing his work at the OWASP Proxy lately. Even though Webscarab NG is in development, development also continues on the current Webscarab.
The Denied and Restricted Parties List (DRPL) is kind of a No-Fly list for export restrictions and since Sun has some encryption related technology, it is a national security concern that someone might take the SCJP exam.
After initially being informed that my request to take the exam was denied, today I got an email from SUN saying that I’m not, after all, someone who might do bad stuff.
here is some background email that was attached to my email:
The following individual, as a result of screening, has been identified as being as a potentially non-compliant export customer:
Search Key: US2121923
First and Last Name: Tim McGuireCity: St. Paul
Country: US
Result of DRPL check: Detected
Date and time of denial: Mon Jul 27 10:10:15 MDT 2009
Course Order Numbers: No Numbers Generated.
Reason for denial: Not Available from service.
From here and here I see that a bunch of people have been inconvenienced and aggravated to varying degrees because some mouth breather hasn’t figured out that no-fly lists are a fake idea. Imagine if Macy’s did a surprise sniff test on every 100th customer in their underwear department? This is just like that.
RSnake has been thinking about a denial of service attack against web servers that involves sending partial http packets to use up number of allowed clients. Sending carefully crafted partial packet causes the server to take A LONG TIME to work on the response to your request, using up its resources and becoming temporarily unavailable to other visitors. Apache HTTPD is mentioned as a server that is vulnerable. IIS is mentioned as one that is not. RSnake, being a realist and not an anti-microsoft evangelist, often says things that make the open source advocates uncomfortable. (“PHP is the bane of my existence” and “Whenever I assess a dot net application I know right off the bat that I’m going to find half the number of vulnerabilities”).
A few notes about Slowloris: It can’t effectively dos a box from windows because it works by creating hundreds of Sockets and Windows only allows a max of 130. It doesn’t crash anything, so it is a gentle tool(haha) It just happens to make web applications unavailable for as long as the attacker wishes. It does, by the way, send out hundreds of packets so it is detectable by the administrator.
To use Slowloris, first establish a timeout for the web server you are attacking:
./slowloris.pl -dns http://localhost -port 8080 -test
this should return some numbers to use for a timeout.
They don’t mention tomcat, so I spent most of the afternoon setting up a machine to see if this tool can DOS tomcat.
drum roll please….
clear that we’ll be using a 5 second timeout for TCP and a 30000 millisecond timeout for http.
then,
./slowloris.pl -dns localhost -port 8080 -timeout 30000 -num 500 -tcpto 5
the above opens 500 sockets and uses a tcp timeout of 5 seconds and looks like this:
now, try and connect to the benighted tomcat server.
hmmm. works fine. What gives? I suspect that as this number of connections (500), I am still able to get a connection. The first visit takes a really long time, but once I get through, I can use the site normally. This matches the statement in the documentation that “ “. If I raise the number of connections…. It still takes a very long time to load the first page, but thereafter is just as easy to access the application.
When I run slowloris on the same server, however, tomcat is completely DOS-ED. I’m impressed with the absolute unavailablity of tomcat in relation to the low level of traffic that slowloris generates.
ooooh. I thought I was supposed to convert 30 seconds into milliseconds. wrong! setting the timeout this high (30,000 seconds) is clearly too high. When I set it down to 30, slowloris CRUSHED tomcat. remotely or locally. As you can see below, setting the timeout correctly allowed many more packets to be sent.
I installed jadclipse to see if it was a way to view source of all the libraries I use but don’t have the source code for.
Installation is simple, with the small extra step of downloading jad and telling eclipse where to find the jad.exe. (Window –> Preferences –> Java –> Decompilers –> Jad)
All the jad command line options are represented in the jad dialog.
inner classes and static blocks were decompiled just fine. I did notice that jad left out any braces in if-then blocks that weren’t needed. Nitpicking a bit here, but I can’t read code without braces. The braces were put in when I checked the “show redundant braces” options, but not until I restarted eclipse.
Finally, some classes seem to have a stuck setting causing them to open in class viewer. For instance, Struts Action gets decompiled just fine, but Struts DispatchAction just shows itself in that awful class viewer. Restarting Eclipse made this problem go away too.
Big thumbs up for this very useful eclipse plugin.
thumbs up