Scoring and reporting for detectors and encoders is pretty much done. The graphs aren't quite ready yet, for aesthetic reasons, but I have some nice tables to report:
Length detector: 16% accuracy
Timing detector: 89% accuracy
Entropy detector: 94% accuracy
SSL was correctly identified 25% of the time, 70% of the time other protocols were misidentified as being SSL, and 5% of the time SSL was misidentified as another protocol.
obfsproxy was correctly identified 55% of the time, 10% of the time other
protocols were misidentified as being obfsproxy, and 35% of the time obfsproxy was misidentified
as another protocol.
Dust was correctly identified 48% of the time, 4% of the time other
protocols were misidentified as being Dust, and 48% of the time Dust was misidentified
as another protocol.
obfsproxy is distinguishable from SSL 96% of the time.
obfsproxy is distinguishable from Dust 98% of the time.
Dust is distinguishable from SSL 56% of the time.
I'm sure these numbers are confusing as several different sorts of things are being measured, all with similar numbers. It is probably quite unclear how all of these percentages add up. I think this is where the graphs will help. In particular I think it's hard to visualize accuracy when there are both false positives and false negatives to take into account.
I think the most interesting result here is the entropy detector. It's the simplest and also the most effective. Not only that, but the entropy detector only looks at the entropy of the first packet, so it's inexpensive. The original reason that I implemented this detector was that there was an argument against Dust that it could be trivially filtered by setting an entropy threshold above which all traffic is filtered. Surprisingly, Dust has lower entropy than SSL, so this attack will not work to specifically target Dust. This is unexpected because Dust uses totally random bytes, whereas the SSL handshake does not. This is a result of a general issue with how Shannon entropy is defined. Shannon entropy is normally defined across a large statistical sampling, in which case all three protocols tested will converge to maximum since they are all encrypted and therefore apparently random. However, the attacker does not in this case get a large statistical sample. Since they don't know a priori which connections are Dust connections, they only get however many packets are in each trace to use as samples. The current version of Dust that I'm testing has most of the advanced obfuscation stripped out as I haven't had time to add it back in after completely reimplementing the protocol to use TCP instead of UDP and all of the modified assumptions that come with that change. So this version of Dust is basically just the ntor protocol with no packet length or timing hiding in use. The first packet is therefore 32 random bytes consisting of the curve25519 ephemeral ECDH key, as compared to the 1k first packet of SSL. This is the reason for the low entropy calculation, because given only 32 bytes to sample almost all possible bytes are going to have a probability of 0 or 1. There is an upper bound on entropy when sample sizes are small. Achieving maximum first-order entropy here would require at least 512 bytes so that each value 0 to 255 could occur exactly twice.
There are of course many ways to measure entropy and you could say that I'm just doing it wrong. Some alternative measurements have already been suggested that I might implement as well. I am not suggesting that Dust is in any way immune to entropy-based detection. In fact, Dust is currently detected very well by the current entropy detector (it passed 100% of the packet length and timing tests, so all detection of Dust was done by the entropy detector) precisely because it has strangely low entropy because of it's small first packets. I bring this up only as an interesting example of how intuition can fail when trying to use Shannon entropy to measure individual messages instead of channels.
As far as the state of the project goes, I think the project can be considered done. Looking at the project documentation, everything didn't work out exactly as originally planned, I think it all worked out for the best and the project is where I wanted it to be by the completion of Summer of Code. I have generation of HTTP and HTTPS traffic, both over Tor and plain. I have two encoders: obfs2 and Dust. I have three detectors: length, timing, and entropy. The string detector didn't work out, but the entropy detector worked out better than planned, so on balance I think it was a win. I didn't implement all of the detector modes because upon further reflection it doesn't really make sense to just try all of these different modes. You want the mode that works best. So for packet lengths and timing that is full conversations and for entropy that's the first packet (could be first N packets, but I tried N=1 and it worked great). Random sampling of packets just seems like a way to limit your success at this point when non-random sampling of just the first packet seems to be so effective. All of the utilities have been written, and more. I have the specified utilities for separating the pcap files into streams, and extracting the strings, lengths, and timings. Additionally, I have some utilities for extracting the entropy of streams, tagging the streams with what protocol they are using (by port), and extracting dictionary and corpus models for latent semantic analysis. I also have some utilities for graphing distributions of properties of different protocols which I hope to finish up next week during the wrap-up period.
Looking to the future, I have changed my goals somewhat from what I had in mind at the beginning of the project. After having attended the TorDev meetup in Waterloo and the PETS conference, I am much more interested in integrating my work with the Tor project instead of just testing the various protocols that have been proposed in order to see how my own protocol stacks up against the competition. As part of this transition, I have rewritten Dust from scratch, both in terms of implementation and in terms of the protocol, in order to be better suited for use as a pluggable transport for Tor. My goal for Dust is to have it shipped with both Tor clients and bridges so that it can be used in the field to contact bridges despite DPI filtering of Tor connections. My goal now for the blocking-test project is to focus on blocking resistance for Tor. So instead of adding all of the protocols under the sun, I'm going to concentrate on protocols that have been implemented or are under consideration to become pluggable transports. Additionally, the next protocol I'm going to add after GSoC is over is just plain Tor. There is at least one bug filed about a suspicion that Tor can currently be fingerprinted based on some packet characteristics and I'd like to be able to close that bug because we now have the capability to easily generate the necessary information.
Finally, I would like to offer some insight into what I think is the future of blocking-resistant transports now that I have done some work trying to break them. Fundamentally, the only thing an attacker can do is limit the bitrate of the connection. Given any sort of channel, we can encode arbitrary information over that channel. However, the smaller the channel the slower the bitrate. The goal of encodings should be to maximize the bitrate given the constraints of the censor. It's easy to develop very slow encodings such as natural language encodings over HTTP. However, in order to maximize the throughput you need to have a good definition of the attack model. All encoding for blocking-resistant purposes lowers the efficiency above just transmitting the normal content (assumed it has already been compressed). Therefore all encoded conversations are going to be longer than unencoded conversations. I think this is the future of writing detectors because it's a fundamental constraint on encoding. I think this should even be able to defeat something like Telex (although let's not get into the details of attacks on Telex specifically, I'm talking about a general idea here). Given a (static, for the sake of argument simplicity) website, I can download every page on that website. Then I can compare the length of intercepted downloads and detect when they don't match up. For dynamic websites you can do the same thing with a statistical model. I think, therefore, that the future of blocking-resisting protocols is to encode single logical connections over an ensemble of multiple actual connections. Ideally these connections would mimic normal traffic in terms of number of connections, duration, directional flow rates, etc.. It's something to think about anyway. It could be an interesting problem.