This is a summary of some discussion among developers of pluggable transports about steganographic channels and deriving them from protocol grammars. Two things prompted the discussion:
* A program called appid, http://code.google.com/p/appid/ , which compiles simple protocol specifications into big regular expressions and uses them to classify network streams. Supported protocols are https://code.google.com/p/appid/source/browse/trunk/apps ; a sample is https://code.google.com/p/appid/source/browse/trunk/apps/afs . Roger suggested to make a program that takes one of these specifications and embeds data within a stream that appid will classify as the chosen protocol.
* This paper: "Provably Secure Steganography" http://www-users.cs.umn.edu/~hopper/tc-stego.pdf The authors define a "channel" as a probability distribution on sequences. You can think of a channel as something that answers "given this prior history of packets, what are the possible next packets and their probabilities?" or even as an oracle that merely allows sampling this distribution. They show a method of steganographically encoding a bit within a channel such that that a passive adversary cannot distinguish it from "normal" traffic (Sec. 4.1 and Fig. 1). For the topics we were talking about, the most important sections to read are 2 and 4.1.
You can embed data in a string that will match a given regular expression by constructing a DFA (http://swtch.com/~rsc/regexp/regexp1.html) and walking it state by state, choosing state transition edges so as to encode something. For example, if there are two outgoing edges from a state, which one you choose can encode a bit. Many of the appid signatures contain any* blocks that can straightforwardly encode bytes. An example of this idea is the regular expression /^(ab*[cd])+/. The b* lets us encode one bit (say "" = 0 and "b" = 1), and the choice of c or d allows another bit ("c" = 0 and "d = 1). We could send the message 00110001 as "acabdacad".
This idea can really be thought of an application of the paper. A regular expression defines a probabilistic channel if we assign probabilties to edge transitions, which we may do uniformly. So, for example, with the regular expression given above, the first byte is 'a' with probability 1. The second byte is 'b' with probability 0.5, 'c' with probability 0.25, and 'c' with probability 0.25. But we see how a channel is sensitive to history: if the string so far is "ab", then the next byte may be 'c' or 'd'; but if the string so far is "ac", then the next byte must be 'a' (or EOF).
We can use appid-like signatures to make steganographic channels, if we assume that the signatures are a realistic reflection of actual use of the protocols. But: this relies critically on the accuracy of the model. (Specifically, does it match the censor's model? If he uses simple regular expressions for blocking, then we win; if not, then we probably lose.) Robert Ransom gave the example of a model of HTTP using just the syntax; i.e. using RFC 2616 as the definition of a channel. Such a thing would not only not look like real-world HTTP, but would be broken easily by proxies that rearrange headers, for example.
rransom also suggested a way to use arithmetic encoding to obfuscate a stream using a probabilistic model of a protocol; I'll quote him in part: * send it through an arithmetic-code encoder, using a model which contains a (low-probability) 'pad' symbol; * encrypt it with a stream cipher; * send it through an arithmetic-code decoder, using a model which matches the 'cover protocol' (the protocol you are trying to mimic).
David (yours truly) wants to write or help write a simple pluggable transport derived from regular-expression signatures, even with the limitations shown above. Client and relay would need matching signature models. For the same of simplicity, it will only seek to match the given signature, and not be indistinguishable in the strong sense mentioned above. It won't do symmetric encryption of the underlying TLS (or if it does, will use a fixed key). It won't use the constructions from the Provably Secure Steganography paper, rather it will just encode its stream directly in DFA edge transitions. I think it will be interesting to see 1) how far a simple system can get us, and 2) what additional changes we would have to make to be provably secure against censors using more sophisticated computational models than regex.
David Fifield