[tor-bugs] #5023 [Pluggable transport]: Morpher pluggable transport: Select algorithm for packet size morphing

Sat Feb 4 20:39:45 UTC 2012

#5023: Morpher pluggable transport: Select algorithm for packet size morphing
---------------------------------+------------------------------------------
 Reporter:  asn                  |          Owner:  asn  
     Type:  task                 |         Status:  new  
 Priority:  normal               |      Milestone:       
Component:  Pluggable transport  |        Version:       
 Keywords:                       |         Parent:  #4680
   Points:                       |   Actualpoints:       
---------------------------------+------------------------------------------

Comment(by asn):

 (Let's rename 'random sampling' to 'direct sampling' from now on:)

 The attachments in #4680 spawn at least two questions:

 a) It seems that in the Server->Client case
 (attachment:500000_sc.png:ticket:4680) 'traffic morphing' actually
 delivers. It results in 1/4 of the overhead of the 'direct sampling'
 method.
 The (not-so-)funny thing is that in the Client->Server case
 (attachment:500000_cs.png:ticket:4680), 'traffic morphing' actually causes
 more overhead than 'direct sampling' does!

 My guess is that this happens because in the C->S case, HTTPS has large
 probabilities in small packet sizes (attachment:https_cs.png:ticket:4680),
 so the morphing matrix tries to split into a small packet, and then the
 second half is padded to 1460. Example from the logs
 (attachment:cs_log.txt:ticket:4680):
 {{{
 sampling: Got packet size 586. We must morph it to 5. Splitting to 581 and
 sending the first part...
 sampling: Got packet size 581. We must morph it to 1459. Padding with 878
 and sending.
 morphing: Got packet size 586. We must morph it to 127. Splitting to 459
 and sending the first part...
 morphing: Got packet size 459. We must morph it to 123. Splitting to 336
 and sending the first part...
 morphing: Got packet size 336. We must morph it to 1459. Padding with 1123
 and sending.
 264: OVERHEAD ROUND SUMMARY: Sampling: 928 : Morphing: 1223
 264: OVERHEAD ROUND SUMMARY: Morpher lost (295)
 }}}

 Morpher gets a packet of 586 bytes to morph, splits it to 127+459. Sends
 127 bytes, splits the 459 part to 123+336. Sends 123 bytes, and pads the
 336 part to 1459. This results in an overhead of 1223 bytes. Direct
 sampling was better due to randomness.

 On the other hand, in the S->C case, Tor seems to have good-enough
 probabilities of outputting small packets
 (attachment:tor_sc.png:ticket:4680) [0] which fit nicely with the good-
 enough probabilities of HTTPS of outputting small packets
 (attachment:https_sc.png:ticket:4680) and the results are:

 {{{
 sampling: Got packet size 65. We must morph it to 1413. Padding with 1348
 and sending.
 morphing: Got packet size 65. We must morph it to 79. Padding with 14 and
 sending.
 5: OVERHEAD ROUND SUMMARY: Sampling: 1348 : Morphing: 14
 5: OVERHEAD ROUND SUMMARY: Morpher won (1334)
 }}}
 {{{
 sampling: Got packet size 11. We must morph it to 1379. Padding with 1368
 and sending.
 morphing: Got packet size 11. We must morph it to 53. Padding with 42 and
 sending.
 6: OVERHEAD ROUND SUMMARY: Sampling: 1368 : Morphing: 42
 6: OVERHEAD ROUND SUMMARY: Morpher won (1326)
 }}}

 so in such cases, 'traffic morphing' beats stupid 'direct sampling'
 easily.
 And even when the original packet size is large, something like this
 happens:
 {{{
 sampling: Got packet size 1460. We must morph it to 1379. Splitting to 81
 and sending the first part...
 sampling: Got packet size 81. We must morph it to 894. Padding with 813
 and sending.
 morphing: Got packet size 1460. We must morph it to 1414. Splitting to 46
 and sending the first part...
 morphing: Got packet size 46. We must morph it to 81. Padding with 35 and
 sending.
 7: OVERHEAD ROUND SUMMARY: Sampling: 863 : Morphing: 85
 7: OVERHEAD ROUND SUMMARY: Morpher won (778)
 }}}

 So the question is, what do we do? Should we use morphing matrices S->C,
 and direct sampling in C->S? Or use direct sampling in both cases?

 ----

 b) The other question is, even if we select the minimum overhead in each
 case, are we happy with this kind of overhead? Looking at the plot of 500
 packets (attachment:500_cs.png:ticket:4680) in the C->S case, we get
 approx. 70k bytes of overhead in 100 packets [1]. In the S->C case
 (attachment:500_sc.png:ticket:4680) we see approx. 100k bytes overhead
 using direct sampling, and 70k bytes of overhead using morphing matrices,
 in 100 packets.

 Are we happy with this?

 Some things to note:
 a) The morphing matrices might be wrong. I'm the only one who has reviewed
 morpher or the matrices.
 b) The probability distributions might be biased. We know that they were
 captured in Chicago, but maybe there is lots of other traffic in there
 apart from Tor (it was captured by TCP port). Also see [0].
 c) The tests/graphs might be wrong. I wrote the tests and I sometimes add
 bugs in my code.

 So apart from the two questions, we should try resolving these last three,
 as well.

 ----

 [0]: Hmm, why is this? Why do Tor relays send small packets to clients,
 when the opposite doesn't happen that much? Do Tor relays use variable
 sized cells more often than the clients (note that v3 link handshake
 didn't exist when these packets were captured)? Or is it just a
 coincidence that happens when big TLS records get split, and it just
 happened more to relays during the packet capture?

 [1]: 100 actual packets, packets splitted during morphing don't count.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/5023#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online