tor callgrinds

Watson Ladd watsonbladd at
Sat Feb 17 21:41:25 UTC 2007

Nick Mathewson wrote:
> On Fri, Feb 16, 2007 at 05:35:50PM -0800, Christopher Layne wrote:
>> On Fri, Feb 16, 2007 at 02:00:00PM -0800, Christopher Layne wrote:
>>> Thought you guys might find this interesting. I did a couple of callgrind
>>> runs on 2 different tor builds, 1 using -Os and the other using -O3. The
>> So did a bit more research on spec'ing which cost models are default in
>> callgrind and now have it logging jumps, asm instructions, and l1/l2/dram
>> performance counters in the simulator.  If anyone is interested on the
>> machine specifically it's a 2.1 ghz Celeron-D (Prescott) running under
>> Linux 2.6.20. I've rebuilt openssl, libz, and libevent with cranked up
>> optimization/debug on, so more interesting things to look at.
> Hi, Chris!  This is pretty neat stuff!  If you can do more of this, it
> could help the development team know how to improve speed.
> (Sorry about the delay in answering; compiling kcachegrind took me way
> longer than it should have.)
> A few questions.
>     1. What version of Tor is this?  Performance data on
>        or on svn trunk would help a lot more than data for 0.1.1.x,
>        which I think this is. (I think this is the 0.1.1.x series
>        because all the compression seems to be happening in
>        tor_gzip_compress, whereas 0.1.2.x does compression
>        incrementally in tor_zlib_process.)  There's already a lot of
>        performance improvements (I think) in, but there
>        might be possible regressions too, and I'd like to catch them
>        before we release... whereas it is not likely that we'll do
>        anything besides security and stability to 0.1.1.x, since it's
>        supposed to be a stable series.
>     2. How is this server configured?  A complete torrc would help.
>     3. To what extent does -O3 help over -O2?  Most users seem to
>        compile with -O2, so we should probably change our flags if the
>        difference is nontrivial.
>     4. Supposedly, KCachegrind can also visualize oprofile output.  If
>        this is true, and you could get it working, it might give more
>        accurate information as to actual timing patterns, with fewer
>        Heisenberg effects.  (Even raw oprofile output
>        would help, actually.)
> Now, some notes on the actual data.  Again, I'm guessing this is for
> Tor 0.1.1.x, so some of the results could be quite different for the
> development series, especially if we fixed some stuff (which I think
> we did) and especially if we introduced some stupid stuff (which
> happens more than I'd like).
>     * It looks like most of our time is being spent, as an OR and
>       directory server, in compression, AES, and RSA.  To improve
>       speed, our options are basically "make it faster" or "do it
>       less" for each of these.
>     * AES isn't going to get used much less: A relay server still
>       needs to AES-ctr-crypt each cell it gets three times: once for
>       TLS for link secrecy on the inbound link, once with a circuit
>       key for long-range secrecy, and once for TLS for link security
>       on the outbound link.  This explains the pretty even breakdown
>       between rijndaelEncrypt, _X86_AES_decrypt, and _X86_AES_encrypt
>       in the results.  (If you're not following me, read the design
>       paper, or just trust me. ;) )
>       [We could _maybe_ save the middle
>       encryption in some cases by a trick similar to what we use for
>       CREATE_FAST cells, but it would only get rid of 1/8 of the AES
>       done by servers in toto, thus reducing the average server's A]
>     * Making AES faster would be pretty neat; the right way to go
>       about it is probably to look hard at how OpenSSL is doing it,
>       and see whether it can't be improved.  Then again, the OpenSSL
>       team is pretty clever, and it's not likely that there is a lot
>       of low-hanging fruit to exploit here.
>     * So here's how RSA is getting used on my server right now:
>           0 directory objects signed,
>        1643 directory objects verified,
>           8 routerdescs signed,
>       20554 routerdescs verified,
>          38 onionskins encrypted,
>       37631 onionskins decrypted,
>       35148 client-side TLS handshakes,
>       29866 server-side TLS handshakes,
>           0 rendezvous client operations,
>          70 rendezvous middle operations,
>           0 rendezvous server operations.
>       So it looks like verifying routers, decrypting onionskins, and
>       doing TLS handshakes are the big offenders for RSA.  We've
>       already cut down onionskin decryption as much as we can except
>       by having clients build circuits less often.  To cut down on
>       routerdesc verification, we need to have routers upload their
>       descriptors and have authorities replace descriptors less often,
>       and there's already a lot of work in that direction, but I don't
>       know if I've seen any numbers recently.  We could cut down on
>       TLS handshakes by using sessions, but that could hurt forward
>       secrecy badly if we did it in a naive way.  (We could be smarter
>       and use sessions with a very short expiration window, but it's
>       not clear whether that would actually help: somebody would need
>       to find out how frequent TLS disconnect/reconnects are in
>       comparison to  ).
We also could eliminate the indirection in the TLS handshakes. Currently
the OR's make a temporary cert which they sign with a long-term one.
Verifying this is a pain, but OR's don't notice. We could also use a
more efficient algorithm then we do now for the authentication of the
client to the OP.
>     * Making RSA faster could also be fun for somebody.  The core
>       multiplication functions in openssl (bn_mul_add_words and
>       bn_sq_comba8) are already in assembly, but it's conceivable that
>       somebody could squeeze a little more out of them, especially on
>       newer platforms.  (Again, though, this is an area that smart
>       people have already spent a lot of time in.)
>     * Finally, compression.  Zlib is pretty tunable in how it makes
>       the CPU/compression tradeoff, so it wouldn't be so hard to
>       fine-tune the compression algorithm more thoroughly.  Every
>       admin I've asked, though, has said that they'd rather spend CPU
>       to save bandwidth than vice versa.  Another way to do less
>       compression would be to make directory objects smaller and have
>       them get fetched less often: there are some design proposals to
>       do that in the next series, and I hope that people help beat
>       them into some semblance of workability.
> Again, many thanks for this information; I hope we'll see more like it
> in the future!
> peace,

