nickm at freehaven.net
Sat Feb 17 21:01:32 UTC 2007
On Fri, Feb 16, 2007 at 05:35:50PM -0800, Christopher Layne wrote:
> On Fri, Feb 16, 2007 at 02:00:00PM -0800, Christopher Layne wrote:
> > Thought you guys might find this interesting. I did a couple of callgrind
> > runs on 2 different tor builds, 1 using -Os and the other using -O3. The
> So did a bit more research on spec'ing which cost models are default in
> callgrind and now have it logging jumps, asm instructions, and l1/l2/dram
> performance counters in the simulator. If anyone is interested on the
> machine specifically it's a 2.1 ghz Celeron-D (Prescott) running under
> Linux 2.6.20. I've rebuilt openssl, libz, and libevent with cranked up
> optimization/debug on, so more interesting things to look at.
Hi, Chris! This is pretty neat stuff! If you can do more of this, it
could help the development team know how to improve speed.
(Sorry about the delay in answering; compiling kcachegrind took me way
longer than it should have.)
A few questions.
1. What version of Tor is this? Performance data on 0.1.2.7-alpha
or on svn trunk would help a lot more than data for 0.1.1.x,
which I think this is. (I think this is the 0.1.1.x series
because all the compression seems to be happening in
tor_gzip_compress, whereas 0.1.2.x does compression
incrementally in tor_zlib_process.) There's already a lot of
performance improvements (I think) in 0.1.2.7-alpha, but there
might be possible regressions too, and I'd like to catch them
before we release... whereas it is not likely that we'll do
anything besides security and stability to 0.1.1.x, since it's
supposed to be a stable series.
2. How is this server configured? A complete torrc would help.
3. To what extent does -O3 help over -O2? Most users seem to
compile with -O2, so we should probably change our flags if the
difference is nontrivial.
4. Supposedly, KCachegrind can also visualize oprofile output. If
this is true, and you could get it working, it might give more
accurate information as to actual timing patterns, with fewer
Heisenberg effects. (Even raw oprofile output
would help, actually.)
Now, some notes on the actual data. Again, I'm guessing this is for
Tor 0.1.1.x, so some of the results could be quite different for the
development series, especially if we fixed some stuff (which I think
we did) and especially if we introduced some stupid stuff (which
happens more than I'd like).
* It looks like most of our time is being spent, as an OR and
directory server, in compression, AES, and RSA. To improve
speed, our options are basically "make it faster" or "do it
less" for each of these.
* AES isn't going to get used much less: A relay server still
needs to AES-ctr-crypt each cell it gets three times: once for
TLS for link secrecy on the inbound link, once with a circuit
key for long-range secrecy, and once for TLS for link security
on the outbound link. This explains the pretty even breakdown
between rijndaelEncrypt, _X86_AES_decrypt, and _X86_AES_encrypt
in the results. (If you're not following me, read the design
paper, or just trust me. ;) )
[We could _maybe_ save the middle
encryption in some cases by a trick similar to what we use for
CREATE_FAST cells, but it would only get rid of 1/8 of the AES
done by servers in toto, thus reducing the average server's A]
* Making AES faster would be pretty neat; the right way to go
about it is probably to look hard at how OpenSSL is doing it,
and see whether it can't be improved. Then again, the OpenSSL
team is pretty clever, and it's not likely that there is a lot
of low-hanging fruit to exploit here.
* So here's how RSA is getting used on my server right now:
0 directory objects signed,
1643 directory objects verified,
8 routerdescs signed,
20554 routerdescs verified,
38 onionskins encrypted,
37631 onionskins decrypted,
35148 client-side TLS handshakes,
29866 server-side TLS handshakes,
0 rendezvous client operations,
70 rendezvous middle operations,
0 rendezvous server operations.
So it looks like verifying routers, decrypting onionskins, and
doing TLS handshakes are the big offenders for RSA. We've
already cut down onionskin decryption as much as we can except
by having clients build circuits less often. To cut down on
routerdesc verification, we need to have routers upload their
descriptors and have authorities replace descriptors less often,
and there's already a lot of work in that direction, but I don't
know if I've seen any numbers recently. We could cut down on
TLS handshakes by using sessions, but that could hurt forward
secrecy badly if we did it in a naive way. (We could be smarter
and use sessions with a very short expiration window, but it's
not clear whether that would actually help: somebody would need
to find out how frequent TLS disconnect/reconnects are in
comparison to ).
* Making RSA faster could also be fun for somebody. The core
multiplication functions in openssl (bn_mul_add_words and
bn_sq_comba8) are already in assembly, but it's conceivable that
somebody could squeeze a little more out of them, especially on
newer platforms. (Again, though, this is an area that smart
people have already spent a lot of time in.)
* Finally, compression. Zlib is pretty tunable in how it makes
the CPU/compression tradeoff, so it wouldn't be so hard to
fine-tune the compression algorithm more thoroughly. Every
admin I've asked, though, has said that they'd rather spend CPU
to save bandwidth than vice versa. Another way to do less
compression would be to make directory objects smaller and have
them get fetched less often: there are some design proposals to
do that in the next series, and I hope that people help beat
them into some semblance of workability.
Again, many thanks for this information; I hope we'll see more like it
in the future!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 652 bytes
Desc: not available
More information about the tor-dev