clayne at anodized.com
Sun Feb 18 10:34:55 UTC 2007
On Sat, Feb 17, 2007 at 04:01:32PM -0500, Nick Mathewson wrote:
> Hi, Chris! This is pretty neat stuff! If you can do more of this, it
> could help the development team know how to improve speed.
Absolutely. One thing to remember with callgrind in general is that
while IRs may show as high, it's really only an indication of a high
instruction fetch count and not necessarily one of high time usage.
In some cases they may correlate - but what one is really looking for
is too much dependance/weight on routines you wouldn't expect, and
more importantly cache misses. The latter is required to use
--simulate-cache=yes with valgrind. I've recently done some grinds with
this, but still going through things. I'll take some shots and put
up the data file.
> (Sorry about the delay in answering; compiling kcachegrind took me way
> longer than it should have.)
Did KDE go along with that? :P
> A few questions.
> 1. What version of Tor is this? Performance data on 0.1.2.7-alpha
> or on svn trunk would help a lot more than data for 0.1.1.x,
> which I think this is. (I think this is the 0.1.1.x series
> because all the compression seems to be happening in
> tor_gzip_compress, whereas 0.1.2.x does compression
> incrementally in tor_zlib_process.) There's already a lot of
> performance improvements (I think) in 0.1.2.7-alpha, but there
> might be possible regressions too, and I'd like to catch them
> before we release... whereas it is not likely that we'll do
> anything besides security and stability to 0.1.1.x, since it's
> supposed to be a stable series.
It's 0.1.1.26. The only internal change I've made to is it pushing up
the clock jump limit to 1000 seconds vs 100, as callgrind simulation
runs much slower and it was causing issues with tor assuming the clock
> 2. How is this server configured? A complete torrc would help.
ContactInfo clayne <clayne AT anodized dot com>
Log notice file /var/log/tor/notices.log
ExitPolicy reject *:* # middleman only -- no exits allowed
#BandwidthRate 32768 bytes
##BandwidthBurst 32768 bytes
BandwidthRate 128000 bytes
BandwidthBurst 256000 bytes
#BandwidthRate 256000 bytes
#BandwidthBurst 512000 bytes
> 3. To what extent does -O3 help over -O2? Most users seem to
> compile with -O2, so we should probably change our flags if the
> difference is nontrivial.
I've found O3 to always benefit over O2. Primarily for:
Integrate all simple functions into their callers. The compiler
heuristically decides which functions are simple enough to be
worth integrating in this way.
If all calls to a given function are integrated, and the function
is declared "static", then the function is normally not output as
assembler code in its own right.
Enabled at level -O3.
> 4. Supposedly, KCachegrind can also visualize oprofile output. If
> this is true, and you could get it working, it might give more
> accurate information as to actual timing patterns, with fewer
> Heisenberg effects. (Even raw oprofile output
> would help, actually.)
I am just not looking into OProfile usage of things. Will have more on this
> Now, some notes on the actual data. Again, I'm guessing this is for
> Tor 0.1.1.x, so some of the results could be quite different for the
> development series, especially if we fixed some stuff (which I think
> we did) and especially if we introduced some stupid stuff (which
> happens more than I'd like).
> * It looks like most of our time is being spent, as an OR and
> directory server, in compression, AES, and RSA. To improve
> speed, our options are basically "make it faster" or "do it
> less" for each of these.
Also remember the comment about IR vs time. It may be more instructions
but not necessarily more time. How much cache burning is done there is
important. The simulation done via either cachegrind or callgrind with
simulate-cache could never predict this perfectly, it's always an
approximation - otherwise they would be implementing a cpu entirely
in software and it would take ages to run. But these approximations
can help find outliers.
> * Making AES faster would be pretty neat; the right way to go
> about it is probably to look hard at how OpenSSL is doing it,
> and see whether it can't be improved. Then again, the OpenSSL
> team is pretty clever, and it's not likely that there is a lot
> of low-hanging fruit to exploit here.
I'd have to agree on the clever factor. They've probably profiled their
own code thousands of times by now.
> So it looks like verifying routers, decrypting onionskins, and
> doing TLS handshakes are the big offenders for RSA. We've
> already cut down onionskin decryption as much as we can except
> by having clients build circuits less often. To cut down on
> routerdesc verification, we need to have routers upload their
> descriptors and have authorities replace descriptors less often,
> and there's already a lot of work in that direction, but I don't
> know if I've seen any numbers recently. We could cut down on
> TLS handshakes by using sessions, but that could hurt forward
> secrecy badly if we did it in a naive way. (We could be smarter
> and use sessions with a very short expiration window, but it's
> not clear whether that would actually help: somebody would need
> to find out how frequent TLS disconnect/reconnects are in
> comparison to ).
It is a hard balancing act to mix light-footing with security. :( That's
a major issue. In order to achieve a suitable paradigm one has to be
> * Making RSA faster could also be fun for somebody. The core
> multiplication functions in openssl (bn_mul_add_words and
> bn_sq_comba8) are already in assembly, but it's conceivable that
> somebody could squeeze a little more out of them, especially on
> newer platforms. (Again, though, this is an area that smart
> people have already spent a lot of time in.)
Yea. Unless we have someone willing to do AMD-specific optimizations or
perhaps make use of MMX, SIMD, (I Believe they're already using sse2
tricks) I don't think there's much to be gained here. Usually the asm
implementors seem to be very vigilant about making hardcore optimizations
> * Finally, compression. Zlib is pretty tunable in how it makes
> the CPU/compression tradeoff, so it wouldn't be so hard to
> fine-tune the compression algorithm more thoroughly. Every
> admin I've asked, though, has said that they'd rather spend CPU
> to save bandwidth than vice versa. Another way to do less
> compression would be to make directory objects smaller and have
> them get fetched less often: there are some design proposals to
> do that in the next series, and I hope that people help beat
> them into some semblance of workability.
In all of my experience with dealing with the sharing of network data,
updates, packetizing, etc. is that having a smaller footprint overall
tends to win out. This also tends to go with general optimization in
the cpu/cache case as well.
Other things I can think of to throw out there are deltas, different
compression algorithms (bz2, lzo) for different usages, and just plain
More information about the tor-dev