Hello everyone,
I was doing some profiling on my two relays running on FreeBSD 13.1
and noticed that they were spending a lot of time in clock_gettime()
which prompted me to have a look at the implementation.
Time implementation
===================
The time implementation is abstracted in src/lib/time/compat_time.c
where different mechanisms are used for different operating systems.
On Linux CLOCK_MONOTONIC_COARSE is a clock that gives worse precision
than CLOCK_MONOTONIC, but is faster and the abstraction layer checks
for its presense and provides more performat less precise time where
applicable.
On FreeBSD, there is also a fast monotonic time source available
called CLOCK_MONOTONIC_FAST. In the header file
src/lib/time/compat_time.h, a comment references this clock, but it is
not used. I thought it might be worth a shot seeing what difference it
would make if I enable the use of CLOCK_MONOTONIC_FAST on FreeBSD and
on the VM where I run my two FreeBSD relays, the difference was
stunning.
I made did a quick patch simply replacing CLOCK_MONOTONIC_COARSE with
CLOCK_MONOTONIC_FAST, see patches attached, compiled and tested.
Tracing system calls to make sure the correct call was being used,
which it was.
Results
=======
This lead to reducing the CPU usage of the patched relay by about 50 %
compared to the unpatched relay. I was a bit shocked so I wrote a
small benchmark program and ran it on my VM giving the following
results:
CLOCK_MONOTONIC: 4.776675 s
CLOCK_MONOTONIC_FAST: 0.260002 s
Showing that on my VM the performance of CLOCK_MONOTONIC_FAST is about
20 times better than CLOCK_MONOTONIC.
I have tested on a few different systems and I think that the
performance increase of CLOCK_MONOTONIC_FAST is thanks to commit
60b0ad10dd0fc7ff6892ecc7ba3458482fcc064c - "vdso: lower precision of
vdso implementation of CLOCK_MONOTONIC_FAST and CLOCK_UPTIME_FAST"
that was cherry-picked to 13.1.
Try it yourself and report your results
=======================================
If you want to benchmark your server to see whether switching clock
could benefit you, you can compile and run my attached test program by
doing
user>clang -o bench.c -o bench
user>./bench
In case the program terminates too quickly or slowly for your liking, adjust
const unsigned long iterations = 1000000;
up or down to change the execution time.
My supplied patches appear to work fine on my system, but aren't
really upstream appropriate since a solution that works for both
FreeBSD and Linux is needed. If you want to test them and you're
building Tor from the ports tree, drop them in
/usr/ports/security/tor/files and build and install.
I'm very interested in seeing some performance data from other people
to see whether I think it worth either pestering some Tor devs to have
a look at this or putting in some effort myself to write an
upstreamable patch.
Thank you for reading!
Cordially,
Andreas Kempe