Hello everyone,
I was doing some profiling on my two relays running on FreeBSD 13.1 and noticed that they were spending a lot of time in clock_gettime() which prompted me to have a look at the implementation.
Time implementation ===================
The time implementation is abstracted in src/lib/time/compat_time.c where different mechanisms are used for different operating systems. On Linux CLOCK_MONOTONIC_COARSE is a clock that gives worse precision than CLOCK_MONOTONIC, but is faster and the abstraction layer checks for its presense and provides more performat less precise time where applicable.
On FreeBSD, there is also a fast monotonic time source available called CLOCK_MONOTONIC_FAST. In the header file src/lib/time/compat_time.h, a comment references this clock, but it is not used. I thought it might be worth a shot seeing what difference it would make if I enable the use of CLOCK_MONOTONIC_FAST on FreeBSD and on the VM where I run my two FreeBSD relays, the difference was stunning.
I made did a quick patch simply replacing CLOCK_MONOTONIC_COARSE with CLOCK_MONOTONIC_FAST, see patches attached, compiled and tested. Tracing system calls to make sure the correct call was being used, which it was.
Results =======
This lead to reducing the CPU usage of the patched relay by about 50 % compared to the unpatched relay. I was a bit shocked so I wrote a small benchmark program and ran it on my VM giving the following results:
CLOCK_MONOTONIC: 4.776675 s CLOCK_MONOTONIC_FAST: 0.260002 s
Showing that on my VM the performance of CLOCK_MONOTONIC_FAST is about 20 times better than CLOCK_MONOTONIC.
I have tested on a few different systems and I think that the performance increase of CLOCK_MONOTONIC_FAST is thanks to commit 60b0ad10dd0fc7ff6892ecc7ba3458482fcc064c - "vdso: lower precision of vdso implementation of CLOCK_MONOTONIC_FAST and CLOCK_UPTIME_FAST" that was cherry-picked to 13.1.
Try it yourself and report your results =======================================
If you want to benchmark your server to see whether switching clock could benefit you, you can compile and run my attached test program by doing
user>clang -o bench.c -o bench user>./bench
In case the program terminates too quickly or slowly for your liking, adjust
const unsigned long iterations = 1000000;
up or down to change the execution time.
My supplied patches appear to work fine on my system, but aren't really upstream appropriate since a solution that works for both FreeBSD and Linux is needed. If you want to test them and you're building Tor from the ports tree, drop them in /usr/ports/security/tor/files and build and install.
I'm very interested in seeing some performance data from other people to see whether I think it worth either pestering some Tor devs to have a look at this or putting in some effort myself to write an upstreamable patch.
Thank you for reading! Cordially, Andreas Kempe