Hello everyone,
I was doing some profiling on my two relays running on FreeBSD 13.1 and noticed that they were spending a lot of time in clock_gettime() which prompted me to have a look at the implementation.
Time implementation ===================
The time implementation is abstracted in src/lib/time/compat_time.c where different mechanisms are used for different operating systems. On Linux CLOCK_MONOTONIC_COARSE is a clock that gives worse precision than CLOCK_MONOTONIC, but is faster and the abstraction layer checks for its presense and provides more performat less precise time where applicable.
On FreeBSD, there is also a fast monotonic time source available called CLOCK_MONOTONIC_FAST. In the header file src/lib/time/compat_time.h, a comment references this clock, but it is not used. I thought it might be worth a shot seeing what difference it would make if I enable the use of CLOCK_MONOTONIC_FAST on FreeBSD and on the VM where I run my two FreeBSD relays, the difference was stunning.
I made did a quick patch simply replacing CLOCK_MONOTONIC_COARSE with CLOCK_MONOTONIC_FAST, see patches attached, compiled and tested. Tracing system calls to make sure the correct call was being used, which it was.
Results =======
This lead to reducing the CPU usage of the patched relay by about 50 % compared to the unpatched relay. I was a bit shocked so I wrote a small benchmark program and ran it on my VM giving the following results:
CLOCK_MONOTONIC: 4.776675 s CLOCK_MONOTONIC_FAST: 0.260002 s
Showing that on my VM the performance of CLOCK_MONOTONIC_FAST is about 20 times better than CLOCK_MONOTONIC.
I have tested on a few different systems and I think that the performance increase of CLOCK_MONOTONIC_FAST is thanks to commit 60b0ad10dd0fc7ff6892ecc7ba3458482fcc064c - "vdso: lower precision of vdso implementation of CLOCK_MONOTONIC_FAST and CLOCK_UPTIME_FAST" that was cherry-picked to 13.1.
Try it yourself and report your results =======================================
If you want to benchmark your server to see whether switching clock could benefit you, you can compile and run my attached test program by doing
user>clang -o bench.c -o bench user>./bench
In case the program terminates too quickly or slowly for your liking, adjust
const unsigned long iterations = 1000000;
up or down to change the execution time.
My supplied patches appear to work fine on my system, but aren't really upstream appropriate since a solution that works for both FreeBSD and Linux is needed. If you want to test them and you're building Tor from the ports tree, drop them in /usr/ports/security/tor/files and build and install.
I'm very interested in seeing some performance data from other people to see whether I think it worth either pestering some Tor devs to have a look at this or putting in some effort myself to write an upstreamable patch.
Thank you for reading! Cordially, Andreas Kempe
Excerpts from Andreas Kempe's message of June 21, 2022 11:50 am:
Hello everyone,
I was doing some profiling on my two relays running on FreeBSD 13.1 and noticed that they were spending a lot of time in clock_gettime() which prompted me to have a look at the implementation.
Time implementation
The time implementation is abstracted in src/lib/time/compat_time.c where different mechanisms are used for different operating systems. On Linux CLOCK_MONOTONIC_COARSE is a clock that gives worse precision than CLOCK_MONOTONIC, but is faster and the abstraction layer checks for its presense and provides more performat less precise time where applicable.
On FreeBSD, there is also a fast monotonic time source available called CLOCK_MONOTONIC_FAST. In the header file src/lib/time/compat_time.h, a comment references this clock, but it is not used. I thought it might be worth a shot seeing what difference it would make if I enable the use of CLOCK_MONOTONIC_FAST on FreeBSD and on the VM where I run my two FreeBSD relays, the difference was stunning.
I made did a quick patch simply replacing CLOCK_MONOTONIC_COARSE with CLOCK_MONOTONIC_FAST, see patches attached, compiled and tested. Tracing system calls to make sure the correct call was being used, which it was.
According to https://www.freebsd.org/cgi/man.cgi?query=clock_gettime, FreeBSD 13.1 has CLOCK_MONOTONIC_COARSE, which it says is an alias of CLOCK_MONOTONIC_FAST for compatibility with other systems. I suppose Tor could add #if !defined(CLOCK_MONOTONIC_COARSE) && defined(CLOCK_MONOTONIC_FAST) #define CLOCK_MONOTONIC_COARSE CLOCK_MONOTONIC_FAST, but I'm not sure how useful that would be. OpenBSD and NetBSD don't seem to define either. Perhaps something like that would be appropriate for a FreeBSD ports patch.
Cheers, Alex.
On Tue, Jun 21, 2022 at 12:31:08PM -0400, Alex Xu (Hello71) via tor-relays wrote:
Excerpts from Andreas Kempe's message of June 21, 2022 11:50 am:
Hello everyone,
I was doing some profiling on my two relays running on FreeBSD 13.1 and noticed that they were spending a lot of time in clock_gettime() which prompted me to have a look at the implementation.
Time implementation
The time implementation is abstracted in src/lib/time/compat_time.c where different mechanisms are used for different operating systems. On Linux CLOCK_MONOTONIC_COARSE is a clock that gives worse precision than CLOCK_MONOTONIC, but is faster and the abstraction layer checks for its presense and provides more performat less precise time where applicable.
On FreeBSD, there is also a fast monotonic time source available called CLOCK_MONOTONIC_FAST. In the header file src/lib/time/compat_time.h, a comment references this clock, but it is not used. I thought it might be worth a shot seeing what difference it would make if I enable the use of CLOCK_MONOTONIC_FAST on FreeBSD and on the VM where I run my two FreeBSD relays, the difference was stunning.
I made did a quick patch simply replacing CLOCK_MONOTONIC_COARSE with CLOCK_MONOTONIC_FAST, see patches attached, compiled and tested. Tracing system calls to make sure the correct call was being used, which it was.
According to https://www.freebsd.org/cgi/man.cgi?query=clock_gettime, FreeBSD 13.1 has CLOCK_MONOTONIC_COARSE, which it says is an alias of CLOCK_MONOTONIC_FAST for compatibility with other systems.
Good catch! I happened to read the man page for clock_gettime() on a FreeBSD 13.0 system (I was convinced was a 13.1 system) but was checking the header file on a 13.1 system where I couldn't find CLOCK_MONOTONIC_COARSE in the header file. A grep through /usr/include shows it is actually hidden in another include.
With this being the case, this solves itself for FreeBSD 13.1. The system I was patching Tor on was a 13.0 system, I was convinced I had upgraded my VMs and never actually checked the version. 13.0 does not have the optimisation commit I dug out, but FAST was still 20x faster. I don't know if this is 13.0 specific, but since 13.0 is EoL soon, it might not matter that much.
On other systems I benchmarked 12.3 did not show any noticeable difference between the two, I could only see it for 13.1, but since they do not have identical hardware, I don't if that could come into play somehow.
I suppose Tor could add #if !defined(CLOCK_MONOTONIC_COARSE) && defined(CLOCK_MONOTONIC_FAST) #define CLOCK_MONOTONIC_COARSE CLOCK_MONOTONIC_FAST, but I'm not sure how useful that would be. OpenBSD and NetBSD don't seem to define either. Perhaps something like that would be appropriate for a FreeBSD ports patch.
I was contemplating a solution similar to this one, but thought it was ugly redefining a define so I used sed for my PoC to get a proper overview of where the actual changes ended up in the code.
I unfortunately don't have any other BSD flavours running where I could bench performance. If users of other BSD flavours have time to run the benchmark, it would be interesting to see the results for sure.
Cordially, Andreas Kempe
On Tue, Jun 21, 2022 at 07:05:35PM +0200, Andreas Kempe wrote:
With this being the case, this solves itself for FreeBSD 13.1. The system I was patching Tor on was a 13.0 system, I was convinced I had upgraded my VMs and never actually checked the version. 13.0 does not have the optimisation commit I dug out, but FAST was still 20x faster. I don't know if this is 13.0 specific, but since 13.0 is EoL soon, it might not matter that much.
On other systems I benchmarked 12.3 did not show any noticeable difference between the two, I could only see it for 13.1, but since they do not have identical hardware, I don't if that could come into play somehow.
I suppose Tor could add #if !defined(CLOCK_MONOTONIC_COARSE) && defined(CLOCK_MONOTONIC_FAST) #define CLOCK_MONOTONIC_COARSE CLOCK_MONOTONIC_FAST, but I'm not sure how useful that would be. OpenBSD and NetBSD don't seem to define either. Perhaps something like that would be appropriate for a FreeBSD ports patch.
I was contemplating a solution similar to this one, but thought it was ugly redefining a define so I used sed for my PoC to get a proper overview of where the actual changes ended up in the code.
I unfortunately don't have any other BSD flavours running where I could bench performance. If users of other BSD flavours have time to run the benchmark, it would be interesting to see the results for sure.
For completeness sake, I upgraded my VM to 13.1 and ran my benchmark again. The slowdown of CLOCK_MONOTONIC compared to CLOCK_MONOTONIC_FAST is now only about 3 times.
Compiling Tor unpatched now also works right out of the box and I'm not seeing a storm of system calls leading me to wonder whether this was some weird VDSO issue.
Cordially, Andreas Kempe
Andreas Kempe kempe@lysator.liu.se wrote on 2022-06-21 at 19:56:45:
For completeness sake, I upgraded my VM to 13.1 and ran my benchmark again. The slowdown of CLOCK_MONOTONIC compared to CLOCK_MONOTONIC_FAST is now only about 3 times.
Thanks for looking into Tor performance on FreeBSD.
I'm seeing similar results on a physical ElectroBSD system based on FreeBSD 13.1.
Some munin graphs and dmesg are available at: https://www.fabiankeil.de/blog-surrogat/2022/06/22/clock_gettime-patch-fuer-tor-auf-electrobsd-getestet.html
Fabian
tor-relays@lists.torproject.org