Greetings tor-dev!
This email is a discussion on adding tracing to little-t tor. Tracing can be a
very abstract notion sometimes so I'll start with a quick overview of what
that is, what we can achieve and use cases within tor. Then I'll go over a
last point which is safety.
This email doesn't go into the technical details of userspace tracing on how
and what will be done to add it to tor. That is for another discussion.
1. Overview
Long story short, you can see tracing as a specific type of logging as in it
records information of the application at runtime using tracepoints (similar
to logging statement) so it can be used later. But the main differences from
logging are in two parts: performance and API stability.
Usually, tracing implies high performance as in adds very little overhead to
the application in order to disrupt as little as possible the normal behavior
of an application. This is extremely useful in cases where you want to catch
race conditions or performance bottle necks.
Tracers in userspace have usually an "inprocess library" which in short means
that it records data (raw) from the application and move it to an outside
buffer. Then, that buffer is emptied either on disk or network by the outside
component of the tracer for which the data can be analyzed after collection.
So all a tracer do is, within the application when a tracepoint is hit, copy
some data into a buffer and yields back to the application.
The other part is the API stability. Very often, logs (let say at DEBUG level)
don't usually have strict stable requirements between released versions. But
tracing events (tracepoints), are exposed to the outside for tracers to hook
on, and for people to run analyzing tools on the recorded data. Thus,
stability is usually strongly encouraged. In other words, what the tracepoint
exposes, once released stable, should really not change that much over time.
With a proper abstraction in the application, we can offer stable tracepoints
for which a variety of tracers can hook themselves on at runtime. It is all
about providing an interface to the outside world.
2. Why Tracing in Tor
The tor software is a very complex beast. It has dozens of subsystems with
various interactions between them. One of the big main job of tor is to relay
data as fast as possible in order to keep the latency low. Which means, that
there are code paths that are considered "fast path" implying that they must
remain light and fast. One example is the crypto code that is hit at each
cell.
Tracing comes in extremely handy to hunt down race conditions, performance
issues, or even multithreading problems. A fast relay, let say 25MB/s, if we
wanted to record cell timing in order to hunt down such issues, it simply can
_not_ be done with logging at debug level since it slows down considerably tor
but also fills the disk in a matter of minutes.
And using the control port is not a good solution for two main reasons: string
formatting at each event and control port is part of the mainloop. So anything
you ask to go on the control port will add an overhead to the overall behavior
of tor which is not good when you hunt down races.
One concrete example where tracing was used in the past in tor is with the
rewrite of the cell scheduler (KIST). In order to measure cell timings within
Tor so bottlenecks issues could be found, tracing had to be added so millions
of events could be recorded within few minutes of using a fast relay in
production.
In pressure situation, this is where tracing comes handy. Tracing was also
used recently to find onion service v3 reachability issues. In order to
correlate connection, cell and circuit level problems with the higher level HS
subsystem, we were able to record events in all those subsystems, match them
with their precise timing (offered by tracing) and analyze the results later
on after recording the data.
3. Safety Discussion
Onto the last part I wanted to raise. Allowing anyone to record very low level
data from tor, there is an obvious safety question that must be raised.
Over the years, I've talked about tracing with many people in Tor and the
consensus was always that it should never be enabled in production. As in, the
packages shipped by Tor or by distros should _never_ build the tracepoints.
In other words, it should be considered a development option only. Not only an
option, but compiled _out_ in production and one has to explicitly build them
into tor.
For example (nothing final, just to show the idea):
$ ./configure --enable-tracing
I personally think that should be enough since the presence of the code
upstream won't stop people from using it (bad or not) but we can prevent it to
be in any legit Tor packages out there. See it a bit like the obsolete Tor2web
option that was never enabled in any published packages by Tor Project or
distros, one had to explicitly enable it at configure time.
The ControlPort is allowed in production and if a malicious actor gets access
to it, then game over. I do see tracing like that as well but at least we can
control its availability as a feature where we can't for the ControlPort as of
today.
Any feedback is very welcome! Concerns, questions, thoughts.
Cheers!
David
--
AOrq46damX3clZogjR9FlXTru90GV9IT5Rq/J0EzVSA=