[tor-dev] Connection, Channel and Scheduler - An Intense Trek

Mon Oct 30 19:57:04 UTC 2017

Hello everyone!

DISCLAIMER: The following is enormous and tries to describe in some level of
details the situation in tor with connection<->channel<->scheduler. This comes
after we've merged the KIST scheduler, we've realized many things we'ren't what
they were suppose to be or meant for. In the end, I'm asking questions so we
can move forward with development and fixing things.

Last thing before you start your journey in the depth of Tor, the 3 subsystems
I'm going to talk about and how they interact are kind of very complicated so
it is very possible that I might have gotten things wrong or miss some details.
Please, point them out so we can better document, better be informed and make
good decisions. I plan to document as much as I can from this process for a new
file in torguts.git repository.

This is part of the work from ticket #23993.

Enjoy!

== Part One - The Distant Past ==

Once upon a time, Tor had 0.2.3 years old. It was before the "big change" and
it was using connections and circuits in a simplistic but yet clever way. I'll
briefly describe here how it worked out and why we wanted to change it.

Roughly, once a cell comes in, it is put in the inbound buffer called "inbuf"
of a connection_t and it is processed. Then, if it needs to be relayed that is
sent along the circuit it came on, it is put in that circuit queue. Each
connection_t had a priority queue of "active circuits" using the EWMA policy to
prioritize. The first circuit was to be flushed of a certain amount of cells
onto the connection_t outbound buffer or "outbuf" where libevent main loop
takes over to actually write the bytes on the network from the outbuf. Then,
the circuit pqueue is updated and we would move on. The following is a high
level diagram of a relayed cell:

      +--------------+    +-----------+    +----------------+
      |connection_t A|    | circuit_t |    | connection_t  B|
      |  (inbuf)     |--->|  (pqueue) |--->|   (outbuf)     |---> Network
      +--------------+    +-----------+    +----------------+

That had at least one big problem. Not considering the entire set of circuits
could make one connection (with some circuits on it) bloat the network link of
the relay. It would actually be worst than that if it had bandwidth limitation,
you want to pick the connection that has the highest priority for the circuit
on it to flush

All in all, tor needed to move to a logic where _all_ circuits are considered
using the EWMA policy, flushing the cell(s) of that picked circuit on the
connection outbuf so it is prioritized on the network.

Then in 0.2.4, everything changed!

== Part Two - The Big Change ==

Two new abstractions were added to Tor 0.2.4 (#6465), channel and circuitmux
(short for circuit multiplexer or multiplexing, history is unclear ;). Up until
this day, this is what tor is using.

The channel_t object was intended to provide a "transport" abstraction layer
for relay to relay connection which is reprensented by a connection_t object.
Each channel has a circuitmux_t object that encapsulates the queuing logic and
selection policy of circuits of a channel. In other words, the circuits that
goes through that connection are listed in that object so we can do cell
prioritization.

To summarize, this roughly looks like this now:

  +------------+
  |connection_t|
  +------------+
         |
         v 1
    +---------+
    |channel_t|
    +---------+
         |
         v 0..n
    +---------+
    |circuit_t|
    +---------+

Channels are suppose to work as a protocol abstraction on that connection. The
original intent of this was to be able to implement more easily different
protocl like for instance helping researchers to experiment with other things
more easily. Decoupling TLS from connection_t and putting them in a subclass of
channel_t and thus was born channel_tls_t. It implements the channel_t
interface and is in theory designed to handle TLS layer of relay to relay
connection (handshake, cells, etc). So far, tor only has that channel class.

Now that we had this layer and a way to multiplex circuits on a single
connection through the use of channel, we needed one last piece, a way to
prioritize cells on the network considering all circuits. Along came the "cell
scheduler" (scheduler.c) in Tor 0.2.6 and now called the Vanilla scheduler.

In a nutshell, libevent main loop calls the scheduler at a constant rate and
the subsystem will consider all channels between each other comparing circuits
priority using the EWMA policy (an interface was created to implement more
policies but I won't get into that in this email).

So now, tor finally had a good infrastructure to prioritize cells on the
network by looking at all circuits at once. TCP connections would be
established between relays, the channel layer would be handling the TLS
business and the scheduler would be flushing cells on the network by
considering all circuits on all channels. It is a really nice modularized
design and would make tor improve in performance.

But, that's not entirely the reality of things right now.

== Part Three - The Thruth ==

In 2014, a paper came out titled: Never Been KIST: Tor’s Congestion Management
Blossoms with Kernel-Informed Socket Transport
(http://www.robgjansen.com/publications/kist-sec2014.pdf). An improved way for
tor to scheduled data transmission properly by asking the kernel information
about the state of the TCP buffers so tor could flush data on the network
without bloating the local link.

Then in 2016 up to now, Matthew Traudt implemented KIST in tor which required
an important refactoring of the scheduler subsystem in order to make it work
with different multiple scheduling type. It is today the Scheduler option in
torrc. And all 0.3.2 tor uses KIST by default. Many things were discovered
during that piece of work by Matt. For instance, this very important #20459 bug
for which we were badly comparing cmux policy thus ultimately prioritizing
wrongly our circuits.

But fundemental issues in the design of channels and the scheduler were found
for which I'll list some here and will ultimately lead me to Part Four of this
email about our next steps moving forward.

1. Channels implemented cell queues that weren't really queues.

If you look at the channel_t object, you'll notice "incoming_queue" and
"outgoing_queue" which basically allow the transition of a cell from a
connection_t inbuf to the circuit queue and then to the connection_t oubuf. It
looks like this for a relayed cell:

    conn inbuf -> chan in_queue -> circ queue -> chan out_queue -> conn outbuf

The idea behind this is to quickly empty the conncetion inbuf to an
intermediary object (channel) leaving the kernel inbound TCP queue as much room
as we can. In other words, offloading the kernel buffer to user space since we
need to run some processing on that data and there was no need to leave that
data on the kernel buffer during that time.

Then from the channel, a cell would be put on the circuit queue and in turn
would be process by the scheduler. The scheduler would select a circuit (EWMA
policy), flush cells from its queue onto the channel outgoing queue and then it
would be flushed to the connection outbuf which is taken care of by libevent to
write on the socket (to the network).

One could argue here that in terms of design we have too many queues but that
would be true if the channel queues were actually acting like queues. It turns
out that the incoming and outgoing queues of a channel are always empty
(#23709) and this is due to the fact that the code queues a cell then processes
it immediately meaning either put it on the circuit queue or connection outbuf.

Each cell is memcpy() over the a channel queue and then processed right away
just to be copied again into the circuit queue or outbuf. This has a
performance cost over time as a relay sees thousands of cells a second. But
also it makes the channel subsystem much more complicated in terms of code with
trying to handle those queues every part of the way in and out of the channel
subsystem. But the worst part is that the scheduler subsystem assumed some
decisions based on the outgoing queue size. And when it is always 0, issue
arise like #23676.

2. DESTROY cells handling

Within a circuitmux object, there is a "destroy cell queue" on which a DESTROY
cell is put in for one of the circuit on the cmux. An important thing for tor
is that when it needs to send a DESTROY, it needs to _stop_ sending any queued
cell on that circuit, dump them and only send the DESTROY cell.

Many things are problematic currently. For starter, sending a DESTROY cell
bypasses the scheduler (#23712). Tor will immediately try to flush the cell on
the network *if* the channel doesn't have queued *writes* which means if the
connection outbuf is empty. If not, once it is sufficiently drained, the
channel will be scheduled in and the DESTROY cell finally sent. I've observed
this process going from 0.5 seconds up to 10 seconds on a normal relay if the
outbuf has data on it.

Second, because of this concept of different queue for the DESTROY cell, tor
will back and forth between that queue and the normal queue of a circuit.
Remember, that the destroy queue is *not* per circuit but per circuitmux. This
is used by the "cmux->last_cell_was_destroy" which is set to 1 if the last cell
the cmux handled was a DESTROY or 0 if not.

This has a side effect. Before sending a DESTROY cell, the circuit queue is
cleared out without changing the circuit EWMA priority, in order to make sure
we don't send cells from that point on. However, because we back and forth, the
EWMA policy will pick the right channel based on the overall state of the
circuits (see scheduler_compare_channels()) on it but will not prioritize the
circuit with the DESTROY cell on using that policy. Furthermore, the destroy
cell queue is a FIFO so if 10 DESTROY cells were queued bu th 10th one is
actually on the circuit with the highest priority, we won't process it until
those 9 previous who came in before are flushed.

3. Connection and Channel decoupling is kind of a lie

The connection_t object is bound to be a TCP connection except for DNSPort
which will be a UDP. That object also has a "TLS connection state" (tor_tls_t)
and a channel_tls_t object (chan).

First, it turns out that the connection subsystem will establish the TLS
session and not the channel tls layer. When opening a circuit, it looks like
this:
  channel_connect_for_circuit()
   |- channel_connect()
     |- channel_tls_connect()
       |- connection_or_connect()
         |- connection_tls_start_handshake()

Second, when processing a cell from the connection inbuf in
connection_or_process_cells_from_inbuf(), we'll directly call
channel_tls_handle_cell(). The channel_tls_t layer handles PADDING, VERSIONS,
CERTS, AUTH_CHALLENGE, AUTHENTICATE and NETINFO cells. The other cell types
will be queued on the corresponding circuit queue.

Third, once the TLS session has been established, a VERSIONS cell can be sent
which is done by the connection subsystem with connection_or_send_versions().
But it is bypassing the channel_tls_t layer that processes them and should
write them using channel_tls_write_cell_method(). Futhermore, the channel_tls_t
layer is using the connection layer to send those cells using
connection_or_send_certs_cell() and cie.

My point here is that the channel abstraction is kind of violated in many ways
by either directly calling the TLS channel layer or not using it to send
specific cells that are for the TLS handshake. And also, the TLS state is baked
in the connection subsystem. This means that anyone wanting to implement a new
channel will need to do a major refactoring first which unfortunately leaves
this abstraction kind of pointless. And do not make the mistake to label the
channel subsystem as a "transport layer", it is at best a protocol layer,
connection_t handles transport, not channels.

4. Vanilla scheduler is not really a scheduler.

It turns out that the Vanilla scheduler is not really scheduling things but
rather shoving as much as it could onto the connection outbuf. This ain't ideal
because tor is selecting circuit based on a policy (EWMA) and when the circuit
is selected, we would push on the connection outbuf by either writing as much
as we can on the outbuf or stop at 1000 cells (see MAX_FLUSH_CELLS) which is
512 * 1000 bytes == 0.5MB. One can argue, is it too little to only write 0.5MB
per scheduler tick or actually too much?

So imagine tor flushed 1000 cells everytime it picks a circuit and then moves
on to the next circuit. I'm not sure this is playing well with a relay with
bandwidth limitation for instance. If you have 100 circuits, and 100KB to
spend, you might not want to spend it all on one single circuit?

Also, because this scheduler runs as often as possible, only one channel was
considered everytime. Matt's experiment showed that in practice which means
that a scheduler that is only considering one channel doesn't make good
priority decision because there is none in the first place.

Finally, when trying to bloat the outbuf as much as possible, and for that I'm
not sure how libevent operates, but it leaves the decision of when and what to
flush on the network to the process that handles outbuf. In other words, the
scheduler doesn't actually control when the data is written to the socket so
the scheduling decision aren't necessarly followed through on the transport
layer. This is something the scheduler and connection subsystem should be
thightly connected.

== Part Four - The Conclusion ==

Through this epic journey, we've discovered some issues as well as design
problems. Now the question is what should and can do about it?

In a nutshell, there are a couple of questions we should ask our selfves and
try to answer so we can move forward:

* I believe now that we should seriously discuss the relevance of channels.
  Originally, the idea was good that is providing an abstraction layer for the
  relay to relay handshake and send/process cells related to the protocol. But,
  as of now, they are half doing it.

  There is an important cost in code and maintanance of something that is not
  properly implemented/finished (channel abstraction) and also something that
  is unused. An abstraction implemented only for one thing is not really useful
  except maybe to offer an example for others? But we aren't providing a good
  example right now imo...

  That being said, we can spend time fixing the channel subsystem, trying to
  turn it in a nicer interface, fixing all the issues I've described above (and
  I suspect there might be more) so the cell scheduler can play nicely with
  channels. Or, we could rip them off eliminating lots of code and reducing our
  technical debt. I would like us to think about what we want seriously because
  that channel subsystem is _complicated_ and very few of us fully understands
  it afaict.

  Which would bring us back to (which is btw basically what we have now
  considering the channel queues are useless):

    conn inbuf -> circ queue -> conn outbuf

  If we don't want to get rid of channel, the fixes are non trivial. For
  starter, we have to decide if we want to keep the channel queue or not and if
  yes, we need to almost start from square 1 in terms of testing because we
  would basically introduce a new layer of queuing cells.

* Dealing with the DESTROY cell design issue will require a bit more tricky
  work I think. We must not starve circuit with a DESTROY cell pending to be
  sent else the other side keeps sending data. But we should also not starve
  all the circuits because if we ever need to send a gazillion DESTROY cell in
  priority, we'll make the relay useless (DoS vector).

  The question is, do we trust our EWMA policy to be wise enough to pick the
  circuit in a reasonable amount of time so we can flush the DESTROY cell from
  the circuit queue? Or we really need to bypass or prioritize somehow that
  cell in order to send them asap in order to avoid load on the network because
  the other side of the circuit is still sending?

* In the short term, we should get rid of Vanilla scheduler because it
  complefixies a lot the scheduler code by adding uneeded things to channel_t
  but also bloated the scheduler interface with pointless function pointers for
  instance. And in my opinion, it is not helping performance the way it is done
  right now.

Cheers!
David

-- 
tU3F3BDoF3rwk9y3O+UAixfuPnil7VNkwl0LGe2cdpA=
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20171030/b25997e4/attachment-0001.sig>