[tor-dev] Connection, Channel and Scheduler - An Intense Trek

Wed Nov 15 18:49:54 UTC 2017

On Mon, Oct 30, 2017 at 3:57 PM, David Goulet <dgoulet at ev0ke.net> wrote:

> Hello everyone!
>
> DISCLAIMER: The following is enormous and tries to describe in some level
> of
> details the situation in tor with connection<->channel<->scheduler. This
> comes
> after we've merged the KIST scheduler, we've realized many things we'ren't
> what
> they were suppose to be or meant for. In the end, I'm asking questions so
> we
> can move forward with development and fixing things.
>
> Last thing before you start your journey in the depth of Tor, the 3
> subsystems
> I'm going to talk about and how they interact are kind of very complicated
> so
> it is very possible that I might have gotten things wrong or miss some
> details.
> Please, point them out so we can better document, better be informed and
> make
> good decisions. I plan to document as much as I can from this process for
> a new
> file in torguts.git repository.
>
>
Snipping the analysis, and going straight to the conclusions.  I'll leave
one sentence in the analysis because it's such a great summary:

> Many things are problematic currently

They sure are. :)

== Part Four - The Conclusion ==
>
> Through this epic journey, we've discovered some issues as well as design
> problems. Now the question is what should and can do about it?
>
> In a nutshell, there are a couple of questions we should ask our selfves
> and
> try to answer so we can move forward:
>
> * I believe now that we should seriously discuss the relevance of channels.
>   Originally, the idea was good that is providing an abstraction layer for
> the
>   relay to relay handshake and send/process cells related to the protocol.
> But,
>   as of now, they are half doing it.
>
>   There is an important cost in code and maintanance of something that is
> not
>   properly implemented/finished (channel abstraction) and also something
> that
>   is unused. An abstraction implemented only for one thing is not really
> useful
>   except maybe to offer an example for others? But we aren't providing a
> good
>   example right now imo...
>
>   That being said, we can spend time fixing the channel subsystem, trying
> to
>   turn it in a nicer interface, fixing all the issues I've described above
> (and
>   I suspect there might be more) so the cell scheduler can play nicely with
>   channels. Or, we could rip them off eliminating lots of code and
> reducing our
>   technical debt. I would like us to think about what we want seriously
> because
>   that channel subsystem is _complicated_ and very few of us fully
> understands
>   it afaict.
>
>   Which would bring us back to (which is btw basically what we have now
>   considering the channel queues are useless):
>
>     conn inbuf -> circ queue -> conn outbuf
>
>   If we don't want to get rid of channel, the fixes are non trivial. For
>   starter, we have to decide if we want to keep the channel queue or not
> and if
>   yes, we need to almost start from square 1 in terms of testing because we
>   would basically introduce a new layer of queuing cells.
>

So, this is the question I'm least sure about. Please take the following as
tentative.

I think that the two choices ("refactor channels" and "rip out channels")
may be less different than we think. Neither one is going to be trivial to
do, and we shouldn't assume that sticking everything together into one big
type will actually make the code _simpler_.

The way I think about the code right now, "channel" is an interface which
"connection_or" implements, and there is no meaningful barrier between
connection_or and channeltls.  I _do_ like the idea of keeping some kind of
abstraction barrier, though: a "channel" is "whatever we can send and
receive cells from", whereas an "or_connection" has a lot of other baggage
that comes with it.

>From my POV, we *should* definitely abolish the channels' queues, and
minimize the amount of logic that channels do on their own. I'm not sure if
we should rip them out entirely, or just simplify them a lot. I don't think
either necessarily simpler or less bug-prone than the other.

Perhaps we should sketch out what the new interface would look like?  Or
maybe do an hour or two worth of exploratory hacking on each approach?

(This reminds me of another change I want someday, which is splitting
edge_connection_t into an "edge_connection" type that implements a "stream"
interface: right now, we have quite a few streams that aren't actually edge
connections, but which use the type anyway.)

* Dealing with the DESTROY cell design issue will require a bit more tricky
>   work I think. We must not starve circuit with a DESTROY cell pending to
> be
>   sent else the other side keeps sending data. But we should also not
> starve
>   all the circuits because if we ever need to send a gazillion DESTROY
> cell in
>   priority, we'll make the relay useless (DoS vector).
>
>   The question is, do we trust our EWMA policy to be wise enough to pick
> the
>   circuit in a reasonable amount of time so we can flush the DESTROY cell
> from
>   the circuit queue? Or we really need to bypass or prioritize somehow that
>   cell in order to send them asap in order to avoid load on the network
> because
>   the other side of the circuit is still sending?
>

So, elsewhere in the thread, folks have been discussing whether a circuit
that's going to send a DESTROY cell should flush its pending cells first.

The answer is sadly, "it depends".

Case 1: Suppose Alice is downloading a webpage.  Suppose we are the middle
relay and we lose our connection to the exit.  It would be nice to keep
flushing the data we have towards Alice -- maybe.  If she can use partial
data.  But any data that Alice sends to us would be lost, so it would be
good if we had some way to tell Alice "stop sending please".

Case 2: Suppose Alice is uploading something to a webserver. Suppose we are
the middle relay and we lose our connection from Alice. In this case,
there's no point in sending any more data towards the webserver before we
send it a DESTROY cell.  (Even if Alice was in the middle of a big upload,
she'll need to repeat any part of it that wasn't ACKed, since she won't
know what was received and what wasn't.)

Case 3: Suppose we hit our OOM killer.  In this case, we had better discard
all the data on the circuit we're killing, or we're vulnerable to "sniper
attacks" again.

So it's clear that sometimes we should dump the data, and sometimes we
shouldn't.  I think this is an independent question from what we're asking
here.  (My own take is that solving case 1 right requires "RELAY_TRUNCATED"
cells, which I believe we don't implement today.)

What we're asking here is: how can we reintegrate DESTROY cells with the
rest of the scheduler logic?

I think that, from a priority POV, DESTROY cells are in a certain sense the
_opposite_ of traffic, and we might actually want to treat them differently
from data cells.  Consider that if we have a choice between DESTROYing a
busy circuit or a quiet one, we will save more bandwidth by destroying the
busy circuit first, so that no more data is sent to us over it.

On the other hand, this doesn't mean that the FIFO structure we have today
is a good idea at all.  It probably makes sense to use the same priority
queue-based scheduler thing that we use everywhere else, but possibly with
a different (inverted??) priority parameter for destroyed circuits.

One more thing to be aware of: the destroy_cell_queue exists in part
because we tear down circuits at the same time that we queue their destroy
cells.  If we changed Tor so that "destroyed" circuits were kept around
somehow until their cells could be sent, then we'd be introducing a new
state to our state machine, to represent circuits that were schedulable but
not actually usable for traffic.  We'd need to be careful to handle that
correctly: this kind of "unusable object that still exists" has caused us
problems before.   (The solution I like best for avoiding this confusion is
to make it so the scheduler can schedule two types of "schedule-able"
things: circuits, and "pending destroy cells".)

> * In the short term, we should get rid of Vanilla scheduler because it
>   complefixies a lot the scheduler code by adding uneeded things to
> channel_t
>   but also bloated the scheduler interface with pointless function
> pointers for
>   instance. And in my opinion, it is not helping performance the way it is
> done
>   right now.
>

I agree with Roger here: it's fine to throw away the vanilla scheduler, but
we should wait until KIST has been running unproblematically in a stable
release for a while.  0.3.4 seems like a good time for this.

-- 
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20171115/4f67c918/attachment-0001.html>