As some of you may be aware, I've been working on new scheduler code for Tor recently. Currently, Tor calls channel_flush_some_cells() whenever a channel becomes able to accept more writes, or channel_write_*_cell() whenever a new cell arrives on a channel able to accept cells without other queued cells on circuits. Thus, we can only see one channel at a time when making scheduling decisions, and the existing circuitmux code is only concerned with providing an API for implementing prioritization of circuits attached to a particular channel.
The new scheduler aims to remove this limitation. To do this, we have to get a global view of all possible choices we could make about which circuit to get a cell from next, so instead we keep lists of channels which can currently accept writes and which currently have cells available to send, and trigger the scheduler using event_active() whenever a channel meets both conditions. Then the scheduler runs from a libevent callback and loops over the list of channels in the intersection of both sets. At present it just invokes the existing circuitmux code, so the behavior differs little from the old code, but this provides a framework where we can potentially see multiple channels simultaneously to make scheduling decisions.
Notice, though, that as described it will rarely actually see more than one pending channel - just looping over them all every time is too greedy. We want to put off making those choices long enough to see all our options, without delaying so long the channel send queues downstream of the scheduler loop start emptying and the delayed choice actually causes increased latency.
At first, NickM and I had discussed this and he suggested a global high/low- water mark system parallel to the ones for each connection; I added code to the channel layer to allow lower layer implementations to provide queue size and overhead estimates in a generic way, so now the channel code has an up to date estimate of the total number of bytes queued to send on all channels in both lower-layer queues and channel_t queues.
I believe this design should not be used, because making the flow rate on one channel affect the scheduling of others like that is a potential DoS. If an attacker controls one node, he can build a circuit through a target node to that node, and then send traffic down it, but force the node to drain it only very slowly, keeping the target's queue size always above a minimum level. If the low-water mark for the global queue is set higher than the high-water mark for each connection it can prevent this, but remains vulnerable to a similar attack using multiple channels to different nodes.
The simplest proposed fixes involve modifying the queue size used to control the scheduler to age out old contributions from non-draining or slow-draining channels - replace the actual queue size with an effective queue size which drains linearly at a some fraction of the node's expect bandwidth, or drops off exponentially. These are straightforward to implement, but I'm unsure what the best option is, so comment is invited.