[tor-scaling] Inprocess Cell Time Tracing

Thu Aug 8 12:10:17 UTC 2019

On 07 Aug (16:23:34), Rob Jansen wrote:
> Hello David,
> 
> Wow, mega-post! Lot's of great info :)
> 
> I still haven't properly digested all of this yet. But I'll pretend like I
> know what I am talking about anyway ;)
> 
> > On Aug 7, 2019, at 2:51 PM, David Goulet <dgoulet at torproject.org> wrote:
> > 
> > So yeah, would love to hear anyone's opinion on all this.
> > 
> 
> I believe understanding kernel buffering and TCP effects will be fundamental
> to understanding Tor's behavior.
> 
> One possible thing holding Tor back from writing to the kernel is an
> overloaded kernel outbuf (buffer bloat), but KIST should help to reduce
> this.

I will add a tracepoint collecting the TCP buffer sizes that KIST gets from
the kernel. That way, during the outbuf->flush timespan (sometimes it is many
seconds), we'll be able to correlate the congestion situation for that
connection.

And thus confirm that KIST is doing what it should.

The other thing I will do is add two tracepoints to record the EWMA parameters
for a circuit. One when the cell count changes thus the weight changes and the
other one when we recompute all the weight I believe every 10 seconds with the
current parameters.

Why quiet circuits have this chaotic distribution of inbuf->flush timings has
to do with either a problem with EWMA or some bugs in our scheduler which I
doubt a bit more than a EWMA bug because both Vanilla and KIST show the same
pattern (see q1-q3 graphs). And they use wildly different code :).

> 
> Another possible issue is that the quiet circuits are also on quiet sockets
> (no other circuit is sending on that socket). Then when the some data comes
> in, Tor sends and fills up the kernel socket out buffer. But because the
> socket was previously quiet, the TCP congestion window will be reset to 10
> packets, and it will take a few round trips with the other end of the socket
> for the congestion window to ramp up. When the circuit becomes quiet again,
> the congestion window resets to 10 packets again and we start over. In order
> for this to be a real issue, a quiet circuit would need to get a burst of
> traffic greater than the socket outbuf size so that it would cause the
> outbuf to fill and Tor to queue it. We are talking more than 200 cells ~
> 100kb to fill the kernel out buffer. The circuit window allows up to 1000
> cells, so this phenomena is possible, though I don't know how likely.

> 
> > One of the next thing I want to learn is _where_ is that time all spent. As in
> > from inbuf->queue or queue->outbuf or outbuf->flush. My instincts tells me
> > outbuf->flush especially with KIST considering TCP congestion.
> > 
> > Food for thoughts here! Please let me know if you would like to see specific
> > things, I now have a set of scripts to analyze the tracing data and plot it.
> 
> I think it would be useful to gather kernel in/out buffer times in order to
> get a more complete picture of queuing times through the entire machine
> (rather than just in Tor). Relatively high kernel out buffer time was the
> primary motivation for KIST, and clearly the kernel will affect Tor's
> ability to write. Can your tools be made to measure kernel in/out buffer
> times?

We can use kernel tracepoints but they are limited within the network layer.
Problem is correlating the data enqueued in the kernel with the one in
userspace and following it to the network device is ... tough :).

Although, there are few interesting tracepoints we could add to the data
collection which are:

    - sock_rcvqueue_full
      When the receiving _kernel buffer_ queue is full.
    - sock_exceed_buf_limit
      When the kernel can't extend more the kernel buffers.

    Those two would indicate buffer bloating on a relay from inbound traffic.
    Weirdly, kernel doesn't have a tracepoint for "sock_sendqueue_full" as far
    as I can tell.

    After that, we can trace many events on the network device side but not
    sure how much it would be helpful to us :S:

        net_dev_queue
        net_dev_start_xmit
        net_dev_xmit
        netif_receive_skb
        netif_receive_skb_entry
        netif_rx
        netif_rx_entry
        netif_rx_ni_entry

> 
> If not, and in case it's useful, I developed a multi-threaded library to
> measure kernel in/out buffer times [0]. It uses libpcap to measure kernel
> queuing times of packets.

Does this has good performance? Thing with libpcap is that they do not provide
any guarantee that all the packets will be seen by libpcap. Under pressure,
you can miss some.

But also, on a 25MB/s relay that is capping the CPU, I worry about adding an
intrusive library that could slow things down which changes drastically the
behavior of the application leading to very different results.

That is the primary reason why I'm not doing this whole thing with log_*() in
tor for instance due to the performance hit and behavior change.

Thanks Rob!
David

-- 
ZLqb2cqRJBGCBCxZzV/qgv4UpvrLOzvWIZHnu/NtwCw=
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/tor-scaling/attachments/20190808/83463d29/attachment.sig>