[metrics-team] How to interpret written/read bytes per second in Relay Search?

Wed Mar 25 19:01:37 UTC 2020

On Wed, Mar 25, 2020 at 09:42:48AM +0100, Karsten Loesing wrote:
> You're right that this data is not described on the Reproducible Metrics
> page. That page only explains where the data in the main Tor Metrics
> website graphs comes from.
> 
> The data in Relay Search comes from Onionoo which has its protocol
> specification here, which is not as detailed as the Reproducible Metrics
> page though:
> 
> https://metrics.torproject.org/onionoo.html#bandwidth

Thanks for the explanation. That makes sense.

> I'm not 100% certain why 2020-03-24 is not displayed yet, but
> it's probably due to data being too recent, even though I don't find
> this in the code.

It is probably because I rebooted the bridge on 2020-03-24 for #33644.

> > 2. When I look at the graphs of some default bridges, I see the written
> >    and read number being almost equal always.
> >    https://metrics.torproject.org/rs.html#details/5F161D2E5713C93F16FEEDD63178E37208AA78DF
> >    https://metrics.torproject.org/rs.html#details/8F4541EEE3F2306B7B9FEF1795EC302F6B84DAE8
> >    When I look at moria1, a directory authority, I see written being
> >    much greater than read.
> >    https://metrics.torproject.org/rs.html#details/9695DFC35FFEB861329B9F1AB04C46397020CE31
> >    What accounts for the equality in some cases and the inequality in
> >    others? What could explain the divergence in the case of the
> >    Snowflake bridge?
> 
> The inequality in case of directory authorities is very likely due to
> directory requests. Requesting a consensus takes just a few dozen bytes,
> but responding with a consensus takes about 2.4 MiB or something like
> 0.5 MiB when compressed.
> 
> I can only speculate about the Snowflake bridge. When looking at the 5
> years graph on Relay Search it seems like the increase in read bytes is
> not that unusual. It's the divergence from written bytes that hasn't
> happened for a while. But if you look at late 2017 there has been a time
> when read bytes outnumbered written bytes.

Here's a hypothesis. For typical guards/bridges,
	written = written to client + written to middle node
	read    = read from client  + read from middle node
And because input/output is conserved in the absence of other transfers,
	written to client = read from middle node
	read from client  = written to middle node
We have equality in the usual case:
	written = read from middle node + written to middle node
	        = read from client + written to client
	        = read
Another way to put it is in terms of the client's upload and download:
	written = download + upload = upload + download = read

That explains why the two graph lines are almost equal for most bridges.

But because of a bug in Snowflake proxies that cause them not to extract
the correct client IP address, the Snowflake bridge currently counts
bytes for only about 25% of client connections (https://bugs.torproject.org/33157#comment:10).
A large fraction of bytes to/from the client are not being counted, but
bytes to/from the middle node are being counted as usual. Ignoring any
possible correlation between which connections have a bogus client IP
address and the number of bytes transferred per connection, we have
something more like
	written = 0.25 * written to client + written to middle node
	read    = 0.25 * read from client  + read from middle node
and
	written to client = 0.25 * read from middle node
	read from client  = 0.25 * written to middle node
so
	written = (0.25 * 0.25 * read from middle node) + written to middle node
	read    = (0.25 * 0.25 * written to middle node) + read from middle node

Now the ratio of written/read depends on the how much the client uploads
versus how much it downloads. There's no reason why these should be
equal.
	written to client = 0.25 * download = 0.25 * read from middle node
	read from client  = 0.25 * upload   = 0.25 * written to middle node
	written = 0.25 * 0.25 * download + upload
	read    = 0.25 * 0.25 * upload + download

Still ignoring correlation, we could recover the quantity upload +
download by adding written and read and dividing by a number that
depends on the faction of bogus IP addresses:
	written + read
	= 0.25 * 0.25 * download + upload + 0.25 * 0.25 * upload + download
	= 1.0625 * (upload + download)
	upload + download = (written + read) / 1.0625

To imagine what the graph would look like if we were actually accounting
for all client bytes, we approximately just have to add the two lines
together.

If my guess is correct, then it accounts for the divergence in the
written and read graphs, if we additionally assume that before
2020-02-19 either 100% of clients were not affected by IP address
reporting bug, or the number of Snowflake clients was negligible and the
graphs only reflect some inter-relay traffic that also happens to
conserve input/output.

> > 3. Roger found a case where traffic tagged with a 0.0.0.0/8 address was
> >    being ignored by some part of tor's internal bandwidth accounting
> >    (https://bugs.torproject.org/33693). Until recently, the Snowflake
> >    bridge had a bug where, for certain clients, it reported a client
> >    address of 0.0.0.0 to the tor bridge's ExtORPort (https://bugs.torproject.org/33157).
> >    The bug is only partially fixed--we now report no address at all for
> >    the affected clients. The fix was not deployed until 2020-02-22, so
> >    it doesn't explain the divergence of read/written on its own. Do you
> >    know offhand whether an apparent client address of 0.0.0.0, or no
> >    address at all, would cause problems with measuring usage?
> 
> I'm afraid I don't know. I wonder if teor knows more about this, as he
> spent some time on bandwidth statistics for IPv6 traffic statistics
> recently.

Doing a quick test just now, in the absence of an Extended OR Port
USERADDR command, the tor bridge will default to the remote IP address
of the socket, which is 127.0.0.1:XXXX because it's coming from the
pluggable transport server on the same host. As far as I can tell,
127.0.0.1 is ignored for bandwidth accounting just as 0.0.0.0 is.

Obviously the correct solution in this case is to fix #33157 and have
the Snowflake proxies report the correct client IP address in 100% of
cases. But it makes me think that it would be nice to have a token or
designated address that means, I don't know what the client IP address
is, but please count its bytes anyway. In meek we can rely on the CDN
setting an X-Forwarded-For header, and in Snowflake we have the proxy
attempt to extract the remote IP address of the peer-to-peer connection
and attach it to the WebSocket request as a URL parameter. But I'm
thinking about a DNS-based transport and there's no way to cause a
recursive DNS resolver to forward the client's IP address to the
authoritative resolver. (There is RFC 7871, but some DNS servers do not
support it by design, and it's only a prefix anyway.) It would be nice
to have clients counted even if they cannot be geolocated.