[metrics-team] Are bandwidth charts double counting?

Tom Ritter tom at ritter.vg
Tue Oct 17 14:48:03 UTC 2017


On 17 October 2017 at 03:17, Karsten Loesing <karsten at torproject.org> wrote:
> Hi Tom,
>
> On 2017-10-16 22:33, Tom Ritter wrote:
>> I was looking at https://metrics.torproject.org/bandwidth.html and
>> https://metrics.torproject.org/bandwidth-flags.html and was confused a
>> little by the graph.  Is it double (or triple) counting?
>>
>> The definitions page says "bandwidth history: the volume of incoming
>> and/or outgoing traffic that a relay claims to have handled on behalf
>> of clients."
>>
>> It's the "and/or" that throws me.
>>
>> If it's 'and' then the Exit bandwidth history is double counting:
>> divide by two to get the bandwidth that exits the tor network.
>
> I see the confusion there.
>
> The purpose of the glossary is to define what a term means in the
> context of Metrics. But it does that on a very high level that most
> users understand. It's not supposed to be sufficiently precise to
> understand the computations behind a given graph.
>
> Stated differently, I think it's okay that "bandwidth history" can mean
> either incoming traffic only or outgoing traffic only or even both. It's
> the fact that it's a number that is reported by relays that is
> important, at least to me. I can be convinced otherwise though.
>
> But I also understand how you still want to know where the numbers in a
> graph come from. The good news is that that's exactly what we just
> received funding for!
>
> https://trac.torproject.org/projects/tor/wiki/org/sponsors/Sponsor13
>
> In particular, there's:
>
> Activity 2.3: Write specification for assessing how much traffic the Tor
> network can handle and how much traffic there is.
>
> Once such a specification exists, it will tell you that the "Bandwidth
> history" line in https://metrics.torproject.org/bandwidth.html is the
> sum of incoming and outgoing traffic, divided by two.

Ahha, so then the traffic by Exit flag is the amount of traffic
exiting the network, no need to divide.

> But why did you ask about triple counting? How would we accidentally
> triple count something here?


"the "Bandwidth history" line in
https://metrics.torproject.org/bandwidth.html is the sum of incoming
and outgoing traffic, divided by two."

It divides by two, but it sums over all relays in the network, right?
Most circuits are three hops, so I think one could equally state:

"The tor network sends ~100gbit/s of traffic in one direction through
the network"

or

"The tor network pushes ~~~33gbit/s of traffic from an entry node to
it's destination, not correctly calculating for hidden services, or
one/two hop circuits."


The latter is supported, I think, by:
- our exit traffic being just over 25gbit/sec
- our bandwidth for directory requests being around ~2.1 gbit/sec
- our HS traffic being a little over 1 gbit/sec
Add them together and you're at least in the ballpark of 33gbit/sec.




> Feel free to ask similar questions, and I'll make sure that the various
> specification documents will answer them.
>
>> The other question I had, that I don't think we are able to calculate,
>> is "How many connections does the Tor Network produce".  Obviously it
>> handwaves over 'connection', but for the browser scenario I'd say
>> 'connections to first party domains'.
>>
>> exit_streams_opened might be the best measurement to accomplish
>> something very similar though right? It'd be unique connections to
>> third party domains (for ports 443 and 80) instead of first party
>> domains, but that's pretty close.
>>
>> How would I go about calculating it? Is it as simple as summing this
>> field across all the extra info descriptors for a given time period?
>
> Fine question. We don't have such statistics yet. But let's see what we
> could do with existing data.
>
> First, I'm not sure what exactly you mean by first party domains and
> third party domains.

example.com includes resources from a.com and b.com

When we load this in Tor Browser we will produce 1 circuit and 3 streams.

I guess ideally I'd like to know "How many circuits are opened" (over
time) as this would tell us something about capacity. If we
anticipating adding a 'thing' to the tor network that would generate
an additional 10 circuits/second, and we currently handle 5
circuits/second, even if we didn't put much bandwidth on these
circuits, we would be tripling <something> in the network that could
cause bottlenecks or problems.

But 'streams' is, at least, an upper bound for circuits, you can't
have more circuits than streams.

> Regarding your idea to use "exit-streams-opened", yes, that might tell
> us something. For example, I found this line in an extra-info descriptor:
>
> exit-streams-opened
> 80=1761692,182=104,443=1343240,4070=1080,5000=540,5002=36,8999=2004,9696=12,51000=28,51413=2952,other=320388
>
> I think if I were to calculate a network total here, I wouldn't try to
> sum up all such lines for a given time frame. There are only few relays
> reporting these numbers, so that number would likely be much too low.

Why are only a few relays reporting this metric? I thought all relays
publish extra info descriptors?

> A better approach might be to extrapolate this line to a network total
> for that given day by computing the probability of picking this exit out
> of all others. Say, if the exit probability is 0.5% (random guess),
> there would be (1,761,692 + 1,343,240) / 0.5%  opened streams in the
> network to ports 80 and 443 on that day. I didn't look up the 0.5%, so
> I'll not write the result here, but you see what I mean.
>
> That approach would produce a few dozen extrapolated network totals per
> day, depending on how many usable reports we have. The median of those
> extrapolations could then be a first approximation of the number you're
> looking for.

That makes sense, thanks!

-tom


More information about the metrics-team mailing list