Hi George,
I’m glad you’re putting serious thought into these stats. I’ll give you my perspective on some of the issues you raise.
I will now enumerate the stats that Aaron considers interesting and low-hanging-fruit:
I should mention that all of these came out of a list that came out of Roger’s mouth, and so you might try and get further thoughts from him.
This time I'm going to put extra focus on how to use these statistics and _what questions they help us answer_. If these stats don't help us answer any interesting questions, they are not that useful.
I think that overall many statistics are useful just to check for abuse, misconfiguration, or bugs. If the statistic is way out of line of what we would expect, especially when compared to other statistics, then that would reveal an unexpected and potentially problematic behavior.
Also, this time we should have an *exact strategy* on how to use specific stats to derive the results we want, so that we don't spend 2 months after we write the code to figure out how to do extrapolations.
I agree that it is important to be confident that we can use the data that we collect. Paul and I actually went through many of the desired statistics early on (during the kickoff meeting in mid-September) sketching out how extrapolation would work. I had attached that document to Trac ticket #13509, although it may be hard to understand.
(1) Number of descriptor updates (total count and distribution) (Sec. 4.2.4)
...
I'm not yet convinced this is a useful stat. What is its use and which *questions* would it help us answer?
In addition to revealing if somebody is sending way too many updates, it would help us understand the general level of churn of hidden services. Are there lots of short-lived services?
I'm assuming that we would total count here, since revealing the exact distribution could leak information about specific hidden services.
I believe that the distribution can be revealed to some extent safely. You choose a small number of bins chopping up the possible numbers of updates, and then publish the counts for each bin in the same way that you would publish a single overall count. The details are in the stats tech report.
Also, this is related to the "Number of unique HSes per HSDir" statistic that we are already doing. This means, that we can do the division and arrive to "Average number of descriptor updates per HS". I'm not sure if I like this, since there are *specific* HSes corresponding to each HSDir. Are we sure that there are not edge-cases that this can be exploited to learn their uptime? I'm not.
I do think that if you know of a specific HS, then you can watch the descriptor update stats from its HSDir over time and gradually learn about how many times that HS updates its descriptors. But if you know of a specific HS, you can do that anyway simply by fetching the descriptors. Thus this doesn’t seem like a problem to me.
(2) Number of RPs established on relays
...
OK, I can see how this stat would give us the number of "connection attempts there are by clients to services that are running". Is this a number we are interested in? I guess so maybe.
I think this is very interesting. How much traffic tends to flow over a typical HS circuit? Are there a huge number of established RPs relative to the amount of traffic (this could indicate either DoS or botnet clients)? Do clients make lots of little connections or fewer large ones?
Number of circuits using TAP and nTor
...
This statistic can reveal other information too since it's basically a circuit count. For example, if you count and publish the number of circuits containing ESTABLISH_INTRO, you get the "Number of IPs established on the network" statistic. If you count and publish the number of circuits containing ESTABLISH_RENDEZVOUS, you get the "Number of RPs established on relays" statistic I discussed in the previous section.
Agreed.
Also, why do we care how many hidden services are using older versions of Tor? And why do we care how many clients are using older versions of Tor? Is this to specifically detect botnet activity?
Roger has mentioned this a couple of times, both in the context of identifying botnet activity. I think more generally, it would be helpful to Tor to understand the distribution of software versions in active use among clients and HSes. This would help them better target upgrading if necessary to improve user security, and it could reveal when older versions are out of use and can be safely end-of-lifed.
Also, why do this just for hidden services?
It is interesting for HSes to figure out how much HS activity is from botnets. I agree that it is interesting more generally as well.
Number of descriptors with encrypted introduction points
...
This seems like a stat that would answer a very concete question "How many hidden services are using authorization currently?".
Answering this question seems useful for evaluating the user base and popularity of this feature.
Yes, agreed. Among other things, this could help direct Tor to improve the usability of such a feature.
However, I'm not sure if I want to learn this information at all. People who use hidden service authorization are cautious users, and it seems weird to count them like this. It might be okay if there are 10000 of these hidden services, but if there are only 100, I wouldn't want to out them like this. More thinking required.
I agree that no individual service should be revealed. That is why we would round and add noise as usual. That would hide the existence of any small number of services (we have used 8 for similar purposes).
Cheers, Aaron