A response to George’s comment: "The timeline here is that we are hoping the proposal _and_ the implementation to be ready by mid-December… I'm currently OK with the two statistics in: https://people.torproject.org/~karsten/volatile/238-hs-relay-stats.txt… I feel that any other statistics will need to be carefully analyzed.”
I believe Roger created a branch implementing these two statistics as well as the number of HS descriptors requests at an HSDir, and I believe that he run those on some relays (at least Moritz’s). Are you just recreating this work? Did those relays stop collecting those statistics? What happened to that data? It won’t be terribly interesting if all we do is report *fewer* statistics collected at a later date than at the kickoff meeting.
I also think that we should identify some question we hope to investigate for January, such as: 1. How much HS traffic is there? Already semi-answered by Roger this summer, as I just mentioned. 2. How many descriptors are there? Ditto. 3. How many descriptors are never requested? Now we’re getting somewhere. 4. How What is the median or maximum number of requests? This would be incredibly informative about the skew of HS popularity. 5. How many faliures/anomalies do we observe? This would help us figure out how well HSes are working and how broken/abusive client behavior is.
And is there a reason this process has to be so slow? Is it the security review? Roger managed to pump out a branch for stats collection and get it generating data within a week. It’s pretty pathetic if we can’t do better ;-)
Cheers, Aaron
On Nov 20, 2014, at 9:49 PM, Karsten Loesing karsten@torproject.org wrote:
On 20/11/14 13:42, George Kadianakis wrote:
"A. Johnson" aaron.m.johnson@nrl.navy.mil writes:
George and I have been working on a small proposal to add two hidden-service related statistics: number of hidden services and total hidden-service traffic.
Great, I’m starting to focus more on this project now. Well, actually I’m going on a trip for a week today, but *then* I’m focusing more on this project :-)
Sounds great! We're meeting every Tuesday at 16:00 UTC in #tor-dev. Feel free to drop by.
Excellent. I won’t be there this coming Tuesday, but I’ll be there the next Tuesday.
Replicas mean that each descriptor is stored under two identifiers, so that's two places. Further, descriptor identifiers change once per day, so during a 24-hour period, there are up to four descriptor identifiers for a hidden service.
That makes sense. It would be nice if the statistics would allow you to identify how long (i.e. how many hour periods) each descriptor was observed being published. That would allow us to figure out if there are lots of short-lived services or fewer long-lived services. Publishing statistics every hour would pretty much take care of this. If you are really set on 24 hours, then perhaps you could add the total number of published descriptors in addition to the number of *unique* published descriptors.
Also, my suggestion about using additive noise applies equally well to the descriptor statistics. And multiplicative noise is a *bad idea* if you don’t have some adjustment for small values (e.g. 10% noise of a 0 value is 0, and 10% of 1 is only 0.1).
We have been thinking about many more hidden-service related statistics in a separate document. We're currently discussing whether we should turn it into a tech report, because we'll probably not want to implement most of those statistics. If you have remarks or more ideas, please feel free to edit the document. We're going to have a public review round for this, too, but that might not happen in the next week or two.
Great! I think we should go for at least a little more data in the current proposal (what is the timeline for this, btw?). I think we should come up with a list of statistics we might imagine gathering and identify the subset of those that we’re comfortable gathering at this point. For example, I think failure statistics is much more innocuous than other data, and those would be very useful. For example, they would help us understand how to improve the protocol is failing, and it might help us identify misuse of hidden services (e.g. by botnets clients stupidly looking for non-existent descriptors or by malicious crawlers attempting to brute force descriptors). So here are some ideas:
- Number of fetch requests for descriptors that don’t exist (number of fetch requests that do succeed would of course be very useful as well)
- Number of descriptor publishes to the wrong HSDir (actually I suspect that the HSDir doesn’t check this and wants to be accepting of any publish)
- Number of rendezvous circuits that never connect (from the RP perspective)
- Number of rendezvous circuits on which no data cells are ever sent
(CC'ed [tor-dev])
Thanks, George, for moving the discussion here.
Here's the latest proposal draft where I incorporated Aaron's suggestions:
https://gitweb.torproject.org/user/karsten/torspec.git/blob/refs/heads/hs_st...
If people on this list have more feedback, please reply here. Thanks!
All the best, Karsten
Thanks for the input Aaron!
The timeline here is that we are hoping the proposal _and_ the implementation to be ready by mid-December. Then we are hoping that we can deploy the code to a few relays so that we have some data by January.
So, time is tight.
I'm currently OK with the two statistics in: https://people.torproject.org/~karsten/volatile/238-hs-relay-stats.txt
I feel that any other statistics will need to be carefully analyzed. We should add the ideas you mentioned in the etherpad, and get them included in the tech report (which we are also hoping to have ready in some form by mid-January).
The tech report is supposed to contain and analyze most of the HS statistics we can think of. It will likely contain many stats that we will never do, but also some stats that might be a good idea. The good ones we should eventually integrate to the Tor proposal and write code for.
Thanks for the very valuable input! Let me know if the following draft looks okay, and I'll start another thread on tor-dev@.
https://people.torproject.org/~karsten/volatile/238-hs-relay-stats-2014-11-2...
"Lab(\epsilon/C)” -> "Lap(\epsilon/C)” (that was my mistake. I think having the added noise both parameterized and included in the reported statistics is an idea worth thinking about. Making it a parameter allows you to easily change it without upgrading. Including it in the statistics would allow us to correct better for noise if different relays might be adding different amounts of noise due to inconsistent opinions of the noise parameter (if this should never happen, then I guess this wouldn’t be necessary).
So again, sorry that I’m not going to be very responsive on this for the next week. I’m really happy that you’re working on it!
Best, Aaron