# [tor-dev] Feedback on obfuscating hidden-service statistics

Karsten Loesing karsten at torproject.org
Thu Nov 20 12:49:35 UTC 2014

On 20/11/14 13:42, George Kadianakis wrote:
> "A. Johnson" <aaron.m.johnson at nrl.navy.mil> writes:
>
>>>>> George and I have been working on a small proposal to add two
>>>>> hidden-service related statistics: number of hidden services and
>>>>> total hidden-service traffic.
>>>>
>>>> Great, I’m starting to focus more on this project now. Well,
>>>> actually I’m going on a trip for a week today, but *then* I’m
>>>> focusing more on this project :-)
>>>
>>> Sounds great!  We're meeting every Tuesday at 16:00 UTC in #tor-dev.
>>> Feel free to drop by.
>>
>> Excellent. I won’t be there this coming Tuesday, but I’ll be there the next Tuesday.
>>
>>> Replicas mean that each descriptor is stored under two identifiers, so
>>> that's two places.  Further, descriptor identifiers change once per
>>> day, so during a 24-hour period, there are up to four descriptor
>>> identifiers for a hidden service.
>>
>> That makes sense. It would be nice if the statistics would allow you
>> to identify how long (i.e. how many hour periods) each descriptor was
>> observed being published. That would allow us to figure out if there
>> are lots of short-lived services or fewer long-lived
>> services. Publishing statistics every hour would pretty much take care
>> of this. If you are really set on 24 hours, then perhaps you could add
>> the total number of published descriptors in addition to the number of
>> *unique* published descriptors.
>>
>> Also, my suggestion about using additive noise applies equally well to
>> the descriptor statistics. And multiplicative noise is a *bad idea* if
>> you don’t have some adjustment for small values (e.g. 10% noise of a 0
>> value is 0, and 10% of 1 is only 0.1).
>>
>>> We have been thinking about many more hidden-service related
>>> statistics in a separate document.  We're currently discussing whether
>>> we should turn it into a tech report, because we'll probably not want
>>> to implement most of those statistics.  If you have remarks or more
>>> ideas, please feel free to edit the document.  We're going to have a
>>> public review round for this, too, but that might not happen in the
>>> next week or two.
>>>
>>
>> Great! I think we should go for at least a little more data in the
>> current proposal (what is the timeline for this, btw?). I think we
>> should come up with a list of statistics we might imagine gathering
>> and identify the subset of those that we’re comfortable gathering at
>> this point. For example, I think failure statistics is much more
>> innocuous than other data, and those would be very useful. For
>> example, they would help us understand how to improve the protocol is
>> failing, and it might help us identify misuse of hidden services
>> (e.g. by botnets clients stupidly looking for non-existent descriptors
>> or by malicious crawlers attempting to brute force descriptors). So
>> here are some ideas:
>>   1. Number of fetch requests for descriptors that don’t exist (number of fetch requests that do succeed would of course be very useful as well)
>>   2. Number of descriptor publishes to the wrong HSDir (actually I suspect that the HSDir doesn’t check this and wants to be accepting of any publish)
>>   3. Number of rendezvous circuits that never connect (from the RP perspective)
>>   4. Number of rendezvous circuits on which no data cells are ever sent
>>
>
> (CC'ed [tor-dev])

Thanks, George, for moving the discussion here.

Here's the latest proposal draft where I incorporated Aaron's suggestions:

If people on this list have more feedback, please reply here.  Thanks!

All the best,
Karsten

> Thanks for the input Aaron!
>
> The timeline here is that we are hoping the proposal _and_ the
> implementation to be ready by mid-December. Then we are hoping that we
> can deploy the code to a few relays so that we have some data by January.
>
> So, time is tight.
>
> I'm currently OK with the two statistics in:
> https://people.torproject.org/~karsten/volatile/238-hs-relay-stats.txt
>
> I feel that any other statistics will need to be carefully analyzed.
> We should add the ideas you mentioned in the etherpad, and get them
> included in the tech report (which we are also hoping to have ready in
> some form by mid-January).
>
> The tech report is supposed to contain and analyze most of the HS
> statistics we can think of. It will likely contain many stats that we
> will never do, but also some stats that might be a good idea. The good
> ones we should eventually integrate to the Tor proposal and write code
> for.
>
>>> Thanks for the very valuable input!  Let me know if the following
>>> draft looks okay, and I'll start another thread on tor-dev at .
>>>
>>> https://people.torproject.org/~karsten/volatile/238-hs-relay-stats-2014-11-20.txt
>>
>> "Lab(\epsilon/C)” -> "Lap(\epsilon/C)” (that was my mistake. I think
>> having the added noise both parameterized and included in the reported
>> statistics is an idea worth thinking about. Making it a parameter
>> allows you to easily change it without upgrading. Including it in the
>> statistics would allow us to correct better for noise if different
>> relays might be adding different amounts of noise due to inconsistent
>> opinions of the noise parameter (if this should never happen, then I
>> guess this wouldn’t be necessary).
>>
>> So again, sorry that I’m not going to be very responsive on this for the next week. I’m really happy that you’re working on it!
>>
>> Best,
>> Aaron
>