[tor-dev] Feedback on obfuscating hidden-service statistics

George Kadianakis desnacked at riseup.net
Fri Nov 21 14:39:32 UTC 2014

"A. Johnson" <aaron.m.johnson at nrl.navy.mil> writes:

> A response to George’s comment: "The timeline here is that we are hoping the proposal _and_ the
> implementation to be ready by mid-December… I'm currently OK with the two statistics in: <https://people.torproject.org/~karsten/volatile/238-hs-relay-stats.txt>… I feel that any other statistics will need to be carefully analyzed.”
> I believe Roger created a branch implementing these two statistics as
> well as the number of HS descriptors requests at an HSDir, and I
> believe that he run those on some relays (at least Moritz’s). Are you
> just recreating this work? Did those relays stop collecting those
> statistics? What happened to that data? It won’t be terribly
> interesting if all we do is report *fewer* statistics collected at a
> later date than at the kickoff meeting.


Roger's branch was a PoC that wrote stats on the log file. I don't
think we have newer data than what is in #13192. It's unclear whether
the relays stopped collecting statistics, or they just haven't updated
the trac ticket.

Instead, we are planning to write the stats to the extra-info
descriptors, so that relays publish these stats every day.

Also, Roger's stats were counting cells from both RP and IP
circuits. It's unclear whether we will take the same approach; atm I
find it more reasonable to only count RP cells/circuits.

BTW, did Roger do the "How many HSes are there?" HSDir stats? Is there
a ticket for that?

In any case, Roger told us that answering the questions "Approx. how
many HSes are there?" and "How much bw is HS bw?" are the important
parts of what we need to have by January. Our plan was to have that,
plus a document with various future statistics we might or might not
do. Do you think that's not sufficient?

> I also think that we should identify some question we hope to
> investigate for January, such as: 1. How much HS traffic is there?
> Already semi-answered by Roger this summer, as I just mentioned.
> 2. How many descriptors are there? Ditto.  3. How many descriptors
> are never requested? Now we’re getting somewhere.  4. How What is
> the median or maximum number of requests? This would be incredibly
> informative about the skew of HS popularity.  5. How many
> faliures/anomalies do we observe? This would help us figure out how
> well HSes are working and how broken/abusive client behavior is.
> And is there a reason this process has to be so slow? Is it the security review? Roger managed to pump out a branch for stats collection and get it generating data within a week. It’s pretty pathetic if we can’t do better ;-)


Security review is indeed a big part. I'm not persuaded that just
collecting all kinds of statistics from the Tor network is always good
or helpful [0]. I personally prefer to do this methodically and with
sufficient time for thinking and feedback, instead of starting to
collect various statistics in a short time. I feel that getting
pressured about *moar statistics* is a slippery slope that leads to badness.

I also believe that some of these extra stats (e.g. "How many
failures/anomalies do we observe?") should first be done on a privnet
instead of the real network. That can give us some preliminary
results, and then we can consider doing them on the real
network. Maybe we can also have some privnet stats by January.

Also, I'm the main person who will be doing stats on the real network,
both the proposal and the implementation, and I'm not full-time on
this. Other SponsorR people are doing different things, like HS
privnet setup and collecting statistics/benchmarks on
privnets. Karsten recently started helping with the proposal which is
a huge help!

Also also, we hope to finish everything by mid-December so that we can
also have time to deploy the stats in a few relays. This is a month
from now, not too far away.

But to be a bit more constructive, if you want more stats to happen
faster, I invite you to help with the security analysis. If you can
show that the stats you want to see don't reveal information about
specific HSes or their clients and that they are useful to have, maybe
we will have time to integrate them before January. No promises here.

And to be a bit more technical, from a first glance I don't think we
should do "4. How What is the median or maximum number of requests?".
This would allow an attacker to learn the popularity of *specific*
HSes if they have their onion address. Why do we want that?

Finally, if you know that the funder will be unhappy with just those
two stats and we should *definitely* do more, then please tell us and
we can think of something. Roger didn't give me this impression.

> Cheers,
> Aaron

[0]: Did you know that relays (and bridges) report bandwidth
     statistics every *15* minutes? I have no idea if this is a good
     idea to do, especially for relays that see very few clients.

