[tor-dev] Feedback on obfuscating hidden-service statistics

Fri Nov 21 15:38:32 UTC 2014

> Roger's branch was a PoC that wrote stats on the log file. I don't
> think we have newer data than what is in #13192. It's unclear whether
> the relays stopped collecting statistics, or they just haven't updated
> the trac ticket.

If we could check on that and get that data, that would be really helpful. Then we could do analysis in parallel with the better extra-info implementation.

> Also, Roger's stats were counting cells from both RP and IP
> circuits. It's unclear whether we will take the same approach; atm I
> find it more reasonable to only count RP cells/circuits.

IP stats are also interesting, but I agree less so than RP stats alone.

> BTW, did Roger do the "How many HSes are there?" HSDir stats? Is there
> a ticket for that?

I am fairly sue he at least counted descriptor updates at an HSDir. I have a slide bullet saying "We estimate about 30 to 50K hidden services are updating their descriptors each day” from the kickoff meeting, and I recall Roger talking about that. Because the question is what the “dark matter” of Hidden Services consists of, that is, the 30-50K HSes less the ~1500 that are publicly available and were responding at that time.

> In any case, Roger told us that answering the questions "Approx. how
> many HSes are there?" and "How much bw is HS bw?" are the important
> parts of what we need to have by January. Our plan was to have that,
> plus a document with various future statistics we might or might not
> do. Do you think that's not sufficient?

I’m not sure about “not sufficient”, but as I said, Roger already reported estimates for those last time. But I’d go with his opinion on this - it is Tor’s part of the project.

> Security review is indeed a big part. I'm not persuaded that just
> collecting all kinds of statistics from the Tor network is always good
> or helpful [0]. I personally prefer to do this methodically and with
> sufficient time for thinking and feedback, instead of starting to
> collect various statistics in a short time. I feel that getting
> pressured about *moar statistics* is a slippery slope that leads to badness.

OK, makes sense. So let’s start tackling the hard question: what exactly do hidden services want to protect? Some questions:
  1. Should HSes be able to hide that they even exist at all in the system? If so, counting the number of hidden services reduces this somewhat (up to the added noise/inaccuracy). And by the way, random noise doesn’t necessarily hide this, because over time, if you choose new noise every measurement period and the number of HSes is constant, then the average will eventually reveal the exact number. Ideas to handle this: reuse randomness (except now that reveals exactly when HSes are added or removed), round to the nearest multiple of some bucket size (although what about the one HS that puts you into the next bucket..) Doing against an active adversary (not one who just looks at your reported stats) is much harder, of course, because you need to prevent HSDirs from knowing how many real descriptors they have.
  2. Should HSes be able to hide their (pseudonymous) popularity (i.e. number of users, connections per user)? If so, collecting RP cell counts already leaks averages and puts a lower bound on the max.
  3. Should client HS lookups be hidden so that nobody knows what’s being queried or how often? If so, collecting descriptors requests could reveal a very active set of clients.
These are hard questions because HSes are only designed to hide location, but there also appears to be a strong desire to make it hard to learn anything else about them. But there are good reasons (e.g. designing protocols improvements, troubleshooting problems, watching for malicious behavior) to learn *something* about HSes.

> I also believe that some of these extra stats (e.g. "How many
> failures/anomalies do we observe?") should first be done on a privnet
> instead of the real network. That can give us some preliminary
> results, and then we can consider doing them on the real
> network. Maybe we can also have some privnet stats by January.

Testing any code changes makes sense to be confident they work as intended. And I agree that failure stats might give us useful information just from the test network. Getting those for January seems like a great idea.

> Also, I'm the main person who will be doing stats on the real network,
> both the proposal and the implementation, and I'm not full-time on
> this. Other SponsorR people are doing different things, like HS
> privnet setup and collecting statistics/benchmarks on
> privnets. Karsten recently started helping with the proposal which is
> a huge help!

Understood :-)

> But to be a bit more constructive, if you want more stats to happen
> faster, I invite you to help with the security analysis. If you can
> show that the stats you want to see don't reveal information about
> specific HSes or their clients and that they are useful to have, maybe
> we will have time to integrate them before January. No promises here.

Sounds like a plan. Let me know what you think about the security issues I brought up earlier (Karsten put them in the security section of the proposal), as well as the questions I raise above.

> And to be a bit more technical, from a first glance I don't think we
> should do "4. How What is the median or maximum number of requests?".
> This would allow an attacker to learn the popularity of *specific*
> HSes if they have their onion address. Why do we want that?

How would knowing the median reveal the popularity of a specific HS? And as I said earlier, there is an issue with learning HS popularity over time if it the service goes up and down by correlating that with changes in the count of total  connections.

> Finally, if you know that the funder will be unhappy with just those
> two stats and we should *definitely* do more, then please tell us and
> we can think of something. Roger didn't give me this impression.

Just repeating what I said above for clarity: go with Roger’s opinion on this. That is just my opinion.

> [0]: Did you know that relays (and bridges) report bandwidth
>     statistics every *15* minutes? I have no idea if this is a good
>     idea to do, especially for relays that see very few clients.

I did know this. It does seem potentially revealing of, say, the guard used by a hidden service because you can easily modulate the HSes traffic in 15 minute intervals. Somebody should think about how what statistics gathering might reveal and if that’s cool.

Cheers,
Aaron