Hello,
Tor hidden services are meant to primarily provide server anonymity but they also provide various other properties. For example, their addresses are self-authenticated and their connections punch NAT. This post is about another property, which is that Tor does not reveal the popularity of a hidden service by default. That is, you can't easily get the user count of a specific hidden service.
This is not that surprising to hidden service operators, since that's also how the normal Internet works. In the normal Internet, someone cannot learn the user count of an IP or website, except if they are the operator or they control DNS or the site publishes analytics.
Over the past years, people have suggested various features that would provide us with interesting information and optimizations but would have the side effect of revealing the user count of hidden services (or only of popular hidden services) to the public.
Some examples of such popularity-leaking features:
- Hidden Services dynamically set their number of Introduction Points depending on their popularity. Basically, they self evaluate their popularity, and use a formula to decide the number of Introduction Points between 3 and 10. For a while, we thought that this formula does not work properly (#4862, #8950), but we recently discovered that it seems to be working in some manner.
While an interesting and useful feature on its own, it has the side effect that it leaks how popular your hidden service is. Since a hidden service publishes a descriptor every hour, you can monitor hourly usage patterns of hidden services. Of course, you can't get the exact user count, but you might be able to get rough approximate numbers (we still haven't analyzed the formula enough to know *exactly* how much).
- During our recent work on hidden service statistics [0] people have suggested to gather statistics that would get us closer to learn the number of hidden service users [1]. The suggested way to do so is to have HSDirs or introduction points count the total number of introductions or descriptor fetches and publish that number in their extra-info descriptor. Since given a hidden service address you can easily learn its HSDirs or IPs, it should be possible to map those statistics to specific hidden services, which would leak their popularity (more on this later).
There might be more examples that I'm missing, but this should be enough to demonstrate the leaks. For the rest of this post, I will be presenting various arguments for and against leaking popularity.
Disclaimer: I still am not 100% decided here, but I lean heavily towards the "popularity is private information and we should not reveal it if we can help it" camp, or maybe in the "there needs to be very concrete positive outcomes before even considering leaking popularity". Hence, my arguments will be obviously biased towards the negatives of leaking popularity. I invite someone from the opposite camp to articulate better arguments for why popularity-hiding is something worth sacrificing.
== Arguments for leaking popularity and reaping its benefits ==
Here are a few arguments that people use to shrug off popularity-hiding. I can relate to some of them, but I find the reasoning of some others funny or even dangerous.
- "If we don't care about leaking popularity we can get useful statistics"
Indeed, we've dismissed various statistics that we could collect because we were afraid that they would leak the popularity of hidden services. If we didn't have this fear, we would have a better idea on how much usage hidden services see, or whether people are conducting DoS attacks on hidden services.
- "If we don't care about leaking popularity we can get nice optimizations"
As an example, the dynamic IP calculation is one of those optimizations. I'm not aware of other optimizations, but I bet that we can think of a few more if we completely remove popularity-hiding from our threat model. Also, people have claimed that more statistics would reveal more optimizations that we could do.
- "Popularity-hiding is just a side-effect of the Tor protocol, and not a stated security goal"
People have claimed that popularity-hiding is not a stated security goal of hidden services, or that the name "hidden services" does not imply popularity-hiding in any way.
- "There are no realistic attacks that could happen from leaking popularity"
People have claimed that popularity is just a curiosity, and nothing bad can come from leaking it. They say that protecting popularity does not offer security against realistic or dangerous attacks.
Other people claim that popularity-revealing attack vectors are too noisy and contain too much random data, hence it's hard to get targetted popularity values out of them. They say that it might only be possible for very popular hidden services, or for unlikely edge cases.
- "There are probably other ways to reveal popularity. You can't fix them all"
That's actually a big fear of mine. That we are nitpicking about 2-3 popularity revealing vectors, while there are hundreds more currently open. See #8742 for example, but I bet there are more vectors that we need to think about.
== Arguments for protecting popularity ==
And here are arguments that make me believe that popularity is something that should be protected.
- Popularity attracts attention
Anonymity likes uniformity, but popularity attracts attention. There are literally infinite possible use cases where a hidden service wants to be public and still not attract attention.
However, since the above argument has not been particularly successful and only attack demonstrations will persuade a true skeptic's mind, here is an attack scenario:
Try hard to imagine a dystopian future where authorities are tracking down and hacking activist websites. They just received a big list of hidden services, the result of a messy interrogation, but they are all locked. Their hackers can hack some of them but not all. Not much time before revolution, end of dystopian future and happiness for all humanity. The dictator needs to decide which hidden services to hack to stop the revolution. Which??
With popularity being public , they can get the popularity of the biggest ones and target those first.
- Popularity can be used to find patterns in group movements now and in the past.
Even though you can't track specific users using popularity, you can still track group of users. Also, these statistics are forever: even if you didn't care about a group of users in the past, but you start caring about them now, you can still look back and see their development over time.
Here is an attack scenario:
Imagine a community that practices very dangerous urban climbing [2]. Imagine thousands of friends climbing away in happiness from all over the world,
Imagine now that this community splinters in other smaller communities, if you monitor their popularity, it will be possible for you to observe the movement of that subculture.
As a further point, imagine now that dystopian future comes and very dangerous urban climbing gets outlawed. The police catches an urban climber in New London and gets a list of hidden services from her. They can then check _historically_ how many users those hidden services had. They can basically notice all the trends of the urban climbing scene in the past years. Creepy, no?
As you probably well know, anonymity is not a binary option. It's not like you are not either super anonymous, or not. It's more of a fuzzy variable that depends on many things. OPSEC is a big part of anonymity, and it seems to me that popularity has OPSEC consequences.
- Statistics noise will get reduced. Attacks only get better.
In the statistics we were talking about, each HSDir would reveal the number of descriptor fetches it received over the past day. We know that each HSDir serves about 150 hidden services, which means that the final value in the end will contain the popularity of 150 hidden services in one number. This is expected to be extremely noisy, and I think that's one of the main hopes of people who don't care about popularity hiding. That allows them to claim that popularity will only be leaked for very popular hidden services.
While this indeed seems reasonable, my main intuition is that attacks can only get better. Here are some ways that noise can be reduced. I will focus on the HSDir case, but same arguments apply to other suggested statistics like number of introductions per IP.
-- It's still early in the hidden services scene, so not many services get lots of traffic. I imagine that many of those 150 hidden services are going to be very inactive, and not provide much noise.
-- Hidden services publish hidden service descriptors to 6 HSDirs. This means that every day you will learn 6 noisy values for your target hidden service, not just 1. It's easier to remove noise that way.
-- Also, those 5 irrelevant hidden services that provide the noise will publish themselves to 6 HSDirs. Applying the same logic as above, you might be able to learn information about the noise, which makes it easier to remove. In a way, you can put all the statistics measurements in a big system of equations, and start solving it to reduce noise in the equation you are interested in.
-- Think of crazy edge cases. Maybe an introduction point is very weak and unlikely to be picked and only got 10 HSes for a day. If one of them is the hidden service you are interested in, there is going to be much fewer noise than usual.
-- There might be other techniques for reducing noise, by combining other statistics (like the number of hidden services per HSDir which is already a stat), or by influencing the statistics yourself (like Aaron's attack on the stats aggregation protocol [3]).
What I'm trying to say here, is that if you thought that the urban climbing example was ridiculous because such a community cannot be big enough to be visible in noisy statistics, maybe by reducing noise you can actually make it distinguishable.
- There are not that amazing benefits from ditching popularity-hiding.
To be honest, I have not heard convincing enough arguments that would make me ditch popularity hiding. Some extra statistics or some small optimizations do not seem exciting enough to me. Please try harder. This could be a nice thread to demonstrate all the positive things that could happen if we ditch popularity-hiding.
Also, there is a small difference here between the stats and the introduction point formula. The dynamic introduction point formula is something that we could disable by default, but also leave it as a configurable option for people who want to use it. That is, it will then be *the choice of the hidden service operator* whether he cares about popularity being hidden or not. With the statistics that have been proposed, you don't give any choice. You just do it for all hidden services forever.
- Principle of least surprise
Hidden service operators except that hidden services are at least as secure as the normal Internet plus more. On the normal Internet, popularity is private by default. Having this assumption violated on hidden services, might not be polite.
- Popularity-hiding is crucial to maintain the deep sea security model of hidden services
As I have mentioned in the past, some people think of the onion land as a very deep ocean. In some places of the ocean, you might be able to see some buoys (some more visible than others). To visit them, you need to wear your goggles and your snokrel, dive in and enter from underwater.
This might not seem like a very concrete security model, but in any case popularity is not revealed at any point. The sea is opaque and you can't see the divers entering the hidden services.
Anyway this post has grown to immense size, and I was really hoping it would be shorter.
On a more practical note, over the next few weeks, we should decide what we want to do with the dynamic introduction point formula and whether we should keep it or not (#4862). My current intuition is that it should be disabled but also kept there as an option for people who want to enable it. In any case, I hope that this thread can stimulate discussion.
Also, if you are a hidden service operator I'm curious to hear about whether you believe that popularity hiding is a security property that should be preserved if that's even possible.
Cheers!
[0]: https://blog.torproject.org/blog/some-statistics-about-onions [1]: https://lists.torproject.org/pipermail/tor-dev/2015-February/008247.html [2]: https://www.youtube.com/watch?v=kpS7vhvkIQM [3]: https://lists.torproject.org/pipermail/tor-dev/2015-March/008404.html