[tor-dev] Should popularity-hiding be a security property of hidden services?

Fri Apr 3 14:57:33 UTC 2015

Hello,

Tor hidden services are meant to primarily provide server anonymity
but they also provide various other properties. For example, their
addresses are self-authenticated and their connections punch NAT. This
post is about another property, which is that Tor does not reveal the
popularity of a hidden service by default. That is, you can't easily
get the user count of a specific hidden service.

This is not that surprising to hidden service operators, since that's
also how the normal Internet works. In the normal Internet, someone
cannot learn the user count of an IP or website, except if they are
the operator or they control DNS or the site publishes analytics.

Over the past years, people have suggested various features that would
provide us with interesting information and optimizations but would
have the side effect of revealing the user count of hidden services
(or only of popular hidden services) to the public.

Some examples of such popularity-leaking features:

- Hidden Services dynamically set their number of Introduction Points
  depending on their popularity. Basically, they self evaluate their
  popularity, and use a formula to decide the number of Introduction
  Points between 3 and 10. For a while, we thought that this formula
  does not work properly (#4862, #8950), but we recently discovered
  that it seems to be working in some manner.

  While an interesting and useful feature on its own, it has the side
  effect that it leaks how popular your hidden service is. Since a
  hidden service publishes a descriptor every hour, you can monitor
  hourly usage patterns of hidden services. Of course, you can't get
  the exact user count, but you might be able to get rough approximate
  numbers (we still haven't analyzed the formula enough to know
  *exactly* how much).

- During our recent work on hidden service statistics [0] people have
  suggested to gather statistics that would get us closer to learn the
  number of hidden service users [1]. The suggested way to do so is to
  have HSDirs or introduction points count the total number of
  introductions or descriptor fetches and publish that number in their
  extra-info descriptor. Since given a hidden service address you can
  easily learn its HSDirs or IPs, it should be possible to map those
  statistics to specific hidden services, which would leak their
  popularity (more on this later).

There might be more examples that I'm missing, but this should be
enough to demonstrate the leaks. For the rest of this post, I will be
presenting various arguments for and against leaking popularity.

Disclaimer: I still am not 100% decided here, but I lean heavily
towards the "popularity is private information and we should not
reveal it if we can help it" camp, or maybe in the "there needs to be
very concrete positive outcomes before even considering leaking
popularity".  Hence, my arguments will be obviously biased towards the
negatives of leaking popularity. I invite someone from the opposite
camp to articulate better arguments for why popularity-hiding is
something worth sacrificing.

== Arguments for leaking popularity and reaping its benefits ==

Here are a few arguments that people use to shrug off popularity-hiding.
I can relate to some of them, but I find the reasoning of some others
funny or even dangerous.

- "If we don't care about leaking popularity we can get useful statistics"

  Indeed, we've dismissed various statistics that we could collect
  because we were afraid that they would leak the popularity of hidden
  services. If we didn't have this fear, we would have a better idea
  on how much usage hidden services see, or whether people are
  conducting DoS attacks on hidden services.

- "If we don't care about leaking popularity we can get nice optimizations"

  As an example, the dynamic IP calculation is one of those optimizations.
  I'm not aware of other optimizations, but I bet that we can think of
  a few more if we completely remove popularity-hiding from our threat
  model. Also, people have claimed that more statistics would reveal
  more optimizations that we could do. 

- "Popularity-hiding is just a side-effect of the Tor protocol, and
   not a stated security goal"

  People have claimed that popularity-hiding is not a stated security
  goal of hidden services, or that the name "hidden services" does not
  imply popularity-hiding in any way.

- "There are no realistic attacks that could happen from leaking popularity"

  People have claimed that popularity is just a curiosity, and nothing
  bad can come from leaking it. They say that protecting popularity
  does not offer security against realistic or dangerous attacks.

  Other people claim that popularity-revealing attack vectors are too
  noisy and contain too much random data, hence it's hard to get
  targetted popularity values out of them. They say that it might only
  be possible for very popular hidden services, or for unlikely edge
  cases.

- "There are probably other ways to reveal popularity. You can't fix them all"

  That's actually a big fear of mine. That we are nitpicking about 2-3
  popularity revealing vectors, while there are hundreds more
  currently open. See #8742 for example, but I bet there are more
  vectors that we need to think about.

== Arguments for protecting popularity ==

And here are arguments that make me believe that popularity is
something that should be protected.

- Popularity attracts attention

  Anonymity likes uniformity, but popularity attracts attention.
  There are literally infinite possible use cases where a hidden
  service wants to be public and still not attract attention.

  However, since the above argument has not been particularly
  successful and only attack demonstrations will persuade a true
  skeptic's mind, here is an attack scenario:

      Try hard to imagine a dystopian future where authorities are
      tracking down and hacking activist websites.  They just received a
      big list of hidden services, the result of a messy interrogation,
      but they are all locked. Their hackers can hack some of them but not
      all. Not much time before revolution, end of dystopian future and
      happiness for all humanity. The dictator needs to decide which
      hidden services to hack to stop the revolution. Which??

      With popularity being public , they can get the popularity of
      the biggest ones and target those first.

- Popularity can be used to find patterns in group movements now and in the past.  

  Even though you can't track specific users using popularity, you can
  still track group of users. Also, these statistics are forever: even
  if you didn't care about a group of users in the past, but you start
  caring about them now, you can still look back and see their
  development over time.

  Here is an attack scenario:

      Imagine a community that practices very dangerous urban climbing [2].
      Imagine thousands of friends climbing away in happiness from all
      over the world,

      Imagine now that this community splinters in other smaller
      communities, if you monitor their popularity, it will be
      possible for you to observe the movement of that subculture.

      As a further point, imagine now that dystopian future comes and
      very dangerous urban climbing gets outlawed. The police catches an
      urban climber in New London and gets a list of hidden services from
      her. They can then check _historically_ how many users those hidden
      services had. They can basically notice all the trends of the urban
      climbing scene in the past years. Creepy, no?

  As you probably well know, anonymity is not a binary option. It's
  not like you are not either super anonymous, or not. It's more of a
  fuzzy variable that depends on many things. OPSEC is a big part of
  anonymity, and it seems to me that popularity has OPSEC consequences.

- Statistics noise will get reduced. Attacks only get better.

  In the statistics we were talking about, each HSDir would reveal the
  number of descriptor fetches it received over the past day. We know
  that each HSDir serves about 150 hidden services, which means that
  the final value in the end will contain the popularity of 150 hidden
  services in one number. This is expected to be extremely noisy, and
  I think that's one of the main hopes of people who don't care about
  popularity hiding. That allows them to claim that popularity will
  only be leaked for very popular hidden services.

  While this indeed seems reasonable, my main intuition is that
  attacks can only get better. Here are some ways that noise can be
  reduced. I will focus on the HSDir case, but same arguments apply
  to other suggested statistics like number of introductions per IP.

  -- It's still early in the hidden services scene, so not many
     services get lots of traffic. I imagine that many of those 150
     hidden services are going to be very inactive, and not provide much
     noise.

  -- Hidden services publish hidden service descriptors to 6 HSDirs.
     This means that every day you will learn 6 noisy values for
     your target hidden service, not just 1. It's easier to remove noise
     that way.

  -- Also, those 5 irrelevant hidden services that provide the noise
     will publish themselves to 6 HSDirs. Applying the same logic as
     above, you might be able to learn information about the noise, which
     makes it easier to remove. In a way, you can put all the statistics
     measurements in a big system of equations, and start solving it to
     reduce noise in the equation you are interested in.

  -- Think of crazy edge cases. Maybe an introduction point is very
     weak and unlikely to be picked and only got 10 HSes for a day. If
     one of them is the hidden service you are interested in, there is
     going to be much fewer noise than usual.

  -- There might be other techniques for reducing noise, by combining
     other statistics (like the number of hidden services per HSDir which
     is already a stat), or by influencing the statistics yourself (like
     Aaron's attack on the stats aggregation protocol [3]).

  What I'm trying to say here, is that if you thought that the urban
  climbing example was ridiculous because such a community cannot be
  big enough to be visible in noisy statistics, maybe by reducing
  noise you can actually make it distinguishable.

- There are not that amazing benefits from ditching popularity-hiding.

  To be honest, I have not heard convincing enough arguments that
  would make me ditch popularity hiding. Some extra statistics or some
  small optimizations do not seem exciting enough to me. Please try
  harder. This could be a nice thread to demonstrate all the positive
  things that could happen if we ditch popularity-hiding.

  Also, there is a small difference here between the stats and the
  introduction point formula. The dynamic introduction point formula
  is something that we could disable by default, but also leave it as
  a configurable option for people who want to use it. That is, it
  will then be *the choice of the hidden service operator* whether he
  cares about popularity being hidden or not. With the statistics that
  have been proposed, you don't give any choice. You just do it for
  all hidden services forever.

- Principle of least surprise

  Hidden service operators except that hidden services are at least as
  secure as the normal Internet plus more. On the normal Internet,
  popularity is private by default. Having this assumption violated on
  hidden services, might not be polite.

- Popularity-hiding is crucial to maintain the deep sea security model of hidden services

  As I have mentioned in the past, some people think of the onion land
  as a very deep ocean. In some places of the ocean, you might be able
  to see some buoys (some more visible than others). To visit them,
  you need to wear your goggles and your snokrel, dive in and enter
  from underwater.

  This might not seem like a very concrete security model, but in any
  case popularity is not revealed at any point. The sea is opaque and
  you can't see the divers entering the hidden services.

Anyway this post has grown to immense size, and I was really hoping it
would be shorter.

On a more practical note, over the next few weeks, we should decide
what we want to do with the dynamic introduction point formula and
whether we should keep it or not (#4862). My current intuition is that
it should be disabled but also kept there as an option for people who
want to enable it. In any case, I hope that this thread can stimulate
discussion.

Also, if you are a hidden service operator I'm curious to hear about
whether you believe that popularity hiding is a security property that
should be preserved if that's even possible.

Cheers!

[0]: https://blog.torproject.org/blog/some-statistics-about-onions 
[1]: https://lists.torproject.org/pipermail/tor-dev/2015-February/008247.html
[2]: https://www.youtube.com/watch?v=kpS7vhvkIQM
[3]: https://lists.torproject.org/pipermail/tor-dev/2015-March/008404.html