[tor-bugs] #33666 [Circumvention/Snowflake]: Investigate Snowflake proxy failures

Tue Apr 14 17:10:32 UTC 2020

#33666: Investigate Snowflake proxy failures
-------------------------------------+--------------------------
 Reporter:  cohosh                   |          Owner:  cohosh
     Type:  defect                   |         Status:  assigned
 Priority:  High                     |      Milestone:
Component:  Circumvention/Snowflake  |        Version:
 Severity:  Normal                   |     Resolution:
 Keywords:                           |  Actual Points:
Parent ID:  #19001                   |         Points:
 Reviewer:                           |        Sponsor:
-------------------------------------+--------------------------

Comment (by cohosh):

 Okay I'm reassessing the ideas presented in comment:5 and I think now that
 we know NAT topologies are likely a large source of the issues here, there
 are some different options I'd like to consider. The main techniques are:

 == Option 1: Disable or have less useful proxies poll less often ==

  This is essentially what was discussed above, where we decided that
 keeping track of how often a datachannel times out without opening is a
 good metric for figuring out how useful a proxy is, and that disabling it
 after a few subsequent failed attempts is a good way to go.

  To map out the design space here, we can separate this into two parts:
 how we measure and report the usefulness of a proxy, and what we do with
 this information.
 ==== Measuring a proxy's usefulness ====

  I see three main options here:

 A. Have proxies self-report a metric like the number of datachannel
 timeouts mentioned above.

  (+) This is very easy to implement and gives us a good idea of how many
 clients a proxy works with
  (-) This is prone to denial of service attacks. A proxy can self-report
 as good while not functioning properly, or an adversarial client can
 purposefully fail to open a datachannel causing an honest proxy to believe
 it isn't useful.

 B. Give proxies long-term identifiers and have clients report to the
 broker the IDs of failed proxies the next time they poll

  (+) We've already put a little bit of thought into this. It would require
 an implementation of #29260 and a modification of the client-broker
 protocol which shouldn't be too difficult
  (+) Here we could restrict the denial of service by an adversarial client
 based on IP address. A single client IP could be rate limited on reporting
 bad proxies and could only report on each proxy once.
  (+) Proxies don't have to be trusted here
  (-) This adds complexity to the system
  (-) There are still some denial of service attacks possible if we're not
 careful. We should take into account client successes as well as failures
 to ensure that proxies aren't rejoining with different IDs, and make sure
 honest client successes aren't drowned out by adversarial failure reports.

 C. Have an external probe behind different NATs determine how useful a
 proxy is

  This is essentially a modification of #32938.

  (+) Denial of service attacks are harder
  (-) Still requires honest self-reporting or the implementation of long-
 term identifiers (#29260)
  (-) Adds a lot more moving parts and single points of failure. What if
 this probe service goes down? How will we make sure we have a variety of
 NATs? Who is responsible for it?

 ==== What to do with less useful proxies ====
  The drawback to completely disabling a proxy just because it's behind a
 more restrictive NAT is that we'll be throwing out proxies that could
 still be useful for other clients and disincentivizing people to
 participate. It would be frustrating to find that your proxy isn't useful
 even though you are able to use other WebRTC tools (even though these
 usually aren't P2P).

  However, telling proxies to poll less frequently doesn't actually make
 them more useful. It just makes other fixes like multiplexing (#25723)
 more likely to have at least one more permissive/robust proxy.

 == Option 2: Distribute proxies to clients based on their compatibility
 with each other ==

  I suggested this in comment:14 and while I like it in theory, it's
 difficult to do in practice, and we'd likely end up relying on heuristics
 similar to the datachannel timeouts in Option 1. It's possible that we
 could modify the STUN library to notice which candidates are chosen or
 what IP:port we're talking to in order to infer over multiple connections
 what kind of NAT topology we have but I suspect this is more difficulty
 than it's worth. Datachannel timeouts will likely give us a pretty good
 idea of what kind of NAT we have.

  So, this option would be to take whatever measurement technique is best
 from Option 1 and also have clients measure their own success rate. These
 two measurements are then used together when the client polls the broker
 to get a proxy that's compatible for the client. If a client finds that
 most of their connections succeed, the broker can give them a proxy that
 works a lower percentage of the time. If a client typically has
 difficulty, the broker can give them a more permissive (i.e. higher
 success rate) proxy.

  This requires more complex logic at the broker, an implementation of
 reliability measurements at the proxy and client, and a change in the
 protocol between the broker and these pieces. It doesn't seem too
 difficult though.

 == Option 3: Configure a TURN server to fall back on (#25596) ==

  Maybe we want to do this anyway as a short term fix but as mentioned
 above I have my doubts that this can be a longer term solution.

 Personally, I think we should go with Option 1 first and then decide if we
 want to layer Option 2 on top of it to make less permissive proxies more
 useful again. I'd also suggest going with option A first since it's the
 easiest and then seriously consider option B for measuring a proxy's
 usefulness since I think that will protect us more against denial of
 service attacks in the long run.

 I'd prefer to have the less reliable proxies poll less often at the moment
 instead of completely disabling them, since that will cause people to get
 frustrated and drop out of participating even though they still provide
 some value. That means moving on #25598.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33666#comment:16>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online