[tor-bugs] #29206 [Circumvention/Snowflake]: New design for client -- server protocol for Snowflake

Wed Oct 23 22:33:45 UTC 2019

#29206: New design for client -- server protocol for Snowflake
-----------------------------------------------+---------------------------
 Reporter:  cohosh                             |          Owner:  cohosh
     Type:  task                               |         Status:
                                               |  needs_review
 Priority:  Medium                             |      Milestone:
Component:  Circumvention/Snowflake            |        Version:
 Severity:  Normal                             |     Resolution:
 Keywords:  anti-censorship-roadmap-september  |  Actual Points:
Parent ID:                                     |         Points:  6
 Reviewer:  dcf                                |        Sponsor:
                                               |  Sponsor28-must
-----------------------------------------------+---------------------------
Changes (by cohosh):

 * status:  needs_revision => needs_review

Comment:

 Replying to [comment:28 dcf]:

 Thanks! This feedback was very helpful. I made several changes to the
 implementation. [https://github.com/cohosh/snowflake/tree/sequencing2
 Here] are a series of commits on top of the old branch, and
 [https://github.com/cohosh/snowflake/tree/sequencing2_squashed here] is a
 newly squashed version. The main changes are:
 - I restructured the protocol a bit to only send the session ID at the
 start. This work as long as we're using the WebRTC datachannel with full
 reliability, and I think it's worth doing for this very simple precursor
 to Turbo Tunnel. The reason for the change was to simplify the calls to
 `NewSnowflake` so we don't have to pass in the header we read in order to
 determine which `SnowflakeConn` the snowflake belongs to. The drawback to
 this is that the server has to remember to call ReadSessionID before
 calling Read (which should be straightforward because Read should only be
 called on a `SnowflakeConn` and at the start, the WebRTC connection
 doesn't belong to a `SnowflakeConn` yet.

  Basically, throughout writing this, I've tried to keep the client and
 server as symmetric as possible so that we share most of the code. This is
 a bit different from the approach taken in Turbo Tunnel.

 - I fixed the race conditions and found a bug that was causing a seg
 fault. The server now seems to be running consistently well without
 crashing.

 To address specific comments:
 > Could `Flurry.conn` be a plain `net.Conn`, and `Flurry.or` also? Or do
 they need the specialization to `*proto.Snowflake.Conn` and
 `*net.TCPConn`?
 Not the way I've written in right now because of the calls to
 `Flurry.conn.NewSnowflake`
 [https://github.com/cohosh/snowflake/blob/sequencing_squashed/server/server.go#L186
 here]. I don't think generalizing this at the expense of complexity at the
 client is desirable, since Flurries are specific to Snowflake connections.
 > > I see some motivation for another feature that allows us to set some
 kind of FIN or RST flag to notify the client that the OR port has been
 closed and the server that the client has closed the connection.
 >
 > Yes, but in the Tor model, it essentially never happens that a
 relay/bridge closes an ORPort connection, right? It's usually the client
 that decides when to disconnect.
 The exception to this is if there is the client is giving the bridge bad
 data. Of course that '''shouldn't''' happen unless there is a bug in the
 client so maybe that's ok to just let time out. I think you're right here
 that this feature is unnecessary.
 >
 > > Perhaps 10s is too long a timeout?
 >
 > I don't understand this. Do you mean too ''short'' a timeout? As a
 retransmission timer, 10 s doesn't seem short, but as a timer that
 terminates the whole end-to-end connection, it does seem short. Since in
 this design, there's no retransmission except when kicked off by a
 `NewSnowflake` transition, it might be worth increasing the timeout.

 You're right that 10s is short for a network connection timeout. This
 brings us to what remains to be a tricky engineering challenge here. The
 goal of the sequencing layer is to allow the client to recover from a
 faulty snowflake. However, it takes 30s for the connection to go stale in
 the client's `checkForStaleness` function. So it takes 30s for a client to
 request a new snowflake and start se nding data through it. In all of my
 tests, the SOCKS connection timed out well before the client connected to
 a new snowflake. Since a new SOCKS connection means a new snowflake
 session and new OR connection anyway, this means the client never actually
 recovers and the browser reports a connection error.

 So my thought was that if we lowered it to 10s, we have a chance to
 recover before the SOCKS connection goes stale. However in practice this
 is still a bit too long and I'm still seeing SOCKS timeouts.

 So, I'm wondering if it's ok if we make this even shorter. All it means is
 that the client will abandon the proxy connection for a better one with
 less latency and that the threshold for that can be low (maybe 2-5
 seconds).

 > Should it be an EOF error to receive future sequence numbers, as is
 tested
 [https://github.com/cohosh/snowflake/blob/1707fcc125180b73509688a510bfbe8fa697a99d/common
 /snowflake-proto/proto_test.go#L207 here]?
 The EOF was actually a result of the timed out connection, not because a
 future packet causes Read to return an EOF. I rewrote that test in a
 better way that should avoid confusion here.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29206#comment:31>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online