Hello,

I am digging this out to make some kind of announcement

I am actually working over those problems, and I am developing a new approach to solve these (still early research state, but with really encouraging results). Basically, I've developing a software architecture which will let you deploy major features, hotfixes or rollbacks in a snap of fingers (to clients and relays), without needing any relay to restart and without perturbing their ongoing connections. The good news is that, so far, the measured performance impact of this techno is somewhat light.

I guess it might sound a bit mystical. So if you're interested to find out more, I will be presenting my ongoing research at HotPETs in Stockolm (https://petsymposium.org/2019/program.php#hotpets_program), the talk is "Flexible Anonymous Network".

Oh, and there is even more promise in this technology than just addressing the 'forward compatibility in the design and backward compatibility in the implementation' headache :) But enough of teasing, see you in Stockolm!

Best regards,

Florentin

On 5/23/19 8:49 PM, David Goulet wrote:
On 22 May (18:57:06), Mike Perry wrote:
Nick Mathewson:
Filename: 303-protover-removal-policy.txt
Title: When and how to remove support for protocol versions
Author: Nick Mathewson
Created: 21 May 2019
Status: Draft

1. Background

   With proposal 264, added support for "subprotocol versions" -- a
   means to declare which features are required for participation in the
   Tor network.  We also created a mechanism (refined later in proposal
   297) for telling Tor clients and relays that they cannot participate
   effectively in the Tor network, and they need to shut down.

   In this document, we describe a policy according to which these
   decisions should be made in practice.

2. Recommending features (for clients and relays)

   A subprotocol version SHOULD become recommended soon after all
   release series that did not provide it become unsupported (within a
   month or so).

   For example, the current oldest LTS release series is 0.2.9; when it
   becomes unsupported in 2020, the oldest supported release series will
   be 0.3.5.  Suppose that 0.2.9 supports a subprotocol Cupcake=1, and
   that all stable 0.3.5.x versions support Cupcake=1-3.  Around one
   month after the end of 0.2.9 support, Cupcake=3 should become a
   _recommended_ protocol for clients and relays.

   Additionally, a feature can become _recommended_ because of security
   reasons.  If we believe that it is a terrible idea to run an old
   protocol, we can make it _recommended_ for relays or clients or both.
   We should not do this lightly, since it will be annoying.
To be clear, "_recommended_" and "_required_" terms here are from
Proposal #264, Section 4, right? Aka the consensus lines?

These only affect WARN-vs-exit behavior by clients and relays that lack
support, right? Clients and relays will still *negotiate* and use
protocol versions that they both have, even if they are not listed as
either recommended or required?
Afaiu, you can only negotiate what you know which is the protover list you
support (the one advertised by relays).

For instance, if "Cupcake=1-3" is what you support as a client but the
recommended is "Cupcake=2-3", you can still do "1" but you will be warned.

If _required_, let say "Cupcake=3" but the client is "Cupcake=1-2", then the
client does _not_ join the network. If _required_ is "Cupcake=1-3" for both
the relay and client, then yes they can use version "1" instead of "3" if I'm
not mistaken else "Cupcake=3" should be used.

Are there cases where they don't/won't negotiate to use a new protover
field, such as for anonymity fragmentation reasons? How do we handle
those?
As an example for the prop289 (authenticated SENDMEs), we handle that with a
consensus parameters that flip knobs at once to avoid partitioning problem as
much as possible. _And_ then the protover is changed changed into the
_recommended_ or _required_ field depending on where we are.

(I am trying to gauge the impact of this proposal on our ability to roll
out new features that clients can use right away vs ensure that old
clients and relays can still work. It seems to focus on the latter,
and I want to get a handle on at what expense).
 
3. Requiring features (for relays)

   We regularly update the directory authorities to require relays to
   run certain versions of Tor or later.  We generally do this after a
   short outreach campaign to get as many relays as possible to upgrade.

   We MAY make a feature required for relays one month after every
   version without it is obsolete and unsupported, though it is better
   to wait three months if possible.

   We SHOULD make a feature required for relays within 12 months after
   every version without it is obsolete and unsupported.
As a cultural signaling thing, I think it is better to say to relay
operators, "keep your relay's operating system and its Tor up to date,
or please don't run it anymore (aka we'll shut it down for you)."

I think its bad culturally if we signal to people that we need relays so
badly that it doesn't matter if they are unpatched, or if the OS is
unpatched, or if they accidentally publish their relay and ssh keys to a
public github repo. (Relays running on a system that hasn't received any
patches or security updates in 12 months is the administrator diligence
equivalent of publishing admin keys to public github, IMO, if not its
actual functional equivalent).

Not only does it encourage a sloppy mindset about paying attention to
relay systems, it also slows down our development of new protocols, and
impedes major network upgrades.
I'm very much agreeing with this. We do "force-ask" the directory authorities
to follow the latest stable up to at worst 2 stable behind. There are reasons
for that, maintainenance but also security.

Relays have to be sharp at upgrading... A relay that is not, that we end up
excluding from the consensus because the version is too dangerous (remember
heartbleed), can be considered in my opinion more a liability then a useful
piece.

Having capacity in my opinion is as important as having relays that are up to
date. Every release we rollout very important features that if not deployed
network wide, we don't get the benefit of them until years to come (basically
when the previous LTS is EOL...).

And that forces us into a position of backporting sometimes big block of codes
(DoS subsystem is one example).

Still today, there are still 1000+ relays (on 0.2.9) that can't be used for
Onion Service v3... It is a 1/6th of the network and we've released relay
support 2 years ago... And we have _specific_ code to avoid picking those
relays so all these edge case also accumulates in the code over time.

(As an aside, I would like to take a hard look at the LTS series, and
brainstorm how much it would cost us to provide official, reproducibly
built repos for every distribution whose LTS policies we find expensive
and cumbersome to support.. Or at least do some analysis of which changes
have been or will be extremely expensive or impossible to roll out due
to being blocked on needing to maintain the LTS).
If we could convince Debian to consider an EOL version a "security issue" and
thus accept to pull in the new next supported stable in their stable
package...... that would be grand because then even Debian LTS relay operator
could still benefit from getting newer versions, improving the network and
thus the security of all on Tor.

I know I know, challenges and sometimes a bad idea but with this proposal, it
might be a good time to also take a hard look at how things are and change
paradigm even if it means a painful transition.

 
4. Requiring features (for clients)

   Clients take the longest time to update, and are often the least
   able to fetch upgrades. Because of this, we should be very careful
   about making subprotocol versions required on clients, and should
   only do so for fairly compelling reasons.
Is this true? From our Tor Browser metrics (which could use some kind of
totaling), it looks like most Tor Browser users upgrade pretty quickly:
https://metrics.torproject.org/webstats-tb.html

What kinds of clients don't upgrade? I got the impression that it was
mostly things like old botnet cruft that didn't..
My guts feeling is that relays actually take longer...

 
   We SHOULD NOT make a feature required for clients until it has been
   _recommended_ for clients for at first 9 months.

   We SHOULD make a feature required for clients if it has been
   _recommended_ for clients for at least 18 months.
I guess since we're talking about causing clients to exit() in both
these cases, it might be OK to be conservative here...
Honestly, a client exit()ing is indeed a pain point but we get to that
situation because it is not safe anymore for the client to join the network.
I find that less worrying than relays starting to exit() all the sudden
because we've pushed a required protover, we end up with 3000 dead relays...

I would be for reducing those values much more. As an example, again with
prop289 (authenticated sendmes), we are talking a deployment plan that spans
almost 5 years...

We can't publish FlowCtrl=1 protover until 035 is EOL which is in 3 years and
then once we have that, we have another 9 months to go for _recommended_ and
then 18 months before we can force it in required.

This means 4.5 years of deployment for a _security_ feature that is overall
helping the network and specific attacks... I think we can do much better and
we should.

And maybe that comes with laxing our backport policy or rethinking our LTS?
I'm not entirely sure...

Historically, we do have a quick transition when a version is EOL and the
package follow, see the drops here:

https://metrics.torproject.org/versions.html?start=2018-02-22&end=2019-05-23

As long as package follow, usually the majority of relays do upgrade to them
in matter of some months. Which re-enforces my point about Debian + packaging
;).

But again, I am really worried about future network scalability and
performance upgrades getting stalled because we don't want to change
things that fragment client anonymity.. Does that mean that for some
kinds of new features, we can't flip a switch because we're trying to
give clients another 1.5 years *past the EOL of the last LTS* to
upgrade?
As more examples here, we were forced to backport the DoS subsystem down many
versions, that was some work!... Testing all relay versions, I had to
sometimes wait weeks before my relay could get the Guard flag again... The
pain was real at that time.

If we talk in terms of tor scaling, we'll start doing pretty big drastic
changes to the protocol or even just how tor the binary operates. If every
step takes _years_, we'll fail this "tor game" in my opinion over time.

I know pastly has some results about how different cell scheduling (KIST,
KISTLite, Vanilla) between relays is actually badly affecting the network...
and there is NO way to change that quickly until all our EOL dies out and
transition...

I'm almost at the point of proposing "remote relay upgrades" like Tor Browser
does ... :P. I know, hard, but at least we would be extremely agile on going
forward *but* also rolling back anything that f*** the network (and it
happened before, we had to rollout parameters). Tor Browser did insane work
there so we could cherry-pick on that imo.

I would enjoy a session in Stockholm that walked through how we would
use this proposal and proposal 264 to roll out a handful of involved
changes, such as walking onions, onion service DoS protections, conflux,
explicit congestion control signaling, full datagram Tor, etc. 
+1

It would be awesome if such a session could result in a proposal like
this one, but the flip side: explaining how to use protovers to roll out
involved features so that clients adopt them quickly and safely (and
what sorts of changes can be done quickly, and what sorts of changes
require waiting 4 years for LTS to EOL + 1.5 more years for clients to
update so as not to fragment anonymity).
Yes, something concrete, something that after Stockholm we can be happy with
and apply it. Not just brainstorm and then this whole thing dies off.

But yes overall, I'm in favor that we think in terms of reducing the waiting
time for anything to be rolled out on the network instead of waiting years for
one single feature to be fully deployed. Some comes from us to change a bit
our policy but also a lot comes from our relay operator to be good operators
and upgrade to our stables much faster.

Cheers!
David


_______________________________________________
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev