tor-dev
Threads by month
- ----- 2025 -----
- June
- May
- April
- March
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 1 participants
- 3600 discussions
Hi there,
I have a proposal regarding DNS name resolution.
Ticket: https://trac.torproject.org/projects/tor/ticket/34004
Proposal:
https://trac.torproject.org/projects/tor/attachment/ticket/34004/317-secure…
Implementation: https://github.com/torproject/tor/pull/1869
All functioniality is behind the DNSResolver feature flag, so don't
forget to activate it before you start testing.
Please let me know what you think.
BR
Christian
7
29
On a grant from the zcash foundation, I've been working on a
full specification for the Walking Onions design. This is the last update
for the spec project.
My previous updates are linked below:
Week 1:
formats, preliminaries, git repositories, binary diffs,
metaformat decisions, and Merkle Tree trickery.
https://lists.torproject.org/pipermail/tor-dev/2020-March/014178.html
Week 2:
Specifying details of SNIP and ENDIVE formats, in CDDL.
https://lists.torproject.org/pipermail/tor-dev/2020-March/014181.html
Week 3:
Expanding ENDIVEs into SNIPs.
https://lists.torproject.org/pipermail/tor-dev/2020-March/014194.html
Week 4:
Voting (part 1) and extending circuits.
https://lists.torproject.org/pipermail/tor-dev/2020-March/014207.html
Week 5:
Voting (part 2)
https://lists.torproject.org/pipermail/tor-dev/2020-April/014218.html
Week 6:
Editing, voting rules, and client behavior (part 1)
https://lists.torproject.org/pipermail/tor-dev/2020-April/014222.html
Week 7:
Exit policies: How do they work?
https://lists.torproject.org/pipermail/tor-dev/2020-April/014232.html
Week 8:
Onion services, relay honesty, migration and families
https://lists.torproject.org/pipermail/tor-dev/2020-April/014255.html
Week 9:
(There was no week 9)
Week 10:
Pushing towards completion
https://lists.torproject.org/pipermail/tor-dev/2020-May/014281.html
== Since the last update
The last couple weeks of the walking onions project were mostly the
"fiddly bits" left over from previous work. I had to edit for
consistency and clarity, move text around, add specifications for
missing details, and so on. There were parts of the voting design
that didn't actually work as written, and needed to get tweaked
around.
Index voting was tricky for a few reasons: notably, index weights
are a complex function of bandwidths, which themselves are voted on.
I had hoped that we could move the algorithms we use for weighting
out of the consensus algorithm and into the authorities, but for
now, I think we're stuck with having them as yet another thing
authorities need to implement identically. At least the new consensus
system itself is extensible, so we have a clear path forward to a
better consensus approach for indices (if we ever figure one out).
I also needed to figure out handling for VoterCerts: I had done the
first version of the spec for how they should appear in parameters
documents before I actually figured out how voting would work, and
the two approaches weren't compatible. This took some revision, but
I think it should work out now.
Similarly, I had known that we'd need to use the "detached
signatures" mechanism from the existing consensus protocol, but the
Walking Onions spec didn't actually allow for this. Fortunately,
this was an easy chance to make.
And overall, I went through a pretty large number of places in the
document where there were "XXXX" comments indicating things I had to
come back to later.
With that, the first version Walking Onions proposal was ready to go
out as [PROP323]. You can see a rendered version of it over at
[PROP323-RENDERED], though there will be a better URL for that in
the future.
I also put out the remaining side related proposals ([PROP321] for
families, and [PROP322] for link specifiers for directory ports).
== Project recap
When I first described Walking Onions [PROP300], I had only the
general outlines of a design for a more efficient Tor network: a
couple of tricks to allow clients to continue building secure paths
through the network without having to know a list of relays on the
network. But I had known there would be remaining problems to
solve, and listed them in that proposal.
Thanks to this grant, I've been able to flesh out Walking Onions to
a specified design that I think we could actually build. In doing
so, I ran into the typical specification curse, where each change
led to a need for other changes. Notably:
* The need for a new kind of directory-like structure led to the
need for a more flexible and parsable metadocument format, and a
more generic voting algorithm for authorities.
* When designing the new meta-format, I found a better approach
for handling expiration and clock synchronization between
clients and relays that should allow us to be a little more lax
with skew, while better resisting attacks based on serving
obsolete information.
* The search for a reasonable solution to exit policies led to
a hybrid extraction/designed approach for clustering exit ports
by their correlatedness, then designing exit policies around
semantic similarity in port types.
* When enumerating the various ways that clients need to be able
to request SNIPs and circuits, I found a need for a span-based
SNIP query request, which I hadn't previously known that we'd
need. This simplifies some of our logic.
* (and more; please see previous updates.)
The resulting documents (proposals 318 through 323) are not final: I
believe that as we continue to discuss them, and eventually implement
them, we'll find further opportunities for improvement--and for fixing
things that I missed when I was designing and writing all of this. But
I think we've got a much firmer grasp of this project now.
I think there are parts that we could in principle start building
now: the preliminary proposals 318 (limiting protocol versions), 319
(cell fragmentation) and 321 (improved family handling) are high on
my list for now, if we get time.
And now the design iteration begins!
[PROP300] https://gitweb.torproject.org/torspec.git/plain/proposals/300-walking-onion…
[PROP321] https://gitweb.torproject.org/torspec.git/plain/proposals/321-happy-familie…
[PROP322] https://gitweb.torproject.org/torspec.git/plain/proposals/322-dirport-links…
[PROP323] https://gitweb.torproject.org/torspec.git/plain/proposals/323-walking-onion…
[PROP323-RENDERED] https://people.torproject.org/~nickm/volatile/rendered.html
1
0

onion_client_auth_add Flags=Permanent fails with 553 Unable to store creds for
by Patrick Schleizer 07 Jun '20
by Patrick Schleizer 07 Jun '20
07 Jun '20
Hi!
sudo -u debian-tor socat - UNIX-CONNECT:/var/run/tor/control
AUTHENTICATE "test"
250 OK
onion_client_auth_add
m5bmcnsk64naezc26scz2xb3l3n2nd5xobsljljrpvf77tclmykn7wid
x25519:uBKh6DGrkcFxB1adYuyKQltUDDUT9IZrOsne3nfHbHI=
252 Registered client and decrypted desc
onion_client_auth_add
m5bmcnsk64naezc26scz2xb3l3n2nd5xobsljljrpvf77tclmykn7wid
x25519:uBKh6DGrkcFxB1adYuyKQltUDDUT9IZrOsne3nfHbHI= Flags=Permanent
553 Unable to store creds for
"m5bmcnsk64naezc26scz2xb3l3n2nd5xobsljljrpvf77tclmykn7wid"
tor --version
Tor version 0.4.3.5.
Should be implemented according to:
https://trac.torproject.org/projects/tor/ticket/32562
Cheers,
Patrick
2
1

onionbalance useful on same server / for high-spec non-location hidden servers?
by Patrick Schleizer 03 Jun '20
by Patrick Schleizer 03 Jun '20
03 Jun '20
Would it be useful to run multiple Tor instances and onionbalance on the
very same server? Or does that totally defeat the purpose of onionbalance?
In my case, it's not a location hidden service. Just an alternative way
to connect to a server which is available over clearnet anyhow.
I guess the presumption of onionbalance is that a location hidden server
shouldn't produce too much Tor traffic as that would be suspicious?
Is a single Tor client a bottleneck? I.e. are multiple Tor clients more
performant than only one Tor client? Does onionbalance "only" work
around limitations of individual servers in CPU / IO / bandwidth?
In other words, assume CPU / IO / bandwidth is "unlimited" on one
server. (And ignore failover.) Does it make sense to run multiple Tor
instances and onionbalance or would a single Tor instance without
onionbalance be sufficient?
Cheers,
Patrick
2
1
```
Filename: 323-walking-onions-full.md
Title: Specification for Walking Onions
Author: Nick Mathewson
Created: 3 June 2020
Status: Draft
```
<!-- Section 1 --> <a id='S1'></a>
# Introduction: A Specification for Walking Onions
In Proposal 300, I introduced Walking Onions, a design for scaling
Tor and simplifying clients, by removing the requirement that every
client know about every relay on the network.
This proposal will elaborate on the original Walking Onions idea,
and should provide enough detail to allow multiple compatible
implementations. In this introduction, I'll start by summarizing the
key ideas of Walking Onions, and then outline how the rest of this
proposal will be structured.
<!-- Section 1.1 --> <a id='S1.1'></a>
## Remind me about Walking Onions again?
With Tor's current design, every client downloads and refreshes a
set of directory documents that describe the directory authorities'
views about every single relay on the Tor network. This requirement
makes directory bandwidth usage grow quadratically, since the
directory size grows linearly with the number of relays, and it is
downloaded a number of times that grows linearly with the number of
clients. Additionally, low-bandwidth clients and bootstrapping
clients spend a disproportionate amount of their bandwidth loading
directory information.
With these drawbacks, why does Tor still require clients to
download a directory? It does so in order to prevent attacks that
would be possible if clients let somebody else choose their
paths through the network, or if each client chose its paths from a
different subset of relays.
Walking Onions is a design that resists these attacks without
requiring clients ever to have a complete view of the network.
You can think of the Walking Onions design like this: Imagine that with
the current Tor design, the client covers a wall with little pieces
of paper, each representing a relay, and then throws a dart at the wall
to pick a relay. Low-bandwidth relays get small pieces of paper;
high-bandwidth relays get large pieces of paper. With the Walking
Onions design, however, the client throws its dart at a _blank
wall_, notes the position of the dart, and asks for the relay whose
paper _would be_ at that position on a "standard wall". These
"standard walls" are mapped out by directory authorities in advance,
and are authenticated in such a way that the client can receive a
proof of a relay's position on the wall without actually having to
know the whole wall.
Because the client itself picks the position on the wall, and
because the authorities must vote together to build a set of
"standard walls", nobody else controls the client's path through the
network, and all clients can choose their paths in the same way.
But since clients only probe one position on the wall at a time,
they don't need to download a complete directory.
(Note that there has to be more than one wall at a time: the client
throws darts at one wall to pick guards, another wall to pick
middle relays, and so on.)
In Walking Onions, we call a collection of standard walls an
"ENDIVE" (Efficient Network Directory with Individually Verifiable
Entries). We call each of the individual walls a "routing index",
and we call each of the little pieces of paper describing a relay and
its position within the routing index a "SNIP" (Separable Network
Index Proof).
For more details about the key ideas behind Walking Onions, see
proposal 300. For more detailed analysis and discussion, see
"Walking Onions: Scaling Anonymity Networks while Protecting Users"
by Komlo, Mathewson, and Goldberg.
<!-- Section 1.2 --> <a id='S1.2'></a>
## The rest of this document
This proposal is unusually long, since Walking Onions touches on many
aspects of Tor's functionality. It requires changes to voting,
directory formats, directory operations, circuit building, path
selection, client operations, and more. These changes are described in the
sections listed below.
Here in section 1, we briefly reintroduce Walking Onions, and talk
about the rest of this proposal.
Section 2 will describe the formats for ENDIVEs, SNIPs, and related
documents.
Section 3 will describe new behavior for directory authorities as
they vote on and produce ENDIVEs.
Section 4 describes how relays fetch and reconstruct ENDIVEs from
the directory authorities.
Section 5 has the necessary changes to Tor's circuit extension
protocol so that clients can extend to relays by index position.
Section 6 describes new behaviors for clients as they use Walking
Onions, to retain existing Tor functionality for circuit construction.
Section 7 explains how to implement onion services using Walking
Onions.
Section 8 describes small alterations in client and relay behavior
to strengthen clients against some kinds of attacks based on relays
picking among multiple ENDIVEs, while still making the voting
system robust against transient authority failures.
Section 9 closes with a discussion of how to migrate from the
existing Tor design to the new system proposed here.
<!-- Section 1.2.1 --> <a id='S1.2.1'></a>
### Appendices
Additionally, this proposal has several appendices:
Appendix A defines commonly used terms.
Appendix B provides definitions for CDDL grammar productions that
are used elsewhere in the documents.
Appendix C lists the new elements in the protocol that will require
assigned values.
Appendix D lists new network parameters that authorities must vote
on.
Appendix E gives a sorting algorithm for a subset of the CBOR object
representation.
Appendix F gives an example set of possible "voting rules" that
authorities could use to produce an ENDIVE.
Appendix G lists the different routing indices that will be required
in a Walking Onions deployment.
Appendix H discusses partitioning TCP ports into a small number of
subsets, so that relays' exit policies can be represented only as
the group of ports that they support.
Appendix Z closes with acknowledgments.
<!-- Section 1.2.2 --> <a id='S1.2.2'></a>
### Related proposals
The following proposals are not part of the Walking Onions proposal,
but they were written at the same time, and are either helpful or
necessary for its implementation.
318-limit-protovers.md restricts the allowed version numbers for
each subprotocol to the range 0..63.
319-wide-everything.md gives a general mechanism for splitting relay
commands across more than one cell.
320-tap-out-again.md attempts to remove the need for TAP keys in
the HSv2 protocol.
321-happy-families.md lets families be represented with a single
identifier, rather than a long list of keys
322-dirport-linkspec.md allows a directory port to be represented
with a link specifier.
<!-- Section 2 --> <a id='S2'></a>
# Document Formats: ENDIVEs and SNIPs
Here we specify a pair of related document formats that we will
use for specifying SNIPs and ENDIVEs.
Recall from proposal 300 that a SNIP is a set of information about
a single relay, plus proof from the directory authorities that the
given relay occupies a given range in a certain routing index.
For example, we can imagine that a SNIP might say:
* Relay X has the following IP, port, and onion key.
* In the routing index Y, it occupies index positions 0x20002
through 0x23000.
* This SNIP is valid on 2020-12-09 00:00:00, for one hour.
* Here is a signature of all the above text, using a threshold
signature algorithm.
You can think of a SNIP as a signed combination of a routerstatus and
a microdescriptor... together with a little bit of the randomized
routing table from Tor's current path selection code, all wrapped
in a signature.
Every relay keeps a set of SNIPs, and serves them to clients when
the client is extending by a routing index position.
An ENDIVE is a complete set of SNIPs. Relays download ENDIVEs, or
diffs between ENDIVEs, once every voting period. We'll accept some
complexity in order to make these diffs small, even though some of the
information in them (particularly SNIP signatures and index
ranges) will tend to change with every period.
<!-- Section 2.1 --> <a id='S2.1'></a>
## Preliminaries and scope
<!-- Section 2.1.1 --> <a id='S2.1.1'></a>
### Goals for our formats
We want SNIPs to be small, since they need to be sent on the wire
one at a time, and won't get much benefit from compression. (To
avoid a side-channel, we want CREATED cells to all be the same
size, which means we need to pad up to the largest size possible
for a SNIP.)
We want to place as few requirements on clients as possible, and we
want to preserve forward compatibility.
We want ENDIVEs to be compressible, and small. We want successive
ENDIVEs to be textually similar, so that we can use diffs to
transmit only the parts that change.
We should preserve our policy of requiring only loose time
synchronization between clients and relays, and allow even looser
synchronization when possible. Where possible, we'll make the
permitted skew explicit in the protocol: for example, rather than
saying "you can accept a document 10 minutes before it is valid", we
will just make the validity interval start 10 minutes earlier.
<!-- Section 2.1.2 --> <a id='S2.1.2'></a>
### Notes on Metaformat
In the format descriptions below, we will describe a set of
documents in the CBOR metaformat, as specified in RFC 7049. If
you're not familiar with CBOR, you can think of it as a simple
binary version of JSON, optimized first for simplicity of
implementation and second for space.
I've chosen CBOR because it's schema-free (you can parse it
without knowing what it is), terse, dumpable as text, extensible,
standardized, and very easy to parse and encode.
We will choose to represent many size-critical types as maps whose
keys are short integers: this is slightly shorter in its encoding
than string-based dictionaries. In some cases, we make types even
shorter by using arrays rather than maps, but only when we are
confident we will not have to make changes to the number of elements
in the future.
We'll use CDDL (defined in RFC 8610) to describe the data in a way
that can be validated -- and hopefully, in a way that will make it
comprehensible. (The state of CDDL tooling is a bit lacking at the
moment, so my CDDL validation will likely be imperfect.)
We make the following restrictions to CBOR documents that Tor
implementations will _generate_:
* No floating-point values are permitted.
* No tags are allowed unless otherwise specified.
* All items must follow the rules of RFC 7049 section 3.9 for
canonical encoding, unless otherwise specified.
Implementations SHOULD accept and parse documents that are not
generated according to these rules, for future extensibility.
However, implementations SHOULD reject documents that are not
"well-formed" and "valid" by the definitions of RFC 7049.
<!-- Section 2.1.3 --> <a id='S2.1.3'></a>
### Design overview: signing documents
We try to use a single document-signing approach here, using a hash
function parameterized to accommodate lifespan information and an
optional nonce.
All the signed CBOR data used in this format is represented as a
binary string, so that CBOR-processing tools are less likely to
re-encode or transform it. We denote this below with the CDDL syntax
`bstr .cbor Object`, which means "a binary string that must hold a valid
encoding of a CBOR object whose type is `Object`".
<!-- Section 2.1.4 --> <a id='S2.1.4'></a>
### Design overview: SNIP Authentication
I'm going to specify a flexible authentication format for SNIPs that
can handle threshold signatures, multisignatures, and Merkle trees.
This will give us flexibility in our choice of authentication
mechanism over time.
* If we use Merkle trees, we can make ENDIVE diffs much much smaller,
and save a bunch of authority CPU -- at the expense of requiring
slightly larger SNIPs.
* If Merkle tree root signatures are in SNIPs, SNIPs get a
bit larger, but they can be used by clients that do not have the
latest signed Merkle tree root.
* If we use threshold signatures, we need to depend on
not-yet-quite-standardized algorithms. If we use multisignatures,
then either SNIPs get bigger, or we need to put the signed Merkle
tree roots into a consensus document.
Of course, flexibility in signature formats is risky, since the more
code paths there are, the more opportunities there are for nasty bugs.
With this in mind, I'm structuring our authentication so that there
should (to the extent possible) be only a single validation path for
different uses.
With this in mind, our format is structured so that "not using a
Merkle tree" is considered, from the client's point of view, the same as
"using a Merkle of depth 1".
The authentication on a single snip is structured, in the abstract, as:
- ITEM: The item to be authenticated.
- PATH: A string of N bits, representing a path through a Merkle tree from
its root, where 0 indicates a left branch and 1 indicates a right
branch. (Note that in a left-leaning tree, the 0th leaf will have
path 000..0, the 1st leaf will have path 000..1, and so on.)
- BRANCH: A list of N digests, representing the digests for the
branches in the Merkle tree that we are _not_ taking.
- SIG: A generalized signature (either a threshold signature or a
multisignature) of a top-level digest.
- NONCE: an optional nonce for use with the hash functions.
Note that PATH here is a bitstring, not an integer! "0001" and "01" are
different paths, and "" is a valid path, indicating the root of the tree.
We assume two hash functions here: `H_leaf()` to be used with leaf
items, and `H_node()` to be used with intermediate nodes. These functions
are parameterized with a path through the tree, with
the lifespan of the object to be signed, and with a nonce.
To validate the authentication on a SNIP, the client proceeds as follows:
Algorithm: Validating SNIP authentication
Let N = the length of PATH, in bits.
Let H = H_leaf(PATH, LIFESPAN, NONCE, ITEM).
While N > 0:
Remove the last bit of PATH; call it P.
Remove the last digest of BRANCH; call it B.
If P is zero:
Let H = H_node(PATH, LIFESPAN, NONCE, H, B)
else:
Let H = H_node(PATH, LIFESPAN, NONCE, B, H)
Let N = N - 1
Check wither SIG is a correct (multi)signature over H with the
correct key(s).
Parameterization on this structure is up to the authorities. If N is
zero, then we are not using a Merkle tree. The generalize signature
SIG can either be given as part of the SNIP, or as part of a consensus
document. I expect that in practice, we will converge on a single set of
parameters here quickly (I'm favoring BLS signatures and a Merkle
tree), but using this format will give clients the flexibility to handle
other variations in the future.
For our definition of `H_leaf()` and `H_node()`, see "Digests and
parameters" below.
<!-- Section 2.1.5 --> <a id='S2.1.5'></a>
### Design overview: timestamps and validity.
For future-proofing, SNIPs and ENDIVEs have separate time ranges
indicating when they are valid. Unlike with current designs, these
validity ranges should take clock skew into account, and should not
require clients or relays to deliberately add extra tolerance to their
processing. (For example, instead of saying that a document is "fresh"
for three hours and then telling clients to accept documents for 24
hours before they are valid and 24 hours after they are expired, we will
simply make the documents valid for 51 hours.)
We give each lifespan as a (PUBLISHED, PRE, POST) triple, such that
objects are valid from (PUBLISHED - PRE) through (PUBLISHED + POST).
(The "PUBLISHED" time is provided so that we can more reliably tell
which of two objects is more recent.)
Later (see section 08), we'll explain measures to ensure that
hostile relays do not take advantage of multiple overlapping SNIP
lifetimes to attack clients.
<!-- Section 2.1.6 --> <a id='S2.1.6'></a>
### Design overview: how the formats work together
Authorities, as part of their current voting process, will produce an
ENDIVE.
Relays will download this ENDIVE (either directly or as a diff),
validate it, and extract SNIPs from it. Extracting these SNIPs may be
trivial (if they are signed individually), or more complex (if they are
signed via a Merkle tree, and the Merkle tree needs to be
reconstructed). This complexity is acceptable only to the extent that
it reduces compressed diff size.
Once the SNIPs are reconstructed, relays will hold them and serve them
to clients.
<!-- Section 2.1.7 --> <a id='S2.1.7'></a>
### What isn't in this section
This section doesn't tell you what the different routing indices
are or mean. For now, we can imagine there being one routing index for
guards, one for middles, and one for exits, and one for each hidden
service directory ring. (See section 06 for more on regular indices,
and section 07 for more on onion services.)
This section doesn't give an algorithm for computing ENDIVEs from
votes, and doesn't give an algorithm for extracting SNIPs from an ENDIVE.
Those come later. (See sections 03 and 04 respectively.)
<!-- Section 2.2 --> <a id='S2.2'></a>
## SNIPs
Each SNIP has three pieces: the part of the SNIP that describes the
router, the part of that describes the SNIP's place within an ENDIVE, and
the part that authenticates the whole SNIP.
Why two _separate_ authenticated pieces? Because one (the router
description) is taken verbatim from the ENDIVE, and the other
(the location within the ENDIVE) is computed from the ENDIVE by the
relays. Separating them like this helps ensure that the part
generated by the relay and the part generated by the authorities
can't interfere with each other.
; A SNIP, as it is sent from the relay to the client. Note that
; this is represented as a three-element array.
SNIP = [
; First comes the signature. This is computed over
; the concatenation of the two bstr objects below.
auth: SNIPSignature,
; Next comes the location of the SNIP within the ENDIVE.
index: bstr .cbor SNIPLocation,
; Finally comes the information about the router.
router: bstr .cbor SNIPRouterData,
]
(Computing the signature over a concatenation of objects is safe, since
the objects' content is self-describing CBOR, and isn't vulnerable to
framing issues.)
<!-- Section 2.2.1 --> <a id='S2.2.1'></a>
### SNIPRouterData: information about a single router.
Here we talk about the type that tells a client about a single
router. For cases where we are just storing information about a
router (for example, when using it as a guard), we can remember
this part, and discard the other pieces.
The only required parts here are those that identify the router
and tell the client how to build a circuit through it. The others
are all optional. In practice, I expect they will be encoded in most
cases, but clients MUST behave properly if they are absent.
More than one SNIPRouterData may exist in the same ENDIVE for a
single router. For example, there might be a longer version to
represent a router to be used as a guard, and another to represent
the same router when used as a hidden service directory. (This is
not possible in the voting mechanism that I'm working on, but relays
and clients MUST NOT treat this as an error.)
This representation is based on the routerstats and
microdescriptor entries of today, but tries to omit a number of
obsolete fields, including RSA identity fingerprint, TAP key,
published time, etc.
; A SNIPRouterData is a map from integer keys to values for
; those keys.
SNIPRouterData = {
; identity key.
? 0 => Ed25519PublicKey,
; ntor onion key.
? 1 => Curve25519PublicKey,
; list of link specifiers other than the identity key.
; If a client wants to extend to the same router later on,
; they SHOULD include all of these link specifiers verbatim,
; whether they recognize them or not.
? 2 => [ LinkSpecifier ],
; The software that this relay says it is running.
? 3 => SoftwareDescription,
; protovers.
? 4 => ProtoVersions,
; Family. See below for notes on dual encoding.
? 5 => [ * FamilyId ],
; Country Code
? 6 => Country,
; Exit policies describing supported port _classes_. Absent exit
; policies are treated as "deny all".
? 7 => ExitPolicy,
; NOTE: Properly speaking, there should be a CDDL 'cut'
; here, to indicate that the rules below should only match
; if one if the previous rules hasn't matched.
; Unfortunately, my CDDL tool doesn't seem to support cuts.
; For future tor extensions.
* int => any,
; For unofficial and experimental extensions.
* tstr => any,
}
; For future-proofing, we are allowing multiple ways to encode
; families. One is as a list of other relays that are in your
; family. One is as a list of authority-generated family
; identifiers. And one is as a master key for a family (as in
; Tor proposal 242).
;
; A client should consider two routers to be in the same
; family if they have at least one FamilyId in common.
; Authorities will canonicalize these lists.
FamilyId = bstr
; A country. These should ordinarily be 2-character strings,
; but I don't want to enforce that.
Country = tstr;
; SoftwareDescription replaces our old "version".
SoftwareDescription = [
software: tstr,
version: tstr,
extra: tstr
]
; Protocol versions: after a bit of experimentation, I think
; the most reasonable representation to use is a map from protocol
; ID to a bitmask of the supported versions.
ProtoVersions = { ProtoId => ProtoBitmask }
; Integer protocols are reserved for future version of Tor. tstr ids
; are reserved for experimental and non-tor extensions.
ProtoId = ProtoIdEnum / int / tstr
ProtoIdEnum = &(
Link : 0,
LinkAuth : 1,
Relay : 2,
DirCache : 3,
HSDir : 4,
HSIntro : 5,
HSRend : 6,
Desc : 7,
MicroDesc: 8,
Cons : 9,
Padding : 10,
FlowCtrl : 11,
)
; This type is limited to 64 bits, and that's fine. If we ever
; need a protocol version higher than 63, we should allocate a
; new protoid.
ProtoBitmask = uint
; An exit policy may exist in up to two variants. When port classes
; have not changed in a while, only one policy is needed. If port
; classes have changed recently, however, then SNIPs need to include
; each relay's position according to both the older and the newer policy
; until older network parameter documents become invalid.
ExitPolicy = SinglePolicy / [ SinglePolicy, SinglePolicy ]
; Each single exit policy is a tagged bit array, whose bits
; correspond to the members of the list of port classes in the
; network parameter document with a corresponding tag.
SinglePolicy = [
; Identifies which group of port classes we're talking about
tag: unsigned,
; Bit-array of which port classes this relay supports.
policy: bstr
]
<!-- Section 2.2.2 --> <a id='S2.2.2'></a>
### SNIPLocation: Locating a SNIP within a routing index.
The SNIPLocation type can encode where a SNIP is located with
respect to one or more routing indices. Note that a SNIPLocation
does not need to be exhaustive: If a given IndexId is not listed for
a given relay in one SNIP, it might exist in another SNIP. Clients
should not infer that the absence of an IndexId in one SNIPLocation
for a relay means that no SNIPLocation with that IndexId exists for
the relay.
; SNIPLocation: we're using a map here because it's natural
; to look up indices in maps.
SNIPLocation = {
; The keys of this mapping represent the routing indices in
; which a SNIP appears. The values represent the index ranges
; that it occupies in those indices.
* IndexId => IndexRange / ExtensionIndex,
}
; We'll define the different index ranges as we go on with
; these specifications.
;
; IndexId values over 65535 are reserved for extensions and
; experimentation.
IndexId = uint32
; An index range extends from a minimum to a maximum value.
; These ranges are _inclusive_ on both sides. If 'hi' is less
; than 'lo', then this index "wraps around" the end of the ring.
; A "nil" value indicates an empty range, which would not
; ordinarily be included.
IndexRange = [ lo: IndexPos,
hi: IndexPos ] / nil
; An ExtensionIndex is reserved for future use; current clients
; will not understand it and current ENDIVEs will not contain it.
ExtensionIndex = any
; For most routing indices, the ranges are encoded as 4-byte integers.
; But for hsdir rings, they are binary strings. (Clients and
; relays SHOULD NOT require this.)
IndexPos = uint / bstr
A bit more on IndexRanges: Every IndexRange actually describes a set of
_prefixes_ for possible index positions. For example, the IndexRange
`[ h'AB12', h'AB24' ]` includes all the binary strings that start with (hex)
`AB12`, `AB13`, and so on, up through all strings that start with `AB24`.
Alternatively, you can think of a `bstr`-based IndexRange *(lo, hi)* as
covering *lo*`00000...` through *hi*`ff...`.
IndexRanges based on the uint type work the same, except that they always
specify the first 32 bits of a prefix.
<!-- Section 2.2.3 --> <a id='S2.2.3'></a>
### SNIPSignature: How to prove a SNIP is in the ENDIVE.
Here we describe the types for implementing SNIP signatures, to be
validated as described in "Design overview: Authentication" above.
; Most elements in a SNIPSignature are positional and fixed
SNIPSignature = [
; The actual signature or signatures. If this is a single signature,
; it's probably a threshold signature. Otherwise, it's probably
; a list containing one signature from each directory authority.
SingleSig / MultiSig,
; algorithm to use for the path through the merkle tree.
d_alg: DigestAlgorithm,
; Path through merkle tree, possibly empty.
merkle_path: MerklePath,
; Lifespan information. This is included as part of the input
; to the hash algorithm for the signature.
LifespanInfo,
; optional nonce for hash algorithm.
? nonce: bstr,
; extensions for later use. These are not signed.
? extensions: { * any => any },
]
; We use this group to indicate when an object originated, and when
; it should be accepted.
;
; When we are using it as an input to a hash algorithm for computing
; signatures, we encode it as an 8-byte number for "published",
; followed by two 4-byte numbers for pre-valid and post-valid.
LifespanInfo = (
; Official publication time in seconds since the epoch. These
; MUST be monotonically increasing over time for a given set of
; authorities on all SNIPs or ENDIVEs that they generate: a
; document with a greater `published` time is always more recent
; than one with an earlier `published` time.
;
; Seeing a publication time "in the future" on a correctly
; authenticated document is a reliable sign that your
; clock is set too far in the past.
published: uint,
; Value to subtract from "published" in order to find the first second
; at which this object should be accepted.
pre-valid: uint32,
; Value to add to "published" in order to find the last
; second at which this object should be accepted. The
; lifetime of an object is therefore equal to "(post-valid +
; pre-valid)".
post-valid: uint32,
)
; A Lifespan is just the fields of LifespanInfo, encoded as a list.
Lifespan = [ LifespanInfo ]
; One signature on a SNIP or ENDIVE. If the signature is a threshold
; signature, or a reference to a signature in another
; document, there will probably be just one of these per SNIP. But if
; we're sticking a full multisignature in the document, this
; is just one of the signatures on it.
SingleSig = [
s_alg: SigningAlgorithm,
; One of signature and sig_reference must be present.
?signature: bstr,
; sig_reference is an identifier for a signature that appears
; elsewhere, and can be fetched on request. It should only be
; used with signature types too large to attach to SNIPs on their
; own.
?sig_reference: bstr,
; A prefix of the key or the key's digest, depending on the
; algorithm.
?keyid: bstr,
]
MultiSig = [ + SingleSig ]
; A Merkle path is represented as a sequence of bits to
; indicate whether we're going left or right, and a list of
; hashes for the parts of the tree that we aren't including.
;
; (It's safe to use a uint for the number of bits, since it will
; never overflow 64 bits -- that would mean a Merkle tree with
; too many leaves to actually calculate on.)
MerklePath = [ uint, *bstr ]
<!-- Section 2.3 --> <a id='S2.3'></a>
## ENDIVEs: sending a bunch of SNIPs efficiently.
ENDIVEs are delivered by the authorities in a compressed format, optimized
for diffs.
Note that if we are using Merkle trees for SNIP authentication, ENDIVEs do
not include the trees at all, since those can be inferred from the leaves of
the tree. Similarly, the ENDIVEs do not include raw routing indices, but
instead include a set of bandwidths that can be combined into the routing
indices -- these bandwidths change less frequently, and therefore are more
diff-friendly.
Note also that this format has more "wasted bytes" than SNIPs
do. Unlike SNIPs, ENDIVEs are large enough to benefit from
compression with with gzip, lzma2, or so on.
This section does not fully specify how to construct SNIPs from an ENDIVE;
for the full algorithm, see section 04.
; ENDIVEs are also sent as CBOR.
ENDIVE = [
; Signature for the ENDIVE, using a simpler format than for
; a SNIP. Since ENDIVEs are more like a consensus, we don't need
; to use threshold signatures or Merkle paths here.
sig: ENDIVESignature,
; Contents, as a binary string.
body: encoded-cbor .cbor ENDIVEContent,
]
; The set of signatures across an ENDIVE.
;
; This type doubles as the "detached signature" document used when
; collecting signatures for a consensus.
ENDIVESignature = {
; The actual signatures on the endive. A multisignature is the
; likeliest format here.
endive_sig: [ + SingleSig ],
; Lifespan information. As with SNIPs, this is included as part
; of the input to the hash algorithm for the signature.
; Note that the lifespan of an ENDIVE is likely to be a subset
; of the lifespan of its SNIPs.
endive_lifespan: Lifespan,
; Signatures across SNIPs, at some level of the Merkle tree. Note
; that these signatures are not themselves signed -- having them
; signed would take another step in the voting algorithm.
snip_sigs: DetachedSNIPSignatures,
; Signatures across the ParamDoc pieces. Note that as with the
; DetachedSNIPSignatures, these signatures are not themselves signed.
param_doc: ParamDocSignature,
; extensions for later use. These are not signed.
* tstr => any,
}
; A list of single signatures or a list of multisignatures. This
; list must have 2^signature-depth elements.
DetachedSNIPSignatures =
[ *SingleSig ] / [ *MultiSig ]
ENDIVEContent = {
; Describes how to interpret the signatures over the SNIPs in this
; ENDIVE. See section 04 for the full algorithm.
sig_params: {
; When should we say that the signatures are valid?
lifespan: Lifespan,
; Nonce to be used with the signing algorithm for the signatures.
? signature-nonce: bstr,
; At what depth of a Merkle tree do the signatures apply?
; (If this value is 0, then only the root of the tree is signed.
; If this value is >= ceil(log2(n_leaves)), then every leaf is
; signed.).
signature-depth: uint,
; What digest algorithm is used for calculating the signatures?
signature-digest-alg: DigestAlgorithm,
; reserved for future extensions.
* tstr => any,
},
; Documents for clients/relays to learn about current network
; parameters.
client-param-doc: encoded-cbor .cbor ClientParamDoc,
relay-param-doc: encoded-cbor .cbor RelayParamDoc,
; Definitions for index group. Each "index group" is all
; applied to the same SNIPs. (If there is one index group,
; then every relay is in at most one SNIP, and likely has several
; indices. If there are multiple index groups, then relays
; can appear in more than one SNIP.)
indexgroups: [ *IndexGroup ],
; Information on particular relays.
;
; (The total number of SNIPs identified by an ENDIVE is at most
; len(indexgroups) * len(relays).)
relays: [ * ENDIVERouterData ],
; for future exensions
* tstr => any,
}
; An "index group" lists a bunch of routing indices that apply to the same
; SNIPs. There may be multiple index groups when a relay needs to appear
; in different SNIPs with routing indices for some reason.
IndexGroup = {
; A list of all the indices that are built for this index group.
; An IndexId may appear in at most one group per ENDIVE.
indices: [ + IndexId ],
; A list of keys to delete from SNIPs to build this index group.
omit_from_snips: [ *(int/tstr) ],
; A list of keys to forward from SNIPs to the next relay in an EXTEND
; cell. This can help the next relay know which keys to use in its
; handshake.
forward_with_extend: [ *(int/tstr) ],
; A number of "gaps" to place in the Merkle tree after the SNIPs
; in this group. This can be used together with signature-depth
; to give different index-groups independent signatures.
? n_padding_entries: uint,
; A detailed description of how to build the index.
+ IndexId => IndexSpec,
; For experimental and extension use.
* tstr => any,
}
; Enumeration to identify how to generate an index.
Indextype_Raw = 0
Indextype_Weighted = 1
Indextype_RSAId = 2
Indextype_Ed25519Id = 3
Indextype_RawNumeric = 4
; An indexspec may be given as a raw set of index ranges. This is a
; fallback for cases where we simply can't construct an index any other
; way.
IndexSpec_Raw = {
type: Indextype_Raw,
; This index is constructed by taking relays by their position in the
; list from the list of ENDIVERouterData, and placing them at a given
; location in the routing index. Each index range extends up to
; right before the next index position.
index_ranges: [ * [ uint, IndexPos ] ],
}
; An indexspec given as a list of numeric spans on the index.
IndexSpec_RawNumeric = {
type: Indextype_RawNumeric,
first_index_pos: uint,
; This index is constructed by taking relays by index from the list
; of ENDIVERouterData, and giving them a certain amount of "weight"
; in the index.
index_ranges: [ * [ idx: uint, span: uint ] ],
}
; This index is computed from the weighted bandwidths of all the SNIPs.
;
; Note that when a single bandwidth changes, it can change _all_ of
; the indices in a bandwidth-weighted index, even if no other
; bandwidth changes. That's why we only pack the bandwidths
; here, and scale them as part of the reconstruction algorithm.
IndexSpec_Weighted = {
type: Indextype_Weighted,
; This index is constructed by assigning a weight to each relay,
; and then normalizing those weights. See algorithm below in section
; 04.
; Limiting bandwidth weights to uint32 makes reconstruction algorithms
; much easier.
index_weights: [ * uint32 ],
}
; This index is computed from the RSA identity key digests of all of the
; SNIPs. It is used in the HSv2 directory ring.
IndexSpec_RSAId = {
type: Indextype_RSAId,
; How many bytes of RSA identity data go into each indexpos entry?
n_bytes: uint,
; Bitmap of which routers should be included.
members: bstr,
}
; This index is computed from the Ed25519 identity keys of all of the
; SNIPs. It is used in the HSv3 directory ring.
IndexSpec_Ed25519Id = {
type: Indextype_Ed25519Id,
; How many bytes of digest go into each indexpos entry?
n_bytes: uint,
; What digest do we use for building this ring?
d_alg: DigestAlgorithm,
; What bytes do we give to the hash before the ed25519?
prefix: bstr,
; What bytes do we give to the hash after the ed25519?
suffix: bstr,
; Bitmap of which routers should be included.
members: bstr,
}
IndexSpec = IndexSpec_Raw /
IndexSpec_RawNumeric /
IndexSpec_Weighted /
IndexSpec_RSAId /
IndexSpec_Ed25519Id
; Information about a single router in an ENDIVE.
ENDIVERouterData = {
; The authority-generated SNIPRouterData for this router.
1 => encoded-cbor .cbor SNIPRouterData,
; The RSA identity, or a prefix of it, to use for HSv2 indices.
? 2 => RSAIdentityFingerprint,
* int => any,
* tstr => any,
}
; encoded-cbor is defined in the CDDL postlude as a bstr that is
; tagged as holding verbatim CBOR:
;
; encoded-cbor = #6.24(bstr)
;
; Using a tag like this helps tools that validate the string as
; valid CBOR; using a bstr helps indicate that the signed data
; should not be interpreted until after the signature is checked.
; It also helps diff tools know that they should look inside these
; objects.
<!-- Section 2.4 --> <a id='S2.4'></a>
## Network parameter documents
Network parameter documents ("ParamDocs" for short) take the place of the
current consensus and certificates as a small document that clients and
relays need to download periodically and keep up-to-date. They are generated
as part of the voting process, and contain fields like network parameters,
recommended versions, authority certificates, and so on.
; A "parameter document" is like a tiny consensus that relays and clients
; can use to get network parameters.
ParamDoc = [
sig: ParamDocSignature,
; Client-relevant portion of the parameter document. Everybody fetches
; this.
cbody: encoded-cbor .cbor ClientParamDoc,
; Relay-relevant portion of the parameter document. Only relays need to
; fetch this; the document can be validated without it.
? sbody: encoded-cbor .cbor RelayParamDoc,
]
ParamDocSignature = [
; Multisignature or threshold signature of the concatenation
; of the two digests below.
SingleSig / MultiSig,
; Lifespan information. As with SNIPs, this is included as part
; of the input to the hash algorithm for the signature.
; Note that the lifespan of a parameter document is likely to be
; very long.
LifespanInfo,
; how are c_digest and s_digest computed?
d_alg: DigestAlgorithm,
; Digest over the cbody field
c_digest: bstr,
; Digest over the sbody field
s_digest: bstr,
]
ClientParamDoc = {
params: NetParams,
; List of certificates for all the voters. These
; authenticate the keys used to sign SNIPs and ENDIVEs and votes,
; using the authorities' longest-term identity keys.
voters: [ + bstr .cbor VoterCert ],
; A division of exit ports into "classes" of ports.
port-classes: PortClasses,
; As in client-versions from dir-spec.txt
? recommend-versions: [ * tstr ],
; As in recommended-client-protocols in dir-spec.txt
? recommend-protos: ProtoVersions,
; As in required-client-protocols in dir-spec.txt
? require-protos: ProtoVersions,
; For future extensions.
* tstr => any,
}
RelayParamDoc = {
params: NetParams,
; As in server-versions from dir-spec.txt
? recommend-versions: [ * tstr ],
; As in recommended-relay-protocols in dir-spec.txt
? recommend-protos: ProtoVersions,
; As in required-relay-protocols in dir-spec.txt
? require-versions: ProtoVersions,
* tstr => any,
}
; A NetParams encodes information about the Tor network that
; clients and relays need in order to participate in it. The
; current list of parameters is described in the "params" field
; as specified in dir-spec.txt.
;
; Note that there are separate client and relay NetParams now.
; Relays are expected to first check for a defintion in the
; RelayParamDoc, and then in the ClientParamDoc.
NetParams = { *tstr => int }
PortClasses = {
; identifies which port class grouping this is. Used to migrate
; from one group of port classes to another.
tag: uint,
; list of the port classes.
classes: { * IndexId => PortList },
}
PortList = [ *PortOrRange ]
; Either a single port or a low-high pair
PortOrRange = Port / [ Port, Port ]
Port = 1...65535
<!-- Section 2.5 --> <a id='S2.5'></a>
## Certificates
Voting certificates are used to bind authorities' long-term
identities to shorter-term signing keys. These have a similar
purpose to the authority certs made for the existing voting
algorithm, but support more key types.
; A 'voter certificate' is a statement by an authority binding keys to
; each other.
VoterCert = [
; One or more signatures over `content` using the provided lifetime.
; Each signature should be treated independently.
[ + SingleSig ],
; A lifetime value, used (as usual ) as an input to the
; signature algorithm.
LifespanInfo,
; The keys and other data to be certified.
content: encoded-cbor .cbor CertContent,
]
; The contents of the certificate that get signed.
CertContent = {
; What kind of a certificate is this?
type: CertType,
; A list of keys that are being certified in this document
keys: [ + CertifiedKey ],
; A list of other keys that you might need to know about, which
; are NOT certififed in this document.
? extra: [ + CertifiedKey ],
* tstr => any,
}
CertifiedKey = {
; What is the intended usage of this key?
usage: KeyUsage,
; What cryptographic algorithm is this key used for?
alg: PKAlgorithm,
; The actual key being certified.
data: bstr,
; A human readable string.
? remarks: tstr,
* tstr => any,
}
<!-- Section 2.6 --> <a id='S2.6'></a>
## ENDIVE diffs
Here is a binary format to be used with ENDIVEs, ParamDocs, and any
other similar binary formats. Authorities and directory caches need to
be able to generate it; clients and non-cache relays only need to be
able to parse and apply it.
; Binary diff specification.
BinaryDiff = {
; This is version 1.
v: 1,
; Optionally, a diff can say what different digests
; of the document should be before and after it is applied.
; If there is more than one entry, parties MAY check one or
; all of them.
? digest: { * DigestAlgorithm =>
[ pre: Digest,
post: Digest ]},
; Optionally, a diff can give some information to identify
; which document it applies to, and what document you get
; from applying it. These might be a tuple of a document type
; and a publication type.
? ident: [ pre: any, post: any ],
; list of commands to apply in order to the original document in
; order to get the transformed document
cmds: [ *DiffCommand ],
; for future extension.
* tstr => any,
}
; There are currently only two diff commands.
; One is to copy some bytes from the original.
CopyDiffCommand = [
OrigBytesCmdId,
; Range of bytes to copy from the original document.
; Ranges include their starting byte. The "offset" is relative to
; the end of the _last_ range that was copied.
offset: int,
length: uint,
]
; The other diff comment is to insert some bytes from the diff.
InsertDiffCommand = [
InsertBytesCmdId,
data: bstr,
]
DiffCommand = CopyDiffCommand / InsertDiffCommand
OrigBytesCmdId = 0
InsertBytesCmdId = 1
Applying a binary diff is simple:
Algorithm: applying a binary diff.
(Given an input bytestring INP and a diff D, produces an output OUT.)
Initialize OUT to an empty bytestring.
Set OFFSET to 0.
For each command C in D.commands, in order:
If C begins with OrigBytesCmdId:
Increase "OFFSET" by C.offset
If OFFSET..OFFSET+C.length is not a valid range in
INP, abort.
Append INP[OFFSET .. OFFSET+C.length] to OUT.
Increase "OFFSET" by C.length
else: # C begins with InsertBytesCmdId:
Append C.data to OUT.
Generating a binary diff can be trickier, and is not specified here.
There are several generic algorithms out there for making binary diffs
between arbitrary byte sequences. Since these are complex, I recommend a
chunk-based CBOR-aware algorithm, using each CBOR item in a similar way
to the way in which our current line-oriented code uses lines. When
encountering a bstr tagged with "encoded-cbor", the diff algorithm
should look inside it to find more cbor chunks. (See
example-code/cbor_diff.py for an example of doing this with Python's
difflib.)
The diff format above should work equally well no matter what
diff algorithm is used, so we have room to move to other algorithms
in the future if needed.
To indicate support for the above diff format in directory requests,
implementations should use an `X-Support-Diff-Formats` header. The
above format is designated "cbor-bindiff"; our existing format is
called "ed".
<!-- Section 2.7 --> <a id='S2.7'></a>
## Digests and parameters
Here we give definitions for `H_leaf()` and `H_node()`, based on an
underlying digest function H() with a preferred input block size of B.
(B should be chosen as the natural input size of the hash function, to
aid in precomputation.)
We also define `H_sign()`, to be used outside of SNIP authentication
where we aren't using a Merkle tree at all.
PATH must be no more than 64 bits long. NONCE must be no more than B-33
bytes long.
H_sign(LIFESPAN, NONCE, ITEM) =
H( PREFIX(OTHER_C, LIFESPAN, NONCE) || ITEM)
H_leaf(PATH, LIFESPAN, NONCE, ITEM) =
H( PREFIX(LEAF_C, LIFESPAN, NONCE) ||
U64(PATH) ||
U64(bits(path)) ||
ITEM )
H_node(PATH, LIFESPAN, NONCE, ITEM) =
H( PREFIX(NODE_C, LIFESPAN, NONCE) ||
U64(PATH) ||
U64(bits(PATH)) ||
ITEM )
PREFIX(leafcode, lifespan, nonce) =
U64(leafcode) ||
U64(lifespan.published) ||
U32(lifespan.pre-valid) ||
U32(lifespan.post-valid) ||
U8(len(nonce)) ||
nonce ||
Z(B - 33 - len(nonce))
LEAF_C = 0x8BFF0F687F4DC6A1 ^ NETCONST
NODE_C = 0xA6F7933D3E6B60DB ^ NETCONST
OTHER_C = 0x7365706172617465 ^ NETCONST
# For the live Tor network only.
NETCONST = 0x0746f72202020202
# For testing networks, by default.
NETCONST = 0x74657374696e6720
U64(n) -- N encoded as a big-endian 64-bit number.
Z(n) -- N bytes with value zero.
len(b) -- the number of bytes in a byte-string b.
bits(b) -- the number of bits in a bit-string b.
<!-- Section 3 --> <a id='S3'></a>
# Directory authority operations
For Walking Onions to work, authorities must begin to generate
ENDIVEs as a new kind of "consensus document". Since this format is
incompatible with the previous consensus document formats, and is
CBOR-based, a text-based voting protocol is no longer appropriate
for generating it.
We cannot immediately abandon the text-based consensus and
microdescriptor formats, but instead will need to keep
generating them for legacy relays and clients. Ideally, process
that produces the ENDIVE should also produce a legacy consensus,
to limit the amount of divergence in their contents.
Further, it would be good for the purposes of this proposal if we
can "inherit" as much as possible of our existing voting mechanism
for legacy purposes.
This section of the proposal will try to solve these goals by defining a
new binary-based voting format, a new set of voting rules for it, and a
series of migration steps.
<!-- Section 3.1 --> <a id='S3.1'></a>
## Overview
Except as described below, we retain from Tor's existing voting
mechanism all notions of how votes are transferred and processed.
Other changes are likely desirable, but they are out of scope for
this proposal.
Notably, we are not changing how the voting schedule works. Nor are
we changing the property that all authorities must agree on the list
of authorities; the property that a consensus is computed as a
deterministic function of a set of votes; or the property that if
authorities believe in different sets of votes, they will not reach
the same consensus.
The principal changes in the voting that are relevant for legacy
consensus computation are:
* The uploading process for votes now supports negotiation, so
that the receiving authority can tell the uploading authority
what kind of formats, diffs, and compression it supports.
* We specify a CBOR-based binary format for votes, with a simple
embedding method for the legacy text format. This embedding is
meant for transitional use only; once all authorities support
the binary format, the transitional format and its support
structures can be abandoned.
* To reduce complexity, the new vote format also includes
_verbatim_ microdescriptors, whereas previously microdescriptors
would have been referenced by hash. (The use of diffs and
compression should make the bandwidth impact of this addition
negligible.)
For computing ENDIVEs, the principal changes in voting are:
* The consensus outputs for most voteable objects are specified in a
way that does not require the authorities to understand their
semantics when computing a consensus. This should make it
easier to change fields without requiring new consensus methods.
<!-- Section 3.2 --> <a id='S3.2'></a>
## Negotiating vote uploads
Authorities supporting Walking Onions are required to support a new
resource "/tor/auth-vote-opts". This resource is a text document
containing a list of HTTP-style headers. Recognized headers are
described below; unrecognized headers MUST be ignored.
The *Accept-Encoding* header follows the same format as the HTTP
header of the same name; it indicates a list of Content-Encodings
that the authority will accept for uploads. All authorities SHOULD
support the gzip and identity encodings. The identity encoding is
mandatory. (Default: "identity")
The *Accept-Vote-Diffs-From* header is a list of digests of previous
votes held by this authority; any new uploaded votes that are given
as diffs from one of these old votes SHOULD be accepted. The format
is a space-separated list of "digestname:Hexdigest". (Default: "".)
The *Accept-Vote-Formats* header is a space-separated list of the
vote formats that this router accepts. The recognized vote formats
are "legacy-3" (Tor's current vote format) and "endive-1" (the vote
format described here). Unrecognized vote formats MUST be ignored.
(Default: "legacy-3".)
If requesting "/tor/auth-vote-opts" gives an error, or if one or
more headers are missing, the default values SHOULD be used. These
documents (or their absence) MAY be cached for up to 2 voting
periods.)
Authorities supporting Walking Onions SHOULD also support the
"Connection: keep-alive" and "Keep-Alive" HTTP headers, to avoid
needless reconnections in response to these requests.
Implementors SHOULD be aware of potential denial-of-service
attacks based on open HTTP connections, and mitigate them as
appropriate.
> Note: I thought about using OPTIONS here, but OPTIONS isn't quite
> right for this, since Accept-Vote-Diffs-From does not fit with its
> semantics.
> Note: It might be desirable to support this negotiation for legacy
> votes as well, even before walking onions is implemented. Doing so
> would allow us to reduce authority bandwidth a little, and possibly
> include microdescriptors in votes for more convenient processing.
<!-- Section 3.3 --> <a id='S3.3'></a>
## A generalized algorithm for voting
Unlike with previous versions of our voting specification, here I'm
going to try to describe pieces the voting algorithm in terms of
simpler voting operations. Each voting operation will be named and
possibly parameterized, and data will frequently self-describe what
voting operation is to be used on it.
Voting operations may operate over different CBOR types, and are
themselves specified as CBOR objects.
A voting operation takes place over a given "voteable field". Each
authority that specifies a value for a voteable field MUST specify
which voting operation to use for that field. Specifying a voteable
field without a voting operation MUST be taken as specifying the
voting operation "None" -- that is, voting against a consensus.
On the other hand, an authority MAY specify a voting operation for
a field without casting any vote for it. This means that the
authority has an opinion on how to reach a consensus about the
field, without having any preferred value for the field itself.
<!-- Section 3.3.1 --> <a id='S3.3.1'></a>
### Constants used with voting operations
Many voting operations may be parameterized by an unsigned integer.
In some cases the integers are constant, but in others, they depend
on the number of authorities, the number of votes cast, or the
number of votes cast for a particular field.
When we encode these values, we encode them as short strings
rather than as integers.
> I had thought of using negative integers here to encode these
> special constants, but that seems too error-prone.
The following constants are defined:
`N_AUTH` -- the total number of authorities, including those whose
votes are absent.
`N_PRESENT` -- the total number of authorities whose votes are
present for this vote.
`N_FIELD` -- the total number of authorities whose votes for a given
field are present.
Necessarily, `N_FIELD` <= `N_PRESENT` <= `N_AUTH` -- you can't vote
on a field unless you've cast a vote, and you can't cast a vote
unless you're an authority.
In the definitions below, `//` denotes the truncating integer division
operation, as implemented with `/` in C.
`QUORUM_AUTH` -- The lowest integer that is greater than half of
`N_AUTH`. Equivalent to `N_AUTH // 2 + 1`.
`QUORUM_PRESENT` -- The lowest integer that is greater than half of
`N_PRESENT`. Equivalent to `N_PRESENT // 2 + 1`.
`QUORUM_FIELD` -- The lowest integer that is greater than half of
`N_FIELD`. Equivalent to `N_FIELD // 2 + 1`.
We define `SUPERQUORUM_`..., variants of these fields as well, based
on the lowest integer that is greater than 2/3 majority of the
underlying field. `SUPERQUORUM_x` is thus equivalent to
`(N_x * 2) // 3 + 1`.
; We need to encode these arguments; we do so as short strings.
IntOpArgument = uint / "auth" / "present" / "field" /
"qauth" / "qpresent" / "qfield" /
"sqauth" / "sqpresent" / "sqfield"
No IntOpArgument may be greater than AUTH. If an IntOpArgument is
given as an integer, and that integer is greater than AUTH, then it
is treated as if it were AUTH.
> This rule lets us say things like "at least 3 authorities must
> vote on x...if there are 3 authorities."
<!-- Section 3.3.2 --> <a id='S3.3.2'></a>
### Producing consensus on a field
Each voting operation will either produce a CBOR output, or produce
no consensus. Unless otherwise stated, all CBOR outputs are to be
given in canonical form.
Below we specify a number of operations, and the parameters that
they take. We begin with operations that apply to "simple" values
(integers and binary strings), then show how to compose them to
larger values.
All of the descriptions below show how to apply a _single_ voting
operation to a set of votes. We will later describe how to behave when
the authorities do not agree on which voting operation to use, in our
discussion of the StructJoinOp operation.
Note that while some voting operations take other operations as
parameters, we are _not_ supporting full recursion here: there is a
strict hierarchy of operations, and more complex operations can only
have simpler operations in their parameters.
All voting operations follow this metaformat:
; All a generic voting operation has to do is say what kind it is.
GenericVotingOp = {
op: tstr,
* tstr => any,
}
Note that some voting operations require a sort or comparison
operation over CBOR values. This operation is defined later in
appendix E; it works only on homogeneous inputs.
<!-- Section 3.3.3 --> <a id='S3.3.3'></a>
### Generic voting operations
<!-- Section 3.3.3.1 --> <a id='S3.3.3.1'></a>
#### None
This voting operation takes no parameters, and always produces "no
consensus". It is encoded as:
; "Don't produce a consensus".
NoneOp = { op: "None" }
When encountering an unrecognized or nonconforming voting operation,
_or one which is not recognized by the consensus-method in use_, the
authorities proceed as if the operation had been "None".
<!-- Section 3.3.4 --> <a id='S3.3.4'></a>
### Voting operations for simple values
We define a "simple value" according to these cddl rules:
; Simple values are primitive types, and tuples of primitive types.
SimpleVal = BasicVal / SimpleTupleVal
BasicVal = bool / int / bstr / tstr
SimpleTupleVal = [ *BasicVal ]
We also need the ability to encode the types for these values:
; Encoding a simple type.
SimpleType = BasicType / SimpleTupleType
BasicType = "bool" / "uint" / "sint" / "bstr" / "tstr"
SimpleTupleType = [ "tuple", *BasicType ]
In other words, a SimpleVal is either an non-compound base value, or is
a tuple of such values.
; We encode these operations as:
SimpleOp = MedianOp / ModeOp / ThresholdOp /
BitThresholdOp / CborSimpleOp / NoneOp
We define each of these operations in the sections below.
<!-- Section 3.3.4.1 --> <a id='S3.3.4.1'></a>
#### Median
_Parameters_: `MIN_VOTES` (an integer), `BREAK_EVEN_LOW` (a boolean),
`TYPE` (a SimpleType)
; Encoding:
MedianOp = { op: "Median",
? min_vote: IntOpArgument, ; Default is 1.
? even_low: bool, ; Default is true.
type: SimpleType }
Discard all votes that are not of the specified `TYPE`. If there are
fewer than `MIN_VOTES` votes remaining, return "no consensus".
Put the votes in ascending sorted order. If the number of votes N
is odd, take the center vote (the one at position (N+1)/2). If N is
even, take the lower of the two center votes (the one at position
N/2) if `BREAK_EVEN_LOW` is true. Otherwise, take the higher of the
two center votes (the one at position N/2 + 1).
For example, the Median(…, even_low: True, type: "uint") of the votes
["String", 2, 111, 6] is 6. The Median(…, even_low: True, type: "uint")
of the votes ["String", 77, 9, 22, "String", 3] is 9.
<!-- Section 3.3.4.2 --> <a id='S3.3.4.2'></a>
#### Mode
_Parameters_: `MIN_COUNT` (an integer), `BREAK_TIES_LOW` (a boolean),
`TYPE` (a SimpleType)
; Encoding:
ModeOp = { op: "Mode",
? min_count: IntOpArgument, ; Default 1.
? tie_low: bool, ; Default true.
type: SimpleType
}
Discard all votes that are not of the specified `TYPE`. Of the
remaining votes, look for the value that has received the most
votes. If no value has received at least `MIN_COUNT` votes, then
return "no consensus".
If there is a single value that has received the most votes, return
it. Break ties in favor of lower values if `BREAK_TIES_LOW` is true,
and in favor of higher values if `BREAK_TIES_LOW` is false.
(Perform comparisons in canonical cbor order.)
<!-- Section 3.3.4.3 --> <a id='S3.3.4.3'></a>
#### Threshold
_Parameters_: `MIN_COUNT` (an integer), `BREAK_MULTI_LOW` (a boolean),
`TYPE` (a SimpleType)
; Encoding
ThresholdOp = { op: "Threshold",
min_count: IntOpArgument, ; No default.
? multi_low: bool, ; Default true.
type: SimpleType
}
Discard all votes that are not of the specified `TYPE`. Sort in
canonical cbor order. If `BREAK_MULTI_LOW` is false, reverse the
order of the list.
Return the first element that received at least `MIN_COUNT` votes.
If no value has received at least `MIN_COUNT` votes, then return
"no consensus".
<!-- Section 3.3.4.4 --> <a id='S3.3.4.4'></a>
#### BitThreshold
Parameters: `MIN_COUNT` (an integer >= 1)
; Encoding
BitThresholdOp = { op: "BitThreshold",
min_count: IntOpArgument, ; No default.
}
These are usually not needed, but are quite useful for
building some ProtoVer operations.
Discard all votes that are not of type uint or bstr; construe bstr
inputs as having type "biguint".
The output is a uint or biguint in which the b'th bit is set iff the
b'th bit is set in at least `MIN_COUNT` of the votes.
<!-- Section 3.3.5 --> <a id='S3.3.5'></a>
### Voting operations for lists
These operations work on lists of SimpleVal:
; List type definitions
ListVal = [ * SimpleVal ]
ListType = [ "list",
[ *SimpleType ] / nil ]
They are encoded as:
; Only one list operation exists right now.
ListOp = SetJoinOp
<!-- Section 3.3.5.1 --> <a id='S3.3.5.1'></a>
#### SetJoin
Parameters: `MIN_COUNT` (an integer >= 1).
Optional parameters: `TYPE` (a SimpleType.)
; Encoding:
SetJoinOp = {
op: "SetJoin",
min_count: IntOpArgument,
? type: SimpleType
}
Discard all votes that are not lists. From each vote,
discard all members that are not of type 'TYPE'.
For the consensus, construct a new list containing exactly those
elements that appears in at least `MIN_COUNT` votes.
(Note that the input votes may contain duplicate elements. These
must be treated as if there were no duplicates: the vote
[1, 1, 1, 1] is the same as the vote [1]. Implementations may want
to preprocess votes by discarding all but one instance of each
member.)
<!-- Section 3.3.6 --> <a id='S3.3.6'></a>
### Voting operations for maps
Map voting operations work over maps from key types to other non-map
types.
; Map type definitions.
MapVal = { * SimpleVal => ItemVal }
ItemVal = ListVal / SimpleVal
MapType = [ "map", [ *SimpleType ] / nil, [ *ItemType ] / nil ]
ItemType = ListType / SimpleType
They are encoded as:
; MapOp encodings
MapOp = MapJoinOp / StructJoinOp
<!-- Section 3.3.6.1 --> <a id='S3.3.6.1'></a>
#### MapJoin
The MapJoin operation combines homogeneous maps (that is, maps from
a single key type to a single value type.)
Parameters:
`KEY_MIN_COUNT` (an integer >= 1)
`KEY_TYPE` (a SimpleType type)
`ITEM_OP` (A non-MapJoin voting operation)
Encoding:
; MapJoin operation encoding
MapJoinOp = {
op: "MapJoin"
? key_min_count: IntOpArgument, ; Default 1.
key_type: SimpleType,
item_op: ListOp / SimpleOp
}
First, discard all votes that are not maps. Then consider the set
of keys from each vote as if they were a list, and apply
`SetJoin[KEY_MIN_COUNT,KEY_TYPE]` to those lists. The resulting list
is a set of keys to consider including in the output map.
> We have a separate `key_min_count` field, even if `item_op` has
> its own `min_count` field, because some min_count values (like
> `qfield`) depend on the overall number of votes for the field.
> Having `key_min_count` lets us specify rules like "the FOO of all
> votes on this field, if there are at least 2 such votes."
For each key in the output list, run the sub-voting operation
`ItemOperation` on the values it received in the votes. Discard all
keys for which the outcome was "no consensus".
The final vote result is a map from the remaining keys to the values
produced by the voting operation.
<!-- Section 3.3.6.2 --> <a id='S3.3.6.2'></a>
#### StructJoin
A StructJoinOp operation describes a way to vote on maps that encode a
structure-like object.
Parameters:
`KEY_RULES` (a map from int or string to StructItemOp)
`UNKNOWN_RULE` (An operation to apply to unrecognized keys.)
; Encoding
StructItemOp = ListOp / SimpleOp / MapJoinOp / DerivedItemOp /
CborDerivedItemOp
VoteableStructKey = int / tstr
StructJoinOp = {
op: "StructJoin",
key_rules: {
* VoteableStructKey => StructItemOp,
}
? unknown_rule: StructItemOp
}
To apply a StructJoinOp to a set of votes, first discard every vote that is
not a map. Then consider the set of keys from all the votes as a single
list, with duplicates removed. Also remove all entries that are not integers
or strings from the list of keys.
For each key, then look for that key in the `KEY_RULES` map. If there is an
entry, then apply the StructItemOp for that entry to the values for that key
in every vote. Otherwise, apply the `UNKNOWN_RULE` operation to the values
for that key in every vote. Otherwise, there is no consensus for the values
of this key. If there _is_ a consensus for the values, then the key should
map to that consensus in the result.
This operation always reaches a consensus, even if it is an empty map.
<!-- Section 3.3.6.3 --> <a id='S3.3.6.3'></a>
#### CborData
A CborData operation wraps another operation, and tells the authorities
that after the operation is completed, its result should be decoded as a
CBOR bytestring and interpolated directly into the document.
Parameters: `ITEM_OP` (Any SingleOp that can take a bstr input.)
; Encoding
CborSimpleOp = {
op: "CborSimple",
item-op: MedianOp / ModeOp / ThresholdOp / NoneOp
}
CborDerivedItemOp = {
op: "CborDerived",
item-op: DerivedItemOp,
}
To apply either of these operations to a set of votes, first apply
`ITEM_OP` to those votes. After that's done, check whether the
consensus from that operation is a bstr that encodes a single item of
"well-formed" "valid" cbor. If it is not, this operation gives no
consensus. Otherwise, the consensus value for this operation is the
decoding of that bstr value.
<!-- Section 3.3.6.4 --> <a id='S3.3.6.4'></a>
#### DerivedFromField
This operation can only occur within a StructJoinOp operation (or a
semantically similar SectionRules). It indicates that one field
should have been derived from another. It can be used, for example,
to say that a relay's version is "derived from" a relay's descriptor
digest.
Unlike other operations, this one depends on the entire consensus (as
computed so far), and on the entirety of the set of votes.
> This operation might be a mistake, but we need it to continue lots of
> our current behavior.
Parameters:
`FIELDS` (one or more other locations in the vote)
`RULE` (the rule used to combine values)
Encoding
; This item is "derived from" some other field.
DerivedItemOp = {
op: "DerivedFrom",
fields: [ +SourceField ],
rule: SimpleOp
}
; A field in the vote.
SourceField = [ FieldSource, VoteableStructKey ]
; A location in the vote. Each location here can only
; be referenced from later locations, or from itself.
FieldSource = "M" ; Meta.
/ "CP" ; ClientParam.
/ "SP" ; ServerParam.
/ "RM" ; Relay-meta
/ "RS" ; Relay-SNIP
/ "RL" ; Relay-legacy
To compute a consensus with this operation, first locate each field described
in the SourceField entry in each VoteDocument (if present), and in the
consensus computed so far. If there is no such field in the
consensus or if it has not been computed yet, then
this operation produces "no consensus". Otherwise, discard the VoteDocuments
that do not have the same value for the field as the consensus, and their
corresponding votes for this field. Do this for every listed field.
At this point, we have a set of votes for this field's value that all come
from VoteDocuments that describe the same value for the source field(s). Apply
the `RULE` operation to those votes in order to give the result for this
voting operation.
The DerivedFromField members in a SectionRules or a StructJoinOp
should be computed _after_ the other members, so that they can refer
to those members themselves.
<!-- Section 3.3.7 --> <a id='S3.3.7'></a>
### Voting on document sections
Voting on a section of the document is similar to the StructJoin
operation, with some exceptions. When we vote on a section of the
document, we do *not* apply a single voting rule immediately.
Instead, we first "_merge_" a set of SectionRules together, and then
apply the merged rule to the votes. This is the only place where we
merge rules like this.
A SectionRules is _not_ a voting operation, so its format is not
tagged with an "op":
; Format for section rules.
SectionRules = {
* VoteableStructKey => SectionItemOp,
? nil => SectionItemOp
}
SectionItemOp = StructJoinOp / StructItemOp
To merge a set of SectionRules together, proceed as follows. For each
key, consider whether at least QUORUM_AUTH authorities have voted the
same StructItemOp for that key. If so, that StructItemOp is the
resulting operation for this key. Otherwise, there is no entry for this key.
Do the same for the "nil" StructItemOp; use the result as the
`UNKNOWN_RULE`.
Note that this merging operation is *not* recursive.
<!-- Section 3.4 --> <a id='S3.4'></a>
## A CBOR-based metaformat for votes.
A vote is a signed document containing a number of sections; each
section corresponds roughly to a section of another document, a
description of how the vote is to be conducted, or both.
; VoteDocument is a top-level signed vote.
VoteDocument = [
; Each signature may be produced by a different key, if they
; are all held by the same authority.
sig: [ + SingleSig ],
lifetime: Lifespan,
digest-alg: DigestAlgorithm,
body: bstr .cbor VoteContent
]
VoteContent = {
; List of supported consensus methods.
consensus-methods: [ + uint ],
; Text-based legacy vote to be used if the negotiated
; consensus method is too old. It should itself be signed.
; It's encoded as a series of text chunks, to help with
; cbor-based binary diffs.
? legacy-vote: [ * tstr ],
; How should the votes within the individual sections be
; computed?
voting-rules: VotingRules,
; Information that the authority wants to share about this
; vote, which is not itself voted upon.
notes: NoteSection,
; Meta-information that the authorities vote on, which does
; not actually appear in the ENDIVE or consensus directory.
meta: MetaSection .within VoteableSection,
; Fields that appear in the client network parameter document.
client-params: ParamSection .within VoteableSection,
; Fields that appear in the server network parameter document.
server-params: ParamSection .within VoteableSection,
; Information about each relay.
relays: RelaySection,
; Information about indices.
indices: IndexSection,
* tstr => any
}
; Self-description of a voter.
VoterSection = {
; human-memorable name
name: tstr,
; List of link specifiers to use when uploading to this
; authority. (See proposal for dirport link specifier)
? ul: [ *LinkSpecifier ],
; List of link specifiers to use when downloading from this authority.
? dl: [ *LinkSpecifier ],
; contact information for this authority.
? contact: tstr,
; legacy certificate in format given by dir-spec.txt.
? legacy-cert: tstr,
; for extensions
* tstr => any,
}
; An indexsection says how we think routing indices should be built.
IndexSection = {
* IndexId => bstr .cbor [ IndexGroupId, GenericIndexRule ],
}
IndexGroupId = uint
; A mechanism for building a single routing index. Actual values need to
; be within RecognizedIndexRule or the authority can't complete the
; consensus.
GenericIndexRule = {
type: tstr,
* tstr => any
}
RecognizedIndexRule = EdIndex / RSAIndex / BWIndex / WeightedIndex
; The values in an RSAIndex are derived from digests of Ed25519 keys.
EdIndex = {
type: "ed-id",
alg: DigestAlgorithm,
prefix: bstr,
suffix: bstr
}
; The values in an RSAIndex are derived from RSA keys.
RSAIndex = {
type: "rsa-id"
}
; A BWIndex is built by taking some uint-valued field referred to by
; SourceField from all the relays that have all of required_flags set.
BWIndex = {
type: "bw",
bwfield: SourceField,
require_flags: FlagSet,
}
; A flag can be prefixed with "!" to indicate negation. A flag
; with a name like P@X indicates support for port class 'X' in its
; exit policy.
;
; FUTURE WORK: perhaps we should add more structure here and it
; should be a matching pattern?
FlagSet = [ *tstr ]
; A WeightedIndex applies a set of weights to a BWIndex based on which
; flags the various routers have. Relays that match a set of flags have
; their weights multiplied by the corresponding WeightVal.
WeightedIndex = {
type: "weighted",
source: BwIndex,
weight: { * FlagSet => WeightVal }
}
; A WeightVal is either an integer to multiply bandwidths by, or a
; string from the Wgg, Weg, Wbm, ... set as documented in dir-spec.txt,
; or a reference to an earlier field.
WeightVal = uint / tstr / SourceField
VoteableValue = MapVal / ListVal / SimpleVal
; A "VoteableSection" is something that we apply part of the
; voting rules to. When we apply voting rules to these sections,
; we do so without regards to their semantics. When we are done,
; we use these consensus values to make the final consensus.
VoteableSection = {
VoteableStructKey => VoteableValue,
}
; A NoteSection is used to convey information about the voter and
; its vote that is not actually voted on.
NoteSection = {
; Information about the voter itself
voter: VoterSection,
; Information that the voter used when assigning flags.
? flag-thresholds: { tstr => any },
; Headers from the bandwidth file to be reported as part of
; the vote.
? bw-file-headers: {tstr => any },
? shared-rand-commit: SRCommit,
* VoteableStructKey => VoteableValue,
}
; Shared random commitment; fields are as for the current
; shared-random-commit fields.
SRCommit = {
ver: uint,
alg: DigestAlgorithm,
ident: bstr,
commit: bstr,
? reveal: bstr
}
; the meta-section is voted on, but does not appear in the ENDIVE.
MetaSection = {
; Seconds to allocate for voting and distributing signatures
; Analagous to the "voting-delay" field in the legacy algorithm.
voting-delay: [ vote_seconds: uint, dist_seconds: uint ],
; Proposed time till next vote.
voting-interval: uint,
; proposed lifetime for the SNIPs and ENDIVEs
snip-lifetime: Lifespan,
; proposed lifetime for client params document
c-param-lifetime: Lifespan,
; proposed lifetime for server params document
s-param-lifetime: Lifespan,
; signature depth for ENDIVE
signature-depth: uint,
; digest algorithm to use with ENDIVE.
signature-digest-alg: DigestAlgorithm,
; Current and previous shared-random values
? cur-shared-rand: [ reveals: uint, rand: bstr ],
? prev-shared-rand: [ reveals: uint, rand: bstr ],
; extensions.
* VoteableStructKey => VoteableValue,
};
; A ParamSection will be made into a ParamDoc after voting;
; the fields are analogous.
ParamSection = {
? certs: [ 1*2 bstr .cbor VoterCert ],
? recommend-versions: [ * tstr ],
? require-protos: ProtoVersions,
? recommend-protos: ProtoVersions,
? params: NetParams,
* VoteableStructKey => VoteableValue,
}
RelaySection = {
; Mapping from relay identity key (or digest) to relay information.
* bstr => RelayInfo,
}
; A RelayInfo is a vote about a single relay.
RelayInfo = {
meta: RelayMetaInfo .within VoteableSection,
snip: RelaySNIPInfo .within VoteableSection,
legacy: RelayLegacyInfo .within VoteableSection,
}
; Information about a relay that doesn't go into a SNIP.
RelayMetaInfo = {
; Tuple of published-time and descriptor digest.
? desc: [ uint , bstr ],
; What flags are assigned to this relay? We use a
; string->value encoding here so that only the authorities
; who have an opinion on the status of a flag for a relay need
; to vote yes or no on it.
? flags: { *tstr=>bool },
; The relay's self-declared bandwidth.
? bw: uint,
; The relay's measured bandwidth.
? mbw: uint,
; The fingerprint of the relay's RSA identity key.
? rsa-id: RSAIdentityFingerprint
}
; SNIP information can just be voted on directly; the formats
; are the same.
RelaySNIPInfo = SNIPRouterData
; Legacy information is used to build legacy consensuses, but
; not actually required by walking onions clients.
RelayLegacyInfo = {
; Mapping from consensus version to microdescriptor digests
; and microdescriptors.
? mds: [ *Microdesc ],
}
; Microdescriptor votes now include the digest AND the
; microdescriptor-- see note.
Microdesc = [
low: uint,
high: uint,
digest: bstr .size 32,
; This is encoded in this way so that cbor-based diff tools
; can see inside it. Because of compression and diffs,
; including microdesc text verbatim should be comparatively cheap.
content: encoded-cbor .cbor [ *tstr ],
]
; ==========
; The VotingRules field explains how to vote on the members of
; each section
VotingRules = {
meta: SectionRules,
params: SectionRules,
relay: RelayRules,
indices: SectionRules,
}
; The RelayRules object explains the rules that apply to each
; part of a RelayInfo. A key will appear in the consensus if it
; has been listed by at least key_min_count authorities.
RelayRules = {
key_min_count: IntOpArgument,
meta: SectionRules,
snip: SectionRules,
legacy: SectionRules,
}
<!-- Section 3.5 --> <a id='S3.5'></a>
## Computing a consensus.
To compute a consensus, the authorities first verify that all the votes are
timely and correctly signed by real authorities. This includes
validating all invariants stated here, and all internal documents.
If they have two votes from an authority, authorities SHOULD issue a
warning, and they should take the one that is published more
recently.
> TODO: Teor suggests that maybe we shouldn't warn about two votes
> from an authority for the same period, and we could instead have a
> more resilient process here, where authorities can update their
> votes at various times over the voting period, up to some point.
>
> I'm not sure whether this helps reliability more or less than it risks it,
> but it worth investigating.
Next, the authorities determine the consensus method as they do today,
using the field "consensus-method". This can also be expressed as
the voting operation `Threshold[SUPERQUORUM_PRESENT, false, uint]`.
If there is no consensus for the consensus-method, then voting stops
without having produced a consensus.
Note that in contrast with its behavior in the current voting algorithm, the
consensus method does not determine the way to vote on every
individual field: that aspect of voting is controlled by the
voting-rules. Instead, the consensus-method changes other aspects
of this voting, such as:
* Adding, removing, or changing the semantics of voting
operations.
* Changing the set of documents to which voting operations apply.
* Otherwise changing the rules that are set out in this
document.
Once a consensus-method is decided, the next step is to compute the
consensus for other sections in this order: `meta`, `client-params`,
`server-params`, and `indices`. The consensus for each is calculated
according to the operations given in the corresponding section of
VotingRules.
Next the authorities compute a consensus on the `relays` section,
which is done slightly differently, according to the rules of
RelayRules element of VotingRules.
Finally, the authorities transform the resulting sections into an
ENDIVE and a legacy consensus, as in "Computing an ENDIVE" and
"Computing a legacy consensus" below.
To vote on a single VotingSection, find the corresponding
SectionRules objects in the VotingRules of this votes, and apply it
as described above in "Voting on document sections".
<!-- Section 3.6 --> <a id='S3.6'></a>
## If an older consensus method is negotiated (Transitional)
The `legacy-vote` field in the vote document contains an older (v3,
text-style) consensus vote, and is used when an older consensus
method is negotiated. The legacy-vote is encoded by splitting it
into pieces, to help with CBOR diff calculation. Authorities MAY split at
line boundaries, space boundaries, or anywhere that will help with
diffs. To reconstruct the legacy vote, concatenate the members of
`legacy-vote` in order. The resulting string MUST validate
according to the rules of the legacy voting algorithm.
If a legacy vote is present, then authorities SHOULD include the
same view of the network in the legacy vote as they included in their
real vote.
If a legacy vote is present, then authorities MUST
give the same list of consensus-methods and the same voting
schedule in both votes. Authorities MUST reject noncompliant votes.
<!-- Section 3.7 --> <a id='S3.7'></a>
## Computing an ENDIVE.
If a consensus-method is negotiated that is high enough to support
ENDIVEs, then the authorities proceed as follows to transform the consensus
sectoins above into an ENDIVE.
The ParamSections from the consensus are used verbatim as the bodies of the
`client-params` and `relay-params` fields.
The fields that appear in each RelaySNIPInfo determine what goes
into the SNIPRouterData for each relay. To build the relay section,
first decide which relays appear according to the `key_min_count`
field in the RelayRules. Then collate relays across all the votes
by their keys, and see which ones are listed. For each key that
appears in at least `key_min_count` votes, apply the RelayRules to
each section of the RelayInfos for that key.
The sig_params section is derived from fields in the meta section.
Fields with identical names are simply copied; Lifespan values are
copied to the corresponding documents (snip-lifetime as the lifespan
for SNIPs and ENDIVEs, and c and s-param-lifetime as the lifespan
for ParamDocs).
To compute the signature nonce, use the signature digest algorithm
to compute the digest of each input vote body, sort those digests
lexicographically, and concatenate and hash those digests again.
Routing indices are built according to named IndexRules, and grouped
according to fields in the meta section. See "Constructing Indices" below.
> (At this point extra fields may be copied from the Meta section of
> each RelayInfo into the ENDIVERouterData depending on the meta
> document; we do not, however, currently specify any case where this
> is done.)
<!-- Section 3.7.1 --> <a id='S3.7.1'></a>
### Constructing indices
After having built the list of relays, the authorities construct and
encode the indices that appear in the ENDIVEs. The voted-upon
GenericIndexRule values in the IndexSection of the consensus say how
to build the indices in the ENDIVE, as follows.
An `EdIndex` is built using the IndexType_Ed25519Id value, with the
provided prefix and suffix values. Authorities don't need to expand
this index in the ENDIVE, since the relays can compute it
deterministically.
An `RSAIndex` is built using the IndexType_RSAId type. Authorities
don't need to expand this index in the ENDIVE, since the relays can
compute it deterministically.
A `BwIndex` is built using the IndexType_Weighted type. Each relay has a
weight equal to some specified bandwidth field in its consensus
RelayInfo. If a relay is missing any of the `required_flags` in
its meta section, or if it does not have the specified bandwidth
field, that relay's weight becomes 0.
A `WeightedIndex` is built by computing a BwIndex, and then
transforming each relay in the list according to the flags that it
has set. Relays that match any set of flags in the WeightedIndex
rule get their bandwidths multiplied by _all_ WeightVals that
apply. Some WeightVals are computed according to special rules,
such as "Wgg", "Weg", and so on. These are taken from the current
dir-spec.txt.
For both BwIndex and WeightedIndex values, authorities MUST scale
the computed outputs so that no value is greater than UINT32_MAX;
they MUST do by shifting all values right by lowest number of bits
that achieves this.
> We could specify a more precise algorithm, but this is simpler.
Indices with the same IndexGroupId are placed in the same index
group; index groups are ordered numerically.
<!-- Section 3.8 --> <a id='S3.8'></a>
## Computing a legacy consensus.
When using a consensus method that supports Walking Onions, the
legacy consensus is computed from the same data as the ENDIVE.
Because the legacy consensus format will be frozen once Walking
Onions is finalized, we specify this transformation directly, rather
than in a more extensible way.
The published time and descriptor digest are used directly.
Microdescriptor negotiation proceeds as before. Bandwidths,
measured bandwidths, descriptor digests, published times, flags, and
rsa-id values are taken from the RelayMetaInfo section. Addresses,
protovers, versions, and so on are taken from the RelaySNIPInfo. Header
fields are all taken from the corresponding header fields in the
MetaSection or the ClientParamsSection. All parameters are copied
into the net-params field.
<!-- Section 3.9 --> <a id='S3.9'></a>
## Managing indices over time.
> The present voting mechanism does not do a great job of handling
> the authorities
The semantic meaning of most IndexId values, as understood by
clients should remain unchanging; if a client uses index 6 for
middle nodes, 6 should _always_ mean "middle nodes".
If an IndexId is going to change its meaning over time, it should
_not_ be hardcoded by clients; it should instead be listed in the
NetParams document, as the exit indices are in the `port-classes`
field. (See also section 6 and appendix AH.) If such a field needs
to change, it also needs a migration method that allows clients with
older and newer parameters documents to exist at the same time.
<!-- Section 4 --> <a id='S4'></a>
# Relay operations: Receiving and expanding ENDIVEs
Previously, we introduced a format for ENDIVEs to be transmitted
from authorities to relays. To save on bandwidth, the relays
download diffs rather than entire ENDIVEs. The ENDIVE format makes
several choices in order to make these diffs small: the Merkle tree
is omitted, and routing indices are not included directly.
To address those issues, this document describes the steps that a
relay needs to perform, upon receiving an ENDIVE document, to derive
all the SNIPs for that ENDIVE.
Here are the steps to be followed. We'll describe them in order,
though in practice they could be pipelined somewhat. We'll expand
further on each step later on.
1. Compute routing indices positions.
2. Compute truncated SNIPRouterData variations.
3. Build signed SNIP data.
4. Compute Merkle tree.
5. Build authenticated SNIPs.
Below we'll specify specific algorithms for these steps. Note that
relays do not need to follow the steps of these algorithms exactly,
but they MUST produce the same outputs as if they had followed them.
<!-- Section 4.1 --> <a id='S4.1'></a>
## Computing index positions.
For every IndexId in every Index Group, the relay will compute the
full routing index. Every routing index is a mapping from
index position ranges (represented as 2-tuples) to relays, where the
relays are represented as ENDIVERouterData members of the ENDIVE. The
routing index must map every possible value of the index to exactly one
relay.
An IndexSpec field describes how the index is to be constructed. There
are four types of IndexSpec: Raw, Raw Spans, Weighted, RSAId, and
Ed25519Id. We'll describe how to build the indices for each.
Every index may either have an integer key, or a binary-string
key. We define the "successor" of an integer index as the succeeding
integer. We define the "successor" of a binary string as the next
binary string of the same length in lexicographical (memcmp) order. We
define "predecessor" as the inverse of "successor". Both these
operations "wrap around" the index.
The algorithms here describe a set of invariants that are
"verified". Relays SHOULD check each of these invariants;
authorities MUST NOT generate any ENDIVEs that violate them. If a
relay encounters an ENDIVE that cannot be verified, then the ENDIVE
cannot be expanded.
> NOTE: conceivably should there be some way to define an index as a
> subset of another index, with elements weighted in different ways? In
> other words, "Index a is index b, except multiply these relays by 0 and
> these relays by 1.2". We can keep this idea sitting around in case there
> turns out to be a use for it.
<!-- Section 4.1.1 --> <a id='S4.1.1'></a>
### Raw indices
When the IndexType is Indextype_Raw, then its members are listed
directly in the IndexSpec.
Algorithm: Expanding a "Raw" indexspec.
Let result_idx = {} (an empty mapping).
Let previous_pos = indexspec.first_index
For each element [i, pos2] of indexspec.index_ranges:
Verify that i is a valid index into the list of ENDIVERouterData.
Set pos1 = the successor of previous_pos.
Verify that pos1 and pos2 have the same type.
Append the mapping (pos1, pos2) => i to result_idx
Set previous_pos to pos2.
Verify that previous_pos = the predecessor of indexspec.first_index.
Return result_idx.
<!-- Section 4.1.2 --> <a id='S4.1.2'></a>
### Raw numeric indices
If the IndexType is Indextype_RawNumeric, it is described by a set of
spans on a 32-bit index range.
Algorithm: Expanding a RawNumeric index.
Let prev_pos = 0
For each element [i, span] of indexspec.index_ranges:
Verify that i is a valid index into the list of ENDIVERouterData.
Verify that prev_pos <= UINT32_MAX - span.
Let pos2 = prev_pos + span.
Append the mapping (pos1, pos2) => i to result_idx.
Let prev_pos = successor(pos2)
Verify that prev_pos = UINT32_MAX.
Return result_idx.
<!-- Section 4.1.3 --> <a id='S4.1.3'></a>
### Weighted indices
If the IndexSpec type is Indextype_Weighted, then the index is
described by assigning a probability weight to each of a number of relays.
>From these, we compute a series of 32-bit index positions.
This algorithm uses 64-bit math, and 64-by-32-bit integer division.
It requires that the sum of weights is no more than UINT32_MAX.
Algorithm: Expanding a "Weighted" indexspec.
Let total_weight = SUM(indexspec.index_weights)
Verify total_weight <= UINT32_MAX.
Let total_so_far = 0.
Let result_idx = {} (an empty mapping).
Define POS(b) = FLOOR( (b << 32) / total_weight).
For 0 <= i < LEN(indexspec.indexweights):
Let w = indexspec.indexweights[i].
Let lo = POS(total_so_far).
Let total_so_far = total_so_far + w.
Let hi = POS(total_so_far) - 1.
Append (lo, hi) => i to result_idx.
Verify that total_so_far = total_weight.
Verify that the last value of "hi" was UINT32_MAX.
Return result_idx.
This algorithm is a bit finicky in its use of division, but it
results in a mapping onto 32 bit integers that completely covers the
space of available indices.
<!-- Section 4.1.4 --> <a id='S4.1.4'></a>
### RSAId indices
If the IndexSpec type is Indextype_RSAId then the index is a set of
binary strings describing the routers' legacy RSA identities, for
use in the HSv2 hash ring.
These identities are truncated to a fixed length. Though the SNIP
format allows _variable_-length binary prefixes, we do not use this
feature.
Algorithm: Expanding an "RSAId" indexspec.
Let R = [ ] (an empty list).
Take the value n_bytes from the IndexSpec.
For 0 <= b_idx < MIN( LEN(indexspec.members) * 8,
LEN(list of ENDIVERouterData) ):
Let b = the b_idx'th bit of indexspec.members.
If b is 1:
Let m = the b_idx'th member of the ENDIVERouterData list.
Verify that m has its RSAIdentityFingerprint set.
Let pos = m.RSAIdentityFingerprint, truncated to n_bytes.
Add (pos, b_idx) to the list R.
Return INDEX_FROM_RING_KEYS(R).
Sub-Algorithm: INDEX_FROM_RING_KEYS(R)
First, sort R according to its 'pos' field.
For each member (pos, idx) of the list R:
If this is the first member of the list R:
Let key_low = pos for the last member of R.
else:
Let key_low = pos for the previous member of R.
Let key_high = predecessor(pos)
Add (key_low, key_high) => idx to result_idx.
Return result_idx.
<!-- Section 4.1.5 --> <a id='S4.1.5'></a>
### Ed25519 indices
If the IndexSpec type is Indextype_Ed25519, then the index is a set of
binary strings describing the routers' positions in a hash ring,
derived from their Ed25519 identity keys.
This algorithm is a generalization of the one used for hsv3 rings,
to be used to compute the hsv3 ring and other possible future
derivatives.
Algorithm: Expanding an "Ed25519Id" indexspec.
Let R = [ ] (an empty list).
Take the values prefix, suffix, and n_bytes from the IndexSpec.
Let H() be the digest algorithm specified by d_alg from the
IndexSpec.
For 0 <= b_idx < MIN( LEN(indexspec.members) * 8,
LEN(list of ENDIVERouterData) ):
Let b = the b_idx'th bit of indexspec.members.
If b is 1:
Let m = the b_idx'th member of the ENDIVERouterData list.
Let key = m's ed25519 identity key, as a 32-byte value.
Compute pos = H(prefix || key || suffix)
Truncate pos to n_bytes.
Add (pos, b_idx) to the list R.
Return INDEX_FROM_RING_KEYS(R).
<!-- Section 4.1.6 --> <a id='S4.1.6'></a>
### Building a SNIPLocation
After computing all the indices in an IndexGroup, relays combine
them into a series of SNIPLocation objects. Each SNIPLocation
MUST contain all the IndexId => IndexRange entries that point to a
given ENDIVERouterData, for the IndexIds listed in an IndexGroup.
Algorithm: Build a list of SNIPLocation objects from a set of
routing indices.
Initialize R as [ { } ] * LEN(relays) (A list of empty maps)
For each IndexId "ID" in the IndexGroup:
Let router_idx be the index map calculated for ID.
(This is what we computed previously.)
For each entry ( (LO, HI) => idx) in router_idx:
Let R[idx][ID] = (LO, HI).
SNIPLocation objects are thus organized in the order in which they will
appear in the Merkle tree: that is, sorted by the position of their
corresponding ENDIVERouterData.
Because SNIPLocation objects are signed, they must be encoded as "canonical"
cbor, according to section 3.9 of RFC 7049.
If R[idx] is {} (the empty map) for any given idx, then no SNIP will be
generated for the SNIPRouterData at that routing index for this index group.
<!-- Section 4.2 --> <a id='S4.2'></a>
## Computing truncated SNIPRouterData.
An index group can include an `omit_from_snips` field to indicate that
certain fields from a SNIPRouterData should not be included in the
SNIPs for that index group.
Since a SNIPRouterData needs to be signed, this process has to be
deterministic. Thus, the truncated SNIPRouterData should be computed by
removing the keys and values for EXACTLY the keys listed and no more. The
remaining keys MUST be left in the same order that they appeared in the
original SNIPRouterData, and they MUST NOT be re-encoded.
(Two keys are "the same" if and only if they are integers encoding the same
value, or text strings with the same UT-8 content.)
There is no need to compute a SNIPRouterData when no SNIP is going to be
generated for a given router.
<!-- Section 4.3 --> <a id='S4.3'></a>
## Building the Merkle tree.
After computing a list of (SNIPLocation, SNIPRouterData) for every entry
in an index group, the relay needs to expand a Merkle tree to
authenticate every SNIP.
There are two steps here: First the relay generates the leaves, and then
it generates the intermediate hashes.
To generate the list of leaves for an index group, the relay first
removes all entries from the (SNIPLocation, SNIPRouterData) list that
have an empty index map. The relay then puts `n_padding_entries` "nil"
entries at the end of the list.
To generate the list of leaves for the whole Merkle tree, the relay
concatenates these index group lists in the order in which they appear
in the ENDIVE, and pads the resulting list with "nil" entries until the
length of the list is a power of two: 2^`tree-depth` for some integer
`tree-depth`. Let LEAF(IDX) denote the entry at position IDX in this
list, where IDX is a D-bit bitstring. LEAF(IDX) is either a byte string
or nil.
The relay then recursively computes the hashes in the Merkle tree as
follows. (Recall that `H_node()` and `H_leaf()` are hashes taking
a bit-string PATH, a LIFESPAN and NONCE from the signature information,
and a variable-length string ITEM.)
Recursive defintion: HM(PATH)
Given PATH a bitstring of length no more than tree-depth.
Define S:
S(nil) = an all-0 string of the same length as the hash output.
S(x) = x, for all other x.
If LEN(PATH) = tree-depth: (Leaf case.)
If LEAF(PATH) = nil:
HM(PATH) = nil.
Else:
HM(PATH) = H_node(PATH, LIFESPAN, NONCE, LEAF(PATH)).
Else:
Let LEFT = HM(PATH || 0)
Let RIGHT = HM(PATH || 1)
If LEFT = nil and RIGHT = nil:
HM(PATH) = nil
else:
HM(PATH) = H_node(PATH, LIFESPAN, NONCE, S(LEFT) || S(RIGHT))
Note that entries aren't computed for "nil" leaves, or any node all of
whose children are "nil". The "nil" entries only exist to place all
leaves at a constant depth, and to enable spacing out different sections
of the tree.
If `signature-depth` for the ENDIVE is N, the relay does not need to
compute any Merkle tree entries for PATHs of length shorter than N bits.
<!-- Section 4.4 --> <a id='S4.4'></a>
## Assembling the SNIPs
Finally, the relay has computed a list of encoded (SNIPLocation,
RouterData) values, and a Merkle tree to authenticate them. At this
point, the relay builds them into SNIPs, using the `sig_params` and
`signatures` from the ENDIVE.
Algorithm: Building a SNIPSignature for a SNIP.
Given a non-nil (SNIPLocation, RouterData) at leaf position PATH.
Let SIG_IDX = PATH, truncated to signature-depth bits.
Consider SIG_IDX as an integer.
Let Sig = signatures[SIG_IDX] -- either the SingleSig or the MultiSig
for this snip.
Let HashPath = [] (an empty list).
For bitlen = signature-depth+1 ... tree-depth-1:
Let X = PATH, truncated to bitlen bits.
Invert the final bit of PATH.
Append HM(PATH) to HashPath.
The SnipSignature's signature values is Sig, and its merkle_path is
HashPath.
<!-- Section 4.5 --> <a id='S4.5'></a>
## Implementation considerations
A relay only needs to hold one set of SNIPs at a time: once one
ENDIVE's SNIPs have been extracted, then the SNIPs from the previous
ENDIVE can be discarded.
To save memory, a relay MAY store SNIPs to disk, and mmap them as
needed.
<!-- Section 5 --> <a id='S5'></a>
# Extending circuits with Walking Onions
When a client wants to extend a circuit, there are several
possibilities. It might need to extend to an unknown relay with
specific properties. It might need to extend to a particular relay
from which it has received a SNIP before. In both cases, there are
changes to be made in the circuit extension process.
Further, there are changes we need to make for the handshake between
the extending relay and the target relay. The target relay is no
longer told by the client which of its onion keys it should use... so
the extending relay needs to tell the target relay which keys are in
the SNIP that the client is using.
<!-- Section 5.1 --> <a id='S5.1'></a>
## Modifying the EXTEND/CREATE handshake
First, we will require that proposal 249 (or some similar proposal
for wide CREATE and EXTEND cells) is in place, so that we can have
EXTEND cells larger than can fit in a single cell. (See
319-wide-everything.md for an example proposal to supersede 249.)
We add new fields to the CREATE2 cell so that relays can send each
other more information without interfering with the client's part of
the handshake.
The CREATE2, CREATED2, and EXTENDED2 cells change as follows:
struct create2_body {
// old fields
u16 htype; // client handshake type
u16 hlen; // client handshake length
u8 hdata[hlen]; // client handshake data.
// new fields
u8 n_extensions;
struct extension extension[n_extensions];
}
struct created2_body {
// old fields
u16 hlen;
u8 hdata[hlen];
// new fields
u8 n_extensions;
struct extension extension[n_extensions];
}
struct truncated_body {
// old fields
u8 errcode;
// new fields
u8 n_extensions;
struct extension extension[n_extensions];
}
// EXTENDED2 cells can now use the same new fields as in the
// created2 cell.
struct extension {
u16 type;
u16 len;
u8 body[len];
}
These extensions are defined by this proposal:
[01] -- `Partial_SNIPRouterData` -- Sent from an extending relay
to a target relay. This extension holds one or more fields
from the SNIPRouterData that the extending relay is using,
so that the target relay knows (for example) what keys to
use. (These fields are determined by the
"forward_with_extend" field in the ENDIVE.)
[02] -- Full_SNIP -- an entire SNIP that was used in an attempt to
extend the circuit. This must match the client's provided
index position.
[03] -- Extra_SNIP -- an entire SNIP that was not used to extend
the circuit, but which the client requested anyway. This
can be sent back from the extending relay when the client
specifies multiple index positions, or uses a nonzero "nth" value
in their `snip_index_pos` link specifier.
[04] -- SNIP_Request -- a 32-bit index position, or a single zero
byte, sent away from the client. If the byte is 0, the
originator does not want a SNIP. Otherwise, the
originator does want a SNIP containing the router and the
specified index. Other values are unspecified.
By default, EXTENDED2 cells are sent with a SNIP iff the EXTENDED2
cell used a `snip_index_pos` link specifier, and CREATED2 cells are
not sent with a SNIP.
<!-- Section 5.1.1 --> <a id='S5.1.1'></a>
### New link specifiers
We add a new link specifier type for a router index, using the
following coding for its contents:
/* Using trunnel syntax here. */
struct snip_index_pos {
u32 index_id; // which index is it?
u8 nth; // how many SNIPs should be skipped/included?
u8 index_pos[]; // extends to the end of the link specifier.
}
The `index_pos` field can be longer or shorter than the actual width of
the router index. If it is too long, it is truncated. If it is too
short, it is extended with zero-valued bytes.
Any number of these link specifiers may appear in an EXTEND cell.
If there is more then one, then they should appear in order of
client preference; the extending relay may extend to any of the
listed routers.
This link specifier SHOULD NOT be used along with IPv4, IPv6, RSA
ID, or Ed25519 ID link specifiers. Relays receiving such a link
specifier along with a `snip_index_pos` link specifier SHOULD reject
the entire EXTEND request.
If `nth` is nonzero, then link specifier means "the n'th SNIP after
the one defined by the SNIP index position." A relay MAY reject
this request if `nth` is greater than 4. If the relay does not
reject this request, then it MUST include all snips between
`index_pos` and the one that was actually used in an Extra_Snip
extension. (Otherwise, the client would not be able to verify that
it had gotten the correct SNIP.)
> I've avoided use of CBOR for these types, under the assumption that we'd
> like to use CBOR for directory stuff, but no more. We already have
> trunnel-like objects for this purpose.
<!-- Section 5.2 --> <a id='S5.2'></a>
## Modified ntor handshake
We adapt the ntor handshake from tor-spec.txt for this use, with the
following main changes.
* The NODEID and KEYID fields are omitted from the input.
Instead, these fields _may_ appear in a PartialSNIPData extension.
* The NODEID and KEYID fields appear in the reply.
* The NODEID field is extended to 32 bytes, and now holds the
relay's ed25519 identity.
So the client's message is now:
CLIENT_PK [32 bytes]
And the relay's reply is now:
NODEID [32 bytes]
KEYID [32 bytes]
SERVER_PK [32 bytes]
AUTH [32 bytes]
otherwise, all fields are computed as described in tor-spec.
When this handshake is in use, the hash function is SHA3-256 and keys
are derived using SHAKE-256, as in rend-spec-v3.txt.
> Future work: We may wish to update this choice of functions
> between now and the implementation date, since SHA3 is a bit
> pricey. Perhaps one of the BLAKEs would be a better choice. If
> so, we should use it more generally. On the other hand, the
> presence of public-key operations in the handshake _probably_
> outweighs the use of SHA3.
We will have to give this version of the handshake a new handshake
type.
<!-- Section 5.3 --> <a id='S5.3'></a>
## New relay behavior on EXTEND and CREATE failure.
If an EXTEND2 cell based on an routing index fails, the relay should
not close the circuit, but should instead send back a TRUNCATED cell
containing the SNIP in an extension.
If a CREATE2 cell fails and a SNIP was requested, then instead of
sending a DESTROY cell, the relay SHOULD respond with a CREATED2
cell containing 0 bytes of handshake data, and the SNIP in an
extension. Clients MAY re-extend or close the circuit, but should
not leave it dangling.
<!-- Section 5.4 --> <a id='S5.4'></a>
## NIL handshake type
We introduce a new handshake type, "NIL". The NIL handshake always
fails. A client's part of the NIL handshake is an empty bytestring;
there is no server response that indicates success.
The NIL handshake can used by the client when it wants to fetch a
SNIP without creating a circuit.
Upon receiving a request to extend with the NIL circuit type, a
relay SHOULD NOT actually open any connection or send any data to
the target relay. Instead, it should respond with a TRUNCATED cell
with the SNIP(s) that the client requested in one or more Extra_SNIP
extensions.
<!-- Section 5.5 --> <a id='S5.5'></a>
## Padding handshake cells to a uniform size
To avoid leaking information, all CREATE/CREATED/EXTEND/EXTENDED
cells SHOULD be padded to the same sizes. In all cases, the amount
of padding is controlled by a set of network parameters:
"create-pad-len", "created-pad-len", "extend-pad-len" and
"extended-pad-len". These parameters determine the minimum length
that the cell body or relay cell bodies should be.
If a cell would be sent whose body is less than the corresponding
parameter value, then the sender SHOULD pad the body by adding
zero-valued bytes to the cell body. As usual, receivers MUST ignore
extra bytes at the end of cells.
> ALTERNATIVE: We could specify a more complicated padding
> mechanism, eg. 32 bytes of zeros then random bytes.
<!-- Section 6 --> <a id='S6'></a>
# Client behavior with walking onions
Today's Tor clients have several behaviors that become somewhat
more difficult to implement with Walking Onions. Some of these
behaviors are essential and achievable. Others can be achieved with
some effort, and still others appear to be incompatible with the
Walking Onions design.
<!-- Section 6.1 --> <a id='S6.1'></a>
## Bootstrapping and guard selection
When a client first starts running, it has no guards on the Tor
network, and therefore can't start building circuits immediately.
To produce a list of possible guards, the client begins connecting
to one or more fallback directories on their ORPorts, and building
circuits through them. These are 3-hop circuits. The first hop of
each circuit is the fallback directory; the second and third hops
are chosen from the Middle routing index. At the third hop, the
client then sends an informational request for a guard's SNIP. This
informational request is an EXTEND2 cell with handshake type NIL,
using a random spot on the Guard routing index.
Each such request yields a single SNIP that the client will store.
These SNIPs, in the order in which they were _requested_, will form the
client's list of "Sampled" guards as described in guard-spec.txt.
Clients SHOULD ensure that their sampled guards are not
linkable to one another. In particular, clients SHOULD NOT add more
than one guard retrieved from the same third hop on the same
circuit. (If it did, that third hop would realize that some client using
guard A was also using guard B.)
> Future work: Is this threat real? It seems to me that knowing one or two
> guards at a time in this way is not a big deal, though knowing the whole
> set would sure be bad. However, we shouldn't optimize this kind of
> defense away until we know that it's actually needless.
If a client's network connection or choice of entry nodes is heavily
restricted, the client MAY request more than one guard at a time, but if
it does so, it SHOULD discard all but one guard retrieved from each set.
After choosing guards, clients will continue to use them even after
their SNIPs expire. On the first circuit through each guard after
opening a channel, clients should ask that guard for a fresh SNIP for
itself, to ensure that the guard is still listed in the consensus, and
to keep the client's information up-to-date.
<!-- Section 6.2 --> <a id='S6.2'></a>
## Using bridges
As now, clients are configured to use a bridge by using an address and a
public key for the bridge. Bridges behave like guards, except that they
are not listed in any directory or ENDIVE, and so cannot prove
membership when the client connects to them.
On the first circuit through each channel to a bridge, the client
asks that bridge for a SNIP listing itself in the `Self` routing
index. The bridge responds with a self-created unsigned SNIP:
; This is only valid when received on an authenticated connection
; to a bridge.
UnsignedSNIP = [
; There is no signature on this SNIP.
auth : nil,
; Next comes the location of the SNIP within the ENDIVE. This
; SNIPLocation will list only the Self index.
index : bstr .cbor SNIPLocation,
; Finally comes the information about the router.
router : bstr .cbor SNIPRouterData,
]
*Security note*: Clients MUST take care to keep UnsignedSNIPs separated
from signed ones. These are not part of any ENDIVE, and so should not be
used for any purpose other than connecting through the bridge that the
client has received them from. They should be kept associated with that
bridge, and not used for any other, even if they contain other link
specifiers or keys. The client MAY use link specifiers from the
UnsignedSNIP on future attempts to connect to the bridge.
<!-- Section 6.3 --> <a id='S6.3'></a>
## Finding relays by exit policy
To find a relay by exit policy, clients might choose the exit
routing index corresponding to the exit port they want to use. This
has negative privacy implications, however, since the middle node
discovers what kind of exit traffic the client wants to use.
Instead, we support two other options.
First, clients may build anonymous three-hop circuits and then use those
circuits to request the SNIPs that they will use for their exits. This
may, however, be inefficient.
Second, clients may build anonymous three-hop circuits and then use a
BEGIN cell to try to open the connection when they want. When they do
so, they may include a new flag in the begin cell, "DVS" to enable
Delegated Verifiable Selection. As described in the Walking Onions
paper, DVS allows a relay that doesn't support the requested port to
instead send the client the SNIP of a relay that does. (In the paper,
the relay uses a digest of previous messages to decide which routing
index to use. Instead, we have the client send an index field.)
This requires changes to the BEGIN and END cell formats. After the
"flags" field in BEGIN cells, we add an extension mechanism:
struct begin_cell {
nulterm addr_port;
u32 flags;
u8 n_extensions;
struct extension exts[n_extensions];
}
We allow the `snip_index_pos` link specifier type to appear as a begin
extension.
END cells will need to have a new format that supports including policy and
SNIP information. This format is enabled whenever a new `EXTENDED_END_CELL`
flag appears in the begin cell.
struct end_cell {
u8 tag IN [ 0xff ]; // indicate that this isn't an old-style end cell.
u8 reason;
u8 n_extensions;
struct extension exts[n_extensions];
}
We define three END cell extensions. Two types are for addresses, that
indicate what address was resolved and the associated TTL:
struct end_ext_ipv4 {
u32 addr;
u32 ttl;
}
struct end_ext_ipv6 {
u8 addr[16];
u32 ttl;
}
One new END cell extension is used for delegated verifiable selection:
struct end_ext_alt_snip {
u16 index_id;
u8 snip[..];
}
This design may require END cells to become wider; see
319-wide-everything.md for an example proposal to
supersede proposal 249 and allow more wide cell types.
<!-- Section 6.4 --> <a id='S6.4'></a>
## Universal path restrictions
There are some restrictions on Tor paths that all clients should obey,
unless they are configured not to do so. Some of these restrictions
(like "start paths with a Guard node" or "don't use an Exit as a middle
when Exit bandwidth is scarce") are captured by the index system. Some
other restrictions are not. Here we describe how to implement those.
The general approach taken here is "build and discard". Since most
possible paths will not violate these universal restrictions, we
accept that a fraction of the paths built will not be usable.
Clients tear them down a short time after they are built.
Clients SHOULD discard a circuit if, after it has been built, they
find that it contains the same relay twice, or it contains more than
one relay from the same family or from the same subnet.
Clients MAY remember the SNIPs they have received, and use those
SNIPs to avoid index ranges that they would automatically reject.
Clients SHOULD NOT store any SNIP for longer than it is maximally
recent.
> NOTE: We should continue to monitor the fraction of paths that are
> rejected in this way. If it grows too high, we either need to amend
> the path selection rules, or change authorities to e.g. forbid more
> than a certain fraction of relay weight in the same family or subnet.
> FUTURE WORK: It might be a good idea, if these restrictions truly are
> 'universal', for relays to have a way to say "You wouldn't want that
> SNIP; I am giving you the next one in sequence" and send back both
> SNIPs. This would need some signaling in the EXTEND/EXTENDED cells.
<!-- Section 6.5 --> <a id='S6.5'></a>
## Client-configured path restrictions
Sometimes users configure their clients with path restrictions beyond
those that are in ordinary use. For example, a user might want to enter
only from US relays, but never exit from US. Or they might be
configured with a short list of vanguards to use in their second
position.
<!-- Section 6.5.1 --> <a id='S6.5.1'></a>
### Handling "light" restrictions
If a restriction only excludes a small number of relays, then clients
can continue to use the "build and discard" methodology described above.
<!-- Section 6.5.2 --> <a id='S6.5.2'></a>
### Handling some "heavy" restrictions
Some restrictions can exclude most relays, and still be reasonably easy
to implement if they only _include_ a small fraction of relays. For
example, if the user has a EntryNodes restriction that contains only a
small group of relays by exact IP address, the client can connect or
extend to one of those addresses specifically.
If we decide IP ranges are important, that IP addresses without
ports are important, or that key specifications are important, we
can add routing indices that list relays by IP, by RSAId, or by
Ed25519 Id. Clients could then use those indices to remotely
retrieve SNIPs, and then use those SNIPs to connect to their
selected relays.
> Future work: we need to decide how many of the above functions to actually
> support.
<!-- Section 6.5.3 --> <a id='S6.5.3'></a>
### Recognizing too-heavy restrictions
The above approaches do not handle all possible sets of restrictions. In
particular, they do a bad job for restrictions that ban a large fraction
of paths in a way that is not encodeable in the routing index system.
If there is substantial demand for such a path restriction, implementors
and authority operators should figure out how to implement it in the
index system if possible.
Implementations SHOULD track what fraction of otherwise valid circuits
they are closing because of the user's configuration. If this fraction
is above a certain threshold, they SHOULD issue a warning; if it is
above some other threshold, they SHOULD refuse to build circuits
entirely.
> Future work: determine which fraction appears in practice, and use that to
> set the appropriate thresholds above.
<!-- Section 7 --> <a id='S7'></a>
# Using and providing onion services with Walking Onions
Both live versions of the onion service design rely on a ring of
hidden service directories for use in uploading and downloading
hidden service descriptors. With Walking Onions, we can use routing
indices based on Ed25519 or RSA identity keys to retrieve this data.
(The RSA identity ring is unchanging, whereas the Ed25519 ring
changes daily based on the shared random value: for this reason, we
have to compute two simultaneous indices for Ed25519 rings: one for
the earlier date that is potentially valid, and one for the later
date that is potentially valid. We call these `hsv3-early` and
`hsv3-late`.)
Beyond the use of these indices, however, there are other steps that
clients and services need to take in order to maintain their privacy.
<!-- Section 7.1 --> <a id='S7.1'></a>
## Finding HSDirs
When a client or service wants to contact an HSDir, it SHOULD do so
anonymously, by building a three-hop anonymous circuit, and then
extending it a further hop using the snip_span link specifier to
upload to any of the first 3 replicas on the ring. Clients SHOULD
choose an 'nth' at random; services SHOULD upload to each replica.
Using a full 80-bit or 256-bit index position in the link specifier
would leak the chosen service to somebody other than the directory.
Instead, the client or service SHOULD truncate the identifier to a
number of bytes equal to the network parameter `hsv2-index-bytes` or
`hsv3-index-bytes` respectively. (See Appendix C.)
<!-- Section 7.2 --> <a id='S7.2'></a>
## SNIPs for introduction points
When services select an introduction point, they should include the
SNIP for the introduction point in their hidden service directory
entry, along with the introduction-point fields. The format for
this entry is:
"snip" NL snip NL
[at most once per introduction points]
Clients SHOULD begin treating the link specifier and onion-key
fields of each introduction point as optional when the "snip" field
is present, and when the `hsv3-tolerate-no-legacy` network parameter
is set to 1. If either of these fields _is_ present, and the SNIP is
too, then these fields MUST match those listed in the SNIPs.
Clients SHOULD reject descriptors with mismatched fields, and alert
the user that the service may be trying a partitioning attack.
The "legacy-key" and "legacy-key-cert" fields, if present, should be
checked similarly.
> Using the SNIPs in these ways allows services to prove that their
> introduction points have actually been listed in the consensus
> recently. It also lets clients use introduction point features
> that the relay might not understand.
Services should include these fields based on a set of network
parameters: `hsv3-intro-snip` and `hsv3-intro-legacy-fields`.
(See appendix C.)
Clients should use these fields only when Walking Onions support is
enabled; see section 09.
<!-- Section 7.3 --> <a id='S7.3'></a>
## SNIPs for rendezvous points
When a client chooses a rendezvous point for a v3 onion service, it
similarly has the opportunity to include the SNIP of its rendezvous
point in the encrypted part of its INTRODUCE cell. (This may cause
INTRODUCE cells to become fragmented; see proposal about fragmenting
relay cells.)
> Using the SNIPs in these ways allows services to prove that their
> introduction points have actually been listed in the consensus
> recently. It also lets services use introduction point features
> that the relay might not understand.
To include the SNIP, the client places it in an extension in the
INTRODUCE cell. The onion key can now be omitted[*], along with
the link specifiers.
> [*] Technically, we use a zero-length onion key, with a new type
> "implicit in SNIP".
To know whether the service can recognize this kind of cell, the
client should look for the presence of a "snips-allowed 1" field in
the encrypted part of the hidden service descriptor.
In order to prevent partitioning, services SHOULD NOT advertise
"snips-allowed 1" unless the network parameter
"hsv3-rend-service-snip" is set to 1. Clients SHOULD NOT use this
field unless "hsv3-rend-client-snip" is set to 1.
<!-- Section 7.4 --> <a id='S7.4'></a>
## TAP keys and where to find them
If v2 hidden services are still supported when Walking Onions arrives
on the network, we have two choices: We could migrate them to use
ntor keys instead of TAP, or we could provide a way for TAP keys to
be advertised with Walking Onions.
The first option would appear to be far simpler. See
proposal draft 320-tap-out-again.md.
The latter option would require us to put RSA-1024 keys in SNIPs, or
put a digest of them in SNIPs and give some way to retrieve them
independently.
(Of course, it's possible that we will have v2 onion services
deprecated by the time Walking Onions is implemented. If so, that
will simplify matters a great deal too.)
<!-- Section 8 --> <a id='S8'></a>
# Tracking Relay honesty
Our design introduces an opportunity for dishonest relay behavior:
since multiple ENDIVEs are valid at the same time, a malicious relay
might choose any of several possible SNIPs in response to a client's
routing index value.
Here we discuss several ways to mitigate this kind of attack.
<!-- Section 8.1 --> <a id='S8.1'></a>
## Defense: index stability
First, the voting process should be designed such that relays do not
needlessly move around the routing index. For example, it would
_not_ be appropriate to add an index type whose value is computed by
first putting the relays into a pseudorandom order. Instead, index
voting should be deterministic and tend to give similar outputs for
similar inputs.
This proposal tries to achieve this property in its index voting
algorithms. We should measure the degree to which we succeed over
time, by looking at all of the ENDIVEs that are valid at any
particular time, and sampling several points for each index to see
how many distinct relays are listed at each point, across all valid
ENDIVEs.
We do not need this stability property for routing indices whose
purpose is nonrandomized relay selection, such as those indices used
for onion service directories.
<!-- Section 8.2 --> <a id='S8.2'></a>
## Defense: enforced monotonicity
Once an honest relay has received an ENDIVE, it has no reason to
keep any previous ENDIVEs or serve SNIPs from them. Because of
this, relay implementations SHOULD ensure that no data is served
from a new ENDIVE until all the data from an old ENDIVE is
thoroughly discarded.
Clients and relays can use this monotonicity property to keep relays
honest: once a relay has served a SNIP with some timestamp `T`, that
relay should never serve any other SNIP with a timestamp earlier than
`T`. Clients SHOULD track the most recent SNIP timestamp that they
have received from each of their guards, and MAY track the most
recent SNIP timestamps that they have received from other relays as
well.
<!-- Section 8.3 --> <a id='S8.3'></a>
## Defense: limiting ENDIVE variance within the network.
The primary motivation for allowing long (de facto) lifespans on
today's consensus documents is to keep the network from grinding to
a halt if the authorities fail to reach consensus for a few hours.
But in practice, _if_ there is a consensus, then relays should have
it within an hour or two, so they should not be falling a full day out
of date.
Therefore we can potentially add a client behavior that, within N
minutes after the client has seen any SNIP with timestamp `T`,
the client should not accept any SNIP with timestamp earlier than
`T - Delta`.
Values for N and Delta are controlled by network parameters
(`enforce-endive-dl-delay-after` and `allow-endive-dl-delay`
respectively in appendix C). N should be about as long as we expect
it to take for a single ENDIVE to propagate to all the relays on the
network; Delta should be about as long as we would like relays to go
between updating ENDIVEs under ideal circumstances.
<!-- Section 9 --> <a id='S9'></a>
# Migrating to Walking Onions
This proposal is a major change in the Tor network that will
eventually require the participation of all relays [*], and will make
clients who support it distinguishable from clients that don't.
> [*] Technically, the last relay in the path doesn't need support.
To keep the compatibility issues under control, here is the order in which it
should be deployed on the network.
1. First, authorities should add support for voting on ENDIVEs.
2. Relays may immediately begin trying to download and reconstruct
ENDIVEs. (Relay versions are public, so they leak nothing by
doing this.)
3. Once a sufficient number of authorities are voting on ENDIVEs and
unlikely to downgrade, relays should begin serving parameter documents
and responding to walking-onion EXTEND and CREATE cells. (Again,
relay versions are public, so this doesn't leak.)
4. In parallel with relay support, Tor should also add client
support for Walking Onions. This should be disabled by default,
however, since it will only be usable with the subset of relays
that support Walking Onions, and since it would make clients
distinguishable.
5. Once enough of the relays (possibly, all) support Walking Onions,
the client support can be turned on. They will not be able to
use old relays that do not support Walking Onions.
6. Eventually, relays that do not support Walking Onions should not
be listed in the consensus.
Client support for Walking Onions should be enabled or disabled, at
first, with a configuration option. Once it seems stable, the
option should have an "auto" setting that looks at a network
parameter. This parameter should NOT be a simple "on" or "off",
however: it should be the minimum client version whose support for
Walking Onions is believed to be correct.
<!-- Section 9.1 --> <a id='S9.1'></a>
## Future work: migrating away from sedentary onions
Once all clients are using Walking Onions, we can take a pass
through the Tor specifications and source code to remove
no-longer-needed code.
Clients should be the first to lose support for old directories,
since nobody but the clients depends on the clients having them.
Only after obsolete clients represent a very small fraction of the
network should relay or authority support be disabled.
Some fields in router descriptors become obsolete with Walking
Onions, and possibly router descriptors themselves should be
replaced with cbor objects of some kind. This can only happen,
however, after no descriptor users remain.
<!-- Section A --> <a id='SA'></a>
# Appendices
<!-- Section A.1 --> <a id='SA.1'></a>
## Appendix A: Glossary
I'm going to put a glossary here so I can try to use these terms
consistently.
*SNIP* -- A "Separable Network Index Proof". Each SNIP contains the
information necessary to use a single Tor relay, and associates the relay
with one or more index ranges. SNIPs are authenticated by the directory
authorities.
*ENDIVE* -- An "Efficient Network Directory with Individually Verifiable
Entries". An ENDIVE is a collection of SNIPS downloaded by relays,
authenticated by the directory authorities.
*Routing index* -- A routing index is a map from binary strings to relays,
with some given property. Each relay that is in the routing index is
associated with a single *index range*.
*Index range* -- A range of positions withing a routing index. Each range
contains many positions.
*Index position* -- A single value within a routing index. Every position in
a routing index corresponds to a single relay.
*ParamDoc* -- A network parameters document, describing settings for the
whole network. Clients download this infrequently.
*Index group* -- A collection of routing indices that are encoded in the same
SNIPs.
<!-- Section A.2 --> <a id='SA.2'></a>
## Appendix B: More cddl definions
; These definitions are used throughout the rest of the
; proposal
; Ed25519 keys are 32 bytes, and that isn't changing.
Ed25519PublicKey = bstr .size 32
; Curve25519 keys are 32 bytes, and that isn't changing.
Curve25519PublicKey = bstr .size 32
; 20 bytes or fewer: legacy RSA SHA1 identity fingerprint.
RSAIdentityFingerprint = bstr
; A 4-byte integer -- or to be cddl-pedantic, one that is
; between 0 and UINT32_MAX.
uint32 = uint .size 4
; Enumeration to define integer equivalents for all the digest algorithms
; that Tor uses anywhere. Note that some of these are not used in
; this spec, but are included so that we can use this production
; whenever we need to refer to a hash function.
DigestAlgorithm = &(
NoDigest: 0,
SHA1 : 1, ; deprecated.
SHA2-256: 2,
SHA2-512: 3,
SHA3-256: 4,
SHA3-512: 5,
Kangaroo12-256: 6,
Kangaroo12-512: 7,
)
; A digest is represented as a binary blob.
Digest = bstr
; Enumeration for different signing algorithms.
SigningAlgorithm = &(
RSA-OAEP-SHA1 : 1, ; deprecated.
RSA-OAEP-SHA256: 2, ; deprecated.
Ed25519 : 3,
Ed448 : 4,
BLS : 5, ; Not yet standardized.
)
PKAlgorithm = &(
SigningAlgorithm,
Curve25519: 100,
Curve448 : 101
)
KeyUsage = &(
; A master unchangeable identity key for this authority. May be
; any signing key type. Distinct from the authority's identity as a
; relay.
AuthorityIdentity: 0x10,
; A medium-term key used for signing SNIPs, votes, and ENDIVEs.
SNIPSigning: 0x11,
; These are designed not to collide with the "list of certificate
; types" or "list of key types" in cert-spec.txt
)
CertType = &(
VotingCert: 0x12,
; These are designed not to collide with the "list of certificate
; types" in cert-spec.txt.
)
LinkSpecifier = bstr
<!-- Section A.3 --> <a id='SA.3'></a>
## Appendix C: new numbers to assign.
Relay commands:
* We need a new relay command for "FRAGMENT" per proposal 319.
CREATE handshake types:
* We need a type for the NIL handshake.
* We need a handshake type for the new ntor handshake variant.
Link specifiers:
* We need a link specifier for extend-by-index.
* We need a link specifier for dirport URL.
Certificate Types and Key Types:
* We need to add the new entries from CertType and KeyUsage to
cert-spec.txt, and possibly merge the two lists.
Begin cells:
* We need a flag for Delegated Verifiable Selection.
* We need an extension type for extra data, and a value for indices.
End cells:
* We need an extension type for extra data, a value for indices, a
value for IPv4 addresses, and a value for IPv6 addresses.
Extensions for decrypted INTRODUCE2 cells:
* A SNIP for the rendezvous point.
Onion key types for decrypted INTRODUCE2 cells:
* An "onion key" to indicate that the onion key for the rendezvous point is
implicit in the SNIP.
New URLs:
* A URL for fetching ENDIVEs.
* A URL for fetching client / relay parameter documents
* A URL for fetching detached SNIP signatures.
Protocol versions:
(In theory we could omit many new protovers here, since being listed
in an ENDIVE implies support for the new protocol variants. We're
going to use new protovers anyway, however, since doing so keeps our
numbering consistent.)
We need new versions for these subprotocols:
* _Relay_ to denote support for new handshake elements.
* _DirCache_ to denote support for ENDIVEs, paramdocs, binary diffs, etc.
* _Cons_ to denote support for ENDIVEs
<!-- Section A.4 --> <a id='SA.4'></a>
## Appendix D: New network parameters.
We introduce these network parameters:
>From section 5:
* `create-pad-len` -- Clients SHOULD pad their CREATE cell bodies
to this size.
* `created-pad-len` -- Relays SHOULD pad their CREATED cell bodies to this
size.
* `extend-pad-len` -- Clients SHOULD pad their EXTEND cell bodies to this
size.
* `extended-pad-len` -- Relays SHOULD pad their EXTENDED cell bodies to this
size.
>From section 7:
* `hsv2-index-bytes` -- how many bytes to use when sending an hsv2 index
position to look up a hidden service directory. Min: 1,
Max: 40. Default: 4.
* `hsv3-index-bytes` -- how many bytes to use when sending an hsv3 index
position to look up a hidden service directory. Min: 1,
Max: 128. Default: 4.
* `hsv3-intro-legacy-fields` -- include legacy fields in service descriptors.
Min: 0. Max: 1. Default: 1.
* `hsv3-intro-snip` -- include intro point SNIPs in service descriptors.
Min: 0. Max: 1. Default: 0.
* `hsv3-rend-service-snip` -- Should services advertise and accept rendezvous
point SNIPs in INTRODUCE2 cells? Min: 0. Max: 1. Default: 0.
* `hsv3-rend-client-snip` -- Should clients place rendezvous point SNIPS in
INTRODUCE2 cells when the service supports it?
Min: 0. Max: 1. Default: 0.
* `hsv3-tolerate-no-legacy` -- Should clients tolerate v3 service descriptors
that don't have legacy fields? Min: 0. Max: 1. Default: 0.
>From section 8:
* `enforce-endive-dl-delay-after` -- How many seconds after receiving a
SNIP with some timestamp T does a client wait for rejecting older SNIPs?
Equivalent to "N" in "limiting ENDIVE variance within the network."
Min: 0. Max: INT32_MAX. Default: 3600 (1 hour).
* `allow-endive-dl-delay` -- Once a client has received an SNIP with
timestamp T, it will not accept any SNIP with timestamp earlier than
"allow-endive-dl-delay" seconds before T.
Equivalent to "Delta" in "limiting ENDIVE variance within the network."
Min: 0. Max: 2592000 (30 days). Default: 10800 (3 hours).
<!-- Section A.5 --> <a id='SA.5'></a>
## Appendix E: Semantic sorting for CBOR values.
Some voting operations assume a partial ordering on CBOR values. We define
such an ordering as follows:
* bstr and tstr items are sorted lexicographically, as if they were
compared with a version of strcmp() that accepts internal NULs.
* uint and int items are are sorted by integer values.
* arrays are sorted lexicographically by elements.
* Tagged items are sorted as if they were not tagged.
* Maps do not have any sorting order.
* False precedes true.
* Otherwise, the ordering between two items is not defined.
More specifically:
Algorithm: compare two cbor items A and B.
Returns LT, EQ, GT, or NIL.
While A is tagged, remove the tag from A.
While B is tagged, remove the tag from B.
If A is any integer type, and B is any integer type:
return A cmp B
If the type of A is not the same as the type of B:
return NIL.
If A and B are both booleans:
return int(A) cmp int(B), where int(false)=0 and int(B)=1.
If A and B are both tstr or both bstr:
while len(A)>0 and len(B)>0:
if A[0] != B[0]:
return A[0] cmp B[0]
Discard A[0] and B[0]
If len(A) == len(B) == 0:
return EQ.
else if len(A) == 0:
return LT. (B is longer)
else:
return GT. (A is longer)
If A and B are both arrays:
while len(A)>0 and len(B)>0:
Run this algorithm recursively on A[0] and B[0].
If the result is not EQ:
Return that result.
Discard A[0] and B[0]
If len(A) == len(B) == 0:
return EQ.
else if len(A) == 0:
return LT. (B is longer)
else:
return GT. (A is longer)
Otherwise, A and B are a type for which we do not define an ordering,
so return NIL.
<!-- Section A.6 --> <a id='SA.6'></a>
## Appendix F: Example voting rules
Here we give a set of voting rules for the fields described in our initial
VoteDocuments.
{
meta: {
voting-delay: { op: "Mode", tie_low:false,
type:["tuple","uint","uint"] },
voting-interval: { op: "Median", type:"uint" },
snip-lifespan: {op: "Mode", type:["tuple","uint","uint","uint"] },
c-param-lifetime: {op: "Mode", type:["tuple","uint","uint","uint"] },
s-param-lifetime: {op: "Mode", type:["tuple","uint","uint","uint"] },
cur-shared-rand: {op: "Mode", min_count: "qfield",
type:["tuple","uint","bstr"]},
prev-shared-rand: {op: "Mode", min_count: "qfield",
type:["tuple","uint","bstr"]},
client-params: {
recommend-versions: {op:"SetJoin", min_count:"qfield",type:"tstr"},
require-protos: {op:"BitThreshold", min_count:"sqauth"},
recommend-protos: {op:"BitThreshold", min_count:"qauth"},
params: {op:"MapJoin",key_min_count:"qauth",
keytype:"tstr",
item_op:{op:"Median",min_vote:"qauth",type:"uint"},
},
certs: {op:"SetJoin",min_count:1, type: 'bstr'},
},
; Use same value for server-params.
relay: {
meta: {
desc: {op:"Mode", min_count:"qauth",tie_low:false,
type:["uint","bstr"] },
flags: {op:"MapJoin", key_type:"tstr",
item_op:{op:"Mode",type:"bool"}},
bw: {op:"Median", type:"uint" },
mbw :{op:"Median", type:"uint" },
rsa-id: {op:"Mode", type:"bstr"},
},
snip: {
; ed25519 key is handled as any other value.
0: { op:"DerivedFrom", fields:[["RM","desc"]],
rule:{op:"Mode",type="bstr"} },
; ntor onion key.
1: { op:"DerivedFrom", fields:[["RM","desc"]],
rule:{op:"Mode",type="bstr"} },
; link specifiers.
2: { op: "CborDerived",
item-op: { op:"DerivedFrom", fields:[["RM","desc"]],
rule:{op:"Mode",type="bstr" } } },
; software description.
3: { op:"DerivedFrom", fields:[["RM","desc"]],
rule:{op:"Mode",type=["tuple", "tstr", "tstr"] } },
; protovers.
4: { op: "CborDerived",
item-op: { op:"DerivedFrom", fields:[["RM","desc"]],
rule:{op:"Mode",type="bstr" } } },
; families.
5: { op:"SetJoin", min_count:"qfield", type:"bstr" },
; countrycode
6: { op:"Mode", type="tstr" } ,
; 7: exitpolicy.
7: { op: "CborDerived",
item-op: { op: "DerivedFrom",
fields:[["RM","desc"],["CP","port-classes"]],
rule:{op:"Mode",type="bstr" } } },
},
legacy: {
"sha1-desc": { op:"DerivedFrom",
fields:[["RM","desc"]],
rule:{op:"Mode",type="bstr"} },
"mds": { op:"DerivedFrom",
fields:[["RM":"desc"]],
rule: { op:"ThresholdOp", min_count: "qauth",
multi_low:false,
type:["tuple", "uint", "uint",
"bstr", "bstr" ] }},
}
}
indices: {
; See appendix G.
}
}
<!-- Section A.7 --> <a id='SA.7'></a>
## Appendix G: A list of routing indices
Middle -- general purpose index for use when picking middle hops in
circuits. Bandwidth-weighted for use as middle relays. May exclude
guards and/or exits depending on overall balance of resources on the
network.
Formula:
type: 'weighted',
source: {
type:'bw', require_flags: ['Valid'], 'bwfield' : ["RM", "mbw"]
},
weight: {
[ "!Exit", "!Guard" ] => "Wmm",
[ "Exit", "Guard" ] => "Wbm",
[ "Exit", "!Guard" ] => "Wem",
[ "!Exit", "Guard" ] => "Wgm",
}
Guard -- index for choosing guard relays. This index is not used
directly when extending, but instead only for _picking_ guard relays
that the client will later connect to directly. Bandwidth-weighted
for use as guard relays. May exclude guard+exit relays depending on
resource balance.
type: 'weighted',
source: {
type:'bw',
require_flags: ['Valid', "Guard"],
bwfield : ["RM", "mbw"]
},
weight: {
[ "Exit", ] => "Weg",
}
HSDirV2 -- index for finding spots on the hsv2 directory ring.
Formula:
type: 'rsa-id',
HSDirV3-early -- index for finding spots on the hsv3 directory ring
for the earlier of the two "active" days. (The active days are
today, and whichever other day is closest to the time at which the
ENDIVE becomes active.)
Formula:
type: 'ed-id'
alg: SHA3-256,
prefix: b"node-idx",
suffix: (depends on shared-random and time period)
HSDirV3-late -- index for finding spots on the hsv3 directory ring
for the later of the two "active" days.
Formula: as HSDirV3-early, but with a different suffix.
Self -- A virtual index that never appears in an ENDIVE. SNIPs with
this index are unsigned, and occupy the entire index range. This
index is used with bridges to represent each bridge's uniqueness.
Formula: none.
Exit0..ExitNNN -- Exits that can connect to all ports within a given
PortClass 0 through NNN.
Formula:
type: 'weighted',
source: {
type:'bw',
; The second flag here depends on which portclass this is.
require_flags: [ 'Valid', "P@3" ],
bwfield : ["RM", "mbw"]
},
weight: {
[ "Guard", ] => "Wge",
}
<!-- Section A.8 --> <a id='SA.8'></a>
## Appendix H: Choosing good clusters of exit policies
With Walking Onions, we cannot easily support all the port
combinations [*] that we currently allow in the "policy summaries"
that we support in microdescriptors.
> [*] How many "short policy summaries" are there? The number would be
> 2^65535, except for the fact today's Tor doesn't permit exit policies to
> get maximally long.
In the Walking Onions whitepaper
(https://crysp.uwaterloo.ca/software/walkingonions/) we noted in
section 6 that we can group exit policies by class, and get down to
around 220 "classes" of port, such that each class was either
completely supported or completely unsupported by every relay. But
that number is still impractically large: if we need ~11 bytes to
represent a SNIP index range, we would need an extra 2320 bytes per
SNIP, which seems like more overhead than we really want.
We can reduce the number of port classes further, at the cost of
some fidelity. For example, suppose that the set {https,http} is
supported by relays {A,B,C,D}, and that the set {ssh,irc} is
supported by relays {B,C,D,E}. We could combine them into a new
port class {https,http,ssh,irc}, supported by relays {B,C,D} -- at
the expense of no longer being able to say that relay A supported
{https,http}, or that relay E supported {ssh,irc}.
This loss would not necessarily be permanent: the operator of relay
A might be willing to add support for {ssh,irc}, and the operator of
relay E might be willing to add support for {https,http}, in order
to become useful as an exit again.
(We might also choose to add a configuration option for relays to
take their exit policies directly from the port classes in the
consensus.)
How might we select our port classes? Three general categories of
approach seem possible: top-down, bottom-up, and hybrid.
In a top-down approach, we would collaborate with authority and exit
operators to identify _a priori_ reasonable classes of ports, such
as "Web", "Chat", "Miscellaneous internet", "SMTP", and "Everything
else". Authorities would then base exit indices on these classes.
In a bottom-up approach, we would find an algorithm to run on the
current exit policies in order to find the "best" set of port
classes to capture the policies as they stand with minimal loss.
(Quantifying this loss is nontrivial: do we weight by bandwidth? Do
we weight every port equally, or do we call some more "important"
than others?)
> See exit-analysis for an example tool that runs a greedy algorithm
> to find a "good" partition using an unweighted,
> all-ports-are-equal cost function. See the files
> "greedy-set-cov-{4,8,16}" for examples of port classes produced
> by this algorithm.
In a hybrid approach, we'd use top-down and bottom-up techniques
together. For example, we could start with an automated bottom-up
approach, and then evaluate it based feedback from operators. Or we
could start with a handcrafted top-down approach, and then use
bottom-up cost metrics to look for ways to split or combine those
port classes in order to represent existing policies with better
fidelity.
<!-- Section A.9 --> <a id='SA.9'></a>
## Appendix I: Non-clique topologies with Walking Onions
For future work, we can expand the Walking Onions design to
accommodate network topologies where relays are divided into groups,
and not every group connects to every other. To do so requires
additional design work, but here I'll provide what I hope will be a
workable sketch.
First, each SNIP needs to contain an ID saying which relay group it
belongs to, and an ID saying which relay group(s) may serve it.
When downloading an ENDIVE, each relay should report its own
identity, and receive an ENDIVE for that identity's group. It
should contain both the identities of relays in the group, and the
SNIPs that should be served for different indices by members of that
group.
The easy part would be to add an optional group identity field to
SNIPs, defaulting to 0, indicating that the relay belongs to that
group, and an optional served-by field to each SNIP, indicating
groups that may serve the SNIP. You'd only accept SNIPs if they
were served by a relay in a group that was allowed to serve them.
Would guards work? Sure: we'd need to have guard SNIPS served by
middle relays.
For hsdirs, we'd need to have either multiple shards of the hsdir
ring (which seems like a bad idea?) or have all middle nodes able to
reach the hsdir ring.
Things would get tricky with making onion services work: if you need
to use an introduction point or a rendezvous point in group X, then
you need to get there from a relay that allows connections to group
X. Does this imply indices meaning "Can reach group X" or
"two-degrees of group X"?
The question becomes: "how much work on alternative topologies does
it make sense to deploy in advance?" It seems like there are
unknowns affecting both client and relay operations here, which
suggests that advance deployment for either case is premature: we
can't necessarily make either clients or relays "do the right thing"
in advance given what we now know of the right thing.
<!-- Section A.10 --> <a id='SA.10'></a>
## Appendix Z: acknowledgments
Thanks to Peter Palfrader for his original design in proposal 141,
and to the designers of PIR-Tor, both of which inspired aspects of
this Walking Onions design.
Thanks to Chelsea Komlo, Sajin Sasy, and Ian Goldberg for feedback on
an earlier version of this design.
Thanks to David Goulet, Teor, and George Kadianakis for commentary on
earlier versions of proposal 300.
Thanks to Chelsea Komlo and Ian Goldberg for their help fleshing out
so many ideas related to Walking Onions in their work on the design
paper.
Thanks to Teor for improvements to diff format, ideas about grouping
exit ports, and numerous ideas about getting topology and
distribution right.
These specifications were supported by a grant from the Zcash
Foundation.
1
0
Hi,
tl;dr How to move key material out of tor?
## The idea of a vault component
ahf and others in the network team have been discussing the
possibility of a "vault" component in tor, for moving private keys out
of the tor process. Separating secret key material from the code
handling data from the network seems valuable and providing a
component making different implementations "pluggable" would allow for
anyone to use their favourite technology without touching the tor
code base. Examples are local trusted execution environments like Intel
SGX and Arm TrustZone and various HSM's and security keys/tokens.
One way of implementing this would be to define a protocol, for a
vault component to talk to a daemon running on the same host as tor,
over some IPC mechanism. This protocol would allow tor to request a
signature over a hash, or a document, in a certain key. Whether the
daemon has access to the key material or has to forward the request to
a separate device or hardware component is irrelevant to the protocol
and the vault component.
Even if the design focuses on signatures it should probably take
encryption and decryption into account, to be added later.
## The vault as a possible match for TorHSM
Such a vault component would fit well with a project where Peter Stuge
and myself are building an HSM for Tor directory authority signing
keys [0] based on the CrypTech design [1].
[0] https://trac.cryptech.is/wiki/ExternalProjectsTorHSM
[1] https://cryptech.is/
One of the options for tor to delegate the signing of votes and
consensuses to such a device would be to use a vault component as
described above. We would then have a signing daemon responsible for
the USB communication with the TorHSM device. Another option would be
to build support directly into tor for communicating with the device,
through a library capable of a USB vendor specific protocol defined by
TorHSM.
There are pros and cons with both options. I'd like to hear what the
Tor developers community think would be best.
## Asynchronous signing API
Vault or not, tor needs to change the way signing is done to be able
to move key material out of the main process. First, external devices
may need more time to perform a signature operation than what tor can
accept to spend blocking in its main thread. Second, an external
device might disappear or malfunction, requiring the possibility for
an operation to time out.
This can be implemented either with callbacks or with a state machine
with more well defined state transitions. Maybe there are more
options that I didn't think of.
## Protocol choice
If we decide to go for the vault component with a separate daemon we
should consider using the ssh-agent protocol for tor to talk to the
daemon. Apart from being already defined there are multiple well
tested implementations of this protocol. It also has a couple of
features we might want, such as forgetting a certain key and
locking/unlocking of the vault. Finally, it's extensible and should
accommodate potential additions we might want to make later.
3
4
This novelette-length mail is meant to provide the background
information necessary to understand choices in congestion control
deployment for Tor. Much of this background information won't be
included in my planned Tor Proposal, which will hopefully be much
shorter, but the choices made there will be informed by this material.
This material has been enhanced through review and conversations with
Toke Høiland-Jørgensen, g burdell, Rob Jansen, and George Kadianakis.
Immense credit goes to Toke in particular, for detailed and thoughtful
review, and input on queue management.
Like my last mail on this topic, this mail comes with a playlist. Fair
warning: it is not exactly easy listening. Still, hopefully it gives you
strength and inspiration in these trying times:
https://people.torproject.org/~mikeperry/borders/everybody-loves-them/TakeA…
Section -I: Recap [RECAP]
https://wizardzines.com/zines/networking
I strongly believe that congestion control will provide the single
largest performance win for the Tor network. Specifically, it will
considerably improve both latency and throughput metrics in the best,
average, and worst case. In addition to performance, congestion control
will also provide the second-largest scalability win on our research
horizon (behind Walking Onions), by making the queue length memory
requirements of Tor relays independent of the number of circuits carried
on that relay.
I also believe we can achieve the vast majority of the benefits by
deploying an RTT-based system that only clients and Exits need to
support, specifically BOOTLEG_RTT_TOR. After this, we might consider
more advanced options based on ECN, but their benefits are far less
clear, and fraught with peril.
This mail assumes that you have read "Options for Congestion Control in
Tor-like Networks":
https://lists.torproject.org/pipermail/tor-dev/2020-January/014140.html
Even if you have already read the Options mail, I recommend reviewing
the the SENDME section and the sections marked with TCP_HISTORY,
BOOTLEG_RTT_TOR, BACKWARD_ECN_TOR, and perhaps also FORWARD_ECN_TOR, as
understanding this mail requires an understanding of those sections. The
summary of the other sections from the Options mail is basically this:
adding ECN signals to Tor seems straight-forward, but is actually very
tricky.
Section @: Introduction and summary [INTRO_SUMMARY]
Deep understanding of the congestion control problem space is essential
because shitty congestion control will not be a huge win, and will
reduce the benefit of other performance improvements. Worse still, it
may introduce side channels and other anonymity attacks.
In particular, our level of congestion latency (and associated relay
queue lengths) will be very sensitive to congestion signal response
time, to our congestion window update rules, and also to our choice of
queue management algorithm (ie: which circuits send cells and/or
congestion signals).
Our average circuit bandwidth on the network will similarly be very
sensitive to the ability of the congestion window to quickly update and
converge on the available Bandwidth-Delay Product (BDP) of its path.
In other words, we need to make good design choices in the following
three different areas:
I. Which congestion signal to use [CONGESTION_SIGNALS]
II. How to update congestion window [CONTROL_ALGORITHMS]
III. What circuits to send the signal on [QUEUE_MANAGEMENT]
The next three sections of this mail correspond to those three areas.
As before, I have tagged the specific sub-sections that will be used in
the Tor Proposal with [PROPOSAL_CANDIDATE]. I have also given tags to
section in this document. References to things in this document are
enclosed in brackets. References to the Options mail are simply all
caps, without brackets.
The high-level plan is that we use RTT as a primary congestion signal,
as per BOOTLEG_RTT_TOR from the Options mail, and try it out with a few
different congestion algorithms, such as [TCP_VEGAS_TOR] and
[TCP_WESTWOOD_TOR].
It turns out that when RTT is used as a congestion signal, our current
Circuit-EWMA is sufficient to serve as [QUEUE_MANAGEMENT]. It also turns
out that Circuit-EWMA actually should protect us from fairness issues in
the control algorithms, and from cheaters who try to run different
congestion control algorithms. Circuit-EWMA will also deliver RTT
congestion signals more heavily to bulk circuits (because it prioritizes
their cells below light circuits), causing them to back off in favor of
lighter traffic, even as we relay fewer of their cells. Thanks goes to
Toke Høiland-Jørgensen for pointing this out.
The combination of RTT-based congestion signaling, a decent but simple
control algorithm, and Circuit-EWMA will get us the most if not all of
the benefits we seek, and only requires clients and Exits to upgrade to
use it. Once it is deployed, circuit bandwidth caps will no longer be
capped at ~500kb/sec by the fixed window sizes of SENDME; queue latency
will fall significantly; memory requirements at relays should plummet;
and bottlenecks in the network should dissipate.
However, we can still improve upon this further, after this initial
deployment. Because the "slow start" phase of TCP can take a long time,
we will also want to play around with various initial window sizes,
window growth rate parameters, and experiment with using a single
BACKWARD_ECN_TOR cell_t.command ECN-style signal. This signal will be
sent on circuits as determined by [CODEL_TOR]. An instance of
[CODEL_TOR] can be added to each TLS connection, after Circuit-EWMA is
applied.
While BACKWARD_ECN_TOR is complex and slow to roll out (all relays must
support this new cell_t.command), this optimization will likely be worth
it because BACKWARD_ECN_TOR can send a signal on a circuit to stop the
exponential growth of slow start in *less than* RTT/2 time, which is
significantly faster than implicitly measuring an RTT-based signal
(which takes RTT plus some measurement lag to act upon). This faster
signal means faster termination of slow start, which means less buffer
bloat.
For stream flow control, on the client side, we should just ack all
streams cells with a circuit-level SENDME even though they were not
delivered to their blocked stream. The local Tor client should keep
queuing them until the application reads them. Thus, a blocked client
application stream will not block other streams, but will cause Tor
client stream-level queues to build up. For exit-side streams, we should
*not* send SENDMEs until the cell has been sent on its exiting upstream
connection. This allows TCP pushback on one stream to block the entire
circuit, but this is desirable to avoid queuing, and it will only happen
in the upload direction.
Section I: Selecting Congestion Signals [CONGESTION_SIGNALS]
The "Options for Congestion Control in Tor-like Networks" mail probably
should have been titled "Options for Congestion Signals in Tor-like
Networks", as its primary focus was on the ways Tor endpoints can
receive a signal about the congestion status of a circuit on bottleneck
relays. In that post, I postponed discussion of queue management, and
deliberately under-specified how to react to a congestion signal (going
no further than saying "use SlowStart and AIMD").
I chose that scoping because the most difficult barrier we have faced in
the past decade of work on this topic is the side channels enabled by a
congestion signal (and information leaks in general).
However, I should have made our constraints more explicit. Our choice of
acceptable congestion signal mechanism is restricted by these design
constraints: 1. No side channels between Guard and Exit, or between
circuits 2. Incremental upgrade path 3. Fast response time 4. Cheater
prevention
These design constraints restrict our choice of congestion signal, but
they are not our only design constraints in this problem space. When we
get to the later sections on window control update algorithm, and queue
management algorithm, I will explicitly enumerate additional design
constraints there. Hopefully this will provide more clarity than the
uncategorized list of "causes of death" in the Options post.
For your convenience, here's the summary table of all the options for
congestion signal again:
______________________________________________________________________
| MIX | CHEATING | RESPONSE| SIDE | UPGRADE |
| TRACK | DETECTION | TIME | CHANNELS | PATH? |
|~~~~~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~|~~~~~~~~~|~~~~~~~~~~~~|~~~~~~~~~~|
|BOOTLEG_RTT_TOR | Exit RTT |100c+RTT | RTT | Exits |
|FORWARD_ECN_TOR | Circ OOM | RTT | None? | Full Net |
|FORWARD_ECN_RTT_TOR | Exit RTT | RTT | RTT |Exits;Full|
|BACKWARD_ECN_TOR |Middles;OOM| RTT/2 | Guard->Exit| Full Net |
|BACKWARD_ECN_RTT_TOR |Middles;RTT|RTT/2;RTT| Guard->Exit|Exits;Full|
|BACKWARD_ECN_TRAP_TOR | Middles | RTT/2 |Guard<->Exit| Full Net |
|EDGE_FORWARD_ECN_TOR |Middles;OOM| 2*RTT/3 | None? | Full Net |
|START_BACKWARD_ECN_TOR|Middles;OOM|RTT/2;RTT| Low/None? | Full Net |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My short-term favorite is BOOTLEG_RTT_TOR, because it only requires
Exit relay support to use.
Recall that BOOTLEG_RTT_TOR defines an RTT latency threshold as its
congestion signal, and was first described in section 4.1 of the N23
Defenestrator paper. Even though the N23 paper didn't bother to name
this system, RTT use as a congestion signal in this way also resembles
TCP Vegas, as we shall see.
https://www.cypherpunks.ca/~iang/pubs/defenestrator.pdf
Note that because Circuit-EWMA will add additional delay to loud
circuits, "cheaters" who use alternate congestion control algorithms
should end up with more delay than those who do not. This means that if
we use RTT as a congestion signal, Circuit-EWMA should provide *more*
RTT congestion signal to cheaters, and cheaters would only be punishing
themselves with more Circuit-EWMA delay. So cheater resilience for an
RTT signal is actually more well handled that I previously thought.
Again, thanks goes to Toke Høiland-Jørgensen for pointing this out.
For completeness, we will still explore some examples of alternate
"cheating" control algorithms in Section II, though.
Unfortunately, RTT is itself a side channel, as per:
https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf
https://www.robgjansen.com/publications/howlow-pets2013.pdf
There is also literature on shaping circuit bandwidth to create a side
channel. This can be done regardless of the use of congestion control,
and is not an argument against using congestion control. In fact, the
Backlit defense may be an argument in favor of endpoints monitoring
circuit bandwidth and latency more closely, as a defense:
https://www.freehaven.net/anonbib/cache/ndss09-rainbow.pdf
https://www.freehaven.net/anonbib/cache/ndss11-swirl.pdf
https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf
Recall that BACKWARD_ECN_TOR used circuit-level cell_t.command to signal
congestion. This allows all relays in the path to signal congestion in
under RTT/2 in either direction, and it can be flipped on existing relay
cells already in transit, without introducing any overhead. However,
because cell_t.command is visible and malleable to all relays, it can
also be used as a side channel. So we must limit its use to a couple of
cells per circuit, at most.
Recall that FORWARD_ECN_TOR adds encrypted end-to-end relay cells that
explicitly signal congestion to either the client or the Exit, but
requires the client to mediate and echo this cell to protect against
side channels. Because congestion signals must be relayed off the
client, it still takes the Exit a full RTT to learn about congestion at
earlier nodes in the circuit. Additionally, because this cell itself
must travel the full path in each direction, it is further delayed by
congestion on the path, and it also adds yet more congestion to
congested paths. Congestion in Exit to client direction is the most
common, as this is the download direction. It will be particularly
expensive and slow to bounce the congestion signal off the client to the
Exit.
For this reason, I think all forms of FORWARD_ECN_TOR are not likely to
be as efficient as RTT signaling, despite the FORWARD_ECN_RTT_TOR
variant initially being flagged as a PROPOSAL_CANDIDATE in the Options
mail. However, we should still keep it in mind, since RTT may turn out
to be an unreliable signal by itself.
Therefore, since RTT is already measurable with our current SENDME flow
control, I propose that we use it via BOOTLEG_RTT_TOR (or TCP Vegas),
and then add START_BACKWARD_ECN_TOR only exactly once (to exit slow
start). If RTT proves unreliable, then we can experiment with the more
heavyweight FORWARD_ECN_TOR full-cell signal.
As we shall see in the following control algorithm section, RTT-based
systems like TCP Vegas largely avoid the need to send any congestion
signal (drop or ECN). In other words: if we can rely on RTT, we should
not need any explicit signals at all (though they would improve
responsiveness).
Section II: Congestion Window Control Algorithms [CONTROL_ALGORITHMS]
Recall that the congestion window is the number of packets TCP will
allow to be sent on a connection without any acknowledgment. The
congestion window is meant to match the bandwidth-delay product of the
connection as close as possible. The bandwidth-delay product can be
thought of as the total amount of data that can be in-transit on a link
at one time without using any queue memory (bytes per second of
bandwidth multiplied by seconds of transit time delay = total bytes in
transit on the wire at one time).
On the Internet, if the congestion window exceeds the bandwidth-delay
product of a transmission path, then packets must build up in router
queues, and eventually get dropped. If the congestion window is below
the bandwidth-delay product of the path, then all routers on the path
spend time unused and idle, and thus utilization suffers, even if users
are trying to use full bandwidth.
TCP variants basically compete on their ability to collectively track
the bandwidth-delay product of a path under a wider of a variety of
network conditions and other competing traffic.
In the interest of brevity, my "Options for Congestion Control in
Tor-like Networks" mail suggested that we update the congestion window
according to Slow Start with AIMD, in response to a congestion signal
(or lack thereof).
However, that is not the only option, and if we retain the ability to
measure RTT, it may not even be the most efficient one. So to help us
make this decision, I did a review of historical TCP and TCP-like
systems in the research literature, and will present the highlights
here.
But first, let's define the control requirements that we want to satisfy:
1. Full bandwidth utilization of slowest TLS link in every circuit
2. Queue length minimization at the relay sending on this TLS link
3. Fairness between circuits (esp partially overlapping ones)
4. Rapid convergence to full circuit capacity
These requirements mirror the requirements used in TCP evaluation. TCP
implementations compete over metrics in each of these areas, and
algorithm design is far more art than formally proved optimal or even
correct.
That does not mean that formalism is impossible, however. For a great
formal breakdown of this design space without all of TCP's baggage, see
Section 3 of the XCP paper. Note that we can't use XCP as-is because its
congestion signal fields provide way too much information for use in
potential side-channels, but it is worth a read because it clarifies the
control theory behind congestion control:
http://conferences.sigcomm.org/sigcomm/2002/papers/xcp.pdf
In terms of how this design space is approached in TCP: utilization and
queue minimization are typically handled simultaneously - sometimes
explicitly and sometimes implicitly. Similarly, fairness is sometimes
explicitly designed for, but more often fairness is an emergent property
that is analyzed when competing against a benchmark (historically: TCP
Reno). Fairness abuse is also an area where cheating can occur that
enables cheaters to get more than their fair share of bandwidth, and
such cheating can be very hard to detect and prevent on the Internet.
Interestingly, many of the design goals of TCP are *also* improved via
queue management algorithms at routers (Section III of this mail). In
fact, our Circuit-EWMA circuit scheduler will effectively hand out RTT
congestion signals (delay) to loud circuits before quiet ones, and will
*also* enforce fairness and impede cheating, by delaying a cheater's
cells even more.
In this sense, we do not need to worry quite so much about fairness and
cheating as security properties, but a lack of fairness at the TCP
algorithm will *still* increase memory use in relays to queue these
unfair/loud circuits. So we should still be mindful of these properties
in selecting our congestion algorithm, to minimize relay memory use, if
nothing else.
For a good summary of several TCP congestion control algorithms, see:
http://intronetworks.cs.luc.edu/1/html/reno.html ("Slow Start+AIMD")
http://intronetworks.cs.luc.edu/1/html/newtcps.html (modern TCPs)
As you can see, the TCP design space is huge.
A good way to break down the TCP space is to separate the drop-based
algorithms from the delay-based algorithms. TCP drop-based algorithms
require either packet drops or ECN packet marking to signal congestion.
Delay based algorithms infer congestion from RTT measurements only.
But as we shall see, this distinction is artificial for us.
BOOTLEG_RTT_TOR effectively converts delay *into* an explicit congestion
signal, as far as congestion window control algorithms are concerned.
Similarly, either of our ECN signals can be used in place of the drop
signal in the algorithms below.
I'm going to review four options that strike me as most relevant to our
needs. Two are drop-based: TCP Reno, and TCP CUBIC. Two are
delay-based: TCP Vegas, TCP Westwood.
It should be noted that there are also hybrid algorithms that use
drops/ECN in addition to delay as congestion signals, but since we are
focusing on primarily using RTT only as our signal, I won't be covering
them. For a good overview of those options, see:
https://arxiv.org/pdf/1903.03852.pdf
Toke Høiland-Jørgensen suggested that we also investigate BBR, which is
a hybrid algorithm currently being evaluated by DropBox and Google.
However, BBR involves packet send time scheduling, which will require
replacement of Circuit-EWMA and other extensive changes. Furthermore,
BBR is still under active development, experimentation, and tuning. In
fact, while v2 has been tested by Dropbox, no IETF draft specification
exists yet. Still, for completeness:
https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00
https://queue.acm.org/detail.cfm?id=3022184
https://datatracker.ietf.org/doc/slides-106-iccrg-update-on-bbrv2/
https://dropbox.tech/infrastructure/evaluating-bbrv2-on-the-dropbox-edge-ne…
I am also going to simplify and omit treatment of packet loss and
duplicate acks, since we will not be dropping any packets. Additionally,
I have normalized all formulas below to be in terms of Tor cells, rather
than bytes.
The hope is that these simplifications help us get to the essence of
what these algorithms would look like when deployed on Tor, without
getting distracted by irrelevant drop and ack synchronization management
details from TCP/IP.
However, as we move closer to specifying an actual design proposal, we
should double-check my simplifications below, for the algorithm(s) that
we actually specify.
Section II.A: TCP Reno [TCP_RENO_TOR]
https://tools.ietf.org/html/rfc5681
http://intronetworks.cs.luc.edu/1/html/reno.html
Until recently, TCP Reno was the most popular TCP on the Internet, and
it is still used as a fairness benchmark. While we won't be using it
directly, it is thus important to understand.
Like all TCPs, TCP Reno satisfies its utilization and queue minimization
design goals by trying to make its congestion window closely match the
bandwidth-delay product of the link. More importantly, when multiple TCP
connections are present competing for the bandwidth of a link, TCP Reno
fairly distributes the bandwidth-delay product of the link equally among
all competing TCP Reno connections.
http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf
https://www.cse.wustl.edu/~jain/papers/ftp/cong_av.pdf
TCP Reno probes the bandwidth-delay product by increasing its congestion
window until it overflows the the queue capacity of the slowest router
on a path, causing a packet drop or ECN signal, upon which TCP Reno
reduces its congestion window.
The initial phase of TCP Reno is called "Slow Start". For our purposes,
Slow Start in TCP Reno means that every SENDME ack, the congestion
window cwnd becomes:
cwnd = cwnd + SENDME_RATE (recall Tor's SENDME_RATE is 100 cells)
Because increasing the congestion window by 100 every SENDME allows the
sender to send 100 *more* 512 byte cells every 100 cells, this is an
exponential function that causes cwnd to double every cwnd cells. TCP
Reno keeps on doubling until it experiences a congestion signal.
Once a congestion signal is experienced, Slow Start is exited, and the
Additive-Increase-Multiplicative-Decrease (AIMD) steady-state phase
begins. AIMD algorithms can be generalized with two parameters: the add
quantity, and the multiply quantity. This is written as AIMD(a,m). TCP
Reno uses AIMD(1, 0.5). It should be noted that most later TCP's
include an AIMD component of some kind, but tend to use 0.7 for m
instead of 0.5. This allows them to recover faster from congestion.
After slow start exits, in steady-state, after every ack (SENDME)
response without a congestion signal, the window is updated as:
cwnd = cwnd + SENDME_RATE/cwnd
This comes out to increasing cwnd by 1, every time cwnd cells are
successfully sent without a congestion signal occurring. Thus this is
additive linear growth, not exponential growth.
If there was a congestion signal, cwnd is updated as:
cwnd = cwnd * m (m=0.5 or 0.7)
This is called "Fast Recovery". If you dig into the RFCs, actual TCP
Reno has some additional details for responding to multiple packet
losses that in some cases can fall back into slow-start. However, for
our purposes, the congestion window should only be updated once per cwnd
cells, in either the congestion case, or the non-congestion case.
TCP Reno (and all TCPs) also revert to slow start when the connection
has been idle, in case bottleneck capacity has changed in this time. We
may or may not want to do this, as slow start has a dramatic impact on
page load times.
TCP Reno is still used as the benchmark to determine fairness. If an
algorithm is "TCP Friendly", this basically means that when competing
with TCP Reno to use spare path capacity, it does not get its ass
kicked, and it does not kick Reno's ass. Several algorithms contain
hacks or special cases to preserve this property. If this fairness
property holds, a bunch of math can prove convergence rates,
utilization, congestion window sizes, and even queue sizes. Modern
queue management also relies on this math to auto-tune queue parameters,
as we will see in Section III.
http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf
https://www.icir.org/tfrc/aimd.pdf
Section II.B: TCP Cubic [TCP_CUBIC_TOR]
http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-cubic
https://pandorafms.com/blog/tcp-congestion-control/
https://tools.ietf.org/html/rfc8312
TCP Reno unfortunately does not perform optimally for networks where the
natural packet loss rate starts to become frequent, which happens for
both high bandwidth and high latency networks. This shortcoming has lead
to a huge explosion of competition among replacements and improvements
for this use case, and TCP Cubic has won out (for now).
TCP Cubic basically takes TCP Reno and tweaks the additive portion of
AIMD behavior to be a cubic function (y=x^3) instead of linearly
additive. The cubic function was chosen because it is concave, then
flat, and then convex. The concave piece allows for quickly recovering
after congestion (or non-congestion-related packet loss), and the convex
piece allows for gently probing for fresh spare link capacity.
Because of adoption by both the Linux kernel and the Windows kernel as
the default TCP, TCP Cubic is now the most popular TCP on the Internet.
Unfortunately, TCP Cubic is complicated. It has multiple magic numbers,
and is defined piece wise depending on how its congestion window would
compare to the equivalent TCP Reno window (to ensure the "TCP Friendly"
property -- without this hack, TCP Cubic is "too polite" and TCP Reno
eats its lunch). It additionally has time-based components that must be
clocked relative to current measured RTT (though it does not use RTT as
a congestion signal).
The cubic congestion window is defined as:
W_max = previous window size before last congestion signal
W_cubic(t) = C*(t-K)^3 + W_max
With magic number constants:
C = 0.4
beta_cubic = 0.7
K = cubic_root(W_max*(1-beta_cubic)/C)
To maintain competitiveness with TCP Reno, Cubic keeps a running
estimate of an equivalent TCP Reno's window size, using the AIMD math
cited for Reno above:
W_est(t) = W_max*beta_cubic +
[3*(1-beta_cubic)/(1+beta_cubic)] * (t/RTT)
if W_est(t) > W_cubic(t):
cwnd = W_est(t)
else:
cwnd = cwnd + (W_cubic(t+RTT) - cwnd)/cwnd
Upon congestion signal, Cubic uses its beta_cubic parameter (0.7), and
thus does not back off quite as far as Reno:
W_max = cwnd
cwnd = cwnd * beta_cubic
It is pretty easy to convince yourself that Cubic will do as least as
well as TCP Reno (thanks to the AIMD Reno window comparison hack), but
it is very hard to conceptualize the above and convince yourself that it
is the best we could possibly do for our use case. Especially since Tor
is *not* affected by an increasing inherent drop rate as bandwidth
increases, unlike the Internet.
For these reasons, the complexity of TCP Cubic does not strike me as
worthwhile, as compared to other, simpler candidates.
Section II.C: TCP Vegas and [TCP_VEGAS_TOR] [PROPOSAL_CANDIDATE]
http://pages.cs.wisc.edu/~akella/CS740/F08/740-Papers/BOP94.pdf
ftp://ftp.cs.princeton.edu/techreports/2000/628.pdf (lol ftp, sorry)
Similar to BOOTLEG_RTT_TOR from my Options mail, TCP Vegas was designed
to make use of RTT and bandwidth estimation to avoid queue congestion,
and thus avoid any dropped packets or ECN signals. However, TCP Vegas
actually directly estimates queue length of the bottleneck router, and
aims to minimize the packets in that queue.
To accomplish this, TCP Vegas maintains two RTT measurements:
RTT_current and RTT_min. It also maintains a bandwidth estimate.
The bandwidth estimate is the current congestion window size divided by
the RTT estimate:
BWE = cwnd/RTT_current
(Note: In later TCP Vegas relatives like TCP Westwood, this bandwidth
estimate was improved to use an explicit count of the unacked packets in
transit, instead of cwnd).
Recall that in order to utilize the full capacity of the link, all TCPs
attempt to maintain its congestion window as close to the
bandwidth-delay product (ie: the "memory capacity") of the transmission
path as possible.
The extra queue use along a path can thus be estimated by first
estimating the path's bandwidth-delay product:
BDP = BWE*RTT_min
TCP Vegas then estimates the queue use caused by congestion as:
queue_use = cwnd - BDP
= cwnd - cwnd*RTT_min/RTT_current
= cwnd * (1 - RTT_min/RTT_current)
If queue_use drops below a threshold alpha (typically 2-3 packets for
TCP, but perhaps double or triple that for our smaller cells), then the
congestion window is increased. If queue_use exceeds a threshold beta
(typically 4-6 packets, but again we should probably double or triple
this), then the congestion window is decreased. This update happens only
once per cwnd packets:
if queue_use < alpha:
cwnd = cwnd + 1
elif queue_use > beta:
cwnd = cwnd - 1
As a backstop, TCP Vegas still will half its congestion window in the
event that a congestion signal (packet drop or ECN) still occurs.
However, such packet drops should not actually even happen, unless Vegas
is fighting with a less fair control algorithm that is causing drops
instead of backing off based on queue delay.
Vegas also uses this queue_use estimate to govern its initial slow start
behavior. During slow start, the Vegas congestion window grows
exponentially, but only half as fast as Reno (to avoid excessive
congestion). For us, this means that cwnd is updated every SENDME ack
by:
if queue_use < gamma:
cwnd = cwnd + SENDME_RATE/2
This exponential increase continues until the queue_use estimate exceeds
"gamma" (which is determined experimentally and sadly not listed in the
TCP Vegas paper). If I had to guess, gamma is probably close to beta
(4-6 packets).
Because Vegas tries much harder to avoid congestion than TCP Reno, Vegas
is vulnerable to cheating through fairness abuse. When TCP Vegas
competes against a TCP Reno-style TCP that responds only to packet drops
(or ECN signals in our case) and ignores RTT, TCP Vegas detects
congestion earlier than TCP Reno would, and ends up yielding bandwidth
to TCP Reno. This means that Vegas does *not* have the "TCP Friendly"
property that modern AQM algorithms depend on for their
parameterization.
Note that we should expect the same thing to happen if TCP Vegas
competed with BOOTLEG_RTT_TOR algorithm run by cheaters. However, since
RTT is being used as the congestion signal in both cases, Circuit-EWMA
should still balance this out, but we should experiment with this to
confirm. There may still be additional queuing memory costs in the
presence of cheaters, compared to either pure [TCP_VEGAS_TOR] or pure
BOOTLEG_RTT_TOR. Even so: this form of cheating is also pretty easy for
honest Exit nodes to detect, and cheating clients only get benefit in
the upload direction, even if they did cheat.
To see this a bit more concretely, if we sub in TCP Reno in
BOOTLEG_RTT_TOR, during slow start we have:
T=(1−a)*rtt_min+a*rtt_max
If rtt < T:
cwnd = cwnd + SENDME
else:
exit slow start
And then for steady state:
If rtt < T:
cwnd = cwnd + SENDME_RATE/cwnd
else:
cwnd = cwnd * 0.5
The 'a' parameter from BOOTLEG_RTT_TOR is operating similarly to the
queue_len criterion of [TCP_VEGAS_TOR], but the window update rules are
[TCP_RENO_TOR], which are more aggressive than TCP Vegas.
So effectively, this is still [TCP_VEGAS_TOR] with a different target
condition for queue_len. This means we could expect BOOTLEG_RTT_TOR to
drive the congestion to some higher rate of queue use than
[TCP_VEGAS_TOR], and senders who use it may out-compete Vegas much like
TCP Reno out-competes Vegas, until Circuit-EWMA starts punishing them
(at the expense of more queue memory waste).
This also shows us a crucial testing consideration: [TCP_VEGAS_TOR] may
not compete well with our current SENDME congestion control, since it's
ability to set queue_target may cause it to de-prioritize itself below
competing SENDME flows. BOOTLEG_RTT_TOR, on the other hand, will allow
us to set its 'a' parameter at whatever point makes it competitive with
SENDME. For this reason, BOOTLEG_RTT_TOR may yield better performance
characteristics until all clients upgrade.
Section II.D: TCP Westwood [TCP_WESTWOOD_TOR] [PROPOSAL_CANDIDATE]
https://tools.ietf.org/html/rfc8290#section-4.1
http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-westwood
TCP Westwood is essentially TCP Reno, but if the current BDP estimate is
greater than half the current congestion window, TCP Westwood uses that
instead of halving the congestion window, during a congestion signal.
In other words, upon a congestion signal, TCP Westood does:
cwnd = max(cwnd * 0.5, BDP_estimate)
This has the effect of recovering from congestion events much quicker
than TCP Reno, and thus utilizing more of the path capacity.
Westwood+ also has some improvements on TCP Vegas's BDP_estimate in
order to smooth it in the presence of delayed and/or dropped acks:
BWE_k = a*BWE_k-1 + (1−a)*BWM_k
We may not need this smoothing as much, because we do not have any
dropped acks, and dropped acks were far more damaging to TCP Vegas than
slightly delayed acks.
Section III: Queue Management Algorithms [QUEUE_MANAGEMENT]
On the Internet, Queue Management Algorithms are used by routers to
decide how long their queues should be, which packets they should drop
and when, and which connections to send ECN signals on and when. Queue
management is almost as large of a research area as congestion control.
Thanks to bittorrent and video streaming, queue management also grew to
consider fairness or QoS between flows. This is done by separating
queues or imposing a scheduling algorithm on packet ordering. For an
excellent overview of fairness vs queue management, see RFC 7806:
https://tools.ietf.org/html/rfc7806
Since Tor relays cannot drop packets, our design goals are slightly
different than Internet queue management. We care about the following
design goals:
1. Prioritize well-behaved/light circuits over loud/heavy ones
2. Decide when to send ECN signals
3 Decide which circuits to send ECN signals on
4. Detect which circuits are queuing too much (or otherwise cheating)
Recall that Tor has already deployed Circuit-EWMA in order to meet goal
#1. Circuit-EWMA tries to give priority to less loud circuits in the
circuit scheduler, so that interactive applications experience less
latency than bulk transfers.
https://www.cypherpunks.ca/~iang/pubs/ewma-ccs.pdf
Circuit-EWMA can be viewed as a fairness algorithm as per RFC 7806
Section 2, since it schedules flows independently with the target
fairness property of prioritizing quiet circuits cells over loud ones.
When Circuit-EWMA is used in combination with RTT as a congestion
signal, it is effectively *also* a marking algorithm. This is because it
*also* distributes RTT congestion signals more frequently to loud
circuit:
https://tools.ietf.org/html/rfc7806#section-2
https://tools.ietf.org/html/rfc7806#section-3
Recall that Tor also had to deploy KIST, to move queues from the Linux
kernel into Tor, for Circuit-EWMA to properly prioritize them. See
Section 4 and 5 of the full-length KIST paper for details on this
interaction, and see page 5 for a good graphical depiction of all the
queuing points in Tor and the Kernel:
https://matt.traudt.xyz/static/papers/kist-tops2018.pdf
It is likely the case that Circuit-EWMA and KIST are all we need if we
only use RTT signaling. But if we add in ECN, at minimum we will need to
update Circuit-EWMA to inform us which circuit(s) to send ECN signals
on. In fact, thanks to RFC7806's distinction between fairness and
management, modern Internet queue management algorithms compose well
with Circuit-EWMA, allowing us to still use Circuit-EWMA for the
fairness portion, but use an existing Internet algorithm for the ECN
marking. So it is useful to review Internet history here, as well as
current state-of-the-art.
On the Internet, queue management algorithms arose fairly early for one
primary reason: early routers performed what is called "tail dropping".
Tail dropping means that routers allowed their queues to get full, and
then started dropping all packets that arrived while queue remained
full. This has all kinds of negative consequences, but the easiest to
intuitively understand is that if you wait until your queue is full
before you start dropping packets, and then drop all of the packets that
arrive from then on, competing TCP connections will all back off at the
same time, leading to lulls in network utilization.
The key innovation to address tail dropping was to randomly drop (or ECN
mark) packets early, before the queue was full, with a per-packet drop
probability. This probability was computed based on many factors of the
link, and sometimes based on individual TCP connection tracking.
Until very recently, both queue management and fair queuing algorithms
tended to have many knobs that must be tuned for specific kinds of links
and load levels. The Internet seems to be converging on three main
contenders for "knobless" queue management: PIE, ARED, and CoDel. Here's
some recent review material:
http://content.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_EUNICE_…
https://datatracker.ietf.org/meeting/88/materials/slides-88-iccrg-4/
However, be aware that these knobless systems make assumptions about the
congestion control algorithms and the latency of the connections that
may not apply to us. The assumptions are most clearly specified in
Section 3.2 of the CoDel RFC, and are closely related to the "TCP
Fairness vs TCP Reno" assumption from our congestion control review:
https://tools.ietf.org/html/rfc8289#section-3.2
Ok, so let's review the top "knobless" contenders.
Section III.A: PIE and ARED [ARED_PIE_TOR]
https://tools.ietf.org/html/rfc8033
http://www.icir.org/floyd/papers/adaptiveRed.pdf
As a quick summary, both PIE and ARED manage their queues by dropping
packets probabalistically. In both systems, the per-packet drop
probability is updated in proportion to average queue transit time --
longer queues mean the drop probability is increased.
Both systems decide to drop individual packets from the queue in using
this probability. This has the effect that louder flows (with more
packets) are also more likely to experience drops (or ECN congestion
marking). But otherwise they have no explicit notion of fairness.
This makes them marking algorithms as per RFC7806's classification. If
explicit QoS or fairness is desired, they have to be deployed in
combination with a fairness algorithm to prioritize between separate
queues, again as per RFC 7806 Section 3.3:
https://tools.ietf.org/html/rfc7806#section-3.3
Recall that Tor has multiple levels of queues: Circuit, TLS connections,
and kernel buffers, as per Figure 1 on page 5 in the KIST paper:
https://matt.traudt.xyz/static/papers/kist-tops2018.pdf
Even if we keep Circuit-EWMA for scheduling (and thus fairness), we can
still add in PIE or ARED per each TLS connection. If the average total
queue length for all circuits on a TLS connection grows beyond the
marking threshold for PIE/ARED, we could then begin sending congestion
signals on circuits, as they send cells, as per the cell drop
probability of these algorithms.
However, because we only want to send one BACKWARD_ECN_TOR signal per
circuit, we will have to save these signals for circuits that have not
yet gotten one. It is an open question how we should manage the mark
probability if we have to pass over a circuit because we already sent a
backward ECN signal on it.
Section III.B: CoDel and FQ_CoDel [CODEL_TOR] [PROPOSAL_CANDIDATE]
https://queue.acm.org/detail.cfm?id=2209336
https://tools.ietf.org/html/rfc8289
https://tools.ietf.org/html/rfc8290
CoDel is a relatively recent algorithm that is gaining traction because
of its simple elegance. Like PIE and ARED, CoDel itself is a marking
algorithm, and can be combined with any fairness/QoS algorithm. It also
comes with its own fairness algorithm, called FQ_CoDel.
https://tools.ietf.org/html/rfc7806#section-3.2
Despite being touted as parameterless, CoDel does have two parameters:
inspection_interval <- set above max RTT (but not too high)
min_queue_target <- set to 5% of min RTT
CoDel tags packets with timestamps as soon as they enter the queue.
Every inspection_interval, it examines the queue to see if any packets
have waited longer than target_delay. If they have, it begins dropping
packets (sending ECN signals in our case).
Because CoDel is accurately tracking the time *each* packet spends in
the queue, it is able to differentiate between "good" queues (not
congested), vs "bad" queues (congested), on a packet-by-packet basis.
This good vs bad distinction is determined by comparing the packet
transit time to min_queue_target. If a packet exceeds this transit time,
then the queue is bad, and the packet must be dropped or ECN marked.
It is this per-packet distinction that significantly separates CoDel
from PIE and ARED, which both use average queue times. This was the
source of some controversy, as per-packet timestamping was initially
argued to be too expensive. But on modern commodity hardware, this
per-packet timestamping is actually incredibly efficient. It also allows
CoDel to converge on the min_queue_target very rapidly, with high link
utilization.
However, it would seem that because we can't drop packets, sojourn times
will *not* decrease down to min_queue_target immediately upon congestion
for Tor. We will have to wait for the actual backoff signal to
propagate (1 RTT) before the queue size changes due to backoff.
According to Toke Høiland-Jørgensen (one of the FQ_CoDel spec authors),
this should not be that significant, but it should be tested.
FQ_CoDel extends CoDel to provide fairness in a very similar way as
Circuit-EWMA does, by separating out TCP connections into "generated a
queue recently" and "did not generate a queue recently". This allows
packets from flows that are not causing congestion to be sent more
quickly than flows that are. This also has the side effect that when
connections build long FQ queues, their packets spend more time in
queues, and have an increased probability of being ECN marked or
dropped.
https://ieeexplore.ieee.org/document/8469111
Like CoDel, FQ_CoDel relies on the assumption that it can drop a lot of
packets at once to correct queue sizes instantly:
https://tools.ietf.org/html/rfc8290#section-4.1
This makes me think we should not rush to replace Circuit-EWMA with
FQ_CoDel, but we may get benefit from doing CoDel on a per-TLS
connection basis, just as is described for PIE and ARED above. In this
case, it is more clear which circuits to ECN mark. Whenever a cell has a
Circuit-EWMA circuitmux queue transit time longer than the target queue
time threshold (min_queue_target), that circuit gets an ECN mark. If it
has already been ECN marked, no biggie. In other words: The decision to
ECN mark circuits is completely dependent only on how long their cells
traverse our queues, and does not require any capacity estimates or
other measurements. Very clean.
So if we decide we want to try ECN, I think Circuit-EWMA with CoDel is a
smart first choice. We would then use CoDel to deliver exactly one
BACKWARD_ECN_TOR signal when it detects a circuit's queues are too long.
This RTT/2 signal should cause the circuit to exit slow start more
quickly than RTT measurement by endpoints will.
Section IV: Next Steps [NEXT_STEPS]
After this extended review, I am convinced that an RTT-based approach
can provide a comprehensive solution to all of our design requirements,
and can be enhanced based on experimentation and further additions. Such
an approach also only requires clients and Exits (or just clients and
onion services) to upgrade to initially deploy.
However, we can still improve on this with a few enhancements. In
particular, we can compare [TCP_VEGAS_TOR] to [TCP_WESTWOOD_TOR] to
BOOTLEG_RTT_TOR, and see how they perform in competition with each
other, and in competition with Tor's currently fixed-size SENDME
windows. We can also implement CoDel in circuitmux to track transit
times and determine which circuits to send a single BACKWARD_ECN_TOR
signal, to exit slow start on circuits more quickly, and we can play
around with higher initial window sizes. We can do this experimentation
first in Shadow, and also on the live network via consensus parameters.
The immediate next steps here are to create a proper Tor Proposal from
the sections tagged [PROPOSAL_CANDIDATE] above.
--
Mike Perry
1
0

02 Jun '20
Filename: xxx-flashflow.txt
Title: FlashFlow: A Secure Speed Test for Tor (Parent Proposal)
Author: Matthew Traudt, Aaron Johnson, Rob Jansen, Mike Perry
Created: 23 April 2020
Status: Draft
1. Introduction
FlashFlow is a new distributed bandwidth measurement system for Tor that
consists of a single authority node ("coordinator") instructing one or
more measurement nodes ("measurers") when and how to measure Tor relays.
A measurement consists of the following steps:
1. The measurement nodes demonstrate to the target relay permission to
perform measurements.
2. The measurement nodes open many TCP connections to the target relay
and create a one-hop circuit to the target relay on each one.
3. For 30 seconds the measurement nodes send measurement cells to the
target relay and verify that the cells echoed back match the ones
sent. During this time the relay caps the amount of background
traffic it transfers. Background and measurement traffic are
handled separately at the relay. Measurement traffic counts towards
all the standard existing relay statistics.
4. For every second during the measurement, the measurement nodes
report to the authority node how much traffic was echoed back. The
target relay also reports the amount of per-second background
(non-measurement) traffic.
5. The authority node sums the per-second reported throughputs into 30
sums (one for each second) and calculates the median. This is the
estimated capacity of the relay.
FlashFlow performs a measurement of every relay according to a schedule
described later in this document. Periodically it produces relay
capacity estimates in the form of a v3bw file, which is suitable for
direct consumption by a Tor directory authority. Alternatively an
existing load balancing system such as Simple Bandwidth Scanner could be
modified to use FlashFlow's v3bw file as input.
It is envisioned that each directory authority that wants to use
FlashFlow will run their own FlashFlow deployment consisting of a
coordinator that they run and one or more measurers that they trust
(e.g. because they run them themselves), similar to how each runs their
own Torflow/sbws. Section 5.2 of this proposal describes long term plans
involving multiple FlashFlow deployments.
FlashFlow is more performant than Torflow: FlashFlow takes 5 hours to
measure the entire existing Tor network from scratch (with 3 Gbit/s
measurer capacity) while Torflow takes 2 days; FlashFlow measures relays
it hasn't seen recently as soon as it learns about them (i.e. every new
consensus) while Torflow can take a day or more; and FlashFlow
accurately measures new high-capacity relays the first time and every
time while Torflow takes days/weeks to assign them their full fair share
of bandwidth (especially for non-exits). FlashFlow is more secure than
Torflow: FlashFlow allows a relay to inflate its measured capacity by up
to 1.33x (configured by a parameter) while Torflow allows weight
inflation by a factor of 89x [0] or even 177x [1].
After an overview in section 2 of the planned deployment stages, section
3, 4, and 5 discuss the short, medium, and long term deployment plans in
more detail.
2. Deployment Stages
FlashFlow's deployment shall be broken up into three stages.
In the short term we will implement a working FlashFlow measurement
system. This requires code changes in little-t tor and an external
FlashFlow codebase. The majority of the implementation work will be
done in the short term, and the product is a complete FlashFlow
measurement system. Remaining pieces (e.g. better authentication) are
added later for enhanced security and network performance.
In the medium term we will begin collecting data with a FlashFlow
deployment. The intermediate results and v3bw files produced will be
made available (semi?) publicly for study.
In the long term experiments will be performed to study ways of using FF
v3bw files to improve load balancing. Two examples: (1) using FF v3bw
files instead of sbws's (and eventually phasing out torflow/sbws), and
(2) continuing to run sbws but use FF's results as a better estimate of
relay capacity than observed bandwidth. Authentication and other
FlashFlow features necessary to make it completely ready for full
production deployment will be worked on during this long term phase.
3. FlashFlow measurement system: Short term
The core measurement mechanics will be implemented in little-t tor, but
a separate codebase for the FlashFlow side of the measurement system
will also be created. This section is divided into three parts: first a
discussion of changes/additions that logically reside entirely within
tor (essentially: relay-side modifications), second a discussion of the
separate FlashFlow code that also requires some amount of tor changes
(essentially: measurer-side and coordinator-side modifications), and
third a security discussion.
3.1 Little-T Tor Components
The primary additions/changes that entirely reside within tor on the
relay side:
- New torrc options/consensus parameters.
- New cell commands.
- Pre-measurement handshaking (with a simplified authentication
scheme).
- Measurement mode, during which the relay will echo traffic with
measurers, set a cap on the amount of background traffic it
transfers, and report the amount of transferred background traffic.
3.1.1 Parameters
FlashFlow will require some consensus parameters/torrc options. Each has
some default value if nothing is specified; the consensus parameter
overrides this default value; the torrc option overrides both.
FFMeasurementsAllowed: A global toggle on whether or not to allow
measurements. Even if all other settings would allow a measurement, if
this is turned off, then no measurement is allowed. Possible values: 0,
1. Default: 0 (disallowed).
FFAllowedCoordinators: The list of coordinator TLS certificate
fingerprints that are allowed to start measurements. Relays check their
torrc when they receive a connection from a FlashFlow coordinator to see
if it's on the list. If they have no list, they check the consensus
parameter. If nether exist, then no FlashFlow deployment is allowed to
measure this relay. Default: empty list.
FFMeasurementPeriod: A relay should expect on average, to be measured by
each FlashFlow deployment once each measurement period. A relay will not
allow itself to be measured more than twice by a FlashFlow deployment in
any time window of this length. Relays should not change this option
unless they really know what they're doing. Changing it at the relay
will not change how often FlashFlow will attempt to measure the relay.
Possible values are in the range [1 hour, 1 month] inclusive. Default: 1
day.
FFBackgroundTrafficPercent: The maximum amount of regular
non-measurement traffic a relay should handle while being measured, as a
percent of total traffic (measurement + non-measurement). This
parameter is a trade off between having to limit background traffic and
limiting how much a relay can inflate its result by handling no
background traffic but reporting that it has done so. Possible values
are in the range [0, 99] inclusive. Default: 25 (a maximum inflation
factor of 1.33).
FFMaxMeasurementDuration: The maximum amount of time, in seconds, that
is allowed to pass from the moment the relay is notified that a
measurement will begin soon and the end of the measurement. If this
amount of time passes, the relay shall close all measurement connections
and exit its measurement mode. Note this duration includes handshake
time, thus it necessarily is larger than the expected actual measurement
duration. Possible values are in the range [10, 120] inclusive.
Default: 45.
3.1.2 New Cell Types
FlashFlow will introduce a new cell command MEASURE.
The payload of each MEASURE cell consists of:
Measure command [1 byte]
Length [2 bytes]
Data [Length-3 bytes]
The measure commands are:
0 -- MSM_PARAMS [forward]
1 -- MSM_PARAMS_OK [backward]
2 -- MSM_ECHO [forward and backward]
3 -- MSM_BG [backward]
4 -- MSM_ERR [forward and backward]
Forward cells are sent from the measurer/coordinator to the relay.
Backward cells are sent from the relay to the measurer/coordinator.
MSM_PARAMS and MSM_PARAMS_OK are used during the pre-measurement stage
to tell the target what to expect and for the relay to positively
acknowledge the message. MSM_ECHO cells are the measurement traffic;
the measurer generates them, sends them to the target, and the target
echos them back. The target send a MSM_BG cell once per second to report
the amount of background traffic it is handling. MSM_ERR cells are used
to signal to the other party that there has been some sort of problem
and that the measurement should be aborted. These measure commands are
described in more detail in the next section.
The only cell that sometimes undergoes cell encryption is MSM_ECHO; no
other cell ever gets cell encrypted. (All cells are transmitted on a
regular TLS-wrapped OR connection; that encryption still exists.)
The relay "decrypts" MSM_ECHO cells before sending them back to the
measurer; this mirrors the way relays decrypt/encrypt RELAY_DATA cells
in order to induce realistic cryptographic CPU load. The measurer
usually skips encrypting MSM_ECHO cells to reduce its own CPU load;
however, to verify the relay is actually correctly decrypting all cells,
the measurer will choose random outgoing cells, encrypt them, remember
the ciphertext, and verify the corresponding incoming cell matches.
3.1.3 Pre-Measurement Handshaking/Starting a Measurement
The coordinator connects to the target relay and sends it a MSM_PARAMS
cell. If the target is unwilling to be measured at this time or if the
coordinator didn't use a TLS certificate that the target trusts, it
responds with an error cell and closes the connection. Otherwise it
checks that the parameters of the measurement are acceptable (e.g. the
version is acceptable, the duration isn't too long, etc.). If the
target is happy, it sends a MSM_PARAMS_OK, otherwise it sends a MSM_ERR
and closes the connection.
Upon learning the IP addresses of the measurers from the coordinator in
the MSM_PARAMS cell, the target whitelists their IPs in its DoS
detection subsystem until the measurement ends (successfully or
otherwise), at which point the whitelist is cleared.
Upon receiving a MSM_PARAMS_OK from the target, the coordinator will
instruct the measurers to open their TCP connections with the target. If
the coordinator or any measurer receives a MSM_ERR, it reports the error
to the coordinator and considers the measurement a failure. It is also a
failure if any measurer is unable to open at least half of its TCP
connections with the target.
The payload of MSM_PARAMS cells [XXX more may need to be added]:
- version [1 byte]
- msm_duration [1 byte]
- num_measurers [1 byte]
- measurer_info [num_measurers times]
- ipv4_addr [4 bytes]
- num_conns [2 bytes]
version dictates how this MSM_PARAMS cell shall be parsed. msm_duration
is the duration, in seconds, that the actual measurement will last.
num_measurers is how many measurer_info structs follow. For each
measurer, the ipv4_addr it will use when connecting to the target is
provided, as is num_conns, the number of TCP connections that measurer
will open with the target. Future versions of FlashFlow and MSM_PARAMS
will use TLS certificates instead of IP addresses.
MSM_PARAMS_OK has no payload: it's just padding bytes to make the cell
514 bytes long.
The payload of MSM_ECHO cells:
- arbitrary bytes [max to fill up 514 byte cell]
The payload of MSM_BG cells:
- second [1 byte]
- sent_bg_bytes [4 bytes]
- recv_bg_bytes [4 bytes]
second is the number of seconds since the measurement began. MSM_BG
cells are sent once per second from the relay to the FlashFlow
coordinator. The first cell will have this set to 1, and each
subsequent cell will increment it by one. sent_bg_bytes is the number of
background traffic bytes sent in the last second (since the last MSM_BG
cell). recv_bg_bytes is the same but for received bytes.
The payload of MSM_ERR cells:
- err_code [1 byte]
- err_str [possibly zero-len null-terminated string]
The error code is one of:
[... XXX TODO ...]
255 -- OTHER
The error string is optional in all cases. It isn't present if the first
byte of err_str is null, otherwise it is present. It ends at the first
null byte or the end of the cell, whichever comes first.
3.1.4 Measurement Mode
The relay considers the measurement to have started the moment it
receives the first MSM_ECHO cell from any measurer. At this point, the
relay
- Starts a repeating 1s timer on which it will report the amount of
background traffic to the coordinator over the coordinator's
connection.
- Enters "measurement mode" and limits the amount of background
traffic it handles according to the torrc option/consensus
parameter.
The relay decrypts and echos back all MSM_ECHO cells it receives on
measurement connections until it has reported its amount of background
traffic the same number of times as there are seconds in the measurement
(e.g. 30 per-second reports for a 30 second measurement). After sending
the last MSM_BG cell, the relay drops all buffered MSM_ECHO cells,
closes all measurement connections, and exits measurement mode.
During the measurement the relay targets a ratio of background traffic
to measurement traffic as specified by a consensus parameter/torrc
option. For a given ratio r, if the relay has handled x cells of
measurement traffic recently, Tor then limits itself to y = xr/(1-r)
cells of non-measurement traffic this scheduling round. The target will
enforce that a minimum of 10 Mbit/s of measurement traffic is recorded
since the last background traffic scheduling round to ensure it always
allows some minimum amount of background traffic.
3.2 FlashFlow Components
The FF coordinator and measurer code will reside in a FlashFlow
repository separate from little-t tor.
There are three notable parameters for which a FF deployment must choose
values. They are:
- The number of sockets, s, the measurers should open, in aggregate,
with the target relay. We suggest s=160 based on the FF paper.
- The bandwidth multiplier, m. Given an existing capacity estimate for
a relay, z, the coordinator will instruct the measurers to, in
aggregate, send m*z Mbit/s to the target relay. We recommend m=2.25.
- The measurement duration, d. Based on the FF paper, we recommend
d=30 seconds.
The rest of this section first discusses notable functions of the
FlashFlow coordinator, then goes on to discuss FF measurer code that
will require supporting tor code.
3.2.1 FlashFlow Coordinator
The coordinator is responsible for scheduling measurements, aggregating
results, and producing v3bw files. It needs continuous access to new
consensus files, which it can obtain by running an accompanying Tor
process in client mode.
The coordinator has the following functions, which will be described in
this section:
- result aggregation.
- schedule measurements.
- v3bw file generation.
3.2.1.1 Aggregating Results
Every second during a measurement, the measurers send the amount of
verified measurement traffic they have received back from the relay.
Additionally, the relay sends a MSM_BG cell each second to the
coordinator with amount of non-measurement background traffic it is
sending and receiving.
For each second's reports, the coordinator sums the measurer's reports.
The coordinator takes the minimum of the relay's reported sent and
received background traffic. If, when compared to the measurer's reports
for this second, the relay's claimed background traffic is more than
what's allowed by the background/measurement traffic ratio, then the
coordinator further clamps the relay's report down. The coordinator adds
this final adjusted amount of background traffic to the sum of the
measurer's reports.
Once the coordinator has done the above for each second in the
measurement (e.g. 30 times for a 30 second measurement), the coordinator
takes the median of the 30 per-second throughputs and records it as the
estimated capacity of the target relay.
3.2.1.2 Measurement Schedule
The short term implementation of measurement scheduling will be simpler
than the long term one due to (1) there only being one FlashFlow
deployment, and (2) there being very few relays that support being
measured by FlashFlow. In fact the FF coordinator will maintain a list
of the relays that have updated to support being measured and have opted
in to being measured, and it will only measure them.
The coordinator divides time into a series of 24 hour periods, commonly
referred to as days. Each period has measurement slots that are longer
than a measurement lasts (30s), say 60s, to account for pre- and
post-measurement work. Thus with 60s slots there's 1,440 slots in a
day.
At the start of each day the coordinator considers the list of relays
that have opted in to being measured. From this list of relays, it
repeatedly takes the relay with the largest existing capacity estimate.
It selects a random slot. If the slot has existing relays assigned to
it, the coordinator makes sure there is enough additional measurer
capacity to handle this relay. If so, it assigns this relay to this
slot. If not, it keeps picking new random slots until one has sufficient
additional measurer capacity.
Relays without existing capacity estimates are assumed to have the 75th
percentile capacity of the current network.
If a relay is not online when it's scheduled to be measured, it doesn't
get measured that day.
3.2.1.2.1 Example
Assume the FF deployment has 1 Gbit/s of measurer capacity. Assume the
chosen multiplier m=2. Assume there are only 5 slots in a measurement
period.
Consider a set of relays with the following existing capacity estimates
and that have opted in to being measured by FlashFlow.
- 500 Mbit/s
- 300 Mbit/s
- 250 Mbit/s
- 200 Mbit/s
- 100 Mbit/s
- 50 Mbit/s
The coordinator takes the largest relay, 500 Mbit/s, and picks a random
slot for it. It picks slot 3. The coordinator takes the next largest,
300, and randomly picks slot 2. The slots are now:
0 | 1 | 2 | 3 | 4
-------|-------|-------|-------|-------
| | 300 | 500 |
| | | |
The coordinator takes the next largest, 250, and randomly picks slot 2.
Slot 2 already has 600 Mbit/s of measurer capacity reserved (300*m);
given just 1000 Mbit/s of total measurer capacity, there is just 400
Mbit/s of spare capacity while this relay requires 500 Mbit/s. There is
not enough room in slot 2 for this relay. The coordinator picks a new
random slot, 0.
0 | 1 | 2 | 3 | 4
-------|-------|-------|-------|-------
250 | | 300 | 500 |
| | | |
The next largest is 200 and the coordinator randomly picks slot 2 again
(wow!). As there is just enough spare capacity, the coordinator assigns
this relay to slot 2.
0 | 1 | 2 | 3 | 4
-------|-------|-------|-------|-------
250 | | 300 | 500 |
| | 200 | |
The coordinator randomly picks slot 4 for the last remaining relays, in
that order.
0 | 1 | 2 | 3 | 4
-------|-------|-------|-------|-------
250 | | 300 | 500 | 100
| | 200 | | 50
3.2.1.3 Generating V3BW files
Every hour the FF coordinator produces a v3bw file in which it stores
the latest capacity estimate for every relay it has measured in the last
week. The coordinator will create this file on the host's local file
system. Previously-generated v3bw files will not be deleted by the
coordinator. A symbolic link at a static path will always point to the
latest v3bw file.
$ ls -l
v3bw -> v3bw.2020-03-01-05-00-00
v3bw.2020-03-01-00-00-00
v3bw.2020-03-01-01-00-00
v3bw.2020-03-01-02-00-00
v3bw.2020-03-01-03-00-00
v3bw.2020-03-01-04-00-00
v3bw.2020-03-01-05-00-00
3.2.2 FlashFlow Measurer
The measurers take commands from the coordinator, connect to target
relays with many sockets, send them traffic, and verify the received
traffic is the same as what was sent. Measurers need access to a lot of
internal tor functionality. One strategy is to house as much logic as
possible inside an compile-time-optional control port module that calls
into other parts of tor. Alternatively FlashFlow could link against tor
and call internal tor functions directly.
[XXX for now I'll assume that an optional little-t tor control port
module housing a lot of this code is the best idea.]
Notable new things that internal tor code will need to do on the
measurer (client) side:
1. Open many TLS+TCP connections to the same relay on purpose.
2. Verify echo cells.
3.2.2.1 Open many connections
FlashFlow prototypes needed to "hack in" a flag in the
open-a-connection-with-this-relay function call chain that indicated
whether or not we wanted to force a new connection to be created. Most
of Tor doesn't care if it reuses an existing connection, but FF does
want to create many different connections. The cleanest way to
accomplish this will be investigated.
On the relay side, these measurer connections do not count towards DoS
detection algorithms.
3.2.2.2 Verify echo cells
A parameter will exist to tell the measurers with what frequency they
shall verify that cells echoed back to them match what was sent. This
parameter does not need to exist outside of the FF deployment (e.g. it
doesn't need to be a consensus parameter).
The parameter instructs the measurers to check 1 out of every N cells.
The measurer keeps a count of how many measurement cells it has sent. It
also logically splits its output stream of cells into buckets of size N.
At the start of each bucket (when num_sent % N == 0), the measurer
chooses a random index in the bucket. Upon sending the cell at that
index (num_sent % N == chosen_index), the measurer records the cell.
The measurer also counts cells that it receives. When it receives a cell
at an index that was recorded, it verifies that the received cell
matches the recorded sent cell. If they match, no special action is
taken. If they don't match, the measurer indicates failure to the
coordinator and target relay and closes all connections, ending the
measurement.
3.2.2.2.1 Example
Consider bucket_size is 1000. For the moment ignore cell encryption.
We start at idx=0 and pick an idx in [0, 1000) to record, say 640. At
idx=640 we record the cell. At idx=1000 we choose a new idx in [1000,
2000) to record, say 1236. At idx=1236 we record the cell. At idx=2000
we choose a new idx in [2000, 3000). Etc.
There's 2000+ cells in flight and the measurer has recorded two items:
- (640, contents_of_cellA)
- (1236, contents_of_cellB)
Consider the receive side now. It counts the cells it receives. At
receive idx=640, it checks the received cell matches the saved cell from
before. At receive idx=1236, it again checks the received cell matches.
Etc.
3.2.2.2.2 Motivation
A malicious relay may want to skip decryption of measurement cells to
save CPU cycles and obtain a higher capacity estimate. More generally,
it could generate fake measurement cells locally, ignore the measurement
traffic it is receiving, and flood the measurer with more traffic that
it (the measurer) is even sending.
The security of echo cell verification is discussed in section 3.3.1.
3.3 Security
In this section we discuss the security of various aspects of FlashFlow
and the tor changes it requires.
3.3.1 Echo Cell Verification: Bucket Size
A smaller bucket size means more cells are checked and FF is more likely
to detect a malicious target. It also means more bookkeeping overhead
(CPU/RAM).
An adversary that knows bucket_size and cheats on one item out of every
bucket_size items will have a 1/bucket_size chance of getting caught in
the first bucket. This is the worst case adversary. While cheating on
just a single item per bucket yields very little advantage, cheating on
more items per bucket increases the likelihood the adversary gets
caught. Thus only the worst case is considered here.
In general, the odds the adversary can successfully cheat in a single
bucket are
(bucket_size-1)/bucket_size
Thus the odds the adversary can cheat in X consecutive buckets are
[(bucket_size-1)/bucket_size]^X
In our case, X will be highly varied: Slow relays won't see very many
buckets, but fast relays will. The damage to the network a very slow
relay can do by faking being only slightly faster is limited.
Nonetheless, for now we motivate the selection of bucket_size with a
slow relay:
- Assume a very slow relay of 1 Mbit/s capacity that will cheat 1 cell
in each bucket. Assume a 30 second measurement.
- The relay will handle 1*30 = 30 Mbit of traffic during the
measurement, or 3.75 MB, or 3.75 million bytes.
- Cells are 514 bytes. Approximately (e.g. ignoring TLS) 7300 cells
will be sent/recv over the course of the measurement.
- A bucket_size of 50 results in about 146 buckets over the course of
the 30s measurement.
- Therefore, the odds of the adversary cheating successfully as
(49/50)^(146), or about 5.2%.
This sounds high, but a relay capable of double the bandwidth (2 Mbit/s)
will have (49/50)^(2*146) or 0.2% odds of success, which is quite low.
Wanting a <1% chance that a 10 Mbit/s relay can successfully cheat
results in a bucket size of approximately 125:
- 10*30 = 300 Mbit of traffic during 30s measurement. 37.5 million
bytes.
- 37,500,000 bytes / 514 bytes/cell = ~73,000 cells
- bucket_size of 125 cells means 73,000 / 125 = 584 buckets
- (124/125)^(584) = 0.918% chance of successfully cheating
Slower relays can cheat more easily but the amount of extra weight they
can obtain is insignificant in absolute terms. Faster relays are
essentially unable to cheat.
3.3.2 Weight Inflation
Target relays are an active part of the measurement process; they know
they are getting measured. While a relay cannot fake the measurement
traffic, it can trivially stop transferring client background traffic
for the duration of the measurement yet claim it carried some. More
generally, there is no verification of the claimed amount of background
traffic during the measurement. The relay can claim whatever it wants,
but it will not be trusted above the ratio the FlashFlow deployment is
configured to know. This places an easy to understand, firm, and (if set
as we suggest) low cap on how much a relay can inflate its measured
capacity.
Consider a background/measurement ratio of 1/4, or 25%. Assume the relay
in question has a hard limit on capacity (e.g. from its NIC) of 100
Mbit/s. The relay is supposed to use up to 25% of its capacity for
background traffic and the remaining 75%+ capacity for measurement
traffic. Instead the relay ceases carrying background traffic, uses all
100 Mbit/s of capacity to handle measurement traffic, and reports ~33
Mbit/s of background traffic (33/133 = ~25%). FlashFlow would trust this
and consider the relay capable of 133 Mbit/s. (If the relay were to
report more than ~33 Mbit/s, FlashFlow limits it to just ~33 Mbit/s.)
With r=25%, FlashFlow only allows 1.33x weight inflation.
Prior work shows that Torflow allows weight inflation by a factor of 89x
[0] or even 177x [1].
The ratio chosen is a trade-off between impact on background traffic and
security: r=50% allows a relay to double its weight but won't impact
client traffic for relays with steady state throughput below 50%, while
r=10% allows a very low inflation factor but will cause throttling of
client traffic at far more relays. We suggest r=25% (and thus
1/(1-0.25)=1.33x inflation) for a reasonable trade-off between
performance and security.
It may be possible to catch relays performing this attack, especially if
they literally drop all background traffic during the measurement: have
the measurer (or some party on its behalf) create a regular stream
through the relay and measure the throughput on the stream
before/during/after the measurement. This can be explored longer term.
3.3.3 Incomplete Authentication
The short term FlashFlow implementation has the relay set two torrc
options if they would like to allow themselves to be measured: a flag
allowing measurement, and the list of coordinator TLS certificate that
are allowed to start a measurement.
The relay drops MSM_PARAMS cells from coordinators it does not trust,
and immediately closes the connection after that. A FF coordinator
cannot convince a relay to enter measurement mode unless the relay
trusts its TLS certificate.
A trusted coordinator specifies in the MSM_PARAMS cell the IP addresses
of the measurers the relay shall expect to connect to it shortly. The
target adds the measurer IP addresses to a whitelist in the DoS
connection limit system, exempting them from any configured connection
limit. If a measurer is behind a NAT, an adversary behind the same NAT
can DoS the relay's available sockets until the end of the measurement.
The adversary could also pretend to be the measurer. Such an adversary
could induce measurement failures and inaccuracies. (Note: the whitelist
is cleared after the measurement is over.)
4. FlashFlow measurement system: Medium term
The medium term deployment stage begins after FlashFlow has been
implemented and relays are starting to update to a version of Tor that
supports it.
We plan to host a FlashFlow deployment consisting of a FF coordinator
and a single FF measurer on a single 1 Gbit/s machine. Data produced by
this deployment will be made available (semi?) publicly, including both
v3bw files and intermediate results.
Any development changes needed during this time would go through
separate proposals.
5. FlashFlow measurement system: Long term
In the long term, finishing-touch development work will be done,
including adding better authentication and measurement scheduling, and
experiments will be run to determine the best way to integrate FlashFlow
into the Tor ecosystem.
Any development changes needed during this time would go through
separate proposals.
5.1 Authentication to Target Relay
Short term deployment already had FlashFlow coordinators using TLS
certificates when connecting to relays, but in the long term, directory
authorities will vote on the consensus parameter for which coordinators
should be allowed to perform measurements. The voting is done in the
same way they currently vote on recommended tor versions.
FlashFlow measurers will be updated to use TLS certificates when
connecting to relays too. FlashFlow coordinators will update the
contents of MSM_PARAMS cells to contain measurer TLS certificates
instead of IP addresses, and relays will update to expect this change.
5.2 Measurement Scheduling
Short term deployment only has one FF deployment running. Long term this
may no longer be the case because, for example, more than one directory
authority decides to adopt it and they each want to run their own
deployment. FF deployments will need to coordinate between themselves
to not measure the same relay at the same time, and to handle new relays
as they join during the middle of a measurement period (during the day).
The following is quoted from Section 4.3 of the FlashFlow paper.
To measure all relays in the network, the BWAuths periodically
determine the measurement schedule. The schedule determines when and
by whom a relay should be measured. We assume that the BWAuths have
sufficiently synchronized clocks to facilitate coordinating their
schedules. A measurement schedule is created for each measurement
period, the length p of which determines how often a relay is
measured. We use a measurement period of p = 24 hours.
To help avoid active denial-of-service attacks on targeted relays,
the measurement schedule is randomized and known only to the
BWAuths. Before the next measurement period starts, the BWAuths
collectively generate a random seed (e.g. using Tor’s
secure-randomness protocol). Each BWAuth can then locally determine
the shared schedule using pseudorandom bits extracted from that
seed. The algorithm to create the schedule considers each
measurement period to be divided into a sequence of t-second
measurement slots. For each old relay, slots for each BWAuth to
measure it are selected uniformly at random without replacement
from all slots in the period that have sufficient unallocated
measurement capacity to accommodate the measurement. When a new
relay appears, it is measured separately by each BWAuth in the first
slots with sufficient unallocated capacity. Note that this design
ensures that old relays will continue to be measured, with new
relays given secondary priority in the order they arrive.
5.3 Experiments
[XXX todo]
5.4 Other Changes/Investigations/Ideas
- How can FlashFlow data be used in a way that doesn't lead to poor load
balancing given the following items that lead to non-uniform client
behavior:
- Guards that high-traffic HSs choose (for 3 months at a time)
- Guard vs middle flag allocation issues
- New Guard nodes (Guardfraction)
- Exit policies other than default/all
- Directory activity
- Total onion service activity
- Super long-lived circuits
- Add a cell that the target relay sends to the coordinator indicating
its CPU and memory usage, whether it has a shortage of sockets, how
much bandwidth load it has been experiencing lately, etc. Use this
information to lower a relays weight, never increase.
- If FlashFlow and sbws work together (as opposed to FlashFlow replacing
sbws), consider logic for how much sbws can increase/decrease FF
results
- Coordination of multiple FlashFlow deployments: scheduling of
measurements, seeding schedule with shared random value.
- Other background/measurement traffic ratios. Dynamic? (known slow
relay => more allowed bg traffic?)
- Catching relays inflating their measured capacity by dropping
background traffic.
- What to do about co-located relays. Can they be detected reliably?
Should we just add a torrc option a la MyFamily for co-located relays?
- What is the explanation for dennis.jackson's scary graphs in this [2]
ticket? Was it because of the speed test? Why? Will FlashFlow produce
the same behavior?
6. Citations
[0] F. Thill. Hidden Service Tracking Detection and Bandwidth Cheating
in Tor Anonymity Network. Master’s thesis, Univ. Luxembourg, 2014.
[1] A. Johnson, R. Jansen, N. Hopper, A. Segal, and P. Syverson.
PeerFlow: Secure Load Balancing in Tor. Proceedings on Privacy
Enhancing Technologies (PoPETs), 2017(2), April 2017.
[2] Mike Perry: Graph onionperf and consensus information from Rob's
experiments https://trac.torproject.org/projects/tor/ticket/33076
5
8
```
Filename: 320-tap-out-again.md
Title: Removing TAP usage from v2 onion services
Author: Nick Mathewson
Created: 11 May 2020
Status: Open
```
(This proposal is part of the Walking Onions spec project. It updates
proposal 245.)
# Removing TAP from v2 onion services
As we implement walking onions, we're faced with a problem: what to do
with TAP keys? They are bulky and insecure, and not used for anything
besides v2 onion services. Keeping them in SNIPs would consume
bandwidth, and keeping them indirectly would consume complexity. It
would be nicer to remove TAP keys entirely.
But although v2 onion services are obsolescent and their
cryptographic parameters are disturbing, we do not want to drop
support for them as part of the Walking Onions migration. If we did
so, then we would force some users to choose between Walking Onions
and v2 onion services, which we do not want to do.
Instead, we describe here a phased plan to replace TAP in v2 onion
services with ntor. This change improves the forward secrecy of
_some_ of the session keys used with v2 onion services, but does not
improve their authentication, which is strongly tied to truncated
SHA1 hashes of RSA1024 keys.
Implementing this change is more complex than similar changes
elsewhere in the Tor protocol, since we do not want clients or
services to leak whether they have support for this proposal, until
support is widespread enough that revealing it is no longer a
privacy risk.
## Ntor keys, link specifiers, and SNIPs in v2 descriptors.
We define these entries that may appear in v2 onion service
descriptors, once per introduction point.
"identity-ed25519"
"ntor-onion-key"
[at most once each per intro point.]
These values are in the same format as and follow the same
rules as their equivalents in router descriptors.
"link-specifiers"
[at most once per introduction point]
This value is the same as the link specifiers in a v3 onion
service descriptor, and follows the same rules.
Services should not include any of these fields unless a new network
parameter, "hsv2-intro-updated" is set to 1. Clients should not parse
these fields or use them unless "hsv2-use-intro-updated" is set to 1.
We define a new field that can be used for hsv2 descriptors with
walking onions:
"snip"
[at most once]
This value is the same as the snip field introduced to a v3
onion service descriptor by proposal (XXX) and follows the
same rules.
Services should not include this field unless a new network parameter,
"hsv2-intro-snip" is set to 1. Clients should not parse this field or use it
unless the parameter "hsv2-use-intro-snip" is set to 1.
Additionally, relays SHOULD omit the following legacy intro point
parameters when a new network parameter, "hsv2-intro-legacy" is set
to 0: "ip-address", "onion-port", and "onion-key". Clients should
treat them as optional when "hsv2-tolerate-no-legacy" is set to 1.
## INTRODUCE cells, RENDEZVOUS cells, and ntor.
We allow clients to specify the rendezvous point's ntor key in the
INTRODUCE2 cell instead of the TAP key. To do this, the client
simply sets KLEN to 32, and includes the ntor key for the relay.
Clients should only use ntor keys in this way if the network parameter
"hsv2-client-rend-ntor" is set to 1, and if the entry "allow-rend-ntor"
is present in the onion service descriptor.
Services should only advertise "allow-rend-ntor" in this way if the
network parameter "hsv2-service-rend-ntor" is set to 1.
## Migration steps
First, we implement all of the above, but set it to be disabled by
default. We use torrc fields to selectively enable them for testing
purposes, to make sure they work.
Once all non-LTS versions of Tor without support for this proposal are
obsolete, we can safely enable "hsv2-client-rend-ntor",
"hsv2-service-rend-ntor", "hsv2-intro-updated", and
"hsv2-use-intro-updated".
Once all non-LTS versions of Tor without support for walking onions
are obsolete, we can safely enable "hsv2-intro-snip",
"hsv2-use-intro-snip", and "hsv2-tolerate-no-legacy".
Once all non-LTS versions of Tor without support for _both_ of the
above implementations are finally obsolete, we can finally set
"hsv2-intro-legacy" to 0.
## Future work
There is a final TAP-like protocol used for v2 hidden services: the
client uses RSA1024 and DH1024 to send information about the
rendezvous point and to start negotiating the session key to be used
for end-to-end encryption.
In theory we could get a benefit to forward secrecy by using ntor
instead of TAP here, but we would get not corresponding benefit for
authentication, since authentication is still ultimately tied to
HSv2's scary RSA1024-plus-truncated-SHA1 combination.
Given that, it might be just as good to allow the client to insert
a curve25519 key in place of their DH1024 key, and use that for the
DH handshake instead. That would be a separate proposal, though:
this proposal is enough to allow all relays to drop TAP support.
8
11

27 May '20
```
Filename: 322-dirport-linkspec.md
Title: Extending link specifiers to include the directory port
Author: Nick Mathewson
Created: 27 May 2020
Status: Open
```
## Motivation
Directory ports remain the only way to contact a (non-bridge) Tor
relay that isn't expressible as a Link Specifier. We haven't
specified a link specifier of this kind so far, since it isn't a way
to contact a relay to create a channel.
But authorities still expose directory ports, and encourage relays
to use them preferentially for uploading and downloading. And with
Walking Onions, it would be convenient to try to make every kind of
"address" a link specifier -- we'd like want authorities to be able
to specify a list of link specifiers that can be used to contact
them for uploads and downloads.
> It is possible that after revision, Walking Onions won't need a way
> to specify this information. If so, this proposal should be moved
> to "Reserve" status as generally unuseful.
## Proposal
We reserve a new link specifier type "dir-url", for use with the
directory system. This is a variable-length link specifier, containing
a URL prefix. The only currently supported URL schema is "http://".
Implementations SHOULD ignore unrecognized schemas. IPv4 and IPv6
addresses MAY be used directory; hostnames are also allowed.
Implementations MAY ignore hostnames and only use raw addresses.
The URL prefix includes everything through the string "tor" in the
directory hierarchy.
A dir-url link specifier SHOULD NOT appear in an EXTEND cell;
implementations SHOULD reject them if they do appear.
1
0