Hi Matt,
Thanks for the quick response!
I've trimmed the conversation to the comments that need further discussion.
On 25 Apr 2020, at 06:46, Matt Traudt pastly@torproject.org wrote:
On 4/23/20 21:05, teor wrote:
...
- msm_duration [1 byte]
What are the minimum and maximum valid values for this field? 1..255 ?
Do we want to limit measurements to 4 minutes at a protocol level?
In general, protocols should make invalid states impossible to represent. But do we want a 4 minute hard limit here?
This document suggests a measurement duration of 30 seconds. We see no reason to ever go above 1 minute. If there's a byte to spare, then sure let's make this a uint16.
I've thought about this a bit more, and from a user experience perspective, we also want a 30 second limit. (Most users will give up on a slow connection after 30 seconds.)
So as long as there is a documented limit in the protocol, we should be fine with 2 bytes.
second is the number of seconds since the measurement began. MSM_BG cells are sent once per second from the relay to the FlashFlow coordinator. The first cell will have this set to 1, and each subsequent cell will increment it by one. sent_bg_bytes is the number of background traffic bytes sent in the last second (since the last MSM_BG cell). recv_bg_bytes is the same but for received bytes.
The payload of MSM_ERR cells:
- err_code [1 byte]
- err_str [possibly zero-len null-terminated string]
We don't have strings in any other tor protocol cells.
If you need extensible error information, can I suggest using ext-type-length-value fields:
N_EXTENSIONS [1 byte] N_EXTENSIONS times: EXT_FIELD_TYPE [1 byte] EXT_FIELD_LEN [1 byte] EXT_FIELD [EXT_FIELD_LEN bytes]
https://gitweb.torproject.org/torspec.git/tree/rend-spec-v3.txt#n1518
If strings are necessary, please specify a character encoding (ASCII or UTF-8), and an allowed set of characters.
If we don't whitelist characters, we risk logging terminal escape sequences, or other arbitrary data.
I seem to remember strings used between directory authority /directory mirror relays and clients to communicate certain errors (clock skew?), but what's probably reality is a *code* is communicated and what I'm thinking of is merely the Tor client interpreting the code for logging purposes.
There are two sources for clock skew warnings: * A binary time field in NETINFO cells * A HTTP header on directory documents
The header is text, but it's very structured, and at a different protocol layer.
In another part of the directory protocol, when authorities reject a relay descriptor upload, they send a rejection reason to the relay. That's unstructured text in a HTTP response. (But we do escape it before logging.)
Regardless, we probably don't really need a string. It occurs to me we might want *something* that carries more information than a code; for example, a MSM_ERR cell with a code stating "I'm refusing to be measured because I've been measured too recently" would benefit from a field stating either time till measurement allowed again or time since last measurement.
Yes, I think a code and ext-type-length-value fields for any additional info would work here.
At this point, the relay
- Starts a repeating 1s timer on which it will report the amount of background traffic to the coordinator over the coordinator's connection.
- Enters "measurement mode" and limits the amount of background traffic it handles according to the torrc option/consensus parameter.
The relay decrypts and echos back all MSM_ECHO cells it receives on measurement connections
Are MSM_ECHO cells relay cells? How much of the relay protocol does the measurer implement?
The references to decrypting cells suggest that MSM_ECHO cells are relay (circuit-level) cells. But earlier sections suggest that they are link cells.
If they are link cells, what key material is used for decryption? How do the measurer and relay agree on this key material?
If they are relay cells, do they use the ntor handshake? https://gitweb.torproject.org/torspec.git/tree/tor-spec.txt#n1132
MEASURE cells are cells like CREATE, CREATED, RELAY, etc.
In the Tor protocol specification, we call these "commands", and the cells are sent at the link level.
MSM_ECHO, MSM_PARAMS, etc. cells are MEASURE cells in the same way RELAY_BEGIN, RELAY_DATA, RELAY_SENDME, etc. are RELAY cells.
For relay cells, we call these "relay commands", and the cells are sent at the circuit level.
So it might be helpful to say "measure commands" here. It might also be helpful to distinguish "control" and "data" cells, in a similar way to the relay cell spec:
https://github.com/torproject/torspec/blob/master/tor-spec.txt#L1572
The relay needs to do AES on the MSM_ECHO cells (or for simplicity, the MSM_ECHO cell payload) like it does AES on relay cells. As for key material necessary to do that, that's an oversight. We've neglected to specify how it's derived.
Suggestions? Not a cryptographer, but off the top of my head, the measurer could simply tell the relay to use $key_i for MSM_ECHO cells on $connection_i. We just want the CPU load on the relay; we're not after security properties here (other than verifying the relay is actually doing the crypto, as discussed elsewhere).
I suggest that the client opens a real single-hop circuit, and sends RELAY_ECHO cells (a new relay command) on that circuit.
As part of this design: * RELAY_ECHO cells are only allowed from valid measurers * flow control is disabled on circuits from valid measurers (I think that's what you want here, but best to be explicit)
This design has a few advantages: * The design and coding is much simpler * FlashFlow automatically uses the latest relay crypto * The key material is automatically derived for you * The decryption and some verification is automatically performed for you * You can verify the cell contents using a simple memcmp()
There's a slight disadvantage: * When you skip decrypting a cell, the digest gets out of sync, so future cells have less validation. But I don't think that matters for single-hop circuits.
It's likely that your measurers will be network-bound, rather than CPU-bound. So you may be able to just use unmodified circuit crypto.
There are also security advantages to using unmodified relay crypto. If tor adds extra modes that skip decryption or verification, then it's easier to accidentally trigger those modes. (Via bugs or exploits.)
If we use unmodified relay crypto, then it's much harder to get tor into an insecure mode.
Here's what the cells would look like in detail:
16 -- RELAY_ECHO [forward] [control] 17 -- RELAY_ECHOED [backward] [control]
I think these should be control cells (circuit-level cells) rather than stream-level cells, because they are like RELAY_DROP:
10 -- RELAY_DROP [forward or backward] [control]
I don't have a strong opinion about the rest of the measure commands. They can stay as link-level cells. But if it turns out that it's easier to code them as circuit-level cells, we could add a new RELAY_MEASURE command.
until it has reported its amount of background traffic the same number of times as there are seconds in the measurement (e.g. 30 per-second reports for a 30 second measurement). After sending the last MSM_BG cell, the relay drops all buffered MSM_ECHO cells, closes all measurement connections, and exits measurement mode.
To be more precise here, can we say:
"the relay drops all inbound and outbound MSM_ECHO cells from measurers associated with the completed measurement"
Can we avoid assuming that there is always only one measurement happening at one time?
I think it's safe/smart/necessary to assume that, for a given relay, there is always only zero or one measurements happening.
- Measurements are scheduled s.t. coordinators won't try to measure a
relay at the same time.
- A coordinator trying to start a measurement while another one is
ongoing can simply be sent a MSM_ERR cell stating as such.
You're right, the relay can resolve clashes. It's important that we make that explicit.
During the measurement the relay targets a ratio of background traffic to measurement traffic as specified by a consensus parameter/torrc option. For a given ratio r, if the relay has handled x cells of measurement traffic recently, Tor then limits itself to y = xr/(1-r) cells of non-measurement traffic this scheduling round. The target will enforce that a minimum of 10 Mbit/s of measurement traffic is recorded since the last background traffic scheduling round to ensure it always allows some minimum amount of background traffic.
Do you mean "a maximum of 10 Mbit/s of measurement traffic" ?
No. When getting ready to handle background traffic, if there has been less than 10 Mbit/s of measurement traffic recently, Tor will limit background traffic as if there was indeed 10 Mbit/s of measurement traffic.
This way the relay can always send at least some background traffic, and a malfunctioning/malicious FlashFlow deployment cannot stop all background client traffic going through a relay for 30 seconds by not sending it (very much) measurement traffic.
I'm still a bit confused here.
When you say: "The target will enforce that a minimum of 10 Mbit/s of measurement traffic is recorded"
I think you mean: "... regardless of the actual traffic sent by the measurer."
But that raises another concern:
What about relays with very low bandwidths? Will they reserve all their traffic for users, and none for the measurer?
Using the suggested r=25%, the maximum non-measurement traffic is:
y = (10 Mbits)(0.25)/(1-0.25) = 3.3 Mbits
So a relay with an actual capacity of 3.3 Mbits, which is fully loaded with user traffic, will send no measurement traffic.
That seems... unexpected.
At the moment, tor directory authorities default to:
AuthDirFastGuarantee 100 Kbytes AuthDirGuardBWGuarantee 2 Mbytes
So maybe we should derive the limit based on these values?
3.2.1 FlashFlow Coordinator
The coordinator is responsible for scheduling measurements, aggregating results, and producing v3bw files. It needs continuous access to new consensus files, which it can obtain by running an accompanying Tor process in client mode.
Recent tor versions go dormant when they haven't built circuits for a while. There are options that prevent dormancy, but they are only designed for interactive applications.
Is the FlashFlow coordinator going to use tor to implement the tor link protocol?
If the coordinator uses tor, then it can use the same tor client instance that's downloading its consensuses.
Otherwise, you might just be better using a small stem script, and a download timer.
If you use a timer, you can download each new consensus, shortly after it is created. (Clients often have consensuses that are 1-2 hours old, unless specifically configured to fetch from directory authorities. Even then, they can take up to an hour to download a new consensus.)
As described elsewhere, the coordinator uses a Tor client in order to avoid implementing the tor link protocol itself. If there is not already a way to make a Tor client download every new consensus (e.g. a torrc option or an hourly control port command), we'll want to add that.
If the coordinator is constantly sending network traffic to relays, then it shouldn't go dormant.
Here are the torrc options you might want to set on the coordinator:
# Set this to your maximum expected gap between relay measurements, # including network downtime and other emergencies. # Particularly important during the initial deployment. DormantClientTimeout 1 week
# You may also need to set FetchUselessDescriptors 1
# Get new relays as fast as possible. FetchDirInfoEarly 1 FetchDirInfoExtraEarly 1
This is starting to look like the sbws config, you probably want most of these options on controllers and measurers: https://github.com/torproject/sbws/blob/master/sbws/globals.py#L20
What if the capacity is limited at some other point on the internet?
For example:
- an intermediate transit provider between the measurer and all the chosen
relays
- the chosen relays are all on the same local network
Ideally a single FlashFlow deployment's measurers are diverse to help mitigate the first point.
For the second, I don't have a good idea at this time. That shouldn't happen regularly. It will happen sometimes though, so perhaps this motivates a modification in how the coordinator chooses the weight for a relay. Instead of the result of the latest measurement, perhaps the highest result from the last X measurements.
That seems like a good idea.
It might also help to measure relays in each family in separate slots. You might also want to do the same thing with relays that are in: * the same IPv4 /24 * the same IPv6 /48
Or at the very least, relays on the same IP address.
Relays without existing capacity estimates are assumed to have the 75th percentile capacity of the current network.
If a relay is not online when it's scheduled to be measured, it doesn't get measured that day.
Online in the consensus, or listening via its ORPort? (There's a delay of up to 3 hours here, whenever the relay goes up or down.)
What bandwidth weight does an offline relay get? sbws has had issues because it drops offline relays.
Online as in both, I think.
I'm not up to speed or have forgotten why continuing to give weight to offline relays is important (and this may not be the place to enlighten me). Naively I'd say zero. Assuming that's stupid, I **think** whatever weight FlashFlow would give it were it online is smarter than some minimum weight value. Suggestions?
You'll need relays to be in the consensus to do connection crypto, and listening on their ORPort to actually connect. Then you can measure.
Relays sometimes drop out of the consensus between their measurement, and the creation of the v3bw file. So don't check if they are online when you create that file.
Using the median of the past few measurements is a good idea anyway: tor has a daily user bandwidth cycle. And it helps deal with missing measurements.
3.2.1.2.1 Example
...
3.2.1.3 Generating V3BW files
Every hour the FF coordinator produces a v3bw file in which it stores the latest capacity estimate for every relay it has measured in the last week. The coordinator will create this file on the host's local file system. Previously-generated v3bw files will not be deleted by the coordinator.
Seems risky, we've seen Torflow fail in the past, because it filled up the disk with bandwidth files.
What's the required disk capacity for a few years of bandwidth files?
We can ship a script or provide a parameter to keep the last X v3bw files if that would be preferable to relying on bwauths using logrotate themselves or otherwise finding an archival/deletion strategy that fits their needs.
If you provide a default maximum age, then you can document the disk capacity that's required to keep that many files.
If operators have more or less disk, they can change the defaults.
3.2.2 FlashFlow Measurer
The measurers take commands from the coordinator
The command protocol is not specified in this proposal.
For example, does the coordinator send the IPv4 and IPv6 addresses of the relay to the measurers?
Which deployment parameters are sent via the protocol, and which are hard-coded in configurations?
A Tor proposal did not seem the place for some of these protocols, options, etc. existing entirely outside little-t tor. We can certainly elaborate better if that's wrong.
I'm not sure. It's helpful to have a design overview somewhere, and to reference it from the proposal.
I don't have strong opinions about exactly where it is located. Most similar documentation is external, but the BridgeDB spec is part of the Tor specifications: https://github.com/torproject/torspec/blob/master/bridgedb-spec.txt
One helpful question is:
Who will maintain this software over the long term?
Ask them how they want it to be specified.
T