Hey,
So - thinking through your comments and thoughts, I realized that the better way to proceed was to describe the algorithm, and then put it into a proposal format. The following is updated with most of your comments and should be a bit clearer. I decided to remove #241 again, since it's unclear how it actually fits together with #259. The idea is that the algo is divided into three pieces. If failure is reported from the algorithm, the caller can wait for a while or just restart it from scratch immediately:
I feel that this algorithm is a step in the right direction! It's interface logic is quite similar to how Tor picks its guards, and it's more specific than the previous proposals. It's a good first step so I'll CC tor-dev so that Nick, Isis and other people can take a look if they want.
After we pin this down, we also need to collect all its security properties in one place. For example, we should state clearly what security properties the thresholds GUARDS_TRY_THRESHOLD_TIME and GUARDS_FAILOVER_THRESHOLD provide.
I inline various questions below. I can take a deeper look tomorrow as well.
- Start of algorithm
- Ensure USED_GUARDS is populated (from state file etc)
- Create a list of PRIMARY_GUARDS that contain N_PRIMARY_GUARDS that are not bad by:
- Taking the next entry from USED_GUARDS
- If USED_GUARDS is empty:
- randomly select an entry from UTOPIC_GUARDS, weighted by bandwidth
Hmm. PRIMARY_GUARDS stays like this for ever?
It probably needs to be updated when a primary guard get removed from the consensus (become bad), and put the next entry of USED_GUARDS in its place.
Maybe we need another event (and algorithm) for "receive new consensus"? Or is this considered out of scope?
- Set TRIED_GUARDS to be an empty set
Hm. What is the point of TRIED_GUARDS? It's like a list of guards that we found unreachable at some point in time. It's never cleaned, so it will eventually accumulate many nodes. IIUC, we only use that list to enforce our thresholds by looking at timestamps.
Can we do without this list, just by checking the timestamps of unreachable USED_GUARDS? Or the logic will be much more complicated?
Set TRIED_DYSTOPIC_GUARDS to be an empty set
Set state = STATE_PRIMARY_GUARDS
Each iteration of algorithm
If it was at least 3 minutes since we tried the primary guards and we are not in STATE_PRIMARY_GUARDS:
- save previous state
- set state = STATE_PRIMARY_GUARDS
STATE_PRIMARY_GUARDS:
- return each entry in PRIMARY_GUARDS in turn
- mark each entry as "unreachable" if algorithm doesn't terminate
What is the "algorithm" here? Is it the guard selection algorithm, or the circuit construction algorithm? I guess the latter?
If it's indeed the circuit construction algorithm, then the steps here are not asynchronous anymore, which diverges from how Tor networking is designed. Maybe this could be fixed by introducing some new states for when we are waiting for a circuit to connect and we are now learning whether it was succesful or not. Basically what entry_guard_register_connect_status() does in Tor right now.
For example we could have STATE_WAITING_FOR_PRIMARY_GUARD_CIRCUIT for when we are waiting for a primary guard circuit to connect. If it fails, then we need to mark the entry as unreachable. Maybe we also need STATE_WAITING_FOR_UTOPIC_GUARD_CIRCUIT etc.
- restore previous state (or STATE_TRY_UTOPIC if no previous)
- STATE_TRY_UTOPIC:
- for each entry in TRIED_GUARDS that was marked as unreachable more than 20 minutes ago
- add to the front of remaining UTOPIC_GUARDS
Isn't it already in UTOPIC_GUARDS? I don't see us ever removing stuff from UTOPIC_GUARDS.
BTW, what is this step supposed to do? Setup previously TRIED_GUARDS to be retried?
- return each remaining entry from USED_GUARDS in turn - for each entry, if algorithm doesn't terminate - mark entry as "unreachable" - add entry to TRIED_GUARDS
Add it there even if it it's already there (TRIED_GUARDS is never cleaned)? Or maybe update a timestamp if it's already there? How to specify this nicely?
- if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm - if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC - return each remaining entry from UTOPIC_GUARDS randomly selected, weighted by bandwidth - for each entry, if algorithm doesn't terminate - mark entry as "unreachable" - add entry to TRIED_GUARDS - if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm - if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC
TRIED_GUARDS is never cleaned, so this will be almost always true after a while.
- STATE_TRY_DYSTOPIC:
- for each entry in TRIED_DYSTOPIC_GUARDS that was marked as unreachable more than 20 minutes ago
- add to the front of remaining UTOPIC_GUARDS
Does it matter if we add it to the front or the back of UTOPIC_GUARDS?
From what I understand, we always sample from UTOPIC_GUARDS based on relay
bandwidth so the list order should not matter, right?
[This comment and also the comments below also apply for the STATE_TRY_UTOPIC case above]
- return each remaining DYSTOPIC entry from USED_GUARDS in turn
Do we even try previously-found unreachable guards here?
- for each entry, if algorithm doesn't terminate - mark entry as "unreachable" - add entry to TRIED_DYSTOPIC_GUARDS
Order matters here right? We probably want FIFO.
- if the number of entries in TRIED_GUARDS+TRIED_DYSTOPIC_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm - if the number of entries in TRIED_DYSTOPIC_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of DYSTOPIC_GUARDS: - mark all guards in PRIMARY_GUARDS, TRIED_GUARDS and TRIED_DYSTOPIC_GUARDS as not "unreachable" - return failure from the algorithm - return each remaining entry from DYSTOPIC_GUARDS randomly selected, weighted by bandwidth - for each entry, if algorithm doesn't terminate - mark entry as "unreachable" - add entry to TRIED_DYSTOPIC_GUARDS - if the number of entries in TRIED_GUARDS+TRIED_DYSTOPIC_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm - if the number of entries in TRIED_DYSTOPIC_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of DYSTOPIC_GUARDS: - mark all guards in PRIMARY_GUARDS, TRIED_GUARDS and TRIED_DYSTOPIC_GUARDS as not "unreachable"
Hm, not sure what this is supposed to do.
Is it trying to setup the algorithm to retry all guards from the beginning? Shouldn't we edit USED_+GUARDS in that case?
- return failure from the algorithm
Instead of failing in this case, shouldn't we just retry from the top without adding any new guards if we hit these thresholds? How do we do this? Or is this logic considered out of scope? :x
- End of algorithm
- If circuit is set up correctly, let algorithm know
- Algorithm marks the guard chosen as used and makes sure it is in USED_GUARDS
- Otherwise do another run of the algorithm
When does this "End of algorithm" happen?