Hi there, >
<snip>
ALGO_CHOOSE_ENTRY_GUARD keeps track of unreachable status for guards in state private to the algorithm - this is initialized every time ALGO_CHOOSE_ENTRY_GUARD_START is called.
Interesting. That seems like both a bug and a feature in some ways.
It's a security feature because we will try our guard list from the beginning more frequently.
It's a performance "bug" because we have to cycle through all the unreachable nodes everytime we restart the algorithm, because we forgot they were unreachable. If the first multiple guards in your USED_GUARDS are actually unreachable, then this will delay bootstrap by some time. Consider the case where you need to make three circuits to connect to a hidden service as a client (HSDir/IP/RP), so you have to call the algorithm three times in a row.
Of course, if a guard is really unreachable it _should_ be marked as bad within an hour because it won't be listed in the next consensus. While this makes sense, I wonder why my laptop guard list (in the state file) has a total of 24 guards, where 18 of them are marked as unreachable and only 6 of them are marked as bad. Maybe they were all marked unreachable when the internet was down. I wonder if this influences the performance of the algorithm. It would be nice to know if the security/performance tradeoff here is acceptable. Simulations might help, or we will have to learn it the hard way when we implement the algorithm and try it out in various types of networks.
Yes, interesting. Hmm. I'll try to come up with an unreachable measure that sits outside of the algorithm, and see if we can simulate both alternatives.
Returning to this for a bit. I think it would be good to decide whether we should keep the unreachable status of guards on permannet disk state or not. The very latest prop259 basically forgets the unreachable guard status as soon as the algorithm terminates. I wonder if we actually want this. Hopefully guardsim has a simulation scenario that will illustrate whether that's a good idea or not.
As an example of a troublesome edge case, consider Alice who operates a busy hidden service that gets dozens of client requests per second. If the first few guards on Alice's guardlist are actually offline, Tor will have to spend a few seconds probing them for _every_ client request (to make the corresponding rendezvous circuit). That seems like it will definitely influence performance.
---
I'd also like to point out another security consideration on how STATE_PRIMARY_GUARDS works. I currently like how the 3 minute retry trigger works; I think it can enforce correct guard usage in various unhandleable edge cases. I wonder if this time-based trigger should be the only way to go back to our primary guards.
For example, consider Bob a travelling laptop user whose Internet is constantly up and down. While Bob has no Internet, Tor will keep on cycling through guards. When Bob finally manages to connect to a guard, chances are it's going to be a low priority guard, or Tor will already be in STATE_RETRY_ONLY. In that case, Bob will connect to this shitty guard, and only after 3 minutes (max) it will start retrying its primary guards. This way, Bob is going to expose himself to lots of guards on the network over time. Maybe to reduce this exposure, we should try to go back to STATE_PRIMARY_GUARDS in those cases? Tor does a similar trick right now which has been very helpful: https://gitweb.torproject.org/tor.git/tree/src/or/entrynodes.c?id=tor-0.2.7....
Maybe an equivalent heuristic would be that if we are in STATE_RETRY_ONLY and we manage to connect to a non-primary guard, we hang up the connection, and go back into STATE_PRIMARY_GUARDS.
Can this heuristic be improved? I think it should be considered for the algorithm.
Thanks!