Hi,
Here is the next version of the algorithm - put it in a gist to make it easier to look at: https://gist.github.com/olabini/343da01de8e01491bf5c
Cheers
Ola Bini obini@thoughtworks.com writes:
Hi,
Here is the next version of the algorithm - put it in a gist to make it easier to look at: https://gist.github.com/olabini/343da01de8e01491bf5c
Cheers
Thanks for this.
Here we go with another review!
The full algorithm is referred to as ALGO_CHOOSE_ENTRY_GUARD. It is divided into three components, such that the full algorithm is implemented by first invocing ALGO_CHOOSE_ENTRY_GUARD_START, then repeatedly calling ALGO_CHOOSE_ENTRY_GUARD_NEXT and finally calling ALGO_CHOOSE_ENTRY_GUARD_END.
That's helpful. Maybe we can also put the pseudocode that you posted in a previous mail in the appendix to provide more context?
ALGO_CHOOSE_ENTRY_GUARD keeps track of unreachable status for guards in state private to the algorithm - this is initialized every time ALGO_CHOOSE_ENTRY_GUARD_START is called.
Interesting. That seems like both a bug and a feature in some ways.
It's a security feature because we will try our guard list from the beginning more frequently.
It's a performance "bug" because we have to cycle through all the unreachable nodes everytime we restart the algorithm, because we forgot they were unreachable. If the first multiple guards in your USED_GUARDS are actually unreachable, then this will delay bootstrap by some time. Consider the case where you need to make three circuits to connect to a hidden service as a client (HSDir/IP/RP), so you have to call the algorithm three times in a row.
Of course, if a guard is really unreachable it _should_ be marked as bad within an hour because it won't be listed in the next consensus. While this makes sense, I wonder why my laptop guard list (in the state file) has a total of 24 guards, where 18 of them are marked as unreachable and only 6 of them are marked as bad. Maybe they were all marked unreachable when the internet was down. I wonder if this influences the performance of the algorithm.
It would be nice to know if the security/performance tradeoff here is acceptable. Simulations might help, or we will have to learn it the hard way when we implement the algorithm and try it out in various types of networks.
ALGO_CHOOSE_ENTRY_GUARD_START: (arguments: USED_GUARDS, EXCLUDE_NODES, N_PRIMARY_GUARDS) If selecting directory guards, GUARDS is all guards from the consensus with the V2Flag, otherwise GUARDS is all guards from the consensus Set UTOPIC_GUARDS to be all guards to use under utopic conditions from GUARDS Set DYSTOPIC_GUARDS to be all guards to use under dystopic conditions from GUARDS Set REMAINING_UTOPIC_GUARDS to be UTOPIC_GUARDS without EXCLUDE_NODES and USED_GUARDS Set REMAINING_DYSTOPIC_GUARDS to be DYSTOPIC_GUARDS without EXCLUDE_NODES and USED_GUARDS Create a list of PRIMARY_GUARDS that contain N_PRIMARY_GUARDS that are not bad by: Taking the next entry from USED_GUARDS If USED_GUARDS is empty: randomly select an entry from UTOPIC_GUARDS, weighted by bandwidth Set TRIED_GUARDS to be an empty set Set TRIED_DYSTOPIC_GUARDS to be an empty set Set state = STATE_PRIMARY_GUARDS ALGO_CHOOSE_ENTRY_GUARD_NEXT: If a new consensus has arrived: Update all guard profiles with new bad/non-bad information If any PRIMARY_GUARDS have become bad: re-add to the list of PRIMARY_GUARDS using the same procedure
And also remove the primary guard that has become bad, right?
If any USED_GUARDS have become non-bad: add it back to PRIMARY_GUARDS at the place it would have been if it was non-bad when running ALGO_CHOOSE_ENTRY_GUARD_START. If this results in PRIMARY_GUARDS being larger than N_PRIMARY_GUARDS, remove from the end of the list until the list is N_PRIMARY_GUARDS long
How do we know that a guard in USED_GUARDS that just became non-bad deserves to be a primary guard? Even if it's the first guard in the USED_GUARDS list, how do we know whether it has priority over the last guard of the PRIMARY_GUARDS list? Do we need to keep state of previous PRIMARY_GUARDS sets?
Ensure that UTOPIC_GUARDS and DYSTOPIC_GUARDS are updated with the changes from the consensus
Isn't this implicit in "Update all guard profiles with new bad/non-bad information"? Maybe these can be merged into a single step that explicitly mentions the guardlists that need to be updated.
If it was at least 3 minutes since we tried the primary guards and we are not in STATE_PRIMARY_GUARDS: save previous state set state = STATE_PRIMARY_GUARDS STATE_PRIMARY_GUARDS: return each entry in PRIMARY_GUARDS in turn for each entry, if algorithm doesn't terminate mark entry as "unreachable" add entry to TRIED_GUARDS if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from ALGO_CHOOSE_ENTRY_GUARD. restore previous state (or STATE_TRY_UTOPIC if no previous) STATE_TRY_UTOPIC: for each entry in TRIED_GUARDS that was marked as unreachable more than 20 minutes ago add it back to REMAINING_UTOPIC_GUARDS return each remaining entry from USED_GUARDS in turn
What does "remaining" means here? Is it the next USED_GUARDS entry that has not been marked as unreachable?
for each entry, if algorithm doesn't terminate mark entry as "unreachable" add entry to TRIED_GUARDS if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from ALGO_CHOOSE_ENTRY_GUARD. if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC
I wonder if returning failure and exiting the algorithm is the right thing to do in this case.
Basically, we just connected to too many different guards and we want to stop exposing ourselves to more of the network. This is something that can happen during an attack, but also something that happens naturally all the time when the network is down. So we can't just have Tor exit().
Instead, we probably want to keep our guardlist and continue connecting to the guards we already have till the network comes up again.
If we return failure here and exit the algorithm, then when we restart the algorithm we will have forgetten the TRIED_GUARDS list and hence all the guards that we tried in our previous run. Then we will attempt to connect to GUARDS_TRY_THRESHOLD new guards again. Which is probably exactly what the adversary wants.
So maybe failure is not an option here, and instead we should massage our current guardlist and start from the beginning without adding more guards?
What do you think?
return each entry from REMAINING_UTOPIC_GUARDS randomly selected, weighted by bandwidth remove the returned entry from REMAINING_UTOPIC_GUARDS for each entry, if algorithm doesn't terminate mark entry as "unreachable" add entry to TRIED_GUARDS if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from ALGO_CHOOSE_ENTRY_GUARD. if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC
Also, I'm not sure I understand the precise security properties that the above two checks provide us with. Are both checks needed? Could you explain them to me please?
STATE_TRY_DYSTOPIC: for each entry in TRIED_DYSTOPIC_GUARDS that was marked as unreachable more than 20 minutes ago add it back to REMAINING_DYSTOPIC_GUARDS return each remaining DYSTOPIC entry from USED_GUARDS in turn for each entry, if algorithm doesn't terminate mark entry as "unreachable" add entry to TRIED_DYSTOPIC_GUARDS if the number of entries in TRIED_GUARDS+TRIED_DYSTOPIC_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from ALGO_CHOOSE_ENTRY_GUARD. if the number of entries in TRIED_DYSTOPIC_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of DYSTOPIC_GUARDS: mark all guards in PRIMARY_GUARDS, TRIED_GUARDS and TRIED_DYSTOPIC_GUARDS as not "unreachable" return failure from ALGO_CHOOSE_ENTRY_GUARD. return each entry from REMAINING_DYSTOPIC_GUARDS randomly selected, weighted by bandwidth remove the returned entry from REMAINING_DYSTOPIC_GUARDS for each entry, if algorithm doesn't terminate mark entry as "unreachable" add entry to TRIED_DYSTOPIC_GUARDS if the number of entries in TRIED_GUARDS+TRIED_DYSTOPIC_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from ALGO_CHOOSE_ENTRY_GUARD. if the number of entries in TRIED_DYSTOPIC_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of DYSTOPIC_GUARDS: return failure from ALGO_CHOOSE_ENTRY_GUARD. ALGO_CHOOSE_ENTRY_GUARD_END: If circuit is set up correctly, let algorithm know Algorithm marks the guard chosen as used and makes sure it is in USED_GUARDS
Maybe instead of "makes sure it is in" we should say "adds it in"? It has not been added in a previous step from what I see.
Hi there,
Thanks again! Really appreciate you spending so much time with us fine tuning this.
The full algorithm is referred to as ALGO_CHOOSE_ENTRY_GUARD. It is divided into three components, such that the full algorithm is implemented by first invocing ALGO_CHOOSE_ENTRY_GUARD_START, then repeatedly calling ALGO_CHOOSE_ENTRY_GUARD_NEXT and finally calling ALGO_CHOOSE_ENTRY_GUARD_END.
That's helpful. Maybe we can also put the pseudocode that you posted in a previous mail in the appendix to provide more context?
Yep, absolutely.
It would be nice to know if the security/performance tradeoff here is acceptable. Simulations might help, or we will have to learn it the hard way when we implement the algorithm and try it out in various types of networks.
Yes, interesting. Hmm. I'll try to come up with an unreachable measure that sits outside of the algorithm, and see if we can simulate both alternatives.
If a new consensus has arrived: Update all guard profiles with new bad/non-bad information If any PRIMARY_GUARDS have become bad: re-add to the list of PRIMARY_GUARDS using the same procedure
And also remove the primary guard that has become bad, right?
Yes, corrected.
If any USED_GUARDS have become non-bad: add it back to PRIMARY_GUARDS at the place it would have been if it was non-bad when running ALGO_CHOOSE_ENTRY_GUARD_START. If this results in PRIMARY_GUARDS being larger than N_PRIMARY_GUARDS, remove from the end of the list until the list is N_PRIMARY_GUARDS long
How do we know that a guard in USED_GUARDS that just became non-bad deserves to be a primary guard? Even if it's the first guard in the USED_GUARDS list, how do we know whether it has priority over the last guard of the PRIMARY_GUARDS list? Do we need to keep state of previous PRIMARY_GUARDS sets?
Well, my understanding is this - under ideal circumstances, the first guards in USED_GUARDS will always be the same as PRIMARY_GUARDS - since PRIMARY_GUARDS is a temporary abstraction only used inside of the algorithm. So based on that, USED_GUARDS is all the persistent state we have.
Isn't this implicit in "Update all guard profiles with new bad/non-bad information"? Maybe these can be merged into a single step that explicitly mentions the guardlists that need to be updated.
Yes, removed that.
STATE_TRY_UTOPIC: for each entry in TRIED_GUARDS that was marked as unreachable more than 20 minutes ago add it back to REMAINING_UTOPIC_GUARDS return each remaining entry from USED_GUARDS in turn
What does "remaining" means here? Is it the next USED_GUARDS entry that has not been marked as unreachable?
Changed it to read like this: "return each entry from USED_GUARDS not in PRIMARY_GUARDS in turn" hope that is clearer.
for each entry, if algorithm doesn't terminate mark entry as "unreachable" add entry to TRIED_GUARDS if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from ALGO_CHOOSE_ENTRY_GUARD. if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC
I wonder if returning failure and exiting the algorithm is the right thing to do in this case.
Basically, we just connected to too many different guards and we want to stop exposing ourselves to more of the network. This is something that can happen during an attack, but also something that happens naturally all the time when the network is down. So we can't just have Tor exit().
Instead, we probably want to keep our guardlist and continue connecting to the guards we already have till the network comes up again.
If we return failure here and exit the algorithm, then when we restart the algorithm we will have forgetten the TRIED_GUARDS list and hence all the guards that we tried in our previous run. Then we will attempt to connect to GUARDS_TRY_THRESHOLD new guards again. Which is probably exactly what the adversary wants.
So maybe failure is not an option here, and instead we should massage our current guardlist and start from the beginning without adding more guards?
What do you think?
Ugh. You're completely right. Not a good idea. OK - so what if we instead of returning failure enter a new state where we only ever retry guards in TRIED_GUARDS?
if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME is larger than GUARDS_TRY_THRESHOLD, return failure from ALGO_CHOOSE_ENTRY_GUARD. if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC
Also, I'm not sure I understand the precise security properties that the above two checks provide us with. Are both checks needed? Could you explain them to me please?
The first one is to make it possible to constrain how many guards we ever try to connect to, in order to minimize our exposure in a shorter amount of time.
The second one is to switch over from utopic conditions to dystopic conditions. IF we are in dystopic conditions, then this heuristic will make it more likely we will faster find a guard to connect to.
ALGO_CHOOSE_ENTRY_GUARD_END: If circuit is set up correctly, let algorithm know Algorithm marks the guard chosen as used and makes sure it is in USED_GUARDS
Maybe instead of "makes sure it is in" we should say "adds it in"? It has not been added in a previous step from what I see.
Well, since USED_GUARDS is persistent, the guard could have been added in a previous run of the Tor client - that's why I said "makes sure it is in". If you disagree, I can change the language. =)
Cheers
Hi again,
Here is the newest version of the algorithm: https://gist.github.com/olabini/343da01de8e01491bf5c
The biggest change is the addition of the state STATE_TRY_ONLY_TRIED. Once it enters this state, it will never exit it again.
Cheers
Ola Bini obini@thoughtworks.com writes:
Hi again,
Here is the newest version of the algorithm: https://gist.github.com/olabini/343da01de8e01491bf5c
Thanks! Looks better!
I think we are reaching the point that we need good simulations and actually running and stepping through the algorithm with actual data to find issues and suboptimalities. And maybe we also need to get a third person familiar with guards to review it.
BTW, I noticed you removed the "we received a new consensus" part. That's fine for now, but I think it should be added back at some point maybe as a separate event. I like the work done by rjunior and chelsea here: https://gist.github.com/chelseakomlo/2acbe15314b5a809c6f4 and I think the events described are quite close to the Tor networking API. So it might be worth having this algorithm mirror that event structure.
Hi George,
We were having a similar discussion about what to include in the "receive new consensus" action.
We currently have three actions to remove dead/unused guards. They are:
1. Marking guards that are not listed in the latest consensus as "bad" 2. Remove guards that have been dead for 30 days 3. Remove guards that were added more than 30 days ago
Specifically, our question was whether #2 and #3 should be part of the algorithm for new guard selection, as this seems to be more state file maintenance, or if these will always be a part of the "receive new consensus" action.
If #2 and #3 can be separated, we were wondering where these would go- if there are other similar events for state file maintenance.
We have an updated document here- https://github.com/twstrike/tor_guardsim/blob/develop/original_algorithm.md
Looking forward to seeing what you think! Chelsea
On Mon, Feb 15, 2016 at 10:15 AM, George Kadianakis desnacked@riseup.net wrote:
Ola Bini obini@thoughtworks.com writes:
Hi again,
Here is the newest version of the algorithm: https://gist.github.com/olabini/343da01de8e01491bf5c
Thanks! Looks better!
I think we are reaching the point that we need good simulations and actually running and stepping through the algorithm with actual data to find issues and suboptimalities. And maybe we also need to get a third person familiar with guards to review it.
BTW, I noticed you removed the "we received a new consensus" part. That's fine for now, but I think it should be added back at some point maybe as a separate event. I like the work done by rjunior and chelsea here: https://gist.github.com/chelseakomlo/2acbe15314b5a809c6f4 and I think the events described are quite close to the Tor networking API. So it might be worth having this algorithm mirror that event structure.
Chelsea Komlo ckomlo@thoughtworks.com writes:
Hi George,
We were having a similar discussion about what to include in the "receive new consensus" action.
We currently have three actions to remove dead/unused guards. They are:
- Marking guards that are not listed in the latest consensus as "bad"
- Remove guards that have been dead for 30 days
- Remove guards that were added more than 30 days ago
Specifically, our question was whether #2 and #3 should be part of the algorithm for new guard selection, as this seems to be more state file maintenance, or if these will always be a part of the "receive new consensus" action.
If #2 and #3 can be separated, we were wondering where these would go- if there are other similar events for state file maintenance.
Hello Chelsea,
I agree that event actions #2 and #3 are not really connected to the "received new consensus" event.
Currently in Tor these two actions are also performed by remove_obsolete_entry_guards() which is called by entry_guards_parse_state(). The entry_guards_parse_state() function is called when Tor starts up, to load any guard information from the state file.
So in the new algorithm maybe this can fit under a new LOAD_STATE event? Looking at the new prop259, this could be part of the START of the algorithm, called when initializing USED_GUARDS.
We have an updated document here- https://github.com/twstrike/tor_guardsim/blob/develop/original_algorithm.md
Looking forward to seeing what you think! Chelsea
On Mon, Feb 15, 2016 at 10:15 AM, George Kadianakis desnacked@riseup.net wrote:
Ola Bini obini@thoughtworks.com writes:
Hi again,
Here is the newest version of the algorithm: https://gist.github.com/olabini/343da01de8e01491bf5c
Thanks! Looks better!
I think we are reaching the point that we need good simulations and actually running and stepping through the algorithm with actual data to find issues and suboptimalities. And maybe we also need to get a third person familiar with guards to review it.
BTW, I noticed you removed the "we received a new consensus" part. That's fine for now, but I think it should be added back at some point maybe as a separate event. I like the work done by rjunior and chelsea here: https://gist.github.com/chelseakomlo/2acbe15314b5a809c6f4 and I think the events described are quite close to the Tor networking API. So it might be worth having this algorithm mirror that event structure.
Hi George,
Sorry I've been slow to answer and send more info - was spending all day Saturday and Sunday, and this week I'll be in all day meetings.
Anyway, just to avoid confusion - After getting the algorithm to the latest stage you saw below, we started work on implementing it to simulate it. Iván and Tania have been working on that.
I also wrote it in draft form, so this: https://github.com/twstrike/torspec/blob/review/proposals/259-guard-selectio... Is completely rewritten more or less from scratch.
When it comes to receive new consensus, it is in the latest version of the written proposal above.
The link Chelsea sent is a link to the original protocol, not the new one.
So to summarize where we are:
- We have fixed a bunch of issues in the existing guardsim code - We have added a lot of simulation cases to the existing guardsim - code - We have implemented most of the original implementation in the - guardsim code - We are implementing the new algorithm, described in the spec above - in the guardsim code, and expect that to be done today or tomorrow - The spec above is polished enough, that if you agree we could send - it to a wider distribution and get thoughts. - We might be able to start implementing this in tor proper tomorrow - or wednesday hopefully.
From now, the team will take over most communication.
Cheers
Ola Bini obini@thoughtworks.com writes:
Hi George,
Sorry I've been slow to answer and send more info - was spending all day Saturday and Sunday, and this week I'll be in all day meetings.
Anyway, just to avoid confusion - After getting the algorithm to the latest stage you saw below, we started work on implementing it to simulate it. Iván and Tania have been working on that.
I also wrote it in draft form, so this: https://github.com/twstrike/torspec/blob/review/proposals/259-guard-selectio... Is completely rewritten more or less from scratch.
When it comes to receive new consensus, it is in the latest version of the written proposal above.
The link Chelsea sent is a link to the original protocol, not the new one.
So to summarize where we are:
- We have fixed a bunch of issues in the existing guardsim code
- We have added a lot of simulation cases to the existing guardsim
- code
- We have implemented most of the original implementation in the
- guardsim code
- We are implementing the new algorithm, described in the spec above
- in the guardsim code, and expect that to be done today or tomorrow
- The spec above is polished enough, that if you agree we could send
- it to a wider distribution and get thoughts.
- We might be able to start implementing this in tor proper tomorrow
- or wednesday hopefully.
Sounds great.
I would be interested in learning the results of those simulations!
In any case, starting the tor implementation will be useful anyhow. Even if the very final algorithm is not 100% the same as the new prop259, it will definitely be closer to that than the old mess.
Maybe we can have another IRC meetup this week to speak about the simulation results and discuss how the little-t-tor changes could work? I will be available tomorrow and on Wednesday from 14:00 to 18:00 UTC. Let me know if that works for you.
Hi there, >
<snip>
ALGO_CHOOSE_ENTRY_GUARD keeps track of unreachable status for guards in state private to the algorithm - this is initialized every time ALGO_CHOOSE_ENTRY_GUARD_START is called.
Interesting. That seems like both a bug and a feature in some ways.
It's a security feature because we will try our guard list from the beginning more frequently.
It's a performance "bug" because we have to cycle through all the unreachable nodes everytime we restart the algorithm, because we forgot they were unreachable. If the first multiple guards in your USED_GUARDS are actually unreachable, then this will delay bootstrap by some time. Consider the case where you need to make three circuits to connect to a hidden service as a client (HSDir/IP/RP), so you have to call the algorithm three times in a row.
Of course, if a guard is really unreachable it _should_ be marked as bad within an hour because it won't be listed in the next consensus. While this makes sense, I wonder why my laptop guard list (in the state file) has a total of 24 guards, where 18 of them are marked as unreachable and only 6 of them are marked as bad. Maybe they were all marked unreachable when the internet was down. I wonder if this influences the performance of the algorithm. It would be nice to know if the security/performance tradeoff here is acceptable. Simulations might help, or we will have to learn it the hard way when we implement the algorithm and try it out in various types of networks.
Yes, interesting. Hmm. I'll try to come up with an unreachable measure that sits outside of the algorithm, and see if we can simulate both alternatives.
Returning to this for a bit. I think it would be good to decide whether we should keep the unreachable status of guards on permannet disk state or not. The very latest prop259 basically forgets the unreachable guard status as soon as the algorithm terminates. I wonder if we actually want this. Hopefully guardsim has a simulation scenario that will illustrate whether that's a good idea or not.
As an example of a troublesome edge case, consider Alice who operates a busy hidden service that gets dozens of client requests per second. If the first few guards on Alice's guardlist are actually offline, Tor will have to spend a few seconds probing them for _every_ client request (to make the corresponding rendezvous circuit). That seems like it will definitely influence performance.
---
I'd also like to point out another security consideration on how STATE_PRIMARY_GUARDS works. I currently like how the 3 minute retry trigger works; I think it can enforce correct guard usage in various unhandleable edge cases. I wonder if this time-based trigger should be the only way to go back to our primary guards.
For example, consider Bob a travelling laptop user whose Internet is constantly up and down. While Bob has no Internet, Tor will keep on cycling through guards. When Bob finally manages to connect to a guard, chances are it's going to be a low priority guard, or Tor will already be in STATE_RETRY_ONLY. In that case, Bob will connect to this shitty guard, and only after 3 minutes (max) it will start retrying its primary guards. This way, Bob is going to expose himself to lots of guards on the network over time. Maybe to reduce this exposure, we should try to go back to STATE_PRIMARY_GUARDS in those cases? Tor does a similar trick right now which has been very helpful: https://gitweb.torproject.org/tor.git/tree/src/or/entrynodes.c?id=tor-0.2.7....
Maybe an equivalent heuristic would be that if we are in STATE_RETRY_ONLY and we manage to connect to a non-primary guard, we hang up the connection, and go back into STATE_PRIMARY_GUARDS.
Can this heuristic be improved? I think it should be considered for the algorithm.
Thanks!
Hey,
Returning to this for a bit. I think it would be good to decide whether we should keep the unreachable status of guards on permannet disk state or not. The very latest prop259 basically forgets the unreachable guard status as soon as the algorithm terminates. I wonder if we actually want this. Hopefully guardsim has a simulation scenario that will illustrate whether that's a good idea or not.
Agreed. I don't know if you saw it, but in the new proposal I have an XXX for exactly that.
Maybe an equivalent heuristic would be that if we are in STATE_RETRY_ONLY and we manage to connect to a non-primary guard, we hang up the connection, and go back into STATE_PRIMARY_GUARDS.
Maybe. Should we do this only in STATE_RETRY_ONLY or for the UTOPIC and DYSTOPIC states as well?
Cheers
Ola Bini obini@thoughtworks.com writes:
Hey,
Returning to this for a bit. I think it would be good to decide whether we should keep the unreachable status of guards on permannet disk state or not. The very latest prop259 basically forgets the unreachable guard status as soon as the algorithm terminates. I wonder if we actually want this. Hopefully guardsim has a simulation scenario that will illustrate whether that's a good idea or not.
Agreed. I don't know if you saw it, but in the new proposal I have an XXX for exactly that.
Hello,
yes I noticed. I'm looking forward to the simulation results to see if that XXX should actually be implemented.
Note that if we decide to go with keeping the unreachable state, there might be multiple parts of the algorithm that will need to change.
Maybe an equivalent heuristic would be that if we are in STATE_RETRY_ONLY and we manage to connect to a non-primary guard, we hang up the connection, and go back into STATE_PRIMARY_GUARDS.
Maybe. Should we do this only in STATE_RETRY_ONLY or for the UTOPIC and DYSTOPIC states as well?
Hmmm, I could see how it could be useful in the DYSTOPIC state, as well as maybe in the UTOPIC case as well.
Ideally, the simulator would tell us which design is best here.
On 2/16/16 12:20, George Kadianakis wrote:
The very latest prop259 basically forgets the unreachable guard status as soon as the algorithm terminates. I wonder if we actually want this. Hopefully guardsim has a simulation scenario that will illustrate whether that's a good idea or not.
We are assuming the unreachable state is persisted (by setting unreachable_since in the guard) in the current simulation. We are also ignoring entries known to be unreachable when choosing guards from either USED_GUARDS or REMAINING_UTOPIC_GUARDS to be our PRIMARY_GUARDS.
Maybe to reduce this exposure, we should try to go back to STATE_PRIMARY_GUARDS in those cases? Tor does a similar trick right now which has been very helpful: https://gitweb.torproject.org/tor.git/tree/src/or/entrynodes.c?id=tor-0.2.7....
Maybe an equivalent heuristic would be that if we are in STATE_RETRY_ONLY and we manage to connect to a non-primary guard, we hang up the connection, and go back into STATE_PRIMARY_GUARDS.
Having this in mind, and revisiting entry_guard_register_connect_status(), it seems the appropriate approach would be marking all USED_GUARDS for retry, discarding the current circuit and trying to build a new circuit (restart the whole algorithm) rather than marking the PRIMARY_GUARDS and going back to STATE_PRIMARY_GUARDS.
This is because by marking all the used guards for retry we can expect the next circuit to use the first available USED_GUARD as an entry guard.
In addition to that, my understanding of this behavior in the current guard selection algorithm is:
When you succeed to build a circuit with a new guard, you mark all previously used guards (the global entry_guards var in tor) for retry which are:
- listed in the latest consensus - not bad_since some time - unreachable_since some time and is "time to retry"* - (and it does not have other restrictions which are not important for our simulation, like being a bridge when required or fast/stable)
If you find any retryable guard in the condition above, you discard the current circuit and try to build a new one:
- best case: you have a new circuit with one of the previously unreachable used guards. - worst case: all previously unreachable used guards marked for retry fail again, and you have a new circuit with the same guard you discarded before (unless it is also unreachable now :S).
It's not really "every time you connect to a new guard, retry the previously used" but "every time you connect to a new guard, check if its time to retry a unreachable previously used guard".
* time to retry is progressive depending on how long the node has been unreachable, for example hourly in the first 6 hours, every 4 hours until 3 days, etc. (see: entry_is_time_to_retry).
Can this heuristic be improved? I think it should be considered for the algorithm.
We have reviewed this strategy in the simulation for the original algorithm and we are generating new data. We are also updating the proposal to include the described strategy.
Cheers,
Reinaldo and Iván
Reinaldo de Souza Jr rjunior@thoughtworks.com writes:
On 2/16/16 12:20, George Kadianakis wrote:
The very latest prop259 basically forgets the unreachable guard status as soon as the algorithm terminates. I wonder if we actually want this. Hopefully guardsim has a simulation scenario that will illustrate whether that's a good idea or not.
We are assuming the unreachable state is persisted (by setting unreachable_since in the guard) in the current simulation. We are also ignoring entries known to be unreachable when choosing guards from either USED_GUARDS or REMAINING_UTOPIC_GUARDS to be our PRIMARY_GUARDS.
Thanks for the work on the proposal! I haven't had time to read the new version yet apart from the stuff you point out in your email.
Making the unreachable state persistent seems like a reasonable idea.
However, I'm not so sure about rejecting previously unreachable guards from being primary guards. The concept of primary guards is to _always_ try to use the top N guards in your guardlist (as long as they are in the latest consensus). So the list of primary guards should stay the same as long as none of those nodes disappear from the consensus.
Consider the following case: The network was offline for a few seconds and we ended up marking our top 8 guards as unreachable. Then the network comes back up, and we want to build a new circuit: When it's time to choose the primary guards set we end up choosing the 9th, 10th and 11th node instead of the top 3 guards, even though the top 3 guards are actually online. Ideally Tor will always connect to the top 3 guards, and this is notthe case here.
Maybe the above issue is not possible with the new prop259 algorithm? If that's the case, why?
I'm also curious to see how your simulation reacts to this new change.
Maybe to reduce this exposure, we should try to go back to STATE_PRIMARY_GUARDS in those cases? Tor does a similar trick right now which has been very helpful: https://gitweb.torproject.org/tor.git/tree/src/or/entrynodes.c?id=tor-0.2.7....
Maybe an equivalent heuristic would be that if we are in STATE_RETRY_ONLY and we manage to connect to a non-primary guard, we hang up the connection, and go back into STATE_PRIMARY_GUARDS.
Having this in mind, and revisiting entry_guard_register_connect_status(), it seems the appropriate approach would be marking all USED_GUARDS for retry, discarding the current circuit and trying to build a new circuit (restart the whole algorithm) rather than marking the PRIMARY_GUARDS and going back to STATE_PRIMARY_GUARDS.
This is because by marking all the used guards for retry we can expect the next circuit to use the first available USED_GUARD as an entry guard.
In addition to that, my understanding of this behavior in the current guard selection algorithm is:
When you succeed to build a circuit with a new guard, you mark all previously used guards (the global entry_guards var in tor) for retry which are:
- listed in the latest consensus
- not bad_since some time
- unreachable_since some time and is "time to retry"*
- (and it does not have other restrictions which are not important for
our simulation, like being a bridge when required or fast/stable)
If you find any retryable guard in the condition above, you discard the current circuit and try to build a new one:
- best case: you have a new circuit with one of the previously
unreachable used guards.
- worst case: all previously unreachable used guards marked for retry
fail again, and you have a new circuit with the same guard you discarded before (unless it is also unreachable now :S).
I think your evaluation is correct here.
It's not really "every time you connect to a new guard, retry the previously used" but "every time you connect to a new guard, check if its time to retry a unreachable previously used guard".
- time to retry is progressive depending on how long the node has been
unreachable, for example hourly in the first 6 hours, every 4 hours until 3 days, etc. (see: entry_is_time_to_retry).
Can this heuristic be improved? I think it should be considered for the algorithm.
We have reviewed this strategy in the simulation for the original algorithm and we are generating new data. We are also updating the proposal to include the described strategy.
Great!
BTW, I noticed that the graphs here: https://github.com/twstrike/tor_guardsim/issues/1 are not really explained, and also the x and y axes are undocumented some times. I have not had time to read the new tor_guardsim to understand these graphs. Some documentation would be really helpful here; otherwise, you can explain these things to me in Valencia.
Thanks!
Ola Bini obini@thoughtworks.com writes:
Hi,
Here are some more comments to the latest version of the proposal, as seen here: https://github.com/twstrike/torspec
Filename: 259-guard-selection.txt Title: New Guard Selection Behaviour Author: Isis Lovecruft, George Kadianakis, [Ola Bini] Created: 2015-10-28 Status: Draft
§1. Overview
Tor uses entry guards to prevent an attacker who controls some fraction of the network from observing a fraction of every user's traffic. If users chose their entries and exits uniformly at random from the list of servers every time they build a circuit, then an adversary who had (k/N) of the network would deanonymize F=(k/N)^2 of all circuits... and after a given user had built C circuits, the attacker would see them at least once with probability 1-(1-F)^C. With large C, the attacker would get a sample of every user's traffic with probability 1.
To prevent this from happening, Tor clients choose a small number of guard nodes (currently 3). These guard nodes are the only nodes that the client will connect to directly. If they are not compromised, the user's paths are not compromised.
But attacks remain. Consider an attacker who can run a firewall between a target user and the Tor network, and make many of the guards they don't control appear to be unreachable. Or consider an attacker who can identify a user's guards, and mount denial-of-service attacks on them until the user picks a guard that the attacker controls.
In the presence of these attacks, we can't continue to connect to the Tor network unconditionally. Doing so would eventually result in the user choosing a hostile node as their guard, and losing anonymity.
This proposal outlines a new entry guard selection algorithm, which addresses the following concerns:
- Heuristics and algorithms for determining how and which guard(s) is(/are) chosen should be kept as simple and easy to understand as possible. - Clients in censored regions or who are behind a fascist firewall who connect to the Tor network should not experience any significant disadvantage in terms of reachability or usability. - Tor should make a best attempt at discovering the most appropriate behaviour, with as little user input and configuration as possible.
§2. Design
Alice, an OP attempting to connect to the Tor network, should undertake the following steps to determine information about the local network and to select (some) appropriate entry guards. In the following scenario, it is assumed that Alice has already obtained a recent, valid, and verifiable consensus document.
The algorithm is divided into four components such that the full algorithm is implemented by first invoking START, then repeatedly calling NEXT while adviced it SHOULD_CONTINUE and finally calling END. For an example usage see §A. Appendix.
Several components of NEXT can be invoked asynchronously. SHOULD_CONTINUE is used for the algorithm to be able to tell the caller whether we consider the work done or not - this can be used to retry primary guards when we finally are able to connect to a guard after a long network outage, for example.
This algorithm keeps track of the unreachability status for guards in state private to the algorithm - this is re-initialized every time START is called.
Hmm, didn't we decide to persist the unreachability status over runs, right? Or not?
The argument for doing so is "What happens if I'm a hidden service that needs to make 5 circuits per second, and my first 2 guards happen to be offline? Do I have to spend time probing them everytime I make a new circuit?".
The algorithm expects several arguments to guide its behavior. These will be defined in §2.1.
The goal of this algorithm is to strongly prefer connecting to the same guards we have connected to before, while also trying to detect conditions such as a network outage or a network environment that blocks most ports. The way it does this is by keeping track of how many guards we have exposed ourselves to, and if we have connected to too many we will fall back to only retrying the ones we have already tried. The algorithm also decides on sample set that should be persisted - in order to minimize the risk of an attacker forcing enumeration of the whole network by triggering rebuilding of circuits.
§2.1. The START algorithm
In order to start choosing an entry guard, use the START algorithm. This takes four arguments that can be used to fine tune the workings:
USED_GUARDS This is a list that contains all the guards that have been used before by this client. We will prioritize using guards from this list in order to minimize our exposure. The list is expected to be sorted based on priority, where the first entry will have the highest priority.
SAMPLED_UTOPIC_GUARDS This is a set that contains all guards that should be considered for connection under utopic conditions. This set should be persisted between runs. It will be filled in by the algorithm if it's empty, or if it contains less than SAMPLE_SET_THRESHOLD guards after winnowing out older guards. It should be filled by using NEXT_BY_BANDWIDTH with UTOPIC_GUARDS as an argument.
Should we use UTOPIC_GUARDS or REMAINING_UTOPIC_GUARDS as the argument?
SAMPLED_DYSTOPIC_GUARDS This is a set that contains all guards that should be considered for connection under dystopic conditions. This set should be persisted between runs. It will be filled in by the algorithm if it's empty, or if it contains less than SAMPLE_SET_THRESHOLD guards after winnowing out older guards. It should be filled by using NEXT_BY_BANDWIDTH with DYSTOPIC_GUARDS as an argument.
EXCLUDE_NODES A set of nodes that we should not consider using as a guard.
N_PRIMARY_GUARDS The number of guards we should consider our primary guards. These guards will be retried more frequently and will take precedence in most situations. By default the primary guards will be the first N_PRIMARY_GUARDS guards from USED_GUARDS.
DIR If this argument is set, we should only consider guards that can be directory guards. If not set, we will consider all guards.
The primary work of START is to initialize the state machine depicted in §2.2 The NEXT algorithm. The initial state of the machine is defined by:
GUARDS This is a set of all guards from the consensus, without EXCLUDE_NODES and potentially filtered if DIR is set.
UTOPIC_GUARDS This is a set of all guards to use under utopic conditions. It will primarily be used to fill in SAMPLED_UTOPIC_GUARDS. This set will be initialized to be the same as GUARDS.
DYSTOPIC_GUARDS This is a set of all guards to use under dystopic conditions (usually when we are subject to a firewall that restricts the ports we can connect to). It will primarily be used to fill in SAMPLED_UTOPIC_GUARDS. This set will be initialized to be the
I guess you mean SAMPLED_DYSTOPIC_GUARDS.
subset of GUARDS that listen to ports that are allowed by dystopic conditions.
REMAINING_UTOPIC_GUARDS This is a running set of the utopic guards we have not yet tried to connect to. It should be initialized to be SAMPLED_UTOPIC_GUARDS without USED_GUARDS.
Maybe here we should also mention that we will reinsert guards that we have not tried in a long time (GUARDS_RETRY_TIME) as specified by 2.2.2?
REMAINING_DYSTOPIC_GUARDS This is a running set of the dystopic guards we have not yet tried to connect to. It should be initialized to be SAMPLED_DYSTOPIC_GUARDS without USED_GUARDS.
TRIED_GUARDS This set keeps track of all utopic guards we have tried connecting to. This should be initialized to the empty set.
TRIED_DYSTOPIC_GUARDS This set keeps track of all dystopic guards we have tried connecting to. This should be initialized to the empty set.
STATE A variable that keeps track of which state in the state machine we are currently in. It should be initialized to STATE_PRIMARY_GUARDS.
PRIMARY_GUARDS This list keeps track of our primary guards. These are guards that we will prioritize when trying to connect, and will also retry more often in case of failure with other guards. It should be initialized by calling algorithm NEXT_PRIMARY_GUARD repeatedly until PRIMARY_GUARDS contains N_PRIMARY_GUARDS elements.
§2.2. The NEXT algorithm
The NEXT algorithm is composed of several different possibly flows. The first one is a simple state machine that can transfer between four different states. Every time NEXT is invoked, it will resume at the state where it left off previously. In the course of selecting an entry guard, a new consensus can arrive. When that happens we need to update the data structures used, but nothing else should change.
Before jumping in to the state machine, we should first check if it was at least PRIMARY_GUARDS_RETRY_INTERVAL minutes since we tried any of the PRIMARY_GUARDS. If this is the case, and we are not in STATE_PRIMARY_GUARDS, we should save the previous state and set the state to STATE_PRIMARY_GUARDS.
§2.2.1. The STATE_PRIMARY_GUARDS state
Return each entry in PRIMARY_GUARDS in turn. For each entry, if it was not possible to connect to it, mark the entry as unreachable, and add it to TRIED_GUARDS. [XXX defining "was not possible to connect" as "entry is not live" according to current definition of "live entry guard" in tor source code, seems to improve success rate on the flaky network scenario. See: https://github.com/twstrike/tor_guardsim/issues/1#issuecomment-187374942]
Hmm, I'm not sure what this XXX means exactly. I believe we should actually try to _connect_ to those primary guards and not just check if we think they are live.
If all entries have been tried, restore the previous state and go there. If there is no previous state, transition to STATE_TRY_UTOPIC.
§2.2.2. The STATE_TRY_UTOPIC state
In order to give guards that have been marked as unreachable a chance to come back, add all entries in TRIED_GUARDS that were marked as unreachable more than GUARDS_RETRY_TIME minutes ago back to REMAINING_UTOPIC_GUARDS.
I'm a bit puzzled by this mechanism. Maybe it's benefits can be explained a bit more clearly?
When we add guards back to REMAINING_UTOPIC_GUARDS, do we also remove them from TRIED_GUARDS?
Also, the value here is currently 20 minutes. So this will trigger only when this algorithm takes over 20 minutes to terminate. I guess this should only happen when the network is down.
Now that we have persistent SAMPLED_UTOPIC_GUARDS is this still useful? Won't we have fully populated our SAMPLED_*_GUARDS structures by the point this rule triggers?
Sorry for the confusion :)
Return each entry in USED_GUARDS that is not in PRIMARY_GUARDS in turn. For each entry, if it was not possible to connect to it, mark the entry as unreachable and add it to TRIED_GUARDS.
Return each entry from REMAINING_UTOPIC_GUARDS using NEXT_BY_BANDWIDTH. For each entry, if it was not possible to connect to it, remove the entry from REMAINING_UTOPIC_GUARDS, mark it as unreachable and add it to TRIED_GUARDS.
If no entries remain in REMAINING_UTOPIC_GUARDS, transition to STATE_TRY_DYSTOPIC.
§2.2.3. The STATE_TRY_DYSTOPIC state
In order to give guards that have been marked as unreachable a chance to come back, add all entries in TRIED_DYSTOPIC_GUARDS that were marked as unreachable more than GUARDS_RETRY_TIME minutes ago back to REMAINING_DYSTOPIC_GUARDS.
Return each dystopic entry in USED_GUARDS that is not in PRIMARY_GUARDS in turn. For each entry, if it was not possible to connect to it, mark the entry as unreachable and add it to TRIED_DYSTOPIC_GUARDS.
Return each entry from REMAINING_DYSTOPIC_GUARDS using NEXT_BY_BANDWIDTH. For each entry, if it was not possible to connect to it, remove the entry from REMAINING_DYSTOPIC_GUARDS, mark it as unreachable and add it to TRIED_DYSTOPIC_GUARDS.
If no entries remain in REMAINING_DYSTOPIC_GUARDS, transition to STATE_PRIMARY_GUARDS.
§2.2.5. ON_NEW_CONSENSUS
First, ensure that all guard profiles are updated with information about whether they were in the newest consensus or not. If not, the guard is considered bad.
Maybe instead of "If not" we could say "If a guard is not included in the newest consensus" to make it a bit clearer.
If any PRIMARY_GUARDS have become bad, remove the guard from PRIMARY_GUARDS. Then ensure that PRIMARY_GUARDS contain N_PRIMARY_GUARDS entries by repeatedly calling NEXT_PRIMARY_GUARD.
If any guards in USED_GUARDS have switched from being bad to being non-bad, add it back in the place it should have been in PRIMARY_GUARDS if it had been non-bad when populating PRIMARY_GUARDS. If this results in PRIMARY_GUARDS being larger than N_PRIMARY_GUARDS, truncate the list to be N_PRIMARY_GUARDS entries long. [XXX Does "add it back in the place it should have been in PRIMARY_GUARDS if it had been non-bad" implies keeping original order?]
If I understand correctly, I think the answer to this XXX is "Ideally, yes.".
I'm curious to see how this mechanism will be implemented because it's important and it would be nice if it's done cleanly.
Also, we should be careful about when we count 'bad' guards. After a few weeks of operation, the USED_GUARDS list can accumulate multiple bad guards, and we should make sure we don't count them when we do our threshold checks.
§2.3. The SHOULD_CONTINUE algorithm
This algorithm takes as an argument a boolean indicating whether the circuit was successfully built or not.
After the caller have tried to build a circuit with a returned guard, they should invoke SHOULD_CONTINUE to understand if the algorithm is finished or not. SHOULD_CONTINUE will always return true if the circuit failed. If the circuit succeeded, SHOULD_CONTINUE will always return false, unless the guard that succeeded was the first guard to succeed after INTERNET_LIKELY_DOWN_INTERVAL minutes - in that case it will set the state to STATE_PRIMARY_GUARDS and return true.
ACK.
Just a reminder that we also discussed adding the "Retry primary guards if we have looped over the whole guardlist" heuristic somewhere here. Because in many cases the network can go down and then back up in less than a minute.
§2.4. The END algorithm
The goal of this algorithm is simply to make sure that we keep track of successful connections made. This algorithm should be invoked with the guard that was used to correctly set up a circuit.
Once invoked, this algorithm will mark the guard as used, and make sure it is in USED_GUARDS. [XXX if the guard is not in USED_GUARDS should be added *first*? Important because it will affect on the building of PRIMARY_GUARDS.]
IIUC, if the guard is not in USED_GUARDS it should be added *last* (that is, with lowest priority).
§2.5. Helper algorithms
These algorithms are used in the above algorithms, but have been separated out here in order to make the flow clearer.
NEXT_PRIMARY_GUARD - Return the first entry from USED_GUARDS that is not in PRIMARY_GUARDS and that is in the most recent consensus. - If USED_GUARDS is empty, use NEXT_BY_BANDWIDTH with REMAINING_UTOPIC_GUARDS as the argument.
NEXT_BY_BANDWIDTH - Takes G as an argument, which should be a set of guards to choose from. - Return a randomly select element from G, weighted by bandwidth.
§3. Consensus Parameters, & Configurable Variables
This proposal introduces several new parameters that ideally should be set in the consensus but that should also be possible to set or override in the client configuration file. Some of these have proposed values, but for others more simulation and trial needs to happen.
PRIMARY_GUARDS_RETRY_INTERVAL In order to make it more likely we connect to a primary guard, we would like to retry the primary guards more often than other types of guards. This parameter controls how many minutes should pass before we consider retrying primary guards again. The proposed value is 3.
GUARDS_RETRY_TIME In the normal course of selecting guards, we want to continue retrying unreachable guards in case they have become reachable. This parameter controls the minimum amount of minutes before we should start to consider guards for connecting again. Proposed value is 20.
SAMPLE_SET_THRESHOLD In order to allow us to recognize dystopic situations or a completely unreachable network, we would like to avoid connecting to too many guards before switching modes. We also want to avoid exposing ourselves to too many nodes in a potentially hostile situation. This parameter, expressed as a fraction, determines the number of guards we should keep as the sampled set of the only guards we will consider connecting to. It will be used as a fraction for both the utopic and the dystopic sampled set. If we assume there are 1900 utopic guards and of them there are 500 dystopic guards, a setting of 0.02 means we will have a sample set of 38 utopic guards and 10 dystopic guards. This limits our total exposure. Proposed value is 0.02.
We should decide if we want to actually use a dynamic percentage here, or just set the threshold to a constant value.
A dynamic percentage might give us better security and reachability as the network evolves, but might also cause unpredictable behaviors if we suddently get too many guards or too many of them disappear.
I don't have a strong opinion here.
INTERNET_LIKELY_DOWN_INTERVAL The number of minutes since we started trying to find an entry guard before we should consider the network down and consider retrying primary guards before using a functioning guard found. Proposed value 20.
It seems to me that the value 20 here could get reduced to something like 5 or even less. Of course 5 is also an arbitrary value and to actually find out the "best" number here we should test the algorithm ourselves in various network types.
Let's say we are restricting ourselves to SAMPLE_SET_THRESHOLD guards. If the network is down, we will start cycling through guards and we will have walked through all of them in a matter of minutes. After we have exhausted our guard list once and we did not manage to connect, our network is effectively down. This might give extra reasons to do the "Retry guards if we have exhausted our guard list once" heuristic.
§4. Security properties and behavior under various conditions
Under normal conditions, this algorithm will allow us to quickly connect and use guards we have used before with high likelihood of working. Assuming the first primary guard is reachable and in the consensus, this algorithm will deterministically always return that guard.
Under dystopic conditions (when a firewall is in place that blocks all ports except for potentially port 80 and 443), this algorithm will try to connect to 2% of all guards before switching modes to try dystopic guards. Currently, that means trying to connect to circa 40 guards before getting a successful connection. If we assume a connection try will take maximum 10 seconds, that means it will take up to 6 minutes to get a working connection.
When the network is completely down, we will try to connect to 2% of all guards plus 2% of all dystopic guards before realizing we are down. This means circa 50 guards tried assuming there are 1900 guards in the network.
In terms of exposure, we will connect to a maximum of 2% of all guards plus 2% of all dystopic guards, or 3% of all guards, whichever is lower. If N is the number of guards, and k is the number of guards an attacker controls, that means an attacker would have a probability of 1-(1-(k/N)^2)^(N * 0.03) to have one of their guards selected before we fall back. In real terms, this means an attacker would need to control over 10% of all guards in order to have a larger than 50% chance of controlling a guard for any given client.
In addition, since the sampled set changes slowly (the suggestion here is that guards in it expire every month) it is not possible for an attacker to force a connection to an entry guard that isn't already in the users sampled set.
§A. Appendix: An example usage
In order to clarify how this algorithm is supposed to be used, this pseudo code illustrates the building of a circuit:
OPEN_CIRCUIT: context = ALGO_CHOOSE_ENTRY_GUARD_START(used_guards, [], 3, false) while True: entryGuard = ALGO_CHOOSE_ENTRY_GUARD_NEXT(context) circuit = composeCircuitAndConnect(entryGuard) if not SHOULD_CONTINUE(isSuccessful(circuit)): ALGO_CHOOSE_ENTRY_GUARD_END(context, entryGuard) return circuit
-*- coding: utf-8 -*-
Great, lots of good comments. I'll respond in depth once the fever has left my brain! =D
On Thu, Mar 03, 2016 at 03:15:26PM +0100, George Kadianakis wrote:
Ola Bini obini@thoughtworks.com writes:
Hi,
Here are some more comments to the latest version of the proposal, as seen here: https://github.com/twstrike/torspec
Filename: 259-guard-selection.txt Title: New Guard Selection Behaviour Author: Isis Lovecruft, George Kadianakis, [Ola Bini] Created: 2015-10-28 Status: Draft
§1. Overview
Tor uses entry guards to prevent an attacker who controls some fraction of the network from observing a fraction of every user's traffic. If users chose their entries and exits uniformly at random from the list of servers every time they build a circuit, then an adversary who had (k/N) of the network would deanonymize F=(k/N)^2 of all circuits... and after a given user had built C circuits, the attacker would see them at least once with probability 1-(1-F)^C. With large C, the attacker would get a sample of every user's traffic with probability 1.
To prevent this from happening, Tor clients choose a small number of guard nodes (currently 3). These guard nodes are the only nodes that the client will connect to directly. If they are not compromised, the user's paths are not compromised.
But attacks remain. Consider an attacker who can run a firewall between a target user and the Tor network, and make many of the guards they don't control appear to be unreachable. Or consider an attacker who can identify a user's guards, and mount denial-of-service attacks on them until the user picks a guard that the attacker controls.
In the presence of these attacks, we can't continue to connect to the Tor network unconditionally. Doing so would eventually result in the user choosing a hostile node as their guard, and losing anonymity.
This proposal outlines a new entry guard selection algorithm, which addresses the following concerns:
- Heuristics and algorithms for determining how and which guard(s) is(/are) chosen should be kept as simple and easy to understand as possible. - Clients in censored regions or who are behind a fascist firewall who connect to the Tor network should not experience any significant disadvantage in terms of reachability or usability. - Tor should make a best attempt at discovering the most appropriate behaviour, with as little user input and configuration as possible.
§2. Design
Alice, an OP attempting to connect to the Tor network, should undertake the following steps to determine information about the local network and to select (some) appropriate entry guards. In the following scenario, it is assumed that Alice has already obtained a recent, valid, and verifiable consensus document.
The algorithm is divided into four components such that the full algorithm is implemented by first invoking START, then repeatedly calling NEXT while adviced it SHOULD_CONTINUE and finally calling END. For an example usage see §A. Appendix.
Several components of NEXT can be invoked asynchronously. SHOULD_CONTINUE is used for the algorithm to be able to tell the caller whether we consider the work done or not - this can be used to retry primary guards when we finally are able to connect to a guard after a long network outage, for example.
This algorithm keeps track of the unreachability status for guards in state private to the algorithm - this is re-initialized every time START is called.
Hmm, didn't we decide to persist the unreachability status over runs, right? Or not?
The argument for doing so is "What happens if I'm a hidden service that needs to make 5 circuits per second, and my first 2 guards happen to be offline? Do I have to spend time probing them everytime I make a new circuit?".
The algorithm expects several arguments to guide its behavior. These will be defined in §2.1.
The goal of this algorithm is to strongly prefer connecting to the same guards we have connected to before, while also trying to detect conditions such as a network outage or a network environment that blocks most ports. The way it does this is by keeping track of how many guards we have exposed ourselves to, and if we have connected to too many we will fall back to only retrying the ones we have already tried. The algorithm also decides on sample set that should be persisted - in order to minimize the risk of an attacker forcing enumeration of the whole network by triggering rebuilding of circuits.
§2.1. The START algorithm
In order to start choosing an entry guard, use the START algorithm. This takes four arguments that can be used to fine tune the workings:
USED_GUARDS This is a list that contains all the guards that have been used before by this client. We will prioritize using guards from this list in order to minimize our exposure. The list is expected to be sorted based on priority, where the first entry will have the highest priority.
SAMPLED_UTOPIC_GUARDS This is a set that contains all guards that should be considered for connection under utopic conditions. This set should be persisted between runs. It will be filled in by the algorithm if it's empty, or if it contains less than SAMPLE_SET_THRESHOLD guards after winnowing out older guards. It should be filled by using NEXT_BY_BANDWIDTH with UTOPIC_GUARDS as an argument.
Should we use UTOPIC_GUARDS or REMAINING_UTOPIC_GUARDS as the argument?
SAMPLED_DYSTOPIC_GUARDS This is a set that contains all guards that should be considered for connection under dystopic conditions. This set should be persisted between runs. It will be filled in by the algorithm if it's empty, or if it contains less than SAMPLE_SET_THRESHOLD guards after winnowing out older guards. It should be filled by using NEXT_BY_BANDWIDTH with DYSTOPIC_GUARDS as an argument.
EXCLUDE_NODES A set of nodes that we should not consider using as a guard.
N_PRIMARY_GUARDS The number of guards we should consider our primary guards. These guards will be retried more frequently and will take precedence in most situations. By default the primary guards will be the first N_PRIMARY_GUARDS guards from USED_GUARDS.
DIR If this argument is set, we should only consider guards that can be directory guards. If not set, we will consider all guards.
The primary work of START is to initialize the state machine depicted in §2.2 The NEXT algorithm. The initial state of the machine is defined by:
GUARDS This is a set of all guards from the consensus, without EXCLUDE_NODES and potentially filtered if DIR is set.
UTOPIC_GUARDS This is a set of all guards to use under utopic conditions. It will primarily be used to fill in SAMPLED_UTOPIC_GUARDS. This set will be initialized to be the same as GUARDS.
DYSTOPIC_GUARDS This is a set of all guards to use under dystopic conditions (usually when we are subject to a firewall that restricts the ports we can connect to). It will primarily be used to fill in SAMPLED_UTOPIC_GUARDS. This set will be initialized to be the
I guess you mean SAMPLED_DYSTOPIC_GUARDS.
subset of GUARDS that listen to ports that are allowed by dystopic conditions.
REMAINING_UTOPIC_GUARDS This is a running set of the utopic guards we have not yet tried to connect to. It should be initialized to be SAMPLED_UTOPIC_GUARDS without USED_GUARDS.
Maybe here we should also mention that we will reinsert guards that we have not tried in a long time (GUARDS_RETRY_TIME) as specified by 2.2.2?
REMAINING_DYSTOPIC_GUARDS This is a running set of the dystopic guards we have not yet tried to connect to. It should be initialized to be SAMPLED_DYSTOPIC_GUARDS without USED_GUARDS.
TRIED_GUARDS This set keeps track of all utopic guards we have tried connecting to. This should be initialized to the empty set.
TRIED_DYSTOPIC_GUARDS This set keeps track of all dystopic guards we have tried connecting to. This should be initialized to the empty set.
STATE A variable that keeps track of which state in the state machine we are currently in. It should be initialized to STATE_PRIMARY_GUARDS.
PRIMARY_GUARDS This list keeps track of our primary guards. These are guards that we will prioritize when trying to connect, and will also retry more often in case of failure with other guards. It should be initialized by calling algorithm NEXT_PRIMARY_GUARD repeatedly until PRIMARY_GUARDS contains N_PRIMARY_GUARDS elements.
§2.2. The NEXT algorithm
The NEXT algorithm is composed of several different possibly flows. The first one is a simple state machine that can transfer between four different states. Every time NEXT is invoked, it will resume at the state where it left off previously. In the course of selecting an entry guard, a new consensus can arrive. When that happens we need to update the data structures used, but nothing else should change.
Before jumping in to the state machine, we should first check if it was at least PRIMARY_GUARDS_RETRY_INTERVAL minutes since we tried any of the PRIMARY_GUARDS. If this is the case, and we are not in STATE_PRIMARY_GUARDS, we should save the previous state and set the state to STATE_PRIMARY_GUARDS.
§2.2.1. The STATE_PRIMARY_GUARDS state
Return each entry in PRIMARY_GUARDS in turn. For each entry, if it was not possible to connect to it, mark the entry as unreachable, and add it to TRIED_GUARDS. [XXX defining "was not possible to connect" as "entry is not live" according to current definition of "live entry guard" in tor source code, seems to improve success rate on the flaky network scenario. See: https://github.com/twstrike/tor_guardsim/issues/1#issuecomment-187374942]
Hmm, I'm not sure what this XXX means exactly. I believe we should actually try to _connect_ to those primary guards and not just check if we think they are live.
If all entries have been tried, restore the previous state and go there. If there is no previous state, transition to STATE_TRY_UTOPIC.
§2.2.2. The STATE_TRY_UTOPIC state
In order to give guards that have been marked as unreachable a chance to come back, add all entries in TRIED_GUARDS that were marked as unreachable more than GUARDS_RETRY_TIME minutes ago back to REMAINING_UTOPIC_GUARDS.
I'm a bit puzzled by this mechanism. Maybe it's benefits can be explained a bit more clearly?
When we add guards back to REMAINING_UTOPIC_GUARDS, do we also remove them from TRIED_GUARDS?
Also, the value here is currently 20 minutes. So this will trigger only when this algorithm takes over 20 minutes to terminate. I guess this should only happen when the network is down.
Now that we have persistent SAMPLED_UTOPIC_GUARDS is this still useful? Won't we have fully populated our SAMPLED_*_GUARDS structures by the point this rule triggers?
Sorry for the confusion :)
Return each entry in USED_GUARDS that is not in PRIMARY_GUARDS in turn. For each entry, if it was not possible to connect to it, mark the entry as unreachable and add it to TRIED_GUARDS.
Return each entry from REMAINING_UTOPIC_GUARDS using NEXT_BY_BANDWIDTH. For each entry, if it was not possible to connect to it, remove the entry from REMAINING_UTOPIC_GUARDS, mark it as unreachable and add it to TRIED_GUARDS.
If no entries remain in REMAINING_UTOPIC_GUARDS, transition to STATE_TRY_DYSTOPIC.
§2.2.3. The STATE_TRY_DYSTOPIC state
In order to give guards that have been marked as unreachable a chance to come back, add all entries in TRIED_DYSTOPIC_GUARDS that were marked as unreachable more than GUARDS_RETRY_TIME minutes ago back to REMAINING_DYSTOPIC_GUARDS.
Return each dystopic entry in USED_GUARDS that is not in PRIMARY_GUARDS in turn. For each entry, if it was not possible to connect to it, mark the entry as unreachable and add it to TRIED_DYSTOPIC_GUARDS.
Return each entry from REMAINING_DYSTOPIC_GUARDS using NEXT_BY_BANDWIDTH. For each entry, if it was not possible to connect to it, remove the entry from REMAINING_DYSTOPIC_GUARDS, mark it as unreachable and add it to TRIED_DYSTOPIC_GUARDS.
If no entries remain in REMAINING_DYSTOPIC_GUARDS, transition to STATE_PRIMARY_GUARDS.
§2.2.5. ON_NEW_CONSENSUS
First, ensure that all guard profiles are updated with information about whether they were in the newest consensus or not. If not, the guard is considered bad.
Maybe instead of "If not" we could say "If a guard is not included in the newest consensus" to make it a bit clearer.
If any PRIMARY_GUARDS have become bad, remove the guard from PRIMARY_GUARDS. Then ensure that PRIMARY_GUARDS contain N_PRIMARY_GUARDS entries by repeatedly calling NEXT_PRIMARY_GUARD.
If any guards in USED_GUARDS have switched from being bad to being non-bad, add it back in the place it should have been in PRIMARY_GUARDS if it had been non-bad when populating PRIMARY_GUARDS. If this results in PRIMARY_GUARDS being larger than N_PRIMARY_GUARDS, truncate the list to be N_PRIMARY_GUARDS entries long. [XXX Does "add it back in the place it should have been in PRIMARY_GUARDS if it had been non-bad" implies keeping original order?]
If I understand correctly, I think the answer to this XXX is "Ideally, yes.".
I'm curious to see how this mechanism will be implemented because it's important and it would be nice if it's done cleanly.
Also, we should be careful about when we count 'bad' guards. After a few weeks of operation, the USED_GUARDS list can accumulate multiple bad guards, and we should make sure we don't count them when we do our threshold checks.
§2.3. The SHOULD_CONTINUE algorithm
This algorithm takes as an argument a boolean indicating whether the circuit was successfully built or not.
After the caller have tried to build a circuit with a returned guard, they should invoke SHOULD_CONTINUE to understand if the algorithm is finished or not. SHOULD_CONTINUE will always return true if the circuit failed. If the circuit succeeded, SHOULD_CONTINUE will always return false, unless the guard that succeeded was the first guard to succeed after INTERNET_LIKELY_DOWN_INTERVAL minutes - in that case it will set the state to STATE_PRIMARY_GUARDS and return true.
ACK.
Just a reminder that we also discussed adding the "Retry primary guards if we have looped over the whole guardlist" heuristic somewhere here. Because in many cases the network can go down and then back up in less than a minute.
§2.4. The END algorithm
The goal of this algorithm is simply to make sure that we keep track of successful connections made. This algorithm should be invoked with the guard that was used to correctly set up a circuit.
Once invoked, this algorithm will mark the guard as used, and make sure it is in USED_GUARDS. [XXX if the guard is not in USED_GUARDS should be added *first*? Important because it will affect on the building of PRIMARY_GUARDS.]
IIUC, if the guard is not in USED_GUARDS it should be added *last* (that is, with lowest priority).
§2.5. Helper algorithms
These algorithms are used in the above algorithms, but have been separated out here in order to make the flow clearer.
NEXT_PRIMARY_GUARD - Return the first entry from USED_GUARDS that is not in PRIMARY_GUARDS and that is in the most recent consensus. - If USED_GUARDS is empty, use NEXT_BY_BANDWIDTH with REMAINING_UTOPIC_GUARDS as the argument.
NEXT_BY_BANDWIDTH - Takes G as an argument, which should be a set of guards to choose from. - Return a randomly select element from G, weighted by bandwidth.
§3. Consensus Parameters, & Configurable Variables
This proposal introduces several new parameters that ideally should be set in the consensus but that should also be possible to set or override in the client configuration file. Some of these have proposed values, but for others more simulation and trial needs to happen.
PRIMARY_GUARDS_RETRY_INTERVAL In order to make it more likely we connect to a primary guard, we would like to retry the primary guards more often than other types of guards. This parameter controls how many minutes should pass before we consider retrying primary guards again. The proposed value is 3.
GUARDS_RETRY_TIME In the normal course of selecting guards, we want to continue retrying unreachable guards in case they have become reachable. This parameter controls the minimum amount of minutes before we should start to consider guards for connecting again. Proposed value is 20.
SAMPLE_SET_THRESHOLD In order to allow us to recognize dystopic situations or a completely unreachable network, we would like to avoid connecting to too many guards before switching modes. We also want to avoid exposing ourselves to too many nodes in a potentially hostile situation. This parameter, expressed as a fraction, determines the number of guards we should keep as the sampled set of the only guards we will consider connecting to. It will be used as a fraction for both the utopic and the dystopic sampled set. If we assume there are 1900 utopic guards and of them there are 500 dystopic guards, a setting of 0.02 means we will have a sample set of 38 utopic guards and 10 dystopic guards. This limits our total exposure. Proposed value is 0.02.
We should decide if we want to actually use a dynamic percentage here, or just set the threshold to a constant value.
A dynamic percentage might give us better security and reachability as the network evolves, but might also cause unpredictable behaviors if we suddently get too many guards or too many of them disappear.
I don't have a strong opinion here.
INTERNET_LIKELY_DOWN_INTERVAL The number of minutes since we started trying to find an entry guard before we should consider the network down and consider retrying primary guards before using a functioning guard found. Proposed value 20.
It seems to me that the value 20 here could get reduced to something like 5 or even less. Of course 5 is also an arbitrary value and to actually find out the "best" number here we should test the algorithm ourselves in various network types.
Let's say we are restricting ourselves to SAMPLE_SET_THRESHOLD guards. If the network is down, we will start cycling through guards and we will have walked through all of them in a matter of minutes. After we have exhausted our guard list once and we did not manage to connect, our network is effectively down. This might give extra reasons to do the "Retry guards if we have exhausted our guard list once" heuristic.
§4. Security properties and behavior under various conditions
Under normal conditions, this algorithm will allow us to quickly connect and use guards we have used before with high likelihood of working. Assuming the first primary guard is reachable and in the consensus, this algorithm will deterministically always return that guard.
Under dystopic conditions (when a firewall is in place that blocks all ports except for potentially port 80 and 443), this algorithm will try to connect to 2% of all guards before switching modes to try dystopic guards. Currently, that means trying to connect to circa 40 guards before getting a successful connection. If we assume a connection try will take maximum 10 seconds, that means it will take up to 6 minutes to get a working connection.
When the network is completely down, we will try to connect to 2% of all guards plus 2% of all dystopic guards before realizing we are down. This means circa 50 guards tried assuming there are 1900 guards in the network.
In terms of exposure, we will connect to a maximum of 2% of all guards plus 2% of all dystopic guards, or 3% of all guards, whichever is lower. If N is the number of guards, and k is the number of guards an attacker controls, that means an attacker would have a probability of 1-(1-(k/N)^2)^(N * 0.03) to have one of their guards selected before we fall back. In real terms, this means an attacker would need to control over 10% of all guards in order to have a larger than 50% chance of controlling a guard for any given client.
In addition, since the sampled set changes slowly (the suggestion here is that guards in it expire every month) it is not possible for an attacker to force a connection to an entry guard that isn't already in the users sampled set.
§A. Appendix: An example usage
In order to clarify how this algorithm is supposed to be used, this pseudo code illustrates the building of a circuit:
OPEN_CIRCUIT: context = ALGO_CHOOSE_ENTRY_GUARD_START(used_guards, [], 3, false) while True: entryGuard = ALGO_CHOOSE_ENTRY_GUARD_NEXT(context) circuit = composeCircuitAndConnect(entryGuard) if not SHOULD_CONTINUE(isSuccessful(circuit)): ALGO_CHOOSE_ENTRY_GUARD_END(context, entryGuard) return circuit
-*- coding: utf-8 -*-
Hey,
This algorithm keeps track of the unreachability status for guards in state private to the algorithm - this is re-initialized every time START is called.
Hmm, didn't we decide to persist the unreachability status over runs, right? Or not?
Yeah, I think we did decide to persist it between runs, but not more permanently. I've changed it now.
SAMPLED_UTOPIC_GUARDS This is a set that contains all guards that should be considered for connection under utopic conditions. This set should be persisted between runs. It will be filled in by the algorithm if it's empty, or if it contains less than SAMPLE_SET_THRESHOLD guards after winnowing out older guards. It should be filled by using NEXT_BY_BANDWIDTH with UTOPIC_GUARDS as an argument.
Should we use UTOPIC_GUARDS or REMAINING_UTOPIC_GUARDS as the argument?
It should be UTOPIC_GUARDS, since REMAINING_UTOPIC_GUARDS will always be a subset of SAMPLED_UTOPIC_GUARDS.
I guess you mean SAMPLED_DYSTOPIC_GUARDS.
Yep, thanks. Fixed.
REMAINING_UTOPIC_GUARDS This is a running set of the utopic guards we have not yet tried to connect to. It should be initialized to be SAMPLED_UTOPIC_GUARDS without USED_GUARDS.
Maybe here we should also mention that we will reinsert guards that we have not tried in a long time (GUARDS_RETRY_TIME) as specified by 2.2.2?
Yep, good clarification. I've added that.
[XXX defining "was not possible to connect" as "entry is not live" according to current definition of "live entry guard" in tor source code, seems to improve success rate on the flaky network scenario. See: https://github.com/twstrike/tor_guardsim/issues/1#issuecomment-187374942]
Hmm, I'm not sure what this XXX means exactly. I believe we should actually try to _connect_ to those primary guards and not just check if we think they are live.
Yeah, I don't know where it comes from either - @rjunior, care to expand on it?
§2.2.2. The STATE_TRY_UTOPIC state
In order to give guards that have been marked as unreachable a chance to come back, add all entries in TRIED_GUARDS that were marked as unreachable more than GUARDS_RETRY_TIME minutes ago back to REMAINING_UTOPIC_GUARDS.
I'm a bit puzzled by this mechanism. Maybe it's benefits can be explained a bit more clearly?
When we add guards back to REMAINING_UTOPIC_GUARDS, do we also remove them from TRIED_GUARDS?
Well, TRIED_GUARDS doesn't really do much at the moment. In fact, it might be easier to just remove it. I've done that and it simplifies things as well.
Now that we have persistent SAMPLED_UTOPIC_GUARDS is this still useful? Won't we have fully populated our SAMPLED_*_GUARDS structures by the point this rule triggers?
Agree, I've removed it. Much nicer and neater now! =D
§2.2.5. ON_NEW_CONSENSUS
First, ensure that all guard profiles are updated with information about whether they were in the newest consensus or not. If not, the guard is considered bad.
Maybe instead of "If not" we could say "If a guard is not included in the newest consensus" to make it a bit clearer.
Good clarification, done.
[XXX Does "add it back in the place it should have been in PRIMARY_GUARDS if it had been non-bad" implies keeping original order?]
If I understand correctly, I think the answer to this XXX is "Ideally, yes.".
Yes, that is definitely the answer.
I'm curious to see how this mechanism will be implemented because it's important and it would be nice if it's done cleanly.
I can see a few different ways to do it easily. One of them would be to just rerun the original primary guard selection algorithm until we find the guard we want to insert.
Also, we should be careful about when we count 'bad' guards. After a few weeks of operation, the USED_GUARDS list can accumulate multiple bad guards, and we should make sure we don't count them when we do our threshold checks.
Absolutely.
Just a reminder that we also discussed adding the "Retry primary guards if we have looped over the whole guardlist" heuristic somewhere here. Because in many cases the network can go down and then back up in less than a minute.
Actually, that retry heuristic is there. Or maybe I misunderstand the point.
IIUC, if the guard is not in USED_GUARDS it should be added *last* (that is, with lowest priority).
Yep, added that.
We should decide if we want to actually use a dynamic percentage here, or just set the threshold to a constant value.
A dynamic percentage might give us better security and reachability as the network evolves, but might also cause unpredictable behaviors if we suddently get too many guards or too many of them disappear.
I don't have a strong opinion here.
Me neither. I think a percentage is a good starting point - it feels easier to tweak in different ways.
It seems to me that the value 20 here could get reduced to something like 5 or even less. Of course 5 is also an arbitrary value and to actually find out the "best" number here we should test the algorithm ourselves in various network types.
Arbitrarily changed to 5. =)
Cheers