[tor-bugs] #23817 [Core Tor/Tor]: Tor re-tries directory mirrors that it knows are missing microdescriptors

Wed Nov 1 16:54:46 UTC 2017

#23817: Tor re-tries directory mirrors that it knows are missing microdescriptors
-------------------------------------------------+-------------------------
 Reporter:  teor                                 |          Owner:  (none)
     Type:  defect                               |         Status:  new
 Priority:  Medium                               |      Milestone:  Tor:
                                                 |  0.3.3.x-final
Component:  Core Tor/Tor                         |        Version:
 Severity:  Normal                               |     Resolution:
 Keywords:  tor-guard, tor-hs, prop224,          |  Actual Points:
  032-backport? 031-backport?                    |
Parent ID:  #21969                               |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------

Comment (by asn):

 Replying to [comment:10 teor]:
 > Replying to [comment:9 asn]:
 > > Replying to [comment:7 teor]:
 > > > Replying to [comment:6 asn]:
 > > > > Here is an implementation plan of the failure cache idea from
 comment:4 .
 > > > >
 > > > > First of all, the interface of the failure cache:
 > > > >
 > > > >   We introduce a `digest256map_t *md_fetch_fail_cache` which maps
 the 256-bit md hash to a smartlist of dirguards thru which we failed to
 fetch the md.
 > > > >
 > > > > Now the code logic:
 > > > >
 > > > > 1) We populate `md_fetch_fail_cache` with dirguards in
 `dir_microdesc_download_failed()`.  We remove them from the failure cache
 in `microdescs_add_to_cache()` when we successfuly add an md to the cache.
 > > >
 > > > Successfully add *that* md to the cache?
 > > > Or any md from that dirguard?
 > > >
 > >
 > > I meant, we remove *that* md from the `md_fetch_fail_cache` if we
 manage to fetch *that* md from any dir.
 > >
 > > > I think this is ok, as long as we ask for mds in large enough
 batches.
 > > >
 > > > > 2) We add another `entry_guard_restriction_t` restriction type in
 `guards_choose_dirguard()`. We currently have one restriction type which
 is designed to restrict guard nodes based on the exit node choice and its
 family. We want another type which uses a smartlist and restricts
 dirguards based on whether we have failed to fetch an md from that
 dirguard. We are gonna use this in step 3.
 > > > >
 > > > > 3) In `directory_get_from_dirserver()` we query the md failure
 cache and pass any results to `directory_pick_generic_dirserver()` and
 then to `guards_choose_dirguard()` which uses the new restriction type to
 block previously failed dirguards from being selected.
 > > >
 > > > Do we block dirguards that have failed to deliver an md from
 downloads of that md?
 > > > Or do we block dirguards that have failed to deliver any mds from
 downloads of any md?
 > > >
 > >
 > > Yes, that's a good question that I forgot to address in this proposal.
 > >
 > > I guess my design above was suggesting that we block dirguards "that
 have failed to deliver any mds from downloads of any md", until those mds
 get fetched from another dirserver and get removed from the failure cache.
 >
 > I think this is the behaviour we want: trying each dir server for each
 specific md will mean that we time out, because there are more mds than
 there are dir servers.
 >
 > There are two cases when this will cause us to fall back to the
 authorities:
 > * we download a new consensus from the authorities, and they are the
 only ones with some of the mds in it
 > * for some reason, an md referenced in the consensus is not mirrored
 correctly by relays or authorities (this would be a serious bug in tor)
 >
 > To avoid this happening when it isn't necessary, we should expire
 failure cache entries after a random time. Maybe it should be the time
 when we expect dir servers to fetch a new consensus and new mds. I think
 this is 1-2 hours, but check dir-spec for the exact details. Or we could
 expire md failure caches each time we get a new consensus. That would be
 easier.

 ACK. Based on our discussion above, I think the implementation can be
 simplified since we decided to go with the "block dirguards that have
 failed to deliver any mds from downloads of any md" approach:

 Hence, instead of keeping track of which mds we failed to download from
 which relays, we can just keep track of the relays that have outdated md
 information without caring about which mds they failed to serve us. This
 simplifies things a lot from an engineering perspective since we don't
 need to map mds to relays. We can just keep an `outdated_dirserver_list`
 smartlist with the list of outdated dirservers and ignore those when we
 try to fetch mds.

 This is also consistent with my empirical experience, where most
 dirservers will usually provide us with all the mds we asked them, except
 from a few dirservers which would refuse to give us 10 mds or so for a
 while. So we can just ban those dirservers for a while using the above
 logic.

 And I agree with you, we should wipe the `outdated_dirservers_list` list
 everytime we fetch a new consensus. Should we also clear it at any other
 point? Maybe we should not let it grow to a huge size (cap it?).

 I started implementing the original implementation plan today, before I
 realized the above simplification, so I will have to change some code to
 do this new design. I estimate that I will have an initial branch here
 early next week including unittests.

 Onwards!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23817#comment:11>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online