[tor-bugs] #28424 [Core Tor/Tor]: Refactor hs_service_callback() to no longer need to run once per second?

Tue Nov 27 18:24:02 UTC 2018

#28424: Refactor hs_service_callback() to no longer need to run once per second?
--------------------------+------------------------------------
 Reporter:  nickm         |          Owner:  (none)
     Type:  defect        |         Status:  new
 Priority:  Medium        |      Milestone:  Tor: 0.4.0.x-final
Component:  Core Tor/Tor  |        Version:
 Severity:  Normal        |     Resolution:
 Keywords:                |  Actual Points:
Parent ID:                |         Points:
 Reviewer:                |        Sponsor:  Sponsor8-can
--------------------------+------------------------------------

Comment (by akwizgran):

 If I'm reading the spec right, there are `hsdir_n_replicas = 2` replicas.
 For each replica, the HS uploads the descriptor to `hsdir_spread_store =
 4` HSDirs at consecutive positions in the hashring. Each client tries to
 fetch the descriptor from one of the first `hsdir_spread_fetch = 3`
 positions, chosen at random.

 A lookup fails when the position chosen by the client is occupied by an
 HSDir that didn't receive the descriptor, for both replicas. So failure is
 possible when ''any'' of the first 3 positions is occupied by an HSDir
 that didn't receive the descriptor, for both replicas. Churn can bring
 this about in two ways: by removing the HSDirs that received the
 descriptor, and by adding new HSDirs that push the HSDirs that received
 the descriptor out of the first 3 positions.

 How long do we expect it to take before churn makes a lookup failure
 possible? We could measure this with historical consensus data, but let's
 try a quick simulation first.

 Figure 9 of
 [https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_winter.pdf
 this paper] shows the fraction of relays with the HSDir flag that join or
 leave between consecutive consensuses. I'd estimate 0.01 by eye, so let's
 conservatively call it 0.02. The churn rate counts both joins and leaves,
 so a churn rate of 0.02 means each HSDir from the previous consensus has
 left with probability 0.01, and new HSDirs have joined at the same rate.

 There are
 [https://metrics.torproject.org/relayflags.html?start=2018-08-29&end=2018-11-27&flag=HSDir
 about 3,000] relays with the HSDir flag.

 My code (attached) simulates each replica by creating 3,000 HSDirs, each
 at a random position on the hashring, and remembering the first 4 HSDirs
 on the hashring - these are the ones that receive copies of the
 descriptor. Churn is simulated an hour at a time. In each hour, each HSDir
 is removed with probability 0.01 and replaced with a new HSDir at a random
 position. Then the code checks whether the first 3 HSDirs on the hashring
 are all ones that received copies of the descriptor. If not, a lookup on
 this replica could fail.

 For simplicity I've simulated the two replicas independently - in reality
 they'd be based on different permutations of the same HSDirs, but
 independence seems like a reasonable approximation. The simulation runs
 until lookups on both replicas could fail.

 The mean time until both replicas could fail is 37 hours, averaged over
 10,000 runs.

 If this is roughly accurate then we should be able to keep the HS
 reachable by waking Tor from its dormant state every few hours to fetch a
 fresh consensus and upload new copies of the descriptor if necessary.

 Perhaps I should extend the simulation to consider the probability of
 lookup failure as a function of time, rather than the mean time until
 failure becomes possible.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28424#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online