[tor-bugs] #28868 [Core Tor/sbws]: best_priority() can starve the worker threads of good relays

Sun Jan 13 02:27:24 UTC 2019

#28868: best_priority() can starve the worker threads of good relays
---------------------------+-----------------------------------
 Reporter:  teor           |          Owner:  juga
     Type:  defect         |         Status:  needs_review
 Priority:  Medium         |      Milestone:  sbws: 1.0.x-final
Component:  Core Tor/sbws  |        Version:  sbws: 1.0.2
 Severity:  Normal         |     Resolution:
 Keywords:                 |  Actual Points:
Parent ID:  #28663         |         Points:
 Reviewer:                 |        Sponsor:
---------------------------+-----------------------------------

Comment (by teor):

 Replying to [comment:3 juga]:
 > Replying to [ticket:28868 teor]:
 > > `best_priority()` tries to measure unmeasured and failing relays
 first.
 > >
 > > But if `fraction_relays` or `min_relays` always fail, those relays
 will always end up first in the priority queue. (More precisely, those
 relays will end up first in the priority queue, until the results of the
 good relays ~~time out~~ are discarded for being too old.)
 > >
 > > Thinking about starvation is complicated, because of the
 `freshness_reduction_factor` on some errors.
 > >
 > > Here's a very simple algorithm that avoids starving good relays for
 failed relays:
 > > 1. Count the number of times that sbws has attempted to get a result
 from each relay.
 >
 > This is already done when writing the results: ResultError and
 ResultSuccess keep that.

 But some failures do not write a ResultError.

 > > 2. Test the relays with the lowest number of attempts first. (Don't
 check if the attempt succeeded or failed.)
 >
 > This's what i was proposing by commenting the part where it prioritizes
 ResultError measurements.

 I don't understand what you mean here.
 Can you link to the comment, or quote it?

 > >
 > > For this priority rule to work, every time a relay is queued, it must
 get a result. Here's how we can make that happen"
 > > * Modify `result_putter_error()` to store an error result to the
 queue.
 >
 > result_putter already writes ResultError.

 But result_putter_error() is called when there is an exception in
 apply_async(), and it does not write ResultError.

 > Here there're two other bugs, result_putter_error, only happens when:
 > 1. The relay being measured, doesn't have a descriptor (#28870)
 > 2. The operator hits KeyboardInterrupt (#28869)
 >
 > AFAIK, there're no other cases where the error callback is called.

 The code is complicated, so it could throw other exceptions that you
 haven't seen yet. Future code changes could also add more exceptions.

 > > * Make sure timeouts store an error result to the queue.
 > > * Add a unit test and integration test that makes sure every queued
 relay has a result.
 >
 > Testing this is hard, but i'll see.

 Replying to [comment:5 juga]:
 > > Add a unit test and integration test that makes sure every queued
 relay has a result.
 > Maybe this could be done as part of #28566 instead?.
 >
 > #28933 runs the actual scanner. It is not counting that all the relays
 get measured, though in the test network this is the case.
 >
 > Created PR without the tests:
 https://github.com/torproject/sbws/pull/328

 You'll also need to update the documentation:
 https://github.com/torproject/sbws/blob/master/docs/source/specification.rst
 #simple-relay-prioritization

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28868#comment:6>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online