[tor-bugs] #13718 [Tor]: Reachability Tests aren't conducted if there are no exit nodes

Mon Nov 10 04:03:29 UTC 2014

#13718: Reachability Tests aren't conducted if there are no exit nodes
--------------------+---------------------
 Reporter:  tom     |          Owner:
     Type:  defect  |         Status:  new
 Priority:  normal  |      Milestone:
Component:  Tor     |        Version:
 Keywords:          |  Actual Points:
Parent ID:          |         Points:
--------------------+---------------------
 Context:
 * https://lists.torproject.org/pipermail/tor-dev/2014-October/007613.html
 * https://lists.torproject.org/pipermail/tor-dev/2014-October/007654.html

 On 22 October 2014 05:48, Roger Dingledine <arma at mit.edu> wrote:
 >> What I had to do was make one of my Directory Authorities an exit -
 >> this let the other nodes start building circuits through the
 >> authorities and upload descriptors.
 >
 > This part seems surprising to me -- directory authorities always publish
 > their dirport whether they've found it reachable or not, and relays
 > publish their descriptors directly to the dirport of each directory
 > authority (not through the Tor network).
 >
 > So maybe there's a bug that you aren't describing, or maybe you are
 > misunderstanding what you saw?
 >
 > See also https://trac.torproject.org/projects/tor/ticket/11973
 >
 >> Another problem I ran into was that nodes couldn't conduct
 >> reachability tests when I had exits that were only using the Reduced
 >> Exit Policy - because it doesn't list the ORPort/DirPort!  (I was
 >> using nonstandard ports actually, but indeed the reduced exit policy
 >> does not include 9001 or 9030.)  Looking at the current consensus,
 >> there are 40 exits that exit to all ports, and 400-something exits
 >> that use the ReducedExitPolicy.  It seems like 9001 and 9030 should
 >> probably be added to that for reachability tests?
 >
 > The reachability tests for the ORPort involve extending the circuit to
 > the ORPort -- which doesn't use an exit stream. So your relays should
 > have been able to find themselves reachable, and published a descriptor,
 > even with no exit relays in the network.

 Okay, so the behavior I saw, and reproduced, is that reachability tests
 didn't succeed (and therefore descriptors weren't uploaded) when there
 were no exits.  I think I may have figured out why, but there are some
 internals I haven't completely figured out.  I'm going to lay out what I
 think and then the parts I'm not completely sure about.

 First off, you're (obviously) correct about me misunderstanding extending
 the circuit via an Exit stream, that's not necessary.  But still, I think
 the lack of Exits stopped the reachability tests from succeeding.

 == too long; didn't read ==

 I don't think reachability tests happen when there are no Exit nodes
 because of a quirk in the bootstrapping process, where we never think we
 have a minimum of directory information.

 == target function: consider_testing_reachability ==

 A reachability test is conducted from `consider_testing_reachability` (I
 think it's only conducted from here? Although maybe there's other
 situations it could happen..?)  `consider_testing_reachability` is called
 from `circuit_send_next_onion_skin`, `circuit_testing_opened`,
 `run_scheduled_events`, and `directory_info_has_arrived`.

 == call site #1: directory_info_has_arrived ==

 This is called very frequently on router startup.  But
 `consider_testing_reachability` will not be called if
 `router_have_minimum_dir_info` returns false:
 {{{
 void directory_info_has_arrived(time_t now, int from_cache)
 { //...
   if (!router_have_minimum_dir_info()) {
     //...
     return;
   } else { /* ... */ }

   if (server_mode(options) && !net_is_disabled() && !from_cache &&
       (can_complete_circuit || !any_predicted_circuits(now)))
     consider_testing_reachability(1, 1);
 }
 }}}

 `router_have_minimum_dir_info` returns the static variable
 `have_min_dir_info`.  This variable is only set to 1 in
 `update_router_have_minimum_dir_info` and then only if there are Exits!
 Specifically, we will trigger `paths <
 get_frac_paths_needed_for_circs(options,consensus)` because we have 0% of
 the Exit Bandwidth, as shown by this error message:
 {{{
 Nov 09 22:10:26.000 [notice] I learned some more directory information,
 but not enough to build a circuit: We need more descriptors: we have 5/5,
 and can only build 0% of likely paths. (We have 100% of guards bw, 100% of
 midpoint bw, and 0% of exit bw.)
 }}}

 {{{
 update_router_have_minimum_dir_info(void)
 {       //...
     char *status = NULL;
     int num_present=0, num_usable=0;
     double paths = compute_frac_paths_available(consensus, options, now,
                                                 &num_present, &num_usable,
                                                 &status);

     if (paths < get_frac_paths_needed_for_circs(options,consensus)) {
       tor_snprintf(dir_info_status, sizeof(dir_info_status),
                    "We need more %sdescriptors: we have %d/%d, and "
                    "can only build %d%% of likely paths. (We have %s.)",
                    using_md?"micro":"", num_present, num_usable,
                    (int)(paths*100), status);
       //...
       res = 0;
       goto done;
     }
    res = 1;
   }

  done:
   if (res && !have_min_dir_info) { /* ... */ }
   if (!res && have_min_dir_info) {
     int quiet = directory_too_idle_to_fetch_descriptors(options, now);
     tor_log(quiet ? LOG_INFO : LOG_NOTICE, LD_DIR,
         "Our directory information is no longer up-to-date "
         "enough to build circuits: %s", dir_info_status);

     /* a) make us log when we next complete a circuit, so we know when Tor
      * is back up and usable, and b) disable some activities that Tor
      * should only do while circuits are working, like reachability tests
      * and fetching bridge descriptors only over circuits. */
     can_complete_circuit = 0;

     control_event_client_status(LOG_NOTICE, "NOT_ENOUGH_DIR_INFO");
   }
   have_min_dir_info = res;
 }
 }}}

 (The exact source line is in `frac_nodes_with_descriptors`, called by
 `compute_frac_paths_available`:)

 {{{
 /** For all nodes in <b>sl</b>, return the fraction of those nodes,
 weighted
  * by their weighted bandwidths with rule <b>rule</b>, for which we have
  * descriptors. */
 double
 frac_nodes_with_descriptors(const smartlist_t *sl,
                             bandwidth_weight_rule_t rule)
 {
   //...
   if (smartlist_len(sl) == 0)
     return 0.0;
 }}}

 This prevents reachability from occurring from
 `directory_info_has_arrived`.

 == call site #2: run_scheduled_events (and call site #3)  ==

 There's a litany of conditions to call `consider_testing_reachability`
 from `run_scheduled_events`.  In particular, there's
 `can_complete_circuit`

 {{{
 if (time_to_check_descriptor < now && !options->DisableNetwork) {
     //...
     /* also, check religiously for reachability, if it's within the first
      * 20 minutes of our uptime. */
     if (is_server &&
         (can_complete_circuit || !any_predicted_circuits(now)) &&
         !we_are_hibernating()) {
       if (stats_n_seconds_working <
 TIMEOUT_UNTIL_UNREACHABILITY_COMPLAINT) {
         consider_testing_reachability(1, dirport_reachability_count==0);
 }}}

 `can_complete_circuit` is only set in `circuit_send_next_onion_skin`, but
 then only if a circuit is built and it is not
 `circ->build_state->onehop_tunnel`.  I _think_ this means the circuit is a
 full circuit, complete with Exit.  Right?

 {{{
 int circuit_send_next_onion_skin(origin_circuit_t *circ)
 { //...
   if (circ->cpath->state == CPATH_STATE_CLOSED) {
     // ...
   } else {
     //...
     hop = onion_next_hop_in_cpath(circ->cpath);
     if (!hop) {
       //...
       if (!can_complete_circuit && !circ->build_state->onehop_tunnel) {
         can_complete_circuit=1;
         /* FFFF Log a count of known routers here */
         log_notice(LD_GENERAL,
             "Tor has successfully opened a circuit. "
             "Looks like client functionality is working.");
         //...
         if (server_mode(options) && !check_whether_orport_reachable()) {
           inform_testing_reachability();
           consider_testing_reachability(1, 1);
 }}}

 This is also the third place `consider_testing_reachability` is called -
 there is only one left:

 == call site #4: circuit_testing_opened ==

 {{{
 /** A testing circuit has completed. Take whatever stats we want.
  * Noticing reachability is taken care of in onionskin_answer(),
  * so there's no need to record anything here. But if we still want
  * to do the bandwidth test, and we now have enough testing circuits
  * open, do it.
  */
 static void
 circuit_testing_opened(origin_circuit_t *circ)
 {
   if (have_performed_bandwidth_test ||
       !check_whether_orport_reachable()) {
     /* either we've already done everything we want with testing circuits,
      * or this testing circuit became open due to a fluke, e.g. we picked
      * a last hop where we already had the connection open due to an
      * outgoing local circuit. */
     circuit_mark_for_close(TO_CIRCUIT(circ), END_CIRC_AT_ORIGIN);
   } else if (circuit_enough_testing_circs()) {
     router_perform_bandwidth_test(NUM_PARALLEL_TESTING_CIRCS, time(NULL));
     have_performed_bandwidth_test = 1;
   } else
     consider_testing_reachability(1, 0);
 }
 }}}

 But... as far as I can tell - a testing circuit is only used for two
 things: conducting a reachability test and conducting a bandwidth self-
 test.  The only place a bandwidth self-test is called is inside
 `circuit_testing_opened`.  So this call of `consider_testing_reachability`
 is a chicken or the egg problem.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/13718>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online