On Tue, Mar 13, 2018 at 02:55:12AM +0000, dawuud wrote:
Out of 9900 possible two hop tor circuits among the top 100 tor relays only 935 circuit builds have succeeded. This is way worse than the last time I sent a report 6 months ago during the Montreal tor dev meeting.
The next step here would be to try to debug your results, to understand if it's actually an issue with the Tor network (in which case, what exactly is the issue), or if it's a bug in your scripts.
Teor asked some good questions.
Other questions I'd want to investigate:
(A) Are the failures consistent, or intermittent? That is, does a failed link always fail, or only sometimes?
(B) Are you really sure that it failed? I would guess that 'failed' is different from 'timeout' because it got an explicit destroy back? If so, don't destroy cells have 'reason' components? Which reasons are happening most commonly?
(C) We should find a link that is failing between two relays that we both control, and look at each one more closely to see if there are any hints. For example, is there anything in the logs? If we turn up the logging, do we get any hints then?
(D) ...which leads to: we should run this same tool on the test network that teor and dgoulet et al run, and look for failures there. Assuming we find some, since there are no users on the test network, we can investigate much more thoroughly.
(E) I wonder if there's a correlation between the failed links and whether a TLS connection is already established on that link. That is, when there is no connection already, there are many more steps that need to be taken to extend the circuit, and those steps could lead to increased failure rates, either due to the extra time that is needed, or because part of tor's link handshake (NETINFO, etc) is going wrong.
And a last point: this tool, and these investigations, are exactly in scope for the "network health" topic that the network team has been discussing as one of the key open areas that need more attention.
--Roger