Re: [tor-project] Analyzing bwauth disagreements

13 Oct 2017

      Hi Nusenu, Tor Project, Tor community,

tl;dr
The Tor network is *highly* partitioned.

The scanner code is located here:
https://github.com/david415/tor_partition_scanner

HOWEVER, the scanner needs a redesign... I have been using the WRONG
methodology for scanning the Tor network for partitions in that I
select relays which I think are interesting and then see how much
connectivity there is between them. I instead would like to scan ALL
the Tor relays (repeatedly at different times of day) and then later
perform queries against the consensus to see which relays show up in
interesting combinations of failures.

The other failure in this approach to scanning is that it uses a fixed
set of relays... and therefore the Tor consensus file used will become
old and contain relays no longer in the consensus well before the scan
is complete.

And lastly, don't get gamed! Scanning from a single IP is a mistake and
this is an obvious way to get gamed. For all these reasons my colleague
Katharina and I plan to write a new better scanner of Tor network partions
that is distributed, and that has a central dispatcher so that the new
consensus documents will be used to inform "worker machines" which circuits
to try and build.

All that having been said, let me show you what my naive scan looked like
a few weeks ago:

1. setup a machine running Tor and expose its control port as either a
tcp port or unix domain socket with no authentication

   *edit* /etc/tor/torrc
   blah blah easy rtfm

2. install tor_partition_scanner

   virtualenv virtenv-orscanner
   . ./virtenv-orscanner/bin/activate
   mkdir -p code; cd code
   git clone https://github.com/david415/tor_partition_scanner.git
   cd tor_partition_scanner
   pip install -e .

3. get a recent consensus file

   Use consensus files from collector if you want others to be able to reproduce your results.
      here --> https://collector.torproject.org/recent/relay-descriptors/consensuses/

   wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2017-0...

4. choose which relays you want in your scan

   Here I am intentionally NOT scanning 50 million tor circuits using the entire consensus.
   Instead I am using a simple python program written using the Stem library to parse the consensus file
   and give us all the realys with the Stable and Fast flags; among those we choose the top 100 in terms of
   consensus bandwidth.

   ./helpers/query_fingerprints_from_consensus_file.py 2017-09-21-23-00-00-consensus > top100.relays

5. perform scan of top 100 relays

   detect_partitions.py --tor-control tcp:127.0.0.1:9051 --log-dir ./ --status-log ./status_log \
   --relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \
   --build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100

   9,900 two hop tor circuits are being built.
   As the scan runs you can tail -f the status_log to make sure its working.
   Only circuit build failures if any will be logged in the json log file.

   When the scan completes the status_log should display something like this:

   2017-09-22T00:05:44+0000 [-] $BD4C647508162F59CB44E4DFC1C2B2B8A9387CCA -> $DD808ECE4F2E24F377CBE11E335ECDA196FE3B78
   2017-09-22T00:05:44+0000 [-] $0966A24977A0B0DB62546C6F18F9578D97FE86F0 -> $AD00FB62A133F91009AD5F6503E5F21F594BC4C6
   2017-09-22T00:05:50+0000 [orscanner#info] Finished writing measurement values to ./2017-09-22T00:05:50.492698-scan.json.
   2017-09-22T00:05:50+0000 [-] Main loop terminated.

6. Load circuit build failures into sqlite db file

   ./bin/load.py --dbfile scan1.db -p 2017-09-22T00:03:31.610096-scan.json \
   -p 2017-09-22T00:05:42.886622-scan.json \
   -p 2017-09-22T00:05:50.492698-scan.json

7. Count the results

   echo "select first_hop, second_hop from scan_log;" | sqlite3 scan1.db  | wc -l
   2014

8. Attempt to eliminate false positives by retesting the failed circuits

   mkdir scan1
   mv *.json scan1
   echo "select first_hop, second_hop from scan_log;" | sqlite3 scan1.db > scan2.circuits

   detect_partitions.py --tor-control tcp:127.0.0.1:9051 --log-dir ./ --status-log ./status_log \
   --relay-list relays_for_scan1 --build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 \
   --max-concurrency 100 --circuit-file scan2.circuits

   ./bin/load.py --dbfile scan2.db -p 2017-09-22T00:59:31.017246-scan.json -p 2017-09-22T01:04:35.491908-scan.json

   echo "select first_hop, second_hop from scan_log;" | sqlite3 scan2.db | wc -l
   1947

   still 1947 circuit build failures!

Now tell the database to show us the relays involved in a circuit build timeout AND count the number of failures by first hop:

sqlite> select first_hop,count(first_hop) from scan_log where status = 'timeout' group by second_hop;
$4198BD138E5E11B15B05C826B427148CED7D99FE|2
$578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$7C0AA4E3B73E407E9F5FEB1912F8BE26D8AA124D|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$A6B0521C4C1FB91FB66398AAD523AD773E82E77E|18
$578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|2
$578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1
$DAA3D3F6FDA962885072537E3F315086B003A6E3|3
$B5212DB685A2A0FCFBAE425738E478D12361710D|1
$0593F5255316748247EBA76353A3A61F62224903|4
$7C0AA4E3B73E407E9F5FEB1912F8BE26D8AA124D|2
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1
$C818C0EA0BAD90F5432DBA4EE662BCBEC39D2668|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$B5212DB685A2A0FCFBAE425738E478D12361710D|1
$578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
$4198BD138E5E11B15B05C826B427148CED7D99FE|1
sqlite>

A6B0521C4C1FB91FB66398AAD523AD773E82E77E is certainly an outlier!  It
looks like this relay hits it's bandwidth limit several times per
day... and Roger and I suspect it is failing circuit builds at certain
times because it is overloaded during specific times of day due to
traffic spikes. The fix of course would be to adjust the tor consensus
and perhaps cap this relay's bandwidth capacity in the consensus:

https://atlas.torproject.org/#details/A6B0521C4C1FB91FB66398AAD523AD773E82E7...

Circuit build failures also have outliers where a few relays fail more than others:

sqlite> select first_hop,count(first_hop) from scan_log where status = 'failure' group by second_hop;
$1AF72E8906E6C49481A791A6F8F84F8DFEBBB2BA|10
$1AF72E8906E6C49481A791A6F8F84F8DFEBBB2BA|10
$A571351082A9E04F14A0A3DF27E0637231D57B84|10
$B204DE75B37064EF6A4C6BAF955C5724578D0B32|11
$F6740DEABFD5F62612FA025A5079EA72846B1F67|10
$BF0FB582E37F738CD33C3651125F2772705BB8E8|10
$A571351082A9E04F14A0A3DF27E0637231D57B84|10
$1D3F937E2053E58C18E18D43FA5153E2A9F4DC77|99

This outlier relay 1D3F937E2053E58C18E18D43FA5153E2A9F4DC77 is also
quite interesting. It also seems to hit its bandwidth limit several times
per day and it's also interesting to note that it's located in the UK:

https://atlas.torproject.org/#details/1D3F937E2053E58C18E18D43FA5153E2A9F4DC...

Show me all the tor relays that failed ALL of their circuit builds:

echo "select first_hop,count(first_hop) from scan_log where status = 'failure' group by second_hop;" | sqlite3 scan2.db | grep 99
$1D3F937E2053E58C18E18D43FA5153E2A9F4DC77|99
$C793AB88565DDD3C9E4C6F15CCB9D8C7EF964CE9|99
$EF6591754F9079DD122EFC2C4B52917F625A8E5B|99
$55C7554AFCEC1062DCBAC93E67B2E03C6F330EFC|99
$87C08DDFD32C62F3C56D371F9774D27BFDBB807B|99
$BD4C647508162F59CB44E4DFC1C2B2B8A9387CCA|99
$EF6591754F9079DD122EFC2C4B52917F625A8E5B|99
$1D3174338A1131A53E098443E76E1103CDED00DC|99
$734EDDC2C04B1C0184178167ABD23AE85413212F|99
$7C0AA4E3B73E407E9F5FEB1912F8BE26D8AA124D|99

99 is used because here we scanned the top 100 and therefore each
relay is involved in 99 circuit builds since relay A cannot build a
circuit to itself.

I hope it rings loud and clear that this is a *huge* problem for the
health of the Tor network that many of the top 100 Tor relays with the
Fast and Stable flags cannot even build a single Tor circuit to any of
the other top 100 relays!

In conclusion, many of these circuit build failures are very likely
NOT the fault of the relay operators but instead this points to the
failure of the current Tor Bandwidth Authority system. Not only is it
old and broken, even if it was to work "properly" it would still be
broken by design if:

a. it's not performing circuit build tests
b. it's not distributed and thus more easily gameable

Katharina and I plan to conduct a more formal research project in this
area and scan the entire Tor network (50 million 2-hop Tor circuits)
several times at some point in the near future using a new better
methodology.  However even before we do that, I can tell you right now
you will not like the results. It's bad news. (but fret not, Tor
Project is committed to fixing the Bandwidth Authority system)

And yes, I am well aware that correlating these scan results with BGP ASNs
is interesting. There are many other interesting queries we can do after we've
collected the data to try and find malicious/intentional network partitions.

One possible positive outcome of our research could perhaps be the addition
of a partition scanning component to the Bandwidth Authority System. However,
as it stands right now Tor Project is prioritizing bandwidth measurements NOT
partition scanning, since that is currently a maintenance pain point for
the Directory Authority operators that must deal with the torflow tool.

Sincerely,
David Stainton

On Wed, Oct 11, 2017 at 09:55:00PM +0000, nusenu wrote:
...
Hi David,
dawuud:
...
Also I have recently done a few small scans for 2-hop circuit
connectivity and shared some of the results with Roger on
#tor-dev. One theory is that many of the circuit failures are due to
traffic spikes at certain hours of the day.
where can I find more about that if I havn't been on #tor-dev at the time?
thanks,
nusenu