Hi Nusenu, Tor Project, Tor community,
tl;dr The Tor network is *highly* partitioned.
The scanner code is located here: https://github.com/david415/tor_partition_scanner
HOWEVER, the scanner needs a redesign... I have been using the WRONG methodology for scanning the Tor network for partitions in that I select relays which I think are interesting and then see how much connectivity there is between them. I instead would like to scan ALL the Tor relays (repeatedly at different times of day) and then later perform queries against the consensus to see which relays show up in interesting combinations of failures.
The other failure in this approach to scanning is that it uses a fixed set of relays... and therefore the Tor consensus file used will become old and contain relays no longer in the consensus well before the scan is complete.
And lastly, don't get gamed! Scanning from a single IP is a mistake and this is an obvious way to get gamed. For all these reasons my colleague Katharina and I plan to write a new better scanner of Tor network partions that is distributed, and that has a central dispatcher so that the new consensus documents will be used to inform "worker machines" which circuits to try and build.
All that having been said, let me show you what my naive scan looked like a few weeks ago:
1. setup a machine running Tor and expose its control port as either a tcp port or unix domain socket with no authentication
*edit* /etc/tor/torrc blah blah easy rtfm
2. install tor_partition_scanner
virtualenv virtenv-orscanner . ./virtenv-orscanner/bin/activate mkdir -p code; cd code git clone https://github.com/david415/tor_partition_scanner.git cd tor_partition_scanner pip install -e .
3. get a recent consensus file
Use consensus files from collector if you want others to be able to reproduce your results. here --> https://collector.torproject.org/recent/relay-descriptors/consensuses/
wget https://collector.torproject.org/recent/relay-descriptors/consensuses/2017-0...
4. choose which relays you want in your scan
Here I am intentionally NOT scanning 50 million tor circuits using the entire consensus. Instead I am using a simple python program written using the Stem library to parse the consensus file and give us all the realys with the Stable and Fast flags; among those we choose the top 100 in terms of consensus bandwidth.
./helpers/query_fingerprints_from_consensus_file.py 2017-09-21-23-00-00-consensus > top100.relays
5. perform scan of top 100 relays
detect_partitions.py --tor-control tcp:127.0.0.1:9051 --log-dir ./ --status-log ./status_log \ --relay-list top100.relays --secret secretTorEmpireOfRelays --partitions 1 --this-partition 0 \ --build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 --max-concurrency 100
9,900 two hop tor circuits are being built. As the scan runs you can tail -f the status_log to make sure its working. Only circuit build failures if any will be logged in the json log file.
When the scan completes the status_log should display something like this:
2017-09-22T00:05:44+0000 [-] $BD4C647508162F59CB44E4DFC1C2B2B8A9387CCA -> $DD808ECE4F2E24F377CBE11E335ECDA196FE3B78 2017-09-22T00:05:44+0000 [-] $0966A24977A0B0DB62546C6F18F9578D97FE86F0 -> $AD00FB62A133F91009AD5F6503E5F21F594BC4C6 2017-09-22T00:05:50+0000 [orscanner#info] Finished writing measurement values to ./2017-09-22T00:05:50.492698-scan.json. 2017-09-22T00:05:50+0000 [-] Main loop terminated.
6. Load circuit build failures into sqlite db file
./bin/load.py --dbfile scan1.db -p 2017-09-22T00:03:31.610096-scan.json \ -p 2017-09-22T00:05:42.886622-scan.json \ -p 2017-09-22T00:05:50.492698-scan.json
7. Count the results
echo "select first_hop, second_hop from scan_log;" | sqlite3 scan1.db | wc -l 2014
8. Attempt to eliminate false positives by retesting the failed circuits
mkdir scan1 mv *.json scan1 echo "select first_hop, second_hop from scan_log;" | sqlite3 scan1.db > scan2.circuits
detect_partitions.py --tor-control tcp:127.0.0.1:9051 --log-dir ./ --status-log ./status_log \ --relay-list relays_for_scan1 --build-duration .25 --circuit-timeout 60 --log-chunk-size 1000 \ --max-concurrency 100 --circuit-file scan2.circuits
./bin/load.py --dbfile scan2.db -p 2017-09-22T00:59:31.017246-scan.json -p 2017-09-22T01:04:35.491908-scan.json
echo "select first_hop, second_hop from scan_log;" | sqlite3 scan2.db | wc -l 1947
still 1947 circuit build failures!
Now tell the database to show us the relays involved in a circuit build timeout AND count the number of failures by first hop:
sqlite> select first_hop,count(first_hop) from scan_log where status = 'timeout' group by second_hop; $4198BD138E5E11B15B05C826B427148CED7D99FE|2 $578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $7C0AA4E3B73E407E9F5FEB1912F8BE26D8AA124D|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $A6B0521C4C1FB91FB66398AAD523AD773E82E77E|18 $578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|2 $578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1 $DAA3D3F6FDA962885072537E3F315086B003A6E3|3 $B5212DB685A2A0FCFBAE425738E478D12361710D|1 $0593F5255316748247EBA76353A3A61F62224903|4 $7C0AA4E3B73E407E9F5FEB1912F8BE26D8AA124D|2 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1 $C818C0EA0BAD90F5432DBA4EE662BCBEC39D2668|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $B5212DB685A2A0FCFBAE425738E478D12361710D|1 $578E007E5E4535FBFEF7758D8587B07B4C8C5D06|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 $4198BD138E5E11B15B05C826B427148CED7D99FE|1 sqlite>
A6B0521C4C1FB91FB66398AAD523AD773E82E77E is certainly an outlier! It looks like this relay hits it's bandwidth limit several times per day... and Roger and I suspect it is failing circuit builds at certain times because it is overloaded during specific times of day due to traffic spikes. The fix of course would be to adjust the tor consensus and perhaps cap this relay's bandwidth capacity in the consensus:
https://atlas.torproject.org/#details/A6B0521C4C1FB91FB66398AAD523AD773E82E7...
Circuit build failures also have outliers where a few relays fail more than others:
sqlite> select first_hop,count(first_hop) from scan_log where status = 'failure' group by second_hop; $1AF72E8906E6C49481A791A6F8F84F8DFEBBB2BA|10 $1AF72E8906E6C49481A791A6F8F84F8DFEBBB2BA|10 $A571351082A9E04F14A0A3DF27E0637231D57B84|10 $B204DE75B37064EF6A4C6BAF955C5724578D0B32|11 $F6740DEABFD5F62612FA025A5079EA72846B1F67|10 $BF0FB582E37F738CD33C3651125F2772705BB8E8|10 $A571351082A9E04F14A0A3DF27E0637231D57B84|10 $1D3F937E2053E58C18E18D43FA5153E2A9F4DC77|99
This outlier relay 1D3F937E2053E58C18E18D43FA5153E2A9F4DC77 is also quite interesting. It also seems to hit its bandwidth limit several times per day and it's also interesting to note that it's located in the UK:
https://atlas.torproject.org/#details/1D3F937E2053E58C18E18D43FA5153E2A9F4DC...
Show me all the tor relays that failed ALL of their circuit builds:
echo "select first_hop,count(first_hop) from scan_log where status = 'failure' group by second_hop;" | sqlite3 scan2.db | grep 99 $1D3F937E2053E58C18E18D43FA5153E2A9F4DC77|99 $C793AB88565DDD3C9E4C6F15CCB9D8C7EF964CE9|99 $EF6591754F9079DD122EFC2C4B52917F625A8E5B|99 $55C7554AFCEC1062DCBAC93E67B2E03C6F330EFC|99 $87C08DDFD32C62F3C56D371F9774D27BFDBB807B|99 $BD4C647508162F59CB44E4DFC1C2B2B8A9387CCA|99 $EF6591754F9079DD122EFC2C4B52917F625A8E5B|99 $1D3174338A1131A53E098443E76E1103CDED00DC|99 $734EDDC2C04B1C0184178167ABD23AE85413212F|99 $7C0AA4E3B73E407E9F5FEB1912F8BE26D8AA124D|99
99 is used because here we scanned the top 100 and therefore each relay is involved in 99 circuit builds since relay A cannot build a circuit to itself.
I hope it rings loud and clear that this is a *huge* problem for the health of the Tor network that many of the top 100 Tor relays with the Fast and Stable flags cannot even build a single Tor circuit to any of the other top 100 relays!
In conclusion, many of these circuit build failures are very likely NOT the fault of the relay operators but instead this points to the failure of the current Tor Bandwidth Authority system. Not only is it old and broken, even if it was to work "properly" it would still be broken by design if:
a. it's not performing circuit build tests b. it's not distributed and thus more easily gameable
Katharina and I plan to conduct a more formal research project in this area and scan the entire Tor network (50 million 2-hop Tor circuits) several times at some point in the near future using a new better methodology. However even before we do that, I can tell you right now you will not like the results. It's bad news. (but fret not, Tor Project is committed to fixing the Bandwidth Authority system)
And yes, I am well aware that correlating these scan results with BGP ASNs is interesting. There are many other interesting queries we can do after we've collected the data to try and find malicious/intentional network partitions.
One possible positive outcome of our research could perhaps be the addition of a partition scanning component to the Bandwidth Authority System. However, as it stands right now Tor Project is prioritizing bandwidth measurements NOT partition scanning, since that is currently a maintenance pain point for the Directory Authority operators that must deal with the torflow tool.
Sincerely, David Stainton
On Wed, Oct 11, 2017 at 09:55:00PM +0000, nusenu wrote:
Hi David,
dawuud:
Also I have recently done a few small scans for 2-hop circuit connectivity and shared some of the results with Roger on #tor-dev. One theory is that many of the circuit failures are due to traffic spikes at certain hours of the day.
where can I find more about that if I havn't been on #tor-dev at the time? thanks, nusenu