In which I pretend to know how to do statistics or data analysis; but really hope I spur people to point out what I'm doing wrong =)
Methodology:
1) Run two bwauths on the same box 2) Collect votes (and raw bwauth votes where available) for a month from all bwauths (including the duplicate one) 3) Stick the tuple (vote timestamp, relayid, bwauthid, bw value) into a database 4) Measure the percent difference in bandwidth for the same relay at the same vote time between two bwauths as (abs(r1.bw - r2.bw) / ((r1.bw + r2.bw) / 2) 5) Average that value for all relays for a given vote time to get the overall percent difference for that vote time 6) Graph it for multiple bwauth pairs 7) Get a slight flavor difference of the above value by limiting it to relays with a bandwidth value of at least 100, to see if there are any differences
Code: Download and archive votes and raw bwauth votes: https://github.com/tomrittervg/bwauth-tools/blob/master/download_files.py cron it up, run it every hour at 15 minutes after the hour.
MySQL Database schema: https://github.com/tomrittervg/bwauth-tools/blob/master/schema.sql
Process the files and put them in the database: https://github.com/tomrittervg/bwauth-tools/blob/master/files_to_database.py
Run the last query to get the data: https://github.com/tomrittervg/bwauth-tools/blob/master/queries.sql
Plot it: https://github.com/tomrittervg/bwauth-tools/blob/master/plot_data.py Optionally run something like grep -v "\\N,\\N,\\N,\\N,\\N,\\N,\\N,\\N" query.csv > non-blank-lines.csv, or change non-blank-lines.csv to query.csv
~ One month of data is 29,194,717 rows or ~4GB of data, and you want to run the query and the plotting overnight.
Conclusions:
The data!! https://raw.githubusercontent.com/tomrittervg/bwauth-tools/master/data.png x axis: unix epoch of the vote time y axis: percent disagreement between maatuska's bwauth and the indicated bwauth
The two bwauths run from maatuska agree quite consistently. However they have the only instance of DISagreement between the 'all' set of relays and the '>100' set.
maatuska and moria agree eerily closely.
two bwauths agree with each other to the same amount consistently over time, although how much they agree depends on the two bwauths being compared
the 'base' agreement, how much two identical bwauths will agree, is around 35%
there is not much difference between 'all relays' and 'relays > 100'
Follow Up:
Is my data correct? maatuska and moria being that close together is suspicious.
Compare moria<->longclaw, moria<->faravahar, and longclaw<->faravahar
Change the bandwidth limit to >1000 and see if they produces more or less disagreement
I can start applying bwauth patches, and run the new bwauth code, from the same vantage point and use this methodology to determine if new code has altered the results. IF it has altered the results: are they better results or worse? Don't know!
I should measure the disagreements between bwauths when grouping relays by country.
Because I have raw vote data from maatuska and moria, I can test the hypothesis that relays fall between gaps in scanners. Maybe. I could probably get positive proof this occurs but maybe not proof it _doesn't_ occur.
-tom
On 6 October 2017 at 15:40, Tom Ritter tom@ritter.vg wrote:
Is my data correct? maatuska and moria being that close together is suspicious.
moria was completely missing, heh. Unfortunately trying to grapha ll the data now gives me an out of memory error!
Here is a graph showing maatuska<->maatuska2 and maatuska<->moria. They agree quite closely, much more than any other two bwauths, but not identically.
https://raw.githubusercontent.com/tomrittervg/bwauth-tools/master/moria.png
It appears there's a single vote file in there that has incorrect data and is throwing the scale off, but a) I like a larger scale and b) trying to correct it would take me another day before notifying everyone of the mistake.
-tom
Hi Tom et al,
I hope I'm not too far off topic here:
I remember from the last tor-dev meeting there was some discussion about replacing the old torflow code with bwscanner a project that Aaron Gibson started and Donncha subsequently worked on. But since there I haven't heard of any progress on that.
Also I have recently done a few small scans for 2-hop circuit connectivity and shared some of the results with Roger on #tor-dev. One theory is that many of the circuit failures are due to traffic spikes at certain hours of the day.
I was thinking that many of the relay to relay connectivity problems could be fixed by adjusting the consensus weights or capping their consensus bandwidth... and that this scanning of "network partitions" shouldn't only be an interesting research project but should probably be a part of the bandwidth authority system.
Meejah and I were planning to facilitate a workshop/discussion about these and other issues related to scanning the tor network. Are you coming to the meeting in Montreal?
Cheers, David
On Mon, Oct 09, 2017 at 07:59:39AM -0500, Tom Ritter wrote:
On 6 October 2017 at 15:40, Tom Ritter tom@ritter.vg wrote:
Is my data correct? maatuska and moria being that close together is suspicious.
moria was completely missing, heh. Unfortunately trying to grapha ll the data now gives me an out of memory error!
Here is a graph showing maatuska<->maatuska2 and maatuska<->moria. They agree quite closely, much more than any other two bwauths, but not identically.
https://raw.githubusercontent.com/tomrittervg/bwauth-tools/master/moria.png
It appears there's a single vote file in there that has incorrect data and is throwing the scale off, but a) I like a larger scale and b) trying to correct it would take me another day before notifying everyone of the mistake.
-tom _______________________________________________ tor-project mailing list tor-project@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
On 9 October 2017 at 10:34, dawuud dawuud@riseup.net wrote:
Hi Tom et al,
I hope I'm not too far off topic here:
Nope!
I remember from the last tor-dev meeting there was some discussion about replacing the old torflow code with bwscanner a project that Aaron Gibson started and Donncha subsequently worked on. But since there I haven't heard of any progress on that.
No. AFAIK the code is still at https://github.com/TheTorProject/bwscanner.git and waiting to be tested. I intend to test it 'soon' (f a value of 'soon' that might stretch into 4-5 months) and see how its results compare.
Also I have recently done a few small scans for 2-hop circuit connectivity and shared some of the results with Roger on #tor-dev. One theory is that many of the circuit failures are due to traffic spikes at certain hours of the day.
I was thinking that many of the relay to relay connectivity problems could be fixed by adjusting the consensus weights or capping their consensus bandwidth... and that this scanning of "network partitions" shouldn't only be an interesting research project but should probably be a part of the bandwidth authority system.
I saw that! Or at least part of it! Awesome work!
Meejah and I were planning to facilitate a workshop/discussion about these and other issues related to scanning the tor network. Are you coming to the meeting in Montreal?
No. Sorry. =/
Your proposal makes sense to me superficially for sure. I don't own torflow really, I just try and help it keep limping along. I'm not qualified enough to say it's a good idea or a bad idea, or how we could measure its results or what. =)
-tom
On Mon, Oct 09, 2017 at 07:59:39AM -0500, Tom Ritter wrote:
On 6 October 2017 at 15:40, Tom Ritter tom@ritter.vg wrote:
Is my data correct? maatuska and moria being that close together is suspicious.
moria was completely missing, heh. Unfortunately trying to grapha ll the data now gives me an out of memory error!
Here is a graph showing maatuska<->maatuska2 and maatuska<->moria. They agree quite closely, much more than any other two bwauths, but not identically.
https://raw.githubusercontent.com/tomrittervg/bwauth-tools/master/moria.png
It appears there's a single vote file in there that has incorrect data and is throwing the scale off, but a) I like a larger scale and b) trying to correct it would take me another day before notifying everyone of the mistake.
-tom _______________________________________________ tor-project mailing list tor-project@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
tor-project mailing list tor-project@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
On 9 Oct 2017, at 11:34, dawuud dawuud@riseup.net wrote:
Hi Tom et al,
I hope I'm not too far off topic here:
I remember from the last tor-dev meeting there was some discussion about replacing the old torflow code with bwscanner a project that Aaron Gibson started and Donncha subsequently worked on. But since there I haven't heard of any progress on that.
There has been a small amount of volunteer progress on maintaining the current bandwidth authority code.
But we haven't made any progress on bwscanner. Perhaps because we're not sure who is working on it or where it is at.
To make progress, we need a list of priorities for bandwidth authorities, and people with time to make them happen. This needs people from the network team, and directory authority operators.
...
Meejah and I were planning to facilitate a workshop/discussion about these and other issues related to scanning the tor network. Are you coming to the meeting in Montreal?
There's also a session on bandwidth authorities - do you think we should merge the two?
T
-- Tim / teor
PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n ------------------------------------------------------------------------
There has been a small amount of volunteer progress on maintaining the current bandwidth authority code.
But we haven't made any progress on bwscanner. Perhaps because we're not sure who is working on it or where it is at.
OK. I'll send Donncha a quick e-mail asking him about the current status.
To make progress, we need a list of priorities for bandwidth authorities, and people with time to make them happen. This needs people from the network team, and directory authority operators.
...
Meejah and I were planning to facilitate a workshop/discussion about these and other issues related to scanning the tor network. Are you coming to the meeting in Montreal?
There's also a session on bandwidth authorities - do you think we should merge the two?
I'm not sure. Was anyone actually planning to attend the session about scanning the tor network besides meejah and I? If not then merge them.
I definitely would attend the bandwidth authorities session. I think these network health issues are important and should get more attention since they very much affect the security properties of tor.
On 9 Oct 2017, at 12:50, dawuud dawuud@riseup.net wrote:
...
To make progress, we need a list of priorities for bandwidth authorities, and people with time to make them happen. This needs people from the network team, and directory authority operators.
...
Meejah and I were planning to facilitate a workshop/discussion about these and other issues related to scanning the tor network. Are you coming to the meeting in Montreal?
There's also a session on bandwidth authorities - do you think we should merge the two?
I'm not sure. Was anyone actually planning to attend the session about scanning the tor network besides meejah and I? If not then merge them.
I definitely would attend the bandwidth authorities session. I think these network health issues are important and should get more attention since they very much affect the security properties of tor.
Let's merge them, and get everyone there to set priorities. I've CC'd Alison, who is keeping the list.
T
Respect to all people keeping torflow alive.
teor:
There has been a small amount of volunteer progress on maintaining the current bandwidth authority code.
But we haven't made any progress on bwscanner. Perhaps because we're not sure who is working on it or where it is at.
To make progress, we need a list of priorities for bandwidth authorities, and people with time to make them happen. This needs people from the network team, and directory authority operators.
I haven't spent any time working on the bwscanner code in the past year, and so my memory of the outstanding issues is a bit hazy.
IIRC the scanning component should be basically functional. It builds circuits, fetches files and outputs the results into a directory as JSON.
There is probably some work needed on the code to turn these JSON files into measurements that are readable by the bwauths. I ported the torflow measurement aggregation code to Stem and it **should** work to generate bwauth compatible measurement files (https://github.com/TheTorProject/bwscanner/blob/develop/scripts/aggregate.py). That script should be tested to make sure the result look sensible and actually match the torflow spec.
I think the best thing would be for someone to take the existing bwsacanner code and just get it running somewhere and start generating measurements on the real network. It will be the quickest way to find any issues with the measurement code or aggregation tools.
There are some short instructions in the README about how to collect measurements and then aggregate those measurements into bwauth files.
Anyone is welcome to contact me with questions. I'm happy to assist in any way I can.
Regards, Donncha
tor-project@lists.torproject.org