Talked with Mike on IRC:
12:12 < tjr:#tor-dev> mikeperry: If you have a moment today, we'd appreciate it if you could peek at the tor-dev thread 'stale entries in bwscan.20151029-1145' 12:14 < mikeperry:#tor-dev> that seems to be one of the mails I lost 12:14 < mikeperry:#tor-dev> (had a mail failure a couple weeks back) 12:14 < mikeperry:#tor-dev> oh, nm, it just hasn't arrived yet 12:16 < mikeperry:#tor-dev> tjr: torflow does indeed fetch a new consensus for each slice now. they could be falling in between them :/ 12:16 < mikeperry:#tor-dev> but the unmeasured scanner didn't pick them up even? 12:17 < tjr:#tor-dev> They are measured 12:17 < tjr:#tor-dev> they're big fast relays 12:18 < tjr:#tor-dev> Hm. Conceptually, do you see a problem with locking a single consensus at the startup of a scanner? 12:24 < mikeperry:#tor-dev> tjr: it made them finish much faster, since they didn't have to keep churning and spending CPU and making additional measurements as relays come in and out, but I was wondering if it would make the gap problem worse 12:26 < mikeperry:#tor-dev> it seems odd that three relays would be missed by all scanners though. I wonder what is special about them that is causing them to fall through the cracks for everyone for so long 12:26 < tjr:#tor-dev> Wait I'm confused. When you say "it" you mean fetching a new consensus every slice, right? Why would fetching a new consensus every slice use _less_ CPU and do less churning. It seems that _would_ cause new relays come in and out and make the gap problem worse 12:27 < mikeperry:#tor-dev> tjr: because what the code used to do was listen for new consensus events, and dynamically update the slice and the relays as the consensus came in 12:27 < tjr:#tor-dev> So these 3 should be covered by scanner1. They were skipped, and I'm theorizing because they fell through gaps in the slices inside scanner1 12:27 < mikeperry:#tor-dev> that would mean that every new consensus period, the scanning machine would crawl to a stop, and also that relays would shift around in the slice during that time 12:28 < tjr:#tor-dev> Okay, yea dynamically updating the slice in the middle of the slice definetly sounds bad. 12:28 < tjr:#tor-dev> I'm proposing pushing it back even further - instead of a new consensus each slice, lock the consensus at the beginngin of a scanner for all slices 12:28 < mikeperry:#tor-dev> that is harder architecturally because of the process model 12:29 < mikeperry:#tor-dev> though maybe we could have the subprocesses continue on for multiple slices
So them falling between the slices would be my best guess. The tedious way to confirm it would be to look at the consensus at the times each slice began (in bws-data), match up the slice ordering, and confirm that (for all N) when slicenum=N began Onyx was expected to be in slicenum=Not-N
-tom
On 5 November 2015 at 11:11, Tom Ritter tom@ritter.vg wrote:
[+tor-dev]
So... weird. I dug into Onyx primarily. No, in scanner.1/scan-data I cannot find any evidence of Onyx being present. I'm not super familiar with the files torflow produces, but I believe the bws- files list what slice each relay is assigned to. I've put those files (concatted) here: https://bwauth.ritter.vg/bwauth/bws-data
Those relays are indeed missing.
Mike: is it possible that relays are falling in between _slices_ as well as _scanners_? I thought the 'stop listening for consensus' commit would mean that for a single scanner would use the same consensus for all the slices in the scanner...
-tom
[0] https://gitweb.torproject.org/torflow.git/commit/NetworkScanners/BwAuthority...
On 5 November 2015 at 10:48, starlight.2015q4@binnacle.cx wrote:
Hi Tom,
Scanner 1 finally finished the first pass.
Of the list of big relays not checked below, three are still not checked:
*Onyx 10/14 atomicbox1 10/21 *naiveTorer 10/15
Most interesting, ZERO evidence of any attempt to use the two starred entries appears in the scanner log. 'atomicbox1' was used to test other relays but was not tested itself.
Can you look in the database files to see if any obvious reason for this exists? These relays are very fast, Stable-flagged relays that rank near the top of the Blutmagie list.
Date: Thu, 29 Oct 2015 19:57:52 -0500 To: Tom Ritter tom@ritter.vg From: starlight.2015q4@binnacle.cx Subject: Re: stale entries in bwscan.20151029-1145
Tom,
Looked even more closely.
I flittered out all relays that are not currently active, ending up with a list of 6303 live relays.
1065 or 17% of them have not be updated for five or more days, 292 or 4% have not been updated for ten days, and 102 or 1% have not been updated for 15 days.
In particular I know of a very fast high quality relay in a CDN-grade network that has not been measured in 13 days. My relay Binnacle is a well run relay in the high-quality Verizon FiOS network and has not been measured for 10 days.
This does not seem correct.
P.S. Here is a quick list of some top-30 relays that have have been seriously neglected:
redjohn1 10/9 becks 10/15 aurora 10/20 Onyx 10/14 IPredator 10/15 atomicbox1 10/21 sofia 10/14 naiveTorer 10/15 quadhead 10/12 3cce3a91f6a625 10/13 apx2 10/14
At 13:35 10/29/2015 -0400, you wrote:
The system is definetly active. . . .the most recent file has ten day old entries?
Just looked more closely. About 2500 of 8144 lines (30%) have "updated_at=" more than five days ago or 2015/10/24 00:00 UTC.
Seems like something that should have an alarm check/monitor.