Re: [tor-dev] stale entries in bwscan.20151029-1145

5 Nov 2015

      Talked with Mike on IRC:

12:12 < tjr:#tor-dev> mikeperry: If you have a moment today, we'd
appreciate it if you could peek at the tor-dev thread 'stale entries
in bwscan.20151029-1145'
12:14 < mikeperry:#tor-dev> that seems to be one of the mails I lost
12:14 < mikeperry:#tor-dev> (had a mail failure a couple weeks back)
12:14 < mikeperry:#tor-dev> oh, nm, it just hasn't arrived yet
12:16 < mikeperry:#tor-dev> tjr: torflow does indeed fetch a new
consensus for each slice now. they could be falling in between them :/
12:16 < mikeperry:#tor-dev> but the unmeasured scanner didn't pick them up even?
12:17 < tjr:#tor-dev> They are measured
12:17 < tjr:#tor-dev> they're big fast relays
12:18 < tjr:#tor-dev> Hm.  Conceptually, do you see a problem with
locking a single consensus at the startup of a scanner?
12:24 < mikeperry:#tor-dev> tjr: it made them finish much faster,
since they didn't have to keep churning and spending CPU and making
additional measurements as relays come in and
                            out, but I was wondering if it would make
the gap problem worse
12:26 < mikeperry:#tor-dev> it seems odd that three relays would be
missed by all scanners though. I wonder what is special about them
that is causing them to fall through the
                            cracks for everyone for so long
12:26 < tjr:#tor-dev> Wait I'm confused. When you say "it" you mean
fetching a new consensus every slice, right?  Why would fetching a new
consensus every slice use _less_ CPU and
                      do less churning. It seems that _would_ cause
new relays come in and out and make the gap problem worse
12:27 < mikeperry:#tor-dev> tjr: because what the code used to do was
listen for new consensus events, and dynamically update the slice and
the relays as the consensus came in
12:27 < tjr:#tor-dev> So these 3 should be covered by scanner1. They
were skipped, and I'm theorizing because they fell through gaps in the
slices inside scanner1
12:27 < mikeperry:#tor-dev> that would mean that every new consensus
period, the scanning machine would crawl to a stop, and also that
relays would shift around in the slice during
                            that time
12:28 < tjr:#tor-dev> Okay, yea dynamically updating the slice in the
middle of the slice definetly sounds bad.
12:28 < tjr:#tor-dev> I'm proposing pushing it back even further -
instead of a new consensus each slice, lock the consensus at the
beginngin of a scanner for all slices
12:28 < mikeperry:#tor-dev> that is harder architecturally because of
the process model
12:29 < mikeperry:#tor-dev> though maybe we could have the
subprocesses continue on for multiple slices

So them falling between the slices would be my best guess.  The
tedious way to confirm it would be to look at the consensus at the
times each slice began (in bws-data), match up the slice ordering, and
confirm that (for all N) when slicenum=N began Onyx was expected to be
in slicenum=Not-N

-tom

On 5 November 2015 at 11:11, Tom Ritter <tom@ritter.vg> wrote:
...
[+tor-dev]
So... weird. I dug into Onyx primarily. No, in scanner.1/scan-data I
cannot find any evidence of Onyx being present.  I'm not super
familiar with the files torflow produces, but I believe the bws- files
list what slice each relay is assigned to.  I've put those files
(concatted) here: https://bwauth.ritter.vg/bwauth/bws-data
Those relays are indeed missing.
Mike: is it possible that relays are falling in between _slices_ as
well as _scanners_?  I thought the 'stop listening for consensus'
commit would mean that for a single scanner would use the same
consensus for all the slices in the scanner...
-tom
[0] https://gitweb.torproject.org/torflow.git/commit/NetworkScanners/BwAuthority...
On 5 November 2015 at 10:48,  <starlight.2015q4@binnacle.cx> wrote:
...
Hi Tom,
Scanner 1 finally finished the first pass.
Of the list of big relays not checked
below, three are still not checked:
*Onyx           10/14
 atomicbox1     10/21
*naiveTorer     10/15
Most interesting, ZERO evidence of
any attempt to use the two starred
entries appears in the scanner log.
'atomicbox1' was used to test
other relays but was not tested
itself.
Can you look in the database files
to see if any obvious reason for
this exists?  These relays are
very fast, Stable-flagged relays
that rank near the top of the
Blutmagie list.
...
Date: Thu, 29 Oct 2015 19:57:52 -0500
To: Tom Ritter <tom@ritter.vg>
From: starlight.2015q4@binnacle.cx
Subject: Re: stale entries in bwscan.20151029-1145
Tom,
Looked even more closely.
I flittered out all relays that are
not currently active, ending up with
a list of 6303 live relays.
1065 or 17% of them have not be
updated for five or more days,
292 or 4% have not been updated
for ten days, and 102 or 1%
have not been updated for 15
days.
In particular I know of a very fast
high quality relay in a CDN-grade
network that has not been measured
in 13 days.  My relay Binnacle
is a well run relay in the
high-quality Verizon FiOS network
and has not been measured for 10 days.
This does not seem correct.
P.S. Here is a quick list of some
top-30 relays that have have been
seriously neglected:
redjohn1        10/9
becks           10/15
aurora          10/20
Onyx            10/14
IPredator       10/15
atomicbox1      10/21
sofia           10/14
naiveTorer      10/15
quadhead        10/12
3cce3a91f6a625  10/13
apx2            10/14
...
At 13:35 10/29/2015 -0400, you wrote:
...
The system is definetly active.  . . .the most recent file has ten day old entries?
Just looked more closely.  About 2500
of 8144 lines (30%) have "updated_at=" more
than five days ago or 2015/10/24 00:00 UTC.
Seems like something that should have
an alarm check/monitor.