[tor-dev] Using Stem's descriptor fetching module to replace the Java consensus-health checker

Karsten Loesing karsten at torproject.org
Mon Aug 19 10:27:31 UTC 2013


Hi Damian,

On 8/19/13 5:26 AM, Damian Johnson wrote:
> Hi Karsten, I've finished a replacement for the DocTor consensus monitors...
> 
> https://gitweb.torproject.org/atagar/tor-utils.git/blob/HEAD:/consensus_health_checker.py
> https://gitweb.torproject.org/atagar/tor-utils.git/blob/HEAD:/data/consensus_health.cfg

This is awesome!  Thanks!  Replying to this mail first, then adding a
first code review.

> Like the other checkers this is presently running hourly and sending
> results my way. My vote would be to have them start sending results to
> tor-consensus-health@ [1] instead. This will double the amount of
> noise on the list but it should help us flush out any issues with the
> scripts. Once we have confidence in it we can shut down DocTor's
> checks.

Yes, I agree that we should have your script start sending results to
the mailing list.  Want to set that up?  Not sure if your mails will
bounce until somebody approves your sender address, but I guess we'll
find out.

> A code review would also be much appreciated. If there's any portions
> that you find confusing then let me know.

See below.

> As for the DocTor website,
> I'm a little surprised Peter and Sebastian didn't reply. Not sure how
> we'd like to proceed there...

I just asked Peter on IRC.  He didn't reply before, because of tl;dr.

So, Peter is fine with all consensus-health info being in a mail and not
on the website.  When he looks at the website, he's mostly interested in
two things which aren't yet contained in status emails:

 1. Do the authorities voting on BadExit agree about relays they assign
this flag to?  The warning might contain the diff of relay fingerprints
they voted or didn't vote BadExit on.  We'll probably want to rate-limit
this warning to every 6 or 12 hours.

 2. Do the bandwidth authorities report roughly the same number of
Measured lines, or do these numbers diverge beyond a given threshold
(say, 20%)?  This warning should probably be rate-limited to every 24
hours, because new bandwidth authorities take some time to measure the
network.

What do you think about adding these two warnings to your script?  It
seems we'd make the website obsolete, at least for Peter, by doing so.

How about we start a new thread on this list discussing only the part
where we're planning to kill the website and move everything to the
status emails?  This thread is probably as long by now that nobody
besides us reads it and we could as well discuss private stuff without
anybody noticing like that I still disagree with you about Pepper Jack
being tasty cheese and that I can highly recommend Muenster cheese.

> Thoughts? -Damian

And here's the code review.  As usual, feel free to ignore comments you
don't agree with:

 - Can you add a license to your script?

 - Your rate-limiting logic seems to work only with full hours (and
days).  This may lead to unexpected results if two script executions
don't start at the exact minute and second of an hour.  In one case a
message might be suppressed, in another case it might not.  That's why I
defined all rate-limiting intervals as X:30 hours.  Maybe simply add or
subtract 30 minutes from all intervals just to be sure, or add a minutes
parameter with a default of 30.

 - DocTor produces two output files containing warnings, one with new
warnings and one with all warnings (which is empty if there are zero new
warnings).  New warnings are sent to the IRC bot and all warnings are
sent to the mailing list.  I agree that this approach is somewhat
complicated, so maybe it's sufficient to just send the new warnings to
both the IRC bot and the mailing list.  Unless you didn't realize there
was such a distinction and think it's worth adding to your script.  Up
to you.

 - Are old warnings ever removed from your last_notified.cfg?  Not sure
if it matters though.

 - Is the %is in `log.info("Suppressing %s, time remaining is %is"`
supposed to be a %s?

 - Your rule about not downloading a vote from an authority that didn't
provide a consensus before seems quite strict.  In theory, we could ask
another authority for that first authority's vote.  Maybe the first
authority just didn't want to talk to us but was happy to talk to all
other authorities.  We should probably learn about problems with that
authority's vote (or the absence of problems) even if it doesn't talk to
us.  I'm aware that this will make the download logic somewhat more complex.

 - Typo in `unknown_consensus_parameteres`.

 - Typo in `incompatable_authorities`.

 - In `certificate_expiration`, is your check that a vote has exactly
one authority really necessary?  Wouldn't stem complain if this were not
the case?

 - Speaking of stem complaining about invalid votes or consensuses: what
does your script do in such a case?

 - I wonder if your check in `has_expected_fingerprints` would complain
if somebody set up a fake "moria1".  It probably shouldn't.  That's why
I added IP address and port to checkAuthorityRelayIdentityKeys.

 - Maybe I missed them while reading your code, but did you add checks
for checkContainedVotes and checkConsensusSignatures?  Both checks are
quite important, because even if an authority claims to add a vote, it
may not have become part of the consensus, or the authority may not have
signed the consensus that it contributed a vote to before.

That's all.  Thanks for working on a DocTor replacement!

All the best,
Karsten



More information about the tor-dev mailing list