At 11:47 11/5/2015 -0600, Tom Ritter wrote:
. . . So them falling between the slices would be my best guess. . .
Immediately comes to mind that dealing with the changing consensus while scanning might be handled in a different but nonetheless straightforward manner.
Why not create a snapshot of the consensus at the time scanning commences then-- without disturbing the scanners-- pull each new update with an asynchronous thread or process. The consensus thread would diff against the previous state snapshot and produce an updated snapshot plus deltas for each scanner and/or slice as the implementation requires. Then briefly lock the working list for each active scanner and apply the delta to it.
By having a single thread handle consensus retrieval and sub-division, issues of "lost" relays should go away entirely. No need to hold locks for extended periods.
The consensus allocation thread would run continuously, so individual slices and scanners can complete and restart asynchronous to each other without glitches or delays.
Consensus allocation worker could also manage migrating relays from one scanner to another, again preventing lost relays.
At 17:37 11/5/2015 -0500, you wrote:
. . .Consensus allocation worker. .
The consensus list manger could run as an independent Python process and "message" changes to the scanner processes to avoid complexities of trying to share data (I know very little about Python and whether sharing data is difficult or not).
The consensus manager would keep a snapshot of the list and send delta transactions to the scanners via IPC, by updating complete disk- file lists, or by creating disk-file deltas for the scanners to consume. Whatever is easiest and most appropriate.
Perhaps each scanner would resync somehow when starting a new pass as insurance against list-delta transaction lossage.
On 5 November 2015 at 16:37, starlight.2015q3@binnacle.cx wrote:
At 11:47 11/5/2015 -0600, Tom Ritter wrote:
. . . So them falling between the slices would be my best guess. . .
Immediately comes to mind that dealing with the changing consensus while scanning might be handled in a different but nonetheless straightforward manner.
Why not create a snapshot of the consensus at the time scanning commences then-- without disturbing the scanners-- pull each new update with an asynchronous thread or process. The consensus thread would diff against the previous state snapshot and produce an updated snapshot plus deltas for each scanner and/or slice as the implementation requires. Then briefly lock the working list for each active scanner and apply the delta to it.
By having a single thread handle consensus retrieval and sub-division, issues of "lost" relays should go away entirely. No need to hold locks for extended periods.
The consensus allocation thread would run continuously, so individual slices and scanners can complete and restart asynchronous to each other without glitches or delays.
Consensus allocation worker could also manage migrating relays from one scanner to another, again preventing lost relays.
So I'm coming around to this idea, after spending an hour trying to explain why it was bad. I thought "No no, let's do this other thing..." and then I basically designed what you said. So the main problem as I see it is that it's easy to move relays between slices that haven't happened yet - but how do you do this when some slices are completed and some aren't?
Relay1 is currently in scanner2 slice2, but the new consensus came in and it should be in scanner1 slice 14. Except scanner2slice2 was already measured and scanner1slice14 has not been. What do you do?
Or the inverse. Relay2 is currently in scanner1 slice14. But the new consensus says it should be in scanner2 slice 2. But scanner2slice2 was already measured, and scanner1slice14 had not been. You can only move a relay between two slices that have yet to be measured. But everything is 'yet to be measured' unless you're going to halt the whole bwauth after one entire cycle and then start over again.
Which... if you used a work queue instead of a scanner, might actually work...?
We could make a shared work queue of slices, and do away with the idea of 'separate scanners for differerent speeded relays'... When there's no more work, we would get a new consensus and make a new work queue off of that. We would assign work items in a scanner-like pattern, and as we get new consensuses with new relays that weren't in any slices, just insert them into existing queued work items. (Could also go through and remove missing relays too.)
Moving new relays into the closest-matching slice isn't hard, and swapping relays between yet-to-be-retrieved slices isn't that hard either. The pattern to select work items is now the main source of complexity - it needs to estimate how long it takes a work item to complete, and give out work items such that it always keeps some gaps around to insert new relays into that aren't _too_ far away from that relay's speed. (Which is basically what the scanner separation was set up for.) It could also fly off the rails by going "Man these fast relays take forever to measure, let's just give out 7 work items of those" - although I'm not sure how bad that would be. Needs a simulator maybe.
FWIW, looking at https://bwauth.ritter.vg/bwauth/AA_scanner_loop_times.txt , it seems like (for whatever weird reason) scanner1 took way longer than the others. (Scanner 9 is very different, so ignore that one.)
Scanner 1 5 days, 11:07:27 Scanner 2 3 days, 19:00:03 Scanner 3 2 days, 19:48:15 2 days, 9:36:13 Scanner 4 2 days, 18:42:21 2 days, 19:41:16 Scanner 5 2 days, 13:21:20 2 days, 11:20:53 Scanner 6 2 days, 20:19:48 2 days, 13:46:30 Scanner 7 2 days, 9:04:49 2 days, 12:50:34 Scanner 8 2 days, 14:31:50 2 days, 15:05:28 Scanner 9 20:29:42 20:52:32 14:42:08 13:59:27 10:25:06 9:27:27 9:52:41 9:52:36 14:52:38 15:09:23