On 2015-06-29 17:51, Speak Freely wrote:
Hello,
First of all, I love Tor. I love Tor Browser, and I love running relays.
When the problems are solved, I will most likely spin up more relays.
I'm leaving my fastest relay running, as a method of checking the status for myself. The rest have already started to expire, and within the next week or so most of the other ones will have expired as well.
I'm going to try tor-dev-alpha 2.7.1 and change fingerprints, as per a suggestion from s7r, seeing as how I have nothing to lose.
I just wish the bwauths could scan relays based off previous relative consensus weights... If this particular relay was at 27000, it should be higher on the list to check compared to another one I have that is at 487. My one relay was blazing fast with thousands of connections, my
Well, relays are ranked by capacity, and split over several scanner processes - they do get measured against their relative peers. But it seems that, when they fall out of the measurement process ('Unmeasured') they must start again at the beginning. This is expected because all relays start Unmeasured, and gradually increase their position in the consensus (per relative capacity), in order to dampen sudden changes and limit sybil attacks by requiring relays to stick around for a while, increasing the cost to an adversary. It likely should not be the case that historically long running relays should start at the bottom if they are unmeasured for a short period of time.
We are in the process of testing increasing the number of scanner and accompanying tor instances from 4 to 9 (double, plus one for currently unmeasured relays) in order to decrease the amount of time each fraction of the network takes to measure and ensure that new relays or unmeasured relays are measured often. There are additional patches that introduce extra exits into the slice of relays, if there are no suitable exits to measure with. This likely won't address the above behavior, but we hope it will reduce the number of relays that go missing. Currently we seem to have mixed results, with one Bandwidth Authority operator claiming minimal (50) unmeasured relays, and another claiming ~600 mixed relays. These numbers are not directly comparable because they were not sampled at the same time, and may not be representative of typical behavior - it's a little too soon to tell.
It's a bit tricky to both test these changes, on the live tor network, demonstrate that they produce sane results, and convince the directory authority operators and partner bandwidth authority operators to upgrade - nor do we want to do that all at once - gradual change is better. So, the goal is to produce results that will convince operators they should update, improve the situation for relay operators, and then start looking at longer term solutions for the measurement problem that are more maintainable and scalable in the long run.
other is painfully useless with dozens, but my fastest one lost its consensus while the slowest one kept it's consensus. It just seems silly. That being said, I don't know how/if the bwauths scan in any order or just willy-nilly, (that's not entirely true, I know it's segmented to some degree as I recall reading a blog post about how it's chopped up) but... I'd be much less upset if my best relays worked and my worst relays didn't. More complaining... bleh.
I hope to have a testable hypothesis as to why your faster relays suffer(ed) more than the slower relays - it could be that the fraction of network by capacity allocated to a particular scanner is not well balanced, and that fraction is taking significantly longer to measure. In order to evaluate that statement I need to understand the common characteristics of relays that become unmeasured/lose rank and see if they are from a similar segment of the network, and whether or not that segment of the network takes longer to measure than other segments.
Another hypothesis is that your relays are on the boundary between two segments, and that a transition between scanner instances causes enough missed measurements to drop your relays. It would be helpful to know what rank the last good measurement your or other relays had before becoming unmeasured.
It will require some cooperation with the existing deployed Bandwidth Authorities, in order to learn what their current scan times are - I will be writing some simple scripts to scrape these results so that we can collect and publish some useful heuristics about the scanner processes to better try and debug this problem.
One thing I would like to point out though... it appears... These problems have at least a casual relationship with MyFamily.
One group of MyFamily is completely done - all of them stuck at 20. Another group of MyFamily is working happily. I've been doing some tests over the past few months trying to understand why I keep having problems, and one thing has consistently popped up... MyFamily.
That is very interesting, because MyFamily should have nothing to do with the scanner process at all - I'll need to think about this some more.
As one of MyFamily lost consensus, another family gained consensus back on or around the same time.
Yes, especially nusenu, I know I'm supposed to have it all configured to be under 1 MyFamily... But in a way I'm glad I didn't, as the casual relationship I see really could only be seen having done what I did.
I say casual because I have no proof of causation. But... it is interesting. If no one else has experienced similar problems, then I'd chock it up to a completely unexpected unrelated set of mysterious circumstances that should not have happened for which there is no explanation.
Aaron, if there is anything I can do to help you please let me know.
If anything that I said above sparks a thought, please let me know :)
So in conclusion, I'm not done, I'm just not happy.
This was supposed to be a short email, oops.
Matt Speak Freely