On 29 Oct 2017, at 01:19, George Kadianakis desnacked@riseup.net wrote:
Hey Tim,
just wanted to ask a clarifying question wrt #21969.
First of all there are various forms of #21969 (aka the "missing descriptors for some of our primary entry guards" issue). Sometimes it occurs for 10 mins and then goes away, whereas for other people it disables their service permanently (until restart). I call this the hardcore case of #21969. It has happened to me and disabled my service for days, and I've also seen it happen to other people (e.g. dgoulet).
So. We have found various md-related bugs and put them as children of #21969. Do you think we have found the bugs that can cause the hardcore case of #21969? That is, is any of these bugs (or a bug combo) capable of permanently disabling an onion service?
Yes, this bug is disabling:
#23862, where we don't update guard state unless we have enough directory info.
When tor gets in a state where it doesn't have enough directory info due to another bug, this makes sure it will never get out of that state. Because it will never mark its directory guards as up when it gets a new consensus, and therefore it will never fetch microdescs, find out it has enough directory info, and build circuits.
That's why I made sure we fixed it as soon as possible. I'm glad it's in the latest alpha.
And this (and a few of the other #21969 children) makes it happen:
#23817, where we keep trying directory guards even though they don't have the microdescriptors we want, on an exponential backoff.
Because it causes tor to only check for new microdescriptors after a very long time (days or weeks), which means the microdescs can expire before they are refreshed.
It seems to me that all the bugs identified so far can only cause #21969 to occur for a few hours before it self-heals itself. IIUC, even the most fundamental bugs like #23862 and #23863 are only temporarily, since eventually one of the dirguards will fetch the missing mds and give them to the client. Do you think that's the case?
No, the current set of bugs can block microdesc fetches forever. And even if they do happen eventually, "eventually" on an exponential backoff is indistinguishable from "forever" over short time frames. (This is by design, it's part of the definition of an exponential backoff.)
I'm asking you because I plan to spend some serious time next week on #21969-related issues, and I'd like to prioritize between bug hunting and bug fixing. That is, if the root cause of the hardcore case of #21969 is still out there, I'd like to continue bug hunting until I find it.
Let me know what you think! Perhaps you have other ideas here of how we should approach this issue.
Fix #23817 by implementing a failure cache and going to a fallback if all primary guards fail. I think that would be a solution for #23863 as well.
And if a few fallbacks don't have the guard's microdesc, mark the guard as down. It's likely it's microdesc is just not on the network for some reason.
Fix #23985, the 10 minute wait when we have less than 15 microdescs, by changing it to an exponential backoff. Otherwise, if we handle it specially when it includes our primary guards, clients will leak that their primary guards are in this small set. (And if we're using an exponential backoff, the failure cache from #23817 will kick in, so we'll check fallbacks, then mark the primary guard down.)
After that, I'd put these fixes out in an alpha, and wait and see if the issue happens again.
T