[tor-dev] DirAuth usage and 503 try again later

Sat Jan 16 01:06:42 UTC 2021

Hi James,

thanks for already working on patches for these issues! I will reply
inline some more.

> On 15. Jan 2021, at 23:56, James <jbrown299 at yandex.com> wrote:
> 
> First of all, sorry if torpy hurt in some way Tor Network. It was unintentionally.

I believe you :)

> In any case, it seems to me that if there was some high-level description of logic for official tor client, it would be very useful.

Indeed. The more people work on alternative clients etc, the more we can
learn here. Perhaps you can help point out places where documentation
could help or something was not easy to understand.

> >First, I found this string in the code: "Hardcoded into each Tor client
> >is the information about 10 beefy Tor nodes run by trusted volunteers".
> >The word beefy is definitely wrong here. The nodes are not particularly
> >powerful, which is why we have the fallback dir design for
> >bootstrapping.
> At first glance, it seemed that the AuthDirs were the most trusted and reliable place for obtaining consensus. Now I'm understand more.

The consensus is signed, so all the places to get it from are equally
trusted. That's the beauty of the consensus system :) The dirauths
are just trusted to create it, it doesn't matter who spreads it.

> >Once this
> >happens, torpy goes into a deathly loop of "consensus invalid,
> >trying again". There are no timeouts, backoffs, or failures noted.
> Not really, because torpy has only 3 retries for getting consensus. But probably you are right because user code probably can do retry calling torpy in a loop. So that will always try download network_status... If you have some sort of statistic about increasing traffic we can compare that with time when was consensus signed by 4 signers which enough for tor but not enough for torpy.

Interesting, I ran torpy and on the console it seemed to try more
often. Perhaps it made some progress and then failed on a different
thing, which it then tried again.

To your second point, something like this can probably be done using
https://metrics.torproject.org. But I am not doing the analysis here
at the moment for personal reasons, sorry. Maybe someone else wants
to look at it.

> >The code frequently throws exceptions, but when an exception occurs
> >it just continues doing what it was doing before. It has absolutely
> >no regards to constrain its resources when using the Tor network.
> What kind of constraints can you advise?

I think instead of throwing an exception and continuing, you should
give clear error messages and consider whether you need to stop
execution. For example, if you downloaded a consensus and it is
invalid, you're likely not going to get a valid one by trying again
immediately. Instead, it would be better to declare who gave you the
invalid one and log a sensible error.

In addition, properly using already downloaded directory information
would be a much more considerate use of resources.

> >The logic that if a network_status document was already downloaded that
> >is used rather than trying to download a new one does not work.
> It works. But probably not in optimal way. It caches network_status only.

I may have confused it with asking for the diff. But that should not
be necessary at all if you already have the latest one, so don't ask
for a diff in this case.

> >I have
> >a network_status document, but the dirauths are contacted anyway.
> >Perhaps descriptors are not cached to disk and downloaded on every new
> >start of the application?
> 
> Exactly. Descriptors and network_status diff every hour was asking always from AuthDirs.

Please cache descriptors.

> >New consensuses never seem to be downloaded from guards, only from
> >dirauths.
> Thanks for pointing out. I looked more deeply into tor client sources. So basically if we have network_status we can use guard nodes to ask network_status and descriptors from them. Otherwise using fallback dirs to download network_status. I've implemented such logic in last commit.

Cool!

> >- Stop automatically retrying on failure, without backoff
> I've added delays and backoff between retries.
> 
> >- Cache failures to disk to ensure a newly started torpy_cli does not
> >  request the same resources again that the previous instance failed to
> >  get.
> That will be on the list. But probably even if there is a loop level above and without this feature but with backoff it will be delays like: 3 sec, 5, 7, 9; 3, 5, 7, 9. Seems ok?

Well, the problem is if I run torpy_cli in parallel 100 times, we will
still send many requests per second. From dirauth access patterns, we
can see that some people indeed have such access patterns. So I think
the backoff is a great start (tor client uses exponential backoff I
think) but it definitely is not enough. If you couldn't get something
this hour and you tried a few times, you need to stop trying again for
this hour.

> > Defenses are probably necessary to implement even if
> >torpy can be fixed very quickly, because the older versions of torpy >are out there and I assume will continue to be used. Hopefully that
> >point is wrong?
> I believe that old versions doesn't work any more because them could not connect to auth dirs. Users getting 503 many times, so they will update client. I hope.

Would be nice. We'll see!

Thanks
Sebastian