[tor-dev] Remote descriptor fetching
karsten at torproject.org
Mon Jun 10 06:29:51 UTC 2013
On 6/9/13 5:07 AM, Damian Johnson wrote:
>> Indeed, this would be pretty bad. I'm not convinced that moria1
>> provides truncated responses though. It could also be that it
>> compresses results for every new request and that compressed responses
>> randomly differ in size, but are still valid compressions of the same
>> input. Kostas, do you want to look more into this and open a ticket if
>> this really turns out to be a bug?
> Tor clients use the ORPort to fetch descriptors. As I understand it
> the DirPort has been pretty well unused for years, in which case a
> regression there doesn't seem that surprising. Guess we'll see.
> If Kostas wants to lead this investigation then that would be fantastic. :)
>> So, this isn't the super smart downloader that I had in mind, but maybe
>> there should still be some logic left in the application using this API.
>> I can imagine how both DocTor and metrics-db-R could use this API with
>> some modifications. A few comments/suggestions:
> What kind of additional smartness were you hoping for the downloader to have?
I had the idea of configuring the downloader to tell it what downloads
I'm interested in, let it start downloading, and parse returned
descriptors as they come in. But never mind, I think the current API is
a fine abstraction that leaves application-specific logic where it belongs.
>> - There could be two methods get/set_compression(compression) that
>> define whether to use compression. Assuming we get it working.
> Good idea. Added.
>> - If possible, the downloader should support parallel downloads, with at
>> most one parallel download per directory. But it's possible to ask
>> multiple directories at the same time. There could be two methods
>> get/set_max_parallel_downloads(max) with a default of 1.
> Usually I'd be all for paralleling our requests to both improve
> performance and distribute load. However, tor's present interface
> doesn't really encourage it. There's no way of saying "get half of the
> server descriptors from location X and the other half from location
> Y". You can only request specific descriptors or all of them.
> Are you thinking that the get_server_descriptors() and friends should
> only try to parallelize when given a set of fingerprints? If so then
> that sounds like a fine idea.
I was only thinking of parallelizing requests for a given set of
fingerprints. We can only request at most 96 descriptors at a time, so
it's easy to make requests in parallel.
I agree that this doesn't apply to requests for all descriptors.
>> - I'd want to set a global timeout for all things requested from the
>> directories, so a get/set_global_timeout(seconds) would be nice. The
>> downloader could throw an exception when the global download timeout
>> elapses. I need such a timeout for hourly running cronjobs to prevent
>> them from overlapping when things are really, really slow.
> How does the global timeout differ from our present set_timeout()?
AIUI, the current timeout is for a single HTTP request, whereas the
global timeout is for all HTTP requests made for a single API method.
>> - Just to be sure, get/set_retries(tries) is meant for each endpoint, right?
> Yup, clarified.
>> - I don't like get_directory_mirrors() as much, because it does two
>> things: make a network request and parse it. I'd prefer a method
>> use_v2dirs_as_endpoints(consensus) that takes a consensus document and
>> uses the contained v2dirs as endpoints for future downloads. The
>> documentation could suggest to use this approach to move some load off
>> the directory authorities and to directory mirrors.
> Very good point. Changed to a use_directory_mirrors() method, callers
> can then call get_endpoints() if they're really curious what the
> present directory mirrors are (which I doubt they often will).
>> - Related note: I always look if the Dir port is non-zero to decide
>> whether a relay is a directory. Not sure if there's a difference to
>> looking at the V2Dir flag.
> Sounds good. We'll go for that instead.
>> - All methods starting at get_consensus() should be renamed to fetch_*
>> or query_* to make it clear that these are no getters but perform actual
>> network requests.
> Going with fetch_*.
>> - All methods starting at get_consensus() could have an additional
>> parameter for the number of copies (from different directories) to
>> download. The default would be 1. But in some cases people might be
>> interested in having 2 or 3 copies of a descriptor to compare if there
>> are any differences, or to compare download times (more on this below).
>> Also, a special value of -1 could mean to download every requested
>> descriptor from every available directory. That's what I'd do in DocTor
>> to download the consensus from all directory authorities.
>> - As for download times, is there a way to include download meta data in
>> the result of get_consensus() and friends? I'd be interested in the
>> directory that a descriptor was downloaded from and in the download time
>> in millis. This is similar to how I'm interested in file meta data in
>> the descriptor reader, like file name or last modified time of the file
>> containing a descriptor.
> This sounds really specialized. If callers cared about the download
> times then that seems best done via something like...
> endpoints = ['location1', 'location2'... etc]
> for endpoint in endpoints:
> start_time = time.time()
> print "endpoint %s took: %0.2f" % (endpoint, time.time() - start_time)
> except IOError, exc:
> print "failed to use %s: %s" % (endpoint, exc)
The downside of that approach is that it doesn't make requests in
parallel, unless the application parallelizes requests, which I hear
isn't trivial in Python. If we can, we should help the application make
requests in parallel. Some directories are really slow and can block
the application too long if it's only doing one request at a time. (I
agree that timeouts can solve that problem to some extent, but a timeout
that's chosen too low can also be problematic.)
So, I guess we should either have the API do parallel requests, or
describe in a tutorial how to write an application that uses the API to
make parallel requests.
By the way, here's an idea how the API could add meta data to
descriptors: it could add annotations like "@downloaded-from" and
"@downloaded-millis" to descriptors. (Speaking of, is there an easy way
to extract the descriptor string without annotations?)
>> - Can you add a fetch|query_votes(fingerprints) method to request vote
> Added a fetch_vote(authority) to provide an authority's
> NetworkStatusDocument vote by querying
> 'http://<hostname>/tor/status-vote/next/authority.z'. However, I'm not
> clear from the spec how you can query for specific relays (unless you
> mean fingerprints to be the authority fingerprints).
I meant fingerprints of authorities.
More information about the tor-dev