[tor-dev] Remote descriptor fetching

Tue May 28 07:13:06 UTC 2013

On 5/28/13 1:50 AM, Damian Johnson wrote:
> Hi Karsten. I'm starting to look into remote descriptor fetching, a
> capability of metrics-lib that stem presently lacks [1][2]. The spec
> says that mirrors provide zlib compressed data [3], and the
> DirectoryDownloader handles this via a InflaterInputStream [4].
> 
> So far, so good. By my read of the man pages this means that gzip or
> python's zlib module should be able to handle the decompression.
> However, I must be missing something...
> 
> % wget http://128.31.0.34:9131/tor/server/all.z
> 
> % file all.z
> all.z: data
> 
> % gzip -d all.z
> gzip: all.z: not in gzip format
> 
> % zcat all.z
> gzip: all.z: not in gzip format
> 
> % python
>>>> import zlib
>>>> with open('all.z') as desc_file:
> ...   print zlib.decompress(desc_file.read())
> ...
> Traceback (most recent call last):
>   File "<stdin>", line 2, in <module>
> zlib.error: Error -5 while decompressing data: incomplete or truncated stream
> 
> Maybe a fresh set of eyes will spot something I'm obviously missing.
> Spotting anything?

Hmmm, that's a fine question.  I remember this was tricky in Java and
took me a while to figure out.  I did a quick Google search, but I
didn't find a way to decompress tor's .z files using shell commands or
Python. :/

How about we focus on the API first and ignore the fact that compressed
responses exist?

> Speaking of remote descriptor fetching, any thought on the API? I'm
> thinking of a 'stem/descriptor/remote.py' module with...
> 
> * get_directory_authorities()
> 
> List of hardcoded (IP, DirPorts) tuples for tor's authorities. Ideally
> we'd have an integ test to notify us when our listing falls out of
> date. However, it looks like the controller interface doesn't surface
> this. Is there a nice method of determining the present authorities
> besides polling the authorities array of 'src/or/config.c' [5]?
> 
> * fetch_directory_mirrors()
> 
> Polls an authority for the present consensus and filters it down to
> relays with the V2Dir flag. It then uses this to populate a global
> directory mirror cached that's used when querying directory data. This
> can optionally be provided with a Controller instance or cached
> consensus file to use that instead of polling a authority.

(Minor note: if possible, let's separate methods like this into one
method that makes a network request and another method that works only
locally.)

> * get_directory_cache()
> 
> Provides a list of our present directory mirrors. This is a list if
> (IP, DirPort) tuples. If fetch_directory_mirrors() hasn't yet been
> called this is the directory authorities.
> 
> * query(descriptor_type, fingerprint = None, retires = 5)
> 
> Picks a random relay from our directory mirror cache, and attempts to
> retrieve the given type of descriptor data. Arguments behave as
> follows...
> 
> descriptor_type (str): Type of descriptor to be fetched. This is the
> same as our @type annotations [6]. This raises a ValueError if the
> descriptor type isn't available from directory mirrors.
> 
> fingerprint (str, list): Optional argument for the relay or list of
> relays to fetch the descriptors for. This retrieves all relays if
> omitted.
> 
> retries (int): Maximum number of times we'll attempt to retrieve the
> descriptors. We fail to another randomly selected directory mirror
> when unsuccessful. Our last attempt is always via a directory
> authority. If all attempts are unsuccessful we raise an IOError.
> 
> ========================================
> 
> I'd imagine this would make use of the module something like the following...
> 
> # Simple script to print all of the exits.
> 
> from stem.descriptor import remote
> 
> # Populates our directory mirror cache. This does more harm
> # here than good since we're only making a single request.
> # However, if this was a longer living script doing this would
> # relieve load from the authorities.
> 
> remote.fetch_directory_mirrors()
> 
> try:
>   for desc in remote.query('server-descriptor 1.0'):
>     if desc.exit_policy.is_exiting_allowed():
>       print "%s (%s)" % (desc.nickname, desc.fingerprint)
> except IOError, exc:
>   print "Unable to query the server descriptors: %s" % exc
> 
> ========================================
> 
> Thoughts? Does this cover all of the use cases we'll this module for?

This API looks like a fine way to manually download descriptors, but I
wonder if we can make the downloader smarter than that.

The two main use cases I have in mind are:

1. Download and archive relay descriptors: metrics-db uses different
sources to archive relay descriptors including gabelmoo's cached-*
files.  But there's always the chance to miss a descriptor that is
referenced from another descriptor.  metrics-db (or the Python
equivalent) would initialize the downloader by telling it which
descriptors it's missing, and the downloader would go fetch them.

2. Monitor consensus process for any issues: DocTor downloads the
current consensus from all directory authorities and all votes from any
directory authority.  It doesn't care about server or extra-info
descriptors, but in contrast to metrics-db it cares about having the
consensus from all directory authorities.  Its Python equivalent would
tell the downloader which descriptors it's interested in, let it fetch
those descriptors, and then evaluate the result.

So, the question is: should we generalize these two use cases and make
the downloader smart enough to handle them and maybe future use cases,
or should we leave the specifics in metrics-db and DocTor and keep the
API simple?

Here's how a generalized downloader API might look like:

Phase 1: configure the downloader by telling it:
 - what descriptor types we're interested in;
 - whether we only care about the descriptor content or about
downloading descriptors from specific directory authorities or mirrors;
 - whether we're only interested in descriptors that we didn't know
before, either by asking the downloader to use an internal download
history file or by passing identifiers of descriptors we already know;
 - to prefer directory mirrors over directory authorities as soon as it
has learned about them, and to memorize directory mirrors for future runs;
 - to use directory mirrors from the soon-to-be-added fallback directory
list (#8374);
 - parameters like timeouts and maximum retries; and
 - parameters to the descriptor parser that will handle downloaded contents.

Phase 2: run downloads and pass retrieved descriptors (including
information about the directory it downloaded from, the download time,
and maybe other meta data) in an iterator similar to what the descriptor
reader does.

Phase 3: when all downloads are done and downloaded descriptors are
processed by the application:
- query the download history or
- ask the downloader to store its download history.

Note that the downloader could do all kinds of smart things in phase 2,
like concatenating up to 96 descriptors in a single request, switching
to all.z if there are many more descriptors to download, round-robining
between directories, making requests in parallel, etc.

If we go for the simple API you suggest above, the application would
have to implement this smart stuff itself.

All the best,
Karsten

> [1] https://trac.torproject.org/8257
> [2] https://gitweb.torproject.org/metrics-lib.git/blob/HEAD:/src/org/torproject/descriptor/impl/DirectoryDownloader.java
> [3] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt#l2626
> [4] http://docs.oracle.com/javase/6/docs/api/java/util/zip/InflaterInputStream.html
> [5] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/or/config.c#l780
> [6] https://metrics.torproject.org/formats.html#descriptortypes
>