Hi Karsten. I'm starting to look into remote descriptor fetching, a capability of metrics-lib that stem presently lacks [1][2]. The spec says that mirrors provide zlib compressed data [3], and the DirectoryDownloader handles this via a InflaterInputStream [4].
So far, so good. By my read of the man pages this means that gzip or python's zlib module should be able to handle the decompression. However, I must be missing something...
% wget http://128.31.0.34:9131/tor/server/all.z
% file all.z all.z: data
% gzip -d all.z gzip: all.z: not in gzip format
% zcat all.z gzip: all.z: not in gzip format
% python
import zlib with open('all.z') as desc_file:
... print zlib.decompress(desc_file.read()) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> zlib.error: Error -5 while decompressing data: incomplete or truncated stream
Maybe a fresh set of eyes will spot something I'm obviously missing. Spotting anything?
Speaking of remote descriptor fetching, any thought on the API? I'm thinking of a 'stem/descriptor/remote.py' module with...
* get_directory_authorities()
List of hardcoded (IP, DirPorts) tuples for tor's authorities. Ideally we'd have an integ test to notify us when our listing falls out of date. However, it looks like the controller interface doesn't surface this. Is there a nice method of determining the present authorities besides polling the authorities array of 'src/or/config.c' [5]?
* fetch_directory_mirrors()
Polls an authority for the present consensus and filters it down to relays with the V2Dir flag. It then uses this to populate a global directory mirror cached that's used when querying directory data. This can optionally be provided with a Controller instance or cached consensus file to use that instead of polling a authority.
* get_directory_cache()
Provides a list of our present directory mirrors. This is a list if (IP, DirPort) tuples. If fetch_directory_mirrors() hasn't yet been called this is the directory authorities.
* query(descriptor_type, fingerprint = None, retires = 5)
Picks a random relay from our directory mirror cache, and attempts to retrieve the given type of descriptor data. Arguments behave as follows...
descriptor_type (str): Type of descriptor to be fetched. This is the same as our @type annotations [6]. This raises a ValueError if the descriptor type isn't available from directory mirrors.
fingerprint (str, list): Optional argument for the relay or list of relays to fetch the descriptors for. This retrieves all relays if omitted.
retries (int): Maximum number of times we'll attempt to retrieve the descriptors. We fail to another randomly selected directory mirror when unsuccessful. Our last attempt is always via a directory authority. If all attempts are unsuccessful we raise an IOError.
========================================
I'd imagine this would make use of the module something like the following...
# Simple script to print all of the exits.
from stem.descriptor import remote
# Populates our directory mirror cache. This does more harm # here than good since we're only making a single request. # However, if this was a longer living script doing this would # relieve load from the authorities.
remote.fetch_directory_mirrors()
try: for desc in remote.query('server-descriptor 1.0'): if desc.exit_policy.is_exiting_allowed(): print "%s (%s)" % (desc.nickname, desc.fingerprint) except IOError, exc: print "Unable to query the server descriptors: %s" % exc
========================================
Thoughts? Does this cover all of the use cases we'll this module for?
Cheers! -Damian
[1] https://trac.torproject.org/8257 [2] https://gitweb.torproject.org/metrics-lib.git/blob/HEAD:/src/org/torproject/... [3] https://gitweb.torproject.org/torspec.git/blob/HEAD:/dir-spec.txt#l2626 [4] http://docs.oracle.com/javase/6/docs/api/java/util/zip/InflaterInputStream.h... [5] https://gitweb.torproject.org/tor.git/blob/HEAD:/src/or/config.c#l780 [6] https://metrics.torproject.org/formats.html#descriptortypes