Hi Moritz, sorry about the delay. I just took a quick peek at your exit-funding script (https://github.com/torservers/exit-funding). Looks good!
Class and function documentation would've been nice, but the scripts reasonably straight forward. As I'm sure you know answering the question 'how fast is relay X' or 'how much is relay X used' is surprisingly thorny. Tor descriptors have two values...
* Measurement provided by relays themselves of how much they're used. These are completely gameable since they're self-published.
* Heuristic provided by the bandwidth authorities. This is what Tor actually uses for relay selection. However, while this is tricker to game, it's also pretty useless for answering either question since it's purely a unit-less heuristic.
Roger discussed this space a bit on...
https://blog.torproject.org/blog/lifecycle-of-a-new-relay
Iirc the value you're using (the router status entry's 'bandwidth' value) is the later. I suspect both Stem and Tor's control-spec docs about this being kb/s is wrong. I left a question on irc asking for clarification...
15:53 < atagar> No armadev *or* nickm? I think that might be a first... 15:57 < atagar> karsten: Actually, you'd know this offhand... 15:57 < atagar> The dir-spec's description of the 'Bandwidth=' value of w lines in router status entries is confusing me a bit. It's described as... 15:57 < atagar> "An estimate of the bandwidth of this relay, in an arbitrary unit (currently kilobytes per second). Used to weight router selection." 15:57 < atagar> This is the heuristic generated by the bandwidth authorities, right? If it is then it shouldn't have units at all (that is purely a heuristic for relay selection - it has no bearing on a relay's actual capacity or usage). 15:57 < atagar> ... or is this the old heuristic that's based on relay's self-published usage? 15:57 < atagar> I'm guessing the former, and that we should replace "in an arbitrary unit (currently kilobytes per second)" with "as an unit-less heuristic purely used for relay selection. This has no direct relation to a relay's capacity or usage." 15:57 < atagar> (this is a common point of confusion since 'how do I determine the fastest relays?' is a common question)
Added your script to Stem's examples page. I don't have much time on my hands so I only took a close peek at the Stem related bits...
class VerboseDescriptorReader(object):
You are parsing metrics.torproject.org archives, right? I recall some discussion about Stem performance with archives (... though I don't recall if the trouble was compression or python's tar module). If you decompress them then I suspect that it provides one descriptor per file. If that's the case then you can do the same by registering a read listener instead of wrapping the class...
entries_seen = 0
def read_file(path): entries_seen += 1
if entries_seen % 25 == 0: print >> sys.stderr, "%i documents parsed...\n" % entries_seen
reader = DescriptorReader([self.descriptors_path]) reader.register_read_listener(read_file)
with reader: for relay_desc in reader: ... etc...
If you decide to keep the VerboseDescriptorReader then you can drop self._targets (it's unused).
self.controller = Controller.from_port(port=9151)
This isn't the usual control port, so you should probably mention it in your readme.
return self.controller.get_info('ip-to-country/%s' % address)
This can potentially raise exceptions. You might want to provide a default value...
return self.controller.get_info('ip-to-country/%s' % address, 'unknown')
In general I suspect we might be able to simplify your script by abstracting away the descriptor handling from the rest. If I'm understanding this correctly the question you're trying to answer is...
"I have a set of relays I'm interested in, defined by their contact information. What is the sum of their router status entry 'bandwidth' measurements?"
Is that right?
Cheers! -Damian