[tor-dev] Exit Funding Script
atagar at torproject.org
Sat Mar 29 23:43:31 UTC 2014
Hi Moritz, sorry about the delay. I just took a quick peek at your
exit-funding script (https://github.com/torservers/exit-funding).
Class and function documentation would've been nice, but the scripts
reasonably straight forward. As I'm sure you know answering the
question 'how fast is relay X' or 'how much is relay X used' is
surprisingly thorny. Tor descriptors have two values...
* Measurement provided by relays themselves of how much they're used.
These are completely gameable since they're self-published.
* Heuristic provided by the bandwidth authorities. This is what Tor
actually uses for relay selection. However, while this is tricker to
game, it's also pretty useless for answering either question since
it's purely a unit-less heuristic.
Roger discussed this space a bit on...
Iirc the value you're using (the router status entry's 'bandwidth'
value) is the later. I suspect both Stem and Tor's control-spec docs
about this being kb/s is wrong. I left a question on irc asking for
15:53 < atagar> No armadev *or* nickm? I think that might be a first...
15:57 < atagar> karsten: Actually, you'd know this offhand...
15:57 < atagar> The dir-spec's description of the 'Bandwidth=' value
of w lines in router status entries is confusing me a bit. It's
15:57 < atagar> "An estimate of the bandwidth of this relay, in an
arbitrary unit (currently kilobytes per second). Used to weight
15:57 < atagar> This is the heuristic generated by the bandwidth
authorities, right? If it is then it shouldn't have units at all (that
is purely a heuristic for relay selection - it has no bearing on a
relay's actual capacity or usage).
15:57 < atagar> ... or is this the old heuristic that's based on
relay's self-published usage?
15:57 < atagar> I'm guessing the former, and that we should replace
"in an arbitrary unit (currently kilobytes per second)" with "as an
unit-less heuristic purely used for relay selection. This has no
direct relation to a relay's capacity or usage."
15:57 < atagar> (this is a common point of confusion since 'how do I
determine the fastest relays?' is a common question)
Added your script to Stem's examples page. I don't have much time on
my hands so I only took a close peek at the Stem related bits...
> class VerboseDescriptorReader(object):
You are parsing metrics.torproject.org archives, right? I recall some
discussion about Stem performance with archives (... though I don't
recall if the trouble was compression or python's tar module). If you
decompress them then I suspect that it provides one descriptor per
file. If that's the case then you can do the same by registering a
read listener instead of wrapping the class...
entries_seen = 0
entries_seen += 1
if entries_seen % 25 == 0:
print >> sys.stderr, "%i documents parsed...\n" % entries_seen
reader = DescriptorReader([self.descriptors_path])
for relay_desc in reader:
If you decide to keep the VerboseDescriptorReader then you can drop
self._targets (it's unused).
> self.controller = Controller.from_port(port=9151)
This isn't the usual control port, so you should probably mention it
in your readme.
> return self.controller.get_info('ip-to-country/%s' % address)
This can potentially raise exceptions. You might want to provide a
return self.controller.get_info('ip-to-country/%s' % address, 'unknown')
In general I suspect we might be able to simplify your script by
abstracting away the descriptor handling from the rest. If I'm
understanding this correctly the question you're trying to answer
"I have a set of relays I'm interested in, defined by their contact
information. What is the sum of their router status entry 'bandwidth'
Is that right?
More information about the tor-dev