[tor-dev] Exit Funding Script

29 Mar 2014

      Hi Moritz, sorry about the delay. I just took a quick peek at your
exit-funding script (https://github.com/torservers/exit-funding).
Looks good!

Class and function documentation would've been nice, but the scripts
reasonably straight forward. As I'm sure you know answering the
question 'how fast is relay X' or 'how much is relay X used' is
surprisingly thorny. Tor descriptors have two values...

* Measurement provided by relays themselves of how much they're used.
These are completely gameable since they're self-published.

* Heuristic provided by the bandwidth authorities. This is what Tor
actually uses for relay selection. However, while this is tricker to
game, it's also pretty useless for answering either question since
it's purely a unit-less heuristic.

Roger discussed this space a bit on...

https://blog.torproject.org/blog/lifecycle-of-a-new-relay

Iirc the value you're using (the router status entry's 'bandwidth'
value) is the later. I suspect both Stem and Tor's control-spec docs
about this being kb/s is wrong. I left a question on irc asking for
clarification...

15:53 < atagar> No armadev *or* nickm? I think that might be a first...
15:57 < atagar> karsten: Actually, you'd know this offhand...
15:57 < atagar> The dir-spec's description of the 'Bandwidth=' value
of w lines in router status entries is confusing me a bit. It's
described as...
15:57 < atagar> "An estimate of the bandwidth of this relay, in an
arbitrary unit (currently kilobytes per second).  Used to weight
router selection."
15:57 < atagar> This is the heuristic generated by the bandwidth
authorities, right? If it is then it shouldn't have units at all (that
is purely a heuristic for relay selection - it has no bearing on a
relay's actual capacity or usage).
15:57 < atagar> ... or is this the old heuristic that's based on
relay's self-published usage?
15:57 < atagar> I'm guessing the former, and that we should replace
"in an arbitrary unit (currently kilobytes per second)" with "as an
unit-less heuristic purely used for relay selection. This has no
direct relation to a relay's capacity or usage."
15:57 < atagar> (this is a common point of confusion since 'how do I
determine the fastest relays?' is a common question)

Added your script to Stem's examples page. I don't have much time on
my hands so I only took a close peek at the Stem related bits...
...
class VerboseDescriptorReader(object):
You are parsing metrics.torproject.org archives, right? I recall some
discussion about Stem performance with archives (... though I don't
recall if the trouble was compression or python's tar module). If you
decompress them then I suspect that it provides one descriptor per
file. If that's the case then you can do the same by registering a
read listener instead of wrapping the class...

entries_seen = 0

def read_file(path):
  entries_seen += 1

  if entries_seen % 25 == 0:
    print >> sys.stderr, "%i documents parsed...\n" % entries_seen

reader = DescriptorReader([self.descriptors_path])
reader.register_read_listener(read_file)

with reader:
  for relay_desc in reader:
    ... etc...

If you decide to keep the VerboseDescriptorReader then you can drop
self._targets (it's unused).
...
self.controller = Controller.from_port(port=9151)
This isn't the usual control port, so you should probably mention it
in your readme.
...
return self.controller.get_info('ip-to-country/%s' % address)
This can potentially raise exceptions. You might want to provide a
default value...

return self.controller.get_info('ip-to-country/%s' % address, 'unknown')

In general I suspect we might be able to simplify your script by
abstracting away the descriptor handling from the rest. If I'm
understanding this correctly the question you're trying to answer
is...

"I have a set of relays I'm interested in, defined by their contact
information. What is the sum of their router status entry 'bandwidth'
measurements?"

Is that right?

Cheers! -Damian

[tor-dev] Exit Funding Script

Damian Johnson