[tor-dev] Tor Metrics project - past, present, future

Sat Dec 14 18:56:09 UTC 2013

Dear devs,

you probably know the Tor Metrics project with its most visible part,
the metrics website [0].  The metrics website and the less visible parts
around it have grown organically since 2009.  I'm maintaining the code
behind most of these parts, which is why I'm periodically trying to
bring order into the chaos.  Sometimes this means removing parts which
are mostly unused or require too many resources, and removing working
parts typically leads to sad faces. But sometimes I'm also adding new
parts which has the potential of turning sad faces into happy faces again.

When I talked to Tom on Thursday, he suggested that I should take a step
back and explain here what metrics is for and what it's not for, and
list the things that are great, the things that can go, and the things
we might like in the future.  Great idea, let me do that!

Before I start, may I call your attention to exhibit A:

https://metrics.torproject.org/tools.html

This is a diagram showing all the different metrics parts and how they
are connected.  Really, please take a quick look at it, because I'm
going to refer to the diagram below.

So, *what* is metrics for?  Perhaps it's easier to answer this question
by saying *who* metrics is for:

1. For the researchers!  The metrics project archives Tor network data,
that is safe to be archived, and provides it to anyone on a public
website [1, 2].  This is a valuable resource for researchers who are
trying to back their research with actual data about the deployed Tor
network.  (Green parts in the diagram.)

2. For the community!  The metrics website contains all kinds of
statistics showing how the Tor network grew over time [3, 4], including
details about relays, bridges, and clients as far as user privacy is not
at risk.  Focus is on the Tor network as a whole, from 2007 until today.
 (Purple parts in the diagram.)

3. For the operators!  Onionoo [5] processes Tor network data and
provides it in a convenient format for web applications [6, 7, 8]
designed for relay or bridge operators, people helping to debug the Tor
network, and more generally Tor users.  Focus is on single relays or
bridges right now or in the very recent past.  (Orange part in the diagram.)

And what is metrics *not* for?  If you take another look at exhibit A,
I'll explain three parts which I think are either not used anymore or
require too many resources.  If people think we really need to keep any
of these parts, I'd like to know:

1. We added statistics on uni-/bidirectional connection usage a few
years back.  I think these statistics were useful for researchers at the
time, but I haven't heard from anybody using them recently.  We should
remove them from little-t-tor and from metrics [9].  (Part of
performance.html in the diagram.)

2. We introduced the notion of a fast exit for sponsor J, and we added
graphs showing how many of these fast exits or almost fast exits are
running in the Tor network.  The sponsor contract ended, and the notion
is too specific to be useful for anything else.  We should remove the
graphs.  (fast-exits.html in the diagram.)

3. The relay-search service can find any relay that was running since
2007.  This is certainly useful for debugging the Tor network, but it
requires us to keep a full database of relays which doesn't scale
anymore [10].  (relay-search.html in the diagram.)

Last, but certainly not least, I have a couple of things in mind that we
might like in the *future*.  It would be good to know which of these
things sound interesting to others, so I can direct my time on the most
popular features first:

1. Start archiving microdescriptors and microdescriptor consensuses
(#2785).  While both descriptor types can be derived from server
descriptors and unflavored consensuses, archiving the originals would be
more convenient for researchers and developers.  (Green parts in the
diagram.)

2. Resume to add country codes and maybe AS numbers to sanitized bridge
descriptors.  We stopped resolving bridge IP addresses to country codes,
because it's difficult to do that in a reproducible way with monthly
changing GeoIP databases.  But having country and maybe AS information
would for sure be useful.  (Green parts in the diagram.)

3. Keep contact lines in sanitized bridge descriptors (as discussed in
#9854).  Having contact lines of bridge operators would allow more
people to contact them in case of problems, and it would allow bridge
operators to find their bridge more easily in services like Atlas [6] or
Globe [8].  (Green parts in the diagram.)

4. Make graph on the number of running obfsbridges (#9187).  Bridges
report which pluggable transports they support in their extra-info
descriptors, and it's not trivial to extract this information and count
how many bridges per transport are running at a given time.  But it's
also not terribly difficult.  (Purple parts in the diagram.)

5. Re-enable relays-by-country graph on the metrics website (#8127).
This is difficult for similar reasons as 2.  (Purple parts in the diagram.)

6. Make bridges-by-country graph for the metrics website.  Similar to 5,
depends on 2.  (Purple parts in the diagram.)

7. Include votes in Onionoo (#9778).  Votes contain some interesting
information about relays, like missing or extra relay flags, or measured
bandwidth.  The main problem would be potential performance issues.
(Orange part in the diagram.)

8. Provide per-bridge usage statistics in Onionoo (#10331).  We could
tell bridge operators some estimates on the number of clients connecting
to their bridge by country, transport, or IP version.  Maybe this
encourages bridge operators to run more bridges, in particular if they
can show their social circle how their bridge helps actual people in the
world.  (Orange part in the diagram.)

9. Provide relay comparison metrics in Onionoo.  We could define some
simple metrics on the usefulness of a relay, like provided bandwidth or
uptime, in comparison to other relays.  A possible statement from these
metrics could be: "your relay provides more bandwidth than 95% of relays
in the network."  Similar to 8.  If Atlas [6] or Globe [8] or a
yet-to-be-written Facebook application or a also-yet-to-be-written
Twitter integration into Tor Weather (#10372) tell the world how
successful someone's running Tor relays, maybe that encourages others to
run relays, too.  We could even invent a points system for running
relays, with additional points for running exits, if that makes the Tor
network better.  Probably needs input from a community coordinator
person.  (Orange part in the diagram.)

Whee, that was a comprehensive discourse on the past, present, and
future of the metrics project.  Thanks for reading!  Feedback much
appreciated!

All the best,
Karsten

[0] https://metrics.torproject.org/
[1] https://metrics.torproject.org/data.html
[2] https://metrics.torproject.org/formats.html
[3] https://metrics.torproject.org/network.html
[4] https://metrics.torproject.org/users.html
[5] https://onionoo.torproject.org/
[6] https://atlas.torproject.org/
[7] https://compass.torproject.org/
[8] https://globe.torproject.org/
[9] https://lists.torproject.org/pipermail/tor-dev/2013-December/005919.html
[10]
https://lists.torproject.org/pipermail/tor-talk/2013-December/031310.html