[tor-dev] Scaling Tor Metrics, Round 2

Sun Dec 6 15:52:45 UTC 2015

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everyone,

I posted some thoughts on Scaling Tor Metrics [0] almost two weeks ago
and received very useful feedback from George, David, Thomas, and
Letty.  Thanks for that!  In fact, this is such a big topic and this
was so much feedback that I decided not to respond to every idea
individually but instead start over and suggest a new plan that
incorporates all feedback I saw.  Otherwise, I'd be worried that we'd
lose ourselves in the details and miss the big picture.  Maybe this
also makes this topic somewhat easier to follow and respond to, which
I hope many people will do.

The problem to solve is still that the Tor Metrics website has a huge
product backlog and that we want it to remain "the primary place to
learn interesting facts about the Tor network", either by making it
better and bigger internally, or by adding more ways to let others
contribute to it externally.

- From the feedback in round 1, I observe three major areas where we
need to improve Tor Metrics:

 1 Metrics frontend
 2 Metrics backend
 3 External contributions

There are low-hanging fruit in each area, but there are many fruit
overall and some are hanging higher than we might think.  I'll go
through them area by area and assign numbers to the tasks.

 1 Metrics frontend

The frontend is the part of the Tor Metrics website that takes
pre-aggregated data in the .csv format as input and produces somewhat
interactive visualizations.  The current frontend uses Java
servlets/JSP as web framework and R/ggplot2 for graphs.  This code is
hard to extend, even for me, and the result isn't pretty, but
therefore the website doesn't require any JavaScript.

So, one task would be: #1 decide whether we can still ignore
JavaScript and what it has to offer.  I agree that D3.js is cool, I
even used it myself in the past, though I know very little about it.
This decision would mean that we develop new visualizations in D3.js
and phase out the existing R/ggplot2 visualizations one by one.  This
is a tough decision, but one with a lot of potential.  I understand
how we're excited about this as developers, but I'd want to ask
Metrics users about this first.

Another task would be: #2 website redesign.  In fact, what you see on
the website right now is a redesign half-way through.  Believe me, the
website was even less readable a year or two ago, and this statement
alone tells you how slow things are moving.  But the remaining step is
just to replace the start page with a gallery, well, and to apply a
Bootstrap design to everything, because why not.  One challenge here
is that the current graphs all look quite similar making them hard to
distinguish in a gallery, but that's probably still more useful than
putting text there.  It sounds like Letty and Thomas might be willing
to help out with this, which would be great.

Yet another task would be: #3 replace the website framework with
something more recent.  This can be something simple, as long as it
supports some basic filtering and maybe searching on the start page.
I'd say let's pick something Python-based here.  However, maybe we
should first replace the existing graphs that are deeply tied into the
current website framework.  If we switch to D3.js and have replaced
all existing graphs, this switch to a new website framework will hurt
a lot less.

Another high-hanging fruit would be: #4 build something like Thomas'
Visionion, where users can create generic visualizations on the fly.
This is an exciting idea, really, but I think we have to accept that
it's out of reach for now.

 2 Metrics backend

The backend of Tor Metrics consists of a bunch of programs that run
once per day to fetch new data from CollecTor and produce a set of
.csv files for the frontend.  There are no strict requirements to
languages and databases, as long as tools are available in Debian
stable.  Some programs use PostgreSQL, but most of them just use
files.  Ironically, it's the database-based tools that have major
performance problems, whereas the file-based ones work just fine.
Most programs are written in Java, very few in Python.

One rather low-hanging fruit would be: #5 document backend programs
and say what's required to add one more to the bunch.  The biggest
challenge in writing such a program is that it needs to stay
reasonably fast even over the next couple of years and even if the
network doubles or triples in size.  I started documenting things a
while ago, but got distracted by other things.  I wouldn't mind help.

Another rather low-hanging fruit would be: #6 use Big Data to produce
pre-aggregated data for the frontend.  As said before, it doesn't
matter whether a backend program uses files or PostgreSQL or another
database engine.  What matters is that it reads data from CollecTor
and produces a data that the frontend can use.  This could be a CSV or
JSON file.  We should probably pick the next visualization project as
test case for applying Big Data tools, or we could rewrite one that
needs to be rewritten for performance reasons anyway.

Here's a not-so-low-hanging fruit for the backend: #7 have the backend
provide an API to the frontend.  This is potentially more difficult.
The part that I'm very much afraid of is performance.  It's just too
easy to build a tool that performs reasonable during testing but that
can't handle the load of 10 or 100 people looking at a frontend
visualization at once.  In particular, it's easy to build an API that
works just fine and then add another feature that looks harmless,
which later turns out to hurt performance a lot.  I'd say postpone.

 3 External contributions

Most of the discussion in round 1 circled around how external
contributions are great, but that they need to be handled with care.
I understand the risks here, and I think I'll postpone the part where
we're including externally developed websites after "redressing" them.
 Let's instead try to either keep those contributions as external
links or properly integrate them into Tor Metrics.

The lowest-hanging fruit here is: #8 keep adding links to external
websites as we already do, assuming that we clearly mark these
contributions as external.

Another low-hanging fruit is: #9 add static data or static graphs.  I
heard a single +1, but I guess once we have concrete suggestions what
data or graphs to add, we'll have more discussion.  That's fine.

One important and still somewhat low-hanging fruit is: #10 give
external developers more support when developing visualizations that
could later be added to Metrics.  This requires better documentation,
but it also requires making it easier to install Tor Metrics locally
and test new additions before submitting them.  The latter is a good
goal, but we're not there yet.  The documentation part doesn't seem
crazy though.  David, if you don't mind being the guinea pig yet once
more, I'd want to try this out with your latest visualizations.  This
is pending on the JavaScript decision though.

Now to the higher-hanging fruit: #11 Build or adapt a tool like Munin
to prototype new visualizations quickly.  I didn't fully understand
how Munin makes it easier to prototype a visualization than just
writing a Python/Stem script for the data pre-processing and using any
graphing engine for the visualization.  But maybe there's something to
learn here.  Still, this seems like a quite high goal at the moment.

Another one: #12 provide a public API like Onionoo but for Metrics
data.  This seems somewhat out of reach for the moment.  It's a cool
idea, but doing it right is not at all trivial.  I'd want to put this
to the lower end of the list.

Another high-hanging fruit: #13 adapt the Gist idea where external
contributors write some code that magically turn into new
visualizations.  I think that most new visualizations require writing
backend code to aggregate different parts of the available data or to
aggregate the same data differently.  It's a neat idea that external
contributors could simply write some code that then magically turns
into new visualizations.  But I think it's a long way until we're there.

 4 Summary

Here's the list of low-hanging fruit:

#1 decide whether we can still ignore JavaScript
#2 website redesign
#5 document backend programs
#6 use Big Data to produce pre-aggregated data
#8 keep adding links to external websites
#9 add static data or static graphs
#10 give external developers more support

And here are the tasks that I think we should postpone a bit longer:

#3 replace the website framework
#4 build something like Thomas' Visionion
#7 have the backend provide an API to the frontend
#11 Build or adapt a tool like Munin
#12 provide a public API like Onionoo but for Metrics data
#13 adapt the Gist idea

How does this sound?  What did I miss (sorry!), on a high level
without going into all the details just yet?  Who wants to help?

All the best,
Karsten

[0]
https://lists.torproject.org/pipermail/tor-dev/2015-November/009983.html
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJWZFnNAAoJEJD5dJfVqbCrowMIAIcx5kd7BfHN9HsgEj+JZ6e6
YsIUfrrCcTeMlZyHHG9amWodac+spE11fZCAkZ2B2gYf8I6KT23pex4w23WCVlxo
0qq5pZSADedAYBsocEhZLh2bZyzIhEPueGJW2/HeyW2Smd9T+YJ8hA1kkoebX/Xv
xb/2051bc48sQW767gst3ClTv4va8+24pKytJxvfRYDrs3xv3VxPlyCvLsrgz9ji
pnxd2C+Syo4sT3vHFb9JNhvd1ZaEmqjXBzywQ2SNnOqZNQE77zfvB766FFSJEg6a
MMgwt0RcjElTnZRZR2utA2WLYzJDiof4ULfw51xoYYcT6kmGQw9IOqOuzgBvSr4=
=Yo+I
-----END PGP SIGNATURE-----