[metrics-team] Integrating Consensus Health graphs onto metrics?

Tue May 2 17:19:42 UTC 2017

On 2 May 2017 at 03:30, Karsten Loesing <karsten at torproject.org> wrote:
> I understand that you're not too keen on rewriting this code for Tor
> Metrics.  But I'm not sure if I want us to take processed Tor network
> data from an external source, so I think I'd rather want to add a static
> image and link instead.  However, maybe we can find a compromise. :)
>
> How about we discuss how your data would have to look like in order to
> be included on Tor Metrics, and even if we later decide not to take that
> last step, you'll have benefited from our experience with making Tor
> network data accessible to users?  (And of course, if we do decide to
> take that last step, it'll be a smaller step.)

Sure!

> Some quick thoughts:
>
> Do you have a data format for the graph data behind your graphs that
> scales for months or even a decade?

Kinda? https://consensus-health.torproject.org/historical.db is less
than 5 MB right now and contains several years worth of data. It has
one row per consensus in two tables. I wouldn't blink at a sqlite
database 10 times the size; and since it's SQL, porting it to MySQL or
Postgres wouldn't be a big deal IMO.

So I feel it is acceptable to store fine-grain data for all time at this point.

> It looks like all your graphs end
> at "past 90 days", though I'm not clear whether that's because you
> didn't want to make the page even longer or whether the data file would
> become too big.

Neither. Because I don't do any smoothing of data, any graph over 90
days gets painful spikes that correspond to a single hour's odd
datapoint (as you note later for the 90 day graphs also.)

> In the latter case, we should discuss how to keep the
> file small enough even if it contains all the data back to 2007.  Or
> we'd have to come up with a good reason why this data is only relevant
> for the relatively recent past, unlike all other data.

The main pain point in scaling I anticipate is trying to do
javascript-based processing on a CSV that contains years worth of
data. I intend to do this as part of #21883 in the next few weeks and
to see how painful it is though.

> What you could do is pick a higher data resolution for quite recent data
> and a lower data resolution for data in the distant past.  For example,
> you could keep one data point per hour for the past week and reduce that
> step by step to 4 hours, 12 hours, 24 hours, 240 hours, etc. for the
> past years.

I think this is a good point, and I think it ought to be possible to
do with just a slightly-more-complicated SQL SELECT. What I'm thinking
is generating CSV files at different granularities: day-level
granularity for all data, 12-hour granularity for past year; etc. The
clientside would look at the date range you selected and choose the
correct datasource/granularity.  (Again, the SQLite database has all
the raw data, but the clientside javascript uses CSVs generated by
that database (which happens hourly).)

The question I have is - how do I reduce the granularity/smooth the
data? If I'm looking at consensus values for a four hour span do I:
a) mean
b) median
c) omit missing datapoints (0's) and then mean?
d) omit missing datapoints (0's) and then median?
e) other?

I have no idea about what gives the most accurate representation of
the underlying data; I'm hoping you've thought about this before and
can just say "Do exactly <this>." =)

> This is similar to what Onionoo does, though I could see
> how we adapt that approach for CSV files containing all data for a
> graph.  Graphing this data requires some tricks: if we plot data from
> two different data resolutions, we'll have to process all data to have
> the lower of the two data resolutions; and if we plot a very short time
> interval from the distant path, we'll have to interpolate from data
> points possibly outside of the plotted time interval.  But I'd be happy
> to help with this, unless you'd want to do this adventure on your own.

I think my idea about "Choose the highest-resolution dataset that
covers the entire selected timespan" would be sufficient...

> Note that you should probably apply some smoothing to your 90-days
> graphs anyway.  Your "Voted About Relays (Not Running)" graphs contain a
> lot of volatility from relays joining and leaving over the day that
> makes it hard to see trends or even differences between authorities.

Yes.

> Unless it's the outliers that you really care about to see.

No, unless someone objects.

> By the way, if you expose your data in CSV files, you could quite easily
> use some JavaScript graphing thing to make your graphs more interactive
> and avoid having several graphs for different intervals of the same
> data.  D3.js comes to mind.

Yup, that's the plan. (I use d3 now.)

> Another requirement for adding data to Tor Metrics is that it needs to
> be documented in a similar way as the other data files:
>
> https://metrics.torproject.org/stats.html

That should be possible with the exception that the column set is
dynamic. We add columns (automatically) whenever a new authority or
bwauth pops online. So the table format is 'date'
'dirauthalice_running', 'dirauthalice_measured', 'dirauthalice_voted',
'dirauthbob_running' and so on

> And the graphs would have to be documented in a way that the average Tor
> Metrics user can understand what's going on, at least to the point where
> they can decide that they don't care about this aspect of the Tor network.

That probably isn't a problem either, assuming someone can read what I
wrote and say "Yes I understand" or "No, try again."

> Yet another requirement for moving code to Tor Metrics is that it should
> use PostgreSQL rather than SQLite and Java rather than whatever else
> it's written in right now.

Switching to Postgres is easy from a code point of view and difficult
from a sysadmin point of view. consensus-health is static HTML
generated in python on henryi and then get deployed to... well some
frontend servers (I never investigated.)

henryi would need to talk to your postgres database to insert new
values every hour. (Unless this is the part that you want cut over to
Java.)

The graphs are, and will remain, completely clientside javascript that
generate SVG. I don't really want to change that nor do I have the
time to. The HTML is generated with python, but there's very little
generation going on.

I think I could do this sometime over the next several months, if it
is acceptable and desirable:
a) Switch the sqlite database to postgres (assuming henryi gets
permission to talk to your database)
b) Generate CSV files out of the postgres database in the Metrics
codebase using Java
c) 'Generate' the HTML/javascript for the graphs in the Metrics
codebase using Java
d) Metrics gets the graphs as a first-class citizen
e) consensus-health will keep some form of non-interactive graphs
(maybe 7 and 30 day only) mostly to serve as a pointer into Metrics

If you want the the data processing and update-the-database portion to
be in Java, I don't think I have the time to do that though...

-tom