[metrics-team] Integrating Consensus Health graphs onto metrics?

Thu May 4 10:39:25 UTC 2017

Hi Tom,

On 02.05.17 19:19, Tom Ritter wrote:
> On 2 May 2017 at 03:30, Karsten Loesing <karsten at torproject.org> wrote:
>> I understand that you're not too keen on rewriting this code for Tor
>> Metrics.  But I'm not sure if I want us to take processed Tor network
>> data from an external source, so I think I'd rather want to add a static
>> image and link instead.  However, maybe we can find a compromise. :)
>>
>> How about we discuss how your data would have to look like in order to
>> be included on Tor Metrics, and even if we later decide not to take that
>> last step, you'll have benefited from our experience with making Tor
>> network data accessible to users?  (And of course, if we do decide to
>> take that last step, it'll be a smaller step.)
> 
> Sure!
> 
>> Some quick thoughts:
>>
>> Do you have a data format for the graph data behind your graphs that
>> scales for months or even a decade?
> 
> Kinda? https://consensus-health.torproject.org/historical.db is less
> than 5 MB right now and contains several years worth of data. It has
> one row per consensus in two tables. I wouldn't blink at a sqlite
> database 10 times the size; and since it's SQL, porting it to MySQL or
> Postgres wouldn't be a big deal IMO.
> 
> So I feel it is acceptable to store fine-grain data for all time at this point.

Oh, I didn't mean the database size, though it's great that the database
is still tiny.  But even at 1000 times the size it would still be
reasonably small.  So, not an issue there.

What I meant is the CSV file size.  More on that below where you mention
dynamically added column sets (uh-oh).

>> It looks like all your graphs end
>> at "past 90 days", though I'm not clear whether that's because you
>> didn't want to make the page even longer or whether the data file would
>> become too big.
> 
> Neither. Because I don't do any smoothing of data, any graph over 90
> days gets painful spikes that correspond to a single hour's odd
> datapoint (as you note later for the 90 day graphs also.)

Okay.

>> In the latter case, we should discuss how to keep the
>> file small enough even if it contains all the data back to 2007.  Or
>> we'd have to come up with a good reason why this data is only relevant
>> for the relatively recent past, unlike all other data.
> 
> The main pain point in scaling I anticipate is trying to do
> javascript-based processing on a CSV that contains years worth of
> data. I intend to do this as part of #21883 in the next few weeks and
> to see how painful it is though.
> 
>> What you could do is pick a higher data resolution for quite recent data
>> and a lower data resolution for data in the distant past.  For example,
>> you could keep one data point per hour for the past week and reduce that
>> step by step to 4 hours, 12 hours, 24 hours, 240 hours, etc. for the
>> past years.
> 
> I think this is a good point, and I think it ought to be possible to
> do with just a slightly-more-complicated SQL SELECT. What I'm thinking
> is generating CSV files at different granularities: day-level
> granularity for all data, 12-hour granularity for past year; etc. The
> clientside would look at the date range you selected and choose the
> correct datasource/granularity.  (Again, the SQLite database has all
> the raw data, but the clientside javascript uses CSVs generated by
> that database (which happens hourly).)

Hmm.  I see how producing several CSV files would solve this.  But that
approach also has disadvantages: there will be more files, which will be
confusing to users downloading these files and using them with their own
tools.  It would also require new code if we wanted to add graphs to Tor
Metrics, because they expect a single data file right now, not half a
dozen.  Just thinking aloud here, I don't know what the best solution
is.  Hmm.

> The question I have is - how do I reduce the granularity/smooth the
> data? If I'm looking at consensus values for a four hour span do I:
> a) mean
> b) median
> c) omit missing datapoints (0's) and then mean?
> d) omit missing datapoints (0's) and then median?
> e) other?
> 
> I have no idea about what gives the most accurate representation of
> the underlying data; I'm hoping you've thought about this before and
> can just say "Do exactly <this>." =)

Onionoo does c), though you'd have to distinguish 0 from null and only
omit nulls, not 0's.  But d) might be even better, because it's more
robust against outliers.  I'd say pick d) if you can.

Oh, and Onionoo makes sure that it has at least 20% non-null values in a
given time period, and if not, it produces null as aggregate for that
time period.

>> This is similar to what Onionoo does, though I could see
>> how we adapt that approach for CSV files containing all data for a
>> graph.  Graphing this data requires some tricks: if we plot data from
>> two different data resolutions, we'll have to process all data to have
>> the lower of the two data resolutions; and if we plot a very short time
>> interval from the distant path, we'll have to interpolate from data
>> points possibly outside of the plotted time interval.  But I'd be happy
>> to help with this, unless you'd want to do this adventure on your own.
> 
> I think my idea about "Choose the highest-resolution dataset that
> covers the entire selected timespan" would be sufficient...
> 
>> Note that you should probably apply some smoothing to your 90-days
>> graphs anyway.  Your "Voted About Relays (Not Running)" graphs contain a
>> lot of volatility from relays joining and leaving over the day that
>> makes it hard to see trends or even differences between authorities.
> 
> Yes.
> 
>> Unless it's the outliers that you really care about to see.
> 
> No, unless someone objects.

No.  And if somebody does, you could include min/max or p1/p99 to give a
sense of outliers.  But most people wouldn't care.

>> By the way, if you expose your data in CSV files, you could quite easily
>> use some JavaScript graphing thing to make your graphs more interactive
>> and avoid having several graphs for different intervals of the same
>> data.  D3.js comes to mind.
> 
> Yup, that's the plan. (I use d3 now.)
> 
>> Another requirement for adding data to Tor Metrics is that it needs to
>> be documented in a similar way as the other data files:
>>
>> https://metrics.torproject.org/stats.html
> 
> That should be possible with the exception that the column set is
> dynamic. We add columns (automatically) whenever a new authority or
> bwauth pops online. So the table format is 'date'
> 'dirauthalice_running', 'dirauthalice_measured', 'dirauthalice_voted',
> 'dirauthbob_running' and so on

Ugh.  Dynamic column sets sound pretty evil to me.  What if you have a
CSV file covering 10 years?  The number of columns could be pretty big,
depending on how much churn there was among directory authorities.  It's
also non-trivial to handle dynamic column names.

What do you think about making a column for the authority name and have
only one column for running, measured, voted, etc. and $num_auths times
as many rows?  Would that grow the CSV file too much?  It would
certainly be easier to handle.

>> And the graphs would have to be documented in a way that the average Tor
>> Metrics user can understand what's going on, at least to the point where
>> they can decide that they don't care about this aspect of the Tor network.
> 
> That probably isn't a problem either, assuming someone can read what I
> wrote and say "Yes I understand" or "No, try again."
> 
>> Yet another requirement for moving code to Tor Metrics is that it should
>> use PostgreSQL rather than SQLite and Java rather than whatever else
>> it's written in right now.
> 
> Switching to Postgres is easy from a code point of view and difficult
> from a sysadmin point of view. consensus-health is static HTML
> generated in python on henryi and then get deployed to... well some
> frontend servers (I never investigated.)
> 
> henryi would need to talk to your postgres database to insert new
> values every hour. (Unless this is the part that you want cut over to
> Java.)

Or would you be able to run a PostgreSQL database on henryi?  I care
most about having table definitions, indexes, and view definitions in
PostgreSQL.  Getting the data on the metrics host would be a separate step.

> The graphs are, and will remain, completely clientside javascript that
> generate SVG. I don't really want to change that nor do I have the
> time to. The HTML is generated with python, but there's very little
> generation going on.
> 
> I think I could do this sometime over the next several months, if it
> is acceptable and desirable:
> a) Switch the sqlite database to postgres (assuming henryi gets
> permission to talk to your database)
> b) Generate CSV files out of the postgres database in the Metrics
> codebase using Java
> c) 'Generate' the HTML/javascript for the graphs in the Metrics
> codebase using Java
> d) Metrics gets the graphs as a first-class citizen
> e) consensus-health will keep some form of non-interactive graphs
> (maybe 7 and 30 day only) mostly to serve as a pointer into Metrics
> 
> If you want the the data processing and update-the-database portion to
> be in Java, I don't think I have the time to do that though...

Let me give you an example of newly added data to Tor Metrics:

https://gitweb.torproject.org/metrics-web.git/commit/?id=917cc649b2012ea409fea1b73a7b5715e5ecb78a

Note that if we were to add new data related to directory authorities,
we wouldn't ask you to submit a patch like that.  But we'd appreciate
help with the following files:

 - database importer and CSV file exporter like Main.java but can be
written in Python or another language and will be rewritten by metrics team,

 - database schema like init-onionperf.sql, ideally complete,

 - graphing code like graphs.R but can be written in D3.js and will be
rewritten by metrics team,

 - graph description like in metrics.json (though this particular commit
didn't add a new graph, so it will require a bit more effort), and

 - data format specification like on stats.jsp, ideally complete.

But again, I'm happy to discuss these graphs more with you without
making binding plans to move them over to Tor Metrics later.  Good
graphs on Consensus Health are still worth the effort. :)

Thanks!

> -tom

All the best,
Karsten

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20170504/f477430e/attachment-0001.sig>