[metrics-team] Integrating Consensus Health graphs onto metrics?

Thu May 4 15:49:56 UTC 2017

Closing the loop on this after the meeting this morning.

On 4 May 2017 at 05:39, Karsten Loesing <karsten at torproject.org> wrote:
>>> In the latter case, we should discuss how to keep the
>>> file small enough even if it contains all the data back to 2007.  Or
>>> we'd have to come up with a good reason why this data is only relevant
>>> for the relatively recent past, unlike all other data.
>>
>> The main pain point in scaling I anticipate is trying to do
>> javascript-based processing on a CSV that contains years worth of
>> data. I intend to do this as part of #21883 in the next few weeks and
>> to see how painful it is though.
>>
>>> What you could do is pick a higher data resolution for quite recent data
>>> and a lower data resolution for data in the distant past.  For example,
>>> you could keep one data point per hour for the past week and reduce that
>>> step by step to 4 hours, 12 hours, 24 hours, 240 hours, etc. for the
>>> past years.
>>
>> I think this is a good point, and I think it ought to be possible to
>> do with just a slightly-more-complicated SQL SELECT. What I'm thinking
>> is generating CSV files at different granularities: day-level
>> granularity for all data, 12-hour granularity for past year; etc. The
>> clientside would look at the date range you selected and choose the
>> correct datasource/granularity.  (Again, the SQLite database has all
>> the raw data, but the clientside javascript uses CSVs generated by
>> that database (which happens hourly).)
>
> Hmm.  I see how producing several CSV files would solve this.  But that
> approach also has disadvantages: there will be more files, which will be
> confusing to users downloading these files and using them with their own
> tools.  It would also require new code if we wanted to add graphs to Tor
> Metrics, because they expect a single data file right now, not half a
> dozen.  Just thinking aloud here, I don't know what the best solution
> is.  Hmm.

Metrics expects a CSV file with day-level granularity.

My current, developing, plan is to create two graphs for
consensus-health: one that is non-interactive and has hour-level
granularity for the past week (which we have currently and seems to
handle spikes just fine) and an interactive one that uses day-level
granularity. So two CSV files, only one of which you would use if/when
you adopt these.

>> The question I have is - how do I reduce the granularity/smooth the
>> data? If I'm looking at consensus values for a four hour span do I:
>> a) mean
>> b) median
>> c) omit missing datapoints (0's) and then mean?
>> d) omit missing datapoints (0's) and then median?
>> e) other?
>>
>> I have no idea about what gives the most accurate representation of
>> the underlying data; I'm hoping you've thought about this before and
>> can just say "Do exactly <this>." =)
>
> Onionoo does c), though you'd have to distinguish 0 from null and only
> omit nulls, not 0's.  But d) might be even better, because it's more
> robust against outliers.  I'd say pick d) if you can.
>
> Oh, and Onionoo makes sure that it has at least 20% non-null values in a
> given time period, and if not, it produces null as aggregate for that
> time period.

Thanks! I am so happy to not have to think about what I should do and
just implement it =)

>>> By the way, if you expose your data in CSV files, you could quite easily
>>> use some JavaScript graphing thing to make your graphs more interactive
>>> and avoid having several graphs for different intervals of the same
>>> data.  D3.js comes to mind.
>>
>> Yup, that's the plan. (I use d3 now.)
>>
>>> Another requirement for adding data to Tor Metrics is that it needs to
>>> be documented in a similar way as the other data files:
>>>
>>> https://metrics.torproject.org/stats.html
>>
>> That should be possible with the exception that the column set is
>> dynamic. We add columns (automatically) whenever a new authority or
>> bwauth pops online. So the table format is 'date'
>> 'dirauthalice_running', 'dirauthalice_measured', 'dirauthalice_voted',
>> 'dirauthbob_running' and so on
>
> Ugh.  Dynamic column sets sound pretty evil to me.  What if you have a
> CSV file covering 10 years?  The number of columns could be pretty big,
> depending on how much churn there was among directory authorities.  It's
> also non-trivial to handle dynamic column names.

Well... I agree that dynamic columns make things much more difficult
to handle; but the number of columns doesn't seem that painful.
Dynamic columns means dynamic column handling at which point the
difference between 20 and 200 isn't that's big.

> What do you think about making a column for the authority name and have
> only one column for running, measured, voted, etc. and $num_auths times
> as many rows?  Would that grow the CSV file too much?  It would
> certainly be easier to handle.

Nonetheless, I can convert the code over, for the good of Metrics,
with the hope that these graphs graduate up =)

>>> And the graphs would have to be documented in a way that the average Tor
>>> Metrics user can understand what's going on, at least to the point where
>>> they can decide that they don't care about this aspect of the Tor network.
>>
>> That probably isn't a problem either, assuming someone can read what I
>> wrote and say "Yes I understand" or "No, try again."
>>
>>> Yet another requirement for moving code to Tor Metrics is that it should
>>> use PostgreSQL rather than SQLite and Java rather than whatever else
>>> it's written in right now.
>>
>> Switching to Postgres is easy from a code point of view and difficult
>> from a sysadmin point of view. consensus-health is static HTML
>> generated in python on henryi and then get deployed to... well some
>> frontend servers (I never investigated.)
>>
>> henryi would need to talk to your postgres database to insert new
>> values every hour. (Unless this is the part that you want cut over to
>> Java.)
>
> Or would you be able to run a PostgreSQL database on henryi?  I care
> most about having table definitions, indexes, and view definitions in
> PostgreSQL.  Getting the data on the metrics host would be a separate step.

That is a question for the sysadmins. =)  For now I'm not going to
push on it, because I think converting from SQLite -> Postgres would
not be a terribly difficult thing to do.

>> The graphs are, and will remain, completely clientside javascript that
>> generate SVG. I don't really want to change that nor do I have the
>> time to. The HTML is generated with python, but there's very little
>> generation going on.
>>
>> I think I could do this sometime over the next several months, if it
>> is acceptable and desirable:
>> a) Switch the sqlite database to postgres (assuming henryi gets
>> permission to talk to your database)
>> b) Generate CSV files out of the postgres database in the Metrics
>> codebase using Java
>> c) 'Generate' the HTML/javascript for the graphs in the Metrics
>> codebase using Java
>> d) Metrics gets the graphs as a first-class citizen
>> e) consensus-health will keep some form of non-interactive graphs
>> (maybe 7 and 30 day only) mostly to serve as a pointer into Metrics
>>
>> If you want the the data processing and update-the-database portion to
>> be in Java, I don't think I have the time to do that though...
>
> Let me give you an example of newly added data to Tor Metrics:
>
> https://gitweb.torproject.org/metrics-web.git/commit/?id=917cc649b2012ea409fea1b73a7b5715e5ecb78a
>
> Note that if we were to add new data related to directory authorities,
> we wouldn't ask you to submit a patch like that.  But we'd appreciate
> help with the following files:
>
>  - database importer and CSV file exporter like Main.java but can be
> written in Python or another language and will be rewritten by metrics team,
>
>  - database schema like init-onionperf.sql, ideally complete,
>
>  - graphing code like graphs.R but can be written in D3.js and will be
> rewritten by metrics team,
>
>  - graph description like in metrics.json (though this particular commit
> didn't add a new graph, so it will require a bit more effort), and
>
>  - data format specification like on stats.jsp, ideally complete.
>
> But again, I'm happy to discuss these graphs more with you without
> making binding plans to move them over to Tor Metrics later.  Good
> graphs on Consensus Health are still worth the effort. :)

So my plan is:
- Refactor the database schema to not have dynamic columns
- Create two CSV files: hourly data for a week; daily data for all time
- Create interactive graphs for consensus-health using the latter CSV file
- Refactor the current consensus-health graph page... into something
- Show you my python-based data creation code, the python-based CSV
generation code, and the d3 javascript for graph generation

-tom