[ooni-dev] Feedback on OONI data collection, aggregation, and visualization

Mon Dec 1 14:42:08 UTC 2014

Hi Karsten,

Thanks for these thoughts and sorry for not replying sooner.

On 10/19/14, 1:52 PM, Karsten Loesing wrote:
>  - For Tor network data it has turned out to be quite useful to strictly
> separate data collection from data aggregation from data visualization.
>  That is, don't worry too much about visualizing the right thing, but
> start with something, and if you don't like it, throw it away and do it
> differently.  And if you're aggregating the wrong thing, then aggregate
> the previously collected data in a different way.  Of course, if you
> figure out you collected the wrong thing, then you won't be able to go
> back in time and fix that.
> 

This is indeed the approach that we are now using in the case of the
oonipipeline. Put all the data we collect into a NoSQL database, on
which we can then run queries and present it in various ways.

There are some ideas of how to present this data and you can learn more
about this here:
https://trac.torproject.org/projects/tor/ticket/13731

>  - I saw some discussion of "The pool from where the bridge has been
> extracted (private, tbb, BridgeDB https, BridgeDB email)".  Note that
> isis and I are currently talking about removing sanitized bridge pool
> assignments from CollecTor.  We're thinking about adding a new config
> line to tor that states the preferred bridge pool, which could be used
> here instead.  Just as a heads-up, six months or so in advance.  I can
> probably provide more details if this is relevant to you.
> 

This is probably something that should be mentioned inside of this ticket:

https://trac.torproject.org/projects/tor/ticket/13570

I like the idea that interaction with bridgeDB is opaque to us. All we
care is that they give us a JSON dictionary that has some keys we expect.

>> Another area that perhaps overlaps with the needs of the metrics is data
>> storage. Currently we have around 16 GB of uncompressed raw report data
>> that needs to be archived (currently it's being stored and published on
>> staticiforme, but I have a feeling that is not ideal especially when the
>> data will become much bigger) and indexed in some sort of database.
>> Once we put the data (or a subset of it) in a database producing
>> visualizations and exposing the data to end users will be much simpler.
>> The question is if this is a need also for
>> Metrics/BwAuth/ExitScanner/DocTor and if we can perhaps work out some
>> shared infrastructure to fit both of our goals.
>> Currently we have placed the data inside of MongoDB, but some concerns
>> with it have been raised [2].
> 
> Again, some random thoughts:
> 
>  - For Metrics, the choice of database is entirely an internal decision,
> and no user would ever see that.  It's part of the aggregation part.  If
> we ever decide to pick something else (than PostgreSQL in this case),
> we'd have to rewrite the aggregation scripts, which would then produce
> the same or similar output (which is an .csv file in our case).  That
> being said, trying out MongoDB or another NoSQL variant might be
> worthwhile, but don't rely on it too much.
> 

At this point we have been using MongoDB for a couple of months and a
part from a few initial issues (that had to do with me not being
familiar with NoSQL document oriented databases) it works quite well.

I also realized that doing JOINs with NoSQL on different collections
(i.e. tables) is not something you want to do. If there is no (or
minimal) duplication it's always best to just stick everything inside of
one big fat document.

To do this I need to re-process all the data, but it is the path that we
are going to follow in the future.

>  - Would you want to add bridge reachability statistics to Tor Metrics?
>  I'm currently working on opening it up and making it easier for people
> to contribute metrics.  Maybe take a look at the website prototype that
> I posted to tor-dev@ a week ago [3] (and if you want, comment there).  I
> could very well imagine adding a new section "Reachability" right next
> to "Diversity" with one or more graphs/tables provided by you.  Please
> see the new "Contributing to Tor Metrics" section on the About page for
> the various options for contributing data or metrics.
> 

Yes this would be awesome!

Our timeline for shipping these visualizations is that we would like to
have something ready by the end of this year (at this point 1 month).

I think we should be able to get there also with the help of Choke Point
Project.

I will keep you posted and send a reply to that thread once we have
something to be posted publicly ready.

>  - Please ask weasel for a VM to host those 16 GB of report data; having
> it on staticiforme is probably a bad idea.  Also, do you have any plans
> to synchronize reports between hosts?  I'm planning such a thing for
> CollecTor where two or more instances fetch relay descriptors from
> directory authorities and automatically exchange missing descriptors.

I ended up getting 1 box donated from GreenHost and renting another one,
since this gives us more freedom to operate.

We do have in mind a multi host sync protocol that follows a pub-sub
paradigm, but for the moment it's implemented using just simple rsync
based polling.
I have a cronjob that runs a rsync task on every host that collects
reports. There are hosts that receive reports (for archive purposes) and
ones that just collect them from clients and then want them to be
archived. For the later the cronjob will copy the reports that it has
not already archived to all the hosts that should archive them and then
deletes the copy on the collector.

For how it is implemented see this code:
https://github.com/TheTorProject/ooni-pipeline/blob/master/ooni/pipeline/task/sync.py

>  - I could imagine extending CollecTor to also collect and archive OONI
> reports, as a long-term thing.  Right now CollecTor does that for Tor
> relay and bridge descriptors, TORDNSEL exit lists, BridgeDB pool
> assignment files, and Torperf performance measurement results.  But note
> that it's written in Java and that I hardly have development time to
> keep it afloat; so somebody else would have to extend it towards
> supporting OONI reports.  I'd be willing to review and merge things.  We
> should also keep CollecTor pure Java, because I want to make it easier
> for others to run their own mirror and help us make data more redundant.
>  Anyway, I can also imagine keeping the OONI report collector distinct
> from CollecTor and only exchange design ideas and experiences if that's
> easier.
> 

That would be awesome!

Can you point me to relevant CollecTor code portions that would be
helpful to implement this?

It would be great if you could perhaps write a ticket giving some
pointers to who may be interested in implementing this under the OONI
component of trac.

> Lots of ideas.  What do you think?
> 

Thanks for taking the time to compose this.

~ Arturo