
Hi everyone, I just subscribed to this list, because Arturo asked me to comment on two postings here. As a very quick introduction, and because I don't know how distinct the Tor community and the OONI community are: I'm the developer behind the Tor network data collector CollecTor [-1] and the Tor Metrics website that aggregates and visualizes Tor network data [0]. Here's what Arturo asked on another thread:
With OONI what we are currently focusing on is Bridge Reachability measurements. We have at this time 1 meter in China, 1 in Iran (a second one is going to be setup soon), 1 in Russia and 1 in Ukraine. We have some ideas of the sorts of information we would like to extract from this data, but it would also be very good to have some more feedback from you on what would be useful [1].
Long mail is long. Some random thoughts: - For Tor network data it has turned out to be quite useful to strictly separate data collection from data aggregation from data visualization. That is, don't worry too much about visualizing the right thing, but start with something, and if you don't like it, throw it away and do it differently. And if you're aggregating the wrong thing, then aggregate the previously collected data in a different way. Of course, if you figure out you collected the wrong thing, then you won't be able to go back in time and fix that. - I saw some discussion of "The pool from where the bridge has been extracted (private, tbb, BridgeDB https, BridgeDB email)". Note that isis and I are currently talking about removing sanitized bridge pool assignments from CollecTor. We're thinking about adding a new config line to tor that states the preferred bridge pool, which could be used here instead. Just as a heads-up, six months or so in advance. I can probably provide more details if this is relevant to you.
Another area that perhaps overlaps with the needs of the metrics is data storage. Currently we have around 16 GB of uncompressed raw report data that needs to be archived (currently it's being stored and published on staticiforme, but I have a feeling that is not ideal especially when the data will become much bigger) and indexed in some sort of database. Once we put the data (or a subset of it) in a database producing visualizations and exposing the data to end users will be much simpler. The question is if this is a need also for Metrics/BwAuth/ExitScanner/DocTor and if we can perhaps work out some shared infrastructure to fit both of our goals. Currently we have placed the data inside of MongoDB, but some concerns with it have been raised [2].
Again, some random thoughts: - For Metrics, the choice of database is entirely an internal decision, and no user would ever see that. It's part of the aggregation part. If we ever decide to pick something else (than PostgreSQL in this case), we'd have to rewrite the aggregation scripts, which would then produce the same or similar output (which is an .csv file in our case). That being said, trying out MongoDB or another NoSQL variant might be worthwhile, but don't rely on it too much. - Would you want to add bridge reachability statistics to Tor Metrics? I'm currently working on opening it up and making it easier for people to contribute metrics. Maybe take a look at the website prototype that I posted to tor-dev@ a week ago [3] (and if you want, comment there). I could very well imagine adding a new section "Reachability" right next to "Diversity" with one or more graphs/tables provided by you. Please see the new "Contributing to Tor Metrics" section on the About page for the various options for contributing data or metrics. - Please ask weasel for a VM to host those 16 GB of report data; having it on staticiforme is probably a bad idea. Also, do you have any plans to synchronize reports between hosts? I'm planning such a thing for CollecTor where two or more instances fetch relay descriptors from directory authorities and automatically exchange missing descriptors. - I could imagine extending CollecTor to also collect and archive OONI reports, as a long-term thing. Right now CollecTor does that for Tor relay and bridge descriptors, TORDNSEL exit lists, BridgeDB pool assignment files, and Torperf performance measurement results. But note that it's written in Java and that I hardly have development time to keep it afloat; so somebody else would have to extend it towards supporting OONI reports. I'd be willing to review and merge things. We should also keep CollecTor pure Java, because I want to make it easier for others to run their own mirror and help us make data more redundant. Anyway, I can also imagine keeping the OONI report collector distinct from CollecTor and only exchange design ideas and experiences if that's easier. Lots of ideas. What do you think? All the best, Karsten [-1] https://collector.torproject.org/ [0] https://metrics.torproject.org/ [1] https://lists.torproject.org/pipermail/ooni-dev/2014-October/000176.html [2] https://lists.torproject.org/pipermail/ooni-dev/2014-October/000178.html [3] https://kloesing.github.io/metrics-2.0/