[ooni-dev] hadoop?

Tyler Fisher apt.get.apps at gmail.com
Mon Oct 5 14:45:54 UTC 2015


It might be worthwhile to take a step back, and have a discussion with
regards to why ooni-pipeline-ng, and related tooling makes use of Apache
Spark + Hadoop in lieu of using a RDBMS/NoSQL database before diving into
the Hadoop ecosystem.  In my experience, early adoption of Hadoop typically
incurs a significant degree of technical debt that can be associated with
both the maintenance of the associated cluster, and increased development

If I am correct in my assumption, ooni-pipeline-ng adopted Apache Spark
largely due to its flexibility, and (relatively) low learning curve.

It is worth mentioning that I am currently working on a service in parallel
that makes a subset of the OONI metrics available to developers, and other
users (specifically the YAML reports associated with different ooni-probe
measurements). I've been pretty busy lately, but I should have a service up
shortly that I can open up to peer review. After working with the MongoDB
aggregation framework for a few weeks, I've opted to lean more towards
using PostgreSQL given that it supports conventional aggregate queries that
many developers (including myself) are familiar with, which will make it
easier for other developers to contribute.

The use of PostgreSQL may be favourable in lieu of using a NoSQL database,
or adopting Hadoop given that for the most part, ooni-probe reports can be
modeled in a relational form. The only sparse element of a given ooni-probe
reports is the result associated with a given test, and even that follows a
relatively predictable schema which can be targeted using an aggregate

One of the neat features of PostgreSQL is that it supports aggregate
queries over nested JSON documents, meaning that you can perform aggregate
queries on nested JSON in tandem to aggregates on scalar fields. This is
pretty useful, not to mention performant when proper indices are used. This
may be sufficient for the purposes of what the metrics team is doing, but
of course, I'd have to hear more about what the hurdles you're trying to
cross are.

Before tackling scalability, we should verify that the services in question
fit your use case. If I am correct in my assumption, the ooni-probe metrics
are less than 1GB overall, as opposed to being several terabytes.

[1] ooni-base format:


GPG fingerprint: 8931 45DF 609B EE2E BC32  5E71 631E 6FC3 4686 F0EB

PS: Hopefully I am replying to this e-mail thread properly - I am not too
familiar with mailing lists.

Message: 1
Date: Thu, 1 Oct 2015 15:50:48 +0200
From: thomas l?rtsch <tl at rat.io>
To: ooni-dev at lists.torproject.org
Subject: [ooni-dev] hadoop?
Message-ID: <BC4AEC9F-CEEF-4FB7-B313-B87D960A1B66 at rat.io>
Content-Type: text/plain; charset=windows-1252


measurement team thinks about setting up a server with metrics data and an
environment that allows everybody (everybody with a login, that is) to
analyze metrics data and do crazy research with it.
Hadoop as a well established Big Data solution seems like a good choice to
base that environment on, enhanced by R and probably more stuff. The
problem is that Hadoop is not in Debian stable (and doesn?t seem to get in
anytime soon [1]). The only alternatives we could find are PostgreSQL and
MongoDB, but MongoDB is too shoddy and PostgreSQL will likely struggle with
the kind of data we intend to throw at it and won?t be fun to work with.

Ooni does use Hadoop and we?d like to know why and how. Didn?t you, like
us, find any viable alternative to Hadoop that is available in Debian
stable? How did you get around Hadoop not being in stable? Can you advice
us to do the same or look somewhere else? (Where?)


[1] https://wiki.debian.org/Hadoop
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/ooni-dev/attachments/20151005/dc11fd12/attachment.html>

More information about the ooni-dev mailing list