[metrics-team] analytics project status report

tl tl at rat.io
Thu Feb 11 12:16:28 UTC 2016


Karsten asked me last week to give an update on the analytics project and here it is.

To recap: 
The plan in short is to convert all the data that CollecTor collects into serializations (one or more) that are easily usable in state of the art and broadly accessible data analytics softwares like Tableau, MongoDB and the whole Hadoop ecosystem, especially Drill and Spark. You might not know these names, but don’t worry: they work with SQL and Map Reduce and have some other tricks up there sleeves that you can employ when you’re ready. They are easily deployable on a laptop and ther’s lots of tutorials. A second step would be having some of these tools and the data readily set up on server run by Tor and having standard usecases preaggregated. The serialization formats would be JSON (for the more desktop-ish tools like MongoDB) and Parquet (more performant, Hadoop-oriented). 
So, converting the data to JSON and Parquet is the first and most tedious step. Setting up the softwares is quite trivial. Developing useful aggregations and analytics strategies will be the real work.

Achievements so far:
The current status is that I finished a converter tool to convert CollecTor data to JSON a few weeks ago. The conversion has some small bugs, as I found out in the meantime, but it’s nonetheless useful and usable. That tool is written in Java, but can be operated from the commandline, no programming required. It’s in the 'mteam' repository on Github [0], a compiled standalone version ready to download and use [1].
In a second step I want to upgrade this tool to convert CollecTor data not only to JSON but also to Avro and Parquet. These formats are much more performant and space efficient than JSON and much better suited to seriously work with the data. For this development I cloned 'mteam' into a new repo, 'converTor'[2] - because I wanted to be able to break things again without loosing the working JSON converter, and because I figured that such a cool tool would need a cool name too ;-)
I read what I could find, sorted the options and came up with a plan that involved writing data schemata in a special Avro schema language, autogenerate Java classes from these schemata and then develop a statically type checked conversion tool in Java. The schemata are finished (that’s when I found those bugs in the first version of the converter), the classes autogenerated and a first descriptor converter is written. But now I’m stuck in figuring out how to write the converted serialization to disk - which is different for all 3 formats and not exactly well documented.
The problem here is that to develop with Avro and Parquet you’d better be a somewhat seasoned Java developer. Documentation is sparse, options are complex, good examples on the internets are hard to find (and then often take another route than I have chosen), the source code itself is challenging. Alone the fact that I have to dig into the sources is … a challenge for me.

The status:
You can the see the state of things in the repo. There is one central file, Convert.java [3], that contains the backbone of logic. The other interesting part are the schemata [4] (and there especially the *.avdl versions) but you dont need to understand those to follow along. The helper classes autogenerated from the schemata sit in appropriatly named directories besides Convert.java. Some background and links to introductions about Avro can be found in avro.md [5].
Convert.java begins with some configuration, mainly the setting up of the 9 descriptor types and the many attributes I need them to have to succesfully guide them through the Avro|Parquet|JSON switch. Then the main method starts with a lot of configuration again, this time defining and honoring the commandline parameters (a few…) and is followed by the descriptor reading logic which hasn’t much changed since Karsten’s very first version of this script. Then, after main, follows the "COMMONS" section with "file writing machinery" being the first subsection and here is where the trouble begins. And ends, so far, since the protype converter for Torperf (in different versions again, as Avro can really be overwhelming in its flexibility) is fairly trivial and should work okay, while the more complex other descriptor types while pose some challenges to a Java rookie like me, but I’ll tackle those only after the writer troubles are resolved.

The plan (A and B):
I won’t go into detail about what the trouble with those writers is as this mail has gotten much too long already (although I’m of course happy to elaborate if someone is interested!). There is no Parquet user mailing list so I posted a question on Stackoverflow [6] yesterday but havn’t gotten any response so far (maybe it’s too long too). I’ll wait another day (and in the meantime wise up on Java Generics and stuff) but then switch to plan B which is to take a less performant route through this jungle, without autogenerated classes and without static type checking, with constructors marked as deprecated - and with at least some trustworthy looking examples on the internets and less complex switching and conversion logic. That’s the plan. Help and comments are of course very welcome.

I do feel the pressure to get this ready within the next week because I’d really like to start working on Hadoop aggregations before the dev meeting. But on the other hand I hate to not do things right, so fingers crossed that Stackoverflow will come to the rescue. If someone happens to know Tom White from the Hadoop project...


[0] https://github.com/tomlurge/mteam/
[1] https://github.com/tomlurge/mteam/blob/master/build/convert2json.jar
[2] https://github.com/tomlurge/converTor
[3] https://github.com/tomlurge/converTor/blob/master/src/converTor/Convert.java
[4] https://github.com/tomlurge/converTor/tree/master/schema
[5] https://github.com/tomlurge/converTor/blob/master/docs/avro.md
[6] https://stackoverflow.com/questions/35315992/parquet-mr-avroparquetwriter-how-to-convert-data-to-parquet-with-specific-map

< he not busy being born is busy dying >

More information about the metrics-team mailing list