[metrics-team] analytics project status report

Karsten Loesing karsten at torproject.org
Mon Feb 15 19:37:15 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Thomas,

thanks for the update!

I took a quick look at your Convert.java and figured its probably
easier to solve your Java issues interactively.  How's tomorrow
(Tuesday) for a quick, possibly Etherpad-assisted IRC chat?  Want to
ping me when you're around?

All the best,
Karsten


On 11/02/16 13:16, tl wrote:
> Hi,
> 
> Karsten asked me last week to give an update on the analytics
> project and here it is.
> 
> To recap: The plan in short is to convert all the data that
> CollecTor collects into serializations (one or more) that are
> easily usable in state of the art and broadly accessible data
> analytics softwares like Tableau, MongoDB and the whole Hadoop
> ecosystem, especially Drill and Spark. You might not know these
> names, but don’t worry: they work with SQL and Map Reduce and have
> some other tricks up there sleeves that you can employ when you’re
> ready. They are easily deployable on a laptop and ther’s lots of
> tutorials. A second step would be having some of these tools and
> the data readily set up on server run by Tor and having standard
> usecases preaggregated. The serialization formats would be JSON
> (for the more desktop-ish tools like MongoDB) and Parquet (more
> performant, Hadoop-oriented). So, converting the data to JSON and
> Parquet is the first and most tedious step. Setting up the
> softwares is quite trivial. Developing useful aggregations and
> analytics strategies will be the real work.
> 
> Achievements so far: The current status is that I finished a
> converter tool to convert CollecTor data to JSON a few weeks ago.
> The conversion has some small bugs, as I found out in the meantime,
> but it’s nonetheless useful and usable. That tool is written in
> Java, but can be operated from the commandline, no programming
> required. It’s in the 'mteam' repository on Github [0], a compiled
> standalone version ready to download and use [1]. In a second step
> I want to upgrade this tool to convert CollecTor data not only to
> JSON but also to Avro and Parquet. These formats are much more
> performant and space efficient than JSON and much better suited to
> seriously work with the data. For this development I cloned 'mteam'
> into a new repo, 'converTor'[2] - because I wanted to be able to
> break things again without loosing the working JSON converter, and
> because I figured that such a cool tool would need a cool name too
> ;-) I read what I could find, sorted the options and came up with a
> plan that involved writing data schemata in a special Avro schema
> language, autogenerate Java classes from these schemata and then
> develop a statically type checked conversion tool in Java. The
> schemata are finished (that’s when I found those bugs in the first
> version of the converter), the classes autogenerated and a first
> descriptor converter is written. But now I’m stuck in figuring out
> how to write the converted serialization to disk - which is
> different for all 3 formats and not exactly well documented. The
> problem here is that to develop with Avro and Parquet you’d better
> be a somewhat seasoned Java developer. Documentation is sparse,
> options are complex, good examples on the internets are hard to
> find (and then often take another route than I have chosen), the
> source code itself is challenging. Alone the fact that I have to
> dig into the sources is … a challenge for me.
> 
> The status: You can the see the state of things in the repo. There
> is one central file, Convert.java [3], that contains the backbone
> of logic. The other interesting part are the schemata [4] (and
> there especially the *.avdl versions) but you dont need to
> understand those to follow along. The helper classes autogenerated
> from the schemata sit in appropriatly named directories besides
> Convert.java. Some background and links to introductions about Avro
> can be found in avro.md [5]. Convert.java begins with some
> configuration, mainly the setting up of the 9 descriptor types and
> the many attributes I need them to have to succesfully guide them
> through the Avro|Parquet|JSON switch. Then the main method starts
> with a lot of configuration again, this time defining and honoring
> the commandline parameters (a few…) and is followed by the
> descriptor reading logic which hasn’t much changed since Karsten’s
> very first version of this script. Then, after main, follows the
> "COMMONS" section with "file writing machinery" being the first
> subsection and here is where the trouble begins. And ends, so far,
> since the protype converter for Torperf (in different versions
> again, as Avro can really be overwhelming in its flexibility) is
> fairly trivial and should work okay, while the more complex other
> descriptor types while pose some challenges to a Java rookie like
> me, but I’ll tackle those only after the writer troubles are
> resolved.
> 
> The plan (A and B): I won’t go into detail about what the trouble
> with those writers is as this mail has gotten much too long already
> (although I’m of course happy to elaborate if someone is
> interested!). There is no Parquet user mailing list so I posted a
> question on Stackoverflow [6] yesterday but havn’t gotten any
> response so far (maybe it’s too long too). I’ll wait another day
> (and in the meantime wise up on Java Generics and stuff) but then
> switch to plan B which is to take a less performant route through
> this jungle, without autogenerated classes and without static type
> checking, with constructors marked as deprecated - and with at
> least some trustworthy looking examples on the internets and less
> complex switching and conversion logic. That’s the plan. Help and
> comments are of course very welcome.
> 
> I do feel the pressure to get this ready within the next week
> because I’d really like to start working on Hadoop aggregations
> before the dev meeting. But on the other hand I hate to not do
> things right, so fingers crossed that Stackoverflow will come to
> the rescue. If someone happens to know Tom White from the Hadoop
> project...
> 
> Cheers, Thomas
> 
> 
> 
> 
> [0] https://github.com/tomlurge/mteam/ [1]
> https://github.com/tomlurge/mteam/blob/master/build/convert2json.jar
>
> 
[2] https://github.com/tomlurge/converTor
> [3]
> https://github.com/tomlurge/converTor/blob/master/src/converTor/Convert.java
>
> 
[4] https://github.com/tomlurge/converTor/tree/master/schema
> [5] https://github.com/tomlurge/converTor/blob/master/docs/avro.md 
> [6]
> https://stackoverflow.com/questions/35315992/parquet-mr-avroparquetwriter-how-to-convert-data-to-parquet-with-specific-map
>
> 
> 
> < he not busy being born is busy dying >
> 
> 
> 
> 
> 
> _______________________________________________ metrics-team
> mailing list metrics-team at lists.torproject.org 
> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
> 

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJWwijrAAoJEC3ESO/4X7XBpQ0H/AxCBAOqZu/aMVMkYZoJTWEb
T/WE5S3Bu0ATclIDE2W6+Nu0ZvgdMf+u4jMJbPU69TK2Vd/DmOEY24OsnlFblDPo
AFWhJhV7rE9NiJPT7gxDyHorb/wQCZouA1qP8QG7vskfMHSy4ZIAjp+Kw56kgW4s
aXlIcocHmJrMX0Jax8Oll2tBRhN4XrwL9mmd+78s99zHRoHa4rMzitqQye1n9pLw
MxUKNfl4uPCuoJl8nsoEh0mjqhg8robcVFtO617K80HPJW/gvgxMGQojoYb2sB5X
yoBNitMNiibKpbvbr963Ybl9C+9WNhvHrm36Y2Xb9OplzFK2DSwZYXDlo3aodfY=
=qmIu
-----END PGP SIGNATURE-----


More information about the metrics-team mailing list