[metrics-team] How to handle double entries in ConverTor

tl tl at rat.io
Thu Jun 16 21:42:18 UTC 2016


Hi,

during the metrics team chat this afternoon we discussed briefly how ConverTor would or should handle double entries of descriptors. I thought about it a little more and think the following aspects are important:

* ConverTor only converts CollecTor tarballs into other formats: JSON, Parquet and Avro. It changes the contents of these archives as little as possible. If they contain double entries, the converted archives will do so too.

* Therefor the CollecTor tarballs better be correct ;-)

* Doing otherwise wouldn’t be easy: ConverTor would have to keep all descriptors in memory before he can write them out in one flush. Or ConverTor would have to know how to change JSON/Parquet/Avro files. And it would have to keep an index of the entries. This feels a bit like writing a database. I quite possibly don’t fully understand what I’m talking about here but still I highly doubt that this would be the right way to go.

* Analytics will work on the converted JSON/Parquet/Avro files first. Importing the descriptors from JSON/Parquet/Avro into a database is something that can be done to facilitate certain kinds of analytics but it’s not absolutely necessary. A lot of aggregation and other tasks can be accomplished from the converted files alone and as far as I know this is how it’s usually done: load data from files into memory and compute new data from there (and write it to files again).

* When importing descriptors in a database (HBase is planned) it should be easy to detect double entries and handle them appropriatly. 

* During aggregation from archive files it might be possible as well to detect double entries.

* I wonder though what the practical relevance of this question is. It wouldn’t expect there to be a lot of double entries in CollecTor archives. Am I wrong? 

* So far I haven’t spend much time on thinking about how a system must be build that is updated with new descriptors in an hourly or daily fashion. For a start I would just regenerate monthly tarballs and then re-convert them since that doesn’t take very long. Of course a database should be beneficial here but I’ll have to play with this whole machinery a little more before I can say something that I’m confident about.


Hope this helps to clear things up. Of course patches are always welcome :-) My focus though will be on getting the conversion bug free (but not adding new features) and starting to do some meaningful aggregation and analytics.

Cheers
oma


More information about the metrics-team mailing list