[metrics-team] How to handle double entries in ConverTor

Fri Jun 17 09:39:06 UTC 2016

Hi,

is there such a README for CollecTor data anywhere? I would just cite or refer to it.

Cheers,
Thomas

> On 17.06.2016, at 08:58, Karsten Loesing <karsten at torproject.org> wrote:
> 
> Signed PGP part
> Hi Thomas,
> 
> it sounds like a plausible design decision that your converter does
> not deduplicate entries.  However, the consequence of that cannot be
> that you require the input data to be free of duplicates.
> 
> The consequence is that any consumer of your converted data must take
> into account that there could be duplicates.  And that's a reasonable
> requirement.  Just state it in the README that there might be
> duplicate entries in the output, and you're done.  Though even if you
> didn't state that, consumers of your data shouldn't make such an
> assumption anyway.
> 
> Stated differently, "be conservative in what you do, be liberal in
> what you accept from others." (Robustness principle,
> https://en.wikipedia.org/wiki/Robustness_principle)
> 
> Note that I'm still planning to repackage those older tarballs to
> minimize confusions like yours when you found out that tarballs
> contain different files than you expected.  You shouldn't have to wait
> for that though.
> 
> All the best,
> Karsten
> 
> 
> On 16/06/16 23:42, tl wrote:
> > Hi,
> >
> > during the metrics team chat this afternoon we discussed briefly
> > how ConverTor would or should handle double entries of descriptors.
> > I thought about it a little more and think the following aspects
> > are important:
> >
> > * ConverTor only converts CollecTor tarballs into other formats:
> > JSON, Parquet and Avro. It changes the contents of these archives
> > as little as possible. If they contain double entries, the
> > converted archives will do so too.
> >
> > * Therefor the CollecTor tarballs better be correct ;-)
> >
> > * Doing otherwise wouldn’t be easy: ConverTor would have to keep
> > all descriptors in memory before he can write them out in one
> > flush. Or ConverTor would have to know how to change
> > JSON/Parquet/Avro files. And it would have to keep an index of the
> > entries. This feels a bit like writing a database. I quite possibly
> > don’t fully understand what I’m talking about here but still I
> > highly doubt that this would be the right way to go.
> >
> > * Analytics will work on the converted JSON/Parquet/Avro files
> > first. Importing the descriptors from JSON/Parquet/Avro into a
> > database is something that can be done to facilitate certain kinds
> > of analytics but it’s not absolutely necessary. A lot of
> > aggregation and other tasks can be accomplished from the converted
> > files alone and as far as I know this is how it’s usually done:
> > load data from files into memory and compute new data from there
> > (and write it to files again).
> >
> > * When importing descriptors in a database (HBase is planned) it
> > should be easy to detect double entries and handle them
> > appropriatly.
> >
> > * During aggregation from archive files it might be possible as
> > well to detect double entries.
> >
> > * I wonder though what the practical relevance of this question is.
> > It wouldn’t expect there to be a lot of double entries in CollecTor
> > archives. Am I wrong?
> >
> > * So far I haven’t spend much time on thinking about how a system
> > must be build that is updated with new descriptors in an hourly or
> > daily fashion. For a start I would just regenerate monthly tarballs
> > and then re-convert them since that doesn’t take very long. Of
> > course a database should be beneficial here but I’ll have to play
> > with this whole machinery a little more before I can say something
> > that I’m confident about.
> >
> >
> > Hope this helps to clear things up. Of course patches are always
> > welcome :-) My focus though will be on getting the conversion bug
> > free (but not adding new features) and starting to do some
> > meaningful aggregation and analytics.
> >
> > Cheers oma _______________________________________________
> > metrics-team mailing list metrics-team at lists.torproject.org
> > https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
> >
> 
> _______________________________________________
> metrics-team mailing list
> metrics-team at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team

< Der Siegeszug der Populisten - http://www.stern.de/6880250.html >
< Diskurs und Wutbürger - http://www.faz.net/aktuell/politik/inland/politik-braucht-eine-sprache-der-maessigung-14281846.html

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20160617/8469898b/attachment.sig>