[metrics-team] duplicates in collector tarballs?

tl tl at rat.io
Tue Jun 14 09:33:32 UTC 2016


> On 14.06.2016, at 11:27, tl <tl at rat.io> wrote:
> 
>> 
>> On 14.06.2016, at 10:05, Karsten Loesing <karsten at torproject.org> wrote:
>> 
>> Signed PGP part
>> Hi Thomas,
>> 
>> can you give one or more examples?
> 
> Unfortunately I didn’t keep note of them. When I couldn’t convert all descriptors of one type in one run (because I ran into memory limits) I converted descriptors per year. Maybe in 20% of these cases I got results like this:
> 
> -rwxrwxrwx 1 t t    1978191 Jun 14 03:23 RelayVote_2015-12.parquet.snappy
> -rwxrwxrwx 1 t t 2473316989 Jun 14 03:23 RelayVote_2016-01.parquet.snappy
> -rwxrwxrwx 1 t t 2384211448 Jun 14 03:23 RelayVote_2016-02.parquet.snappy
> -rwxrwxrwx 1 t t 2265386311 Jun 14 03:23 RelayVote_2016-03.parquet.snappy
> -rwxrwxrwx 1 t t 2339112076 Jun 14 03:23 RelayVote_2016-04.parquet.snappy
> -rwxrwxrwx 1 t t 2062026086 Jun 14 03:23 RelayVote_2016-05.parquet.snappy
> 
> where I had only converted tarballs of 2016.
> 
> 
> I had similar issues when I converted tarballs from another year but I don’t remember for sure which type and which year. I think (!) it relays for 2012-08 and 2012-09 so it’s not only an issue with years ends.
> It seems like my JSON converter handles this issue differently than my Parquet converter. The JSON converter didn’t run into memory issues and seems to be happy to append to data already written to disk. The Parquet converter otoh often (but not always :-/) keeps everything in memory and only in the very last step writes everything to disk in one flush. Then sometimes the results for one or two months remain completely empty and my current guess would be that in those cases there was an overlap of descriptors in tarballs from different months and the converter couldn’t decide which one to write out. The two months mentioned above where such a case and when I then converted sepoerately I got results also for the month 2012-07 and 2012-10. But again: I’m neither sure about the year nor the type of descriptor. I would have to rerun conversions and search for them. Should I?

Ha, found them in the bash-history: relays 2007-08 and 2007-09

c’t


> Ciao
> Thomas
> 
> 
>> All the best,
>> Karsten
>> 
>> 
>> On 13/06/16 22:19, tl wrote:
>>> Hi,
>>> 
>>> when testing some descriptor converter I stumbled across the fact
>>> that descriptor tarballs for a given month sometimes contain a few
>>> descriptors from the month before or after. That introduces a
>>> problem that I might be able to overcome by poking at the code but
>>> before I try that I’d like to know: - if a descriptor tarball for
>>> say 2012-10 also contains descriptors from 2012-09 does that mean
>>> that the 2012-09 descriptors contained in the 2012-10 tarball are
>>> not contained in the 2012-09 tarball? Or are they duplicates? - and
>>> if they are no duplicates: would it be hard to repackage the
>>> tarballs? Tedious for sure, but hard? Or not good for other
>>> reasons?
>>> 
>>> Cheers Thomas
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________ metrics-team
>>> mailing list metrics-team at lists.torproject.org
>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>> 
>> 
>> _______________________________________________
>> metrics-team mailing list
>> metrics-team at lists.torproject.org
>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
> 
> 
> 
> 
> 
> 
> < Der Siegeszug der Populisten - http://www.stern.de/6880250.html >
> 
> _______________________________________________
> metrics-team mailing list
> metrics-team at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team






< Der Siegeszug der Populisten - http://www.stern.de/6880250.html >

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20160614/9aa99da5/attachment.sig>


More information about the metrics-team mailing list