[metrics-team] duplicates in collector tarballs?

Karsten Loesing karsten at torproject.org
Wed Jun 15 13:07:47 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Thomas,

my first guess is that you're looking at a different timestamp than
CollecTor for deciding which tarball a descriptor belongs in.

Unfortunately, "relays 2007-08 and 2007-09" is rather vague, because
relays published all kinds of descriptors in those two months, and I
can't really look at all those tarballs right now.

Can you list a tarball and a file contained in that tarball which you
think doesn't belong there?

All the best,
Karsten


On 14/06/16 11:33, tl wrote:
> 
>> On 14.06.2016, at 11:27, tl <tl at rat.io> wrote:
>> 
>>> 
>>> On 14.06.2016, at 10:05, Karsten Loesing
>>> <karsten at torproject.org> wrote:
>>> 
>>> Signed PGP part Hi Thomas,
>>> 
>>> can you give one or more examples?
>> 
>> Unfortunately I didn’t keep note of them. When I couldn’t convert
>> all descriptors of one type in one run (because I ran into memory
>> limits) I converted descriptors per year. Maybe in 20% of these
>> cases I got results like this:
>> 
>> -rwxrwxrwx 1 t t    1978191 Jun 14 03:23
>> RelayVote_2015-12.parquet.snappy -rwxrwxrwx 1 t t 2473316989 Jun
>> 14 03:23 RelayVote_2016-01.parquet.snappy -rwxrwxrwx 1 t t
>> 2384211448 Jun 14 03:23 RelayVote_2016-02.parquet.snappy 
>> -rwxrwxrwx 1 t t 2265386311 Jun 14 03:23
>> RelayVote_2016-03.parquet.snappy -rwxrwxrwx 1 t t 2339112076 Jun
>> 14 03:23 RelayVote_2016-04.parquet.snappy -rwxrwxrwx 1 t t
>> 2062026086 Jun 14 03:23 RelayVote_2016-05.parquet.snappy
>> 
>> where I had only converted tarballs of 2016.
>> 
>> 
>> I had similar issues when I converted tarballs from another year
>> but I don’t remember for sure which type and which year. I think
>> (!) it relays for 2012-08 and 2012-09 so it’s not only an issue
>> with years ends. It seems like my JSON converter handles this
>> issue differently than my Parquet converter. The JSON converter
>> didn’t run into memory issues and seems to be happy to append to
>> data already written to disk. The Parquet converter otoh often
>> (but not always :-/) keeps everything in memory and only in the
>> very last step writes everything to disk in one flush. Then
>> sometimes the results for one or two months remain completely
>> empty and my current guess would be that in those cases there was
>> an overlap of descriptors in tarballs from different months and
>> the converter couldn’t decide which one to write out. The two
>> months mentioned above where such a case and when I then
>> converted sepoerately I got results also for the month 2012-07
>> and 2012-10. But again: I’m neither sure about the year nor the
>> type of descriptor. I would have to rerun conversions and search
>> for them. Should I?
> 
> Ha, found them in the bash-history: relays 2007-08 and 2007-09
> 
> c’t
> 
> 
>> Ciao Thomas
>> 
>> 
>>> All the best, Karsten
>>> 
>>> 
>>> On 13/06/16 22:19, tl wrote:
>>>> Hi,
>>>> 
>>>> when testing some descriptor converter I stumbled across the
>>>> fact that descriptor tarballs for a given month sometimes
>>>> contain a few descriptors from the month before or after.
>>>> That introduces a problem that I might be able to overcome by
>>>> poking at the code but before I try that I’d like to know: -
>>>> if a descriptor tarball for say 2012-10 also contains
>>>> descriptors from 2012-09 does that mean that the 2012-09
>>>> descriptors contained in the 2012-10 tarball are not
>>>> contained in the 2012-09 tarball? Or are they duplicates? -
>>>> and if they are no duplicates: would it be hard to repackage
>>>> the tarballs? Tedious for sure, but hard? Or not good for
>>>> other reasons?
>>>> 
>>>> Cheers Thomas
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________ metrics-team 
>>>> mailing list metrics-team at lists.torproject.org 
>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>>
>>>
>>>
>>>> 
_______________________________________________
>>> metrics-team mailing list metrics-team at lists.torproject.org 
>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>
>>
>>
>>
>>
>>
>>
>>> 
< Der Siegeszug der Populisten - http://www.stern.de/6880250.html >
>> 
>> _______________________________________________ metrics-team
>> mailing list metrics-team at lists.torproject.org 
>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>
>> 
> 
> 
> 
> 
> 
> < Der Siegeszug der Populisten - http://www.stern.de/6880250.html
> >
> 

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJXYVMjAAoJEC3ESO/4X7XBDhQH/iAXIuhf8ghdcvLP8Uk/Nuv/
Xt53IRqvI+wV9zqonmPReXGDmOKDZA3v0n0+1d58+XXakUU+WFZq2yG3x0BebzVx
WcdpySxoT7jepKxz/Q0eo7nNyNnlrQv80lr2mh7URmkq83CZdlW+4/ZbXx5A6DBY
cLojNTUs30LHWDpv3+nk1qyT6DISStNw8bwK/FP/fDFiTmQDMqo+8wTEZF4a7k8v
1K10yOa9O0DMbYdI0Czb6DBiI3MqfCZP/6oPyi3gJR6IiDCWPijb5TDjuLTH9/5K
+BIblV7Ayx1bSsuCrpryJ/vt+pmIbWf4IDhQ+Z9m7JiUVMV59JwjxKrRvAcDiOc=
=DIRN
-----END PGP SIGNATURE-----


More information about the metrics-team mailing list