[metrics-team] duplicates in collector tarballs?

tl tl at rat.io
Wed Jun 15 21:43:13 UTC 2016


> 
> On 15.06.2016, at 15:07, Karsten Loesing <karsten at torproject.org> wrote:
> 
> Signed PGP part
> Hi Thomas,
> 
> my first guess is that you're looking at a different timestamp than
> CollecTor for deciding which tarball a descriptor belongs in.

I’m using getPublishedMillis() in most cases, except
	Consensus - getValidAfterMillis()
	Torperf - getStartMillis()
	Tordnsel - getDownloadedMillis()
What is CollecTor using?


> Unfortunately, "relays 2007-08 and 2007-09" is rather vague, because
> relays published all kinds of descriptors in those two months, and I
> can't really look at all those tarballs right now.

Sorry, my bad. That’s the short name I used internally for
	server-descriptors-2007-09.tar.xz
	server-descriptors-2007-08.tar.xz


> Can you list a tarball and a file contained in that tarball which you
> think doesn't belong there?

Converting server-descriptors-2007-09.tar.xz I get 3 results: Relay_2007-08.json, Relay_2007-09.json and Relay_2007-10.json. I’m attaching the latter:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Relay_2007-10.json
Type: application/json
Size: 9178 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20160615/73a0084f/attachment-0001.json>
-------------- next part --------------


Both descriptors are from October 1. early in the morning.


And I’m also thinking if I shouldn't just use the date of the tarball that contains the descriptors. I hadn’t expected any problems here so I went for the (easily reachable) dates in the descriptors but it seems safest to just reproduce CollecTor tarballs as faithful as possible no matter how the descriptors were allocated. Especially since the situation get’s even more complex with Consensus, Torperf and Tordnsel.
I just don’t know how exactly I could get hold of the name of the tarball that the descriptor is extracted from. Seems like metrics-lib.DescriptorReader doesn’t provide the name of the tarball it’s reading. Can you do something about that.


Ciao
Thomas









> All the best,
> Karsten
> 
> 
> On 14/06/16 11:33, tl wrote:
>> 
>>> On 14.06.2016, at 11:27, tl <tl at rat.io> wrote:
>>> 
>>>> 
>>>> On 14.06.2016, at 10:05, Karsten Loesing
>>>> <karsten at torproject.org> wrote:
>>>> 
>>>> Signed PGP part Hi Thomas,
>>>> 
>>>> can you give one or more examples?
>>> 
>>> Unfortunately I didn’t keep note of them. When I couldn’t convert
>>> all descriptors of one type in one run (because I ran into memory
>>> limits) I converted descriptors per year. Maybe in 20% of these
>>> cases I got results like this:
>>> 
>>> -rwxrwxrwx 1 t t    1978191 Jun 14 03:23
>>> RelayVote_2015-12.parquet.snappy -rwxrwxrwx 1 t t 2473316989 Jun
>>> 14 03:23 RelayVote_2016-01.parquet.snappy -rwxrwxrwx 1 t t
>>> 2384211448 Jun 14 03:23 RelayVote_2016-02.parquet.snappy
>>> -rwxrwxrwx 1 t t 2265386311 Jun 14 03:23
>>> RelayVote_2016-03.parquet.snappy -rwxrwxrwx 1 t t 2339112076 Jun
>>> 14 03:23 RelayVote_2016-04.parquet.snappy -rwxrwxrwx 1 t t
>>> 2062026086 Jun 14 03:23 RelayVote_2016-05.parquet.snappy
>>> 
>>> where I had only converted tarballs of 2016.
>>> 
>>> 
>>> I had similar issues when I converted tarballs from another year
>>> but I don’t remember for sure which type and which year. I think
>>> (!) it relays for 2012-08 and 2012-09 so it’s not only an issue
>>> with years ends. It seems like my JSON converter handles this
>>> issue differently than my Parquet converter. The JSON converter
>>> didn’t run into memory issues and seems to be happy to append to
>>> data already written to disk. The Parquet converter otoh often
>>> (but not always :-/) keeps everything in memory and only in the
>>> very last step writes everything to disk in one flush. Then
>>> sometimes the results for one or two months remain completely
>>> empty and my current guess would be that in those cases there was
>>> an overlap of descriptors in tarballs from different months and
>>> the converter couldn’t decide which one to write out. The two
>>> months mentioned above where such a case and when I then
>>> converted sepoerately I got results also for the month 2012-07
>>> and 2012-10. But again: I’m neither sure about the year nor the
>>> type of descriptor. I would have to rerun conversions and search
>>> for them. Should I?
>> 
>> Ha, found them in the bash-history: relays 2007-08 and 2007-09
>> 
>> c’t
>> 
>> 
>>> Ciao Thomas
>>> 
>>> 
>>>> All the best, Karsten
>>>> 
>>>> 
>>>> On 13/06/16 22:19, tl wrote:
>>>>> Hi,
>>>>> 
>>>>> when testing some descriptor converter I stumbled across the
>>>>> fact that descriptor tarballs for a given month sometimes
>>>>> contain a few descriptors from the month before or after.
>>>>> That introduces a problem that I might be able to overcome by
>>>>> poking at the code but before I try that I’d like to know: -
>>>>> if a descriptor tarball for say 2012-10 also contains
>>>>> descriptors from 2012-09 does that mean that the 2012-09
>>>>> descriptors contained in the 2012-10 tarball are not
>>>>> contained in the 2012-09 tarball? Or are they duplicates? -
>>>>> and if they are no duplicates: would it be hard to repackage
>>>>> the tarballs? Tedious for sure, but hard? Or not good for
>>>>> other reasons?
>>>>> 
>>>>> Cheers Thomas
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________ metrics-team
>>>>> mailing list metrics-team at lists.torproject.org
>>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>>> 
>>>> 
>>>> 
>>>>> 
> _______________________________________________
>>>> metrics-team mailing list metrics-team at lists.torproject.org
>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> 
> < Der Siegeszug der Populisten - http://www.stern.de/6880250.html >
>>> 
>>> _______________________________________________ metrics-team
>>> mailing list metrics-team at lists.torproject.org
>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> < Der Siegeszug der Populisten - http://www.stern.de/6880250.html
>>> 
>> 
> 






< Der Siegeszug der Populisten - http://www.stern.de/6880250.html >
< Diskurs und Wutbürger - http://www.faz.net/aktuell/politik/inland/politik-braucht-eine-sprache-der-maessigung-14281846.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20160615/73a0084f/attachment-0001.sig>


More information about the metrics-team mailing list