[tor-bugs] #2763 [Metrics]: Do we collect descriptors that don't get into the consensus?

Sun Apr 3 18:59:42 UTC 2011

#2763: Do we collect descriptors that don't get into the consensus?
---------------------+------------------------------------------------------
 Reporter:  arma     |          Owner:  karsten 
     Type:  task     |         Status:  assigned
 Priority:  normal   |      Milestone:          
Component:  Metrics  |        Version:          
 Keywords:           |         Parent:          
   Points:           |   Actualpoints:          
---------------------+------------------------------------------------------

Comment(by karsten):

 Replying to [comment:5 karsten]:
 > Replying to [comment:4 nickm]:
 > > I wonder if this approach might be insufficient for your requirements.
 It will tell you about descriptors that the authorities 'they have
 accepted and have decided to keep.  It ''won't'' tell us about descriptors
 that the authorities immediately rejected, or ones that they decided (for
 whatever reason) to drop or replace.
 > >
 > > Do we care about those factors?
 > That's a fine question.  I can't say.  I guess Sebastian or arma have an
 answer.  From a metrics POV, we're only interested in the descriptors that
 are referenced from consensuses and maybe votes.  But I understand the
 need to collect unreferenced descriptors for debugging purposes.
 >
 > What reasons are there for an authority to reject or drop a descriptor?
 a) unable to parse and b) changes are cosmetic come to mind.  I'm somewhat
 concerned about a) here.  If we want to include descriptors that the
 directory authorities cannot parse, I'll have to improve the metrics code
 for parsing descriptors.  I'd prefer to not include descriptors from case
 a), though.  Descriptors from case b) should be fine to archive.  Are
 there other reasons for the authorities to drop or reject descriptors?

 Without having more information what descriptors people want to collect,
 I'll assume that whatever we learn by downloading /tor/server/all.z and
 /tor/extra/all once per day is sufficient.  Please let me know if it's
 not.

 > > As for the information about download size: you can make it much
 smaller.  First, instead of downloading "all", download "all.z".
 > Right.  We should do that for all downloads, I guess.

 I added ".z" to all URLs except for extra-info descriptors.  It seems that
 directory authorities first compress extra-info descriptors and then
 concatenate the results.  I know that this is permitted in the
 specification.  Unfortunately, I cannot handle that easily in Java.  After
 spending two hours on this problem, I decided that developer time is more
 valuable than bandwidth and removed the ".z" for extra-info descriptors.
 Everything else works fine with ".z".  I'm happy to accept a patch if
 someone wants to look closer at the Java problem.

 > > Second, instead of downloading all extra-info descriptors, read
 through the descriptors in tor/server/all.z to see which ones you are
 missing, and download only those.  I'd bet these approaches combined would
 save 60-80% of the expected download size.
 > Okay, that should work.  Is once per day enough?

 I tried downloading /tor/server/all.z and all the extra-info descriptors
 referenced from there, and then downloaded /tor/extra/all.  The latter
 gave me new descriptors that were not referenced from the server
 descriptors I had.  We're trying to collect all descriptors in the
 network, so I enabled downloading both /tor/server/all.z and
 /tor/extra/all once per day.

 As the next steps I'm going to check whether we still need to import
 gabelmoo's cached-* files, and how we can add a timeout per authority to
 avoid being delayed by extremely slow authorities.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2763#comment:6>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online