[tor-bugs] #18910 [Metrics/CollecTor]: distributing descriptors accross CollecTor instances

Tor Bug Tracker & Wiki blackhole at torproject.org
Sun Jun 26 20:35:57 UTC 2016


#18910: distributing descriptors accross CollecTor instances
-------------------------------+-----------------------------------
 Reporter:  iwakeh             |          Owner:  iwakeh
     Type:  enhancement        |         Status:  needs_information
 Priority:  Medium             |      Milestone:
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:  ctip               |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+-----------------------------------

Comment (by karsten):

 Replying to [comment:7 iwakeh]:
 > Some thoughts:
 >
 > === The CollecTor side
 > Maybe CollecTor (or the Metrics Team) needs a data collection and
 handling policy?
 > (Or, is there anything like that I didn't find yet other than the
 license and of course the Tor-wide privacy goals?)

 There is no explicit policy like that, but it would be useful to document
 that in the medium term.

 I guess a CollecTor policy would make more sense than one that applies to
 all metrics-related products, because then we'd have to either enforce
 that policy for all metrics-related tools or manually confirm that a tool
 conforms to the policy.  Other tools could have their own policies.

 > In general, CollecTor shouldn't attempt to make received data better
 than it is
 > by dropping unwanted things.

 Agreed, and a nice way to phrase this. :)

 > At least not without some defined process.
 > And collected data should only be changed when there is a reason for
 obfuscation or
 > when it is enhanced (e.g. by adding the @source tag).

 Look, that's the beginning of a policy!  I like that.

 > === Handling of //unwanted// data
 > Incomplete unreferenced server descs could be stored differently:
 > * referenced server descs can be stored in the way it is done now and
 > * unreferenced can be kept, but stored seperately.
 >
 > The synch-process could first concentrate on the referenced descriptors.

 I'm not sold on this part with respect to the process.  I can see how
 we're switching from a model where we're trusting everyone (all relays and
 bridges, all directory authorities, all other CollecTor instances) to just
 a small set of nodes (for example, the set of directory authorities listed
 in tor.git at a certain point in time).  But doing so is a major
 engineering effort, whereas continuing to trust everyone and risking to
 get spammed is easy.  Also, once we limit trust we can always go through
 the tarballs and rip out everything we shouldn't have accepted.  Hence,
 I'd say let's handle all data, wanted or unwanted, the same for now.

 But in the future, yes, let's consider doing this.  Once we do we should
 talk to ln5 about his plans to apply certificate transparency concepts to
 create a Tor network data archive, where spam descriptors turned out to be
 a major issue, too.

 > === Regarding the repeated uploads:
 > What is the reason for all these server descriptors gabelmoo received?
 > Is there some benign explanation for the uploads?

 Probably not.  But even if we find the reason and fix this, we cannot undo
 that it happened in the past, we cannot guarantee that there will be no
 future bugs like this one, and we cannot prevent malicious relays from
 flooding the directory authorities with random descriptors without there
 being a bug.  Or did you mean that directory authorities shouldn't accept
 as many descriptors from a single source?  I'm not sure how that would
 work, and for the directory authorities it's not that much of a problem to
 get spammed temporarily.  So, I think we might not be able to fix our
 issue with spam descriptors in the tor daemon.

 > Maybe, we should actually search the old data for more upload frencies
 like the one triggering this discussion?

 We could, but what would we do once we find similar events?  When does a
 malicious descriptor flood begin and what's still expected behavior?  I
 think if we want to solve the descriptor spam problem we'll have to limit
 ourselves to descriptors published by trusted entities and descriptors
 referenced from such descriptors directly or indirectly.

 Sorry for the long response.  It's a difficult problem, it seems.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/18910#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list