[tor-bugs] #2680 [Metrics]: present bridge usage data so researchers can focus on the math

Tue Mar 15 13:56:33 UTC 2011

#2680: present bridge usage data so researchers can focus on the math
---------------------+------------------------------------------------------
 Reporter:  arma     |          Owner:  karsten 
     Type:  task     |         Status:  assigned
 Priority:  normal   |      Milestone:          
Component:  Metrics  |        Version:          
 Keywords:           |         Parent:          
   Points:           |   Actualpoints:          
---------------------+------------------------------------------------------

Comment(by karsten):

 Replying to [comment:3 arma]:
 > The "fingerprint" and "descriptor" in statuses.csv are always the same.
 I think you're printing "fingerprint" for both of them?

 Ooops, fixed.

 > I think the next step is to write a short overview of how to reconstruct
 these files to answer some research question.

 See the new Section 3 of the README and the new R file analysis.R in
 task-2680.

 > For example, say I want to get a list of all the countries that a given
 bridge has seen over time. I guess I want to iterate over all bridge
 fingerprints -- should I use the list of all fingerprints I find in
 statuses.csv or in descriptors.csv -- should they be the same?

 If you want to learn about usage by country, you should only look at
 descriptors.csv, not at statuses.csv.  The data in bridge network statuses
 and the data in extra-info descriptors are not tightly connected (even
 though one can link them via the bridge's descriptor identifier).  A
 bridge is free to write anything in its extra-info descriptor, including a
 few days old bridge statistics.  That is in no way related to the bridge
 authority thinking that a bridge is running at a later time.

 I added a note to the README.

 > So step zero, given a fingerprint, is to look it up in relays.csv and
 make sure it's not there. If it is, either ignore it or if we want to get
 fancier, ignore data from it close to the time it's in the relay list.

 Correct.  We're removing all bridges that have been seen as relays for the
 metrics graphs, because even with a time distance of 1 week we had
 unrealistic usage numbers that I couldn't explain otherwise.  If someone
 wants to investigate this further, I'd be happy to learn if we can do
 something smarter.

 > Step one is to look it up in statuses.csv, get a set of descriptor
 hashes, discard all the ones whose third-to-last value is not TRUE, and
 skip duplicate hashes.

 See above.  Removing descriptors of non-running bridges is not meaningful
 here.

 > Then step two is to take those remaining descriptor hashes and look them
 up in descriptors.csv, at which point I can learn which countries they saw
 unless the countries are all NA in which case we don't have data?

 NA means no data, right.

 > And the optional step three is to take the timestamp from the status
 file and look up the fingerprint in assignments.csv to decide if it's
 http, email, or unassigned?

 The timestamps of the assignments and the timestamps of the bridge network
 statuses do not necessarily match precisely.  But BridgeDB does not
 reassign bridges between distributors (yet), so there's no need to compare
 timestamps here.

 I think that the example in analysis.R helps clarifying things a bit.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2680#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online