I'm trying to use CollecTor data to find out how much bandwidth is offered by different pluggable transports over time. I.e., I want to be able to say something like, "On July 1, bridges with obfs3 offered X MB/s, bridges with obfs4 offered Y MB/s," etc. To do this, I'm mapping through three types of CollecTor documents: bridge-network-status (where the bandwidth is and which links to router digests) bridge-server-descriptor (which links to extra-info digests) bridge-extra-info (where the transports are) I'm having trouble because sometimes, a router digest listed in a bridge-network-status document is not found in the same tarball.
https://collector.torproject.org/archive/bridge-descriptors/bridge-descripto... Here is an example of what I'm doing, using the above tarball. bridge-descriptors-2015-07/statuses/04/20150704-000350-4A0CCD2DDC7995083D73F5D667100C8A5831F16D This is a bridge-network-status document. One of its entries is: r starman qgM+62FgGytzEtibYqqiPcPtijQ mdOOBxVOTpw8loBezhSDZxLIcXs 2015-07-03 21:39:31 10.174.163.60 9002 0 s Fast Guard Running Stable Valid w Bandwidth=2646 p reject 1-65535 The second base64-encoded string is the router digest. base64decode("mdOOBxVOTpw8loBezhSDZxLIcXs") = 99D38E07154E4E9C3C96805ECE14836712C8717B bridge-descriptors-2015-07/server-descriptors/9/9/99d38e07154e4e9c3c96805ece14836712c8717b Now we go looking for a bridge-server-descriptor with router digest 99D38E07154E4E9C3C96805ECE14836712C8717B, which is in the above file. It has a line: extra-info-digest D69106C8BAF5C0044F7331F24DF77E85BBF84027 bridge-descriptors-2015-07/extra-infos/d/6/d69106c8baf5c0044f7331f24df77e85bbf84027 Now we find a bridge-extra-info with digest D69106C8BAF5C0044F7331F24DF77E85BBF84027 in the above file. It tells us what transports the bridge supports (there are two, one for IPv4 and one for IPv6): transport meek transport meek
Here's an example of where it goes wrong. bridge-descriptors-2015-07/statuses/01/20150701-060138-4A0CCD2DDC7995083D73F5D667100C8A5831F16D r Unnamed ABk0wg4j6BLCdZKleVtmNrfzJGI eGIOW1mGM/Dbw+t5bXnR8jdnsoY 2015-07-01 05:56:14 10.123.124.91 443 0 s Fast Running Stable Valid w Bandwidth=156 p reject 1-65535 We are looking for router digest 78620E5B598633F0DBC3EB796D79D1F23767B286: base64decode("eGIOW1mGM/Dbw+t5bXnR8jdnsoY") = 78620E5B598633F0DBC3EB796D79D1F23767B286 But there is no file bridge-descriptors-2015-07/server-descriptors/7/8/78620e5b598633f0dbc3eb796d79d1f23767b286. However, I did find it in the previous month's tarball, https://collector.torproject.org/archive/bridge-descriptors/bridge-descripto... bridge-descriptors-2015-06/server-descriptors/8/3/835a43ff89db9c1be8ddf7536d759875878620e7
It seems rare that the bridge-server-descriptor is missing. In the 2015-07 tarball, it happened for 5891/477496 relays (1.2%). An additional 4/477496 (0.0%) had a bridge-server-descriptor but were missing bridge-extra-info.
How do you handle cases like this? I had a browse through the Onionoo source code, but did not quickly understand it. Should I just always include the month preceding the earliest month I want to process?