I'm trying to use CollecTor data to find out how much bandwidth is offered by different pluggable transports over time. I.e., I want to be able to say something like, "On July 1, bridges with obfs3 offered X MB/s, bridges with obfs4 offered Y MB/s," etc. To do this, I'm mapping through three types of CollecTor documents: bridge-network-status (where the bandwidth is and which links to router digests) bridge-server-descriptor (which links to extra-info digests) bridge-extra-info (where the transports are) I'm having trouble because sometimes, a router digest listed in a bridge-network-status document is not found in the same tarball.
https://collector.torproject.org/archive/bridge-descriptors/bridge-descripto... Here is an example of what I'm doing, using the above tarball. bridge-descriptors-2015-07/statuses/04/20150704-000350-4A0CCD2DDC7995083D73F5D667100C8A5831F16D This is a bridge-network-status document. One of its entries is: r starman qgM+62FgGytzEtibYqqiPcPtijQ mdOOBxVOTpw8loBezhSDZxLIcXs 2015-07-03 21:39:31 10.174.163.60 9002 0 s Fast Guard Running Stable Valid w Bandwidth=2646 p reject 1-65535 The second base64-encoded string is the router digest. base64decode("mdOOBxVOTpw8loBezhSDZxLIcXs") = 99D38E07154E4E9C3C96805ECE14836712C8717B bridge-descriptors-2015-07/server-descriptors/9/9/99d38e07154e4e9c3c96805ece14836712c8717b Now we go looking for a bridge-server-descriptor with router digest 99D38E07154E4E9C3C96805ECE14836712C8717B, which is in the above file. It has a line: extra-info-digest D69106C8BAF5C0044F7331F24DF77E85BBF84027 bridge-descriptors-2015-07/extra-infos/d/6/d69106c8baf5c0044f7331f24df77e85bbf84027 Now we find a bridge-extra-info with digest D69106C8BAF5C0044F7331F24DF77E85BBF84027 in the above file. It tells us what transports the bridge supports (there are two, one for IPv4 and one for IPv6): transport meek transport meek
Here's an example of where it goes wrong. bridge-descriptors-2015-07/statuses/01/20150701-060138-4A0CCD2DDC7995083D73F5D667100C8A5831F16D r Unnamed ABk0wg4j6BLCdZKleVtmNrfzJGI eGIOW1mGM/Dbw+t5bXnR8jdnsoY 2015-07-01 05:56:14 10.123.124.91 443 0 s Fast Running Stable Valid w Bandwidth=156 p reject 1-65535 We are looking for router digest 78620E5B598633F0DBC3EB796D79D1F23767B286: base64decode("eGIOW1mGM/Dbw+t5bXnR8jdnsoY") = 78620E5B598633F0DBC3EB796D79D1F23767B286 But there is no file bridge-descriptors-2015-07/server-descriptors/7/8/78620e5b598633f0dbc3eb796d79d1f23767b286. However, I did find it in the previous month's tarball, https://collector.torproject.org/archive/bridge-descriptors/bridge-descripto... bridge-descriptors-2015-06/server-descriptors/8/3/835a43ff89db9c1be8ddf7536d759875878620e7
It seems rare that the bridge-server-descriptor is missing. In the 2015-07 tarball, it happened for 5891/477496 relays (1.2%). An additional 4/477496 (0.0%) had a bridge-server-descriptor but were missing bridge-extra-info.
How do you handle cases like this? I had a browse through the Onionoo source code, but did not quickly understand it. Should I just always include the month preceding the earliest month I want to process?
On Wed, Jul 08, 2015 at 07:45:04PM -0700, David Fifield wrote:
I'm trying to use CollecTor data to find out how much bandwidth is offered by different pluggable transports over time. I.e., I want to be able to say something like, "On July 1, bridges with obfs3 offered X MB/s, bridges with obfs4 offered Y MB/s," etc.
Great!
I'm having trouble because sometimes, a router digest listed in a bridge-network-status document is not found in the same tarball.
[snip]
Here's an example of where it goes wrong. bridge-descriptors-2015-07/statuses/01/20150701-060138-4A0CCD2DDC7995083D73F5D667100C8A5831F16D
Yeah, I'm not surprised it goes wrong, since the descriptor from 0701-06:01 was likely published in the previous month.
However, I did find it in the previous month's tarball,
Yep.
It seems rare that the bridge-server-descriptor is missing. In the 2015-07 tarball, it happened for 5891/477496 relays (1.2%).
[snip]
How do you handle cases like this? I had a browse through the Onionoo source code, but did not quickly understand it. Should I just always include the month preceding the earliest month I want to process?
How many of the 5891 cases does that resolve?
--Roger
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 09/07/15 05:39, Roger Dingledine wrote:
On Wed, Jul 08, 2015 at 07:45:04PM -0700, David Fifield wrote:
I'm trying to use CollecTor data to find out how much bandwidth is offered by different pluggable transports over time. I.e., I want to be able to say something like, "On July 1, bridges with obfs3 offered X MB/s, bridges with obfs4 offered Y MB/s," etc.
Great!
I'm having trouble because sometimes, a router digest listed in a bridge-network-status document is not found in the same tarball.
[snip]
Here's an example of where it goes wrong. bridge-descriptors-2015-07/statuses/01/20150701-060138-4A0CCD2DDC7995083D73F5D667100C8A5831F16D
Yeah, I'm not surprised it goes wrong, since the descriptor from 0701-06:01 was likely published in the previous month.
However, I did find it in the previous month's tarball,
Yep.
I think you picked the wrong example for something going wrong, because that descriptor is actually included in the 2015-07 tarball.
But there are indeed cases when a status published in 2015-07 references a server descriptor that was published in 2015-06, and that server descriptor would be contained in the 2015-06 tarball. Example from the same status:
bridge-descriptors-2015-07/statuses/01/20150701-060138-4A0CCD2DDC7995083D73F5D667100C8A5831F16D
contains a line:
r Unnamed ABQ4ZADwj8WkfgApkhVTFalGweU GqjwHG/sFpFzY4sx9SWuzVTcHag 2015-06-30 12:59:03 10.135.171.161 443 0
which references the following server descriptor:
bridge-descriptors-2015-06/server-descriptors/1/a/1aa8f01c6fec169173638b31f525aecd54dc1da8
It seems rare that the bridge-server-descriptor is missing. In the 2015-07 tarball, it happened for 5891/477496 relays (1.2%).
[snip]
How do you handle cases like this? I had a browse through the Onionoo source code, but did not quickly understand it.
Onionoo typically reads descriptors from CollecTor's recent/ directory which have been published in the past 72 hours, not the tarballs in the archive/ directory that are organized by publication month.
Should I just always include the month preceding the earliest month I want to process?
Yes, you should do that.
How many of the 5891 cases does that resolve?
If you happen to find cases which are not explained by that, please let me know.
All the best, Karsten
On Wed, Jul 08, 2015 at 11:39:54PM -0400, Roger Dingledine wrote:
It seems rare that the bridge-server-descriptor is missing. In the 2015-07 tarball, it happened for 5891/477496 relays (1.2%).
[snip]
How do you handle cases like this? I had a browse through the Onionoo source code, but did not quickly understand it. Should I just always include the month preceding the earliest month I want to process?
How many of the 5891 cases does that resolve?
Peeking into 2015-06 resolves all 5891 cases of missing bridge-server-descriptor in the 2015-07 tarball. (There are still 4 cases of missing bridge-extra-info.)
On Thu, Jul 09, 2015 at 12:04:52PM -0700, David Fifield wrote:
On Wed, Jul 08, 2015 at 11:39:54PM -0400, Roger Dingledine wrote:
It seems rare that the bridge-server-descriptor is missing. In the 2015-07 tarball, it happened for 5891/477496 relays (1.2%).
[snip]
How do you handle cases like this? I had a browse through the Onionoo source code, but did not quickly understand it. Should I just always include the month preceding the earliest month I want to process?
How many of the 5891 cases does that resolve?
Peeking into 2015-06 resolves all 5891 cases of missing bridge-server-descriptor in the 2015-07 tarball. (There are still 4 cases of missing bridge-extra-info.)
Awesome. Sounds like a great reason to peek into the previous month. :)
--Roger