[tor-dev] CollecTor data: mapping bridge-network-status to bridge-server-descriptor to bridge-extra-info

David Fifield david at bamsoftware.com
Thu Jul 9 02:45:04 UTC 2015

I'm trying to use CollecTor data to find out how much bandwidth is
offered by different pluggable transports over time. I.e., I want to be
able to say something like, "On July 1, bridges with obfs3 offered X MB/s,
bridges with obfs4 offered Y MB/s," etc. To do this, I'm mapping through
three types of CollecTor documents:
	bridge-network-status (where the bandwidth is and which links to router digests)
	bridge-server-descriptor (which links to extra-info digests)
	bridge-extra-info (where the transports are)
I'm having trouble because sometimes, a router digest listed in a
bridge-network-status document is not found in the same tarball.

Here is an example of what I'm doing, using the above tarball.
	This is a bridge-network-status document. One of its entries is:
		r starman qgM+62FgGytzEtibYqqiPcPtijQ mdOOBxVOTpw8loBezhSDZxLIcXs 2015-07-03 21:39:31 9002 0
		s Fast Guard Running Stable Valid
		w Bandwidth=2646
		p reject 1-65535
	The second base64-encoded string is the router digest.
		base64decode("mdOOBxVOTpw8loBezhSDZxLIcXs") = 99D38E07154E4E9C3C96805ECE14836712C8717B
	Now we go looking for a bridge-server-descriptor with router
	digest 99D38E07154E4E9C3C96805ECE14836712C8717B, which is in the
	above file. It has a line:
		extra-info-digest D69106C8BAF5C0044F7331F24DF77E85BBF84027
	Now we find a bridge-extra-info with digest
	D69106C8BAF5C0044F7331F24DF77E85BBF84027 in the above file. It
	tells us what transports the bridge supports (there are two, one
	for IPv4 and one for IPv6):
		transport meek
		transport meek

Here's an example of where it goes wrong.
		r Unnamed ABk0wg4j6BLCdZKleVtmNrfzJGI eGIOW1mGM/Dbw+t5bXnR8jdnsoY 2015-07-01 05:56:14 443 0
		s Fast Running Stable Valid
		w Bandwidth=156
		p reject 1-65535
	We are looking for router digest 78620E5B598633F0DBC3EB796D79D1F23767B286:
		base64decode("eGIOW1mGM/Dbw+t5bXnR8jdnsoY") = 78620E5B598633F0DBC3EB796D79D1F23767B286
	But there is no file bridge-descriptors-2015-07/server-descriptors/7/8/78620e5b598633f0dbc3eb796d79d1f23767b286.
	However, I did find it in the previous month's tarball,

It seems rare that the bridge-server-descriptor is missing. In the
2015-07 tarball, it happened for 5891/477496 relays (1.2%). An
additional 4/477496 (0.0%) had a bridge-server-descriptor but were
missing bridge-extra-info.

How do you handle cases like this? I had a browse through the Onionoo
source code, but did not quickly understand it. Should I just always
include the month preceding the earliest month I want to process?

More information about the tor-dev mailing list