[tor-dev] Using MaxMind's GeoIP2 databases in tor, BridgeDB, metrics-*, Onionoo, etc.
karsten at torproject.org
Thu Jan 16 10:15:37 UTC 2014
you probably know that we use MaxMind's GeoIP database in various of our
products (list may not be exhaustive):
- tor: We ship little-t-tor with a geoip and a geoip6 file for clients
to support excluding relays by country code and for relays to generate
- BridgeDB: I vaguely recall that the BridgeDB service uses GeoIP data
to return only bridges that are not blocked in a user's country. Or
maybe that was a feature yet to be implemented.
- Onionoo: The Onionoo service uses MaxMind's city database to provide
location information of relays. (It also uses MaxMind's ASN database to
provide information on AS number and name.)
- metrics-db: I'm planning to use GeoIP data to resolve bridge IP
addresses to country codes in the bridge descriptor sanitizing process.
- metrics-web: We have been using GeoIP data to provide statistics on
relays by country. This is currently disabled because the
implementation was eating too many resources, but I plan to put these
However, the GeoIP database that we currently use has a big shortcoming:
it replaces valid country codes with A1 or A2 whenever MaxMind thinks
that a relay is an "anonymizing proxy" or "satellite provider".
That's why we currently repair their database by either automatically
guessing what country code an A1 entry could have had [1, 2], or by
manually looking it up in RIR delegation files [3, 4]. This is just a
workaround. Also, I think BridgeDB doesn't repair its GeoIP database.
Here's the good news: MaxMind now provides their databases in new
formats which provide the A1/A2 information in *addition* to the correct
country codes [5, 6]. We should switch!
How do we switch? First option is to ship their binary database files
and include their APIs  in our products. Looks there are APIs for C,
Java, and Python, so all the languages we need for the tools listed
above. Pros: we can kick out our parsing and lookup code. Cons: we
need to check if their licenses are compatible, we have to kick out our
parsing and lookup code and learn their APIs, and we add new dependencies.
Another option is to write a new tool that parses their full databases
and converts them into file formats we already support. (This would
also allow us to provide a custom format with multiple database versions
which would be pretty useful for metrics, see #6471.) Also, it looks
like their license, Creative Commons Attribution-ShareAlike 3.0
Unported, allows converting their database to a different format. If we
want to write such a tool, we have a few options:
- We use their database specification  and write our own parser
using a language of our choice (read: whoever writes it pretty much
decides). We could skip the binary search tree part of their files and
only process the contents. Whenever they change their format, we'll
have to adapt.
- We use their Python API  to build our parser, though it looks like
that requires pip or easy_install and compiling their C API. I don't
know enough about Python to assess what headaches that's going to cause.
- We use their Java API  to build our parser, though we're probably
forced to use Maven rather than Ant. I don't have much experience with
Maven. Also, using Java probably makes me the default (and only)
maintainer, which I'd want to avoid if possible.
Thoughts? What other options did I miss, and what pros and cons that I
And is this something that people on this list would want to help with,
once we agreed on one of the options? If so, please feel free to join
the discussion now and maybe influence which path we're going to take.
All the best,
More information about the tor-dev