[tor-bugs] #6266 [Tor]: maxmind geoip db is starting to label Tor relays as country "A1"

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Nov 26 19:12:50 UTC 2012


#6266: maxmind geoip db is starting to label Tor relays as country "A1"
------------------------+---------------------------------------------------
 Reporter:  arma        |          Owner:                    
     Type:  defect      |         Status:  needs_review      
 Priority:  normal      |      Milestone:  Tor: 0.2.3.x-final
Component:  Tor         |        Version:                    
 Keywords:  tor-client  |         Parent:                    
   Points:              |   Actualpoints:                    
------------------------+---------------------------------------------------
Changes (by karsten):

  * status:  new => needs_review


Comment:

 Replying to [comment:5 nickm]:
 > The "look for cases where the previous and next entry are in the same
 country" rule resolves 90% of the A1 entres in the June maxmind db.

 After looking into using Software77's database or the RIR delegation files
 as a replacement, I like the approach you suggest here best.  I think we
 should resolve those 90% of A1 entries automatically and have a human fix
 the remaining 10% by using RIR delegation files as a reference and using
 common sense.  If we document what changes we made and make it easy for
 others to verify our decisions, I think we should be all set.

 I wrote a
 [https://trac.torproject.org/projects/tor/attachment/ticket/6266/deanonymind.py
 script] to fix the simple cases in MaxMind's database, and I extended
 [https://github.com/ioerror/blockfinder blockfinder] to show differences
 between GeoIP databases.  Here's what I did to fix the 90%+10% of A1
 entries.  I wrote this down as a documentation that we can ship together
 with the geoip file for others to verify what we did.

 Clone blockfinder:

 {{{
 git clone https://github.com/ioerror/blockfinder
 cd blockfinder/
 }}}

 Download MaxMind GeoLite Country database file

 {{{
 wget
 http://geolite.maxmind.com/download/geoip/database/GeoIPCountryCSV.zip
 }}}

 Download and run deanonymind.py to automatically replace A1 entries with
 the country code of the previous and next entry if the two agree.

 {{{
 python deanonymind.py GeoIPCountryCSV.zip AutomaticGeoIPCountryWhois.csv
 }}}

 Unzip the original MaxMind file and compare it to the new file.

 {{{
 unzip GeoIPCountryCSV.zip
 diff -U1 GeoIPCountryWhois.csv AutomaticGeoIPCountryWhois.csv | less
 }}}

 Copy new file to have a reference for manual changes.

 {{{
 cp AutomaticGeoIPCountryWhois.csv ManualGeoIPCountryWhois.csv
 }}}

 Initialize cache with RIR delegation files, MaxMind's original file, and
 the modified file:

 {{{
 python blockfinder -i
 python blockfinder -r GeoIPCountryWhois.csv
 python blockfinder -r ManualGeoIPCountryWhois.csv
 }}}

 Run blockfinder to compare the three data sources for the A1 country code.

 {{{
 python blockfinder -p A1 | less
 }}}

 Scroll down to "Assignments in 'ManualGeoIPCountryWhois.csv'".  The blocks
 shown there are the A1 entries that could not be resolved by
 deanonymind.py, most likely because previous and subsequent country codes
 do not match.  There are 19 such entries in the November 2012 database, so
 within scope for a human to fix.  Here's an example:

 {{{
   NL 31.171.128.0-31.171.133.255 GeoIPCountryWhois.csv
 > A1 31.171.134.0-31.171.135.255 GeoIPCountryWhois.csv
   IT 31.171.136.0-31.171.143.255 GeoIPCountryWhois.csv
   NL 31.171.128.0-31.171.133.255 ManualGeoIPCountryWhois.csv
 < A1 31.171.134.0-31.171.135.255 ManualGeoIPCountryWhois.csv
   IT 31.171.136.0-31.171.143.255 ManualGeoIPCountryWhois.csv
 * NL 31.171.128.0-31.171.135.255 rir
   IT 31.171.136.0-31.171.143.255 rir
 }}}

 In this case the two MaxMind files still agree that
 31.171.134.0-31.171.135.255 should be assigned to A1 whereas the RIR
 delegation files say NL.  It seems clear that NL is correct here, so we
 can manually change this line in ManualGeoIPCountryWhois.csv to NL.
 Repeat 18 times for the remaining A1 entries.

 Re-import ManualGeoIPCountryWhois.csv and re-run the comparison:

 {{{
 python blockfinder -r ManualGeoIPCountryWhois.csv
 python blockfinder -p A1 | less
 }}}

 There should be no "Assignments in 'ManualGeoIPCountryWhois.csv'" section
 anymore, because all A1 entries should have been edited by now.  But there
 is an "Assignments in 'GeoIPCountryWhois.csv'" section with quite a lot of
 blocks in it.  There are two types of conflicts, and we're only interested
 in one of them: the uninteresting conflict is where GeoIPCountryWhois.csv
 has an assignment for A1 and both ManualGeoIPCountryWhois.csv and rir
 agree on another country code.  For example:

 {{{
   US 8.10.6.244-8.12.36.255 GeoIPCountryWhois.csv
 < A1 8.12.37.0-8.12.37.255 GeoIPCountryWhois.csv
   US 8.12.38.0-8.14.223.255 GeoIPCountryWhois.csv
   US 8.10.6.244-8.12.36.255 ManualGeoIPCountryWhois.csv
 * US 8.12.37.0-8.12.37.255 ManualGeoIPCountryWhois.csv
   US 8.12.38.0-8.14.223.255 ManualGeoIPCountryWhois.csv
 * US 8.0.0.0-8.255.255.255 rir
 }}}

 This conflict implies that either deanonmind.py or our manual edits were
 likely correct, so it's uninteresting.  But then there's another type of
 conflict where all three databases have a different assignment.  These
 conflicting lines are prefixed with '#' instead of '*'.  The first such
 conflict is:

 {{{
   CA 38.80.64.0-38.80.71.255 GeoIPCountryWhois.csv
 < A1 38.80.72.0-38.80.73.255 GeoIPCountryWhois.csv
   CA 38.80.74.0-38.80.75.255 GeoIPCountryWhois.csv
   CA 38.80.64.0-38.80.71.255 ManualGeoIPCountryWhois.csv
 * CA 38.80.72.0-38.80.73.255 ManualGeoIPCountryWhois.csv
   CA 38.80.74.0-38.80.75.255 ManualGeoIPCountryWhois.csv
 # US 38.0.0.0-38.255.255.255 rir
 }}}

 This conflict is interesting, but still can be ignored after reviewing it.
 It's quite obvious that our choice of CA is more likely correct even if it
 conflicts with the RIR delegation files which say US.

 There are 11 '#' conflicts for the November database, after automatic and
 manual changes, and we'll have to look at each of them.  If we're unhappy
 with a conflict, we'll have to edit ManualGeoIPCountryWhois.csv again, re-
 import it, and look again.

 Review manual changes a last time:

 {{{
 diff -U1 AutomaticGeoIPCountryWhois.csv ManualGeoIPCountryWhois.csv | less
 }}}

 Convert new file to Tor's geoip file format:

 {{{
 cut -d, -f3-5 < ManualGeoIPCountryWhois.csv | sed 's/"//g' > geoip
 }}}

 Prepend geoip file with a comment like the following:

 {{{
 # Last updated based on November 7 2012 Maxmind GeoLite Country
 # See $SOME_README_FILE_OR_TRAC_LINK for details on the conversion.
 }}}

 Commit the new geoip file to tor's `src/config/`, done.

 If you like this approach, I have an A1-less November 2012 database here
 that we can ship with the next Tor version.  I'd need to know how we'd
 want to document a) the general approach (basically what I described in
 this comment) and b) the manual changes.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6266#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list