[tor-commits] [onionoo/master] Use recent GeoIP database without A1 entries.

karsten at torproject.org karsten at torproject.org
Mon Feb 11 08:22:43 UTC 2013


commit 95623efb0e415d1c9c9fa176a967f1a05f942b45
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date:   Mon Feb 11 08:12:39 2013 +0100

    Use recent GeoIP database without A1 entries.
    
    The IP-to-city database to be deployed with Onionoo needs to have its "A1"
    ("Anonymous Proxy") entries fixed just like Tor's IP-to-country file.  See
    Tor's src/config/README.geoip for detailed information.
    
    - Ship with a variant of Tor's deanonymind.py that removes A1 entries from
      IP-to-city databases.  Also ship with a custom geoip-manual for manual
      replacements..
    - Use our own GeoIP file parser, because MaxMind's library doesn't work
      with .csv files.  On the plus side this removes a dependency and makes
      it easier to build Onionoo.  On the minus side it adds a bunch of new
      code.
    - Update index.html to say that some _name entries may be missing if
      empty.
    - Update .gitignore and INSTALL.
---
 .gitignore                                   |   29 ++-
 INSTALL                                      |   79 +++++--
 geoip/deanonymind.py                         |  175 ++++++++++++
 geoip/geoip-manual                           |  354 +++++++++++++++++++++++++
 src/org/torproject/onionoo/CurrentNodes.java |  364 +++++++++++++++++++++++---
 src/org/torproject/onionoo/Main.java         |    3 +-
 web/index.html                               |    6 +-
 7 files changed, 936 insertions(+), 74 deletions(-)

diff --git a/.gitignore b/.gitignore
index 40f5895..ac44c7d 100755
--- a/.gitignore
+++ b/.gitignore
@@ -1,16 +1,21 @@
-relay-search-data.csv
-in/
-status/
-lib/
+.classpath
+.project
 classes/
-out/
-onionoo.war
-etc/web.xml
 etc/context.xml
-GeoIP.dat
-GeoIPASNum.dat
-GeoLiteCity.dat
+etc/web.xml
+geoip/Automatic-GeoLiteCity-Blocks.csv
+geoip/GeoIPASNum2.csv
+geoip/GeoIPASNum2.zip
+geoip/GeoLiteCity-Blocks.csv
+geoip/GeoLiteCity-Location.csv
+geoip/GeoLiteCity-latest.zip
+geoip/Manual-GeoLiteCity-Blocks.csv
+geoip/iso3166.csv
+geoip/region.csv
+in/
+lib/
 log
-.classpath
-.project
+onionoo.war
+out/
+status/
 
diff --git a/INSTALL b/INSTALL
index 0e6269d..b3d5d0a 100644
--- a/INSTALL
+++ b/INSTALL
@@ -1,9 +1,14 @@
 Clone the Onionoo server repository
 -----------------------------------
 
-Clone the Onionoo server repository into /srv/onionoo/.
+Create working directory /srv/onionoo/, make it writable for the metrics
+user, and clone the Onionoo server repository into it.  Commands prefixed
+with # are meant to be run by root, commands with $ by user metrics:
 
-$ git clone git://github.com/kloesing/Onionoo /srv/onionoo/
+# mkdir /srv/onionoo
+# chown metrics:metrics /srv/onionoo
+$ git clone https://git.torproject.org/onionoo.git /srv/onionoo/
+$ cd /srv/onionoo
 
 
 Install Java 1.5 or higher, ant 1.8 or higher, and Tomcat 6
@@ -20,13 +25,13 @@ Provide required .jar files
 ---------------------------
 
 Download or build the following .jar files and put them in the lib/
-directory using the given filename (or update build.xml if filenames are
-different):
+directory:
 
-- Apache Commons Codec 1.4, lib/commons-codec-1.4.jar
-- Servlet API, e.g., from Tomcat 6, lib/servlet-api.jar
-- Maxmind GeoIP Java API, lib/maxmindgeoip.jar
-- Tor Metrics Descriptor Library, lib/descriptor.jar
+- Apache Commons Codec 1.4
+- Apache Commons Compress 1.4.1
+- Apache Commons Lang 2.6
+- Servlet API, e.g., from Tomcat 6
+- Tor Metrics Descriptor Library, metrics-lib
 
 Attempt to compile the Java sources to make sure that everything works
 correctly:
@@ -37,14 +42,50 @@ $ ant compile
 Download GeoIP and ASN database files
 -------------------------------------
 
-Download the GeoLite City database from Maxmind and put it in
-/srv/onionoo/GeoLiteCity.dat.  If no such file is found, relay IP
-addresses will not be resolved to country codes, latitudes, and
-longitudes.
+Onionoo uses an IP-to-city database and an IP-to-ASN database to provide
+additional information about a relay's location.
 
-Also download the GeoLite ASN database from Maxmind and put it in
-/srv/onionoo/GeoIPASNum.dat.  If no such file is found, relay IP
-addresses will not be resolved to AS numbers and names.
+The IP-to-city database to be deployed with Onionoo needs to have its "A1"
+("Anonymous Proxy") entries fixed just like Tor's IP-to-country file.  See
+Tor's src/config/README.geoip for detailed information.
+
+First, change to the geoip/ directory:
+
+$ cd geoip/
+
+Download the most recent MaxMind GeoLite City database and unzip it in the
+current directory, junking paths:
+
+$ wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/GeoLiteCity-latest.zip
+$ unzip -j GeoLiteCity-latest.zip
+
+Run deanonymind.py in the local directory:
+
+$ python deanonymind.py
+
+Review the output to learn about applied automatic/manual changes and
+watch out for any warnings.  Possibly edit geoip-manual to make
+more/fewer/different manual changes and re-run deanonymind.py.  To look at
+automatic and manual changes, run:
+
+$ diff -U1 GeoLiteCity-Blocks.csv Automatic-GeoLiteCity-Blocks.csv
+$ diff -U1 Automatic-GeoLiteCity-Blocks.csv Manual-GeoLiteCity-Blocks.csv
+
+Download MaxMind's country and region codes files to the current
+directory:
+
+$ wget http://dev.maxmind.com/static/csv/codes/iso3166.csv
+$ wget http://dev.maxmind.com/static/csv/codes/maxmind/region.csv
+
+Download the most recent MaxMind ASN database file and unzip it in the
+current directory:
+
+$ wget http://www.maxmind.com/download/geoip/database/asnum/GeoIPASNum2.zip
+$ unzip GeoIPASNum2.zip
+
+Change back to the root working directory:
+
+$ cd ../
 
 
 Test the rsync of descriptors from metrics.torproject.org
@@ -57,10 +98,10 @@ $ rsync -arz metrics.torproject.org::metrics-recent in
 The result should be around 1G of data in the in/ directory, as of January
 2012.
 
-(If you want to pre-populate the bandwidth data with archived data,
-download the tarballs from https://metrics.torproject.org/data.html and
-process them one after the other.  There is no requirement to process data
-in any given order.)
+(If you want to pre-populate bandwidth and weights data with archived
+data, download the tarballs from https://metrics.torproject.org/data.html
+and process them one after the other.  There is no requirement to process
+data in any given order.)
 
 
 Test the hourly data processing process
diff --git a/geoip/deanonymind.py b/geoip/deanonymind.py
new file mode 100755
index 0000000..9ac3568
--- /dev/null
+++ b/geoip/deanonymind.py
@@ -0,0 +1,175 @@
+#!/usr/bin/env python
+import optparse
+import os
+import sys
+import zipfile
+
+"""
+Take a MaxMind GeoLite City blocks file as input and replace A1 entries
+with the block number of the preceding entry iff the preceding
+(subsequent) entry ends (starts) directly before (after) the A1 entry and
+both preceding and subsequent entries contain the same block number.
+
+Then apply manual changes, either replacing A1 entries that could not be
+replaced automatically or overriding previously made automatic changes.
+"""
+
+def main():
+    options = parse_options()
+    assignments = read_file(options.in_maxmind)
+    assignments = apply_automatic_changes(assignments,
+            options.block_number)
+    write_file(options.out_automatic, assignments)
+    manual_assignments = read_file(options.in_manual, must_exist=False)
+    assignments = apply_manual_changes(assignments, manual_assignments)
+    write_file(options.out_manual, assignments)
+
+def parse_options():
+    parser = optparse.OptionParser()
+    parser.add_option('-i', action='store', dest='in_maxmind',
+            default='GeoLiteCity-Blocks.csv', metavar='FILE',
+            help='use the specified MaxMind GeoLite City blocks .csv '
+                 'file as input [default: %default]')
+    parser.add_option('-b', action='store', dest='block_number',
+            default=242, metavar='NUM',
+            help='replace entries with this block number [default: '
+                 '%default]')
+    parser.add_option('-g', action='store', dest='in_manual',
+            default='geoip-manual', metavar='FILE',
+            help='use the specified .csv file for manual changes or to '
+                 'override automatic changes [default: %default]')
+    parser.add_option('-a', action='store', dest='out_automatic',
+            default="Automatic-GeoLiteCity-Blocks.csv", metavar='FILE',
+            help='write full input file plus automatic changes to the '
+                 'specified .csv file [default: %default]')
+    parser.add_option('-m', action='store', dest='out_manual',
+            default='Manual-GeoLiteCity-Blocks.csv', metavar='FILE',
+            help='write full input file plus automatic and manual '
+                 'changes to the specified .csv file [default: %default]')
+    (options, args) = parser.parse_args()
+    return options
+
+def read_file(path, must_exist=True):
+    if not os.path.exists(path):
+        if must_exist:
+            print 'File %s does not exist.  Exiting.' % (path, )
+            sys.exit(1)
+        else:
+            return
+    csv_file = open(path)
+    csv_content = csv_file.read()
+    csv_file.close()
+    assignments = []
+    for line in csv_content.split('\n'):
+        stripped_line = line.strip()
+        if len(stripped_line) > 0 and not stripped_line.startswith('#'):
+            assignments.append(stripped_line)
+    return assignments
+
+def apply_automatic_changes(assignments, block_number):
+    print '\nApplying automatic changes...'
+    result_lines = []
+    prev_line = None
+    a1_lines = []
+    block_number_str = '"%d"' % (block_number, )
+    for line in assignments:
+        if block_number_str in line:
+            a1_lines.append(line)
+        else:
+            if len(a1_lines) > 0:
+                new_a1_lines = process_a1_lines(prev_line, a1_lines, line)
+                for new_a1_line in new_a1_lines:
+                    result_lines.append(new_a1_line)
+                a1_lines = []
+            result_lines.append(line)
+            prev_line = line
+    if len(a1_lines) > 0:
+        new_a1_lines = process_a1_lines(prev_line, a1_lines, None)
+        for new_a1_line in new_a1_lines:
+            result_lines.append(new_a1_line)
+    return result_lines
+
+def process_a1_lines(prev_line, a1_lines, next_line):
+    if not prev_line or not next_line:
+        return a1_lines   # Can't merge first or last line in file.
+    if len(a1_lines) > 1:
+        return a1_lines   # Can't merge more than 1 line at once.
+    a1_line = a1_lines[0].strip()
+    prev_entry = parse_line(prev_line)
+    a1_entry = parse_line(a1_line)
+    next_entry = parse_line(next_line)
+    touches_prev_entry = int(prev_entry['end_num']) + 1 == \
+            int(a1_entry['start_num'])
+    touches_next_entry = int(a1_entry['end_num']) + 1 == \
+            int(next_entry['start_num'])
+    same_block_number = prev_entry['block_number'] == \
+            next_entry['block_number']
+    if touches_prev_entry and touches_next_entry and same_block_number:
+        new_line = format_line_with_other_country(a1_entry, prev_entry)
+        print '-%s\n+%s' % (a1_line, new_line, )
+        return [new_line]
+    else:
+        return a1_lines
+
+def parse_line(line):
+    if not line:
+        return None
+    keys = ['start_num', 'end_num', 'block_number']
+    stripped_line = line.replace('"', '').strip()
+    parts = stripped_line.split(',')
+    entry = dict((k, v) for k, v in zip(keys, parts))
+    return entry
+
+def format_line_with_other_country(original_entry, other_entry):
+    return '"%s","%s","%s"' % (original_entry['start_num'],
+            original_entry['end_num'], other_entry['block_number'], )
+
+def apply_manual_changes(assignments, manual_assignments):
+    if not manual_assignments:
+        return assignments
+    print '\nApplying manual changes...'
+    manual_dict = {}
+    for line in manual_assignments:
+        start_num = parse_line(line)['start_num']
+        if start_num in manual_dict:
+            print ('Warning: duplicate start number in manual '
+                   'assignments:\n  %s\n  %s\nDiscarding first entry.' %
+                   (manual_dict[start_num], line, ))
+        manual_dict[start_num] = line
+    result = []
+    for line in assignments:
+        entry = parse_line(line)
+        start_num = entry['start_num']
+        if start_num in manual_dict:
+            manual_line = manual_dict[start_num]
+            manual_entry = parse_line(manual_line)
+            if entry['end_num'] == manual_entry['end_num']:
+                if len(manual_entry['block_number']) == 0:
+                    print '-%s' % (line, )  # only remove, don't replace
+                else:
+                    new_line = format_line_with_other_country(entry,
+                            manual_entry)
+                    print '-%s\n+%s' % (line, new_line, )
+                    result.append(new_line)
+                del manual_dict[start_num]
+            else:
+                print ('Warning: only partial match between '
+                       'original/automatically replaced assignment and '
+                       'manual assignment:\n  %s\n  %s\nNot applying '
+                       'manual change.' % (line, manual_line, ))
+                result.append(line)
+        else:
+            result.append(line)
+    if len(manual_dict) > 0:
+        print ('Warning: could not apply all manual assignments:  %s' %
+                ('\n  '.join(manual_dict.values())), )
+    return result
+
+def write_file(path, assignments):
+    out_file = open(path, 'w')
+    out_file.write('\n'.join(assignments))
+    out_file.close()
+
+if __name__ == '__main__':
+    main()
+
diff --git a/geoip/geoip-manual b/geoip/geoip-manual
new file mode 100644
index 0000000..6188957
--- /dev/null
+++ b/geoip/geoip-manual
@@ -0,0 +1,354 @@
+# This file contains manual overrides of A1 entries (and possibly others)
+# in MaxMind's GeoLite City database.  Use deanonymind.py in the same
+# directory to process this file when producing a new geoip file.  See
+# INSTALL for details.
+
+# From geoip-manual (country):
+# Remove MaxMind entry 0.116.0.0-0.119.255.255 which MaxMind says is AT,
+# but which is part of reserved range 0.0.0.0/8.  -KL 2012-06-13
+"7602176","7864319",""
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"135013632","135013887","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"520493568","520494079","77"
+
+# From geoip-manual (country):
+# NL, because previous MaxMind entry 31.171.128.0-31.171.133.255 is NL,
+# and RIR delegation files say 31.171.128.0-31.171.135.255 is NL.
+# -KL 2012-11-27
+"531334656","531335167","161"
+
+# From geoip-manual (country):
+# EU, because next MaxMind entry 37.139.64.1-37.139.64.9 is EU, because
+# RIR delegation files say 37.139.64.0-37.139.71.255 is EU, and because it
+# just makes more sense for the next entry to start at .0 and not .1.
+# -KL 2012-11-27
+"629882880","629882880","3"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"644048128","644048383","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"644121856","644122111","223"
+
+# From geoip-manual (country):
+# CH, because previous MaxMind entry 46.19.141.0-46.19.142.255 is CH, and
+# RIR delegation files say 46.19.136.0-46.19.143.255 is CH.
+# -KL 2012-11-27
+"773033728","773033983","44"
+
+# From geoip-manual (country):
+# GB, because next MaxMind entry 46.166.129.0-46.166.134.255 is GB, and
+# RIR delegation files say 46.166.128.0-46.166.191.255 is GB.
+# -KL 2012-11-27
+"782663680","782663935","77"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"786817152","786817215","195"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"846537728","846537983","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"846542848","846543103","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1077383168","1077384191","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1077840384","1077840639","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1083264384","1083264447","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1083264464","1083264511","223"
+
+# From geoip-manual (country):
+# US, though could as well be CA.  Previous MaxMind entry
+# 64.237.32.52-64.237.34.127 is US, next MaxMind entry
+# 64.237.34.144-64.237.34.151 is CA, and RIR delegation files say the
+# entire block 64.237.32.0-64.237.63.255 is US.  -KL 2012-11-27
+"1089282688","1089282703","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1093730816","1093731071","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1095314944","1095314944","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1109848832","1109849087","39"
+
+# From geoip-manual (country):
+# US, though could as well be UY.  Previous MaxMind entry
+# 67.15.170.0-67.15.182.255 is US, next MaxMind entry
+# 67.15.183.128-67.15.183.159 is UY, and RIR delegation files say the
+# entire block 67.15.0.0-67.15.255.255 is US.  -KL 2012-11-27
+"1125103360","1125103487","223"
+
+# From geoip-manual (country):
+# US, because next MaxMind entry 67.43.145.0-67.43.155.255 is US, and RIR
+# delegation files say 67.43.144.0-67.43.159.255 is US.
+# -KL 2012-11-27
+"1126928384","1126928639","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1126931456","1126931711","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1138622208","1138622463","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1145334528","1145335039","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1159676928","1159677183","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1160905216","1160905471","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1170375168","1170375679","223"
+
+# From geoip-manual (country):
+# US, because previous MaxMind entry 70.159.21.51-70.232.244.255 is US,
+# because next MaxMind entry 70.232.245.58-70.232.245.59 is A2 ("Satellite
+# Provider") which is a country information about as useless as A1, and
+# because RIR delegation files say 70.224.0.0-70.239.255.255 is US.
+# -KL 2012-11-27
+"1189672192","1189672249","223"
+
+# From geoip-manual (country):
+# US, because next MaxMind entry 70.232.246.0-70.240.141.255 is US,
+# because previous MaxMind entry 70.232.245.58-70.232.245.59 is A2
+# ("Satellite Provider") which is a country information about as useless
+# as A1, and because RIR delegation files say 70.224.0.0-70.239.255.255 is
+# US.  -KL 2012-11-27
+"1189672252","1189672447","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1249050624","1249051135","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1249051904","1249052671","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1249091584","1249092607","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1286389760","1286390271","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1286390528","1286390783","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1286391296","1286391807","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1286393856","1286394623","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1286395392","1286396159","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1286398976","1286399487","223"
+
+# From geoip-manual (country):
+# GB, despite neither previous (GE) nor next (LV) MaxMind entry being GB,
+# but because RIR delegation files agree with both previous and next
+# MaxMind entry and say GB for 91.228.0.0-91.228.3.255.  -KL 2012-11-27
+"1541668864","1541669887","77"
+
+# From geoip-manual (country):
+# GB, because next MaxMind entry 91.232.125.0-91.232.125.255 is GB, and
+# RIR delegation files say 91.232.124.0-91.232.125.255 is GB.
+# -KL 2012-11-27
+"1541962752","1541963007","77"
+
+# From geoip-manual (country):
+# GB, despite neither previous (RU) nor next (PL) MaxMind entry being GB,
+# but because RIR delegation files agree with both previous and next
+# MaxMind entry and say GB for 91.238.214.0-91.238.215.255.
+# -KL 2012-11-27
+"1542379008","1542379519","77"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1632587008","1632587263","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1673576896","1673576959","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1795558656","1795558911","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"1933909760","1933910015","17"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"2360215808","2360216063","223"
+
+# From geoip-manual (country):
+# US, because next MaxMind entry 173.0.16.0-173.0.65.255 is US, and RIR
+# delegation files say 173.0.0.0-173.0.15.255 is US.  -KL 2012-11-27
+"2902458368","2902462463","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"2918536448","2918536703","223"
+
+# From geoip-manual (country):
+# US, because next MaxMind entry 176.67.84.0-176.67.84.79 is US, and RIR
+# delegation files say 176.67.80.0-176.67.87.255 is US.  -KL 2012-11-27
+"2957201408","2957202431","223"
+
+# From geoip-manual (country):
+# US, because previous MaxMind entry 176.67.84.192-176.67.85.255 is US,
+# and RIR delegation files say 176.67.80.0-176.67.87.255 is US.
+# -KL 2012-11-27
+"2957202944","2957203455","223"
+
+# From geoip-manual (country):
+# EU, despite neither previous (RU) nor next (UA) MaxMind entry being EU,
+# but because RIR delegation files agree with both previous and next
+# MaxMind entry and say EU for 193.200.150.0-193.200.150.255.
+# -KL 2012-11-27
+"3251148288","3251148543","3"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3341849376","3341853471","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3341873152","3341875199","223"
+
+# From geoip-manual (country):
+# US, because previous MaxMind entry 199.96.68.0-199.96.87.127 is US, and
+# RIR delegation files say 199.96.80.0-199.96.87.255 is US.
+# -KL 2012-11-27
+"3344979840","3344979967","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3346193920","3346194431","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3355430912","3355432959","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3450078464","3450079487","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3483239424","3483239679","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3483240704","3483240959","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3483247360","3483247871","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3485724672","3485728767","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3500664576","3500664831","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3500666752","3500666879","223"
+
+# From geoip-manual (country):
+# US, because previous MaxMind entry 209.58.176.144-209.59.31.255 is US,
+# and RIR delegation files say 209.59.32.0-209.59.63.255 is US.
+# -KL 2012-11-27
+"3510312960","3510321151","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3519352832","3519352959","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3519354048","3519354111","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3519355392","3519355519","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3520644608","3520644863","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3520656384","3520656639","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3632994048","3632994303","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3633782528","3633782783","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3633823488","3633823743","223"
+
+# Previous and next entry are same country, set to country number without
+# city information.  -KL 2013-02-10
+"3634982400","3634982655","223"
+
+# From geoip-manual (country):
+# FR, because previous MaxMind entry 217.15.166.0-217.15.166.255 is FR,
+# and RIR delegation files contain a block 217.15.160.0-217.15.175.255
+# which, however, is EU, not FR.  But merging with next MaxMind entry
+# 217.15.176.0-217.15.191.255 which is KZ and which fully matches what
+# the RIR delegation files say seems unlikely to be correct.
+# -KL 2012-11-27
+"3641681664","3641683967","75"
+
diff --git a/src/org/torproject/onionoo/CurrentNodes.java b/src/org/torproject/onionoo/CurrentNodes.java
index 9e5d0db..487cf4d 100644
--- a/src/org/torproject/onionoo/CurrentNodes.java
+++ b/src/org/torproject/onionoo/CurrentNodes.java
@@ -11,13 +11,17 @@ import java.io.IOException;
 import java.text.ParseException;
 import java.text.SimpleDateFormat;
 import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
 import java.util.Iterator;
-import java.util.Locale;
+import java.util.Map;
+import java.util.Set;
 import java.util.SortedMap;
 import java.util.SortedSet;
 import java.util.TimeZone;
 import java.util.TreeMap;
 import java.util.TreeSet;
+import java.util.regex.Pattern;
 
 import org.torproject.descriptor.BridgeNetworkStatus;
 import org.torproject.descriptor.Descriptor;
@@ -27,10 +31,6 @@ import org.torproject.descriptor.DescriptorSourceFactory;
 import org.torproject.descriptor.NetworkStatusEntry;
 import org.torproject.descriptor.RelayNetworkStatusConsensus;
 
-import com.maxmind.geoip.Location;
-import com.maxmind.geoip.LookupService;
-import com.maxmind.geoip.regionName;
-
 /* Store relays and bridges that have been running in the past seven
  * days. */
 public class CurrentNodes {
@@ -343,53 +343,341 @@ public class CurrentNodes {
     }
   }
 
-  public void lookUpCountries() {
-    File geoLiteCityDatFile = new File("GeoLiteCity.dat");
-    if (!geoLiteCityDatFile.exists()) {
-      System.err.println("No GeoLiteCity.dat file in /.");
+  public void lookUpCitiesAndASes() {
+
+    /* Make sure we have all required .csv files. */
+    File[] geoLiteCityBlocksCsvFiles = new File[] {
+        new File("geoip/Manual-GeoLiteCity-Blocks.csv"),
+        new File("geoip/Automatic-GeoLiteCity-Blocks.csv"),
+        new File("geoip/GeoLiteCity-Blocks.csv")
+    };
+    File geoLiteCityBlocksCsvFile = null;
+    for (File file : geoLiteCityBlocksCsvFiles) {
+      if (file.exists()) {
+        geoLiteCityBlocksCsvFile = file;
+        break;
+      }
+    }
+    if (geoLiteCityBlocksCsvFile == null) {
+      System.err.println("No *GeoLiteCity-Blocks.csv file in geoip/.");
+      return;
+    }
+    File geoLiteCityLocationCsvFile =
+        new File("geoip/GeoLiteCity-Location.csv");
+    if (!geoLiteCityLocationCsvFile.exists()) {
+      System.err.println("No GeoLiteCity-Location.csv file in geoip/.");
+      return;
+    }
+    File iso3166CsvFile = new File("geoip/iso3166.csv");
+    if (!iso3166CsvFile.exists()) {
+      System.err.println("No iso3166.csv file in geoip/.");
+      return;
+    }
+    File regionCsvFile = new File("geoip/region.csv");
+    if (!regionCsvFile.exists()) {
+      System.err.println("No region.csv file in geoip/.");
+      return;
+    }
+    File geoIPASNum2CsvFile = new File("geoip/GeoIPASNum2.csv");
+    if (!geoIPASNum2CsvFile.exists()) {
+      System.err.println("No GeoIPASNum2.csv file in geoip/.");
+      return;
+    }
+
+    /* Obtain a map from relay IP address strings to numbers. */
+    Map<String, Long> addressStringNumbers = new HashMap<String, Long>();
+    Pattern ipv4Pattern = Pattern.compile("^[0-9\\.]{7,15}$");
+    for (Node relay : this.currentRelays.values()) {
+      String addressString = relay.getAddress();
+      long addressNumber = -1L;
+      if (ipv4Pattern.matcher(addressString).matches()) {
+        String[] parts = addressString.split("\\.", 4);
+        if (parts.length == 4) {
+          addressNumber = 0L;
+          for (int i = 0; i < 4; i++) {
+            addressNumber *= 256L;
+            int octetValue = -1;
+            try {
+              octetValue = Integer.parseInt(parts[i]);
+            } catch (NumberFormatException e) {
+            }
+            if (octetValue < 0 || octetValue > 255) {
+              addressNumber = -1L;
+              break;
+            }
+            addressNumber += octetValue;
+          }
+        }
+      }
+      if (addressNumber >= 0L) {
+        addressStringNumbers.put(addressString, addressNumber);
+      }
+    }
+    if (addressStringNumbers.isEmpty()) {
+      System.err.println("No relay IP addresses to resolve to cities or "
+          + "ASN.");
       return;
     }
+
+    /* Obtain a map from IP address numbers to blocks. */
+    Map<Long, Long> addressNumberBlocks = new HashMap<Long, Long>();
     try {
-      LookupService ls = new LookupService(geoLiteCityDatFile,
-          LookupService.GEOIP_MEMORY_CACHE);
-      for (Node relay : currentRelays.values()) {
-        Location location = ls.getLocation(relay.getAddress());
-        if (location != null) {
-          relay.setLatitude(String.format(Locale.US, "%.6f",
-              location.latitude));
-          relay.setLongitude(String.format(Locale.US, "%.6f",
-              location.longitude));
-          relay.setCountryCode(location.countryCode.toLowerCase());
-          relay.setCountryName(location.countryName);
-          relay.setRegionName(regionName.regionNameByCode(
-              location.countryCode, location.region));
-          relay.setCityName(location.city);
+      SortedSet<Long> sortedAddressNumbers = new TreeSet<Long>(
+          addressStringNumbers.values());
+      long firstAddressNumber = sortedAddressNumbers.first();
+      BufferedReader br = new BufferedReader(new FileReader(
+          geoLiteCityBlocksCsvFile));
+      String line;
+      long previousStartIpNum = -1L;
+      while ((line = br.readLine()) != null) {
+        if (!line.startsWith("\"")) {
+          continue;
+        }
+        String[] parts = line.replaceAll("\"", "").split(",", 3);
+        if (parts.length != 3) {
+          System.err.println("Illegal line '" + line + "' in "
+              + geoLiteCityBlocksCsvFile.getAbsolutePath() + ".");
+          br.close();
+          return;
+        }
+        try {
+          long startIpNum = Long.parseLong(parts[0]);
+          if (startIpNum <= previousStartIpNum) {
+            System.err.println("Line '" + line + "' not sorted in "
+                + geoLiteCityBlocksCsvFile.getAbsolutePath() + ".");
+            br.close();
+            return;
+          }
+          previousStartIpNum = startIpNum;
+          while (firstAddressNumber < startIpNum &&
+              firstAddressNumber != -1L) {
+            sortedAddressNumbers.remove(firstAddressNumber);
+            if (sortedAddressNumbers.isEmpty()) {
+              firstAddressNumber = -1L;
+            } else {
+              firstAddressNumber = sortedAddressNumbers.first();
+            }
+          }
+          long endIpNum = Long.parseLong(parts[1]);
+          while (firstAddressNumber <= endIpNum &&
+              firstAddressNumber != -1L) {
+            long blockNumber = Long.parseLong(parts[2]);
+            addressNumberBlocks.put(firstAddressNumber, blockNumber);
+            sortedAddressNumbers.remove(firstAddressNumber);
+            if (sortedAddressNumbers.isEmpty()) {
+              firstAddressNumber = -1L;
+            } else {
+              firstAddressNumber = sortedAddressNumbers.first();
+            }
+          }
+          if (firstAddressNumber == -1L) {
+            break;
+          }
+        }
+        catch (NumberFormatException e) {
+          System.err.println("Number format exception while parsing line "
+              + "'" + line + "' in "
+              + geoLiteCityBlocksCsvFile.getAbsolutePath() + ".");
+          br.close();
+          return;
         }
       }
-      ls.close();
+      br.close();
     } catch (IOException e) {
-      System.err.println("Could not look up countries for relays.");
+      System.err.println("I/O exception while reading "
+          + geoLiteCityBlocksCsvFile.getAbsolutePath() + ".");
+      return;
+    }
+
+    /* Obtain a map from relevant blocks to location lines. */
+    Map<Long, String> blockLocations = new HashMap<Long, String>();
+    try {
+      Set<Long> blockNumbers = new HashSet<Long>(
+          addressNumberBlocks.values());
+      BufferedReader br = new BufferedReader(new FileReader(
+          geoLiteCityLocationCsvFile));
+      String line;
+      while ((line = br.readLine()) != null) {
+        if (line.startsWith("C") || line.startsWith("l")) {
+          continue;
+        }
+        String[] parts = line.replaceAll("\"", "").split(",", 9);
+        if (parts.length != 9) {
+          System.err.println("Illegal line '" + line + "' in "
+              + geoLiteCityLocationCsvFile.getAbsolutePath() + ".");
+          br.close();
+          return;
+        }
+        try {
+          long locId = Long.parseLong(parts[0]);
+          if (blockNumbers.contains(locId)) {
+            blockLocations.put(locId, line);
+          }
+        }
+        catch (NumberFormatException e) {
+          System.err.println("Number format exception while parsing line "
+              + "'" + line + "' in "
+              + geoLiteCityLocationCsvFile.getAbsolutePath() + ".");
+          br.close();
+          return;
+        }
+      }
+      br.close();
+    } catch (IOException e) {
+      System.err.println("I/O exception while reading "
+          + geoLiteCityLocationCsvFile.getAbsolutePath() + ".");
+      return;
     }
-  }
 
-  public void lookUpASes() {
-    File geoIPASNumDatFile = new File("GeoIPASNum.dat");
-    if (!geoIPASNumDatFile.exists()) {
-      System.err.println("No GeoIPASNum.dat file in /.");
+    /* Read country names to memory. */
+    Map<String, String> countryNames = new HashMap<String, String>();
+    try {
+      BufferedReader br = new BufferedReader(new FileReader(
+          iso3166CsvFile));
+      String line;
+      while ((line = br.readLine()) != null) {
+        String[] parts = line.replaceAll("\"", "").split(",", 2);
+        if (parts.length != 2) {
+          System.err.println("Illegal line '" + line + "' in "
+              + iso3166CsvFile.getAbsolutePath() + ".");
+          br.close();
+          return;
+        }
+        countryNames.put(parts[0].toLowerCase(), parts[1]);
+      }
+      br.close();
+    } catch (IOException e) {
+      System.err.println("I/O exception while reading "
+          + iso3166CsvFile.getAbsolutePath() + ".");
       return;
     }
+
+    /* Read region names to memory. */
+    Map<String, String> regionNames = new HashMap<String, String>();
+    try {
+      BufferedReader br = new BufferedReader(new FileReader(
+          regionCsvFile));
+      String line;
+      while ((line = br.readLine()) != null) {
+        String[] parts = line.replaceAll("\"", "").split(",", 3);
+        if (parts.length != 3) {
+          System.err.println("Illegal line '" + line + "' in "
+              + regionCsvFile.getAbsolutePath() + ".");
+          br.close();
+          return;
+        }
+        regionNames.put(parts[0].toLowerCase() + ","
+            + parts[1].toLowerCase(), parts[2]);
+      }
+      br.close();
+    } catch (IOException e) {
+      System.err.println("I/O exception while reading "
+          + regionCsvFile.getAbsolutePath() + ".");
+      return;
+    }
+
+    /* Obtain a map from IP address numbers to ASN. */
+    Map<Long, String> addressNumberASN = new HashMap<Long, String>();
     try {
-      LookupService ls = new LookupService(geoIPASNumDatFile);
-      for (Node relay : currentRelays.values()) {
-        String org = ls.getOrg(relay.getAddress());
-        if (org != null && org.indexOf(" ") > 0 && org.startsWith("AS")) {
-          relay.setASNumber(org.substring(0, org.indexOf(" ")));
-          relay.setASName(org.substring(org.indexOf(" ") + 1));
+      SortedSet<Long> sortedAddressNumbers = new TreeSet<Long>(
+          addressStringNumbers.values());
+      long firstAddressNumber = sortedAddressNumbers.first();
+      BufferedReader br = new BufferedReader(new FileReader(
+          geoIPASNum2CsvFile));
+      String line;
+      long previousStartIpNum = -1L;
+      while ((line = br.readLine()) != null) {
+        String[] parts = line.replaceAll("\"", "").split(",", 3);
+        if (parts.length != 3) {
+          System.err.println("Illegal line '" + line + "' in "
+              + geoIPASNum2CsvFile.getAbsolutePath() + ".");
+          br.close();
+          return;
+        }
+        try {
+          long startIpNum = Long.parseLong(parts[0]);
+          if (startIpNum <= previousStartIpNum) {
+            System.err.println("Line '" + line + "' not sorted in "
+                + geoIPASNum2CsvFile.getAbsolutePath() + ".");
+            br.close();
+            return;
+          }
+          previousStartIpNum = startIpNum;
+          while (firstAddressNumber < startIpNum &&
+              firstAddressNumber != -1L) {
+            sortedAddressNumbers.remove(firstAddressNumber);
+            if (sortedAddressNumbers.isEmpty()) {
+              firstAddressNumber = -1L;
+            } else {
+              firstAddressNumber = sortedAddressNumbers.first();
+            }
+          }
+          long endIpNum = Long.parseLong(parts[1]);
+          while (firstAddressNumber <= endIpNum &&
+              firstAddressNumber != -1L) {
+            if (parts[2].startsWith("AS") &&
+                parts[2].split(" ", 2).length == 2) {
+              addressNumberASN.put(firstAddressNumber, parts[2]);
+            }
+            sortedAddressNumbers.remove(firstAddressNumber);
+            if (sortedAddressNumbers.isEmpty()) {
+              firstAddressNumber = -1L;
+            } else {
+              firstAddressNumber = sortedAddressNumbers.first();
+            }
+          }
+          if (firstAddressNumber == -1L) {
+            break;
+          }
+        }
+        catch (NumberFormatException e) {
+          System.err.println("Number format exception while parsing line "
+              + "'" + line + "' in "
+              + geoIPASNum2CsvFile.getAbsolutePath() + ".");
+          br.close();
+          return;
         }
       }
-      ls.close();
+      br.close();
     } catch (IOException e) {
-      System.err.println("Could not look up ASes for relays.");
+      System.err.println("I/O exception while reading "
+          + geoIPASNum2CsvFile.getAbsolutePath() + ".");
+      return;
+    }
+
+    /* Finally, set relays' city and ASN information. */
+    for (Node relay : currentRelays.values()) {
+      String addressString = relay.getAddress();
+      if (addressStringNumbers.containsKey(addressString)) {
+        long addressNumber = addressStringNumbers.get(addressString);
+        if (addressNumberBlocks.containsKey(addressNumber)) {
+          long blockNumber = addressNumberBlocks.get(addressNumber);
+          if (blockLocations.containsKey(blockNumber)) {
+            String[] parts = blockLocations.get(blockNumber).
+                replaceAll("\"", "").split(",", -1);
+            String countryCode = parts[1].toLowerCase();
+            relay.setCountryCode(countryCode);
+            if (countryNames.containsKey(countryCode)) {
+              relay.setCountryName(countryNames.get(countryCode));
+            }
+            String regionCode = countryCode + ","
+                + parts[2].toLowerCase();
+            if (regionNames.containsKey(regionCode)) {
+              relay.setRegionName(regionNames.get(regionCode));
+            }
+            if (parts[3].length() > 0) {
+              relay.setCityName(parts[3]);
+            }
+            relay.setLatitude(parts[5]);
+            relay.setLongitude(parts[6]);
+          }
+        }
+        if (addressNumberASN.containsKey(addressNumber)) {
+          String[] parts = addressNumberASN.get(addressNumber).split(" ", 2);
+          relay.setASNumber(parts[0]);
+          relay.setASName(parts[1]);
+        }
+      }
     }
   }
 
diff --git a/src/org/torproject/onionoo/Main.java b/src/org/torproject/onionoo/Main.java
index 41af72c..e3e7c5b 100644
--- a/src/org/torproject/onionoo/Main.java
+++ b/src/org/torproject/onionoo/Main.java
@@ -14,8 +14,7 @@ public class Main {
     cn.readRelaySearchDataFile(new File("out/summary"));
     cn.readRelayNetworkConsensuses();
     cn.setRelayRunningBits();
-    cn.lookUpCountries();
-    cn.lookUpASes();
+    cn.lookUpCitiesAndASes();
     cn.readBridgeNetworkStatuses();
     cn.setBridgeRunningBits();
 
diff --git a/web/index.html b/web/index.html
index 5087a01..4c3491c 100755
--- a/web/index.html
+++ b/web/index.html
@@ -153,17 +153,17 @@ database.</li>
 resolving the relay's first onion-routing IP address.
 Optional field.
 Omitted if the relay IP address could not be found in the GeoIP
-database.</li>
+database, or if the GeoIP database did not contain a country name.</li>
 <li><b>"region_name":</b> Region name as found in a GeoIP database by
 resolving the relay's first onion-routing IP address.
 Optional field.
 Omitted if the relay IP address could not be found in the GeoIP
-database.</li>
+database, or if the GeoIP database did not contain a region name.</li>
 <li><b>"city_name":</b> City name as found in a
 GeoIP database by resolving the relay's first onion-routing IP address.
 Optional field.
 Omitted if the relay IP address could not be found in the GeoIP
-database.</li>
+database, or if the GeoIP database did not contain a city name.</li>
 <li><b>"latitude":</b> Latitude as found in a GeoIP database by resolving
 the relay's first onion-routing IP address.
 Optional field.



More information about the tor-commits mailing list