[tor-bugs] #6471 [Metrics Utilities]: Design file format and Python/Java library for multiple GeoIP or AS databases

Tor Bug Tracker & Wiki torproject-admin at torproject.org
Wed Sep 12 23:20:54 UTC 2012


#6471: Design file format and Python/Java library for multiple GeoIP or AS
databases
-------------------------------+--------------------------------------------
 Reporter:  karsten            |          Owner:  karsten     
     Type:  enhancement        |         Status:  needs_review
 Priority:  normal             |      Milestone:              
Component:  Metrics Utilities  |        Version:              
 Keywords:                     |         Parent:              
   Points:                     |   Actualpoints:              
-------------------------------+--------------------------------------------

Comment(by karsten):

 Here we go.  I finished the initial version of a multi-GeoIP database in
 Java.  It's now capable of handling five or more years of monthly
 databases, which is about as much as we need for metrics.  And it's
 incredibly fast and memory-friendly!  Some facts:
  - Looking up 100,000 IP address/date combinations in the respectively
 best matching database takes 600 milliseconds overall.
  - Combining five years of data into a single database takes 46 seconds,
 but after combining them and storing the result to disk once, loading
 everything back to memory only takes 1.3 seconds.
  - The total memory consumption for five years of GeoIP data is under 150
 MB.

 Here's the output of a performance test that imports 60 monthly GeoIP
 databases and performs 100000 random lookups:

 {{{
 Generating test cases... 52705 millis.
 Importing files... 45855 millis.
 Making test requests... 623 millis, 0 out of 100000 tests failed.
 Database contains 60 databases and 146629 combined address ranges.
 Performed 6042264 address range imports requiring 12886981 lookups.
 Performed 100000 address lookups requiring 114316 lookups.
 First 10 entries, in reverse order, are:
   223.255.255.0 223.255.255.255 au 20110901 20120901 12
   223.255.254.0 223.255.254.255 sg 20110501 20120901 16
   223.255.252.0 223.255.253.255 cn 20110501 20120901 16
   223.255.248.0 223.255.251.255 au 20100801 20120901 25
   223.255.244.0 223.255.247.255 in 20100901 20120901 24
   223.255.240.0 223.255.243.255 hk 20100901 20120901 24
   223.255.236.0 223.255.239.255 cn 20110401 20120901 17
   223.255.232.0 223.255.235.255 au 20100901 20120901 24
   223.255.224.0 223.255.231.255 id 20100901 20120901 24
   223.255.192.0 223.255.223.255 kr 20100901 20120901 24
 Saving combined databases to disk... 849 millis.
 Loading combined databases from disk... 1253 millis.
 Making a second round of test requests... 591 millis, 0 out of 100000
 tests failed.
 }}}

 Maybe the next step is to rewrite this code in Python?  There shouldn't be
 anything too Java-specific that couldn't be implemented in Python, I
 think.  The Java code is available [https://gitweb.torproject.org/metrics-
 tasks.git/tree/HEAD:/task-6471/java here], and I uploaded the 5*60 input
 files from the regional registries
 [https://people.torproject.org/~karsten/volatile/registry-
 files-2007-10-2012-09.tar.bz2 here] and the combined database file
 [https://people.torproject.org/~karsten/volatile/geoip-2007-10-2012-09.csv.bz2
 here].

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6471#comment:11>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list