[tor-bugs] #2506 [Tor Relay]: Design and implement a more compact GeoIP file format

Tor Bug Tracker & Wiki torproject-admin at torproject.org
Mon Feb 7 16:32:03 UTC 2011


#2506: Design and implement a more compact GeoIP file format
-------------------------+--------------------------------------------------
 Reporter:  rransom      |       Owner:     
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:     
Component:  Tor Relay    |     Version:     
 Keywords:               |      Points:     
   Parent:               |  
-------------------------+--------------------------------------------------
 Our current text-based GeoIP file (as of commit
 e9803aa71003079cc00a8b3c80324581758a36be; from the January 2011 !MaxMind
 !GeoLite Country dataset) is 3460049 bytes long (or 955382 bytes gzipped).
 In !MaxMind's binary format, the February 2011 dataset is 1126966 bytes
 long, and gzips to about half that size.  But we can do much better than
 that, and without having to use (or reverse-engineer and clone) their LGPL
 library.

 The January 2011 !GeoLite database contains 138658 data lines, each of
 which specifies a sequence of consecutive IPs assigned to a single
 country.  The file contains runs of 4070 distinct lengths, and maps runs
 to 241 distinct countries.  Even doubling the number of runs in order to
 account for the fact that some IPs are not contained in any run (which we
 should consider as a run assigned to 'no country'), and padding each run
 to a 3-byte field, we can store the mapping itself in at most 813 kiB,
 with a run-length table and country table totalling under 17 kiB.  We can
 fit an additional random-access index consisting of one 4-byte starting IP
 for each 768-byte (256-run) block in just over 4 kiB if we want to keep
 the database itself in its packed form, whether in memory or on disk.

 813 kiB is probably a wild overestimate for the size of the mapping; I
 haven't checked how many 'fake runs' we would need to add, but I would
 expect there are far fewer unassigned runs than runs assigned to a country
 in the database.  I'm also not relying on any fancy encoding that would
 fit each run in less than 3 bytes.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2506>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list