[tor-bugs] #2506 [Tor Relay]: Design and implement a more compact GeoIP file format

Tor Bug Tracker & Wiki torproject-admin at torproject.org
Mon Jun 6 23:43:51 UTC 2011


#2506: Design and implement a more compact GeoIP file format
-------------------------+--------------------------------------------------
 Reporter:  rransom      |          Owner:  endian7000        
     Type:  enhancement  |         Status:  needs_review      
 Priority:  normal       |      Milestone:  Tor: 0.2.3.x-final
Component:  Tor Relay    |        Version:                    
 Keywords:               |         Parent:                    
   Points:               |   Actualpoints:                    
-------------------------+--------------------------------------------------

Comment(by rransom):

 Replying to [comment:7 nickm]:
 > I'm not so sure that having this stuff in separate run-length and cc
 files will actually be needed; endianness issues will keep us from reading
 any portable file into an array-of-country verbatim, I think.

 The country codes are two-character ASCII strings, and are thus
 endianness-independent.  The run lengths are integers, but could be
 encoded in big-endian form everywhere.

 > Let's see how tricky the read code is before we decide that
 complexifying the format is worth it in order to make the read code
 simpler.

 I think putting each separate array in a separate file would give us a
 simpler format than putting all of the arrays in a single file.

 > I'm also not clear how  best to read this format quickly on the fly:
 unpacking it all into ram seems like a lose if we don't have to; a
 workable index format would be neat (and would be much easier for fixed-
 length or self-synchronizing records.

 We would need an index format even with fixed-length records -- each
 record corresponds to a wildly varying amount of IP address space.  We
 look up the country associated with an IP address, so we need to read the
 database in order from some starting point for which we know both the
 starting IP address and the offset into the packed dataset.  I suggest an
 index format consisting of (IP address, offset) pairs as fixed-length
 records; we can look up an IP address by performing a binary search
 through the index, then searching linearly through the runs in the piece
 of the packed dataset starting at the specified offset.  Specifying the
 offset has the additional advantage that (if we know how to find the end
 of the index array) we can later put the packed dataset in the same file
 as (and following) the index.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2506#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list