[tor-bugs] #2506 [Tor Relay]: Design and implement a more compact GeoIP file format

Tor Bug Tracker & Wiki torproject-admin at torproject.org
Tue Jun 7 00:16:35 UTC 2011


#2506: Design and implement a more compact GeoIP file format
-------------------------+--------------------------------------------------
 Reporter:  rransom      |          Owner:  endian7000        
     Type:  enhancement  |         Status:  needs_review      
 Priority:  normal       |      Milestone:  Tor: 0.2.3.x-final
Component:  Tor Relay    |        Version:                    
 Keywords:               |         Parent:                    
   Points:               |   Actualpoints:                    
-------------------------+--------------------------------------------------

Comment(by rransom):

 Replying to [comment:11 nickm]:
 > Replying to [comment:9 rransom]:
 > > Replying to [comment:7 nickm]:
 > > > I'm not so sure that having this stuff in separate run-length and cc
 files will actually be needed; endianness issues will keep us from reading
 any portable file into an array-of-country verbatim, I think.
 > >
 > > The country codes are two-character ASCII strings, and are thus
 endianness-independent.  The run lengths are integers, but could be
 encoded in big-endian form everywhere.
 >
 > I thought that the whole point of endian7000's idea was that a lot of
 the savings came from variable-length run-length encoding. In the database
 I'm looking at, there are 4212 distinct run-length encodings. Lots of the
 win comes from encoding the more frequent run-lengths as a single byte and
 the less frequent ones as two bytes.
 >
 > To quantify: 136810 of the runs in my geoip file would have their
 lengths represented as one byte in the var-length encoding, whereas 11586
 would take two bytes.  Using a fixed-width two-byte encoding for run
 lengths would add another 133K to the file size.

 The mapping of run-length codes to run lengths should be stored in a
 separate file, in which the run lengths are fixed-width big-endian
 integers, and each run-length code should be an index into that array.
 The mapping of country identifiers to two-character ISO country codes
 should be stored similarly.  The list of runs should be stored in
 endian7000's variable-length-record format.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2506#comment:12>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list