[tor-bugs] #2506 [Tor Relay]: Design and implement a more compact GeoIP file format

Mon Jun 6 23:55:31 UTC 2011

#2506: Design and implement a more compact GeoIP file format
-------------------------+--------------------------------------------------
 Reporter:  rransom      |          Owner:  endian7000        
     Type:  enhancement  |         Status:  needs_review      
 Priority:  normal       |      Milestone:  Tor: 0.2.3.x-final
Component:  Tor Relay    |        Version:                    
 Keywords:               |         Parent:                    
   Points:               |   Actualpoints:                    
-------------------------+--------------------------------------------------

Comment(by rransom):

 Replying to [comment:8 nickm]:
 > Other thoughts: elsewhere we have used the country code "??" for an
 unassigned mapping.
 >
 > We don't have a consistent internal uint8->country mapping in Tor: that
 needs to get saved too.
 >
 > Right now the db has 246 country codes; it's be nice to have a good
 answer here for what we would do if we had to go over 255.

 We could use a variable-length integer (possibly encoded as one of a small
 number of escape values followed by a protobuf variable-length integer).

 > One possibility for making this format self-synchronizing is to reserve
 the run-length index 0 and the country code 0 such that neither
 corresponds to a run length or to a country.  There are probably better
 ways too.

 As I explained in my previous comment, we need an index format anyway, so
 making the run-list format self-synchronizing won't help us.

 > Another useful observation: there seem to be less than 32 run lengths
 that ever occur more than once.   Not sure what we can use that for,
 though.

 I suspect that we could get a space win by constraining run lengths to a
 small set such that every run length in the dataset can be expressed as a
 sum of encodable run lengths (possibly the powers of two, although we
 could automatically search for a better set) and Huffman-coding or
 arithmetic-coding the run records; since we have to scan each chunk of
 runs linearly anyway, this wouldn't make our speed ''much'' worse.  I
 think the added code complexity (in both the encoder and decoder) makes
 fancy coding schemes a bad idea, though.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2506#comment:10>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online