[tor-bugs] #6471 [Metrics Utilities]: Design file format and Python library for multiple GeoIP or AS databases

Fri Aug 3 08:35:29 UTC 2012

#6471: Design file format and Python library for multiple GeoIP or AS databases
-------------------------------+--------------------------------------------
 Reporter:  karsten            |          Owner:     
     Type:  enhancement        |         Status:  new
 Priority:  normal             |      Milestone:     
Component:  Metrics Utilities  |        Version:     
 Keywords:                     |         Parent:     
   Points:                     |   Actualpoints:     
-------------------------------+--------------------------------------------

Comment(by karsten):

 Replying to [comment:1 gsathya]:
 > Possible first step would be to figure out if there are any additional
 info that we don't need/use in maxminds db.
 > A naive solution then would be to -
 > Step 0) Remove unnecessary data
 > Step 1) Diff the old csv with the new csv
 > Step 2.1) Add a human readable(?) line to the old csv - explaining the
 date of change, no of lines changed and possibly other details that might
 become obvious once we actually try to diff
 > Step 2.2) Modify the diff to make more parseable since we know that we
 are only diff-ing csv's - i bet we can optimize this a bit
 > Step 3) Append the modified diff to the old csv
 > Step 4) Write a library that can parse added human readable line and the
 modified diff
 >
 > Another solution would be to go all out and write our own spec and a
 parser that converts every newly generated GeoIP db into something that
 conforms with our spec. (And write a library to parse such a file)
 >
 > The second approach would be a lot more useful in the long run but a lot
 more time consuming to write. If we pick either approaches(or an
 alternative one) I'd be happy to write the python code for it!

 Your first approach above already sounds like a design for a file format,
 and I admit that the second approach requires a lot of work before seeing
 any results.

 Hmm.  How about a third approach: write a library that a) reads unmodified
 database files to memory, maybe together with a mapping file containing
 dates when these files became valid, and b) resolves IP addresses and
 dates to country codes or ASes.  We wouldn't want to memorize the full
 file contents, but only the relevant information for looking up an IP
 address and date.  But we can still wonder about a compact file format for
 that later on.

 This third approach has the disadvantage that initializing the lookup
 library may take a while (tens of seconds, maybe minutes).  But it reduces
 development time a lot at the beginning.  Also, we may learn a lot about
 compact representations of address ranges, dates, country codes, and ASes
 which we can use to design a good file format later on.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6471#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online