[tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently

Tue Apr 28 12:26:10 UTC 2015

#15844: Develop database schema to support Onionoo's search parameter efficiently
-------------------------+---------------------
 Reporter:  karsten      |          Owner:
     Type:  enhancement  |         Status:  new
 Priority:  normal       |      Milestone:
Component:  Onionoo      |        Version:
 Keywords:               |  Actual Points:
Parent ID:               |         Points:
-------------------------+---------------------
 The current way for handling incoming client requests is to load all
 search-relevant data about relays and bridges into memory and handle
 requests from there.  This has two major downsides: it's difficult to
 extend and it requires us to limit searches to relays and bridges that
 have been running in the past seven days.  We would like to overcome both.

 After some experimenting with database schemas it seems that supporting
 the `search` parameter efficiently will be most difficult.  Here's what it
 does (from https://onionoo.torproject.org/protocol.html):

 ''Return only (1) relays with the parameter value matching (part of a)
 nickname, (possibly $-prefixed) beginning of a hex-encoded fingerprint,
 beginning of a base64-encoded fingerprint without trailing equal signs, or
 beginning of an IP address, (2) bridges with (part of a) nickname or
 (possibly $-prefixed) beginning of a hashed hex-encoded fingerprint, and
 (3) relays and/or bridges matching a given qualified search term. Searches
 by relay IP address include all known addresses used for onion routing and
 for exiting to the Internet. Searches for beginnings of IP addresses are
 performed on textual representations of canonical IP address forms, so
 that searches using CIDR notation or non-canonical forms will return empty
 results. Searches are case-insensitive, except for base64-encoded
 fingerprints. If multiple search terms are given, separated by spaces, the
 intersection of all relays and bridges matching all search terms will be
 returned. Complete hex-encoded fingerprints should always be hashed using
 SHA-1, regardless of searching for a relay or a bridge, in order to not
 accidentally leak non-hashed bridge fingerprints in the URL. [...]''

 Before providing my experimental code (which I'd have to clean up anyway),
 I'd want to keep this discussion as open as possible and only present my
 general ideas how this could be implemented:

  - In the following I'll assume that we're going to use PostgreSQL as
 database.  I already have some experience with it from other Tor-related
 projects, and our sysadmins like it more than other SQL databases.  If
 NoSQL turns out to be superior for this use case based on some actual
 performance evaluations, I'm happy to consider that.

  - We'll have to support three comparison modes for the `search` paramter:
 "starts with", "starts with ignore case", and "contains as substring".

  - PostgreSQL does not support substring searches (`LIKE '%foo%'`) out of
 the box, at least not efficiently, but there's a package called `pg_trgm`
 that can "determine the similarity of text based on trigram matching".
 It's contained in Debian's `postgresql-contrib` package, so it should be
 available to us.

  - Right now, search terms are supported starting at a minimum length of 1
 character.  I could imagine raising that to 3 characters if it has major
 benefits to search efficiency.  Though if it doesn't, let's keep
 supporting searches for 1 or 2 characters.

  - I briefly experimented with a normalized database schema with a
 `servers` table containing one row per relay or bridge, an `addresses`
 table with one or more addresses per server, and a `fingerprints` table
 with original and hashed fingerprint per server.  The performance was not
 very promising, because searches would have to happen in all three tables.
 Happy to try again if somebody has hints what I could have done wrong.

  - I also considered (but did not test) a schema with a single `servers`
 table that encodes all fields that are relevant for the `search` parameter
 in a single string with format `"lower-case-nickname#base64-fingerprint
 |lower-case-hex-fingerprint|lower-case-hashed-hex-fingerprint|lower-case-
 address1|lower-case-address2"`.  For example, Tonga would have the
 combined string
 `"tonga#SgzNLdx5lQg9c/XWZxAMilgx8W0|4a0ccd2ddc7995083d73f5d667100c8a5831f16d|e654ae16b76cf002bd26adaf060f8a9c5d333cc9|82.94.251.203"`,
 and searches for `Tonga` would use the following condition: `WHERE search
 LIKE '%tonga%#' OR search LIKE '%#Tonga%' OR search LIKE '%|tonga%'`.

  - There may be variants of these two schemas that have advantages that I
 didn't think of yet.  Suggestions very welcome.

 If we can find a good database schema for the `search` parameter,
 implementing the other parameters should be relatively easy.

 Here's Tonga's search data for a very first sample:

 {{{
 {
     "t": true,
     "f": "4A0CCD2DDC7995083D73F5D667100C8A5831F16D",
     "n": "Tonga",
     "ad": [
         "82.94.251.203"
     ],
     "cc": "nl",
     "as": "AS3265",
     "fs": "2007-10-27 12:00:00",
     "ls": "2015-04-18 13:00:00",
     "rf": [
         "Authority",
         "Fast",
         "HSDir",
         "Running",
         "Stable",
         "V2Dir",
         "Valid"
     ],
     "cw": 20,
     "r": true,
     "c": "4096/fd3428b4 lucky green <shamrock at cypherpunks.to>"
 }
 }}}

 I also uploaded more
 [https://people.torproject.org/~karsten/volatile/summary.xz sample search
 data] in case that helps the discussion.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online