[tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently

Fri May 8 23:43:10 UTC 2015

#15844: Develop database schema to support Onionoo's search parameter efficiently
-----------------------------+-----------------
     Reporter:  karsten      |      Owner:
         Type:  enhancement  |     Status:  new
     Priority:  normal       |  Milestone:
    Component:  Onionoo      |    Version:
   Resolution:               |   Keywords:
Actual Points:               |  Parent ID:
       Points:               |
-----------------------------+-----------------

Comment (by leeroy):

 A database is an important transition to remove the restrictions on
 searchable period-of-time. You've mentioned you only get a week with the
 onionoo resources as provisioned. It's reasonable to expect a database to
 scale to the entirety of available history. It's the substring search on
 the entire history that'll get you. Still, it would be also be epic for
 analysis/vis. Your limit would be the storage space used by the database,
 and the i/o of requests. It might be useful to bin the data, similar to
 !CollecTor, into recent, and longer (to be decided) interval. Storage
 isn't so much an issue, and neither is transaction i/o which can scale
 with pgpool.

 About your simplifications:

 1a) If you're only interested in substring search on nickname and contact
 then an alternative to `LIKE '%foo%'` must be considered. Some real
 substring performance tests should be performed. What are the candidates?
 There's the position function of postgreSQL, as teor mentioned, user-
 defined functions, and a comparison with trigram-match indexing? The
 testing would only need to apply to substring since you defined the rest
 of the queries as prefix search capable (best case).

 1b) I can think of ways around the inherent limits imposed by an unbounded
 data set. A simplification of this thinking is:

  * spawn n-helpers

  * give each helper a range of data to work with

  * schedule helpers

  * for each helper apply substring search

  * return results until all data processed

 1c) A thought about introducing describing notation to the search pattern
 itself. A marker of sorts to encode the search type. In the absence of any
 marker perform a most general search. `@foo@`, or something like it, could
 transform into a contact-specific search. This would avoid changes to
 parametrization and keep backwards compatibility.

 2) I do agree it would simplify things to have a case insensitive search
 of base64 fingerprints. However, you've specified this type of search as
 prefix search capable. So it would be best case performance. A query that
 spans multiple tables can use a combination of substring-search and
 prefix-search to maximize performance of the query. It's even conceivable
 to perform the query in separate transactions rather than one.

 3) A general change to a minimum of 3 characters would cause problems in
 nickname searches won't it? A nickname is allowed to be a single
 character. Lets suppose for an moment that the minimum is 1. Well then
 only one character needs to match in order to return a result. Compare
 this to 3 characters where all 3 need to match. I do, however, think
 there's value in the careful consideration of a "single-unit" with respect
 to an entity-type. It influences the returned data for a particular query.
 A single character for nicknames, a single octet for an ip, a hex-value
 for fingerprint, etc. This may produce more relevant results for a given
 pattern.

 I don't think there's any doubt a clean and efficient implementation is
 doable. As long as it's understood that it's unlikely to ever be as fast
 as Java. The Java VM offers runtime optimization, postgreSQL offers a
 transactional data store. Even in a best case: postgreSQL, n-workers,
 compiled user-defined substring search, and Java, there's still an
 overhead from the transaction.

 The Java deployment is a pain to scale, postgreSQL is designed to scale
 independently from the Java powered engine behind the scenes. It's a
 compromise.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online