[tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently

Thu May 7 09:49:08 UTC 2015

#15844: Develop database schema to support Onionoo's search parameter efficiently
-----------------------------+-----------------
     Reporter:  karsten      |      Owner:
         Type:  enhancement  |     Status:  new
     Priority:  normal       |  Milestone:
    Component:  Onionoo      |    Version:
   Resolution:               |   Keywords:
Actual Points:               |  Parent ID:
       Points:               |
-----------------------------+-----------------

Comment (by teor):

 I like leeroy's suggestion of a `substring(foo)` (or, perhaps more
 efficiently, a `position(foo)`) over a `LIKE %foo%`. Some database engines
 can recognize their equivalence in the simple case of `%text%`, but you
 don't want to rely on this. And you don't really need people to to be able
 to do `%foo%bar%`, do you?
 Allowing people to insert unquoted metacharacters, even into a LIKE
 string, can open you up to DoS issues. Depends how powerful postgres LIKE
 regexes are.

 For comparing the starts of strings, sometimes `left(identity,
 len(search_string)) = search_string)` can be faster than `identity LIKE
 search_string + '%'` for similar reasons.

 The database may be more efficient if it's collation is case-insensitive
 overall. And it will be easier to program if that is the default.
 Individual columns and comparisons can still be case-sensitive (or, even
 more efficiently, compared exactly, which is equivalent for search
 purposes here).

 You might also want to consider storing the base64 encoded string in its
 decoded binary form in a `bit varying` column, and then doing the decoding
 for each search term before comparing it using `position()`. The combined
 decoding and comparison might be faster, and the storage and indexing will
 definitely be faster and smaller. This strategy might apply to some of the
 other fingerprints (store in binary form, convert before search). This
 would also avoid some of your case-insensitive matching. It would also
 give you a canonical form for storage and comparison. You'd have to encode
 them for display, or, as a speed-space tradeoff, store a lookup table of
 binary->encoded for each fingerprint in the database.

 You can ever compare partial bytes, which may be a feature, or a bug if
 you're looking at the middle of strings. You might need to restrict it to
 `position() > 0 && position() % 4 == 0` for hex, and `position() > 0 &&
 position() % 6 == 0` for base 64, so you only compare whole encoded
 character positions, and not sub-character bit positions. Otherwise users
 might complain that they put certain characters in, and don't get any out.

 By the way, full-text search won't help, because in postgres it does
 words, not substrings.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844#comment:7>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online