[tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed May 13 23:03:10 UTC 2015


#15844: Develop database schema to support Onionoo's search parameter efficiently
-----------------------------+-----------------
     Reporter:  karsten      |      Owner:
         Type:  enhancement  |     Status:  new
     Priority:  normal       |  Milestone:
    Component:  Onionoo      |    Version:
   Resolution:               |   Keywords:
Actual Points:               |  Parent ID:
       Points:               |
-----------------------------+-----------------

Comment (by leeroy):

 There are a couple things to look into if you decide to test the
 possibility of a search engine. Consider a search of nickname,
 fingerprint, ip, and contact.

  1. Onionoo file-based i/o represents a virtual document view. It's
 usually delegated to the client to render it. A search engine makes the
 document view more concrete on the Onionoo server. This can lead to
 duplicate copies of data across multiple files. Which may be more
 difficult, and resource-consuming, to update or modify. Compression may
 also be harder to implement efficiently on a document-basis instead of on
 a field-basis.
  1. The documents must be created/updated in a separate atomic step from
 searching/indexing/reading.
  1. It's a challenge to define a normalization of the searchable content.
 You need a normalization for nicknames, fingerprints, ip, and contact.  In
 the case of ip, a tokenization leads to the problem of recognizing  which
 octet is being searched. In the case of contact you need to be  careful to
 not eliminate stop-words, and also take into consideration  the impact of
 multiple languages (stemming).
  1. To perform substring search, or even the equivalent prefix-search
 within words requires using a dynamically created regex.
  1. Full document indexing may induce undesired tradeoffs similar to the
 week-long limit currently seen.

 ----
 > If substring searches are just too hard to implement, let me suggest
 another simplification: how about we sacrifice the `contact` parameter
 together with substring searches in the `nickname` field?
 >

 It occurs to me !`contact` field doesn't even show up in summary
 documents. So that field wouldn't apply to a request for summary data?
 Eliminating substring searches on the server would definitely help with
 those big requests. I don't think it would hurt terribly for small
 requests though. Maybe if it was sufficiently limited in it's search range
 it wouldn't be a concern? At least theoretically it can be as efficient as
 Java--maybe even more if compiled.

 ----
 > Example: a search for "u" would still return all relays called "Unnamed"
 and "u", but not "default"
 >

 That's a great point. Why would a substring search for a single-character
 be useful anway? If the search pattern is short it should be reasonable to
 say only prefix-search is available, or there's some other limits imposed.

 ----
 > If a request takes 100 ms on average, that's fine by me, but if some
 requests take 60 s, then something's very wrong.
 >

 Is that just the server processing the query, or does that include the
 possibility that a client has a slow connection? Absolutely, if a client
 has to wait 60 s to get any data, something's definitely not working
 properly. Instrumentation should be built in from the get go--to keep
 track of events like this.

 ----
 > I also worry about potential lack of scalability, either to thousands of
 concurrent clients or to twice or three times as many entries in the
 database.
 >

 Are there any such situations like this with the currently deployed code?
 Besides the data in memory. Are there any particular circumstances that
 induce worst-case performance? Does the current worst-case have any other
 potential consequences? I'm just concerned that the scalability question
 lacks any consideration of the rest of the system.

 ----
 > I'd prefer a solution that has reasonable performance for  most use
 cases, rather than one that is highly optimized for one use  case and that
 might perform badly for use cases we didn't think.
 >

 Agreed. Designing around some well known use-cases leads to code which
 relies to much on the optimizations. This can make additions more
 difficult, due to code that needs rewriting, and a database which you're
 reluctant to change. First the database should be clean, and efficient,
 and the rest of the system should be checked for bottlenecks. Amdahl's law
 isn't it?

 ----
 > Of course it's no hard requirement that the database performs better
 than the current in-memory search.
 >

 I think it's perfectly reasonable to expect the database transition should
 result in a system that performs comparably to the currently deployed
 code. That might be a good starting point. A data-set representative of
 the current in-memory search only using a database would make for a good
 comparison.

 ----
 > How do we proceed?
 >

 Some thoughts about that:

  * Maybe by coming up with a better normal form. One which starts with the
 details-view of data. Summary data might be better represented as a
 derivation. I noticed a couple problems with the one I posted above. It
 includes a lot of information not seen in a summary document. The ASN and
 Country fields might make more sense part of the IP table. It also doesn't
 consider ip changes over time or other geolocation information.

  * History objects and aggregate data need a database representation too.
 PostgreNOSQL features and compression might come in handy here. These
 fields are created, read by the client for rendering, but aren't really
 searched as far as I can tell.

  * Determine which data needs to be tracked over time. For example, flags,
 and ip associated with the lifetime of a router determined by fingerprint.
 Are there others? What about consensus weight, and position probabilities?

  * Functions for importing new data, or even importing the initial data
 set

  * Functions to update last-modified response on receiving an update
 notification from the database

 I'm doing it again aren't I? The long response thing :D

 ----
 > leeroy, did you want to write some database code and run some
 performance measurements?  If so,
 [https://people.torproject.org/%7Ekarsten/volatile/summary.xz ​here's some
 sample data] (the same that I mentioned above).
 >

 I do have that data. I guess I could get some more from CollecTor or
 Onionoo. Which Java is deployed? 7? Besides that, do you have something in
 particular in mind for code? A wishlist?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844#comment:11>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list