[tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently

Thu May 14 15:10:27 UTC 2015

#15844: Develop database schema to support Onionoo's search parameter efficiently
-----------------------------+-----------------
     Reporter:  karsten      |      Owner:
         Type:  enhancement  |     Status:  new
     Priority:  normal       |  Milestone:
    Component:  Onionoo      |    Version:
   Resolution:               |   Keywords:
Actual Points:               |  Parent ID:
       Points:               |
-----------------------------+-----------------

Comment (by karsten):

 Replying to [comment:11 leeroy]:
 > There are a couple things to look into if you decide to test the
 possibility of a search engine. Consider a search of nickname,
 fingerprint, ip, and contact. [...]

 It seems like switching to a search engine would be a pretty major change
 even compared to switching to a database.  I'd say if somebody wants to
 look into this, that's cool, but the database switch seems like the much-
 lower hanging fruit for now.

 > > If substring searches are just too hard to implement, let me suggest
 another simplification: how about we sacrifice the `contact` parameter
 together with substring searches in the `nickname` field?
 >
 > It occurs to me the `contact` field doesn't even show up in summary
 documents. So that field wouldn't apply to a request for summary data?
 Eliminating substring searches on the server ''would'' help with those big
 requests. I don't think it would hurt terribly for small requests though.
 Maybe if it were limited in it's search range it wouldn't be a concern? At
 least theoretically it can be as efficient as Java--maybe even ''more'' if
 compiled.

 You can use the `contact` parameter for any document type, even one that
 doesn't contain the `contact` field.  The way this works is that all
 requests are answered using a node index that can handle all possible
 parameters and that returns a list of fingerprints.  In the next step, all
 documents matching these fingerprints are written to the response.  For
 example, right now you can use the `contact` parameter when asking for
 `weights` documents.

 > > Example: a search for "u" would still return all relays called
 "Unnamed" and "u", but not "default"
 >
 > That's a great point. Why would a substring search for a single-
 character be useful anway? If the search pattern is short it should be
 reasonable to say only prefix-search is available. That or there's some
 other limits imposed.

 I'm not too worried about this simplification, because nicknames have
 become less important a while ago with the Named flag being discontinued.
 They are convenient ways to refer to relays, but there is no guarantee
 that you'll get the relay you're hoping to get.

 > > If a request takes 100 ms on average, that's fine by me, but if some
 requests take 60 s, then something's very wrong.
 >
 > Would that be just the server processing the query, or does that include
 the possibility that a client has a slow connection? Absolutely, if a
 client has to wait 60 s to get any data, something's definitely not
 working properly. Instrumentation should be built in from the start--to
 keep track of events like this.

 Well, clients having slow connections should be treated separately.  We
 already have performance statistics in the deployed code:

 {{{
 Request statistics (2015-05-13 18:52:48, 3600 s):
   Total processed requests: 285066
   Most frequently requested resource: details (283763), summary (547),
 bandwidth (372)
   Most frequently requested parameter combinations: [lookup, fields]
 (278974), [flag, type, running] (2531), [lookup] (2393)
   Matching relays per request: .500<2, .900<2, .990<2048, .999<16384
   Matching bridges per request: .500<1, .900<1, .990<1, .999<8192
   Written characters per response: .500<256, .900<512, .990<2097152,
 .999<4194304
   Milliseconds to handle request: .500<8, .900<16, .990<32, .999<64
   Milliseconds to build response: .500<4, .900<4, .990<256, .999<512
 }}}

 If we switch request handling from the current in-memory approach to a
 database approach, we should watch out for the last but one line there,
 "Milliseconds to handle request".  That's the time from receiving a
 request to returning a list of fingerprint of documents to be returned.
 You read that line like this: 50% of requests were handled in less than 8
 milliseconds, ..., and 99.9% in less than 64 milliseconds.

 Slow clients would show up in the next line, "Milliseconds to build
 response", which includes reading documents from disk as well as writing
 them to the response.  (We could split up that line, but then we'd have to
 buffer complete responses in memory, which might be bad.)

 Happy to add new statistics there, if they help us investigate potential
 bottlenecks.

 > > I also worry about potential lack of scalability, either to thousands
 of  concurrent clients or to twice or three times as many entries in the
 database.
 >
 > Are there any situations like this with the currently deployed code?
 Besides the data in memory. Are there any circumstances that induce worst-
 case performance? Does the worst-case have any other potential
 consequences? I'm just concerned that the scalability question lacks any
 consideration of the rest of the system.

 There are indeed other parts of the system that would be affected by an
 increase in concurrent clients or entries in the database.  As I described
 above, once the database would return a list of fingerprints, those
 documents would have to be looked up and written to the response.  Right
 now, all documents are read from disk, but we might want to put them into
 the database as well.  And yes, we would want to make sure that those
 parts of the system scale as well.

 > > I'd prefer a solution that has reasonable performance for  most use
 cases, rather than one that is highly optimized for one use  case and that
 might perform badly for use cases we didn't think.
 >
 > Agreed. Designing around some well known use-cases leads to code which
 relies to much on the optimizations. This can make additions more
 difficult, due to code that needs rewriting, and a database which you're
 reluctant to change. First the database should be clean, and efficient,
 and the rest of the system should be checked for bottlenecks. Amdahl's law
 isn't it?

 Agreed.

 > > Of course it's no hard requirement that the database performs better
 than the current in-memory search.
 >
 > I think it's perfectly reasonable to expect the database transition
 should result in a system that performs comparably to the currently
 deployed code. That might be a good starting point. A data-set
 representative of the current in-memory search, only using a database,
 would make for a good comparison.

 That would be a fine comparison to start with, but even if the result is
 that the database is slower by a factor 1.5 or so, its other advantages
 might outweigh that.

 > > How do we proceed?
 >
 > Some thoughts about that:
 >
 >  * Maybe by coming up with a better normal form. One which starts with
 the details-view of data. Summary data might be better represented as a
 derivation. I noticed a couple problems with the one I posted above. It
 includes a lot of information not seen in a summary document. The ASN and
 Country fields might make more sense part of the IP table. It also doesn't
 consider ip changes over time or other geolocation information.

 Looking at details documents is a fine start.  But let's be careful with
 adding new parameters that would make it harder to make the database fast
 enough.  One problem at a time.

 >  * History objects and aggregate data need a database representation
 too. PostgreNOSQL features and compression might come in handy here. These
 fields are created, read by the client for rendering, but aren't really
 searched as far as I can tell.

 History documents are not searched, and that's perfectly fine.

 >  * Determine which data needs to be tracked over time. For example,
 flags, and ip associated with the lifetime of a router determined by
 fingerprint. Are there others? What about consensus weight, and position
 probabilities?

 Let's be careful with adding historical data.  There are over 7.5 years of
 data in CollecTor, and adding anything of that to Onionoo might not scale
 very well.  Onionoo started out by giving out details about relays and
 bridges that are currently running.  Adding non-running relays and bridges
 that have been running in the last week was just a convenience.  Removing
 that 1-week limitation would be even more convenient, but we need to be
 very careful how to do that.  For example, it might be possible to search
 by any IP address assigned to a relay in the past.  But allowing users to
 look up which relays had the Stable flag on January 1, 2009 might be a
 little too much.

 >  * Functions for importing new data, or even importing the initial data
 set
 >
 >  * Functions to update last-modified response on receiving an update
 notification from the database

 Yes, this code needs to be written.  I just thought it might be easier to
 start with writing some fine new SQL code to do performance experiments
 and not mess with the Onionoo code base.

 > I'm doing it again aren't I? The long response thing :D

 At least this one is easier to respond to. :)

 > > leeroy, did you want to write some database code and run some
 performance measurements?  If so,
 [https://people.torproject.org/%7Ekarsten/volatile/summary.xz here's some
 sample data] (the same that I mentioned above).
 >
 > I do have that data. I guess I could get some more from CollecTor or
 Onionoo. Which Java is deployed? 7? Besides that, do you have something in
 particular in mind for code? A wishlist?

 I'd say ignore CollecTor for the moment and only look at Onionoo.  The
 document I pasted above is the current input to the request-handling code.
 You could also fetch the [https://onionoo.torproject.org/details latest
 details documents] for additional input.  But feel free to mention any
 other data here first, and I might comment on possible scalability issues
 from including those in the search.

 The deployed Java version is indeed 7.  I don't have anything particular
 in mind for the code, but I'm happy to comment on patches.  Release early,
 release often.

 Thanks for your contributions here!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844#comment:12>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online