[tor-bugs] #17939 [Onionoo]: Optimise the construction of details documents with field constraints

Tor Bug Tracker & Wiki blackhole at torproject.org
Sat Dec 26 02:52:43 UTC 2015


#17939: Optimise the construction of details documents with field constraints
-----------------------------+-----------------
     Reporter:  fmap         |      Owner:
         Type:  enhancement  |     Status:  new
     Priority:  Low          |  Milestone:
    Component:  Onionoo      |    Version:
     Severity:  Minor        |   Keywords:
Actual Points:               |  Parent ID:
       Points:               |    Sponsor:
-----------------------------+-----------------
 In a [https://lists.torproject.org/pipermail/metrics-
 team/2015-December/000026.html recent post to metrics-team@], Karsten
 pointed toward an expensive operation within the response builder:

 > Once per hour, the updater fetches new data and in the end produces
 JSON-formatted strings that it writes to disk. The servlet reads a
 (comparatively) small index to memory that it uses to handle requests, and
 when it builds responses, it tries hard to avoid (de-)serializing JSON.
 >
 > The only situation where this fails is when [a] request [to the /details
 endpoint] contains the fields parameter. Only in that case we'll have to
 deserialize, pick the fields we want, and serialize again. I could imagine
 that this shows up in profiles pretty badly, and I'd love to fix this, I
 just don't know how.

 I think we can exploit a few properties of the updater to handle this case
 in a more efficient manner.

 It seems safe to assume that: (1) the produced response is always the
 concatenation of a sequence of a substrings within the written document
 ^[#fn1 1]^; (2) that the documents on disk are legal JSON and correctly
 typed (having been written by the updater, which we trust and control);
 and (3) that the contents of the file are trivially parsed (belonging to a
 restriction of JSON with known and non-redundant keys, the grammar is at
 most context-free).

 I believe these conditions admit introducing a relatively efficient parser
 generator pair, one that avoids request-time de-serialisation. Given a
 request, the result of the parser would be a sequence of pairs of indices
 marking the boundaries of each field. The generator would reproduce the
 input, but for excluding text regions corresponding to fields excluded by
 the request.

 No patch yet, but I've hacked together a small (inefficient mess of a..)
 proof of concept that hopefully illustrates the basic idea:

   http://hack.rs/~vi/onionoo/IndexJSON.hs
   sha256: 14a09f26fadab8d989263dc76d368e41e63ba6c5279d37443878d6c1d0c87834
   http://www.webcitation.org/6e3NEOLJg

 {{{
 % jq . 96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE
 {
   "nickname": "Unnamed",
   "hashed_fingerprint": "96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE",
   "or_addresses": [
     "10.103.224.131:443"
   ],
   "last_seen": "2015-11-23 03:40:44",
   "first_seen": "2015-11-20 04:38:22",
   "running": false,
   "flags": [
     "Valid"
   ],
   "last_restarted": "2015-11-22 01:23:06",
   "advertised_bandwidth": 49168,
   "platform": "Tor 0.2.4.22 on Windows 8"
 }
 % index-json 96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE
 ("nickname",(2,21,22))
 ("hashed_fingerprint",(23,85,86))
 ("or_addresses",(87,123,124))
 ("last_seen",(125,157,158))
 ("first_seen",(159,192,193))
 ("running",(194,208,209))
 ("flags",(210,226,227))
 ("last_restarted",(228,265,266))
 ("advertised_bandwidth",(267,294,295))
 ("platform",(296,333,333))
 % cut -c1 -c23-158 -c194- 96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE  | jq .
 {
   "hashed_fingerprint": "96B16C78BB54BA0F56EEA8721781C9BD01B7E9AE",
   "or_addresses": [
     "10.103.224.131:443"
   ],
   "last_seen": "2015-11-23 03:40:44",
   "running": false,
   "flags": [
     "Valid"
   ],
   "last_restarted": "2015-11-22 01:23:06",
   "advertised_bandwidth": 49168,
   "platform": "Tor 0.2.4.22 on Windows 8"
 }
 }}}

 What do you think?

 ,,
 [=#fn1 ^1^] There's a factor of surprise in the treatment of nullable
 properties, but it turns out that the existing behaviour works in our
 favour. GSON removes 'null'ed fields in writing documents to disk; e.g.
 note the absence of an AS number here:

 {{{
 % pwd
 /srv/onionoo.torproject.org/onionoo/out/details
 % jq . $(ls | shuf -n1)
 {
  "nickname": "Unnamed",
  "hashed_fingerprint": "CE0A4E1B6C545FF9F25A9CAF5926732559A2C0FE",
  "or_addresses": [
    "10.190.9.13:443"
  ],
  "last_seen": "2015-12-16 22:41:56",
  "first_seen": "2015-11-11 21:01:43",
  "running": true,
  "flags": [
    "Fast",
    "Valid"
  ],
  "last_restarted": "2015-12-16 02:13:40",
  "advertised_bandwidth": 59392,
  "platform": "Tor 0.2.4.23 on Windows 8"
 }
 }}}

 ,,
 But it *also* excludes them from /details responses, even when specified
 by name using the 'fields' parameter:

 {{{
 % curl -s
 'http://onionoo.local/details?lookup=CE0A4E1B6C545FF9F25A9CAF5926732559A2C0FE&fields=hashed_fingerprint,as_number'
 | jq .bridges[]
 {
   "hashed_fingerprint": "CE0A4E1B6C545FF9F25A9CAF5926732559A2C0FE"
 }
 }}}

 ,,So it doesn't seem necessary to add any text atop the persisted
 serialisation, even in this case.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/17939>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list