[tor-commits] [onionoo/master] Add a DESIGN document.

Mon Jun 18 17:00:15 UTC 2012

commit 77fc6e110c1c57b28ad36a0ccd00c61a96c034ff
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date:   Mon Jun 18 18:57:12 2012 +0200

    Add a DESIGN document.
---
 DESIGN |  157 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 157 insertions(+), 0 deletions(-)

diff --git a/DESIGN b/DESIGN
new file mode 100644
index 0000000..6afbdb3
--- /dev/null
+++ b/DESIGN
@@ -0,0 +1,157 @@
+Onionoo design document
+=======================
+
+This short document describes Onionoo's design in a mostly informal and
+language-independent way.  The goal is to be able to discuss design
+decisions with non-Java programmers and to provide a blueprint for porting
+Onionoo to other programming languages.  This document cannot describe all
+the details, but it can provide a rough overview.
+
+There are two main building blocks of Onionoo that are described here:
+
+  1) an hourly cronjob processing newly published Tor descriptors and
+
+  2) a web service component answering client requests.
+
+The interface between the two building blocks is a directory in the local
+file system that can be read and written by component 1 and can be read by
+component 2.  In theory, the two components can be implemented in two
+entirely different programming languages.  In a possible port from Java to
+another programming language, the two components can easily be ported
+subsequently.
+
+The purpose of the hourly batch processor is to read updated Tor
+descriptors from the metrics service and transform them to be read by the
+web service component.  Answering a client request in component 2 of
+Onionoo needs to be highly efficient which is why any data aggregation
+needs to happen beforehand.  Parsing descriptors on-the-fly is not an
+option.
+
+The hourly batch processor is run in a cron job at :15 every hour that
+usually takes up to five minutes and that contains the following substeps:
+
+  1.1)  Rsync new Tor descriptors from metrics.
+
+  1.2)  Read previously stored status data about relays and bridges that
+        have been running in the last seven days to memory.  These data
+        include for each relay or bridge: nickname, fingerprint, primary
+        OR address and port, additional OR addresses and ports, exit
+        addresses, network status publication time, directory port, relay
+        flags, consensus weight, country code, host name as obtained by
+        reverse domain name lookup, and timestamp of last reverse domain
+        name lookup.
+
+  1.3)  Import any new relay network status consensuses that have been
+        published since the last run.
+
+  1.4)  Set the running bit for all relays that are contained in the last
+        known relay network status consensus.
+
+  1.5)  Look up all relay IP addresses in a local GeoIP database and in a
+        local AS number database.  Extract country codes and names, city
+        names, geo coordinates, AS name and number, etc.
+
+  1.6)  Import any new bridge network statuses that have been published
+        since the last run.
+
+  1.7)  Start reverse domain name lookups for all relay IP addresses.  Run
+        in background, only refresh lookups for previously looked up IP
+        address every 12 hours, run up to five lookups in parallel, and
+        set timeouts for single requests and for the general lookup
+        process.  In theory, this step could happen a few steps before,
+        but not before step 1.3.
+
+  1.8)  Import any new relay server descriptors that have been published
+        since the last run.
+
+  1.9)  Import any new exit lists that have been published since the last
+        run.
+
+  1.10) Import any new bridge server descriptors that have been published
+        since the last run.
+
+  1.11) Import any new bridge pool assignments that have been published
+        since the last run.
+
+  1.12) Make sure that reverse domain name lookups are finished or the
+        timeout for running lookups has expired.  This step cannot happen
+        at any time later than step 1.13 and shouldn't happen long before.
+
+  1.13) Rewrite all details files that have changed.  Details files
+        combine information from all previously imported descriptory
+        types, database lookups, and performed reverse domain name
+        lookups.  The web service component needs to be able to retrieve a
+        details file for a given relay or bridge without grabbing
+        information from different data sources.  It's best to write the
+        details file part for a give relay or bridge to a single file in
+        the target JSON format, saved under the relay's or bridge's
+        fingerprint.  If a database is used, the raw string should be
+        saved for faster processing.
+
+  1.14) Import relays' and bridges' bandwidth histories from extra-info
+        descriptors that have been published since the last run.  There
+        must be internally stored bandwidth histories for each relay and
+        bridge, regardless of whether they have been running in the last
+        seven days.  The original bandwidth histories, which are available
+        on 15-minute detail, can be aggregated to longer time periods the
+        farther the interval lies in the past.  The interal bandwidth
+        histories are different from the bandwidth files described in 1.15
+        which are written to be given out to clients.
+
+  1.15) Rewrite bandwidth files that have changed.  Bandwidth files
+        aggregate bandwidth history information on varying levels of
+        detail, depending on how far observations lie in the past.  It's
+        inevitable to write JSON-formatted bandwidth files for all relays
+        and bridges in the hourly cronjob.  Any attempts to process years
+        of bandwidth data while answering a web request can only fail.
+        The previously aggregated bandwidth files are stored under the
+        relay's or bridge's fingerprint for quick lookup.
+
+  1.16) Update the summary file listing all relays and bridges that have
+        been running in the last seven days which was previously read in
+        step 1.2.  This is the last step in the hourly process.  The web
+        service component checks the modification time of this file to
+        decide whether it needs to reload its view on the network.  If
+        this step was not the last step, the web service component might
+        list relays or bridges for which there are no details or bandwidth
+        files available yet.  (With the approach taken here, it's
+        conveivable that a bandwidth file of a relay or bridge that hasn't
+        been running for a week has been deleted before step 1.16.  This
+        case has been found acceptable, because it's highly unlikely.  If
+        a database would have been used, steps 1.2 to 1.16 would have
+        happened in a single database transaction.)
+
+The web service component has the purpose of answering client requests.
+It uses previously prepared data from the hourly cronjob to respond to
+requests very quickly.
+
+During initialization, or whenever the hourly cronjob has finished, the
+web service component does the following substeps:
+
+  2.1)  Read the summary file that was produced by the hourly cronjob in
+        step 1.16.
+
+  2.2)  Keep the list of relays and bridges in memory, including all
+        information that is used for filtering or sorting results.
+
+  2.3)  Prepare summary lines for all relays and bridges.  The summary
+        resource is a JSON file with a single line per relay or bridge.
+        This line contains only very few fields as compared to details
+        files that a client might use for further filtering results.
+
+When responding to a request, the web service component does the following
+steps:
+
+  2.4)  Parse request and its parameters.
+
+  2.5)  Possibly filter relays and bridges.
+
+  2.6)  Possibly re-order and limit results.
+
+  2.7)  Write response or error code.
+
+Again, (and this can hardly be overstated!) steps 2.4 to 2.7 need to
+happen *extremely* fast.  Any steps that go beyond file system reads or
+simple database lookups need to happen either in the hourly cronjob (1.1
+to 1.16) or in the web service component initialization (2.1 to 2.3).
+