commit 77fc6e110c1c57b28ad36a0ccd00c61a96c034ff
Author: Karsten Loesing <karsten.loesing(a)gmx.net>
Date: Mon Jun 18 18:57:12 2012 +0200
Add a DESIGN document.
---
DESIGN | 157 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 157 insertions(+), 0 deletions(-)
diff --git a/DESIGN b/DESIGN
new file mode 100644
index 0000000..6afbdb3
--- /dev/null
+++ b/DESIGN
@@ -0,0 +1,157 @@
+Onionoo design document
+=======================
+
+This short document describes Onionoo's design in a mostly informal and
+language-independent way. The goal is to be able to discuss design
+decisions with non-Java programmers and to provide a blueprint for porting
+Onionoo to other programming languages. This document cannot describe all
+the details, but it can provide a rough overview.
+
+There are two main building blocks of Onionoo that are described here:
+
+ 1) an hourly cronjob processing newly published Tor descriptors and
+
+ 2) a web service component answering client requests.
+
+The interface between the two building blocks is a directory in the local
+file system that can be read and written by component 1 and can be read by
+component 2. In theory, the two components can be implemented in two
+entirely different programming languages. In a possible port from Java to
+another programming language, the two components can easily be ported
+subsequently.
+
+The purpose of the hourly batch processor is to read updated Tor
+descriptors from the metrics service and transform them to be read by the
+web service component. Answering a client request in component 2 of
+Onionoo needs to be highly efficient which is why any data aggregation
+needs to happen beforehand. Parsing descriptors on-the-fly is not an
+option.
+
+The hourly batch processor is run in a cron job at :15 every hour that
+usually takes up to five minutes and that contains the following substeps:
+
+ 1.1) Rsync new Tor descriptors from metrics.
+
+ 1.2) Read previously stored status data about relays and bridges that
+ have been running in the last seven days to memory. These data
+ include for each relay or bridge: nickname, fingerprint, primary
+ OR address and port, additional OR addresses and ports, exit
+ addresses, network status publication time, directory port, relay
+ flags, consensus weight, country code, host name as obtained by
+ reverse domain name lookup, and timestamp of last reverse domain
+ name lookup.
+
+ 1.3) Import any new relay network status consensuses that have been
+ published since the last run.
+
+ 1.4) Set the running bit for all relays that are contained in the last
+ known relay network status consensus.
+
+ 1.5) Look up all relay IP addresses in a local GeoIP database and in a
+ local AS number database. Extract country codes and names, city
+ names, geo coordinates, AS name and number, etc.
+
+ 1.6) Import any new bridge network statuses that have been published
+ since the last run.
+
+ 1.7) Start reverse domain name lookups for all relay IP addresses. Run
+ in background, only refresh lookups for previously looked up IP
+ address every 12 hours, run up to five lookups in parallel, and
+ set timeouts for single requests and for the general lookup
+ process. In theory, this step could happen a few steps before,
+ but not before step 1.3.
+
+ 1.8) Import any new relay server descriptors that have been published
+ since the last run.
+
+ 1.9) Import any new exit lists that have been published since the last
+ run.
+
+ 1.10) Import any new bridge server descriptors that have been published
+ since the last run.
+
+ 1.11) Import any new bridge pool assignments that have been published
+ since the last run.
+
+ 1.12) Make sure that reverse domain name lookups are finished or the
+ timeout for running lookups has expired. This step cannot happen
+ at any time later than step 1.13 and shouldn't happen long before.
+
+ 1.13) Rewrite all details files that have changed. Details files
+ combine information from all previously imported descriptory
+ types, database lookups, and performed reverse domain name
+ lookups. The web service component needs to be able to retrieve a
+ details file for a given relay or bridge without grabbing
+ information from different data sources. It's best to write the
+ details file part for a give relay or bridge to a single file in
+ the target JSON format, saved under the relay's or bridge's
+ fingerprint. If a database is used, the raw string should be
+ saved for faster processing.
+
+ 1.14) Import relays' and bridges' bandwidth histories from extra-info
+ descriptors that have been published since the last run. There
+ must be internally stored bandwidth histories for each relay and
+ bridge, regardless of whether they have been running in the last
+ seven days. The original bandwidth histories, which are available
+ on 15-minute detail, can be aggregated to longer time periods the
+ farther the interval lies in the past. The interal bandwidth
+ histories are different from the bandwidth files described in 1.15
+ which are written to be given out to clients.
+
+ 1.15) Rewrite bandwidth files that have changed. Bandwidth files
+ aggregate bandwidth history information on varying levels of
+ detail, depending on how far observations lie in the past. It's
+ inevitable to write JSON-formatted bandwidth files for all relays
+ and bridges in the hourly cronjob. Any attempts to process years
+ of bandwidth data while answering a web request can only fail.
+ The previously aggregated bandwidth files are stored under the
+ relay's or bridge's fingerprint for quick lookup.
+
+ 1.16) Update the summary file listing all relays and bridges that have
+ been running in the last seven days which was previously read in
+ step 1.2. This is the last step in the hourly process. The web
+ service component checks the modification time of this file to
+ decide whether it needs to reload its view on the network. If
+ this step was not the last step, the web service component might
+ list relays or bridges for which there are no details or bandwidth
+ files available yet. (With the approach taken here, it's
+ conveivable that a bandwidth file of a relay or bridge that hasn't
+ been running for a week has been deleted before step 1.16. This
+ case has been found acceptable, because it's highly unlikely. If
+ a database would have been used, steps 1.2 to 1.16 would have
+ happened in a single database transaction.)
+
+The web service component has the purpose of answering client requests.
+It uses previously prepared data from the hourly cronjob to respond to
+requests very quickly.
+
+During initialization, or whenever the hourly cronjob has finished, the
+web service component does the following substeps:
+
+ 2.1) Read the summary file that was produced by the hourly cronjob in
+ step 1.16.
+
+ 2.2) Keep the list of relays and bridges in memory, including all
+ information that is used for filtering or sorting results.
+
+ 2.3) Prepare summary lines for all relays and bridges. The summary
+ resource is a JSON file with a single line per relay or bridge.
+ This line contains only very few fields as compared to details
+ files that a client might use for further filtering results.
+
+When responding to a request, the web service component does the following
+steps:
+
+ 2.4) Parse request and its parameters.
+
+ 2.5) Possibly filter relays and bridges.
+
+ 2.6) Possibly re-order and limit results.
+
+ 2.7) Write response or error code.
+
+Again, (and this can hardly be overstated!) steps 2.4 to 2.7 need to
+happen *extremely* fast. Any steps that go beyond file system reads or
+simple database lookups need to happen either in the hourly cronjob (1.1
+to 1.16) or in the web service component initialization (2.1 to 2.3).
+