commit 77fc6e110c1c57b28ad36a0ccd00c61a96c034ff Author: Karsten Loesing karsten.loesing@gmx.net Date: Mon Jun 18 18:57:12 2012 +0200
Add a DESIGN document. --- DESIGN | 157 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 157 insertions(+), 0 deletions(-)
diff --git a/DESIGN b/DESIGN new file mode 100644 index 0000000..6afbdb3 --- /dev/null +++ b/DESIGN @@ -0,0 +1,157 @@ +Onionoo design document +======================= + +This short document describes Onionoo's design in a mostly informal and +language-independent way. The goal is to be able to discuss design +decisions with non-Java programmers and to provide a blueprint for porting +Onionoo to other programming languages. This document cannot describe all +the details, but it can provide a rough overview. + +There are two main building blocks of Onionoo that are described here: + + 1) an hourly cronjob processing newly published Tor descriptors and + + 2) a web service component answering client requests. + +The interface between the two building blocks is a directory in the local +file system that can be read and written by component 1 and can be read by +component 2. In theory, the two components can be implemented in two +entirely different programming languages. In a possible port from Java to +another programming language, the two components can easily be ported +subsequently. + +The purpose of the hourly batch processor is to read updated Tor +descriptors from the metrics service and transform them to be read by the +web service component. Answering a client request in component 2 of +Onionoo needs to be highly efficient which is why any data aggregation +needs to happen beforehand. Parsing descriptors on-the-fly is not an +option. + +The hourly batch processor is run in a cron job at :15 every hour that +usually takes up to five minutes and that contains the following substeps: + + 1.1) Rsync new Tor descriptors from metrics. + + 1.2) Read previously stored status data about relays and bridges that + have been running in the last seven days to memory. These data + include for each relay or bridge: nickname, fingerprint, primary + OR address and port, additional OR addresses and ports, exit + addresses, network status publication time, directory port, relay + flags, consensus weight, country code, host name as obtained by + reverse domain name lookup, and timestamp of last reverse domain + name lookup. + + 1.3) Import any new relay network status consensuses that have been + published since the last run. + + 1.4) Set the running bit for all relays that are contained in the last + known relay network status consensus. + + 1.5) Look up all relay IP addresses in a local GeoIP database and in a + local AS number database. Extract country codes and names, city + names, geo coordinates, AS name and number, etc. + + 1.6) Import any new bridge network statuses that have been published + since the last run. + + 1.7) Start reverse domain name lookups for all relay IP addresses. Run + in background, only refresh lookups for previously looked up IP + address every 12 hours, run up to five lookups in parallel, and + set timeouts for single requests and for the general lookup + process. In theory, this step could happen a few steps before, + but not before step 1.3. + + 1.8) Import any new relay server descriptors that have been published + since the last run. + + 1.9) Import any new exit lists that have been published since the last + run. + + 1.10) Import any new bridge server descriptors that have been published + since the last run. + + 1.11) Import any new bridge pool assignments that have been published + since the last run. + + 1.12) Make sure that reverse domain name lookups are finished or the + timeout for running lookups has expired. This step cannot happen + at any time later than step 1.13 and shouldn't happen long before. + + 1.13) Rewrite all details files that have changed. Details files + combine information from all previously imported descriptory + types, database lookups, and performed reverse domain name + lookups. The web service component needs to be able to retrieve a + details file for a given relay or bridge without grabbing + information from different data sources. It's best to write the + details file part for a give relay or bridge to a single file in + the target JSON format, saved under the relay's or bridge's + fingerprint. If a database is used, the raw string should be + saved for faster processing. + + 1.14) Import relays' and bridges' bandwidth histories from extra-info + descriptors that have been published since the last run. There + must be internally stored bandwidth histories for each relay and + bridge, regardless of whether they have been running in the last + seven days. The original bandwidth histories, which are available + on 15-minute detail, can be aggregated to longer time periods the + farther the interval lies in the past. The interal bandwidth + histories are different from the bandwidth files described in 1.15 + which are written to be given out to clients. + + 1.15) Rewrite bandwidth files that have changed. Bandwidth files + aggregate bandwidth history information on varying levels of + detail, depending on how far observations lie in the past. It's + inevitable to write JSON-formatted bandwidth files for all relays + and bridges in the hourly cronjob. Any attempts to process years + of bandwidth data while answering a web request can only fail. + The previously aggregated bandwidth files are stored under the + relay's or bridge's fingerprint for quick lookup. + + 1.16) Update the summary file listing all relays and bridges that have + been running in the last seven days which was previously read in + step 1.2. This is the last step in the hourly process. The web + service component checks the modification time of this file to + decide whether it needs to reload its view on the network. If + this step was not the last step, the web service component might + list relays or bridges for which there are no details or bandwidth + files available yet. (With the approach taken here, it's + conveivable that a bandwidth file of a relay or bridge that hasn't + been running for a week has been deleted before step 1.16. This + case has been found acceptable, because it's highly unlikely. If + a database would have been used, steps 1.2 to 1.16 would have + happened in a single database transaction.) + +The web service component has the purpose of answering client requests. +It uses previously prepared data from the hourly cronjob to respond to +requests very quickly. + +During initialization, or whenever the hourly cronjob has finished, the +web service component does the following substeps: + + 2.1) Read the summary file that was produced by the hourly cronjob in + step 1.16. + + 2.2) Keep the list of relays and bridges in memory, including all + information that is used for filtering or sorting results. + + 2.3) Prepare summary lines for all relays and bridges. The summary + resource is a JSON file with a single line per relay or bridge. + This line contains only very few fields as compared to details + files that a client might use for further filtering results. + +When responding to a request, the web service component does the following +steps: + + 2.4) Parse request and its parameters. + + 2.5) Possibly filter relays and bridges. + + 2.6) Possibly re-order and limit results. + + 2.7) Write response or error code. + +Again, (and this can hardly be overstated!) steps 2.4 to 2.7 need to +happen *extremely* fast. Any steps that go beyond file system reads or +simple database lookups need to happen either in the hourly cronjob (1.1 +to 1.16) or in the web service component initialization (2.1 to 2.3). +
tor-commits@lists.torproject.org