[tor-commits] [collector/master] Expand INSTALL.md.

28 Oct 2016

commit 0ac703be63413de3d2e5315feae2280915284491
Author: Karsten Loesing <karsten.loesing@gmx.net>
Date:   Mon Oct 17 16:56:15 2016 +0200

    Expand INSTALL.md.
    
    Contains many suggestions by iwakeh.
    
    Implements #20380.
---
 INSTALL.md                            | 311 +++++++++++++++++++++++++++-------
 README.md                             |  62 -------
 src/main/resources/create-tarballs.sh |   9 +-
 3 files changed, 253 insertions(+), 129 deletions(-)

diff --git a/INSTALL.md b/INSTALL.md
index 1b0920e..4b0b143 100644
--- a/INSTALL.md
+++ b/INSTALL.md
@@ -1,90 +1,281 @@
-CollecTor -- Operator's Guide
-=============================
+# CollecTor Operator's Guide
 
-Welcome to the Operator's Guide of CollecTor.  This guide explains how
-to set up a new CollecTor instance to download relay descriptors from the
-Tor directory authorities.
+Welcome to CollecTor, your friendly data-collecting service in the Tor network.
+CollecTor fetches data from various nodes and services in the public Tor network
+and makes it available to the world.  This data includes relay descriptors from
+the directory authorities, sanitized bridge descriptors from the bridge
+authority, and other data about the Tor network.
 
+This document describes how to set up your very own CollecTor instance.  It was
+written with an audience in mind that has at least some experience with running
+services and is comfortable with the command line.  It's not required that you
+know how to read or even write Java code, though.
 
-Requirements
-------------
+Before we go ahead with setting up your CollecTor instance, let us pause for a
+moment and reflect why you'd want to do that as opposed to simply using data
+from an existing CollecTor instance.
 
-You'll need a Linux host with at least 50G disk space and 2G RAM.
+CollecTor is a service, and the best reason for running a CollecTor service
+instance is to offer your collected Tor network data to others.  You could
+mirror the data from an existing instance or even aggregate data from multiple
+instances by using the synchronization feature.  Or you could fetch data from
+public sources and provide your data to users and other CollecTor instances.
+Another reason might be to collect or synchronize Tor network data and provide
+it to your working or research group.  And of course you might want to run a
+CollecTor instance for testing purposes.  In all these cases, setting up a
+CollecTor instance might make sense.
 
-In the following we'll assume that the host runs Debian stable as
-operating system, but it should work on any other Linux or possibly even
-*BSD.  Though you'll be mostly on your own with those.
+However, if you only want to use Tor network data as a client, even as input for
+another service you're developing, you don't have to and probably shouldn't run
+a CollecTor instance.  In that case it's sufficient to use a library like
+[metrics-lib](https://dist.torproject.org/descriptor/) or
+[Stem](https://stem.torproject.org/) to fetch CollecTor data and process it.
 
-As Java is available on a variety of other operating systems, these might
-work, too.  But again you'll be on your own.
 
-Prepare the system
-------------------
+## Setting up the host
 
-CollecTor is provided by The Tor Project and can be found here:
-    https://dist.torproject.org/collector/
-Download the tar.gz file with the version number listed in build.xml.
-The README inside the tar.gz file has all the information about CollecTor
-and explains how to verify the downloaded files.
+You'll need a host with at least 200G disk space and 4G RAM.
 
-You need a Java installation.  On Debian you can just run:
+In the following we'll assume that your host runs Debian stable as operating
+system.  CollecTor should run on any other Linux or possibly even *BSD, though
+you'll be mostly on your own with those.  And as Java is available on a variety
+of other operating systems, those might work, too, but, again, you'll be on your
+own.
 
-$ sudo apt-get openjdk-7-jdk
+CollecTor does not require installing many or specific dependencies on the host
+system.  All it needs are a Java Runtime Environment version 7 or higher and an
+Apache HTTP Server version 2 or higher.
 
-Configure the relay descriptor downloader
------------------------------------------
+The CollecTor service runs entirely under a non-privileged user account.  Any
+user account will do, but feel free to create a new user account just for the
+CollecTor service, if you prefer.
 
-Run
-$ java -DLOGBASE=/path/to/logs -jar collector-<version>.jar
-once in order to obtain a configuration properties file.
+The CollecTor service requires running in a working directory where it can store
+Tor network data and state files.  This working directory can be located
+anywhere in the file system as long as there is enough disk space available.
+The Apache service will later need to know where to find files to serve to web
+clients including other CollecTor instances.
 
-There are quite a few options to set in collector.properties and the comments
-explain their meaning.  So, you can set the options to the values you want.
+CollecTor does not require setting up a database.
 
-Create the paths you set in collector.properties.
+This concludes the host setup.  Later in the process you'll once more need root
+privileges to configure Apache to serve CollecTor files.  But until then you can
+do all setup steps with the non-privileged user account.
 
-Example: run the relay descriptor downloader
---------------------------------------------
 
-This is a small example about how CollecTor is used.  All the other
-settings are explained in the default collector.properties.
+## Setting up the service
 
-For running the relay descriptor downloader:
+### Obtaining the code
 
-Edit collector.properties and set at least the following value to true:
+CollecTor releases are available at:
 
-DownloadRelayDescriptors = true
+```https://dist.torproject.org/collector/```
 
-$ java -DLOGBASE=/path/to/logs -jar collector-<version>.jar </place/of/collector.properties>
+Choose the latest tarball and signature file, verify the signature on the
+tarball, and extract the tarball in a location of your choice which will create
+a subdirectory called `collector-<version>/`.
 
-Watch out for INFO-level logs in the log directory you configured.  In
-particular, the lines following "Statistics on the completeness of written
-relay descriptors:" are quite important.
 
-In case of the unforeseen ERROR and WARN level logs should help you troubleshoot
-your installation.
+### Planning the service setup
 
-Maintenance
------------
+By default, CollecTor is configured to do nothing at all.  The reason is that
+new operators should first understand its capabilities and make a plan for
+configuring their new CollecTor instance.  Let's do that now.
 
-CollecTor is designed to keep running and attempts to re-run modules even
-when previous runs stopped because of a problem.  Thus, it is very important
-to watch out for WARNING level and especially ERROR level log statements.
+CollecTor consists of a background updater with an internal scheduler and
+several data-collecting modules that write data to local directories which are
+then served by a webserver.  Each of the modules can have one or more data
+sources, some public like relay descriptors served by the directory authorities
+and some private like bridge descriptors uploaded to the bridge directory
+authority.
 
-These often will point to problems you can do something about, e.g. a full disk
-or missing file system permissions.
+You'll have to decide which of the data-collecting modules you want to activate,
+how often to execute these modules, and which data sources to collect data from.
 
-Logging Configuration
----------------------
+The release tarball contains an executable .jar file:
 
-Some hints for those who are familiar with Logback:
+```collector-<version>/generated/dist/collector-<version>.jar```
 
-If you want to use your own logging configuration for Logback you can simply
-create your own logback.xml or logback.groovy and start CollecTor in the
-following way:
+Copy this .jar file into the working directory and run it:
 
-java -cp /folder/with/logback:collector-1.0.0.jar org.torproject.collector.Main
- </place/of/collector.properties>
+```java -jar collector-<version>.jar```
+
+CollecTor will print some text about not being able to find a configuration
+file, which is understandable since there is no such file yet.  It also writes a
+fresh configuration file called `collector.properties` to the working directory
+which contains defaults (that instruct CollecTor to do nothing).
+
+Read through that file to learn about all available configuration options.
+
+
+### Performing the initial run
+
+When you have made a plan how to configure your CollecTor instance, edit the
+`collector.properties` file, set it to run only once, activate all relevant
+modules, check and possibly edit other options as needed, and save the file.
+Run the Java process using:
+
+```java -Xmx2g -DLOGBASE=<your-log-dir> -jar collector-<version>.jar
+<your-collector.properties>```
+
+The option `-Xmx2g` sets the maximum heap space to 2G, which is based on the
+recommended 4G total RAM size for the host.  If you have more memory to spare,
+feel free to adapt this option as needed.
+
+This may take a while, depending on which modules you activated.  Read the logs
+to learn if the run was successful.  If it wasn't, go back to editing the
+properties file and re-run the .jar file.  Change the run-once option back when
+you're done with the initial run of the Java process.
+
+Complete the initialization step by copying the shell script
+`collector-<version>/src/main/resources/create-tarballs.sh` from the release
+tarball to the working directory or another location of your choice, editing the
+contained paths, and executing it.  Note that this script will at least partly
+fail if one or more modules are deactivated.
+
+
+### Scheduling periodic runs
+
+The next step in setting up the CollecTor instance is to start the updater with
+its internal scheduler and let it run continuously in the background.  In order
+to do so, make sure the run-once property is set to `false`, possibly adapt the
+scheduling properties, and execute the .jar file using the same command as above
+but this time in the background.  Make sure that the same command will be run
+automatically after a reboot.
+
+Also make sure that the `create-tarballs.sh` script will be executed at least
+every three days, but no more than once per day.
+
+### Setting up the website
+
+The last remaining part in the setup process is to make the collected data
+available.  Copy the contents from `collector-<version>/src/main/webapp/*` in
+the release tarball to a web application subdirectory in the working directory
+or another location of your choice.
+
+Configure an Apache site that uses redirects or symbolic links to serve the
+following directories or files in your working directory (where paths in <>
+refer to settings in `collector.properties`):
+
+ * `<your-webapp-dir>/*`,
+ * `<ArchivePath>`,
+ * `<IndexPath>`, and
+ * `<RecentPath>`.
+
+Use your browser to make sure that your instance serves the web pages and data
+that you'd expect.
+
+
+## Maintaining the service
+
+### Monitoring the service
+
+The most important information about your CollecTor instance is whether it is
+alive.  Otherwise, if it dies and you don't notice, you might be losing data
+that is not available at the data sources anymore.  You should set up a
+notification mechanism of your choice to be informed quickly when the background
+updater dies.
+
+Other than fatal issues, a good source for learning about issues with your
+CollecTor instance are its logs.  Be sure to read the logs every now and then,
+and look out for warnings and errors.  Maybe set up another notification to be
+informed quickly of new warnings or errors.
+
+
+### Changing logging options
+
+CollecTor uses Logback for logging and comes with a default logging
+configuration that logs on info level and that creates a common log file that
+rotates once per day and a separate log file per module.  If you want to change
+logging options, copy the default logging configuration from
+`collector-<version>/src/main/resources/logback.xml` to your working directory,
+edit your copy, and execute the .jar file as follows:
+
+```java -Xmx2g -DLOGBASE=<your-log-dir> -jar -cp .:collector-<version>.jar
+org.torproject.collector.Main```
+
+Internally, CollecTor uses the Simple Logging Facade for Java (SLF4J) and ships
+with the Logback implementation for SLF4J.  If you prefer a different logging
+framework, you can provide and use that instead.  For more detailed information,
+or if you have different logging needs, please refer to the [Logback
+documentation](http://logback.qos.ch/), and for switching to a different
+framework to the [SFL4J website](http://www.slf4j.org/).
+
+
+### Changing configuration options
+
+If you need to reconfigure your CollecTor instance, you may be able to do that
+without stopping and restarting the Java process.  Scheduling settings are
+exempt from this, but all general and module settings may be changed at
+run-time.  Just edit the config file, and the changes will become effective in
+the next execution of a module.  Changes to the scheduler, however, require
+stopping and restarting the Java update process.
+
+
+### Stopping the service (gracefully)
+
+If you need to stop the background updater for some reason, like rebooting the
+host, there is a way to do that gracefully: kill the Java process, and a
+shutdown hook will stop the internal scheduler and wait for up to 10 minutes (or
+whatever amount of time is configured) for all currently running updates to be
+finished.  However, if you must stop the process immediately, use `kill -9`,
+though you might have to clean up manually.  You should try to avoid rebooting
+while tarballs are being created.
+
+
+### Upgrading and downgrading
+
+If you need to upgrade to a newer release or downgrade to a previous release,
+download that tarball and extract it, and copy over the executable .jar file and
+the `create-tarballs.sh` script in case it has changed.  Stop the current
+service version as described above, possibly adapt your `collector.properties`
+file as necessary, and restart the Java process using the new .jar file.  Don't
+forget to update the version number in the command that ensures that the .jar
+file gets executed automatically after a reboot.  Watch the logs to see if the
+upgrade or downgrade was successful.
+
+
+### Backing up data and settings
+
+A backup of your CollecTor instance should include the <ArchivePath> and your
+configuration, which would enable you to set up this instance again.  A backup
+for short term recovery would also include the more volatile data in
+<StatsPath>, <RecentPath>, and <OutputPath>.
+
+
+### Performing recurring tasks
+
+Most of CollecTor is designed to just run in the background forever.  However,
+some parts still require manual housekeeping every month or two: You'll need to
+clean up data from `<OutputPath>` as configured in `collector.properties` when
+you're certain that the data is contained in tarballs and contained in backups.
+Likewise, you'll have to delete old files from `<BridgeLocalOrigins>`, in case
+that is being used, where CollecTor only reads and never writes or deletes.
+
+
+### Resolving common issues
+
+Unfortunately, CollecTor still runs into issues from time to time, and some of
+these issues require a human being to decide whether they're harmless or require
+intervention by the operator.
+
+The most common issue these days is a warning about missing too many referenced
+descriptors, which may even be true but which is typically not an operations
+issue.
+
+A lot less frequently, the bridgedesc module reports unrecognized lines in
+non-sanitized bridge descriptors which, if true, requires developing and
+deploying a patch.  And sometimes the bridgedesc module complains about stale
+input data, which requires fixing the bridge authority or the sync mechanism to
+the CollecTor host.
+
+Another minor issue is that files in `<OutputPath>` may change while tarballs
+are being created, which is usually safe to ignore.
+
+There's another frequent error message where CollecTor complains about not being
+able to fetch a remote file during the sync process.  This error message is
+usually harmless and can be ignored.
+
+But let's hope that you won't run into any of these issues or at least not
+frequently.  Enjoy your new CollecTor instance!
 
-The default configuration can be found in the tar-ball you downloaded, in
-the subdirectory collector-1.0.0/src/main/resources.
\ No newline at end of file
diff --git a/README.md b/README.md
deleted file mode 100644
index b5b3e33..0000000
--- a/README.md
+++ /dev/null
@@ -1,62 +0,0 @@
-CollecTor -- The friendly data-collecting service in the Tor network
-====================================================================
-
-CollecTor fetches data from various nodes and services in the public
-Tor network and makes it available to the world.
-
-Verifying releases
-------------------
-
-Releases can be cryptographically verified to get some more confidence that
-they were put together by a Tor developer.  The following steps explain the
-verification process by example.
-
-Download the release tarball and the separate signature file:
-
-```
-wget https://dist.torproject.org/collector/1.0.0/collector-1.0.0.tar.gz
-wget https://dist.torproject.org/collector/1.0.0/collector-1.0.0.tar.gz.asc
-```
-
-Attempt to verify the signature on the tarball:
-
-```
-gpg --verify collector-1.0.0.tar.gz.asc
-```
-
-If the signature cannot be verified due to the public key of the signer
-not being locally available, download that public key from one of the key
-servers and retry:
-
-```
-gpg --keyserver pgp.mit.edu --recv-key 0x4EFD4FDC3F46D41E
-gpg --verify collector-1.0.0.tar.gz.asc
-```
-
-If the signature still cannot be verified, something is wrong!
-
-But note that even if it can be verified, you now only know that the
-signature was made by the person claiming to own this key, which could be
-anyone.  You'll need a trust path to the owner of this key in order to
-trust this signature, but that's clearly out of scope here.  In short,
-your best chance is to meet a Tor developer in real life and enter the web
-of trust.
-
-If you want to go one step further in the verification game, you can
-verify the signature on the .jar files.
-
-Print and then import the provided X.509 certificate:
-
-```
-keytool -printcert -file CERT
-keytool -importcert -alias karsten -file CERT
-```
-
-Verify the signatures on the contained .jar files using Java's jarsigner
-tool:
-
-```
-jarsigner -verify collector-1.0.0.jar
-jarsigner -verify collector-1.0.0-sources.jar
-```
-
diff --git a/src/main/resources/create-tarballs.sh b/src/main/resources/create-tarballs.sh
index de05b30..4b6aa57 100755
--- a/src/main/resources/create-tarballs.sh
+++ b/src/main/resources/create-tarballs.sh
@@ -24,9 +24,7 @@ YEARTWO=`date --date='7 days ago' +%Y`
 MONTHTWO=`date --date='7 days ago' +%m`
 CURRENTPATH=`pwd`
 
-if ! test -d $WORKDIR
-  then mkdir $WORKDIR
-fi
+mkdir -p $WORKDIR
 
 cd $WORKDIR
 
@@ -35,10 +33,7 @@ if ! test -d $OUTDIR
   exit 1
 fi
 
-if ! test -d $TARBALLTARGETDIR
-  then echo "$TARBALLTARGETDIR doesn't exist.  Exiting."
-  exit 1
-fi
+mkdir -p $TARBALLTARGETDIR
 
 TARBALLS=(
   exit-list-$YEARONE-$MONTHONE

    

[tor-commits] [collector/master] Expand INSTALL.md.

karsten＠torproject.org