commit 0ac703be63413de3d2e5315feae2280915284491 Author: Karsten Loesing karsten.loesing@gmx.net Date: Mon Oct 17 16:56:15 2016 +0200
Expand INSTALL.md.
Contains many suggestions by iwakeh.
Implements #20380. --- INSTALL.md | 311 +++++++++++++++++++++++++++------- README.md | 62 ------- src/main/resources/create-tarballs.sh | 9 +- 3 files changed, 253 insertions(+), 129 deletions(-)
diff --git a/INSTALL.md b/INSTALL.md index 1b0920e..4b0b143 100644 --- a/INSTALL.md +++ b/INSTALL.md @@ -1,90 +1,281 @@ -CollecTor -- Operator's Guide -============================= +# CollecTor Operator's Guide
-Welcome to the Operator's Guide of CollecTor. This guide explains how -to set up a new CollecTor instance to download relay descriptors from the -Tor directory authorities. +Welcome to CollecTor, your friendly data-collecting service in the Tor network. +CollecTor fetches data from various nodes and services in the public Tor network +and makes it available to the world. This data includes relay descriptors from +the directory authorities, sanitized bridge descriptors from the bridge +authority, and other data about the Tor network.
+This document describes how to set up your very own CollecTor instance. It was +written with an audience in mind that has at least some experience with running +services and is comfortable with the command line. It's not required that you +know how to read or even write Java code, though.
-Requirements ------------- +Before we go ahead with setting up your CollecTor instance, let us pause for a +moment and reflect why you'd want to do that as opposed to simply using data +from an existing CollecTor instance.
-You'll need a Linux host with at least 50G disk space and 2G RAM. +CollecTor is a service, and the best reason for running a CollecTor service +instance is to offer your collected Tor network data to others. You could +mirror the data from an existing instance or even aggregate data from multiple +instances by using the synchronization feature. Or you could fetch data from +public sources and provide your data to users and other CollecTor instances. +Another reason might be to collect or synchronize Tor network data and provide +it to your working or research group. And of course you might want to run a +CollecTor instance for testing purposes. In all these cases, setting up a +CollecTor instance might make sense.
-In the following we'll assume that the host runs Debian stable as -operating system, but it should work on any other Linux or possibly even -*BSD. Though you'll be mostly on your own with those. +However, if you only want to use Tor network data as a client, even as input for +another service you're developing, you don't have to and probably shouldn't run +a CollecTor instance. In that case it's sufficient to use a library like +[metrics-lib](https://dist.torproject.org/descriptor/) or +[Stem](https://stem.torproject.org/) to fetch CollecTor data and process it.
-As Java is available on a variety of other operating systems, these might -work, too. But again you'll be on your own.
-Prepare the system ------------------- +## Setting up the host
-CollecTor is provided by The Tor Project and can be found here: - https://dist.torproject.org/collector/ -Download the tar.gz file with the version number listed in build.xml. -The README inside the tar.gz file has all the information about CollecTor -and explains how to verify the downloaded files. +You'll need a host with at least 200G disk space and 4G RAM.
-You need a Java installation. On Debian you can just run: +In the following we'll assume that your host runs Debian stable as operating +system. CollecTor should run on any other Linux or possibly even *BSD, though +you'll be mostly on your own with those. And as Java is available on a variety +of other operating systems, those might work, too, but, again, you'll be on your +own.
-$ sudo apt-get openjdk-7-jdk +CollecTor does not require installing many or specific dependencies on the host +system. All it needs are a Java Runtime Environment version 7 or higher and an +Apache HTTP Server version 2 or higher.
-Configure the relay descriptor downloader ------------------------------------------ +The CollecTor service runs entirely under a non-privileged user account. Any +user account will do, but feel free to create a new user account just for the +CollecTor service, if you prefer.
-Run -$ java -DLOGBASE=/path/to/logs -jar collector-<version>.jar -once in order to obtain a configuration properties file. +The CollecTor service requires running in a working directory where it can store +Tor network data and state files. This working directory can be located +anywhere in the file system as long as there is enough disk space available. +The Apache service will later need to know where to find files to serve to web +clients including other CollecTor instances.
-There are quite a few options to set in collector.properties and the comments -explain their meaning. So, you can set the options to the values you want. +CollecTor does not require setting up a database.
-Create the paths you set in collector.properties. +This concludes the host setup. Later in the process you'll once more need root +privileges to configure Apache to serve CollecTor files. But until then you can +do all setup steps with the non-privileged user account.
-Example: run the relay descriptor downloader ---------------------------------------------
-This is a small example about how CollecTor is used. All the other -settings are explained in the default collector.properties. +## Setting up the service
-For running the relay descriptor downloader: +### Obtaining the code
-Edit collector.properties and set at least the following value to true: +CollecTor releases are available at:
-DownloadRelayDescriptors = true +```https://dist.torproject.org/collector/%60%60%60
-$ java -DLOGBASE=/path/to/logs -jar collector-<version>.jar </place/of/collector.properties> +Choose the latest tarball and signature file, verify the signature on the +tarball, and extract the tarball in a location of your choice which will create +a subdirectory called `collector-<version>/`.
-Watch out for INFO-level logs in the log directory you configured. In -particular, the lines following "Statistics on the completeness of written -relay descriptors:" are quite important.
-In case of the unforeseen ERROR and WARN level logs should help you troubleshoot -your installation. +### Planning the service setup
-Maintenance ------------ +By default, CollecTor is configured to do nothing at all. The reason is that +new operators should first understand its capabilities and make a plan for +configuring their new CollecTor instance. Let's do that now.
-CollecTor is designed to keep running and attempts to re-run modules even -when previous runs stopped because of a problem. Thus, it is very important -to watch out for WARNING level and especially ERROR level log statements. +CollecTor consists of a background updater with an internal scheduler and +several data-collecting modules that write data to local directories which are +then served by a webserver. Each of the modules can have one or more data +sources, some public like relay descriptors served by the directory authorities +and some private like bridge descriptors uploaded to the bridge directory +authority.
-These often will point to problems you can do something about, e.g. a full disk -or missing file system permissions. +You'll have to decide which of the data-collecting modules you want to activate, +how often to execute these modules, and which data sources to collect data from.
-Logging Configuration ---------------------- +The release tarball contains an executable .jar file:
-Some hints for those who are familiar with Logback: +```collector-<version>/generated/dist/collector-<version>.jar```
-If you want to use your own logging configuration for Logback you can simply -create your own logback.xml or logback.groovy and start CollecTor in the -following way: +Copy this .jar file into the working directory and run it:
-java -cp /folder/with/logback:collector-1.0.0.jar org.torproject.collector.Main - </place/of/collector.properties> +```java -jar collector-<version>.jar``` + +CollecTor will print some text about not being able to find a configuration +file, which is understandable since there is no such file yet. It also writes a +fresh configuration file called `collector.properties` to the working directory +which contains defaults (that instruct CollecTor to do nothing). + +Read through that file to learn about all available configuration options. + + +### Performing the initial run + +When you have made a plan how to configure your CollecTor instance, edit the +`collector.properties` file, set it to run only once, activate all relevant +modules, check and possibly edit other options as needed, and save the file. +Run the Java process using: + +```java -Xmx2g -DLOGBASE=<your-log-dir> -jar collector-<version>.jar +<your-collector.properties>``` + +The option `-Xmx2g` sets the maximum heap space to 2G, which is based on the +recommended 4G total RAM size for the host. If you have more memory to spare, +feel free to adapt this option as needed. + +This may take a while, depending on which modules you activated. Read the logs +to learn if the run was successful. If it wasn't, go back to editing the +properties file and re-run the .jar file. Change the run-once option back when +you're done with the initial run of the Java process. + +Complete the initialization step by copying the shell script +`collector-<version>/src/main/resources/create-tarballs.sh` from the release +tarball to the working directory or another location of your choice, editing the +contained paths, and executing it. Note that this script will at least partly +fail if one or more modules are deactivated. + + +### Scheduling periodic runs + +The next step in setting up the CollecTor instance is to start the updater with +its internal scheduler and let it run continuously in the background. In order +to do so, make sure the run-once property is set to `false`, possibly adapt the +scheduling properties, and execute the .jar file using the same command as above +but this time in the background. Make sure that the same command will be run +automatically after a reboot. + +Also make sure that the `create-tarballs.sh` script will be executed at least +every three days, but no more than once per day. + +### Setting up the website + +The last remaining part in the setup process is to make the collected data +available. Copy the contents from `collector-<version>/src/main/webapp/*` in +the release tarball to a web application subdirectory in the working directory +or another location of your choice. + +Configure an Apache site that uses redirects or symbolic links to serve the +following directories or files in your working directory (where paths in <> +refer to settings in `collector.properties`): + + * `<your-webapp-dir>/*`, + * `<ArchivePath>`, + * `<IndexPath>`, and + * `<RecentPath>`. + +Use your browser to make sure that your instance serves the web pages and data +that you'd expect. + + +## Maintaining the service + +### Monitoring the service + +The most important information about your CollecTor instance is whether it is +alive. Otherwise, if it dies and you don't notice, you might be losing data +that is not available at the data sources anymore. You should set up a +notification mechanism of your choice to be informed quickly when the background +updater dies. + +Other than fatal issues, a good source for learning about issues with your +CollecTor instance are its logs. Be sure to read the logs every now and then, +and look out for warnings and errors. Maybe set up another notification to be +informed quickly of new warnings or errors. + + +### Changing logging options + +CollecTor uses Logback for logging and comes with a default logging +configuration that logs on info level and that creates a common log file that +rotates once per day and a separate log file per module. If you want to change +logging options, copy the default logging configuration from +`collector-<version>/src/main/resources/logback.xml` to your working directory, +edit your copy, and execute the .jar file as follows: + +```java -Xmx2g -DLOGBASE=<your-log-dir> -jar -cp .:collector-<version>.jar +org.torproject.collector.Main``` + +Internally, CollecTor uses the Simple Logging Facade for Java (SLF4J) and ships +with the Logback implementation for SLF4J. If you prefer a different logging +framework, you can provide and use that instead. For more detailed information, +or if you have different logging needs, please refer to the [Logback +documentation](http://logback.qos.ch/), and for switching to a different +framework to the [SFL4J website](http://www.slf4j.org/). + + +### Changing configuration options + +If you need to reconfigure your CollecTor instance, you may be able to do that +without stopping and restarting the Java process. Scheduling settings are +exempt from this, but all general and module settings may be changed at +run-time. Just edit the config file, and the changes will become effective in +the next execution of a module. Changes to the scheduler, however, require +stopping and restarting the Java update process. + + +### Stopping the service (gracefully) + +If you need to stop the background updater for some reason, like rebooting the +host, there is a way to do that gracefully: kill the Java process, and a +shutdown hook will stop the internal scheduler and wait for up to 10 minutes (or +whatever amount of time is configured) for all currently running updates to be +finished. However, if you must stop the process immediately, use `kill -9`, +though you might have to clean up manually. You should try to avoid rebooting +while tarballs are being created. + + +### Upgrading and downgrading + +If you need to upgrade to a newer release or downgrade to a previous release, +download that tarball and extract it, and copy over the executable .jar file and +the `create-tarballs.sh` script in case it has changed. Stop the current +service version as described above, possibly adapt your `collector.properties` +file as necessary, and restart the Java process using the new .jar file. Don't +forget to update the version number in the command that ensures that the .jar +file gets executed automatically after a reboot. Watch the logs to see if the +upgrade or downgrade was successful. + + +### Backing up data and settings + +A backup of your CollecTor instance should include the <ArchivePath> and your +configuration, which would enable you to set up this instance again. A backup +for short term recovery would also include the more volatile data in +<StatsPath>, <RecentPath>, and <OutputPath>. + + +### Performing recurring tasks + +Most of CollecTor is designed to just run in the background forever. However, +some parts still require manual housekeeping every month or two: You'll need to +clean up data from `<OutputPath>` as configured in `collector.properties` when +you're certain that the data is contained in tarballs and contained in backups. +Likewise, you'll have to delete old files from `<BridgeLocalOrigins>`, in case +that is being used, where CollecTor only reads and never writes or deletes. + + +### Resolving common issues + +Unfortunately, CollecTor still runs into issues from time to time, and some of +these issues require a human being to decide whether they're harmless or require +intervention by the operator. + +The most common issue these days is a warning about missing too many referenced +descriptors, which may even be true but which is typically not an operations +issue. + +A lot less frequently, the bridgedesc module reports unrecognized lines in +non-sanitized bridge descriptors which, if true, requires developing and +deploying a patch. And sometimes the bridgedesc module complains about stale +input data, which requires fixing the bridge authority or the sync mechanism to +the CollecTor host. + +Another minor issue is that files in `<OutputPath>` may change while tarballs +are being created, which is usually safe to ignore. + +There's another frequent error message where CollecTor complains about not being +able to fetch a remote file during the sync process. This error message is +usually harmless and can be ignored. + +But let's hope that you won't run into any of these issues or at least not +frequently. Enjoy your new CollecTor instance!
-The default configuration can be found in the tar-ball you downloaded, in -the subdirectory collector-1.0.0/src/main/resources. \ No newline at end of file diff --git a/README.md b/README.md deleted file mode 100644 index b5b3e33..0000000 --- a/README.md +++ /dev/null @@ -1,62 +0,0 @@ -CollecTor -- The friendly data-collecting service in the Tor network -==================================================================== - -CollecTor fetches data from various nodes and services in the public -Tor network and makes it available to the world. - -Verifying releases ------------------- - -Releases can be cryptographically verified to get some more confidence that -they were put together by a Tor developer. The following steps explain the -verification process by example. - -Download the release tarball and the separate signature file: - -``` -wget https://dist.torproject.org/collector/1.0.0/collector-1.0.0.tar.gz -wget https://dist.torproject.org/collector/1.0.0/collector-1.0.0.tar.gz.asc -``` - -Attempt to verify the signature on the tarball: - -``` -gpg --verify collector-1.0.0.tar.gz.asc -``` - -If the signature cannot be verified due to the public key of the signer -not being locally available, download that public key from one of the key -servers and retry: - -``` -gpg --keyserver pgp.mit.edu --recv-key 0x4EFD4FDC3F46D41E -gpg --verify collector-1.0.0.tar.gz.asc -``` - -If the signature still cannot be verified, something is wrong! - -But note that even if it can be verified, you now only know that the -signature was made by the person claiming to own this key, which could be -anyone. You'll need a trust path to the owner of this key in order to -trust this signature, but that's clearly out of scope here. In short, -your best chance is to meet a Tor developer in real life and enter the web -of trust. - -If you want to go one step further in the verification game, you can -verify the signature on the .jar files. - -Print and then import the provided X.509 certificate: - -``` -keytool -printcert -file CERT -keytool -importcert -alias karsten -file CERT -``` - -Verify the signatures on the contained .jar files using Java's jarsigner -tool: - -``` -jarsigner -verify collector-1.0.0.jar -jarsigner -verify collector-1.0.0-sources.jar -``` - diff --git a/src/main/resources/create-tarballs.sh b/src/main/resources/create-tarballs.sh index de05b30..4b6aa57 100755 --- a/src/main/resources/create-tarballs.sh +++ b/src/main/resources/create-tarballs.sh @@ -24,9 +24,7 @@ YEARTWO=`date --date='7 days ago' +%Y` MONTHTWO=`date --date='7 days ago' +%m` CURRENTPATH=`pwd`
-if ! test -d $WORKDIR - then mkdir $WORKDIR -fi +mkdir -p $WORKDIR
cd $WORKDIR
@@ -35,10 +33,7 @@ if ! test -d $OUTDIR exit 1 fi
-if ! test -d $TARBALLTARGETDIR - then echo "$TARBALLTARGETDIR doesn't exist. Exiting." - exit 1 -fi +mkdir -p $TARBALLTARGETDIR
TARBALLS=( exit-list-$YEARONE-$MONTHONE