commit 94c7158834ad5068c101f9b1ec95408ed1d46cd7 Author: Karsten Loesing karsten.loesing@gmx.net Date: Sat May 16 11:40:25 2015 +0200
Rewrite documentation. --- .gitignore | 2 - INSTALL.md | 80 +++++++++ README | 7 - doc/manual.pdf | Bin 156056 -> 0 bytes doc/manual.tex | 548 -------------------------------------------------------- 5 files changed, 80 insertions(+), 557 deletions(-)
diff --git a/.gitignore b/.gitignore index 10681a4..f7d045e 100644 --- a/.gitignore +++ b/.gitignore @@ -5,8 +5,6 @@ config data/ in/ out/ -doc/manual.aux -doc/manual.log log/ rsync/ stats/ diff --git a/INSTALL.md b/INSTALL.md new file mode 100644 index 0000000..05db7ba --- /dev/null +++ b/INSTALL.md @@ -0,0 +1,80 @@ +CollecTor -- Operator's Guide +============================= + +Welcome to the Operator's Guide of CollecTor. This guide explains how +to set up a new CollecTor instance to download relay descriptors from the +Tor directory authorities. + + +Requirements +------------ + +You'll need a Linux host with at least 50G disk space and 2G RAM. + +In the following we'll assume that the host runs Debian stable as +operating system, but it should work on any other Linux or possibly even +*BSD. Though you'll be mostly on your own with those. + + +Prepare the system +------------------ + +Create a working directory for CollecTor. In this guide, we'll assume +that you're using `/srv/collector.torproject.org/` as working directory, +but feel free to use another directory that better suits your needs. + +$ sudo mkdir -p /srv/collector.torproject.org/ +$ sudo chown vagrant:vagrant /srv/collector.torproject.org/ + +Install a few packages: + +$ sudo apt-get install openjdk-6-jdk ant libcommons-codec-java \ + libcommons-compress-java + + +Clone the metrics-db repository +------------------------------- + +$ cd /srv/collector.torproject.org/ +$ git clone https://git.torproject.org/metrics-db + + +Clone required submodule metrics-lib +------------------------------------ + +$ git submodule init +$ git submodule update + + +Compile CollecTor +----------------- + +$ ant compile + + +Configure the relay descriptor downloader +----------------------------------------- + +Edit the config file and uncomment and edit at least the following line: + +DownloadRelayDescriptors 1 + + +Run the relay descriptor downloader +----------------------------------- + +$ bin/run-relaydescs + + +Set up an hourly cronjob for the relay descriptor downloader +------------------------------------------------------------ + +Ideally, run the relay descriptor downloader once per hour by adding a +crontab entry like the following: + +6 * * * * cd /srv/collector.torproject.org/db/ && bin/run-relaydescs + +Watch out for INFO-level logs in the `log/` directory. In particular, the +lines following "Statistics on the completeness of written relay +descriptors:" is quite important. + diff --git a/README b/README deleted file mode 100644 index 97a7a7b..0000000 --- a/README +++ /dev/null @@ -1,7 +0,0 @@ -ERNIE is the Enhanced R-based tor Network Intelligence Engine - (why ERNIE? because nobody liked BIRT; sorry for misspelling Tor) - --------------------------------------------------------------------------- - -Please find documentation in doc/ . - diff --git a/doc/manual.pdf b/doc/manual.pdf deleted file mode 100644 index 4c8b375..0000000 Binary files a/doc/manual.pdf and /dev/null differ diff --git a/doc/manual.tex b/doc/manual.tex deleted file mode 100644 index e50b1e4..0000000 --- a/doc/manual.tex +++ /dev/null @@ -1,548 +0,0 @@ -\documentclass{article} -\begin{document} -\title{ERNIE: a tool to study the Tor network\-- User's Guide --} -\author{by Karsten Loesing \texttt{karsten@torproject.org}} -\maketitle - -\section{Overview} - -Welcome to ERNIE! -ERNIE is a tool to study the Tor network. -ERNIE has been designed to process all kinds of data about the Tor network -and visualize them or prepare them for further analysis. -ERNIE is also the software behind the Tor Metrics Portal -\verb+http://metrics.torproject.org/+. - -The acronym ERNIE stands for the \emph{Enhanced R-based tor Network -Intelligence Engine} (sorry for misspelling Tor). -Why ERNIE? -Because nobody liked BIRT (Business Intelligence and Reporting Tools) that -we used for visualizing statistics about the Tor network before writing -our own software. -By the way, reasons were that BIRT made certain people's browsers crash -and requires JavaScript that most Tor user have turned off. - -If you want to learn more about the Tor network, regardless of whether you -want to present your findings on a website (like ERNIE does) or include -them in your next Tor paper, this user's guide is for you! - -\section{Installation instructions} - -ERNIE depends on various other software tools. ERNIE is developed in a -\emph{Git} repository which is currently the only way to download it. -ERNIE uses \emph{Java} for parsing data, \emph{R} for plotting graphs, -and \emph{PostgreSQL} for importing data into a database. -Which of these tools you need depends on what tasks you are planning to -use ERNIE for. -In most cases it is not required to install all these tools. -For this tutorial, we assume Debian GNU/Linux 5.0 as operating system. -Installation instructions for other platforms may vary. - -\subsection{Git 1.5.6.5} - -Currently, the only way to download ERNIE is to clone its Git branch. - -Install Git 1.5.6.5 (or higher) and check that it's working: - -\begin{verbatim} -$ sudo apt-get install git-core -$ git --version -\end{verbatim} - -\subsection{Java 6} - -ERNIE requires Java to parse data from various data sources and write them -to one or more data sinks. Java is required for most use cases of ERNIE. - -Add the non-free repository to the apt sources in -\verb+/etc/apt/sources.list+ by changing the line (mirror URL may vary): - -\begin{verbatim} -deb http://ftp.ca.debian.org/debian/ lenny main -\end{verbatim} - -to - -\begin{verbatim} -deb http://ftp.ca.debian.org/debian/ lenny main non-free -\end{verbatim} - -Fetch the package list, install Sun Java 6, and set it as system default: - -\begin{verbatim} -$ sudo apt-get update -$ sudo apt-get install sun-java6-jdk -$ sudo update-alternatives --set java \ - /usr/lib/jvm/java-6-sun/jre/bin/java -$ sudo update-alternatives --set javac \ - /usr/lib/jvm/java-6-sun/bin/javac -\end{verbatim} - -Check that Java 6 is installed and selected as default: - -\begin{verbatim} -$ java -version -$ javac -version -\end{verbatim} - -\subsection{Ant 1.7} - -ERNIE comes with an Ant build file that facilitates common build tasks. -If you want to use Ant to build and run ERNIE, install Ant and check its -installed version (tested with 1.7): - -\begin{verbatim} -$ sudo apt-get install ant -$ ant -version -\end{verbatim} - -\subsection{R 2.8 and ggplot2} - -ERNIE uses R and the R library \emph{ggplot2} to visualize anaylsis -results for presentation on a website or for inclusion in publications. -ggplot2 requires at least R version 2.8 to be installed. - -Add a new line to \verb+/etc/apt/sources.list+: - -\begin{verbatim} -deb http://cran.cnr.berkeley.edu/bin/linux/debian lenny-cran/ -\end{verbatim} - -Download the package maintainer's public key (``Johannes Ranke (CRAN Debian -archive) $<$jranke@uni-bremen.de$>$''): - -\begin{verbatim} -$ gpg --keyserver pgpkeys.pca.dfn.de --recv-key 381BA480 -$ gpg --export 381BA480 | sudo apt-key add - -\end{verbatim} - -Install the most recent R version: - -\begin{verbatim} -$ sudo apt-get update -$ sudo apt-get -t unstable install r-base -\end{verbatim} - -Start R to check its version (must be 2.8 or higher) and install ggplot2. -Do this as root, so that the installed package is available to all system -users: - -\begin{verbatim} -$ sudo R -> install.packages("ggplot2") -> q() -\end{verbatim} - -Confirm that R and ggplot2 are installed: - -\begin{verbatim} -$ R -> library(ggplot2) -> q() -\end{verbatim} - -\subsection{PostgreSQL 8.3} -\label{sec-install-postgres} - -ERNIE uses PostgreSQL to import data into a database for later analysis. -This feature is not required for most use cases of ERNIE, but only for -people who prefer having the network data in a database to execute custom -queries. - -Install PostgreSQL 8.3 using apt-get: - -\begin{verbatim} -$ sudo apt-get install postgresql-8.3 -\end{verbatim} - -Create a new database user \verb+ernie+ to insert data and run queries. -This command is executed as unix user \verb+postgres+ and therefore as -database superuser \verb+postgres+ via ident authentication. The -\verb+-P+ flag issues a password prompt for the new user. -There is no need to give the new user superuser privileges or allow it to -create databases or new roles. - -\begin{verbatim} -$ sudo -u postgres createuser -P ernie -\end{verbatim} - -Create a new database schema \verb+tordir+ owned by user \verb+ernie+ -(using option \verb+-O+). -Again, this command is executed as \verb+postgres+ system user to make use -of ident authentication. - -\begin{verbatim} -$ sudo -u postgres createdb -O ernie tordir -\end{verbatim} - -Log into the database schema as user \verb+ernie+ to check that it's -working. -This time, ident authentication is not available, since there is no system -user \verb+ernie+. -Instead, we will use password authentication via a TCP connection to -localhost (using option \verb+-h+) as database user \verb+ernie+ (using -option \verb+-U+). - -\begin{verbatim} -$ psql -h localhost -U ernie tordir -tordir=> \q -\end{verbatim} - -\subsection{ERNIE} - -Finally, you can install ERNIE by cloning its Git branch: - -\begin{verbatim} -$ git clone git://git.torproject.org/ernie -\end{verbatim} - -This command should create a directory \verb+ernie/+ which we will -consider the working directory of ERNIE. - -\section{Getting started with ERNIE} - -The ERNIE project was started as a simple tool to parse Tor relay -descriptors and plot graphs on Tor network usage for a website. -Since then, ERNIE has grown to a tool that can process all kinds of Tor -network data for various purposes, including but not limited to -visualization. - -We think that the easiest way to get started with ERNIE is to walk through -typical use cases in a tutorial style and explain what is required to set -up ERNIE. -These use cases have been chosen from what we think are typical -applications of ERNIE. - -\subsection{Visualizing network statistics} - -{\it Write me.} - -\subsection{Importing relay descriptors into a database} - -As of February 2010, the relays and directories in the Tor network -generate more than 1 GB of descriptors every month. -There are two approaches to process these amounts of data: -extract only the relevant data for the analysis and write them to files, -or import all data to a database and run queries on the database. -ERNIE currently takes the file-based approach for the Metrics Portal, -which works great for standardized analyses. -But the more flexible way to research the Tor network is to work with a -database. - -This tutorial describes how to import relay descriptors into a database -and run a few example queries. -Note that the presented database schema is limited to answering basic -questions about the Tor network. -In order to answer more complex questions, one would have to extend the -database schema and Java classes which is sketched at the end of this -tutorial. - -\subsubsection{Preparing database for data import} - -The first step in importing relay descriptors into a database is to -install a database management system. -See Section \ref{sec-install-postgres} for installation instructions of -PostgreSQL 8.3 on Debian GNU/Linux 5.0. -Note that in theory, any other relational database that has a working JDBC -4 driver should work, too, possibly with minor modifications to ERNIE. - -Import the database schema from file \verb+db/tordir.sql+ containing two -tables that we need for importing relay descriptors plus two indexes to -accelerate queries. Check that tables have been created using \verb+\dt+. -You should see a list containing the two tables \verb+descriptor+ and -\verb+statusentry+. - -\begin{verbatim} -$ psql -h localhost -U ernie -f db/tordir.sql tordir -$ psql -h localhost -U ernie tordir -tordir=> \dt -tordir=> \q -\end{verbatim} - -A row in the \verb+statusentry+ table contains the information that a -given relay (that has published the server descriptor with ID -\verb+descriptor+) was contained in the network status consensus published -at time \verb+validafter+. -These two fields uniquely identify a row in the \verb+statusentry+ table. -The other fields contain boolean values for the flags that the directory -authorities assigned to the relay in this consensus, e.g., the Exit flag -in \verb+isexit+. -Note that for the 24 network status consensuses of a given day with each -of them containing 2000 relays, there will be $24 \times 2000$ rows in the -\verb+statusentry+ table. - -The \verb+descriptor+ table contains some portion of the information that -a relay includes in its server descriptor. -Descriptors are identified by the \verb+descriptor+ field which -corresponds to the \verb+descriptor+ field in the \verb+statusentry+ -table. -The other fields contain further data of the server descriptor that might -be relevant for analyses, e.g., the platform line with the Tor software -version and operating system of the relay. - -Obviously, this data schema doesn't match everyone's needs. -See the instructions below for extending ERNIE to import other data into -the database. - -\subsubsection{Downloading relay descriptors from the metrics website} - -In the next step you will probably want to download relay descriptors from -the metrics website -\verb+http://metrics.torproject.org/data.html#relaydesc+. -Download the \verb+v3 consensuses+ and/or \verb+server descriptors+ of the -months you want to analyze. -The server descriptors are the documents that relays publish at least -every 18 hours describing their capabilities, whereas the v3 consensuses -are views of the directory authorities on the available relays at a given -time. -For this tutorial you need both v3 consensuses and server descriptors. -You might want to start with a single month of data, experiment with it, -and import more data later on. -Extract the tarballs to a new directory \verb+archives/+ in the ERNIE -working directory. - -\subsubsection{Configuring ERNIE to import relay descriptors into a -database} - -ERNIE can be used to read data from one or more data sources and write -them to one or more data sinks. -You need to configure ERNIE so that it knows to use the downloaded relay -descriptors as data source and the database as data sink. -Add the following two lines to your \verb+config+ file: - -\begin{verbatim} -ImportDirectoryArchives 1 -WriteRelayDescriptorDatabase 1 -\end{verbatim} - -You further need to provide the JDBC string that ERNIE shall use to access -the database schema \verb+tordir+ that we created above. -The config option with the JDBC string for a local PostgreSQL database -might be (without line break): - -\begin{verbatim} -RelayDescriptorDatabaseJDBC - jdbc:postgresql://localhost/tordir?user=ernie&password=password -\end{verbatim} - -\subsubsection{Importing relay descriptors using ERNIE} - -Now you are ready to actually import relay descriptors using ERNIE. -Create a directory for Java class files, compile the Java source files, -and run ERNIE. All these steps are performed by the default target in the -provided Ant task. - -\begin{verbatim} -$ ant -\end{verbatim} - -Note that the import process might take between a few minutes and an hour, -depending on your hardware. -You will notice that ERNIE doesn't write progress messages to the standard -output, which is useful for unattended installations with only warnings -being mailed out by cron. -You can change this behavior and make messages on the standard output more -verbose by setting -\verb+java.util.logging.ConsoleHandler.level+ in -\verb+logging.properties+ to \verb+INFO+ or \verb+FINE+. -Alternately, you can look at the log file \verb+log.0+ that is created by -ERNIE. - -If ERNIE finishes after a few seconds, you have probably put the relay -descriptors at the wrong place. -Make sure that you extract the relay descriptors to sub directories of -\verb+archives/+ in the ERNIE working directory. - -If you interrupt ERNIE, or if ERNIE terminates uncleanly for some reason, -you will have problems starting it the next time. -ERNIE uses a local lock file called \verb+lock+ to make sure that only a -single instance of ERNIE is running at a time. -If you are sure that the last ERNIE instance isn't running anymore, you -can delete the lock file and start ERNIE again. - -If all goes well, you should now have the relay descriptors of 1 month in -your database. - -\subsubsection{Example queries} - -In this tutorial, we want to give you a few examples for using the -database schema with the imported relay descriptors to extract some useful -statistics about the Tor network. - -In the first example we want to find out how many relays have been running -on average per day and how many of these relays were exit relays. -We only need the \verb+statusentry+ table for this evaluation, because -the information we are interested in is contained in the network status -consensuses. - -The SQL statement that we need for this evaluation consists of two parts: -First, we find out how many network status consensuses have been published -on any given day. -Second, we count all relays and those with the Exit flag and divide these -numbers by the number of network status consensuses per day. - -\begin{verbatim} -$ psql -h localhost -U ernie tordir -tordir=> SELECT DATE(validafter), - COUNT(*) / relay_statuses_per_day.count AS avg_running, - SUM(CASE WHEN isexit IS TRUE THEN 1 ELSE 0 END) / - relay_statuses_per_day.count AS avg_exit - FROM statusentry, - (SELECT COUNT(*) AS count, DATE(validafter) AS date - FROM (SELECT DISTINCT validafter FROM statusentry) - distinct_consensuses - GROUP BY DATE(validafter)) relay_statuses_per_day - WHERE DATE(validafter) = relay_statuses_per_day.date - GROUP BY DATE(validafter), relay_statuses_per_day.count - ORDER BY DATE(validafter); -tordir=> \q -\end{verbatim} - -Executing this query should finish within a few seconds to one minute, -again depending on your hardware. -The result might start like this (truncated here): - -\begin{verbatim} - date | avg_running | avg_exit -------------+-------------+---------- - 2010-02-01 | 1583 | 627 - 2010-02-02 | 1596 | 638 - 2010-02-03 | 1600 | 654 -: -\end{verbatim} - -In the second example we want to find out what Tor software versions the -relays have been running. -More precisely, we want to know how many relays have been running what Tor -versions on micro version granularity (e.g., 0.2.2) on average per day? - -We need to combine network status consensuses with server descriptors to -find out this information, because the version information is not -contained in the consensuses (or at least, it's optional to be contained -in there; and after all, this is just an example). -Note that we cannot focus on server descriptors only and leave out the -consensuses for this analysis, because we want our analysis to be limited -to running relays as confirmed by the directory authorities and not -include all descriptors that happened to be published at a given day. - -The SQL statement again determines the number of consensuses per day in a -sub query. -In the next step, we join the \verb+statusentry+ table with the -\verb+descriptor+ table for all rows contained in the \verb+statusentry+ -table. -The left join means that we include \verb+statusentry+ rows even if we do -not have corresponding lines in the \verb+descriptor+ table. -We determine the version by skipping the first 4 characters of the platform -string that should contain \verb+"Tor "+ (without quotes) and cutting off -after another 5 characters. -Obviously, this approach is prone to errors if the platform line format -changes, but it should be sufficient for this example. - -\begin{verbatim} -$ psql -h localhost -U ernie tordir -tordir=> SELECT DATE(validafter) AS date, - SUBSTRING(platform, 5, 5) AS version, - COUNT(*) / relay_statuses_per_day.count AS count - FROM - (SELECT COUNT(*) AS count, DATE(validafter) AS date - FROM (SELECT DISTINCT validafter - FROM statusentry) distinct_consensuses - GROUP BY DATE(validafter)) relay_statuses_per_day - JOIN statusentry - ON relay_statuses_per_day.date = DATE(validafter) - LEFT JOIN descriptor - ON statusentry.descriptor = descriptor.descriptor - GROUP BY DATE(validafter), SUBSTRING(platform, 5, 5), - relay_statuses_per_day.count, relay_statuses_per_day.date - ORDER BY DATE(validafter), SUBSTRING(platform, 5, 5); -tordir=> \q -\end{verbatim} - -Running this query takes longer than the first query, which can be a few -minutes to half an hour. -The main reason is that joining the two tables is an expensive database -operation. -If you plan to perform many evaluations like this one, you might want to -create a third table that holds the results of joining the two tables of -this tutorial. -Creating such a table to speed up queries is not specific to ERNIE and -beyond the scope of this tutorial. - -The (truncated) result of the query might look like this: - -\begin{verbatim} - date | version | count -------------+---------+------- - 2010-02-01 | 0.1.2 | 10 - 2010-02-01 | 0.2.0 | 217 - 2010-02-01 | 0.2.1 | 774 - 2010-02-01 | 0.2.2 | 75 - 2010-02-01 | | 505 - 2010-02-02 | 0.1.2 | 14 - 2010-02-02 | 0.2.0 | 328 - 2010-02-02 | 0.2.1 | 1143 - 2010-02-02 | 0.2.2 | 110 -: -\end{verbatim} - -Note that, in the fifth line, we are missing the server descriptors of 505 -relays contained in network status consensuses published on 2010-02-01. -If you want to avoid such missing values, you'll have to import the server -descriptors of the previous month, too. - -\subsubsection{Extending ERNIE to import further data into the database} - -In this tutorial we have explained how to prepare a database, download -relay descriptors, configure ERNIE, import the descriptors, and execute -example queries. -This description is limited to a few examples by the very nature of a -tutorial. -If you want to extend ERNIE to import further data into your database, -you will have to perform at least two steps: -extend the database schema and modify the Java classes used for parsing. - -The first step, extending the database schema, is not specific to ERNIE. -Just add the fields and tables to the schema definition. - -The second step, modifying the Java classes used for parsing, is of course -specific to ERNIE. -You will have to look at two classes in particular: -The first class, \verb+RelayDescriptorDatabaseImporter+, contains the -prepared statements and methods used to add network status consensus -entries and server descriptors to the database. -The second class, \verb+RelayDescriptorParser+, contains the parsing logic -for the relay descriptors and decides what information to add to the -database, among other things. - -This ends the tutorial on importing relay descriptors into a database. -Happy researching! - -\subsection{Aggregating relay and bridge descriptors} - -{\it Write me.} - -\section{Software architecture} - -{\it Write me. In particular, include overview of components: - -\begin{itemize} -\item Data sources and data sinks -\item Java classes with data sources and data sinks -\item R scripts to process CSV output -\item Website -\end{itemize} -} - -\section{Tor Metrics Portal setup} - -{\it -Write me. In particular, include documentation of deployed ERNIE that -runs the metrics website. -This documentation has two purposes: -First, a reference setup can help others creating their own ERNIE -configuration that goes beyond the use cases as described above. -Second, we need to remember how things are configured anyway, so we can -as well document them here.} - -\end{document} -