commit bf314c1aa2900eced290701c6664f09cc73a8825 Author: Karsten Loesing karsten.loesing@gmx.net Date: Wed Jun 8 14:43:33 2011 +0200
Write an actually useful README. --- README | 527 +++++++++++++++++++++++++++++++++++++++++++++++++++++++- doc/rserve.pdf | Bin 99059 -> 0 bytes doc/rserve.tex | 140 --------------- 3 files changed, 523 insertions(+), 144 deletions(-)
diff --git a/README b/README index 93315ee..0786852 100644 --- a/README +++ b/README @@ -1,7 +1,526 @@ -ERNIE is the Enhanced R-based tor Network Intelligence Engine - (why ERNIE? because nobody liked BIRT; sorry for misspelling Tor) +Tor Metrics Database and Website +================================
--------------------------------------------------------------------------- +The metrics database stores publicly available data about the Tor network +which are visualized by the metrics website.
-Please find documentation on ERNIE in website/ernie-howto.html and doc/ . +This software package, metrics-web, contains (1) the code to import Tor +network data into a database, (2) the code to generate graphs and .CSV +output, and (3) the code for a dynamic web application. metrics-web is +based on Java, Ant, PostgreSQL, R, Apache HTTP Server, and Apache Tomcat. + +This README explains all necessary steps to install metrics-web including +the database (Section 1), the graphing engine (Section 2), and the web +application (Section 3). It is possible to install only the database part +or only the database and the graphing engine, if desired. + + +1. Installing the metrics database +================================== + +The metrics database contains data about the Tor Network coming from +different sources, including the Tor directory authorities, Torperf +performance measurement installations, the GetTor software package +delivery service, and others. + + +1.1. Preparing the operating system +----------------------------------- + +This README describes the steps for installing metrics-web on a Debian +GNU/Linux Squeeze server. Instructions for other operating systems may +vary. + +In the following it is assumed that root privileges are available. +Commands requiring root privileges will be prefixed with # below. + +Start by adding a metrics user that will be used to execute all commands +that do not require root privileges. These commands will be prefixed with +$ below. + +# adduser metrics + +The database importer and website sources will be installed in +/srv/metrics-web/ that is created as follows: + +# mkdir /srv/metrics-web/ +# chmod g+ws /srv/metrics-web/ +# chown metrics:metrics /srv/metrics-web/ + +Either extract the metrics-web source tarball... + +$ tar xf metrics-web-x.y.z.tar /srv/metrics-web/ + +... or clone the metrics-web Git repository: + +$ git clone git://git.torproject.org/metrics-web /srv/metrics-web/ + +Install Sun Java 6, Ant 1.8, and PostgreSQL 8.4 that are necessary for +setting up the metrics database (be sure to include Debian's non-free +repository in /etc/apt/sources.list). + +# apt-get install sun-java6-jdk ant postgresql-8.4 + +Make Sun's Java the default. + +# update-java-alternatives -s java-6-sun + +Check the versions of the newly installed tools. + +$ java -version +java version "1.6.0_24" +Java(TM) SE Runtime Environment (build 1.6.0_24-b07) +Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) + +$ ant -version +Apache Ant version 1.8.0 compiled on March 11 2010 + +$ psql --version +psql (PostgreSQL) 8.4.7 +contains support for command-line editing + + +1.2. Configuring the database +============================= + +The first step in setting up the metrics database is to configure the +PostgreSQL database and import a database schema. + +Start by creating a new metrics database user. There is no need to give +the metrics user superuser privileges or allow it to create databases or +new roles. + +# sudo -u postgres createuser -P metrics + +Create a new database tordir owned by user metrics. + +# sudo -u postgres createdb -O metrics tordir + +Import the metrics database schema. + +$ psql -f /srv/metrics-web/db/tordir.sql tordir + +Confirm that the database now contains tables to hold metrics data. In +the following, => will be used as the database prompt. + +$ psql tordir +=> \dt+ +=> \q + + +1.3. Importing relay descriptor tarballs +======================================== + +In most cases it makes sense to populate the metrics database with +archived relay descriptors from the official metrics website. + +Download the relay descriptor tarballs from the metrics website at +https://metrics.torproject.org/data.html#relaydesc and extract them to +/srv/metrics-web/archives/ . The database importer can process v3 votes, +v3 consensuses, server descriptors, and extra-infos. + +Edit the config file ~/metrics-web/config (or create it if it's not there) +to contain the following five lines (be sure to remove the linebreak in +the line defining the JDBC string and insert the real password there): + +ImportDirectoryArchives 1 +DirectoryArchivesDirectory archives/ +KeepDirectoryArchiveImportHistory 1 +WriteRelayDescriptorDatabase 1 +RelayDescriptorDatabaseJDBC + jdbc:postgresql://localhost/tordir?user=metrics&password=password + +Compile and run the Java database importer. + +$ cd /srv/metrics-web/ +$ ./run.sh + +The database import will take a while. Once it's complete, check that the +database tables now contain metrics data: + +$ psql tordir +=> \dt+ +=> \q + +It's safe to delete the relay descriptor files in ~/metrics-web/archives/ +once they are imported. + +An alternative to importing relay descriptor tarballs directly into the +database is to convert them into a data format that psql's \copy command +can process. Look for the config option WriteRelayDescriptorsRawFiles in +/srv/metrics-web/config.template for more information on this experimental +feature. + + +1.4. Importing relay descriptors from a local Tor data directory +================================================================ + +WARNING: The functions described in this section are not implemented yet! + +In a future version of metrics-web, the metrics database importer will be +able to import the cached descriptors from a local Tor data directory. +(A special case of importing descriptors from a continuously updated +directory is when both metrics-db and metrics-web are run on the same +machine, but this shouldn't be the general case.) + +Configure a local Tor client to fetch all known descriptors as early as +possible by adding these config options to its torrc file: + +FetchUselessDescriptors 1 +FetchDirInfoExtraEarly 1 + +Tell the metrics database importer where to find the cached descriptor +files. One way to achieve this is to add symbolic links to +/srv/metrics-web/archives/ like this. Tor's data directory is assumed to +be /srv/tor/ here. + +$ cd /srv/metrics-web/archives/ +$ ln -s /srv/tor/cached-* . + +Add a crontab entry for the database importer to run once per hour: + +15 * * * * cd /srv/metrics-web/ && ./run.sh + +In a future version of metrics-web it may also be possible to update local +relay descriptor tarballs from the official metrics server via rsync and +import only the changes into the metrics database. The idea is to simply +rsync the data/ directory from the metrics server and have all information +available. But until this is implemented, the recommended way to keep the +metrics website up-to-date would be the one described above in this +section. + + +1.5. Importing GeoIP information +================================ + +Some of the graphs require GeoIP information to resolve IP addresses to +country codes. This information is provided in MaxMind's GeoLite City +database available at http://www.maxmind.com/app/geolitecity. + +Download and extract the two files GeoLiteCity-Location.csv and +GeoLiteCity-Blocks.csv to /srv/metrics-web/. + +Import the two files into the metrics database. + +$ ant geoipdb + +Note that there is no easy way to update the GeoIP information in the +metrics database yet. The only way to do so is to manually delete and +recreate the database table and import the new GeoIP database. + + +1.6. Pre-calculating relay statistics +===================================== + +The relay graphs on the metrics website rely on pre-calculated statistics +in the metrics database. These statistics are not calculated after every +completed import, which would usually be once per hour. In general it's +sufficient to pre-calculate statistics 2 or 4 times a day. + +Calculate statistics manually after large imports (this may take a while): + +$ psql tordir -c 'SELECT * FROM refresh_all();' + +If the metrics database gets updated automatically, write a script and add +a crontab entry for pre-calculating statistics every 6 or 12 hours. + + +1.7. Generating network status information +========================================== + +The metrics database importer can analyze the most recently parsed network +status consensus for irregularities indicating problems with the directory +authorities. There are two possible outputs: the consensus-health page +that can be found at https://metrics.torproject.org/consensus-health.html +and a local file that can be parsed by Nagios that will be written to +/srv/metrics-web/website/consensus-health . + +Edit /srv/metrics-web/config to contain either or both of the following +options: + +WriteConsensusHealth 1 +WriteNagiosStatusFile 1 + + +1.8. Importing sanitized bridge descriptors +=========================================== + +The metrics database can store aggregate statistics about running bridges +and bridge usage. These statistics are added by parsing sanitized bridge +descriptors available on the official metrics website. + +Download a sanitized bridge descriptor tarball from the metrics website at +https://metrics.torproject.org/data.html#bridgedesc and extract it to, +e.g., /srv/metrics-web/bridges/bridge-descriptors-2011-05/ . + +Edit /srv/metrics-web/config to contain the following options: + +ImportSanitizedBridges 1 +SanitizedBridgesDirectory bridges/ +KeepSanitizedBridgesImportHistory 1 +WriteBridgeStats 1 + +Note that the bridge usage statistics require parsing relay descriptors of +the same time period in order to filter bridges that have been running as +relays from the results. When parsing sanitized bridge descriptors for +the first time it may be necessary to delete the relay descriptor import +history in /srv/metrics-web/stats/archives-import-history and import all +relay descriptors once again. + +Run the database import: + +$ ./run.sh + + +1.9. Importing Torperf performance data +======================================= + +Torperf measures the performance of the Tor network as users experience +it. Torperf's measurement data are available on the metrics website and +can be imported into the metrics database, too. + +Download the Torperf measurement files from the metrics website at +https://metrics.torproject.org/data.html#performance and put them in a +subdirectory, e.g., /srv/metrics-web/torperf/ . + +Edit /srv/metrics-web/config to contain the following options: + +ImportWriteTorperfStats 1 +TorperfDirectory torperf/ + +Run the database import: + +$ ./run.sh + + +1.10. Importing GetTor statistics +================================= + +WARNING: The GetTor statistics are not available for download yet, so that +this section only applies to the official metrics website. + +GetTor is a software distribution service that allows users to fetch the +Tor software via email. GetTor produces daily statistics of requested +packages that can be imported into the metrics database. + +Put the GetTor statistics file into /srv/metrics-web/gettor/ . + +Edit /srv/metrics-web/config to contain the following options: + +ProcessGetTorStats 1 +GetTorDirectory gettor/ + +Run the database import: + +$ ./run.sh + + +2. Installing the graphing engine +================================= + +The metrics graphing engine generates custom graphs of Tor network data +based on user-provided parameters. The graphing engine requires the +metrics database to be installed as described in the previous section. + +The graphing engine uses R and Rserve to generate its graphs. Rserve is a +TCP/IP server that makes it easy for other tools to use R without spawning +their own R process. Rserve also pre-loads R code and R libraries which +saves time when processing user requests. + +In this configuration, Rserve will run in the context of the metrics user. + +Setting up the graphing engine requires installing PostgreSQL's header +files and R 2.8 or higher. R 2.8 or higher is required for the ggplot2 +library. + +# apt-get install libpq-dev r-base-dev + +Run R as user metrics and install required packages to ~/R/. In the +following, R commands will be prefixed with >. + +$ R +> install.packages("Rserve") +> install.packages("ggplot2") +> install.packages("RPostgreSQL") +> q() + +Start the Rserve daemon (the exact path of Rserve-bin.so may vary), check +that it's working by connecting via telnet, and shut it down: + +$ R CMD ~/R/x86_64-pc-linux-gnu-library/2.11/Rserve/libs/Rserve-bin.so +$ telnet 127.0.0.1 6311 +$ echo "library(Rserve); RSshutdown(RSconnect())" | R --slave + +Also check that a database connection can be established from within R +(using the actual password instead of "password"): + +$ R +> library(RPostgreSQL) +> drv <- dbDriver("PostgreSQL") +> con <- dbConnect(drv, user = "metrics", password = "password", + dbname = "tordir") +> dbDisconnect(con) +> dbUnloadDriver(drv) +> q() + +Insert the database password in the Rserve initialization script in +/srv/metrics-web/rserve/rserve-init.R. + +Update the workdir path in /srv/metrics-web/rserve/Rserv.conf . + +Start Rserve, this time with the metrics-web-specific configuration that +includes pre-loading the graph code: + +$ cd /srv/metrics-web/rserve/ && ./start.sh + +Add a crontab entry to start Rserve on reboot: + +@reboot cd /srv/metrics-web/rserve/ && ./start.sh + +Rserve will pre-load the graph code at startup. If changes are made to +the graph code, Rserve must be restarted: + +$ cd /srv/metrics-web/rserve/ +$ ./shutdown.sh && ./start.sh + + +3. Installing the metrics website +================================= + +The metrics website lets web users search parts of the metrics database +and visualizes custom graphs. Both the metrics database and the graphing +engine are required to set up the metrics website as described in this +section. + +Note that the description here has a few specific parts that only apply to +the official metrics website. These parts should be changed when setting +up a non-official metrics website. + + +3.1. Configuring Apache HTTP Server +=================================== + +The Apache HTTP Server is used as the front-end web server that serves +static resources itself and forwards requests for dynamic resources to +Apache Tomcat. + +Start by installing Apache 2: + +# apt-get install apache2 + +Disable Apache's default site. + +# a2dissite default + +Enable mod_rewrite to tell Apache where to find static resources on disk. +Also enable mod_proxy to forward requests to Tomcat. + +# a2enmod rewrite proxy_http + +Create a new virtual host configuration and store it in a new file +/etc/apache2/sites-available/metrics.torproject.org with the following +content: + +<VirtualHost *:80> + ServerName metrics.torproject.org + ServerAdmin torproject-admin@torproject.org + ErrorLog /var/log/apache2/error.log + CustomLog /var/log/apache2/access.log combined + ServerSignature On + <IfModule mod_rewrite.c> + RewriteEngine On + RewriteRule /(data|dist|papers)/(.*) /srv/metrics-web/$1/$2 [L] + RewriteRule /(consensus-health.html) /srv/metrics-web/website/$1 [L] + </IfModule> + <IfModule mod_proxy.c> + <Proxy *> + Order deny,allow + Allow from all + </Proxy> + ProxyPass / http://127.0.0.1:8080/ernie/ retry=15 + ProxyPassReverse / http://127.0.0.1:8080/ernie/ + ProxyPreserveHost on + </IfModule> +</VirtualHost> + +Create the directories containing static resources: /srv/metrics-web/data/ +contains the tarballs and other metrics data linked from data.html. +/srv/metrics-web/dist/ contains the software packages linked from +tools.html. /srv/metrics-web/papers/ contains the papers and technical +reports linked from papers.html. Note that there is no option not to +serve these files other than manually removing the links from the .html +pages. + +Enable the new virtual host. + +# a2ensite metrics.torproject.org + +Restart Apache just to be sure that all changes are effective. + +# /etc/init.d/apache2 restart + + +3.2. Configuring Apache Tomcat +============================== + +Apache Tomcat will process requests for dynamic resources, including web +pages and graphs. + +Install Tomcat 6: + +# apt-get install tomcat6 + +Replace Tomcat's default configuration in /etc/tomcat6/server.xml with the +following configuration: + +<Server port="8005" shutdown="SHUTDOWN"> + <Service name="Catalina"> + <Connector port="8080" maxHttpHeaderSize="8192" + maxThreads="150" minSpareThreads="25" maxSpareThreads="75" + enableLookups="false" redirectPort="8443" acceptCount="100" + connectionTimeout="20000" disableUploadTimeout="true" + compression="off" compressionMinSize="2048" + noCompressionUserAgents="gozilla, traviata" + compressableMimeType="text/html,text/xml,text/plain" /> + <Engine name="Catalina" defaultHost="yatei.torproject.org"> + <Host name="metrics.torproject.org" appBase="webapps" + unpackWARs="true" autoDeploy="true" + xmlValidation="false" xmlNamespaceAware="false"> + <Alias>yatei.torproject.org</Alias> + <Valve className="org.apache.catalina.valves.AccessLogValve" + directory="logs" prefix="metrics_access_log." suffix=".txt" + pattern="%l %u %t %r %s %b" resolveHosts="false"/> + </Host> + </Engine> + </Service> +</Server> + +Be sure to replace *.torproject.org with something else, unless this is +a re-installation of the official metrics website. + +Update the database password in /srv/metrics-web/etc/context.xml. + +Update the paths starting with /srv/metrics.torproject.org/ in +/srv/metrics-web/etc/web.xml to the correct paths in /srv/metrics-web/. +The default paths in that file are correct for the official metrics +website setup which is slightly different than the one described here. + +Now generate the web application. + +$ ant make-war + +Create a symbolic link to the ernie.war file: + +# ln -s /srv/metrics-web/ernie.war /var/lib/tomcat6/webapps/ + +Tomcat will now attempt to deploy the web application automatically. + +Whenever the metrics website needs to be redeployed, generate a new .war +file and Tomcat will reload the web application automatically. + +Restart Tomcat to make all configuration changes effective: + +# /etc/init.d/tomcat6 restart + +The metrics website should now work.
diff --git a/doc/rserve.pdf b/doc/rserve.pdf deleted file mode 100644 index 2adca75..0000000 Binary files a/doc/rserve.pdf and /dev/null differ diff --git a/doc/rserve.tex b/doc/rserve.tex deleted file mode 100644 index 317b475..0000000 --- a/doc/rserve.tex +++ /dev/null @@ -1,140 +0,0 @@ -\documentclass{article} -\setlength{\parindent}{0in} -\begin{document} -\title{Tor Metrics - Rserve} -\author{by Kevin Berry \texttt{kevin.berry@villanova.edu}} -\maketitle -\section{Overview} -Rserve is a TCP/IP interface which allows other tools and languages to use -the facilities of the R language. In our case, Java, Tomcat, Postgres and -other parts of Metrics work with Rserve to generate graphs on-demand. For -more information about the Metrics website, see the sister repository -\emph{metrics-db}. Here, we will cover how to install R, Rserve, the R -Postgres driver, and ggplot2. For more information about the Postgres -setup, see manual.pdf. The database should be set up before continuing -this. - -\section{Architecture} -See the \emph{rserve} directory for the start/stop scripts and -config. The graph code is all pre-loaded when Rserve starts, so, if any -changes are made to the graph code, Rserve must be restarted. The database -name, user, and password can be configured in \emph{rserve-init.R}, as well -as the pre-loaded libraries. Rserve forks itself upon connection, so R code -can be pre-loaded to speed things up. - -\section{Setup} -\subsection{Installing R} -Before we get started, we need to have R installed. We need to have the R -dev package installed so we can use the add-ons. - -\begin{verbatim} -$ sudo apt-get install r-base-dev -\end{verbatim} - -\subsection{Installing and testing Rserve} -There are a few different ways to install Rserve. However, the easiest and -most direct way to install it is through R's built-in package manager and -package network, CRAN (\emph{The Comprehensive R Archive Network - -http://cran.r-project.org%7D). Unfortunately, Rserve isn't packaged currently -for many Linux distributions, so it requires a bit manual configuration and -administration. -\ - -R needs to be started as root so its build-in package manager can access -the file system to install its own packages. Select the mirror through the -Tcl/tk or command line interface and it should install. - -\begin{verbatim} -$ sudo R -> install.packages("Rserve") -\end{verbatim} - -We want to start a bare server and see if it works correctly. - -\begin{verbatim} -$ R CMD Rserve -\end{verbatim} - -Now, test it with the built-in R connector. - -\begin{verbatim} -$ R -> library(Rserve) -> c <- RSconnect() -> RSshutdown(c) -\end{verbatim} - -If this worked, the server is listening, so it installed and started -correctly. - -\subsection{Installing ggplot2} -ggplot2 is the second necessary R package that we need for Metrics. - -\begin{verbatim} -$ R -> install.packages("ggplot2") -\end{verbatim} - -\subsection{Installing and testing the R PostgresSQL driver} -The Postgres driver installs similarly to the other R packages. - -\begin{verbatim} -$ R -> install.packages("RPostgreSQL") -\end{verbatim} - -First, make sure Postgres is started and configured correctly according to -the Metrics specifications (see manual.pdf). -\ - -Start the R console, load the driver, and connect to the database. The -database user, password may need to be changed. -\begin{verbatim} -$ R -> library(RPostgreSQL) -> drv <- dbDriver("PostgreSQL") -> con <- dbConnect(drv, user="ernie", password="", dbname="tordir") -> dbDisconnect(con) -> dbUnloadDriver(drv) -\end{verbatim} - -\section{Administrating Rserve} -Since Rserve is not standardly packaged, a few things must be done to -ensure it runs smoothly and securely. We need to adjust permissions, add -users, and modify groups so it works nicely with Tomcat. Feel free to do -this differenty according to your system's requirements. - -\subsection{Adding users and groups} -We'll add the user 'rserve' without a shell and no home directory. - -\begin{verbatim} -$ useradd rserve -s /bin/false -U -\end{verbatim} - -Now, find the user id and group id of the rserve user, and edit them in -rserve/Rserv.conf. This is so Rserve properly forks itself and runs as the -correct user when it is started. - -\begin{verbatim} -$ id rserve -uid=1011(rserve) gid=1012(rserve) groups=1012(rserve) -\end{verbatim} - -Next, we need to add the rserve user to the 'apache' group (The default -user for Tomcat), so it can communicate correctly with Tomcat and have the -necessary permissions for writing graphs. In this case, we will add apache -to the rserve group. - -\begin{verbatim} -$ usermod -a -G rserve apache -\end{verbatim} - -\section{Start Rserve} -Now we are ready to start Rserve! Run the script rserve/start.sh as ROOT -(or else Rserve will not fork itself properly). Then, check the rserve log -file (\emph{rserve.log}). The path to this log file can be changed by -modifying the script. Do a "ps -ef | grep rserve" to see if it has started. -Now, with Rserve installed and Postgres is running, Metrics is almost ready -to start generating some graphs! - -\end{document}
tor-commits@lists.torproject.org