[tor-dev] Sanitizing and publishing our web server logs

25 Aug 2011

      Hi everyone,

we have been discussing sanitizing and publishing our web server logs
for quite a while now.  The idea is to remove all potentially sensitive
parts from the logs, publish them in monthly tarballs on the metrics
website, and analyze them for top visited pages, top downloaded
packages, etc.  See the tickets #1641 and #2489 for details.

Here's a suggested sanitizing procedure for our web logs, which are in
Apache's combined log format:

 - Ignore everything except GET requests.
 - Ignore all requests that resulted in a 404 status code.
 - Rewrite log lines so that they only contain the following fields:
   - IP address 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests
(as logged by our Apache configuration),
   - the request date (with the time part set to 00:00:00),
   - the requested URL (cut off at the first encountered "?"),
   - the HTTP version,
   - the server's HTTP status code, and
   - the size of the returned object.
 - Write all lines from a given virtual host and day to a single output
file.
 - Sort the output file alphanumerically to conceal the original order
of requests.

Here's a sample sanitized log file for www.torproject.org from May 1,
2009 (462K):

http://freehaven.net/~karsten/volatile/www.torproject.org-access.log-2009050...

Is there still anything sensitive in that log file that we should
remove?  For example:
 - Do the logs reveal how many pages were cached already on the
requestor's site (e.g. as repeat accesses)?  Note that log files are
sorted before being published.
 - Are there other concerns about making these sanitized log files
publicly available?

Are the decisions to remove parts from the logs reasonable?  In particular:
 - Do we have to take out all requests with 404 status codes?  Some of
these requests for non-existing URLs contain typos which may not be safe
to make public.  Should we instead put in some placeholder for the URL
part and keep the 404 lines to know how many 404's we have per day?
 - Is there any good reason to keep the portion of a URL after a "?"?
 - Is it possible to leave some part of Referers in the logs that helps
us figure out where our traffic originates and what search terms people
use to find us?
 - Can we resolve client IP addresses to country codes and include those
in the logs instead of our 0.0.0.0/0.0.0.1 code for HTTP/HTTPS?  How
would we handle countries with only a few users per day, e.g., should
there be a threshold below which we consider requests to come from "a
country with less than XY users?"

Thanks,
Karsten

[tor-dev] Sanitizing and publishing our web server logs

Karsten Loesing