[tor-dev] Sanitizing and publishing our web server logs
karsten.loesing at gmx.net
Tue Oct 18 07:27:28 UTC 2011
On 8/25/11 10:08 AM, Karsten Loesing wrote:
> we have been discussing sanitizing and publishing our web server logs
> for quite a while now. The idea is to remove all potentially sensitive
> parts from the logs, publish them in monthly tarballs on the metrics
> website, and analyze them for top visited pages, top downloaded
> packages, etc. See the tickets #1641 and #2489 for details.
> Here's a suggested sanitizing procedure for our web logs, which are in
> Apache's combined log format:
> - Ignore everything except GET requests.
> - Ignore all requests that resulted in a 404 status code.
> - Rewrite log lines so that they only contain the following fields:
> - IP address 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests
> (as logged by our Apache configuration),
> - the request date (with the time part set to 00:00:00),
> - the requested URL (cut off at the first encountered "?"),
> - the HTTP version,
> - the server's HTTP status code, and
> - the size of the returned object.
> - Write all lines from a given virtual host and day to a single output
> - Sort the output file alphanumerically to conceal the original order
> of requests.
Pushing this forward. Here are the sanitized web logs that we'd like to
publish on a daily basis for all our web servers and virtual domains for
all of 2010 (155M):
The webalizer output for www.torproject.org can be viewed here:
So. Is it safe to publish these logs on a daily basis? The same
questions from my original mail apply here:
> Is there still anything sensitive in that log file that we should
> remove? For example:
> - Do the logs reveal how many pages were cached already on the
> requestor's site (e.g. as repeat accesses)? Note that log files are
> sorted before being published.
> - Are there other concerns about making these sanitized log files
> publicly available?
> Are the decisions to remove parts from the logs reasonable? In particular:
> - Do we have to take out all requests with 404 status codes? Some of
> these requests for non-existing URLs contain typos which may not be safe
> to make public. Should we instead put in some placeholder for the URL
> part and keep the 404 lines to know how many 404's we have per day?
> - Is there any good reason to keep the portion of a URL after a "?"?
> - Is it possible to leave some part of Referers in the logs that helps
> us figure out where our traffic originates and what search terms people
> use to find us?
> - Can we resolve client IP addresses to country codes and include those
> in the logs instead of our 0.0.0.0/0.0.0.1 code for HTTP/HTTPS? How
> would we handle countries with only a few users per day, e.g., should
> there be a threshold below which we consider requests to come from "a
> country with less than XY users?"
The next steps will be to make these sanitized logs available on a daily
basis and to publish the sanitized archives from 2008, 2009, and 2011.
I'm going to wait another week (probably longer) for feedback before
taking these next steps.
More information about the tor-dev