Hi everyone,
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
Here's a suggested sanitizing procedure for our web logs, which are in Apache's combined log format:
- Ignore everything except GET requests. - Ignore all requests that resulted in a 404 status code. - Rewrite log lines so that they only contain the following fields: - IP address 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests (as logged by our Apache configuration), - the request date (with the time part set to 00:00:00), - the requested URL (cut off at the first encountered "?"), - the HTTP version, - the server's HTTP status code, and - the size of the returned object. - Write all lines from a given virtual host and day to a single output file. - Sort the output file alphanumerically to conceal the original order of requests.
Here's a sample sanitized log file for www.torproject.org from May 1, 2009 (462K):
http://freehaven.net/~karsten/volatile/www.torproject.org-access.log-2009050...
Is there still anything sensitive in that log file that we should remove? For example: - Do the logs reveal how many pages were cached already on the requestor's site (e.g. as repeat accesses)? Note that log files are sorted before being published. - Are there other concerns about making these sanitized log files publicly available?
Are the decisions to remove parts from the logs reasonable? In particular: - Do we have to take out all requests with 404 status codes? Some of these requests for non-existing URLs contain typos which may not be safe to make public. Should we instead put in some placeholder for the URL part and keep the 404 lines to know how many 404's we have per day? - Is there any good reason to keep the portion of a URL after a "?"? - Is it possible to leave some part of Referers in the logs that helps us figure out where our traffic originates and what search terms people use to find us? - Can we resolve client IP addresses to country codes and include those in the logs instead of our 0.0.0.0/0.0.0.1 code for HTTP/HTTPS? How would we handle countries with only a few users per day, e.g., should there be a threshold below which we consider requests to come from "a country with less than XY users?"
Thanks, Karsten