[tor-dev] Sanitizing and publishing our web server logs

Andrew Lewman andrew at torproject.org
Fri Sep 2 00:18:40 UTC 2011


On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
> we have been discussing sanitizing and publishing our web server logs
> for quite a while now.  The idea is to remove all potentially sensitive
> parts from the logs, publish them in monthly tarballs on the metrics
> website, and analyze them for top visited pages, top downloaded
> packages, etc.  See the tickets #1641 and #2489 for details.

My concern is that we have the data at all.  We shouldn't have any
sensitive information logged on the webservers. Therefore sanitizing the
logs should not be necessary.  I would like to replace the current
0.0.0.0/0.0.0.1 scheme with a geoip lookup and just log the country code
in place of the IP address. Apache can do this on the fly between
request and the log entry.

> Is there still anything sensitive in that log file that we should
> remove?  For example:

Referrers and requested urls will be a nightmare to clean up. We
literally get thousands of probes a day per site trying to exploit
apache (or tomcat, or cgi, or a million other things). If we were the US
military, we'd claim each probe is a hostile attack and whine about
millions of attacks on our infrastructure a year. Clearly this is
cyberwar and we need $3 billion to stop it or retaliate.

On the other hand, seeing the referrer data has been interesting because
it tells us where our traffic originates. Our top referrers are google
and the wikipedia pages about tor in various languages. The search terms
are also valuable if we want to buy keywords for ads some day. We've had
two volunteers do this already through google adwords and the results
are surprising.

-- 
Andrew
pgp 0x74ED336B


More information about the tor-dev mailing list