[tor-dev] Sanitizing and publishing our web server logs

Karsten Loesing karsten.loesing at gmx.net
Fri Sep 2 12:46:36 UTC 2011

Hi Andrew,

On 9/2/11 2:18 AM, Andrew Lewman wrote:
> On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
>> we have been discussing sanitizing and publishing our web server logs
>> for quite a while now.  The idea is to remove all potentially sensitive
>> parts from the logs, publish them in monthly tarballs on the metrics
>> website, and analyze them for top visited pages, top downloaded
>> packages, etc.  See the tickets #1641 and #2489 for details.
> My concern is that we have the data at all.  We shouldn't have any
> sensitive information logged on the webservers. Therefore sanitizing the
> logs should not be necessary.

My concern is that we remove details from the logs and learn in a few
months that we wanted to analyze them.  I'd like to sanitize the
existing logs first, make them available for people to analyze, and only
change the Apache configuration once we're really sure we found the
level of detail that we want.  There's no rush in changing the Apache
configuration now, right?

> I would like to replace the current
> scheme with a geoip lookup and just log the country code
> in place of the IP address. Apache can do this on the fly between
> request and the log entry.

Runa and I discussed one major drawback of this approach: even though
there are no timestamps in the logs, the order of requests can reveal a
lot about user sessions.  Now, if we put in country codes, it's quite
easy to track single user sessions.  Even sorting logs before publishing
them may not help, because there may only be a handful users from a
given country.

If we want country codes in the logs, we'll have to define a threshold
and change all requests from countries with fewer requests to some "less
than XY users" country code.  Also, we'll absolutely have to
reorder/sort requests per day.

Finally, we'll have to find a way to encode the country code in the logs
and still keep Apache's Combined Log Format.  And do we still care about
the HTTP vs. HTTPS bit?  Because if we use the IP column for the country
code, we'll have to encode the HTTP/HTTPS thing somewhere else.

So, it should be possible to implement GeoIP lookups in the future.  I'd
like to consider that a separate task from sanitizing the existing web
logs, though.

>> Is there still anything sensitive in that log file that we should
>> remove?  For example:
> Referrers and requested urls will be a nightmare to clean up. We
> literally get thousands of probes a day per site trying to exploit
> apache (or tomcat, or cgi, or a million other things). If we were the US
> military, we'd claim each probe is a hostile attack and whine about
> millions of attacks on our infrastructure a year. Clearly this is
> cyberwar and we need $3 billion to stop it or retaliate.
> On the other hand, seeing the referrer data has been interesting because
> it tells us where our traffic originates. Our top referrers are google
> and the wikipedia pages about tor in various languages. The search terms
> are also valuable if we want to buy keywords for ads some day. We've had
> two volunteers do this already through google adwords and the results
> are surprising.

I understand that removing referrers and changing URLs (removing them
for 4xx status codes, cutting off GET parameters, etc.) makes the logs
less useful.  Maybe there are ways to keep at least the top referrers in
the sanitized logs.  (I don't think this is something we can leave to
Apache, so we'll have to post-process logs for that.)

But how about we start without GET parameters and referrers and see how
useful those logs are for analysis?  We can still add more detail later on.


More information about the tor-dev mailing list