[tor-dev] Sanitizing and publishing our web server logs

Sebastian Hahn hahn.seb at web.de
Fri Sep 2 13:06:57 UTC 2011


On Sep 2, 2011, at 2:46 PM, Karsten Loesing wrote:

> Hi Andrew,
> 
> On 9/2/11 2:18 AM, Andrew Lewman wrote:
>> On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
>>> we have been discussing sanitizing and publishing our web server logs
>>> for quite a while now.  The idea is to remove all potentially sensitive
>>> parts from the logs, publish them in monthly tarballs on the metrics
>>> website, and analyze them for top visited pages, top downloaded
>>> packages, etc.  See the tickets #1641 and #2489 for details.
>> 
>> My concern is that we have the data at all.  We shouldn't have any
>> sensitive information logged on the webservers. Therefore sanitizing the
>> logs should not be necessary.
> 
> My concern is that we remove details from the logs and learn in a few
> months that we wanted to analyze them.  I'd like to sanitize the
> existing logs first, make them available for people to analyze, and only
> change the Apache configuration once we're really sure we found the
> level of detail that we want.  There's no rush in changing the Apache
> configuration now, right?

So, if we decide in a few months that we need more detail, we can
change the logging then. Sure, we won't have history, but that just
means that the graphs we make start in 2012 instead of 2007.

> Finally, we'll have to find a way to encode the country code in the logs
> and still keep Apache's Combined Log Format.  And do we still care about
> the HTTP vs. HTTPS bit?  Because if we use the IP column for the country
> code, we'll have to encode the HTTP/HTTPS thing somewhere else.

IP addresses have plenty of bits for a country code and http/https
encoding, we could for example use the first bytes for country code.

> So, it should be possible to implement GeoIP lookups in the future.  I'd
> like to consider that a separate task from sanitizing the existing web
> logs, though.

It's separate, but without the on-the-fly geoip lookups we won't have
any, because the sanitizing process doesn't get them magically.

All the best
Sebastian


More information about the tor-dev mailing list