On Sep 2, 2011, at 2:46 PM, Karsten Loesing wrote:
Hi Andrew,
On 9/2/11 2:18 AM, Andrew Lewman wrote:
On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
My concern is that we have the data at all. We shouldn't have any sensitive information logged on the webservers. Therefore sanitizing the logs should not be necessary.
My concern is that we remove details from the logs and learn in a few months that we wanted to analyze them. I'd like to sanitize the existing logs first, make them available for people to analyze, and only change the Apache configuration once we're really sure we found the level of detail that we want. There's no rush in changing the Apache configuration now, right?
So, if we decide in a few months that we need more detail, we can change the logging then. Sure, we won't have history, but that just means that the graphs we make start in 2012 instead of 2007.
Finally, we'll have to find a way to encode the country code in the logs and still keep Apache's Combined Log Format. And do we still care about the HTTP vs. HTTPS bit? Because if we use the IP column for the country code, we'll have to encode the HTTP/HTTPS thing somewhere else.
IP addresses have plenty of bits for a country code and http/https encoding, we could for example use the first bytes for country code.
So, it should be possible to implement GeoIP lookups in the future. I'd like to consider that a separate task from sanitizing the existing web logs, though.
It's separate, but without the on-the-fly geoip lookups we won't have any, because the sanitizing process doesn't get them magically.
All the best Sebastian