Hi Andrew,
On 9/2/11 2:18 AM, Andrew Lewman wrote:
On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
My concern is that we have the data at all. We shouldn't have any sensitive information logged on the webservers. Therefore sanitizing the logs should not be necessary.
My concern is that we remove details from the logs and learn in a few months that we wanted to analyze them. I'd like to sanitize the existing logs first, make them available for people to analyze, and only change the Apache configuration once we're really sure we found the level of detail that we want. There's no rush in changing the Apache configuration now, right?
I would like to replace the current 0.0.0.0/0.0.0.1 scheme with a geoip lookup and just log the country code in place of the IP address. Apache can do this on the fly between request and the log entry.
Runa and I discussed one major drawback of this approach: even though there are no timestamps in the logs, the order of requests can reveal a lot about user sessions. Now, if we put in country codes, it's quite easy to track single user sessions. Even sorting logs before publishing them may not help, because there may only be a handful users from a given country.
If we want country codes in the logs, we'll have to define a threshold and change all requests from countries with fewer requests to some "less than XY users" country code. Also, we'll absolutely have to reorder/sort requests per day.
Finally, we'll have to find a way to encode the country code in the logs and still keep Apache's Combined Log Format. And do we still care about the HTTP vs. HTTPS bit? Because if we use the IP column for the country code, we'll have to encode the HTTP/HTTPS thing somewhere else.
So, it should be possible to implement GeoIP lookups in the future. I'd like to consider that a separate task from sanitizing the existing web logs, though.
Is there still anything sensitive in that log file that we should remove? For example:
Referrers and requested urls will be a nightmare to clean up. We literally get thousands of probes a day per site trying to exploit apache (or tomcat, or cgi, or a million other things). If we were the US military, we'd claim each probe is a hostile attack and whine about millions of attacks on our infrastructure a year. Clearly this is cyberwar and we need $3 billion to stop it or retaliate.
On the other hand, seeing the referrer data has been interesting because it tells us where our traffic originates. Our top referrers are google and the wikipedia pages about tor in various languages. The search terms are also valuable if we want to buy keywords for ads some day. We've had two volunteers do this already through google adwords and the results are surprising.
I understand that removing referrers and changing URLs (removing them for 4xx status codes, cutting off GET parameters, etc.) makes the logs less useful. Maybe there are ways to keep at least the top referrers in the sanitized logs. (I don't think this is something we can leave to Apache, so we'll have to post-process logs for that.)
But how about we start without GET parameters and referrers and see how useful those logs are for analysis? We can still add more detail later on.
Thanks, Karsten