Hi everyone,
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
Here's a suggested sanitizing procedure for our web logs, which are in Apache's combined log format:
- Ignore everything except GET requests. - Ignore all requests that resulted in a 404 status code. - Rewrite log lines so that they only contain the following fields: - IP address 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests (as logged by our Apache configuration), - the request date (with the time part set to 00:00:00), - the requested URL (cut off at the first encountered "?"), - the HTTP version, - the server's HTTP status code, and - the size of the returned object. - Write all lines from a given virtual host and day to a single output file. - Sort the output file alphanumerically to conceal the original order of requests.
Here's a sample sanitized log file for www.torproject.org from May 1, 2009 (462K):
http://freehaven.net/~karsten/volatile/www.torproject.org-access.log-2009050...
Is there still anything sensitive in that log file that we should remove? For example: - Do the logs reveal how many pages were cached already on the requestor's site (e.g. as repeat accesses)? Note that log files are sorted before being published. - Are there other concerns about making these sanitized log files publicly available?
Are the decisions to remove parts from the logs reasonable? In particular: - Do we have to take out all requests with 404 status codes? Some of these requests for non-existing URLs contain typos which may not be safe to make public. Should we instead put in some placeholder for the URL part and keep the 404 lines to know how many 404's we have per day? - Is there any good reason to keep the portion of a URL after a "?"? - Is it possible to leave some part of Referers in the logs that helps us figure out where our traffic originates and what search terms people use to find us? - Can we resolve client IP addresses to country codes and include those in the logs instead of our 0.0.0.0/0.0.0.1 code for HTTP/HTTPS? How would we handle countries with only a few users per day, e.g., should there be a threshold below which we consider requests to come from "a country with less than XY users?"
Thanks, Karsten
On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
My concern is that we have the data at all. We shouldn't have any sensitive information logged on the webservers. Therefore sanitizing the logs should not be necessary. I would like to replace the current 0.0.0.0/0.0.0.1 scheme with a geoip lookup and just log the country code in place of the IP address. Apache can do this on the fly between request and the log entry.
Is there still anything sensitive in that log file that we should remove? For example:
Referrers and requested urls will be a nightmare to clean up. We literally get thousands of probes a day per site trying to exploit apache (or tomcat, or cgi, or a million other things). If we were the US military, we'd claim each probe is a hostile attack and whine about millions of attacks on our infrastructure a year. Clearly this is cyberwar and we need $3 billion to stop it or retaliate.
On the other hand, seeing the referrer data has been interesting because it tells us where our traffic originates. Our top referrers are google and the wikipedia pages about tor in various languages. The search terms are also valuable if we want to buy keywords for ads some day. We've had two volunteers do this already through google adwords and the results are surprising.
Hi Andrew,
On 9/2/11 2:18 AM, Andrew Lewman wrote:
On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
My concern is that we have the data at all. We shouldn't have any sensitive information logged on the webservers. Therefore sanitizing the logs should not be necessary.
My concern is that we remove details from the logs and learn in a few months that we wanted to analyze them. I'd like to sanitize the existing logs first, make them available for people to analyze, and only change the Apache configuration once we're really sure we found the level of detail that we want. There's no rush in changing the Apache configuration now, right?
I would like to replace the current 0.0.0.0/0.0.0.1 scheme with a geoip lookup and just log the country code in place of the IP address. Apache can do this on the fly between request and the log entry.
Runa and I discussed one major drawback of this approach: even though there are no timestamps in the logs, the order of requests can reveal a lot about user sessions. Now, if we put in country codes, it's quite easy to track single user sessions. Even sorting logs before publishing them may not help, because there may only be a handful users from a given country.
If we want country codes in the logs, we'll have to define a threshold and change all requests from countries with fewer requests to some "less than XY users" country code. Also, we'll absolutely have to reorder/sort requests per day.
Finally, we'll have to find a way to encode the country code in the logs and still keep Apache's Combined Log Format. And do we still care about the HTTP vs. HTTPS bit? Because if we use the IP column for the country code, we'll have to encode the HTTP/HTTPS thing somewhere else.
So, it should be possible to implement GeoIP lookups in the future. I'd like to consider that a separate task from sanitizing the existing web logs, though.
Is there still anything sensitive in that log file that we should remove? For example:
Referrers and requested urls will be a nightmare to clean up. We literally get thousands of probes a day per site trying to exploit apache (or tomcat, or cgi, or a million other things). If we were the US military, we'd claim each probe is a hostile attack and whine about millions of attacks on our infrastructure a year. Clearly this is cyberwar and we need $3 billion to stop it or retaliate.
On the other hand, seeing the referrer data has been interesting because it tells us where our traffic originates. Our top referrers are google and the wikipedia pages about tor in various languages. The search terms are also valuable if we want to buy keywords for ads some day. We've had two volunteers do this already through google adwords and the results are surprising.
I understand that removing referrers and changing URLs (removing them for 4xx status codes, cutting off GET parameters, etc.) makes the logs less useful. Maybe there are ways to keep at least the top referrers in the sanitized logs. (I don't think this is something we can leave to Apache, so we'll have to post-process logs for that.)
But how about we start without GET parameters and referrers and see how useful those logs are for analysis? We can still add more detail later on.
Thanks, Karsten
On Sep 2, 2011, at 2:46 PM, Karsten Loesing wrote:
Hi Andrew,
On 9/2/11 2:18 AM, Andrew Lewman wrote:
On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
My concern is that we have the data at all. We shouldn't have any sensitive information logged on the webservers. Therefore sanitizing the logs should not be necessary.
My concern is that we remove details from the logs and learn in a few months that we wanted to analyze them. I'd like to sanitize the existing logs first, make them available for people to analyze, and only change the Apache configuration once we're really sure we found the level of detail that we want. There's no rush in changing the Apache configuration now, right?
So, if we decide in a few months that we need more detail, we can change the logging then. Sure, we won't have history, but that just means that the graphs we make start in 2012 instead of 2007.
Finally, we'll have to find a way to encode the country code in the logs and still keep Apache's Combined Log Format. And do we still care about the HTTP vs. HTTPS bit? Because if we use the IP column for the country code, we'll have to encode the HTTP/HTTPS thing somewhere else.
IP addresses have plenty of bits for a country code and http/https encoding, we could for example use the first bytes for country code.
So, it should be possible to implement GeoIP lookups in the future. I'd like to consider that a separate task from sanitizing the existing web logs, though.
It's separate, but without the on-the-fly geoip lookups we won't have any, because the sanitizing process doesn't get them magically.
All the best Sebastian
What exactly are we hoping to gain from the analysis of the (hopefully correctly) stripped logs?
On 09/02/2011 09:06 AM, Sebastian Hahn wrote:
On Sep 2, 2011, at 2:46 PM, Karsten Loesing wrote:
Hi Andrew,
On 9/2/11 2:18 AM, Andrew Lewman wrote:
On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
My concern is that we have the data at all. We shouldn't have any sensitive information logged on the webservers. Therefore sanitizing the logs should not be necessary.
My concern is that we remove details from the logs and learn in a few months that we wanted to analyze them. I'd like to sanitize the existing logs first, make them available for people to analyze, and only change the Apache configuration once we're really sure we found the level of detail that we want. There's no rush in changing the Apache configuration now, right?
So, if we decide in a few months that we need more detail, we can change the logging then. Sure, we won't have history, but that just means that the graphs we make start in 2012 instead of 2007.
Finally, we'll have to find a way to encode the country code in the logs and still keep Apache's Combined Log Format. And do we still care about the HTTP vs. HTTPS bit? Because if we use the IP column for the country code, we'll have to encode the HTTP/HTTPS thing somewhere else.
IP addresses have plenty of bits for a country code and http/https encoding, we could for example use the first bytes for country code.
So, it should be possible to implement GeoIP lookups in the future. I'd like to consider that a separate task from sanitizing the existing web logs, though.
It's separate, but without the on-the-fly geoip lookups we won't have any, because the sanitizing process doesn't get them magically.
All the best Sebastian _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On Friday, September 02, 2011 10:08:37 Brian Szymanski wrote:
What exactly are we hoping to gain from the analysis of the (hopefully correctly) stripped logs?
Overall, all of our data collected can be analyzed to see if any of it can be used to discover users, sets of users, or other personally identifying info.
It will also be helpful to know more about our websites, usage, referrers, etc. If Tor is going to be transparent in its data collection practices, we should be able to publish our web server logs without issue.
On 9/2/11 3:06 PM, Sebastian Hahn wrote:
On Sep 2, 2011, at 2:46 PM, Karsten Loesing wrote:
Hi Andrew,
On 9/2/11 2:18 AM, Andrew Lewman wrote:
On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
My concern is that we have the data at all. We shouldn't have any sensitive information logged on the webservers. Therefore sanitizing the logs should not be necessary.
My concern is that we remove details from the logs and learn in a few months that we wanted to analyze them. I'd like to sanitize the existing logs first, make them available for people to analyze, and only change the Apache configuration once we're really sure we found the level of detail that we want. There's no rush in changing the Apache configuration now, right?
So, if we decide in a few months that we need more detail, we can change the logging then. Sure, we won't have history, but that just means that the graphs we make start in 2012 instead of 2007.
You're right. Once we change the logging we'll only have graphs from then on. But there's no immediate need to change the logging now. We can still do that in a few months from now when we have more experience with the sanitizing process (which we need anyway, if only for reordering requests) and subsequent analysis.
Finally, we'll have to find a way to encode the country code in the logs and still keep Apache's Combined Log Format. And do we still care about the HTTP vs. HTTPS bit? Because if we use the IP column for the country code, we'll have to encode the HTTP/HTTPS thing somewhere else.
IP addresses have plenty of bits for a country code and http/https encoding, we could for example use the first bytes for country code.
Sounds like a hack to me (not that I'm too opposed to it). How do other people encode country codes in Apache logs?
So, it should be possible to implement GeoIP lookups in the future. I'd like to consider that a separate task from sanitizing the existing web logs, though.
It's separate, but without the on-the-fly geoip lookups we won't have any, because the sanitizing process doesn't get them magically.
Right. I'm just trying to keep the scope of this first discussion round small to speed things up. This is something to revisit in a few months from now.
Thanks for your comments!
Best, Karsten
On 08/25/2011 03:08 AM, Karsten Loesing wrote:
Hi everyone,
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
Why?
I.e., what are the great benefits hoped to arise from such publication to outweigh the considerable risks?
- Marsh
On 9/2/11 7:32 PM, Marsh Ray wrote:
On 08/25/2011 03:08 AM, Karsten Loesing wrote:
Hi everyone,
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
Why?
I.e., what are the great benefits hoped to arise from such publication to outweigh the considerable risks?
The benefits are, e.g., that we learn more about our website visitors and can make our websites more useful for them. And we can learn which packages users download, including their platforms, languages, etc. which may help us concentrate our efforts better. These are just two examples, but I think we agree that analyzing web logs does provide a benefit.
Our general approach with analyzing potentially sensitive data is to openly discuss the algorithm to remove any sensitive parts, make the resulting data publicly available, and only analyze those. Ideally, we don't want to collect the sensitive parts at all, but sometimes that's not feasible (IP addresses in bridge descriptors, request order in web server logs), so we need to post-process the data before publication.
I think the overall risk of our approach is considerably lower than trying to keep the data you're planning to analyze private, because there's always the risk of losing data.
See this paper and website for a better answer:
https://metrics.torproject.org/papers/wecsr10.pdf
https://metrics.torproject.org/formats.html
What are the considerable risks you're referring to?
Best, Karsten
On 8/25/11 10:08 AM, Karsten Loesing wrote:
we have been discussing sanitizing and publishing our web server logs for quite a while now. The idea is to remove all potentially sensitive parts from the logs, publish them in monthly tarballs on the metrics website, and analyze them for top visited pages, top downloaded packages, etc. See the tickets #1641 and #2489 for details.
Here's a suggested sanitizing procedure for our web logs, which are in Apache's combined log format:
- Ignore everything except GET requests.
- Ignore all requests that resulted in a 404 status code.
- Rewrite log lines so that they only contain the following fields:
- IP address 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests
(as logged by our Apache configuration),
- the request date (with the time part set to 00:00:00),
- the requested URL (cut off at the first encountered "?"),
- the HTTP version,
- the server's HTTP status code, and
- the size of the returned object.
- Write all lines from a given virtual host and day to a single output
file.
- Sort the output file alphanumerically to conceal the original order
of requests.
Pushing this forward. Here are the sanitized web logs that we'd like to publish on a daily basis for all our web servers and virtual domains for all of 2010 (155M):
http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-01.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-02.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-03.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-04.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-05.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-06.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-07.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-08.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-09.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-10.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-11.tar http://freehaven.net/~karsten/volatile/torproject-weblogs-2010-12.tar
The webalizer output for www.torproject.org can be viewed here:
http://freehaven.net/~karsten/volatile/www.torproject.org-webalizer/
So. Is it safe to publish these logs on a daily basis? The same questions from my original mail apply here:
Is there still anything sensitive in that log file that we should remove? For example:
- Do the logs reveal how many pages were cached already on the
requestor's site (e.g. as repeat accesses)? Note that log files are sorted before being published.
- Are there other concerns about making these sanitized log files
publicly available?
Are the decisions to remove parts from the logs reasonable? In particular:
- Do we have to take out all requests with 404 status codes? Some of
these requests for non-existing URLs contain typos which may not be safe to make public. Should we instead put in some placeholder for the URL part and keep the 404 lines to know how many 404's we have per day?
- Is there any good reason to keep the portion of a URL after a "?"?
- Is it possible to leave some part of Referers in the logs that helps
us figure out where our traffic originates and what search terms people use to find us?
- Can we resolve client IP addresses to country codes and include those
in the logs instead of our 0.0.0.0/0.0.0.1 code for HTTP/HTTPS? How would we handle countries with only a few users per day, e.g., should there be a threshold below which we consider requests to come from "a country with less than XY users?"
The next steps will be to make these sanitized logs available on a daily basis and to publish the sanitized archives from 2008, 2009, and 2011.
I'm going to wait another week (probably longer) for feedback before taking these next steps.
Best, Karsten
On Tue, Oct 18, 2011 at 8:27 AM, Karsten Loesing karsten.loesing@gmx.net wrote:
The webalizer output for www.torproject.org can be viewed here:
http://freehaven.net/~karsten/volatile/www.torproject.org-webalizer/
I have looked into four different web log analysis tools, see https://trac.torproject.org/projects/tor/ticket/4463#comment:4 for details.