[tor-bugs] #1641 [Metrics]: Make website logs available in the Metrics Portal

Tor Bug Tracker & Wiki torproject-admin at torproject.org
Wed Feb 2 10:52:19 UTC 2011


#1641: Make website logs available in the Metrics Portal
---------------------+------------------------------------------------------
 Reporter:  karsten  |       Owner:  karsten 
     Type:  task     |      Status:  assigned
 Priority:  minor    |   Milestone:          
Component:  Metrics  |     Version:          
 Keywords:           |      Points:          
   Parent:           |  
---------------------+------------------------------------------------------
Changes (by karsten):

  * status:  new => assigned
  * owner:  Karsten => karsten


Comment:

 I looked at a web log sample from January 30 from one of our currently
 three www servers.  Here's a sample line:

 {{{
 0.0.0.1 - - [30/Jan/2011:00:00:00 +0000] "GET /projects/projects.html.en
 HTTP/1.1" 200 3029 "https://www.torproject.org/docs/bridges.html.en" "-"
 }}}

 The format is Apache's Combined Log Format with the following exceptions:

  - The client IP address is replaced with either 0.0.0.0 for HTTP requests
 or 0.0.0.1 for HTTPS requests.
  - The request time is set to 00:00:00 +0000.
  - The user-agent string is set to "-".

 However, I found CONNECT request and other non-GET requests in the logs
 which are potentially sensitive.  Also, the referer string may be
 sensitive, especially if it's a non-Tor URL.  We should remove all log
 lines except GET requests and set the referer string to "-".

 An even better approach is to define the information we want to keep:

  - We publish only GET requests with the following data fields:
  - 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests,
  - the request date,
  - the requested URL,
  - the HTTP version,
  - the server's HTTP status code, and
  - the size of the returned object.

 We retain Apache's Combined Log Format for the sanitized logs, so that we
 can use standard web log analysis tools.

 Runa has web server log analysis on her TODO list.  I explained this
 approach to her yesterday.  She agreed with settling a format like the one
 above and said that she'll find a way to work with it.

 How do we proceed?  Andrew says the sanitizing process cannot take place
 on the web servers, because they are quite busy already.  Can we set up
 copying our web server logs to yatei to do the sanitizing there?  I can
 write a parser as part of metrics-db and make daily updated sanitized web
 logs available in the metrics portal.  I also want to make a graph on
 downloaded packages per day available on the metrics website.  Once Runa
 starts her web server log analysis, we can extend this setup to copy the
 web server logs, either from the web servers or from yatei, to wherever
 she does the analysis.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/1641#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list