[tor-commits] [metrics-web/master] Explain sorting more prominently.

karsten at torproject.org karsten at torproject.org
Tue Nov 14 13:14:32 UTC 2017


commit 76688eec38bc02a2813ae9bf9a72f1f1c2c239c3
Author: iwakeh <iwakeh at torproject.org>
Date:   Tue Nov 14 08:39:29 2017 +0000

    Explain sorting more prominently.
    
    Also make the point that normal web log analyzers can operate on sanitized logs.
    Improvements were suggested by Sebastian, cf. ticket-23243.
---
 .../src/main/resources/spec/web-server-logs.xml    | 17 +++++++------
 .../main/resources/web/WEB-INF/web-server-logs.jsp | 29 +++++++++++++---------
 2 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/website/src/main/resources/spec/web-server-logs.xml b/website/src/main/resources/spec/web-server-logs.xml
index d8efe53..13cfad7 100644
--- a/website/src/main/resources/spec/web-server-logs.xml
+++ b/website/src/main/resources/spec/web-server-logs.xml
@@ -20,7 +20,7 @@
   </front>
   <middle>
     <section title="Purpose of this document">
-      <t>BETA: As of November 8, 2017, this document is still under
+      <t>BETA: As of November 14, 2017, this document is still under
       discussion and subject to change without prior notice. Feel free
       to <eref target="/about.html#contact">contact us</eref> for questions or
       concerns regarding this document.</t>
@@ -174,6 +174,12 @@ mod_log_config module</eref>.</t>
       <section title="Re-assembling log files">
         <t>Rewritten log lines are re-assembled into sanitized log files based
         on physical host, virtual host, and request start date.</t>
+        <t>All rewritten log lines are sorted alphabetically, so that request
+        order cannot be inferred from sanitized log files.</t>
+        <t>Many of the sanitized log lines will now be identical.
+        But in order to not remove too much useful information we keep the
+        identical log lines and thus enable typical web log analyzers to
+        operate on the sanitized log files. </t>
         <t>The naming convention for sanitized log files is:
         <list>
           <t><virtual-host>_<physical-host>_access.log_YYYYMMDD[.xz]</t>
@@ -190,12 +196,9 @@ mod_log_config module</eref>.</t>
         'dist.torproject.org', are more familiar to the public and were therefore
         chosen to be the first naming component.
         </t>
-        <t>As last and certainly not least important sanitizing step, all
-        rewritten log lines are sorted alphabetically, so that request order
-        cannot be inferred from sanitized log files.</t>
-        <t>Sanitized log files are typically compressed before publication. In
-        particular the sorting step allows for highly efficient compression
-        rates. We typically use XZ for compression, which is indicated by
+        <t>Sanitized log files are typically compressed before publication.
+        The sorting step also allows for highly efficient compression rates.
+        We typically use XZ for compression, which is indicated by
         appending ".xz" to log file names, but this is subject to change.</t>
       </section>
     </section>
diff --git a/website/src/main/resources/web/WEB-INF/web-server-logs.jsp b/website/src/main/resources/web/WEB-INF/web-server-logs.jsp
index b1505df..5e9cc79 100644
--- a/website/src/main/resources/web/WEB-INF/web-server-logs.jsp
+++ b/website/src/main/resources/web/WEB-INF/web-server-logs.jsp
@@ -22,7 +22,7 @@
 "#rfc.section.1">1.</a> <a href=
 "#n-purpose-of-this-document">Purpose of this document</a></h2>
 <div id="rfc.section.1.p.1">
-<p>BETA: As of November 8, 2017, this document is still under
+<p>BETA: As of November 14, 2017, this document is still under
 discussion and subject to change without prior notice. Feel free to
 <a href="/about.html#contact">contact us</a> for questions or
 concerns regarding this document.</p>
@@ -254,6 +254,16 @@ of processing that format.</p>
 based on physical host, virtual host, and request start date.</p>
 </div>
 <div id="rfc.section.4.3.p.2">
+<p>All rewritten log lines are sorted alphabetically, so that
+request order cannot be inferred from sanitized log files.</p>
+</div>
+<div id="rfc.section.4.3.p.3">
+<p>Many of the sanitized log lines will now be identical. But in
+order to not remove too much useful information we keep the
+identical log lines and thus enable typical web log analyzers to
+operate on the sanitized log files.</p>
+</div>
+<div id="rfc.section.4.3.p.4">
 <p>The naming convention for sanitized log files is:</p>
 <ul class="empty">
 <li>
@@ -262,7 +272,7 @@ based on physical host, virtual host, and request start date.</p>
 <p>The underscore is a separator symbol between the various parts
 of the filename.</p>
 </div>
-<div id="rfc.section.4.3.p.3">
+<div id="rfc.section.4.3.p.5">
 <p>Sanitized log files may additionally be sorted into directories
 by virtual host and date as in:</p>
 <ul class="empty">
@@ -273,17 +283,12 @@ by virtual host and date as in:</p>
 'dist.torproject.org', are more familiar to the public and were
 therefore chosen to be the first naming component.</p>
 </div>
-<div id="rfc.section.4.3.p.4">
-<p>As last and certainly not least important sanitizing step, all
-rewritten log lines are sorted alphabetically, so that request
-order cannot be inferred from sanitized log files.</p>
-</div>
-<div id="rfc.section.4.3.p.5">
+<div id="rfc.section.4.3.p.6">
 <p>Sanitized log files are typically compressed before publication.
-In particular the sorting step allows for highly efficient
-compression rates. We typically use XZ for compression, which is
-indicated by appending ".xz" to log file names, but this is subject
-to change.</p>
+The sorting step also allows for highly efficient compression
+rates. We typically use XZ for compression, which is indicated by
+appending ".xz" to log file names, but this is subject to
+change.</p>
 </div>
 </section>
 </div> <!-- container -->



More information about the tor-commits mailing list