[tor-commits] [webstats/master] Document the new sanitizing code, but don't implement it yet.

runa at torproject.org runa at torproject.org
Fri Dec 30 16:26:09 UTC 2011

commit 52731f5544954594a7b6e6805dd1136539e17856
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date:   Fri Dec 30 12:08:58 2011 +0100

    Document the new sanitizing code, but don't implement it yet.
 src/org/torproject/webstats/Main.java |  109 +++++++++++++++++++++++++++++++--
 1 files changed, 104 insertions(+), 5 deletions(-)

diff --git a/src/org/torproject/webstats/Main.java b/src/org/torproject/webstats/Main.java
index bcf5cf1..628e4a3 100644
--- a/src/org/torproject/webstats/Main.java
+++ b/src/org/torproject/webstats/Main.java
@@ -7,11 +7,110 @@ import java.util.regex.*;
 import org.apache.commons.compress.compressors.gzip.*;
-/* Sanitize gz-compressed Apache web logs by removing all potentially
- * sensitive parts and makes sure we never sanitize a web log file
- * twice.  Only consider lines containing HTTP GET requests.  Append
- * sanitized lines to out/YYYY/MM/DD/<filename>access.log.
- * TODO Document how exactly the sanitizing process works. */
+ * Sanitize Apache web logs by removing all potentially sensitive parts.
+ *
+ * TODO Document what exactly is sanitized and how that is done.
+ *
+ * TODO Implement the following description.
+ *
+ * The main operation is to parse Apache web log files from the in/
+ * directory and write sanitized web log files to the out/ directory.
+ * Files in the in/ directory are assumed to never change and will be
+ * deleted after processing by this program.  Files in the out/ directory
+ * are guaranteed to never change and may be deleted by a subsequently
+ * running program.
+ *
+ * This program uses a couple of state files to make sure that files in
+ * in/ are not parsed more than once and that files in out/ do not need to
+ * be changed:
+ * - state/lock prevents concurrent executions of this program.
+ * - state/in-history contains file names of previously read and deleted
+ *   files in the in/ directory.
+ * - state/in-history.new is the file written in the current execution
+ *   that will replace state/in-history during the execution.
+ * - state/execution/ contains new or updated output files parsed in the
+ *   current execution.
+ * - state/out-history contains file names of previously written and
+ *   possibly deleted files in the out/ directory.
+ * - state/out-history.new is the file written in the current execution
+ *   that will replace state/out-history at the end of the execution.
+ * - state/full/ contains complete output files that may or may not be
+ *   newer than files in the out/ directory.
+ * - state/diff/ contains new parts for files in the out/ directory which
+ *   have been deleted.
+ *
+ * The steps taken by this program are as follows:
+ *  1. Check that state/lock does not exists, or exit immediately.  Add a
+ *     new state/lock file.
+ *  2. Read the contents from state/in-history and state/out-history and
+ *     the directory listings of out/, state/diff/, and state/update/ to
+ *     memory.
+ *  3. For each file in in/:
+ *     a. Append the file name to state/in-history.new.
+ *     b. Check that the file name is not contained in state/in-history.
+ *        If it is, print out a warning and skip the file.
+ *     c. Parse the file in chunks of 250,000 lines to reduce writes.
+ *     d. When writing sanitized chunks to output files, for each output
+ *        file, check in the following order if there is already such a
+ *        file in
+ *          i. state/execution/,
+ *         ii. state/full/,
+ *        iii. out/, or
+ *         iv. state/diff/.
+ *        If there's such a file, merge the newly sanitized lines with
+ *        that file and write the sorted result state/execution/.
+ *  4. Rename state/in-history to state/in-history.old and rename
+ *     state/in-history.new to state/in-history.  Delete
+ *     state/in-history.old.
+ *  5. Delete files in in/ that have been parsed in this execution.
+ *  6. For each file in state/execution/:
+ *     a. Check if there's a corresponding line in state/out-history.  If
+ *        so, check whether there is a file in state/full/ or out/.  If
+ *        so, move the file to state/full/.  Otherwise move the file to
+ *        state/diff/, overwriting the file there if one exists.
+ *     b. If a. does not apply and the sanitized log is less than four (4)
+ *        days old, move the file to state/full/.
+ *     c. If b. does not apply, append a line to out-history.new and move
+ *        the file to out/.
+ *  7. Rename state/out-history to state/out-history.old and rename
+ *     state/out-history.new to state/out-history.  Delete
+ *     state/out-history.old.
+ *  8. Delete state/lock and exit.
+ *
+ * If the program is interrupted and leaves a lock file in state/lock, it
+ * requires an operator to fix the state/ directory and make it work
+ * UNLESS YOU'RE CERTAIN WHAT YOU'RE DOING!  The following situations can
+ * happen.  It may make sense to try a solution in a non-productive
+ * setting first:
+ *  A. The file state/in-history.new does not exist and there are no files
+ *     in state/execution/.  The process died before step 3.  Delete
+ *     state/lock and re-run the program.
+ *  B. The file state/in-history.new exists and there are files in
+ *     state/execution/.  The process died during steps 3 or 4.  Delete
+ *     all files in state/execution/.  If state/in-history does not exist,
+ *     but state/in-history.old does exist, rename the latter to the
+ *     former.  Delete state/lock and re-run the program.
+ *  C. The file state/in-history.new does not exist, but there are files
+ *     in state/execution/.  The process died after step 4.  Run the steps
+ *     5 to 8 manually.  Then re-run the program.
+ *
+ * Whenever logs are parsed that are 4 days old or older, there may
+ * already be output files in out/ that cannot be modified anymore.  The
+ * operator may decide to manually overwrite files in out/ with the files
+ * in state/full/ or state/diff/.  IMPORTANT: ONLY OVERWRITE FILES IN out/
+ * are two possible situations:
+ *  A. There is a file in state/full/.  This file is newer than the file
+ *     with the same name in out/ and contains everything from that file,
+ *     too.  It's okay to overwrite the file in out/ with the file in
+ *     state/full/ and delete the file in state/full/.
+ *  B. There is a file in state/diff/.  The file in out/ didn't exist
+ *     anymore when parsing more log lines for it.  The file that was in
+ *     out/ should be located and merged with the file in state/diff/.
+ *     Afterwards, the file in state/diff/ should be deleted.
+ */
 public class Main {
   private static File historyFile = new File("hist");
   private static File inputDirectory = new File("in");

More information about the tor-commits mailing list