commit 52731f5544954594a7b6e6805dd1136539e17856 Author: Karsten Loesing karsten.loesing@gmx.net Date: Fri Dec 30 12:08:58 2011 +0100
Document the new sanitizing code, but don't implement it yet. --- src/org/torproject/webstats/Main.java | 109 +++++++++++++++++++++++++++++++-- 1 files changed, 104 insertions(+), 5 deletions(-)
diff --git a/src/org/torproject/webstats/Main.java b/src/org/torproject/webstats/Main.java index bcf5cf1..628e4a3 100644 --- a/src/org/torproject/webstats/Main.java +++ b/src/org/torproject/webstats/Main.java @@ -7,11 +7,110 @@ import java.util.regex.*;
import org.apache.commons.compress.compressors.gzip.*;
-/* Sanitize gz-compressed Apache web logs by removing all potentially - * sensitive parts and makes sure we never sanitize a web log file - * twice. Only consider lines containing HTTP GET requests. Append - * sanitized lines to out/YYYY/MM/DD/<filename>access.log. - * TODO Document how exactly the sanitizing process works. */ +/* + * Sanitize Apache web logs by removing all potentially sensitive parts. + * + * TODO Document what exactly is sanitized and how that is done. + * + * TODO Implement the following description. + * + * The main operation is to parse Apache web log files from the in/ + * directory and write sanitized web log files to the out/ directory. + * Files in the in/ directory are assumed to never change and will be + * deleted after processing by this program. Files in the out/ directory + * are guaranteed to never change and may be deleted by a subsequently + * running program. + * + * This program uses a couple of state files to make sure that files in + * in/ are not parsed more than once and that files in out/ do not need to + * be changed: + * - state/lock prevents concurrent executions of this program. + * - state/in-history contains file names of previously read and deleted + * files in the in/ directory. + * - state/in-history.new is the file written in the current execution + * that will replace state/in-history during the execution. + * - state/execution/ contains new or updated output files parsed in the + * current execution. + * - state/out-history contains file names of previously written and + * possibly deleted files in the out/ directory. + * - state/out-history.new is the file written in the current execution + * that will replace state/out-history at the end of the execution. + * - state/full/ contains complete output files that may or may not be + * newer than files in the out/ directory. + * - state/diff/ contains new parts for files in the out/ directory which + * have been deleted. + * + * The steps taken by this program are as follows: + * 1. Check that state/lock does not exists, or exit immediately. Add a + * new state/lock file. + * 2. Read the contents from state/in-history and state/out-history and + * the directory listings of out/, state/diff/, and state/update/ to + * memory. + * 3. For each file in in/: + * a. Append the file name to state/in-history.new. + * b. Check that the file name is not contained in state/in-history. + * If it is, print out a warning and skip the file. + * c. Parse the file in chunks of 250,000 lines to reduce writes. + * d. When writing sanitized chunks to output files, for each output + * file, check in the following order if there is already such a + * file in + * i. state/execution/, + * ii. state/full/, + * iii. out/, or + * iv. state/diff/. + * If there's such a file, merge the newly sanitized lines with + * that file and write the sorted result state/execution/. + * 4. Rename state/in-history to state/in-history.old and rename + * state/in-history.new to state/in-history. Delete + * state/in-history.old. + * 5. Delete files in in/ that have been parsed in this execution. + * 6. For each file in state/execution/: + * a. Check if there's a corresponding line in state/out-history. If + * so, check whether there is a file in state/full/ or out/. If + * so, move the file to state/full/. Otherwise move the file to + * state/diff/, overwriting the file there if one exists. + * b. If a. does not apply and the sanitized log is less than four (4) + * days old, move the file to state/full/. + * c. If b. does not apply, append a line to out-history.new and move + * the file to out/. + * 7. Rename state/out-history to state/out-history.old and rename + * state/out-history.new to state/out-history. Delete + * state/out-history.old. + * 8. Delete state/lock and exit. + * + * If the program is interrupted and leaves a lock file in state/lock, it + * requires an operator to fix the state/ directory and make it work + * again. IMPORTANT: DO NOT CHANGE ANYTHING IN THE state/ DIRECTORY + * UNLESS YOU'RE CERTAIN WHAT YOU'RE DOING! The following situations can + * happen. It may make sense to try a solution in a non-productive + * setting first: + * A. The file state/in-history.new does not exist and there are no files + * in state/execution/. The process died before step 3. Delete + * state/lock and re-run the program. + * B. The file state/in-history.new exists and there are files in + * state/execution/. The process died during steps 3 or 4. Delete + * all files in state/execution/. If state/in-history does not exist, + * but state/in-history.old does exist, rename the latter to the + * former. Delete state/lock and re-run the program. + * C. The file state/in-history.new does not exist, but there are files + * in state/execution/. The process died after step 4. Run the steps + * 5 to 8 manually. Then re-run the program. + * + * Whenever logs are parsed that are 4 days old or older, there may + * already be output files in out/ that cannot be modified anymore. The + * operator may decide to manually overwrite files in out/ with the files + * in state/full/ or state/diff/. IMPORTANT: ONLY OVERWRITE FILES IN out/ + * IF YOU'RE CERTAIN HOW TO FIX THE PROGRAM THAT PARSES ITS FILES. There + * are two possible situations: + * A. There is a file in state/full/. This file is newer than the file + * with the same name in out/ and contains everything from that file, + * too. It's okay to overwrite the file in out/ with the file in + * state/full/ and delete the file in state/full/. + * B. There is a file in state/diff/. The file in out/ didn't exist + * anymore when parsing more log lines for it. The file that was in + * out/ should be located and merged with the file in state/diff/. + * Afterwards, the file in state/diff/ should be deleted. + */ public class Main { private static File historyFile = new File("hist"); private static File inputDirectory = new File("in");