That’s great! It just appeared ordered in that multiple related requests appeared in sequence, but I see that sorting can have that effect too.
Here is a concern: if the adversary can cause the size to be modified (say by adding comments to an blog page), then he can effectively mark certain requests as happening within a certain time period by setting a unique size for that time period.
This might be easiest to appreciate in the limit. Suppose you have a huge number of servers (relative to the number of clients) with DNS load-balancing among them. Each one basically has no requests or all those from the same client. Linking together multiple client requests allow them to collectively reveal information about the client. You might learn the language in one request, the platform in another, etc. A similar argument applies to splitting the logs across increasingly small time periods (per-day, per-hour, although at some point the time period gets below a given client’s “browsing session"). Obviously both of these examples are not near reality at some point, but the more you separate the logs across machines and over time, the more that requests might reasonably be inferred to belong to the same client. This presents an tradeoff you can make between accuracy and privacy by aggregating across more machines and over longer time periods.
:-D
A more conservative approach would be more “pull” than “push”, so you don’t collect data until you want it, at which point you add it to the collection list. Just a thought.
I think this is the ultimate test, and it sounds like you put a lot of thought into it (as expected).
Are there any files there are only of particular interest to a particular user or user subpopulation? Examples might be an individual’s blog or Tor instructions in Kurdish. If so, revealing that they have been accessed could indicate if and when the user or subpopulation are active on the Tor site. Are there any files that might hold particular interest to some adversary? Examples might be a comparison in Mandarin between Psiphon tools and Tor. If so, revealing their access frequency could indicate to the adversary that they should pay close attention to whatever is signified by that file. A similar issue arose with the popularity onion services, about which I believe the current consensus is that it should be hidden, the canonical example being a government that monitors the popularity of political opposition forums to determine which ones are beginning to be popular and thus need to be repressed.
I’m happy to see that you’re removing 404s! Some things that occurred to me are avoided by doing this (e.g. inadvertent sensitive client requests).