<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div><blockquote type="cite" class=""><div class=""><div class=""><fieldset style="padding-top:10px; border:0px; border: 3px solid #CCC; padding-left: 20px;" class=""><div style="padding-left:3px;" class="">Log files are sorted as part of the sanitizing procedure, so that<br class="">request order should not be preserved.<span class="Apple-converted-space"> </span> If you find a log file that is<br class="">not sorted, please let us know, because that would be a bug.<br class=""></div></fieldset></div></div></blockquote><div><br class=""></div><div>That’s great! It just appeared ordered in that multiple related requests appeared in sequence, but I see that sorting can have that effect too.</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><fieldset style="padding-top:10px; border:0px; border: 3px solid #CCC; padding-left: 20px;" class=""><div style="padding-left:3px;" class=""><br class="">> 2. The size of the response is included, which potentially allows<br class="">> an adversary observing the client side to perform a correlation<br class="">> attack (combined with #1 above). This could allow the adversary to<br class="">> learn interesting things like (i) this person is downloading arm<br class="">> and thus is probably running a relay or (ii) this person is<br class="">> creating Trac tickets with onion-service bugs and is likely running<br class="">> an onion service somewhere (or is Trac excluded from these logs?).<br class="">> The size could also be used as an time-stamping mechanism<br class="">> alternative to #1 if the size of the request can be changed by the<br class="">> adversary (e.g. by blog comments).<br class=""><br class="">This seems less of a problem with request order not being preserved.<br class="">And actually, the logged size is the size of the object on the server,<br class="">not the number of bytes written to the client.<span class="Apple-converted-space"> </span> Even if these sizes<br class="">were scrubbed, it would be quite easy for an attacker to find out most<br class="">of these sizes by simply requesting objects themselves.<span class="Apple-converted-space"> </span> On the other<br class="">hand, not including them would make some analyses unnecessarily hard.<br class=""> I'd say it's reasonable to keep them.<br class=""></div></fieldset></div></div></blockquote><div><br class=""></div><div>Here is a concern: if the adversary can cause the size to be modified (say by adding comments to an blog page), then he can effectively mark certain requests as happening within a certain time period by setting a unique size for that time period.</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><fieldset style="padding-top:10px; border:0px; border: 3px solid #CCC; padding-left: 20px;" class=""><div style="padding-left:3px;" class="">> 3. Even without fine-grained timing information, daily per-server<br class="">> logs might include data from few enough clients that multiple<br class="">> requests can be reasonably inferred to be from the same client,<br class="">> which can collectively reveal lots of information (e.g. country<br class="">> based on browser localization used, platform, blog posts<br class="">> viewed/commented on if the blog server also releases logs).<br class=""><br class="">We're removing almost all user data from request logs and only<br class="">preserving data about the requested object.<span class="Apple-converted-space"> </span> For example, we're<br class="">throwing away user agent strings and request parameters.<span class="Apple-converted-space"> </span> I don't<br class="">really see the problem you're describing here.<br class=""></div></fieldset></div></div></blockquote><div><br class=""></div><div>This might be easiest to appreciate in the limit. Suppose you have a huge number of servers (relative to the number of clients) with DNS load-balancing among them. Each one basically has no requests or all those from the same client. Linking together multiple client requests allow them to collectively reveal information about the client. You might learn the language in one request, the platform in another, etc. A similar argument applies to splitting the logs across increasingly small time periods (per-day, per-hour, although at some point the time period gets below a given client’s “browsing session"). Obviously both of these examples are not near reality at some point, but the more you separate the logs across machines and over time, the more that requests might reasonably be inferred to belong to the same client. This presents an tradeoff you can make between accuracy and privacy by aggregating across more machines and over longer time periods.</div><div><br class=""></div><blockquote type="cite" class=""><div class=""><div class=""><fieldset style="padding-top:10px; border:0px; border: 3px solid #CCC; padding-left: 20px;" class=""><div style="padding-left:3px;" class="">let's do that now:<br class=""></div></fieldset></div></div></blockquote><div><br class=""></div><div>:-D</div><div><br class=""></div><blockquote type="cite" class=""><div class=""><div class=""><fieldset style="padding-top:10px; border:0px; border: 3px solid #CCC; padding-left: 20px;" class=""><div style="padding-left:3px;" class="">> • Don't collect data you don't need (minimization).<br class=""><br class="">I can see us using sanitized web logs from all Tor web servers, not<br class="">limited to Tor Browser/Tor Messenger downloads and Tor main website<br class="">hits.<span class="Apple-converted-space"> </span> I used these logs to learn whether Atlas or Globe had more<br class="">users, and I just recently looked at Metrics logs to see which graphs<br class="">are requested most often.<br class=""></div></fieldset></div></div></blockquote><div><br class=""></div><div>A more conservative approach would be more “pull” than “push”, so you don’t collect data until you want it, at which point you add it to the collection list. Just a thought.</div><div><br class=""></div><blockquote type="cite" class=""><div class=""><div class=""><fieldset style="padding-top:10px; border:0px; border: 3px solid #CCC; padding-left: 20px;" class=""><div style="padding-left:3px;" class="">> • The benefits should outweigh the risks.<br class=""><br class="">I'd say this is the case.<span class="Apple-converted-space"> </span> As you say below yourself, there is value<br class="">of analyzing these logs, and I agree.<span class="Apple-converted-space"> </span> I have also been thinking a lot<br class="">about possible risks, which resulted in the sanitizing procedure that<br class="">is in place, which comes after the very restrictive logging policy at<br class="">Tor's Apache processes, which throws away client IP addresses and<br class="">other sensitive data right at the logging step.<span class="Apple-converted-space"> </span> All in all, yes,<br class="">benefits do outweigh the risks here, in my opinion.<br class=""></div></fieldset></div></div></blockquote><div><br class=""></div><div>I think this is the ultimate test, and it sounds like you put a lot of thought into it (as expected).</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><fieldset style="padding-top:10px; border:0px; border: 3px solid #CCC; padding-left: 20px;" class=""><div style="padding-left:3px;" class=""><br class="">> • Consider auxiliary data (e.g. third-party data sets) when<br class="">> assessing the risks.<br class=""><br class="">I don't see a convincing scenario where this data set would make a<br class="">third-party data set more dangerous.<br class=""></div></fieldset></div></div></blockquote><div><br class=""></div><div>Are there any files there are only of particular interest to a particular user or user subpopulation? Examples might be an individual’s blog or Tor instructions in Kurdish. If so, revealing that they have been accessed could indicate if and when the user or subpopulation are active on the Tor site. Are there any files that might hold particular interest to some adversary? Examples might be a comparison in Mandarin between Psiphon tools and Tor. If so, revealing their access frequency could indicate to the adversary that they should pay close attention to whatever is signified by that file. A similar issue arose with the popularity onion services, about which I believe the current consensus is that it should be hidden, the canonical example being a government that monitors the popularity of political opposition forums to determine which ones are beginning to be popular and thus need to be repressed.</div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><fieldset style="padding-top:10px; border:0px; border: 3px solid #CCC; padding-left: 20px;" class=""><div style="padding-left:3px;" class="">> • Consider whether the user meant for that data to be private.<br class=""><br class="">We're removing the user's IP address, request parameters, and user<br class="">agent string, and we're throwing out requests that resulted in a 404<br class="">or that used a different method than GET or HEAD.<span class="Apple-converted-space"> </span> I can't see how a<br class="">user meant the remaining parts to be private.<br class=""></div></fieldset></div></div></blockquote><div><br class=""></div><div>I’m happy to see that you’re removing 404s! Some things that occurred to me are avoided by doing this (e.g. inadvertent sensitive client requests).</div><div><br class=""></div><br class=""><blockquote type="cite" class=""><div class=""><div class=""><fieldset style="padding-top:10px; border:0px; border: 3px solid #CCC; padding-left: 20px;" class=""><div style="padding-left:3px;" class="">We shall specify the sanitizing procedure in more detail as soon as<br class="">these logs are provided by CollecTor.<span class="Apple-converted-space"> </span> I could imagine that we'll<br class="">write down the process similar to the bridge descriptor sanitizing<br class="">process:<br class=""><br class=""><a href="https://collector.torproject.org/#bridge-descriptors" class="">https://collector.torproject.org/#bridge-descriptors</a><br class=""></div></fieldset></div></div></blockquote><div><br class=""></div>I look forward to the writeup!</div><div><br class=""></div><div>Best,</div><div>Aaron</div></body></html>