[tor-dev] Tor Browser downloads and updates graphs

Karsten Loesing karsten at torproject.org
Sun Oct 9 10:00:01 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 22/09/16 01:48, Aaron Johnson wrote:

Oops, this thread got lost in the Seattle preparations and only
surfaced today while doing some housekeeping.  Please find my response
below.

>> Log files are sorted as part of the sanitizing procedure, so that
>>  request order should not be preserved.  If you find a log file
>> that is not sorted, please let us know, because that would be a
>> bug.
> 
> That’s great! It just appeared ordered in that multiple related 
> requests appeared in sequence, but I see that sorting can have
> that effect too.

Okay, glad you didn't find a bug there.

>>> 2. The size of the response is included, which potentially 
>>> allows an adversary observing the client side to perform a 
>>> correlation attack (combined with #1 above). This could allow
>>> the adversary to learn interesting things like (i) this person
>>> is downloading arm and thus is probably running a relay or (ii)
>>> this person is creating Trac tickets with onion-service bugs
>>> and is likely running an onion service somewhere (or is Trac
>>> excluded from these logs?). The size could also be used as an 
>>> time-stamping mechanism alternative to #1 if the size of the 
>>> request can be changed by the adversary (e.g. by blog
>>> comments).
>> 
>> This seems less of a problem with request order not being 
>> preserved. And actually, the logged size is the size of the
>> object on the server, not the number of bytes written to the
>> client.  Even if these sizes were scrubbed, it would be quite
>> easy for an attacker to find out most of these sizes by simply
>> requesting objects themselves.  On the other hand, not including
>> them would make some analyses unnecessarily hard. I'd say it's
>> reasonable to keep them.
> 
> Here is a concern: if the adversary can cause the size to be
> modified (say by adding comments to an blog page), then he can
> effectively mark certain requests as happening within a certain
> time period by setting a unique size for that time period.

Alright, I see your point.  We should remove sizes of requested
objects that can be modified by users and hence adversaries.  The blog
is not affected here, because we're not including sanitized of the
blog yet, and even if we were, comments are manually approved by the
blog admin which only happens a few times per day and which takes away
control from an adversary.

But we do have Trac logs where users can easily add a comment or
modify a wiki page.  We should simply include 0 as requested object
size in those logs.  And we should make sure we're doing the same with
future sites where users can modify content.  Added to my list.

>>> 3. Even without fine-grained timing information, daily 
>>> per-server logs might include data from few enough clients
>>> that multiple requests can be reasonably inferred to be from
>>> the same client, which can collectively reveal lots of
>>> information (e.g. country based on browser localization used,
>>> platform, blog posts viewed/commented on if the blog server
>>> also releases logs).
>> 
>> We're removing almost all user data from request logs and only 
>> preserving data about the requested object.  For example, we're 
>> throwing away user agent strings and request parameters.  I don't
>>  really see the problem you're describing here.
> 
> This might be easiest to appreciate in the limit. Suppose you have
> a huge number of servers (relative to the number of clients) with
> DNS load-balancing among them. Each one basically has no requests
> or all those from the same client. Linking together multiple client
> requests allow them to collectively reveal information about the
> client. You might learn the language in one request, the platform
> in another, etc. A similar argument applies to splitting the logs
> across increasingly small time periods (per-day, per-hour, although
> at some point the time period gets below a given client’s
> “browsing session"). Obviously both of these examples are not near
> reality at some point, but the more you separate the logs across
> machines and over time, the more that requests might reasonably be
> inferred to belong to the same client. This presents an tradeoff
> you can make between accuracy and privacy by aggregating across
> more machines and over longer time periods.

So, I'm not sure if the following is feasible with the current
sanitizing code.  What we could do is merge all logs coming from
different servers for a given site and day, sort them, and provide
them as single sanitized log file.  That would address your concern
here without making the logs any less useful for analysis.  If we
cannot implement this right now, I'll make a note to implement this
when we re-implement this code in Java and add it to CollecTor.  Added
to my list, too.

>> let's do that now:
> 
> :-D

Well, we did discuss benefits and risks at length a few years ago, we
just didn't follow these guidelines simply because they didn't exist
back at the time.

>>> • Don't collect data you don't need (minimization).
>> 
>> I can see us using sanitized web logs from all Tor web servers, 
>> not limited to Tor Browser/Tor Messenger downloads and Tor main 
>> website hits.  I used these logs to learn whether Atlas or Globe 
>> had more users, and I just recently looked at Metrics logs to
>> see which graphs are requested most often.
> 
> A more conservative approach would be more “pull” than “push”, so
> you don’t collect data until you want it, at which point you add it
> to the collection list. Just a thought.

The downside is that we'd be losing history.  I'm not in favor of that
approach.  To give a random example, it would have made the Tor
Messenger analysis a lot less useful, because most downloads happened
at the initial release a year ago.  I'd rather want us to ensure that
sanitized logs don't contain sensitive parts anymore, and publishing
them seems to me like a good way to learn about that.

>>> • The benefits should outweigh the risks.
>> 
>> I'd say this is the case.  As you say below yourself, there is 
>> value of analyzing these logs, and I agree.  I have also been 
>> thinking a lot about possible risks, which resulted in the 
>> sanitizing procedure that is in place, which comes after the
>> very restrictive logging policy at Tor's Apache processes, which
>> throws away client IP addresses and other sensitive data right at
>> the logging step.  All in all, yes, benefits do outweigh the
>> risks here, in my opinion.
> 
> I think this is the ultimate test, and it sounds like you put a
> lot of thought into it (as expected).

Yep.

>>> • Consider auxiliary data (e.g. third-party data sets) when 
>>> assessing the risks.
>> 
>> I don't see a convincing scenario where this data set would make
>> a third-party data set more dangerous.
> 
> Are there any files there are only of particular interest to a 
> particular user or user subpopulation? Examples might be an 
> individual’s blog or Tor instructions in Kurdish. If so, revealing 
> that they have been accessed could indicate if and when the user
> or subpopulation are active on the Tor site. Are there any files
> that might hold particular interest to some adversary? Examples
> might be a comparison in Mandarin between Psiphon tools and Tor. If
> so, revealing their access frequency could indicate to the
> adversary that they should pay close attention to whatever is
> signified by that file. A similar issue arose with the popularity
> onion services, about which I believe the current consensus is that
> it should be hidden, the canonical example being a government that
> monitors the popularity of political opposition forums to determine
> which ones are beginning to be popular and thus need to be
> repressed.

I believe that we should only be using data that we're publishing.
And I can see how we want to learn ourselves whether our outreach
efforts are successful or not.  So can others.  I don't believe in
using that information and at the same time trying to keep it secret.

>>> • Consider whether the user meant for that data to be private.
>> 
>> We're removing the user's IP address, request parameters, and
>> user agent string, and we're throwing out requests that resulted
>> in a 404 or that used a different method than GET or HEAD.  I
>> can't see how a user meant the remaining parts to be private.
> 
> I’m happy to see that you’re removing 404s! Some things that
> occurred to me are avoided by doing this (e.g. inadvertent
> sensitive client requests).

Yes, keeping 404s would have been bad.

>> We shall specify the sanitizing procedure in more detail as soon 
>> as these logs are provided by CollecTor.  I could imagine that 
>> we'll write down the process similar to the bridge descriptor 
>> sanitizing process:
>> 
>> https://collector.torproject.org/#bridge-descriptors 
>> <https://collector.torproject.org/#bridge-descriptors>
> 
> I look forward to the writeup!

You'll learn about the CollecTor re-implementation and documentation
on this list or in the monthly team reports on tor-reports at .  Though
I'm not very optimistic that it will happen in the next 9 months,
given that our roadmap is already quite full:

https://trac.torproject.org/projects/tor/wiki/org/teams/MetricsTeam#RoadmapfromOctober2016toJune2017

But it's on my list.

> Best, Aaron

Thanks for your input here!

All the best,
Karsten
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBCAAGBQJX+hUgAAoJEC3ESO/4X7XBlQMH/jDtFP/p26w4SF9y9yujmrKo
LJ9Ps+VMMPZOwqS2InuxXR3pV7al8azmu4PEqEviZih2LChRdpCcB3R+DFmLfJc0
UwSO22wSNrHze/VBQuv7aGoQEGhRwzgy4dYLxR4W9GVRcTCvUVHnuqb/yLT8tqoV
SztyDL0qU34IF2ZmPjZTl4vek1ysw/d0WNkNOkKBsW8kD5XQCattp1vdXs75xVLs
169PuDEF+v9pXaeOX52l2c8O8R71V6fPQ0D7vU5UbwACFvh7CgiFGmw0Rm9pQeQR
8zQB9af3FgpYfifLStOnye6qfosUXWjlaVNhaPbaKW/isPKShJqsOYE6EXv6sJo=
=EjkQ
-----END PGP SIGNATURE-----


More information about the tor-dev mailing list