[tor-project] Plan to double-check Tor Browser initial download numbers

Karsten Loesing karsten at torproject.org
Wed Jul 19 16:32:07 UTC 2017


On 2017-07-18 01:46, teor wrote:
> 
>> On 18 Jul 2017, at 04:05, Karsten Loesing <karsten at torproject.org> wrote:
>>
>> Hello list,
>>
>> it's been almost two years since we started collecting sanitized Apache
>> web server logs. During this time the number of Tor Browser initial
>> downloads rarely went below 70,000 per day.
>>
>> https://metrics.torproject.org/webstats-tb.html
>>
>> Either there must be a steady demand for fresh binaries, or there is a
>> non-zero number of bots downloading the Tor Browser binary several times
>> per day.
>>
>> I already double-checked our aggregation code that takes sanitized web
>> server logs as input and produces daily totals as output. It looks okay
>> to me.
>>
>> I'd also like to double-check whether there's anything unexpected
>> happening before the sanitizing step. For example, could it be that
>> there are a few IP addresses making hundreds or thousands of requests?
>>
>> Or are there lots of requests with same referrers or common user agents
>> indicating bots?
>>
>> My plan is to ask our admins to temporarily add a second Apache log file
>> on one of the dist.torproject.org hosts with the default Apache log file
>> format without the sanitizing that is usually applied.
>>
>> A snapshot of 15 or 30 minutes would likely be sufficient as sample. I'd
>> analyze this log file on the server, delete it, and report my findings here.
>>
>> This message has two purposes:
>>
>> 1. Is this approach acceptable? If not, are there more acceptable
>> approaches yielding similar results?
> 
> Can you get similar results with a default apache log file, with the
> following changes:
> * remove timestamps

Yes, we can remove timestamps. It's potentially useful information, but
it's also potentially sensitive information. Let's leave it out for now,
and if the remaining fields are not sufficient, let's reconsider
including timestamps in some way for a possible second experiment.

> * sort lines to destroy the original order

That's more difficult, because Apache doesn't have an option for that.
But the order of dist.torproject.org requests is likely less sensitive
than the order of www.torproject.org requests where users navigate over
the site. I'd say let's leave the order unchanged to keep this
experiment simple.

> Without precise timing information, the data would be a lot less
> sensitive.

Agreed.

> It might also be useful to know the distribution of requests over
> a 24 hour period, without any other details. This might help you
> work out how the activity is being triggered.

Oh, that's a good idea, too! We shouldn't do this overlapping with the
currently planned experiment, but I'll put it on the list for a
follow-up experiment.

>> 2. Are there any theories what might keep the numbers from dropping
>> below those 70,000 requests per day? What should I be looking for?
> 
> There are 86,400 seconds in a day, which means that we're getting
> about 1 request per second. This could be a single bot caught in a
> loop.
> 
> Are you only counting GET requests?

Yes.

> Do you count incomplete downloads?

Yes. Apache only includes the size of the returned object, not the
number of transferred bytes. What we do see though is HTTP range
requests (code 206), which we do not count.

> (A continually failing automated download process could cause this.)

Yes.

Thanks for your questions and ideas here!

All the best,
Karsten


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-project/attachments/20170719/c15a2e18/attachment.sig>


More information about the tor-project mailing list