[tor-project] Plan to double-check Tor Browser initial download numbers

Karsten Loesing karsten at torproject.org
Wed Jul 19 16:37:25 UTC 2017


On 2017-07-18 01:54, Ian Goldberg wrote:
> On Mon, Jul 17, 2017 at 08:05:30PM +0200, Karsten Loesing wrote:
>> Hello list,
>>
>> it's been almost two years since we started collecting sanitized Apache
>> web server logs. During this time the number of Tor Browser initial
>> downloads rarely went below 70,000 per day.
>>
>> https://metrics.torproject.org/webstats-tb.html
>>
>> Either there must be a steady demand for fresh binaries, or there is a
>> non-zero number of bots downloading the Tor Browser binary several times
>> per day.
>>
>> I already double-checked our aggregation code that takes sanitized web
>> server logs as input and produces daily totals as output. It looks okay
>> to me.
>>
>> I'd also like to double-check whether there's anything unexpected
>> happening before the sanitizing step. For example, could it be that
>> there are a few IP addresses making hundreds or thousands of requests?
>>
>> Or are there lots of requests with same referrers or common user agents
>> indicating bots?
>>
>> My plan is to ask our admins to temporarily add a second Apache log file
>> on one of the dist.torproject.org hosts with the default Apache log file
>> format without the sanitizing that is usually applied.
>>
>> A snapshot of 15 or 30 minutes would likely be sufficient as sample. I'd
>> analyze this log file on the server, delete it, and report my findings here.
>>
>> This message has two purposes:
>>
>>  1. Is this approach acceptable? If not, are there more acceptable
>> approaches yielding similar results?
>>
>>  2. Are there any theories what might keep the numbers from dropping
>> below those 70,000 requests per day? What should I be looking for?
>>
>> Thanks!
>>
>> All the best,
>> Karsten
> 
> Any chance you (i.e. a script) could replace the IP address with
> HASH(IP||salt) for a randomly chosen salt that you don't know, and which
> is deleted when the 30 minutes are up, before you get access to the log
> file?

Fine question. I'd like to keep this experiment simple and only use what
Apache has built in. So, let's leave out IP addresses for the moment and
see if the remaining fields, without timestamp and IP address, are
sufficient to answer the question. We can always consider adding more
fields and taking another 30 minutes snapshot. And if we do that we can
see how other people have solved this problem, possibly using something
similar to what you describe above.

Thanks!

All the best,
Karsten

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-project/attachments/20170719/31de2571/attachment.sig>


More information about the tor-project mailing list