[tor-project] Ethics Guidelines; crawling .onion

Nurmi, Juha juha.nurmi at ahmia.fi
Tue Jul 19 08:21:32 UTC 2016


Hi,

Virgil pointed out several good points with onion search engines.

1) Anonymous vs. hidden

>>> Whereas some would say Tor users are "anonymous", others would instead
say any and everything Tor is "private".  I believe this needs to be
clarified.

I am publishing a paper about my onion service experiment: I deployed 100
onion servers and followed TCP traffic to these services. As a result, they
got accessed by multiple different scanners (curl, wget, browser, scrapers,
ssh). This means that some people do HSDir harvesting and scan onions.

2) Search engines can efficiently map content

>>> For what it's worth, ahmia.fi actually supports regex searching right
out of the box.  In fact, a single line of JSON spits out all known bitcoin
addresses ahmia knows about.

At the moment I have no public documentation how to use regex search but
Ahmia supports this feature. Is this good or not? I know that Google has
disabled these kind of features because privacy issues.

3) Is a web-site a public place?

>>> Here's how I currently see this.  I put on my amateur legal hat and
say, "Well, the Internet/world-wide-web is considered a public space.
Onion-sites are like the web, but with masked speakers."

Good point! I think you are right.

Best,
Juha


On Thu, Jul 7, 2016 at 9:28 AM, Virgil Griffith <i at virgil.gr> wrote:

> > you might want to remove the client IP address (X-Forwarded-For) from
> HTTP headers
>
> Agreed!  And yes we already remove x-forwarded-for.
> https://github.com/globaleaks/Tor2web/blob/master/tor2web/t2w.py#L701
>
> I recall that the very, very beginning we had a python proxy library
> automatically adding x-forwarded-for, but once we realized it was doing
> that we corrected it.  FWIW, it was actually Aaron who wrote that code ;)
>
> AFAIK Tor2web hasn't leaked any privacy-invading headers for sometime.  If
> ones are discovered they would be fixed ASAP.
>
>
> > Is the opt-out permanent, or does your server re-check every time it
> connects?
> > I can imagine there being issues with either model - one involves
> storing a list, the other, regular connections.
>
> I don't know.  This is Google/Bing's department.  Do we have someone on
> list familiar enough with either?  If I were to guess the Googley/Bingy-way
> of doing this, I'd imagine them storing the list, and then when crawling
> the site again they'd do a HEAD request to see if the /robots.txt has
> changed.  And if the /robots.txt has changed, to overwrite their stored
> list.
>
>
> > I am disappointed that we have a Tor2web design where Tor2web needs to
> connect to a hidden service first, then check if it has given permission
> for Tor2web to connect to it.
>
> /robots.txt isn't a permission to "connect to", it's a permission to
> crawl/index.  I'm aware of no standard within or outside of Tor to say
> whether node A has permission to connect to node B.  If such a standard or
> even unofficial exists I'm down for spending some weekends implementing it.
>
> > I am also disappointed that this only works for HTTP onions on the
> default port 80.
>
> I agree completely.  But if the issue is operator privacy, isn't it even
> *better* that tor2web only works for port 80?  As an aside, there is
> tor2tcp at: https://cryptoparty.at/tor2tcp
>
>
> > I am also concerned about threat models where a single unwanted
> connection, or a number of unwanted connections, are security factors.
> > For example:
> > Imagine there is an (unknown) attack which can determine 1 bit of the
> 1024-bit RSA key per hidden service connection.
> > (Some known attacks on broken crypto systems are like this, as are some
> side-channels.)
> > Or imagine there is an attack which can determine 1 bit of the IPv4
> address per connection.
> > Is there an alternative to position (A) that supports threat models like
> this?
>
> I don't have a good solution to this.  As stated above, I'm aware of no
> protocol for saying "Please don't connect to me."  The security person in
> me is a little skeptical how useful it would be---if someone wanted to make
> many connections to learn a private key, I presume she won't be obeying
> said requests.  However, if someone doesn't want to be connected to, upon
> such a standard existing I would happily abide by it.
>
> > there is also the possibility of exerting social pressure to prevent
> people from running servers that continually connect to tor hidden services.
>
> The closest things I know of for social pressure are:
>
> (1) Liberal caching headers in the HTTP response:
>
> ```
> max-age=604800   #can be cached by browser and any intermediary caches
> for up to 1 week
> ```
>
> (2) In /robots.txt putting long crawl-delays:
>
> ```
> User-Agent: *
> Crawl-delay: 86400   #wait 1 day between each fetch.
> ```
>
> > I believe that a technical solution to this threat model is hidden
> service client authentication (and the next-generation hidden service
> protocol, when available).
>
> Agreed.
>
> -V
>
> On Thu, Jul 7, 2016 at 1:44 PM, Tim Wilson-Brown - teor <
> teor2345 at gmail.com> wrote:
>
>>
>> > On 7 Jul 2016, at 15:24, Virgil Griffith <i at virgil.gr> wrote:
>> >
>> > > How do you make sure that Tor2web users are anonymised (as possible)
>> when accessing hidden services?
>> >
>> > I make a good faith effort not to wantonly reveal personally
>> identifying information.  But in short, it's hard.  I urge people to think
>> of tor2web nodes as closer to Twitter where they record what links you
>> click.  I wholly support having the "where is Tor2web in regards to user
>> privacy" discussion (hopefully could even make some improvements to it!),
>> but it is orthogonal to the "robots.txt on .onion" discussion.  Let's
>> address the robots.txt issue and then we can return to Tor2web user-privacy.
>>
>> Well, as a separate issue, you might want to remove the client IP address
>> (X-Forwarded-For) from HTTP headers your caching proxies send to hidden
>> services. And work out if any of the other headers are sensitive.
>>
>> > On 7 Jul 2016, at 14:40, Virgil Griffith <i at virgil.gr> wrote:
>> >
>> > So now we have *three* different positions among respected members of
>> the Tor community.
>> >
>> > (A) isis et al: robots.txt is insufficient
>> > --- "Consent is not the absence of saying 'no' — it is explicitly
>> saying 'yes'."
>> >
>> > (B) onionlink/ahmia/notevil/grams: we respect robots.txt
>> > --- "Default is yes, but you can always opt-out."
>>
>> Is the opt-out permanent, or does your server re-check every time it
>> connects?
>> I can imagine there being issues with either model - one involves storing
>> a list, the other, regular connections.
>>
>> > (C) onionstats/memex: we ignore robots.txt
>> > --- "Don't care even if you opt-out." (see
>> https://onionscan.org/reports/may2016.html)
>> >
>> >
>> > Isis did a good job arguing for (A) by claiming that representing (B)
>> and (C) are "blatant and disgusting workaround[s] to the trust and
>> expectations which onion service operators place in the network."
>> https://lists.torproject.org/pipermail/tor-project/2016-May/000356.html
>> >
>> > This is me arguing for (B):
>> https://lists.torproject.org/pipermail/tor-project/2016-May/000411.html
>> >
>> > I have no link arguing for (C).
>>
>> I am disappointed that we have a Tor2web design where Tor2web needs to
>> connect to a hidden service first, then check if it has given permission
>> for Tor2web to connect to it. I am also disappointed that this only works
>> for HTTP onions on the default port 80.
>>
>> I would like to see a much better design for this.
>>
>> I am also concerned about threat models where a single unwanted
>> connection, or a number of unwanted connections, are security factors.
>> For example:
>> Imagine there is an (unknown) attack which can determine 1 bit of the
>> 1024-bit RSA key per hidden service connection.
>> (Some known attacks on broken crypto systems are like this, as are some
>> side-channels.)
>> Or imagine there is an attack which can determine 1 bit of the IPv4
>> address per connection.
>>
>> For security, a hidden service operator decides to only allow 10
>> connections before rolling over their hidden service to a new key and
>> server.
>>
>> There are at least 10 connections to known .onion addresses every week,
>> because there are at least 10 Tor2web or memex or onionstats instances on
>> the web.
>> Therefore, every week, the operator must roll over their hidden service,
>> and arrange to notify users of the new address in a secure fashion.
>> Alternately, they must keep the address secret, even from the HSDir hash
>> ring, which is not possible.
>>
>> Is there an alternative to position (A) that supports threat models like
>> this?
>>
>> I believe that a technical solution to this threat model is hidden
>> service client authentication (and the next-generation hidden service
>> protocol, when available).
>> However, there is also the possibility of exerting social pressure to
>> prevent people from running servers that continually connect to tor hidden
>> services.
>>
>> Tim
>>
>> Tim Wilson-Brown (teor)
>>
>> teor2345 at gmail dot com
>> PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
>> ricochet:ekmygaiu4rzgsk6n
>>
>>
>>
>>
>>
>> Tim
>>
>> Tim Wilson-Brown (teor)
>>
>> teor2345 at gmail dot com
>> PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
>> ricochet:ekmygaiu4rzgsk6n
>>
>>
>>
>>
>>
>> _______________________________________________
>> tor-project mailing list
>> tor-project at lists.torproject.org
>> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
>>
>>
>
> _______________________________________________
> tor-project mailing list
> tor-project at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-project/attachments/20160719/13bf1890/attachment-0001.html>


More information about the tor-project mailing list