[tor-project] Ethics Guidelines; crawling .onion

Virgil Griffith i at virgil.gr
Thu Jul 21 08:10:27 UTC 2016


> I think you've misinterpreted the ethics guidelines here.
> "crawling" means running a HSDir to discover .onion addresses that would
otherwise be private.
> It doesn't (necessarily) mean accessing web pages on .onion sites using
an automated process.

If so, this is news to me, and I would be delighted to hear it.

Can we get a confirmation then that /robots.txt is a totally cool standard?

-V

On Thu, Jul 21, 2016 at 12:31 PM, Tim Wilson-Brown - teor <
teor2345 at gmail.com> wrote:

>
> > On 21 Jul 2016, at 14:23, Virgil Griffith <i at virgil.gr> wrote:
> >
> > Does anyone want to vouch for view (A) ?  Note that view (A) is
> currently enshrined in the ethics guidelines.
>
> I think you've misinterpreted the ethics guidelines here.
> "crawling" means running a HSDir to discover .onion addresses that would
> otherwise be private.
> It doesn't (necessarily) mean accessing web pages on .onion sites using an
> automated process.
>
> Tim
>
> >  The following are currently in conflict with (A):
> >
> > * the largest tor2web nodes
> > * MEMEX and other government programs
> > * beloved metrics applications like OnionStats
> >
> > -V
> >
> > On Tuesday, 19 July 2016, Nurmi, Juha <juha.nurmi at ahmia.fi> wrote:
> > Hi,
> >
> > Virgil pointed out several good points with onion search engines.
> >
> > 1) Anonymous vs. hidden
> >
> > >>> Whereas some would say Tor users are "anonymous", others would
> instead say any and everything Tor is "private".  I believe this needs to
> be clarified.
> >
> > I am publishing a paper about my onion service experiment: I deployed
> 100 onion servers and followed TCP traffic to these services. As a result,
> they got accessed by multiple different scanners (curl, wget, browser,
> scrapers, ssh). This means that some people do HSDir harvesting and scan
> onions.
> >
> > 2) Search engines can efficiently map content
> >
> > >>> For what it's worth, ahmia.fi actually supports regex searching
> right out of the box.  In fact, a single line of JSON spits out all known
> bitcoin addresses ahmia knows about.
> >
> > At the moment I have no public documentation how to use regex search but
> Ahmia supports this feature. Is this good or not? I know that Google has
> disabled these kind of features because privacy issues.
> >
> > 3) Is a web-site a public place?
> >
> > >>> Here's how I currently see this.  I put on my amateur legal hat and
> say, "Well, the Internet/world-wide-web is considered a public space.
> Onion-sites are like the web, but with masked speakers."
> >
> > Good point! I think you are right.
> >
> > Best,
> > Juha
> >
> >
> > On Thu, Jul 7, 2016 at 9:28 AM, Virgil Griffith <i at virgil.gr> wrote:
> > > you might want to remove the client IP address (X-Forwarded-For) from
> HTTP headers
> >
> > Agreed!  And yes we already remove x-forwarded-for.
> > https://github.com/globaleaks/Tor2web/blob/master/tor2web/t2w.py#L701
> >
> > I recall that the very, very beginning we had a python proxy library
> automatically adding x-forwarded-for, but once we realized it was doing
> that we corrected it.  FWIW, it was actually Aaron who wrote that code ;)
> >
> > AFAIK Tor2web hasn't leaked any privacy-invading headers for sometime.
> If ones are discovered they would be fixed ASAP.
> >
> >
> > > Is the opt-out permanent, or does your server re-check every time it
> connects?
> > > I can imagine there being issues with either model - one involves
> storing a list, the other, regular connections.
> >
> > I don't know.  This is Google/Bing's department.  Do we have someone on
> list familiar enough with either?  If I were to guess the Googley/Bingy-way
> of doing this, I'd imagine them storing the list, and then when crawling
> the site again they'd do a HEAD request to see if the /robots.txt has
> changed.  And if the /robots.txt has changed, to overwrite their stored
> list.
> >
> >
> > > I am disappointed that we have a Tor2web design where Tor2web needs to
> connect to a hidden service first, then check if it has given permission
> for Tor2web to connect to it.
> >
> > /robots.txt isn't a permission to "connect to", it's a permission to
> crawl/index.  I'm aware of no standard within or outside of Tor to say
> whether node A has permission to connect to node B.  If such a standard or
> even unofficial exists I'm down for spending some weekends implementing it.
> >
> > > I am also disappointed that this only works for HTTP onions on the
> default port 80.
> >
> > I agree completely.  But if the issue is operator privacy, isn't it even
> *better* that tor2web only works for port 80?  As an aside, there is
> tor2tcp at: https://cryptoparty.at/tor2tcp
> >
> >
> > > I am also concerned about threat models where a single unwanted
> connection, or a number of unwanted connections, are security factors.
> > > For example:
> > > Imagine there is an (unknown) attack which can determine 1 bit of the
> 1024-bit RSA key per hidden service connection.
> > > (Some known attacks on broken crypto systems are like this, as are
> some side-channels.)
> > > Or imagine there is an attack which can determine 1 bit of the IPv4
> address per connection.
> > > Is there an alternative to position (A) that supports threat models
> like this?
> >
> > I don't have a good solution to this.  As stated above, I'm aware of no
> protocol for saying "Please don't connect to me."  The security person in
> me is a little skeptical how useful it would be---if someone wanted to make
> many connections to learn a private key, I presume she won't be obeying
> said requests.  However, if someone doesn't want to be connected to, upon
> such a standard existing I would happily abide by it.
> >
> > > there is also the possibility of exerting social pressure to prevent
> people from running servers that continually connect to tor hidden services.
> >
> > The closest things I know of for social pressure are:
> >
> > (1) Liberal caching headers in the HTTP response:
> >
> > ```
> > max-age=604800           #can be cached by browser and any intermediary
> caches for up to 1 week
> > ```
> >
> > (2) In /robots.txt putting long crawl-delays:
> >
> > ```
> > User-Agent: *
> > Crawl-delay: 86400   #wait 1 day between each fetch.
> > ```
> >
> > > I believe that a technical solution to this threat model is hidden
> service client authentication (and the next-generation hidden service
> protocol, when available).
> >
> > Agreed.
> >
> > -V
> >
> > On Thu, Jul 7, 2016 at 1:44 PM, Tim Wilson-Brown - teor <
> teor2345 at gmail.com> wrote:
> >
> > > On 7 Jul 2016, at 15:24, Virgil Griffith <i at virgil.gr> wrote:
> > >
> > > > How do you make sure that Tor2web users are anonymised (as possible)
> when accessing hidden services?
> > >
> > > I make a good faith effort not to wantonly reveal personally
> identifying information.  But in short, it's hard.  I urge people to think
> of tor2web nodes as closer to Twitter where they record what links you
> click.  I wholly support having the "where is Tor2web in regards to user
> privacy" discussion (hopefully could even make some improvements to it!),
> but it is orthogonal to the "robots.txt on .onion" discussion.  Let's
> address the robots.txt issue and then we can return to Tor2web user-privacy.
> >
> > Well, as a separate issue, you might want to remove the client IP
> address (X-Forwarded-For) from HTTP headers your caching proxies send to
> hidden services. And work out if any of the other headers are sensitive.
> >
> > > On 7 Jul 2016, at 14:40, Virgil Griffith <i at virgil.gr> wrote:
> > >
> > > So now we have *three* different positions among respected members of
> the Tor community.
> > >
> > > (A) isis et al: robots.txt is insufficient
> > > --- "Consent is not the absence of saying 'no' — it is explicitly
> saying 'yes'."
> > >
> > > (B) onionlink/ahmia/notevil/grams: we respect robots.txt
> > > --- "Default is yes, but you can always opt-out."
> >
> > Is the opt-out permanent, or does your server re-check every time it
> connects?
> > I can imagine there being issues with either model - one involves
> storing a list, the other, regular connections.
> >
> > > (C) onionstats/memex: we ignore robots.txt
> > > --- "Don't care even if you opt-out." (see
> https://onionscan.org/reports/may2016.html)
> > >
> > >
> > > Isis did a good job arguing for (A) by claiming that representing (B)
> and (C) are "blatant and disgusting workaround[s] to the trust and
> expectations which onion service operators place in the network."
> https://lists.torproject.org/pipermail/tor-project/2016-May/000356.html
> > >
> > > This is me arguing for (B):
> https://lists.torproject.org/pipermail/tor-project/2016-May/000411.html
> > >
> > > I have no link arguing for (C).
> >
> > I am disappointed that we have a Tor2web design where Tor2web needs to
> connect to a hidden service first, then check if it has given permission
> for Tor2web to connect to it. I am also disappointed that this only works
> for HTTP onions on the default port 80.
> >
> > I would like to see a much better design for this.
> >
> > I am also concerned about threat models where a single unwanted
> connection, or a number of unwanted connections, are security factors.
> > For example:
> > Imagine there is an (unknown) attack which can determine 1 bit of the
> 1024-bit RSA key per hidden service connection.
> > (Some known attacks on broken crypto systems are like this, as are some
> side-channels.)
> > Or imagine there is an attack which can determine 1 bit of the IPv4
> address per connection.
> >
> > For security, a hidden service operator decides to only allow 10
> connections before rolling over their hidden service to a new key and
> server.
> >
> > There are at least 10 connections to known .onion addresses every week,
> because there are at least 10 Tor2web or memex or onionstats instances on
> the web.
> > Therefore, every week, the operator must roll over their hidden service,
> and arrange to notify users of the new address in a secure fashion.
> Alternately, they must keep the address secret, even from the HSDir hash
> ring, which is not possible.
> >
> > Is there an alternative to position (A) that supports threat models like
> this?
> >
> > I believe that a technical solution to this threat model is hidden
> service client authentication (and the next-generation hidden service
> protocol, when available).
> > However, there is also the possibility of exerting social pressure to
> prevent people from running servers that continually connect to tor hidden
> services.
> >
> > Tim
> >
> > Tim Wilson-Brown (teor)
> >
> > teor2345 at gmail dot com
> > PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
> > ricochet:ekmygaiu4rzgsk6n
> >
> >
> >
> >
> >
> > Tim
> >
> > Tim Wilson-Brown (teor)
> >
> > teor2345 at gmail dot com
> > PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
> > ricochet:ekmygaiu4rzgsk6n
> >
> >
> >
> >
> >
> > _______________________________________________
> > tor-project mailing list
> > tor-project at lists.torproject.org
> > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
> >
> >
> >
> > _______________________________________________
> > tor-project mailing list
> > tor-project at lists.torproject.org
> > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
> >
> >
> > _______________________________________________
> > tor-project mailing list
> > tor-project at lists.torproject.org
> > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
>
> Tim Wilson-Brown (teor)
>
> teor2345 at gmail dot com
> PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
> OTR 8F39BCAC 9C9DDF9A DF5FAE48 1D7D99D4 3B406880
> ricochet:ekmygaiu4rzgsk6n
>
>
>
>
>
>
> _______________________________________________
> tor-project mailing list
> tor-project at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-project/attachments/20160721/0cc4d48a/attachment-0001.html>


More information about the tor-project mailing list