Re: [tor-project] Ethics Guidelines; crawling .onion

7 Jul 2016

      ...
Please define "crawling of .onion".
I don't know enough about the details of what you're doing to have a
strong opinion.
I mean search engines crawling HTML pages on .onion.  Like doing:
https://www.google.com/search?q=site%3Aonion.to

ahmia.fi does do crawling.  I leave further discussion to them.
OnionLink actually does *zero* crawling.  I leave it to Google et al.

When Google crawls me they use:
* using .onion addresses found via a search engine.
* using .onion addresses found on HTML pages on other .onion sites.

None of the rest.  Nothing with HSDirs, etc.  The *only* HSDir thing that
has ever existed is caching NXDOMAIN responses from HSDirs to reduce the
load Tor2web places on the Tor network. This was solely to *be kind to the
operators*.  However, as the caching caused some uproar I've stopped
caching NXDOMAINs and have returned to unnessecarily burdening the Tor
network.
...
How do you access and index the web content on those .onion sites?
The accessing is just plain Tor2web HTTP requests.  They announce
themselves with the HTTP header `x-tor2web: true` .  Google does the
indexing.
...
How often do you access the site?
Looking at analytics from Googlebot accessing Onionlink, every 7-21 days.
...
How many pages deep do you go on the site?
Don't know.  I suppose Google goes as deep as possible.
...
Do you follow links to other .onion sites?
Yes.  Corresponding to the other .onion sites /robots.txt policy.
...
How do you make sure that Tor2web users are anonymised (as possible) when
accessing hidden services?
I make a good faith effort not to wantonly reveal personally identifying
information.  But in short, it's hard.  I urge people to think of tor2web
nodes as closer to Twitter where they record what links you click.  I
wholly support having the "where is Tor2web in regards to user privacy"
discussion (hopefully could even make some improvements to it!), but it is
orthogonal to the "robots.txt on .onion" discussion.  Let's address the
robots.txt issue and then we can return to Tor2web user-privacy.
...
Please stop releasing logs.
It could easily be seen as a provocative act.
Yeah I understand.  This is my 3rd or 4th attempt to discuss this and I was
intentionally being a little pokey.  I have no intention or desire of
actually compromising anomymity.

-V

On Thu, Jul 7, 2016 at 12:54 PM, Tim Wilson-Brown - teor <teor2345@gmail.com
...
wrote:
...
...
On 7 Jul 2016, at 14:40, Virgil Griffith <i@virgil.gr> wrote:
Hello all.  Back in June Griffin asked for this conversation to be
temporarily tabled, and it's been a month!
Let us discuss robots.txt and crawling of .onion.  Right now we have
*three* camps!  They are:
Please define "crawling of .onion".
I don't know enough about the details of what you're doing to have a
strong opinion.
How do you make your list of .onion addresses to crawl?
* by running a HSDir?
* using Tor2web request logs?
* using .onion addresses found via a search engine?
* using .onion addresses found on HTML pages on other .onion sites?
* through some other method?
How do you access and index the web content on those .onion sites?
How often do you access the site?
How many pages deep do you go on the site?
Do you follow links to other .onion sites?
How do you make sure that Tor2web users are anonymised (as possible) when
accessing hidden services?
...
So now we have *three* different positions among respected members of
the Tor community.
(A) isis et al: robots.txt is insufficient
--- "Consent is not the absence of saying 'no' — it is explicitly saying
'yes'."
(B) onionlink/ahmia/notevil/grams: we respect robots.txt
--- "Default is yes, but you can always opt-out."
(C) onionstats/memex: we ignore robots.txt
--- "Don't care even if you opt-out." (see
https://onionscan.org/reports/may2016.html)
Isis did a good job arguing for (A) by claiming that representing (B)
and (C) are "blatant and disgusting workaround[s] to the trust and
expectations which onion service operators place in the network."
https://lists.torproject.org/pipermail/tor-project/2016-May/000356.html
This is me arguing for (B):
https://lists.torproject.org/pipermail/tor-project/2016-May/000411.html
I have no link arguing for (C).
I had tried to get this conversation moving before.  So to poke this
discussion to go forward this time, I have republished the onion2bitcoin as
well as the bitcoin2onion anonymizing only the final 4 characters of the
.onion address instead of final 8.  Under (A), compiling this list is
deeply heretical.  In the view of either (B) or (C), .onion content is by
default public (presumably running regexs is fine), compiling such data is
a perfectly fine thing to do.
-- http://virgil.gr/wp-content/uploads/2016/06/onion2btc.html
-- http://virgil.gr/wp-content/uploads/2016/06/btc2onion.html
Please stop releasing logs.
It could easily be seen as a provocative act.
And it's not a good way to encourage people to talk to you.
One possible consequence is that individuals or groups decide it's poor
behaviour, and therefore refuse to deal with you.
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com
PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
ricochet:ekmygaiu4rzgsk6n
_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project