Please define "crawling of .onion". I don't know enough about the details of what you're doing to have a
strong opinion.
I mean search engines crawling HTML pages on .onion. Like doing: https://www.google.com/search?q=site%3Aonion.to
ahmia.fi does do crawling. I leave further discussion to them. OnionLink actually does *zero* crawling. I leave it to Google et al.
When Google crawls me they use: * using .onion addresses found via a search engine. * using .onion addresses found on HTML pages on other .onion sites.
None of the rest. Nothing with HSDirs, etc. The *only* HSDir thing that has ever existed is caching NXDOMAIN responses from HSDirs to reduce the load Tor2web places on the Tor network. This was solely to *be kind to the operators*. However, as the caching caused some uproar I've stopped caching NXDOMAINs and have returned to unnessecarily burdening the Tor network.
How do you access and index the web content on those .onion sites?
The accessing is just plain Tor2web HTTP requests. They announce themselves with the HTTP header `x-tor2web: true` . Google does the indexing.
How often do you access the site?
Looking at analytics from Googlebot accessing Onionlink, every 7-21 days.
How many pages deep do you go on the site?
Don't know. I suppose Google goes as deep as possible.
Do you follow links to other .onion sites?
Yes. Corresponding to the other .onion sites /robots.txt policy.
How do you make sure that Tor2web users are anonymised (as possible) when
accessing hidden services?
I make a good faith effort not to wantonly reveal personally identifying information. But in short, it's hard. I urge people to think of tor2web nodes as closer to Twitter where they record what links you click. I wholly support having the "where is Tor2web in regards to user privacy" discussion (hopefully could even make some improvements to it!), but it is orthogonal to the "robots.txt on .onion" discussion. Let's address the robots.txt issue and then we can return to Tor2web user-privacy.
Please stop releasing logs. It could easily be seen as a provocative act.
Yeah I understand. This is my 3rd or 4th attempt to discuss this and I was intentionally being a little pokey. I have no intention or desire of actually compromising anomymity.
-V
On Thu, Jul 7, 2016 at 12:54 PM, Tim Wilson-Brown - teor <teor2345@gmail.com
wrote:
On 7 Jul 2016, at 14:40, Virgil Griffith i@virgil.gr wrote:
Hello all. Back in June Griffin asked for this conversation to be
temporarily tabled, and it's been a month!
Let us discuss robots.txt and crawling of .onion. Right now we have
*three* camps! They are:
Please define "crawling of .onion". I don't know enough about the details of what you're doing to have a strong opinion.
How do you make your list of .onion addresses to crawl?
- by running a HSDir?
- using Tor2web request logs?
- using .onion addresses found via a search engine?
- using .onion addresses found on HTML pages on other .onion sites?
- through some other method?
How do you access and index the web content on those .onion sites? How often do you access the site? How many pages deep do you go on the site? Do you follow links to other .onion sites?
How do you make sure that Tor2web users are anonymised (as possible) when accessing hidden services?
So now we have *three* different positions among respected members of
the Tor community.
(A) isis et al: robots.txt is insufficient --- "Consent is not the absence of saying 'no' — it is explicitly saying
'yes'."
(B) onionlink/ahmia/notevil/grams: we respect robots.txt --- "Default is yes, but you can always opt-out."
(C) onionstats/memex: we ignore robots.txt --- "Don't care even if you opt-out." (see
https://onionscan.org/reports/may2016.html)
Isis did a good job arguing for (A) by claiming that representing (B)
and (C) are "blatant and disgusting workaround[s] to the trust and expectations which onion service operators place in the network." https://lists.torproject.org/pipermail/tor-project/2016-May/000356.html
This is me arguing for (B):
https://lists.torproject.org/pipermail/tor-project/2016-May/000411.html
I have no link arguing for (C).
I had tried to get this conversation moving before. So to poke this
discussion to go forward this time, I have republished the onion2bitcoin as well as the bitcoin2onion anonymizing only the final 4 characters of the .onion address instead of final 8. Under (A), compiling this list is deeply heretical. In the view of either (B) or (C), .onion content is by default public (presumably running regexs is fine), compiling such data is a perfectly fine thing to do.
-- http://virgil.gr/wp-content/uploads/2016/06/onion2btc.html -- http://virgil.gr/wp-content/uploads/2016/06/btc2onion.html
Please stop releasing logs. It could easily be seen as a provocative act. And it's not a good way to encourage people to talk to you. One possible consequence is that individuals or groups decide it's poor behaviour, and therefore refuse to deal with you.
Tim
Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n
tor-project mailing list tor-project@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project