Here's yet another data point indicating the policy on crawling .onion needs to be clarified. The new and popular OnionStats tool doesn't even respect /robots.txt, see: https://onionscan.org/reports/may2016.html
So now we have *three* different positions among respected members of the Tor community.
(1) isis et al: robots.txt is insufficient --- "Consent is not the absence of saying 'no' — it is explicitly saying 'yes'."
(2) onionlink/ahmia/notevil/grams: we respect robots.txt --- "Default is yes, but you can always opt-out."
(3) onionstats/memex: we ignore robots.txt --- "Don't care even if you opt-out."
-V
On Wed, Jun 8, 2016 at 1:34 AM, Virgil Griffith i@virgil.gr wrote:
Hello all.
I wrote on this topic earlier at:
https://lists.torproject.org/pipermail/tor-project/2016-May/000411.html
This is me again asking for clarification. I choose this issue because it is the most self-contained of the various ones raised by isis et al, and it seemed wise to clarify this becoming opening up a new one. If someone from Tor management writes me that social reasons prohibit search engines from being addressed at this time, I will drop it.
Given the lack of prior reaction as well as ahmia.fi getting funded for GSoC (ahmia has followed /robots.txt from day zero), I tentatively conclude this crawling .onion is non-controversial, i.e., "Per Tor community standards, search engines obeying robots.txt are a-okay. Equivalently, indexing .onion content is treated equivalently as any other part of the web."
But, to motivate as well as give any concerned parties an opportunity to be hard, I have republished the onion2bitcoin as well as the bitcoin2onion anonymizing only the final 4 characters of the .onion address instead of final 8.
-- http://virgil.gr/wp-content/uploads/2016/06/onion2btc.html -- http://virgil.gr/wp-content/uploads/2016/06/btc2onion.html
-V
On Tue, May 31, 2016 at 10:05 PM, Virgil Griffith i@virgil.gr wrote:
This seems like something people would have opinions on. Anyone?
-V
On Monday, 30 May 2016, Virgil Griffith i@virgil.gr wrote:
Hello all.
I am preparing a longer response to the issues Isis et al mentioned. Most are interrelated, but this one is not. And I wanted to get clarification on it.
Isis expressed a concern about making a list of bitcoin addresses from .onion, citing, "Consent is not the absence of saying 'no' — it is explicitly saying 'yes'."
For what it's worth, ahmia.fi actually supports regex searching right out of the box. In fact, a single line of JSON spits out all known bitcoin addresses ahmia knows about.
For example, here's an anonymized list going .onion -> BTC which I mined from Ahmia,
[6MB]
And here's the same information going BTC -> .onion
If you want to check the results you can ask Juha for the JSON query to do this.
Lets go out on a limb and assume that regexs are okay. Is the issue then .onion search-engines? I understand Isis's preference for there to always be affirmative consent but does that mean that until such a standard exists all search engines from onion.link, ahmia.fi, MEMEX, NotEvil, and Grams are violating official Tor community policy?
Here's how I currently see this. I put on my amateur legal hat and say, "Well, the Internet/world-wide-web is considered a public space. Onion-sites are like the web, but with masked speakers."
https://www.hks.harvard.edu/m-rcbg/research/j.camp_acm.computer_internet.as....
Ergo, I would argue that, by default, content on .onion is public the same way everything else on the web is. If you don't want to be "indexed", for physical spaces you go in-doors, or for the web you put up a login. As an aside, the web-standard is actually *kinder* than physical public spaces because on the web one can have an unobstrusive /robots.txt saying, "please don't index me". Which is a great thing.
Whereas some would say Tor users are "anonymous", others would instead say any and everything Tor is "private". I believe this needs to be clarified. I once proposed to Roger that he delineate the sub-types of privacy in the same way Stallman delineated his "Four Freedoms". Roger replied that he preferred using the broad catch-all term "Privacy". These confusions may be a caveat of using a broad catch-all term. Interpreting broadly, Isis is correct. However, this conclusion has a lot of unpleasant ramifications.
Comments appreciated, -V
P.S. Mildly related, I saw this today involving DARPA, and Tor. http://thehackernews.com/2016/05/darpa-trace-hacker.html
""" The aim of Enhanced Attribution program is to track personas continuously and create “algorithms for developing predictive behavioral profiles.” """
I hope you all are aware this flows directly from MEMEX. Right? This, and MEMEX, seems a much more appropriate target for outrage. A lot of this work that numerous community members have worked on gives even me pause.