[tor-project] Ethics Guidelines; crawling .onion

Virgil Griffith i at virgil.gr
Wed Jun 8 07:29:52 UTC 2016


Here's yet another data point indicating the policy on crawling .onion
needs to be clarified.  The new and popular OnionStats tool doesn't even
respect /robots.txt, see: https://onionscan.org/reports/may2016.html

So now we have *three* different positions among respected members of the
Tor community.

(1) isis et al: robots.txt is insufficient
--- "Consent is not the absence of saying 'no' — it is explicitly saying
'yes'."

(2) onionlink/ahmia/notevil/grams: we respect robots.txt
--- "Default is yes, but you can always opt-out."

(3) onionstats/memex: we ignore robots.txt
--- "Don't care even if you opt-out."

-V


On Wed, Jun 8, 2016 at 1:34 AM, Virgil Griffith <i at virgil.gr> wrote:

> Hello all.
>
> I wrote on this topic earlier at:
>
> https://lists.torproject.org/pipermail/tor-project/2016-May/000411.html
>
> This is me again asking for clarification.  I choose this issue because it
> is the most self-contained of the various ones raised by isis et al, and it
> seemed wise to clarify this becoming opening up a new one.  If someone from
> Tor management writes me that social reasons prohibit search engines from
> being addressed at this time, I will drop it.
>
> Given the lack of prior reaction as well as ahmia.fi getting funded for
> GSoC (ahmia has followed /robots.txt from day zero), I tentatively conclude
> this crawling .onion is non-controversial, i.e., "Per Tor community
> standards, search engines obeying robots.txt are a-okay.  Equivalently,
> indexing .onion content is treated equivalently as any other part of the
> web."
>
> But, to motivate as well as give any concerned parties an opportunity to
> be hard, I have republished the onion2bitcoin as well as the bitcoin2onion
> anonymizing only the final 4 characters of the .onion address instead of
> final 8.
>
> -- http://virgil.gr/wp-content/uploads/2016/06/onion2btc.html
> -- http://virgil.gr/wp-content/uploads/2016/06/btc2onion.html
>
> -V
>
> On Tue, May 31, 2016 at 10:05 PM, Virgil Griffith <i at virgil.gr> wrote:
>
>> This seems like something people would have opinions on.  Anyone?
>>
>> -V
>>
>>
>> On Monday, 30 May 2016, Virgil Griffith <i at virgil.gr> wrote:
>>
>>> Hello all.
>>>
>>> I am preparing a longer response to the issues Isis et al mentioned.
>>> Most are interrelated, but this one is not.  And I wanted to get
>>> clarification on it.
>>>
>>> Isis expressed a concern about making a list of bitcoin addresses from
>>> .onion, citing, "Consent is not the absence of saying 'no' — it is
>>> explicitly saying 'yes'."
>>>
>>> For what it's worth, ahmia.fi actually supports regex searching right
>>> out of the box.  In fact, a single line of JSON spits out all known bitcoin
>>> addresses ahmia knows about.
>>>
>>> For example, here's an anonymized list going .onion -> BTC which I mined
>>> from Ahmia,
>>> * http://virgil.gr/wp-content/uploads/2016/05/btc-on-dot-onion.html
>>>  [6MB]
>>>
>>> And here's the same information going BTC -> .onion
>>> * http://virgil.gr/wp-content/uploads/2016/05/btc2domains.v2.txt [2mb]
>>>
>>> If you want to check the results you can ask Juha for the JSON query to
>>> do this.
>>>
>>> Lets go out on a limb and assume that regexs are okay.  Is the issue
>>> then .onion search-engines?  I understand Isis's preference for there to
>>> always be affirmative consent but does that mean that until such a standard
>>> exists all search engines from onion.link, ahmia.fi, MEMEX, NotEvil,
>>> and Grams are violating official Tor community policy?
>>>
>>> ----
>>> Here's how I currently see this.  I put on my amateur legal hat and say,
>>> "Well, the Internet/world-wide-web is considered a public space.
>>> Onion-sites are like the web, but with masked speakers."
>>>
>>> *
>>> https://www.hks.harvard.edu/m-rcbg/research/j.camp_acm.computer_internet.as.public.space.pdf
>>> * http://aims.muohio.edu/2011/02/01/is-the-internet-a-public-space/
>>>
>>> Ergo, I would argue that, by default, content on .onion is public the
>>> same way everything else on the web is.  If you don't want to be "indexed",
>>> for physical spaces you go in-doors, or for the web you put up a login.  As
>>> an aside, the web-standard is actually *kinder* than physical public spaces
>>> because on the web one can have an unobstrusive /robots.txt saying, "please
>>> don't index me".  Which is a great thing.
>>>
>>> Whereas some would say Tor users are "anonymous", others would instead
>>> say any and everything Tor is "private".  I believe this needs to be
>>> clarified.  I once proposed to Roger that he delineate the sub-types of
>>> privacy in the same way Stallman delineated his "Four Freedoms".  Roger
>>> replied that he preferred using the broad catch-all term "Privacy".  These
>>> confusions may be a caveat of using a broad catch-all term.  Interpreting
>>> broadly, Isis is correct.  However, this conclusion has a lot of unpleasant
>>> ramifications.
>>>
>>> Comments appreciated,
>>> -V
>>>
>>>
>>> P.S. Mildly related, I saw this today involving DARPA, and Tor.
>>> http://thehackernews.com/2016/05/darpa-trace-hacker.html
>>>
>>> """
>>> The aim of Enhanced Attribution program is to track personas
>>> continuously and create “algorithms for developing predictive behavioral
>>> profiles.”
>>> """
>>>
>>> I hope you all are aware this flows directly from MEMEX.  Right?  This,
>>> and MEMEX, seems a much more appropriate target for outrage.  A lot of this
>>> work that numerous community members have worked on gives even me pause.
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-project/attachments/20160608/6ae3e53d/attachment.html>


More information about the tor-project mailing list