[tor-dev] GSoC - Search Engine for Hidden services

George Kadianakis desnacked at riseup.net
Sun Mar 16 19:13:42 UTC 2014


Juha Nurmi <juha.nurmi at ahmia.fi> writes:

>> And what would you like to do over the summer so that: a) Something
>> useful and concrete comes out of only 3 months of work. b) Your
>> work will also be useful after the summer ends.
>>
>> I would be interested to see some areas that you would like to work
>> on over the summer, and how that would change the ahmia.fi user
>> experience.
>
> I have drafted a timetable for the possible new features to ahmia.fi:
>
> https://docs.google.com/document/d/1XB42HM4uESYBAnoHHRuaqKMP64VFDI91Qa-CtIuye2E/edit?usp=sharing
>

Hello Juha,

here are some comments on your proposal:

> Search development
>
> Full text search development
> Popularity     tracking (catch users clicks and tell YaCy the popular
pages):     development of a popularity tracking feature for ahmia.fi
and     Integration of that feature with YaCy API (providing stats for  
  popular pages and suggestions for relevant results)
> 1-3 workdays
>

Yes, this is definitely useful.

I would also like you to check out how backlinks work, and whether
your crawler can start counting HS backlinks too. Mainly because
popularity tracking is easily gameable, whereas backlinks might be
harder to game (still definitely gameable though; SEO is crazy).

To make sure that this section is done properly, I would suggest to
compile a list of well-known HSes and verify that they all appear
on/near the top of the ahmia search results by the end of development
of these features.

I would suggest using more than 1-3 workdays for this.

> Use an another crawler to search .onion pages from the public Internet
> Search new .onion domains from different online sources
> This is an excellent case to test open source crawlers like Heritrix and
Apache Nutch
> 1 workweeks
>

Yes, this is very useful.

> Public open YaCy back-end for everyone
> let’s make our YaCy network open so anyone can join to it with their
YaCy nodes
> This way we could get real P2P decentralization
> Share installation configuration package that joins a YaCy node to
ahmia.fi’s nodes
> 1 workweek
>

I guess this is also useful.

> Better edited HS descriptions
> Design and development of a more useful and complete UI including more  
  comple and exaustive descriptions and details (e.g., show the whole   
 history of descriptions and let the users edit it better)
> 1 workweek
>

Yes this seems like a good idea. Improving the UX is very important.

Because of the security nature of ahmia, the UX should be security
conscious too. For example, you shouldn't give your users too much
confidence on the ordering of the search results since a motivated
adversary can probably influence it.

Maybe you could also expose some of your popularity/backlinks
information to users, in case that lets them pick results more safely.

> Comment and vote about the content (safe/unsafe)
> Ahmia.fi needs a commenting and rating systems for hidden services
> It is useful to gather a user's knowledge about the sites
> 1 workweek
>

I think that this needs more thinking.

The rating idea is trivially gameable. Do we assume that all users are
good citizens?

Given that there are shitloads of phising websites registered to
ahmia, we take it that there are bad people out there who know of
ahmia. How will the rating system interact with bad people? What about
the comming system? Is this also an argument against popularity
tracking? How do we use this technologies usefully in the face of bad
people?

> Tor browser friendly version of the ahmia.fi
> Development     of a JavaScript free version of ahmia.fi
> 1 workweek
>

TBB has javascript enabled these days. I would probably spend this one
week on other stuff.

> Search API
> 1 workweek
>

What do you mean by this? Do other search engines provide this sort of API?

This would need more than one week to design and deploy properly, no?

> Automated statistics and visualizations about hidden services and their
content
> Development     of an Analytics feature
> As the result of the indexing Tor network's content ahmia.fi can produce
an authoritative and exact quantitative research data about what is
published through the Tor network.
> 2 workweeks
>
> Automated visualizations
> It is very practical to visualize the data
> 2 workweeks
>

Both of the above items are statistics and they seem to require 1
month of development. Are there really that many stats that we
can/should produce?

What kind of stats are you thinking of, other than the "number of HSes
added per month", "number of ahmia visitors", etc.? BTW, we should be
very careful that stats are privacy preserving.

> Show cached text versions of the pages
> 1 workweek
>

Useful. I thought you had this feature in the past though; no?

> API development
>
> In addition, ahmia.fi provides RESTful API to integrate other services
to use hidden service description information (see
https://ahmia.fi/documentation/). Hidden services can integrate their
descriptions directly to the hidden service list (see
https://ahmia.fi/documentation/descriptionProposal/). Ahmia.fi knows
which hidden services are online and you can use the API to check hidden
service's online status. This API should be maintained general and
simple.
>
> Integration with softwares that are using hidden services
> Integration with Tor2web
> Thanks to our suggestion recently, Tor2web has implemented a feature
that provides secure and anonymous statistics within a day. I want to
implement to implement an automatic fetch and handling of this data.
> Ahmia.fi should fetch these and add each new .onion page
> Child pornographic is a plague for the Tor network and a well designed
and authoritative entity may be useful for provide some filtering lists.
To this aim we are currently handling manually a filter list already
integrated with Tor2web and in use on quite all the nodes of the Tor2web
network (https://ahmia.fi/policy/,
https://github.com/globaleaks/Tor2web-3.0/issues/25). In collaboration
with Tor2web i want to develop an efficient and automated system to
handle and share a filtering information in a secure manner.
> 1 workweek
>

Hm, this is interesting but potentially controversial. Where is this
data?

> Development     of a Content Abuse Signaling feature in order to allow
fast handling     of abuse comments; i want to implement a Callback API
in order to     publish this data to Tor2web nodes in real-time.
> 1-3 workdays
>

Ehm, so you are going to expose all the banned pages to Tor2Web?  Is
this API going to be public? Will anyone be able to see the banned
pages?

If it's not public, how are you going to protect it? Is this doable in
1-3 workdays? Is this worth doing?

> Globaleaks integration
> Currently, GlobaLeaks informs ahmia.fi to index new hidden services
> Ahmia.fi could extend the visibility of Globaleaks on the search results
> Together with GlobaLeaks: RESTful API according to Globaleaks’ needs
> 1 workweek
>

So you will make an API that allows people to submit HSes to ahmia?
Will this be useable by anyone; can it be exploited? If not, how will
you protect it? Is this really worth doing?

> Estimated amount of work is 13 weeks.

All in all, the timetable looks good.

I'm quite excited about the changes to your crawler (that will give us
a bigger list of HSes), and the changes to your indexing (popularity
tracking/backlinks etc.). I think you should devote more time to these
so that they are done properly. You currently estimated 1.5 weeks to
those tasks, but maybe you could bump it to 3 or 4 weeks. OTOH, I
don't know much about search engines so it might be easier than I
think.

I'm also excited about the UX changes and statistics, but I'm not sure
if I would devote one month just for statistics. Maybe steal some time
from statistics and give it to the crawler/indexing and UX? Maybe not?

The API stuff and the "Integration with *" projects are probably
harder/riskier to do than they seem. Are we sure we want to do them?
Better to do fewer things properly, than many things sloppily. Or not?

I would also like to see the code base cleaned up a bit. For example a
README file, some basic description of what each file is
doing. Probably also include the YaCy/crawler configs?

I would also like Ahmia to have some docs on the website. I would like
to see a doc on how ahmia works, including how its components interact
with each other.  And I would also like to see a doc that explains to
users the threat model of Ahmia; that is, what technologies ahmia has
in place to defend against phishing, how likely they are to succeed,
and how cautious users should be.





More information about the tor-dev mailing list