Hi everyone,
I'm working on ahmia.fi, the hidden service search engine and you're
reading status update #5.
During the last two weeks I worked on improving search.
It required working on a new index format and updating the crawler to
fill this new index.
Search now has these new features:
- Token search is done in addition to pure full-text search. For
instance, "foxes" and "fox" are different words but are indexed as the
same token. Same for "running" and "run", "tor's and tor".
- Associated words are also stored. For instance, "I’m not happy I’m
working" have the same words than "I’m happy I’m not working", but the
meaning is different.
- Meaningful words have way more impact than frequent words.
- The text used with a link (called anchor text) is now stored with the
page which is the target of the link. Let's say a site A is linking a
site B like that: <a href="B">B is a great search engine</a>. Searching
"great search engine" is going to return B instead of A.
These anchors could also replace a title in the results list if the
site has no title (.txt files are like that).
- A word present in the title, description, anchors has way more
strength than a word present in the content of a page.
- An authority score is computed for each page thanks to the pagerank
algorithm. Important pages (like homepages) tends to appear first in
the result list. It also solves problems we have with duplicated
content.
- The result list now returns pages instead of sites/domains.
Before updating ahmia.fi, I need to finish porting the rest of the
django site (API and statistics) to the new structure. I hope to finish
this next week so you can try it.
After this, here are the features I plan to implement in the remaining
time:
- Add a language to the search engine (site:msydqstlz2kzerdg.onion to
specify a domain, -term to exclude a term, filetype:txt to specify a
filetype, lang:fr to specify a language, etc).
- Use If-Modified-Since HTTP header to avoid crawling unmodified pages.
- Improve documentation
- Add unit tests
Also, I have some ideas I won't have the time to implement during GSoC.
Hopefully, I will do it later because I plan to continue working with
the Tor community :)
A small list:
- Improve search for languages other than english by making one index
per language
- Make the crawler better understand query parameters. Some are useful
(?page=3), some are useless (?printable=1).
- Better tokenize the url to extract words from it.
- Add highlights to each result (especially for those without meta-
description) to show the context in which query terms appears.
- Make a script to make sure every indexed url is still online.
- Make a true RESTful API.
I see you in two weeks for the next status report and maybe sooner if I
push something online.
Cheers!
Ismael