Hi everyone,
I'm working on ahmia.fi, the hidden service search engine and you're reading status update #5.
During the last two weeks I worked on improving search. It required working on a new index format and updating the crawler to fill this new index.
Search now has these new features: - Token search is done in addition to pure full-text search. For instance, "foxes" and "fox" are different words but are indexed as the same token. Same for "running" and "run", "tor's and tor". - Associated words are also stored. For instance, "I’m not happy I’m working" have the same words than "I’m happy I’m not working", but the meaning is different. - Meaningful words have way more impact than frequent words. - The text used with a link (called anchor text) is now stored with the page which is the target of the link. Let's say a site A is linking a site B like that: <a href="B">B is a great search engine</a>. Searching "great search engine" is going to return B instead of A. These anchors could also replace a title in the results list if the site has no title (.txt files are like that). - A word present in the title, description, anchors has way more strength than a word present in the content of a page. - An authority score is computed for each page thanks to the pagerank algorithm. Important pages (like homepages) tends to appear first in the result list. It also solves problems we have with duplicated content. - The result list now returns pages instead of sites/domains.
Before updating ahmia.fi, I need to finish porting the rest of the django site (API and statistics) to the new structure. I hope to finish this next week so you can try it.
After this, here are the features I plan to implement in the remaining time: - Add a language to the search engine (site:msydqstlz2kzerdg.onion to specify a domain, -term to exclude a term, filetype:txt to specify a filetype, lang:fr to specify a language, etc). - Use If-Modified-Since HTTP header to avoid crawling unmodified pages. - Improve documentation - Add unit tests
Also, I have some ideas I won't have the time to implement during GSoC. Hopefully, I will do it later because I plan to continue working with the Tor community :) A small list: - Improve search for languages other than english by making one index per language - Make the crawler better understand query parameters. Some are useful (?page=3), some are useless (?printable=1). - Better tokenize the url to extract words from it. - Add highlights to each result (especially for those without meta- description) to show the context in which query terms appears. - Make a script to make sure every indexed url is still online. - Make a true RESTful API.
I see you in two weeks for the next status report and maybe sooner if I push something online.
Cheers! Ismael