Hi everyone,
I'm working on ahmia.fi, the hidden service search engine and you're reading status update #1.
During the last two weeks i've been working on several things:
1/ Settle on a new structure for ahmia source code.
The official repository [1] contains all the code related to ahmia. Some of this code is deprecated (solr is not used anymore), documentation needed to be updated, so it needed a bit of cleanup anyway. A structure with two repositories was chosen: - ahmia-site [2] is going to contain the django website, configuration to use it in production (apache, nginx, uwsgi) and documentation on how to get the project running. - ahmia-crawler [3] is going to contain scrapy bots, configuration + documentation (elasticsearch, polipo) I tried to keep all past commits when creating these repositories.
2/ Update documentation
See [2] and [3].
3/ Start to refactor the django project
The django project is going to be composed by two apps: - search is going to be the search engine frontend + future API endpoints - trends is going to be the statistics visualization frontend + future API endpoints Some logic is also going to move from the website source code to the indexer part of the search engine (ex: removal of fake/banned domains). You can see this work on the ahmia-site repository [2]. Note: The trends app is not yet done so it isn't visible online.
4/ Implement continuous integration with travis.CI
Tests are going to be automatically run on travis.CI. I also consider to display test code coverage with coveralls.io but I fear about people focusing on improving the coverage percentage at all cost, which is not very good. This work is going to be pushed during the week-end.
5/ Start to write a proposal with details on how to improve search
I have yet to write a much more readable document, but here are a couple ideas: - Regroup all data related to domains, stats, content into elasticsearch so when can use it for search or insights - What about a pagerank-like algorithm to estimate a webpage popularity instead of tor2web popularity ? - Improve search with human language thanks to elasticsearch [4] - Use static boosting with popularity (or pagerank) field [5] We have a meeting planned tuesday with all ahmia's contributors. I hope to have a clean proposal by then to discuss it with them.
During the next two weeks, I plan to continue working on the same things. I want to finish 1/ to 4/ as quickly as possible to start working on search quality.
See you in two weeks :) Ismael
[1] https://github.com/ahmia/search [2] https://github.com/iriahi/ahmia-site [3] https://github.com/iriahi/ahmia-crawler [4] https://www.elastic.co/guide/en/elasticsearch/guide/current/languages.html [5] https://marcobonzanini.com/2015/06/22/tuning-relevance-in-elasticsearch-with...