[tor-dev] [GSOC 16] Ahmia status update #6
zma at riseup.net
Fri Aug 12 22:02:47 UTC 2016
I'm working on ahmia.fi, the hidden service search engine and you're reading
status update #6.
During the last two weeks, I finished porting the django app to the new
structure. I'm also working on last minute things before shipping the new site
I will continue updating documentation and add some unit tests to the project.
The code is not merged yet but you're welcome to check it on my forks.  
Since this status report is short, here is a list of goals I had in my initial
project proposition and what work has been done on each.
Review code and infrastructure:
- Split the project in several repositories
- Improve documentation
- Automate testing (Travis.CI)
- Track code quality (Landscape.IO)
- Track requirements (Requires.IO)
- Refactor each subproject
Improve search results:
- Better use of elasticsearch (use of stemmers, shingles, term-centric search)
- Search results are now pages instead of domains.
Not much work has been done for this goal. The website has been in the process
of porting old pages to a new design. All pages are now using the new design.
Gather more statistics:
- Pagerank is now used to compute an authority score for each page
- I suggested that we could use a self hosted statistics framework like piwik
 but no decision has been made.
Use stats to better rank search results:
- Results are ranked by authority score.
Make sense of the indexed info to understand a search meaning:
- Shingles enable us to differenciate these two queries: "i'm not happy i'm
working" and "i'm happy i'm not working".
- Synonyms could be used by the search algorithm if we provided a synonym
dictionnary. No work has been done at making this work.
Make a google trend-like interface to visualize searches over time:
No work has been done to reach this optional goal. Even some stats
fonctionnalities were dropped in the new site because they were "domain-
centric" when a search engine needs to be "page-centric". We could probably
index searches in elasticsearch and use Date Histogram Aggregation  to
Make stats available with the API:
No work has been done to reach this optional goal. Some API endpoints were
also dropped because they were domain-centric. It would be great to have an
API with a coherent url scheme. I think Django Rest Framework can help design
that API while keeping the code simple.
That's it for this week,
Have a nice weekend.
More information about the tor-dev