[tor-reports] GSoC: Weekly report for ahmia, week 28

Juha Nurmi juha.nurmi at ahmia.fi
Fri Jul 11 12:44:34 UTC 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

During the week:

I started building a search engine software of my own. After looking
through many crawlers I decided to use Scrapy[1]. There are a few
reasons for this: Scrapy is very mature and maintained by a company
with active developer community, it is a Python software, there is
Django integration, it's flexible, and it's pipeline architecture is
simple.

So, I will attach scrapy crawler (onionbot) to Django + Postgresql
with the popularity data that ahmia has been gathering.

In this model a website data is:

1) URL
2) keywords (HTML keywords, title, h1, h2, h3, h3 etc.)
3) All the words from the page (word1: count_of_word1, word2:
count_of_word2...)
4) Domain
5) Public WWW backlinks to the domain
6) Popularity according to the Tor2web stats
7) Number of clicks in the search results to the domain

Hopefully, this will work. I have no idea before I run a the prototype
software.

[1] http://scrapy.org/

Have a nice weekend everyone!

Greetings,
Juha
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTv9wrAAoJELGTs54GL8vATZAIALHaX+9o5Li7w9HyY7U76NKu
uilmQxmgE5+uuhx2f9cMLxYjG8z3MU2haRSpv8SuU7pzuTQghPOdLKqtdUuqfKJ2
RZQb6nOvdJNsyP7Mo2hF7DBY9ASVp4vLA5KKhKUD1q2LQV2rZ95gMYDLHfaY+ref
IpCU6rYIZSlbT7MFYW4/SXX1762AIilXfpDrGzzZQV5OeCCBkS5sG6Xe3SeF8Foa
xCJtfR0/I1WtAczACwjKB+PTTIzPg9gOXutZvDhJSmEr7GRzx38GnztcgoroiIq3
CQ8UWcyLua2UzvMUuI3sIWS7B4Y14yfsbR+4zzuIIS2G6CBUwW+tHlrcCiBZGy0=
=z/Oj
-----END PGP SIGNATURE-----


More information about the tor-reports mailing list