[tor-reports] GSoC: Weekly report for ahmia, week 29

Sat Jul 19 15:45:12 UTC 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I have coded a scrapy based crawler for .onion pages. This crawler is
based on scrapy and uses Postgresql database.

It should be quite straight forward to develop some kind of search
algorithm for the crawled data. The data for each website is:

1) URL
2) keywords (HTML keywords, title, h1, h2, h3, h3 etc.)
3) All the stemmed words from the page and the number of them (word1:
count_of_word1, word2:
count_of_word2...)
4) Domain
5) Public WWW backlinks to the domain
6) Popularity according to the Tor2web stats
7) Number of clicks in the search results to the domain

Note the stemming[1]. I realized that I have to find the words that
are close enough for what is searched. For efficiency, it is useful to
save the stemmed words from the page and use Levenshtein distance[2]
to compare the search to these stemmed words. I am working on this.

[1] https://en.wikipedia.org/wiki/Stemming
[2] https://en.wikipedia.org/wiki/Levenshtein_distance

Greetings,
Juha
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTypKBAAoJELGTs54GL8vAGLgH+QHc7/UlP3Qnl3v18WXFtNJs
nIUrZjKF2RKa5nMJf0kvovNGNejNiuz9lJ7J4tqh6HyadWprqgy1s3Pz/3SuFPV8
rZ6wR+FDVw5hN2Xdxogla/A2U1B0DJ+CMTkJmSSc+gYwrKV+k7ImBztJQcNo4LpX
IMHGttUfct0vDn639J5NjOuScJvkTws1rIiLLADzGQRGmsTL64f93uAaZJGjiNlX
/mL/CZze9B2Z/tochGqun6pKAyJcGLxoNvbv65gllGcnKIBbzG3nPihYGJw+QbMY
8zeLjFySKGpx7jedfnGjYOmuiV6iiiqulE/W+bNrBLuGU0DeCh0Z1fUqJ4iEQxs=
=5TZF
-----END PGP SIGNATURE-----