GSoC: Ahmia.fi - Search Engine for Hidden Services

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, I'm a student who is starting to work with ahmia.fi search engine as a part of Google Summer of Code. :) The proposal is online here https://ahmia.fi/gsoc/ In practise, I have now time and funding to develop my search engine. George is my primary mentor and Moritz the backup mentor. Today, I will submit all the required documents (the tax forms etc.) to Google. After that, I think I will speed up with code base in the GitHub :) Cheers, Juha -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTViWUAAoJELGTs54GL8vAuC8H/jSdgBCRQM/3l3mX5Uig9fgM wacPsxm6RJd3Sw+JJpYgoRP1nDqI513haP4Z6s//tR3Vn5RyQ/u7ik3QdFEVKbJD KqnQ4Eaf5hT4xsJwBXZIjzW6uhbYaq1GmUJi4eaglwUrgIgJrHzDbOz/p8q71O1z rLnrS1vrsvMzY4rU0dRe1/S9LyPWTUAfpVMINa54RPmNjMzrTT/WUnlcQWo9cY3a SRrT2MVz5nwBEXJuhZUmC3L6XLL8RX2TgzGwVyYOUfMlNuZdcSaOOTvF7gKVZVZQ hGhr/V40iNm5BOAcQ2TVaxuR5HjxSFWUp15T8ux+xxyN/Yp9EeaDjsAsTVegq0w= =QMfR -----END PGP SIGNATURE-----

Juha Nurmi <juha.nurmi@ahmia.fi> writes:
Hi,
I'm a student who is starting to work with ahmia.fi search engine as a part of Google Summer of Code. :)
The proposal is online here https://ahmia.fi/gsoc/
In practise, I have now time and funding to develop my search engine. George is my primary mentor and Moritz the backup mentor.
Today, I will submit all the required documents (the tax forms etc.) to Google.
After that, I think I will speed up with code base in the GitHub :)
Enjoy GSoC :) BTW, looking again at your proposal, I see that you are going to do both popularity tracking and backlinks. How are these two technologies going to interact with each other? That is, how will the indexer consider the output of those two features? Also, with your newly acquired knowledge about backlinks, how long is it going to take your incorporate them in ahmia? Are you actually going to do it during the "Use an another crawler to search .onion pages from the public Internet" phase?

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 22.04.2014 17:35, George Kadianakis wrote:
Enjoy GSoC :)
I will :)
BTW, looking again at your proposal, I see that you are going to do both popularity tracking and backlinks.
Yes, another crawler gathers backlinks from the public WWW and I will start gathering the URL clicks from the users.
How are these two technologies going to interact with each other? That is, how will the indexer consider the output of those two features?
Django front-end re-sorts the answers from YaCy back-end. See https://ahmia.fi/static/gsoc/re_sort.jpg I have this idea in mind: https://ahmia.fi/static/gsoc/sorter.py The result is sorted according to YaCy result index, number of backlinks and clicks which are scaled. Note the scaling: p_info.backlinks = 1 / (float(index) + 1) etc. sum_function = 3.0*self.yacy + 2.0*self.backlinks + 1.0*self.clicks where 3, 2 and 1 are test coefficients. I will optimize these and made a better model if necessary. However, clicks are easily spoofed and there have to be small coefficient for them.
Also, with your newly acquired knowledge about backlinks, how long is it going to take your incorporate them in ahmia? Are you actually going to do it during the "Use an another crawler to search .onion pages from the public Internet" phase?
We can test it when popularity tracking and backlinks crawler are working. - -Juha -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTWKhsAAoJELGTs54GL8vA+WAH/1i4sCvvcwotn5b39Ox8yldn Wv6mBxqlIiaoeBj1Eeu+A92QfGvvpxdWDb7Kn3+3u0IO0wXcZlf0SrIri11IgprW 1f8x5BMDYiaFl12dVO/3jfXSmdfKQ24AdKknfK9wuD63266L2Tks/DVURHQKrYaM zTfYJKZNWJtOPxUj45lHknHxDWVzRlmqiksRn1aPwx2EW5dpKCCVkV9ySnJdZW74 DWs1es1rLKj6UVmVl6w88PJ/C1COWhMQspXtYIZ8paZQfMHtEgDxLuifITIHgdBh TdGLUEVteUl5wyCNjDh1Q+ZEkdbMvcpNZuP5D3lUYweHz0cMMOGHC0oaLlJS4KE= =48jK -----END PGP SIGNATURE-----

Juha Nurmi <juha.nurmi@ahmia.fi> writes:
On 22.04.2014 17:35, George Kadianakis wrote:
Enjoy GSoC :)
I will :)
BTW, looking again at your proposal, I see that you are going to do both popularity tracking and backlinks.
Yes, another crawler gathers backlinks from the public WWW and I will start gathering the URL clicks from the users.
How are these two technologies going to interact with each other? That is, how will the indexer consider the output of those two features?
Django front-end re-sorts the answers from YaCy back-end.
See https://ahmia.fi/static/gsoc/re_sort.jpg
I have this idea in mind: https://ahmia.fi/static/gsoc/sorter.py
The result is sorted according to YaCy result index, number of backlinks and clicks which are scaled.
Note the scaling: p_info.backlinks = 1 / (float(index) + 1) etc.
sum_function = 3.0*self.yacy + 2.0*self.backlinks + 1.0*self.clicks
where 3, 2 and 1 are test coefficients. I will optimize these and made a better model if necessary. However, clicks are easily spoofed and there have to be small coefficient for them.
That makes sense. BTW, what is the 'yacy' score? Is it just the order that YaCy's indexer chose for each result? Or does YaCy actually expose a score for each result? How is the score derived? Or do you treat it as a blackbox and assume it's the most accurate of backlinks and popularity. Thanks!

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 25.04.2014 17:27, George Kadianakis wrote:
Juha Nurmi <juha.nurmi@ahmia.fi> writes:
On 22.04.2014 17:35, George Kadianakis wrote:
Enjoy GSoC :)
I will :)
BTW, looking again at your proposal, I see that you are going to do both popularity tracking and backlinks.
Yes, another crawler gathers backlinks from the public WWW and I will start gathering the URL clicks from the users.
How are these two technologies going to interact with each other? That is, how will the indexer consider the output of those two features?
Django front-end re-sorts the answers from YaCy back-end.
See https://ahmia.fi/static/gsoc/re_sort.jpg
I have this idea in mind: https://ahmia.fi/static/gsoc/sorter.py
The result is sorted according to YaCy result index, number of backlinks and clicks which are scaled.
Note the scaling: p_info.backlinks = 1 / (float(index) + 1) etc.
sum_function = 3.0*self.yacy + 2.0*self.backlinks + 1.0*self.clicks
where 3, 2 and 1 are test coefficients. I will optimize these and made a better model if necessary. However, clicks are easily spoofed and there have to be small coefficient for them.
That makes sense.
BTW, what is the 'yacy' score? Is it just the order that YaCy's indexer chose for each result? Or does YaCy actually expose a score for each result? How is the score derived? Or do you treat it as a blackbox and assume it's the most accurate of backlinks and popularity.
I am using only the order information. BTW, we (Mikko installed new servers) are migrating YaCy servers and took down the old one system. There should be a working crawler + fresh full text search results soon :) - -Juha -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTXK5uAAoJELGTs54GL8vA1bcH/R/8xYJMCk7rc296/UBWBlaX SDGYO/85EjbdBUokleQAZ8odxrV+rNCbsWMbncddo8QLxl6w99tS9Wz1ehZ+KOI2 beSCSEdS46gnztoGTRrRos4YFxEfbq708wFUh0CDQbzeT9doBX6dAV62FXhP8Fgm sY/YvqNMJSBnqqlojsAfHV70IorjveEJ23pnktX8fcfkTqM+xBIVk0Ul2zggQNW+ c/d9SuaZLDB2Fdbsch4Ip3Tln8C/tLF7HC1cyRh7QDwU1zmr8UUe0N3mmzwEqUWA h/uD/U3yZSNQfGrSI8/19QjvsDqCdoWIP/i78B90iIZhJ8YNlyN+cydb1O+cj9A= =Dfu/ -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, Some updates. Ahmia have now fresh new YaCy back-end installed. Unfortunately, I messed up with Solr and eventually we might have to destroy and re-crawl everything again. At the moment, it at least works. Then some good news. I created a milestone to github. There are all the main features and I try to develop them as fast as I can :) https://github.com/juhanurmi/ahmia/issues?milestone=1&page=1&state=open Currently, I have worked some code to gather popularity stats and new domains from tor2web nodes and saving them to ahmia.fi. Furthermore, I have built a tool that checks backlinks from the public WWW! This data is useful for the popularity measurements. I am already pushing code to github :) Cheers, Juha -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTcRgHAAoJELGTs54GL8vA2jAH/j2aIV158GSpS+udWM62PfsM 3RxTkzfnfxRT5JPC/BtVNqzDCwnyePskK3FVR6etd+rA9XD55He6Kb9EAFypfkK4 QI/2/IVViWOZzL/S55bz97/DbBPPCpIoesd20cUNC08qK57FnZZOKrQFCVtyL11i MskET/TMIZLFgXjLlCoGCsGvCt386OjbN1A0aAJkEwvKf9EfWEZdDED12nj4jaMB s6+dKr8+4jJt8hBKsrPSw1Kcb7UNBBzFGUL/N75Rl4fVToE9YJyLtNHhogy7z2JH d9JFuIcoSl/ZK/Ly1W/91DcJgZQwVU4fUedQ/aWocPO/HSxaUXsgIir88BoX89M= =2TIk -----END PGP SIGNATURE-----
participants (2)
-
George Kadianakis
-
Juha Nurmi