[tor-dev] GSoC 2021 - Alexa Top Sites Captcha and Tor Block Monitoring #Update

Mon Jul 12 11:31:35 UTC 2021

Hi everyone!

Been quite sometime since the last update but if one wants to see the
details in between one could go to the DIAL blogs for the project[1].
As of now, we do have a working project with the following details
implemented [2] and further the dotted[consensus module], idea taken from
Senser paper[3] haven't been implemented yet, but hopefully I'll implement
it within a week or two at most. I personally was tilted towards the
similarity of the structure but after some discussions with woswos and
Micah Sherr[4], I've thought of implementing the content based approach
too.

I'll briefly describe both the methods below:
+ Structure of the website: This was thought of because we don't really
know what various changes would be there for a website. More specifically
would be useful for dynamic websites, websites with language based on
geolocation (Geotargeting). But I have to use a filter list and statistical
method to approach the problem.

+ Content based Approach: Compares the content of the HTML data using tree
like structure and hashes to know how the structure is different or
similar. Usage of proxies of the same locations as vantage points to get
better results.

That said, the above mentioned methods are used for  the case where
websites partially block tor. One good example for this case would be
https://dan.me.uk/ which doesn't block tor exit relay nodes completely, but
gives an error page (partial block) and no error HTTP response code. The
checking of the HTTP response codes being a low-hanging-fruitish algorithm
is our first step which is seen performing good and might sometimes result
in false positives (Says a website like https://cloudflare.com to be
blocked completely, when it returns captcha or is partially blocked).

Further for the demo purpose, one can refer to the Experimental code[5] and
it's log[6] (Isn't much of a good code and is a bit old but wrote to serve
the purpose of backing up the first method (Structure of the website)).
Also one could look into the `Analyzer.py`[7,8] which would contain the
most recent and improved logic to the analysis. Hope to improve it with
every passing day. I also plan to create a FAQ[9] page which would have
excerpts of discussions or answers to as why a following approach was taken.

Thanks,
Apratim
(irc: _ranchak_)

** Looking forward for suggestions and comments as to how to improve on it.
Also materials like research paper in this domain would be helpful **

References:
[1]
https://hub.osc.dial.community/t/tor-project-alexa-top-sites-captcha-and-block-monitoring/2552
[2]
https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021#updated-logic
[3] http://people.cs.georgetown.edu/~wzhou/publication/senser-acsac13.pdf
[4] https://seclab.cs.georgetown.edu/msherr/
[5]
https://github.com/Hackhard/Fetcher/blob/b9f2fa8d09061862cf954537cbaad7921ddb3d89/status%20code/test_run4/tr.py
[6]
https://raw.githubusercontent.com/Hackhard/Fetcher/main/status%20code/test_run4/tr_bash_output
[7] Consensus_lite branch:
https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/blob/consensus_lite/src/captchamonitor/core/analyzer.py
[8] Master branch:
https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/blob/master/src/captchamonitor/core/analyzer.py
[9]
https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/wikis/GSoC-2021/Faqs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20210712/033af0e7/attachment.htm>