[tor-dev] GSoC 2021 - Alexa Top Sites Captcha and Tor Block Monitoring #Update

David Fifield david at bamsoftware.com
Tue Jul 20 16:11:50 UTC 2021

On Mon, Jul 12, 2021 at 05:01:35PM +0530, Apratim Ranjan Chakrabarty wrote:
> ** Looking forward for suggestions and comments as to how to improve on it.
> Also materials like research paper in this domain would be helpful **

Section IV-C of the ICLab paper has discussion of block page detection.
The first pass is regex for known block pages, but there is also
clustering by similar HTML structure and text.

The 2016 "Do You See What I See?" study seems to be in line with your
project. "The second-class treatment of anonymous users ranges from
outright rejection to ... imposing hurdles such as CAPTCHA-solving....
Our study draws upon ... scans of the home pages of top-1,000 Alexa
websites through every Tor exit..." Section V-A has to do with scans of
top-ranked sites.

More information about the tor-dev mailing list