On Mon, Jul 12, 2021 at 05:01:35PM +0530, Apratim Ranjan Chakrabarty wrote:
** Looking forward for suggestions and comments as to how to improve on it. Also materials like research paper in this domain would be helpful **
Section IV-C of the ICLab paper has discussion of block page detection. The first pass is regex for known block pages, but there is also clustering by similar HTML structure and text. https://censorbib.nymity.ch/#Niaki2020a https://github.com/net4people/bbs/issues/52
The 2016 "Do You See What I See?" study seems to be in line with your project. "The second-class treatment of anonymous users ranges from outright rejection to ... imposing hurdles such as CAPTCHA-solving.... Our study draws upon ... scans of the home pages of top-1,000 Alexa websites through every Tor exit..." Section V-A has to do with scans of top-ranked sites. https://www.ndss-symposium.org/wp-content/uploads/2017/09/do-you-see-what-i-... https://archive.org/details/ndss16doyousee