[anti-censorship-team] Insufficiently many pregenerated BridgeDB captchas?

David Fifield david at bamsoftware.com
Thu Dec 9 22:45:06 UTC 2021


For the Moat and HTTPS distributors, BridgeDB uses a cache of
pregenerated captcha images. It does not generate a fresh captcha for
every challenge.
https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/blob/eeca27703d5c151375ad5649263a72a44fd4481a/README.rst#id10
	> ...The second method uses a local cache of pre-made CAPTCHAs,
	> created by scripting Gimp using gimp-captcha. The latter
	> cannot easily be run on headless server, unfortunately,
	> because Gimp requires an X server to be installed.
https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/blob/eeca27703d5c151375ad5649263a72a44fd4481a/bridgedb/captcha.py#L378
	imageFilename = random.SystemRandom().choice(os.listdir(self.cacheDir))
	imagePath = os.path.join(self.cacheDir, imageFilename)
	with open(imagePath, 'rb') as imageFile:    
	    self.image = imageFile.read()

It may be that there are simply too few pregenerated captcha images. If
there are N total, and an adversary invests effort to solve n of them,
then the adversary will get a captcha it knows in n / N fraction of
later bridge queries, until the cache of pregenerated images is
regenerated.

I downloaded 1000 captcha images from the Moat API and hashed them:
	for a in $(seq 1 1000); do curl -s -x socks5h://127.0.0.1:9050/ https://bridges.torproject.org/moat/fetch -H 'Content-type: application/vnd.api+json' --data-raw '{"data": [{"version": "0.1.0", "type": "client-transports"}]}' | jq '.data[0].image' | sha256sum; done | tee bridgedb.hashes

Out of 1000 images drawn randomly with replacement,
	916 appeared 1 time
	 39 appeared 2 times
	  2 appeared 3 times

We can use a capture–recapture technique to estimate the total
population size.
https://en.wikipedia.org/wiki/Mark_and_recapture#Lincoln%E2%80%93Petersen_estimator
Divide the 1000 images into 2 equal halves, and count the unique images
in each half: n = 488, k = 492. The number of images in the second half
that were already seen in the first half is K = 23. The estimate for
N = n*K/k = 488*492/23 = 10439, so I guess the captcha cache dir on the
BridgeDB server holds only about 10000 images.
	>>> pop = list(open("bridgedb.hashes"))
	>>> s1, s2 = set(pop[:len(pop)//2]), set(pop[len(pop)//2:])
	>>> len(s1)
	488
	>>> len(s2)
	492
	>>> len(s1.intersection(s2))
	23
	>>> len(s1)*len(s2)/len(s1.intersection(s2))
	10438.95652173913

It would be best to generate a fresh captcha image for each challenge,
but if that's not possible, we should increase the number of cached
images or regnerate the cache periodically.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bridgedb.hashes.gz
Type: application/gzip
Size: 37778 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/anti-censorship-team/attachments/20211209/0ed40113/attachment.gz>


More information about the anti-censorship-team mailing list