[anti-censorship-team] Insufficiently many pregenerated BridgeDB captchas?

Fri Dec 10 01:58:07 UTC 2021

On Thu, Dec 09, 2021 at 03:45:06PM -0700, David Fifield wrote:
> We can use a capture–recapture technique to estimate the total
> population size.
> https://en.wikipedia.org/wiki/Mark_and_recapture#Lincoln%E2%80%93Petersen_estimator
> Divide the 1000 images into 2 equal halves, and count the unique images
> in each half: n = 488, k = 492. The number of images in the second half
> that were already seen in the first half is K = 23. The estimate for
> N = n*K/k = 488*492/23 = 10439, so I guess the captcha cache dir on the
> BridgeDB server holds only about 10000 images.
> 	>>> pop = list(open("bridgedb.hashes"))
> 	>>> s1, s2 = set(pop[:len(pop)//2]), set(pop[len(pop)//2:])
> 	>>> len(s1)
> 	488
> 	>>> len(s2)
> 	492
> 	>>> len(s1.intersection(s2))
> 	23
> 	>>> len(s1)*len(s2)/len(s1.intersection(s2))
> 	10438.95652173913

BridgeDB should have 10,000 CAPTCHAs; at least it did when I last
generated a batch, in January 2020:
https://gitlab.torproject.org/tpo/anti-censorship/bridgedb/-/issues/24607#note_2599604