Hi,
Please have a look at this proposal. Will replace xyz with more meaningful numbers once this is finalized. Comments, suggestions and criticism are welcome.
------------- Filename: xxx-Count-unique-IPs-in-anonymous-way.txt Title: Count Unique IP addresses in an anonymous way Author: Jaskaran Veer Singh Created: 14 March 2017 Status: Draft
§0. Introduction
Currently, guard relays and bridges maintains a list of IP addresses of the devices that connect to it for various reasons such as for use by the bridge to check which country has them blocked. This is dangerous because if any of these tor instances get compromised, clients will be de-anonymized. To solve this issue, a new data structure that keeps a track of unique IP addresses seen but does not directly keep a list of them is proposed in this document.
§1. Specification
§1.1. Notation
Let `a^b` denote the exponentiation of a to the bth power.
Let `a == b` denote the equality of a with b, and vice versa.
Let `a := b` be the assignment of the value of b to the variable a.
Let `a / b` denote the division of a by b.
Let `a <= b` denote that a is less than equal to b.
Let `a >= b` denote that a is greater than equal to b.
§2. Research
There are three ways to solve this problem. All the three ways are actually Big Data Algorithms. A few problems arises for each of these algorithms since they are made for big data but the data we would provide is not necessarily big.
§2.1. Bloom Filter[1]
A bloom filter maps the input to positions on the bitmap after passing through 2 or more hash functions. Later any new input are mapped onto this bitmap in the same way to check whether this value is already present in the set. The feature of this bitmap is that collisions could happen. And this collision creates deniability. When collisions happen, On the one hand, one of the input doesn’t count to be unique (although in reality, it is), and on the other hand, this is beneficial since this creates deniability. The person who gets hand on this data structure could never be 100% sure about the original inputs. So we get the job done successfully at some error rate.
§3.1.1. Obstacle
Suppose if the number of inputs is small. Let’s say we receive just 1 connection in a day from some small, less busy country like Estonia. In that case, there might not be any chance for collision and the adversary could determine the IP address with some brute force. Hence this algorithm isn’t suited for us.
§3.2. Rappor[2]
Randomized Aggregatable Privacy Preserving Ordinal Responses is an algorithm where the system adds some deterministic and nondeterministic noise to the data that has to be stored. This creates deniability. In our case, we don’t need to have deterministic noise added at first stage. So we’ll just stick to adding non deterministic noise and storing it in a bloom filter.
One thing to note here is that, we should not accept output of the non deterministic randomizer which can be traced back to any IP address of Group D or Group E since those IP addresses are not in use and the adversary could easily know that those have been produced after adding random noise.
§3.2.1. Obstacle
Using brute force technique, the adversary could check to see whether the IP address stored is the correct one or produced using random noise. In this technique, the adversary could compare those IP address (obtained via brute force) to the directory of what IP addresses are allotted to what country. The one that does not match, is the one that has been faked by using random noise.
§3.3. Probabilistic Counting with Stochastic Averaging[3] (PCSA)
It is based on FM sketches. The algorithm goes as follows:
| m = 2^b # with b in [4...16] | bitmaps = [[0]*32]*m # initialize m 32bit wide bitmaps to 0s | | # Construct the PCSA bitmaps | for h in hashed(data): | bitmap_index = 1 + get_bitmap_index( h,b ) # binary address of rightmost b bits | run_length = run_of_zeros( h,b ) # length of the run of zeros starting at bit b+1 | bitmaps[bitmap_index][run_length] = 1 # set the bitmap bit based on the run length observed | | # Determine the cardinality | phi = 0.77351 | DV = m / phi * 2 ^ (sum( least_sig_bit( bitmap ) ) / m) # the DV estimate
So, Error is bounded by 0.78/sqrt(m)
§3.3.1. Obstacle
This algorithms stated above is made for use on large databases. Infact, these were invented to save time and space while doing basic set operations on data with high cardinality. But the data we would provide as an input is not necessarily of high cardinality. Since we would be counting numbers for each country separately, so the expected value of the input cardinality would be :
0 <= C <= 2500
where C is the actual cardinality of the input data.
§3.3.2. Workaround
As a workaround, we could introduce a correction term for small values of cardinality. So our final formula could look something like:
DV(S1,...,Sm) := m· (2^(Z/m) −2^(−κ·Z/m))/ ρ
Where Κ is chosen to be around 1.75
§4. Implementation
The data structure present in the geoip.c needs to be removed as it stores the IP address of the client. The data structure is shown below.
struct clientmap_entry_t { HT_ENTRY(clientmap_entry_t) node; tor_addr_t addr; char *transport_name; unsigned int last_seen_in_minutes:30; unsigned int action:2; };
It then later needs to be replaced by a datastruture that contains the list of countries with the number of unique IP addrs seen for that country.
The whole system can be represented by this diagram.
------------------ | Input IP Addr | ------------------ | | --------------------- | Determine Country | --------------------- | | ---------------------- | Determine Transport| ---------------------- | | ------------------------------ | Get salted hash of IP addr | ------------------------------ | | -------------------------- | Apply the formula | | on the hash obtained | -------------------------- | | --------------------------------------- | Increment that country's counter | | if the cardinality estimation | | is greater than the estimation | | done previously | ---------------------------------------
§5. Security Considerations
We might as well go ahead and implement this, but the thing to keep in mind is that implementing this would not protect a client's identity from a adversary completely. First thing to understand is that, an adversary that has gained access to a guard node or a bridge could still get the random value generated and deanonymize the client using brute force. Or even better, why would the adversary need that random value when she simply log all network connections coming into the compromised system?
So, A thing to note is that this functionality would keep those clients anonymous who had connected to the compromised system in the past, but not those who will connect to it in the future.
§6. References
[1] https://en.wikipedia.org/wiki/Bloom_filter [2] https://www.chromium.org/developers/design-documents/rappor [3] https://research.neustar.biz/2013/04/02/sketch-of-the- day-probabilistic-counting-with-stochastic-averaging-pcsa/
-------------
Regards, Jaskaran Veer Singh