Hi all,
I was asked by George to submit my comments about the proposal and to suggest suitable RandomX parameters for this PoW scheme.
For our dynamic PoW system to work, we will need to be able to compare PoW tokens with each other. To do so we define a function: unsigned effort(uint8_t *token) which takes as its argument a hash output token, and returns the number of leading zero bits on it.
This definition makes the effort exponential, i.e. the computational resources required to reach one notch higher effort increase by a factor of 2 each time.
I suggest to use linear effort defined as the quotient of dividing a bitstring of 1s by the hash value:
== Example A:
effort(00000001100010101101) = 11111111111111111111 / 00000001100010101101
or in decimal:
effort(6317) = 1048575 / 6317 = 165.
This definition of effort has the advantage of directly expressing the expected number of hashes that the client had to calculate to reach the effort.
With the exponential definition, we could have an equivalent linear effort of either 128 (7 leading zeroes) or 256 (8 leading zeroes), while the linear definition provides smoother classification of PoW results.
The EXT_FIELD content format is:
POW_VERSION [1 byte] POW_NONCE [32 bytes]
I suggest to use a 16-byte nonce value, which is more than sufficient given the target security level and has the benefit of reducing the replay cache size to half.
Since we expect the seed to be valid for around 3 hours (as proposed), then even if the service receives 1 million proofs per second and each proof has an effort of 1 million, then the number of submitted nonces from clients will only reach about 10^10. With a 108-bit solution space (subtracting 20 bits as the search space per client), the probability that two clients accidentally submit the same nonce is roughly 10^-13 (see [REF_BIRTHDAY]).
Additionally, I suggest to add the client's effort for convenience:
The updated EXT_FIELD content format would be:
POW_VERSION [1 byte] POW_EFFORT [4 bytes] POW_NONCE [16 bytes]
Including the effort has 2 benefits:
1. In case the Intro Priority Queue is full, the service doesn't need to waste time verifying PoW solutions that have effort lower than the last element in the queue. While an attacker can always lie about the actual effort of their nonce value, I think this field can still save some CPU cycles when the service is under heavy load.
2. The service can conveniently verify the reported effort with the following inequality:
POW_EFFORT * pow_function(POW_NONCE, SEED) <= MAX_RESULT
where MAX_RESULT is the highest possible output from the pow_function. In the case of Example A, that would be:
165 * 6317 = 1042305 <= 1048576
Similar to how our cell scheduler works, the onion service subsystem will poll the priority queue every 100ms tick and process the first 20 cells from the priority queue (if they exist). The service will perform the rendezvous and the rest of the onion service protocol as normal.
I suggest to use a different selection method rather than always taking the first 20 requests.
Selecting cells from the front of the queue actually minimizes the effort that an attacker needs to expend to completely block access to the service. The attacker can always submit the minimum effort required to occupy the first 20 slots in each 100 ms window. This can be done by submitting just 200 requests per second and observing how many circuits are successfully open and adjusting the effort until no other request can go through. This is the "Total overwhelm strategy" described in § 5.1.1.
See the following examples to show how selecting the first N cells from the queue is unfair. I will use N = 4 for clarity. E denotes some value of effort.
== Example B:
ATTACKER LEGITIMATE CLIENTS ___________ _________ ____________ ______________ / \ / \
+--------+--------+--------+--------+----+----+----+----+----+----+----+----+ | 2E | 2E | 2E | 2E | E | E | E | E | E | E | E | E | +--------+--------+--------+--------+----+----+----+----+----+----+----+----+
^ ^ ^ ^ selected selected selected selected
Here both the attacker and the legitimate clients have expended a combined effort of 8E. All of the attacker's cells get selected, while none of the other clients get through.
Instead, I suggest to use Stochastic universal sampling [REF_SUS], where the probability that a cell gets selected is proportional to its effort.
== Example C:
Here the total effort in the queue is 16E, so we first select a random value in the interval [0, 4E) and then select 4 evenly spaced cells:
ATTACKER LEGITIMATE CLIENTS ___________ _________ ____________ ______________ / \ / \
+--------+--------+--------+--------+----+----+----+----+----+----+----+----+ | 2E | 2E | 2E | 2E | E | E | E | E | E | E | E | E | +--------+--------+--------+--------+----+----+----+----+----+----+----+----+
^ ^ ^ ^ selected selected selected selected
In this case, 2 cells are selected from each group, which is fairer considering that each group expended the same effort.
== Example D:
Now if the attacker wanted to block access to all legitimate clients, they would need to at least quadruple their total PoW effort (and there would still be a chance that a legitimate client gets selected from the end of the queue):
ATTACKER LEGITIMATE CLIENTS __________________________ ______________________ ______________ / \ / \
+--------------+--------------+---------------+--------------+-+-+-+-+-+-+-+-+ | 8E | 8E | 8E | 8E |E|E|E|E|E|E|E|E| +--------------+--------------+--------------+---------------+-+-+-+-+-+-+-+-+
^ ^ ^ ^ selected selected selected selected
Note: With linear effort, the original selection method becomes even worse because the attacker can occupy the front of the queue with an effort of just E+1 per cell rather than 2E.
In some cases, Stochastic universal sampling can select a single element multiple times, which is not a problem for genetic algorithms, but we want to avoid it. I suggest to restart the selection algorithm from the beginning of the next cell in those cases and shorten the sampling interval according to the remaining portion of the queue. See the following example.
== Example E:
+---------------+-------------+--------------+------------+-+-+-+-+-+-+-+-+-+-+ | 16E | 8E | 6E |E|E|E|E|E|E|E|E|E|E| +---------------+-------------+--------------+------------+-+-+-+-+-+-+-+-+-+-+
^ ^ ^ ^ selected selected selected selected
In particular, the service starts with a default suggested-effort value of 15.
Everytime the service handles an introduction request from the priority queue in [HANDLE_QUEUE], the service compares the request's effort to the current suggested-effort value. If the new request's effort is lower than the suggested-effort, set the suggested-effort equal to the effort of the new request.
Everytime the service trims the priority queue in [HANDLE_QUEUE], the service compares the request at the trim point against the current suggested-effort value. If the trimmed request's effort is higher than the suggested-effort, set the suggested-effort equal to the effort of the new request.
I think the default suggested-effort is a bit too high. Assuming each attempt takes around 1 ms to calculate, the client would need, on average, 2^15 ms = 33 seconds to find a solution using 1 CPU core. I suggest to specify a minimum effort instead. Requests with effort < MIN_EFFORT and requests without the PROOF_OF_WORK extension would be treated as having effort = 1 for the purposes of the sampling algorithm.
Secondly, the proposed method of calculating the suggested-effort is susceptible to gaming by attackers. Since the service can only update the value in the directory once every 'hs-pow-desc-upload-rate-limit' seconds, they could stop the attack for a little while just before the directory gets updated, which would cause the suggested-effort value to be too low despite an ongoing attack.
I suggest to take the median effort of the selected cells during each 100 ms window. For the examples above that would be:
Example C: 1.5E Example D: 8E Example E: 7E
Then I would take the median of these values over the directory update period.
- Attacker strategies
Additional attacks:
5.1.2 PoW Spam Attack
The attacker may try to spam many requests with random values of POW_NONCE, requiring the service to waste cycles verifying the invalid proofs. No such request would make it into the Intro Queue, but it may still be a viable DoS strategy depending on proof verification time and the number of intro requests that can be practically delivered through the network.
5.1.3 Precomputed PoW attack
The attacker may precompute many valid PoW nonces and submit them all at once before the current seed expires, overwhelming the service temporarily even using a single computer.
4.2. Seed expiration issues
As mentioned in [DESC_POW], the expiration timestamp on the PoW seed can cause issues with clock skewed clients. Furthermore, even not clock skewed clients can encounter TOCTOU-style race conditions here.
The client descriptor refetch logic of [CLIENT_TIMEOUT] should take care of such seed-expiration issues, since the client will refetch the descriptor.
I suggest to use 2 concurrent seeds, i.e. to accept PoW both from the current and the last seed epoch. We use this approach in Monero. It would however require adding the seed value into the proof of work extension field and also double the memory requirements for verification with RandomX.
The proposal suggests argon2, and Mike has been looking at Randomx. However, after further consideration and speaking with some people (props to Alex Biryukov), it seems like those two functions are not well fitted for this purpose, since they are memory-hard both for the client and the service. And since we are trying to minimize the verification overhead, so that the service can do hundreds of verifications per second, they don't seem like good fits.
Asymmetric function like Equihash, Cuckoo cycle [REF_CUCKOO] and MTP have the advantage of being very fast to verify, but they run much faster on GPUs and specialized hardware, so I don't think they are particularly suitable for this purpose.
When we designed RandomX to be used as the PoW algorithm by Monero, we selected the parameters very conservatively to maximize the ASIC resistance of the algorithm. That's because it is very difficult to change the PoW algorithm once it's deployed.
TOR is not limited by this, so we can be much more aggressive when configuring RandomX. I suggest to use a configuration that gives the fastest possible verification time without completely breaking the algorithm.
In particular, the following parameters should be set differently from Monero:
RANDOMX_ARGON_SALT = "RandomX-TOR-v1"
The unique RandomX salt means we do not need to use a separate salt as PoW input as specified in § 3.2.
RANDOMX_ARGON_ITERATIONS = 1 RANDOMX_CACHE_ACCESSES = 4 RANDOMX_DATASET_BASE_SIZE = 1073741824 RANDOMX_DATASET_EXTRA_SIZE = 16777216
These 4 changes reduce the RandomX Dataset size to ~1 GiB, which allows the number of iteration to be reduced from 8 to 4. The combined effect of this is that Dataset initialization becomes 4 times faster, which is needed due to more frequent updates of the seed (Monero updates once per ~3 days).
RANDOMX_PROGRAM_COUNT = 2 RANDOMX_SCRATCHPAD_L3 = 1048576
Additionally, reducing the number of programs from 8 to 2 makes the hash calculation about 4 times faster, while still providing resistance against program filtering strategies (see [REF_RANDOMX_PROGRAMS]). Since there are 4 times fewer writes, we also have to reduce the scratchpad size. I suggest to use a 1 MiB scratchpad size as a compromise between scratchpad write density and memory hardness. Most x86 CPUs will perform roughly the same with a 512 KiB and 1024 KiB scratchpad, while the larger size provides higher resistance against specialized hardware, at the cost of possible time-memory tradeoffs (see [REF_RANDOMX_TMTO] for details).
Lastly, we reduce the output of RandomX to just 8 bytes:
RANDOMX_HASH_SIZE = 8
64-bit preimage security is more than sufficient for proof-of-work and it allows the result to be treated as a little-endian encoded unsigned integer for easy effort calculation.
RandomX would be used as follows:
The service will select a 32-byte POW_SEED and initialize the cache and the dataset:
randomx_init_cache(myCache, POW_SEED, 32); randomx_init_dataset(myDataset, myCache, 0, randomx_dataset_item_count());
randomx_vm *myMachine = randomx_create_vm(flags, NULL, myDataset);
Then in order to validate a PoW token, we could use something like this:
int validateProof(uint32_t pow_effort, void* pow_nonce) {
uint64_t result;
randomx_calculate_hash(myMachine, pow_nonce, 16, &result);
if (mulh(pow_effort, result) == 0) { return 1; }
return 0; }
I suggest to set MIN_EFFORT = 10000, which takes about 1 second on my laptop. Requests with pow_effort < MIN_EFFORT would not be validated.
I have collected some performance figures for the fast mode with the above RandomX configuration (~1 GiB of memory is required):
H/s = hashes per second
== CPUs:
Intel Core i3-3220 - 1 thread 700 H/s - 3 threads 1400 H/s
Intel Xeon (dual core VPS, Sandy Bridge, unknown model) - 1 thread 2000 H/s - 2 threads 4000 H/s
Intel Core i5-2500K (stock) - 1 thread 2200 H/s - 4 threads 8200 H/s
Intel Core i7-8550U (laptop) - 1 thread 2700 H/s - 8 threads 10000 H/s
Intel Core i7-9850H (laptop) - 1 thread 3100 H/s - 12 threads 16000 H/s
AMD Ryzen 1700 @ 3300MHz - 1 thread 2300 H/s - 16 threads 23800 H/s
AMD Ryzen 3700X @ 3300MHz - 1 thread 2500 H/s - 16 threads 27500 H/s
AMD Epyc 7702P - 1 thread 2100 H/s - 128 threads 139000 H/s
== GPUs:
NVIDIA GeForce GTX 1660 Ti (credits to SChernykh, see [REF_RANDOMX_CUDA]) - 3072 intensity 2600 H/s
According to the above results, the time to verify a single hash is around 400-500 μs. A mid-range GPU has a similar performance as a single CPU core. Most CPUs made since 2011 have similar per-core performance except of low-end CPUs without hardware AES support.
References:
[REF_BIRTHDAY]: https://en.wikipedia.org/wiki/Birthday_attack#Mathematics
[REF_SUS]: https://en.wikipedia.org/wiki/Stochastic_universal_sampling
[REF_CUCKOO]: https://github.com/tromp/cuckoo
[REF_RANDOMX_PROGRAMS]: https://github.com/tevador/RandomX/blob/master/doc/design.md#12-the-easy-pro...
[REF_RANDOMX_TMTO]: https://github.com/tevador/RandomX/issues/65
[REF_RANDOMX_CUDA]: https://github.com/SChernykh/RandomX_CUDA