Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

10 May 2020

      On 08 May, 21:53, tevador <tevador@gmail.com> wrote:
...
In particular, the following parameters should be set differently from
Monero:
RANDOMX_ARGON_SALT = "RandomX-TOR-v1"
The unique RandomX salt means we do not need to use a separate salt as PoW
input as specified in § 3.2.
RANDOMX_ARGON_ITERATIONS = 1
    RANDOMX_CACHE_ACCESSES = 4
    RANDOMX_DATASET_BASE_SIZE = 1073741824
    RANDOMX_DATASET_EXTRA_SIZE = 16777216
These 4 changes reduce the RandomX Dataset size to ~1 GiB, which allows
the number of iteration to be reduced from 8 to 4. The combined effect of
this is that Dataset initialization becomes 4 times faster, which is needed
due to more frequent updates of the seed (Monero updates once per ~3 days).
RANDOMX_PROGRAM_COUNT = 2
    RANDOMX_SCRATCHPAD_L3 = 1048576
Additionally, reducing the number of programs from 8 to 2 makes the hash
calculation about 4 times faster, while still providing resistance against
program filtering strategies (see [REF_RANDOMX_PROGRAMS]). Since there are
4 times fewer writes, we also have to reduce the scratchpad size. I suggest
to use a 1 MiB scratchpad size as a compromise between scratchpad write
density and memory hardness. Most x86 CPUs will perform roughly the same
with a 512 KiB and 1024 KiB scratchpad, while the larger size provides
higher resistance against specialized hardware, at the cost of possible
time-memory tradeoffs (see [REF_RANDOMX_TMTO] for details).
Lastly, we reduce the output of RandomX to just 8 bytes:
RANDOMX_HASH_SIZE = 8
64-bit preimage security is more than sufficient for proof-of-work and it
allows the result to be treated as a little-endian encoded unsigned integer
for easy effort calculation.
I have implemented this in the tor-pow branch of the RandomX repository:

    https://github.com/tevador/RandomX/tree/tor-pow

Namely I have changed the API to return the hash value as an uint64_t and
made corresponding changes in the benchmark.

Benchmark example:

    ./randomx-benchmark --mine \
                        --avx2 \
                        --jit  \
                        --largePages \
                        --nonces 10000 \
                        --seed 1234 \
                        --init 1 \
                        --threads 1 \
                        --batch
    RandomX-TOR-v1 benchmark
     - Argon2 implementation: AVX2
     - full memory mode (1040 MiB)
     - JIT compiled mode
     - hardware AES mode
     - large pages mode
     - batch mode
    Initializing (1 thread) ...
    Memory initialized in 5.32855 s
    Initializing 1 virtual machine(s) ...
    Running benchmark (10000 nonces) ...
    Performance: 2535.43 hashes per second
    Best result:
      Nonce: 8bc3ded34d2dcdeed9000000f95cd20c
      Result: d947ceff08750300
      Effort: 18956
      Valid: 1

At the end, it prints out the nonce that gives the highest effort value and
validates it.

For the actual implementation in TOR, the RandomX validator should run in
a separate thread that doesn't do anything else apart from validation and
moving valid requests into the Intro Queue. This way we can reach the maximum
performance of ~2000 processed requests per second.

Finally, here are some disadvantages of RandomX-TOR:

 1) Fast verification requires ~1 GiB of memory. If we decide to use two
    overlapping seed epochs, each service will need to allocate >2 GiB of RAM
    just to verify the PoW. Alternatively, it is possible to use the slow
    mode, which requires only 256 MiB per seed, but runs 4x slower.
 2) The fast mode needs about 5 seconds to initialize every time the
seed is      changed (can be reduced to under 1 second using multiple
threads). The
    slow mode needs about 0.1 seconds to initialize.
 3) RandomX includes a JIT compiler for maximum performance. The iOS operating
    system doesn't support JIT compilation, so RandomX runs about 10x slower
    there.
 4) The JIT compiler in RandomX is currently implemented only for
x86-64 and      ARM64 CPU architectures. Other architectures will run
very slowly
    (especially 32-bit systems). However, the two supported architectures
    cover the vast majority of devices, so this should not be an issue.