Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

7 Apr 2020

      Trimming to stuff that I just want to reply to; I otherwise agree.

Note: in a couple places I replied directly to asn's OP, because I
noticed some more questions that I could answer.

On 4/2/20 2:30 PM, David Goulet wrote:
...
On 02 Apr (18:54:59), George Kadianakis wrote:
...
2.2.2. Dynamic PoW
DoS is a dynamic problem where the attacker's capabilities constantly change,
  and hence we want our proof-of-work system to be dynamic and not stuck with a
  static difficulty setting. Hence, instead of forcing clients to go below a
  static target like in Bitcoin to be successful, we ask clients to "bid" using
  their PoW effort. Effectively, a client gets higher priority the higher
  effort they put into their proof-of-work. This is similar to how
  proof-of-stake works but instead of staking coins, you stake work.
So this means that desktop users will be prioritized over mobile users
basically unless I make my phone use X% of battery?
Yes. We should be clear that this is not meant to be on all the time,
and that yes, it is likely to sacrifice access by mobile users,
depending on the current attack volume/difficulty level.

The Tor Browser Android UI could inform users the service is under
attack and direct them try on their desktop instead.
...
...
If a client receives a descriptor with "pow-params", it should assume that
  the service is expecting a PoW input as part of the introduction protocol.
What happens with clients _without_ PoW support? They basically won't be able
to connect I suppose? Or be put in the prio queue at the service at the very
hand with work done = 0 ?
The work_done=0 will be better. Fake work with actual work_done=0 is
just as easy to create as omitting work, for an attacker.
...
...
3.4.1. PoW verification
To verify the client's proof-of-work the service extracts (hash_output,
   seed, nonce) from the INTRODUCE1 cell and MUST do the following steps:
1) Make sure that the client's seed is identical to the active seed.
   2) Check the client's nonce for replays (see [REPLAY_PROTECTION] section).
   3) Verify that 'hash_output =?= argon2(seed, nonce)
So wait, the service also has to do the PoW for each client by computing the
Argon2 hash for each cell? Or am I mis-understanding?
Yes, but the service side only has to run the hash once, which should be
fast.

The client/attacker is hashing many times, to search for a nonce that
will satisfy the target hash value comparison.

But oops! We forgot to list that directly above, which might have caused
the confusion:

0) Check that hash_output <= target_level
...
...
3.4.2. The Introduction Queue  [INTRO_QUEUE]
3.4.2.1. Adding introductions to the introduction queue
When PoW is enabled and a verified introduction comes through, the service
  instead of jumping straight into rendezvous, queues it and prioritizes it
  based on how much effort was devoted by the client to PoW. This means that
  introduction requests with high effort should be prioritized over those with
  low effort.
To do so, the service maintains an "introduction priority queue" data
  structure. Each element in that priority queue is an introduction request,
  and its priority is the effort put into its PoW:
When a verified introduction comes through, the service interprets the PoW
  hash as a 32-byte big-endian integer 'hash_int' and based on that integer it
  inserts it into the right position of the priority_queue: The smallest
  'hash_int' goes forward in the queue. If two elements have the same value,
  the older one has priority over the newer one.
  {XXX: Is this operation with 32-bytes integers expensive? How to make cheaper?}
One option: either subtract or and-off the target, so we're comparing
difficulty relative to the target - ie a much smaller integer space.

In blockchain world, typically the difficulty target is expressed as a
bitmask anyway, probably for reasons like this.
...
...
{TODO: PARAM_TUNING: If the priority queue is only ordered based on the
   effort what attacks can happen in various scenarios? Do we want to order on
   time+effort?  Which scenarios and attackers should we examine here?}
{TODO: PARAM_TUNING: What's the max size of the queue? How do we trim it? Can we
   use WRED usefully?}
I think you'll be bound by the amount of data a connection inbuf can take
which has an upper bound of 32 cells each read event.
Then tor will have to empty at once the inbuf, queue all INTRODUCE2 cells (at
most 32) in that priority queue and once done, we would process it until we
return to handling the connection inbuf.
In other words, the queue size, with tor's architecture, is bound to the
number of cells upper bound you can get when doing a recv() pass which is 32
cells.
Nevertheless, that limit is weirdly hardcoded in tor so you should definitely
think of a way to upper bound the queue and just drop the rest. A good
starting point would be that 32 cells number?
dgoulet and I think the following will work better than always doing 32
cells at a time. Basically, the idea is to split our INTRO2 handling
into "top-half" and "bottom-half" handlers.

Top-half handling:
 1) read 32 cells off inbuf as usual
 2) do AES relay cell decryption as usual
 3) Parse relay headers, handle all cells as usual, except:
    a) in hs_service_receive_introduce2(), add to pqueue
       and return without further processing of those
 4) Return to rest of mainloop

Then, separately, also in mainloop, do the bottom half. (TODO: What
group priority?)

Bottom-half handling:
  I) pop a single intro2 off of pqueue (ie: max difficulty in queue)
 II) Compare this difficulty to desc difficulty. If lower, lower
     desc difficulty
III) Parse it and launch RP circuit (as per current bottom half of
     hs_service_receive_introduce2())
 IV) trim pqueue elements, if queue "too big" (TODO: how to trim?)
  V) Compare trim point difficulty to descriptor difficulty, if trim
     point was higher than descriptor value, raise desc difficulty
 VI) return to libevent/mainloop again

The astute reader will note that even without PoW, the above can provide
almost the exact same functionality as the naive rate limiting currently
done at intropoints - just cut the queue arbitrarily. Basically PoW and
the pqueue just gives us a smarter way to decide who to reply to.

However, it still has the following potential issues:
  A) AES will bottleneck us at ~100Mbit-300Mbit at #2 in top-half above
  B) Extra mainloop() iterations for INTRO2s may be expensive (or not?)

For A, onionbalance will still help by adding new back-end instances
providing intropoints via separate back-end Tor daemons, either on the
same box or different boxes.

But it will only help up to a point. A HSDesc maxes out at 30k, so at
some point we'll run out of space to list more intropoints in a single
descriptor. At that point, we can still list different intropoints at
each HSDir position, but after that, we are screwed.

We can alternatively avoid the AES bottleneck by moving this whole
system to the intropoint tor relay. Basically, we split
hs_intro_received_introduce1() into a top and bottom half in almost
exactly the same way we split hs_service_receive_introduce2(), and we
use the unencrypted extension field instead of the encrypted one.

This is a very straight-forward v1.5 change, but it requires intropoints
on the network to upgrade for it to work.
...
...
3.4.2.2. Handling introductions from the introduction queue [HANDLE_QUEUE]
The service should handle introductions by pulling from the introduction
  queue.
Similar to how our cell scheduler works, the onion service subsystem will
  poll the priority queue every 100ms tick and process the first 20 cells from
  the priority queue (if they exist). The service will perform the rendezvous
  and the rest of the onion service protocol as normal.
With this tempo, we can process 200 introduction cells per second.
As I described above, I think we might want to do something like that for
simplicity at first which is "empty inbuf by priority queuing all INTRODUCE2"
and once done, process them.
Thus, it won't be like the cell scheduler that accumulates until a certain
tick (10msec) and then process it all.
...
{XXX: Is this good?}
{TODO: PARAM_TUNING: STRAWMAN: This needs hella tuning. Processing 20 cells
  per 100ms is probably unmaintainable, since each cell is quite expensive:
  doing so involving path selection, crypto and making circuits. We will need
  to profile this procedure and see how we can do this scheduling better.}
With the above, we should be within the same performance as we have right now
since we just deferring the processing of INTRODUCE2 cell after the inbuf is
emptied.
...
{XXX: This might be a nice place to promote multithreading. Queues and pools
  are nice objects to do multithreading since you can have multiple threads
  pull from the queue, or leave stuff on the queue. Not sure if this should be
  in the proposal tho.}
I would _love_ to but could be too early for that if we consider that we are
still unsure that this defense will be useful or not (according to Mike as a
discussion on IRC).
As described above, multithreading still provides a multiplier in the
AES bottleneck case, even over onionbalance.

But, there may be more bottlenecks than just AES crypto, so this is a
further argument for not jumping the gun just yet, and trying v1 first
(or even a simple prototype without pow, that just cuts the queue
arbitrarily), and getting some more profiling data.

Next steps (not necessarily in order):

a) pqueue plan review + more detailed description
b) Figure out pqueue trim mechanism - can we do better than O(n)?
c) test using randomx as a hash function in various ways, esp wrt
   key usage, cache setup, and VM infos
d) test pqueue refactoring, maybe without pow involved yet
e) specify a v1.5 that works at intropoint (to avoid AES bottleneck)
f) Merge this thread back into a single proposal document
g) other stuff we forgot, XXX's, TODOs, etc.

-- 
Mike Perry