Trimming to stuff that I just want to reply to; I otherwise agree.
Note: in a couple places I replied directly to asn's OP, because I noticed some more questions that I could answer.
On 4/2/20 2:30 PM, David Goulet wrote:
On 02 Apr (18:54:59), George Kadianakis wrote:
2.2.2. Dynamic PoW
DoS is a dynamic problem where the attacker's capabilities constantly change, and hence we want our proof-of-work system to be dynamic and not stuck with a static difficulty setting. Hence, instead of forcing clients to go below a static target like in Bitcoin to be successful, we ask clients to "bid" using their PoW effort. Effectively, a client gets higher priority the higher effort they put into their proof-of-work. This is similar to how proof-of-stake works but instead of staking coins, you stake work.
So this means that desktop users will be prioritized over mobile users basically unless I make my phone use X% of battery?
Yes. We should be clear that this is not meant to be on all the time, and that yes, it is likely to sacrifice access by mobile users, depending on the current attack volume/difficulty level.
The Tor Browser Android UI could inform users the service is under attack and direct them try on their desktop instead.
If a client receives a descriptor with "pow-params", it should assume that the service is expecting a PoW input as part of the introduction protocol.
What happens with clients _without_ PoW support? They basically won't be able to connect I suppose? Or be put in the prio queue at the service at the very hand with work done = 0 ?
The work_done=0 will be better. Fake work with actual work_done=0 is just as easy to create as omitting work, for an attacker.
3.4.1. PoW verification
To verify the client's proof-of-work the service extracts (hash_output, seed, nonce) from the INTRODUCE1 cell and MUST do the following steps:
- Make sure that the client's seed is identical to the active seed.
- Check the client's nonce for replays (see [REPLAY_PROTECTION] section).
- Verify that 'hash_output =?= argon2(seed, nonce)
So wait, the service also has to do the PoW for each client by computing the Argon2 hash for each cell? Or am I mis-understanding?
Yes, but the service side only has to run the hash once, which should be fast.
The client/attacker is hashing many times, to search for a nonce that will satisfy the target hash value comparison.
But oops! We forgot to list that directly above, which might have caused the confusion:
0) Check that hash_output <= target_level
3.4.2. The Introduction Queue [INTRO_QUEUE]
3.4.2.1. Adding introductions to the introduction queue
When PoW is enabled and a verified introduction comes through, the service instead of jumping straight into rendezvous, queues it and prioritizes it based on how much effort was devoted by the client to PoW. This means that introduction requests with high effort should be prioritized over those with low effort.
To do so, the service maintains an "introduction priority queue" data structure. Each element in that priority queue is an introduction request, and its priority is the effort put into its PoW:
When a verified introduction comes through, the service interprets the PoW hash as a 32-byte big-endian integer 'hash_int' and based on that integer it inserts it into the right position of the priority_queue: The smallest 'hash_int' goes forward in the queue. If two elements have the same value, the older one has priority over the newer one. {XXX: Is this operation with 32-bytes integers expensive? How to make cheaper?}
One option: either subtract or and-off the target, so we're comparing difficulty relative to the target - ie a much smaller integer space.
In blockchain world, typically the difficulty target is expressed as a bitmask anyway, probably for reasons like this.
{TODO: PARAM_TUNING: If the priority queue is only ordered based on the effort what attacks can happen in various scenarios? Do we want to order on time+effort? Which scenarios and attackers should we examine here?}
{TODO: PARAM_TUNING: What's the max size of the queue? How do we trim it? Can we use WRED usefully?}
I think you'll be bound by the amount of data a connection inbuf can take which has an upper bound of 32 cells each read event.
Then tor will have to empty at once the inbuf, queue all INTRODUCE2 cells (at most 32) in that priority queue and once done, we would process it until we return to handling the connection inbuf.
In other words, the queue size, with tor's architecture, is bound to the number of cells upper bound you can get when doing a recv() pass which is 32 cells.
Nevertheless, that limit is weirdly hardcoded in tor so you should definitely think of a way to upper bound the queue and just drop the rest. A good starting point would be that 32 cells number?
dgoulet and I think the following will work better than always doing 32 cells at a time. Basically, the idea is to split our INTRO2 handling into "top-half" and "bottom-half" handlers.
Top-half handling: 1) read 32 cells off inbuf as usual 2) do AES relay cell decryption as usual 3) Parse relay headers, handle all cells as usual, except: a) in hs_service_receive_introduce2(), add to pqueue and return without further processing of those 4) Return to rest of mainloop
Then, separately, also in mainloop, do the bottom half. (TODO: What group priority?)
Bottom-half handling: I) pop a single intro2 off of pqueue (ie: max difficulty in queue) II) Compare this difficulty to desc difficulty. If lower, lower desc difficulty III) Parse it and launch RP circuit (as per current bottom half of hs_service_receive_introduce2()) IV) trim pqueue elements, if queue "too big" (TODO: how to trim?) V) Compare trim point difficulty to descriptor difficulty, if trim point was higher than descriptor value, raise desc difficulty VI) return to libevent/mainloop again
The astute reader will note that even without PoW, the above can provide almost the exact same functionality as the naive rate limiting currently done at intropoints - just cut the queue arbitrarily. Basically PoW and the pqueue just gives us a smarter way to decide who to reply to.
However, it still has the following potential issues: A) AES will bottleneck us at ~100Mbit-300Mbit at #2 in top-half above B) Extra mainloop() iterations for INTRO2s may be expensive (or not?)
For A, onionbalance will still help by adding new back-end instances providing intropoints via separate back-end Tor daemons, either on the same box or different boxes.
But it will only help up to a point. A HSDesc maxes out at 30k, so at some point we'll run out of space to list more intropoints in a single descriptor. At that point, we can still list different intropoints at each HSDir position, but after that, we are screwed.
We can alternatively avoid the AES bottleneck by moving this whole system to the intropoint tor relay. Basically, we split hs_intro_received_introduce1() into a top and bottom half in almost exactly the same way we split hs_service_receive_introduce2(), and we use the unencrypted extension field instead of the encrypted one.
This is a very straight-forward v1.5 change, but it requires intropoints on the network to upgrade for it to work.
3.4.2.2. Handling introductions from the introduction queue [HANDLE_QUEUE]
The service should handle introductions by pulling from the introduction queue.
Similar to how our cell scheduler works, the onion service subsystem will poll the priority queue every 100ms tick and process the first 20 cells from the priority queue (if they exist). The service will perform the rendezvous and the rest of the onion service protocol as normal.
With this tempo, we can process 200 introduction cells per second.
As I described above, I think we might want to do something like that for simplicity at first which is "empty inbuf by priority queuing all INTRODUCE2" and once done, process them.
Thus, it won't be like the cell scheduler that accumulates until a certain tick (10msec) and then process it all.
{XXX: Is this good?}
{TODO: PARAM_TUNING: STRAWMAN: This needs hella tuning. Processing 20 cells per 100ms is probably unmaintainable, since each cell is quite expensive: doing so involving path selection, crypto and making circuits. We will need to profile this procedure and see how we can do this scheduling better.}
With the above, we should be within the same performance as we have right now since we just deferring the processing of INTRODUCE2 cell after the inbuf is emptied.
{XXX: This might be a nice place to promote multithreading. Queues and pools are nice objects to do multithreading since you can have multiple threads pull from the queue, or leave stuff on the queue. Not sure if this should be in the proposal tho.}
I would _love_ to but could be too early for that if we consider that we are still unsure that this defense will be useful or not (according to Mike as a discussion on IRC).
As described above, multithreading still provides a multiplier in the AES bottleneck case, even over onionbalance.
But, there may be more bottlenecks than just AES crypto, so this is a further argument for not jumping the gun just yet, and trying v1 first (or even a simple prototype without pow, that just cuts the queue arbitrarily), and getting some more profiling data.
Next steps (not necessarily in order):
a) pqueue plan review + more detailed description b) Figure out pqueue trim mechanism - can we do better than O(n)? c) test using randomx as a hash function in various ways, esp wrt key usage, cache setup, and VM infos d) test pqueue refactoring, maybe without pow involved yet e) specify a v1.5 that works at intropoint (to avoid AES bottleneck) f) Merge this thread back into a single proposal document g) other stuff we forgot, XXX's, TODOs, etc.