[tor-bugs] #12464 [Tor]: When Tor 0.2.6.x is closer to done, profile relays running Tor 0.2.6.x and optimize accordingly

Tue Feb 24 15:36:34 UTC 2015

#12464: When Tor 0.2.6.x is closer to done, profile relays running Tor 0.2.6.x and
optimize accordingly
------------------------+-------------------------------------------------
     Reporter:  nickm   |      Owner:  dgoulet
         Type:  defect  |     Status:  assigned
     Priority:  normal  |  Milestone:  Tor: 0.2.6.x-final
    Component:  Tor     |    Version:
   Resolution:          |   Keywords:  tor-relay performance 026-triaged-1
Actual Points:          |  Parent ID:
       Points:          |
------------------------+-------------------------------------------------

Comment (by yawning):

 Replying to [comment:8 tmpname0901]:
 > The reference to __memcpy_sse2_unaligned() above reminds me that data
 should always be aligned for more efficient read/write.
 >
 > There are tools (Valgrind?) that can report this.  For x86(_64), buffers
 should always be aligned to at least mod 16.

 Note: Discussing Intel and compatible processors here.

 To be pedantic:
 > **Assembly/Compiler Coding Rule 46. //(H impact, H generality)** Align
 data on natural operand size address boundaries. If the data will be
 accessed with vector instruction loads and stores, align the data on
 16-byte boundaries.//

 The performance hit for non-vector access comes when an access straddles a
 cache line boundary (64 bytes), unless the processor is an iPotato86 that
 should have been retired a long time ago (P1, PMMX, AMD <= K8).  Vector
 access needs to be 16 byte aligned, unless you are using AVX where it
 doesn't matter (So looking towards the bright future, this will matter
 less and less).

 The memcpy that I imagine dominates memcpy's runtime would be the one in
 buffers.c:`write_to_buf()`, invoked from `connection_write_to_buf()` from
 the `RELAY_COMMAND_DATA` handler logic (a quick skim shows that everything
 else is infrequent, should be reasonably aligned, or the copy is too small
 to be interesting).

 Here the destination will only always be nicely aligned when a new chunk
 is allocated for the buffer (or if the data in the buffer happened to end
 on a 16 byte boundary), and the source will never be aligned correctly
 (`cell->payload + RELAY_HEADER_SIZE`).

 I'm currently of the opinion that before messing with this, faster crypto
 will gain more mileage for our development time.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/12464#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online