More than you necessarily wanted to know about tor_strndup performance improvements from late-2004 [was Re: tor callgrinds]

Sun Feb 18 13:54:00 UTC 2007

On Sat, Feb 17, 2007 at 04:04:04PM -0500, Nick Mathewson wrote:
> On Fri, Feb 16, 2007 at 10:12:20PM -0500, Watson Ladd wrote:
>  [...]
> > 2: Won't strlcpy run in time proportional to n? strncpy can't be much
> > faster as both functions need to loop from 0..n, load the character and
> > check if its null, and then copy it. The big difference is strncpy keeps
> > on going.
> 
> You're confused about how strlcpy works when n is much smaller than
> the length of the source string.  Look at the return value of strlcpy:
> it's the length of the original string.  To find the length of the
> original string, strlcpy basically has to walk over the whole thing.
> So while the running time of strncpy(dst,src,siz) is O(siz),
> strlcpy(dst,src,siz) is O(strlen(src)).  That's a big problem for
> tor_strndup: its most performance-relevant use case is copying a small
> portion of a much longer string, as when we're parsing router
> descriptors.  In this use case, strncpy is obviously better.

Technically though shouldn't strlcpy() not have to do this? It already
knows thelength, as it is provided to it. The only anecdote to this is
that it will supercede an 'actual' length if it detects a '\0' along the
copy. This isn't typically bad as it's already walking the string
anyways.

e.g. OpenBSD copy (same as Tor's):
/*
 * Copy src to string dst of size siz.  At most siz-1 characters
 * will be copied.  Always NUL terminates (unless siz == 0).
 * Returns strlen(src); if retval >= siz, truncation occurred.
 */
size_t strlcpy(char *dst, const char *src, size_t siz)
{
	register char *d = dst;
	register const char *s = src;
	register size_t n = siz;

	if (n == 0)
		return(strlen(s));
	while (*s != '\0') {
		if (n != 1) {
			*d++ = *s;
			n--;
		}
		s++;
	}
	*d = '\0';

	return(s - src);	/* count does not include NUL */
}

Provided the string 'onion\0routing\0', with n being 13, only 6 loop
cycles will actually happen.

> This isn't one of those hypothetical armchair optimizations, by the
> way: we profiled Tor before and after, and found out that the time
> spent copying strings went way down after we made the change.  (I'm
> afraid I don't still have the numbers; this change came in the
> 0.0.9pre6 develoment, was back in November of 2004.)

Unfortunately this may be due to the fact that strlcpy() vs strncpy()s
implementation may have been one of C vs asm. :-/

I've always been on the fence about the whole strlcpy vs strncpy vs
snprintf vs sprintf("%.*s", len, str) issue. For one, I tend to not
use many str* functions in general. Secondarily, the "bonus" of strncpy
zero-filling the remaining end of the dst buffer as some kind of security
benefit that others have brought up in the past, I find to be completely
baseless as it's depending on defensive programming and the possibility of
a buffer overflow happening in the first place.

Also, when it comes to strlcpy, you're actually better off using:

/* provided we've allocated dst as s_len + 1 bytes */

l = MIN(d_len, s_len);
memcpy(dst, src, l); dst[l] = '\0';

The rationale here being that you've handled the length issues, you already
know the length (typically) of the source ahead of time (how else would you
pre-allocate for dst? if it's an auto or static buffer, then you know the
length as well).

In most asm implementations of str* functions the maximum length they will
operate on, chunk-wise, is a 4-byte integer. The reason being that they
will do the old 'is there a null in 4 bytes' trick, check it's alignment,
and if everything looks good, copy off the equivalent space of an int in
one call. But the bonus of using memcpy() is that memcpy() knows the entire
length ahead of time - and if it can make use of, provided the asm backend
code is there, it will use higher order optimizations such as mmx and/or
sse ops to copy much more than 4 bytes at once. It also doesn't have to
deal with even checking for nul's at all. It's entire goal is to copy data
as fast as possible, especially since it's use is widespread inside the
kernel and networking code.

This is from Solaris. It does not do any mmx or amd-specific tricks to
make things even faster, but it is relatively optimized and nothing out
of the ordinary.

memcpy.s:
     40 	ENTRY(memcpy)
     41 	movl	%edi,%edx	/ save register variables
     42 	pushl	%esi
     43 	movl	8(%esp),%edi	/ %edi = dest address
     44 	movl	12(%esp),%esi	/ %esi = source address
     45 	movl	16(%esp),%ecx	/ %ecx = length of string
     46 	movl	%edi,%eax	/ return value from the call
     47 
     48 	shrl	$2,%ecx		/ %ecx = number of words to move
     49 	rep ; smovl		/ move the words
     50 
     51 	movl	16(%esp),%ecx	/ %ecx = number of bytes to move
     52 	andl	$0x3,%ecx	/ %ecx = number of bytes left to move
     53 	rep ; smovb		/ move the bytes
     54 
     55 	popl	%esi		/ restore register variables
     56 	movl	%edx,%edi
     57 	ret
     58 	SET_SIZE(memcpy)

strncpy.s:
     65 	ENTRY(strncpy)
     66 	pushl	%edi			/ save register variables
     67 	pushl	%esi
     68 
     69 	movl	16(%esp), %eax		/ %eax = source string address
     70 	movl	12(%esp), %edi		/ %edi = destination string address
     71 	movl	20(%esp), %esi		/ %esi = number of bytes
     72 
     73 	testl	$3, %eax		/ if %eax not word aligned
     74 	jnz	.L1			/ goto .L1
     75 .L8:
     76 	cmpl	$4, %esi		/ if number of bytes < 4
     77 	jb	.L4			/ goto .L4
     78 	.align	4
     79 .L2:
     80 	movl	(%eax), %edx		/ move 1 word from (%eax) to %edx
     81 	movl	$0x7f7f7f7f, %ecx
     82 	andl	%edx, %ecx		/ %ecx = %edx & 0x7f7f7f7f
     83 	addl	$4, %eax		/ next word
     84 	addl	$0x7f7f7f7f, %ecx	/ %ecx += 0x7f7f7f7f
     85 	orl	%edx, %ecx		/ %ecx |= %edx
     86 	andl	$0x80808080, %ecx	/ %ecx &= 0x80808080
     87 	cmpl	$0x80808080, %ecx	/ if null byte in this word
     88 	jne	.L3			/ goto .L3
     89 	movl	%edx, (%edi)		/ copy this word to (%edi)
     90 	subl	$4, %esi		/ decrement number of bytes by 4
     91 	addl	$4, %edi		/ next word
     92 	cmpl	$4, %esi		/ if number of bytes >= 4
     93 	jae	.L2			/ goto .L2
     94 	jmp	.L4			/ goto .L4
     95 .L3:
     96 	subl	$4, %eax		/ post-incremented
     97 	.align	4
     98 .L4:
     99 	/ (number of bytes < 4) or (a null byte found in the word)
    100 	cmpl	$0, %esi		/ if number of bytes == 0
    101 	je	.L7			/ goto .L7 (finished)
    102 	movb	(%eax), %dl		/ %dl = a byte in (%eax)
    103 	decl	%esi			/ decrement number of bytes by 1
    104 	movb	%dl, (%edi)		/ copy %dl to (%edi)
    105 	incl	%eax			/ next byte
    106 	incl	%edi			/ next byte
    107 	cmpb	$0, %dl			/ compare %dl with a null byte
    108 	je	.L5			/ if %dl is a null, goto .L5
    109 	jmp	.L4			/ goto .L4
    110 	.align	4
    111 .L1:
    112 	/ %eax not aligned
    113 	cmpl	$0, %esi		/ if number of bytes == 0
    114 	je	.L7			/ goto .L7 (finished)
    115 	movb	(%eax), %dl		/ %dl = a byte in (%eax)
    116 	decl	%esi			/ decrement number of bytes by 1
    117 	movb	%dl, (%edi)		/ copy %dl to (%edi)
    118 	incl	%edi			/ next byte
    119 	incl	%eax			/ next byte
    120 	cmpb	$0, %dl			/ compare %dl with a null byte
    121 	je	.L5			/ if %dl is a null, goto .L5
    122 	testl	$3, %eax		/ if %eax word aligned
    123 	jz	.L8			/ goto .L8
    124 	jmp	.L1			/ goto .L1 (not word aligned)
    125 	.align	4
    126 .L5:
    127 	movl	%esi, %ecx		/ %ecx = length to copy null bytes
    128 	xorl	%eax, %eax		/ clear %eax
    129 	shrl	$2, %ecx		/ %ecx = words to copy null bytes
    130 	rep ; sstol			/ rep;sstol is optimal
    131 	andl	$3, %esi		/ %esi = leftover bytes
    132 .L6:
    133 	cmpl	$0, %esi		/ if number of bytes == 0
    134 	jz	.L7			/ goto .L7 (finished)
    135 	movb	$0, (%edi)		/ move a null byte to (%edi)
    136 	decl	%esi			/ decrement number of bytes by 1
    137 	incl	%edi			/ next byte
    138 	jmp	.L6			/ goto .L6
    139 	.align	4
    140 .L7:
    141 	movl	12(%esp), %eax		/ return the destination address
    142 	popl	%esi			/ restore register variables
    143 	popl	%edi
    144 	ret
    145 	SET_SIZE(strncpy)

Both functions are in optimized assembly - but you don't even have be to be
very knowledgable about asm (which I am not that much) to know that memcpy()
will be faster than strncpy(), period.

So consider this simple little test program on Linux 2.6.20 w/ gcc-4.1.1:

$ cat strn.c
#include <string.h>
#include <stdlib.h>

static char buf[] = "an example string to use for copying bytes";
/* word offsets:     0   1   2   3   4   5   6   7   8   9   0  */

static size_t strlcpy(char *dst, const char *src, size_t siz)
{
        register char *d = dst;
        register const char *s = src;
        register size_t n = siz;

        if (n == 0)
                return(strlen(s));
        while (*s != '\0') {
                if (n != 1) {
                        *d++ = *s;
                        n--;
                }
                s++;
        }
        *d = '\0';

        return(s - src);        /* count does not include NUL */
}

int main(int argc, char **argv)
{
        char *q;
        int i, l = sizeof buf - 1;

        if (argc <= 1)
                return EXIT_FAILURE;
        if ((q = malloc(l * sizeof *q + 1)) == NULL)
                return EXIT_FAILURE;

        switch (*argv[1]) {
        case 'm':
                for (i = 16777216; i--; ) {
                        /* one doesn't typically copy nul */
                        memcpy(q, buf, l); q[l] = '\0';
		}
                break;
        case 'l':
                for (i = 16777216; i--; )
                        strlcpy(q, buf, l);
                break;
        case 'n':
                for (i = 16777216; i--; )
                        strncpy(q, buf, l);
                break;
        case 's':
                for (i = 16777216; i--; )
                        strcpy(q, buf);
                break;
        default: break;
        }

        free(q);

        return 0;
}

$ gcc -static -O3 -W -Wall -pedantic -o strn strn.c
$ for i in l m s n; do TIMEFORMAT="$i r: %3lR u: %3lU s: %3lS"; for n in 0 1 2 3 4 5 6 7; do time ./strn $i; done; echo; done
l r: 0m2.593s u: 0m2.380s s: 0m0.004s
l r: 0m2.491s u: 0m2.436s s: 0m0.004s
l r: 0m2.494s u: 0m2.408s s: 0m0.000s
l r: 0m2.425s u: 0m2.364s s: 0m0.012s
l r: 0m2.494s u: 0m2.408s s: 0m0.008s
l r: 0m2.432s u: 0m2.380s s: 0m0.000s
l r: 0m2.530s u: 0m2.420s s: 0m0.000s
l r: 0m2.440s u: 0m2.388s s: 0m0.000s

m r: 0m0.546s u: 0m0.508s s: 0m0.008s
m r: 0m0.526s u: 0m0.508s s: 0m0.000s
m r: 0m0.545s u: 0m0.512s s: 0m0.000s
m r: 0m0.545s u: 0m0.512s s: 0m0.004s
m r: 0m0.545s u: 0m0.508s s: 0m0.000s
m r: 0m0.534s u: 0m0.508s s: 0m0.000s
m r: 0m0.547s u: 0m0.508s s: 0m0.004s
m r: 0m0.526s u: 0m0.508s s: 0m0.004s

s r: 0m1.738s u: 0m1.688s s: 0m0.000s
s r: 0m1.760s u: 0m1.692s s: 0m0.004s
s r: 0m1.813s u: 0m1.712s s: 0m0.004s
s r: 0m1.736s u: 0m1.700s s: 0m0.004s
s r: 0m1.764s u: 0m1.700s s: 0m0.004s
s r: 0m1.757s u: 0m1.696s s: 0m0.004s
s r: 0m1.745s u: 0m1.688s s: 0m0.008s
s r: 0m1.749s u: 0m1.692s s: 0m0.000s

n r: 0m1.913s u: 0m1.860s s: 0m0.004s
n r: 0m1.945s u: 0m1.872s s: 0m0.000s
n r: 0m1.920s u: 0m1.868s s: 0m0.000s
n r: 0m1.951s u: 0m1.876s s: 0m0.004s
n r: 0m1.973s u: 0m1.868s s: 0m0.036s
n r: 0m1.918s u: 0m1.864s s: 0m0.000s
n r: 0m1.922s u: 0m1.864s s: 0m0.004s
n r: 0m1.935s u: 0m1.864s s: 0m0.004s

Regardless of the differences between str, strn, and strl, memcpy just blows
them out of the water. I even included the terminating nul for realistic use.

-cl