Hi everyone,
I stumble upon an issue when testing torsocks[1] with firefox. I'm still wondering how this can be fixed thus I need more eyes on this :). The issue is that torsocks gets into a deadlock during the initialization phase within the libc.
Here it is. This new torsocks version hijacks the "syscall" symbol (syscall(2)) in order to intercept applications that decides to do some network operations with that interface. To do that, the torsocks library constructor (executed before the application main()) lookup the original symbol in the libc (dlopen(3)) and is used for unhandled syscall values (for instance open(2)).
Now the issue was detected with firefox which uses a custom malloc hook meaning that it handles its own memory allocation. This hook uses mmap() that firefox redefines to be a direct syscall(__NR_mmap, ...) and remember that this symbol is hijacked by torsocks.
Torsocks constructor calls dlsym() to get the original libc syscall symbol. This call locks a "loading lock" inside the libc:
dlfcn/dlsym.c +68: __rtld_lock_lock_recursive (GL(dl_load_lock));
Just after, dlerror_run is called which does a calloc() which then calls the firefox malloc hook and calls syscall() for mmap that torsocks hijacks. In torsocks, syscall() make a check on the original libc syscall pointer to see if it's NULL or not and if NULL, tries to look it up with dlsym(). And there you have the deadlock.
dlsym --> LOCK --> dlerror_run --> calloc --> syscall() --> dlsym() --> dlerror_run --> DEADLOCK.
It's a bit of a catch 22 because torsocks is basically looking for the libc syscall symbol but then it gets call inside that lookup code path...
To be honest, I am not sure what's the right fix here or if there is any way to lookup the symbol in a "special" way that would help here. Any idea or questions are VERY welcome :).
Hope this explanation is clear enough, this is a "not that trivial" issue.
Cheers! David