deadlock in __lll_lock_wait() @ /lib64/libpthread.so.0

Discussion:

Paweł Sikora

2012-11-15 17:58:37 UTC

Hi,

i'm playing with some EDA simulator which loads dynamically (via dlopen) my plugin.
during plugin initialization (global ctors) it deadlocks on the __lll_lock_wait.
i'm observing this issue on RHEL-5/CentOS-5 with glibc-2.5-58.el5_6.4.
is it a known bug on the 2.5 branch?

btw, i can workaround this issue with -Wl,-z,now linking flag to avoid lazy
symbol binding but i'd like to avoid this way if possible.

BR,
Paweł.

Thread 4 (Thread 0x43f06940 (LWP 27264)):
#0 0x0000003e6880d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003e68808e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2 0x0000003e68808cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00002aaab44ca9e8 in boost::mutex::lock () from /remote/hal/home/pawels/DVM.trunk/bin64/libHmdbApi.so
#4 0x00002aaab44caecb in boost::unique_lock<boost::mutex>::lock () from /remote/hal/home/pawels/DVM.trunk/bin64/libHmdbApi.so
#5 0x00002aaab44ddf73 in hmdb::HmdbPadlock::isLocked () from /remote/hal/home/pawels/DVM.trunk/bin64/libHmdbApi.so
#6 0x00002aaab44f4806 in hmdb::HmdbOperations::check_lock () from /remote/hal/home/pawels/DVM.trunk/bin64/libHmdbApi.so
#7 0x00002aaab44f4f8b in hmdb::HmdbOperations::findHesBoardwithHmdb () from /remote/hal/home/pawels/DVM.trunk/bin64/libHmdbApi.so
(...)
#25 0x00002aaab3a7cc68 in DpiApiInitialize () from /remote/hal/home/pawels/DVM.trunk/bin64/libScemiDpiBridgeApi.so
#26 0x00002aaab7f152bc in ?? ()
#27 0x0000000043efff00 in ?? ()
#28 0x00002aaab3a1ac4c in dc::Executor::initialize () from /remote/hal/home/pawels/DVM.trunk/bin64/libScemiDpiControllerRiviera.so
#29 0x00002aaab3a3f782 in load () from /remote/hal/home/pawels/DVM.trunk/bin64/libScemiDpiControllerRiviera.so
#30 0x00002aaab3a4d776 in __do_global_ctors_aux () from /remote/hal/home/pawels/DVM.trunk/bin64/libScemiDpiControllerRiviera.so
#31 0x00002aaab3a1955b in _init () from /remote/hal/home/pawels/DVM.trunk/bin64/libScemiDpiControllerRiviera.so
#32 0x00002aaab3eef8f8 in typeinfo for boost::detail::sp_counted_impl_pd<void const*, boost::archive::detail::shared_ptr_helper::null_deleter>
() from /remote/hal/home/pawels/DVM.trunk/bin64/libboost_serialization.so.1.51.0
#33 0x0000003e6780d3fb in call_init () from /lib64/ld-linux-x86-64.so.2
#34 0x0000003e6780d505 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#35 0x0000003e67810ffe in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#36 0x0000003e6780d086 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#37 0x0000003e678107dc in _dl_open () from /lib64/ld-linux-x86-64.so.2
#38 0x0000003e68000f9a in dlopen_doit () from /lib64/libdl.so.2
#39 0x0000003e6780d086 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#40 0x0000003e6800150d in _dlerror_run () from /lib64/libdl.so.2
#41 0x0000003e68000f11 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
(...)
#53 0x000000000067c444 in ?? ()
#54 0x0000003e6880673d in start_thread () from /lib64/libpthread.so.0
#55 0x0000003e67cd44bd in clone () from /lib64/libc.so.6
#56 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x45308940 (LWP 27277)):
#0 0x0000003e6880d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003e68808e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2 0x0000003e68808d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003e67809dcb in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#4 0x0000003e6780cf05 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#5 0x0000003e67812982 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
#6 0x00002aaab4509af2 in au::FileLock::tryLock () from /remote/hal/home/pawels/DVM.trunk/bin64/libHmdbApi.so
#7 0x00002aaab44da483 in hmdb::HmdbLockfile::operator() () from /remote/hal/home/pawels/DVM.trunk/bin64/libHmdbApi.so
#8 0x00002abcba45a8ba in ?? () from /remote/dragon/eda/riviera-pro-2012.10.rtm.64/bin/Linux64/libboost_thread.so.1.48.0
#9 0x0000003e6880673d in start_thread () from /lib64/libpthread.so.0
#10 0x0000003e67cd44bd in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()

Carlos O'Donell

2012-11-15 22:03:14 UTC

Permalink

Post by PaweÅ Sikora
Hi,
i'm playing with some EDA simulator which loads dynamically (via dlopen) my plugin.
during plugin initialization (global ctors) it deadlocks on the __lll_lock_wait.
i'm observing this issue on RHEL-5/CentOS-5 with glibc-2.5-58.el5_6.4.
is it a known bug on the 2.5 branch?

That was released 7 years ago. I don't remember anything from that
time period :-)

Post by PaweÅ Sikora
btw, i can workaround this issue with -Wl,-z,now linking flag to avoid lazy
symbol binding but i'd like to avoid this way if possible.

Why do you assume it's a glibc bug?

It will always deadlock in ___lll_lock_wait for any deadlock since that's the
lowest level function for the locking implementation.

At this point it is either an application bug or a glibc bug, but I see nothing
in the gdb stack traces that points either way.

You'll need to debug more to find out.

Cheers,
Carlos.

Paweł Sikora

2012-11-16 20:02:30 UTC

Permalink

Post by Carlos O'Donell

That was released 7 years ago. I don't remember anything from that
time period :-)

RHEL5 has at least 10 years of commercial support and many companies still use it ;-)

Post by Carlos O'Donell

Post by PaweÅ Sikora
btw, i can workaround this issue with -Wl,-z,now linking flag to avoid lazy
symbol binding but i'd like to avoid this way if possible.

Why do you assume it's a glibc bug?
(...)
It will always deadlock in ___lll_lock_wait for any deadlock since that's the
lowest level function for the locking implementation.

it works fine with newer glibc-2.12 from RHEL6 and with glibc-2.16 from other linux distro.
moreover, these traces from different threads stuck in the same point -> _L_lock_1127,
so i assume that it is probably a glibc-2.5 bug fixed in newer version. the main problem
is to locate the right fix in glibc.git mirror and check RHEL5 updates against it.
i can't force customer to update theirs RHEL5 cluster without strong arguments :)

BR,
Paweł.

Thread 3 (Thread 0x44f72940 (LWP 29969)):
#0 0x0000003e6880d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003e68808e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2 0x0000003e68808d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003e67809dcb in _dl_lookup_symbol_x () from /lib64/ld-linux-x86-64.so.2
#4 0x0000003e6780cf05 in _dl_fixup () from /lib64/ld-linux-x86-64.so.2
#5 0x0000003e67812982 in _dl_runtime_resolve () from /lib64/ld-linux-x86-64.so.2
(...)
#9 0x0000003e6880673d in start_thread () from /lib64/libpthread.so.0
#10 0x0000003e67cd44bd in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x2b294f828c90 (LWP 29947)):
#0 0x0000003e6880d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003e68808e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2 0x0000003e68808d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000003e67d08f71 in _dl_addr () from /lib64/libc.so.6
#4 0x00002b294bbc9603 in backtracexx::lookupSymbol(backtracexx::Frame&) ...
#5 0x00002b294bbc9e9b in backtracexx::(anonymous namespace)::helper(_Unwind_Context*, ...
#6 0x0000003e6c808914 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#7 0x00002b294bbc95a9 in backtracexx::scan(void*) () from ...
#8 0x00002aaaafa95d95 in eh::signalFilter () from ...
#9 <signal handler called>
#10 0x0000003e6880aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#11 0x00002b294ad8149e in QWaitCondition::wait(QMutex*, unsigned long) () from ...
#12 0x00002b294ad80600 in QThread::wait(unsigned long) () from ...
#13 0x0000000000492a11 in ?? ()
#14 0x0000003e67c1d994 in __libc_start_main () from /lib64/libc.so.6

Carlos O'Donell

2012-11-16 20:46:32 UTC

Permalink

Post by PaweÅ Sikora

Post by Carlos O'Donell

That was released 7 years ago. I don't remember anything from that
time period :-)

RHEL5 has at least 10 years of commercial support and many companies still use it ;-)

That's excellent, but *I* don't remember that far back :-)

Post by PaweÅ Sikora

Post by Carlos O'Donell

Post by PaweÅ Sikora
btw, i can workaround this issue with -Wl,-z,now linking flag to avoid lazy
symbol binding but i'd like to avoid this way if possible.

Why do you assume it's a glibc bug?
(...)
It will always deadlock in ___lll_lock_wait for any deadlock since that's the
lowest level function for the locking implementation.

In the glibc 2.9 era (2008, 3 years after 2.5[1]) on x86_64 we added
tlsdesc support.

The tlsdesc support had some interesting dependencies on _dl_load_lock, which
is the mostly likely lock being taken here. The lock is used to serialize access
to the dynamic loader data. As such it get touched from a number of different
places to prevent corruption.

It's possible that _dl_load_lock access is the problem here in the 2.5 codebase.

It's possible the problem still exists and the changes for tlsdesc
have covered it up.

It's also possible you have a kernel bug that misses a futex wakeup.

Good luck.

Cheers,
Carlos.

[1] http://sourceware.org/glibc/wiki/Glibc%20Timeline

Mike Frysinger

2012-11-17 07:54:46 UTC

Permalink

Post by PaweÅ Sikora

Post by Carlos O'Donell

Post by PaweÅ Sikora
i'm playing with some EDA simulator which loads dynamically (via
dlopen) my plugin. during plugin initialization (global ctors) it
deadlocks on the __lll_lock_wait. i'm observing this issue on
RHEL-5/CentOS-5 with glibc-2.5-58.el5_6.4. is it a known bug on the
2.5 branch?

That was released 7 years ago. I don't remember anything from that
time period :-)

RHEL5 has at least 10 years of commercial support and many companies still use it ;-)

then you should have no problem getting them to locate the fix and release a
new version ;)
-mike