DB environment complete lockout: is there a way to investigate?

Discussion:

(too old to reply)

oleksandr kalinin

2007-01-30 11:35:45 UTC

Hello All,

I've run into a lockout scenario (that is when even db_stat and
db_deadlock utilities hang). DB recovery works fine. However, since
the setup is quite simple and it has been running fine for years now,
I am very surprised about the sudden lockout and would like to find
the root cause of it. I have the copy of 'broken' DB environment
files.

Is there any way to find out what exactly is causing the lockout, if
there is a lock that has not been released, what process has obtained
it, etc? Is that possible to do some 'basic' manual analysis of
BerkeleyDB environment files on my own in case db_stat is unusable,
using BerkeleyDB API?

In my setup, there is one multithreaded program with two threads of
control using BerkeleyDB (exclusive 'get' calls, i.e. read-only) and a
command line utility for updating the table at specific time once a
day. Lockout did happen two times already, at random time, in the
middle of normal functioning of the whole system, and I could not find
any common pattern or correlation on what could have caused these
errors.

Environment is opened with DB_INIT_CDB | DB_INIT_MPOOL | DB_CREATE
flags;
Daemon 'reading only' program opens tables with following flag :
DB_RDONLY
Tables update program table open flags : DB_CREATE

Many thanks in advance for your hints,

Regards,
Oleksandr

Florian Weimer

2007-01-30 13:48:22 UTC

Permalink

Post by oleksandr kalinin
I've run into a lockout scenario (that is when even db_stat and
db_deadlock utilities hang). DB recovery works fine. However, since
the setup is quite simple and it has been running fine for years now,
I am very surprised about the sudden lockout and would like to find
the root cause of it. I have the copy of 'broken' DB environment
files.

This looks as if something got hold of a region lock and then crashed,
without closing the database in an orderly manner.

Beginning with Berkeley DB 4.4, it's possible to detect this situation
with a file locking hack, and recover automatically. I'm not aware of
a general debugging strategy, though.

oleksandr kalinin

2007-01-31 11:36:55 UTC

Permalink

Hi Florian,

Post by Florian Weimer
Beginning with Berkeley DB 4.4, it's possible to detect this situation
with a file locking hack, and recover automatically. I'm not aware of
a general debugging strategy, though.

Indeed, I could find information in release notes of BDB 4.4 that it
contains new functionality supporting 'online abandoned locks
removal'. This would be indeed very handy feature in general, and
perhaps for this case. Do you know any more details about the
principle and how to use this feature? I could not find any relevant
info in documentation so far.

Regards,
Oleksandr

Florian Weimer

2007-01-31 12:03:39 UTC

Permalink

Do you know any more details about the principle and how to use this
feature?

It's related to the new environment open flag DB_REGISTER.

Kevin Duraj

2007-01-31 19:52:21 UTC

Permalink

Post by oleksandr kalinin
Hello All,
I've run into a lockout scenario (that is when even db_stat and
db_deadlock utilities hang). DB recovery works fine. However, since
the setup is quite simple and it has been running fine for years now,
I am very surprised about the sudden lockout and would like to find
the root cause of it. I have the copy of 'broken' DB environment
files.
Is there any way to find out what exactly is causing the lockout, if
there is a lock that has not been released, what process has obtained
it, etc? Is that possible to do some 'basic' manual analysis of
BerkeleyDB environment files on my own in case db_stat is unusable,
using BerkeleyDB API?
In my setup, there is one multithreaded program with two threads of
control using BerkeleyDB (exclusive 'get' calls, i.e. read-only) and a
command line utility for updating the table at specific time once a
day. Lockout did happen two times already, at random time, in the
middle of normal functioning of the whole system, and I could not find
any common pattern or correlation on what could have caused these
errors.
Environment is opened with DB_INIT_CDB | DB_INIT_MPOOL | DB_CREATE
flags;
DB_RDONLY
Tables update program table open flags : DB_CREATE
Many thanks in advance for your hints,
Regards,
Oleksandr

Oleksandr,

Did you try to use DB_THREAD tag ? ... something like this for
handles usable concurrently by multiple threads?

Example:
DB_INIT_CDB | DB_INIT_MPOOL | DB_CREATE | DB_THREAD

Let us know if that helps ...

Thanks,
-Kevin

m***@gmail.com

2007-02-01 00:10:30 UTC

Permalink

Hi Oleksandr,

As Florian said, this sort of lockout is usually caused by some
process crashing while holding a lock.

For diagnostic purposes, you can get information from the environment
by running most of the utilities (including db_stat) with the '-N'
flag. That will cause the utilities to operate without obtaining
locks. This is obviously dangerous, and should be done on a copy of
the entire environment when you are sure that no other processes are
running.

It can be difficult after the fact to work out what has happened, but
the lock statistics (including "db_stat -N -Co") are probably a good
place to start.

Regards,
Michael Cahill, Oracle.

oleksandr kalinin

2007-02-01 12:50:09 UTC

Permalink

Hi All,

I do not use DB_THREAD option, but there rather is a convention where
each thread creates his own environment and database handles. So no
multiple threads may concurrently use the same environment or database
handle. That has some overhead in number of BerkeleyDB resources
involved, but in my pogram that is sufficient and on the other hand
there is no need to dynamically allocate memory for BerkeleyDB get()
results.

I made some more investigations recently, and found out that before
lockout of the environment the database gets locked at first blocking
in 'get' call, while environment still can be accessed. Stack trace of
the apllication blocked in 'get' call :

IOT/Abort trap in __lock_promote at 0x10065314 ($t11)
0x10065314 (__lock_promote+0x80) 7e5a0214 add r18,r26,r0
(dbx) where
__lock_promote() at 0x10065314
__lock_put_internal() at 0x1006480c
__lock_put_nolock_20_3() at 0x10066fd8
__lock_put() at 0x10065944
__db_c_close() at 0x10074cbc
__db_c_cleanup() at 0x10072308
__db_c_get() at 0x10074230
__db_c_pget() at 0x10072bd4
__db_pget() at 0x1007909c
__db_pget_pp() at 0x10079284
get__5TableFPvUls(0x20004d48, 0x2c49a420, 0x8, 0x10001) at 0x100615e0

At that point, new attempt to open environment succeeded, but db_open
blocked. From lock stat I could see that each time db_open is called,
"Total number of locks not immediately available due to conflicts" is
increased by one. There were 0 deadlocks. When application is blocked
in get() call, I assume that opening BDB 4.4+ environment by other
program with DB_REGISTER | DB_RECOVER flags would not help to avoid
outage, because application is blocked while having environment
already opened. I may be wrong here, though.

Complete environment lockout seems to occur later, when application
process has to be killed due to threads blocked in get() and some more
unsuccessful attempts are done to open the database that is locked
out. When that happens, program that can not access environment have
following stack trace :

IOT/Abort trap in _global_lock_common at 0xd0051630 ($t1)
0xd0051630 (_global_lock_common+0x200) 80410014 lwz
r2,0x14(r1)
(dbx) where
_global_lock_common(??, ??, ??) at 0xd0051630
pthread_mutex_lock(??) at 0xd004feb4
__db_pthread_mutex_lock() at 0x1000f1d8
__db_e_attach() at 0x1002c4d0
__dbenv_open() at 0x1002a7e0
attach__5TableFPCciUlT3N22(0x20000a38, 0x100bffb8, 0x1, 0x8, 0x8, 0x0,
0x0) at 0x100021a0

Lock stats looks as follows (db_stat -N option specified) :

Default locking region information:
105 Last allocated locker ID
0x7fffffff Current maximum unused locker ID
5 Number of lock modes
5000 Maximum number of locks possible
5000 Maximum number of lockers possible
5000 Maximum number of lock objects possible
18 Number of current locks
21 Maximum number of locks at any one time
26 Number of current lockers
29 Maximum number of lockers at any one time
5 Number of current lock objects
6 Maximum number of lock objects at any one time
1798034 Total number of locks requested
1798015 Total number of locks released
0 Total number of lock requests failing because DB_LOCK_NOWAIT
was set
2 Total number of locks not immediately available due to
conflicts
0 Number of deadlocks
0 Lock timeout value
0 Number of locks that have timed out
0 Transaction timeout value
0 Number of transactions that have timed out
2MB 336KB The size of the lock region
60 The number of region locks that required waiting (0%)

...

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Lock REGINFO information:
Lock Region type
3 Region ID
__db.003 Region name
0x3004a000 Original region address
0x3004a000 Region address
0x3029df40 Region primary address
0 Region maximum allocation
0 Region allocated
REGION_JOIN_OK Region flags
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Locks grouped by object:
Locker Mode Count Status ----------------- Object
---------------
d READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0
2b READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0
31 READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0
37 READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0
3a READ 1 HELD (7 f20022 64280d65 eb0d0 0)
handle 0

f READ 1 HELD (8 f20022 64287138 103770 0)
handle 0
2d READ 1 HELD (8 f20022 64287138 103770 0)
handle 0
33 READ 1 HELD (8 f20022 64287138 103770 0)
handle 0
39 READ 1 HELD (8 f20022 64287138 103770 0)
handle 0
3c READ 1 HELD (8 f20022 64287138 103770 0)
handle 0

a READ 1 HELD (9 f20022 6a50cb81 17d0dc 0)
handle 0
28 READ 1 HELD (9 f20022 6a50cb81 17d0dc 0)
handle 0
2e READ 1 HELD (9 f20022 6a50cb81 17d0dc 0)
handle 0
34 READ 1 HELD (9 f20022 6a50cb81 17d0dc 0)
handle 0

2a READ 1 HELD (a f20022 6a517d89 19577c 0)
handle 0
30 READ 1 HELD (a f20022 6a517d89 19577c 0)
handle 0
36 READ 1 HELD (a f20022 6a517d89 19577c 0)
handle 0

2f READ 3 HELD 0x10f108 len: 20 data:
0000000x09000xf200"jP0xcb0x81000x170xd00xdc00000000

Would be interesting to find out what is wrong here, but my
understanding is, however, that at first place I need to find out why
application thread is getting blocked in get() call. That seems to be
the first symptom of things going wrong.

Again, thanks a lot for your replies.

Regards,
Oleksandr