fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
<agentzh>
it happens from time to time, especially in our stress test.
<agentzh>
it looks like a deadlock to me.
sscox has quit [Ping timeout: 256 seconds]
sscox has joined #systemtap
linus2 has quit [Ping timeout: 260 seconds]
linus2 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
khaled_ has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
khaled_ has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
orivej has joined #systemtap
<fche>
agentzh, well, wouldn't call that part a deadlock; the loop is intentional
<fche>
but something among the writers of that stp_task_work_callbacks atomic is not getting cleared - i.e., one of the related callbacks seems stuck
<fche>
a systemwide stack traceback may help explain
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
hpt has quit [Ping timeout: 258 seconds]
orivej has quit [Ping timeout: 258 seconds]
zhuizhuhaomeng has joined #systemtap
<zhuizhuhaomeng>
hello @fche I found an error 'ERROR: probe process("/usr/local/openresty/nginx/sbin/nginx").begin registration error (rc -114)' which is -EALREADY. I lookup the code and found it maybe return by __stp_utrace_attach with __STP_TASK_FINDER_EVENTS UTRACE_RESUME args.
<zhuizhuhaomeng>
is it a real error or we can ignore?
<fche>
like the error# agentzh was talking about yesterday, I suspect this relates to very short-lived processes
<zhuizhuhaomeng>
how can i debug this problem?
orivej has joined #systemtap
<fche>
zhuizhuhaomeng, not sure, sorry. there are no debugging prints in stp_utrace.c; maybe adding some is a start
zhuizhuhaomeng has quit [Quit: Leaving]
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
sscox has quit [Ping timeout: 240 seconds]
amerey has joined #systemtap
sscox has joined #systemtap
derek0883 has joined #systemtap
<agentzh>
fche: okay, something like 'foreach bt' in the crash session?
<fche>
think so
<fche>
don't remember the kgdb rune but yeah
<fche>
if you have access to /proc/$$/stack or whatever you'd used in the bz, then yeah that but for other threads in the system
<agentzh>
okay, will try.
<agentzh>
fche: found that we can reproduce kernel panic relatively consistently in __stp_tf_clone_worker() on a 32c/64t system using kernel 4.14.
<agentzh>
fche: it's the work arg pointer itself being NULL in the kdump for the 4.14 kernel.
<agentzh>
the thread start/stop timestamps would be huge, since we are doing stress testing here and lots of processes in the system coming and going.
<agentzh>
i've checked the current process running the fatal task_work job, which is a bash process which is still running (task->state == 0).
<agentzh>
and its task->task_works pointer is not NULL: task_works = 0xffff8a948cd77a60, and i've checked it still has 3 callback_head entries in the list whose ->func all point to utrace_resume.
<agentzh>
though those 3 callbacks do belong to other concurrent stap modules, not the current one.
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Ping timeout: 240 seconds]
sscox has quit [Ping timeout: 260 seconds]
derek0883 has joined #systemtap
orivej has joined #systemtap
orivej_ has joined #systemtap
orivej has quit [Quit: No Ping reply in 180 seconds.]