fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
khaled has quit [Remote host closed the connection]
<fche>
kerneltoast, ok, I see nothing scary in the patch or the diffs
<fche>
we're hoping to cut the release in the next day ish so let's hope this is the last bit that's kind of low level
<kerneltoast>
i was hoping to get the lockup fixed in time for the release too but alas
<fche>
yeah that one's more troubling but I don't think your fix was on the right track
<kerneltoast>
needs moar grok
derek0883 has quit [Remote host closed the connection]
sscox has joined #systemtap
derek0883 has joined #systemtap
<kerneltoast>
fche, we should also add a patch to fix the memory leak when the task work fails to get added
lijunlong has quit [Ping timeout: 246 seconds]
derek0883 has quit [Remote host closed the connection]
lijunlong has joined #systemtap
derek0883 has joined #systemtap
irker835 has joined #systemtap
<irker835>
systemtap: sultan systemtap.git:master * release-4.3-128-g498aa23b6 / runtime/linux/task_finder2.c: PR26144: task_finder2: execute task workers in order
lijunlong has quit [Ping timeout: 256 seconds]
lijunlong has joined #systemtap
khaled has joined #systemtap
hpt has quit [Ping timeout: 246 seconds]
hpt has joined #systemtap
hpt has quit [Ping timeout: 240 seconds]
irker835 has quit [Quit: transmission timeout]
orivej has joined #systemtap
pviktori has joined #systemtap
pviktori has quit [Ping timeout: 256 seconds]
pviktori has joined #systemtap
mjw has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
wcohen has quit [Remote host closed the connection]
wcohen has joined #systemtap
amerey has joined #systemtap
tromey has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<kerneltoast>
fche, hiya
<fche>
uh oh
<kerneltoast>
would it be bad to wrap a small code style nitpick into an unrelated patch?
<fche>
could use a warning in the _put_context() function, as that c == _stp_runtime_get_context is really an assertion
<fche>
hmmmmmmmm
<fche>
ok have a hypothesis
<fche>
what if we do encounter a reentrancy event, but at that moment for whatever reason, we're in rcu-idle state
<fche>
so _stp_runtime_get_context() returns 0 in the if (c == ...) test
<fche>
then the test would be false and the context would stay busy ..... hm never mind, that's a more friendly outcome
<fche>
but I'm thinking that particular assertion could introduce heisenbugs
<fche>
kerneltoast, if you can induce that crash easily enough with make -j *check, could you try it with commenting out that if... and leaving in the atomic_dec unconditionally?
<fche>
agentzh, kerneltoast, making any sense?
<kerneltoast>
fche, commenting out which if? also, my cmpxchg patch doesn't fix the bug, it's just an optimization
<kerneltoast>
i did test something just now similar to what you're thinking of
<fche>
the if (c == _stp_runtime_get_context()) in _put_context
<fche>
kerneltoast, not sure that's enough. You probably don't have access to the bz in question, but there was in fact some related problem
<agentzh>
kerneltoast: looking into that patch.
<fche>
basically if -any- code run by a stap probe handler invoked rcu-related functions, then WARNING: suspicious RCU usage kernel messages could result
derek0883 has joined #systemtap
<kerneltoast>
fche, b5f8a8a64b6354e5 is also irrelevant, because the code was still using RCU at that point
<fche>
kerneltoast's last gist seems to restore context allocation to just before PR20192's commit b5f8a8a64
<kerneltoast>
my gist doesn't make use of RCU though
<fche>
I understand
<fche>
that's why I said --before-- PR20192's commit
<kerneltoast>
the code just before that b5f commit had some RCU usage
<fche>
that commit introduced rcu into this path
<kerneltoast>
no it didn't, see: [02:00 PM] <ffffffkerneltoast> rcu_assign_pointer(contexts[cpu], NULL);
derek0883 has quit [Remote host closed the connection]
<kerneltoast>
it was using RCU inconsistently
<fche>
ah, there are some other older uses
<fche>
commit 2d9786c1d9
<fche>
six years ago, well there we go
<fche>
if we can get rid of the rcu code in this path a la the latest gist, I think I'm all for it, but it'd definitely need lockdep etc. fuller testing overnight etc etc etc
derek0883 has joined #systemtap
<irker042>
systemtap: sultan systemtap.git:master * release-4.3-135-g7db54199f / runtime/linux/task_finder2.c: task_finder2: fix memory leak when task workers fail to get added
* agentzh
finds fche's release date is a great way to increase kerneltoast's productivity *grin*
<fche>
and decrease mine :) thereby extending the release date :)
<agentzh>
lol
<fche>
but hey it's probably a good tradeoff
<agentzh>
we should release often.
<fche>
kerneltoast, you said it still died with your last patch variant 2c00e7be etc.
<fche>
one extra diagnostic you could put in there is in the context_put function, check if c->locked is 1
<fche>
there's no way it should be 1 (things still locked) by the time the context-put is run
<fche>
could emit a BUG at that time
<fche>
agentzh, kerneltoast, am trying to reproduce the bug here, but until I manage
<kerneltoast>
yea so the context won't protect from another cpu
<fche>
it need not
<fche>
we're dealing with apparent reentrancy situation here
<fche>
so the get_context() would've been somehow hit twice on the same cpu, or somehow returning the same context that was somehow ?!?! not marked busy properly
<fche>
c->locked is not a lock
<fche>
it's an indication whether the current context has acquired the stap global variable locks
<kerneltoast>
running that paste now
<fche>
so the theory goes, in the reentry case, this _get_context test for c->locked==1 should be another way for detecting suspected reentrancy