fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
khaled has quit [Remote host closed the connection]
khaled has joined #systemtap
sscox has quit [Ping timeout: 264 seconds]
orivej has quit [Ping timeout: 272 seconds]
khaled has quit [Quit: Konversation terminated!]
zamba has quit [Ping timeout: 256 seconds]
fLiPr3VeRsE has quit [Ping timeout: 256 seconds]
fLiPr3VeRsE has joined #systemtap
zamba has joined #systemtap
hpt has joined #systemtap
<fche> kerneltoast, ok, I see nothing scary in the patch or the diffs
<fche> we're hoping to cut the release in the next day ish so let's hope this is the last bit that's kind of low level
<kerneltoast> i was hoping to get the lockup fixed in time for the release too but alas
<fche> yeah that one's more troubling but I don't think your fix was on the right track
<kerneltoast> needs moar grok
derek0883 has quit [Remote host closed the connection]
sscox has joined #systemtap
derek0883 has joined #systemtap
<kerneltoast> fche, we should also add a patch to fix the memory leak when the task work fails to get added
lijunlong has quit [Ping timeout: 246 seconds]
derek0883 has quit [Remote host closed the connection]
lijunlong has joined #systemtap
derek0883 has joined #systemtap
irker835 has joined #systemtap
<irker835> systemtap: sultan systemtap.git:master * release-4.3-128-g498aa23b6 / runtime/linux/task_finder2.c: PR26144: task_finder2: execute task workers in order
lijunlong has quit [Ping timeout: 256 seconds]
lijunlong has joined #systemtap
khaled has joined #systemtap
hpt has quit [Ping timeout: 246 seconds]
hpt has joined #systemtap
hpt has quit [Ping timeout: 240 seconds]
irker835 has quit [Quit: transmission timeout]
orivej has joined #systemtap
pviktori has joined #systemtap
pviktori has quit [Ping timeout: 256 seconds]
pviktori has joined #systemtap
mjw has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
wcohen has quit [Remote host closed the connection]
wcohen has joined #systemtap
amerey has joined #systemtap
tromey has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<kerneltoast> fche, hiya
<fche> uh oh
<kerneltoast> would it be bad to wrap a small code style nitpick into an unrelated patch?
<fche> depends
<fche> what're you thinking
orivej has joined #systemtap
<fche> well, not just a style thing, it should shrink the critical section somewhat in task_worker_f
<fche> fn
<fche> looks okay to me, testing supports?
<kerneltoast> lemme test real quick
<kerneltoast> want it to be a separate patch?
irker042 has joined #systemtap
<irker042> systemtap: amerey systemtap.git:master * release-4.3-129-g4ff87eef4 / po/cs.gmo po/cs.po po/en.po po/fr.po po/pl.po po/systemtap.pot: prerelease: update-po
<irker042> systemtap: amerey systemtap.git:master * release-4.3-130-g3345cd0df / AUTHORS: prerelease: AUTHORS bump
<irker042> systemtap: amerey systemtap.git:master * release-4.3-131-g8dc7997f7 / tapset/uconversions.stp: tapset/uconversions.stp: Fix user_string_n_nofault description.
<fche> kerneltoast, what as a separate patch? the diff you posted looks okay
<kerneltoast> the task_worker_fn hunk but if not, coolio
<kerneltoast> coolbeans
<kerneltoast> fche, our lean tester is happy. here's the polished commit: https://gist.github.com/kerneltoast/32a19af77cb401140da5ac881ad4f6d9
<fche> thanks, nice
<kerneltoast> fche, i think i see a problem with _stp_runtime_entryfn_get_context
<kerneltoast> but my brain is still processing
<irker042> systemtap: amerey systemtap.git:master * release-4.3-132-gf75ec911a / tapset/uconversions.stp: tapset/uconversions.stp: Fix format of user_string_n_nofault
<irker042> systemtap: amerey systemtap.git:master * release-4.3-133-g3df5453fe / : prerelease: update-docs
<kerneltoast> fche, i guess this is moreso an optimization: https://gist.github.com/kerneltoast/45815f8345acad1f840af3b20b0b1adb
<fche> hm, so no sign of the bug?
<fche> hm ya know
<fche> could use a warning in the _put_context() function, as that c == _stp_runtime_get_context is really an assertion
<fche> hmmmmmmmm
<fche> ok have a hypothesis
<fche> what if we do encounter a reentrancy event, but at that moment for whatever reason, we're in rcu-idle state
<fche> so _stp_runtime_get_context() returns 0 in the if (c == ...) test
<fche> then the test would be false and the context would stay busy ..... hm never mind, that's a more friendly outcome
<fche> but I'm thinking that particular assertion could introduce heisenbugs
<fche> kerneltoast, if you can induce that crash easily enough with make -j *check, could you try it with commenting out that if... and leaving in the atomic_dec unconditionally?
<fche> agentzh, kerneltoast, making any sense?
<kerneltoast> fche, commenting out which if? also, my cmpxchg patch doesn't fix the bug, it's just an optimization
<kerneltoast> i did test something just now similar to what you're thinking of
<fche> the if (c == _stp_runtime_get_context()) in _put_context
<kerneltoast> the lockup still occurred without that printk getting hit
<fche> ok
<fche> could you try my hypothetical too easily?
<kerneltoast> yeah
<kerneltoast> just comment out that if? nothing else?
<fche> yes, do the atomic_dec(&c->busy); eery time
<kerneltoast> it takes 10-20 min to reproduce usually btw
<kerneltoast> pretty fast
<kerneltoast> + atomic_dec(&c->busy);
<kerneltoast> - atomic_dec(&c->busy);
<kerneltoast> - if (c == _stp_runtime_get_context())
<fche> yup
<kerneltoast> it's chuggin away
<fche> thanks
<kerneltoast> when's the release btw?
<fche> any time now,
<fche> I'm personally holding it up with work :*
<fche> but am having a tough time concentraing :)
<kerneltoast> hehe
<kerneltoast> i should get agentzh to push the memleak fix soon then
<fche> dammit
<fche> :)
<kerneltoast> agentzh gogogogog
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<kerneltoast> fche, it ran to completion. testing it again now
<fche> really, no sign of the reentrancy this time?
<kerneltoast> yes, but from previous testing it doesn't happen 100% of the time, just *most* of the time
<kerneltoast> we may have gotten lucky
<kerneltoast> i could also add a print
<fche> cueing daft punk
<kerneltoast> a print to indicate when the atomic_dec would've been ignored
<kerneltoast> running the testsuite again
<kerneltoast> chug a lug lug
<fche> not sure whether that's necessarily a bug or problem so let's not print there for now
<fche> chug a lug lug <<< okay
<kerneltoast> better to have a print to see if it's even doing something
<kerneltoast> fche, it died
<kerneltoast> my print didn't get printed either
<kerneltoast> turn the daft punk off
<kerneltoast> do not pass GO
<kerneltoast> do not collect pension
<fche> stap_4f0a4a843ca425508f184f4a28f3b46b__13439 (<input>):
<fche> after a reboot, can you find the stap_4f* corresponding files under $build/testsuite/.systemtap-root/cache/* ?
<fche> with the .c file, it should be possible to figure out which <input> this comes from
<fche> then I'd run that test (only) on a super busyfied machine
<kerneltoast> fche, here's the file in its enormity: https://gist.github.com/kerneltoast/f73190f91e74bca314d53d8c9d176c76
tromey has quit [Quit: ERC (IRC client for Emacs 27.1.50)]
<fche> uprobes_onthefly.x ..... lots of tracepoints .....
<kerneltoast> there are 15 calls to stp_lock_probe and 12 stp_unlock_probe
<kerneltoast> let's see where there's a discrepancy...
<fche> that's not necessarily a problem if they're in different nested blocks
<fche> at execution time there should be a matching set run
<kerneltoast> i know but it's just a belt-and-suspenders check to do
<fche> and an unlock at probe exit should be unavoidable
<kerneltoast> and i know how much you like belts and suspenders
<kerneltoast> oh hey
<fche> hm and interesting
<kerneltoast> i think i see the problem
<kerneltoast> goto out;
<kerneltoast> else
<kerneltoast> if (!stp_lock_probe(locks, ARRAY_SIZE(locks)))
<kerneltoast> c->locked = 1;
<kerneltoast> interrupt after the lock but before c->locked = 1
<fche> the reentrancy mechanism should prevent this
<kerneltoast> with the get_context?
<fche> yes
<fche> the lock_probe() mechanism should operate between concurrent probe handlers
<kerneltoast> maybe something's janky with the get_context rcu stuff
<kerneltoast> i'm not exactly sure why rcu is used there
derek088_ has joined #systemtap
derek088_ has quit [Remote host closed the connection]
derek0883 has quit [Ping timeout: 272 seconds]
<fche> commit 2699450d
<kerneltoast> no i mean why is rcu used at all
<fche> yeah, that's plausible generally
<fche> but I believe the BZ1788662 related tests are still necessary - to prevent a certain type of reentrancy
<kerneltoast> the atomic_cmpxchg should protect from that
<irker042> systemtap: amerey systemtap.git:master * release-4.3-134-g91087d81d / man/stapprobes.3stap: man/stapprobes.3stap: Mention nd_syscall argument writing.
<fche> see also commit b5f8a8a64b6354e5
<fche> kerneltoast, not sure that's enough. You probably don't have access to the bz in question, but there was in fact some related problem
<agentzh> kerneltoast: looking into that patch.
<fche> basically if -any- code run by a stap probe handler invoked rcu-related functions, then WARNING: suspicious RCU usage kernel messages could result
derek0883 has joined #systemtap
<kerneltoast> fche, b5f8a8a64b6354e5 is also irrelevant, because the code was still using RCU at that point
<kerneltoast> rcu_assign_pointer(contexts[cpu], NULL);
<kerneltoast> if you use RCU inconsistently it'll scream of course
<kerneltoast> i'll brb in an hour
<kerneltoast> but anyway, it still died with this gist: https://gist.github.com/kerneltoast/2c00e7be1e5f9250efea463f79ff8efb
<kerneltoast> so this isn't the problem
<fche> kerneltoast's last gist seems to restore context allocation to just before PR20192's commit b5f8a8a64
<kerneltoast> my gist doesn't make use of RCU though
<fche> I understand
<fche> that's why I said --before-- PR20192's commit
<kerneltoast> the code just before that b5f commit had some RCU usage
<fche> that commit introduced rcu into this path
<kerneltoast> no it didn't, see: [02:00 PM] <ffffffkerneltoast> rcu_assign_pointer(contexts[cpu], NULL);
derek0883 has quit [Remote host closed the connection]
<kerneltoast> it was using RCU inconsistently
<fche> ah, there are some other older uses
<fche> commit 2d9786c1d9
<fche> six years ago, well there we go
<fche> if we can get rid of the rcu code in this path a la the latest gist, I think I'm all for it, but it'd definitely need lockdep etc. fuller testing overnight etc etc etc
derek0883 has joined #systemtap
<irker042> systemtap: sultan systemtap.git:master * release-4.3-135-g7db54199f / runtime/linux/task_finder2.c: task_finder2: fix memory leak when task workers fail to get added
* agentzh finds fche's release date is a great way to increase kerneltoast's productivity *grin*
<fche> and decrease mine :) thereby extending the release date :)
<agentzh> lol
<fche> but hey it's probably a good tradeoff
<agentzh> we should release often.
<fche> kerneltoast, you said it still died with your last patch variant 2c00e7be etc.
<fche> one extra diagnostic you could put in there is in the context_put function, check if c->locked is 1
<fche> there's no way it should be 1 (things still locked) by the time the context-put is run
<fche> could emit a BUG at that time
<fche> agentzh, kerneltoast, am trying to reproduce the bug here, but until I manage
<agentzh> fche: yeah, should be easy to reproduce with -j17 on a 8c16t box.
<fche> unfortunately by that time of the code, the probe_point/name are already nuked, but it'd be handy to know which one it was
* fche doesn't have a vm that big locally, but trying a biggish -j job
<kerneltoast> okay i'm back
<kerneltoast> seems like you want to fix this before the release eh
<kerneltoast> yes it still died with 2c00e7be
<kerneltoast> trying your patch now
<fche> well kernel hangs are bad m'kay
<kerneltoast> yeeeeep
<kerneltoast> they reel bad
<kerneltoast> no recovery
<kerneltoast> just deadbeef
<kerneltoast> fche, okay running with your centos paste
<fche> not sure it'd catch this case, but maybe
<kerneltoast> fche, your BUG didn't get hit
<fche> ok
<fche> bug that hang still there?
<kerneltoast> yep still hung
<fche> hm the exact same test could be fired into the _get fn
<fche> it should not be 1 upon entry to a probe handler either (from a prior context that finished running, or naturally from a Currently running one)
<kerneltoast> whaddya mean
<kerneltoast> shouldn't be locked and then acquire the context?
<fche> this is a different lock
<fche> c->locked should be 0 or 2 incoming and outgoing from a context_get and context_put fn
<kerneltoast> fche, the contexts are per-cpu, but the locks are not
<fche> I know, that's the point
<kerneltoast> yea so the context won't protect from another cpu
<fche> it need not
<fche> we're dealing with apparent reentrancy situation here
<fche> so the get_context() would've been somehow hit twice on the same cpu, or somehow returning the same context that was somehow ?!?! not marked busy properly
<fche> c->locked is not a lock
<fche> it's an indication whether the current context has acquired the stap global variable locks
<kerneltoast> running that paste now
<fche> so the theory goes, in the reentry case, this _get_context test for c->locked==1 should be another way for detecting suspected reentrancy