fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
orivej has quit [Ping timeout: 258 seconds]
khaled has quit [Remote host closed the connection]
khaled has joined #systemtap
<agentzh> i added the oneline test with the result to the PR.
khaled has quit [Quit: Konversation terminated!]
hpt has joined #systemtap
sscox has quit [Ping timeout: 264 seconds]
irker455 has quit [Quit: transmission timeout]
<kerneltoast> fche, hiya
<kerneltoast> i fixed the probe_lock soft lockups
<kerneltoast> guess you're eating dinner or, like, enjoying life or something :P
sscox has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
irker825 has joined #systemtap
<irker825> systemtap: alizhang systemtap.git:azhang/pr13838-systemtapexample * release-4.3-120-g5ca63de01 / testsuite/systemtap.examples/general/floatingpoint.meta testsuite/systemtap.examples/general/floatingpoint.stp testsuite/systemtap.examples/general/floatingpoint.txt: PR13838: add floating point to systemtap.examples
<agentzh> kerneltoast: will you paste your gist link here? so that fche can review it 5am in the morning (our time).
<agentzh> or even earlier :)
<agentzh> the last time fche replied to my patch 5am in my morning.
<kerneltoast> i guess we'll wake up to a bunch of angry messages from fche
<agentzh> i hope not.
derek0883 has quit [Remote host closed the connection]
irker825 has quit [Quit: transmission timeout]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
khaled has joined #systemtap
tonyj has quit [Remote host closed the connection]
hpt has quit [Ping timeout: 256 seconds]
orivej has joined #systemtap
_whitelogger has joined #systemtap
_whitelogger has joined #systemtap
<fche> hi guys
<fche> I see why you're thinking in this area, but we have two related mechanisms already, and am concerned they're not working:
<fche> the -DINTERRUPTIBLE conditional in the probe prologues/epilogues which normally wraps a similar irq_save() gadget around the bulk of the code
<fche> and
<fche> the _stp_runtime_entryfn_get_context() gadget which is our primary reentrancy-prevention mechanism
<fche> the latter aims to prevent just this sort of thing, and should have rejected giving the nested probe a context* at all, without which it would haven never tried the nested stp_probe_lock()
mjw has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
tromey has joined #systemtap
pviktori has quit [Ping timeout: 260 seconds]
orivej has quit [Ping timeout: 264 seconds]
pviktori has joined #systemtap
pviktori has quit [Ping timeout: 264 seconds]
pviktori has joined #systemtap
amerey has joined #systemtap
sscox has quit [Ping timeout: 264 seconds]
pviktori has quit [Ping timeout: 256 seconds]
orivej has joined #systemtap
<fche> agentzh, morning morning
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
tonyj has joined #systemtap
sscox has joined #systemtap
amerey has quit [Ping timeout: 264 seconds]
amerey has joined #systemtap
irker782 has joined #systemtap
<irker782> systemtap: amerey systemtap.git:master * release-4.3-123-g225242ee3 / Makefile.am Makefile.in doc/Makefile.in doc/beginners/Makefile.in java/Makefile.in python/Makefile.in: Makefile.am: Install runtime/softfloat/
<irker782> systemtap: alizhang systemtap.git:azhang/pr13838-systemtapexample * release-4.3-121-g01d926f10 / testsuite/systemtap.examples/general/floatingpoint.stp testsuite/systemtap.examples/general/floatingpoint.txt: PR13838: make systemtap.examples fp tests concise
<kerneltoast> hiya fche
<fche> good morning, good evening, and good night
<kerneltoast> agentzh and i are both PST
<kerneltoast> 10am over here
<fche> time is a state of mind
<kerneltoast> dimensionality is a function of consciousness
<kerneltoast> now that we've got the greetings out of the way, down to bizniss
<fche> if you please
<kerneltoast> see one, maybe two problems with -DINTERRUPTIBLE
<kerneltoast> *i see
<kerneltoast> firstly it's conditional, when it really should not be
<kerneltoast> (at least for the locks)
<kerneltoast> the second problem is that i'm not sure it encapsulates all of the probe locks
<fche> yeah, I mentioned -DINTERRUPTIBLE more as an aside of something similar already being there
<kerneltoast> similar but not quite the same
<fche> but the big thing is the second - the context get/put gadget that is supposed to prevent this reentrancy
<kerneltoast> reentrancy aside, there is still the issue of interrupts
<fche> interrupts are an instance of reentrancy, no?
<kerneltoast> you got me there
<kerneltoast> ok lemme look at that
<irker782> systemtap: alizhang systemtap.git:master * release-4.3-124-gd06bd093d / testsuite/systemtap.examples/general/floatingpoint.meta testsuite/systemtap.examples/general/floatingpoint.stp testsuite/systemtap.examples/general/floatingpoint.txt: PR13838: add floating point to systemtap.examples
<kerneltoast> fche, it's not clear to me that the context get/put encapsulates all the probe locking
<fche> can you think of a counterexample?
<fche> the context get/put is done in a function that calls the probe handler body, which is what contains the probe_lock blocks
<kerneltoast> i mean there are functions like visit_expr_statement() where i just can't easily tell in all the translation complexity
<kerneltoast> there are a lot of emit_lock() calls
<fche> look at stap -p3
<kerneltoast> not sure which stap script to use
<fche> whatever one you think had a problem with this, almost any one that uses global vars in a normal probe will show you
<kerneltoast> how do i build the testsuite stap scripts?
<fche> what do you mean build?
<fche> you can run the testsuite - restricted with RUNTEST="foo.exp bar.exp" if you like
<fche> you can run stap on a .stp file in the testsuite
<fche> some tests generate .stp files on the go, which are harder to replicate by hand (sorry, mope)
<kerneltoast> i mean, how to generate pass 3 code from the testsuite .stp files
<fche> stap -p3 FOO.stp
<kerneltoast> that doesn't work
<fche> specifics please
<kerneltoast> $ ./stap -p3 testsuite/systemtap.base/sdt.stp
<kerneltoast> source: probe process(@1).mark("mark_z")
<kerneltoast> at: junk '' at testsuite/systemtap.base/sdt.stp:1:15
<kerneltoast> parse error: command line argument out of range [1-0]
<fche> so that test case must need a parameter identifying a test workload file
<kerneltoast> yeah and a lot do...
<fche> see the systemtap.log file for traces of how that was built and how stap was being run upon it
<kerneltoast> executing: stap -w /home/sultan/systemtap/testsuite/systemtap.base/sdt.stp sdt.c.exe.0 -c ./sdt.c.exe.0
<kerneltoast> y it no be easy
derek0883 has quit [Remote host closed the connection]
<fche> it be easy
<fche> but you have to see the previous few lines to see how sdt.c.exe.0 was built
derek0883 has joined #systemtap
<kerneltoast> ok i got it
<agentzh> fche: hey, i'm late today. customer meetings in the morning.
<agentzh> glad kerneltoast was already talking to you :)
<agentzh> so the current theory is that some special probe handlers lack _stp_runtime_entryfn_get_context() calls?
<agentzh> or the _stp_runtime_entryfn_get_context() call itself is buggy?
<agentzh> and the next step is to scan all the C code for the .stp files in the stap test suite? it sounds like a daunting task given the complexity of the generated C code.
<agentzh> and maybe one stp script uses 3 probes but only 1 lacks _stp_runtime_entryfn_get_context().
<agentzh> it may be easier if we can know which .stp file is at fault when the soft lockup happens.
orivej has quit [Ping timeout: 240 seconds]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<fche> that get_context should be called by every possible code
<fche> the soft lockup ... the dmesg gives a hint of the stap source file name
<irker782> systemtap: alizhang systemtap.git:master * release-4.3-124-gc80f1453e / testsuite/systemtap.examples/general/floatingpoint.meta testsuite/systemtap.examples/general/floatingpoint.stp testsuite/systemtap.examples/general/floatingpoint.txt: PR13838: add floating point to systemtap.examples
<irker782> systemtap: alizhang systemtap.git:master * release-4.3-126-g55156a5ed / : PR13838: Fix previous commit message (c80f1453eba9430921edd4dc10e93f8d993042da)
<irker782> systemtap: alizhang systemtap.git:master * release-4.3-127-gc17b7d54a / testsuite/systemtap.examples/general/floatingpoint.stp testsuite/systemtap.examples/general/floatingpoint.txt: PR13838: update fp systemtap example
mjw has quit [Quit: Leaving]
<fche> and the systemtap.log file should track the history of the testsuite run
orivej has joined #systemtap
tromey has quit [Quit: ERC (IRC client for Emacs 27.1.50)]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
<kerneltoast> fche, dmesg doesn't give a hint of the stap source file name
<kerneltoast> stap_fa607a79e39c925e67c205c9745018c0__22713 (<input>)
<kerneltoast> that's the stap module listed in the backtrace with the double probe lock
<fche> yeah, so look back to the previous dmesg entry to find out which ones were recent
<fche> dejagnu runs the .exp files in alphabetical order
<kerneltoast> the bug only happens when running in parallel :/
<fche> hmmmmmm
<fche> wonder if there is a preemption / move-to-different-cpu issue here
<fche> so the context-getter function can succeed both times (with different cpus) on the same stack/context
<fche> well, not same context, but same thread
<kerneltoast> the centos 7 kernel only has voluntary preemption enabled
<kerneltoast> fche, there's something else i have for you to look at atm
* fche will be unavailable in about 20ish mins
<fche> hm no race between the _stp_kfree() in __stp_tf_task_worker_fn and someone adding entries again?
<fche> shouldn't that be protected by the lock?
<kerneltoast> it's taken off the list
<kerneltoast> "dequeued" is the term in the comment
<fche> ah
<fche> it looks highly plausible
<fche> noting you're removing the belt-and-suspenders __stp_tf_free_all_task_work() at the bottom
<fche> worth a BUG() at the least, if you are convinced it can't happen ?
<kerneltoast> i moved the belt and suspenders
<kerneltoast> __stp_tf_free_all_task_work has been coalesced into __stp_tf_cancel_all_task_work
<fche> ok, am not seeing a call to that in this diff; it must be nearby I assume
<kerneltoast> right before utrace_exit()
<fche> hm, confident there's no race between that and utrace_exit() which could spawn more entries?
<kerneltoast> ah
<kerneltoast> i guess not
<fche> maybe call cancel one more time?
<kerneltoast> well, then the problem becomes that the stp_task_work API has just been violated
<kerneltoast> because there's a task work in flight after stp_task_work_exit() was called
<kerneltoast> and stp_task_work_exit() is supposed to wait until all task workers are finished running
<fche> yeah, I'm just wondering whether there is a gap of time AFTER this but before utrace_exit() that could result in new work
<fche> if not, no problem.
<kerneltoast> it should be covered by stp_task_work_exit()
<fche> ok. had a chance to run that through the suite?
<kerneltoast> nope, guess it's time to bust out the serial testsuite run
<fche> yupper
amerey has quit [Remote host closed the connection]
irker782 has quit [Quit: transmission timeout]