fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
hpt has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
hpt has quit [Ping timeout: 256 seconds]
hpt has joined #systemtap
hpt has quit [Ping timeout: 256 seconds]
khaled_ has quit [Quit: Konversation terminated!]
orivej has joined #systemtap
hpt has joined #systemtap
orivej has quit [Ping timeout: 246 seconds]
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
hpt has quit [Ping timeout: 240 seconds]
khaled has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 264 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
orivej has joined #systemtap
derek0883 has joined #systemtap
derek0883 has quit [Read error: Connection reset by peer]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
khaled has quit [Quit: Konversation terminated!]
khaled has joined #systemtap
<kerneltoast> fche, bonjour
<kerneltoast> out fishing i guess?
khaled has quit [Remote host closed the connection]
irker718 has joined #systemtap
<irker718> systemtap: sultan systemtap.git:master * release-4.4-47-gb26b4e2c2 / runtime/linux/task_finder2.c: task_finder2: fix panics due to broken task work cancellation
khaled has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
<kerneltoast> fche, that patch i just pushed didn't fix the issue entirely. now there's a new, rarer bug when `tf_work == tmp`
<kerneltoast> because list_for_each_entry_safe doesn't expect you to free the next element of the list
<fche> maybe a while(list-not-empty) { work = .... stp_kfree (tf_work) ; } instead of that list_for_each_entry_safe ?
<fche> where node = list head every time?
<kerneltoast> can't because list head could be an active task worker that we always fail to cancel
<kerneltoast> fche, there's this option: https://paste.centos.org/view/caa71a3c
<fche> aha. is that the only case of concern for list_for_each .. ?\
<kerneltoast> yeah, if we free a node way later into the list or if we free the current node, it's no problem
<kerneltoast> but if we free the next node, everything combusts
<fche> can't go backward can it?
<kerneltoast> nope
<fche> ok, please add a comment to the code to explain why
<kerneltoast> of course
orivej has joined #systemtap
<kerneltoast> fche, i have a theory that there could be another issue in here
<kerneltoast> so, when __stp_tf_cancel_all_task_work() finishes, there can still be taskfinder task workers running
<kerneltoast> when that happens, we rely on the belt and suspenders in the _stp_task_work API to wait until all task workers are freed before unloading the module
<kerneltoast> the potential problem i see is that the instructions to notify the _stp_task_work API of a task worker finishing are not the last ones in the actual task worker function
<kerneltoast> take a look at __stp_tf_task_worker_fn()
<kerneltoast> tf_work->func(&tf_work->work); <--- this is where the _stp_task_work API will be notified of the final task worker being freed
<fche> can we not wait until all are done before we let cancel_all_task_work return?
<kerneltoast> any ideas for how to do that?
<kerneltoast> we could also move some code around so that tf_work->func() is not responsible for notifying the _stp_task_work API
<fche> how to wait until all are done? wait until we have no "fail to cancel" in __stp_tf_cancel_all_task_work ?
<kerneltoast> but we have to release the lock for that to occur
khaled has quit [Quit: Konversation terminated!]
<fche> can we put a loop outside the stp_spin_lock/unlock ?
<kerneltoast> yeah but it'd just be cpu_relaxin i guess
<fche> yeah busy wait, wahtever
khaled has joined #systemtap
<kerneltoast> hmm the race would still be there
<kerneltoast> there can still be a running task worker when the list is empty
<kerneltoast> it'll just be finishing up
<kerneltoast> fche, i think that paste i just sent is a good bet
<kerneltoast> granted, that paste is missing a return from an if-statement...
<kerneltoast> or two
<fche> ok, the approach makes sense
<kerneltoast> k i'll finalize the commits and send em to ya for review
<fche> ok
<fche> ok
<fche> ok
<kerneltoast> SHIPTIME
<fche> ahoy
<irker718> systemtap: sultan systemtap.git:master * release-4.4-48-g96470399a / runtime/linux/task_finder2.c: task_finder2: fix list corruption in __stp_tf_cancel_all_task_work()
<irker718> systemtap: sultan systemtap.git:master * release-4.4-49-g6cb54128e / runtime/linux/task_finder2.c: task_finder2: fix task worker race on module unload