fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
orivej has quit [Ping timeout: 244 seconds]
wmealing has joined #systemtap
* wmealing waves
fche has joined #systemtap
aryehw has quit [Quit: Leaving]
ema has quit [Remote host closed the connection]
CME has quit [Ping timeout: 265 seconds]
orivej has joined #systemtap
mjw has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
wmealing has quit [Remote host closed the connection]
<invano> hey fche it's me, again
<invano> I'm seeing a kernel hang on mips I've never seen before, when I was using systemtap 2.7 and an older kernel.
<invano> I have a simple script probe nd_syscall.*, probe nd_syscall.*.return
<invano> It appears there is a problem on kretprobes and I'm debugging the kernel right now
<invano> basically the kernel enters the kretprobe trampoline but the return address is not updated so the trampoline jumps over itself
<invano> global symbol "kretprobe_trampoline_holder" in kernel
<invano> This happens if I include all syscalls and, instead, it's not triggered if I probe only a bunch of them
<invano> I'm checking the code of systemtap/kernel and I'm debugging right now, I was only wondering if something like this ever happened in the past and/or you have some feelings on what could cause this
<fche> hi
<fche> doesn't sound too familiar, but the recent meltdown/spectre workarounds did impact k*probe operation at some point, causing all kinds of fun crashes/failures
<fche> so maybe worth looking over the date range of the kernel and avoid anything from ...dunny ... january-july 2018 ? (being paranoid)
<fche> if the failures are limited to kretprobes, that's more likely to localize to a piece of kernel or perhaps stap code
<invano> but I'm on mips
<fche> understood, I'm saying I wouldn't be surprised if those bugs also hit that
<invano> ahh ok
<invano> sorry
<fche> (but that's just speculation on my part)
<invano> yeah got it
brolley has joined #systemtap
orivej has joined #systemtap
tromey has joined #systemtap
<agentzh> fche: committed. thanks!
<fche> thanks
<agentzh> fche: what do you think of this small patch? https://sourceware.org/ml/systemtap/2018-q3/msg00055.html
<agentzh> is it okay to commit it too?
<fche> nope, that's not related
<fche> we need -Werror for the on-the-fly-compiled .c bits that stap itself generates
<fche> (like that comment block says)
<agentzh> it's still using -Werror by default.
<fche> a build-tree configure option is too blunt an instrument
<agentzh> i just add an option to disable it at compile time of stap.
<fche> that'll also nuke the stapconf* bits that we need
<fche> would be interested in seeing the a == a cases and the accompanying warnings/errors
<fche> maybe we can cure them with a more finely tuned -W option, or by changing our generated code
<agentzh> fche: see https://pastebin.com/468bLyLY
<agentzh> that's our test cases to cover that patch.
<fche> wouldn't mind adding '-Wno-tautological-compare' to the buildrun.cxx-generated makefiles
<fche> to suppress that warning, as opposed to suppressing warnings-as-errors
<agentzh> that sounds good to me.
<agentzh> oh, btw, will you be open to adding an alterative test scaffold to stap? like the one i just showed? it would be much more easier to write new tests or debug test failures in existing tests.
<agentzh> *much easier
<agentzh> it's more declarative and data driven.
<fche> that's a tough one
<agentzh> we can keep both in the official source tree.
<fche> for that particular case, you could just have added a testsuite/buildok/FOOBAR.stp file with that one-liner in it, that's all
<agentzh> right now, it requires writing several small files for the same test case.
<agentzh> it would be nice to put all the small pices together, as in this test case for our @vma() patch: https://pastebin.com/PxSNdmFk
<agentzh> it will encourage us to write way more tests for our patches :)
<agentzh> and it supports parallel testing too, just run the command "prove -j8 t/*.t" where t/*.t are the test files.
<agentzh> multiple backends are supported too, by default both kernel and dyninst runtimes are run, and each test case can explicitly turn on or off a particular runtime by specifying "--- bpf" or "--- no_kernel" and etc.
<agentzh> the test output is also much less verbose: https://pastebin.com/iJCL8dHv
<agentzh> and test failures are much easier to see as well: https://pastebin.com/cCLefpGS
<agentzh> no need to dig up separate systemtap.log files for failure details...
<fche> there's normally a single (big) systemtap.log
<fche> anyway I see kind of what you mean - some things could be simplified -- but with a bit of dejagnu/tcl work, one could automate the multi-runtime thing too
<agentzh> the most painful bit is that we now have to write several separate small files to write a single test case, like a .stp file, a .c file and a .exp file.
<fche> would have a hard time justifying a second test framework (with new prereqs, incompatible reporting)
<fche> agentzh, let's try simplifying that further; as I mentioned we have done that for the -p4 cases, and also for syscalls
<agentzh> and even worse, to see what's going on with a test failure, we have to dig a big and separate systemtap.log file instead of simply having a quick glance at the test run output on the terminal.
<fche> if you can characterize a new family of tests that would benefit from abbreviation, please describe them
<fche> one can run a single test case with dejagnu (make installcheck RUNTESTFLAGS=foobar.exp)
<agentzh> yeah, i know that RUNTESTFLAGS thing.
<agentzh> in the new test scaffold, it's as easy as adding a --- ONLY line to the test block in question.
<agentzh> or a --- SKIP line to skip it.
<agentzh> re tests that would benefit from abbreviation: https://pastebin.com/HV2ua2He these are our test cases for the @vma(addr, module) feature we just did.
<agentzh> they are like documentation.
<fche> we also have bunch of .exp files that carry .stp / .c parts within them
<fche> I am not a fan of that style, but that's easily done there too
<agentzh> and this is the test file for the stat-typed function parameters feature: https://pastebin.com/w7HPUJDv
<agentzh> all these tests are already passing completely on my side, btw.
<agentzh> fche: re .exp files that carry .stp / .c parts: that's still tcl coding though, but sure it could be hacked up.
<fche> yeah, much like how these .t files are just python coding :)
<fche> anyway ... I'm not going to ask that all that .t work be redone as .exp - that's not your burden :). I don't mind pulling in the .t files, but understand that we're not really in a position to run them
* fche doesn't actually recognize the testsuite framework here; is it a perl thing?
<agentzh> yeah, it's a perl thing.
<agentzh> it's based on perl's Test::Base framework.
<agentzh> but the perl code is just 2 lines at the beginning of each .t file.
<agentzh> and they are always the same :)
<agentzh> but yeah, i do understand your concerns. they are all valid points. i think we'll just make the test scaffold emit .exp/.c/.stp files targeting the stap's current test scaffold. that's the beautify of being data driven and declarative. it would be much much harder the other way around, if not impossible.
* agentzh used to write a tool to convert gnu make's perl 4 testing code into the Test::Base data driven syntax.
* agentzh remembers the pain of parsing perl 4 code.
<fche> that sounds like a plan. emitting a single .exp file from your .t is also possible, if it helps
<agentzh> fche: yeah, that would be definitely easier.
<agentzh> do you have such .exp files for our reference. would be great to have some samples :)
<agentzh> *?
<fche> systemtap.base/pr18649.exp e.g.
<fche> (I don't much like this model because it tends to create temporary files, which are gone by the time one may want to hand-rerun the test
<agentzh> thanks! as long as you would accept our patches with such tests :)
<fche> would be glad to take a look
<agentzh> okay, cool. we could always change the model of the emitted tests anyway. they are auto-genetated :)
<fche> and I wouldn't be surprised if we can make the .exp system more declarative for our use cases
<agentzh> yeah, sure, that'll deifnitely be possible though needing quite some work.
* agentzh lacks the motivation to hack tcl/expect/gnudeja
<fche> hehe yeah
<fche> understood
<agentzh> i'll try to get some generated .exp files to show you soon.
<fche> ok
<agentzh> thanks for your feedback.
<agentzh> i've already got the stat-func-arg feature working fully. i'll also continue working on the array-func-arg feature today.
* agentzh likes to move fast.
<fche> is the idea there to pass aliases of the entire array-of-whatever or singleton-stat to a function?
<fche> i.e., pass by reference? that's different from the normal pass-by-value approach
<agentzh> yeah, it's passing by references.
<agentzh> and i had to change the optimizers and analyzers in elaborate.cxx to follow references.
<agentzh> like varuse collector, stat decl collector, and etc.
<agentzh> so that a function can be shared among different stat vars and different arrays.
<fche> and what if two different signature arrays are passed
<agentzh> then an arity mismatch error would be emitted at Pass 2.
<agentzh> or Pass 3?
<agentzh> the same applies to incompatible stat types.
<agentzh> like a hist log and a hist linear
<agentzh> but count/avg/min/sum stat ops would be merged and collected among all the ref graph.
<fche> hm, we probably talked about this, but are you sure that macros are not sufficient to express this?
<agentzh> nope, macros lack control flow and statement support.
<agentzh> and also lacking code sharing, it's inlining per se :)
<agentzh> oh, btw, i'm thinking about adding backtrace info to stap's runtime errors.
<fche> https://pastebin.com/w7HPUJDv <-- from here, which TEST would be the best demonstration why a macro isn't enough?
<agentzh> it would be much easier to debug huge .stp files.
<fche> that could be useful
<agentzh> so when stap unwinds the function calling stack with c->last_error, it can also append the current frame info.
<fche> or just store the function name in a context->locals[] array slot
<fche> and if an error is detected. record the nesting depth, then traverse that array to the noted depth at reporting time
<agentzh> re which TEST would be the best demonstration why a macro isn't enough? not in those tests, i'll write you a more realistic one.
<fche> ok, curious
<agentzh> not exactly correct, but it shows the idea.
<agentzh> for this particular example, we could also make delete statement work on the expression level.
<agentzh> so that a macro would also work.
<agentzh> but now, it can't.
<fche> so a plain statement-expression extension would do though?
<fche> the equivalent of the gnu-c ({ }) ?
<agentzh> for this very example, yes.
<agentzh> but because i'm working on a general c-to-stap compiler, it needs something much more general.
<fche> are you sure? the compiler could emit ({}) stuff too, if that existed
<fche> IOW I wonder whether that single general facility in the scripting language would make unnecessary other more intricate & policy changes (w.r.t. passing arrays by reference)
<agentzh> there can be complicated control flow inside the functions.
<fche> macros expand to anything; ({ }) hypothetically can do loops etc. too
<agentzh> fche: the changes are not big. the patch for stat-func-arg is minimal.
<agentzh> fche: it cannot do return, can it?
<fche> ({ }) doesn't exist yet, so indeed can't ... an early exit from the stmt block seems reasonable though
<agentzh> the function may want to return something early when some condition is met.
<agentzh> sorry for the confusion, i was talking about gnu C's ({...}).
<fche> aha; we could do more than they, as long as the concept is simple
<agentzh> one of the biggest hurdles is the macro expansion's resulting code size.
<agentzh> our stap functions can be quite large and has many call sites.
<agentzh> and thoese functions may call other functions further.
<agentzh> and it would also be tricky to get runtime backtraces for macros in case of runtime errors.
<fche> yeah ...
<agentzh> but i do agree ({...}) has its own metrits.
<agentzh> *merits
<fche> wonder how serious the c-to-stap case should be taken ... we take a lot of algorithmic shortcuts in the optimization / etc. passes, with the presumption that stap scripts just aren't that large
<agentzh> it's very handy for code emitters for many cases where functions are not needed.
<fche> but if you think you want to generate a ton of code - where inlined code size starts to matter - then I wonder if other parts of stap will bog you down at least as badly
<agentzh> fche: not yet, we used to manually port tons of C code to stap and they work pretty well in production.
<agentzh> now we decide to stop doing that manually, it's really painful :)
<agentzh> and once we hit another hurdle, we can always try fixing it :)
<agentzh> right now, the biggest hurdle is the function arg thing.
<agentzh> it's a showstopper for us.
<fche> would've been good to hear it before a lot of work was done
<fche> just to see if all the options were explored okay
<fche> how does your c-to-stap translator need passing arrays to functions ?
<agentzh> understood. we just have a lot of pressure from the business side. so we'll try our best.
<fche> what sorts of c constructs map to that?
orivej has quit [Ping timeout: 240 seconds]
<agentzh> it is used to emulate C's output arguments.
<agentzh> for example.
<agentzh> so ideally stap could support scalar's references in function arguments too.
<agentzh> now we use single-element arrays a lot, which is wasteful.
<agentzh> but working though.
<fche> could instead use one big array, and use indexes?
<agentzh> we could, but right now we expose a "builtin array" type to the language level. the source language is a superset of C11.
<agentzh> our compiler emits python arrays for such constrcuts in its gdb/python backend.
<agentzh> it has multiple backends, not just targeting stap.
<agentzh> so aggressive optimizations would require aggressive sematnic analyses in our compiler.
<agentzh> *semantic
<agentzh> my current stat-func-arg patch is for total 400 lines, not much. most of the code is just small refactoring of existing code. true additions are much less.
<agentzh> i can show you what i already have.
<agentzh> a sec.
<fche> I'm concerned about the pass-by-reference change to the model too. not sure about that at all. maybe some new syntax for that?
<agentzh> i think arrays and stats should just be references.
<agentzh> that's natural.
<agentzh> it makes little sense to do C/C++'s copy-by-value by default.
mjw has quit [Quit: Leaving]
<agentzh> here we go: https://pastebin.com/sewLdTz4
<agentzh> it's still a bit messy due to the debugging code. not ready for formal review.
<fche> would have to think hard about making sure e.g. alias-detection algorithms work with this sort of thing, everywhere
<agentzh> *nod*
* agentzh has been thinking hard himself these days.
<agentzh> and that's also why i really want to get the official stap test suite running on my machines :)
<fche> yeah, definitely
tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]
orivej has joined #systemtap
<agentzh> fche: about adding a quit() tapset func instead of abort()? Because abort() in C also results in core dump and erroneous exit code of the current process, which is not what we need here.
<agentzh> *how about
<fche> sure
<agentzh> ok, thanks
<agentzh> re or just store the function name in a context->locals[] array slot: yeah, i've been thinking along the same line, though i also need the line numbers, not just the function names.
<agentzh> but we could also store all the info in an array which we can quickly look up at runtime in case of errors.
<agentzh> that would sound fun.
<agentzh> we'll definitely give it a shot at some point.
brolley has left #systemtap [#systemtap]
zodbot has quit [Disconnected by services]
agentzh has quit [*.net *.split]
zodbot has joined #systemtap
fche2 has joined #systemtap
fche has quit [Ping timeout: 240 seconds]