fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
orivej has quit [Ping timeout: 244 seconds]
wmealing has joined #systemtap
* wmealing
waves
fche has joined #systemtap
aryehw has quit [Quit: Leaving]
ema has quit [Remote host closed the connection]
CME has quit [Ping timeout: 265 seconds]
orivej has joined #systemtap
mjw has joined #systemtap
orivej has quit [Ping timeout: 256 seconds]
wmealing has quit [Remote host closed the connection]
<invano>
hey fche it's me, again
<invano>
I'm seeing a kernel hang on mips I've never seen before, when I was using systemtap 2.7 and an older kernel.
<invano>
I have a simple script probe nd_syscall.*, probe nd_syscall.*.return
<invano>
It appears there is a problem on kretprobes and I'm debugging the kernel right now
<invano>
basically the kernel enters the kretprobe trampoline but the return address is not updated so the trampoline jumps over itself
<invano>
global symbol "kretprobe_trampoline_holder" in kernel
<invano>
This happens if I include all syscalls and, instead, it's not triggered if I probe only a bunch of them
<invano>
I'm checking the code of systemtap/kernel and I'm debugging right now, I was only wondering if something like this ever happened in the past and/or you have some feelings on what could cause this
<fche>
hi
<fche>
doesn't sound too familiar, but the recent meltdown/spectre workarounds did impact k*probe operation at some point, causing all kinds of fun crashes/failures
<fche>
so maybe worth looking over the date range of the kernel and avoid anything from ...dunny ... january-july 2018 ? (being paranoid)
<fche>
if the failures are limited to kretprobes, that's more likely to localize to a piece of kernel or perhaps stap code
<invano>
but I'm on mips
<fche>
understood, I'm saying I wouldn't be surprised if those bugs also hit that
<agentzh>
that's our test cases to cover that patch.
<fche>
wouldn't mind adding '-Wno-tautological-compare' to the buildrun.cxx-generated makefiles
<fche>
to suppress that warning, as opposed to suppressing warnings-as-errors
<agentzh>
that sounds good to me.
<agentzh>
oh, btw, will you be open to adding an alterative test scaffold to stap? like the one i just showed? it would be much more easier to write new tests or debug test failures in existing tests.
<agentzh>
*much easier
<agentzh>
it's more declarative and data driven.
<fche>
that's a tough one
<agentzh>
we can keep both in the official source tree.
<fche>
for that particular case, you could just have added a testsuite/buildok/FOOBAR.stp file with that one-liner in it, that's all
<agentzh>
right now, it requires writing several small files for the same test case.
<agentzh>
it would be nice to put all the small pices together, as in this test case for our @vma() patch: https://pastebin.com/PxSNdmFk
<agentzh>
it will encourage us to write way more tests for our patches :)
<agentzh>
and it supports parallel testing too, just run the command "prove -j8 t/*.t" where t/*.t are the test files.
<agentzh>
multiple backends are supported too, by default both kernel and dyninst runtimes are run, and each test case can explicitly turn on or off a particular runtime by specifying "--- bpf" or "--- no_kernel" and etc.
<agentzh>
no need to dig up separate systemtap.log files for failure details...
<fche>
there's normally a single (big) systemtap.log
<fche>
anyway I see kind of what you mean - some things could be simplified -- but with a bit of dejagnu/tcl work, one could automate the multi-runtime thing too
<agentzh>
the most painful bit is that we now have to write several separate small files to write a single test case, like a .stp file, a .c file and a .exp file.
<fche>
would have a hard time justifying a second test framework (with new prereqs, incompatible reporting)
<fche>
agentzh, let's try simplifying that further; as I mentioned we have done that for the -p4 cases, and also for syscalls
<agentzh>
and even worse, to see what's going on with a test failure, we have to dig a big and separate systemtap.log file instead of simply having a quick glance at the test run output on the terminal.
<fche>
if you can characterize a new family of tests that would benefit from abbreviation, please describe them
<fche>
one can run a single test case with dejagnu (make installcheck RUNTESTFLAGS=foobar.exp)
<agentzh>
yeah, i know that RUNTESTFLAGS thing.
<agentzh>
in the new test scaffold, it's as easy as adding a --- ONLY line to the test block in question.
<agentzh>
or a --- SKIP line to skip it.
<agentzh>
re tests that would benefit from abbreviation: https://pastebin.com/HV2ua2He these are our test cases for the @vma(addr, module) feature we just did.
<agentzh>
they are like documentation.
<fche>
we also have bunch of .exp files that carry .stp / .c parts within them
<fche>
I am not a fan of that style, but that's easily done there too
<agentzh>
all these tests are already passing completely on my side, btw.
<agentzh>
fche: re .exp files that carry .stp / .c parts: that's still tcl coding though, but sure it could be hacked up.
<fche>
yeah, much like how these .t files are just python coding :)
<fche>
anyway ... I'm not going to ask that all that .t work be redone as .exp - that's not your burden :). I don't mind pulling in the .t files, but understand that we're not really in a position to run them
* fche
doesn't actually recognize the testsuite framework here; is it a perl thing?
<agentzh>
yeah, it's a perl thing.
<agentzh>
it's based on perl's Test::Base framework.
<agentzh>
but the perl code is just 2 lines at the beginning of each .t file.
<agentzh>
and they are always the same :)
<agentzh>
but yeah, i do understand your concerns. they are all valid points. i think we'll just make the test scaffold emit .exp/.c/.stp files targeting the stap's current test scaffold. that's the beautify of being data driven and declarative. it would be much much harder the other way around, if not impossible.
* agentzh
used to write a tool to convert gnu make's perl 4 testing code into the Test::Base data driven syntax.
* agentzh
remembers the pain of parsing perl 4 code.
<fche>
that sounds like a plan. emitting a single .exp file from your .t is also possible, if it helps
<agentzh>
fche: yeah, that would be definitely easier.
<agentzh>
do you have such .exp files for our reference. would be great to have some samples :)
<agentzh>
*?
<fche>
systemtap.base/pr18649.exp e.g.
<fche>
(I don't much like this model because it tends to create temporary files, which are gone by the time one may want to hand-rerun the test
<agentzh>
thanks! as long as you would accept our patches with such tests :)
<fche>
would be glad to take a look
<agentzh>
okay, cool. we could always change the model of the emitted tests anyway. they are auto-genetated :)
<fche>
and I wouldn't be surprised if we can make the .exp system more declarative for our use cases
<agentzh>
yeah, sure, that'll deifnitely be possible though needing quite some work.
* agentzh
lacks the motivation to hack tcl/expect/gnudeja
<fche>
hehe yeah
<fche>
understood
<agentzh>
i'll try to get some generated .exp files to show you soon.
<fche>
ok
<agentzh>
thanks for your feedback.
<agentzh>
i've already got the stat-func-arg feature working fully. i'll also continue working on the array-func-arg feature today.
* agentzh
likes to move fast.
<fche>
is the idea there to pass aliases of the entire array-of-whatever or singleton-stat to a function?
<fche>
i.e., pass by reference? that's different from the normal pass-by-value approach
<agentzh>
yeah, it's passing by references.
<agentzh>
and i had to change the optimizers and analyzers in elaborate.cxx to follow references.
<agentzh>
like varuse collector, stat decl collector, and etc.
<agentzh>
so that a function can be shared among different stat vars and different arrays.
<fche>
and what if two different signature arrays are passed
<agentzh>
then an arity mismatch error would be emitted at Pass 2.
<agentzh>
or Pass 3?
<agentzh>
the same applies to incompatible stat types.
<agentzh>
like a hist log and a hist linear
<agentzh>
but count/avg/min/sum stat ops would be merged and collected among all the ref graph.
<fche>
hm, we probably talked about this, but are you sure that macros are not sufficient to express this?
<agentzh>
nope, macros lack control flow and statement support.
<agentzh>
and also lacking code sharing, it's inlining per se :)
<agentzh>
oh, btw, i'm thinking about adding backtrace info to stap's runtime errors.
<fche>
https://pastebin.com/w7HPUJDv <-- from here, which TEST would be the best demonstration why a macro isn't enough?
<agentzh>
it would be much easier to debug huge .stp files.
<fche>
that could be useful
<agentzh>
so when stap unwinds the function calling stack with c->last_error, it can also append the current frame info.
<fche>
or just store the function name in a context->locals[] array slot
<fche>
and if an error is detected. record the nesting depth, then traverse that array to the noted depth at reporting time
<agentzh>
re which TEST would be the best demonstration why a macro isn't enough? not in those tests, i'll write you a more realistic one.
<agentzh>
not exactly correct, but it shows the idea.
<agentzh>
for this particular example, we could also make delete statement work on the expression level.
<agentzh>
so that a macro would also work.
<agentzh>
but now, it can't.
<fche>
so a plain statement-expression extension would do though?
<fche>
the equivalent of the gnu-c ({ }) ?
<agentzh>
for this very example, yes.
<agentzh>
but because i'm working on a general c-to-stap compiler, it needs something much more general.
<fche>
are you sure? the compiler could emit ({}) stuff too, if that existed
<fche>
IOW I wonder whether that single general facility in the scripting language would make unnecessary other more intricate & policy changes (w.r.t. passing arrays by reference)
<agentzh>
there can be complicated control flow inside the functions.
<fche>
macros expand to anything; ({ }) hypothetically can do loops etc. too
<agentzh>
fche: the changes are not big. the patch for stat-func-arg is minimal.
<agentzh>
fche: it cannot do return, can it?
<fche>
({ }) doesn't exist yet, so indeed can't ... an early exit from the stmt block seems reasonable though
<agentzh>
the function may want to return something early when some condition is met.
<agentzh>
sorry for the confusion, i was talking about gnu C's ({...}).
<fche>
aha; we could do more than they, as long as the concept is simple
<agentzh>
one of the biggest hurdles is the macro expansion's resulting code size.
<agentzh>
our stap functions can be quite large and has many call sites.
<agentzh>
and thoese functions may call other functions further.
<agentzh>
and it would also be tricky to get runtime backtraces for macros in case of runtime errors.
<fche>
yeah ...
<agentzh>
but i do agree ({...}) has its own metrits.
<agentzh>
*merits
<fche>
wonder how serious the c-to-stap case should be taken ... we take a lot of algorithmic shortcuts in the optimization / etc. passes, with the presumption that stap scripts just aren't that large
<agentzh>
it's very handy for code emitters for many cases where functions are not needed.
<fche>
but if you think you want to generate a ton of code - where inlined code size starts to matter - then I wonder if other parts of stap will bog you down at least as badly
<agentzh>
fche: not yet, we used to manually port tons of C code to stap and they work pretty well in production.
<agentzh>
now we decide to stop doing that manually, it's really painful :)
<agentzh>
and once we hit another hurdle, we can always try fixing it :)
<agentzh>
right now, the biggest hurdle is the function arg thing.
<agentzh>
it's a showstopper for us.
<fche>
would've been good to hear it before a lot of work was done
<fche>
just to see if all the options were explored okay
<fche>
how does your c-to-stap translator need passing arrays to functions ?
<agentzh>
understood. we just have a lot of pressure from the business side. so we'll try our best.
<fche>
what sorts of c constructs map to that?
orivej has quit [Ping timeout: 240 seconds]
<agentzh>
it is used to emulate C's output arguments.
<agentzh>
for example.
<agentzh>
so ideally stap could support scalar's references in function arguments too.
<agentzh>
now we use single-element arrays a lot, which is wasteful.
<agentzh>
but working though.
<fche>
could instead use one big array, and use indexes?
<agentzh>
we could, but right now we expose a "builtin array" type to the language level. the source language is a superset of C11.
<agentzh>
our compiler emits python arrays for such constrcuts in its gdb/python backend.
<agentzh>
it has multiple backends, not just targeting stap.
<agentzh>
so aggressive optimizations would require aggressive sematnic analyses in our compiler.
<agentzh>
*semantic
<agentzh>
my current stat-func-arg patch is for total 400 lines, not much. most of the code is just small refactoring of existing code. true additions are much less.
<agentzh>
i can show you what i already have.
<agentzh>
a sec.
<fche>
I'm concerned about the pass-by-reference change to the model too. not sure about that at all. maybe some new syntax for that?
<agentzh>
i think arrays and stats should just be references.
<agentzh>
that's natural.
<agentzh>
it makes little sense to do C/C++'s copy-by-value by default.
<agentzh>
it's still a bit messy due to the debugging code. not ready for formal review.
<fche>
would have to think hard about making sure e.g. alias-detection algorithms work with this sort of thing, everywhere
<agentzh>
*nod*
* agentzh
has been thinking hard himself these days.
<agentzh>
and that's also why i really want to get the official stap test suite running on my machines :)
<fche>
yeah, definitely
tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]
orivej has joined #systemtap
<agentzh>
fche: about adding a quit() tapset func instead of abort()? Because abort() in C also results in core dump and erroneous exit code of the current process, which is not what we need here.
<agentzh>
*how about
<fche>
sure
<agentzh>
ok, thanks
<agentzh>
re or just store the function name in a context->locals[] array slot: yeah, i've been thinking along the same line, though i also need the line numbers, not just the function names.
<agentzh>
but we could also store all the info in an array which we can quickly look up at runtime in case of errors.
<agentzh>
that would sound fun.
<agentzh>
we'll definitely give it a shot at some point.