fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
orivej has quit [Ping timeout: 240 seconds]
fche has quit [Ping timeout: 244 seconds]
<invano> fche: ok I'll try again with incredible -v combo. Yeah I'm still inspecting that @cast. Works on my x64 machine but fails when I cross-compile and I pass the external kernel tree. Also, I get @cast errors only on specific fields of the struct, Anyway I'll handle this
mjw has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 260 seconds]
pfallenop has quit [Ping timeout: 240 seconds]
fche has joined #systemtap
orivej has joined #systemtap
<invano> hitting a similar @cast problem with func pid2task because it calls @task that calls @cast on task_struct. Problem is that every time @cast has a module declared e.g. "kernel<HEADER.h>" I get "type definition X not found in 'kernel<Y>'". The problem does not arise if I don't use the module arg in @cast. Also, this only happens when I cross-compile with a different build tree (-r). I'm going to attach a full log showing the error.
<fche> sure
<invano> here we go https://pastebin.com/zkr76dWC . error on line 1073
<fche> could you do it again with a --poison-cache added? It seems as though the explanation might hide in files left from previous runs
<invano> of course
tromey has joined #systemtap
<invano> here https://pastebin.com/4ns2097C line 1093
<fche> Copying /tmp/stap22u6zF/typequery_kmod_1/typequery_kmod_1.ko to /home/user/.systemtap/cache/3e/typequery_3e4e72f4faeacf6129c8e0c94896cc4f_1051.ko
<fche> that is the .ko file built from the @cast() expression
<fche> it's in there that stap ought look for the task_struct declaration
mjw has quit [Quit: Leaving]
<fche> does eu-readelf -w ... give nice content?
<fche> (maybe fpaste that too?)
<invano> sure. here https://pastebin.com/7ZL8Kya2
<fche> yeah, it's in there
<fche> hm actually maybe not?
<invano> trying to have a look at it here as well
<invano> I might leave soon but I'll be back in a couple of hours
<fche> righto
<fche> the eu-readelf -w of a similar typequery.*ko file on native x86-64 is much different
<fche> it's as though the arm linux/sched.h may have just a forward declaration?
<fche> e.g. here:
<fche> [ f4b] structure_type
<fche> decl_file (data1) 24
<fche> alignment (data1) 64
<fche> byte_size (data2) 11200
<fche> name (strp) "task_struct"
<fche> decl_line (data2) 524
<fche> sibling (ref4) [ 1a47]
<fche> [ f5a] member
<fche> name (strp) "thread_info"
<fche> decl_file (data1) 24
<fche> decl_line (data2) 530
<fche> type (ref4) [ 5684]
<fche> data_member_location (data1) 0
<fche> etc. etc.
<fche> I don't see any "member" tags in your dump
<agentzh> fche: do you know is there any particular reason to make the keys of session.stat_decls strings? git blame shows that that line of code dates back to 2005 :)
<agentzh> fche: i'm currently working on supporting stat values in user-defined stap function parameters. and using names in that map makes it impossible to keep track of stat-typed function parameter variables without introducing a separate map.
<agentzh> do you think it's feasible to simply using vardecl* as the keys to that stat_decls map?
<agentzh> *to simply use
<fche> agentzh, looking
<fche> yes clearly some of that structure needs to change if the scope of these objects changes
<fche> (similar as the case of arrays)
<fche> by the way, have you considered using macros instead of functions for your purposes?
<agentzh> fche: thanks for the suggestion. i'm afraid macros are not powerful enough since i have some statements that must be used on the expression level.
* agentzh is missing gnu C's ({...}) construct in the stap language.
<fche> that one's a possibility; we added %{ %} expressions at one point
<fche> it'd be good to read more about your use case on https://sourceware.org/bugzilla/show_bug.cgi?id=6746
<agentzh> fche: oh, i didn't know about the %{ %} thing. will have a look.
<agentzh> thanks for the info.
<agentzh> and i'm not aware that there is already a ticket for it :)
<fche> those were for embedded-c only
<fche> a script level statement-expression could be helpful perhaps
<agentzh> *nod*
<agentzh> my use case is a bit bizzar.
<agentzh> i'm compiling the full C language into stap scripts :)
<agentzh> almost full, i think.
<fche> ... I'm sure you're aware of embedded-c too
<agentzh> i've already spent a few days on the stat/array arg support in user-defined functions. i think i'll submit a patch soon.
<agentzh> fche: yes, i'm aware of that, but it's kinda scary.
<agentzh> so i never actually make use of it.
<agentzh> compiling down to stap scripts (without embeddd C) makes me feel safer.
<agentzh> and it would also be easier to port to other stap backends, i guess.
<agentzh> like bpf.
<fche> ok, that makes some sense -- would be curious to see the larger motivation for that work sometime
<agentzh> sure, we'll have something much bigger to show soon :)
<agentzh> it's exciting.
<agentzh> fche: are you okay with making exit() abort the current running funciton or probe handler?
<agentzh> it's been bothering me a lot and i already have a patch in my send queue.
<fche> probably too much weight of tradition to change that now
<fche> you can make up a macro like exit(); next;
<agentzh> fche: that won't work in the context of functions.
<agentzh> especially for void-typed functions.
<agentzh> speaking of which, i also have a patch to support return statements without values in void functions in my send queue.
<fche> "next won't work in the context of functions" https://knowyourmeme.com/photos/1198798 :--
<fche> +
<fche> +
<fche> :-)
<agentzh> fche: i thought about using return/next statements right after exit(), but it won't work for nested function calls.
<agentzh> we want to unwind the whole call chain.
<fche> exit() error("") then?
<agentzh> and there's no way for the caller function to know there is an "exit" on the stap language level.
<agentzh> fche: i tried error("") too but it produces an error which is a false alarm :)
<agentzh> it generates a big red ERROR to stderr ;)
<fche> error("THIS IS A NORMAL EXIT, IGNORE THIS RED THING") :-)
<agentzh> fche: oh yeah, sorry for being nitpicking.
<fche> heh, yeah, I do understand. just don't think we can practically change that now
mjw has joined #systemtap
<agentzh> fche: so you won't accept a patch to abort the execution flow immediately in exit()?
<agentzh> how about a build-time option?
<fche> we could make error("") more quiet e.g.
<agentzh> but in the general case, it is really important to unwind the call chain for nested function calls.
<fche> ... especially if you start contemplating creating function-local arrays that need cleanup
<fche> but yeah, I'd suggest pursuing the error("") case - to be prettier - or a new exit() type function that does the same thing, but not changing the old exit()
<agentzh> fche: yeah, i understand the requirement of being backward compatible.
<agentzh> i thought about treating error("") specially too, just thought it was a bit too hacky for you guys to accept.
<agentzh> how about exit_now() ?
<agentzh> i havne't got that far of designing function-local arrays but i got your point :)
<agentzh> error("") always looks like a hack to me :)
<fche> well, at least the error mechanism is designed to unwind cleanly and set an exit-equivalent flag too
<fche> hey, you could always use error("") as a Very Clean exit, and wrap your probes in a try {} catch { exit() }
<agentzh> yeah, i'm reusing the same mechanism for error() for my exit() patch.
<fche> see also stap --suppress-handler-errors
<agentzh> fche: yeah, that's one way to make it work, though still not very beautiful.
<agentzh> fche: i'll submit a patch to implement exit_now(), is that okay?
<fche> will be interested how it looks / works
<agentzh> okay, great, it would also be a great chance for me to learn new things from you guys :)
<fche> could call it abort() perhaps
<fche> WE DON'T LIKE TEACHING
<fche> ok well, maybe we do
<agentzh> okay, abort() then :)
<agentzh> well, giving feedback on patches is already the best teaching :)
<agentzh> it does not have to be like formal teaching at all.
<fche> and please ping regularly if we don't get to it quickly enough. I know there's a backlog :(
<agentzh> sure, thanks.
<agentzh> fche: i also have another patch to implement a new builtin, @vma(addr, module).
<agentzh> for converting relative addresses in ELF to real VM address in the specified module.
<agentzh> it invokes the vma tracker to do the runtime conversion if the module is PIC/PIE.
<agentzh> do you like it?
<agentzh> the motivation is to do dwarf-less probing for @var(...).
<fche> aha, neat.
<agentzh> and also for obtaining userland addrs in the target process which cannot be achieved by @var(...)
<agentzh> like C's literal string data in the .rodata section.
<agentzh> fche: glad you like it, i'll submit the patch to the mailing ist then.
<fche> hm, those addresses aren't easy to handle by a normal user
<agentzh> well, we already have dwarf-less probing support in the probe spec.
<agentzh> where we just specify the relative addresses.
<agentzh> like probe process.statement(0xdeadbeef) { ... }
<agentzh> so already along the same line?
<agentzh> but yeah, it's not meant for normal stap users, just advanced ones.
* agentzh is using stap as a VM anyway.
<agentzh> fche: i also have patches to add support for comma expressions (or compound expression) and allowing use of parentheses right after the unary '&' operator to the stap language. will send them soon too.
<agentzh> fche: since you are around, are you open to adding native floating-point number support to the stap language and runtime? we're also working on it.
<agentzh> do you already have some ideas or plans for it? i saw you replied to a stackoverflow question related to floating point number handling in stap.
<agentzh> i guess we'll have to do software floating-point arithmetic in the kernel module.
<fche> yes, almost certainly we'd have to emulate it
<fche> there was a PR for that too
<agentzh> oh really?
* agentzh has to check it out.
<agentzh> fche: do you have a url for it?
<agentzh> it's in the mailing list?
<agentzh> thanks
<agentzh> it's not a patch?
<agentzh> just an issue?
<agentzh> oh, sorry, i thought PR was Pull Request.
<fche> yeah. heh, no, problem report in this case
<agentzh> gotcha.
<fche> github's terminology is YOUNG YOUNG YOUNG compared to that :)
<agentzh> got too used to github PRs :)
<agentzh> fche: is it legally fine to steal glibc/gcc code for the softfloat code?
<agentzh> glibc is LGPL though.
<fche> we're GPL2 in stap land ... for something big like that, one would tread very carefully
<fche> would much rather read a proposal first
<agentzh> fche: *nod*
tromey has quit [Ping timeout: 240 seconds]
<fche> invano, hey, if you're still here
<fche> methinks I found the problem
<fche> the eu-readelf of your typequery ko file includes the compiler cflags
<fche> producer (strp) "GNU C89 6.4.0 -mlittle-endian -mabi=aapcs-linux -mno-thumb-interwork -mfpu=vfp -marm -march=armv7-a -mfloat-abi=soft -mtls-dialect=gnu -g -O2 -std=gnu90 -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -fno-dwarf2-cfi-asm -fno-ipa-sra -funwind-tables -fno-delete-null-pointer-checks -fno-stack-protector -fomit-frame-pointer -fno-var-tracking-assignments -femit-struct-debug-baseonly -fno-var-tracking
<fche> -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fstack-check=no -fconserve-stack -fno-eliminate-unused-debug-types --param allow-store-data-races=0"
<fche> two issues:
<fche> the -fno-var-tracking-assignments -fno-var-tracking stuff should come out of your kbuild generally, it's a goofy artifact of a poor linus decision several years ago
<fche> their presence degrades debuginfo quality generally within functions / inlines
<fche> the more immediate problem is: -femit-struct-debug-baseonly
<fche> ifdef CONFIG_DEBUG_INFO_REDUCED
<fche> endif
<fche> $(call cc-option,-fno-var-tracking)
<fche> KBUILD_CFLAGS += $(call cc-option, -femit-struct-debug-baseonly) \
<fche> you don't want that either
<fche> we may be able to work around that in stap land,
<fche> invano, I pushed what seems like a likely fix
<agentzh> running the command "make -j6 installcheck-parallel" gives the error "ERROR: Couldn't find library file site.exp." seems like running "touch testsuite/lib/site.exp" manually myself fixes it. is it the right way to work around it?
<agentzh> i'm seeing it on both fedora 26 and fedora 27.
<fche> hm, over here, the first thing installcheck-parallel does is create site.exp
<agentzh> am i missing any deps?
<agentzh> fche: that was a fresh clone of the systemtap master on a fresh fedora 26 VM.
<fche> yeah
<fche> I think I see what's up
<fche> in testsuite/Makefile.am,
<fche> installcheck-parallel: site.exp
<agentzh> great
<fche> -rmmod uprobes 2>/dev/null
<fche> -test -z $(SYSTEMTAP_TESTSUITE_RESUME) && $(MAKE) clean
<fche> so first thing it creates the site.exp
<fche> t
<fche> then the make clean nukes it again
<fche> so ... what the? :)
<fche> I suspect that 'make clean' part is not quite right; not sure why it's written that way
<fche> .but...
<agentzh> due to make -j6 ?
<fche> changing them to $(MAKE) clean site.exp ought fix it?
* agentzh is trying.
<fche> could hand-edit the Makefile first
<agentzh> ok
<fche> if that works, then the .am / .in files
<agentzh> on it.
<fche> egg selent!
<fche> (could be that different make versions time this stuff differently; it does seem to work okay here for me, not sure why.)
<agentzh> what version of make are you using?
<agentzh> here it is make 4.2.1.
<fche> standard f27 make 4.2.1, but on a multi-cpu box; maybe your vm is too small? dunno
<agentzh> my vm is big.
<agentzh> it has 4 physical CPU cores.
<agentzh> 8 logical cores.
<agentzh> hyper-threaded.
<agentzh> fche: i think it is testsuite/lib/site.exp, not testsuite/site.exp.
<fche> hm, doesn't make sense, lib/ doesn't have a site.exp
<fche> and the a build tree proper doesn't have a testsuite/lib/ directory at all; that's from the $srcdir
<fche> wonder if it's an odd phenomenon associated with building in the source tree
<agentzh> your Makefile change does not seem to be releva
<fche> alas
<agentzh> *relevant
<agentzh> strace shows that it tries to look for testsuite/lib/site.exp.
<agentzh> and there is no such file in the location, hence the error.
<fche> ok, I'm seeing that same error, in a build-in-source-tree configuration
<agentzh> good to know i'm not alone :)
<agentzh> you were using some not-build-in-source-tree config? just curious what it is.
<fche> we almost always build in a new empty directory, and configure it with .../path/to/source/tree/configure
<agentzh> oh, the cmake style. i didn't know it supports this way. thanks for the info.
<agentzh> i'll try that way.
<fche> it's the gnu style, predates cmake by a few decades :)
<agentzh> i see.
<agentzh> still it would be nice if we can support running tests in the build-in-src-tree mode :)
<fche> yeah
<fche> have a patch
<agentzh> great
<agentzh> fche: i can confirm the separate build dir way works for me now.
<fche> ok
<agentzh> fche: now i'm seeing a lot of test failures like these: https://pastebin.com/NtzYykAv
<agentzh> is that normal? using the latest master.
<agentzh> kernel is 4.16.11-100.fc26.x86_64
mjw has quit [Quit: Leaving]
<agentzh> i think 4.16 is supposed to work fine for the stap syscall stuff?
<fche> yes
<fche> does the systemtap.log file explain what's up?
<agentzh> or must i run the test suite with root?
<fche> installcheck*, yes
<agentzh> i'm currently using normal user under the stapusr/stapsys/stapdev groups.
<fche> (the README covers this IIRC)
<agentzh> README does not mention root nor sudo. but maybe "privileged" implies that...
<agentzh> i thought adding stapusr/etc user groups would be sufficient.
<fche> the "#" is the hint I guess
<agentzh> no sh prompt there.
<fche> some of the tests involve sudo
<agentzh> okay
<agentzh> good to know.
<agentzh> still same errors using sudo.
<agentzh> the paste is the file testsuite/artifacts/systemtap.unprivileged/unprivileged_myproc/systemtap.log
* fche is missing irker
<fche> agentzh, I do believe you're right, that set of tests has gone a bit crazy
<agentzh> oh, good to know it's not my env going crazy :)
<fche> maybe we are both crazy, pal
<agentzh> lol
<agentzh> do we have a CI service to keep track of the test suite passing rate and etc for the master branch?
<fche> yes, but not a really public one
<agentzh> oh, pity.
<agentzh> as a contributor, i just want to know what failures are expected what are not.
<agentzh> *and what are not
<agentzh> ideally, all the tests are always passing on master.
<fche> yeah, we're actually starting to investigate a more formal way of classifying testsuite results in general
<fche> ideally, yeah ... but in practice systemtap is uniquely environment-sensitive (gcc, kernel versions, configurations, architectures) that a lot of things are out of our hands
<agentzh> that would be nice.
<agentzh> fche: yeah, that's totally understandable.
<agentzh> i wonder how other backends dyninst and bpf run the current test suite.
<agentzh> do they run at all by default?
<agentzh> are they enabled by some fancy system environments?
<fche> some of the .exp files explicitly iterate over runtimes
<agentzh> or do they do their own (sub) test suite?
<fche> foreach runtime [get_runtime_list] { ... }
<agentzh> fche: okay, i see. it needs to be explicitly turned on in the .exp files.
<fche> yeah. the runtimes are not interchangeable in power unfortunately
<agentzh> yeah, i understand.
<agentzh> the bpf runtime always hangs on my side for userland probing.
<agentzh> and the dyninst runtime lacks vma tracking. alas.
<fche> bpf hanging would be good to know of
<agentzh> fche: will dig it up with more details when i have the tuits.
<fche> haha yeah. bugzilla PR appreciated when you have time.
<fche> we should create a tuit dispenser
<agentzh> sure.
<fche> hmmmmm a stap script that monitors for idle keyboard
<agentzh> lol
<fche> and after a few minutes of idleness, gives you a tuit
<fche> to use for good, not evil
<agentzh> we're very interested in the bpf backend, so don't worry.
<agentzh> it's great to see that stap emits bpf bytecode directly without involving llvm/gcc toolchains.
<agentzh> bcc is much more heavyweight in that regard.
<fche> oh yeah
<fche> we're learning a lot from bcc/etc., a bit amused at its limitations sometimes :)
<fche> but yeah we'll try to do a good job ... and darn it's fast when it works!
<agentzh> good to know. bcc lacks debuginfo support so it's more like dwarf-less probing :)
<agentzh> yeah, stap --bpf is so fast when it works.
<agentzh> reminds me of the good old dtrace days :)
<fche> heh, and for good reason, similar tech
<agentzh> yep
<fche> still bummed out about the kernel-side limitations of bpf, but who knows, over time we may be able to nudge them toward some higher capability/power
<agentzh> fche: or maybe with a separate stap-bpf kernel module residing in the kernel space for good :)
<fche> well, if we could have gotten one in 10 years ago :)
<agentzh> that would be bpf++ for stap :D
<fche> heh