fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
White_Light has joined #systemtap
hpt has joined #systemtap
gmg has quit [Remote host closed the connection]
rth has quit [Quit: Leaving]
RustJason has joined #systemtap
RustJason has quit [Ping timeout: 260 seconds]
nkambo has joined #systemtap
irker604 has quit [Quit: transmission timeout]
hkshaw has joined #systemtap
gmg has joined #systemtap
gmg1 has joined #systemtap
gmg has quit [Ping timeout: 240 seconds]
sanoj has joined #systemtap
gmg1 has quit [Remote host closed the connection]
nkambo has quit [Ping timeout: 240 seconds]
groleo has joined #systemtap
nkambo has joined #systemtap
scox has quit [Ping timeout: 240 seconds]
scox has joined #systemtap
lorddoskias has joined #systemtap
<lorddoskias> I have a probe definition : probe kernel.statement("reserve_metadata_bytes@fs/btrfs/extent-tree.c:5087") ? { and it actually causes compilation failure: semantic error: no line records for fs/btrfs/extent-tree.c:5087 [man error::dwarf]
<lorddoskias> but shouldn't the '?' at the end prevent just that ?
mjw has joined #systemtap
orivej has quit [Ping timeout: 255 seconds]
Jackson_Xing has quit [Remote host closed the connection]
Jackson_Xing has joined #systemtap
hpt has quit [Ping timeout: 260 seconds]
scox has quit [Ping timeout: 240 seconds]
orivej has joined #systemtap
skycarl has joined #systemtap
<skycarl> How can I configure the kernel4,9 source to support systemtap when I use gentoo system?
skycarl has left #systemtap ["Leaving"]
Jackson_Xing has quit [Quit: Jackson_Xing]
wcohen has quit [Ping timeout: 240 seconds]
hpt has joined #systemtap
gbm has joined #systemtap
<gbm> Hello everyone. I was wondering if anyone else has had problems using the proc_mem_data tapset? Getting a unknown type in dereference: operator '->' on line 293 in proc_mem.stp (mm = task->mm), currently using version 2.9
<gbm> on Ubuntu 16.04. Have checked with google and so but haven't found anything..
<fche> lorddoskias, yeah the ? should handle that -- can you fpaste the script ?
<lorddoskias> basically the thing is when i recompile the kernel with a different commit i'd expect this probe to fail but not fail my script
<lorddoskias> but the opposite is happening
<lorddoskias> i guess the function size is being changed
<fche> gbm, if your kernel is much newer than the stap 2.9 release (which was early 2016), then may need a newer version
<gbm> here is how the tapset looks like on my machine: it arrived with the package manager
scox has joined #systemtap
<lorddoskias> fche: okay i can give you the whole script now i got it to fail again
<lorddoskias> fche: for example __reserve_metadata_bytes is non-existant in one of the kernel i'm testing with and the ? deals fine with that, however reserve_metadata_bytes has different sizes since it being refactored
<fche> can you fpaste the stap error also by any chance? (I don't have all the same version here.)
<lorddoskias> semantic error: no line records for fs/btrfs/extent-tree.c:5087 [man error::dwarf]
<lorddoskias> Systemtap translator/driver (version 3.1/0.165, non-git sources)
<fche> gbm, yes, that would be part of systemtap, but the question is what version? stap -V vs. uname -a
<gbm> moment, I'm aggregating all the useful info
<fche> lorddoskias, weird. I don't have a good explanation for why that should happen
<fche> what if you drop all the other probes ?
<gbm> Here is the full error, with newest systemtap and uname -a
<fche> can try a line number range
<fche> gbm, I assume this was a hand-built coyp of systemtap
<gbm> I cloned it from git://sourceware.org/git/systemtap.git and built
<lorddoskias> fche: same thing with all other probes commented out
<fche> lorddoskias, ok, that's a good sign (not a heisenbug)
hpt has quit [Ping timeout: 268 seconds]
<fche> gbm, this construct would rely on kernel debugging symbols being available
<fche> try stap-report or stap-prep to see if you have all matching versions
<fche> we should be able to work around that requirement if we were to add some header-file-based @cast()'s into those tapset files
<gbm> I will post stap report in a bit, but for context I'm using this tool to trace a user space application (which is awesome btw)
<fche> righto
drsmith_away is now known as drsmith
<fche> gbm, so you're running 4.4.0-75-generic
<fche> but the closest debuginfo is 4.4.0-59-generic-dbgsym
<fche> so stap will ignore the latter
sanoj has quit [Quit: Leaving]
<fche> with some work, the code could do something like replace mm = task->mm with mm = task(task)->mm
<fche> since the latter does a little @cast() based on header files based on a macro in linux/task.stpm
<fche> would you mind trying that? you can su and hand-edit that line of that tapset file
<gbm> it gave me an unresolved function
carl_ has joined #systemtap
carl_ has quit [Client Quit]
hkshaw has quit [Ping timeout: 260 seconds]
skycarl has joined #systemtap
<fche> sorry @task
<fche> someday I'll learn the tool :)
__positron has joined #systemtap
<fche> so mm = @task(task)->mm
<gbm> perfect, it works. Thanks!
<fche> ooh excellent
<fche> could I talk you into possibly doing the same search/replace through that file and nearby?
<fche> and send it as a diff?
<__positron> When I try to probe some functions in net/core/sock.c, stap complains registration error (rc -84). It looks like the probe address is not at an instruction boundary, at least according to arch (x86) related kprobe code. Is there a way to fix this?
<gbm> sure, no problem. Just this file?
orivej has quit [Ping timeout: 245 seconds]
<fche> sure.
<fche> there's something in context.stp and context-envvar.stp that could benefit
<fche> (just searching for ->mm )
orivej has joined #systemtap
tromey has joined #systemtap
<__positron> For instance, I have tried these two functions __alloc_skb and sock_wmalloc. Both result in the same error. (rc -84)
<fche> __positron, would need some more details to figure out whether stap or the kernel is at fault
<fche> would want to check the addresses calculated by stap (-DDEBUG_SYMS e.g.) vs. kernel actual addresses
<fche> gbm, hey cool, what email address / full name shall we credit the patch to?
<gbm> mkdubik@gmail.com, Mikael Dubik :) thanks for the assist!
irker567 has joined #systemtap
<irker567> systemtap: mkdubik systemtap.git:refs/heads/master * release-3.1-74-gb573a3f / tapset/linux/context-envvar.stp tapset/linux/context.stp tapset/linux/proc_mem.stp: Use @task() @cast-wrapper for task->mm in tapset, so that debuginfo not needed http://tinyurl.com/kp9hdgr
<fche> thanks dude
gbm has quit [Quit: Page closed]
brolley has joined #systemtap
<__positron> The warning spit out by stap is this, WARNING: probe kernel.function("sock_alloc_send_skb@net/core/sock.c:1914").call (address 0xffffffff81679fe0) registration error (rc -84). I presume it to be the address computed by stap.
<fche> yup, the final address
<__positron> vmlinux file has this function at a different address. ffffffff8167afe0 T sock_alloc_send_skb
<fche> 9fe0 vs afe0
<__positron> yup.
<__positron> I wonder what makes stap compute a different address
<fche> try sudo stap -DDEBUG_SYMBOLS ...
<__positron> This time, I am sticking to a release version. version 3.1/0.158 :)
<fche> so run that -DDEBUG_SYMBOLS variant
<fche> then see also stap -p2 -v ....
<fche> where you see how the address is represented at the end of the translation phase
<fche> kernel.function("SyS_open@fs/open.c:1066") /* pc=_stext+0x258868 */ /* <- kernel.function("SyS_open@fs/open.c:1066") */
<fche> <-- for me here e.g.
<fche> kernel.function("sock_alloc_send_skb@net/core/sock.c:1894") /* pc=_stext+0x6d9688 */ /* <- kernel.function("sock_alloc_send_skb@net/core/sock.c:1894") */
groleo has quit [Quit: Leaving.]
groleo has joined #systemtap
wcohen has joined #systemtap
skycarl has quit [Remote host closed the connection]
groleo has quit [Ping timeout: 245 seconds]
<fche> ok, so stap's math seems to check out _stext=0x810002b8 + 0x679d28 == 0x81679fe0
modem has quit [Remote host closed the connection]
<fche> so the question is how this kernel manages to shift the symbol around ... interseting
<__positron> _stext indeed matches with vmlinux. ffffffff810002b8 T _stext
nkambo has quit [Ping timeout: 240 seconds]
<__positron> I am gonna try it on another system with the current master branch. fche: Is the current master stable?
<fche> yeah, it should be fine
<fche> but really it seems like there's something unusual about the kernel rather than stap
<fche> __positron, how does your kernel look w.r.t. version numbers & debuginfo ?
<fche> stap-report ?
<__positron> It's a custom kernel based on v4.8.4
<fche> right, so do you have anything unusual related to relocation? or a possible version mismatch?
<__positron> we do, but not in the kernel. we have some modules which perform relocation of some its own functions.
<fche> so ... stap computes a relocation basis for kernel (and module) symbols
<fche> _stext for the kernel and some section-name pseudo-symbol for modules
<fche> and subtracts / readds to adapt to the actual run-time addresses
<fche> now if the kernel relocates depeer than the old-style 'shift everything up or down, moving _stext', then we'll have problems
<__positron> Does v4.8.4 or above do some relocation in an unusual way?
<fche> I'm not sure
<fche> we use all kinds of kernel versions all the time, including beyond 4.8.4
<__positron> have you tried tracing this particular function on one of your systems? does it result in the same error?
<fche> works here
<fche> (4.9.10-200.fc25.x86_64 e.g.)
<fche> I believe you said that you have several functions suffering similarly
<fche> is there a pattern as per the -DDEBUG_SYMBOLS content?
<__positron> yeah. all the functions are from net/core/sock.c. let me check if there is a fixed pattern.
<fche> could you check out the RELOC|RANDOMIZE bits from /boot/config* ?
<__positron> Sure. I will post you my stap-report in a short while.
gmg has joined #systemtap
<__positron> surprisingly, it works on a different system with the same kernel version (but they are not 100% equivalent). let me also check what changes has creeped into the other kernel.
<irker567> systemtap: fche systemtap.git:refs/heads/master * release-3.1-75-g4e76d62 / stap-report: stap-report: also include RELOC/RANDOMIZE kconfig lines http://tinyurl.com/m6k3uvf
<__positron> fche: that was quick ;)
<__positron> I am lazy to pull and compile. but here it is what you've asked for. CONFIG_ARCH_HAS_ELF_RANDOMIZE=y, CONFIG_RELOCATABLE=y, # CONFIG_RANDOMIZE_BASE is not set
<fche> not too different from fedora, though we have CONFIG_RANDOMIZE_BASE=y on,
<fche> which sounds like it should make the situation harder rather than easier on fedora
<fche> (but am not really that familiar)
orivej has quit [Ping timeout: 264 seconds]
orivej has joined #systemtap
<fche> so ... are you 100% sure that the kernel you're running is the exact same version as the one whose build tree stap has to work from ?
<fche> 'cause the uname -r version code 4.8.4 is much less specific than usual
<fche> (I mean stap verifies with build-id , but still)
* fche must step out a little hwile
<__positron> let me rebuild stap once again. the difference (9fe0 vs afe0) is exactly a page size.
<fche> but might try checking out the symbol table of the /lib/modules/4.8.4/build/vmlinux file - to see whether things match
<fche> a stap rebuild is unlikely to change this, but of course go ahead & try
<__positron> if i change the kernel's code and install a new one, should this break stap or stap automatically resolves the needed symbols by referring to the latest kernel build?
<fche> stap will need to find the new kernel's build tree
<fche> (stap -r /path/ if needed)
gmg has quit [Ping timeout: 272 seconds]
gmg has joined #systemtap
gmg has quit [Remote host closed the connection]
gmg has joined #systemtap
<__positron> Actually, I have made a few changes to net/core/skbuff.c. however, the new kernel is built and installed. I tried with stap -r, results in the same error
gmg has quit [Client Quit]
<fche> and booted-into ?
<__positron> the same modified kernel. It's just that i have not rebuilt stap after modifying the kernel.
<fche> stap does not need to be rebuilt when you change a kernel
<__positron> as long as it could find the new build tree, stap should be happy.
<__positron> can I add more debug info to stap to know where exactly this 0x1000 offset is being added?
<fche> could run stap with some more tracing to see how it comes up with the addresses
<fche> did you have a chance to look through the (proper version of) vmlinux's symbol table (with readelf etc)
<fche> stap --vp 04 say
<fche> and compare to readelf -s /path/to/vmlinux
<irker567> systemtap: dsmith systemtap.git:refs/heads/master * release-3.1-76-g585b5c3 / config.in configure configure.ac httpd/Makefile.am httpd/Makefile.in httpd/main.cxx httpd/server.cxx httpd/server.h: Add httpd server updates. http://tinyurl.com/lrbsjmv
<__positron> fche: got the problem. there was a older version of vmlinux image in the /boot folder. I am unsure of the order in which stap searches for vmlinux file. It has picked up /boot/vmlinux as its kernel image and started grabbing symbol data from it. That has caused all the problem. --vp 04 got me the details. Thanks a lot for all your inputs.
<fche> hey neat, good catch
<__positron> Shouldn't this be caught by stap automatically? running version != vmlinux image?
<fche> yes, there ought to be buildid checking e.g.
<__positron> I am not sure how pedantic those version checks are.
<fche> there's a combination of buildid, 'uname -r' (version), and some such stuff
<fche> I'm surprised that the former did not catch this
<fche> time for another tweak to the stap-report
<__positron> it would be nice if stap had warned me that the vmlinux it was referring to was old.
<fche> hm, and we don't document $SYSTEMTAP_DEBUGINFO_PATH
<fche> hmmmmmmmm I wonder if the problem was particularly nasty because you had the correct kernel-build tree (with the vmlinux.id file in there reflecting the running kernel)
<fche> so the buildid located in the actual vmlinux file was not used?
<fche> interesting if so, fixable
<fche> can you look at the stap --vp 04 report again, seeing where it extracts buildid from?
<__positron> you mean the additional hex digits from the git hash added to the kernel name?
<fche> readelf -n /path/to/vmlinux would be very enlightening, for all /vmlinux files existing on your machine
<fche> and also '% sudo hexdump -C /sys/kernel/notes'
<fche> that's not a source git hash - that's a linker-generated hash of most of the binary
<fche> good, you see those two matching
<fche> and what about the /boot copy ?
<__positron> I have deleted it. didn't realize it could give some valuable piece of information. :-/
<fche> aw shucks :)
<fche> I'd expect it to have a different buildid
brolley has left #systemtap [#systemtap]
<__positron> ok. I have got a vmlinux with a different buildid for the same kernel version. Now that file is in /boot/vmlinux-4.8.4 and my running kernel is different. This is where it gets interesting. stap doesn't even print a warning and happily runs. As expected, the run does not print any output.
gmg has joined #systemtap
<fche> yeah, I have a theory why
<irker567> systemtap: fche systemtap.git:refs/heads/master * release-3.1-77-g4116df1 / stap-report: stap-report: search out other vmlinu{x,z,id} files for buildid reporting http://tinyurl.com/lxfq4ja
<fche> __positron, would you be able to fetch the newest stap-report from git and see what it reports on your box?
<__positron> sure. give me some time. meanwhile some more details on the latest anomaly I have found. https://paste.fedoraproject.org/paste/kdZtI7utXq1hm7ZTTPaLPl5M1UNdIGYhyRLivL9gydE=
<fche> yup, that's the anomaly that the new stap-report should also discover
<fche> note on line 11 of your trace -- stap's getting the build-tree's build.id from that vmlinux.id file, rather than the /boot/vmlinux* file
<fche> but elfutils defaults search the /boot/vmlinux* file first rather than the /lib/modules/.../build copy
<fche> WHOOPS
<fche> mjw ... hey you're here ... with elfutils search algorithms for the vmlinux* file being sort of undocumented, any suggestions?
<fche> libdwfl/linux-kernel-modules.c
__positron has quit [Ping timeout: 240 seconds]
lorddoskias has left #systemtap [#systemtap]
* mjw looks up
<fche> happy friday!
<mjw> fche, what exactly is going wrong?
<fche> elfutils is finding an unexpected copy of vmlinux in /boot
<mjw> I assume there is some mismatch which is or isn't detected between some build-ids?
<fche> whereas stap was picking up the build-time kernel buildid from another file under the kernel build tree
<fche> (which did have the correct vmlinux* set of files)
<mjw> And stap is using dwfl_linux_kernel_find_elf?
<mjw> And then it is trying to find the KERNEL
<mjw> with kernel_release returning...
<mjw> that is weird
<mjw> do we happen to know what uname -r gives on that setup?
<mjw> well, that cannot be it
<mjw> but if that would return a "path" (something starting with /) instead of a release string (not starting with /) then elfutils first looks under /boot
<mjw> lets see if there is some other place this could trigger. Somewhere stap passes a kernel "path" instead of a "release" string...
<mjw> O, it is the other way around, if you pass a path (starting with /) then elfutils will look for <path>/vmlinux otherwise it will look for /boot/vmlinux-<release>
<mjw> So...
<mjw> I think it is correct that elfutils finds /boot/vmlinux-<release>
<mjw> since that should be the running kernel
<mjw> what does stap find?
<mjw> aha, stap uses the vmlinux.id file
<mjw> that is interesting I was looking at that last week
<mjw> The fedora kernel.spec file generates it and I only know stap that uses that file.
<mjw> Why is stap using vmlinux.id?
<mjw> fche, ^ ?
<fche> ... not sure now :)
<fche> maybe seemed like the simplest thing to do at the time?
<mjw> maybe
<fche> but yeah, if we have a vmlinux handle from -lelf, it seems silly to look elsewhere
nkambo has joined #systemtap
__positron has joined #systemtap
brolley has joined #systemtap
<irker567> systemtap: scox systemtap.git:refs/heads/master * release-3.1-78-gc67d8f2 / stapdyn/stapdyn.8: Fix manpage typo http://tinyurl.com/kj9jyde
mjw has quit [Quit: Leaving]
drsmith is now known as drsmith_away
mjw has joined #systemtap
tromey has quit [Quit: ERC (IRC client for Emacs 26.0.50)]
orivej has quit [Ping timeout: 255 seconds]
scox has quit [Ping timeout: 240 seconds]
mjw has quit [Quit: Leaving]
wcohen has quit [Ping timeout: 255 seconds]
<__positron> fche: latest stap-report on my machine is here https://paste.fedoraproject.org/paste/G3rqG0HkPuo3cxXAK~WvWl5M1UNdIGYhyRLivL9gydE=
<fche> right, and you see some buildids for the /boot kernels
__positron has quit [Quit: Leaving]
__positron has joined #systemtap
<__positron> yes. I do. /boot/vmlinux-4.8.4 is an older version of the kernel.
<fche> yeah. so we don't have a great -fix- for your problem, but at least stap-report will let us identify it faster next time
<fche> I suspect mjw is right in that stap should not use that vmlinux.id file preferentially or at all
<fche> and then what would've happened was that the stap script would fail early during pass 5 execution/startup, with a mismatching-buildid error
<fche> rather than registration errors
<__positron> that would be more meaningful. isn't it?
<fche> yes
<__positron> alright. I am happy that the problem is indeed fixed and i will be more careful when I execute stap the next time. I will flush the stale kernels then and there.
<fche> right, sorry stap wasn't more helpful
<__positron> it's ok i guess. It already adds a lot of value to my debugging. can't complain :)
<__positron> thank you again for your valuable inputs and time.
brolley has left #systemtap [#systemtap]
orivej has joined #systemtap
wcohen has joined #systemtap
<__positron> that's nice. hope things will better soon.
scox has joined #systemtap
skycarl has joined #systemtap
gmg has quit [Quit: Leaving.]