fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
khaled has quit [Quit: Konversation terminated!]
orivej has quit [Ping timeout: 265 seconds]
hpt has joined #systemtap
sscox has quit [Ping timeout: 240 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
amerey has quit [Remote host closed the connection]
orivej has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
khaled has joined #systemtap
ggherdov has quit [Ping timeout: 248 seconds]
ggherdov has joined #systemtap
fdalleau_away is now known as fdalleau
mjw has joined #systemtap
orivej has joined #systemtap
hpt has quit [Ping timeout: 246 seconds]
orivej has quit [Ping timeout: 240 seconds]
amerey has joined #systemtap
tromey has joined #systemtap
mjw has quit [Quit: Leaving]
fdalleau is now known as fdalleau_away
fdalleau_away is now known as fdalleau
tromey has quit [Quit: ERC (IRC client for Emacs 27.1)]
mjw has joined #systemtap
fdalleau is now known as fdalleau_away
orivej has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
orivej has joined #systemtap
orivej has quit [Ping timeout: 268 seconds]
<kerneltoast> fche, baddish news
<fche> oh no
<kerneltoast> the backtrace bug's cause is different on newer vs older kernels
<kerneltoast> on newer kernels where it works half the time, kernel_read_file_from_path() is not returning an error
<kerneltoast> on older kernels where it works none of the time, kernel_read_file_from_path() returns errors
<kerneltoast> exciting isn't it
<kerneltoast> the issue on newer kernels seems to be a race in stap
<kerneltoast> i replaced kernel_read_file_from_path's vmalloc with a stack allocation and the bug "went away"
<kerneltoast> (i replaced it by just passing in a pointer to a stack buffer)
<kerneltoast> fche, try this: https://paste.centos.org/view/ee283a99
<kerneltoast> it makes the bug disappearTM
<kerneltoast> but only on your fancy 5.11 kernel
<kerneltoast> i guess when kernel_read_file_from_path() was changed in the kernel to take an offset, it stopped failing
amerey has quit [Remote host closed the connection]
<kerneltoast> fche, yo
<kerneltoast> i have an idea to fix the 4.18 bug
<kerneltoast> (centos 8)
<fche> loo king
<kerneltoast> the only danger is potentially populating our section addresses with garbage
<fche> um so kernel_read_file .... doesn't like heap pointers? neato
<kerneltoast> oh whoops i forgot i was writing stap
<fche> don't see the danger
<kerneltoast> stack arrays evil
<kerneltoast> that can be a heap alloc no problemo
<kerneltoast> > don't see the danger
<kerneltoast> well if kernel_read_file returns a legitimate error and our buffer happens to have some valid data in it, there could be a problem
<kerneltoast> what happens if we read the .eh_frame address as 0x9
<kerneltoast> and then the read failed after that
<fche> in case of error, well, don't use any of the data?
<kerneltoast> the problem is that it returns a spurious error on success
<kerneltoast> so its errors are garbage
<kerneltoast> we could parse the spurious error (-EIO) but i'm sure lots of the deeper fs machinery returns -EIO
<fche> ok I would prefer to find out the cause of this spurious error if it really is spurious
<kerneltoast> sure, i can show you
<kerneltoast> and i can show you why 5.11 dodges it
<kerneltoast> it must be a very large number, because it's bigger than 64. with stap master, kernel_read_file returns -EFBIG from there
<kerneltoast> we can get past that error by setting max_size to 0
<kerneltoast> but now we're left with this error: https://elixir.bootlin.com/linux/v4.18.20/source/fs/exec.c#L939
<kerneltoast> pos never equals i_size because i_size is a lie
<fche> kernel_read_file() reads like it wants to read an Entire file
<kerneltoast> yes, and on 5.11 it was changed to accommodate partial reads
<kerneltoast> err the change wasn't in 5.11
<kerneltoast> idk when it was, but you know what i mean
<kerneltoast> on 5.11, i printed out i_size and it appears to be correct. but even if it weren't correct, stap's usage of kernel_read_file dodges that pesky i_size check entirely
<kerneltoast> (for when i_size is bigger than 64)
<kerneltoast> so i_size can be 0xbologna on 5.11 and it won't cause us any issues
<kerneltoast> i have a feeling i_size is INT_MAX or something, since the maximum size of the sections sysfs nodes are not defined when they're created. see this in 5.11: https://elixir.bootlin.com/linux/v5.11.12/source/kernel/module.c#L1658
<kerneltoast> on 64-bit, MODULE_SECT_READ_SIZE == 19
<kerneltoast> 4.18 doesn't have that
<fche> how trash is i_size on your kernels?
<fche> on some random rhel7 one, i_size = 4096 for those files, which is trash but not Super Awful trash
<fche> on 5.11, i_size appears to be a nice small exactish number
<kerneltoast> i can only see what i_size is if i can get past that pos != i_size check
<kerneltoast> i'll try 4096 and see if it succeeds
<fche> stat /sys/module/FOO/section/BAR
<kerneltoast> 4096
<kerneltoast> heh i had done that on 5.11 to find what i_size was, but it didn't occur to me to try it on centos8 for some reason
<kerneltoast> i wonder if it's just PAGE_SIZE
<kerneltoast> either way, it may as well be Super Awful trash because it breaks that pos != i_size check
<kerneltoast> if we can get the range of the module address space then we could use it to validate our read
<kerneltoast> or we could find some other way to read that data
<kerneltoast> barring a straight up read() syscall...
<fche> we can read into a PAGE_SIZE buffer, and then this should work on old and new, methinks
<kerneltoast> that only fixes the -EFBIG check
<kerneltoast> there aren't 4096 bytes of data to read
<kerneltoast> `pos` will only ever go up to 19
<fche> you're thinking about the pos != i_size check?
<fche> that should be fine too
<fche> it's not a pos != maxsize
<kerneltoast> i_size == 4096
<kerneltoast> pos == 19
<kerneltoast> maybe 19 == 4096 in canada, but not in the rest of the world
<fche> good point