fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged
<kerneltoast> oh turns out the thing i did earlier fixed it
<kerneltoast> but
<kerneltoast> i was running add.stp on its own
<kerneltoast> so i didn't notice it was fixed
<kerneltoast> hah
<kerneltoast> i'm dumb
<kerneltoast> {
<kerneltoast> static int _stp_data_write_commit(void *entry)
<kerneltoast> + struct rchan_buf *buf;
<kerneltoast> - /* Nothing to do here. */
<kerneltoast> +
<kerneltoast> + buf = _stp_get_rchan_subbuf(_stp_relay_data.rchan->buf,
<kerneltoast> + smp_processor_id());
<kerneltoast> + __stp_relay_switch_subbuf(buf, 0);
<kerneltoast> return 0;
<kerneltoast> }
irker962 has joined #systemtap
<irker962> systemtap: sultan systemtap.git:sultan/bulkmode2 * release-4.4-26-g5c1b84a8d / runtime/print_flush.c runtime/transport/relay_v2.c runtime/transport/transport.c runtime/transport/transport.h staprun/relay.c: always use per-cpu bulkmode relayfs files to communicate with userspace
<irker962> systemtap: sultan systemtap.git:sultan/bulkmode2 * release-4.4-27-g5836a314d / tapset-timers.cxx: Revert "REVERTME: tapset-timers: work around on-the-fly deadlocks caused by mutex_trylock"
<kerneltoast> fche, check out sultan/bulkmode2
<kerneltoast> i'm gonna run it through the testsuite now since it works with add.exp
<kerneltoast> i'll let you know how it goes in about 12000 seconds
<kerneltoast> (that's how long the testsuite takes to run, i just check dmesg for the last stap log)
khaled has quit [Quit: Konversation terminated!]
jistone has quit [Quit: ZNC - http://znc.in]
jistone has joined #systemtap
mjw has quit [Quit: Leaving]
hpt has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
<agentzh> hopefully we can run the stap test suite in parallel soon.
<agentzh> right now it's really a struggle.
<fche> hey the new code might work well enough in parallel mode too
<fche> patience is not a struggle
<fche> you just need to hold the sleeping pigeon pose for 666 minutes
<fche> breathe in through the nose, out through the mouth
<fche> contract and relax all the muscles around the vagus nerve
<fche> and by the end of it all
<fche> you'll have testsuite results
<fche> hope this helps
<fche> kerneltoast, I see what you did and can see why it should work
<fche> but not sure why it wasn't there before
<agentzh> kerneltoast: maybe we could split the test suite and run each subset of files in separate kvm guests...
<agentzh> that way we can utilize all the phy cpu cores...
<kerneltoast> you mean to speed up serial mode?
<agentzh> yes
<kerneltoast> not sure if previous tests in the testsuite influence the rest of the results though
<agentzh> but seems like bulkmode is near the corner, probably we can do parallel runs directly anyway.
<kerneltoast> seems like we're going to be from serial mode soon anyway
<kerneltoast> *free from serial mode
<agentzh> aye
<agentzh> kerneltoast: you got fun things to do while waiting for the test results? ;)
<kerneltoast> yeah i've got a big fat merge to work on
<agentzh> oh yeah that one.
<agentzh> cool
<agentzh> fche: the thing is that our customers is very impatient. so we really need to get these bugs behind us ASAP ;)
<agentzh> *are
<kerneltoast> lucky me that i've never had to face our customers :P
<agentzh> oh boy, you are protected :)
<kerneltoast> agentzh, we still might have the probe_lock lockup stopping us from parallel runs, and fche said his buildbots were still dying even with the bulkmode patch
<kerneltoast> so there's no shortage of bugs :)))
<kerneltoast> fche, yeah i have no idea why that code was not there before, nor how it was working before...
<agentzh> still dying with your latest bulkmode2 branch?
<agentzh> maybe bulk mode was never exercised much.
<agentzh> since the default was never bulk.
<agentzh> it's a relatively new territory.
<kerneltoast> agentzh, bulkmode2 just fixes a problem where prints weren't being flushed. bulkmode1 got rid of the mutex_trylock, so the buildbots should not be suffering from the mutex_trylock bug anymore
<agentzh> that's good news.
<agentzh> hopefully they can get merged soon.
<agentzh> once the test suite is green enough.
<kerneltoast> yeah bulkmode2 should be good to go once the testsuite is all green
<kerneltoast> 4000 seconds left
<agentzh> yeah, probe lock is still out there.
orivej has joined #systemtap
<fche> I'll rerun my local tests against the new branch
irker962 has quit [Quit: transmission timeout]
<kerneltoast> oh boy, the testsuite is taking longer than 12000 seconds
<kerneltoast> i hope it's not more timeouts
<fche> things looking good here so far
<agentzh> it's rare that fche is still around. haha
derek0883 has quit [Remote host closed the connection]
<kerneltoast> fche, hmm i'm seeing this, i think it's an environment issue:
<kerneltoast> +ERROR: tcl error sourcing /home/sultan/systemtap/testsuite/systemtap.printf/out1.exp.
<kerneltoast> +ERROR: error writing "stdout": I/O error
<kerneltoast> any idea what the problem is?
<kerneltoast> 1.4G of free space isn't enough?
<kerneltoast> cleared up some space and i'm gonna re-run
<kerneltoast> we'll see what happens tomorrow morning i guess
<fche> not familiar
<fche> sounds almost like -EPIPE, something not there to listen to the outoutput
derek0883 has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
hpt has quit [Quit: Lost terminal]
hpt has joined #systemtap
derek088_ has joined #systemtap
derek088_ has quit [Remote host closed the connection]
derek088_ has joined #systemtap
derek0883 has quit [Ping timeout: 260 seconds]
derek088_ has quit [Remote host closed the connection]
khaled has joined #systemtap
hpt has quit [Ping timeout: 264 seconds]
mjw has joined #systemtap
orivej has joined #systemtap
orivej has quit [Ping timeout: 264 seconds]
ema has quit [Quit: reboot]
ema has joined #systemtap
orivej has joined #systemtap
amerey has joined #systemtap
<kerneltoast> fche, hmm still getting that wacky error, and all the 32-bit syscalls are failing
<kerneltoast> guess bulkmode needs more fixing...
<fche> that syscall stuff shouldn't relate to transport at all
derek0883 has joined #systemtap
<fche> "transport failures"
<fche> oh
<kerneltoast> that's from the transport failure atomic var?
<fche> think so
<kerneltoast> ouch
<kerneltoast> oh ok i think i see the issue
<kerneltoast> when print_flush requests a length greater than the subbuf size, the code in relay_v2 says NO and returns an error
<kerneltoast> if (unlikely(length > buf->chan->subbuf_size))
<kerneltoast> -goto toobig;
<kerneltoast> +length = buf->chan->subbuf_size;
<kerneltoast> i think that should do it
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
derek0883 has quit [Remote host closed the connection]
derek0883 has joined #systemtap
demon000_ has joined #systemtap
<demon000_> @fche, I submitted a patch to fix a bug @agentzh reported a while ago
<fche> hi
<fche> hm I thought we fixed that
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
amerey has quit [Remote host closed the connection]
amerey has joined #systemtap
tromey has joined #systemtap
<kerneltoast> demon000_, you'll need to rebase your patch on the newest stap master. one of my previous patches changed that bit of code
<demon000_> If we did then @agentzh didn't tell mention that
<demon000_> didn't mention me*
amerey has quit [Quit: Leaving]
<agentzh> demon000_: i did not know the master already fixes it. has it?
amerey has joined #systemtap
<agentzh> next time we should test the latest master first.
<agentzh> kerneltoast: which commit fixes it? shall we cherry-pick?
<kerneltoast> agentzh, it's not fixed in master
<kerneltoast> and demon000_'s fix isn't correct because the print buffer isn't null terminated
<agentzh> ah, okay
<agentzh> demon000_: next time you can post the patch into a gist for review here.
<kerneltoast> i almost fixed this bug by accident in my big print patch lol
<agentzh> and let's do internal review for open source patches first.
<kerneltoast> i was going to replace the strlcpy with memcpy
<agentzh> to save fche's cycles.
<kerneltoast> let me do that now
<demon000_> um
<agentzh> kerneltoast: that's interesting.
<demon000_> @kerneltoast, wdym the print buffer is not null terminated
<agentzh> your patches are really far reaching ;)
<demon000_> yes it is
<demon000_> strlcpy will put a null at the end
<kerneltoast> demon000_, pb->buf in your patch is not null terminated
<kerneltoast> the source buffer isn't null terminated
<demon000_> no matter if it is null terminated or not
<kerneltoast> if pb->len is less than size, you'll append garbage bytes to the output buffer
<fche> why not just use strlcpy as it is
<fche> and if you think , size, is wrong in the original, , size-1, instead
<demon000_> I get it now @kerneltoast
<kerneltoast> fche, because we're not using strlcpy for its ability to check for null bytes in the source buffer
<kerneltoast> we're using it for its ability to append a null byte at the end of the output
orivej has quit [Ping timeout: 240 seconds]
<demon000_> are you sure it's not null terminated tho?
<kerneltoast> so we may as well make that clear with a memcpy
<kerneltoast> demon000_, yeah i wrote that code
<fche> strlcpy puts the null at the end, that's part of its guarantee
<demon000_> I didn't get any garbage
<kerneltoast> fche, we don't need the rest of strlcpy's guarantee
<demon000_> I guess it might have been just luck
demon000_ has quit [Quit: Leaving]
<kerneltoast> and memcpy goes F A S T
demon000_ has joined #systemtap
<fche> strlcpy in this context is exactly memcpy + \0 AIUI
<fche> we use it -everywhere- in the code, and it is fine
<kerneltoast> fche, no it's still checking for \0 in the source buffer
<demon000_> yeah ^
<demon000_> that's why dest buffer size only doesn't work
<demon000_> who would store non-null-terminated char buffers tho
<fche> I am skeptical that any benchmark will see a difference
<kerneltoast> fche, it's clearer to the reader anyway
<kerneltoast> cosmin got confused by it
<demon000_> if it's not null-terminated it's not a string @fche
<kerneltoast> err, demon000_ is cosmin
<demon000_> so str* methods don't make sense
<fche> in trhis case we're dealing with strings
<kerneltoast> "strings"
<demon000_> is it?
<kerneltoast> fyi we use memcpy when print flushing too
<demon000_> if it's not null terminated
<demon000_> it's just some bytes
<demon000_> anyway yeah my patch was wrong sadly
<kerneltoast> demon000_, it doesn't have a null termination in order to save space. the per-cpu log buffers are limited in size
<fche> buffers at the transport layer are a different thing
<kerneltoast> i wasn't the one who originally omitted the null termination either btw
<fche> these are strings being printed/into
<kerneltoast> fche, all the print routines use memcpy
<kerneltoast> the strlcpy usage here is inconsistent
<kerneltoast> err no the print routines go byte at a time, nvm
<kerneltoast> i never made the memcpy optimization in the end heh
orivej has joined #systemtap
<kerneltoast> fche, with memcpy it is clear that the source buffer can lack termination. with strlcpy it isn't
<fche> ok, if the issue is that the per-cpu stp_log_pcpu->buf may be unterminated, so we just have its log->len to work with, then fine
<kerneltoast> yep that's exactly it
<kerneltoast> shall i do a quick test and then push?
<fche> ok
<kerneltoast> agentzh, in your description for the bug (https://sourceware.org/bugzilla/show_bug.cgi?id=26844), your stap command doesn't include a.out
<kerneltoast> what should the correct command be?
derek0883 has quit [Remote host closed the connection]
<demon000_> it automatically picks a.out
derek0883 has joined #systemtap
<kerneltoast> huh, didn't for me
<kerneltoast> i needed to specify it explicitly
<kerneltoast> oh well, i tested it
irker760 has joined #systemtap
<irker760> systemtap: sultan systemtap.git:master * release-4.4-26-gfd93cf71d / runtime/stack.c: PR26844: fix off-by-one error when copying printed backtraces
<kerneltoast> fche, there's that pushed
<kerneltoast> my vm running the testsuite this morning died from the probe lock soft lockup btw
<kerneltoast> i wasn't running the testsuite in parallel mode
<kerneltoast> so now i'll need to restart the testsuite to get that bulkmode patch verified
derek0883 has quit [Remote host closed the connection]
<demon000_> if you probe oneshot it works with a.out
<demon000_> otherwise it doesn't
<demon000_> but oneshot probe doesn't do anything anywya
<demon000_> anyway*
tromey has quit [Quit: ERC (IRC client for Emacs 27.1)]
orivej has quit [Ping timeout: 260 seconds]
orivej has joined #systemtap
derek0883 has joined #systemtap
orivej has quit [Ping timeout: 240 seconds]
demon000_ has quit [Ping timeout: 272 seconds]
<agentzh> kerneltoast: you can sepcify the -x PID or -c CMD options of stap.
orivej has joined #systemtap
mjw has quit [Quit: Leaving]
lickingball has joined #systemtap
amerey has quit [Quit: Leaving]