#systemtap on 2020-12-09 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

00:20 <kerneltoast> oh turns out the thing i did earlier fixed it

00:20 <kerneltoast> but

00:20 <kerneltoast> i was running add.stp on its own

00:20 <kerneltoast> so i didn't notice it was fixed

00:20 <kerneltoast> hah

00:21 <kerneltoast> i'm dumb

00:21 <kerneltoast> {

00:21 <kerneltoast> static int _stp_data_write_commit(void *entry)

00:21 <kerneltoast> + struct rchan_buf *buf;

00:21 <kerneltoast> - /* Nothing to do here. */

00:21 <kerneltoast> +

00:21 <kerneltoast> + buf = _stp_get_rchan_subbuf(_stp_relay_data.rchan->buf,

00:21 <kerneltoast> + smp_processor_id());

00:21 <kerneltoast> + __stp_relay_switch_subbuf(buf, 0);

00:21 <kerneltoast> return 0;

00:21 <kerneltoast> }

00:24 irker962 has joined #systemtap

00:24 <irker962> systemtap: sultan systemtap.git:sultan/bulkmode2 * release-4.4-26-g5c1b84a8d / runtime/print_flush.c runtime/transport/relay_v2.c runtime/transport/transport.c runtime/transport/transport.h staprun/relay.c: always use per-cpu bulkmode relayfs files to communicate with userspace

00:24 <irker962> systemtap: sultan systemtap.git:sultan/bulkmode2 * release-4.4-27-g5836a314d / tapset-timers.cxx: Revert "REVERTME: tapset-timers: work around on-the-fly deadlocks caused by mutex_trylock"

00:25 <kerneltoast> fche, check out sultan/bulkmode2

00:25 <kerneltoast> i'm gonna run it through the testsuite now since it works with add.exp

00:26 <kerneltoast> i'll let you know how it goes in about 12000 seconds

00:27 <kerneltoast> (that's how long the testsuite takes to run, i just check dmesg for the last stap log)

00:36 khaled has quit [Quit: Konversation terminated!]

00:52 jistone has quit [Quit: ZNC - http://znc.in]

00:52 jistone has joined #systemtap

01:10 mjw has quit [Quit: Leaving]

01:22 hpt has joined #systemtap

01:53 derek0883 has quit [Remote host closed the connection]

01:59 derek0883 has joined #systemtap

02:02 orivej has quit [Ping timeout: 240 seconds]

02:09 <agentzh> hopefully we can run the stap test suite in parallel soon.

02:09 <agentzh> right now it's really a struggle.

02:13 <fche> hey the new code might work well enough in parallel mode too

02:15 <fche> patience is not a struggle

02:15 <fche> you just need to hold the sleeping pigeon pose for 666 minutes

02:15 <fche> breathe in through the nose, out through the mouth

02:16 <fche> contract and relax all the muscles around the vagus nerve

02:16 <fche> and by the end of it all

02:16 <fche> you'll have testsuite results

02:16 <fche> hope this helps

02:17 <fche> kerneltoast, I see what you did and can see why it should work

02:17 <fche> but not sure why it wasn't there before

02:30 <agentzh> kerneltoast: maybe we could split the test suite and run each subset of files in separate kvm guests...

02:30 <agentzh> that way we can utilize all the phy cpu cores...

02:30 <kerneltoast> you mean to speed up serial mode?

02:30 <agentzh> yes

02:31 <kerneltoast> not sure if previous tests in the testsuite influence the rest of the results though

02:31 <agentzh> but seems like bulkmode is near the corner, probably we can do parallel runs directly anyway.

02:31 <kerneltoast> seems like we're going to be from serial mode soon anyway

02:31 <kerneltoast> *free from serial mode

02:31 <agentzh> aye

02:32 <agentzh> kerneltoast: you got fun things to do while waiting for the test results? ;)

02:32 <kerneltoast> yeah i've got a big fat merge to work on

02:32 <agentzh> oh yeah that one.

02:32 <agentzh> cool

02:34 <agentzh> fche: the thing is that our customers is very impatient. so we really need to get these bugs behind us ASAP ;)

02:34 <agentzh> *are

02:34 <kerneltoast> lucky me that i've never had to face our customers :P

02:34 <agentzh> oh boy, you are protected :)

02:37 <kerneltoast> agentzh, we still might have the probe_lock lockup stopping us from parallel runs, and fche said his buildbots were still dying even with the bulkmode patch

02:37 <kerneltoast> so there's no shortage of bugs :)))

02:40 <kerneltoast> fche, yeah i have no idea why that code was not there before, nor how it was working before...

02:40 <agentzh> still dying with your latest bulkmode2 branch?

02:41 <agentzh> maybe bulk mode was never exercised much.

02:41 <agentzh> since the default was never bulk.

02:42 <agentzh> it's a relatively new territory.

02:42 <kerneltoast> agentzh, bulkmode2 just fixes a problem where prints weren't being flushed. bulkmode1 got rid of the mutex_trylock, so the buildbots should not be suffering from the mutex_trylock bug anymore

02:42 <agentzh> that's good news.

02:43 <agentzh> hopefully they can get merged soon.

02:43 <agentzh> once the test suite is green enough.

02:43 <kerneltoast> yeah bulkmode2 should be good to go once the testsuite is all green

02:43 <kerneltoast> 4000 seconds left

02:43 <agentzh> yeah, probe lock is still out there.

03:05 orivej has joined #systemtap

03:15 <fche> I'll rerun my local tests against the new branch

03:43 irker962 has quit [Quit: transmission timeout]

04:03 <kerneltoast> oh boy, the testsuite is taking longer than 12000 seconds

04:03 <kerneltoast> i hope it's not more timeouts

04:03 <fche> things looking good here so far

04:13 <agentzh> it's rare that fche is still around. haha

04:41 derek0883 has quit [Remote host closed the connection]

04:48 <kerneltoast> fche, hmm i'm seeing this, i think it's an environment issue:

04:48 <kerneltoast> +ERROR: tcl error sourcing /home/sultan/systemtap/testsuite/systemtap.printf/out1.exp.

04:48 <kerneltoast> +ERROR: error writing "stdout": I/O error

04:48 <kerneltoast> any idea what the problem is?

04:48 <kerneltoast> 1.4G of free space isn't enough?

04:51 <kerneltoast> cleared up some space and i'm gonna re-run

04:51 <kerneltoast> we'll see what happens tomorrow morning i guess

05:13 <fche> not familiar

05:13 <fche> sounds almost like -EPIPE, something not there to listen to the outoutput

05:14 derek0883 has joined #systemtap

06:17 orivej has quit [Ping timeout: 240 seconds]

06:44 hpt has quit [Quit: Lost terminal]

06:45 hpt has joined #systemtap

06:58 derek088_ has joined #systemtap

06:58 derek088_ has quit [Remote host closed the connection]

06:59 derek088_ has joined #systemtap

06:59 derek0883 has quit [Ping timeout: 260 seconds]

07:21 derek088_ has quit [Remote host closed the connection]

08:01 khaled has joined #systemtap

11:48 hpt has quit [Ping timeout: 264 seconds]

11:53 mjw has joined #systemtap

12:02 orivej has joined #systemtap

13:29 orivej has quit [Ping timeout: 264 seconds]

14:01 ema has quit [Quit: reboot]

14:03 ema has joined #systemtap

14:29 orivej has joined #systemtap

15:03 amerey has joined #systemtap

17:41 <kerneltoast> fche, hmm still getting that wacky error, and all the 32-bit syscalls are failing

17:41 <kerneltoast> guess bulkmode needs more fixing...

17:42 <fche> that syscall stuff shouldn't relate to transport at all

17:43 <kerneltoast> fche, https://paste.centos.org/view/a010a2ea

17:43 derek0883 has joined #systemtap

17:43 <fche> "transport failures"

17:43 <fche> oh

17:44 <kerneltoast> that's from the transport failure atomic var?

17:44 <fche> think so

17:45 <kerneltoast> ouch

17:48 <kerneltoast> oh ok i think i see the issue

17:49 <kerneltoast> when print_flush requests a length greater than the subbuf size, the code in relay_v2 says NO and returns an error

17:50 <kerneltoast> if (unlikely(length > buf->chan->subbuf_size))

17:50 <kerneltoast> -goto toobig;

17:50 <kerneltoast> +length = buf->chan->subbuf_size;

17:50 <kerneltoast> i think that should do it

17:56 derek0883 has quit [Remote host closed the connection]

17:57 derek0883 has joined #systemtap

17:58 derek0883 has quit [Remote host closed the connection]

18:04 derek0883 has joined #systemtap

18:25 demon000_ has joined #systemtap

18:25 <demon000_> @fche, I submitted a patch to fix a bug @agentzh reported a while ago

18:25 <demon000_> https://sourceware.org/bugzilla/show_bug.cgi?id=26844

18:30 <fche> hi

18:30 <fche> hm I thought we fixed that

18:45 amerey has quit [Remote host closed the connection]

18:45 amerey has joined #systemtap

18:54 amerey has quit [Remote host closed the connection]

18:54 amerey has joined #systemtap

19:11 tromey has joined #systemtap

19:31 <kerneltoast> demon000_, you'll need to rebase your patch on the newest stap master. one of my previous patches changed that bit of code

19:42 <demon000_> If we did then @agentzh didn't tell mention that

19:42 <demon000_> didn't mention me*

19:59 amerey has quit [Quit: Leaving]

19:59 <agentzh> demon000_: i did not know the master already fixes it. has it?

20:00 amerey has joined #systemtap

20:01 <agentzh> next time we should test the latest master first.

20:01 <agentzh> kerneltoast: which commit fixes it? shall we cherry-pick?

20:06 <kerneltoast> agentzh, it's not fixed in master

20:08 <kerneltoast> and demon000_'s fix isn't correct because the print buffer isn't null terminated

20:09 <agentzh> ah, okay

20:11 <agentzh> demon000_: next time you can post the patch into a gist for review here.

20:11 <kerneltoast> i almost fixed this bug by accident in my big print patch lol

20:11 <agentzh> and let's do internal review for open source patches first.

20:11 <kerneltoast> i was going to replace the strlcpy with memcpy

20:11 <agentzh> to save fche's cycles.

20:11 <kerneltoast> let me do that now

20:11 <demon000_> um

20:11 <agentzh> kerneltoast: that's interesting.

20:11 <demon000_> @kerneltoast, wdym the print buffer is not null terminated

20:12 <agentzh> your patches are really far reaching ;)

20:12 <demon000_> yes it is

20:12 <demon000_> strlcpy will put a null at the end

20:12 <kerneltoast> demon000_, pb->buf in your patch is not null terminated

20:12 <kerneltoast> the source buffer isn't null terminated

20:12 <demon000_> no matter if it is null terminated or not

20:12 <kerneltoast> if pb->len is less than size, you'll append garbage bytes to the output buffer

20:16 <kerneltoast> this should fix it: https://gist.github.com/kerneltoast/56e28bd0b11e8324d18b3b036eac01ce

20:16 <fche> why not just use strlcpy as it is

20:17 <fche> and if you think , size, is wrong in the original, , size-1, instead

20:17 <demon000_> I get it now @kerneltoast

20:17 <kerneltoast> fche, because we're not using strlcpy for its ability to check for null bytes in the source buffer

20:17 <kerneltoast> we're using it for its ability to append a null byte at the end of the output

20:17 orivej has quit [Ping timeout: 240 seconds]

20:17 <demon000_> are you sure it's not null terminated tho?

20:17 <kerneltoast> so we may as well make that clear with a memcpy

20:18 <kerneltoast> demon000_, yeah i wrote that code

20:18 <fche> strlcpy puts the null at the end, that's part of its guarantee

20:18 <demon000_> I didn't get any garbage

20:18 <kerneltoast> fche, we don't need the rest of strlcpy's guarantee

20:18 <demon000_> I guess it might have been just luck

20:19 demon000_ has quit [Quit: Leaving]

20:19 <kerneltoast> and memcpy goes F A S T

20:19 demon000_ has joined #systemtap

20:19 <fche> strlcpy in this context is exactly memcpy + \0 AIUI

20:19 <fche> we use it -everywhere- in the code, and it is fine

20:19 <kerneltoast> fche, no it's still checking for \0 in the source buffer

20:19 <demon000_> yeah ^

20:20 <demon000_> that's why dest buffer size only doesn't work

20:20 <demon000_> who would store non-null-terminated char buffers tho

20:20 <fche> I am skeptical that any benchmark will see a difference

20:20 <kerneltoast> fche, it's clearer to the reader anyway

20:20 <kerneltoast> cosmin got confused by it

20:20 <demon000_> if it's not null-terminated it's not a string @fche

20:20 <kerneltoast> err, demon000_ is cosmin

20:20 <demon000_> so str* methods don't make sense

20:21 <fche> in trhis case we're dealing with strings

20:21 <kerneltoast> "strings"

20:21 <demon000_> is it?

20:21 <kerneltoast> fyi we use memcpy when print flushing too

20:21 <demon000_> if it's not null terminated

20:21 <demon000_> it's just some bytes

20:21 <demon000_> anyway yeah my patch was wrong sadly

20:21 <kerneltoast> demon000_, it doesn't have a null termination in order to save space. the per-cpu log buffers are limited in size

20:22 <fche> buffers at the transport layer are a different thing

20:22 <kerneltoast> i wasn't the one who originally omitted the null termination either btw

20:22 <fche> these are strings being printed/into

20:23 <kerneltoast> fche, all the print routines use memcpy

20:23 <kerneltoast> the strlcpy usage here is inconsistent

20:23 <kerneltoast> err no the print routines go byte at a time, nvm

20:24 <kerneltoast> i never made the memcpy optimization in the end heh

20:24 orivej has joined #systemtap

20:24 <kerneltoast> fche, with memcpy it is clear that the source buffer can lack termination. with strlcpy it isn't

20:27 <fche> ok, if the issue is that the per-cpu stp_log_pcpu->buf may be unterminated, so we just have its log->len to work with, then fine

20:27 <kerneltoast> yep that's exactly it

20:28 <kerneltoast> shall i do a quick test and then push?

20:29 <fche> ok

20:37 <kerneltoast> agentzh, in your description for the bug (https://sourceware.org/bugzilla/show_bug.cgi?id=26844), your stap command doesn't include a.out

20:37 <kerneltoast> what should the correct command be?

20:39 derek0883 has quit [Remote host closed the connection]

20:47 <demon000_> it automatically picks a.out

20:48 derek0883 has joined #systemtap

20:49 <kerneltoast> huh, didn't for me

20:49 <kerneltoast> i needed to specify it explicitly

20:49 <kerneltoast> oh well, i tested it

20:55 irker760 has joined #systemtap

20:55 <irker760> systemtap: sultan systemtap.git:master * release-4.4-26-gfd93cf71d / runtime/stack.c: PR26844: fix off-by-one error when copying printed backtraces

20:56 <kerneltoast> fche, there's that pushed

20:57 <kerneltoast> my vm running the testsuite this morning died from the probe lock soft lockup btw

20:57 <kerneltoast> i wasn't running the testsuite in parallel mode

20:57 <kerneltoast> so now i'll need to restart the testsuite to get that bulkmode patch verified

21:10 derek0883 has quit [Remote host closed the connection]

21:15 <demon000_> if you probe oneshot it works with a.out

21:15 <demon000_> otherwise it doesn't

21:15 <demon000_> but oneshot probe doesn't do anything anywya

21:15 <demon000_> anyway*

21:39 tromey has quit [Quit: ERC (IRC client for Emacs 27.1)]

21:45 orivej has quit [Ping timeout: 260 seconds]

21:51 orivej has joined #systemtap

22:03 derek0883 has joined #systemtap

22:25 orivej has quit [Ping timeout: 240 seconds]

22:29 demon000_ has quit [Ping timeout: 272 seconds]

22:30 <agentzh> kerneltoast: you can sepcify the -x PID or -c CMD options of stap.

22:57 orivej has joined #systemtap

22:58 mjw has quit [Quit: Leaving]

22:59 lickingball has joined #systemtap

23:18 amerey has quit [Quit: Leaving]