#systemtap on 2018-08-28 — irc logs at freenode.irclog.whitequark.org

2015-11-12 23:18 fche changed the topic of #systemtap to: http://sourceware.org/systemtap; email systemtap@sourceware.org if answers here not timely, conversations may be logged

00:43 <fche> hey

00:43 <fche> agentzh, if we need grammar changes like that, I don't oppose them strongly, but they need to be noted in NEWS and made conditional on systemtap_v, so older scripts can run with --compatible=3.2

00:52 <agentzh> fche: will do that. thanks. it makes sense.

00:53 <fche> yeah, and similarly the quit/abort function is a definite user-relevant change that should get a blurb in the NEWS

00:54 <agentzh> fche: gotcha :)

00:55 <agentzh> btw, you mean --compatible=3.3, right? i see the last release is 3.3.

00:57 <agentzh> fche: a quick question: does the item order in NEWS matter? or should i just append to the end of the existing list for 4.0?

00:57 <agentzh> it does not seem to be follow some apparent ordering or arrangement to me.

00:57 <agentzh> *following

01:01 <fche> that part's kind of a random order

01:01 <fche> t

01:01 <fche> the release notes unscramble it

01:01 <agentzh> okay, i see.

01:02 <agentzh> fche: i've seen that existing code uses something like has_version("2.3") to do the conditionals. what version number should i use? 3.3 or 4.0?

01:03 <agentzh> should it be the next release or the last release number?

01:05 <agentzh> scanning the git history for existing instances of has_version(), it seems to be the former.

01:06 <agentzh> sorry, the latter, a previous release number.

01:07 <fche> you can compare for <4.0 as opposed to <=3.2

01:11 <agentzh> i'll try reusing the parser class's input.has_version("") API.

01:11 <fche> yup

01:12 <fche> do we have any instances of this ?: ?: nesting already? I bet we do somewhere in the tapset

01:12 <agentzh> nesting ?: ?: should work as before.

01:13 <agentzh> it's just a ? b : c = 3 now requires to be rewritten as a ? b : (c = 3)

01:13 <agentzh> but yeah, i'll add some more tests to cover nested ?:

01:13 <agentzh> and also for various --compaitble xxx cases.

01:14 <fche> yeah. I think the parser has some code to detect version-sensitive constructs; might want to look for that and check that it triggers for these constructs

01:14 <agentzh> does --compatible specifies a version inclusive or exclusive? in other words, for --compatible 3.3, does it mean the incompatibility is introduced *after* 3.3, excluding 3.3?

01:14 <fche> from the stap man page, it is the version of stap release that the given script is designed to run on

01:15 <agentzh> okay, thanks. i got confused by the word "beyond" in the usage text.

01:15 <fche> (by the way, in your tests ... probe begin

01:15 <fche> { ... exit () }

01:15 <fche> may be expressed as probe oneshot { ... }

01:15 <agentzh> ah, that's a nice trick.

01:15 <agentzh> i'll update newly written tests.

01:15 <agentzh> do you want me to update tests in existing patches too ?'

01:16 <fche> doesn't matter

01:16 <agentzh> okay, gotcha.

01:16 <agentzh> thanks for the reminder.

01:16 <agentzh> *tip

01:18 <fche> hm, that parser warning I was looking for is not quite what this would need (to help us find any preexisting code that needs those extra parens)

01:18 <fche> it was elaborate.cxx systemtap_v_check-related code line 1935ish

01:20 <agentzh> fche: by running the full test suite, i only found one existing test case requiring extra parens in ?:

01:20 <agentzh> and i'll fix that in my v2 patch.

01:20 <agentzh> it was under systemtap.examples/, ansci_colors2.stp

01:21 <fche> ok. would be curious how you found that only one needed changing

01:22 <fche> wonder if there is syntax that would be accepted with both variants of the grammar, but resulting in a different AST and thus different semantics

01:22 <fche> that's the scary part of changing grammar

01:22 <fche> how do you know that you're done adapting code?

01:23 <fche> it could be simply safer to leave the code as is, and note in the docs the slightly different precedence than in C, and then have your generator tool emit extra parens

01:27 <agentzh> fche: i wrote a script to do auto comparison and filters out true new test failures (which is a bit tricky due to random test failures in some test files).

01:28 <fche> an 'stap -v -p1' would pretty-print parse trees

01:28 <agentzh> i had to run the full test suite against the current master and my own branch with the patch applied for several times. and use my own script to analyze the results.

01:28 <fche> comparing with & without --compatible=foo could also be okay

01:29 <agentzh> fche: in the case of lacking parentheses, there will be an error. the current stap error is very similar to gcc's. so i think we're good?

01:30 <agentzh> it's not a warning.

01:30 <agentzh> so we would not end up with different ASTs anyway.

01:30 <agentzh> use of binary assigment operators directly after ? : 's 3rd operand will always lead to a compile-time error.

01:31 <agentzh> so i believe we're safe here.

01:32 <agentzh> in the case of gcc for similar code patterns, a similar error is given as well.

01:42 <fche> good night dude

01:43 <agentzh> night

02:45 orivej has quit [Ping timeout: 252 seconds]

02:47 orivej has joined #systemtap

02:59 orivej has quit [Ping timeout: 272 seconds]

03:18 orivej has joined #systemtap

04:10 wcohen has joined #systemtap

04:46 ericlee has joined #systemtap

05:06 <ericlee> how aggregation works in systemtap? so what exactly "1" does in this: reads[execname(),pid()] <<< 1 ?

05:06 <ericlee> thanks

05:26 ericlee has quit [Remote host closed the connection]

05:28 ericlee has joined #systemtap

05:28 <ericlee> Running under fedora 27, and saw following errors:

05:28 <ericlee> Missing separate debuginfos, use: debuginfo-install kernel-core-4.17.3-100.fc27.x86_64

05:28 <ericlee> but the package exists already, any fix? Thanks

05:41 <agentzh> ericlee: version mismatch?

05:41 <agentzh> it has to be of the exact version number as `uname -r`.

05:42 <agentzh> "1" is a data item. it can be other integer values.

05:42 <agentzh> or a data point.

05:43 <agentzh> in your example, maybe that script only cares abut the "count", so the exact data point value does not really matter (1 is just convenient, which can also be any other number too).

05:43 <agentzh> but for "sum", "avg", "hist_log", and etc, the data point values definitely make huge difference.

05:49 <ericlee> thanks agentzh but I still don't understand the usage of <<<, can u give me some small example?

05:49 orivej has quit [Ping timeout: 252 seconds]

05:51 ericlee has quit [Remote host closed the connection]

07:25 orivej has joined #systemtap

07:57 slowfranklin has joined #systemtap

08:14 pwithnall has joined #systemtap

08:17 orivej has quit [Ping timeout: 244 seconds]

10:40 orivej has joined #systemtap

11:32 orivej has quit [Ping timeout: 276 seconds]

12:24 wcohen has quit [Ping timeout: 252 seconds]

12:48 naveen1 has joined #systemtap

12:51 naveen has quit [Ping timeout: 264 seconds]

12:52 <fche> agentzh, btw [man stap] does discuss <<<, too bad ericlee's not here

13:00 orivej has joined #systemtap

14:02 orivej has quit [Ping timeout: 272 seconds]

14:05 brolley has joined #systemtap

14:08 tromey has joined #systemtap

16:37 <agentzh> fche: *nod*

16:38 <agentzh> fche: i sent quit v2 patch last night: https://sourceware.org/ml/systemtap/2018-q3/msg00130.html

16:38 <agentzh> please let me know if i should do any further changes. thanks!

16:42 slowfranklin has quit [Quit: slowfranklin]

16:43 orivej has joined #systemtap

16:52 orivej has quit [Ping timeout: 245 seconds]

16:55 orivej has joined #systemtap

17:05 slowfranklin has joined #systemtap

17:16 slowfranklin has quit [Quit: slowfranklin]

17:24 slowfranklin has joined #systemtap

17:27 <agentzh> fche: also sent the ternary op v2 patch with changes you suggested: https://sourceware.org/ml/systemtap/2018-q3/msg00131.html

17:28 <fche> re. ternary, would suggest parse_expression() as the false branch rather than parse_assignment, to match the earlier behaviour

17:29 <fche> otherwise looks good, thanks!

17:29 slowfranklin has left #systemtap [#systemtap]

17:31 <agentzh> fche: okay, will make that change and commit :) thanks!

17:33 pwithnall has quit [Ping timeout: 268 seconds]

17:34 <agentzh> fche: btw, parse_assignment() is currently exactly the same as parse_expression() and i figured that it would be safer to be more specific in case we introduce operators with looser precedence above assignment. but anyway, i'll use parse_expression () as you wish :)

17:35 <fche> yeah, to guarantee compatibility

17:39 <agentzh> fche: changed and committed. thanks.

17:41 <fche> re. quit()

17:41 <fche> silly toolshed, but consider renaming quit -> abort throughout?

17:42 <fche> re. the bpf case, instead of printf("..."), you could just run error("aborting")

17:42 <fche> it shouldn't need both _set_exit_status() and exit()

17:43 <fche> a new test would be nice which exercises something other than boring old probe begin for abort purposes

17:43 <fche> maybe timer.profile .... which can easily run concurrently on different cpus

17:48 orivej has quit [Ping timeout: 268 seconds]

17:53 <agentzh> fche: re quit -> abort renaming, your call. i'll make that change if you want me to :)

17:54 <agentzh> re timer.profile tests, sure, i'll add some.

17:54 <fche> anyone else with opinions on that name?

17:55 <agentzh> fche: if we name it abort(), we still keep the stap exit code zero, right?

17:55 <agentzh> otherwise it won't work for our purposes.

17:56 <fche> sure. those who want a nonzero exit code can say error("something") abort()

17:56 <fche> hm, actually they can't :)

17:56 <agentzh> *nod*

17:56 <agentzh> oh, they can't. indeed.

17:56 <fche> because error() will already unroll the stack, the abort() is unreachable

17:56 <agentzh> *nod*

17:56 <agentzh> error() never returns.

17:57 <fche> so what we need here is fork()

17:57 <fche> :)

17:57 <fche> oh no got it

17:57 <fche> try{ error("")} catch {} abort()

17:57 <fche> just sort of rolls off the tongue :-)

17:57 <agentzh> yeah, this works.

17:57 <fche> the abort() documentation should point out that it cannot be try/catch caught

17:58 <agentzh> okay, noted.

17:58 <agentzh> will add that.

18:00 <agentzh> btw, for timer.profile tests, is it good enough to simply probe the current whole system? or do i need to invent a silly hot-looping userland C program with multiple threads?

18:01 <fche> shouldn't need any special workload

18:01 <agentzh> okay, got it. thanks.

18:16 <agentzh> fche: is it okay to commit this patch to fix my previously committed test files? https://sourceware.org/ml/systemtap/2018-q3/msg00133.html

18:26 <fche> will leave those to your discretion

18:30 <agentzh> fche: okay, thanks!

18:35 <agentzh> fche: i'm considering adding some bpf tests. it seems that it's common practice to copy tests over to the systemtap.bpf/ folder?

18:36 <agentzh> instead of using the get_runtime_list tcl proc?

18:38 <fche> bpf-only tests don't need the get_runtime_list

18:39 <agentzh> get_runtime_list does not include bpf yet.

18:40 <agentzh> so i wonder what's the way to test existing tests with bpf in the meantime time (before bpf is mature enough to go into get_runtime_list).

18:40 <agentzh> *the recommended way

18:40 <fche> follow the pattern of the existing bpf tests

18:40 <agentzh> okay

18:42 <agentzh> the current pattern is not suitable for collaborations IMHO since it's all in a giant bpf.exp test file.

18:42 <agentzh> it's easy to get conflicts.

18:42 <agentzh> and also hard to do auto-test-generation.

18:43 <agentzh> can we allow adding new .exp test files under systemtap.bpf/ ?

18:43 <agentzh> just like the other subdirectories (eg systemta.base)?

18:43 <fche> certainly

18:43 <agentzh> okay, cool.

18:43 <agentzh> then no problems for me :)

18:56 orivej has joined #systemtap

19:14 <agentzh> fche: the bpf runtime's error() tapset function has bugs which cannot be used for abort(): https://sourceware.org/bugzilla/show_bug.cgi?id=23580

19:15 <fche> that's ok

19:15 <agentzh> i'll leave it for now then.

19:17 <agentzh> fche: btw, is this timer.profile + abort() test case good enough? https://pastebin.com/6Daj4JHC

19:17 <agentzh> or do you have more tricky ones in your mind?

19:18 <fche> that one has the drawback that the global rest variable incurs mutex locks amongst the timer.profile probe handlers

19:19 <fche> I'd just abort() in the plain handler, plus add that printf("fire") message afterwards (to show that it is NOTREACHED)

19:19 <fche> that way they can run concurrently

19:20 <agentzh> fche: i'll try.

19:23 <agentzh> fche: like this? https://pastebin.com/pBHt8K2X

19:33 <fche> sure

19:33 <agentzh> great. thanks.

19:35 <agentzh> fche: btw, Zexuan Luo is also on our team. i asked him to hack on stap with me :)

19:36 <fche> great

19:36 <fche> all the work is a bit overwhelming :)

19:54 <agentzh> heh

19:54 <agentzh> we want to move fast :)

19:55 <agentzh> fche: abort v3 patch is sent: https://sourceware.org/ml/systemtap/2018-q3/msg00137.html

19:55 <agentzh> with all your suggestions implemented (hopefully)

19:56 <fche> yup. I don't know if it matters, but given that _stp_abort_flag is going to be set & read possibly by multiple cpus concurrently, maybe it should be an atomic_t like the session_state()

19:58 <agentzh> fche: i was thinking about that too, but i convinced myself that setting a flag in only one direction is always safe anyway?

19:58 <agentzh> or maybe i'm wrong here?

19:58 <fche> it's probably 99% safe

19:59 <fche> and yeah, an atomic read every place would probably suck for purposes of performance

19:59 <agentzh> your call :)

19:59 <agentzh> i don't mind rolling out a v4 patch.

19:59 <agentzh> the performance concern is indeed valid.

20:00 <fche> yeah, the reading part is done many times within typical probe handlers

20:00 <agentzh> i think the worst case is that one cpu reads a zero when it is setting to one.

20:00 <agentzh> this should be fine behavior, i guess?

20:00 <agentzh> checking the flag is not at real time anyway.

20:00 <fche> -if- that were the worst case, yes, but am not convinced that with zero concurrency control lower level insns, that is the case

20:02 <agentzh> cannot imagine cases worse than that.

20:03 <agentzh> setting is only to 1, from either 0 or 1.

20:03 <fche> heh, that's not quite the standard of proof for concurrency software :-)

20:03 <fche> it's the cross-probe fine-granularity aborting which is the difficulty here

20:03 <agentzh> not trying to offer a proof, but trying to learn something new :)

20:04 <agentzh> if this will blow up the cpu on the instruction level, then i would learn something new :)

20:05 <fche> I am humble enough not to be sure that nothing else can go wrong with a non-concurrency-controlled global flag

20:05 <agentzh> that's respectable :)

20:06 <agentzh> so what's the next move? waiting for more feedback?

20:07 <fche> hm, a couple of possibilities

20:07 <fche> could benchmark stap with an atomic_t alternative (both in the reading & writing cases), see if the effect is measurable

20:08 <fche> 2) could reconsider whether the systemwide instant stoppage of all probe handlers is actually necessary, or whether you can let the other handlers finish peacably

20:08 <fche> 3) could look for another mechanism, maybe lighter than atomic_t, just for simple flags

20:09 <agentzh> any candidates for 3)?

20:10 <agentzh> 2) does not work for us.

20:10 <fche> can you elaborate why?

20:10 <agentzh> 1) looks tricky to do fair benchmarks.

20:11 <agentzh> other handlers can 1) have side effects which clobber the tool's output, 2) make the exiting take much longer than desired.

20:15 <fche> much longer? how? probe handlers run fast (-DMAXACTION statements total)

20:27 <agentzh> fche: a sec. i need to think a bit more about it :)

20:27 <agentzh> so you are suggesting making abort() just abort the current probe handler instead of all the running handlers in 2)?

20:35 <agentzh> so we just turn abort() into a special error()?

20:35 <agentzh> and simply set and check c->aborted, for exmaple?

20:35 <agentzh> is that what you are saying?

20:36 <agentzh> i'm good with this change if that's what your 2) suggestion is :)

20:37 <agentzh> you want me to update the patch?

20:40 tromey has quit [Quit: ERC (IRC client for Emacs 26.1.50)]

20:46 <agentzh> fche: i wonder whether the existing _stp_exit_flag global has the same concurrency problem.

20:50 <fche> there is a chance

20:50 <fche> the main flag though is that genuine atomic_t session_state() value

20:54 <agentzh> okay

20:54 <agentzh> fche: btw, what do you think of this patch please? https://sourceware.org/ml/systemtap/2018-q3/msg00138.html

20:55 <agentzh> for bare return statements.

20:56 <fche> the parser part is more tricky than that; we do not have compulsory ; statement terminators

20:56 <fche> so return a = 4 would be parsed differently before & after

20:57 <fche> (the fact that return is a special statement that precludes execution of subsequent statements is a separate issue)

20:58 <fche> is there something difficult about return 0 for such cases?

21:04 <agentzh> no, return a = 4 should parse like before.

21:04 <agentzh> right now the parser only checks if the next token is ';' or '}'.

21:05 <agentzh> if so, then a bare return is recognized. otherwise it parses just like before.

21:05 <fche> ok, so a strict extension

21:05 <agentzh> i'm not sure how far we should go with omitting ';', since it is ambiguous per se. maybe we could also check for statement prefixes like `for`, `if`, `while` and etc?

21:06 <agentzh> return 0 looks like a hack and will change the current function's return type, which is not desired.

21:06 <fche> if it were for any other statement type (one that didn't abort control flow), yeah the other statement keywords following would have to be dealt with

21:07 <fche> change current function's return type ... if it was unknown, it changes it to long ... but nothing else. after all, nothing must be reading that value. so why is it a problem?

21:08 <agentzh> because certain function calls should never be used in expressions with the pseudo values. such use cases should lead to compilation errors.

21:08 <agentzh> it's just like why C should have void functions ;)

21:08 <agentzh> it's for better code and also docs.

21:09 <agentzh> and better compile-time error messages.

21:15 <fche> ok, not sure of the value, but there isn't any serious cost, so why not

21:16 <fche> one thing though: in the pretty-printer, it better print return; instead of return

21:16 <fche> unless you can be sure the parent structure will add that semicolon

21:17 <fche> the NEWS & stap man page would have to identify that this void return is only accepted at the end of a block or with a ;

21:17 <fche> (language/grammar changes also need to be mentioned in the man page)

21:27 <agentzh> fche: sure, will do. thanks!

21:27 <agentzh> i'll add more tests to cover those.

21:31 <agentzh> fche: re it better print return; i've just added a test case and it seems that the pretty printer already adds a semicolon at the end.

21:31 <agentzh> so we're already good :)

21:31 <agentzh> i'll add support for "return if ..." and etc.

21:32 <fche> no need really; just as long as docs are clear

21:32 <agentzh> ah, i already did.

21:32 <fche> and it's not that simple anyway, you may need to deal with symbols and other stuff too, not just the few statement keywords.

21:32 <agentzh> *nod* so i should remove it?

21:33 <fche> re. the pretty-printer, check in different contexts like if foo return while(1) return

21:33 <agentzh> okay

21:43 <agentzh> tested your example, still fine.

21:43 <agentzh> i'll add it to my test suite.

21:43 <agentzh> anyway

21:53 brolley has left #systemtap [#systemtap]

22:25 <agentzh> fche: just sent return stmt v3 patch according to your feedback: https://sourceware.org/ml/systemtap/2018-q3/msg00139.html

23:23 <agentzh> fche: just sent abort v4 patch: https://sourceware.org/ml/systemtap/2018-q3/msg00140.html

23:31 <agentzh> now it's using c->aborted just like the existing c->last_error flag.

23:32 <agentzh> please let me know if it's good enough now :)