jemc changed the topic of #ponylang to: Welcome! Please check out our Code of Conduct => https://github.com/ponylang/ponyc/blob/master/CODE_OF_CONDUCT.md | Public IRC logs are available => http://irclog.whitequark.org/ponylang
unbalancedparen has quit [Ping timeout: 240 seconds]
jemc has quit [Ping timeout: 252 seconds]
dynarr has joined #ponylang
unbalancedparen has joined #ponylang
c355e3b has quit [Quit: Connection closed for inactivity]
montanonic has quit [Ping timeout: 255 seconds]
montanonic has joined #ponylang
amclain has quit [Quit: Leaving]
montanonic has quit [Ping timeout: 255 seconds]
montanonic has joined #ponylang
k0nsl has quit [Ping timeout: 252 seconds]
k0nsl has joined #ponylang
TheRealMue is now known as TheMue
TheMue has left #ponylang [#ponylang]
montanonic has quit [Ping timeout: 276 seconds]
c355e3b has joined #ponylang
mytrile has joined #ponylang
_andre has joined #ponylang
gsteed has joined #ponylang
TwoNotes has joined #ponylang
skanur has joined #ponylang
skanur has quit [Client Quit]
vrand has joined #ponylang
srenatus[m] has quit [Ping timeout: 265 seconds]
srenatus[m] has joined #ponylang
<TwoNotes> Issue 1000. Something steps on a sched->q structure. When actor-stealing happens, a bad value is returned from pop(sched) in pop_global, called from steal.
<TwoNotes> What exactly is the _atomic_load function?
<TwoNotes> Should I expect _atomic_load(&next->data) to return the contents of next->data?
jemc has joined #ponylang
Praetonus has joined #ponylang
<Praetonus> TwoNotes: _atomic_load fetches a value from memory in a thread-safe fashion
<TwoNotes> So should v = _atomic_load(&foo) give the same result as v = foo?
<TwoNotes> Because I am seeing it *not* do that
<Praetonus> v = foo isn't an atomic operation. It means it doesn't know about multithreading, and if foo is modified in another thread at the same time, then you have a data race
<TwoNotes> I understad that.
<Praetonus> You're on ARM, right?
<TwoNotes> This is in libponyrt/sched/mpmcq.c around line 80
<TwoNotes> yes ARM
<TwoNotes> There is a line there, void* data = _atomic_load(&next->data)
<TwoNotes> I check the returned value in 'data' and it is 0x2.
<TwoNotes> It should be either NULL or the address of an actor_t
<Praetonus> Yeah, it looks really wrong
<TwoNotes> In the debugger stopped at that point I print next->data and it is a valid address
<Praetonus> Could you try it with --ponythreads=1?
<TwoNotes> I think there is a race condition here somewhere
<TwoNotes> I will try that. Default is 4 on this machine
<TwoNotes> That is a compile-time option, or run-time?
<Praetonus> Run-time, you can pass that to compiled Pony programs
<TwoNotes> Should it work within gdb as well?
<Praetonus> Yes. I think you have to pass the flag to the run command
<TwoNotes> I will have to run it lots of times. This problem does not always show up, reinforcing the idea that it is timing related
<TwoNotes> ok
<Praetonus> Also, what is the C compiler you used to compile the runtime, and which version?
<jemc> I use `gdb --args program arg1 arg2 ...` to pass args to the debugged program
<TwoNotes> Is that gcc?
<TwoNotes> gcc version 6.1.1 20160501
<TwoNotes> It says "Thread model posix"
<TwoNotes> --with-arch=armv7-a
<TwoNotes> uname says the hardware is armv7l
<TwoNotes> Arch Linux
<TwoNotes> Not failing so far with ponythreads=1. I will keep trying
<TwoNotes> With ponythreads=1, then actor stealing should never happen, right?
<Praetonus> You're right. We'll have to try harder. I think testing stealing on one thread would require multiple schedulers running on the same thread
<TwoNotes> atomic_load is used in 5 modules in the RT
<TwoNotes> The initial symptom is that these random values pulled of the queue eventually get used as actor_t pointers, resulint in segfaults.
<TwoNotes> I added a bunch of checks in the actor, scheduler, and mpmcq modules to validate that things that are supposed to be addresses really are
<TwoNotes> That is how I caught this
<TwoNotes> Now it could be that something else, or my own code, is somehow stomping on these data structures from another thread.
<Praetonus> I suspect the way we use atomics somehow introduces an undefined behaviour, which is only visible on ARM
<Praetonus> This would be really problematic
<Praetonus> The only thing I see right now is that we're not using the _Atomic type qualifier for atomic variables. I'll look at the C standard to see if it's allowed or not
<TwoNotes> I have been reading http://llvm.org/docs/Atomics.html
<Praetonus> The bug only happens in scheduler queues when stealing actors?
SilverKey has joined #ponylang
<TwoNotes> I have seen it in messageq, but not as often
<TwoNotes> But that module also uses atomic ops
toblux has joined #ponylang
<TwoNotes> Happened again with ponythreads=2.
Perelandric has joined #ponylang
<Perelandric> Does it ever make sense to have a generic type constraint that is a concrete type instead of an interface?
<Perelandric> The compiler lets me do this: `class Test[T:String]`
<Perelandric> ...which I thought was suprising, so I added `let x: T` `new create() => x = "foo"` out of curiosity
<Perelandric> ...and it gives >>String val is not a subtype of String #any
<Praetonus> TwoNotes: I'll try to get my hands on an ARM system to test various things
<TwoNotes> They are cheap. Under $100 gets you the ocmplete kit with case, power supply, etc.
<TwoNotes> RPi3 has HDMI out, and I know that there is an Ubuntu-MATE download for it
<SeanTAllen> Perelandric: that probably doesn't make sense from a human perspective.
<SeanTAllen> to the compiler right now, its "just a type"
<jemc> Perelandric: SeanTAllen: it could potentially make sense to let you parameterize the rcap of T
<SeanTAllen> ah true
<jemc> for example, if we have `class Test[T: String #read]`, we could instantiate a `Test[String val]` or `Test[String ref]`
<jemc> actually, when combined with Praetonus' RFC for type param inference, it could be a cool pattern for solving a wrapper type problem I was thinking about the other day
amclain has joined #ponylang
<jemc> right now in pony-sodium, I have string-wrapper types for things like public and secret keys
<jemc> they currently wrap `String val`, so there's not a good way to have a mutable one (which someone pointed out could be useful for security-paranoid clearing of memory after use)
<jemc> if the wrapped type were parameterized, and that could be inferred as Praetonus has proposed, it would probably be able to wrap a ref or val with no significant loss in succinctness or convenience
<jemc> I'll have to look into it a bit more later
<Perelandric> ok, thanks for the info.
mytrile has quit [Quit: Connection closed for inactivity]
foopbar has joined #ponylang
toblux has left #ponylang [#ponylang]
<TwoNotes> Praetonus, the generated code for _atomic_load uses a 'dmb ish' instruction just after fetching the value.
<jemc> heh, 'dumb ish'
<TwoNotes> Data Memory Barrier, Inner Shareable Domain.
<TwoNotes> Last machine I programmed in assembler was a VAX. It did not have such htings
<Praetonus> It's the memory barrier for the synchronisation
<Praetonus> ARM has a weak memory model so it needs to add barriers to synchronise things. On strongly-ordered systems like x86, an atomic load and a plain load both use the same instruction
<Praetonus> Could you look what _atomic_store and _atomic_exchange look like?
<TwoNotes> I will have to look for some places in pony thay do that
<Praetonus> ponyint_messageq_push does both
<foopbar> Hi, is anyone working on or aware of a WebSocket implementation in Pony?
<jemc> foopbar: I've seen various people talk about it, but I haven't seen any concrete work on websockets in Pony
<jemc> there's also been talk about redesigning the `net/http` package as well (which was only ever really a proof of concept), to resolve some usability and some performance concerns
<TwoNotes> I did websockets in Erlang. But the cowboy library takes care of all the protocol switching.
<TwoNotes> Websockets are really cool. Your server app and your JavaScript program just throw messages at each other asynchronously
<TwoNotes> Praetonus, pontint_messageq_push as requested https://gist.github.com/pdtwonotes/5f8c372a6ff47b1d0c597e75e2059de2
graaff has joined #ponylang
<TwoNotes> When I look at next->data in the mpmcq_pop routine, it looks like a valid actor_t address.
<TwoNotes> But when I look at the void* data value obtained from the _atomic_load, it is variously things like 0x1, 0x2, 0x19, or negative numbers. *sometimes*. Most of the time it all works.
<TwoNotes> One possible clue - the valid-looking actor_t address is a value like 0x0007d880, sugessting it is one of the built-in actors, perhaps for stdout, etc. Dynamically allocated things seem to be at around 0x733ffd00
<TwoNotes> I do make lots of log.print calls.
<TwoNotes> Where 'log' is the stdout file that comes in the Env
<Praetonus> From what I read in the ARM documentation, the assembly for the atomic operations is fine
<Praetonus> Could you try putting a mutex locked during the entirety of mpmcq_push, mpmcq_push_single and mpmcq_pop and see if the bug still happens? If it doesn't then the problem comes from the atomics
<foopbar> TwoNotes: I'd only need a client for now to subscribe to a wss feed. No need for a server.
<TwoNotes> Praetonus, what would that look like?
<TwoNotes> I try to avoid mutexes in my own programming so I am not familiar with the facilities for doing that in C
<Praetonus> TwoNotes: Actually I think there is a more suspicious thing to test first. In src/common/atomics.h there are 2 occurrences of __ATOMIC_RELAXED. Could you try replacing those by __ATOMIC_ACQ_REL and run your tests?
<TwoNotes> I have been building with config=debug. Would that mess up any of this?
<Praetonus> I don't think it would
<TwoNotes> It is building now. Takes about 5 minutes
<TwoNotes> Now compiling my code. But I have to go out for a while. Results in a couple hours
foopbar has quit [Quit: Page closed]
Praetonus has quit [Quit: Leaving]
SilverKey has quit [Read error: Connection reset by peer]
travisgriggs_ has joined #ponylang
travisgriggs has quit [Ping timeout: 264 seconds]
sylvanc has quit [Ping timeout: 264 seconds]
travisgriggs_ is now known as travisgriggs
srenatus[m] has quit [Ping timeout: 276 seconds]
sylvanc has joined #ponylang
srenatus[m] has joined #ponylang
graaff has quit [Quit: Leaving]
_andre has quit [Quit: leaving]
<TwoNotes> Nope, problem still there
kulibali has joined #ponylang
<TwoNotes> So need to try the mutex idea
montanonic has joined #ponylang
Praetonus has joined #ponylang
<Praetonus> TwoNotes: I've modified mpmcq.c to use a mutex around the atomic operations, you can try with that: https://gist.github.com/Praetonus/8bd8d6706589d647171e850a1cfd18ed
travisgriggs has quit [Quit: travisgriggs]
<TwoNotes> building
<TwoNotes> Is that mutex initialized properly? All the threads appear to block
<TwoNotes> I see the problem. There is a 'return' at line 84 that does not release the mutex
<TwoNotes> Fixed that. Trying again
<TwoNotes> Praetonus. No crashes yet
<Praetonus> TwoNotes: Oops, sorry for the unlock problem, I didn't test it
<TwoNotes> It is working so far. I will continue beating on it
<doublec> Is there a Pony sqlite wrapper or similar DB wrapper? I seem to recall someone doing something there.
<TwoNotes> I think there are a few. I did one
<TwoNotes> Three different DB interfaces at https://github.com/pdtwonotes/tackroom
<TwoNotes> But the sqlite one does not work on ARM
<TwoNotes> The other two are various key/data stores
<doublec> TwoNotes: thanks!
<TwoNotes> I can't remember why the sqlite one does not work on ARM. A missing library I think
<TwoNotes> See the test.c files for examples
<TwoNotes> Praetonus, I have not seen it running this long without error before. So I think the mutex fixed it.
<TwoNotes> It looks like the code was trying to do the right thing, with that do-while loop, but something was not working as expected
<TwoNotes> In the early days of the TOPS-20 operating system, they had bad instability in the file system
<jemc> so sounds like the atomic is not-so-atomic on that platform?
<TwoNotes> The manager finally told them "If you can't fix this in two more days, I will fix it myself"
<TwoNotes> So he went in, threw a mutex around the entire file system, which fixed the problem
<TwoNotes> So at least they could take their time finding out what the real problem was. (Which I think they eventually did)
<TwoNotes> jemc, well there are individual atomic operations in that code. But it is also manipulating a queue. Perhaps the entire queue integrity was not being maintained.
<TwoNotes> ARM cache sync works differently from x86 too
<TwoNotes> I heard that TOPS20 story directly from a very senior VP of engineering at DEC. He had been the manager in question.
vrand has quit [Quit: Leaving.]
<Praetonus> TwoNotes: Could you get the assembly code for ponyint_mpmcq_pop?
<TwoNotes> Yes, but let me clean it up some. I had some extra testing in there I can take out now. That way it will match what you have. (plus the extra release)
<TwoNotes> Later tonite
<Praetonus> Thanks
<TwoNotes> Linking libpony*.tests takes forever. Especially on ARM. Is there a way to skip that?
<SeanTAllen> we ended up cross compiling for ARM because... PAIN
toblux has joined #ponylang
toblux has quit [Client Quit]
jemc has quit [Ping timeout: 244 seconds]