unbalancedparen has quit [Ping timeout: 240 seconds]
jemc has quit [Ping timeout: 252 seconds]
dynarr has joined #ponylang
unbalancedparen has joined #ponylang
c355e3b has quit [Quit: Connection closed for inactivity]
montanonic has quit [Ping timeout: 255 seconds]
montanonic has joined #ponylang
amclain has quit [Quit: Leaving]
montanonic has quit [Ping timeout: 255 seconds]
montanonic has joined #ponylang
k0nsl has quit [Ping timeout: 252 seconds]
k0nsl has joined #ponylang
TheRealMue is now known as TheMue
TheMue has left #ponylang [#ponylang]
montanonic has quit [Ping timeout: 276 seconds]
c355e3b has joined #ponylang
mytrile has joined #ponylang
_andre has joined #ponylang
gsteed has joined #ponylang
TwoNotes has joined #ponylang
skanur has joined #ponylang
skanur has quit [Client Quit]
vrand has joined #ponylang
srenatus[m] has quit [Ping timeout: 265 seconds]
srenatus[m] has joined #ponylang
<TwoNotes>
Issue 1000. Something steps on a sched->q structure. When actor-stealing happens, a bad value is returned from pop(sched) in pop_global, called from steal.
<TwoNotes>
What exactly is the _atomic_load function?
<TwoNotes>
Should I expect _atomic_load(&next->data) to return the contents of next->data?
jemc has joined #ponylang
Praetonus has joined #ponylang
<Praetonus>
TwoNotes: _atomic_load fetches a value from memory in a thread-safe fashion
<TwoNotes>
So should v = _atomic_load(&foo) give the same result as v = foo?
<TwoNotes>
Because I am seeing it *not* do that
<Praetonus>
v = foo isn't an atomic operation. It means it doesn't know about multithreading, and if foo is modified in another thread at the same time, then you have a data race
<TwoNotes>
I understad that.
<Praetonus>
You're on ARM, right?
<TwoNotes>
This is in libponyrt/sched/mpmcq.c around line 80
<TwoNotes>
yes ARM
<TwoNotes>
There is a line there, void* data = _atomic_load(&next->data)
<TwoNotes>
I check the returned value in 'data' and it is 0x2.
<TwoNotes>
It should be either NULL or the address of an actor_t
<Praetonus>
Yeah, it looks really wrong
<TwoNotes>
In the debugger stopped at that point I print next->data and it is a valid address
<Praetonus>
Could you try it with --ponythreads=1?
<TwoNotes>
I think there is a race condition here somewhere
<TwoNotes>
I will try that. Default is 4 on this machine
<TwoNotes>
That is a compile-time option, or run-time?
<Praetonus>
Run-time, you can pass that to compiled Pony programs
<TwoNotes>
Should it work within gdb as well?
<Praetonus>
Yes. I think you have to pass the flag to the run command
<TwoNotes>
I will have to run it lots of times. This problem does not always show up, reinforcing the idea that it is timing related
<TwoNotes>
ok
<Praetonus>
Also, what is the C compiler you used to compile the runtime, and which version?
<jemc>
I use `gdb --args program arg1 arg2 ...` to pass args to the debugged program
<TwoNotes>
Is that gcc?
<TwoNotes>
gcc version 6.1.1 20160501
<TwoNotes>
It says "Thread model posix"
<TwoNotes>
--with-arch=armv7-a
<TwoNotes>
uname says the hardware is armv7l
<TwoNotes>
Arch Linux
<TwoNotes>
Not failing so far with ponythreads=1. I will keep trying
<TwoNotes>
With ponythreads=1, then actor stealing should never happen, right?
<Praetonus>
You're right. We'll have to try harder. I think testing stealing on one thread would require multiple schedulers running on the same thread
<TwoNotes>
atomic_load is used in 5 modules in the RT
<TwoNotes>
The initial symptom is that these random values pulled of the queue eventually get used as actor_t pointers, resulint in segfaults.
<TwoNotes>
I added a bunch of checks in the actor, scheduler, and mpmcq modules to validate that things that are supposed to be addresses really are
<TwoNotes>
That is how I caught this
<TwoNotes>
Now it could be that something else, or my own code, is somehow stomping on these data structures from another thread.
<Praetonus>
I suspect the way we use atomics somehow introduces an undefined behaviour, which is only visible on ARM
<Praetonus>
This would be really problematic
<Praetonus>
The only thing I see right now is that we're not using the _Atomic type qualifier for atomic variables. I'll look at the C standard to see if it's allowed or not
<Praetonus>
The bug only happens in scheduler queues when stealing actors?
SilverKey has joined #ponylang
<TwoNotes>
I have seen it in messageq, but not as often
<TwoNotes>
But that module also uses atomic ops
toblux has joined #ponylang
<TwoNotes>
Happened again with ponythreads=2.
Perelandric has joined #ponylang
<Perelandric>
Does it ever make sense to have a generic type constraint that is a concrete type instead of an interface?
<Perelandric>
The compiler lets me do this: `class Test[T:String]`
<Perelandric>
...which I thought was suprising, so I added `let x: T` `new create() => x = "foo"` out of curiosity
<Perelandric>
...and it gives >>String val is not a subtype of String #any
<Praetonus>
TwoNotes: I'll try to get my hands on an ARM system to test various things
<TwoNotes>
They are cheap. Under $100 gets you the ocmplete kit with case, power supply, etc.
<TwoNotes>
RPi3 has HDMI out, and I know that there is an Ubuntu-MATE download for it
<SeanTAllen>
Perelandric: that probably doesn't make sense from a human perspective.
<SeanTAllen>
to the compiler right now, its "just a type"
<jemc>
Perelandric: SeanTAllen: it could potentially make sense to let you parameterize the rcap of T
<SeanTAllen>
ah true
<jemc>
for example, if we have `class Test[T: String #read]`, we could instantiate a `Test[String val]` or `Test[String ref]`
<jemc>
actually, when combined with Praetonus' RFC for type param inference, it could be a cool pattern for solving a wrapper type problem I was thinking about the other day
amclain has joined #ponylang
<jemc>
right now in pony-sodium, I have string-wrapper types for things like public and secret keys
<jemc>
they currently wrap `String val`, so there's not a good way to have a mutable one (which someone pointed out could be useful for security-paranoid clearing of memory after use)
<jemc>
if the wrapped type were parameterized, and that could be inferred as Praetonus has proposed, it would probably be able to wrap a ref or val with no significant loss in succinctness or convenience
<jemc>
I'll have to look into it a bit more later
<Perelandric>
ok, thanks for the info.
mytrile has quit [Quit: Connection closed for inactivity]
foopbar has joined #ponylang
toblux has left #ponylang [#ponylang]
<TwoNotes>
Praetonus, the generated code for _atomic_load uses a 'dmb ish' instruction just after fetching the value.
<jemc>
heh, 'dumb ish'
<TwoNotes>
Data Memory Barrier, Inner Shareable Domain.
<TwoNotes>
Last machine I programmed in assembler was a VAX. It did not have such htings
<Praetonus>
It's the memory barrier for the synchronisation
<Praetonus>
ARM has a weak memory model so it needs to add barriers to synchronise things. On strongly-ordered systems like x86, an atomic load and a plain load both use the same instruction
<Praetonus>
Could you look what _atomic_store and _atomic_exchange look like?
<TwoNotes>
I will have to look for some places in pony thay do that
<Praetonus>
ponyint_messageq_push does both
<foopbar>
Hi, is anyone working on or aware of a WebSocket implementation in Pony?
<jemc>
foopbar: I've seen various people talk about it, but I haven't seen any concrete work on websockets in Pony
<jemc>
there's also been talk about redesigning the `net/http` package as well (which was only ever really a proof of concept), to resolve some usability and some performance concerns
<TwoNotes>
I did websockets in Erlang. But the cowboy library takes care of all the protocol switching.
<TwoNotes>
Websockets are really cool. Your server app and your JavaScript program just throw messages at each other asynchronously
<TwoNotes>
When I look at next->data in the mpmcq_pop routine, it looks like a valid actor_t address.
<TwoNotes>
But when I look at the void* data value obtained from the _atomic_load, it is variously things like 0x1, 0x2, 0x19, or negative numbers. *sometimes*. Most of the time it all works.
<TwoNotes>
One possible clue - the valid-looking actor_t address is a value like 0x0007d880, sugessting it is one of the built-in actors, perhaps for stdout, etc. Dynamically allocated things seem to be at around 0x733ffd00
<TwoNotes>
I do make lots of log.print calls.
<TwoNotes>
Where 'log' is the stdout file that comes in the Env
<Praetonus>
From what I read in the ARM documentation, the assembly for the atomic operations is fine
<Praetonus>
Could you try putting a mutex locked during the entirety of mpmcq_push, mpmcq_push_single and mpmcq_pop and see if the bug still happens? If it doesn't then the problem comes from the atomics
<foopbar>
TwoNotes: I'd only need a client for now to subscribe to a wss feed. No need for a server.
<TwoNotes>
Praetonus, what would that look like?
<TwoNotes>
I try to avoid mutexes in my own programming so I am not familiar with the facilities for doing that in C
<Praetonus>
TwoNotes: Actually I think there is a more suspicious thing to test first. In src/common/atomics.h there are 2 occurrences of __ATOMIC_RELAXED. Could you try replacing those by __ATOMIC_ACQ_REL and run your tests?
<TwoNotes>
I have been building with config=debug. Would that mess up any of this?
<Praetonus>
I don't think it would
<TwoNotes>
It is building now. Takes about 5 minutes
<TwoNotes>
Now compiling my code. But I have to go out for a while. Results in a couple hours
foopbar has quit [Quit: Page closed]
Praetonus has quit [Quit: Leaving]
SilverKey has quit [Read error: Connection reset by peer]
<TwoNotes>
But the sqlite one does not work on ARM
<TwoNotes>
The other two are various key/data stores
<doublec>
TwoNotes: thanks!
<TwoNotes>
I can't remember why the sqlite one does not work on ARM. A missing library I think
<TwoNotes>
See the test.c files for examples
<TwoNotes>
Praetonus, I have not seen it running this long without error before. So I think the mutex fixed it.
<TwoNotes>
It looks like the code was trying to do the right thing, with that do-while loop, but something was not working as expected
<TwoNotes>
In the early days of the TOPS-20 operating system, they had bad instability in the file system
<jemc>
so sounds like the atomic is not-so-atomic on that platform?
<TwoNotes>
The manager finally told them "If you can't fix this in two more days, I will fix it myself"
<TwoNotes>
So he went in, threw a mutex around the entire file system, which fixed the problem
<TwoNotes>
So at least they could take their time finding out what the real problem was. (Which I think they eventually did)
<TwoNotes>
jemc, well there are individual atomic operations in that code. But it is also manipulating a queue. Perhaps the entire queue integrity was not being maintained.
<TwoNotes>
ARM cache sync works differently from x86 too
<TwoNotes>
I heard that TOPS20 story directly from a very senior VP of engineering at DEC. He had been the manager in question.
vrand has quit [Quit: Leaving.]
<Praetonus>
TwoNotes: Could you get the assembly code for ponyint_mpmcq_pop?
<TwoNotes>
Yes, but let me clean it up some. I had some extra testing in there I can take out now. That way it will match what you have. (plus the extra release)
<TwoNotes>
Later tonite
<Praetonus>
Thanks
<TwoNotes>
Linking libpony*.tests takes forever. Especially on ARM. Is there a way to skip that?
<SeanTAllen>
we ended up cross compiling for ARM because... PAIN