#m-labs on 2014-06-01 — irc logs at freenode.irclog.whitequark.org

2013-12-11 12:34 lekernel changed the topic of #m-labs to: Mixxeo, Migen, MiSoC & other M-Labs projects :: fka #milkymist :: Logs http://irclog.whitequark.org/m-labs

00:31 siruf has quit [Ping timeout: 252 seconds]

00:32 siruf has joined #m-labs

01:14 mumptai has quit [Ping timeout: 240 seconds]

07:47 nicksydney has quit [Remote host closed the connection]

07:49 nicksydney has joined #m-labs

08:26 sb0 has joined #m-labs

08:27 _florent_ has joined #m-labs

08:34 mumptai has joined #m-labs

08:44 _florent_ has quit [Ping timeout: 240 seconds]

09:08 stekern_ is now known as stekern

10:36 sh[4]rm4 has joined #m-labs

10:37 sh4rm4 has quit [Ping timeout: 252 seconds]

10:58 rofl__ has joined #m-labs

11:00 sh[4]rm4 has quit [Ping timeout: 252 seconds]

11:42 <ysionneau> http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon12StallFree.pdf < "A Stall-Free Real-Time Garbage Collector for FPGAs"

11:43 <ysionneau> not sure I understand how/why they use a lot of "software" terms for something in an fpga

11:43 <ysionneau> for instance they compare their garbage collector which seems to manage some kind of heap implemented on blockram

11:43 <ysionneau> with "malloc"

11:43 <ysionneau> is there some kind of "hardware malloc" somewhere? :o

11:43 <ysionneau> never heard of that

11:56 Alain__ has joined #m-labs

11:59 rjo_ has quit [Ping timeout: 252 seconds]

14:03 Alain__ has quit [Remote host closed the connection]

14:34 <sb0> "By uniform we mean that the shape of the objects (the size of the data fields and the location of pointers) is fixed."

14:34 <sb0> that's cheating

14:35 <sb0> even string manipulation won't work with that

14:37 <sb0> well, you could split long strings into a linked list of uniform objects

14:47 <sb0> "For the first time, garbage collection of programs synthesized to hardware is practical and realizable." meh

14:47 <sb0> the right thing to do with this is a Python machine :-) and the block RAMs should be SDRAM-backed caches.

15:18 rofl__ is now known as sh4rm4

15:26 <ysionneau> sb0: the thing is, I don't even understand the point of what they are doing

15:26 <ysionneau> what's the point of a "hardware" GC ?

15:27 <sb0> make it faster than a software GC

15:27 <ysionneau> ah so the point is to handle the garbage collection of the software, ok

15:27 <ysionneau> I was thinking it would be in order to manage dynamic allocations of hardware buffers or dynamically allocated stuff in the fpga

15:27 <ysionneau> but no, ok it's for software

15:28 <ysionneau> so that would need to be tightly coupled with the MMU I guess

15:28 <sb0> in their case they suggest using it for some hw-synthesized algo that uses dynamic memory. it sounds to me most HW accelerators don't need that, but it might make sense to have hardware GC in a CPU.

15:30 <sb0> one thing I can think about is a CPU where registers contain pointers at all time, that would address such "uniform" objects

15:30 <sb0> and you put the BRAM in the pipeline (which adds 2 stages)

15:31 <sb0> then you can implement HW GC, and accelerated duck typing

15:32 <sb0> and since those uniform objects are relatively large, you can use the extra space for doing SIMD/vector operations

15:37 <sb0> gee the official Python/LLVM backend is terrible... seems they even have debug print's still laying around

15:40 <ysionneau> sb0: ok I see

15:40 <ysionneau> thanks for the light

15:41 <ysionneau> llvm seems young and quickly evolving which can be a pain

15:41 <ysionneau> but it also seems the way forward

15:41 <ysionneau> to be*

15:42 <sb0> yeah, llvmpy.org (the decent binding) doesn't work with dev/3.5

15:42 <ysionneau> if you use submodules you could freeze llvmpy to the one version that works for you :/

15:42 <ysionneau> if you find one :p

15:43 <ysionneau> then you only struggle with updates when you are ready to do so ^^

15:43 <sb0> been trying... couldn't find a version of llvm-or1k that would 1) work on its own 2) be compatible with llvmpy

15:43 <sb0> also I can't build llvm-lm32 anymore for some reason, tblgen segfaults when processing LM32.td

15:44 <ysionneau> speaking about llvm, what was the conclusion about gcc 4.9 for lm32? it works well? C? C++? or still a bit buggy on C++?

15:44 <sb0> seems -or1k is generally more up-to-date, and with more people working on it

15:44 <ysionneau> last time I tried I would not compile it, and I didn't retry

15:44 <ysionneau> it would not compile*

15:45 <sb0> gcc-lm32 4.9 generally works fine

15:45 <sb0> the only source I've had problems with is this lnfpus.i that I posted on the list, which cause a ICE

15:45 <ysionneau> 17:44 < sb0> seems -or1k is generally more up-to-date, and with more people working on it < yeah they have a lot of dedicated people working on all toolchains aspects

15:45 <ysionneau> *and* the support from a company which job is to do compilers

15:45 <ysionneau> ^^"

15:45 <ysionneau> sb0: ok

15:47 <sb0> or1k also doesn't have those stupid 'export control' clauses in its license

15:47 <ysionneau> what those clauses say basically?

15:47 <sb0> https://github.com/m-labs/lm32/blob/master/LICENSE.LATTICE#L136

15:49 <ysionneau> really?! you've got to check some crazy blacklist?

15:49 <ysionneau> whaaa

15:51 <ysionneau> but ... the fact that now it's on github ... basically violates the export rule doesn't it?

15:51 <ysionneau> on github Iraq, north korean people etc can just see the code

15:51 <ysionneau> and the black listed people as well

15:53 <sb0> yeah... though I know of no one respecting that clause, not even Lattice themselves :-) (you could download it freely from their FTP, up until last year or so)

15:54 <ysionneau> rha it sucks so much that they put this ridiculous license

15:55 <ysionneau> means that "if they wish", they can just ask for removal on github

15:55 <ysionneau> -_-

15:55 <sb0> up until recently, the alternative to lm32 was to use a ludicrously bloated, slow and/or buggy CPU, but mor1kx is becoming reasonable now...

15:56 <sb0> I'm going to get some hard numbers on mor1kx vs. lm32 ...

15:56 <ysionneau> I guess mor1kx is the way forward if the performance is there

15:57 <ysionneau> they have so much software/toolchain support

15:57 <ysionneau> and a clean license(?)

15:57 <sb0> regarding toolchain, their GCC isn't upstream yet

15:58 <sb0> and there hasn't been a binutils release since it was merged. so it's actually a bit more painful to build than for lm32.

15:58 <ysionneau> they have upstream linux support

15:58 <ysionneau> but for very embedded stuff you don't care about linux

15:58 <ysionneau> but it's just cool

15:59 <sb0> do you know of any good CPU benchmark tools btw?

15:59 <ysionneau> not at all, never did cpu benchmarking

15:59 <sb0> that do a good number of SDRAM/bus accesses (unlike dhrystone) but still do not use a lot of libc/OS calls

16:00 <ysionneau> I guess you could use things like sorting algorithm

16:01 <sb0> I've tried this http://www.eecs.umich.edu/mibench/ on lm32 vs. microblaze a few years ago

16:01 <sb0> (lm32 is the faster one, btw)

16:01 <sb0> but that was on linux

16:01 <sb0> haven't tried on bare-metal

16:10 <sb0> there's this http://www.eembc.org/coremark/

16:10 <sb0> they have numbers for the zynq and it seems a strict reporting methodology, which makes it interesting :)

17:12 <stekern> if you want to benchmark SDRAM/bus accesses, coremark isn't the tool

17:13 <stekern> it can be contained in about 8KB of cache iirc

17:13 <stekern> and it run a "test" round first, so the actual test will all run from hot cache

17:14 <stekern> it's good for measuring pipeline performance of a cpu though

17:16 <ysionneau> yep that's what they say in the CoreMark FAQ

17:16 <ysionneau> the code will fit in the cache, but maybe some I/O will cache miss

17:16 <ysionneau> but yes the cache will absorb most of the job

17:29 Alain has joined #m-labs

19:00 _florent_ has joined #m-labs

19:17 _florent_ has quit [Ping timeout: 240 seconds]

19:25 rjo_ has joined #m-labs

20:08 <sb0> so... according to coremark, or1k is 3% faster than lm32

20:09 <sb0> and the UART bug also manifests itself :/

20:11 <sb0> both do 133 iterations/sec

20:11 <sb0> total ticks for lm32 is 1304826013, and 1256780316 for or1k

20:12 <sb0> that's 1.60 CoreMark/MHz

20:12 <sb0> http://www.eembc.org/coremark/

20:13 <sb0> it's faster than ultrasparc, hahaha

20:14 <sb0> the cpu in zynq is 5.92. okay. there's still some work to do.

20:17 <sb0> ah, but that's with threads

20:17 <sb0> it's only 3.38 without, it seems, according to another report

20:28 <sb0> on the area side, MiniSoC/LM32 is 3177 LUTs and MiniSoC/OR1K 4788 LUTs (+1611)

20:28 <sb0> and 2616 vs. 3039 registers (+423)

20:30 <sb0> so, OR1K is very slightly faster, but noticeably larger ...

20:31 <sb0> stekern, do you have a precise idea where the bloat is coming from? you said SPRs, but I'm not convinced

20:31 <sb0> good job about the speed, btw :)

20:31 <stekern> that, and there are some duplicate logic in the fetch/icache lsu/dcache that should be "low hanging fruit"

20:33 <stekern> I want to clean those up at some point, they are a bit of a mess right now

20:38 <stekern> I've got 1.80 coremark/mhz on our SoC btw

20:42 <sb0> different cache config?

20:42 <stekern> yeah, I think that must be it

20:42 <sb0> or compiler flags

20:43 <sb0> I'm using -Os -mhard-mul -mhard-div

20:43 <stekern> you get better numbers up to 8KB of cache iirc

20:43 <sb0> ah, yay, the or1k llvm integrated assembler mostly works

20:43 <sb0> it just chokes on inline asm

20:43 <sb0> ../../software/include/base/system.h:18:24: error: Invalid operand for

20:43 <sb0> instruction

20:43 <sb0> __asm__ __volatile__ ("l.mfspr %0,r0,%1" : "=r" (ret) : "K" (add));

20:45 <ysionneau> so which one do we rewrite in Migen ? mor1kx or lm32 ? :)

20:47 <stekern> and I have 16KB of cache in that config

20:47 <stekern> which reminds me of something unrelated, when I was poking at the then milkymist-ng, I noticed that wrapped wb bursts weren't supported. Has that changed?

20:48 <sb0> no, they are still unsupported

20:48 <sb0> it expands the instruction to "l.mfspr r3,r0,6"

20:48 <sb0> those spr instructions work correctly in the crt. are there restrictions on what registers can be used?

20:49 <stekern> no, but it might be that it doesn't grok that what you are feeding that function is indeed a constant

20:50 <stekern> ysionneau: why not both ;)

20:51 <sb0> the 6?

20:51 <stekern> umm, no the 6 should be fine.

20:52 <stekern> ok, it's actually the assembler that chokes on that?

20:52 <ysionneau> when I see this error message it's the assembler

20:54 <sb0> stekern, http://pastebin.com/tLj41atP

20:55 <sb0> the same command on setjmp-or1k.S, which contains e.g. "l.mtspr r0, r21, SPR_SR", executes correctly

20:55 <sb0> wait, no

20:56 <sb0> setjmp-or1k.S does compile correctly with clang

20:56 <sb0> but does not contain l.mtspr

20:57 <sb0> crt0-or1k.S contains l.mtspr, but fails all over the place with clang

20:57 <sb0> crt0-or1k.S:73:19: error: unknown operand

20:57 <sb0> l.mtspr r0, r21, ((0<< (11)) + 17)

20:57 <sb0> and a good hundred similar errors

20:57 <stekern> hmm, maybe the l.mtspr/l.mfspr isn't implemented in the integrated assembler

20:57 <stekern> (I didn't do that part)

20:57 <sb0> there are other things that fail

20:57 <sb0> crt0-or1k.S:74:15: error: unknown operand

20:57 <sb0> l.movhi r21, hi(_reset_handler)

20:58 <sb0> crt0-or1k.S:78:8: error: unknown operand

20:58 <sb0> l.jal _cache_init

20:58 <sb0> etc. etc.

20:58 <stekern> I only ran it agains gnu as/ld

20:58 <stekern> +t

20:59 <stekern> http://lists.openrisc.net/pipermail/openrisc/2014-May/002178.html

20:59 <stekern> ^ might wanna poke those guys

20:59 Alain has quit [Quit: ChatZilla 0.9.90.1 [Firefox 29.0.1/20140506152807]]

20:59 <ysionneau> __asm__ __volatile__ ("l.mfspr %0,r0,%1" : "=r" (ret) : "K" (add)); <= is the 3rd operand of the instruction supposed to be a constant ?

20:59 <ysionneau> or a register

21:00 <stekern> constant

21:00 <ysionneau> https://github.com/m-labs/misoc/blob/master/software/include/base/system.h#L18

21:00 <ysionneau> to me it does not seem to be a constant here :o

21:01 <stekern> it is, depending how you call that function

21:01 <ysionneau> ah because it's inlined?

21:03 <ysionneau> ok nice I didn't know it could work like that :)

21:03 <ysionneau> 23:01 < ysionneau> ah because it's inlined? < by "it" I mean the function

21:05 gric has quit [Ping timeout: 240 seconds]

21:06 <stekern> yeah, you can depend on such things to do compile checks like this to: http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/include/asm/cmpxchg.h?h=smp#n27

21:07 <stekern> (I just copied that from what other archs do, I claim no credit for it ;)

21:08 <sb0> so, with clang, the performance is essentially the same: 133 iterations/s, 1273324392 ticks for the 2000 iterations (was 1256780316 with gcc)

21:12 <ysionneau> stekern: ah nice trick indeed :)

21:18 <GitHub186> [misoc] sbourdeauducq pushed 1 new commit to master: http://git.io/DPTM4A

21:18 <GitHub186> misoc/master 4c2a209 Sebastien Bourdeauducq: libbase: remove crt during make clean

21:38 <ysionneau> gn8

21:44 nicksydney has quit [Remote host closed the connection]

22:01 sh[4]rm4 has joined #m-labs

22:03 sh4rm4 has quit [Ping timeout: 252 seconds]

22:13 sh[4]rm4 is now known as sh4rm4

22:35 mumptai has quit [Ping timeout: 255 seconds]

23:07 sh4rm4 has quit [Remote host closed the connection]

23:07 sh4rm4 has joined #m-labs

23:18 <sb0> stekern, any idea why TargetRegistry::lookupTarget would fail with "No available targets are compatible with this triple, see -version for the available targets." when passed any CPU type?

23:18 <sb0> llc -version does list or1k

23:21 <sb0> seriously, llvm arch management code is a gnu/autocrap-level fuckup

23:22 <sb0> and they used C++ for it... only C or asm would have been worse

23:31 <sb0> ah, that's because in that particular version of llvm, that stupid function ignores whatever arch you specify and uses the default (x86)

23:39 <sb0> gn8

23:39 sb0 has quit [Quit: Leaving]