lekernel changed the topic of #m-labs to: Mixxeo, Migen, MiSoC & other M-Labs projects :: fka #milkymist :: Logs http://irclog.whitequark.org/m-labs
siruf has quit [Ping timeout: 252 seconds]
siruf has joined #m-labs
mumptai has quit [Ping timeout: 240 seconds]
nicksydney has quit [Remote host closed the connection]
nicksydney has joined #m-labs
sb0 has joined #m-labs
_florent_ has joined #m-labs
mumptai has joined #m-labs
_florent_ has quit [Ping timeout: 240 seconds]
stekern_ is now known as stekern
sh[4]rm4 has joined #m-labs
sh4rm4 has quit [Ping timeout: 252 seconds]
rofl__ has joined #m-labs
sh[4]rm4 has quit [Ping timeout: 252 seconds]
<ysionneau> http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon12StallFree.pdf < "A Stall-Free Real-Time Garbage Collector for FPGAs"
<ysionneau> not sure I understand how/why they use a lot of "software" terms for something in an fpga
<ysionneau> for instance they compare their garbage collector which seems to manage some kind of heap implemented on blockram
<ysionneau> with "malloc"
<ysionneau> is there some kind of "hardware malloc" somewhere? :o
<ysionneau> never heard of that
Alain__ has joined #m-labs
rjo_ has quit [Ping timeout: 252 seconds]
Alain__ has quit [Remote host closed the connection]
<sb0> "By uniform we mean that the shape of the objects (the size of the data fields and the location of pointers) is fixed."
<sb0> that's cheating
<sb0> even string manipulation won't work with that
<sb0> well, you could split long strings into a linked list of uniform objects
<sb0> "For the first time, garbage collection of programs synthesized to hardware is practical and realizable." meh
<sb0> the right thing to do with this is a Python machine :-) and the block RAMs should be SDRAM-backed caches.
rofl__ is now known as sh4rm4
<ysionneau> sb0: the thing is, I don't even understand the point of what they are doing
<ysionneau> what's the point of a "hardware" GC ?
<sb0> make it faster than a software GC
<ysionneau> ah so the point is to handle the garbage collection of the software, ok
<ysionneau> I was thinking it would be in order to manage dynamic allocations of hardware buffers or dynamically allocated stuff in the fpga
<ysionneau> but no, ok it's for software
<ysionneau> so that would need to be tightly coupled with the MMU I guess
<sb0> in their case they suggest using it for some hw-synthesized algo that uses dynamic memory. it sounds to me most HW accelerators don't need that, but it might make sense to have hardware GC in a CPU.
<sb0> one thing I can think about is a CPU where registers contain pointers at all time, that would address such "uniform" objects
<sb0> and you put the BRAM in the pipeline (which adds 2 stages)
<sb0> then you can implement HW GC, and accelerated duck typing
<sb0> and since those uniform objects are relatively large, you can use the extra space for doing SIMD/vector operations
<sb0> gee the official Python/LLVM backend is terrible... seems they even have debug print's still laying around
<ysionneau> sb0: ok I see
<ysionneau> thanks for the light
<ysionneau> llvm seems young and quickly evolving which can be a pain
<ysionneau> but it also seems the way forward
<ysionneau> to be*
<sb0> yeah, llvmpy.org (the decent binding) doesn't work with dev/3.5
<ysionneau> if you use submodules you could freeze llvmpy to the one version that works for you :/
<ysionneau> if you find one :p
<ysionneau> then you only struggle with updates when you are ready to do so ^^
<sb0> been trying... couldn't find a version of llvm-or1k that would 1) work on its own 2) be compatible with llvmpy
<sb0> also I can't build llvm-lm32 anymore for some reason, tblgen segfaults when processing LM32.td
<ysionneau> speaking about llvm, what was the conclusion about gcc 4.9 for lm32? it works well? C? C++? or still a bit buggy on C++?
<sb0> seems -or1k is generally more up-to-date, and with more people working on it
<ysionneau> last time I tried I would not compile it, and I didn't retry
<ysionneau> it would not compile*
<sb0> gcc-lm32 4.9 generally works fine
<sb0> the only source I've had problems with is this lnfpus.i that I posted on the list, which cause a ICE
<ysionneau> 17:44 < sb0> seems -or1k is generally more up-to-date, and with more people working on it < yeah they have a lot of dedicated people working on all toolchains aspects
<ysionneau> *and* the support from a company which job is to do compilers
<ysionneau> ^^"
<ysionneau> sb0: ok
<sb0> or1k also doesn't have those stupid 'export control' clauses in its license
<ysionneau> what those clauses say basically?
<ysionneau> really?! you've got to check some crazy blacklist?
<ysionneau> whaaa
<ysionneau> but ... the fact that now it's on github ... basically violates the export rule doesn't it?
<ysionneau> on github Iraq, north korean people etc can just see the code
<ysionneau> and the black listed people as well
<sb0> yeah... though I know of no one respecting that clause, not even Lattice themselves :-) (you could download it freely from their FTP, up until last year or so)
<ysionneau> rha it sucks so much that they put this ridiculous license
<ysionneau> means that "if they wish", they can just ask for removal on github
<ysionneau> -_-
<sb0> up until recently, the alternative to lm32 was to use a ludicrously bloated, slow and/or buggy CPU, but mor1kx is becoming reasonable now...
<sb0> I'm going to get some hard numbers on mor1kx vs. lm32 ...
<ysionneau> I guess mor1kx is the way forward if the performance is there
<ysionneau> they have so much software/toolchain support
<ysionneau> and a clean license(?)
<sb0> regarding toolchain, their GCC isn't upstream yet
<sb0> and there hasn't been a binutils release since it was merged. so it's actually a bit more painful to build than for lm32.
<ysionneau> they have upstream linux support
<ysionneau> but for very embedded stuff you don't care about linux
<ysionneau> but it's just cool
<sb0> do you know of any good CPU benchmark tools btw?
<ysionneau> not at all, never did cpu benchmarking
<sb0> that do a good number of SDRAM/bus accesses (unlike dhrystone) but still do not use a lot of libc/OS calls
<ysionneau> I guess you could use things like sorting algorithm
<sb0> I've tried this http://www.eecs.umich.edu/mibench/ on lm32 vs. microblaze a few years ago
<sb0> (lm32 is the faster one, btw)
<sb0> but that was on linux
<sb0> haven't tried on bare-metal
<sb0> they have numbers for the zynq and it seems a strict reporting methodology, which makes it interesting :)
<stekern> if you want to benchmark SDRAM/bus accesses, coremark isn't the tool
<stekern> it can be contained in about 8KB of cache iirc
<stekern> and it run a "test" round first, so the actual test will all run from hot cache
<stekern> it's good for measuring pipeline performance of a cpu though
<ysionneau> yep that's what they say in the CoreMark FAQ
<ysionneau> the code will fit in the cache, but maybe some I/O will cache miss
<ysionneau> but yes the cache will absorb most of the job
Alain has joined #m-labs
_florent_ has joined #m-labs
_florent_ has quit [Ping timeout: 240 seconds]
rjo_ has joined #m-labs
<sb0> so... according to coremark, or1k is 3% faster than lm32
<sb0> and the UART bug also manifests itself :/
<sb0> both do 133 iterations/sec
<sb0> total ticks for lm32 is 1304826013, and 1256780316 for or1k
<sb0> that's 1.60 CoreMark/MHz
<sb0> it's faster than ultrasparc, hahaha
<sb0> the cpu in zynq is 5.92. okay. there's still some work to do.
<sb0> ah, but that's with threads
<sb0> it's only 3.38 without, it seems, according to another report
<sb0> on the area side, MiniSoC/LM32 is 3177 LUTs and MiniSoC/OR1K 4788 LUTs (+1611)
<sb0> and 2616 vs. 3039 registers (+423)
<sb0> so, OR1K is very slightly faster, but noticeably larger ...
<sb0> stekern, do you have a precise idea where the bloat is coming from? you said SPRs, but I'm not convinced
<sb0> good job about the speed, btw :)
<stekern> that, and there are some duplicate logic in the fetch/icache lsu/dcache that should be "low hanging fruit"
<stekern> I want to clean those up at some point, they are a bit of a mess right now
<stekern> I've got 1.80 coremark/mhz on our SoC btw
<sb0> different cache config?
<stekern> yeah, I think that must be it
<sb0> or compiler flags
<sb0> I'm using -Os -mhard-mul -mhard-div
<stekern> you get better numbers up to 8KB of cache iirc
<sb0> ah, yay, the or1k llvm integrated assembler mostly works
<sb0> it just chokes on inline asm
<sb0> ../../software/include/base/system.h:18:24: error: Invalid operand for
<sb0> instruction
<sb0> __asm__ __volatile__ ("l.mfspr %0,r0,%1" : "=r" (ret) : "K" (add));
<ysionneau> so which one do we rewrite in Migen ? mor1kx or lm32 ? :)
<stekern> and I have 16KB of cache in that config
<stekern> which reminds me of something unrelated, when I was poking at the then milkymist-ng, I noticed that wrapped wb bursts weren't supported. Has that changed?
<sb0> no, they are still unsupported
<sb0> it expands the instruction to "l.mfspr r3,r0,6"
<sb0> those spr instructions work correctly in the crt. are there restrictions on what registers can be used?
<stekern> no, but it might be that it doesn't grok that what you are feeding that function is indeed a constant
<stekern> ysionneau: why not both ;)
<sb0> the 6?
<stekern> umm, no the 6 should be fine.
<stekern> ok, it's actually the assembler that chokes on that?
<ysionneau> when I see this error message it's the assembler
<sb0> the same command on setjmp-or1k.S, which contains e.g. "l.mtspr r0, r21, SPR_SR", executes correctly
<sb0> wait, no
<sb0> setjmp-or1k.S does compile correctly with clang
<sb0> but does not contain l.mtspr
<sb0> crt0-or1k.S contains l.mtspr, but fails all over the place with clang
<sb0> crt0-or1k.S:73:19: error: unknown operand
<sb0> l.mtspr r0, r21, ((0<< (11)) + 17)
<sb0> and a good hundred similar errors
<stekern> hmm, maybe the l.mtspr/l.mfspr isn't implemented in the integrated assembler
<stekern> (I didn't do that part)
<sb0> there are other things that fail
<sb0> crt0-or1k.S:74:15: error: unknown operand
<sb0> l.movhi r21, hi(_reset_handler)
<sb0> crt0-or1k.S:78:8: error: unknown operand
<sb0> l.jal _cache_init
<sb0> etc. etc.
<stekern> I only ran it agains gnu as/ld
<stekern> +t
<stekern> ^ might wanna poke those guys
Alain has quit [Quit: ChatZilla 0.9.90.1 [Firefox 29.0.1/20140506152807]]
<ysionneau> __asm__ __volatile__ ("l.mfspr %0,r0,%1" : "=r" (ret) : "K" (add)); <= is the 3rd operand of the instruction supposed to be a constant ?
<ysionneau> or a register
<stekern> constant
<ysionneau> to me it does not seem to be a constant here :o
<stekern> it is, depending how you call that function
<ysionneau> ah because it's inlined?
<ysionneau> ok nice I didn't know it could work like that :)
<ysionneau> 23:01 < ysionneau> ah because it's inlined? < by "it" I mean the function
gric has quit [Ping timeout: 240 seconds]
<stekern> yeah, you can depend on such things to do compile checks like this to: http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/include/asm/cmpxchg.h?h=smp#n27
<stekern> (I just copied that from what other archs do, I claim no credit for it ;)
<sb0> so, with clang, the performance is essentially the same: 133 iterations/s, 1273324392 ticks for the 2000 iterations (was 1256780316 with gcc)
<ysionneau> stekern: ah nice trick indeed :)
<GitHub186> [misoc] sbourdeauducq pushed 1 new commit to master: http://git.io/DPTM4A
<GitHub186> misoc/master 4c2a209 Sebastien Bourdeauducq: libbase: remove crt during make clean
<ysionneau> gn8
nicksydney has quit [Remote host closed the connection]
sh[4]rm4 has joined #m-labs
sh4rm4 has quit [Ping timeout: 252 seconds]
sh[4]rm4 is now known as sh4rm4
mumptai has quit [Ping timeout: 255 seconds]
sh4rm4 has quit [Remote host closed the connection]
sh4rm4 has joined #m-labs
<sb0> stekern, any idea why TargetRegistry::lookupTarget would fail with "No available targets are compatible with this triple, see -version for the available targets." when passed any CPU type?
<sb0> llc -version does list or1k
<sb0> seriously, llvm arch management code is a gnu/autocrap-level fuckup
<sb0> and they used C++ for it... only C or asm would have been worse
<sb0> ah, that's because in that particular version of llvm, that stupid function ignores whatever arch you specify and uses the default (x86)
<sb0> gn8
sb0 has quit [Quit: Leaving]