#m-labs on 2015-01-17 — irc logs at freenode.irclog.whitequark.org

2013-12-11 12:34 lekernel changed the topic of #m-labs to: Mixxeo, Migen, MiSoC & other M-Labs projects :: fka #milkymist :: Logs http://irclog.whitequark.org/m-labs

00:46 sb0 has joined #m-labs

00:54 <sb0> rjo, hi

03:33 <sb0> http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html "Lots of people struggle with the complexities of getting big data systems up and running, when they possibly shouldn’t be using the systems in the first place."

03:39 <hozer> well, I'd go read that, but I just had to 'kill 666' my firefox process :P

03:40 <hozer> but links seems to work quite nice

03:47 <hozer> hrrm, I had a theory back when I did high-performance computing that FPGAs with open/programmable/reconfigurable memory controllers with *narrower* data paths would rock on graph searches

03:47 <hozer> but this requires you have some real stable FPGA toolchain you could actually *use*

03:49 <hozer> so whatever happened to github.com/wolfgang-spraul/fpgatools

03:52 <hozer> looks like opencircuitdesign.com/qflow/welcome.html might be an answer...

03:53 <sb0> there's no bitstream backend

03:54 <hozer> dammit

03:54 <hozer> well fpgatools tried to do that, right? But it looks dead

03:54 <sb0> yes

03:55 <hozer> frack

03:55 <hozer> I guess if I want a bitstream backend I'm going to have to design my own damned fpga :P

03:55 <sb0> just continue the S6 RE

03:55 <sb0> or move on to 7 series

03:55 <sb0> it's easier than spinning your own

03:56 <sb0> try this: xdl -report -pips 6slx4

03:56 <sb0> it tells you everything :)

03:57 <sb0> (well, not everything, but enough to get started)

03:58 <hozer> so my theory on graph algorithms is if you had an FPGA with a bunch of separate narrow-width (but high speed) memory busses that are all independent, you'd get a lot better graph search performance bit

03:58 mumptai has quit [Ping timeout: 246 seconds]

03:59 <hozer> because everything I'm aware of is going to 128 or 256 (or wider) bit cache lines

04:02 <sb0> yes, but that's because of DRAM, whose bandwidth has improved a lot more than its latency

04:03 <hozer> yep. But if you are doing a graph/sparse matrix, you might only need a 32 bit pointer out of an entire 256 or 512 bit cache line

04:03 <hozer> so you just wasted all that bandwidth

04:04 <sb0> it's cheap. and having a narrower DRAM won't make the latency (which is what is slowing you down) better

04:05 <hozer> but it will sure make it cooler

04:05 <sb0> you mean less power? maybe

04:06 <hozer> and if you had a cpu with say 512 thread contexts instead of 512 bit cache lines anytime you hit a memory load you switch context

04:06 <sb0> parallel programming is hard

04:07 <hozer> it would still be easier than some big-nonsense cloud data analytics whizbang :P

04:07 <sb0> and a ddr3 chip has only 8 banks

04:07 <sb0> if your thread contexts cause DRAM precharge cycles, and they probably will, you can't pipeline

04:07 <hozer> so your 'big data' FPGA search appliance has a couple fpgas, and 256 DRAM chips :P

04:08 <hozer> and the whole thing draws < 250 watts

04:09 <sb0> you can't put many DRAM chips on a FPGA. especially without shared address/command lines. and on 7-series, only certain IO banks can run ddr3 at max speed.

04:11 <hozer> This is why if I had a million dollars I'd spend half on farmland and half on rolling my own fpga :P

04:11 <hozer> well, okay, I'd probably need 100 million to actually do the fpga right

04:11 mumptai has joined #m-labs

04:30 <sb0> if you become the dumpster diving god, you can probably get your own fab and do it for much less

04:34 <sb0> which is what hackerspaces should be doing if they weren't busy with their cnc plastic extrusion stuff

04:35 <hozer> if I can dumpsterize a fab it would probably be worth about half a million to me

04:36 <hozer> cause at some point there will be some expensive tractor or combine that you can't some critical silicon for anymore

04:36 <hozer> I mean' that you can't get some critical silicon for anymore'

04:37 <hozer> and then I can go around buying up equipment for $5,000 that I could sell for $50,000 if it ran

04:39 <uhhimhere> http://www.caviumnetworks.com/OCTEON-III_CN7XXX.html

04:40 <hozer> no infiniband, bah

04:43 <hozer> oh, and Cavium pisses me off cause they made my MontaVista stock worthless :P

04:44 <hozer> ... they came in when the market was down and acquired MontaVista and only Cavium and the company execs got stock. All the employees who had stock options (or left and excercised them) got zilch

04:50 <hozer> now THIS is a switch chip... http://www.mellanox.com/page/products_dyn?product_family=190&mtag=switch_ib_ic

04:57 <hozer> The only thing I know of better than Infiniband's cut-through routing latency is Cray's interconnect, and there's a good argument you might be able to get lower latency doing a remote RDMA read over infiniband than doing a local memory read if you are doing a lot of graph search stuff

04:58 <sb0> no

04:58 <sb0> the serdes themselves have more latency than a typical closely coupled dram controller

05:00 <hozer> well assumption here is the data set doesn't all fit in your local dram

05:01 <hozer> so you can either go for 'ultra-scalable' systems that are 1000x times slower than a single laptop

05:01 <hozer> or actually get some decent low-latency interconnect

05:01 <hozer> but, you know, parallel programming is hard, and it's easier to say how scalable your system is on a crappy cloud infrastructure :P

05:02 <uhhimhere> head in the cloud

05:02 <uhhimhere> meow

05:03 <hozer> sometime I want to do a single-threaded graph search on a terabyte data set with a bunch of commodity hard drives ;)

05:10 <uhhimhere> the cavium supports low latency multi-socket Soc

05:12 <uhhimhere> http://cpushack.com/space-craft-cpu.html

05:20 <hozer> no mention of http://www.gaisler.com/index.php/products/components/gr712rc ?

05:23 <hozer> I tried to figure out how to synthesize LEON/Grlib awhile go but never really finished.

06:50 uhhimhere has quit [Ping timeout: 245 seconds]

07:36 <rjo> hey sb0.

07:47 <rjo> sb0: i like sync_struct.

07:48 <rjo> but from a cursory look it seems a big fat warning about mutable members is necessary.

07:50 <rjo> by the way: https://github.com/nist-ionstorage/artiq/commit/45869f2055c0e31199aca4a7591aa58cb94e1a90

07:52 <sb0> rjo, you can put mutable list/dicts (and compatible objects) into sync_struct, and then mutate them - that will be handled correctly

07:52 <sb0> sync_struct will wrap them

07:53 <sb0> the only thing you should not mutate is the suitably named "read" property, which gives you access to the data

07:53 <rjo> but there are many more mutable objects.

07:53 <sb0> yeah, user classes are not supported

07:54 <rjo> are they prevented?

07:54 <sb0> nested lists/dicts are

07:55 <sb0> no, they're not

07:56 <sb0> I think the proper way to "prevent" them is to add a note to the documentation

07:56 <rjo> the big fat warning i suggested. yes.

07:56 <sb0> otherwise we'd have to scan any object (potentially with several levels of nesting) passed by the user for something that isn't a list or dict

07:57 <sb0> or immutable

07:57 <rjo> BTAFTP

07:57 <rjo> i agree.

07:57 <rjo> but the problematic use case i see is adding a numpy array and starting to call all its wonderful methods.

07:58 <sb0> hmm, we can actually make them work

07:58 <sb0> sync_struct can proxy them

07:58 <rjo> oh. that is not an acronym yet. let's make it one. (better to ask forgiveness than permission)

07:59 <sb0> only, the receiving side has to handle them

07:59 <rjo> rpc-like.

07:59 <sb0> yes

07:59 <rjo> too generic imho.

07:59 <sb0> sync_struct is pretty much a RPC already

08:00 <sb0> but with pubsub, index syntax support, and structure initialization on connect

08:01 <sb0> supporting numpy methods is a matter of replacing the hardcoded list methods (append, insert, pop, etc.) with a generic RPC

08:01 <sb0> ..of course, those methods must be called from the Notifier, and never via the read property

08:02 <rjo> i gravitate toward actually dumbing it down.

08:03 <sb0> all functionality is necessary for the GUI/master atm

08:03 <rjo> because then you will have mutable arguments to methods of numpy arrays where the method return value after execution on the subscribers might matter.

08:03 <rjo> and conflict when returned to the publisher..

08:06 <rjo> looking at the master queueing: why did the idea of letting the experiments reschedule themselves not work?

08:06 <sb0> (entry_points) ah, good call

08:07 <sb0> I'd have to think about it, and also how to support "pausable" experiments

08:08 <GitHub187> [artiq] sbourdeauducq pushed 1 new commit to master: http://git.io/yFi1kA

08:08 <GitHub187> artiq/master 6cc3a9d Robert Jordens: frontend/*: move to artiq.frontend, make entry_points...

08:08 <rjo> all in all there will be a lot of scheduling experiments by other experiments.

08:09 <rjo> so at least there will be functionality duplication where the experiments need to manage the master queue.

08:10 <sb0> so how exactly should they manage the queue?

08:10 <sb0> the simplest way to do #1 is expose a "queue_append" method to the experiments

08:11 <rjo> a stack of calibration experiments queued as a batch by another.

08:11 <rjo> if this parent experiment gets unqueued or canceled, it needs to remove its children.

08:11 <rjo> (or may want to remove them) from the queue

08:12 <sb0> can that be done by having the parent schedule its calibration experiments just before finishing?

08:12 <sb0> if it gets unqueued or canceled, then it can't schedule the children

08:12 <rjo> or something akin to an "operating mode" where a parent experiment sets up a bunch of periodic experiments and then runs a few others, unqueueing the calibrations when done.

08:13 <rjo> the problem is "ownership". if an experiment can queue other experiments, it must be able to take full responsibility of them.

08:13 <rjo> handling their failures, unqueueing them...

08:14 <rjo> yes. if the parent goes it should take its children too.

08:15 <rjo> -- or not. depending on how much responsibility the parent is able/willing to accept.

08:15 <rjo> but if you have such an "operating condition" style experiment, you need full queue management from within experiments.

08:17 <sb0> do the "operating condition experiments" do anything more than schedule a pack of periodic experiments + a batch of sequential ones?

08:18 <rjo> i would like it to be able to handle experiment failures.

08:19 <rjo> the workflow if you loose an ion (or a laser becomes unlocked) is a whole other batch of experiments (load, move, calibrate...) as a massive error handler.

08:22 <rjo> and how can i get an experiment that is periodically scheduled every hour but just on weekends?

08:25 <rjo> imho the Scheduler api should be more like an event loop. there periodic execution is also an emergent feature.

08:26 <sb0> how do you represent it in the GUI, though?

08:27 <rjo> isn't there a list of the coroutines and Tasks in the asyncio event loop?

08:29 <sb0> yes, but the tasks are parallel - there is no queue

08:29 <sb0> and I guess the GUI needs a queue

08:30 <rjo> isn't that parallelism the same as the "pausable" thingy?

08:31 <rjo> really? how does it decide wich task to continue with if one yields?

08:32 <sb0> the one that has a completed IO operation...

08:34 <rjo> there is a nice heapq in the eventloop.

08:35 <sb0> your error handling scenario means: assume that the lost ion exception didn't occur, speculatively proceed to add the next steps into the queue so that the GUI displays them, and if the exception does occur, undo it

08:35 <rjo> if they are no fds, it's just a bunch of TimerHandles.

08:36 <sb0> you don't fundamentally need scheduling access for that - you can just import the other "error handling" experiments and run them in the exception handler

08:36 <rjo> yes. that would work.

08:37 <sb0> the only problem I see with #2 is less user feedback

08:37 <rjo> but then it is not apparent in the gui which experiment is actually running if they do not pass through the scheduler.

08:37 <rjo> yes

08:38 <sb0> we can also simply add a notification of the current class name in which the execution is going on

08:39 <sb0> and for the "weekend schedule"... replace periodic execution with timed execution, and let experiments re-schedule them

08:39 <rjo> or give scheduler access.

08:39 <sb0> what was your collaborative code editor website again?

08:40 <sb0> something.io

08:40 <sb0> ah, kobra

08:40 <rjo> i am wondering now whether experiments could become real asyncio.Tasks

08:41 <rjo> and whether there could be something like a slave enventloop to the big asyncio one that manages only the experiments.

08:41 <rjo> that would give a notion of "pause" for free.

08:41 <rjo> could just follow the same call_later(...) api etc.

08:42 <sb0> https://kobra.io/#/e/-Jfr2BLsySY9CZWRG8xm

08:46 <rjo> if you do scheduler.queue("") by name, you can return the actual experiment instance, right?

08:46 <rjo> why do you need an rid?

08:47 <sb0> hmm, creating the instance calls build(), which initializes drivers

08:47 <rjo> ah

08:48 <rjo> the error recovery pattern is ok. could maybe streamlined a bit with contextmanager.

08:49 <sb0> I guess that we should not permit those "scheduler plugins" to access drivers, too

08:50 <sb0> otherwise, there can be conflicts if the scheduler plugin requests a driver with certain arguments, and then one of its scheduled experiments requests it again with other arguments

08:50 <sb0> only one experiment may access drivers at any given time

08:52 <sb0> and it should be permitted to have several scheduler plugins running concurrently, I guess

08:52 <sb0> so that several periodic experiments can be scheduled

08:52 <rjo> by "requesting a driver" you mean the rpc that in the end leads to the opening of the serial port?

08:53 <sb0> or even the core device driver opening the serial port

08:55 <sb0> contextlib.nested is deprecated

08:56 <rjo> i am almost convinced these backend drivers should be singletons.

08:56 <rjo> oh. even better.

08:58 <rjo> but i don't know how/whether the contextmanager jives with the coroutine that would be required.

08:58 <rjo> probably not.

08:59 <sb0> you can only yield once in @contextlib.contextmanager, no?

08:59 <rjo> yes.

09:00 <rjo> but one could just write a real class with __enter__ and __exit__(), maybe.

09:00 <sb0> can one yield from a with statement?

09:01 <sb0> in other words: can a context manager cause its calling generator to yield?

09:01 <rjo> hmm. "scheduler plugins" aka "batches" or "operating conditions" vs drivers: if you "request" a driver and get it, you do expect exclusive access.

09:01 <sb0> afaik generators don't play well with context managers

09:02 <rjo> so the parent (or any other parallel running, paused experiment) would have to -- at least -- temporarily relinquish control.

09:02 <sb0> and same with generators and class initialization. e.g. you can't create an asyncio connection in __init__

09:02 <rjo> oh. why is that?

09:02 <sb0> __init__ can't yield

09:03 <rjo> ha.

09:03 <sb0> well you can, but you have to pass the asyncio loop as a parameter to __init__

09:03 <sb0> and __init__ calls loop.run_until_complete (and becomes blocking)

09:03 <rjo> and then yield "downwards" into the loop instead of upwards?

09:03 <sb0> or create a task

09:04 <sb0> but then error handling becomes a pain

09:05 <rjo> is the scheduler already split from the experiment runner into different processes?

09:06 <rjo> is that thing the worker?

09:06 <sb0> if we have separate scheduler.queue and scheduler.run_at, with a run_at reaching the deadline taking priority over the queue, it's also easier to display in the GUI

09:06 <sb0> yes. the worker runs the user code. that way, if it goes into an infinite loop, leaks memory, imports a crashy library, etc. it can be killed

09:07 <rjo> call_later() call_soon() and call_at() like asyncio ;)

09:08 <rjo> so to streamline terminology a bit, could the "worker" be more or less the "controller" for the coredevice?

09:08 <rjo> or does that analogy not hold?

09:09 <sb0> it's not exactly a controller as it doesn't go over the network, and it doesn't even have to use the core device

09:10 <rjo> but there is a pair of filedescriptors between the scheduler and the (worker).

09:10 <rjo> pipe.

09:10 <sb0> yes. stdin/stdout, actually

09:11 <sb0> the main reason for running on the same machine is that the filesystem where the experiments and later the results are stored becomes the same

09:11 <rjo> ah. the rpcs to the controllers originate/are relayed at the worker.

09:11 <rjo> ok.

09:12 <sb0> rpcs to the controller are done directly by the worker, yes

09:12 <sb0> I'm also imagining that the worker would write HDF5 outputs itself

09:12 <rjo> and you fire a worker per experiment?

09:12 <sb0> and maybe do the git checkouts

09:13 <sb0> no, it takes instructions, runs, and reports

09:13 <sb0> and keeps going

09:13 <rjo> ok.

09:13 <sb0> we can have a collection of generator-based scheduler plugins running in the worker

09:14 <rjo> good. then i like the term worker.

09:15 <rjo> should also be ok to implement pauseability by "yield"ing from within an experiment.

09:16 <sb0> if scheduler plugins and experiments become the same thing, yes

09:16 <rjo> yep.

09:17 <sb0> hmm, we may have to move the scheduler into the worker...

09:17 <sb0> otherwise, propagating an exception worker -> scheduler -> worker will be a mess

09:18 <rjo> why does it have to go to the scheduler?

09:18 <rjo> (first)?

09:18 <sb0> see the error recovery example ...

09:18 <rjo> and who is talking to the gui, the worker or the scheduler?

09:19 <sb0> the scheduler

09:19 <sb0> if we move the scheduler into the worker, it could simply sync_struct the queue and periodic schedule from the parent process

09:19 <rjo> the results updates are proxied through the scheduler?

09:20 <sb0> the real-time results are produced by the worker, sent to the master (what you call scheduler), and subscribed to by clients

09:20 <rjo> then i should not call it scheduler.

09:21 <sb0> all results, including non-realtime ones, would then be written to the filesystem by the worker directly

09:21 <rjo> the scheduler is just a component of the master. another component is this data-hub.

09:21 <sb0> yes

09:25 <sb0> hmm. if we move the scheduler into the worker, then we cannot abort an experiment by killing the worker.

09:25 <rjo> hmm. that is nasty if an infinite loop in an experiment would block the scheduler.

09:25 <sb0> yes, that too

09:26 <rjo> OTOH there will be plenty of opportunity to DOS the scheduler if the API is accessible.

09:27 <rjo> i think i have to let that problem take a few rounds in my head.

09:27 <rjo> we didn't even get to discussing the RTData and Plot persistence stuff.

09:28 <rjo> but i am heading home now anyway.

09:29 <rjo> good night!

09:29 <sb0> I'd implement persistence by reloading from HDF5

09:29 <sb0> explicitly

09:29 <sb0> good night!

09:33 <rjo> ok. that sounds smart. if the gui can trigger that (preferrably by just "selecting" the experiment) and then fiddle with the fit/plot and stuff gets saved, that would be really smooth.

09:33 <rjo> but that's for another day.

09:33 <rjo> see you.

11:46 <GitHub94> [artiq] sbourdeauducq pushed 2 new commits to master: http://git.io/uqhPcA

11:46 <GitHub94> artiq/master 3e22fe8 Sebastien Bourdeauducq: reorganize files as per discussion with Robert

11:46 <GitHub94> artiq/master 0c2e960 Sebastien Bourdeauducq: frontend: restore artiq_ prefix

18:35 sb0 has quit [Quit: Leaving]