<CrashRoX>
Hi - congrats on release Wallaroo. It looks really interesting. I had a question about streaming output. Are there are docs or suggestions for the best way to flush the output periodically? I just watched the "scale-independence" video and it continously counted the bill of rights text. How would you go about flushing say every 1000 words and resetting the counts?
<SeanTAllen>
CrashRoX: First, thanks! Second... that windowing. You can do windowing manually at this point and we are figuring out what a good windowing API would look like.
<SeanTAllen>
To do what you want right now, you could store the count of words seen in each state object (thats WordTotals in the WordCount example) and each time you increment a count, you could also increment words seen. Then, you could change your CountWords state computation to return None unless the counter was 100, if it was, you can return a list of words and counts and reset your counter.
<SeanTAllen>
With the way WordCount is implemented in that example, that could be rather painful and more expensive than the streaming count it does.
<SeanTAllen>
We are working on being able to dynamically add state objects which means that you could have a state object per word which would make that easier.
<SeanTAllen>
What you could do within the current system if also to check the count of a given word and if it isn't (count % 100) == 0, return None, and if it is, return the word and its count. That might be more in line with what you are asking.
<SeanTAllen>
Is there are a particular use case you are looking at or just generally curious?
<SeanTAllen>
CrashRoX: That's definitely a good question and something we should cover in our documentation as it isn't something that many people will immediately gravitate towards. I've created an issue for us to add documentation to address: https://github.com/WallarooLabs/wallaroo/issues/1636
<CrashRoX>
Thanks. I work for a company that does data collection from web events. While its a bit early for us to put Wallaroo in production (is anyone running it in prod yet?) it seems to align closely to the way we think about collecting and process and is much lighter weight then some of the alternatives. So it's a project I'd like to keep an eye on.
<CrashRoX>
But I'm literally in the middle of building something similar to what I asked about
aturley has joined #wallaroo
<SeanTAllen>
Ha! So, going backwards through what you said with my answers...
<SeanTAllen>
We've built an awful lot of things at previous companies that did what Wallaroo does (or parts of it). But they were all ad-hoc and usually rather poor in the end. Inevitably, we ended up with bugs or problems due to lack of time to do testing etc. We really wanted to make those types of systems easier to build by handling all the infrastructure plumbing.
<SeanTAllen>
We aren't in prod anywhere yet. We have done a number of POCs with possible clients and are currently on track with a couple to have something in prod by Q1/Q2 of next year. On the commercial side, that seems to be the timeline. Perhaps someone using the open source version or a smaller deployment of the enterprise version (which is free up to certain sizes) will beat folks there. We are very interested in working
<SeanTAllen>
with folks who find us interesting. We are actively looking for a couple more partners that we can work closely with to help guide our direction over the next 6-9 months (and beyond).
<CrashRoX>
Sounds great! I'll prob play with it a little deeper. Our pipeline isn't terribly complex, so I bet I could put together a POC relatively quickly.
<CrashRoX>
Do you guys have a planned release cadence?
<SeanTAllen>
Sort of. We want to do a release anytime we have a new feature ready to go, when we fix a critical bug or when there's a breaking change in Pony (which the underlying platform is written in) that we need to do a release to support. The last 2 shouldnt happen all that often. Through the end of the year. We are expecting to have a new feature ready to go every month or so, I expect to do 2 or 3 releases this year.
<CrashRoX>
Is windowing on that roadmap? I just followed the docs ticket for now, although everything you said is enough for me to get started playing.
<SeanTAllen>
It should be coming as part of the batch/microbatch work. We haven't created any tickets for that yet.
<SeanTAllen>
Alan Mosca who isn't here in IRC right now would be a good one to talk to about that.
aturley has quit [Ping timeout: 255 seconds]
nitbix has joined #wallaroo
<SeanTAllen>
Ah, here's Alan, aka nitbix
<nitbix>
hello!
<nitbix>
sorry, I was just catching up with the conversation. What SeanTAllen said is correct, if you looked at doing it right now, you'd have to implement your own state-based logic to "flush" your current contents to output.
<nitbix>
however, I am working on ways to have a "windowing-based approach" in wallaroo, by which you specify the beginning and end of a window in the input with a "special" message, and you would have special logic whenever that is received at a computation.
<nitbix>
you could also, technically speaking, implement that yourself on top of wallaroo right now if you are in a rush :)
<nitbix>
CrashRoX: I would love to talk more about your use case, and what type of API additions would be most valuable to you.
pyroscope has joined #wallaroo
<SeanTAllen>
Greetings pyroscope!
<pyroscope>
how solid are python3 support plans? 2020 is not that far off.
<SeanTAllen>
i'm not sure what you mean by "solid". it might have more meaning than i read into. i can say: we have strong interest from folks who want to work with us on a commercial basis if we support Python 3. Given that, Python 3 is on our near term roadmap for this year.
<SeanTAllen>
we have a number of folks who are looking for Python 3 or Go support.
<pyroscope>
"on our near term roadmap for this year" is solid enough ;)
<SeanTAllen>
awesome
<CrashRoX>
@nitbix Happy to chat about our use case. I'm in NYC as well, looks like we are actually only a few blocks away if you're interested in a coffee or beer sometime. Otherwise happy to jump on a call as its easier to discuss vs type.
jonesnc has joined #wallaroo
jonesnc has quit [Quit: Leaving]
jonesnc has joined #wallaroo
jonesnc has quit [Remote host closed the connection]
jonesnc has joined #wallaroo
<jonesnc>
hello
<jonesnc>
I just discovered wallaroo, so I apologize if this is explained somewhere in the docs. I looked, but didn't really find anything. Is there a way to test data pipelines in wallaroo?
<jonesnc>
like an integration test of the pipeline
<slfritchie>
Hi, jonesnc. My apologies, I need to leave the "office" early today. If you're wondering about checking a specific pipeline for correctness, Nisan has written a blog article about how we do that, see https://blog.wallaroolabs.com/2017/10/measuring-correctness-of-state-in-a-distributed-system/. If you're asking about testing an arbitrary data pipeline, my answer (as the newest person to join Wallaroo Labs and
<slfritchie>
so don't quite know Absolutely Everything yet) is no.
<SeanTAllen>
@CrashRoX: nitbix is in London. I however am in NYC.
<jonesnc>
slfritchie: ok, thanks for the reply
<SeanTAllen>
jonesnc: We have tools that we use to test our applications. They aren't designed for external use. That said, they could be adapted and we'd welcome PRs to make them more flexible.
<SeanTAllen>
Are your looking to test data pipelines in general or wallaroo applications?
<SeanTAllen>
CrashRoX: our team is distributed. From Belgium to Vancouver at the moment.
<CrashRoX>
Ah, okay. Office map through me off. I'm on Walker off Broadway.
<CrashRoX>
threw*
<SeanTAllen>
We have a 2 person office at the WeWork at 222 Fulton. Usually at most, 1 person is there.
<SeanTAllen>
We have folks in Belgium, London, the Bronx, 4 in Brooklyn, 1 in jersey, Minneapolis and Vancouver.
<jonesnc>
@SeanTAllen: wallaroo applications
CrashRoX has quit [Ping timeout: 260 seconds]
<SeanTAllen>
jonesnc: we have tools that run as part of our CI so we can make sure we didn't break things. that said, they aren't documented or designed to be user friendly right now. we are on version 2 or 3 of them.
<SeanTAllen>
its one of the more important things we need.
<SeanTAllen>
the latest versions were pretty much done by Nisan. he's on vacation this week but will be back next week and is probably to best person to take about them.
<SeanTAllen>
they are mostly written in Python at this point.
<jonesnc>
for example, i want to create a 'mock-up' source, then validate that the expected state changes occurred on the sink
<jonesnc>
does that make sense?
<SeanTAllen>
the tools that we have right now don't involve mock up sources. we use giles-sender to feed data in over tcp for most of the tests, although we also have a kafka app test that setup up a tiny kafka install and reads from that.
<SeanTAllen>
we thought about mock up sources but decided not to go down that route as it wouldn't give us the assurance we required.
<SeanTAllen>
our big concern, and the hardest thing for us to test, is that the application gets layed out and wired together correctly.
<SeanTAllen>
the things we've found that we were mostly like to break and not miss was at the end to end integration level where we need to test from input to output.
<jonesnc>
i don't understand this problem space well, so i'm trying to learn
<SeanTAllen>
using a mock up source, we felt, wasn't as good as a real source over TCP etc
<SeanTAllen>
Nisan has spent a lot of time on this so he's a good person to talk to in general about it
<SeanTAllen>
but
<SeanTAllen>
we've all done a lot of thinking on it.
<SeanTAllen>
in addition to Nisan's blog post, I also did a couple variations on a talk about how we approach integration testings and failure injection.