#wallaroo on 2018-12-04 — irc logs at freenode.irclog.whitequark.org

2017-09-30 09:40 SeanTAllen changed the topic of #wallaroo to: Welcome! Please check out our Code of Conduct -> https://github.com/WallarooLabs/wallaroo/blob/master/CODE_OF_CONDUCT.md | Public IRC Logs are available at -> https://irclog.whitequark.org/wallaroo

01:53 nisanharamati has quit [Quit: Connection closed for inactivity]

13:10 <sknebel> I'm thinking about trying wallaroo for an application that basically does web crawling. Is it reasonable to get rate limiting etc inside wallaroo, or would I need an external fetch component that has an input task queue from wallaroo and feeds results back through an Input?

13:18 <SeanTAllen> Hi sknebel

13:19 <SeanTAllen> Can you tell me a bit more about the web crawling you are looking to do?

13:19 <SeanTAllen> How do you decide what you want to crawl?

13:26 <sknebel> More or less a feed reader with some additions. Fetch a feed regularly or on external prompt, fetch the feed, parse it, fetch sites referenced in the feed, parse those for information. Wallaroo might be Overkill, but it seemed at least interesting to investigate.

13:28 <SeanTAllen> You could use Wallaroo for parallelization like that. You'd be looking at each web scraping attempt as a stateless computation so, the more Wallaroo workers you add to the cluster, the more scraping you'd do in parallel. I think you could get a definite boost in that sense of not having to deal with that yourself.

13:29 <SeanTAllen> Wallaroo is event-driven so streaming in the sites you want to scrape would make sense or even a single incoming message that is a list of everything you want to scrape. Then your first step in the pipeline would be splitting that list up and sending off to the stateless scraping steps.

13:30 <SeanTAllen> There's nothing for doing rate limiting though in what I just described.

13:30 <SeanTAllen> What were you thinking on the rate limiting front?

13:30 <SeanTAllen> Let me rephrase that, there's nothing explicit doing rate limiting

13:31 <SeanTAllen> If your cluster was say 3 Python processes then at most, you'd be doing 3 scraping jobs at once, so that is rate limiting.

13:31 <SeanTAllen> If you wanted to do something like limit based on number of requests to a website then you could do something like the following:

13:31 <SeanTAllen> key on each website you want to scrap and use that to feed into a stateful step

13:32 <SeanTAllen> actually just a stateless one

13:33 <SeanTAllen> then you are limited to one scraping per website at a time

13:34 <SeanTAllen> i was going to say you could use a stateful step but i realized there's no way in wallaroo to say "don't run this computation right now, run it later" which would be a form of rate limiting. the ramifications of that are rather interesting though and potentially problematic.

13:35 <SeanTAllen> if the "don't run computation at this time" sort of rate-limiting would be interesting to you sknebel, i'd love to hop on a call and chat about it for a bit as there are some serious complexities lurking beneath the surface on that.

13:38 <sknebel> Yeah, growing queue sizes etc become much more of an issue then. Thanks for the info for now, I'll report back if I look at it closer

13:38 <sknebel> (really in the early thinking stages for now)

13:39 <SeanTAllen> You're welcome sknebel. Glad to help.

16:41 nisanharamati has joined #wallaroo

19:02 cajually has quit [Read error: Connection reset by peer]