<Papierkorb>
jots_twitter, also look at how the times are spent. the crystal program is waiting 40s just on I/O. even if you just look at the user times, the crystal program is still slower than wc, but faster already than perl
<Papierkorb>
you should get rid of the mem* functions if you get rid of the useless arrays
<FromGitter>
<jots_twitter> yes. interesting that perl somehow gets away with it though.
soveran has joined #crystal-lang
<FromGitter>
<drosehn> I'm pretty sure you posted it earlier, but where's the source for your crystal program? I might take a look at it tomorrow, if I have time. (I am not an expert at crystal, so don't expect that I'll get anywhere!)
snsei has quit [Remote host closed the connection]
snsei has joined #crystal-lang
vikaton has quit [Quit: Connection closed for inactivity]
mgarciaisaia has quit [Quit: Leaving.]
pduncan has joined #crystal-lang
snsei_ has joined #crystal-lang
snsei_ has quit [Remote host closed the connection]
snsei_ has joined #crystal-lang
unshadow has quit [Ping timeout: 256 seconds]
snsei has quit [Ping timeout: 258 seconds]
pduncan has quit [Ping timeout: 260 seconds]
bjz has joined #crystal-lang
soveran has joined #crystal-lang
soveran has quit [Ping timeout: 240 seconds]
bjz has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
snsei has joined #crystal-lang
snsei_ has quit [Ping timeout: 256 seconds]
bjz has joined #crystal-lang
bjz has quit [Client Quit]
bjz has joined #crystal-lang
soveran has joined #crystal-lang
soveran has quit [Changing host]
soveran has joined #crystal-lang
phase_ has quit [Quit: cya l8r alig8r]
pawnbox has joined #crystal-lang
unshadow has joined #crystal-lang
p0p0pr37 has quit [Quit: p0p0pr37]
p0p0pr37 has joined #crystal-lang
p0p0pr37 has joined #crystal-lang
Philpax has joined #crystal-lang
pduncan has joined #crystal-lang
pduncan has quit [Ping timeout: 258 seconds]
bjz has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
bjz has joined #crystal-lang
j2k has joined #crystal-lang
mark_66 has joined #crystal-lang
bjz_ has joined #crystal-lang
bjz has quit [Ping timeout: 256 seconds]
j2k has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
j2k has joined #crystal-lang
bjz has joined #crystal-lang
bjz_ has quit [Ping timeout: 245 seconds]
p0p0pr37 has quit [Remote host closed the connection]
p0p0pr37 has joined #crystal-lang
p0p0pr37 has joined #crystal-lang
p0p0pr37 has quit [Client Quit]
p0p0pr37 has joined #crystal-lang
p0p0pr37 has joined #crystal-lang
bjz has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
bjz has joined #crystal-lang
ome has joined #crystal-lang
pduncan has joined #crystal-lang
pduncan has quit [Ping timeout: 258 seconds]
soveran has quit [Remote host closed the connection]
soveran has joined #crystal-lang
soveran has joined #crystal-lang
soveran has quit [Changing host]
Raimondii has joined #crystal-lang
Raimondi has quit [Ping timeout: 244 seconds]
Raimondii is now known as Raimondi
<FromGitter>
<luislavena> @jots_twitter actually `wc` will count lines between `LF` characters. There is another implementation of `wc` in Rust that might provide efficient approach of scanning big text files: https://github.com/uutils/coreutils/blob/master/src/wc/wc.rs
<FromGitter>
<luislavena> The ruby code might be faster because `gets` will read until `LF` is found, and return that as a string.
gloscombe has joined #crystal-lang
<FromGitter>
<luislavena> Perhaps the issue is the allocation process, remove of the allocation of arrays and strings and just walk over the IO *might* be faster
unshadow has quit [Quit: Lost terminal]
pawnbox has quit [Remote host closed the connection]
matp has quit [Read error: Connection reset by peer]
Philpax has quit [Ping timeout: 260 seconds]
pduncan has joined #crystal-lang
matp has joined #crystal-lang
bjz_ has joined #crystal-lang
bjz has quit [Ping timeout: 260 seconds]
ome has quit [Quit: Connection closed for inactivity]
pawnbox has joined #crystal-lang
pduncan has quit [Ping timeout: 260 seconds]
snsei has quit [Remote host closed the connection]
bjz_ has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
<FromGitter>
<sdogruyol> @johnjansen maybe it's something with `gets`?
soveran has joined #crystal-lang
soveran has joined #crystal-lang
soveran has quit [Changing host]
pawnbox has joined #crystal-lang
<FromGitter>
<drosehn> Well, I have only had time to glance at @jots_twitter 's code, but I'll make the observation that in his code he processes the entire file once for each option. The call to `row << text.lines.size if options[:lines]` is going to start at byte #0 of the file, and go through every byte of it to count the number of lines. And then `row << text.split.size if options[:words]` is going to start back at byte #0, and process all of
<FromGitter>
... those same bytes, this time looking for word boundaries.
mark_66 has quit [Remote host closed the connection]
<FromGitter>
<drosehn> And code which calls `gets()` will (at some level in the processing) do a read of probably BUFSIZ bytes, and then copy a single byte up to your program. And that lower-level code will have to keep track of where it is in the larger buffer, so it needs to keep a pointer and update that pointer for each call to `gets()`.
soveran has quit [Ping timeout: 250 seconds]
mgarciaisaia has joined #crystal-lang
<crystal-gh>
[crystal] samueleaton opened pull request #3550: Implement Hash notation for examples in docs (master...fix-hash-example) https://git.io/vXMET
<FromGitter>
<asterite> @johnjansen Did you try compiling with --release ?
soveran has joined #crystal-lang
soveran has joined #crystal-lang
soveran has quit [Changing host]
maxpowa has quit [Ping timeout: 244 seconds]
pawnbox has quit [Remote host closed the connection]
pduncan has joined #crystal-lang
gloscombe has quit [Quit: Lost terminal]
soveran has quit [Remote host closed the connection]
kochev has joined #crystal-lang
pduncan has quit [Ping timeout: 256 seconds]
maxpowa has joined #crystal-lang
mgarciaisaia has left #crystal-lang [#crystal-lang]
<FromGitter>
<drosehn> I started to make another minor observation from quick-skimming, when the comments made by several other people suddenly clicked in my head.
<FromGitter>
<drosehn> When you call `text.lines`, crystal is building a full-blown Array(String). It is creating an Array object, and then going through `text` and adding each line that it finds as another element in that Array. You're then taking that array, and asking it "So, how many elements do you have?". And then you throw away that entire array. And then you do the *same* thing (building a completely different array) when you call
<FromGitter>
... `text.split`.
<Yxhuvud>
that does seem a bit inefficient.
<Yxhuvud>
creating lots of objects for lines, that is.
<FromGitter>
<drosehn> You can see this, btw, if you add `printf " lines=%s\n", text.lines.class`. You really are creating a full-blown array which has copied data from the original `text` object into many `string` objects that are stored in that `Array(String)`.
<FromGitter>
<sdogruyol> @drosehn what's your suggestion then
<FromGitter>
<drosehn> Well, I know what I'd do in C, and in fact I *have* written a program pretty similar to this in C. I'm not 100% sure how to translate all the tricks in my C program into crystal. So I need to do a few more experiments before making any claims that I'll regret later. :smiley:
soveran has joined #crystal-lang
<Yxhuvud>
does it create new strings or references to inside the original string?
<FromGitter>
<drosehn> Unfortunately I'm at work now, so I'll need to focus on work-related tasks at the moment!
<FromGitter>
<drosehn> Yeah, I wondered that. Given that crystal keeps strings as immutable, it might not need to copy any of the data-bytes into the new `String` objects. However, it does has to do *something* which will create each of those string objects, even if that's just to create a pointer to the start and end of the characters as they exist in `text`.
<FromGitter>
<johnjansen> @asterite yeah that was with release, its not for me BTW someone else is trying to duplicate `wc` in crystal, but that struck me as a little odd ;-)
<FromGitter>
<drosehn> Consider, for instance, that you could also say `puts text.lines[143]`, and the crystal code will expect that it can go to element #143 of an Array, and pull out the string which matches that context.
<FromGitter>
<drosehn> They're trying to duplicate `wc` as a simple exercise. The real goal is to understand how to process large data files as quickly as possible.
<FromGitter>
<drosehn> For instance, my "wc-like" program written in C is not counting lines. It's finding line-boundaries, checking for lines that start with "%%" (Postscript comments), and then doing things based on what it finds on those lines.
<FromGitter>
<drosehn> And given that our print servers may throw around several hundreds of gigabytes of postscript files per day, I really needed to do that as efficiently as possible.
<FromGitter>
<sdogruyol> @drosehn that sounds interesting. What do you do? Fintech?
<FromGitter>
<johnjansen> WOW @drosehn Postscript, you are bring back some memories / nightmares from the past
<FromGitter>
<drosehn> Oh, I guess I should not put the number-sign before numbers unless I'm talking about github issues!
<FromGitter>
<johnjansen> @sdogruyol Postscript is a printer language for want of a better word … files are routinely enormous
<FromGitter>
<johnjansen> @drosehn is in the academic world ;-)
<FromGitter>
<drosehn> No, I work at a college. Over the last 15 years we've built up a pretty popular service for printing out large-format outputs. 3-foot-wide by many-feet-long. During the last two weeks of each semester, we'll print out more than a mile of 3-foot-wide paper.
<FromGitter>
<sdogruyol> @johnjansen just learned that. Thanks you :)
<FromGitter>
<sdogruyol> @drosehn that's awesome
soveran has quit [Remote host closed the connection]
<FromGitter>
<drosehn> It also means you have to be really obsessive about any processing of those postscript files!
<FromGitter>
<johnjansen> im feeling sympathy for @drosehn, debugging PS is ... well … interesting ;-)
<FromGitter>
<sdogruyol> :smile:
<FromGitter>
<johnjansen> @drosehn did you guys build a RIP ?
<FromGitter>
<drosehn> These days actual-debugging of postscript is nearly impossible. My program just helps us know exactly what the postscript file expects to do *before* we send it to the plotters. This is very valuable info, when you have a lot of plots and you need to be obsessive.
<FromGitter>
<drosehn> Wow. No!! That's way beyond my abilities!!
soveran has joined #crystal-lang
<FromGitter>
<drosehn> [That emphatic "no" is wrt building a RIP]
<FromGitter>
<johnjansen> thank god …
<FromGitter>
<drosehn> In any case, I need to get back to work...
<FromGitter>
<johnjansen> ;-)
soveran has quit [Ping timeout: 260 seconds]
j2k has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
kochev has quit [Remote host closed the connection]
<FromGitter>
<crisward> What does everyone use for mocks / spies / stubs in testing?
j2k has joined #crystal-lang
<FromGitter>
<jwoertink> What's testing? Oh, that thing you do after you push your app to production?
mgarciaisaia has joined #crystal-lang
pduncan has joined #crystal-lang
<FromGitter>
<johnjansen> anyone know the status of Crystal right now, like whens the next release?
pduncan has quit [Ping timeout: 245 seconds]
mgarciaisaia has left #crystal-lang [#crystal-lang]
<FromGitter>
<drosehn> Here's another indication of the files that my program had to deal with. I notice the "wcg" program used `row = [] of Int32`. In my program, I have to use 64-bit integers, not 32-bit...
bjz has joined #crystal-lang
<FromGitter>
<drosehn> yeah, if you send a file >2gig to the `wcg` program, it exits sideways with `negative capacity (ArgumentError)`. The call to `text = File.read(fd)` will need to create a single `String` object which needs to be more than 2-gig, and since `String.size` returns an Int32, I suspect it is impossible to create a single String which is larger than that.
<FromGitter>
<drosehn> if it's any consolation, the first version of my `scanps` program was written in perl, and it worked fairly well for my 10-meg test files. I then put it in production, and when a postscript file larger than 200 meg (*meg*, not gig) arrived, the entire machine crashed. It had run out of memory. And swap space.
<FromGitter>
<sdogruyol> @drosehn just what you'd expect from any code in production :)
<RX14>
you really don't want to be using File.read on large files ever
j2k has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
bjz has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
soveran has joined #crystal-lang
bjz has joined #crystal-lang
bjz has quit [Read error: Connection reset by peer]
bjz has joined #crystal-lang
bjz has quit [Quit: My MacBook Pro has gone to sleep. ZZZzzz…]
Philpax has joined #crystal-lang
Philpax has quit [Ping timeout: 246 seconds]
<crystal-gh>
[crystal] ysbaddaden pushed 1 new commit to master: https://git.io/vXDBT
<crystal-gh>
crystal/master e5deb09 Sam Eaton: Implement Hash notation for examples in docs (#3550)
<FromGitter>
<jots_twitter> @drosehn : postscript brings back memories of sun workstations running NeWS (display postscript) good times, good times :-)
soveran has quit [Remote host closed the connection]
am_ has joined #crystal-lang
am_ has quit [Ping timeout: 250 seconds]
<FromGitter>
<johnjansen> Oh boy the postscript club is in session now next will be the Linotype vs compugraphic discussion
<FromGitter>
<drosehn> My experience with postscript started with the first NeXTstation (not the original NeXT Cube).
<FromGitter>
<johnjansen> wow now some PasteUp is all we need
<FromGitter>
<drosehn> Getting back to rewriting `wc` in crystal, I have something which works much faster than `wcg.cr` for larger files (over 100-meg), and which isn't dangerous to run for very large files (say, over 6-gig). However it's *slower* that `wcg.cr` for files under 1-meg, it does not get word-counts correct, and if it's given arbitrary binary files (such as a disk-image file) then it can get totally wrong answers.
<RX14>
Papierkorb, i'm still making progress on the select with both channels and IO. I'm pretty sure it can be done now.
<Papierkorb>
kk
<Papierkorb>
drosehn, well, if you feed wc binary data, it won't come up with something useful either
<FromGitter>
<drosehn> well, it sometimes gets the word-count correct, depending on the file. I'm pretty sure I understand why the word-count is often wrong, but given all the other problems I'm not too concerned about fixing that.
<FromGitter>
<drosehn> it'll come up with a correct byte-count, and it comes up with a line-count that's probably correct. Mine won't even get the byte-count correct!
<FromGitter>
<drosehn> Hmm, maybe it is running into a x'0', and treating that as end-of-file.
<FromGitter>
<drosehn> nope.
<FromGitter>
<drosehn> in any case, it's clear I need to know more about crystal before I'll have something that works faster & better than the `wcg.cr` attempt. I won't have the time for that. I have learned a number of things, so this mini-project has benefitted me even though it didn't help anyone else!
<FromGitter>
<drosehn> oops. I mean that I won't have time for that anytime soon. Maybe next weekend, maybe not.