<teratorn> anyone know of a slick way to see if a certain string exists as a substring of another string, beginning at a given offset?
<karryall> String.sub and then (=) ?
<Kinners> Str.search_forward (Str.regexp_string substring) string offset
<teratorn> karryall: yeah, but i need something very efficient here, i'm hoping for an in-line comparison :)
<teratorn> "Str" you mean String?
<karryall> no he means Str, the regular expressions module
<teratorn> odd
<teratorn> it's not in the stdlib?
<karryall> no, you have to specify str.cma at link time
<teratorn> i see
<teratorn> still, compiling regexes doesn't seem to efficient :(
<mrvn> Does that build a state automat or a search with rollback?
<mrvn> teratorn: Is your substring a lot shorter than your big string?
<teratorn> mrvn: very much so
<teratorn> ideally i just want a simple byte-by-byte comparrison
<teratorn> and this operation could very well be the bottleneck of my application
<mrvn> You can preprocess your search string noting how many chars you have to rollback on the search string when a char mismatches and then do a linear search once through the big string.
<teratorn> i'm not following :(
<mrvn> example: search "abab":
<karryall> teratorn: just pick an implementation somewhere, there are zillions of them
<mrvn> If you find an 'a' but then an 'a' you advance the big string by one but not the search string. So you check if the next char is a 'b'
<karryall> ... or just try the one-line solution using Str and make sure it _is_ the bottleneck of your application
<mrvn> The regexp thing above probably does exactly that but way less typing.
<teratorn> yeah i'm trying to optimize last
<teratorn> but the reason i'm rewriting this in ocaml is for the speed
<mrvn> If you need it any faster you might have to use C and read the string in a register at a time instead of char wise.
<mrvn> Fully optimized build and schedule assembler code specifically for the search string at hand :)
<teratorn> i think ocaml will be fast enough
<teratorn> or rather, fast enough will be as fast as that :)
<mrvn> You might get a speedup of 2 or something with handmade C code.
<teratorn> *nod*
<mrvn> But your probably just looking for O(n) instead of O(n*m) speed [n = string length, m = search string length]
<mrvn> Are you searching many different (or same) strings in the big string?
<mrvn> Cause if you search for more than one string you should use a suffix tree instead.
<mrvn> Thats O(n) once setup phase, O(m + |num finds|) searching.
<teratorn> well here's the problem
<teratorn> i'm filtering profanity from email messages
<teratorn> so
<teratorn> for each word i identify in the "big string", i run a binary search against my array of bad words
<mrvn> That realy sounds like a job for a suffixtrie and a 'word' tree
<karryall> you could use a trie
<mrvn> Do you know suffixtries?
<teratorn> never heard of it
<mrvn> Suffixtree?
<teratorn> no
<mrvn> You build a tree where each suffix of a string exists as a patch in the tree.
<teratorn> if i find a match i do an in-line blit with star characters
<mrvn> abcd =>
<mrvn> / \
<mrvn> /| | \
<mrvn> a b c d
<mrvn> b c d
<mrvn> c d
<mrvn> d
* teratorn copies into something with fixed-width font =)
<mrvn> If you then need to find the substring bc you can just go down the b node and then the c node along the tree.
<mrvn> With longer and repetitive texts all the repeatet words start with the same path. You can find them all in one go.
<mrvn> So if you are looking for "fuck" you start at the root and go down to f, u, c, k and you have all places where fuck stands.
<mrvn> Alternatively you could one huge regexp for your complete dictionary of bad words and search for that.
<mrvn> +build
* teratorn ponders
<mrvn> I'm guessing your dictionary doesn't change often. You could build an optimised regexp once and marshal that to disk.
<teratorn> no not much
<teratorn> i have it all in program code as a literal array, the only thing i do with is Array.sort compare bad_words_array
<teratorn> so i'm not too worried about startup costs
<mrvn> Thinking about it I would probably build mayself a automat that goes trough a string and stops at the first word thats in the dictionary.
<teratorn> hmm
<karryall> I say: build a trie with the bad words and then iterate over the text and test each word of the text to see if it appears in the bad words set
<teratorn> i guess i should learn about trie's
<teratorn> the only thing i was wondering about right now was how to test a substring against another string without doing String.sub to get at it
<mrvn> teratorn: how strickt is that word thing?
<mrvn> do you want to block "fuck" in "motherFUCKer"?
<mrvn> or doesn't that count?
<teratorn> nah i just have motherfucker in the dict
<teratorn> and one phase where i copy the string, lowercase and substitute common l33tspeak type characters for their letter representations
<teratorn> i run the comparions against that string and make changes back to the original
<teratorn> *comparisons
<mrvn> Ever heard of a dictionary tree?
<teratorn> no
<mrvn> You start with a root and label the edges in the tree each with one char. Each path forms a word of the dictionary.
rhil is now known as rhil|brb
<teratorn> ok
<mrvn> Worst case its O(|dictionary|) but usually much smaller.
<mrvn> With that you can search for all words of the dictionary ad once.
<mrvn> You see how?
<karryall> (btw, that's what I meant when I said "a trie")
<teratorn> yeah i see how that works
<teratorn> cool stuff
<mrvn> So you start at the begining and look that word up. If its in the tree and the next char in the text is a ' ' you found a bad word.
<mrvn> If its not in the tree or if the next char is not ' ' read on to the next word.
<mrvn> s/' '/white space/
<teratorn> yeah i've got a whole list of "word breaks"
<teratorn> html crap etc :(
<mrvn> karryall: tries are usually somewhat special. Like compressed or without leafes or something.
<teratorn> this is pretty cool, now that i think about it
<mrvn> If ram is of concern you can compress the dictionary tree somewhat.
<karryall> mrvn: I use then uncompressed but maybe I was wrong in calling these "tries"
<mrvn> Instead of storing allways one char on each edge you can store a string if there is no branching.
<teratorn> usually i could eliminate a word after the first 2 or 3 letters
<teratorn> isntead of eliminating after the first 1 or 2 chars, for each comaprrison in the binary search
<mrvn> teratorn: A tree is definitly the right way.
<mrvn> That gives you a O(n) algorithm after setup phase. You only ever look at each char once.
<teratorn> right get the char, see if the path in the dict tree can continue
rhil|brb is now known as rhil
<teratorn> well thanks mrvn and karryall
* teratorn &
<mrvn> Try replacing all whitespaces by ' ' when you remove 1337 talk and the like. Maybe thats quicker.
Smerdyakov has quit ["reboot because computer is dodgy"]
<teratorn> freakin tuareg-mode indentation
<teratorn> is it know to be bugged or what?
<teratorn> *known
<gl> ictp contest was hardcore
<gl> *icfp
<Riastradh> teratorn, what's wrong with it?
<teratorn> it's not matching up if's and else's right when they are nested
<teratorn> i don't think..
Smerdyakov has joined #ocaml
<teratorn> damn this sucks. if you have several nested if's and only one trailing else, it's ambiguous which if-block the else goes with
<teratorn> has anyone had this problem?
<Smerdyakov> Problem in what sense? As a user of a programming language with this syntax?
<teratorn> sure
<Smerdyakov> It's not a very big problem.. just use parentheses :P
<teratorn> having a bunch of else () kinda sucks
<teratorn> `/o but it was only fantasy `/o
<teratorn> `/o the wall was too high / as you can see `/o
<teratorn> `/o no matter how he tried / he could not break free `/o
<teratorn> `/o and the worms ate into his brain `/o
<Smerdyakov> Well, pretty much every language has this "problem." (I put it in quotes because people don't really agree on what the "right" grouping is.)
<Smerdyakov> Well, excluding languages like Lisp, where you'll need the parens anyway :D
<teratorn> well python doesn't have this problem :) i guess i've been sheltered from languages that do :)
<Smerdyakov> teratorn, eh? You use indentation in Python, right?
<teratorn> yep
<Smerdyakov> So it's even WORSE than in OCaml.
<teratorn> no :)
<Smerdyakov> You must indicate grouping for EVERY if..then..else. It NEVER guesses.
<teratorn> the indentation level tells which if block it corresponds to
<Smerdyakov> Yes, so you have to do what you are complaining about doing in OCaml...
<teratorn> eh? no
<teratorn> the only way i can do it in ocaml (i think) is to close each if with an else, even if it's only else ()
<Smerdyakov> That's absurd.
<teratorn> i'm pretty new so i could be missing something
<Smerdyakov> You don't need to do anything but add parentheses around existing expressions to resolve ambiguities.
<teratorn> hmm
<teratorn> oh ok
<teratorn> :)
<teratorn> see i told you i'm new to ocaml
<teratorn> i should have thought of that, i've only been used to using it to resolv ambiguities in function parameters
<Riastradh> Function arguments are just expressions -- like if expressions.
<teratorn> yeah
Kinners has left #ocaml []
rhil has quit ["leaving"]
rhil has joined #ocaml
reltuk has quit [Read error: 104 (Connection reset by peer)]
rhil has quit [Remote closed the connection]
rhil has joined #ocaml
<teratorn> argh
<teratorn> with input_line channel, there's no way to know if the last line had a newline or now :/
<teratorn> s/or now/ or not
reltuk has joined #ocaml
<teratorn> fuck
<teratorn> silly input
<teratorn> "i'll read that many characters unless there aren't any more to read, or unless i just don't feel like it"
<teratorn> ok, so long as it reads more than 0, then i just keep going... :/
* teratorn bitches and moans
<teratorn> well it came out ok <sniff>
mattam has quit [Read error: 110 (Connection timed out)]
klamath has joined #ocaml
klamath has left #ocaml []
Smerdyakov has quit []
karryall has quit ["ERC vVersion 3.0 $Revision: 1.328 $ (IRC client for Emacs)"]
<mrvn> teratorn: "if a then if b then (if c then d else e)" is allways done this way and thats what tuareg indents.
pattern__ has quit [Read error: 110 (Connection timed out)]
<teratorn> yeah that works
pattern_ has joined #ocaml
<teratorn> i have a program that filters data off of stdin to stdout, the best way for me to process this data is by line. so i want to use input_line stdin...
<teratorn> however
<teratorn> input_line strips the newline, so i do not know if the last line had a newline or not, and this is very important to my program
<teratorn> if i could detect if the stdin channel had a newline before the eof i could do it
<teratorn> or if i could read everthing off stdin into a single string
Yurik__ has joined #ocaml
Yurik_ has quit [Read error: 54 (Connection reset by peer)]
<mrvn> teratorn: just read a string and split it into lins manually.
<mrvn> or use streams or the parser generator.
<teratorn> yeah i thought about doing it mysql, but i didn't feel like implementing my own line buffering
<mrvn> you could use the buffer class
<teratorn> yeah i could
<teratorn> Buffer.add_channel kind of sucks though, since it apparently won't do a partial read
<teratorn> alright screw it, ill just "input stdin" and keep adding to the buffer till EOF, then take the whole string back out of the buffer and use that
<teratorn> probably won't be too slow
<mrvn> teratorn: I have a IO modules that can read asynchronously from Unix.file_descriptor that uses buffering.
rhil is now known as rhil|zzz
<teratorn> well i don't need async, and this is working ok, but thanks
reltuk has quit ["leaving"]
lus|wazze has joined #ocaml
cDlm_ has joined #ocaml
reltuk has joined #ocaml
cDlm has quit [Read error: 110 (Connection timed out)]
cDlm_ is now known as cDlm
docelic has quit [Excess Flood]
docelic has joined #ocaml
lus|wazze has quit ["[21:57:38] <{specter}> benutze lustiges konfetti auf schwer bewaffneter clown"]
gene9 has joined #ocaml
gene9 has quit [Read error: 104 (Connection reset by peer)]
gene9 has joined #ocaml
gene9 has left #ocaml []
Smerdyakov has joined #ocaml
systems has joined #ocaml
<systems> the contest is over
<systems> so did ocaml have a team, how many teams submitted something
<systems> etc... etc...
* Riastradh was on the #scheme team.
reltuk has quit ["leaving"]
systems has quit [Operation timed out]
CybeRDukE has joined #ocaml
Smerdyakov has quit ["work"]
CybeRDukE is now known as CybeR[away]
CybeR[away] is now known as CybeRDukE
CybeRDukE is now known as CybeR[away]
mattam has joined #ocaml
mattam_ has joined #ocaml
mattam has quit [Read error: 110 (Connection timed out)]
lus|wazze has joined #ocaml
* Riastradh curses loudly at ocaml.org
mvw has joined #ocaml
mattam_ is now known as mattam
axolotl has quit [Remote closed the connection]
mattam has quit ["brb"]
mattam has joined #ocaml
lus|wazze has quit ["[21:57:38] <{specter}> benutze lustiges konfetti auf schwer bewaffneter clown"]
lus|wazze has joined #ocaml
mvw has left #ocaml []
mrvn_ has joined #ocaml
mrvn has quit [Read error: 60 (Operation timed out)]
<teratorn> Riastradh: yeah i hate people that don't have domain.tld resolve to www.domain.tld
<Riastradh> Er, no, it's that it wasn't resolving for me at all.
<teratorn> oh heh
vincenz has joined #ocaml
<vincenz> Hi
rhil|zzz has quit [leguin.freenode.net irc.freenode.net]
lam has quit [leguin.freenode.net irc.freenode.net]
mellum has quit [leguin.freenode.net irc.freenode.net]
jtra has quit [leguin.freenode.net irc.freenode.net]
rhil|zzz has joined #ocaml
lam has joined #ocaml
mellum has joined #ocaml
<teratorn> hi
<teratorn> it would be real cool to have a slice notation for strings
<teratorn> so the compiler could optimize out operations that could be perfomed in-line on the substring, instead of always creating a new string with String.sub
<emu> yer confusing syntax w/semantics
<emu> not even semantics
<emu> the compiler can do the same with String.sub
<teratorn> well hmm
<teratorn> i guess so
<teratorn> how do yall profile your apps?
<teratorn> i noticed the -p flag
<teratorn> is gprof usable for doing a live profile?
<mattam> not live, you get a profile dump iirc
<teratorn> yeah
<teratorn> but is there a way to profile with live input?
vincenz has quit ["KVIrc 3.0.0-beta1 "Eve's Avatar""]
owll has joined #ocaml
Yurik__ is now known as Yurik
Smerdyakov has joined #ocaml
owll has quit ["Client Exiting"]
docelic has quit ["reboot, seems I have usb disabled"]
docelic has joined #ocaml
CybeR[away] has quit ["Documentation is like sex: when it is good, it is very, very good. And when it is bad, it is better than nothing."]
Demitar has joined #ocaml
Demitar has quit ["There are bubbles in the air..."]
Demitar has joined #ocaml
asqui has quit [Connection reset by peer]
asqui has joined #ocaml
lus|wazze has quit ["#lus - der Atheisten-channel \o/"]
Demitar has quit ["There are bubbles in the air..."]