<teratorn>
karryall: yeah, but i need something very efficient here, i'm hoping for an in-line comparison :)
<teratorn>
"Str" you mean String?
<karryall>
no he means Str, the regular expressions module
<teratorn>
odd
<teratorn>
it's not in the stdlib?
<karryall>
no, you have to specify str.cma at link time
<teratorn>
i see
<teratorn>
still, compiling regexes doesn't seem to efficient :(
<mrvn>
Does that build a state automat or a search with rollback?
<mrvn>
teratorn: Is your substring a lot shorter than your big string?
<teratorn>
mrvn: very much so
<teratorn>
ideally i just want a simple byte-by-byte comparrison
<teratorn>
and this operation could very well be the bottleneck of my application
<mrvn>
You can preprocess your search string noting how many chars you have to rollback on the search string when a char mismatches and then do a linear search once through the big string.
<teratorn>
i'm not following :(
<mrvn>
example: search "abab":
<karryall>
teratorn: just pick an implementation somewhere, there are zillions of them
<mrvn>
If you find an 'a' but then an 'a' you advance the big string by one but not the search string. So you check if the next char is a 'b'
<karryall>
... or just try the one-line solution using Str and make sure it _is_ the bottleneck of your application
<mrvn>
The regexp thing above probably does exactly that but way less typing.
<teratorn>
yeah i'm trying to optimize last
<teratorn>
but the reason i'm rewriting this in ocaml is for the speed
<mrvn>
If you need it any faster you might have to use C and read the string in a register at a time instead of char wise.
<mrvn>
Fully optimized build and schedule assembler code specifically for the search string at hand :)
<teratorn>
i think ocaml will be fast enough
<teratorn>
or rather, fast enough will be as fast as that :)
<mrvn>
You might get a speedup of 2 or something with handmade C code.
<teratorn>
*nod*
<mrvn>
But your probably just looking for O(n) instead of O(n*m) speed [n = string length, m = search string length]
<mrvn>
Are you searching many different (or same) strings in the big string?
<mrvn>
Cause if you search for more than one string you should use a suffix tree instead.
<teratorn>
i'm filtering profanity from email messages
<teratorn>
so
<teratorn>
for each word i identify in the "big string", i run a binary search against my array of bad words
<mrvn>
That realy sounds like a job for a suffixtrie and a 'word' tree
<karryall>
you could use a trie
<mrvn>
Do you know suffixtries?
<teratorn>
never heard of it
<mrvn>
Suffixtree?
<teratorn>
no
<mrvn>
You build a tree where each suffix of a string exists as a patch in the tree.
<teratorn>
if i find a match i do an in-line blit with star characters
<mrvn>
abcd =>
<mrvn>
/ \
<mrvn>
/| | \
<mrvn>
a b c d
<mrvn>
b c d
<mrvn>
c d
<mrvn>
d
* teratorn
copies into something with fixed-width font =)
<mrvn>
If you then need to find the substring bc you can just go down the b node and then the c node along the tree.
<mrvn>
With longer and repetitive texts all the repeatet words start with the same path. You can find them all in one go.
<mrvn>
So if you are looking for "fuck" you start at the root and go down to f, u, c, k and you have all places where fuck stands.
<mrvn>
Alternatively you could one huge regexp for your complete dictionary of bad words and search for that.
<mrvn>
+build
* teratorn
ponders
<mrvn>
I'm guessing your dictionary doesn't change often. You could build an optimised regexp once and marshal that to disk.
<teratorn>
no not much
<teratorn>
i have it all in program code as a literal array, the only thing i do with is Array.sort compare bad_words_array
<teratorn>
so i'm not too worried about startup costs
<mrvn>
Thinking about it I would probably build mayself a automat that goes trough a string and stops at the first word thats in the dictionary.
<teratorn>
hmm
<karryall>
I say: build a trie with the bad words and then iterate over the text and test each word of the text to see if it appears in the bad words set
<teratorn>
i guess i should learn about trie's
<teratorn>
the only thing i was wondering about right now was how to test a substring against another string without doing String.sub to get at it
<mrvn>
teratorn: how strickt is that word thing?
<mrvn>
do you want to block "fuck" in "motherFUCKer"?
<mrvn>
or doesn't that count?
<teratorn>
nah i just have motherfucker in the dict
<teratorn>
and one phase where i copy the string, lowercase and substitute common l33tspeak type characters for their letter representations
<teratorn>
i run the comparions against that string and make changes back to the original
<teratorn>
*comparisons
<mrvn>
Ever heard of a dictionary tree?
<teratorn>
no
<mrvn>
You start with a root and label the edges in the tree each with one char. Each path forms a word of the dictionary.
rhil is now known as rhil|brb
<teratorn>
ok
<mrvn>
Worst case its O(|dictionary|) but usually much smaller.
<mrvn>
With that you can search for all words of the dictionary ad once.
<mrvn>
You see how?
<karryall>
(btw, that's what I meant when I said "a trie")
<teratorn>
yeah i see how that works
<teratorn>
cool stuff
<mrvn>
So you start at the begining and look that word up. If its in the tree and the next char in the text is a ' ' you found a bad word.
<mrvn>
If its not in the tree or if the next char is not ' ' read on to the next word.
<mrvn>
s/' '/white space/
<teratorn>
yeah i've got a whole list of "word breaks"
<teratorn>
html crap etc :(
<mrvn>
karryall: tries are usually somewhat special. Like compressed or without leafes or something.
<teratorn>
this is pretty cool, now that i think about it
<mrvn>
If ram is of concern you can compress the dictionary tree somewhat.
<karryall>
mrvn: I use then uncompressed but maybe I was wrong in calling these "tries"
<mrvn>
Instead of storing allways one char on each edge you can store a string if there is no branching.
<teratorn>
usually i could eliminate a word after the first 2 or 3 letters
<teratorn>
isntead of eliminating after the first 1 or 2 chars, for each comaprrison in the binary search
<mrvn>
teratorn: A tree is definitly the right way.
<mrvn>
That gives you a O(n) algorithm after setup phase. You only ever look at each char once.
<teratorn>
right get the char, see if the path in the dict tree can continue
rhil|brb is now known as rhil
<teratorn>
well thanks mrvn and karryall
* teratorn
&
<mrvn>
Try replacing all whitespaces by ' ' when you remove 1337 talk and the like. Maybe thats quicker.
Smerdyakov has quit ["reboot because computer is dodgy"]
<teratorn>
freakin tuareg-mode indentation
<teratorn>
is it know to be bugged or what?
<teratorn>
*known
<gl>
ictp contest was hardcore
<gl>
*icfp
<Riastradh>
teratorn, what's wrong with it?
<teratorn>
it's not matching up if's and else's right when they are nested
<teratorn>
i don't think..
Smerdyakov has joined #ocaml
<teratorn>
damn this sucks. if you have several nested if's and only one trailing else, it's ambiguous which if-block the else goes with
<teratorn>
has anyone had this problem?
<Smerdyakov>
Problem in what sense? As a user of a programming language with this syntax?
<teratorn>
sure
<Smerdyakov>
It's not a very big problem.. just use parentheses :P
<teratorn>
having a bunch of else () kinda sucks
<teratorn>
`/o but it was only fantasy `/o
<teratorn>
`/o the wall was too high / as you can see `/o
<teratorn>
`/o no matter how he tried / he could not break free `/o
<teratorn>
`/o and the worms ate into his brain `/o
<Smerdyakov>
Well, pretty much every language has this "problem." (I put it in quotes because people don't really agree on what the "right" grouping is.)
<Smerdyakov>
Well, excluding languages like Lisp, where you'll need the parens anyway :D
<teratorn>
well python doesn't have this problem :) i guess i've been sheltered from languages that do :)
<Smerdyakov>
teratorn, eh? You use indentation in Python, right?
<teratorn>
yep
<Smerdyakov>
So it's even WORSE than in OCaml.
<teratorn>
no :)
<Smerdyakov>
You must indicate grouping for EVERY if..then..else. It NEVER guesses.
<teratorn>
the indentation level tells which if block it corresponds to
<Smerdyakov>
Yes, so you have to do what you are complaining about doing in OCaml...
<teratorn>
eh? no
<teratorn>
the only way i can do it in ocaml (i think) is to close each if with an else, even if it's only else ()
<Smerdyakov>
That's absurd.
<teratorn>
i'm pretty new so i could be missing something
<Smerdyakov>
You don't need to do anything but add parentheses around existing expressions to resolve ambiguities.
<teratorn>
hmm
<teratorn>
oh ok
<teratorn>
:)
<teratorn>
see i told you i'm new to ocaml
<teratorn>
i should have thought of that, i've only been used to using it to resolv ambiguities in function parameters
<Riastradh>
Function arguments are just expressions -- like if expressions.
<teratorn>
yeah
Kinners has left #ocaml []
rhil has quit ["leaving"]
rhil has joined #ocaml
reltuk has quit [Read error: 104 (Connection reset by peer)]
rhil has quit [Remote closed the connection]
rhil has joined #ocaml
<teratorn>
argh
<teratorn>
with input_line channel, there's no way to know if the last line had a newline or now :/
<teratorn>
s/or now/ or not
reltuk has joined #ocaml
<teratorn>
fuck
<teratorn>
silly input
<teratorn>
"i'll read that many characters unless there aren't any more to read, or unless i just don't feel like it"
<teratorn>
ok, so long as it reads more than 0, then i just keep going... :/
* teratorn
bitches and moans
<teratorn>
well it came out ok <sniff>
mattam has quit [Read error: 110 (Connection timed out)]
klamath has joined #ocaml
klamath has left #ocaml []
Smerdyakov has quit []
karryall has quit ["ERC vVersion 3.0 $Revision: 1.328 $ (IRC client for Emacs)"]
<mrvn>
teratorn: "if a then if b then (if c then d else e)" is allways done this way and thats what tuareg indents.
pattern__ has quit [Read error: 110 (Connection timed out)]
<teratorn>
yeah that works
pattern_ has joined #ocaml
<teratorn>
i have a program that filters data off of stdin to stdout, the best way for me to process this data is by line. so i want to use input_line stdin...
<teratorn>
however
<teratorn>
input_line strips the newline, so i do not know if the last line had a newline or not, and this is very important to my program
<teratorn>
if i could detect if the stdin channel had a newline before the eof i could do it
<teratorn>
or if i could read everthing off stdin into a single string
Yurik__ has joined #ocaml
Yurik_ has quit [Read error: 54 (Connection reset by peer)]
<mrvn>
teratorn: just read a string and split it into lins manually.
<mrvn>
or use streams or the parser generator.
<teratorn>
yeah i thought about doing it mysql, but i didn't feel like implementing my own line buffering
<mrvn>
you could use the buffer class
<teratorn>
yeah i could
<teratorn>
Buffer.add_channel kind of sucks though, since it apparently won't do a partial read
<teratorn>
alright screw it, ill just "input stdin" and keep adding to the buffer till EOF, then take the whole string back out of the buffer and use that
<teratorn>
probably won't be too slow
<mrvn>
teratorn: I have a IO modules that can read asynchronously from Unix.file_descriptor that uses buffering.
rhil is now known as rhil|zzz
<teratorn>
well i don't need async, and this is working ok, but thanks
reltuk has quit ["leaving"]
lus|wazze has joined #ocaml
cDlm_ has joined #ocaml
reltuk has joined #ocaml
cDlm has quit [Read error: 110 (Connection timed out)]
cDlm_ is now known as cDlm
docelic has quit [Excess Flood]
docelic has joined #ocaml
lus|wazze has quit ["[21:57:38] <{specter}> benutze lustiges konfetti auf schwer bewaffneter clown"]
gene9 has joined #ocaml
gene9 has quit [Read error: 104 (Connection reset by peer)]
gene9 has joined #ocaml
gene9 has left #ocaml []
Smerdyakov has joined #ocaml
systems has joined #ocaml
<systems>
the contest is over
<systems>
so did ocaml have a team, how many teams submitted something
<systems>
etc... etc...
* Riastradh
was on the #scheme team.
reltuk has quit ["leaving"]
systems has quit [Operation timed out]
CybeRDukE has joined #ocaml
Smerdyakov has quit ["work"]
CybeRDukE is now known as CybeR[away]
CybeR[away] is now known as CybeRDukE
CybeRDukE is now known as CybeR[away]
mattam has joined #ocaml
mattam_ has joined #ocaml
mattam has quit [Read error: 110 (Connection timed out)]
lus|wazze has joined #ocaml
* Riastradh
curses loudly at ocaml.org
mvw has joined #ocaml
mattam_ is now known as mattam
axolotl has quit [Remote closed the connection]
mattam has quit ["brb"]
mattam has joined #ocaml
lus|wazze has quit ["[21:57:38] <{specter}> benutze lustiges konfetti auf schwer bewaffneter clown"]
lus|wazze has joined #ocaml
mvw has left #ocaml []
mrvn_ has joined #ocaml
mrvn has quit [Read error: 60 (Operation timed out)]
<teratorn>
Riastradh: yeah i hate people that don't have domain.tld resolve to www.domain.tld
<Riastradh>
Er, no, it's that it wasn't resolving for me at all.
<teratorn>
oh heh
vincenz has joined #ocaml
<vincenz>
Hi
rhil|zzz has quit [leguin.freenode.net irc.freenode.net]
lam has quit [leguin.freenode.net irc.freenode.net]
mellum has quit [leguin.freenode.net irc.freenode.net]
jtra has quit [leguin.freenode.net irc.freenode.net]
rhil|zzz has joined #ocaml
lam has joined #ocaml
mellum has joined #ocaml
<teratorn>
hi
<teratorn>
it would be real cool to have a slice notation for strings
<teratorn>
so the compiler could optimize out operations that could be perfomed in-line on the substring, instead of always creating a new string with String.sub
<emu>
yer confusing syntax w/semantics
<emu>
not even semantics
<emu>
the compiler can do the same with String.sub
<teratorn>
well hmm
<teratorn>
i guess so
<teratorn>
how do yall profile your apps?
<teratorn>
i noticed the -p flag
<teratorn>
is gprof usable for doing a live profile?
<mattam>
not live, you get a profile dump iirc
<teratorn>
yeah
<teratorn>
but is there a way to profile with live input?