#ocaml on 2003-07-01 — irc logs at freenode.irclog.whitequark.org

2003-06-26 09:16 karryall changed the topic of #ocaml to: ICFP contest 2003 http://www.dtek.chalmers.se/groups/icfpcontest/ | http://www.ocaml.org/ | http://caml.inria.fr/oreilly-book/ | Caml Weekly news http://pauillac.inria.fr/~aschmitt/cwn/

00:22 <teratorn> anyone know of a slick way to see if a certain string exists as a substring of another string, beginning at a given offset?

00:25 <karryall> String.sub and then (=) ?

00:26 <Kinners> Str.search_forward (Str.regexp_string substring) string offset

00:27 <teratorn> karryall: yeah, but i need something very efficient here, i'm hoping for an in-line comparison :)

00:28 <teratorn> "Str" you mean String?

00:29 <karryall> no he means Str, the regular expressions module

00:29 <teratorn> odd

00:29 <teratorn> it's not in the stdlib?

00:31 <karryall> no, you have to specify str.cma at link time

00:31 <teratorn> i see

00:31 <teratorn> still, compiling regexes doesn't seem to efficient :(

00:33 <mrvn> Does that build a state automat or a search with rollback?

00:33 <mrvn> teratorn: Is your substring a lot shorter than your big string?

00:33 <teratorn> mrvn: very much so

00:34 <teratorn> ideally i just want a simple byte-by-byte comparrison

00:34 <teratorn> and this operation could very well be the bottleneck of my application

00:35 <mrvn> You can preprocess your search string noting how many chars you have to rollback on the search string when a char mismatches and then do a linear search once through the big string.

00:36 <teratorn> i'm not following :(

00:36 <mrvn> example: search "abab":

00:36 <karryall> teratorn: just pick an implementation somewhere, there are zillions of them

00:38 <mrvn> If you find an 'a' but then an 'a' you advance the big string by one but not the search string. So you check if the next char is a 'b'

00:38 <karryall> ... or just try the one-line solution using Str and make sure it _is_ the bottleneck of your application

00:39 <mrvn> The regexp thing above probably does exactly that but way less typing.

00:39 <teratorn> yeah i'm trying to optimize last

00:39 <teratorn> but the reason i'm rewriting this in ocaml is for the speed

00:40 <mrvn> If you need it any faster you might have to use C and read the string in a register at a time instead of char wise.

00:41 <mrvn> Fully optimized build and schedule assembler code specifically for the search string at hand :)

00:42 <teratorn> i think ocaml will be fast enough

00:42 <teratorn> or rather, fast enough will be as fast as that :)

00:42 <mrvn> You might get a speedup of 2 or something with handmade C code.

00:43 <teratorn> *nod*

00:43 <mrvn> But your probably just looking for O(n) instead of O(n*m) speed [n = string length, m = search string length]

00:44 <mrvn> Are you searching many different (or same) strings in the big string?

00:45 <mrvn> Cause if you search for more than one string you should use a suffix tree instead.

00:46 <mrvn> Thats O(n) once setup phase, O(m + |num finds|) searching.

00:47 <teratorn> well here's the problem

00:47 <teratorn> i'm filtering profanity from email messages

00:47 <teratorn> so

00:47 <teratorn> for each word i identify in the "big string", i run a binary search against my array of bad words

00:48 <mrvn> That realy sounds like a job for a suffixtrie and a 'word' tree

00:48 <karryall> you could use a trie

00:48 <mrvn> Do you know suffixtries?

00:49 <teratorn> never heard of it

00:49 <mrvn> Suffixtree?

00:49 <teratorn> no

00:49 <mrvn> You build a tree where each suffix of a string exists as a patch in the tree.

00:50 <teratorn> if i find a match i do an in-line blit with star characters

00:50 <mrvn> abcd =>

00:50 <mrvn> / \

00:50 <mrvn> /| | \

00:50 <mrvn> a b c d

00:50 <mrvn> b c d

00:50 <mrvn> c d

00:50 <mrvn> d

00:51 * teratorn copies into something with fixed-width font =)

00:51 <mrvn> If you then need to find the substring bc you can just go down the b node and then the c node along the tree.

00:53 <mrvn> With longer and repetitive texts all the repeatet words start with the same path. You can find them all in one go.

00:54 <mrvn> So if you are looking for "fuck" you start at the root and go down to f, u, c, k and you have all places where fuck stands.

00:56 <mrvn> Alternatively you could one huge regexp for your complete dictionary of bad words and search for that.

00:56 <mrvn> +build

00:57 * teratorn ponders

00:57 <mrvn> I'm guessing your dictionary doesn't change often. You could build an optimised regexp once and marshal that to disk.

00:58 <teratorn> no not much

00:59 <teratorn> i have it all in program code as a literal array, the only thing i do with is Array.sort compare bad_words_array

00:59 <teratorn> so i'm not too worried about startup costs

01:00 <mrvn> Thinking about it I would probably build mayself a automat that goes trough a string and stops at the first word thats in the dictionary.

01:01 <teratorn> hmm

01:01 <karryall> I say: build a trie with the bad words and then iterate over the text and test each word of the text to see if it appears in the bad words set

01:02 <teratorn> i guess i should learn about trie's

01:02 <teratorn> the only thing i was wondering about right now was how to test a substring against another string without doing String.sub to get at it

01:03 <mrvn> teratorn: how strickt is that word thing?

01:03 <mrvn> do you want to block "fuck" in "motherFUCKer"?

01:03 <mrvn> or doesn't that count?

01:03 <teratorn> nah i just have motherfucker in the dict

01:04 <teratorn> and one phase where i copy the string, lowercase and substitute common l33tspeak type characters for their letter representations

01:04 <teratorn> i run the comparions against that string and make changes back to the original

01:04 <teratorn> *comparisons

01:05 <mrvn> Ever heard of a dictionary tree?

01:05 <teratorn> no

01:07 <mrvn> You start with a root and label the edges in the tree each with one char. Each path forms a word of the dictionary.

01:07 rhil is now known as rhil|brb

01:07 <teratorn> ok

01:07 <mrvn> Worst case its O(|dictionary|) but usually much smaller.

01:08 <mrvn> With that you can search for all words of the dictionary ad once.

01:08 <mrvn> You see how?

01:09 <karryall> (btw, that's what I meant when I said "a trie")

01:10 <teratorn> yeah i see how that works

01:11 <teratorn> cool stuff

01:11 <mrvn> So you start at the begining and look that word up. If its in the tree and the next char in the text is a ' ' you found a bad word.

01:11 <mrvn> If its not in the tree or if the next char is not ' ' read on to the next word.

01:11 <mrvn> s/' '/white space/

01:12 <teratorn> yeah i've got a whole list of "word breaks"

01:12 <teratorn> html crap etc :(

01:12 <mrvn> karryall: tries are usually somewhat special. Like compressed or without leafes or something.

01:13 <teratorn> this is pretty cool, now that i think about it

01:14 <mrvn> If ram is of concern you can compress the dictionary tree somewhat.

01:14 <karryall> mrvn: I use then uncompressed but maybe I was wrong in calling these "tries"

01:15 <mrvn> Instead of storing allways one char on each edge you can store a string if there is no branching.

01:15 <teratorn> usually i could eliminate a word after the first 2 or 3 letters

01:15 <teratorn> isntead of eliminating after the first 1 or 2 chars, for each comaprrison in the binary search

01:16 <mrvn> teratorn: A tree is definitly the right way.

01:17 <mrvn> That gives you a O(n) algorithm after setup phase. You only ever look at each char once.

01:18 <teratorn> right get the char, see if the path in the dict tree can continue

01:19 rhil|brb is now known as rhil

01:19 <teratorn> well thanks mrvn and karryall

01:19 * teratorn &

01:20 <mrvn> Try replacing all whitespaces by ' ' when you remove 1337 talk and the like. Maybe thats quicker.

02:19 Smerdyakov has quit ["reboot because computer is dodgy"]

02:24 <teratorn> freakin tuareg-mode indentation

02:24 <teratorn> is it know to be bugged or what?

02:24 <teratorn> *known

02:26 <gl> ictp contest was hardcore

02:26 <gl> *icfp

02:27 <Riastradh> teratorn, what's wrong with it?

02:27 <teratorn> it's not matching up if's and else's right when they are nested

02:27 <teratorn> i don't think..

02:39 Smerdyakov has joined #ocaml

02:41 <teratorn> damn this sucks. if you have several nested if's and only one trailing else, it's ambiguous which if-block the else goes with

02:45 <teratorn> has anyone had this problem?

02:45 <Smerdyakov> Problem in what sense? As a user of a programming language with this syntax?

02:46 <teratorn> sure

02:46 <Smerdyakov> It's not a very big problem.. just use parentheses :P

02:46 <teratorn> having a bunch of else () kinda sucks

02:47 <teratorn> `/o but it was only fantasy `/o

02:47 <teratorn> `/o the wall was too high / as you can see `/o

02:47 <teratorn> `/o no matter how he tried / he could not break free `/o

02:48 <teratorn> `/o and the worms ate into his brain `/o

02:48 <Smerdyakov> Well, pretty much every language has this "problem." (I put it in quotes because people don't really agree on what the "right" grouping is.)

02:48 <Smerdyakov> Well, excluding languages like Lisp, where you'll need the parens anyway :D

02:48 <teratorn> well python doesn't have this problem :) i guess i've been sheltered from languages that do :)

02:49 <Smerdyakov> teratorn, eh? You use indentation in Python, right?

02:49 <teratorn> yep

02:49 <Smerdyakov> So it's even WORSE than in OCaml.

02:49 <teratorn> no :)

02:49 <Smerdyakov> You must indicate grouping for EVERY if..then..else. It NEVER guesses.

02:49 <teratorn> the indentation level tells which if block it corresponds to

02:49 <Smerdyakov> Yes, so you have to do what you are complaining about doing in OCaml...

02:50 <teratorn> eh? no

02:50 <teratorn> the only way i can do it in ocaml (i think) is to close each if with an else, even if it's only else ()

02:50 <Smerdyakov> That's absurd.

02:50 <teratorn> i'm pretty new so i could be missing something

02:51 <Smerdyakov> You don't need to do anything but add parentheses around existing expressions to resolve ambiguities.

02:51 <teratorn> hmm

02:51 <teratorn> oh ok

02:51 <teratorn> :)

02:52 <teratorn> see i told you i'm new to ocaml

02:52 <teratorn> i should have thought of that, i've only been used to using it to resolv ambiguities in function parameters

02:55 <Riastradh> Function arguments are just expressions -- like if expressions.

02:56 <teratorn> yeah

03:00 Kinners has left #ocaml []

03:02 rhil has quit ["leaving"]

03:26 rhil has joined #ocaml

03:32 reltuk has quit [Read error: 104 (Connection reset by peer)]

03:33 rhil has quit [Remote closed the connection]

03:33 rhil has joined #ocaml

03:47 <teratorn> argh

03:48 <teratorn> with input_line channel, there's no way to know if the last line had a newline or now :/

03:48 <teratorn> s/or now/ or not

04:11 reltuk has joined #ocaml

04:33 <teratorn> fuck

04:33 <teratorn> silly input

04:34 <teratorn> "i'll read that many characters unless there aren't any more to read, or unless i just don't feel like it"

04:36 <teratorn> ok, so long as it reads more than 0, then i just keep going... :/

04:41 * teratorn bitches and moans

04:48 <teratorn> well it came out ok <sniff>

05:09 mattam has quit [Read error: 110 (Connection timed out)]

05:27 klamath has joined #ocaml

05:28 klamath has left #ocaml []

06:01 Smerdyakov has quit []

07:32 karryall has quit ["ERC vVersion 3.0 $Revision: 1.328 $ (IRC client for Emacs)"]

08:13 <mrvn> teratorn: "if a then if b then (if c then d else e)" is allways done this way and thats what tuareg indents.

08:24 pattern__ has quit [Read error: 110 (Connection timed out)]

08:31 <teratorn> yeah that works

08:33 pattern_ has joined #ocaml

08:36 <teratorn> i have a program that filters data off of stdin to stdout, the best way for me to process this data is by line. so i want to use input_line stdin...

08:36 <teratorn> however

08:36 <teratorn> input_line strips the newline, so i do not know if the last line had a newline or not, and this is very important to my program

08:39 <teratorn> if i could detect if the stdin channel had a newline before the eof i could do it

08:39 <teratorn> or if i could read everthing off stdin into a single string

08:49 Yurik__ has joined #ocaml

08:51 Yurik_ has quit [Read error: 54 (Connection reset by peer)]

09:09 <mrvn> teratorn: just read a string and split it into lins manually.

09:09 <mrvn> or use streams or the parser generator.

09:16 <teratorn> yeah i thought about doing it mysql, but i didn't feel like implementing my own line buffering

09:17 <mrvn> you could use the buffer class

09:19 <teratorn> yeah i could

09:20 <teratorn> Buffer.add_channel kind of sucks though, since it apparently won't do a partial read

09:21 <teratorn> alright screw it, ill just "input stdin" and keep adding to the buffer till EOF, then take the whole string back out of the buffer and use that

09:22 <teratorn> probably won't be too slow

09:31 <mrvn> teratorn: I have a IO modules that can read asynchronously from Unix.file_descriptor that uses buffering.

09:32 rhil is now known as rhil|zzz

09:36 <teratorn> well i don't need async, and this is working ok, but thanks

10:02 reltuk has quit ["leaving"]

10:03 lus|wazze has joined #ocaml

10:21 cDlm_ has joined #ocaml

10:36 reltuk has joined #ocaml

10:39 cDlm has quit [Read error: 110 (Connection timed out)]

11:35 cDlm_ is now known as cDlm

11:53 docelic has quit [Excess Flood]

12:00 docelic has joined #ocaml

12:03 lus|wazze has quit ["[21:57:38] <{specter}> benutze lustiges konfetti auf schwer bewaffneter clown"]

12:24 gene9 has joined #ocaml

12:51 gene9 has quit [Read error: 104 (Connection reset by peer)]

12:51 gene9 has joined #ocaml

14:30 gene9 has left #ocaml []

14:38 Smerdyakov has joined #ocaml

14:58 systems has joined #ocaml

15:01 <systems> the contest is over

15:02 <systems> so did ocaml have a team, how many teams submitted something

15:02 <systems> etc... etc...

15:05 * Riastradh was on the #scheme team.

15:10 reltuk has quit ["leaving"]

15:13 systems has quit [Operation timed out]

15:15 CybeRDukE has joined #ocaml

15:15 Smerdyakov has quit ["work"]

15:22 CybeRDukE is now known as CybeR[away]

15:23 CybeR[away] is now known as CybeRDukE

15:28 CybeRDukE is now known as CybeR[away]

15:38 mattam has joined #ocaml

16:07 mattam_ has joined #ocaml

16:07 mattam has quit [Read error: 110 (Connection timed out)]

16:13 lus|wazze has joined #ocaml

16:28 * Riastradh curses loudly at ocaml.org

16:44 mvw has joined #ocaml

16:47 mattam_ is now known as mattam

17:05 axolotl has quit [Remote closed the connection]

17:17 mattam has quit ["brb"]

17:19 mattam has joined #ocaml

17:51 lus|wazze has quit ["[21:57:38] <{specter}> benutze lustiges konfetti auf schwer bewaffneter clown"]

17:51 lus|wazze has joined #ocaml

17:51 mvw has left #ocaml []

18:21 mrvn_ has joined #ocaml

18:25 mrvn has quit [Read error: 60 (Operation timed out)]

18:26 <teratorn> Riastradh: yeah i hate people that don't have domain.tld resolve to www.domain.tld

18:26 <Riastradh> Er, no, it's that it wasn't resolving for me at all.

18:27 <teratorn> oh heh

18:55 vincenz has joined #ocaml

18:56 <vincenz> Hi

19:20 rhil|zzz has quit [leguin.freenode.net irc.freenode.net]

19:20 lam has quit [leguin.freenode.net irc.freenode.net]

19:20 mellum has quit [leguin.freenode.net irc.freenode.net]

19:20 jtra has quit [leguin.freenode.net irc.freenode.net]

19:23 rhil|zzz has joined #ocaml

19:23 lam has joined #ocaml

19:26 mellum has joined #ocaml

19:27 <teratorn> hi

19:34 <teratorn> it would be real cool to have a slice notation for strings

19:34 <teratorn> so the compiler could optimize out operations that could be perfomed in-line on the substring, instead of always creating a new string with String.sub

19:41 <emu> yer confusing syntax w/semantics

19:41 <emu> not even semantics

19:42 <emu> the compiler can do the same with String.sub

19:43 <teratorn> well hmm

19:43 <teratorn> i guess so

19:44 <teratorn> how do yall profile your apps?

19:44 <teratorn> i noticed the -p flag

19:45 <teratorn> is gprof usable for doing a live profile?

19:47 <mattam> not live, you get a profile dump iirc

20:01 <teratorn> yeah

20:01 <teratorn> but is there a way to profile with live input?

20:28 <h> http://zolo.freelsd.net/~sts/ICFP03/Pictures/index.html

21:36 vincenz has quit ["KVIrc 3.0.0-beta1 "Eve's Avatar""]

21:55 owll has joined #ocaml

21:56 Yurik__ is now known as Yurik

22:05 Smerdyakov has joined #ocaml

22:06 owll has quit ["Client Exiting"]

22:15 docelic has quit ["reboot, seems I have usb disabled"]

22:21 docelic has joined #ocaml

22:25 CybeR[away] has quit ["Documentation is like sex: when it is good, it is very, very good. And when it is bad, it is better than nothing."]

22:34 Demitar has joined #ocaml

22:43 Demitar has quit ["There are bubbles in the air..."]

23:21 Demitar has joined #ocaml

23:51 asqui has quit [Connection reset by peer]

23:51 asqui has joined #ocaml

23:55 lus|wazze has quit ["#lus - der Atheisten-channel \o/"]

23:57 Demitar has quit ["There are bubbles in the air..."]