#picolisp on 2020-11-13 — irc logs at freenode.irclog.whitequark.org

2018-09-14 18:41 ChanServ changed the topic of #picolisp to: PicoLisp language | Channel Log: https://irclog.whitequark.org/picolisp/ | Check also http://www.picolisp.com for more information

00:30 Blukunfando has quit [Ping timeout: 240 seconds]

00:38 Blukunfando has joined #picolisp

03:09 <aw-> beneroth is right. I'm handling untrusted user input and i need to validate that strings are "valid" UTF-8.. sure (struct P 'S) works fine to extract it, but it doesn't tell me if the byte or byte sequence is valid utf-8.. that's why i want the list of bytes, so I can process them (as a list) and validate etc.. i may also even want to disallow a specific range of utf-8 characters within a string.. so in the end, i need the bytes not j

03:09 <aw-> st the unicode value (as an integer)... at least.. that's how I see it now.

03:10 <aw-> I could be missing something too.. but when I do (struct P 'S) and receive this: "��^?�^XA�^?^?" <-- it's not good

03:10 <aw-> i can't have that

03:11 <aw-> and also vice-versa, i want to ensure that if I have a list of bytes, they convert back to **valid** UTF-8 and not just some unicode value that is technically in the range of UTF-8 (but invalid)

03:12 <aw-> I think this is a common use-case

03:13 <aw-> Regenaxer: in any case, I have some workarounds for now, but would really love it if PicoLisp provided utilities for that in the language itself

03:13 <aw-> dealing with utf-8 manually is a huge pain

04:47 orivej has quit [Ping timeout: 260 seconds]

04:55 <shoshin> is there something you could shell out to to do the conversion?

05:02 <aw-> no i'm using (native) calling into a Rust library for now

06:18 <Regenaxer> Good morning aw-, beneroth

06:18 <Regenaxer> I would not convert to a list of bytes

06:18 <Regenaxer> too much runtime overhead

06:18 <Regenaxer> I would check the string directly

06:18 <Regenaxer> UTF-8 has this format: http://ix.io/2DYL

06:19 <Regenaxer> take the string pointer, and access the bytes with (byte P)

06:19 <Regenaxer> Legality check is quite straightforward

06:19 <Regenaxer> The first byte determines the length

06:20 <Regenaxer> via the number of leading 1's

06:20 <Regenaxer> the other bytes all must be 10xxxxxx

06:20 <Regenaxer> that's all you can check

06:21 <Regenaxer> the x'es can be any value

06:25 <aw-> hi Regenaxer

06:25 <aw-> so right shift by 6 and the value should be 2 ?

06:25 <aw-> 00000010 ?

06:26 <aw-> (for the other bytes)

06:27 <Regenaxer> no

06:28 <Regenaxer> and with 0x3F then shift according to the right position

06:28 <Regenaxer> as in various places in the pil sources

06:29 <Regenaxer> only if it is 01111111 nothing is done

06:29 <Regenaxer> 7-bit ASCII

06:30 <Regenaxer> the x'es in the following bytes are big-endian

06:31 <Regenaxer> but for a check you don't care

06:32 <Regenaxer> Ah, sorry, I misunderstood

06:32 <Regenaxer> you meant 00000010 for the check?

06:32 <Regenaxer> No shift by 6

06:33 <Regenaxer> just AND with 0xC0

06:34 <aw-> right

06:34 <Regenaxer> (= (hex "10000000") (& C (hex "C0")))

06:34 <aw-> ok so i still have to do all these things manually

06:34 <Regenaxer> but shift is also geod

06:34 <aw-> instead of picolisp providing that

06:35 <Regenaxer> What is the danger?

06:35 <Regenaxer> Nonsense data?

06:35 <aw-> yes, not all programs accept/allow nonsense data

06:35 <Regenaxer> Pil does not do it cause it is a lot of runtime overhead ;)

06:36 <Regenaxer> But the x'es can be anything, so you still get nonsense

06:37 <Regenaxer> Data must be checked on a higher level

06:37 <Regenaxer> library and/or application

06:37 <aw-> right

06:38 <Regenaxer> Still a utfCheck function is fun

06:38 <Regenaxer> perhaps

06:38 <Regenaxer> But I would not do it in the base system

06:46 <Regenaxer> And, generally, I think it is never needed or useful

06:47 <Regenaxer> especially in pil, where there cannot be a buffer overflow if it reads 3 bytes too much

06:47 <Regenaxer> (cause there is no buffer, symbol names can be infinite in length)

07:02 _whitelogger has joined #picolisp

07:28 mtsd has joined #picolisp

07:32 rob_w has joined #picolisp

07:48 aw- has quit [Quit: Leaving.]

07:48 aw- has joined #picolisp

08:28 <Regenaxer> Maybe I'm wrong, but at the moment I cannot see any really critical issue with malformatted utf8 data

08:28 <Regenaxer> at that low level

08:29 <Regenaxer> As I said, utf8 is just a serialized representation of unicode

08:29 aw- has quit [Quit: Leaving.]

08:43 orivej has joined #picolisp

09:09 orivej has quit [Ping timeout: 272 seconds]

09:15 <beneroth> Regenaxer, I think the use case is <untrusted system> -> picolisp -> <untrusted system>

09:15 <beneroth> untrusted system = faulty or even malicious

09:16 <beneroth> maybe picolisp in the middle will not be affected, but you want the picolisp app to guarantee that it never hands something to untrusted-system-2 which might crash it, e.g. because it cannot handle high UTF-8 characters

09:17 <Regenaxer> yes, but that's application level again

09:17 <Regenaxer> check for high unicodes

09:17 <Regenaxer> not byte-level

09:17 <beneroth> granted, bad example

09:18 <beneroth> what about "number of bytes" stuff ?

09:18 <Regenaxer> Mom, telephone

09:18 <beneroth> or "read X number of bytes as UTF-8 chars"

09:18 <beneroth> kk

09:19 <beneroth> I faced the "read X number of bytes as UTF-8 chars" issue myself

09:19 <Regenaxer> ht:Read iirc

09:19 <beneroth> well no, luckily in the end I didn't have to, could just handle it as "read X number of bytes, not caring about their meaning"

09:20 <beneroth> Regenaxer, T

09:21 <beneroth> now I'm out of examples/arguments once more

09:21 <Regenaxer> anyway, afp

09:21 <beneroth> hihi

09:21 <beneroth> cu later Regenaxer :)

09:21 <Regenaxer> :)

09:22 orivej has joined #picolisp

10:42 <Regenaxer> ret

11:18 mtsd has quit [Quit: Leaving]

12:08 orivej has quit [Ping timeout: 260 seconds]

12:10 rob_w has quit [Quit: Leaving]

13:15 aw- has joined #picolisp

14:25 aw- has left #picolisp [#picolisp]

14:26 Wiin has joined #picolisp

14:26 Wiin has quit [Client Quit]

14:31 <Regenaxer> Hi Adrià! Welcome! :)

15:17 Wiin has joined #picolisp

15:19 Wiin has quit [Client Quit]

17:57 casaca has quit [Remote host closed the connection]

18:13 casaca has joined #picolisp

18:25 orivej has joined #picolisp

20:48 emacsomancer has quit [Read error: Connection reset by peer]

20:49 emacsomancer has joined #picolisp

21:09 emacsomancer has quit [Read error: Connection reset by peer]

21:10 emacsomancer has joined #picolisp

21:15 emacsomancer has quit [Read error: Connection reset by peer]

21:15 emacsomancer has joined #picolisp