ChanServ changed the topic of #picolisp to: PicoLisp language | Channel Log: https://irclog.whitequark.org/picolisp/ | Check also http://www.picolisp.com for more information
Blukunfando has quit [Ping timeout: 240 seconds]
Blukunfando has joined #picolisp
<aw-> beneroth is right. I'm handling untrusted user input and i need to validate that strings are "valid" UTF-8.. sure (struct P 'S) works fine to extract it, but it doesn't tell me if the byte or byte sequence is valid utf-8.. that's why i want the list of bytes, so I can process them (as a list) and validate etc.. i may also even want to disallow a specific range of utf-8 characters within a string.. so in the end, i need the bytes not j
<aw-> st the unicode value (as an integer)... at least.. that's how I see it now.
<aw-> I could be missing something too.. but when I do (struct P 'S) and receive this: "������^?�^XA�^?^?" <-- it's not good
<aw-> i can't have that
<aw-> and also vice-versa, i want to ensure that if I have a list of bytes, they convert back to **valid** UTF-8 and not just some unicode value that is technically in the range of UTF-8 (but invalid)
<aw-> I think this is a common use-case
<aw-> Regenaxer: in any case, I have some workarounds for now, but would really love it if PicoLisp provided utilities for that in the language itself
<aw-> dealing with utf-8 manually is a huge pain
orivej has quit [Ping timeout: 260 seconds]
<shoshin> is there something you could shell out to to do the conversion?
<aw-> no i'm using (native) calling into a Rust library for now
<Regenaxer> Good morning aw-, beneroth
<Regenaxer> I would not convert to a list of bytes
<Regenaxer> too much runtime overhead
<Regenaxer> I would check the string directly
<Regenaxer> UTF-8 has this format: http://ix.io/2DYL
<Regenaxer> take the string pointer, and access the bytes with (byte P)
<Regenaxer> Legality check is quite straightforward
<Regenaxer> The first byte determines the length
<Regenaxer> via the number of leading 1's
<Regenaxer> the other bytes all must be 10xxxxxx
<Regenaxer> that's all you can check
<Regenaxer> the x'es can be any value
<aw-> hi Regenaxer
<aw-> so right shift by 6 and the value should be 2 ?
<aw-> 00000010 ?
<aw-> (for the other bytes)
<Regenaxer> no
<Regenaxer> and with 0x3F then shift according to the right position
<Regenaxer> as in various places in the pil sources
<Regenaxer> only if it is 01111111 nothing is done
<Regenaxer> 7-bit ASCII
<Regenaxer> the x'es in the following bytes are big-endian
<Regenaxer> but for a check you don't care
<Regenaxer> Ah, sorry, I misunderstood
<Regenaxer> you meant 00000010 for the check?
<Regenaxer> No shift by 6
<Regenaxer> just AND with 0xC0
<aw-> right
<Regenaxer> (= (hex "10000000") (& C (hex "C0")))
<aw-> ok so i still have to do all these things manually
<Regenaxer> but shift is also geod
<aw-> instead of picolisp providing that
<Regenaxer> What is the danger?
<Regenaxer> Nonsense data?
<aw-> yes, not all programs accept/allow nonsense data
<Regenaxer> Pil does not do it cause it is a lot of runtime overhead ;)
<Regenaxer> But the x'es can be anything, so you still get nonsense
<Regenaxer> Data must be checked on a higher level
<Regenaxer> library and/or application
<aw-> right
<Regenaxer> Still a utfCheck function is fun
<Regenaxer> perhaps
<Regenaxer> But I would not do it in the base system
<Regenaxer> And, generally, I think it is never needed or useful
<Regenaxer> especially in pil, where there cannot be a buffer overflow if it reads 3 bytes too much
<Regenaxer> (cause there is no buffer, symbol names can be infinite in length)
_whitelogger has joined #picolisp
mtsd has joined #picolisp
rob_w has joined #picolisp
aw- has quit [Quit: Leaving.]
aw- has joined #picolisp
<Regenaxer> Maybe I'm wrong, but at the moment I cannot see any really critical issue with malformatted utf8 data
<Regenaxer> at that low level
<Regenaxer> As I said, utf8 is just a serialized representation of unicode
aw- has quit [Quit: Leaving.]
orivej has joined #picolisp
orivej has quit [Ping timeout: 272 seconds]
<beneroth> Regenaxer, I think the use case is <untrusted system> -> picolisp -> <untrusted system>
<beneroth> untrusted system = faulty or even malicious
<beneroth> maybe picolisp in the middle will not be affected, but you want the picolisp app to guarantee that it never hands something to untrusted-system-2 which might crash it, e.g. because it cannot handle high UTF-8 characters
<Regenaxer> yes, but that's application level again
<Regenaxer> check for high unicodes
<Regenaxer> not byte-level
<beneroth> granted, bad example
<beneroth> what about "number of bytes" stuff ?
<Regenaxer> Mom, telephone
<beneroth> or "read X number of bytes as UTF-8 chars"
<beneroth> kk
<beneroth> I faced the "read X number of bytes as UTF-8 chars" issue myself
<Regenaxer> ht:Read iirc
<beneroth> well no, luckily in the end I didn't have to, could just handle it as "read X number of bytes, not caring about their meaning"
<beneroth> Regenaxer, T
<beneroth> now I'm out of examples/arguments once more
<Regenaxer> anyway, afp
<beneroth> hihi
<beneroth> cu later Regenaxer :)
<Regenaxer> :)
orivej has joined #picolisp
<Regenaxer> ret
mtsd has quit [Quit: Leaving]
orivej has quit [Ping timeout: 260 seconds]
rob_w has quit [Quit: Leaving]
aw- has joined #picolisp
aw- has left #picolisp [#picolisp]
Wiin has joined #picolisp
Wiin has quit [Client Quit]
<Regenaxer> Hi Adrià! Welcome! :)
Wiin has joined #picolisp
Wiin has quit [Client Quit]
casaca has quit [Remote host closed the connection]
casaca has joined #picolisp
orivej has joined #picolisp
emacsomancer has quit [Read error: Connection reset by peer]
emacsomancer has joined #picolisp
emacsomancer has quit [Read error: Connection reset by peer]
emacsomancer has joined #picolisp
emacsomancer has quit [Read error: Connection reset by peer]
emacsomancer has joined #picolisp