<aw->
beneroth is right. I'm handling untrusted user input and i need to validate that strings are "valid" UTF-8.. sure (struct P 'S) works fine to extract it, but it doesn't tell me if the byte or byte sequence is valid utf-8.. that's why i want the list of bytes, so I can process them (as a list) and validate etc.. i may also even want to disallow a specific range of utf-8 characters within a string.. so in the end, i need the bytes not j
<aw->
st the unicode value (as an integer)... at least.. that's how I see it now.
<aw->
I could be missing something too.. but when I do (struct P 'S) and receive this: "������^?�^XA�^?^?" <-- it's not good
<aw->
i can't have that
<aw->
and also vice-versa, i want to ensure that if I have a list of bytes, they convert back to **valid** UTF-8 and not just some unicode value that is technically in the range of UTF-8 (but invalid)
<aw->
I think this is a common use-case
<aw->
Regenaxer: in any case, I have some workarounds for now, but would really love it if PicoLisp provided utilities for that in the language itself
<aw->
dealing with utf-8 manually is a huge pain
orivej has quit [Ping timeout: 260 seconds]
<shoshin>
is there something you could shell out to to do the conversion?
<aw->
no i'm using (native) calling into a Rust library for now
<Regenaxer>
Good morning aw-, beneroth
<Regenaxer>
I would not convert to a list of bytes
<Regenaxer>
take the string pointer, and access the bytes with (byte P)
<Regenaxer>
Legality check is quite straightforward
<Regenaxer>
The first byte determines the length
<Regenaxer>
via the number of leading 1's
<Regenaxer>
the other bytes all must be 10xxxxxx
<Regenaxer>
that's all you can check
<Regenaxer>
the x'es can be any value
<aw->
hi Regenaxer
<aw->
so right shift by 6 and the value should be 2 ?
<aw->
00000010 ?
<aw->
(for the other bytes)
<Regenaxer>
no
<Regenaxer>
and with 0x3F then shift according to the right position
<Regenaxer>
as in various places in the pil sources
<Regenaxer>
only if it is 01111111 nothing is done
<Regenaxer>
7-bit ASCII
<Regenaxer>
the x'es in the following bytes are big-endian
<Regenaxer>
but for a check you don't care
<Regenaxer>
Ah, sorry, I misunderstood
<Regenaxer>
you meant 00000010 for the check?
<Regenaxer>
No shift by 6
<Regenaxer>
just AND with 0xC0
<aw->
right
<Regenaxer>
(= (hex "10000000") (& C (hex "C0")))
<aw->
ok so i still have to do all these things manually
<Regenaxer>
but shift is also geod
<aw->
instead of picolisp providing that
<Regenaxer>
What is the danger?
<Regenaxer>
Nonsense data?
<aw->
yes, not all programs accept/allow nonsense data
<Regenaxer>
Pil does not do it cause it is a lot of runtime overhead ;)
<Regenaxer>
But the x'es can be anything, so you still get nonsense
<Regenaxer>
Data must be checked on a higher level
<Regenaxer>
library and/or application
<aw->
right
<Regenaxer>
Still a utfCheck function is fun
<Regenaxer>
perhaps
<Regenaxer>
But I would not do it in the base system
<Regenaxer>
And, generally, I think it is never needed or useful
<Regenaxer>
especially in pil, where there cannot be a buffer overflow if it reads 3 bytes too much
<Regenaxer>
(cause there is no buffer, symbol names can be infinite in length)
_whitelogger has joined #picolisp
mtsd has joined #picolisp
rob_w has joined #picolisp
aw- has quit [Quit: Leaving.]
aw- has joined #picolisp
<Regenaxer>
Maybe I'm wrong, but at the moment I cannot see any really critical issue with malformatted utf8 data
<Regenaxer>
at that low level
<Regenaxer>
As I said, utf8 is just a serialized representation of unicode
aw- has quit [Quit: Leaving.]
orivej has joined #picolisp
orivej has quit [Ping timeout: 272 seconds]
<beneroth>
Regenaxer, I think the use case is <untrusted system> -> picolisp -> <untrusted system>
<beneroth>
untrusted system = faulty or even malicious
<beneroth>
maybe picolisp in the middle will not be affected, but you want the picolisp app to guarantee that it never hands something to untrusted-system-2 which might crash it, e.g. because it cannot handle high UTF-8 characters
<Regenaxer>
yes, but that's application level again
<Regenaxer>
check for high unicodes
<Regenaxer>
not byte-level
<beneroth>
granted, bad example
<beneroth>
what about "number of bytes" stuff ?
<Regenaxer>
Mom, telephone
<beneroth>
or "read X number of bytes as UTF-8 chars"
<beneroth>
kk
<beneroth>
I faced the "read X number of bytes as UTF-8 chars" issue myself
<Regenaxer>
ht:Read iirc
<beneroth>
well no, luckily in the end I didn't have to, could just handle it as "read X number of bytes, not caring about their meaning"
<beneroth>
Regenaxer, T
<beneroth>
now I'm out of examples/arguments once more
<Regenaxer>
anyway, afp
<beneroth>
hihi
<beneroth>
cu later Regenaxer :)
<Regenaxer>
:)
orivej has joined #picolisp
<Regenaxer>
ret
mtsd has quit [Quit: Leaving]
orivej has quit [Ping timeout: 260 seconds]
rob_w has quit [Quit: Leaving]
aw- has joined #picolisp
aw- has left #picolisp [#picolisp]
Wiin has joined #picolisp
Wiin has quit [Client Quit]
<Regenaxer>
Hi Adrià! Welcome! :)
Wiin has joined #picolisp
Wiin has quit [Client Quit]
casaca has quit [Remote host closed the connection]
casaca has joined #picolisp
orivej has joined #picolisp
emacsomancer has quit [Read error: Connection reset by peer]
emacsomancer has joined #picolisp
emacsomancer has quit [Read error: Connection reset by peer]
emacsomancer has joined #picolisp
emacsomancer has quit [Read error: Connection reset by peer]