mark4 changed the topic of #forth to: Forth Programming | do drop >in | logged by clog at http://bit.ly/91toWN backup at http://forthworks.com/forth/irc-logs/ | If you have two (or more) stacks and speak RPN then you're welcome here! | https://github.com/mark4th
f-a has left #forth [#forth]
<inode> or just tell it what you want to see
<inode> set $pc = arbitrary-address
<inode> stepi
<inode> x/[CELL-COUNT-HERE]wx $sp
<inode> etc. :)
<remexre> yeah, I ended up writing a bunch of python exts to gdb
<remexre> so I can do the equiv of .S, SEE, etc from within it
Zarutian_HTC has quit [Read error: Connection reset by peer]
Zarutian_HTC has joined #forth
lispmacs[work] has quit [Ping timeout: 264 seconds]
<mark4> gdb does not like it when your code lays down code and you try to debug it
<mark4> it also wont show you whats on the stack unless it sees a stack frame
<mark4> and the interface is FUCKING HORRIBLE
<mark4> lol
<mark4> btw, did i tell you how much i loathe and despise all gnu development tools? :)
<MrMobius> weird
<MrMobius> i wonder if anyone uses qemu or similar
jimt[m] has quit [Ping timeout: 258 seconds]
lispmacs[work] has joined #forth
boru` has joined #forth
boru has quit [Disconnected by services]
boru` is now known as boru
jimt[m] has joined #forth
joe9 has quit [Read error: Connection reset by peer]
joe9 has joined #forth
<remexre> I'm using qemu, yeah
<remexre> I haven't had those problems with gdb, but I'm working on aarch64, which has enough free registers that I do leave a valid C stack in the stack pointer (though it's only a kilobyte or so)
<remexre> gdb can connect to qemu easily with `target remote', and since I don't have a JTAG, this is invaluable for debugging
<remexre> agreed on the interface being horrible though... it's like, emacs-keybind-inspired, and it's not hard to get the TUI to segfault...
<mark4> lol
tabemann has quit [Ping timeout: 276 seconds]
travisb has joined #forth
cartwright has quit [Remote host closed the connection]
cartwright has joined #forth
travisb is now known as tabemann
mark4 has quit [Ping timeout: 245 seconds]
<joe9> is there a way to do the stack trace when forth fails with an invalid write to address?
<joe9> I presume there must be a way to walk up the return stack
<joe9> but, am not sure if there is a better approach?
sts-q has quit [Ping timeout: 260 seconds]
sts-q has joined #forth
gravicappa has joined #forth
_whitelogger has joined #forth
Zarutian_HTC1 has joined #forth
Zarutian_HTC has quit [Read error: Connection reset by peer]
jess has quit [Quit: K-Lined]
<inode> joe9: which forth?
jess has joined #forth
scoofy has quit [Ping timeout: 276 seconds]
scoofy has joined #forth
<joe9> inode, felix forth
<inode> i guess you'd have to first implement a means of calling signal(2)/sigaction(2) to register a handler for SIGSEGV?
<joe9> inode, ok, thanks.
<inode> at least i didn't see anything at all for handling signals that would be generated by illegal memory access when skimming through that ff repo?
<joe9> yes, there is none. you are correct. it is a simple code base, easier to understand.
Croran has quit [Ping timeout: 240 seconds]
Croran has joined #forth
hosewiejacke has joined #forth
<joe9> ?dup dup if the top of stack is not zero?
xek has joined #forth
Zarutian_HTC1 has quit [Remote host closed the connection]
<proteusguy> correct
<joe9> thanks
f-a has joined #forth
hosewiejacke has quit [Ping timeout: 245 seconds]
hosewiejacke has joined #forth
elioat has joined #forth
hosewiejacke has quit [Remote host closed the connection]
hosewiejacke has joined #forth
mark4 has joined #forth
<veltas> gdb is very good.... at C. At Forth, a good Forth has its own debugger, but you can still use it, just with a bit of pain. I was using it to debug mark4's x64 code, better than nothing.
<f-a> uhhh does gforth have a debugger
* f-a checks
<mark4> actually traditionally a good forth needs no debugger in the traditional sense. you dont debug forth by single stepping it normally
<mark4> but....
<mark4> x4 used to have a fully working on years ago but things changed and i never kept up wity it
<mark4> its also on the todo list to get working again
<crc> the only part of my debugger that I actually use is the disassembler
<f-a> well
<f-a> say you have a problem with the logic of your program
<f-a> or some function leaving crap on the stack etc
<f-a> where do you start from then, to debug?
<f-a> I usually sprinkle ." stuff" in my words but that seems not efficient
<crc> I normally spend a rather long time thinking through the logic before I start coding for anything that's not trivial
<f-a> it is a good attitude, shit happens regardless :P
<f-a> imagine you are looking at a piece of code you did *not* write yourself
<crc> That's harder
<crc> I print out and review the code, mentally walking through as much as possible
<crc> If it has issues, I can run it under the single stepper or with execution tracing, but that's the extent of by debugging tools
<crc> Most of my debugging work is done on paper or in my head
f-a has quit [Remote host closed the connection]
<mark4> depth . in various places helps too
<mark4> crc if its so complificated i can just write it off the top of my head i like to code it on paper first
<mark4> and i find more bugs just scanning slowly through my code or taking the time to properly comment them
<mark4> if you have thought about your code long enough to explain it to someone else you have thought about it long enough to code it
<mark4> tho in practace thats an itterative process :)
<mark4> write a primitive. test THAT primitve
<mark4> write another primitve and immediately test that one too
<mark4> use your already tested primitves to create higher levvel definitions and then test them as soon as you write them
<mark4> : foo .... ; 1 2 3 foo
<mark4> : bar .... 1 2 3 bar
<mark4> : blah foo blah ; 1 2 3 blah
<mark4> : blah foo bar ; 1 2 3 blah i mean
<mark4> thats the theory of how to develop forth properly... dont ask me if thats what i always do or not lol
<mark4> shhh
<mark4> modern theory calls that TDD
<mark4> forth calls it the status quo, the norm, just what we do (tm)
Zarutian_HTC has joined #forth
<veltas> I don't know what most people consider a 'debugger' but being able to step through code or print some kind of trace would suffice, gforth can do both of those and any forth can or could with a small amount of work
<veltas> A forth is flexible enough that its interactive mode is more useful for debugging than what you get with some other langs
<veltas> My forth doesn't have any debug features other than .S so far, and I still find myself e.g. using ' to locate addresses etc easily while debugging
KipIngram has quit [Ping timeout: 276 seconds]
<mark4> anyone here know how to compile a utf8 string in C because ##c just gives you a cirle jerk
<mark4> u8"blah" is wrong
<mark4> L"blah" gives utf32 not utf8
<MrMobius> mark4, I havent tried but could you use goldbolt to test?
<remexre> by default (gcc, linux, utf8 locale) strings should be utf8, no?
<mark4> i know u8"blah" is wrong because when i use it there is NO string in there that i can find
<mark4> i dont want BLAH" encoded as 'B' 'L' 'A' 'H' those are the ascii characters
<remexre> ascii is a subset of utf-8?
<mark4> i need to be able to encode strings containing ANY characters
<MrMobius> wouldnt BLAH" be encoded as 'B' 'L' 'A' 'H' but then anything outside of what normally fits would get a preceding escape byte?
<MrMobius> ie isnt utf8 identical until it has to encode extended characters?
<mark4> thats how you decode the codepoints
<mark4> i need to be able to store EVERY CHARACTER as a specific width codepoint
<remexre> don't you want utf-32 then, assuming character=unicode scalar value?
<mark4> no
<mark4> i actually dont want utf-32
<joe9> I am slowly working through this code and trying to figure out what it does as it is leaving 4 values on the stack and I cannot figure out why. It is doing this before the interpret routine call. So, I figure it is reading some input and figuring out if the input should be accepted or not.
<mark4> LITERALLY all i need right now is to do u8"The cow jumped over the moon" and printf it and then look in the binary for how that string got encoded
<mark4> guess what. i can printf it
<mark4> guess what... i cant see that string anywhere in the binary
<joe9> I understand that it would be hard to figure out from this piece. I want not sure if this is a common task across forths.
<mark4> its NOT being encoded as 'T', 'h', 'e', ' ', .. . . .
<mark4> i did not port it down to x4 but if you go into x64 you can literally do $2501 emit
<joe9> mark4, could it be 16 bit runes?
<mark4> $2501 emit ━ ok
<mark4> wtf is a rune?
<mark4> NOBODY talks about runes
<mark4> $2563 emit ╣ ok
<mark4> they talk about characters
<mark4> they talk about code points
<inode> instead of whinging about it, why don't you poke around in a debugger to find the string? :)
KipIngram has joined #forth
<mark4> no
KipIngram is now known as Guest79368
<mark4> im trying to get a string "xxxxxxxx" where all of those chars are encoded as their utf-8 codepoint
<mark4> no matter WHAT charactes they are
<remexre> characters don't have a single codepoint in utf-8, they're a variable-width sequence of bytes
<mark4> the fucking string is not in there! i have looked with mc i have also looked wtih ida-pro
<remexre> what sequence of bytes are you expecting to be there
<remexre> for the case of puts("\u2563");
<mark4> u8"The cow jumped over the moon" <-- the ones i specified in the u8 string
<mark4> what is puts ?
<mark4> does not sound like c ? :)
<inode> it is
<inode> print a string to stdout
<mark4> like printf but without formatting ok
<mark4> i never actually encountered puts ever
<remexre> can you upload your binary somewhere?
<remexre> (or the .o file, or .S file)
<mark4> actually i got it now. i dont understand why adding puts() of the string suddenly makes it visitible in the code but it did
<mark4> and i was compiling with -O0 so the unusesd string should not have been purged from the binary
<remexre> possibly the varargs calling convention for your platform makes something funky? dunno
<mark4> i was not using printf, i was not doing anything with the string till i added the puts
<mark4> hang on
<mark4> i just added the puts
<mark4> and while i can see the string in a binary dump. ida-pro still seems to be having issues. i cant see it anwywhere in there
<mark4> it trying to disassemble the string as opcodes?
<Zarutian_HTC> counted or null byte terminated?
<remexre> I suspect that's because of the array
<remexre> you're specifying that it goes into a *mutable, stack-allocated* array
<remexre> you probably want const char* foo = "asdfasdf";
<remexre> to make foo a pointer to a .rodata-allocated constant string
<remexre> so what's happening is, you're stack-allocing the array you're putting the string into
<remexre> then moving constant chunks of it in
<mark4> let me try that :)
<mark4> aTheCowJumpedOv db 'The cow jumped over the moon',0
<mark4> it literally didnt help me lol
<mark4> i need a chinese person to enter a chinese string in my foo.c :)
<mark4> and give it back to me
<remexre> asdf̈u :P
<remexre> oh wack my irc client displays that wrong...
<mark4> lol
<mark4> i saw adsfu was that supposed to be chinese for piss off? :)
<remexre> nah, asdf + combining diaresis + u
<mark4> lol
<mark4> o the F seems to have dots over it unless thats my eyes going fuzzy
<remexre> yeah
<mark4> i REALLY REALLY do NOT want to have to includes some 400gig utf8 string library in my 40k binary
<mark4> im not actually going to be puts'ing strings. i need to implement a puts/printf like function that prints them into one of my TUI windows
<mark4> and my windows do not use varible width charcters
<mark4> its fine for the strings to be variable width, i just need to know how to extract each charater from those strings one at a time and to place them into my windows at the current cursor location
<remexre> mmmmmmm
<remexre> so
<remexre> full-width characters
<mark4> well. they dont need to be and probably should not be in the sources
<mark4> but they need to be when they are emitted to the window
<mark4> what i should do is commit my code as it now stands and then mark the gitnhub repo as no longer private
<mark4> but i might want to sell this :)
<remexre> I think you probably need the annoying unicode tables for any sort of "figure out how many character cells wide this string is"
<mark4> just got a call from the "warranty center" lol
Guest79368 is now known as KipIngram
<mark4> i didnt hang up i just saind "hey banchod does your mother know that you scam people?" lol
<mark4> he hung up
<mark4> i know how to decode the codepoints and decompose them to their character sequences
<mark4> the first byte tells me how many bytes are in the codepoint
<mark4> but. erm. i need to be able to compile strings AS CODEPOINTS!!!!
<mark4> i dont know if u8"blah" does that properly or not
<remexre> so that's outside the C spec, but what gcc will do is
<remexre> if you're referring to them by address (i.e. not as an array initializer), outside of constant folding and other optimizations, it'll put strings in the rodata section
<remexre> though not every string in the source will end up as a distinct string in rodata
<remexre> e.g. ("foo" == "foo") may or may not be true
<mark4> really what i need to be able to do is give someone using my code to do
<mark4> win_puts("any string in any language");
Zarutian_HTC has quit [Remote host closed the connection]
<mark4> i.e. i need to be able to implement that function
<remexre> also regardless, the length of a unicode scalar value in code points isn't sufficient to determine its length on a terminal
<mark4> so.. that function needs to be able to parse the given string
<mark4> the string is not written to the "window" as a secuence of "characters"
<mark4> its written as an array of codepoints
<remexre> not every code point is one character cell wide
<remexre> e.g. U+0308
<mark4> i.e. given a cell of the window containing $2500 it will decompose that cell into those charactesr at display time
<mark4> i know that u8"abcd" will be compiled ideitican to "abcd"
<mark4> i can handle straight ascii.
<mark4> my win_puts() or win_printf() functions need to be able to read the next item from the string and place that one item in a given cell of the window array
<mark4> i.e. i need to be able to get string[x] for any index of x
<mark4> for ANY possible string in any language
<mark4> that part is trivial
<remexre> what is x measured in? scalar values? code points? character cells? things-a-human-considers a charater?
<mark4> in codepoints
<mark4> is what i want
<remexre> you need utf-32
<mark4> mope
<remexre> or a second array of where things start
<remexre> utf-8 is inherently variable-width
<mark4> im not handling the string as an array in that way... ill be parsing through it from the beginning, extracting each 8 bit byte out of it till i have exactly one code point
<mark4> THAT i can do
<mark4> what i need to know is how i can specify a string in C that is encoded as CODEPOINTS!!!!
<mark4> yes. i understand im not looking for the X'th character that was not an exact example
<mark4> i cant do x++ to get the next character
<mark4> i need to parse forward of the current index till i get to the next index
<mark4> i understand that
<mark4> what i do NOT understand is how to specify a string in C so that it is encoded as a stream of utf8 codepoints
<mark4> not utf16 not utf32. utf8
<mark4> SPECIFICALLY a stream of utf8 codepoints
f-a has joined #forth
<mark4> and i do not know that u8"blah blah" does that
<nihilazo> it's kinda a shame that utf-8 is relatively difficult to handle because it leads to so many english-speaking developers building things that don't support international text
<mark4> yes and thats what im trying to accomplish
<nihilazo> but idk either in C, I've mostly worked in languages like go where you're lucky enough to have rune[]
<nihilazo> sorry
<remexre> I'm 99.9% sure that unless you're using a bizzaro compiler, it's doing exactly that, as long as you follow the rules I stated above
<remexre> > if you're referring to them by address (i.e. not as an array initializer), outside of constant folding and other optimizations, it'll put strings in the rodata section
<mark4> i KNOW my existing code can display any character in any language just given its CODE POINT of what ever length it is
<mark4> if a string of utf8 characters is encodes as xx xx yy zz zz zz aa aa bb cc cc then i need to be able to parse the X char, the Y char, the Z char the A char and the B and C chars
<mark4> i can do that!!!!!!!!!!!!1
<mark4> thats FUCKING TRIVIAL!!!!!!!!!!!
<mark4> thers no rocket surgery involved there
<mark4> how the FUCK do i specify that string in C
<mark4> char foo= "abc" does not do that
<remexre> char* foo = "abc";
<mark4> i dont know if char foo = u8"abc" does that
<mark4> what if my abc string is not 'a' 'b' 'c'
<mark4> what if its a chinese word
<mark4> or japanese
<mark4> or korean
<mark4> or indonesian
<mark4> or .. .. . .
<mark4> i need to be able to compile strings in ANY LANGUAGE!!!
<remexre> either this works or your compiler is ISO-incompliant (or doesn't support utf8)
<mark4> as utf8 codepoints
<mark4> oh
<remexre> when you specify char f[];, you're not requesting that the string be in .rodata though
<mark4> show me an example of compiling a string as a stream of utf8 codepoints
<remexre> that was the previous problem
<mark4> and PROVE thats what it does?
<mark4> i dont give a fuck where its compiled to
<mark4> as long as i can access it
<mark4> u8"chinese sentence here" <-- does this compile that chinese sentence as a stream of utf8 codepoints?
<mark4> thats ALL i care about right now
<mark4> the parsing of that string is MY problem and i alaready know how to handle that
dave0 has quit [Quit: dave's not here]
<mark4> as a stream of variable width utf-8 codepoints of course
<inode> what's the widest utf-8 codepoint?
<mark4> not as 000x 00xx 0xxx xxxx values but as xxxxxxxxx and all mashed up togehter as a stream of codepoints that need to be handled
<mark4> well 32 bits
<mark4> the highest utf8 codepoint is something like 0x110000
<mark4> remexre: you didnt even specify that those were utf8 characters. you can do that?
<remexre> yeah
<mark4> no need for the bullshit u8"xxxxx" visual clutter?
<mark4> !!!!!!!!!!!!!!
<remexre> i'm using a utf-8 locale with gcc
<remexre> which is like, eminently reasonable
<remexre> if you're using some 80s POS compiler with a shift-jis locale, that's when u8"" is useful
<remexre> because maybe the compiler is dumb
<mark4> err how do you "use a utf-9 locale with c" ?
<mark4> well the compiler in this case is the most recent gcc
<remexre> like my system locale is a utf-8 locale
<mark4> compiled with the c17 standard
<mark4> oooh ok
<mark4> so. really i dont need to worry i can just implement win_puts(win, "blah");
<remexre> you have to worry about combining characters and full-width characters
<remexre> if you're doing a TUI
<remexre> but if you're not, yeah
<mark4> the first byte in your code is a 0xEn byte
<mark4> that tells me how many 8 bit bytes there are in that codepoint
<mark4> e4 bd ad
<remexre> right, but not how many spaces on a screen the character occupies
<mark4> e5 9b bd
<mark4> look i can even do it in my head :)
<remexre> e.g. both of those characters are two character cells wide
<remexre> and U+0308 is "zero"
<remexre> in that it modifies the previous character instead of occupying its own character cell
<mark4> hang on give me a sec
<mark4> ok yea my emit in my forth must be expecing 16 bit codepoints only thats a bug
<mark4> i can fix that
<mark4> no actually im not sure whats going on there.
<mark4> let me see if my C code can emit those chinese chars correctly
<mark4> ooooh i see a problem lol
<mark4> erm. ok so... how do i tell how many cells each char takes?
hosewiejacke has left #forth ["Leaving"]
<remexre> that's where you need a bloated table, sadly
<mark4> i am 99.99% sure my c code will write the correct sequcence of bytes to display those chars but... while those chars take up one cell of the window array
<mark4> they take up 2 bytes of the display space
<remexre> I think, 90% of the time you can tell whether a character is wider than one char-cell from its block
<mark4> ooooh! lol nope
<remexre> and afaik you can always tell whether a char is a combining char by block
<mark4> err yea no scratch that idea
<mark4> i had the idea of tracking the actual cursor location on the display to see how many cells had been used by each character
<mark4> thats kind of too late lol
<remexre> if I said you need a dozen ranges and to check if characters are within those ranges, would that be better
<mark4> do combining chars always display correctly?
<remexre> like are there any characters it's illegal to combine with?
<mark4> for example, with the same font in xterm as i use in gnome terminal my box charsetes do not displayu corectly
<mark4> for example
<remexre> I certainly agree that lots of software does this wrong :P
<mark4> the top line of a window boerder displahysa s ━━━━━
<mark4> in gnome terminal
<mark4> but in xterm it displays as ━ ━ ━ ━ ━
<mark4> with very tiny gaps between
<mark4> same font
<mark4> just being rendered differently in different terminals
gravicappa has quit [Ping timeout: 245 seconds]
<mark4> im assuming combining means that one "character" like 'x' in some language might display as 2 physical characters on the display like 'xx'
<mark4> is that what that means?
<remexre> other way around
<mark4> your c code has two chinese characters in it
gravicappa has joined #forth
<mark4> those woul be displayed in 2 cells
<remexre> combining char = two unicode scalar values form one character, that fits in one cell
<remexre> full-width = cjk characters that require 2 char cells
<mark4> so your example C does not have TWO chinese characters in it but... just one?
<mark4> is it stored as a single codepoint in the compiled string?
<remexre> no, it has two
<remexre> U+0308 is the combining-character example
<mark4> ok then it has two characters and those two charactes will be displayed in adjacent cells on the display as single characters ?
<mark4> im lost lol
<mark4> $0308 emit ̈ ok
<mark4> oooh i get it
<mark4> its like you can have an A with dots over it
<remexre> yeah
<mark4> or you can display the A and then display the dots over it later!
<remexre> yep
<mark4> ok. show me a string in c that uses an A with dots over it but specified with combining characters
<mark4> and... im not sure how you can do that in a text mode anyway
f-a has left #forth [#forth]
<mark4> so i dont think its an issue for me
<mark4> in a graphical mode you can merge the two before rendering or render one then the other in the same place
<remexre> ok it's possible my term is fucked
<remexre> oh wait
<remexre> this is a bitmap font
<remexre> sec
<remexre> there we go
<mark4> hang on
<mark4> 'A' emit $0308 emit Ä ok
<mark4> it works :P
<mark4> the terminal handles it
<mark4> it KNOWS that $308 is a combining char and does it for me
<mark4> however
<mark4> :)
<mark4> drat
<mark4> i cant store [0000:000a][0000:0308] in consecutive cells of my window array
<mark4> because those are ONE character
<remexre> yeah, and there have been nasty terminal bugs about this in the past...
<mark4> i literally need to be able to take the string containing the combined chars and COMBINE them in some way and store the combined data in my window array
<remexre> that's normalization
<mark4> and then when i go to actually output those to the console i need to separate them again and output them individually
<remexre> again requires big tables
<remexre> and doesn't always remove all combining characters
<mark4> yea. maybe if i just say screw combining characters! lol
<mark4> and be broken like everyone else :/.
<remexre> yeah... I gave up on TUI instead, and am planning GUI-over-serial-line and "boring" CLI only...
<mark4> i hate quitting lol
<mark4> what i can do is make every cell 64 bits!!! lol
<mark4> cuz everyone has 287456923465 gigs of ram
<mark4> i hate this idea tho
<remexre> you can have multiple combining chars on a single char
<remexre> tho idk if any normal human languages use this
<remexre> but e.g. your browser supports it (see zalgo text)
<mark4> lol
<mark4> i could simplify and say "this supports english utf-8 only" lol
<mark4> thus obliterating the need for utf-8 in the first place lol
<joe9> I added comments to this code. I am still debugging to get it working. Just want to check if my comments make sense. http://ix.io/2RNV
<remexre> well, there are americans who have non-ascii chars in their names
<mark4> so i ran ida pro erlier and i just got an email from them saying ida tells us you are out of date, here click this link for an update :)
<mark4> joe9: not following, if the address is false jump to that address
<mark4> ?
<mark4> you are jumping to the flag not the address.. shouldnt it be ( addr f --- ) ?
<mark4> oooh nvm the address is pointed to by esi. my bad
<mark4> yea the code and comments look good
<mark4> if you kept top of stack in ebx instead of eax you could do lodsd instead
<mark4> so next would be lodsd followed by jmp eax
xek has quit [Quit: Leaving]
<joe9> mark4, thanks.
<joe9> This macro is named (if)
<joe9> I am not sure if there is a convention on when to put () for names
<mark4> usually (if) is a primitive for if
<mark4> if might be an immediate word that compiles (if)
<mark4> the parens are valid here
<joe9> ok, thanks.
Zarutian_HTC has joined #forth
Zarutian_HTC has quit [Remote host closed the connection]
inode has quit [Quit: ]
Zarutian_HTC has joined #forth
<mark4> ok so i just took a look back at your chinese utf8 strings. its compiling the characters not the codepoints
<mark4> for exampe € is whats displayed. e2 82 ac is what is output to display it, 20ac is the codepoint
<mark4> your "chinese chars" is being compiled AS CHARACTERS not as codepoints
<mark4> no good to me :/
<mark4> $e4 (emit) $b8 (emit) $ad (emit) 中 ok
<mark4> (emit) writes those characters directly to stdout
<mark4> so im back to the original problem
<mark4> how do i compile an array of CODEPOINTS not a stream of characters
<patrickg> e2 82 ac is the codepoint, just in utf-8 encoding while you're probably looking for some other
<mark4> no its not the codepoint, its the utf8 character
<mark4> hang on ill give you a non chinese example
<mark4> ━ is the character
<mark4> 2501 is the codepoint
<mark4> ━ is the character as displayed
<mark4> i mean
<mark4> but the bytes that are output to display that character are different. hang on i need to write code to get it lol
<patrickg> there's no "utf8 character". utf8 is an encoding to map larger numbers onto octets with a few constraints (0..127 are idempotent, it's self synchronizing, there are no 0 bytes except _actual_ NUL)
gravicappa has quit [Ping timeout: 256 seconds]
<mark4> the character in this case is 81 94 e2
<mark4> the codepoint is 2501
<mark4> what is displayed is ━
<mark4> to display it you write the 81 94 e2 to the terminal
<mark4> but those three bytes are THE CHARACTER not the codepoint
<patrickg> 81 94 e2 is backwards - it's not a valid utf8 sequence
<patrickg> nicer to calculate that stuff on a stack though :-)
<patrickg> 11100010 10010100 10000001 - first byte starts with 1110 = 3 bytes encoding. first byte gives "0010", second byte gives (stripping leading 10) "010100, third byte gives "000001" = 0010010100000001
<patrickg> 2 base ! 0010010100000001 hex . 2501 ok
<patrickg> so yes, that's precisely the code point you're looking for
elioat has quit [Quit: elioat]
<mark4> oh yea my bad
<mark4> so after another long discussion in a different channel the concensus is that i need to accept "some string in some language" is going to be compiled as utf8 characters and at run time convert that string to utf8 codepoints :/
f-a has joined #forth
f-a has quit [Remote host closed the connection]
<mark4> and i have encode and decode backwards
<mark4> encoding is going from codepoint to byte sequence and decoding is going from byte sequence to codepoint.
<mark4> that sounds horribly backwards to me
<patrickg> you can encode everything in UTF32/UCS4, which is just a list/array of 32bit values that contain a codepoint each. but to get those out to a terminal, GUI or any other target you'll have to convert it to whatever that target speaks.
<mark4> no
<patrickg> or you keep them internally as utf-8 encoded string, where you have a reasonable chance to be able to just dump them byte-by-byte - or still convert them to whatever the target wants
<mark4> utf32 is not acceptable to me. i dont want my library to be 500 megs in size like libncurses :P
<mark4> or is it 5 gigs now lol
f-a has joined #forth
<patrickg> if going from utf8 to utf32 means that utf32 is 500megs and utf8 isn't, you're predominantly using ascii characters (the high plane code points are larger in utf8 encoding, at 6 bytes, than they are in utf32, at 4)
<mark4> ill just take the utf8 decoded bytes in the strings such as win_puts("some string\n") at run time and convert them to the codepoints
<patrickg> also, your utf8 text would still be 125megs if all ascii
<mark4> lol
<mark4> my point is is that space efficiency is orders of magnitude more important to me than runtime speed efficiency here
<mark4> i mean. the executable or .so needs to be as TINY as I can make it
<mark4> run time will already be using up two buffers of 32 bits per char each for every single window and one buffer of 32 bits for each char per secreen
<mark4> screens are what are written to the display. windows are written into the screen if the char at X, Y has changed
<mark4> thus the double buffering
<mark4> anyway i have to go back to wallymart. went there, got all my stuff and had to leave it there beause my roomie had my card so he could get rent out of my bank lol
<mark4> my rent not his :P
<mark4> brb
elioat has joined #forth
f-a has quit [Read error: Connection reset by peer]
f-a has joined #forth
cmtptr has joined #forth
<cmtptr> omg how long have i not been in this channel!
<cmtptr> i wonder what juicy forth gossip i've been missing out on and didn't notice
<cmtptr> (rhetorical question btw, the answer is probably 68 days since that's my uptime)
elioat has quit [Quit: elioat]
<mark4> lol
<crc> cmtptr: just read the logs :)