ELLIOTTCABLE changed the topic of #elliottcable to: a _better_ cult
eligrey_ has quit [Quit: Leaving]
eligrey has joined #elliottcable
TheMathNinja has joined #elliottcable
prophile has quit [Quit: The Game]
prophile has joined #elliottcable
eligrey has quit [Quit: Leaving]
<nuck> Guys, I think I just found a programming horror in the Hummingbird codebase
<nuck> Taht you might enjoy
<nuck> Basically a halfassed XMLLint that tries to fix this awful other site's broken XML
<nuck> And then map it onto our database
<nuck> The other site's DB can provide like one of three different formats for the completion status
<nuck> Hence the big STATUS_MAP up top
TheMathNinja has quit [Ping timeout: 240 seconds]
<devyn> nuck: it gives you invalid UTF-8?
<nuck> I have no idea
<nuck> I didn't write this shit
<nuck> I think MAL is probably not even UTF-8 clean
<nuck> It's all PHP scripts that haven't been manned in years
<nuck> Like, one time the HB scraper got stuck in a loop and scraped 50GB in a couple hours, and nobody so much as noticed
<devyn> lol
<purr> lol
<devyn> I have a feeling that based on how this is written, the XML isn't actually broken; it just sometimes contains non-UTF-8 data
<nuck> It's all going away
<nuck> xmllint fixes it
<nuck> So the plan is to just feed the shit into xmllint and use that
<devyn> which makes sense; I come across EUC-JP and Shift-JIS encodings all the time
<nuck> mmhmm
<devyn> always annoying
<nuck> If it were that I think we could probably fix with a simple .encoding!
<nuck> Since Ruby's got pretty good encoding handling
<devyn> not really, because it's going to be mixed in UTF-*
<nuck> good API around iconv
<devyn> UTF-8*
<devyn> and you have no idea *which* encoding it is, and it isn't always easy to tell
<devyn> and sometimes it's just mojibake and it's been run through a few different encoders and totally impossible to understand
<nuck> ('ASCII-8BIT', invalid: :replace, undef: :replace, replace: '')
<nuck> That should just drop all non-ASCII characters
<devyn> no, because it's ASCII-8BIT, which means that post-0x7F characters are allowed, IIRC
<devyn> but if it was read in in UTF-8, what that would do is drop any non-UTF-8 characters, I think
<devyn> because what they're doing is #encode as ASCII-8BIT
<devyn> not #force_encoding
<devyn> it's... weird
<nuck> The lead dev says "I think it results from something in the notes."
<nuck> Apparently xmllint will fix it all for us though
<devyn> the weird thing to me is that *after* #encode → ASCII-8BIT, they're doing #split('') and #select with a regex that matches multiple bytes, which makes no sense; they should have split('') before encoding if they wanted to match multiple bytes
<devyn> and really they should have used #force_encoding
<nuck> "To filter out invalid characters that MAL sometimes emits in XML, which nokogiri doesn't like."
<devyn> I don't even think this works
<nuck> He claims he was entirely shitfaced when he wrote this
<nuck> As he put it, "I can't work with MAL's stuff sober"
<nuck> And the scariest part is that MAL is the *best* site besides HB and our primary competitor AniList
<devyn> yep, this totally doesn't work; it ends up stripping anything non-ASCII
<nuck> I think that might be the goal
<devyn> irb(main):004:0> "こんにちは、世界".encode('ASCII-8BIT', invalid: :replace, undef: :replace, replace: '')
<devyn> => ""
<devyn> irb(main):005:0> "Märchen".encode('ASCII-8BIT', invalid: :replace, undef: :replace, replace: '')
<devyn> => "Mrchen"
<devyn> nuck: if that's the goal, then the #select with that regex makes NO sense
<devyn> it won't ever catch anything
<devyn> lol
<purr> lol
<devyn> so I don't think that was the intention
<devyn> I think the intention was to drop anything non-UTF-8
<devyn> and they failed miserably
<nuck> Maybe it was multiple overlapping "fixes"
<devyn> lol
<devyn> perhaps yeah
<nuck> Like the regexes were the first fix
<nuck> And then they failed so he tried the .encoding() thing
<devyn> nuck: well, the proper way to do what he wanted to do:
<devyn> irb(main):006:0> "こんにちは、世界".force_encoding('ASCII-8BIT')
<devyn> => "\xE3\x81\x93\xE3\x82\x93\xE3\x81\xAB\xE3\x81\xA1\xE3\x81\xAF\xE3\x80\x81\xE4\xB8\x96\xE7\x95\x8C"
<nuck> But that's even worse
<nuck> You're lying about the encoding
<nuck> The right answer is probably to just cram it into xmllint
<devyn> ASCII-8BIT can contain any octet, so not really; any encoding is also ASCII-8BIT :p
<devyn> so
<nuck> Well yes
<devyn> he should have forced encoding, run his regexes
<devyn> and then forced back to UTF-8
<devyn> that would have worked, I think
<nuck> I might forward this onto him
<nuck> See if it lets us avoid the pain of spawning xmllint
<devyn> in fact
<nuck> It turns out xmllint doesn't have a ruby lib
<devyn> there's another problem I mentioned above; the use of split('')
<devyn> how to do that better:
<devyn> nuck: nah, this won't really work, never mind. if spawning xmllint is painful, write an FFI shim if you can
<nuck> Yeah I was thinking of an FFI shim
<devyn> Ruby has a good FFI library that's actually relatively painless to bind to C libs
<nuck> He just sent me the broken XML
<nuck> So I guess it's time to dig in
<devyn> I remember when binding Ruby to C libs used to be horrifyingly painful haha
<nuck> I don't! :D
<nuck> I just got into Ruby like... 1 year ago ish?
<nuck> Maybe?
<nuck> I don't even remember
<nuck> I think it was December 2012
<devyn> well, you basically had to write a shared library of your own to act as the shim, and the .so would hook into Ruby internals and basically use Ruby's internal C API to expose everything
<devyn> it was terrible
<devyn> you had to do everything manually
<nuck> That's the XML file he handed me
<devyn> XML should probably never be generated with basic string-based templating (including PHP or ERB or EJS or whatever), that's why
<devyn> lol
<purr> lol
<devyn> anyway this doesn't really look broken to me, glancing at it
<devyn> I'll have to try to parse it lol
<nuck> I'm thinking that too
<nuck> But it's huge
<nuck> So who knows
<devyn> seems like 0 is used when NULL is really meant… a lot of 0000-00-00 dates around lol
<nuck> This is probably close to the db
<devyn> knowing PHP developers, it's probably literally just a DB query and a for loop
<nuck> in some implementations, a date column with null is actually 0000-00-00 iirc
<nuck> Oh that's absolutely what it is
<nuck> Like, you're not just talking PHP, but PHP from about 2008 at its most recent
<devyn> it's so unfortunate that PHP ever took off
<devyn> nuck: so, uh, Nokogiri parsed that file just fine
<nuck> Well then.
<nuck> I guess my regression test is pointless
<devyn> get some data from your colleague that actually fails to parse lol
<purr> lol
<devyn> because... this looks fine
<nuck> Yeha I guess I'll have to lol
<nuck> He's out right now, I think on a train
<nuck> And it's india, trains don't have wifi
<devyn> though, I guarantee that it will totally break if a string contains ]]>, because they're using CDATA
<nuck> I should go test if I can break it with that
<nuck> brb adding that as a note on MAL
<devyn> ]]> alone might not break it; try doing ]]><
<nuck> I mostly wanna see if they're smart and escape it or not
<nuck> They might
<nuck> They thought to use CDATA at all, which tells me *something*
<devyn> it's impossible to escape in CDATA; the only thing you can do is change it to something else
<devyn> lol
<devyn> CDATA doesn't have any escape sequences, which is why the terminator is relatively long
<nuck> Can't just do ]]&gt;&lt; ?
<devyn> you could but it would come out that way; the XML parser wouldn't turn the entities into ><
<devyn> that's sorta the point of CDATA; no parsing at all until ]]> is found
<nuck> Probably what they would do I'd guess
<joelteon> you have to split ]]> across two CDATA sections right
<devyn> but you don't always want to escape > and <
<devyn> if they always escaped > and < with &gt; and &lt; it would always turn out that way
<devyn> which would be … completely wrong
<nuck> This is PHP
<nuck> "completely wrong" is par for the course
<devyn> joelteon: <![CDATA[hello]]>]]&gt;<![CDATA[world]]> would be pretty much the only way to do it
<joelteon> huehue
<devyn> nuck: I would say it's far more likely that they wouldn't have even considered it, given how common SQLi vulnerabilities are in PHP scripts, and that's exactly the same kind of flaw
<joelteon> devyn: <![CDATA[<![CDATA]]>]]<![CDATA[>]]>
<nuck> <my_comments><![CDATA[]]&gt;&lt;]]></my_comments>
<nuck> That's what I got
<devyn> interesting
<devyn> haha
<nuck> For PHP, they've actually done a pretty good job
<devyn> what if you just do like
<devyn> >:D
<devyn> if it turns into &gt;:D then they've still done a shitty job
<devyn> lol
<purr> lol
<nuck> <my_comments><![CDATA[&gt;:D]]></my_comments>
<devyn> even if it doesn't, if they're not turning & into &amp; as well, people could put whatever entities they want in
<nuck> They've done a shitty job
<devyn> ok
<devyn> not surprising
<devyn> :p
<nuck> But it's at least a shitty job that produces valid XML
<nuck> Which makes me wonder wtf we're doing all this for
<devyn> maybe it used to be worse
<nuck> hahahahahahahahahaha
<devyn> maybe they actually fixed it
<devyn> lol
<nuck> That would imply MAL actually has programmers
<nuck> They don't
<devyn> huh
<devyn> lol
<nuck> Literally, they're sitting back and collecting money
<nuck> All the people working on MAL left, it's now run by some other company and it's got zero development
<devyn> in any case, I bet you that the problem is if you get something with EUC-JP or Shift-JIS instead of UTF-8
<devyn> the XML isn't actually malformed aside from the encoding of the data being bad, but <![CDATA[]]> just contains octets anyway; no particular encoding is required IIRC
<nuck> I should paste in some japanese characters and see whath appens
<nuck> Whether it's mangled or passed through safely
<nuck> I know that's true
<nuck> I've seen images embedded in CDATA
<devyn> yeah
<devyn> so really it's Nokogiri's fault for parsing CDATA as UTF-8 when really it should be treating it as ASCII-8BIT
<devyn> haha
<nuck> I hate that "ASCII-8BIT" means "bytestring"
<nuck> It's silly
<devyn> anyway nuck, throw in some actually invalid UTF-8, not just any Japanese chars
<nuck> I doubt I can do that in a browser and I'm too lazy to curl it into place lol
<purr> lol
<joelteon> what's wrong with bytestrings
<nuck> Nothing
<nuck> I just wish people didn't label them as "ASCII"
<nuck> When they're more accurately encodingless
<devyn> joelteon: it's that Ruby labels "no encoding" as "ASCII-8BIT"
<devyn> pretty weird
<devyn> lol
<joelteon> oh ok
<nuck> The same thing is in GLib iirc
<devyn> nuck: there is also, for the record, ASCII-7BIT, which is properly ASCII
<nuck> lol
<nuck> But nobody uses 7 bit things anymore
<nuck> Since all the machines this runs on are 8bit
<devyn> no, it's not that the bytes are interpreted in 7 bit groups; that would be... wayy too much work
<devyn> it just means that the 8th bit can only be 0
<nuck> Just zeroed
<nuck> mmmm
<joelteon> anyway, nix
<joelteon> is awesome
<devyn> any char that has bit 8 set will be considered invalid according to the encoding
<joelteon> free distributed builds
<devyn> ASCII is a 7bit-on-8bit encoding which is exactly what this describes, anyway
<devyn> it goes from 0 to 127; 128..255 is undefined
<joelteon> who needs the other 128 though
<nuck> Russia
<devyn> well, it's actually excellent, because it means other encodings could hijack that space and still maintain ASCII backward-compatibility
<devyn> including beloved UTF-8 :D
<nuck> You wanna se horrifying in a different way?
<nuck> Screen scraper wooooooo
<joelteon> oh yeah
<devyn> nuck: hah, I've done things like that; it's not that bad though if you know the HTML output is always going to be predictable because the site has no programmers :p
<nuck> Exactly
<nuck> We're not sure why it's in models/
<nuck> The guy who put it there according to git blame has no idea either
<devyn> it makes sense to me; it is technically a "model"
<devyn> models are really any data source
<nuck> Kind of, but we don't use it as a data source for any rendering, it's only used by a side thing
<nuck> Since all MAL scrpaing is Sidekiq'd
<nuck> I'm gonna be cleaning that sucker up today and replacing it with AnimeNewsNetwork
<nuck> Since they have an API
<nuck> It's actually a pretty good API for XML
<nuck> They have some weird things in other parts of the API though
<nuck> Like, for one part, all commands are specified as three-digit numbers
<nuck> So the command "get all anime titles in the encyclopedia" is actually "144"
<joelteon> duh
<devyn> unrelated: my air cooler is so good that my CPU's *core temperature* can be exactly ambient temperature at 1-2% load
<nuck> holy shit what
<nuck> Your air cooler is boss
<devyn> not even that expensive; cooler master hyper 212 evo with arctic silver 5 paste
<nuck> We just discovered earlier that AnimeNewsNetwork has an anime-list service
<nuck> Like, not a single one of us actaully knew about this until somebody mentioned it on the forums
<nuck> It's so buried in their site
<devyn> lol, wow
<purr> lol
<devyn> does anyone actually use it?
<nuck> Well, not many I'd guess
<nuck> Since it's buried and they're listing the users by an alphabetical menu
<nuck> It's like some kind of class directory
<devyn> haha
<devyn> this is sick http://i.imgur.com/uv7s9NH.png
<devyn> original
<nuck> oh god
<nuck> I just set "こんにちは、世界" as the comments string
<nuck> <my_comments><![CDATA[&atilde;<81><93>&atilde;<82><93>&atilde;<81>&laquo;&atilde;<81>&iexcl;&atilde;<81>&macr;&atilde;<80><81>&auml;&cedil;
<nuck> <96>&ccedil;<95><8C>]]></my_comments>
<nuck> That was the output &
<nuck> From less
<devyn> oh
<devyn> oh god
<devyn> holy shit
<nuck> I *know*
<devyn> I suggest you write a parser for that, honestly, and use a XML entity <-> octets table
<devyn> then take the whole thing and force to UTF-8
<devyn> you have to use a parser though, for sure; if you go after the entities first then you'll potentially have problems with the <XX>
<nuck> Or we could just ignore it
<nuck> Seriously
<nuck> If MAL has been mangling this stuff
<nuck> Why should we unmangle it -- they already trained the users not to use it
<nuck> Yeah it totally mojibake'd
<devyn> have they been manging it on their website too?
<nuck> ��������\
<devyn> ok
<nuck> That's on their website
<devyn> yeah, fuck it then
<nuck> No reason for us to bother
<nuck> Found out what the cleanup code was for
<nuck> This is what they used to get
<nuck> From the MAL API
<devyn> oh but how would that be malformed; that looks generated by an XML library instead of a template
<nuck> No clue, but it doesn't use CDATA anywhere
<devyn> an XML lib should properly escape things though... unless it's still pumping the <XX> shit out with no escaping
<devyn> that would suck a lot
<nuck> No clue
<nuck> But, I think we've figured out the answer to our woes: just stop doing anything
<nuck> Don't look a gift horse in the mouth, as they say
<prophile> don't trust them
<prophile> they like bob marley too much
* devyn puffs
<devyn> ELLIOTTCABLE: so I'm thinking now that Paws could actually be performant and none of the aforementioned blockers to parallelism really matter too much as long as we can let the reactors do more than one thing in a tick
prophile has quit [Quit: The Game]
<devyn> Paws implementations can't really do anything about bad code, but good code with larger-ticked reactors should run just fine and in parallel too
<devyn> and I'm thinking ultimately this comes down to allowing native ops to decide whether they want to produce a staging to be completed immediately by the reactor without even touching the queue (unless a mask can't be acquired)
<devyn> if they do return to the reactor with a staging to be completed immediately (and there will be a distinction made)
<devyn> then the reactor will just do that; it won't go back to the queue
<devyn> a native op can, of course, produce a staging to be completed immediately and also add things to the queue
<devyn> that's fine too
<devyn> ELLIOTTCABLE: nvm, saw your concerns in the google drive doc, still thinking
<devyn> ELLIOTTCABLE: basically my idea would be to change Paws to introduce, essentially, synchronicity-by-alien, but I think that kind of goes against the fundamental philosophy of Paws
<devyn> ELLIOTTCABLE: so then… the question is, is the fundamental philosophy of Paws, having everything always be asynchronous, flawed?
<devyn> ELLIOTTCABLE: and honestly, I have a feeling it might be; time-sharing i.e. traditional preemptive multitasking just seems like a more efficient idea
<devyn> ELLIOTTCABLE: I think we should go back to what you think the original benefits of Paws would be. what is there to be gained from this programming model?
<devyn> ELLIOTTCABLE: in any case, I think as you said, having the default receiver execute synchronously is a pretty good idea
<ELLIOTTCABLE> if I introduce synchronicity, it will be available to libside, too.
<ELLIOTTCABLE> never forget that Paws is abstractive. There are too many truly fundamental operations that will be implemented libside; and if the conclusion is come to that "fundamental operations simply must be executed unordered" (or rather, synchronously, so "simply ordered"), then that statement equally much applies to abstractive fundamental operations as to alien
<ELLIOTTCABLE> fundamental operations
<ELLIOTTCABLE> there's a lot on my mind about this, because I also have thoughts about revamping it to be **more** asynchronous, in some ways. Ones that notably don't restrict us from having fully-synchronous paths of execution.
<devyn> at the moment, the best course of action I can see is to introduce synchronicity by allowing aliens to produce combinations to be reacted immediately, jumping the queue. don't skip responsibility checks; if responsibility checks fail then push to the queue
<devyn> aliens as well as core ops, I mean
<devyn> I would like to hear about the other idea though
<devyn> in fact I don't think that it's dangerous to jump the queue, because responsibility should take care of any situations in which it would be dangerous, I think
<devyn> but of course this also means a `yield` alien would inevitably be added, which makes it basically like any other cooperative multitasking system
<devyn> (yield being, of course, stage caller on queue instead of immediately)
sharkbot has quit [Remote host closed the connection]
sharkbot has joined #elliottcable
<alexgordon> hi ELLIOTTCABLE
yorick has joined #elliottcable
prophile has joined #elliottcable
<alexgordon> lol I just googled "can't be arsed" and elliott's face came up
<purr> lol
TheMathNinja has joined #elliottcable
eligrey has joined #elliottcable
oldskirt_ has joined #elliottcable
oldskirt_ has quit [Changing host]
oldskirt_ has joined #elliottcable
oldskirt has quit [Ping timeout: 240 seconds]
prophile has quit [Quit: The Game]
<joelteon> can someone who's good at networking tell me why video streams often stop streaming video
<cloudhead> heh
<cloudhead> well it's not easy to resume in case of a connection problem
<cloudhead> and if there is not enough bandwidth left, they might have to drop a client
<cloudhead> happens with webpages too it's just that it's more noticeable with videos
<cloudhead> for ex if a packet is lost, all subsequent packets will have to wait for the lost one, the client starts buffering new packets because it's waiting for the old ones, after a while it tells the server it can't handle more packets
<cloudhead> the server drops the client
<joelteon> oh
<cloudhead> this wouldn't be a problem with say, UDP
<cloudhead> but RTMP which is used for streaming is TCP based
<cloudhead> so it can't really "skip" a frame
<cloudhead> it has to wait
<joelteon> i wish it was UDP
oldskirt_ has quit [Ping timeout: 240 seconds]
<cloudhead> do your videos stop often?
<joelteon> yeah
<cloudhead> hm
<joelteon> i'm watching the BBC stream of argentina vs bosnia, i get about 5 seconds of gameplay at a time
<joelteon> then it hangs for 10 seconds
<cloudhead> oh wow
<devyn> should really use something like μTP
<glowcoil> hi
<purr> glowcoil: hi!
<katlogic> cloudhead: easier said than done. it is tough to encode h264 stream to have it withstand gop packet skips
<cloudhead> oh I wasn't suggesting UDP would be better
<cloudhead> just that front of the line blocking wouldn't happen
<cloudhead> there are good reasons to use TCP
<katlogic> things like rtfmp/webrtc p2p offloading seem to be like dead end too
<katlogic> because the situation is um like, 3/4 of folks streaming from crappy comcast/verizon
<cloudhead> yea
<cloudhead> I hope webrtc picks up though
<katlogic> and naturally those two have congested aggregated last mile, so the p2p part makes it only worse
<katlogic> cloudhead: there already are webrtc swarm implementations
<cloudhead> ah cool
* katlogic does not hold many hopes for it tho
<katlogic> sopcast tried it first, then adobe with rtmfp
<katlogic> and the only place where p2p tv actually works decent is china because their monopolies dont throttle mainland traffic, only outside bw :>
<katlogic> also, theres tribbler, which works fairly well because it has torrent tit for tat
<katlogic> ie if your isp is crap you dont consume network resources, however suffers from immense lag being torrent swarm based
<katlogic> like 2min lag at best :(
eligrey has quit [Quit: Leaving]
eligrey has joined #elliottcable
eligrey has quit [Read error: Connection reset by peer]
nuck has quit [Ping timeout: 264 seconds]