<stellar-slack>
<graydon> I'm not sure what you mean by history corruption being edge-case
<stellar-slack>
<graydon> it seems pretty important to me that we not lose history
<stellar-slack>
<graydon> maybe I misunderstand what you're saying
<stellar-slack>
<lab> sure, i agree
<stellar-slack>
<lab> but if we have enough redundant copy, single point failure is irrelavant
<stellar-slack>
<graydon> I agree that lots of redundant copies and mirroring will likely provide extra safeguards against loss
<stellar-slack>
<lab> i can't bearhistory missing either.
<stellar-slack>
<graydon> but I feel that it's not a tremendous amount of engineering to do what we currently do and retry puts that fail
<stellar-slack>
<graydon> networks are often transiently unavailable
<stellar-slack>
<lab> and i regular case, ie s3, SLA is high enough
<stellar-slack>
<graydon> I mean, it has to be asynchronous anyways because it's done as a background process that might take a while
<stellar-slack>
<graydon> the SLA on the storage side isn't the problem, it's the network route between the node running the server and the storage facility :)
<stellar-slack>
<lab> i know, there is GFW between us
<stellar-slack>
<graydon> are you finding that the retrying behaviour causes instability itself?
<stellar-slack>
<lab> yes. the first failure in stellar-core production network, is cause by retrying
<stellar-slack>
<graydon> interesting. what went wrong?
<stellar-slack>
<lab> one of my node in German and try to put archives to alibaba's oss (behind GFW)
<stellar-slack>
<graydon> ok
<stellar-slack>
<lab> there was a disconnect last a few minute
<stellar-slack>
<graydon> ok. it should just keep retrying. did it eventually throw an error or something?
<stellar-slack>
<lab> 2015-10-01T12:53:35.087 fa34c0 [] [Ledger] INFO Closed ledger: [seq=14065, hash=b056da] Socket timeout, please try again later. 2015-10-01T12:53:36.403 fa34c0 [] [Process] WARN process 4615 exited 1: osscmd -H http://oss-cn-beijing.aliyuncs.com|oss-cn-beijing.aliyuncs.com put tmp/snapshot-8564d3f1fdbd4536/ledger-000032bf.xdr.gz oss://stellar/xlm/ledger/00/00/32/ledger-000032bf.xdr.gz 2015-10-01T12:53:36.404 fa3
<stellar-slack>
<graydon> oh! well that's a bug. any assertion failure is just a mistake on our side. I will fix that, just need to figure out what went wrong.
<stellar-slack>
<lab> there are already some missing files in archive, but it's ok
<stellar-slack>
<lab> crash is not.
<stellar-slack>
<graydon> understood
<stellar-slack>
<lab> i have a solution to detect missing file in seconds
<stellar-slack>
<graydon> I understand, and I appreciate that making the code less complex might have made this crash less likely; but there are plenty of state transitions in stellar-core that could go wrong and cause assertion failures.
<stellar-slack>
<graydon> in general we have few options except to crash when there's a significant logic error like that. when you say a crash is not ok, I'm curious what the effect on your server is. It's supposed to restart from crashes reasonably well, if supervised.
TheSeven has quit [Disconnected by services]
[7] has joined #stellar-dev
<stellar-slack>
<lab> but it was one validator in first 4 and there is no auto restart yet.
<stellar-slack>
<graydon> ok. I'm sorry this happened.
<stellar-slack>
<graydon> I will fix the bug
<stellar-slack>
<sacarlson> I was going to write a simple bash script to reset my stellar-core each time it falls out of sync
<stellar-slack>
<lab> i didn't check the code. but it's looks like the network break last from one publish to next publish
<stellar-slack>
<graydon> I need to figure out why and how it happened but I don't think disabling retrying across the publishing subsystem is a way to make it better.
<stellar-slack>
<graydon> lab: what do you mean?
<stellar-slack>
<graydon> sacarlson: why?
<stellar-slack>
<sacarlson> because I loose sync every time my ISP changes my ip address
<stellar-slack>
<lab> publish every 64 ledger? and what if the previous publishing is still retrying and the next publishing is started?
<stellar-slack>
<graydon> lab: checkpoints pending publication are queued
<stellar-slack>
<graydon> lab: currently in memory, but the PR I posted in #rfr a little while ago also persists them in the database and restarts them when the program restarts
<stellar-slack>
<graydon> sacarlson: loose sync in what sense? does your ISP break existing TCP connections?
<stellar-slack>
<lab> so maybe the crashing is not by this logic
<stellar-slack>
<lab> another hint graydon
<stellar-slack>
<lab> i run another stellar-core for forked network, but also trying put history back in china.
<stellar-slack>
<lab> crashed by the same cause
<stellar-slack>
<graydon> lab: I think the crash you saw was due to a long-runnign command exiting once the overall ArchivePublisher entering a different state (possibly retrying-state). I'm just unsure how it would have triggered it.
<stellar-slack>
<graydon> you say your osscmd put command was timing out, right?
sacarlson has quit [Quit: Leaving.]
<stellar-slack>
<graydon> running for a long time, then exiting?
<stellar-slack>
<lab> there was a long time disconnectivity between vps in german and cloud storage in china before crash
<stellar-slack>
<graydon> do you happen to have a more complete log of the last little while before the crash? it'll show me a bit more about the state transitions in the ArchivePublisher
<stellar-slack>
<sacarlson> well every 24 hours with my adsl line they do a reset of my ip address. at that time it seems stellar-core looses sync and never recovers. I can duplicate the problem when I use my vpn
<stellar-slack>
<lab> yes, i'm now digging it out
<stellar-slack>
<graydon> oh I bet it entered end-state
<stellar-slack>
<graydon> if there were a lot of retries
<stellar-slack>
<graydon> it'll eventually enter end-state and give up
<stellar-slack>
<graydon> 16 retries
<stellar-slack>
<graydon> and that'll trigger the assert you're seeing
<stellar-slack>
<sacarlson> I think for me it get's stuck in catchup
<stellar-slack>
<graydon> sacarlson: awkward! I wonder why. lemme check the bug, sec.
<stellar-slack>
<sacarlson> no big deal for me it normally switches at 2:00am when I'm asleep anyway
<stellar-slack>
<sacarlson> another work around might be to look like a static address with vpn even when my ip changes. not sure that will work or not
<stellar-slack>
<graydon> Can you run it in debug log level, and capture a log around the event?
<stellar-slack>
<graydon> to my thinking, the worst that should happen is that other nodes might have a hard time finding you by IP address
<stellar-slack>
<graydon> you shouldn't (I think!) lose sync or have trouble re-syncing. so I'm curious what's going on
<stellar-slack>
<sacarlson> you want to barrow my vpn to look at it?
<stellar-slack>
<sacarlson> I'm not totally sure it has the same effect but I would think it should recover with vpn switch just like my skype and other apps do but it doen'st
<stellar-slack>
<graydon> um, potentially! it's 9pm on a friday night after a full workday where I live now, so I'm not really up for a long debugging session; but if you prefer I can try reproducing it on my own next week or via your VPN (we can synthesize a test that does this too, it's not like changing IP addresses is that unusual a thing to happen!)
<stellar-slack>
<sacarlson> I think most the nodes that I know are all runing on static addresses
<stellar-slack>
<sacarlson> but yes it will have to be fixed at some point
<stellar-slack>
<graydon> totally. yeah, IP addresses change. that's how the world works!
<stellar-slack>
<sacarlson> well it's above my pay grade so I leave it to the pro's
<stellar-slack>
<sacarlson> I just find work arounds
<stellar-slack>
<sacarlson> if you want to take a wack at it go for it. I'll help in any way I can. I got lots of time
<stellar-slack>
<graydon> lab: what's your github id?
<stellar-slack>
<graydon> sacarlson: I really appreciate you spending the time to help shake bugs like this out of the system
<stellar-slack>
<graydon> I'll absolutely try to figure it out and get a fix in. it should work. just need to wrap up for this evening and come back at it on monday.
<stellar-slack>
<sacarlson> ok have a fun weekend
<stellar-slack>
<sacarlson> don't do what I always do and party too much
<stellar-slack>
<graydon> I'm a quiet sort. likely a little walk. might help out with one of the political campaigns in town. election season.
<stellar-slack>
<lab> graydon: are you still here?
<stellar-slack>
<graydon> yes
<stellar-slack>
<graydon> just posting a patch for you
<stellar-slack>
<lab> there was network break
<stellar-slack>
<graydon> what's your github id?
<stellar-slack>
<lab> damned GFW.
<stellar-slack>
<graydon> network break: do you mean you were cut off, or do you mean the production network failed?
<stellar-slack>
<lab> i am cut off. i'm catchup your messages
<stellar-slack>
<graydon> np. I think I found your bug.
<stellar-slack>
<graydon> if you want to try that on your side, it should survive long-running-and-timing-out processes better
<stellar-slack>
<lab> it make sense.
<stellar-slack>
<graydon> at this hour I think probably everyone else here is asleep and it won't be in the trunk repo until tomorrow or later but it's an easy change to test on your side if you're doing your own builds
<stellar-slack>
<lab> i will apply this patch after that validator crash again.
<stellar-slack>
<graydon> cool. I'm really sorry for this sort of thing, I suspect there'll be a fair number of asserts tripping in the first few weeks of production use, just because of things we assumed were "impossible" when coding, that are actually just corner cases we didn't picture having to handle.
<stellar-slack>
<lab> chinese is always making impossible possible...
<stellar-slack>
<graydon> I tend to code in a pretty assert-heavy style; it might be helpful to get the server into an auto-restart / supervisor framework (upstart or init or systemd or such)
<stellar-slack>
<lab> i'm study from you programming style. it's nutritious. :)
<stellar-slack>
<graydon> anyway it's getting close to my bed time here so I gotta head in. if you see other stuff like that that's disrupting operation, feel free to ping me. I much prefer to turn around quick fixes for things that are bothering real users, vs. hypothetical failure cases I dream up :)
pixelbeat has quit [Ping timeout: 272 seconds]
<stellar-slack>
<lab> happy weekend and good night.
de_henne has joined #stellar-dev
pixelbeat has joined #stellar-dev
pixelbeat has quit [Ping timeout: 260 seconds]
pixelbeat has joined #stellar-dev
<stellar-slack>
<buhrmi> how to shut down the GFW?
<stellar-slack>
<buhrmi> why is that even there
<stellar-slack>
<lab> GFW is like skynet
<stellar-slack>
<lab> once it started, will never be down
stellar-slack has quit [Remote host closed the connection]