#stellar-dev on 2015-10-03 — irc logs at freenode.irclog.whitequark.org

01:38 pixelbeat has joined #stellar-dev

02:48 sacarlson has joined #stellar-dev

03:25 <stellar-slack> <lab> hi

03:25 <stellar-slack> <graydon> hi

03:25 <stellar-slack> <graydon> I'm not sure what you mean by history corruption being edge-case

03:26 <stellar-slack> <graydon> it seems pretty important to me that we not lose history

03:26 <stellar-slack> <graydon> maybe I misunderstand what you're saying

03:26 <stellar-slack> <lab> sure, i agree

03:27 <stellar-slack> <lab> but if we have enough redundant copy, single point failure is irrelavant

03:28 <stellar-slack> <graydon> I agree that lots of redundant copies and mirroring will likely provide extra safeguards against loss

03:28 <stellar-slack> <lab> i can't bearhistory missing either.

03:28 <stellar-slack> <graydon> but I feel that it's not a tremendous amount of engineering to do what we currently do and retry puts that fail

03:28 <stellar-slack> <graydon> networks are often transiently unavailable

03:29 <stellar-slack> <lab> and i regular case, ie s3, SLA is high enough

03:29 <stellar-slack> <graydon> I mean, it has to be asynchronous anyways because it's done as a background process that might take a while

03:30 <stellar-slack> <graydon> the SLA on the storage side isn't the problem, it's the network route between the node running the server and the storage facility :)

03:30 <stellar-slack> <lab> i know, there is GFW between us

03:31 <stellar-slack> <graydon> are you finding that the retrying behaviour causes instability itself?

03:32 <stellar-slack> <lab> yes. the first failure in stellar-core production network, is cause by retrying

03:32 <stellar-slack> <graydon> interesting. what went wrong?

03:32 <stellar-slack> <lab> one of my node in German and try to put archives to alibaba's oss (behind GFW)

03:33 <stellar-slack> <graydon> ok

03:33 <stellar-slack> <lab> there was a disconnect last a few minute

03:34 <stellar-slack> <graydon> ok. it should just keep retrying. did it eventually throw an error or something?

03:35 <stellar-slack> <lab> 2015-10-01T12:53:35.087 fa34c0 [] [Ledger] INFO Closed ledger: [seq=14065, hash=b056da] Socket timeout, please try again later. 2015-10-01T12:53:36.403 fa34c0 [] [Process] WARN process 4615 exited 1: osscmd -H http://oss-cn-beijing.aliyuncs.com|oss-cn-beijing.aliyuncs.com put tmp/snapshot-8564d3f1fdbd4536/ledger-000032bf.xdr.gz oss://stellar/xlm/ledger/00/00/32/ledger-000032bf.xdr.gz 2015-10-01T12:53:36.404 fa3

03:35 <stellar-slack> failed on ledger-000032bf.xdr stellar-core: history/PublishStateMachine.cpp:204: void stellar::ArchivePublisher::enterSendingState(): Assertion `mState == PUBLISH_OBSERVED || mState == PUBLISH_SENDING' failed. Aborted (core dumped) root#

03:36 <stellar-slack> <graydon> oh! well that's a bug. any assertion failure is just a mistake on our side. I will fix that, just need to figure out what went wrong.

03:36 <stellar-slack> <lab> there are already some missing files in archive, but it's ok

03:36 <stellar-slack> <lab> crash is not.

03:37 <stellar-slack> <graydon> understood

03:37 <stellar-slack> <lab> i have a solution to detect missing file in seconds

03:37 <stellar-slack> <lab> and sync from other mirror

03:38 <stellar-slack> <lab> and i'm writing script for it. it will be in https://github.com/strllar/stellar_history_sync

03:38 <stellar-slack> <graydon> I understand, and I appreciate that making the code less complex might have made this crash less likely; but there are plenty of state transitions in stellar-core that could go wrong and cause assertion failures.

03:39 <stellar-slack> <graydon> in general we have few options except to crash when there's a significant logic error like that. when you say a crash is not ok, I'm curious what the effect on your server is. It's supposed to restart from crashes reasonably well, if supervised.

03:40 TheSeven has quit [Disconnected by services]

03:41 [7] has joined #stellar-dev

03:41 <stellar-slack> <lab> but it was one validator in first 4 and there is no auto restart yet.

03:42 <stellar-slack> <graydon> ok. I'm sorry this happened.

03:43 <stellar-slack> <graydon> I will fix the bug

03:43 <stellar-slack> <sacarlson> I was going to write a simple bash script to reset my stellar-core each time it falls out of sync

03:43 <stellar-slack> <lab> i didn't check the code. but it's looks like the network break last from one publish to next publish

03:43 <stellar-slack> <graydon> I need to figure out why and how it happened but I don't think disabling retrying across the publishing subsystem is a way to make it better.

03:43 <stellar-slack> <graydon> lab: what do you mean?

03:43 <stellar-slack> <graydon> sacarlson: why?

03:44 <stellar-slack> <sacarlson> because I loose sync every time my ISP changes my ip address

03:44 <stellar-slack> <lab> publish every 64 ledger? and what if the previous publishing is still retrying and the next publishing is started?

03:45 <stellar-slack> <graydon> lab: checkpoints pending publication are queued

03:45 <stellar-slack> <graydon> lab: currently in memory, but the PR I posted in #rfr a little while ago also persists them in the database and restarts them when the program restarts

03:46 <stellar-slack> <graydon> sacarlson: loose sync in what sense? does your ISP break existing TCP connections?

03:46 <stellar-slack> <lab> so maybe the crashing is not by this logic

03:46 <stellar-slack> <lab> another hint graydon

03:47 <stellar-slack> <lab> i run another stellar-core for forked network, but also trying put history back in china.

03:47 <stellar-slack> <lab> crashed by the same cause

03:47 <stellar-slack> <graydon> lab: I think the crash you saw was due to a long-runnign command exiting once the overall ArchivePublisher entering a different state (possibly retrying-state). I'm just unsure how it would have triggered it.

03:48 <stellar-slack> <graydon> you say your osscmd put command was timing out, right?

03:48 sacarlson has quit [Quit: Leaving.]

03:48 <stellar-slack> <graydon> running for a long time, then exiting?

03:48 <stellar-slack> <lab> there was a long time disconnectivity between vps in german and cloud storage in china before crash

03:48 <stellar-slack> <graydon> do you happen to have a more complete log of the last little while before the crash? it'll show me a bit more about the state transitions in the ArchivePublisher

03:49 <stellar-slack> <sacarlson> well every 24 hours with my adsl line they do a reset of my ip address. at that time it seems stellar-core looses sync and never recovers. I can duplicate the problem when I use my vpn

03:49 <stellar-slack> <lab> yes, i'm now digging it out

03:52 <stellar-slack> <graydon> oh I bet it entered end-state

03:52 <stellar-slack> <graydon> if there were a lot of retries

03:52 <stellar-slack> <graydon> it'll eventually enter end-state and give up

03:53 <stellar-slack> <graydon> 16 retries

03:53 <stellar-slack> <graydon> and that'll trigger the assert you're seeing

03:53 <stellar-slack> <sacarlson> I think for me it get's stuck in catchup

03:55 <stellar-slack> <graydon> sacarlson: awkward! I wonder why. lemme check the bug, sec.

03:56 <stellar-slack> <sacarlson> no big deal for me it normally switches at 2:00am when I'm asleep anyway

03:57 <stellar-slack> <sacarlson> another work around might be to look like a static address with vpn even when my ip changes. not sure that will work or not

04:00 <stellar-slack> <graydon> Can you run it in debug log level, and capture a log around the event?

04:00 <stellar-slack> <graydon> to my thinking, the worst that should happen is that other nodes might have a hard time finding you by IP address

04:00 <stellar-slack> <graydon> you shouldn't (I think!) lose sync or have trouble re-syncing. so I'm curious what's going on

04:01 <stellar-slack> <sacarlson> you want to barrow my vpn to look at it?

04:02 <stellar-slack> <sacarlson> I'm not totally sure it has the same effect but I would think it should recover with vpn switch just like my skype and other apps do but it doen'st

04:02 <stellar-slack> <graydon> um, potentially! it's 9pm on a friday night after a full workday where I live now, so I'm not really up for a long debugging session; but if you prefer I can try reproducing it on my own next week or via your VPN (we can synthesize a test that does this too, it's not like changing IP addresses is that unusual a thing to happen!)

04:03 <stellar-slack> <sacarlson> I think most the nodes that I know are all runing on static addresses

04:04 <stellar-slack> <sacarlson> but yes it will have to be fixed at some point

04:05 <stellar-slack> <graydon> totally. yeah, IP addresses change. that's how the world works!

04:05 <stellar-slack> <sacarlson> well it's above my pay grade so I leave it to the pro's

04:05 <stellar-slack> <sacarlson> I just find work arounds

04:06 <stellar-slack> <sacarlson> if you want to take a wack at it go for it. I'll help in any way I can. I got lots of time

04:06 <stellar-slack> <graydon> lab: what's your github id?

04:07 <stellar-slack> <graydon> sacarlson: I really appreciate you spending the time to help shake bugs like this out of the system

04:07 <stellar-slack> <graydon> I'll absolutely try to figure it out and get a fix in. it should work. just need to wrap up for this evening and come back at it on monday.

04:08 <stellar-slack> <sacarlson> ok have a fun weekend

04:08 <stellar-slack> <sacarlson> don't do what I always do and party too much

04:09 <stellar-slack> <graydon> I'm a quiet sort. likely a little walk. might help out with one of the political campaigns in town. election season.

04:13 <stellar-slack> <lab> graydon: are you still here?

04:13 <stellar-slack> <graydon> yes

04:14 <stellar-slack> <graydon> just posting a patch for you

04:14 <stellar-slack> <lab> there was network break

04:14 <stellar-slack> <graydon> what's your github id?

04:14 <stellar-slack> <lab> damned GFW.

04:14 <stellar-slack> <graydon> network break: do you mean you were cut off, or do you mean the production network failed?

04:15 <stellar-slack> <lab> i am cut off. i'm catchup your messages

04:15 <stellar-slack> <graydon> np. I think I found your bug.

04:17 <stellar-slack> <graydon> https://github.com/stellar/stellar-core/pull/816

04:17 <stellar-slack> <lab> https://github.com/codeck

04:17 <stellar-slack> <graydon> if you want to try that on your side, it should survive long-running-and-timing-out processes better

04:19 <stellar-slack> <lab> it make sense.

04:19 <stellar-slack> <graydon> at this hour I think probably everyone else here is asleep and it won't be in the trunk repo until tomorrow or later but it's an easy change to test on your side if you're doing your own builds

04:20 <stellar-slack> <lab> yes, i have a fork in https://github.com/strllar/stellar-core

04:20 <stellar-slack> <graydon> great

04:21 <stellar-slack> <lab> i will apply this patch after that validator crash again.

04:22 <stellar-slack> <graydon> cool. I'm really sorry for this sort of thing, I suspect there'll be a fair number of asserts tripping in the first few weeks of production use, just because of things we assumed were "impossible" when coding, that are actually just corner cases we didn't picture having to handle.

04:24 <stellar-slack> <lab> chinese is always making impossible possible...

04:24 <stellar-slack> <graydon> I tend to code in a pretty assert-heavy style; it might be helpful to get the server into an auto-restart / supervisor framework (upstart or init or systemd or such)

04:25 <stellar-slack> <lab> i'm study from you programming style. it's nutritious. :)

04:25 <stellar-slack> <graydon> anyway it's getting close to my bed time here so I gotta head in. if you see other stuff like that that's disrupting operation, feel free to ping me. I much prefer to turn around quick fixes for things that are bothering real users, vs. hypothetical failure cases I dream up :)

04:29 pixelbeat has quit [Ping timeout: 272 seconds]

04:30 <stellar-slack> <lab> happy weekend and good night.

07:56 de_henne has joined #stellar-dev

08:44 pixelbeat has joined #stellar-dev

09:35 pixelbeat has quit [Ping timeout: 260 seconds]

09:36 pixelbeat has joined #stellar-dev

10:51 <stellar-slack> <buhrmi> how to shut down the GFW?

10:51 <stellar-slack> <buhrmi> why is that even there

13:32 <stellar-slack> <lab> GFW is like skynet

13:32 <stellar-slack> <lab> once it started, will never be down

14:22 stellar-slack has quit [Remote host closed the connection]

14:22 stellar-slack has joined #stellar-dev

14:27 pixelbeat has quit [Ping timeout: 256 seconds]

15:42 <stellar-slack> <hsxuif> does access to "s3://history.stellar.org" require credentials? i found i cannot get the history from it: https://gist.github.com/HongxuChen/9adece910dded3270ec3

15:42 <stellar-slack> <hsxuif> the exit code is either "1" or "22"

17:35 pixelbeat has joined #stellar-dev

17:53 pixelbeat has quit [Ping timeout: 240 seconds]

18:19 pixelbeat has joined #stellar-dev

18:51 de_henne_ has joined #stellar-dev

18:53 pixelbeat has quit [Ping timeout: 250 seconds]

18:55 de_henne has quit [Ping timeout: 264 seconds]

20:42 pixelbeat has joined #stellar-dev

21:38 Kwelstr has quit [Quit: ugh]

21:38 Kwelstr has joined #stellar-dev