#sandstorm on 2018-10-29 — irc logs at freenode.irclog.whitequark.org

2017-06-04 23:24 kentonv changed the topic of #sandstorm to: Welcome to #sandstorm: home of all things sandstorm.io. Say hi! | Have a question but no one is here? Try asking in the discussion group: https://groups.google.com/group/sandstorm-dev | Public logs at https://botbot.me/freenode/sandstorm/

00:01 pie_ has quit [Ping timeout: 252 seconds]

02:56 cbaines_ has quit [Quit: bye]

03:00 cbaines has joined #sandstorm

05:52 _whitelogger has joined #sandstorm

06:16 _whitelogger has joined #sandstorm

10:12 pie_ has joined #sandstorm

12:25 simpson has quit [Ping timeout: 252 seconds]

12:32 simpson has joined #sandstorm

14:04 ripdog has quit [Ping timeout: 250 seconds]

16:27 <Zarutian> kentonv: just out of sheer curiosity, how much, roughly, compute and memory does Oasis consume on a weekly basis?

17:24 <kentonv> Zarutian, there are currently 8 VMs that make up the system: 5x n1-standard-1, 2x n1-highmem-2, 1x g1-small

17:25 <kentonv> (that's for Oasis itself; Sandstorm's web site, app store, and update downloads are served elsewhere)

17:40 TC01 has quit [Ping timeout: 240 seconds]

18:26 isd has joined #sandstorm

19:05 <TimMc> https://cloud.google.com/compute/docs/machine-types for reference

19:05 <TimMc> so capped at about 10 vCPU and 46.5 GB RAM (but not *using* all of that)

19:06 <TimMc> and adding vCPU across machine types, well...

19:09 digitalcircuit has quit [Ping timeout: 250 seconds]

19:10 digitalcircuit has joined #sandstorm

19:11 TC01 has joined #sandstorm

22:11 <kentonv> TimMc, almost all the VMs are in single-digit percent utilization of CPU. The system was designed to be a lot more scalable than it needed to be, I guess. >_>

22:12 <kentonv> in fact self-hosted (single-machine) sandstorm on a beefy instance would probably have handled the load fine. Oops.

22:13 <simpson> On GCE, it doesn't matter quite as much. I suppose it depends on what's on each machine.

22:15 <TimMc> kentonv: That happens. :-)

22:16 <kentonv> https://github.com/sandstorm-io/sandstorm/tree/master/roadmap/blackrock describes the machine types FWIW

22:17 <kentonv> Oasis has: master, storage, mongo, 2x worker, 2x shell

22:17 <kentonv> oh, and gateway

22:17 <kentonv> the gateway is a g1-small, the workers are n1-highmem-2, and the rest are n1-standard-1

22:18 <kentonv> master could probably be reduced to g1-small and probably storage could too.

22:18 <kentonv> but I worry about subtle performance loss

22:19 <kentonv> we could also probably go to just one shell

22:19 <simpson> Mm. Are you running full systemd? As I've containerized, I've found that that's actually one of the costs, and that there's been a modest savings from running more stuff on k8s.

22:21 <kentonv> these are full VMs. Some of the things could maybe run in containers but the workers definitely can't since they do a lot of root-only stuff, like setting up nbd devices.

22:22 <simpson> Mm, makes sense. It was only recently that I was able to get my Tahoe-LAFS storage servers off of VMs, and for similar reasons: Wiring up storage is non-trivial.

22:23 <mokomull> ooh, nbd? that's kind of my life these days :)

22:24 <kentonv> mokomull, yeah Blackrock makes heavy use of nbd in order to give each grain its own virtual-volume that's actually maintained on the remote storage server.

22:25 <kentonv> it's my favorite crazy systems hack

22:25 <mokomull> haha you're in good company, though ... ISTR someone big was using Ceph via a userspace NBD translator too.

22:26 <kentonv> nbd is basically fuse at the block layer

22:26 <mokomull> Do you preallocate a gajiggaton of /dev/nbd* devices, or are you using the netlink API?

22:26 <kentonv> gajiggaton of devices

22:26 <kentonv> didn't know you could use netlink for this

22:26 <mokomull> it's relatively new

22:27 <kentonv> I think I create 4096 devices at startup and then I have some code for locking them to grains.

22:27 <kentonv> and it's really easy for devices to get permanently stuck so I have some logic to route around stuck ones, it's gross

22:27 <mokomull> 4.12 and up it seems, https://github.com/torvalds/linux/commit/e46c7287b1c27683a8e30ca825fb98e2b97f1099

22:28 <kentonv> hahaha, before I even clicked I was wondering if the patch comes from Facebook

22:29 <kentonv> sure enough

22:29 <mokomull> kentonv: if you still end up with devices permanently stuck, I would absolutely love to hear about it. We've hit some of that after the blk-mq migration because the kyber and deadline scheduler somehow manage to mess with request IDs enough that confuses nbd.

22:29 <kentonv> (I talked to some people there who seemed excited about nbd recently)

22:29 <mokomull> I am one of those people there :)

22:30 <kentonv> oh hah

22:30 <mokomull> although my crazy patchset hasn't progressed beyond the "rewrite it before you publish this or you're gonna get skewered" stage

22:31 <kentonv> it's been years since I wrote the code, I'm sure it has gotten better

22:33 <mokomull> There was quite the onslaught of XFS fixes when we started this, that's for sure

23:04 <kentonv> mokomull, I've been using ext4 and it's been remarkably solid. I don't think I ever saw an instance of an unrecoverable volume or data loss caused by ext4, even though we disconnect mid-stream all the time.

23:04 <mokomull> that might say some things about our design choices :)

23:04 <kentonv> I'm sure you push a hell of a lot more bits though

23:10 <mokomull> I don't even know how many bits anymore. It's kind of mindblowing.