Fly.io Status – Consul cluster outage

mrkurt · on March 16, 2023

This has been a rough week, and I'm sorry we broke peoples' apps. We had a big Nomad outage on Monday, and then a suspiciously similar Consul outage today. Both tipped over faster than we could detect and mitigate, and we ended up having to do serious surgery to build entirely new Consul/Nomad clusters.

There's nothing to brag about here, I just wanted to let y'all know we're listening (even when things aren't on the HN front page).

wjossey · on March 16, 2023

This stuff is hard. As someone who runs infra teams for a living, these are the worst kinds of weeks.

Hang in there. You all will learn from this and be better for it. Your architecture will improve. Customers will give you a second chance. This too shall pass.

Sending positive vibes.

srhyne · on March 16, 2023

Lovely response. Ah, kindness. So refreshing to see.

mrkurt · on March 16, 2023

atonse · on March 16, 2023

We had mysterious consul outages (and related nomad outages) causing us to never deploy our new hashicorp stack to production.

Shame cuz we were excited about our nomad+consul+vault setup and invested a lot of money into building it. But just didn’t have the time or enough depth of expertise to babysit it.

Mizza · on March 16, 2023

From my experience with the Hashi stack, I don't think it's a coincidence that Fly has a lot of downtime and are a major Hashi user. Terraform makes excellent bait though.

Still love using Fly, please add static assets hosting/CDN.

atonse · on March 16, 2023

How is this possible? How is consul not self—healing? It just seems so brittle in a way even database clusters aren’t.

throwawaaarrgh · on March 16, 2023

All distributed decentralized systems are brittle. The only people who don't think this are people who haven't run them at scale.

Also, "self-healing" isn't really one thing. There are hundreds of different problems that can take out such a cluster, and every single one of them needs its own "self-healing" mechanism. These systems are literally the most complicated kinds of systems.

namaria · on March 19, 2023

"Should have self healing" can be expanded into "should have systems to address the underlying system failure modes", which starts to shed some light into why distributed systems will always run into failure modes.

mdaniel · on March 16, 2023

I haven't been responsible for babysitting consul, but I have been responsible for etcd for years and if consul's problem is anything like etcd's it's because members have identity and if one of them goes toes up then etcd will wait forever for the snowflake to come back to life, and if that's not how the underlying infra is configured, that's very very bad. Mix it mTLS into this story and it gets worse

I stayed away from the so-called "stacked" control plane of etcd inside kubernetes because it can make a tiny fire into a sharkfirenado but recently I've heard discussions of k3s (which uses dqlite) managing the etcd members and then "formal" kubernetes managing the workloads pointed at that k3s-stacked-etcd but I haven't tried it yet in order to know how theory and practice differ

cheeseprocedure · on March 16, 2023

Consul’s autopilot feature makes life a little easier by automatically reaping failed instances:

https://developer.hashicorp.com/consul/tutorials/datacenter-...

Paired with cloud discovery, it makes for a tolerable operational experience when instances are expected to occasionally disappear.

grrdotcloud · on March 16, 2023

I've worked with Vault and clusters, not Consul specifically but generally there's a healthy cluster state until something happens putting the cluster into an inconsistent state.

Generally there's a master node or multiple nodes in agreement. If the cluster cannot agree on it's current state the entire system may run multiple versions or be completely unavailable or provide inconsistent response bringing down other systems that rely on it.

Inspection itself is hampered by elections or syncing state or other process/race related/caching/ddossing itself or other services.

candiddevmike · on March 16, 2023

Consul is a lot more than just a database cluster, and that may be part of its problem.

jen20 · on March 16, 2023

The simpler explanation is that running products designed for LAN usage on a WAN is a fundamentally bad plan, as the folks over at fly acknowledge, even in this thread.

Meanwhile, hundreds of thousands of Consul, Nomad and Vault clusters used appropriately work perfectly well…

suryao · on March 16, 2023

Fly is building everything in hard mode - since they are not layering on top of an existing cloud like pretty much everyone else (heroku, render, railway, ...).

It's either very smart (if they pull it off) because they will have a ginormous cost advantage or they fail.

I'm personally of the opinion that the ux on top of aws/gcp/... is worse than a doo-doo in a shoe. However, they are as stable as can be (all complex systems go down once in a while). There are very few mature projects that do not rely on aws/gcp/... managed services anyway. Might as well put in the little bit of effort to set yourself up for the future instead of painful migrations. This obviously doesn't hold for hobby projects.

In any case, I have a lot of respect for the engineering that fly does. Kudos.

candiddevmike · on March 16, 2023

Are they really building everything in hard mode or do they just have a bad architecture?

mrkurt · on March 16, 2023

Yes. Both. This outage was caused by a bad architectural decision. We had an incident a few weeks ago caused by "hard mode".

lghh · on March 16, 2023

I really appreciate the honesty here. I’m not a fly customer (no use case for it) but your transparency is admirable. Wish y’all the best.

3np · on March 16, 2023

As a fellow hashistack operator I'd love to hear what the bad decision was.

mrkurt · on March 16, 2023

We got ourselves into a bind because of Nomad. We outlined a bunch of it here: https://fly.io/blog/carving-the-scheduler-out-of-our-orchest...

The short version is that "using Nomad and Consul for the type of global workloads we run is not a good choice". I do not believe we'd have the same problems with Nomad + Consul in a single region. But running a single, global cluster of each of these is suboptimal.

The second problem was using some Consul features that forced us to keep it single region. What we actually need is a global view of a single service. Federated Consul doesn't quite give us that. Earlier versions of our infrastructure were using a bunch of Consul watches to update local state, so we couldn't really federate.

Some of this I'd do very differently if we rewound. But we were also building an idea with no actual users. Nomad and Consul gave us a nice platform to experiment on. We just outgrew the "prototype" as we learned what people actually wanted from us.

HyperSane · on March 16, 2023

One of the single best design decisions that AWS makes is to isolate each region as much as possible so that truly global outages are nearly impossible.

pcthrowaway · on March 16, 2023

Hopefully they'll blog about it at some point. At this point I might pay just for their blog posts, even though I'm not using fly (yet?) because it's certainly a more cost-effective way to learn about big architectural mistakes than the way they're going about it, which involves actually paying the sunk costs.

faizshah · on March 16, 2023

AWS invested a ton into limiting the blast radius of failures by isolating AZs, regions and using a cellular (service level isolated shards) architecture. I am surprised these ideas have not propagated to newer companies trying to build clouds: https://m.youtube.com/watch?v=swQbA4zub20

AWS isn’t perfect but these lessons were learned by fire because these sorts of global outages can seriously harm reputations.

HyperSane · on March 16, 2023

I feel that this is one of the biggest advantages AWS has over Azure. AWS has never had truly global outages the way that Azure has had with Azure AD

olivermuty · on March 16, 2023

Azures global outages have all been DNS more or less.

And AWS has had a few of those in my 13-14 years I have used them :D

Azure reliability sucks more for sure though. Especiallly networking.

Edit: us-east-1 going down disrupts most of global AWS pretty severely fwiw.

faizshah · on March 16, 2023

That actually brings up another aspect of AWS architecture, even for global services that depend on us-east-1, AWS separates the control plane (configuration of resources) from the data plane (usage of resources). These global services like IAM have regionalized data planes: https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

In a disaster scenario, the data plane operations can continue so customer workloads can still run while the control plane might experience downtime. This is another lesson where in the case of fly isolating the control plane (deployment of services) from the data plane (executing customer code) could have limited the blast radius of this fault instead of using a global cluster manager.

suryao · on March 16, 2023

They build everything from scratch - on bare metal, including sourcing hardware (though I'd presume they use a data center manager for it). Arch, from their engineering blogs, is pretty sound.

luhn · on March 15, 2023

Relevant: "Reliability: It's not great" from last week https://news.ycombinator.com/item?id=35044516

They even specifically call out Consul as a source of trouble.

> We propagate app instance and health information across all our regions. That’s how our proxies know where to route requests, and how our DNS servers know what names to give out.

> We started out using HashiCorp Consul for this. But we were shoehorning Consul, which has a centralized server model design for individual data center deployments, into a global service discovery role it wasn’t suited for. The result: continuously stale data, a proxy that would route to old expired interfaces, and private DNS that would routinely have stale entries.

jen20 · on March 16, 2023

They call out THEIR USAGE of Consul as a source of trouble. This is quite different.

markthethomas · on March 16, 2023

Been a fan of fly and have had most, if not all, of my side and semi-side projects on there for some time now. But...the ratio of good/fun/snarky blog posts to reliable service has gotten a bit too large for me, starting to look for other providers at this point just in case they can't turn this trend around. Honestly been a good object lesson for me in the importance of backing up marketing/hype/"mind-share" stuff w/ absolute rock-solid performance/reliability or just forgoing the former for the latter.

As an aside, it's also taking down some decently-load-bearing web infra like unpkg => https://www.unpkg.com/

see also https://community.fly.io/t/app-went-dead/11397/60

zachallaun · on March 16, 2023

Relevant response from the Fly community forums: https://community.fly.io/t/frequent-outages-is-really-demons...

markthethomas · on March 16, 2023

Yeah, I saw; I've kept up w/ everything pretty closely. Still decently frustrating as a paying customer, but I hope they can figure it out. If they can and can show some real reliability, I'll be an even bigger fan.

zachallaun · on March 16, 2023

Yep! More putting it out there for other folks. I’m also a somewhat frustrated paying customer, but as I’m dealing with my own growing pains, I relate to what they’re going through. I’ve personally migrated my DB to Crunchy to somewhat mitigate the risk.

markthethomas · on March 16, 2023

That's a good point / thing that I've not thought about as much here. I would be much more frustrated if my primary datastores were hosted there. As things stand, their semi-hosted offering never really made sense to me (esp. now lol), but I do think if you get into the game of DB hosting there's almost another level of expectation even beyond basic compute (oddly enough)

ericpauley · on March 16, 2023

Wow, part of Delaware’s tax website was hanging on unpkg today, now I know why!

markthethomas · on March 16, 2023

(unpkg seems to be up now)

pawelduda · on March 15, 2023

I really really wanted to like and recommend fly.io but I wouldn't risk deploying anything more than a side project to tinker with, given how many random issues I encountered in a relatively short development time. It was a simple Phoenix app which made me wonder "am I doing things totally wrong?" quite a few times, after exhausting all info sources. But when I tried the same process the next day, it would deploy just fine. Plus the outages that appear to be getting more frequent don't make me optimistic.

At least they're transparent about their issues, gotta give them that. I still kinda root for them, maybe they'll make a comeback.

mrcwinn · on March 16, 2023

Same. I’m so disappointed because I’ve been rooting for them. We were close to a major deployment/migration (well, major as is mid four figures per month, not major like Google) but they were removed from the decision set. It would not have been responsible to bet on them at this time. I hope they get this sorted - they’re really good folks!

mrkurt · on March 16, 2023

Thank you! I'm both sorry it didn't work out (because $$$$) and also glad we didn't create any agony for you. Someday, we hope to create mild irritation for you, though, if we can.

atonse · on March 16, 2023

I am feeling similarly. We’ve got a few apps in fly and have convinced devs to use it for their side projects. We’re excited about the promise of fly and were considering the HIPAA plan.

But these stability issues actually make me more nervous about the fact that I’d have to manage my own postgres cluster and have to learn how to recover it in such an event. AWS RDS has made me soft!

Wishing you guys the best. We’ll still use fly for QA until a few of these issues are sorted out. And until there’s fully managed pg (first party or third party)

mrkurt · on March 16, 2023

I appreciate that. I think we'll be in a really good spot in ~60 days.

mrcwinn · on March 16, 2023

Thanks for replying and saying that.

drewbug01 · on March 16, 2023

I love this update:

“ We are working to build a new Consul cluster with 10x the RAM. We aren't yet sure, but believe a routine DNS change might have created a thundering herd problem causing Consul servers to immediately increase RAM usage by 500%. This is not ideal.”

_This is not ideal._

gzer0 · on March 16, 2023

Interestingly, Roblox went down for 73 hours due to a "unique" issue with Consul as well [1].

Great read on how the issue was approached, handled, and ultimately remediated.

[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

jeremyjh · on March 16, 2023

Most often the issues that take down a site are with core services like network routing, DNS and service discovery. Consul gets mentioned because it’s in that business and isn’t a standard so it gets called out specifically. Zookeeper, HAProxy and various cluster managers also get slagged for this stuff and yeah, sometimes it’s their fault but that’s what it means to be in that business.

throwdbaaway · on March 16, 2023

https://github.com/hashicorp/consul/pull/12080 - this should be the Consul issue that brought down Roblox

felixding · on March 16, 2023

Was affected by the outage. Didn't know about it so I thought it was just another crash on Fly.io.

Tried to restart our app from the command line, only to be told they had disabled the API. And there is no restart feature on their dashboard. So all I could do was watching flyio logs telling me that our apps were down.

Sigh.

We moved from Heroku to Fly.io only this January, and are already considering moving away from it. The reliability is miserable at best. And so many basic features are missing. Yes it's much cheaper than Heroku, but we ended up paying much more time/resource/money dealing with its glitches. Defeats the purpose why we used a PaaS in the first place.

mrkurt · on March 16, 2023

I know blocking deploys sucks, I'm sorry. We disabled them to prevent otherwise healthy apps from going down. When Consul fails, we can't boot new app processes. The ones that are already running continue running. A restart is roughly the same as a deploy, in this respect.

satvikpendem · on March 16, 2023

At this point I'm not sure why one wouldn't use something like Hetzner and slap Coolify or Dokku or something else on it.

Benjamin_Dobell · on March 16, 2023

You're right. We've been on Fly.io for 6 months[1] and it's been nothing but pain. ~10 years ago I took a start-up off an EC2 distributed set-up and moved them to a simple Dokku & Linode single VPS infra (plus separate staging env - https://github.com/glassechidna/dokku-graduate). Most content was served from S3 via a CDN, so workload was light. That simple VPS set up was super reliable and served us well for over 5 years. We eventually outgrew the infra and deployed a K8s cluster (on AWS). I left a short while after we were acquired, but I believe the K8s infra is holding strong. Unfortunately, this latest generation of PaaS really aren't living up to expectations.

[1] We're using so little infra at present that we're within their free usage tier. However, I want to clarify that this isn't because we aren't willing to pay, we specifically want to pay for reliable managed offerings. That's actually the entire point! If Fly.io can deliver on their vision, we'd gladly be billed at 100x the current usage rates.

satvikpendem · on March 16, 2023

I'm waiting for Coolify's Kubernetes support, personally, I'd love to use it as a pseudo-managed service while still having much lower costs and higher uptime.

alxmng · on March 16, 2023

This. It’s way more performant too, because you can host DB and other services from the same machine.

You don’t need to orchestrate a complex cluster to serve thousands or even millions of users. You can scale to hundreds of gigs of memory on a single machine nowadays.

rtpg · on March 16, 2023

Ops from scratch are annoying compared to the theoretical niceness of just pushing up a docker image.

Though I think a lot of this is incidental to just not really knowing the deal, and ops from scratch mean you have to make a lot of tiny decisions like "OK how do I get this package over here, how do I set it up, do I wipe the VM on OS-level udpates, do I need scripts for resetting the machine..." Having pre-made decisions for a bunch of questions means you aren't spending a bunch of time on tedious stuff when starting up a project.

satvikpendem · on March 16, 2023

I can push up a Docker image or git push to deploy just fine with Coolify or Dokku. I've been using them for my projects for a while with no trouble, plus they're cheaper and more performant than paid PaaS.

overbytecode · on March 16, 2023

Even with coolify or dokku you still have to manage the machine, those tools just simplify deployment. You still have to take care of host-level security and maintenance. Which is a hassle when all you really want is to stick your app somewhere and have it run.

satvikpendem · on March 16, 2023

Yeah I mean the host level maintenance isn't really a big issue, I can already stick my app somewhere and have it run, after the initial host setup. Maintenance afterwards is also pretty minimal.

kbumsik · on March 16, 2023

I have seen some issues around Consul these days.

As a person with no background in distributed systems, I am wondering why people choose Consul over alternatives. Are there features that etcd doesn't offer?

mrkurt · on March 16, 2023

We chose Nomad and adopted Consul as a result. Nomad and Consul work well together.

I don't believe etcd would have been any better for us, though. Centralized service discovery that runs through raft consensus doesn't make a lot of sense for the things we need to do. And when I've had etcd blow up on me in the past, it's been similarly painful to recover from.

aeyes · on March 16, 2023

Most people only use etcd at small scale. If you try to store 10 or even 100GB in etcd you are going to run into uncommon problems.

Most people don't even know that the Kubernetes control plane by default has a hard limit on etcd size. It used to be 2GB, not sure what it is now.

mdaniel · on March 16, 2023

I would need a citation on the kubernetes control plane having any such hard limit. etcd is its own little snowflake, and I could very easily imagine it having some bad default value like that, or even kubeadm improperly configuring it

However, related to that, for big-time clusters (q.v. https://news.ycombinator.com/item?id=35174655 and https://news.ycombinator.com/item?id=25907312) one should without question move events over into their own etcd cluster: https://openai.com/research/scaling-kubernetes-to-2500-nodes...

dilyevsky · on March 16, 2023

It’s etcd default https://etcd.io/docs/v3.5/dev-guide/limit/#storage-size-limi...

There’s also max object size of 1MB on the apiserserver side I believe

ivzhh · on March 16, 2023

ByteDance replaces etcd with kubebrain [1], which is backed by their own KV store (TiKV seems also supported).

The single-group raft is the hard limit.

[1]: https://github.com/kubewharf/kubebrain

mdaniel · on March 16, 2023

That's interesting, thanks for the link. I held out high hopes for pluggable KV in kubernetes for the longest time, but since that issue was closed WONTFIX I resigned my hopes

Heh, that kubebrain TODO is some "oh, really?"

* Guarantee consistence in critical cases

but I give them huge props for calling out Jepsen

kbumsik · on March 16, 2023

Doesn't Consul have the similar storage limit btw?

I have seen very few strongly consistent distributed KV store that scales beyond 10GB+

dilyevsky · on March 16, 2023

It’s the underlying db limit (boltdb) which both etcd and consul use

dilyevsky · on March 16, 2023

I’ve run it with over 50G under heavy load (10k+ qps) and it was fine. It’s pretty sensitive to disk latency though

grrdotcloud · on March 16, 2023

Raft is amazing and totally frustrating.

I think I understand how you're using it and curious if you've considered how AWS STS API manages their cross region syncing gets solved.

kbumsik · on March 16, 2023

Thanks for the answer!

AFAIK doesn't Consul also use Raft?

pcthrowaway · on March 16, 2023

Etcd is really only for basic config.

If you want apps to discover each other and be able to communicate effortlessly, even across datacenters, Consul, in theory, enables this.

I say in theory because I couldn't get federated Consul actually working.

convolvatron · on March 16, 2023

discovery isn't that hard a problem that you should cede your agency to a external party like Hashicorp

I used consul for a clustered service once, it was worth it for bringup. but I when I had problems I just wrote one in a couple days since I'd done so several times before. and it didn't fail for all the years that product was running.

throwaway3838g · on March 16, 2023

I attempted to deploy a simple app on Fly a couple of weeks ago, but porting it from heroku became a nightmare, servers crashing, cryptic error messages, etc. Maybe I'm in the minority but in any case my experience with Fly definitely left me questioning the hype around it.

mrkurt · on March 16, 2023

There are really only a few frameworks where our experience approaches Heroku. And even for those, it's only the newest versions. Phoenix, Rails, Laravel, and Remix are all pretty seamless to launch.

Most others require pretty decent Docker knowledge.

HL33tibCe7 · on March 15, 2023

Respect to anybody who is an SRE at fly.io. Couldn’t pay me enough to do that job

abledon · on March 16, 2023

Markdowns forged in god-steel coming out from this incident

abofh · on March 15, 2023

They just hired their first if I recall correctly. I feel for their customers more than I do for their shareholders

mrkurt · on March 16, 2023

We've scaled infra ops from 3 to 7 people in the past few weeks. Our very first VP was a VP Infra Ops, because that's the thing we have to get best at to succeed as a business.

Note that we grew the whole company from 25 to 60 over the last six months.

rtpg · on March 16, 2023

No offense but it's kind of wild to me that y'all had 3 infra operations people out of 60 hires.

As someone who had to do SRE-style work in a smaller company for a long time despite obsetnsibly being a backend dev, the institutional knowledge you get from "real" SRE people is so valuable, and makes me a bit hopeful for the future.

mrkurt · on March 16, 2023

It was really 7 out of 60, but yes. We ran into two problems: 1) the infra ops jobs is pretty intensely difficult here and 2) actually building an infra ops _org_ instead of just hiring individual engineers and overloading them seemed important.

aeyes · on March 16, 2023

You might want to slow down on hiring, more people doesn't equal to solving the problem faster or better. It could be better to queue new sign ups to your service for a while, even if it's painful.

abofh · on March 16, 2023

You've been around for a decade, I'm not giving you pity points. Can you provide what we want or not?

mrkurt · on March 16, 2023

Not right now, no. We've been running this product for about 2.5 years.

abofh · on March 16, 2023

So who's product have I been using?

phphphphp · on March 16, 2023

https://fly.io/about/

mrkurt · on March 16, 2023

Oh yes, I should edit. We launched this product a little over 2.5 years ago. We built a bunch of different things no one really wanted before that.

abofh · on March 16, 2023

You should edit, but that's not my puppet to control, I only use this one.

So yeah, you've built a ton of shit nobody wanted as a product, been there, done that. You've convinced me fly doesn't fit business, we're done.

sergiomattei · on March 16, 2023

I’m rooting for Fly. I use them myself for a project, and love the service.

However, their transparency into outages and service rough edges is a double-edged sword: they’re building a reputation for unreliable software. It’s a shame to see this major outage happen right after last week’s post, it almost confirms the stereotype.

However, even with these flaws, I still think they’re building the best hosting out there. They’re taking bold risks and doing what others aren’t. I wish them the best.

mcsniff · on March 16, 2023

> they’re building a reputation for unreliable software

This is a terrific way to word what might be happening unconsciously.

Fly posts about how hard things are during and after service outages -- while I also love the transparency, most people don't want to 'be a passenger on a plane that's being built while it's flying' especially when it comes to their business, myself included.

pm90 · on March 16, 2023

> We are working to build a new Consul cluster with 10x the RAM.

Oh boy. I wouldn’t wanna be the people doing this. Working with infrastructure is hard. Doing it under tight SLAs? Ugh. I really hope the people working on this are being well supported.

mrkurt · on March 16, 2023

You wouldn't necessarily know this from the outside, but we have _exceptional_ internal support when things go sideways. This is relatively new, up until about two months ago most incidents were run by 1.5 people. We had 7 people working this one today.

throwdbaaway · on March 16, 2023

I don't really know him, but from what I can tell, https://github.com/wjordan is at least equivalent to 2.0 people.

mrkurt · on March 16, 2023

Accurate.

capableweb · on March 16, 2023

1. fly.io SLA only covers users on the Enterprise plan

2. The SLA fly.io has commits to 99.9% uptime, meaning they can "afford" ~1.5m downtime daily, or ~40m monthly. AWS "offers" 99.99% (~4m monthly) if I recall correctly, but their scale is also wildly different obviously.

js4ever · on March 16, 2023

That's the issue with centralized infra... I expect it to be less and less stable the more customers they have. I still wish them good luck.

On my side I took the opposite direction, each workload is shared nothing.

Thaxll · on March 15, 2023

They seem to have a lot of issues with Consul, is it the design of Consul or the way they use it that is the problem?

sidlls · on March 16, 2023

Both, though the latter is likely due to marketing/promises from HashiCorp. Consul (and the entire hashicorp stack, really) is overengineered, under-optimized, and generally terrible to use at any scale beyond "small".

jen20 · on March 16, 2023

And yet the comment from a member of the actual team in question underneath says the opposite…

mrkurt · on March 16, 2023

The design of Consul is wrong for what we need to do. Consul has been pretty good when it's running, but it's a huge pain in the ass to recover when it falls over. And when it does fall over, it's usually with no notice.

berkle4455 · on March 16, 2023

Roblox had a massive 3-day outage [1] in October 2021 due to a Consul feature that didn’t work as expected.

My gut with Consul is don’t use it for high-load distributed services.

[1] https://blog.roblox.com/2022/01/roblox-return-to-service-10-...

jimmyl02 · on March 16, 2023

The Roblox outage seemed like a pretty one-off instance due to a hard to catch bug. Consul still seems like a great choice and it looks like Roblox continues to use it at their scale.

HL33tibCe7 · on March 15, 2023

The latter (they openly admit as such)

pcthrowaway · on March 16, 2023

They say that, but they're also being actively supported by Hashicorp right now (one would presume), so they really need to maintain a good working relationship.

I don't have a relationship with Hashicorp, and have tried using Consul. Everything about it is amazing in theory, but you might need a few years of experience with kube, consul, go, and maybe even the hashicorp stack to even begin debugging when things don't work as advertised.

I still think my company is going to take another stab at consul in the future, because we do need service discovery. But they're advertising a solution to an incredibly hard problem with a shit ton of variations in network topology and infra that it should (theoretically) work on. I imagine if you stay on the happy path everything works out just fine with Consul (even then, maybe only most of the time). The problem is that they don't spell out what the happy path is, and that all the other knobs they expose off to the side are actually down paths beleagured by dragons.

fastest963 · on March 16, 2023

To add a data point we've been using Consul globally for several years now without any major outages. We do close to 50k qps with Consul at peak running on single digit cores per DC.

sidlls · on March 16, 2023

50k qps on consul or system wide?

fastest963 · on March 21, 2023

Consul specifically

jen20 · on March 16, 2023

Out of interest why would one presume that they are being actively supported? I haven’t read everything about this saga, but I’ve never seen any mention of a commercial relationship.

pcthrowaway · on March 17, 2023

If you're on a slowly-sinking ship, it'd be silly not to at least try to bring in someone with the full context of the ship's architecture to get it serviceable before going all-in on the decision to engineer and build an entirely new kind of ship while still aboard the sinking one.

jen20 · on March 19, 2023

That’s not really what I asked: do you have any actual evidence of a commercial relationship rather than the notion that it wouldn’t be a bad idea?

abofh · on March 16, 2023

The former ish -- they relied on consul marketing that the hammer fit the square hole. Hashicorp has been pretty bad about marketing themselves as the right tool for any job, but they really only fit the narrowest of tasks before you find yourself needing an alternative or being compelled to buy a support contact.

It's atlassian from Arkansas, just faster

akerl_ · on March 16, 2023

Can you cite this? Because it seems like the opposite is true: https://news.ycombinator.com/item?id=35048318

abofh · on March 16, 2023

You seem to have cited exactly why I don't recommend my clients use a hashi stack, it seems like you've failed to make a point?

akerl_ · on March 16, 2023

> I have only positive things to say about every HashiCorp product I've worked with since I got here.

You're making a claim that Hashicorp sold themselves as being a solution for problems that they can't solve. But the comment from an actual Fly.io employee suggests that isn't the case. They're stating that Fly.io pushed the product beyond its limits, and they don't seem to be projecting any of that as being the fault of Hashicorp the company or of their products.

abofh · on March 16, 2023

You literally started with a quote that I didn't write.

Hashi is dishonest and diminuative, providing products that generally should've been written off as a loss

Edit: We're apparently not allowed to interact beyond three replies: I have no beef with any hashi product that is satisfactory.

Your reply quotation explanation is unsatisfactory, you tried to quote something into a thread in the most irresponsible way you could - got called out for it and tried to top post to make it work.

I have never had a client using a hashi stack that was happy about it: price, quality or reliability it's a failure

I don't begrudge their work, their work is just subpar Quote that if you'd like, I won't interact with someone that starts with fraudulent misrepresentation.

akerl_ · on March 16, 2023

It’s a quote from the comment I linked, from the Fly.io employee, describing their opinion of Hashicorp products based on their usage of those products at Fly.io.

It seems pretty clear you have beef with Hashicorp and with their products. It’s entirely possible you’re right. But your original claim, that I was replying to, attempted to answer a question about why Fly.io was experiencing issues with Hashicorp products. And your answer doesn’t line up with clear public statements from Fly.io staff.

seabrookmx · on March 16, 2023

Why not both?

simonw · on March 16, 2023

"This impacts queries to our API, including creating and modifying apps, as well as incoming network requests for recently deployed apps."

Would be really interested to understand why it affects recently deployed apps but not apps that are already established - something to do with how the Fly Router works?

mrkurt · on March 16, 2023

We still pipe service discovery through Consul, we just propagate it with a different, gossip based mechanism. Services are stored in local sqlite DBs on every host that runs our Proxy. They are designed to keep running, even when we can't get updates to them.

This outage prevented us from writing services to Consul, so we couldn't read them back out. Nomad will only really write service information to Consul, so we're kind of stuck with Consul in the loop until we're fully off Nomad.

pa7ch · on March 16, 2023

From my experience etcd would have been a better choice for maturity if they don't need the gossip stuff.

beoberha · on March 16, 2023

This shit is hard. Running a cloud service at one of the Big 3 is hard, I can’t imagine doing it with such a small team with your own infra.