Nomad by HashiCorp is an interesting alternative for workload orchestration. As the project reached version 1.0 somewhere in late 2020 after checking out its current feature set I have decided to pull the trigger and migrate all of my side projects, tools, and whatnot into new infrastructure based on Nomad backed by Consul and Vault and even give a Terraform another try. So HashiCorp all way in ;).
The process is still ongoing - I have realized that my infra setup is around 2 years old already and during that time I setup a lot of different projects - some of them are tricky to update/move - eg. my Crystal app was created in Crystal 0.3.x era and basically nothing works nowadays. But that’s a different story for a different post. Here are my first impressions regarding Nomad itself.
Deployment without tears with some pain
Nomad is a single binary so in theory deployment should be painless - there is even a pretty nice ansible playbook for doing so. Yet when you need some more fine-grain configuration, Consul and Vault integration it suddenly becomes a little bit more tricky than advertised. Dealing with unsealing vault, configuring all the tokens, wrapping your head about ACL ideas and whatnot - it’s a lot to take it. Once it works it works tho.
Nice UI with nice CLI
Nomad UI is pretty slick, single binary also acts as CLI - once you deal with all the authorization hassle it’s cool to deploy new jobs straight from the terminal. UI provides just enough insight about what’s going on there in terms of whole cluster, individual servers, and individual tasks even. So you don’t have to throw a ton of other tools on top just to get started.
No networking, no problem
Nomad doesn’t do a lot in terms of networking - but personally, I like that. I’m running trusted environment so I can just leverage internal networking functionality that my hosting provider offers (so my nodes can speak with one another). Additionaly I have decided to use their load balancer and allow them to deal with the High Availability problem and I just route all traffic internally using Traefik that is running on every node.
Simple yet powerful job declaration
Once you wrap your head around ideas of job, group, and tasks the whole thing just clicks. When you use Consul and Vault things become even more powerful - by using services discovery provided by Consul and secrets provided by Vault you can generate dynamic templates/env and Nomad will restart/reload your app when it detects changes. Allocating resources seems weird at first but then again you can have confidence that one misbehaving task won’t take down your whole cluster.
CSI Volumes in practice - a huge disappointment
Support for CSI volumes sounded like heaven - FINALLY, I can run stateful jobs essentially with almost a HA setup - once node goes down, the job is reallocated to another node that picks up the volume and everything just continues to work - brilliant!
In practice it was a huge disappointment - because the whole concept is complex it is still quite buggy. Unclear nomad shutdown can leave volumes in a zombie state that are never re-attached properly which defeats the purpose of this whole thing. Such a bummer!
For time being I just pinned volumes to particular nodes which again kinda defeats the purpose but at least until a node is up and running it works without surprises. Hopefully, I will revisit this problem soon-ish.
Lack of resources and sometimes outdated docs
Nomad community is not that huge and the docs can be sometimes outdated, various blog posts can be misleading - I hope this project will get more and more attention because it’s a great alternative to what’s on the market - and the marked is not that huge as k8s is eating the world. Before Nomad I even revisited Rancher (backed by k3s) and I hated what they have done to it in version 2.x. I very much preferred 1.x with their Cattle orchestration - seemed so much simpler and most of all reliable! (don’t even start about k3s randomly crashing one me).
Nomad seems to hit the sweet spot - where you can fairly easily start with a small setup and potentially scale it to hundreds of servers if that may be needed.
That’s it for now when it comes to first impressions, I hope once I will manage to move all of my tooling to a new setup I will be able to provide more hands-on tips, tricks, and list potential pitfalls.