This post will outline the reasons why Nomad is an ideal container orchestrator for WebAssembly and wasmCloud, and how we created Netreap to run Cilium in our Nomad clusters alongside the rest of our infrastructure. In my next post, I'll walk you through how to run Cilium on a Nomad node, and how Netreap performs in practice.
You can see my recent HashiTalks session Running Cilium with Nomad on this subject here:
Cilium is perhaps the best solution for managing network policies and secure workloads running on Kubernetes clusters, which is why it is now a CNCF incubating project. As a CNI, it provides everything you'd want and more. The only problem is, what do you do if you want to run Cilium but are not running Kubernetes? And why would you want to run Cilium in non-Kubernetes workloads in the first place?
Building for a distributed future
Firstly, let's look at why the decision was taken to build wasmCloud (the CNCF Sandbox project) and the Cosmonic Platform on HashiCorp Nomad. Don't get me wrong, we love Kubernetes and we've shown WebAssembly works harmoniously with Kubernetes in large organizations. The decision to underpin wasmCloud and the Cosmonic PaaS with Nomad, however, was born of the core mission to build applications capable of running in any environment - in multiple clouds, on bare metal and on devices at the far network edge.
One of our primary requirements when building the Cosmonic platform was ensuring that we run customer workloads as securely as possible. This meant that we started with the goal of running everything in lightweight Firecracker VMs - for security and portability purposes. Another important factor is that Kubernetes doesn't work smoothly with Firecracker, requiring a different containerd deployment. This felt like more work than it was worth.
A ton of work has gone into edge Kubernetes development, K3s, MicroK8s and KubeEdge come to mind. Despite this, there remain several barriers in making Kubernetes workable in edge scenarios, not least how tricky it is to manage edge hardware, the set up required to deploy applications on bare metal (OS configuration etc), and the difficulties in scaling and managing hosts of edge devices with Kubernetes.
Added to all of this, running Kubernetes yourself and not relying on one of the major cloud providers, requires a significant up-front cost in terms of deciding how you want to build your Kubernetes cluster. You need to decide on which CNI to use, what ingress controller you need to run, how you want to manage secrets, etc. With Nomad, deployment is easy since as of Nomad 1.3 all you actually need is a single binary that you deploy to your edge cluster and away you go. That being said, we still use Consul and Vault with our Nomad cluster since we use specific features provided by both.
For sure, Kubernetes shines when you're building a platform on top of it because it's designed for you as an operator to write code to integrate really deeply with APIs. In highly distributed environments, however, there's a need for better abstraction for running arbitrary workloads, without the requirement to stay tied to Docker containers or specific CRIs.
This is where Nomad comes in. It eases 3 main friction points - it simplifies distributed deployments; brings operational efficiency at the edge; and is great at managing heterogeneous environments. Whether running on bare metal, VMs, containers, or even Raspberry Pis, applications at the edge need the lifecycle management, security rigor, high fault tolerance, and scalability that Kubernetes struggles to provide without a lot of work.
Netreap: Re-imagining Cilium for Nomad
We wanted to run Cilium on Nomad but, while there are several decent abstractions in existence, Cilium is purpose-built for Kubernetes. In order to make this possible, we needed to re-implement a component called the Cilium operator, a component designed to work with Cilium, Kubernetes APIs and a bunch of CRDs which combine to manage the lifecycle of endpoints and policies.
When working in Kubernetes, Cilium cleans up after itself and provides a host of useful metadata needed to build policies. Essentially, Netreap was created to bring all the benefits of Cilium to Nomad. Cilium stores a ton of built-in state abstraction - either etcd in K8s or Consul outside K8s. Netreap uses this - joining together Cilium data stored in Consul with what is running in Nomad. It runs on the Nomad cluster as a task on every node and synchronizes all the network policies we put into Cilium to manage all that traffic.
What is Netreap designed to do?
Sync Cilium Network Policies
One of the most compelling features of Cilium is its ability to secure network traffic in your cluster by using programmable policies. As with Kubernetes, we want to be able to apply a policy for a particular workload that dictates what it can and cannot communicate with. Effectively, Cilium provides a programmable firewall for containers. With Cilium, we can bring in the kind of functionality you might find natively in a Nomad cluster using Consul Connect where you can restrict traffic and communication between services.
You can get fairly granular here, Cilium allows us to write policies that restrict traffic at all layers of the networking stack, including restricting traffic to only particular CIDR ranges or even specific DNS addresses.
While network policies work out of the box with Cilium, the Cilium operator normally takes care of syncing the policies to every node in a Kubernetes cluster. In order to get the same functionality in Nomad, we needed to replicate that functionality, which is one of Netreap's primary responsibilities. Netreap monitors a Consul KV key that contains the full JSON payload of all of the Cilium policies you want applied to the cluster and automatically keeps that data in sync on every node.
The other big thing for us is that Cilium provides transparent encryption in transit. It is able to manage connections between all Cilium nodes using WireGuard or IPsec. We use WireGuard which is very performant and easy to use. By using Cilium, we are able to encrypt data in transit without needing to worry about TLS.
When it comes to cluster health, Cilium is great. It health-checks individual clusters, constructing a topography of everything running at any given time and performing checks against every endpoint. However, the Cilium operator normally handles cleaning up stale IP allocations and removing old nodes from the Cilium cluster. We built Netreap to provide this functionality in Nomad. This is important because if you don't periodically remove old nodes, health checks permanently fail for nodes that no longer belong to the cluster - delaying responses that suggest the cluster is unhealthy and making it difficult to understand its overall state. Automating this process is key to maintaining performance.
In the next blog, I'll dive into the practical steps you can take to run Cilium on a Nomad node, and how Netreap takes this a step further. If you don't want to wait, watch the full demo here.