Ansible has been a good choice for server configuration and application stack deployment so far. However, drawbacks have emerged in the lifecycle and deployment process:
- Lifecycle is managed with init system technology, such as Systemd and OpenRC. Not only are they brittle to use and complicated to configure (OpenRC has been a nightmare in some cases) but they are also not able to perform seamless upgrades without downtime (this is less of a concern for the homelab).
- Ansible does not scale well with regards to the amout of solutions that are deployed. Every new service added increases the deployment process linearly, as Ansible does not support parallel task execution. This is annoying during testing and also takes longer during prod releasing, the more services there are, leading to a significant scaling problem.
- The complexity of the setup is another annoyance in production, as each service is deployed with its individual user. This choice was made to have true separation of permission on the Linux user system level which was overkill from the start and more done because it is "free" during Ansible execution and irrelevant during testing. In prod, this makes debugging more complicated and the benefit is very slim (there needs to be a container isolation problem and a mallicious application running for this to be a benefit).
The solution:
- Ansible for node management and setup (this stays the same).
- Kubernetes for easier lifecycle management and parallel service deployment.
Benefits:
- Releases will also get easier, as only the new diff needs to be applied. Containers are started in parallel and not all containers need to be redeployed. Restarting them all makes sense, if all applications need to be redeployed, which only happens during version bumping events.
- Simpler testing by only needing to apply a diff to a long running k8s instance (something like k3d).
- Simpler deployments by only needing to run the Ansible setup if there has been a change in the server configuration, not the application stack.
- Easier monitoring due to all applications running in one environment.
- Easier user management as there will no longer be one user per application stack, but rather one namespace in the same cluster.
- Better community support to perform incremental backups.
- Applications become node agnostic and can be executed everywhere.
- The nodes get more tightly coupled and can communicate more easily with one another.
- Potentially (has to be fully thought through) only one reverse proxy for the cluster, not one per node.
- Simpler integration of additional nodes, such as a NAS for storing the entire data.
- Easier local access to the cluster without the need to ssh into the machine first.
Drawbacks:
- Less security through no user isolation. A flaw in k8s/the container engine used underneath and a compromised application running in the cluster. This is highly unlikely.
- Higher complexity due to now relying on k8s as a container orchestrator instead of simple Podman containers.
- Single point of failure. Cascading failures are less likeley in the Podman setup than in k8s. One bad cluster config could take down the entire cluster, not just one application.
- Test environment no longer equivalent to prod. When only testing the application stack without a VM, the test environment is not representative of production.
- Test environment gets more complicated. Currently we only have one setup with a VM, after this we will have a cluster test setup, a setup for the individual VMs and a setup for testing the entire cluster with VMs (prod representation with VMs).
- Higher resource demand, especially on the Raspberry Pi.
The benefits, especially the quicker deployment time, simper management from the local computer and more tightly coupled system between the nodes outshine the drawbacks and are the reason this switch will be made.
The Path forward:
- Finish setting up the currently in progress services, except for the media stack, that would require engineering that is thrown away with this switch and thus not worth it.
- Evaluate if switch to Debian still makes sense or if a switch to Thalos OS is an even better approach.
- Implement poc for one service. Ideally one with multiple containers and access to the file system.
- Implement tooling and the infra to run everything locally.
- Switch setup to k8s first.
- Switch prod to use k8s. This markes the end of the switch and the initial setup. Monitoring, backups and other tooling will be done in later steps.
Ansible has been a good choice for server configuration and application stack deployment so far. However, drawbacks have emerged in the lifecycle and deployment process:
The solution:
Benefits:
Drawbacks:
The benefits, especially the quicker deployment time, simper management from the local computer and more tightly coupled system between the nodes outshine the drawbacks and are the reason this switch will be made.
The Path forward: