cro
github.com/jamiehannaford/coreos-reboot-operator
go get github.com/jamiehannaford/coreos-reboot-operator
cro

github.com/jamiehannaford/coreos-reboot-operator

Kubernetes operator for managing CoreOS node upgrades

by Jamie Hannaford

v0.0.0-20170609160000-2eff2a187d97 (see all)License:GPL-3.0
go get github.com/jamiehannaford/coreos-reboot-operator
Readme

CoreOS reboot operator

NOTE: This codebase has been deprecated in favour of CoreOS's official operator.

A Kubernetes operator that manages the reboot cycle for CoreOS nodes. Normally when a node self-updates, it waits to be rebooted in order for the changes to be effected. This has been traditionally done either by manual intervention or by sync tools like locksmith. Although the latter works very well, it does not offer full programmatic extensibility that's needed by some orgs who require high availability for their Kubernetes clusters.

This project was inspired by Aaron Levy's KubeCon talk and is heavily based on his demo controller repository. Although this project has been verified to work, it's still very much in alpha so it's advised to use this in dev environments only.

How it works

The operator is composed of two components: the controller which synchronizes the reboots, ensuring that the cluster will not be negatively impacted; and the agent DaemonSet, which listens out for reboot requests on systemd and performs the reboot itself.

This is the lifecycle of a reboot:

  1. The update engine detects a new update is available, then it downloads and installs. When the self-installation has completed, the engine notifies its completion by updating its status to UPDATE_STATUS_UPDATED_NEED_REBOOT.
  2. The operator listens on a DBus interface for this state change. When it detects that a reboot is needed, it tags the Kubernetes node with a reboot-needed annotation.
  3. The controller uses an informer to fire hooks when node resources are updated. When the controller sees that a node is marked for reboot (i.e. it has a specific annotation), it will perform a series of checks to make sure the operation is permitted - for example it will enforce a node quota, ensuring that only a specific number are rebooted at once. If these conditions pass, it permits the operation to go ahead and marks the node as reboot.
  4. The agent also uses an informer to listen out for node state changes. Once this controller gives the green light, the agent cordons the Kubernetes node, preventing further pods being scheduled. It then gracefully deletes pods from the node. Once this is done, it sends a reboot command over DBus and the node is rebooted.
  5. After the reboot, the agent re-marks the node as schedulable and removes any reboot annotations.

Further work

  • Allow better configuration through TPRs or ConfigMaps
  • Add some kind of E2E testing
  • Upgrade to client-go v3 when released
  • Support pod eviction if available
  • Improve pod filtering so that specific types are not force deleted

Prerequisites

  • The nodes must disable auto-reboots. You can do so by following the update strategy docs, or by disabling locksmith:
systemctl stop locksmithd

How to deploy

# Create reboot-operator ns
kubectl create -f manifests/namespace.yaml

# Create cluster roles and sa bindings
kubectl create -f manifests/cluster-role.yaml

# Create controller RS
kubectl create -f manifests/reboot-controller.yaml

# Create agent DS
kubectl create -f manifests/cluster-role.yaml

Building

Build agent and controller binaries:

make clean all

Build agent and controller Docker images:

make clean images

GitHub Stars

12

LAST COMMIT

5yrs ago

MAINTAINERS

0

CONTRIBUTORS

1

OPEN ISSUES

0

OPEN PRs

0
VersionTagPublished
v0.0.0-20170609160000-2eff2a187d97
1yr ago
No alternatives found
No tutorials found
Add a tutorial