Restarting and Monitoring a systemd service from golang with coreos/go-systemd

I’m current working on code which lives in an agent, written in golang, which is running on a fleet of machines.

The agent needs to, when a new version of an application is deployed, restart the systemd unit and make sure it comes up healthy and running.

Here is the output of what we’re going to build:

demo output restarting service

☠️ Dead-end: Try calling systemctl and parse the stdout

This is the most obvious route. You exec.Command and look at the output.

While it’s obvious it has some downsides, you’re going to end up doing strings.Contains.

Then when you are monitoring you’ll end up wrapping this in a for {} loop.

This didn’t feel like a great option. Can we do better?

🚀 go-systemd Library: Talking to systemd via dbus

What’s dbus?

D-Bus (short for “Desktop Bus”[4]) is a message-oriented middleware mechanism that allows communication between multiple processes running concurrently on the same machine.

Behind the scenes, when you call systemctl the CLI tool is sending and receiving dbus messages.

There is lovely doc showing all the methods related to systemd you can call via dbus

Even better there is a go library coreos/go-systemd which gives a nice way to interface with that dbus interface.

The result it some nice, strongly typed, methods that let you talk to systemd

Let’s connect and list the services on the box 👇

    // What service are we looking at?
	targetSystemdUnit := "fakemill.service"

	ctx := context.Background()
	// Connect to systemd
	// Specifically this will look DBUS_SYSTEM_BUS_ADDRESS environment variable
	// For example: `unix:path=/run/dbus/system_bus_socket`
	systemdConnection, err := dbus.NewSystemConnectionContext(ctx)
	if err != nil {
		fmt.Printf("Failed to connect to systemd: %v\n", err)
		panic(err)
	}
	defer systemdConnection.Close()

	listOfUnits, err := systemdConnection.ListUnitsContext(ctx)
	if err != nil {
		fmt.Printf("Failed to list units: %v\n", err)
		return exitCodeFailedToListUnits
	}

	found := false
	targetUnit := dbus.UnitStatus{}
	for _, unit := range listOfUnits {
		if unit.Name == targetSystemdUnit {
			fmt.Printf("Found systemd unit %s\n", targetSystemdUnit)
			found = true
			targetUnit = unit
			break
		}
	}
	if !found {
		fmt.Printf("Expected systemd unit %s not found\n", targetSystemdUnit)
		return 1
	}

Now we know the service exists, let’s restart it, notice we can use channels to subscribe to results here, we’ll do some more of this later too.

    completedRestartCh := make(chan string)
	jobID, err := systemdConnection.RestartUnitContext(
		ctx,
		targetSystemdUnit,
		restartMode,
		completedRestartCh,
	)

	if err != nil {
		fmt.Printf("Failed to restart unit: %v\n", err)
		panic(err)
	}
	fmt.Printf("Restart job id: %d\n", jobID)

	// Wait for the restart to complete
	select {
	case <-completedRestartCh:
		fmt.Printf("Restart job completed for unit: %s\n", targetSystemdUnit)
	case <-time.After(30 * time.Second):
		fmt.Printf("Timed out waiting for restart job to complete for unit: %s\n", targetSystemdUnit)
	}

Because we have channels we can make use of the select statement, waiting on the restart command to compete or 30 secs to pass, at which point we can give up and timeout.

So now the service has been restarted, how do we tell if it’s healthy and running?

Well we can use a similar approach, coreos/go-systemd lets you subscribe to changes for services.

You pass 2 functions, one to filter the services that you’re interested in and another to check if the change is one you care about or not.

Here is an example

    // Wait for the service to reach a running state

    // Ignore any services that we don't care about by filtering them out
	filterUnits := func(unit string) bool {
		return unit != targetSystemdUnit
	}


	// Configure which changes we care about
	isRelevantChangeFunc := func(before *dbus.UnitStatus, after *dbus.UnitStatus) bool {
		if before.ActiveState != after.ActiveState {
			fmt.Printf("Active state changed from %s to %s\n", before.ActiveState, after.ActiveState)
			return true
		}
		if before.SubState != after.SubState {
			fmt.Printf("Sub state changed from %s to %s\n", before.SubState, after.SubState)
			return true
		}
		return false
	}

	// Subscribe to the changes
	channelBuffer := 10

	changeCh, errorCh := systemdConnection.SubscribeUnitsCustom(time.Millisecond*10, channelBuffer, isRelevantChangeFunc, filterUnits)

	// Wait for the service to be active and running or give up
	for {
		select {
		case changedUnits := <-changeCh:
			unitStatus := changedUnits[targetSystemdUnit]
			fmt.Printf("Unit %s has changed\n", targetSystemdUnit)
			fmt.Printf("UnitStatus dump: %+v \n", unitStatus)
			if unitStatus.ActiveState == desiredActiveState && unitStatus.SubState == desiredSubState {
				fmt.Printf("Unit %s is now active and running\n", targetSystemdUnit)
				return exitCodeSuccess
			}
		case <-errorCh:
			fmt.Printf("Error while waiting for unit %s to change\n", targetSystemdUnit)
			return exitCodeErrorWhileWaiting
		case <-time.After(30 * time.Second):
			fmt.Printf("Timed out waiting for restart job to complete for unit: %s\n", targetSystemdUnit)
			return exitCodeTimeoutWaitingForRestart
		}
	}

That’s it, here is the full code from above all together.

Bonus: Testing against dummy systemd

So we don’t break with your real systemd on your machine, you can setup a docker container running systemd and mount the unix socket to the host.

To make that easy I used mock-systemd-bus.

services:
 mock-systemd:
    image: ghcr.io/balena-os/mock-systemd-bus@sha256:79b92200fb6481edfea9776a75d6bf3fd8583b79b7e83868f49ba7b295f397ce
    # Necessary to run systemd in a container
    privileged: true
    volumes:
      - ./tmp/dbus:/shared/dbus
    environment:
      # Set this variable with the location of your shared bus socket
      # NOTE: do not use /run/dbus as this path is overwritten by systemd
      # Pinned for reproducibility https://github.com/balena-os/mock-systemd-bus/pkgs/container/mock-systemd-bus
      DBUS_SYSTEM_BUS_ADDRESS: unix:path=/shared/dbus/system_bus_socket
      # Optionally set-up any mock units you need. Service files for these
      # Will be created on service start
      MOCK_SYSTEMD_UNITS: fakemill.service

With that service started you can set DBUS_SYSTEM_BUS_ADDRESS environment variable to point to the address and run the above go testing locally.

Now you can setup a Makefile to run the above compose and test the code against it.

run-mock-systemd: ## Run the mock systemd service for testing.
	docker compose down mock-systemd
	docker compose up -d mock-systemd

poc:
	## Remember to run 'make run-mock-systemd' once to start a systemd instance
	DBUS_SYSTEM_BUS_ADDRESS="unix:path=$$PWD/tmp/dbus/system_bus_socket" go run ./cmd/restarter/

There you go, easy testing without messing up your local system.

Works great for integration testing in CI too.