What I learnt when a system no one knew how to maintain started failing, and I was on-call

A system is failing. People rely on it. You are on-call to fix it. You don’t know how it works, your team don’t know how it works and the last person to work on it has left the company. Fun times!

I’ll be upfront. This was an intense on-call shift. It wasn’t much fun but it did help me learn some new approaches for how to handle these situations. This blog is what I was doing by the end, having learnt from doing the wrong things in places.

Panic and try to find help

As the realisation dawns that you are meant to fix a system you know nothing about, it’s natural to feel some panic. I did.

Next up, try and get help. This is a time to be humble, explain that you’ll do your best but be clear you don’t know the tech/stack/system etc and reach out widely to see if others do.

You MUST do this. Think of it this way, If you struggled with the problem for ages (without asking for help) you are causing unnecessary downtime and pain for users. Imagine, hours into the outage, someone else in the organisation popped up and said “Oh this is an easy problem in stack xyz just do z – why didn’t you reach out?”. You’ve not done yourself or the organisation any favours at that point.

Being honest about what you know and don’t know. Don’t try and take it all on, reach out for help. Tell people how you are feeling. Get support.


This is useful because:

  1. In a situation you don’t understand, things you think are irrelevant may later become relevant.
  2. Anyone who comes to help, whether your awake, asleep or getting a coffee, can see what’s been done and what is on the list to try next.
  3. After you can look back, as part of a retrospective, to learn from the incident.

Buy time

Engineering is hard at the best of times. Doing it with the pressure of an ongoing outage makes it even harder.

Buy yourself some time with tactical hacks. These don’t have to be good, they’re not there to last forever, they’re to give you time. In this case I wrote a script to poke the system, attempt to detect the failure and restart it when it was failing. It worked, sort of, for a bit and gave some head space.

Use the time you gain to do the nitty gritty detailed work of engineering, which I find hard to do with the time pressure associated with an outage.

Understand the request flow – Read the code, re-read the code

It’s impossible, in my opinion, to fix a system you don’t understand. Unless you get incredibly lucky. Once you’ve brought yourself some time, spend that time wisely.

My personal preference is to look at a typical request flow through the system.

  • What does a normal request look like?
  • What dependencies are called during its processing?
  • Are caches involved? What state are they in?
  • Are some requests more substantial than others?
  • What code is on the hot path of processing?
  • Where is the hardest computational work done?
  • Where is the complexity? Note: Sometimes this might be in a library you use not your own code.

All of these help to give you insight into where to investigate.

Another key thing to understand is the history of the system. Systems can sometimes see issues reoccurring or new issues coming up which are new twists on previous issues. It’s well worth looking through the historic issues, comments and code to understand some history of the system. It’s not practical to know it all, be tactical and look at history for areas you are suspicious might be contributing to the failure.

For example, I ended up reading a commit message from 2018 and finding useful details after doing git blame on a file that looked of interest.

Look at what changed but don’t obsess about it

Thing A was working for X amount of time. Now, at Y time, it’s not working any more. What changed between X and Y? There must be one thing. It’s an obvious conclusion to make.

Sometimes this give you the perfect answer. “Oh we deployed this change just before it broke!” you roll it back and the world is happy again.

Other times your failure state is based on a complex interplay of issues that have built up over time and combined to cause you pain. My experience is that this is common. Usually it’s not just one thing which caused it, it’s an interplay of multiple things.

Even if the failure is related to a change, it might not be a change you can control. Say the users’ usage of the system has shifted. It’s usually not practical to email them all and say “Please stop doing X”.

Get data: Metrics and Logs

This is the first change I shipped. Not an attempt at fixing the problem but a way to get more data about the problem.

If I’d started shipping code changes without data I’d be guessing, sometimes you get lucky but most of the time you don’t. You need data. Logs, metrics and any other useful sources.

Get More Data: Debugger

If possible get an instance that has the issue and attach a debugger. This was crucial to building a picture of the issue in my case. Minimize the impact to users, maybe you can use a secondary and leave a primary serving traffic or attach to a canary instance or a lab.

However you do it, go in with a clear idea of where you want to break and what you want to know when you hit those break points.

Be Scientific: Build a theory, prove it’s right or wrong and repeat

At this point you’ve:

  1. Asked for help
  2. Brought yourself some time
  3. Understood the system
  4. Looked what might have changed
  5. Got data from metrics, logs and debugging

Maybe some more than others.

Start this loop:

  • Create a theory about the failure (early on these will be guesses).
  • Work out how you’d prove it right or wrong.
  • Test the theory.
  • If it didn’t fix it: Go dig more and build and new theory.
  • Repeat

Then the game is simple, repeat the loop above. Add more logs if you need them. Debug more. Write down the outcomes, capture data. Build on past theories.

Start with smaller, easier to verify theories. This lets you run through more of them, and often small things can cause big problems (great addition from @seveas).

Sometimes a theory will need more data that you don’t have, at this point write out a plan for what to do the next time the issue re-occurs to gather the data you need.

Don’t fall into the trap of changing too much at one time. There is a danger you introduce new issues with changes. Be measured. Is during an incident the right time to update all 38 dependencies the app has in one go (skipping QA because everything is on fire)? Probably not. Theories help with that, a change must have a theory to justify it. That theory should be provable. If a change doesn’t improve things don’t pile then next one on-top. Reset and go repeat the theory loop.

When you hit a theory that appears correct always validate the fix did what you expected. Add metrics, state how you expect them to change after the fix is shipped. Check the system did change the way you expected.


That’s my brain dump on the topic. There are doubtlessly other views on this topic from better authors with more experience in SRE than myself. I’d be really interested to hear thoughts on bits I’ve missed here or articles by others that related to this topic.


Writing OPA rules to lint Kubernetes YAML resource and Outputting as annotations on Pull Requests with GitHub Actions

Warning: This expects you already know about rego/opa and is more of a brain dump than a blog.

First up take a look at conftest it’s a great little CLI tool which lets you take rules you’ve written in rego/opa and run them easily.

In our case we have the following:

./rules folder containing our rego rules
./yaml folder containing yaml we want to validate

We’re going to write a rule to flag duplicate resources, ie. when you have two yamls with the same kind and name.

The rule will be written in rego then executed by conftest and when a failure occurs it’ll be shown as an annotation on the Pull Request using GitHub Actions.

Firstly for conftest we want to use the --combine option so we get a single array of all the yaml files passed into our rule. This allows us to compare the files against one another to determine if there are any duplicates.

The data structure you get looks a bit like this:

        "content": {
            "apiVersion": "thing",
            "kind": "deployment"
        "path": "path/to/yaml/file"

As well as validating the rule we also use the path property to output metadata about which file generated the warning.

We can then use jq to parse the json output from conftest and convert it to “Workflow Commands: Warning Messages” these are outputted to the console and read by GitHub Actions. With the details in the message it generates an annotation on the file in the PR like so:

Here is a gist of this wrapped together.

# Here is the basic Rego rule
package main
# deny creating duplicate resource in the same namespace
deny_duplicate_resources[{"msg": msg, "details": details}] {
i != j
currentFilePath = input[i].path
input[i].contents.kind == input[j].contents.kind
input[i].contents.metadata.name == input[j].contents.metadata.name
msg := sprintf("no duplicate resources are allowed, file: %q, name: %q, kind: %q, file with duplicate: %q", [currentFilePath, input[i].contents.metadata.name, input[i].contents.kind, input[j].path])
details := {
"file": currentFilePath,
"line": 1,
"url": "http://some.docs.link.here.something/rulex.md",
view raw 1-rule.rego hosted with ❤ by GitHub
# This runs the rule against the yaml with conftest
# Run this inside your GitHub Action
conftest test -p ./rules ./yaml –combine –no-fail -o json | jq -r -f ./convert.jq
view raw 2-run.bash hosted with ❤ by GitHub
# Get all the failure items from the conftest json output
# see: https://www.conftest.dev/options/#json
# Note as we use `–combine` with conftest we will always receive and array consisting of a single item
# To add newlines to the message '\n' has to be urlencoded to %0A
# We split the 'msg' returned by the rule with ','s replaced with newlines
# and also put the doc url on a newline
# see: https://github.com/actions/toolkit/issues/193
try .[0].failures[]
# pull out the file and msg that we care about based on the defined
# test output format
# see: ../README.md#writing-rules
| { "file": .metadata.details.file, "msg": (.msg | gsub(", "; "%0A ")), "url": .metadata.details.url}
# Format that into the structure actions wants
# see: https://docs.github.com/en/actions/learn-github-actions/workflow-commands-for-github-actions#setting-a-warning-message
| "::warning file=\(.file),line=1::\(.msg)%0A%0AAbout this rule: \(.url)"
view raw 3-convert.jq hosted with ❤ by GitHub

MedBot: Sick children + Signal Group + Bot = Graphs and Timelines

This is a brain-dump rather than a fully fleshed out blog. Most of the code was written with an unwell small human sleeping on me and python isn’t my best language, it’s very much a hack.

I have two kids, both have asthma and chest issues. Unfortunately, these are things you manage rather than cure, they’re more prone to normal colds escalating quickly and need more medical interventions in general.

My oldest hasn’t started school yet but has spent more time in hospital already than I have in my entire life.

“How does this relate to coding Lawrence?”, Glad you asked. We keep a track of the medication, temp, pulse ox and other key events in a Signal Group.

We’ve found that between swapping parents, sleepless nights and different hospital wards/doctors its easy for things to get lost.

This has worked really well in the past, Signal keeps things tracked, it’s quick and easy. You can write down whatever you want. If your offline it’ll sync up later.

When you swap parents or see a new doctor you can do a quick rundown of what’s happened in the last x hours, chase up missed doses etc just by scrolling up the chat.

What was new this time round was that both of my kids where ill at the same time, both with chest infections. Both needed medication, observations on temp, pulse ox etc and the group got messy fast.

So I decided to write something to make things nicer. Partly because I thought it would help, partly because having something to focus on helped dissipate the nervous energy of seeing your kids ill and not being able to do much about it.

The aim is a bot to pickup the messages on the group and then store them and build out views/graphs.

The stack I used is:

First up, massive shout out to Finn for the work on Signald and to Lazlo for the Semaphore bot library that builds on it. Both of these where awesome to work with and made this project easy.

The basic aim is for the bot to listen on the group, pickup updated then pull out the relevant information and store it in a sqlite db.

I used the ‘reaction’ in Signal to show that the bot has successfully picked up an item and stored it, you can see this as the 💾 added to the messages below.

Last when someone sends a message ‘graphs’ the bot should build out graphs and share them back to the group.

What does this code look like? See the Semaphone examples for a full fledged starting point (seriously they’re awesome). In the meantime, I’ll show my specific bits. It’s surprisingly small, I added a handler to the bot to detect messages that had a temperature in them using a regex and insert them into the temperature table in sqlite.

sql_con = sqlite3.connect('medbot.db')
temp_regex = re.compile(r'[0-9]{2}[.][0-9]')
async def track_temp(ctx: ChatContext) -> None:
## or not is_med_group(ctx):
if ctx.message.empty():
name = get_name(ctx) # No included, just a regex to pull out the childs name from the msg
temp = temp_regex.search(ctx.message.get_body().lower()).group()
print(f'Tracking temp for {name}, temp {temp}')
await ctx.message.typing_started()
cursor = sql_con.cursor()
cursor.execute("INSERT INTO temperatures('name', 'temperature', 'time') VALUES (?,?,?)", (name, temp, ctx.message.timestamp_iso))
await ctx.message.reply(body="🤒", reaction=True)
async def main():
"""Start the bot."""
# Connect the bot to number.
async with Bot("YOUR_HUMBER_HERE", socket_path="/signald/signald.sock") as bot:
# Track temps in DB
bot.register_handler(temp_regex, track_temp)
await ctx.message.typing_stopped()
view raw temp_handler.py hosted with ❤ by GitHub

Then for graphing I tried out something a bit different. I used a Juypiter notebook to author and play with the code then I used jupyter nbconvert graphs.ipynb --to python to output the notebooks code as a python file.

This was a nice mix for a side/hack project, I could iterate quickly in the notebook but still have that code callable from the bot easily.

The handler and graph rendering look like this, I was seriously impressed with pandas datafame, I’ve not used it much in the past and being able to easily read in from sqlite was a big win.

import pandas as pd
df_source = pd.read_sql_query("SELECT * FROM temperatures WHERE time > date('now', '-72 hours')",sql_con)
# convert time to datetime type
df_source['time'] = pd.to_datetime(df_source['time'])
df_1 = df_source.loc[df_source.name == '1']
df_2 = df_source.loc[df_source.name == '2']
import matplotlib.pyplot as plt
from matplotlib import dates
fig, ax = plt.subplots()
ax.xaxis.set_major_formatter(dates.DateFormatter("%dth %H:%M"))
plt.title('Temps last 3 days')
plt.ylabel('Temp c')
ax.plot(df_freya.time, df_freya.temperature, marker='o', label='1')
ax.plot(df_rory.time, df_rory.temperature, marker='o', label='2')
plt.axhline(38, color='red', ls='dotted')
plt.axhline(36.4, color='green', ls='dotted')
view raw draw.py hosted with ❤ by GitHub
async def graphs(ctx: ChatContext) -> None:
# str(Path(__file__).parent.absolute() / 'temps.jpg')
attachmentTemps = {"filename": '/signald/temps.jpg', # cos is't the path in the signald container that matters here
"width": "250",
"height": "250"}
attachmentTimeline1 = {"filename": '/signald/timeline1.png', # cos is't the path in the signald container that matters here
"width": "250",
"height": "250"}
attachmentTimeline2 = {"filename": '/signald/timeline2.png', # cos is't the path in the signald container that matters here
"width": "250",
"height": "250"}
await ctx.message.reply(body="Temp graphs for the last 3 days, last 12 hours timeline", attachments=[attachmentTemps, attachmentTimeline1, attachmentTimeline2])
view raw graphs.py hosted with ❤ by GitHub

Last was drawing the timelines, labella was awesome here, I had to hack a bit but it does awesome stuff like let you pick a colour for the item based on it’s content. With this I could label different types of medication with different colours on the timeline.

def timeline(name, data):
from labella.timeline import TimelineSVG, TimelineTex
from labella.utils import COLOR_10
from labella.scale import TimeScale, LinearScale
import pyvips
import os
def color_selector(data):
colors = {
"meds": "#FEA443",
"meds_amoxicillin": "#705E78",
"meds_paracetamol": "#A5AAA3",
"meds_ibrufen": "#812F33",
"sleep": "#EB722A",
"inhaler_blue": "#1E24E3",
"inhaler_brown": "#B61D28",
"note": "#BCBF50",
return colors[data['type']]
options = {
"scale": LinearScale(),
"initialWidth": 350,
"initialHeight": 580,
"direction": 'right',
"dotColor": color_selector,
"labelBgColor": color_selector,
"linkColor": color_selector,
"textFn": lambda x: f'{x["timeobj"].strftime("%H:%M")}{x["message"]}' ,
"labelPadding": {"left": 0, "right": 0, "top": 1, "bottom": 1},
"margin": {"left": 20, "right": 20, "top": 30, "bottom": 20},
"layerGap": 40,
"labella": {
"maxPos": 500,
"latex": {"reproducible": True},
items = data.to_dict('records')
tl = TimelineSVG(items, options=options)
svg_filename = f'timeline{name}.svg'
view raw timeline.py hosted with ❤ by GitHub

What does this look like when drawn? (Granted I’ve picked rubbish colors).

It gives a chronologically accurate timeline with each medicine or item type easily distinguishable. This is useful to take in how things are going over 24 hours and also spot issues with missed doses.

So that’s it really, I haven’t published the full set of code as it’s got more specific stuff to them in there, but hopefully this is a useful overview and drop comments if you’d find this interesting/useful. If there is enough interest I can clean stuff up to make this sharable.