Ryzen Ubuntu Server: Throttle CPU Frequency for power saving

This is a super quick one, I have a Ryzen server running a bunch of VMs. I noticed it’s running quite hot and pulling a fair bit of power.

As none of the VMs running are particularly performance sensitive I wanted to force the CPU to use a more conservative power setting.

First up, how do I see what frequencies the CPU is currently running at? For that we’ll use cpufreq-info, this will tell us what frequency each core is at:

cpufreq-info | grep current
  current policy: frequency should be within 2.20 GHz and 3.60 GHz.
  current CPU frequency is 4.23 GHz.
  current policy: frequency should be within 2.20 GHz and 3.60 GHz.
  current CPU frequency is 2.70 GHz.
.... more

Now how do I see what power draw this is causing, well that’s more complicated. In my case I have an APC UPS connected to the system to keep it up during power outages. This has a tool called apcaccess which gives me information about the UPS’s load. Knowing the size of the UPS you can back track from the load % to a rough watt’s usage.

In our case what I want to do though is use this to prove that changing the CPU has worked. Before making any changes it outputs 19% load.

apcaccess | grep LOADLOADPCT  : 19.0 Percent

To throttle the CPU we can use a govenor lets see what governors we have available

cpufreq-info | grep governors
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance, schedutil
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance, schedutil
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance, schedutil
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance, schedutil

Cool, well powersave looks like a good one to try out, lets give that a go by running sudo cpupower frequency-set --governor powersave and then looking at the load and frequencies again.

lawrencegripper@libvirt:~$ sudo cpupower frequency-set --governor powersave
Setting cpu: 0
Setting cpu: 1
Setting cpu: 2
Setting cpu: 15
lawrencegripper@libvirt:~$ cpufreq-info | grep current
  current policy: frequency should be within 2.20 GHz and 2.20 GHz.
  current CPU frequency is 2.20 GHz.
  current policy: frequency should be within 2.20 GHz and 3.60 GHz.
  current CPU frequency is 2.20 GHz.
  current policy: frequency should be within 2.20 GHz and 3.60 GHz.
  current CPU frequency is 2.20 GHz.
lawrencegripper@libvirt:~$ apcaccess | grep LOAD
LOADPCT  : 15.0 Percent

That did the trick, apcaccess is reporting a drop to 15% load and CPU frequency is down to 2.2GHz.

I’ve got a smart meter for my home and can also see the drop in usage roughly reflected there too.

Done. Server is now being more eco-friendly.


Ruby + Sorbet: Autogen sig method annotations

I’ve been adding Sorbet and type checking gradually to a legacy Ruby codebase dating back to the 2010‘s.

First up was getting all or most files as (a topic for another day):

# typed: true

Now I’m gradually adding method annotations, using sig annotations, to tell Sorbet what types a method accepts and returns, like so:

sig {params(x: SomeType, y: SomeOtherType).returns(MyReturnType)} def foo(x, y); ...; end


This is fairly time consuming so being lazy I wanted some help to get this done quicker.

I say “help” here as it’s never going to be perfect, with the meta programming and cruft of an old Ruby codebase it’s always going to need human validation and tweaking.

Luckily the codebase has a pretty extensive test suite so we can use that to validate the generated types match reality.

Let’s get Autogenerating

I’m lucky to work with great engineers, George ( is one of them. He showed me a useful approach to generate sigs which saves a bunch of time.

As sorbet “knows” some return types already it can infer some sigs on methods. The trick George showed me was to get it to add those inferred sigs for you automagically.

  • Set #typed: strict on the file, this means any methods without sigs are considered errors
  • Run srb tc --autocorrect --isolate-error-code=7017 this will tell sorbet to auto create any signatures it can work out
  • Reset #typed: true (unless you’ve solved all errors under strict – I’m aiming for that but gradually, currently want good sigs)
  • Review the auto generated sigs and make sure they’re sensible, fixing up ‘untyped’ and other issues

The cool thing here is the more sigs you have the better sorbet gets at generating the missing ones.

So what about those that can’t be created with this technique?

Well you have to write them but here there is help too. I’ve being using GitHub Copilot ( to infer or suggest sigs.

This is much more hit and miss than the first technique, it still (mostly) is a time saver but you do have to tweak the suggestions regularly.

Playing it safe with new sigs

Now I’ve got a set of new sigs you’d think next up would be to ship them but hold fire.

Sig annotations are statically checked at dev time but also, by default, enforced at runtime too.

The danger here is the new shiny sigs you added aren’t right and your production application will start failing when they are shipped.

To work around this we need to tweak some configuration in Sorbet. In this case I configured the app to raise runtime errors when a signature wasn’t correct but only in test or in staging environment – not in production. To do this you implement a:


This allows you to control how sorbet reacts to a method receiving or returning a type which doesn’t match the sig annotation.

Here is the initializer I ended up with:

# typed: strict

require "sorbet-runtime"

# Register call_validation_error_handler callback.
# This runs every time a method with a sig fails to type check at runtime.
# See:

# In all environments report a sig violation to Sentry.
# In any non-production environment raise an error if a sig is violated.
T::Configuration.call_validation_error_handler = lambda do |signature, opts|
failure_message = opts[:pretty_message]
Scrolls.log(at: :sorbet_runtime_sig_checking, msg: failure_message.squish)
error =
raise error unless Rails.env.production?


There you go, ship the new code with its sigs and keep an eye on Sentry to see if any need tweaking based on real production usage without those causing production errors.


What I learnt when a system no one knew how to maintain started failing, and I was on-call

A system is failing. People rely on it. You are on-call to fix it. You don’t know how it works, your team don’t know how it works and the last person to work on it has left the company. Fun times!

I’ll be upfront. This was an intense on-call shift. It wasn’t much fun but it did help me learn some new approaches for how to handle these situations. This blog is what I was doing by the end, having learnt from doing the wrong things in places.

Panic and try to find help

As the realisation dawns that you are meant to fix a system you know nothing about, it’s natural to feel some panic. I did.

Next up, try and get help. This is a time to be humble, explain that you’ll do your best but be clear you don’t know the tech/stack/system etc and reach out widely to see if others do.

You MUST do this. Think of it this way, If you struggled with the problem for ages (without asking for help) you are causing unnecessary downtime and pain for users. Imagine, hours into the outage, someone else in the organisation popped up and said “Oh this is an easy problem in stack xyz just do z – why didn’t you reach out?”. You’ve not done yourself or the organisation any favours at that point.

Being honest about what you know and don’t know. Don’t try and take it all on, reach out for help. Tell people how you are feeling. Get support.


This is useful because:

  1. In a situation you don’t understand, things you think are irrelevant may later become relevant.
  2. Anyone who comes to help, whether your awake, asleep or getting a coffee, can see what’s been done and what is on the list to try next.
  3. After you can look back, as part of a retrospective, to learn from the incident.

Buy time

Engineering is hard at the best of times. Doing it with the pressure of an ongoing outage makes it even harder.

Buy yourself some time with tactical hacks. These don’t have to be good, they’re not there to last forever, they’re to give you time. In this case I wrote a script to poke the system, attempt to detect the failure and restart it when it was failing. It worked, sort of, for a bit and gave some head space.

Use the time you gain to do the nitty gritty detailed work of engineering, which I find hard to do with the time pressure associated with an outage.

Understand the request flow – Read the code, re-read the code

It’s impossible, in my opinion, to fix a system you don’t understand. Unless you get incredibly lucky. Once you’ve brought yourself some time, spend that time wisely.

My personal preference is to look at a typical request flow through the system.

  • What does a normal request look like?
  • What dependencies are called during its processing?
  • Are caches involved? What state are they in?
  • Are some requests more substantial than others?
  • What code is on the hot path of processing?
  • Where is the hardest computational work done?
  • Where is the complexity? Note: Sometimes this might be in a library you use not your own code.

All of these help to give you insight into where to investigate.

Another key thing to understand is the history of the system. Systems can sometimes see issues reoccurring or new issues coming up which are new twists on previous issues. It’s well worth looking through the historic issues, comments and code to understand some history of the system. It’s not practical to know it all, be tactical and look at history for areas you are suspicious might be contributing to the failure.

For example, I ended up reading a commit message from 2018 and finding useful details after doing git blame on a file that looked of interest.

Look at what changed but don’t obsess about it

Thing A was working for X amount of time. Now, at Y time, it’s not working any more. What changed between X and Y? There must be one thing. It’s an obvious conclusion to make.

Sometimes this give you the perfect answer. “Oh we deployed this change just before it broke!” you roll it back and the world is happy again.

Other times your failure state is based on a complex interplay of issues that have built up over time and combined to cause you pain. My experience is that this is common. Usually it’s not just one thing which caused it, it’s an interplay of multiple things.

Even if the failure is related to a change, it might not be a change you can control. Say the users’ usage of the system has shifted. It’s usually not practical to email them all and say “Please stop doing X”.

Get data: Metrics and Logs

This is the first change I shipped. Not an attempt at fixing the problem but a way to get more data about the problem.

If I’d started shipping code changes without data I’d be guessing, sometimes you get lucky but most of the time you don’t. You need data. Logs, metrics and any other useful sources.

Get More Data: Debugger

If possible get an instance that has the issue and attach a debugger. This was crucial to building a picture of the issue in my case. Minimize the impact to users, maybe you can use a secondary and leave a primary serving traffic or attach to a canary instance or a lab.

However you do it, go in with a clear idea of where you want to break and what you want to know when you hit those break points.

Be Scientific: Build a theory, prove it’s right or wrong and repeat

At this point you’ve:

  1. Asked for help
  2. Brought yourself some time
  3. Understood the system
  4. Looked what might have changed
  5. Got data from metrics, logs and debugging

Maybe some more than others.

Start this loop:

  • Create a theory about the failure (early on these will be guesses).
  • Work out how you’d prove it right or wrong.
  • Test the theory.
  • If it didn’t fix it: Go dig more and build and new theory.
  • Repeat

Then the game is simple, repeat the loop above. Add more logs if you need them. Debug more. Write down the outcomes, capture data. Build on past theories.

Start with smaller, easier to verify theories. This lets you run through more of them, and often small things can cause big problems (great addition from @seveas).

Sometimes a theory will need more data that you don’t have, at this point write out a plan for what to do the next time the issue re-occurs to gather the data you need.

Don’t fall into the trap of changing too much at one time. There is a danger you introduce new issues with changes. Be measured. Is during an incident the right time to update all 38 dependencies the app has in one go (skipping QA because everything is on fire)? Probably not. Theories help with that, a change must have a theory to justify it. That theory should be provable. If a change doesn’t improve things don’t pile then next one on-top. Reset and go repeat the theory loop.

When you hit a theory that appears correct always validate the fix did what you expected. Add metrics, state how you expect them to change after the fix is shipped. Check the system did change the way you expected.


That’s my brain dump on the topic. There are doubtlessly other views on this topic from better authors with more experience in SRE than myself. I’d be really interested to hear thoughts on bits I’ve missed here or articles by others that related to this topic.