Keep it light weight

March 12, 2026 - 29 minutes read - 6083 words

A while back, as part of my ever so slightly over-engineered noise reduction effort on the basement cluster, a simple fan manager daemon was developed. Emphasis was on simplicity and small footprint.

Now some weeks down the road, looking at systemctl status fanman output, I cannot help but notice that, accumulated, the fan controller software does burn quite some cycles on the CPU. A full 38 minutes and change, from running just around two and a half weeks.

This is not terrible of course. It never shows up on top. It’s not like the cluster can’t do real work because it spends time managing its fans. In fact, this is not really an issue in practical terms. But it bugs me. 38 minutes. On the CPU. For just a few weeks of managing the fans. That is not good enough.

I want to take you through how I approach a performance problem like this: Going through the basic process of establishing the capability to measure, getting a baseline, looking with a critical eye at the work that is done, and iterating towards an acceptable solution.

Baseline

From looking at the code, the vast vast majority of work happens in the PID logic. Outside of the PID logic, we read an integer from one file and we sometimes write an integer to another file (and both files are kept open, so those operations are very low overhead).

Let’s write a simple benchmark application to exercise the PID logic. This will allow us to make incremental changes and document that they are actually effective. The following perf_pidloop logic is taken literally from the fanman software main loop to reflect the accurate use of the PID logic.

The set_dc function is an empty function implemented in another compilation unit to prevent the inlining of the function (and thereby the removal of the calculation of the new_dc value which would then obviously be unneeded). It’s basically a trick to prevent the compiler from realising that our code doesn’t actually generate results.

void perf_pidloop()
{
  // This is our most basic PID logic where we run our PID
  // calculations
  ProcessHistory<int32_t> errorhist(10, 10*15);
  int32_t ideal_temp = 48000;
  double old_dc = 0;
  double pid_kp = 0.1,
    pid_ki = 0.2,
    pid_kd = 0.05,
    fan_min_dc = 0.4,
    fan_start_dc = 0.8;

  // Run some iterations
  for (size_t s = 0; s != 10000000; ++s) {

    int32_t temp = 49950;
    errorhist.add(temp - ideal_temp);

    double new_dc = pid_kp * errorhist.get()
      + pid_ki * errorhist.getint()
      + pid_kd * errorhist.getdiffSG9();

    if (new_dc < 0) new_dc = 0;
    if (new_dc > 1) new_dc = 1;

    new_dc = .6 * new_dc + .4 * old_dc;

    if (new_dc < fan_min_dc) {
      new_dc = new_dc < fan_min_dc/2
        ? 0.
        : fan_min_dc;
    }

    if (old_dc == 0 && new_dc > 0)
      new_dc = fan_start_dc;

    if (old_dc != new_dc) {
      set_dc(new_dc);
      old_dc = new_dc;
    }
  }
}

Running the performance test software on our baseline PID code yields the following result:

joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"

real    0m5.059s
user    0m4.982s
sys     0m0.004s
joe@chip:~/fanman$ gprof targets/perf
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 29.98      1.22     1.22                             ProcessHistory<int>::getdiffSG9()
 24.57      2.22     1.00                             ProcessHistory<int>::get()
 22.11      3.12     0.90                             perf_pidloop()
  8.85      3.48     0.36                             void std::__final_insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter)
  4.91      3.68     0.20                             ProcessHistory<int>::add(int)
  3.69      3.83     0.15                             std::deque<int, std::allocator<int> >::push_front(int const&)
  3.44      3.97     0.14                             void std::__introsort_loop<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter)
joe@chip:~/fanman$

We have our baseline. Five seconds. Game on.

The profile reveals that, not shockingly, the Savitzky-Golay filter (getdiffSG9) and the half-second median (get) are two expensive routines.

Before going into the routines themselves, one thing that struck me, was that I have been using double precision floating point arithmetic wherever floating point arithmetic was needed. This is out of old habit - it used to be that on “big” machines, double precision and single precision arithmetic would cost basically the same, the only saving was in memory (an FP32 takes up 32 bits and an FP64 takes up 64, unsurprisingly). That changed a lot, however. On GPUs today they generally double the performance when you halve the operand size. On the JH7110 CPU that I am targeting here, it is not quite as dramatic, but we do see that (reading from The SiFive U74-MC Core Complex Manual - or this local copy) single precision operations are significantly faster than double precision:

Instruction	Cycles on FP32	Cycles on FP64
fmul	5	7
fmadd	5	7
fdiv	9-36	9-58

For what we are doing here, single precision is easily accurate enough. Just by switching from double to float we can shave 29% of the cycles off of the FP arithmetic (and 38% off of the worst case division performance).

Interestingly the JH7110 provides a fused multiply-add instruction that takes the same number of cycles as just the multiply - hugely useful when doing dot products or matrix multiplication (SAXPY - Single-precision A X Plus Y).

Another point on the processor that should be brought up: As I touched upon in an earlier post, the JH7110 is not a speed daemon (and it was not intended to be either). It does not have out-of-order execution, but it does have a dual-issue pipeline on each core. Meaning, each core can - ideally - retire two instructions per clock cycle. A number of practical limitations will make the number lower of course, but it is worth considering when looking at the code, what it costs to - for example - tie up a pipeline with a 58-cycle division operation versus maybe loading a constant from memory and doing a multiplication with the inverse instead. More on this later.

Free improvement: double to float

Extending our ProcessHistory class to be able to accept either double of float as its floating-point type of choice is easy:

//
// Store a history of process outputs in a signed integer format. All
// getters return double precision floating point
//
template <typename B = int32_t, typename F = double>
class ProcessHistory {
... 
  // Get most recent half second median, in an effort to filter out
  // the worst of the noise
  F get() {
    m_window.clear();
...
  // Get filtered differential; we use a Savitzky-Golay filter to get
  // the first derivative
  F getdiffSG5() {
    return 1. / (12. / m_sps)
...

In the performance test code from before, we instantiate the ProcessHistory as:

  ProcessHistory<int32_t,float> errorhist(10, 10*15);

This completely free change gives us:

joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"

real    0m4.899s
user    0m4.828s
sys     0m0.005s
joe@chip:~/fanman$

Ok that is not impressive. Where is the time spent?

joe@chip:~/fanman$ gprof targets/perf --flat-profile
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 29.71      1.12     1.12                             ProcessHistory<int, float>::getdiffSG9()
 25.73      2.09     0.97                             ProcessHistory<int, float>::get()
 18.83      2.80     0.71                             perf_pidloop()
 10.34      3.19     0.39                             void std::__final_insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter)
  3.98      3.34     0.15                             ProcessHistory<int, float>::add(int)
  3.45      3.47     0.13                             void std::__introsort_loop<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter)
  3.18      3.59     0.12                             std::deque<int, std::allocator<int> >::push_front(int const&)

Let’s take a look inside the getdiffSG9 routine to get a feel for what’s going on. First of all, the implementation looks like:

  F getdiffSG9() {
    return 1. / (1188. / m_sps)
      * F(86 * m_history[8]
               - 142 * m_history[7]
               - 193 * m_history[6]
               - 126 * m_history[5]
               + 126 * m_history[3]
               + 193 * m_history[2]
               + 142 * m_history[1]
               - 86 * m_history[0]);
  }

So a straight forward function, very simple, some memory loads, a bit of integer arithmetic, finally a float conversion and a bit of floating point arithmetic. The disassembly is rather large (which we shall return to), but I noticed at the very end of the function that we had:

perf[0x1726] <+462>: fld    fa5, -0x652(a0)
perf[0x172a] <+466>: auipc  a0, 0x4
perf[0x172e] <+470>: fld    fa4, -0x652(a0)

** 93  	    return 1. / (1188. / m_sps)

perf[0x1732] <+474>: fcvt.d.lu fa3, a2
perf[0x1736] <+478>: fdiv.d fa5, fa5, fa3
perf[0x173a] <+482>: fdiv.d fa5, fa4, fa5

** 101 	               - 86 * m_history[0]);
   102 	  }
   103

What are fdiv.d double precision divisions doing here? F is float… The answer of course is staring right in my face - it is the literals. 1. is a double precision literal in C++. If I use F(1) instead then it will be a 1 converted to whatever the F type is set to, something the compiler should be able to do at compile time.

Another point; since floating point divisions are incredibly expensive, we really should rewrite the function to read:

  F getdiffSG9() {
    return m_sps / F(1188)
      * F(86 * m_history[8]
               - 142 * m_history[7]
               - 193 * m_history[6]
               - 126 * m_history[5]
               + 126 * m_history[3]
               + 193 * m_history[2]
               + 142 * m_history[1]
               - 86 * m_history[0]);
  }

This gets rid of the double precision literal and it cuts away a division. (I also updated three literals in the perf.cc program).

joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"

real    0m4.186s
user    0m4.103s
sys     0m0.013s

Now we’re cooking! That’s a 17% reduction just from not using excessive precision and not dividing unnecessarily.

Stop branching for simple things

Looking at the disassembly for the getdiffSG9 routine, we still see that it is hugely complex. What looks, in C++, as a relatively simple function, ends up being absolutely humongous

lldb) disas -n 'ProcessHistory<int, float>::getdiffSG9()' 
perf`ProcessHistory<int, float>::getdiffSG9:
perf[0x1540] <+0>:   addi   sp, sp, -0x20
perf[0x1542] <+2>:   sd     ra, 0x18(sp)
perf[0x1544] <+4>:   sd     s0, 0x10(sp)
perf[0x1546] <+6>:   sd     s1, 0x8(sp)
perf[0x1548] <+8>:   sd     s2, 0x0(sp)
perf[0x154a] <+10>:  addi   s0, sp, 0x20
perf[0x154c] <+12>:  mv     s2, a0
perf[0x154e] <+14>:  jal    0xeb0
perf[0x1552] <+18>:  ld     t3, 0x20(s2)
perf[0x1556] <+22>:  ld     a0, 0x28(s2)
perf[0x155a] <+26>:  ld     t4, 0x38(s2)
perf[0x155e] <+30>:  sub    a7, t3, a0
perf[0x1562] <+34>:  srai   s1, a7, 0x2
perf[0x1566] <+38>:  li     a2, -0x8
perf[0x1568] <+40>:  addi   a1, s1, 0x8
perf[0x156c] <+44>:  blt    s1, a2, 0x158a
perf[0x1570] <+48>:  srli   a2, a1, 0x7
perf[0x1574] <+52>:  bnez   a2, 0x158e
perf[0x1576] <+54>:  addi   a6, t3, 0x20
perf[0x157a] <+58>:  li     a2, -0x7
perf[0x157c] <+60>:  addi   a1, s1, 0x7
perf[0x1580] <+64>:  bge    s1, a2, 0x15aa
perf[0x1584] <+68>:  srai   a2, a1, 0x7
perf[0x1588] <+72>:  j      0x15b0
perf[0x158a] <+74>:  srai   a2, a1, 0x7
perf[0x158e] <+78>:  slli   a3, a2, 0x3
perf[0x1592] <+82>:  add    a3, a3, t4
perf[0x1594] <+84>:  ld     a6, 0x0(a3)
perf[0x1598] <+88>:  slli   a2, a2, 0x7
perf[0x159a] <+90>:  sub    a1, a1, a2
perf[0x159c] <+92>:  slli   a1, a1, 0x2
perf[0x159e] <+94>:  add    a6, a6, a1
perf[0x15a0] <+96>:  li     a2, -0x7
perf[0x15a2] <+98>:  addi   a1, s1, 0x7
perf[0x15a6] <+102>: blt    s1, a2, 0x1584
perf[0x15aa] <+106>: srli   a2, a1, 0x7
perf[0x15ae] <+110>: beqz   a2, 0x15e6
perf[0x15b0] <+112>: slli   a3, a2, 0x3
perf[0x15b4] <+116>: add    a3, a3, t4
perf[0x15b6] <+118>: ld     t0, 0x0(a3)

perf`ProcessHistory<int, float>::getdiffSG9 [inlined] std::_Deque_iterator<int, int&, int*>::operator+=(long):
perf[0x15ba] <+122>: slli   a2, a2, 0x7
perf[0x15bc] <+124>: sub    a1, a1, a2
perf[0x15be] <+126>: slli   a1, a1, 0x2
perf[0x15c0] <+128>: add    t0, t0, a1
perf[0x15c2] <+130>: li     a2, -0x6
perf[0x15c4] <+132>: addi   a1, s1, 0x6
perf[0x15c8] <+136>: blt    s1, a2, 0x15f4
perf[0x15cc] <+140>: srli   a2, a1, 0x7
perf[0x15d0] <+144>: bnez   a2, 0x15f8
perf[0x15d2] <+146>: addi   t1, t3, 0x18
perf[0x15d6] <+150>: li     a2, -0x5
perf[0x15d8] <+152>: addi   a1, s1, 0x5
perf[0x15dc] <+156>: bge    s1, a2, 0x1614
perf[0x15e0] <+160>: srai   a2, a1, 0x7
perf[0x15e4] <+164>: j      0x161a
perf[0x15e6] <+166>: addi   t0, t3, 0x1c
perf[0x15ea] <+170>: li     a2, -0x6
perf[0x15ec] <+172>: addi   a1, s1, 0x6
perf[0x15f0] <+176>: bge    s1, a2, 0x15cc
perf[0x15f4] <+180>: srai   a2, a1, 0x7
perf[0x15f8] <+184>: slli   a3, a2, 0x3
perf[0x15fc] <+188>: add    a3, a3, t4
perf[0x15fe] <+190>: ld     t1, 0x0(a3)
perf[0x1602] <+194>: slli   a2, a2, 0x7
perf[0x1604] <+196>: sub    a1, a1, a2
perf[0x1606] <+198>: slli   a1, a1, 0x2
perf[0x1608] <+200>: add    t1, t1, a1
perf[0x160a] <+202>: li     a2, -0x5
perf[0x160c] <+204>: addi   a1, s1, 0x5
perf[0x1610] <+208>: blt    s1, a2, 0x15e0
perf[0x1614] <+212>: srli   a2, a1, 0x7
perf[0x1618] <+216>: beqz   a2, 0x1650
perf[0x161a] <+218>: slli   a3, a2, 0x3
perf[0x161e] <+222>: add    a3, a3, t4
perf[0x1620] <+224>: ld     t2, 0x0(a3)
perf[0x1624] <+228>: slli   a2, a2, 0x7
perf[0x1626] <+230>: sub    a1, a1, a2
perf[0x1628] <+232>: slli   a1, a1, 0x2
perf[0x162a] <+234>: add    t2, t2, a1
perf[0x162c] <+236>: li     a2, -0x3
perf[0x162e] <+238>: addi   a1, s1, 0x3
perf[0x1632] <+242>: blt    s1, a2, 0x165e
perf[0x1636] <+246>: srli   a2, a1, 0x7
perf[0x163a] <+250>: bnez   a2, 0x1662
perf[0x163c] <+252>: addi   a4, t3, 0xc
perf[0x1640] <+256>: li     a1, -0x2
perf[0x1642] <+258>: addi   a2, s1, 0x2
perf[0x1646] <+262>: bge    s1, a1, 0x167c
perf[0x164a] <+266>: srai   a1, a2, 0x7
perf[0x164e] <+270>: j      0x1682
perf[0x1650] <+272>: addi   t2, t3, 0x14
perf[0x1654] <+276>: li     a2, -0x3
perf[0x1656] <+278>: addi   a1, s1, 0x3
perf[0x165a] <+282>: bge    s1, a2, 0x1636
perf[0x165e] <+286>: srai   a2, a1, 0x7
perf[0x1662] <+290>: slli   a3, a2, 0x3
perf[0x1666] <+294>: add    a3, a3, t4
perf[0x1668] <+296>: ld     a4, 0x0(a3)
perf[0x166a] <+298>: slli   a2, a2, 0x7
perf[0x166c] <+300>: sub    a1, a1, a2
perf[0x166e] <+302>: slli   a1, a1, 0x2
perf[0x1670] <+304>: add    a4, a4, a1
perf[0x1672] <+306>: li     a1, -0x2
perf[0x1674] <+308>: addi   a2, s1, 0x2
perf[0x1678] <+312>: blt    s1, a1, 0x164a
perf[0x167c] <+316>: srli   a1, a2, 0x7
perf[0x1680] <+320>: beqz   a1, 0x16a8
perf[0x1682] <+322>: slli   a3, a1, 0x3
perf[0x1686] <+326>: add    a3, a3, t4
perf[0x1688] <+328>: ld     a3, 0x0(a3)
perf[0x168a] <+330>: slli   a1, a1, 0x7
perf[0x168c] <+332>: sub    a2, a2, a1
perf[0x168e] <+334>: slli   a2, a2, 0x2
perf[0x1690] <+336>: add    a3, a3, a2
perf[0x1692] <+338>: li     a1, -0x1
perf[0x1694] <+340>: addi   a2, s1, 0x1
perf[0x1698] <+344>: blt    s1, a1, 0x16b6
perf[0x169c] <+348>: srli   a1, a2, 0x7
perf[0x16a0] <+352>: bnez   a1, 0x16ba
perf[0x16a2] <+354>: addi   a1, t3, 0x4
perf[0x16a6] <+358>: j      0x16cc
perf[0x16a8] <+360>: addi   a3, t3, 0x8
perf[0x16ac] <+364>: li     a1, -0x1
perf[0x16ae] <+366>: addi   a2, s1, 0x1
perf[0x16b2] <+370>: bge    s1, a1, 0x169c
perf[0x16b6] <+374>: srai   a1, a2, 0x7
perf[0x16ba] <+378>: slli   a5, a1, 0x3
perf[0x16be] <+382>: add    a5, a5, t4
perf[0x16c0] <+384>: ld     a5, 0x0(a5)
perf[0x16c2] <+386>: slli   a1, a1, 0x7
perf[0x16c4] <+388>: sub    a2, a2, a1
perf[0x16c6] <+390>: slli   a1, a2, 0x2
perf[0x16ca] <+394>: add    a1, a1, a5
perf[0x16cc] <+396>: ld     a2, 0x0(s2)
perf[0x16d0] <+400>: lw     a6, 0x0(a6)
perf[0x16d4] <+404>: lw     t0, 0x0(t0)
perf[0x16d8] <+408>: lw     t1, 0x0(t1)
perf[0x16dc] <+412>: lw     a5, 0x0(t2)
perf[0x16e0] <+416>: lw     a4, 0x0(a4)
perf[0x16e2] <+418>: lw     a3, 0x0(a3)
perf[0x16e4] <+420>: lw     t2, 0x0(a1)
perf[0x16e8] <+424>: bltz   s1, 0x16f4
perf[0x16ec] <+428>: srli   a0, s1, 0x7
perf[0x16f0] <+432>: bnez   a0, 0x16f8
perf[0x16f2] <+434>: j      0x170a
perf[0x16f4] <+436>: srai   a0, a7, 0x9
perf[0x16f8] <+440>: slli   a1, a0, 0x3
perf[0x16fc] <+444>: add    a1, a1, t4
perf[0x16fe] <+446>: ld     t3, 0x0(a1)
perf[0x1702] <+450>: slli   a0, a0, 0x7
perf[0x1704] <+452>: sub    s1, s1, a0
perf[0x1706] <+454>: slli   s1, s1, 0x2
perf[0x1708] <+456>: add    t3, t3, s1
perf[0x170a] <+458>: fcvt.s.lu fa5, a2
perf[0x170e] <+462>: lui    a0, 0x44948
perf[0x1712] <+466>: fmv.w.x fa4, a0
perf[0x1716] <+470>: fdiv.s fa5, fa5, fa4
perf[0x171a] <+474>: lw     a0, 0x0(t3)
perf[0x171e] <+478>: subw   a4, a4, a5
perf[0x1720] <+480>: slli   a1, a4, 0x1
perf[0x1724] <+484>: slli   a4, a4, 0x7
perf[0x1726] <+486>: subw   a4, a4, a1
perf[0x1728] <+488>: subw   a1, a3, t1
perf[0x172c] <+492>: li     a2, 0xc1
perf[0x1730] <+496>: mul    a1, a1, a2
perf[0x1734] <+500>: subw   a2, t2, t0
perf[0x1738] <+504>: li     a3, 0x8e
perf[0x173c] <+508>: mul    a2, a2, a3
perf[0x1740] <+512>: subw   a0, a6, a0
perf[0x1744] <+516>: li     a3, 0x56
perf[0x1748] <+520>: mul    a0, a0, a3
perf[0x174c] <+524>: add    a1, a1, a2
perf[0x174e] <+526>: add    a1, a1, a4
perf[0x1750] <+528>: add    a0, a0, a1
perf[0x1752] <+530>: fcvt.s.w fa4, a0
perf[0x1756] <+534>: fmul.s fa0, fa5, fa4
perf[0x175a] <+538>: ld     ra, 0x18(sp)
perf[0x175c] <+540>: ld     s0, 0x10(sp)
perf[0x175e] <+542>: ld     s1, 0x8(sp)
perf[0x1760] <+544>: ld     s2, 0x0(sp)
perf[0x1762] <+546>: addi   sp, sp, 0x20
perf[0x1764] <+548>: ret

The explanation, of course, is the use of std::deque as the cyclic buffer for our process variable history. The great thing about std::deque, the double-ended-queue, is, that both insertion (pushing on the front) and removal (popping from the back) have complexity O(1). The not-so-great thing is, that whenever we want to access an element by index (as the getdiffSG9 function does 8 times in a row), the deque::operator[] has to see if the index given causes a wrap-around on the underlying buffer.

Conceptually simple, but it means that what looks like a simple access to a relatively addressed element, actually becomes some offset calculations, then a comparison and a branch, and then an actual loading of the data. And this needs to happen 8 times. That’s a lot of branching for something that looked simple in C++.

Now, is that worth changing? Is the std::deque insufficient? Do I think that I can do better?

The answer, of course, is a resounding YES to all of the above.

In all seriousness though… This type of decision has to weigh in the use case. There are so many benefits to using the standard containers, and they are extremely efficient at what they do. There is no way that I would want to try to implement a better std::deque - I would likely fail if I tried. But in my use case I don’t need std::deque. I need a cyclic buffer, and just that. Something much less than std::deque. But something more than std::vector. Let’s try to consider what we really want:

Fixed size cyclic buffer; number of elements can be constant
Random access addressing should be relative to the most recently inserted element
- element 0 is the most recently inserted element
- element 1 is the previously inserted element
- etc.
Branch-less (O(1) and super efficient) random access to a limited number of past elements
Efficient insertion (O(1) and pretty fast)

In general, when building cyclic buffers, one tends to end up with a trade-off where:

Efficient insert means we need to be careful on retrieval
Efficient retrieval means we need to move data on insert

A simple replacement of std::deque with std::vector where I then simply displaced the contents by one on insert yielded a net-negative performance impact. Something more fancy is called for. The vector experiment did show that the getdiffSG9 routine suddenly became inlined in the main loop rather than being called as a function - so the experiment did confirm that, given efficient random access retrieval, great savings are enabled elsewhere.

The circular buffer

If we do not want to displace the element on underlying storage when we insert into our cyclic buffer, we need a head pointer which we move on every insert.

What follows, is, that on read, we would implement access something like:

 const E &operator[](size_t i) const {
  return m_buf[(m_head + i) % m_buf.size()];
 }

This would either use a costly division, or one could implement it with a branch instead. Either way, single element access becomes expensive. This is what we see in the disassembly too.

Now, if the number of elements we want to be able to access by index is limited - which it is in our use case (at most up to index 8) - one could imagine an alternative data structure:

What if, we maintain a front on an N-element cyclic buffer which we decrement whenever we insert an element. As the front pointer reaches 0, next insert it goes to N-1.

Then, at the end (contiguously) on our buffer, we add a P-element space, where P is the number of elements we wish to be able to access by index (relative to front).

What this means, is, that no matter the value of front, it is always safe to directly access up P elements relatively from front (so up to absolute element front+P). No branching. No modulus.

A sequence of data insertions will update the buffer (with N=30 and P=3) in the following pattern:

[d0 d0 d0 ... d0 d0 d0] [d0 d0 d0]
                     ^front
[d0 d0 d0 ... d0 d1 d0] [d0 d0 d0]
                  ^front
[d0 d0 d0 ... d2 d1 d0] [d0 d0 d0]
               ^front
...
[d0 d0 d27 ... d2 d1 d0] [d0 d0 d27]
         ^front
[d0 d28 d27 ... d2 d1 d0] [d0 d28 d27]
      ^front
[d29 d28 d27 ... d2 d1 d0] [d29 d28 d27]
   ^front
[d29 d28 d27 ... d2 d1 d30] [d29 d28 d27]
                         ^front
[d29 d28 d27 ... d2 d31 d30] [d29 d28 d27]
                      ^front

The burden is on insertion, of course, where insertion of the lowest P entries in the main buffer need to be mirrored into the “overflow” buffer as well.

So at the cost of one branch in the insertion, we save a branch in every random access. This is a very good trade-off.

The core parts of the implementation follows here:

template <typename E>
class Cyclic {
public:
  Cyclic(const size_t n_elements,
         const size_t n_overflow)
    : m_n_elements(n_elements)
    , m_n_overflow(n_overflow)
  {
  }
  // Insert new value in cyclic buffer
  void push(const E &v) {
    if (m_buf.empty()) {
      m_buf.resize(m_n_elements + m_n_overflow, v);
    } else {
      // Reduce head position
      if (m_head)
        --m_head;
      else
        m_head = m_n_elements-1;
      // Set new head element
      m_buf[m_head] = v;
      // Maintain overflow buffer
      if (m_head < m_n_overflow) {
        m_buf[m_head + m_n_elements] = v;
      }
    }
  }
  // Simple buffer access
  // i must be smaller than or equal to n_overflow
  const E &operator[](size_t i) const {
    return m_buf[m_head + i];
  }
  E &operator[](size_t i) {
    return m_buf[m_head + i];
  }
private:
  // Our setup
  const size_t m_n_elements;
  const size_t m_n_overflow;
  // Underlying buffer
  std::vector<E> m_buf;
  // Head position
  size_t m_head = 0;
};

So was it worth getting rid of std::deque? Let’s find out:

joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"

real    0m3.179s
user    0m3.106s
sys     0m0.005s

I believe that’s a yes, especially considering we are not done taking out expensive pieces of logic. Let’s look at the getdiffSG9 disassembly. Actually, what we find is that this function no longer exists on its own - it is now inlined in the perf_pidloop routine and therefore the best we can get is an excerpt from the pidloop disassembly (meaning, a few bits may be missing in this excerpt and a few instructions from surrounding code may be weaved in):

   99  	  F getdiffSG9() {
** 100 	    return F(m_sps) / F(1188)

perf[0xfea] <+252>:  fdiv.s fa4, fa3, fs4
perf[0xfee] <+256>:  slli   a0, a0, 0x2
perf[0xff0] <+258>:  add    a0, a0, a1

** 101 	      * F(86 * m_history[8]

perf[0xff2] <+260>:  lw     a6, 0x20(a0)

** 102 	          - 142 * m_history[7]

perf[0xff6] <+264>:  lw     a2, 0x1c(a0)

** 103 	          - 193 * m_history[6]

perf[0xff8] <+266>:  lw     a3, 0x18(a0)

** 104 	          - 126 * m_history[5]

perf[0xffa] <+268>:  lw     a4, 0x14(a0)

** 105 	          + 126 * m_history[3]

perf[0xffc] <+270>:  lw     a5, 0xc(a0)

** 106 	          + 193 * m_history[2]

perf[0xffe] <+272>:  lw     s1, 0x8(a0)

** 107 	          + 142 * m_history[1]

perf[0x1000] <+274>: lw     a1, 0x4(a0)

** 108 	          - 86 * m_history[0]);
   109 	  }
   110 	

perf[0x1002] <+276>: lw     a0, 0x0(a0)
perf[0x1004] <+278>: subw   a5, a5, a4
perf[0x1006] <+280>: slli   a4, a5, 0x1
perf[0x100a] <+284>: slli   a5, a5, 0x7
perf[0x100c] <+286>: subw   a5, a5, a4
perf[0x100e] <+288>: subw   s1, s1, a3
perf[0x1010] <+290>: mul    a3, s1, s2
perf[0x1014] <+294>: subw   a1, a1, a2
perf[0x1016] <+296>: mul    a1, a1, s3
perf[0x101a] <+300>: subw   a0, a6, a0
perf[0x101e] <+304>: mul    a0, a0, s4

** 106 	          + 193 * m_history[2]

perf[0x1022] <+308>: add    a1, a1, a3

** 107 	          + 142 * m_history[1]

perf[0x1024] <+310>: add    a1, a1, a5

Notice the difference in the instructions for the calculations? Instead of endless calculations for where to load the data elements from, all that is gone, replaced with simple lw dst, ofs(a0). One instruction for one data load. That’s it.

From the processor manual we know that for a core to retire two instructions per cycle, only one of the two instructions can be a load, so clearly the sequence of 8 loads are not going to interleave with one another. Looking back further though, it is clear that an fdiv.s is issued before this - that can take from 9 to 36 cycles to complete, meaning, all the loads probably interleave with that one.

Looking at where time is spent now, a flat profile shows us:

joe@chip:~/fanman$ gprof targets/perf --flat-profile
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 37.57      0.95     0.95                             ProcessHistory<int, float>::get()
 19.77      1.45     0.50                             perf_pidloop()
 16.61      1.87     0.42                             void std::__final_insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter)
 13.84      2.22     0.35                             ProcessHistory<int, float>::add(int)
  7.91      2.42     0.20                             void std::__introsort_loop<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter)

It is time to give get a little tender loving care.

Faster get

The get calculates the median of the last half second of samples, in an order to make it basically return “the most recent value, kind of, but without the noise”.

And medians are a great way to go for getting rid of noise. But realistically, given that we are interested in performance and medians require sorting things (and sorting requires branching and branching wastes processor cycles) - maybe a mean would be a serviceable substitute for the median? It certainly is faster to calculate.

...
  // Add new element
  void add(B e) {
...
    } else {
      // Remove the 'leaving' elements from the sums
      m_sum -= m_history.back();
      m_getsum -= m_history[m_sps/2-1];
      // Insert new
      m_history.push(e);
      // Add the new element to the sums
      m_sum += e;
      m_getsum += e;
    }
  }
...
  // Get most recent half second average, in an effort to filter out
  // the worst of the noise
  F get() {
    // The half second sum divided by samples per half second
    return m_getsum / F(m_sps/2);
  }
...

So by modifying the add routine so that we maintain not just the total sum of all entries in the history (for easy buffer integral calculation), we also main a sum of all the entries in the past half second of the history. This allows us to trivially calculate the mean of the last half second of the entries (simply divide the sum of the entries by the number of entries).

The F(m_sps/2) looks funny perhaps, but it is very deliberately written this way. We want - by integer division to calculate the number of samples per half second, because the number of samples is going to be an integer number. Then that result gets converted to whatever floating point type we chose to work with (the F template type).

How does getting rid of the single top time consumer help us?

joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"

real    0m1.522s
user    0m1.484s
sys     0m0.001s
joe@chip:~/fanman$

Wham! That worked quite well I would say!

Our profile now clearly shows that pretty much everything is so simple it is inlined - with the exception of add:

joe@chip:~/fanman$ gprof targets/perf --flat-profile
BFD: DWARF error: mangled line number section (bad file number)
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 65.05      0.92     0.92                             perf_pidloop()
 32.35      1.37     0.46                             ProcessHistory<int, float>::add(int)
  0.36      1.38     0.01                             __clang_call_terminate

Now, add could be simplified. Personally I don’t like that a routine like add which is called often, checks whether it is the first call to it, every time it is called. This seems wasteful and silly.

However… If add does not do this, I would have to have an init routine instead. That init routine would have to be called by the outer logic, before entering the main loop of the application that uses the ProcessHistory class. This could be done. But… I don’t know. It just doesn’t seem very clean to have a container that has all sorts of requirements for special members one must call before it can really be used. That’s what the constructor is for. Which is called on initialisation. After construction, containers should be good to go. Generally…

Maybe it’s worth it. I’ll leave this one for now.

Getting rid of costly floating divisions

As already stated, a floating point division takes between 9 and 36 cycles, whereas a load can complete in a couple of cycles (on cache hit) and a floating point multiplication in another 5 cycles. With this logic, rewriting x / F(m_sps) into x * m_inv_sps and having pre-calculated and stored m_inv_sps = 1/F(m_sps) would make a lot of sense.

We do this in a couple of places and get:

template <typename B = int32_t, typename F = double>
class ProcessHistory {
public:
  ProcessHistory(size_t samples_per_second,
                 size_t samples_in_history)
    : m_sps(samples_per_second)
    , m_inv_sps(1 / F(m_sps))
    , m_inv_sphs(1 / F(m_sps/2))
    , m_sih(samples_in_history)
    , m_SG5_factor(F(m_sps) / F(12))
    , m_SG7_factor(F(m_sps) / F(252))
    , m_SG9_factor(F(m_sps) / F(1188))
...
  // Get most recent half second average, in an effort to filter out
  // the worst of the noise
  F get() {
    // The half second sum divided by samples per half second
    return m_getsum * m_inv_sphs;
  }
  // Get integral (sum divided by samples per second)
  F getint() {
    return m_sum * m_inv_sps;
  }
...
  F getdiffSG9() {
    return m_SG9_factor
      * F(86 * m_history[8]
          - 142 * m_history[7]
          - 193 * m_history[6]
          - 126 * m_history[5]
          + 126 * m_history[3]
          + 193 * m_history[2]
          + 142 * m_history[1]
          - 86 * m_history[0]);
  }
...
private:
  // Our constants from construction
  const size_t m_sps;
  // Two floating point constants allow us to avoid costly floating
  // point division during sliding window mean and integral
  // calculation
  const F m_inv_sps;  // samples per second
  const F m_inv_sphs; // samples per half second (integer division)
  // Samples in history
  const size_t m_sih;
  // Another three floating point constants allow us to avoid costly
  // floating point division during differential calculation
  const F m_SG5_factor;
  const F m_SG7_factor;
  const F m_SG9_factor;
...
}

How do these last optimisations stack up?

joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"

real    0m1.138s
user    0m1.084s
sys     0m0.013s
joe@chip:~/fanman$

We started at around five seconds for the test, we landed at around 1 second. That’s an 80% reduction, or a five X speedup. Not too bad if I do say so myself.

Now… how about that add? What does the profile say?

joe@chip:~/fanman$ gprof targets/perf --flat-profile
BFD: DWARF error: mangled line number section (bad file number)
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 51.15      0.50     0.50                             perf_pidloop()
 43.40      0.92     0.42                             ProcessHistory<int, float>::add(int)
  1.03      0.93     0.01                             __clang_call_terminate

43%??? Unacceptable!

Adding requirement for special initialisation

I don’t like it, but let’s see what happens if we have an init routine that is called like add, but has to be called once before anyone calls add on the container.

template <typename B = int32_t, typename F = double>
class ProcessHistory {
public:
...
// First addition of an element to the container - initialises
  // everything
  void init(B e) {
    // Adding the first element will fill the cyclic buffer with
    // this element value - initialise the sums accordingly
    m_sum = e * m_sih;
    m_getsum = e * (m_sps/2);
    m_history.init(e);
  }

  // Add new element
  void add(B e) {
    // Remove the 'leaving' elements from the sums
    m_sum -= m_history.back();
    m_getsum -= m_history[m_sps/2-1];
    // Insert new
    m_history.push(e);
    // Add the new element to the sums
    m_sum += e;
    m_getsum += e;
  }
...

template <typename E>
class Cyclic {
public:
...
  // Initialise cyclic buffer
  void init(const E &v) {
    m_buf.resize(m_n_elements + m_n_overflow, v);
  }

  // Insert new value in cyclic buffer
  void push(const E &v) {
    // Reduce head position
    if (m_head)
      --m_head;
    else
      m_head = m_n_elements-1;
    // Set new head element
    m_buf[m_head] = v;
    // Maintain overflow buffer
    if (m_head < m_n_overflow) {
      m_buf[m_head + m_n_elements] = v;
    }
  }
...

But is it worth it? Let’s see:

joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"

real    0m0.920s
user    0m0.870s
sys     0m0.013s
joe@chip:~/fanman$

That’s another 13% off or so. Not too bad. More so, none of the members of ProcessHistory now exist as functions in the resulting binary - everything is inlined. This shows on the profile too:

joe@chip:~/fanman$ gprof targets/perf --flat-profile
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
100.23      0.87     0.87                             perf_pidloop()

An 82.5% reduction or almost a 6X speedup on the core PID logic. So rather than spending 38 minutes on the CPU every two and a half weeks, it should spend less than 10 (leaving a little room for the file I/O which I have not touched).

A second take on the `fanman` software

While integrating these optimisations, I am also making history size and sample rate configurable. Because maybe, just maybe, we don’t need 10 samples per second to run a fan (I know, shocking). On the other hand, a longer history will allow the integral component of the PID control logic to better correct for steady-state error.

Running the fan manager software with 4 samples per second versus running it with 10 samples per second is obviously going to result in direct savings on CPU cycles. But how does the resulting temperature control look? The following graph shows the actual temperature when running at these two sample rates - the node had a bit of network/filesystem load at the time but nothing heavy:

Temperature, 4 versus 10 samples per
second

Note that the graphs are from two separate runs - the first axis is “seconds since start”. For this reason the graphs have maximums and minimums at different times - this is perfectly fine. Also, the two graphs are not completely comparable since, again, they ran at different absolute times so system load could have slightly changed. But anyway, the graph at least indicates that the 10sps temperature (the green curve) is slightly more tightly regulated than the 4sps temperature (the purple curve). This is to be expected; sampling less often leads to slower reaction time.

PWM, 4 versus 10 samples per second

Looking at the duty cycle, we see that the 10sps (green curve) has many really tiny spikes whereas the 4sps (purple curve) compensates less often and appears more smooth. The tiny spikes on the 10sps curve are in part caused by the minimum duty cycle constraint as well as the minimum duty cycle for spinning up the fan from stand-still. These constraints will cause regulation to be unable to find an equilibrium if, say, the “perfect” duty cycle would be 10% but the minimum allowed duty cycle is 40%.

Anyway, from the looks of it, it doesn’t seem like going to 4 samples per second is going to do anything terrible for the fan regulation.

Getting the software

As before, I provide you with the bits that constitute a Debian source package for you to build on your own:

joe@pluto:~/build$ dpkg-source -x fanman_-1.dsc 
dpkg-source: warning: extracting unsigned source package (fanman_2-1.dsc)
dpkg-source: info: extracting fanman in fanman-2
dpkg-source: info: unpacking fanman_2.orig.tar.xz
dpkg-source: info: unpacking fanman_2-1.debian.tar.xz
joe@pluto:~/build$ cd fanman-2/
joe@pluto:~/build/fanman-2$ dpkg-buildpackage -uc -us -b
dpkg-buildpackage: info: source package fanman
dpkg-buildpackage: info: source version 2-1
dpkg-buildpackage: info: source distribution unstable
dpkg-buildpackage: info: source changed by Jakob Oestergaard <joe@unthought.net>
dpkg-buildpackage: info: host architecture riscv64
 dpkg-source --before-build .
 debian/rules clean
...
dpkg-deb: building package 'fanman' in '../fanman_2-1_riscv64.deb'.
dpkg-deb: building package 'fanman-dbgsym' in '../fanman-dbgsym_2-1_riscv64.deb'.
 dpkg-genbuildinfo --build=binary -O../fanman_2-1_riscv64.buildinfo
 dpkg-genchanges --build=binary -O../fanman_2-1_riscv64.changes
dpkg-genchanges: info: binary-only upload (no source code included)
 dpkg-source --after-build .
dpkg-buildpackage: info: binary-only upload (no source included)
joe@pluto:~/build/fanman-2$ cd ..
joe@pluto:~/build$ ls fanman_2*deb
fanman_2-1_riscv64.deb
joe@pluto:~/build$ sudo dpkg -i fanman_2-1_riscv64.deb 
...

Enjoy!