Keep it light weight
- 29 minutes read - 6083 wordsA while back, as part of my ever so slightly over-engineered noise reduction effort on the basement cluster, a simple fan manager daemon was developed. Emphasis was on simplicity and small footprint.
Now some weeks down the road, looking at systemctl status fanman
output, I cannot help but notice that, accumulated, the fan controller
software does burn quite some cycles on the CPU. A full 38 minutes and
change, from running just around two and a half weeks.
This is not terrible of course. It never shows up on top. It’s not
like the cluster can’t do real work because it spends time managing
its fans. In fact, this is not really an issue in practical terms. But
it bugs me. 38 minutes. On the CPU. For just a few weeks of managing
the fans. That is not good enough.
I want to take you through how I approach a performance problem like this: Going through the basic process of establishing the capability to measure, getting a baseline, looking with a critical eye at the work that is done, and iterating towards an acceptable solution.
Baseline
From looking at the code, the vast vast majority of work happens in the PID logic. Outside of the PID logic, we read an integer from one file and we sometimes write an integer to another file (and both files are kept open, so those operations are very low overhead).
Let’s write a simple benchmark application to exercise the PID
logic. This will allow us to make incremental changes and document
that they are actually effective. The following perf_pidloop logic
is taken literally from the fanman software main loop to reflect
the accurate use of the PID logic.
The set_dc function is an empty function implemented in another
compilation unit to prevent the inlining of the function (and thereby
the removal of the calculation of the new_dc value which would then
obviously be unneeded). It’s basically a trick to prevent the compiler
from realising that our code doesn’t actually generate results.
void perf_pidloop()
{
// This is our most basic PID logic where we run our PID
// calculations
ProcessHistory<int32_t> errorhist(10, 10*15);
int32_t ideal_temp = 48000;
double old_dc = 0;
double pid_kp = 0.1,
pid_ki = 0.2,
pid_kd = 0.05,
fan_min_dc = 0.4,
fan_start_dc = 0.8;
// Run some iterations
for (size_t s = 0; s != 10000000; ++s) {
int32_t temp = 49950;
errorhist.add(temp - ideal_temp);
double new_dc = pid_kp * errorhist.get()
+ pid_ki * errorhist.getint()
+ pid_kd * errorhist.getdiffSG9();
if (new_dc < 0) new_dc = 0;
if (new_dc > 1) new_dc = 1;
new_dc = .6 * new_dc + .4 * old_dc;
if (new_dc < fan_min_dc) {
new_dc = new_dc < fan_min_dc/2
? 0.
: fan_min_dc;
}
if (old_dc == 0 && new_dc > 0)
new_dc = fan_start_dc;
if (old_dc != new_dc) {
set_dc(new_dc);
old_dc = new_dc;
}
}
}Running the performance test software on our baseline PID code yields the following result:
joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"
real 0m5.059s
user 0m4.982s
sys 0m0.004s
joe@chip:~/fanman$ gprof targets/perf
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
29.98 1.22 1.22 ProcessHistory<int>::getdiffSG9()
24.57 2.22 1.00 ProcessHistory<int>::get()
22.11 3.12 0.90 perf_pidloop()
8.85 3.48 0.36 void std::__final_insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter)
4.91 3.68 0.20 ProcessHistory<int>::add(int)
3.69 3.83 0.15 std::deque<int, std::allocator<int> >::push_front(int const&)
3.44 3.97 0.14 void std::__introsort_loop<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter)
joe@chip:~/fanman$We have our baseline. Five seconds. Game on.
The profile reveals that, not shockingly, the Savitzky-Golay filter
(getdiffSG9) and the half-second median (get) are two expensive
routines.
Before going into the routines themselves, one thing that struck me, was that I have been using double precision floating point arithmetic wherever floating point arithmetic was needed. This is out of old habit - it used to be that on “big” machines, double precision and single precision arithmetic would cost basically the same, the only saving was in memory (an FP32 takes up 32 bits and an FP64 takes up 64, unsurprisingly). That changed a lot, however. On GPUs today they generally double the performance when you halve the operand size. On the JH7110 CPU that I am targeting here, it is not quite as dramatic, but we do see that (reading from The SiFive U74-MC Core Complex Manual - or this local copy) single precision operations are significantly faster than double precision:
| Instruction | Cycles on FP32 | Cycles on FP64 |
|---|---|---|
| fmul | 5 | 7 |
| fmadd | 5 | 7 |
| fdiv | 9-36 | 9-58 |
For what we are doing here, single precision is easily accurate
enough. Just by switching from double to float we can shave 29% of
the cycles off of the FP arithmetic (and 38% off of the worst case
division performance).
Interestingly the JH7110 provides a fused multiply-add instruction that takes the same number of cycles as just the multiply - hugely useful when doing dot products or matrix multiplication (SAXPY - Single-precision A X Plus Y).
Another point on the processor that should be brought up: As I touched upon in an earlier post, the JH7110 is not a speed daemon (and it was not intended to be either). It does not have out-of-order execution, but it does have a dual-issue pipeline on each core. Meaning, each core can - ideally - retire two instructions per clock cycle. A number of practical limitations will make the number lower of course, but it is worth considering when looking at the code, what it costs to - for example - tie up a pipeline with a 58-cycle division operation versus maybe loading a constant from memory and doing a multiplication with the inverse instead. More on this later.
Free improvement: double to float
Extending our ProcessHistory class to be able to accept either
double of float as its floating-point type of choice is easy:
//
// Store a history of process outputs in a signed integer format. All
// getters return double precision floating point
//
template <typename B = int32_t, typename F = double>
class ProcessHistory {
...
// Get most recent half second median, in an effort to filter out
// the worst of the noise
F get() {
m_window.clear();
...
// Get filtered differential; we use a Savitzky-Golay filter to get
// the first derivative
F getdiffSG5() {
return 1. / (12. / m_sps)
...In the performance test code from before, we instantiate the
ProcessHistory as:
ProcessHistory<int32_t,float> errorhist(10, 10*15);This completely free change gives us:
joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"
real 0m4.899s
user 0m4.828s
sys 0m0.005s
joe@chip:~/fanman$Ok that is not impressive. Where is the time spent?
joe@chip:~/fanman$ gprof targets/perf --flat-profile
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
29.71 1.12 1.12 ProcessHistory<int, float>::getdiffSG9()
25.73 2.09 0.97 ProcessHistory<int, float>::get()
18.83 2.80 0.71 perf_pidloop()
10.34 3.19 0.39 void std::__final_insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter)
3.98 3.34 0.15 ProcessHistory<int, float>::add(int)
3.45 3.47 0.13 void std::__introsort_loop<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter)
3.18 3.59 0.12 std::deque<int, std::allocator<int> >::push_front(int const&)Let’s take a look inside the getdiffSG9 routine to get a feel for
what’s going on. First of all, the implementation looks like:
F getdiffSG9() {
return 1. / (1188. / m_sps)
* F(86 * m_history[8]
- 142 * m_history[7]
- 193 * m_history[6]
- 126 * m_history[5]
+ 126 * m_history[3]
+ 193 * m_history[2]
+ 142 * m_history[1]
- 86 * m_history[0]);
}So a straight forward function, very simple, some memory loads, a bit of integer arithmetic, finally a float conversion and a bit of floating point arithmetic. The disassembly is rather large (which we shall return to), but I noticed at the very end of the function that we had:
perf[0x1726] <+462>: fld fa5, -0x652(a0)
perf[0x172a] <+466>: auipc a0, 0x4
perf[0x172e] <+470>: fld fa4, -0x652(a0)
** 93 return 1. / (1188. / m_sps)
perf[0x1732] <+474>: fcvt.d.lu fa3, a2
perf[0x1736] <+478>: fdiv.d fa5, fa5, fa3
perf[0x173a] <+482>: fdiv.d fa5, fa4, fa5
** 101 - 86 * m_history[0]);
102 }
103 What are fdiv.d double precision divisions doing here? F is
float… The answer of course is staring right in my face - it is
the literals. 1. is a double precision literal in C++. If I use
F(1) instead then it will be a 1 converted to whatever the F type
is set to, something the compiler should be able to do at compile
time.
Another point; since floating point divisions are incredibly expensive, we really should rewrite the function to read:
F getdiffSG9() {
return m_sps / F(1188)
* F(86 * m_history[8]
- 142 * m_history[7]
- 193 * m_history[6]
- 126 * m_history[5]
+ 126 * m_history[3]
+ 193 * m_history[2]
+ 142 * m_history[1]
- 86 * m_history[0]);
}This gets rid of the double precision literal and it cuts away a
division. (I also updated three literals in the perf.cc program).
joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"
real 0m4.186s
user 0m4.103s
sys 0m0.013sNow we’re cooking! That’s a 17% reduction just from not using excessive precision and not dividing unnecessarily.
Stop branching for simple things
Looking at the disassembly for the getdiffSG9 routine, we still see
that it is hugely complex. What looks, in C++, as a relatively simple
function, ends up being absolutely humongous
lldb) disas -n 'ProcessHistory<int, float>::getdiffSG9()'
perf`ProcessHistory<int, float>::getdiffSG9:
perf[0x1540] <+0>: addi sp, sp, -0x20
perf[0x1542] <+2>: sd ra, 0x18(sp)
perf[0x1544] <+4>: sd s0, 0x10(sp)
perf[0x1546] <+6>: sd s1, 0x8(sp)
perf[0x1548] <+8>: sd s2, 0x0(sp)
perf[0x154a] <+10>: addi s0, sp, 0x20
perf[0x154c] <+12>: mv s2, a0
perf[0x154e] <+14>: jal 0xeb0
perf[0x1552] <+18>: ld t3, 0x20(s2)
perf[0x1556] <+22>: ld a0, 0x28(s2)
perf[0x155a] <+26>: ld t4, 0x38(s2)
perf[0x155e] <+30>: sub a7, t3, a0
perf[0x1562] <+34>: srai s1, a7, 0x2
perf[0x1566] <+38>: li a2, -0x8
perf[0x1568] <+40>: addi a1, s1, 0x8
perf[0x156c] <+44>: blt s1, a2, 0x158a
perf[0x1570] <+48>: srli a2, a1, 0x7
perf[0x1574] <+52>: bnez a2, 0x158e
perf[0x1576] <+54>: addi a6, t3, 0x20
perf[0x157a] <+58>: li a2, -0x7
perf[0x157c] <+60>: addi a1, s1, 0x7
perf[0x1580] <+64>: bge s1, a2, 0x15aa
perf[0x1584] <+68>: srai a2, a1, 0x7
perf[0x1588] <+72>: j 0x15b0
perf[0x158a] <+74>: srai a2, a1, 0x7
perf[0x158e] <+78>: slli a3, a2, 0x3
perf[0x1592] <+82>: add a3, a3, t4
perf[0x1594] <+84>: ld a6, 0x0(a3)
perf[0x1598] <+88>: slli a2, a2, 0x7
perf[0x159a] <+90>: sub a1, a1, a2
perf[0x159c] <+92>: slli a1, a1, 0x2
perf[0x159e] <+94>: add a6, a6, a1
perf[0x15a0] <+96>: li a2, -0x7
perf[0x15a2] <+98>: addi a1, s1, 0x7
perf[0x15a6] <+102>: blt s1, a2, 0x1584
perf[0x15aa] <+106>: srli a2, a1, 0x7
perf[0x15ae] <+110>: beqz a2, 0x15e6
perf[0x15b0] <+112>: slli a3, a2, 0x3
perf[0x15b4] <+116>: add a3, a3, t4
perf[0x15b6] <+118>: ld t0, 0x0(a3)
perf`ProcessHistory<int, float>::getdiffSG9 [inlined] std::_Deque_iterator<int, int&, int*>::operator+=(long):
perf[0x15ba] <+122>: slli a2, a2, 0x7
perf[0x15bc] <+124>: sub a1, a1, a2
perf[0x15be] <+126>: slli a1, a1, 0x2
perf[0x15c0] <+128>: add t0, t0, a1
perf[0x15c2] <+130>: li a2, -0x6
perf[0x15c4] <+132>: addi a1, s1, 0x6
perf[0x15c8] <+136>: blt s1, a2, 0x15f4
perf[0x15cc] <+140>: srli a2, a1, 0x7
perf[0x15d0] <+144>: bnez a2, 0x15f8
perf[0x15d2] <+146>: addi t1, t3, 0x18
perf[0x15d6] <+150>: li a2, -0x5
perf[0x15d8] <+152>: addi a1, s1, 0x5
perf[0x15dc] <+156>: bge s1, a2, 0x1614
perf[0x15e0] <+160>: srai a2, a1, 0x7
perf[0x15e4] <+164>: j 0x161a
perf[0x15e6] <+166>: addi t0, t3, 0x1c
perf[0x15ea] <+170>: li a2, -0x6
perf[0x15ec] <+172>: addi a1, s1, 0x6
perf[0x15f0] <+176>: bge s1, a2, 0x15cc
perf[0x15f4] <+180>: srai a2, a1, 0x7
perf[0x15f8] <+184>: slli a3, a2, 0x3
perf[0x15fc] <+188>: add a3, a3, t4
perf[0x15fe] <+190>: ld t1, 0x0(a3)
perf[0x1602] <+194>: slli a2, a2, 0x7
perf[0x1604] <+196>: sub a1, a1, a2
perf[0x1606] <+198>: slli a1, a1, 0x2
perf[0x1608] <+200>: add t1, t1, a1
perf[0x160a] <+202>: li a2, -0x5
perf[0x160c] <+204>: addi a1, s1, 0x5
perf[0x1610] <+208>: blt s1, a2, 0x15e0
perf[0x1614] <+212>: srli a2, a1, 0x7
perf[0x1618] <+216>: beqz a2, 0x1650
perf[0x161a] <+218>: slli a3, a2, 0x3
perf[0x161e] <+222>: add a3, a3, t4
perf[0x1620] <+224>: ld t2, 0x0(a3)
perf[0x1624] <+228>: slli a2, a2, 0x7
perf[0x1626] <+230>: sub a1, a1, a2
perf[0x1628] <+232>: slli a1, a1, 0x2
perf[0x162a] <+234>: add t2, t2, a1
perf[0x162c] <+236>: li a2, -0x3
perf[0x162e] <+238>: addi a1, s1, 0x3
perf[0x1632] <+242>: blt s1, a2, 0x165e
perf[0x1636] <+246>: srli a2, a1, 0x7
perf[0x163a] <+250>: bnez a2, 0x1662
perf[0x163c] <+252>: addi a4, t3, 0xc
perf[0x1640] <+256>: li a1, -0x2
perf[0x1642] <+258>: addi a2, s1, 0x2
perf[0x1646] <+262>: bge s1, a1, 0x167c
perf[0x164a] <+266>: srai a1, a2, 0x7
perf[0x164e] <+270>: j 0x1682
perf[0x1650] <+272>: addi t2, t3, 0x14
perf[0x1654] <+276>: li a2, -0x3
perf[0x1656] <+278>: addi a1, s1, 0x3
perf[0x165a] <+282>: bge s1, a2, 0x1636
perf[0x165e] <+286>: srai a2, a1, 0x7
perf[0x1662] <+290>: slli a3, a2, 0x3
perf[0x1666] <+294>: add a3, a3, t4
perf[0x1668] <+296>: ld a4, 0x0(a3)
perf[0x166a] <+298>: slli a2, a2, 0x7
perf[0x166c] <+300>: sub a1, a1, a2
perf[0x166e] <+302>: slli a1, a1, 0x2
perf[0x1670] <+304>: add a4, a4, a1
perf[0x1672] <+306>: li a1, -0x2
perf[0x1674] <+308>: addi a2, s1, 0x2
perf[0x1678] <+312>: blt s1, a1, 0x164a
perf[0x167c] <+316>: srli a1, a2, 0x7
perf[0x1680] <+320>: beqz a1, 0x16a8
perf[0x1682] <+322>: slli a3, a1, 0x3
perf[0x1686] <+326>: add a3, a3, t4
perf[0x1688] <+328>: ld a3, 0x0(a3)
perf[0x168a] <+330>: slli a1, a1, 0x7
perf[0x168c] <+332>: sub a2, a2, a1
perf[0x168e] <+334>: slli a2, a2, 0x2
perf[0x1690] <+336>: add a3, a3, a2
perf[0x1692] <+338>: li a1, -0x1
perf[0x1694] <+340>: addi a2, s1, 0x1
perf[0x1698] <+344>: blt s1, a1, 0x16b6
perf[0x169c] <+348>: srli a1, a2, 0x7
perf[0x16a0] <+352>: bnez a1, 0x16ba
perf[0x16a2] <+354>: addi a1, t3, 0x4
perf[0x16a6] <+358>: j 0x16cc
perf[0x16a8] <+360>: addi a3, t3, 0x8
perf[0x16ac] <+364>: li a1, -0x1
perf[0x16ae] <+366>: addi a2, s1, 0x1
perf[0x16b2] <+370>: bge s1, a1, 0x169c
perf[0x16b6] <+374>: srai a1, a2, 0x7
perf[0x16ba] <+378>: slli a5, a1, 0x3
perf[0x16be] <+382>: add a5, a5, t4
perf[0x16c0] <+384>: ld a5, 0x0(a5)
perf[0x16c2] <+386>: slli a1, a1, 0x7
perf[0x16c4] <+388>: sub a2, a2, a1
perf[0x16c6] <+390>: slli a1, a2, 0x2
perf[0x16ca] <+394>: add a1, a1, a5
perf[0x16cc] <+396>: ld a2, 0x0(s2)
perf[0x16d0] <+400>: lw a6, 0x0(a6)
perf[0x16d4] <+404>: lw t0, 0x0(t0)
perf[0x16d8] <+408>: lw t1, 0x0(t1)
perf[0x16dc] <+412>: lw a5, 0x0(t2)
perf[0x16e0] <+416>: lw a4, 0x0(a4)
perf[0x16e2] <+418>: lw a3, 0x0(a3)
perf[0x16e4] <+420>: lw t2, 0x0(a1)
perf[0x16e8] <+424>: bltz s1, 0x16f4
perf[0x16ec] <+428>: srli a0, s1, 0x7
perf[0x16f0] <+432>: bnez a0, 0x16f8
perf[0x16f2] <+434>: j 0x170a
perf[0x16f4] <+436>: srai a0, a7, 0x9
perf[0x16f8] <+440>: slli a1, a0, 0x3
perf[0x16fc] <+444>: add a1, a1, t4
perf[0x16fe] <+446>: ld t3, 0x0(a1)
perf[0x1702] <+450>: slli a0, a0, 0x7
perf[0x1704] <+452>: sub s1, s1, a0
perf[0x1706] <+454>: slli s1, s1, 0x2
perf[0x1708] <+456>: add t3, t3, s1
perf[0x170a] <+458>: fcvt.s.lu fa5, a2
perf[0x170e] <+462>: lui a0, 0x44948
perf[0x1712] <+466>: fmv.w.x fa4, a0
perf[0x1716] <+470>: fdiv.s fa5, fa5, fa4
perf[0x171a] <+474>: lw a0, 0x0(t3)
perf[0x171e] <+478>: subw a4, a4, a5
perf[0x1720] <+480>: slli a1, a4, 0x1
perf[0x1724] <+484>: slli a4, a4, 0x7
perf[0x1726] <+486>: subw a4, a4, a1
perf[0x1728] <+488>: subw a1, a3, t1
perf[0x172c] <+492>: li a2, 0xc1
perf[0x1730] <+496>: mul a1, a1, a2
perf[0x1734] <+500>: subw a2, t2, t0
perf[0x1738] <+504>: li a3, 0x8e
perf[0x173c] <+508>: mul a2, a2, a3
perf[0x1740] <+512>: subw a0, a6, a0
perf[0x1744] <+516>: li a3, 0x56
perf[0x1748] <+520>: mul a0, a0, a3
perf[0x174c] <+524>: add a1, a1, a2
perf[0x174e] <+526>: add a1, a1, a4
perf[0x1750] <+528>: add a0, a0, a1
perf[0x1752] <+530>: fcvt.s.w fa4, a0
perf[0x1756] <+534>: fmul.s fa0, fa5, fa4
perf[0x175a] <+538>: ld ra, 0x18(sp)
perf[0x175c] <+540>: ld s0, 0x10(sp)
perf[0x175e] <+542>: ld s1, 0x8(sp)
perf[0x1760] <+544>: ld s2, 0x0(sp)
perf[0x1762] <+546>: addi sp, sp, 0x20
perf[0x1764] <+548>: ret
The explanation, of course, is the use of std::deque as the cyclic
buffer for our process variable history. The great thing about
std::deque, the double-ended-queue, is, that both insertion (pushing
on the front) and removal (popping from the back) have complexity
O(1). The not-so-great thing is, that whenever we want to access an
element by index (as the getdiffSG9 function does 8 times in a row),
the deque::operator[] has to see if the index given causes a
wrap-around on the underlying buffer.
Conceptually simple, but it means that what looks like a simple access to a relatively addressed element, actually becomes some offset calculations, then a comparison and a branch, and then an actual loading of the data. And this needs to happen 8 times. That’s a lot of branching for something that looked simple in C++.
Now, is that worth changing? Is the std::deque insufficient? Do I
think that I can do better?
The answer, of course, is a resounding YES to all of the above.
In all seriousness though… This type of decision has to weigh in the
use case. There are so many benefits to using the standard containers,
and they are extremely efficient at what they do. There is no way that
I would want to try to implement a better std::deque - I would
likely fail if I tried. But in my use case I don’t need
std::deque. I need a cyclic buffer, and just that. Something much
less than std::deque. But something more than std::vector. Let’s
try to consider what we really want:
- Fixed size cyclic buffer; number of elements can be constant
- Random access addressing should be relative to the most recently
inserted element
- element 0 is the most recently inserted element
- element 1 is the previously inserted element
- etc.
- Branch-less (O(1) and super efficient) random access to a limited number of past elements
- Efficient insertion (O(1) and pretty fast)
In general, when building cyclic buffers, one tends to end up with a trade-off where:
- Efficient insert means we need to be careful on retrieval
- Efficient retrieval means we need to move data on insert
A simple replacement of std::deque with std::vector where I then
simply displaced the contents by one on insert yielded a net-negative
performance impact. Something more fancy is called for. The vector
experiment did show that the getdiffSG9 routine suddenly became
inlined in the main loop rather than being called as a function - so
the experiment did confirm that, given efficient random access
retrieval, great savings are enabled elsewhere.
The circular buffer
If we do not want to displace the element on underlying storage when
we insert into our cyclic buffer, we need a head pointer which we
move on every insert.
What follows, is, that on read, we would implement access something like:
const E &operator[](size_t i) const {
return m_buf[(m_head + i) % m_buf.size()];
}This would either use a costly division, or one could implement it with a branch instead. Either way, single element access becomes expensive. This is what we see in the disassembly too.
Now, if the number of elements we want to be able to access by index is limited - which it is in our use case (at most up to index 8) - one could imagine an alternative data structure:
What if, we maintain a front on an N-element cyclic buffer which we
decrement whenever we insert an element. As the front pointer
reaches 0, next insert it goes to N-1.
Then, at the end (contiguously) on our buffer, we add a P-element
space, where P is the number of elements we wish to be able to access
by index (relative to front).
What this means, is, that no matter the value of front, it is always
safe to directly access up P elements relatively from front (so up
to absolute element front+P). No branching. No modulus.
A sequence of data insertions will update the buffer (with N=30 and
P=3) in the following pattern:
[d0 d0 d0 ... d0 d0 d0] [d0 d0 d0]
^front
[d0 d0 d0 ... d0 d1 d0] [d0 d0 d0]
^front
[d0 d0 d0 ... d2 d1 d0] [d0 d0 d0]
^front
...
[d0 d0 d27 ... d2 d1 d0] [d0 d0 d27]
^front
[d0 d28 d27 ... d2 d1 d0] [d0 d28 d27]
^front
[d29 d28 d27 ... d2 d1 d0] [d29 d28 d27]
^front
[d29 d28 d27 ... d2 d1 d30] [d29 d28 d27]
^front
[d29 d28 d27 ... d2 d31 d30] [d29 d28 d27]
^front
The burden is on insertion, of course, where insertion of the lowest
P entries in the main buffer need to be mirrored into the “overflow”
buffer as well.
So at the cost of one branch in the insertion, we save a branch in every random access. This is a very good trade-off.
The core parts of the implementation follows here:
template <typename E>
class Cyclic {
public:
Cyclic(const size_t n_elements,
const size_t n_overflow)
: m_n_elements(n_elements)
, m_n_overflow(n_overflow)
{
}
// Insert new value in cyclic buffer
void push(const E &v) {
if (m_buf.empty()) {
m_buf.resize(m_n_elements + m_n_overflow, v);
} else {
// Reduce head position
if (m_head)
--m_head;
else
m_head = m_n_elements-1;
// Set new head element
m_buf[m_head] = v;
// Maintain overflow buffer
if (m_head < m_n_overflow) {
m_buf[m_head + m_n_elements] = v;
}
}
}
// Simple buffer access
// i must be smaller than or equal to n_overflow
const E &operator[](size_t i) const {
return m_buf[m_head + i];
}
E &operator[](size_t i) {
return m_buf[m_head + i];
}
private:
// Our setup
const size_t m_n_elements;
const size_t m_n_overflow;
// Underlying buffer
std::vector<E> m_buf;
// Head position
size_t m_head = 0;
};So was it worth getting rid of std::deque? Let’s find out:
joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"
real 0m3.179s
user 0m3.106s
sys 0m0.005sI believe that’s a yes, especially considering we are not done taking
out expensive pieces of logic. Let’s look at the getdiffSG9
disassembly. Actually, what we find is that this function no longer
exists on its own - it is now inlined in the perf_pidloop routine
and therefore the best we can get is an excerpt from the pidloop
disassembly (meaning, a few bits may be missing in this excerpt and a
few instructions from surrounding code may be weaved in):
99 F getdiffSG9() {
** 100 return F(m_sps) / F(1188)
perf[0xfea] <+252>: fdiv.s fa4, fa3, fs4
perf[0xfee] <+256>: slli a0, a0, 0x2
perf[0xff0] <+258>: add a0, a0, a1
** 101 * F(86 * m_history[8]
perf[0xff2] <+260>: lw a6, 0x20(a0)
** 102 - 142 * m_history[7]
perf[0xff6] <+264>: lw a2, 0x1c(a0)
** 103 - 193 * m_history[6]
perf[0xff8] <+266>: lw a3, 0x18(a0)
** 104 - 126 * m_history[5]
perf[0xffa] <+268>: lw a4, 0x14(a0)
** 105 + 126 * m_history[3]
perf[0xffc] <+270>: lw a5, 0xc(a0)
** 106 + 193 * m_history[2]
perf[0xffe] <+272>: lw s1, 0x8(a0)
** 107 + 142 * m_history[1]
perf[0x1000] <+274>: lw a1, 0x4(a0)
** 108 - 86 * m_history[0]);
109 }
110
perf[0x1002] <+276>: lw a0, 0x0(a0)
perf[0x1004] <+278>: subw a5, a5, a4
perf[0x1006] <+280>: slli a4, a5, 0x1
perf[0x100a] <+284>: slli a5, a5, 0x7
perf[0x100c] <+286>: subw a5, a5, a4
perf[0x100e] <+288>: subw s1, s1, a3
perf[0x1010] <+290>: mul a3, s1, s2
perf[0x1014] <+294>: subw a1, a1, a2
perf[0x1016] <+296>: mul a1, a1, s3
perf[0x101a] <+300>: subw a0, a6, a0
perf[0x101e] <+304>: mul a0, a0, s4
** 106 + 193 * m_history[2]
perf[0x1022] <+308>: add a1, a1, a3
** 107 + 142 * m_history[1]
perf[0x1024] <+310>: add a1, a1, a5Notice the difference in the instructions for the calculations?
Instead of endless calculations for where to load the data elements
from, all that is gone, replaced with simple lw dst, ofs(a0). One
instruction for one data load. That’s it.
From the processor manual we know that for a core to retire two
instructions per cycle, only one of the two instructions can be a
load, so clearly the sequence of 8 loads are not going to interleave
with one another. Looking back further though, it is clear that an
fdiv.s is issued before this - that can take from 9 to 36 cycles to
complete, meaning, all the loads probably interleave with that one.
Looking at where time is spent now, a flat profile shows us:
joe@chip:~/fanman$ gprof targets/perf --flat-profile
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
37.57 0.95 0.95 ProcessHistory<int, float>::get()
19.77 1.45 0.50 perf_pidloop()
16.61 1.87 0.42 void std::__final_insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter)
13.84 2.22 0.35 ProcessHistory<int, float>::add(int)
7.91 2.42 0.20 void std::__introsort_loop<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, long, __gnu_cxx::__ops::_Iter_less_iter)It is time to give get a little tender loving care.
Faster get
The get calculates the median of the last half second of samples,
in an order to make it basically return “the most recent value, kind
of, but without the noise”.
And medians are a great way to go for getting rid of noise. But realistically, given that we are interested in performance and medians require sorting things (and sorting requires branching and branching wastes processor cycles) - maybe a mean would be a serviceable substitute for the median? It certainly is faster to calculate.
...
// Add new element
void add(B e) {
...
} else {
// Remove the 'leaving' elements from the sums
m_sum -= m_history.back();
m_getsum -= m_history[m_sps/2-1];
// Insert new
m_history.push(e);
// Add the new element to the sums
m_sum += e;
m_getsum += e;
}
}
...
// Get most recent half second average, in an effort to filter out
// the worst of the noise
F get() {
// The half second sum divided by samples per half second
return m_getsum / F(m_sps/2);
}
...So by modifying the add routine so that we maintain not just the
total sum of all entries in the history (for easy buffer integral
calculation), we also main a sum of all the entries in the past half
second of the history. This allows us to trivially calculate the mean
of the last half second of the entries (simply divide the sum of the
entries by the number of entries).
The F(m_sps/2) looks funny perhaps, but it is very deliberately
written this way. We want - by integer division to calculate the
number of samples per half second, because the number of samples is
going to be an integer number. Then that result gets converted to
whatever floating point type we chose to work with (the F template
type).
How does getting rid of the single top time consumer help us?
joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"
real 0m1.522s
user 0m1.484s
sys 0m0.001s
joe@chip:~/fanman$ Wham! That worked quite well I would say!
Our profile now clearly shows that pretty much everything is so simple
it is inlined - with the exception of add:
joe@chip:~/fanman$ gprof targets/perf --flat-profile
BFD: DWARF error: mangled line number section (bad file number)
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
65.05 0.92 0.92 perf_pidloop()
32.35 1.37 0.46 ProcessHistory<int, float>::add(int)
0.36 1.38 0.01 __clang_call_terminateNow, add could be simplified. Personally I don’t like that a routine
like add which is called often, checks whether it is the first call
to it, every time it is called. This seems wasteful and silly.
However… If add does not do this, I would have to have an init
routine instead. That init routine would have to be called by the
outer logic, before entering the main loop of the application that
uses the ProcessHistory class. This could be done. But… I don’t
know. It just doesn’t seem very clean to have a container that has all
sorts of requirements for special members one must call before it can
really be used. That’s what the constructor is for. Which is called on
initialisation. After construction, containers should be good to
go. Generally…
Maybe it’s worth it. I’ll leave this one for now.
Getting rid of costly floating divisions
As already stated, a floating point division takes between 9 and 36
cycles, whereas a load can complete in a couple of cycles (on cache
hit) and a floating point multiplication in another 5 cycles. With
this logic, rewriting x / F(m_sps) into x * m_inv_sps and having
pre-calculated and stored m_inv_sps = 1/F(m_sps) would make a lot of
sense.
We do this in a couple of places and get:
template <typename B = int32_t, typename F = double>
class ProcessHistory {
public:
ProcessHistory(size_t samples_per_second,
size_t samples_in_history)
: m_sps(samples_per_second)
, m_inv_sps(1 / F(m_sps))
, m_inv_sphs(1 / F(m_sps/2))
, m_sih(samples_in_history)
, m_SG5_factor(F(m_sps) / F(12))
, m_SG7_factor(F(m_sps) / F(252))
, m_SG9_factor(F(m_sps) / F(1188))
...
// Get most recent half second average, in an effort to filter out
// the worst of the noise
F get() {
// The half second sum divided by samples per half second
return m_getsum * m_inv_sphs;
}
// Get integral (sum divided by samples per second)
F getint() {
return m_sum * m_inv_sps;
}
...
F getdiffSG9() {
return m_SG9_factor
* F(86 * m_history[8]
- 142 * m_history[7]
- 193 * m_history[6]
- 126 * m_history[5]
+ 126 * m_history[3]
+ 193 * m_history[2]
+ 142 * m_history[1]
- 86 * m_history[0]);
}
...
private:
// Our constants from construction
const size_t m_sps;
// Two floating point constants allow us to avoid costly floating
// point division during sliding window mean and integral
// calculation
const F m_inv_sps; // samples per second
const F m_inv_sphs; // samples per half second (integer division)
// Samples in history
const size_t m_sih;
// Another three floating point constants allow us to avoid costly
// floating point division during differential calculation
const F m_SG5_factor;
const F m_SG7_factor;
const F m_SG9_factor;
...
}How do these last optimisations stack up?
joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"
real 0m1.138s
user 0m1.084s
sys 0m0.013s
joe@chip:~/fanman$We started at around five seconds for the test, we landed at around 1 second. That’s an 80% reduction, or a five X speedup. Not too bad if I do say so myself.
Now… how about that add? What does the profile say?
joe@chip:~/fanman$ gprof targets/perf --flat-profile
BFD: DWARF error: mangled line number section (bad file number)
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
51.15 0.50 0.50 perf_pidloop()
43.40 0.92 0.42 ProcessHistory<int, float>::add(int)
1.03 0.93 0.01 __clang_call_terminate43%??? Unacceptable!
Adding requirement for special initialisation
I don’t like it, but let’s see what happens if we have an init
routine that is called like add, but has to be called once before
anyone calls add on the container.
template <typename B = int32_t, typename F = double>
class ProcessHistory {
public:
...
// First addition of an element to the container - initialises
// everything
void init(B e) {
// Adding the first element will fill the cyclic buffer with
// this element value - initialise the sums accordingly
m_sum = e * m_sih;
m_getsum = e * (m_sps/2);
m_history.init(e);
}
// Add new element
void add(B e) {
// Remove the 'leaving' elements from the sums
m_sum -= m_history.back();
m_getsum -= m_history[m_sps/2-1];
// Insert new
m_history.push(e);
// Add the new element to the sums
m_sum += e;
m_getsum += e;
}
...
template <typename E>
class Cyclic {
public:
...
// Initialise cyclic buffer
void init(const E &v) {
m_buf.resize(m_n_elements + m_n_overflow, v);
}
// Insert new value in cyclic buffer
void push(const E &v) {
// Reduce head position
if (m_head)
--m_head;
else
m_head = m_n_elements-1;
// Set new head element
m_buf[m_head] = v;
// Maintain overflow buffer
if (m_head < m_n_overflow) {
m_buf[m_head + m_n_elements] = v;
}
}
...But is it worth it? Let’s see:
joe@chip:~/fanman$ make runperf -j3
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perf.po src/perf.cc -pg -g
clang++ -Werror -Wall -O3 --std=c++23 -c -o targets/perfh.po src/perfh.cc -pg -g
clang++ -o targets/perf targets/perf.po targets/perfh.po -lsystemd -g -pg
bash -c "time ./targets/perf"
real 0m0.920s
user 0m0.870s
sys 0m0.013s
joe@chip:~/fanman$That’s another 13% off or so. Not too bad. More so, none of the
members of ProcessHistory now exist as functions in the resulting
binary - everything is inlined. This shows on the profile too:
joe@chip:~/fanman$ gprof targets/perf --flat-profile
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
100.23 0.87 0.87 perf_pidloop()An 82.5% reduction or almost a 6X speedup on the core PID logic. So rather than spending 38 minutes on the CPU every two and a half weeks, it should spend less than 10 (leaving a little room for the file I/O which I have not touched).
A second take on the fanman software
While integrating these optimisations, I am also making history size and sample rate configurable. Because maybe, just maybe, we don’t need 10 samples per second to run a fan (I know, shocking). On the other hand, a longer history will allow the integral component of the PID control logic to better correct for steady-state error.
Running the fan manager software with 4 samples per second versus running it with 10 samples per second is obviously going to result in direct savings on CPU cycles. But how does the resulting temperature control look? The following graph shows the actual temperature when running at these two sample rates - the node had a bit of network/filesystem load at the time but nothing heavy:

Note that the graphs are from two separate runs - the first axis is “seconds since start”. For this reason the graphs have maximums and minimums at different times - this is perfectly fine. Also, the two graphs are not completely comparable since, again, they ran at different absolute times so system load could have slightly changed. But anyway, the graph at least indicates that the 10sps temperature (the green curve) is slightly more tightly regulated than the 4sps temperature (the purple curve). This is to be expected; sampling less often leads to slower reaction time.

Looking at the duty cycle, we see that the 10sps (green curve) has many really tiny spikes whereas the 4sps (purple curve) compensates less often and appears more smooth. The tiny spikes on the 10sps curve are in part caused by the minimum duty cycle constraint as well as the minimum duty cycle for spinning up the fan from stand-still. These constraints will cause regulation to be unable to find an equilibrium if, say, the “perfect” duty cycle would be 10% but the minimum allowed duty cycle is 40%.
Anyway, from the looks of it, it doesn’t seem like going to 4 samples per second is going to do anything terrible for the fan regulation.
Getting the software
As before, I provide you with the bits that constitute a Debian source package for you to build on your own:
joe@pluto:~/build$ dpkg-source -x fanman_-1.dsc
dpkg-source: warning: extracting unsigned source package (fanman_2-1.dsc)
dpkg-source: info: extracting fanman in fanman-2
dpkg-source: info: unpacking fanman_2.orig.tar.xz
dpkg-source: info: unpacking fanman_2-1.debian.tar.xz
joe@pluto:~/build$ cd fanman-2/
joe@pluto:~/build/fanman-2$ dpkg-buildpackage -uc -us -b
dpkg-buildpackage: info: source package fanman
dpkg-buildpackage: info: source version 2-1
dpkg-buildpackage: info: source distribution unstable
dpkg-buildpackage: info: source changed by Jakob Oestergaard <joe@unthought.net>
dpkg-buildpackage: info: host architecture riscv64
dpkg-source --before-build .
debian/rules clean
...
dpkg-deb: building package 'fanman' in '../fanman_2-1_riscv64.deb'.
dpkg-deb: building package 'fanman-dbgsym' in '../fanman-dbgsym_2-1_riscv64.deb'.
dpkg-genbuildinfo --build=binary -O../fanman_2-1_riscv64.buildinfo
dpkg-genchanges --build=binary -O../fanman_2-1_riscv64.changes
dpkg-genchanges: info: binary-only upload (no source code included)
dpkg-source --after-build .
dpkg-buildpackage: info: binary-only upload (no source included)
joe@pluto:~/build/fanman-2$ cd ..
joe@pluto:~/build$ ls fanman_2*deb
fanman_2-1_riscv64.deb
joe@pluto:~/build$ sudo dpkg -i fanman_2-1_riscv64.deb
...Enjoy!