Re: [PATCH v2 0/4] kernel.h further split

13 Oct 2021

      On Fri, Oct 08, 2021 at 11:37:58AM +0200, Thorsten Leemhuis wrote:
...
On Thu, 7 Oct 2021 14:51:15 +0300
Andy Shevchenko andy.shevchenko@gmail.com wrote:
...
On Thu, Oct 7, 2021 at 1:34 PM Greg Kroah-Hartman
gregkh@linuxfoundation.org wrote:
...
On Thu, Oct 07, 2021 at 12:51:25PM +0300, Andy Shevchenko wrote:
...
The kernel.h is a set of something which is not related to each
other and often used in non-crossed compilation units, especially
when drivers need only one or two macro definitions from it.
Here is the split of container_of(). The goals are the following:

untwist the dependency hell a bit
drop kernel.h inclusion where it's only used for container_of()
speed up C preprocessing.

People, like Greg KH and Miguel Ojeda, were asking about the
latter. Read below the methodology and test setup with outcome
numbers.
The methodology
The question here is how to measure in the more or less clean way
the C preprocessing time when building a project like Linux
kernel. To answer it, let's look around and see what tools do we
have that may help. Aha, here is ccache tool that seems quite
plausible to be used. Its core idea is to preprocess C file,
count hash (MD4) and compare to ones that are in the cache. If
found, return the object file, avoiding compilation stage.
Taking into account the property of the ccache, configure and use
it in the below steps:

Configure kernel with allyesconfig

Make it with `make` to be sure that the cache is filled with
the latest data. I.o.w. warm up the cache.

Run `make -s` (silent mode to reduce the influence of
the unrelated things, like console output) 10 times and
measure 'real' time spent.

Repeat 1-3 for each patch or patch set to get data sets before
and after.

When we get the raw data, calculating median will show us the
number. Comparing them before and after we will see the
difference.
The setup
I have used the Intel x86_64 server platform (see partial output
of `lscpu` below):
$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  88
  On-line CPU(s) list:   0-87
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  22
    Socket(s):           2
    Stepping:            1
    CPU max MHz:         3600.0000
    CPU min MHz:         1200.0000
...
Caches (sum of all):
  L1d:                   1.4 MiB (44 instances)
  L1i:                   1.4 MiB (44 instances)
  L2:                    11 MiB (44 instances)
  L3:                    110 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-21,44-65
  NUMA node1 CPU(s):     22-43,66-87
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: Split huge pages
  L1tf:                  Mitigation; PTE Inversion; VMX
conditional cache flushes, SMT vulnerable Mds:
Mitigation; Clear CPU buffers; SMT vulnerable Meltdown:
   Mitigation; PTI Spec store bypass:     Mitigation; Speculative
Store Bypass disabled via prctl and seccomp Spectre v1:
 Mitigation; usercopy/swapgs barriers and __user pointer
sanitization Spectre v2:            Mitigation; Full generic
retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB
filling Tsx async abort:       Mitigation; Clear CPU buffers; SMT
vulnerable
With the following GCC:
$ gcc --version
gcc (Debian 10.3.0-11) 10.3.0
The commands I have run during the measurement were:
  rm -rf $O
  make O=$O allyesconfig
  time make O=$O -s -j64  # this step has been measured

BTW, what kcbench does in the end is not that different, but it only
builds the config once and that uses it for all further testing.
Since I measure the third operation only this shouldn't affect recreation
of the configuration file.
...
...
...
...
The raw data and median
Before patch 2 (yes, I have measured the only patch 2 effect) in
the series (the data is sorted by time):
real    2m8.794s
real    2m11.183s
real    2m11.235s
real    2m11.639s
real    2m11.960s
real    2m12.014s
real    2m12.609s
real    2m13.177s
real    2m13.462s
real    2m19.132s
After patch 2 has been applied:
real    2m8.536s
real    2m8.776s
real    2m9.071s
real    2m9.459s
real    2m9.531s
real    2m9.610s
real    2m10.356s
real    2m10.430s
real    2m11.117s
real    2m11.885s
Median values are:
      131.987s before
      129.571s after
We see the steady speedup as of 1.83%.
You do know about kcbench:
        https://gitlab.com/knurd42/kcbench.git
Try running that to make it such that we know how it was tested :)
I'll try it.
Meanwhile, Thorsten, can you have a look at my approach and tell if it
makes sense?
I'm not the right person to ask here, I don't know enough about the
inner working of ccache and C preprocessing. Reminder: I'm not a real
kernel/C developer, but more kind of a parasite that lives on the
fringes of kernel development. ;-) Kcbench in fact originated as a
benchmark magazine for the computer magazine I used to work for – where
I also did quite a few benchmarks. But that knowledge might be helpful
here:
The measurements before and after patch 2 was applied get slower over
time. That is a hint that something is interfering. Is the disk filling
up and making the fs do more work? Or is the machine getting to hot? It
IMHO would be worth investigating and ruling out, as the differences
you are looking out are likely quite small
I tried to explain why my methodology is closer to what we need to measure
in the above and replies. TL;DR: mathematically the O() shadows o() and as
we know the CPU and disk usage during compilation is a huge in comparison
to the C preprocessing. I'm not sure what you are referring by "slower
over time" since I explicitly said that I have _sorted_ the data. Nothing
should be done here, I believe.
...
Also: the last run of the first measurement cycle is off by quite a
bit, so I wouldn't even include the result, as there like was something
that disturbed the benchmark.
I believe you missed the very same remark, i.e. that the data is sorted.
...
And I might be missing something, but why were you using "-j 64" on a
machine with 44 cores/88 threads?
Because that machine has more processes being run. And I would like to
minimize fluctuation of the CPU scheduling when some process requires
a resource to perform little work.
...
I wonder if that might lead do
interesting effects due to SMT (some core will run two threads, other
only one). Using either "-j 44" or "-j 88" might be better.
How -j64 can be better? Nothing will guarantee that any of the core will
be half-loaded. But -j88 is worse because any process that wakes up and
requires for a resource may affect the measurements.
...
But I
suggest you run kcbench once without specifying "-j", as that will
check which setting is the fastest on this system – and then use that
for all further tests.
Next time I will try this approach, thanks for your reply and insights!
-- 
With Best Regards,
Andy Shevchenko

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v2 0/4] kernel.h further split

The methodology

The setup

The raw data and median