On Mon, Aug 30, 2021 at 02:12:49PM +0200, Petr Mladek wrote:
On Thu 2021-08-26 14:09:55, Yury Norov wrote:
On Thu, Aug 26, 2021 at 03:57:13PM +0200, Petr Mladek wrote:
On Sat 2021-08-14 14:17:07, Yury Norov wrote:
The macros iterate thru all set/clear bits in a bitmap. They search a first bit using find_first_bit(), and the rest bits using find_next_bit().
Since find_next_bit() is called shortly after find_first_bit(), we can save few lines of I-cache by not using find_first_bit().
Is this only a speculation or does it fix a real performance problem?
The macro is used like:
for_each_set_bit(bit, addr, size) { fn(bit); }
IMHO, the micro-opimization does not help when fn() is non-trivial.
The effect is measurable:
Start testing for_each_bit() for_each_set_bit: 15296 ns, 1000 iterations for_each_set_bit_from: 15225 ns, 1000 iterations
Start testing for_each_bit() with cash flushing for_each_set_bit: 547626 ns, 1000 iterations for_each_set_bit_from: 497899 ns, 1000 iterations
Refer this:
https://www.mail-archive.com/dri-devel@lists.freedesktop.org/msg356151.html
I see. The results look convincing on the first look.
But I am still not sure. This patch is basically contradicting many other patches from this patchset:
5th patch optimizes find_first_and_bit() and proves that it is much faster:
Before (#define find_first_and_bit(...) find_next_and_bit(..., 0): Start testing find_bit() with random-filled bitmap [ 140.291468] find_first_and_bit: 46890919 ns, 32671 iterations Start testing find_bit() with sparse bitmap [ 140.295028] find_first_and_bit: 7103 ns, 1 iterations
After: Start testing find_bit() with random-filled bitmap [ 162.574907] find_first_and_bit: 25045813 ns, 32846 iterations Start testing find_bit() with sparse bitmap [ 162.578458] find_first_and_bit: 4900 ns, 1 iterations
=> saves 46% in random bitmap saves 31% in sparse bitmap
6th, 7th, and 9th patch makes the code use find_first_bit() because it is faster than find_next_bit(mask, size, 0);
Now, 11th (this) patch replaces find_first_bit() with find_next_bit(mask, size, 0) because find_first_bit() makes things slower. It is suspicious at minimum.
By other words. The I-cache could safe 10% in one case. But find_first_bit() might safe 46% in random case.
Those are different cases. find_first_bit() is approximately twice faster than find_next_bit, and much smaller. The conclusion is simple: use 'first' version whenever possible if there's no other considerations.
In case of for_each_bit() macros, however, we have such a consideration. In contrast to regular pattern, where user calls either first, or next versions N times, here we call find_first_bit once, and then find_next_bit N-1 times.
Because we know for sure that we'll call find_next_bit shortly, we can benefit from locality under heavy pressure on I-cache, if replace 'first' with 'next'. Consider it as a prefetch mechanism for the following calls to find_next_bit().
Does I-cache cost more than the faster code?
In this case cache miss is more expensive.
Or was for_each_set_bit() tested only with a bitmap where find_first_bit() optimization did not help much?
I tried to ensure that the effect of I-cache is real and in this case more important than code performance, so in the test I called 'first' once and 'next' twice.
How would for_each_set_bit() work with random bitmap?
It would work for all bitmaps.
How does it work with larger bitmaps?
Percentage gain (but not absolute) will decrease proportionally to the number of calls of find_next_bit() for big N.
Thanks, Yury
Best Regards, Petr