From: David Gow
Sent: 04 July 2023 09:32
The checksum_32 code was originally written to only handle 2-byte aligned buffers, but was later extended to support arbitrary alignment. However, the non-PPro variant doesn't apply the carry before jumping to the 2- or 4-byte aligned versions, which clear CF.
....
I also tested it on a real 486DX2, with the same results.
Which cpu does anyone really care about?
The unrolled 'adcl' loop is horrid on intel cpu between (about) 'core' and 'haswell' because each u-op can only have two inputs and adc needs 3 - so is 2 u-ops. First fixed by summing to alternate registers.
On anything modern (well I've not checked some Atom based servers) misaligned accesses are pretty near zero cost. So it really isn't worth the tests that align data.
(I suspect it all got better a long time ago except for transfers that cross cache-line boundaries, with adc taking two cycles even that might be free.)
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)