On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote:
On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote:
Some BIOS-es contain a bug where they add addresses which map to system RAM in the PCI host bridge window returned by the ACPI _CRS method, see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
To work around this bug Linux excludes E820 reserved addresses when allocating addresses from the PCI host bridge window since 2010. ...
I haven't seen anybody else eager to merge this, so I guess I'll stick my neck out here.
I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
This issue has been around a long time, so it's not like a regression that we just introduced. If we fixed these machines and regressed *other* machines, we'd be worse off than we are now.
Convince me otherwise if you see this differently :)
In the meantime, here's another possibility for working around this. What if we discarded remove_e820_regions() completely, but aligned the problem _CRS windows a little more? The 4dc2287c1805 case was this:
BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved) pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
where the _CRS window was of size 0x20100000, i.e., 512M + 1M. At least in this particular case, we could avoid the problem by throwing away that first 1M and aligning the window to a nice 3G boundary. Maybe it would be worth giving up a small fraction (less than 0.2% in this case) of questionable windows like this?
Bjorn
Hi Bjorn,
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote:
On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote:
Some BIOS-es contain a bug where they add addresses which map to system RAM in the PCI host bridge window returned by the ACPI _CRS method, see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
To work around this bug Linux excludes E820 reserved addresses when allocating addresses from the PCI host bridge window since 2010. ...
I haven't seen anybody else eager to merge this, so I guess I'll stick my neck out here.
I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
It is a 10 year old BIOS defect, so hopefully anything from 2018 or later will not have it.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
Fedora and Arch do follow mainline pretty closely and a lot of users are affected by this (see the large number of BugLinks in the commit).
I completely understand why you are reluctant to push this out, but your argument about most distros not running mainline kernels also applies to chances of people where this may cause a regression running mainline kernels also being quite small.
This issue has been around a long time, so it's not like a regression that we just introduced. If we fixed these machines and regressed *other* machines, we'd be worse off than we are now.
If we break one machine model and fix a whole bunch of other machines then in my book that is a win. Ideally we would not break anything, but we can only find out if we actually break anything if we ship the change.
Convince me otherwise if you see this differently :)
See above :)
In the meantime, here's another possibility for working around this. What if we discarded remove_e820_regions() completely, but aligned the problem _CRS windows a little more? The 4dc2287c1805 case was this:
BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved) pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
where the _CRS window was of size 0x20100000, i.e., 512M + 1M. At least in this particular case, we could avoid the problem by throwing away that first 1M and aligning the window to a nice 3G boundary. Maybe it would be worth giving up a small fraction (less than 0.2% in this case) of questionable windows like this?
The PCI BAR allocation code tries to fall back to the BIOS assigned resource if the allocation fails. That BIOS assigned resource might fall outside of the host bridge window after we round the address.
My initial gut instinct here is that this has a bigger chance of breaking things then my change.
In the beginning of the thread you said that ideally we would completely stop using the E820 reservations for PCI host bridge windows. Because in hindsight messing with the windows on all machines just to work around a clear BIOS bug in some was not a good idea.
This address-rounding/-aligning you now suggest, is again messing with the windows on all machines just to work around a clear BIOS bug in some. At least that is how I see this.
I can understand that you're not entirely happy with my patch, but it does get rid of the use of E820 reservations for any current and future machines, removing any messing with the _CRS returned windows which we are doing.
I also understand that you're not entirely comfortable with my "fix" not causing regressions else where. If you want to delay my fix till 5.16-rc1 that is fine (1).
Regards,
Hans
1) The stable series will likely pick it up soon after 5.16-rc1 though, so not sure how much that actually helps with getting more testing time.
On Thu, Oct 21, 2021 at 07:15:57PM +0200, Hans de Goede wrote:
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote:
On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote:
Some BIOS-es contain a bug where they add addresses which map to system RAM in the PCI host bridge window returned by the ACPI _CRS method, see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
To work around this bug Linux excludes E820 reserved addresses when allocating addresses from the PCI host bridge window since 2010. ...
I haven't seen anybody else eager to merge this, so I guess I'll stick my neck out here.
I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
It is a 10 year old BIOS defect, so hopefully anything from 2018 or later will not have it.
We can hope. AFAIK, Windows allocates space top-down, while Linux allocates bottom-up, so I think it's quite possible these defects would never be discovered or fixed. In any event, I don't think we have much evidence either way.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
Fedora and Arch do follow mainline pretty closely and a lot of users are affected by this (see the large number of BugLinks in the commit).
I completely understand why you are reluctant to push this out, but your argument about most distros not running mainline kernels also applies to chances of people where this may cause a regression running mainline kernels also being quite small.
True.
This issue has been around a long time, so it's not like a regression that we just introduced. If we fixed these machines and regressed *other* machines, we'd be worse off than we are now.
If we break one machine model and fix a whole bunch of other machines then in my book that is a win. Ideally we would not break anything, but we can only find out if we actually break anything if we ship the change.
I'm definitely not going to try the "fix many, break one" argument on Linus. Of course we want to fix systems, but IMO it's far better to leave a system broken than it is to break one that used to work.
In the meantime, here's another possibility for working around this. What if we discarded remove_e820_regions() completely, but aligned the problem _CRS windows a little more? The 4dc2287c1805 case was this:
BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved) pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
where the _CRS window was of size 0x20100000, i.e., 512M + 1M. At least in this particular case, we could avoid the problem by throwing away that first 1M and aligning the window to a nice 3G boundary. Maybe it would be worth giving up a small fraction (less than 0.2% in this case) of questionable windows like this?
The PCI BAR allocation code tries to fall back to the BIOS assigned resource if the allocation fails. That BIOS assigned resource might fall outside of the host bridge window after we round the address.
My initial gut instinct here is that this has a bigger chance of breaking things then my change.
In the beginning of the thread you said that ideally we would completely stop using the E820 reservations for PCI host bridge windows. Because in hindsight messing with the windows on all machines just to work around a clear BIOS bug in some was not a good idea.
This address-rounding/-aligning you now suggest, is again messing with the windows on all machines just to work around a clear BIOS bug in some. At least that is how I see this.
That's true. I assume Red Hat has a bunch of machines and hopefully an archive of dmesg logs from them. Those logs should contain good E820 and _CRS information, so with a little scripting, maybe we could get some idea of what's out there.
Bjorn
Hi Bjorn,
On 10/22/21 03:20, Bjorn Helgaas wrote:
On Thu, Oct 21, 2021 at 07:15:57PM +0200, Hans de Goede wrote:
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote:
On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote:
Some BIOS-es contain a bug where they add addresses which map to system RAM in the PCI host bridge window returned by the ACPI _CRS method, see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
To work around this bug Linux excludes E820 reserved addresses when allocating addresses from the PCI host bridge window since 2010. ...
I haven't seen anybody else eager to merge this, so I guess I'll stick my neck out here.
I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
It is a 10 year old BIOS defect, so hopefully anything from 2018 or later will not have it.
We can hope. AFAIK, Windows allocates space top-down, while Linux allocates bottom-up, so I think it's quite possible these defects would never be discovered or fixed. In any event, I don't think we have much evidence either way.
Ack.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
Fedora and Arch do follow mainline pretty closely and a lot of users are affected by this (see the large number of BugLinks in the commit).
I completely understand why you are reluctant to push this out, but your argument about most distros not running mainline kernels also applies to chances of people where this may cause a regression running mainline kernels also being quite small.
True.
This issue has been around a long time, so it's not like a regression that we just introduced. If we fixed these machines and regressed *other* machines, we'd be worse off than we are now.
If we break one machine model and fix a whole bunch of other machines then in my book that is a win. Ideally we would not break anything, but we can only find out if we actually break anything if we ship the change.
I'm definitely not going to try the "fix many, break one" argument on Linus. Of course we want to fix systems, but IMO it's far better to leave a system broken than it is to break one that used to work.
Right, what I meant to say with "a win" is a step in the right direction, we definitely must address any regressions coming from this change as soon as we learn about them.
In the meantime, here's another possibility for working around this. What if we discarded remove_e820_regions() completely, but aligned the problem _CRS windows a little more? The 4dc2287c1805 case was this:
BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved) pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
where the _CRS window was of size 0x20100000, i.e., 512M + 1M. At least in this particular case, we could avoid the problem by throwing away that first 1M and aligning the window to a nice 3G boundary. Maybe it would be worth giving up a small fraction (less than 0.2% in this case) of questionable windows like this?
The PCI BAR allocation code tries to fall back to the BIOS assigned resource if the allocation fails. That BIOS assigned resource might fall outside of the host bridge window after we round the address.
My initial gut instinct here is that this has a bigger chance of breaking things then my change.
In the beginning of the thread you said that ideally we would completely stop using the E820 reservations for PCI host bridge windows. Because in hindsight messing with the windows on all machines just to work around a clear BIOS bug in some was not a good idea.
This address-rounding/-aligning you now suggest, is again messing with the windows on all machines just to work around a clear BIOS bug in some. At least that is how I see this.
That's true. I assume Red Hat has a bunch of machines and hopefully an archive of dmesg logs from them. Those logs should contain good E820 and _CRS information, so with a little scripting, maybe we could get some idea of what's out there.
We do have a (large-ish) test-lab, but that contains almost exclusively servers, where as the original problem was on Dell Precision laptops.
Also I'm not sure if I can get aggregate data from the lab's machines. I can reserve time on any model we have to debug specific problems, but that is targeting one specific model. I'll ask around about this.
Regards,
Hans
Hi Bjorn,
On 10/22/21 11:53, Hans de Goede wrote:
Hi Bjorn,
On 10/22/21 03:20, Bjorn Helgaas wrote:
On Thu, Oct 21, 2021 at 07:15:57PM +0200, Hans de Goede wrote:
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote:
On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote: > Some BIOS-es contain a bug where they add addresses which map to system > RAM in the PCI host bridge window returned by the ACPI _CRS method, see > commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address > space"). > > To work around this bug Linux excludes E820 reserved addresses when > allocating addresses from the PCI host bridge window since 2010. > ...
I haven't seen anybody else eager to merge this, so I guess I'll stick my neck out here.
I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
It is a 10 year old BIOS defect, so hopefully anything from 2018 or later will not have it.
We can hope. AFAIK, Windows allocates space top-down, while Linux allocates bottom-up, so I think it's quite possible these defects would never be discovered or fixed. In any event, I don't think we have much evidence either way.
Ack.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
Fedora and Arch do follow mainline pretty closely and a lot of users are affected by this (see the large number of BugLinks in the commit).
I completely understand why you are reluctant to push this out, but your argument about most distros not running mainline kernels also applies to chances of people where this may cause a regression running mainline kernels also being quite small.
True.
This issue has been around a long time, so it's not like a regression that we just introduced. If we fixed these machines and regressed *other* machines, we'd be worse off than we are now.
If we break one machine model and fix a whole bunch of other machines then in my book that is a win. Ideally we would not break anything, but we can only find out if we actually break anything if we ship the change.
I'm definitely not going to try the "fix many, break one" argument on Linus. Of course we want to fix systems, but IMO it's far better to leave a system broken than it is to break one that used to work.
Right, what I meant to say with "a win" is a step in the right direction, we definitely must address any regressions coming from this change as soon as we learn about them.
In the meantime, here's another possibility for working around this. What if we discarded remove_e820_regions() completely, but aligned the problem _CRS windows a little more? The 4dc2287c1805 case was this:
BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved) pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]
where the _CRS window was of size 0x20100000, i.e., 512M + 1M. At least in this particular case, we could avoid the problem by throwing away that first 1M and aligning the window to a nice 3G boundary. Maybe it would be worth giving up a small fraction (less than 0.2% in this case) of questionable windows like this?
The PCI BAR allocation code tries to fall back to the BIOS assigned resource if the allocation fails. That BIOS assigned resource might fall outside of the host bridge window after we round the address.
My initial gut instinct here is that this has a bigger chance of breaking things then my change.
In the beginning of the thread you said that ideally we would completely stop using the E820 reservations for PCI host bridge windows. Because in hindsight messing with the windows on all machines just to work around a clear BIOS bug in some was not a good idea.
This address-rounding/-aligning you now suggest, is again messing with the windows on all machines just to work around a clear BIOS bug in some. At least that is how I see this.
That's true. I assume Red Hat has a bunch of machines and hopefully an archive of dmesg logs from them. Those logs should contain good E820 and _CRS information, so with a little scripting, maybe we could get some idea of what's out there.
We do have a (large-ish) test-lab, but that contains almost exclusively servers, where as the original problem was on Dell Precision laptops.
Also I'm not sure if I can get aggregate data from the lab's machines. I can reserve time on any model we have to debug specific problems, but that is targeting one specific model. I'll ask around about this.
So I had another idea to get us a whole bunch of dmesg outputs and that is to use the database collected by linux-hardware.org . The dmesg were already individually accessible by selecting a specific model machine, but I asked them if they could do a dump and I just got an email that a dmesg dump is now available here:
https://github.com/linuxhw/Dmesg
Note be careful with the size of the repository - it will take ~3 gigabytes of network traffic and ~20 gigabytes of space on the drive to checkout it.
So if you want dmesg outputs to grep through for e820 / host-bridge-window info, here you go.
Regards,
Hans
Hi Bjorn,
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote:
On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote:
Some BIOS-es contain a bug where they add addresses which map to system RAM in the PCI host bridge window returned by the ACPI _CRS method, see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
To work around this bug Linux excludes E820 reserved addresses when allocating addresses from the PCI host bridge window since 2010. ...
I haven't seen anybody else eager to merge this, so I guess I'll stick my neck out here.
I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
I understand that you were reluctant to add this to 5.15 so close near the end of the 5.15 cycle, but can we please get this into 5.16 now ?
I know you ultimately want to see if there is a better fix, but this is hitting a *lot* of users right now and if we come up with a better fix we can always use that to replace this one later.
So cam we please just go with this fix now, so that we can fix the issues a lot of users are seeing caused by the current *wrong* behavior of taking the e820 reservations into account ?
Regards,
Hans
On Sat, Nov 06, 2021 at 11:15:07AM +0100, Hans de Goede wrote:
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote:
On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote:
Some BIOS-es contain a bug where they add addresses which map to system RAM in the PCI host bridge window returned by the ACPI _CRS method, see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
To work around this bug Linux excludes E820 reserved addresses when allocating addresses from the PCI host bridge window since 2010. ...
I haven't seen anybody else eager to merge this, so I guess I'll stick my neck out here.
I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
I understand that you were reluctant to add this to 5.15 so close near the end of the 5.15 cycle, but can we please get this into 5.16 now ?
I know you ultimately want to see if there is a better fix, but this is hitting a *lot* of users right now and if we come up with a better fix we can always use that to replace this one later.
I don't know whether there's a "better" fix, but I do know that if we merge what we have right now, nobody will be looking for a better one.
We're in the middle of the merge window, so the v5.16 development cycle is over. The v5.17 cycle is just starting, so we have time to hit that. Obviously a fix can be backported to older kernels as needed.
So can we please just go with this fix now, so that we can fix the issues a lot of users are seeing caused by the current *wrong* behavior of taking the e820 reservations into account ?
I think the fix on the table is "ignore E820 for BIOS date >= 2018" plus the obvious parameters to force it both ways.
The thing I don't like is that this isn't connected at all to the actual BIOS defect. We have no indication that current BIOSes have fixed the defect, and we have no assurance that future ones will not have the defect. It would be better if we had some algorithmic way of figuring out what to do.
Thank you very much for chasing down the dmesg log archive (https://github.com/linuxhw/Dmesg; see https://lore.kernel.org/r/82035130-d810-9f0b-259e-61280de1d81f@redhat.com). Unfortunately I haven't had time to look through it myself, and I haven't heard of anybody else doing it either.
Bjorn
Hi Bjorn,
On 11/9/21 23:07, Bjorn Helgaas wrote:
On Sat, Nov 06, 2021 at 11:15:07AM +0100, Hans de Goede wrote:
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote:
On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote:
Some BIOS-es contain a bug where they add addresses which map to system RAM in the PCI host bridge window returned by the ACPI _CRS method, see commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
To work around this bug Linux excludes E820 reserved addresses when allocating addresses from the PCI host bridge window since 2010. ...
I haven't seen anybody else eager to merge this, so I guess I'll stick my neck out here.
I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
I understand that you were reluctant to add this to 5.15 so close near the end of the 5.15 cycle, but can we please get this into 5.16 now ?
I know you ultimately want to see if there is a better fix, but this is hitting a *lot* of users right now and if we come up with a better fix we can always use that to replace this one later.
I don't know whether there's a "better" fix, but I do know that if we merge what we have right now, nobody will be looking for a better one.
We're in the middle of the merge window, so the v5.16 development cycle is over. The v5.17 cycle is just starting, so we have time to hit that. Obviously a fix can be backported to older kernels as needed.
So can we please just go with this fix now, so that we can fix the issues a lot of users are seeing caused by the current *wrong* behavior of taking the e820 reservations into account ?
I think the fix on the table is "ignore E820 for BIOS date >= 2018" plus the obvious parameters to force it both ways.
Correct.
The thing I don't like is that this isn't connected at all to the actual BIOS defect. We have no indication that current BIOSes have fixed the defect,
We also have no indication that that defect from 10 years ago, from pre UEFI firmware is still present in modern day UEFI firmware which is basically an entire different code-base.
And even 10 years ago the problem was only happening to a single family of laptop models (Dell Precision laptops) so this clearly was a bug in that specific implementation and not some generic issue which is likely to be carried forward.
and we have no assurance that future ones will not have the defect. It would be better if we had some algorithmic way of figuring out what to do.
You yourself have said that in hindsight taking E820 reservations into account for PCI bridge host windows was a mistake. So what the "ignore E820 for BIOS date >= 2018" is doing is letting the past be the past (without regressing on older models) while fixing that mistake on any hardware going forward.
In the unlikely case that we hit that BIOS bug again on 1 or 2 models, we can simply DMI quirk those models, as we do for countless other BIOS issues.
Thank you very much for chasing down the dmesg log archive (https://github.com/linuxhw/Dmesg; see https://lore.kernel.org/r/82035130-d810-9f0b-259e-61280de1d81f@redhat.com). Unfortunately I haven't had time to look through it myself, and I haven't heard of anybody else doing it either.
Right, I'm afraid that I already have spend way too much time on this myself. Note that I've been working with users on this bug on and off for over a year now.
This is hitting many users and now that we have a viable fix, this really needs to be fixed now.
I believe that the "ignore E820 for BIOS date >= 2018" fix is good enough and that you are letting perfect be the enemy of good here.
As an upstream kernel maintainer myself, I'm sorry to say this, but if we don't get some fix for this merged soon you are leaving my no choice but to add my fix to the Fedora kernels as a downstream patch (and to advise other distros to do the same).
Note that if you are still afraid of regressions going the downstream route is also an opportunity, Fedora will start testing moving users to 5.15.y soon, so I could add the patch to Fedora's 5.15.y builds and see how that goes ?
Regards,
Hans
Hi Bjorn,
On 11/10/21 09:45, Hans de Goede wrote:
Hi Bjorn,
On 11/9/21 23:07, Bjorn Helgaas wrote:
On Sat, Nov 06, 2021 at 11:15:07AM +0100, Hans de Goede wrote:
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote:
On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote: > Some BIOS-es contain a bug where they add addresses which map to system > RAM in the PCI host bridge window returned by the ACPI _CRS method, see > commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address > space"). > > To work around this bug Linux excludes E820 reserved addresses when > allocating addresses from the PCI host bridge window since 2010. > ...
I haven't seen anybody else eager to merge this, so I guess I'll stick my neck out here.
I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
I understand that you were reluctant to add this to 5.15 so close near the end of the 5.15 cycle, but can we please get this into 5.16 now ?
I know you ultimately want to see if there is a better fix, but this is hitting a *lot* of users right now and if we come up with a better fix we can always use that to replace this one later.
I don't know whether there's a "better" fix, but I do know that if we merge what we have right now, nobody will be looking for a better one.
We're in the middle of the merge window, so the v5.16 development cycle is over. The v5.17 cycle is just starting, so we have time to hit that. Obviously a fix can be backported to older kernels as needed.
So can we please just go with this fix now, so that we can fix the issues a lot of users are seeing caused by the current *wrong* behavior of taking the e820 reservations into account ?
I think the fix on the table is "ignore E820 for BIOS date >= 2018" plus the obvious parameters to force it both ways.
Correct.
The thing I don't like is that this isn't connected at all to the actual BIOS defect. We have no indication that current BIOSes have fixed the defect,
We also have no indication that that defect from 10 years ago, from pre UEFI firmware is still present in modern day UEFI firmware which is basically an entire different code-base.
And even 10 years ago the problem was only happening to a single family of laptop models (Dell Precision laptops) so this clearly was a bug in that specific implementation and not some generic issue which is likely to be carried forward.
and we have no assurance that future ones will not have the defect. It would be better if we had some algorithmic way of figuring out what to do.
You yourself have said that in hindsight taking E820 reservations into account for PCI bridge host windows was a mistake. So what the "ignore E820 for BIOS date >= 2018" is doing is letting the past be the past (without regressing on older models) while fixing that mistake on any hardware going forward.
In the unlikely case that we hit that BIOS bug again on 1 or 2 models, we can simply DMI quirk those models, as we do for countless other BIOS issues.
Thank you very much for chasing down the dmesg log archive (https://github.com/linuxhw/Dmesg; see https://lore.kernel.org/r/82035130-d810-9f0b-259e-61280de1d81f@redhat.com). Unfortunately I haven't had time to look through it myself, and I haven't heard of anybody else doing it either.
Right, I'm afraid that I already have spend way too much time on this myself. Note that I've been working with users on this bug on and off for over a year now.
This is hitting many users and now that we have a viable fix, this really needs to be fixed now.
I believe that the "ignore E820 for BIOS date >= 2018" fix is good enough and that you are letting perfect be the enemy of good here.
As an upstream kernel maintainer myself, I'm sorry to say this, but if we don't get some fix for this merged soon you are leaving my no choice but to add my fix to the Fedora kernels as a downstream patch (and to advise other distros to do the same).
Note that if you are still afraid of regressions going the downstream route is also an opportunity, Fedora will start testing moving users to 5.15.y soon, so I could add the patch to Fedora's 5.15.y builds and see how that goes ?
So I've discussed this with the Fedora kernel maintainers and they have agreed to add the patch to the Fedora 5.15 kernels, which we will ask our users to start testing soon (we first run some voluntary testing before eventually moving all users over).
This will provide us with valuable feedback wrt this patch causing regressions as you are worried about, or not.
Assuming no regressions show up I hope that this will give you some assurance that there the patch causes no regressions and that you will then be willing to pick this up later during the 5.16 cycle so that Fedora only deviates from upstream for 1 cycle.
Regards,
Hans
Den 2021-11-10 kl. 15:05, skrev Hans de Goede:
So I've discussed this with the Fedora kernel maintainers and they have agreed to add the patch to the Fedora 5.15 kernels, which we will ask our users to start testing soon (we first run some voluntary testing before eventually moving all users over).
This will provide us with valuable feedback wrt this patch causing regressions as you are worried about, or not.
Assuming no regressions show up I hope that this will give you some assurance that there the patch causes no regressions and that you will then be willing to pick this up later during the 5.16 cycle so that Fedora only deviates from upstream for 1 cycle.
FWIW... As an extra data point...
I've backported this one on top of 5.14.14 in Mageia Cauldron and Mageia 8 backports where it has been in active use for ~3 weeks now, and so far no reports of systems breaking ...
-- Thomas
Hi Bjorn,
On 11/10/21 14:05, Hans de Goede wrote:
Hi Bjorn,
On 11/10/21 09:45, Hans de Goede wrote:
Hi Bjorn,
On 11/9/21 23:07, Bjorn Helgaas wrote:
On Sat, Nov 06, 2021 at 11:15:07AM +0100, Hans de Goede wrote:
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote:
On 10/19/21 23:52, Bjorn Helgaas wrote: > On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote: >> Some BIOS-es contain a bug where they add addresses which map to system >> RAM in the PCI host bridge window returned by the ACPI _CRS method, see >> commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address >> space"). >> >> To work around this bug Linux excludes E820 reserved addresses when >> allocating addresses from the PCI host bridge window since 2010. >> ...
> I haven't seen anybody else eager to merge this, so I guess I'll stick > my neck out here. > > I applied this to my for-linus branch for v5.15.
Thank you, and sorry about the build-errors which the lkp kernel-test-robot found.
I've just send out a patch which fixes these build-errors (verified with both .config-s from the lkp reports). Feel free to squash this into the original patch (or keep them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
I understand that you were reluctant to add this to 5.15 so close near the end of the 5.15 cycle, but can we please get this into 5.16 now ?
I know you ultimately want to see if there is a better fix, but this is hitting a *lot* of users right now and if we come up with a better fix we can always use that to replace this one later.
I don't know whether there's a "better" fix, but I do know that if we merge what we have right now, nobody will be looking for a better one.
We're in the middle of the merge window, so the v5.16 development cycle is over. The v5.17 cycle is just starting, so we have time to hit that. Obviously a fix can be backported to older kernels as needed.
So can we please just go with this fix now, so that we can fix the issues a lot of users are seeing caused by the current *wrong* behavior of taking the e820 reservations into account ?
I think the fix on the table is "ignore E820 for BIOS date >= 2018" plus the obvious parameters to force it both ways.
Correct.
The thing I don't like is that this isn't connected at all to the actual BIOS defect. We have no indication that current BIOSes have fixed the defect,
We also have no indication that that defect from 10 years ago, from pre UEFI firmware is still present in modern day UEFI firmware which is basically an entire different code-base.
And even 10 years ago the problem was only happening to a single family of laptop models (Dell Precision laptops) so this clearly was a bug in that specific implementation and not some generic issue which is likely to be carried forward.
and we have no assurance that future ones will not have the defect. It would be better if we had some algorithmic way of figuring out what to do.
You yourself have said that in hindsight taking E820 reservations into account for PCI bridge host windows was a mistake. So what the "ignore E820 for BIOS date >= 2018" is doing is letting the past be the past (without regressing on older models) while fixing that mistake on any hardware going forward.
In the unlikely case that we hit that BIOS bug again on 1 or 2 models, we can simply DMI quirk those models, as we do for countless other BIOS issues.
Thank you very much for chasing down the dmesg log archive (https://github.com/linuxhw/Dmesg; see https://lore.kernel.org/r/82035130-d810-9f0b-259e-61280de1d81f@redhat.com). Unfortunately I haven't had time to look through it myself, and I haven't heard of anybody else doing it either.
Right, I'm afraid that I already have spend way too much time on this myself. Note that I've been working with users on this bug on and off for over a year now.
This is hitting many users and now that we have a viable fix, this really needs to be fixed now.
I believe that the "ignore E820 for BIOS date >= 2018" fix is good enough and that you are letting perfect be the enemy of good here.
As an upstream kernel maintainer myself, I'm sorry to say this, but if we don't get some fix for this merged soon you are leaving my no choice but to add my fix to the Fedora kernels as a downstream patch (and to advise other distros to do the same).
Note that if you are still afraid of regressions going the downstream route is also an opportunity, Fedora will start testing moving users to 5.15.y soon, so I could add the patch to Fedora's 5.15.y builds and see how that goes ?
So I've discussed this with the Fedora kernel maintainers and they have agreed to add the patch to the Fedora 5.15 kernels, which we will ask our users to start testing soon (we first run some voluntary testing before eventually moving all users over).
This will provide us with valuable feedback wrt this patch causing regressions as you are worried about, or not.
Assuming no regressions show up I hope that this will give you some assurance that there the patch causes no regressions and that you will then be willing to pick this up later during the 5.16 cycle so that Fedora only deviates from upstream for 1 cycle.
5.15.y kernels with this patch added have been in Fedora's stable updates repo for a while now without any reports of the regressions you feared this may cause.
Bjorn, I hope that you are willing to merge this patch now that it has seen some more wide spread testing ?
Regards,
Hans
On Tue, Dec 07, 2021 at 05:52:40PM +0100, Hans de Goede wrote:
On 11/10/21 14:05, Hans de Goede wrote:
On 11/10/21 09:45, Hans de Goede wrote:
On 11/9/21 23:07, Bjorn Helgaas wrote:
On Sat, Nov 06, 2021 at 11:15:07AM +0100, Hans de Goede wrote:
On 10/20/21 23:14, Bjorn Helgaas wrote:
On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote: > On 10/19/21 23:52, Bjorn Helgaas wrote: >> On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote: >>> Some BIOS-es contain a bug where they add addresses which map to system >>> RAM in the PCI host bridge window returned by the ACPI _CRS method, see >>> commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address >>> space"). >>> >>> To work around this bug Linux excludes E820 reserved addresses when >>> allocating addresses from the PCI host bridge window since 2010. >>> ...
>> I haven't seen anybody else eager to merge this, so I guess I'll stick >> my neck out here. >> >> I applied this to my for-linus branch for v5.15. > > Thank you, and sorry about the build-errors which the lkp > kernel-test-robot found. > > I've just send out a patch which fixes these build-errors > (verified with both .config-s from the lkp reports). > Feel free to squash this into the original patch (or keep > them separate, whatever works for you).
Thanks, I squashed the fix in.
HOWEVER, I think it would be fairly risky to push this into v5.15. We would be relying on the assumption that current machines have all fixed the BIOS defect that 4dc2287c1805 addressed, and we have little evidence for that.
I'm not sure there's significant benefit to having this in v5.15. Yes, the mainline v5.15 kernel would work on the affected machines, but I suspect most people with those machines are running distro kernels, not mainline kernels.
I understand that you were reluctant to add this to 5.15 so close near the end of the 5.15 cycle, but can we please get this into 5.16 now ?
I know you ultimately want to see if there is a better fix, but this is hitting a *lot* of users right now and if we come up with a better fix we can always use that to replace this one later.
I don't know whether there's a "better" fix, but I do know that if we merge what we have right now, nobody will be looking for a better one.
We're in the middle of the merge window, so the v5.16 development cycle is over. The v5.17 cycle is just starting, so we have time to hit that. Obviously a fix can be backported to older kernels as needed.
So can we please just go with this fix now, so that we can fix the issues a lot of users are seeing caused by the current *wrong* behavior of taking the e820 reservations into account ?
I think the fix on the table is "ignore E820 for BIOS date >= 2018" plus the obvious parameters to force it both ways.
Correct.
The thing I don't like is that this isn't connected at all to the actual BIOS defect. We have no indication that current BIOSes have fixed the defect,
We also have no indication that that defect from 10 years ago, from pre UEFI firmware is still present in modern day UEFI firmware which is basically an entire different code-base.
And even 10 years ago the problem was only happening to a single family of laptop models (Dell Precision laptops) so this clearly was a bug in that specific implementation and not some generic issue which is likely to be carried forward.
and we have no assurance that future ones will not have the defect. It would be better if we had some algorithmic way of figuring out what to do.
You yourself have said that in hindsight taking E820 reservations into account for PCI bridge host windows was a mistake. So what the "ignore E820 for BIOS date >= 2018" is doing is letting the past be the past (without regressing on older models) while fixing that mistake on any hardware going forward.
In the unlikely case that we hit that BIOS bug again on 1 or 2 models, we can simply DMI quirk those models, as we do for countless other BIOS issues.
Thank you very much for chasing down the dmesg log archive (https://github.com/linuxhw/Dmesg; see https://lore.kernel.org/r/82035130-d810-9f0b-259e-61280de1d81f@redhat.com). Unfortunately I haven't had time to look through it myself, and I haven't heard of anybody else doing it either.
Right, I'm afraid that I already have spend way too much time on this myself. Note that I've been working with users on this bug on and off for over a year now.
This is hitting many users and now that we have a viable fix, this really needs to be fixed now.
I believe that the "ignore E820 for BIOS date >= 2018" fix is good enough and that you are letting perfect be the enemy of good here.
As an upstream kernel maintainer myself, I'm sorry to say this, but if we don't get some fix for this merged soon you are leaving my no choice but to add my fix to the Fedora kernels as a downstream patch (and to advise other distros to do the same).
Note that if you are still afraid of regressions going the downstream route is also an opportunity, Fedora will start testing moving users to 5.15.y soon, so I could add the patch to Fedora's 5.15.y builds and see how that goes ?
So I've discussed this with the Fedora kernel maintainers and they have agreed to add the patch to the Fedora 5.15 kernels, which we will ask our users to start testing soon (we first run some voluntary testing before eventually moving all users over).
This will provide us with valuable feedback wrt this patch causing regressions as you are worried about, or not.
Assuming no regressions show up I hope that this will give you some assurance that there the patch causes no regressions and that you will then be willing to pick this up later during the 5.16 cycle so that Fedora only deviates from upstream for 1 cycle.
5.15.y kernels with this patch added have been in Fedora's stable updates repo for a while now without any reports of the regressions you feared this may cause.
Bjorn, I hope that you are willing to merge this patch now that it has seen some more wide spread testing ?
I'm still not happy about the idea of basing this on BIOS dates. I did this with 7bc5e3f2be32 ("x86/PCI: use host bridge _CRS info by default on 2008 and newer machines"), and it was a mistake.
Because of that mistake, we now have the use_crs/nocrs kernel parameters, which confuse users and lead to them being passed around as "fixes" on random bulletin boards.
Adding another BIOS date check and use_e820/no_e820 kernel parameters feels like it's layering on more complexity to cover up another major mistake I made, 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
I think it would be better for the code to recognize the situation addressed by 4dc2287c1805 and deal with it directly. Is that possible? I dunno; I don't think we've really tried.
Bjorn
Hi Bjorn,
On 12/15/21 17:01, Bjorn Helgaas wrote:
On Tue, Dec 07, 2021 at 05:52:40PM +0100, Hans de Goede wrote:
On 11/10/21 14:05, Hans de Goede wrote:
On 11/10/21 09:45, Hans de Goede wrote:
On 11/9/21 23:07, Bjorn Helgaas wrote:
On Sat, Nov 06, 2021 at 11:15:07AM +0100, Hans de Goede wrote:
On 10/20/21 23:14, Bjorn Helgaas wrote: > On Wed, Oct 20, 2021 at 12:23:26PM +0200, Hans de Goede wrote: >> On 10/19/21 23:52, Bjorn Helgaas wrote: >>> On Thu, Oct 14, 2021 at 08:39:42PM +0200, Hans de Goede wrote: >>>> Some BIOS-es contain a bug where they add addresses which map to system >>>> RAM in the PCI host bridge window returned by the ACPI _CRS method, see >>>> commit 4dc2287c1805 ("x86: avoid E820 regions when allocating address >>>> space"). >>>> >>>> To work around this bug Linux excludes E820 reserved addresses when >>>> allocating addresses from the PCI host bridge window since 2010. >>>> ... > >>> I haven't seen anybody else eager to merge this, so I guess I'll stick >>> my neck out here. >>> >>> I applied this to my for-linus branch for v5.15. >> >> Thank you, and sorry about the build-errors which the lkp >> kernel-test-robot found. >> >> I've just send out a patch which fixes these build-errors >> (verified with both .config-s from the lkp reports). >> Feel free to squash this into the original patch (or keep >> them separate, whatever works for you). > > Thanks, I squashed the fix in. > > HOWEVER, I think it would be fairly risky to push this into v5.15. > We would be relying on the assumption that current machines have all > fixed the BIOS defect that 4dc2287c1805 addressed, and we have little > evidence for that. > > I'm not sure there's significant benefit to having this in v5.15. > Yes, the mainline v5.15 kernel would work on the affected machines, > but I suspect most people with those machines are running distro > kernels, not mainline kernels.
I understand that you were reluctant to add this to 5.15 so close near the end of the 5.15 cycle, but can we please get this into 5.16 now ?
I know you ultimately want to see if there is a better fix, but this is hitting a *lot* of users right now and if we come up with a better fix we can always use that to replace this one later.
I don't know whether there's a "better" fix, but I do know that if we merge what we have right now, nobody will be looking for a better one.
We're in the middle of the merge window, so the v5.16 development cycle is over. The v5.17 cycle is just starting, so we have time to hit that. Obviously a fix can be backported to older kernels as needed.
So can we please just go with this fix now, so that we can fix the issues a lot of users are seeing caused by the current *wrong* behavior of taking the e820 reservations into account ?
I think the fix on the table is "ignore E820 for BIOS date >= 2018" plus the obvious parameters to force it both ways.
Correct.
The thing I don't like is that this isn't connected at all to the actual BIOS defect. We have no indication that current BIOSes have fixed the defect,
We also have no indication that that defect from 10 years ago, from pre UEFI firmware is still present in modern day UEFI firmware which is basically an entire different code-base.
And even 10 years ago the problem was only happening to a single family of laptop models (Dell Precision laptops) so this clearly was a bug in that specific implementation and not some generic issue which is likely to be carried forward.
and we have no assurance that future ones will not have the defect. It would be better if we had some algorithmic way of figuring out what to do.
You yourself have said that in hindsight taking E820 reservations into account for PCI bridge host windows was a mistake. So what the "ignore E820 for BIOS date >= 2018" is doing is letting the past be the past (without regressing on older models) while fixing that mistake on any hardware going forward.
In the unlikely case that we hit that BIOS bug again on 1 or 2 models, we can simply DMI quirk those models, as we do for countless other BIOS issues.
Thank you very much for chasing down the dmesg log archive (https://github.com/linuxhw/Dmesg; see https://lore.kernel.org/r/82035130-d810-9f0b-259e-61280de1d81f@redhat.com). Unfortunately I haven't had time to look through it myself, and I haven't heard of anybody else doing it either.
Right, I'm afraid that I already have spend way too much time on this myself. Note that I've been working with users on this bug on and off for over a year now.
This is hitting many users and now that we have a viable fix, this really needs to be fixed now.
I believe that the "ignore E820 for BIOS date >= 2018" fix is good enough and that you are letting perfect be the enemy of good here.
As an upstream kernel maintainer myself, I'm sorry to say this, but if we don't get some fix for this merged soon you are leaving my no choice but to add my fix to the Fedora kernels as a downstream patch (and to advise other distros to do the same).
Note that if you are still afraid of regressions going the downstream route is also an opportunity, Fedora will start testing moving users to 5.15.y soon, so I could add the patch to Fedora's 5.15.y builds and see how that goes ?
So I've discussed this with the Fedora kernel maintainers and they have agreed to add the patch to the Fedora 5.15 kernels, which we will ask our users to start testing soon (we first run some voluntary testing before eventually moving all users over).
This will provide us with valuable feedback wrt this patch causing regressions as you are worried about, or not.
Assuming no regressions show up I hope that this will give you some assurance that there the patch causes no regressions and that you will then be willing to pick this up later during the 5.16 cycle so that Fedora only deviates from upstream for 1 cycle.
5.15.y kernels with this patch added have been in Fedora's stable updates repo for a while now without any reports of the regressions you feared this may cause.
Bjorn, I hope that you are willing to merge this patch now that it has seen some more wide spread testing ?
I'm still not happy about the idea of basing this on BIOS dates. I did this with 7bc5e3f2be32 ("x86/PCI: use host bridge _CRS info by default on 2008 and newer machines"), and it was a mistake.
Because of that mistake, we now have the use_crs/nocrs kernel parameters, which confuse users and lead to them being passed around as "fixes" on random bulletin boards.
Adding another BIOS date check and use_e820/no_e820 kernel parameters feels like it's layering on more complexity to cover up another major mistake I made, 4dc2287c1805 ("x86: avoid E820 regions when allocating address space").
I think it would be better for the code to recognize the situation addressed by 4dc2287c1805 and deal with it directly. Is that possible? I dunno; I don't think we've really tried.
So we are just going to leave a ton of users systems broken *for years* until someone has the time to try ? I've not seen anyone step up to try and address the issue worked around by 4dc2287c1805 (and no I'm not volunteering).
Also how are we going to come up with another fix for that without any of the hardware which was affected by the issue back then to test on?
AFAIK we agree that:
1. In hindsight commit 4dc2287c1805 was not a good idea. 2. We cannot just revert it without causing regressions
So given these 2 things, disabling the problematic behavior introduced by commit 4dc2287c1805 on newer machines, to avoid the older machines which need it from regressions really seems like the obvious fix to me ?
Especially since replacing commit 4dc2287c1805 seems impossible to me without access to the originally affected hardware to verify any fix.
AFAIK there are a number of other places in the kernel where BIOS date checks are used, to e.g. not use ACPI on really early buggy ACPI implementations, so this is not unheard of.
You seem to mainly be concerned about users cargo-culting the use_e820/no_e820 kernel parameters as workaround for issues which have a completely different root cause.
Would my solution to disable the troublesome workaround from 4dc2287c1805 be acceptable if I drop the new commandline options?
I added those just in case, but so far no Fedora users have needed them, so I would be happy to drop them ?
Regards,
Hans
linux-stable-mirror@lists.linaro.org