A 5-level paging capable machine can have memory above 46-bit in the
physical address space. This memory is only addressable in the 5-level
paging mode: we don't have enough virtual address space to create direct
mapping for such memory in the 4-level paging mode.
Currently, we fail boot completely: NULL pointer dereference in
subsection_map_init().
Skip creating a memblock for such memory instead and notify user that
some memory is not addressable.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen(a)intel.com>
Cc: stable(a)vger.kernel.org # v4.14
---
Tested with a hacked QEMU: https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
---
arch/x86/kernel/e820.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c5399e80c59c..d320d37d0f95 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1280,8 +1280,8 @@ void __init e820__memory_setup(void)
void __init e820__memblock_setup(void)
{
+ u64 size, end, not_addressable = 0;
int i;
- u64 end;
/*
* The bootstrap memblock region count maximum is 128 entries
@@ -1307,7 +1307,22 @@ void __init e820__memblock_setup(void)
if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
continue;
- memblock_add(entry->addr, entry->size);
+ if (entry->addr >= MAXMEM) {
+ not_addressable += entry->size;
+ continue;
+ }
+
+ end = min_t(u64, end, MAXMEM - 1);
+ size = end - entry->addr;
+ not_addressable += entry->size - size;
+ memblock_add(entry->addr, size);
+ }
+
+ if (not_addressable) {
+ pr_err("%lldGB of physical memory is not addressable in the paging mode\n",
+ not_addressable >> 30);
+ if (!pgtable_l5_enabled())
+ pr_err("Consider enabling 5-level paging\n");
}
/* Throw away partial pages: */
--
2.26.2
Hello,
Lack of proper validation that cached inodes are free during allocation can,
cause a crash in fs/xfs/xfs_icache.c (refer: CVE-2018-13093). To address this
issue, I'm backporting upstream commit [1] to 4.4 and 4.9 stable trees
(a backport of [1] to 4.14 already exists).
Also, commit [1] references another commit [2] which added checks only to
xfs_iget_cache_miss(). In this patch, those checks have been moved into a
dedicated checker method and both xfs_iget_cache_miss() and
xfs_iget_cache_hit() are made to call that method. This code reorg in commit
[1], makes commit [2] redundant in the history of the 4.9 and 4.4 stable
trees. So commit [2] is not being backported.
-- Sid
[1]: afca6c5b2595 ("xfs: validate cached inodes are free when allocated")
[2]: ee457001ed6c ("xfs: catch inode allocation state mismatch corruption")
change log:
v2:
- Reword cover letter.
- Fix accidental worong patch that got mailed.
--
2.7.4
On 2020-05-22 01:46, Robin Murphy wrote:
> On 2020-05-21 12:30, Prakash Gupta wrote:
>> Limit the iova size while freeing based on unmapped size. In absence
>> of
>> this even with unmap failure, invalid iova is pushed to iova rcache
>> and
>> subsequently can cause panic while rcache magazine is freed.
>
> Can you elaborate on that panic?
>
We have seen couple of stability issues around this.
Below is one such example:
kernel BUG at kernel/msm-4.19/drivers/iommu/iova.c:904!
iova_magazine_free_pfns
iova_rcache_insert
free_iova_fast
__iommu_unmap_page
iommu_dma_unmap_page
It turned out an iova pfn 0 got into iova_rcache. One possibility I see
is
where client unmap with invalid dma_addr. The unmap call will fail and
warn on
and still try to free iova. This will cause invalid pfn to be inserted
into
rcache. As and when the magazine with invalid pfn will be freed
private_find_iova() will return NULL for invalid iova and meet bug
condition.
>> Signed-off-by: Prakash Gupta <guptap(a)codeaurora.org>
>>
>> :100644 100644 4959f5df21bd 098f7d377e04 M drivers/iommu/dma-iommu.c
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index 4959f5df21bd..098f7d377e04 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -472,7 +472,8 @@ static void __iommu_dma_unmap(struct device *dev,
>> dma_addr_t dma_addr,
>> if (!cookie->fq_domain)
>> iommu_tlb_sync(domain, &iotlb_gather);
>> - iommu_dma_free_iova(cookie, dma_addr, size);
>> + if (unmapped)
>> + iommu_dma_free_iova(cookie, dma_addr, unmapped);
>
> Frankly, if any part of the unmap fails then things have gone
> catastrophically wrong already, but either way this isn't right. The
> IOVA API doesn't support partial freeing - an IOVA *must* be freed
> with its original size, or not freed at all, otherwise it will corrupt
> the state of the rcaches and risk a cascade of further misbehaviour
> for future callers.
>
I agree, we shouldn't be freeing the partial iova. Instead just making
sure if unmap was successful should be sufficient before freeing iova.
So change
can instead be something like this:
- iommu_dma_free_iova(cookie, dma_addr, size);
+ if (unmapped)
+ iommu_dma_free_iova(cookie, dma_addr, size);
> TBH my gut feeling here is that you're really just trying to treat a
> symptom of another bug elsewhere, namely some driver calling
> dma_unmap_* or dma_free_* with the wrong address or size in the first
> place.
>
This condition would arise only if driver calling dma_unmap/free_* with
0
iova_pfn. This will be flagged with a warning during unmap but will
trigger
panic later on while doing unrelated dma_map/unmap_*. If unmapped has
already
failed for invalid iova, there is no reason we should consider this as
valid
iova and free. This part should be fixed.
On 2020-05-22 00:19, Andrew Morton wrote:
> I think we need a cc:stable here?
>
Added now.
Thanks,
Prakash
Before commit cfc4c189bc70 ("pwm: Read initial hardware state at request
time"), a driver's get_state callback would get called once per PWM from
pwmchip_add().
pwm-lpss' runtime-pm code was relying on this, getting a runtime-pm ref for
PWMs which are enabled at probe time from within its get_state callback,
before enabling runtime-pm.
The change to calling get_state at request time causes a number of
problems:
1. PWMs enabled at probe time may get runtime suspended before they are
requested, causing e.g. a LCD backlight controlled by the PWM to turn off.
2. When the request happens when the PWM has been runtime suspended, the
ctrl register will read all 1 / 0xffffffff, causing get_state to store
bogus values in the pwm_state.
3. get_state was using an async pm_runtime_get() call, because it assumed
that runtime-pm has not been enabled yet. If shortly after the request an
apply call is made, then the pwm_lpss_is_updating() check may trigger
because the resume triggered by the pm_runtime_get() call is not complete
yet, so the ctrl register still reads all 1 / 0xffffffff.
This commit fixes these issues by moving the initial pm_runtime_get() call
for PWMs which are enabled at probe time to the pwm_lpss_probe() function;
and by making get_state take a runtime-pm ref before reading the ctrl reg.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1828927
Fixes: cfc4c189bc70 ("pwm: Read initial hardware state at request time")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hans de Goede <hdegoede(a)redhat.com>
---
drivers/pwm/pwm-lpss.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/pwm/pwm-lpss.c b/drivers/pwm/pwm-lpss.c
index 75bbfe5f3bc2..9d965ffe66d1 100644
--- a/drivers/pwm/pwm-lpss.c
+++ b/drivers/pwm/pwm-lpss.c
@@ -158,7 +158,6 @@ static int pwm_lpss_apply(struct pwm_chip *chip, struct pwm_device *pwm,
return 0;
}
-/* This function gets called once from pwmchip_add to get the initial state */
static void pwm_lpss_get_state(struct pwm_chip *chip, struct pwm_device *pwm,
struct pwm_state *state)
{
@@ -167,6 +166,8 @@ static void pwm_lpss_get_state(struct pwm_chip *chip, struct pwm_device *pwm,
unsigned long long base_unit, freq, on_time_div;
u32 ctrl;
+ pm_runtime_get_sync(chip->dev);
+
base_unit_range = BIT(lpwm->info->base_unit_bits);
ctrl = pwm_lpss_read(pwm);
@@ -187,8 +188,7 @@ static void pwm_lpss_get_state(struct pwm_chip *chip, struct pwm_device *pwm,
state->polarity = PWM_POLARITY_NORMAL;
state->enabled = !!(ctrl & PWM_ENABLE);
- if (state->enabled)
- pm_runtime_get(chip->dev);
+ pm_runtime_put(chip->dev);
}
static const struct pwm_ops pwm_lpss_ops = {
@@ -202,7 +202,8 @@ struct pwm_lpss_chip *pwm_lpss_probe(struct device *dev, struct resource *r,
{
struct pwm_lpss_chip *lpwm;
unsigned long c;
- int ret;
+ int i, ret;
+ u32 ctrl;
if (WARN_ON(info->npwm > MAX_PWMS))
return ERR_PTR(-ENODEV);
@@ -232,6 +233,12 @@ struct pwm_lpss_chip *pwm_lpss_probe(struct device *dev, struct resource *r,
return ERR_PTR(ret);
}
+ for (i = 0; i < lpwm->info->npwm; i++) {
+ ctrl = pwm_lpss_read(&lpwm->chip.pwms[i]);
+ if (ctrl & PWM_ENABLE)
+ pm_runtime_get(dev);
+ }
+
return lpwm;
}
EXPORT_SYMBOL_GPL(pwm_lpss_probe);
--
2.26.0
From: Vladimir Oltean <vladimir.oltean(a)nxp.com>
In kernel 4.19 (and probably earlier too) there are issues surrounding
the PHY_AN state.
For example, if a PHY is in PHY_AN state and AN has not finished, then
what is supposed to happen is that the state machine gets rescheduled
until it is, or until the link_timeout reaches zero which triggers an
autoneg restart process.
But actually the rescheduling never works if the PHY uses interrupts,
because the condition under which rescheduling occurs is just if
phy_polling_mode() is true. So basically, this whole rescheduling
functionality works for AN-not-yet-complete just by mistake. Let me
explain.
Most of the time the AN process manages to finish by the time the
interrupt has triggered. One might say "that should always be the case,
otherwise the PHY wouldn't raise the interrupt, right?".
Well, some PHYs implement an .aneg_done method which allows them to tell
the state machine when the AN is really complete.
The AR8031/AR8033 driver (at803x.c) is one such example. Even when
copper autoneg completes, the driver still keeps the "aneg_done"
variable unset until in-band SGMII autoneg finishes too (there is no
interrupt for that). So we have the premises of a race condition.
In practice, what really happens depends on the log level of the serial
console. If the log level is verbose enough that kernel messages related
to the Ethernet link state are printed to the console, then this gives
in-band AN enough time to complete, which means the link will come up
and everyone will be happy. But if the console is not that verbose, the
link will sometimes come up, and sometimes will be forced down by the
.aneg_done of the PHY driver (forever, since we are not rescheduling).
The conclusion is that an extra condition needs to be explicitly added,
so that the state machine can be rescheduled properly. Otherwise PHY
devices in interrupt mode will never work properly if they have an
.aneg_done callback.
In more recent kernels, the whole PHY_AN state was removed by Heiner
Kallweit in the "[net-next,0/5] net: phy: improve and simplify phylib
state machine" series here:
https://patchwork.ozlabs.org/cover/994464/
and the problem was just masked away instead of being addressed with a
punctual patch.
Fixes: 76a423a3f8f1 ("net: phy: allow driver to implement their own aneg_done")
Signed-off-by: Vladimir Oltean <vladimir.oltean(a)nxp.com>
---
I'm not sure the procedure I'm following is correct, sending this
directly to Greg. The patch doesn't apply on net.
drivers/net/phy/phy.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index cc454b8c032c..ca4fd74fd2c8 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -934,7 +934,7 @@ void phy_state_machine(struct work_struct *work)
struct delayed_work *dwork = to_delayed_work(work);
struct phy_device *phydev =
container_of(dwork, struct phy_device, state_queue);
- bool needs_aneg = false, do_suspend = false;
+ bool recheck = false, needs_aneg = false, do_suspend = false;
enum phy_state old_state;
int err = 0;
int old_link;
@@ -981,6 +981,8 @@ void phy_state_machine(struct work_struct *work)
phy_link_up(phydev);
} else if (0 == phydev->link_timeout--)
needs_aneg = true;
+ else
+ recheck = true;
break;
case PHY_NOLINK:
if (!phy_polling_mode(phydev))
@@ -1123,7 +1125,7 @@ void phy_state_machine(struct work_struct *work)
* PHY, if PHY_IGNORE_INTERRUPT is set, then we will be moving
* between states from phy_mac_interrupt()
*/
- if (phy_polling_mode(phydev))
+ if (phy_polling_mode(phydev) || recheck)
queue_delayed_work(system_power_efficient_wq, &phydev->state_queue,
PHY_STATE_TIME * HZ);
}
--
2.25.1
Hi all,
Issue which is reported in https://lore.kernel.org/linux-nvme/CH2PR12MB40050ACF
2C0DC7439355ED3FDD270(a)CH2PR12MB4005.namprd12.prod.outlook.com/T/#r8cfc80b26f0cd
1cde41879a68fd6a71186e9594c is also seen on stable kernel 5.4.41.
In upstream issue is fixed with commit b716e6889c95f64b.
For stable 5.4 kernel it doesn’t apply clean and needs pulling in the following
commits.
commit 2cb6963a16e9e114486decf591af7cb2d69cb154
Author: Christoph Hellwig <hch(a)lst.de>
Date: Wed Oct 23 10:35:41 2019 -0600
commit 6f86f2c9d94d55c4d3a6f1ffbc2e1115b5cb38a8
Author: Christoph Hellwig <hch(a)lst.de>
Date: Wed Oct 23 10:35:42 2019 -0600
commit 59ef0eaa7741c3543f98220cc132c61bf0230bce
Author: Christoph Hellwig <hch(a)lst.de>
Date: Wed Oct 23 10:35:43 2019 -0600
commit e9061c397839eea34207668bfedce0a6c18c5015
Author: Christoph Hellwig <hch(a)lst.de>
Date: Wed Oct 23 10:35:44 2019 -0600
commit b716e6889c95f64ba32af492461f6cc9341f3f05
Author: Sagi Grimberg <sagi(a)grimberg.me>
Date: Sun Jan 26 23:23:28 2020 -0800
I tried a patch by including only necessary parts of the commits e9061c397839,
59ef0eaa7741 and b716e6889c95. PFA.
With the attached patch, issue is not seen.
Please let me know on how to fix it in stable, can all above 5 changes be
cleanly pushed or if attached shorter version can be pushed?
Thanks,
Dakshaja.
Hi,
since commit
0ada120c883d ("perf: Make perf able to build with latest libbfd")
is in master, can it be backported to stable as well? I keep hitting
this with too new binutils on Linux 5.4.y and I have to keep
cherry-picking this commit to fix it.
Thanks