During tracee-ebpf regression tests, it was discovered that a CO-RE capable
eBPF program, that relied on a kconfig BTF extern, could not be loaded with
the following error:
libbpf: prog 'tracepoint__raw_syscalls__sys_enter': failed to attach to raw
tracepoint 'sys_enter': Invalid argument
That happened because the CONFIG_ARCH_HAS_SYSCALL_WRAPPER variable had the
wrong value, despite kconfig map existing, misleading the eBPF program
execution (which would then have different pointers, not accepted by the
verifier during load time).
I got the patch proposed here by bisecting upstream tree with the testcase
just described. I kindly ask you to include this patch in the LTS v5.4.x
series so CO-RE (Compile Once - Run Everywhere) eBPF programs, relying in
kconfig settings, can be correctly loaded in kernel series v5.4.
Link: https://github.com/aquasecurity/tracee/issues/851#issuecomment-903074596
I have tested latest 5.4 stable tree with this patch and it fixes the issue.
-rafaeldtinoco
A bad "Fixes" SHA in mainline 7387a72c5f8 references a non-public
SHA, instead of referencing f8dd60de1948 -- see:
https://lore.kernel.org/lkml/20210817075644.0b5123d2@canb.auug.org.au/
This matters to -stable since the broken commit is here:
stable-queue$git grep -l f8dd60de1948
releases/5.10.56/tipc-fix-implicit-connect-for-syn.patch
releases/5.13.8/tipc-fix-implicit-connect-for-syn.patch
and hence those releases will need mainline 7387a72c5f8 applied but
I suspect automatic Fixes: parsing won't "see" it.
Thanks,
Paul.
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 4e9655763b82a91e4c341835bb504a2b1590f984 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu(a)suse.com>
Date: Wed, 25 Aug 2021 13:41:42 +0800
Subject: [PATCH] Revert "btrfs: compression: don't try to compress if we don't
have enough pages"
This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763.
[BUG]
It's no longer possible to create compressed inline extent after commit
f2165627319f ("btrfs: compression: don't try to compress if we don't
have enough pages").
[CAUSE]
For compression code, there are several possible reasons we have a range
that needs to be compressed while it's no more than one page.
- Compressed inline write
The data is always smaller than one sector and the test lacks the
condition to properly recognize a non-inline extent.
- Compressed subpage write
For the incoming subpage compressed write support, we require page
alignment of the delalloc range.
And for 64K page size, we can compress just one page into smaller
sectors.
For those reasons, the requirement for the data to be more than one page
is not correct, and is already causing regression for compressed inline
data writeback. The idea of skipping one page to avoid wasting CPU time
could be revisited in the future.
[FIX]
Fix it by reverting the offending commit.
Reported-by: Zygo Blaxell <ce3g8jdj(a)umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.n…
Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages")
CC: stable(a)vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 06f9f167222b..bd5689fa290e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -629,7 +629,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
* inode has not been flagged as nocompress. This flag can
* change at any time if we discover bad compression ratios.
*/
- if (nr_pages > 1 && inode_need_compress(BTRFS_I(inode), start, end)) {
+ if (inode_need_compress(BTRFS_I(inode), start, end)) {
WARN_ON(pages);
pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
if (!pages) {
From: Frieder Schrempf <frieder.schrempf(a)kontron.de>
The new generic NAND ECC framework stores the configuration and
requirements in separate places since commit 93ef92f6f422 ("mtd: nand: Use
the new generic ECC object"). In 5.10.x The SPI NAND layer still uses only
the requirements to track the ECC properties. This mismatch leads to
values of zero being used for ECC strength and step_size in the SPI NAND
layer wherever nanddev_get_ecc_conf() is used and therefore breaks the SPI
NAND on-die ECC support in 5.10.x.
By using nanddev_get_ecc_requirements() instead of nanddev_get_ecc_conf()
for SPI NAND, we make sure that the correct parameters for the detected
chip are used. In later versions (5.11.x) this is fixed anyway with the
implementation of the SPI NAND on-die ECC engine.
Cc: stable(a)vger.kernel.org # 5.10.x
Reported-by: voice INTER connect GmbH <developer(a)voiceinterconnect.de>
Signed-off-by: Frieder Schrempf <frieder.schrempf(a)kontron.de>
Acked-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
---
Changes in v2:
* Fix checkpatch error/warnings for commit message style
* Add Miquel's A-b tag
---
drivers/mtd/nand/spi/core.c | 6 +++---
drivers/mtd/nand/spi/macronix.c | 6 +++---
drivers/mtd/nand/spi/toshiba.c | 6 +++---
3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/mtd/nand/spi/core.c b/drivers/mtd/nand/spi/core.c
index 558d8a14810b..8794a1f6eacd 100644
--- a/drivers/mtd/nand/spi/core.c
+++ b/drivers/mtd/nand/spi/core.c
@@ -419,7 +419,7 @@ static int spinand_check_ecc_status(struct spinand_device *spinand, u8 status)
* fixed, so let's return the maximum possible value so that
* wear-leveling layers move the data immediately.
*/
- return nanddev_get_ecc_conf(nand)->strength;
+ return nanddev_get_ecc_requirements(nand)->strength;
case STATUS_ECC_UNCOR_ERROR:
return -EBADMSG;
@@ -1090,8 +1090,8 @@ static int spinand_init(struct spinand_device *spinand)
mtd->oobavail = ret;
/* Propagate ECC information to mtd_info */
- mtd->ecc_strength = nanddev_get_ecc_conf(nand)->strength;
- mtd->ecc_step_size = nanddev_get_ecc_conf(nand)->step_size;
+ mtd->ecc_strength = nanddev_get_ecc_requirements(nand)->strength;
+ mtd->ecc_step_size = nanddev_get_ecc_requirements(nand)->step_size;
return 0;
diff --git a/drivers/mtd/nand/spi/macronix.c b/drivers/mtd/nand/spi/macronix.c
index 8e801e4c3a00..cd7a9cacc3fb 100644
--- a/drivers/mtd/nand/spi/macronix.c
+++ b/drivers/mtd/nand/spi/macronix.c
@@ -84,11 +84,11 @@ static int mx35lf1ge4ab_ecc_get_status(struct spinand_device *spinand,
* data around if it's not necessary.
*/
if (mx35lf1ge4ab_get_eccsr(spinand, &eccsr))
- return nanddev_get_ecc_conf(nand)->strength;
+ return nanddev_get_ecc_requirements(nand)->strength;
- if (WARN_ON(eccsr > nanddev_get_ecc_conf(nand)->strength ||
+ if (WARN_ON(eccsr > nanddev_get_ecc_requirements(nand)->strength ||
!eccsr))
- return nanddev_get_ecc_conf(nand)->strength;
+ return nanddev_get_ecc_requirements(nand)->strength;
return eccsr;
diff --git a/drivers/mtd/nand/spi/toshiba.c b/drivers/mtd/nand/spi/toshiba.c
index 21fde2875674..6fe7bd2a94d2 100644
--- a/drivers/mtd/nand/spi/toshiba.c
+++ b/drivers/mtd/nand/spi/toshiba.c
@@ -90,12 +90,12 @@ static int tx58cxgxsxraix_ecc_get_status(struct spinand_device *spinand,
* data around if it's not necessary.
*/
if (spi_mem_exec_op(spinand->spimem, &op))
- return nanddev_get_ecc_conf(nand)->strength;
+ return nanddev_get_ecc_requirements(nand)->strength;
mbf >>= 4;
- if (WARN_ON(mbf > nanddev_get_ecc_conf(nand)->strength || !mbf))
- return nanddev_get_ecc_conf(nand)->strength;
+ if (WARN_ON(mbf > nanddev_get_ecc_requirements(nand)->strength || !mbf))
+ return nanddev_get_ecc_requirements(nand)->strength;
return mbf;
--
2.32.0
From: Filipe Manana <fdmanana(a)suse.com>
commit bc0939fcfab0d7efb2ed12896b1af3d819954a14 upstream.
We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.
1) We are at transaction N;
2) Inode I was previously fsynced in the current transaction so it has:
inode->logged_trans set to N;
3) The inode's root currently has:
root->log_transid set to 1
root->last_log_commit set to 0
Which means only one log transaction was committed to far, log
transaction 0. When a log tree is created we set ->log_transid and
->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
4) One more range of pages is dirtied in inode I;
5) Some task A starts an fsync against some other inode J (same root), and
so it joins log transaction 1.
Before task A calls btrfs_sync_log()...
6) Task B starts an fsync against inode I, which currently has the full
sync flag set, so it starts delalloc and waits for the ordered extent
to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
7) During ordered extent completion we have btrfs_update_inode() called
against inode I, which in turn calls btrfs_set_inode_last_trans(),
which does the following:
spin_lock(&inode->lock);
inode->last_trans = trans->transaction->transid;
inode->last_sub_trans = inode->root->log_transid;
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
So ->last_trans is set to N and ->last_sub_trans set to 1.
But before setting ->last_log_commit...
8) Task A is at btrfs_sync_log():
- it increments root->log_transid to 2
- starts writeback for all log tree extent buffers
- waits for the writeback to complete
- writes the super blocks
- updates root->last_log_commit to 1
It's a lot of slow steps between updating root->log_transid and
root->last_log_commit;
9) The task doing the ordered extent completion, currently at
btrfs_set_inode_last_trans(), then finally runs:
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
Which results in inode->last_log_commit being set to 1.
The ordered extent completes;
10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
true because we have all the following conditions met:
inode->logged_trans == N which matches fs_info->generation &&
inode->last_subtrans (1) <= inode->last_log_commit (1) &&
inode->last_subtrans (1) <= root->last_log_commit (1) &&
list inode->extent_tree.modified_extents is empty
And as a consequence we return without logging the inode, so the
existing logged version of the inode does not point to the extent
that was written after the previous fsync.
It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.
However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:
vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
{
(...)
BTRFS_I(inode)->last_trans = fs_info->generation;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
(...)
So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.
So fix this in two different ways:
1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
value of ->last_sub_trans minus 1;
2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
like we do for buffered and direct writes at btrfs_file_write_iter(),
which is all we need to make sure multiple writes and fsyncs to an
inode in the same transaction never result in an fsync missing that
the inode changed and needs to be logged. Turn this into a helper
function and use it both at btrfs_page_mkwrite() and at
btrfs_file_write_iter() - this also fixes the problem that at
btrfs_page_mkwrite() we were setting those fields without the
protection of the inode's spinlock.
This is an extremely unlikely race to happen in practice.
Signed-off-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: Anand Jain <anand.jain(a)oracle.com>
---
fs/btrfs/btrfs_inode.h | 15 +++++++++++++++
fs/btrfs/file.c | 10 ++--------
fs/btrfs/inode.c | 4 +---
fs/btrfs/transaction.h | 2 +-
4 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index f853835c409c..f3ff57b93158 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -268,6 +268,21 @@ static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
mod);
}
+/*
+ * Called every time after doing a buffered, direct IO or memory mapped write.
+ *
+ * This is to ensure that if we write to a file that was previously fsynced in
+ * the current transaction, then try to fsync it again in the same transaction,
+ * we will know that there were changes in the file and that it needs to be
+ * logged.
+ */
+static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode)
+{
+ spin_lock(&inode->lock);
+ inode->last_sub_trans = inode->root->log_transid;
+ spin_unlock(&inode->lock);
+}
+
static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
{
int ret = 0;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 400b0717b9d4..1279359ed172 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2004,14 +2004,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
inode_unlock(inode);
- /*
- * We also have to set last_sub_trans to the current log transid,
- * otherwise subsequent syncs to a file that's been synced in this
- * transaction will appear to have already occurred.
- */
- spin_lock(&BTRFS_I(inode)->lock);
- BTRFS_I(inode)->last_sub_trans = root->log_transid;
- spin_unlock(&BTRFS_I(inode)->lock);
+ btrfs_set_inode_last_sub_trans(BTRFS_I(inode));
+
if (num_written > 0)
num_written = generic_write_sync(iocb, num_written);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b044b1d910de..1117335374ff 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9250,9 +9250,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
set_page_dirty(page);
SetPageUptodate(page);
- BTRFS_I(inode)->last_trans = fs_info->generation;
- BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
- BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
+ btrfs_set_inode_last_sub_trans(BTRFS_I(inode));
unlock_extent_cached(io_tree, page_start, page_end, &cached_state);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index d8a7d460e436..cbede328bda5 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -160,7 +160,7 @@ static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans,
spin_lock(&BTRFS_I(inode)->lock);
BTRFS_I(inode)->last_trans = trans->transaction->transid;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
- BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
+ BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->last_sub_trans - 1;
spin_unlock(&BTRFS_I(inode)->lock);
}
--
2.31.1
Reading from the CMOS involves writing to the index register and then
reading from the data register. Therefore access to the CMOS has to be
serialized with a spinlock. An invocation in cmos_set_alarm was not
serialized with rtc_lock, fix this.
Use spin_lock_irq() like the rest of the function.
Nothing in kernel modifies the RTC_DM_BINARY bit, so use a separate pair
of spin_lock_irq() / spin_unlock_irq() before doing the math.
Signed-off-by: Mateusz Jończyk <mat.jonczyk(a)o2.pl>
Cc: Alessandro Zummo <a.zummo(a)towertech.it>
Cc: Alexandre Belloni <alexandre.belloni(a)bootlin.com>
Cc: stable(a)vger.kernel.org
---
drivers/rtc/rtc-cmos.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
index 0fa66d1039b9..e6ff0fb7591b 100644
--- a/drivers/rtc/rtc-cmos.c
+++ b/drivers/rtc/rtc-cmos.c
@@ -463,7 +463,10 @@ static int cmos_set_alarm(struct device *dev, struct rtc_wkalrm *t)
min = t->time.tm_min;
sec = t->time.tm_sec;
+ spin_lock_irq(&rtc_lock);
rtc_control = CMOS_READ(RTC_CONTROL);
+ spin_unlock_irq(&rtc_lock);
+
if (!(rtc_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) {
/* Writing 0xff means "don't care" or "match all". */
mon = (mon <= 12) ? bin2bcd(mon) : 0xff;
--
2.25.1
The patch titled
Subject: mm/hugetlb: initialize hugetlb_usage in mm_init
has been added to the -mm tree. Its filename is
mm-hugetlb-initialize-hugetlb_usage-in-mm_init.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-hugetlb-initialize-hugetlb_usa…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-initialize-hugetlb_usa…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Liu Zixian <liuzixian4(a)huawei.com>
Subject: mm/hugetlb: initialize hugetlb_usage in mm_init
After fork, the child process will get incorrect (2x) hugetlb_usage.
If a process uses 5 2MB hugetlb pages in an anonymous mapping,
HugetlbPages: 10240 kB
and then forks, the child will show,
HugetlbPages: 20480 kB
The reason for double the amount is because hugetlb_usage will be copied
from the parent and then increased when we copy page tables from parent to
child. Child will have 2x actual usage.
Fix this by adding hugetlb_count_init in mm_init.
Link: https://lkml.kernel.org/r/20210826071742.877-1-liuzixian4@huawei.com
Fixes: 5d317b2b6536 ("mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status")
Signed-off-by: Liu Zixian <liuzixian4(a)huawei.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
include/linux/hugetlb.h | 9 +++++++++
kernel/fork.c | 1 +
2 files changed, 10 insertions(+)
--- a/include/linux/hugetlb.h~mm-hugetlb-initialize-hugetlb_usage-in-mm_init
+++ a/include/linux/hugetlb.h
@@ -858,6 +858,11 @@ static inline spinlock_t *huge_pte_lockp
void hugetlb_report_usage(struct seq_file *m, struct mm_struct *mm);
+static inline void hugetlb_count_init(struct mm_struct *mm)
+{
+ atomic_long_set(&mm->hugetlb_usage, 0);
+}
+
static inline void hugetlb_count_add(long l, struct mm_struct *mm)
{
atomic_long_add(l, &mm->hugetlb_usage);
@@ -1042,6 +1047,10 @@ static inline spinlock_t *huge_pte_lockp
return &mm->page_table_lock;
}
+static inline void hugetlb_count_init(struct mm_struct *mm)
+{
+}
+
static inline void hugetlb_report_usage(struct seq_file *f, struct mm_struct *m)
{
}
--- a/kernel/fork.c~mm-hugetlb-initialize-hugetlb_usage-in-mm_init
+++ a/kernel/fork.c
@@ -1052,6 +1052,7 @@ static struct mm_struct *mm_init(struct
mm->pmd_huge_pte = NULL;
#endif
mm_init_uprobes_state(mm);
+ hugetlb_count_init(mm);
if (current->mm) {
mm->flags = current->mm->flags & MMF_INIT_MASK;
_
Patches currently in -mm which might be from liuzixian4(a)huawei.com are
mm-hugetlb-initialize-hugetlb_usage-in-mm_init.patch
The patch titled
Subject: mm/hmm: bypass devmap pte when all pfn requested flags are fulfilled
has been added to the -mm tree. Its filename is
mm-hmm-bypass-devmap-pte-when-all-pfn-requested-flags-are-fulfilled.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-hmm-bypass-devmap-pte-when-all…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-hmm-bypass-devmap-pte-when-all…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Li Zhijian <lizhijian(a)cn.fujitsu.com>
Subject: mm/hmm: bypass devmap pte when all pfn requested flags are fulfilled
Previously, we noticed the one rpma example was failed[1] since
36f30e486d, where it will use ODP feature to do RDMA WRITE between fsdax
files.
After digging into the code, we found hmm_vma_handle_pte() will still
return EFAULT even though all the its requesting flags has been fulfilled.
That's because a DAX page will be marked as (_PAGE_SPECIAL | PAGE_DEVMAP)
by pte_mkdevmap().
[1]: https://github.com/pmem/rpma/issues/1142
Link: https://lkml.kernel.org/r/20210830094232.203029-1-lizhijian@cn.fujitsu.com
Fixes: 405506274922 ("mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling")
Signed-off-by: Li Zhijian <lizhijian(a)cn.fujitsu.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
Reviewed-by: Jason Gunthorpe <jgg(a)nvidia.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/hmm.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
--- a/mm/hmm.c~mm-hmm-bypass-devmap-pte-when-all-pfn-requested-flags-are-fulfilled
+++ a/mm/hmm.c
@@ -295,10 +295,13 @@ static int hmm_vma_handle_pte(struct mm_
goto fault;
/*
+ * Bypass devmap pte such as DAX page when all pfn requested
+ * flags(pfn_req_flags) are fulfilled.
* Since each architecture defines a struct page for the zero page, just
* fall through and treat it like a normal page.
*/
- if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {
+ if (pte_special(pte) && !pte_devmap(pte) &&
+ !is_zero_pfn(pte_pfn(pte))) {
if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
pte_unmap(ptep);
return -EFAULT;
_
Patches currently in -mm which might be from lizhijian(a)cn.fujitsu.com are
mm-hmm-bypass-devmap-pte-when-all-pfn-requested-flags-are-fulfilled.patch
The patch titled
Subject: mm: fix panic caused by __page_handle_poison()
has been added to the -mm tree. Its filename is
mm-fix-panic-caused-by-__page_handle_poison.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/mm-fix-panic-caused-by-__page_han…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/mm-fix-panic-caused-by-__page_han…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Michael Wang <yun.wang(a)linux.alibaba.com>
Subject: mm: fix panic caused by __page_handle_poison()
In commit 510d25c92ec4 ("mm/hwpoison: disable pcp for
page_handle_poison()"), __page_handle_poison() was introduced, and if we
mark:
RET_A = dissolve_free_huge_page();
RET_B = take_page_off_buddy();
then __page_handle_poison was supposed to return TRUE When RET_A == 0 &&
RET_B == TRUE
But since it failed to take care the case when RET_A is -EBUSY or -ENOMEM,
and just return the ret as a bool which actually become TRUE, it break the
original logic.
The following result is a huge page in freelist but was
referenced as poisoned, and lead into the final panic:
kernel BUG at mm/internal.h:95!
invalid opcode: 0000 [#1] SMP PTI
skip...
RIP: 0010:set_page_refcounted mm/internal.h:95 [inline]
RIP: 0010:remove_hugetlb_page+0x23c/0x240 mm/hugetlb.c:1371
skip...
Call Trace:
remove_pool_huge_page+0xe4/0x110 mm/hugetlb.c:1892
return_unused_surplus_pages+0x8d/0x150 mm/hugetlb.c:2272
hugetlb_acct_memory.part.91+0x524/0x690 mm/hugetlb.c:4017
This patch replaces 'bool' with 'int' to handle RET_A correctly.
Link: https://lkml.kernel.org/r/61782ac6-1e8a-4f6f-35e6-e94fce3b37f5@linux.alibab…
Fixes: 510d25c92ec4 ("mm/hwpoison: disable pcp for page_handle_poison()")
Signed-off-by: Michael Wang <yun.wang(a)linux.alibaba.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com>
Reported-by: Abaci <abaci(a)linux.alibaba.com>
Cc: <stable(a)vger.kernel.org> [5.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/memory-failure.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/mm/memory-failure.c~mm-fix-panic-caused-by-__page_handle_poison
+++ a/mm/memory-failure.c
@@ -68,7 +68,7 @@ atomic_long_t num_poisoned_pages __read_
static bool __page_handle_poison(struct page *page)
{
- bool ret;
+ int ret;
zone_pcp_disable(page_zone(page));
ret = dissolve_free_huge_page(page);
@@ -76,7 +76,7 @@ static bool __page_handle_poison(struct
ret = take_page_off_buddy(page);
zone_pcp_enable(page_zone(page));
- return ret;
+ return ret > 0;
}
static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
_
Patches currently in -mm which might be from yun.wang(a)linux.alibaba.com are
mm-fix-panic-caused-by-__page_handle_poison.patch