The patch titled
Subject: crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo
has been added to the -mm tree. Its filename is
crash_core-vmcoreinfo-append-section_size_bits-to-vmcoreinfo.patch
This patch should soon appear at
https://ozlabs.org/~akpm/mmots/broken-out/crash_core-vmcoreinfo-append-sect…
and later at
https://ozlabs.org/~akpm/mmotm/broken-out/crash_core-vmcoreinfo-append-sect…
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Pingfan Liu <kernelfans(a)gmail.com>
Subject: crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo
As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo:
Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
formula:
#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
PAGES_PER_SECTION in makedumpfile just like kernel.
Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce
SECTION_SIZE_BITS"). But user space wants a stable interface to get this
info. Such info is impossible to be deduced from a crashdump vmcore.
Hence append SECTION_SIZE_BITS to vmcoreinfo.
Link: https://lkml.kernel.org/r/20210608103359.84907-1-kernelfans@gmail.com
Link: http://lists.infradead.org/pipermail/kexec/2021-June/022676.html
Signed-off-by: Pingfan Liu <kernelfans(a)gmail.com>
Acked-by: Baoquan He <bhe(a)redhat.com>
Cc: Bhupesh Sharma <bhupesh.sharma(a)linaro.org>
Cc: Kazuhito Hagio <k-hagio(a)ab.jp.nec.com>
Cc: Dave Young <dyoung(a)redhat.com>
Cc: Boris Petkov <bp(a)alien8.de>
Cc: Ingo Molnar <mingo(a)kernel.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Cc: James Morse <james.morse(a)arm.com>
Cc: Mark Rutland <mark.rutland(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Cc: Catalin Marinas <catalin.marinas(a)arm.com>
Cc: Michael Ellerman <mpe(a)ellerman.id.au>
Cc: Paul Mackerras <paulus(a)samba.org>
Cc: Benjamin Herrenschmidt <benh(a)kernel.crashing.org>
Cc: Dave Anderson <anderson(a)redhat.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/crash_core.c | 1 +
1 file changed, 1 insertion(+)
--- a/kernel/crash_core.c~crash_core-vmcoreinfo-append-section_size_bits-to-vmcoreinfo
+++ a/kernel/crash_core.c
@@ -464,6 +464,7 @@ static int __init crash_save_vmcoreinfo_
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
VMCOREINFO_STRUCT_SIZE(mem_section);
VMCOREINFO_OFFSET(mem_section, section_mem_map);
+ VMCOREINFO_NUMBER(SECTION_SIZE_BITS);
VMCOREINFO_NUMBER(MAX_PHYSMEM_BITS);
#endif
VMCOREINFO_STRUCT_SIZE(page);
_
Patches currently in -mm which might be from kernelfans(a)gmail.com are
crash_core-vmcoreinfo-append-section_size_bits-to-vmcoreinfo.patch
Commit 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle")
tried to fix a problem that a process could be sleeping in rq_qos_wait()
without anyone to wake it up. However the fix is not complete and the
following can still happen:
CPU1 (waiter1) CPU2 (waiter2) CPU3 (waker)
rq_qos_wait() rq_qos_wait()
acquire_inflight_cb() -> fails
acquire_inflight_cb() -> fails
completes IOs, inflight
decreased
prepare_to_wait_exclusive()
prepare_to_wait_exclusive()
has_sleeper = !wq_has_single_sleeper() -> true as there are two sleepers
has_sleeper = !wq_has_single_sleeper() -> true
io_schedule() io_schedule()
Deadlock as now there's nobody to wakeup the two waiters. The logic
automatically blocking when there are already sleepers is really subtle
and the only way to make it work reliably is that we check whether there
are some waiters in the queue when adding ourselves there. That way, we
are guaranteed that at least the first process to enter the wait queue
will recheck the waiting condition before going to sleep and thus
guarantee forward progress.
Fixes: 545fbd0775ba ("rq-qos: fix missed wake-ups in rq_qos_throttle")
CC: stable(a)vger.kernel.org
Signed-off-by: Jan Kara <jack(a)suse.cz>
---
block/blk-rq-qos.c | 4 ++--
include/linux/wait.h | 2 +-
kernel/sched/wait.c | 9 +++++++--
3 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 656460636ad3..e83af7bc7591 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -266,8 +266,8 @@ void rq_qos_wait(struct rq_wait *rqw, void *private_data,
if (!has_sleeper && acquire_inflight_cb(rqw, private_data))
return;
- prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE);
- has_sleeper = !wq_has_single_sleeper(&rqw->wait);
+ has_sleeper = !prepare_to_wait_exclusive(&rqw->wait, &data.wq,
+ TASK_UNINTERRUPTIBLE);
do {
/* The memory barrier in set_task_state saves us here. */
if (data.got_token)
diff --git a/include/linux/wait.h b/include/linux/wait.h
index fe10e8570a52..6598ae35e1b5 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -1136,7 +1136,7 @@ do { \
* Waitqueues which are removed from the waitqueue_head at wakeup time
*/
void prepare_to_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state);
-void prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state);
+bool prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state);
long prepare_to_wait_event(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state);
void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry);
long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 183cc6ae68a6..76577d1642a5 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -264,17 +264,22 @@ prepare_to_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_ent
}
EXPORT_SYMBOL(prepare_to_wait);
-void
+/* Returns true if we are the first waiter in the queue, false otherwise. */
+bool
prepare_to_wait_exclusive(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry, int state)
{
unsigned long flags;
+ bool was_empty = false;
wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
spin_lock_irqsave(&wq_head->lock, flags);
- if (list_empty(&wq_entry->entry))
+ if (list_empty(&wq_entry->entry)) {
+ was_empty = list_empty(&wq_head->head);
__add_wait_queue_entry_tail(wq_head, wq_entry);
+ }
set_current_state(state);
spin_unlock_irqrestore(&wq_head->lock, flags);
+ return was_empty;
}
EXPORT_SYMBOL(prepare_to_wait_exclusive);
--
2.26.2
From: Long Li <longli(a)microsoft.com>
After commit 07173c3ec276 ("block: enable multipage bvecs"), a bvec can
have multiple pages. But bio_will_gap() still assumes one page bvec while
checking for merging. If the pages in the bvec go across the
seg_boundary_mask, this check for merging can potentially succeed if only
the 1st page is tested, and can fail if all the pages are tested.
Later, when SCSI builds the SG list the same check for merging is done in
__blk_segment_map_sg_merge() with all the pages in the bvec tested. This
time the check may fail if the pages in bvec go across the
seg_boundary_mask (but tested okay in bio_will_gap() earlier, so those
BIOs were merged). If this check fails, we end up with a broken SG list
for drivers assuming the SG list not having offsets in intermediate pages.
This results in incorrect pages written to the disk.
Fix this by returning the multi-page bvec when testing gaps for merging.
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: Johannes Thumshirn <johannes.thumshirn(a)wdc.com>
Cc: Pavel Begunkov <asml.silence(a)gmail.com>
Cc: Ming Lei <ming.lei(a)redhat.com>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: "Matthew Wilcox (Oracle)" <willy(a)infradead.org>
Cc: Jeffle Xu <jefflexu(a)linux.alibaba.com>
Cc: linux-kernel(a)vger.kernel.org
Cc: stable(a)vger.kernel.org
Fixes: 07173c3ec276 ("block: enable multipage bvecs")
Signed-off-by: Long Li <longli(a)microsoft.com>
Reviewed-by: Ming Lei <ming.lei(a)redhat.com>
---
Changes
v2: added commit details on how data corruption happens
v3: reorganized the code/comments in bio_get_last_bvec
include/linux/bio.h | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index a0b4cfdf62a4..d2b98efb5cc5 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -44,9 +44,6 @@ static inline unsigned int bio_max_segs(unsigned int nr_segs)
#define bio_offset(bio) bio_iter_offset((bio), (bio)->bi_iter)
#define bio_iovec(bio) bio_iter_iovec((bio), (bio)->bi_iter)
-#define bio_multiple_segments(bio) \
- ((bio)->bi_iter.bi_size != bio_iovec(bio).bv_len)
-
#define bvec_iter_sectors(iter) ((iter).bi_size >> 9)
#define bvec_iter_end_sector(iter) ((iter).bi_sector + bvec_iter_sectors((iter)))
@@ -271,7 +268,7 @@ static inline void bio_clear_flag(struct bio *bio, unsigned int bit)
static inline void bio_get_first_bvec(struct bio *bio, struct bio_vec *bv)
{
- *bv = bio_iovec(bio);
+ *bv = mp_bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
}
static inline void bio_get_last_bvec(struct bio *bio, struct bio_vec *bv)
@@ -279,10 +276,9 @@ static inline void bio_get_last_bvec(struct bio *bio, struct bio_vec *bv)
struct bvec_iter iter = bio->bi_iter;
int idx;
- if (unlikely(!bio_multiple_segments(bio))) {
- *bv = bio_iovec(bio);
- return;
- }
+ bio_get_first_bvec(bio, bv);
+ if (bv->bv_len == bio->bi_iter.bi_size)
+ return; /* this bio only has a single bvec */
bio_advance_iter(bio, &iter, iter.bi_size);
--
2.17.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From afd09b617db3786b6ef3dc43e28fe728cfea84df Mon Sep 17 00:00:00 2001
From: Alexey Makhalov <amakhalov(a)vmware.com>
Date: Fri, 21 May 2021 07:55:33 +0000
Subject: [PATCH] ext4: fix memory leak in ext4_fill_super
Buffer head references must be released before calling kill_bdev();
otherwise the buffer head (and its page referenced by b_data) will not
be freed by kill_bdev, and subsequently that bh will be leaked.
If blocksizes differ, sb_set_blocksize() will kill current buffers and
page cache by using kill_bdev(). And then super block will be reread
again but using correct blocksize this time. sb_set_blocksize() didn't
fully free superblock page and buffer head, and being busy, they were
not freed and instead leaked.
This can easily be reproduced by calling an infinite loop of:
systemctl start <ext4_on_lvm>.mount, and
systemctl stop <ext4_on_lvm>.mount
... since systemd creates a cgroup for each slice which it mounts, and
the bh leak get amplified by a dying memory cgroup that also never
gets freed, and memory consumption is much more easily noticed.
Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize")
Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
Link: https://lore.kernel.org/r/20210521075533.95732-1-amakhalov@vmware.com
Signed-off-by: Alexey Makhalov <amakhalov(a)vmware.com>
Signed-off-by: Theodore Ts'o <tytso(a)mit.edu>
Cc: stable(a)kernel.org
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 886e0d518668..f66c7301b53a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4462,14 +4462,20 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
}
if (sb->s_blocksize != blocksize) {
+ /*
+ * bh must be released before kill_bdev(), otherwise
+ * it won't be freed and its page also. kill_bdev()
+ * is called by sb_set_blocksize().
+ */
+ brelse(bh);
/* Validate the filesystem blocksize */
if (!sb_set_blocksize(sb, blocksize)) {
ext4_msg(sb, KERN_ERR, "bad block size %d",
blocksize);
+ bh = NULL;
goto failed_mount;
}
- brelse(bh);
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
offset = do_div(logical_sb_block, blocksize);
bh = ext4_sb_bread_unmovable(sb, logical_sb_block);
@@ -5202,8 +5208,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
kfree(get_qf_name(sb, sbi, i));
#endif
fscrypt_free_dummy_policy(&sbi->s_dummy_enc_policy);
- ext4_blkdev_remove(sbi);
+ /* ext4_blkdev_remove() calls kill_bdev(), release bh before it. */
brelse(bh);
+ ext4_blkdev_remove(sbi);
out_fail:
sb->s_fs_info = NULL;
kfree(sbi->s_blockgroup_lock);