This is a note to let you know that I've just added the patch titled
mm/pagewalk.c: report holes in hugetlb ranges
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: mm/pagewalk.c: report holes in hugetlb ranges
From: Jann Horn <jannh(a)google.com>
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Changed for 4.4/4.9 stable backport:
- fix up conflict in the huge_pte_offset() call
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -142,8 +142,12 @@ static int walk_hugetlb_range(unsigned l
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
Patches currently in stable-queue which might be from jannh(a)google.com are
queue-4.9/mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Changed for 4.4/4.9 stable backport:
- fix up conflict in the huge_pte_offset() call
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
---
Please apply this patch to <=4.9 stable trees instead of the
original patch.
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 29f2f8b853ae..c2cbd2620169 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -142,8 +142,12 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
--
2.15.0.448.gf294e3d99a-goog
This is a note to let you know that I've just added the patch titled
mm/pagewalk.c: report holes in hugetlb ranges
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 373c4557d2aa362702c4c2d41288fb1e54990b7c Mon Sep 17 00:00:00 2001
From: Jann Horn <jannh(a)google.com>
Date: Tue, 14 Nov 2017 01:03:44 +0100
Subject: mm/pagewalk.c: report holes in hugetlb ranges
From: Jann Horn <jannh(a)google.com>
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.
This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.
Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.
This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.
This issue was found using an AFL-based fuzzer.
v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)
Changed for 4.4/4.9 stable backport:
- fix up conflict in the huge_pte_offset() call
Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh(a)google.com>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/pagewalk.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -142,8 +142,12 @@ static int walk_hugetlb_range(unsigned l
do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask);
- if (pte && walk->hugetlb_entry)
+
+ if (pte)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ else if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+
if (err)
break;
} while (addr = next, addr != end);
Patches currently in stable-queue which might be from jannh(a)google.com are
queue-4.4/mm-pagewalk.c-report-holes-in-hugetlb-ranges.patch
On Wed 22-11-17 09:43:46, Zi Yan wrote:
>
>
> Michal Hocko wrote:
> > On Wed 22-11-17 09:54:16, Michal Hocko wrote:
> >> On Mon 20-11-17 21:18:55, Zi Yan wrote:
> > [...]
> >>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >>> index 895ec0c4942e..a2246cf670ba 100644
> >>> --- a/include/linux/migrate.h
> >>> +++ b/include/linux/migrate.h
> >>> @@ -54,7 +54,7 @@ static inline struct page *new_page_nodemask(struct page *page,
> >>> new_page = __alloc_pages_nodemask(gfp_mask, order,
> >>> preferred_nid, nodemask);
> >>>
> >>> - if (new_page && PageTransHuge(page))
> >>> + if (new_page && PageTransHuge(new_page))
> >>> prep_transhuge_page(new_page);
> >> I would keep the two checks consistent. But that leads to a more
> >> interesting question. new_page_nodemask does
> >>
> >> if (thp_migration_supported() && PageTransHuge(page)) {
> >> order = HPAGE_PMD_ORDER;
> >> gfp_mask |= GFP_TRANSHUGE;
> >> }
> >
> > And one more question/note. Why do we need thp_migration_supported
> > in the first place? 9c670ea37947 ("mm: thp: introduce
> > CONFIG_ARCH_ENABLE_THP_MIGRATION") says
> > : Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
> > : functionality to x86_64, which should be safer at the first step.
> >
> > but why is unsafe to enable the feature on other arches which support
> > THP? Is there any plan to do the next step and remove this config
> > option?
>
> Because different architectures have their own way of specifying a swap
> entry. This means, to support THP migration, each architecture needs to
> add its own __pmd_to_swp_entry() and __swp_entry_to_pmd(), which are
> used for arch-independent pmd_to_swp_entry() and swp_entry_to_pmd().
I understand that part. But this smells like a matter of coding, no?
I was suprised to see the note about safety which didn't make much sense
to me.
--
Michal Hocko
SUSE Labs
This is a note to let you know that I've just added the patch titled
ipmi: Prefer ACPI system interfaces over SMBIOS ones
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipmi-prefer-acpi-system-interfaces-over-smbios-ones.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 7e030d6dff713250c7dcfb543cad2addaf479b0e Mon Sep 17 00:00:00 2001
From: Corey Minyard <cminyard(a)mvista.com>
Date: Fri, 8 Sep 2017 14:05:58 -0500
Subject: ipmi: Prefer ACPI system interfaces over SMBIOS ones
From: Corey Minyard <cminyard(a)mvista.com>
commit 7e030d6dff713250c7dcfb543cad2addaf479b0e upstream.
The recent changes to add SMBIOS (DMI) IPMI interfaces as platform
devices caused DMI to be selected before ACPI, causing ACPI type
of operations to not work.
Signed-off-by: Corey Minyard <cminyard(a)mvista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/char/ipmi/ipmi_si_intf.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)
--- a/drivers/char/ipmi/ipmi_si_intf.c
+++ b/drivers/char/ipmi/ipmi_si_intf.c
@@ -3424,7 +3424,7 @@ static inline void wait_for_timer_and_th
del_timer_sync(&smi_info->si_timer);
}
-static int is_new_interface(struct smi_info *info)
+static struct smi_info *find_dup_si(struct smi_info *info)
{
struct smi_info *e;
@@ -3439,24 +3439,36 @@ static int is_new_interface(struct smi_i
*/
if (info->slave_addr && !e->slave_addr)
e->slave_addr = info->slave_addr;
- return 0;
+ return e;
}
}
- return 1;
+ return NULL;
}
static int add_smi(struct smi_info *new_smi)
{
int rv = 0;
+ struct smi_info *dup;
mutex_lock(&smi_infos_lock);
- if (!is_new_interface(new_smi)) {
- pr_info(PFX "%s-specified %s state machine: duplicate\n",
- ipmi_addr_src_to_str(new_smi->addr_source),
- si_to_str[new_smi->si_type]);
- rv = -EBUSY;
- goto out_err;
+ dup = find_dup_si(new_smi);
+ if (dup) {
+ if (new_smi->addr_source == SI_ACPI &&
+ dup->addr_source == SI_SMBIOS) {
+ /* We prefer ACPI over SMBIOS. */
+ dev_info(dup->dev,
+ "Removing SMBIOS-specified %s state machine in favor of ACPI\n",
+ si_to_str[new_smi->si_type]);
+ cleanup_one_si(dup);
+ } else {
+ dev_info(new_smi->dev,
+ "%s-specified %s state machine: duplicate\n",
+ ipmi_addr_src_to_str(new_smi->addr_source),
+ si_to_str[new_smi->si_type]);
+ rv = -EBUSY;
+ goto out_err;
+ }
}
pr_info(PFX "Adding %s-specified %s state machine\n",
@@ -3865,7 +3877,8 @@ static void cleanup_one_si(struct smi_in
poll(to_clean);
schedule_timeout_uninterruptible(1);
}
- disable_si_irq(to_clean, false);
+ if (to_clean->handlers)
+ disable_si_irq(to_clean, false);
while (to_clean->curr_msg || (to_clean->si_state != SI_NORMAL)) {
poll(to_clean);
schedule_timeout_uninterruptible(1);
Patches currently in stable-queue which might be from cminyard(a)mvista.com are
queue-4.14/ipmi-fix-unsigned-long-underflow.patch
queue-4.14/ipmi-prefer-acpi-system-interfaces-over-smbios-ones.patch
I'm requesting a backport of
7e030d6dff713250c7dcfb543cad2addaf479b0e ipmi: Prefer ACPI system
interfaces over SMBIOS ones
to the 4.14 stable kernel tree only. This was already staged for
Linus in my public tree before Andrew noticed an issue that this
patch fixes, where the system can oops when the ipmi_si module
is removed. Since it fixes an oops that likely can affect people, I'd
like it to be added to the stable tree,
Since this bug was added in 4.13 by:
0944d889a237b6107f9ceeee053fe7221cdd1089 ipmi: Convert DMI handling
over to a platform device
it should only require a backport to 4.14.
Thank you,
-corey
On 11/21/2017 09:14 PM, Jens Axboe wrote:
> On 11/21/2017 01:12 PM, Christian Borntraeger wrote:
>>
>>
>> On 11/21/2017 08:30 PM, Jens Axboe wrote:
>>> On 11/21/2017 12:15 PM, Christian Borntraeger wrote:
>>>>
>>>>
>>>> On 11/21/2017 07:39 PM, Jens Axboe wrote:
>>>>> On 11/21/2017 11:27 AM, Jens Axboe wrote:
>>>>>> On 11/21/2017 11:12 AM, Christian Borntraeger wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/21/2017 07:09 PM, Jens Axboe wrote:
>>>>>>>> On 11/21/2017 10:27 AM, Jens Axboe wrote:
>>>>>>>>> On 11/21/2017 03:14 AM, Christian Borntraeger wrote:
>>>>>>>>>> Bisect points to
>>>>>>>>>>
>>>>>>>>>> 1b5a7455d345b223d3a4658a9e5fce985b7998c1 is the first bad commit
>>>>>>>>>> commit 1b5a7455d345b223d3a4658a9e5fce985b7998c1
>>>>>>>>>> Author: Christoph Hellwig <hch(a)lst.de>
>>>>>>>>>> Date: Mon Jun 26 12:20:57 2017 +0200
>>>>>>>>>>
>>>>>>>>>> blk-mq: Create hctx for each present CPU
>>>>>>>>>>
>>>>>>>>>> commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream.
>>>>>>>>>>
>>>>>>>>>> Currently we only create hctx for online CPUs, which can lead to a lot
>>>>>>>>>> of churn due to frequent soft offline / online operations. Instead
>>>>>>>>>> allocate one for each present CPU to avoid this and dramatically simplify
>>>>>>>>>> the code.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Christoph Hellwig <hch(a)lst.de>
>>>>>>>>>> Reviewed-by: Jens Axboe <axboe(a)kernel.dk>
>>>>>>>>>> Cc: Keith Busch <keith.busch(a)intel.com>
>>>>>>>>>> Cc: linux-block(a)vger.kernel.org
>>>>>>>>>> Cc: linux-nvme(a)lists.infradead.org
>>>>>>>>>> Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
>>>>>>>>>> Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
>>>>>>>>>> Cc: Oleksandr Natalenko <oleksandr(a)natalenko.name>
>>>>>>>>>> Cc: Mike Galbraith <efault(a)gmx.de>
>>>>>>>>>> Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
>>>>>>>>>
>>>>>>>>> I wonder if we're simply not getting the masks updated correctly. I'll
>>>>>>>>> take a look.
>>>>>>>>
>>>>>>>> Can't make it trigger here. We do init for each present CPU, which means
>>>>>>>> that if I offline a few CPUs here and register a queue, those still show
>>>>>>>> up as present (just offline) and get mapped accordingly.
>>>>>>>>
>>>>>>>> From the looks of it, your setup is different. If the CPU doesn't show
>>>>>>>> up as present and it gets hotplugged, then I can see how this condition
>>>>>>>> would trigger. What environment are you running this in? We might have
>>>>>>>> to re-introduce the cpu hotplug notifier, right now we just monitor
>>>>>>>> for a dead cpu and handle that.
>>>>>>>
>>>>>>> I am not doing a hot unplug and the replug, I use KVM and add a previously
>>>>>>> not available CPU.
>>>>>>>
>>>>>>> in libvirt/virsh speak:
>>>>>>> <vcpu placement='static' current='1'>4</vcpu>
>>>>>>
>>>>>> So that's why we run into problems. It's not present when we load the device,
>>>>>> but becomes present and online afterwards.
>>>>>>
>>>>>> Christoph, we used to handle this just fine, your patch broke it.
>>>>>>
>>>>>> I'll see if I can come up with an appropriate fix.
>>>>>
>>>>> Can you try the below?
>>>>
>>>>
>>>> It does prevent the crash but it seems that the new CPU is not "used " after the hotplug for mq:
>>>>
>>>>
>>>> output with 2 cpus:
>>>> /sys/kernel/debug/block/vda
>>>> /sys/kernel/debug/block/vda/hctx0
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0/completed
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0/merged
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0/dispatched
>>>> /sys/kernel/debug/block/vda/hctx0/cpu0/rq_list
>>>> /sys/kernel/debug/block/vda/hctx0/active
>>>> /sys/kernel/debug/block/vda/hctx0/run
>>>> /sys/kernel/debug/block/vda/hctx0/queued
>>>> /sys/kernel/debug/block/vda/hctx0/dispatched
>>>> /sys/kernel/debug/block/vda/hctx0/io_poll
>>>> /sys/kernel/debug/block/vda/hctx0/sched_tags_bitmap
>>>> /sys/kernel/debug/block/vda/hctx0/sched_tags
>>>> /sys/kernel/debug/block/vda/hctx0/tags_bitmap
>>>> /sys/kernel/debug/block/vda/hctx0/tags
>>>> /sys/kernel/debug/block/vda/hctx0/ctx_map
>>>> /sys/kernel/debug/block/vda/hctx0/busy
>>>> /sys/kernel/debug/block/vda/hctx0/dispatch
>>>> /sys/kernel/debug/block/vda/hctx0/flags
>>>> /sys/kernel/debug/block/vda/hctx0/state
>>>> /sys/kernel/debug/block/vda/sched
>>>> /sys/kernel/debug/block/vda/sched/dispatch
>>>> /sys/kernel/debug/block/vda/sched/starved
>>>> /sys/kernel/debug/block/vda/sched/batching
>>>> /sys/kernel/debug/block/vda/sched/write_next_rq
>>>> /sys/kernel/debug/block/vda/sched/write_fifo_list
>>>> /sys/kernel/debug/block/vda/sched/read_next_rq
>>>> /sys/kernel/debug/block/vda/sched/read_fifo_list
>>>> /sys/kernel/debug/block/vda/write_hints
>>>> /sys/kernel/debug/block/vda/state
>>>> /sys/kernel/debug/block/vda/requeue_list
>>>> /sys/kernel/debug/block/vda/poll_stat
>>>
>>> Try this, basically just a revert.
>>
>> Yes, seems to work.
>>
>> Tested-by: Christian Borntraeger <borntraeger(a)de.ibm.com>
>
> Great, thanks for testing.
>
>> Do you know why the original commit made it into 4.12 stable? After all
>> it has no Fixes tag and no cc stable-
>
> I was wondering the same thing when you said it was in 4.12.stable and
> not in 4.12 release. That patch should absolutely not have gone into
> stable, it's not marked as such and it's not fixing a problem that is
> stable worthy. In fact, it's causing a regression...
>
> Greg? Upstream commit is mentioned higher up, start of the email.
>
Forgot to cc Greg?
Hi all,
Since I'm going slightly off-topic, I've tweaked the subject line and
trimmed some of the conversation.
I believe everyone in the CC list might be interested in the
following, yet feel free to adjust.
Above all, I'd kindly ask everyone to skim through and draw their conclusions.
If the ideas put forward have some value - great, if not - let my email rot.
On 17 November 2017 at 13:57, Greg KH <gregkh(a)linuxfoundation.org> wrote:
>>
>> I still have no idea how this autoselect picks up patches that do *not*
>> have cc: stable nor Fixes: from us. What information do you have that we
>> don't for making that call?
>
> I'll let Sasha describe how he's doing this, but in the end, does it
> really matter _how_ it is done, vs. the fact that it seems to at least
> one human reviewer that this is a patch that _should_ be included based
> on the changelog text and the code patch?
>
> By having this review process that Sasha is providing, he's saying
> "Here's a patch that I think might be good for stable, do you object?"
>
> If you do, great, no harm done, all is fine, the patch is dropped. If
> you don't object, just ignore the email and the patch gets merged.
>
> If you don't want any of this to happen for your subsystem at all, then
> also fine, just let us know and we will ignore it entirely.
>
Let me start with saying that I'm handling the releases for Mesa 3D -
the project providing OpenGL, Vulkan and many other userspace graphics
drivers.
I've been doing that for 3 years now, which admittedly is quite a
short time relative to the kernel.
There is a procedure quite similar to the kernel, with a few
differences - see below for details.
We also autoselect patches, hence my interest in the heuristics
applied for nominating patches ;-)
That aside, here are some things I've learned from my experience.
Some of those may not be applicable - hope you'll find them useful:
- Try to reference developers to existing documentation/procedure.
Was just reminded that even long standing developers can forget detail X or Y.
- CC developers for the important stuff - aka do not CC on each accepted patch.
Accepted patches are merged in pre-release branch and a email with
accepted/deferred/rejected list is sent.
Patches that had conflicts merging, and ones that are rejected have
their author in the CC list.
Rejected patches have brief description + developers are contacted beforehand.
- Autoselect patches are merged only with the approval from the
respective developers.
IMHO this engages developers into the process, without distracting
them too much.
It is by no means a perfect system - input and changes are always appreciated.
That said, here are some suggestions which should make autosel smoother:
- Document the autoselect process
Information about about What, Why, and [ideally] How - analogous to
the normal stable nominations.
Insert reference to the process in the patch notification email.
- Make the autoselect nominations _more_ distinct than the normal stable ones.
Maintainers will want to put more cognitive effort into the patches.
HTH
Emil
This is a note to let you know that I've just added the patch titled
mm/page_ext.c: check if page_ext is not prepared
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e492080e640c2d1235ddf3441cae634cfffef7e1 Mon Sep 17 00:00:00 2001
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
Date: Wed, 15 Nov 2017 17:39:07 -0800
Subject: mm/page_ext.c: check if page_ext is not prepared
From: Jaewon Kim <jaewon31.kim(a)samsung.com>
commit e492080e640c2d1235ddf3441cae634cfffef7e1 upstream.
online_page_ext() and page_ext_init() allocate page_ext for each
section, but they do not allocate if the first PFN is !pfn_present(pfn)
or !pfn_valid(pfn). Then section->page_ext remains as NULL.
lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a
valid PFN, __set_page_owner will try to get page_ext through
lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse
NULL pointer as value 0. This incurrs invalid address access.
This is the panic example when PFN 0x100000 is not valid but PFN
0x13FC00 is being used for page_ext. section->page_ext is NULL,
get_entry returned invalid page_ext address as 0x1DFA000 for a PFN
0x13FC00.
To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext
will be checked at all times.
Unable to handle kernel paging request at virtual address 01dfa014
------------[ cut here ]------------
Kernel BUG at ffffff80082371e0 [verbose debug info unavailable]
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Modules linked in:
PC is at __set_page_owner+0x48/0x78
LR is at __set_page_owner+0x44/0x78
__set_page_owner+0x48/0x78
get_page_from_freelist+0x880/0x8e8
__alloc_pages_nodemask+0x14c/0xc48
__do_page_cache_readahead+0xdc/0x264
filemap_fault+0x2ac/0x550
ext4_filemap_fault+0x3c/0x58
__do_fault+0x80/0x120
handle_mm_fault+0x704/0xbb0
do_page_fault+0x2e8/0x394
do_mem_abort+0x88/0x124
Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return
value of lookup_page_ext for all call sites").
Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com
Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging")
Signed-off-by: Jaewon Kim <jaewon31.kim(a)samsung.com>
Acked-by: Michal Hocko <mhocko(a)suse.com>
Cc: Vlastimil Babka <vbabka(a)suse.cz>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Joonsoo Kim <js1304(a)gmail.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko(a)suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/page_ext.c | 4 ----
1 file changed, 4 deletions(-)
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -106,7 +106,6 @@ struct page_ext *lookup_page_ext(struct
struct page_ext *base;
base = NODE_DATA(page_to_nid(page))->node_page_ext;
-#ifdef CONFIG_DEBUG_VM
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -115,7 +114,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (unlikely(!base))
return NULL;
-#endif
offset = pfn - round_down(node_start_pfn(page_to_nid(page)),
MAX_ORDER_NR_PAGES);
return base + offset;
@@ -180,7 +178,6 @@ struct page_ext *lookup_page_ext(struct
{
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
-#ifdef CONFIG_DEBUG_VM
/*
* The sanity checks the page allocator does upon freeing a
* page can reach here before the page_ext arrays are
@@ -189,7 +186,6 @@ struct page_ext *lookup_page_ext(struct
*/
if (!section->page_ext)
return NULL;
-#endif
return section->page_ext + pfn;
}
Patches currently in stable-queue which might be from jaewon31.kim(a)samsung.com are
queue-4.4/mm-page_ext.c-check-if-page_ext-is-not-prepared.patch
From: Zi Yan <zi.yan(a)cs.rutgers.edu>
In [1], Andrea reported that during memory hotplug/hot remove
prep_transhuge_page() is called incorrectly on non-THP pages for
migration, when THP is on but THP migration is not enabled.
This leads to a bad state of target pages for migration.
This patch fixes it by only calling prep_transhuge_page() when we are
certain that the target page is THP.
[1] https://lkml.org/lkml/2017/11/20/411
Cc: stable(a)vger.kernel.org # v4.14
Fixes: 8135d8926c08 ("mm: memory_hotplug: memory hotremove supports thp migration")
Reported-by: Andrea Reale <ar(a)linux.vnet.ibm.com>
Signed-off-by: Zi Yan <zi.yan(a)cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi(a)ah.jp.nec.com>
Cc: "Jérôme Glisse" <jglisse(a)redhat.com>
---
include/linux/migrate.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 895ec0c4942e..a2246cf670ba 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -54,7 +54,7 @@ static inline struct page *new_page_nodemask(struct page *page,
new_page = __alloc_pages_nodemask(gfp_mask, order,
preferred_nid, nodemask);
- if (new_page && PageTransHuge(page))
+ if (new_page && PageTransHuge(new_page))
prep_transhuge_page(new_page);
return new_page;
--
2.14.2
On Wed 22-11-17 07:29:38, Zi Yan wrote:
> On 22 Nov 2017, at 7:13, Zi Yan wrote:
>
> > On 22 Nov 2017, at 5:14, Michal Hocko wrote:
> >
> >> On Wed 22-11-17 10:35:10, Michal Hocko wrote:
> >> [...]
> >>> Moreover I am not really sure this is really working properly. Just look
> >>> at the split_huge_page. It moves all the tail pages to the LRU list
> >>> while migrate_pages has a list of pages to migrate. So we will migrate
> >>> the head page and all the rest will get back to the LRU list. What
> >>> guarantees that they will get migrated as well.
> >>
> >> OK, so this is as I've expected. It doesn't work! Some pfn walker based
> >> migration will just skip tail pages see madvise_inject_error.
> >> __alloc_contig_migrate_range will simply fail on THP page see
> >> isolate_migratepages_block so we even do not try to migrate it.
> >> do_move_page_to_node_array will simply migrate head and do not care
> >> about tail pages. do_mbind splits the page and then fall back to pte
> >> walk when thp migration is not supported but it doesn't handle tail
> >> pages if the THP migration path is not able to allocate a fresh THP
> >> AFAICS. Memory hotplug should be safe because it doesn't skip the whole
> >> THP when doing pfn walk.
> >>
> >> Unless I am missing something here this looks like a huge mess to me.
> >
> > +Kirill
> >
> > First, I agree with you that splitting a THP and only migrating its head page
> > is a mess. But what you describe is also the behavior of migrate_page()
> > _before_ THP migration support is added. I thought that was intended.
> >
> > Look at http://elixir.free-electrons.com/linux/v4.13.15/source/mm/migrate.c#L1091,
> > unmap_and_move() splits THPs and only migrates the head page in v4.13 before THP
> > migration is added. I think the behavior was introduced since v4.5 (I just skimmed
> > v4.0 to v4.13 code and did not have time to use git blame), before that THPs are
> > not migrated but shown as successfully migrated (at least from v4.4’s code).
>
> Sorry, I misread v4.4’s code, it also does ‘splitting a THP and migrating its head page’.
> This behavior was there for a long time, at least since v3.0.
>
> The code in unmap_and_move() is:
>
> if (unlikely(PageTransHuge(page)))
> if (unlikely(split_huge_page(page)))
> goto out;
I _think_ that this all should be handled at migrate_pages layer. Try to
migrate THP and fallback to split_huge_page into to the list when it
fails. I haven't checked whether there is something which would prevent
that though. THP tricks in specific paths then should be removed.
--
Michal Hocko
SUSE Labs
I've made mistake during converting hugetlb code to 5-level paging:
in huge_pte_alloc() we have to use p4d_alloc(), not p4d_offset().
Otherwise it leads to crash -- NULL-pointer dereference in pud_alloc()
if p4d table is not yet allocated.
It only can happen in 5-level paging mode. In 4-level paging mode
p4d_offset() always returns pgd, so we are fine.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com>
Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging")
Cc: <stable(a)vger.kernel.org> # v4.11+
---
mm/hugetlb.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2d2ff5e8bf2b..94a4c0b63580 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4617,7 +4617,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
pte_t *pte = NULL;
pgd = pgd_offset(mm, addr);
- p4d = p4d_offset(pgd, addr);
+ p4d = p4d_alloc(mm, pgd, addr);
+ if (!p4d)
+ return NULL;
pud = pud_alloc(mm, p4d, addr);
if (pud) {
if (sz == PUD_SIZE) {
--
2.15.0
Hi,
Is there a way to get notifications about patches queued up for the
-rc1 commits in linux-stable-rc.git? I currently use the push that is
associated with email to this list from Greg. But that only gives 48h
window to do the building and testing. If there is a way to get the
patches earlier, that would give more time for testing activities. Any
hints are much appreciated.
Best Regards,
Milosz
On Wed 22-11-17 04:18:35, Zi Yan wrote:
> On 22 Nov 2017, at 3:54, Michal Hocko wrote:
[...]
> > I would keep the two checks consistent. But that leads to a more
> > interesting question. new_page_nodemask does
> >
> > if (thp_migration_supported() && PageTransHuge(page)) {
> > order = HPAGE_PMD_ORDER;
> > gfp_mask |= GFP_TRANSHUGE;
> > }
> >
> > How come it is safe to allocate an order-0 page if
> > !thp_migration_supported() when we are about to migrate THP? This
> > doesn't make any sense to me. Are we working around this somewhere else?
> > Why shouldn't we simply return NULL here?
>
> If !thp_migration_supported(), we will first split a THP and migrate
> its head page. This process is done in unmap_and_move() after
> get_new_page() (the function pointer to this new_page_nodemask()) is
> called. The situation can be PageTransHuge(page) is true here, but the
> page is split in unmap_and_move(), so we want to return a order-0 page
> here.
This deserves a big fat comment in the code because this is not clear
from the code!
> I think the confusion comes from that there is no guarantee of THP
> allocation when we are doing THP migration. If we can allocate a THP
> during THP migration, we are good. Otherwise, we want to fallback to
> the old way, splitting the original THP and migrating the head page,
> to preserve the original code behavior.
I understand that but that should be done explicitly rather than relying
on two functions doing the right thing because this is just too fragile.
Moreover I am not really sure this is really working properly. Just look
at the split_huge_page. It moves all the tail pages to the LRU list
while migrate_pages has a list of pages to migrate. So we will migrate
the head page and all the rest will get back to the LRU list. What
guarantees that they will get migrated as well.
This all looks like a mess!
--
Michal Hocko
SUSE Labs
On Tue 21 Nov 2017, 17:35, Zi Yan wrote:
> On 21 Nov 2017, at 17:12, Andrew Morton wrote:
>
> > On Mon, 20 Nov 2017 21:18:55 -0500 Zi Yan <zi.yan(a)sent.com> wrote:
> >
> >> This patch fixes it by only calling prep_transhuge_page() when we are
> >> certain that the target page is THP.
> >
> > What are the user-visible effects of the bug?
>
> By inspecting the code, if called on a non-THP, prep_transhuge_page() will
> 1) change the value of the mapping of (page + 2), since it is used for THP deferred list;
> 2) change the lru value of (page + 1), since it is used for THP’s dtor.
>
> Both can lead to data corruption of these two pages.
Pragmatically and from the point of view of the memory_hotplug subsys,
the effect is a kernel crash when pages are being migrated during a memory
hot remove offline and migration target pages are found in a bad state.
Best,
Andrea
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.9/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.4/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.14-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.14 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -447,8 +447,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.14/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-4.13/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
This is a note to let you know that I've just added the patch titled
coda: fix 'kernel memory exposure attempt' in fsync
to the 3.18-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
and it can be found in the queue-3.18 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d337b66a4c52c7b04eec661d86c2ef6e168965a2 Mon Sep 17 00:00:00 2001
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
Date: Wed, 27 Sep 2017 15:52:12 -0400
Subject: coda: fix 'kernel memory exposure attempt' in fsync
From: Jan Harkes <jaharkes(a)cs.cmu.edu>
commit d337b66a4c52c7b04eec661d86c2ef6e168965a2 upstream.
When an application called fsync on a file in Coda a small request with
just the file identifier was allocated, but the declared length was set
to the size of union of all possible upcall requests.
This bug has been around for a very long time and is now caught by the
extra checking in usercopy that was introduced in Linux-4.8.
The exposure happens when the Coda cache manager process reads the fsync
upcall request at which point it is killed. As a result there is nobody
servicing any further upcalls, trapping any processes that try to access
the mounted Coda filesystem.
Signed-off-by: Jan Harkes <jaharkes(a)cs.cmu.edu>
Signed-off-by: Al Viro <viro(a)zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
fs/coda/upcall.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/coda/upcall.c
+++ b/fs/coda/upcall.c
@@ -446,8 +446,7 @@ int venus_fsync(struct super_block *sb,
UPARG(CODA_FSYNC);
inp->coda_fsync.VFid = *fid;
- error = coda_upcall(coda_vcp(sb), sizeof(union inputArgs),
- &outsize, inp);
+ error = coda_upcall(coda_vcp(sb), insize, &outsize, inp);
CODA_FREE(inp, insize);
return error;
Patches currently in stable-queue which might be from jaharkes(a)cs.cmu.edu are
queue-3.18/coda-fix-kernel-memory-exposure-attempt-in-fsync.patch
The patch below does not apply to the 4.13-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: [PATCH] mm, swap: fix false error message in __swp_swapcount()
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.11+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 326439428daf..f2face8b889e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -559,6 +559,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true, page_allocated;
@@ -572,6 +573,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: [PATCH] mm, swap: fix false error message in __swp_swapcount()
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Cc: <stable(a)vger.kernel.org> [4.11+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 326439428daf..f2face8b889e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -559,6 +559,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true, page_allocated;
@@ -572,6 +573,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
This is a note to let you know that I've just added the patch titled
mm, swap: fix false error message in __swp_swapcount()
to the 4.13-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
mm-swap-fix-false-error-message-in-__swp_swapcount.patch
and it can be found in the queue-4.13 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From e9a6effa500526e2a19d5ad042cb758b55b1ef93 Mon Sep 17 00:00:00 2001
From: Huang Ying <huang.ying.caritas(a)gmail.com>
Date: Wed, 15 Nov 2017 17:33:15 -0800
Subject: mm, swap: fix false error message in __swp_swapcount()
From: Huang Ying <huang.ying.caritas(a)gmail.com>
commit e9a6effa500526e2a19d5ad042cb758b55b1ef93 upstream.
When a page fault occurs for a swap entry, the physical swap readahead
(not the VMA base swap readahead) may readahead several swap entries
after the fault swap entry. The readahead algorithm calculates some of
the swap entries to readahead via increasing the offset of the fault
swap entry without checking whether they are beyond the end of the swap
device and it relys on the __swp_swapcount() and swapcache_prepare() to
check it. Although __swp_swapcount() checks for the swap entry passed
in, it will complain with the error message as follow for the expected
invalid swap entry. This may make the end users confused.
swap_info_get: Bad swap offset entry 0200f8a7
To fix the false error message, the swap entry checking is added in
swapin_readahead() to avoid to pass the out-of-bound swap entries and
the swap entry reserved for the swap header to __swp_swapcount() and
swapcache_prepare().
Link: http://lkml.kernel.org/r/20171102054225.22897-1-ying.huang@intel.com
Fixes: e8c26ab60598 ("mm/swap: skip readahead for unreferenced swap slots")
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reported-by: Christian Kujau <lists(a)nerdbynature.de>
Acked-by: Minchan Kim <minchan(a)kernel.org>
Suggested-by: Minchan Kim <minchan(a)kernel.org>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Michal Hocko <mhocko(a)suse.com>
Cc: Hugh Dickins <hughd(a)google.com>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
mm/swap_state.c | 3 +++
1 file changed, 3 insertions(+)
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -506,6 +506,7 @@ struct page *swapin_readahead(swp_entry_
unsigned long offset = entry_offset;
unsigned long start_offset, end_offset;
unsigned long mask;
+ struct swap_info_struct *si = swp_swap_info(entry);
struct blk_plug plug;
bool do_poll = true;
@@ -519,6 +520,8 @@ struct page *swapin_readahead(swp_entry_
end_offset = offset | mask;
if (!start_offset) /* First page is swap header. */
start_offset++;
+ if (end_offset >= si->max)
+ end_offset = si->max - 1;
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
Patches currently in stable-queue which might be from huang.ying.caritas(a)gmail.com are
queue-4.13/mm-swap-fix-false-error-message-in-__swp_swapcount.patch