----- Original Message -----
>
>
> On May 13, 2019 4:01 PM, Yang Shi <yang.shi(a)linux.alibaba.com> wrote:
>
>
> On 5/13/19 9:38 AM, Will Deacon wrote:
> > On Fri, May 10, 2019 at 07:26:54AM +0800, Yang Shi wrote:
> >> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> >> index 99740e1..469492d 100644
> >> --- a/mm/mmu_gather.c
> >> +++ b/mm/mmu_gather.c
> >> @@ -245,14 +245,39 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> >> {
> >> /*
> >> * If there are parallel threads are doing PTE changes on same range
> >> - * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> >> - * flush by batching, a thread has stable TLB entry can fail to flush
> >> - * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> >> - * forcefully if we detect parallel PTE batching threads.
> >> + * under non-exclusive lock (e.g., mmap_sem read-side) but defer TLB
> >> + * flush by batching, one thread may end up seeing inconsistent PTEs
> >> + * and result in having stale TLB entries. So flush TLB forcefully
> >> + * if we detect parallel PTE batching threads.
> >> + *
> >> + * However, some syscalls, e.g. munmap(), may free page tables, this
> >> + * needs force flush everything in the given range. Otherwise this
> >> + * may result in having stale TLB entries for some architectures,
> >> + * e.g. aarch64, that could specify flush what level TLB.
> >> */
> >> - if (mm_tlb_flush_nested(tlb->mm)) {
> >> - __tlb_reset_range(tlb);
> >> - __tlb_adjust_range(tlb, start, end - start);
> >> + if (mm_tlb_flush_nested(tlb->mm) && !tlb->fullmm) {
> >> + /*
> >> + * Since we can't tell what we actually should have
> >> + * flushed, flush everything in the given range.
> >> + */
> >> + tlb->freed_tables = 1;
> >> + tlb->cleared_ptes = 1;
> >> + tlb->cleared_pmds = 1;
> >> + tlb->cleared_puds = 1;
> >> + tlb->cleared_p4ds = 1;
> >> +
> >> + /*
> >> + * Some architectures, e.g. ARM, that have range invalidation
> >> + * and care about VM_EXEC for I-Cache invalidation, need
> >> force
> >> + * vma_exec set.
> >> + */
> >> + tlb->vma_exec = 1;
> >> +
> >> + /* Force vma_huge clear to guarantee safer flush */
> >> + tlb->vma_huge = 0;
> >> +
> >> + tlb->start = start;
> >> + tlb->end = end;
> >> }
> > Whilst I think this is correct, it would be interesting to see whether
> > or not it's actually faster than just nuking the whole mm, as I mentioned
> > before.
> >
> > At least in terms of getting a short-term fix, I'd prefer the diff below
> > if it's not measurably worse.
>
> I did a quick test with ebizzy (96 threads with 5 iterations) on my x86
> VM, it shows slightly slowdown on records/s but much more sys time spent
> with fullmm flush, the below is the data.
>
> nofullmm fullmm
> ops (records/s) 225606 225119
> sys (s) 0.69 1.14
>
> It looks the slight reduction of records/s is caused by the increase of
> sys time.
>
> >
> > Will
> >
> > --->8
> >
> > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > index 99740e1dd273..cc251422d307 100644
> > --- a/mm/mmu_gather.c
> > +++ b/mm/mmu_gather.c
> > @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> > * forcefully if we detect parallel PTE batching threads.
> > */
> > if (mm_tlb_flush_nested(tlb->mm)) {
> > + tlb->fullmm = 1;
> > __tlb_reset_range(tlb);
> > - __tlb_adjust_range(tlb, start, end - start);
> > + tlb->freed_tables = 1;
> > }
> >
> > tlb_flush_mmu(tlb);
>
>
> I think that this should have set need_flush_all and not fullmm.
>
Wouldn't that skip the flush?
If fulmm == 0, then __tlb_reset_range() sets tlb->end = 0.
tlb_flush_mmu
tlb_flush_mmu_tlbonly
if (!tlb->end)
return
Replacing fullmm with need_flush_all, brings the problem back / reproducer hangs.
On Tue, May 14, 2019 at 02:01:34AM +0000, Nadav Amit wrote:
> > diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> > index 99740e1dd273..cc251422d307 100644
> > --- a/mm/mmu_gather.c
> > +++ b/mm/mmu_gather.c
> > @@ -251,8 +251,9 @@ void tlb_finish_mmu(struct mmu_gather *tlb,
> > * forcefully if we detect parallel PTE batching threads.
> > */
> > if (mm_tlb_flush_nested(tlb->mm)) {
> > + tlb->fullmm = 1;
> > __tlb_reset_range(tlb);
> > - __tlb_adjust_range(tlb, start, end - start);
> > + tlb->freed_tables = 1;
> > }
> >
> > tlb_flush_mmu(tlb);
>
>
> I think that this should have set need_flush_all and not fullmm.
Difficult, mmu_gather::need_flush_all is arch specific and not everybody
implements it.
And while mmu_gather::fullmm isn't strictly correct either; we can
(ab)use it here, because at tlb_finish_mmu() time the differences don't
matter anymore.
Christophe Leroy <christophe.leroy(a)c-s.fr> writes:
> [Backport of upstream commit b45ba4a51cde29b2939365ef0c07ad34c8321789]
>
> On powerpc32, patch_instruction() is called by apply_feature_fixups()
> which is called from early_init()
>
> There is the following note in front of early_init():
> * Note that the kernel may be running at an address which is different
> * from the address that it was linked at, so we must use RELOC/PTRRELOC
> * to access static data (including strings). -- paulus
>
> Therefore init_mem_is_free must be accessed with PTRRELOC()
>
> Fixes: 1c38a84d4586 ("powerpc: Avoid code patching freed init sections")
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=203597
> Signed-off-by: Christophe Leroy <christophe.leroy(a)c-s.fr>
>
> ---
> Can't apply the upstream commit as such due to several of other unrelated stuff
> like STRICT_KERNEL_RWX which are missing for instance.
> So instead, using same approach as for commit 252eb55816a6f69ef9464cad303cdb3326cdc61d
Yeah this looks good to me.
Though should we keep the same subject as the upstream commit this is
sort of a backport of? That might make it simpler for people who are
trying to keep track of things?
ie. "powerpc/lib: fix book3s/32 boot failure due to code patching"
cheers
> diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
> index 14535ad4cdd1..c312955977ce 100644
> --- a/arch/powerpc/lib/code-patching.c
> +++ b/arch/powerpc/lib/code-patching.c
> @@ -23,7 +23,7 @@ int patch_instruction(unsigned int *addr, unsigned int instr)
> int err;
>
> /* Make sure we aren't patching a freed init section */
> - if (init_mem_is_free && init_section_contains(addr, 4)) {
> + if (*PTRRELOC(&init_mem_is_free) && init_section_contains(addr, 4)) {
> pr_debug("Skipping init section patching addr: 0x%px\n", addr);
> return 0;
> }
> --
> 2.13.3
The patch below does not apply to the 4.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6b583201fa219b7b1b6aebd8966c8fd9357ef9f4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Petr=20=C5=A0tetiar?= <ynezz(a)true.cz>
Date: Thu, 11 Apr 2019 20:13:30 +0200
Subject: [PATCH] mwl8k: Fix rate_idx underflow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
It was reported on OpenWrt bug tracking system[1], that several users
are affected by the endless reboot of their routers if they configure
5GHz interface with channel 44 or 48.
The reboot loop is caused by the following excessive number of WARN_ON
messages:
WARNING: CPU: 0 PID: 0 at backports-4.19.23-1/net/mac80211/rx.c:4516
ieee80211_rx_napi+0x1fc/0xa54 [mac80211]
as the messages are being correctly emitted by the following guard:
case RX_ENC_LEGACY:
if (WARN_ON(status->rate_idx >= sband->n_bitrates))
as the rate_idx is in this case erroneously set to 251 (0xfb). This fix
simply converts previously used magic number to proper constant and
guards against substraction which is leading to the currently observed
underflow.
1. https://bugs.openwrt.org/index.php?do=details&task_id=2218
Fixes: 854783444bab ("mwl8k: properly set receive status rate index on 5 GHz receive")
Cc: <stable(a)vger.kernel.org>
Tested-by: Eubert Bao <bunnier(a)gmail.com>
Reported-by: Eubert Bao <bunnier(a)gmail.com>
Signed-off-by: Petr Štetiar <ynezz(a)true.cz>
Signed-off-by: Kalle Valo <kvalo(a)codeaurora.org>
diff --git a/drivers/net/wireless/marvell/mwl8k.c b/drivers/net/wireless/marvell/mwl8k.c
index e0df51b62e97..70e69305197a 100644
--- a/drivers/net/wireless/marvell/mwl8k.c
+++ b/drivers/net/wireless/marvell/mwl8k.c
@@ -441,6 +441,9 @@ static const struct ieee80211_rate mwl8k_rates_50[] = {
#define MWL8K_CMD_UPDATE_STADB 0x1123
#define MWL8K_CMD_BASTREAM 0x1125
+#define MWL8K_LEGACY_5G_RATE_OFFSET \
+ (ARRAY_SIZE(mwl8k_rates_24) - ARRAY_SIZE(mwl8k_rates_50))
+
static const char *mwl8k_cmd_name(__le16 cmd, char *buf, int bufsize)
{
u16 command = le16_to_cpu(cmd);
@@ -1016,8 +1019,9 @@ mwl8k_rxd_ap_process(void *_rxd, struct ieee80211_rx_status *status,
if (rxd->channel > 14) {
status->band = NL80211_BAND_5GHZ;
- if (!(status->encoding == RX_ENC_HT))
- status->rate_idx -= 5;
+ if (!(status->encoding == RX_ENC_HT) &&
+ status->rate_idx >= MWL8K_LEGACY_5G_RATE_OFFSET)
+ status->rate_idx -= MWL8K_LEGACY_5G_RATE_OFFSET;
} else {
status->band = NL80211_BAND_2GHZ;
}
@@ -1124,8 +1128,9 @@ mwl8k_rxd_sta_process(void *_rxd, struct ieee80211_rx_status *status,
if (rxd->channel > 14) {
status->band = NL80211_BAND_5GHZ;
- if (!(status->encoding == RX_ENC_HT))
- status->rate_idx -= 5;
+ if (!(status->encoding == RX_ENC_HT) &&
+ status->rate_idx >= MWL8K_LEGACY_5G_RATE_OFFSET)
+ status->rate_idx -= MWL8K_LEGACY_5G_RATE_OFFSET;
} else {
status->band = NL80211_BAND_2GHZ;
}
The patch below does not apply to the 3.18-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6b583201fa219b7b1b6aebd8966c8fd9357ef9f4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Petr=20=C5=A0tetiar?= <ynezz(a)true.cz>
Date: Thu, 11 Apr 2019 20:13:30 +0200
Subject: [PATCH] mwl8k: Fix rate_idx underflow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
It was reported on OpenWrt bug tracking system[1], that several users
are affected by the endless reboot of their routers if they configure
5GHz interface with channel 44 or 48.
The reboot loop is caused by the following excessive number of WARN_ON
messages:
WARNING: CPU: 0 PID: 0 at backports-4.19.23-1/net/mac80211/rx.c:4516
ieee80211_rx_napi+0x1fc/0xa54 [mac80211]
as the messages are being correctly emitted by the following guard:
case RX_ENC_LEGACY:
if (WARN_ON(status->rate_idx >= sband->n_bitrates))
as the rate_idx is in this case erroneously set to 251 (0xfb). This fix
simply converts previously used magic number to proper constant and
guards against substraction which is leading to the currently observed
underflow.
1. https://bugs.openwrt.org/index.php?do=details&task_id=2218
Fixes: 854783444bab ("mwl8k: properly set receive status rate index on 5 GHz receive")
Cc: <stable(a)vger.kernel.org>
Tested-by: Eubert Bao <bunnier(a)gmail.com>
Reported-by: Eubert Bao <bunnier(a)gmail.com>
Signed-off-by: Petr Štetiar <ynezz(a)true.cz>
Signed-off-by: Kalle Valo <kvalo(a)codeaurora.org>
diff --git a/drivers/net/wireless/marvell/mwl8k.c b/drivers/net/wireless/marvell/mwl8k.c
index e0df51b62e97..70e69305197a 100644
--- a/drivers/net/wireless/marvell/mwl8k.c
+++ b/drivers/net/wireless/marvell/mwl8k.c
@@ -441,6 +441,9 @@ static const struct ieee80211_rate mwl8k_rates_50[] = {
#define MWL8K_CMD_UPDATE_STADB 0x1123
#define MWL8K_CMD_BASTREAM 0x1125
+#define MWL8K_LEGACY_5G_RATE_OFFSET \
+ (ARRAY_SIZE(mwl8k_rates_24) - ARRAY_SIZE(mwl8k_rates_50))
+
static const char *mwl8k_cmd_name(__le16 cmd, char *buf, int bufsize)
{
u16 command = le16_to_cpu(cmd);
@@ -1016,8 +1019,9 @@ mwl8k_rxd_ap_process(void *_rxd, struct ieee80211_rx_status *status,
if (rxd->channel > 14) {
status->band = NL80211_BAND_5GHZ;
- if (!(status->encoding == RX_ENC_HT))
- status->rate_idx -= 5;
+ if (!(status->encoding == RX_ENC_HT) &&
+ status->rate_idx >= MWL8K_LEGACY_5G_RATE_OFFSET)
+ status->rate_idx -= MWL8K_LEGACY_5G_RATE_OFFSET;
} else {
status->band = NL80211_BAND_2GHZ;
}
@@ -1124,8 +1128,9 @@ mwl8k_rxd_sta_process(void *_rxd, struct ieee80211_rx_status *status,
if (rxd->channel > 14) {
status->band = NL80211_BAND_5GHZ;
- if (!(status->encoding == RX_ENC_HT))
- status->rate_idx -= 5;
+ if (!(status->encoding == RX_ENC_HT) &&
+ status->rate_idx >= MWL8K_LEGACY_5G_RATE_OFFSET)
+ status->rate_idx -= MWL8K_LEGACY_5G_RATE_OFFSET;
} else {
status->band = NL80211_BAND_2GHZ;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
>From 6b583201fa219b7b1b6aebd8966c8fd9357ef9f4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Petr=20=C5=A0tetiar?= <ynezz(a)true.cz>
Date: Thu, 11 Apr 2019 20:13:30 +0200
Subject: [PATCH] mwl8k: Fix rate_idx underflow
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
It was reported on OpenWrt bug tracking system[1], that several users
are affected by the endless reboot of their routers if they configure
5GHz interface with channel 44 or 48.
The reboot loop is caused by the following excessive number of WARN_ON
messages:
WARNING: CPU: 0 PID: 0 at backports-4.19.23-1/net/mac80211/rx.c:4516
ieee80211_rx_napi+0x1fc/0xa54 [mac80211]
as the messages are being correctly emitted by the following guard:
case RX_ENC_LEGACY:
if (WARN_ON(status->rate_idx >= sband->n_bitrates))
as the rate_idx is in this case erroneously set to 251 (0xfb). This fix
simply converts previously used magic number to proper constant and
guards against substraction which is leading to the currently observed
underflow.
1. https://bugs.openwrt.org/index.php?do=details&task_id=2218
Fixes: 854783444bab ("mwl8k: properly set receive status rate index on 5 GHz receive")
Cc: <stable(a)vger.kernel.org>
Tested-by: Eubert Bao <bunnier(a)gmail.com>
Reported-by: Eubert Bao <bunnier(a)gmail.com>
Signed-off-by: Petr Štetiar <ynezz(a)true.cz>
Signed-off-by: Kalle Valo <kvalo(a)codeaurora.org>
diff --git a/drivers/net/wireless/marvell/mwl8k.c b/drivers/net/wireless/marvell/mwl8k.c
index e0df51b62e97..70e69305197a 100644
--- a/drivers/net/wireless/marvell/mwl8k.c
+++ b/drivers/net/wireless/marvell/mwl8k.c
@@ -441,6 +441,9 @@ static const struct ieee80211_rate mwl8k_rates_50[] = {
#define MWL8K_CMD_UPDATE_STADB 0x1123
#define MWL8K_CMD_BASTREAM 0x1125
+#define MWL8K_LEGACY_5G_RATE_OFFSET \
+ (ARRAY_SIZE(mwl8k_rates_24) - ARRAY_SIZE(mwl8k_rates_50))
+
static const char *mwl8k_cmd_name(__le16 cmd, char *buf, int bufsize)
{
u16 command = le16_to_cpu(cmd);
@@ -1016,8 +1019,9 @@ mwl8k_rxd_ap_process(void *_rxd, struct ieee80211_rx_status *status,
if (rxd->channel > 14) {
status->band = NL80211_BAND_5GHZ;
- if (!(status->encoding == RX_ENC_HT))
- status->rate_idx -= 5;
+ if (!(status->encoding == RX_ENC_HT) &&
+ status->rate_idx >= MWL8K_LEGACY_5G_RATE_OFFSET)
+ status->rate_idx -= MWL8K_LEGACY_5G_RATE_OFFSET;
} else {
status->band = NL80211_BAND_2GHZ;
}
@@ -1124,8 +1128,9 @@ mwl8k_rxd_sta_process(void *_rxd, struct ieee80211_rx_status *status,
if (rxd->channel > 14) {
status->band = NL80211_BAND_5GHZ;
- if (!(status->encoding == RX_ENC_HT))
- status->rate_idx -= 5;
+ if (!(status->encoding == RX_ENC_HT) &&
+ status->rate_idx >= MWL8K_LEGACY_5G_RATE_OFFSET)
+ status->rate_idx -= MWL8K_LEGACY_5G_RATE_OFFSET;
} else {
status->band = NL80211_BAND_2GHZ;
}
[ Restoring the recipients after mistakenly pressing reply instead of
reply-all ]
> On May 9, 2019, at 12:11 PM, Peter Zijlstra <peterz(a)infradead.org> wrote:
>
> On Thu, May 09, 2019 at 06:50:00PM +0000, Nadav Amit wrote:
>>> On May 9, 2019, at 11:24 AM, Peter Zijlstra <peterz(a)infradead.org> wrote:
>>>
>>> On Thu, May 09, 2019 at 05:36:29PM +0000, Nadav Amit wrote:
>
>>>> As a simple optimization, I think it is possible to hold multiple nesting
>>>> counters in the mm, similar to tlb_flush_pending, for freed_tables,
>>>> cleared_ptes, etc.
>>>>
>>>> The first time you set tlb->freed_tables, you also atomically increase
>>>> mm->tlb_flush_freed_tables. Then, in tlb_flush_mmu(), you just use
>>>> mm->tlb_flush_freed_tables instead of tlb->freed_tables.
>>>
>>> That sounds fraught with races and expensive; I would much prefer to not
>>> go there for this arguably rare case.
>>>
>>> Consider such fun cases as where CPU-0 sees and clears a PTE, CPU-1
>>> races and doesn't see that PTE. Therefore CPU-0 sets and counts
>>> cleared_ptes. Then if CPU-1 flushes while CPU-0 is still in mmu_gather,
>>> it will see cleared_ptes count increased and flush that granularity,
>>> OTOH if CPU-1 flushes after CPU-0 completes, it will not and potentiall
>>> miss an invalidate it should have had.
>>
>> CPU-0 would send a TLB shootdown request to CPU-1 when it is done, so I
>> don’t see the problem. The TLB shootdown mechanism is independent of the
>> mmu_gather for the matter.
>
> Duh.. I still don't like those unconditional mm wide atomic counters.
>
>>> This whole concurrent mmu_gather stuff is horrible.
>>>
>>> /me ponders more....
>>>
>>> So I think the fundamental race here is this:
>>>
>>> CPU-0 CPU-1
>>>
>>> tlb_gather_mmu(.start=1, tlb_gather_mmu(.start=2,
>>> .end=3); .end=4);
>>>
>>> ptep_get_and_clear_full(2)
>>> tlb_remove_tlb_entry(2);
>>> __tlb_remove_page();
>>> if (pte_present(2)) // nope
>>>
>>> tlb_finish_mmu();
>>>
>>> // continue without TLBI(2)
>>> // whoopsie
>>>
>>> tlb_finish_mmu();
>>> tlb_flush() -> TLBI(2)
>>>
>>>
>>> And we can fix that by having tlb_finish_mmu() sync up. Never let a
>>> concurrent tlb_finish_mmu() complete until all concurrenct mmu_gathers
>>> have completed.
>>>
>>> This should not be too hard to make happen.
>>
>> This synchronization sounds much more expensive than what I proposed. But I
>> agree that cache-lines that move from one CPU to another might become an
>> issue. But I think that the scheme I suggested would minimize this overhead.
>
> Well, it would have a lot more unconditional atomic ops. My scheme only
> waits when there is actual concurrency.
Well, something has to give. I didn’t think that if the same core does the
atomic op it would be too expensive.
> I _think_ something like the below ought to work, but its not even been
> near a compiler. The only problem is the unconditional wakeup; we can
> play games to avoid that if we want to continue with this.
>
> Ideally we'd only do this when there's been actual overlap, but I've not
> found a sensible way to detect that.
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4ef4bbe78a1d..b70e35792d29 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -590,7 +590,12 @@ static inline void dec_tlb_flush_pending(struct mm_struct *mm)
> *
> * Therefore we must rely on tlb_flush_*() to guarantee order.
> */
> - atomic_dec(&mm->tlb_flush_pending);
> + if (atomic_dec_and_test(&mm->tlb_flush_pending)) {
> + wake_up_var(&mm->tlb_flush_pending);
> + } else {
> + wait_event_var(&mm->tlb_flush_pending,
> + !atomic_read_acquire(&mm->tlb_flush_pending));
> + }
> }
It still seems very expensive to me, at least for certain workloads (e.g.,
Apache with multithreaded MPM).
It may be possible to avoid false-positive nesting indications (when the
flushes do not overlap) by creating a new struct mmu_gather_pending, with
something like:
struct mmu_gather_pending {
u64 start;
u64 end;
struct mmu_gather_pending *next;
}
tlb_finish_mmu() would then iterate over the mm->mmu_gather_pending
(pointing to the linked list) and find whether there is any overlap. This
would still require synchronization (acquiring a lock when allocating and
deallocating or something fancier).
I see multiple instances of:
arch/powerpc/kernel/exceptions-64s.S:839: Error:
attempt to move .org backwards
in v4.4.y.queue (v4.4.179-143-gc4db218e9451).
This is due to commit 9b2d4e06d7f1 ("powerpc/64s: Add support for a store
forwarding barrier at kernel entry/exit"), which is part of a large patch
series and can not easily be reverted.
Guess I'll stop doing ppc:allmodconfig builds in v4.4.y ?
Thanks,
Guenter