This is a note to let you know that I've just added the patch titled
x86/asm: Fix inline asm call constraints for GCC 4.4
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-asm-fix-inline-asm-call-constraints-for-gcc-4.4.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 520a13c530aeb5f63e011d668c42db1af19ed349 Mon Sep 17 00:00:00 2001
From: Josh Poimboeuf <jpoimboe(a)redhat.com>
Date: Thu, 28 Sep 2017 16:58:26 -0500
Subject: x86/asm: Fix inline asm call constraints for GCC 4.4
From: Josh Poimboeuf <jpoimboe(a)redhat.com>
commit 520a13c530aeb5f63e011d668c42db1af19ed349 upstream.
The kernel test bot (run by Xiaolong Ye) reported that the following commit:
f5caf621ee35 ("x86/asm: Fix inline asm call constraints for Clang")
is causing double faults in a kernel compiled with GCC 4.4.
Linus subsequently diagnosed the crash pattern and the buggy commit and found that
the issue is with this code:
register unsigned int __asm_call_sp asm("esp");
#define ASM_CALL_CONSTRAINT "+r" (__asm_call_sp)
Even on a 64-bit kernel, it's using ESP instead of RSP. That causes GCC
to produce the following bogus code:
ffffffff8147461d: 89 e0 mov %esp,%eax
ffffffff8147461f: 4c 89 f7 mov %r14,%rdi
ffffffff81474622: 4c 89 fe mov %r15,%rsi
ffffffff81474625: ba 20 00 00 00 mov $0x20,%edx
ffffffff8147462a: 89 c4 mov %eax,%esp
ffffffff8147462c: e8 bf 52 05 00 callq ffffffff814c98f0 <copy_user_generic_unrolled>
Despite the absurdity of it backing up and restoring the stack pointer
for no reason, the bug is actually the fact that it's only backing up
and restoring the lower 32 bits of the stack pointer. The upper 32 bits
are getting cleared out, corrupting the stack pointer.
So change the '__asm_call_sp' register variable to be associated with
the actual full-size stack pointer.
This also requires changing the __ASM_SEL() macro to be based on the
actual compiled arch size, rather than the CONFIG value, because
CONFIG_X86_64 compiles some files with '-m32' (e.g., realmode and vdso).
Otherwise Clang fails to build the kernel because it complains about the
use of a 64-bit register (RSP) in a 32-bit file.
Reported-and-Bisected-and-Tested-by: kernel test robot <xiaolong.ye(a)intel.com>
Diagnosed-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: LKP <lkp(a)01.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Matthias Kaehlcke <mka(a)chromium.org>
Cc: Miguel Bernal Marin <miguel.bernal.marin(a)linux.intel.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Fixes: f5caf621ee35 ("x86/asm: Fix inline asm call constraints for Clang")
Link: http://lkml.kernel.org/r/20170928215826.6sdpmwtkiydiytim@treble
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Cc: Matthias Kaehlcke <mka(a)chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/asm.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -11,10 +11,12 @@
# define __ASM_FORM_COMMA(x) " " #x ","
#endif
-#ifdef CONFIG_X86_32
+#ifndef __x86_64__
+/* 32 bit */
# define __ASM_SEL(a,b) __ASM_FORM(a)
# define __ASM_SEL_RAW(a,b) __ASM_FORM_RAW(a)
#else
+/* 64 bit */
# define __ASM_SEL(a,b) __ASM_FORM(b)
# define __ASM_SEL_RAW(a,b) __ASM_FORM_RAW(b)
#endif
Patches currently in stable-queue which might be from jpoimboe(a)redhat.com are
queue-4.9/x86-asm-fix-inline-asm-call-constraints-for-gcc-4.4.patch
This is a note to let you know that I've just added the patch titled
x86/asm: Fix inline asm call constraints for GCC 4.4
to the 4.4-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
x86-asm-fix-inline-asm-call-constraints-for-gcc-4.4.patch
and it can be found in the queue-4.4 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 520a13c530aeb5f63e011d668c42db1af19ed349 Mon Sep 17 00:00:00 2001
From: Josh Poimboeuf <jpoimboe(a)redhat.com>
Date: Thu, 28 Sep 2017 16:58:26 -0500
Subject: x86/asm: Fix inline asm call constraints for GCC 4.4
From: Josh Poimboeuf <jpoimboe(a)redhat.com>
commit 520a13c530aeb5f63e011d668c42db1af19ed349 upstream.
The kernel test bot (run by Xiaolong Ye) reported that the following commit:
f5caf621ee35 ("x86/asm: Fix inline asm call constraints for Clang")
is causing double faults in a kernel compiled with GCC 4.4.
Linus subsequently diagnosed the crash pattern and the buggy commit and found that
the issue is with this code:
register unsigned int __asm_call_sp asm("esp");
#define ASM_CALL_CONSTRAINT "+r" (__asm_call_sp)
Even on a 64-bit kernel, it's using ESP instead of RSP. That causes GCC
to produce the following bogus code:
ffffffff8147461d: 89 e0 mov %esp,%eax
ffffffff8147461f: 4c 89 f7 mov %r14,%rdi
ffffffff81474622: 4c 89 fe mov %r15,%rsi
ffffffff81474625: ba 20 00 00 00 mov $0x20,%edx
ffffffff8147462a: 89 c4 mov %eax,%esp
ffffffff8147462c: e8 bf 52 05 00 callq ffffffff814c98f0 <copy_user_generic_unrolled>
Despite the absurdity of it backing up and restoring the stack pointer
for no reason, the bug is actually the fact that it's only backing up
and restoring the lower 32 bits of the stack pointer. The upper 32 bits
are getting cleared out, corrupting the stack pointer.
So change the '__asm_call_sp' register variable to be associated with
the actual full-size stack pointer.
This also requires changing the __ASM_SEL() macro to be based on the
actual compiled arch size, rather than the CONFIG value, because
CONFIG_X86_64 compiles some files with '-m32' (e.g., realmode and vdso).
Otherwise Clang fails to build the kernel because it complains about the
use of a 64-bit register (RSP) in a 32-bit file.
Reported-and-Bisected-and-Tested-by: kernel test robot <xiaolong.ye(a)intel.com>
Diagnosed-by: Linus Torvalds <torvalds(a)linux-foundation.org>
Signed-off-by: Josh Poimboeuf <jpoimboe(a)redhat.com>
Cc: Alexander Potapenko <glider(a)google.com>
Cc: Andrey Ryabinin <aryabinin(a)virtuozzo.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Arnd Bergmann <arnd(a)arndb.de>
Cc: Dmitriy Vyukov <dvyukov(a)google.com>
Cc: LKP <lkp(a)01.org>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Matthias Kaehlcke <mka(a)chromium.org>
Cc: Miguel Bernal Marin <miguel.bernal.marin(a)linux.intel.com>
Cc: Peter Zijlstra <peterz(a)infradead.org>
Cc: Thomas Gleixner <tglx(a)linutronix.de>
Fixes: f5caf621ee35 ("x86/asm: Fix inline asm call constraints for Clang")
Link: http://lkml.kernel.org/r/20170928215826.6sdpmwtkiydiytim@treble
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Cc: Matthias Kaehlcke <mka(a)chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/include/asm/asm.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -11,10 +11,12 @@
# define __ASM_FORM_COMMA(x) " " #x ","
#endif
-#ifdef CONFIG_X86_32
+#ifndef __x86_64__
+/* 32 bit */
# define __ASM_SEL(a,b) __ASM_FORM(a)
# define __ASM_SEL_RAW(a,b) __ASM_FORM_RAW(a)
#else
+/* 64 bit */
# define __ASM_SEL(a,b) __ASM_FORM(b)
# define __ASM_SEL_RAW(a,b) __ASM_FORM_RAW(b)
#endif
Patches currently in stable-queue which might be from jpoimboe(a)redhat.com are
queue-4.4/x86-asm-fix-inline-asm-call-constraints-for-gcc-4.4.patch
From: Eric Biggers <ebiggers(a)google.com>
commit 794b4bc292f5d31739d89c0202c54e7dc9bc3add upstream
With the 'encrypted' key type it was possible for userspace to provide a
data blob ending with a master key description shorter than expected,
e.g. 'keyctl add encrypted desc "new x" @s'. When validating such a
master key description, validate_master_desc() could read beyond the end
of the buffer. Fix this by using strncmp() instead of memcmp(). [Also
clean up the code to deduplicate some logic.]
Cc: linux-stable <stable(a)vger.kernel.org> # 4.9.y
Cc: Mimi Zohar <zohar(a)linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: David Howells <dhowells(a)redhat.com>
Signed-off-by: James Morris <james.l.morris(a)oracle.com>
Signed-off-by: Jin Qian <jinqian(a)google.com>
Signed-off-by: Jin Qian <jinqian(a)android.com>
---
security/keys/encrypted-keys/encrypted.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/security/keys/encrypted-keys/encrypted.c b/security/keys/encrypted-keys/encrypted.c
index a871159bf03c..ead2fd60244d 100644
--- a/security/keys/encrypted-keys/encrypted.c
+++ b/security/keys/encrypted-keys/encrypted.c
@@ -141,23 +141,22 @@ static int valid_ecryptfs_desc(const char *ecryptfs_desc)
*/
static int valid_master_desc(const char *new_desc, const char *orig_desc)
{
- if (!memcmp(new_desc, KEY_TRUSTED_PREFIX, KEY_TRUSTED_PREFIX_LEN)) {
- if (strlen(new_desc) == KEY_TRUSTED_PREFIX_LEN)
- goto out;
- if (orig_desc)
- if (memcmp(new_desc, orig_desc, KEY_TRUSTED_PREFIX_LEN))
- goto out;
- } else if (!memcmp(new_desc, KEY_USER_PREFIX, KEY_USER_PREFIX_LEN)) {
- if (strlen(new_desc) == KEY_USER_PREFIX_LEN)
- goto out;
- if (orig_desc)
- if (memcmp(new_desc, orig_desc, KEY_USER_PREFIX_LEN))
- goto out;
- } else
- goto out;
+ int prefix_len;
+
+ if (!strncmp(new_desc, KEY_TRUSTED_PREFIX, KEY_TRUSTED_PREFIX_LEN))
+ prefix_len = KEY_TRUSTED_PREFIX_LEN;
+ else if (!strncmp(new_desc, KEY_USER_PREFIX, KEY_USER_PREFIX_LEN))
+ prefix_len = KEY_USER_PREFIX_LEN;
+ else
+ return -EINVAL;
+
+ if (!new_desc[prefix_len])
+ return -EINVAL;
+
+ if (orig_desc && strncmp(new_desc, orig_desc, prefix_len))
+ return -EINVAL;
+
return 0;
-out:
- return -EINVAL;
}
/*
--
2.16.0.rc1.238.g530d649a79-goog
From: Eric Biggers <ebiggers(a)google.com>
commit 794b4bc292f5d31739d89c0202c54e7dc9bc3add upstream
With the 'encrypted' key type it was possible for userspace to provide a
data blob ending with a master key description shorter than expected,
e.g. 'keyctl add encrypted desc "new x" @s'. When validating such a
master key description, validate_master_desc() could read beyond the end
of the buffer. Fix this by using strncmp() instead of memcmp(). [Also
clean up the code to deduplicate some logic.]
Cc: linux-stable <stable(a)vger.kernel.org> # 4.4.y
Cc: Mimi Zohar <zohar(a)linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: David Howells <dhowells(a)redhat.com>
Signed-off-by: James Morris <james.l.morris(a)oracle.com>
Signed-off-by: Jin Qian <jinqian(a)google.com>
Signed-off-by: Jin Qian <jinqian(a)android.com>
---
security/keys/encrypted-keys/encrypted.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/security/keys/encrypted-keys/encrypted.c b/security/keys/encrypted-keys/encrypted.c
index ce295c0c1da0..e44e844c8ec4 100644
--- a/security/keys/encrypted-keys/encrypted.c
+++ b/security/keys/encrypted-keys/encrypted.c
@@ -141,23 +141,22 @@ static int valid_ecryptfs_desc(const char *ecryptfs_desc)
*/
static int valid_master_desc(const char *new_desc, const char *orig_desc)
{
- if (!memcmp(new_desc, KEY_TRUSTED_PREFIX, KEY_TRUSTED_PREFIX_LEN)) {
- if (strlen(new_desc) == KEY_TRUSTED_PREFIX_LEN)
- goto out;
- if (orig_desc)
- if (memcmp(new_desc, orig_desc, KEY_TRUSTED_PREFIX_LEN))
- goto out;
- } else if (!memcmp(new_desc, KEY_USER_PREFIX, KEY_USER_PREFIX_LEN)) {
- if (strlen(new_desc) == KEY_USER_PREFIX_LEN)
- goto out;
- if (orig_desc)
- if (memcmp(new_desc, orig_desc, KEY_USER_PREFIX_LEN))
- goto out;
- } else
- goto out;
+ int prefix_len;
+
+ if (!strncmp(new_desc, KEY_TRUSTED_PREFIX, KEY_TRUSTED_PREFIX_LEN))
+ prefix_len = KEY_TRUSTED_PREFIX_LEN;
+ else if (!strncmp(new_desc, KEY_USER_PREFIX, KEY_USER_PREFIX_LEN))
+ prefix_len = KEY_USER_PREFIX_LEN;
+ else
+ return -EINVAL;
+
+ if (!new_desc[prefix_len])
+ return -EINVAL;
+
+ if (orig_desc && strncmp(new_desc, orig_desc, prefix_len))
+ return -EINVAL;
+
return 0;
-out:
- return -EINVAL;
}
/*
--
2.16.0.rc1.238.g530d649a79-goog
From: Eric Biggers <ebiggers(a)google.com>
commit 794b4bc292f5d31739d89c0202c54e7dc9bc3add upstream
With the 'encrypted' key type it was possible for userspace to provide a
data blob ending with a master key description shorter than expected,
e.g. 'keyctl add encrypted desc "new x" @s'. When validating such a
master key description, validate_master_desc() could read beyond the end
of the buffer. Fix this by using strncmp() instead of memcmp(). [Also
clean up the code to deduplicate some logic.]
Cc: linux-stable <stable(a)vger.kernel.org> # 3.18.y
Cc: Mimi Zohar <zohar(a)linux.vnet.ibm.com>
Signed-off-by: Eric Biggers <ebiggers(a)google.com>
Signed-off-by: David Howells <dhowells(a)redhat.com>
Signed-off-by: James Morris <james.l.morris(a)oracle.com>
Signed-off-by: Jin Qian <jinqian(a)google.com>
---
security/keys/encrypted-keys/encrypted.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)
diff --git a/security/keys/encrypted-keys/encrypted.c b/security/keys/encrypted-keys/encrypted.c
index 89d5695c51cd..20251ee5c491 100644
--- a/security/keys/encrypted-keys/encrypted.c
+++ b/security/keys/encrypted-keys/encrypted.c
@@ -141,23 +141,22 @@ static int valid_ecryptfs_desc(const char *ecryptfs_desc)
*/
static int valid_master_desc(const char *new_desc, const char *orig_desc)
{
- if (!memcmp(new_desc, KEY_TRUSTED_PREFIX, KEY_TRUSTED_PREFIX_LEN)) {
- if (strlen(new_desc) == KEY_TRUSTED_PREFIX_LEN)
- goto out;
- if (orig_desc)
- if (memcmp(new_desc, orig_desc, KEY_TRUSTED_PREFIX_LEN))
- goto out;
- } else if (!memcmp(new_desc, KEY_USER_PREFIX, KEY_USER_PREFIX_LEN)) {
- if (strlen(new_desc) == KEY_USER_PREFIX_LEN)
- goto out;
- if (orig_desc)
- if (memcmp(new_desc, orig_desc, KEY_USER_PREFIX_LEN))
- goto out;
- } else
- goto out;
+ int prefix_len;
+
+ if (!strncmp(new_desc, KEY_TRUSTED_PREFIX, KEY_TRUSTED_PREFIX_LEN))
+ prefix_len = KEY_TRUSTED_PREFIX_LEN;
+ else if (!strncmp(new_desc, KEY_USER_PREFIX, KEY_USER_PREFIX_LEN))
+ prefix_len = KEY_USER_PREFIX_LEN;
+ else
+ return -EINVAL;
+
+ if (!new_desc[prefix_len])
+ return -EINVAL;
+
+ if (orig_desc && strncmp(new_desc, orig_desc, prefix_len))
+ return -EINVAL;
+
return 0;
-out:
- return -EINVAL;
}
/*
--
2.16.0.rc1.238.g530d649a79-goog
This is a note to let you know that I've just added the patch titled
vhost_net: stop device during reset owner
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
vhost_net-stop-device-during-reset-owner.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Jason Wang <jasowang(a)redhat.com>
Date: Thu, 25 Jan 2018 22:03:52 +0800
Subject: vhost_net: stop device during reset owner
From: Jason Wang <jasowang(a)redhat.com>
[ Upstream commit 4cd879515d686849eec5f718aeac62a70b067d82 ]
We don't stop device before reset owner, this means we could try to
serve any virtqueue kick before reset dev->worker. This will result a
warn since the work was pending at llist during owner resetting. Fix
this by stopping device during owner reset.
Reported-by: syzbot+eb17c6162478cc50632c(a)syzkaller.appspotmail.com
Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server")
Signed-off-by: Jason Wang <jasowang(a)redhat.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/vhost/net.c | 1 +
1 file changed, 1 insertion(+)
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1078,6 +1078,7 @@ static long vhost_net_reset_owner(struct
}
vhost_net_stop(n, &tx_sock, &rx_sock);
vhost_net_flush(n);
+ vhost_dev_stop(&n->dev);
vhost_dev_reset_owner(&n->dev, umem);
vhost_net_vq_reset(n);
done:
Patches currently in stable-queue which might be from jasowang(a)redhat.com are
queue-4.9/vhost_net-stop-device-during-reset-owner.patch
This is a note to let you know that I've just added the patch titled
tcp_bbr: fix pacing_gain to always be unity when using lt_bw
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tcp_bbr-fix-pacing_gain-to-always-be-unity-when-using-lt_bw.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Neal Cardwell <ncardwell(a)google.com>
Date: Wed, 31 Jan 2018 15:43:05 -0500
Subject: tcp_bbr: fix pacing_gain to always be unity when using lt_bw
From: Neal Cardwell <ncardwell(a)google.com>
[ Upstream commit 3aff3b4b986e51bcf4ab249e5d48d39596e0df6a ]
This commit fixes the pacing_gain to remain at BBR_UNIT (1.0) when
using lt_bw and returning from the PROBE_RTT state to PROBE_BW.
Previously, when using lt_bw, upon exiting PROBE_RTT and entering
PROBE_BW the bbr_reset_probe_bw_mode() code could sometimes randomly
end up with a cycle_idx of 0 and hence have bbr_advance_cycle_phase()
set a pacing gain above 1.0. In such cases this would result in a
pacing rate that is 1.25x higher than intended, potentially resulting
in a high loss rate for a little while until we stop using the lt_bw a
bit later.
This commit is a stable candidate for kernels back as far as 4.9.
Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell(a)google.com>
Signed-off-by: Yuchung Cheng <ycheng(a)google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil(a)google.com>
Reported-by: Beyers Cronje <bcronje(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv4/tcp_bbr.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -452,7 +452,8 @@ static void bbr_advance_cycle_phase(stru
bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
bbr->cycle_mstamp = tp->delivered_mstamp;
- bbr->pacing_gain = bbr_pacing_gain[bbr->cycle_idx];
+ bbr->pacing_gain = bbr->lt_use_bw ? BBR_UNIT :
+ bbr_pacing_gain[bbr->cycle_idx];
}
/* Gain cycling: cycle pacing gain to converge to fair share of available bw. */
@@ -461,8 +462,7 @@ static void bbr_update_cycle_phase(struc
{
struct bbr *bbr = inet_csk_ca(sk);
- if ((bbr->mode == BBR_PROBE_BW) && !bbr->lt_use_bw &&
- bbr_is_next_cycle_phase(sk, rs))
+ if (bbr->mode == BBR_PROBE_BW && bbr_is_next_cycle_phase(sk, rs))
bbr_advance_cycle_phase(sk);
}
Patches currently in stable-queue which might be from ncardwell(a)google.com are
queue-4.9/tcp_bbr-fix-pacing_gain-to-always-be-unity-when-using-lt_bw.patch
This is a note to let you know that I've just added the patch titled
tcp: release sk_frag.page in tcp_disconnect
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
tcp-release-sk_frag.page-in-tcp_disconnect.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Li RongQing <lirongqing(a)baidu.com>
Date: Fri, 26 Jan 2018 16:40:41 +0800
Subject: tcp: release sk_frag.page in tcp_disconnect
From: Li RongQing <lirongqing(a)baidu.com>
[ Upstream commit 9b42d55a66d388e4dd5550107df051a9637564fc ]
socket can be disconnected and gets transformed back to a listening
socket, if sk_frag.page is not released, which will be cloned into
a new socket by sk_clone_lock, but the reference count of this page
is increased, lead to a use after free or double free issue
Signed-off-by: Li RongQing <lirongqing(a)baidu.com>
Cc: Eric Dumazet <edumazet(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv4/tcp.c | 6 ++++++
1 file changed, 6 insertions(+)
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2316,6 +2316,12 @@ int tcp_disconnect(struct sock *sk, int
WARN_ON(inet->inet_num && !icsk->icsk_bind_hash);
+ if (sk->sk_frag.page) {
+ put_page(sk->sk_frag.page);
+ sk->sk_frag.page = NULL;
+ sk->sk_frag.offset = 0;
+ }
+
sk->sk_error_report(sk);
return err;
}
Patches currently in stable-queue which might be from lirongqing(a)baidu.com are
queue-4.9/tcp-release-sk_frag.page-in-tcp_disconnect.patch
This is a note to let you know that I've just added the patch titled
r8169: fix RTL8168EP take too long to complete driver initialization.
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
r8169-fix-rtl8168ep-take-too-long-to-complete-driver-initialization.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Chunhao Lin <hau(a)realtek.com>
Date: Wed, 31 Jan 2018 01:32:36 +0800
Subject: r8169: fix RTL8168EP take too long to complete driver initialization.
From: Chunhao Lin <hau(a)realtek.com>
[ Upstream commit 086ca23d03c0d2f4088f472386778d293e15c5f6 ]
Driver check the wrong register bit in rtl_ocp_tx_cond() that keep driver
waiting until timeout.
Fix this by waiting for the right register bit.
Signed-off-by: Chunhao Lin <hau(a)realtek.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/net/ethernet/realtek/r8169.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1387,7 +1387,7 @@ DECLARE_RTL_COND(rtl_ocp_tx_cond)
{
void __iomem *ioaddr = tp->mmio_addr;
- return RTL_R8(IBISR0) & 0x02;
+ return RTL_R8(IBISR0) & 0x20;
}
static void rtl8168ep_stop_cmac(struct rtl8169_private *tp)
@@ -1395,7 +1395,7 @@ static void rtl8168ep_stop_cmac(struct r
void __iomem *ioaddr = tp->mmio_addr;
RTL_W8(IBCR2, RTL_R8(IBCR2) & ~0x01);
- rtl_msleep_loop_wait_low(tp, &rtl_ocp_tx_cond, 50, 2000);
+ rtl_msleep_loop_wait_high(tp, &rtl_ocp_tx_cond, 50, 2000);
RTL_W8(IBISR0, RTL_R8(IBISR0) | 0x20);
RTL_W8(IBCR0, RTL_R8(IBCR0) & ~0x01);
}
Patches currently in stable-queue which might be from hau(a)realtek.com are
queue-4.9/r8169-fix-rtl8168ep-take-too-long-to-complete-driver-initialization.patch
This is a note to let you know that I've just added the patch titled
qmi_wwan: Add support for Quectel EP06
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
qmi_wwan-add-support-for-quectel-ep06.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Kristian Evensen <kristian.evensen(a)gmail.com>
Date: Tue, 30 Jan 2018 14:12:55 +0100
Subject: qmi_wwan: Add support for Quectel EP06
From: Kristian Evensen <kristian.evensen(a)gmail.com>
[ Upstream commit c0b91a56a2e57a5a370655b25d677ae0ebf8a2d0 ]
The Quectel EP06 is a Cat. 6 LTE modem. It uses the same interface as
the EC20/EC25 for QMI, and requires the same "set DTR"-quirk to work.
Signed-off-by: Kristian Evensen <kristian.evensen(a)gmail.com>
Acked-by: Bjørn Mork <bjorn(a)mork.no>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/net/usb/qmi_wwan.c | 1 +
1 file changed, 1 insertion(+)
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -944,6 +944,7 @@ static const struct usb_device_id produc
{QMI_QUIRK_SET_DTR(0x2c7c, 0x0125, 4)}, /* Quectel EC25, EC20 R2.0 Mini PCIe */
{QMI_QUIRK_SET_DTR(0x2c7c, 0x0121, 4)}, /* Quectel EC21 Mini PCIe */
{QMI_FIXED_INTF(0x2c7c, 0x0296, 4)}, /* Quectel BG96 */
+ {QMI_QUIRK_SET_DTR(0x2c7c, 0x0306, 4)}, /* Quectel EP06 Mini PCIe */
/* 4. Gobi 1000 devices */
{QMI_GOBI1K_DEVICE(0x05c6, 0x9212)}, /* Acer Gobi Modem Device */
Patches currently in stable-queue which might be from kristian.evensen(a)gmail.com are
queue-4.9/qmi_wwan-add-support-for-quectel-ep06.patch
This is a note to let you know that I've just added the patch titled
soreuseport: fix mem leak in reuseport_add_sock()
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
soreuseport-fix-mem-leak-in-reuseport_add_sock.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Eric Dumazet <edumazet(a)google.com>
Date: Fri, 2 Feb 2018 10:27:27 -0800
Subject: soreuseport: fix mem leak in reuseport_add_sock()
From: Eric Dumazet <edumazet(a)google.com>
[ Upstream commit 4db428a7c9ab07e08783e0fcdc4ca0f555da0567 ]
reuseport_add_sock() needs to deal with attaching a socket having
its own sk_reuseport_cb, after a prior
setsockopt(SO_ATTACH_REUSEPORT_?BPF)
Without this fix, not only a WARN_ONCE() was issued, but we were also
leaking memory.
Thanks to sysbot and Eric Biggers for providing us nice C repros.
------------[ cut here ]------------
socket already in reuseport group
WARNING: CPU: 0 PID: 3496 at net/core/sock_reuseport.c:119
reuseport_add_sock+0x742/0x9b0 net/core/sock_reuseport.c:117
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 3496 Comm: syzkaller869503 Not tainted 4.15.0-rc6+ #245
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
panic+0x1e4/0x41c kernel/panic.c:183
__warn+0x1dc/0x200 kernel/panic.c:547
report_bug+0x211/0x2d0 lib/bug.c:184
fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
fixup_bug arch/x86/kernel/traps.c:247 [inline]
do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
Fixes: ef456144da8e ("soreuseport: define reuseport groups")
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Reported-by: syzbot+c0ea2226f77a42936bf7(a)syzkaller.appspotmail.com
Acked-by: Craig Gallek <kraig(a)google.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/core/sock_reuseport.c | 35 ++++++++++++++++++++---------------
1 file changed, 20 insertions(+), 15 deletions(-)
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -93,6 +93,16 @@ static struct sock_reuseport *reuseport_
return more_reuse;
}
+static void reuseport_free_rcu(struct rcu_head *head)
+{
+ struct sock_reuseport *reuse;
+
+ reuse = container_of(head, struct sock_reuseport, rcu);
+ if (reuse->prog)
+ bpf_prog_destroy(reuse->prog);
+ kfree(reuse);
+}
+
/**
* reuseport_add_sock - Add a socket to the reuseport group of another.
* @sk: New socket to add to the group.
@@ -101,7 +111,7 @@ static struct sock_reuseport *reuseport_
*/
int reuseport_add_sock(struct sock *sk, struct sock *sk2)
{
- struct sock_reuseport *reuse;
+ struct sock_reuseport *old_reuse, *reuse;
if (!rcu_access_pointer(sk2->sk_reuseport_cb)) {
int err = reuseport_alloc(sk2);
@@ -112,10 +122,13 @@ int reuseport_add_sock(struct sock *sk,
spin_lock_bh(&reuseport_lock);
reuse = rcu_dereference_protected(sk2->sk_reuseport_cb,
- lockdep_is_held(&reuseport_lock)),
- WARN_ONCE(rcu_dereference_protected(sk->sk_reuseport_cb,
- lockdep_is_held(&reuseport_lock)),
- "socket already in reuseport group");
+ lockdep_is_held(&reuseport_lock));
+ old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb,
+ lockdep_is_held(&reuseport_lock));
+ if (old_reuse && old_reuse->num_socks != 1) {
+ spin_unlock_bh(&reuseport_lock);
+ return -EBUSY;
+ }
if (reuse->num_socks == reuse->max_socks) {
reuse = reuseport_grow(reuse);
@@ -133,19 +146,11 @@ int reuseport_add_sock(struct sock *sk,
spin_unlock_bh(&reuseport_lock);
+ if (old_reuse)
+ call_rcu(&old_reuse->rcu, reuseport_free_rcu);
return 0;
}
-static void reuseport_free_rcu(struct rcu_head *head)
-{
- struct sock_reuseport *reuse;
-
- reuse = container_of(head, struct sock_reuseport, rcu);
- if (reuse->prog)
- bpf_prog_destroy(reuse->prog);
- kfree(reuse);
-}
-
void reuseport_detach_sock(struct sock *sk)
{
struct sock_reuseport *reuse;
Patches currently in stable-queue which might be from edumazet(a)google.com are
queue-4.9/tcp-release-sk_frag.page-in-tcp_disconnect.patch
queue-4.9/net-igmp-add-a-missing-rcu-locking-section.patch
queue-4.9/soreuseport-fix-mem-leak-in-reuseport_add_sock.patch
This is a note to let you know that I've just added the patch titled
net: igmp: add a missing rcu locking section
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
net-igmp-add-a-missing-rcu-locking-section.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Eric Dumazet <edumazet(a)google.com>
Date: Thu, 1 Feb 2018 10:26:57 -0800
Subject: net: igmp: add a missing rcu locking section
From: Eric Dumazet <edumazet(a)google.com>
[ Upstream commit e7aadb27a5415e8125834b84a74477bfbee4eff5 ]
Newly added igmpv3_get_srcaddr() needs to be called under rcu lock.
Timer callbacks do not ensure this locking.
=============================
WARNING: suspicious RCU usage
4.15.0+ #200 Not tainted
-----------------------------
./include/linux/inetdevice.h:216 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
3 locks held by syzkaller616973/4074:
#0: (&mm->mmap_sem){++++}, at: [<00000000bfce669e>] __do_page_fault+0x32d/0xc90 arch/x86/mm/fault.c:1355
#1: ((&im->timer)){+.-.}, at: [<00000000619d2f71>] lockdep_copy_map include/linux/lockdep.h:178 [inline]
#1: ((&im->timer)){+.-.}, at: [<00000000619d2f71>] call_timer_fn+0x1c6/0x820 kernel/time/timer.c:1316
#2: (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] spin_lock_bh include/linux/spinlock.h:315 [inline]
#2: (&(&im->lock)->rlock){+.-.}, at: [<000000005f833c5c>] igmpv3_send_report+0x98/0x5b0 net/ipv4/igmp.c:600
stack backtrace:
CPU: 0 PID: 4074 Comm: syzkaller616973 Not tainted 4.15.0+ #200
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
lockdep_rcu_suspicious+0x123/0x170 kernel/locking/lockdep.c:4592
__in_dev_get_rcu include/linux/inetdevice.h:216 [inline]
igmpv3_get_srcaddr net/ipv4/igmp.c:329 [inline]
igmpv3_newpack+0xeef/0x12e0 net/ipv4/igmp.c:389
add_grhead.isra.27+0x235/0x300 net/ipv4/igmp.c:432
add_grec+0xbd3/0x1170 net/ipv4/igmp.c:565
igmpv3_send_report+0xd5/0x5b0 net/ipv4/igmp.c:605
igmp_send_report+0xc43/0x1050 net/ipv4/igmp.c:722
igmp_timer_expire+0x322/0x5c0 net/ipv4/igmp.c:831
call_timer_fn+0x228/0x820 kernel/time/timer.c:1326
expire_timers kernel/time/timer.c:1363 [inline]
__run_timers+0x7ee/0xb70 kernel/time/timer.c:1666
run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
__do_softirq+0x2d7/0xb85 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1cc/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:541 [inline]
smp_apic_timer_interrupt+0x16b/0x700 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xa9/0xb0 arch/x86/entry/entry_64.S:938
Fixes: a46182b00290 ("net: igmp: Use correct source address on IGMPv3 reports")
Signed-off-by: Eric Dumazet <edumazet(a)google.com>
Reported-by: syzbot <syzkaller(a)googlegroups.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv4/igmp.c | 4 ++++
1 file changed, 4 insertions(+)
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -386,7 +386,11 @@ static struct sk_buff *igmpv3_newpack(st
pip->frag_off = htons(IP_DF);
pip->ttl = 1;
pip->daddr = fl4.daddr;
+
+ rcu_read_lock();
pip->saddr = igmpv3_get_srcaddr(dev, &fl4);
+ rcu_read_unlock();
+
pip->protocol = IPPROTO_IGMP;
pip->tot_len = 0; /* filled in later */
ip_select_ident(net, skb, NULL);
Patches currently in stable-queue which might be from edumazet(a)google.com are
queue-4.9/tcp-release-sk_frag.page-in-tcp_disconnect.patch
queue-4.9/net-igmp-add-a-missing-rcu-locking-section.patch
queue-4.9/soreuseport-fix-mem-leak-in-reuseport_add_sock.patch
This is a note to let you know that I've just added the patch titled
ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ipv6-fix-so_reuseport-udp-socket-with-implicit-sk_ipv6only.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Martin KaFai Lau <kafai(a)fb.com>
Date: Wed, 24 Jan 2018 23:15:27 -0800
Subject: ipv6: Fix SO_REUSEPORT UDP socket with implicit sk_ipv6only
From: Martin KaFai Lau <kafai(a)fb.com>
[ Upstream commit 7ece54a60ee2ba7a386308cae73c790bd580589c ]
If a sk_v6_rcv_saddr is !IPV6_ADDR_ANY and !IPV6_ADDR_MAPPED, it
implicitly implies it is an ipv6only socket. However, in inet6_bind(),
this addr_type checking and setting sk->sk_ipv6only to 1 are only done
after sk->sk_prot->get_port(sk, snum) has been completed successfully.
This inconsistency between sk_v6_rcv_saddr and sk_ipv6only confuses
the 'get_port()'.
In particular, when binding SO_REUSEPORT UDP sockets,
udp_reuseport_add_sock(sk,...) is called. udp_reuseport_add_sock()
checks "ipv6_only_sock(sk2) == ipv6_only_sock(sk)" before adding sk to
sk2->sk_reuseport_cb. In this case, ipv6_only_sock(sk2) could be
1 while ipv6_only_sock(sk) is still 0 here. The end result is,
reuseport_alloc(sk) is called instead of adding sk to the existing
sk2->sk_reuseport_cb.
It can be reproduced by binding two SO_REUSEPORT UDP sockets on an
IPv6 address (!ANY and !MAPPED). Only one of the socket will
receive packet.
The fix is to set the implicit sk_ipv6only before calling get_port().
The original sk_ipv6only has to be saved such that it can be restored
in case get_port() failed. The situation is similar to the
inet_reset_saddr(sk) after get_port() has failed.
Thanks to Calvin Owens <calvinowens(a)fb.com> who created an easy
reproduction which leads to a fix.
Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
Signed-off-by: Martin KaFai Lau <kafai(a)fb.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv6/af_inet6.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -274,6 +274,7 @@ int inet6_bind(struct socket *sock, stru
struct net *net = sock_net(sk);
__be32 v4addr = 0;
unsigned short snum;
+ bool saved_ipv6only;
int addr_type = 0;
int err = 0;
@@ -378,19 +379,21 @@ int inet6_bind(struct socket *sock, stru
if (!(addr_type & IPV6_ADDR_MULTICAST))
np->saddr = addr->sin6_addr;
+ saved_ipv6only = sk->sk_ipv6only;
+ if (addr_type != IPV6_ADDR_ANY && addr_type != IPV6_ADDR_MAPPED)
+ sk->sk_ipv6only = 1;
+
/* Make sure we are allowed to bind here. */
if ((snum || !inet->bind_address_no_port) &&
sk->sk_prot->get_port(sk, snum)) {
+ sk->sk_ipv6only = saved_ipv6only;
inet_reset_saddr(sk);
err = -EADDRINUSE;
goto out;
}
- if (addr_type != IPV6_ADDR_ANY) {
+ if (addr_type != IPV6_ADDR_ANY)
sk->sk_userlocks |= SOCK_BINDADDR_LOCK;
- if (addr_type != IPV6_ADDR_MAPPED)
- sk->sk_ipv6only = 1;
- }
if (snum)
sk->sk_userlocks |= SOCK_BINDPORT_LOCK;
inet->inet_sport = htons(inet->inet_num);
Patches currently in stable-queue which might be from kafai(a)fb.com are
queue-4.9/ipv6-fix-so_reuseport-udp-socket-with-implicit-sk_ipv6only.patch
This is a note to let you know that I've just added the patch titled
ip6mr: fix stale iterator
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
ip6mr-fix-stale-iterator.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Nikolay Aleksandrov <nikolay(a)cumulusnetworks.com>
Date: Wed, 31 Jan 2018 16:29:30 +0200
Subject: ip6mr: fix stale iterator
From: Nikolay Aleksandrov <nikolay(a)cumulusnetworks.com>
[ Upstream commit 4adfa79fc254efb7b0eb3cd58f62c2c3f805f1ba ]
When we dump the ip6mr mfc entries via proc, we initialize an iterator
with the table to dump but we don't clear the cache pointer which might
be initialized from a prior read on the same descriptor that ended. This
can result in lock imbalance (an unnecessary unlock) leading to other
crashes and hangs. Clear the cache pointer like ipmr does to fix the issue.
Thanks for the reliable reproducer.
Here's syzbot's trace:
WARNING: bad unlock balance detected!
4.15.0-rc3+ #128 Not tainted
syzkaller971460/3195 is trying to release lock (mrt_lock) at:
[<000000006898068d>] ipmr_mfc_seq_stop+0xe1/0x130 net/ipv6/ip6mr.c:553
but there are no more locks to release!
other info that might help us debug this:
1 lock held by syzkaller971460/3195:
#0: (&p->lock){+.+.}, at: [<00000000744a6565>] seq_read+0xd5/0x13d0
fs/seq_file.c:165
stack backtrace:
CPU: 1 PID: 3195 Comm: syzkaller971460 Not tainted 4.15.0-rc3+ #128
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
print_unlock_imbalance_bug+0x12f/0x140 kernel/locking/lockdep.c:3561
__lock_release kernel/locking/lockdep.c:3775 [inline]
lock_release+0x5f9/0xda0 kernel/locking/lockdep.c:4023
__raw_read_unlock include/linux/rwlock_api_smp.h:225 [inline]
_raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
ipmr_mfc_seq_stop+0xe1/0x130 net/ipv6/ip6mr.c:553
traverse+0x3bc/0xa00 fs/seq_file.c:135
seq_read+0x96a/0x13d0 fs/seq_file.c:189
proc_reg_read+0xef/0x170 fs/proc/inode.c:217
do_loop_readv_writev fs/read_write.c:673 [inline]
do_iter_read+0x3db/0x5b0 fs/read_write.c:897
compat_readv+0x1bf/0x270 fs/read_write.c:1140
do_compat_preadv64+0xdc/0x100 fs/read_write.c:1189
C_SYSC_preadv fs/read_write.c:1209 [inline]
compat_SyS_preadv+0x3b/0x50 fs/read_write.c:1203
do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
entry_SYSENTER_compat+0x51/0x60 arch/x86/entry/entry_64_compat.S:125
RIP: 0023:0xf7f73c79
RSP: 002b:00000000e574a15c EFLAGS: 00000292 ORIG_RAX: 000000000000014d
RAX: ffffffffffffffda RBX: 000000000000000f RCX: 0000000020a3afb0
RDX: 0000000000000001 RSI: 0000000000000067 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
BUG: sleeping function called from invalid context at lib/usercopy.c:25
in_atomic(): 1, irqs_disabled(): 0, pid: 3195, name: syzkaller971460
INFO: lockdep is turned off.
CPU: 1 PID: 3195 Comm: syzkaller971460 Not tainted 4.15.0-rc3+ #128
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:53
___might_sleep+0x2b2/0x470 kernel/sched/core.c:6060
__might_sleep+0x95/0x190 kernel/sched/core.c:6013
__might_fault+0xab/0x1d0 mm/memory.c:4525
_copy_to_user+0x2c/0xc0 lib/usercopy.c:25
copy_to_user include/linux/uaccess.h:155 [inline]
seq_read+0xcb4/0x13d0 fs/seq_file.c:279
proc_reg_read+0xef/0x170 fs/proc/inode.c:217
do_loop_readv_writev fs/read_write.c:673 [inline]
do_iter_read+0x3db/0x5b0 fs/read_write.c:897
compat_readv+0x1bf/0x270 fs/read_write.c:1140
do_compat_preadv64+0xdc/0x100 fs/read_write.c:1189
C_SYSC_preadv fs/read_write.c:1209 [inline]
compat_SyS_preadv+0x3b/0x50 fs/read_write.c:1203
do_syscall_32_irqs_on arch/x86/entry/common.c:327 [inline]
do_fast_syscall_32+0x3ee/0xf9d arch/x86/entry/common.c:389
entry_SYSENTER_compat+0x51/0x60 arch/x86/entry/entry_64_compat.S:125
RIP: 0023:0xf7f73c79
RSP: 002b:00000000e574a15c EFLAGS: 00000292 ORIG_RAX: 000000000000014d
RAX: ffffffffffffffda RBX: 000000000000000f RCX: 0000000020a3afb0
RDX: 0000000000000001 RSI: 0000000000000067 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
WARNING: CPU: 1 PID: 3195 at lib/usercopy.c:26 _copy_to_user+0xb5/0xc0
lib/usercopy.c:26
Reported-by: syzbot <bot+eceb3204562c41a438fa1f2335e0fe4f6886d669(a)syzkaller.appspotmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay(a)cumulusnetworks.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/ipv6/ip6mr.c | 1 +
1 file changed, 1 insertion(+)
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -495,6 +495,7 @@ static void *ipmr_mfc_seq_start(struct s
return ERR_PTR(-ENOENT);
it->mrt = mrt;
+ it->cache = NULL;
return *pos ? ipmr_mfc_seq_idx(net, seq->private, *pos - 1)
: SEQ_START_TOKEN;
}
Patches currently in stable-queue which might be from nikolay(a)cumulusnetworks.com are
queue-4.9/ip6mr-fix-stale-iterator.patch
This is a note to let you know that I've just added the patch titled
cls_u32: add missing RCU annotation.
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
cls_u32-add-missing-rcu-annotation.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From foo@baz Tue Feb 6 12:42:23 PST 2018
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Fri, 2 Feb 2018 16:02:22 +0100
Subject: cls_u32: add missing RCU annotation.
From: Paolo Abeni <pabeni(a)redhat.com>
[ Upstream commit 058a6c033488494a6b1477b05fe8e1a16e344462 ]
In a couple of points of the control path, n->ht_down is currently
accessed without the required RCU annotation. The accesses are
safe, but sparse complaints. Since we already held the
rtnl lock, let use rtnl_dereference().
Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs")
Fixes: de5df63228fc ("net: sched: cls_u32 changes to knode must appear atomic to readers")
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Acked-by: Cong Wang <xiyou.wangcong(a)gmail.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
net/sched/cls_u32.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -496,6 +496,7 @@ static void u32_clear_hw_hnode(struct tc
static int u32_replace_hw_knode(struct tcf_proto *tp, struct tc_u_knode *n,
u32 flags)
{
+ struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
struct net_device *dev = tp->q->dev_queue->dev;
struct tc_cls_u32_offload u32_offload = {0};
struct tc_to_netdev offload;
@@ -520,7 +521,7 @@ static int u32_replace_hw_knode(struct t
offload.cls_u32->knode.sel = &n->sel;
offload.cls_u32->knode.exts = &n->exts;
if (n->ht_down)
- offload.cls_u32->knode.link_handle = n->ht_down->handle;
+ offload.cls_u32->knode.link_handle = ht->handle;
err = dev->netdev_ops->ndo_setup_tc(dev, tp->q->handle,
tp->protocol, &offload);
@@ -788,8 +789,9 @@ static void u32_replace_knode(struct tcf
static struct tc_u_knode *u32_init_knode(struct tcf_proto *tp,
struct tc_u_knode *n)
{
- struct tc_u_knode *new;
+ struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
struct tc_u32_sel *s = &n->sel;
+ struct tc_u_knode *new;
new = kzalloc(sizeof(*n) + s->nkeys*sizeof(struct tc_u32_key),
GFP_KERNEL);
@@ -807,11 +809,11 @@ static struct tc_u_knode *u32_init_knode
new->fshift = n->fshift;
new->res = n->res;
new->flags = n->flags;
- RCU_INIT_POINTER(new->ht_down, n->ht_down);
+ RCU_INIT_POINTER(new->ht_down, ht);
/* bump reference count as long as we hold pointer to structure */
- if (new->ht_down)
- new->ht_down->refcnt++;
+ if (ht)
+ ht->refcnt++;
#ifdef CONFIG_CLS_U32_PERF
/* Statistics may be incremented by readers during update
Patches currently in stable-queue which might be from pabeni(a)redhat.com are
queue-4.9/cls_u32-add-missing-rcu-annotation.patch
On Sun, Feb 04, 2018 at 06:17:36PM +0100, Pavel Machek wrote:
>
>> > > >> *** if brightness=0, led off
>> > > >> *** else apply brightness if next timer <--- timer is stop, and will never apply new setting
>> > > >> ** otherwise set led_set_brightness_nosleep
>> > > >>
>> > > >> To fix that, when we delete the timer, we should clear LED_BLINK_SW.
>> > > >
>> > > >Can you run the tests on the affected stable kernels? I have feeling
>> > > >that the problem described might not be present there.
>> > >
>> > > Hm, I don't seem to have HW to test that out. Maybe someone else does?
>> >
>> > Why are you submitting patches you have no way to test?
>>
>> What? This is stable tree backporting, why are you trying to make a
>> requirement for something that we have never had before?
>
>I don't think random patches should be sent to stable just because
>they appeared in mainline. Plus, I don't think I'm making new rules:
>
>submit-checklist.rst:
>
>13) Has been build- and runtime tested with and without ``CONFIG_SMP``
>and
> ``CONFIG_PREEMPT.``
>
>stable-kernel-rules.rst:
>
>Rules on what kind of patches are accepted, and which ones are not,
>into the "-stable" tree:
>
> - It must be obviously correct and tested.
> - It must fix a real bug that bothers people (not a, "This could be a
> problem..." type thing).
So you're saying that this doesn't qualify as a bug?
>> This is a backport of a patch that is already upstream. If it doesn't
>> belong in a stable tree, great, let us know that, saying why it is so.
>
>See jacek.anaszewski(a)gmail.com 's explanation.
I might be missing something, but Jacek suggested I pull another patch
before this one?
--
Thanks,
Sasha
This is a note to let you know that I've just added the patch titled
KVM/x86: Add IBPB support
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
kvm-x86-add-ibpb-support.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 15d45071523d89b3fb7372e2135fbd72f6af9506 Mon Sep 17 00:00:00 2001
From: Ashok Raj <ashok.raj(a)intel.com>
Date: Thu, 1 Feb 2018 22:59:43 +0100
Subject: KVM/x86: Add IBPB support
From: Ashok Raj <ashok.raj(a)intel.com>
commit 15d45071523d89b3fb7372e2135fbd72f6af9506 upstream.
The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.
Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.
IBPB helps mitigate against three potential attacks:
* Mitigate guests from being attacked by other guests.
- This is addressed by issing IBPB when we do a guest switch.
* Mitigate attacks from guest/ring3->host/ring3.
These would require a IBPB during context switch in host, or after
VMEXIT. The host process has two ways to mitigate
- Either it can be compiled with retpoline
- If its going through context switch, and has set !dumpable then
there is a IBPB in that path.
(Tim's patch: https://patchwork.kernel.org/patch/10192871)
- The case where after a VMEXIT you return back to Qemu might make
Qemu attackable from guest when Qemu isn't compiled with retpoline.
There are issues reported when doing IBPB on every VMEXIT that resulted
in some tsc calibration woes in guest.
* Mitigate guest/ring0->host/ring0 attacks.
When host kernel is using retpoline it is safe against these attacks.
If host kernel isn't using retpoline we might need to do a IBPB flush on
every VMEXIT.
Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.
* IBPB is issued only for SVM during svm_free_vcpu().
VMX has a vmclear and SVM doesn't. Follow discussion here:
https://lkml.org/lkml/2018/1/15/146
Please refer to the following spec for more details on the enumeration
and control.
Refer here to get documentation about mitigations.
https://software.intel.com/en-us/side-channel-security-support
[peterz: rebase and changelog rewrite]
[karahmed: - rebase
- vmx: expose PRED_CMD if guest has it in CPUID
- svm: only pass through IBPB if guest has it in CPUID
- vmx: support !cpu_has_vmx_msr_bitmap()]
- vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
PRED_CMD is a write-only MSR]
Signed-off-by: Ashok Raj <ashok.raj(a)intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: KarimAllah Ahmed <karahmed(a)amazon.de>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Andi Kleen <ak(a)linux.intel.com>
Cc: kvm(a)vger.kernel.org
Cc: Asit Mallick <asit.k.mallick(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven(a)intel.com>
Cc: Greg KH <gregkh(a)linuxfoundation.org>
Cc: Jun Nakajima <jun.nakajima(a)intel.com>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.…
Link: https://lkml.kernel.org/r/1517522386-18410-3-git-send-email-karahmed@amazon…
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kvm/cpuid.c | 11 ++++++-
arch/x86/kvm/cpuid.h | 12 +++++++
arch/x86/kvm/svm.c | 28 ++++++++++++++++++
arch/x86/kvm/vmx.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 127 insertions(+), 3 deletions(-)
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -355,6 +355,10 @@ static inline int __do_cpuid_ent(struct
F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
+ /* cpuid 0x80000008.ebx */
+ const u32 kvm_cpuid_8000_0008_ebx_x86_features =
+ F(IBPB);
+
/* cpuid 0xC0000001.edx */
const u32 kvm_cpuid_C000_0001_edx_x86_features =
F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
@@ -607,7 +611,12 @@ static inline int __do_cpuid_ent(struct
if (!g_phys_as)
g_phys_as = phys_as;
entry->eax = g_phys_as | (virt_as << 8);
- entry->ebx = entry->edx = 0;
+ entry->edx = 0;
+ /* IBPB isn't necessarily present in hardware cpuid */
+ if (boot_cpu_has(X86_FEATURE_IBPB))
+ entry->ebx |= F(IBPB);
+ entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
+ cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
break;
}
case 0x80000019:
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -160,6 +160,18 @@ static inline bool guest_cpuid_has_rdtsc
return best && (best->edx & bit(X86_FEATURE_RDTSCP));
}
+static inline bool guest_cpuid_has_ibpb(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0);
+ if (best && (best->ebx & bit(X86_FEATURE_IBPB)))
+ return true;
+ best = kvm_find_cpuid_entry(vcpu, 7, 0);
+ return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL));
+}
+
+
/*
* NRIPS is provided through cpuidfn 0x8000000a.edx bit 3
*/
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -248,6 +248,7 @@ static const struct svm_direct_access_ms
{ .index = MSR_CSTAR, .always = true },
{ .index = MSR_SYSCALL_MASK, .always = true },
#endif
+ { .index = MSR_IA32_PRED_CMD, .always = false },
{ .index = MSR_IA32_LASTBRANCHFROMIP, .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP, .always = false },
@@ -510,6 +511,7 @@ struct svm_cpu_data {
struct kvm_ldttss_desc *tss_desc;
struct page *save_area;
+ struct vmcb *current_vmcb;
};
static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
@@ -1644,11 +1646,17 @@ static void svm_free_vcpu(struct kvm_vcp
__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, svm);
+ /*
+ * The vmcb page can be recycled, causing a false negative in
+ * svm_vcpu_load(). So do a full IBPB now.
+ */
+ indirect_branch_prediction_barrier();
}
static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
struct vcpu_svm *svm = to_svm(vcpu);
+ struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
int i;
if (unlikely(cpu != vcpu->cpu)) {
@@ -1677,6 +1685,10 @@ static void svm_vcpu_load(struct kvm_vcp
if (static_cpu_has(X86_FEATURE_RDTSCP))
wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
+ if (sd->current_vmcb != svm->vmcb) {
+ sd->current_vmcb = svm->vmcb;
+ indirect_branch_prediction_barrier();
+ }
avic_vcpu_load(vcpu, cpu);
}
@@ -3599,6 +3611,22 @@ static int svm_set_msr(struct kvm_vcpu *
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr);
break;
+ case MSR_IA32_PRED_CMD:
+ if (!msr->host_initiated &&
+ !guest_cpuid_has_ibpb(vcpu))
+ return 1;
+
+ if (data & ~PRED_CMD_IBPB)
+ return 1;
+
+ if (!data)
+ break;
+
+ wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+ if (is_guest_mode(vcpu))
+ break;
+ set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
+ break;
case MSR_STAR:
svm->vmcb->save.star = data;
break;
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -549,6 +549,7 @@ struct vcpu_vmx {
u64 msr_host_kernel_gs_base;
u64 msr_guest_kernel_gs_base;
#endif
+
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
/*
@@ -913,6 +914,8 @@ static void copy_vmcs12_to_shadow(struct
static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx);
static int alloc_identity_pagetable(struct kvm *kvm);
static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
+static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type);
static DEFINE_PER_CPU(struct vmcs *, vmxarea);
static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -1848,6 +1851,29 @@ static void update_exception_bitmap(stru
vmcs_write32(EXCEPTION_BITMAP, eb);
}
+/*
+ * Check if MSR is intercepted for L01 MSR bitmap.
+ */
+static bool msr_write_intercepted_l01(struct kvm_vcpu *vcpu, u32 msr)
+{
+ unsigned long *msr_bitmap;
+ int f = sizeof(unsigned long);
+
+ if (!cpu_has_vmx_msr_bitmap())
+ return true;
+
+ msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
+
+ if (msr <= 0x1fff) {
+ return !!test_bit(msr, msr_bitmap + 0x800 / f);
+ } else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
+ msr &= 0x1fff;
+ return !!test_bit(msr, msr_bitmap + 0xc00 / f);
+ }
+
+ return true;
+}
+
static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
unsigned long entry, unsigned long exit)
{
@@ -2257,6 +2283,7 @@ static void vmx_vcpu_load(struct kvm_vcp
if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
vmcs_load(vmx->loaded_vmcs->vmcs);
+ indirect_branch_prediction_barrier();
}
if (!already_loaded) {
@@ -3058,6 +3085,33 @@ static int vmx_set_msr(struct kvm_vcpu *
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
+ case MSR_IA32_PRED_CMD:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_ibpb(vcpu))
+ return 1;
+
+ if (data & ~PRED_CMD_IBPB)
+ return 1;
+
+ if (!data)
+ break;
+
+ wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+
+ /*
+ * For non-nested:
+ * When it's written (to non-zero) for the first time, pass
+ * it through.
+ *
+ * For nested:
+ * The handling of the MSR bitmap for L2 guests is done in
+ * nested_vmx_merge_msr_bitmap. We should not touch the
+ * vmcs02.msr_bitmap here since it gets completely overwritten
+ * in the merging.
+ */
+ vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
+ MSR_TYPE_W);
+ break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -9437,9 +9491,23 @@ static inline bool nested_vmx_merge_msr_
struct page *page;
unsigned long *msr_bitmap_l1;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
+ /*
+ * pred_cmd is trying to verify two things:
+ *
+ * 1. L0 gave a permission to L1 to actually passthrough the MSR. This
+ * ensures that we do not accidentally generate an L02 MSR bitmap
+ * from the L12 MSR bitmap that is too permissive.
+ * 2. That L1 or L2s have actually used the MSR. This avoids
+ * unnecessarily merging of the bitmap if the MSR is unused. This
+ * works properly because we only update the L01 MSR bitmap lazily.
+ * So even if L0 should pass L1 these MSRs, the L01 bitmap is only
+ * updated to reflect this when L1 (or its L2s) actually write to
+ * the MSR.
+ */
+ bool pred_cmd = msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD);
- /* This shortcut is ok because we support only x2APIC MSRs so far. */
- if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
+ if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
+ !pred_cmd)
return false;
page = nested_get_page(vcpu, vmcs12->msr_bitmap);
@@ -9477,6 +9545,13 @@ static inline bool nested_vmx_merge_msr_
MSR_TYPE_W);
}
}
+
+ if (pred_cmd)
+ nested_vmx_disable_intercept_for_msr(
+ msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_PRED_CMD,
+ MSR_TYPE_W);
+
kunmap(page);
nested_release_page_clean(page);
Patches currently in stable-queue which might be from ashok.raj(a)intel.com are
queue-4.9/kvm-vmx-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-x86-add-ibpb-support.patch
queue-4.9/kvm-svm-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-vmx-emulate-msr_ia32_arch_capabilities.patch
This is a note to let you know that I've just added the patch titled
KVM: VMX: introduce alloc_loaded_vmcs
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
kvm-vmx-introduce-alloc_loaded_vmcs.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From f21f165ef922c2146cc5bdc620f542953c41714b Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Thu, 11 Jan 2018 12:16:15 +0100
Subject: KVM: VMX: introduce alloc_loaded_vmcs
From: Paolo Bonzini <pbonzini(a)redhat.com>
commit f21f165ef922c2146cc5bdc620f542953c41714b upstream.
Group together the calls to alloc_vmcs and loaded_vmcs_init. Soon we'll also
allocate an MSR bitmap there.
Cc: stable(a)vger.kernel.org # prereq for Spectre mitigation
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kvm/vmx.c | 38 +++++++++++++++++++++++---------------
1 file changed, 23 insertions(+), 15 deletions(-)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3524,11 +3524,6 @@ static struct vmcs *alloc_vmcs_cpu(int c
return vmcs;
}
-static struct vmcs *alloc_vmcs(void)
-{
- return alloc_vmcs_cpu(raw_smp_processor_id());
-}
-
static void free_vmcs(struct vmcs *vmcs)
{
free_pages((unsigned long)vmcs, vmcs_config.order);
@@ -3547,6 +3542,22 @@ static void free_loaded_vmcs(struct load
WARN_ON(loaded_vmcs->shadow_vmcs != NULL);
}
+static struct vmcs *alloc_vmcs(void)
+{
+ return alloc_vmcs_cpu(raw_smp_processor_id());
+}
+
+static int alloc_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
+{
+ loaded_vmcs->vmcs = alloc_vmcs();
+ if (!loaded_vmcs->vmcs)
+ return -ENOMEM;
+
+ loaded_vmcs->shadow_vmcs = NULL;
+ loaded_vmcs_init(loaded_vmcs);
+ return 0;
+}
+
static void free_kvm_area(void)
{
int cpu;
@@ -6949,6 +6960,7 @@ static int handle_vmon(struct kvm_vcpu *
struct vmcs *shadow_vmcs;
const u64 VMXON_NEEDED_FEATURES = FEATURE_CONTROL_LOCKED
| FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX;
+ int r;
/* The Intel VMX Instruction Reference lists a bunch of bits that
* are prerequisite to running VMXON, most notably cr4.VMXE must be
@@ -6988,11 +7000,9 @@ static int handle_vmon(struct kvm_vcpu *
return 1;
}
- vmx->nested.vmcs02.vmcs = alloc_vmcs();
- vmx->nested.vmcs02.shadow_vmcs = NULL;
- if (!vmx->nested.vmcs02.vmcs)
+ r = alloc_loaded_vmcs(&vmx->nested.vmcs02);
+ if (r < 0)
goto out_vmcs02;
- loaded_vmcs_init(&vmx->nested.vmcs02);
if (cpu_has_vmx_msr_bitmap()) {
vmx->nested.msr_bitmap =
@@ -9113,17 +9123,15 @@ static struct kvm_vcpu *vmx_create_vcpu(
if (!vmx->guest_msrs)
goto free_pml;
- vmx->loaded_vmcs = &vmx->vmcs01;
- vmx->loaded_vmcs->vmcs = alloc_vmcs();
- vmx->loaded_vmcs->shadow_vmcs = NULL;
- if (!vmx->loaded_vmcs->vmcs)
- goto free_msrs;
if (!vmm_exclusive)
kvm_cpu_vmxon(__pa(per_cpu(vmxarea, raw_smp_processor_id())));
- loaded_vmcs_init(vmx->loaded_vmcs);
+ err = alloc_loaded_vmcs(&vmx->vmcs01);
if (!vmm_exclusive)
kvm_cpu_vmxoff();
+ if (err < 0)
+ goto free_msrs;
+ vmx->loaded_vmcs = &vmx->vmcs01;
cpu = get_cpu();
vmx_vcpu_load(&vmx->vcpu, cpu);
vmx->vcpu.cpu = cpu;
Patches currently in stable-queue which might be from pbonzini(a)redhat.com are
queue-4.9/kvm-vmx-introduce-alloc_loaded_vmcs.patch
queue-4.9/kvm-nvmx-eliminate-vmcs02-pool.patch
queue-4.9/kvm-vmx-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-x86-add-ibpb-support.patch
queue-4.9/kvm-svm-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-nvmx-vmx_complete_nested_posted_interrupt-can-t-fail.patch
queue-4.9/x86-pti-make-unpoison-of-pgd-for-trusted-boot-work-for-real.patch
queue-4.9/kvm-vmx-make-msr-bitmaps-per-vcpu.patch
queue-4.9/kvm-nvmx-mark-vmcs12-pages-dirty-on-l2-exit.patch
queue-4.9/kvm-vmx-emulate-msr_ia32_arch_capabilities.patch
This is a note to let you know that I've just added the patch titled
KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
kvm-vmx-emulate-msr_ia32_arch_capabilities.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 28c1c9fabf48d6ad596273a11c46e0d0da3e14cd Mon Sep 17 00:00:00 2001
From: KarimAllah Ahmed <karahmed(a)amazon.de>
Date: Thu, 1 Feb 2018 22:59:44 +0100
Subject: KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
From: KarimAllah Ahmed <karahmed(a)amazon.de>
commit 28c1c9fabf48d6ad596273a11c46e0d0da3e14cd upstream.
Intel processors use MSR_IA32_ARCH_CAPABILITIES MSR to indicate RDCL_NO
(bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default the
contents will come directly from the hardware, but user-space can still
override it.
[dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]
Signed-off-by: KarimAllah Ahmed <karahmed(a)amazon.de>
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini(a)redhat.com>
Reviewed-by: Darren Kenny <darren.kenny(a)oracle.com>
Reviewed-by: Jim Mattson <jmattson(a)google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Andi Kleen <ak(a)linux.intel.com>
Cc: Jun Nakajima <jun.nakajima(a)intel.com>
Cc: kvm(a)vger.kernel.org
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Asit Mallick <asit.k.mallick(a)intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven(a)intel.com>
Cc: Greg KH <gregkh(a)linuxfoundation.org>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Ashok Raj <ashok.raj(a)intel.com>
Link: https://lkml.kernel.org/r/1517522386-18410-4-git-send-email-karahmed@amazon…
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kvm/cpuid.c | 8 +++++++-
arch/x86/kvm/cpuid.h | 8 ++++++++
arch/x86/kvm/vmx.c | 15 +++++++++++++++
arch/x86/kvm/x86.c | 1 +
4 files changed, 31 insertions(+), 1 deletion(-)
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -380,6 +380,10 @@ static inline int __do_cpuid_ent(struct
/* cpuid 7.0.ecx*/
const u32 kvm_cpuid_7_0_ecx_x86_features = F(PKU) | 0 /*OSPKE*/;
+ /* cpuid 7.0.edx*/
+ const u32 kvm_cpuid_7_0_edx_x86_features =
+ F(ARCH_CAPABILITIES);
+
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
@@ -462,12 +466,14 @@ static inline int __do_cpuid_ent(struct
/* PKU is not yet implemented for shadow paging. */
if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
entry->ecx &= ~F(PKU);
+ entry->edx &= kvm_cpuid_7_0_edx_x86_features;
+ cpuid_mask(&entry->edx, CPUID_7_EDX);
} else {
entry->ebx = 0;
entry->ecx = 0;
+ entry->edx = 0;
}
entry->eax = 0;
- entry->edx = 0;
break;
}
case 9:
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -171,6 +171,14 @@ static inline bool guest_cpuid_has_ibpb(
return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL));
}
+static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry(vcpu, 7, 0);
+ return best && (best->edx & bit(X86_FEATURE_ARCH_CAPABILITIES));
+}
+
/*
* NRIPS is provided through cpuidfn 0x8000000a.edx bit 3
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -550,6 +550,8 @@ struct vcpu_vmx {
u64 msr_guest_kernel_gs_base;
#endif
+ u64 arch_capabilities;
+
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
/*
@@ -2981,6 +2983,12 @@ static int vmx_get_msr(struct kvm_vcpu *
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
+ case MSR_IA32_ARCH_CAPABILITIES:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_arch_capabilities(vcpu))
+ return 1;
+ msr_info->data = to_vmx(vcpu)->arch_capabilities;
+ break;
case MSR_IA32_SYSENTER_CS:
msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
break;
@@ -3112,6 +3120,11 @@ static int vmx_set_msr(struct kvm_vcpu *
vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
MSR_TYPE_W);
break;
+ case MSR_IA32_ARCH_CAPABILITIES:
+ if (!msr_info->host_initiated)
+ return 1;
+ vmx->arch_capabilities = data;
+ break;
case MSR_IA32_CR_PAT:
if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -5202,6 +5215,8 @@ static int vmx_vcpu_setup(struct vcpu_vm
++vmx->nmsrs;
}
+ if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
+ rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -975,6 +975,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
+ MSR_IA32_ARCH_CAPABILITIES
};
static unsigned num_msrs_to_save;
Patches currently in stable-queue which might be from karahmed(a)amazon.de are
queue-4.9/kvm-vmx-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-x86-add-ibpb-support.patch
queue-4.9/kvm-svm-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-vmx-emulate-msr_ia32_arch_capabilities.patch
This is a note to let you know that I've just added the patch titled
KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
kvm-vmx-allow-direct-access-to-msr_ia32_spec_ctrl.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From d28b387fb74da95d69d2615732f50cceb38e9a4d Mon Sep 17 00:00:00 2001
From: KarimAllah Ahmed <karahmed(a)amazon.de>
Date: Thu, 1 Feb 2018 22:59:45 +0100
Subject: KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL
From: KarimAllah Ahmed <karahmed(a)amazon.de>
commit d28b387fb74da95d69d2615732f50cceb38e9a4d upstream.
[ Based on a patch from Ashok Raj <ashok.raj(a)intel.com> ]
Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
be using a retpoline+IBPB based approach.
To avoid the overhead of saving and restoring the MSR_IA32_SPEC_CTRL for
guests that do not actually use the MSR, only start saving and restoring
when a non-zero is written to it.
No attempt is made to handle STIBP here, intentionally. Filtering STIBP
may be added in a future patch, which may require trapping all writes
if we don't want to pass it through directly to the guest.
[dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
Signed-off-by: KarimAllah Ahmed <karahmed(a)amazon.de>
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Darren Kenny <darren.kenny(a)oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Reviewed-by: Jim Mattson <jmattson(a)google.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Andi Kleen <ak(a)linux.intel.com>
Cc: Jun Nakajima <jun.nakajima(a)intel.com>
Cc: kvm(a)vger.kernel.org
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Asit Mallick <asit.k.mallick(a)intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven(a)intel.com>
Cc: Greg KH <gregkh(a)linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Ashok Raj <ashok.raj(a)intel.com>
Link: https://lkml.kernel.org/r/1517522386-18410-5-git-send-email-karahmed@amazon…
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kvm/cpuid.c | 8 ++-
arch/x86/kvm/cpuid.h | 11 +++++
arch/x86/kvm/vmx.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++-
arch/x86/kvm/x86.c | 2
4 files changed, 118 insertions(+), 6 deletions(-)
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -357,7 +357,7 @@ static inline int __do_cpuid_ent(struct
/* cpuid 0x80000008.ebx */
const u32 kvm_cpuid_8000_0008_ebx_x86_features =
- F(IBPB);
+ F(IBPB) | F(IBRS);
/* cpuid 0xC0000001.edx */
const u32 kvm_cpuid_C000_0001_edx_x86_features =
@@ -382,7 +382,7 @@ static inline int __do_cpuid_ent(struct
/* cpuid 7.0.edx*/
const u32 kvm_cpuid_7_0_edx_x86_features =
- F(ARCH_CAPABILITIES);
+ F(SPEC_CTRL) | F(ARCH_CAPABILITIES);
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
@@ -618,9 +618,11 @@ static inline int __do_cpuid_ent(struct
g_phys_as = phys_as;
entry->eax = g_phys_as | (virt_as << 8);
entry->edx = 0;
- /* IBPB isn't necessarily present in hardware cpuid */
+ /* IBRS and IBPB aren't necessarily present in hardware cpuid */
if (boot_cpu_has(X86_FEATURE_IBPB))
entry->ebx |= F(IBPB);
+ if (boot_cpu_has(X86_FEATURE_IBRS))
+ entry->ebx |= F(IBRS);
entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
break;
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -171,6 +171,17 @@ static inline bool guest_cpuid_has_ibpb(
return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL));
}
+static inline bool guest_cpuid_has_ibrs(struct kvm_vcpu *vcpu)
+{
+ struct kvm_cpuid_entry2 *best;
+
+ best = kvm_find_cpuid_entry(vcpu, 0x80000008, 0);
+ if (best && (best->ebx & bit(X86_FEATURE_IBRS)))
+ return true;
+ best = kvm_find_cpuid_entry(vcpu, 7, 0);
+ return best && (best->edx & bit(X86_FEATURE_SPEC_CTRL));
+}
+
static inline bool guest_cpuid_has_arch_capabilities(struct kvm_vcpu *vcpu)
{
struct kvm_cpuid_entry2 *best;
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -551,6 +551,7 @@ struct vcpu_vmx {
#endif
u64 arch_capabilities;
+ u64 spec_ctrl;
u32 vm_entry_controls_shadow;
u32 vm_exit_controls_shadow;
@@ -1854,6 +1855,29 @@ static void update_exception_bitmap(stru
}
/*
+ * Check if MSR is intercepted for currently loaded MSR bitmap.
+ */
+static bool msr_write_intercepted(struct kvm_vcpu *vcpu, u32 msr)
+{
+ unsigned long *msr_bitmap;
+ int f = sizeof(unsigned long);
+
+ if (!cpu_has_vmx_msr_bitmap())
+ return true;
+
+ msr_bitmap = to_vmx(vcpu)->loaded_vmcs->msr_bitmap;
+
+ if (msr <= 0x1fff) {
+ return !!test_bit(msr, msr_bitmap + 0x800 / f);
+ } else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
+ msr &= 0x1fff;
+ return !!test_bit(msr, msr_bitmap + 0xc00 / f);
+ }
+
+ return true;
+}
+
+/*
* Check if MSR is intercepted for L01 MSR bitmap.
*/
static bool msr_write_intercepted_l01(struct kvm_vcpu *vcpu, u32 msr)
@@ -2983,6 +3007,13 @@ static int vmx_get_msr(struct kvm_vcpu *
case MSR_IA32_TSC:
msr_info->data = guest_read_tsc(vcpu);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_ibrs(vcpu))
+ return 1;
+
+ msr_info->data = to_vmx(vcpu)->spec_ctrl;
+ break;
case MSR_IA32_ARCH_CAPABILITIES:
if (!msr_info->host_initiated &&
!guest_cpuid_has_arch_capabilities(vcpu))
@@ -3093,6 +3124,36 @@ static int vmx_set_msr(struct kvm_vcpu *
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr_info);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_ibrs(vcpu))
+ return 1;
+
+ /* The STIBP bit doesn't fault even if it's not advertised */
+ if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+ return 1;
+
+ vmx->spec_ctrl = data;
+
+ if (!data)
+ break;
+
+ /*
+ * For non-nested:
+ * When it's written (to non-zero) for the first time, pass
+ * it through.
+ *
+ * For nested:
+ * The handling of the MSR bitmap for L2 guests is done in
+ * nested_vmx_merge_msr_bitmap. We should not touch the
+ * vmcs02.msr_bitmap here since it gets completely overwritten
+ * in the merging. We update the vmcs01 here for L1 as well
+ * since it will end up touching the MSR anyway now.
+ */
+ vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_RW);
+ break;
case MSR_IA32_PRED_CMD:
if (!msr_info->host_initiated &&
!guest_cpuid_has_ibpb(vcpu))
@@ -5245,6 +5306,7 @@ static void vmx_vcpu_reset(struct kvm_vc
u64 cr0;
vmx->rmode.vm86_active = 0;
+ vmx->spec_ctrl = 0;
vmx->soft_vnmi_blocked = 0;
@@ -8830,6 +8892,15 @@ static void __noclone vmx_vcpu_run(struc
vmx_arm_hv_timer(vcpu);
+ /*
+ * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+ * it's non-zero. Since vmentry is serialising on affected CPUs, there
+ * is no need to worry about the conditional branch over the wrmsr
+ * being speculatively taken.
+ */
+ if (vmx->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
vmx->__launched = vmx->loaded_vmcs->launched;
asm(
/* Store host registers */
@@ -8948,6 +9019,27 @@ static void __noclone vmx_vcpu_run(struc
#endif
);
+ /*
+ * We do not use IBRS in the kernel. If this vCPU has used the
+ * SPEC_CTRL MSR it may have left it on; save the value and
+ * turn it off. This is much more efficient than blindly adding
+ * it to the atomic save/restore list. Especially as the former
+ * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+ *
+ * For non-nested case:
+ * If the L01 MSR bitmap does not intercept the MSR, then we need to
+ * save it.
+ *
+ * For nested case:
+ * If the L02 MSR bitmap does not intercept the MSR, then we need to
+ * save it.
+ */
+ if (!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))
+ rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
+ if (vmx->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();
@@ -9507,7 +9599,7 @@ static inline bool nested_vmx_merge_msr_
unsigned long *msr_bitmap_l1;
unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
/*
- * pred_cmd is trying to verify two things:
+ * pred_cmd & spec_ctrl are trying to verify two things:
*
* 1. L0 gave a permission to L1 to actually passthrough the MSR. This
* ensures that we do not accidentally generate an L02 MSR bitmap
@@ -9520,9 +9612,10 @@ static inline bool nested_vmx_merge_msr_
* the MSR.
*/
bool pred_cmd = msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD);
+ bool spec_ctrl = msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL);
if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
- !pred_cmd)
+ !pred_cmd && !spec_ctrl)
return false;
page = nested_get_page(vcpu, vmcs12->msr_bitmap);
@@ -9561,6 +9654,12 @@ static inline bool nested_vmx_merge_msr_
}
}
+ if (spec_ctrl)
+ nested_vmx_disable_intercept_for_msr(
+ msr_bitmap_l1, msr_bitmap_l0,
+ MSR_IA32_SPEC_CTRL,
+ MSR_TYPE_R | MSR_TYPE_W);
+
if (pred_cmd)
nested_vmx_disable_intercept_for_msr(
msr_bitmap_l1, msr_bitmap_l0,
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -975,7 +975,7 @@ static u32 msrs_to_save[] = {
#endif
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
- MSR_IA32_ARCH_CAPABILITIES
+ MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
};
static unsigned num_msrs_to_save;
Patches currently in stable-queue which might be from karahmed(a)amazon.de are
queue-4.9/kvm-vmx-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-x86-add-ibpb-support.patch
queue-4.9/kvm-svm-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-vmx-emulate-msr_ia32_arch_capabilities.patch
This is a note to let you know that I've just added the patch titled
KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
kvm-svm-allow-direct-access-to-msr_ia32_spec_ctrl.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From b2ac58f90540e39324e7a29a7ad471407ae0bf48 Mon Sep 17 00:00:00 2001
From: KarimAllah Ahmed <karahmed(a)amazon.de>
Date: Sat, 3 Feb 2018 15:56:23 +0100
Subject: KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
From: KarimAllah Ahmed <karahmed(a)amazon.de>
commit b2ac58f90540e39324e7a29a7ad471407ae0bf48 upstream.
[ Based on a patch from Paolo Bonzini <pbonzini(a)redhat.com> ]
... basically doing exactly what we do for VMX:
- Passthrough SPEC_CTRL to guests (if enabled in guest CPUID)
- Save and restore SPEC_CTRL around VMExit and VMEntry only if the guest
actually used it.
Signed-off-by: KarimAllah Ahmed <karahmed(a)amazon.de>
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx(a)linutronix.de>
Reviewed-by: Darren Kenny <darren.kenny(a)oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com>
Cc: Andrea Arcangeli <aarcange(a)redhat.com>
Cc: Andi Kleen <ak(a)linux.intel.com>
Cc: Jun Nakajima <jun.nakajima(a)intel.com>
Cc: kvm(a)vger.kernel.org
Cc: Dave Hansen <dave.hansen(a)intel.com>
Cc: Tim Chen <tim.c.chen(a)linux.intel.com>
Cc: Andy Lutomirski <luto(a)kernel.org>
Cc: Asit Mallick <asit.k.mallick(a)intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven(a)intel.com>
Cc: Greg KH <gregkh(a)linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Linus Torvalds <torvalds(a)linux-foundation.org>
Cc: Ashok Raj <ashok.raj(a)intel.com>
Link: https://lkml.kernel.org/r/1517669783-20732-1-git-send-email-karahmed@amazon…
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kvm/svm.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 88 insertions(+)
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -183,6 +183,8 @@ struct vcpu_svm {
u64 gs_base;
} host;
+ u64 spec_ctrl;
+
u32 *msrpm;
ulong nmi_iret_rip;
@@ -248,6 +250,7 @@ static const struct svm_direct_access_ms
{ .index = MSR_CSTAR, .always = true },
{ .index = MSR_SYSCALL_MASK, .always = true },
#endif
+ { .index = MSR_IA32_SPEC_CTRL, .always = false },
{ .index = MSR_IA32_PRED_CMD, .always = false },
{ .index = MSR_IA32_LASTBRANCHFROMIP, .always = false },
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
@@ -863,6 +866,25 @@ static bool valid_msr_intercept(u32 inde
return false;
}
+static bool msr_write_intercepted(struct kvm_vcpu *vcpu, unsigned msr)
+{
+ u8 bit_write;
+ unsigned long tmp;
+ u32 offset;
+ u32 *msrpm;
+
+ msrpm = is_guest_mode(vcpu) ? to_svm(vcpu)->nested.msrpm:
+ to_svm(vcpu)->msrpm;
+
+ offset = svm_msrpm_offset(msr);
+ bit_write = 2 * (msr & 0x0f) + 1;
+ tmp = msrpm[offset];
+
+ BUG_ON(offset == MSR_INVALID);
+
+ return !!test_bit(bit_write, &tmp);
+}
+
static void set_msr_interception(u32 *msrpm, unsigned msr,
int read, int write)
{
@@ -1537,6 +1559,8 @@ static void svm_vcpu_reset(struct kvm_vc
u32 dummy;
u32 eax = 1;
+ svm->spec_ctrl = 0;
+
if (!init_event) {
svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE |
MSR_IA32_APICBASE_ENABLE;
@@ -3520,6 +3544,13 @@ static int svm_get_msr(struct kvm_vcpu *
case MSR_VM_CR:
msr_info->data = svm->nested.vm_cr_msr;
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has_ibrs(vcpu))
+ return 1;
+
+ msr_info->data = svm->spec_ctrl;
+ break;
case MSR_IA32_UCODE_REV:
msr_info->data = 0x01000065;
break;
@@ -3611,6 +3642,33 @@ static int svm_set_msr(struct kvm_vcpu *
case MSR_IA32_TSC:
kvm_write_tsc(vcpu, msr);
break;
+ case MSR_IA32_SPEC_CTRL:
+ if (!msr->host_initiated &&
+ !guest_cpuid_has_ibrs(vcpu))
+ return 1;
+
+ /* The STIBP bit doesn't fault even if it's not advertised */
+ if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+ return 1;
+
+ svm->spec_ctrl = data;
+
+ if (!data)
+ break;
+
+ /*
+ * For non-nested:
+ * When it's written (to non-zero) for the first time, pass
+ * it through.
+ *
+ * For nested:
+ * The handling of the MSR bitmap for L2 guests is done in
+ * nested_svm_vmrun_msrpm.
+ * We update the L1 MSR bit as well since it will end up
+ * touching the MSR anyway now.
+ */
+ set_msr_interception(svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
+ break;
case MSR_IA32_PRED_CMD:
if (!msr->host_initiated &&
!guest_cpuid_has_ibpb(vcpu))
@@ -4854,6 +4912,15 @@ static void svm_vcpu_run(struct kvm_vcpu
local_irq_enable();
+ /*
+ * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+ * it's non-zero. Since vmentry is serialising on affected CPUs, there
+ * is no need to worry about the conditional branch over the wrmsr
+ * being speculatively taken.
+ */
+ if (svm->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
+
asm volatile (
"push %%" _ASM_BP "; \n\t"
"mov %c[rbx](%[svm]), %%" _ASM_BX " \n\t"
@@ -4946,6 +5013,27 @@ static void svm_vcpu_run(struct kvm_vcpu
#endif
);
+ /*
+ * We do not use IBRS in the kernel. If this vCPU has used the
+ * SPEC_CTRL MSR it may have left it on; save the value and
+ * turn it off. This is much more efficient than blindly adding
+ * it to the atomic save/restore list. Especially as the former
+ * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+ *
+ * For non-nested case:
+ * If the L01 MSR bitmap does not intercept the MSR, then we need to
+ * save it.
+ *
+ * For nested case:
+ * If the L02 MSR bitmap does not intercept the MSR, then we need to
+ * save it.
+ */
+ if (!msr_write_intercepted(vcpu, MSR_IA32_SPEC_CTRL))
+ rdmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
+
+ if (svm->spec_ctrl)
+ wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
/* Eliminate branch target predictions from guest mode */
vmexit_fill_RSB();
Patches currently in stable-queue which might be from karahmed(a)amazon.de are
queue-4.9/kvm-vmx-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-x86-add-ibpb-support.patch
queue-4.9/kvm-svm-allow-direct-access-to-msr_ia32_spec_ctrl.patch
queue-4.9/kvm-vmx-emulate-msr_ia32_arch_capabilities.patch
This is a note to let you know that I've just added the patch titled
KVM: nVMX: vmx_complete_nested_posted_interrupt() can't fail
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
kvm-nvmx-vmx_complete_nested_posted_interrupt-can-t-fail.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From 6342c50ad12e8ce0736e722184a7dbdea4a3477f Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david(a)redhat.com>
Date: Wed, 25 Jan 2017 11:58:58 +0100
Subject: KVM: nVMX: vmx_complete_nested_posted_interrupt() can't fail
From: David Hildenbrand <david(a)redhat.com>
commit 6342c50ad12e8ce0736e722184a7dbdea4a3477f upstream.
vmx_complete_nested_posted_interrupt() can't fail, let's turn it into
a void function.
Signed-off-by: David Hildenbrand <david(a)redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kvm/vmx.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4738,7 +4738,7 @@ static bool vmx_get_enable_apicv(void)
return enable_apicv;
}
-static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
+static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
int max_irr;
@@ -4749,13 +4749,13 @@ static int vmx_complete_nested_posted_in
vmx->nested.pi_pending) {
vmx->nested.pi_pending = false;
if (!pi_test_and_clear_on(vmx->nested.pi_desc))
- return 0;
+ return;
max_irr = find_last_bit(
(unsigned long *)vmx->nested.pi_desc->pir, 256);
if (max_irr == 256)
- return 0;
+ return;
vapic_page = kmap(vmx->nested.virtual_apic_page);
if (!vapic_page) {
@@ -4772,7 +4772,6 @@ static int vmx_complete_nested_posted_in
vmcs_write16(GUEST_INTR_STATUS, status);
}
}
- return 0;
}
static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu)
@@ -10493,7 +10492,8 @@ static int vmx_check_nested_events(struc
return 0;
}
- return vmx_complete_nested_posted_interrupt(vcpu);
+ vmx_complete_nested_posted_interrupt(vcpu);
+ return 0;
}
static u32 vmx_get_preemption_timer_value(struct kvm_vcpu *vcpu)
Patches currently in stable-queue which might be from david(a)redhat.com are
queue-4.9/kvm-nvmx-eliminate-vmcs02-pool.patch
queue-4.9/kvm-nvmx-vmx_complete_nested_posted_interrupt-can-t-fail.patch
This is a note to let you know that I've just added the patch titled
KVM: nVMX: mark vmcs12 pages dirty on L2 exit
to the 4.9-stable tree which can be found at:
http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum…
The filename of the patch is:
kvm-nvmx-mark-vmcs12-pages-dirty-on-l2-exit.patch
and it can be found in the queue-4.9 subdirectory.
If you, or anyone else, feels it should not be added to the stable tree,
please let <stable(a)vger.kernel.org> know about it.
>From c9f04407f2e0b3fc9ff7913c65fcfcb0a4b61570 Mon Sep 17 00:00:00 2001
From: David Matlack <dmatlack(a)google.com>
Date: Tue, 1 Aug 2017 14:00:40 -0700
Subject: KVM: nVMX: mark vmcs12 pages dirty on L2 exit
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
From: David Matlack <dmatlack(a)google.com>
commit c9f04407f2e0b3fc9ff7913c65fcfcb0a4b61570 upstream.
The host physical addresses of L1's Virtual APIC Page and Posted
Interrupt descriptor are loaded into the VMCS02. The CPU may write
to these pages via their host physical address while L2 is running,
bypassing address-translation-based dirty tracking (e.g. EPT write
protection). Mark them dirty on every exit from L2 to prevent them
from getting out of sync with dirty tracking.
Also mark the virtual APIC page and the posted interrupt descriptor
dirty when KVM is virtualizing posted interrupt processing.
Signed-off-by: David Matlack <dmatlack(a)google.com>
Reviewed-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar(a)redhat.com>
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
arch/x86/kvm/vmx.c | 53 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 43 insertions(+), 10 deletions(-)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4738,6 +4738,28 @@ static bool vmx_get_enable_apicv(void)
return enable_apicv;
}
+static void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu)
+{
+ struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+ gfn_t gfn;
+
+ /*
+ * Don't need to mark the APIC access page dirty; it is never
+ * written to by the CPU during APIC virtualization.
+ */
+
+ if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) {
+ gfn = vmcs12->virtual_apic_page_addr >> PAGE_SHIFT;
+ kvm_vcpu_mark_page_dirty(vcpu, gfn);
+ }
+
+ if (nested_cpu_has_posted_intr(vmcs12)) {
+ gfn = vmcs12->posted_intr_desc_addr >> PAGE_SHIFT;
+ kvm_vcpu_mark_page_dirty(vcpu, gfn);
+ }
+}
+
+
static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -4745,18 +4767,15 @@ static void vmx_complete_nested_posted_i
void *vapic_page;
u16 status;
- if (vmx->nested.pi_desc &&
- vmx->nested.pi_pending) {
- vmx->nested.pi_pending = false;
- if (!pi_test_and_clear_on(vmx->nested.pi_desc))
- return;
-
- max_irr = find_last_bit(
- (unsigned long *)vmx->nested.pi_desc->pir, 256);
+ if (!vmx->nested.pi_desc || !vmx->nested.pi_pending)
+ return;
- if (max_irr == 256)
- return;
+ vmx->nested.pi_pending = false;
+ if (!pi_test_and_clear_on(vmx->nested.pi_desc))
+ return;
+ max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256);
+ if (max_irr != 256) {
vapic_page = kmap(vmx->nested.virtual_apic_page);
if (!vapic_page) {
WARN_ON(1);
@@ -4772,6 +4791,8 @@ static void vmx_complete_nested_posted_i
vmcs_write16(GUEST_INTR_STATUS, status);
}
}
+
+ nested_mark_vmcs12_pages_dirty(vcpu);
}
static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu)
@@ -8028,6 +8049,18 @@ static bool nested_vmx_exit_handled(stru
vmcs_read32(VM_EXIT_INTR_ERROR_CODE),
KVM_ISA_VMX);
+ /*
+ * The host physical addresses of some pages of guest memory
+ * are loaded into VMCS02 (e.g. L1's Virtual APIC Page). The CPU
+ * may write to these pages via their host physical address while
+ * L2 is running, bypassing any address-translation-based dirty
+ * tracking (e.g. EPT write protection).
+ *
+ * Mark them dirty on every exit from L2 to prevent them from
+ * getting out of sync with dirty tracking.
+ */
+ nested_mark_vmcs12_pages_dirty(vcpu);
+
if (vmx->nested.nested_run_pending)
return false;
Patches currently in stable-queue which might be from dmatlack(a)google.com are
queue-4.9/kvm-nvmx-mark-vmcs12-pages-dirty-on-l2-exit.patch