After reset the VFIO device state will always be put in
VFIO_DEVICE_STATE_RUNNING, but the save/restore files will only be
cleared if the previous state was VFIO_DEVICE_STATE_ERROR. This
can/will cause the restore/save files to be leaked if/when the
migration state machine transitions through the states that
re-allocates these files. Fix this by always clearing the
restore/save files for resets.
Fixes: 7dabb1bcd177 ("vfio/pds: Add support for firmware recovery")
Cc: stable(a)vger.kernel.org
Signed-off-by: Brett Creeley <brett.creeley(a)amd.com>
Reviewed-by: Shannon Nelson <shannon.nelson(a)amd.com>
---
drivers/vfio/pci/pds/vfio_dev.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/pds/vfio_dev.c b/drivers/vfio/pci/pds/vfio_dev.c
index 4c351c59d05a..a286ebcc7112 100644
--- a/drivers/vfio/pci/pds/vfio_dev.c
+++ b/drivers/vfio/pci/pds/vfio_dev.c
@@ -32,9 +32,9 @@ void pds_vfio_state_mutex_unlock(struct pds_vfio_pci_device *pds_vfio)
mutex_lock(&pds_vfio->reset_mutex);
if (pds_vfio->deferred_reset) {
pds_vfio->deferred_reset = false;
+ pds_vfio_put_restore_file(pds_vfio);
+ pds_vfio_put_save_file(pds_vfio);
if (pds_vfio->state == VFIO_DEVICE_STATE_ERROR) {
- pds_vfio_put_restore_file(pds_vfio);
- pds_vfio_put_save_file(pds_vfio);
pds_vfio_dirty_disable(pds_vfio, false);
}
pds_vfio->state = pds_vfio->deferred_reset_state;
--
2.17.1
Since kernel version 5.4.250 LTS, there has been an issue with the kernel live patching feature becoming unavailable. When compiling the sample code for kernel live patching, the following message is displayed when enabled:
livepatch: klp_check_stack: kworker/u256:6:23490 has an unreliable stack
After investigation, it was found that this is due to objtool not supporting intra-function calls, resulting in incorrect orc entry generation.
This patchset adds support for intra-function calls, allowing the kernel live patching feature to work correctly.
Alexandre Chartre (2):
objtool: is_fentry_call() crashes if call has no destination
objtool: Add support for intra-function calls
Rui Qi (1):
x86/speculation: Support intra-function call validation
arch/x86/include/asm/nospec-branch.h | 7 ++
include/linux/frame.h | 11 ++++
.../Documentation/stack-validation.txt | 8 +++
tools/objtool/arch/x86/decode.c | 6 ++
tools/objtool/check.c | 64 +++++++++++++++++--
5 files changed, 91 insertions(+), 5 deletions(-)
--
2.39.2 (Apple Git-143)
Supersedes: <20240130180400.1698136-1-pbonzini(a)redhat.com>
MKTME repurposes the high bit of physical address to key id for encryption
key and, even though MAXPHYADDR in CPUID[0x80000008] remains the same,
the valid bits in the MTRR mask register are based on the reduced number
of physical address bits. This breaks boot on machines that have TME enabled
and do something to cleanup MTRRs, unless "disable_mtrr_cleanup" is
passed on the command line. The fix is to move the check to early CPU
initialization, which runs before Linux sets up MTRRs.
However, as noticed by Kirill, the patch I sent as v1 actually works only
until Linux 6.6. In Linux 6.7, commit fbf6449f84bf ("x86/sev-es: Set
x86_virt_bits to the correct value straight away, instead of a two-phase
approach") reorganized the initialization of c->x86_phys_bits in a way
that broke the patch. But even in 6.7 AMD processors, which did try to
reduce it in this_cpu->c_early_init(c), had their x86_phys_bits value
overwritten by get_cpu_address_sizes(), so that early_identify_cpu()
left the wrong value in x86_phys_bits. This probably went unnoticed
because on AMD processors you need not apply the reduced MAXPHYADDR to
MTRR masks.
Therefore, this v2 prepends the fix for this issue in commit fbf6449f84bf.
Apologies for the oversight.
Tested on an AMD Epyc machine (where I resorted to dumping mtrr_state) and
on the problematic Intel Emerald Rapids machine.
Thanks,
Paolo
Paolo Bonzini (2):
x86/cpu: allow reducing x86_phys_bits during early_identify_cpu()
x86/cpu/intel: Detect TME keyid bits before setting MTRR mask
registers
arch/x86/kernel/cpu/common.c | 4 +-
arch/x86/kernel/cpu/intel.c | 178 ++++++++++++++++++-----------------
2 files changed, 93 insertions(+), 89 deletions(-)
--
2.43.0
While commit 69f89168b310 ("usb: typec: tpcm: Fix issues with power being
removed during reset") fixes the boot issues for bus powered devices such
as LibreTech Renegade Elite/Firefly, it trades off the CC pins NOT being
Hi-Zed during errory recovery (i.e PORT_RESET) for devices which are NOT
bus powered(a.k.a self powered). This change Hi-Zs the CC pins only for
self powered devices, thus preventing brown out for bus powered devices
Adhering to spec is gaining more importance due to the Common charger
initiative enforced by the European Union.
Quoting from the spec:
4.5.2.2.2.1 ErrorRecovery State Requirements
The port shall not drive VBUS or VCONN, and shall present a
high-impedance to ground (above zOPEN) on its CC1 and CC2 pins.
Hi-Zing the CC pins is the inteded behavior for PORT_RESET.
CC pins are set to default state after tErrorRecovery in
PORT_RESET_WAIT_OFF.
4.5.2.2.2.2 Exiting From ErrorRecovery State
A Sink shall transition to Unattached.SNK after tErrorRecovery.
A Source shall transition to Unattached.SRC after tErrorRecovery.
Fixes: 69f89168b310 ("usb: typec: tpcm: Fix issues with power being removed during reset")
Cc: stable(a)kernel.org
Cc: Mark Brown <broonie(a)kernel.org>
Signed-off-by: Badhri Jagan Sridharan <badhri(a)google.com>
---
drivers/usb/typec/tcpm/tcpm.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/typec/tcpm/tcpm.c b/drivers/usb/typec/tcpm/tcpm.c
index c9a78f55ca48..bbe1381232eb 100644
--- a/drivers/usb/typec/tcpm/tcpm.c
+++ b/drivers/usb/typec/tcpm/tcpm.c
@@ -5593,8 +5593,11 @@ static void run_state_machine(struct tcpm_port *port)
break;
case PORT_RESET:
tcpm_reset_port(port);
- tcpm_set_cc(port, tcpm_default_state(port) == SNK_UNATTACHED ?
- TYPEC_CC_RD : tcpm_rp_cc(port));
+ if (port->self_powered)
+ tcpm_set_cc(port, TYPEC_CC_OPEN);
+ else
+ tcpm_set_cc(port, tcpm_default_state(port) == SNK_UNATTACHED ?
+ TYPEC_CC_RD : tcpm_rp_cc(port));
tcpm_set_state(port, PORT_RESET_WAIT_OFF,
PD_T_ERROR_RECOVERY);
break;
base-commit: a560a5672826fc1e057068bda93b3d4c98d037a2
--
2.44.0.rc1.240.g4c46232300-goog
On Tue, Feb 27, 2024 at 10:14 AM Kairui Song <ryncsn(a)gmail.com> wrote:
>
> On Wed, Feb 21, 2024 at 12:32 AM Chris Li <chrisl(a)kernel.org> wrote:
> >
> > On Mon, Feb 19, 2024 at 8:56 PM Kairui Song <ryncsn(a)gmail.com> wrote:
> >
> > >
> > > Hi Barry,
> > >
> > > > it might not be a problem for throughput. but for real-time and tail latency,
> > > > this hurts. For example, this might increase dropping frames of UI which
> > > > is an important parameter to evaluate performance :-)
> > > >
> > >
> > > That's a true issue, as Chris mentioned before I think we need to
> > > think of some clever data struct to solve this more naturally in the
> > > future, similar issue exists for cached swapin as well and it has been
> > > there for a while. On the other hand I think maybe applications that
> > > are extremely latency sensitive should try to avoid swap on fault? A
> > > swapin could cause other issues like reclaim, throttled or contention
> > > with many other things, these seem to have a higher chance than this
> > > race.
> >
> >
> > Yes, I do think the best long term solution is to have some clever
> > data structure to solve the synchronization issue and allow racing
> > threads to make forward progress at the same time.
> >
> > I have also explored some (failed) synchronization ideas, for example
> > having the run time swap entry refcount separate from swap_map count.
> > BTW, zswap entry->refcount behaves like that, it is separate from swap
> > entry and manages the temporary run time usage count held by the
> > function. However that idea has its own problem as well, it needs to
> > have an xarray to track the swap entry run time refcount (only stored
> > in the xarray when CPU fails to get SWAP_HAS_CACHE bit.) When we are
> > done with page faults, we still need to look up the xarray to make
> > sure there is no racing CPU and put the refcount into the xarray. That
> > kind of defeats the purpose of avoiding the swap cache in the first
> > place. We still need to do the xarray lookup in the normal path.
> >
> > I came to realize that, while this current fix is not perfect, (I
> > still wish we had a better solution not pausing the racing CPU). This
> > patch stands better than not fixing this data corruption issue and the
> > patch remains relatively simple. Yes it has latency issues but still
> > better than data corruption. It also doesn't stop us from coming up
> > with better solutions later on. If we want to address the
> > synchronization in a way not blocking other CPUs, it will likely
> > require a much bigger change.
> >
> > Unless we have a better suggestion. It seems the better one among the
> > alternatives so far.
> >
>
> Hi,
>
> Thanks for the comments. I've been trying some ideas locally, I think a simple and straight solution exists: We just don't skip the swap cache xarray.
Yes, I have been pondering about that as well.
Notice in __read_swap_cache_async(), it has a similar
"schedule_timeout_uninterruptible(1)" when swapcache_prepare(entry)
fails to grab the SWAP_HAS_CACHE bit. So falling back to use the swap
cache does not automatically solve the latency issue. Similar delay
exists in the swap cache case as well.
> The current reason we are skipping it is for performance, but with some optimization, the performance should be as good as skipping it (in current behavior). Notice even in the swap cache bypass path, we need to do one lookup, and one modify (delete the shadow). That can't be skipped. So the usage of swap cache can be better organized and optimized.
> After all swapin makes use of swap cache, swapin can insert the folio in swap cache xarray first, then set swap map cache bit. I'm thinking about reusing the folio lock, or having an intermediate value in xarray, so raced swapins can wait properly. There are some tricky parts syncing with swap maps though.
Inserting the swap cache xarray first and setting SWAP_HAS_CACHE bit
later will need more audit on the race. I assume you take the swap
device/cluster lock before folio insert into swap cache xarray?
Chris
>
> Currently working on a series, will send in a few weeks if it works.
In raid5_cache_count():
if (conf->max_nr_stripes < conf->min_nr_stripes)
return 0;
return conf->max_nr_stripes - conf->min_nr_stripes;
The current check is ineffective, as the values could change immediately
after being checked.
In raid5_set_cache_size():
...
conf->min_nr_stripes = size;
...
while (size > conf->max_nr_stripes)
conf->min_nr_stripes = conf->max_nr_stripes;
...
Due to intermediate value updates in raid5_set_cache_size(), concurrent
execution of raid5_cache_count() and raid5_set_cache_size() may lead to
inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes.
The current checks are ineffective as values could change immediately
after being checked, raising the risk of conf->min_nr_stripes exceeding
conf->max_nr_stripes and potentially causing an integer overflow.
This possible bug is found by an experimental static analysis tool
developed by our team. This tool analyzes the locking APIs to extract
function pairs that can be concurrently executed, and then analyzes the
instructions in the paired functions to identify possible concurrency bugs
including data races and atomicity violations. The above possible bug is
reported when our tool analyzes the source code of Linux 6.2.
To resolve this issue, it is suggested to introduce local variables
'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the
values remain stable throughout the check. Adding locks in
raid5_cache_count() fails to resolve atomicity violations, as
raid5_set_cache_size() may hold intermediate values of
conf->min_nr_stripes while unlocked. With this patch applied, our tool no
longer reports the bug, with the kernel configuration allyesconfig for
x86_64. Due to the lack of associated hardware, we cannot test the patch
in runtime testing, and just verify it according to the code logic.
Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable(a)vger.kernel.org
Signed-off-by: Gui-Dong Han <2045gemini(a)gmail.com>
---
v2:
* In this patch v2, we've updated to use READ_ONCE() instead of direct
reads for accessing max_nr_stripes and min_nr_stripes, since read and
write can concurrent.
Thank Yu Kuai for helpful advice.
---
v3:
* In this patch v3, we've updated to use WRITE_ONCE() in
raid5_set_cache_size(), grow_one_stripe() and drop_one_stripe(), in order
to pair READ_ONCE() with WRITE_ONCE().
Thank Yu Kuai for helpful advice.
---
v4:
* In this patch v4, we've addressed several code style issues.
Thank Yu Kuai for helpful advice.
---
drivers/md/raid5.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8497880135ee..30e118d10c0b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2412,7 +2412,7 @@ static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
atomic_inc(&conf->active_stripes);
raid5_release_stripe(sh);
- conf->max_nr_stripes++;
+ WRITE_ONCE(conf->max_nr_stripes, conf->max_nr_stripes + 1);
return 1;
}
@@ -2707,7 +2707,7 @@ static int drop_one_stripe(struct r5conf *conf)
shrink_buffers(sh);
free_stripe(conf->slab_cache, sh);
atomic_dec(&conf->active_stripes);
- conf->max_nr_stripes--;
+ WRITE_ONCE(conf->max_nr_stripes, conf->max_nr_stripes - 1);
return 1;
}
@@ -6820,7 +6820,7 @@ raid5_set_cache_size(struct mddev *mddev, int size)
if (size <= 16 || size > 32768)
return -EINVAL;
- conf->min_nr_stripes = size;
+ WRITE_ONCE(conf->min_nr_stripes, size);
mutex_lock(&conf->cache_size_mutex);
while (size < conf->max_nr_stripes &&
drop_one_stripe(conf))
@@ -6832,7 +6832,7 @@ raid5_set_cache_size(struct mddev *mddev, int size)
mutex_lock(&conf->cache_size_mutex);
while (size > conf->max_nr_stripes)
if (!grow_one_stripe(conf, GFP_KERNEL)) {
- conf->min_nr_stripes = conf->max_nr_stripes;
+ WRITE_ONCE(conf->min_nr_stripes, conf->max_nr_stripes);
result = -ENOMEM;
break;
}
@@ -7390,11 +7390,13 @@ static unsigned long raid5_cache_count(struct shrinker *shrink,
struct shrink_control *sc)
{
struct r5conf *conf = shrink->private_data;
+ int max_stripes = READ_ONCE(conf->max_nr_stripes);
+ int min_stripes = READ_ONCE(conf->min_nr_stripes);
- if (conf->max_nr_stripes < conf->min_nr_stripes)
+ if (max_stripes < min_stripes)
/* unlikely, but not impossible */
return 0;
- return conf->max_nr_stripes - conf->min_nr_stripes;
+ return max_stripes - min_stripes;
}
static struct r5conf *setup_conf(struct mddev *mddev)
--
2.34.1
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 045e9d812868a2d80b7a57b224ce8009444b7bbc
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024022601-footwork-fastness-bcab@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
045e9d812868 ("mptcp: fix duplicate subflow creation")
b9d69db87fb7 ("mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 045e9d812868a2d80b7a57b224ce8009444b7bbc Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Thu, 15 Feb 2024 19:25:33 +0100
Subject: [PATCH] mptcp: fix duplicate subflow creation
Fullmesh endpoints could end-up unexpectedly generating duplicate
subflows - same local and remote addresses - when multiple incoming
ADD_ADDR are processed before the PM creates the subflow for the local
endpoints.
Address the issue explicitly checking for duplicates at subflow
creation time.
To avoid a quadratic computational complexity, track the unavailable
remote address ids in a temporary bitmap and initialize such bitmap
with the remote ids of all the existing subflows matching the local
address currently processed.
The above allows additionally replacing the existing code checking
for duplicate entry in the current set with a simple bit test
operation.
Fixes: 2843ff6f36db ("mptcp: remote addresses fullmesh")
Cc: stable(a)vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/435
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index ed6983af1ab2..58d17d9604e7 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -396,19 +396,6 @@ void mptcp_pm_free_anno_list(struct mptcp_sock *msk)
}
}
-static bool lookup_address_in_vec(const struct mptcp_addr_info *addrs, unsigned int nr,
- const struct mptcp_addr_info *addr)
-{
- int i;
-
- for (i = 0; i < nr; i++) {
- if (addrs[i].id == addr->id)
- return true;
- }
-
- return false;
-}
-
/* Fill all the remote addresses into the array addrs[],
* and return the array size.
*/
@@ -440,6 +427,16 @@ static unsigned int fill_remote_addresses_vec(struct mptcp_sock *msk,
msk->pm.subflows++;
addrs[i++] = remote;
} else {
+ DECLARE_BITMAP(unavail_id, MPTCP_PM_MAX_ADDR_ID + 1);
+
+ /* Forbid creation of new subflows matching existing
+ * ones, possibly already created by incoming ADD_ADDR
+ */
+ bitmap_zero(unavail_id, MPTCP_PM_MAX_ADDR_ID + 1);
+ mptcp_for_each_subflow(msk, subflow)
+ if (READ_ONCE(subflow->local_id) == local->id)
+ __set_bit(subflow->remote_id, unavail_id);
+
mptcp_for_each_subflow(msk, subflow) {
ssk = mptcp_subflow_tcp_sock(subflow);
remote_address((struct sock_common *)ssk, &addrs[i]);
@@ -447,11 +444,17 @@ static unsigned int fill_remote_addresses_vec(struct mptcp_sock *msk,
if (deny_id0 && !addrs[i].id)
continue;
+ if (test_bit(addrs[i].id, unavail_id))
+ continue;
+
if (!mptcp_pm_addr_families_match(sk, local, &addrs[i]))
continue;
- if (!lookup_address_in_vec(addrs, i, &addrs[i]) &&
- msk->pm.subflows < subflows_max) {
+ if (msk->pm.subflows < subflows_max) {
+ /* forbid creating multiple address towards
+ * this id
+ */
+ __set_bit(addrs[i].id, unavail_id);
msk->pm.subflows++;
i++;
}