On Tue, Feb 27, 2024 at 10:14 AM Kairui Song <ryncsn(a)gmail.com> wrote:
>
> On Wed, Feb 21, 2024 at 12:32 AM Chris Li <chrisl(a)kernel.org> wrote:
> >
> > On Mon, Feb 19, 2024 at 8:56 PM Kairui Song <ryncsn(a)gmail.com> wrote:
> >
> > >
> > > Hi Barry,
> > >
> > > > it might not be a problem for throughput. but for real-time and tail latency,
> > > > this hurts. For example, this might increase dropping frames of UI which
> > > > is an important parameter to evaluate performance :-)
> > > >
> > >
> > > That's a true issue, as Chris mentioned before I think we need to
> > > think of some clever data struct to solve this more naturally in the
> > > future, similar issue exists for cached swapin as well and it has been
> > > there for a while. On the other hand I think maybe applications that
> > > are extremely latency sensitive should try to avoid swap on fault? A
> > > swapin could cause other issues like reclaim, throttled or contention
> > > with many other things, these seem to have a higher chance than this
> > > race.
> >
> >
> > Yes, I do think the best long term solution is to have some clever
> > data structure to solve the synchronization issue and allow racing
> > threads to make forward progress at the same time.
> >
> > I have also explored some (failed) synchronization ideas, for example
> > having the run time swap entry refcount separate from swap_map count.
> > BTW, zswap entry->refcount behaves like that, it is separate from swap
> > entry and manages the temporary run time usage count held by the
> > function. However that idea has its own problem as well, it needs to
> > have an xarray to track the swap entry run time refcount (only stored
> > in the xarray when CPU fails to get SWAP_HAS_CACHE bit.) When we are
> > done with page faults, we still need to look up the xarray to make
> > sure there is no racing CPU and put the refcount into the xarray. That
> > kind of defeats the purpose of avoiding the swap cache in the first
> > place. We still need to do the xarray lookup in the normal path.
> >
> > I came to realize that, while this current fix is not perfect, (I
> > still wish we had a better solution not pausing the racing CPU). This
> > patch stands better than not fixing this data corruption issue and the
> > patch remains relatively simple. Yes it has latency issues but still
> > better than data corruption. It also doesn't stop us from coming up
> > with better solutions later on. If we want to address the
> > synchronization in a way not blocking other CPUs, it will likely
> > require a much bigger change.
> >
> > Unless we have a better suggestion. It seems the better one among the
> > alternatives so far.
> >
>
> Hi,
>
> Thanks for the comments. I've been trying some ideas locally, I think a simple and straight solution exists: We just don't skip the swap cache xarray.
Yes, I have been pondering about that as well.
Notice in __read_swap_cache_async(), it has a similar
"schedule_timeout_uninterruptible(1)" when swapcache_prepare(entry)
fails to grab the SWAP_HAS_CACHE bit. So falling back to use the swap
cache does not automatically solve the latency issue. Similar delay
exists in the swap cache case as well.
> The current reason we are skipping it is for performance, but with some optimization, the performance should be as good as skipping it (in current behavior). Notice even in the swap cache bypass path, we need to do one lookup, and one modify (delete the shadow). That can't be skipped. So the usage of swap cache can be better organized and optimized.
> After all swapin makes use of swap cache, swapin can insert the folio in swap cache xarray first, then set swap map cache bit. I'm thinking about reusing the folio lock, or having an intermediate value in xarray, so raced swapins can wait properly. There are some tricky parts syncing with swap maps though.
Inserting the swap cache xarray first and setting SWAP_HAS_CACHE bit
later will need more audit on the race. I assume you take the swap
device/cluster lock before folio insert into swap cache xarray?
Chris
>
> Currently working on a series, will send in a few weeks if it works.
In raid5_cache_count():
if (conf->max_nr_stripes < conf->min_nr_stripes)
return 0;
return conf->max_nr_stripes - conf->min_nr_stripes;
The current check is ineffective, as the values could change immediately
after being checked.
In raid5_set_cache_size():
...
conf->min_nr_stripes = size;
...
while (size > conf->max_nr_stripes)
conf->min_nr_stripes = conf->max_nr_stripes;
...
Due to intermediate value updates in raid5_set_cache_size(), concurrent
execution of raid5_cache_count() and raid5_set_cache_size() may lead to
inconsistent reads of conf->max_nr_stripes and conf->min_nr_stripes.
The current checks are ineffective as values could change immediately
after being checked, raising the risk of conf->min_nr_stripes exceeding
conf->max_nr_stripes and potentially causing an integer overflow.
This possible bug is found by an experimental static analysis tool
developed by our team. This tool analyzes the locking APIs to extract
function pairs that can be concurrently executed, and then analyzes the
instructions in the paired functions to identify possible concurrency bugs
including data races and atomicity violations. The above possible bug is
reported when our tool analyzes the source code of Linux 6.2.
To resolve this issue, it is suggested to introduce local variables
'min_stripes' and 'max_stripes' in raid5_cache_count() to ensure the
values remain stable throughout the check. Adding locks in
raid5_cache_count() fails to resolve atomicity violations, as
raid5_set_cache_size() may hold intermediate values of
conf->min_nr_stripes while unlocked. With this patch applied, our tool no
longer reports the bug, with the kernel configuration allyesconfig for
x86_64. Due to the lack of associated hardware, we cannot test the patch
in runtime testing, and just verify it according to the code logic.
Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable(a)vger.kernel.org
Signed-off-by: Gui-Dong Han <2045gemini(a)gmail.com>
---
v2:
* In this patch v2, we've updated to use READ_ONCE() instead of direct
reads for accessing max_nr_stripes and min_nr_stripes, since read and
write can concurrent.
Thank Yu Kuai for helpful advice.
---
v3:
* In this patch v3, we've updated to use WRITE_ONCE() in
raid5_set_cache_size(), grow_one_stripe() and drop_one_stripe(), in order
to pair READ_ONCE() with WRITE_ONCE().
Thank Yu Kuai for helpful advice.
---
v4:
* In this patch v4, we've addressed several code style issues.
Thank Yu Kuai for helpful advice.
---
drivers/md/raid5.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8497880135ee..30e118d10c0b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2412,7 +2412,7 @@ static int grow_one_stripe(struct r5conf *conf, gfp_t gfp)
atomic_inc(&conf->active_stripes);
raid5_release_stripe(sh);
- conf->max_nr_stripes++;
+ WRITE_ONCE(conf->max_nr_stripes, conf->max_nr_stripes + 1);
return 1;
}
@@ -2707,7 +2707,7 @@ static int drop_one_stripe(struct r5conf *conf)
shrink_buffers(sh);
free_stripe(conf->slab_cache, sh);
atomic_dec(&conf->active_stripes);
- conf->max_nr_stripes--;
+ WRITE_ONCE(conf->max_nr_stripes, conf->max_nr_stripes - 1);
return 1;
}
@@ -6820,7 +6820,7 @@ raid5_set_cache_size(struct mddev *mddev, int size)
if (size <= 16 || size > 32768)
return -EINVAL;
- conf->min_nr_stripes = size;
+ WRITE_ONCE(conf->min_nr_stripes, size);
mutex_lock(&conf->cache_size_mutex);
while (size < conf->max_nr_stripes &&
drop_one_stripe(conf))
@@ -6832,7 +6832,7 @@ raid5_set_cache_size(struct mddev *mddev, int size)
mutex_lock(&conf->cache_size_mutex);
while (size > conf->max_nr_stripes)
if (!grow_one_stripe(conf, GFP_KERNEL)) {
- conf->min_nr_stripes = conf->max_nr_stripes;
+ WRITE_ONCE(conf->min_nr_stripes, conf->max_nr_stripes);
result = -ENOMEM;
break;
}
@@ -7390,11 +7390,13 @@ static unsigned long raid5_cache_count(struct shrinker *shrink,
struct shrink_control *sc)
{
struct r5conf *conf = shrink->private_data;
+ int max_stripes = READ_ONCE(conf->max_nr_stripes);
+ int min_stripes = READ_ONCE(conf->min_nr_stripes);
- if (conf->max_nr_stripes < conf->min_nr_stripes)
+ if (max_stripes < min_stripes)
/* unlikely, but not impossible */
return 0;
- return conf->max_nr_stripes - conf->min_nr_stripes;
+ return max_stripes - min_stripes;
}
static struct r5conf *setup_conf(struct mddev *mddev)
--
2.34.1
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 045e9d812868a2d80b7a57b224ce8009444b7bbc
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024022601-footwork-fastness-bcab@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
045e9d812868 ("mptcp: fix duplicate subflow creation")
b9d69db87fb7 ("mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 045e9d812868a2d80b7a57b224ce8009444b7bbc Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Thu, 15 Feb 2024 19:25:33 +0100
Subject: [PATCH] mptcp: fix duplicate subflow creation
Fullmesh endpoints could end-up unexpectedly generating duplicate
subflows - same local and remote addresses - when multiple incoming
ADD_ADDR are processed before the PM creates the subflow for the local
endpoints.
Address the issue explicitly checking for duplicates at subflow
creation time.
To avoid a quadratic computational complexity, track the unavailable
remote address ids in a temporary bitmap and initialize such bitmap
with the remote ids of all the existing subflows matching the local
address currently processed.
The above allows additionally replacing the existing code checking
for duplicate entry in the current set with a simple bit test
operation.
Fixes: 2843ff6f36db ("mptcp: remote addresses fullmesh")
Cc: stable(a)vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/435
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/pm_netlink.c b/net/mptcp/pm_netlink.c
index ed6983af1ab2..58d17d9604e7 100644
--- a/net/mptcp/pm_netlink.c
+++ b/net/mptcp/pm_netlink.c
@@ -396,19 +396,6 @@ void mptcp_pm_free_anno_list(struct mptcp_sock *msk)
}
}
-static bool lookup_address_in_vec(const struct mptcp_addr_info *addrs, unsigned int nr,
- const struct mptcp_addr_info *addr)
-{
- int i;
-
- for (i = 0; i < nr; i++) {
- if (addrs[i].id == addr->id)
- return true;
- }
-
- return false;
-}
-
/* Fill all the remote addresses into the array addrs[],
* and return the array size.
*/
@@ -440,6 +427,16 @@ static unsigned int fill_remote_addresses_vec(struct mptcp_sock *msk,
msk->pm.subflows++;
addrs[i++] = remote;
} else {
+ DECLARE_BITMAP(unavail_id, MPTCP_PM_MAX_ADDR_ID + 1);
+
+ /* Forbid creation of new subflows matching existing
+ * ones, possibly already created by incoming ADD_ADDR
+ */
+ bitmap_zero(unavail_id, MPTCP_PM_MAX_ADDR_ID + 1);
+ mptcp_for_each_subflow(msk, subflow)
+ if (READ_ONCE(subflow->local_id) == local->id)
+ __set_bit(subflow->remote_id, unavail_id);
+
mptcp_for_each_subflow(msk, subflow) {
ssk = mptcp_subflow_tcp_sock(subflow);
remote_address((struct sock_common *)ssk, &addrs[i]);
@@ -447,11 +444,17 @@ static unsigned int fill_remote_addresses_vec(struct mptcp_sock *msk,
if (deny_id0 && !addrs[i].id)
continue;
+ if (test_bit(addrs[i].id, unavail_id))
+ continue;
+
if (!mptcp_pm_addr_families_match(sk, local, &addrs[i]))
continue;
- if (!lookup_address_in_vec(addrs, i, &addrs[i]) &&
- msk->pm.subflows < subflows_max) {
+ if (msk->pm.subflows < subflows_max) {
+ /* forbid creating multiple address towards
+ * this id
+ */
+ __set_bit(addrs[i].id, unavail_id);
msk->pm.subflows++;
i++;
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 3f83d8a77eeeb47011b990fd766a421ee64f1d73
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2024021911-fragment-yearly-5b45@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
3f83d8a77eee ("mptcp: fix more tx path fields initialization")
013e3179dbd2 ("mptcp: fix rcv space initialization")
c693a8516429 ("mptcp: use mptcp_set_state")
4fd19a307016 ("mptcp: fix inconsistent state on fastopen race")
d109a7767273 ("mptcp: fix possible NULL pointer dereference on close")
8005184fd1ca ("mptcp: refactor sndbuf auto-tuning")
a5efdbcece83 ("mptcp: fix delegated action races")
27e5ccc2d5a5 ("mptcp: fix dangling connection hang-up")
f6909dc1c1f4 ("mptcp: rename timer related helper to less confusing names")
9f1a98813b4b ("mptcp: process pending subflow error on close")
d5fbeff1ab81 ("mptcp: move __mptcp_error_report in protocol.c")
ebc1e08f01eb ("mptcp: drop last_snd and MPTCP_RESET_SCHEDULER")
e263691773cd ("mptcp: Remove unnecessary test for __mptcp_init_sock()")
39880bd808ad ("mptcp: get rid of msk->subflow")
3f326a821b99 ("mptcp: change the mpc check helper to return a sk")
3aa362494170 ("mptcp: avoid ssock usage in mptcp_pm_nl_create_listen_socket()")
f0bc514bd5c1 ("mptcp: avoid additional indirection in sockopt")
40f56d0c7043 ("mptcp: avoid additional indirection in mptcp_listen()")
8cf2ebdc0078 ("mptcp: mptcp: avoid additional indirection in mptcp_bind()")
ccae357c1c6a ("mptcp: avoid additional __inet_stream_connect() call")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3f83d8a77eeeb47011b990fd766a421ee64f1d73 Mon Sep 17 00:00:00 2001
From: Paolo Abeni <pabeni(a)redhat.com>
Date: Thu, 8 Feb 2024 19:03:51 +0100
Subject: [PATCH] mptcp: fix more tx path fields initialization
The 'msk->write_seq' and 'msk->snd_nxt' are always updated under
the msk socket lock, except at MPC handshake completiont time.
Builds-up on the previous commit to move such init under the relevant
lock.
There are no known problems caused by the potential race, the
primary goal is consistency.
Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets")
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni(a)redhat.com>
Reviewed-by: Mat Martineau <martineau(a)kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 7632eafb683b..8cb6a873dae9 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -3478,10 +3478,8 @@ void mptcp_finish_connect(struct sock *ssk)
* accessing the field below
*/
WRITE_ONCE(msk->local_key, subflow->local_key);
- WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
- WRITE_ONCE(msk->snd_nxt, msk->write_seq);
- WRITE_ONCE(msk->snd_una, msk->write_seq);
- WRITE_ONCE(msk->wnd_end, msk->snd_nxt + tcp_sk(ssk)->snd_wnd);
+ WRITE_ONCE(msk->snd_una, subflow->idsn + 1);
+ WRITE_ONCE(msk->wnd_end, subflow->idsn + 1 + tcp_sk(ssk)->snd_wnd);
mptcp_pm_new_connection(msk, ssk, 0);
}
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 56b2ac2f2f22..c2df34ebcf28 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -421,12 +421,21 @@ static bool subflow_use_different_dport(struct mptcp_sock *msk, const struct soc
void __mptcp_sync_state(struct sock *sk, int state)
{
+ struct mptcp_subflow_context *subflow;
struct mptcp_sock *msk = mptcp_sk(sk);
+ struct sock *ssk = msk->first;
- __mptcp_propagate_sndbuf(sk, msk->first);
+ subflow = mptcp_subflow_ctx(ssk);
+ __mptcp_propagate_sndbuf(sk, ssk);
if (!msk->rcvspace_init)
- mptcp_rcv_space_init(msk, msk->first);
+ mptcp_rcv_space_init(msk, ssk);
+
if (sk->sk_state == TCP_SYN_SENT) {
+ /* subflow->idsn is always available is TCP_SYN_SENT state,
+ * even for the FASTOPEN scenarios
+ */
+ WRITE_ONCE(msk->write_seq, subflow->idsn + 1);
+ WRITE_ONCE(msk->snd_nxt, msk->write_seq);
mptcp_set_state(sk, state);
sk->sk_state_change(sk);
}
Hi Larry,
> -----Original Message-----
> From: Larry Finger <Larry.Finger(a)gmail.com>
> Sent: Tuesday, February 27, 2024 10:35 AM
> To: Kalle Valo <kvalo(a)kernel.org>
> Cc: Johannes Berg <johannes(a)sipsolutions.net>; linux-wireless(a)vger.kernel.org; Nick Morrow
> <morrownr(a)gmail.com>; Larry Finger <Larry.Finger(a)lwfinger.net>; Ping-Ke Shih <pkshih(a)realtek.com>;
> stable(a)vger.kernel.org
> Subject: [PATCHi V2] wifi: rtw88: Add missing VID/PIDs doe 8811CU and 8821CU
Not sure if "doe" is typo?
>
> From: Nick Morrow <morrownr(a)gmail.com>
>
> Purpose: Add VID/PIDs that are known to be missing for this driver.
> - removed /* 8811CU */ and /* 8821CU */ as they are redundant
> since the file is specific to those chips.
> - removed /* TOTOLINK A650UA v3 */ as the manufacturer. It has a REALTEK
> VID so it may not be specific to this adapter.
>
> Source is
> https://1EHFQ.trk.elasticemail.com/tracking/click?d=I82H0YR_W_h175Lb3Nkb0D8…
> 0SPxd1Olp3PNJEJTqsu4kyqBXayE0BVd_k7uLFvlTe65Syx2uqLUB-UQSfsKKLkuyE-frMZXSCL7q824UG3Oer614GGEeEz-DNEWHh
> 43p_e8oz7OouS6gRBEng0
> Verified and tested.
>
> Signed-off-by: Nick Morrow <morrownr(a)gmail.com>
> Signed-off-by: Larry Finger <Larry.Finger(a)lwfinger.net>
> Acked-by: Ping-Ke Shih <pkshih(a)realtek.com>
>
Did you keep a blank line intentionally?
> Cc: stable(a)vger.kernel.org