In commit d0ceea662d45 ("x86/mm: Add _PAGE_NOPTISHADOW bit to avoid
updating userspace page tables") we started setting _PAGE_NOPTISHADOW (bit
58) in identity map PTEs created by kernel_ident_mapping_init. In
configure_5level_paging when transiting from 5-level paging to 4-level
paging we copy the first p4d to become our new (4-level) pgd by
dereferencing the first entry in the current pgd, but we only mask off the
low 12 bits in the pgd entry.
When kexecing, the previous kernel sets up an identity map using
kernel_ident_mapping_init before transiting to the new kernel, which means
the new kernel gets a pgd with entries with bit 58 set. Then we mask off
only the low 12 bits, resulting in a non-canonical address, and try to
copy from it, and fault.
Fixes: d0ceea662d45 ("x86/mm: Add _PAGE_NOPTISHADOW bit to avoid updating userspace page tables")
Signed-off-by: Eric Hagberg <ehagberg(a)janestreet.com>
Cc: stable(a)vger.kernel.org
---
arch/x86/boot/compressed/pgtable_64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/boot/compressed/pgtable_64.c b/arch/x86/boot/compressed/pgtable_64.c
index bdd26050dff7..c785c5d54a11 100644
--- a/arch/x86/boot/compressed/pgtable_64.c
+++ b/arch/x86/boot/compressed/pgtable_64.c
@@ -180,7 +180,7 @@ asmlinkage void configure_5level_paging(struct boot_params *bp, void *pgtable)
* We cannot just point to the page table from trampoline as it
* may be above 4G.
*/
- src = *(unsigned long *)__native_read_cr3() & PAGE_MASK;
+ src = *(unsigned long *)__native_read_cr3() & PTE_PFN_MASK;
memcpy(trampoline_32bit, (void *)src, PAGE_SIZE);
}
--
2.43.7
Hello linux-stable-mirror,
Are you looking for a loan to either increase your activity or to
carry out a project. We offer Crypto Loans at 2-7% interest rate
with or without a credit check. We also pay 1% commission to
brokers, who introduce project owners for finance or other
opportunities.
Please get back to me if you are interested in more details.
Best Regards,
Lucy Seinfield
Assistant Secretary
On SoCFPGA/Sodia board, mdio bus cannot be probed, so the PHY cannot be
found and the network device does not work.
```
stmmaceth ff702000.ethernet eth0: __stmmac_open: Cannot attach to PHY (error: -19)
```
To probe the mdio bus, add "snps,dwmac-mdio" as compatible string of the
mdio bus. Also the PHY address connected to this board is 4. Therefore,
change to 4.
Cc: stable(a)vger.kernel.org # 6.3+
Signed-off-by: Nobuhiro Iwamatsu <iwamatsu(a)nigauri.org>
---
v2: Update commit message from 'ID' to 'address'.
Drop Fixes tag, because that commit is not the cause.
arch/arm/boot/dts/intel/socfpga/socfpga_cyclone5_sodia.dts | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/arm/boot/dts/intel/socfpga/socfpga_cyclone5_sodia.dts b/arch/arm/boot/dts/intel/socfpga/socfpga_cyclone5_sodia.dts
index ce0d6514eeb571..e4794ccb8e413f 100644
--- a/arch/arm/boot/dts/intel/socfpga/socfpga_cyclone5_sodia.dts
+++ b/arch/arm/boot/dts/intel/socfpga/socfpga_cyclone5_sodia.dts
@@ -66,8 +66,10 @@ &gmac1 {
mdio0 {
#address-cells = <1>;
#size-cells = <0>;
- phy0: ethernet-phy@0 {
- reg = <0>;
+ compatible = "snps,dwmac-mdio";
+
+ phy0: ethernet-phy@4 {
+ reg = <4>;
rxd0-skew-ps = <0>;
rxd1-skew-ps = <0>;
rxd2-skew-ps = <0>;
--
2.45.2
From: Sumit Kumar <sumk(a)qti.qualcomm.com>
The current implementation of mhi_ep_read_channel, in case of chained
transactions, assumes the End of Transfer(EOT) bit is received with the
doorbell. As a result, it may incorrectly advance mhi_chan->rd_offset
beyond wr_offset during host-to-device transfers when EOT has not yet
arrived. This can lead to access of unmapped host memory, causing
IOMMU faults and processing of stale TREs.
This change modifies the loop condition to ensure mhi_queue is not empty,
allowing the function to process only valid TREs up to the current write
pointer. This prevents premature reads and ensures safe traversal of
chained TREs.
Removed buf_left from the while loop condition to avoid exiting prematurely
before reading the ring completely.
Removed write_offset since it will always be zero because the new cache
buffer is allocated everytime.
Fixes: 5301258899773 ("bus: mhi: ep: Add support for reading from the host")
Cc: stable(a)vger.kernel.org
Co-developed-by: Akhil Vinod <akhvin(a)qti.qualcomm.com>
Signed-off-by: Akhil Vinod <akhvin(a)qti.qualcomm.com>
Signed-off-by: Sumit Kumar <sumk(a)qti.qualcomm.com>
---
Changes in v2:
- Use mhi_ep_queue_is_empty in while loop (Mani).
- Remove do while loop in mhi_ep_process_ch_ring (Mani).
- Remove buf_left, wr_offset, tr_done.
- Haven't added Reviewed-by as there is change in logic.
- Link to v1: https://lore.kernel.org/r/20250709-chained_transfer-v1-1-2326a4605c9c@quici…
---
drivers/bus/mhi/ep/main.c | 37 ++++++++++++-------------------------
1 file changed, 12 insertions(+), 25 deletions(-)
diff --git a/drivers/bus/mhi/ep/main.c b/drivers/bus/mhi/ep/main.c
index b3eafcf2a2c50d95e3efd3afb27038ecf55552a5..cdea24e9291959ae0a92487c1b9698dc8164d2f1 100644
--- a/drivers/bus/mhi/ep/main.c
+++ b/drivers/bus/mhi/ep/main.c
@@ -403,17 +403,13 @@ static int mhi_ep_read_channel(struct mhi_ep_cntrl *mhi_cntrl,
{
struct mhi_ep_chan *mhi_chan = &mhi_cntrl->mhi_chan[ring->ch_id];
struct device *dev = &mhi_cntrl->mhi_dev->dev;
- size_t tr_len, read_offset, write_offset;
+ size_t tr_len, read_offset;
struct mhi_ep_buf_info buf_info = {};
u32 len = MHI_EP_DEFAULT_MTU;
struct mhi_ring_element *el;
- bool tr_done = false;
void *buf_addr;
- u32 buf_left;
int ret;
- buf_left = len;
-
do {
/* Don't process the transfer ring if the channel is not in RUNNING state */
if (mhi_chan->state != MHI_CH_STATE_RUNNING) {
@@ -426,24 +422,23 @@ static int mhi_ep_read_channel(struct mhi_ep_cntrl *mhi_cntrl,
/* Check if there is data pending to be read from previous read operation */
if (mhi_chan->tre_bytes_left) {
dev_dbg(dev, "TRE bytes remaining: %u\n", mhi_chan->tre_bytes_left);
- tr_len = min(buf_left, mhi_chan->tre_bytes_left);
+ tr_len = min(len, mhi_chan->tre_bytes_left);
} else {
mhi_chan->tre_loc = MHI_TRE_DATA_GET_PTR(el);
mhi_chan->tre_size = MHI_TRE_DATA_GET_LEN(el);
mhi_chan->tre_bytes_left = mhi_chan->tre_size;
- tr_len = min(buf_left, mhi_chan->tre_size);
+ tr_len = min(len, mhi_chan->tre_size);
}
read_offset = mhi_chan->tre_size - mhi_chan->tre_bytes_left;
- write_offset = len - buf_left;
buf_addr = kmem_cache_zalloc(mhi_cntrl->tre_buf_cache, GFP_KERNEL);
if (!buf_addr)
return -ENOMEM;
buf_info.host_addr = mhi_chan->tre_loc + read_offset;
- buf_info.dev_addr = buf_addr + write_offset;
+ buf_info.dev_addr = buf_addr;
buf_info.size = tr_len;
buf_info.cb = mhi_ep_read_completion;
buf_info.cb_buf = buf_addr;
@@ -459,16 +454,12 @@ static int mhi_ep_read_channel(struct mhi_ep_cntrl *mhi_cntrl,
goto err_free_buf_addr;
}
- buf_left -= tr_len;
mhi_chan->tre_bytes_left -= tr_len;
- if (!mhi_chan->tre_bytes_left) {
- if (MHI_TRE_DATA_GET_IEOT(el))
- tr_done = true;
-
+ if (!mhi_chan->tre_bytes_left)
mhi_chan->rd_offset = (mhi_chan->rd_offset + 1) % ring->ring_size;
- }
- } while (buf_left && !tr_done);
+ /* Read until the some buffer is left or the ring becomes not empty */
+ } while (!mhi_ep_queue_is_empty(mhi_chan->mhi_dev, DMA_TO_DEVICE));
return 0;
@@ -502,15 +493,11 @@ static int mhi_ep_process_ch_ring(struct mhi_ep_ring *ring)
mhi_chan->xfer_cb(mhi_chan->mhi_dev, &result);
} else {
/* UL channel */
- do {
- ret = mhi_ep_read_channel(mhi_cntrl, ring);
- if (ret < 0) {
- dev_err(&mhi_chan->mhi_dev->dev, "Failed to read channel\n");
- return ret;
- }
-
- /* Read until the ring becomes empty */
- } while (!mhi_ep_queue_is_empty(mhi_chan->mhi_dev, DMA_TO_DEVICE));
+ ret = mhi_ep_read_channel(mhi_cntrl, ring);
+ if (ret < 0) {
+ dev_err(&mhi_chan->mhi_dev->dev, "Failed to read channel\n");
+ return ret;
+ }
}
return 0;
---
base-commit: 4c06e63b92038fadb566b652ec3ec04e228931e8
change-id: 20250709-chained_transfer-0b95f8afa487
Best regards,
--
Sumit Kumar <quic_sumk(a)quicinc.com>
Fix the lpaif_type configuration for the I2S interface.
The proper lpaif interface type required to allow DSP to vote
appropriate clock setting for I2S interface and also Fix the
CPU DAI format misconfiguration encountered during I2S
stream setup. Tested on Lemans evk platform.
changes in [v3]:
- Added Cc: <stable(a)vger.kernel.org>, suggested by Srinivas Kandagatla.
- Added QUINARY MI2S case in patch 3, suggested by Konrad.
- Link to V2: https://lore.kernel.org/lkml/20250905104020.2463473-1-mohammad.rafi.shaik@o…
changes in [v2]:
- Used snd_soc_dai_set_fmt() API to set the current clock settings,
instead of the default WS source setting, as suggested by Srinivas Kandagatla.
- Link to V1: https://lore.kernel.org/lkml/20250822171440.2040324-1-mohammad.rafi.shaik@o…
Mohammad Rafi Shaik (3):
ASoC: qcom: audioreach: Fix lpaif_type configuration for the I2S
interface
ASoC: qcom: q6apm-lpass-dais: Fix missing set_fmt DAI op for I2S
ASoC: qcom: sc8280xp: Fix DAI format setting for MI2S interfaces
sound/soc/qcom/qdsp6/audioreach.c | 1 +
sound/soc/qcom/qdsp6/q6apm-lpass-dais.c | 1 +
sound/soc/qcom/sc8280xp.c | 4 ++++
3 files changed, 6 insertions(+)
base-commit: be5d4872e528796df9d7425f2bd9b3893eb3a42c
--
2.34.1
Commit ff90afa75573 ("KVM: x86: Evaluate latched_init in
KVM_SET_VCPU_EVENTS when vCPU not in SMM") changes KVM_SET_VCPU_EVENTS
handler to set pending LAPIC INIT event regardless of if vCPU is in
SMM mode or not.
However, latch INIT without checking CPU state exists race condition,
which causes the loss of INIT event. This is fatal during the VM
startup process because it will cause some AP to never switch to
non-root mode. Just as commit f4ef19108608 ("KVM: X86: Fix loss of
pending INIT due to race") said:
BSP AP
kvm_vcpu_ioctl_x86_get_vcpu_events
events->smi.latched_init = 0
kvm_vcpu_block
kvm_vcpu_check_block
schedule
send INIT to AP
kvm_vcpu_ioctl_x86_set_vcpu_events
(e.g. `info registers -a` when VM starts/reboots)
if (events->smi.latched_init == 0)
clear INIT in pending_events
kvm_apic_accept_events
test_bit(KVM_APIC_INIT, &pe) == false
vcpu->arch.mp_state maintains UNINITIALIZED
send SIPI to AP
kvm_apic_accept_events
test_bit(KVM_APIC_SIPI, &pe) == false
vcpu->arch.mp_state will never change to RUNNABLE
(defy: UNINITIALIZED => INIT_RECEIVED => RUNNABLE)
AP will never switch to non-root operation
In such race result, VM hangs. E.g., BSP loops in SeaBIOS's SMPLock and
AP will never be reset, and qemu hmp "info registers -a" shows:
CPU#0
EAX=00000002 EBX=00000002 ECX=00000000 EDX=00020000
ESI=00000000 EDI=00000000 EBP=00000008 ESP=00006c6c
EIP=000ef570 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
......
CPU#1
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00080660
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
......
Fix this by handling latched INITs only in specific CPU states (SMM,
VMX non-root mode, SVM with GIF=0) in KVM_SET_VCPU_EVENTS.
Cc: stable(a)vger.kernel.org
Fixes: ff90afa75573 ("KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM")
Signed-off-by: Fei Li <lifei.shirley(a)bytedance.com>
---
arch/x86/kvm/x86.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a1c49bc681c46..7001b2af00ed1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5556,7 +5556,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
return -EINVAL;
#endif
- if (lapic_in_kernel(vcpu)) {
+ if (!kvm_apic_init_sipi_allowed(vcpu) && lapic_in_kernel(vcpu)) {
if (events->smi.latched_init)
set_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
else
--
2.39.2 (Apple Git-143)
Using an OOB offset past end of the available OOB data is invalid,
irregardless of whether the 'ooblen' is set in the ops or not. Move
the relevant check out from the if statement to always verify that.
The 'oobtest' module executes four tests to verify how reading/writing
OOB data past end of the devices is handled. It expects errors in case
of these tests, but this expectation fails in the last two tests on
MTD devices, which have no OOB bytes available.
This is indicated in the test output like the following:
[ 212.059416] mtd_oobtest: attempting to write past end of device
[ 212.060379] mtd_oobtest: an error is expected...
[ 212.066353] mtd_oobtest: error: wrote past end of device
[ 212.071142] mtd_oobtest: attempting to read past end of device
[ 212.076507] mtd_oobtest: an error is expected...
[ 212.082080] mtd_oobtest: error: read past end of device
...
[ 212.330508] mtd_oobtest: finished with 2 errors
For reference, here is the corresponding code from the oobtest module:
/* Attempt to write off end of device */
ops.mode = MTD_OPS_AUTO_OOB;
ops.len = 0;
ops.retlen = 0;
ops.ooblen = mtd->oobavail;
ops.oobretlen = 0;
ops.ooboffs = 1;
ops.datbuf = NULL;
ops.oobbuf = writebuf;
pr_info("attempting to write past end of device\n");
pr_info("an error is expected...\n");
err = mtd_write_oob(mtd, mtd->size - mtd->writesize, &ops);
if (err) {
pr_info("error occurred as expected\n");
} else {
pr_err("error: wrote past end of device\n");
errcnt += 1;
}
As it can be seen, the code sets 'ooboffs' to 1, and 'ooblen' to
mtd->oobavail which is zero in our case.
Since the mtd_check_oob_ops() function only verifies 'ooboffs' if 'ooblen'
is not zero, the 'ooboffs' value does not gets validated and the function
returns success whereas it should fail.
After the change, the oobtest module will bail out early with an error if
there are no OOB bytes available on the MDT device under test:
# cat /sys/class/mtd/mtd0/oobavail
0
# insmod mtd_test; insmod mtd_oobtest dev=0
[ 943.606228]
[ 943.606259] =================================================
[ 943.606784] mtd_oobtest: MTD device: 0
[ 943.612660] mtd_oobtest: MTD device size 524288, eraseblock size 131072, page size 2048, count of eraseblocks 4, pages per eraseblock 64, OOB size 128
[ 943.616091] mtd_test: scanning for bad eraseblocks
[ 943.629571] mtd_test: scanned 4 eraseblocks, 0 are bad
[ 943.634313] mtd_oobtest: test 1 of 5
[ 943.653402] mtd_oobtest: writing OOBs of whole device
[ 943.653424] mtd_oobtest: error: writeoob failed at 0x0
[ 943.657419] mtd_oobtest: error: use_len 0, use_offset 0
[ 943.662493] mtd_oobtest: error -22 occurred
[ 943.667574] =================================================
This behaviour is more accurate than the current one where most tests
are indicating successful writing of OOB data even that in fact nothing
gets written into the device, which is quite misleading.
Cc: stable(a)vger.kernel.org
Fixes: 5cdd929da53d ("mtd: Add sanity checks in mtd_write/read_oob()")
Reviewed-by: Daniel Golle <daniel(a)makrotopia.org>
Signed-off-by: Gabor Juhos <j4g8y7(a)gmail.com>
---
Changes in v2:
- add Reviewed-by tag from Daniel
- add stable and Fixes tags
- Link to v1: https://lore.kernel.org/r/20250831-mtd-validate-ooboffs-v1-1-d3fdce7a8698@g…
---
drivers/mtd/mtdcore.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 5ba9a741f5ac3c297ae21329c2827baf5dc471f0..9a3c9f163219bcb9fde66839f228fd8d38310f2d 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -1590,12 +1590,12 @@ static int mtd_check_oob_ops(struct mtd_info *mtd, loff_t offs,
if (offs < 0 || offs + ops->len > mtd->size)
return -EINVAL;
+ if (ops->ooboffs >= mtd_oobavail(mtd, ops))
+ return -EINVAL;
+
if (ops->ooblen) {
size_t maxooblen;
- if (ops->ooboffs >= mtd_oobavail(mtd, ops))
- return -EINVAL;
-
maxooblen = ((size_t)(mtd_div_by_ws(mtd->size, mtd) -
mtd_div_by_ws(offs, mtd)) *
mtd_oobavail(mtd, ops)) - ops->ooboffs;
---
base-commit: 1b237f190eb3d36f52dffe07a40b5eb210280e00
change-id: 20250831-mtd-validate-ooboffs-e35c796540fe
Best regards,
--
Gabor Juhos <j4g8y7(a)gmail.com>
The ready event list of an epoll object is protected by read-write
semaphore:
- The consumer (waiter) acquires the write lock and takes items.
- the producer (waker) takes the read lock and adds items.
The point of this design is enabling epoll to scale well with large number
of producers, as multiple producers can hold the read lock at the same
time.
Unfortunately, this implementation may cause scheduling priority inversion
problem. Suppose the consumer has higher scheduling priority than the
producer. The consumer needs to acquire the write lock, but may be blocked
by the producer holding the read lock. Since read-write semaphore does not
support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y),
we have a case of priority inversion: a higher priority consumer is blocked
by a lower priority producer. This problem was reported in [1].
Furthermore, this could also cause stall problem, as described in [2].
Fix this problem by replacing rwlock with spinlock.
This reduces the event bandwidth, as the producers now have to contend with
each other for the spinlock. According to the benchmark from
https://github.com/rouming/test-tools/blob/master/stress-epoll.c:
On 12 x86 CPUs:
Before After Diff
threads events/ms events/ms
8 7162 4956 -31%
16 8733 5383 -38%
32 7968 5572 -30%
64 10652 5739 -46%
128 11236 5931 -47%
On 4 riscv CPUs:
Before After Diff
threads events/ms events/ms
8 2958 2833 -4%
16 3323 3097 -7%
32 3451 3240 -6%
64 3554 3178 -11%
128 3601 3235 -10%
Although the numbers look bad, it should be noted that this benchmark
creates multiple threads who do nothing except constantly generating new
epoll events, thus contention on the spinlock is high. For real workload,
the event rate is likely much lower, and the performance drop is not as
bad.
Using another benchmark (perf bench epoll wait) where spinlock contention
is lower, improvement is even observed on x86:
On 12 x86 CPUs:
Before: Averaged 110279 operations/sec (+- 1.09%), total secs = 8
After: Averaged 114577 operations/sec (+- 2.25%), total secs = 8
On 4 riscv CPUs:
Before: Averaged 175767 operations/sec (+- 0.62%), total secs = 8
After: Averaged 167396 operations/sec (+- 0.23%), total secs = 8
In conclusion, no one is likely to be upset over this change. After all,
spinlock was used originally for years, and the commit which converted to
rwlock didn't mention a real workload, just that the benchmark numbers are
nice.
This patch is not exactly the revert of commit a218cc491420 ("epoll: use
rwlock in order to reduce ep_poll_callback() contention"), because git
revert conflicts in some places which are not obvious on the resolution.
This patch is intended to be backported, therefore go with the obvious
approach:
- Replace rwlock_t with spinlock_t one to one
- Delete list_add_tail_lockless() and chain_epi_lockless(). These were
introduced to allow producers to concurrently add items to the list.
But now that spinlock no longer allows producers to touch the event
list concurrently, these two functions are not necessary anymore.
Fixes: a218cc491420 ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
Signed-off-by: Nam Cao <namcao(a)linutronix.de>
Cc: stable(a)vger.kernel.org
---
fs/eventpoll.c | 139 +++++++++----------------------------------------
1 file changed, 26 insertions(+), 113 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 0fbf5dfedb24..a171f7e7dacc 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -46,10 +46,10 @@
*
* 1) epnested_mutex (mutex)
* 2) ep->mtx (mutex)
- * 3) ep->lock (rwlock)
+ * 3) ep->lock (spinlock)
*
* The acquire order is the one listed above, from 1 to 3.
- * We need a rwlock (ep->lock) because we manipulate objects
+ * We need a spinlock (ep->lock) because we manipulate objects
* from inside the poll callback, that might be triggered from
* a wake_up() that in turn might be called from IRQ context.
* So we can't sleep inside the poll callback and hence we need
@@ -195,7 +195,7 @@ struct eventpoll {
struct list_head rdllist;
/* Lock which protects rdllist and ovflist */
- rwlock_t lock;
+ spinlock_t lock;
/* RB tree root used to store monitored fd structs */
struct rb_root_cached rbr;
@@ -740,10 +740,10 @@ static void ep_start_scan(struct eventpoll *ep, struct list_head *txlist)
* in a lockless way.
*/
lockdep_assert_irqs_enabled();
- write_lock_irq(&ep->lock);
+ spin_lock_irq(&ep->lock);
list_splice_init(&ep->rdllist, txlist);
WRITE_ONCE(ep->ovflist, NULL);
- write_unlock_irq(&ep->lock);
+ spin_unlock_irq(&ep->lock);
}
static void ep_done_scan(struct eventpoll *ep,
@@ -751,7 +751,7 @@ static void ep_done_scan(struct eventpoll *ep,
{
struct epitem *epi, *nepi;
- write_lock_irq(&ep->lock);
+ spin_lock_irq(&ep->lock);
/*
* During the time we spent inside the "sproc" callback, some
* other events might have been queued by the poll callback.
@@ -792,7 +792,7 @@ static void ep_done_scan(struct eventpoll *ep,
wake_up(&ep->wq);
}
- write_unlock_irq(&ep->lock);
+ spin_unlock_irq(&ep->lock);
}
static void ep_get(struct eventpoll *ep)
@@ -867,10 +867,10 @@ static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
rb_erase_cached(&epi->rbn, &ep->rbr);
- write_lock_irq(&ep->lock);
+ spin_lock_irq(&ep->lock);
if (ep_is_linked(epi))
list_del_init(&epi->rdllink);
- write_unlock_irq(&ep->lock);
+ spin_unlock_irq(&ep->lock);
wakeup_source_unregister(ep_wakeup_source(epi));
/*
@@ -1151,7 +1151,7 @@ static int ep_alloc(struct eventpoll **pep)
return -ENOMEM;
mutex_init(&ep->mtx);
- rwlock_init(&ep->lock);
+ spin_lock_init(&ep->lock);
init_waitqueue_head(&ep->wq);
init_waitqueue_head(&ep->poll_wait);
INIT_LIST_HEAD(&ep->rdllist);
@@ -1238,100 +1238,10 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd,
}
#endif /* CONFIG_KCMP */
-/*
- * Adds a new entry to the tail of the list in a lockless way, i.e.
- * multiple CPUs are allowed to call this function concurrently.
- *
- * Beware: it is necessary to prevent any other modifications of the
- * existing list until all changes are completed, in other words
- * concurrent list_add_tail_lockless() calls should be protected
- * with a read lock, where write lock acts as a barrier which
- * makes sure all list_add_tail_lockless() calls are fully
- * completed.
- *
- * Also an element can be locklessly added to the list only in one
- * direction i.e. either to the tail or to the head, otherwise
- * concurrent access will corrupt the list.
- *
- * Return: %false if element has been already added to the list, %true
- * otherwise.
- */
-static inline bool list_add_tail_lockless(struct list_head *new,
- struct list_head *head)
-{
- struct list_head *prev;
-
- /*
- * This is simple 'new->next = head' operation, but cmpxchg()
- * is used in order to detect that same element has been just
- * added to the list from another CPU: the winner observes
- * new->next == new.
- */
- if (!try_cmpxchg(&new->next, &new, head))
- return false;
-
- /*
- * Initially ->next of a new element must be updated with the head
- * (we are inserting to the tail) and only then pointers are atomically
- * exchanged. XCHG guarantees memory ordering, thus ->next should be
- * updated before pointers are actually swapped and pointers are
- * swapped before prev->next is updated.
- */
-
- prev = xchg(&head->prev, new);
-
- /*
- * It is safe to modify prev->next and new->prev, because a new element
- * is added only to the tail and new->next is updated before XCHG.
- */
-
- prev->next = new;
- new->prev = prev;
-
- return true;
-}
-
-/*
- * Chains a new epi entry to the tail of the ep->ovflist in a lockless way,
- * i.e. multiple CPUs are allowed to call this function concurrently.
- *
- * Return: %false if epi element has been already chained, %true otherwise.
- */
-static inline bool chain_epi_lockless(struct epitem *epi)
-{
- struct eventpoll *ep = epi->ep;
-
- /* Fast preliminary check */
- if (epi->next != EP_UNACTIVE_PTR)
- return false;
-
- /* Check that the same epi has not been just chained from another CPU */
- if (cmpxchg(&epi->next, EP_UNACTIVE_PTR, NULL) != EP_UNACTIVE_PTR)
- return false;
-
- /* Atomically exchange tail */
- epi->next = xchg(&ep->ovflist, epi);
-
- return true;
-}
-
/*
* This is the callback that is passed to the wait queue wakeup
* mechanism. It is called by the stored file descriptors when they
* have events to report.
- *
- * This callback takes a read lock in order not to contend with concurrent
- * events from another file descriptor, thus all modifications to ->rdllist
- * or ->ovflist are lockless. Read lock is paired with the write lock from
- * ep_start/done_scan(), which stops all list modifications and guarantees
- * that lists state is seen correctly.
- *
- * Another thing worth to mention is that ep_poll_callback() can be called
- * concurrently for the same @epi from different CPUs if poll table was inited
- * with several wait queues entries. Plural wakeup from different CPUs of a
- * single wait queue is serialized by wq.lock, but the case when multiple wait
- * queues are used should be detected accordingly. This is detected using
- * cmpxchg() operation.
*/
static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, void *key)
{
@@ -1342,7 +1252,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
unsigned long flags;
int ewake = 0;
- read_lock_irqsave(&ep->lock, flags);
+ spin_lock_irqsave(&ep->lock, flags);
ep_set_busy_poll_napi_id(epi);
@@ -1371,12 +1281,15 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
* chained in ep->ovflist and requeued later on.
*/
if (READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR) {
- if (chain_epi_lockless(epi))
+ if (epi->next == EP_UNACTIVE_PTR) {
+ epi->next = READ_ONCE(ep->ovflist);
+ WRITE_ONCE(ep->ovflist, epi);
ep_pm_stay_awake_rcu(epi);
+ }
} else if (!ep_is_linked(epi)) {
/* In the usual case, add event to ready list. */
- if (list_add_tail_lockless(&epi->rdllink, &ep->rdllist))
- ep_pm_stay_awake_rcu(epi);
+ list_add_tail(&epi->rdllink, &ep->rdllist);
+ ep_pm_stay_awake_rcu(epi);
}
/*
@@ -1409,7 +1322,7 @@ static int ep_poll_callback(wait_queue_entry_t *wait, unsigned mode, int sync, v
pwake++;
out_unlock:
- read_unlock_irqrestore(&ep->lock, flags);
+ spin_unlock_irqrestore(&ep->lock, flags);
/* We have to call this outside the lock */
if (pwake)
@@ -1744,7 +1657,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
}
/* We have to drop the new item inside our item list to keep track of it */
- write_lock_irq(&ep->lock);
+ spin_lock_irq(&ep->lock);
/* record NAPI ID of new item if present */
ep_set_busy_poll_napi_id(epi);
@@ -1761,7 +1674,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
pwake++;
}
- write_unlock_irq(&ep->lock);
+ spin_unlock_irq(&ep->lock);
/* We have to call this outside the lock */
if (pwake)
@@ -1825,7 +1738,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
* list, push it inside.
*/
if (ep_item_poll(epi, &pt, 1)) {
- write_lock_irq(&ep->lock);
+ spin_lock_irq(&ep->lock);
if (!ep_is_linked(epi)) {
list_add_tail(&epi->rdllink, &ep->rdllist);
ep_pm_stay_awake(epi);
@@ -1836,7 +1749,7 @@ static int ep_modify(struct eventpoll *ep, struct epitem *epi,
if (waitqueue_active(&ep->poll_wait))
pwake++;
}
- write_unlock_irq(&ep->lock);
+ spin_unlock_irq(&ep->lock);
}
/* We have to call this outside the lock */
@@ -2088,7 +2001,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
init_wait(&wait);
wait.func = ep_autoremove_wake_function;
- write_lock_irq(&ep->lock);
+ spin_lock_irq(&ep->lock);
/*
* Barrierless variant, waitqueue_active() is called under
* the same lock on wakeup ep_poll_callback() side, so it
@@ -2107,7 +2020,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
if (!eavail)
__add_wait_queue_exclusive(&ep->wq, &wait);
- write_unlock_irq(&ep->lock);
+ spin_unlock_irq(&ep->lock);
if (!eavail)
timed_out = !ep_schedule_timeout(to) ||
@@ -2123,7 +2036,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
eavail = 1;
if (!list_empty_careful(&wait.entry)) {
- write_lock_irq(&ep->lock);
+ spin_lock_irq(&ep->lock);
/*
* If the thread timed out and is not on the wait queue,
* it means that the thread was woken up after its
@@ -2134,7 +2047,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
if (timed_out)
eavail = list_empty(&wait.entry);
__remove_wait_queue(&ep->wq, &wait);
- write_unlock_irq(&ep->lock);
+ spin_unlock_irq(&ep->lock);
}
}
}
--
2.39.5