The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
592642e6b11e ("scsi: qedf: Populate sysfs attributes for vport")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 592642e6b11e620e4b43189f8072752429fc8dc3 Mon Sep 17 00:00:00 2001
From: Saurav Kashyap <skashyap(a)marvell.com>
Date: Mon, 19 Sep 2022 06:44:34 -0700
Subject: [PATCH] scsi: qedf: Populate sysfs attributes for vport
Few vport parameters were displayed by systool as 'Unknown' or 'NULL'.
Copy speed, supported_speed, frame_size and update port_type for NPIV port.
Link: https://lore.kernel.org/r/20220919134434.3513-1-njavali@marvell.com
Cc: stable(a)vger.kernel.org
Tested-by: Guangwu Zhang <guazhang(a)redhat.com>
Reviewed-by: John Meneghini <jmeneghi(a)redhat.com>
Signed-off-by: Saurav Kashyap <skashyap(a)marvell.com>
Signed-off-by: Nilesh Javali <njavali(a)marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/qedf/qedf_main.c b/drivers/scsi/qedf/qedf_main.c
index 3d6b137314f3..cc6d9decf62c 100644
--- a/drivers/scsi/qedf/qedf_main.c
+++ b/drivers/scsi/qedf/qedf_main.c
@@ -1921,6 +1921,27 @@ static int qedf_vport_create(struct fc_vport *vport, bool disabled)
fc_vport_setlink(vn_port);
}
+ /* Set symbolic node name */
+ if (base_qedf->pdev->device == QL45xxx)
+ snprintf(fc_host_symbolic_name(vn_port->host), 256,
+ "Marvell FastLinQ 45xxx FCoE v%s", QEDF_VERSION);
+
+ if (base_qedf->pdev->device == QL41xxx)
+ snprintf(fc_host_symbolic_name(vn_port->host), 256,
+ "Marvell FastLinQ 41xxx FCoE v%s", QEDF_VERSION);
+
+ /* Set supported speed */
+ fc_host_supported_speeds(vn_port->host) = n_port->link_supported_speeds;
+
+ /* Set speed */
+ vn_port->link_speed = n_port->link_speed;
+
+ /* Set port type */
+ fc_host_port_type(vn_port->host) = FC_PORTTYPE_NPIV;
+
+ /* Set maxframe size */
+ fc_host_maxframe_size(vn_port->host) = n_port->mfs;
+
QEDF_INFO(&(base_qedf->dbg_ctx), QEDF_LOG_NPIV, "vn_port=%p.\n",
vn_port);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
592642e6b11e ("scsi: qedf: Populate sysfs attributes for vport")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 592642e6b11e620e4b43189f8072752429fc8dc3 Mon Sep 17 00:00:00 2001
From: Saurav Kashyap <skashyap(a)marvell.com>
Date: Mon, 19 Sep 2022 06:44:34 -0700
Subject: [PATCH] scsi: qedf: Populate sysfs attributes for vport
Few vport parameters were displayed by systool as 'Unknown' or 'NULL'.
Copy speed, supported_speed, frame_size and update port_type for NPIV port.
Link: https://lore.kernel.org/r/20220919134434.3513-1-njavali@marvell.com
Cc: stable(a)vger.kernel.org
Tested-by: Guangwu Zhang <guazhang(a)redhat.com>
Reviewed-by: John Meneghini <jmeneghi(a)redhat.com>
Signed-off-by: Saurav Kashyap <skashyap(a)marvell.com>
Signed-off-by: Nilesh Javali <njavali(a)marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/qedf/qedf_main.c b/drivers/scsi/qedf/qedf_main.c
index 3d6b137314f3..cc6d9decf62c 100644
--- a/drivers/scsi/qedf/qedf_main.c
+++ b/drivers/scsi/qedf/qedf_main.c
@@ -1921,6 +1921,27 @@ static int qedf_vport_create(struct fc_vport *vport, bool disabled)
fc_vport_setlink(vn_port);
}
+ /* Set symbolic node name */
+ if (base_qedf->pdev->device == QL45xxx)
+ snprintf(fc_host_symbolic_name(vn_port->host), 256,
+ "Marvell FastLinQ 45xxx FCoE v%s", QEDF_VERSION);
+
+ if (base_qedf->pdev->device == QL41xxx)
+ snprintf(fc_host_symbolic_name(vn_port->host), 256,
+ "Marvell FastLinQ 41xxx FCoE v%s", QEDF_VERSION);
+
+ /* Set supported speed */
+ fc_host_supported_speeds(vn_port->host) = n_port->link_supported_speeds;
+
+ /* Set speed */
+ vn_port->link_speed = n_port->link_speed;
+
+ /* Set port type */
+ fc_host_port_type(vn_port->host) = FC_PORTTYPE_NPIV;
+
+ /* Set maxframe size */
+ fc_host_maxframe_size(vn_port->host) = n_port->mfs;
+
QEDF_INFO(&(base_qedf->dbg_ctx), QEDF_LOG_NPIV, "vn_port=%p.\n",
vn_port);
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
3b7610302a75 ("fs: dlm: fix possible use after free if tracing")
7a3de7324c2b ("fs: dlm: trace user space callbacks")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3b7610302a75fc1032a6c9462862bec6948f85c9 Mon Sep 17 00:00:00 2001
From: Alexander Aring <aahringo(a)redhat.com>
Date: Thu, 1 Sep 2022 12:05:32 -0400
Subject: [PATCH] fs: dlm: fix possible use after free if tracing
This patch fixes a possible use after free if tracing for the specific
event is enabled. To avoid the use after free we introduce a out_put
label like all other user lock specific requests and safe in a boolean
to do a put or not which depends on the execution path of
dlm_user_request().
Cc: stable(a)vger.kernel.org
Fixes: 7a3de7324c2b ("fs: dlm: trace user space callbacks")
Reported-by: Dan Carpenter <dan.carpenter(a)oracle.com>
Signed-off-by: Alexander Aring <aahringo(a)redhat.com>
Signed-off-by: David Teigland <teigland(a)redhat.com>
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index c830feb26384..94a72ede5764 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -5835,6 +5835,7 @@ int dlm_user_request(struct dlm_ls *ls, struct dlm_user_args *ua,
{
struct dlm_lkb *lkb;
struct dlm_args args;
+ bool do_put = true;
int error;
dlm_lock_recovery(ls);
@@ -5851,9 +5852,8 @@ int dlm_user_request(struct dlm_ls *ls, struct dlm_user_args *ua,
ua->lksb.sb_lvbptr = kzalloc(DLM_USER_LVB_LEN, GFP_NOFS);
if (!ua->lksb.sb_lvbptr) {
kfree(ua);
- __put_lkb(ls, lkb);
error = -ENOMEM;
- goto out_trace_end;
+ goto out_put;
}
}
#ifdef CONFIG_DLM_DEPRECATED_API
@@ -5867,8 +5867,7 @@ int dlm_user_request(struct dlm_ls *ls, struct dlm_user_args *ua,
kfree(ua->lksb.sb_lvbptr);
ua->lksb.sb_lvbptr = NULL;
kfree(ua);
- __put_lkb(ls, lkb);
- goto out_trace_end;
+ goto out_put;
}
/* After ua is attached to lkb it will be freed by dlm_free_lkb().
@@ -5887,8 +5886,7 @@ int dlm_user_request(struct dlm_ls *ls, struct dlm_user_args *ua,
error = 0;
fallthrough;
default:
- __put_lkb(ls, lkb);
- goto out_trace_end;
+ goto out_put;
}
/* add this new lkb to the per-process list of locks */
@@ -5896,8 +5894,11 @@ int dlm_user_request(struct dlm_ls *ls, struct dlm_user_args *ua,
hold_lkb(lkb);
list_add_tail(&lkb->lkb_ownqueue, &ua->proc->locks);
spin_unlock(&ua->proc->locks_spin);
- out_trace_end:
+ do_put = false;
+ out_put:
trace_dlm_lock_end(ls, lkb, name, namelen, mode, flags, error, false);
+ if (do_put)
+ __put_lkb(ls, lkb);
out:
dlm_unlock_recovery(ls);
return error;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7c7f9bc986e6 ("serial: Deassert Transmit Enable on probe in driver-specific way")
60f361722ad2 ("serial: fsl_lpuart: Reset prior to registration")
44b27aec9d96 ("serial: core, 8250: set RS485 termination GPIO in serial core")
ae50bb275283 ("serial: take termios_rwsem for ->rs485_config() & pass termios as param")
84f2faa7852e ("serial: 8250: Remove serial_rs485 sanitization from em485")
7195eefb38d7 ("serial: fsl_lpuart: Call core's sanitization and remove custom one")
596a9171472b ("serial: Clear rs485 struct when non-RS485 mode is set")
be2e2cb1d281 ("serial: Sanitize rs485_struct")
59c221f8e126 ("serial: 8250_exar: Fill in rs485_supported")
2dbd0c14ebe8 ("serial: Move serial_rs485 sanitization into separate function")
8322b1f52715 ("serial: Add uart_rs485_config()")
d6da35e0c6d5 ("Merge 5.18-rc7 into usb-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c7f9bc986e698873b489c371a08f206979d06b7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Thu, 22 Sep 2022 18:27:33 +0200
Subject: [PATCH] serial: Deassert Transmit Enable on probe in driver-specific
way
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When a UART port is newly registered, uart_configure_port() seeks to
deassert RS485 Transmit Enable by setting the RTS bit in port->mctrl.
However a number of UART drivers interpret a set RTS bit as *assertion*
instead of deassertion: Affected drivers include those using
serial8250_em485_config() (except 8250_bcm2835aux.c) and some using
mctrl_gpio (e.g. imx.c).
Since the interpretation of the RTS bit is driver-specific, it is not
suitable as a means to centrally deassert Transmit Enable in the serial
core. Instead, the serial core must call on drivers to deassert it in
their driver-specific way. One way to achieve that is to call
->rs485_config(). It implicitly deasserts Transmit Enable.
So amend uart_configure_port() and uart_resume_port() to invoke
uart_rs485_config(). That allows removing calls to uart_rs485_config()
from drivers' ->probe() hooks and declaring the function static.
Skip any invocation of ->set_mctrl() if RS485 is enabled. RS485 has no
hardware flow control, so the modem control lines are irrelevant and
need not be touched. When leaving RS485 mode, reset the modem control
lines to the state stored in port->mctrl. That way, UARTs which are
muxed between RS485 and RS232 transceivers drive the lines correctly
when switched to RS232. (serial8250_do_startup() historically raises
the OUT1 modem signal because otherwise interrupts are not signaled on
ancient PC UARTs, but I believe that no longer applies to modern,
RS485-capable UARTs and is thus safe to be skipped.)
imx.c modifies port->mctrl whenever Transmit Enable is asserted and
deasserted. Stop it from doing that so port->mctrl reflects the RS232
line state.
8250_omap.c deasserts Transmit Enable on ->runtime_resume() by calling
->set_mctrl(). Because that is now a no-op in RS485 mode, amend the
function to call serial8250_em485_stop_tx().
fsl_lpuart.c retrieves and applies the RS485 device tree properties
after registering the UART port. Because applying now happens on
registration in uart_configure_port(), move retrieval of the properties
ahead of uart_add_one_port().
Link: https://lore.kernel.org/all/20220329085050.311408-1-matthias.schiffer@ew.tq…
Link: https://lore.kernel.org/all/8f538a8903795f22f9acc94a9a31b03c9c4ccacb.camel@…
Fixes: d3b3404df318 ("serial: Fix incorrect rs485 polarity on uart open")
Cc: stable(a)vger.kernel.org # v4.14+
Reported-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reported-by: Roosen Henri <Henri.Roosen(a)ginzinger.com>
Tested-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Link: https://lore.kernel.org/r/2de36eba3fbe11278d5002e4e501afe0ceaca039.16638638…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
index 871f8a050513..41b8c6b27136 100644
--- a/drivers/tty/serial/8250/8250_omap.c
+++ b/drivers/tty/serial/8250/8250_omap.c
@@ -342,6 +342,9 @@ static void omap8250_restore_regs(struct uart_8250_port *up)
omap8250_update_mdr1(up, priv);
up->port.ops->set_mctrl(&up->port, up->port.mctrl);
+
+ if (up->port.rs485.flags & SER_RS485_ENABLED)
+ serial8250_em485_stop_tx(up);
}
/*
diff --git a/drivers/tty/serial/8250/8250_pci.c b/drivers/tty/serial/8250/8250_pci.c
index 0052cf899e29..8e9f247590bd 100644
--- a/drivers/tty/serial/8250/8250_pci.c
+++ b/drivers/tty/serial/8250/8250_pci.c
@@ -1632,7 +1632,6 @@ static int pci_fintek_init(struct pci_dev *dev)
resource_size_t bar_data[3];
u8 config_base;
struct serial_private *priv = pci_get_drvdata(dev);
- struct uart_8250_port *port;
if (!(pci_resource_flags(dev, 5) & IORESOURCE_IO) ||
!(pci_resource_flags(dev, 4) & IORESOURCE_IO) ||
@@ -1679,13 +1678,7 @@ static int pci_fintek_init(struct pci_dev *dev)
pci_write_config_byte(dev, config_base + 0x06, dev->irq);
- if (priv) {
- /* re-apply RS232/485 mode when
- * pciserial_resume_ports()
- */
- port = serial8250_get_port(priv->line[i]);
- uart_rs485_config(&port->port);
- } else {
+ if (!priv) {
/* First init without port data
* force init to RS232 Mode
*/
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index 6a3c2ead0f17..dfdd2db95cb3 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -600,7 +600,7 @@ EXPORT_SYMBOL_GPL(serial8250_rpm_put);
static int serial8250_em485_init(struct uart_8250_port *p)
{
if (p->em485)
- return 0;
+ goto deassert_rts;
p->em485 = kmalloc(sizeof(struct uart_8250_em485), GFP_ATOMIC);
if (!p->em485)
@@ -616,7 +616,9 @@ static int serial8250_em485_init(struct uart_8250_port *p)
p->em485->active_timer = NULL;
p->em485->tx_stopped = true;
- p->rs485_stop_tx(p);
+deassert_rts:
+ if (p->em485->tx_stopped)
+ p->rs485_stop_tx(p);
return 0;
}
@@ -2048,6 +2050,9 @@ EXPORT_SYMBOL_GPL(serial8250_do_set_mctrl);
static void serial8250_set_mctrl(struct uart_port *port, unsigned int mctrl)
{
+ if (port->rs485.flags & SER_RS485_ENABLED)
+ return;
+
if (port->set_mctrl)
port->set_mctrl(port, mctrl);
else
@@ -3192,9 +3197,6 @@ static void serial8250_config_port(struct uart_port *port, int flags)
if (flags & UART_CONFIG_TYPE)
autoconfig(up);
- if (port->rs485.flags & SER_RS485_ENABLED)
- uart_rs485_config(port);
-
/* if access method is AU, it is a 16550 with a quirk */
if (port->type == PORT_16550A && port->iotype == UPIO_AU)
up->bugs |= UART_BUG_NOMSR;
diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index fadb49086102..67fa113f77d4 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -2726,15 +2726,13 @@ static int lpuart_probe(struct platform_device *pdev)
if (ret)
goto failed_reset;
- ret = uart_add_one_port(&lpuart_reg, &sport->port);
- if (ret)
- goto failed_attach_port;
-
ret = uart_get_rs485_mode(&sport->port);
if (ret)
goto failed_get_rs485;
- uart_rs485_config(&sport->port);
+ ret = uart_add_one_port(&lpuart_reg, &sport->port);
+ if (ret)
+ goto failed_attach_port;
ret = devm_request_irq(&pdev->dev, sport->port.irq, handler, 0,
DRIVER_NAME, sport);
@@ -2744,9 +2742,9 @@ static int lpuart_probe(struct platform_device *pdev)
return 0;
failed_irq_request:
-failed_get_rs485:
uart_remove_one_port(&lpuart_reg, &sport->port);
failed_attach_port:
+failed_get_rs485:
failed_reset:
lpuart_disable_clks(sport);
return ret;
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 5875ee66492b..05b432dc7a85 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -380,8 +380,7 @@ static void imx_uart_rts_active(struct imx_port *sport, u32 *ucr2)
{
*ucr2 &= ~(UCR2_CTSC | UCR2_CTS);
- sport->port.mctrl |= TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl | TIOCM_RTS);
}
/* called with port.lock taken and irqs caller dependent */
@@ -390,8 +389,7 @@ static void imx_uart_rts_inactive(struct imx_port *sport, u32 *ucr2)
*ucr2 &= ~UCR2_CTSC;
*ucr2 |= UCR2_CTS;
- sport->port.mctrl &= ~TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl & ~TIOCM_RTS);
}
static void start_hrtimer_ms(struct hrtimer *hrt, unsigned long msec)
@@ -2347,8 +2345,6 @@ static int imx_uart_probe(struct platform_device *pdev)
dev_err(&pdev->dev,
"low-active RTS not possible when receiver is off, enabling receiver\n");
- uart_rs485_config(&sport->port);
-
/* Disable interrupts before requesting them */
ucr1 = imx_uart_readl(sport, UCR1);
ucr1 &= ~(UCR1_ADEN | UCR1_TRDYEN | UCR1_IDEN | UCR1_RRDYEN | UCR1_RTSDEN);
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index c7113182275a..179ee199df34 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -158,15 +158,10 @@ uart_update_mctrl(struct uart_port *port, unsigned int set, unsigned int clear)
unsigned long flags;
unsigned int old;
- if (port->rs485.flags & SER_RS485_ENABLED) {
- set &= ~TIOCM_RTS;
- clear &= ~TIOCM_RTS;
- }
-
spin_lock_irqsave(&port->lock, flags);
old = port->mctrl;
port->mctrl = (old & ~clear) | set;
- if (old != port->mctrl)
+ if (old != port->mctrl && !(port->rs485.flags & SER_RS485_ENABLED))
port->ops->set_mctrl(port, port->mctrl);
spin_unlock_irqrestore(&port->lock, flags);
}
@@ -1391,7 +1386,7 @@ static void uart_set_rs485_termination(struct uart_port *port,
!!(rs485->flags & SER_RS485_TERMINATE_BUS));
}
-int uart_rs485_config(struct uart_port *port)
+static int uart_rs485_config(struct uart_port *port)
{
struct serial_rs485 *rs485 = &port->rs485;
int ret;
@@ -1405,7 +1400,6 @@ int uart_rs485_config(struct uart_port *port)
return ret;
}
-EXPORT_SYMBOL_GPL(uart_rs485_config);
static int uart_get_rs485_config(struct uart_port *port,
struct serial_rs485 __user *rs485)
@@ -1444,8 +1438,13 @@ static int uart_set_rs485_config(struct tty_struct *tty, struct uart_port *port,
spin_lock_irqsave(&port->lock, flags);
ret = port->rs485_config(port, &tty->termios, &rs485);
- if (!ret)
+ if (!ret) {
port->rs485 = rs485;
+
+ /* Reset RTS and other mctrl lines when disabling RS485 */
+ if (!(rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ }
spin_unlock_irqrestore(&port->lock, flags);
if (ret)
return ret;
@@ -2352,7 +2351,8 @@ int uart_suspend_port(struct uart_driver *drv, struct uart_port *uport)
spin_lock_irq(&uport->lock);
ops->stop_tx(uport);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
/* save mctrl so it can be restored on resume */
mctrl = uport->mctrl;
uport->mctrl = 0;
@@ -2440,7 +2440,8 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
uart_change_pm(state, UART_PM_STATE_ON);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
spin_unlock_irq(&uport->lock);
if (console_suspend_enabled || !uart_console(uport)) {
/* Protected by port mutex for now */
@@ -2451,7 +2452,10 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
if (tty)
uart_change_speed(tty, state, NULL);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, uport->mctrl);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, uport->mctrl);
+ else
+ uart_rs485_config(uport);
ops->start_tx(uport);
spin_unlock_irq(&uport->lock);
tty_port_set_initialized(port, 1);
@@ -2558,10 +2562,10 @@ uart_configure_port(struct uart_driver *drv, struct uart_state *state,
*/
spin_lock_irqsave(&port->lock, flags);
port->mctrl &= TIOCM_DTR;
- if (port->rs485.flags & SER_RS485_ENABLED &&
- !(port->rs485.flags & SER_RS485_RTS_AFTER_SEND))
- port->mctrl |= TIOCM_RTS;
- port->ops->set_mctrl(port, port->mctrl);
+ if (!(port->rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ else
+ uart_rs485_config(port);
spin_unlock_irqrestore(&port->lock, flags);
/*
diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
index b4c21f84ad79..d657f2a42a7b 100644
--- a/include/linux/serial_core.h
+++ b/include/linux/serial_core.h
@@ -951,5 +951,4 @@ static inline int uart_handle_break(struct uart_port *port)
!((cflag) & CLOCAL))
int uart_get_rs485_mode(struct uart_port *port);
-int uart_rs485_config(struct uart_port *port);
#endif /* LINUX_SERIAL_CORE_H */
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7c7f9bc986e6 ("serial: Deassert Transmit Enable on probe in driver-specific way")
60f361722ad2 ("serial: fsl_lpuart: Reset prior to registration")
44b27aec9d96 ("serial: core, 8250: set RS485 termination GPIO in serial core")
ae50bb275283 ("serial: take termios_rwsem for ->rs485_config() & pass termios as param")
84f2faa7852e ("serial: 8250: Remove serial_rs485 sanitization from em485")
7195eefb38d7 ("serial: fsl_lpuart: Call core's sanitization and remove custom one")
596a9171472b ("serial: Clear rs485 struct when non-RS485 mode is set")
be2e2cb1d281 ("serial: Sanitize rs485_struct")
59c221f8e126 ("serial: 8250_exar: Fill in rs485_supported")
2dbd0c14ebe8 ("serial: Move serial_rs485 sanitization into separate function")
8322b1f52715 ("serial: Add uart_rs485_config()")
d6da35e0c6d5 ("Merge 5.18-rc7 into usb-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c7f9bc986e698873b489c371a08f206979d06b7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Thu, 22 Sep 2022 18:27:33 +0200
Subject: [PATCH] serial: Deassert Transmit Enable on probe in driver-specific
way
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When a UART port is newly registered, uart_configure_port() seeks to
deassert RS485 Transmit Enable by setting the RTS bit in port->mctrl.
However a number of UART drivers interpret a set RTS bit as *assertion*
instead of deassertion: Affected drivers include those using
serial8250_em485_config() (except 8250_bcm2835aux.c) and some using
mctrl_gpio (e.g. imx.c).
Since the interpretation of the RTS bit is driver-specific, it is not
suitable as a means to centrally deassert Transmit Enable in the serial
core. Instead, the serial core must call on drivers to deassert it in
their driver-specific way. One way to achieve that is to call
->rs485_config(). It implicitly deasserts Transmit Enable.
So amend uart_configure_port() and uart_resume_port() to invoke
uart_rs485_config(). That allows removing calls to uart_rs485_config()
from drivers' ->probe() hooks and declaring the function static.
Skip any invocation of ->set_mctrl() if RS485 is enabled. RS485 has no
hardware flow control, so the modem control lines are irrelevant and
need not be touched. When leaving RS485 mode, reset the modem control
lines to the state stored in port->mctrl. That way, UARTs which are
muxed between RS485 and RS232 transceivers drive the lines correctly
when switched to RS232. (serial8250_do_startup() historically raises
the OUT1 modem signal because otherwise interrupts are not signaled on
ancient PC UARTs, but I believe that no longer applies to modern,
RS485-capable UARTs and is thus safe to be skipped.)
imx.c modifies port->mctrl whenever Transmit Enable is asserted and
deasserted. Stop it from doing that so port->mctrl reflects the RS232
line state.
8250_omap.c deasserts Transmit Enable on ->runtime_resume() by calling
->set_mctrl(). Because that is now a no-op in RS485 mode, amend the
function to call serial8250_em485_stop_tx().
fsl_lpuart.c retrieves and applies the RS485 device tree properties
after registering the UART port. Because applying now happens on
registration in uart_configure_port(), move retrieval of the properties
ahead of uart_add_one_port().
Link: https://lore.kernel.org/all/20220329085050.311408-1-matthias.schiffer@ew.tq…
Link: https://lore.kernel.org/all/8f538a8903795f22f9acc94a9a31b03c9c4ccacb.camel@…
Fixes: d3b3404df318 ("serial: Fix incorrect rs485 polarity on uart open")
Cc: stable(a)vger.kernel.org # v4.14+
Reported-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reported-by: Roosen Henri <Henri.Roosen(a)ginzinger.com>
Tested-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Link: https://lore.kernel.org/r/2de36eba3fbe11278d5002e4e501afe0ceaca039.16638638…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
index 871f8a050513..41b8c6b27136 100644
--- a/drivers/tty/serial/8250/8250_omap.c
+++ b/drivers/tty/serial/8250/8250_omap.c
@@ -342,6 +342,9 @@ static void omap8250_restore_regs(struct uart_8250_port *up)
omap8250_update_mdr1(up, priv);
up->port.ops->set_mctrl(&up->port, up->port.mctrl);
+
+ if (up->port.rs485.flags & SER_RS485_ENABLED)
+ serial8250_em485_stop_tx(up);
}
/*
diff --git a/drivers/tty/serial/8250/8250_pci.c b/drivers/tty/serial/8250/8250_pci.c
index 0052cf899e29..8e9f247590bd 100644
--- a/drivers/tty/serial/8250/8250_pci.c
+++ b/drivers/tty/serial/8250/8250_pci.c
@@ -1632,7 +1632,6 @@ static int pci_fintek_init(struct pci_dev *dev)
resource_size_t bar_data[3];
u8 config_base;
struct serial_private *priv = pci_get_drvdata(dev);
- struct uart_8250_port *port;
if (!(pci_resource_flags(dev, 5) & IORESOURCE_IO) ||
!(pci_resource_flags(dev, 4) & IORESOURCE_IO) ||
@@ -1679,13 +1678,7 @@ static int pci_fintek_init(struct pci_dev *dev)
pci_write_config_byte(dev, config_base + 0x06, dev->irq);
- if (priv) {
- /* re-apply RS232/485 mode when
- * pciserial_resume_ports()
- */
- port = serial8250_get_port(priv->line[i]);
- uart_rs485_config(&port->port);
- } else {
+ if (!priv) {
/* First init without port data
* force init to RS232 Mode
*/
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index 6a3c2ead0f17..dfdd2db95cb3 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -600,7 +600,7 @@ EXPORT_SYMBOL_GPL(serial8250_rpm_put);
static int serial8250_em485_init(struct uart_8250_port *p)
{
if (p->em485)
- return 0;
+ goto deassert_rts;
p->em485 = kmalloc(sizeof(struct uart_8250_em485), GFP_ATOMIC);
if (!p->em485)
@@ -616,7 +616,9 @@ static int serial8250_em485_init(struct uart_8250_port *p)
p->em485->active_timer = NULL;
p->em485->tx_stopped = true;
- p->rs485_stop_tx(p);
+deassert_rts:
+ if (p->em485->tx_stopped)
+ p->rs485_stop_tx(p);
return 0;
}
@@ -2048,6 +2050,9 @@ EXPORT_SYMBOL_GPL(serial8250_do_set_mctrl);
static void serial8250_set_mctrl(struct uart_port *port, unsigned int mctrl)
{
+ if (port->rs485.flags & SER_RS485_ENABLED)
+ return;
+
if (port->set_mctrl)
port->set_mctrl(port, mctrl);
else
@@ -3192,9 +3197,6 @@ static void serial8250_config_port(struct uart_port *port, int flags)
if (flags & UART_CONFIG_TYPE)
autoconfig(up);
- if (port->rs485.flags & SER_RS485_ENABLED)
- uart_rs485_config(port);
-
/* if access method is AU, it is a 16550 with a quirk */
if (port->type == PORT_16550A && port->iotype == UPIO_AU)
up->bugs |= UART_BUG_NOMSR;
diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index fadb49086102..67fa113f77d4 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -2726,15 +2726,13 @@ static int lpuart_probe(struct platform_device *pdev)
if (ret)
goto failed_reset;
- ret = uart_add_one_port(&lpuart_reg, &sport->port);
- if (ret)
- goto failed_attach_port;
-
ret = uart_get_rs485_mode(&sport->port);
if (ret)
goto failed_get_rs485;
- uart_rs485_config(&sport->port);
+ ret = uart_add_one_port(&lpuart_reg, &sport->port);
+ if (ret)
+ goto failed_attach_port;
ret = devm_request_irq(&pdev->dev, sport->port.irq, handler, 0,
DRIVER_NAME, sport);
@@ -2744,9 +2742,9 @@ static int lpuart_probe(struct platform_device *pdev)
return 0;
failed_irq_request:
-failed_get_rs485:
uart_remove_one_port(&lpuart_reg, &sport->port);
failed_attach_port:
+failed_get_rs485:
failed_reset:
lpuart_disable_clks(sport);
return ret;
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 5875ee66492b..05b432dc7a85 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -380,8 +380,7 @@ static void imx_uart_rts_active(struct imx_port *sport, u32 *ucr2)
{
*ucr2 &= ~(UCR2_CTSC | UCR2_CTS);
- sport->port.mctrl |= TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl | TIOCM_RTS);
}
/* called with port.lock taken and irqs caller dependent */
@@ -390,8 +389,7 @@ static void imx_uart_rts_inactive(struct imx_port *sport, u32 *ucr2)
*ucr2 &= ~UCR2_CTSC;
*ucr2 |= UCR2_CTS;
- sport->port.mctrl &= ~TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl & ~TIOCM_RTS);
}
static void start_hrtimer_ms(struct hrtimer *hrt, unsigned long msec)
@@ -2347,8 +2345,6 @@ static int imx_uart_probe(struct platform_device *pdev)
dev_err(&pdev->dev,
"low-active RTS not possible when receiver is off, enabling receiver\n");
- uart_rs485_config(&sport->port);
-
/* Disable interrupts before requesting them */
ucr1 = imx_uart_readl(sport, UCR1);
ucr1 &= ~(UCR1_ADEN | UCR1_TRDYEN | UCR1_IDEN | UCR1_RRDYEN | UCR1_RTSDEN);
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index c7113182275a..179ee199df34 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -158,15 +158,10 @@ uart_update_mctrl(struct uart_port *port, unsigned int set, unsigned int clear)
unsigned long flags;
unsigned int old;
- if (port->rs485.flags & SER_RS485_ENABLED) {
- set &= ~TIOCM_RTS;
- clear &= ~TIOCM_RTS;
- }
-
spin_lock_irqsave(&port->lock, flags);
old = port->mctrl;
port->mctrl = (old & ~clear) | set;
- if (old != port->mctrl)
+ if (old != port->mctrl && !(port->rs485.flags & SER_RS485_ENABLED))
port->ops->set_mctrl(port, port->mctrl);
spin_unlock_irqrestore(&port->lock, flags);
}
@@ -1391,7 +1386,7 @@ static void uart_set_rs485_termination(struct uart_port *port,
!!(rs485->flags & SER_RS485_TERMINATE_BUS));
}
-int uart_rs485_config(struct uart_port *port)
+static int uart_rs485_config(struct uart_port *port)
{
struct serial_rs485 *rs485 = &port->rs485;
int ret;
@@ -1405,7 +1400,6 @@ int uart_rs485_config(struct uart_port *port)
return ret;
}
-EXPORT_SYMBOL_GPL(uart_rs485_config);
static int uart_get_rs485_config(struct uart_port *port,
struct serial_rs485 __user *rs485)
@@ -1444,8 +1438,13 @@ static int uart_set_rs485_config(struct tty_struct *tty, struct uart_port *port,
spin_lock_irqsave(&port->lock, flags);
ret = port->rs485_config(port, &tty->termios, &rs485);
- if (!ret)
+ if (!ret) {
port->rs485 = rs485;
+
+ /* Reset RTS and other mctrl lines when disabling RS485 */
+ if (!(rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ }
spin_unlock_irqrestore(&port->lock, flags);
if (ret)
return ret;
@@ -2352,7 +2351,8 @@ int uart_suspend_port(struct uart_driver *drv, struct uart_port *uport)
spin_lock_irq(&uport->lock);
ops->stop_tx(uport);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
/* save mctrl so it can be restored on resume */
mctrl = uport->mctrl;
uport->mctrl = 0;
@@ -2440,7 +2440,8 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
uart_change_pm(state, UART_PM_STATE_ON);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
spin_unlock_irq(&uport->lock);
if (console_suspend_enabled || !uart_console(uport)) {
/* Protected by port mutex for now */
@@ -2451,7 +2452,10 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
if (tty)
uart_change_speed(tty, state, NULL);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, uport->mctrl);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, uport->mctrl);
+ else
+ uart_rs485_config(uport);
ops->start_tx(uport);
spin_unlock_irq(&uport->lock);
tty_port_set_initialized(port, 1);
@@ -2558,10 +2562,10 @@ uart_configure_port(struct uart_driver *drv, struct uart_state *state,
*/
spin_lock_irqsave(&port->lock, flags);
port->mctrl &= TIOCM_DTR;
- if (port->rs485.flags & SER_RS485_ENABLED &&
- !(port->rs485.flags & SER_RS485_RTS_AFTER_SEND))
- port->mctrl |= TIOCM_RTS;
- port->ops->set_mctrl(port, port->mctrl);
+ if (!(port->rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ else
+ uart_rs485_config(port);
spin_unlock_irqrestore(&port->lock, flags);
/*
diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
index b4c21f84ad79..d657f2a42a7b 100644
--- a/include/linux/serial_core.h
+++ b/include/linux/serial_core.h
@@ -951,5 +951,4 @@ static inline int uart_handle_break(struct uart_port *port)
!((cflag) & CLOCAL))
int uart_get_rs485_mode(struct uart_port *port);
-int uart_rs485_config(struct uart_port *port);
#endif /* LINUX_SERIAL_CORE_H */
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7c7f9bc986e6 ("serial: Deassert Transmit Enable on probe in driver-specific way")
60f361722ad2 ("serial: fsl_lpuart: Reset prior to registration")
44b27aec9d96 ("serial: core, 8250: set RS485 termination GPIO in serial core")
ae50bb275283 ("serial: take termios_rwsem for ->rs485_config() & pass termios as param")
84f2faa7852e ("serial: 8250: Remove serial_rs485 sanitization from em485")
7195eefb38d7 ("serial: fsl_lpuart: Call core's sanitization and remove custom one")
596a9171472b ("serial: Clear rs485 struct when non-RS485 mode is set")
be2e2cb1d281 ("serial: Sanitize rs485_struct")
59c221f8e126 ("serial: 8250_exar: Fill in rs485_supported")
2dbd0c14ebe8 ("serial: Move serial_rs485 sanitization into separate function")
8322b1f52715 ("serial: Add uart_rs485_config()")
d6da35e0c6d5 ("Merge 5.18-rc7 into usb-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c7f9bc986e698873b489c371a08f206979d06b7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Thu, 22 Sep 2022 18:27:33 +0200
Subject: [PATCH] serial: Deassert Transmit Enable on probe in driver-specific
way
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When a UART port is newly registered, uart_configure_port() seeks to
deassert RS485 Transmit Enable by setting the RTS bit in port->mctrl.
However a number of UART drivers interpret a set RTS bit as *assertion*
instead of deassertion: Affected drivers include those using
serial8250_em485_config() (except 8250_bcm2835aux.c) and some using
mctrl_gpio (e.g. imx.c).
Since the interpretation of the RTS bit is driver-specific, it is not
suitable as a means to centrally deassert Transmit Enable in the serial
core. Instead, the serial core must call on drivers to deassert it in
their driver-specific way. One way to achieve that is to call
->rs485_config(). It implicitly deasserts Transmit Enable.
So amend uart_configure_port() and uart_resume_port() to invoke
uart_rs485_config(). That allows removing calls to uart_rs485_config()
from drivers' ->probe() hooks and declaring the function static.
Skip any invocation of ->set_mctrl() if RS485 is enabled. RS485 has no
hardware flow control, so the modem control lines are irrelevant and
need not be touched. When leaving RS485 mode, reset the modem control
lines to the state stored in port->mctrl. That way, UARTs which are
muxed between RS485 and RS232 transceivers drive the lines correctly
when switched to RS232. (serial8250_do_startup() historically raises
the OUT1 modem signal because otherwise interrupts are not signaled on
ancient PC UARTs, but I believe that no longer applies to modern,
RS485-capable UARTs and is thus safe to be skipped.)
imx.c modifies port->mctrl whenever Transmit Enable is asserted and
deasserted. Stop it from doing that so port->mctrl reflects the RS232
line state.
8250_omap.c deasserts Transmit Enable on ->runtime_resume() by calling
->set_mctrl(). Because that is now a no-op in RS485 mode, amend the
function to call serial8250_em485_stop_tx().
fsl_lpuart.c retrieves and applies the RS485 device tree properties
after registering the UART port. Because applying now happens on
registration in uart_configure_port(), move retrieval of the properties
ahead of uart_add_one_port().
Link: https://lore.kernel.org/all/20220329085050.311408-1-matthias.schiffer@ew.tq…
Link: https://lore.kernel.org/all/8f538a8903795f22f9acc94a9a31b03c9c4ccacb.camel@…
Fixes: d3b3404df318 ("serial: Fix incorrect rs485 polarity on uart open")
Cc: stable(a)vger.kernel.org # v4.14+
Reported-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reported-by: Roosen Henri <Henri.Roosen(a)ginzinger.com>
Tested-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Link: https://lore.kernel.org/r/2de36eba3fbe11278d5002e4e501afe0ceaca039.16638638…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
index 871f8a050513..41b8c6b27136 100644
--- a/drivers/tty/serial/8250/8250_omap.c
+++ b/drivers/tty/serial/8250/8250_omap.c
@@ -342,6 +342,9 @@ static void omap8250_restore_regs(struct uart_8250_port *up)
omap8250_update_mdr1(up, priv);
up->port.ops->set_mctrl(&up->port, up->port.mctrl);
+
+ if (up->port.rs485.flags & SER_RS485_ENABLED)
+ serial8250_em485_stop_tx(up);
}
/*
diff --git a/drivers/tty/serial/8250/8250_pci.c b/drivers/tty/serial/8250/8250_pci.c
index 0052cf899e29..8e9f247590bd 100644
--- a/drivers/tty/serial/8250/8250_pci.c
+++ b/drivers/tty/serial/8250/8250_pci.c
@@ -1632,7 +1632,6 @@ static int pci_fintek_init(struct pci_dev *dev)
resource_size_t bar_data[3];
u8 config_base;
struct serial_private *priv = pci_get_drvdata(dev);
- struct uart_8250_port *port;
if (!(pci_resource_flags(dev, 5) & IORESOURCE_IO) ||
!(pci_resource_flags(dev, 4) & IORESOURCE_IO) ||
@@ -1679,13 +1678,7 @@ static int pci_fintek_init(struct pci_dev *dev)
pci_write_config_byte(dev, config_base + 0x06, dev->irq);
- if (priv) {
- /* re-apply RS232/485 mode when
- * pciserial_resume_ports()
- */
- port = serial8250_get_port(priv->line[i]);
- uart_rs485_config(&port->port);
- } else {
+ if (!priv) {
/* First init without port data
* force init to RS232 Mode
*/
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index 6a3c2ead0f17..dfdd2db95cb3 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -600,7 +600,7 @@ EXPORT_SYMBOL_GPL(serial8250_rpm_put);
static int serial8250_em485_init(struct uart_8250_port *p)
{
if (p->em485)
- return 0;
+ goto deassert_rts;
p->em485 = kmalloc(sizeof(struct uart_8250_em485), GFP_ATOMIC);
if (!p->em485)
@@ -616,7 +616,9 @@ static int serial8250_em485_init(struct uart_8250_port *p)
p->em485->active_timer = NULL;
p->em485->tx_stopped = true;
- p->rs485_stop_tx(p);
+deassert_rts:
+ if (p->em485->tx_stopped)
+ p->rs485_stop_tx(p);
return 0;
}
@@ -2048,6 +2050,9 @@ EXPORT_SYMBOL_GPL(serial8250_do_set_mctrl);
static void serial8250_set_mctrl(struct uart_port *port, unsigned int mctrl)
{
+ if (port->rs485.flags & SER_RS485_ENABLED)
+ return;
+
if (port->set_mctrl)
port->set_mctrl(port, mctrl);
else
@@ -3192,9 +3197,6 @@ static void serial8250_config_port(struct uart_port *port, int flags)
if (flags & UART_CONFIG_TYPE)
autoconfig(up);
- if (port->rs485.flags & SER_RS485_ENABLED)
- uart_rs485_config(port);
-
/* if access method is AU, it is a 16550 with a quirk */
if (port->type == PORT_16550A && port->iotype == UPIO_AU)
up->bugs |= UART_BUG_NOMSR;
diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index fadb49086102..67fa113f77d4 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -2726,15 +2726,13 @@ static int lpuart_probe(struct platform_device *pdev)
if (ret)
goto failed_reset;
- ret = uart_add_one_port(&lpuart_reg, &sport->port);
- if (ret)
- goto failed_attach_port;
-
ret = uart_get_rs485_mode(&sport->port);
if (ret)
goto failed_get_rs485;
- uart_rs485_config(&sport->port);
+ ret = uart_add_one_port(&lpuart_reg, &sport->port);
+ if (ret)
+ goto failed_attach_port;
ret = devm_request_irq(&pdev->dev, sport->port.irq, handler, 0,
DRIVER_NAME, sport);
@@ -2744,9 +2742,9 @@ static int lpuart_probe(struct platform_device *pdev)
return 0;
failed_irq_request:
-failed_get_rs485:
uart_remove_one_port(&lpuart_reg, &sport->port);
failed_attach_port:
+failed_get_rs485:
failed_reset:
lpuart_disable_clks(sport);
return ret;
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 5875ee66492b..05b432dc7a85 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -380,8 +380,7 @@ static void imx_uart_rts_active(struct imx_port *sport, u32 *ucr2)
{
*ucr2 &= ~(UCR2_CTSC | UCR2_CTS);
- sport->port.mctrl |= TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl | TIOCM_RTS);
}
/* called with port.lock taken and irqs caller dependent */
@@ -390,8 +389,7 @@ static void imx_uart_rts_inactive(struct imx_port *sport, u32 *ucr2)
*ucr2 &= ~UCR2_CTSC;
*ucr2 |= UCR2_CTS;
- sport->port.mctrl &= ~TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl & ~TIOCM_RTS);
}
static void start_hrtimer_ms(struct hrtimer *hrt, unsigned long msec)
@@ -2347,8 +2345,6 @@ static int imx_uart_probe(struct platform_device *pdev)
dev_err(&pdev->dev,
"low-active RTS not possible when receiver is off, enabling receiver\n");
- uart_rs485_config(&sport->port);
-
/* Disable interrupts before requesting them */
ucr1 = imx_uart_readl(sport, UCR1);
ucr1 &= ~(UCR1_ADEN | UCR1_TRDYEN | UCR1_IDEN | UCR1_RRDYEN | UCR1_RTSDEN);
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index c7113182275a..179ee199df34 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -158,15 +158,10 @@ uart_update_mctrl(struct uart_port *port, unsigned int set, unsigned int clear)
unsigned long flags;
unsigned int old;
- if (port->rs485.flags & SER_RS485_ENABLED) {
- set &= ~TIOCM_RTS;
- clear &= ~TIOCM_RTS;
- }
-
spin_lock_irqsave(&port->lock, flags);
old = port->mctrl;
port->mctrl = (old & ~clear) | set;
- if (old != port->mctrl)
+ if (old != port->mctrl && !(port->rs485.flags & SER_RS485_ENABLED))
port->ops->set_mctrl(port, port->mctrl);
spin_unlock_irqrestore(&port->lock, flags);
}
@@ -1391,7 +1386,7 @@ static void uart_set_rs485_termination(struct uart_port *port,
!!(rs485->flags & SER_RS485_TERMINATE_BUS));
}
-int uart_rs485_config(struct uart_port *port)
+static int uart_rs485_config(struct uart_port *port)
{
struct serial_rs485 *rs485 = &port->rs485;
int ret;
@@ -1405,7 +1400,6 @@ int uart_rs485_config(struct uart_port *port)
return ret;
}
-EXPORT_SYMBOL_GPL(uart_rs485_config);
static int uart_get_rs485_config(struct uart_port *port,
struct serial_rs485 __user *rs485)
@@ -1444,8 +1438,13 @@ static int uart_set_rs485_config(struct tty_struct *tty, struct uart_port *port,
spin_lock_irqsave(&port->lock, flags);
ret = port->rs485_config(port, &tty->termios, &rs485);
- if (!ret)
+ if (!ret) {
port->rs485 = rs485;
+
+ /* Reset RTS and other mctrl lines when disabling RS485 */
+ if (!(rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ }
spin_unlock_irqrestore(&port->lock, flags);
if (ret)
return ret;
@@ -2352,7 +2351,8 @@ int uart_suspend_port(struct uart_driver *drv, struct uart_port *uport)
spin_lock_irq(&uport->lock);
ops->stop_tx(uport);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
/* save mctrl so it can be restored on resume */
mctrl = uport->mctrl;
uport->mctrl = 0;
@@ -2440,7 +2440,8 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
uart_change_pm(state, UART_PM_STATE_ON);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
spin_unlock_irq(&uport->lock);
if (console_suspend_enabled || !uart_console(uport)) {
/* Protected by port mutex for now */
@@ -2451,7 +2452,10 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
if (tty)
uart_change_speed(tty, state, NULL);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, uport->mctrl);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, uport->mctrl);
+ else
+ uart_rs485_config(uport);
ops->start_tx(uport);
spin_unlock_irq(&uport->lock);
tty_port_set_initialized(port, 1);
@@ -2558,10 +2562,10 @@ uart_configure_port(struct uart_driver *drv, struct uart_state *state,
*/
spin_lock_irqsave(&port->lock, flags);
port->mctrl &= TIOCM_DTR;
- if (port->rs485.flags & SER_RS485_ENABLED &&
- !(port->rs485.flags & SER_RS485_RTS_AFTER_SEND))
- port->mctrl |= TIOCM_RTS;
- port->ops->set_mctrl(port, port->mctrl);
+ if (!(port->rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ else
+ uart_rs485_config(port);
spin_unlock_irqrestore(&port->lock, flags);
/*
diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
index b4c21f84ad79..d657f2a42a7b 100644
--- a/include/linux/serial_core.h
+++ b/include/linux/serial_core.h
@@ -951,5 +951,4 @@ static inline int uart_handle_break(struct uart_port *port)
!((cflag) & CLOCAL))
int uart_get_rs485_mode(struct uart_port *port);
-int uart_rs485_config(struct uart_port *port);
#endif /* LINUX_SERIAL_CORE_H */
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7c7f9bc986e6 ("serial: Deassert Transmit Enable on probe in driver-specific way")
60f361722ad2 ("serial: fsl_lpuart: Reset prior to registration")
44b27aec9d96 ("serial: core, 8250: set RS485 termination GPIO in serial core")
ae50bb275283 ("serial: take termios_rwsem for ->rs485_config() & pass termios as param")
84f2faa7852e ("serial: 8250: Remove serial_rs485 sanitization from em485")
7195eefb38d7 ("serial: fsl_lpuart: Call core's sanitization and remove custom one")
596a9171472b ("serial: Clear rs485 struct when non-RS485 mode is set")
be2e2cb1d281 ("serial: Sanitize rs485_struct")
59c221f8e126 ("serial: 8250_exar: Fill in rs485_supported")
2dbd0c14ebe8 ("serial: Move serial_rs485 sanitization into separate function")
8322b1f52715 ("serial: Add uart_rs485_config()")
d6da35e0c6d5 ("Merge 5.18-rc7 into usb-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c7f9bc986e698873b489c371a08f206979d06b7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Thu, 22 Sep 2022 18:27:33 +0200
Subject: [PATCH] serial: Deassert Transmit Enable on probe in driver-specific
way
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When a UART port is newly registered, uart_configure_port() seeks to
deassert RS485 Transmit Enable by setting the RTS bit in port->mctrl.
However a number of UART drivers interpret a set RTS bit as *assertion*
instead of deassertion: Affected drivers include those using
serial8250_em485_config() (except 8250_bcm2835aux.c) and some using
mctrl_gpio (e.g. imx.c).
Since the interpretation of the RTS bit is driver-specific, it is not
suitable as a means to centrally deassert Transmit Enable in the serial
core. Instead, the serial core must call on drivers to deassert it in
their driver-specific way. One way to achieve that is to call
->rs485_config(). It implicitly deasserts Transmit Enable.
So amend uart_configure_port() and uart_resume_port() to invoke
uart_rs485_config(). That allows removing calls to uart_rs485_config()
from drivers' ->probe() hooks and declaring the function static.
Skip any invocation of ->set_mctrl() if RS485 is enabled. RS485 has no
hardware flow control, so the modem control lines are irrelevant and
need not be touched. When leaving RS485 mode, reset the modem control
lines to the state stored in port->mctrl. That way, UARTs which are
muxed between RS485 and RS232 transceivers drive the lines correctly
when switched to RS232. (serial8250_do_startup() historically raises
the OUT1 modem signal because otherwise interrupts are not signaled on
ancient PC UARTs, but I believe that no longer applies to modern,
RS485-capable UARTs and is thus safe to be skipped.)
imx.c modifies port->mctrl whenever Transmit Enable is asserted and
deasserted. Stop it from doing that so port->mctrl reflects the RS232
line state.
8250_omap.c deasserts Transmit Enable on ->runtime_resume() by calling
->set_mctrl(). Because that is now a no-op in RS485 mode, amend the
function to call serial8250_em485_stop_tx().
fsl_lpuart.c retrieves and applies the RS485 device tree properties
after registering the UART port. Because applying now happens on
registration in uart_configure_port(), move retrieval of the properties
ahead of uart_add_one_port().
Link: https://lore.kernel.org/all/20220329085050.311408-1-matthias.schiffer@ew.tq…
Link: https://lore.kernel.org/all/8f538a8903795f22f9acc94a9a31b03c9c4ccacb.camel@…
Fixes: d3b3404df318 ("serial: Fix incorrect rs485 polarity on uart open")
Cc: stable(a)vger.kernel.org # v4.14+
Reported-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reported-by: Roosen Henri <Henri.Roosen(a)ginzinger.com>
Tested-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Link: https://lore.kernel.org/r/2de36eba3fbe11278d5002e4e501afe0ceaca039.16638638…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
index 871f8a050513..41b8c6b27136 100644
--- a/drivers/tty/serial/8250/8250_omap.c
+++ b/drivers/tty/serial/8250/8250_omap.c
@@ -342,6 +342,9 @@ static void omap8250_restore_regs(struct uart_8250_port *up)
omap8250_update_mdr1(up, priv);
up->port.ops->set_mctrl(&up->port, up->port.mctrl);
+
+ if (up->port.rs485.flags & SER_RS485_ENABLED)
+ serial8250_em485_stop_tx(up);
}
/*
diff --git a/drivers/tty/serial/8250/8250_pci.c b/drivers/tty/serial/8250/8250_pci.c
index 0052cf899e29..8e9f247590bd 100644
--- a/drivers/tty/serial/8250/8250_pci.c
+++ b/drivers/tty/serial/8250/8250_pci.c
@@ -1632,7 +1632,6 @@ static int pci_fintek_init(struct pci_dev *dev)
resource_size_t bar_data[3];
u8 config_base;
struct serial_private *priv = pci_get_drvdata(dev);
- struct uart_8250_port *port;
if (!(pci_resource_flags(dev, 5) & IORESOURCE_IO) ||
!(pci_resource_flags(dev, 4) & IORESOURCE_IO) ||
@@ -1679,13 +1678,7 @@ static int pci_fintek_init(struct pci_dev *dev)
pci_write_config_byte(dev, config_base + 0x06, dev->irq);
- if (priv) {
- /* re-apply RS232/485 mode when
- * pciserial_resume_ports()
- */
- port = serial8250_get_port(priv->line[i]);
- uart_rs485_config(&port->port);
- } else {
+ if (!priv) {
/* First init without port data
* force init to RS232 Mode
*/
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index 6a3c2ead0f17..dfdd2db95cb3 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -600,7 +600,7 @@ EXPORT_SYMBOL_GPL(serial8250_rpm_put);
static int serial8250_em485_init(struct uart_8250_port *p)
{
if (p->em485)
- return 0;
+ goto deassert_rts;
p->em485 = kmalloc(sizeof(struct uart_8250_em485), GFP_ATOMIC);
if (!p->em485)
@@ -616,7 +616,9 @@ static int serial8250_em485_init(struct uart_8250_port *p)
p->em485->active_timer = NULL;
p->em485->tx_stopped = true;
- p->rs485_stop_tx(p);
+deassert_rts:
+ if (p->em485->tx_stopped)
+ p->rs485_stop_tx(p);
return 0;
}
@@ -2048,6 +2050,9 @@ EXPORT_SYMBOL_GPL(serial8250_do_set_mctrl);
static void serial8250_set_mctrl(struct uart_port *port, unsigned int mctrl)
{
+ if (port->rs485.flags & SER_RS485_ENABLED)
+ return;
+
if (port->set_mctrl)
port->set_mctrl(port, mctrl);
else
@@ -3192,9 +3197,6 @@ static void serial8250_config_port(struct uart_port *port, int flags)
if (flags & UART_CONFIG_TYPE)
autoconfig(up);
- if (port->rs485.flags & SER_RS485_ENABLED)
- uart_rs485_config(port);
-
/* if access method is AU, it is a 16550 with a quirk */
if (port->type == PORT_16550A && port->iotype == UPIO_AU)
up->bugs |= UART_BUG_NOMSR;
diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index fadb49086102..67fa113f77d4 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -2726,15 +2726,13 @@ static int lpuart_probe(struct platform_device *pdev)
if (ret)
goto failed_reset;
- ret = uart_add_one_port(&lpuart_reg, &sport->port);
- if (ret)
- goto failed_attach_port;
-
ret = uart_get_rs485_mode(&sport->port);
if (ret)
goto failed_get_rs485;
- uart_rs485_config(&sport->port);
+ ret = uart_add_one_port(&lpuart_reg, &sport->port);
+ if (ret)
+ goto failed_attach_port;
ret = devm_request_irq(&pdev->dev, sport->port.irq, handler, 0,
DRIVER_NAME, sport);
@@ -2744,9 +2742,9 @@ static int lpuart_probe(struct platform_device *pdev)
return 0;
failed_irq_request:
-failed_get_rs485:
uart_remove_one_port(&lpuart_reg, &sport->port);
failed_attach_port:
+failed_get_rs485:
failed_reset:
lpuart_disable_clks(sport);
return ret;
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 5875ee66492b..05b432dc7a85 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -380,8 +380,7 @@ static void imx_uart_rts_active(struct imx_port *sport, u32 *ucr2)
{
*ucr2 &= ~(UCR2_CTSC | UCR2_CTS);
- sport->port.mctrl |= TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl | TIOCM_RTS);
}
/* called with port.lock taken and irqs caller dependent */
@@ -390,8 +389,7 @@ static void imx_uart_rts_inactive(struct imx_port *sport, u32 *ucr2)
*ucr2 &= ~UCR2_CTSC;
*ucr2 |= UCR2_CTS;
- sport->port.mctrl &= ~TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl & ~TIOCM_RTS);
}
static void start_hrtimer_ms(struct hrtimer *hrt, unsigned long msec)
@@ -2347,8 +2345,6 @@ static int imx_uart_probe(struct platform_device *pdev)
dev_err(&pdev->dev,
"low-active RTS not possible when receiver is off, enabling receiver\n");
- uart_rs485_config(&sport->port);
-
/* Disable interrupts before requesting them */
ucr1 = imx_uart_readl(sport, UCR1);
ucr1 &= ~(UCR1_ADEN | UCR1_TRDYEN | UCR1_IDEN | UCR1_RRDYEN | UCR1_RTSDEN);
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index c7113182275a..179ee199df34 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -158,15 +158,10 @@ uart_update_mctrl(struct uart_port *port, unsigned int set, unsigned int clear)
unsigned long flags;
unsigned int old;
- if (port->rs485.flags & SER_RS485_ENABLED) {
- set &= ~TIOCM_RTS;
- clear &= ~TIOCM_RTS;
- }
-
spin_lock_irqsave(&port->lock, flags);
old = port->mctrl;
port->mctrl = (old & ~clear) | set;
- if (old != port->mctrl)
+ if (old != port->mctrl && !(port->rs485.flags & SER_RS485_ENABLED))
port->ops->set_mctrl(port, port->mctrl);
spin_unlock_irqrestore(&port->lock, flags);
}
@@ -1391,7 +1386,7 @@ static void uart_set_rs485_termination(struct uart_port *port,
!!(rs485->flags & SER_RS485_TERMINATE_BUS));
}
-int uart_rs485_config(struct uart_port *port)
+static int uart_rs485_config(struct uart_port *port)
{
struct serial_rs485 *rs485 = &port->rs485;
int ret;
@@ -1405,7 +1400,6 @@ int uart_rs485_config(struct uart_port *port)
return ret;
}
-EXPORT_SYMBOL_GPL(uart_rs485_config);
static int uart_get_rs485_config(struct uart_port *port,
struct serial_rs485 __user *rs485)
@@ -1444,8 +1438,13 @@ static int uart_set_rs485_config(struct tty_struct *tty, struct uart_port *port,
spin_lock_irqsave(&port->lock, flags);
ret = port->rs485_config(port, &tty->termios, &rs485);
- if (!ret)
+ if (!ret) {
port->rs485 = rs485;
+
+ /* Reset RTS and other mctrl lines when disabling RS485 */
+ if (!(rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ }
spin_unlock_irqrestore(&port->lock, flags);
if (ret)
return ret;
@@ -2352,7 +2351,8 @@ int uart_suspend_port(struct uart_driver *drv, struct uart_port *uport)
spin_lock_irq(&uport->lock);
ops->stop_tx(uport);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
/* save mctrl so it can be restored on resume */
mctrl = uport->mctrl;
uport->mctrl = 0;
@@ -2440,7 +2440,8 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
uart_change_pm(state, UART_PM_STATE_ON);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
spin_unlock_irq(&uport->lock);
if (console_suspend_enabled || !uart_console(uport)) {
/* Protected by port mutex for now */
@@ -2451,7 +2452,10 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
if (tty)
uart_change_speed(tty, state, NULL);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, uport->mctrl);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, uport->mctrl);
+ else
+ uart_rs485_config(uport);
ops->start_tx(uport);
spin_unlock_irq(&uport->lock);
tty_port_set_initialized(port, 1);
@@ -2558,10 +2562,10 @@ uart_configure_port(struct uart_driver *drv, struct uart_state *state,
*/
spin_lock_irqsave(&port->lock, flags);
port->mctrl &= TIOCM_DTR;
- if (port->rs485.flags & SER_RS485_ENABLED &&
- !(port->rs485.flags & SER_RS485_RTS_AFTER_SEND))
- port->mctrl |= TIOCM_RTS;
- port->ops->set_mctrl(port, port->mctrl);
+ if (!(port->rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ else
+ uart_rs485_config(port);
spin_unlock_irqrestore(&port->lock, flags);
/*
diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
index b4c21f84ad79..d657f2a42a7b 100644
--- a/include/linux/serial_core.h
+++ b/include/linux/serial_core.h
@@ -951,5 +951,4 @@ static inline int uart_handle_break(struct uart_port *port)
!((cflag) & CLOCAL))
int uart_get_rs485_mode(struct uart_port *port);
-int uart_rs485_config(struct uart_port *port);
#endif /* LINUX_SERIAL_CORE_H */
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7c7f9bc986e6 ("serial: Deassert Transmit Enable on probe in driver-specific way")
60f361722ad2 ("serial: fsl_lpuart: Reset prior to registration")
44b27aec9d96 ("serial: core, 8250: set RS485 termination GPIO in serial core")
ae50bb275283 ("serial: take termios_rwsem for ->rs485_config() & pass termios as param")
84f2faa7852e ("serial: 8250: Remove serial_rs485 sanitization from em485")
7195eefb38d7 ("serial: fsl_lpuart: Call core's sanitization and remove custom one")
596a9171472b ("serial: Clear rs485 struct when non-RS485 mode is set")
be2e2cb1d281 ("serial: Sanitize rs485_struct")
59c221f8e126 ("serial: 8250_exar: Fill in rs485_supported")
2dbd0c14ebe8 ("serial: Move serial_rs485 sanitization into separate function")
8322b1f52715 ("serial: Add uart_rs485_config()")
d6da35e0c6d5 ("Merge 5.18-rc7 into usb-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c7f9bc986e698873b489c371a08f206979d06b7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Thu, 22 Sep 2022 18:27:33 +0200
Subject: [PATCH] serial: Deassert Transmit Enable on probe in driver-specific
way
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When a UART port is newly registered, uart_configure_port() seeks to
deassert RS485 Transmit Enable by setting the RTS bit in port->mctrl.
However a number of UART drivers interpret a set RTS bit as *assertion*
instead of deassertion: Affected drivers include those using
serial8250_em485_config() (except 8250_bcm2835aux.c) and some using
mctrl_gpio (e.g. imx.c).
Since the interpretation of the RTS bit is driver-specific, it is not
suitable as a means to centrally deassert Transmit Enable in the serial
core. Instead, the serial core must call on drivers to deassert it in
their driver-specific way. One way to achieve that is to call
->rs485_config(). It implicitly deasserts Transmit Enable.
So amend uart_configure_port() and uart_resume_port() to invoke
uart_rs485_config(). That allows removing calls to uart_rs485_config()
from drivers' ->probe() hooks and declaring the function static.
Skip any invocation of ->set_mctrl() if RS485 is enabled. RS485 has no
hardware flow control, so the modem control lines are irrelevant and
need not be touched. When leaving RS485 mode, reset the modem control
lines to the state stored in port->mctrl. That way, UARTs which are
muxed between RS485 and RS232 transceivers drive the lines correctly
when switched to RS232. (serial8250_do_startup() historically raises
the OUT1 modem signal because otherwise interrupts are not signaled on
ancient PC UARTs, but I believe that no longer applies to modern,
RS485-capable UARTs and is thus safe to be skipped.)
imx.c modifies port->mctrl whenever Transmit Enable is asserted and
deasserted. Stop it from doing that so port->mctrl reflects the RS232
line state.
8250_omap.c deasserts Transmit Enable on ->runtime_resume() by calling
->set_mctrl(). Because that is now a no-op in RS485 mode, amend the
function to call serial8250_em485_stop_tx().
fsl_lpuart.c retrieves and applies the RS485 device tree properties
after registering the UART port. Because applying now happens on
registration in uart_configure_port(), move retrieval of the properties
ahead of uart_add_one_port().
Link: https://lore.kernel.org/all/20220329085050.311408-1-matthias.schiffer@ew.tq…
Link: https://lore.kernel.org/all/8f538a8903795f22f9acc94a9a31b03c9c4ccacb.camel@…
Fixes: d3b3404df318 ("serial: Fix incorrect rs485 polarity on uart open")
Cc: stable(a)vger.kernel.org # v4.14+
Reported-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reported-by: Roosen Henri <Henri.Roosen(a)ginzinger.com>
Tested-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Link: https://lore.kernel.org/r/2de36eba3fbe11278d5002e4e501afe0ceaca039.16638638…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
index 871f8a050513..41b8c6b27136 100644
--- a/drivers/tty/serial/8250/8250_omap.c
+++ b/drivers/tty/serial/8250/8250_omap.c
@@ -342,6 +342,9 @@ static void omap8250_restore_regs(struct uart_8250_port *up)
omap8250_update_mdr1(up, priv);
up->port.ops->set_mctrl(&up->port, up->port.mctrl);
+
+ if (up->port.rs485.flags & SER_RS485_ENABLED)
+ serial8250_em485_stop_tx(up);
}
/*
diff --git a/drivers/tty/serial/8250/8250_pci.c b/drivers/tty/serial/8250/8250_pci.c
index 0052cf899e29..8e9f247590bd 100644
--- a/drivers/tty/serial/8250/8250_pci.c
+++ b/drivers/tty/serial/8250/8250_pci.c
@@ -1632,7 +1632,6 @@ static int pci_fintek_init(struct pci_dev *dev)
resource_size_t bar_data[3];
u8 config_base;
struct serial_private *priv = pci_get_drvdata(dev);
- struct uart_8250_port *port;
if (!(pci_resource_flags(dev, 5) & IORESOURCE_IO) ||
!(pci_resource_flags(dev, 4) & IORESOURCE_IO) ||
@@ -1679,13 +1678,7 @@ static int pci_fintek_init(struct pci_dev *dev)
pci_write_config_byte(dev, config_base + 0x06, dev->irq);
- if (priv) {
- /* re-apply RS232/485 mode when
- * pciserial_resume_ports()
- */
- port = serial8250_get_port(priv->line[i]);
- uart_rs485_config(&port->port);
- } else {
+ if (!priv) {
/* First init without port data
* force init to RS232 Mode
*/
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index 6a3c2ead0f17..dfdd2db95cb3 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -600,7 +600,7 @@ EXPORT_SYMBOL_GPL(serial8250_rpm_put);
static int serial8250_em485_init(struct uart_8250_port *p)
{
if (p->em485)
- return 0;
+ goto deassert_rts;
p->em485 = kmalloc(sizeof(struct uart_8250_em485), GFP_ATOMIC);
if (!p->em485)
@@ -616,7 +616,9 @@ static int serial8250_em485_init(struct uart_8250_port *p)
p->em485->active_timer = NULL;
p->em485->tx_stopped = true;
- p->rs485_stop_tx(p);
+deassert_rts:
+ if (p->em485->tx_stopped)
+ p->rs485_stop_tx(p);
return 0;
}
@@ -2048,6 +2050,9 @@ EXPORT_SYMBOL_GPL(serial8250_do_set_mctrl);
static void serial8250_set_mctrl(struct uart_port *port, unsigned int mctrl)
{
+ if (port->rs485.flags & SER_RS485_ENABLED)
+ return;
+
if (port->set_mctrl)
port->set_mctrl(port, mctrl);
else
@@ -3192,9 +3197,6 @@ static void serial8250_config_port(struct uart_port *port, int flags)
if (flags & UART_CONFIG_TYPE)
autoconfig(up);
- if (port->rs485.flags & SER_RS485_ENABLED)
- uart_rs485_config(port);
-
/* if access method is AU, it is a 16550 with a quirk */
if (port->type == PORT_16550A && port->iotype == UPIO_AU)
up->bugs |= UART_BUG_NOMSR;
diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index fadb49086102..67fa113f77d4 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -2726,15 +2726,13 @@ static int lpuart_probe(struct platform_device *pdev)
if (ret)
goto failed_reset;
- ret = uart_add_one_port(&lpuart_reg, &sport->port);
- if (ret)
- goto failed_attach_port;
-
ret = uart_get_rs485_mode(&sport->port);
if (ret)
goto failed_get_rs485;
- uart_rs485_config(&sport->port);
+ ret = uart_add_one_port(&lpuart_reg, &sport->port);
+ if (ret)
+ goto failed_attach_port;
ret = devm_request_irq(&pdev->dev, sport->port.irq, handler, 0,
DRIVER_NAME, sport);
@@ -2744,9 +2742,9 @@ static int lpuart_probe(struct platform_device *pdev)
return 0;
failed_irq_request:
-failed_get_rs485:
uart_remove_one_port(&lpuart_reg, &sport->port);
failed_attach_port:
+failed_get_rs485:
failed_reset:
lpuart_disable_clks(sport);
return ret;
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 5875ee66492b..05b432dc7a85 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -380,8 +380,7 @@ static void imx_uart_rts_active(struct imx_port *sport, u32 *ucr2)
{
*ucr2 &= ~(UCR2_CTSC | UCR2_CTS);
- sport->port.mctrl |= TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl | TIOCM_RTS);
}
/* called with port.lock taken and irqs caller dependent */
@@ -390,8 +389,7 @@ static void imx_uart_rts_inactive(struct imx_port *sport, u32 *ucr2)
*ucr2 &= ~UCR2_CTSC;
*ucr2 |= UCR2_CTS;
- sport->port.mctrl &= ~TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl & ~TIOCM_RTS);
}
static void start_hrtimer_ms(struct hrtimer *hrt, unsigned long msec)
@@ -2347,8 +2345,6 @@ static int imx_uart_probe(struct platform_device *pdev)
dev_err(&pdev->dev,
"low-active RTS not possible when receiver is off, enabling receiver\n");
- uart_rs485_config(&sport->port);
-
/* Disable interrupts before requesting them */
ucr1 = imx_uart_readl(sport, UCR1);
ucr1 &= ~(UCR1_ADEN | UCR1_TRDYEN | UCR1_IDEN | UCR1_RRDYEN | UCR1_RTSDEN);
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index c7113182275a..179ee199df34 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -158,15 +158,10 @@ uart_update_mctrl(struct uart_port *port, unsigned int set, unsigned int clear)
unsigned long flags;
unsigned int old;
- if (port->rs485.flags & SER_RS485_ENABLED) {
- set &= ~TIOCM_RTS;
- clear &= ~TIOCM_RTS;
- }
-
spin_lock_irqsave(&port->lock, flags);
old = port->mctrl;
port->mctrl = (old & ~clear) | set;
- if (old != port->mctrl)
+ if (old != port->mctrl && !(port->rs485.flags & SER_RS485_ENABLED))
port->ops->set_mctrl(port, port->mctrl);
spin_unlock_irqrestore(&port->lock, flags);
}
@@ -1391,7 +1386,7 @@ static void uart_set_rs485_termination(struct uart_port *port,
!!(rs485->flags & SER_RS485_TERMINATE_BUS));
}
-int uart_rs485_config(struct uart_port *port)
+static int uart_rs485_config(struct uart_port *port)
{
struct serial_rs485 *rs485 = &port->rs485;
int ret;
@@ -1405,7 +1400,6 @@ int uart_rs485_config(struct uart_port *port)
return ret;
}
-EXPORT_SYMBOL_GPL(uart_rs485_config);
static int uart_get_rs485_config(struct uart_port *port,
struct serial_rs485 __user *rs485)
@@ -1444,8 +1438,13 @@ static int uart_set_rs485_config(struct tty_struct *tty, struct uart_port *port,
spin_lock_irqsave(&port->lock, flags);
ret = port->rs485_config(port, &tty->termios, &rs485);
- if (!ret)
+ if (!ret) {
port->rs485 = rs485;
+
+ /* Reset RTS and other mctrl lines when disabling RS485 */
+ if (!(rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ }
spin_unlock_irqrestore(&port->lock, flags);
if (ret)
return ret;
@@ -2352,7 +2351,8 @@ int uart_suspend_port(struct uart_driver *drv, struct uart_port *uport)
spin_lock_irq(&uport->lock);
ops->stop_tx(uport);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
/* save mctrl so it can be restored on resume */
mctrl = uport->mctrl;
uport->mctrl = 0;
@@ -2440,7 +2440,8 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
uart_change_pm(state, UART_PM_STATE_ON);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
spin_unlock_irq(&uport->lock);
if (console_suspend_enabled || !uart_console(uport)) {
/* Protected by port mutex for now */
@@ -2451,7 +2452,10 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
if (tty)
uart_change_speed(tty, state, NULL);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, uport->mctrl);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, uport->mctrl);
+ else
+ uart_rs485_config(uport);
ops->start_tx(uport);
spin_unlock_irq(&uport->lock);
tty_port_set_initialized(port, 1);
@@ -2558,10 +2562,10 @@ uart_configure_port(struct uart_driver *drv, struct uart_state *state,
*/
spin_lock_irqsave(&port->lock, flags);
port->mctrl &= TIOCM_DTR;
- if (port->rs485.flags & SER_RS485_ENABLED &&
- !(port->rs485.flags & SER_RS485_RTS_AFTER_SEND))
- port->mctrl |= TIOCM_RTS;
- port->ops->set_mctrl(port, port->mctrl);
+ if (!(port->rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ else
+ uart_rs485_config(port);
spin_unlock_irqrestore(&port->lock, flags);
/*
diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
index b4c21f84ad79..d657f2a42a7b 100644
--- a/include/linux/serial_core.h
+++ b/include/linux/serial_core.h
@@ -951,5 +951,4 @@ static inline int uart_handle_break(struct uart_port *port)
!((cflag) & CLOCAL))
int uart_get_rs485_mode(struct uart_port *port);
-int uart_rs485_config(struct uart_port *port);
#endif /* LINUX_SERIAL_CORE_H */
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7c7f9bc986e6 ("serial: Deassert Transmit Enable on probe in driver-specific way")
60f361722ad2 ("serial: fsl_lpuart: Reset prior to registration")
44b27aec9d96 ("serial: core, 8250: set RS485 termination GPIO in serial core")
ae50bb275283 ("serial: take termios_rwsem for ->rs485_config() & pass termios as param")
84f2faa7852e ("serial: 8250: Remove serial_rs485 sanitization from em485")
7195eefb38d7 ("serial: fsl_lpuart: Call core's sanitization and remove custom one")
596a9171472b ("serial: Clear rs485 struct when non-RS485 mode is set")
be2e2cb1d281 ("serial: Sanitize rs485_struct")
59c221f8e126 ("serial: 8250_exar: Fill in rs485_supported")
2dbd0c14ebe8 ("serial: Move serial_rs485 sanitization into separate function")
8322b1f52715 ("serial: Add uart_rs485_config()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7c7f9bc986e698873b489c371a08f206979d06b7 Mon Sep 17 00:00:00 2001
From: Lukas Wunner <lukas(a)wunner.de>
Date: Thu, 22 Sep 2022 18:27:33 +0200
Subject: [PATCH] serial: Deassert Transmit Enable on probe in driver-specific
way
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When a UART port is newly registered, uart_configure_port() seeks to
deassert RS485 Transmit Enable by setting the RTS bit in port->mctrl.
However a number of UART drivers interpret a set RTS bit as *assertion*
instead of deassertion: Affected drivers include those using
serial8250_em485_config() (except 8250_bcm2835aux.c) and some using
mctrl_gpio (e.g. imx.c).
Since the interpretation of the RTS bit is driver-specific, it is not
suitable as a means to centrally deassert Transmit Enable in the serial
core. Instead, the serial core must call on drivers to deassert it in
their driver-specific way. One way to achieve that is to call
->rs485_config(). It implicitly deasserts Transmit Enable.
So amend uart_configure_port() and uart_resume_port() to invoke
uart_rs485_config(). That allows removing calls to uart_rs485_config()
from drivers' ->probe() hooks and declaring the function static.
Skip any invocation of ->set_mctrl() if RS485 is enabled. RS485 has no
hardware flow control, so the modem control lines are irrelevant and
need not be touched. When leaving RS485 mode, reset the modem control
lines to the state stored in port->mctrl. That way, UARTs which are
muxed between RS485 and RS232 transceivers drive the lines correctly
when switched to RS232. (serial8250_do_startup() historically raises
the OUT1 modem signal because otherwise interrupts are not signaled on
ancient PC UARTs, but I believe that no longer applies to modern,
RS485-capable UARTs and is thus safe to be skipped.)
imx.c modifies port->mctrl whenever Transmit Enable is asserted and
deasserted. Stop it from doing that so port->mctrl reflects the RS232
line state.
8250_omap.c deasserts Transmit Enable on ->runtime_resume() by calling
->set_mctrl(). Because that is now a no-op in RS485 mode, amend the
function to call serial8250_em485_stop_tx().
fsl_lpuart.c retrieves and applies the RS485 device tree properties
after registering the UART port. Because applying now happens on
registration in uart_configure_port(), move retrieval of the properties
ahead of uart_add_one_port().
Link: https://lore.kernel.org/all/20220329085050.311408-1-matthias.schiffer@ew.tq…
Link: https://lore.kernel.org/all/8f538a8903795f22f9acc94a9a31b03c9c4ccacb.camel@…
Fixes: d3b3404df318 ("serial: Fix incorrect rs485 polarity on uart open")
Cc: stable(a)vger.kernel.org # v4.14+
Reported-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reported-by: Roosen Henri <Henri.Roosen(a)ginzinger.com>
Tested-by: Matthias Schiffer <matthias.schiffer(a)ew.tq-group.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Signed-off-by: Lukas Wunner <lukas(a)wunner.de>
Link: https://lore.kernel.org/r/2de36eba3fbe11278d5002e4e501afe0ceaca039.16638638…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
diff --git a/drivers/tty/serial/8250/8250_omap.c b/drivers/tty/serial/8250/8250_omap.c
index 871f8a050513..41b8c6b27136 100644
--- a/drivers/tty/serial/8250/8250_omap.c
+++ b/drivers/tty/serial/8250/8250_omap.c
@@ -342,6 +342,9 @@ static void omap8250_restore_regs(struct uart_8250_port *up)
omap8250_update_mdr1(up, priv);
up->port.ops->set_mctrl(&up->port, up->port.mctrl);
+
+ if (up->port.rs485.flags & SER_RS485_ENABLED)
+ serial8250_em485_stop_tx(up);
}
/*
diff --git a/drivers/tty/serial/8250/8250_pci.c b/drivers/tty/serial/8250/8250_pci.c
index 0052cf899e29..8e9f247590bd 100644
--- a/drivers/tty/serial/8250/8250_pci.c
+++ b/drivers/tty/serial/8250/8250_pci.c
@@ -1632,7 +1632,6 @@ static int pci_fintek_init(struct pci_dev *dev)
resource_size_t bar_data[3];
u8 config_base;
struct serial_private *priv = pci_get_drvdata(dev);
- struct uart_8250_port *port;
if (!(pci_resource_flags(dev, 5) & IORESOURCE_IO) ||
!(pci_resource_flags(dev, 4) & IORESOURCE_IO) ||
@@ -1679,13 +1678,7 @@ static int pci_fintek_init(struct pci_dev *dev)
pci_write_config_byte(dev, config_base + 0x06, dev->irq);
- if (priv) {
- /* re-apply RS232/485 mode when
- * pciserial_resume_ports()
- */
- port = serial8250_get_port(priv->line[i]);
- uart_rs485_config(&port->port);
- } else {
+ if (!priv) {
/* First init without port data
* force init to RS232 Mode
*/
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index 6a3c2ead0f17..dfdd2db95cb3 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -600,7 +600,7 @@ EXPORT_SYMBOL_GPL(serial8250_rpm_put);
static int serial8250_em485_init(struct uart_8250_port *p)
{
if (p->em485)
- return 0;
+ goto deassert_rts;
p->em485 = kmalloc(sizeof(struct uart_8250_em485), GFP_ATOMIC);
if (!p->em485)
@@ -616,7 +616,9 @@ static int serial8250_em485_init(struct uart_8250_port *p)
p->em485->active_timer = NULL;
p->em485->tx_stopped = true;
- p->rs485_stop_tx(p);
+deassert_rts:
+ if (p->em485->tx_stopped)
+ p->rs485_stop_tx(p);
return 0;
}
@@ -2048,6 +2050,9 @@ EXPORT_SYMBOL_GPL(serial8250_do_set_mctrl);
static void serial8250_set_mctrl(struct uart_port *port, unsigned int mctrl)
{
+ if (port->rs485.flags & SER_RS485_ENABLED)
+ return;
+
if (port->set_mctrl)
port->set_mctrl(port, mctrl);
else
@@ -3192,9 +3197,6 @@ static void serial8250_config_port(struct uart_port *port, int flags)
if (flags & UART_CONFIG_TYPE)
autoconfig(up);
- if (port->rs485.flags & SER_RS485_ENABLED)
- uart_rs485_config(port);
-
/* if access method is AU, it is a 16550 with a quirk */
if (port->type == PORT_16550A && port->iotype == UPIO_AU)
up->bugs |= UART_BUG_NOMSR;
diff --git a/drivers/tty/serial/fsl_lpuart.c b/drivers/tty/serial/fsl_lpuart.c
index fadb49086102..67fa113f77d4 100644
--- a/drivers/tty/serial/fsl_lpuart.c
+++ b/drivers/tty/serial/fsl_lpuart.c
@@ -2726,15 +2726,13 @@ static int lpuart_probe(struct platform_device *pdev)
if (ret)
goto failed_reset;
- ret = uart_add_one_port(&lpuart_reg, &sport->port);
- if (ret)
- goto failed_attach_port;
-
ret = uart_get_rs485_mode(&sport->port);
if (ret)
goto failed_get_rs485;
- uart_rs485_config(&sport->port);
+ ret = uart_add_one_port(&lpuart_reg, &sport->port);
+ if (ret)
+ goto failed_attach_port;
ret = devm_request_irq(&pdev->dev, sport->port.irq, handler, 0,
DRIVER_NAME, sport);
@@ -2744,9 +2742,9 @@ static int lpuart_probe(struct platform_device *pdev)
return 0;
failed_irq_request:
-failed_get_rs485:
uart_remove_one_port(&lpuart_reg, &sport->port);
failed_attach_port:
+failed_get_rs485:
failed_reset:
lpuart_disable_clks(sport);
return ret;
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index 5875ee66492b..05b432dc7a85 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -380,8 +380,7 @@ static void imx_uart_rts_active(struct imx_port *sport, u32 *ucr2)
{
*ucr2 &= ~(UCR2_CTSC | UCR2_CTS);
- sport->port.mctrl |= TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl | TIOCM_RTS);
}
/* called with port.lock taken and irqs caller dependent */
@@ -390,8 +389,7 @@ static void imx_uart_rts_inactive(struct imx_port *sport, u32 *ucr2)
*ucr2 &= ~UCR2_CTSC;
*ucr2 |= UCR2_CTS;
- sport->port.mctrl &= ~TIOCM_RTS;
- mctrl_gpio_set(sport->gpios, sport->port.mctrl);
+ mctrl_gpio_set(sport->gpios, sport->port.mctrl & ~TIOCM_RTS);
}
static void start_hrtimer_ms(struct hrtimer *hrt, unsigned long msec)
@@ -2347,8 +2345,6 @@ static int imx_uart_probe(struct platform_device *pdev)
dev_err(&pdev->dev,
"low-active RTS not possible when receiver is off, enabling receiver\n");
- uart_rs485_config(&sport->port);
-
/* Disable interrupts before requesting them */
ucr1 = imx_uart_readl(sport, UCR1);
ucr1 &= ~(UCR1_ADEN | UCR1_TRDYEN | UCR1_IDEN | UCR1_RRDYEN | UCR1_RTSDEN);
diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index c7113182275a..179ee199df34 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -158,15 +158,10 @@ uart_update_mctrl(struct uart_port *port, unsigned int set, unsigned int clear)
unsigned long flags;
unsigned int old;
- if (port->rs485.flags & SER_RS485_ENABLED) {
- set &= ~TIOCM_RTS;
- clear &= ~TIOCM_RTS;
- }
-
spin_lock_irqsave(&port->lock, flags);
old = port->mctrl;
port->mctrl = (old & ~clear) | set;
- if (old != port->mctrl)
+ if (old != port->mctrl && !(port->rs485.flags & SER_RS485_ENABLED))
port->ops->set_mctrl(port, port->mctrl);
spin_unlock_irqrestore(&port->lock, flags);
}
@@ -1391,7 +1386,7 @@ static void uart_set_rs485_termination(struct uart_port *port,
!!(rs485->flags & SER_RS485_TERMINATE_BUS));
}
-int uart_rs485_config(struct uart_port *port)
+static int uart_rs485_config(struct uart_port *port)
{
struct serial_rs485 *rs485 = &port->rs485;
int ret;
@@ -1405,7 +1400,6 @@ int uart_rs485_config(struct uart_port *port)
return ret;
}
-EXPORT_SYMBOL_GPL(uart_rs485_config);
static int uart_get_rs485_config(struct uart_port *port,
struct serial_rs485 __user *rs485)
@@ -1444,8 +1438,13 @@ static int uart_set_rs485_config(struct tty_struct *tty, struct uart_port *port,
spin_lock_irqsave(&port->lock, flags);
ret = port->rs485_config(port, &tty->termios, &rs485);
- if (!ret)
+ if (!ret) {
port->rs485 = rs485;
+
+ /* Reset RTS and other mctrl lines when disabling RS485 */
+ if (!(rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ }
spin_unlock_irqrestore(&port->lock, flags);
if (ret)
return ret;
@@ -2352,7 +2351,8 @@ int uart_suspend_port(struct uart_driver *drv, struct uart_port *uport)
spin_lock_irq(&uport->lock);
ops->stop_tx(uport);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
/* save mctrl so it can be restored on resume */
mctrl = uport->mctrl;
uport->mctrl = 0;
@@ -2440,7 +2440,8 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
uart_change_pm(state, UART_PM_STATE_ON);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, 0);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, 0);
spin_unlock_irq(&uport->lock);
if (console_suspend_enabled || !uart_console(uport)) {
/* Protected by port mutex for now */
@@ -2451,7 +2452,10 @@ int uart_resume_port(struct uart_driver *drv, struct uart_port *uport)
if (tty)
uart_change_speed(tty, state, NULL);
spin_lock_irq(&uport->lock);
- ops->set_mctrl(uport, uport->mctrl);
+ if (!(uport->rs485.flags & SER_RS485_ENABLED))
+ ops->set_mctrl(uport, uport->mctrl);
+ else
+ uart_rs485_config(uport);
ops->start_tx(uport);
spin_unlock_irq(&uport->lock);
tty_port_set_initialized(port, 1);
@@ -2558,10 +2562,10 @@ uart_configure_port(struct uart_driver *drv, struct uart_state *state,
*/
spin_lock_irqsave(&port->lock, flags);
port->mctrl &= TIOCM_DTR;
- if (port->rs485.flags & SER_RS485_ENABLED &&
- !(port->rs485.flags & SER_RS485_RTS_AFTER_SEND))
- port->mctrl |= TIOCM_RTS;
- port->ops->set_mctrl(port, port->mctrl);
+ if (!(port->rs485.flags & SER_RS485_ENABLED))
+ port->ops->set_mctrl(port, port->mctrl);
+ else
+ uart_rs485_config(port);
spin_unlock_irqrestore(&port->lock, flags);
/*
diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
index b4c21f84ad79..d657f2a42a7b 100644
--- a/include/linux/serial_core.h
+++ b/include/linux/serial_core.h
@@ -951,5 +951,4 @@ static inline int uart_handle_break(struct uart_port *port)
!((cflag) & CLOCAL))
int uart_get_rs485_mode(struct uart_port *port);
-int uart_rs485_config(struct uart_port *port);
#endif /* LINUX_SERIAL_CORE_H */
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
5c13a4a0291b ("xen/gntdev: Accommodate VMA splitting")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
30dcc56bba91 ("xen: assume XENFEAT_gnttab_map_avail_bits being set for pv guests")
970655aa9b42 ("xen/gntdev: fix gntdev_mmap() error exit path")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
0102e4efda76 ("xen: Use evtchn_type_t as a type for event channels")
9293724192a7 ("xen/gntdev: Do not use mm notifiers with autotranslating guests")
b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()")
3b06ac6707c1 ("xen/gntdev: replace global limit of mapped pages by limit per call")
d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
ee7f5225dc3c ("xen: Stop abusing DT of_dma_configure API")
bce5963bcb4f ("xen/events: fix binding user event channels to cpus")
8b1e0f81fb6f ("mm/pgtable: drop pgtable_t variable from pte_fn_t functions")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5c13a4a0291b30191eff9ead8d010e1ca43a4d0c Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:06 -0400
Subject: [PATCH] xen/gntdev: Accommodate VMA splitting
Prior to this commit, the gntdev driver code did not handle the
following scenario correctly with paravirtualized (PV) Xen domains:
* User process sets up a gntdev mapping composed of two grant mappings
(i.e., two pages shared by another Xen domain).
* User process munmap()s one of the pages.
* User process munmap()s the remaining page.
* User process exits.
In the scenario above, the user process would cause the kernel to log
the following messages in dmesg for the first munmap(), and the second
munmap() call would result in similar log messages:
BUG: Bad page map in process doublemap.test pte:... pmd:...
page:0000000057c97bff refcount:1 mapcount:-1 \
mapping:0000000000000000 index:0x0 pfn:...
...
page dumped because: bad pte
...
file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
...
Call Trace:
<TASK>
dump_stack_lvl+0x46/0x5e
print_bad_pte.cold+0x66/0xb6
unmap_page_range+0x7e5/0xdc0
unmap_vmas+0x78/0xf0
unmap_region+0xa8/0x110
__do_munmap+0x1ea/0x4e0
__vm_munmap+0x75/0x120
__x64_sys_munmap+0x28/0x40
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
...
For each munmap() call, the Xen hypervisor (if built with CONFIG_DEBUG)
would print out the following and trigger a general protection fault in
the affected Xen PV domain:
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
As of this writing, gntdev_grant_map structure's vma field (referred to
as map->vma below) is mainly used for checking the start and end
addresses of mappings. However, with split VMAs, these may change, and
there could be more than one VMA associated with a gntdev mapping.
Hence, remove the use of map->vma and rely on map->pages_vm_start for
the original start address and on (map->count << PAGE_SHIFT) for the
original mapping size. Let the invalidate() and find_special_page()
hooks use these.
Also, given that there can be multiple VMAs associated with a gntdev
mapping, move the "mmu_interval_notifier_remove(&map->notifier)" call to
the end of gntdev_put_map, so that the MMU notifier is only removed
after the closing of the last remaining VMA.
Finally, use an atomic to prevent inadvertent gntdev mapping re-use,
instead of using the map->live_grants atomic counter and/or the map->vma
pointer (the latter of which is now removed). This prevents the
userspace from mmap()'ing (with MAP_FIXED) a gntdev mapping over the
same address range as a previously set up gntdev mapping. This scenario
can be summarized with the following call-trace, which was valid prior
to this commit:
mmap
gntdev_mmap
mmap (repeat mmap with MAP_FIXED over the same address range)
gntdev_invalidate
unmap_grant_pages (sets 'being_removed' entries to true)
gnttab_unmap_refs_async
unmap_single_vma
gntdev_mmap (maps the shared pages again)
munmap
gntdev_invalidate
unmap_grant_pages
(no-op because 'being_removed' entries are true)
unmap_single_vma (For PV domains, Xen reports that a granted page
is being unmapped and triggers a general protection fault in the
affected domain, if Xen was built with CONFIG_DEBUG)
The fix for this last scenario could be worth its own commit, but we
opted for a single commit, because removing the gntdev_grant_map
structure's vma field requires guarding the entry to gntdev_mmap(), and
the live_grants atomic counter is not sufficient on its own to prevent
the mmap() over a pre-existing mapping.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: ab31523c2fca ("xen/gntdev: allow usermode to map granted pages")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-3-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev-common.h b/drivers/xen/gntdev-common.h
index 40ef379c28ab..9c286b2a1900 100644
--- a/drivers/xen/gntdev-common.h
+++ b/drivers/xen/gntdev-common.h
@@ -44,9 +44,10 @@ struct gntdev_unmap_notify {
};
struct gntdev_grant_map {
+ atomic_t in_use;
struct mmu_interval_notifier notifier;
+ bool notifier_init;
struct list_head next;
- struct vm_area_struct *vma;
int index;
int count;
int flags;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index eb0586b9767d..4d9a3050de6a 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -286,6 +286,9 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
*/
}
+ if (use_ptemod && map->notifier_init)
+ mmu_interval_notifier_remove(&map->notifier);
+
if (map->notify.flags & UNMAP_NOTIFY_SEND_EVENT) {
notify_remote_via_evtchn(map->notify.event);
evtchn_put(map->notify.event);
@@ -298,7 +301,7 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
{
struct gntdev_grant_map *map = data;
- unsigned int pgnr = (addr - map->vma->vm_start) >> PAGE_SHIFT;
+ unsigned int pgnr = (addr - map->pages_vm_start) >> PAGE_SHIFT;
int flags = map->flags | GNTMAP_application_map | GNTMAP_contains_pte |
(1 << _GNTMAP_guest_avail0);
u64 pte_maddr;
@@ -508,11 +511,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
struct gntdev_priv *priv = file->private_data;
pr_debug("gntdev_vma_close %p\n", vma);
- if (use_ptemod) {
- WARN_ON(map->vma != vma);
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
+
vma->vm_private_data = NULL;
gntdev_put_map(priv, map);
}
@@ -540,29 +539,30 @@ static bool gntdev_invalidate(struct mmu_interval_notifier *mn,
struct gntdev_grant_map *map =
container_of(mn, struct gntdev_grant_map, notifier);
unsigned long mstart, mend;
+ unsigned long map_start, map_end;
if (!mmu_notifier_range_blockable(range))
return false;
+ map_start = map->pages_vm_start;
+ map_end = map->pages_vm_start + (map->count << PAGE_SHIFT);
+
/*
* If the VMA is split or otherwise changed the notifier is not
* updated, but we don't want to process VA's outside the modified
* VMA. FIXME: It would be much more understandable to just prevent
* modifying the VMA in the first place.
*/
- if (map->vma->vm_start >= range->end ||
- map->vma->vm_end <= range->start)
+ if (map_start >= range->end || map_end <= range->start)
return true;
- mstart = max(range->start, map->vma->vm_start);
- mend = min(range->end, map->vma->vm_end);
+ mstart = max(range->start, map_start);
+ mend = min(range->end, map_end);
pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
- map->index, map->count,
- map->vma->vm_start, map->vma->vm_end,
- range->start, range->end, mstart, mend);
- unmap_grant_pages(map,
- (mstart - map->vma->vm_start) >> PAGE_SHIFT,
- (mend - mstart) >> PAGE_SHIFT);
+ map->index, map->count, map_start, map_end,
+ range->start, range->end, mstart, mend);
+ unmap_grant_pages(map, (mstart - map_start) >> PAGE_SHIFT,
+ (mend - mstart) >> PAGE_SHIFT);
return true;
}
@@ -1042,18 +1042,15 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
return -EINVAL;
pr_debug("map %d+%d at %lx (pgoff %lx)\n",
- index, count, vma->vm_start, vma->vm_pgoff);
+ index, count, vma->vm_start, vma->vm_pgoff);
mutex_lock(&priv->lock);
map = gntdev_find_map_index(priv, index, count);
if (!map)
goto unlock_out;
- if (use_ptemod && map->vma)
- goto unlock_out;
- if (atomic_read(&map->live_grants)) {
- err = -EAGAIN;
+ if (!atomic_add_unless(&map->in_use, 1, 1))
goto unlock_out;
- }
+
refcount_inc(&map->users);
vma->vm_ops = &gntdev_vmops;
@@ -1074,15 +1071,16 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
map->flags |= GNTMAP_readonly;
}
+ map->pages_vm_start = vma->vm_start;
+
if (use_ptemod) {
- map->vma = vma;
err = mmu_interval_notifier_insert_locked(
&map->notifier, vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start, &gntdev_mmu_ops);
- if (err) {
- map->vma = NULL;
+ if (err)
goto out_unlock_put;
- }
+
+ map->notifier_init = true;
}
mutex_unlock(&priv->lock);
@@ -1099,7 +1097,6 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
*/
mmu_interval_read_begin(&map->notifier);
- map->pages_vm_start = vma->vm_start;
err = apply_to_page_range(vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start,
find_grant_ptes, map);
@@ -1128,13 +1125,8 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
out_unlock_put:
mutex_unlock(&priv->lock);
out_put_map:
- if (use_ptemod) {
+ if (use_ptemod)
unmap_grant_pages(map, 0, map->count);
- if (map->vma) {
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
- }
gntdev_put_map(priv, map);
return err;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
5c13a4a0291b ("xen/gntdev: Accommodate VMA splitting")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
30dcc56bba91 ("xen: assume XENFEAT_gnttab_map_avail_bits being set for pv guests")
970655aa9b42 ("xen/gntdev: fix gntdev_mmap() error exit path")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
0102e4efda76 ("xen: Use evtchn_type_t as a type for event channels")
9293724192a7 ("xen/gntdev: Do not use mm notifiers with autotranslating guests")
b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()")
3b06ac6707c1 ("xen/gntdev: replace global limit of mapped pages by limit per call")
d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
ee7f5225dc3c ("xen: Stop abusing DT of_dma_configure API")
bce5963bcb4f ("xen/events: fix binding user event channels to cpus")
8b1e0f81fb6f ("mm/pgtable: drop pgtable_t variable from pte_fn_t functions")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5c13a4a0291b30191eff9ead8d010e1ca43a4d0c Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:06 -0400
Subject: [PATCH] xen/gntdev: Accommodate VMA splitting
Prior to this commit, the gntdev driver code did not handle the
following scenario correctly with paravirtualized (PV) Xen domains:
* User process sets up a gntdev mapping composed of two grant mappings
(i.e., two pages shared by another Xen domain).
* User process munmap()s one of the pages.
* User process munmap()s the remaining page.
* User process exits.
In the scenario above, the user process would cause the kernel to log
the following messages in dmesg for the first munmap(), and the second
munmap() call would result in similar log messages:
BUG: Bad page map in process doublemap.test pte:... pmd:...
page:0000000057c97bff refcount:1 mapcount:-1 \
mapping:0000000000000000 index:0x0 pfn:...
...
page dumped because: bad pte
...
file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
...
Call Trace:
<TASK>
dump_stack_lvl+0x46/0x5e
print_bad_pte.cold+0x66/0xb6
unmap_page_range+0x7e5/0xdc0
unmap_vmas+0x78/0xf0
unmap_region+0xa8/0x110
__do_munmap+0x1ea/0x4e0
__vm_munmap+0x75/0x120
__x64_sys_munmap+0x28/0x40
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
...
For each munmap() call, the Xen hypervisor (if built with CONFIG_DEBUG)
would print out the following and trigger a general protection fault in
the affected Xen PV domain:
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
As of this writing, gntdev_grant_map structure's vma field (referred to
as map->vma below) is mainly used for checking the start and end
addresses of mappings. However, with split VMAs, these may change, and
there could be more than one VMA associated with a gntdev mapping.
Hence, remove the use of map->vma and rely on map->pages_vm_start for
the original start address and on (map->count << PAGE_SHIFT) for the
original mapping size. Let the invalidate() and find_special_page()
hooks use these.
Also, given that there can be multiple VMAs associated with a gntdev
mapping, move the "mmu_interval_notifier_remove(&map->notifier)" call to
the end of gntdev_put_map, so that the MMU notifier is only removed
after the closing of the last remaining VMA.
Finally, use an atomic to prevent inadvertent gntdev mapping re-use,
instead of using the map->live_grants atomic counter and/or the map->vma
pointer (the latter of which is now removed). This prevents the
userspace from mmap()'ing (with MAP_FIXED) a gntdev mapping over the
same address range as a previously set up gntdev mapping. This scenario
can be summarized with the following call-trace, which was valid prior
to this commit:
mmap
gntdev_mmap
mmap (repeat mmap with MAP_FIXED over the same address range)
gntdev_invalidate
unmap_grant_pages (sets 'being_removed' entries to true)
gnttab_unmap_refs_async
unmap_single_vma
gntdev_mmap (maps the shared pages again)
munmap
gntdev_invalidate
unmap_grant_pages
(no-op because 'being_removed' entries are true)
unmap_single_vma (For PV domains, Xen reports that a granted page
is being unmapped and triggers a general protection fault in the
affected domain, if Xen was built with CONFIG_DEBUG)
The fix for this last scenario could be worth its own commit, but we
opted for a single commit, because removing the gntdev_grant_map
structure's vma field requires guarding the entry to gntdev_mmap(), and
the live_grants atomic counter is not sufficient on its own to prevent
the mmap() over a pre-existing mapping.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: ab31523c2fca ("xen/gntdev: allow usermode to map granted pages")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-3-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev-common.h b/drivers/xen/gntdev-common.h
index 40ef379c28ab..9c286b2a1900 100644
--- a/drivers/xen/gntdev-common.h
+++ b/drivers/xen/gntdev-common.h
@@ -44,9 +44,10 @@ struct gntdev_unmap_notify {
};
struct gntdev_grant_map {
+ atomic_t in_use;
struct mmu_interval_notifier notifier;
+ bool notifier_init;
struct list_head next;
- struct vm_area_struct *vma;
int index;
int count;
int flags;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index eb0586b9767d..4d9a3050de6a 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -286,6 +286,9 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
*/
}
+ if (use_ptemod && map->notifier_init)
+ mmu_interval_notifier_remove(&map->notifier);
+
if (map->notify.flags & UNMAP_NOTIFY_SEND_EVENT) {
notify_remote_via_evtchn(map->notify.event);
evtchn_put(map->notify.event);
@@ -298,7 +301,7 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
{
struct gntdev_grant_map *map = data;
- unsigned int pgnr = (addr - map->vma->vm_start) >> PAGE_SHIFT;
+ unsigned int pgnr = (addr - map->pages_vm_start) >> PAGE_SHIFT;
int flags = map->flags | GNTMAP_application_map | GNTMAP_contains_pte |
(1 << _GNTMAP_guest_avail0);
u64 pte_maddr;
@@ -508,11 +511,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
struct gntdev_priv *priv = file->private_data;
pr_debug("gntdev_vma_close %p\n", vma);
- if (use_ptemod) {
- WARN_ON(map->vma != vma);
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
+
vma->vm_private_data = NULL;
gntdev_put_map(priv, map);
}
@@ -540,29 +539,30 @@ static bool gntdev_invalidate(struct mmu_interval_notifier *mn,
struct gntdev_grant_map *map =
container_of(mn, struct gntdev_grant_map, notifier);
unsigned long mstart, mend;
+ unsigned long map_start, map_end;
if (!mmu_notifier_range_blockable(range))
return false;
+ map_start = map->pages_vm_start;
+ map_end = map->pages_vm_start + (map->count << PAGE_SHIFT);
+
/*
* If the VMA is split or otherwise changed the notifier is not
* updated, but we don't want to process VA's outside the modified
* VMA. FIXME: It would be much more understandable to just prevent
* modifying the VMA in the first place.
*/
- if (map->vma->vm_start >= range->end ||
- map->vma->vm_end <= range->start)
+ if (map_start >= range->end || map_end <= range->start)
return true;
- mstart = max(range->start, map->vma->vm_start);
- mend = min(range->end, map->vma->vm_end);
+ mstart = max(range->start, map_start);
+ mend = min(range->end, map_end);
pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
- map->index, map->count,
- map->vma->vm_start, map->vma->vm_end,
- range->start, range->end, mstart, mend);
- unmap_grant_pages(map,
- (mstart - map->vma->vm_start) >> PAGE_SHIFT,
- (mend - mstart) >> PAGE_SHIFT);
+ map->index, map->count, map_start, map_end,
+ range->start, range->end, mstart, mend);
+ unmap_grant_pages(map, (mstart - map_start) >> PAGE_SHIFT,
+ (mend - mstart) >> PAGE_SHIFT);
return true;
}
@@ -1042,18 +1042,15 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
return -EINVAL;
pr_debug("map %d+%d at %lx (pgoff %lx)\n",
- index, count, vma->vm_start, vma->vm_pgoff);
+ index, count, vma->vm_start, vma->vm_pgoff);
mutex_lock(&priv->lock);
map = gntdev_find_map_index(priv, index, count);
if (!map)
goto unlock_out;
- if (use_ptemod && map->vma)
- goto unlock_out;
- if (atomic_read(&map->live_grants)) {
- err = -EAGAIN;
+ if (!atomic_add_unless(&map->in_use, 1, 1))
goto unlock_out;
- }
+
refcount_inc(&map->users);
vma->vm_ops = &gntdev_vmops;
@@ -1074,15 +1071,16 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
map->flags |= GNTMAP_readonly;
}
+ map->pages_vm_start = vma->vm_start;
+
if (use_ptemod) {
- map->vma = vma;
err = mmu_interval_notifier_insert_locked(
&map->notifier, vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start, &gntdev_mmu_ops);
- if (err) {
- map->vma = NULL;
+ if (err)
goto out_unlock_put;
- }
+
+ map->notifier_init = true;
}
mutex_unlock(&priv->lock);
@@ -1099,7 +1097,6 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
*/
mmu_interval_read_begin(&map->notifier);
- map->pages_vm_start = vma->vm_start;
err = apply_to_page_range(vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start,
find_grant_ptes, map);
@@ -1128,13 +1125,8 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
out_unlock_put:
mutex_unlock(&priv->lock);
out_put_map:
- if (use_ptemod) {
+ if (use_ptemod)
unmap_grant_pages(map, 0, map->count);
- if (map->vma) {
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
- }
gntdev_put_map(priv, map);
return err;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
5c13a4a0291b ("xen/gntdev: Accommodate VMA splitting")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
30dcc56bba91 ("xen: assume XENFEAT_gnttab_map_avail_bits being set for pv guests")
970655aa9b42 ("xen/gntdev: fix gntdev_mmap() error exit path")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
0102e4efda76 ("xen: Use evtchn_type_t as a type for event channels")
9293724192a7 ("xen/gntdev: Do not use mm notifiers with autotranslating guests")
b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()")
3b06ac6707c1 ("xen/gntdev: replace global limit of mapped pages by limit per call")
d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
ee7f5225dc3c ("xen: Stop abusing DT of_dma_configure API")
bce5963bcb4f ("xen/events: fix binding user event channels to cpus")
8b1e0f81fb6f ("mm/pgtable: drop pgtable_t variable from pte_fn_t functions")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5c13a4a0291b30191eff9ead8d010e1ca43a4d0c Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:06 -0400
Subject: [PATCH] xen/gntdev: Accommodate VMA splitting
Prior to this commit, the gntdev driver code did not handle the
following scenario correctly with paravirtualized (PV) Xen domains:
* User process sets up a gntdev mapping composed of two grant mappings
(i.e., two pages shared by another Xen domain).
* User process munmap()s one of the pages.
* User process munmap()s the remaining page.
* User process exits.
In the scenario above, the user process would cause the kernel to log
the following messages in dmesg for the first munmap(), and the second
munmap() call would result in similar log messages:
BUG: Bad page map in process doublemap.test pte:... pmd:...
page:0000000057c97bff refcount:1 mapcount:-1 \
mapping:0000000000000000 index:0x0 pfn:...
...
page dumped because: bad pte
...
file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
...
Call Trace:
<TASK>
dump_stack_lvl+0x46/0x5e
print_bad_pte.cold+0x66/0xb6
unmap_page_range+0x7e5/0xdc0
unmap_vmas+0x78/0xf0
unmap_region+0xa8/0x110
__do_munmap+0x1ea/0x4e0
__vm_munmap+0x75/0x120
__x64_sys_munmap+0x28/0x40
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
...
For each munmap() call, the Xen hypervisor (if built with CONFIG_DEBUG)
would print out the following and trigger a general protection fault in
the affected Xen PV domain:
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
As of this writing, gntdev_grant_map structure's vma field (referred to
as map->vma below) is mainly used for checking the start and end
addresses of mappings. However, with split VMAs, these may change, and
there could be more than one VMA associated with a gntdev mapping.
Hence, remove the use of map->vma and rely on map->pages_vm_start for
the original start address and on (map->count << PAGE_SHIFT) for the
original mapping size. Let the invalidate() and find_special_page()
hooks use these.
Also, given that there can be multiple VMAs associated with a gntdev
mapping, move the "mmu_interval_notifier_remove(&map->notifier)" call to
the end of gntdev_put_map, so that the MMU notifier is only removed
after the closing of the last remaining VMA.
Finally, use an atomic to prevent inadvertent gntdev mapping re-use,
instead of using the map->live_grants atomic counter and/or the map->vma
pointer (the latter of which is now removed). This prevents the
userspace from mmap()'ing (with MAP_FIXED) a gntdev mapping over the
same address range as a previously set up gntdev mapping. This scenario
can be summarized with the following call-trace, which was valid prior
to this commit:
mmap
gntdev_mmap
mmap (repeat mmap with MAP_FIXED over the same address range)
gntdev_invalidate
unmap_grant_pages (sets 'being_removed' entries to true)
gnttab_unmap_refs_async
unmap_single_vma
gntdev_mmap (maps the shared pages again)
munmap
gntdev_invalidate
unmap_grant_pages
(no-op because 'being_removed' entries are true)
unmap_single_vma (For PV domains, Xen reports that a granted page
is being unmapped and triggers a general protection fault in the
affected domain, if Xen was built with CONFIG_DEBUG)
The fix for this last scenario could be worth its own commit, but we
opted for a single commit, because removing the gntdev_grant_map
structure's vma field requires guarding the entry to gntdev_mmap(), and
the live_grants atomic counter is not sufficient on its own to prevent
the mmap() over a pre-existing mapping.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: ab31523c2fca ("xen/gntdev: allow usermode to map granted pages")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-3-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev-common.h b/drivers/xen/gntdev-common.h
index 40ef379c28ab..9c286b2a1900 100644
--- a/drivers/xen/gntdev-common.h
+++ b/drivers/xen/gntdev-common.h
@@ -44,9 +44,10 @@ struct gntdev_unmap_notify {
};
struct gntdev_grant_map {
+ atomic_t in_use;
struct mmu_interval_notifier notifier;
+ bool notifier_init;
struct list_head next;
- struct vm_area_struct *vma;
int index;
int count;
int flags;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index eb0586b9767d..4d9a3050de6a 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -286,6 +286,9 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
*/
}
+ if (use_ptemod && map->notifier_init)
+ mmu_interval_notifier_remove(&map->notifier);
+
if (map->notify.flags & UNMAP_NOTIFY_SEND_EVENT) {
notify_remote_via_evtchn(map->notify.event);
evtchn_put(map->notify.event);
@@ -298,7 +301,7 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
{
struct gntdev_grant_map *map = data;
- unsigned int pgnr = (addr - map->vma->vm_start) >> PAGE_SHIFT;
+ unsigned int pgnr = (addr - map->pages_vm_start) >> PAGE_SHIFT;
int flags = map->flags | GNTMAP_application_map | GNTMAP_contains_pte |
(1 << _GNTMAP_guest_avail0);
u64 pte_maddr;
@@ -508,11 +511,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
struct gntdev_priv *priv = file->private_data;
pr_debug("gntdev_vma_close %p\n", vma);
- if (use_ptemod) {
- WARN_ON(map->vma != vma);
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
+
vma->vm_private_data = NULL;
gntdev_put_map(priv, map);
}
@@ -540,29 +539,30 @@ static bool gntdev_invalidate(struct mmu_interval_notifier *mn,
struct gntdev_grant_map *map =
container_of(mn, struct gntdev_grant_map, notifier);
unsigned long mstart, mend;
+ unsigned long map_start, map_end;
if (!mmu_notifier_range_blockable(range))
return false;
+ map_start = map->pages_vm_start;
+ map_end = map->pages_vm_start + (map->count << PAGE_SHIFT);
+
/*
* If the VMA is split or otherwise changed the notifier is not
* updated, but we don't want to process VA's outside the modified
* VMA. FIXME: It would be much more understandable to just prevent
* modifying the VMA in the first place.
*/
- if (map->vma->vm_start >= range->end ||
- map->vma->vm_end <= range->start)
+ if (map_start >= range->end || map_end <= range->start)
return true;
- mstart = max(range->start, map->vma->vm_start);
- mend = min(range->end, map->vma->vm_end);
+ mstart = max(range->start, map_start);
+ mend = min(range->end, map_end);
pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
- map->index, map->count,
- map->vma->vm_start, map->vma->vm_end,
- range->start, range->end, mstart, mend);
- unmap_grant_pages(map,
- (mstart - map->vma->vm_start) >> PAGE_SHIFT,
- (mend - mstart) >> PAGE_SHIFT);
+ map->index, map->count, map_start, map_end,
+ range->start, range->end, mstart, mend);
+ unmap_grant_pages(map, (mstart - map_start) >> PAGE_SHIFT,
+ (mend - mstart) >> PAGE_SHIFT);
return true;
}
@@ -1042,18 +1042,15 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
return -EINVAL;
pr_debug("map %d+%d at %lx (pgoff %lx)\n",
- index, count, vma->vm_start, vma->vm_pgoff);
+ index, count, vma->vm_start, vma->vm_pgoff);
mutex_lock(&priv->lock);
map = gntdev_find_map_index(priv, index, count);
if (!map)
goto unlock_out;
- if (use_ptemod && map->vma)
- goto unlock_out;
- if (atomic_read(&map->live_grants)) {
- err = -EAGAIN;
+ if (!atomic_add_unless(&map->in_use, 1, 1))
goto unlock_out;
- }
+
refcount_inc(&map->users);
vma->vm_ops = &gntdev_vmops;
@@ -1074,15 +1071,16 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
map->flags |= GNTMAP_readonly;
}
+ map->pages_vm_start = vma->vm_start;
+
if (use_ptemod) {
- map->vma = vma;
err = mmu_interval_notifier_insert_locked(
&map->notifier, vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start, &gntdev_mmu_ops);
- if (err) {
- map->vma = NULL;
+ if (err)
goto out_unlock_put;
- }
+
+ map->notifier_init = true;
}
mutex_unlock(&priv->lock);
@@ -1099,7 +1097,6 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
*/
mmu_interval_read_begin(&map->notifier);
- map->pages_vm_start = vma->vm_start;
err = apply_to_page_range(vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start,
find_grant_ptes, map);
@@ -1128,13 +1125,8 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
out_unlock_put:
mutex_unlock(&priv->lock);
out_put_map:
- if (use_ptemod) {
+ if (use_ptemod)
unmap_grant_pages(map, 0, map->count);
- if (map->vma) {
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
- }
gntdev_put_map(priv, map);
return err;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
5c13a4a0291b ("xen/gntdev: Accommodate VMA splitting")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
30dcc56bba91 ("xen: assume XENFEAT_gnttab_map_avail_bits being set for pv guests")
970655aa9b42 ("xen/gntdev: fix gntdev_mmap() error exit path")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
0102e4efda76 ("xen: Use evtchn_type_t as a type for event channels")
9293724192a7 ("xen/gntdev: Do not use mm notifiers with autotranslating guests")
b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()")
3b06ac6707c1 ("xen/gntdev: replace global limit of mapped pages by limit per call")
d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5c13a4a0291b30191eff9ead8d010e1ca43a4d0c Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:06 -0400
Subject: [PATCH] xen/gntdev: Accommodate VMA splitting
Prior to this commit, the gntdev driver code did not handle the
following scenario correctly with paravirtualized (PV) Xen domains:
* User process sets up a gntdev mapping composed of two grant mappings
(i.e., two pages shared by another Xen domain).
* User process munmap()s one of the pages.
* User process munmap()s the remaining page.
* User process exits.
In the scenario above, the user process would cause the kernel to log
the following messages in dmesg for the first munmap(), and the second
munmap() call would result in similar log messages:
BUG: Bad page map in process doublemap.test pte:... pmd:...
page:0000000057c97bff refcount:1 mapcount:-1 \
mapping:0000000000000000 index:0x0 pfn:...
...
page dumped because: bad pte
...
file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
...
Call Trace:
<TASK>
dump_stack_lvl+0x46/0x5e
print_bad_pte.cold+0x66/0xb6
unmap_page_range+0x7e5/0xdc0
unmap_vmas+0x78/0xf0
unmap_region+0xa8/0x110
__do_munmap+0x1ea/0x4e0
__vm_munmap+0x75/0x120
__x64_sys_munmap+0x28/0x40
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
...
For each munmap() call, the Xen hypervisor (if built with CONFIG_DEBUG)
would print out the following and trigger a general protection fault in
the affected Xen PV domain:
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
As of this writing, gntdev_grant_map structure's vma field (referred to
as map->vma below) is mainly used for checking the start and end
addresses of mappings. However, with split VMAs, these may change, and
there could be more than one VMA associated with a gntdev mapping.
Hence, remove the use of map->vma and rely on map->pages_vm_start for
the original start address and on (map->count << PAGE_SHIFT) for the
original mapping size. Let the invalidate() and find_special_page()
hooks use these.
Also, given that there can be multiple VMAs associated with a gntdev
mapping, move the "mmu_interval_notifier_remove(&map->notifier)" call to
the end of gntdev_put_map, so that the MMU notifier is only removed
after the closing of the last remaining VMA.
Finally, use an atomic to prevent inadvertent gntdev mapping re-use,
instead of using the map->live_grants atomic counter and/or the map->vma
pointer (the latter of which is now removed). This prevents the
userspace from mmap()'ing (with MAP_FIXED) a gntdev mapping over the
same address range as a previously set up gntdev mapping. This scenario
can be summarized with the following call-trace, which was valid prior
to this commit:
mmap
gntdev_mmap
mmap (repeat mmap with MAP_FIXED over the same address range)
gntdev_invalidate
unmap_grant_pages (sets 'being_removed' entries to true)
gnttab_unmap_refs_async
unmap_single_vma
gntdev_mmap (maps the shared pages again)
munmap
gntdev_invalidate
unmap_grant_pages
(no-op because 'being_removed' entries are true)
unmap_single_vma (For PV domains, Xen reports that a granted page
is being unmapped and triggers a general protection fault in the
affected domain, if Xen was built with CONFIG_DEBUG)
The fix for this last scenario could be worth its own commit, but we
opted for a single commit, because removing the gntdev_grant_map
structure's vma field requires guarding the entry to gntdev_mmap(), and
the live_grants atomic counter is not sufficient on its own to prevent
the mmap() over a pre-existing mapping.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: ab31523c2fca ("xen/gntdev: allow usermode to map granted pages")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-3-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev-common.h b/drivers/xen/gntdev-common.h
index 40ef379c28ab..9c286b2a1900 100644
--- a/drivers/xen/gntdev-common.h
+++ b/drivers/xen/gntdev-common.h
@@ -44,9 +44,10 @@ struct gntdev_unmap_notify {
};
struct gntdev_grant_map {
+ atomic_t in_use;
struct mmu_interval_notifier notifier;
+ bool notifier_init;
struct list_head next;
- struct vm_area_struct *vma;
int index;
int count;
int flags;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index eb0586b9767d..4d9a3050de6a 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -286,6 +286,9 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
*/
}
+ if (use_ptemod && map->notifier_init)
+ mmu_interval_notifier_remove(&map->notifier);
+
if (map->notify.flags & UNMAP_NOTIFY_SEND_EVENT) {
notify_remote_via_evtchn(map->notify.event);
evtchn_put(map->notify.event);
@@ -298,7 +301,7 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
{
struct gntdev_grant_map *map = data;
- unsigned int pgnr = (addr - map->vma->vm_start) >> PAGE_SHIFT;
+ unsigned int pgnr = (addr - map->pages_vm_start) >> PAGE_SHIFT;
int flags = map->flags | GNTMAP_application_map | GNTMAP_contains_pte |
(1 << _GNTMAP_guest_avail0);
u64 pte_maddr;
@@ -508,11 +511,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
struct gntdev_priv *priv = file->private_data;
pr_debug("gntdev_vma_close %p\n", vma);
- if (use_ptemod) {
- WARN_ON(map->vma != vma);
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
+
vma->vm_private_data = NULL;
gntdev_put_map(priv, map);
}
@@ -540,29 +539,30 @@ static bool gntdev_invalidate(struct mmu_interval_notifier *mn,
struct gntdev_grant_map *map =
container_of(mn, struct gntdev_grant_map, notifier);
unsigned long mstart, mend;
+ unsigned long map_start, map_end;
if (!mmu_notifier_range_blockable(range))
return false;
+ map_start = map->pages_vm_start;
+ map_end = map->pages_vm_start + (map->count << PAGE_SHIFT);
+
/*
* If the VMA is split or otherwise changed the notifier is not
* updated, but we don't want to process VA's outside the modified
* VMA. FIXME: It would be much more understandable to just prevent
* modifying the VMA in the first place.
*/
- if (map->vma->vm_start >= range->end ||
- map->vma->vm_end <= range->start)
+ if (map_start >= range->end || map_end <= range->start)
return true;
- mstart = max(range->start, map->vma->vm_start);
- mend = min(range->end, map->vma->vm_end);
+ mstart = max(range->start, map_start);
+ mend = min(range->end, map_end);
pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
- map->index, map->count,
- map->vma->vm_start, map->vma->vm_end,
- range->start, range->end, mstart, mend);
- unmap_grant_pages(map,
- (mstart - map->vma->vm_start) >> PAGE_SHIFT,
- (mend - mstart) >> PAGE_SHIFT);
+ map->index, map->count, map_start, map_end,
+ range->start, range->end, mstart, mend);
+ unmap_grant_pages(map, (mstart - map_start) >> PAGE_SHIFT,
+ (mend - mstart) >> PAGE_SHIFT);
return true;
}
@@ -1042,18 +1042,15 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
return -EINVAL;
pr_debug("map %d+%d at %lx (pgoff %lx)\n",
- index, count, vma->vm_start, vma->vm_pgoff);
+ index, count, vma->vm_start, vma->vm_pgoff);
mutex_lock(&priv->lock);
map = gntdev_find_map_index(priv, index, count);
if (!map)
goto unlock_out;
- if (use_ptemod && map->vma)
- goto unlock_out;
- if (atomic_read(&map->live_grants)) {
- err = -EAGAIN;
+ if (!atomic_add_unless(&map->in_use, 1, 1))
goto unlock_out;
- }
+
refcount_inc(&map->users);
vma->vm_ops = &gntdev_vmops;
@@ -1074,15 +1071,16 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
map->flags |= GNTMAP_readonly;
}
+ map->pages_vm_start = vma->vm_start;
+
if (use_ptemod) {
- map->vma = vma;
err = mmu_interval_notifier_insert_locked(
&map->notifier, vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start, &gntdev_mmu_ops);
- if (err) {
- map->vma = NULL;
+ if (err)
goto out_unlock_put;
- }
+
+ map->notifier_init = true;
}
mutex_unlock(&priv->lock);
@@ -1099,7 +1097,6 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
*/
mmu_interval_read_begin(&map->notifier);
- map->pages_vm_start = vma->vm_start;
err = apply_to_page_range(vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start,
find_grant_ptes, map);
@@ -1128,13 +1125,8 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
out_unlock_put:
mutex_unlock(&priv->lock);
out_put_map:
- if (use_ptemod) {
+ if (use_ptemod)
unmap_grant_pages(map, 0, map->count);
- if (map->vma) {
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
- }
gntdev_put_map(priv, map);
return err;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
5c13a4a0291b ("xen/gntdev: Accommodate VMA splitting")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
30dcc56bba91 ("xen: assume XENFEAT_gnttab_map_avail_bits being set for pv guests")
970655aa9b42 ("xen/gntdev: fix gntdev_mmap() error exit path")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 5c13a4a0291b30191eff9ead8d010e1ca43a4d0c Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:06 -0400
Subject: [PATCH] xen/gntdev: Accommodate VMA splitting
Prior to this commit, the gntdev driver code did not handle the
following scenario correctly with paravirtualized (PV) Xen domains:
* User process sets up a gntdev mapping composed of two grant mappings
(i.e., two pages shared by another Xen domain).
* User process munmap()s one of the pages.
* User process munmap()s the remaining page.
* User process exits.
In the scenario above, the user process would cause the kernel to log
the following messages in dmesg for the first munmap(), and the second
munmap() call would result in similar log messages:
BUG: Bad page map in process doublemap.test pte:... pmd:...
page:0000000057c97bff refcount:1 mapcount:-1 \
mapping:0000000000000000 index:0x0 pfn:...
...
page dumped because: bad pte
...
file:gntdev fault:0x0 mmap:gntdev_mmap [xen_gntdev] readpage:0x0
...
Call Trace:
<TASK>
dump_stack_lvl+0x46/0x5e
print_bad_pte.cold+0x66/0xb6
unmap_page_range+0x7e5/0xdc0
unmap_vmas+0x78/0xf0
unmap_region+0xa8/0x110
__do_munmap+0x1ea/0x4e0
__vm_munmap+0x75/0x120
__x64_sys_munmap+0x28/0x40
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x61/0xcb
...
For each munmap() call, the Xen hypervisor (if built with CONFIG_DEBUG)
would print out the following and trigger a general protection fault in
the affected Xen PV domain:
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
(XEN) d0v... Attempt to implicitly unmap d0's grant PTE ...
As of this writing, gntdev_grant_map structure's vma field (referred to
as map->vma below) is mainly used for checking the start and end
addresses of mappings. However, with split VMAs, these may change, and
there could be more than one VMA associated with a gntdev mapping.
Hence, remove the use of map->vma and rely on map->pages_vm_start for
the original start address and on (map->count << PAGE_SHIFT) for the
original mapping size. Let the invalidate() and find_special_page()
hooks use these.
Also, given that there can be multiple VMAs associated with a gntdev
mapping, move the "mmu_interval_notifier_remove(&map->notifier)" call to
the end of gntdev_put_map, so that the MMU notifier is only removed
after the closing of the last remaining VMA.
Finally, use an atomic to prevent inadvertent gntdev mapping re-use,
instead of using the map->live_grants atomic counter and/or the map->vma
pointer (the latter of which is now removed). This prevents the
userspace from mmap()'ing (with MAP_FIXED) a gntdev mapping over the
same address range as a previously set up gntdev mapping. This scenario
can be summarized with the following call-trace, which was valid prior
to this commit:
mmap
gntdev_mmap
mmap (repeat mmap with MAP_FIXED over the same address range)
gntdev_invalidate
unmap_grant_pages (sets 'being_removed' entries to true)
gnttab_unmap_refs_async
unmap_single_vma
gntdev_mmap (maps the shared pages again)
munmap
gntdev_invalidate
unmap_grant_pages
(no-op because 'being_removed' entries are true)
unmap_single_vma (For PV domains, Xen reports that a granted page
is being unmapped and triggers a general protection fault in the
affected domain, if Xen was built with CONFIG_DEBUG)
The fix for this last scenario could be worth its own commit, but we
opted for a single commit, because removing the gntdev_grant_map
structure's vma field requires guarding the entry to gntdev_mmap(), and
the live_grants atomic counter is not sufficient on its own to prevent
the mmap() over a pre-existing mapping.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: ab31523c2fca ("xen/gntdev: allow usermode to map granted pages")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-3-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev-common.h b/drivers/xen/gntdev-common.h
index 40ef379c28ab..9c286b2a1900 100644
--- a/drivers/xen/gntdev-common.h
+++ b/drivers/xen/gntdev-common.h
@@ -44,9 +44,10 @@ struct gntdev_unmap_notify {
};
struct gntdev_grant_map {
+ atomic_t in_use;
struct mmu_interval_notifier notifier;
+ bool notifier_init;
struct list_head next;
- struct vm_area_struct *vma;
int index;
int count;
int flags;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index eb0586b9767d..4d9a3050de6a 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -286,6 +286,9 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
*/
}
+ if (use_ptemod && map->notifier_init)
+ mmu_interval_notifier_remove(&map->notifier);
+
if (map->notify.flags & UNMAP_NOTIFY_SEND_EVENT) {
notify_remote_via_evtchn(map->notify.event);
evtchn_put(map->notify.event);
@@ -298,7 +301,7 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
{
struct gntdev_grant_map *map = data;
- unsigned int pgnr = (addr - map->vma->vm_start) >> PAGE_SHIFT;
+ unsigned int pgnr = (addr - map->pages_vm_start) >> PAGE_SHIFT;
int flags = map->flags | GNTMAP_application_map | GNTMAP_contains_pte |
(1 << _GNTMAP_guest_avail0);
u64 pte_maddr;
@@ -508,11 +511,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
struct gntdev_priv *priv = file->private_data;
pr_debug("gntdev_vma_close %p\n", vma);
- if (use_ptemod) {
- WARN_ON(map->vma != vma);
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
+
vma->vm_private_data = NULL;
gntdev_put_map(priv, map);
}
@@ -540,29 +539,30 @@ static bool gntdev_invalidate(struct mmu_interval_notifier *mn,
struct gntdev_grant_map *map =
container_of(mn, struct gntdev_grant_map, notifier);
unsigned long mstart, mend;
+ unsigned long map_start, map_end;
if (!mmu_notifier_range_blockable(range))
return false;
+ map_start = map->pages_vm_start;
+ map_end = map->pages_vm_start + (map->count << PAGE_SHIFT);
+
/*
* If the VMA is split or otherwise changed the notifier is not
* updated, but we don't want to process VA's outside the modified
* VMA. FIXME: It would be much more understandable to just prevent
* modifying the VMA in the first place.
*/
- if (map->vma->vm_start >= range->end ||
- map->vma->vm_end <= range->start)
+ if (map_start >= range->end || map_end <= range->start)
return true;
- mstart = max(range->start, map->vma->vm_start);
- mend = min(range->end, map->vma->vm_end);
+ mstart = max(range->start, map_start);
+ mend = min(range->end, map_end);
pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
- map->index, map->count,
- map->vma->vm_start, map->vma->vm_end,
- range->start, range->end, mstart, mend);
- unmap_grant_pages(map,
- (mstart - map->vma->vm_start) >> PAGE_SHIFT,
- (mend - mstart) >> PAGE_SHIFT);
+ map->index, map->count, map_start, map_end,
+ range->start, range->end, mstart, mend);
+ unmap_grant_pages(map, (mstart - map_start) >> PAGE_SHIFT,
+ (mend - mstart) >> PAGE_SHIFT);
return true;
}
@@ -1042,18 +1042,15 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
return -EINVAL;
pr_debug("map %d+%d at %lx (pgoff %lx)\n",
- index, count, vma->vm_start, vma->vm_pgoff);
+ index, count, vma->vm_start, vma->vm_pgoff);
mutex_lock(&priv->lock);
map = gntdev_find_map_index(priv, index, count);
if (!map)
goto unlock_out;
- if (use_ptemod && map->vma)
- goto unlock_out;
- if (atomic_read(&map->live_grants)) {
- err = -EAGAIN;
+ if (!atomic_add_unless(&map->in_use, 1, 1))
goto unlock_out;
- }
+
refcount_inc(&map->users);
vma->vm_ops = &gntdev_vmops;
@@ -1074,15 +1071,16 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
map->flags |= GNTMAP_readonly;
}
+ map->pages_vm_start = vma->vm_start;
+
if (use_ptemod) {
- map->vma = vma;
err = mmu_interval_notifier_insert_locked(
&map->notifier, vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start, &gntdev_mmu_ops);
- if (err) {
- map->vma = NULL;
+ if (err)
goto out_unlock_put;
- }
+
+ map->notifier_init = true;
}
mutex_unlock(&priv->lock);
@@ -1099,7 +1097,6 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
*/
mmu_interval_read_begin(&map->notifier);
- map->pages_vm_start = vma->vm_start;
err = apply_to_page_range(vma->vm_mm, vma->vm_start,
vma->vm_end - vma->vm_start,
find_grant_ptes, map);
@@ -1128,13 +1125,8 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
out_unlock_put:
mutex_unlock(&priv->lock);
out_put_map:
- if (use_ptemod) {
+ if (use_ptemod)
unmap_grant_pages(map, 0, map->count);
- if (map->vma) {
- mmu_interval_notifier_remove(&map->notifier);
- map->vma = NULL;
- }
- }
gntdev_put_map(priv, map);
return err;
}
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
0991028cd495 ("xen/gntdev: Prevent leaking grants")
166d38632316 ("xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
0102e4efda76 ("xen: Use evtchn_type_t as a type for event channels")
b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()")
3b06ac6707c1 ("xen/gntdev: replace global limit of mapped pages by limit per call")
d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
ee7f5225dc3c ("xen: Stop abusing DT of_dma_configure API")
bce5963bcb4f ("xen/events: fix binding user event channels to cpus")
dfcd66604c1c ("mm/mmu_notifier: convert user range->blockable to helper function")
a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range")
73231612dc7c ("mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0991028cd49567d7016d1b224fe0117c35059f86 Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:05 -0400
Subject: [PATCH] xen/gntdev: Prevent leaking grants
Prior to this commit, if a grant mapping operation failed partially,
some of the entries in the map_ops array would be invalid, whereas all
of the entries in the kmap_ops array would be valid. This in turn would
cause the following logic in gntdev_map_grant_pages to become invalid:
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
if (!use_ptemod)
alloced++;
}
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
if (map->map_ops[i].status == GNTST_okay)
alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
}
}
}
...
atomic_add(alloced, &map->live_grants);
Assume that use_ptemod is true (i.e., the domain mapping the granted
pages is a paravirtualized domain). In the code excerpt above, note that
the "alloced" variable is only incremented when both kmap_ops[i].status
and map_ops[i].status are set to GNTST_okay (i.e., both mapping
operations are successful). However, as also noted above, there are
cases where a grant mapping operation fails partially, breaking the
assumption of the code excerpt above.
The aforementioned causes map->live_grants to be incorrectly set. In
some cases, all of the map_ops mappings fail, but all of the kmap_ops
mappings succeed, meaning that live_grants may remain zero. This in turn
makes it impossible to unmap the successfully grant-mapped pages pointed
to by kmap_ops, because unmap_grant_pages has the following snippet of
code at its beginning:
if (atomic_read(&map->live_grants) == 0)
return; /* Nothing to do */
In other cases where only some of the map_ops mappings fail but all
kmap_ops mappings succeed, live_grants is made positive, but when the
user requests unmapping the grant-mapped pages, __unmap_grant_pages_done
will then make map->live_grants negative, because the latter function
does not check if all of the pages that were requested to be unmapped
were actually unmapped, and the same function unconditionally subtracts
"data->count" (i.e., a value that can be greater than map->live_grants)
from map->live_grants. The side effects of a negative live_grants value
have not been studied.
The net effect of all of this is that grant references are leaked in one
of the above conditions. In Qubes OS v4.1 (which uses Xen's grant
mechanism extensively for X11 GUI isolation), this issue manifests
itself with warning messages like the following to be printed out by the
Linux kernel in the VM that had granted pages (that contain X11 GUI
window data) to dom0: "g.e. 0x1234 still pending", especially after the
user rapidly resizes GUI VM windows (causing some grant-mapping
operations to partially or completely fail, due to the fact that the VM
unshares some of the pages as part of the window resizing, making the
pages impossible to grant-map from dom0).
The fix for this issue involves counting all successful map_ops and
kmap_ops mappings separately, and then adding the sum to live_grants.
During unmapping, only the number of successfully unmapped grants is
subtracted from live_grants. The code is also modified to check for
negative live_grants values after the subtraction and warn the user.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Acked-by: Demi Marie Obenour <demi(a)invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-2-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 84b143eef395..eb0586b9767d 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -367,8 +367,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
- if (!use_ptemod)
- alloced++;
+ alloced++;
} else if (!err)
err = -EINVAL;
@@ -377,8 +376,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
- if (map->map_ops[i].status == GNTST_okay)
- alloced++;
+ alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
} else if (!err)
err = -EINVAL;
@@ -394,8 +392,14 @@ static void __unmap_grant_pages_done(int result,
unsigned int i;
struct gntdev_grant_map *map = data->data;
unsigned int offset = data->unmap_ops - map->unmap_ops;
+ int successful_unmaps = 0;
+ int live_grants;
for (i = 0; i < data->count; i++) {
+ if (map->unmap_ops[offset + i].status == GNTST_okay &&
+ map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->unmap_ops[offset + i].status != GNTST_okay &&
map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("unmap handle=%d st=%d\n",
@@ -403,6 +407,10 @@ static void __unmap_grant_pages_done(int result,
map->unmap_ops[offset+i].status);
map->unmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
if (use_ptemod) {
+ if (map->kunmap_ops[offset + i].status == GNTST_okay &&
+ map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->kunmap_ops[offset + i].status != GNTST_okay &&
map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("kunmap handle=%u st=%d\n",
@@ -411,11 +419,15 @@ static void __unmap_grant_pages_done(int result,
map->kunmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
}
}
+
/*
* Decrease the live-grant counter. This must happen after the loop to
* prevent premature reuse of the grants by gnttab_mmap().
*/
- atomic_sub(data->count, &map->live_grants);
+ live_grants = atomic_sub_return(successful_unmaps, &map->live_grants);
+ if (WARN_ON(live_grants < 0))
+ pr_err("%s: live_grants became negative (%d) after unmapping %d pages!\n",
+ __func__, live_grants, successful_unmaps);
/* Release reference taken by __unmap_grant_pages */
gntdev_put_map(NULL, map);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
0991028cd495 ("xen/gntdev: Prevent leaking grants")
166d38632316 ("xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
0102e4efda76 ("xen: Use evtchn_type_t as a type for event channels")
b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()")
3b06ac6707c1 ("xen/gntdev: replace global limit of mapped pages by limit per call")
d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
ee7f5225dc3c ("xen: Stop abusing DT of_dma_configure API")
bce5963bcb4f ("xen/events: fix binding user event channels to cpus")
dfcd66604c1c ("mm/mmu_notifier: convert user range->blockable to helper function")
a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range")
73231612dc7c ("mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0991028cd49567d7016d1b224fe0117c35059f86 Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:05 -0400
Subject: [PATCH] xen/gntdev: Prevent leaking grants
Prior to this commit, if a grant mapping operation failed partially,
some of the entries in the map_ops array would be invalid, whereas all
of the entries in the kmap_ops array would be valid. This in turn would
cause the following logic in gntdev_map_grant_pages to become invalid:
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
if (!use_ptemod)
alloced++;
}
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
if (map->map_ops[i].status == GNTST_okay)
alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
}
}
}
...
atomic_add(alloced, &map->live_grants);
Assume that use_ptemod is true (i.e., the domain mapping the granted
pages is a paravirtualized domain). In the code excerpt above, note that
the "alloced" variable is only incremented when both kmap_ops[i].status
and map_ops[i].status are set to GNTST_okay (i.e., both mapping
operations are successful). However, as also noted above, there are
cases where a grant mapping operation fails partially, breaking the
assumption of the code excerpt above.
The aforementioned causes map->live_grants to be incorrectly set. In
some cases, all of the map_ops mappings fail, but all of the kmap_ops
mappings succeed, meaning that live_grants may remain zero. This in turn
makes it impossible to unmap the successfully grant-mapped pages pointed
to by kmap_ops, because unmap_grant_pages has the following snippet of
code at its beginning:
if (atomic_read(&map->live_grants) == 0)
return; /* Nothing to do */
In other cases where only some of the map_ops mappings fail but all
kmap_ops mappings succeed, live_grants is made positive, but when the
user requests unmapping the grant-mapped pages, __unmap_grant_pages_done
will then make map->live_grants negative, because the latter function
does not check if all of the pages that were requested to be unmapped
were actually unmapped, and the same function unconditionally subtracts
"data->count" (i.e., a value that can be greater than map->live_grants)
from map->live_grants. The side effects of a negative live_grants value
have not been studied.
The net effect of all of this is that grant references are leaked in one
of the above conditions. In Qubes OS v4.1 (which uses Xen's grant
mechanism extensively for X11 GUI isolation), this issue manifests
itself with warning messages like the following to be printed out by the
Linux kernel in the VM that had granted pages (that contain X11 GUI
window data) to dom0: "g.e. 0x1234 still pending", especially after the
user rapidly resizes GUI VM windows (causing some grant-mapping
operations to partially or completely fail, due to the fact that the VM
unshares some of the pages as part of the window resizing, making the
pages impossible to grant-map from dom0).
The fix for this issue involves counting all successful map_ops and
kmap_ops mappings separately, and then adding the sum to live_grants.
During unmapping, only the number of successfully unmapped grants is
subtracted from live_grants. The code is also modified to check for
negative live_grants values after the subtraction and warn the user.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Acked-by: Demi Marie Obenour <demi(a)invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-2-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 84b143eef395..eb0586b9767d 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -367,8 +367,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
- if (!use_ptemod)
- alloced++;
+ alloced++;
} else if (!err)
err = -EINVAL;
@@ -377,8 +376,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
- if (map->map_ops[i].status == GNTST_okay)
- alloced++;
+ alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
} else if (!err)
err = -EINVAL;
@@ -394,8 +392,14 @@ static void __unmap_grant_pages_done(int result,
unsigned int i;
struct gntdev_grant_map *map = data->data;
unsigned int offset = data->unmap_ops - map->unmap_ops;
+ int successful_unmaps = 0;
+ int live_grants;
for (i = 0; i < data->count; i++) {
+ if (map->unmap_ops[offset + i].status == GNTST_okay &&
+ map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->unmap_ops[offset + i].status != GNTST_okay &&
map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("unmap handle=%d st=%d\n",
@@ -403,6 +407,10 @@ static void __unmap_grant_pages_done(int result,
map->unmap_ops[offset+i].status);
map->unmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
if (use_ptemod) {
+ if (map->kunmap_ops[offset + i].status == GNTST_okay &&
+ map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->kunmap_ops[offset + i].status != GNTST_okay &&
map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("kunmap handle=%u st=%d\n",
@@ -411,11 +419,15 @@ static void __unmap_grant_pages_done(int result,
map->kunmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
}
}
+
/*
* Decrease the live-grant counter. This must happen after the loop to
* prevent premature reuse of the grants by gnttab_mmap().
*/
- atomic_sub(data->count, &map->live_grants);
+ live_grants = atomic_sub_return(successful_unmaps, &map->live_grants);
+ if (WARN_ON(live_grants < 0))
+ pr_err("%s: live_grants became negative (%d) after unmapping %d pages!\n",
+ __func__, live_grants, successful_unmaps);
/* Release reference taken by __unmap_grant_pages */
gntdev_put_map(NULL, map);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
0991028cd495 ("xen/gntdev: Prevent leaking grants")
166d38632316 ("xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
0102e4efda76 ("xen: Use evtchn_type_t as a type for event channels")
b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()")
3b06ac6707c1 ("xen/gntdev: replace global limit of mapped pages by limit per call")
d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
ee7f5225dc3c ("xen: Stop abusing DT of_dma_configure API")
bce5963bcb4f ("xen/events: fix binding user event channels to cpus")
dfcd66604c1c ("mm/mmu_notifier: convert user range->blockable to helper function")
a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range")
73231612dc7c ("mm/hmm: improve and rename hmm_vma_fault() to hmm_range_fault()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0991028cd49567d7016d1b224fe0117c35059f86 Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:05 -0400
Subject: [PATCH] xen/gntdev: Prevent leaking grants
Prior to this commit, if a grant mapping operation failed partially,
some of the entries in the map_ops array would be invalid, whereas all
of the entries in the kmap_ops array would be valid. This in turn would
cause the following logic in gntdev_map_grant_pages to become invalid:
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
if (!use_ptemod)
alloced++;
}
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
if (map->map_ops[i].status == GNTST_okay)
alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
}
}
}
...
atomic_add(alloced, &map->live_grants);
Assume that use_ptemod is true (i.e., the domain mapping the granted
pages is a paravirtualized domain). In the code excerpt above, note that
the "alloced" variable is only incremented when both kmap_ops[i].status
and map_ops[i].status are set to GNTST_okay (i.e., both mapping
operations are successful). However, as also noted above, there are
cases where a grant mapping operation fails partially, breaking the
assumption of the code excerpt above.
The aforementioned causes map->live_grants to be incorrectly set. In
some cases, all of the map_ops mappings fail, but all of the kmap_ops
mappings succeed, meaning that live_grants may remain zero. This in turn
makes it impossible to unmap the successfully grant-mapped pages pointed
to by kmap_ops, because unmap_grant_pages has the following snippet of
code at its beginning:
if (atomic_read(&map->live_grants) == 0)
return; /* Nothing to do */
In other cases where only some of the map_ops mappings fail but all
kmap_ops mappings succeed, live_grants is made positive, but when the
user requests unmapping the grant-mapped pages, __unmap_grant_pages_done
will then make map->live_grants negative, because the latter function
does not check if all of the pages that were requested to be unmapped
were actually unmapped, and the same function unconditionally subtracts
"data->count" (i.e., a value that can be greater than map->live_grants)
from map->live_grants. The side effects of a negative live_grants value
have not been studied.
The net effect of all of this is that grant references are leaked in one
of the above conditions. In Qubes OS v4.1 (which uses Xen's grant
mechanism extensively for X11 GUI isolation), this issue manifests
itself with warning messages like the following to be printed out by the
Linux kernel in the VM that had granted pages (that contain X11 GUI
window data) to dom0: "g.e. 0x1234 still pending", especially after the
user rapidly resizes GUI VM windows (causing some grant-mapping
operations to partially or completely fail, due to the fact that the VM
unshares some of the pages as part of the window resizing, making the
pages impossible to grant-map from dom0).
The fix for this issue involves counting all successful map_ops and
kmap_ops mappings separately, and then adding the sum to live_grants.
During unmapping, only the number of successfully unmapped grants is
subtracted from live_grants. The code is also modified to check for
negative live_grants values after the subtraction and warn the user.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Acked-by: Demi Marie Obenour <demi(a)invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-2-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 84b143eef395..eb0586b9767d 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -367,8 +367,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
- if (!use_ptemod)
- alloced++;
+ alloced++;
} else if (!err)
err = -EINVAL;
@@ -377,8 +376,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
- if (map->map_ops[i].status == GNTST_okay)
- alloced++;
+ alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
} else if (!err)
err = -EINVAL;
@@ -394,8 +392,14 @@ static void __unmap_grant_pages_done(int result,
unsigned int i;
struct gntdev_grant_map *map = data->data;
unsigned int offset = data->unmap_ops - map->unmap_ops;
+ int successful_unmaps = 0;
+ int live_grants;
for (i = 0; i < data->count; i++) {
+ if (map->unmap_ops[offset + i].status == GNTST_okay &&
+ map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->unmap_ops[offset + i].status != GNTST_okay &&
map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("unmap handle=%d st=%d\n",
@@ -403,6 +407,10 @@ static void __unmap_grant_pages_done(int result,
map->unmap_ops[offset+i].status);
map->unmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
if (use_ptemod) {
+ if (map->kunmap_ops[offset + i].status == GNTST_okay &&
+ map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->kunmap_ops[offset + i].status != GNTST_okay &&
map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("kunmap handle=%u st=%d\n",
@@ -411,11 +419,15 @@ static void __unmap_grant_pages_done(int result,
map->kunmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
}
}
+
/*
* Decrease the live-grant counter. This must happen after the loop to
* prevent premature reuse of the grants by gnttab_mmap().
*/
- atomic_sub(data->count, &map->live_grants);
+ live_grants = atomic_sub_return(successful_unmaps, &map->live_grants);
+ if (WARN_ON(live_grants < 0))
+ pr_err("%s: live_grants became negative (%d) after unmapping %d pages!\n",
+ __func__, live_grants, successful_unmaps);
/* Release reference taken by __unmap_grant_pages */
gntdev_put_map(NULL, map);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
0991028cd495 ("xen/gntdev: Prevent leaking grants")
166d38632316 ("xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
0102e4efda76 ("xen: Use evtchn_type_t as a type for event channels")
b3f7931f5c61 ("xen/gntdev: switch from kcalloc() to kvcalloc()")
3b06ac6707c1 ("xen/gntdev: replace global limit of mapped pages by limit per call")
d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0991028cd49567d7016d1b224fe0117c35059f86 Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:05 -0400
Subject: [PATCH] xen/gntdev: Prevent leaking grants
Prior to this commit, if a grant mapping operation failed partially,
some of the entries in the map_ops array would be invalid, whereas all
of the entries in the kmap_ops array would be valid. This in turn would
cause the following logic in gntdev_map_grant_pages to become invalid:
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
if (!use_ptemod)
alloced++;
}
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
if (map->map_ops[i].status == GNTST_okay)
alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
}
}
}
...
atomic_add(alloced, &map->live_grants);
Assume that use_ptemod is true (i.e., the domain mapping the granted
pages is a paravirtualized domain). In the code excerpt above, note that
the "alloced" variable is only incremented when both kmap_ops[i].status
and map_ops[i].status are set to GNTST_okay (i.e., both mapping
operations are successful). However, as also noted above, there are
cases where a grant mapping operation fails partially, breaking the
assumption of the code excerpt above.
The aforementioned causes map->live_grants to be incorrectly set. In
some cases, all of the map_ops mappings fail, but all of the kmap_ops
mappings succeed, meaning that live_grants may remain zero. This in turn
makes it impossible to unmap the successfully grant-mapped pages pointed
to by kmap_ops, because unmap_grant_pages has the following snippet of
code at its beginning:
if (atomic_read(&map->live_grants) == 0)
return; /* Nothing to do */
In other cases where only some of the map_ops mappings fail but all
kmap_ops mappings succeed, live_grants is made positive, but when the
user requests unmapping the grant-mapped pages, __unmap_grant_pages_done
will then make map->live_grants negative, because the latter function
does not check if all of the pages that were requested to be unmapped
were actually unmapped, and the same function unconditionally subtracts
"data->count" (i.e., a value that can be greater than map->live_grants)
from map->live_grants. The side effects of a negative live_grants value
have not been studied.
The net effect of all of this is that grant references are leaked in one
of the above conditions. In Qubes OS v4.1 (which uses Xen's grant
mechanism extensively for X11 GUI isolation), this issue manifests
itself with warning messages like the following to be printed out by the
Linux kernel in the VM that had granted pages (that contain X11 GUI
window data) to dom0: "g.e. 0x1234 still pending", especially after the
user rapidly resizes GUI VM windows (causing some grant-mapping
operations to partially or completely fail, due to the fact that the VM
unshares some of the pages as part of the window resizing, making the
pages impossible to grant-map from dom0).
The fix for this issue involves counting all successful map_ops and
kmap_ops mappings separately, and then adding the sum to live_grants.
During unmapping, only the number of successfully unmapped grants is
subtracted from live_grants. The code is also modified to check for
negative live_grants values after the subtraction and warn the user.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Acked-by: Demi Marie Obenour <demi(a)invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-2-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 84b143eef395..eb0586b9767d 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -367,8 +367,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
- if (!use_ptemod)
- alloced++;
+ alloced++;
} else if (!err)
err = -EINVAL;
@@ -377,8 +376,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
- if (map->map_ops[i].status == GNTST_okay)
- alloced++;
+ alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
} else if (!err)
err = -EINVAL;
@@ -394,8 +392,14 @@ static void __unmap_grant_pages_done(int result,
unsigned int i;
struct gntdev_grant_map *map = data->data;
unsigned int offset = data->unmap_ops - map->unmap_ops;
+ int successful_unmaps = 0;
+ int live_grants;
for (i = 0; i < data->count; i++) {
+ if (map->unmap_ops[offset + i].status == GNTST_okay &&
+ map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->unmap_ops[offset + i].status != GNTST_okay &&
map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("unmap handle=%d st=%d\n",
@@ -403,6 +407,10 @@ static void __unmap_grant_pages_done(int result,
map->unmap_ops[offset+i].status);
map->unmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
if (use_ptemod) {
+ if (map->kunmap_ops[offset + i].status == GNTST_okay &&
+ map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->kunmap_ops[offset + i].status != GNTST_okay &&
map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("kunmap handle=%u st=%d\n",
@@ -411,11 +419,15 @@ static void __unmap_grant_pages_done(int result,
map->kunmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
}
}
+
/*
* Decrease the live-grant counter. This must happen after the loop to
* prevent premature reuse of the grants by gnttab_mmap().
*/
- atomic_sub(data->count, &map->live_grants);
+ live_grants = atomic_sub_return(successful_unmaps, &map->live_grants);
+ if (WARN_ON(live_grants < 0))
+ pr_err("%s: live_grants became negative (%d) after unmapping %d pages!\n",
+ __func__, live_grants, successful_unmaps);
/* Release reference taken by __unmap_grant_pages */
gntdev_put_map(NULL, map);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
0991028cd495 ("xen/gntdev: Prevent leaking grants")
166d38632316 ("xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE")
dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
ce2f46f3531a ("xen/gntdev: fix unmap notification order")
f28347cc6639 ("Xen/gntdev: don't ignore kernel unmapping error")
bce21a2b48ed ("Xen/gnttab: introduce common INVALID_GRANT_{HANDLE,REF}")
36caa3fedf06 ("Xen/gntdev: don't needlessly allocate k{,un}map_ops[]")
8310b77b48c5 ("Xen/gnttab: handle p2m update errors on a per-slot basis")
36bf1dfb8b26 ("xen/arm: don't ignore return errors from set_phys_to_machine")
ebee0eab0859 ("Xen/gntdev: correct error checking in gntdev_map_grant_pages()")
dbe5283605b3 ("Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0991028cd49567d7016d1b224fe0117c35059f86 Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <m.v.b(a)runbox.com>
Date: Sun, 2 Oct 2022 18:20:05 -0400
Subject: [PATCH] xen/gntdev: Prevent leaking grants
Prior to this commit, if a grant mapping operation failed partially,
some of the entries in the map_ops array would be invalid, whereas all
of the entries in the kmap_ops array would be valid. This in turn would
cause the following logic in gntdev_map_grant_pages to become invalid:
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
if (!use_ptemod)
alloced++;
}
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
if (map->map_ops[i].status == GNTST_okay)
alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
}
}
}
...
atomic_add(alloced, &map->live_grants);
Assume that use_ptemod is true (i.e., the domain mapping the granted
pages is a paravirtualized domain). In the code excerpt above, note that
the "alloced" variable is only incremented when both kmap_ops[i].status
and map_ops[i].status are set to GNTST_okay (i.e., both mapping
operations are successful). However, as also noted above, there are
cases where a grant mapping operation fails partially, breaking the
assumption of the code excerpt above.
The aforementioned causes map->live_grants to be incorrectly set. In
some cases, all of the map_ops mappings fail, but all of the kmap_ops
mappings succeed, meaning that live_grants may remain zero. This in turn
makes it impossible to unmap the successfully grant-mapped pages pointed
to by kmap_ops, because unmap_grant_pages has the following snippet of
code at its beginning:
if (atomic_read(&map->live_grants) == 0)
return; /* Nothing to do */
In other cases where only some of the map_ops mappings fail but all
kmap_ops mappings succeed, live_grants is made positive, but when the
user requests unmapping the grant-mapped pages, __unmap_grant_pages_done
will then make map->live_grants negative, because the latter function
does not check if all of the pages that were requested to be unmapped
were actually unmapped, and the same function unconditionally subtracts
"data->count" (i.e., a value that can be greater than map->live_grants)
from map->live_grants. The side effects of a negative live_grants value
have not been studied.
The net effect of all of this is that grant references are leaked in one
of the above conditions. In Qubes OS v4.1 (which uses Xen's grant
mechanism extensively for X11 GUI isolation), this issue manifests
itself with warning messages like the following to be printed out by the
Linux kernel in the VM that had granted pages (that contain X11 GUI
window data) to dom0: "g.e. 0x1234 still pending", especially after the
user rapidly resizes GUI VM windows (causing some grant-mapping
operations to partially or completely fail, due to the fact that the VM
unshares some of the pages as part of the window resizing, making the
pages impossible to grant-map from dom0).
The fix for this issue involves counting all successful map_ops and
kmap_ops mappings separately, and then adding the sum to live_grants.
During unmapping, only the number of successfully unmapped grants is
subtracted from live_grants. The code is also modified to check for
negative live_grants values after the subtraction and warn the user.
Link: https://github.com/QubesOS/qubes-issues/issues/7631
Fixes: dbe97cff7dd9 ("xen/gntdev: Avoid blocking in unmap_grant_pages()")
Cc: stable(a)vger.kernel.org
Signed-off-by: M. Vefa Bicakci <m.v.b(a)runbox.com>
Acked-by: Demi Marie Obenour <demi(a)invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross(a)suse.com>
Link: https://lore.kernel.org/r/20221002222006.2077-2-m.v.b@runbox.com
Signed-off-by: Juergen Gross <jgross(a)suse.com>
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 84b143eef395..eb0586b9767d 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -367,8 +367,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
for (i = 0; i < map->count; i++) {
if (map->map_ops[i].status == GNTST_okay) {
map->unmap_ops[i].handle = map->map_ops[i].handle;
- if (!use_ptemod)
- alloced++;
+ alloced++;
} else if (!err)
err = -EINVAL;
@@ -377,8 +376,7 @@ int gntdev_map_grant_pages(struct gntdev_grant_map *map)
if (use_ptemod) {
if (map->kmap_ops[i].status == GNTST_okay) {
- if (map->map_ops[i].status == GNTST_okay)
- alloced++;
+ alloced++;
map->kunmap_ops[i].handle = map->kmap_ops[i].handle;
} else if (!err)
err = -EINVAL;
@@ -394,8 +392,14 @@ static void __unmap_grant_pages_done(int result,
unsigned int i;
struct gntdev_grant_map *map = data->data;
unsigned int offset = data->unmap_ops - map->unmap_ops;
+ int successful_unmaps = 0;
+ int live_grants;
for (i = 0; i < data->count; i++) {
+ if (map->unmap_ops[offset + i].status == GNTST_okay &&
+ map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->unmap_ops[offset + i].status != GNTST_okay &&
map->unmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("unmap handle=%d st=%d\n",
@@ -403,6 +407,10 @@ static void __unmap_grant_pages_done(int result,
map->unmap_ops[offset+i].status);
map->unmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
if (use_ptemod) {
+ if (map->kunmap_ops[offset + i].status == GNTST_okay &&
+ map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE)
+ successful_unmaps++;
+
WARN_ON(map->kunmap_ops[offset + i].status != GNTST_okay &&
map->kunmap_ops[offset + i].handle != INVALID_GRANT_HANDLE);
pr_debug("kunmap handle=%u st=%d\n",
@@ -411,11 +419,15 @@ static void __unmap_grant_pages_done(int result,
map->kunmap_ops[offset+i].handle = INVALID_GRANT_HANDLE;
}
}
+
/*
* Decrease the live-grant counter. This must happen after the loop to
* prevent premature reuse of the grants by gnttab_mmap().
*/
- atomic_sub(data->count, &map->live_grants);
+ live_grants = atomic_sub_return(successful_unmaps, &map->live_grants);
+ if (WARN_ON(live_grants < 0))
+ pr_err("%s: live_grants became negative (%d) after unmapping %d pages!\n",
+ __func__, live_grants, successful_unmaps);
/* Release reference taken by __unmap_grant_pages */
gntdev_put_map(NULL, map);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
958f32ce832b ("mm: hugetlb: fix UAF in hugetlb_handle_userfault")
40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
378397ccb8e5 ("hugetlb: create hugetlb_unmap_file_folio to unmap single file folio")
8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
12710fd69634 ("hugetlb: rename vma_shareable() and refactor code")
c86272287bc6 ("hugetlb: create remove_inode_single_folio to remove single file folio")
7e1813d48dd3 ("hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache")
3a47c54f09c4 ("hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization")
188a39725ad7 ("hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race")
5e6b1bf1b5c3 ("hugetlb: remove meaningless BUG_ON(huge_pte_none())")
3a5497a2dae3 ("mm/hugetlb: fix missing call to restore_reserve_on_error()")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 958f32ce832ba781ac20e11bb2d12a9352ea28fc Mon Sep 17 00:00:00 2001
From: Liu Shixin <liushixin2(a)huawei.com>
Date: Fri, 23 Sep 2022 12:21:13 +0800
Subject: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.co…
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf(a)syzkaller.appspotmail.com
Reported-by: Liu Zixian <liuzixian4(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar(a)oracle.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2182134216f0..3c1316ad54b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5489,7 +5489,6 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
- vm_fault_t ret;
u32 hash;
struct vm_fault vmf = {
.vma = vma,
@@ -5507,18 +5506,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
};
/*
- * vma_lock and hugetlb_fault_mutex must be
- * dropped before handling userfault. Reacquire
- * after handling fault to make calling code simpler.
+ * vma_lock and hugetlb_fault_mutex must be dropped before handling
+ * userfault. Also mmap_lock could be dropped due to handling
+ * userfault, any vma operation should be careful from here.
*/
hugetlb_vma_unlock_read(vma);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- ret = handle_userfault(&vmf, reason);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
- hugetlb_vma_lock_read(vma);
-
- return ret;
+ return handle_userfault(&vmf, reason);
}
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
@@ -5536,6 +5531,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
/*
* Currently, we are forced to kill the process in the event the
@@ -5546,7 +5542,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
current->pid);
- return ret;
+ goto out;
}
/*
@@ -5560,12 +5556,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (idx >= size)
goto out;
/* Check for page in userfault range */
- if (userfaultfd_missing(vma)) {
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ if (userfaultfd_missing(vma))
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MISSING);
- goto out;
- }
page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
@@ -5631,10 +5625,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MINOR);
- goto out;
}
}
@@ -5692,6 +5685,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unlock_page(page);
out:
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return ret;
backout:
@@ -5789,11 +5784,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
entry = huge_ptep_get(ptep);
/* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ if (huge_pte_none_mostly(entry))
+ /*
+ * hugetlb_no_page will drop vma lock and hugetlb fault
+ * mutex internally, which make us return immediately.
+ */
+ return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
- goto out_mutex;
- }
ret = 0;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
958f32ce832b ("mm: hugetlb: fix UAF in hugetlb_handle_userfault")
40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
378397ccb8e5 ("hugetlb: create hugetlb_unmap_file_folio to unmap single file folio")
8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
12710fd69634 ("hugetlb: rename vma_shareable() and refactor code")
c86272287bc6 ("hugetlb: create remove_inode_single_folio to remove single file folio")
7e1813d48dd3 ("hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache")
3a47c54f09c4 ("hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization")
188a39725ad7 ("hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race")
5e6b1bf1b5c3 ("hugetlb: remove meaningless BUG_ON(huge_pte_none())")
3a5497a2dae3 ("mm/hugetlb: fix missing call to restore_reserve_on_error()")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 958f32ce832ba781ac20e11bb2d12a9352ea28fc Mon Sep 17 00:00:00 2001
From: Liu Shixin <liushixin2(a)huawei.com>
Date: Fri, 23 Sep 2022 12:21:13 +0800
Subject: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.co…
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf(a)syzkaller.appspotmail.com
Reported-by: Liu Zixian <liuzixian4(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar(a)oracle.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2182134216f0..3c1316ad54b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5489,7 +5489,6 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
- vm_fault_t ret;
u32 hash;
struct vm_fault vmf = {
.vma = vma,
@@ -5507,18 +5506,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
};
/*
- * vma_lock and hugetlb_fault_mutex must be
- * dropped before handling userfault. Reacquire
- * after handling fault to make calling code simpler.
+ * vma_lock and hugetlb_fault_mutex must be dropped before handling
+ * userfault. Also mmap_lock could be dropped due to handling
+ * userfault, any vma operation should be careful from here.
*/
hugetlb_vma_unlock_read(vma);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- ret = handle_userfault(&vmf, reason);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
- hugetlb_vma_lock_read(vma);
-
- return ret;
+ return handle_userfault(&vmf, reason);
}
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
@@ -5536,6 +5531,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
/*
* Currently, we are forced to kill the process in the event the
@@ -5546,7 +5542,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
current->pid);
- return ret;
+ goto out;
}
/*
@@ -5560,12 +5556,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (idx >= size)
goto out;
/* Check for page in userfault range */
- if (userfaultfd_missing(vma)) {
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ if (userfaultfd_missing(vma))
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MISSING);
- goto out;
- }
page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
@@ -5631,10 +5625,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MINOR);
- goto out;
}
}
@@ -5692,6 +5685,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unlock_page(page);
out:
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return ret;
backout:
@@ -5789,11 +5784,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
entry = huge_ptep_get(ptep);
/* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ if (huge_pte_none_mostly(entry))
+ /*
+ * hugetlb_no_page will drop vma lock and hugetlb fault
+ * mutex internally, which make us return immediately.
+ */
+ return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
- goto out_mutex;
- }
ret = 0;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
958f32ce832b ("mm: hugetlb: fix UAF in hugetlb_handle_userfault")
40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
378397ccb8e5 ("hugetlb: create hugetlb_unmap_file_folio to unmap single file folio")
8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
12710fd69634 ("hugetlb: rename vma_shareable() and refactor code")
c86272287bc6 ("hugetlb: create remove_inode_single_folio to remove single file folio")
7e1813d48dd3 ("hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache")
3a47c54f09c4 ("hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization")
188a39725ad7 ("hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race")
5e6b1bf1b5c3 ("hugetlb: remove meaningless BUG_ON(huge_pte_none())")
3a5497a2dae3 ("mm/hugetlb: fix missing call to restore_reserve_on_error()")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 958f32ce832ba781ac20e11bb2d12a9352ea28fc Mon Sep 17 00:00:00 2001
From: Liu Shixin <liushixin2(a)huawei.com>
Date: Fri, 23 Sep 2022 12:21:13 +0800
Subject: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.co…
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf(a)syzkaller.appspotmail.com
Reported-by: Liu Zixian <liuzixian4(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar(a)oracle.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2182134216f0..3c1316ad54b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5489,7 +5489,6 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
- vm_fault_t ret;
u32 hash;
struct vm_fault vmf = {
.vma = vma,
@@ -5507,18 +5506,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
};
/*
- * vma_lock and hugetlb_fault_mutex must be
- * dropped before handling userfault. Reacquire
- * after handling fault to make calling code simpler.
+ * vma_lock and hugetlb_fault_mutex must be dropped before handling
+ * userfault. Also mmap_lock could be dropped due to handling
+ * userfault, any vma operation should be careful from here.
*/
hugetlb_vma_unlock_read(vma);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- ret = handle_userfault(&vmf, reason);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
- hugetlb_vma_lock_read(vma);
-
- return ret;
+ return handle_userfault(&vmf, reason);
}
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
@@ -5536,6 +5531,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
/*
* Currently, we are forced to kill the process in the event the
@@ -5546,7 +5542,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
current->pid);
- return ret;
+ goto out;
}
/*
@@ -5560,12 +5556,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (idx >= size)
goto out;
/* Check for page in userfault range */
- if (userfaultfd_missing(vma)) {
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ if (userfaultfd_missing(vma))
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MISSING);
- goto out;
- }
page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
@@ -5631,10 +5625,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MINOR);
- goto out;
}
}
@@ -5692,6 +5685,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unlock_page(page);
out:
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return ret;
backout:
@@ -5789,11 +5784,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
entry = huge_ptep_get(ptep);
/* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ if (huge_pte_none_mostly(entry))
+ /*
+ * hugetlb_no_page will drop vma lock and hugetlb fault
+ * mutex internally, which make us return immediately.
+ */
+ return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
- goto out_mutex;
- }
ret = 0;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
958f32ce832b ("mm: hugetlb: fix UAF in hugetlb_handle_userfault")
40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
378397ccb8e5 ("hugetlb: create hugetlb_unmap_file_folio to unmap single file folio")
8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
12710fd69634 ("hugetlb: rename vma_shareable() and refactor code")
c86272287bc6 ("hugetlb: create remove_inode_single_folio to remove single file folio")
7e1813d48dd3 ("hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache")
3a47c54f09c4 ("hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization")
188a39725ad7 ("hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race")
5e6b1bf1b5c3 ("hugetlb: remove meaningless BUG_ON(huge_pte_none())")
3a5497a2dae3 ("mm/hugetlb: fix missing call to restore_reserve_on_error()")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 958f32ce832ba781ac20e11bb2d12a9352ea28fc Mon Sep 17 00:00:00 2001
From: Liu Shixin <liushixin2(a)huawei.com>
Date: Fri, 23 Sep 2022 12:21:13 +0800
Subject: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.co…
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf(a)syzkaller.appspotmail.com
Reported-by: Liu Zixian <liuzixian4(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar(a)oracle.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2182134216f0..3c1316ad54b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5489,7 +5489,6 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
- vm_fault_t ret;
u32 hash;
struct vm_fault vmf = {
.vma = vma,
@@ -5507,18 +5506,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
};
/*
- * vma_lock and hugetlb_fault_mutex must be
- * dropped before handling userfault. Reacquire
- * after handling fault to make calling code simpler.
+ * vma_lock and hugetlb_fault_mutex must be dropped before handling
+ * userfault. Also mmap_lock could be dropped due to handling
+ * userfault, any vma operation should be careful from here.
*/
hugetlb_vma_unlock_read(vma);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- ret = handle_userfault(&vmf, reason);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
- hugetlb_vma_lock_read(vma);
-
- return ret;
+ return handle_userfault(&vmf, reason);
}
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
@@ -5536,6 +5531,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
/*
* Currently, we are forced to kill the process in the event the
@@ -5546,7 +5542,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
current->pid);
- return ret;
+ goto out;
}
/*
@@ -5560,12 +5556,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (idx >= size)
goto out;
/* Check for page in userfault range */
- if (userfaultfd_missing(vma)) {
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ if (userfaultfd_missing(vma))
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MISSING);
- goto out;
- }
page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
@@ -5631,10 +5625,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MINOR);
- goto out;
}
}
@@ -5692,6 +5685,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unlock_page(page);
out:
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return ret;
backout:
@@ -5789,11 +5784,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
entry = huge_ptep_get(ptep);
/* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ if (huge_pte_none_mostly(entry))
+ /*
+ * hugetlb_no_page will drop vma lock and hugetlb fault
+ * mutex internally, which make us return immediately.
+ */
+ return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
- goto out_mutex;
- }
ret = 0;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
958f32ce832b ("mm: hugetlb: fix UAF in hugetlb_handle_userfault")
40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
378397ccb8e5 ("hugetlb: create hugetlb_unmap_file_folio to unmap single file folio")
8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
12710fd69634 ("hugetlb: rename vma_shareable() and refactor code")
c86272287bc6 ("hugetlb: create remove_inode_single_folio to remove single file folio")
7e1813d48dd3 ("hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache")
3a47c54f09c4 ("hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization")
188a39725ad7 ("hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race")
5e6b1bf1b5c3 ("hugetlb: remove meaningless BUG_ON(huge_pte_none())")
3a5497a2dae3 ("mm/hugetlb: fix missing call to restore_reserve_on_error()")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 958f32ce832ba781ac20e11bb2d12a9352ea28fc Mon Sep 17 00:00:00 2001
From: Liu Shixin <liushixin2(a)huawei.com>
Date: Fri, 23 Sep 2022 12:21:13 +0800
Subject: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.co…
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf(a)syzkaller.appspotmail.com
Reported-by: Liu Zixian <liuzixian4(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar(a)oracle.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2182134216f0..3c1316ad54b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5489,7 +5489,6 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
- vm_fault_t ret;
u32 hash;
struct vm_fault vmf = {
.vma = vma,
@@ -5507,18 +5506,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
};
/*
- * vma_lock and hugetlb_fault_mutex must be
- * dropped before handling userfault. Reacquire
- * after handling fault to make calling code simpler.
+ * vma_lock and hugetlb_fault_mutex must be dropped before handling
+ * userfault. Also mmap_lock could be dropped due to handling
+ * userfault, any vma operation should be careful from here.
*/
hugetlb_vma_unlock_read(vma);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- ret = handle_userfault(&vmf, reason);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
- hugetlb_vma_lock_read(vma);
-
- return ret;
+ return handle_userfault(&vmf, reason);
}
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
@@ -5536,6 +5531,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
/*
* Currently, we are forced to kill the process in the event the
@@ -5546,7 +5542,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
current->pid);
- return ret;
+ goto out;
}
/*
@@ -5560,12 +5556,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (idx >= size)
goto out;
/* Check for page in userfault range */
- if (userfaultfd_missing(vma)) {
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ if (userfaultfd_missing(vma))
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MISSING);
- goto out;
- }
page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
@@ -5631,10 +5625,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MINOR);
- goto out;
}
}
@@ -5692,6 +5685,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unlock_page(page);
out:
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return ret;
backout:
@@ -5789,11 +5784,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
entry = huge_ptep_get(ptep);
/* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ if (huge_pte_none_mostly(entry))
+ /*
+ * hugetlb_no_page will drop vma lock and hugetlb fault
+ * mutex internally, which make us return immediately.
+ */
+ return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
- goto out_mutex;
- }
ret = 0;
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
958f32ce832b ("mm: hugetlb: fix UAF in hugetlb_handle_userfault")
40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
378397ccb8e5 ("hugetlb: create hugetlb_unmap_file_folio to unmap single file folio")
8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
12710fd69634 ("hugetlb: rename vma_shareable() and refactor code")
c86272287bc6 ("hugetlb: create remove_inode_single_folio to remove single file folio")
7e1813d48dd3 ("hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache")
3a47c54f09c4 ("hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization")
188a39725ad7 ("hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race")
5e6b1bf1b5c3 ("hugetlb: remove meaningless BUG_ON(huge_pte_none())")
3a5497a2dae3 ("mm/hugetlb: fix missing call to restore_reserve_on_error()")
6614a3c3164a ("Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 958f32ce832ba781ac20e11bb2d12a9352ea28fc Mon Sep 17 00:00:00 2001
From: Liu Shixin <liushixin2(a)huawei.com>
Date: Fri, 23 Sep 2022 12:21:13 +0800
Subject: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.co…
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf(a)syzkaller.appspotmail.com
Reported-by: Liu Zixian <liuzixian4(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar(a)oracle.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2182134216f0..3c1316ad54b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5489,7 +5489,6 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
- vm_fault_t ret;
u32 hash;
struct vm_fault vmf = {
.vma = vma,
@@ -5507,18 +5506,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
};
/*
- * vma_lock and hugetlb_fault_mutex must be
- * dropped before handling userfault. Reacquire
- * after handling fault to make calling code simpler.
+ * vma_lock and hugetlb_fault_mutex must be dropped before handling
+ * userfault. Also mmap_lock could be dropped due to handling
+ * userfault, any vma operation should be careful from here.
*/
hugetlb_vma_unlock_read(vma);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- ret = handle_userfault(&vmf, reason);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
- hugetlb_vma_lock_read(vma);
-
- return ret;
+ return handle_userfault(&vmf, reason);
}
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
@@ -5536,6 +5531,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
/*
* Currently, we are forced to kill the process in the event the
@@ -5546,7 +5542,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
current->pid);
- return ret;
+ goto out;
}
/*
@@ -5560,12 +5556,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (idx >= size)
goto out;
/* Check for page in userfault range */
- if (userfaultfd_missing(vma)) {
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ if (userfaultfd_missing(vma))
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MISSING);
- goto out;
- }
page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
@@ -5631,10 +5625,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MINOR);
- goto out;
}
}
@@ -5692,6 +5685,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unlock_page(page);
out:
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return ret;
backout:
@@ -5789,11 +5784,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
entry = huge_ptep_get(ptep);
/* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ if (huge_pte_none_mostly(entry))
+ /*
+ * hugetlb_no_page will drop vma lock and hugetlb fault
+ * mutex internally, which make us return immediately.
+ */
+ return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
- goto out_mutex;
- }
ret = 0;
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
958f32ce832b ("mm: hugetlb: fix UAF in hugetlb_handle_userfault")
40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization")
378397ccb8e5 ("hugetlb: create hugetlb_unmap_file_folio to unmap single file folio")
8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
12710fd69634 ("hugetlb: rename vma_shareable() and refactor code")
c86272287bc6 ("hugetlb: create remove_inode_single_folio to remove single file folio")
7e1813d48dd3 ("hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache")
3a47c54f09c4 ("hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization")
188a39725ad7 ("hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race")
5e6b1bf1b5c3 ("hugetlb: remove meaningless BUG_ON(huge_pte_none())")
3a5497a2dae3 ("mm/hugetlb: fix missing call to restore_reserve_on_error()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 958f32ce832ba781ac20e11bb2d12a9352ea28fc Mon Sep 17 00:00:00 2001
From: Liu Shixin <liushixin2(a)huawei.com>
Date: Fri, 23 Sep 2022 12:21:13 +0800
Subject: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault
The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,
hugetlb_fault
hugetlb_no_page
/*unlock vma_lock */
hugetlb_handle_userfault
handle_userfault
/* unlock mm->mmap_lock*/
vm_mmap_pgoff
do_mmap
mmap_region
munmap_vma_range
/* clean old vma */
/* lock vma_lock again <--- UAF */
/* unlock vma_lock */
Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.
[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.co…
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: Liu Shixin <liushixin2(a)huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang(a)huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf(a)syzkaller.appspotmail.com
Reported-by: Liu Zixian <liuzixian4(a)huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: John Hubbard <jhubbard(a)nvidia.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar(a)oracle.com>
Cc: <stable(a)vger.kernel.org> [4.14+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2182134216f0..3c1316ad54b5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5489,7 +5489,6 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
unsigned long addr,
unsigned long reason)
{
- vm_fault_t ret;
u32 hash;
struct vm_fault vmf = {
.vma = vma,
@@ -5507,18 +5506,14 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
};
/*
- * vma_lock and hugetlb_fault_mutex must be
- * dropped before handling userfault. Reacquire
- * after handling fault to make calling code simpler.
+ * vma_lock and hugetlb_fault_mutex must be dropped before handling
+ * userfault. Also mmap_lock could be dropped due to handling
+ * userfault, any vma operation should be careful from here.
*/
hugetlb_vma_unlock_read(vma);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- ret = handle_userfault(&vmf, reason);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
- hugetlb_vma_lock_read(vma);
-
- return ret;
+ return handle_userfault(&vmf, reason);
}
static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
@@ -5536,6 +5531,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
/*
* Currently, we are forced to kill the process in the event the
@@ -5546,7 +5542,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
current->pid);
- return ret;
+ goto out;
}
/*
@@ -5560,12 +5556,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (idx >= size)
goto out;
/* Check for page in userfault range */
- if (userfaultfd_missing(vma)) {
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ if (userfaultfd_missing(vma))
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MISSING);
- goto out;
- }
page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
@@ -5631,10 +5625,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (userfaultfd_minor(vma)) {
unlock_page(page);
put_page(page);
- ret = hugetlb_handle_userfault(vma, mapping, idx,
+ return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MINOR);
- goto out;
}
}
@@ -5692,6 +5685,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unlock_page(page);
out:
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return ret;
backout:
@@ -5789,11 +5784,13 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
entry = huge_ptep_get(ptep);
/* PTE markers should be handled the same way as none pte */
- if (huge_pte_none_mostly(entry)) {
- ret = hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
+ if (huge_pte_none_mostly(entry))
+ /*
+ * hugetlb_no_page will drop vma lock and hugetlb fault
+ * mutex internally, which make us return immediately.
+ */
+ return hugetlb_no_page(mm, vma, mapping, idx, address, ptep,
entry, flags);
- goto out_mutex;
- }
ret = 0;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fac35ba763ed ("mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page")
ad1ac596e8a8 ("mm/migration: fix potential pte_unmap on an not mapped pte")
4dd845b5a3e5 ("mm/swapops: rework swap entry manipulation code")
af5cdaf82238 ("mm: remove special swap entry functions")
b3807a91aca7 ("mm: page_vma_mapped_walk(): add a level of indentation")
e2e1d4076c77 ("mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block")
3306d3119cea ("mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd")
f003c03bd29e ("mm: page_vma_mapped_walk(): use page for pvmw->page")
494334e43c16 ("mm/thp: fix vma_address() if virtual address below file offset")
732ed55823fc ("mm/thp: try_to_unmap() use TTU_SYNC for safe splitting")
99fa8a48203d ("mm/thp: fix __split_huge_pmd_locked() on shmem migration entry")
ffc90cbb2970 ("mm, thp: use head page in __migration_entry_wait()")
a44f89dc6c5f ("mm/huge_memory.c: use helper function migration_entry_to_page()")
374437a274e2 ("mm/pgtable-generic.c: optimize the VM_BUG_ON condition in pmdp_huge_clear_flush()")
c045c72ccde3 ("mm/pgtable-generic.c: simplify the VM_BUG_ON condition in pmdp_huge_clear_flush()")
013339df116c ("mm/rmap: always do TTU_IGNORE_ACCESS")
336bf30eb765 ("hugetlbfs: fix anon huge page migration race")
df3a57d1f607 ("mm: split out the non-present case from copy_one_pte()")
ec0abae6dcdf ("mm/thp: fix __split_huge_pmd_locked() for migration PMD")
6128763fc324 ("mm/migrate: remove unnecessary is_zone_device_page() check")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fac35ba763ed07ba93154c95ffc0c4a55023707f Mon Sep 17 00:00:00 2001
From: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Date: Thu, 1 Sep 2022 18:41:31 +0800
Subject: [PATCH] mm/hugetlb: fix races when looking up a CONT-PTE/PMD size
hugetlb page
On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and
1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified.
So when looking up a CONT-PTE size hugetlb page by follow_page(), it will
use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size
hugetlb in follow_page_pte(). However this pte entry lock is incorrect
for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get
the correct lock, which is mm->page_table_lock.
That means the pte entry of the CONT-PTE size hugetlb under current pte
lock is unstable in follow_page_pte(), we can continue to migrate or
poison the pte entry of the CONT-PTE size hugetlb, which can cause some
potential race issues, even though they are under the 'pte lock'.
For example, suppose thread A is trying to look up a CONT-PTE size hugetlb
page by move_pages() syscall under the lock, however antoher thread B can
migrate the CONT-PTE hugetlb page at the same time, which will cause
thread A to get an incorrect page, if thread A also wants to do page
migration, then data inconsistency error occurs.
Moreover we have the same issue for CONT-PMD size hugetlb in
follow_huge_pmd().
To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte()
to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to
get the correct pte entry lock to make the pte entry stable.
Mike said:
Support for CONT_PMD/_PTE was added with bb9dd3df8ee9 ("arm64: hugetlb:
refactor find_num_contig()"). Patch series "Support for contiguous pte
hugepages", v4. However, I do not believe these code paths were
executed until migration support was added with 5480280d3f2d ("arm64/mm:
enable HugeTLB migration for contiguous bit HugeTLB pages") I would go
with 5480280d3f2d for the Fixes: targe.
Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.16620175…
Fixes: 5480280d3f2d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Suggested-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3ec981a0d8b3..67c88b82fc32 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -207,8 +207,8 @@ struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
struct page *follow_huge_pd(struct vm_area_struct *vma,
unsigned long address, hugepd_t hpd,
int flags, int pdshift);
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int flags);
+struct page *follow_huge_pmd_pte(struct vm_area_struct *vma, unsigned long address,
+ int flags);
struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags);
struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
@@ -312,8 +312,8 @@ static inline struct page *follow_huge_pd(struct vm_area_struct *vma,
return NULL;
}
-static inline struct page *follow_huge_pmd(struct mm_struct *mm,
- unsigned long address, pmd_t *pmd, int flags)
+static inline struct page *follow_huge_pmd_pte(struct vm_area_struct *vma,
+ unsigned long address, int flags)
{
return NULL;
}
diff --git a/mm/gup.c b/mm/gup.c
index 00926abb4426..251cb6a10bc0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -530,6 +530,18 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return ERR_PTR(-EINVAL);
+
+ /*
+ * Considering PTE level hugetlb, like continuous-PTE hugetlb on
+ * ARM64 architecture.
+ */
+ if (is_vm_hugetlb_page(vma)) {
+ page = follow_huge_pmd_pte(vma, address, flags);
+ if (page)
+ return page;
+ return no_page_table(vma, flags);
+ }
+
retry:
if (unlikely(pmd_bad(*pmd)))
return no_page_table(vma, flags);
@@ -662,7 +674,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
if (pmd_none(pmdval))
return no_page_table(vma, flags);
if (pmd_huge(pmdval) && is_vm_hugetlb_page(vma)) {
- page = follow_huge_pmd(mm, address, pmd, flags);
+ page = follow_huge_pmd_pte(vma, address, flags);
if (page)
return page;
return no_page_table(vma, flags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0bdfc7e1c933..9564bf817e6a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6946,12 +6946,13 @@ follow_huge_pd(struct vm_area_struct *vma,
}
struct page * __weak
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int flags)
+follow_huge_pmd_pte(struct vm_area_struct *vma, unsigned long address, int flags)
{
+ struct hstate *h = hstate_vma(vma);
+ struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
spinlock_t *ptl;
- pte_t pte;
+ pte_t *ptep, pte;
/*
* FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
@@ -6961,17 +6962,15 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
return NULL;
retry:
- ptl = pmd_lockptr(mm, pmd);
- spin_lock(ptl);
- /*
- * make sure that the address range covered by this pmd is not
- * unmapped from other threads.
- */
- if (!pmd_huge(*pmd))
- goto out;
- pte = huge_ptep_get((pte_t *)pmd);
+ ptep = huge_pte_offset(mm, address, huge_page_size(h));
+ if (!ptep)
+ return NULL;
+
+ ptl = huge_pte_lock(h, mm, ptep);
+ pte = huge_ptep_get(ptep);
if (pte_present(pte)) {
- page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+ page = pte_page(pte) +
+ ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
/*
* try_grab_page() should always succeed here, because: a) we
* hold the pmd (ptl) lock, and b) we've just checked that the
@@ -6987,7 +6986,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
} else {
if (is_hugetlb_entry_migration(pte)) {
spin_unlock(ptl);
- __migration_entry_wait_huge((pte_t *)pmd, ptl);
+ __migration_entry_wait_huge(ptep, ptl);
goto retry;
}
/*
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fac35ba763ed ("mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page")
ad1ac596e8a8 ("mm/migration: fix potential pte_unmap on an not mapped pte")
4dd845b5a3e5 ("mm/swapops: rework swap entry manipulation code")
af5cdaf82238 ("mm: remove special swap entry functions")
b3807a91aca7 ("mm: page_vma_mapped_walk(): add a level of indentation")
e2e1d4076c77 ("mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block")
3306d3119cea ("mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd")
f003c03bd29e ("mm: page_vma_mapped_walk(): use page for pvmw->page")
494334e43c16 ("mm/thp: fix vma_address() if virtual address below file offset")
732ed55823fc ("mm/thp: try_to_unmap() use TTU_SYNC for safe splitting")
99fa8a48203d ("mm/thp: fix __split_huge_pmd_locked() on shmem migration entry")
ffc90cbb2970 ("mm, thp: use head page in __migration_entry_wait()")
a44f89dc6c5f ("mm/huge_memory.c: use helper function migration_entry_to_page()")
374437a274e2 ("mm/pgtable-generic.c: optimize the VM_BUG_ON condition in pmdp_huge_clear_flush()")
c045c72ccde3 ("mm/pgtable-generic.c: simplify the VM_BUG_ON condition in pmdp_huge_clear_flush()")
013339df116c ("mm/rmap: always do TTU_IGNORE_ACCESS")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fac35ba763ed07ba93154c95ffc0c4a55023707f Mon Sep 17 00:00:00 2001
From: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Date: Thu, 1 Sep 2022 18:41:31 +0800
Subject: [PATCH] mm/hugetlb: fix races when looking up a CONT-PTE/PMD size
hugetlb page
On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and
1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified.
So when looking up a CONT-PTE size hugetlb page by follow_page(), it will
use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size
hugetlb in follow_page_pte(). However this pte entry lock is incorrect
for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get
the correct lock, which is mm->page_table_lock.
That means the pte entry of the CONT-PTE size hugetlb under current pte
lock is unstable in follow_page_pte(), we can continue to migrate or
poison the pte entry of the CONT-PTE size hugetlb, which can cause some
potential race issues, even though they are under the 'pte lock'.
For example, suppose thread A is trying to look up a CONT-PTE size hugetlb
page by move_pages() syscall under the lock, however antoher thread B can
migrate the CONT-PTE hugetlb page at the same time, which will cause
thread A to get an incorrect page, if thread A also wants to do page
migration, then data inconsistency error occurs.
Moreover we have the same issue for CONT-PMD size hugetlb in
follow_huge_pmd().
To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte()
to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to
get the correct pte entry lock to make the pte entry stable.
Mike said:
Support for CONT_PMD/_PTE was added with bb9dd3df8ee9 ("arm64: hugetlb:
refactor find_num_contig()"). Patch series "Support for contiguous pte
hugepages", v4. However, I do not believe these code paths were
executed until migration support was added with 5480280d3f2d ("arm64/mm:
enable HugeTLB migration for contiguous bit HugeTLB pages") I would go
with 5480280d3f2d for the Fixes: targe.
Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.16620175…
Fixes: 5480280d3f2d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Suggested-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3ec981a0d8b3..67c88b82fc32 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -207,8 +207,8 @@ struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
struct page *follow_huge_pd(struct vm_area_struct *vma,
unsigned long address, hugepd_t hpd,
int flags, int pdshift);
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int flags);
+struct page *follow_huge_pmd_pte(struct vm_area_struct *vma, unsigned long address,
+ int flags);
struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags);
struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
@@ -312,8 +312,8 @@ static inline struct page *follow_huge_pd(struct vm_area_struct *vma,
return NULL;
}
-static inline struct page *follow_huge_pmd(struct mm_struct *mm,
- unsigned long address, pmd_t *pmd, int flags)
+static inline struct page *follow_huge_pmd_pte(struct vm_area_struct *vma,
+ unsigned long address, int flags)
{
return NULL;
}
diff --git a/mm/gup.c b/mm/gup.c
index 00926abb4426..251cb6a10bc0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -530,6 +530,18 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return ERR_PTR(-EINVAL);
+
+ /*
+ * Considering PTE level hugetlb, like continuous-PTE hugetlb on
+ * ARM64 architecture.
+ */
+ if (is_vm_hugetlb_page(vma)) {
+ page = follow_huge_pmd_pte(vma, address, flags);
+ if (page)
+ return page;
+ return no_page_table(vma, flags);
+ }
+
retry:
if (unlikely(pmd_bad(*pmd)))
return no_page_table(vma, flags);
@@ -662,7 +674,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
if (pmd_none(pmdval))
return no_page_table(vma, flags);
if (pmd_huge(pmdval) && is_vm_hugetlb_page(vma)) {
- page = follow_huge_pmd(mm, address, pmd, flags);
+ page = follow_huge_pmd_pte(vma, address, flags);
if (page)
return page;
return no_page_table(vma, flags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0bdfc7e1c933..9564bf817e6a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6946,12 +6946,13 @@ follow_huge_pd(struct vm_area_struct *vma,
}
struct page * __weak
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int flags)
+follow_huge_pmd_pte(struct vm_area_struct *vma, unsigned long address, int flags)
{
+ struct hstate *h = hstate_vma(vma);
+ struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
spinlock_t *ptl;
- pte_t pte;
+ pte_t *ptep, pte;
/*
* FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
@@ -6961,17 +6962,15 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
return NULL;
retry:
- ptl = pmd_lockptr(mm, pmd);
- spin_lock(ptl);
- /*
- * make sure that the address range covered by this pmd is not
- * unmapped from other threads.
- */
- if (!pmd_huge(*pmd))
- goto out;
- pte = huge_ptep_get((pte_t *)pmd);
+ ptep = huge_pte_offset(mm, address, huge_page_size(h));
+ if (!ptep)
+ return NULL;
+
+ ptl = huge_pte_lock(h, mm, ptep);
+ pte = huge_ptep_get(ptep);
if (pte_present(pte)) {
- page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+ page = pte_page(pte) +
+ ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
/*
* try_grab_page() should always succeed here, because: a) we
* hold the pmd (ptl) lock, and b) we've just checked that the
@@ -6987,7 +6986,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
} else {
if (is_hugetlb_entry_migration(pte)) {
spin_unlock(ptl);
- __migration_entry_wait_huge((pte_t *)pmd, ptl);
+ __migration_entry_wait_huge(ptep, ptl);
goto retry;
}
/*
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fac35ba763ed ("mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page")
ad1ac596e8a8 ("mm/migration: fix potential pte_unmap on an not mapped pte")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fac35ba763ed07ba93154c95ffc0c4a55023707f Mon Sep 17 00:00:00 2001
From: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Date: Thu, 1 Sep 2022 18:41:31 +0800
Subject: [PATCH] mm/hugetlb: fix races when looking up a CONT-PTE/PMD size
hugetlb page
On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and
1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified.
So when looking up a CONT-PTE size hugetlb page by follow_page(), it will
use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size
hugetlb in follow_page_pte(). However this pte entry lock is incorrect
for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get
the correct lock, which is mm->page_table_lock.
That means the pte entry of the CONT-PTE size hugetlb under current pte
lock is unstable in follow_page_pte(), we can continue to migrate or
poison the pte entry of the CONT-PTE size hugetlb, which can cause some
potential race issues, even though they are under the 'pte lock'.
For example, suppose thread A is trying to look up a CONT-PTE size hugetlb
page by move_pages() syscall under the lock, however antoher thread B can
migrate the CONT-PTE hugetlb page at the same time, which will cause
thread A to get an incorrect page, if thread A also wants to do page
migration, then data inconsistency error occurs.
Moreover we have the same issue for CONT-PMD size hugetlb in
follow_huge_pmd().
To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte()
to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to
get the correct pte entry lock to make the pte entry stable.
Mike said:
Support for CONT_PMD/_PTE was added with bb9dd3df8ee9 ("arm64: hugetlb:
refactor find_num_contig()"). Patch series "Support for contiguous pte
hugepages", v4. However, I do not believe these code paths were
executed until migration support was added with 5480280d3f2d ("arm64/mm:
enable HugeTLB migration for contiguous bit HugeTLB pages") I would go
with 5480280d3f2d for the Fixes: targe.
Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.16620175…
Fixes: 5480280d3f2d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages")
Signed-off-by: Baolin Wang <baolin.wang(a)linux.alibaba.com>
Suggested-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Muchun Song <songmuchun(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3ec981a0d8b3..67c88b82fc32 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -207,8 +207,8 @@ struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
struct page *follow_huge_pd(struct vm_area_struct *vma,
unsigned long address, hugepd_t hpd,
int flags, int pdshift);
-struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int flags);
+struct page *follow_huge_pmd_pte(struct vm_area_struct *vma, unsigned long address,
+ int flags);
struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int flags);
struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
@@ -312,8 +312,8 @@ static inline struct page *follow_huge_pd(struct vm_area_struct *vma,
return NULL;
}
-static inline struct page *follow_huge_pmd(struct mm_struct *mm,
- unsigned long address, pmd_t *pmd, int flags)
+static inline struct page *follow_huge_pmd_pte(struct vm_area_struct *vma,
+ unsigned long address, int flags)
{
return NULL;
}
diff --git a/mm/gup.c b/mm/gup.c
index 00926abb4426..251cb6a10bc0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -530,6 +530,18 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return ERR_PTR(-EINVAL);
+
+ /*
+ * Considering PTE level hugetlb, like continuous-PTE hugetlb on
+ * ARM64 architecture.
+ */
+ if (is_vm_hugetlb_page(vma)) {
+ page = follow_huge_pmd_pte(vma, address, flags);
+ if (page)
+ return page;
+ return no_page_table(vma, flags);
+ }
+
retry:
if (unlikely(pmd_bad(*pmd)))
return no_page_table(vma, flags);
@@ -662,7 +674,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
if (pmd_none(pmdval))
return no_page_table(vma, flags);
if (pmd_huge(pmdval) && is_vm_hugetlb_page(vma)) {
- page = follow_huge_pmd(mm, address, pmd, flags);
+ page = follow_huge_pmd_pte(vma, address, flags);
if (page)
return page;
return no_page_table(vma, flags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0bdfc7e1c933..9564bf817e6a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6946,12 +6946,13 @@ follow_huge_pd(struct vm_area_struct *vma,
}
struct page * __weak
-follow_huge_pmd(struct mm_struct *mm, unsigned long address,
- pmd_t *pmd, int flags)
+follow_huge_pmd_pte(struct vm_area_struct *vma, unsigned long address, int flags)
{
+ struct hstate *h = hstate_vma(vma);
+ struct mm_struct *mm = vma->vm_mm;
struct page *page = NULL;
spinlock_t *ptl;
- pte_t pte;
+ pte_t *ptep, pte;
/*
* FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via
@@ -6961,17 +6962,15 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
return NULL;
retry:
- ptl = pmd_lockptr(mm, pmd);
- spin_lock(ptl);
- /*
- * make sure that the address range covered by this pmd is not
- * unmapped from other threads.
- */
- if (!pmd_huge(*pmd))
- goto out;
- pte = huge_ptep_get((pte_t *)pmd);
+ ptep = huge_pte_offset(mm, address, huge_page_size(h));
+ if (!ptep)
+ return NULL;
+
+ ptl = huge_pte_lock(h, mm, ptep);
+ pte = huge_ptep_get(ptep);
if (pte_present(pte)) {
- page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+ page = pte_page(pte) +
+ ((address & ~huge_page_mask(h)) >> PAGE_SHIFT);
/*
* try_grab_page() should always succeed here, because: a) we
* hold the pmd (ptl) lock, and b) we've just checked that the
@@ -6987,7 +6986,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
} else {
if (is_hugetlb_entry_migration(pte)) {
spin_unlock(ptl);
- __migration_entry_wait_huge((pte_t *)pmd, ptl);
+ __migration_entry_wait_huge(ptep, ptl);
goto retry;
}
/*
From: Jeffle Xu <jefflexu(a)linux.alibaba.com>
[ Upstream commit b0d97557ebfc9d5ba5f2939339a9fdd267abafeb ]
The inflight of partition 0 doesn't include inflight IOs to all
sub-partitions, since currently mq calculates inflight of specific
partition by simply camparing the value of the partition pointer.
Thus the following case is possible:
$ cat /sys/block/vda/inflight
0 0
$ cat /sys/block/vda/vda1/inflight
0 128
While single queue device (on a previous version, e.g. v3.10) has no
this issue:
$cat /sys/block/sda/sda3/inflight
0 33
$cat /sys/block/sda/inflight
0 33
Partition 0 should be specially handled since it represents the whole
disk. This issue is introduced since commit bf0ddaba65dd ("blk-mq: fix
sysfs inflight counter").
Besides, this patch can also fix the inflight statistics of part 0 in
/proc/diskstats. Before this patch, the inflight statistics of part 0
doesn't include that of sub partitions. (I have marked the 'inflight'
field with asterisk.)
$cat /proc/diskstats
259 0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
259 2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0
This is introduced since commit f299b7c7a9de ("blk-mq: provide internal
in-flight variant").
Fixes: bf0ddaba65dd ("blk-mq: fix sysfs inflight counter")
Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
Signed-off-by: Jeffle Xu <jefflexu(a)linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch(a)lst.de>
[axboe: adapt for 5.11 partition change]
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
[khazhy: adapt for 5.10 partition]
Signed-off-by: Khazhismel Kumykov <khazhy(a)google.com>
---
block/blk-mq.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index cfc039fabf8c..e37ba792902a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -105,7 +105,8 @@ static bool blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
{
struct mq_inflight *mi = priv;
- if (rq->part == mi->part && blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
+ if ((!mi->part->partno || rq->part == mi->part) &&
+ blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
mi->inflight[rq_data_dir(rq)]++;
return true;
--
2.38.0.rc1.362.ged0d419d3c-goog
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
171df58028bf ("arm64: errata: Add Cortex-A55 to the repeat tlbi list")
39fdb65f52e9 ("arm64: errata: Add Cortex-A510 to the repeat tlbi list")
1dd498e5e26a ("KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata")
297ae1eb23b0 ("arm64: cpufeature: List early Cortex-A510 parts as having broken dbm")
708e8af4924e ("arm64: errata: Add detection for TRBE trace data corruption")
3bd94a8759de ("arm64: errata: Add detection for TRBE invalid prohibited states")
607a9afaae09 ("arm64: errata: Add detection for TRBE ignored system register writes")
83bb2c1a01d7 ("KVM: arm64: Save PSTATE early on exit")
d7e0a795bf37 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 171df58028bf4649460fb146a56a58dcb0c8f75a Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Fri, 30 Sep 2022 14:19:59 +0100
Subject: [PATCH] arm64: errata: Add Cortex-A55 to the repeat tlbi list
Cortex-A55 is affected by an erratum where in rare circumstances the
CPUs may not handle a race between a break-before-make sequence on one
CPU, and another CPU accessing the same page. This could allow a store
to a page that has been unmapped.
Work around this by adding the affected CPUs to the list that needs
TLB sequences to be done twice.
Signed-off-by: James Morse <james.morse(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20220930131959.3082594-1-james.morse@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst
index 17d9fc5d14fb..808ade4cc008 100644
--- a/Documentation/arm64/silicon-errata.rst
+++ b/Documentation/arm64/silicon-errata.rst
@@ -76,6 +76,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A55 | #1530923 | ARM64_ERRATUM_1530923 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A55 | #2441007 | ARM64_ERRATUM_2441007 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #832075 | ARM64_ERRATUM_832075 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #852523 | N/A |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1675310f1791..20d082d54bd8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -634,6 +634,23 @@ config ARM64_ERRATUM_1530923
config ARM64_WORKAROUND_REPEAT_TLBI
bool
+config ARM64_ERRATUM_2441007
+ bool "Cortex-A55: Completion of affected memory accesses might not be guaranteed by completion of a TLBI"
+ default y
+ select ARM64_WORKAROUND_REPEAT_TLBI
+ help
+ This option adds a workaround for ARM Cortex-A55 erratum #2441007.
+
+ Under very rare circumstances, affected Cortex-A55 CPUs
+ may not handle a race between a break-before-make sequence on one
+ CPU, and another CPU accessing the same page. This could allow a
+ store to a page that has been unmapped.
+
+ Work around this by adding the affected CPUs to the list that needs
+ TLB sequences to be done twice.
+
+ If unsure, say Y.
+
config ARM64_ERRATUM_1286807
bool "Cortex-A76: Modification of the translation table for a virtual address might lead to read-after-read ordering violation"
default y
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 58ca4f6b25d6..89ac00084f38 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -230,6 +230,11 @@ static const struct arm64_cpu_capabilities arm64_repeat_tlbi_list[] = {
ERRATA_MIDR_RANGE(MIDR_QCOM_KRYO_4XX_GOLD, 0xc, 0xe, 0xf, 0xe),
},
#endif
+#ifdef CONFIG_ARM64_ERRATUM_2441007
+ {
+ ERRATA_MIDR_ALL_VERSIONS(MIDR_CORTEX_A55),
+ },
+#endif
#ifdef CONFIG_ARM64_ERRATUM_2441009
{
/* Cortex-A510 r0p0 -> r1p1. Fixed in r1p2 */
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
171df58028bf ("arm64: errata: Add Cortex-A55 to the repeat tlbi list")
39fdb65f52e9 ("arm64: errata: Add Cortex-A510 to the repeat tlbi list")
1dd498e5e26a ("KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata")
297ae1eb23b0 ("arm64: cpufeature: List early Cortex-A510 parts as having broken dbm")
708e8af4924e ("arm64: errata: Add detection for TRBE trace data corruption")
3bd94a8759de ("arm64: errata: Add detection for TRBE invalid prohibited states")
607a9afaae09 ("arm64: errata: Add detection for TRBE ignored system register writes")
83bb2c1a01d7 ("KVM: arm64: Save PSTATE early on exit")
d7e0a795bf37 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 171df58028bf4649460fb146a56a58dcb0c8f75a Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Fri, 30 Sep 2022 14:19:59 +0100
Subject: [PATCH] arm64: errata: Add Cortex-A55 to the repeat tlbi list
Cortex-A55 is affected by an erratum where in rare circumstances the
CPUs may not handle a race between a break-before-make sequence on one
CPU, and another CPU accessing the same page. This could allow a store
to a page that has been unmapped.
Work around this by adding the affected CPUs to the list that needs
TLB sequences to be done twice.
Signed-off-by: James Morse <james.morse(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20220930131959.3082594-1-james.morse@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst
index 17d9fc5d14fb..808ade4cc008 100644
--- a/Documentation/arm64/silicon-errata.rst
+++ b/Documentation/arm64/silicon-errata.rst
@@ -76,6 +76,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A55 | #1530923 | ARM64_ERRATUM_1530923 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A55 | #2441007 | ARM64_ERRATUM_2441007 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #832075 | ARM64_ERRATUM_832075 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #852523 | N/A |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1675310f1791..20d082d54bd8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -634,6 +634,23 @@ config ARM64_ERRATUM_1530923
config ARM64_WORKAROUND_REPEAT_TLBI
bool
+config ARM64_ERRATUM_2441007
+ bool "Cortex-A55: Completion of affected memory accesses might not be guaranteed by completion of a TLBI"
+ default y
+ select ARM64_WORKAROUND_REPEAT_TLBI
+ help
+ This option adds a workaround for ARM Cortex-A55 erratum #2441007.
+
+ Under very rare circumstances, affected Cortex-A55 CPUs
+ may not handle a race between a break-before-make sequence on one
+ CPU, and another CPU accessing the same page. This could allow a
+ store to a page that has been unmapped.
+
+ Work around this by adding the affected CPUs to the list that needs
+ TLB sequences to be done twice.
+
+ If unsure, say Y.
+
config ARM64_ERRATUM_1286807
bool "Cortex-A76: Modification of the translation table for a virtual address might lead to read-after-read ordering violation"
default y
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 58ca4f6b25d6..89ac00084f38 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -230,6 +230,11 @@ static const struct arm64_cpu_capabilities arm64_repeat_tlbi_list[] = {
ERRATA_MIDR_RANGE(MIDR_QCOM_KRYO_4XX_GOLD, 0xc, 0xe, 0xf, 0xe),
},
#endif
+#ifdef CONFIG_ARM64_ERRATUM_2441007
+ {
+ ERRATA_MIDR_ALL_VERSIONS(MIDR_CORTEX_A55),
+ },
+#endif
#ifdef CONFIG_ARM64_ERRATUM_2441009
{
/* Cortex-A510 r0p0 -> r1p1. Fixed in r1p2 */
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
171df58028bf ("arm64: errata: Add Cortex-A55 to the repeat tlbi list")
39fdb65f52e9 ("arm64: errata: Add Cortex-A510 to the repeat tlbi list")
1dd498e5e26a ("KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata")
297ae1eb23b0 ("arm64: cpufeature: List early Cortex-A510 parts as having broken dbm")
708e8af4924e ("arm64: errata: Add detection for TRBE trace data corruption")
3bd94a8759de ("arm64: errata: Add detection for TRBE invalid prohibited states")
607a9afaae09 ("arm64: errata: Add detection for TRBE ignored system register writes")
83bb2c1a01d7 ("KVM: arm64: Save PSTATE early on exit")
d7e0a795bf37 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 171df58028bf4649460fb146a56a58dcb0c8f75a Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Fri, 30 Sep 2022 14:19:59 +0100
Subject: [PATCH] arm64: errata: Add Cortex-A55 to the repeat tlbi list
Cortex-A55 is affected by an erratum where in rare circumstances the
CPUs may not handle a race between a break-before-make sequence on one
CPU, and another CPU accessing the same page. This could allow a store
to a page that has been unmapped.
Work around this by adding the affected CPUs to the list that needs
TLB sequences to be done twice.
Signed-off-by: James Morse <james.morse(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20220930131959.3082594-1-james.morse@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst
index 17d9fc5d14fb..808ade4cc008 100644
--- a/Documentation/arm64/silicon-errata.rst
+++ b/Documentation/arm64/silicon-errata.rst
@@ -76,6 +76,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A55 | #1530923 | ARM64_ERRATUM_1530923 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A55 | #2441007 | ARM64_ERRATUM_2441007 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #832075 | ARM64_ERRATUM_832075 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #852523 | N/A |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1675310f1791..20d082d54bd8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -634,6 +634,23 @@ config ARM64_ERRATUM_1530923
config ARM64_WORKAROUND_REPEAT_TLBI
bool
+config ARM64_ERRATUM_2441007
+ bool "Cortex-A55: Completion of affected memory accesses might not be guaranteed by completion of a TLBI"
+ default y
+ select ARM64_WORKAROUND_REPEAT_TLBI
+ help
+ This option adds a workaround for ARM Cortex-A55 erratum #2441007.
+
+ Under very rare circumstances, affected Cortex-A55 CPUs
+ may not handle a race between a break-before-make sequence on one
+ CPU, and another CPU accessing the same page. This could allow a
+ store to a page that has been unmapped.
+
+ Work around this by adding the affected CPUs to the list that needs
+ TLB sequences to be done twice.
+
+ If unsure, say Y.
+
config ARM64_ERRATUM_1286807
bool "Cortex-A76: Modification of the translation table for a virtual address might lead to read-after-read ordering violation"
default y
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 58ca4f6b25d6..89ac00084f38 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -230,6 +230,11 @@ static const struct arm64_cpu_capabilities arm64_repeat_tlbi_list[] = {
ERRATA_MIDR_RANGE(MIDR_QCOM_KRYO_4XX_GOLD, 0xc, 0xe, 0xf, 0xe),
},
#endif
+#ifdef CONFIG_ARM64_ERRATUM_2441007
+ {
+ ERRATA_MIDR_ALL_VERSIONS(MIDR_CORTEX_A55),
+ },
+#endif
#ifdef CONFIG_ARM64_ERRATUM_2441009
{
/* Cortex-A510 r0p0 -> r1p1. Fixed in r1p2 */
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
171df58028bf ("arm64: errata: Add Cortex-A55 to the repeat tlbi list")
39fdb65f52e9 ("arm64: errata: Add Cortex-A510 to the repeat tlbi list")
1dd498e5e26a ("KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata")
297ae1eb23b0 ("arm64: cpufeature: List early Cortex-A510 parts as having broken dbm")
708e8af4924e ("arm64: errata: Add detection for TRBE trace data corruption")
3bd94a8759de ("arm64: errata: Add detection for TRBE invalid prohibited states")
607a9afaae09 ("arm64: errata: Add detection for TRBE ignored system register writes")
83bb2c1a01d7 ("KVM: arm64: Save PSTATE early on exit")
d7e0a795bf37 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 171df58028bf4649460fb146a56a58dcb0c8f75a Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Fri, 30 Sep 2022 14:19:59 +0100
Subject: [PATCH] arm64: errata: Add Cortex-A55 to the repeat tlbi list
Cortex-A55 is affected by an erratum where in rare circumstances the
CPUs may not handle a race between a break-before-make sequence on one
CPU, and another CPU accessing the same page. This could allow a store
to a page that has been unmapped.
Work around this by adding the affected CPUs to the list that needs
TLB sequences to be done twice.
Signed-off-by: James Morse <james.morse(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20220930131959.3082594-1-james.morse@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst
index 17d9fc5d14fb..808ade4cc008 100644
--- a/Documentation/arm64/silicon-errata.rst
+++ b/Documentation/arm64/silicon-errata.rst
@@ -76,6 +76,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A55 | #1530923 | ARM64_ERRATUM_1530923 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A55 | #2441007 | ARM64_ERRATUM_2441007 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #832075 | ARM64_ERRATUM_832075 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #852523 | N/A |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1675310f1791..20d082d54bd8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -634,6 +634,23 @@ config ARM64_ERRATUM_1530923
config ARM64_WORKAROUND_REPEAT_TLBI
bool
+config ARM64_ERRATUM_2441007
+ bool "Cortex-A55: Completion of affected memory accesses might not be guaranteed by completion of a TLBI"
+ default y
+ select ARM64_WORKAROUND_REPEAT_TLBI
+ help
+ This option adds a workaround for ARM Cortex-A55 erratum #2441007.
+
+ Under very rare circumstances, affected Cortex-A55 CPUs
+ may not handle a race between a break-before-make sequence on one
+ CPU, and another CPU accessing the same page. This could allow a
+ store to a page that has been unmapped.
+
+ Work around this by adding the affected CPUs to the list that needs
+ TLB sequences to be done twice.
+
+ If unsure, say Y.
+
config ARM64_ERRATUM_1286807
bool "Cortex-A76: Modification of the translation table for a virtual address might lead to read-after-read ordering violation"
default y
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 58ca4f6b25d6..89ac00084f38 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -230,6 +230,11 @@ static const struct arm64_cpu_capabilities arm64_repeat_tlbi_list[] = {
ERRATA_MIDR_RANGE(MIDR_QCOM_KRYO_4XX_GOLD, 0xc, 0xe, 0xf, 0xe),
},
#endif
+#ifdef CONFIG_ARM64_ERRATUM_2441007
+ {
+ ERRATA_MIDR_ALL_VERSIONS(MIDR_CORTEX_A55),
+ },
+#endif
#ifdef CONFIG_ARM64_ERRATUM_2441009
{
/* Cortex-A510 r0p0 -> r1p1. Fixed in r1p2 */
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
171df58028bf ("arm64: errata: Add Cortex-A55 to the repeat tlbi list")
39fdb65f52e9 ("arm64: errata: Add Cortex-A510 to the repeat tlbi list")
1dd498e5e26a ("KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata")
297ae1eb23b0 ("arm64: cpufeature: List early Cortex-A510 parts as having broken dbm")
708e8af4924e ("arm64: errata: Add detection for TRBE trace data corruption")
3bd94a8759de ("arm64: errata: Add detection for TRBE invalid prohibited states")
607a9afaae09 ("arm64: errata: Add detection for TRBE ignored system register writes")
83bb2c1a01d7 ("KVM: arm64: Save PSTATE early on exit")
d7e0a795bf37 ("Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 171df58028bf4649460fb146a56a58dcb0c8f75a Mon Sep 17 00:00:00 2001
From: James Morse <james.morse(a)arm.com>
Date: Fri, 30 Sep 2022 14:19:59 +0100
Subject: [PATCH] arm64: errata: Add Cortex-A55 to the repeat tlbi list
Cortex-A55 is affected by an erratum where in rare circumstances the
CPUs may not handle a race between a break-before-make sequence on one
CPU, and another CPU accessing the same page. This could allow a store
to a page that has been unmapped.
Work around this by adding the affected CPUs to the list that needs
TLB sequences to be done twice.
Signed-off-by: James Morse <james.morse(a)arm.com>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/20220930131959.3082594-1-james.morse@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst
index 17d9fc5d14fb..808ade4cc008 100644
--- a/Documentation/arm64/silicon-errata.rst
+++ b/Documentation/arm64/silicon-errata.rst
@@ -76,6 +76,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A55 | #1530923 | ARM64_ERRATUM_1530923 |
+----------------+-----------------+-----------------+-----------------------------+
+| ARM | Cortex-A55 | #2441007 | ARM64_ERRATUM_2441007 |
++----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #832075 | ARM64_ERRATUM_832075 |
+----------------+-----------------+-----------------+-----------------------------+
| ARM | Cortex-A57 | #852523 | N/A |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1675310f1791..20d082d54bd8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -634,6 +634,23 @@ config ARM64_ERRATUM_1530923
config ARM64_WORKAROUND_REPEAT_TLBI
bool
+config ARM64_ERRATUM_2441007
+ bool "Cortex-A55: Completion of affected memory accesses might not be guaranteed by completion of a TLBI"
+ default y
+ select ARM64_WORKAROUND_REPEAT_TLBI
+ help
+ This option adds a workaround for ARM Cortex-A55 erratum #2441007.
+
+ Under very rare circumstances, affected Cortex-A55 CPUs
+ may not handle a race between a break-before-make sequence on one
+ CPU, and another CPU accessing the same page. This could allow a
+ store to a page that has been unmapped.
+
+ Work around this by adding the affected CPUs to the list that needs
+ TLB sequences to be done twice.
+
+ If unsure, say Y.
+
config ARM64_ERRATUM_1286807
bool "Cortex-A76: Modification of the translation table for a virtual address might lead to read-after-read ordering violation"
default y
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 58ca4f6b25d6..89ac00084f38 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -230,6 +230,11 @@ static const struct arm64_cpu_capabilities arm64_repeat_tlbi_list[] = {
ERRATA_MIDR_RANGE(MIDR_QCOM_KRYO_4XX_GOLD, 0xc, 0xe, 0xf, 0xe),
},
#endif
+#ifdef CONFIG_ARM64_ERRATUM_2441007
+ {
+ ERRATA_MIDR_ALL_VERSIONS(MIDR_CORTEX_A55),
+ },
+#endif
#ifdef CONFIG_ARM64_ERRATUM_2441009
{
/* Cortex-A510 r0p0 -> r1p1. Fixed in r1p2 */
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
a8e5e5146ad0 ("arm64: mte: Avoid setting PG_mte_tagged if no tags cleared or restored")
20794545c146 ("arm64: kasan: Revert "arm64: mte: reset the page tag in page->flags"")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a8e5e5146ad08d794c58252bab00b261045ef16d Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas(a)arm.com>
Date: Thu, 6 Oct 2022 17:33:54 +0100
Subject: [PATCH] arm64: mte: Avoid setting PG_mte_tagged if no tags cleared or
restored
Prior to commit 69e3b846d8a7 ("arm64: mte: Sync tags for pages where PTE
is untagged"), mte_sync_tags() was only called for pte_tagged() entries
(those mapped with PROT_MTE). Therefore mte_sync_tags() could safely use
test_and_set_bit(PG_mte_tagged, &page->flags) without inadvertently
setting PG_mte_tagged on an untagged page.
The above commit was required as guests may enable MTE without any
control at the stage 2 mapping, nor a PROT_MTE mapping in the VMM.
However, the side-effect was that any page with a PTE that looked like
swap (or migration) was getting PG_mte_tagged set automatically. A
subsequent page copy (e.g. migration) copied the tags to the destination
page even if the tags were owned by KASAN.
This issue was masked by the page_kasan_tag_reset() call introduced in
commit e5b8d9218951 ("arm64: mte: reset the page tag in page->flags").
When this commit was reverted (20794545c146), KASAN started reporting
access faults because the overriding tags in a page did not match the
original page->flags (with CONFIG_KASAN_HW_TAGS=y):
BUG: KASAN: invalid-access in copy_page+0x10/0xd0 arch/arm64/lib/copy_page.S:26
Read at addr f5ff000017f2e000 by task syz-executor.1/2218
Pointer tag: [f5], memory tag: [f2]
Move the PG_mte_tagged bit setting from mte_sync_tags() to the actual
place where tags are cleared (mte_sync_page_tags()) or restored
(mte_restore_tags()).
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Reported-by: syzbot+c2c79c6d6eddc5262b77(a)syzkaller.appspotmail.com
Fixes: 69e3b846d8a7 ("arm64: mte: Sync tags for pages where PTE is untagged")
Cc: <stable(a)vger.kernel.org> # 5.14.x
Cc: Steven Price <steven.price(a)arm.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Link: https://lore.kernel.org/r/0000000000004387dc05e5888ae5@google.com/
Reviewed-by: Steven Price <steven.price(a)arm.com>
Link: https://lore.kernel.org/r/20221006163354.3194102-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index aca88470fb69..7467217c1eaf 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -48,7 +48,12 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
if (!pte_is_tagged)
return;
- mte_clear_page_tags(page_address(page));
+ /*
+ * Test PG_mte_tagged again in case it was racing with another
+ * set_pte_at().
+ */
+ if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+ mte_clear_page_tags(page_address(page));
}
void mte_sync_tags(pte_t old_pte, pte_t pte)
@@ -64,7 +69,7 @@ void mte_sync_tags(pte_t old_pte, pte_t pte)
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
- if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+ if (!test_bit(PG_mte_tagged, &page->flags))
mte_sync_page_tags(page, old_pte, check_swap,
pte_is_tagged);
}
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index 4334dec93bd4..bed803d8e158 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -53,7 +53,12 @@ bool mte_restore_tags(swp_entry_t entry, struct page *page)
if (!tags)
return false;
- mte_restore_page_tags(page_address(page), tags);
+ /*
+ * Test PG_mte_tagged again in case it was racing with another
+ * set_pte_at().
+ */
+ if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+ mte_restore_page_tags(page_address(page), tags);
return true;
}
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
a8e5e5146ad0 ("arm64: mte: Avoid setting PG_mte_tagged if no tags cleared or restored")
20794545c146 ("arm64: kasan: Revert "arm64: mte: reset the page tag in page->flags"")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From a8e5e5146ad08d794c58252bab00b261045ef16d Mon Sep 17 00:00:00 2001
From: Catalin Marinas <catalin.marinas(a)arm.com>
Date: Thu, 6 Oct 2022 17:33:54 +0100
Subject: [PATCH] arm64: mte: Avoid setting PG_mte_tagged if no tags cleared or
restored
Prior to commit 69e3b846d8a7 ("arm64: mte: Sync tags for pages where PTE
is untagged"), mte_sync_tags() was only called for pte_tagged() entries
(those mapped with PROT_MTE). Therefore mte_sync_tags() could safely use
test_and_set_bit(PG_mte_tagged, &page->flags) without inadvertently
setting PG_mte_tagged on an untagged page.
The above commit was required as guests may enable MTE without any
control at the stage 2 mapping, nor a PROT_MTE mapping in the VMM.
However, the side-effect was that any page with a PTE that looked like
swap (or migration) was getting PG_mte_tagged set automatically. A
subsequent page copy (e.g. migration) copied the tags to the destination
page even if the tags were owned by KASAN.
This issue was masked by the page_kasan_tag_reset() call introduced in
commit e5b8d9218951 ("arm64: mte: reset the page tag in page->flags").
When this commit was reverted (20794545c146), KASAN started reporting
access faults because the overriding tags in a page did not match the
original page->flags (with CONFIG_KASAN_HW_TAGS=y):
BUG: KASAN: invalid-access in copy_page+0x10/0xd0 arch/arm64/lib/copy_page.S:26
Read at addr f5ff000017f2e000 by task syz-executor.1/2218
Pointer tag: [f5], memory tag: [f2]
Move the PG_mte_tagged bit setting from mte_sync_tags() to the actual
place where tags are cleared (mte_sync_page_tags()) or restored
(mte_restore_tags()).
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
Reported-by: syzbot+c2c79c6d6eddc5262b77(a)syzkaller.appspotmail.com
Fixes: 69e3b846d8a7 ("arm64: mte: Sync tags for pages where PTE is untagged")
Cc: <stable(a)vger.kernel.org> # 5.14.x
Cc: Steven Price <steven.price(a)arm.com>
Cc: Andrey Konovalov <andreyknvl(a)gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino(a)arm.com>
Cc: Will Deacon <will(a)kernel.org>
Link: https://lore.kernel.org/r/0000000000004387dc05e5888ae5@google.com/
Reviewed-by: Steven Price <steven.price(a)arm.com>
Link: https://lore.kernel.org/r/20221006163354.3194102-1-catalin.marinas@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index aca88470fb69..7467217c1eaf 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -48,7 +48,12 @@ static void mte_sync_page_tags(struct page *page, pte_t old_pte,
if (!pte_is_tagged)
return;
- mte_clear_page_tags(page_address(page));
+ /*
+ * Test PG_mte_tagged again in case it was racing with another
+ * set_pte_at().
+ */
+ if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+ mte_clear_page_tags(page_address(page));
}
void mte_sync_tags(pte_t old_pte, pte_t pte)
@@ -64,7 +69,7 @@ void mte_sync_tags(pte_t old_pte, pte_t pte)
/* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
- if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+ if (!test_bit(PG_mte_tagged, &page->flags))
mte_sync_page_tags(page, old_pte, check_swap,
pte_is_tagged);
}
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index 4334dec93bd4..bed803d8e158 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -53,7 +53,12 @@ bool mte_restore_tags(swp_entry_t entry, struct page *page)
if (!tags)
return false;
- mte_restore_page_tags(page_address(page), tags);
+ /*
+ * Test PG_mte_tagged again in case it was racing with another
+ * set_pte_at().
+ */
+ if (!test_and_set_bit(PG_mte_tagged, &page->flags))
+ mte_restore_page_tags(page_address(page), tags);
return true;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
973b9e373306 ("arm64: mte: move register initialization to C")
e921da6bc7ca ("arm64/mm: Consolidate TCR_EL1 fields")
7a062ce31807 ("arm64/cpufeature: Optionally disable MTE via command-line")
82868247897b ("arm64: kasan: mte: use a constant kernel GCR_EL1 value")
d914b80a8f56 ("arm64: avoid double ISB on kernel entry")
59f44069e052 ("arm64: mte: fix restoration of GCR_EL1 from suspend")
fdceddb06a5f ("Merge branch 'for-next/mte' into for-next/core")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 973b9e37330656dec719ede508e4dc40e5c2d80c Mon Sep 17 00:00:00 2001
From: Peter Collingbourne <pcc(a)google.com>
Date: Thu, 15 Sep 2022 15:20:53 -0700
Subject: [PATCH] arm64: mte: move register initialization to C
If FEAT_MTE2 is disabled via the arm64.nomte command line argument on a
CPU that claims to support FEAT_MTE2, the kernel will use Tagged Normal
in the MAIR. If we interpret arm64.nomte to mean that the CPU does not
in fact implement FEAT_MTE2, setting the system register like this may
lead to UNSPECIFIED behavior. Fix it by arranging for MAIR to be set
in the C function cpu_enable_mte which is called based on the sanitized
version of the system register.
There is no need for the rest of the MTE-related system register
initialization to happen from assembly, with the exception of TCR_EL1,
which must be set to include at least TBI1 because the secondary CPUs
access KASan-allocated data structures early. Therefore, make the TCR_EL1
initialization unconditional and move the rest of the initialization to
cpu_enable_mte so that we no longer have a dependency on the unsanitized
ID register value.
Co-developed-by: Evgenii Stepanov <eugenis(a)google.com>
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Signed-off-by: Evgenii Stepanov <eugenis(a)google.com>
Suggested-by: Catalin Marinas <catalin.marinas(a)arm.com>
Reported-by: kernel test robot <lkp(a)intel.com>
Fixes: 3b714d24ef17 ("arm64: mte: CPU feature detection and initial sysreg configuration")
Cc: <stable(a)vger.kernel.org> # 5.10.x
Link: https://lore.kernel.org/r/20220915222053.3484231-1-eugenis@google.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index aa523591a44e..760c62f8e22f 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -42,7 +42,9 @@ void mte_sync_tags(pte_t old_pte, pte_t pte);
void mte_copy_page_tags(void *kto, const void *kfrom);
void mte_thread_init_user(void);
void mte_thread_switch(struct task_struct *next);
+void mte_cpu_setup(void);
void mte_suspend_enter(void);
+void mte_suspend_exit(void);
long set_mte_ctrl(struct task_struct *task, unsigned long arg);
long get_mte_ctrl(struct task_struct *task);
int mte_ptrace_copy_tags(struct task_struct *child, long request,
@@ -72,6 +74,9 @@ static inline void mte_thread_switch(struct task_struct *next)
static inline void mte_suspend_enter(void)
{
}
+static inline void mte_suspend_exit(void)
+{
+}
static inline long set_mte_ctrl(struct task_struct *task, unsigned long arg)
{
return 0;
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index af4de817d712..d7a077b5ccd1 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2034,7 +2034,8 @@ static void bti_enable(const struct arm64_cpu_capabilities *__unused)
static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap)
{
sysreg_clear_set(sctlr_el1, 0, SCTLR_ELx_ATA | SCTLR_EL1_ATA0);
- isb();
+
+ mte_cpu_setup();
/*
* Clear the tags in the zero page. This needs to be done via the
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index b2b730233274..aca88470fb69 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -285,6 +285,49 @@ void mte_thread_switch(struct task_struct *next)
mte_check_tfsr_el1();
}
+void mte_cpu_setup(void)
+{
+ u64 rgsr;
+
+ /*
+ * CnP must be enabled only after the MAIR_EL1 register has been set
+ * up. Inconsistent MAIR_EL1 between CPUs sharing the same TLB may
+ * lead to the wrong memory type being used for a brief window during
+ * CPU power-up.
+ *
+ * CnP is not a boot feature so MTE gets enabled before CnP, but let's
+ * make sure that is the case.
+ */
+ BUG_ON(read_sysreg(ttbr0_el1) & TTBR_CNP_BIT);
+ BUG_ON(read_sysreg(ttbr1_el1) & TTBR_CNP_BIT);
+
+ /* Normal Tagged memory type at the corresponding MAIR index */
+ sysreg_clear_set(mair_el1,
+ MAIR_ATTRIDX(MAIR_ATTR_MASK, MT_NORMAL_TAGGED),
+ MAIR_ATTRIDX(MAIR_ATTR_NORMAL_TAGGED,
+ MT_NORMAL_TAGGED));
+
+ write_sysreg_s(KERNEL_GCR_EL1, SYS_GCR_EL1);
+
+ /*
+ * If GCR_EL1.RRND=1 is implemented the same way as RRND=0, then
+ * RGSR_EL1.SEED must be non-zero for IRG to produce
+ * pseudorandom numbers. As RGSR_EL1 is UNKNOWN out of reset, we
+ * must initialize it.
+ */
+ rgsr = (read_sysreg(CNTVCT_EL0) & SYS_RGSR_EL1_SEED_MASK) <<
+ SYS_RGSR_EL1_SEED_SHIFT;
+ if (rgsr == 0)
+ rgsr = 1 << SYS_RGSR_EL1_SEED_SHIFT;
+ write_sysreg_s(rgsr, SYS_RGSR_EL1);
+
+ /* clear any pending tag check faults in TFSR*_EL1 */
+ write_sysreg_s(0, SYS_TFSR_EL1);
+ write_sysreg_s(0, SYS_TFSRE0_EL1);
+
+ local_flush_tlb_all();
+}
+
void mte_suspend_enter(void)
{
if (!system_supports_mte())
@@ -301,6 +344,14 @@ void mte_suspend_enter(void)
mte_check_tfsr_el1();
}
+void mte_suspend_exit(void)
+{
+ if (!system_supports_mte())
+ return;
+
+ mte_cpu_setup();
+}
+
long set_mte_ctrl(struct task_struct *task, unsigned long arg)
{
u64 mte_ctrl = (~((arg & PR_MTE_TAG_MASK) >> PR_MTE_TAG_SHIFT) &
diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
index 9135fe0f3df5..8b02d310838f 100644
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@ -43,6 +43,8 @@ void notrace __cpu_suspend_exit(void)
{
unsigned int cpu = smp_processor_id();
+ mte_suspend_exit();
+
/*
* We are resuming from reset with the idmap active in TTBR0_EL1.
* We must uninstall the idmap and restore the expected MMU
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 7837a69524c5..f38bccdd374a 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -48,17 +48,19 @@
#ifdef CONFIG_KASAN_HW_TAGS
#define TCR_MTE_FLAGS TCR_TCMA1 | TCR_TBI1 | TCR_TBID1
-#else
+#elif defined(CONFIG_ARM64_MTE)
/*
* The mte_zero_clear_page_tags() implementation uses DC GZVA, which relies on
* TBI being enabled at EL1.
*/
#define TCR_MTE_FLAGS TCR_TBI1 | TCR_TBID1
+#else
+#define TCR_MTE_FLAGS 0
#endif
/*
* Default MAIR_EL1. MT_NORMAL_TAGGED is initially mapped as Normal memory and
- * changed during __cpu_setup to Normal Tagged if the system supports MTE.
+ * changed during mte_cpu_setup to Normal Tagged if the system supports MTE.
*/
#define MAIR_EL1_SET \
(MAIR_ATTRIDX(MAIR_ATTR_DEVICE_nGnRnE, MT_DEVICE_nGnRnE) | \
@@ -426,46 +428,8 @@ SYM_FUNC_START(__cpu_setup)
mov_q mair, MAIR_EL1_SET
mov_q tcr, TCR_TxSZ(VA_BITS) | TCR_CACHE_FLAGS | TCR_SMP_FLAGS | \
TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
- TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS
-
-#ifdef CONFIG_ARM64_MTE
- /*
- * Update MAIR_EL1, GCR_EL1 and TFSR*_EL1 if MTE is supported
- * (ID_AA64PFR1_EL1[11:8] > 1).
- */
- mrs x10, ID_AA64PFR1_EL1
- ubfx x10, x10, #ID_AA64PFR1_MTE_SHIFT, #4
- cmp x10, #ID_AA64PFR1_MTE
- b.lt 1f
-
- /* Normal Tagged memory type at the corresponding MAIR index */
- mov x10, #MAIR_ATTR_NORMAL_TAGGED
- bfi mair, x10, #(8 * MT_NORMAL_TAGGED), #8
+ TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS | TCR_MTE_FLAGS
- mov x10, #KERNEL_GCR_EL1
- msr_s SYS_GCR_EL1, x10
-
- /*
- * If GCR_EL1.RRND=1 is implemented the same way as RRND=0, then
- * RGSR_EL1.SEED must be non-zero for IRG to produce
- * pseudorandom numbers. As RGSR_EL1 is UNKNOWN out of reset, we
- * must initialize it.
- */
- mrs x10, CNTVCT_EL0
- ands x10, x10, #SYS_RGSR_EL1_SEED_MASK
- csinc x10, x10, xzr, ne
- lsl x10, x10, #SYS_RGSR_EL1_SEED_SHIFT
- msr_s SYS_RGSR_EL1, x10
-
- /* clear any pending tag check faults in TFSR*_EL1 */
- msr_s SYS_TFSR_EL1, xzr
- msr_s SYS_TFSRE0_EL1, xzr
-
- /* set the TCR_EL1 bits */
- mov_q x10, TCR_MTE_FLAGS
- orr tcr, tcr, x10
-1:
-#endif
tcr_clear_errata_bits tcr, x9, x5
#ifdef CONFIG_ARM64_VA_BITS_52
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
973b9e373306 ("arm64: mte: move register initialization to C")
e921da6bc7ca ("arm64/mm: Consolidate TCR_EL1 fields")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 973b9e37330656dec719ede508e4dc40e5c2d80c Mon Sep 17 00:00:00 2001
From: Peter Collingbourne <pcc(a)google.com>
Date: Thu, 15 Sep 2022 15:20:53 -0700
Subject: [PATCH] arm64: mte: move register initialization to C
If FEAT_MTE2 is disabled via the arm64.nomte command line argument on a
CPU that claims to support FEAT_MTE2, the kernel will use Tagged Normal
in the MAIR. If we interpret arm64.nomte to mean that the CPU does not
in fact implement FEAT_MTE2, setting the system register like this may
lead to UNSPECIFIED behavior. Fix it by arranging for MAIR to be set
in the C function cpu_enable_mte which is called based on the sanitized
version of the system register.
There is no need for the rest of the MTE-related system register
initialization to happen from assembly, with the exception of TCR_EL1,
which must be set to include at least TBI1 because the secondary CPUs
access KASan-allocated data structures early. Therefore, make the TCR_EL1
initialization unconditional and move the rest of the initialization to
cpu_enable_mte so that we no longer have a dependency on the unsanitized
ID register value.
Co-developed-by: Evgenii Stepanov <eugenis(a)google.com>
Signed-off-by: Peter Collingbourne <pcc(a)google.com>
Signed-off-by: Evgenii Stepanov <eugenis(a)google.com>
Suggested-by: Catalin Marinas <catalin.marinas(a)arm.com>
Reported-by: kernel test robot <lkp(a)intel.com>
Fixes: 3b714d24ef17 ("arm64: mte: CPU feature detection and initial sysreg configuration")
Cc: <stable(a)vger.kernel.org> # 5.10.x
Link: https://lore.kernel.org/r/20220915222053.3484231-1-eugenis@google.com
Signed-off-by: Catalin Marinas <catalin.marinas(a)arm.com>
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index aa523591a44e..760c62f8e22f 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -42,7 +42,9 @@ void mte_sync_tags(pte_t old_pte, pte_t pte);
void mte_copy_page_tags(void *kto, const void *kfrom);
void mte_thread_init_user(void);
void mte_thread_switch(struct task_struct *next);
+void mte_cpu_setup(void);
void mte_suspend_enter(void);
+void mte_suspend_exit(void);
long set_mte_ctrl(struct task_struct *task, unsigned long arg);
long get_mte_ctrl(struct task_struct *task);
int mte_ptrace_copy_tags(struct task_struct *child, long request,
@@ -72,6 +74,9 @@ static inline void mte_thread_switch(struct task_struct *next)
static inline void mte_suspend_enter(void)
{
}
+static inline void mte_suspend_exit(void)
+{
+}
static inline long set_mte_ctrl(struct task_struct *task, unsigned long arg)
{
return 0;
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index af4de817d712..d7a077b5ccd1 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2034,7 +2034,8 @@ static void bti_enable(const struct arm64_cpu_capabilities *__unused)
static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap)
{
sysreg_clear_set(sctlr_el1, 0, SCTLR_ELx_ATA | SCTLR_EL1_ATA0);
- isb();
+
+ mte_cpu_setup();
/*
* Clear the tags in the zero page. This needs to be done via the
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index b2b730233274..aca88470fb69 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -285,6 +285,49 @@ void mte_thread_switch(struct task_struct *next)
mte_check_tfsr_el1();
}
+void mte_cpu_setup(void)
+{
+ u64 rgsr;
+
+ /*
+ * CnP must be enabled only after the MAIR_EL1 register has been set
+ * up. Inconsistent MAIR_EL1 between CPUs sharing the same TLB may
+ * lead to the wrong memory type being used for a brief window during
+ * CPU power-up.
+ *
+ * CnP is not a boot feature so MTE gets enabled before CnP, but let's
+ * make sure that is the case.
+ */
+ BUG_ON(read_sysreg(ttbr0_el1) & TTBR_CNP_BIT);
+ BUG_ON(read_sysreg(ttbr1_el1) & TTBR_CNP_BIT);
+
+ /* Normal Tagged memory type at the corresponding MAIR index */
+ sysreg_clear_set(mair_el1,
+ MAIR_ATTRIDX(MAIR_ATTR_MASK, MT_NORMAL_TAGGED),
+ MAIR_ATTRIDX(MAIR_ATTR_NORMAL_TAGGED,
+ MT_NORMAL_TAGGED));
+
+ write_sysreg_s(KERNEL_GCR_EL1, SYS_GCR_EL1);
+
+ /*
+ * If GCR_EL1.RRND=1 is implemented the same way as RRND=0, then
+ * RGSR_EL1.SEED must be non-zero for IRG to produce
+ * pseudorandom numbers. As RGSR_EL1 is UNKNOWN out of reset, we
+ * must initialize it.
+ */
+ rgsr = (read_sysreg(CNTVCT_EL0) & SYS_RGSR_EL1_SEED_MASK) <<
+ SYS_RGSR_EL1_SEED_SHIFT;
+ if (rgsr == 0)
+ rgsr = 1 << SYS_RGSR_EL1_SEED_SHIFT;
+ write_sysreg_s(rgsr, SYS_RGSR_EL1);
+
+ /* clear any pending tag check faults in TFSR*_EL1 */
+ write_sysreg_s(0, SYS_TFSR_EL1);
+ write_sysreg_s(0, SYS_TFSRE0_EL1);
+
+ local_flush_tlb_all();
+}
+
void mte_suspend_enter(void)
{
if (!system_supports_mte())
@@ -301,6 +344,14 @@ void mte_suspend_enter(void)
mte_check_tfsr_el1();
}
+void mte_suspend_exit(void)
+{
+ if (!system_supports_mte())
+ return;
+
+ mte_cpu_setup();
+}
+
long set_mte_ctrl(struct task_struct *task, unsigned long arg)
{
u64 mte_ctrl = (~((arg & PR_MTE_TAG_MASK) >> PR_MTE_TAG_SHIFT) &
diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
index 9135fe0f3df5..8b02d310838f 100644
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@ -43,6 +43,8 @@ void notrace __cpu_suspend_exit(void)
{
unsigned int cpu = smp_processor_id();
+ mte_suspend_exit();
+
/*
* We are resuming from reset with the idmap active in TTBR0_EL1.
* We must uninstall the idmap and restore the expected MMU
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 7837a69524c5..f38bccdd374a 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -48,17 +48,19 @@
#ifdef CONFIG_KASAN_HW_TAGS
#define TCR_MTE_FLAGS TCR_TCMA1 | TCR_TBI1 | TCR_TBID1
-#else
+#elif defined(CONFIG_ARM64_MTE)
/*
* The mte_zero_clear_page_tags() implementation uses DC GZVA, which relies on
* TBI being enabled at EL1.
*/
#define TCR_MTE_FLAGS TCR_TBI1 | TCR_TBID1
+#else
+#define TCR_MTE_FLAGS 0
#endif
/*
* Default MAIR_EL1. MT_NORMAL_TAGGED is initially mapped as Normal memory and
- * changed during __cpu_setup to Normal Tagged if the system supports MTE.
+ * changed during mte_cpu_setup to Normal Tagged if the system supports MTE.
*/
#define MAIR_EL1_SET \
(MAIR_ATTRIDX(MAIR_ATTR_DEVICE_nGnRnE, MT_DEVICE_nGnRnE) | \
@@ -426,46 +428,8 @@ SYM_FUNC_START(__cpu_setup)
mov_q mair, MAIR_EL1_SET
mov_q tcr, TCR_TxSZ(VA_BITS) | TCR_CACHE_FLAGS | TCR_SMP_FLAGS | \
TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
- TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS
-
-#ifdef CONFIG_ARM64_MTE
- /*
- * Update MAIR_EL1, GCR_EL1 and TFSR*_EL1 if MTE is supported
- * (ID_AA64PFR1_EL1[11:8] > 1).
- */
- mrs x10, ID_AA64PFR1_EL1
- ubfx x10, x10, #ID_AA64PFR1_MTE_SHIFT, #4
- cmp x10, #ID_AA64PFR1_MTE
- b.lt 1f
-
- /* Normal Tagged memory type at the corresponding MAIR index */
- mov x10, #MAIR_ATTR_NORMAL_TAGGED
- bfi mair, x10, #(8 * MT_NORMAL_TAGGED), #8
+ TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS | TCR_MTE_FLAGS
- mov x10, #KERNEL_GCR_EL1
- msr_s SYS_GCR_EL1, x10
-
- /*
- * If GCR_EL1.RRND=1 is implemented the same way as RRND=0, then
- * RGSR_EL1.SEED must be non-zero for IRG to produce
- * pseudorandom numbers. As RGSR_EL1 is UNKNOWN out of reset, we
- * must initialize it.
- */
- mrs x10, CNTVCT_EL0
- ands x10, x10, #SYS_RGSR_EL1_SEED_MASK
- csinc x10, x10, xzr, ne
- lsl x10, x10, #SYS_RGSR_EL1_SEED_SHIFT
- msr_s SYS_RGSR_EL1, x10
-
- /* clear any pending tag check faults in TFSR*_EL1 */
- msr_s SYS_TFSR_EL1, xzr
- msr_s SYS_TFSRE0_EL1, xzr
-
- /* set the TCR_EL1 bits */
- mov_q x10, TCR_MTE_FLAGS
- orr tcr, tcr, x10
-1:
-#endif
tcr_clear_errata_bits tcr, x9, x5
#ifdef CONFIG_ARM64_VA_BITS_52
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
26696d465716 ("dmaengine: mxs: use platform_driver_register")
cc2afb0d4c7c ("dmaengine: mxs-dma: Remove the unused .id_table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 26696d4657167112a1079f86cba1739765c1360e Mon Sep 17 00:00:00 2001
From: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Date: Wed, 21 Sep 2022 19:05:56 +0200
Subject: [PATCH] dmaengine: mxs: use platform_driver_register
Driver registration fails on SOC imx8mn as its supplier, the clock
control module, is probed later than subsys initcall level. This driver
uses platform_driver_probe which is not compatible with deferred probing
and won't be probed again later if probe function fails due to clock not
being available at that time.
This patch replaces the use of platform_driver_probe with
platform_driver_register which will allow probing the driver later again
when the clock control module will be available.
The __init annotation has been dropped because it is not compatible with
deferred probing. The code is not executed once and its memory cannot be
freed.
Fixes: a580b8c5429a ("dmaengine: mxs-dma: add dma support for i.MX23/28")
Co-developed-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Acked-by: Sascha Hauer <s.hauer(a)pengutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220921170556.1055962-1-dario.binacchi@amarulaso…
Signed-off-by: Vinod Koul <vkoul(a)kernel.org>
diff --git a/drivers/dma/mxs-dma.c b/drivers/dma/mxs-dma.c
index 994fc4d2aca4..dc147cc2436e 100644
--- a/drivers/dma/mxs-dma.c
+++ b/drivers/dma/mxs-dma.c
@@ -670,7 +670,7 @@ static enum dma_status mxs_dma_tx_status(struct dma_chan *chan,
return mxs_chan->status;
}
-static int __init mxs_dma_init(struct mxs_dma_engine *mxs_dma)
+static int mxs_dma_init(struct mxs_dma_engine *mxs_dma)
{
int ret;
@@ -741,7 +741,7 @@ static struct dma_chan *mxs_dma_xlate(struct of_phandle_args *dma_spec,
ofdma->of_node);
}
-static int __init mxs_dma_probe(struct platform_device *pdev)
+static int mxs_dma_probe(struct platform_device *pdev)
{
struct device_node *np = pdev->dev.of_node;
const struct mxs_dma_type *dma_type;
@@ -839,10 +839,7 @@ static struct platform_driver mxs_dma_driver = {
.name = "mxs-dma",
.of_match_table = mxs_dma_dt_ids,
},
+ .probe = mxs_dma_probe,
};
-static int __init mxs_dma_module_init(void)
-{
- return platform_driver_probe(&mxs_dma_driver, mxs_dma_probe);
-}
-subsys_initcall(mxs_dma_module_init);
+builtin_platform_driver(mxs_dma_driver);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
26696d465716 ("dmaengine: mxs: use platform_driver_register")
cc2afb0d4c7c ("dmaengine: mxs-dma: Remove the unused .id_table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 26696d4657167112a1079f86cba1739765c1360e Mon Sep 17 00:00:00 2001
From: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Date: Wed, 21 Sep 2022 19:05:56 +0200
Subject: [PATCH] dmaengine: mxs: use platform_driver_register
Driver registration fails on SOC imx8mn as its supplier, the clock
control module, is probed later than subsys initcall level. This driver
uses platform_driver_probe which is not compatible with deferred probing
and won't be probed again later if probe function fails due to clock not
being available at that time.
This patch replaces the use of platform_driver_probe with
platform_driver_register which will allow probing the driver later again
when the clock control module will be available.
The __init annotation has been dropped because it is not compatible with
deferred probing. The code is not executed once and its memory cannot be
freed.
Fixes: a580b8c5429a ("dmaengine: mxs-dma: add dma support for i.MX23/28")
Co-developed-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Acked-by: Sascha Hauer <s.hauer(a)pengutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220921170556.1055962-1-dario.binacchi@amarulaso…
Signed-off-by: Vinod Koul <vkoul(a)kernel.org>
diff --git a/drivers/dma/mxs-dma.c b/drivers/dma/mxs-dma.c
index 994fc4d2aca4..dc147cc2436e 100644
--- a/drivers/dma/mxs-dma.c
+++ b/drivers/dma/mxs-dma.c
@@ -670,7 +670,7 @@ static enum dma_status mxs_dma_tx_status(struct dma_chan *chan,
return mxs_chan->status;
}
-static int __init mxs_dma_init(struct mxs_dma_engine *mxs_dma)
+static int mxs_dma_init(struct mxs_dma_engine *mxs_dma)
{
int ret;
@@ -741,7 +741,7 @@ static struct dma_chan *mxs_dma_xlate(struct of_phandle_args *dma_spec,
ofdma->of_node);
}
-static int __init mxs_dma_probe(struct platform_device *pdev)
+static int mxs_dma_probe(struct platform_device *pdev)
{
struct device_node *np = pdev->dev.of_node;
const struct mxs_dma_type *dma_type;
@@ -839,10 +839,7 @@ static struct platform_driver mxs_dma_driver = {
.name = "mxs-dma",
.of_match_table = mxs_dma_dt_ids,
},
+ .probe = mxs_dma_probe,
};
-static int __init mxs_dma_module_init(void)
-{
- return platform_driver_probe(&mxs_dma_driver, mxs_dma_probe);
-}
-subsys_initcall(mxs_dma_module_init);
+builtin_platform_driver(mxs_dma_driver);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
26696d465716 ("dmaengine: mxs: use platform_driver_register")
cc2afb0d4c7c ("dmaengine: mxs-dma: Remove the unused .id_table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 26696d4657167112a1079f86cba1739765c1360e Mon Sep 17 00:00:00 2001
From: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Date: Wed, 21 Sep 2022 19:05:56 +0200
Subject: [PATCH] dmaengine: mxs: use platform_driver_register
Driver registration fails on SOC imx8mn as its supplier, the clock
control module, is probed later than subsys initcall level. This driver
uses platform_driver_probe which is not compatible with deferred probing
and won't be probed again later if probe function fails due to clock not
being available at that time.
This patch replaces the use of platform_driver_probe with
platform_driver_register which will allow probing the driver later again
when the clock control module will be available.
The __init annotation has been dropped because it is not compatible with
deferred probing. The code is not executed once and its memory cannot be
freed.
Fixes: a580b8c5429a ("dmaengine: mxs-dma: add dma support for i.MX23/28")
Co-developed-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Acked-by: Sascha Hauer <s.hauer(a)pengutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220921170556.1055962-1-dario.binacchi@amarulaso…
Signed-off-by: Vinod Koul <vkoul(a)kernel.org>
diff --git a/drivers/dma/mxs-dma.c b/drivers/dma/mxs-dma.c
index 994fc4d2aca4..dc147cc2436e 100644
--- a/drivers/dma/mxs-dma.c
+++ b/drivers/dma/mxs-dma.c
@@ -670,7 +670,7 @@ static enum dma_status mxs_dma_tx_status(struct dma_chan *chan,
return mxs_chan->status;
}
-static int __init mxs_dma_init(struct mxs_dma_engine *mxs_dma)
+static int mxs_dma_init(struct mxs_dma_engine *mxs_dma)
{
int ret;
@@ -741,7 +741,7 @@ static struct dma_chan *mxs_dma_xlate(struct of_phandle_args *dma_spec,
ofdma->of_node);
}
-static int __init mxs_dma_probe(struct platform_device *pdev)
+static int mxs_dma_probe(struct platform_device *pdev)
{
struct device_node *np = pdev->dev.of_node;
const struct mxs_dma_type *dma_type;
@@ -839,10 +839,7 @@ static struct platform_driver mxs_dma_driver = {
.name = "mxs-dma",
.of_match_table = mxs_dma_dt_ids,
},
+ .probe = mxs_dma_probe,
};
-static int __init mxs_dma_module_init(void)
-{
- return platform_driver_probe(&mxs_dma_driver, mxs_dma_probe);
-}
-subsys_initcall(mxs_dma_module_init);
+builtin_platform_driver(mxs_dma_driver);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
26696d465716 ("dmaengine: mxs: use platform_driver_register")
cc2afb0d4c7c ("dmaengine: mxs-dma: Remove the unused .id_table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 26696d4657167112a1079f86cba1739765c1360e Mon Sep 17 00:00:00 2001
From: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Date: Wed, 21 Sep 2022 19:05:56 +0200
Subject: [PATCH] dmaengine: mxs: use platform_driver_register
Driver registration fails on SOC imx8mn as its supplier, the clock
control module, is probed later than subsys initcall level. This driver
uses platform_driver_probe which is not compatible with deferred probing
and won't be probed again later if probe function fails due to clock not
being available at that time.
This patch replaces the use of platform_driver_probe with
platform_driver_register which will allow probing the driver later again
when the clock control module will be available.
The __init annotation has been dropped because it is not compatible with
deferred probing. The code is not executed once and its memory cannot be
freed.
Fixes: a580b8c5429a ("dmaengine: mxs-dma: add dma support for i.MX23/28")
Co-developed-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Acked-by: Sascha Hauer <s.hauer(a)pengutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220921170556.1055962-1-dario.binacchi@amarulaso…
Signed-off-by: Vinod Koul <vkoul(a)kernel.org>
diff --git a/drivers/dma/mxs-dma.c b/drivers/dma/mxs-dma.c
index 994fc4d2aca4..dc147cc2436e 100644
--- a/drivers/dma/mxs-dma.c
+++ b/drivers/dma/mxs-dma.c
@@ -670,7 +670,7 @@ static enum dma_status mxs_dma_tx_status(struct dma_chan *chan,
return mxs_chan->status;
}
-static int __init mxs_dma_init(struct mxs_dma_engine *mxs_dma)
+static int mxs_dma_init(struct mxs_dma_engine *mxs_dma)
{
int ret;
@@ -741,7 +741,7 @@ static struct dma_chan *mxs_dma_xlate(struct of_phandle_args *dma_spec,
ofdma->of_node);
}
-static int __init mxs_dma_probe(struct platform_device *pdev)
+static int mxs_dma_probe(struct platform_device *pdev)
{
struct device_node *np = pdev->dev.of_node;
const struct mxs_dma_type *dma_type;
@@ -839,10 +839,7 @@ static struct platform_driver mxs_dma_driver = {
.name = "mxs-dma",
.of_match_table = mxs_dma_dt_ids,
},
+ .probe = mxs_dma_probe,
};
-static int __init mxs_dma_module_init(void)
-{
- return platform_driver_probe(&mxs_dma_driver, mxs_dma_probe);
-}
-subsys_initcall(mxs_dma_module_init);
+builtin_platform_driver(mxs_dma_driver);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
26696d465716 ("dmaengine: mxs: use platform_driver_register")
cc2afb0d4c7c ("dmaengine: mxs-dma: Remove the unused .id_table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 26696d4657167112a1079f86cba1739765c1360e Mon Sep 17 00:00:00 2001
From: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Date: Wed, 21 Sep 2022 19:05:56 +0200
Subject: [PATCH] dmaengine: mxs: use platform_driver_register
Driver registration fails on SOC imx8mn as its supplier, the clock
control module, is probed later than subsys initcall level. This driver
uses platform_driver_probe which is not compatible with deferred probing
and won't be probed again later if probe function fails due to clock not
being available at that time.
This patch replaces the use of platform_driver_probe with
platform_driver_register which will allow probing the driver later again
when the clock control module will be available.
The __init annotation has been dropped because it is not compatible with
deferred probing. The code is not executed once and its memory cannot be
freed.
Fixes: a580b8c5429a ("dmaengine: mxs-dma: add dma support for i.MX23/28")
Co-developed-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Michael Trimarchi <michael(a)amarulasolutions.com>
Signed-off-by: Dario Binacchi <dario.binacchi(a)amarulasolutions.com>
Acked-by: Sascha Hauer <s.hauer(a)pengutronix.de>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220921170556.1055962-1-dario.binacchi@amarulaso…
Signed-off-by: Vinod Koul <vkoul(a)kernel.org>
diff --git a/drivers/dma/mxs-dma.c b/drivers/dma/mxs-dma.c
index 994fc4d2aca4..dc147cc2436e 100644
--- a/drivers/dma/mxs-dma.c
+++ b/drivers/dma/mxs-dma.c
@@ -670,7 +670,7 @@ static enum dma_status mxs_dma_tx_status(struct dma_chan *chan,
return mxs_chan->status;
}
-static int __init mxs_dma_init(struct mxs_dma_engine *mxs_dma)
+static int mxs_dma_init(struct mxs_dma_engine *mxs_dma)
{
int ret;
@@ -741,7 +741,7 @@ static struct dma_chan *mxs_dma_xlate(struct of_phandle_args *dma_spec,
ofdma->of_node);
}
-static int __init mxs_dma_probe(struct platform_device *pdev)
+static int mxs_dma_probe(struct platform_device *pdev)
{
struct device_node *np = pdev->dev.of_node;
const struct mxs_dma_type *dma_type;
@@ -839,10 +839,7 @@ static struct platform_driver mxs_dma_driver = {
.name = "mxs-dma",
.of_match_table = mxs_dma_dt_ids,
},
+ .probe = mxs_dma_probe,
};
-static int __init mxs_dma_module_init(void)
-{
- return platform_driver_probe(&mxs_dma_driver, mxs_dma_probe);
-}
-subsys_initcall(mxs_dma_module_init);
+builtin_platform_driver(mxs_dma_driver);
commit 61ce339f19fabbc3e51237148a7ef6f2270e44fa upstream.
If swiotlb is force enabled dma_max_mapping_size ends up calling
swiotlb_max_mapping_size which takes into account the min align mask for
the device. Set the min align mask for nvme driver before calling
dma_max_mapping_size while calculating max hw sectors.
Signed-off-by: Rishabh Bhatnagar <risbhat(a)amazon.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
---
drivers/nvme/host/pci.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d820131d39b2..e9f3701dda3f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2732,6 +2732,8 @@ static void nvme_reset_work(struct work_struct *work)
if (result)
goto out_unlock;
+ dma_set_min_align_mask(dev->dev, NVME_CTRL_PAGE_SIZE - 1);
+
/*
* Limit the max command size to prevent iod->sg allocations going
* over a single page.
@@ -2744,7 +2746,6 @@ static void nvme_reset_work(struct work_struct *work)
* Don't limit the IOMMU merged segment size.
*/
dma_set_max_seg_size(dev->dev, 0xffffffff);
- dma_set_min_align_mask(dev->dev, NVME_CTRL_PAGE_SIZE - 1);
mutex_unlock(&dev->shutdown_lock);
--
2.37.1
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
10f6913c548b ("riscv: always honor the CONFIG_CMDLINE_FORCE when parsing dtb")
46ad48e8a28d ("riscv: Add machine name to kernel boot log and stack dump output")
8f3a2b4a96dc ("RISC-V: Move DT mapping outof fixmap")
2d2682512f0f ("riscv: Allow device trees to be built into the kernel")
335b139057ef ("riscv: Add SOC early init support")
20d2292754e7 ("riscv: make sure the cores stay looping in .Lsecondary_park")
6bd33e1ece52 ("riscv: add nommu support")
9e80635619b5 ("riscv: clear the instruction cache and all registers when booting")
accb9dbc4aff ("riscv: read the hart ID from mhartid on boot")
fcdc65375186 ("riscv: provide native clint access for M-mode")
4f9bbcefa142 ("riscv: add support for MMIO access to the timer registers")
8bf90f320d9a ("riscv: implement remote sfence.i using IPIs")
3b03ac6bbd6e ("riscv: poison SBI calls for M-mode")
a4c3733d32a7 ("riscv: abstract out CSR names for supervisor vs machine mode")
0c3ac28931d5 ("riscv: separate MMIO functions into their own header file")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 10f6913c548b32ecb73801a16b120e761c6957ea Mon Sep 17 00:00:00 2001
From: Wenting Zhang <zephray(a)outlook.com>
Date: Fri, 8 Jul 2022 16:38:22 -0400
Subject: [PATCH] riscv: always honor the CONFIG_CMDLINE_FORCE when parsing dtb
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When CONFIG_CMDLINE_FORCE is enabled, cmdline provided by
CONFIG_CMDLINE are always used. This allows CONFIG_CMDLINE to be
used regardless of the result of device tree scanning.
This especially fixes the case where a device tree without the
chosen node is supplied to the kernel. In such cases,
early_init_dt_scan would return true. But inside
early_init_dt_scan_chosen, the cmdline won't be updated as there
is no chosen node in the device tree. As a result, CONFIG_CMDLINE
is not copied into boot_command_line even if CONFIG_CMDLINE_FORCE
is enabled. This commit allows properly update boot_command_line
in this situation.
Fixes: 8fd6e05c7463 ("arch: riscv: support kernel command line forcing when no DTB passed")
Signed-off-by: Wenting Zhang <zephray(a)outlook.com>
Reviewed-by: Björn Töpel <bjorn(a)kernel.org>
Reviewed-by: Conor Dooley <conor.dooley(a)microchip.com>
Link: https://lore.kernel.org/r/PSBPR04MB399135DFC54928AB958D0638B1829@PSBPR04MB3…
Cc: stable(a)vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
index 2dfc463b86bb..ad76bb59b059 100644
--- a/arch/riscv/kernel/setup.c
+++ b/arch/riscv/kernel/setup.c
@@ -252,10 +252,10 @@ static void __init parse_dtb(void)
pr_info("Machine model: %s\n", name);
dump_stack_set_arch_desc("%s (DT)", name);
}
- return;
+ } else {
+ pr_err("No DTB passed to the kernel\n");
}
- pr_err("No DTB passed to the kernel\n");
#ifdef CONFIG_CMDLINE_FORCE
strscpy(boot_command_line, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
pr_info("Forcing kernel command line to: %s\n", boot_command_line);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
10f6913c548b ("riscv: always honor the CONFIG_CMDLINE_FORCE when parsing dtb")
46ad48e8a28d ("riscv: Add machine name to kernel boot log and stack dump output")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 10f6913c548b32ecb73801a16b120e761c6957ea Mon Sep 17 00:00:00 2001
From: Wenting Zhang <zephray(a)outlook.com>
Date: Fri, 8 Jul 2022 16:38:22 -0400
Subject: [PATCH] riscv: always honor the CONFIG_CMDLINE_FORCE when parsing dtb
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When CONFIG_CMDLINE_FORCE is enabled, cmdline provided by
CONFIG_CMDLINE are always used. This allows CONFIG_CMDLINE to be
used regardless of the result of device tree scanning.
This especially fixes the case where a device tree without the
chosen node is supplied to the kernel. In such cases,
early_init_dt_scan would return true. But inside
early_init_dt_scan_chosen, the cmdline won't be updated as there
is no chosen node in the device tree. As a result, CONFIG_CMDLINE
is not copied into boot_command_line even if CONFIG_CMDLINE_FORCE
is enabled. This commit allows properly update boot_command_line
in this situation.
Fixes: 8fd6e05c7463 ("arch: riscv: support kernel command line forcing when no DTB passed")
Signed-off-by: Wenting Zhang <zephray(a)outlook.com>
Reviewed-by: Björn Töpel <bjorn(a)kernel.org>
Reviewed-by: Conor Dooley <conor.dooley(a)microchip.com>
Link: https://lore.kernel.org/r/PSBPR04MB399135DFC54928AB958D0638B1829@PSBPR04MB3…
Cc: stable(a)vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
index 2dfc463b86bb..ad76bb59b059 100644
--- a/arch/riscv/kernel/setup.c
+++ b/arch/riscv/kernel/setup.c
@@ -252,10 +252,10 @@ static void __init parse_dtb(void)
pr_info("Machine model: %s\n", name);
dump_stack_set_arch_desc("%s (DT)", name);
}
- return;
+ } else {
+ pr_err("No DTB passed to the kernel\n");
}
- pr_err("No DTB passed to the kernel\n");
#ifdef CONFIG_CMDLINE_FORCE
strscpy(boot_command_line, CONFIG_CMDLINE, COMMAND_LINE_SIZE);
pr_info("Forcing kernel command line to: %s\n", boot_command_line);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7ab72c597356 ("riscv: Make VM_WRITE imply VM_READ")
afb8c6fee8ce ("riscv/mm/fault: Move access error check to function")
6747430197ed ("riscv/mm/fault: Move FAULT_FLAG_WRITE handling in do_page_fault()")
ac416a724f11 ("riscv/mm/fault: Move vmalloc fault handling to vmalloc_fault()")
a51271d99cdd ("riscv/mm/fault: Move bad area handling to bad_area()")
cac4d1dc85be ("riscv/mm/fault: Move no context handling to no_context()")
d8ed45c5dcd4 ("mmap locking API: use coccinelle to convert mmap_sem rwsem call sites")
7ae77150d94d ("Merge tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7ab72c597356be1e7f0f3d856e54ce78527f43c8 Mon Sep 17 00:00:00 2001
From: Andrew Bresticker <abrestic(a)rivosinc.com>
Date: Thu, 15 Sep 2022 15:37:01 -0400
Subject: [PATCH] riscv: Make VM_WRITE imply VM_READ
RISC-V does not presently have write-only mappings as that PTE bit pattern
is considered reserved in the privileged spec, so allow handling of read
faults in VMAs that have VM_WRITE without VM_READ in order to be consistent
with other architectures that have similar limitations.
Fixes: 2139619bcad7 ("riscv: mmap with PROT_WRITE but no PROT_READ is invalid")
Reviewed-by: Atish Patra <atishp(a)rivosinc.com>
Signed-off-by: Andrew Bresticker <abrestic(a)rivosinc.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220915193702.2201018-2-abrestic@rivosinc.com/
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index f2fbd1400b7c..d86f7cebd4a7 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -184,7 +184,8 @@ static inline bool access_error(unsigned long cause, struct vm_area_struct *vma)
}
break;
case EXC_LOAD_PAGE_FAULT:
- if (!(vma->vm_flags & VM_READ)) {
+ /* Write implies read */
+ if (!(vma->vm_flags & (VM_READ | VM_WRITE))) {
return true;
}
break;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7ab72c597356 ("riscv: Make VM_WRITE imply VM_READ")
afb8c6fee8ce ("riscv/mm/fault: Move access error check to function")
6747430197ed ("riscv/mm/fault: Move FAULT_FLAG_WRITE handling in do_page_fault()")
ac416a724f11 ("riscv/mm/fault: Move vmalloc fault handling to vmalloc_fault()")
a51271d99cdd ("riscv/mm/fault: Move bad area handling to bad_area()")
cac4d1dc85be ("riscv/mm/fault: Move no context handling to no_context()")
d8ed45c5dcd4 ("mmap locking API: use coccinelle to convert mmap_sem rwsem call sites")
7ae77150d94d ("Merge tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7ab72c597356be1e7f0f3d856e54ce78527f43c8 Mon Sep 17 00:00:00 2001
From: Andrew Bresticker <abrestic(a)rivosinc.com>
Date: Thu, 15 Sep 2022 15:37:01 -0400
Subject: [PATCH] riscv: Make VM_WRITE imply VM_READ
RISC-V does not presently have write-only mappings as that PTE bit pattern
is considered reserved in the privileged spec, so allow handling of read
faults in VMAs that have VM_WRITE without VM_READ in order to be consistent
with other architectures that have similar limitations.
Fixes: 2139619bcad7 ("riscv: mmap with PROT_WRITE but no PROT_READ is invalid")
Reviewed-by: Atish Patra <atishp(a)rivosinc.com>
Signed-off-by: Andrew Bresticker <abrestic(a)rivosinc.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220915193702.2201018-2-abrestic@rivosinc.com/
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index f2fbd1400b7c..d86f7cebd4a7 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -184,7 +184,8 @@ static inline bool access_error(unsigned long cause, struct vm_area_struct *vma)
}
break;
case EXC_LOAD_PAGE_FAULT:
- if (!(vma->vm_flags & VM_READ)) {
+ /* Write implies read */
+ if (!(vma->vm_flags & (VM_READ | VM_WRITE))) {
return true;
}
break;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fbd92809997a ("riscv: topology: fix default topology reporting")
4f0e8eef772e ("riscv: Add numa support for riscv64 platform")
cbd34f4bb37d ("riscv: Separate memory init from paging init")
00ab027a3b82 ("RISC-V: Add kernel image sections to the resource tree")
270315b8235e ("Merge tag 'riscv-for-linus-5.10-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fbd92809997a391f28075f1c8b5ee314c225557c Mon Sep 17 00:00:00 2001
From: Conor Dooley <conor.dooley(a)microchip.com>
Date: Fri, 15 Jul 2022 18:51:56 +0100
Subject: [PATCH] riscv: topology: fix default topology reporting
RISC-V has no sane defaults to fall back on where there is no cpu-map
in the devicetree.
Without sane defaults, the package, core and thread IDs are all set to
-1. This causes user-visible inaccuracies for tools like hwloc/lstopo
which rely on the sysfs cpu topology files to detect a system's
topology.
On a PolarFire SoC, which should have 4 harts with a thread each,
lstopo currently reports:
Machine (793MB total)
Package L#0
NUMANode L#0 (P#0 793MB)
Core L#0
L1d L#0 (32KB) + L1i L#0 (32KB) + PU L#0 (P#0)
L1d L#1 (32KB) + L1i L#1 (32KB) + PU L#1 (P#1)
L1d L#2 (32KB) + L1i L#2 (32KB) + PU L#2 (P#2)
L1d L#3 (32KB) + L1i L#3 (32KB) + PU L#3 (P#3)
Adding calls to store_cpu_topology() in {boot,smp} hart bringup code
results in the correct topolgy being reported:
Machine (793MB total)
Package L#0
NUMANode L#0 (P#0 793MB)
L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
CC: stable(a)vger.kernel.org # 456797da792f: arm64: topology: move store_cpu_topology() to shared code
Fixes: 03f11f03dbfe ("RISC-V: Parse cpu topology during boot.")
Reported-by: Brice Goglin <Brice.Goglin(a)inria.fr>
Link: https://github.com/open-mpi/hwloc/issues/536
Reviewed-by: Sudeep Holla <sudeep.holla(a)arm.com>
Reviewed-by: Atish Patra <atishp(a)rivosinc.com>
Signed-off-by: Conor Dooley <conor.dooley(a)microchip.com>
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index ed66c31e4655..d557cc50295d 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -52,7 +52,7 @@ config RISCV
select COMMON_CLK
select CPU_PM if CPU_IDLE
select EDAC_SUPPORT
- select GENERIC_ARCH_TOPOLOGY if SMP
+ select GENERIC_ARCH_TOPOLOGY
select GENERIC_ATOMIC64 if !64BIT
select GENERIC_CLOCKEVENTS_BROADCAST if SMP
select GENERIC_EARLY_IOREMAP
diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
index a752c7b41683..3373df413c88 100644
--- a/arch/riscv/kernel/smpboot.c
+++ b/arch/riscv/kernel/smpboot.c
@@ -49,6 +49,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
unsigned int curr_cpuid;
curr_cpuid = smp_processor_id();
+ store_cpu_topology(curr_cpuid);
numa_store_cpu_info(curr_cpuid);
numa_add_cpu(curr_cpuid);
@@ -162,9 +163,9 @@ asmlinkage __visible void smp_callin(void)
mmgrab(mm);
current->active_mm = mm;
+ store_cpu_topology(curr_cpuid);
notify_cpu_starting(curr_cpuid);
numa_add_cpu(curr_cpuid);
- update_siblings_masks(curr_cpuid);
set_cpu_online(curr_cpuid, 1);
/*
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fbd92809997a ("riscv: topology: fix default topology reporting")
4f0e8eef772e ("riscv: Add numa support for riscv64 platform")
cbd34f4bb37d ("riscv: Separate memory init from paging init")
00ab027a3b82 ("RISC-V: Add kernel image sections to the resource tree")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fbd92809997a391f28075f1c8b5ee314c225557c Mon Sep 17 00:00:00 2001
From: Conor Dooley <conor.dooley(a)microchip.com>
Date: Fri, 15 Jul 2022 18:51:56 +0100
Subject: [PATCH] riscv: topology: fix default topology reporting
RISC-V has no sane defaults to fall back on where there is no cpu-map
in the devicetree.
Without sane defaults, the package, core and thread IDs are all set to
-1. This causes user-visible inaccuracies for tools like hwloc/lstopo
which rely on the sysfs cpu topology files to detect a system's
topology.
On a PolarFire SoC, which should have 4 harts with a thread each,
lstopo currently reports:
Machine (793MB total)
Package L#0
NUMANode L#0 (P#0 793MB)
Core L#0
L1d L#0 (32KB) + L1i L#0 (32KB) + PU L#0 (P#0)
L1d L#1 (32KB) + L1i L#1 (32KB) + PU L#1 (P#1)
L1d L#2 (32KB) + L1i L#2 (32KB) + PU L#2 (P#2)
L1d L#3 (32KB) + L1i L#3 (32KB) + PU L#3 (P#3)
Adding calls to store_cpu_topology() in {boot,smp} hart bringup code
results in the correct topolgy being reported:
Machine (793MB total)
Package L#0
NUMANode L#0 (P#0 793MB)
L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
CC: stable(a)vger.kernel.org # 456797da792f: arm64: topology: move store_cpu_topology() to shared code
Fixes: 03f11f03dbfe ("RISC-V: Parse cpu topology during boot.")
Reported-by: Brice Goglin <Brice.Goglin(a)inria.fr>
Link: https://github.com/open-mpi/hwloc/issues/536
Reviewed-by: Sudeep Holla <sudeep.holla(a)arm.com>
Reviewed-by: Atish Patra <atishp(a)rivosinc.com>
Signed-off-by: Conor Dooley <conor.dooley(a)microchip.com>
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index ed66c31e4655..d557cc50295d 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -52,7 +52,7 @@ config RISCV
select COMMON_CLK
select CPU_PM if CPU_IDLE
select EDAC_SUPPORT
- select GENERIC_ARCH_TOPOLOGY if SMP
+ select GENERIC_ARCH_TOPOLOGY
select GENERIC_ATOMIC64 if !64BIT
select GENERIC_CLOCKEVENTS_BROADCAST if SMP
select GENERIC_EARLY_IOREMAP
diff --git a/arch/riscv/kernel/smpboot.c b/arch/riscv/kernel/smpboot.c
index a752c7b41683..3373df413c88 100644
--- a/arch/riscv/kernel/smpboot.c
+++ b/arch/riscv/kernel/smpboot.c
@@ -49,6 +49,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
unsigned int curr_cpuid;
curr_cpuid = smp_processor_id();
+ store_cpu_topology(curr_cpuid);
numa_store_cpu_info(curr_cpuid);
numa_add_cpu(curr_cpuid);
@@ -162,9 +163,9 @@ asmlinkage __visible void smp_callin(void)
mmgrab(mm);
current->active_mm = mm;
+ store_cpu_topology(curr_cpuid);
notify_cpu_starting(curr_cpuid);
numa_add_cpu(curr_cpuid);
- update_siblings_masks(curr_cpuid);
set_cpu_online(curr_cpuid, 1);
/*
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
bb44c31cdcac ("cifs: destage dirty pages before re-reading them for cache=none")
6e6e2b86c29c ("CIFS: Add support for direct I/O read")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bb44c31cdcac107344dd2fcc3bd0504a53575c51 Mon Sep 17 00:00:00 2001
From: Ronnie Sahlberg <lsahlber(a)redhat.com>
Date: Tue, 20 Sep 2022 14:32:02 +1000
Subject: [PATCH] cifs: destage dirty pages before re-reading them for
cache=none
This is the opposite case of kernel bugzilla 216301.
If we mmap a file using cache=none and then proceed to update the mmapped
area these updates are not reflected in a later pread() of that part of the
file.
To fix this we must first destage any dirty pages in the range before
we allow the pread() to proceed.
Cc: stable(a)vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc(a)cjr.nz>
Reviewed-by: Enzo Matsumiya <ematsumiya(a)suse.de>
Signed-off-by: Ronnie Sahlberg <lsahlber(a)redhat.com>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 6f38b134a346..7d756721e1a6 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4271,6 +4271,15 @@ static ssize_t __cifs_readv(
len = ctx->len;
}
+ if (direct) {
+ rc = filemap_write_and_wait_range(file->f_inode->i_mapping,
+ offset, offset + len - 1);
+ if (rc) {
+ kref_put(&ctx->refcount, cifs_aio_ctx_release);
+ return -EAGAIN;
+ }
+ }
+
/* grab a lock here due to read response handlers can access ctx */
mutex_lock(&ctx->aio_mutex);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
bb44c31cdcac ("cifs: destage dirty pages before re-reading them for cache=none")
6e6e2b86c29c ("CIFS: Add support for direct I/O read")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bb44c31cdcac107344dd2fcc3bd0504a53575c51 Mon Sep 17 00:00:00 2001
From: Ronnie Sahlberg <lsahlber(a)redhat.com>
Date: Tue, 20 Sep 2022 14:32:02 +1000
Subject: [PATCH] cifs: destage dirty pages before re-reading them for
cache=none
This is the opposite case of kernel bugzilla 216301.
If we mmap a file using cache=none and then proceed to update the mmapped
area these updates are not reflected in a later pread() of that part of the
file.
To fix this we must first destage any dirty pages in the range before
we allow the pread() to proceed.
Cc: stable(a)vger.kernel.org
Reviewed-by: Paulo Alcantara (SUSE) <pc(a)cjr.nz>
Reviewed-by: Enzo Matsumiya <ematsumiya(a)suse.de>
Signed-off-by: Ronnie Sahlberg <lsahlber(a)redhat.com>
Signed-off-by: Steve French <stfrench(a)microsoft.com>
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 6f38b134a346..7d756721e1a6 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4271,6 +4271,15 @@ static ssize_t __cifs_readv(
len = ctx->len;
}
+ if (direct) {
+ rc = filemap_write_and_wait_range(file->f_inode->i_mapping,
+ offset, offset + len - 1);
+ if (rc) {
+ kref_put(&ctx->refcount, cifs_aio_ctx_release);
+ return -EAGAIN;
+ }
+ }
+
/* grab a lock here due to read response handlers can access ctx */
mutex_lock(&ctx->aio_mutex);
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
9d88bf08e46d ("ARM: dts: integrator: Tag PCI host with device_type")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 9d88bf08e46dc2d2975ef87877445a924a04a0a7 Mon Sep 17 00:00:00 2001
From: Linus Walleij <linus.walleij(a)linaro.org>
Date: Mon, 19 Sep 2022 11:26:08 +0200
Subject: [PATCH] ARM: dts: integrator: Tag PCI host with device_type
The DT parser is dependent on the PCI device being tagged as
device_type = "pci" in order to parse memory ranges properly.
Fix this up.
Signed-off-by: Linus Walleij <linus.walleij(a)linaro.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20220919092608.813511-1-linus.walleij@linaro.org'
Signed-off-by: Arnd Bergmann <arnd(a)arndb.de>
diff --git a/arch/arm/boot/dts/integratorap.dts b/arch/arm/boot/dts/integratorap.dts
index 9b652cc27b14..c983435ed492 100644
--- a/arch/arm/boot/dts/integratorap.dts
+++ b/arch/arm/boot/dts/integratorap.dts
@@ -160,6 +160,7 @@ pic: pic@14000000 {
pci: pciv3@62000000 {
compatible = "arm,integrator-ap-pci", "v3,v360epc-pci";
+ device_type = "pci";
#interrupt-cells = <1>;
#size-cells = <2>;
#address-cells = <3>;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
ff7cd07f3064 ("net: thunderbolt: Enable DMA paths only after rings are enabled")
180b0689425c ("thunderbolt: Allow multiple DMA tunnels over a single XDomain connection")
46b494f28681 ("thunderbolt: Add support for maxhopid XDomain property")
8ccbed2476f2 ("thunderbolt: Do not re-establish XDomain DMA paths automatically")
d29c59b1a4dc ("thunderbolt: Add more logging to XDomain connections")
5ca67688256a ("thunderbolt: Allow disabling XDomain protocol")
a27ea0dfc1cd ("thunderbolt: tunnel: Fix misspelling of 'receive_path'")
9490f71167fe ("thunderbolt: Add connection manager specific hooks for USB4 router operations")
83bab44ada05 ("thunderbolt: Pass TX and RX data directly to usb4_switch_op()")
fe265a06319b ("thunderbolt: Pass metadata directly to usb4_switch_op()")
661b19473bf3 ("thunderbolt: Perform USB4 router NVM upgrade in two phases")
edc0f494ed96 ("thunderbolt: Add DMA traffic test driver")
5bf722df5d37 ("thunderbolt: Make it possible to allocate one directional DMA tunnel")
5cc0df9ce10a ("thunderbolt: Add functions for enabling and disabling lane bonding on XDomain")
4210d50f0b3e ("thunderbolt: Add link_speed and link_width to XDomain")
47844ecb8cec ("thunderbolt: Create XDomain devices for loops back to the host")
d67274bacb8a ("thunderbolt: Find XDomain by route instead of UUID")
8eabfca52333 ("thunderbolt: Use "if USB4" instead of "depends on" in Kconfig")
2c6ea4e2cefe ("thunderbolt: Allow KUnit tests to be built also when CONFIG_USB4=m")
54e418106c76 ("thunderbolt: Add debugfs interface")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ff7cd07f306406493f7b78890475e85b6d0811ed Mon Sep 17 00:00:00 2001
From: Mika Westerberg <mika.westerberg(a)linux.intel.com>
Date: Tue, 30 Aug 2022 18:32:46 +0300
Subject: [PATCH] net: thunderbolt: Enable DMA paths only after rings are
enabled
If the other host starts sending packets early on it is possible that we
are still in the middle of populating the initial Rx ring packets to the
ring. This causes the tbnet_poll() to mess over the queue and causes
list corruption. This happens specifically when connected with macOS as
it seems start sending various IP discovery packets as soon as its side
of the paths are configured.
To prevent this we move the DMA path enabling to happen after we have
primed the Rx ring. This makes sure no incoming packets can arrive
before we are ready to handle them.
Fixes: e69b6c02b4c3 ("net: Add support for networking over Thunderbolt cable")
Cc: stable(a)vger.kernel.org
Signed-off-by: Mika Westerberg <mika.westerberg(a)linux.intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/thunderbolt.c b/drivers/net/thunderbolt.c
index ff5d0e98a088..ab3f04562980 100644
--- a/drivers/net/thunderbolt.c
+++ b/drivers/net/thunderbolt.c
@@ -612,18 +612,13 @@ static void tbnet_connected_work(struct work_struct *work)
return;
}
- /* Both logins successful so enable the high-speed DMA paths and
- * start the network device queue.
+ /* Both logins successful so enable the rings, high-speed DMA
+ * paths and start the network device queue.
+ *
+ * Note we enable the DMA paths last to make sure we have primed
+ * the Rx ring before any incoming packets are allowed to
+ * arrive.
*/
- ret = tb_xdomain_enable_paths(net->xd, net->local_transmit_path,
- net->rx_ring.ring->hop,
- net->remote_transmit_path,
- net->tx_ring.ring->hop);
- if (ret) {
- netdev_err(net->dev, "failed to enable DMA paths\n");
- return;
- }
-
tb_ring_start(net->tx_ring.ring);
tb_ring_start(net->rx_ring.ring);
@@ -635,10 +630,21 @@ static void tbnet_connected_work(struct work_struct *work)
if (ret)
goto err_free_rx_buffers;
+ ret = tb_xdomain_enable_paths(net->xd, net->local_transmit_path,
+ net->rx_ring.ring->hop,
+ net->remote_transmit_path,
+ net->tx_ring.ring->hop);
+ if (ret) {
+ netdev_err(net->dev, "failed to enable DMA paths\n");
+ goto err_free_tx_buffers;
+ }
+
netif_carrier_on(net->dev);
netif_start_queue(net->dev);
return;
+err_free_tx_buffers:
+ tbnet_free_buffers(&net->tx_ring);
err_free_rx_buffers:
tbnet_free_buffers(&net->rx_ring);
err_stop_rings:
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
ff7cd07f3064 ("net: thunderbolt: Enable DMA paths only after rings are enabled")
180b0689425c ("thunderbolt: Allow multiple DMA tunnels over a single XDomain connection")
46b494f28681 ("thunderbolt: Add support for maxhopid XDomain property")
8ccbed2476f2 ("thunderbolt: Do not re-establish XDomain DMA paths automatically")
d29c59b1a4dc ("thunderbolt: Add more logging to XDomain connections")
5ca67688256a ("thunderbolt: Allow disabling XDomain protocol")
a27ea0dfc1cd ("thunderbolt: tunnel: Fix misspelling of 'receive_path'")
9490f71167fe ("thunderbolt: Add connection manager specific hooks for USB4 router operations")
83bab44ada05 ("thunderbolt: Pass TX and RX data directly to usb4_switch_op()")
fe265a06319b ("thunderbolt: Pass metadata directly to usb4_switch_op()")
661b19473bf3 ("thunderbolt: Perform USB4 router NVM upgrade in two phases")
edc0f494ed96 ("thunderbolt: Add DMA traffic test driver")
5bf722df5d37 ("thunderbolt: Make it possible to allocate one directional DMA tunnel")
5cc0df9ce10a ("thunderbolt: Add functions for enabling and disabling lane bonding on XDomain")
4210d50f0b3e ("thunderbolt: Add link_speed and link_width to XDomain")
47844ecb8cec ("thunderbolt: Create XDomain devices for loops back to the host")
d67274bacb8a ("thunderbolt: Find XDomain by route instead of UUID")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ff7cd07f306406493f7b78890475e85b6d0811ed Mon Sep 17 00:00:00 2001
From: Mika Westerberg <mika.westerberg(a)linux.intel.com>
Date: Tue, 30 Aug 2022 18:32:46 +0300
Subject: [PATCH] net: thunderbolt: Enable DMA paths only after rings are
enabled
If the other host starts sending packets early on it is possible that we
are still in the middle of populating the initial Rx ring packets to the
ring. This causes the tbnet_poll() to mess over the queue and causes
list corruption. This happens specifically when connected with macOS as
it seems start sending various IP discovery packets as soon as its side
of the paths are configured.
To prevent this we move the DMA path enabling to happen after we have
primed the Rx ring. This makes sure no incoming packets can arrive
before we are ready to handle them.
Fixes: e69b6c02b4c3 ("net: Add support for networking over Thunderbolt cable")
Cc: stable(a)vger.kernel.org
Signed-off-by: Mika Westerberg <mika.westerberg(a)linux.intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/thunderbolt.c b/drivers/net/thunderbolt.c
index ff5d0e98a088..ab3f04562980 100644
--- a/drivers/net/thunderbolt.c
+++ b/drivers/net/thunderbolt.c
@@ -612,18 +612,13 @@ static void tbnet_connected_work(struct work_struct *work)
return;
}
- /* Both logins successful so enable the high-speed DMA paths and
- * start the network device queue.
+ /* Both logins successful so enable the rings, high-speed DMA
+ * paths and start the network device queue.
+ *
+ * Note we enable the DMA paths last to make sure we have primed
+ * the Rx ring before any incoming packets are allowed to
+ * arrive.
*/
- ret = tb_xdomain_enable_paths(net->xd, net->local_transmit_path,
- net->rx_ring.ring->hop,
- net->remote_transmit_path,
- net->tx_ring.ring->hop);
- if (ret) {
- netdev_err(net->dev, "failed to enable DMA paths\n");
- return;
- }
-
tb_ring_start(net->tx_ring.ring);
tb_ring_start(net->rx_ring.ring);
@@ -635,10 +630,21 @@ static void tbnet_connected_work(struct work_struct *work)
if (ret)
goto err_free_rx_buffers;
+ ret = tb_xdomain_enable_paths(net->xd, net->local_transmit_path,
+ net->rx_ring.ring->hop,
+ net->remote_transmit_path,
+ net->tx_ring.ring->hop);
+ if (ret) {
+ netdev_err(net->dev, "failed to enable DMA paths\n");
+ goto err_free_tx_buffers;
+ }
+
netif_carrier_on(net->dev);
netif_start_queue(net->dev);
return;
+err_free_tx_buffers:
+ tbnet_free_buffers(&net->tx_ring);
err_free_rx_buffers:
tbnet_free_buffers(&net->rx_ring);
err_stop_rings:
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
ff7cd07f3064 ("net: thunderbolt: Enable DMA paths only after rings are enabled")
180b0689425c ("thunderbolt: Allow multiple DMA tunnels over a single XDomain connection")
46b494f28681 ("thunderbolt: Add support for maxhopid XDomain property")
8ccbed2476f2 ("thunderbolt: Do not re-establish XDomain DMA paths automatically")
d29c59b1a4dc ("thunderbolt: Add more logging to XDomain connections")
5ca67688256a ("thunderbolt: Allow disabling XDomain protocol")
a27ea0dfc1cd ("thunderbolt: tunnel: Fix misspelling of 'receive_path'")
9490f71167fe ("thunderbolt: Add connection manager specific hooks for USB4 router operations")
83bab44ada05 ("thunderbolt: Pass TX and RX data directly to usb4_switch_op()")
fe265a06319b ("thunderbolt: Pass metadata directly to usb4_switch_op()")
661b19473bf3 ("thunderbolt: Perform USB4 router NVM upgrade in two phases")
edc0f494ed96 ("thunderbolt: Add DMA traffic test driver")
5bf722df5d37 ("thunderbolt: Make it possible to allocate one directional DMA tunnel")
5cc0df9ce10a ("thunderbolt: Add functions for enabling and disabling lane bonding on XDomain")
4210d50f0b3e ("thunderbolt: Add link_speed and link_width to XDomain")
47844ecb8cec ("thunderbolt: Create XDomain devices for loops back to the host")
d67274bacb8a ("thunderbolt: Find XDomain by route instead of UUID")
8eabfca52333 ("thunderbolt: Use "if USB4" instead of "depends on" in Kconfig")
2c6ea4e2cefe ("thunderbolt: Allow KUnit tests to be built also when CONFIG_USB4=m")
54e418106c76 ("thunderbolt: Add debugfs interface")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ff7cd07f306406493f7b78890475e85b6d0811ed Mon Sep 17 00:00:00 2001
From: Mika Westerberg <mika.westerberg(a)linux.intel.com>
Date: Tue, 30 Aug 2022 18:32:46 +0300
Subject: [PATCH] net: thunderbolt: Enable DMA paths only after rings are
enabled
If the other host starts sending packets early on it is possible that we
are still in the middle of populating the initial Rx ring packets to the
ring. This causes the tbnet_poll() to mess over the queue and causes
list corruption. This happens specifically when connected with macOS as
it seems start sending various IP discovery packets as soon as its side
of the paths are configured.
To prevent this we move the DMA path enabling to happen after we have
primed the Rx ring. This makes sure no incoming packets can arrive
before we are ready to handle them.
Fixes: e69b6c02b4c3 ("net: Add support for networking over Thunderbolt cable")
Cc: stable(a)vger.kernel.org
Signed-off-by: Mika Westerberg <mika.westerberg(a)linux.intel.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/thunderbolt.c b/drivers/net/thunderbolt.c
index ff5d0e98a088..ab3f04562980 100644
--- a/drivers/net/thunderbolt.c
+++ b/drivers/net/thunderbolt.c
@@ -612,18 +612,13 @@ static void tbnet_connected_work(struct work_struct *work)
return;
}
- /* Both logins successful so enable the high-speed DMA paths and
- * start the network device queue.
+ /* Both logins successful so enable the rings, high-speed DMA
+ * paths and start the network device queue.
+ *
+ * Note we enable the DMA paths last to make sure we have primed
+ * the Rx ring before any incoming packets are allowed to
+ * arrive.
*/
- ret = tb_xdomain_enable_paths(net->xd, net->local_transmit_path,
- net->rx_ring.ring->hop,
- net->remote_transmit_path,
- net->tx_ring.ring->hop);
- if (ret) {
- netdev_err(net->dev, "failed to enable DMA paths\n");
- return;
- }
-
tb_ring_start(net->tx_ring.ring);
tb_ring_start(net->rx_ring.ring);
@@ -635,10 +630,21 @@ static void tbnet_connected_work(struct work_struct *work)
if (ret)
goto err_free_rx_buffers;
+ ret = tb_xdomain_enable_paths(net->xd, net->local_transmit_path,
+ net->rx_ring.ring->hop,
+ net->remote_transmit_path,
+ net->tx_ring.ring->hop);
+ if (ret) {
+ netdev_err(net->dev, "failed to enable DMA paths\n");
+ goto err_free_tx_buffers;
+ }
+
netif_carrier_on(net->dev);
netif_start_queue(net->dev);
return;
+err_free_tx_buffers:
+ tbnet_free_buffers(&net->tx_ring);
err_free_rx_buffers:
tbnet_free_buffers(&net->rx_ring);
err_stop_rings:
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7175e131ebba ("fs: dlm: fix invalid derefence of sb_lvbptr")
00e99ccde757 ("dlm: use __le types for dlm messages")
2f9dbeda8dc0 ("dlm: use __le types for rcom messages")
3428785a65da ("dlm: use __le types for dlm header")
6c2e3bf68f3e ("fs: dlm: filter user dlm messages for kernel locks")
9af5b8f0ead7 ("fs: dlm: add debugfs rawmsg send functionality")
88aa023a2556 ("fs: dlm: cleanup and remove _send_rcom")
d10a0b88751a ("fs: dlm: rename socket and app buffer defines")
5b2f981fde8b ("fs: dlm: add midcomms debugfs functionality")
489d8e559c65 ("fs: dlm: add reliable connection if reconnect")
8e2e40860c7f ("fs: dlm: add union in dlm header for lockspace id")
37a247da517f ("fs: dlm: move out some hash functionality")
8f2dc78dbc20 ("fs: dlm: make buffer handling per msg")
a070a91cf140 ("fs: dlm: add more midcomms hooks")
6fb5cf9d4206 ("fs: dlm: public header in out utility")
8aa31cbf20ad ("fs: dlm: fix connection tcp EOF handling")
ba868d9deaab ("fs: dlm: reconnect if socket error report occurs")
b38bc9c2b317 ("fs: dlm: fix srcu read lock usage")
2fd8db2dd05d ("fs: dlm: fix missing unlock on error in accept_from_sock()")
9d232469bcd7 ("fs: dlm: add shutdown hook")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7175e131ebba47afef47e6ac4d5bab474d1e6e49 Mon Sep 17 00:00:00 2001
From: Alexander Aring <aahringo(a)redhat.com>
Date: Mon, 15 Aug 2022 15:43:19 -0400
Subject: [PATCH] fs: dlm: fix invalid derefence of sb_lvbptr
I experience issues when putting a lkbsb on the stack and have sb_lvbptr
field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash
with the following kernel message, the dangled pointer is here
0xdeadbeef as example:
[ 102.749317] BUG: unable to handle page fault for address: 00000000deadbeef
[ 102.749320] #PF: supervisor read access in kernel mode
[ 102.749323] #PF: error_code(0x0000) - not-present page
[ 102.749325] PGD 0 P4D 0
[ 102.749332] Oops: 0000 [#1] PREEMPT SMP PTI
[ 102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G W 5.19.0-rc3+ #1565
[ 102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014
[ 102.749344] RIP: 0010:memcpy_erms+0x6/0x10
[ 102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202
[ 102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040
[ 102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070
[ 102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001
[ 102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70
[ 102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001
[ 102.749368] FS: 0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000
[ 102.749372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0
[ 102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 102.749379] PKRU: 55555554
[ 102.749381] Call Trace:
[ 102.749382] <TASK>
[ 102.749383] ? send_args+0xb2/0xd0
[ 102.749389] send_common+0xb7/0xd0
[ 102.749395] _unlock_lock+0x2c/0x90
[ 102.749400] unlock_lock.isra.56+0x62/0xa0
[ 102.749405] dlm_unlock+0x21e/0x330
[ 102.749411] ? lock_torture_stats+0x80/0x80 [dlm_locktorture]
[ 102.749416] torture_unlock+0x5a/0x90 [dlm_locktorture]
[ 102.749419] ? preempt_count_sub+0xba/0x100
[ 102.749427] lock_torture_writer+0xbd/0x150 [dlm_locktorture]
[ 102.786186] kthread+0x10a/0x130
[ 102.786581] ? kthread_complete_and_exit+0x20/0x20
[ 102.787156] ret_from_fork+0x22/0x30
[ 102.787588] </TASK>
[ 102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw
[ 102.792536] CR2: 00000000deadbeef
[ 102.792930] ---[ end trace 0000000000000000 ]---
This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags
is set when copying the lvbptr array instead of if it's just null which
fixes for me the issue.
I think this patch can fix other dlm users as well, depending how they
handle the init, freeing memory handling of sb_lvbptr and don't set
DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a
hidden issue all the time. However with checking on DLM_LKF_VALBLK the
user always need to provide a sb_lvbptr non-null value. There might be
more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and
non-null to report the user the way how DLM API is used is wrong but can
be added for later, this will only fix the current behaviour.
Cc: stable(a)vger.kernel.org
Signed-off-by: Alexander Aring <aahringo(a)redhat.com>
Signed-off-by: David Teigland <teigland(a)redhat.com>
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 354f79254d62..da95ba3c295e 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -3651,7 +3651,7 @@ static void send_args(struct dlm_rsb *r, struct dlm_lkb *lkb,
case cpu_to_le32(DLM_MSG_REQUEST_REPLY):
case cpu_to_le32(DLM_MSG_CONVERT_REPLY):
case cpu_to_le32(DLM_MSG_GRANT):
- if (!lkb->lkb_lvbptr)
+ if (!lkb->lkb_lvbptr || !(lkb->lkb_exflags & DLM_LKF_VALBLK))
break;
memcpy(ms->m_extra, lkb->lkb_lvbptr, r->res_ls->ls_lvblen);
break;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7175e131ebba ("fs: dlm: fix invalid derefence of sb_lvbptr")
00e99ccde757 ("dlm: use __le types for dlm messages")
2f9dbeda8dc0 ("dlm: use __le types for rcom messages")
3428785a65da ("dlm: use __le types for dlm header")
6c2e3bf68f3e ("fs: dlm: filter user dlm messages for kernel locks")
9af5b8f0ead7 ("fs: dlm: add debugfs rawmsg send functionality")
88aa023a2556 ("fs: dlm: cleanup and remove _send_rcom")
d10a0b88751a ("fs: dlm: rename socket and app buffer defines")
5b2f981fde8b ("fs: dlm: add midcomms debugfs functionality")
489d8e559c65 ("fs: dlm: add reliable connection if reconnect")
8e2e40860c7f ("fs: dlm: add union in dlm header for lockspace id")
37a247da517f ("fs: dlm: move out some hash functionality")
8f2dc78dbc20 ("fs: dlm: make buffer handling per msg")
a070a91cf140 ("fs: dlm: add more midcomms hooks")
6fb5cf9d4206 ("fs: dlm: public header in out utility")
8aa31cbf20ad ("fs: dlm: fix connection tcp EOF handling")
ba868d9deaab ("fs: dlm: reconnect if socket error report occurs")
b38bc9c2b317 ("fs: dlm: fix srcu read lock usage")
2fd8db2dd05d ("fs: dlm: fix missing unlock on error in accept_from_sock()")
9d232469bcd7 ("fs: dlm: add shutdown hook")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7175e131ebba47afef47e6ac4d5bab474d1e6e49 Mon Sep 17 00:00:00 2001
From: Alexander Aring <aahringo(a)redhat.com>
Date: Mon, 15 Aug 2022 15:43:19 -0400
Subject: [PATCH] fs: dlm: fix invalid derefence of sb_lvbptr
I experience issues when putting a lkbsb on the stack and have sb_lvbptr
field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash
with the following kernel message, the dangled pointer is here
0xdeadbeef as example:
[ 102.749317] BUG: unable to handle page fault for address: 00000000deadbeef
[ 102.749320] #PF: supervisor read access in kernel mode
[ 102.749323] #PF: error_code(0x0000) - not-present page
[ 102.749325] PGD 0 P4D 0
[ 102.749332] Oops: 0000 [#1] PREEMPT SMP PTI
[ 102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G W 5.19.0-rc3+ #1565
[ 102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014
[ 102.749344] RIP: 0010:memcpy_erms+0x6/0x10
[ 102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202
[ 102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040
[ 102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070
[ 102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001
[ 102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70
[ 102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001
[ 102.749368] FS: 0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000
[ 102.749372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0
[ 102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 102.749379] PKRU: 55555554
[ 102.749381] Call Trace:
[ 102.749382] <TASK>
[ 102.749383] ? send_args+0xb2/0xd0
[ 102.749389] send_common+0xb7/0xd0
[ 102.749395] _unlock_lock+0x2c/0x90
[ 102.749400] unlock_lock.isra.56+0x62/0xa0
[ 102.749405] dlm_unlock+0x21e/0x330
[ 102.749411] ? lock_torture_stats+0x80/0x80 [dlm_locktorture]
[ 102.749416] torture_unlock+0x5a/0x90 [dlm_locktorture]
[ 102.749419] ? preempt_count_sub+0xba/0x100
[ 102.749427] lock_torture_writer+0xbd/0x150 [dlm_locktorture]
[ 102.786186] kthread+0x10a/0x130
[ 102.786581] ? kthread_complete_and_exit+0x20/0x20
[ 102.787156] ret_from_fork+0x22/0x30
[ 102.787588] </TASK>
[ 102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw
[ 102.792536] CR2: 00000000deadbeef
[ 102.792930] ---[ end trace 0000000000000000 ]---
This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags
is set when copying the lvbptr array instead of if it's just null which
fixes for me the issue.
I think this patch can fix other dlm users as well, depending how they
handle the init, freeing memory handling of sb_lvbptr and don't set
DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a
hidden issue all the time. However with checking on DLM_LKF_VALBLK the
user always need to provide a sb_lvbptr non-null value. There might be
more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and
non-null to report the user the way how DLM API is used is wrong but can
be added for later, this will only fix the current behaviour.
Cc: stable(a)vger.kernel.org
Signed-off-by: Alexander Aring <aahringo(a)redhat.com>
Signed-off-by: David Teigland <teigland(a)redhat.com>
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 354f79254d62..da95ba3c295e 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -3651,7 +3651,7 @@ static void send_args(struct dlm_rsb *r, struct dlm_lkb *lkb,
case cpu_to_le32(DLM_MSG_REQUEST_REPLY):
case cpu_to_le32(DLM_MSG_CONVERT_REPLY):
case cpu_to_le32(DLM_MSG_GRANT):
- if (!lkb->lkb_lvbptr)
+ if (!lkb->lkb_lvbptr || !(lkb->lkb_exflags & DLM_LKF_VALBLK))
break;
memcpy(ms->m_extra, lkb->lkb_lvbptr, r->res_ls->ls_lvblen);
break;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7175e131ebba ("fs: dlm: fix invalid derefence of sb_lvbptr")
00e99ccde757 ("dlm: use __le types for dlm messages")
2f9dbeda8dc0 ("dlm: use __le types for rcom messages")
3428785a65da ("dlm: use __le types for dlm header")
6c2e3bf68f3e ("fs: dlm: filter user dlm messages for kernel locks")
9af5b8f0ead7 ("fs: dlm: add debugfs rawmsg send functionality")
88aa023a2556 ("fs: dlm: cleanup and remove _send_rcom")
d10a0b88751a ("fs: dlm: rename socket and app buffer defines")
5b2f981fde8b ("fs: dlm: add midcomms debugfs functionality")
489d8e559c65 ("fs: dlm: add reliable connection if reconnect")
8e2e40860c7f ("fs: dlm: add union in dlm header for lockspace id")
37a247da517f ("fs: dlm: move out some hash functionality")
8f2dc78dbc20 ("fs: dlm: make buffer handling per msg")
a070a91cf140 ("fs: dlm: add more midcomms hooks")
6fb5cf9d4206 ("fs: dlm: public header in out utility")
8aa31cbf20ad ("fs: dlm: fix connection tcp EOF handling")
ba868d9deaab ("fs: dlm: reconnect if socket error report occurs")
b38bc9c2b317 ("fs: dlm: fix srcu read lock usage")
2fd8db2dd05d ("fs: dlm: fix missing unlock on error in accept_from_sock()")
9d232469bcd7 ("fs: dlm: add shutdown hook")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7175e131ebba47afef47e6ac4d5bab474d1e6e49 Mon Sep 17 00:00:00 2001
From: Alexander Aring <aahringo(a)redhat.com>
Date: Mon, 15 Aug 2022 15:43:19 -0400
Subject: [PATCH] fs: dlm: fix invalid derefence of sb_lvbptr
I experience issues when putting a lkbsb on the stack and have sb_lvbptr
field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash
with the following kernel message, the dangled pointer is here
0xdeadbeef as example:
[ 102.749317] BUG: unable to handle page fault for address: 00000000deadbeef
[ 102.749320] #PF: supervisor read access in kernel mode
[ 102.749323] #PF: error_code(0x0000) - not-present page
[ 102.749325] PGD 0 P4D 0
[ 102.749332] Oops: 0000 [#1] PREEMPT SMP PTI
[ 102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G W 5.19.0-rc3+ #1565
[ 102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014
[ 102.749344] RIP: 0010:memcpy_erms+0x6/0x10
[ 102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202
[ 102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040
[ 102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070
[ 102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001
[ 102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70
[ 102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001
[ 102.749368] FS: 0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000
[ 102.749372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0
[ 102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 102.749379] PKRU: 55555554
[ 102.749381] Call Trace:
[ 102.749382] <TASK>
[ 102.749383] ? send_args+0xb2/0xd0
[ 102.749389] send_common+0xb7/0xd0
[ 102.749395] _unlock_lock+0x2c/0x90
[ 102.749400] unlock_lock.isra.56+0x62/0xa0
[ 102.749405] dlm_unlock+0x21e/0x330
[ 102.749411] ? lock_torture_stats+0x80/0x80 [dlm_locktorture]
[ 102.749416] torture_unlock+0x5a/0x90 [dlm_locktorture]
[ 102.749419] ? preempt_count_sub+0xba/0x100
[ 102.749427] lock_torture_writer+0xbd/0x150 [dlm_locktorture]
[ 102.786186] kthread+0x10a/0x130
[ 102.786581] ? kthread_complete_and_exit+0x20/0x20
[ 102.787156] ret_from_fork+0x22/0x30
[ 102.787588] </TASK>
[ 102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw
[ 102.792536] CR2: 00000000deadbeef
[ 102.792930] ---[ end trace 0000000000000000 ]---
This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags
is set when copying the lvbptr array instead of if it's just null which
fixes for me the issue.
I think this patch can fix other dlm users as well, depending how they
handle the init, freeing memory handling of sb_lvbptr and don't set
DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a
hidden issue all the time. However with checking on DLM_LKF_VALBLK the
user always need to provide a sb_lvbptr non-null value. There might be
more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and
non-null to report the user the way how DLM API is used is wrong but can
be added for later, this will only fix the current behaviour.
Cc: stable(a)vger.kernel.org
Signed-off-by: Alexander Aring <aahringo(a)redhat.com>
Signed-off-by: David Teigland <teigland(a)redhat.com>
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 354f79254d62..da95ba3c295e 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -3651,7 +3651,7 @@ static void send_args(struct dlm_rsb *r, struct dlm_lkb *lkb,
case cpu_to_le32(DLM_MSG_REQUEST_REPLY):
case cpu_to_le32(DLM_MSG_CONVERT_REPLY):
case cpu_to_le32(DLM_MSG_GRANT):
- if (!lkb->lkb_lvbptr)
+ if (!lkb->lkb_lvbptr || !(lkb->lkb_exflags & DLM_LKF_VALBLK))
break;
memcpy(ms->m_extra, lkb->lkb_lvbptr, r->res_ls->ls_lvblen);
break;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7175e131ebba ("fs: dlm: fix invalid derefence of sb_lvbptr")
00e99ccde757 ("dlm: use __le types for dlm messages")
2f9dbeda8dc0 ("dlm: use __le types for rcom messages")
3428785a65da ("dlm: use __le types for dlm header")
6c2e3bf68f3e ("fs: dlm: filter user dlm messages for kernel locks")
9af5b8f0ead7 ("fs: dlm: add debugfs rawmsg send functionality")
88aa023a2556 ("fs: dlm: cleanup and remove _send_rcom")
d10a0b88751a ("fs: dlm: rename socket and app buffer defines")
5b2f981fde8b ("fs: dlm: add midcomms debugfs functionality")
489d8e559c65 ("fs: dlm: add reliable connection if reconnect")
8e2e40860c7f ("fs: dlm: add union in dlm header for lockspace id")
37a247da517f ("fs: dlm: move out some hash functionality")
8f2dc78dbc20 ("fs: dlm: make buffer handling per msg")
a070a91cf140 ("fs: dlm: add more midcomms hooks")
6fb5cf9d4206 ("fs: dlm: public header in out utility")
8aa31cbf20ad ("fs: dlm: fix connection tcp EOF handling")
ba868d9deaab ("fs: dlm: reconnect if socket error report occurs")
b38bc9c2b317 ("fs: dlm: fix srcu read lock usage")
2fd8db2dd05d ("fs: dlm: fix missing unlock on error in accept_from_sock()")
9d232469bcd7 ("fs: dlm: add shutdown hook")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7175e131ebba47afef47e6ac4d5bab474d1e6e49 Mon Sep 17 00:00:00 2001
From: Alexander Aring <aahringo(a)redhat.com>
Date: Mon, 15 Aug 2022 15:43:19 -0400
Subject: [PATCH] fs: dlm: fix invalid derefence of sb_lvbptr
I experience issues when putting a lkbsb on the stack and have sb_lvbptr
field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash
with the following kernel message, the dangled pointer is here
0xdeadbeef as example:
[ 102.749317] BUG: unable to handle page fault for address: 00000000deadbeef
[ 102.749320] #PF: supervisor read access in kernel mode
[ 102.749323] #PF: error_code(0x0000) - not-present page
[ 102.749325] PGD 0 P4D 0
[ 102.749332] Oops: 0000 [#1] PREEMPT SMP PTI
[ 102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G W 5.19.0-rc3+ #1565
[ 102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014
[ 102.749344] RIP: 0010:memcpy_erms+0x6/0x10
[ 102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202
[ 102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040
[ 102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070
[ 102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001
[ 102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70
[ 102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001
[ 102.749368] FS: 0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000
[ 102.749372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0
[ 102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 102.749379] PKRU: 55555554
[ 102.749381] Call Trace:
[ 102.749382] <TASK>
[ 102.749383] ? send_args+0xb2/0xd0
[ 102.749389] send_common+0xb7/0xd0
[ 102.749395] _unlock_lock+0x2c/0x90
[ 102.749400] unlock_lock.isra.56+0x62/0xa0
[ 102.749405] dlm_unlock+0x21e/0x330
[ 102.749411] ? lock_torture_stats+0x80/0x80 [dlm_locktorture]
[ 102.749416] torture_unlock+0x5a/0x90 [dlm_locktorture]
[ 102.749419] ? preempt_count_sub+0xba/0x100
[ 102.749427] lock_torture_writer+0xbd/0x150 [dlm_locktorture]
[ 102.786186] kthread+0x10a/0x130
[ 102.786581] ? kthread_complete_and_exit+0x20/0x20
[ 102.787156] ret_from_fork+0x22/0x30
[ 102.787588] </TASK>
[ 102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw
[ 102.792536] CR2: 00000000deadbeef
[ 102.792930] ---[ end trace 0000000000000000 ]---
This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags
is set when copying the lvbptr array instead of if it's just null which
fixes for me the issue.
I think this patch can fix other dlm users as well, depending how they
handle the init, freeing memory handling of sb_lvbptr and don't set
DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a
hidden issue all the time. However with checking on DLM_LKF_VALBLK the
user always need to provide a sb_lvbptr non-null value. There might be
more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and
non-null to report the user the way how DLM API is used is wrong but can
be added for later, this will only fix the current behaviour.
Cc: stable(a)vger.kernel.org
Signed-off-by: Alexander Aring <aahringo(a)redhat.com>
Signed-off-by: David Teigland <teigland(a)redhat.com>
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 354f79254d62..da95ba3c295e 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -3651,7 +3651,7 @@ static void send_args(struct dlm_rsb *r, struct dlm_lkb *lkb,
case cpu_to_le32(DLM_MSG_REQUEST_REPLY):
case cpu_to_le32(DLM_MSG_CONVERT_REPLY):
case cpu_to_le32(DLM_MSG_GRANT):
- if (!lkb->lkb_lvbptr)
+ if (!lkb->lkb_lvbptr || !(lkb->lkb_exflags & DLM_LKF_VALBLK))
break;
memcpy(ms->m_extra, lkb->lkb_lvbptr, r->res_ls->ls_lvblen);
break;
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7175e131ebba ("fs: dlm: fix invalid derefence of sb_lvbptr")
00e99ccde757 ("dlm: use __le types for dlm messages")
2f9dbeda8dc0 ("dlm: use __le types for rcom messages")
3428785a65da ("dlm: use __le types for dlm header")
6c2e3bf68f3e ("fs: dlm: filter user dlm messages for kernel locks")
9af5b8f0ead7 ("fs: dlm: add debugfs rawmsg send functionality")
88aa023a2556 ("fs: dlm: cleanup and remove _send_rcom")
d10a0b88751a ("fs: dlm: rename socket and app buffer defines")
5b2f981fde8b ("fs: dlm: add midcomms debugfs functionality")
489d8e559c65 ("fs: dlm: add reliable connection if reconnect")
8e2e40860c7f ("fs: dlm: add union in dlm header for lockspace id")
37a247da517f ("fs: dlm: move out some hash functionality")
8f2dc78dbc20 ("fs: dlm: make buffer handling per msg")
a070a91cf140 ("fs: dlm: add more midcomms hooks")
6fb5cf9d4206 ("fs: dlm: public header in out utility")
8aa31cbf20ad ("fs: dlm: fix connection tcp EOF handling")
ba868d9deaab ("fs: dlm: reconnect if socket error report occurs")
b38bc9c2b317 ("fs: dlm: fix srcu read lock usage")
2fd8db2dd05d ("fs: dlm: fix missing unlock on error in accept_from_sock()")
9d232469bcd7 ("fs: dlm: add shutdown hook")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7175e131ebba47afef47e6ac4d5bab474d1e6e49 Mon Sep 17 00:00:00 2001
From: Alexander Aring <aahringo(a)redhat.com>
Date: Mon, 15 Aug 2022 15:43:19 -0400
Subject: [PATCH] fs: dlm: fix invalid derefence of sb_lvbptr
I experience issues when putting a lkbsb on the stack and have sb_lvbptr
field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash
with the following kernel message, the dangled pointer is here
0xdeadbeef as example:
[ 102.749317] BUG: unable to handle page fault for address: 00000000deadbeef
[ 102.749320] #PF: supervisor read access in kernel mode
[ 102.749323] #PF: error_code(0x0000) - not-present page
[ 102.749325] PGD 0 P4D 0
[ 102.749332] Oops: 0000 [#1] PREEMPT SMP PTI
[ 102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G W 5.19.0-rc3+ #1565
[ 102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014
[ 102.749344] RIP: 0010:memcpy_erms+0x6/0x10
[ 102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202
[ 102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040
[ 102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070
[ 102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001
[ 102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70
[ 102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001
[ 102.749368] FS: 0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000
[ 102.749372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0
[ 102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 102.749379] PKRU: 55555554
[ 102.749381] Call Trace:
[ 102.749382] <TASK>
[ 102.749383] ? send_args+0xb2/0xd0
[ 102.749389] send_common+0xb7/0xd0
[ 102.749395] _unlock_lock+0x2c/0x90
[ 102.749400] unlock_lock.isra.56+0x62/0xa0
[ 102.749405] dlm_unlock+0x21e/0x330
[ 102.749411] ? lock_torture_stats+0x80/0x80 [dlm_locktorture]
[ 102.749416] torture_unlock+0x5a/0x90 [dlm_locktorture]
[ 102.749419] ? preempt_count_sub+0xba/0x100
[ 102.749427] lock_torture_writer+0xbd/0x150 [dlm_locktorture]
[ 102.786186] kthread+0x10a/0x130
[ 102.786581] ? kthread_complete_and_exit+0x20/0x20
[ 102.787156] ret_from_fork+0x22/0x30
[ 102.787588] </TASK>
[ 102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw
[ 102.792536] CR2: 00000000deadbeef
[ 102.792930] ---[ end trace 0000000000000000 ]---
This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags
is set when copying the lvbptr array instead of if it's just null which
fixes for me the issue.
I think this patch can fix other dlm users as well, depending how they
handle the init, freeing memory handling of sb_lvbptr and don't set
DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a
hidden issue all the time. However with checking on DLM_LKF_VALBLK the
user always need to provide a sb_lvbptr non-null value. There might be
more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and
non-null to report the user the way how DLM API is used is wrong but can
be added for later, this will only fix the current behaviour.
Cc: stable(a)vger.kernel.org
Signed-off-by: Alexander Aring <aahringo(a)redhat.com>
Signed-off-by: David Teigland <teigland(a)redhat.com>
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 354f79254d62..da95ba3c295e 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -3651,7 +3651,7 @@ static void send_args(struct dlm_rsb *r, struct dlm_lkb *lkb,
case cpu_to_le32(DLM_MSG_REQUEST_REPLY):
case cpu_to_le32(DLM_MSG_CONVERT_REPLY):
case cpu_to_le32(DLM_MSG_GRANT):
- if (!lkb->lkb_lvbptr)
+ if (!lkb->lkb_lvbptr || !(lkb->lkb_exflags & DLM_LKF_VALBLK))
break;
memcpy(ms->m_extra, lkb->lkb_lvbptr, r->res_ls->ls_lvblen);
break;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
7175e131ebba ("fs: dlm: fix invalid derefence of sb_lvbptr")
00e99ccde757 ("dlm: use __le types for dlm messages")
2f9dbeda8dc0 ("dlm: use __le types for rcom messages")
3428785a65da ("dlm: use __le types for dlm header")
6c2e3bf68f3e ("fs: dlm: filter user dlm messages for kernel locks")
9af5b8f0ead7 ("fs: dlm: add debugfs rawmsg send functionality")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 7175e131ebba47afef47e6ac4d5bab474d1e6e49 Mon Sep 17 00:00:00 2001
From: Alexander Aring <aahringo(a)redhat.com>
Date: Mon, 15 Aug 2022 15:43:19 -0400
Subject: [PATCH] fs: dlm: fix invalid derefence of sb_lvbptr
I experience issues when putting a lkbsb on the stack and have sb_lvbptr
field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash
with the following kernel message, the dangled pointer is here
0xdeadbeef as example:
[ 102.749317] BUG: unable to handle page fault for address: 00000000deadbeef
[ 102.749320] #PF: supervisor read access in kernel mode
[ 102.749323] #PF: error_code(0x0000) - not-present page
[ 102.749325] PGD 0 P4D 0
[ 102.749332] Oops: 0000 [#1] PREEMPT SMP PTI
[ 102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G W 5.19.0-rc3+ #1565
[ 102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014
[ 102.749344] RIP: 0010:memcpy_erms+0x6/0x10
[ 102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe
[ 102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202
[ 102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040
[ 102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070
[ 102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001
[ 102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70
[ 102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001
[ 102.749368] FS: 0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000
[ 102.749372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0
[ 102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 102.749379] PKRU: 55555554
[ 102.749381] Call Trace:
[ 102.749382] <TASK>
[ 102.749383] ? send_args+0xb2/0xd0
[ 102.749389] send_common+0xb7/0xd0
[ 102.749395] _unlock_lock+0x2c/0x90
[ 102.749400] unlock_lock.isra.56+0x62/0xa0
[ 102.749405] dlm_unlock+0x21e/0x330
[ 102.749411] ? lock_torture_stats+0x80/0x80 [dlm_locktorture]
[ 102.749416] torture_unlock+0x5a/0x90 [dlm_locktorture]
[ 102.749419] ? preempt_count_sub+0xba/0x100
[ 102.749427] lock_torture_writer+0xbd/0x150 [dlm_locktorture]
[ 102.786186] kthread+0x10a/0x130
[ 102.786581] ? kthread_complete_and_exit+0x20/0x20
[ 102.787156] ret_from_fork+0x22/0x30
[ 102.787588] </TASK>
[ 102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw
[ 102.792536] CR2: 00000000deadbeef
[ 102.792930] ---[ end trace 0000000000000000 ]---
This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags
is set when copying the lvbptr array instead of if it's just null which
fixes for me the issue.
I think this patch can fix other dlm users as well, depending how they
handle the init, freeing memory handling of sb_lvbptr and don't set
DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a
hidden issue all the time. However with checking on DLM_LKF_VALBLK the
user always need to provide a sb_lvbptr non-null value. There might be
more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and
non-null to report the user the way how DLM API is used is wrong but can
be added for later, this will only fix the current behaviour.
Cc: stable(a)vger.kernel.org
Signed-off-by: Alexander Aring <aahringo(a)redhat.com>
Signed-off-by: David Teigland <teigland(a)redhat.com>
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 354f79254d62..da95ba3c295e 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -3651,7 +3651,7 @@ static void send_args(struct dlm_rsb *r, struct dlm_lkb *lkb,
case cpu_to_le32(DLM_MSG_REQUEST_REPLY):
case cpu_to_le32(DLM_MSG_CONVERT_REPLY):
case cpu_to_le32(DLM_MSG_GRANT):
- if (!lkb->lkb_lvbptr)
+ if (!lkb->lkb_lvbptr || !(lkb->lkb_exflags & DLM_LKF_VALBLK))
break;
memcpy(ms->m_extra, lkb->lkb_lvbptr, r->res_ls->ls_lvblen);
break;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
bf68b5b34311 ("io_uring/rw: fix unexpected link breakage")
f3b44f92e59a ("io_uring: move read/write related opcodes to its own file")
c98817e6cd44 ("io_uring: move remaining file table manipulation to filetable.c")
735729844819 ("io_uring: move rsrc related data, core, and commands")
3b77495a9723 ("io_uring: split provided buffers handling into its own file")
7aaff708a768 ("io_uring: move cancelation into its own file")
329061d3e2f9 ("io_uring: move poll handling into its own file")
cfd22e6b3319 ("io_uring: add opcode name to io_op_defs")
92ac8beaea1f ("io_uring: include and forward-declaration sanitation")
c9f06aa7de15 ("io_uring: move io_uring_task (tctx) helpers into its own file")
a4ad4f748ea9 ("io_uring: move fdinfo helpers to its own file")
e5550a1447bf ("io_uring: use io_is_uring_fops() consistently")
17437f311490 ("io_uring: move SQPOLL related handling into its own file")
59915143e89f ("io_uring: move timeout opcodes and handling into its own file")
e418bbc97bff ("io_uring: move our reference counting into a header")
36404b09aa60 ("io_uring: move msg_ring into its own file")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bf68b5b34311ee57ed40749a1257a30b46127556 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Tue, 27 Sep 2022 00:44:39 +0100
Subject: [PATCH] io_uring/rw: fix unexpected link breakage
req->cqe.res is set in io_read() to the amount of bytes left to be done,
which is used to figure out whether to fail a read or not. However,
io_read() may do another without returning, and we stash the previous
value into ->bytes_done but forget to update cqe.res. Then we ask a read
to do strictly less than cqe.res but expect the return to be exactly
cqe.res.
Fix the bug by updating cqe.res for retries.
Cc: stable(a)vger.kernel.org
Reported-and-Tested-by: Beld Zhang <beldzhang(a)gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/3a1088440c7be98e5800267af922a67da0ef9f13.16642357…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 59c92a4616b8..ed14322aadb9 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -793,6 +793,7 @@ int io_read(struct io_kiocb *req, unsigned int issue_flags)
return -EAGAIN;
}
+ req->cqe.res = iov_iter_count(&s->iter);
/*
* Now retry read with the IOCB_WAITQ parts set in the iocb. If
* we get -EIOCBQUEUED, then we'll get a notification when the
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
bf68b5b34311 ("io_uring/rw: fix unexpected link breakage")
f3b44f92e59a ("io_uring: move read/write related opcodes to its own file")
c98817e6cd44 ("io_uring: move remaining file table manipulation to filetable.c")
735729844819 ("io_uring: move rsrc related data, core, and commands")
3b77495a9723 ("io_uring: split provided buffers handling into its own file")
7aaff708a768 ("io_uring: move cancelation into its own file")
329061d3e2f9 ("io_uring: move poll handling into its own file")
cfd22e6b3319 ("io_uring: add opcode name to io_op_defs")
92ac8beaea1f ("io_uring: include and forward-declaration sanitation")
c9f06aa7de15 ("io_uring: move io_uring_task (tctx) helpers into its own file")
a4ad4f748ea9 ("io_uring: move fdinfo helpers to its own file")
e5550a1447bf ("io_uring: use io_is_uring_fops() consistently")
17437f311490 ("io_uring: move SQPOLL related handling into its own file")
59915143e89f ("io_uring: move timeout opcodes and handling into its own file")
e418bbc97bff ("io_uring: move our reference counting into a header")
36404b09aa60 ("io_uring: move msg_ring into its own file")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bf68b5b34311ee57ed40749a1257a30b46127556 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Tue, 27 Sep 2022 00:44:39 +0100
Subject: [PATCH] io_uring/rw: fix unexpected link breakage
req->cqe.res is set in io_read() to the amount of bytes left to be done,
which is used to figure out whether to fail a read or not. However,
io_read() may do another without returning, and we stash the previous
value into ->bytes_done but forget to update cqe.res. Then we ask a read
to do strictly less than cqe.res but expect the return to be exactly
cqe.res.
Fix the bug by updating cqe.res for retries.
Cc: stable(a)vger.kernel.org
Reported-and-Tested-by: Beld Zhang <beldzhang(a)gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/3a1088440c7be98e5800267af922a67da0ef9f13.16642357…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 59c92a4616b8..ed14322aadb9 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -793,6 +793,7 @@ int io_read(struct io_kiocb *req, unsigned int issue_flags)
return -EAGAIN;
}
+ req->cqe.res = iov_iter_count(&s->iter);
/*
* Now retry read with the IOCB_WAITQ parts set in the iocb. If
* we get -EIOCBQUEUED, then we'll get a notification when the
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
bf68b5b34311 ("io_uring/rw: fix unexpected link breakage")
f3b44f92e59a ("io_uring: move read/write related opcodes to its own file")
c98817e6cd44 ("io_uring: move remaining file table manipulation to filetable.c")
735729844819 ("io_uring: move rsrc related data, core, and commands")
3b77495a9723 ("io_uring: split provided buffers handling into its own file")
7aaff708a768 ("io_uring: move cancelation into its own file")
329061d3e2f9 ("io_uring: move poll handling into its own file")
cfd22e6b3319 ("io_uring: add opcode name to io_op_defs")
92ac8beaea1f ("io_uring: include and forward-declaration sanitation")
c9f06aa7de15 ("io_uring: move io_uring_task (tctx) helpers into its own file")
a4ad4f748ea9 ("io_uring: move fdinfo helpers to its own file")
e5550a1447bf ("io_uring: use io_is_uring_fops() consistently")
17437f311490 ("io_uring: move SQPOLL related handling into its own file")
59915143e89f ("io_uring: move timeout opcodes and handling into its own file")
e418bbc97bff ("io_uring: move our reference counting into a header")
36404b09aa60 ("io_uring: move msg_ring into its own file")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From bf68b5b34311ee57ed40749a1257a30b46127556 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Tue, 27 Sep 2022 00:44:39 +0100
Subject: [PATCH] io_uring/rw: fix unexpected link breakage
req->cqe.res is set in io_read() to the amount of bytes left to be done,
which is used to figure out whether to fail a read or not. However,
io_read() may do another without returning, and we stash the previous
value into ->bytes_done but forget to update cqe.res. Then we ask a read
to do strictly less than cqe.res but expect the return to be exactly
cqe.res.
Fix the bug by updating cqe.res for retries.
Cc: stable(a)vger.kernel.org
Reported-and-Tested-by: Beld Zhang <beldzhang(a)gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/3a1088440c7be98e5800267af922a67da0ef9f13.16642357…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 59c92a4616b8..ed14322aadb9 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -793,6 +793,7 @@ int io_read(struct io_kiocb *req, unsigned int issue_flags)
return -EAGAIN;
}
+ req->cqe.res = iov_iter_count(&s->iter);
/*
* Now retry read with the IOCB_WAITQ parts set in the iocb. If
* we get -EIOCBQUEUED, then we'll get a notification when the
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
99f15d8d6136 ("io_uring: move uring_cmd handling to its own file")
cd40cae29ef8 ("io_uring: split out open/close operations")
453b329be5ea ("io_uring: separate out file table handling code")
f4c163dd7d4b ("io_uring: split out fadvise/madvise operations")
0d5847274037 ("io_uring: split out fs related sync/fallocate functions")
531113bbd5bf ("io_uring: split out splice related operations")
11aeb71406dd ("io_uring: split out filesystem related operations")
e28683bdfc2f ("io_uring: move nop into its own file")
5e2a18d93fec ("io_uring: move xattr related opcodes to its own file")
97b388d70b53 ("io_uring: handle completions in the core")
de23077eda61 ("io_uring: set completion results upfront")
e27f928ee1cb ("io_uring: add io_uring_types.h")
4d4c9cff4f70 ("io_uring: define a request type cleanup handler")
890968dc0336 ("io_uring: unify struct io_symlink and io_hardlink")
9a3a11f977f9 ("io_uring: convert iouring_cmd to io_cmd_type")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3fb1bd68817288729179444caf1fd5c5c4d2d65d Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Tue, 4 Oct 2022 20:29:48 -0600
Subject: [PATCH] io_uring/net: handle -EINPROGRESS correct for
IORING_OP_CONNECT
We treat EINPROGRESS like EAGAIN, but if we're retrying post getting
EINPROGRESS, then we just need to check the socket for errors and
terminate the request.
This was exposed on a bluetooth connection request which ends up
taking a while and hitting EINPROGRESS, and yields a CQE result of
-EBADFD because we're retrying a connect on a socket that is now
connected.
Cc: stable(a)vger.kernel.org
Fixes: 87f80d623c6c ("io_uring: handle connect -EINPROGRESS like -EAGAIN")
Link: https://github.com/axboe/liburing/issues/671
Reported-by: Aidan Sun <aidansun05(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/net.c b/io_uring/net.c
index caa6a803cb72..8c7226b5bf41 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -46,6 +46,7 @@ struct io_connect {
struct file *file;
struct sockaddr __user *addr;
int addr_len;
+ bool in_progress;
};
struct io_sr_msg {
@@ -1386,6 +1387,7 @@ int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
conn->addr_len = READ_ONCE(sqe->addr2);
+ conn->in_progress = false;
return 0;
}
@@ -1397,6 +1399,16 @@ int io_connect(struct io_kiocb *req, unsigned int issue_flags)
int ret;
bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+ if (connect->in_progress) {
+ struct socket *socket;
+
+ ret = -ENOTSOCK;
+ socket = sock_from_file(req->file);
+ if (socket)
+ ret = sock_error(socket->sk);
+ goto out;
+ }
+
if (req_has_async_data(req)) {
io = req->async_data;
} else {
@@ -1413,13 +1425,17 @@ int io_connect(struct io_kiocb *req, unsigned int issue_flags)
ret = __sys_connect_file(req->file, &io->address,
connect->addr_len, file_flags);
if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) {
- if (req_has_async_data(req))
- return -EAGAIN;
- if (io_alloc_async_data(req)) {
- ret = -ENOMEM;
- goto out;
+ if (ret == -EINPROGRESS) {
+ connect->in_progress = true;
+ } else {
+ if (req_has_async_data(req))
+ return -EAGAIN;
+ if (io_alloc_async_data(req)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ memcpy(req->async_data, &__io, sizeof(__io));
}
- memcpy(req->async_data, &__io, sizeof(__io));
return -EAGAIN;
}
if (ret == -ERESTARTSYS)
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
b78870e7f415 ("mmc: sdhci-tegra: Use actual clock rate for SW tuning correction")
d618978dd4d3 ("mmc: sdhci-tegra: Add runtime PM and OPP support")
8048822bac01 ("sdhci: tegra: Add missing TMCLK for data timeout")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b78870e7f41534cc719c295d1f8809aca93aeeab Mon Sep 17 00:00:00 2001
From: Prathamesh Shete <pshete(a)nvidia.com>
Date: Thu, 6 Oct 2022 18:36:22 +0530
Subject: [PATCH] mmc: sdhci-tegra: Use actual clock rate for SW tuning
correction
Ensure tegra_host member "curr_clk_rate" holds the actual clock rate
instead of requested clock rate for proper use during tuning correction
algorithm. Actual clk rate may not be the same as the requested clk
frequency depending on the parent clock source set. Tuning correction
algorithm depends on certain parameters which are sensitive to current
clk rate. If the host clk is selected instead of the actual clock rate,
tuning correction algorithm may end up applying invalid correction,
which could result in errors
Fixes: ea8fc5953e8b ("mmc: tegra: update hw tuning process")
Signed-off-by: Aniruddha TVS Rao <anrao(a)nvidia.com>
Signed-off-by: Prathamesh Shete <pshete(a)nvidia.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Acked-by: Thierry Reding <treding(a)nvidia.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20221006130622.22900-4-pshete@nvidia.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/sdhci-tegra.c b/drivers/mmc/host/sdhci-tegra.c
index 2d2d8260c681..413925bce0ca 100644
--- a/drivers/mmc/host/sdhci-tegra.c
+++ b/drivers/mmc/host/sdhci-tegra.c
@@ -773,7 +773,7 @@ static void tegra_sdhci_set_clock(struct sdhci_host *host, unsigned int clock)
dev_err(dev, "failed to set clk rate to %luHz: %d\n",
host_clk, err);
- tegra_host->curr_clk_rate = host_clk;
+ tegra_host->curr_clk_rate = clk_get_rate(pltfm_host->clk);
if (tegra_host->ddr_signaling)
host->max_clk = host_clk;
else
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
b78870e7f415 ("mmc: sdhci-tegra: Use actual clock rate for SW tuning correction")
d618978dd4d3 ("mmc: sdhci-tegra: Add runtime PM and OPP support")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b78870e7f41534cc719c295d1f8809aca93aeeab Mon Sep 17 00:00:00 2001
From: Prathamesh Shete <pshete(a)nvidia.com>
Date: Thu, 6 Oct 2022 18:36:22 +0530
Subject: [PATCH] mmc: sdhci-tegra: Use actual clock rate for SW tuning
correction
Ensure tegra_host member "curr_clk_rate" holds the actual clock rate
instead of requested clock rate for proper use during tuning correction
algorithm. Actual clk rate may not be the same as the requested clk
frequency depending on the parent clock source set. Tuning correction
algorithm depends on certain parameters which are sensitive to current
clk rate. If the host clk is selected instead of the actual clock rate,
tuning correction algorithm may end up applying invalid correction,
which could result in errors
Fixes: ea8fc5953e8b ("mmc: tegra: update hw tuning process")
Signed-off-by: Aniruddha TVS Rao <anrao(a)nvidia.com>
Signed-off-by: Prathamesh Shete <pshete(a)nvidia.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Acked-by: Thierry Reding <treding(a)nvidia.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20221006130622.22900-4-pshete@nvidia.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/sdhci-tegra.c b/drivers/mmc/host/sdhci-tegra.c
index 2d2d8260c681..413925bce0ca 100644
--- a/drivers/mmc/host/sdhci-tegra.c
+++ b/drivers/mmc/host/sdhci-tegra.c
@@ -773,7 +773,7 @@ static void tegra_sdhci_set_clock(struct sdhci_host *host, unsigned int clock)
dev_err(dev, "failed to set clk rate to %luHz: %d\n",
host_clk, err);
- tegra_host->curr_clk_rate = host_clk;
+ tegra_host->curr_clk_rate = clk_get_rate(pltfm_host->clk);
if (tegra_host->ddr_signaling)
host->max_clk = host_clk;
else
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
b78870e7f415 ("mmc: sdhci-tegra: Use actual clock rate for SW tuning correction")
d618978dd4d3 ("mmc: sdhci-tegra: Add runtime PM and OPP support")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From b78870e7f41534cc719c295d1f8809aca93aeeab Mon Sep 17 00:00:00 2001
From: Prathamesh Shete <pshete(a)nvidia.com>
Date: Thu, 6 Oct 2022 18:36:22 +0530
Subject: [PATCH] mmc: sdhci-tegra: Use actual clock rate for SW tuning
correction
Ensure tegra_host member "curr_clk_rate" holds the actual clock rate
instead of requested clock rate for proper use during tuning correction
algorithm. Actual clk rate may not be the same as the requested clk
frequency depending on the parent clock source set. Tuning correction
algorithm depends on certain parameters which are sensitive to current
clk rate. If the host clk is selected instead of the actual clock rate,
tuning correction algorithm may end up applying invalid correction,
which could result in errors
Fixes: ea8fc5953e8b ("mmc: tegra: update hw tuning process")
Signed-off-by: Aniruddha TVS Rao <anrao(a)nvidia.com>
Signed-off-by: Prathamesh Shete <pshete(a)nvidia.com>
Acked-by: Adrian Hunter <adrian.hunter(a)intel.com>
Acked-by: Thierry Reding <treding(a)nvidia.com>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20221006130622.22900-4-pshete@nvidia.com
Signed-off-by: Ulf Hansson <ulf.hansson(a)linaro.org>
diff --git a/drivers/mmc/host/sdhci-tegra.c b/drivers/mmc/host/sdhci-tegra.c
index 2d2d8260c681..413925bce0ca 100644
--- a/drivers/mmc/host/sdhci-tegra.c
+++ b/drivers/mmc/host/sdhci-tegra.c
@@ -773,7 +773,7 @@ static void tegra_sdhci_set_clock(struct sdhci_host *host, unsigned int clock)
dev_err(dev, "failed to set clk rate to %luHz: %d\n",
host_clk, err);
- tegra_host->curr_clk_rate = host_clk;
+ tegra_host->curr_clk_rate = clk_get_rate(pltfm_host->clk);
if (tegra_host->ddr_signaling)
host->max_clk = host_clk;
else
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
455561fb618f ("can: kvaser_usb_leaf: Fix TX queue out of sync after restart")
7259124eac7d ("can: kvaser_usb: Split driver into kvaser_usb_core.c and kvaser_usb_leaf.c")
e0543f2479f8 ("can: kvaser_usb: Add SPDX GPL-2.0 license identifier")
2b049c150080 ("can: kvaser_usb: Fix typos")
6ba0b9294bca ("can: kvaser_usb: Improve logging messages")
7c4780146177 ("can: kvaser_usb: Refactor kvaser_usb_init_one()")
99ce1bc17462 ("can: kvaser_usb: Refactor kvaser_usb_get_endpoints()")
0e30619fd6fa ("can: kvaser_usb: Add pointer to struct usb_interface into struct kvaser_usb")
75d2b4c3e399 ("can: kvaser_usb: Replace USB timeout constants with one define")
f741f938556d ("can: kvaser_usb: Rename message/msg to command/cmd")
237572220121 ("can: kvaser_usb: Remove unused commands and defines")
deaa1c984be7 ("can: kvaser_usb: Remove unnecessary return")
ffbdd9172ee2 ("can: usb: Kconfig/Makefile: sort alphabetically")
6ee00865ffe4 ("can: kvaser_usb: Increase correct stats counter in kvaser_usb_rx_can_msg()")
6aa8d5945502 ("can: kvaser_usb: cancel urb on -EPIPE and -EPROTO")
8bd13bd522ff ("can: kvaser_usb: ratelimit errors if incomplete messages are received")
e84f44eb5523 ("can: kvaser_usb: Fix comparison bug in kvaser_usb_read_bulk_callback()")
435019b48033 ("can: kvaser_usb: free buf in error paths")
e1d2d1329a57 ("can: kvaser_usb: Ignore CMD_FLUSH_QUEUE_REPLY messages")
8f65a923e6b6 ("can: kvaser_usb: Correct return value in printout")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 455561fb618fde40558776b5b8435f9420f335db Mon Sep 17 00:00:00 2001
From: Anssi Hannula <anssi.hannula(a)bitwise.fi>
Date: Mon, 10 Oct 2022 17:08:28 +0200
Subject: [PATCH] can: kvaser_usb_leaf: Fix TX queue out of sync after restart
The TX queue seems to be implicitly flushed by the hardware during
bus-off or bus-off recovery, but the driver does not reset the TX
bookkeeping.
Despite not resetting TX bookkeeping the driver still re-enables TX
queue unconditionally, leading to "cannot find free context" /
NETDEV_TX_BUSY errors if the TX queue was full at bus-off time.
Fix that by resetting TX bookkeeping on CAN restart.
Tested with 0bfd:0124 Kvaser Mini PCI Express 2xHS FW 4.18.778.
Cc: stable(a)vger.kernel.org
Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Tested-by: Jimmy Assarsson <extja(a)kvaser.com>
Signed-off-by: Anssi Hannula <anssi.hannula(a)bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja(a)kvaser.com>
Link: https://lore.kernel.org/all/20221010150829.199676-4-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
diff --git a/drivers/net/can/usb/kvaser_usb/kvaser_usb.h b/drivers/net/can/usb/kvaser_usb/kvaser_usb.h
index 841da29cef93..f6c0938027ec 100644
--- a/drivers/net/can/usb/kvaser_usb/kvaser_usb.h
+++ b/drivers/net/can/usb/kvaser_usb/kvaser_usb.h
@@ -178,6 +178,8 @@ struct kvaser_usb_dev_cfg {
extern const struct kvaser_usb_dev_ops kvaser_usb_hydra_dev_ops;
extern const struct kvaser_usb_dev_ops kvaser_usb_leaf_dev_ops;
+void kvaser_usb_unlink_tx_urbs(struct kvaser_usb_net_priv *priv);
+
int kvaser_usb_recv_cmd(const struct kvaser_usb *dev, void *cmd, int len,
int *actual_len);
diff --git a/drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c b/drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c
index c2bce6773adc..e91648ed7386 100644
--- a/drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c
+++ b/drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c
@@ -477,7 +477,7 @@ static void kvaser_usb_reset_tx_urb_contexts(struct kvaser_usb_net_priv *priv)
/* This method might sleep. Do not call it in the atomic context
* of URB completions.
*/
-static void kvaser_usb_unlink_tx_urbs(struct kvaser_usb_net_priv *priv)
+void kvaser_usb_unlink_tx_urbs(struct kvaser_usb_net_priv *priv)
{
usb_kill_anchored_urbs(&priv->tx_submitted);
kvaser_usb_reset_tx_urb_contexts(priv);
diff --git a/drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c b/drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c
index 8e11cda85624..59c220ef3049 100644
--- a/drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c
+++ b/drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c
@@ -1426,6 +1426,8 @@ static int kvaser_usb_leaf_set_mode(struct net_device *netdev,
switch (mode) {
case CAN_MODE_START:
+ kvaser_usb_unlink_tx_urbs(priv);
+
err = kvaser_usb_leaf_simple_cmd_async(priv, CMD_START_CHIP);
if (err)
return err;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
455561fb618f ("can: kvaser_usb_leaf: Fix TX queue out of sync after restart")
7259124eac7d ("can: kvaser_usb: Split driver into kvaser_usb_core.c and kvaser_usb_leaf.c")
e0543f2479f8 ("can: kvaser_usb: Add SPDX GPL-2.0 license identifier")
2b049c150080 ("can: kvaser_usb: Fix typos")
6ba0b9294bca ("can: kvaser_usb: Improve logging messages")
7c4780146177 ("can: kvaser_usb: Refactor kvaser_usb_init_one()")
99ce1bc17462 ("can: kvaser_usb: Refactor kvaser_usb_get_endpoints()")
0e30619fd6fa ("can: kvaser_usb: Add pointer to struct usb_interface into struct kvaser_usb")
75d2b4c3e399 ("can: kvaser_usb: Replace USB timeout constants with one define")
f741f938556d ("can: kvaser_usb: Rename message/msg to command/cmd")
237572220121 ("can: kvaser_usb: Remove unused commands and defines")
deaa1c984be7 ("can: kvaser_usb: Remove unnecessary return")
ffbdd9172ee2 ("can: usb: Kconfig/Makefile: sort alphabetically")
6ee00865ffe4 ("can: kvaser_usb: Increase correct stats counter in kvaser_usb_rx_can_msg()")
6aa8d5945502 ("can: kvaser_usb: cancel urb on -EPIPE and -EPROTO")
8bd13bd522ff ("can: kvaser_usb: ratelimit errors if incomplete messages are received")
e84f44eb5523 ("can: kvaser_usb: Fix comparison bug in kvaser_usb_read_bulk_callback()")
435019b48033 ("can: kvaser_usb: free buf in error paths")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 455561fb618fde40558776b5b8435f9420f335db Mon Sep 17 00:00:00 2001
From: Anssi Hannula <anssi.hannula(a)bitwise.fi>
Date: Mon, 10 Oct 2022 17:08:28 +0200
Subject: [PATCH] can: kvaser_usb_leaf: Fix TX queue out of sync after restart
The TX queue seems to be implicitly flushed by the hardware during
bus-off or bus-off recovery, but the driver does not reset the TX
bookkeeping.
Despite not resetting TX bookkeeping the driver still re-enables TX
queue unconditionally, leading to "cannot find free context" /
NETDEV_TX_BUSY errors if the TX queue was full at bus-off time.
Fix that by resetting TX bookkeeping on CAN restart.
Tested with 0bfd:0124 Kvaser Mini PCI Express 2xHS FW 4.18.778.
Cc: stable(a)vger.kernel.org
Fixes: 080f40a6fa28 ("can: kvaser_usb: Add support for Kvaser CAN/USB devices")
Tested-by: Jimmy Assarsson <extja(a)kvaser.com>
Signed-off-by: Anssi Hannula <anssi.hannula(a)bitwise.fi>
Signed-off-by: Jimmy Assarsson <extja(a)kvaser.com>
Link: https://lore.kernel.org/all/20221010150829.199676-4-extja@kvaser.com
Signed-off-by: Marc Kleine-Budde <mkl(a)pengutronix.de>
diff --git a/drivers/net/can/usb/kvaser_usb/kvaser_usb.h b/drivers/net/can/usb/kvaser_usb/kvaser_usb.h
index 841da29cef93..f6c0938027ec 100644
--- a/drivers/net/can/usb/kvaser_usb/kvaser_usb.h
+++ b/drivers/net/can/usb/kvaser_usb/kvaser_usb.h
@@ -178,6 +178,8 @@ struct kvaser_usb_dev_cfg {
extern const struct kvaser_usb_dev_ops kvaser_usb_hydra_dev_ops;
extern const struct kvaser_usb_dev_ops kvaser_usb_leaf_dev_ops;
+void kvaser_usb_unlink_tx_urbs(struct kvaser_usb_net_priv *priv);
+
int kvaser_usb_recv_cmd(const struct kvaser_usb *dev, void *cmd, int len,
int *actual_len);
diff --git a/drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c b/drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c
index c2bce6773adc..e91648ed7386 100644
--- a/drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c
+++ b/drivers/net/can/usb/kvaser_usb/kvaser_usb_core.c
@@ -477,7 +477,7 @@ static void kvaser_usb_reset_tx_urb_contexts(struct kvaser_usb_net_priv *priv)
/* This method might sleep. Do not call it in the atomic context
* of URB completions.
*/
-static void kvaser_usb_unlink_tx_urbs(struct kvaser_usb_net_priv *priv)
+void kvaser_usb_unlink_tx_urbs(struct kvaser_usb_net_priv *priv)
{
usb_kill_anchored_urbs(&priv->tx_submitted);
kvaser_usb_reset_tx_urb_contexts(priv);
diff --git a/drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c b/drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c
index 8e11cda85624..59c220ef3049 100644
--- a/drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c
+++ b/drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c
@@ -1426,6 +1426,8 @@ static int kvaser_usb_leaf_set_mode(struct net_device *netdev,
switch (mode) {
case CAN_MODE_START:
+ kvaser_usb_unlink_tx_urbs(priv);
+
err = kvaser_usb_leaf_simple_cmd_async(priv, CMD_START_CHIP);
if (err)
return err;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
365e1ececb29 ("hv_netvsc: Fix race between VF offering and VF association message from host")
d0922bf79817 ("hv_netvsc: Add error handling while switching data path")
34b06a2eee44 ("hv_netvsc: Process NETDEV_GOING_DOWN on VF hot remove")
8b31f8c982b7 ("hv_netvsc: Wait for completion on request SWITCH_DATA_PATH")
4d18fcc95f50 ("hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardening")
e8b7db38449a ("Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus hardening")
4907a43da831 ("Merge tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 365e1ececb2905f94cc10a5817c5b644a32a3ae2 Mon Sep 17 00:00:00 2001
From: Gaurav Kohli <gauravkohli(a)linux.microsoft.com>
Date: Wed, 5 Oct 2022 22:52:59 -0700
Subject: [PATCH] hv_netvsc: Fix race between VF offering and VF association
message from host
During vm boot, there might be possibility that vf registration
call comes before the vf association from host to vm.
And this might break netvsc vf path, To prevent the same block
vf registration until vf bind message comes from host.
Cc: stable(a)vger.kernel.org
Fixes: 00d7ddba11436 ("hv_netvsc: pair VF based on serial number")
Reviewed-by: Haiyang Zhang <haiyangz(a)microsoft.com>
Signed-off-by: Gaurav Kohli <gauravkohli(a)linux.microsoft.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 25b38a374e3c..dd5919ec408b 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1051,7 +1051,8 @@ struct net_device_context {
u32 vf_alloc;
/* Serial number of the VF to team with */
u32 vf_serial;
-
+ /* completion variable to confirm vf association */
+ struct completion vf_add;
/* Is the current data path through the VF NIC? */
bool data_path_is_vf;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index f066de0da492..9352dad58996 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1580,6 +1580,10 @@ static void netvsc_send_vf(struct net_device *ndev,
net_device_ctx->vf_alloc = nvmsg->msg.v4_msg.vf_assoc.allocated;
net_device_ctx->vf_serial = nvmsg->msg.v4_msg.vf_assoc.serial;
+
+ if (net_device_ctx->vf_alloc)
+ complete(&net_device_ctx->vf_add);
+
netdev_info(ndev, "VF slot %u %s\n",
net_device_ctx->vf_serial,
net_device_ctx->vf_alloc ? "added" : "removed");
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 5f08482065ca..89eb4f179a3c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2313,6 +2313,18 @@ static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
}
+ /* Fallback path to check synthetic vf with
+ * help of mac addr
+ */
+ list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
+ ndev = hv_get_drvdata(ndev_ctx->device_ctx);
+ if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr)) {
+ netdev_notice(vf_netdev,
+ "falling back to mac addr based matching\n");
+ return ndev;
+ }
+ }
+
netdev_notice(vf_netdev,
"no netdev found for vf serial:%u\n", serial);
return NULL;
@@ -2409,6 +2421,11 @@ static int netvsc_vf_changed(struct net_device *vf_netdev, unsigned long event)
if (net_device_ctx->data_path_is_vf == vf_is_up)
return NOTIFY_OK;
+ if (vf_is_up && !net_device_ctx->vf_alloc) {
+ netdev_info(ndev, "Waiting for the VF association from host\n");
+ wait_for_completion(&net_device_ctx->vf_add);
+ }
+
ret = netvsc_switch_datapath(ndev, vf_is_up);
if (ret) {
@@ -2440,6 +2457,7 @@ static int netvsc_unregister_vf(struct net_device *vf_netdev)
netvsc_vf_setxdp(vf_netdev, NULL);
+ reinit_completion(&net_device_ctx->vf_add);
netdev_rx_handler_unregister(vf_netdev);
netdev_upper_dev_unlink(vf_netdev, ndev);
RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
@@ -2479,6 +2497,7 @@ static int netvsc_probe(struct hv_device *dev,
INIT_DELAYED_WORK(&net_device_ctx->dwork, netvsc_link_change);
+ init_completion(&net_device_ctx->vf_add);
spin_lock_init(&net_device_ctx->lock);
INIT_LIST_HEAD(&net_device_ctx->reconfig_events);
INIT_DELAYED_WORK(&net_device_ctx->vf_takeover, netvsc_vf_setup);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
365e1ececb29 ("hv_netvsc: Fix race between VF offering and VF association message from host")
d0922bf79817 ("hv_netvsc: Add error handling while switching data path")
34b06a2eee44 ("hv_netvsc: Process NETDEV_GOING_DOWN on VF hot remove")
8b31f8c982b7 ("hv_netvsc: Wait for completion on request SWITCH_DATA_PATH")
4d18fcc95f50 ("hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardening")
e8b7db38449a ("Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus hardening")
4907a43da831 ("Merge tag 'hyperv-next-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 365e1ececb2905f94cc10a5817c5b644a32a3ae2 Mon Sep 17 00:00:00 2001
From: Gaurav Kohli <gauravkohli(a)linux.microsoft.com>
Date: Wed, 5 Oct 2022 22:52:59 -0700
Subject: [PATCH] hv_netvsc: Fix race between VF offering and VF association
message from host
During vm boot, there might be possibility that vf registration
call comes before the vf association from host to vm.
And this might break netvsc vf path, To prevent the same block
vf registration until vf bind message comes from host.
Cc: stable(a)vger.kernel.org
Fixes: 00d7ddba11436 ("hv_netvsc: pair VF based on serial number")
Reviewed-by: Haiyang Zhang <haiyangz(a)microsoft.com>
Signed-off-by: Gaurav Kohli <gauravkohli(a)linux.microsoft.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 25b38a374e3c..dd5919ec408b 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1051,7 +1051,8 @@ struct net_device_context {
u32 vf_alloc;
/* Serial number of the VF to team with */
u32 vf_serial;
-
+ /* completion variable to confirm vf association */
+ struct completion vf_add;
/* Is the current data path through the VF NIC? */
bool data_path_is_vf;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index f066de0da492..9352dad58996 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1580,6 +1580,10 @@ static void netvsc_send_vf(struct net_device *ndev,
net_device_ctx->vf_alloc = nvmsg->msg.v4_msg.vf_assoc.allocated;
net_device_ctx->vf_serial = nvmsg->msg.v4_msg.vf_assoc.serial;
+
+ if (net_device_ctx->vf_alloc)
+ complete(&net_device_ctx->vf_add);
+
netdev_info(ndev, "VF slot %u %s\n",
net_device_ctx->vf_serial,
net_device_ctx->vf_alloc ? "added" : "removed");
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 5f08482065ca..89eb4f179a3c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2313,6 +2313,18 @@ static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
}
+ /* Fallback path to check synthetic vf with
+ * help of mac addr
+ */
+ list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
+ ndev = hv_get_drvdata(ndev_ctx->device_ctx);
+ if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr)) {
+ netdev_notice(vf_netdev,
+ "falling back to mac addr based matching\n");
+ return ndev;
+ }
+ }
+
netdev_notice(vf_netdev,
"no netdev found for vf serial:%u\n", serial);
return NULL;
@@ -2409,6 +2421,11 @@ static int netvsc_vf_changed(struct net_device *vf_netdev, unsigned long event)
if (net_device_ctx->data_path_is_vf == vf_is_up)
return NOTIFY_OK;
+ if (vf_is_up && !net_device_ctx->vf_alloc) {
+ netdev_info(ndev, "Waiting for the VF association from host\n");
+ wait_for_completion(&net_device_ctx->vf_add);
+ }
+
ret = netvsc_switch_datapath(ndev, vf_is_up);
if (ret) {
@@ -2440,6 +2457,7 @@ static int netvsc_unregister_vf(struct net_device *vf_netdev)
netvsc_vf_setxdp(vf_netdev, NULL);
+ reinit_completion(&net_device_ctx->vf_add);
netdev_rx_handler_unregister(vf_netdev);
netdev_upper_dev_unlink(vf_netdev, ndev);
RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
@@ -2479,6 +2497,7 @@ static int netvsc_probe(struct hv_device *dev,
INIT_DELAYED_WORK(&net_device_ctx->dwork, netvsc_link_change);
+ init_completion(&net_device_ctx->vf_add);
spin_lock_init(&net_device_ctx->lock);
INIT_LIST_HEAD(&net_device_ctx->reconfig_events);
INIT_DELAYED_WORK(&net_device_ctx->vf_takeover, netvsc_vf_setup);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
365e1ececb29 ("hv_netvsc: Fix race between VF offering and VF association message from host")
d0922bf79817 ("hv_netvsc: Add error handling while switching data path")
34b06a2eee44 ("hv_netvsc: Process NETDEV_GOING_DOWN on VF hot remove")
8b31f8c982b7 ("hv_netvsc: Wait for completion on request SWITCH_DATA_PATH")
4d18fcc95f50 ("hv_netvsc: Use vmbus_requestor to generate transaction IDs for VMBus hardening")
e8b7db38449a ("Drivers: hv: vmbus: Add vmbus_requestor data structure for VMBus hardening")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 365e1ececb2905f94cc10a5817c5b644a32a3ae2 Mon Sep 17 00:00:00 2001
From: Gaurav Kohli <gauravkohli(a)linux.microsoft.com>
Date: Wed, 5 Oct 2022 22:52:59 -0700
Subject: [PATCH] hv_netvsc: Fix race between VF offering and VF association
message from host
During vm boot, there might be possibility that vf registration
call comes before the vf association from host to vm.
And this might break netvsc vf path, To prevent the same block
vf registration until vf bind message comes from host.
Cc: stable(a)vger.kernel.org
Fixes: 00d7ddba11436 ("hv_netvsc: pair VF based on serial number")
Reviewed-by: Haiyang Zhang <haiyangz(a)microsoft.com>
Signed-off-by: Gaurav Kohli <gauravkohli(a)linux.microsoft.com>
Signed-off-by: David S. Miller <davem(a)davemloft.net>
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 25b38a374e3c..dd5919ec408b 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -1051,7 +1051,8 @@ struct net_device_context {
u32 vf_alloc;
/* Serial number of the VF to team with */
u32 vf_serial;
-
+ /* completion variable to confirm vf association */
+ struct completion vf_add;
/* Is the current data path through the VF NIC? */
bool data_path_is_vf;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index f066de0da492..9352dad58996 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1580,6 +1580,10 @@ static void netvsc_send_vf(struct net_device *ndev,
net_device_ctx->vf_alloc = nvmsg->msg.v4_msg.vf_assoc.allocated;
net_device_ctx->vf_serial = nvmsg->msg.v4_msg.vf_assoc.serial;
+
+ if (net_device_ctx->vf_alloc)
+ complete(&net_device_ctx->vf_add);
+
netdev_info(ndev, "VF slot %u %s\n",
net_device_ctx->vf_serial,
net_device_ctx->vf_alloc ? "added" : "removed");
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 5f08482065ca..89eb4f179a3c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -2313,6 +2313,18 @@ static struct net_device *get_netvsc_byslot(const struct net_device *vf_netdev)
}
+ /* Fallback path to check synthetic vf with
+ * help of mac addr
+ */
+ list_for_each_entry(ndev_ctx, &netvsc_dev_list, list) {
+ ndev = hv_get_drvdata(ndev_ctx->device_ctx);
+ if (ether_addr_equal(vf_netdev->perm_addr, ndev->perm_addr)) {
+ netdev_notice(vf_netdev,
+ "falling back to mac addr based matching\n");
+ return ndev;
+ }
+ }
+
netdev_notice(vf_netdev,
"no netdev found for vf serial:%u\n", serial);
return NULL;
@@ -2409,6 +2421,11 @@ static int netvsc_vf_changed(struct net_device *vf_netdev, unsigned long event)
if (net_device_ctx->data_path_is_vf == vf_is_up)
return NOTIFY_OK;
+ if (vf_is_up && !net_device_ctx->vf_alloc) {
+ netdev_info(ndev, "Waiting for the VF association from host\n");
+ wait_for_completion(&net_device_ctx->vf_add);
+ }
+
ret = netvsc_switch_datapath(ndev, vf_is_up);
if (ret) {
@@ -2440,6 +2457,7 @@ static int netvsc_unregister_vf(struct net_device *vf_netdev)
netvsc_vf_setxdp(vf_netdev, NULL);
+ reinit_completion(&net_device_ctx->vf_add);
netdev_rx_handler_unregister(vf_netdev);
netdev_upper_dev_unlink(vf_netdev, ndev);
RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
@@ -2479,6 +2497,7 @@ static int netvsc_probe(struct hv_device *dev,
INIT_DELAYED_WORK(&net_device_ctx->dwork, netvsc_link_change);
+ init_completion(&net_device_ctx->vf_add);
spin_lock_init(&net_device_ctx->lock);
INIT_LIST_HEAD(&net_device_ctx->reconfig_events);
INIT_DELAYED_WORK(&net_device_ctx->vf_takeover, netvsc_vf_setup);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
42b6419d0aba ("io_uring: correct pinned_vm accounting")
ed29b0b4fd83 ("io_uring: move to separate directory")
ab4094024784 ("io_uring: optimise rsrc referencing")
a46be971edb6 ("io_uring: optimise io_req_set_rsrc_node()")
d886e185a128 ("io_uring: control ->async_data with a REQ_F flag")
ef05d9ebcc92 ("io_uring: kill off ->inflight_entry field")
d4b7a5ef2b9c ("io_uring: inline completion batching helpers")
3aa83bfb6e5c ("io_uring: add a helper for batch free")
c2b6c6bc4e0d ("io_uring: replace list with stack for req caches")
3ab665b74e59 ("io_uring: remove allocation cache array")
6f33b0bc4ea4 ("io_uring: use slist for completion batching")
c450178d9be9 ("io_uring: dedup CQE flushing non-empty checks")
4b628aeb69cc ("io_uring: kill off ios_left")
4c928904ff77 ("block: move CONFIG_BLOCK guard to top Makefile")
ddf21bd8ab98 ("Merge tag 'iov_iter.3-5.15-2021-09-17' of git://git.kernel.dk/linux-block")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 42b6419d0aba47c5d8644cdc0b68502254671de5 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Tue, 4 Oct 2022 03:19:08 +0100
Subject: [PATCH] io_uring: correct pinned_vm accounting
->mm_account should be released only after we free all registered
buffers, otherwise __io_sqe_buffers_unregister() will see a NULL
->mm_account and skip locked_vm accounting.
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/6d798f65ed4ab8db3664c4d3397d4af16ca98846.16648499…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 63f6ce5e5355..ea5cee593bbd 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2585,12 +2585,6 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
{
io_sq_thread_finish(ctx);
-
- if (ctx->mm_account) {
- mmdrop(ctx->mm_account);
- ctx->mm_account = NULL;
- }
-
io_rsrc_refs_drop(ctx);
/* __io_rsrc_put_work() may need uring_lock to progress, wait w/o it */
io_wait_rsrc_data(ctx->buf_data);
@@ -2633,6 +2627,10 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
WARN_ON_ONCE(ctx->notif_slots || ctx->nr_notif_slots);
+ if (ctx->mm_account) {
+ mmdrop(ctx->mm_account);
+ ctx->mm_account = NULL;
+ }
io_mem_free(ctx->rings);
io_mem_free(ctx->sq_sqes);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
42b6419d0aba ("io_uring: correct pinned_vm accounting")
ed29b0b4fd83 ("io_uring: move to separate directory")
ab4094024784 ("io_uring: optimise rsrc referencing")
a46be971edb6 ("io_uring: optimise io_req_set_rsrc_node()")
d886e185a128 ("io_uring: control ->async_data with a REQ_F flag")
ef05d9ebcc92 ("io_uring: kill off ->inflight_entry field")
d4b7a5ef2b9c ("io_uring: inline completion batching helpers")
3aa83bfb6e5c ("io_uring: add a helper for batch free")
c2b6c6bc4e0d ("io_uring: replace list with stack for req caches")
3ab665b74e59 ("io_uring: remove allocation cache array")
6f33b0bc4ea4 ("io_uring: use slist for completion batching")
c450178d9be9 ("io_uring: dedup CQE flushing non-empty checks")
4b628aeb69cc ("io_uring: kill off ios_left")
4c928904ff77 ("block: move CONFIG_BLOCK guard to top Makefile")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 42b6419d0aba47c5d8644cdc0b68502254671de5 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Tue, 4 Oct 2022 03:19:08 +0100
Subject: [PATCH] io_uring: correct pinned_vm accounting
->mm_account should be released only after we free all registered
buffers, otherwise __io_sqe_buffers_unregister() will see a NULL
->mm_account and skip locked_vm accounting.
Cc: <Stable(a)vger.kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/6d798f65ed4ab8db3664c4d3397d4af16ca98846.16648499…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 63f6ce5e5355..ea5cee593bbd 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2585,12 +2585,6 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
{
io_sq_thread_finish(ctx);
-
- if (ctx->mm_account) {
- mmdrop(ctx->mm_account);
- ctx->mm_account = NULL;
- }
-
io_rsrc_refs_drop(ctx);
/* __io_rsrc_put_work() may need uring_lock to progress, wait w/o it */
io_wait_rsrc_data(ctx->buf_data);
@@ -2633,6 +2627,10 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
WARN_ON_ONCE(ctx->notif_slots || ctx->nr_notif_slots);
+ if (ctx->mm_account) {
+ mmdrop(ctx->mm_account);
+ ctx->mm_account = NULL;
+ }
io_mem_free(ctx->rings);
io_mem_free(ctx->sq_sqes);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
0091bfc81741 ("io_uring/af_unix: defer registered files gc to io_uring release")
735729844819 ("io_uring: move rsrc related data, core, and commands")
3b77495a9723 ("io_uring: split provided buffers handling into its own file")
7aaff708a768 ("io_uring: move cancelation into its own file")
329061d3e2f9 ("io_uring: move poll handling into its own file")
cfd22e6b3319 ("io_uring: add opcode name to io_op_defs")
92ac8beaea1f ("io_uring: include and forward-declaration sanitation")
c9f06aa7de15 ("io_uring: move io_uring_task (tctx) helpers into its own file")
a4ad4f748ea9 ("io_uring: move fdinfo helpers to its own file")
e5550a1447bf ("io_uring: use io_is_uring_fops() consistently")
17437f311490 ("io_uring: move SQPOLL related handling into its own file")
59915143e89f ("io_uring: move timeout opcodes and handling into its own file")
e418bbc97bff ("io_uring: move our reference counting into a header")
36404b09aa60 ("io_uring: move msg_ring into its own file")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
99f15d8d6136 ("io_uring: move uring_cmd handling to its own file")
cd40cae29ef8 ("io_uring: split out open/close operations")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0091bfc81741b8d3aeb3b7ab8636f911b2de6e80 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon, 3 Oct 2022 13:59:47 +0100
Subject: [PATCH] io_uring/af_unix: defer registered files gc to io_uring
release
Instead of putting io_uring's registered files in unix_gc() we want it
to be done by io_uring itself. The trick here is to consider io_uring
registered files for cycle detection but not actually putting them down.
Because io_uring can't register other ring instances, this will remove
all refs to the ring file triggering the ->release path and clean up
with io_ring_ctx_free().
Cc: stable(a)vger.kernel.org
Fixes: 6b06314c47e1 ("io_uring: add file set registration")
Reported-and-tested-by: David Bouman <dbouman03(a)gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)canonical.com>
[axboe: add kerneldoc comment to skb, fold in skb leak fix]
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9fcf534f2d92..7be5bb4c94b6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -803,6 +803,7 @@ typedef unsigned char *sk_buff_data_t;
* @csum_level: indicates the number of consecutive checksums found in
* the packet minus one that have been verified as
* CHECKSUM_UNNECESSARY (max 3)
+ * @scm_io_uring: SKB holds io_uring registered files
* @dst_pending_confirm: need to confirm neighbour
* @decrypted: Decrypted SKB
* @slow_gro: state present at GRO time, slower prepare step required
@@ -982,6 +983,7 @@ struct sk_buff {
#endif
__u8 slow_gro:1;
__u8 csum_not_inet:1;
+ __u8 scm_io_uring:1;
#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 6f88ded0e7e5..012fdb04ec23 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -855,6 +855,7 @@ int __io_scm_file_account(struct io_ring_ctx *ctx, struct file *file)
UNIXCB(skb).fp = fpl;
skb->sk = sk;
+ skb->scm_io_uring = 1;
skb->destructor = unix_destruct_scm;
refcount_add(skb->truesize, &sk->sk_wmem_alloc);
}
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index d45d5366115a..dc2763540393 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -204,6 +204,7 @@ void wait_for_unix_gc(void)
/* The external entry point: unix_gc() */
void unix_gc(void)
{
+ struct sk_buff *next_skb, *skb;
struct unix_sock *u;
struct unix_sock *next;
struct sk_buff_head hitlist;
@@ -297,11 +298,30 @@ void unix_gc(void)
spin_unlock(&unix_gc_lock);
+ /* We need io_uring to clean its registered files, ignore all io_uring
+ * originated skbs. It's fine as io_uring doesn't keep references to
+ * other io_uring instances and so killing all other files in the cycle
+ * will put all io_uring references forcing it to go through normal
+ * release.path eventually putting registered files.
+ */
+ skb_queue_walk_safe(&hitlist, skb, next_skb) {
+ if (skb->scm_io_uring) {
+ __skb_unlink(skb, &hitlist);
+ skb_queue_tail(&skb->sk->sk_receive_queue, skb);
+ }
+ }
+
/* Here we are. Hitlist is filled. Die. */
__skb_queue_purge(&hitlist);
spin_lock(&unix_gc_lock);
+ /* There could be io_uring registered files, just push them back to
+ * the inflight list
+ */
+ list_for_each_entry_safe(u, next, &gc_candidates, link)
+ list_move_tail(&u->link, &gc_inflight_list);
+
/* All candidates should have been detached by now. */
BUG_ON(!list_empty(&gc_candidates));
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
0091bfc81741 ("io_uring/af_unix: defer registered files gc to io_uring release")
735729844819 ("io_uring: move rsrc related data, core, and commands")
3b77495a9723 ("io_uring: split provided buffers handling into its own file")
7aaff708a768 ("io_uring: move cancelation into its own file")
329061d3e2f9 ("io_uring: move poll handling into its own file")
cfd22e6b3319 ("io_uring: add opcode name to io_op_defs")
92ac8beaea1f ("io_uring: include and forward-declaration sanitation")
c9f06aa7de15 ("io_uring: move io_uring_task (tctx) helpers into its own file")
a4ad4f748ea9 ("io_uring: move fdinfo helpers to its own file")
e5550a1447bf ("io_uring: use io_is_uring_fops() consistently")
17437f311490 ("io_uring: move SQPOLL related handling into its own file")
59915143e89f ("io_uring: move timeout opcodes and handling into its own file")
e418bbc97bff ("io_uring: move our reference counting into a header")
36404b09aa60 ("io_uring: move msg_ring into its own file")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
99f15d8d6136 ("io_uring: move uring_cmd handling to its own file")
cd40cae29ef8 ("io_uring: split out open/close operations")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0091bfc81741b8d3aeb3b7ab8636f911b2de6e80 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon, 3 Oct 2022 13:59:47 +0100
Subject: [PATCH] io_uring/af_unix: defer registered files gc to io_uring
release
Instead of putting io_uring's registered files in unix_gc() we want it
to be done by io_uring itself. The trick here is to consider io_uring
registered files for cycle detection but not actually putting them down.
Because io_uring can't register other ring instances, this will remove
all refs to the ring file triggering the ->release path and clean up
with io_ring_ctx_free().
Cc: stable(a)vger.kernel.org
Fixes: 6b06314c47e1 ("io_uring: add file set registration")
Reported-and-tested-by: David Bouman <dbouman03(a)gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)canonical.com>
[axboe: add kerneldoc comment to skb, fold in skb leak fix]
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9fcf534f2d92..7be5bb4c94b6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -803,6 +803,7 @@ typedef unsigned char *sk_buff_data_t;
* @csum_level: indicates the number of consecutive checksums found in
* the packet minus one that have been verified as
* CHECKSUM_UNNECESSARY (max 3)
+ * @scm_io_uring: SKB holds io_uring registered files
* @dst_pending_confirm: need to confirm neighbour
* @decrypted: Decrypted SKB
* @slow_gro: state present at GRO time, slower prepare step required
@@ -982,6 +983,7 @@ struct sk_buff {
#endif
__u8 slow_gro:1;
__u8 csum_not_inet:1;
+ __u8 scm_io_uring:1;
#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 6f88ded0e7e5..012fdb04ec23 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -855,6 +855,7 @@ int __io_scm_file_account(struct io_ring_ctx *ctx, struct file *file)
UNIXCB(skb).fp = fpl;
skb->sk = sk;
+ skb->scm_io_uring = 1;
skb->destructor = unix_destruct_scm;
refcount_add(skb->truesize, &sk->sk_wmem_alloc);
}
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index d45d5366115a..dc2763540393 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -204,6 +204,7 @@ void wait_for_unix_gc(void)
/* The external entry point: unix_gc() */
void unix_gc(void)
{
+ struct sk_buff *next_skb, *skb;
struct unix_sock *u;
struct unix_sock *next;
struct sk_buff_head hitlist;
@@ -297,11 +298,30 @@ void unix_gc(void)
spin_unlock(&unix_gc_lock);
+ /* We need io_uring to clean its registered files, ignore all io_uring
+ * originated skbs. It's fine as io_uring doesn't keep references to
+ * other io_uring instances and so killing all other files in the cycle
+ * will put all io_uring references forcing it to go through normal
+ * release.path eventually putting registered files.
+ */
+ skb_queue_walk_safe(&hitlist, skb, next_skb) {
+ if (skb->scm_io_uring) {
+ __skb_unlink(skb, &hitlist);
+ skb_queue_tail(&skb->sk->sk_receive_queue, skb);
+ }
+ }
+
/* Here we are. Hitlist is filled. Die. */
__skb_queue_purge(&hitlist);
spin_lock(&unix_gc_lock);
+ /* There could be io_uring registered files, just push them back to
+ * the inflight list
+ */
+ list_for_each_entry_safe(u, next, &gc_candidates, link)
+ list_move_tail(&u->link, &gc_inflight_list);
+
/* All candidates should have been detached by now. */
BUG_ON(!list_empty(&gc_candidates));
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
0091bfc81741 ("io_uring/af_unix: defer registered files gc to io_uring release")
735729844819 ("io_uring: move rsrc related data, core, and commands")
3b77495a9723 ("io_uring: split provided buffers handling into its own file")
7aaff708a768 ("io_uring: move cancelation into its own file")
329061d3e2f9 ("io_uring: move poll handling into its own file")
cfd22e6b3319 ("io_uring: add opcode name to io_op_defs")
92ac8beaea1f ("io_uring: include and forward-declaration sanitation")
c9f06aa7de15 ("io_uring: move io_uring_task (tctx) helpers into its own file")
a4ad4f748ea9 ("io_uring: move fdinfo helpers to its own file")
e5550a1447bf ("io_uring: use io_is_uring_fops() consistently")
17437f311490 ("io_uring: move SQPOLL related handling into its own file")
59915143e89f ("io_uring: move timeout opcodes and handling into its own file")
e418bbc97bff ("io_uring: move our reference counting into a header")
36404b09aa60 ("io_uring: move msg_ring into its own file")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
99f15d8d6136 ("io_uring: move uring_cmd handling to its own file")
cd40cae29ef8 ("io_uring: split out open/close operations")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 0091bfc81741b8d3aeb3b7ab8636f911b2de6e80 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Mon, 3 Oct 2022 13:59:47 +0100
Subject: [PATCH] io_uring/af_unix: defer registered files gc to io_uring
release
Instead of putting io_uring's registered files in unix_gc() we want it
to be done by io_uring itself. The trick here is to consider io_uring
registered files for cycle detection but not actually putting them down.
Because io_uring can't register other ring instances, this will remove
all refs to the ring file triggering the ->release path and clean up
with io_ring_ctx_free().
Cc: stable(a)vger.kernel.org
Fixes: 6b06314c47e1 ("io_uring: add file set registration")
Reported-and-tested-by: David Bouman <dbouman03(a)gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo(a)canonical.com>
[axboe: add kerneldoc comment to skb, fold in skb leak fix]
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9fcf534f2d92..7be5bb4c94b6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -803,6 +803,7 @@ typedef unsigned char *sk_buff_data_t;
* @csum_level: indicates the number of consecutive checksums found in
* the packet minus one that have been verified as
* CHECKSUM_UNNECESSARY (max 3)
+ * @scm_io_uring: SKB holds io_uring registered files
* @dst_pending_confirm: need to confirm neighbour
* @decrypted: Decrypted SKB
* @slow_gro: state present at GRO time, slower prepare step required
@@ -982,6 +983,7 @@ struct sk_buff {
#endif
__u8 slow_gro:1;
__u8 csum_not_inet:1;
+ __u8 scm_io_uring:1;
#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 6f88ded0e7e5..012fdb04ec23 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -855,6 +855,7 @@ int __io_scm_file_account(struct io_ring_ctx *ctx, struct file *file)
UNIXCB(skb).fp = fpl;
skb->sk = sk;
+ skb->scm_io_uring = 1;
skb->destructor = unix_destruct_scm;
refcount_add(skb->truesize, &sk->sk_wmem_alloc);
}
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index d45d5366115a..dc2763540393 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -204,6 +204,7 @@ void wait_for_unix_gc(void)
/* The external entry point: unix_gc() */
void unix_gc(void)
{
+ struct sk_buff *next_skb, *skb;
struct unix_sock *u;
struct unix_sock *next;
struct sk_buff_head hitlist;
@@ -297,11 +298,30 @@ void unix_gc(void)
spin_unlock(&unix_gc_lock);
+ /* We need io_uring to clean its registered files, ignore all io_uring
+ * originated skbs. It's fine as io_uring doesn't keep references to
+ * other io_uring instances and so killing all other files in the cycle
+ * will put all io_uring references forcing it to go through normal
+ * release.path eventually putting registered files.
+ */
+ skb_queue_walk_safe(&hitlist, skb, next_skb) {
+ if (skb->scm_io_uring) {
+ __skb_unlink(skb, &hitlist);
+ skb_queue_tail(&skb->sk->sk_receive_queue, skb);
+ }
+ }
+
/* Here we are. Hitlist is filled. Die. */
__skb_queue_purge(&hitlist);
spin_lock(&unix_gc_lock);
+ /* There could be io_uring registered files, just push them back to
+ * the inflight list
+ */
+ list_for_each_entry_safe(u, next, &gc_candidates, link)
+ list_move_tail(&u->link, &gc_inflight_list);
+
/* All candidates should have been detached by now. */
BUG_ON(!list_empty(&gc_candidates));
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
99f15d8d6136 ("io_uring: move uring_cmd handling to its own file")
cd40cae29ef8 ("io_uring: split out open/close operations")
453b329be5ea ("io_uring: separate out file table handling code")
f4c163dd7d4b ("io_uring: split out fadvise/madvise operations")
0d5847274037 ("io_uring: split out fs related sync/fallocate functions")
531113bbd5bf ("io_uring: split out splice related operations")
11aeb71406dd ("io_uring: split out filesystem related operations")
e28683bdfc2f ("io_uring: move nop into its own file")
5e2a18d93fec ("io_uring: move xattr related opcodes to its own file")
97b388d70b53 ("io_uring: handle completions in the core")
de23077eda61 ("io_uring: set completion results upfront")
e27f928ee1cb ("io_uring: add io_uring_types.h")
4d4c9cff4f70 ("io_uring: define a request type cleanup handler")
890968dc0336 ("io_uring: unify struct io_symlink and io_hardlink")
9a3a11f977f9 ("io_uring: convert iouring_cmd to io_cmd_type")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3fb1bd68817288729179444caf1fd5c5c4d2d65d Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Tue, 4 Oct 2022 20:29:48 -0600
Subject: [PATCH] io_uring/net: handle -EINPROGRESS correct for
IORING_OP_CONNECT
We treat EINPROGRESS like EAGAIN, but if we're retrying post getting
EINPROGRESS, then we just need to check the socket for errors and
terminate the request.
This was exposed on a bluetooth connection request which ends up
taking a while and hitting EINPROGRESS, and yields a CQE result of
-EBADFD because we're retrying a connect on a socket that is now
connected.
Cc: stable(a)vger.kernel.org
Fixes: 87f80d623c6c ("io_uring: handle connect -EINPROGRESS like -EAGAIN")
Link: https://github.com/axboe/liburing/issues/671
Reported-by: Aidan Sun <aidansun05(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/net.c b/io_uring/net.c
index caa6a803cb72..8c7226b5bf41 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -46,6 +46,7 @@ struct io_connect {
struct file *file;
struct sockaddr __user *addr;
int addr_len;
+ bool in_progress;
};
struct io_sr_msg {
@@ -1386,6 +1387,7 @@ int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
conn->addr_len = READ_ONCE(sqe->addr2);
+ conn->in_progress = false;
return 0;
}
@@ -1397,6 +1399,16 @@ int io_connect(struct io_kiocb *req, unsigned int issue_flags)
int ret;
bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+ if (connect->in_progress) {
+ struct socket *socket;
+
+ ret = -ENOTSOCK;
+ socket = sock_from_file(req->file);
+ if (socket)
+ ret = sock_error(socket->sk);
+ goto out;
+ }
+
if (req_has_async_data(req)) {
io = req->async_data;
} else {
@@ -1413,13 +1425,17 @@ int io_connect(struct io_kiocb *req, unsigned int issue_flags)
ret = __sys_connect_file(req->file, &io->address,
connect->addr_len, file_flags);
if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) {
- if (req_has_async_data(req))
- return -EAGAIN;
- if (io_alloc_async_data(req)) {
- ret = -ENOMEM;
- goto out;
+ if (ret == -EINPROGRESS) {
+ connect->in_progress = true;
+ } else {
+ if (req_has_async_data(req))
+ return -EAGAIN;
+ if (io_alloc_async_data(req)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ memcpy(req->async_data, &__io, sizeof(__io));
}
- memcpy(req->async_data, &__io, sizeof(__io));
return -EAGAIN;
}
if (ret == -ERESTARTSYS)
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
f9ead18c1058 ("io_uring: split network related opcodes into its own file")
e0da14def1ee ("io_uring: move statx handling to its own file")
a9c210cebe13 ("io_uring: move epoll handler to its own file")
4cf90495281b ("io_uring: add a dummy -EOPNOTSUPP prep handler")
99f15d8d6136 ("io_uring: move uring_cmd handling to its own file")
cd40cae29ef8 ("io_uring: split out open/close operations")
453b329be5ea ("io_uring: separate out file table handling code")
f4c163dd7d4b ("io_uring: split out fadvise/madvise operations")
0d5847274037 ("io_uring: split out fs related sync/fallocate functions")
531113bbd5bf ("io_uring: split out splice related operations")
11aeb71406dd ("io_uring: split out filesystem related operations")
e28683bdfc2f ("io_uring: move nop into its own file")
5e2a18d93fec ("io_uring: move xattr related opcodes to its own file")
97b388d70b53 ("io_uring: handle completions in the core")
de23077eda61 ("io_uring: set completion results upfront")
e27f928ee1cb ("io_uring: add io_uring_types.h")
4d4c9cff4f70 ("io_uring: define a request type cleanup handler")
890968dc0336 ("io_uring: unify struct io_symlink and io_hardlink")
9a3a11f977f9 ("io_uring: convert iouring_cmd to io_cmd_type")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 3fb1bd68817288729179444caf1fd5c5c4d2d65d Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Tue, 4 Oct 2022 20:29:48 -0600
Subject: [PATCH] io_uring/net: handle -EINPROGRESS correct for
IORING_OP_CONNECT
We treat EINPROGRESS like EAGAIN, but if we're retrying post getting
EINPROGRESS, then we just need to check the socket for errors and
terminate the request.
This was exposed on a bluetooth connection request which ends up
taking a while and hitting EINPROGRESS, and yields a CQE result of
-EBADFD because we're retrying a connect on a socket that is now
connected.
Cc: stable(a)vger.kernel.org
Fixes: 87f80d623c6c ("io_uring: handle connect -EINPROGRESS like -EAGAIN")
Link: https://github.com/axboe/liburing/issues/671
Reported-by: Aidan Sun <aidansun05(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/net.c b/io_uring/net.c
index caa6a803cb72..8c7226b5bf41 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -46,6 +46,7 @@ struct io_connect {
struct file *file;
struct sockaddr __user *addr;
int addr_len;
+ bool in_progress;
};
struct io_sr_msg {
@@ -1386,6 +1387,7 @@ int io_connect_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
conn->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
conn->addr_len = READ_ONCE(sqe->addr2);
+ conn->in_progress = false;
return 0;
}
@@ -1397,6 +1399,16 @@ int io_connect(struct io_kiocb *req, unsigned int issue_flags)
int ret;
bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+ if (connect->in_progress) {
+ struct socket *socket;
+
+ ret = -ENOTSOCK;
+ socket = sock_from_file(req->file);
+ if (socket)
+ ret = sock_error(socket->sk);
+ goto out;
+ }
+
if (req_has_async_data(req)) {
io = req->async_data;
} else {
@@ -1413,13 +1425,17 @@ int io_connect(struct io_kiocb *req, unsigned int issue_flags)
ret = __sys_connect_file(req->file, &io->address,
connect->addr_len, file_flags);
if ((ret == -EAGAIN || ret == -EINPROGRESS) && force_nonblock) {
- if (req_has_async_data(req))
- return -EAGAIN;
- if (io_alloc_async_data(req)) {
- ret = -ENOMEM;
- goto out;
+ if (ret == -EINPROGRESS) {
+ connect->in_progress = true;
+ } else {
+ if (req_has_async_data(req))
+ return -EAGAIN;
+ if (io_alloc_async_data(req)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ memcpy(req->async_data, &__io, sizeof(__io));
}
- memcpy(req->async_data, &__io, sizeof(__io));
return -EAGAIN;
}
if (ret == -ERESTARTSYS)
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
108893ddcc4d ("io_uring/net: fix notif cqe reordering")
6ae91ac9a6aa ("io_uring/net: don't skip notifs for failed requests")
a75155faef4e ("io_uring/net: fix UAF in io_sendrecv_fail()")
493108d95f14 ("io_uring/net: zerocopy sendmsg")
c4c0009e0b56 ("io_uring/net: combine fail handlers")
b0e9b5517eb1 ("io_uring/net: rename io_sendzc()")
516e82f0e043 ("io_uring/net: support non-zerocopy sendto")
6ae61b7aa2c7 ("io_uring/net: refactor io_setup_async_addr")
5693bcce892d ("io_uring/net: don't lose partial send_zc on fail")
7e6b638ed501 ("io_uring/net: don't lose partial send/recv on fail")
ac9e5784bbe7 ("io_uring/net: use io_sr_msg for sendzc")
0b048557db76 ("io_uring/net: refactor io_sr_msg types")
6bf8ad25fcd4 ("io_uring/net: io_async_msghdr caches for sendzc")
95eafc74be5e ("io_uring/net: reshuffle error handling")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 108893ddcc4d3aa0a4a02aeb02d478e997001227 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Thu, 29 Sep 2022 22:23:19 +0100
Subject: [PATCH] io_uring/net: fix notif cqe reordering
send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so
we can't use task-tw ordering trick to order notification cqes
with requests completions. In this case leave it alone and let
io_send_zc_cleanup() flush it.
Cc: stable(a)vger.kernel.org
Fixes: 53bdc88aac9a2 ("io_uring/notif: order notif vs send CQEs")
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.16644865…
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/net.c b/io_uring/net.c
index 604eac5f7a34..caa6a803cb72 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1127,8 +1127,14 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
else if (zc->done_io)
ret = zc->done_io;
- io_notif_flush(zc->notif);
- req->flags &= ~REQ_F_NEED_CLEANUP;
+ /*
+ * If we're in io-wq we can't rely on tw ordering guarantees, defer
+ * flushing notif to io_send_zc_cleanup()
+ */
+ if (!(issue_flags & IO_URING_F_UNLOCKED)) {
+ io_notif_flush(zc->notif);
+ req->flags &= ~REQ_F_NEED_CLEANUP;
+ }
io_req_set_res(req, ret, IORING_CQE_F_MORE);
return IOU_OK;
}
@@ -1182,8 +1188,10 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
req_set_fail(req);
}
/* fast path, check for non-NULL to avoid function call */
- if (kmsg->free_iov)
+ if (kmsg->free_iov) {
kfree(kmsg->free_iov);
+ kmsg->free_iov = NULL;
+ }
io_netmsg_recycle(req, issue_flags);
if (ret >= 0)
@@ -1191,8 +1199,14 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
else if (sr->done_io)
ret = sr->done_io;
- io_notif_flush(sr->notif);
- req->flags &= ~REQ_F_NEED_CLEANUP;
+ /*
+ * If we're in io-wq we can't rely on tw ordering guarantees, defer
+ * flushing notif to io_send_zc_cleanup()
+ */
+ if (!(issue_flags & IO_URING_F_UNLOCKED)) {
+ io_notif_flush(sr->notif);
+ req->flags &= ~REQ_F_NEED_CLEANUP;
+ }
io_req_set_res(req, ret, IORING_CQE_F_MORE);
return IOU_OK;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
46a525e199e4 ("io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL")
c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
b4c98d59a787 ("io_uring: introduce io_has_work")
78a861b94959 ("io_uring: add sync cancelation API through io_uring_register()")
c34398a8c018 ("io_uring: remove __io_req_task_work_add")
ed5ccb3beeba ("io_uring: remove priority tw list optimisation")
625d38b3fd34 ("io_uring: improve io_run_task_work()")
4a0fef62788b ("io_uring: optimize io_uring_task layout")
253993210bd8 ("io_uring: introduce locking helpers for CQE posting")
305bef988708 ("io_uring: hide eventfd assumptions in eventfd paths")
affa87db9010 ("io_uring: fix multi ctx cancellation")
d9dee4302a7c ("io_uring: remove ->flush_cqes optimisation")
a830ffd28780 ("io_uring: move io_eventfd_signal()")
9046c6415be6 ("io_uring: reshuffle io_uring/io_uring.h")
d142c3ec8d16 ("io_uring: remove extra io_commit_cqring()")
68494a65d0e2 ("io_uring: introduce io_req_cqe_overflow()")
faf88dde060f ("io_uring: don't inline __io_get_cqe()")
d245bca6375b ("io_uring: don't expose io_fill_cqe_aux()")
9ca9fb24d5fe ("io_uring: mutex locked poll hashing")
e6f89be61410 ("io_uring: introduce a struct for hash table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 46a525e199e4037516f7e498c18f065b09df32ac Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 29 Sep 2022 15:29:13 -0600
Subject: [PATCH] io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
This isn't a reliable mechanism to tell if we have task_work pending, we
really should be looking at whether we have any items queued. This is
problematic if forward progress is gated on running said task_work. One
such example is reading from a pipe, where the write side has been closed
right before the read is started. The fput() of the file queues TWA_RESUME
task_work, and we need that task_work to be run before ->release() is
called for the pipe. If ->release() isn't called, then the read will sit
forever waiting on data that will never arise.
Fix this by io_run_task_work() so it checks if we have task_work pending
rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
doesn't work for task_work that is queued without TWA_SIGNAL.
Reported-by: Christiano Haesbaert <haesbaert(a)haesbaert.org>
Cc: stable(a)vger.kernel.org
Link: https://github.com/axboe/liburing/issues/665
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 177bd55357d7..48ce2348c8c1 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -231,11 +231,11 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
static inline int io_run_task_work(void)
{
- if (test_thread_flag(TIF_NOTIFY_SIGNAL)) {
+ if (task_work_pending(current)) {
+ if (test_thread_flag(TIF_NOTIFY_SIGNAL))
+ clear_notify_signal();
__set_current_state(TASK_RUNNING);
- clear_notify_signal();
- if (task_work_pending(current))
- task_work_run();
+ task_work_run();
return 1;
}
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
46a525e199e4 ("io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL")
c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
b4c98d59a787 ("io_uring: introduce io_has_work")
78a861b94959 ("io_uring: add sync cancelation API through io_uring_register()")
c34398a8c018 ("io_uring: remove __io_req_task_work_add")
ed5ccb3beeba ("io_uring: remove priority tw list optimisation")
625d38b3fd34 ("io_uring: improve io_run_task_work()")
4a0fef62788b ("io_uring: optimize io_uring_task layout")
253993210bd8 ("io_uring: introduce locking helpers for CQE posting")
305bef988708 ("io_uring: hide eventfd assumptions in eventfd paths")
affa87db9010 ("io_uring: fix multi ctx cancellation")
d9dee4302a7c ("io_uring: remove ->flush_cqes optimisation")
a830ffd28780 ("io_uring: move io_eventfd_signal()")
9046c6415be6 ("io_uring: reshuffle io_uring/io_uring.h")
d142c3ec8d16 ("io_uring: remove extra io_commit_cqring()")
68494a65d0e2 ("io_uring: introduce io_req_cqe_overflow()")
faf88dde060f ("io_uring: don't inline __io_get_cqe()")
d245bca6375b ("io_uring: don't expose io_fill_cqe_aux()")
9ca9fb24d5fe ("io_uring: mutex locked poll hashing")
e6f89be61410 ("io_uring: introduce a struct for hash table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 46a525e199e4037516f7e498c18f065b09df32ac Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 29 Sep 2022 15:29:13 -0600
Subject: [PATCH] io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
This isn't a reliable mechanism to tell if we have task_work pending, we
really should be looking at whether we have any items queued. This is
problematic if forward progress is gated on running said task_work. One
such example is reading from a pipe, where the write side has been closed
right before the read is started. The fput() of the file queues TWA_RESUME
task_work, and we need that task_work to be run before ->release() is
called for the pipe. If ->release() isn't called, then the read will sit
forever waiting on data that will never arise.
Fix this by io_run_task_work() so it checks if we have task_work pending
rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
doesn't work for task_work that is queued without TWA_SIGNAL.
Reported-by: Christiano Haesbaert <haesbaert(a)haesbaert.org>
Cc: stable(a)vger.kernel.org
Link: https://github.com/axboe/liburing/issues/665
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 177bd55357d7..48ce2348c8c1 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -231,11 +231,11 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
static inline int io_run_task_work(void)
{
- if (test_thread_flag(TIF_NOTIFY_SIGNAL)) {
+ if (task_work_pending(current)) {
+ if (test_thread_flag(TIF_NOTIFY_SIGNAL))
+ clear_notify_signal();
__set_current_state(TASK_RUNNING);
- clear_notify_signal();
- if (task_work_pending(current))
- task_work_run();
+ task_work_run();
return 1;
}
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
46a525e199e4 ("io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL")
c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
b4c98d59a787 ("io_uring: introduce io_has_work")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 46a525e199e4037516f7e498c18f065b09df32ac Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Thu, 29 Sep 2022 15:29:13 -0600
Subject: [PATCH] io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
This isn't a reliable mechanism to tell if we have task_work pending, we
really should be looking at whether we have any items queued. This is
problematic if forward progress is gated on running said task_work. One
such example is reading from a pipe, where the write side has been closed
right before the read is started. The fput() of the file queues TWA_RESUME
task_work, and we need that task_work to be run before ->release() is
called for the pipe. If ->release() isn't called, then the read will sit
forever waiting on data that will never arise.
Fix this by io_run_task_work() so it checks if we have task_work pending
rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
doesn't work for task_work that is queued without TWA_SIGNAL.
Reported-by: Christiano Haesbaert <haesbaert(a)haesbaert.org>
Cc: stable(a)vger.kernel.org
Link: https://github.com/axboe/liburing/issues/665
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 177bd55357d7..48ce2348c8c1 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -231,11 +231,11 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
static inline int io_run_task_work(void)
{
- if (test_thread_flag(TIF_NOTIFY_SIGNAL)) {
+ if (task_work_pending(current)) {
+ if (test_thread_flag(TIF_NOTIFY_SIGNAL))
+ clear_notify_signal();
__set_current_state(TASK_RUNNING);
- clear_notify_signal();
- if (task_work_pending(current))
- task_work_run();
+ task_work_run();
return 1;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
1161703c9bd6 ("mtd: rawnand: atmel: Unmap streaming DMA mappings")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 1161703c9bd664da5e3b2eb1a3bb40c210e026ea Mon Sep 17 00:00:00 2001
From: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Date: Thu, 28 Jul 2022 10:40:14 +0300
Subject: [PATCH] mtd: rawnand: atmel: Unmap streaming DMA mappings
Every dma_map_single() call should have its dma_unmap_single() counterpart,
because the DMA address space is a shared resource and one could render the
machine unusable by consuming all DMA addresses.
Link: https://lore.kernel.org/lkml/13c6c9a2-6db5-c3bf-349b-4c127ad3496a@axentia.s…
Cc: stable(a)vger.kernel.org
Fixes: f88fc122cc34 ("mtd: nand: Cleanup/rework the atmel_nand driver")
Signed-off-by: Tudor Ambarus <tudor.ambarus(a)microchip.com>
Acked-by: Alexander Dahl <ada(a)thorsis.com>
Reported-by: Peter Rosin <peda(a)axentia.se>
Tested-by: Alexander Dahl <ada(a)thorsis.com>
Reviewed-by: Boris Brezillon <boris.brezillon(a)collabora.com>
Tested-by: Peter Rosin <peda(a)axentia.se>
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Link: https://lore.kernel.org/linux-mtd/20220728074014.145406-1-tudor.ambarus@mic…
diff --git a/drivers/mtd/nand/raw/atmel/nand-controller.c b/drivers/mtd/nand/raw/atmel/nand-controller.c
index c9ac3baf68c0..41c6bd6e2d72 100644
--- a/drivers/mtd/nand/raw/atmel/nand-controller.c
+++ b/drivers/mtd/nand/raw/atmel/nand-controller.c
@@ -405,6 +405,7 @@ static int atmel_nand_dma_transfer(struct atmel_nand_controller *nc,
dma_async_issue_pending(nc->dmac);
wait_for_completion(&finished);
+ dma_unmap_single(nc->dev, buf_dma, len, dir);
return 0;
--
Schönen Tag,
Ich bin Herr Richard Wahl, Sie haben eine Spende von 700.000,00 €. Ich
habe ein gewonnen
Glück in der Power-Ball-Lotterie und ich spende einen Teil davon an Ten
Lucky People und Ten Charity Organisation. Ihre E-Mail kam heraus
siegreich, also antworte mir dringend für weitere Informationen unter:
richardwahl9035(a)gmail.com
Aufrichtig,
Herr Richard Wah7
From: David Woodhouse <dwmw2(a)infradead.org>
commit 7e2175ebd695f17860c5bd4ad7616cce12ed4591 upstream.
In commit b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is
not missed") we switched to using a gfn_to_pfn_cache for accessing the
guest steal time structure in order to allow for an atomic xchg of the
preempted field. This has a couple of problems.
Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the
atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a
guest vCPU using an IOMEM page for its steal time would never have its
preempted field set.
Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it
should have been. There are two stages to the GFN->PFN conversion;
first the GFN is converted to a userspace HVA, and then that HVA is
looked up in the process page tables to find the underlying host PFN.
Correct invalidation of the latter would require being hooked up to the
MMU notifiers, but that doesn't happen---so it just keeps mapping and
unmapping the *wrong* PFN after the userspace page tables change.
In the !IOMEM case at least the stale page *is* pinned all the time it's
cached, so it won't be freed and reused by anyone else while still
receiving the steal time updates. The map/unmap dance only takes care
of the KVM administrivia such as marking the page dirty.
Until the gfn_to_pfn cache handles the remapping automatically by
integrating with the MMU notifiers, we might as well not get a
kernel mapping of it, and use the perfectly serviceable userspace HVA
that we already have. We just need to implement the atomic xchg on
the userspace address with appropriate exception handling, which is
fairly trivial.
Cc: stable(a)vger.kernel.org
Fixes: b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed")
Signed-off-by: David Woodhouse <dwmw(a)amazon.co.uk>
Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel(a)infradead.org>
[I didn't entirely agree with David's assessment of the
usefulness of the gfn_to_pfn cache, and integrated the outcome
of the discussion in the above commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
[risbhat(a)amazon.com: Use the older mark_page_dirty_in_slot api without
kvm argument]
Signed-off-by: Rishabh Bhatnagar <risbhat(a)amazon.com>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/x86.c | 105 +++++++++++++++++++++++---------
2 files changed, 76 insertions(+), 31 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 38c63a78aba6..2b35f8139f15 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -663,7 +663,7 @@ struct kvm_vcpu_arch {
u8 preempted;
u64 msr_val;
u64 last_steal;
- struct gfn_to_pfn_cache cache;
+ struct gfn_to_hva_cache cache;
} st;
u64 l1_tsc_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3640b298c42e..cd1e6710bc33 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3013,53 +3013,92 @@ static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
static void record_steal_time(struct kvm_vcpu *vcpu)
{
- struct kvm_host_map map;
- struct kvm_steal_time *st;
+ struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache;
+ struct kvm_steal_time __user *st;
+ struct kvm_memslots *slots;
+ u64 steal;
+ u32 version;
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;
- /* -EAGAIN is returned in atomic context so we can just return. */
- if (kvm_map_gfn(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT,
- &map, &vcpu->arch.st.cache, false))
+ if (WARN_ON_ONCE(current->mm != vcpu->kvm->mm))
return;
- st = map.hva +
- offset_in_page(vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS);
+ slots = kvm_memslots(vcpu->kvm);
+
+ if (unlikely(slots->generation != ghc->generation ||
+ kvm_is_error_hva(ghc->hva) || !ghc->memslot)) {
+ gfn_t gfn = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
+
+ /* We rely on the fact that it fits in a single page. */
+ BUILD_BUG_ON((sizeof(*st) - 1) & KVM_STEAL_VALID_BITS);
+
+ if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, gfn, sizeof(*st)) ||
+ kvm_is_error_hva(ghc->hva) || !ghc->memslot)
+ return;
+ }
+
+ st = (struct kvm_steal_time __user *)ghc->hva;
+ if (!user_access_begin(st, sizeof(*st)))
+ return;
/*
* Doing a TLB flush here, on the guest's behalf, can avoid
* expensive IPIs.
*/
if (guest_pv_has(vcpu, KVM_FEATURE_PV_TLB_FLUSH)) {
- u8 st_preempted = xchg(&st->preempted, 0);
+ u8 st_preempted = 0;
+ int err = -EFAULT;
+
+ asm volatile("1: xchgb %0, %2\n"
+ "xor %1, %1\n"
+ "2:\n"
+ _ASM_EXTABLE_UA(1b, 2b)
+ : "+r" (st_preempted),
+ "+&r" (err)
+ : "m" (st->preempted));
+ if (err)
+ goto out;
+
+ user_access_end();
+
+ vcpu->arch.st.preempted = 0;
trace_kvm_pv_tlb_flush(vcpu->vcpu_id,
st_preempted & KVM_VCPU_FLUSH_TLB);
if (st_preempted & KVM_VCPU_FLUSH_TLB)
kvm_vcpu_flush_tlb_guest(vcpu);
+
+ if (!user_access_begin(st, sizeof(*st)))
+ goto dirty;
} else {
- st->preempted = 0;
+ unsafe_put_user(0, &st->preempted, out);
+ vcpu->arch.st.preempted = 0;
}
- vcpu->arch.st.preempted = 0;
-
- if (st->version & 1)
- st->version += 1; /* first time write, random junk */
+ unsafe_get_user(version, &st->version, out);
+ if (version & 1)
+ version += 1; /* first time write, random junk */
- st->version += 1;
+ version += 1;
+ unsafe_put_user(version, &st->version, out);
smp_wmb();
- st->steal += current->sched_info.run_delay -
+ unsafe_get_user(steal, &st->steal, out);
+ steal += current->sched_info.run_delay -
vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
+ unsafe_put_user(steal, &st->steal, out);
- smp_wmb();
-
- st->version += 1;
+ version += 1;
+ unsafe_put_user(version, &st->version, out);
- kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, false);
+ out:
+ user_access_end();
+ dirty:
+ mark_page_dirty_in_slot(ghc->memslot, gpa_to_gfn(ghc->gpa));
}
int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
@@ -4044,8 +4083,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
{
- struct kvm_host_map map;
- struct kvm_steal_time *st;
+ struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache;
+ struct kvm_steal_time __user *st;
+ struct kvm_memslots *slots;
+ static const u8 preempted = KVM_VCPU_PREEMPTED;
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;
@@ -4053,16 +4094,23 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
if (vcpu->arch.st.preempted)
return;
- if (kvm_map_gfn(vcpu, vcpu->arch.st.msr_val >> PAGE_SHIFT, &map,
- &vcpu->arch.st.cache, true))
+ /* This happens on process exit */
+ if (unlikely(current->mm != vcpu->kvm->mm))
return;
- st = map.hva +
- offset_in_page(vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS);
+ slots = kvm_memslots(vcpu->kvm);
+
+ if (unlikely(slots->generation != ghc->generation ||
+ kvm_is_error_hva(ghc->hva) || !ghc->memslot))
+ return;
- st->preempted = vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
+ st = (struct kvm_steal_time __user *)ghc->hva;
+ BUILD_BUG_ON(sizeof(st->preempted) != sizeof(preempted));
- kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, true);
+ if (!copy_to_user_nofault(&st->preempted, &preempted, sizeof(preempted)))
+ vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;
+
+ mark_page_dirty_in_slot(ghc->memslot, gpa_to_gfn(ghc->gpa));
}
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -10145,11 +10193,8 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
{
- struct gfn_to_pfn_cache *cache = &vcpu->arch.st.cache;
int idx;
- kvm_release_pfn(cache->pfn, cache->dirty, cache);
-
kvmclock_reset(vcpu);
kvm_x86_ops.vcpu_free(vcpu);
--
2.37.1
Hi,
So I looked at this, and it wasn't great one way or the other...
The first patch here is obvious, let's just take it and get one
of the parsings out of the way.
The second one removes a couple of more cases where this is done,
since it only happens when the last argument is non-NULL.
The third then is to avoid the UAF, and is simpler now since only
a few places can even allocate it.
johannes
Hey Greg and Sasha - I'm starting to think through how I can monitor for
dropped stable patches that touch certain file paths in order to help
out with backporting. I've come across a few examples where Greg's
automated emails point out that the patch couldn't be applied to a
certain series but then the patch shows up in a future -rc release
without anyone submitting a fixed backport to the list. An example:
FAILED email: https://lore.kernel.org/stable/1664091338227131@kroah.com/
-rc review email: https://lore.kernel.org/stable/20221003070722.143782710@linuxfoundation.org/
Is this just a case of one of you manually doing the backports (or
dependency resolution) yourselves or are folks directly sending you
backports?
This isn't a problem for me, BTW. I just want to make sure that I
understand the process. Thanks!
Tyler
This is the start of the stable review cycle for the 5.15.74 release.
There are 33 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sun, 16 Oct 2022 08:25:00 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.74-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.15.74-rc2
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211: fix MBSSID parsing use-after-free
Johannes Berg <johannes.berg(a)intel.com>
mac80211: fix memory leaks with element parsing
Johannes Berg <johannes.berg(a)intel.com>
mac80211: always allocate struct ieee802_11_elems
Johannes Berg <johannes.berg(a)intel.com>
mac80211: mlme: find auth challenge directly
Johannes Berg <johannes.berg(a)intel.com>
mac80211: move CRC into struct ieee802_11_elems
Johannes Berg <johannes.berg(a)intel.com>
mac80211: mesh: clean up rx_bcn_presp API
Shunsuke Mie <mie(a)igel.co.jp>
misc: pci_endpoint_test: Fix pci_endpoint_test_{copy,write,read}() panic
Shunsuke Mie <mie(a)igel.co.jp>
misc: pci_endpoint_test: Aggregate params checking for xfer
Cameron Gutman <aicommander(a)gmail.com>
Input: xpad - fix wireless 360 controller breaking after suspend
Pavel Rojtberg <rojtberg(a)gmail.com>
Input: xpad - add supported devices as contributed on github
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: update hidden BSSes to avoid WARN_ON
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211: fix crash in beacon protection for P2P-device
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211_hwsim: avoid mac80211 warning on bad rate
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: avoid nontransmitted BSS list corruption
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: fix BSS refcounting bugs
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: ensure length byte is present before access
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211/mac80211: reject bad MBSSID elements
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: fix u8 overflow in cfg80211_update_notlisted_nontrans()
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: use expired timer rather than wq for mixing fast pool
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: avoid reading two cache lines on irq randomness
Giovanni Cabiddu <giovanni.cabiddu(a)intel.com>
Revert "crypto: qat - reduce size of mapped region"
Nathan Lynch <nathanl(a)linux.ibm.com>
Revert "powerpc/rtas: Implement reentrant rtas call"
Frank Wunderlich <frank-w(a)public-files.de>
USB: serial: qcserial: add new usb-id for Dell branded EM7455
Linus Torvalds <torvalds(a)linux-foundation.org>
scsi: stex: Properly zero out the passthrough command structure
Orlando Chamberlain <redecorating(a)protonmail.com>
efi: Correct Macmini DMI match in uefi cert quirk
Takashi Iwai <tiwai(a)suse.de>
ALSA: hda: Fix position reporting on Poulsbo
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: clamp credited irq bits to maximum mixed
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: restore O_NONBLOCK support
Hu Weiwen <sehuww(a)mail.scut.edu.cn>
ceph: don't truncate file in atomic_open
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix leak of nilfs_root in case of writer thread creation failure
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix use-after-free bug of struct nilfs_root
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()
-------------
Diffstat:
Makefile | 4 +-
arch/powerpc/include/asm/paca.h | 1 -
arch/powerpc/include/asm/rtas.h | 1 -
arch/powerpc/kernel/paca.c | 32 ----
arch/powerpc/kernel/rtas.c | 54 -------
arch/powerpc/sysdev/xics/ics-rtas.c | 22 +--
drivers/char/mem.c | 4 +-
drivers/char/random.c | 25 ++-
drivers/crypto/qat/qat_common/qat_asym_algs.c | 12 +-
drivers/input/joystick/xpad.c | 20 ++-
drivers/misc/pci_endpoint_test.c | 34 +++-
drivers/net/wireless/mac80211_hwsim.c | 2 +
drivers/scsi/stex.c | 17 +-
drivers/usb/serial/qcserial.c | 1 +
fs/ceph/file.c | 10 +-
fs/nilfs2/inode.c | 19 ++-
fs/nilfs2/segment.c | 21 ++-
include/scsi/scsi_cmnd.h | 2 +-
net/mac80211/agg-rx.c | 14 +-
net/mac80211/ibss.c | 33 ++--
net/mac80211/ieee80211_i.h | 40 +++--
net/mac80211/mesh.c | 87 +++++-----
net/mac80211/mesh_hwmp.c | 44 +++---
net/mac80211/mesh_plink.c | 11 +-
net/mac80211/mesh_sync.c | 26 ++-
net/mac80211/mlme.c | 218 ++++++++++++++------------
net/mac80211/rx.c | 12 +-
net/mac80211/scan.c | 16 +-
net/mac80211/tdls.c | 63 +++++---
net/mac80211/util.c | 53 ++++---
net/wireless/scan.c | 77 +++++----
security/integrity/platform_certs/load_uefi.c | 2 +-
sound/pci/hda/hda_intel.c | 3 +-
33 files changed, 536 insertions(+), 444 deletions(-)
This is the start of the stable review cycle for the 6.0.2 release.
There are 34 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat, 15 Oct 2022 17:51:33 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.0.2-rc1.…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.0.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.0.2-rc1
Shunsuke Mie <mie(a)igel.co.jp>
misc: pci_endpoint_test: Fix pci_endpoint_test_{copy,write,read}() panic
Shunsuke Mie <mie(a)igel.co.jp>
misc: pci_endpoint_test: Aggregate params checking for xfer
Cameron Gutman <aicommander(a)gmail.com>
Input: xpad - fix wireless 360 controller breaking after suspend
Pavel Rojtberg <rojtberg(a)gmail.com>
Input: xpad - add supported devices as contributed on github
Jeremy Kerr <jk(a)codeconstruct.com.au>
mctp: prevent double key removal and unref
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: update hidden BSSes to avoid WARN_ON
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211: fix crash in beacon protection for P2P-device
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211_hwsim: avoid mac80211 warning on bad rate
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: avoid nontransmitted BSS list corruption
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: fix BSS refcounting bugs
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: ensure length byte is present before access
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211: fix MBSSID parsing use-after-free
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211/mac80211: reject bad MBSSID elements
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: fix u8 overflow in cfg80211_update_notlisted_nontrans()
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: use expired timer rather than wq for mixing fast pool
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: avoid reading two cache lines on irq randomness
Giovanni Cabiddu <giovanni.cabiddu(a)intel.com>
Revert "crypto: qat - reduce size of mapped region"
Nathan Lynch <nathanl(a)linux.ibm.com>
Revert "powerpc/rtas: Implement reentrant rtas call"
Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Revert "usb: dwc3: Don't switch OTG -> peripheral if extcon is present"
Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Revert "USB: fixup for merge issue with "usb: dwc3: Don't switch OTG -> peripheral if extcon is present""
Frank Wunderlich <frank-w(a)public-files.de>
USB: serial: qcserial: add new usb-id for Dell branded EM7455
Linus Torvalds <torvalds(a)linux-foundation.org>
scsi: stex: Properly zero out the passthrough command structure
Arun Easi <aeasi(a)marvell.com>
scsi: qla2xxx: Fix response queue handler reading stale packets
Arun Easi <aeasi(a)marvell.com>
scsi: qla2xxx: Revert "scsi: qla2xxx: Fix response queue handler reading stale packets"
Orlando Chamberlain <redecorating(a)protonmail.com>
efi: Correct Macmini DMI match in uefi cert quirk
Takashi Iwai <tiwai(a)suse.de>
ALSA: hda/realtek: Add quirk for HP Zbook Firefly 14 G9 model
Takashi Iwai <tiwai(a)suse.de>
ALSA: hda: Fix position reporting on Poulsbo
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: clamp credited irq bits to maximum mixed
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: restore O_NONBLOCK support
Rishabh Bhatnagar <risbhat(a)amazon.com>
nvme-pci: set min_align_mask before calculating max_hw_sectors
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix leak of nilfs_root in case of writer thread creation failure
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix use-after-free bug of struct nilfs_root
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()
-------------
Diffstat:
Makefile | 4 +-
arch/powerpc/include/asm/paca.h | 1 -
arch/powerpc/include/asm/rtas.h | 1 -
arch/powerpc/kernel/paca.c | 32 -----------
arch/powerpc/kernel/rtas.c | 54 -------------------
arch/powerpc/sysdev/xics/ics-rtas.c | 22 ++++----
drivers/char/mem.c | 4 +-
drivers/char/random.c | 25 ++++++---
drivers/crypto/qat/qat_common/qat_asym_algs.c | 12 ++---
drivers/input/joystick/xpad.c | 20 ++++++-
drivers/misc/pci_endpoint_test.c | 34 +++++++++---
drivers/net/wireless/mac80211_hwsim.c | 2 +
drivers/nvme/host/pci.c | 3 +-
drivers/scsi/qla2xxx/qla_gbl.h | 2 -
drivers/scsi/qla2xxx/qla_isr.c | 22 +++-----
drivers/scsi/qla2xxx/qla_os.c | 10 ----
drivers/scsi/stex.c | 17 +++---
drivers/usb/dwc3/core.c | 50 +----------------
drivers/usb/dwc3/drd.c | 50 +++++++++++++++++
drivers/usb/serial/qcserial.c | 1 +
fs/nilfs2/inode.c | 19 ++++++-
fs/nilfs2/segment.c | 21 +++++---
include/scsi/scsi_cmnd.h | 2 +-
net/mac80211/ieee80211_i.h | 8 +++
net/mac80211/rx.c | 12 +++--
net/mac80211/util.c | 32 +++++------
net/mctp/af_mctp.c | 23 +++++---
net/mctp/route.c | 10 ++--
net/wireless/scan.c | 77 +++++++++++++++++----------
security/integrity/platform_certs/load_uefi.c | 2 +-
sound/pci/hda/hda_intel.c | 3 +-
sound/pci/hda/patch_realtek.c | 18 +++++++
32 files changed, 313 insertions(+), 280 deletions(-)
This is the start of the stable review cycle for the 5.19.16 release.
There are 33 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat, 15 Oct 2022 17:51:33 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.19.16-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.19.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.19.16-rc1
Shunsuke Mie <mie(a)igel.co.jp>
misc: pci_endpoint_test: Fix pci_endpoint_test_{copy,write,read}() panic
Shunsuke Mie <mie(a)igel.co.jp>
misc: pci_endpoint_test: Aggregate params checking for xfer
Cameron Gutman <aicommander(a)gmail.com>
Input: xpad - fix wireless 360 controller breaking after suspend
Pavel Rojtberg <rojtberg(a)gmail.com>
Input: xpad - add supported devices as contributed on github
Jeremy Kerr <jk(a)codeconstruct.com.au>
mctp: prevent double key removal and unref
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: update hidden BSSes to avoid WARN_ON
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211: fix crash in beacon protection for P2P-device
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211_hwsim: avoid mac80211 warning on bad rate
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: avoid nontransmitted BSS list corruption
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: fix BSS refcounting bugs
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: ensure length byte is present before access
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211: fix MBSSID parsing use-after-free
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211/mac80211: reject bad MBSSID elements
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: fix u8 overflow in cfg80211_update_notlisted_nontrans()
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: use expired timer rather than wq for mixing fast pool
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: avoid reading two cache lines on irq randomness
Giovanni Cabiddu <giovanni.cabiddu(a)intel.com>
Revert "crypto: qat - reduce size of mapped region"
Nathan Lynch <nathanl(a)linux.ibm.com>
Revert "powerpc/rtas: Implement reentrant rtas call"
Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Revert "usb: dwc3: Don't switch OTG -> peripheral if extcon is present"
Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
Revert "USB: fixup for merge issue with "usb: dwc3: Don't switch OTG -> peripheral if extcon is present""
Frank Wunderlich <frank-w(a)public-files.de>
USB: serial: qcserial: add new usb-id for Dell branded EM7455
Linus Torvalds <torvalds(a)linux-foundation.org>
scsi: stex: Properly zero out the passthrough command structure
Orlando Chamberlain <redecorating(a)protonmail.com>
efi: Correct Macmini DMI match in uefi cert quirk
Takashi Iwai <tiwai(a)suse.de>
ALSA: hda/realtek: Add quirk for HP Zbook Firefly 14 G9 model
Takashi Iwai <tiwai(a)suse.de>
ALSA: hda: Fix position reporting on Poulsbo
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: clamp credited irq bits to maximum mixed
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: restore O_NONBLOCK support
Rishabh Bhatnagar <risbhat(a)amazon.com>
nvme-pci: set min_align_mask before calculating max_hw_sectors
Hu Weiwen <sehuww(a)mail.scut.edu.cn>
ceph: don't truncate file in atomic_open
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix leak of nilfs_root in case of writer thread creation failure
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix use-after-free bug of struct nilfs_root
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()
-------------
Diffstat:
Makefile | 4 +-
arch/powerpc/include/asm/paca.h | 1 -
arch/powerpc/include/asm/rtas.h | 1 -
arch/powerpc/kernel/paca.c | 32 -----------
arch/powerpc/kernel/rtas.c | 54 -------------------
arch/powerpc/sysdev/xics/ics-rtas.c | 22 ++++----
drivers/char/mem.c | 4 +-
drivers/char/random.c | 25 ++++++---
drivers/crypto/qat/qat_common/qat_asym_algs.c | 12 ++---
drivers/input/joystick/xpad.c | 20 ++++++-
drivers/misc/pci_endpoint_test.c | 34 +++++++++---
drivers/net/wireless/mac80211_hwsim.c | 2 +
drivers/nvme/host/pci.c | 3 +-
drivers/scsi/stex.c | 17 +++---
drivers/usb/dwc3/core.c | 50 +----------------
drivers/usb/dwc3/drd.c | 50 +++++++++++++++++
drivers/usb/serial/qcserial.c | 1 +
fs/ceph/file.c | 10 ++--
fs/nilfs2/inode.c | 19 ++++++-
fs/nilfs2/segment.c | 21 +++++---
include/scsi/scsi_cmnd.h | 2 +-
net/mac80211/ieee80211_i.h | 8 +++
net/mac80211/rx.c | 12 +++--
net/mac80211/util.c | 35 ++++++------
net/mctp/af_mctp.c | 23 +++++---
net/mctp/route.c | 10 ++--
net/wireless/scan.c | 77 +++++++++++++++++----------
security/integrity/platform_certs/load_uefi.c | 2 +-
sound/pci/hda/hda_intel.c | 3 +-
sound/pci/hda/patch_realtek.c | 18 +++++++
30 files changed, 315 insertions(+), 257 deletions(-)
From: Paolo Bonzini <pbonzini(a)redhat.com>
commit c3c28d24d910a746b02f496d190e0e8c6560224b upstream.
Commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time
/ preempted status", 2021-11-11) open coded the previous call to
kvm_map_gfn, but in doing so it dropped the comparison between the cached
guest physical address and the one in the MSR. This cause an incorrect
cache hit if the guest modifies the steal time address while the memslots
remain the same. This can happen with kexec, in which case the preempted
bit is written at the address used by the old kernel instead of
the old one.
Cc: David Woodhouse <dwmw(a)amazon.co.uk>
Cc: stable(a)vger.kernel.org
Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status")
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Rishabh Bhatnagar <risbhat(a)amazon.com>
---
arch/x86/kvm/x86.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 111aa95f3de3..9e9298c333c8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4089,6 +4089,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
struct kvm_steal_time __user *st;
struct kvm_memslots *slots;
static const u8 preempted = KVM_VCPU_PREEMPTED;
+ gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
/*
* The vCPU can be marked preempted if and only if the VM-Exit was on
@@ -4116,6 +4117,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
slots = kvm_memslots(vcpu->kvm);
if (unlikely(slots->generation != ghc->generation ||
+ gpa != ghc->gpa ||
kvm_is_error_hva(ghc->hva) || !ghc->memslot))
return;
--
2.37.1
From: Paolo Bonzini <pbonzini(a)redhat.com>
commit 901d3765fa804ce42812f1d5b1f3de2dfbb26723 upstream.
Commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time
/ preempted status", 2021-11-11) open coded the previous call to
kvm_map_gfn, but in doing so it dropped the comparison between the cached
guest physical address and the one in the MSR. This cause an incorrect
cache hit if the guest modifies the steal time address while the memslots
remain the same. This can happen with kexec, in which case the steal
time data is written at the address used by the old kernel instead of
the old one.
While at it, rename the variable from gfn to gpa since it is a plain
physical address and not a right-shifted one.
Reported-by: Dave Young <ruyang(a)redhat.com>
Reported-by: Xiaoying Yan <yiyan(a)redhat.com>
Analyzed-by: Dr. David Alan Gilbert <dgilbert(a)redhat.com>
Cc: David Woodhouse <dwmw(a)amazon.co.uk>
Cc: stable(a)vger.kernel.org
Fixes: 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status")
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
Signed-off-by: Rishabh Bhatnagar <risbhat(a)amazon.com>
---
arch/x86/kvm/x86.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 75494b3c2d1e..111aa95f3de3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3018,6 +3018,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
struct gfn_to_hva_cache *ghc = &vcpu->arch.st.cache;
struct kvm_steal_time __user *st;
struct kvm_memslots *slots;
+ gpa_t gpa = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
u64 steal;
u32 version;
@@ -3030,13 +3031,12 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
slots = kvm_memslots(vcpu->kvm);
if (unlikely(slots->generation != ghc->generation ||
+ gpa != ghc->gpa ||
kvm_is_error_hva(ghc->hva) || !ghc->memslot)) {
- gfn_t gfn = vcpu->arch.st.msr_val & KVM_STEAL_VALID_BITS;
-
/* We rely on the fact that it fits in a single page. */
BUILD_BUG_ON((sizeof(*st) - 1) & KVM_STEAL_VALID_BITS);
- if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, gfn, sizeof(*st)) ||
+ if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, gpa, sizeof(*st)) ||
kvm_is_error_hva(ghc->hva) || !ghc->memslot)
return;
}
--
2.37.1
This is the start of the stable review cycle for the 5.15.74 release.
There are 27 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat, 15 Oct 2022 17:51:33 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.74-rc…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.15.74-rc1
Shunsuke Mie <mie(a)igel.co.jp>
misc: pci_endpoint_test: Fix pci_endpoint_test_{copy,write,read}() panic
Shunsuke Mie <mie(a)igel.co.jp>
misc: pci_endpoint_test: Aggregate params checking for xfer
Cameron Gutman <aicommander(a)gmail.com>
Input: xpad - fix wireless 360 controller breaking after suspend
Pavel Rojtberg <rojtberg(a)gmail.com>
Input: xpad - add supported devices as contributed on github
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: update hidden BSSes to avoid WARN_ON
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211: fix crash in beacon protection for P2P-device
Johannes Berg <johannes.berg(a)intel.com>
wifi: mac80211_hwsim: avoid mac80211 warning on bad rate
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: avoid nontransmitted BSS list corruption
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: fix BSS refcounting bugs
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: ensure length byte is present before access
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211/mac80211: reject bad MBSSID elements
Johannes Berg <johannes.berg(a)intel.com>
wifi: cfg80211: fix u8 overflow in cfg80211_update_notlisted_nontrans()
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: use expired timer rather than wq for mixing fast pool
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: avoid reading two cache lines on irq randomness
Giovanni Cabiddu <giovanni.cabiddu(a)intel.com>
Revert "crypto: qat - reduce size of mapped region"
Nathan Lynch <nathanl(a)linux.ibm.com>
Revert "powerpc/rtas: Implement reentrant rtas call"
Frank Wunderlich <frank-w(a)public-files.de>
USB: serial: qcserial: add new usb-id for Dell branded EM7455
Linus Torvalds <torvalds(a)linux-foundation.org>
scsi: stex: Properly zero out the passthrough command structure
Orlando Chamberlain <redecorating(a)protonmail.com>
efi: Correct Macmini DMI match in uefi cert quirk
Takashi Iwai <tiwai(a)suse.de>
ALSA: hda: Fix position reporting on Poulsbo
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: clamp credited irq bits to maximum mixed
Jason A. Donenfeld <Jason(a)zx2c4.com>
random: restore O_NONBLOCK support
Hu Weiwen <sehuww(a)mail.scut.edu.cn>
ceph: don't truncate file in atomic_open
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix leak of nilfs_root in case of writer thread creation failure
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix use-after-free bug of struct nilfs_root
Ryusuke Konishi <konishi.ryusuke(a)gmail.com>
nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()
-------------
Diffstat:
Makefile | 4 +-
arch/powerpc/include/asm/paca.h | 1 -
arch/powerpc/include/asm/rtas.h | 1 -
arch/powerpc/kernel/paca.c | 32 -----------
arch/powerpc/kernel/rtas.c | 54 -------------------
arch/powerpc/sysdev/xics/ics-rtas.c | 22 ++++----
drivers/char/mem.c | 4 +-
drivers/char/random.c | 25 ++++++---
drivers/crypto/qat/qat_common/qat_asym_algs.c | 12 ++---
drivers/input/joystick/xpad.c | 20 ++++++-
drivers/misc/pci_endpoint_test.c | 34 +++++++++---
drivers/net/wireless/mac80211_hwsim.c | 2 +
drivers/scsi/stex.c | 17 +++---
drivers/usb/serial/qcserial.c | 1 +
fs/ceph/file.c | 10 ++--
fs/nilfs2/inode.c | 19 ++++++-
fs/nilfs2/segment.c | 21 +++++---
include/scsi/scsi_cmnd.h | 2 +-
net/mac80211/rx.c | 12 +++--
net/mac80211/util.c | 2 +
net/wireless/scan.c | 77 +++++++++++++++++----------
security/integrity/platform_certs/load_uefi.c | 2 +-
sound/pci/hda/hda_intel.c | 3 +-
23 files changed, 198 insertions(+), 179 deletions(-)
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
[ Upstream commit a7c01fa93aeb03ab76cd3cb2107990dd160498e6 ]
I was recently surprised to learn that msleep_interruptible(),
wait_for_completion_interruptible_timeout(), and related functions
simply hung when I called kthread_stop() on kthreads using them. The
solution to fixing the case with msleep_interruptible() was more simply
to move to schedule_timeout_interruptible(). Why?
The reason is that msleep_interruptible(), and many functions just like
it, has a loop like this:
while (timeout && !signal_pending(current))
timeout = schedule_timeout_interruptible(timeout);
The call to kthread_stop() woke up the thread, so schedule_timeout_
interruptible() returned early, but because signal_pending() returned
true, it went back into another timeout, which was never woken up.
This wait loop pattern is common to various pieces of code, and I
suspect that the subtle misuse in a kthread that caused a deadlock in
the code I looked at last week is also found elsewhere.
So this commit causes signal_pending() to return true when
kthread_stop() is called, by setting TIF_NOTIFY_SIGNAL.
The same also probably applies to the similar kthread_park()
functionality, but that can be addressed later, as its semantics are
slightly different.
Cc: Eric W. Biederman <ebiederm(a)xmission.com>
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
v1: https://lkml.kernel.org/r/20220627120020.608117-1-Jason@zx2c4.com
v2: https://lkml.kernel.org/r/20220627145716.641185-1-Jason@zx2c4.com
v3: https://lkml.kernel.org/r/20220628161441.892925-1-Jason@zx2c4.com
v4: https://lkml.kernel.org/r/20220711202136.64458-1-Jason@zx2c4.com
v5: https://lkml.kernel.org/r/20220711232123.136330-1-Jason@zx2c4.com
Signed-off-by: Eric W. Biederman <ebiederm(a)xmission.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
kernel/kthread.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 5b37a8567168..c8ca1007e2dd 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -644,6 +644,7 @@ int kthread_stop(struct task_struct *k)
kthread = to_kthread(k);
set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);
kthread_unpark(k);
+ set_tsk_thread_flag(k, TIF_NOTIFY_SIGNAL);
wake_up_process(k);
wait_for_completion(&kthread->exited);
ret = k->exit_code;
--
2.35.1
If an error is detected as a result of user-space process accessing a
corrupt memory location, the CPU may take an abort. Then the platform
firmware reports kernel via NMI like notifications, e.g. NOTIFY_SEA,
NOTIFY_SOFTWARE_DELEGATED, etc.
For NMI like notifications, commit 7f17b4a121d0 ("ACPI: APEI: Kick the
memory_failure() queue for synchronous errors") keep track of whether
memory_failure() work was queued, and make task_work pending to flush out
the queue so that the work is processed before return to user-space.
The code use init_mm to check whether the error occurs in user space:
if (current->mm != &init_mm)
The condition is always true, becase _nobody_ ever has "init_mm" as a real
VM any more.
In addition to abort, errors can also be signaled as asynchronous
exceptions, such as interrupt and SError. In such case, the interrupted
current process could be any kind of thread. When a kernel thread is
interrupted, the work ghes_kick_task_work deferred to task_work will never
be processed because entry_handler returns to call ret_to_kernel() instead
of ret_to_user(). Consequently, the estatus_node alloced from
ghes_estatus_pool in ghes_in_nmi_queue_one_entry() will not be freed.
After around 200 allocations in our platform, the ghes_estatus_pool will
run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. As a
result, the event failed to be processed.
sdei: event 805 on CPU 113 failed with error: -2
Finally, a lot of unhandled events may cause platform firmware to exceed
some threshold and reboot.
The condition should generally just do
if (current->mm)
as described in active_mm.rst documentation.
Then if an asynchronous error is detected when a kernel thread is running,
(e.g. when detected by a background scrubber), do not add task_work to it
as the original patch intends to do.
Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <xueshuai(a)linux.alibaba.com>
---
changes since v1:
- add description the side effect and give more details
drivers/acpi/apei/ghes.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d91ad378c00d..80ad530583c9 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -985,7 +985,7 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
ghes_estatus_cache_add(generic, estatus);
}
- if (task_work_pending && current->mm != &init_mm) {
+ if (task_work_pending && current->mm) {
estatus_node->task_work.func = ghes_kick_task_work;
estatus_node->task_work_cpu = smp_processor_id();
ret = task_work_add(current, &estatus_node->task_work,
--
2.20.1.12.g72788fdb
Dzień dobry,
czy chcieliby odroczyć płatność za towary u dostawców?
Możemy zapłacić za Państwa zamówienia, a po zapłacie wystawimy Państwu fakturę z odroczonym terminem płatności nawet do 90 dni.
Byliby Państwo zainteresowani taką możliwością?
Pozdrawiam
Dominik Chrostek
Userspace can directly access physical address through user type
Work Queue (WQ) in two scenarios: no IOMMU or IOMMU Passthrough
without Shared Virtual Addressing (SVA). In these two cases, user type WQ
allows userspace to issue DMA physical address access without virtual
to physical translation.
This is inconsistent with the security goals of a good kernel API.
Plus there is no usage for user type WQ without SVA.
So enable user type WQ only when SVA is enabled (i.e. user PASID is
enabled).
Fixes: 42d279f9137a ("dmaengine: idxd: add char driver to expose submission portal to userland")
Suggested-by: Arjan Van De Ven <arjan.van.de.ven(a)intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu(a)intel.com>
Reviewed-by: Dave Jiang <dave.jiang(a)intel.com>
---
drivers/dma/idxd/cdev.c | 14 ++++++++++++++
include/uapi/linux/idxd.h | 1 +
2 files changed, 15 insertions(+)
diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index c2808fd081d6..4cd3400c5a48 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -312,6 +312,20 @@ static int idxd_user_drv_probe(struct idxd_dev *idxd_dev)
if (idxd->state != IDXD_DEV_ENABLED)
return -ENXIO;
+ /*
+ * User type WQ is enabled only when SVA is enabled for two reasons:
+ * - If no IOMMU or IOMMU Passthrough without SVA, userspace
+ * can directly access physical address through the WQ.
+ * - There is no usage case for the WQ without SVA.
+ */
+ if (!device_user_pasid_enabled(idxd)) {
+ idxd->cmd_status = IDXD_SCMD_WQ_USER_NO_IOMMU;
+ dev_dbg(&idxd->pdev->dev,
+ "User type WQ cannot be enabled without SVA.\n");
+
+ return -EOPNOTSUPP;
+ }
+
mutex_lock(&wq->wq_lock);
wq->type = IDXD_WQT_USER;
rc = drv_enable_wq(wq);
diff --git a/include/uapi/linux/idxd.h b/include/uapi/linux/idxd.h
index 095299c75828..2b9e7feba3f3 100644
--- a/include/uapi/linux/idxd.h
+++ b/include/uapi/linux/idxd.h
@@ -29,6 +29,7 @@ enum idxd_scmd_stat {
IDXD_SCMD_WQ_NO_SIZE = 0x800e0000,
IDXD_SCMD_WQ_NO_PRIV = 0x800f0000,
IDXD_SCMD_WQ_IRQ_ERR = 0x80100000,
+ IDXD_SCMD_WQ_USER_NO_IOMMU = 0x80110000,
};
#define IDXD_SCMD_SOFTERR_MASK 0x80000000
--
2.32.0
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 6df2a016c0c8a3d0933ef33dd192ea6606b115e3 Mon Sep 17 00:00:00 2001
From: Aurelien Jarno <aurelien(a)aurel32.net>
Date: Wed, 26 Jan 2022 18:14:42 +0100
Subject: [PATCH] riscv: fix build with binutils 2.38
From version 2.38, binutils default to ISA spec version 20191213. This
means that the csr read/write (csrr*/csrw*) instructions and fence.i
instruction has separated from the `I` extension, become two standalone
extensions: Zicsr and Zifencei. As the kernel uses those instruction,
this causes the following build failure:
CC arch/riscv/kernel/vdso/vgettimeofday.o
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h: Assembler messages:
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h:71: Error: unrecognized opcode `csrr a5,0xc01'
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h:71: Error: unrecognized opcode `csrr a5,0xc01'
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h:71: Error: unrecognized opcode `csrr a5,0xc01'
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h:71: Error: unrecognized opcode `csrr a5,0xc01'
The fix is to specify those extensions explicitely in -march. However as
older binutils version do not support this, we first need to detect
that.
Signed-off-by: Aurelien Jarno <aurelien(a)aurel32.net>
Tested-by: Alexandre Ghiti <alexandre.ghiti(a)canonical.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 8a107ed18b0d..7d81102cffd4 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -50,6 +50,12 @@ riscv-march-$(CONFIG_ARCH_RV32I) := rv32ima
riscv-march-$(CONFIG_ARCH_RV64I) := rv64ima
riscv-march-$(CONFIG_FPU) := $(riscv-march-y)fd
riscv-march-$(CONFIG_RISCV_ISA_C) := $(riscv-march-y)c
+
+# Newer binutils versions default to ISA spec version 20191213 which moves some
+# instructions from the I extension to the Zicsr and Zifencei extensions.
+toolchain-need-zicsr-zifencei := $(call cc-option-yn, -march=$(riscv-march-y)_zicsr_zifencei)
+riscv-march-$(toolchain-need-zicsr-zifencei) := $(riscv-march-y)_zicsr_zifencei
+
KBUILD_CFLAGS += -march=$(subst fd,,$(riscv-march-y))
KBUILD_AFLAGS += -march=$(riscv-march-y)
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 6df2a016c0c8a3d0933ef33dd192ea6606b115e3 Mon Sep 17 00:00:00 2001
From: Aurelien Jarno <aurelien(a)aurel32.net>
Date: Wed, 26 Jan 2022 18:14:42 +0100
Subject: [PATCH] riscv: fix build with binutils 2.38
From version 2.38, binutils default to ISA spec version 20191213. This
means that the csr read/write (csrr*/csrw*) instructions and fence.i
instruction has separated from the `I` extension, become two standalone
extensions: Zicsr and Zifencei. As the kernel uses those instruction,
this causes the following build failure:
CC arch/riscv/kernel/vdso/vgettimeofday.o
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h: Assembler messages:
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h:71: Error: unrecognized opcode `csrr a5,0xc01'
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h:71: Error: unrecognized opcode `csrr a5,0xc01'
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h:71: Error: unrecognized opcode `csrr a5,0xc01'
<<BUILDDIR>>/arch/riscv/include/asm/vdso/gettimeofday.h:71: Error: unrecognized opcode `csrr a5,0xc01'
The fix is to specify those extensions explicitely in -march. However as
older binutils version do not support this, we first need to detect
that.
Signed-off-by: Aurelien Jarno <aurelien(a)aurel32.net>
Tested-by: Alexandre Ghiti <alexandre.ghiti(a)canonical.com>
Cc: stable(a)vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmer(a)rivosinc.com>
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 8a107ed18b0d..7d81102cffd4 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -50,6 +50,12 @@ riscv-march-$(CONFIG_ARCH_RV32I) := rv32ima
riscv-march-$(CONFIG_ARCH_RV64I) := rv64ima
riscv-march-$(CONFIG_FPU) := $(riscv-march-y)fd
riscv-march-$(CONFIG_RISCV_ISA_C) := $(riscv-march-y)c
+
+# Newer binutils versions default to ISA spec version 20191213 which moves some
+# instructions from the I extension to the Zicsr and Zifencei extensions.
+toolchain-need-zicsr-zifencei := $(call cc-option-yn, -march=$(riscv-march-y)_zicsr_zifencei)
+riscv-march-$(toolchain-need-zicsr-zifencei) := $(riscv-march-y)_zicsr_zifencei
+
KBUILD_CFLAGS += -march=$(subst fd,,$(riscv-march-y))
KBUILD_AFLAGS += -march=$(riscv-march-y)
From: Johannes Berg <johannes.berg(a)intel.com>
Commit ff05d4b45dd89b922578dac497dcabf57cf771c6 upstream.
When we parse a multi-BSSID element, we might point some
element pointers into the allocated nontransmitted_profile.
However, we free this before returning, causing UAF when the
relevant pointers in the parsed elements are accessed.
Fix this by not allocating the scratch buffer separately but
as part of the returned structure instead, that way, there
are no lifetime issues with it.
The scratch buffer introduction as part of the returned data
here is taken from MLO feature work done by Ilan.
This fixes CVE-2022-42719.
Fixes: 5023b14cf4df ("mac80211: support profile split between elements")
Co-developed-by: Ilan Peer <ilan.peer(a)intel.com>
Signed-off-by: Ilan Peer <ilan.peer(a)intel.com>
Reviewed-by: Kees Cook <keescook(a)chromium.org>
Signed-off-by: Johannes Berg <johannes.berg(a)intel.com>
---
net/mac80211/ieee80211_i.h | 8 ++++++++
net/mac80211/util.c | 30 +++++++++++++++---------------
2 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/net/mac80211/ieee80211_i.h b/net/mac80211/ieee80211_i.h
index 48fbccbf2a54..44c8701af95c 100644
--- a/net/mac80211/ieee80211_i.h
+++ b/net/mac80211/ieee80211_i.h
@@ -1640,6 +1640,14 @@ struct ieee802_11_elems {
/* whether a parse error occurred while retrieving these elements */
bool parse_error;
+
+ /*
+ * scratch buffer that can be used for various element parsing related
+ * tasks, e.g., element de-fragmentation etc.
+ */
+ size_t scratch_len;
+ u8 *scratch_pos;
+ u8 scratch[];
};
static inline struct ieee80211_local *hw_to_local(
diff --git a/net/mac80211/util.c b/net/mac80211/util.c
index 504422cc683e..8f36ab8fcfb2 100644
--- a/net/mac80211/util.c
+++ b/net/mac80211/util.c
@@ -1503,25 +1503,27 @@ struct ieee802_11_elems *ieee802_11_parse_elems_crc(const u8 *start, size_t len,
const struct element *non_inherit = NULL;
u8 *nontransmitted_profile;
int nontransmitted_profile_len = 0;
+ size_t scratch_len = len;
- elems = kzalloc(sizeof(*elems), GFP_ATOMIC);
+ elems = kzalloc(sizeof(*elems) + scratch_len, GFP_ATOMIC);
if (!elems)
return NULL;
elems->ie_start = start;
elems->total_len = len;
+ elems->scratch_len = scratch_len;
+ elems->scratch_pos = elems->scratch;
- nontransmitted_profile = kmalloc(len, GFP_ATOMIC);
- if (nontransmitted_profile) {
- nontransmitted_profile_len =
- ieee802_11_find_bssid_profile(start, len, elems,
- transmitter_bssid,
- bss_bssid,
- nontransmitted_profile);
- non_inherit =
- cfg80211_find_ext_elem(WLAN_EID_EXT_NON_INHERITANCE,
- nontransmitted_profile,
- nontransmitted_profile_len);
- }
+ nontransmitted_profile = elems->scratch_pos;
+ nontransmitted_profile_len =
+ ieee802_11_find_bssid_profile(start, len, elems,
+ transmitter_bssid,
+ bss_bssid,
+ nontransmitted_profile);
+ elems->scratch_pos += nontransmitted_profile_len;
+ elems->scratch_len -= nontransmitted_profile_len;
+ non_inherit = cfg80211_find_ext_elem(WLAN_EID_EXT_NON_INHERITANCE,
+ nontransmitted_profile,
+ nontransmitted_profile_len);
crc = _ieee802_11_parse_elems_crc(start, len, action, elems, filter,
crc, non_inherit);
@@ -1550,8 +1552,6 @@ struct ieee802_11_elems *ieee802_11_parse_elems_crc(const u8 *start, size_t len,
offsetofend(struct ieee80211_bssid_index, dtim_count))
elems->dtim_count = elems->bssid_index->dtim_count;
- kfree(nontransmitted_profile);
-
elems->crc = crc;
return elems;
--
2.37.3
From: Javier Martinez Canillas <javierm(a)redhat.com>
[ Upstream commit 94dc3471d1b2b58b3728558d0e3f264e9ce6ff59 ]
The strlen() function returns a size_t which is an unsigned int on 32-bit
arches and an unsigned long on 64-bit arches. But in the drm_copy_field()
function, the strlen() return value is assigned to an 'int len' variable.
Later, the len variable is passed as copy_from_user() third argument that
is an unsigned long parameter as well.
In theory, this can lead to an integer overflow via type conversion. Since
the assignment happens to a signed int lvalue instead of a size_t lvalue.
In practice though, that's unlikely since the values copied are set by DRM
drivers and not controlled by userspace. But using a size_t for len is the
correct thing to do anyways.
Signed-off-by: Javier Martinez Canillas <javierm(a)redhat.com>
Tested-by: Peter Robinson <pbrobinson(a)gmail.com>
Reviewed-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20220705100215.572498-2-javie…
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
drivers/gpu/drm/drm_ioctl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/drm_ioctl.c b/drivers/gpu/drm/drm_ioctl.c
index babd7ebabfef..4fea6519510c 100644
--- a/drivers/gpu/drm/drm_ioctl.c
+++ b/drivers/gpu/drm/drm_ioctl.c
@@ -458,7 +458,7 @@ EXPORT_SYMBOL(drm_invalid_op);
*/
static int drm_copy_field(char __user *buf, size_t *buf_len, const char *value)
{
- int len;
+ size_t len;
/* don't overflow userbuf */
len = strlen(value);
--
2.35.1
Hi Greg,
You just sent me an automated email about these failing, so here they
are backported.
Jason
Jason A. Donenfeld (3):
random: restore O_NONBLOCK support
random: avoid reading two cache lines on irq randomness
random: use expired timer rather than wq for mixing fast pool
drivers/char/mem.c | 4 ++--
drivers/char/random.c | 23 ++++++++++++++++-------
2 files changed, 18 insertions(+), 9 deletions(-)
--
2.37.3
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cd4f24ae9404fd31fc461066e57889be3b68641b Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 8 Sep 2022 16:14:00 +0200
Subject: [PATCH] random: restore O_NONBLOCK support
Prior to 5.6, when /dev/random was opened with O_NONBLOCK, it would
return -EAGAIN if there was no entropy. When the pools were unified in
5.6, this was lost. The post 5.6 behavior of blocking until the pool is
initialized, and ignoring O_NONBLOCK in the process, went unnoticed,
with no reports about the regression received for two and a half years.
However, eventually this indeed did break somebody's userspace.
So we restore the old behavior, by returning -EAGAIN if the pool is not
initialized. Unlike the old /dev/random, this can only occur during
early boot, after which it never blocks again.
In order to make this O_NONBLOCK behavior consistent with other
expectations, also respect users reading with preadv2(RWF_NOWAIT) and
similar.
Fixes: 30c08efec888 ("random: make /dev/random be almost like /dev/urandom")
Reported-by: Guozihua <guozihua(a)huawei.com>
Reported-by: Zhongguohua <zhongguohua1(a)huawei.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso(a)mit.edu>
Cc: Andrew Lutomirski <luto(a)kernel.org>
Cc: stable(a)vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 32a932a065a6..5611d127363e 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -712,8 +712,8 @@ static const struct memdev {
#endif
[5] = { "zero", 0666, &zero_fops, FMODE_NOWAIT },
[7] = { "full", 0666, &full_fops, 0 },
- [8] = { "random", 0666, &random_fops, 0 },
- [9] = { "urandom", 0666, &urandom_fops, 0 },
+ [8] = { "random", 0666, &random_fops, FMODE_NOWAIT },
+ [9] = { "urandom", 0666, &urandom_fops, FMODE_NOWAIT },
#ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, &kmsg_fops, 0 },
#endif
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 79d7d4e4e582..c8cc23515568 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1347,6 +1347,11 @@ static ssize_t random_read_iter(struct kiocb *kiocb, struct iov_iter *iter)
{
int ret;
+ if (!crng_ready() &&
+ ((kiocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) ||
+ (kiocb->ki_filp->f_flags & O_NONBLOCK)))
+ return -EAGAIN;
+
ret = wait_for_random_bytes();
if (ret != 0)
return ret;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cd4f24ae9404fd31fc461066e57889be3b68641b Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 8 Sep 2022 16:14:00 +0200
Subject: [PATCH] random: restore O_NONBLOCK support
Prior to 5.6, when /dev/random was opened with O_NONBLOCK, it would
return -EAGAIN if there was no entropy. When the pools were unified in
5.6, this was lost. The post 5.6 behavior of blocking until the pool is
initialized, and ignoring O_NONBLOCK in the process, went unnoticed,
with no reports about the regression received for two and a half years.
However, eventually this indeed did break somebody's userspace.
So we restore the old behavior, by returning -EAGAIN if the pool is not
initialized. Unlike the old /dev/random, this can only occur during
early boot, after which it never blocks again.
In order to make this O_NONBLOCK behavior consistent with other
expectations, also respect users reading with preadv2(RWF_NOWAIT) and
similar.
Fixes: 30c08efec888 ("random: make /dev/random be almost like /dev/urandom")
Reported-by: Guozihua <guozihua(a)huawei.com>
Reported-by: Zhongguohua <zhongguohua1(a)huawei.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso(a)mit.edu>
Cc: Andrew Lutomirski <luto(a)kernel.org>
Cc: stable(a)vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 32a932a065a6..5611d127363e 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -712,8 +712,8 @@ static const struct memdev {
#endif
[5] = { "zero", 0666, &zero_fops, FMODE_NOWAIT },
[7] = { "full", 0666, &full_fops, 0 },
- [8] = { "random", 0666, &random_fops, 0 },
- [9] = { "urandom", 0666, &urandom_fops, 0 },
+ [8] = { "random", 0666, &random_fops, FMODE_NOWAIT },
+ [9] = { "urandom", 0666, &urandom_fops, FMODE_NOWAIT },
#ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, &kmsg_fops, 0 },
#endif
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 79d7d4e4e582..c8cc23515568 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1347,6 +1347,11 @@ static ssize_t random_read_iter(struct kiocb *kiocb, struct iov_iter *iter)
{
int ret;
+ if (!crng_ready() &&
+ ((kiocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) ||
+ (kiocb->ki_filp->f_flags & O_NONBLOCK)))
+ return -EAGAIN;
+
ret = wait_for_random_bytes();
if (ret != 0)
return ret;
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cd4f24ae9404fd31fc461066e57889be3b68641b Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 8 Sep 2022 16:14:00 +0200
Subject: [PATCH] random: restore O_NONBLOCK support
Prior to 5.6, when /dev/random was opened with O_NONBLOCK, it would
return -EAGAIN if there was no entropy. When the pools were unified in
5.6, this was lost. The post 5.6 behavior of blocking until the pool is
initialized, and ignoring O_NONBLOCK in the process, went unnoticed,
with no reports about the regression received for two and a half years.
However, eventually this indeed did break somebody's userspace.
So we restore the old behavior, by returning -EAGAIN if the pool is not
initialized. Unlike the old /dev/random, this can only occur during
early boot, after which it never blocks again.
In order to make this O_NONBLOCK behavior consistent with other
expectations, also respect users reading with preadv2(RWF_NOWAIT) and
similar.
Fixes: 30c08efec888 ("random: make /dev/random be almost like /dev/urandom")
Reported-by: Guozihua <guozihua(a)huawei.com>
Reported-by: Zhongguohua <zhongguohua1(a)huawei.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso(a)mit.edu>
Cc: Andrew Lutomirski <luto(a)kernel.org>
Cc: stable(a)vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 32a932a065a6..5611d127363e 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -712,8 +712,8 @@ static const struct memdev {
#endif
[5] = { "zero", 0666, &zero_fops, FMODE_NOWAIT },
[7] = { "full", 0666, &full_fops, 0 },
- [8] = { "random", 0666, &random_fops, 0 },
- [9] = { "urandom", 0666, &urandom_fops, 0 },
+ [8] = { "random", 0666, &random_fops, FMODE_NOWAIT },
+ [9] = { "urandom", 0666, &urandom_fops, FMODE_NOWAIT },
#ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, &kmsg_fops, 0 },
#endif
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 79d7d4e4e582..c8cc23515568 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1347,6 +1347,11 @@ static ssize_t random_read_iter(struct kiocb *kiocb, struct iov_iter *iter)
{
int ret;
+ if (!crng_ready() &&
+ ((kiocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) ||
+ (kiocb->ki_filp->f_flags & O_NONBLOCK)))
+ return -EAGAIN;
+
ret = wait_for_random_bytes();
if (ret != 0)
return ret;
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From cd4f24ae9404fd31fc461066e57889be3b68641b Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 8 Sep 2022 16:14:00 +0200
Subject: [PATCH] random: restore O_NONBLOCK support
Prior to 5.6, when /dev/random was opened with O_NONBLOCK, it would
return -EAGAIN if there was no entropy. When the pools were unified in
5.6, this was lost. The post 5.6 behavior of blocking until the pool is
initialized, and ignoring O_NONBLOCK in the process, went unnoticed,
with no reports about the regression received for two and a half years.
However, eventually this indeed did break somebody's userspace.
So we restore the old behavior, by returning -EAGAIN if the pool is not
initialized. Unlike the old /dev/random, this can only occur during
early boot, after which it never blocks again.
In order to make this O_NONBLOCK behavior consistent with other
expectations, also respect users reading with preadv2(RWF_NOWAIT) and
similar.
Fixes: 30c08efec888 ("random: make /dev/random be almost like /dev/urandom")
Reported-by: Guozihua <guozihua(a)huawei.com>
Reported-by: Zhongguohua <zhongguohua1(a)huawei.com>
Cc: Al Viro <viro(a)zeniv.linux.org.uk>
Cc: Theodore Ts'o <tytso(a)mit.edu>
Cc: Andrew Lutomirski <luto(a)kernel.org>
Cc: stable(a)vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 32a932a065a6..5611d127363e 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -712,8 +712,8 @@ static const struct memdev {
#endif
[5] = { "zero", 0666, &zero_fops, FMODE_NOWAIT },
[7] = { "full", 0666, &full_fops, 0 },
- [8] = { "random", 0666, &random_fops, 0 },
- [9] = { "urandom", 0666, &urandom_fops, 0 },
+ [8] = { "random", 0666, &random_fops, FMODE_NOWAIT },
+ [9] = { "urandom", 0666, &urandom_fops, FMODE_NOWAIT },
#ifdef CONFIG_PRINTK
[11] = { "kmsg", 0644, &kmsg_fops, 0 },
#endif
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 79d7d4e4e582..c8cc23515568 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1347,6 +1347,11 @@ static ssize_t random_read_iter(struct kiocb *kiocb, struct iov_iter *iter)
{
int ret;
+ if (!crng_ready() &&
+ ((kiocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) ||
+ (kiocb->ki_filp->f_flags & O_NONBLOCK)))
+ return -EAGAIN;
+
ret = wait_for_random_bytes();
if (ret != 0)
return ret;
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e4f8a29deb3ba30e414dfb6b09e3ae3bf6dbe74a Mon Sep 17 00:00:00 2001
From: Arun Easi <aeasi(a)marvell.com>
Date: Fri, 26 Aug 2022 03:25:54 -0700
Subject: [PATCH] scsi: qla2xxx: Fix response queue handler reading stale
packets
On some platforms, the current logic of relying on finding new packet
solely based on signature pattern can lead to driver reading stale
packets. Though this is a bug in those platforms, reduce such exposures by
limiting reading packets until the IN pointer.
Link: https://lore.kernel.org/r/20220826102559.17474-3-njavali@marvell.com
Cc: stable(a)vger.kernel.org
Reviewed-by: Himanshu Madhani <himanshu.madhani(a)oracle.com>
Signed-off-by: Arun Easi <aeasi(a)marvell.com>
Signed-off-by: Nilesh Javali <njavali(a)marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index ede76357ccb6..e19fde304e5c 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -3763,7 +3763,8 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
struct qla_hw_data *ha = vha->hw;
struct purex_entry_24xx *purex_entry;
struct purex_item *pure_item;
- u16 cur_ring_index;
+ u16 rsp_in = 0, cur_ring_index;
+ int is_shadow_hba;
if (!ha->flags.fw_started)
return;
@@ -3773,7 +3774,18 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
qla_cpu_update(rsp->qpair, smp_processor_id());
}
- while (rsp->ring_ptr->signature != RESPONSE_PROCESSED) {
+#define __update_rsp_in(_is_shadow_hba, _rsp, _rsp_in) \
+ do { \
+ _rsp_in = _is_shadow_hba ? *(_rsp)->in_ptr : \
+ rd_reg_dword_relaxed((_rsp)->rsp_q_in); \
+ } while (0)
+
+ is_shadow_hba = IS_SHADOW_REG_CAPABLE(ha);
+
+ __update_rsp_in(is_shadow_hba, rsp, rsp_in);
+
+ while (rsp->ring_index != rsp_in &&
+ rsp->ring_ptr->signature != RESPONSE_PROCESSED) {
pkt = (struct sts_entry_24xx *)rsp->ring_ptr;
cur_ring_index = rsp->ring_index;
@@ -3887,6 +3899,7 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
}
pure_item = qla27xx_copy_fpin_pkt(vha,
(void **)&pkt, &rsp);
+ __update_rsp_in(is_shadow_hba, rsp, rsp_in);
if (!pure_item)
break;
qla24xx_queue_purex_item(vha, pure_item,
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e4f8a29deb3ba30e414dfb6b09e3ae3bf6dbe74a Mon Sep 17 00:00:00 2001
From: Arun Easi <aeasi(a)marvell.com>
Date: Fri, 26 Aug 2022 03:25:54 -0700
Subject: [PATCH] scsi: qla2xxx: Fix response queue handler reading stale
packets
On some platforms, the current logic of relying on finding new packet
solely based on signature pattern can lead to driver reading stale
packets. Though this is a bug in those platforms, reduce such exposures by
limiting reading packets until the IN pointer.
Link: https://lore.kernel.org/r/20220826102559.17474-3-njavali@marvell.com
Cc: stable(a)vger.kernel.org
Reviewed-by: Himanshu Madhani <himanshu.madhani(a)oracle.com>
Signed-off-by: Arun Easi <aeasi(a)marvell.com>
Signed-off-by: Nilesh Javali <njavali(a)marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index ede76357ccb6..e19fde304e5c 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -3763,7 +3763,8 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
struct qla_hw_data *ha = vha->hw;
struct purex_entry_24xx *purex_entry;
struct purex_item *pure_item;
- u16 cur_ring_index;
+ u16 rsp_in = 0, cur_ring_index;
+ int is_shadow_hba;
if (!ha->flags.fw_started)
return;
@@ -3773,7 +3774,18 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
qla_cpu_update(rsp->qpair, smp_processor_id());
}
- while (rsp->ring_ptr->signature != RESPONSE_PROCESSED) {
+#define __update_rsp_in(_is_shadow_hba, _rsp, _rsp_in) \
+ do { \
+ _rsp_in = _is_shadow_hba ? *(_rsp)->in_ptr : \
+ rd_reg_dword_relaxed((_rsp)->rsp_q_in); \
+ } while (0)
+
+ is_shadow_hba = IS_SHADOW_REG_CAPABLE(ha);
+
+ __update_rsp_in(is_shadow_hba, rsp, rsp_in);
+
+ while (rsp->ring_index != rsp_in &&
+ rsp->ring_ptr->signature != RESPONSE_PROCESSED) {
pkt = (struct sts_entry_24xx *)rsp->ring_ptr;
cur_ring_index = rsp->ring_index;
@@ -3887,6 +3899,7 @@ void qla24xx_process_response_queue(struct scsi_qla_host *vha,
}
pure_item = qla27xx_copy_fpin_pkt(vha,
(void **)&pkt, &rsp);
+ __update_rsp_in(is_shadow_hba, rsp, rsp_in);
if (!pure_item)
break;
qla24xx_queue_purex_item(vha, pure_item,
The patch below was submitted to be applied to the 6.0-stable tree.
I fail to see how this patch meets the stable kernel rules as found at
Documentation/process/stable-kernel-rules.rst.
I could be totally wrong, and if so, please respond to
<stable(a)vger.kernel.org> and let me know why this patch should be
applied. Otherwise, it is now dropped from my patch queues, never to be
seen again.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2c57d0defa22b2339c06364a275bcc9048a77255 Mon Sep 17 00:00:00 2001
From: Nilesh Javali <njavali(a)marvell.com>
Date: Fri, 26 Aug 2022 03:25:58 -0700
Subject: [PATCH] scsi: qla2xxx: Define static symbols
drivers/scsi/qla2xxx/qla_os.c:40:20: warning: symbol 'qla_trc_array'
was not declared. Should it be static?
drivers/scsi/qla2xxx/qla_os.c:345:5: warning: symbol
'ql2xdelay_before_pci_error_handling' was not declared.
Should it be static?
Define qla_trc_array and ql2xdelay_before_pci_error_handling as static to
fix sparse warnings.
Link: https://lore.kernel.org/r/20220826102559.17474-7-njavali@marvell.com
Cc: stable(a)vger.kernel.org
Reported-by: kernel test robot <lkp(a)intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani(a)oracle.com>
Signed-off-by: Nilesh Javali <njavali(a)marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index f76ae8a64ea9..0ccaeea4e1bd 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -37,7 +37,7 @@ static int apidev_major;
*/
struct kmem_cache *srb_cachep;
-struct trace_array *qla_trc_array;
+static struct trace_array *qla_trc_array;
int ql2xfulldump_on_mpifail;
module_param(ql2xfulldump_on_mpifail, int, S_IRUGO | S_IWUSR);
@@ -342,7 +342,7 @@ MODULE_PARM_DESC(ql2xabts_wait_nvme,
"To wait for ABTS response on I/O timeouts for NVMe. (default: 1)");
-u32 ql2xdelay_before_pci_error_handling = 5;
+static u32 ql2xdelay_before_pci_error_handling = 5;
module_param(ql2xdelay_before_pci_error_handling, uint, 0644);
MODULE_PARM_DESC(ql2xdelay_before_pci_error_handling,
"Number of seconds delayed before qla begin PCI error self-handling (default: 5).\n");
The patch below was submitted to be applied to the 6.0-stable tree.
I fail to see how this patch meets the stable kernel rules as found at
Documentation/process/stable-kernel-rules.rst.
I could be totally wrong, and if so, please respond to
<stable(a)vger.kernel.org> and let me know why this patch should be
applied. Otherwise, it is now dropped from my patch queues, never to be
seen again.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 2c57d0defa22b2339c06364a275bcc9048a77255 Mon Sep 17 00:00:00 2001
From: Nilesh Javali <njavali(a)marvell.com>
Date: Fri, 26 Aug 2022 03:25:58 -0700
Subject: [PATCH] scsi: qla2xxx: Define static symbols
drivers/scsi/qla2xxx/qla_os.c:40:20: warning: symbol 'qla_trc_array'
was not declared. Should it be static?
drivers/scsi/qla2xxx/qla_os.c:345:5: warning: symbol
'ql2xdelay_before_pci_error_handling' was not declared.
Should it be static?
Define qla_trc_array and ql2xdelay_before_pci_error_handling as static to
fix sparse warnings.
Link: https://lore.kernel.org/r/20220826102559.17474-7-njavali@marvell.com
Cc: stable(a)vger.kernel.org
Reported-by: kernel test robot <lkp(a)intel.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani(a)oracle.com>
Signed-off-by: Nilesh Javali <njavali(a)marvell.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index f76ae8a64ea9..0ccaeea4e1bd 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -37,7 +37,7 @@ static int apidev_major;
*/
struct kmem_cache *srb_cachep;
-struct trace_array *qla_trc_array;
+static struct trace_array *qla_trc_array;
int ql2xfulldump_on_mpifail;
module_param(ql2xfulldump_on_mpifail, int, S_IRUGO | S_IWUSR);
@@ -342,7 +342,7 @@ MODULE_PARM_DESC(ql2xabts_wait_nvme,
"To wait for ABTS response on I/O timeouts for NVMe. (default: 1)");
-u32 ql2xdelay_before_pci_error_handling = 5;
+static u32 ql2xdelay_before_pci_error_handling = 5;
module_param(ql2xdelay_before_pci_error_handling, uint, 0644);
MODULE_PARM_DESC(ql2xdelay_before_pci_error_handling,
"Number of seconds delayed before qla begin PCI error self-handling (default: 5).\n");
The patch below does not apply to the 4.9-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 748bc4dd9e663f23448d8ad7e58c011a67ea1eca Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 22 Sep 2022 18:46:04 +0200
Subject: [PATCH] random: use expired timer rather than wq for mixing fast pool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the fast pool was dumped into the main pool periodically in
the fast pool's hard IRQ handler. This worked fine and there weren't
problems with it, until RT came around. Since RT converts spinlocks into
sleeping locks, problems cropped up. Rather than switching to raw
spinlocks, the RT developers preferred we make the transformation from
originally doing:
do_some_stuff()
spin_lock()
do_some_other_stuff()
spin_unlock()
to doing:
do_some_stuff()
queue_work_on(some_other_stuff_worker)
This is an ordinary pattern done all over the kernel. However, Sherry
noticed a 10% performance regression in qperf TCP over a 40gbps
InfiniBand card. Quoting her message:
> MT27500 Family [ConnectX-3] cards:
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80:0000:0000:0000:0010:e000:0178:9eb1
> base lid: 0x6
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 40 Gb/sec (4X QDR)
> link_layer: InfiniBand
>
> Cards are configured with IP addresses on private subnet for IPoIB
> performance testing.
> Regression identified in this bug is in TCP latency in this stack as reported
> by qperf tcp_lat metric:
>
> We have one system listen as a qperf server:
> [root@yourQperfServer ~]# qperf
>
> Have the other system connect to qperf server as a client (in this
> case, it’s X7 server with Mellanox card):
> [root@yourQperfClient ~]# numactl -m0 -N0 qperf 20.20.20.101 -v -uu -ub --time 60 --wait_server 20 -oo msg_size:4K:1024K:*2 tcp_lat
Rather than incur the scheduling latency from queue_work_on, we can
instead switch to running on the next timer tick, on the same core. This
also batches things a bit more -- once per jiffy -- which is okay now
that mix_interrupt_randomness() can credit multiple bits at once.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Tested-by: Paul Webb <paul.x.webb(a)oracle.com>
Cc: Sherry Yang <sherry.yang(a)oracle.com>
Cc: Phillip Goerl <phillip.goerl(a)oracle.com>
Cc: Jack Vogel <jack.vogel(a)oracle.com>
Cc: Nicky Veitch <nicky.veitch(a)oracle.com>
Cc: Colm Harrington <colm.harrington(a)oracle.com>
Cc: Ramanan Govindarajan <ramanan.govindarajan(a)oracle.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Sultan Alsawaf <sultan(a)kerneltoast.com>
Cc: stable(a)vger.kernel.org
Fixes: 58340f8e952b ("random: defer fast pool mixing to worker")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index a90d96f4b3bb..e591c6aadca4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -921,17 +921,20 @@ struct fast_pool {
unsigned long pool[4];
unsigned long last;
unsigned int count;
- struct work_struct mix;
+ struct timer_list mix;
};
+static void mix_interrupt_randomness(struct timer_list *work);
+
static DEFINE_PER_CPU(struct fast_pool, irq_randomness) = {
#ifdef CONFIG_64BIT
#define FASTMIX_PERM SIPHASH_PERMUTATION
- .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 }
+ .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 },
#else
#define FASTMIX_PERM HSIPHASH_PERMUTATION
- .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 }
+ .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 },
#endif
+ .mix = __TIMER_INITIALIZER(mix_interrupt_randomness, 0)
};
/*
@@ -973,7 +976,7 @@ int __cold random_online_cpu(unsigned int cpu)
}
#endif
-static void mix_interrupt_randomness(struct work_struct *work)
+static void mix_interrupt_randomness(struct timer_list *work)
{
struct fast_pool *fast_pool = container_of(work, struct fast_pool, mix);
/*
@@ -1027,10 +1030,11 @@ void add_interrupt_randomness(int irq)
if (new_count < 1024 && !time_is_before_jiffies(fast_pool->last + HZ))
return;
- if (unlikely(!fast_pool->mix.func))
- INIT_WORK(&fast_pool->mix, mix_interrupt_randomness);
fast_pool->count |= MIX_INFLIGHT;
- queue_work_on(raw_smp_processor_id(), system_highpri_wq, &fast_pool->mix);
+ if (!timer_pending(&fast_pool->mix)) {
+ fast_pool->mix.expires = jiffies;
+ add_timer_on(&fast_pool->mix, raw_smp_processor_id());
+ }
}
EXPORT_SYMBOL_GPL(add_interrupt_randomness);
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 748bc4dd9e663f23448d8ad7e58c011a67ea1eca Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 22 Sep 2022 18:46:04 +0200
Subject: [PATCH] random: use expired timer rather than wq for mixing fast pool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the fast pool was dumped into the main pool periodically in
the fast pool's hard IRQ handler. This worked fine and there weren't
problems with it, until RT came around. Since RT converts spinlocks into
sleeping locks, problems cropped up. Rather than switching to raw
spinlocks, the RT developers preferred we make the transformation from
originally doing:
do_some_stuff()
spin_lock()
do_some_other_stuff()
spin_unlock()
to doing:
do_some_stuff()
queue_work_on(some_other_stuff_worker)
This is an ordinary pattern done all over the kernel. However, Sherry
noticed a 10% performance regression in qperf TCP over a 40gbps
InfiniBand card. Quoting her message:
> MT27500 Family [ConnectX-3] cards:
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80:0000:0000:0000:0010:e000:0178:9eb1
> base lid: 0x6
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 40 Gb/sec (4X QDR)
> link_layer: InfiniBand
>
> Cards are configured with IP addresses on private subnet for IPoIB
> performance testing.
> Regression identified in this bug is in TCP latency in this stack as reported
> by qperf tcp_lat metric:
>
> We have one system listen as a qperf server:
> [root@yourQperfServer ~]# qperf
>
> Have the other system connect to qperf server as a client (in this
> case, it’s X7 server with Mellanox card):
> [root@yourQperfClient ~]# numactl -m0 -N0 qperf 20.20.20.101 -v -uu -ub --time 60 --wait_server 20 -oo msg_size:4K:1024K:*2 tcp_lat
Rather than incur the scheduling latency from queue_work_on, we can
instead switch to running on the next timer tick, on the same core. This
also batches things a bit more -- once per jiffy -- which is okay now
that mix_interrupt_randomness() can credit multiple bits at once.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Tested-by: Paul Webb <paul.x.webb(a)oracle.com>
Cc: Sherry Yang <sherry.yang(a)oracle.com>
Cc: Phillip Goerl <phillip.goerl(a)oracle.com>
Cc: Jack Vogel <jack.vogel(a)oracle.com>
Cc: Nicky Veitch <nicky.veitch(a)oracle.com>
Cc: Colm Harrington <colm.harrington(a)oracle.com>
Cc: Ramanan Govindarajan <ramanan.govindarajan(a)oracle.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Sultan Alsawaf <sultan(a)kerneltoast.com>
Cc: stable(a)vger.kernel.org
Fixes: 58340f8e952b ("random: defer fast pool mixing to worker")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index a90d96f4b3bb..e591c6aadca4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -921,17 +921,20 @@ struct fast_pool {
unsigned long pool[4];
unsigned long last;
unsigned int count;
- struct work_struct mix;
+ struct timer_list mix;
};
+static void mix_interrupt_randomness(struct timer_list *work);
+
static DEFINE_PER_CPU(struct fast_pool, irq_randomness) = {
#ifdef CONFIG_64BIT
#define FASTMIX_PERM SIPHASH_PERMUTATION
- .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 }
+ .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 },
#else
#define FASTMIX_PERM HSIPHASH_PERMUTATION
- .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 }
+ .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 },
#endif
+ .mix = __TIMER_INITIALIZER(mix_interrupt_randomness, 0)
};
/*
@@ -973,7 +976,7 @@ int __cold random_online_cpu(unsigned int cpu)
}
#endif
-static void mix_interrupt_randomness(struct work_struct *work)
+static void mix_interrupt_randomness(struct timer_list *work)
{
struct fast_pool *fast_pool = container_of(work, struct fast_pool, mix);
/*
@@ -1027,10 +1030,11 @@ void add_interrupt_randomness(int irq)
if (new_count < 1024 && !time_is_before_jiffies(fast_pool->last + HZ))
return;
- if (unlikely(!fast_pool->mix.func))
- INIT_WORK(&fast_pool->mix, mix_interrupt_randomness);
fast_pool->count |= MIX_INFLIGHT;
- queue_work_on(raw_smp_processor_id(), system_highpri_wq, &fast_pool->mix);
+ if (!timer_pending(&fast_pool->mix)) {
+ fast_pool->mix.expires = jiffies;
+ add_timer_on(&fast_pool->mix, raw_smp_processor_id());
+ }
}
EXPORT_SYMBOL_GPL(add_interrupt_randomness);
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 748bc4dd9e663f23448d8ad7e58c011a67ea1eca Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 22 Sep 2022 18:46:04 +0200
Subject: [PATCH] random: use expired timer rather than wq for mixing fast pool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the fast pool was dumped into the main pool periodically in
the fast pool's hard IRQ handler. This worked fine and there weren't
problems with it, until RT came around. Since RT converts spinlocks into
sleeping locks, problems cropped up. Rather than switching to raw
spinlocks, the RT developers preferred we make the transformation from
originally doing:
do_some_stuff()
spin_lock()
do_some_other_stuff()
spin_unlock()
to doing:
do_some_stuff()
queue_work_on(some_other_stuff_worker)
This is an ordinary pattern done all over the kernel. However, Sherry
noticed a 10% performance regression in qperf TCP over a 40gbps
InfiniBand card. Quoting her message:
> MT27500 Family [ConnectX-3] cards:
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80:0000:0000:0000:0010:e000:0178:9eb1
> base lid: 0x6
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 40 Gb/sec (4X QDR)
> link_layer: InfiniBand
>
> Cards are configured with IP addresses on private subnet for IPoIB
> performance testing.
> Regression identified in this bug is in TCP latency in this stack as reported
> by qperf tcp_lat metric:
>
> We have one system listen as a qperf server:
> [root@yourQperfServer ~]# qperf
>
> Have the other system connect to qperf server as a client (in this
> case, it’s X7 server with Mellanox card):
> [root@yourQperfClient ~]# numactl -m0 -N0 qperf 20.20.20.101 -v -uu -ub --time 60 --wait_server 20 -oo msg_size:4K:1024K:*2 tcp_lat
Rather than incur the scheduling latency from queue_work_on, we can
instead switch to running on the next timer tick, on the same core. This
also batches things a bit more -- once per jiffy -- which is okay now
that mix_interrupt_randomness() can credit multiple bits at once.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Tested-by: Paul Webb <paul.x.webb(a)oracle.com>
Cc: Sherry Yang <sherry.yang(a)oracle.com>
Cc: Phillip Goerl <phillip.goerl(a)oracle.com>
Cc: Jack Vogel <jack.vogel(a)oracle.com>
Cc: Nicky Veitch <nicky.veitch(a)oracle.com>
Cc: Colm Harrington <colm.harrington(a)oracle.com>
Cc: Ramanan Govindarajan <ramanan.govindarajan(a)oracle.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Sultan Alsawaf <sultan(a)kerneltoast.com>
Cc: stable(a)vger.kernel.org
Fixes: 58340f8e952b ("random: defer fast pool mixing to worker")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index a90d96f4b3bb..e591c6aadca4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -921,17 +921,20 @@ struct fast_pool {
unsigned long pool[4];
unsigned long last;
unsigned int count;
- struct work_struct mix;
+ struct timer_list mix;
};
+static void mix_interrupt_randomness(struct timer_list *work);
+
static DEFINE_PER_CPU(struct fast_pool, irq_randomness) = {
#ifdef CONFIG_64BIT
#define FASTMIX_PERM SIPHASH_PERMUTATION
- .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 }
+ .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 },
#else
#define FASTMIX_PERM HSIPHASH_PERMUTATION
- .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 }
+ .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 },
#endif
+ .mix = __TIMER_INITIALIZER(mix_interrupt_randomness, 0)
};
/*
@@ -973,7 +976,7 @@ int __cold random_online_cpu(unsigned int cpu)
}
#endif
-static void mix_interrupt_randomness(struct work_struct *work)
+static void mix_interrupt_randomness(struct timer_list *work)
{
struct fast_pool *fast_pool = container_of(work, struct fast_pool, mix);
/*
@@ -1027,10 +1030,11 @@ void add_interrupt_randomness(int irq)
if (new_count < 1024 && !time_is_before_jiffies(fast_pool->last + HZ))
return;
- if (unlikely(!fast_pool->mix.func))
- INIT_WORK(&fast_pool->mix, mix_interrupt_randomness);
fast_pool->count |= MIX_INFLIGHT;
- queue_work_on(raw_smp_processor_id(), system_highpri_wq, &fast_pool->mix);
+ if (!timer_pending(&fast_pool->mix)) {
+ fast_pool->mix.expires = jiffies;
+ add_timer_on(&fast_pool->mix, raw_smp_processor_id());
+ }
}
EXPORT_SYMBOL_GPL(add_interrupt_randomness);
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 748bc4dd9e663f23448d8ad7e58c011a67ea1eca Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 22 Sep 2022 18:46:04 +0200
Subject: [PATCH] random: use expired timer rather than wq for mixing fast pool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the fast pool was dumped into the main pool periodically in
the fast pool's hard IRQ handler. This worked fine and there weren't
problems with it, until RT came around. Since RT converts spinlocks into
sleeping locks, problems cropped up. Rather than switching to raw
spinlocks, the RT developers preferred we make the transformation from
originally doing:
do_some_stuff()
spin_lock()
do_some_other_stuff()
spin_unlock()
to doing:
do_some_stuff()
queue_work_on(some_other_stuff_worker)
This is an ordinary pattern done all over the kernel. However, Sherry
noticed a 10% performance regression in qperf TCP over a 40gbps
InfiniBand card. Quoting her message:
> MT27500 Family [ConnectX-3] cards:
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80:0000:0000:0000:0010:e000:0178:9eb1
> base lid: 0x6
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 40 Gb/sec (4X QDR)
> link_layer: InfiniBand
>
> Cards are configured with IP addresses on private subnet for IPoIB
> performance testing.
> Regression identified in this bug is in TCP latency in this stack as reported
> by qperf tcp_lat metric:
>
> We have one system listen as a qperf server:
> [root@yourQperfServer ~]# qperf
>
> Have the other system connect to qperf server as a client (in this
> case, it’s X7 server with Mellanox card):
> [root@yourQperfClient ~]# numactl -m0 -N0 qperf 20.20.20.101 -v -uu -ub --time 60 --wait_server 20 -oo msg_size:4K:1024K:*2 tcp_lat
Rather than incur the scheduling latency from queue_work_on, we can
instead switch to running on the next timer tick, on the same core. This
also batches things a bit more -- once per jiffy -- which is okay now
that mix_interrupt_randomness() can credit multiple bits at once.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Tested-by: Paul Webb <paul.x.webb(a)oracle.com>
Cc: Sherry Yang <sherry.yang(a)oracle.com>
Cc: Phillip Goerl <phillip.goerl(a)oracle.com>
Cc: Jack Vogel <jack.vogel(a)oracle.com>
Cc: Nicky Veitch <nicky.veitch(a)oracle.com>
Cc: Colm Harrington <colm.harrington(a)oracle.com>
Cc: Ramanan Govindarajan <ramanan.govindarajan(a)oracle.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Sultan Alsawaf <sultan(a)kerneltoast.com>
Cc: stable(a)vger.kernel.org
Fixes: 58340f8e952b ("random: defer fast pool mixing to worker")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index a90d96f4b3bb..e591c6aadca4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -921,17 +921,20 @@ struct fast_pool {
unsigned long pool[4];
unsigned long last;
unsigned int count;
- struct work_struct mix;
+ struct timer_list mix;
};
+static void mix_interrupt_randomness(struct timer_list *work);
+
static DEFINE_PER_CPU(struct fast_pool, irq_randomness) = {
#ifdef CONFIG_64BIT
#define FASTMIX_PERM SIPHASH_PERMUTATION
- .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 }
+ .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 },
#else
#define FASTMIX_PERM HSIPHASH_PERMUTATION
- .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 }
+ .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 },
#endif
+ .mix = __TIMER_INITIALIZER(mix_interrupt_randomness, 0)
};
/*
@@ -973,7 +976,7 @@ int __cold random_online_cpu(unsigned int cpu)
}
#endif
-static void mix_interrupt_randomness(struct work_struct *work)
+static void mix_interrupt_randomness(struct timer_list *work)
{
struct fast_pool *fast_pool = container_of(work, struct fast_pool, mix);
/*
@@ -1027,10 +1030,11 @@ void add_interrupt_randomness(int irq)
if (new_count < 1024 && !time_is_before_jiffies(fast_pool->last + HZ))
return;
- if (unlikely(!fast_pool->mix.func))
- INIT_WORK(&fast_pool->mix, mix_interrupt_randomness);
fast_pool->count |= MIX_INFLIGHT;
- queue_work_on(raw_smp_processor_id(), system_highpri_wq, &fast_pool->mix);
+ if (!timer_pending(&fast_pool->mix)) {
+ fast_pool->mix.expires = jiffies;
+ add_timer_on(&fast_pool->mix, raw_smp_processor_id());
+ }
}
EXPORT_SYMBOL_GPL(add_interrupt_randomness);
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 748bc4dd9e663f23448d8ad7e58c011a67ea1eca Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 22 Sep 2022 18:46:04 +0200
Subject: [PATCH] random: use expired timer rather than wq for mixing fast pool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the fast pool was dumped into the main pool periodically in
the fast pool's hard IRQ handler. This worked fine and there weren't
problems with it, until RT came around. Since RT converts spinlocks into
sleeping locks, problems cropped up. Rather than switching to raw
spinlocks, the RT developers preferred we make the transformation from
originally doing:
do_some_stuff()
spin_lock()
do_some_other_stuff()
spin_unlock()
to doing:
do_some_stuff()
queue_work_on(some_other_stuff_worker)
This is an ordinary pattern done all over the kernel. However, Sherry
noticed a 10% performance regression in qperf TCP over a 40gbps
InfiniBand card. Quoting her message:
> MT27500 Family [ConnectX-3] cards:
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80:0000:0000:0000:0010:e000:0178:9eb1
> base lid: 0x6
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 40 Gb/sec (4X QDR)
> link_layer: InfiniBand
>
> Cards are configured with IP addresses on private subnet for IPoIB
> performance testing.
> Regression identified in this bug is in TCP latency in this stack as reported
> by qperf tcp_lat metric:
>
> We have one system listen as a qperf server:
> [root@yourQperfServer ~]# qperf
>
> Have the other system connect to qperf server as a client (in this
> case, it’s X7 server with Mellanox card):
> [root@yourQperfClient ~]# numactl -m0 -N0 qperf 20.20.20.101 -v -uu -ub --time 60 --wait_server 20 -oo msg_size:4K:1024K:*2 tcp_lat
Rather than incur the scheduling latency from queue_work_on, we can
instead switch to running on the next timer tick, on the same core. This
also batches things a bit more -- once per jiffy -- which is okay now
that mix_interrupt_randomness() can credit multiple bits at once.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Tested-by: Paul Webb <paul.x.webb(a)oracle.com>
Cc: Sherry Yang <sherry.yang(a)oracle.com>
Cc: Phillip Goerl <phillip.goerl(a)oracle.com>
Cc: Jack Vogel <jack.vogel(a)oracle.com>
Cc: Nicky Veitch <nicky.veitch(a)oracle.com>
Cc: Colm Harrington <colm.harrington(a)oracle.com>
Cc: Ramanan Govindarajan <ramanan.govindarajan(a)oracle.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Sultan Alsawaf <sultan(a)kerneltoast.com>
Cc: stable(a)vger.kernel.org
Fixes: 58340f8e952b ("random: defer fast pool mixing to worker")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index a90d96f4b3bb..e591c6aadca4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -921,17 +921,20 @@ struct fast_pool {
unsigned long pool[4];
unsigned long last;
unsigned int count;
- struct work_struct mix;
+ struct timer_list mix;
};
+static void mix_interrupt_randomness(struct timer_list *work);
+
static DEFINE_PER_CPU(struct fast_pool, irq_randomness) = {
#ifdef CONFIG_64BIT
#define FASTMIX_PERM SIPHASH_PERMUTATION
- .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 }
+ .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 },
#else
#define FASTMIX_PERM HSIPHASH_PERMUTATION
- .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 }
+ .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 },
#endif
+ .mix = __TIMER_INITIALIZER(mix_interrupt_randomness, 0)
};
/*
@@ -973,7 +976,7 @@ int __cold random_online_cpu(unsigned int cpu)
}
#endif
-static void mix_interrupt_randomness(struct work_struct *work)
+static void mix_interrupt_randomness(struct timer_list *work)
{
struct fast_pool *fast_pool = container_of(work, struct fast_pool, mix);
/*
@@ -1027,10 +1030,11 @@ void add_interrupt_randomness(int irq)
if (new_count < 1024 && !time_is_before_jiffies(fast_pool->last + HZ))
return;
- if (unlikely(!fast_pool->mix.func))
- INIT_WORK(&fast_pool->mix, mix_interrupt_randomness);
fast_pool->count |= MIX_INFLIGHT;
- queue_work_on(raw_smp_processor_id(), system_highpri_wq, &fast_pool->mix);
+ if (!timer_pending(&fast_pool->mix)) {
+ fast_pool->mix.expires = jiffies;
+ add_timer_on(&fast_pool->mix, raw_smp_processor_id());
+ }
}
EXPORT_SYMBOL_GPL(add_interrupt_randomness);
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 748bc4dd9e663f23448d8ad7e58c011a67ea1eca Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 22 Sep 2022 18:46:04 +0200
Subject: [PATCH] random: use expired timer rather than wq for mixing fast pool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the fast pool was dumped into the main pool periodically in
the fast pool's hard IRQ handler. This worked fine and there weren't
problems with it, until RT came around. Since RT converts spinlocks into
sleeping locks, problems cropped up. Rather than switching to raw
spinlocks, the RT developers preferred we make the transformation from
originally doing:
do_some_stuff()
spin_lock()
do_some_other_stuff()
spin_unlock()
to doing:
do_some_stuff()
queue_work_on(some_other_stuff_worker)
This is an ordinary pattern done all over the kernel. However, Sherry
noticed a 10% performance regression in qperf TCP over a 40gbps
InfiniBand card. Quoting her message:
> MT27500 Family [ConnectX-3] cards:
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80:0000:0000:0000:0010:e000:0178:9eb1
> base lid: 0x6
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 40 Gb/sec (4X QDR)
> link_layer: InfiniBand
>
> Cards are configured with IP addresses on private subnet for IPoIB
> performance testing.
> Regression identified in this bug is in TCP latency in this stack as reported
> by qperf tcp_lat metric:
>
> We have one system listen as a qperf server:
> [root@yourQperfServer ~]# qperf
>
> Have the other system connect to qperf server as a client (in this
> case, it’s X7 server with Mellanox card):
> [root@yourQperfClient ~]# numactl -m0 -N0 qperf 20.20.20.101 -v -uu -ub --time 60 --wait_server 20 -oo msg_size:4K:1024K:*2 tcp_lat
Rather than incur the scheduling latency from queue_work_on, we can
instead switch to running on the next timer tick, on the same core. This
also batches things a bit more -- once per jiffy -- which is okay now
that mix_interrupt_randomness() can credit multiple bits at once.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Tested-by: Paul Webb <paul.x.webb(a)oracle.com>
Cc: Sherry Yang <sherry.yang(a)oracle.com>
Cc: Phillip Goerl <phillip.goerl(a)oracle.com>
Cc: Jack Vogel <jack.vogel(a)oracle.com>
Cc: Nicky Veitch <nicky.veitch(a)oracle.com>
Cc: Colm Harrington <colm.harrington(a)oracle.com>
Cc: Ramanan Govindarajan <ramanan.govindarajan(a)oracle.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Sultan Alsawaf <sultan(a)kerneltoast.com>
Cc: stable(a)vger.kernel.org
Fixes: 58340f8e952b ("random: defer fast pool mixing to worker")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index a90d96f4b3bb..e591c6aadca4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -921,17 +921,20 @@ struct fast_pool {
unsigned long pool[4];
unsigned long last;
unsigned int count;
- struct work_struct mix;
+ struct timer_list mix;
};
+static void mix_interrupt_randomness(struct timer_list *work);
+
static DEFINE_PER_CPU(struct fast_pool, irq_randomness) = {
#ifdef CONFIG_64BIT
#define FASTMIX_PERM SIPHASH_PERMUTATION
- .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 }
+ .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 },
#else
#define FASTMIX_PERM HSIPHASH_PERMUTATION
- .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 }
+ .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 },
#endif
+ .mix = __TIMER_INITIALIZER(mix_interrupt_randomness, 0)
};
/*
@@ -973,7 +976,7 @@ int __cold random_online_cpu(unsigned int cpu)
}
#endif
-static void mix_interrupt_randomness(struct work_struct *work)
+static void mix_interrupt_randomness(struct timer_list *work)
{
struct fast_pool *fast_pool = container_of(work, struct fast_pool, mix);
/*
@@ -1027,10 +1030,11 @@ void add_interrupt_randomness(int irq)
if (new_count < 1024 && !time_is_before_jiffies(fast_pool->last + HZ))
return;
- if (unlikely(!fast_pool->mix.func))
- INIT_WORK(&fast_pool->mix, mix_interrupt_randomness);
fast_pool->count |= MIX_INFLIGHT;
- queue_work_on(raw_smp_processor_id(), system_highpri_wq, &fast_pool->mix);
+ if (!timer_pending(&fast_pool->mix)) {
+ fast_pool->mix.expires = jiffies;
+ add_timer_on(&fast_pool->mix, raw_smp_processor_id());
+ }
}
EXPORT_SYMBOL_GPL(add_interrupt_randomness);
The patch below does not apply to the 5.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 748bc4dd9e663f23448d8ad7e58c011a67ea1eca Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 22 Sep 2022 18:46:04 +0200
Subject: [PATCH] random: use expired timer rather than wq for mixing fast pool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the fast pool was dumped into the main pool periodically in
the fast pool's hard IRQ handler. This worked fine and there weren't
problems with it, until RT came around. Since RT converts spinlocks into
sleeping locks, problems cropped up. Rather than switching to raw
spinlocks, the RT developers preferred we make the transformation from
originally doing:
do_some_stuff()
spin_lock()
do_some_other_stuff()
spin_unlock()
to doing:
do_some_stuff()
queue_work_on(some_other_stuff_worker)
This is an ordinary pattern done all over the kernel. However, Sherry
noticed a 10% performance regression in qperf TCP over a 40gbps
InfiniBand card. Quoting her message:
> MT27500 Family [ConnectX-3] cards:
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80:0000:0000:0000:0010:e000:0178:9eb1
> base lid: 0x6
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 40 Gb/sec (4X QDR)
> link_layer: InfiniBand
>
> Cards are configured with IP addresses on private subnet for IPoIB
> performance testing.
> Regression identified in this bug is in TCP latency in this stack as reported
> by qperf tcp_lat metric:
>
> We have one system listen as a qperf server:
> [root@yourQperfServer ~]# qperf
>
> Have the other system connect to qperf server as a client (in this
> case, it’s X7 server with Mellanox card):
> [root@yourQperfClient ~]# numactl -m0 -N0 qperf 20.20.20.101 -v -uu -ub --time 60 --wait_server 20 -oo msg_size:4K:1024K:*2 tcp_lat
Rather than incur the scheduling latency from queue_work_on, we can
instead switch to running on the next timer tick, on the same core. This
also batches things a bit more -- once per jiffy -- which is okay now
that mix_interrupt_randomness() can credit multiple bits at once.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Tested-by: Paul Webb <paul.x.webb(a)oracle.com>
Cc: Sherry Yang <sherry.yang(a)oracle.com>
Cc: Phillip Goerl <phillip.goerl(a)oracle.com>
Cc: Jack Vogel <jack.vogel(a)oracle.com>
Cc: Nicky Veitch <nicky.veitch(a)oracle.com>
Cc: Colm Harrington <colm.harrington(a)oracle.com>
Cc: Ramanan Govindarajan <ramanan.govindarajan(a)oracle.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Sultan Alsawaf <sultan(a)kerneltoast.com>
Cc: stable(a)vger.kernel.org
Fixes: 58340f8e952b ("random: defer fast pool mixing to worker")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index a90d96f4b3bb..e591c6aadca4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -921,17 +921,20 @@ struct fast_pool {
unsigned long pool[4];
unsigned long last;
unsigned int count;
- struct work_struct mix;
+ struct timer_list mix;
};
+static void mix_interrupt_randomness(struct timer_list *work);
+
static DEFINE_PER_CPU(struct fast_pool, irq_randomness) = {
#ifdef CONFIG_64BIT
#define FASTMIX_PERM SIPHASH_PERMUTATION
- .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 }
+ .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 },
#else
#define FASTMIX_PERM HSIPHASH_PERMUTATION
- .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 }
+ .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 },
#endif
+ .mix = __TIMER_INITIALIZER(mix_interrupt_randomness, 0)
};
/*
@@ -973,7 +976,7 @@ int __cold random_online_cpu(unsigned int cpu)
}
#endif
-static void mix_interrupt_randomness(struct work_struct *work)
+static void mix_interrupt_randomness(struct timer_list *work)
{
struct fast_pool *fast_pool = container_of(work, struct fast_pool, mix);
/*
@@ -1027,10 +1030,11 @@ void add_interrupt_randomness(int irq)
if (new_count < 1024 && !time_is_before_jiffies(fast_pool->last + HZ))
return;
- if (unlikely(!fast_pool->mix.func))
- INIT_WORK(&fast_pool->mix, mix_interrupt_randomness);
fast_pool->count |= MIX_INFLIGHT;
- queue_work_on(raw_smp_processor_id(), system_highpri_wq, &fast_pool->mix);
+ if (!timer_pending(&fast_pool->mix)) {
+ fast_pool->mix.expires = jiffies;
+ add_timer_on(&fast_pool->mix, raw_smp_processor_id());
+ }
}
EXPORT_SYMBOL_GPL(add_interrupt_randomness);
The patch below does not apply to the 6.0-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 748bc4dd9e663f23448d8ad7e58c011a67ea1eca Mon Sep 17 00:00:00 2001
From: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Date: Thu, 22 Sep 2022 18:46:04 +0200
Subject: [PATCH] random: use expired timer rather than wq for mixing fast pool
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Previously, the fast pool was dumped into the main pool periodically in
the fast pool's hard IRQ handler. This worked fine and there weren't
problems with it, until RT came around. Since RT converts spinlocks into
sleeping locks, problems cropped up. Rather than switching to raw
spinlocks, the RT developers preferred we make the transformation from
originally doing:
do_some_stuff()
spin_lock()
do_some_other_stuff()
spin_unlock()
to doing:
do_some_stuff()
queue_work_on(some_other_stuff_worker)
This is an ordinary pattern done all over the kernel. However, Sherry
noticed a 10% performance regression in qperf TCP over a 40gbps
InfiniBand card. Quoting her message:
> MT27500 Family [ConnectX-3] cards:
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80:0000:0000:0000:0010:e000:0178:9eb1
> base lid: 0x6
> sm lid: 0x1
> state: 4: ACTIVE
> phys state: 5: LinkUp
> rate: 40 Gb/sec (4X QDR)
> link_layer: InfiniBand
>
> Cards are configured with IP addresses on private subnet for IPoIB
> performance testing.
> Regression identified in this bug is in TCP latency in this stack as reported
> by qperf tcp_lat metric:
>
> We have one system listen as a qperf server:
> [root@yourQperfServer ~]# qperf
>
> Have the other system connect to qperf server as a client (in this
> case, it’s X7 server with Mellanox card):
> [root@yourQperfClient ~]# numactl -m0 -N0 qperf 20.20.20.101 -v -uu -ub --time 60 --wait_server 20 -oo msg_size:4K:1024K:*2 tcp_lat
Rather than incur the scheduling latency from queue_work_on, we can
instead switch to running on the next timer tick, on the same core. This
also batches things a bit more -- once per jiffy -- which is okay now
that mix_interrupt_randomness() can credit multiple bits at once.
Reported-by: Sherry Yang <sherry.yang(a)oracle.com>
Tested-by: Paul Webb <paul.x.webb(a)oracle.com>
Cc: Sherry Yang <sherry.yang(a)oracle.com>
Cc: Phillip Goerl <phillip.goerl(a)oracle.com>
Cc: Jack Vogel <jack.vogel(a)oracle.com>
Cc: Nicky Veitch <nicky.veitch(a)oracle.com>
Cc: Colm Harrington <colm.harrington(a)oracle.com>
Cc: Ramanan Govindarajan <ramanan.govindarajan(a)oracle.com>
Cc: Sebastian Andrzej Siewior <bigeasy(a)linutronix.de>
Cc: Dominik Brodowski <linux(a)dominikbrodowski.net>
Cc: Tejun Heo <tj(a)kernel.org>
Cc: Sultan Alsawaf <sultan(a)kerneltoast.com>
Cc: stable(a)vger.kernel.org
Fixes: 58340f8e952b ("random: defer fast pool mixing to worker")
Signed-off-by: Jason A. Donenfeld <Jason(a)zx2c4.com>
diff --git a/drivers/char/random.c b/drivers/char/random.c
index a90d96f4b3bb..e591c6aadca4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -921,17 +921,20 @@ struct fast_pool {
unsigned long pool[4];
unsigned long last;
unsigned int count;
- struct work_struct mix;
+ struct timer_list mix;
};
+static void mix_interrupt_randomness(struct timer_list *work);
+
static DEFINE_PER_CPU(struct fast_pool, irq_randomness) = {
#ifdef CONFIG_64BIT
#define FASTMIX_PERM SIPHASH_PERMUTATION
- .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 }
+ .pool = { SIPHASH_CONST_0, SIPHASH_CONST_1, SIPHASH_CONST_2, SIPHASH_CONST_3 },
#else
#define FASTMIX_PERM HSIPHASH_PERMUTATION
- .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 }
+ .pool = { HSIPHASH_CONST_0, HSIPHASH_CONST_1, HSIPHASH_CONST_2, HSIPHASH_CONST_3 },
#endif
+ .mix = __TIMER_INITIALIZER(mix_interrupt_randomness, 0)
};
/*
@@ -973,7 +976,7 @@ int __cold random_online_cpu(unsigned int cpu)
}
#endif
-static void mix_interrupt_randomness(struct work_struct *work)
+static void mix_interrupt_randomness(struct timer_list *work)
{
struct fast_pool *fast_pool = container_of(work, struct fast_pool, mix);
/*
@@ -1027,10 +1030,11 @@ void add_interrupt_randomness(int irq)
if (new_count < 1024 && !time_is_before_jiffies(fast_pool->last + HZ))
return;
- if (unlikely(!fast_pool->mix.func))
- INIT_WORK(&fast_pool->mix, mix_interrupt_randomness);
fast_pool->count |= MIX_INFLIGHT;
- queue_work_on(raw_smp_processor_id(), system_highpri_wq, &fast_pool->mix);
+ if (!timer_pending(&fast_pool->mix)) {
+ fast_pool->mix.expires = jiffies;
+ add_timer_on(&fast_pool->mix, raw_smp_processor_id());
+ }
}
EXPORT_SYMBOL_GPL(add_interrupt_randomness);
For G200_SE_A, PLL M setting is wrong, which leads to blank screen,
or "signal out of range" on VGA display.
previous code had "m |= 0x80" which was changed to
m |= ((pixpllcn & BIT(8)) >> 1);
Tested on G200_SE_A rev 42
This line of code was moved to another file with
commit 85397f6bc4ff ("drm/mgag200: Initialize each model in separate
function") but can be easily backported before this commit.
Fixes: 2dd040946ecf ("drm/mgag200: Store values (not bits) in struct mgag200_pll_values")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jocelyn Falempe <jfalempe(a)redhat.com>
---
drivers/gpu/drm/mgag200/mgag200_g200se.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/mgag200/mgag200_g200se.c b/drivers/gpu/drm/mgag200/mgag200_g200se.c
index be389ed91cbd..4ec035029b8b 100644
--- a/drivers/gpu/drm/mgag200/mgag200_g200se.c
+++ b/drivers/gpu/drm/mgag200/mgag200_g200se.c
@@ -284,7 +284,7 @@ static void mgag200_g200se_04_pixpllc_atomic_update(struct drm_crtc *crtc,
pixpllcp = pixpllc->p - 1;
pixpllcs = pixpllc->s;
- xpixpllcm = pixpllcm | ((pixpllcn & BIT(8)) >> 1);
+ xpixpllcm = pixpllcm | BIT(7);
xpixpllcn = pixpllcn;
xpixpllcp = (pixpllcs << 3) | pixpllcp;
--
2.37.3
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
e3b49ea36802 ("f2fs: invalidate META_MAPPING before IPU/DIO write")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e3b49ea36802053f312013fd4ccb6e59920a9f76 Mon Sep 17 00:00:00 2001
From: Hyeong-Jun Kim <hj514.kim(a)samsung.com>
Date: Tue, 2 Nov 2021 16:10:02 +0900
Subject: [PATCH] f2fs: invalidate META_MAPPING before IPU/DIO write
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.
Thread A - f2fs_gc() Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block() ---- (a)
up_write(i_gc_rwsem)
f2fs_direct_IO() :
- down_read(i_gc_rwsem)
- __blockdev_direct_io()
- get_data_block_dio_write()
- f2fs_dio_submit_bio() ---- (b)
- up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block() ---- (c)
up_write(i_gc_rwsem)
(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
new blkaddr. In conclusion, the newly written data in (b) is lost.
To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.
Fixes: 6aa58d8ad20a ("f2fs: readahead encrypted block during GC")
Signed-off-by: Hyeong-Jun Kim <hj514.kim(a)samsung.com>
Reviewed-by: Chao Yu <chao(a)kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org>
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 74e1a350c1d8..9f754aaef558 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1708,6 +1708,8 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map,
*/
f2fs_wait_on_block_writeback_range(inode,
map->m_pblk, map->m_len);
+ invalidate_mapping_pages(META_MAPPING(sbi),
+ map->m_pblk, map->m_pblk);
if (map->m_multidev_dio) {
block_t blk_addr = map->m_pblk;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 526423fe84ce..df9ed75f0b7a 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -3652,6 +3652,9 @@ int f2fs_inplace_write_data(struct f2fs_io_info *fio)
goto drop_bio;
}
+ invalidate_mapping_pages(META_MAPPING(sbi),
+ fio->new_blkaddr, fio->new_blkaddr);
+
stat_inc_inplace_blocks(fio->sbi);
if (fio->bio && !(SM_I(sbi)->ipu_policy & (1 << F2FS_IPU_NOCACHE)))
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
e3b49ea36802 ("f2fs: invalidate META_MAPPING before IPU/DIO write")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From e3b49ea36802053f312013fd4ccb6e59920a9f76 Mon Sep 17 00:00:00 2001
From: Hyeong-Jun Kim <hj514.kim(a)samsung.com>
Date: Tue, 2 Nov 2021 16:10:02 +0900
Subject: [PATCH] f2fs: invalidate META_MAPPING before IPU/DIO write
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.
Thread A - f2fs_gc() Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block() ---- (a)
up_write(i_gc_rwsem)
f2fs_direct_IO() :
- down_read(i_gc_rwsem)
- __blockdev_direct_io()
- get_data_block_dio_write()
- f2fs_dio_submit_bio() ---- (b)
- up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block() ---- (c)
up_write(i_gc_rwsem)
(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
new blkaddr. In conclusion, the newly written data in (b) is lost.
To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.
Fixes: 6aa58d8ad20a ("f2fs: readahead encrypted block during GC")
Signed-off-by: Hyeong-Jun Kim <hj514.kim(a)samsung.com>
Reviewed-by: Chao Yu <chao(a)kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org>
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 74e1a350c1d8..9f754aaef558 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1708,6 +1708,8 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map,
*/
f2fs_wait_on_block_writeback_range(inode,
map->m_pblk, map->m_len);
+ invalidate_mapping_pages(META_MAPPING(sbi),
+ map->m_pblk, map->m_pblk);
if (map->m_multidev_dio) {
block_t blk_addr = map->m_pblk;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 526423fe84ce..df9ed75f0b7a 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -3652,6 +3652,9 @@ int f2fs_inplace_write_data(struct f2fs_io_info *fio)
goto drop_bio;
}
+ invalidate_mapping_pages(META_MAPPING(sbi),
+ fio->new_blkaddr, fio->new_blkaddr);
+
stat_inc_inplace_blocks(fio->sbi);
if (fio->bio && !(SM_I(sbi)->ipu_policy & (1 << F2FS_IPU_NOCACHE)))
Hi Maitainers
This patch is a fix in kceph module and should be backported to any
affected stable old kernels. And the original patch missed tagging
stable and got merged already months ago:
commit 7cb9994754f8a36ae9e5ec4597c5c4c2d6c03832
Author: Hu Weiwen <sehuww(a)mail.scut.edu.cn>
Date: Fri Jul 1 10:52:27 2022 +0800
ceph: don't truncate file in atomic_open
Clear O_TRUNC from the flags sent in the MDS create request.
`atomic_open' is called before permission check. We should not do any
modification to the file here. The caller will do the truncation
afterward.
Fixes: 124e68e74099 ("ceph: file operations")
Signed-off-by: Hu Weiwen <sehuww(a)mail.scut.edu.cn>
Reviewed-by: Xiubo Li <xiubli(a)redhat.com>
Signed-off-by: Ilya Dryomov <idryomov(a)gmail.com>
Just a single patch.
I am not very sure this is the correct way to do this, if anything else
I need to do to backport this to old kernels please let me know.
Thanks!
- Xiubo
On 01/07/2022 10:52, Hu Weiwen wrote:
> Clear O_TRUNC from the flags sent in the MDS create request.
>
> `atomic_open' is called before permission check. We should not do any
> modification to the file here. The caller will do the truncation
> afterward.
>
> Fixes: 124e68e74099 ("ceph: file operations")
> Signed-off-by: Hu Weiwen <sehuww(a)mail.scut.edu.cn>
> ---
> rebased onto ceph_client repo testing branch
>
> fs/ceph/file.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 296fd1c7ece8..289e66e9cbb0 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -745,6 +745,11 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> err = ceph_wait_on_conflict_unlink(dentry);
> if (err)
> return err;
> + /*
> + * Do not truncate the file, since atomic_open is called before the
> + * permission check. The caller will do the truncation afterward.
> + */
> + flags &= ~O_TRUNC;
>
> retry:
> if (flags & O_CREAT) {
> @@ -836,9 +841,7 @@ int ceph_atomic_open(struct inode *dir, struct dentry *dentry,
> set_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> req->r_new_inode = new_inode;
> new_inode = NULL;
> - err = ceph_mdsc_do_request(mdsc,
> - (flags & (O_CREAT|O_TRUNC)) ? dir : NULL,
> - req);
> + err = ceph_mdsc_do_request(mdsc, (flags & O_CREAT) ? dir : NULL, req);
> if (err == -ENOENT) {
> dentry = ceph_handle_snapdir(req, dentry);
> if (IS_ERR(dentry)) {
On Tue, Oct 11, 2022 at 02:12:27PM -0700, Bhatnagar, Rishabh wrote:
> Hi
>
> The following patch is needed in 5.10 and 5.15 stable trees.
>
> Commit: 61ce339f19fabbc3e51237148a7ef6f2270e44fa
> Subject: nvme-pci: set min_align_mask before calculating max_hw_sectors
>
> This patch fixes the swiotlb usecase for nvme driver
> where if dma_min_align is not considered the requested buffer
> size overflows the swiotlb buffer.
> Patch applies cleanly for 5.15/5.10 stable trees.
The patch does not apply cleanly at all, how did you test this? Please
provide a working backport for those trees if you wish to have it
applied there.
thanks,
greg k-h
On Tue, Oct 11, 2022 at 02:12:27PM -0700, Bhatnagar, Rishabh wrote:
> Hi
>
> The following patch is needed in 5.10 and 5.15 stable trees.
>
> Commit: 61ce339f19fabbc3e51237148a7ef6f2270e44fa
> Subject: nvme-pci: set min_align_mask before calculating max_hw_sectors
>
> This patch fixes the swiotlb usecase for nvme driver
> where if dma_min_align is not considered the requested buffer
> size overflows the swiotlb buffer.
> Patch applies cleanly for 5.15/5.10 stable trees.
What about 5.19 and 6.0? You don't want someone upgrading to a newer
kernel and having a regression, right?
thanks,
greg k-h
cherry picked from commit e3b49ea36802053f312013fd4ccb6e59920a9f76
[Please apply to 5.10-stable and 5.15-stable.]
BACKPORT: f2fs: invalidate META_MAPPING before IPU/DIO write
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.
Thread A - f2fs_gc() Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block() ---- (a)
up_write(i_gc_rwsem)
f2fs_direct_IO() :
- down_read(i_gc_rwsem)
- __blockdev_direct_io()
- get_data_block_dio_write()
- f2fs_dio_submit_bio() ---- (b)
- up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block() ---- (c)
up_write(i_gc_rwsem)
(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
new blkaddr. In conclusion, the newly written data in (b) is lost.
To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.
Fixes: 6aa58d8ad20a ("f2fs: readahead encrypted block during GC")
Cc: stable(a)vger.kernel.org
Signed-off-by: Hyeong-Jun Kim <hj514.kim(a)samsung.com>
Reviewed-by: Chao Yu <chao(a)kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org>
Signed-off-by: Chao Yu <chao(a)kernel.org>
(cherry picked from commit e3b49ea36802053f312013fd4ccb6e59920a9f76)
Signed-off-by: lvgaofei <lvgaofei(a)oppo.com>
---
fs/f2fs/data.c | 5 ++++-
fs/f2fs/segment.c | 3 +++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b2016fd..994a09e 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1713,9 +1713,12 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map,
sync_out:
/* for hardware encryption, but to avoid potential issue in future */
- if (flag == F2FS_GET_BLOCK_DIO && map->m_flags & F2FS_MAP_MAPPED)
+ if (flag == F2FS_GET_BLOCK_DIO && map->m_flags & F2FS_MAP_MAPPED) {
f2fs_wait_on_block_writeback_range(inode,
map->m_pblk, map->m_len);
+ invalidate_mapping_pages(META_MAPPING(sbi),
+ map->m_pblk, map->m_pblk);
+ }
if (flag == F2FS_GET_BLOCK_PRECACHE) {
if (map->m_flags & F2FS_MAP_MAPPED) {
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 19224e7..8549332 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -3547,6 +3547,9 @@ int f2fs_inplace_write_data(struct f2fs_io_info *fio)
return -EFSCORRUPTED;
}
+ invalidate_mapping_pages(META_MAPPING(sbi),
+ fio->new_blkaddr, fio->new_blkaddr);
+
stat_inc_inplace_blocks(fio->sbi);
if (fio->bio && !(SM_I(sbi)->ipu_policy & (1 << F2FS_IPU_NOCACHE)))
--
2.7.4
________________________________
OPPO
本电子邮件及其附件含有OPPO公司的保密信息,仅限于邮件指明的收件人使用(包含个人及群组)。禁止任何人在未经授权的情况下以任何形式使用。如果您错收了本邮件,请立即以电子邮件通知发件人并删除本邮件及其附件。
This e-mail and its attachments contain confidential information from OPPO, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
BACKPORT: f2fs: invalidate META_MAPPING before IPU/DIO write
Encrypted pages during GC are read and cached in META_MAPPING.
However, due to cached pages in META_MAPPING, there is an issue where
newly written pages are lost by IPU or DIO writes.
Thread A - f2fs_gc() Thread B
/* phase 3 */
down_write(i_gc_rwsem)
ra_data_block() ---- (a)
up_write(i_gc_rwsem)
f2fs_direct_IO() :
- down_read(i_gc_rwsem)
- __blockdev_direct_io()
- get_data_block_dio_write()
- f2fs_dio_submit_bio() ---- (b)
- up_read(i_gc_rwsem)
/* phase 4 */
down_write(i_gc_rwsem)
move_data_block() ---- (c)
up_write(i_gc_rwsem)
(a) In phase 3 of f2fs_gc(), up-to-date page is read from storage and
cached in META_MAPPING.
(b) In thread B, writing new data by IPU or DIO write on same blkaddr as
read in (a). cached page in META_MAPPING become out-dated.
(c) In phase 4 of f2fs_gc(), out-dated page in META_MAPPING is copied to
new blkaddr. In conclusion, the newly written data in (b) is lost.
To address this issue, invalidating pages in META_MAPPING before IPU or
DIO write.
Fixes: 6aa58d8ad20a ("f2fs: readahead encrypted block during GC")
Signed-off-by: Hyeong-Jun Kim <hj514.kim(a)samsung.com>
Reviewed-by: Chao Yu <chao(a)kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk(a)kernel.org>
Signed-off-by: Chao Yu <chao(a)kernel.org>
Signed-off-by: lvgaofei <lvgaofei(a)oppo.com>
(cherry picked from commit e3b49ea36802053f312013fd4ccb6e59920a9f76)
---
fs/f2fs/data.c | 5 ++++-
fs/f2fs/segment.c | 3 +++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index b2016fd..994a09e 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1713,9 +1713,12 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map,
sync_out:
/* for hardware encryption, but to avoid potential issue in future */
- if (flag == F2FS_GET_BLOCK_DIO && map->m_flags & F2FS_MAP_MAPPED)
+ if (flag == F2FS_GET_BLOCK_DIO && map->m_flags & F2FS_MAP_MAPPED) {
f2fs_wait_on_block_writeback_range(inode,
map->m_pblk, map->m_len);
+ invalidate_mapping_pages(META_MAPPING(sbi),
+ map->m_pblk, map->m_pblk);
+ }
if (flag == F2FS_GET_BLOCK_PRECACHE) {
if (map->m_flags & F2FS_MAP_MAPPED) {
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 19224e7..8549332 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -3547,6 +3547,9 @@ int f2fs_inplace_write_data(struct f2fs_io_info *fio)
return -EFSCORRUPTED;
}
+ invalidate_mapping_pages(META_MAPPING(sbi),
+ fio->new_blkaddr, fio->new_blkaddr);
+
stat_inc_inplace_blocks(fio->sbi);
if (fio->bio && !(SM_I(sbi)->ipu_policy & (1 << F2FS_IPU_NOCACHE)))
--
2.7.4
________________________________
OPPO
本电子邮件及其附件含有OPPO公司的保密信息,仅限于邮件指明的收件人使用(包含个人及群组)。禁止任何人在未经授权的情况下以任何形式使用。如果您错收了本邮件,请立即以电子邮件通知发件人并删除本邮件及其附件。
This e-mail and its attachments contain confidential information from OPPO, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
吕高飞(Gaofei Lyu) 将撤回邮件“[PATCH] Cherry picked from commit e3b49ea36802053f312013fd4ccb6e59920a9f76 [Please apply to 5.10-stable and 5.15-stable.]”。
________________________________
OPPO
本电子邮件及其附件含有OPPO公司的保密信息,仅限于邮件指明的收件人使用(包含个人及群组)。禁止任何人在未经授权的情况下以任何形式使用。如果您错收了本邮件,请立即以电子邮件通知发件人并删除本邮件及其附件。
This e-mail and its attachments contain confidential information from OPPO, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!