From: Ard Biesheuvel <ardb(a)kernel.org>
The legacy decompressor has elaborate logic to ensure that the
randomized physical placement of the decompressed kernel image does not
conflict with any memory reservations, including ones specified on the
command line using mem=, memmap=, efi_fake_mem= or hugepages=, which are
taken into account by the kernel proper at a later stage.
When booting in EFI mode, it is the firmware's job to ensure that the
chosen range does not conflict with any memory reservations that it
knows about, and this is trivially achieved by using the firmware's
memory allocation APIs.
That leaves reservations specified on the command line, though, which
the firmware knows nothing about, as these regions have no other special
significance to the platform. Since commit
a1b87d54f4e4 ("x86/efistub: Avoid legacy decompressor when doing EFI boot")
these reservations are not taken into account when randomizing the
physical placement, which may result in conflicts where the memory
cannot be reserved by the kernel proper because its own executable image
resides there.
To avoid having to duplicate or reuse the existing complicated logic,
disable physical KASLR entirely when such overrides are specified. These
are mostly diagnostic tools or niche features, and physical KASLR (as
opposed to virtual KASLR, which is much more important as it affects the
memory addresses observed by code executing in the kernel) is something
we can live without.
Closes: https://lkml.kernel.org/r/FA5F6719-8824-4B04-803E-82990E65E627%40akamai.com
Reported-by: Ben Chaney <bchaney(a)akamai.com>
Fixes: a1b87d54f4e4 ("x86/efistub: Avoid legacy decompressor when doing EFI boot")
Cc: <stable(a)vger.kernel.org> # v6.1+
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
---
drivers/firmware/efi/libstub/x86-stub.c | 28 +++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c
index d5a8182cf2e1..1983fd3bf392 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -776,6 +776,26 @@ static void error(char *str)
efi_warn("Decompression failed: %s\n", str);
}
+static const char *cmdline_memmap_override;
+
+static efi_status_t parse_options(const char *cmdline)
+{
+ static const char opts[][14] = {
+ "mem=", "memmap=", "efi_fake_mem=", "hugepages="
+ };
+
+ for (int i = 0; i < ARRAY_SIZE(opts); i++) {
+ const char *p = strstr(cmdline, opts[i]);
+
+ if (p == cmdline || (p > cmdline && isspace(p[-1]))) {
+ cmdline_memmap_override = opts[i];
+ break;
+ }
+ }
+
+ return efi_parse_options(cmdline);
+}
+
static efi_status_t efi_decompress_kernel(unsigned long *kernel_entry)
{
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
@@ -807,6 +827,10 @@ static efi_status_t efi_decompress_kernel(unsigned long *kernel_entry)
!memcmp(efistub_fw_vendor(), ami, sizeof(ami))) {
efi_debug("AMI firmware v2.0 or older detected - disabling physical KASLR\n");
seed[0] = 0;
+ } else if (cmdline_memmap_override) {
+ efi_info("%s detected on the kernel command line - disabling physical KASLR\n",
+ cmdline_memmap_override);
+ seed[0] = 0;
}
boot_params_ptr->hdr.loadflags |= KASLR_FLAG;
@@ -883,7 +907,7 @@ void __noreturn efi_stub_entry(efi_handle_t handle,
}
#ifdef CONFIG_CMDLINE_BOOL
- status = efi_parse_options(CONFIG_CMDLINE);
+ status = parse_options(CONFIG_CMDLINE);
if (status != EFI_SUCCESS) {
efi_err("Failed to parse options\n");
goto fail;
@@ -892,7 +916,7 @@ void __noreturn efi_stub_entry(efi_handle_t handle,
if (!IS_ENABLED(CONFIG_CMDLINE_OVERRIDE)) {
unsigned long cmdline_paddr = ((u64)hdr->cmd_line_ptr |
((u64)boot_params->ext_cmd_line_ptr << 32));
- status = efi_parse_options((char *)cmdline_paddr);
+ status = parse_options((char *)cmdline_paddr);
if (status != EFI_SUCCESS) {
efi_err("Failed to parse options\n");
goto fail;
--
2.45.0.rc1.225.g2a3ae87e7f-goog
The canary in collect_domain_accesses() can be triggered when trying to
link a root mount point. This cannot work in practice because this
directory is mounted, but the VFS check is done after the call to
security_path_link().
Do not use source directory's d_parent when the source directory is the
mount point.
Add tests to check error codes when renaming or linking a mount root
directory. This previously triggered a kernel warning. The
linux/mount.h file is not sorted with other headers to ease backport to
Linux 6.1 .
Cc: Günther Noack <gnoack(a)google.com>
Cc: Paul Moore <paul(a)paul-moore.com>
Cc: stable(a)vger.kernel.org
Reported-by: syzbot+bf4903dc7e12b18ebc87(a)syzkaller.appspotmail.com
Fixes: b91c3e4ea756 ("landlock: Add support for file reparenting with LANDLOCK_ACCESS_FS_REFER")
Closes: https://lore.kernel.org/r/000000000000553d3f0618198200@google.com
Signed-off-by: Mickaël Salaün <mic(a)digikod.net>
Link: https://lore.kernel.org/r/20240516181935.1645983-2-mic@digikod.net
---
security/landlock/fs.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index 22d8b7c28074..7877a64cc6b8 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -1110,6 +1110,7 @@ static int current_check_refer_path(struct dentry *const old_dentry,
bool allow_parent1, allow_parent2;
access_mask_t access_request_parent1, access_request_parent2;
struct path mnt_dir;
+ struct dentry *old_parent;
layer_mask_t layer_masks_parent1[LANDLOCK_NUM_ACCESS_FS] = {},
layer_masks_parent2[LANDLOCK_NUM_ACCESS_FS] = {};
@@ -1157,9 +1158,17 @@ static int current_check_refer_path(struct dentry *const old_dentry,
mnt_dir.mnt = new_dir->mnt;
mnt_dir.dentry = new_dir->mnt->mnt_root;
+ /*
+ * old_dentry may be the root of the common mount point and
+ * !IS_ROOT(old_dentry) at the same time (e.g. with open_tree() and
+ * OPEN_TREE_CLONE). We do not need to call dget(old_parent) because
+ * we keep a reference to old_dentry.
+ */
+ old_parent = (old_dentry == mnt_dir.dentry) ? old_dentry :
+ old_dentry->d_parent;
+
/* new_dir->dentry is equal to new_dentry->d_parent */
- allow_parent1 = collect_domain_accesses(dom, mnt_dir.dentry,
- old_dentry->d_parent,
+ allow_parent1 = collect_domain_accesses(dom, mnt_dir.dentry, old_parent,
&layer_masks_parent1);
allow_parent2 = collect_domain_accesses(
dom, mnt_dir.dentry, new_dir->dentry, &layer_masks_parent2);
--
2.45.0
Hello,
I encountered an issue when upgrading to 6.1.89 from 6.1.77. This upgrade caused a breakage in emulated persistent memory. Significant amounts of memory are missing from a pmem device:
fdisk -l /dev/pmem*
Disk /dev/pmem0: 355.9 GiB, 382117871616 bytes, 746323968 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk /dev/pmem1: 25.38 GiB, 27246198784 bytes, 53215232 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
The memmap parameter that created these pmem devices is “memmap=364416M!28672M,367488M!419840M”, which should cause a much larger amount of memory to be allocated to /dev/pmem1. The amount of missing memory and the device it is missing from is randomized on each reboot. There is some amount of memory missing in almost all cases, but not 100% of the time. Notably, the memory that is missing from these devices is not reclaimed by the system for general use. This system in question has 768GB of memory split evenly across two NUMA nodes.
When the error occurs, there are also the following error messages showing up in dmesg:
[ 5.318317] nd_pmem namespace1.0: [mem 0x5c2042c000-0x5ff7ffffff flags 0x200] misaligned, unable to map
[ 5.335073] nd_pmem: probe of namespace1.0 failed with error -95
Bisection implicates 2dfaeac3f38e4e550d215204eedd97a061fdc118 as the patch that first caused the issue. I believe the cause of the issue is that the EFI stub is randomizing the location of the decompressed kernel without accounting for the memory map, and it is clobbering some of the memory that has been reserved for pmem.
Thank you,
Ben Chaney
Hello,
Upstream commit 11fbb1bfb5bc8c98b2d7db9da332b5e568f4aaab ("ice: use
relative VSI index for VFs VSIs") was applied to stable 6.1, 6.6 and 6.8:
6.1: 5693dd6d3d01f0eea24401f815c98b64cb315b67
6.6: c926393dc3442c38fdcab17d040837cf4acad1c3
6.8: d3da0d4d9fb472ad7dccb784f3d9de40d0c2f6a9
However, it was a part of a series submitted to net-next [1]. Applying
this one patch on its own broke the VF devices created with the ice as a PF:
# [ 307.688237] iavf: Intel(R) Ethernet Adaptive Virtual Function
Network Driver
# [ 307.688241] Copyright (c) 2013 - 2018 Intel Corporation.
# [ 307.688424] iavf 0000:af:01.0: enabling device (0000 -> 0002)
# [ 307.758860] iavf 0000:af:01.0: Invalid MAC address
00:00:00:00:00:00, using random
# [ 307.759628] iavf 0000:af:01.0: Multiqueue Enabled: Queue pair
count = 16
# [ 307.759683] iavf 0000:af:01.0: MAC address: 6a:46:83:88:c2:26
# [ 307.759688] iavf 0000:af:01.0: GRO is enabled
# [ 307.790937] iavf 0000:af:01.0 ens802f0v0: renamed from eth0
# [ 307.896041] iavf 0000:af:01.0: PF returned error -5
(IAVF_ERR_PARAM) to our request 6
# [ 307.916967] iavf 0000:af:01.0: PF returned error -5
(IAVF_ERR_PARAM) to our request 8
The VF initialization fails and the VF device is completely unusable.
This can be fixed either by:
1 - Reverting the above mentioned commit (upstream
11fbb1bfb5bc8c98b2d7db9da332b5e568f4aaab)
Or,
2 - applying the following upstream commits (part of the series):
a) a21605993dd5dfd15edfa7f06705ede17b519026 ("ice: pass VSI pointer
into ice_vc_isvalid_q_id")
b) 363f689600dd010703ce6391bcfc729a97d21840 ("ice: remove unnecessary
duplicate checks for VF VSI ID")
Thanks,
Ahmed
[1]: https://www.spinics.net/lists/netdev/msg979289.html
Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed
serdev") will cause below regression issue:
BT can't be enabled after below steps:
cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure
if property enable-gpios is not configured within DT|ACPI for QCA6390.
The commit is to fix a use-after-free issue within qca_serdev_shutdown()
by adding condition to avoid the serdev is flushed or wrote after closed
but also introduces this regression issue regarding above steps since the
VSC is not sent to reset controller during warm reboot.
Fixed by sending the VSC to reset controller within qca_serdev_shutdown()
once BT was ever enabled, and the use-after-free issue is also fixed by
this change since the serdev is still opened before it is flushed or wrote.
Verified by the reported machine Dell XPS 13 9310 laptop over below two
kernel commits:
commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump
implementation for QCA") of bluetooth-next tree.
commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump
implementation for QCA") of linus mainline tree.
Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev")
Cc: stable(a)vger.kernel.org
Reported-by: Wren Turkal <wt(a)penguintechs.org>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726
Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com>
Tested-by: Wren Turkal <wt(a)penguintechs.org>
---
V1 -> V2: Add comments and more commit messages
V1 discussion link:
https://lore.kernel.org/linux-bluetooth/d553edef-c1a4-4d52-a892-715549d31eb…
drivers/bluetooth/hci_qca.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
index 0c9c9ee56592..9a0bc86f9aac 100644
--- a/drivers/bluetooth/hci_qca.c
+++ b/drivers/bluetooth/hci_qca.c
@@ -2450,15 +2450,27 @@ static void qca_serdev_shutdown(struct device *dev)
struct qca_serdev *qcadev = serdev_device_get_drvdata(serdev);
struct hci_uart *hu = &qcadev->serdev_hu;
struct hci_dev *hdev = hu->hdev;
- struct qca_data *qca = hu->priv;
const u8 ibs_wake_cmd[] = { 0xFD };
const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
if (qcadev->btsoc_type == QCA_QCA6390) {
- if (test_bit(QCA_BT_OFF, &qca->flags) ||
- !test_bit(HCI_RUNNING, &hdev->flags))
+ /* The purpose of sending the VSC is to reset SOC into a initial
+ * state and the state will ensure next hdev->setup() success.
+ * if HCI_QUIRK_NON_PERSISTENT_SETUP is set, it means that
+ * hdev->setup() can do its job regardless of SoC state, so
+ * don't need to send the VSC.
+ * if HCI_SETUP is set, it means that hdev->setup() was never
+ * invoked and the SOC is already in the initial state, so
+ * don't also need to send the VSC.
+ */
+ if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP, &hdev->quirks) ||
+ hci_dev_test_flag(hdev, HCI_SETUP))
return;
+ /* The serdev must be in open state when conrol logic arrives
+ * here, so also fix the use-after-free issue caused by that
+ * the serdev is flushed or wrote after it is closed.
+ */
serdev_device_write_flush(serdev);
ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
sizeof(ibs_wake_cmd));
--
2.7.4
Early during NAND identification, mtd_info fields have not yet been
initialized (namely, writesize and oobsize) and thus cannot be used for
sanity checks yet. Of course if there is a misuse of
nand_change_read_column_op() so early we won't be warned, but there is
anyway no actual check to perform at this stage as we do not yet know
the NAND geometry.
So, if the fields are empty, especially mtd->writesize which is *always*
set quite rapidly after identification, let's skip the sanity checks.
nand_change_read_column_op() is subject to be used early for ONFI/JEDEC
identification in the very unlikely case of:
- bitflips appearing in the parameter page,
- the controller driver not supporting simple DATA_IN cycles.
As nand_change_read_column_op() uses nand_fill_column_cycles() the logic
explaind above also applies in this secondary helper.
Fixes: c27842e7e11f ("mtd: rawnand: onfi: Adapt the parameter page read to constraint controllers")
Fixes: daca31765e8b ("mtd: rawnand: jedec: Adapt the parameter page read to constraint controllers")
Cc: stable(a)vger.kernel.org
Reported-by: Alexander Dahl <ada(a)thorsis.com>
Closes: https://lore.kernel.org/linux-mtd/20240306-shaky-bunion-d28b65ea97d7@thorsi…
Reported-by: Steven Seeger <steven.seeger(a)flightsystems.net>
Closes: https://lore.kernel.org/linux-mtd/DM6PR05MB4506554457CF95191A670BDEF7062@DM…
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
---
Changes in v2:
* Dropped the double (( ))
* Fixed nand_fill_column_cycles() as well.
---
drivers/mtd/nand/raw/nand_base.c | 57 ++++++++++++++++++--------------
1 file changed, 32 insertions(+), 25 deletions(-)
diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
index 248e654ecefd..53e16d39af4b 100644
--- a/drivers/mtd/nand/raw/nand_base.c
+++ b/drivers/mtd/nand/raw/nand_base.c
@@ -1093,28 +1093,32 @@ static int nand_fill_column_cycles(struct nand_chip *chip, u8 *addrs,
unsigned int offset_in_page)
{
struct mtd_info *mtd = nand_to_mtd(chip);
+ bool ident_stage = !mtd->writesize;
- /* Make sure the offset is less than the actual page size. */
- if (offset_in_page > mtd->writesize + mtd->oobsize)
- return -EINVAL;
-
- /*
- * On small page NANDs, there's a dedicated command to access the OOB
- * area, and the column address is relative to the start of the OOB
- * area, not the start of the page. Asjust the address accordingly.
- */
- if (mtd->writesize <= 512 && offset_in_page >= mtd->writesize)
- offset_in_page -= mtd->writesize;
-
- /*
- * The offset in page is expressed in bytes, if the NAND bus is 16-bit
- * wide, then it must be divided by 2.
- */
- if (chip->options & NAND_BUSWIDTH_16) {
- if (WARN_ON(offset_in_page % 2))
+ /* Bypass all checks during NAND identification */
+ if (likely(!ident_stage)) {
+ /* Make sure the offset is less than the actual page size. */
+ if (offset_in_page > mtd->writesize + mtd->oobsize)
return -EINVAL;
- offset_in_page /= 2;
+ /*
+ * On small page NANDs, there's a dedicated command to access the OOB
+ * area, and the column address is relative to the start of the OOB
+ * area, not the start of the page. Asjust the address accordingly.
+ */
+ if (mtd->writesize <= 512 && offset_in_page >= mtd->writesize)
+ offset_in_page -= mtd->writesize;
+
+ /*
+ * The offset in page is expressed in bytes, if the NAND bus is 16-bit
+ * wide, then it must be divided by 2.
+ */
+ if (chip->options & NAND_BUSWIDTH_16) {
+ if (WARN_ON(offset_in_page % 2))
+ return -EINVAL;
+
+ offset_in_page /= 2;
+ }
}
addrs[0] = offset_in_page;
@@ -1123,7 +1127,7 @@ static int nand_fill_column_cycles(struct nand_chip *chip, u8 *addrs,
* Small page NANDs use 1 cycle for the columns, while large page NANDs
* need 2
*/
- if (mtd->writesize <= 512)
+ if (!ident_stage && mtd->writesize <= 512)
return 1;
addrs[1] = offset_in_page >> 8;
@@ -1436,16 +1440,19 @@ int nand_change_read_column_op(struct nand_chip *chip,
unsigned int len, bool force_8bit)
{
struct mtd_info *mtd = nand_to_mtd(chip);
+ bool ident_stage = !mtd->writesize;
if (len && !buf)
return -EINVAL;
- if (offset_in_page + len > mtd->writesize + mtd->oobsize)
- return -EINVAL;
+ if (!ident_stage) {
+ if (offset_in_page + len > mtd->writesize + mtd->oobsize)
+ return -EINVAL;
- /* Small page NANDs do not support column change. */
- if (mtd->writesize <= 512)
- return -ENOTSUPP;
+ /* Small page NANDs do not support column change. */
+ if (mtd->writesize <= 512)
+ return -ENOTSUPP;
+ }
if (nand_has_exec_op(chip)) {
const struct nand_interface_config *conf =
--
2.40.1
The nand_read_data_op() operation, which only consists in DATA_IN
cycles, is sadly not supported by all controllers despite being very
basic. The core, for some time, supposed all drivers would support
it. An improvement to this situation for supporting more constrained
controller added a check to verify if the operation was supported before
attempting it by running the function with the check_only boolean set
first, and then possibly falling back to another (possibly slightly less
optimized) alternative.
An even newer addition moved that check very early and probe time, in
order to perform the check only once. The content of the operation was
not so important, as long as the controller driver would tell whether
such operation on the NAND bus would be possible or not. In practice, no
buffer was provided (no fake buffer or whatever) as it is anyway not
relevant for the "check_only" condition. Unfortunately, early in the
function, there is an if statement verifying that the input parameters
are right for normal use, making the early check always unsuccessful.
Fixes: 9f820fc0651c ("mtd: rawnand: Check the data only read pattern only once")
Cc: stable(a)vger.kernel.org
Reported-by: Alexander Dahl <ada(a)thorsis.com>
Closes: https://lore.kernel.org/linux-mtd/20240306-shaky-bunion-d28b65ea97d7@thorsi…
Reported-by: Steven Seeger <steven.seeger(a)flightsystems.net>
Closes: https://lore.kernel.org/linux-mtd/DM6PR05MB4506554457CF95191A670BDEF7062@DM…
Signed-off-by: Miquel Raynal <miquel.raynal(a)bootlin.com>
Reviewed-by: Alexander Dahl <ada(a)thorsis.com>
---
drivers/mtd/nand/raw/nand_base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
index acd137dd0957..248e654ecefd 100644
--- a/drivers/mtd/nand/raw/nand_base.c
+++ b/drivers/mtd/nand/raw/nand_base.c
@@ -2173,7 +2173,7 @@ EXPORT_SYMBOL_GPL(nand_reset_op);
int nand_read_data_op(struct nand_chip *chip, void *buf, unsigned int len,
bool force_8bit, bool check_only)
{
- if (!len || !buf)
+ if (!len || (!check_only && !buf))
return -EINVAL;
if (nand_has_exec_op(chip)) {
--
2.40.1