Virtual machines under Xen Hypervisor (DomU) running in Xen PV mode use a
special, nonstandard synthetized CPU topology which "just works" under
kernels 6.9.x while newer kernels wrongly assuming a "crash kernel" and
disable SMP (reducing to one CPU core) because the newer topology
implementation produces a wrong error "[Firmware Bug]: APIC enumeration
order not specification compliant" after new topology checks which are
improper for Xen PV platform. As a result, the kernel disables SMT and
activates just one CPU core within the VM (DomU).
The patch disables the regarding checks if it is running in Xen PV
mode (only) and bring back SMP / all CPUs as in the past to such DomU
VMs.
Signed-off-by: Niels Dettenbach <nd(a)syndicat.com>
---
The current behaviour leads all of our production Xen Host platforms
(amd64 - HPE proliant) unusable after updating to newer linux kernels
(with just one CPU available/activated per VM) while older kernels and
other OS (current NetBSD PV DomU) still work fully (and stable since many
years on the platform).
Xen PV mode is still provided by current Xen and widely used - even
if less wide as the newer Xen PVH mode today. So a solution probably
will be required.
So we assume that bug affects stable(a)vger.kernel.org as well.
dmesg from affected kernel:
-- snip --
[ 0.640364] CPU topo: Enumerated BSP APIC 0 is not marked in APICBASE MSR
[ 0.640367] CPU topo: Assuming crash kernel. Limiting to one CPU to prevent machine INIT
[ 0.640368] CPU topo: [Firmware Bug]: APIC enumeration order not specification compliant
[ 0.640376] CPU topo: Max. logical packages: 1
[ 0.640378] CPU topo: Max. logical dies: 1
[ 0.640379] CPU topo: Max. dies per package: 1
[ 0.640386] CPU topo: Max. threads per core: 1
[ 0.640388] CPU topo: Num. cores per package: 1
[ 0.640389] CPU topo: Num. threads per package: 1
[ 0.640390] CPU topo: Allowing 1 present CPUs plus 0 hotplug CPUs
[ 0.640402] Cannot find an available gap in the 32-bit address range
-- snap --
after patch applied:
-- snip --
[ 0.369439] CPU topo: Max. logical packages: 1
[ 0.369441] CPU topo: Max. logical dies: 1
[ 0.369442] CPU topo: Max. dies per package: 1
[ 0.369450] CPU topo: Max. threads per core: 2
[ 0.369452] CPU topo: Num. cores per package: 3
[ 0.369453] CPU topo: Num. threads per package: 6
[ 0.369453] CPU topo: Allowing 6 present CPUs plus 0 hotplug CPUs
-- snap --
We tested the patch intensely under productive / high load since 2 days now with no issues.
references:
arch/x86/kernel/cpu/topology.c
[line 448]
-- snip --
/*
* XEN PV is special as it does not advertise the local APIC
* properly, but provides a fake topology for it so that the
* infrastructure works. So don't apply the restrictions vs. APIC
* here.
*/
--snap --
--- linux/arch/x86/kernel/cpu/topology.c 2024-09-11 17:42:42.699278317 +0200
+++ linux/arch/x86/kernel/cpu/topology.c.orig 2024-09-11 09:53:16.194095250 +0200
@@ -132,14 +132,6 @@
u64 msr;
/*
- * assume Xen PV has a working (special) topology
- */
- if (xen_pv_domain()) {
- topo_info.real_bsp_apic_id = topo_info.boot_cpu_apic_id;
- return false;
- }
-
- /*
* There is no real good way to detect whether this a kdump()
* kernel, but except on the Voyager SMP monstrosity which is not
* longer supported, the real BSP APIC ID is the first one which is
Witam,
Pomagamy pozyskiwać nowych klientów B2B.
Czy interesuje Państwa dotarcie do nowych potencjalnych partnerów oraz uruchomienie z nimi rozmów handlowch ?
Pozdrawiam
Szymon Jankowski
From: Eliav Bar-ilan <eliavb(a)nvidia.com>
An incorrect argument order calling amd_iommu_dev_flush_pasid_pages()
causes improper flushing of the IOMMU, leaving the old value of GCR3 from
a previous process attached to the same PASID.
The function has the signature:
void amd_iommu_dev_flush_pasid_pages(struct iommu_dev_data *dev_data,
ioasid_t pasid, u64 address, size_t size)
Correct the argument order.
Cc: stable(a)vger.kernel.org
Fixes: 474bf01ed9f0 ("iommu/amd: Add support for device based TLB invalidation")
Signed-off-by: Eliav Bar-ilan <eliavb(a)nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg(a)nvidia.com>
---
drivers/iommu/amd/iommu.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
This was discovered while testing SVA, but I suppose it is probably a bigger issue.
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index b19e8c0f48fa25..6bc4030a6ba8ed 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1552,8 +1552,8 @@ void amd_iommu_dev_flush_pasid_pages(struct iommu_dev_data *dev_data,
void amd_iommu_dev_flush_pasid_all(struct iommu_dev_data *dev_data,
ioasid_t pasid)
{
- amd_iommu_dev_flush_pasid_pages(dev_data, 0,
- CMD_INV_IOMMU_ALL_PAGES_ADDRESS, pasid);
+ amd_iommu_dev_flush_pasid_pages(dev_data, pasid, 0,
+ CMD_INV_IOMMU_ALL_PAGES_ADDRESS);
}
void amd_iommu_domain_flush_complete(struct protection_domain *domain)
base-commit: cf2840f59119f41de3d9641a8b18a5da1b2cf6bf
--
2.46.0
One of our customers reported a crash and a corrupted ocfs2 filesystem.
The crash was due to the detection of corruption. Upon troubleshooting,
the fsck -fn output showed the below corruption
[EXTENT_LIST_FREE] Extent list in owner 33080590 claims 230 as the next free chain record,
but fsck believes the largest valid value is 227. Clamp the next record value? n
The stat output from the debugfs.ocfs2 showed the following corruption
where the "Next Free Rec:" had overshot the "Count:" in the root metadata
block.
Inode: 33080590 Mode: 0640 Generation: 2619713622 (0x9c25a856)
FS Generation: 904309833 (0x35e6ac49)
CRC32: 00000000 ECC: 0000
Type: Regular Attr: 0x0 Flags: Valid
Dynamic Features: (0x16) HasXattr InlineXattr Refcounted
Extended Attributes Block: 0 Extended Attributes Inline Size: 256
User: 0 (root) Group: 0 (root) Size: 281320357888
Links: 1 Clusters: 141738
ctime: 0x66911b56 0x316edcb8 -- Fri Jul 12 06:02:30.829349048 2024
atime: 0x66911d6b 0x7f7a28d -- Fri Jul 12 06:11:23.133669517 2024
mtime: 0x66911b56 0x12ed75d7 -- Fri Jul 12 06:02:30.317552087 2024
dtime: 0x0 -- Wed Dec 31 17:00:00 1969
Refcount Block: 2777346
Last Extblk: 2886943 Orphan Slot: 0
Sub Alloc Slot: 0 Sub Alloc Bit: 14
Tree Depth: 1 Count: 227 Next Free Rec: 230
## Offset Clusters Block#
0 0 2310 2776351
1 2310 2139 2777375
2 4449 1221 2778399
3 5670 731 2779423
4 6401 566 2780447
....... .... .......
....... .... .......
The issue was in the reflink workfow while reserving space for inline xattr.
The problematic function is ocfs2_reflink_xattr_inline(). By the time this
function is called the reflink tree is already recreated at the destination
inode from the source inode. At this point, this function reserves space
space inline xattrs at the destination inode without even checking if there
is space at the root metadata block. It simply reduces the l_count from 243
to 227 thereby making space of 256 bytes for inline xattr whereas the inode
already has extents beyond this index (in this case upto 230), thereby causing
corruption.
The fix for this is to reserve space for inline metadata before the at the
destination inode before the reflink tree gets recreated. The customer has
verified the fix.
Fixes: ef962df057aa ("ocfs2: xattr: fix inlined xattr reflink")
Cc: stable(a)vger.kernel.org
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna(a)oracle.com>
---
fs/ocfs2/refcounttree.c | 26 ++++++++++++++++++++++++--
fs/ocfs2/xattr.c | 11 +----------
2 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index 3f80a56d0d60..05105d271fc8 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -25,6 +25,7 @@
#include "namei.h"
#include "ocfs2_trace.h"
#include "file.h"
+#include "symlink.h"
#include <linux/bio.h>
#include <linux/blkdev.h>
@@ -4155,8 +4156,9 @@ static int __ocfs2_reflink(struct dentry *old_dentry,
int ret;
struct inode *inode = d_inode(old_dentry);
struct buffer_head *new_bh = NULL;
+ struct ocfs2_inode_info *oi = OCFS2_I(inode);
- if (OCFS2_I(inode)->ip_flags & OCFS2_INODE_SYSTEM_FILE) {
+ if (oi->ip_flags & OCFS2_INODE_SYSTEM_FILE) {
ret = -EINVAL;
mlog_errno(ret);
goto out;
@@ -4182,6 +4184,26 @@ static int __ocfs2_reflink(struct dentry *old_dentry,
goto out_unlock;
}
+ if ((oi->ip_dyn_features & OCFS2_HAS_XATTR_FL) &&
+ (oi->ip_dyn_features & OCFS2_INLINE_XATTR_FL)) {
+ /*
+ * Adjust extent record count to reserve space for extended attribute.
+ * Inline data count had been adjusted in ocfs2_duplicate_inline_data().
+ */
+ struct ocfs2_inode_info *new_oi = OCFS2_I(new_inode);
+
+ if (!(new_oi->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
+ !(ocfs2_inode_is_fast_symlink(new_inode))) {
+ struct ocfs2_dinode *new_di = new_bh->b_data;
+ struct ocfs2_dinode *old_di = old_bh->b_data;
+ struct ocfs2_extent_list *el = &new_di->id2.i_list;
+ int inline_size = le16_to_cpu(old_di->i_xattr_inline_size);
+
+ le16_add_cpu(&el->l_count, -(inline_size /
+ sizeof(struct ocfs2_extent_rec)));
+ }
+ }
+
ret = ocfs2_create_reflink_node(inode, old_bh,
new_inode, new_bh, preserve);
if (ret) {
@@ -4189,7 +4211,7 @@ static int __ocfs2_reflink(struct dentry *old_dentry,
goto inode_unlock;
}
- if (OCFS2_I(inode)->ip_dyn_features & OCFS2_HAS_XATTR_FL) {
+ if (oi->ip_dyn_features & OCFS2_HAS_XATTR_FL) {
ret = ocfs2_reflink_xattrs(inode, old_bh,
new_inode, new_bh,
preserve);
diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index 3b81213ed7b8..a9f716ec89e2 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -6511,16 +6511,7 @@ static int ocfs2_reflink_xattr_inline(struct ocfs2_xattr_reflink *args)
}
new_oi = OCFS2_I(args->new_inode);
- /*
- * Adjust extent record count to reserve space for extended attribute.
- * Inline data count had been adjusted in ocfs2_duplicate_inline_data().
- */
- if (!(new_oi->ip_dyn_features & OCFS2_INLINE_DATA_FL) &&
- !(ocfs2_inode_is_fast_symlink(args->new_inode))) {
- struct ocfs2_extent_list *el = &new_di->id2.i_list;
- le16_add_cpu(&el->l_count, -(inline_size /
- sizeof(struct ocfs2_extent_rec)));
- }
+
spin_lock(&new_oi->ip_lock);
new_oi->ip_dyn_features |= OCFS2_HAS_XATTR_FL | OCFS2_INLINE_XATTR_FL;
new_di->i_dyn_features = cpu_to_le16(new_oi->ip_dyn_features);
--
2.39.3