According to the awk manual, the -e option does not need to be specified
in front of 'program' (unless you need to mix program-file).
The redundant -e option can cause error when users use awk tools other
than gawk (for example, mawk does not support the -e option).
Error Example:
awk: not an option: -e
Cgroup v2 mount point not found!
Signed-off-by: Juntong Deng <juntong.deng(a)outlook.com>
---
tools/testing/selftests/cgroup/test_cpuset_prs.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 4afb132e4e4f..6820653e8432 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -20,7 +20,7 @@ skip_test() {
WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
# Find cgroup v2 mount point
-CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+CGROUP2=$(mount -t cgroup2 | head -1 | awk '{print $3}')
[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
CPUS=$(lscpu | grep "^CPU(s):" | sed -e "s/.*:[[:space:]]*//")
--
2.39.2
From: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
Add a test case for probing on a symbol in a module without module name.
When probing on a symbol in a module, ftrace accepts both the syntax that
<MODNAME>:<SYMBOL> and <SYMBOL>. Current test case only checks the former
syntax. This adds a test for the latter one.
Signed-off-by: Masami Hiramatsu (Google) <mhiramat(a)kernel.org>
---
.../ftrace/test.d/kprobe/kprobe_module.tc | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_module.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_module.tc
index 7e74ee11edf9..4b32e1b9a8d3 100644
--- a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_module.tc
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_module.tc
@@ -13,6 +13,12 @@ fi
MOD=trace_printk
FUNC=trace_printk_irq_work
+:;: "Add an event on a module function without module name" ;:
+
+echo "p:event0 $FUNC" > kprobe_events
+test -d events/kprobes/event0 || exit_failure
+echo "-:kprobes/event0" >> kprobe_events
+
:;: "Add an event on a module function without specifying event name" ;:
echo "p $MOD:$FUNC" > kprobe_events
This is part of an effort to improve detection of regressions impacting
device probe on all platforms. The recently merged DT kselftest [1]
detects probe issues for all devices described statically in the DT.
That leaves out devices discovered at run-time from discoverable busses.
This is where this test comes in. All of the devices that are connected
through discoverable busses (ie USB and PCI), and which are internal and
therefore always present, can be described in a per-platform file so
they can be checked for. The test will check that the device has been
instantiated and bound to a driver.
Patch 1 introduces the test. Patch 2 adds the test definitions for the
google,spherion machine (Acer Chromebook 514) as an example.
This is the sample output from the test running on Spherion:
TAP version 13
Using board file: boards/google,spherion
1..10
ok 1 usb.camera.0.device
ok 2 usb.camera.0.driver
ok 3 usb.camera.1.device
ok 4 usb.camera.1.driver
ok 5 usb.bluetooth.0.device
ok 6 usb.bluetooth.0.driver
ok 7 usb.bluetooth.1.device
ok 8 usb.bluetooth.1.driver
ok 9 pci.wifi.device
ok 10 pci.wifi.driver
Totals: pass:10 fail:0 xfail:0 xpass:0 skip:0 error:0
[1] https://lore.kernel.org/all/20230828211424.2964562-1-nfraprado@collabora.co…
Nícolas F. R. A. Prado (2):
kselftest: Add test to verify probe of devices from discoverable
busses
kselftest: devices: Add board file for google,spherion
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/devices/.gitignore | 1 +
tools/testing/selftests/devices/Makefile | 8 +
.../selftests/devices/boards/google,spherion | 3 +
.../devices/test_discoverable_devices.sh | 165 ++++++++++++++++++
5 files changed, 178 insertions(+)
create mode 100644 tools/testing/selftests/devices/.gitignore
create mode 100644 tools/testing/selftests/devices/Makefile
create mode 100644 tools/testing/selftests/devices/boards/google,spherion
create mode 100755 tools/testing/selftests/devices/test_discoverable_devices.sh
--
2.42.0
There is a spelling mistake in a printf message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king(a)gmail.com>
---
tools/testing/selftests/sched/cs_prctl_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/sched/cs_prctl_test.c b/tools/testing/selftests/sched/cs_prctl_test.c
index 3e1619b6bf2d..7b4fc02a0d05 100644
--- a/tools/testing/selftests/sched/cs_prctl_test.c
+++ b/tools/testing/selftests/sched/cs_prctl_test.c
@@ -276,7 +276,7 @@ int main(int argc, char *argv[])
if (setpgid(0, 0) != 0)
handle_error("process group");
- printf("\n## Create a thread/process/process group hiearchy\n");
+ printf("\n## Create a thread/process/process group hierarchy\n");
create_processes(num_processes, num_threads, procs);
need_cleanup = 1;
disp_processes(num_processes, procs);
--
2.39.2
Immediate is incorrectly cast to u32 before being spilled, losing sign
information. The range information is incorrect after load again. Fix
immediate spill by remove the cast. The second patch add a test case
for this.
Signed-off-by: Hao Sun <sunhao.th(a)gmail.com>
---
Hao Sun (2):
bpf: Fix check_stack_write_fixed_off() to correctly spill imm
selftests/bpf: Add test for immediate spilled to stack
kernel/bpf/verifier.c | 2 +-
tools/testing/selftests/bpf/verifier/bpf_st_mem.c | 32 +++++++++++++++++++++++
2 files changed, 33 insertions(+), 1 deletion(-)
---
base-commit: 399f6185a1c02f39bcadb8749bc2d9d48685816f
change-id: 20231026-fix-check-stack-write-c40996694dfa
Best regards,
--
Hao Sun <sunhao.th(a)gmail.com>
Kunit recently gained support to setup attributes, the first one being
the speed of a given test, then allowing to filter out slow tests.
A slow test is defined in the documentation as taking more than one
second. There's an another speed attribute called "super slow" but whose
definition is less clear.
Add support to the test runner to check the test execution time, and
report tests that should be marked as slow but aren't.
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
To: Brendan Higgins <brendan.higgins(a)linux.dev>
To: David Gow <davidgow(a)google.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Rae Moar <rmoar(a)google.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: kunit-dev(a)googlegroups.com
Cc: linux-kernel(a)vger.kernel.org
Changes from v2:
- Add defines and comments to make the warning reporting threshold more
obvious
- Switch the duration comparisons to timespec64_compare to be more
accurate
- Link: https://lore.kernel.org/all/20230920084903.1522728-1-mripard@kernel.org/
Changes from v1:
- Split the patch out of the series
- Change to trigger the warning only if the runtime is twice the
threshold (Jani, Rae)
- Split the speed check into a separate function (Rae)
- Link: https://lore.kernel.org/all/20230911-kms-slow-tests-v1-0-d3800a69a1a1@kerne…
---
lib/kunit/test.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 49698a168437..4b710c92340a 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -372,6 +372,36 @@ void kunit_init_test(struct kunit *test, const char *name, char *log)
}
EXPORT_SYMBOL_GPL(kunit_init_test);
+/* Only warn when a test takes more than twice the threshold */
+#define KUNIT_SPEED_WARNING_MULTIPLIER 2
+
+/* Slow tests are defined as taking more than 1s */
+#define KUNIT_SPEED_SLOW_THRESHOLD_S 1
+
+#define KUNIT_SPEED_SLOW_WARNING_THRESHOLD_S \
+ (KUNIT_SPEED_WARNING_MULTIPLIER * KUNIT_SPEED_SLOW_THRESHOLD_S)
+
+#define s_to_timespec64(s) ns_to_timespec64((s) * NSEC_PER_SEC)
+
+static void kunit_run_case_check_speed(struct kunit *test,
+ struct kunit_case *test_case,
+ struct timespec64 duration)
+{
+ struct timespec64 slow_thr =
+ s_to_timespec64(KUNIT_SPEED_SLOW_WARNING_THRESHOLD_S);
+ enum kunit_speed speed = test_case->attr.speed;
+
+ if (timespec64_compare(&duration, &slow_thr) < 0)
+ return;
+
+ if (speed == KUNIT_SPEED_VERY_SLOW || speed == KUNIT_SPEED_SLOW)
+ return;
+
+ kunit_warn(test,
+ "Test should be marked slow (runtime: %lld.%09lds)",
+ duration.tv_sec, duration.tv_nsec);
+}
+
/*
* Initializes and runs test case. Does not clean up or do post validations.
*/
@@ -379,6 +409,8 @@ static void kunit_run_case_internal(struct kunit *test,
struct kunit_suite *suite,
struct kunit_case *test_case)
{
+ struct timespec64 start, end;
+
if (suite->init) {
int ret;
@@ -390,7 +422,13 @@ static void kunit_run_case_internal(struct kunit *test,
}
}
+ ktime_get_ts64(&start);
+
test_case->run_case(test);
+
+ ktime_get_ts64(&end);
+
+ kunit_run_case_check_speed(test, test_case, timespec64_sub(end, start));
}
static void kunit_case_internal_cleanup(struct kunit *test)
--
2.41.0
This is the first part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series[1],
iommu core supports new ops to allocate domains with user data. For nesting,
the user data is vendor-specific, IOMMU_HWPT_DATA_VTD_S1 is defined for
the Intel VT-d stage-1 page table, it will be used in the stage-1 domain
allocation path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation. This series does not have
the cache invalidation path, it would be added in part 2/2.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch8 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], corresponding QEMU could can be found
in [4].
[1] https://lore.kernel.org/linux-iommu/20231026043938.63898-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v8:
- Adopt changes suggested by Jason on domain_alloc_user() op
https://lore.kernel.org/linux-iommu/20231024230319.GW3952@nvidia.com/
- Add Kevin's r-b on patch 06
- Fix description for IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 (Kevin)
v7: https://lore.kernel.org/linux-iommu/20231024151412.50046-1-yi.l.liu@intel.c…
- Rebase on top of latest iommufd nesting part 1/2
- Add the nested_parent flag in patch 07 and sanitize it for nested domain
allocation (Baolu)
- Fail the nested domain allocation if dirty tracking flag is set
v6: https://lore.kernel.org/linux-iommu/20231020093246.17015-1-yi.l.liu@intel.c…
- Add Kevin's r-b for patch 1 and 8
- Drop Kevin's r-b for patch 7
- Address comments from Kevin
- Split the VT-d nesting series into two parts 1/2 and 2/2
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.…
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow read-only mappings to nest parent domain
Yi Liu (3):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 60 +++++++++---------
drivers/iommu/intel/iommu.h | 46 ++++++++++++--
drivers/iommu/intel/nested.c | 117 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 112 +++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
include/uapi/linux/iommufd.h | 42 ++++++++++++-
7 files changed, 345 insertions(+), 36 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages of address translations to get access
to the physical address. A stage-1 translation table is owned by userspace
(e.g. by a guest OS), while a stage-2 is owned by kernel. Any change to a
stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is guest I/O
page table. As the below diagram shows, the guest I/O page table pointer
in GPA (guest physical address) is passed to host and be used to perform
a stage-1 translation. Along with it, a modification to present mappings
in the guest I/O page table should be followed by an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each hwpt is backed by an iommu_domain allocated from an iommu driver.
So in this series hw_pagetable and iommu_domain means the same thing if no
special note. IOMMUFD has already supported allocating hw_pagetable linked
with an IOAS. However, a nesting case requires IOMMUFD to allow allocating
hw_pagetable with driver specific parameters and interface to sync stage-1
IOTLB as user owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1] and nested
parent domain allocation [2]. It first extends domain_alloc_user to allocate
hwpt with user data by allowing the IOMMUFD internal infrastructure to accept
user_data and parent hwpt, relaying the user_data/parent to the iommu core
to allocate IOMMU_DOMAIN_NESTED. And it then extends the IOMMU_HWPT_ALLOC
ioctl to accept user data and a parent hwpt ID.
Note that this series is the part-1 set of a two-part nesting series. It
does not include the cache invalidation interface, which will be added in
the part 2.
Complete code can be found in [3], it is on top of Joao's dirty page tracking
v6 series and fix patches. QEMU could can be found in [4].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://lore.kernel.org/linux-iommu/20230928071528.26258-1-yi.l.liu@intel.c… - merged
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v7:
- Fix a bug from Kevin
- Add r-b from Kevin
- Adopt Jason's suggestion to plumb user_data pointer to hwpt_paging allocation
https://lore.kernel.org/linux-iommu/20231024173009.GQ3952@nvidia.com/
- Select bit 6 for __IOMMU_DOMAIN_NESTED (Jason)
- Other compiling fixes per linux-next integration (Jason/Joao)
- Move patch "iommu: Pass in parent domain with user_data to domain_alloc_user op"
right before "iommufd: Add a nested HW pagetable object" (Jason)
v6: https://lore.kernel.org/linux-iommu/20231024150609.46884-1-yi.l.liu@intel.c…
- Rebase on top of Joao's dirty tracking series:
https://lore.kernel.org/linux-iommu/20231024135109.73787-1-joao.m.martins@o…
- Rebase on top of the enforce_cache_coherency removal patch:
https://lore.kernel.org/linux-iommu/ZTcAhwYjjzqM0A5M@Asurada-Nvidia/
- Add parent and user_data check in iommu driver before the driver actually
supports the two input. This can make better bisect support, the change is
in patch 02.
v5: https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested
translation, this does not change the rule that resv regions should only
be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be
added in later series as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation
(Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest
I/O page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to
avoid passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in
iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08
of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Jason Gunthorpe (2):
iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING
iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations
Lu Baolu (1):
iommu: Add IOMMU_DOMAIN_NESTED
Nicolin Chen (6):
iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable
iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED
iommufd: Add a nested HW pagetable object
iommu: Add iommu_copy_struct_from_user helper
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
Yi Liu (1):
iommu: Pass in parent domain with user_data to domain_alloc_user op
drivers/iommu/amd/iommu.c | 9 +-
drivers/iommu/intel/iommu.c | 7 +-
drivers/iommu/iommufd/device.c | 156 ++++++++---
drivers/iommu/iommufd/hw_pagetable.c | 265 +++++++++++++-----
drivers/iommu/iommufd/iommufd_private.h | 70 +++--
drivers/iommu/iommufd/iommufd_test.h | 18 ++
drivers/iommu/iommufd/main.c | 10 +-
drivers/iommu/iommufd/selftest.c | 153 ++++++++--
drivers/iommu/iommufd/vfio_compat.c | 6 +-
include/linux/iommu.h | 71 ++++-
include/uapi/linux/iommufd.h | 31 +-
tools/testing/selftests/iommu/iommufd.c | 115 ++++++++
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 30 +-
14 files changed, 758 insertions(+), 186 deletions(-)
--
2.34.1
This series enables support for the data processing extensions in the
newly released 2023 architecture, this is mainly support for 8 bit
floating point formats. Most of the extensions only introduce new
instructions and therefore only require hwcaps but there is a new EL0
visible control register FPMR used to control the 8 bit floating point
formats, we need to manage traps for this and context switch it.
The sharing of floating point save code between the host and guest
kernels slightly complicates the introduction of KVM support, we first
introduce host support with some placeholders for KVM then replace those
with the actual KVM support.
I've not added test coverage for ptrace, I've got a not quite finished
test program which exercises all the FP ptrace interfaces and their
interactions together, my plan is to cover it there rather than add
another tiny test program that duplicates the boilerplace for tracing a
target and doesn't actually run the traced program.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (21):
arm64/sysreg: Add definition for ID_AA64PFR2_EL1
arm64/sysreg: Update ID_AA64ISAR2_EL1 defintion for DDI0601 2023-09
arm64/sysreg: Add definition for ID_AA64ISAR3_EL1
arm64/sysreg: Add definition for ID_AA64FPFR0_EL1
arm64/sysreg: Update ID_AA64SMFR0_EL1 definition for DDI0601 2023-09
arm64/sysreg: Update SCTLR_EL1 for DDI0601 2023-09
arm64/sysreg: Update HCRX_EL2 definition for DDI0601 2023-09
arm64/sysreg: Add definition for FPMR
arm64/cpufeature: Hook new identification registers up to cpufeature
arm64/fpsimd: Enable host kernel access to FPMR
arm64/fpsimd: Support FEAT_FPMR
arm64/signal: Add FPMR signal handling
arm64/ptrace: Expose FPMR via ptrace
KVM: arm64: Add newly allocated ID registers to register descriptions
KVM: arm64: Support FEAT_FPMR for guests
arm64/hwcap: Define hwcaps for 2023 DPISA features
kselftest/arm64: Handle FPMR context in generic signal frame parser
kselftest/arm64: Add basic FPMR test
kselftest/arm64: Add 2023 DPISA hwcap test coverage
KVM: arm64: selftests: Document feature registers added in 2023 extensions
KVM: arm64: selftests: Teach get-reg-list about FPMR
Documentation/arch/arm64/elf_hwcaps.rst | 49 +++++
arch/arm64/include/asm/cpu.h | 3 +
arch/arm64/include/asm/cpufeature.h | 5 +
arch/arm64/include/asm/fpsimd.h | 2 +
arch/arm64/include/asm/hwcap.h | 15 ++
arch/arm64/include/asm/kvm_arm.h | 4 +-
arch/arm64/include/asm/kvm_host.h | 3 +
arch/arm64/include/asm/processor.h | 2 +
arch/arm64/include/uapi/asm/hwcap.h | 15 ++
arch/arm64/include/uapi/asm/sigcontext.h | 8 +
arch/arm64/kernel/cpufeature.c | 72 +++++++
arch/arm64/kernel/cpuinfo.c | 18 ++
arch/arm64/kernel/fpsimd.c | 13 ++
arch/arm64/kernel/ptrace.c | 42 ++++
arch/arm64/kernel/signal.c | 59 ++++++
arch/arm64/kvm/fpsimd.c | 19 +-
arch/arm64/kvm/hyp/include/hyp/switch.h | 7 +-
arch/arm64/kvm/sys_regs.c | 17 +-
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 153 ++++++++++++++-
include/uapi/linux/elf.h | 1 +
tools/testing/selftests/arm64/abi/hwcap.c | 217 +++++++++++++++++++++
tools/testing/selftests/arm64/signal/.gitignore | 1 +
.../arm64/signal/testcases/fpmr_siginfo.c | 82 ++++++++
.../selftests/arm64/signal/testcases/testcases.c | 8 +
.../selftests/arm64/signal/testcases/testcases.h | 1 +
tools/testing/selftests/kvm/aarch64/get-reg-list.c | 11 +-
27 files changed, 810 insertions(+), 18 deletions(-)
---
base-commit: 05d3ef8bba77c1b5f98d941d8b2d4aeab8118ef1
change-id: 20231003-arm64-2023-dpisa-2f3d25746474
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hello,
kernel test robot noticed "kernel-selftests.uevent.uevent_filtering.fail" on:
commit: 5b45a753776be5d21cf395ec97e81c9187fbeaca ("selftests: uevent filtering: fix return on error in uevent_listener")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
[test failed on linux-next/master 2030579113a1b1b5bfd7ff24c0852847836d8fd1]
in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:
group: group-03
compiler: gcc-12
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
we also noticed this issue does not always happen. as below, we saw 15 failures
out of 50 runs. however, parent keeps passing.
37013b557b7f39e6 5b45a753776be5d21cf395ec97e
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:50 30% 15:50 kernel-selftests.uevent.uevent_filtering.fail
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang(a)intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310261454.46082aaa-oliver.sang@intel.com
TAP version 13
1..1
# timeout set to 300
# selftests: uevent: uevent_filtering
# TAP version 13
# 1..1
# # Starting 1 tests from 1 test cases.
# # RUN global.uevent_filtering ...
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3532
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3546
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3556
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3585
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3595
# No buffer space available - Failed to receive uevent
# # uevent_filtering.c:479:uevent_filtering:Expected 0 (0) == ret (-1)
# # uevent_filtering: Test failed at step #10
# # FAIL global.uevent_filtering
# not ok 1 global.uevent_filtering
# # FAILED: 0 / 1 tests passed.
# # Totals: pass:0 fail:1 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: uevent: uevent_filtering # exit=1
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231026/202310261454.46082aaa-oliv…
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Kunit recently gained support to setup attributes, the first one being
the speed of a given test, then allowing to filter out slow tests.
A slow test is defined in the documentation as taking more than one
second. There's an another speed attribute called "super slow" but whose
definition is less clear.
Add support to the test runner to check the test execution time, and
report tests that should be marked as slow but aren't.
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
To: Brendan Higgins <brendan.higgins(a)linux.dev>
To: David Gow <davidgow(a)google.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Rae Moar <rmoar(a)google.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: kunit-dev(a)googlegroups.com
Cc: linux-kernel(a)vger.kernel.org
Changes from v1:
- Split the patch out of the series
- Change to trigger the warning only if the runtime is twice the
threshold (Jani, Rae)
- Split the speed check into a separate function (Rae)
- Link: https://lore.kernel.org/all/20230911-kms-slow-tests-v1-0-d3800a69a1a1@kerne…
---
lib/kunit/test.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 49698a168437..a1d5dd2bf87d 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -372,6 +372,25 @@ void kunit_init_test(struct kunit *test, const char *name, char *log)
}
EXPORT_SYMBOL_GPL(kunit_init_test);
+#define KUNIT_SPEED_SLOW_THRESHOLD_S 1
+
+static void kunit_run_case_check_speed(struct kunit *test,
+ struct kunit_case *test_case,
+ struct timespec64 duration)
+{
+ enum kunit_speed speed = test_case->attr.speed;
+
+ if (duration.tv_sec < (2 * KUNIT_SPEED_SLOW_THRESHOLD_S))
+ return;
+
+ if (speed == KUNIT_SPEED_VERY_SLOW || speed == KUNIT_SPEED_SLOW)
+ return;
+
+ kunit_warn(test,
+ "Test should be marked slow (runtime: %lld.%09lds)",
+ duration.tv_sec, duration.tv_nsec);
+}
+
/*
* Initializes and runs test case. Does not clean up or do post validations.
*/
@@ -379,6 +398,8 @@ static void kunit_run_case_internal(struct kunit *test,
struct kunit_suite *suite,
struct kunit_case *test_case)
{
+ struct timespec64 start, end;
+
if (suite->init) {
int ret;
@@ -390,7 +411,13 @@ static void kunit_run_case_internal(struct kunit *test,
}
}
+ ktime_get_ts64(&start);
+
test_case->run_case(test);
+
+ ktime_get_ts64(&end);
+
+ kunit_run_case_check_speed(test, test_case, timespec64_sub(end, start));
}
static void kunit_case_internal_cleanup(struct kunit *test)
--
2.41.0
This is the first part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series[1],
iommu core supports new ops to allocate domains with user data. For nesting,
the user data is vendor-specific, IOMMU_HWPT_DATA_VTD_S1 is defined for
the Intel VT-d stage-1 page table, it will be used in the stage-1 domain
allocation path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation. This series does not have
the cache invalidation path, it would be added in part 2/2.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch8 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], corresponding QEMU could can be found
in [4].
[1] https://lore.kernel.org/linux-iommu/20231024150609.46884-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v7:
- Rebase on top of latest iommufd nesting part 1/2
- Add the nested_parent flag in patch 07 and sanitize it for nested domain
allocation (Baolu)
- Fail the nested domain allocation if dirty tracking flag is set
v6: https://lore.kernel.org/linux-iommu/20231020093246.17015-1-yi.l.liu@intel.c…
- Add Kevin's r-b for patch 1 and 8
- Drop Kevin's r-b for patch 7
- Address comments from Kevin
- Split the VT-d nesting series into two parts 1/2 and 2/2
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.…
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow read-only mappings to nest parent domain
Yi Liu (3):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 88 +++++++++++++++++----------
drivers/iommu/intel/iommu.h | 46 ++++++++++++--
drivers/iommu/intel/nested.c | 109 ++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 112 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
include/uapi/linux/iommufd.h | 42 ++++++++++++-
7 files changed, 362 insertions(+), 39 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
Hi,
while testing a new patch on the livepatch kselftests, I was testing the gen_tar
target and I figured that we only copy the resulting binaries to the final tar
file.
Per the kselftests documentation[1], the gen_tar target is used to package the
tests to run "on different systems". But what if the different system has
different libraries/library versions? Wouldn't it be a problem?
This question came when I was working to build the livepatch modules as part of
the kselftests testing suit. The plan was to just package the test
scripts/programs/modules and then run the tests on a different system, likewise
a different SLE version. Since the kernel would be different in this case, I
expected that gen_tar would copy the module source files so they can be compiled
on the target system.
While the current approach can work when the selftests rely solely on shell scripts(cpufreq, kexec),
those who compile userspace binaries (cgroup, alsa, sched, ...) may not work.
Am I missing something? Is gen_tar only meant to copy the tests to be run on
systems with the same libraries or with the libraries with the exactly the same
version?
Thanks in advance,
Marcos
[1]: https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
Isolated cpuset partition can currently be created to contain an
exclusive set of CPUs not used in other cgroups and with load balancing
disabled to reduce interference from the scheduler.
The main purpose of this isolated partition type is to dynamically
emulate what can be done via the "isolcpus" boot command line option,
specifically the default domain flag. One effect of the "isolcpus" option
is to remove the isolated CPUs from the cpumasks of unbound workqueues
since running work functions in an isolated CPU can be a major source
of interference. Changing the unbound workqueue cpumasks can be done at
run time by writing an appropriate cpumask without the isolated CPUs to
/sys/devices/virtual/workqueue/cpumask. So one can set up an isolated
cpuset partition and then write to the cpumask sysfs file to achieve
similar level of CPU isolation. However, this manual process can be
error prone.
This patch series implements automatic exclusion of isolated CPUs from
unbound workqueue cpumasks when an isolated cpuset partition is created
and then adds those CPUs back when the isolated partition is destroyed.
There are also other places in the kernel that look at the HK_FLAG_DOMAIN
cpumask or other HK_FLAG_* cpumasks and exclude the isolated CPUs from
certain actions to further reduce interference. CPUs in an isolated
cpuset partition will not be able to avoid those interferences yet. That
may change in the future as the need arises.
Waiman Long (4):
workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs
from wq_unbound_cpumask
selftests/cgroup: Minor code cleanup and reorganization of
test_cpuset_prs.sh
cgroup/cpuset: Keep track of CPUs in isolated partitions
cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
Documentation/admin-guide/cgroup-v2.rst | 10 +-
include/linux/workqueue.h | 2 +-
kernel/cgroup/cpuset.c | 237 +++++++++++++-----
kernel/workqueue.c | 42 +++-
.../selftests/cgroup/test_cpuset_prs.sh | 209 +++++++++------
5 files changed, 350 insertions(+), 150 deletions(-)
--
2.39.3
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages of address translations to get access
to the physical address. A stage-1 translation table is owned by userspace
(e.g. by a guest OS), while a stage-2 is owned by kernel. Any change to a
stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is guest I/O
page table. As the below diagram shows, the guest I/O page table pointer
in GPA (guest physical address) is passed to host and be used to perform
a stage-1 translation. Along with it, a modification to present mappings
in the guest I/O page table should be followed by an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each hwpt is backed by an iommu_domain allocated from an iommu driver.
So in this series hw_pagetable and iommu_domain means the same thing if no
special note. IOMMUFD has already supported allocating hw_pagetable linked
with an IOAS. However, a nesting case requires IOMMUFD to allow allocating
hw_pagetable with driver specific parameters and interface to sync stage-1
IOTLB as user owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1] and nested
parent domain allocation [2]. It first extends domain_alloc_user to allocate
hwpt with user data by allowing the IOMMUFD internal infrastructure to accept
user_data and parent hwpt, relaying the user_data/parent to the iommu core
to allocate IOMMU_DOMAIN_NESTED. And it then extends the IOMMU_HWPT_ALLOC
ioctl to accept user data and a parent hwpt ID.
Note that this series is the part-1 set of a two-part nesting series. It
does not include the cache invalidation interface, which will be added in
the part 2.
Complete code can be found in [3], it is on top of Joao's dirty page tracking
v6 series and fix patches. QEMU could can be found in [4].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://lore.kernel.org/linux-iommu/20230928071528.26258-1-yi.l.liu@intel.c… - merged
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v6:
- Rebase on top of Joao's dirty tracking series:
https://lore.kernel.org/linux-iommu/20231024135109.73787-1-joao.m.martins@o…
- Rebase on top of the enforce_cache_coherency removal patch:
https://lore.kernel.org/linux-iommu/ZTcAhwYjjzqM0A5M@Asurada-Nvidia/
- Add parent and user_data check in iommu driver before the driver actually
supports the two input. This can make better bisect support, the change is
in patch 02.
v5: https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested
translation, this does not change the rule that resv regions should only
be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be
added in later series as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation
(Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest
I/O page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to
avoid passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in
iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08
of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Jason Gunthorpe (2):
iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING
iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations
Lu Baolu (1):
iommu: Add IOMMU_DOMAIN_NESTED
Nicolin Chen (6):
iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable
iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED
iommufd: Add a nested HW pagetable object
iommu: Add iommu_copy_struct_from_user helper
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
Yi Liu (1):
iommu: Pass in parent domain with user_data to domain_alloc_user op
drivers/iommu/intel/iommu.c | 7 +-
drivers/iommu/iommufd/device.c | 157 +++++++---
drivers/iommu/iommufd/hw_pagetable.c | 271 +++++++++++++-----
drivers/iommu/iommufd/iommufd_private.h | 73 +++--
drivers/iommu/iommufd/iommufd_test.h | 18 ++
drivers/iommu/iommufd/main.c | 10 +-
drivers/iommu/iommufd/selftest.c | 151 ++++++++--
drivers/iommu/iommufd/vfio_compat.c | 6 +-
include/linux/iommu.h | 72 ++++-
include/uapi/linux/iommufd.h | 31 +-
tools/testing/selftests/iommu/iommufd.c | 120 ++++++++
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 31 +-
13 files changed, 768 insertions(+), 182 deletions(-)
--
2.34.1
Clang uses a different set of CLI args for coverage, and the output
needs to be processed by a different set of tools.
Update the Makefile and add an example of usage in kunit docs.
Michał Winiarski (2):
arch: um: Add Clang coverage support
Documentation: kunit: Add clang UML coverage example
Documentation/dev-tools/kunit/running_tips.rst | 11 +++++++++++
arch/um/Makefile-skas | 5 +++++
2 files changed, 16 insertions(+)
--
2.42.0
In some conditions, background processes in udpgro don't have enough
time to set up the sockets. When foreground processes start, this
results in the bad GRO lookup test freezing or reporting that it
received 0 gro segments.
To fix this, increase the time given to background processes to complete
the startup before foreground processes start.
This is the same issue and the same fix as posted by Adrien Therry.
Link: https://lore.kernel.org/all/20221101184809.50013-1-athierry@redhat.com/
Signed-off-by: Lucas Karpinski <lkarpins(a)redhat.com>
---
tools/testing/selftests/net/udpgro.sh | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/udpgro.sh b/tools/testing/selftests/net/udpgro.sh
index 0c743752669a..4ccbcb2390ad 100755
--- a/tools/testing/selftests/net/udpgro.sh
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -97,7 +97,8 @@ run_one_nat() {
echo "ok" || \
echo "failed"&
- sleep 0.1
+ # Hack: let bg programs complete the startup
+ sleep 0.2
./udpgso_bench_tx ${tx_args}
ret=$?
kill -INT $pid
--
2.41.0
Change ifconfig with ip command,
on a system where ifconfig is
not used this script will not
work correcly.
Test result with this patchset:
sudo make TARGETS="net" kselftest
....
TAP version 13
1..1
timeout set to 1500
selftests: net: route_localnet.sh
run arp_announce test
net.ipv4.conf.veth0.route_localnet = 1
net.ipv4.conf.veth1.route_localnet = 1
net.ipv4.conf.veth0.arp_announce = 2
net.ipv4.conf.veth1.arp_announce = 2
PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
bytes of data.
64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.068 ms
--- 127.25.3.14 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4073ms
rtt min/avg/max/mdev = 0.038/0.062/0.068/0.012 ms
ok
run arp_ignore test
net.ipv4.conf.veth0.route_localnet = 1
net.ipv4.conf.veth1.route_localnet = 1
net.ipv4.conf.veth0.arp_ignore = 3
net.ipv4.conf.veth1.arp_ignore = 3
PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
bytes of data.
64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.032 ms
64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.065 ms
64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.066 ms
64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.065 ms
64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.065 ms
--- 127.25.3.14 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4092ms
rtt min/avg/max/mdev = 0.032/0.058/0.066/0.013 ms
ok
ok 1 selftests: net: route_localnet.sh
...
Signed-off-by: Swarup Laxman Kotiaklapudi <swarupkotikalapudi(a)gmail.com>
---
tools/testing/selftests/net/route_localnet.sh | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/route_localnet.sh b/tools/testing/selftests/net/route_localnet.sh
index 116bfeab72fa..3ab9beb4462c 100755
--- a/tools/testing/selftests/net/route_localnet.sh
+++ b/tools/testing/selftests/net/route_localnet.sh
@@ -18,8 +18,10 @@ setup() {
ip route del 127.0.0.0/8 dev lo table local
ip netns exec "${PEER_NS}" ip route del 127.0.0.0/8 dev lo table local
- ifconfig veth0 127.25.3.4/24 up
- ip netns exec "${PEER_NS}" ifconfig veth1 127.25.3.14/24 up
+ ip a add 127.25.3.4/24 dev veth0
+ ip link set dev veth0 up
+ ip netns exec "${PEER_NS}" ip a add 127.25.3.14/24 dev veth1
+ ip netns exec "${PEER_NS}" ip link set dev veth1 up
ip route flush cache
ip netns exec "${PEER_NS}" ip route flush cache
--
2.34.1
On Fri, Oct 20, 2023 at 06:21:40AM +0200, Nicolas Schier wrote:
> On Thu, Oct 19, 2023 at 03:50:05PM -0300, Marcos Paulo de Souza wrote:
> > On Sat, Oct 14, 2023 at 05:35:55PM +0900, Masahiro Yamada wrote:
> > > On Tue, Oct 10, 2023 at 5:43 AM Marcos Paulo de Souza <mpdesouza(a)suse.de> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I found an issue while moving the livepatch kselftest modules to be built on the
> > > > fly, instead of building them on kernel building.
> > > >
> > > > If, for some reason, there is a recursive make invocation that starts from the
> > > > top level Makefile and in the leaf Makefile it tries to build a module (using M=
> > > > in the make invocation), it doesn't produce the module. This happens because the
> > > > toplevel Makefile checks for M= only once. This is controlled by the
> > > > sub_make_done variable, which is exported after checking the command line
> > > > options are passed to the top level Makefile. Once this variable is set it's
> > > > the M= setting is never checked again on the recursive call.
> > > >
> > > > This can be observed when cleaning the bpf kselftest dir. When calling
> > > >
> > > > $ make TARGETS="bpf" SKIP_TARGETS="" kselftest-clean
> > > >
> > > > What happens:
> > > >
> > > > 1. It checks for some command line settings (like M=) was passed (it wasn't),
> > > > set some definitions and exports sub_make_done.
> > > >
> > > > 2. Jump into tools/testing/selftests/bpf, and calls the clean target.
> > > >
> > > > 3. The clean target is overwritten to remove some files and then jump to
> > > > bpf_testmod dir and call clean there
> > > >
> > > > 4. On bpf_testmod/Makefile, the clean target will execute
> > > > $(Q)make -C $(KDIR) M=$(BPF_TESTMOD_DIR) clean
> > > >
> > > > 5. The KDIR is to toplevel dir. The top Makefile will check that sub_make_done was
> > > > already set, ignoring the M= setting.
> > > >
> > > > 6. As M= wasn't checked, KBUILD_EXTMOD isn't set, and the clean target applies
> > > > to the kernel as a whole, making it clean all generated code/objects and
> > > > everything.
> > > >
> > > > One way to avoid it is to call "unexport sub_make_done" on
> > > > tools/testing/selftests/bpf/bpf_testmod/Makefile before processing the all
> > > > target, forcing the toplevel Makefile to process the M=, producing the module
> > > > file correctly.
> > > >
> > > > If the M=dir points to /lib/modules/.../build, then it fails with "m2c: No such
> > > > file", which I already reported here[1]. At the time this problem was treated
> > > > like a problem with kselftest infrastructure.
> > > >
> > > > Important: The process works fine if the initial make invocation is targeted to a
> > > > different directory (using -C), since it doesn't goes through the toplevel
> > > > Makefile, and sub_make_done variable is not set.
> > > >
> > > > I attached a minimal reproducer, that can be used to better understand the
> > > > problem. The "make testmod" and "make testmod-clean" have the same effect that
> > > > can be seem with the bpf kselftests. There is a unexport call commented on
> > > > test-mods/Makefile, and once that is called the process works as expected.
> > > >
> > > > Is there a better way to fix this? Is this really a problem, or am I missing
> > > > something?
> > >
> > >
> > > Or, using KBUILD_EXTMOD will work too.
> >
> > Yes, that works, only if set to /lib/modules:
> >
> > $ make kselftest TARGETS=bpf SKIP_TARGETS=""
> > make[3]: Entering directory '/home/mpdesouza/git/linux/tools/testing/selftests/bpf'
> > MOD bpf_testmod.ko
> > warning: the compiler differs from the one used to build the kernel
> > The kernel was built by: gcc (SUSE Linux) 13.2.1 20230803 [revision cc279d6c64562f05019e1d12d0d825f9391b5553]
> > You are using: gcc (SUSE Linux) 13.2.1 20230912 [revision b96e66fd4ef3e36983969fb8cdd1956f551a074b]
> > CC [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.o
> > MODPOST /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/Module.symvers
> > CC [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.mod.o
> > LD [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko
> > BTF [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko
> > Skipping BTF generation for /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko due to unavailability of vmlinux
> > BINARY xdp_synproxy
> > ...
> >
> > But if we set the KBUILD_EXTMOD to toplevel Makefile, it fails with a different
> > strange issue:
> >
> > $ make kselftest TARGETS=bpf SKIP_TARGETS=""
> > BINARY urandom_read
> > MOD bpf_testmod.ko
> > m2c -o scripts/Makefile.build -e scripts/Makefile.build scripts/Makefile.build.mod
> > make[6]: m2c: No such file or directory
> > make[6]: *** [<builtin>: scripts/Makefile.build] Error 127
> > make[5]: *** [Makefile:1913: /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod] Error 2
> > make[4]: *** [Makefile:19: all] Error 2
> > make[3]: *** [Makefile:229: /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod.ko] Error 2
> > make[3]: Leaving directory '/home/mpdesouza/git/linux/tools/testing/selftests/bpf'
> > make[2]: *** [Makefile:175: all] Error 2
> > make[1]: *** [/home/mpdesouza/git/linux/Makefile:1362: kselftest] Error 2
> >
> > I attached a patch that can reproduce the case where it works, and the case
> > where it doesn't by changing the value of KDIR.
> >
> > I understand that KBUILD_EXTMOD, as the name implies, was designed to build
> > "external" modules, and not ones that live inside kernel, but how could this be
> > solved?
>
> It seems to me as if there is some confusion about in-tree vs.
> out-of-tree kmods.
>
> KBUILD_EXTMOD and M are almost the same and indicate that you want to
> build _external_ (=out-of-tree) kernel modules. In-tree modules are
> only those that stay in-tree _and_ are built along with the kernel.
> Thus, 'make modules KBUILD_EXTMOD=fs/ext4' could be used to build ext4
> kmod as "out-of-tree" kernel module, that even taints the kernel if it
> gets loaded.
>
> If you want bpf_testmod.ko to be an in-tree kmod, it has to be build
> during the usual kernel build, not by running 'make kselftest'.
>
> If you use 'make -C $(KDIR)' for building out-of-tree kmods, KDIR has to
> point to the kernel build directory. (Or it may point to the source
> tree if you give O=$(BUILDDIR) as well).
Thanks for the explanation Nicolas. In this, I believe that the BPF module
should be moved into lib/, like lib/livepatch, when then be built along with
other in-tree modules.
Currently there is a bug when running the kselftests-clean target with bpf:
make kselftest-clean TARGETS=bpf SKIP_TARGETS=""
As the M= argument is ignore on the toplevel Makefile, this make invocation
applies the clean to all built kernel objects/modules/everything, which is bug
IMO.
There is a statement in the BPF docs saying that the selftests should be run
inside the tools/testing/selftests/bpf directory. At the same time, kselftests
should comply with all the targets defined in the documention, like gen_tar, and
run_tests. In this case should the build process be fixed, or just make
kselftests less restrict?
(CCing kselftests and bpf ML)
>
> HTH.
>
> Kind regards,
> Nicolas
>
>
> > For the sake of my initial about livepatch kselftests, KBUILD_EXTMOD
> > will suffice, since we will target /lib/modules, but I would like to know what
> > we can do in this case. Do you have other suggestions?
> >
> > Thanks in advance,
> > Marcos
> >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > > Masahiro Yamada
>
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/Makefile b/tools/testing/selftests/bpf/bpf_testmod/Makefile
> > index 15cb36c4483a..1dce76f35405 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/Makefile
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/Makefile
> > @@ -1,5 +1,6 @@
> > BPF_TESTMOD_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST)))))
> > -KDIR ?= $(abspath $(BPF_TESTMOD_DIR)/../../../../..)
> > +#KDIR ?= $(abspath $(BPF_TESTMOD_DIR)/../../../../..)
> > +KDIR ?= /lib/modules/$(shell uname -r)/build
> >
> > ifeq ($(V),1)
> > Q =
> > @@ -12,9 +13,10 @@ MODULES = bpf_testmod.ko
> > obj-m += bpf_testmod.o
> > CFLAGS_bpf_testmod.o = -I$(src)
> >
> > +export KBUILD_EXTMOD := $(BPF_TESTMOD_DIR)
> > +
> > all:
> > - +$(Q)make -C $(KDIR) M=$(BPF_TESTMOD_DIR) modules
> > + +$(Q)make -C $(KDIR) modules
> >
> > clean:
> > - +$(Q)make -C $(KDIR) M=$(BPF_TESTMOD_DIR) clean
> > -
> > + +$(Q)make -C $(KDIR) clean
>
This patch series introduces UFFDIO_MOVE feature to userfaultfd, which
has long been implemented and maintained by Andrea in his local tree [1],
but was not upstreamed due to lack of use cases where this approach would
be better than allocating a new page and copying the contents. Previous
upstraming attempts could be found at [6] and [7].
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [3].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [3].
Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
touching them within the same vma. Today, it can only be done by mremap,
however it forces splitting the vma.
Main changes since Andrea's last version [1]:
- Trivial translations from page to folio, mmap_sem to mmap_lock
- Replace pmd_trans_unstable() with pte_offset_map_nolock() and handle its
possible failure
- Move pte mapping into remap_pages_pte to allow for retries when source
page or anon_vma is contended. Since pte_offset_map_nolock() start RCU
read section, we can't block anymore after mapping a pte, so have to unmap
the ptesm do the locking and retry.
- Add and use anon_vma_trylock_write() to avoid blocking while in RCU
read section.
- Accommodate changes in mmu_notifier_range_init() API, switch to
mmu_notifier_invalidate_range_start_nonblock() to avoid blocking while in
RCU read section.
- Open-code now removed __swp_swapcount()
- Replace pmd_read_atomic() with pmdp_get_lockless()
- Add new selftest for UFFDIO_MOVE
Changes since v1 [4]:
- add mmget_not_zero in userfaultfd_remap, per Jann Horn
- removed extern from function definitions, per Matthew Wilcox
- converted to folios in remap_pages_huge_pmd, per Matthew Wilcox
- use PageAnonExclusive in remap_pages_huge_pmd, per David Hildenbrand
- handle pgtable transfers between MMs, per Jann Horn
- ignore concurrent A/D pte bit changes, per Jann Horn
- split functions into smaller units, per David Hildenbrand
- test for folio_test_large in remap_anon_pte, per Matthew Wilcox
- use pte_swp_exclusive for swapcount check, per David Hildenbrand
- eliminated use of mmu_notifier_invalidate_range_start_nonblock,
per Jann Horn
- simplified THP alignment checks, per Jann Horn
- refactored the loop inside remap_pages, per Jann Horn
- additional clarifying comments, per Jann Horn
Changes since v2 [5]:
- renamed UFFDIO_REMAP to UFFDIO_MOVE, per David Hildenbrand
- rebase over mm-unstable to use folio_move_anon_rmap(),
per David Hildenbrand
- added text for manpage explaining DONTFORK and KSM requirements for this
feature, per David Hildenbrand
- check for anon_vma changes in the fast path of folio_lock_anon_vma_read,
per Peter Xu
- updated the title and description of the first patch,
per David Hildenbrand
- updating comments in folio_lock_anon_vma_read() explaining the need for
anon_vma checks, per David Hildenbrand
- changed all mapcount checks to PageAnonExclusive, per Jann Horn and
David Hildenbrand
- changed counters in remap_swap_pte() from MM_ANONPAGES to MM_SWAPENTS,
per Jann Horn
- added a check for PTE change after folio is locked in remap_pages_pte(),
per Jann Horn
- added handling of PMD migration entries and bailout when pmd_devmap(),
per Jann Horn
- added checks to ensure both src and dst VMAs are writable, per Peter Xu
- added UFFD_FEATURE_MOVE, per Peter Xu
- removed obsolete comments, per Peter Xu
- renamed remap_anon_pte to remap_present_pte, per Peter Xu
- added a comment for folio_get_anon_vma() explaining the need for
anon_vma checks, per Peter Xu
- changed error handling in remap_pages() to make it more clear,
per Peter Xu
- changed EFAULT to EAGAIN to retry when a hugepage appears or disappears
from under us, per Peter Xu
- added links to previous upstreaming attempts, per David Hildenbrand
[1] https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc…
[2] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redha…
[3] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyj…
[4] https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/
[5] https://lore.kernel.org/all/20230923013148.1390521-1-surenb@google.com/
[6] https://lore.kernel.org/all/1425575884-2574-21-git-send-email-aarcange@redh…
[7] https://lore.kernel.org/all/cover.1547251023.git.blake.caldwell@colorado.ed…
The patchset applies over mm-unstable.
Andrea Arcangeli (2):
mm/rmap: support move to different root anon_vma in
folio_move_anon_rmap()
userfaultfd: UFFDIO_MOVE uABI
Suren Baghdasaryan (1):
selftests/mm: add UFFDIO_MOVE ioctl test
Documentation/admin-guide/mm/userfaultfd.rst | 3 +
fs/userfaultfd.c | 63 ++
include/linux/rmap.h | 5 +
include/linux/userfaultfd_k.h | 12 +
include/uapi/linux/userfaultfd.h | 29 +-
mm/huge_memory.c | 138 +++++
mm/khugepaged.c | 3 +
mm/rmap.c | 30 +
mm/userfaultfd.c | 602 +++++++++++++++++++
tools/testing/selftests/mm/uffd-common.c | 41 +-
tools/testing/selftests/mm/uffd-common.h | 1 +
tools/testing/selftests/mm/uffd-unit-tests.c | 62 ++
12 files changed, 986 insertions(+), 3 deletions(-)
--
2.42.0.609.gbb76f46606-goog
From: Jeff Xu <jeffxu(a)google.com>
This patchset proposes a new mseal() syscall for the Linux kernel.
Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
The memory permission feature improves security stance on memory
corruption bugs, i.e. the attacker can’t just write to arbitrary memory
and point the code to it, the memory has to be marked with X bit, or
else an exception will happen.
Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with
following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
addr/len: memory range. Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.
types: bit mask to specify which syscall to seal, currently they are:
MM_SEAL_MSEAL 0x1
MM_SEAL_MPROTECT 0x2
MM_SEAL_MUNMAP 0x4
MM_SEAL_MMAP 0x8
MM_SEAL_MREMAP 0x10
Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.
The kernel will remember which seal types are applied, and the application
doesn’t need to repeat all existing seal types in the next mseal(). Once
a seal type is applied, it can’t be unsealed. Call mseal() on an existing
seal type is a no-action, not a failure.
MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
Internally, vm_area_struct adds a new field vm_seals, to store the bit
masks.
For the affected syscalls, such as mprotect, a check(can_modify_mm) for
sealing is added, this usually happens at the early point of the syscall,
before any update is made to VMAs. The effect of that is: if any of the
VMAs in the given address range fails the sealing check, none of the VMA
will be updated. It might be worth noting that this is different from the
rest of mprotect(), where some updates can happen even when mprotect
returns fail. Consider can_modify_mm only checks vm_seals in
vm_area_struct, and it is not going deeper in the page table or updating
any HW, success or none behavior might fit better here. I would like to
listen to the community's feedback on this.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API.
In addition, Stephen is working on glibc change to add sealing support
into the dynamic linker to seal all non-writable segments at startup. When
that work is completed, all applications can automatically benefit from
these new protections.
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
Jeff Xu (8):
Add mseal syscall
Wire up mseal syscall
mseal: add can_modify_mm and can_modify_vma
mseal: seal mprotect
mseal munmap
mseal mremap
mseal mmap
selftest mm/mseal mprotect/munmap/mremap/mmap
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/aio.c | 5 +-
include/linux/mm.h | 55 +-
include/linux/mm_types.h | 7 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 6 +
ipc/shm.c | 3 +-
kernel/sys_ni.c | 1 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/internal.h | 4 +-
mm/mmap.c | 49 +-
mm/mprotect.c | 6 +
mm/mremap.c | 19 +-
mm/mseal.c | 328 +++++
mm/nommu.c | 6 +-
mm/util.c | 8 +-
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++
37 files changed, 1934 insertions(+), 28 deletions(-)
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.42.0.609.gbb76f46606-goog
Dzień dobry,
dostrzegam możliwość współpracy z Państwa firmą.
Świadczymy kompleksową obsługę inwestycji w fotowoltaikę, która obniża koszty energii elektrycznej.
Czy są Państwo zainteresowani weryfikacją wstępnych propozycji?
Pozdrawiam,
Kamil Lasek
We recently encountered a bug that makes all zswap store attempt fail.
Specifically, after:
"141fdeececb3 mm/zswap: delay the initialization of zswap"
if we build a kernel with zswap disabled by default, then enabled after
the swapfile is set up, the zswap tree will not be initialized. As a
result, all zswap store calls will be short-circuited. We have to
perform another swapon to get zswap working properly again.
Fortunately, this issue has since been fixed by the patch that kills
frontswap:
"42c06a0e8ebe mm: kill frontswap"
which performs zswap_swapon() unconditionally, i.e always initializing
the zswap tree.
This test add a sanity check that ensure zswap storing works as
intended.
Signed-off-by: Nhat Pham <nphamcs(a)gmail.com>
---
tools/testing/selftests/cgroup/test_zswap.c | 48 +++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c
index 49def87a909b..c99d2adaca3f 100644
--- a/tools/testing/selftests/cgroup/test_zswap.c
+++ b/tools/testing/selftests/cgroup/test_zswap.c
@@ -55,6 +55,11 @@ static int get_zswap_written_back_pages(size_t *value)
return read_int("/sys/kernel/debug/zswap/written_back_pages", value);
}
+static long get_zswpout(const char *cgroup)
+{
+ return cg_read_key_long(cgroup, "memory.stat", "zswpout ");
+}
+
static int allocate_bytes(const char *cgroup, void *arg)
{
size_t size = (size_t)arg;
@@ -68,6 +73,48 @@ static int allocate_bytes(const char *cgroup, void *arg)
return 0;
}
+/*
+ * Sanity test to check that pages are written into zswap.
+ */
+static int test_zswap_usage(const char *root)
+{
+ long zswpout_before, zswpout_after;
+ int ret = KSFT_FAIL;
+ char *test_group;
+
+ /* Set up */
+ test_group = cg_name(root, "no_shrink_test");
+ if (!test_group)
+ goto out;
+ if (cg_create(test_group))
+ goto out;
+ if (cg_write(test_group, "memory.max", "1M"))
+ goto out;
+
+ zswpout_before = get_zswpout(test_group);
+ if (zswpout_before < 0) {
+ ksft_print_msg("Failed to get zswpout\n");
+ goto out;
+ }
+
+ /* Allocate more than memory.max to push memory into zswap */
+ if (cg_run(test_group, allocate_bytes, (void *)MB(4)))
+ goto out;
+
+ /* Verify that pages come into zswap */
+ zswpout_after = get_zswpout(test_group);
+ if (zswpout_after <= zswpout_before) {
+ ksft_print_msg("zswpout does not increase after test program\n");
+ goto out;
+ }
+ ret = KSFT_PASS;
+
+out:
+ cg_destroy(test_group);
+ free(test_group);
+ return ret;
+}
+
/*
* When trying to store a memcg page in zswap, if the memcg hits its memory
* limit in zswap, writeback should not be triggered.
@@ -235,6 +282,7 @@ struct zswap_test {
int (*fn)(const char *root);
const char *name;
} tests[] = {
+ T(test_zswap_usage),
T(test_no_kmem_bypass),
T(test_no_invasive_cgroup_shrink),
};
--
2.34.1
This is the first part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series[1],
iommu core supports new ops to allocate domains with user data. For nesting,
the user data is vendor-specific, IOMMU_HWPT_DATA_VTD_S1 is defined for
the Intel VT-d stage-1 page table, it will be used in the stage-1 domain
allocation path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation. This series does not have
the cache invalidation path, it would be added in part 2/2.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch8 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], corresponding QEMU could can be found
in [4].
[1] https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v6:
- Add Kevin's r-b for patch 1 and 8
- Drop Kevin's r-b for patch 7
- Address comments from Kevin
- Split the VT-d nesting series into two parts 1/2 and 2/2
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.…
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow read-only mappings to nest parent domain
Yi Liu (3):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 63 +++++++++++++-------
drivers/iommu/intel/iommu.h | 46 ++++++++++++--
drivers/iommu/intel/nested.c | 109 ++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 112 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
include/uapi/linux/iommufd.h | 42 ++++++++++++-
7 files changed, 348 insertions(+), 28 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each has an iommu_domain allocated from iommu driver. So in this series
hw_pagetable and iommu_domain means the same thing if no special note.
IOMMUFD has already supported allocating hw_pagetable that is linked with
an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable
with driver specific parameters and interface to sync stage-1 IOTLB as user
owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1]. It first
extends domain_alloc_user to allocate domains with user data and adds new
op for invalidate stage-1 IOTLB for user-managed domains, then extends the
IOMMUFD internal infrastructure to accept user_data and parent hwpt, relay
the user_data/parent to iommu core to allocate user-managed iommu_domain.
After it, extends the ioctl IOMMU_HWPT_ALLOC to accept user data and stage-2
hwpt ID. Along with it, ioctl IOMMU_HWPT_INVALIDATE is added to invalidate
stage-1 IOTLB. This is needed for user-managed hwpts. Selftest is added as
well to cover the new ioctls.
Complete code can be found in [2], QEMU could can be found in [3].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v4:
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation,
this does not change the rule that resv regions should only be added to the
kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series
as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to avoid
passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Lu Baolu (1):
iommu: Add nested domain support
Nicolin Chen (12):
iommufd: Unite all kernel-managed members into a struct
iommufd: Separate kernel-managed HWPT alloc/destroy/abort functions
iommufd: Add shared alloc_fn function pointer and mutex pointer
iommufd: Add user-managed hw_pagetable support
iommufd: Always setup MSI and anforce cc on kernel-managed domains
iommufd/device: Add helpers to enforce/remove device reserved regions
iommufd/selftest: Rework TEST_LENGTH to test min_size explicitly
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
iommufd/selftest: Add mock_domain_cache_invalidate_user support
iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (4):
iommu: Add hwpt_type with user_data for domain_alloc_user op
iommufd: Pass in hwpt_type/user_data to iommufd_hw_pagetable_alloc()
iommufd: Support IOMMU_HWPT_ALLOC allocation with user data
iommufd: Add IOMMU_HWPT_INVALIDATE
drivers/iommu/intel/iommu.c | 5 +-
drivers/iommu/iommufd/device.c | 51 +++-
drivers/iommu/iommufd/hw_pagetable.c | 257 ++++++++++++++++--
drivers/iommu/iommufd/iommufd_private.h | 59 +++-
drivers/iommu/iommufd/iommufd_test.h | 40 +++
drivers/iommu/iommufd/main.c | 3 +
drivers/iommu/iommufd/selftest.c | 184 ++++++++++++-
include/linux/iommu.h | 110 +++++++-
include/uapi/linux/iommufd.h | 60 +++-
tools/testing/selftests/iommu/iommufd.c | 209 +++++++++++++-
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 91 ++++++-
12 files changed, 998 insertions(+), 74 deletions(-)
--
2.34.1
The sysfs code for online targets updating can result in adding more than
expected monigoring targets to the context. It can result in unexpected amount
of memory consumption and monitoring overhead. This patchset fixes the issue
(patch 1), and add a kunit test for avoiding similar bug of future (patch 2).
SeongJae Park (2):
mm/damon/sysfs: remove requested targets when online-commit inputs
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
mm/damon/Kconfig | 12 ++++++
mm/damon/sysfs-test.h | 86 +++++++++++++++++++++++++++++++++++++++++++
mm/damon/sysfs.c | 52 ++++++--------------------
3 files changed, 109 insertions(+), 41 deletions(-)
create mode 100644 mm/damon/sysfs-test.h
base-commit: 9a969da6ffb9609f5fa8d0b7fdc6859c37a10335
--
2.34.1
This is the second part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series [1],
iommu core supports new ops to invalidate the cache after the modifictions
in stage-1 page table. So far, the cache invalidation data is vendor specific,
the data_type (IOMMU_HWPT_DATA_VTD_S1) defined for the vendor specific HWPT
allocation is reused in the cache invalidation path. User should provide the
correct data_type that suit with the type used in HWPT allocation.
IOMMU_HWPT_INVALIDATE iotcl returns an error in @out_driver_error_code. However
Intel VT-d does not define error code so far, so it's not easy to pre-define it
in iommufd neither. As a result, this field should just be ignored on VT-d platform.
Complete code can be found in [2], corresponding QEMU could can be found in [3].
[1] https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v6:
- Address comments from Kevin
- Split the VT-d nesting series into two parts (Jason)
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.…
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Yi Liu (3):
iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
iommu/vt-d: Make iotlb flush helpers to be extern
iommu/vt-d: Add iotlb flush for nested domain
drivers/iommu/intel/iommu.c | 10 +++----
drivers/iommu/intel/iommu.h | 6 ++++
drivers/iommu/intel/nested.c | 54 ++++++++++++++++++++++++++++++++++++
include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++
4 files changed, 101 insertions(+), 5 deletions(-)
--
2.34.1
Hi Linus,
Please pull the following Kselftest fixes update for Linux 6.6-rc7.
This Kselftest update for Linux 6.6-rc7 consists of one single fix
to assert check in user_events abi_test to properly check bit value
on Big Endian architectures. The current code treats the bit values
as Little Endian and the check fails on Big Endian.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 6f874fa021dfc7bf37f4f37da3a5aaa41fe9c39c:
selftests: Fix wrong TARGET in kselftest top level Makefile (2023-09-26 18:47:37 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest_active-fixes-6.6-rc7
for you to fetch changes up to cf5a103c98a6fb9ee3164334cb5502df6360749b:
selftests/user_events: Fix abi_test for BE archs (2023-10-17 15:07:19 -0600)
----------------------------------------------------------------
linux_kselftest_active-fixes-6.6-rc7
This Kselftest update for Linux 6.6-rc7 consists of one single fix
to assert check in user_events abi_test to properly check bit value
on Big Endian architectures. The current code treats the bit values
as Little Endian and the check fails on Big Endian.
----------------------------------------------------------------
Beau Belgrave (1):
selftests/user_events: Fix abi_test for BE archs
tools/testing/selftests/user_events/abi_test.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
----------------------------------------------------------------
Changelog:
v3:
* Add a patch to export per-cgroup zswap writeback counters
* Add a patch to update zswap's kselftest
* Separate the new list_lru functions into its own prep patch
* Do not start from the top of the hierarchy when encounter a memcg
that is not online for the global limit zswap writeback (patch 2)
(suggested by Yosry Ahmed)
* Do not remove the swap entry from list_lru in
__read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
* Removed a redundant zswap pool getting (patch 2)
(reported by Ryan Roberts)
* Use atomic for the nr_zswap_protected (instead of lruvec's lock)
(patch 5) (suggested by Yosry Ahmed)
* Remove the per-cgroup zswap shrinker knob (patch 5)
(suggested by Yosry Ahmed)
v2:
* Fix loongarch compiler errors
* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap
LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
(i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
new shrinker does not have any parameter that must be tuned by the
user, and can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.
Domenico Cerasuolo (3):
zswap: make shrinking memcg-aware
mm: memcg: add per-memcg zswap writeback stat
selftests: cgroup: update per-memcg zswap writeback selftest
Nhat Pham (2):
mm: list_lru: allow external numa node and cgroup tracking
zswap: shrinks zswap pool based on memory pressure
Documentation/admin-guide/mm/zswap.rst | 7 +
include/linux/list_lru.h | 38 +++
include/linux/memcontrol.h | 7 +
include/linux/mmzone.h | 14 +
mm/list_lru.c | 43 ++-
mm/memcontrol.c | 15 +
mm/mmzone.c | 3 +
mm/swap.h | 3 +-
mm/swap_state.c | 38 ++-
mm/zswap.c | 335 ++++++++++++++++----
tools/testing/selftests/cgroup/test_zswap.c | 74 +++--
11 files changed, 485 insertions(+), 92 deletions(-)
--
2.34.1
From: Jeff Xu <jeffxu(a)google.com>
This patchset proposes a new mseal() syscall for the Linux kernel.
Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
The memory permission feature improves security stance on memory
corruption bugs, i.e. the attacker can’t just write to arbitrary memory
and point the code to it, the memory has to be marked with X bit, or
else an exception will happen. The protection is set by mmap(2),
mprotect(2), mremap(2).
Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with
following signature:
mseal(void addr, size_t len, unsigned long types, unsigned long flags)
addr/len: memory range. Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.
types: bit mask to specify which syscall to seal.
Five syscalls can be sealed, as specified by bitmasks:
MM_SEAL_MPROTECT: Deny mprotect(2)/pkey_mprotect(2).
MM_SEAL_MUNMAP: Deny munmap(2).
MM_SEAL_MMAP: Deny mmap(2).
MM_SEAL_MREMAP: Deny mremap(2).
MM_SEAL_MSEAL: Deny adding a new seal type.
Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.
The kernel will remember which seal types are applied, and the application
doesn’t need to repeat all existing seal types in the next mseal(). Once
a seal type is applied, it can’t be unsealed. Call mseal() on an existing
seal type is a no-action, not a failure.
MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
Internally, vm_area_struct adds a new field vm_seals, to store the bit
masks.
For the affected syscalls, such as mprotect, a check(can_modify_mm) for
sealing is added, this usually happens at the early point of the syscall,
before any update is made to VMAs. The effect of that is: if any of the
VMAs in the given address range fails the sealing check, none of the VMA
will be updated.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API.
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
PATCH history:
v1:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add detailed commit message.
v0:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
Jeff Xu (8):
mseal: Add mseal(2) syscall.
mseal: Wire up mseal syscall
mseal: add can_modify_mm and can_modify_vma
mseal: Check seal flag for mprotect(2)
mseal: Check seal flag for munmap(2)
mseal: Check seal flag for mremap(2)
mseal:Check seal flag for mmap(2)
selftest mm/mseal mprotect/munmap/mremap/mmap
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/aio.c | 5 +-
include/linux/mm.h | 44 +-
include/linux/mm_types.h | 7 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 6 +
ipc/shm.c | 3 +-
kernel/sys_ni.c | 1 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/internal.h | 4 +-
mm/mmap.c | 57 +-
mm/mprotect.c | 15 +
mm/mremap.c | 30 +-
mm/mseal.c | 268 ++++
mm/nommu.c | 6 +-
mm/util.c | 8 +-
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++
37 files changed, 1891 insertions(+), 28 deletions(-)
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.42.0.655.g421f12c284-goog
Use a filter function to skip the time namespace test on systems with
!CONFIG_TIME_NS. This reworks a fix originally done by Tiezhu Yang prior
to the refactoring in 34dce23f7e40 ("selftests/clone3: Report descriptive
test names"). The changelog from their fix explains the issue very clearly:
When execute the following command to test clone3 under !CONFIG_TIME_NS:
# make headers && cd tools/testing/selftests/clone3 && make && ./clone3
we can see the following error info:
# [7538] Trying clone3() with flags 0x80 (size 0)
# Invalid argument - Failed to create new process
# [7538] clone3() with flags says: -22 expected 0
not ok 18 [7538] Result (-22) is different than expected (0)
...
# Totals: pass:18 fail:1 xfail:0 xpass:0 skip:0 error:0
This is because if CONFIG_TIME_NS is not set, but the flag
CLONE_NEWTIME (0x80) is used to clone a time namespace, it
will return -EINVAL in copy_time_ns().
If kernel does not support CONFIG_TIME_NS, /proc/self/ns/time
will be not exist, and then we should skip clone3() test with
CLONE_NEWTIME.
With this patch under !CONFIG_TIME_NS:
# make headers && cd tools/testing/selftests/clone3 && make && ./clone3
...
# Time namespaces are not supported
ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME
...
# Totals: pass:18 fail:0 xfail:0 xpass:0 skip:1 error:0
Fixes: 515bddf0ec41 ("selftests/clone3: test clone3 with CLONE_NEWTIME")
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Original-fix-from: Tiezhu Yang <yangtiezhu(a)loongson.cn>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/clone3/clone3.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 9429d361059e..915846bc32cc 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -138,6 +138,16 @@ static bool not_root(void)
return false;
}
+static bool timens_unsupported(void)
+{
+ if (access("/proc/self/ns/time", F_OK) == 0) {
+ ksft_print_msg("Time namespaces are not supported\n");
+ return true;
+ }
+
+ return false;
+}
+
static size_t page_size_plus_8(void)
{
return getpagesize() + 8;
@@ -282,6 +292,7 @@ static const struct test tests[] = {
.size = 0,
.expected = 0,
.test_mode = CLONE3_ARGS_NO_TEST,
+ .filter = timens_unsupported,
},
{
.name = "exit signal (SIGCHLD) in flags",
---
base-commit: 8d4099dd0727acfc8b0f644eacaf852f9d5dc649
change-id: 20231019-kselftest-clone3-time-ns-730b6f4187c7
Best regards,
--
Mark Brown <broonie(a)kernel.org>
If name_show() is non unique, this test will try to install a kprobe on this
function which should fail returning EADDRNOTAVAIL.
On kernel where name_show() is not unique, this test is skipped.
Cc: stable(a)vger.kernel.org
Signed-off-by: Francis Laniel <flaniel(a)linux.microsoft.com>
---
.../ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc | 13 +++++++++++++
1 file changed, 13 insertions(+)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
new file mode 100644
index 000000000000..bc9514428dba
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
@@ -0,0 +1,13 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test failure of registering kprobe on non unique symbol
+# requires: kprobe_events
+
+SYMBOL='name_show'
+
+# We skip this test on kernel where SYMBOL is unique or does not exist.
+if [ "$(grep -c -E "[[:alnum:]]+ t ${SYMBOL}" /proc/kallsyms)" -le '1' ]; then
+ exit_unsupported
+fi
+
+! echo "p:test_non_unique ${SYMBOL}" > kprobe_events
--
2.34.1
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
This series adds the cache invalidation path for the userspace to invalidate
cache after modifying the stage-1 page table. This is based on the first part
of nesting [1]
Complete code can be found in [2], QEMU could can be found in [3].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v5:
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation,
this does not change the rule that resv regions should only be added to the
kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series
as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to avoid
passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Lu Baolu (1):
iommu: Add cache_invalidate_user op
Nicolin Chen (4):
iommu: Add iommu_copy_struct_from_user_array helper
iommufd/selftest: Add mock_domain_cache_invalidate_user support
iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (1):
iommufd: Add IOMMU_HWPT_INVALIDATE
drivers/iommu/iommufd/hw_pagetable.c | 35 ++++++++
drivers/iommu/iommufd/iommufd_private.h | 9 ++
drivers/iommu/iommufd/iommufd_test.h | 22 +++++
drivers/iommu/iommufd/main.c | 3 +
drivers/iommu/iommufd/selftest.c | 69 +++++++++++++++
include/linux/iommu.h | 84 +++++++++++++++++++
include/uapi/linux/iommufd.h | 36 ++++++++
tools/testing/selftests/iommu/iommufd.c | 75 +++++++++++++++++
tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++
9 files changed, 396 insertions(+)
--
2.34.1
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages of address translations to get access
to the physical address. A stage-1 translation table is owned by userspace
(e.g. by a guest OS), while a stage-2 is owned by kernel. Any change to a
stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is guest I/O
page table. As the below diagram shows, the guest I/O page table pointer
in GPA (guest physical address) is passed to host and be used to perform
a stage-1 translation. Along with it, a modification to present mappings
in the guest I/O page table should be followed by an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each hwpt is backed by an iommu_domain allocated from an iommu driver.
So in this series hw_pagetable and iommu_domain means the same thing if no
special note. IOMMUFD has already supported allocating hw_pagetable linked
with an IOAS. However, a nesting case requires IOMMUFD to allow allocating
hw_pagetable with driver specific parameters and interface to sync stage-1
IOTLB as user owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1] and nested
parent domain allocation [2]. It first extends domain_alloc_user to allocate
hwpt with user data by allowing the IOMMUFD internal infrastructure to accept
user_data and parent hwpt, relaying the user_data/parent to the iommu core
to allocate IOMMU_DOMAIN_NESTED. And it then extends the IOMMU_HWPT_ALLOC
ioctl to accept user data and a parent hwpt ID.
Note that this series is the part-1 set of a two-part nesting series. It
does not include the cache invalidation interface, which will be added in
the part 2.
Complete code can be found in [3], QEMU could can be found in [4].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://lore.kernel.org/linux-iommu/20230928071528.26258-1-yi.l.liu@intel.c… - merged
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v5:
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested
translation, this does not change the rule that resv regions should only
be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be
added in later series as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation
(Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest
I/O page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to
avoid passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in
iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08
of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Jason Gunthorpe (2):
iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING
iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations
Lu Baolu (1):
iommu: Add IOMMU_DOMAIN_NESTED
Nicolin Chen (6):
iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable
iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED
iommufd: Add a nested HW pagetable object
iommu: Add iommu_copy_struct_from_user helper
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
Yi Liu (1):
iommu: Pass in parent domain with user_data to domain_alloc_user op
drivers/iommu/intel/iommu.c | 4 +-
drivers/iommu/iommufd/device.c | 184 +++++++++-----
drivers/iommu/iommufd/hw_pagetable.c | 237 ++++++++++++++----
drivers/iommu/iommufd/iommufd_private.h | 67 +++--
drivers/iommu/iommufd/iommufd_test.h | 18 ++
drivers/iommu/iommufd/main.c | 10 +-
drivers/iommu/iommufd/selftest.c | 137 ++++++++--
drivers/iommu/iommufd/vfio_compat.c | 6 +-
include/linux/iommu.h | 72 +++++-
include/uapi/linux/iommufd.h | 31 ++-
tools/testing/selftests/iommu/iommufd.c | 120 +++++++++
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 31 ++-
13 files changed, 752 insertions(+), 168 deletions(-)
--
2.34.1
hi, Akihiko Odaki,
sorry for sending again, the previous one has some problem that lost most
of CC part.
Hello,
kernel test robot noticed "kernel-selftests.net.make.fail" on:
commit: c04079dfb34c2f534f013408b12218c14b286b7d ("[RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing")
url: https://github.com/intel-lab-lkp/linux/commits/Akihiko-Odaki/net-skbuff-Add…
base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next
patch link: https://lore.kernel.org/all/20231008052101.144422-7-akihiko.odaki@daynix.co…
patch subject: [RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing
in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:
group: net
test: fcnal-test.sh
atomic_test: ipv6_runtime
compiler: gcc-12
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang(a)intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310192236.fde97031-oliver.sang@intel.com
KERNEL SELFTESTS: linux_headers_dir is /usr/src/linux-headers-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d
2023-10-14 17:54:16 mount --bind /lib/modules/6.6.0-rc2-00023-gc04079dfb34c/kernel/lib /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/lib
make: Entering directory '/usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/bpf/resolve_btfids'
...
gcc -Wall -Wl,--no-as-needed -O2 -g -I../../../../usr/include/ -I../../../include/ -isystem /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/usr/include -I../ txtimestamp.c -o /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/txtimestamp
reuseport_bpf.c: In function ‘attach_cbpf’:
reuseport_bpf.c:133:28: error: array type has incomplete element type ‘struct sock_filter’
133 | struct sock_filter code[] = {
| ^~~~
reuseport_bpf.c:139:29: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_H’?
139 | { BPF_RET | BPF_A, 0, 0, 0 },
| ^~~~~
| BPF_H
reuseport_bpf.c:139:29: note: each undeclared identifier is reported only once for each function it appears in
reuseport_bpf.c:141:16: error: variable ‘p’ has initializer but incomplete type
141 | struct sock_fprog p = {
| ^~~~~~~~~~
reuseport_bpf.c:142:18: error: ‘struct sock_fprog’ has no member named ‘len’
142 | .len = ARRAY_SIZE(code),
| ^~~
In file included from reuseport_bpf.c:27:
../kselftest.h:56:25: warning: excess elements in struct initializer
56 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
| ^
reuseport_bpf.c:142:24: note: in expansion of macro ‘ARRAY_SIZE’
142 | .len = ARRAY_SIZE(code),
| ^~~~~~~~~~
../kselftest.h:56:25: note: (near initialization for ‘p’)
56 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
| ^
reuseport_bpf.c:142:24: note: in expansion of macro ‘ARRAY_SIZE’
142 | .len = ARRAY_SIZE(code),
| ^~~~~~~~~~
reuseport_bpf.c:143:18: error: ‘struct sock_fprog’ has no member named ‘filter’
143 | .filter = code,
| ^~~~~~
reuseport_bpf.c:143:27: warning: excess elements in struct initializer
143 | .filter = code,
| ^~~~
reuseport_bpf.c:143:27: note: (near initialization for ‘p’)
reuseport_bpf.c:141:27: error: storage size of ‘p’ isn’t known
141 | struct sock_fprog p = {
| ^
reuseport_bpf.c:141:27: warning: unused variable ‘p’ [-Wunused-variable]
reuseport_bpf.c:133:28: warning: unused variable ‘code’ [-Wunused-variable]
133 | struct sock_filter code[] = {
| ^~~~
reuseport_bpf.c: In function ‘test_filter_no_reuseport’:
reuseport_bpf.c:346:28: error: array type has incomplete element type ‘struct sock_filter’
346 | struct sock_filter ccode[] = {{ BPF_RET | BPF_A, 0, 0, 0 }};
| ^~~~~
reuseport_bpf.c:346:51: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_H’?
346 | struct sock_filter ccode[] = {{ BPF_RET | BPF_A, 0, 0, 0 }};
| ^~~~~
| BPF_H
reuseport_bpf.c:348:27: error: storage size of ‘cprog’ isn’t known
348 | struct sock_fprog cprog;
| ^~~~~
reuseport_bpf.c:348:27: warning: unused variable ‘cprog’ [-Wunused-variable]
reuseport_bpf.c:346:28: warning: unused variable ‘ccode’ [-Wunused-variable]
346 | struct sock_filter ccode[] = {{ BPF_RET | BPF_A, 0, 0, 0 }};
| ^~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/reuseport_bpf] Error 1
make: *** Waiting for unfinished jobs....
reuseport_bpf_cpu.c: In function ‘attach_bpf’:
reuseport_bpf_cpu.c:79:28: error: array type has incomplete element type ‘struct sock_filter’
79 | struct sock_filter code[] = {
| ^~~~
reuseport_bpf_cpu.c:81:52: error: ‘SKF_AD_OFF’ undeclared (first use in this function)
81 | { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_CPU },
| ^~~~~~~~~~
reuseport_bpf_cpu.c:81:52: note: each undeclared identifier is reported only once for each function it appears in
reuseport_bpf_cpu.c:81:65: error: ‘SKF_AD_CPU’ undeclared (first use in this function)
81 | { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_CPU },
| ^~~~~~~~~~
reuseport_bpf_cpu.c:83:29: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_H’?
83 | { BPF_RET | BPF_A, 0, 0, 0 },
| ^~~~~
| BPF_H
reuseport_bpf_cpu.c:85:16: error: variable ‘p’ has initializer but incomplete type
85 | struct sock_fprog p = {
| ^~~~~~~~~~
reuseport_bpf_cpu.c:86:18: error: ‘struct sock_fprog’ has no member named ‘len’
86 | .len = 2,
| ^~~
reuseport_bpf_cpu.c:86:24: warning: excess elements in struct initializer
86 | .len = 2,
| ^
reuseport_bpf_cpu.c:86:24: note: (near initialization for ‘p’)
reuseport_bpf_cpu.c:87:18: error: ‘struct sock_fprog’ has no member named ‘filter’
87 | .filter = code,
| ^~~~~~
reuseport_bpf_cpu.c:87:27: warning: excess elements in struct initializer
87 | .filter = code,
| ^~~~
reuseport_bpf_cpu.c:87:27: note: (near initialization for ‘p’)
reuseport_bpf_cpu.c:85:27: error: storage size of ‘p’ isn’t known
85 | struct sock_fprog p = {
| ^
reuseport_bpf_cpu.c:85:27: warning: unused variable ‘p’ [-Wunused-variable]
reuseport_bpf_cpu.c:79:28: warning: unused variable ‘code’ [-Wunused-variable]
79 | struct sock_filter code[] = {
| ^~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/reuseport_bpf_cpu] Error 1
In file included from psock_fanout.c:55:
psock_lib.h: In function ‘pair_udp_setfilter’:
psock_lib.h:52:28: error: array type has incomplete element type ‘struct sock_filter’
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_lib.h:65:27: error: storage size of ‘bpf_prog’ isn’t known
65 | struct sock_fprog bpf_prog;
| ^~~~~~~~
psock_lib.h:65:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_lib.h:52:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_fanout.c: In function ‘sock_fanout_set_cbpf’:
psock_fanout.c:114:28: error: array type has incomplete element type ‘struct sock_filter’
114 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_fanout.c:115:17: warning: implicit declaration of function ‘BPF_STMT’; did you mean ‘BPF_STX’? [-Wimplicit-function-declaration]
115 | BPF_STMT(BPF_LD | BPF_B | BPF_ABS, 80), /* ldb [80] */
| ^~~~~~~~
| BPF_STX
psock_fanout.c:116:36: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_X’?
116 | BPF_STMT(BPF_RET | BPF_A, 0), /* ret A */
| ^~~~~
| BPF_X
psock_fanout.c:116:36: note: each undeclared identifier is reported only once for each function it appears in
psock_fanout.c:118:27: error: storage size of ‘bpf_prog’ isn’t known
118 | struct sock_fprog bpf_prog;
| ^~~~~~~~
psock_fanout.c:118:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_fanout.c:114:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
114 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/psock_fanout] Error 1
In file included from psock_tpacket.c:47:
psock_lib.h: In function ‘pair_udp_setfilter’:
psock_lib.h:52:28: error: array type has incomplete element type ‘struct sock_filter’
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_lib.h:65:27: error: storage size of ‘bpf_prog’ isn’t known
65 | struct sock_fprog bpf_prog;
| ^~~~~~~~
psock_lib.h:65:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_lib.h:52:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/psock_tpacket] Error 1
In file included from psock_snd.c:32:
psock_lib.h: In function ‘pair_udp_setfilter’:
psock_lib.h:52:28: error: array type has incomplete element type ‘struct sock_filter’
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_lib.h:65:27: error: storage size of ‘bpf_prog’ isn’t known
65 | struct sock_fprog bpf_prog;
| ^~~~~~~~
psock_lib.h:65:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_lib.h:52:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/psock_snd] Error 1
make: Leaving directory '/usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net'
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231019/202310192236.fde97031-oliv…
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.
When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations. The current GCS pointer
can not be directly written to by userspace. When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match. GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.
The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.
This series implements support for use of GCS by userspace, along with
support for use of GCS within KVM guests. It does not enable use of GCS
by either EL1 or EL2, this will be implemented separately. Executables
are started without GCS and must use a prctl() to enable it, it is
expected that this will be done very early in application execution by
the dynamic linker or other startup code.
x86 has an equivalent feature called shadow stacks, this series depends
on the x86 patches for generic memory management support for the new
guarded/shadow stack page type and shares APIs as much as possible. As
there has been extensive discussion with the wider community around the
ABI for shadow stacks I have as far as practical kept implementation
decisions close to those for x86, anticipating that review would lead to
similar conclusions in the absence of strong reasoning for divergence.
The main divergence I am concious of is that x86 allows shadow stack to
be enabled and disabled repeatedly, freeing the shadow stack for the
thread whenever disabled, while this implementation keeps the GCS
allocated after disable but refuses to reenable it. This is to avoid
races with things actively walking the GCS during a disable, we do
anticipate that some systems will wish to disable GCS at runtime but are
not aware of any demand for subsequently reenabling it.
x86 uses an arch_prctl() to manage enable and disable, since only x86
and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
patch set for the equivalent RISC-V zisslpcfi feature which I initially
adopted fairly directly but following review feedback has been revised
quite a bit.
There is an open issue with support for CRIU, on x86 this required the
ability to set the GCS mode via ptrace. This series supports
configuring mode bits other than enable/disable via ptrace but it needs
to be confirmed if this is sufficient.
There's a few bits where I'm not convinced with where I've placed
things, in particular the GCS write operation is in the GCS header not
in uaccess.h, I wasn't sure what was clearest there and am probably too
close to the code to have a clear opinion. The reporting of GCS in
/proc/PID/smaps is also a bit awkward.
The series depends on the x86 shadow stack support:
https://lore.kernel.org/lkml/20230227222957.24501-1-rick.p.edgecombe@intel.…
I've rebased this onto v6.5-rc4 but not included it in the series in
order to avoid confusion with Rick's work and cut down the size of the
series, you can see the branch at:
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs
[1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
---
Mark Brown (36):
prctl: arch-agnostic prctl for shadow stack
arm64: Document boot requirements for Guarded Control Stacks
arm64/gcs: Document the ABI for Guarded Control Stacks
arm64/sysreg: Add new system registers for GCS
arm64/sysreg: Add definitions for architected GCS caps
arm64/gcs: Add manual encodings of GCS instructions
arm64/gcs: Provide copy_to_user_gcs()
arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
arm64/mm: Allocate PIE slots for EL0 guarded control stack
mm: Define VM_SHADOW_STACK for arm64 when we support GCS
arm64/mm: Map pages for guarded control stack
KVM: arm64: Manage GCS registers for guests
arm64/gcs: Allow GCS usage at EL0 and EL1
arm64/idreg: Add overrride for GCS
arm64/hwcap: Add hwcap for GCS
arm64/traps: Handle GCS exceptions
arm64/mm: Handle GCS data aborts
arm64/gcs: Context switch GCS state for EL0
arm64/gcs: Allocate a new GCS for threads with GCS enabled
arm64/gcs: Implement shadow stack prctl() interface
arm64/mm: Implement map_shadow_stack()
arm64/signal: Set up and restore the GCS context for signal handlers
arm64/signal: Expose GCS state in signal frames
arm64/ptrace: Expose GCS via ptrace and core files
arm64: Add Kconfig for Guarded Control Stack (GCS)
kselftest/arm64: Verify the GCS hwcap
kselftest/arm64: Add GCS as a detected feature in the signal tests
kselftest/arm64: Add framework support for GCS to signal handling tests
kselftest/arm64: Allow signals tests to specify an expected si_code
kselftest/arm64: Always run signals tests with GCS enabled
kselftest/arm64: Add very basic GCS test program
kselftest/arm64: Add a GCS test program built with the system libc
kselftest/arm64: Add test coverage for GCS mode locking
selftests/arm64: Add GCS signal tests
kselftest/arm64: Add a GCS stress test
kselftest/arm64: Enable GCS for the FP stress tests
Documentation/admin-guide/kernel-parameters.txt | 3 +
Documentation/arch/arm64/booting.rst | 22 +
Documentation/arch/arm64/elf_hwcaps.rst | 3 +
Documentation/arch/arm64/gcs.rst | 228 +++++++++
Documentation/arch/arm64/index.rst | 1 +
Documentation/filesystems/proc.rst | 2 +-
arch/arm64/Kconfig | 19 +
arch/arm64/include/asm/cpufeature.h | 6 +
arch/arm64/include/asm/el2_setup.h | 17 +
arch/arm64/include/asm/esr.h | 28 +-
arch/arm64/include/asm/exception.h | 2 +
arch/arm64/include/asm/gcs.h | 106 ++++
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/asm/kvm_arm.h | 4 +-
arch/arm64/include/asm/kvm_host.h | 12 +
arch/arm64/include/asm/pgtable-prot.h | 14 +-
arch/arm64/include/asm/processor.h | 7 +
arch/arm64/include/asm/sysreg.h | 20 +
arch/arm64/include/asm/uaccess.h | 42 ++
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/ptrace.h | 8 +
arch/arm64/include/uapi/asm/sigcontext.h | 9 +
arch/arm64/kernel/cpufeature.c | 19 +
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/entry-common.c | 23 +
arch/arm64/kernel/idreg-override.c | 2 +
arch/arm64/kernel/process.c | 85 ++++
arch/arm64/kernel/ptrace.c | 59 +++
arch/arm64/kernel/signal.c | 237 ++++++++-
arch/arm64/kernel/traps.c | 11 +
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 17 +
arch/arm64/kvm/sys_regs.c | 22 +
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/fault.c | 78 ++-
arch/arm64/mm/gcs.c | 234 +++++++++
arch/arm64/mm/mmap.c | 12 +-
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 55 +++
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 16 +-
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 22 +
kernel/sys.c | 30 ++
kernel/sys_ni.c | 1 +
tools/testing/selftests/arm64/Makefile | 2 +-
tools/testing/selftests/arm64/abi/hwcap.c | 19 +
tools/testing/selftests/arm64/fp/assembler.h | 15 +
tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 +
tools/testing/selftests/arm64/fp/sve-test.S | 2 +
tools/testing/selftests/arm64/fp/za-test.S | 2 +
tools/testing/selftests/arm64/fp/zt-test.S | 2 +
tools/testing/selftests/arm64/gcs/.gitignore | 5 +
tools/testing/selftests/arm64/gcs/Makefile | 24 +
tools/testing/selftests/arm64/gcs/asm-offsets.h | 0
tools/testing/selftests/arm64/gcs/basic-gcs.c | 356 ++++++++++++++
tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++++
.../selftests/arm64/gcs/gcs-stress-thread.S | 311 ++++++++++++
tools/testing/selftests/arm64/gcs/gcs-stress.c | 532 +++++++++++++++++++++
tools/testing/selftests/arm64/gcs/gcs-util.h | 87 ++++
tools/testing/selftests/arm64/gcs/libc-gcs.c | 500 +++++++++++++++++++
tools/testing/selftests/arm64/signal/.gitignore | 1 +
.../testing/selftests/arm64/signal/test_signals.c | 17 +-
.../testing/selftests/arm64/signal/test_signals.h | 6 +
.../selftests/arm64/signal/test_signals_utils.c | 32 +-
.../selftests/arm64/signal/test_signals_utils.h | 39 ++
.../arm64/signal/testcases/gcs_exception_fault.c | 59 +++
.../selftests/arm64/signal/testcases/gcs_frame.c | 78 +++
.../arm64/signal/testcases/gcs_write_fault.c | 67 +++
.../selftests/arm64/signal/testcases/testcases.c | 7 +
.../selftests/arm64/signal/testcases/testcases.h | 1 +
72 files changed, 3823 insertions(+), 34 deletions(-)
---
base-commit: ed0e1456f04be7a93c9a186e8e13aed78b555617
change-id: 20230303-arm64-gcs-e311ab0d8729
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Commit 2810c1e99867 ("kunit: Fix wild-memory-access bug in
kunit_free_suite_set()") is causing all test suites to run (when
built as modules) while still in MODULE_STATE_COMING. In that state,
test modules are not fully initialized and lack sysfs kobjects.
This behavior can cause a crash if the test module tries to register
fake devices.
This patch restores the normal execution flow, waiting for the module
initialization to complete before running the test suites.
The issue reported in the commit mentioned above is addressed using
virt_addr_valid() to detect if the module loading has failed
and mod->kunit_suites has not been allocated using kmalloc_array().
Fixes: 2810c1e99867 ("kunit: Fix wild-memory-access bug in kunit_free_suite_set()")
Signed-off-by: Marco Pagani <marpagan(a)redhat.com>
---
lib/kunit/test.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 421f13981412..1a49569186fc 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -769,12 +769,14 @@ static void kunit_module_exit(struct module *mod)
};
const char *action = kunit_action();
+ if (!suite_set.start || !virt_addr_valid(suite_set.start))
+ return;
+
if (!action)
__kunit_test_suites_exit(mod->kunit_suites,
mod->num_kunit_suites);
- if (suite_set.start)
- kunit_free_suite_set(suite_set);
+ kunit_free_suite_set(suite_set);
}
static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
@@ -784,12 +786,12 @@ static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
switch (val) {
case MODULE_STATE_LIVE:
+ kunit_module_init(mod);
break;
case MODULE_STATE_GOING:
kunit_module_exit(mod);
break;
case MODULE_STATE_COMING:
- kunit_module_init(mod);
break;
case MODULE_STATE_UNFORMED:
break;
--
2.41.0
Kselftests are kernel tests and must be build with kernel headers from
same source version. The kernel headers are already being included
correctly in futex selftest Makefile with the help of KHDR_INCLUDE. In
this patch, only the dead code is being removed. No functional change is
intended.
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
Changes since v1:
- Make the explanation correct
---
.../selftests/futex/include/futextest.h | 22 -------------------
1 file changed, 22 deletions(-)
diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h
index ddbcfc9b7bac4..59f66af3a6d10 100644
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h
@@ -25,28 +25,6 @@
typedef volatile u_int32_t futex_t;
#define FUTEX_INITIALIZER 0
-/* Define the newer op codes if the system header file is not up to date. */
-#ifndef FUTEX_WAIT_BITSET
-#define FUTEX_WAIT_BITSET 9
-#endif
-#ifndef FUTEX_WAKE_BITSET
-#define FUTEX_WAKE_BITSET 10
-#endif
-#ifndef FUTEX_WAIT_REQUEUE_PI
-#define FUTEX_WAIT_REQUEUE_PI 11
-#endif
-#ifndef FUTEX_CMP_REQUEUE_PI
-#define FUTEX_CMP_REQUEUE_PI 12
-#endif
-#ifndef FUTEX_WAIT_REQUEUE_PI_PRIVATE
-#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
- FUTEX_PRIVATE_FLAG)
-#endif
-#ifndef FUTEX_REQUEUE_PI_PRIVATE
-#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
- FUTEX_PRIVATE_FLAG)
-#endif
-
/**
* futex() - SYS_futex syscall wrapper
* @uaddr: address of first futex
--
2.40.1
Commit 20d96b25cc4c ("selftests/resctrl: Fix schemata write error
check") exposed a problem in feature detection logic in MBM selftest.
If schemata does not support MB:x=x entries, the schemata write to
initialize 100% memory bandwidth allocation in mbm_setup() will now
fail with -EINVAL due to the error handling corrected by the commit
20d96b25cc4c ("selftests/resctrl: Fix schemata write error check").
That commit just uncovers the failed write, it is not wrong itself.
If MB:x=x is not supported by schemata, it is safe to assume 100%
memory bandwidth is always set. Therefore, the previously ignored error
does not make the MBM test itself wrong.
Restore the previous behavior of MBM test by checking MB support before
attempting to write it into schemata which results in behavior
equivalent to ignoring the write error.
Fixes: 20d96b25cc4c ("selftests/resctrl: Fix schemata write error check")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre(a)intel.com>
---
tools/testing/selftests/resctrl/mbm_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
index d3c0d30c676a..741533f2b075 100644
--- a/tools/testing/selftests/resctrl/mbm_test.c
+++ b/tools/testing/selftests/resctrl/mbm_test.c
@@ -95,7 +95,7 @@ static int mbm_setup(struct resctrl_val_param *p)
return END_OF_TESTS;
/* Set up shemata with 100% allocation on the first run. */
- if (p->num_of_runs == 0)
+ if (p->num_of_runs == 0 && validate_resctrl_feature_request("MB", NULL))
ret = write_schemata(p->ctrlgrp, "100", p->cpu_no,
p->resctrl_val);
--
2.30.2
If name_show() is non unique, this test will try to install a kprobe on this
function which should fail returning EADDRNOTAVAIL.
On kernel where name_show() is not unique, this test is skipped.
Signed-off-by: Francis Laniel <flaniel(a)linux.microsoft.com>
---
.../ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc | 13 +++++++++++++
1 file changed, 13 insertions(+)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
new file mode 100644
index 000000000000..bc9514428dba
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
@@ -0,0 +1,13 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test failure of registering kprobe on non unique symbol
+# requires: kprobe_events
+
+SYMBOL='name_show'
+
+# We skip this test on kernel where SYMBOL is unique or does not exist.
+if [ "$(grep -c -E "[[:alnum:]]+ t ${SYMBOL}" /proc/kallsyms)" -le '1' ]; then
+ exit_unsupported
+fi
+
+! echo "p:test_non_unique ${SYMBOL}" > kprobe_events
--
2.34.1
Introduce a limit on the amount of learned FDB entries on a bridge,
configured by netlink with a build time default on bridge creation in
the kernel config.
For backwards compatibility the kernel config default is disabling the
limit (0).
Without any limit a malicious actor may OOM a kernel by spamming packets
with changing MAC addresses on their bridge port, so allow the bridge
creator to limit the number of entries.
Currently the manual entries are identified by the bridge flags
BR_FDB_LOCAL or BR_FDB_ADDED_BY_USER, atomically bundled under the new
flag BR_FDB_DYNAMIC_LEARNED. This means the limit also applies to
entries created with BR_FDB_ADDED_BY_EXT_LEARN but none of BR_FDB_LOCAL
or BR_FDB_ADDED_BY_USER, e.g. ones added by SWITCHDEV_FDB_ADD_TO_BRIDGE.
Link to the corresponding iproute2 changes:
https://lore.kernel.org/r/20230919-fdb_limit-v4-1-b4d2dc4df30f@avm.de
Signed-off-by: Johannes Nixdorf <jnixdorf-oss(a)avm.de>
---
Changes in v5:
- Set IFLA_BR_FDB_N_LEARNED to NLA_REJECT (from review)
- Moved the strict_start_type-commit after the netlink change, used
the new attribute. (from review)
- Dropped the new build time config option. (from review)
- Link to v4: https://lore.kernel.org/r/20230919-fdb_limit-v4-0-39f0293807b8@avm.de
Changes in v4:
- Added the new test to the Makefile. (from review)
- Removed _entries from the names. (from iproute2 review, in some places
only for consistency)
- Wrapped the lines at 80 chars, except when longer lines are consistent
with neighbouring code. (from review)
- Fixed a race in fdb_delete. (from review)
- Link to v3: https://lore.kernel.org/r/20230905-fdb_limit-v3-0-7597cd500a82@avm.de
Changes in v3:
- Fixed the flags for fdb_create in fdb_add_entry to use
BIT(...). Previously we passed garbage. (from review)
- Set strict_start_type for br_policy. (from review)
- Split out the combined accounting and limit patch, and the netlink
patch from the combined patch in v2. (from review)
- Count atomically, remove the newly introduced lock. (from review)
- Added the new attributes to br_policy. (from review)
- Added a selftest for the new feature. (from review)
- Link to v2: https://lore.kernel.org/netdev/20230619071444.14625-1-jnixdorf-oss@avm.de/
Changes in v2:
- Added BR_FDB_ADDED_BY_USER earlier in fdb_add_entry to ensure the
limit is not applied.
- Do not initialize fdb_*_entries to 0. (from review)
- Do not skip decrementing on 0. (from review)
- Moved the counters to a conditional hole in struct net_bridge to
avoid growing the struct. (from review, it still grows the struct as
there are 2 32-bit values)
- Add IFLA_BR_FDB_CUR_LEARNED_ENTRIES (from review)
- Fix br_get_size() with the added attributes.
- Only limit learned entries, rename to
*_(CUR|MAX)_LEARNED_ENTRIES. (from review)
- Added a default limit in Kconfig. (deemed acceptable in review
comments, helps with embedded use-cases where a special purpose kernel
is built anyways)
- Added an iproute2 patch for easier testing.
- Link to v1: https://lore.kernel.org/netdev/20230515085046.4457-1-jnixdorf-oss@avm.de/
Obsolete v1 review comments:
- Return better errors to users: Due to limiting the limit to
automatically created entries, netlink fdb add requests and changing
bridge ports are never rejected, so they do not yet need a more
friendly error returned.
---
Johannes Nixdorf (5):
net: bridge: Set BR_FDB_ADDED_BY_USER early in fdb_add_entry
net: bridge: Track and limit dynamically learned FDB entries
net: bridge: Add netlink knobs for number / max learned FDB entries
net: bridge: Set strict_start_type for br_policy
selftests: forwarding: bridge_fdb_learning_limit: Add a new selftest
include/uapi/linux/if_link.h | 2 +
net/bridge/br_fdb.c | 42 ++-
net/bridge/br_netlink.c | 17 +-
net/bridge/br_private.h | 4 +
tools/testing/selftests/net/forwarding/Makefile | 3 +-
.../net/forwarding/bridge_fdb_learning_limit.sh | 283 +++++++++++++++++++++
6 files changed, 344 insertions(+), 7 deletions(-)
---
base-commit: 58720809f52779dc0f08e53e54b014209d13eebb
change-id: 20230904-fdb_limit-fae5bbf16c88
Best regards,
--
Johannes Nixdorf <jnixdorf-oss(a)avm.de>
On Tue, 15 Aug 2023 16:59:10 +0800
Yu Liao <liaoyu15(a)huawei.com> wrote:
> Yu Liao (2):
> selftests/ftrace: add loongarch support for kprobe args char tests
> selftests/ftrace: Add riscv support for kprobe arg tests
>
> .../selftests/ftrace/test.d/kprobe/kprobe_args_char.tc | 6 ++++++
> .../selftests/ftrace/test.d/kprobe/kprobe_args_string.tc | 3 +++
> .../selftests/ftrace/test.d/kprobe/kprobe_args_syntax.tc | 4 ++++
> 3 files changed, 13 insertions(+)
>
I noticed that this never got picked up, but that's probably because it
didn't also Cc linux-kselftest(a)vger.kernel.org (which I did here).
Shuah,
Can you add this? You can also add:
Acked-by: Steven Rostedt (Google) <rostedt(a)goodmis.org>
-- Steve
The abi_test currently uses a long sized test value for enablement
checks. On LE this works fine, however, on BE this results in inaccurate
assert checks due to a bit being used and assuming it's value is the
same on both LE and BE.
Use int type for 32-bit values and long type for 64-bit values to ensure
appropriate behavior on both LE and BE.
Fixes: 60b1af8de8c1 ("tracing/user_events: Add ABI self-test")
Signed-off-by: Beau Belgrave <beaub(a)linux.microsoft.com>
---
V3 Changes:
Fix missing cast in clone_check().
V2 Changes:
Rebase to linux-kselftest/fixes.
tools/testing/selftests/user_events/abi_test.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/user_events/abi_test.c b/tools/testing/selftests/user_events/abi_test.c
index 8202f1327c39..f5575ef2007c 100644
--- a/tools/testing/selftests/user_events/abi_test.c
+++ b/tools/testing/selftests/user_events/abi_test.c
@@ -47,7 +47,7 @@ static int change_event(bool enable)
return ret;
}
-static int reg_enable(long *enable, int size, int bit)
+static int reg_enable(void *enable, int size, int bit)
{
struct user_reg reg = {0};
int fd = open(data_file, O_RDWR);
@@ -69,7 +69,7 @@ static int reg_enable(long *enable, int size, int bit)
return ret;
}
-static int reg_disable(long *enable, int bit)
+static int reg_disable(void *enable, int bit)
{
struct user_unreg reg = {0};
int fd = open(data_file, O_RDWR);
@@ -90,7 +90,8 @@ static int reg_disable(long *enable, int bit)
}
FIXTURE(user) {
- long check;
+ int check;
+ long check_long;
bool umount;
};
@@ -99,6 +100,7 @@ FIXTURE_SETUP(user) {
change_event(false);
self->check = 0;
+ self->check_long = 0;
}
FIXTURE_TEARDOWN(user) {
@@ -136,9 +138,9 @@ TEST_F(user, bit_sizes) {
#if BITS_PER_LONG == 8
/* Allow 0-64 bits for 64-bit */
- ASSERT_EQ(0, reg_enable(&self->check, sizeof(long), 63));
- ASSERT_NE(0, reg_enable(&self->check, sizeof(long), 64));
- ASSERT_EQ(0, reg_disable(&self->check, 63));
+ ASSERT_EQ(0, reg_enable(&self->check_long, sizeof(long), 63));
+ ASSERT_NE(0, reg_enable(&self->check_long, sizeof(long), 64));
+ ASSERT_EQ(0, reg_disable(&self->check_long, 63));
#endif
/* Disallowed sizes (everything beside 4 and 8) */
@@ -200,7 +202,7 @@ static int clone_check(void *check)
for (i = 0; i < 10; ++i) {
usleep(100000);
- if (*(long *)check)
+ if (*(int *)check)
return 0;
}
base-commit: 6f874fa021dfc7bf37f4f37da3a5aaa41fe9c39c
--
2.34.1
When we dynamically generate a name for a configuration in get-reg-list
we use strcat() to append to a buffer allocated using malloc() but we
never initialise that buffer. Since malloc() offers no guarantees
regarding the contents of the memory it returns this can lead to us
corrupting, and likely overflowing, the buffer:
vregs: PASS
vregs+pmu: PASS
sve: PASS
sve+pmu: PASS
vregs+pauth_address+pauth_generic: PASS
X�vr+gspauth_addre+spauth_generi+pmu: PASS
Initialise the buffer to an empty string to avoid this.
Fixes: 2f9ace5d4557 ("KVM: arm64: selftests: get-reg-list: Introduce vcpu configs")
Reviewed-by: Andrew Jones <ajones(a)ventanamicro.com>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Update Fixes: tag.
- Link to v1: https://lore.kernel.org/r/20231013-kvm-get-reg-list-str-init-v1-1-034f370ff…
---
tools/testing/selftests/kvm/get-reg-list.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/kvm/get-reg-list.c b/tools/testing/selftests/kvm/get-reg-list.c
index be7bf5224434..dd62a6976c0d 100644
--- a/tools/testing/selftests/kvm/get-reg-list.c
+++ b/tools/testing/selftests/kvm/get-reg-list.c
@@ -67,6 +67,7 @@ static const char *config_name(struct vcpu_reg_list *c)
c->name = malloc(len);
+ c->name[0] = '\0';
len = 0;
for_each_sublist(c, s) {
if (!strcmp(s->name, "base"))
---
base-commit: 6465e260f48790807eef06b583b38ca9789b6072
change-id: 20231012-kvm-get-reg-list-str-init-76c8ed4e19d6
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Commit 20d96b25cc4c ("selftests/resctrl: Fix schemata write error
check") exposed a problem in feature detection logic in MBM selftest.
If schemata does not support MB:x=x entries, the schemata write to
initialize 100% memory bandwidth allocation in mbm_setup() will now
fail with -EINVAL due to the error handling corrected by 20d96b25cc4c.
Commit 20d96b25cc4c just uncovers the failed write, it is not wrong
itself.
If MB:x=x is not supported by schemata, it is safe to assume 100%
memory bandwidth is always set. Therefore, the previously ignored error
does not make the MBM test itself wrong.
Restore the previous behavior of MBM test by checking MB support before
attempting to write it into schemata which results in behavior
equivalent to ignoring the write error.
Fixes: 20d96b25cc4c ("selftests/resctrl: Fix schemata write error check")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen(a)linux.intel.com>
---
tools/testing/selftests/resctrl/mbm_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/resctrl/mbm_test.c b/tools/testing/selftests/resctrl/mbm_test.c
index d3c0d30c676a..85987957e7f5 100644
--- a/tools/testing/selftests/resctrl/mbm_test.c
+++ b/tools/testing/selftests/resctrl/mbm_test.c
@@ -95,7 +95,7 @@ static int mbm_setup(struct resctrl_val_param *p)
return END_OF_TESTS;
/* Set up shemata with 100% allocation on the first run. */
- if (p->num_of_runs == 0)
+ if ((p->num_of_runs == 0) && validate_resctrl_feature_request("MB", NULL))
ret = write_schemata(p->ctrlgrp, "100", p->cpu_no,
p->resctrl_val);
--
2.30.2
IOMMU hardwares that support nested translation would have two stages
address translation (normally mentioned as stage-1 and stage-2). The page
table formats of the stage-1 and stage-2 can be different. e.g., VT-d has
different page table formats for stage-1 and stage-2.
Nested parent domain is the iommu domain used to represent the stage-2
translation. In IOMMUFD, both stage-1 and stage-2 translation are tracked
as HWPT (a.k.a. iommu domain). Stage-2 HWPT is parent of stage-1 HWPT as
stage-1 cannot work alone in nested translation. In the cases of stage-1 and
stage-2 page table format are different, the parent HWPT should use exactly
the stage-2 page table format. However, the existing kernel hides the format
selection in iommu drivers, so the domain allocated via IOMMU_HWPT_ALLOC can
use either stage-1 page table format or stage-2 page table format, there is
no guarantees for it.
To enforce the page table format of the nested parent domain, this series
introduces a new iommu op (domain_alloc_user) which can accept user flags
to allocate domain as userspace requires. It also converts IOMMUFD to use
the new domain_alloc_user op for domain allocation if supported, then extends
the IOMMU_HWPT_ALLOC ioctl to pass down a NEST_PARENT flag to allocate a HWPT
which can be used as parent. This series implements the new op in Intel iommu
driver to have a complete picture. It is a preparation for adding nesting
support in IOMMUFD/IOMMU.
Complete code can be found:
https://github.com/yiliu1765/iommufd/tree/iommufd_alloc_user_v2
Change log:
v2:
- Require domain_alloc_user op if IOMMU_HWPT_ALLOC passes non-zero flags (Kevin)
- IOMMUFD core should check kernel known flags while iommu driver needs
to check supported flags as well (Jason)
- Minor tweaks per Baolu's comment
v1: https://lore.kernel.org/linux-iommu/20230919092523.39286-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Yi Liu (6):
iommu: Add new iommu op to create domains owned by userspace
iommufd/hw_pagetable: Use domain_alloc_user op for domain allocation
iommufd/hw_pagetable: Accepts user flags for domain allocation
iommufd/hw_pagetable: Support allocating nested parent domain
iommufd/selftest: Add domain_alloc_user() support in iommu mock
iommu/vt-d: Add domain_alloc_user op
drivers/iommu/intel/iommu.c | 28 +++++++++++++++++
drivers/iommu/iommufd/device.c | 2 +-
drivers/iommu/iommufd/hw_pagetable.c | 31 ++++++++++++++-----
drivers/iommu/iommufd/iommufd_private.h | 3 +-
drivers/iommu/iommufd/selftest.c | 19 ++++++++++++
include/linux/iommu.h | 11 ++++++-
include/uapi/linux/iommufd.h | 12 ++++++-
tools/testing/selftests/iommu/iommufd.c | 24 +++++++++++---
.../selftests/iommu/iommufd_fail_nth.c | 2 +-
tools/testing/selftests/iommu/iommufd_utils.h | 11 +++++--
10 files changed, 124 insertions(+), 19 deletions(-)
--
2.34.1
Hi all:
The core frequency is subjected to the process variation in semiconductors.
Not all cores are able to reach the maximum frequency respecting the
infrastructure limits. Consequently, AMD has redefined the concept of
maximum frequency of a part. This means that a fraction of cores can reach
maximum frequency. To find the best process scheduling policy for a given
scenario, OS needs to know the core ordering informed by the platform through
highest performance capability register of the CPPC interface.
Earlier implementations of amd-pstate preferred core only support a static
core ranking and targeted performance. Now it has the ability to dynamically
change the preferred core based on the workload and platform conditions and
accounting for thermals and aging.
Amd-pstate driver utilizes the functions and data structures provided by
the ITMT architecture to enable the scheduler to favor scheduling on cores
which can be get a higher frequency with lower voltage.
We call it amd-pstate preferred core.
Here sched_set_itmt_core_prio() is called to set priorities and
sched_set_itmt_support() is called to enable ITMT feature.
Amd-pstate driver uses the highest performance value to indicate
the priority of CPU. The higher value has a higher priority.
Amd-pstate driver will provide an initial core ordering at boot time.
It relies on the CPPC interface to communicate the core ranking to the
operating system and scheduler to make sure that OS is choosing the cores
with highest performance firstly for scheduling the process. When amd-pstate
driver receives a message with the highest performance change, it will
update the core ranking.
Changes form V8->V9:
- all:
- - pick up Tested-By flag added by Oleksandr.
- cpufreq: amd-pstate:
- - pick up Review-By flag added by Wyes.
- - ignore modification of bug.
- - add a attribute of prefcore_ranking.
- - modify data type conversion from u32 to int.
- Documentation: amd-pstate:
- - pick up Review-By flag added by Wyes.
Changes form V7->V8:
- all:
- - pick up Review-By flag added by Mario and Ray.
- cpufreq: amd-pstate:
- - use hw_prefcore embeds into cpudata structure.
- - delete preferred core init from cpu online/off.
Changes form V6->V7:
- x86:
- - Modify kconfig about X86_AMD_PSTATE.
- cpufreq: amd-pstate:
- - modify incorrect comments about scheduler_work().
- - convert highest_perf data type.
- - modify preferred core init when cpu init and online.
- acpi: cppc:
- - modify link of CPPC highest performance.
- cpufreq:
- - modify link of CPPC highest performance changed.
Changes form V5->V6:
- cpufreq: amd-pstate:
- - modify the wrong tag order.
- - modify warning about hw_prefcore sysfs attribute.
- - delete duplicate comments.
- - modify the variable name cppc_highest_perf to prefcore_ranking.
- - modify judgment conditions for setting highest_perf.
- - modify sysfs attribute for CPPC highest perf to pr_debug message.
- Documentation: amd-pstate:
- - modify warning: title underline too short.
Changes form V4->V5:
- cpufreq: amd-pstate:
- - modify sysfs attribute for CPPC highest perf.
- - modify warning about comments
- - rebase linux-next
- cpufreq:
- - Moidfy warning about function declarations.
- Documentation: amd-pstate:
- - align with ``amd-pstat``
Changes form V3->V4:
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
Changes form V2->V3:
- x86:
- - Modify kconfig and description.
- cpufreq: amd-pstate:
- - Add Co-developed-by tag in commit message.
- cpufreq:
- - Modify commit message.
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
Changes form V1->V2:
- acpi: cppc:
- - Add reference link.
- cpufreq:
- - Moidfy link error.
- cpufreq: amd-pstate:
- - Init the priorities of all online CPUs
- - Use a single variable to represent the status of preferred core.
- Documentation:
- - Default enabled preferred core.
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
- - Default enabled preferred core.
- - Use a single variable to represent the status of preferred core.
Meng Li (7):
x86: Drop CPU_SUP_INTEL from SCHED_MC_PRIO for the expansion.
acpi: cppc: Add get the highest performance cppc control
cpufreq: amd-pstate: Enable amd-pstate preferred core supporting.
cpufreq: Add a notification message that the highest perf has changed
cpufreq: amd-pstate: Update amd-pstate preferred core ranking
dynamically
Documentation: amd-pstate: introduce amd-pstate preferred core
Documentation: introduce amd-pstate preferrd core mode kernel command
line options
.../admin-guide/kernel-parameters.txt | 5 +
Documentation/admin-guide/pm/amd-pstate.rst | 59 ++++-
arch/x86/Kconfig | 5 +-
drivers/acpi/cppc_acpi.c | 13 ++
drivers/acpi/processor_driver.c | 6 +
drivers/cpufreq/amd-pstate.c | 204 ++++++++++++++++--
drivers/cpufreq/cpufreq.c | 13 ++
include/acpi/cppc_acpi.h | 5 +
include/linux/amd-pstate.h | 10 +
include/linux/cpufreq.h | 5 +
10 files changed, 305 insertions(+), 20 deletions(-)
--
2.34.1
Zero out the buffer for readlink() since readlink() does not append a
terminating null byte to the buffer. Also change the buffer length
passed to readlink() to 'PATH_MAX - 1' to ensure the resulting string
is always null terminated.
Fixes: 833c12ce0f430 ("selftests/x86/lam: Add inherit test cases for linear-address masking")
Signed-off-by: Binbin Wu <binbin.wu(a)linux.intel.com>
---
v1->v2:
- Change the buffer length passed to readlink() to 'PATH_MAX - 1' to ensure the
resulting string is always null terminated. [Kirill]
tools/testing/selftests/x86/lam.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index eb0e46905bf9..8f9b06d9ce03 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -573,7 +573,7 @@ int do_uring(unsigned long lam)
char path[PATH_MAX] = {0};
/* get current process path */
- if (readlink("/proc/self/exe", path, PATH_MAX) <= 0)
+ if (readlink("/proc/self/exe", path, PATH_MAX - 1) <= 0)
return 1;
int file_fd = open(path, O_RDONLY);
@@ -680,14 +680,14 @@ static int handle_execve(struct testcases *test)
perror("Fork failed.");
ret = 1;
} else if (pid == 0) {
- char path[PATH_MAX];
+ char path[PATH_MAX] = {0};
/* Set LAM mode in parent process */
if (set_lam(lam) != 0)
return 1;
/* Get current binary's path and the binary was run by execve */
- if (readlink("/proc/self/exe", path, PATH_MAX) <= 0)
+ if (readlink("/proc/self/exe", path, PATH_MAX - 1) <= 0)
exit(-1);
/* run binary to get LAM mode and return to parent process */
base-commit: 58720809f52779dc0f08e53e54b014209d13eebb
--
2.25.1
This series fixes issues observed with selftests/amd-pstate while
running performance comparison tests with different governors. First
patch changes relative paths with absolute paths and also change it
with correct paths wherever it is broken.
The second patch adds an option to provide perf binary path to
handle the case where distro perf does not work.
Changelog v3->v4:
* Addressed review comments from v3
Swapnil Sapkal (2):
selftests/amd-pstate: Fix broken paths to run workloads in
amd-pstate-ut
selftests/amd-pstate: Added option to provide perf binary path
.../x86/amd_pstate_tracer/amd_pstate_trace.py | 3 +--
.../testing/selftests/amd-pstate/gitsource.sh | 17 +++++++++------
tools/testing/selftests/amd-pstate/run.sh | 21 +++++++++++++------
tools/testing/selftests/amd-pstate/tbench.sh | 4 ++--
4 files changed, 29 insertions(+), 16 deletions(-)
--
2.34.1
TEST_LENGTH passing ".size = sizeof(struct _struct) - 1" expects -EINVAL
from "if (ucmd.user_size < op->min_size)" check in iommufd_fops_ioctl().
This has been working when min_size is exactly the size of the structure.
However, if the size of the structure becomes larger than min_size, i.e.
the passing size above is larger than min_size, that min_size sanity no
longer works.
Since the first test in TEST_LENGTH() was to test that min_size sanity
routine, rework it to support a min_size calculation, rather than using
the full size of the structure.
Signed-off-by: Nicolin Chen <nicolinc(a)nvidia.com>
---
Hi Jason/Kevin,
This was a part of the nesting series. Its link in v4:
https://lore.kernel.org/linux-iommu/20230921075138.124099-13-yi.l.liu@intel…
I just realized that this should go in prior to the nesting series.
One of the nesting patches changes the IOMMU_HWPT_ALLOC structure,
which would break the cmd_length test without this patch.
Thanks!
Nicolin
tools/testing/selftests/iommu/iommufd.c | 29 ++++++++++++++-----------
1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index c5eca2fee42c..6323153d277b 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -86,12 +86,13 @@ TEST_F(iommufd, cmd_fail)
TEST_F(iommufd, cmd_length)
{
-#define TEST_LENGTH(_struct, _ioctl) \
+#define TEST_LENGTH(_struct, _ioctl, _last) \
{ \
+ size_t min_size = offsetofend(struct _struct, _last); \
struct { \
struct _struct cmd; \
uint8_t extra; \
- } cmd = { .cmd = { .size = sizeof(struct _struct) - 1 }, \
+ } cmd = { .cmd = { .size = min_size - 1 }, \
.extra = UINT8_MAX }; \
int old_errno; \
int rc; \
@@ -112,17 +113,19 @@ TEST_F(iommufd, cmd_length)
} \
}
- TEST_LENGTH(iommu_destroy, IOMMU_DESTROY);
- TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO);
- TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC);
- TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC);
- TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES);
- TEST_LENGTH(iommu_ioas_allow_iovas, IOMMU_IOAS_ALLOW_IOVAS);
- TEST_LENGTH(iommu_ioas_map, IOMMU_IOAS_MAP);
- TEST_LENGTH(iommu_ioas_copy, IOMMU_IOAS_COPY);
- TEST_LENGTH(iommu_ioas_unmap, IOMMU_IOAS_UNMAP);
- TEST_LENGTH(iommu_option, IOMMU_OPTION);
- TEST_LENGTH(iommu_vfio_ioas, IOMMU_VFIO_IOAS);
+ TEST_LENGTH(iommu_destroy, IOMMU_DESTROY, id);
+ TEST_LENGTH(iommu_hw_info, IOMMU_GET_HW_INFO, __reserved);
+ TEST_LENGTH(iommu_hwpt_alloc, IOMMU_HWPT_ALLOC, __reserved);
+ TEST_LENGTH(iommu_ioas_alloc, IOMMU_IOAS_ALLOC, out_ioas_id);
+ TEST_LENGTH(iommu_ioas_iova_ranges, IOMMU_IOAS_IOVA_RANGES,
+ out_iova_alignment);
+ TEST_LENGTH(iommu_ioas_allow_iovas, IOMMU_IOAS_ALLOW_IOVAS,
+ allowed_iovas);
+ TEST_LENGTH(iommu_ioas_map, IOMMU_IOAS_MAP, iova);
+ TEST_LENGTH(iommu_ioas_copy, IOMMU_IOAS_COPY, src_iova);
+ TEST_LENGTH(iommu_ioas_unmap, IOMMU_IOAS_UNMAP, length);
+ TEST_LENGTH(iommu_option, IOMMU_OPTION, val64);
+ TEST_LENGTH(iommu_vfio_ioas, IOMMU_VFIO_IOAS, __reserved);
#undef TEST_LENGTH
}
--
2.42.0
This is to add Intel VT-d nested translation based on IOMMUFD nesting
infrastructure. As the iommufd nesting infrastructure series[1], iommu
core supports new ops to report iommu hardware information, allocate
domains with user data and invalidate stage-1 IOTLB when there is mapping
changed in stage-1 page table. The data required in the three paths are
vendor-specific, so
1) IOMMU_HWPT_TYPE_VTD_S1 is defined for the Intel VT-d stage-1 page
table, it will be used in the stage-1 domain allocation and IOTLB
syncing path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation.
struct iommu_hwpt_vtd_s1_invalidate is defined to pass the data for
the Intel VT-d stage-1 IOTLB invalidation.
2) IOMMU_HW_INFO_TYPE_INTEL_VTD and struct iommu_hw_info_vtd are defined
to report iommu hardware information for Intel VT-d.
With above IOMMUFD extensions, the intel iommu driver implements the three
paths to support nested translation.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch12 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], corresponding QEMU could can be found
in [4].
[1] https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v5:
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow read-only mappings to nest parent domain
Yi Liu (6):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
iommu/vt-d: Make iotlb flush helpers to be extern
iommu/vt-d: Add iotlb flush for nested domain
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 60 +++++++++----
drivers/iommu/intel/iommu.h | 51 +++++++++--
drivers/iommu/intel/nested.c | 162 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 125 +++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
include/uapi/linux/iommufd.h | 76 +++++++++++++++-
7 files changed, 452 insertions(+), 26 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
Dzień dobry,
Pozwoliłem sobie na kontakt, ponieważ jestem zainteresowany weryfikacją możliwości nawiązania współpracy.
Wspieramy firmy w pozyskiwaniu nowych klientów biznesowych.
Czy możemy porozmawiać w celu przedstawienia szczegółowych informacji?
Pozdrawiam
Andrzej Polański
kselftest.h declares many variadic functions that can print some
formatted message while also executing selftest logic. These
declarations don't have any compiler mechanism to verify if passed
arguments are valid in comparison with format specifiers used in
printf() calls.
Attribute addition can make debugging easier, the code more consistent
and prevent mismatched or missing variables.
The first patch adds __printf() macro and applies it to all functions
in kselftest.h that use printf format specifiers. After compiling all
selftests using:
make -C tools/testing/selftests
many instances of format specifier mismatching are exposed in the form
of -Wformat warnings.
Fix the mismatched format specifiers caught by __printf() attribute in
multiple tests.
Series is based on kselftests next branch.
Changelog v6:
- Add methodology notes to all patches.
- No functional changes in the patches.
Changelog v5:
- Mention in the cover letter what methodology was used to find the
mismatched format specifiers.
- No functional changes in the patches.
Changelog v4:
- Fix patch 1/8 subject typo.
- Add Reinette's reviewed-by tags.
- Rebase onto new kselftest/next patches.
Changelog v3:
- Changed git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Added one review tag.
- Rebased onto updated kselftests next branch.
Changelog v2:
- Add review and fixes tags to patches.
- Add two patches with mismatch fixes.
- Fix missed attribute in selftests/kvm. (Andrew)
- Fix previously missed issues in selftests/mm (Ilpo)
[v5] https://lore.kernel.org/all/cover.1697012398.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1696846568.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1695373131.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1693829810.git.maciej.wieczor-retman@inte…
[v1] https://lore.kernel.org/all/cover.1693216959.git.maciej.wieczor-retman@inte…
Maciej Wieczor-Retman (8):
selftests: Add printf attribute to kselftest prints
selftests/cachestat: Fix print_cachestat format
selftests/openat2: Fix wrong format specifier
selftests/pidfd: Fix ksft print formats
selftests/sigaltstack: Fix wrong format specifier
selftests/kvm: Replace attribute with macro
selftests/mm: Substitute attribute with a macro
selftests/resctrl: Fix wrong format specifier
.../selftests/cachestat/test_cachestat.c | 2 +-
tools/testing/selftests/kselftest.h | 18 ++++++++++--------
.../testing/selftests/kvm/include/test_util.h | 8 ++++----
tools/testing/selftests/mm/mremap_test.c | 2 +-
tools/testing/selftests/mm/pkey-helpers.h | 2 +-
tools/testing/selftests/openat2/openat2_test.c | 2 +-
.../selftests/pidfd/pidfd_fdinfo_test.c | 2 +-
tools/testing/selftests/pidfd/pidfd_test.c | 12 ++++++------
tools/testing/selftests/resctrl/cache.c | 2 +-
tools/testing/selftests/sigaltstack/sas.c | 2 +-
10 files changed, 27 insertions(+), 25 deletions(-)
base-commit: 2531f374f922e77ba51f24d1aa6fa11c7f4c36b8
--
2.42.0
A number of corner cases were caught when trying to run the selftests on
older systems. Missed skip conditions, some error cases, and outdated
python setups would all report failures but the issue would actually be
related to some other condition rather than the selftest suite.
Address these individual cases.
Aaron Conole (4):
selftests: openvswitch: Add version check for pyroute2
selftests: openvswitch: Catch cases where the tests are killed
selftests: openvswitch: Skip drop testing on older kernels
selftests: openvswitch: Fix the ct_tuple for v4
.../selftests/net/openvswitch/openvswitch.sh | 21 +++++++-
.../selftests/net/openvswitch/ovs-dpctl.py | 48 ++++++++++++++++++-
2 files changed, 66 insertions(+), 3 deletions(-)
--
2.41.0
In order to use run_kselftest.sh the list of tests must be emitted to
populate kselftest-list.txt.
The powerpc Makefile is written to use EMIT_TESTS. But support for
EMIT_TESTS was dropped in commit d4e59a536f50 ("selftests: Use runner.sh
for emit targets"). Although prior to that commit a548de0fe8e1
("selftests: lib.mk: add test execute bit check to EMIT_TESTS") had
already broken run_kselftest.sh for powerpc due to the executable check
using the wrong path.
It can be fixed by replacing the EMIT_TESTS definitions with actual
emit_tests rules in the powerpc Makefiles. This makes run_kselftest.sh
able to run powerpc tests:
$ cd linux
$ export ARCH=powerpc
$ export CROSS_COMPILE=powerpc64le-linux-gnu-
$ make headers
$ make -j -C tools/testing/selftests install
$ grep -c "^powerpc" tools/testing/selftests/kselftest_install/kselftest-list.txt
182
Fixes: d4e59a536f50 ("selftests: Use runner.sh for emit targets")
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
---
tools/testing/selftests/powerpc/Makefile | 7 +++----
tools/testing/selftests/powerpc/pmu/Makefile | 11 ++++++-----
2 files changed, 9 insertions(+), 9 deletions(-)
I'll plan to merge this via the powerpc tree.
cheers
diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile
index 49f2ad1793fd..7ea42fa02eab 100644
--- a/tools/testing/selftests/powerpc/Makefile
+++ b/tools/testing/selftests/powerpc/Makefile
@@ -59,12 +59,11 @@ override define INSTALL_RULE
done;
endef
-override define EMIT_TESTS
+emit_tests:
+@for TARGET in $(SUB_DIRS); do \
BUILD_TARGET=$(OUTPUT)/$$TARGET; \
- $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests;\
+ $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET $@;\
done;
-endef
override define CLEAN
+@for TARGET in $(SUB_DIRS); do \
@@ -77,4 +76,4 @@ endef
tags:
find . -name '*.c' -o -name '*.h' | xargs ctags
-.PHONY: tags $(SUB_DIRS)
+.PHONY: tags $(SUB_DIRS) emit_tests
diff --git a/tools/testing/selftests/powerpc/pmu/Makefile b/tools/testing/selftests/powerpc/pmu/Makefile
index 2b95e44d20ff..a284fa874a9f 100644
--- a/tools/testing/selftests/powerpc/pmu/Makefile
+++ b/tools/testing/selftests/powerpc/pmu/Makefile
@@ -30,13 +30,14 @@ override define RUN_TESTS
+TARGET=event_code_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -C $$TARGET run_tests
endef
-DEFAULT_EMIT_TESTS := $(EMIT_TESTS)
-override define EMIT_TESTS
- $(DEFAULT_EMIT_TESTS)
+emit_tests:
+ for TEST in $(TEST_GEN_PROGS); do \
+ BASENAME_TEST=`basename $$TEST`; \
+ echo "$(COLLECTION):$$BASENAME_TEST"; \
+ done
+TARGET=ebb; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests
+TARGET=sampling_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests
+TARGET=event_code_tests; BUILD_TARGET=$$OUTPUT/$$TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -s -C $$TARGET emit_tests
-endef
DEFAULT_INSTALL_RULE := $(INSTALL_RULE)
override define INSTALL_RULE
@@ -64,4 +65,4 @@ sampling_tests:
event_code_tests:
TARGET=$@; BUILD_TARGET=$$OUTPUT/$$TARGET; mkdir -p $$BUILD_TARGET; $(MAKE) OUTPUT=$$BUILD_TARGET -k -C $$TARGET all
-.PHONY: all run_tests ebb sampling_tests event_code_tests
+.PHONY: all run_tests ebb sampling_tests event_code_tests emit_tests
--
2.41.0
kselftest.h declares many variadic functions that can print some
formatted message while also executing selftest logic. These
declarations don't have any compiler mechanism to verify if passed
arguments are valid in comparison with format specifiers used in
printf() calls.
Attribute addition can make debugging easier, the code more consistent
and prevent mismatched or missing variables.
The first patch adds __printf() macro and applies it to all functions
in kselftest.h that use printf format specifiers. After compiling all
selftests using:
make -C tools/testing/selftests
many instances of format specifier mismatching are exposed in the form
of -Wformat warnings.
Fix the mismatched format specifiers caught by __printf() attribute in
multiple tests.
Series is based on kselftests next branch.
Changelog v5:
- Mention in the cover letter what methodology was used to find the
mismatched format specifiers.
- No functional changes in the patches.
Changelog v4:
- Fix patch 1/8 subject typo.
- Add Reinette's reviewed-by tags.
- Rebase onto new kselftest/next patches.
Changelog v3:
- Changed git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Added one review tag.
- Rebased onto updated kselftests next branch.
Changelog v2:
- Add review and fixes tags to patches.
- Add two patches with mismatch fixes.
- Fix missed attribute in selftests/kvm. (Andrew)
- Fix previously missed issues in selftests/mm (Ilpo)
[v3] https://lore.kernel.org/all/cover.1695373131.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1693829810.git.maciej.wieczor-retman@inte…
[v1] https://lore.kernel.org/all/cover.1693216959.git.maciej.wieczor-retman@inte…
Maciej Wieczor-Retman (8):
selftests: Add printf attribute to kselftest prints
selftests/cachestat: Fix print_cachestat format
selftests/openat2: Fix wrong format specifier
selftests/pidfd: Fix ksft print formats
selftests/sigaltstack: Fix wrong format specifier
selftests/kvm: Replace attribute with macro
selftests/mm: Substitute attribute with a macro
selftests/resctrl: Fix wrong format specifier
.../selftests/cachestat/test_cachestat.c | 2 +-
tools/testing/selftests/kselftest.h | 18 ++++++++++--------
.../testing/selftests/kvm/include/test_util.h | 8 ++++----
tools/testing/selftests/mm/mremap_test.c | 2 +-
tools/testing/selftests/mm/pkey-helpers.h | 2 +-
tools/testing/selftests/openat2/openat2_test.c | 2 +-
.../selftests/pidfd/pidfd_fdinfo_test.c | 2 +-
tools/testing/selftests/pidfd/pidfd_test.c | 12 ++++++------
tools/testing/selftests/resctrl/cache.c | 2 +-
tools/testing/selftests/sigaltstack/sas.c | 2 +-
10 files changed, 27 insertions(+), 25 deletions(-)
base-commit: 2531f374f922e77ba51f24d1aa6fa11c7f4c36b8
--
2.42.0
v2:
- rename c0/c1 to cli0/cli1, p0/p1 to peer0/perr1 as Daniel suggested.
Two cleanups for sockmap_listen selftests: enable a kconfig and add a
new helper.
Geliang Tang (2):
selftests/bpf: Enable CONFIG_VSOCKETS in config
selftests/bpf: Add pairs_redir_to_connected helper
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/sockmap_listen.c | 159 ++++--------------
2 files changed, 36 insertions(+), 124 deletions(-)
--
2.35.3
From: Jiri Pirko <jiri(a)nvidia.com>
The file name used in flash test was "dummy" because at the time test
was written, drivers were responsible for file request and as netdevsim
didn't do that, name was unused. However, the file load request is
now done in devlink code and therefore the file has to exist.
Use first random file from /lib/firmware for this purpose.
Signed-off-by: Jiri Pirko <jiri(a)nvidia.com>
---
.../drivers/net/netdevsim/devlink.sh | 21 ++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index 7f7d20f22207..46e20b13473c 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -31,36 +31,43 @@ devlink_wait()
fw_flash_test()
{
+ DUMMYFILE=$(find /lib/firmware -maxdepth 1 -type f -printf '%f\n' |head -1)
RET=0
- devlink dev flash $DL_HANDLE file dummy
+ if [ -z "$DUMMYFILE" ]
+ then
+ echo "SKIP: unable to find suitable dummy firmware file"
+ return
+ fi
+
+ devlink dev flash $DL_HANDLE file $DUMMYFILE
check_err $? "Failed to flash with status updates on"
- devlink dev flash $DL_HANDLE file dummy component fw.mgmt
+ devlink dev flash $DL_HANDLE file $DUMMYFILE component fw.mgmt
check_err $? "Failed to flash with component attribute"
- devlink dev flash $DL_HANDLE file dummy overwrite settings
+ devlink dev flash $DL_HANDLE file $DUMMYFILE overwrite settings
check_fail $? "Flash with overwrite settings should be rejected"
echo "1"> $DEBUGFS_DIR/fw_update_overwrite_mask
check_err $? "Failed to change allowed overwrite mask"
- devlink dev flash $DL_HANDLE file dummy overwrite settings
+ devlink dev flash $DL_HANDLE file $DUMMYFILE overwrite settings
check_err $? "Failed to flash with settings overwrite enabled"
- devlink dev flash $DL_HANDLE file dummy overwrite identifiers
+ devlink dev flash $DL_HANDLE file $DUMMYFILE overwrite identifiers
check_fail $? "Flash with overwrite settings should be identifiers"
echo "3"> $DEBUGFS_DIR/fw_update_overwrite_mask
check_err $? "Failed to change allowed overwrite mask"
- devlink dev flash $DL_HANDLE file dummy overwrite identifiers overwrite settings
+ devlink dev flash $DL_HANDLE file $DUMMYFILE overwrite identifiers overwrite settings
check_err $? "Failed to flash with settings and identifiers overwrite enabled"
echo "n"> $DEBUGFS_DIR/fw_update_status
check_err $? "Failed to disable status updates"
- devlink dev flash $DL_HANDLE file dummy
+ devlink dev flash $DL_HANDLE file $DUMMYFILE
check_err $? "Failed to flash with status updates off"
log_test "fw flash test"
--
2.41.0
The merge commit 92716869375b ("Merge branch 'br-flush-filtering'") added
support for FDB flushing in bridge driver. Extend VXLAN driver to support
FDB flushing also. Add support for filtering by fields which are relevant
for VXLAN FDBs:
* Source VNI
* Nexthop ID
* 'router' flag
* Destination VNI
* Destination Port
* Destination IP
Without this set, flush for VXLAN device fails:
$ bridge fdb flush dev vx10
RTNETLINK answers: Operation not supported
With this set, such flush works with the relevant arguments, for example:
$ bridge fdb flush dev vx10 vni 5000 dst 193.2.2.1
< flush all vx10 entries with VNI 5000 and destination IP 193.2.2.1>
Some preparations are required, handle them before adding flushing support
in VXLAN driver. See more details in commit messages.
Patch set overview:
Patch #1 prepares flush policy to be used by VXLAN driver
Patches #2-#3 are preparations in VXLAN driver
Patch #4 adds an initial support for flushing in VXLAN driver
Patches #5-#9 add support for filtering by several attributes
Patch #10 adds a test for FDB flush with VXLAN
Patch #11 extends the test to check FDB flush with bridge
Amit Cohen (11):
net: Handle bulk delete policy in bridge driver
vxlan: vxlan_core: Make vxlan_flush() more generic for future use
vxlan: vxlan_core: Do not skip default entry in vxlan_flush() by
default
vxlan: vxlan_core: Add support for FDB flush
vxlan: vxlan_core: Support FDB flushing by source VNI
vxlan: vxlan_core: Support FDB flushing by nexthop ID
vxlan: vxlan_core: Support FDB flushing by destination VNI
vxlan: vxlan_core: Support FDB flushing by destination port
vxlan: vxlan_core: Support FDB flushing by destination IP
selftests: Add test cases for FDB flush with VXLAN device
selftests: fdb_flush: Add test cases for FDB flush with bridge device
drivers/net/vxlan/vxlan_core.c | 207 +++++-
include/linux/netdevice.h | 8 +-
net/bridge/br_fdb.c | 29 +-
net/bridge/br_private.h | 3 +-
net/core/rtnetlink.c | 27 +-
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/fdb_flush.sh | 812 +++++++++++++++++++++++
7 files changed, 1049 insertions(+), 38 deletions(-)
create mode 100755 tools/testing/selftests/net/fdb_flush.sh
--
2.40.1
When we dynamically generate a name for a configuration in get-reg-list
we use strcat() to append to a buffer allocated using malloc() but we
never initialise that buffer. Since malloc() offers no guarantees
regarding the contents of the memory it returns this can lead to us
corrupting, and likely overflowing, the buffer:
vregs: PASS
vregs+pmu: PASS
sve: PASS
sve+pmu: PASS
vregs+pauth_address+pauth_generic: PASS
X�vr+gspauth_addre+spauth_generi+pmu: PASS
Initialise the buffer to an empty string to avoid this.
Fixes: 17da79e009c37 ("KVM: arm64: selftests: Split get-reg-list test code")
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/kvm/get-reg-list.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/kvm/get-reg-list.c b/tools/testing/selftests/kvm/get-reg-list.c
index be7bf5224434..dd62a6976c0d 100644
--- a/tools/testing/selftests/kvm/get-reg-list.c
+++ b/tools/testing/selftests/kvm/get-reg-list.c
@@ -67,6 +67,7 @@ static const char *config_name(struct vcpu_reg_list *c)
c->name = malloc(len);
+ c->name[0] = '\0';
len = 0;
for_each_sublist(c, s) {
if (!strcmp(s->name, "base"))
---
base-commit: 6465e260f48790807eef06b583b38ca9789b6072
change-id: 20231012-kvm-get-reg-list-str-init-76c8ed4e19d6
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Hi all:
The core frequency is subjected to the process variation in semiconductors.
Not all cores are able to reach the maximum frequency respecting the
infrastructure limits. Consequently, AMD has redefined the concept of
maximum frequency of a part. This means that a fraction of cores can reach
maximum frequency. To find the best process scheduling policy for a given
scenario, OS needs to know the core ordering informed by the platform through
highest performance capability register of the CPPC interface.
Earlier implementations of amd-pstate preferred core only support a static
core ranking and targeted performance. Now it has the ability to dynamically
change the preferred core based on the workload and platform conditions and
accounting for thermals and aging.
Amd-pstate driver utilizes the functions and data structures provided by
the ITMT architecture to enable the scheduler to favor scheduling on cores
which can be get a higher frequency with lower voltage.
We call it amd-pstate preferred core.
Here sched_set_itmt_core_prio() is called to set priorities and
sched_set_itmt_support() is called to enable ITMT feature.
Amd-pstate driver uses the highest performance value to indicate
the priority of CPU. The higher value has a higher priority.
Amd-pstate driver will provide an initial core ordering at boot time.
It relies on the CPPC interface to communicate the core ranking to the
operating system and scheduler to make sure that OS is choosing the cores
with highest performance firstly for scheduling the process. When amd-pstate
driver receives a message with the highest performance change, it will
update the core ranking.
Changes form V8->V9:
- all:
- - pick up Tested-By flag added by Oleksandr.
- cpufreq: amd-pstate:
- - pick up Review-By flag added by Wyes.
- - ignore modification of bug.
- - add a attribute of prefcore_ranking.
- - modify data type conversion from u32 to int.
- Documentation: amd-pstate:
- - pick up Review-By flag added by Wyes.
Changes form V7->V8:
- all:
- - pick up Review-By flag added by Mario and Ray.
- cpufreq: amd-pstate:
- - use hw_prefcore embeds into cpudata structure.
- - delete preferred core init from cpu online/off.
Changes form V6->V7:
- x86:
- - Modify kconfig about X86_AMD_PSTATE.
- cpufreq: amd-pstate:
- - modify incorrect comments about scheduler_work().
- - convert highest_perf data type.
- - modify preferred core init when cpu init and online.
- acpi: cppc:
- - modify link of CPPC highest performance.
- cpufreq:
- - modify link of CPPC highest performance changed.
Changes form V5->V6:
- cpufreq: amd-pstate:
- - modify the wrong tag order.
- - modify warning about hw_prefcore sysfs attribute.
- - delete duplicate comments.
- - modify the variable name cppc_highest_perf to prefcore_ranking.
- - modify judgment conditions for setting highest_perf.
- - modify sysfs attribute for CPPC highest perf to pr_debug message.
- Documentation: amd-pstate:
- - modify warning: title underline too short.
Changes form V4->V5:
- cpufreq: amd-pstate:
- - modify sysfs attribute for CPPC highest perf.
- - modify warning about comments
- - rebase linux-next
- cpufreq:
- - Moidfy warning about function declarations.
- Documentation: amd-pstate:
- - align with ``amd-pstat``
Changes form V3->V4:
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
Changes form V2->V3:
- x86:
- - Modify kconfig and description.
- cpufreq: amd-pstate:
- - Add Co-developed-by tag in commit message.
- cpufreq:
- - Modify commit message.
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
Changes form V1->V2:
- acpi: cppc:
- - Add reference link.
- cpufreq:
- - Moidfy link error.
- cpufreq: amd-pstate:
- - Init the priorities of all online CPUs
- - Use a single variable to represent the status of preferred core.
- Documentation:
- - Default enabled preferred core.
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
- - Default enabled preferred core.
- - Use a single variable to represent the status of preferred core.
Meng Li (7):
x86: Drop CPU_SUP_INTEL from SCHED_MC_PRIO for the expansion.
acpi: cppc: Add get the highest performance cppc control
cpufreq: amd-pstate: Enable amd-pstate preferred core supporting.
cpufreq: Add a notification message that the highest perf has changed
cpufreq: amd-pstate: Update amd-pstate preferred core ranking
dynamically
Documentation: amd-pstate: introduce amd-pstate preferred core
Documentation: introduce amd-pstate preferrd core mode kernel command
line options
.../admin-guide/kernel-parameters.txt | 5 +
Documentation/admin-guide/pm/amd-pstate.rst | 59 ++++-
arch/x86/Kconfig | 5 +-
drivers/acpi/cppc_acpi.c | 13 ++
drivers/acpi/processor_driver.c | 6 +
drivers/cpufreq/amd-pstate.c | 204 ++++++++++++++++--
drivers/cpufreq/cpufreq.c | 13 ++
include/acpi/cppc_acpi.h | 5 +
include/linux/amd-pstate.h | 10 +
include/linux/cpufreq.h | 5 +
10 files changed, 305 insertions(+), 20 deletions(-)
--
2.34.1
Rework how KVM limits guest-unsupported xfeatures to effectively hide
only when saving state for userspace (KVM_GET_XSAVE), i.e. to let userspace
load all host-supported xfeatures (via KVM_SET_XSAVE) irrespective of
what features have been exposed to the guest.
The effect on KVM_SET_XSAVE was knowingly done by commit ad856280ddea
("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0"):
As a bonus, it will also fail if userspace tries to set fpu features
(with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
configuration. Such features will never be returned by KVM_GET_XSAVE
or KVM_GET_XSAVE2.
Peventing userspace from doing stupid things is usually a good idea, but in
this case restricting KVM_SET_XSAVE actually exacerbated the problem that
commit ad856280ddea was fixing. As reported by Tyler, rejecting KVM_SET_XSAVE
for guest-unsupported xfeatures breaks live migration from a kernel without
commit ad856280ddea, to a kernel with ad856280ddea. I.e. from a kernel that
saves guest-unsupported xfeatures to a kernel that doesn't allow loading
guest-unuspported xfeatures.
To make matters even worse, QEMU doesn't terminate if KVM_SET_XSAVE fails,
and so the end result is that the live migration results (possibly silent)
guest data corruption instead of a failed migration.
Patch 1 refactors the FPU code to let KVM pass in a mask of which xfeatures
to save, patch 2 fixes KVM by passing in guest_supported_xcr0 instead of
modifying user_xfeatures directly.
Patches 3-5 are regression tests.
I have no objection if anyone wants patches 1 and 2 squashed together, I
split them purely to make review easier.
Note, this doesn't fix the scenario where a guest is migrated from a "bad"
to a "good" kernel and the target host doesn't support the over-saved set
of xfeatures. I don't see a way to safely handle that in the kernel without
an opt-in, which more or less defeats the purpose of handling it in KVM.
Sean Christopherson (5):
x86/fpu: Allow caller to constrain xfeatures when copying to uabi
buffer
KVM: x86: Constrain guest-supported xfeatures only at KVM_GET_XSAVE{2}
KVM: selftests: Touch relevant XSAVE state in guest for state test
KVM: selftests: Load XSAVE state into untouched vCPU during state test
KVM: selftests: Force load all supported XSAVE state in state test
arch/x86/include/asm/fpu/api.h | 3 +-
arch/x86/kernel/fpu/core.c | 5 +-
arch/x86/kernel/fpu/xstate.c | 12 +-
arch/x86/kernel/fpu/xstate.h | 3 +-
arch/x86/kvm/cpuid.c | 8 --
arch/x86/kvm/x86.c | 37 +++---
.../selftests/kvm/include/x86_64/processor.h | 23 ++++
.../testing/selftests/kvm/x86_64/state_test.c | 110 +++++++++++++++++-
8 files changed, 168 insertions(+), 33 deletions(-)
base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
--
2.42.0.582.g8ccd20d70d-goog
This series fixes issues observed with selftests/amd-pstate while
running performance comparison tests with different governors. First
patch changes relative paths with absolute path and also change it
with correct path wherever it is broken.
The second patch adds an option to provide perf binary path to
handle the case where distro perf does not work.
Changelog v2->v3:
* Split the patch into two patches
Swapnil Sapkal (2):
selftests/amd-pstate: Fix broken paths to run workloads in amd-pstate-ut
selftests/amd-pstate: Added option to provide perf binary path
.../x86/amd_pstate_tracer/amd_pstate_trace.py | 2 +-
.../testing/selftests/amd-pstate/gitsource.sh | 14 +++++++----
tools/testing/selftests/amd-pstate/run.sh | 23 +++++++++++++------
tools/testing/selftests/amd-pstate/tbench.sh | 4 ++--
4 files changed, 28 insertions(+), 15 deletions(-)
--
2.34.1
Changelog: v2 -> v3
* Minimal code refactoring
* Rebased on v6.6-rc1
RFC v1:
https://lore.kernel.org/all/20210611124154.56427-1-psampat@linux.ibm.com/
RFC v2:
https://lore.kernel.org/all/20230828061530.126588-2-aboorvad@linux.vnet.ibm…
Other related RFC:
https://lore.kernel.org/all/20210430082804.38018-1-psampat@linux.ibm.com/
Userspace selftest:
https://lkml.org/lkml/2020/9/2/356
----
A kernel module + userspace driver to estimate the wakeup latency
caused by going into stop states. The motivation behind this program is
to find significant deviations behind advertised latency and residency
values.
The patchset measures latencies for two kinds of events. IPIs and Timers
As this is a software-only mechanism, there will be additional latencies
of the kernel-firmware-hardware interactions. To account for that, the
program also measures a baseline latency on a 100 percent loaded CPU
and the latencies achieved must be in view relative to that.
To achieve this, we introduce a kernel module and expose its control
knobs through the debugfs interface that the selftests can engage with.
The kernel module provides the following interfaces within
/sys/kernel/debug/powerpc/latency_test/ for,
IPI test:
ipi_cpu_dest = Destination CPU for the IPI
ipi_cpu_src = Origin of the IPI
ipi_latency_ns = Measured latency time in ns
Timeout test:
timeout_cpu_src = CPU on which the timer to be queued
timeout_expected_ns = Timer duration
timeout_diff_ns = Difference of actual duration vs expected timer
Sample output is as follows:
# --IPI Latency Test---
# Baseline Avg IPI latency(ns): 2720
# Observed Avg IPI latency(ns) - State snooze: 2565
# Observed Avg IPI latency(ns) - State stop0_lite: 3856
# Observed Avg IPI latency(ns) - State stop0: 3670
# Observed Avg IPI latency(ns) - State stop1: 3872
# Observed Avg IPI latency(ns) - State stop2: 17421
# Observed Avg IPI latency(ns) - State stop4: 1003922
# Observed Avg IPI latency(ns) - State stop5: 1058870
#
# --Timeout Latency Test--
# Baseline Avg timeout diff(ns): 1435
# Observed Avg timeout diff(ns) - State snooze: 1709
# Observed Avg timeout diff(ns) - State stop0_lite: 2028
# Observed Avg timeout diff(ns) - State stop0: 1954
# Observed Avg timeout diff(ns) - State stop1: 1895
# Observed Avg timeout diff(ns) - State stop2: 14556
# Observed Avg timeout diff(ns) - State stop4: 873988
# Observed Avg timeout diff(ns) - State stop5: 959137
Aboorva Devarajan (2):
powerpc/cpuidle: cpuidle wakeup latency based on IPI and timer events
powerpc/selftest: Add support for cpuidle latency measurement
arch/powerpc/Kconfig.debug | 10 +
arch/powerpc/kernel/Makefile | 1 +
arch/powerpc/kernel/test_cpuidle_latency.c | 154 ++++++
tools/testing/selftests/powerpc/Makefile | 1 +
.../powerpc/cpuidle_latency/.gitignore | 2 +
.../powerpc/cpuidle_latency/Makefile | 6 +
.../cpuidle_latency/cpuidle_latency.sh | 443 ++++++++++++++++++
.../powerpc/cpuidle_latency/settings | 1 +
8 files changed, 618 insertions(+)
create mode 100644 arch/powerpc/kernel/test_cpuidle_latency.c
create mode 100644 tools/testing/selftests/powerpc/cpuidle_latency/.gitignore
create mode 100644 tools/testing/selftests/powerpc/cpuidle_latency/Makefile
create mode 100755 tools/testing/selftests/powerpc/cpuidle_latency/cpuidle_latency.sh
create mode 100644 tools/testing/selftests/powerpc/cpuidle_latency/settings
--
2.25.1
When execute the following command to test clone3 under !CONFIG_TIME_NS:
# make headers && cd tools/testing/selftests/clone3 && make && ./clone3
we can see the following error info:
# [7538] Trying clone3() with flags 0x80 (size 0)
# Invalid argument - Failed to create new process
# [7538] clone3() with flags says: -22 expected 0
not ok 18 [7538] Result (-22) is different than expected (0)
...
# Totals: pass:18 fail:1 xfail:0 xpass:0 skip:0 error:0
This is because if CONFIG_TIME_NS is not set, but the flag
CLONE_NEWTIME (0x80) is used to clone a time namespace, it
will return -EINVAL in copy_time_ns().
If kernel does not support CONFIG_TIME_NS, /proc/self/ns/time
will be not exist, and then we should skip clone3() test with
CLONE_NEWTIME.
With this patch under !CONFIG_TIME_NS:
# make headers && cd tools/testing/selftests/clone3 && make && ./clone3
...
# Time namespaces are not supported
ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME
...
# Totals: pass:18 fail:0 xfail:0 xpass:0 skip:1 error:0
Fixes: 515bddf0ec41 ("selftests/clone3: test clone3 with CLONE_NEWTIME")
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Tiezhu Yang <yangtiezhu(a)loongson.cn>
---
v6: Rebase on 6.5-rc1 and update the commit message
tools/testing/selftests/clone3/clone3.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index e60cf4d..1c61e3c 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -196,7 +196,12 @@ int main(int argc, char *argv[])
CLONE3_ARGS_NO_TEST);
/* Do a clone3() in a new time namespace */
- test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST);
+ if (access("/proc/self/ns/time", F_OK) == 0) {
+ test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST);
+ } else {
+ ksft_print_msg("Time namespaces are not supported\n");
+ ksft_test_result_skip("Skipping clone3() with CLONE_NEWTIME\n");
+ }
/* Do a clone3() with exit signal (SIGCHLD) in flags */
test_clone3(SIGCHLD, 0, -EINVAL, CLONE3_ARGS_NO_TEST);
--
2.1.0
v2: Rebase on 6.5-rc1 and update the commit message
Tiezhu Yang (2):
selftests/vDSO: Add support for LoongArch
selftests/vDSO: Get version and name for all archs
tools/testing/selftests/vDSO/vdso_config.h | 6 ++++-
tools/testing/selftests/vDSO/vdso_test_getcpu.c | 16 +++++--------
.../selftests/vDSO/vdso_test_gettimeofday.c | 26 ++++++----------------
3 files changed, 18 insertions(+), 30 deletions(-)
--
2.1.0
And this is the last(?) revision of this series which should now compile
with or without CONFIG_HID_BPF set.
I had to do changes because [1] was failing
Nick, I kept your Tested-by, even if I made small changes in 1/3. Feel
free to shout if you don't want me to keep it.
Eduard, You helped us a lot in the review of v1 but never sent your
Reviewed-by or Acked-by. Do you want me to add one?
Cheers,
Benjamin
[1] https://gitlab.freedesktop.org/bentiss/hid/-/jobs/49754306
For reference, the v2 cover letter:
| Hi, I am sending this series on behalf of myself and Benjamin Tissoires. There
| existed an initial n=3 patch series which was later expanded to n=4 and
| is now back to n=3 with some fixes added in and rebased against
| mainline.
|
| This patch series aims to ensure that the hid/bpf selftests can be built
| without errors.
|
| Here's Benjamin's initial cover letter for context:
| | These fixes have been triggered by [0]:
| | basically, if you do not recompile the kernel first, and are
| | running on an old kernel, vmlinux.h doesn't have the required
| | symbols and the compilation fails.
| |
| | The tests will fail if you run them on that very same machine,
| | of course, but the binary should compile.
| |
| | And while I was sorting out why it was failing, I realized I
| | could do a couple of improvements on the Makefile.
| |
| | [0] https://lore.kernel.org/linux-input/56ba8125-2c6f-a9c9-d498-0ca1c153dcb2@re…
Signed-off-by: Benjamin Tissoires <bentiss(a)kernel.org>
---
Changes in v3:
- Also overwrite all of the enum symbols in patch 1/3
- Link to v2: https://lore.kernel.org/r/20230908-kselftest-09-08-v2-0-0def978a4c1b@google…
Changes in v2:
- roll Justin's fix into patch 1/3
- add __attribute__((preserve_access_index)) (thanks Eduard)
- rebased onto mainline (2dde18cd1d8fac735875f2e4987f11817cc0bc2c)
- Link to v1: https://lore.kernel.org/r/20230825-wip-selftests-v1-0-c862769020a8@kernel.o…
Link: https://github.com/ClangBuiltLinux/linux/issues/1698
Link: https://github.com/ClangBuiltLinux/continuous-integration2/issues/61
---
Benjamin Tissoires (3):
selftests/hid: ensure we can compile the tests on kernels pre-6.3
selftests/hid: do not manually call headers_install
selftests/hid: force using our compiled libbpf headers
tools/testing/selftests/hid/Makefile | 10 ++-
tools/testing/selftests/hid/progs/hid.c | 3 -
.../testing/selftests/hid/progs/hid_bpf_helpers.h | 77 ++++++++++++++++++++++
3 files changed, 81 insertions(+), 9 deletions(-)
---
base-commit: 29aa98d0fe013e2ab62aae4266231b7fb05d47a2
change-id: 20230825-wip-selftests-9a7502b56542
Best regards,
--
Benjamin Tissoires <bentiss(a)kernel.org>
A number of corner cases were caught when trying to run the selftests on
older systems. Missed skip conditions, some error cases, and outdated
python setups would all report failures but the issue would actually be
related to some other condition rather than the selftest suite.
Address these individual cases.
Aaron Conole (4):
selftests: openvswitch: Add version check for pyroute2
selftests: openvswitch: Catch cases where the tests are killed
selftests: openvswitch: Skip drop testing on older kernels
selftests: openvswitch: Fix the ct_tuple for v4
.../selftests/net/openvswitch/openvswitch.sh | 21 ++++++++-
.../selftests/net/openvswitch/ovs-dpctl.py | 46 ++++++++++++++++++-
2 files changed, 65 insertions(+), 2 deletions(-)
--
2.40.1
On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki(a)daynix.com> wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but it makes
> little sense to allow to implement different hashing algorithms with
> eBPF since the hash value reported by virtio-net is strictly defined by
> the specification.
>
> The hash value already stored in sk_buff is not used and computed
> independently since it may have been computed in a way not conformant
> with the specification.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
> ---
> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> + .max_indirection_table_length =
> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> +
> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> +};
No need to have explicit capabilities exchange like this? Tun either
supports all or none.
> case TUNSETSTEERINGEBPF:
> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> + if (IS_ERR(bpf_ret))
> + ret = PTR_ERR(bpf_ret);
> + else if (bpf_ret)
> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
Don't make one feature disable another.
TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
functions. If one is enabled the other call should fail, with EBUSY
for instance.
> + case TUNSETVNETHASH:
> + len = sizeof(vnet_hash);
> + if (copy_from_user(&vnet_hash, argp, len)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> + !tun_is_little_endian(tun))) ||
> + vnet_hash.indirection_table_mask >=
> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + argp = (u8 __user *)argp + len;
> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + argp = (u8 __user *)argp + len;
> + len = virtio_net_hash_key_length(vnet_hash.types);
> +
> + if (copy_from_user(vnet_hash_key, argp, len)) {
> + ret = -EFAULT;
> + break;
> + }
Probably easier and less error-prone to define a fixed size control
struct with the max indirection table size.
Btw: please trim the CC: list considerably on future patches.
qemu-system-ppc64 can handle both big and little endian kernels.
While some setups, like Debian, provide a symlink to execute
qemu-system-ppc64 as qemu-system-ppc64le, others, like ArchLinux, do not.
So always use qemu-system-ppc64 directly.
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
tools/testing/selftests/nolibc/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/nolibc/Makefile b/tools/testing/selftests/nolibc/Makefile
index 891aa396163d..af60e07d3c12 100644
--- a/tools/testing/selftests/nolibc/Makefile
+++ b/tools/testing/selftests/nolibc/Makefile
@@ -82,7 +82,7 @@ QEMU_ARCH_arm = arm
QEMU_ARCH_mips = mipsel # works with malta_defconfig
QEMU_ARCH_ppc = ppc
QEMU_ARCH_ppc64 = ppc64
-QEMU_ARCH_ppc64le = ppc64le
+QEMU_ARCH_ppc64le = ppc64
QEMU_ARCH_riscv = riscv64
QEMU_ARCH_s390 = s390x
QEMU_ARCH_loongarch = loongarch64
---
base-commit: 361fbc295e965a3c7f606d281e6107e098d33730
change-id: 20231008-nolibc-qemu-ppc64-07b4f74043a6
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
Hi all:
The core frequency is subjected to the process variation in semiconductors.
Not all cores are able to reach the maximum frequency respecting the
infrastructure limits. Consequently, AMD has redefined the concept of
maximum frequency of a part. This means that a fraction of cores can reach
maximum frequency. To find the best process scheduling policy for a given
scenario, OS needs to know the core ordering informed by the platform through
highest performance capability register of the CPPC interface.
Earlier implementations of amd-pstate preferred core only support a static
core ranking and targeted performance. Now it has the ability to dynamically
change the preferred core based on the workload and platform conditions and
accounting for thermals and aging.
Amd-pstate driver utilizes the functions and data structures provided by
the ITMT architecture to enable the scheduler to favor scheduling on cores
which can be get a higher frequency with lower voltage.
We call it amd-pstate preferred core.
Here sched_set_itmt_core_prio() is called to set priorities and
sched_set_itmt_support() is called to enable ITMT feature.
Amd-pstate driver uses the highest performance value to indicate
the priority of CPU. The higher value has a higher priority.
Amd-pstate driver will provide an initial core ordering at boot time.
It relies on the CPPC interface to communicate the core ranking to the
operating system and scheduler to make sure that OS is choosing the cores
with highest performance firstly for scheduling the process. When amd-pstate
driver receives a message with the highest performance change, it will
update the core ranking.
Changes form V7->V8:
- all:
- - pick up Review-By flag added by Mario and Ray.
- cpufreq: amd-pstate:
- - use hw_prefcore embeds into cpudata structure.
- - delete preferred core init from cpu online/off.
Changes form V6->V7:
- x86:
- - Modify kconfig about X86_AMD_PSTATE.
- cpufreq: amd-pstate:
- - modify incorrect comments about scheduler_work().
- - convert highest_perf data type.
- - modify preferred core init when cpu init and online.
- acpi: cppc:
- - modify link of CPPC highest performance.
- cpufreq:
- - modify link of CPPC highest performance changed.
Changes form V5->V6:
- cpufreq: amd-pstate:
- - modify the wrong tag order.
- - modify warning about hw_prefcore sysfs attribute.
- - delete duplicate comments.
- - modify the variable name cppc_highest_perf to prefcore_ranking.
- - modify judgment conditions for setting highest_perf.
- - modify sysfs attribute for CPPC highest perf to pr_debug message.
- Documentation: amd-pstate:
- - modify warning: title underline too short.
Changes form V4->V5:
- cpufreq: amd-pstate:
- - modify sysfs attribute for CPPC highest perf.
- - modify warning about comments
- - rebase linux-next
- cpufreq:
- - Moidfy warning about function declarations.
- Documentation: amd-pstate:
- - align with ``amd-pstat``
Changes form V3->V4:
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
Changes form V2->V3:
- x86:
- - Modify kconfig and description.
- cpufreq: amd-pstate:
- - Add Co-developed-by tag in commit message.
- cpufreq:
- - Modify commit message.
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
Changes form V1->V2:
- acpi: cppc:
- - Add reference link.
- cpufreq:
- - Moidfy link error.
- cpufreq: amd-pstate:
- - Init the priorities of all online CPUs
- - Use a single variable to represent the status of preferred core.
- Documentation:
- - Default enabled preferred core.
- Documentation: amd-pstate:
- - Modify inappropriate descriptions.
- - Default enabled preferred core.
- - Use a single variable to represent the status of preferred core.
Meng Li (7):
x86: Drop CPU_SUP_INTEL from SCHED_MC_PRIO for the expansion.
acpi: cppc: Add get the highest performance cppc control
cpufreq: amd-pstate: Enable amd-pstate preferred core supporting.
cpufreq: Add a notification message that the highest perf has changed
cpufreq: amd-pstate: Update amd-pstate preferred core ranking
dynamically
Documentation: amd-pstate: introduce amd-pstate preferred core
Documentation: introduce amd-pstate preferrd core mode kernel command
line options
.../admin-guide/kernel-parameters.txt | 5 +
Documentation/admin-guide/pm/amd-pstate.rst | 59 +++++-
arch/x86/Kconfig | 5 +-
drivers/acpi/cppc_acpi.c | 13 ++
drivers/acpi/processor_driver.c | 6 +
drivers/cpufreq/amd-pstate.c | 186 ++++++++++++++++--
drivers/cpufreq/cpufreq.c | 13 ++
include/acpi/cppc_acpi.h | 5 +
include/linux/amd-pstate.h | 10 +
include/linux/cpufreq.h | 5 +
10 files changed, 285 insertions(+), 22 deletions(-)
--
2.34.1
Hi all,
This series implements the Permission Overlay Extension introduced in 2022
VMSA enhancements [1]. It is based on v6.6-rc3.
The Permission Overlay Extension allows to constrain permissions on memory
regions. This can be used from userspace (EL0) without a system call or TLB
invalidation.
POE is used to implement the Memory Protection Keys [2] Linux syscall.
The first few patches add the basic framework, then the PKEYS interface is
implemented, and then the selftests are made to work on arm64.
There was discussion about what the 'default' protection key value should be,
I used disallow-all (apart from pkey 0), which matches what x86 does.
Patch 15 contains a call to cpus_have_const_cap(), which I couldn't avoid
until Mark's patch to re-order when the alternatives were applied [3] is
committed.
The KVM part isn't tested yet.
I have tested the modified protection_keys test on x86_64 [4], but not PPC.
Hopefully I have CC'd everyone correctly.
Thanks,
Joey
Joey Gouly (20):
arm64/sysreg: add system register POR_EL{0,1}
arm64/sysreg: update CPACR_EL1 register
arm64: cpufeature: add Permission Overlay Extension cpucap
arm64: disable trapping of POR_EL0 to EL2
arm64: context switch POR_EL0 register
KVM: arm64: Save/restore POE registers
arm64: enable the Permission Overlay Extension for EL0
arm64: add POIndex defines
arm64: define VM_PKEY_BIT* for arm64
arm64: mask out POIndex when modifying a PTE
arm64: enable ARCH_HAS_PKEYS on arm64
arm64: handle PKEY/POE faults
arm64: stop using generic mm_hooks.h
arm64: implement PKEYS support
arm64: add POE signal support
arm64: enable PKEY support for CPUs with S1POE
arm64: enable POE and PIE to coexist
kselftest/arm64: move get_header()
selftests: mm: move fpregs printing
selftests: mm: make protection_keys test work on arm64
Documentation/arch/arm64/elf_hwcaps.rst | 3 +
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/el2_setup.h | 10 +-
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/asm/kvm_host.h | 4 +
arch/arm64/include/asm/mman.h | 6 +-
arch/arm64/include/asm/mmu.h | 2 +
arch/arm64/include/asm/mmu_context.h | 51 ++++++-
arch/arm64/include/asm/pgtable-hwdef.h | 10 ++
arch/arm64/include/asm/pgtable-prot.h | 8 +-
arch/arm64/include/asm/pgtable.h | 28 +++-
arch/arm64/include/asm/pkeys.h | 110 ++++++++++++++
arch/arm64/include/asm/por.h | 33 +++++
arch/arm64/include/asm/processor.h | 1 +
arch/arm64/include/asm/sysreg.h | 16 ++
arch/arm64/include/asm/traps.h | 1 +
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/sigcontext.h | 7 +
arch/arm64/kernel/cpufeature.c | 15 ++
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/process.c | 16 ++
arch/arm64/kernel/signal.c | 51 +++++++
arch/arm64/kernel/traps.c | 12 +-
arch/arm64/kvm/sys_regs.c | 2 +
arch/arm64/mm/fault.c | 44 +++++-
arch/arm64/mm/mmap.c | 7 +
arch/arm64/mm/mmu.c | 38 +++++
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 11 +-
fs/proc/task_mmu.c | 2 +
include/linux/mm.h | 11 +-
.../arm64/signal/testcases/testcases.c | 23 ---
.../arm64/signal/testcases/testcases.h | 26 +++-
tools/testing/selftests/mm/Makefile | 2 +-
tools/testing/selftests/mm/pkey-arm64.h | 138 ++++++++++++++++++
tools/testing/selftests/mm/pkey-helpers.h | 8 +
tools/testing/selftests/mm/pkey-powerpc.h | 3 +
tools/testing/selftests/mm/pkey-x86.h | 4 +
tools/testing/selftests/mm/protection_keys.c | 29 ++--
39 files changed, 685 insertions(+), 52 deletions(-)
create mode 100644 arch/arm64/include/asm/pkeys.h
create mode 100644 arch/arm64/include/asm/por.h
create mode 100644 tools/testing/selftests/mm/pkey-arm64.h
--
2.25.1
Write_schemata() uses fprintf() to write a bitmask into a schemata file
inside resctrl FS. It checks fprintf() return value but it doesn't check
fclose() return value. Error codes from fprintf() such as write errors,
are buffered and flushed back to the user only after fclose() is executed
which means any invalid bitmask can be written into the schemata file.
Rewrite write_schemata() to use syscalls instead of stdio file
operations to avoid the buffering.
The resctrlfs.c defines functions that interact with the resctrl FS
while resctrl_val.c defines functions that perform measurements on
the cache. Run_benchmark() fits logically into the second file before
resctrl_val() that uses it.
Move run_benchmark() from resctrlfs.c to resctrl_val.c and remove
redundant part of the kernel-doc comment. Make run_benchmark() static
and remove it from the header file.
Patch series is based on [1] which is based on [2] which are based on
kselftest next branch.
Resend v7:
- Resending because I forgot to add the base commit.
Changelog v7:
- Add label for non-empty schema error case to Patch 1/2. (Reinette)
- Add Reinette's reviewed-by tag to Patch 1/2.
Changelog v6:
- Align schema_len error checking with typical snprintf format.
(Reinette)
- Initialize schema string for early return eventuality. (Reinette)
Changelog v5:
- Add Ilpo's reviewed-by tag to Patch 1/2.
- Reword patch messages slightly.
- Add error check to schema_len variable.
Changelog v4:
- Change git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Rebase onto [1] which is based on [2]. (Reinette)
- Add fcntl.h explicitly to provide glibc backward compatibility.
(Reinette)
Changelog v3:
- Use snprintf() return value instead of strlen() in write_schemata().
(Ilpo)
- Make run_benchmark() static and remove it from the header file.
(Reinette)
- Add Ilpo's reviewed-by tag to Patch 2/2.
- Patch messages and cover letter rewording.
Changelog v2:
- Change sprintf() to snprintf() in write_schemata().
- Redo write_schemata() with syscalls instead of stdio functions.
- Fix typos and missing dots in patch messages.
- Branch printf attribute patch to a separate series.
[v1] https://lore.kernel.org/all/cover.1692880423.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1693213468.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1693575451.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1695369120.git.maciej.wieczor-retman@inte…
[v5] https://lore.kernel.org/all/cover.1695975327.git.maciej.wieczor-retman@inte…
[v6] https://lore.kernel.org/all/cover.1696848653.git.maciej.wieczor-retman@inte…
[1] https://lore.kernel.org/all/20231002094813.6633-1-ilpo.jarvinen@linux.intel…
[2] https://lore.kernel.org/all/20230904095339.11321-1-ilpo.jarvinen@linux.inte…
Maciej Wieczor-Retman (2):
selftests/resctrl: Fix schemata write error check
selftests/resctrl: Move run_benchmark() to a more fitting file
tools/testing/selftests/resctrl/resctrl.h | 1 -
tools/testing/selftests/resctrl/resctrl_val.c | 50 ++++++++++
tools/testing/selftests/resctrl/resctrlfs.c | 93 ++++++-------------
3 files changed, 76 insertions(+), 68 deletions(-)
base-commit: f3d3a8b5cf771ed2c6692a457dbc17f389f97f53
--
2.42.0
Zero out the buffer for readlink() since readlink() does not append a
terminating null byte to the buffer.
Fixes: 833c12ce0f430 ("selftests/x86/lam: Add inherit test cases for linear-address masking")
Signed-off-by: Binbin Wu <binbin.wu(a)linux.intel.com>
---
tools/testing/selftests/x86/lam.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/x86/lam.c b/tools/testing/selftests/x86/lam.c
index eb0e46905bf9..9f06942a8e25 100644
--- a/tools/testing/selftests/x86/lam.c
+++ b/tools/testing/selftests/x86/lam.c
@@ -680,7 +680,7 @@ static int handle_execve(struct testcases *test)
perror("Fork failed.");
ret = 1;
} else if (pid == 0) {
- char path[PATH_MAX];
+ char path[PATH_MAX] = {0};
/* Set LAM mode in parent process */
if (set_lam(lam) != 0)
base-commit: ce9ecca0238b140b88f43859b211c9fdfd8e5b70
--
2.25.1
Write_schemata() uses fprintf() to write a bitmask into a schemata file
inside resctrl FS. It checks fprintf() return value but it doesn't check
fclose() return value. Error codes from fprintf() such as write errors,
are buffered and flushed back to the user only after fclose() is executed
which means any invalid bitmask can be written into the schemata file.
Rewrite write_schemata() to use syscalls instead of stdio file
operations to avoid the buffering.
The resctrlfs.c defines functions that interact with the resctrl FS
while resctrl_val.c defines functions that perform measurements on
the cache. Run_benchmark() fits logically into the second file before
resctrl_val() that uses it.
Move run_benchmark() from resctrlfs.c to resctrl_val.c and remove
redundant part of the kernel-doc comment. Make run_benchmark() static
and remove it from the header file.
Patch series is based on [1] which is based on [2] which are based on
kselftest next branch.
Changelog v7:
- Add label for non-empty schema error case to Patch 1/2. (Reinette)
- Add Reinette's reviewed-by tag to Patch 1/2.
Changelog v6:
- Align schema_len error checking with typical snprintf format.
(Reinette)
- Initialize schema string for early return eventuality. (Reinette)
Changelog v5:
- Add Ilpo's reviewed-by tag to Patch 1/2.
- Reword patch messages slightly.
- Add error check to schema_len variable.
Changelog v4:
- Change git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Rebase onto [1] which is based on [2]. (Reinette)
- Add fcntl.h explicitly to provide glibc backward compatibility.
(Reinette)
Changelog v3:
- Use snprintf() return value instead of strlen() in write_schemata().
(Ilpo)
- Make run_benchmark() static and remove it from the header file.
(Reinette)
- Add Ilpo's reviewed-by tag to Patch 2/2.
- Patch messages and cover letter rewording.
Changelog v2:
- Change sprintf() to snprintf() in write_schemata().
- Redo write_schemata() with syscalls instead of stdio functions.
- Fix typos and missing dots in patch messages.
- Branch printf attribute patch to a separate series.
[v1] https://lore.kernel.org/all/cover.1692880423.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1693213468.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1693575451.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1695369120.git.maciej.wieczor-retman@inte…
[v5] https://lore.kernel.org/all/cover.1695975327.git.maciej.wieczor-retman@inte…
[v6] https://lore.kernel.org/all/cover.1696848653.git.maciej.wieczor-retman@inte…
[1] https://lore.kernel.org/all/20231002094813.6633-1-ilpo.jarvinen@linux.intel…
[2] https://lore.kernel.org/all/20230904095339.11321-1-ilpo.jarvinen@linux.inte…
Maciej Wieczor-Retman (2):
selftests/resctrl: Fix schemata write error check
selftests/resctrl: Move run_benchmark() to a more fitting file
tools/testing/selftests/resctrl/resctrl.h | 1 -
tools/testing/selftests/resctrl/resctrl_val.c | 50 ++++++++++
tools/testing/selftests/resctrl/resctrlfs.c | 93 ++++++-------------
3 files changed, 76 insertions(+), 68 deletions(-)
--
2.42.0
Write_schemata() uses fprintf() to write a bitmask into a schemata file
inside resctrl FS. It checks fprintf() return value but it doesn't check
fclose() return value. Error codes from fprintf() such as write errors,
are buffered and flushed back to the user only after fclose() is executed
which means any invalid bitmask can be written into the schemata file.
Rewrite write_schemata() to use syscalls instead of stdio file
operations to avoid the buffering.
The resctrlfs.c defines functions that interact with the resctrl FS
while resctrl_val.c defines functions that perform measurements on
the cache. Run_benchmark() fits logically into the second file before
resctrl_val() that uses it.
Move run_benchmark() from resctrlfs.c to resctrl_val.c and remove
redundant part of the kernel-doc comment. Make run_benchmark() static
and remove it from the header file.
Patch series is based on [1] which is based on [2] which are based on
kselftest next branch.
Changelog v6:
- Align schema_len error checking with typical snprintf format.
(Reinette)
- Initialize schema string for early return eventuality. (Reinette)
Changelog v5:
- Add Ilpo's reviewed-by tag to Patch 1/2.
- Reword patch messages slightly.
- Add error check to schema_len variable.
Changelog v4:
- Change git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Rebase onto [1] which is based on [2]. (Reinette)
- Add fcntl.h explicitly to provide glibc backward compatibility.
(Reinette)
Changelog v3:
- Use snprintf() return value instead of strlen() in write_schemata().
(Ilpo)
- Make run_benchmark() static and remove it from the header file.
(Reinette)
- Add Ilpo's reviewed-by tag to Patch 2/2.
- Patch messages and cover letter rewording.
Changelog v2:
- Change sprintf() to snprintf() in write_schemata().
- Redo write_schemata() with syscalls instead of stdio functions.
- Fix typos and missing dots in patch messages.
- Branch printf attribute patch to a separate series.
[v1] https://lore.kernel.org/all/cover.1692880423.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1693213468.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1693575451.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1695369120.git.maciej.wieczor-retman@inte…
[v5] https://lore.kernel.org/all/cover.1695975327.git.maciej.wieczor-retman@inte…
[1] https://lore.kernel.org/all/20231002094813.6633-1-ilpo.jarvinen@linux.intel…
[2] https://lore.kernel.org/all/20230904095339.11321-1-ilpo.jarvinen@linux.inte…
Maciej Wieczor-Retman (2):
selftests/resctrl: Fix schemata write error check
selftests/resctrl: Move run_benchmark() to a more fitting file
tools/testing/selftests/resctrl/resctrl.h | 1 -
tools/testing/selftests/resctrl/resctrl_val.c | 50 +++++++++++
tools/testing/selftests/resctrl/resctrlfs.c | 88 +++++--------------
3 files changed, 73 insertions(+), 66 deletions(-)
--
2.42.0
Kselftest.h declares many variadic functions that can print some
formatted message while also executing selftest logic. These
declarations don't have any compiler mechanism to verify if passed
arguments are valid in comparison with format specifiers used in
printf() calls.
Attribute addition can make debugging easier, the code more consistent
and prevent mismatched or missing variables.
Add a __printf() macro that validates types of variables passed to the
format string. The macro is similarly used in other tools in the kernel.
Add __printf() attributes to function definitions inside kselftest.h that
use printing.
Adding the __printf() macro exposes some mismatches in format strings
across different selftests.
Fix the mismatched format specifiers in multiple tests.
Series is based on kselftests next branch.
Changelog v4:
- Fix patch 1/8 subject typo.
- Add Reinette's reviewed-by tags.
- Rebased onto updated kselftests next branch.
Changelog v3:
- Changed git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Added one review tag.
- Rebased onto updated kselftests next branch.
Changelog v2:
- Add review and fixes tags to patches.
- Add two patches with mismatch fixes.
- Fix missed attribute in selftests/kvm. (Andrew)
- Fix previously missed issues in selftests/mm (Ilpo)
[v3] https://lore.kernel.org/all/cover.1695373131.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1693829810.git.maciej.wieczor-retman@inte…
[v1] https://lore.kernel.org/all/cover.1693216959.git.maciej.wieczor-retman@inte…
Maciej Wieczor-Retman (8):
selftests: Add printf attribute to kselftest prints
selftests/cachestat: Fix print_cachestat format
selftests/openat2: Fix wrong format specifier
selftests/pidfd: Fix ksft print formats
selftests/sigaltstack: Fix wrong format specifier
selftests/kvm: Replace attribute with macro
selftests/mm: Substitute attribute with a macro
selftests/resctrl: Fix wrong format specifier
.../selftests/cachestat/test_cachestat.c | 2 +-
tools/testing/selftests/kselftest.h | 18 ++++++++++--------
.../testing/selftests/kvm/include/test_util.h | 8 ++++----
tools/testing/selftests/mm/mremap_test.c | 2 +-
tools/testing/selftests/mm/pkey-helpers.h | 2 +-
tools/testing/selftests/openat2/openat2_test.c | 2 +-
.../selftests/pidfd/pidfd_fdinfo_test.c | 2 +-
tools/testing/selftests/pidfd/pidfd_test.c | 12 ++++++------
tools/testing/selftests/resctrl/cache.c | 2 +-
tools/testing/selftests/sigaltstack/sas.c | 2 +-
10 files changed, 27 insertions(+), 25 deletions(-)
base-commit: f1020c687153609f246f3314db5b74821025c185
--
2.42.0
PASID (Process Address Space ID) is a PCIe extension to tag the DMA
transactions out of a physical device, and most modern IOMMU hardware
have supported PASID granular address translation. So a PASID-capable
devices can be attached to multiple hwpts (a.k.a. domains), each attachment
is tagged with a PASID.
This series first adds a missing iommu API to replace domain for a pasid,
then adds iommufd APIs for device drivers to attach/replace/detach pasid
to/from hwpt per userspace's request, and adds selftest to validate the
iommufd APIs.
pasid attach/replace is mandatory on Intel VT-d given the PASID table
locates in the physical address space hence must be managed by the kernel,
both for supporting vSVA and coming SIOV. But it's optional on ARM/AMD
which allow configuring the PASID/CD table either in host physical address space
or nested on top of an GPA address space. This series only add VT-d support
as the minimal requirement.
Complete code can be found in below link:
https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
Regards,
Yi Liu
Kevin Tian (1):
iommufd: Support attach/replace hwpt per pasid
Lu Baolu (2):
iommu: Introduce a replace API for device pasid
iommu/vt-d: Add set_dev_pasid callback for nested domain
Yi Liu (5):
iommufd: replace attach_fn with a structure
iommufd/selftest: Add set_dev_pasid and remove_dev_pasid in mock iommu
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add coverage for iommufd pasid attach/detach
drivers/iommu/intel/nested.c | 47 +++++
drivers/iommu/iommu-priv.h | 2 +
drivers/iommu/iommu.c | 73 ++++++--
drivers/iommu/iommufd/Makefile | 1 +
drivers/iommu/iommufd/device.c | 42 +++--
drivers/iommu/iommufd/iommufd_private.h | 16 ++
drivers/iommu/iommufd/iommufd_test.h | 24 +++
drivers/iommu/iommufd/pasid.c | 152 ++++++++++++++++
drivers/iommu/iommufd/selftest.c | 158 ++++++++++++++--
include/linux/iommufd.h | 6 +
tools/testing/selftests/iommu/iommufd.c | 172 ++++++++++++++++++
.../selftests/iommu/iommufd_fail_nth.c | 28 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 78 ++++++++
13 files changed, 756 insertions(+), 43 deletions(-)
create mode 100644 drivers/iommu/iommufd/pasid.c
--
2.34.1
On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki(a)daynix.com> wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but it makes
> little sense to allow to implement different hashing algorithms with
> eBPF since the hash value reported by virtio-net is strictly defined by
> the specification.
>
> The hash value already stored in sk_buff is not used and computed
> independently since it may have been computed in a way not conformant
> with the specification.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
> @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> }
>
> if (vnet_hdr_sz) {
> - struct virtio_net_hdr gso;
> + union {
> + struct virtio_net_hdr hdr;
> + struct virtio_net_hdr_v1_hash v1_hash_hdr;
> + } hdr;
> + int ret;
>
> if (iov_iter_count(iter) < vnet_hdr_sz)
> return -EINVAL;
>
> - if (virtio_net_hdr_from_skb(skb, &gso,
> - tun_is_little_endian(tun), true,
> - vlan_hlen)) {
> + if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) &&
> + vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) &&
> + skb->tun_vnet_hash) {
Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of
the set hash ioctl failing otherwise?
Such checks should be limited to control path where possible
> + vnet_hdr_content_sz = sizeof(hdr.v1_hash_hdr);
> + ret = virtio_net_hdr_v1_hash_from_skb(skb,
> + &hdr.v1_hash_hdr,
> + true,
> + vlan_hlen,
> + &vnet_hash);
> + } else {
> + vnet_hdr_content_sz = sizeof(hdr.hdr);
> + ret = virtio_net_hdr_from_skb(skb, &hdr.hdr,
> + tun_is_little_endian(tun),
> + true, vlan_hlen);
> + }
> +
Fix four issues with resctrl selftests.
The signal handling fix became necessary after the mount/umount fixes
and the uninitialized member bug was discovered during the review.
The other two came up when I ran resctrl selftests across the server
fleet in our lab to validate the upcoming CAT test rewrite (the rewrite
is not part of this series).
These are developed and should apply cleanly at least on top the
benchmark cleanup series (might apply cleanly also w/o the benchmark
series, I didn't test).
v4:
- Use func(void) for functions taking no arguments
- Correct Fixes tag formatting
v3:
- Add fix to uninitialized sa_flags
- Handle ksft_exit_fail_msg() in per test functions
- Make signal handler register fails to also exit
- Improve changelogs
v2:
- Include patch to move _GNU_SOURCE to Makefile to allow normal #include
placement
- Rework the signal register/unregister into patch to use helpers
- Fixed incorrect function parameter description
- Use return !!res to avoid confusing implicit boolean conversion
- Improve MBA/MBM success bound patch's changelog
- Tweak Cc: stable dependencies (make it a chain).
Ilpo Järvinen (7):
selftests/resctrl: Fix uninitialized .sa_flags
selftests/resctrl: Extend signal handler coverage to unmount on
receiving signal
selftests/resctrl: Remove duplicate feature check from CMT test
selftests/resctrl: Move _GNU_SOURCE define into Makefile
selftests/resctrl: Refactor feature check to use resource and feature
name
selftests/resctrl: Fix feature checks
selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests
tools/testing/selftests/resctrl/Makefile | 2 +-
tools/testing/selftests/resctrl/cat_test.c | 8 --
tools/testing/selftests/resctrl/cmt_test.c | 3 -
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 2 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +-
.../testing/selftests/resctrl/resctrl_tests.c | 82 ++++++++++++-------
tools/testing/selftests/resctrl/resctrl_val.c | 26 +++---
tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++----------
9 files changed, 97 insertions(+), 104 deletions(-)
--
2.30.2
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.
Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.
What about IPIs whose callback take a parameter, you may ask?
Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.
This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).
Kernel entry vs execution of the deferred operation
===================================================
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper).
This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core().
Patches
=======
o Patches 1-9 have been submitted separately and are included for the sake of
testing
o Patches 10-14 focus on having objtool detect problematic static key usage in
early entry
o Patch 15 adds the infrastructure for IPI deferral.
o Patches 16-17 add some RCU testing infrastructure
o Patch 18 adds text_poke() IPI deferral.
o Patches 19-20 add vunmap() flush_tlb_kernel_range() IPI deferral
These ones I'm a lot less confident about, mostly due to lacking
instrumentation/verification.
The actual deferred callback is also incomplete as it's not properly noinstr:
vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section
and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I
have *no idea* what a sane implementation would be for Xen so I haven't
touched that yet.
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v2
Testing
=======
Note: this is a different machine than used for v1, because that machine decided
to act difficult.
Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 30 minutes.
v6.5-rc1 (+ cpumask filtering patches):
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
338 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
9207 func=do_flush_tlb_all
1116 func=do_sync_core
62 func=do_kernel_range_flush
3 func=nohz_full_kick_func
v6.5-rc1 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
2 callback=generic_smp_call_function_single_interrupt+0x0
# These are the different CSD's that caused IPIs
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
2 func=nohz_full_kick_func
The incriminating IPIs are all gone, but note that on the machine I used to test
v1 there were still some do_flush_tlb_all() IPIs caused by
pcpu_balance_workfn(), since only vmalloc is affected by the deferral
mechanism.
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for his guidance regarding objtool and hinting at the
.data..ro_after_init section.
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
Revisions
=========
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
The new TREE11 case with a 2-bit dynticks counter seems to pass when ran
against this series.
o Fixed flush_tlb_kernel_range_deferrable() definition
Peter Zijlstra (1):
jump_label,module: Don't alloc static_key_mod for __ro_after_init keys
Valentin Schneider (19):
tracing/filters: Dynamically allocate filter_pred.regex
tracing/filters: Enable filtering a cpumask field by another cpumask
tracing/filters: Enable filtering a scalar field by a cpumask
tracing/filters: Enable filtering the CPU common field by a cpumask
tracing/filters: Optimise cpumask vs cpumask filtering when user mask
is a single CPU
tracing/filters: Optimise scalar vs cpumask filtering when the user
mask is a single CPU
tracing/filters: Optimise CPU vs cpumask filtering when the user mask
is a single CPU
tracing/filters: Further optimise scalar vs cpumask comparison
tracing/filters: Document cpumask filtering
objtool: Flesh out warning related to pv_ops[] calls
objtool: Warn about non __ro_after_init static key usage in .noinstr
context_tracking: Make context_tracking_key __ro_after_init
x86/kvm: Make kvm_async_pf_enabled __ro_after_init
context-tracking: Introduce work deferral infrastructure
rcu: Make RCU dynticks counter size configurable
rcutorture: Add a test config to torture test low RCU_DYNTICKS width
context_tracking,x86: Defer kernel text patching IPIs
context_tracking,x86: Add infrastructure to defer kernel TLBI
x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
CPUs
Documentation/trace/events.rst | 14 +
arch/Kconfig | 9 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/context_tracking_work.h | 20 ++
arch/x86/include/asm/text-patching.h | 1 +
arch/x86/include/asm/tlbflush.h | 2 +
arch/x86/kernel/alternative.c | 24 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/kvm.c | 2 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/mm/tlb.c | 40 ++-
include/asm-generic/sections.h | 5 +
include/linux/context_tracking.h | 26 ++
include/linux/context_tracking_state.h | 65 +++-
include/linux/context_tracking_work.h | 28 ++
include/linux/jump_label.h | 1 +
include/linux/trace_events.h | 1 +
init/main.c | 1 +
kernel/context_tracking.c | 53 ++-
kernel/jump_label.c | 49 +++
kernel/rcu/Kconfig | 33 ++
kernel/time/Kconfig | 5 +
kernel/trace/trace_events_filter.c | 302 ++++++++++++++++--
mm/vmalloc.c | 19 +-
tools/objtool/check.c | 22 +-
tools/objtool/include/objtool/check.h | 1 +
tools/objtool/include/objtool/special.h | 2 +
tools/objtool/special.c | 3 +
.../selftests/rcutorture/configs/rcu/TREE11 | 19 ++
.../rcutorture/configs/rcu/TREE11.boot | 1 +
31 files changed, 695 insertions(+), 64 deletions(-)
create mode 100644 arch/x86/include/asm/context_tracking_work.h
create mode 100644 include/linux/context_tracking_work.h
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
--
2.31.1
Currently the bpf selftests are skipped by default, so is someone would
like to run the tests one would need to run:
$ make TARGETS=bpf SKIP_TARGETS="" kselftest
To overwrite the SKIP_TARGETS that defines bpf by default. Also,
following the BPF instructions[1], to run the bpf selftests one would
need to enter in the tools/testing/selftests/bpf/ directory, and then
run make, which is not the standard way to run selftests per it's
documentation.
For the reasons above stop mentioning bpf in the kselftests as examples
of how to run a test suite.
[1]: Documentation/bpf/bpf_devel_QA.rst
Signed-off-by: Marcos Paulo de Souza <mpdesouza(a)suse.com>
---
Documentation/dev-tools/kselftest.rst | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/Documentation/dev-tools/kselftest.rst b/Documentation/dev-tools/kselftest.rst
index deede972f254..ab376b316c36 100644
--- a/Documentation/dev-tools/kselftest.rst
+++ b/Documentation/dev-tools/kselftest.rst
@@ -112,7 +112,7 @@ You can specify multiple tests to skip::
You can also specify a restricted list of tests to run together with a
dedicated skiplist::
- $ make TARGETS="bpf breakpoints size timers" SKIP_TARGETS=bpf kselftest
+ $ make TARGETS="breakpoints size timers" SKIP_TARGETS=size kselftest
See the top-level tools/testing/selftests/Makefile for the list of all
possible targets.
@@ -165,7 +165,7 @@ To see the list of available tests, the `-l` option can be used::
The `-c` option can be used to run all the tests from a test collection, or
the `-t` option for specific single tests. Either can be used multiple times::
- $ ./run_kselftest.sh -c bpf -c seccomp -t timers:posix_timers -t timer:nanosleep
+ $ ./run_kselftest.sh -c size -c seccomp -t timers:posix_timers -t timer:nanosleep
For other features see the script usage output, seen with the `-h` option.
@@ -210,7 +210,7 @@ option is supported, such as::
tests by using variables specified in `Running a subset of selftests`_
section::
- $ make -C tools/testing/selftests gen_tar TARGETS="bpf" FORMAT=.xz
+ $ make -C tools/testing/selftests gen_tar TARGETS="size" FORMAT=.xz
.. _tar's auto-compress: https://www.gnu.org/software/tar/manual/html_node/gzip.html#auto_002dcompre…
--
2.42.0
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.
When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations. The current GCS pointer
can not be directly written to by userspace. When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match. GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.
The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.
This series implements support for use of GCS by userspace, along with
support for use of GCS within KVM guests. It does not enable use of GCS
by either EL1 or EL2, this will be implemented separately. Executables
are started without GCS and must use a prctl() to enable it, it is
expected that this will be done very early in application execution by
the dynamic linker or other startup code. For dynamic linking this will
be done by checking that everything in the executable is marked as GCS
compatible.
x86 has an equivalent feature called shadow stacks, this series depends
on the x86 patches for generic memory management support for the new
guarded/shadow stack page type and shares APIs as much as possible. As
there has been extensive discussion with the wider community around the
ABI for shadow stacks I have as far as practical kept implementation
decisions close to those for x86, anticipating that review would lead to
similar conclusions in the absence of strong reasoning for divergence.
The main divergence I am concious of is that x86 allows shadow stack to
be enabled and disabled repeatedly, freeing the shadow stack for the
thread whenever disabled, while this implementation keeps the GCS
allocated after disable but refuses to reenable it. This is to avoid
races with things actively walking the GCS during a disable, we do
anticipate that some systems will wish to disable GCS at runtime but are
not aware of any demand for subsequently reenabling it.
x86 uses an arch_prctl() to manage enable and disable, since only x86
and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
patch set for the equivalent RISC-V zisslpcfi feature which I initially
adopted fairly directly but following review feedback has been revised
quite a bit.
There is an open issue with support for CRIU, on x86 this required the
ability to set the GCS mode via ptrace. This series supports
configuring mode bits other than enable/disable via ptrace but it needs
to be confirmed if this is sufficient.
There's a few bits where I'm not convinced with where I've placed
things, in particular the GCS write operation is in the GCS header not
in uaccess.h, I wasn't sure what was clearest there and am probably too
close to the code to have a clear opinion. The reporting of GCS in
/proc/PID/smaps is also a bit awkward.
The series depends on the x86 shadow stack support:
https://lore.kernel.org/lkml/20230227222957.24501-1-rick.p.edgecombe@intel.…
I've rebased this onto v6.5-rc4 but not included it in the series in
order to avoid confusion with Rick's work and cut down the size of the
series, you can see the branch at:
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs
[1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
Pending feedback from Catalin:
- Use clone3() paramaters to size/place the GCS.
- Switch copy_to_user_gcs() to be put_user_gcs().
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v6:
- Rebase onto v6.6-rc3.
- Add some more gcsb_dsync() barriers following spec clarifications.
- Due to ongoing discussion around clone()/clone3() I've not updated
anything there, the behaviour is the same as on previous versions.
- Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org
Changes in v5:
- Don't map any permissions for user GCSs, we always use EL0 accessors
or use a separate mapping of the page.
- Reduce the standard size of the GCS to RLIMIT_STACK/2.
- Enforce a PAGE_SIZE alignment requirement on map_shadow_stack().
- Clarifications and fixes to documentation.
- More tests.
- Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org
Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
---
Mark Brown (38):
arm64/mm: Restructure arch_validate_flags() for extensibility
prctl: arch-agnostic prctl for shadow stack
mman: Add map_shadow_stack() flags
arm64: Document boot requirements for Guarded Control Stacks
arm64/gcs: Document the ABI for Guarded Control Stacks
arm64/sysreg: Add new system registers for GCS
arm64/sysreg: Add definitions for architected GCS caps
arm64/gcs: Add manual encodings of GCS instructions
arm64/gcs: Provide copy_to_user_gcs()
arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
arm64/mm: Allocate PIE slots for EL0 guarded control stack
mm: Define VM_SHADOW_STACK for arm64 when we support GCS
arm64/mm: Map pages for guarded control stack
KVM: arm64: Manage GCS registers for guests
arm64/gcs: Allow GCS usage at EL0 and EL1
arm64/idreg: Add overrride for GCS
arm64/hwcap: Add hwcap for GCS
arm64/traps: Handle GCS exceptions
arm64/mm: Handle GCS data aborts
arm64/gcs: Context switch GCS state for EL0
arm64/gcs: Allocate a new GCS for threads with GCS enabled
arm64/gcs: Implement shadow stack prctl() interface
arm64/mm: Implement map_shadow_stack()
arm64/signal: Set up and restore the GCS context for signal handlers
arm64/signal: Expose GCS state in signal frames
arm64/ptrace: Expose GCS via ptrace and core files
arm64: Add Kconfig for Guarded Control Stack (GCS)
kselftest/arm64: Verify the GCS hwcap
kselftest/arm64: Add GCS as a detected feature in the signal tests
kselftest/arm64: Add framework support for GCS to signal handling tests
kselftest/arm64: Allow signals tests to specify an expected si_code
kselftest/arm64: Always run signals tests with GCS enabled
kselftest/arm64: Add very basic GCS test program
kselftest/arm64: Add a GCS test program built with the system libc
kselftest/arm64: Add test coverage for GCS mode locking
selftests/arm64: Add GCS signal tests
kselftest/arm64: Add a GCS stress test
kselftest/arm64: Enable GCS for the FP stress tests
Documentation/admin-guide/kernel-parameters.txt | 6 +
Documentation/arch/arm64/booting.rst | 22 +
Documentation/arch/arm64/elf_hwcaps.rst | 3 +
Documentation/arch/arm64/gcs.rst | 233 +++++++
Documentation/arch/arm64/index.rst | 1 +
Documentation/filesystems/proc.rst | 2 +-
arch/arm64/Kconfig | 19 +
arch/arm64/include/asm/cpufeature.h | 6 +
arch/arm64/include/asm/el2_setup.h | 17 +
arch/arm64/include/asm/esr.h | 28 +-
arch/arm64/include/asm/exception.h | 2 +
arch/arm64/include/asm/gcs.h | 106 +++
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/asm/kvm_arm.h | 4 +-
arch/arm64/include/asm/kvm_host.h | 12 +
arch/arm64/include/asm/mman.h | 23 +-
arch/arm64/include/asm/pgtable-prot.h | 14 +-
arch/arm64/include/asm/processor.h | 7 +
arch/arm64/include/asm/sysreg.h | 20 +
arch/arm64/include/asm/uaccess.h | 42 ++
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/ptrace.h | 8 +
arch/arm64/include/uapi/asm/sigcontext.h | 9 +
arch/arm64/kernel/cpufeature.c | 19 +
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/entry-common.c | 23 +
arch/arm64/kernel/idreg-override.c | 2 +
arch/arm64/kernel/process.c | 92 +++
arch/arm64/kernel/ptrace.c | 59 ++
arch/arm64/kernel/signal.c | 237 ++++++-
arch/arm64/kernel/traps.c | 11 +
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 17 +
arch/arm64/kvm/sys_regs.c | 22 +
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/fault.c | 79 ++-
arch/arm64/mm/gcs.c | 228 +++++++
arch/arm64/mm/mmap.c | 13 +-
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 55 ++
arch/x86/include/uapi/asm/mman.h | 3 -
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 16 +-
include/uapi/asm-generic/mman.h | 4 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 22 +
kernel/sys.c | 30 +
tools/testing/selftests/arm64/Makefile | 2 +-
tools/testing/selftests/arm64/abi/hwcap.c | 19 +
tools/testing/selftests/arm64/fp/assembler.h | 15 +
tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 +
tools/testing/selftests/arm64/fp/sve-test.S | 2 +
tools/testing/selftests/arm64/fp/za-test.S | 2 +
tools/testing/selftests/arm64/fp/zt-test.S | 2 +
tools/testing/selftests/arm64/gcs/.gitignore | 5 +
tools/testing/selftests/arm64/gcs/Makefile | 24 +
tools/testing/selftests/arm64/gcs/asm-offsets.h | 0
tools/testing/selftests/arm64/gcs/basic-gcs.c | 356 ++++++++++
tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++
.../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++
tools/testing/selftests/arm64/gcs/gcs-stress.c | 532 +++++++++++++++
tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++
tools/testing/selftests/arm64/gcs/libc-gcs.c | 742 +++++++++++++++++++++
tools/testing/selftests/arm64/signal/.gitignore | 1 +
.../testing/selftests/arm64/signal/test_signals.c | 17 +-
.../testing/selftests/arm64/signal/test_signals.h | 6 +
.../selftests/arm64/signal/test_signals_utils.c | 32 +-
.../selftests/arm64/signal/test_signals_utils.h | 39 ++
.../arm64/signal/testcases/gcs_exception_fault.c | 59 ++
.../selftests/arm64/signal/testcases/gcs_frame.c | 78 +++
.../arm64/signal/testcases/gcs_write_fault.c | 67 ++
.../selftests/arm64/signal/testcases/testcases.c | 7 +
.../selftests/arm64/signal/testcases/testcases.h | 1 +
73 files changed, 4110 insertions(+), 41 deletions(-)
---
base-commit: 6465e260f48790807eef06b583b38ca9789b6072
change-id: 20230303-arm64-gcs-e311ab0d8729
Best regards,
--
Mark Brown <broonie(a)kernel.org>
On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <akihiko.odaki(a)daynix.com> wrote:
>
> tun_vnet_hash can use this flag to indicate it stored virtio-net hash
> cache to cb.
>
> Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
> ---
> include/linux/skbuff.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 4174c4b82d13..e638f157c13c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -837,6 +837,7 @@ typedef unsigned char *sk_buff_data_t;
> * @truesize: Buffer size
> * @users: User count - see {datagram,tcp}.c
> * @extensions: allocated extensions, valid if active_extensions is nonzero
> + * @tun_vnet_hash: tun stored virtio-net hash cache to cb
> */
>
> struct sk_buff {
> @@ -989,6 +990,7 @@ struct sk_buff {
> #if IS_ENABLED(CONFIG_IP_SCTP)
> __u8 csum_not_inet:1;
> #endif
> + __u8 tun_vnet_hash:1;
sk_buff space is very limited.
No need to extend it, especially for code that stays within a single
subsystem (tun).
To a lesser extent the same point applies to the qdisc_skb_cb.
On Sun, Oct 8, 2023 at 7:21 AM Akihiko Odaki <akihiko.odaki(a)daynix.com> wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges.
>
> An alternative solution is to extend the eBPF steering program so that it
> will be able to report to the userspace, but it makes little sense to
> allow to implement different hashing algorithms with eBPF since the hash
> value reported by virtio-net is strictly defined by the specification.
But using the existing BPF steering may have the benefit of requiring
a lot less new code.
There is ample precedence for BPF programs that work this way. The
flow dissector comes to mind.
The test will not work for systems with pagesize != 4096 like aarch64
and some others.
Other testcases are already testing the same functionality:
* auxv_AT_UID tests getauxval() in general.
* test_getpagesize() tests pagesize() which directly calls
getauxval(AT_PAGESZ).
Fixes: 48967b73f8fe ("selftests/nolibc: add testcases for startup code")
Cc: stable(a)vger.kernel.org
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Note:
This should probably also make it into 6.6.
---
tools/testing/selftests/nolibc/nolibc-test.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c
index a3ee4496bf0a..7e3936c182dc 100644
--- a/tools/testing/selftests/nolibc/nolibc-test.c
+++ b/tools/testing/selftests/nolibc/nolibc-test.c
@@ -630,7 +630,6 @@ int run_startup(int min, int max)
CASE_TEST(environ_HOME); EXPECT_PTRNZ(1, getenv("HOME")); break;
CASE_TEST(auxv_addr); EXPECT_PTRGT(test_auxv != (void *)-1, test_auxv, brk); break;
CASE_TEST(auxv_AT_UID); EXPECT_EQ(1, getauxval(AT_UID), getuid()); break;
- CASE_TEST(auxv_AT_PAGESZ); EXPECT_GE(1, getauxval(AT_PAGESZ), 4096); break;
case __LINE__:
return ret; /* must be last */
/* note: do not set any defaults so as to permit holes above */
---
base-commit: ab663cc32912914258bc8a2fbd0e753f552ee9d8
change-id: 20231007-nolibc-auxval-pagesz-05f5ff79c4c4
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
The indentation of parameterized tests messages is currently broken in kunit.
Try to fix that by introducing a test level attribute, that will be increased
during nested parameterized tests execution, and use it to generate correct
indent at the runtime when printing message or writing them to the log.
Also improve kunit by providing test plan for the parameterized tests.
Cc: David Gow <davidgow(a)google.com>
Cc: Rae Moar <rmoar(a)google.com>
Michal Wajdeczko (4):
kunit: Drop redundant text from suite init failure message
kunit: Fix indentation level of suite messages
kunit: Fix indentation of parameterized tests messages
kunit: Prepare test plan for parameterized subtests
include/kunit/test.h | 25 ++++++++++++--
lib/kunit/test.c | 79 +++++++++++++++++++++++---------------------
2 files changed, 65 insertions(+), 39 deletions(-)
--
2.25.1
Today we reset the suite counter as part of the suite cleanup,
called from the module exit callback, but it might not work that
well as one can try to collect results without unloading a previous
test (either unintentionally or due to dependencies).
For easy reproduction try to load the kunit-test.ko and then
collect and parse results from the kunit-example-test.ko load.
Parser will complain about mismatch of expected test number:
[ ] KTAP version 1
[ ] 1..1
[ ] # example: initializing suite
[ ] KTAP version 1
[ ] # Subtest: example
..
[ ] # example: pass:5 fail:0 skip:4 total:9
[ ] # Totals: pass:6 fail:0 skip:6 total:12
[ ] ok 7 example
[ ] [ERROR] Test: example: Expected test number 1 but found 7
[ ] ===================== [PASSED] example =====================
[ ] ============================================================
[ ] Testing complete. Ran 12 tests: passed: 6, skipped: 6, errors: 1
Since we are now printing suite test plan on every module load,
right before running suite tests, we should make sure that suite
counter will also start from 1. Easiest solution seems to be move
counter reset to the __kunit_test_suites_init() function.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko(a)intel.com>
Cc: David Gow <davidgow(a)google.com>
Cc: Rae Moar <rmoar(a)google.com>
---
lib/kunit/test.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index f2eb71f1a66c..9325d309ed82 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -670,6 +670,8 @@ int __kunit_test_suites_init(struct kunit_suite * const * const suites, int num_
return 0;
}
+ kunit_suite_counter = 1;
+
static_branch_inc(&kunit_running);
for (i = 0; i < num_suites; i++) {
@@ -696,8 +698,6 @@ void __kunit_test_suites_exit(struct kunit_suite **suites, int num_suites)
for (i = 0; i < num_suites; i++)
kunit_exit_suite(suites[i]);
-
- kunit_suite_counter = 1;
}
EXPORT_SYMBOL_GPL(__kunit_test_suites_exit);
--
2.25.1
With the startup code moved to C, implementing support for
constructors and deconstructors is fairly easy to implement.
Examples for code size impact:
text data bss dec hex filename
21837 104 88 22029 560d nolibc-test.before
22135 120 88 22343 5747 nolibc-test.after
21970 104 88 22162 5692 nolibc-test.after-only-crt.h-changes
The sections are defined by [0].
[0] https://refspecs.linuxfoundation.org/elf/gabi4+/ch5.dynamic.html
Signed-off-by: Thomas Weißschuh <linux(a)weissschuh.net>
---
Note:
This is only an RFC as I'm not 100% sure it belong into nolibc.
But at least the code is visible as an example.
Also it is one prerequisite for full ksefltest_harness.h support in
nolibc, should we want that.
---
tools/include/nolibc/crt.h | 23 ++++++++++++++++++++++-
tools/testing/selftests/nolibc/nolibc-test.c | 16 ++++++++++++++++
2 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/tools/include/nolibc/crt.h b/tools/include/nolibc/crt.h
index a5f33fef1672..c1176611d9a9 100644
--- a/tools/include/nolibc/crt.h
+++ b/tools/include/nolibc/crt.h
@@ -13,11 +13,22 @@ const unsigned long *_auxv __attribute__((weak));
static void __stack_chk_init(void);
static void exit(int);
+extern void (*const __preinit_array_start[])(void) __attribute__((weak));
+extern void (*const __preinit_array_end[])(void) __attribute__((weak));
+
+extern void (*const __init_array_start[])(void) __attribute__((weak));
+extern void (*const __init_array_end[])(void) __attribute__((weak));
+
+extern void (*const __fini_array_start[])(void) __attribute__((weak));
+extern void (*const __fini_array_end[])(void) __attribute__((weak));
+
void _start_c(long *sp)
{
long argc;
char **argv;
char **envp;
+ int exitcode;
+ void (* const *func)(void);
const unsigned long *auxv;
/* silence potential warning: conflicting types for 'main' */
int _nolibc_main(int, char **, char **) __asm__ ("main");
@@ -54,8 +65,18 @@ void _start_c(long *sp)
;
_auxv = auxv;
+ for (func = __preinit_array_start; func < __preinit_array_end; func++)
+ (*func)();
+ for (func = __init_array_start; func < __init_array_end; func++)
+ (*func)();
+
/* go to application */
- exit(_nolibc_main(argc, argv, envp));
+ exitcode = _nolibc_main(argc, argv, envp);
+
+ for (func = __fini_array_end - 1; func >= __fini_array_start; func--)
+ (*func)();
+
+ exit(exitcode);
}
#endif /* _NOLIBC_CRT_H */
diff --git a/tools/testing/selftests/nolibc/nolibc-test.c b/tools/testing/selftests/nolibc/nolibc-test.c
index a3ee4496bf0a..f166b425613a 100644
--- a/tools/testing/selftests/nolibc/nolibc-test.c
+++ b/tools/testing/selftests/nolibc/nolibc-test.c
@@ -57,6 +57,9 @@ static int test_argc;
/* will be used by some test cases as readable file, please don't write it */
static const char *argv0;
+/* will be used by constructor tests */
+static int constructor_test_value;
+
/* definition of a series of tests */
struct test {
const char *name; /* test name */
@@ -594,6 +597,18 @@ int expect_strne(const char *expr, int llen, const char *cmp)
#define CASE_TEST(name) \
case __LINE__: llen += printf("%d %s", test, #name);
+__attribute__((constructor))
+static void constructor1(void)
+{
+ constructor_test_value = 1;
+}
+
+__attribute__((constructor))
+static void constructor2(void)
+{
+ constructor_test_value *= 2;
+}
+
int run_startup(int min, int max)
{
int test;
@@ -631,6 +646,7 @@ int run_startup(int min, int max)
CASE_TEST(auxv_addr); EXPECT_PTRGT(test_auxv != (void *)-1, test_auxv, brk); break;
CASE_TEST(auxv_AT_UID); EXPECT_EQ(1, getauxval(AT_UID), getuid()); break;
CASE_TEST(auxv_AT_PAGESZ); EXPECT_GE(1, getauxval(AT_PAGESZ), 4096); break;
+ CASE_TEST(constructor); EXPECT_EQ(1, constructor_test_value, 2); break;
case __LINE__:
return ret; /* must be last */
/* note: do not set any defaults so as to permit holes above */
---
base-commit: ab663cc32912914258bc8a2fbd0e753f552ee9d8
change-id: 20231005-nolibc-constructors-b2aebffe9b65
Best regards,
--
Thomas Weißschuh <linux(a)weissschuh.net>
The original order of cases in kunit_module_notify() is confusing and
misleading.
And the best practice is return the err code from
MODULE_STATE_COMING func.
And the test suits should be executed when notify MODULE_STATE_LIVE.
Jinjie Ruan (3):
kunit: Make the cases sequence more reasonable for
kunit_module_notify()
kunit: Return error from kunit_module_init()
kunit: Init and run test suites in the right state
lib/kunit/test.c | 37 +++++++++++++++++++++++--------------
1 file changed, 23 insertions(+), 14 deletions(-)
--
2.34.1
Attn: linux-kselftest(a)vger.kernel.org
Date: 03-10-2023
Subject: Letter of Intent (LOI) (03-10-2023)
This is a Letter of Intent concerning my Board of investors intent to fund some of your available investment projects.
If this is of any interest to you then kindly advise.
Yours Faithfully,
Ahmad R. Deeb
Head of Investment Team
This patch series introduces UFFDIO_REMAP feature to userfaultfd, which
has long been implemented and maintained by Andrea in his local tree [1],
but was not upstreamed due to lack of use cases where this approach would
be better than allocating a new page and copying the contents.
UFFDIO_COPY performs ~20% better than UFFDIO_REMAP when the application
needs pages to be allocated [2]. However, with UFFDIO_REMAP, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [3].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread’s completion time by using UFFDIO_REMAP vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [3].
Furthermore, UFFDIO_REMAP enables remapping swapped-out pages without
touching them within the same vma. Today, it can only be done by mremap,
however it forces splitting the vma.
Main changes since Andrea's last version [1]:
- Trivial translations from page to folio, mmap_sem to mmap_lock
- Replace pmd_trans_unstable() with pte_offset_map_nolock() and handle its
possible failure
- Move pte mapping into remap_pages_pte to allow for retries when source
page or anon_vma is contended. Since pte_offset_map_nolock() start RCU
read section, we can't block anymore after mapping a pte, so have to unmap
the ptesm do the locking and retry.
- Add and use anon_vma_trylock_write() to avoid blocking while in RCU
read section.
- Accommodate changes in mmu_notifier_range_init() API, switch to
mmu_notifier_invalidate_range_start_nonblock() to avoid blocking while in
RCU read section.
- Open-code now removed __swp_swapcount()
- Replace pmd_read_atomic() with pmdp_get_lockless()
- Add new selftest for UFFDIO_REMAP
Changes since v1 [4]:
- add mmget_not_zero in userfaultfd_remap, per Jann Horn
- removed extern from function definitions, per Matthew Wilcox
- converted to folios in remap_pages_huge_pmd, per Matthew Wilcox
- use PageAnonExclusive in remap_pages_huge_pmd, per David Hildenbrand
- handle pgtable transfers between MMs, per Jann Horn
- ignore concurrent A/D pte bit changes, per Jann Horn
- split functions into smaller units, per David Hildenbrand
- test for folio_test_large in remap_anon_pte, per Matthew Wilcox
- use pte_swp_exclusive for swapcount check, per David Hildenbrand
- eliminated use of mmu_notifier_invalidate_range_start_nonblock,
per Jann Horn
- simplified THP alignment checks, per Jann Horn
- refactored the loop inside remap_pages, per Jann Horn
- additional clarifying comments, per Jann Horn
[1] https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc…
[2] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redha…
[3] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyj…
[4] https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/
Andrea Arcangeli (2):
userfaultfd: UFFDIO_REMAP: rmap preparation
userfaultfd: UFFDIO_REMAP uABI
Suren Baghdasaryan (1):
selftests/mm: add UFFDIO_REMAP ioctl test
fs/userfaultfd.c | 63 ++
include/linux/rmap.h | 5 +
include/linux/userfaultfd_k.h | 12 +
include/uapi/linux/userfaultfd.h | 22 +
mm/huge_memory.c | 130 ++++
mm/khugepaged.c | 3 +
mm/rmap.c | 13 +
mm/userfaultfd.c | 590 +++++++++++++++++++
tools/testing/selftests/mm/uffd-common.c | 41 +-
tools/testing/selftests/mm/uffd-common.h | 1 +
tools/testing/selftests/mm/uffd-unit-tests.c | 62 ++
11 files changed, 940 insertions(+), 2 deletions(-)
--
2.42.0.515.g380fc7ccd1-goog
The abi_test currently uses a long sized test value for enablement
checks. On LE this works fine, however, on BE this results in inaccurate
assert checks due to a bit being used and assuming it's value is the
same on both LE and BE.
Use int type for 32-bit values and long type for 64-bit values to ensure
appropriate behavior on both LE and BE.
Fixes: 60b1af8de8c1 ("tracing/user_events: Add ABI self-test")
Signed-off-by: Beau Belgrave <beaub(a)linux.microsoft.com>
---
V2 Changes:
Rebase to linux-kselftest/fixes.
tools/testing/selftests/user_events/abi_test.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/user_events/abi_test.c b/tools/testing/selftests/user_events/abi_test.c
index 8202f1327c39..aa297d3ad95e 100644
--- a/tools/testing/selftests/user_events/abi_test.c
+++ b/tools/testing/selftests/user_events/abi_test.c
@@ -47,7 +47,7 @@ static int change_event(bool enable)
return ret;
}
-static int reg_enable(long *enable, int size, int bit)
+static int reg_enable(void *enable, int size, int bit)
{
struct user_reg reg = {0};
int fd = open(data_file, O_RDWR);
@@ -69,7 +69,7 @@ static int reg_enable(long *enable, int size, int bit)
return ret;
}
-static int reg_disable(long *enable, int bit)
+static int reg_disable(void *enable, int bit)
{
struct user_unreg reg = {0};
int fd = open(data_file, O_RDWR);
@@ -90,7 +90,8 @@ static int reg_disable(long *enable, int bit)
}
FIXTURE(user) {
- long check;
+ int check;
+ long check_long;
bool umount;
};
@@ -99,6 +100,7 @@ FIXTURE_SETUP(user) {
change_event(false);
self->check = 0;
+ self->check_long = 0;
}
FIXTURE_TEARDOWN(user) {
@@ -136,9 +138,9 @@ TEST_F(user, bit_sizes) {
#if BITS_PER_LONG == 8
/* Allow 0-64 bits for 64-bit */
- ASSERT_EQ(0, reg_enable(&self->check, sizeof(long), 63));
- ASSERT_NE(0, reg_enable(&self->check, sizeof(long), 64));
- ASSERT_EQ(0, reg_disable(&self->check, 63));
+ ASSERT_EQ(0, reg_enable(&self->check_long, sizeof(long), 63));
+ ASSERT_NE(0, reg_enable(&self->check_long, sizeof(long), 64));
+ ASSERT_EQ(0, reg_disable(&self->check_long, 63));
#endif
/* Disallowed sizes (everything beside 4 and 8) */
base-commit: 6f874fa021dfc7bf37f4f37da3a5aaa41fe9c39c
--
2.34.1
All architectures should use a long aligned address passed to set_bit().
User processes can pass either a 32-bit or 64-bit sized value to be
updated when tracing is enabled when on a 64-bit kernel. Both cases are
ensured to be naturally aligned, however, that is not enough. The
address must be long aligned without affecting checks on the value
within the user process which require different adjustments for the bit
for little and big endian CPUs.
32 bit on 64 bit, even when properly long aligned, still require a 32 bit
offset to be done for BE. Due to this, it cannot be easily put into a
generic method.
The abi_test also used a long, which broke the test on 64-bit BE machines.
The change simply uses an int for 32-bit value checks and a long when on
64-bit kernels for 64-bit specific checks.
I've run these changes and self tests for user_events on ppc64 BE, x86_64
LE, and aarch64 LE. It'd be great to test this also on RISC-V, but I do
not have one.
Clément Léger originally put a patch together for the alignment issue, but
we uncovered more issues as we went further into the problem. Clément felt
my version was better [1] so I am sending this series out that addresses
the selftest, BE bit offset, and the alignment issue.
1. https://lore.kernel.org/linux-trace-kernel/713f4916-00ff-4a24-82d1-72884500…
Beau Belgrave (2):
tracing/user_events: Align set_bit() address for all archs
selftests/user_events: Fix abi_test for BE archs
kernel/trace/trace_events_user.c | 58 ++++++++++++++++---
.../testing/selftests/user_events/abi_test.c | 16 ++---
2 files changed, 60 insertions(+), 14 deletions(-)
base-commit: fc1653abba0d554aad80224e51bcad42b09895ed
--
2.34.1
Fix three issues with resctrl selftests.
The signal handling fix became necessary after the mount/umount fixes.
The other two came up when I ran resctrl selftests across the server
fleet in our lab to validate the upcoming CAT test rewrite (the rewrite
is not part of this series).
These are developed and should apply cleanly at least on top the
benchmark cleanup series (might apply cleanly also w/o the benchmark
series, I didn't test).
v2:
- Include patch to move _GNU_SOURCE to Makefile to allow normal #include
placement
- Rework the signal register/unregister into patch to use helpers
- Fixed incorrect function parameter description
- Use return !!res to avoid confusing implicit boolean conversion
- Improve MBA/MBM success bound patch's changelog
- Tweak Cc: stable dependencies (make it a chain).
Ilpo Järvinen (6):
selftests/resctrl: Extend signal handler coverage to unmount on
receiving signal
selftests/resctrl: Remove duplicate feature check from CMT test
selftests/resctrl: Move _GNU_SOURCE define into Makefile
selftests/resctrl: Refactor feature check to use resource and feature
name
selftests/resctrl: Fix feature checks
selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests
tools/testing/selftests/resctrl/Makefile | 2 +-
tools/testing/selftests/resctrl/cat_test.c | 8 --
tools/testing/selftests/resctrl/cmt_test.c | 3 -
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 2 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +-
.../testing/selftests/resctrl/resctrl_tests.c | 78 +++++++++++--------
tools/testing/selftests/resctrl/resctrl_val.c | 22 +++---
tools/testing/selftests/resctrl/resctrlfs.c | 69 +++++++---------
9 files changed, 88 insertions(+), 105 deletions(-)
--
2.30.2
This series extends KVM RISC-V to allow Guest/VM discover and use
conditional operations related ISA extensions (namely XVentanaCondOps
and Zicond).
To try these patches, use KVMTOOL from riscv_zbx_zicntr_smstateen_condops_v1
branch at: https://github.com/avpatel/kvmtool.git
These patches are based upon the latest riscv_kvm_queue and can also be
found in the riscv_kvm_condops_v1 branch at:
https://github.com/avpatel/linux.git
Anup Patel (7):
RISC-V: Detect XVentanaCondOps from ISA string
RISC-V: Detect Zicond from ISA string
RISC-V: KVM: Allow XVentanaCondOps extension for Guest/VM
RISC-V: KVM: Allow Zicond extension for Guest/VM
KVM: riscv: selftests: Add senvcfg register to get-reg-list test
KVM: riscv: selftests: Add smstateen registers to get-reg-list test
KVM: riscv: selftests: Add condops extensions to get-reg-list test
.../devicetree/bindings/riscv/extensions.yaml | 13 ++++++
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/uapi/asm/kvm.h | 2 +
arch/riscv/kernel/cpufeature.c | 2 +
arch/riscv/kvm/vcpu_onereg.c | 4 ++
.../selftests/kvm/riscv/get-reg-list.c | 41 +++++++++++++++++++
6 files changed, 64 insertions(+)
--
2.34.1
Add functionality to run built-in tests after boot by writing to a
debugfs file.
Add a new debugfs file labeled "run" for each test suite to use for
this purpose.
As an example, write to the file using the following:
echo "any string" > /sys/kernel/debugfs/kunit/<testsuite>/run
This will trigger the test suite to run and will print results to the
kernel log.
Note that what you "write" to the debugfs file will not be saved.
To guard against running tests concurrently with this feature, add a
mutex lock around running kunit. This supports the current practice of
not allowing tests to be run concurrently on the same kernel.
This functionality may not work for all tests.
This new functionality could be used to design a parameter
injection feature in the future.
Signed-off-by: Rae Moar <rmoar(a)google.com>
---
Changes since v1:
- Removed second patch as this problem has been fixed
- Added Documentation patch
- Made changes to work with new dynamically-extending log feature
Note that these patches now rely on (and are rebased on) the patch series:
https://lore.kernel.org/all/20230828104111.2394344-1-rf@opensource.cirrus.c…
lib/kunit/debugfs.c | 66 +++++++++++++++++++++++++++++++++++++++++++++
lib/kunit/test.c | 13 +++++++++
2 files changed, 79 insertions(+)
diff --git a/lib/kunit/debugfs.c b/lib/kunit/debugfs.c
index 270d185737e6..8c0a970321ce 100644
--- a/lib/kunit/debugfs.c
+++ b/lib/kunit/debugfs.c
@@ -8,12 +8,14 @@
#include <linux/module.h>
#include <kunit/test.h>
+#include <kunit/test-bug.h>
#include "string-stream.h"
#include "debugfs.h"
#define KUNIT_DEBUGFS_ROOT "kunit"
#define KUNIT_DEBUGFS_RESULTS "results"
+#define KUNIT_DEBUGFS_RUN "run"
/*
* Create a debugfs representation of test suites:
@@ -21,6 +23,8 @@
* Path Semantics
* /sys/kernel/debug/kunit/<testsuite>/results Show results of last run for
* testsuite
+ * /sys/kernel/debug/kunit/<testsuite>/run Write to this file to trigger
+ * testsuite to run
*
*/
@@ -99,6 +103,51 @@ static int debugfs_results_open(struct inode *inode, struct file *file)
return single_open(file, debugfs_print_results, suite);
}
+/*
+ * Print a usage message to the debugfs "run" file
+ * (/sys/kernel/debug/kunit/<testsuite>/run) if opened.
+ */
+static int debugfs_print_run(struct seq_file *seq, void *v)
+{
+ struct kunit_suite *suite = (struct kunit_suite *)seq->private;
+
+ seq_puts(seq, "Write to this file to trigger the test suite to run.\n");
+ seq_printf(seq, "usage: echo \"any string\" > /sys/kernel/debugfs/kunit/%s/run\n",
+ suite->name);
+ return 0;
+}
+
+/*
+ * The debugfs "run" file (/sys/kernel/debug/kunit/<testsuite>/run)
+ * contains no information. Write to the file to trigger the test suite
+ * to run.
+ */
+static int debugfs_run_open(struct inode *inode, struct file *file)
+{
+ struct kunit_suite *suite;
+
+ suite = (struct kunit_suite *)inode->i_private;
+
+ return single_open(file, debugfs_print_run, suite);
+}
+
+/*
+ * Trigger a test suite to run by writing to the suite's "run" debugfs
+ * file found at: /sys/kernel/debug/kunit/<testsuite>/run
+ *
+ * Note: what is written to this file will not be saved.
+ */
+static ssize_t debugfs_run(struct file *file,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct inode *f_inode = file->f_inode;
+ struct kunit_suite *suite = (struct kunit_suite *) f_inode->i_private;
+
+ __kunit_test_suites_init(&suite, 1);
+
+ return count;
+}
+
static const struct file_operations debugfs_results_fops = {
.open = debugfs_results_open,
.read = seq_read,
@@ -106,10 +155,23 @@ static const struct file_operations debugfs_results_fops = {
.release = debugfs_release,
};
+static const struct file_operations debugfs_run_fops = {
+ .open = debugfs_run_open,
+ .read = seq_read,
+ .write = debugfs_run,
+ .llseek = seq_lseek,
+ .release = debugfs_release,
+};
+
void kunit_debugfs_create_suite(struct kunit_suite *suite)
{
struct kunit_case *test_case;
+ if (suite->log) {
+ /* Clear the suite log that's leftover from a previous run. */
+ string_stream_clear(suite->log);
+ return;
+ }
/* Allocate logs before creating debugfs representation. */
suite->log = alloc_string_stream(GFP_KERNEL);
string_stream_set_append_newlines(suite->log, true);
@@ -124,6 +186,10 @@ void kunit_debugfs_create_suite(struct kunit_suite *suite)
debugfs_create_file(KUNIT_DEBUGFS_RESULTS, S_IFREG | 0444,
suite->debugfs,
suite, &debugfs_results_fops);
+
+ debugfs_create_file(KUNIT_DEBUGFS_RUN, S_IFREG | 0644,
+ suite->debugfs,
+ suite, &debugfs_run_fops);
}
void kunit_debugfs_destroy_suite(struct kunit_suite *suite)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 651cbda9f250..d376b886d72d 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -13,6 +13,7 @@
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
+#include <linux/mutex.h>
#include <linux/panic.h>
#include <linux/sched/debug.h>
#include <linux/sched.h>
@@ -22,6 +23,8 @@
#include "string-stream.h"
#include "try-catch-impl.h"
+static struct mutex kunit_run_lock;
+
/*
* Hook to fail the current test and print an error message to the log.
*/
@@ -668,6 +671,11 @@ int __kunit_test_suites_init(struct kunit_suite * const * const suites, int num_
return 0;
}
+ /* Use mutex lock to guard against running tests concurrently. */
+ if (mutex_lock_interruptible(&kunit_run_lock)) {
+ pr_err("kunit: test interrupted\n");
+ return -EINTR;
+ }
static_branch_inc(&kunit_running);
for (i = 0; i < num_suites; i++) {
@@ -676,6 +684,7 @@ int __kunit_test_suites_init(struct kunit_suite * const * const suites, int num_
}
static_branch_dec(&kunit_running);
+ mutex_unlock(&kunit_run_lock);
return 0;
}
EXPORT_SYMBOL_GPL(__kunit_test_suites_init);
@@ -836,6 +845,10 @@ static int __init kunit_init(void)
kunit_install_hooks();
kunit_debugfs_init();
+
+ /* Initialize lock to guard against running tests concurrently. */
+ mutex_init(&kunit_run_lock);
+
#ifdef CONFIG_MODULES
return register_module_notifier(&kunit_mod_nb);
#else
base-commit: b754593274e04fc840482a658b29791bc8f8b933
--
2.42.0.283.g2d96d420d3-goog
From: Björn Töpel <bjorn(a)rivosinc.com>
Commit 08d0ce30e0e4 ("riscv: Implement syscall wrappers") introduced
some regressions in libbpf, and the kselftests BPF suite, which are
fixed with these three patches.
Note that there's an outstanding fix [1] for ftrace syscall tracing
which is also a fallout from the commit above.
Björn
[1] https://lore.kernel.org/linux-riscv/20231003182407.32198-1-alexghiti@rivosi…
Alexandre Ghiti (1):
libbpf: Fix syscall access arguments on riscv
Björn Töpel (2):
selftests/bpf: Define SYS_PREFIX for riscv
selftests/bpf: Define SYS_NANOSLEEP_KPROBE_NAME for riscv
tools/lib/bpf/bpf_tracing.h | 2 --
tools/testing/selftests/bpf/progs/bpf_misc.h | 3 +++
tools/testing/selftests/bpf/test_progs.h | 2 ++
3 files changed, 5 insertions(+), 2 deletions(-)
base-commit: 9077fc228f09c9f975c498c55f5d2e882cd0da59
--
2.39.2
From: Björn Töpel <bjorn(a)rivosinc.com>
Yet another "more cross-building support for RISC-V" series.
An example how to invoke a gen_tar build:
| make ARCH=riscv CROSS_COMPILE=riscv64-linux-gnu- CC=riscv64-linux-gnu-gcc \
| HOSTCC=gcc O=/workspace/kbuild FORMAT= \
| SKIP_TARGETS="arm64 ia64 powerpc sparc64 x86 sgx" -j $(($(nproc)-1)) \
| -C tools/testing/selftests gen_tar
Björn
Björn Töpel (3):
selftests/bpf: Add cross-build support for urandom_read et al
selftests/bpf: Enable lld usage for RISC-V
selftests/bpf: Add uprobe_multi to gen_tar target
tools/testing/selftests/bpf/Makefile | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
base-commit: 2147c8d07e1abc8dfc3433ca18eed5295e230ede
--
2.39.2
Hi Linus,
Please pull the following Kselftest fixes update for Linux 6.6-rc5.
This kselftest fixes update for Linux 6.6-rc5 consists of one single
fix to Makefile to fix the incorrect TARGET name for uevent test.
diff is attached.
thanks.
-- Shuah
----------------------------------------------------------------
The following changes since commit 8ed99af4a266a3492d773b5d85c3f8e9f81254b6:
selftests/user_events: Fix to unmount tracefs when test created mount (2023-09-18 11:04:52 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux-kselftest-fixes-6.6-rc5
for you to fetch changes up to 6f874fa021dfc7bf37f4f37da3a5aaa41fe9c39c:
selftests: Fix wrong TARGET in kselftest top level Makefile (2023-09-26 18:47:37 -0600)
----------------------------------------------------------------
linux-kselftest-fixes-6.6-rc5
This kselftest fixes update for Linux 6.6-rc5 consists of one single
fix to Makefile to fix the incorrect TARGET name for uevent test.
----------------------------------------------------------------
Juntong Deng (1):
selftests: Fix wrong TARGET in kselftest top level Makefile
tools/testing/selftests/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------
user_events, tdx and dmabuf-heaps build a series of binaries that can be
safely ignored by git as it is done by other selftests.
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
---
Javier Carrasco (3):
selftests/user_events: add gitignore file
selftests/tdx: add gitignore file
selftests/dmabuf-heaps: add gitignore file
tools/testing/selftests/dmabuf-heaps/.gitignore | 1 +
tools/testing/selftests/tdx/.gitignore | 1 +
tools/testing/selftests/user_events/.gitignore | 4 ++++
3 files changed, 6 insertions(+)
---
base-commit: cbf3a2cb156a2c911d8f38d8247814b4c07f49a2
change-id: 20231004-topic-selftest_gitignore-3e82f4341001
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
From: Björn Töpel <bjorn(a)rivosinc.com>
Two minor changes to the kselftest-merge target:
1. Let builtin have presedence over modules when merging configs
2. Merge per-arch configs, if available
Björn
Björn Töpel (2):
kbuild: Let builtin have precedence over modules for kselftest-merge
kbuild: Merge per-arch config for kselftest-merge target
Makefile | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
base-commit: cbf3a2cb156a2c911d8f38d8247814b4c07f49a2
--
2.39.2
Szanowni Państwo!
W ramach nowej edycji programu Mój Prąd mogą otrzymać Państwo dofinansowanie na zakup i montaż fotowoltaiki i/lub magazynu energii. Maksymalna kwota dofinansowania wynosi 58 tys. zł.
Jako firma wyspecjalizowana w tym zakresie zajmiemy się Państwa wnioskiem o dofinansowanie oraz instalacją i serwisem dopasowanych do Państwa budynku paneli słonecznych.
Będę wdzięczny za informację czy są Państwo zainteresowani.
Pozdrawiam,
Kamil Lasek
Hi, I am sending this series on behalf of myself and Benjamin Tissoires. There
existed an initial n=3 patch series which was later expanded to n=4 and
is now back to n=3 with some fixes added in and rebased against
mainline.
This patch series aims to ensure that the hid/bpf selftests can be built
without errors.
Here's Benjamin's initial cover letter for context:
| These fixes have been triggered by [0]:
| basically, if you do not recompile the kernel first, and are
| running on an old kernel, vmlinux.h doesn't have the required
| symbols and the compilation fails.
|
| The tests will fail if you run them on that very same machine,
| of course, but the binary should compile.
|
| And while I was sorting out why it was failing, I realized I
| could do a couple of improvements on the Makefile.
|
| [0] https://lore.kernel.org/linux-input/56ba8125-2c6f-a9c9-d498-0ca1c153dcb2@re…
Changes from v1 -> v2:
- roll Justin's fix into patch 1/3
- add __attribute__((preserve_access_index)) (thanks Eduard)
- rebased onto mainline (2dde18cd1d8fac735875f2e4987f11817cc0bc2c)
- Link to v1: https://lore.kernel.org/all/20230825-wip-selftests-v1-0-c862769020a8@kernel…
Link: https://github.com/ClangBuiltLinux/linux/issues/1698
Link: https://github.com/ClangBuiltLinux/continuous-integration2/issues/61
---
Benjamin Tissoires (3):
selftests/hid: ensure we can compile the tests on kernels pre-6.3
selftests/hid: do not manually call headers_install
selftests/hid: force using our compiled libbpf headers
tools/testing/selftests/hid/Makefile | 10 ++---
tools/testing/selftests/hid/progs/hid.c | 3 --
.../testing/selftests/hid/progs/hid_bpf_helpers.h | 49 ++++++++++++++++++++++
3 files changed, 53 insertions(+), 9 deletions(-)
---
base-commit: 2dde18cd1d8fac735875f2e4987f11817cc0bc2c
change-id: 20230908-kselftest-09-08-56d7f4a8d5c4
Best regards,
--
Justin Stitt <justinstitt(a)google.com>
Write_schemata() uses fprintf() to write a bitmask into a schemata file
inside resctrl FS. It checks fprintf() return value but it doesn't check
fclose() return value. Error codes from fprintf() such as write errors,
are buffered and flushed back to the user only after fclose() is executed
which means any invalid bitmask can be written into the schemata file.
Rewrite write_schemata() to use syscalls instead of stdio file
operations to avoid the buffering.
The resctrlfs.c defines functions that interact with the resctrl FS
while resctrl_val.c defines functions that perform measurements on
the cache. Run_benchmark() fits logically into the second file before
resctrl_val() that uses it.
Move run_benchmark() from resctrlfs.c to resctrl_val.c and remove
redundant part of the kernel-doc comment. Make run_benchmark() static
and remove it from the header file.
Patch series is based on [1] which is based on [2] which are based on
kselftest next branch.
Changelog v5:
- Add Ilpo's reviewed-by tag to Patch 1/2.
- Reword patch messages slightly.
- Add error check to schema_len variable.
Changelog v4:
- Change git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Rebase onto [1] which is based on [2]. (Reinette)
- Add fcntl.h explicitly to provide glibc backward compatibility.
(Reinette)
Changelog v3:
- Use snprintf() return value instead of strlen() in write_schemata().
(Ilpo)
- Make run_benchmark() static and remove it from the header file.
(Reinette)
- Add Ilpo's reviewed-by tag to Patch 2/2.
- Patch messages and cover letter rewording.
Changelog v2:
- Change sprintf() to snprintf() in write_schemata().
- Redo write_schemata() with syscalls instead of stdio functions.
- Fix typos and missing dots in patch messages.
- Branch printf attribute patch to a separate series.
[v1] https://lore.kernel.org/all/cover.1692880423.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1693213468.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1693575451.git.maciej.wieczor-retman@inte…
[v4] https://lore.kernel.org/all/cover.1695369120.git.maciej.wieczor-retman@inte…
[1] https://lore.kernel.org/all/20230915154438.82931-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/20230904095339.11321-1-ilpo.jarvinen@linux.inte…
Maciej Wieczor-Retman (2):
selftests/resctrl: Fix schemata write error check
selftests/resctrl: Move run_benchmark() to a more fitting file
tools/testing/selftests/resctrl/resctrl.h | 1 -
tools/testing/selftests/resctrl/resctrl_val.c | 50 +++++++++++
tools/testing/selftests/resctrl/resctrlfs.c | 88 +++++--------------
3 files changed, 73 insertions(+), 66 deletions(-)
base-commit: 3b3e8a34b1d50c2c5c6b030dab7682b123162cb4
--
2.42.0
The existing macro KUNIT_ARRAY_PARAM can produce parameter
generator function but only when we fully know the definition
of the array. However, there might be cases where we would like
to generate test params based on externaly defined array, which
is defined as zero-terminated array, like pci_driver.id_table.
Add helper macro KUNIT_ZERO_ARRAY_PARAM that can work with zero
terminated arrays and provide example how to use it.
$ ./tools/testing/kunit/kunit.py run \
--kunitconfig ./lib/kunit/.kunitconfig *.example_params*
[ ] Starting KUnit Kernel (1/1)...
[ ] ============================================================
[ ] ========================= example =========================
[ ] =================== example_params_test ===================
[ ] [SKIPPED] example value 3
[ ] [PASSED] example value 2
[ ] [PASSED] example value 1
[ ] [SKIPPED] example value 0
[ ] =============== [PASSED] example_params_test ===============
[ ] =================== example_params_test ===================
[ ] [SKIPPED] example value 3
[ ] [PASSED] example value 2
[ ] [PASSED] example value 1
[ ] =============== [PASSED] example_params_test ===============
[ ] ===================== [PASSED] example =====================
[ ] ============================================================
[ ] Testing complete. Ran 7 tests: passed: 4, skipped: 3
Signed-off-by: Michal Wajdeczko <michal.wajdeczko(a)intel.com>
Cc: David Gow <davidgow(a)google.com>
Cc: Rae Moar <rmoar(a)google.com>
---
include/kunit/test.h | 22 ++++++++++++++++++++++
lib/kunit/kunit-example-test.c | 2 ++
2 files changed, 24 insertions(+)
diff --git a/include/kunit/test.h b/include/kunit/test.h
index 20ed9f9275c9..280113ceb6a6 100644
--- a/include/kunit/test.h
+++ b/include/kunit/test.h
@@ -1514,6 +1514,28 @@ do { \
return NULL; \
}
+/**
+ * KUNIT_ZERO_ARRAY_PARAM() - Define test parameter generator from a zero terminated array.
+ * @name: prefix for the test parameter generator function.
+ * @array: zero terminated array of test parameters.
+ * @get_desc: function to convert param to description; NULL to use default
+ *
+ * Define function @name_gen_params which uses zero terminated @array to generate parameters.
+ */
+#define KUNIT_ZERO_ARRAY_PARAM(name, array, get_desc) \
+ static const void *name##_gen_params(const void *prev, char *desc) \
+ { \
+ typeof((array)[0]) *__prev = prev; \
+ typeof(__prev) __next = __prev ? __prev + 1 : (array); \
+ void (*__get_desc)(typeof(__next), char *) = get_desc; \
+ for (; memchr_inv(__next, 0, sizeof(*__next)); __prev = __next++) { \
+ if (__get_desc) \
+ __get_desc(__next, desc); \
+ return __next; \
+ } \
+ return NULL; \
+ }
+
// TODO(dlatypov(a)google.com): consider eventually migrating users to explicitly
// include resource.h themselves if they need it.
#include <kunit/resource.h>
diff --git a/lib/kunit/kunit-example-test.c b/lib/kunit/kunit-example-test.c
index 6bb5c2ef6696..ad9ebcfd513e 100644
--- a/lib/kunit/kunit-example-test.c
+++ b/lib/kunit/kunit-example-test.c
@@ -202,6 +202,7 @@ static void example_param_get_desc(const struct example_param *p, char *desc)
}
KUNIT_ARRAY_PARAM(example, example_params_array, example_param_get_desc);
+KUNIT_ZERO_ARRAY_PARAM(example_zero, example_params_array, example_param_get_desc);
/*
* This test shows the use of params.
@@ -246,6 +247,7 @@ static struct kunit_case example_test_cases[] = {
KUNIT_CASE(example_all_expect_macros_test),
KUNIT_CASE(example_static_stub_test),
KUNIT_CASE_PARAM(example_params_test, example_gen_params),
+ KUNIT_CASE_PARAM(example_params_test, example_zero_gen_params),
KUNIT_CASE_SLOW(example_slow_test),
{}
};
--
2.25.1
Fix four issues with resctrl selftests.
The signal handling fix became necessary after the mount/umount fixes
and the uninitialized member bug was discovered during the review.
The other two came up when I ran resctrl selftests across the server
fleet in our lab to validate the upcoming CAT test rewrite (the rewrite
is not part of this series).
These are developed and should apply cleanly at least on top the
benchmark cleanup series (might apply cleanly also w/o the benchmark
series, I didn't test).
v3:
- Add fix to uninitialized sa_flags
- Handle ksft_exit_fail_msg() in per test functions
- Make signal handler register fails to also exit
- Improve changelogs
v2:
- Include patch to move _GNU_SOURCE to Makefile to allow normal #include
placement
- Rework the signal register/unregister into patch to use helpers
- Fixed incorrect function parameter description
- Use return !!res to avoid confusing implicit boolean conversion
- Improve MBA/MBM success bound patch's changelog
- Tweak Cc: stable dependencies (make it a chain).
Ilpo Järvinen (7):
selftests/resctrl: Fix uninitialized .sa_flags
selftests/resctrl: Extend signal handler coverage to unmount on
receiving signal
selftests/resctrl: Remove duplicate feature check from CMT test
selftests/resctrl: Move _GNU_SOURCE define into Makefile
selftests/resctrl: Refactor feature check to use resource and feature
name
selftests/resctrl: Fix feature checks
selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests
tools/testing/selftests/resctrl/Makefile | 2 +-
tools/testing/selftests/resctrl/cat_test.c | 8 --
tools/testing/selftests/resctrl/cmt_test.c | 3 -
tools/testing/selftests/resctrl/mba_test.c | 2 +-
tools/testing/selftests/resctrl/mbm_test.c | 2 +-
tools/testing/selftests/resctrl/resctrl.h | 7 +-
.../testing/selftests/resctrl/resctrl_tests.c | 82 ++++++++++++-------
tools/testing/selftests/resctrl/resctrl_val.c | 24 +++---
tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++----------
9 files changed, 96 insertions(+), 103 deletions(-)
--
2.30.2
Missing field KSM is added in g_smaps_rollup[] array as it
fixes assert in function test_proc_pid_smaps_rollup()
Without this patchset test fails for "proc-empty-vm" as can be seen below:
$make TARGETS="proc" kselftest
...
selftests: proc: proc-empty-vm
proc-empty-vm: proc-empty-vm.c:299: test_proc_pid_smaps_rollup:
Assertion `rv == sizeof(g_smaps_rollup) - 1' failed.
/usr/bin/timeout: the monitored command dumped core
Aborted
not ok 5 selftests: proc: proc-empty-vm # exit=134
...
With this patchset test passes for "proc-empty-vm" as can be seen below:
$make TARGETS="proc" kselftest
....
timeout set to 45
selftests: proc: proc-empty-vm
ok 5 selftests: proc: proc-empty-vm
....
Signed-off-by: Swarup Laxman Kotiaklapudi <swarupkotikalapudi(a)gmail.com>
---
tools/testing/selftests/proc/proc-empty-vm.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/tools/testing/selftests/proc/proc-empty-vm.c b/tools/testing/selftests/proc/proc-empty-vm.c
index b16c13688b88..ee71ce52cb6a 100644
--- a/tools/testing/selftests/proc/proc-empty-vm.c
+++ b/tools/testing/selftests/proc/proc-empty-vm.c
@@ -267,6 +267,7 @@ static const char g_smaps_rollup[] =
"Private_Dirty: 0 kB\n"
"Referenced: 0 kB\n"
"Anonymous: 0 kB\n"
+"KSM: 0 kB\n"
"LazyFree: 0 kB\n"
"AnonHugePages: 0 kB\n"
"ShmemPmdMapped: 0 kB\n"
--
2.34.1
As a general rule, the name of the selftest is printed at the beginning
of every message.
Use "static_keys" (name of the test itself) consistently instead of mixing
"static_key" and "static_keys" at the beginning of the messages in the
test_static_keys script.
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
---
tools/testing/selftests/static_keys/test_static_keys.sh | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/static_keys/test_static_keys.sh b/tools/testing/selftests/static_keys/test_static_keys.sh
index fc9f8cde7d42..3b0f17b81ac2 100755
--- a/tools/testing/selftests/static_keys/test_static_keys.sh
+++ b/tools/testing/selftests/static_keys/test_static_keys.sh
@@ -6,18 +6,18 @@
ksft_skip=4
if ! /sbin/modprobe -q -n test_static_key_base; then
- echo "static_key: module test_static_key_base is not found [SKIP]"
+ echo "static_keys: module test_static_key_base is not found [SKIP]"
exit $ksft_skip
fi
if ! /sbin/modprobe -q -n test_static_keys; then
- echo "static_key: module test_static_keys is not found [SKIP]"
+ echo "static_keys: module test_static_keys is not found [SKIP]"
exit $ksft_skip
fi
if /sbin/modprobe -q test_static_key_base; then
if /sbin/modprobe -q test_static_keys; then
- echo "static_key: ok"
+ echo "static_keys: ok"
/sbin/modprobe -q -r test_static_keys
/sbin/modprobe -q -r test_static_key_base
else
@@ -25,6 +25,6 @@ if /sbin/modprobe -q test_static_key_base; then
/sbin/modprobe -q -r test_static_key_base
fi
else
- echo "static_key: [FAIL]"
+ echo "static_keys: [FAIL]"
exit 1
fi
---
base-commit: cefc06e4de1477dbdc3cb2a91d4b1873b7797a5c
change-id: 20231001-topic-static_keys_selftest_messages-099232047b79
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
This series converts the execveat test to generate KTAP output so it
plays a bit more nicely with automation, KTAP means that kselftest
runners can track the individual tests in the suite rather than just an
overall pass/fail for the suite as a whole.
The first patch adding a perror() equivalent for kselftest was
previously sent as part of a similar conversion for the timers tests:
https://lore.kernel.org/linux-kselftest/8734yyfx00.ffs@tglx/T
there's probably no harm in applying it twice or possibly these should
both go via the kselftest tree - I'm not sure who usually applies timers
test changes.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
kselftest: Add a ksft_perror() helper
selftests/exec: Convert execveat test to generate KTAP output
tools/testing/selftests/exec/execveat.c | 87 ++++++++++++++++++++-------------
tools/testing/selftests/kselftest.h | 14 ++++++
2 files changed, 66 insertions(+), 35 deletions(-)
---
base-commit: 6465e260f48790807eef06b583b38ca9789b6072
change-id: 20230928-ktap-exec-45ea8d28309a
Best regards,
--
Mark Brown <broonie(a)kernel.org>
The ret variable is used to check function return values and assigning
values to it on error has no effect as it is an unused value.
The current implementation uses an additional variable (fret) to return
the error value, which in this case is unnecessary and lead to the above
described misuse. There is no restriction in the current implementation
to always return -1 on error and the actual negative error value can be
returned safely without storing -1 in a specific variable.
Simplify the error checking by using a single variable which always
holds the returned value.
Signed-off-by: Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
---
Changes in v2:
- Remove fret and use a single variable to check errors.
- Link to v1: https://lore.kernel.org/r/20230916-topic-self_uevent_filtering-v1-1-26ede50…
---
tools/testing/selftests/uevent/uevent_filtering.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/uevent/uevent_filtering.c b/tools/testing/selftests/uevent/uevent_filtering.c
index 5cebfb356345..dbe55f3a66f4 100644
--- a/tools/testing/selftests/uevent/uevent_filtering.c
+++ b/tools/testing/selftests/uevent/uevent_filtering.c
@@ -78,7 +78,7 @@ static int uevent_listener(unsigned long post_flags, bool expect_uevent,
{
int sk_fd, ret;
socklen_t sk_addr_len;
- int fret = -1, rcv_buf_sz = __UEVENT_BUFFER_SIZE;
+ int rcv_buf_sz = __UEVENT_BUFFER_SIZE;
uint64_t sync_add = 1;
struct sockaddr_nl sk_addr = { 0 }, rcv_addr = { 0 };
char buf[__UEVENT_BUFFER_SIZE] = { 0 };
@@ -121,6 +121,7 @@ static int uevent_listener(unsigned long post_flags, bool expect_uevent,
if ((size_t)sk_addr_len != sizeof(sk_addr)) {
fprintf(stderr, "Invalid socket address size\n");
+ ret = -1;
goto on_error;
}
@@ -147,11 +148,12 @@ static int uevent_listener(unsigned long post_flags, bool expect_uevent,
ret = write_nointr(sync_fd, &sync_add, sizeof(sync_add));
close(sync_fd);
if (ret != sizeof(sync_add)) {
+ ret = -1;
fprintf(stderr, "Failed to synchronize with parent process\n");
goto on_error;
}
- fret = 0;
+ ret = 0;
for (;;) {
ssize_t r;
@@ -187,7 +189,7 @@ static int uevent_listener(unsigned long post_flags, bool expect_uevent,
on_error:
close(sk_fd);
- return fret;
+ return ret;
}
int trigger_uevent(unsigned int times)
---
base-commit: cefc06e4de1477dbdc3cb2a91d4b1873b7797a5c
change-id: 20230916-topic-self_uevent_filtering-17b53262bc46
Best regards,
--
Javier Carrasco <javier.carrasco.cruz(a)gmail.com>
Kselftest.h declares many variadic functions that can print some
formatted message while also executing selftest logic. These
declarations don't have any compiler mechanism to verify if passed
arguments are valid in comparison with format specifiers used in
printf() calls.
Attribute addition can make debugging easier, the code more consistent
and prevent mismatched or missing variables.
Add a __printf() macro that validates types of variables passed to the
format string. The macro is similarly used in other tools in the kernel.
Add __printf() attributes to function definitions inside kselftest.h that
use printing.
Adding the __printf() macro exposes some mismatches in format strings
across different selftests.
Fix the mismatched format specifiers in multiple tests.
Series is based on kselftests next branch.
Changelog v3:
- Change git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Add one review tag.
- Rebase onto updated kselftests next branch and change base commit.
Changelog v2:
- Add review and fixes tags to patches.
- Add two patches with mismatch fixes.
- Fix missed attribute in selftests/kvm. (Andrew)
- Fix previously missed issues in selftests/mm (Ilpo)
[v2] https://lore.kernel.org/all/cover.1693829810.git.maciej.wieczor-retman@inte…
[v1] https://lore.kernel.org/all/cover.1693216959.git.maciej.wieczor-retman@inte…
Maciej Wieczor-Retman (8):
selftests: Add printf attribute to ksefltest prints
selftests/cachestat: Fix print_cachestat format
selftests/openat2: Fix wrong format specifier
selftests/pidfd: Fix ksft print formats
selftests/sigaltstack: Fix wrong format specifier
selftests/kvm: Replace attribute with macro
selftests/mm: Substitute attribute with a macro
selftests/resctrl: Fix wrong format specifier
.../selftests/cachestat/test_cachestat.c | 2 +-
tools/testing/selftests/kselftest.h | 18 ++++++++++--------
.../testing/selftests/kvm/include/test_util.h | 8 ++++----
tools/testing/selftests/mm/mremap_test.c | 2 +-
tools/testing/selftests/mm/pkey-helpers.h | 2 +-
tools/testing/selftests/openat2/openat2_test.c | 2 +-
.../selftests/pidfd/pidfd_fdinfo_test.c | 2 +-
tools/testing/selftests/pidfd/pidfd_test.c | 12 ++++++------
tools/testing/selftests/resctrl/cache.c | 2 +-
tools/testing/selftests/sigaltstack/sas.c | 2 +-
10 files changed, 27 insertions(+), 25 deletions(-)
base-commit: ce9ecca0238b140b88f43859b211c9fdfd8e5b70
--
2.42.0
SVE 2.1 introduced a new feature FEAT_SVE_B16B16 which adds instructions
supporting the BFloat16 floating point format. Report this to userspace
through the ID registers and hwcap.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
arm64/sve: Report FEAT_SVE_B16B16 to userspace
kselftest/arm64: Verify HWCAP2_SVE_B16B16
Documentation/arch/arm64/cpu-feature-registers.rst | 2 ++
Documentation/arch/arm64/elf_hwcaps.rst | 3 +++
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/kernel/cpufeature.c | 3 +++
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/tools/sysreg | 6 +++++-
tools/testing/selftests/arm64/abi/hwcap.c | 13 +++++++++++++
8 files changed, 29 insertions(+), 1 deletion(-)
---
base-commit: 0bb80ecc33a8fb5a682236443c1e740d5c917d1d
change-id: 20230913-arm64-zfr-b16b16-el0-0811fc70f147
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Write_schemata() uses fprintf() to write a bitmask into a schemata file
inside resctrl FS. It checks fprintf() return value but it doesn't check
fclose() return value. Error codes from fprintf() such as write errors,
are buffered and flushed back to the user only after fclose() is executed
which means any invalid bitmask can be written into the schemata file.
Rewrite write_schemata() to use syscalls instead of stdio file
operations to avoid the buffering.
The resctrlfs.c file defines functions that interact with the resctrl FS
while resctrl_val.c file defines functions that perform measurements on
the cache. Run_benchmark() fits logically into the second file before
resctrl_val() function that uses it.
Move run_benchmark() from resctrlfs.c to resctrl_val.c and remove
redundant part of the kernel-doc comment. Make run_benchmark() static
and remove it from the header file.
Patch series is based on [1] which is based on [2] which are based on
ksefltest next branch.
Changelog v4:
- Change git signature from Wieczor-Retman Maciej to Maciej
Wieczor-Retman.
- Rebase onto [1] which is based on [2]. (Reinette)
- Add fcntl.h explicitly to provide glibc backward compatibility.
(Reinette)
Changelog v3:
- Use snprintf() return value instead of strlen() in write_schemata().
(Ilpo)
- Make run_benchmark() static and remove it from the header file.
(Reinette)
- Added Ilpo's reviewed-by tag to Patch 2/2.
- Patch messages and cover letter rewording.
Changelog v2:
- Change sprintf() to snprintf() in write_schemata().
- Redo write_schemata() with syscalls instead of stdio functions.
- Fix typos and missing dots in patch messages.
- Branch printf attribute patch to a separate series.
[v1] https://lore.kernel.org/all/cover.1692880423.git.maciej.wieczor-retman@inte…
[v2] https://lore.kernel.org/all/cover.1693213468.git.maciej.wieczor-retman@inte…
[v3] https://lore.kernel.org/all/cover.1693575451.git.maciej.wieczor-retman@inte…
[1] https://lore.kernel.org/all/20230915154438.82931-1-ilpo.jarvinen@linux.inte…
[2] https://lore.kernel.org/all/20230904095339.11321-1-ilpo.jarvinen@linux.inte…
Maciej Wieczor-Retman (2):
selftests/resctrl: Fix schemata write error check
selftests/resctrl: Move run_benchmark() to a more fitting file
tools/testing/selftests/resctrl/resctrl.h | 1 -
tools/testing/selftests/resctrl/resctrl_val.c | 50 +++++++++++
tools/testing/selftests/resctrl/resctrlfs.c | 82 ++++---------------
3 files changed, 67 insertions(+), 66 deletions(-)
base-commit: 3b3e8a34b1d50c2c5c6b030dab7682b123162cb4
--
2.42.0
Diagnostic message for failed KUNIT_ASSERT|EXPECT_NOT_ERR_OR_NULL
shows only raw error code, like in this example:
[ ] # example_all_expect_macros_test: EXPECTATION FAILED at lib/kunit/kunit-example-test.c:126
[ ] Expected myptr is not error, but is: -12
but we can improve it by using more friendly error pointer format:
[ ] # example_all_expect_macros_test: EXPECTATION FAILED at lib/kunit/kunit-example-test.c:126
[ ] Expected myptr is not error, but is -ENOMEM
Signed-off-by: Michal Wajdeczko <michal.wajdeczko(a)intel.com>
Cc: David Gow <davidgow(a)google.com>
Cc: Rae Moar <rmoar(a)google.com>
---
lib/kunit/assert.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/kunit/assert.c b/lib/kunit/assert.c
index dd1d633d0fe2..96ef236d3ca3 100644
--- a/lib/kunit/assert.c
+++ b/lib/kunit/assert.c
@@ -80,9 +80,9 @@ void kunit_ptr_not_err_assert_format(const struct kunit_assert *assert,
ptr_assert->text);
} else if (IS_ERR(ptr_assert->value)) {
string_stream_add(stream,
- KUNIT_SUBTEST_INDENT "Expected %s is not error, but is: %ld\n",
+ KUNIT_SUBTEST_INDENT "Expected %s is not error, but is %pe\n",
ptr_assert->text,
- PTR_ERR(ptr_assert->value));
+ ptr_assert->value);
}
kunit_assert_print_msg(message, stream);
}
--
2.25.1
Commit 804b854d374e ("net: bridge: disable bridge MTU auto tuning if it
was set manually") disabled auto-tuning of the bridge MTU when the MTU
was explicitly set by the user, however that would only happen when the
MTU was set after creation. This commit ensures auto-tuning is also
disabled when the MTU is set during bridge creation.
Currently when the br_netdev_ops br_change_mtu function is called, the
flag BROPT_MTU_SET_BY_USER is set. However this function is only called
when the MTU is changed after interface creation and is not called if
the MTU is specified during creation with IFLA_MTU (br_dev_newlink).
br_change_mtu also does not get called if the MTU is set to the same
value it currently has, which makes it difficult to work around this
issue (especially for the default MTU of 1500) as you have to first
change the MTU to some other value and then back to the desired value.
Add new selftests to ensure the bridge MTU is handled correctly:
- Bridge created with user-specified MTU (1500)
- Bridge created with user-specified MTU (2000)
- Bridge created without user-specified MTU
- Bridge created with user-specified MTU set after creation (2000)
Regression risk: Any workload which erroneously specified an MTU during
creation but accidentally relied upon auto-tuning to a different value
may be broken by this change.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2034099
Fixes: 804b854d374e ("net: bridge: disable bridge MTU auto tuning if it was set manually")
Signed-off-by: Trent Lloyd <trent.lloyd(a)canonical.com>
---
net/bridge/br_netlink.c | 3 +
.../selftests/drivers/net/bridge/Makefile | 10 ++
.../drivers/net/bridge/bridge-user-mtu.sh | 148 ++++++++++++++++++
.../drivers/net/bridge/net_forwarding_lib.sh | 1 +
4 files changed, 162 insertions(+)
create mode 100644 tools/testing/selftests/drivers/net/bridge/Makefile
create mode 100755 tools/testing/selftests/drivers/net/bridge/bridge-user-mtu.sh
create mode 120000 tools/testing/selftests/drivers/net/bridge/net_forwarding_lib.sh
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 10f0d33d8ccf..8aff7d077848 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -1559,6 +1559,9 @@ static int br_dev_newlink(struct net *src_net, struct net_device *dev,
spin_unlock_bh(&br->lock);
}
+ if (tb[IFLA_MTU])
+ br_opt_toggle(br, BROPT_MTU_SET_BY_USER, true);
+
err = br_changelink(dev, tb, data, extack);
if (err)
br_dev_delete(dev, NULL);
diff --git a/tools/testing/selftests/drivers/net/bridge/Makefile b/tools/testing/selftests/drivers/net/bridge/Makefile
new file mode 100644
index 000000000000..23e407c75a7f
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/bridge/Makefile
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for net selftests
+
+TEST_PROGS := \
+ bridge-user-mtu.sh
+
+TEST_FILES := \
+ net_forwarding_lib.sh
+
+include ../../../lib.mk
diff --git a/tools/testing/selftests/drivers/net/bridge/bridge-user-mtu.sh b/tools/testing/selftests/drivers/net/bridge/bridge-user-mtu.sh
new file mode 100755
index 000000000000..07e0ac972b00
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/bridge/bridge-user-mtu.sh
@@ -0,0 +1,148 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Ensure a bridge MTU does not automatically change when it has been specified
+# by the user.
+#
+# To run independently:
+# make TARGETS=drivers/net/bridge kselftest
+
+ALL_TESTS="
+ bridge_created_with_user_specified_mtu
+ bridge_created_without_user_specified_mtu
+ bridge_with_late_user_specified_mtu
+"
+
+REQUIRE_MZ=no
+NUM_NETIFS=0
+lib_dir=$(dirname "$0")
+source "${lib_dir}"/net_forwarding_lib.sh
+
+setup_prepare()
+{
+ for i in 1 3 5; do
+ ip link add "vtest${i}" mtu 9000 type veth peer name "vtest${i}b" mtu 9000
+ done
+}
+
+cleanup()
+{
+ for interface in vtest1 vtest3 vtest5 br-test0 br-test1 br-test2; do
+ if [[ -d "/sys/class/net/${interface}" ]]; then
+ ip link del "${interface}" &> /dev/null
+ fi
+ done
+}
+
+check_mtu()
+{
+ cur_mtu=$(<"/sys/class/net/$1/mtu")
+ [[ ${cur_mtu} -eq $2 ]]
+ exit_status=$?
+ return "${exit_status}"
+}
+
+check_bridge_user_specified_mtu()
+{
+ if [[ -z $1 ]]
+ then
+ exit 1
+ fi
+ mtu=$1
+
+ RET=0
+
+ ip link add dev br-test0 mtu "${mtu}" type bridge
+ ip link set br-test0 up
+ check_mtu br-test0 "${mtu}"
+ check_err $? "Bridge was not created with the user-specified MTU"
+
+ check_mtu vtest1 9000
+ check_err $? "vtest1 does not have MTU 9000"
+
+ ip link set dev vtest1 master br-test0
+ check_mtu br-test0 "${mtu}"
+ check_err $? "Bridge user-specified MTU incorrectly changed after adding an interface"
+
+ log_test "Bridge created with user-specified MTU (${mtu})"
+
+ ip link del br-test0
+}
+
+bridge_created_with_user_specified_mtu() {
+ # Check two user-specified MTU values
+ # - 1500: To ensure the default MTU (1500) is not special-cased, you
+ # should be able to lock a bridge to the default MTU.
+ # - 2000: Ensure bridges are actually created with a user-specified MTU
+ check_bridge_user_specified_mtu 1500
+ check_bridge_user_specified_mtu 2000
+}
+
+bridge_created_without_user_specified_mtu()
+{
+ RET=0
+ ip link add dev br-test1 type bridge
+ ip link set br-test1 up
+ check_mtu br-test1 1500
+ check_err $? "Bridge was not created with the user-specified MTU"
+
+ ip link set dev vtest3 master br-test1
+ check_mtu br-test1 9000
+ check_err $? "Bridge without user-specified MTU did not change MTU"
+
+ log_test "Bridge created without user-specified MTU"
+
+ ip link del br-test1
+}
+
+check_bridge_late_user_specified_mtu()
+{
+ if [[ -z $1 ]]
+ then
+ exit 1
+ fi
+ mtu=$1
+
+ RET=0
+ ip link add dev br-test2 type bridge
+ ip link set br-test2 up
+ check_mtu br-test2 1500
+ check_err $? "Bridge was not created with default MTU (1500)"
+
+ ip link set br-test2 mtu "${mtu}"
+ check_mtu br-test2 "${mtu}"
+ check_err $? "User-specified MTU set after creation was not set"
+ check_mtu vtest5 9000
+ check_err $? "vtest5 does not have MTU 9000"
+
+ ip link set dev vtest5 master br-test2
+ check_mtu br-test2 "${mtu}"
+ check_err $? "Bridge late-specified MTU incorrectly changed after adding an interface"
+
+ log_test "Bridge created without user-specified MTU and changed after (${mtu})"
+
+ ip link del br-test2
+}
+
+bridge_with_late_user_specified_mtu()
+{
+ # Note: Unfortunately auto-tuning is not disabled when you set the MTU
+ # to it's current value, including the default of 1500. The reason is
+ # that dev_set_mtu_ext skips notifying any handlers if the MTU is set
+ # to the current value. Normally that makes sense, but is confusing
+ # since you might expect "ip link set br0 mtu 1500" to lock the MTU to
+ # 1500 but that will only happen if the MTU was not already 1500. So we
+ # only check a non-default value of 2000 here unlike the earlier
+ # bridge_created_with_user_specified_mtu test
+
+ # Check one user-specified MTU value
+ # - 2000: Ensure bridges actually change to a user-specified MTU
+ check_bridge_late_user_specified_mtu 2000
+}
+
+trap cleanup EXIT
+
+setup_prepare
+tests_run
+
+exit "${EXIT_STATUS}"
diff --git a/tools/testing/selftests/drivers/net/bridge/net_forwarding_lib.sh b/tools/testing/selftests/drivers/net/bridge/net_forwarding_lib.sh
new file mode 120000
index 000000000000..39c96828c5ef
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/bridge/net_forwarding_lib.sh
@@ -0,0 +1 @@
+../../../net/forwarding/lib.sh
\ No newline at end of file
--
2.34.1
This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
To mitigate Spectre v1, the verifier relies on static analysis to deduct
constant pointer bounds, which can then be enforced by rewriting pointer
arithmetic [1] or index masking [2]. This relies on the fact that every
memory region to be accessed has a static upper bound and every date
below that bound is accessible. The verifier can only rewrite pointer
arithmetic or insert masking instructions to mitigate Spectre v1 if a
static upper bound, below of which every access is valid, can be given.
When allowing packet pointer comparisons, this introduces a way for the
program to effectively construct an accessible pointer for which no
static upper bound is known. Intuitively, this is obvious as a packet
might be of any size and therefore 0 is the only statically known upper
bound below of which every date is always accessible (i.e., none).
To clarify, the problem is not that comparing two pointers can be used
for pointer leaks in the same way in that comparing a pointer to a known
scalar can be used for pointer leaks. That is because the "secret"
components of the addresses cancel each other out if the pointers are
into the same region.
With [3] applied, the following malicious BPF program can be loaded into
the kernel without CAP_PERFMON:
r2 = *(u32 *)(r1 + 76) // data
r3 = *(u32 *)(r1 + 80) // data_end
r4 = r2
r4 += 1
if r4 > r3 goto exit
r5 = *(u8 *)(r2 + 0) // speculatively read secret
r5 &= 1 // choose bit to leak
// ... side channel to leak secret bit
exit:
// ...
This is jited to the following amd64 code which still contains the
gadget:
0: endbr64
4: nopl 0x0(%rax,%rax,1)
9: xchg %ax,%ax
b: push %rbp
c: mov %rsp,%rbp
f: endbr64
13: push %rbx
14: mov 0xc8(%rdi),%rsi // data
1b: mov 0x50(%rdi),%rdx // data_end
1f: mov %rsi,%rcx
22: add $0x1,%rcx
26: cmp %rdx,%rcx
29: ja 0x000000000000003f // branch to mispredict
2b: movzbq 0x0(%rsi),%r8 // speculative load of secret
30: and $0x1,%r8 // choose bit to leak
34: xor %ebx,%ebx
36: cmp %rbx,%r8
39: je 0x000000000000003f // branch based on secret
3b: imul $0x61,%r8,%r8 // leak using port contention side channel
3f: xor %eax,%eax
41: pop %rbx
42: leaveq
43: retq
Here I'm using a port contention side channel because storing the secret
to the stack causes the verifier to insert an lfence for unrelated
reasons (SSB mitigation) which would terminate the speculation.
As Daniel already pointed out to me, data_end is even attacker
controlled as one could send many packets of sufficient length to train
the branch prediction into assuming data_end >= data will never be true.
When the attacker then sends a packet with insufficient data, the
Spectre v1 gadget leaks the chosen bit of some value that lies behind
data_end.
To make it clear that the problem is not the pointer comparison but the
missing masking instruction, it can be useful to transform the code
above into the following equivalent pseudocode:
r2 = data
r3 = data_end
r6 = ... // index to access, constant does not help
r7 = data_end - data // only known at runtime, could be [0,PKT_MAX)
if !(r6 < r7) goto exit
// no masking of index in r6 happens
r2 += r6 // addr. to access
r5 = *(u8 *)(r2 + 0) // speculatively read secret
// ... leak secret as above
One idea to resolve this while still allowing for unprivileged packet
access would be to always allocate a power of 2 for the packet data and
then also pass the respective index mask in the skb structure. The
verifier would then have to check that this index mask is always applied
to the offset before a packet pointer is dereferenced. This patch does
not implement this extension, but only reverts [3].
[1] 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 ("bpf: prevent out of bounds speculation on pointer arithmetic")
[2] b2157399cc9898260d6031c5bfe45fe137c1fbe7 ("bpf: prevent out-of-bounds speculation")
[3] d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2 ("bpf: Fix issue in verifying allow_ptr_leaks")
Reported-by: Daniel Borkmann <daniel(a)iogearbox.net>
Signed-off-by: Luis Gerhorst <gerhorst(a)amazon.de>
Signed-off-by: Luis Gerhorst <gerhorst(a)cs.fau.de>
Acked-by: Hagar Gamal Halim Hemdan <hagarhem(a)amazon.de>
Cc: Puranjay Mohan <puranjay12(a)gmail.com>
---
kernel/bpf/verifier.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index bb78212fa5b2..b415a81149ed 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14050,12 +14050,6 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
return -EINVAL;
}
- /* check src2 operand */
- err = check_reg_arg(env, insn->dst_reg, SRC_OP);
- if (err)
- return err;
-
- dst_reg = ®s[insn->dst_reg];
if (BPF_SRC(insn->code) == BPF_X) {
if (insn->imm != 0) {
verbose(env, "BPF_JMP/JMP32 uses reserved fields\n");
@@ -14067,13 +14061,12 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
if (err)
return err;
- src_reg = ®s[insn->src_reg];
- if (!(reg_is_pkt_pointer_any(dst_reg) && reg_is_pkt_pointer_any(src_reg)) &&
- is_pointer_value(env, insn->src_reg)) {
+ if (is_pointer_value(env, insn->src_reg)) {
verbose(env, "R%d pointer comparison prohibited\n",
insn->src_reg);
return -EACCES;
}
+ src_reg = ®s[insn->src_reg];
} else {
if (insn->src_reg != BPF_REG_0) {
verbose(env, "BPF_JMP/JMP32 uses reserved fields\n");
@@ -14081,6 +14074,12 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
}
}
+ /* check src2 operand */
+ err = check_reg_arg(env, insn->dst_reg, SRC_OP);
+ if (err)
+ return err;
+
+ dst_reg = ®s[insn->dst_reg];
is_jmp32 = BPF_CLASS(insn->code) == BPF_JMP32;
if (BPF_SRC(insn->code) == BPF_K) {
--
2.40.1
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Check the stream pointer passed to string_stream_destroy() for
IS_ERR_OR_NULL() instead of only NULL.
Whatever alloc_string_stream() returns should be safe to pass
to string_stream_destroy(), and that will be an ERR_PTR.
It's obviously good practise and generally helpful to also check
for NULL pointers so that client cleanup code can call
string_stream_destroy() unconditionally - which could include
pointers that have never been set to anything and so are NULL.
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
---
lib/kunit/string-stream.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/kunit/string-stream.c b/lib/kunit/string-stream.c
index a6f3616c2048..54f4fdcbfac8 100644
--- a/lib/kunit/string-stream.c
+++ b/lib/kunit/string-stream.c
@@ -173,7 +173,7 @@ void string_stream_destroy(struct string_stream *stream)
{
KUNIT_STATIC_STUB_REDIRECT(string_stream_destroy, stream);
- if (!stream)
+ if (IS_ERR_OR_NULL(stream))
return;
string_stream_clear(stream);
--
2.30.2
In kunit_debugfs_create_suite() give up and skip creating the debugfs
file if any of the alloc_string_stream() calls return an error or NULL.
Only put a value in the log pointer of kunit_suite and kunit_test if it
is a valid pointer to a log.
This prevents the potential invalid dereference reported by smatch:
lib/kunit/debugfs.c:115 kunit_debugfs_create_suite() error: 'suite->log'
dereferencing possible ERR_PTR()
lib/kunit/debugfs.c:119 kunit_debugfs_create_suite() error: 'test_case->log'
dereferencing possible ERR_PTR()
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
Reported-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Fixes: 05e2006ce493 ("kunit: Use string_stream for test log")
Reviewed-by: Rae Moar <rmoar(a)google.com>
---
Changes from V1:
- If the alloc_string_stream() for the suite->log fails
just return. Nothing has been created at this point so
there's nothing to clean up.
- Re-word the explanation of why the log pointers are only
set if they point to a valid log.
As these changes are trivial I've carried Rae Moar's
Reviewed-by from V1.
---
lib/kunit/debugfs.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)
diff --git a/lib/kunit/debugfs.c b/lib/kunit/debugfs.c
index 270d185737e6..9d167adfa746 100644
--- a/lib/kunit/debugfs.c
+++ b/lib/kunit/debugfs.c
@@ -109,14 +109,28 @@ static const struct file_operations debugfs_results_fops = {
void kunit_debugfs_create_suite(struct kunit_suite *suite)
{
struct kunit_case *test_case;
+ struct string_stream *stream;
- /* Allocate logs before creating debugfs representation. */
- suite->log = alloc_string_stream(GFP_KERNEL);
- string_stream_set_append_newlines(suite->log, true);
+ /*
+ * Allocate logs before creating debugfs representation.
+ * The suite->log and test_case->log pointer are expected to be NULL
+ * if there isn't a log, so only set it if the log stream was created
+ * successfully.
+ */
+ stream = alloc_string_stream(GFP_KERNEL);
+ if (IS_ERR_OR_NULL(stream))
+ return;
+
+ string_stream_set_append_newlines(stream, true);
+ suite->log = stream;
kunit_suite_for_each_test_case(suite, test_case) {
- test_case->log = alloc_string_stream(GFP_KERNEL);
- string_stream_set_append_newlines(test_case->log, true);
+ stream = alloc_string_stream(GFP_KERNEL);
+ if (IS_ERR_OR_NULL(stream))
+ goto err;
+
+ string_stream_set_append_newlines(stream, true);
+ test_case->log = stream;
}
suite->debugfs = debugfs_create_dir(suite->name, debugfs_rootdir);
@@ -124,6 +138,12 @@ void kunit_debugfs_create_suite(struct kunit_suite *suite)
debugfs_create_file(KUNIT_DEBUGFS_RESULTS, S_IFREG | 0444,
suite->debugfs,
suite, &debugfs_results_fops);
+ return;
+
+err:
+ string_stream_destroy(suite->log);
+ kunit_suite_for_each_test_case(suite, test_case)
+ string_stream_destroy(test_case->log);
}
void kunit_debugfs_destroy_suite(struct kunit_suite *suite)
--
2.30.2
KTAP is the standard output format for selftests, providing a method for
systems running the selftests to get results from individual tests
rather than just a pass/fail for the test program as a whole. While
many of the timers tests use KTAP some have custom output formats, let's
convert a few more to KTAP to make them work better in automation.
The posix_timers test made use of perror(), I've added a generic helper
to kselftest.h for that since it seems like it'll be useful elsewhere.
There are more tests that don't use KTAP, several of them just run a
single test so don't really benefit from KTAP and there were a couple
where the conversion was a bit more complex so I've left them for now.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (3):
kselftest: Add a ksft_perror() helper
selftests: timers: Convert posix_timers test to generate KTAP output
selftests: timers: Convert nsleep-lat test to generate KTAP output
tools/testing/selftests/kselftest.h | 14 +++++
tools/testing/selftests/timers/nsleep-lat.c | 26 ++++-----
tools/testing/selftests/timers/posix_timers.c | 81 ++++++++++++++-------------
3 files changed, 67 insertions(+), 54 deletions(-)
---
base-commit: 6465e260f48790807eef06b583b38ca9789b6072
change-id: 20230926-ktap-posix-timers-67e978466185
Best regards,
--
Mark Brown <broonie(a)kernel.org>
In kunit_debugfs_create_suite() give up and skip creating the debugfs
file if any of the alloc_string_stream() calls return an error or NULL.
Only put a value in the log pointer of kunit_suite and kunit_test if it
is a valid pointer to a log.
This prevents the potential invalid dereference reported by smatch:
lib/kunit/debugfs.c:115 kunit_debugfs_create_suite() error: 'suite->log'
dereferencing possible ERR_PTR()
lib/kunit/debugfs.c:119 kunit_debugfs_create_suite() error: 'test_case->log'
dereferencing possible ERR_PTR()
Signed-off-by: Richard Fitzgerald <rf(a)opensource.cirrus.com>
Reported-by: Dan Carpenter <dan.carpenter(a)linaro.org>
Fixes: 05e2006ce493 ("kunit: Use string_stream for test log")
---
lib/kunit/debugfs.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)
diff --git a/lib/kunit/debugfs.c b/lib/kunit/debugfs.c
index 270d185737e6..73075ca6e88c 100644
--- a/lib/kunit/debugfs.c
+++ b/lib/kunit/debugfs.c
@@ -109,14 +109,27 @@ static const struct file_operations debugfs_results_fops = {
void kunit_debugfs_create_suite(struct kunit_suite *suite)
{
struct kunit_case *test_case;
+ struct string_stream *stream;
- /* Allocate logs before creating debugfs representation. */
- suite->log = alloc_string_stream(GFP_KERNEL);
- string_stream_set_append_newlines(suite->log, true);
+ /*
+ * Allocate logs before creating debugfs representation.
+ * The log pointer must be NULL if there isn't a log so only
+ * set it if the log stream was created successfully.
+ */
+ stream = alloc_string_stream(GFP_KERNEL);
+ if (IS_ERR_OR_NULL(stream))
+ goto err;
+
+ string_stream_set_append_newlines(stream, true);
+ suite->log = stream;
kunit_suite_for_each_test_case(suite, test_case) {
- test_case->log = alloc_string_stream(GFP_KERNEL);
- string_stream_set_append_newlines(test_case->log, true);
+ stream = alloc_string_stream(GFP_KERNEL);
+ if (IS_ERR_OR_NULL(stream))
+ goto err;
+
+ string_stream_set_append_newlines(stream, true);
+ test_case->log = stream;
}
suite->debugfs = debugfs_create_dir(suite->name, debugfs_rootdir);
@@ -124,6 +137,12 @@ void kunit_debugfs_create_suite(struct kunit_suite *suite)
debugfs_create_file(KUNIT_DEBUGFS_RESULTS, S_IFREG | 0444,
suite->debugfs,
suite, &debugfs_results_fops);
+ return;
+
+err:
+ string_stream_destroy(suite->log);
+ kunit_suite_for_each_test_case(suite, test_case)
+ string_stream_destroy(test_case->log);
}
void kunit_debugfs_destroy_suite(struct kunit_suite *suite)
--
2.30.2
The test_cases is not freed in kunit_free_suite_set().
And the copy pointer may be moved in kunit_filter_suites().
The filtered_suite and filtered_suite->test_cases allocated in the last
kunit_filter_attr_tests() in last inner for loop may be leaked if
kunit_filter_suites() fails.
If kunit_filter_suites() succeeds, not only copy but also filtered_suite
and filtered_suite->test_cases should be freed.
Changes in v4:
- Make free_suite_set() take a void * for the 4th patch.
- Add Suggested-by and Reviewed-by.
- Correct the fix tag.
Changes in v3:
- Update the kfree_at_end() to use kunit_free_suite_set() for 4th patch.
- Update the commit message for the 4th patch.
Changes in v2:
- Add Reviewed-by.
- Add the memory leak backtrace for the 4th patch.
- Remove the unused func kernel test robot noticed for the 4th patch.
- Update the commit message for the 4th patch.
Jinjie Ruan (4):
kunit: Fix missed memory release in kunit_free_suite_set()
kunit: Fix the wrong kfree of copy for kunit_filter_suites()
kunit: Fix possible memory leak in kunit_filter_suites()
kunit: test: Fix the possible memory leak in executor_test
lib/kunit/executor.c | 23 +++++++++++++++++------
lib/kunit/executor_test.c | 36 ++++++++++++++++++++++--------------
2 files changed, 39 insertions(+), 20 deletions(-)
--
2.34.1
This adds the pasid attach/detach uAPIs for userspace to attach/detach
a PASID of a device to/from a given ioas/hwpt. Only vfio-pci driver is
enabled in this series. After this series, PASID-capable devices bound
with vfio-pci can report PASID capability to userspace and VM to enable
PASID usages like Shared Virtual Addressing (SVA).
This series first adds the helpers for pasid attach in vfio core and then
add the device cdev ioctls for pasid attach/detach, finally exposes the
device PASID capability to user. It depends on iommufd pasid attach/detach
series [1].
Complete code can be found at [2]
[1] https://lore.kernel.org/linux-iommu/20230926092651.17041-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
Regards,
Yi Liu
Kevin Tian (1):
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
Yi Liu (2):
vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
vfio/pci: Expose PCIe PASID capability to userspace
drivers/vfio/device_cdev.c | 45 ++++++++++++++++++++++++
drivers/vfio/iommufd.c | 48 ++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci.c | 2 ++
drivers/vfio/pci/vfio_pci_config.c | 2 +-
drivers/vfio/vfio.h | 4 +++
drivers/vfio/vfio_main.c | 8 +++++
include/linux/vfio.h | 11 ++++++
include/uapi/linux/vfio.h | 55 ++++++++++++++++++++++++++++++
8 files changed, 174 insertions(+), 1 deletion(-)
--
2.34.1
Commit 2810c1e99867 ("kunit: Fix wild-memory-access bug in
kunit_free_suite_set()") causes test suites to run while the test
module is still in MODULE_STATE_COMING. In that state, the module
is not fully initialized, lacking sysfs, module_memory, args, init
function. This behavior can cause all sorts of problems while using
fake devices.
This RFC patch restores the normal execution flow, waiting for module
initialization to complete before running the tests, and uses the
refcount to avoid calling kunit_free_suite_set() if load_module()
fails.
Fixes: 2810c1e99867 ("kunit: Fix wild-memory-access bug in kunit_free_suite_set()")
Signed-off-by: Marco Pagani <marpagan(a)redhat.com>
---
lib/kunit/test.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 421f13981412..242f26ad387a 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -784,13 +784,13 @@ static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
switch (val) {
case MODULE_STATE_LIVE:
+ kunit_module_init(mod);
break;
case MODULE_STATE_GOING:
- kunit_module_exit(mod);
+ if (module_refcount(mod) == -1)
+ kunit_module_exit(mod);
break;
case MODULE_STATE_COMING:
- kunit_module_init(mod);
- break;
case MODULE_STATE_UNFORMED:
break;
}
--
2.41.0
This series extends KVM RISC-V to allow Guest/VM discover and use
conditional operations related ISA extensions (namely XVentanaCondOps
and Zicond).
To try these patches, use KVMTOOL from riscv_zbx_zicntr_smstateen_condops_v1
branch at: https://github.com/avpatel/kvmtool.git
These patches are based upon the latest riscv_kvm_queue and can also be
found in the riscv_kvm_condops_v2 branch at:
https://github.com/avpatel/linux.git
Changes since v1:
- Rebased the series on riscv_kvm_queue
- Split PATCH1 and PATCH2 of v1 series into two patches
- Added separate test configs for XVentanaCondOps and Zicond in PATCH7
of v1 series.
Anup Patel (9):
dt-bindings: riscv: Add XVentanaCondOps extension entry
RISC-V: Detect XVentanaCondOps from ISA string
dt-bindings: riscv: Add Zicond extension entry
RISC-V: Detect Zicond from ISA string
RISC-V: KVM: Allow XVentanaCondOps extension for Guest/VM
RISC-V: KVM: Allow Zicond extension for Guest/VM
KVM: riscv: selftests: Add senvcfg register to get-reg-list test
KVM: riscv: selftests: Add smstateen registers to get-reg-list test
KVM: riscv: selftests: Add condops extensions to get-reg-list test
.../devicetree/bindings/riscv/extensions.yaml | 13 ++++
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/uapi/asm/kvm.h | 2 +
arch/riscv/kernel/cpufeature.c | 2 +
arch/riscv/kvm/vcpu_onereg.c | 4 ++
.../selftests/kvm/riscv/get-reg-list.c | 71 +++++++++++++++++++
6 files changed, 94 insertions(+)
--
2.34.1
There is no reason why the KUnit Tests for the property entry API can
only be built-in. Add support for building these tests as a loadable
module, like is supported by most other tests.
Signed-off-by: Geert Uytterhoeven <geert+renesas(a)glider.be>
---
drivers/base/test/Kconfig | 4 ++--
drivers/base/test/property-entry-test.c | 4 ++++
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/base/test/Kconfig b/drivers/base/test/Kconfig
index 9d42051f8f8e715e..5c7fac80611ce8bc 100644
--- a/drivers/base/test/Kconfig
+++ b/drivers/base/test/Kconfig
@@ -14,6 +14,6 @@ config DM_KUNIT_TEST
depends on KUNIT
config DRIVER_PE_KUNIT_TEST
- bool "KUnit Tests for property entry API" if !KUNIT_ALL_TESTS
- depends on KUNIT=y
+ tristate "KUnit Tests for property entry API" if !KUNIT_ALL_TESTS
+ depends on KUNIT
default KUNIT_ALL_TESTS
diff --git a/drivers/base/test/property-entry-test.c b/drivers/base/test/property-entry-test.c
index dd2b606d76a3f546..a8657eb06f94e934 100644
--- a/drivers/base/test/property-entry-test.c
+++ b/drivers/base/test/property-entry-test.c
@@ -506,3 +506,7 @@ static struct kunit_suite property_entry_test_suite = {
};
kunit_test_suite(property_entry_test_suite);
+
+MODULE_DESCRIPTION("Test module for the property entry API");
+MODULE_AUTHOR("Dmitry Torokhov <dtor(a)chromium.org>");
+MODULE_LICENSE("GPL");
--
2.34.1