Hello,
kernel test robot noticed "kernel-selftests.uevent.uevent_filtering.fail" on:
commit: 5b45a753776be5d21cf395ec97e81c9187fbeaca ("selftests: uevent filtering: fix return on error in uevent_listener")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
[test failed on linux-next/master 2030579113a1b1b5bfd7ff24c0852847836d8fd1]
in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:
group: group-03
compiler: gcc-12
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
we also noticed this issue does not always happen. as below, we saw 15 failures
out of 50 runs. however, parent keeps passing.
37013b557b7f39e6 5b45a753776be5d21cf395ec97e
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
:50 30% 15:50 kernel-selftests.uevent.uevent_filtering.fail
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang(a)intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310261454.46082aaa-oliver.sang@intel.com
TAP version 13
1..1
# timeout set to 300
# selftests: uevent: uevent_filtering
# TAP version 13
# 1..1
# # Starting 1 tests from 1 test cases.
# # RUN global.uevent_filtering ...
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3532
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3546
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3556
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3585
# add@/devices/virtual/mem/fullACTION=addDEVPATH=/devices/virtual/mem/fullSUBSYSTEM=memSYNTH_UUID=0MAJOR=1MINOR=7DEVNAME=fullDEVMODE=0666SEQNUM=3595
# No buffer space available - Failed to receive uevent
# # uevent_filtering.c:479:uevent_filtering:Expected 0 (0) == ret (-1)
# # uevent_filtering: Test failed at step #10
# # FAIL global.uevent_filtering
# not ok 1 global.uevent_filtering
# # FAILED: 0 / 1 tests passed.
# # Totals: pass:0 fail:1 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: uevent: uevent_filtering # exit=1
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231026/202310261454.46082aaa-oliv…
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Kunit recently gained support to setup attributes, the first one being
the speed of a given test, then allowing to filter out slow tests.
A slow test is defined in the documentation as taking more than one
second. There's an another speed attribute called "super slow" but whose
definition is less clear.
Add support to the test runner to check the test execution time, and
report tests that should be marked as slow but aren't.
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
To: Brendan Higgins <brendan.higgins(a)linux.dev>
To: David Gow <davidgow(a)google.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Rae Moar <rmoar(a)google.com>
Cc: linux-kselftest(a)vger.kernel.org
Cc: kunit-dev(a)googlegroups.com
Cc: linux-kernel(a)vger.kernel.org
Changes from v1:
- Split the patch out of the series
- Change to trigger the warning only if the runtime is twice the
threshold (Jani, Rae)
- Split the speed check into a separate function (Rae)
- Link: https://lore.kernel.org/all/20230911-kms-slow-tests-v1-0-d3800a69a1a1@kerne…
---
lib/kunit/test.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 49698a168437..a1d5dd2bf87d 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -372,6 +372,25 @@ void kunit_init_test(struct kunit *test, const char *name, char *log)
}
EXPORT_SYMBOL_GPL(kunit_init_test);
+#define KUNIT_SPEED_SLOW_THRESHOLD_S 1
+
+static void kunit_run_case_check_speed(struct kunit *test,
+ struct kunit_case *test_case,
+ struct timespec64 duration)
+{
+ enum kunit_speed speed = test_case->attr.speed;
+
+ if (duration.tv_sec < (2 * KUNIT_SPEED_SLOW_THRESHOLD_S))
+ return;
+
+ if (speed == KUNIT_SPEED_VERY_SLOW || speed == KUNIT_SPEED_SLOW)
+ return;
+
+ kunit_warn(test,
+ "Test should be marked slow (runtime: %lld.%09lds)",
+ duration.tv_sec, duration.tv_nsec);
+}
+
/*
* Initializes and runs test case. Does not clean up or do post validations.
*/
@@ -379,6 +398,8 @@ static void kunit_run_case_internal(struct kunit *test,
struct kunit_suite *suite,
struct kunit_case *test_case)
{
+ struct timespec64 start, end;
+
if (suite->init) {
int ret;
@@ -390,7 +411,13 @@ static void kunit_run_case_internal(struct kunit *test,
}
}
+ ktime_get_ts64(&start);
+
test_case->run_case(test);
+
+ ktime_get_ts64(&end);
+
+ kunit_run_case_check_speed(test, test_case, timespec64_sub(end, start));
}
static void kunit_case_internal_cleanup(struct kunit *test)
--
2.41.0
This is the first part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series[1],
iommu core supports new ops to allocate domains with user data. For nesting,
the user data is vendor-specific, IOMMU_HWPT_DATA_VTD_S1 is defined for
the Intel VT-d stage-1 page table, it will be used in the stage-1 domain
allocation path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation. This series does not have
the cache invalidation path, it would be added in part 2/2.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch8 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], corresponding QEMU could can be found
in [4].
[1] https://lore.kernel.org/linux-iommu/20231024150609.46884-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v7:
- Rebase on top of latest iommufd nesting part 1/2
- Add the nested_parent flag in patch 07 and sanitize it for nested domain
allocation (Baolu)
- Fail the nested domain allocation if dirty tracking flag is set
v6: https://lore.kernel.org/linux-iommu/20231020093246.17015-1-yi.l.liu@intel.c…
- Add Kevin's r-b for patch 1 and 8
- Drop Kevin's r-b for patch 7
- Address comments from Kevin
- Split the VT-d nesting series into two parts 1/2 and 2/2
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.…
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow read-only mappings to nest parent domain
Yi Liu (3):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 88 +++++++++++++++++----------
drivers/iommu/intel/iommu.h | 46 ++++++++++++--
drivers/iommu/intel/nested.c | 109 ++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 112 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
include/uapi/linux/iommufd.h | 42 ++++++++++++-
7 files changed, 362 insertions(+), 39 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
Hi,
while testing a new patch on the livepatch kselftests, I was testing the gen_tar
target and I figured that we only copy the resulting binaries to the final tar
file.
Per the kselftests documentation[1], the gen_tar target is used to package the
tests to run "on different systems". But what if the different system has
different libraries/library versions? Wouldn't it be a problem?
This question came when I was working to build the livepatch modules as part of
the kselftests testing suit. The plan was to just package the test
scripts/programs/modules and then run the tests on a different system, likewise
a different SLE version. Since the kernel would be different in this case, I
expected that gen_tar would copy the module source files so they can be compiled
on the target system.
While the current approach can work when the selftests rely solely on shell scripts(cpufreq, kexec),
those who compile userspace binaries (cgroup, alsa, sched, ...) may not work.
Am I missing something? Is gen_tar only meant to copy the tests to be run on
systems with the same libraries or with the libraries with the exactly the same
version?
Thanks in advance,
Marcos
[1]: https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
Isolated cpuset partition can currently be created to contain an
exclusive set of CPUs not used in other cgroups and with load balancing
disabled to reduce interference from the scheduler.
The main purpose of this isolated partition type is to dynamically
emulate what can be done via the "isolcpus" boot command line option,
specifically the default domain flag. One effect of the "isolcpus" option
is to remove the isolated CPUs from the cpumasks of unbound workqueues
since running work functions in an isolated CPU can be a major source
of interference. Changing the unbound workqueue cpumasks can be done at
run time by writing an appropriate cpumask without the isolated CPUs to
/sys/devices/virtual/workqueue/cpumask. So one can set up an isolated
cpuset partition and then write to the cpumask sysfs file to achieve
similar level of CPU isolation. However, this manual process can be
error prone.
This patch series implements automatic exclusion of isolated CPUs from
unbound workqueue cpumasks when an isolated cpuset partition is created
and then adds those CPUs back when the isolated partition is destroyed.
There are also other places in the kernel that look at the HK_FLAG_DOMAIN
cpumask or other HK_FLAG_* cpumasks and exclude the isolated CPUs from
certain actions to further reduce interference. CPUs in an isolated
cpuset partition will not be able to avoid those interferences yet. That
may change in the future as the need arises.
Waiman Long (4):
workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs
from wq_unbound_cpumask
selftests/cgroup: Minor code cleanup and reorganization of
test_cpuset_prs.sh
cgroup/cpuset: Keep track of CPUs in isolated partitions
cgroup/cpuset: Take isolated CPUs out of workqueue unbound cpumask
Documentation/admin-guide/cgroup-v2.rst | 10 +-
include/linux/workqueue.h | 2 +-
kernel/cgroup/cpuset.c | 237 +++++++++++++-----
kernel/workqueue.c | 42 +++-
.../selftests/cgroup/test_cpuset_prs.sh | 209 +++++++++------
5 files changed, 350 insertions(+), 150 deletions(-)
--
2.39.3
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages of address translations to get access
to the physical address. A stage-1 translation table is owned by userspace
(e.g. by a guest OS), while a stage-2 is owned by kernel. Any change to a
stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is guest I/O
page table. As the below diagram shows, the guest I/O page table pointer
in GPA (guest physical address) is passed to host and be used to perform
a stage-1 translation. Along with it, a modification to present mappings
in the guest I/O page table should be followed by an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each hwpt is backed by an iommu_domain allocated from an iommu driver.
So in this series hw_pagetable and iommu_domain means the same thing if no
special note. IOMMUFD has already supported allocating hw_pagetable linked
with an IOAS. However, a nesting case requires IOMMUFD to allow allocating
hw_pagetable with driver specific parameters and interface to sync stage-1
IOTLB as user owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1] and nested
parent domain allocation [2]. It first extends domain_alloc_user to allocate
hwpt with user data by allowing the IOMMUFD internal infrastructure to accept
user_data and parent hwpt, relaying the user_data/parent to the iommu core
to allocate IOMMU_DOMAIN_NESTED. And it then extends the IOMMU_HWPT_ALLOC
ioctl to accept user data and a parent hwpt ID.
Note that this series is the part-1 set of a two-part nesting series. It
does not include the cache invalidation interface, which will be added in
the part 2.
Complete code can be found in [3], it is on top of Joao's dirty page tracking
v6 series and fix patches. QEMU could can be found in [4].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://lore.kernel.org/linux-iommu/20230928071528.26258-1-yi.l.liu@intel.c… - merged
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v6:
- Rebase on top of Joao's dirty tracking series:
https://lore.kernel.org/linux-iommu/20231024135109.73787-1-joao.m.martins@o…
- Rebase on top of the enforce_cache_coherency removal patch:
https://lore.kernel.org/linux-iommu/ZTcAhwYjjzqM0A5M@Asurada-Nvidia/
- Add parent and user_data check in iommu driver before the driver actually
supports the two input. This can make better bisect support, the change is
in patch 02.
v5: https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested
translation, this does not change the rule that resv regions should only
be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be
added in later series as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation
(Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest
I/O page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to
avoid passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in
iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08
of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Jason Gunthorpe (2):
iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING
iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations
Lu Baolu (1):
iommu: Add IOMMU_DOMAIN_NESTED
Nicolin Chen (6):
iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable
iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED
iommufd: Add a nested HW pagetable object
iommu: Add iommu_copy_struct_from_user helper
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
Yi Liu (1):
iommu: Pass in parent domain with user_data to domain_alloc_user op
drivers/iommu/intel/iommu.c | 7 +-
drivers/iommu/iommufd/device.c | 157 +++++++---
drivers/iommu/iommufd/hw_pagetable.c | 271 +++++++++++++-----
drivers/iommu/iommufd/iommufd_private.h | 73 +++--
drivers/iommu/iommufd/iommufd_test.h | 18 ++
drivers/iommu/iommufd/main.c | 10 +-
drivers/iommu/iommufd/selftest.c | 151 ++++++++--
drivers/iommu/iommufd/vfio_compat.c | 6 +-
include/linux/iommu.h | 72 ++++-
include/uapi/linux/iommufd.h | 31 +-
tools/testing/selftests/iommu/iommufd.c | 120 ++++++++
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 31 +-
13 files changed, 768 insertions(+), 182 deletions(-)
--
2.34.1
Clang uses a different set of CLI args for coverage, and the output
needs to be processed by a different set of tools.
Update the Makefile and add an example of usage in kunit docs.
Michał Winiarski (2):
arch: um: Add Clang coverage support
Documentation: kunit: Add clang UML coverage example
Documentation/dev-tools/kunit/running_tips.rst | 11 +++++++++++
arch/um/Makefile-skas | 5 +++++
2 files changed, 16 insertions(+)
--
2.42.0
In some conditions, background processes in udpgro don't have enough
time to set up the sockets. When foreground processes start, this
results in the bad GRO lookup test freezing or reporting that it
received 0 gro segments.
To fix this, increase the time given to background processes to complete
the startup before foreground processes start.
This is the same issue and the same fix as posted by Adrien Therry.
Link: https://lore.kernel.org/all/20221101184809.50013-1-athierry@redhat.com/
Signed-off-by: Lucas Karpinski <lkarpins(a)redhat.com>
---
tools/testing/selftests/net/udpgro.sh | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/udpgro.sh b/tools/testing/selftests/net/udpgro.sh
index 0c743752669a..4ccbcb2390ad 100755
--- a/tools/testing/selftests/net/udpgro.sh
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -97,7 +97,8 @@ run_one_nat() {
echo "ok" || \
echo "failed"&
- sleep 0.1
+ # Hack: let bg programs complete the startup
+ sleep 0.2
./udpgso_bench_tx ${tx_args}
ret=$?
kill -INT $pid
--
2.41.0
Change ifconfig with ip command,
on a system where ifconfig is
not used this script will not
work correcly.
Test result with this patchset:
sudo make TARGETS="net" kselftest
....
TAP version 13
1..1
timeout set to 1500
selftests: net: route_localnet.sh
run arp_announce test
net.ipv4.conf.veth0.route_localnet = 1
net.ipv4.conf.veth1.route_localnet = 1
net.ipv4.conf.veth0.arp_announce = 2
net.ipv4.conf.veth1.arp_announce = 2
PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
bytes of data.
64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.038 ms
64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.068 ms
64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.068 ms
--- 127.25.3.14 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4073ms
rtt min/avg/max/mdev = 0.038/0.062/0.068/0.012 ms
ok
run arp_ignore test
net.ipv4.conf.veth0.route_localnet = 1
net.ipv4.conf.veth1.route_localnet = 1
net.ipv4.conf.veth0.arp_ignore = 3
net.ipv4.conf.veth1.arp_ignore = 3
PING 127.25.3.14 (127.25.3.14) from 127.25.3.4 veth0: 56(84)
bytes of data.
64 bytes from 127.25.3.14: icmp_seq=1 ttl=64 time=0.032 ms
64 bytes from 127.25.3.14: icmp_seq=2 ttl=64 time=0.065 ms
64 bytes from 127.25.3.14: icmp_seq=3 ttl=64 time=0.066 ms
64 bytes from 127.25.3.14: icmp_seq=4 ttl=64 time=0.065 ms
64 bytes from 127.25.3.14: icmp_seq=5 ttl=64 time=0.065 ms
--- 127.25.3.14 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4092ms
rtt min/avg/max/mdev = 0.032/0.058/0.066/0.013 ms
ok
ok 1 selftests: net: route_localnet.sh
...
Signed-off-by: Swarup Laxman Kotiaklapudi <swarupkotikalapudi(a)gmail.com>
---
tools/testing/selftests/net/route_localnet.sh | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/net/route_localnet.sh b/tools/testing/selftests/net/route_localnet.sh
index 116bfeab72fa..3ab9beb4462c 100755
--- a/tools/testing/selftests/net/route_localnet.sh
+++ b/tools/testing/selftests/net/route_localnet.sh
@@ -18,8 +18,10 @@ setup() {
ip route del 127.0.0.0/8 dev lo table local
ip netns exec "${PEER_NS}" ip route del 127.0.0.0/8 dev lo table local
- ifconfig veth0 127.25.3.4/24 up
- ip netns exec "${PEER_NS}" ifconfig veth1 127.25.3.14/24 up
+ ip a add 127.25.3.4/24 dev veth0
+ ip link set dev veth0 up
+ ip netns exec "${PEER_NS}" ip a add 127.25.3.14/24 dev veth1
+ ip netns exec "${PEER_NS}" ip link set dev veth1 up
ip route flush cache
ip netns exec "${PEER_NS}" ip route flush cache
--
2.34.1
On Fri, Oct 20, 2023 at 06:21:40AM +0200, Nicolas Schier wrote:
> On Thu, Oct 19, 2023 at 03:50:05PM -0300, Marcos Paulo de Souza wrote:
> > On Sat, Oct 14, 2023 at 05:35:55PM +0900, Masahiro Yamada wrote:
> > > On Tue, Oct 10, 2023 at 5:43 AM Marcos Paulo de Souza <mpdesouza(a)suse.de> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > I found an issue while moving the livepatch kselftest modules to be built on the
> > > > fly, instead of building them on kernel building.
> > > >
> > > > If, for some reason, there is a recursive make invocation that starts from the
> > > > top level Makefile and in the leaf Makefile it tries to build a module (using M=
> > > > in the make invocation), it doesn't produce the module. This happens because the
> > > > toplevel Makefile checks for M= only once. This is controlled by the
> > > > sub_make_done variable, which is exported after checking the command line
> > > > options are passed to the top level Makefile. Once this variable is set it's
> > > > the M= setting is never checked again on the recursive call.
> > > >
> > > > This can be observed when cleaning the bpf kselftest dir. When calling
> > > >
> > > > $ make TARGETS="bpf" SKIP_TARGETS="" kselftest-clean
> > > >
> > > > What happens:
> > > >
> > > > 1. It checks for some command line settings (like M=) was passed (it wasn't),
> > > > set some definitions and exports sub_make_done.
> > > >
> > > > 2. Jump into tools/testing/selftests/bpf, and calls the clean target.
> > > >
> > > > 3. The clean target is overwritten to remove some files and then jump to
> > > > bpf_testmod dir and call clean there
> > > >
> > > > 4. On bpf_testmod/Makefile, the clean target will execute
> > > > $(Q)make -C $(KDIR) M=$(BPF_TESTMOD_DIR) clean
> > > >
> > > > 5. The KDIR is to toplevel dir. The top Makefile will check that sub_make_done was
> > > > already set, ignoring the M= setting.
> > > >
> > > > 6. As M= wasn't checked, KBUILD_EXTMOD isn't set, and the clean target applies
> > > > to the kernel as a whole, making it clean all generated code/objects and
> > > > everything.
> > > >
> > > > One way to avoid it is to call "unexport sub_make_done" on
> > > > tools/testing/selftests/bpf/bpf_testmod/Makefile before processing the all
> > > > target, forcing the toplevel Makefile to process the M=, producing the module
> > > > file correctly.
> > > >
> > > > If the M=dir points to /lib/modules/.../build, then it fails with "m2c: No such
> > > > file", which I already reported here[1]. At the time this problem was treated
> > > > like a problem with kselftest infrastructure.
> > > >
> > > > Important: The process works fine if the initial make invocation is targeted to a
> > > > different directory (using -C), since it doesn't goes through the toplevel
> > > > Makefile, and sub_make_done variable is not set.
> > > >
> > > > I attached a minimal reproducer, that can be used to better understand the
> > > > problem. The "make testmod" and "make testmod-clean" have the same effect that
> > > > can be seem with the bpf kselftests. There is a unexport call commented on
> > > > test-mods/Makefile, and once that is called the process works as expected.
> > > >
> > > > Is there a better way to fix this? Is this really a problem, or am I missing
> > > > something?
> > >
> > >
> > > Or, using KBUILD_EXTMOD will work too.
> >
> > Yes, that works, only if set to /lib/modules:
> >
> > $ make kselftest TARGETS=bpf SKIP_TARGETS=""
> > make[3]: Entering directory '/home/mpdesouza/git/linux/tools/testing/selftests/bpf'
> > MOD bpf_testmod.ko
> > warning: the compiler differs from the one used to build the kernel
> > The kernel was built by: gcc (SUSE Linux) 13.2.1 20230803 [revision cc279d6c64562f05019e1d12d0d825f9391b5553]
> > You are using: gcc (SUSE Linux) 13.2.1 20230912 [revision b96e66fd4ef3e36983969fb8cdd1956f551a074b]
> > CC [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.o
> > MODPOST /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/Module.symvers
> > CC [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.mod.o
> > LD [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko
> > BTF [M] /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko
> > Skipping BTF generation for /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.ko due to unavailability of vmlinux
> > BINARY xdp_synproxy
> > ...
> >
> > But if we set the KBUILD_EXTMOD to toplevel Makefile, it fails with a different
> > strange issue:
> >
> > $ make kselftest TARGETS=bpf SKIP_TARGETS=""
> > BINARY urandom_read
> > MOD bpf_testmod.ko
> > m2c -o scripts/Makefile.build -e scripts/Makefile.build scripts/Makefile.build.mod
> > make[6]: m2c: No such file or directory
> > make[6]: *** [<builtin>: scripts/Makefile.build] Error 127
> > make[5]: *** [Makefile:1913: /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod] Error 2
> > make[4]: *** [Makefile:19: all] Error 2
> > make[3]: *** [Makefile:229: /home/mpdesouza/git/linux/tools/testing/selftests/bpf/bpf_testmod.ko] Error 2
> > make[3]: Leaving directory '/home/mpdesouza/git/linux/tools/testing/selftests/bpf'
> > make[2]: *** [Makefile:175: all] Error 2
> > make[1]: *** [/home/mpdesouza/git/linux/Makefile:1362: kselftest] Error 2
> >
> > I attached a patch that can reproduce the case where it works, and the case
> > where it doesn't by changing the value of KDIR.
> >
> > I understand that KBUILD_EXTMOD, as the name implies, was designed to build
> > "external" modules, and not ones that live inside kernel, but how could this be
> > solved?
>
> It seems to me as if there is some confusion about in-tree vs.
> out-of-tree kmods.
>
> KBUILD_EXTMOD and M are almost the same and indicate that you want to
> build _external_ (=out-of-tree) kernel modules. In-tree modules are
> only those that stay in-tree _and_ are built along with the kernel.
> Thus, 'make modules KBUILD_EXTMOD=fs/ext4' could be used to build ext4
> kmod as "out-of-tree" kernel module, that even taints the kernel if it
> gets loaded.
>
> If you want bpf_testmod.ko to be an in-tree kmod, it has to be build
> during the usual kernel build, not by running 'make kselftest'.
>
> If you use 'make -C $(KDIR)' for building out-of-tree kmods, KDIR has to
> point to the kernel build directory. (Or it may point to the source
> tree if you give O=$(BUILDDIR) as well).
Thanks for the explanation Nicolas. In this, I believe that the BPF module
should be moved into lib/, like lib/livepatch, when then be built along with
other in-tree modules.
Currently there is a bug when running the kselftests-clean target with bpf:
make kselftest-clean TARGETS=bpf SKIP_TARGETS=""
As the M= argument is ignore on the toplevel Makefile, this make invocation
applies the clean to all built kernel objects/modules/everything, which is bug
IMO.
There is a statement in the BPF docs saying that the selftests should be run
inside the tools/testing/selftests/bpf directory. At the same time, kselftests
should comply with all the targets defined in the documention, like gen_tar, and
run_tests. In this case should the build process be fixed, or just make
kselftests less restrict?
(CCing kselftests and bpf ML)
>
> HTH.
>
> Kind regards,
> Nicolas
>
>
> > For the sake of my initial about livepatch kselftests, KBUILD_EXTMOD
> > will suffice, since we will target /lib/modules, but I would like to know what
> > we can do in this case. Do you have other suggestions?
> >
> > Thanks in advance,
> > Marcos
> >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > > Masahiro Yamada
>
> > diff --git a/tools/testing/selftests/bpf/bpf_testmod/Makefile b/tools/testing/selftests/bpf/bpf_testmod/Makefile
> > index 15cb36c4483a..1dce76f35405 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/Makefile
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/Makefile
> > @@ -1,5 +1,6 @@
> > BPF_TESTMOD_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST)))))
> > -KDIR ?= $(abspath $(BPF_TESTMOD_DIR)/../../../../..)
> > +#KDIR ?= $(abspath $(BPF_TESTMOD_DIR)/../../../../..)
> > +KDIR ?= /lib/modules/$(shell uname -r)/build
> >
> > ifeq ($(V),1)
> > Q =
> > @@ -12,9 +13,10 @@ MODULES = bpf_testmod.ko
> > obj-m += bpf_testmod.o
> > CFLAGS_bpf_testmod.o = -I$(src)
> >
> > +export KBUILD_EXTMOD := $(BPF_TESTMOD_DIR)
> > +
> > all:
> > - +$(Q)make -C $(KDIR) M=$(BPF_TESTMOD_DIR) modules
> > + +$(Q)make -C $(KDIR) modules
> >
> > clean:
> > - +$(Q)make -C $(KDIR) M=$(BPF_TESTMOD_DIR) clean
> > -
> > + +$(Q)make -C $(KDIR) clean
>
This patch series introduces UFFDIO_MOVE feature to userfaultfd, which
has long been implemented and maintained by Andrea in his local tree [1],
but was not upstreamed due to lack of use cases where this approach would
be better than allocating a new page and copying the contents. Previous
upstraming attempts could be found at [6] and [7].
UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the application
needs pages to be allocated [2]. However, with UFFDIO_MOVE, if pages are
available (in userspace) for recycling, as is usually the case in heap
compaction algorithms, then we can avoid the page allocation and memcpy
(done by UFFDIO_COPY). Also, since the pages are recycled in the
userspace, we avoid the need to release (via madvise) the pages back to
the kernel [3].
We see over 40% reduction (on a Google pixel 6 device) in the compacting
thread’s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY. This was
measured using a benchmark that emulates a heap compaction implementation
using userfaultfd (to allow concurrent accesses by application threads).
More details of the usecase are explained in [3].
Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
touching them within the same vma. Today, it can only be done by mremap,
however it forces splitting the vma.
Main changes since Andrea's last version [1]:
- Trivial translations from page to folio, mmap_sem to mmap_lock
- Replace pmd_trans_unstable() with pte_offset_map_nolock() and handle its
possible failure
- Move pte mapping into remap_pages_pte to allow for retries when source
page or anon_vma is contended. Since pte_offset_map_nolock() start RCU
read section, we can't block anymore after mapping a pte, so have to unmap
the ptesm do the locking and retry.
- Add and use anon_vma_trylock_write() to avoid blocking while in RCU
read section.
- Accommodate changes in mmu_notifier_range_init() API, switch to
mmu_notifier_invalidate_range_start_nonblock() to avoid blocking while in
RCU read section.
- Open-code now removed __swp_swapcount()
- Replace pmd_read_atomic() with pmdp_get_lockless()
- Add new selftest for UFFDIO_MOVE
Changes since v1 [4]:
- add mmget_not_zero in userfaultfd_remap, per Jann Horn
- removed extern from function definitions, per Matthew Wilcox
- converted to folios in remap_pages_huge_pmd, per Matthew Wilcox
- use PageAnonExclusive in remap_pages_huge_pmd, per David Hildenbrand
- handle pgtable transfers between MMs, per Jann Horn
- ignore concurrent A/D pte bit changes, per Jann Horn
- split functions into smaller units, per David Hildenbrand
- test for folio_test_large in remap_anon_pte, per Matthew Wilcox
- use pte_swp_exclusive for swapcount check, per David Hildenbrand
- eliminated use of mmu_notifier_invalidate_range_start_nonblock,
per Jann Horn
- simplified THP alignment checks, per Jann Horn
- refactored the loop inside remap_pages, per Jann Horn
- additional clarifying comments, per Jann Horn
Changes since v2 [5]:
- renamed UFFDIO_REMAP to UFFDIO_MOVE, per David Hildenbrand
- rebase over mm-unstable to use folio_move_anon_rmap(),
per David Hildenbrand
- added text for manpage explaining DONTFORK and KSM requirements for this
feature, per David Hildenbrand
- check for anon_vma changes in the fast path of folio_lock_anon_vma_read,
per Peter Xu
- updated the title and description of the first patch,
per David Hildenbrand
- updating comments in folio_lock_anon_vma_read() explaining the need for
anon_vma checks, per David Hildenbrand
- changed all mapcount checks to PageAnonExclusive, per Jann Horn and
David Hildenbrand
- changed counters in remap_swap_pte() from MM_ANONPAGES to MM_SWAPENTS,
per Jann Horn
- added a check for PTE change after folio is locked in remap_pages_pte(),
per Jann Horn
- added handling of PMD migration entries and bailout when pmd_devmap(),
per Jann Horn
- added checks to ensure both src and dst VMAs are writable, per Peter Xu
- added UFFD_FEATURE_MOVE, per Peter Xu
- removed obsolete comments, per Peter Xu
- renamed remap_anon_pte to remap_present_pte, per Peter Xu
- added a comment for folio_get_anon_vma() explaining the need for
anon_vma checks, per Peter Xu
- changed error handling in remap_pages() to make it more clear,
per Peter Xu
- changed EFAULT to EAGAIN to retry when a hugepage appears or disappears
from under us, per Peter Xu
- added links to previous upstreaming attempts, per David Hildenbrand
[1] https://gitlab.com/aarcange/aa/-/commit/2aec7aea56b10438a3881a20a411aa4b1fc…
[2] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarcange@redha…
[3] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZKNKXqxRCyj…
[4] https://lore.kernel.org/all/20230914152620.2743033-1-surenb@google.com/
[5] https://lore.kernel.org/all/20230923013148.1390521-1-surenb@google.com/
[6] https://lore.kernel.org/all/1425575884-2574-21-git-send-email-aarcange@redh…
[7] https://lore.kernel.org/all/cover.1547251023.git.blake.caldwell@colorado.ed…
The patchset applies over mm-unstable.
Andrea Arcangeli (2):
mm/rmap: support move to different root anon_vma in
folio_move_anon_rmap()
userfaultfd: UFFDIO_MOVE uABI
Suren Baghdasaryan (1):
selftests/mm: add UFFDIO_MOVE ioctl test
Documentation/admin-guide/mm/userfaultfd.rst | 3 +
fs/userfaultfd.c | 63 ++
include/linux/rmap.h | 5 +
include/linux/userfaultfd_k.h | 12 +
include/uapi/linux/userfaultfd.h | 29 +-
mm/huge_memory.c | 138 +++++
mm/khugepaged.c | 3 +
mm/rmap.c | 30 +
mm/userfaultfd.c | 602 +++++++++++++++++++
tools/testing/selftests/mm/uffd-common.c | 41 +-
tools/testing/selftests/mm/uffd-common.h | 1 +
tools/testing/selftests/mm/uffd-unit-tests.c | 62 ++
12 files changed, 986 insertions(+), 3 deletions(-)
--
2.42.0.609.gbb76f46606-goog
From: Jeff Xu <jeffxu(a)google.com>
This patchset proposes a new mseal() syscall for the Linux kernel.
Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
The memory permission feature improves security stance on memory
corruption bugs, i.e. the attacker can’t just write to arbitrary memory
and point the code to it, the memory has to be marked with X bit, or
else an exception will happen.
Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with
following signature:
mseal(void addr, size_t len, unsigned int types, unsigned int flags)
addr/len: memory range. Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.
types: bit mask to specify which syscall to seal, currently they are:
MM_SEAL_MSEAL 0x1
MM_SEAL_MPROTECT 0x2
MM_SEAL_MUNMAP 0x4
MM_SEAL_MMAP 0x8
MM_SEAL_MREMAP 0x10
Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.
The kernel will remember which seal types are applied, and the application
doesn’t need to repeat all existing seal types in the next mseal(). Once
a seal type is applied, it can’t be unsealed. Call mseal() on an existing
seal type is a no-action, not a failure.
MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
Internally, vm_area_struct adds a new field vm_seals, to store the bit
masks.
For the affected syscalls, such as mprotect, a check(can_modify_mm) for
sealing is added, this usually happens at the early point of the syscall,
before any update is made to VMAs. The effect of that is: if any of the
VMAs in the given address range fails the sealing check, none of the VMA
will be updated. It might be worth noting that this is different from the
rest of mprotect(), where some updates can happen even when mprotect
returns fail. Consider can_modify_mm only checks vm_seals in
vm_area_struct, and it is not going deeper in the page table or updating
any HW, success or none behavior might fit better here. I would like to
listen to the community's feedback on this.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API.
In addition, Stephen is working on glibc change to add sealing support
into the dynamic linker to seal all non-writable segments at startup. When
that work is completed, all applications can automatically benefit from
these new protections.
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
Jeff Xu (8):
Add mseal syscall
Wire up mseal syscall
mseal: add can_modify_mm and can_modify_vma
mseal: seal mprotect
mseal munmap
mseal mremap
mseal mmap
selftest mm/mseal mprotect/munmap/mremap/mmap
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/aio.c | 5 +-
include/linux/mm.h | 55 +-
include/linux/mm_types.h | 7 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 6 +
ipc/shm.c | 3 +-
kernel/sys_ni.c | 1 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/internal.h | 4 +-
mm/mmap.c | 49 +-
mm/mprotect.c | 6 +
mm/mremap.c | 19 +-
mm/mseal.c | 328 +++++
mm/nommu.c | 6 +-
mm/util.c | 8 +-
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++
37 files changed, 1934 insertions(+), 28 deletions(-)
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.42.0.609.gbb76f46606-goog
Dzień dobry,
dostrzegam możliwość współpracy z Państwa firmą.
Świadczymy kompleksową obsługę inwestycji w fotowoltaikę, która obniża koszty energii elektrycznej.
Czy są Państwo zainteresowani weryfikacją wstępnych propozycji?
Pozdrawiam,
Kamil Lasek
We recently encountered a bug that makes all zswap store attempt fail.
Specifically, after:
"141fdeececb3 mm/zswap: delay the initialization of zswap"
if we build a kernel with zswap disabled by default, then enabled after
the swapfile is set up, the zswap tree will not be initialized. As a
result, all zswap store calls will be short-circuited. We have to
perform another swapon to get zswap working properly again.
Fortunately, this issue has since been fixed by the patch that kills
frontswap:
"42c06a0e8ebe mm: kill frontswap"
which performs zswap_swapon() unconditionally, i.e always initializing
the zswap tree.
This test add a sanity check that ensure zswap storing works as
intended.
Signed-off-by: Nhat Pham <nphamcs(a)gmail.com>
---
tools/testing/selftests/cgroup/test_zswap.c | 48 +++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/tools/testing/selftests/cgroup/test_zswap.c b/tools/testing/selftests/cgroup/test_zswap.c
index 49def87a909b..c99d2adaca3f 100644
--- a/tools/testing/selftests/cgroup/test_zswap.c
+++ b/tools/testing/selftests/cgroup/test_zswap.c
@@ -55,6 +55,11 @@ static int get_zswap_written_back_pages(size_t *value)
return read_int("/sys/kernel/debug/zswap/written_back_pages", value);
}
+static long get_zswpout(const char *cgroup)
+{
+ return cg_read_key_long(cgroup, "memory.stat", "zswpout ");
+}
+
static int allocate_bytes(const char *cgroup, void *arg)
{
size_t size = (size_t)arg;
@@ -68,6 +73,48 @@ static int allocate_bytes(const char *cgroup, void *arg)
return 0;
}
+/*
+ * Sanity test to check that pages are written into zswap.
+ */
+static int test_zswap_usage(const char *root)
+{
+ long zswpout_before, zswpout_after;
+ int ret = KSFT_FAIL;
+ char *test_group;
+
+ /* Set up */
+ test_group = cg_name(root, "no_shrink_test");
+ if (!test_group)
+ goto out;
+ if (cg_create(test_group))
+ goto out;
+ if (cg_write(test_group, "memory.max", "1M"))
+ goto out;
+
+ zswpout_before = get_zswpout(test_group);
+ if (zswpout_before < 0) {
+ ksft_print_msg("Failed to get zswpout\n");
+ goto out;
+ }
+
+ /* Allocate more than memory.max to push memory into zswap */
+ if (cg_run(test_group, allocate_bytes, (void *)MB(4)))
+ goto out;
+
+ /* Verify that pages come into zswap */
+ zswpout_after = get_zswpout(test_group);
+ if (zswpout_after <= zswpout_before) {
+ ksft_print_msg("zswpout does not increase after test program\n");
+ goto out;
+ }
+ ret = KSFT_PASS;
+
+out:
+ cg_destroy(test_group);
+ free(test_group);
+ return ret;
+}
+
/*
* When trying to store a memcg page in zswap, if the memcg hits its memory
* limit in zswap, writeback should not be triggered.
@@ -235,6 +282,7 @@ struct zswap_test {
int (*fn)(const char *root);
const char *name;
} tests[] = {
+ T(test_zswap_usage),
T(test_no_kmem_bypass),
T(test_no_invasive_cgroup_shrink),
};
--
2.34.1
This is the first part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series[1],
iommu core supports new ops to allocate domains with user data. For nesting,
the user data is vendor-specific, IOMMU_HWPT_DATA_VTD_S1 is defined for
the Intel VT-d stage-1 page table, it will be used in the stage-1 domain
allocation path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation. This series does not have
the cache invalidation path, it would be added in part 2/2.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch8 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.
Complete code can be found in [3], corresponding QEMU could can be found
in [4].
[1] https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v6:
- Add Kevin's r-b for patch 1 and 8
- Drop Kevin's r-b for patch 7
- Address comments from Kevin
- Split the VT-d nesting series into two parts 1/2 and 2/2
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.…
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow read-only mappings to nest parent domain
Yi Liu (3):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 63 +++++++++++++-------
drivers/iommu/intel/iommu.h | 46 ++++++++++++--
drivers/iommu/intel/nested.c | 109 ++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 112 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
include/uapi/linux/iommufd.h | 42 ++++++++++++-
7 files changed, 348 insertions(+), 28 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each has an iommu_domain allocated from iommu driver. So in this series
hw_pagetable and iommu_domain means the same thing if no special note.
IOMMUFD has already supported allocating hw_pagetable that is linked with
an IOAS. However, nesting requires IOMMUFD to allow allocating hw_pagetable
with driver specific parameters and interface to sync stage-1 IOTLB as user
owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1]. It first
extends domain_alloc_user to allocate domains with user data and adds new
op for invalidate stage-1 IOTLB for user-managed domains, then extends the
IOMMUFD internal infrastructure to accept user_data and parent hwpt, relay
the user_data/parent to iommu core to allocate user-managed iommu_domain.
After it, extends the ioctl IOMMU_HWPT_ALLOC to accept user data and stage-2
hwpt ID. Along with it, ioctl IOMMU_HWPT_INVALIDATE is added to invalidate
stage-1 IOTLB. This is needed for user-managed hwpts. Selftest is added as
well to cover the new ioctls.
Complete code can be found in [2], QEMU could can be found in [3].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v4:
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation,
this does not change the rule that resv regions should only be added to the
kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series
as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to avoid
passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Lu Baolu (1):
iommu: Add nested domain support
Nicolin Chen (12):
iommufd: Unite all kernel-managed members into a struct
iommufd: Separate kernel-managed HWPT alloc/destroy/abort functions
iommufd: Add shared alloc_fn function pointer and mutex pointer
iommufd: Add user-managed hw_pagetable support
iommufd: Always setup MSI and anforce cc on kernel-managed domains
iommufd/device: Add helpers to enforce/remove device reserved regions
iommufd/selftest: Rework TEST_LENGTH to test min_size explicitly
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
iommufd/selftest: Add mock_domain_cache_invalidate_user support
iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (4):
iommu: Add hwpt_type with user_data for domain_alloc_user op
iommufd: Pass in hwpt_type/user_data to iommufd_hw_pagetable_alloc()
iommufd: Support IOMMU_HWPT_ALLOC allocation with user data
iommufd: Add IOMMU_HWPT_INVALIDATE
drivers/iommu/intel/iommu.c | 5 +-
drivers/iommu/iommufd/device.c | 51 +++-
drivers/iommu/iommufd/hw_pagetable.c | 257 ++++++++++++++++--
drivers/iommu/iommufd/iommufd_private.h | 59 +++-
drivers/iommu/iommufd/iommufd_test.h | 40 +++
drivers/iommu/iommufd/main.c | 3 +
drivers/iommu/iommufd/selftest.c | 184 ++++++++++++-
include/linux/iommu.h | 110 +++++++-
include/uapi/linux/iommufd.h | 60 +++-
tools/testing/selftests/iommu/iommufd.c | 209 +++++++++++++-
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 91 ++++++-
12 files changed, 998 insertions(+), 74 deletions(-)
--
2.34.1
The sysfs code for online targets updating can result in adding more than
expected monigoring targets to the context. It can result in unexpected amount
of memory consumption and monitoring overhead. This patchset fixes the issue
(patch 1), and add a kunit test for avoiding similar bug of future (patch 2).
SeongJae Park (2):
mm/damon/sysfs: remove requested targets when online-commit inputs
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
mm/damon/Kconfig | 12 ++++++
mm/damon/sysfs-test.h | 86 +++++++++++++++++++++++++++++++++++++++++++
mm/damon/sysfs.c | 52 ++++++--------------------
3 files changed, 109 insertions(+), 41 deletions(-)
create mode 100644 mm/damon/sysfs-test.h
base-commit: 9a969da6ffb9609f5fa8d0b7fdc6859c37a10335
--
2.34.1
This is the second part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series [1],
iommu core supports new ops to invalidate the cache after the modifictions
in stage-1 page table. So far, the cache invalidation data is vendor specific,
the data_type (IOMMU_HWPT_DATA_VTD_S1) defined for the vendor specific HWPT
allocation is reused in the cache invalidation path. User should provide the
correct data_type that suit with the type used in HWPT allocation.
IOMMU_HWPT_INVALIDATE iotcl returns an error in @out_driver_error_code. However
Intel VT-d does not define error code so far, so it's not easy to pre-define it
in iommufd neither. As a result, this field should just be ignored on VT-d platform.
Complete code can be found in [2], corresponding QEMU could can be found in [3].
[1] https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v6:
- Address comments from Kevin
- Split the VT-d nesting series into two parts (Jason)
v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.…
- Add Kevin's r-b for patch 2, 3 ,5 8, 10
- Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
- Remove duplicate agaw check in patch 04 (Kevin)
- Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
- Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
- uapi data structure check (Kevin)
- Simplify the errata handling as user can allocate nested parent domain
v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.…
- Remove ascii art tables (Jason)
- Drop EMT (Tina, Jason)
- Drop MTS and related definitions (Kevin)
- Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
- Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
- Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
- Do not trim the higher page levels of S2 domain in nested domain attachment as the
S2 domain may have been used independently. (Kevin)
- Remove the first-stage pgd check against the maximum address of s2_domain as hw
can check it anyhow. It makes sense to check every pfns used in the stage-1 page
table. But it cannot make it. So just leave it to hw. (Kevin)
- Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
- Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
- Minor fix for "make htmldocs"
v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.c…
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Yi Liu (3):
iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
iommu/vt-d: Make iotlb flush helpers to be extern
iommu/vt-d: Add iotlb flush for nested domain
drivers/iommu/intel/iommu.c | 10 +++----
drivers/iommu/intel/iommu.h | 6 ++++
drivers/iommu/intel/nested.c | 54 ++++++++++++++++++++++++++++++++++++
include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++
4 files changed, 101 insertions(+), 5 deletions(-)
--
2.34.1
Hi Linus,
Please pull the following Kselftest fixes update for Linux 6.6-rc7.
This Kselftest update for Linux 6.6-rc7 consists of one single fix
to assert check in user_events abi_test to properly check bit value
on Big Endian architectures. The current code treats the bit values
as Little Endian and the check fails on Big Endian.
diff is attached.
thanks,
-- Shuah
----------------------------------------------------------------
The following changes since commit 6f874fa021dfc7bf37f4f37da3a5aaa41fe9c39c:
selftests: Fix wrong TARGET in kselftest top level Makefile (2023-09-26 18:47:37 -0600)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest tags/linux_kselftest_active-fixes-6.6-rc7
for you to fetch changes up to cf5a103c98a6fb9ee3164334cb5502df6360749b:
selftests/user_events: Fix abi_test for BE archs (2023-10-17 15:07:19 -0600)
----------------------------------------------------------------
linux_kselftest_active-fixes-6.6-rc7
This Kselftest update for Linux 6.6-rc7 consists of one single fix
to assert check in user_events abi_test to properly check bit value
on Big Endian architectures. The current code treats the bit values
as Little Endian and the check fails on Big Endian.
----------------------------------------------------------------
Beau Belgrave (1):
selftests/user_events: Fix abi_test for BE archs
tools/testing/selftests/user_events/abi_test.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
----------------------------------------------------------------
Changelog:
v3:
* Add a patch to export per-cgroup zswap writeback counters
* Add a patch to update zswap's kselftest
* Separate the new list_lru functions into its own prep patch
* Do not start from the top of the hierarchy when encounter a memcg
that is not online for the global limit zswap writeback (patch 2)
(suggested by Yosry Ahmed)
* Do not remove the swap entry from list_lru in
__read_swapcache_async() (patch 2) (suggested by Yosry Ahmed)
* Removed a redundant zswap pool getting (patch 2)
(reported by Ryan Roberts)
* Use atomic for the nr_zswap_protected (instead of lruvec's lock)
(patch 5) (suggested by Yosry Ahmed)
* Remove the per-cgroup zswap shrinker knob (patch 5)
(suggested by Yosry Ahmed)
v2:
* Fix loongarch compiler errors
* Use pool stats instead of memcg stats when !CONFIG_MEMCG_KEM
There are currently several issues with zswap writeback:
1. There is only a single global LRU for zswap, making it impossible to
perform worload-specific shrinking - an memcg under memory pressure
cannot determine which pages in the pool it owns, and often ends up
writing pages from other memcgs. This issue has been previously
observed in practice and mitigated by simply disabling
memcg-initiated shrinking:
https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u
But this solution leaves a lot to be desired, as we still do not
have an avenue for an memcg to free up its own memory locked up in
the zswap pool.
2. We only shrink the zswap pool when the user-defined limit is hit.
This means that if we set the limit too high, cold data that are
unlikely to be used again will reside in the pool, wasting precious
memory. It is hard to predict how much zswap space will be needed
ahead of time, as this depends on the workload (specifically, on
factors such as memory access patterns and compressibility of the
memory pages).
This patch series solves these issues by separating the global zswap
LRU into per-memcg and per-NUMA LRUs, and performs workload-specific
(i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The
new shrinker does not have any parameter that must be tuned by the
user, and can be opted in or out on a per-memcg basis.
As a proof of concept, we ran the following synthetic benchmark:
build the linux kernel in a memory-limited cgroup, and allocate some
cold data in tmpfs to see if the shrinker could write them out and
improved the overall performance. Depending on the amount of cold data
generated, we observe from 14% to 35% reduction in kernel CPU time used
in the kernel builds.
Domenico Cerasuolo (3):
zswap: make shrinking memcg-aware
mm: memcg: add per-memcg zswap writeback stat
selftests: cgroup: update per-memcg zswap writeback selftest
Nhat Pham (2):
mm: list_lru: allow external numa node and cgroup tracking
zswap: shrinks zswap pool based on memory pressure
Documentation/admin-guide/mm/zswap.rst | 7 +
include/linux/list_lru.h | 38 +++
include/linux/memcontrol.h | 7 +
include/linux/mmzone.h | 14 +
mm/list_lru.c | 43 ++-
mm/memcontrol.c | 15 +
mm/mmzone.c | 3 +
mm/swap.h | 3 +-
mm/swap_state.c | 38 ++-
mm/zswap.c | 335 ++++++++++++++++----
tools/testing/selftests/cgroup/test_zswap.c | 74 +++--
11 files changed, 485 insertions(+), 92 deletions(-)
--
2.34.1
From: Jeff Xu <jeffxu(a)google.com>
This patchset proposes a new mseal() syscall for the Linux kernel.
Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
The memory permission feature improves security stance on memory
corruption bugs, i.e. the attacker can’t just write to arbitrary memory
and point the code to it, the memory has to be marked with X bit, or
else an exception will happen. The protection is set by mmap(2),
mprotect(2), mremap(2).
Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.
The new mseal() is an architecture independent syscall, and with
following signature:
mseal(void addr, size_t len, unsigned long types, unsigned long flags)
addr/len: memory range. Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.
types: bit mask to specify which syscall to seal.
Five syscalls can be sealed, as specified by bitmasks:
MM_SEAL_MPROTECT: Deny mprotect(2)/pkey_mprotect(2).
MM_SEAL_MUNMAP: Deny munmap(2).
MM_SEAL_MMAP: Deny mmap(2).
MM_SEAL_MREMAP: Deny mremap(2).
MM_SEAL_MSEAL: Deny adding a new seal type.
Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.
The kernel will remember which seal types are applied, and the application
doesn’t need to repeat all existing seal types in the next mseal(). Once
a seal type is applied, it can’t be unsealed. Call mseal() on an existing
seal type is a no-action, not a failure.
MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
Internally, vm_area_struct adds a new field vm_seals, to store the bit
masks.
For the affected syscalls, such as mprotect, a check(can_modify_mm) for
sealing is added, this usually happens at the early point of the syscall,
before any update is made to VMAs. The effect of that is: if any of the
VMAs in the given address range fails the sealing check, none of the VMA
will be updated.
The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API.
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b…
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXge…
PATCH history:
v1:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add detailed commit message.
v0:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/
Jeff Xu (8):
mseal: Add mseal(2) syscall.
mseal: Wire up mseal syscall
mseal: add can_modify_mm and can_modify_vma
mseal: Check seal flag for mprotect(2)
mseal: Check seal flag for munmap(2)
mseal: Check seal flag for mremap(2)
mseal:Check seal flag for mmap(2)
selftest mm/mseal mprotect/munmap/mremap/mmap
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/aio.c | 5 +-
include/linux/mm.h | 44 +-
include/linux/mm_types.h | 7 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/mman.h | 6 +
ipc/shm.c | 3 +-
kernel/sys_ni.c | 1 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/internal.h | 4 +-
mm/mmap.c | 57 +-
mm/mprotect.c | 15 +
mm/mremap.c | 30 +-
mm/mseal.c | 268 ++++
mm/nommu.c | 6 +-
mm/util.c | 8 +-
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++
37 files changed, 1891 insertions(+), 28 deletions(-)
create mode 100644 mm/mseal.c
create mode 100644 tools/testing/selftests/mm/mseal_test.c
--
2.42.0.655.g421f12c284-goog
Use a filter function to skip the time namespace test on systems with
!CONFIG_TIME_NS. This reworks a fix originally done by Tiezhu Yang prior
to the refactoring in 34dce23f7e40 ("selftests/clone3: Report descriptive
test names"). The changelog from their fix explains the issue very clearly:
When execute the following command to test clone3 under !CONFIG_TIME_NS:
# make headers && cd tools/testing/selftests/clone3 && make && ./clone3
we can see the following error info:
# [7538] Trying clone3() with flags 0x80 (size 0)
# Invalid argument - Failed to create new process
# [7538] clone3() with flags says: -22 expected 0
not ok 18 [7538] Result (-22) is different than expected (0)
...
# Totals: pass:18 fail:1 xfail:0 xpass:0 skip:0 error:0
This is because if CONFIG_TIME_NS is not set, but the flag
CLONE_NEWTIME (0x80) is used to clone a time namespace, it
will return -EINVAL in copy_time_ns().
If kernel does not support CONFIG_TIME_NS, /proc/self/ns/time
will be not exist, and then we should skip clone3() test with
CLONE_NEWTIME.
With this patch under !CONFIG_TIME_NS:
# make headers && cd tools/testing/selftests/clone3 && make && ./clone3
...
# Time namespaces are not supported
ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME
...
# Totals: pass:18 fail:0 xfail:0 xpass:0 skip:1 error:0
Fixes: 515bddf0ec41 ("selftests/clone3: test clone3 with CLONE_NEWTIME")
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Original-fix-from: Tiezhu Yang <yangtiezhu(a)loongson.cn>
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/clone3/clone3.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index 9429d361059e..915846bc32cc 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -138,6 +138,16 @@ static bool not_root(void)
return false;
}
+static bool timens_unsupported(void)
+{
+ if (access("/proc/self/ns/time", F_OK) == 0) {
+ ksft_print_msg("Time namespaces are not supported\n");
+ return true;
+ }
+
+ return false;
+}
+
static size_t page_size_plus_8(void)
{
return getpagesize() + 8;
@@ -282,6 +292,7 @@ static const struct test tests[] = {
.size = 0,
.expected = 0,
.test_mode = CLONE3_ARGS_NO_TEST,
+ .filter = timens_unsupported,
},
{
.name = "exit signal (SIGCHLD) in flags",
---
base-commit: 8d4099dd0727acfc8b0f644eacaf852f9d5dc649
change-id: 20231019-kselftest-clone3-time-ns-730b6f4187c7
Best regards,
--
Mark Brown <broonie(a)kernel.org>
If name_show() is non unique, this test will try to install a kprobe on this
function which should fail returning EADDRNOTAVAIL.
On kernel where name_show() is not unique, this test is skipped.
Cc: stable(a)vger.kernel.org
Signed-off-by: Francis Laniel <flaniel(a)linux.microsoft.com>
---
.../ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc | 13 +++++++++++++
1 file changed, 13 insertions(+)
create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
new file mode 100644
index 000000000000..bc9514428dba
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_non_uniq_symbol.tc
@@ -0,0 +1,13 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Test failure of registering kprobe on non unique symbol
+# requires: kprobe_events
+
+SYMBOL='name_show'
+
+# We skip this test on kernel where SYMBOL is unique or does not exist.
+if [ "$(grep -c -E "[[:alnum:]]+ t ${SYMBOL}" /proc/kallsyms)" -le '1' ]; then
+ exit_unsupported
+fi
+
+! echo "p:test_non_unique ${SYMBOL}" > kprobe_events
--
2.34.1
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
This series adds the cache invalidation path for the userspace to invalidate
cache after modifying the stage-1 page table. This is based on the first part
of nesting [1]
Complete code can be found in [2], QEMU could can be found in [3].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.c…
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v5:
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation,
this does not change the rule that resv regions should only be added to the
kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series
as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to avoid
passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Lu Baolu (1):
iommu: Add cache_invalidate_user op
Nicolin Chen (4):
iommu: Add iommu_copy_struct_from_user_array helper
iommufd/selftest: Add mock_domain_cache_invalidate_user support
iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl
Yi Liu (1):
iommufd: Add IOMMU_HWPT_INVALIDATE
drivers/iommu/iommufd/hw_pagetable.c | 35 ++++++++
drivers/iommu/iommufd/iommufd_private.h | 9 ++
drivers/iommu/iommufd/iommufd_test.h | 22 +++++
drivers/iommu/iommufd/main.c | 3 +
drivers/iommu/iommufd/selftest.c | 69 +++++++++++++++
include/linux/iommu.h | 84 +++++++++++++++++++
include/uapi/linux/iommufd.h | 36 ++++++++
tools/testing/selftests/iommu/iommufd.c | 75 +++++++++++++++++
tools/testing/selftests/iommu/iommufd_utils.h | 63 ++++++++++++++
9 files changed, 396 insertions(+)
--
2.34.1
Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages of address translations to get access
to the physical address. A stage-1 translation table is owned by userspace
(e.g. by a guest OS), while a stage-2 is owned by kernel. Any change to a
stage-1 translation table should be followed by an IOTLB invalidation.
Take Intel VT-d as an example, the stage-1 translation table is guest I/O
page table. As the below diagram shows, the guest I/O page table pointer
in GPA (guest physical address) is passed to host and be used to perform
a stage-1 translation. Along with it, a modification to present mappings
in the guest I/O page table should be followed by an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | FS for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.----------------------------------.
| | | SS for GPA->HPA, unmanaged domain|
| | '----------------------------------'
'-------------'
Where:
- FS = First stage page tables
- SS = Second stage page tables
<Intel VT-d Nested translation>
In IOMMUFD, all the translation tables are tracked by hw_pagetable (hwpt)
and each hwpt is backed by an iommu_domain allocated from an iommu driver.
So in this series hw_pagetable and iommu_domain means the same thing if no
special note. IOMMUFD has already supported allocating hw_pagetable linked
with an IOAS. However, a nesting case requires IOMMUFD to allow allocating
hw_pagetable with driver specific parameters and interface to sync stage-1
IOTLB as user owns the stage-1 translation table.
This series is based on the iommu hw info reporting series [1] and nested
parent domain allocation [2]. It first extends domain_alloc_user to allocate
hwpt with user data by allowing the IOMMUFD internal infrastructure to accept
user_data and parent hwpt, relaying the user_data/parent to the iommu core
to allocate IOMMU_DOMAIN_NESTED. And it then extends the IOMMU_HWPT_ALLOC
ioctl to accept user data and a parent hwpt ID.
Note that this series is the part-1 set of a two-part nesting series. It
does not include the cache invalidation interface, which will be added in
the part 2.
Complete code can be found in [3], QEMU could can be found in [4].
At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.
[1] https://lore.kernel.org/linux-iommu/20230818101033.4100-1-yi.l.liu@intel.co… - merged
[2] https://lore.kernel.org/linux-iommu/20230928071528.26258-1-yi.l.liu@intel.c… - merged
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1
Change log:
v5:
- Split the iommufd nesting series into two parts of alloc_user and
invalidation (Jason)
- Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
do the same with the structures/alloc()/abort()/destroy(). Reworked the
selftest accordingly too. (Jason)
- Move hwpt/data_type into struct iommu_user_data from standalone op
arguments. (Jason)
- Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
_TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
- Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
- Add macro to the iommu_copy_struct_from_user() to calculate min_size
(Jason)
- Fix two bugs spotted by ZhaoYan
v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.…
- Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
and kernel-managed HWPTs
- Rework invalidate uAPI to be a multi-request array-based design
- Add a struct iommu_user_data_array and a helper for driver to sanitize
and copy the entry data from user space invalidation array
- Add a patch fixing TEST_LENGTH() in selftest program
- Drop IOMMU_RESV_IOVA_RANGES patches
- Update kdoc and inline comments
- Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested
translation, this does not change the rule that resv regions should only
be added to the kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be
added in later series as it is needed only by SMMU so far.
v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.…
- Add new uAPI things in alphabetical order
- Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
sanity, replacing the previous op->domain_alloc_user_data_len solution
- Return ERR_PTR from domain_alloc_user instead of NULL
- Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation
(Kevin)
- Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest
I/O page table). (Kevin)
- Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
- Minor changes per Kevin's inputs
v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
- Add union iommu_domain_user_data to include all user data structures to
avoid passing void * in kernel APIs.
- Add iommu op to return user data length for user domain allocation
- Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
- Store the invalidation data length in
iommu_domain_ops::cache_invalidate_user_data_len
- Convert cache_invalidate_user op to be int instead of void
- Remove @data_type in struct iommu_hwpt_invalidate
- Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08
of v1
v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.…
Thanks,
Yi Liu
Jason Gunthorpe (2):
iommufd: Rename IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING
iommufd/device: Wrap IOMMUFD_OBJ_HWPT_PAGING-only configurations
Lu Baolu (1):
iommu: Add IOMMU_DOMAIN_NESTED
Nicolin Chen (6):
iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable
iommufd: Share iommufd_hwpt_alloc with IOMMUFD_OBJ_HWPT_NESTED
iommufd: Add a nested HW pagetable object
iommu: Add iommu_copy_struct_from_user helper
iommufd/selftest: Add nested domain allocation for mock domain
iommufd/selftest: Add coverage for IOMMU_HWPT_ALLOC with nested HWPTs
Yi Liu (1):
iommu: Pass in parent domain with user_data to domain_alloc_user op
drivers/iommu/intel/iommu.c | 4 +-
drivers/iommu/iommufd/device.c | 184 +++++++++-----
drivers/iommu/iommufd/hw_pagetable.c | 237 ++++++++++++++----
drivers/iommu/iommufd/iommufd_private.h | 67 +++--
drivers/iommu/iommufd/iommufd_test.h | 18 ++
drivers/iommu/iommufd/main.c | 10 +-
drivers/iommu/iommufd/selftest.c | 137 ++++++++--
drivers/iommu/iommufd/vfio_compat.c | 6 +-
include/linux/iommu.h | 72 +++++-
include/uapi/linux/iommufd.h | 31 ++-
tools/testing/selftests/iommu/iommufd.c | 120 +++++++++
.../selftests/iommu/iommufd_fail_nth.c | 3 +-
tools/testing/selftests/iommu/iommufd_utils.h | 31 ++-
13 files changed, 752 insertions(+), 168 deletions(-)
--
2.34.1
hi, Akihiko Odaki,
sorry for sending again, the previous one has some problem that lost most
of CC part.
Hello,
kernel test robot noticed "kernel-selftests.net.make.fail" on:
commit: c04079dfb34c2f534f013408b12218c14b286b7d ("[RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing")
url: https://github.com/intel-lab-lkp/linux/commits/Akihiko-Odaki/net-skbuff-Add…
base: https://git.kernel.org/cgit/linux/kernel/git/shuah/linux-kselftest.git next
patch link: https://lore.kernel.org/all/20231008052101.144422-7-akihiko.odaki@daynix.co…
patch subject: [RFC PATCH 6/7] selftest: tun: Add tests for virtio-net hashing
in testcase: kernel-selftests
version: kernel-selftests-x86_64-60acb023-1_20230329
with following parameters:
group: net
test: fcnal-test.sh
atomic_test: ipv6_runtime
compiler: gcc-12
test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (Cascade Lake) with 32G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang(a)intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202310192236.fde97031-oliver.sang@intel.com
KERNEL SELFTESTS: linux_headers_dir is /usr/src/linux-headers-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d
2023-10-14 17:54:16 mount --bind /lib/modules/6.6.0-rc2-00023-gc04079dfb34c/kernel/lib /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/lib
make: Entering directory '/usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/bpf/resolve_btfids'
...
gcc -Wall -Wl,--no-as-needed -O2 -g -I../../../../usr/include/ -I../../../include/ -isystem /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/usr/include -I../ txtimestamp.c -o /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/txtimestamp
reuseport_bpf.c: In function ‘attach_cbpf’:
reuseport_bpf.c:133:28: error: array type has incomplete element type ‘struct sock_filter’
133 | struct sock_filter code[] = {
| ^~~~
reuseport_bpf.c:139:29: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_H’?
139 | { BPF_RET | BPF_A, 0, 0, 0 },
| ^~~~~
| BPF_H
reuseport_bpf.c:139:29: note: each undeclared identifier is reported only once for each function it appears in
reuseport_bpf.c:141:16: error: variable ‘p’ has initializer but incomplete type
141 | struct sock_fprog p = {
| ^~~~~~~~~~
reuseport_bpf.c:142:18: error: ‘struct sock_fprog’ has no member named ‘len’
142 | .len = ARRAY_SIZE(code),
| ^~~
In file included from reuseport_bpf.c:27:
../kselftest.h:56:25: warning: excess elements in struct initializer
56 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
| ^
reuseport_bpf.c:142:24: note: in expansion of macro ‘ARRAY_SIZE’
142 | .len = ARRAY_SIZE(code),
| ^~~~~~~~~~
../kselftest.h:56:25: note: (near initialization for ‘p’)
56 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
| ^
reuseport_bpf.c:142:24: note: in expansion of macro ‘ARRAY_SIZE’
142 | .len = ARRAY_SIZE(code),
| ^~~~~~~~~~
reuseport_bpf.c:143:18: error: ‘struct sock_fprog’ has no member named ‘filter’
143 | .filter = code,
| ^~~~~~
reuseport_bpf.c:143:27: warning: excess elements in struct initializer
143 | .filter = code,
| ^~~~
reuseport_bpf.c:143:27: note: (near initialization for ‘p’)
reuseport_bpf.c:141:27: error: storage size of ‘p’ isn’t known
141 | struct sock_fprog p = {
| ^
reuseport_bpf.c:141:27: warning: unused variable ‘p’ [-Wunused-variable]
reuseport_bpf.c:133:28: warning: unused variable ‘code’ [-Wunused-variable]
133 | struct sock_filter code[] = {
| ^~~~
reuseport_bpf.c: In function ‘test_filter_no_reuseport’:
reuseport_bpf.c:346:28: error: array type has incomplete element type ‘struct sock_filter’
346 | struct sock_filter ccode[] = {{ BPF_RET | BPF_A, 0, 0, 0 }};
| ^~~~~
reuseport_bpf.c:346:51: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_H’?
346 | struct sock_filter ccode[] = {{ BPF_RET | BPF_A, 0, 0, 0 }};
| ^~~~~
| BPF_H
reuseport_bpf.c:348:27: error: storage size of ‘cprog’ isn’t known
348 | struct sock_fprog cprog;
| ^~~~~
reuseport_bpf.c:348:27: warning: unused variable ‘cprog’ [-Wunused-variable]
reuseport_bpf.c:346:28: warning: unused variable ‘ccode’ [-Wunused-variable]
346 | struct sock_filter ccode[] = {{ BPF_RET | BPF_A, 0, 0, 0 }};
| ^~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/reuseport_bpf] Error 1
make: *** Waiting for unfinished jobs....
reuseport_bpf_cpu.c: In function ‘attach_bpf’:
reuseport_bpf_cpu.c:79:28: error: array type has incomplete element type ‘struct sock_filter’
79 | struct sock_filter code[] = {
| ^~~~
reuseport_bpf_cpu.c:81:52: error: ‘SKF_AD_OFF’ undeclared (first use in this function)
81 | { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_CPU },
| ^~~~~~~~~~
reuseport_bpf_cpu.c:81:52: note: each undeclared identifier is reported only once for each function it appears in
reuseport_bpf_cpu.c:81:65: error: ‘SKF_AD_CPU’ undeclared (first use in this function)
81 | { BPF_LD | BPF_W | BPF_ABS, 0, 0, SKF_AD_OFF + SKF_AD_CPU },
| ^~~~~~~~~~
reuseport_bpf_cpu.c:83:29: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_H’?
83 | { BPF_RET | BPF_A, 0, 0, 0 },
| ^~~~~
| BPF_H
reuseport_bpf_cpu.c:85:16: error: variable ‘p’ has initializer but incomplete type
85 | struct sock_fprog p = {
| ^~~~~~~~~~
reuseport_bpf_cpu.c:86:18: error: ‘struct sock_fprog’ has no member named ‘len’
86 | .len = 2,
| ^~~
reuseport_bpf_cpu.c:86:24: warning: excess elements in struct initializer
86 | .len = 2,
| ^
reuseport_bpf_cpu.c:86:24: note: (near initialization for ‘p’)
reuseport_bpf_cpu.c:87:18: error: ‘struct sock_fprog’ has no member named ‘filter’
87 | .filter = code,
| ^~~~~~
reuseport_bpf_cpu.c:87:27: warning: excess elements in struct initializer
87 | .filter = code,
| ^~~~
reuseport_bpf_cpu.c:87:27: note: (near initialization for ‘p’)
reuseport_bpf_cpu.c:85:27: error: storage size of ‘p’ isn’t known
85 | struct sock_fprog p = {
| ^
reuseport_bpf_cpu.c:85:27: warning: unused variable ‘p’ [-Wunused-variable]
reuseport_bpf_cpu.c:79:28: warning: unused variable ‘code’ [-Wunused-variable]
79 | struct sock_filter code[] = {
| ^~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/reuseport_bpf_cpu] Error 1
In file included from psock_fanout.c:55:
psock_lib.h: In function ‘pair_udp_setfilter’:
psock_lib.h:52:28: error: array type has incomplete element type ‘struct sock_filter’
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_lib.h:65:27: error: storage size of ‘bpf_prog’ isn’t known
65 | struct sock_fprog bpf_prog;
| ^~~~~~~~
psock_lib.h:65:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_lib.h:52:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_fanout.c: In function ‘sock_fanout_set_cbpf’:
psock_fanout.c:114:28: error: array type has incomplete element type ‘struct sock_filter’
114 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_fanout.c:115:17: warning: implicit declaration of function ‘BPF_STMT’; did you mean ‘BPF_STX’? [-Wimplicit-function-declaration]
115 | BPF_STMT(BPF_LD | BPF_B | BPF_ABS, 80), /* ldb [80] */
| ^~~~~~~~
| BPF_STX
psock_fanout.c:116:36: error: ‘BPF_A’ undeclared (first use in this function); did you mean ‘BPF_X’?
116 | BPF_STMT(BPF_RET | BPF_A, 0), /* ret A */
| ^~~~~
| BPF_X
psock_fanout.c:116:36: note: each undeclared identifier is reported only once for each function it appears in
psock_fanout.c:118:27: error: storage size of ‘bpf_prog’ isn’t known
118 | struct sock_fprog bpf_prog;
| ^~~~~~~~
psock_fanout.c:118:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_fanout.c:114:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
114 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/psock_fanout] Error 1
In file included from psock_tpacket.c:47:
psock_lib.h: In function ‘pair_udp_setfilter’:
psock_lib.h:52:28: error: array type has incomplete element type ‘struct sock_filter’
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_lib.h:65:27: error: storage size of ‘bpf_prog’ isn’t known
65 | struct sock_fprog bpf_prog;
| ^~~~~~~~
psock_lib.h:65:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_lib.h:52:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/psock_tpacket] Error 1
In file included from psock_snd.c:32:
psock_lib.h: In function ‘pair_udp_setfilter’:
psock_lib.h:52:28: error: array type has incomplete element type ‘struct sock_filter’
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
psock_lib.h:65:27: error: storage size of ‘bpf_prog’ isn’t known
65 | struct sock_fprog bpf_prog;
| ^~~~~~~~
psock_lib.h:65:27: warning: unused variable ‘bpf_prog’ [-Wunused-variable]
psock_lib.h:52:28: warning: unused variable ‘bpf_filter’ [-Wunused-variable]
52 | struct sock_filter bpf_filter[] = {
| ^~~~~~~~~~
make: *** [../lib.mk:181: /usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net/psock_snd] Error 1
make: Leaving directory '/usr/src/perf_selftests-x86_64-rhel-8.3-kselftests-c04079dfb34c2f534f013408b12218c14b286b7d/tools/testing/selftests/net'
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231019/202310192236.fde97031-oliv…
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling.
When GCS is active a secondary stack called the Guarded Control Stack is
maintained, protected with a memory attribute which means that it can
only be written with specific GCS operations. The current GCS pointer
can not be directly written to by userspace. When a BL is executed the
value stored in LR is also pushed onto the GCS, and when a RET is
executed the top of the GCS is popped and compared to LR with a fault
being raised if the values do not match. GCS operations may only be
performed on GCS pages, a data abort is generated if they are not.
The combination of hardware enforcement and lack of extra instructions
in the function entry and exit paths should result in something which
has less overhead and is more difficult to attack than a purely software
implementation like clang's shadow stacks.
This series implements support for use of GCS by userspace, along with
support for use of GCS within KVM guests. It does not enable use of GCS
by either EL1 or EL2, this will be implemented separately. Executables
are started without GCS and must use a prctl() to enable it, it is
expected that this will be done very early in application execution by
the dynamic linker or other startup code.
x86 has an equivalent feature called shadow stacks, this series depends
on the x86 patches for generic memory management support for the new
guarded/shadow stack page type and shares APIs as much as possible. As
there has been extensive discussion with the wider community around the
ABI for shadow stacks I have as far as practical kept implementation
decisions close to those for x86, anticipating that review would lead to
similar conclusions in the absence of strong reasoning for divergence.
The main divergence I am concious of is that x86 allows shadow stack to
be enabled and disabled repeatedly, freeing the shadow stack for the
thread whenever disabled, while this implementation keeps the GCS
allocated after disable but refuses to reenable it. This is to avoid
races with things actively walking the GCS during a disable, we do
anticipate that some systems will wish to disable GCS at runtime but are
not aware of any demand for subsequently reenabling it.
x86 uses an arch_prctl() to manage enable and disable, since only x86
and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a
patch set for the equivalent RISC-V zisslpcfi feature which I initially
adopted fairly directly but following review feedback has been revised
quite a bit.
There is an open issue with support for CRIU, on x86 this required the
ability to set the GCS mode via ptrace. This series supports
configuring mode bits other than enable/disable via ptrace but it needs
to be confirmed if this is sufficient.
There's a few bits where I'm not convinced with where I've placed
things, in particular the GCS write operation is in the GCS header not
in uaccess.h, I wasn't sure what was clearest there and am probably too
close to the code to have a clear opinion. The reporting of GCS in
/proc/PID/smaps is also a bit awkward.
The series depends on the x86 shadow stack support:
https://lore.kernel.org/lkml/20230227222957.24501-1-rick.p.edgecombe@intel.…
I've rebased this onto v6.5-rc4 but not included it in the series in
order to avoid confusion with Rick's work and cut down the size of the
series, you can see the branch at:
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs
[1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v4:
- Implement flags for map_shadow_stack() allowing the cap and end of
stack marker to be enabled independently or not at all.
- Relax size and alignment requirements for map_shadow_stack().
- Add more blurb explaining the advantages of hardware enforcement.
- Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org
Changes in v3:
- Rebase onto v6.5-rc4.
- Add a GCS barrier on context switch.
- Add a GCS stress test.
- Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org
Changes in v2:
- Rebase onto v6.5-rc3.
- Rework prctl() interface to allow each bit to be locked independently.
- map_shadow_stack() now places the cap token based on the size
requested by the caller not the actual space allocated.
- Mode changes other than enable via ptrace are now supported.
- Expand test coverage.
- Various smaller fixes and adjustments.
- Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org
---
Mark Brown (36):
prctl: arch-agnostic prctl for shadow stack
arm64: Document boot requirements for Guarded Control Stacks
arm64/gcs: Document the ABI for Guarded Control Stacks
arm64/sysreg: Add new system registers for GCS
arm64/sysreg: Add definitions for architected GCS caps
arm64/gcs: Add manual encodings of GCS instructions
arm64/gcs: Provide copy_to_user_gcs()
arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS)
arm64/mm: Allocate PIE slots for EL0 guarded control stack
mm: Define VM_SHADOW_STACK for arm64 when we support GCS
arm64/mm: Map pages for guarded control stack
KVM: arm64: Manage GCS registers for guests
arm64/gcs: Allow GCS usage at EL0 and EL1
arm64/idreg: Add overrride for GCS
arm64/hwcap: Add hwcap for GCS
arm64/traps: Handle GCS exceptions
arm64/mm: Handle GCS data aborts
arm64/gcs: Context switch GCS state for EL0
arm64/gcs: Allocate a new GCS for threads with GCS enabled
arm64/gcs: Implement shadow stack prctl() interface
arm64/mm: Implement map_shadow_stack()
arm64/signal: Set up and restore the GCS context for signal handlers
arm64/signal: Expose GCS state in signal frames
arm64/ptrace: Expose GCS via ptrace and core files
arm64: Add Kconfig for Guarded Control Stack (GCS)
kselftest/arm64: Verify the GCS hwcap
kselftest/arm64: Add GCS as a detected feature in the signal tests
kselftest/arm64: Add framework support for GCS to signal handling tests
kselftest/arm64: Allow signals tests to specify an expected si_code
kselftest/arm64: Always run signals tests with GCS enabled
kselftest/arm64: Add very basic GCS test program
kselftest/arm64: Add a GCS test program built with the system libc
kselftest/arm64: Add test coverage for GCS mode locking
selftests/arm64: Add GCS signal tests
kselftest/arm64: Add a GCS stress test
kselftest/arm64: Enable GCS for the FP stress tests
Documentation/admin-guide/kernel-parameters.txt | 3 +
Documentation/arch/arm64/booting.rst | 22 +
Documentation/arch/arm64/elf_hwcaps.rst | 3 +
Documentation/arch/arm64/gcs.rst | 228 +++++++++
Documentation/arch/arm64/index.rst | 1 +
Documentation/filesystems/proc.rst | 2 +-
arch/arm64/Kconfig | 19 +
arch/arm64/include/asm/cpufeature.h | 6 +
arch/arm64/include/asm/el2_setup.h | 17 +
arch/arm64/include/asm/esr.h | 28 +-
arch/arm64/include/asm/exception.h | 2 +
arch/arm64/include/asm/gcs.h | 106 ++++
arch/arm64/include/asm/hwcap.h | 1 +
arch/arm64/include/asm/kvm_arm.h | 4 +-
arch/arm64/include/asm/kvm_host.h | 12 +
arch/arm64/include/asm/pgtable-prot.h | 14 +-
arch/arm64/include/asm/processor.h | 7 +
arch/arm64/include/asm/sysreg.h | 20 +
arch/arm64/include/asm/uaccess.h | 42 ++
arch/arm64/include/uapi/asm/hwcap.h | 1 +
arch/arm64/include/uapi/asm/ptrace.h | 8 +
arch/arm64/include/uapi/asm/sigcontext.h | 9 +
arch/arm64/kernel/cpufeature.c | 19 +
arch/arm64/kernel/cpuinfo.c | 1 +
arch/arm64/kernel/entry-common.c | 23 +
arch/arm64/kernel/idreg-override.c | 2 +
arch/arm64/kernel/process.c | 85 ++++
arch/arm64/kernel/ptrace.c | 59 +++
arch/arm64/kernel/signal.c | 237 ++++++++-
arch/arm64/kernel/traps.c | 11 +
arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 17 +
arch/arm64/kvm/sys_regs.c | 22 +
arch/arm64/mm/Makefile | 1 +
arch/arm64/mm/fault.c | 78 ++-
arch/arm64/mm/gcs.c | 234 +++++++++
arch/arm64/mm/mmap.c | 12 +-
arch/arm64/tools/cpucaps | 1 +
arch/arm64/tools/sysreg | 55 +++
fs/proc/task_mmu.c | 3 +
include/linux/mm.h | 16 +-
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 22 +
kernel/sys.c | 30 ++
kernel/sys_ni.c | 1 +
tools/testing/selftests/arm64/Makefile | 2 +-
tools/testing/selftests/arm64/abi/hwcap.c | 19 +
tools/testing/selftests/arm64/fp/assembler.h | 15 +
tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 +
tools/testing/selftests/arm64/fp/sve-test.S | 2 +
tools/testing/selftests/arm64/fp/za-test.S | 2 +
tools/testing/selftests/arm64/fp/zt-test.S | 2 +
tools/testing/selftests/arm64/gcs/.gitignore | 5 +
tools/testing/selftests/arm64/gcs/Makefile | 24 +
tools/testing/selftests/arm64/gcs/asm-offsets.h | 0
tools/testing/selftests/arm64/gcs/basic-gcs.c | 356 ++++++++++++++
tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++++
.../selftests/arm64/gcs/gcs-stress-thread.S | 311 ++++++++++++
tools/testing/selftests/arm64/gcs/gcs-stress.c | 532 +++++++++++++++++++++
tools/testing/selftests/arm64/gcs/gcs-util.h | 87 ++++
tools/testing/selftests/arm64/gcs/libc-gcs.c | 500 +++++++++++++++++++
tools/testing/selftests/arm64/signal/.gitignore | 1 +
.../testing/selftests/arm64/signal/test_signals.c | 17 +-
.../testing/selftests/arm64/signal/test_signals.h | 6 +
.../selftests/arm64/signal/test_signals_utils.c | 32 +-
.../selftests/arm64/signal/test_signals_utils.h | 39 ++
.../arm64/signal/testcases/gcs_exception_fault.c | 59 +++
.../selftests/arm64/signal/testcases/gcs_frame.c | 78 +++
.../arm64/signal/testcases/gcs_write_fault.c | 67 +++
.../selftests/arm64/signal/testcases/testcases.c | 7 +
.../selftests/arm64/signal/testcases/testcases.h | 1 +
72 files changed, 3823 insertions(+), 34 deletions(-)
---
base-commit: ed0e1456f04be7a93c9a186e8e13aed78b555617
change-id: 20230303-arm64-gcs-e311ab0d8729
Best regards,
--
Mark Brown <broonie(a)kernel.org>
Commit 2810c1e99867 ("kunit: Fix wild-memory-access bug in
kunit_free_suite_set()") is causing all test suites to run (when
built as modules) while still in MODULE_STATE_COMING. In that state,
test modules are not fully initialized and lack sysfs kobjects.
This behavior can cause a crash if the test module tries to register
fake devices.
This patch restores the normal execution flow, waiting for the module
initialization to complete before running the test suites.
The issue reported in the commit mentioned above is addressed using
virt_addr_valid() to detect if the module loading has failed
and mod->kunit_suites has not been allocated using kmalloc_array().
Fixes: 2810c1e99867 ("kunit: Fix wild-memory-access bug in kunit_free_suite_set()")
Signed-off-by: Marco Pagani <marpagan(a)redhat.com>
---
lib/kunit/test.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/lib/kunit/test.c b/lib/kunit/test.c
index 421f13981412..1a49569186fc 100644
--- a/lib/kunit/test.c
+++ b/lib/kunit/test.c
@@ -769,12 +769,14 @@ static void kunit_module_exit(struct module *mod)
};
const char *action = kunit_action();
+ if (!suite_set.start || !virt_addr_valid(suite_set.start))
+ return;
+
if (!action)
__kunit_test_suites_exit(mod->kunit_suites,
mod->num_kunit_suites);
- if (suite_set.start)
- kunit_free_suite_set(suite_set);
+ kunit_free_suite_set(suite_set);
}
static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
@@ -784,12 +786,12 @@ static int kunit_module_notify(struct notifier_block *nb, unsigned long val,
switch (val) {
case MODULE_STATE_LIVE:
+ kunit_module_init(mod);
break;
case MODULE_STATE_GOING:
kunit_module_exit(mod);
break;
case MODULE_STATE_COMING:
- kunit_module_init(mod);
break;
case MODULE_STATE_UNFORMED:
break;
--
2.41.0
Kselftests are kernel tests and must be build with kernel headers from
same source version. The kernel headers are already being included
correctly in futex selftest Makefile with the help of KHDR_INCLUDE. In
this patch, only the dead code is being removed. No functional change is
intended.
Signed-off-by: Muhammad Usama Anjum <usama.anjum(a)collabora.com>
---
Changes since v1:
- Make the explanation correct
---
.../selftests/futex/include/futextest.h | 22 -------------------
1 file changed, 22 deletions(-)
diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h
index ddbcfc9b7bac4..59f66af3a6d10 100644
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h
@@ -25,28 +25,6 @@
typedef volatile u_int32_t futex_t;
#define FUTEX_INITIALIZER 0
-/* Define the newer op codes if the system header file is not up to date. */
-#ifndef FUTEX_WAIT_BITSET
-#define FUTEX_WAIT_BITSET 9
-#endif
-#ifndef FUTEX_WAKE_BITSET
-#define FUTEX_WAKE_BITSET 10
-#endif
-#ifndef FUTEX_WAIT_REQUEUE_PI
-#define FUTEX_WAIT_REQUEUE_PI 11
-#endif
-#ifndef FUTEX_CMP_REQUEUE_PI
-#define FUTEX_CMP_REQUEUE_PI 12
-#endif
-#ifndef FUTEX_WAIT_REQUEUE_PI_PRIVATE
-#define FUTEX_WAIT_REQUEUE_PI_PRIVATE (FUTEX_WAIT_REQUEUE_PI | \
- FUTEX_PRIVATE_FLAG)
-#endif
-#ifndef FUTEX_REQUEUE_PI_PRIVATE
-#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
- FUTEX_PRIVATE_FLAG)
-#endif
-
/**
* futex() - SYS_futex syscall wrapper
* @uaddr: address of first futex
--
2.40.1