virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.
Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.
Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.
An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20250313-hash-v4-0-c75c494b495e@daynix.com/
This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/
V1 -> V2:
Changed to introduce a new BPF program type.
Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com>
---
Changes in v10:
- Split common code and TUN/TAP-specific code into separate patches.
- Reverted a spurious style change in patch "tun: Introduce virtio-net
hash feature".
- Added a comment explaining disable_ipv6 in tests.
- Used AF_PACKET for patch "selftest: tun: Add tests for
virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
the testing function now needs access to more variant-specific
variables.
- Corrected the message of patch "selftest: tun: Add tests for
virtio-net hashing"; it mentioned validation of configuration but
it is not scope of this patch.
- Expanded the description of patch "selftest: tun: Add tests for
virtio-net hashing".
- Added patch "tun: Allow steering eBPF program to fall back".
- Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
- Removed redundant tests for tun_vnet_ioctl().
- Added patch "selftest: tap: Add tests for virtio-net ioctls".
- Added a design explanation of ioctls for extensibility and migration.
- Removed a few branches in patch
"vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
- Link to v9: https://lore.kernel.org/r/20250307-rss-v9-0-df76624025eb@daynix.com
Changes in v9:
- Added a missing return statement in patch
"tun: Introduce virtio-net hash feature".
- Link to v8: https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff423@daynix.com
Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: https://lore.kernel.org/r/20250228-rss-v7-0-844205cbbdd6@daynix.com
Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad708f6@daynix.com
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
Introduce virtio-net hash reporting feature", and "tun: Introduce
virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df005d@daynix.com
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324ca…
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0e6c@daynix.com
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
"tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: https://lore.kernel.org/r/20240915-rss-v3-0-c630015db082@daynix.com
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
"tun: Introduce virtio-net hash reporting feature" and
"tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: https://lore.kernel.org/r/20231015141644.260646-1-akihiko.odaki@daynix.com
---
Akihiko Odaki (10):
virtio_net: Add functions for hashing
net: flow_dissector: Export flow_keys_dissector_symmetric
tun: Allow steering eBPF program to fall back
tun: Add common virtio-net hash feature code
tun: Introduce virtio-net hash feature
tap: Introduce virtio-net hash feature
selftest: tun: Test vnet ioctls without device
selftest: tun: Add tests for virtio-net hashing
selftest: tap: Add tests for virtio-net ioctls
vhost/net: Support VIRTIO_NET_F_HASH_REPORT
Documentation/networking/tuntap.rst | 7 +
drivers/net/Kconfig | 1 +
drivers/net/tap.c | 68 ++++-
drivers/net/tun.c | 90 +++++--
drivers/net/tun_vnet.h | 155 ++++++++++-
drivers/vhost/net.c | 68 ++---
include/linux/if_tap.h | 2 +
include/linux/skbuff.h | 3 +
include/linux/virtio_net.h | 188 ++++++++++++++
include/net/flow_dissector.h | 1 +
include/uapi/linux/if_tun.h | 82 ++++++
net/core/flow_dissector.c | 3 +-
net/core/skbuff.c | 4 +
tools/testing/selftests/net/Makefile | 2 +-
tools/testing/selftests/net/tap.c | 97 ++++++-
tools/testing/selftests/net/tun.c | 491 ++++++++++++++++++++++++++++++++++-
16 files changed, 1185 insertions(+), 77 deletions(-)
---
base-commit: dd83757f6e686a2188997cb58b5975f744bb7786
change-id: 20240403-rss-e737d89efa77
prerequisite-change-id: 20241230-tun-66e10a49b0c7:v6
prerequisite-patch-id: 871dc5f146fb6b0e3ec8612971a8e8190472c0fb
prerequisite-patch-id: 2797ed249d32590321f088373d4055ff3f430a0e
prerequisite-patch-id: ea3370c72d4904e2f0536ec76ba5d26784c0cede
prerequisite-patch-id: 837e4cf5d6b451424f9b1639455e83a260c4440d
prerequisite-patch-id: ea701076f57819e844f5a35efe5cbc5712d3080d
prerequisite-patch-id: 701646fb43ad04cc64dd2bf13c150ccbe6f828ce
prerequisite-patch-id: 53176dae0c003f5b6c114d43f936cf7140d31bb5
prerequisite-change-id: 20250116-buffers-96e14bf023fc:v2
prerequisite-patch-id: 25fd4f99d4236a05a5ef16ab79f3e85ee57e21cc
Best regards,
--
Akihiko Odaki <akihiko.odaki(a)daynix.com>
On Friday, 14 March 2025 05:14:30 CDT Su Hui wrote:
> On 2025/3/14 17:21, Dan Carpenter wrote:
> > On Fri, Mar 14, 2025 at 03:14:51PM +0800, Su Hui wrote:
> >> When 'manual=false' and 'signaled=true', then expected value when using
> >> NTSYNC_IOC_CREATE_EVENT should be greater than zero. Fix this typo error.
> >>
> >> Signed-off-by: Su Hui<suhui(a)nfschina.com>
> >> ---
> >> tools/testing/selftests/drivers/ntsync/ntsync.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
> >> index 3aad311574c4..bfb6fad653d0 100644
> >> --- a/tools/testing/selftests/drivers/ntsync/ntsync.c
> >> +++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
> >> @@ -968,7 +968,7 @@ TEST(wake_all)
> >> auto_event_args.manual = false;
> >> auto_event_args.signaled = true;
> >> objs[3] = ioctl(fd, NTSYNC_IOC_CREATE_EVENT, &auto_event_args);
> >> - EXPECT_EQ(0, objs[3]);
> >> + EXPECT_LE(0, objs[3]);
> > It's kind of weird how these macros put the constant on the left.
> > It returns an "fd" on success. So this look reasonable. It probably
> > won't return the zero fd so we could probably check EXPECT_LT()?
> Agreed, there are about 29 items that can be changed to EXPECT_LT().
> I can send a v2 patchset with this change if there is no more other
> suggestions.
I personally think it looks wrong to use EXPECT_LT(), but I'll certainly defer to a higher maintainer on this point.
Replacing all occurrences of `addr_of!(place)` with `&raw const place`, and
all occurrences of `addr_of_mut!(place)` with `&raw mut place`.
Utilizing the new feature will allow us to reduce macro complexity, and
improve consistency with existing reference syntax as `&raw const`, `&raw mut`
is very similar to `&`, `&mut` making it fit more naturally with other
existing code.
Suggested-by: Benno Lossin <benno.lossin(a)proton.me>
Link: https://github.com/Rust-for-Linux/linux/issues/1148
Signed-off-by: Antonio Hickey <contact(a)antoniohickey.com>
---
rust/kernel/block/mq/request.rs | 4 ++--
rust/kernel/faux.rs | 4 ++--
rust/kernel/fs/file.rs | 2 +-
rust/kernel/init.rs | 8 ++++----
rust/kernel/init/macros.rs | 28 +++++++++++++-------------
rust/kernel/jump_label.rs | 4 ++--
rust/kernel/kunit.rs | 4 ++--
rust/kernel/list.rs | 2 +-
rust/kernel/list/impl_list_item_mod.rs | 6 +++---
rust/kernel/net/phy.rs | 4 ++--
rust/kernel/pci.rs | 4 ++--
rust/kernel/platform.rs | 4 +---
rust/kernel/rbtree.rs | 22 ++++++++++----------
rust/kernel/sync/arc.rs | 2 +-
rust/kernel/task.rs | 4 ++--
rust/kernel/workqueue.rs | 8 ++++----
16 files changed, 54 insertions(+), 56 deletions(-)
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
index 7943f43b9575..4a5b7ec914ef 100644
--- a/rust/kernel/block/mq/request.rs
+++ b/rust/kernel/block/mq/request.rs
@@ -12,7 +12,7 @@
};
use core::{
marker::PhantomData,
- ptr::{addr_of_mut, NonNull},
+ ptr::NonNull,
sync::atomic::{AtomicU64, Ordering},
};
@@ -187,7 +187,7 @@ pub(crate) fn refcount(&self) -> &AtomicU64 {
pub(crate) unsafe fn refcount_ptr(this: *mut Self) -> *mut AtomicU64 {
// SAFETY: Because of the safety requirements of this function, the
// field projection is safe.
- unsafe { addr_of_mut!((*this).refcount) }
+ unsafe { &raw mut (*this).refcount }
}
}
diff --git a/rust/kernel/faux.rs b/rust/kernel/faux.rs
index 5acc0c02d451..52ac554c1119 100644
--- a/rust/kernel/faux.rs
+++ b/rust/kernel/faux.rs
@@ -7,7 +7,7 @@
//! C header: [`include/linux/device/faux.h`]
use crate::{bindings, device, error::code::*, prelude::*};
-use core::ptr::{addr_of_mut, null, null_mut, NonNull};
+use core::ptr::{null, null_mut, NonNull};
/// The registration of a faux device.
///
@@ -45,7 +45,7 @@ impl AsRef<device::Device> for Registration {
fn as_ref(&self) -> &device::Device {
// SAFETY: The underlying `device` in `faux_device` is guaranteed by the C API to be
// a valid initialized `device`.
- unsafe { device::Device::as_ref(addr_of_mut!((*self.as_raw()).dev)) }
+ unsafe { device::Device::as_ref((&raw mut (*self.as_raw()).dev)) }
}
}
diff --git a/rust/kernel/fs/file.rs b/rust/kernel/fs/file.rs
index ed57e0137cdb..7ee4830b67f3 100644
--- a/rust/kernel/fs/file.rs
+++ b/rust/kernel/fs/file.rs
@@ -331,7 +331,7 @@ pub fn flags(&self) -> u32 {
// SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
//
// FIXME(read_once): Replace with `read_once` when available on the Rust side.
- unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
+ unsafe { (&raw const (*self.as_ptr()).f_flags).read_volatile() }
}
}
diff --git a/rust/kernel/init.rs b/rust/kernel/init.rs
index 7fd1ea8265a5..a8fac6558671 100644
--- a/rust/kernel/init.rs
+++ b/rust/kernel/init.rs
@@ -122,7 +122,7 @@
//! ```rust
//! # #![expect(unreachable_pub, clippy::disallowed_names)]
//! use kernel::{init, types::Opaque};
-//! use core::{ptr::addr_of_mut, marker::PhantomPinned, pin::Pin};
+//! use core::{marker::PhantomPinned, pin::Pin};
//! # mod bindings {
//! # #![expect(non_camel_case_types)]
//! # #![expect(clippy::missing_safety_doc)]
@@ -159,7 +159,7 @@
//! unsafe {
//! init::pin_init_from_closure(move |slot: *mut Self| {
//! // `slot` contains uninit memory, avoid creating a reference.
-//! let foo = addr_of_mut!((*slot).foo);
+//! let foo = &raw mut (*slot).foo;
//!
//! // Initialize the `foo`
//! bindings::init_foo(Opaque::raw_get(foo));
@@ -541,7 +541,7 @@ macro_rules! stack_try_pin_init {
///
/// ```rust
/// # use kernel::{macros::{Zeroable, pin_data}, pin_init};
-/// # use core::{ptr::addr_of_mut, marker::PhantomPinned};
+/// # use core::marker::PhantomPinned;
/// #[pin_data]
/// #[derive(Zeroable)]
/// struct Buf {
@@ -554,7 +554,7 @@ macro_rules! stack_try_pin_init {
/// pin_init!(&this in Buf {
/// buf: [0; 64],
/// // SAFETY: TODO.
-/// ptr: unsafe { addr_of_mut!((*this.as_ptr()).buf).cast() },
+/// ptr: unsafe { &raw mut (*this.as_ptr()).buf.cast() },
/// pin: PhantomPinned,
/// });
/// pin_init!(Buf {
diff --git a/rust/kernel/init/macros.rs b/rust/kernel/init/macros.rs
index 1fd146a83241..af525fbb2f01 100644
--- a/rust/kernel/init/macros.rs
+++ b/rust/kernel/init/macros.rs
@@ -244,25 +244,25 @@
//! struct __InitOk;
//! // This is the expansion of `t,`, which is syntactic sugar for `t: t,`.
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).t), t) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).t, t) };
//! }
//! // Since initialization could fail later (not in this case, since the
//! // error type is `Infallible`) we will need to drop this field if there
//! // is an error later. This `DropGuard` will drop the field when it gets
//! // dropped and has not yet been forgotten.
//! let __t_guard = unsafe {
-//! ::pinned_init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).t))
+//! ::pinned_init::__internal::DropGuard::new(&raw mut (*slot).t)
//! };
//! // Expansion of `x: 0,`:
//! // Since this can be an arbitrary expression we cannot place it inside
//! // of the `unsafe` block, so we bind it here.
//! {
//! let x = 0;
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).x), x) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).x, x) };
//! }
//! // We again create a `DropGuard`.
//! let __x_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).x))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).x)
//! };
//! // Since initialization has successfully completed, we can now forget
//! // the guards. This is not `mem::forget`, since we only have
@@ -459,15 +459,15 @@
//! {
//! struct __InitOk;
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).a), a) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).a, a) };
//! }
//! let __a_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).a))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).a)
//! };
//! let init = Bar::new(36);
-//! unsafe { data.b(::core::addr_of_mut!((*slot).b), b)? };
+//! unsafe { data.b(&raw mut (*slot).b, b)? };
//! let __b_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).b))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).b)
//! };
//! ::core::mem::forget(__b_guard);
//! ::core::mem::forget(__a_guard);
@@ -1210,7 +1210,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
// We also use the `data` to require the correct trait (`Init` or `PinInit`) for `$field`.
- unsafe { $data.$field(::core::ptr::addr_of_mut!((*$slot).$field), init)? };
+ unsafe { $data.$field(&raw mut (*$slot).$field, init)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1218,7 +1218,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($use_data):
@@ -1241,7 +1241,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
//
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
- unsafe { $crate::init::Init::__init(init, ::core::ptr::addr_of_mut!((*$slot).$field))? };
+ unsafe { $crate::init::Init::__init(init, &raw mut (*$slot).$field)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1249,7 +1249,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot():
@@ -1272,7 +1272,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// Initialize the field.
//
// SAFETY: The memory at `slot` is uninitialized.
- unsafe { ::core::ptr::write(::core::ptr::addr_of_mut!((*$slot).$field), $field) };
+ unsafe { ::core::ptr::write(&raw mut (*$slot).$field, $field) };
}
// Create the drop guard:
//
@@ -1281,7 +1281,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($($use_data)?):
diff --git a/rust/kernel/jump_label.rs b/rust/kernel/jump_label.rs
index 4e974c768dbd..ca10abae0eee 100644
--- a/rust/kernel/jump_label.rs
+++ b/rust/kernel/jump_label.rs
@@ -20,8 +20,8 @@
#[macro_export]
macro_rules! static_branch_unlikely {
($key:path, $keytyp:ty, $field:ident) => {{
- let _key: *const $keytyp = ::core::ptr::addr_of!($key);
- let _key: *const $crate::bindings::static_key_false = ::core::ptr::addr_of!((*_key).$field);
+ let _key: *const $keytyp = &raw const $key;
+ let _key: *const $crate::bindings::static_key_false = &raw const (*_key).$field;
let _key: *const $crate::bindings::static_key = _key.cast();
#[cfg(not(CONFIG_JUMP_LABEL))]
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 824da0e9738a..a17ef3b2e860 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -128,9 +128,9 @@ unsafe impl Sync for UnaryAssert {}
unsafe {
$crate::bindings::__kunit_do_failed_assertion(
kunit_test,
- core::ptr::addr_of!(LOCATION.0),
+ &raw const LOCATION.0,
$crate::bindings::kunit_assert_type_KUNIT_ASSERTION,
- core::ptr::addr_of!(ASSERTION.0.assert),
+ &raw const ASSERTION.0.assert,
Some($crate::bindings::kunit_unary_assert_format),
core::ptr::null(),
);
diff --git a/rust/kernel/list.rs b/rust/kernel/list.rs
index c0ed227b8a4f..e98f0820f002 100644
--- a/rust/kernel/list.rs
+++ b/rust/kernel/list.rs
@@ -176,7 +176,7 @@ pub fn new() -> impl PinInit<Self> {
#[inline]
unsafe fn fields(me: *mut Self) -> *mut ListLinksFields {
// SAFETY: The caller promises that the pointer is valid.
- unsafe { Opaque::raw_get(ptr::addr_of!((*me).inner)) }
+ unsafe { Opaque::raw_get(&raw const (*me).inner) }
}
/// # Safety
diff --git a/rust/kernel/list/impl_list_item_mod.rs b/rust/kernel/list/impl_list_item_mod.rs
index a0438537cee1..014b6713d59d 100644
--- a/rust/kernel/list/impl_list_item_mod.rs
+++ b/rust/kernel/list/impl_list_item_mod.rs
@@ -49,7 +49,7 @@ macro_rules! impl_has_list_links {
// SAFETY: The implementation of `raw_get_list_links` only compiles if the field has the
// right type.
//
- // The behavior of `raw_get_list_links` is not changed since the `addr_of_mut!` macro is
+ // The behavior of `raw_get_list_links` is not changed since the `&raw mut` op is
// equivalent to the pointer offset operation in the trait definition.
unsafe impl$(<$($implarg),*>)? $crate::list::HasListLinks$(<$id>)? for
$self $(<$($selfarg),*>)?
@@ -61,7 +61,7 @@ unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$
// SAFETY: The caller promises that the pointer is not dangling. We know that this
// expression doesn't follow any pointers, as the `offset_of!` invocation above
// would otherwise not compile.
- unsafe { ::core::ptr::addr_of_mut!((*ptr)$(.$field)*) }
+ unsafe { &raw mut (*ptr)$(.$field)* }
}
}
)*};
@@ -103,7 +103,7 @@ macro_rules! impl_has_list_links_self_ptr {
unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$id>)? {
// SAFETY: The caller promises that the pointer is not dangling.
let ptr: *mut $crate::list::ListLinksSelfPtr<$item_type $(, $id)?> =
- unsafe { ::core::ptr::addr_of_mut!((*ptr).$field) };
+ unsafe { &raw mut (*ptr).$field };
ptr.cast()
}
}
diff --git a/rust/kernel/net/phy.rs b/rust/kernel/net/phy.rs
index a59469c785e3..757db052cc09 100644
--- a/rust/kernel/net/phy.rs
+++ b/rust/kernel/net/phy.rs
@@ -7,7 +7,7 @@
//! C headers: [`include/linux/phy.h`](srctree/include/linux/phy.h).
use crate::{error::*, prelude::*, types::Opaque};
-use core::{marker::PhantomData, ptr::addr_of_mut};
+use core::marker::PhantomData;
pub mod reg;
@@ -285,7 +285,7 @@ impl AsRef<kernel::device::Device> for Device {
fn as_ref(&self) -> &kernel::device::Device {
let phydev = self.0.get();
// SAFETY: The struct invariant ensures that `mdio.dev` is valid.
- unsafe { kernel::device::Device::as_ref(addr_of_mut!((*phydev).mdio.dev)) }
+ unsafe { kernel::device::Device::as_ref(&raw mut (*phydev).mdio.dev) }
}
}
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index f7b2743828ae..6cb9ed1e7cbf 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -17,7 +17,7 @@
types::{ARef, ForeignOwnable, Opaque},
ThisModule,
};
-use core::{ops::Deref, ptr::addr_of_mut};
+use core::ops::Deref;
use kernel::prelude::*;
/// An adapter for the registration of PCI drivers.
@@ -60,7 +60,7 @@ extern "C" fn probe_callback(
) -> kernel::ffi::c_int {
// SAFETY: The PCI bus only ever calls the probe callback with a valid pointer to a
// `struct pci_dev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct pci_dev` by the call
// above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/platform.rs b/rust/kernel/platform.rs
index 1297f5292ba9..344875ad7b82 100644
--- a/rust/kernel/platform.rs
+++ b/rust/kernel/platform.rs
@@ -14,8 +14,6 @@
ThisModule,
};
-use core::ptr::addr_of_mut;
-
/// An adapter for the registration of platform drivers.
pub struct Adapter<T: Driver>(T);
@@ -55,7 +53,7 @@ unsafe fn unregister(pdrv: &Opaque<Self::RegType>) {
impl<T: Driver + 'static> Adapter<T> {
extern "C" fn probe_callback(pdev: *mut bindings::platform_device) -> kernel::ffi::c_int {
// SAFETY: The platform bus only ever calls the probe callback with a valid `pdev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct platform_device` by the
// call above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/rbtree.rs b/rust/kernel/rbtree.rs
index 1ea25c7092fb..b0ad35663cb0 100644
--- a/rust/kernel/rbtree.rs
+++ b/rust/kernel/rbtree.rs
@@ -11,7 +11,7 @@
cmp::{Ord, Ordering},
marker::PhantomData,
mem::MaybeUninit,
- ptr::{addr_of_mut, from_mut, NonNull},
+ ptr::{from_mut, NonNull},
};
/// A red-black tree with owned nodes.
@@ -238,7 +238,7 @@ pub fn values_mut(&mut self) -> impl Iterator<Item = &'_ mut V> {
/// Returns a cursor over the tree nodes, starting with the smallest key.
pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_first(root) };
NonNull::new(current).map(|current| {
@@ -253,7 +253,7 @@ pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
/// Returns a cursor over the tree nodes, starting with the largest key.
pub fn cursor_back(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_last(root) };
NonNull::new(current).map(|current| {
@@ -459,7 +459,7 @@ pub fn cursor_lower_bound(&mut self, key: &K) -> Option<Cursor<'_, K, V>>
let best = best_match?;
// SAFETY: `best` is a non-null node so it is valid by the type invariants.
- let links = unsafe { addr_of_mut!((*best.as_ptr()).links) };
+ let links = unsafe { &raw mut (*best.as_ptr()).links };
NonNull::new(links).map(|current| {
// INVARIANT:
@@ -767,7 +767,7 @@ pub fn remove_current(self) -> (Option<Self>, RBTreeNode<K, V>) {
let node = RBTreeNode { node };
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(&mut (*this).links, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(&mut (*this).links, &raw mut self.tree.root) };
let current = match (prev, next) {
(_, Some(next)) => next,
@@ -803,7 +803,7 @@ fn remove_neighbor(&mut self, direction: Direction) -> Option<RBTreeNode<K, V>>
let neighbor = neighbor.as_ptr();
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(neighbor, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(neighbor, &raw mut self.tree.root) };
// SAFETY: By the type invariant of `Self`, all non-null `rb_node` pointers stored in `self`
// point to the links field of `Node<K, V>` objects.
let this = unsafe { container_of!(neighbor, Node<K, V>, links) }.cast_mut();
@@ -918,7 +918,7 @@ unsafe fn to_key_value_raw<'b>(node: NonNull<bindings::rb_node>) -> (&'b K, *mut
let k = unsafe { &(*this).key };
// SAFETY: The passed `node` is the current node or a non-null neighbor,
// thus `this` is valid by the type invariants.
- let v = unsafe { addr_of_mut!((*this).value) };
+ let v = unsafe { &raw mut (*this).value };
(k, v)
}
}
@@ -1027,7 +1027,7 @@ fn next(&mut self) -> Option<Self::Item> {
self.next = unsafe { bindings::rb_next(self.next) };
// SAFETY: By the same reasoning above, it is safe to dereference the node.
- Some(unsafe { (addr_of_mut!((*cur).key), addr_of_mut!((*cur).value)) })
+ Some(unsafe { (&raw mut (*cur).key, &raw mut (*cur).value) })
}
}
@@ -1170,7 +1170,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let node_links = unsafe { addr_of_mut!((*node).links) };
+ let node_links = unsafe { &raw mut (*node).links };
// INVARIANT: We are linking in a new node, which is valid. It remains valid because we
// "forgot" it with `Box::into_raw`.
@@ -1178,7 +1178,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
unsafe { bindings::rb_link_node(node_links, self.parent, self.child_field_of_parent) };
// SAFETY: All pointers are valid. `node` has just been inserted into the tree.
- unsafe { bindings::rb_insert_color(node_links, addr_of_mut!((*self.rbtree).root)) };
+ unsafe { bindings::rb_insert_color(node_links, &raw mut (*self.rbtree).root) };
// SAFETY: The node is valid until we remove it from the tree.
unsafe { &mut (*node).value }
@@ -1261,7 +1261,7 @@ fn replace(self, node: RBTreeNode<K, V>) -> RBTreeNode<K, V> {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let new_node_links = unsafe { addr_of_mut!((*node).links) };
+ let new_node_links = unsafe { &raw mut (*node).links };
// SAFETY: This updates the pointers so that `new_node_links` is in the tree where
// `self.node_links` used to be.
diff --git a/rust/kernel/sync/arc.rs b/rust/kernel/sync/arc.rs
index 3cefda7a4372..81d8b0f84957 100644
--- a/rust/kernel/sync/arc.rs
+++ b/rust/kernel/sync/arc.rs
@@ -243,7 +243,7 @@ pub fn into_raw(self) -> *const T {
let ptr = self.ptr.as_ptr();
core::mem::forget(self);
// SAFETY: The pointer is valid.
- unsafe { core::ptr::addr_of!((*ptr).data) }
+ unsafe { &raw const (*ptr).data }
}
/// Recreates an [`Arc`] instance previously deconstructed via [`Arc::into_raw`].
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index 49012e711942..b2ac768eed23 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -257,7 +257,7 @@ pub fn as_ptr(&self) -> *mut bindings::task_struct {
pub fn group_leader(&self) -> &Task {
// SAFETY: The group leader of a task never changes after initialization, so reading this
// field is not a data race.
- let ptr = unsafe { *ptr::addr_of!((*self.as_ptr()).group_leader) };
+ let ptr = unsafe { *(&raw const (*self.as_ptr()).group_leader) };
// SAFETY: The lifetime of the returned task reference is tied to the lifetime of `self`,
// and given that a task has a reference to its group leader, we know it must be valid for
@@ -269,7 +269,7 @@ pub fn group_leader(&self) -> &Task {
pub fn pid(&self) -> Pid {
// SAFETY: The pid of a task never changes after initialization, so reading this field is
// not a data race.
- unsafe { *ptr::addr_of!((*self.as_ptr()).pid) }
+ unsafe { *(&raw const (*self.as_ptr()).pid) }
}
/// Returns the UID of the given task.
diff --git a/rust/kernel/workqueue.rs b/rust/kernel/workqueue.rs
index 0cd100d2aefb..34e8abb38974 100644
--- a/rust/kernel/workqueue.rs
+++ b/rust/kernel/workqueue.rs
@@ -401,9 +401,9 @@ pub fn new(name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self
pub unsafe fn raw_get(ptr: *const Self) -> *mut bindings::work_struct {
// SAFETY: The caller promises that the pointer is aligned and not dangling.
//
- // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `addr_of!` so that
- // the compiler does not complain that the `work` field is unused.
- unsafe { Opaque::raw_get(core::ptr::addr_of!((*ptr).work)) }
+ // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `&raw const (*ptr).work`
+ // so that the compiler does not complain that the `work` field is unused.
+ unsafe { Opaque::raw_get(&raw const (*ptr).work) }
}
}
@@ -510,7 +510,7 @@ macro_rules! impl_has_work {
unsafe fn raw_get_work(ptr: *mut Self) -> *mut $crate::workqueue::Work<$work_type $(, $id)?> {
// SAFETY: The caller promises that the pointer is not dangling.
unsafe {
- ::core::ptr::addr_of_mut!((*ptr).$field)
+ &raw mut (*ptr).$field
}
}
}
--
2.48.1
There are four small fixes for ntsync test and doc. I divided these into
four different patches due to different types of errors. If one patch is
better, I can do it too.
Su Hui (4):
selftests: ntsync: fix the wrong condition in wake_all
selftests: ntsync: avoid possible overflow in 32-bit machine
selftests: ntsync: update config
docs: ntsync: update NTSYNC_IOC_*
Documentation/userspace-api/ntsync.rst | 18 +++++++++---------
tools/testing/selftests/drivers/ntsync/config | 2 +-
.../testing/selftests/drivers/ntsync/ntsync.c | 6 +++---
3 files changed, 13 insertions(+), 13 deletions(-)
--
2.30.2
I never had much luck running mm selftests so I spent a few hours
digging into why.
Looks like most of the reason is missing SKIP checks, so this series is
just adding a bunch of those that I found. I did not do anything like
all of them, just the ones I spotted in gup_longterm, gup_test, mmap,
userfaultfd and memfd_secret.
It's a bit unfortunate to have to skip those tests when ftruncate()
fails, but I don't have time to dig deep enough into it to actually make
them pass. I have observed the issue on 9pfs and heard rumours that NFS
has a similar problem.
I'm now able to run these test groups successfully:
- mmap
- gup_test
- compaction
- migration
- page_frag
- userfaultfd
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
---
Changes in v3:
- Added fix for userfaultfd tests.
- Dropped attempts to use sudo.
- Fixed garbage printf in uffd-stress.
(Added EXTRA_CFLAGS=-Werror FORCE_TARGETS=1 to my scripts to prevent
such errors happening again).
- Fixed missing newlines in ksft_test_result_skip() calls.
- Link to v2: https://lore.kernel.org/r/20250221-mm-selftests-v2-0-28c4d66383c5@google.com
Changes in v2 (Thanks to Dev for the reviews):
- Improve and cleanup some error messages
- Add some extra SKIPs
- Fix misnaming of nr_cpus variable in uffd tests
- Link to v1: https://lore.kernel.org/r/20250220-mm-selftests-v1-0-9bbf57d64463@google.com
---
Brendan Jackman (10):
selftests/mm: Report errno when things fail in gup_longterm
selftests/mm: Skip uffd-stress if userfaultfd not available
selftests/mm: Skip uffd-wp-mremap if userfaultfd not available
selftests/mm/uffd: Rename nr_cpus -> nr_threads
selftests/mm: Print some details when uffd-stress gets bad params
selftests/mm: Don't fail uffd-stress if too many CPUs
selftests/mm: Skip map_populate on weird filesystems
selftests/mm: Skip gup_longerm tests on weird filesystems
selftests/mm: Drop unnecessary sudo usage
selftests/mm: Ensure uffd-wp-mremap gets pages of each size
tools/testing/selftests/mm/gup_longterm.c | 45 ++++++++++++++++++----------
tools/testing/selftests/mm/map_populate.c | 7 +++++
tools/testing/selftests/mm/run_vmtests.sh | 25 ++++++++++++++--
tools/testing/selftests/mm/uffd-common.c | 8 ++---
tools/testing/selftests/mm/uffd-common.h | 2 +-
tools/testing/selftests/mm/uffd-stress.c | 42 ++++++++++++++++----------
tools/testing/selftests/mm/uffd-unit-tests.c | 2 +-
tools/testing/selftests/mm/uffd-wp-mremap.c | 5 +++-
8 files changed, 95 insertions(+), 41 deletions(-)
---
base-commit: 76544811c850a1f4c055aa182b513b7a843868ea
change-id: 20250220-mm-selftests-2d7d0542face
Best regards,
--
Brendan Jackman <jackmanb(a)google.com>
This series is built on top of the v3 write syscall support [1].
With James's KVM userfault [2], it is possible to handle stage-2 faults
in guest_memfd in userspace. However, KVM itself also triggers faults
in guest_memfd in some cases, for example: PV interfaces like kvmclock,
PV EOI and page table walking code when fetching the MMIO instruction on
x86. It was agreed in the guest_memfd upstream call on 23 Jan 2025 [3]
that KVM would be accessing those pages via userspace page tables. In
order for such faults to be handled in userspace, guest_memfd needs to
support userfaultfd.
This series proposes a limited support for userfaultfd in guest_memfd:
- userfaultfd support is conditional to `CONFIG_KVM_GMEM_SHARED_MEM`
(as is fault support in general)
- Only `page missing` event is currently supported
- Userspace is supposed to respond to the event with the `write`
syscall followed by `UFFDIO_CONTINUE` ioctl to unblock the faulting
process. Note that we can't use `UFFDIO_COPY` here because
userfaulfd code does not know how to prepare guest_memfd pages, eg
remove them from direct map [4].
Not included in this series:
- Proper interface for userfaultfd to recognise guest_memfd mappings
- Proper handling of truncation cases after locking the page
Request for comments:
- Is it a sensible workflow for guest_memfd to resolve a userfault
`page missing` event with `write` syscall + `UFFDIO_CONTINUE`? One
of the alternatives is teaching `UFFDIO_COPY` how to deal with
guest_memfd pages.
- What is a way forward to make userfaultfd code aware of guest_memfd?
I saw that Patrick hit a somewhat similar problem in [5] when trying
to use direct map manipulation functions in KVM and was pointed by
David at Elliot's guestmem library [6] that might include a shim for that.
Would the library be the right place to expose required interfaces like
`vma_is_gmem`?
Nikita
[1] https://lore.kernel.org/kvm/20250303130838.28812-1-kalyazin@amazon.com/T/
[2] https://lore.kernel.org/kvm/20250109204929.1106563-1-jthoughton@google.com/…
[3] https://docs.google.com/document/d/1M6766BzdY1Lhk7LiR5IqVR8B8mG3cr-cxTxOrAo…
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/
[4] https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/T/…
[5] https://lore.kernel.org/kvm/20241122-guestmem-library-v5-2-450e92951a15@qui…
Nikita Kalyazin (5):
KVM: guest_memfd: add kvm_gmem_vma_is_gmem
KVM: guest_memfd: add support for uffd missing
mm: userfaultfd: allow to register userfaultfd for guest_memfd
mm: userfaultfd: support continue for guest_memfd
KVM: selftests: add uffd missing test for guest_memfd
include/linux/userfaultfd_k.h | 9 ++
mm/userfaultfd.c | 23 ++++-
.../testing/selftests/kvm/guest_memfd_test.c | 88 +++++++++++++++++++
virt/kvm/guest_memfd.c | 17 +++-
virt/kvm/kvm_mm.h | 1 +
5 files changed, 136 insertions(+), 2 deletions(-)
base-commit: 592e7531753dc4b711f96cd1daf808fd493d3223
--
2.47.1
Basics and overview
===================
Software with larger attack surfaces (e.g. network facing apps like databases,
browsers or apps relying on browser runtimes) suffer from memory corruption
issues which can be utilized by attackers to bend control flow of the program
to eventually gain control (by making their payload executable). Attackers are
able to perform such attacks by leveraging call-sites which rely on indirect
calls or return sites which rely on obtaining return address from stack memory.
To mitigate such attacks, risc-v extension zicfilp enforces that all indirect
calls must land on a landing pad instruction `lpad` else cpu will raise software
check exception (a new cpu exception cause code on riscv).
Similarly for return flow, risc-v extension zicfiss extends architecture with
- `sspush` instruction to push return address on a shadow stack
- `sspopchk` instruction to pop return address from shadow stack
and compare with input operand (i.e. return address on stack)
- `sspopchk` to raise software check exception if comparision above
was a mismatch
- Protection mechanism using which shadow stack is not writeable via
regular store instructions
More information an details can be found at extensions github repo [1].
Equivalent to landing pad (zicfilp) on x86 is `ENDBRANCH` instruction in Intel
CET [3] and branch target identification (BTI) [4] on arm.
Similarly x86's Intel CET has shadow stack [5] and arm64 has guarded control
stack (GCS) [6] which are very similar to risc-v's zicfiss shadow stack.
x86 and arm64 support for user mode shadow stack is already in mainline.
Kernel awareness for user control flow integrity
================================================
This series picks up Samuel Holland's envcfg changes [2] as well. So if those are
being applied independently, they should be removed from this series.
Enabling:
In order to maintain compatibility and not break anything in user mode, kernel
doesn't enable control flow integrity cpu extensions on binary by default.
Instead exposes a prctl interface to enable, disable and lock the shadow stack
or landing pad feature for a task. This allows userspace (loader) to enumerate
if all objects in its address space are compiled with shadow stack and landing
pad support and accordingly enable the feature. Additionally if a subsequent
`dlopen` happens on a library, user mode can take a decision again to disable
the feature (if incoming library is not compiled with support) OR terminate the
task (if user mode policy is strict to have all objects in address space to be
compiled with control flow integirty cpu feature). prctl to enable shadow stack
results in allocating shadow stack from virtual memory and activating for user
address space. x86 and arm64 are also following same direction due to similar
reason(s).
clone/fork:
On clone and fork, cfi state for task is inherited by child. Shadow stack is
part of virtual memory and is a writeable memory from kernel perspective
(writeable via a restricted set of instructions aka shadow stack instructions)
Thus kernel changes ensure that this memory is converted into read-only when
fork/clone happens and COWed when fault is taken due to sspush, sspopchk or
ssamoswap. In case `CLONE_VM` is specified and shadow stack is to be enabled,
kernel will automatically allocate a shadow stack for that clone call.
map_shadow_stack:
x86 introduced `map_shadow_stack` system call to allow user space to explicitly
map shadow stack memory in its address space. It is useful to allocate shadow
for different contexts managed by a single thread (green threads or contexts)
risc-v implements this system call as well.
signal management:
If shadow stack is enabled for a task, kernel performs an asynchronous control
flow diversion to deliver the signal and eventually expects userspace to issue
sigreturn so that original execution can be resumed. Even though resume context
is prepared by kernel, it is in user space memory and is subject to memory
corruption and corruption bugs can be utilized by attacker in this race window
to perform arbitrary sigreturn and eventually bypass cfi mechanism.
Another issue is how to ensure that cfi related state on sigcontext area is not
trampled by legacy apps or apps compiled with old kernel headers.
In order to mitigate control-flow hijacting, kernel prepares a token and place
it on shadow stack before signal delivery and places address of token in
sigcontext structure. During sigreturn, kernel obtains address of token from
sigcontext struture, reads token from shadow stack and validates it and only
then allow sigreturn to succeed. Compatiblity issue is solved by adopting
dynamic sigcontext management introduced for vector extension. This series
re-factor the code little bit to allow future sigcontext management easy (as
proposed by Andy Chiu from SiFive)
config and compilation:
Introduce a new risc-v config option `CONFIG_RISCV_USER_CFI`. Selecting this
config option picks the kernel support for user control flow integrity. This
optin is presented only if toolchain has shadow stack and landing pad support.
And is on purpose guarded by toolchain support. Reason being that eventually
vDSO also needs to be compiled in with shadow stack and landing pad support.
vDSO compile patches are not included as of now because landing pad labeling
scheme is yet to settle for usermode runtime.
To get more information on kernel interactions with respect to
zicfilp and zicfiss, patch series adds documentation for
`zicfilp` and `zicfiss` in following:
Documentation/arch/riscv/zicfiss.rst
Documentation/arch/riscv/zicfilp.rst
How to test this series
=======================
Toolchain
---------
$ git clone git@github.com:sifive/riscv-gnu-toolchain.git -b cfi-dev
$ riscv-gnu-toolchain/configure --prefix=<path-to-where-to-build> --with-arch=rv64gc_zicfilp_zicfiss --enable-linux --disable-gdb --with-extra-multilib-test="rv64gc_zicfilp_zicfiss-lp64d:-static"
$ make -j$(nproc)
Qemu
----
Get the lastest qemu
$ cd qemu
$ mkdir build
$ cd build
$ ../configure --target-list=riscv64-softmmu
$ make -j$(nproc)
Opensbi
-------
$ git clone git@github.com:deepak0414/opensbi.git -b v6_cfi_spec_split_opensbi
$ make CROSS_COMPILE=<your riscv toolchain> -j$(nproc) PLATFORM=generic
Linux
-----
Running defconfig is fine. CFI is enabled by default if the toolchain
supports it.
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc) defconfig
$ make ARCH=riscv CROSS_COMPILE=<path-to-cfi-riscv-gnu-toolchain>/build/bin/riscv64-unknown-linux-gnu- -j$(nproc)
In case you're building your own rootfs using toolchain, please make sure you
pick following patch to ensure that vDSO compiled with lpad and shadow stack.
"arch/riscv: compile vdso with landing pad"
Branch where above patch can be picked
https://github.com/deepak0414/linux-riscv-cfi/tree/vdso_user_cfi_v6.12-rc1
Running
-------
Modify your qemu command to have:
-bios <path-to-cfi-opensbi>/build/platform/generic/firmware/fw_dynamic.bin
-cpu rv64,zicfilp=true,zicfiss=true,zimop=true,zcmop=true
vDSO related Opens (in the flux)
=================================
I am listing these opens for laying out plan and what to expect in future
patch sets. And of course for the sake of discussion.
Shadow stack and landing pad enabling in vDSO
----------------------------------------------
vDSO must have shadow stack and landing pad support compiled in for task
to have shadow stack and landing pad support. This patch series doesn't
enable that (yet). Enabling shadow stack support in vDSO should be
straight forward (intend to do that in next versions of patch set). Enabling
landing pad support in vDSO requires some collaboration with toolchain folks
to follow a single label scheme for all object binaries. This is necessary to
ensure that all indirect call-sites are setting correct label and target landing
pads are decorated with same label scheme.
How many vDSOs
---------------
Shadow stack instructions are carved out of zimop (may be operations) and if CPU
doesn't implement zimop, they're illegal instructions. Kernel could be running on
a CPU which may or may not implement zimop. And thus kernel will have to carry 2
different vDSOs and expose the appropriate one depending on whether CPU implements
zimop or not.
References
==========
[1] - https://github.com/riscv/riscv-cfi
[2] - https://lore.kernel.org/all/20240814081126.956287-1-samuel.holland@sifive.c…
[3] - https://lwn.net/Articles/889475/
[4] - https://developer.arm.com/documentation/109576/0100/Branch-Target-Identific…
[5] - https://www.intel.com/content/dam/develop/external/us/en/documents/catc17-i…
[6] - https://lwn.net/Articles/940403/
---
changelog
---------
v11:
- patch "arch/riscv: compile vdso with landing pad" was unconditionally
selecting `_zicfilp` for vDSO compile. fixed that. Changed `lpad 1` to
to `lpad 0`.
v10:
- dropped "mm: helper `is_shadow_stack_vma` to check shadow stack vma". This patch
is not that interesting to this patch series for risc-v. There are instances in
arch directories where VM_SHADOW_STACK flag is anyways used. Dropping this patch
to expedite merging in riscv tree.
- Took suggestions from `Clement` on "riscv: zicfiss / zicfilp enumeration" to
validate presence of cfi based on config.
- Added a patch for vDSO to have `lpad 0`. I had omitted this earlier to make sure
we add single vdso object with cfi enabled. But a vdso object with scheme of
zero labeled landing pad is least common denominator and should work with all
objects of zero labeled as well as function-signature labeled objects.
v9:
- rebased on master (39a803b754d5 fix braino in "9p: fix ->rename_sem exclusion")
- dropped "mm: Introduce ARCH_HAS_USER_SHADOW_STACK" (master has it from arm64/gcs)
- dropped "prctl: arch-agnostic prctl for shadow stack" (master has it from arm64/gcs)
v8:
- rebased on palmer/for-next
- dropped samuel holland's `envcfg` context switch patches.
they are in parlmer/for-next
v7:
- Removed "riscv/Kconfig: enable HAVE_EXIT_THREAD for riscv"
Instead using `deactivate_mm` flow to clean up.
see here for more context
https://lore.kernel.org/all/20230908203655.543765-1-rick.p.edgecombe@intel.…
- Changed the header include in `kselftest`. Hopefully this fixes compile
issue faced by Zong Li at SiFive.
- Cleaned up an orphaned change to `mm/mmap.c` in below patch
"riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE"
- Lock interfaces for shadow stack and indirect branch tracking expect arg == 0
Any future evolution of this interface should accordingly define how arg should
be setup.
- `mm/map.c` has an instance of using `VM_SHADOW_STACK`. Fixed it to use helper
`is_shadow_stack_vma`.
- Link to v6: https://lore.kernel.org/r/20241008-v5_user_cfi_series-v6-0-60d9fe073f37@riv…
v6:
- Picked up Samuel Holland's changes as is with `envcfg` placed in
`thread` instead of `thread_info`
- fixed unaligned newline escapes in kselftest
- cleaned up messages in kselftest and included test output in commit message
- fixed a bug in clone path reported by Zong Li
- fixed a build issue if CONFIG_RISCV_ISA_V is not selected
(this was introduced due to re-factoring signal context
management code)
v5:
- rebased on v6.12-rc1
- Fixed schema related issues in device tree file
- Fixed some of the documentation related issues in zicfilp/ss.rst
(style issues and added index)
- added `SHADOW_STACK_SET_MARKER` so that implementation can define base
of shadow stack.
- Fixed warnings on definitions added in usercfi.h when
CONFIG_RISCV_USER_CFI is not selected.
- Adopted context header based signal handling as proposed by Andy Chiu
- Added support for enabling kernel mode access to shadow stack using
FWFT
(https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/src/ext-firmware…)
- Link to v5: https://lore.kernel.org/r/20241001-v5_user_cfi_series-v1-0-3ba65b6e550f@riv…
(Note: I had an issue in my workflow due to which version number wasn't
picked up correctly while sending out patches)
v4:
- rebased on 6.11-rc6
- envcfg: Converged with Samuel Holland's patches for envcfg management on per-
thread basis.
- vma_is_shadow_stack is renamed to is_vma_shadow_stack
- picked up Mark Brown's `ARCH_HAS_USER_SHADOW_STACK` patch
- signal context: using extended context management to maintain compatibility.
- fixed `-Wmissing-prototypes` compiler warnings for prctl functions
- Documentation fixes and amending typos.
- Link to v4: https://lore.kernel.org/all/20240912231650.3740732-1-debug@rivosinc.com/
v3:
- envcfg
logic to pick up base envcfg had a bug where `ENVCFG_CBZE` could have been
picked on per task basis, even though CPU didn't implement it. Fixed in
this series.
- dt-bindings
As suggested, split into separate commit. fixed the messaging that spec is
in public review
- arch_is_shadow_stack change
arch_is_shadow_stack changed to vma_is_shadow_stack
- hwprobe
zicfiss / zicfilp if present will get enumerated in hwprobe
- selftests
As suggested, added object and binary filenames to .gitignore
Selftest binary anyways need to be compiled with cfi enabled compiler which
will make sure that landing pad and shadow stack are enabled. Thus removed
separate enable/disable tests. Cleaned up tests a bit.
- Link to v3: https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
v2:
- Using config `CONFIG_RISCV_USER_CFI`, kernel support for riscv control flow
integrity for user mode programs can be compiled in the kernel.
- Enabling of control flow integrity for user programs is left to user runtime
- This patch series introduces arch agnostic `prctls` to enable shadow stack
and indirect branch tracking. And implements them on riscv.
---
---
Changes in v11:
- EDITME: describe what is new in this series revision.
- EDITME: use bulletpoints and terse descriptions.
- Link to v10: https://lore.kernel.org/r/20250210-v5_user_cfi_series-v10-0-163dcfa31c60@ri…
---
Andy Chiu (1):
riscv: signal: abstract header saving for setup_sigcontext
Clément Léger (1):
riscv: Add Firmware Feature SBI extensions definitions
Deepak Gupta (24):
mm: VM_SHADOW_STACK definition for riscv
dt-bindings: riscv: zicfilp and zicfiss in dt-bindings (extensions.yaml)
riscv: zicfiss / zicfilp enumeration
riscv: zicfiss / zicfilp extension csr and bit definitions
riscv: usercfi state for task and save/restore of CSR_SSP on trap entry/exit
riscv/mm : ensure PROT_WRITE leads to VM_READ | VM_WRITE
riscv mm: manufacture shadow stack pte
riscv mmu: teach pte_mkwrite to manufacture shadow stack PTEs
riscv mmu: write protect and shadow stack
riscv/mm: Implement map_shadow_stack() syscall
riscv/shstk: If needed allocate a new shadow stack on clone
riscv: Implements arch agnostic shadow stack prctls
prctl: arch-agnostic prctl for indirect branch tracking
riscv/traps: Introduce software check exception
riscv/signal: save and restore of shadow stack for signal
riscv/kernel: update __show_regs to print shadow stack register
riscv/ptrace: riscv cfi status and state via ptrace and in core files
riscv/hwprobe: zicfilp / zicfiss enumeration in hwprobe
riscv: enable kernel access to shadow stack memory via FWFT sbi call
riscv: kernel command line option to opt out of user cfi
riscv: create a config for shadow stack and landing pad instr support
riscv: Documentation for landing pad / indirect branch tracking
riscv: Documentation for shadow stack on riscv
kselftest/riscv: kselftest for user mode cfi
Jim Shu (1):
arch/riscv: compile vdso with landing pad
Documentation/arch/riscv/index.rst | 2 +
Documentation/arch/riscv/zicfilp.rst | 115 +++++
Documentation/arch/riscv/zicfiss.rst | 176 +++++++
.../devicetree/bindings/riscv/extensions.yaml | 14 +
arch/riscv/Kconfig | 20 +
arch/riscv/Makefile | 7 +-
arch/riscv/include/asm/asm-prototypes.h | 1 +
arch/riscv/include/asm/assembler.h | 44 ++
arch/riscv/include/asm/cpufeature.h | 13 +
arch/riscv/include/asm/csr.h | 16 +
arch/riscv/include/asm/entry-common.h | 2 +
arch/riscv/include/asm/hwcap.h | 2 +
arch/riscv/include/asm/mman.h | 25 +
arch/riscv/include/asm/mmu_context.h | 7 +
arch/riscv/include/asm/pgtable.h | 30 +-
arch/riscv/include/asm/processor.h | 2 +
arch/riscv/include/asm/sbi.h | 26 +
arch/riscv/include/asm/thread_info.h | 3 +
arch/riscv/include/asm/usercfi.h | 89 ++++
arch/riscv/include/asm/vector.h | 3 +
arch/riscv/include/uapi/asm/hwprobe.h | 2 +
arch/riscv/include/uapi/asm/ptrace.h | 22 +
arch/riscv/include/uapi/asm/sigcontext.h | 1 +
arch/riscv/kernel/Makefile | 1 +
arch/riscv/kernel/asm-offsets.c | 8 +
arch/riscv/kernel/cpufeature.c | 13 +
arch/riscv/kernel/entry.S | 31 +-
arch/riscv/kernel/head.S | 12 +
arch/riscv/kernel/process.c | 26 +-
arch/riscv/kernel/ptrace.c | 83 ++++
arch/riscv/kernel/signal.c | 142 +++++-
arch/riscv/kernel/sys_hwprobe.c | 2 +
arch/riscv/kernel/sys_riscv.c | 10 +
arch/riscv/kernel/traps.c | 43 ++
arch/riscv/kernel/usercfi.c | 524 +++++++++++++++++++++
arch/riscv/kernel/vdso/Makefile | 12 +
arch/riscv/kernel/vdso/flush_icache.S | 4 +
arch/riscv/kernel/vdso/getcpu.S | 4 +
arch/riscv/kernel/vdso/rt_sigreturn.S | 4 +
arch/riscv/kernel/vdso/sys_hwprobe.S | 4 +
arch/riscv/mm/init.c | 2 +-
arch/riscv/mm/pgtable.c | 17 +
include/linux/cpu.h | 4 +
include/linux/mm.h | 7 +
include/uapi/linux/elf.h | 1 +
include/uapi/linux/prctl.h | 27 ++
kernel/sys.c | 30 ++
tools/testing/selftests/riscv/Makefile | 2 +-
tools/testing/selftests/riscv/cfi/.gitignore | 3 +
tools/testing/selftests/riscv/cfi/Makefile | 10 +
tools/testing/selftests/riscv/cfi/cfi_rv_test.h | 84 ++++
tools/testing/selftests/riscv/cfi/riscv_cfi_test.c | 78 +++
tools/testing/selftests/riscv/cfi/shadowstack.c | 375 +++++++++++++++
tools/testing/selftests/riscv/cfi/shadowstack.h | 37 ++
54 files changed, 2193 insertions(+), 29 deletions(-)
---
base-commit: 39a803b754d5224a3522016b564113ee1e4091b2
change-id: 20240930-v5_user_cfi_series-3dc332f8f5b2
--
- debug
In Rust 1.51.0, Clippy introduced the `ignored_unit_patterns` lint [1]:
> Though `as` casts between raw pointers are not terrible,
> `pointer::cast` is safer because it cannot accidentally change the
> pointer's mutability, nor cast the pointer to other types like `usize`.
There are a few classes of changes required:
- Modules generated by bindgen are marked
`#[allow(clippy::ptr_as_ptr)]`.
- Inferred casts (` as _`) are replaced with `.cast()`.
- Ascribed casts (` as *... T`) are replaced with `.cast::<T>()`.
- Multistep casts from references (` as *const _ as *const T`) are
replaced with `let x: *const _ = &x;` and `.cast()` or `.cast::<T>()`
according to the previous rules. The intermediate `let` binding is
required because `(x as *const _).cast::<T>()` results in inference
failure.
- Native literal C strings are replaced with `c_str!().as_char_ptr()`.
Apply these changes and enable the lint -- no functional change
intended.
Link: https://rust-lang.github.io/rust-clippy/master/index.html#ptr_as_ptr [1]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Makefile | 1 +
rust/bindings/lib.rs | 1 +
rust/kernel/alloc/allocator_test.rs | 2 +-
rust/kernel/alloc/kvec.rs | 4 ++--
rust/kernel/device.rs | 5 +++--
rust/kernel/devres.rs | 2 +-
rust/kernel/error.rs | 2 +-
rust/kernel/fs/file.rs | 2 +-
rust/kernel/kunit.rs | 15 +++++++--------
rust/kernel/lib.rs | 4 ++--
rust/kernel/list/impl_list_item_mod.rs | 2 +-
rust/kernel/pci.rs | 2 +-
rust/kernel/platform.rs | 4 +++-
rust/kernel/print.rs | 11 +++++------
rust/kernel/seq_file.rs | 3 ++-
rust/kernel/str.rs | 2 +-
rust/kernel/sync/poll.rs | 2 +-
rust/kernel/workqueue.rs | 10 +++++-----
rust/uapi/lib.rs | 1 +
19 files changed, 40 insertions(+), 35 deletions(-)
diff --git a/Makefile b/Makefile
index 70bdbf2218fc..ec8efc8e23ba 100644
--- a/Makefile
+++ b/Makefile
@@ -483,6 +483,7 @@ export rust_common_flags := --edition=2021 \
-Wclippy::needless_continue \
-Aclippy::needless_lifetimes \
-Wclippy::no_mangle_with_rust_abi \
+ -Wclippy::ptr_as_ptr \
-Wclippy::undocumented_unsafe_blocks \
-Wclippy::unnecessary_safety_comment \
-Wclippy::unnecessary_safety_doc \
diff --git a/rust/bindings/lib.rs b/rust/bindings/lib.rs
index 014af0d1fc70..0486a32ed314 100644
--- a/rust/bindings/lib.rs
+++ b/rust/bindings/lib.rs
@@ -25,6 +25,7 @@
)]
#[allow(dead_code)]
+#[allow(clippy::ptr_as_ptr)]
#[allow(clippy::undocumented_unsafe_blocks)]
mod bindings_raw {
// Manual definition for blocklisted types.
diff --git a/rust/kernel/alloc/allocator_test.rs b/rust/kernel/alloc/allocator_test.rs
index c37d4c0c64e9..8017aa9d5213 100644
--- a/rust/kernel/alloc/allocator_test.rs
+++ b/rust/kernel/alloc/allocator_test.rs
@@ -82,7 +82,7 @@ unsafe fn realloc(
// SAFETY: Returns either NULL or a pointer to a memory allocation that satisfies or
// exceeds the given size and alignment requirements.
- let dst = unsafe { libc_aligned_alloc(layout.align(), layout.size()) } as *mut u8;
+ let dst = unsafe { libc_aligned_alloc(layout.align(), layout.size()) }.cast::<u8>();
let dst = NonNull::new(dst).ok_or(AllocError)?;
if flags.contains(__GFP_ZERO) {
diff --git a/rust/kernel/alloc/kvec.rs b/rust/kernel/alloc/kvec.rs
index ae9d072741ce..c12844764671 100644
--- a/rust/kernel/alloc/kvec.rs
+++ b/rust/kernel/alloc/kvec.rs
@@ -262,7 +262,7 @@ pub fn spare_capacity_mut(&mut self) -> &mut [MaybeUninit<T>] {
// - `self.len` is smaller than `self.capacity` and hence, the resulting pointer is
// guaranteed to be part of the same allocated object.
// - `self.len` can not overflow `isize`.
- let ptr = unsafe { self.as_mut_ptr().add(self.len) } as *mut MaybeUninit<T>;
+ let ptr = unsafe { self.as_mut_ptr().add(self.len) }.cast::<MaybeUninit<T>>();
// SAFETY: The memory between `self.len` and `self.capacity` is guaranteed to be allocated
// and valid, but uninitialized.
@@ -554,7 +554,7 @@ fn drop(&mut self) {
// - `ptr` points to memory with at least a size of `size_of::<T>() * len`,
// - all elements within `b` are initialized values of `T`,
// - `len` does not exceed `isize::MAX`.
- unsafe { Vec::from_raw_parts(ptr as _, len, len) }
+ unsafe { Vec::from_raw_parts(ptr.cast(), len, len) }
}
}
diff --git a/rust/kernel/device.rs b/rust/kernel/device.rs
index db2d9658ba47..9e500498835d 100644
--- a/rust/kernel/device.rs
+++ b/rust/kernel/device.rs
@@ -168,16 +168,17 @@ pub fn pr_dbg(&self, args: fmt::Arguments<'_>) {
/// `KERN_*`constants, for example, `KERN_CRIT`, `KERN_ALERT`, etc.
#[cfg_attr(not(CONFIG_PRINTK), allow(unused_variables))]
unsafe fn printk(&self, klevel: &[u8], msg: fmt::Arguments<'_>) {
+ let msg: *const _ = &msg;
// SAFETY: `klevel` is null-terminated and one of the kernel constants. `self.as_raw`
// is valid because `self` is valid. The "%pA" format string expects a pointer to
// `fmt::Arguments`, which is what we're passing as the last argument.
#[cfg(CONFIG_PRINTK)]
unsafe {
bindings::_dev_printk(
- klevel as *const _ as *const crate::ffi::c_char,
+ klevel.as_ptr().cast::<crate::ffi::c_char>(),
self.as_raw(),
c_str!("%pA").as_char_ptr(),
- &msg as *const _ as *const crate::ffi::c_void,
+ msg.cast::<crate::ffi::c_void>(),
)
};
}
diff --git a/rust/kernel/devres.rs b/rust/kernel/devres.rs
index 942376f6f3af..3a9d998ec371 100644
--- a/rust/kernel/devres.rs
+++ b/rust/kernel/devres.rs
@@ -157,7 +157,7 @@ fn remove_action(this: &Arc<Self>) {
#[allow(clippy::missing_safety_doc)]
unsafe extern "C" fn devres_callback(ptr: *mut kernel::ffi::c_void) {
- let ptr = ptr as *mut DevresInner<T>;
+ let ptr = ptr.cast::<DevresInner<T>>();
// Devres owned this memory; now that we received the callback, drop the `Arc` and hence the
// reference.
// SAFETY: Safe, since we leaked an `Arc` reference to devm_add_action() in
diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
index f6ecf09cb65f..8654d52b0bb9 100644
--- a/rust/kernel/error.rs
+++ b/rust/kernel/error.rs
@@ -152,7 +152,7 @@ pub(crate) fn to_blk_status(self) -> bindings::blk_status_t {
/// Returns the error encoded as a pointer.
pub fn to_ptr<T>(self) -> *mut T {
// SAFETY: `self.0` is a valid error due to its invariant.
- unsafe { bindings::ERR_PTR(self.0.get() as _) as *mut _ }
+ unsafe { bindings::ERR_PTR(self.0.get() as _).cast() }
}
/// Returns a string representing the error, if one exists.
diff --git a/rust/kernel/fs/file.rs b/rust/kernel/fs/file.rs
index e03dbe14d62a..8936afc234a4 100644
--- a/rust/kernel/fs/file.rs
+++ b/rust/kernel/fs/file.rs
@@ -364,7 +364,7 @@ fn deref(&self) -> &LocalFile {
//
// By the type invariants, there are no `fdget_pos` calls that did not take the
// `f_pos_lock` mutex.
- unsafe { LocalFile::from_raw_file(self as *const File as *const bindings::file) }
+ unsafe { LocalFile::from_raw_file((self as *const Self).cast()) }
}
}
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 824da0e9738a..7ed2063c1af0 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -8,19 +8,20 @@
use core::{ffi::c_void, fmt};
+#[cfg(CONFIG_PRINTK)]
+use crate::c_str;
+
/// Prints a KUnit error-level message.
///
/// Public but hidden since it should only be used from KUnit generated code.
#[doc(hidden)]
pub fn err(args: fmt::Arguments<'_>) {
+ let args: *const _ = &args;
// SAFETY: The format string is null-terminated and the `%pA` specifier matches the argument we
// are passing.
#[cfg(CONFIG_PRINTK)]
unsafe {
- bindings::_printk(
- c"\x013%pA".as_ptr() as _,
- &args as *const _ as *const c_void,
- );
+ bindings::_printk(c_str!("\x013%pA").as_char_ptr(), args.cast::<c_void>());
}
}
@@ -29,14 +30,12 @@ pub fn err(args: fmt::Arguments<'_>) {
/// Public but hidden since it should only be used from KUnit generated code.
#[doc(hidden)]
pub fn info(args: fmt::Arguments<'_>) {
+ let args: *const _ = &args;
// SAFETY: The format string is null-terminated and the `%pA` specifier matches the argument we
// are passing.
#[cfg(CONFIG_PRINTK)]
unsafe {
- bindings::_printk(
- c"\x016%pA".as_ptr() as _,
- &args as *const _ as *const c_void,
- );
+ bindings::_printk(c_str!("\x016%pA").as_char_ptr(), args.cast::<c_void>());
}
}
diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
index 7697c60b2d1a..01264e459c92 100644
--- a/rust/kernel/lib.rs
+++ b/rust/kernel/lib.rs
@@ -196,9 +196,9 @@ fn panic(info: &core::panic::PanicInfo<'_>) -> ! {
#[macro_export]
macro_rules! container_of {
($ptr:expr, $type:ty, $($f:tt)*) => {{
- let ptr = $ptr as *const _ as *const u8;
+ let ptr: *const _ = $ptr;
let offset: usize = ::core::mem::offset_of!($type, $($f)*);
- ptr.sub(offset) as *const $type
+ ptr.cast::<u8>().sub(offset).cast::<$type>()
}}
}
diff --git a/rust/kernel/list/impl_list_item_mod.rs b/rust/kernel/list/impl_list_item_mod.rs
index a0438537cee1..1f9498c1458f 100644
--- a/rust/kernel/list/impl_list_item_mod.rs
+++ b/rust/kernel/list/impl_list_item_mod.rs
@@ -34,7 +34,7 @@ pub unsafe trait HasListLinks<const ID: u64 = 0> {
unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut ListLinks<ID> {
// SAFETY: The caller promises that the pointer is valid. The implementer promises that the
// `OFFSET` constant is correct.
- unsafe { (ptr as *mut u8).add(Self::OFFSET) as *mut ListLinks<ID> }
+ unsafe { ptr.cast::<u8>().add(Self::OFFSET).cast() }
}
}
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index 4c98b5b9aa1e..206f71d33ab2 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -75,7 +75,7 @@ extern "C" fn probe_callback(
// Let the `struct pci_dev` own a reference of the driver's private data.
// SAFETY: By the type invariant `pdev.as_raw` returns a valid pointer to a
// `struct pci_dev`.
- unsafe { bindings::pci_set_drvdata(pdev.as_raw(), data.into_foreign() as _) };
+ unsafe { bindings::pci_set_drvdata(pdev.as_raw(), data.into_foreign().cast()) };
}
Err(err) => return Error::to_errno(err),
}
diff --git a/rust/kernel/platform.rs b/rust/kernel/platform.rs
index 50e6b0421813..8f9e6b125faf 100644
--- a/rust/kernel/platform.rs
+++ b/rust/kernel/platform.rs
@@ -66,7 +66,9 @@ extern "C" fn probe_callback(pdev: *mut bindings::platform_device) -> kernel::ff
// Let the `struct platform_device` own a reference of the driver's private data.
// SAFETY: By the type invariant `pdev.as_raw` returns a valid pointer to a
// `struct platform_device`.
- unsafe { bindings::platform_set_drvdata(pdev.as_raw(), data.into_foreign() as _) };
+ unsafe {
+ bindings::platform_set_drvdata(pdev.as_raw(), data.into_foreign().cast())
+ };
}
Err(err) => return Error::to_errno(err),
}
diff --git a/rust/kernel/print.rs b/rust/kernel/print.rs
index b19ee490be58..0245c145ea32 100644
--- a/rust/kernel/print.rs
+++ b/rust/kernel/print.rs
@@ -25,7 +25,7 @@
// SAFETY: The C contract guarantees that `buf` is valid if it's less than `end`.
let mut w = unsafe { RawFormatter::from_ptrs(buf.cast(), end.cast()) };
// SAFETY: TODO.
- let _ = w.write_fmt(unsafe { *(ptr as *const fmt::Arguments<'_>) });
+ let _ = w.write_fmt(unsafe { *ptr.cast::<fmt::Arguments<'_>>() });
w.pos().cast()
}
@@ -102,6 +102,7 @@ pub unsafe fn call_printk(
module_name: &[u8],
args: fmt::Arguments<'_>,
) {
+ let args: *const _ = &args;
// `_printk` does not seem to fail in any path.
#[cfg(CONFIG_PRINTK)]
// SAFETY: TODO.
@@ -109,7 +110,7 @@ pub unsafe fn call_printk(
bindings::_printk(
format_string.as_ptr(),
module_name.as_ptr(),
- &args as *const _ as *const c_void,
+ args.cast::<c_void>(),
);
}
}
@@ -122,15 +123,13 @@ pub unsafe fn call_printk(
#[doc(hidden)]
#[cfg_attr(not(CONFIG_PRINTK), allow(unused_variables))]
pub fn call_printk_cont(args: fmt::Arguments<'_>) {
+ let args: *const _ = &args;
// `_printk` does not seem to fail in any path.
//
// SAFETY: The format string is fixed.
#[cfg(CONFIG_PRINTK)]
unsafe {
- bindings::_printk(
- format_strings::CONT.as_ptr(),
- &args as *const _ as *const c_void,
- );
+ bindings::_printk(format_strings::CONT.as_ptr(), args.cast::<c_void>());
}
}
diff --git a/rust/kernel/seq_file.rs b/rust/kernel/seq_file.rs
index 04947c672979..90545d28e6b7 100644
--- a/rust/kernel/seq_file.rs
+++ b/rust/kernel/seq_file.rs
@@ -31,12 +31,13 @@ pub unsafe fn from_raw<'a>(ptr: *mut bindings::seq_file) -> &'a SeqFile {
/// Used by the [`seq_print`] macro.
pub fn call_printf(&self, args: core::fmt::Arguments<'_>) {
+ let args: *const _ = &args;
// SAFETY: Passing a void pointer to `Arguments` is valid for `%pA`.
unsafe {
bindings::seq_printf(
self.inner.get(),
c_str!("%pA").as_char_ptr(),
- &args as *const _ as *const crate::ffi::c_void,
+ args.cast::<crate::ffi::c_void>(),
);
}
}
diff --git a/rust/kernel/str.rs b/rust/kernel/str.rs
index 28e2201604d6..6a1a982b946d 100644
--- a/rust/kernel/str.rs
+++ b/rust/kernel/str.rs
@@ -191,7 +191,7 @@ pub unsafe fn from_char_ptr<'a>(ptr: *const crate::ffi::c_char) -> &'a Self {
// to a `NUL`-terminated C string.
let len = unsafe { bindings::strlen(ptr) } + 1;
// SAFETY: Lifetime guaranteed by the safety precondition.
- let bytes = unsafe { core::slice::from_raw_parts(ptr as _, len) };
+ let bytes = unsafe { core::slice::from_raw_parts(ptr.cast(), len) };
// SAFETY: As `len` is returned by `strlen`, `bytes` does not contain interior `NUL`.
// As we have added 1 to `len`, the last byte is known to be `NUL`.
unsafe { Self::from_bytes_with_nul_unchecked(bytes) }
diff --git a/rust/kernel/sync/poll.rs b/rust/kernel/sync/poll.rs
index d5f17153b424..a151f54cde91 100644
--- a/rust/kernel/sync/poll.rs
+++ b/rust/kernel/sync/poll.rs
@@ -73,7 +73,7 @@ pub fn register_wait(&mut self, file: &File, cv: &PollCondVar) {
// be destroyed, the destructor must run. That destructor first removes all waiters,
// and then waits for an rcu grace period. Therefore, `cv.wait_queue_head` is valid for
// long enough.
- unsafe { qproc(file.as_ptr() as _, cv.wait_queue_head.get(), self.0.get()) };
+ unsafe { qproc(file.as_ptr().cast(), cv.wait_queue_head.get(), self.0.get()) };
}
}
}
diff --git a/rust/kernel/workqueue.rs b/rust/kernel/workqueue.rs
index 0cd100d2aefb..8ff54105be3f 100644
--- a/rust/kernel/workqueue.rs
+++ b/rust/kernel/workqueue.rs
@@ -170,7 +170,7 @@ impl Queue {
pub unsafe fn from_raw<'a>(ptr: *const bindings::workqueue_struct) -> &'a Queue {
// SAFETY: The `Queue` type is `#[repr(transparent)]`, so the pointer cast is valid. The
// caller promises that the pointer is not dangling.
- unsafe { &*(ptr as *const Queue) }
+ unsafe { &*ptr.cast::<Queue>() }
}
/// Enqueues a work item.
@@ -457,7 +457,7 @@ fn get_work_offset(&self) -> usize {
#[inline]
unsafe fn raw_get_work(ptr: *mut Self) -> *mut Work<T, ID> {
// SAFETY: The caller promises that the pointer is valid.
- unsafe { (ptr as *mut u8).add(Self::OFFSET) as *mut Work<T, ID> }
+ unsafe { ptr.cast::<u8>().add(Self::OFFSET).cast::<Work<T, ID>>() }
}
/// Returns a pointer to the struct containing the [`Work<T, ID>`] field.
@@ -472,7 +472,7 @@ unsafe fn work_container_of(ptr: *mut Work<T, ID>) -> *mut Self
{
// SAFETY: The caller promises that the pointer points at a field of the right type in the
// right kind of struct.
- unsafe { (ptr as *mut u8).sub(Self::OFFSET) as *mut Self }
+ unsafe { ptr.cast::<u8>().sub(Self::OFFSET).cast::<Self>() }
}
}
@@ -538,7 +538,7 @@ unsafe impl<T, const ID: u64> WorkItemPointer<ID> for Arc<T>
{
unsafe extern "C" fn run(ptr: *mut bindings::work_struct) {
// The `__enqueue` method always uses a `work_struct` stored in a `Work<T, ID>`.
- let ptr = ptr as *mut Work<T, ID>;
+ let ptr = ptr.cast::<Work<T, ID>>();
// SAFETY: This computes the pointer that `__enqueue` got from `Arc::into_raw`.
let ptr = unsafe { T::work_container_of(ptr) };
// SAFETY: This pointer comes from `Arc::into_raw` and we've been given back ownership.
@@ -591,7 +591,7 @@ unsafe impl<T, const ID: u64> WorkItemPointer<ID> for Pin<KBox<T>>
{
unsafe extern "C" fn run(ptr: *mut bindings::work_struct) {
// The `__enqueue` method always uses a `work_struct` stored in a `Work<T, ID>`.
- let ptr = ptr as *mut Work<T, ID>;
+ let ptr = ptr.cast::<Work<T, ID>>();
// SAFETY: This computes the pointer that `__enqueue` got from `Arc::into_raw`.
let ptr = unsafe { T::work_container_of(ptr) };
// SAFETY: This pointer comes from `Arc::into_raw` and we've been given back ownership.
diff --git a/rust/uapi/lib.rs b/rust/uapi/lib.rs
index 13495910271f..fe9bf7b5a306 100644
--- a/rust/uapi/lib.rs
+++ b/rust/uapi/lib.rs
@@ -15,6 +15,7 @@
#![allow(
clippy::all,
clippy::undocumented_unsafe_blocks,
+ clippy::ptr_as_ptr,
dead_code,
missing_docs,
non_camel_case_types,
---
base-commit: ff64846bee0e7e3e7bc9363ebad3bab42dd27e24
change-id: 20250307-ptr-as-ptr-21b1867fc4d4
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
Signal delivery during connect() may disconnect an already established
socket. Problem is that such socket might have been placed in a sockmap
before the connection was closed.
PATCH 1 ensures this race won't lead to an unconnected vsock staying in the
sockmap. PATCH 2 selftests it.
PATCH 3 fixes a related race. Note that here the race window is rather
difficult to hit and I can't think of an easy way of testing it.
Signed-off-by: Michal Luczaj <mhal(a)rbox.co>
---
Changes in v2:
- Handle one more path of tripping the warning
- Add a selftest
- Collect R-b [Stefano]
- Link to v1: https://lore.kernel.org/r/20250307-vsock-trans-signal-race-v1-1-3aca3f771fb…
---
Michal Luczaj (3):
vsock/bpf: Fix EINTR connect() racing sockmap update
selftest/bpf: Add test for AF_VSOCK connect() racing sockmap update
vsock/bpf: Fix bpf recvmsg() racing transport reassignment
net/vmw_vsock/af_vsock.c | 10 +-
net/vmw_vsock/vsock_bpf.c | 24 +++--
.../selftests/bpf/prog_tests/sockmap_basic.c | 111 +++++++++++++++++++++
3 files changed, 136 insertions(+), 9 deletions(-)
---
base-commit: da9e8efe7ee10e8425dc356a9fc593502c8e3933
change-id: 20250305-vsock-trans-signal-race-d62f7718d099
Best regards,
--
Michal Luczaj <mhal(a)rbox.co>
On 2025/3/14 18:14, Su Hui wrote:
> On 2025/3/14 17:21, Dan Carpenter wrote:
>> On Fri, Mar 14, 2025 at 03:14:51PM +0800, Su Hui wrote:
>>> When 'manual=false' and 'signaled=true', then expected value when using
>>> NTSYNC_IOC_CREATE_EVENT should be greater than zero. Fix this typo error.
>>>
>>> Signed-off-by: Su Hui<suhui(a)nfschina.com>
>>> ---
>>> tools/testing/selftests/drivers/ntsync/ntsync.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/tools/testing/selftests/drivers/ntsync/ntsync.c b/tools/testing/selftests/drivers/ntsync/ntsync.c
>>> index 3aad311574c4..bfb6fad653d0 100644
>>> --- a/tools/testing/selftests/drivers/ntsync/ntsync.c
>>> +++ b/tools/testing/selftests/drivers/ntsync/ntsync.c
>>> @@ -968,7 +968,7 @@ TEST(wake_all)
>>> auto_event_args.manual = false;
>>> auto_event_args.signaled = true;
>>> objs[3] = ioctl(fd, NTSYNC_IOC_CREATE_EVENT, &auto_event_args);
>>> - EXPECT_EQ(0, objs[3]);
>>> + EXPECT_LE(0, objs[3]);
>> It's kind of weird how these macros put the constant on the left.
>> It returns an "fd" on success. So this look reasonable. It probably
>> won't return the zero fd so we could probably check EXPECT_LT()?
> Agreed, there are about 29 items that can be changed to EXPECT_LT().
> I can send a v2 patchset with this change if there is no more other
> suggestions.
Sorry for the wrong style of email:(.
Su Hui
After the recent merge between net-next and net, I got some conflicts on
my side because the merge resolution was different from Stephen's one
[1] I applied on my side in the MPTCP tree.
It looks like the code that is now in net-next is using the old way to
retrieve the local and remote addresses. This patch is now using the new
way, like what was in Stephen's email [1].
Also, in get_interface_info(), there were no conflicts in this area,
because that was new code from 'net', but a small adaptation was needed
there as well to get the remote address.
Fixes: 941defcea7e1 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
Link: https://lore.kernel.org/20250311115758.17a1d414@canb.auug.org.au [1]
Suggested-by: Stephen Rothwell <sfr(a)canb.auug.org.au>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
---
tools/testing/selftests/drivers/net/ping.py | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/drivers/net/ping.py b/tools/testing/selftests/drivers/net/ping.py
index 7a1026a073681d159202015fc6945e91368863fe..79f07e0510ecc14d3bc2716e14f49f9381bb919f 100755
--- a/tools/testing/selftests/drivers/net/ping.py
+++ b/tools/testing/selftests/drivers/net/ping.py
@@ -15,18 +15,18 @@ no_sleep=False
def _test_v4(cfg) -> None:
cfg.require_ipver("4")
- cmd(f"ping -c 1 -W0.5 {cfg.remote_v4}")
- cmd(f"ping -c 1 -W0.5 {cfg.v4}", host=cfg.remote)
- cmd(f"ping -s 65000 -c 1 -W0.5 {cfg.remote_v4}")
- cmd(f"ping -s 65000 -c 1 -W0.5 {cfg.v4}", host=cfg.remote)
+ cmd("ping -c 1 -W0.5 " + cfg.remote_addr_v["4"])
+ cmd("ping -c 1 -W0.5 " + cfg.addr_v["4"], host=cfg.remote)
+ cmd("ping -s 65000 -c 1 -W0.5 " + cfg.remote_addr_v["4"])
+ cmd("ping -s 65000 -c 1 -W0.5 " + cfg.addr_v["4"], host=cfg.remote)
def _test_v6(cfg) -> None:
cfg.require_ipver("6")
- cmd(f"ping -c 1 -W5 {cfg.remote_v6}")
- cmd(f"ping -c 1 -W5 {cfg.v6}", host=cfg.remote)
- cmd(f"ping -s 65000 -c 1 -W0.5 {cfg.remote_v6}")
- cmd(f"ping -s 65000 -c 1 -W0.5 {cfg.v6}", host=cfg.remote)
+ cmd("ping -c 1 -W5 " + cfg.remote_addr_v["6"])
+ cmd("ping -c 1 -W5 " + cfg.addr_v["6"], host=cfg.remote)
+ cmd("ping -s 65000 -c 1 -W0.5 " + cfg.remote_addr_v["6"])
+ cmd("ping -s 65000 -c 1 -W0.5 " + cfg.addr_v["6"], host=cfg.remote)
def _test_tcp(cfg) -> None:
cfg.require_cmd("socat", remote=True)
@@ -120,7 +120,7 @@ def get_interface_info(cfg) -> None:
global remote_ifname
global no_sleep
- remote_info = cmd(f"ip -4 -o addr show to {cfg.remote_v4} | awk '{{print $2}}'", shell=True, host=cfg.remote).stdout
+ remote_info = cmd(f"ip -4 -o addr show to {cfg.remote_addr_v['4']} | awk '{{print $2}}'", shell=True, host=cfg.remote).stdout
remote_ifname = remote_info.rstrip('\n')
if remote_ifname == "":
raise KsftFailEx('Can not get remote interface')
---
base-commit: 941defcea7e11ad7ff8f0d4856716dd637d757dd
change-id: 20250314-net-next-drv-net-ping-fix-merge-b303167fde16
Best regards,
--
Matthieu Baerts (NGI0) <matttbe(a)kernel.org>
Replacing all occurrences of `addr_of!(place)` with `&raw place`, and
all occurrences of `addr_of_mut!(place)` with `&raw mut place`.
Utilizing the new feature will allow us to reduce macro complexity, and
improve consistency with existing reference syntax as `&raw`, `&raw mut`
is very similar to `&`, `&mut` making it fit more naturally with other
existing code.
Depends on: Patch 1/3 0001-rust-enable-raw_ref_op-feature.patch
Suggested-by: Benno Lossin <y86-dev(a)protonmail.com>
Link: https://github.com/Rust-for-Linux/linux/issues/1148
Signed-off-by: Antonio Hickey <contact(a)antoniohickey.com>
---
rust/kernel/block/mq/request.rs | 4 ++--
rust/kernel/faux.rs | 4 ++--
rust/kernel/fs/file.rs | 2 +-
rust/kernel/init.rs | 8 ++++----
rust/kernel/init/macros.rs | 28 +++++++++++++-------------
rust/kernel/jump_label.rs | 4 ++--
rust/kernel/kunit.rs | 4 ++--
rust/kernel/list.rs | 2 +-
rust/kernel/list/impl_list_item_mod.rs | 6 +++---
rust/kernel/net/phy.rs | 4 ++--
rust/kernel/pci.rs | 4 ++--
rust/kernel/platform.rs | 4 +---
rust/kernel/rbtree.rs | 22 ++++++++++----------
rust/kernel/sync/arc.rs | 2 +-
rust/kernel/task.rs | 4 ++--
rust/kernel/workqueue.rs | 8 ++++----
16 files changed, 54 insertions(+), 56 deletions(-)
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
index 7943f43b9575..4a5b7ec914ef 100644
--- a/rust/kernel/block/mq/request.rs
+++ b/rust/kernel/block/mq/request.rs
@@ -12,7 +12,7 @@
};
use core::{
marker::PhantomData,
- ptr::{addr_of_mut, NonNull},
+ ptr::NonNull,
sync::atomic::{AtomicU64, Ordering},
};
@@ -187,7 +187,7 @@ pub(crate) fn refcount(&self) -> &AtomicU64 {
pub(crate) unsafe fn refcount_ptr(this: *mut Self) -> *mut AtomicU64 {
// SAFETY: Because of the safety requirements of this function, the
// field projection is safe.
- unsafe { addr_of_mut!((*this).refcount) }
+ unsafe { &raw mut (*this).refcount }
}
}
diff --git a/rust/kernel/faux.rs b/rust/kernel/faux.rs
index 5acc0c02d451..52ac554c1119 100644
--- a/rust/kernel/faux.rs
+++ b/rust/kernel/faux.rs
@@ -7,7 +7,7 @@
//! C header: [`include/linux/device/faux.h`]
use crate::{bindings, device, error::code::*, prelude::*};
-use core::ptr::{addr_of_mut, null, null_mut, NonNull};
+use core::ptr::{null, null_mut, NonNull};
/// The registration of a faux device.
///
@@ -45,7 +45,7 @@ impl AsRef<device::Device> for Registration {
fn as_ref(&self) -> &device::Device {
// SAFETY: The underlying `device` in `faux_device` is guaranteed by the C API to be
// a valid initialized `device`.
- unsafe { device::Device::as_ref(addr_of_mut!((*self.as_raw()).dev)) }
+ unsafe { device::Device::as_ref((&raw mut (*self.as_raw()).dev)) }
}
}
diff --git a/rust/kernel/fs/file.rs b/rust/kernel/fs/file.rs
index ed57e0137cdb..7ee4830b67f3 100644
--- a/rust/kernel/fs/file.rs
+++ b/rust/kernel/fs/file.rs
@@ -331,7 +331,7 @@ pub fn flags(&self) -> u32 {
// SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
//
// FIXME(read_once): Replace with `read_once` when available on the Rust side.
- unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
+ unsafe { (&raw const (*self.as_ptr()).f_flags).read_volatile() }
}
}
diff --git a/rust/kernel/init.rs b/rust/kernel/init.rs
index 7fd1ea8265a5..a8fac6558671 100644
--- a/rust/kernel/init.rs
+++ b/rust/kernel/init.rs
@@ -122,7 +122,7 @@
//! ```rust
//! # #![expect(unreachable_pub, clippy::disallowed_names)]
//! use kernel::{init, types::Opaque};
-//! use core::{ptr::addr_of_mut, marker::PhantomPinned, pin::Pin};
+//! use core::{marker::PhantomPinned, pin::Pin};
//! # mod bindings {
//! # #![expect(non_camel_case_types)]
//! # #![expect(clippy::missing_safety_doc)]
@@ -159,7 +159,7 @@
//! unsafe {
//! init::pin_init_from_closure(move |slot: *mut Self| {
//! // `slot` contains uninit memory, avoid creating a reference.
-//! let foo = addr_of_mut!((*slot).foo);
+//! let foo = &raw mut (*slot).foo;
//!
//! // Initialize the `foo`
//! bindings::init_foo(Opaque::raw_get(foo));
@@ -541,7 +541,7 @@ macro_rules! stack_try_pin_init {
///
/// ```rust
/// # use kernel::{macros::{Zeroable, pin_data}, pin_init};
-/// # use core::{ptr::addr_of_mut, marker::PhantomPinned};
+/// # use core::marker::PhantomPinned;
/// #[pin_data]
/// #[derive(Zeroable)]
/// struct Buf {
@@ -554,7 +554,7 @@ macro_rules! stack_try_pin_init {
/// pin_init!(&this in Buf {
/// buf: [0; 64],
/// // SAFETY: TODO.
-/// ptr: unsafe { addr_of_mut!((*this.as_ptr()).buf).cast() },
+/// ptr: unsafe { &raw mut (*this.as_ptr()).buf.cast() },
/// pin: PhantomPinned,
/// });
/// pin_init!(Buf {
diff --git a/rust/kernel/init/macros.rs b/rust/kernel/init/macros.rs
index 1fd146a83241..af525fbb2f01 100644
--- a/rust/kernel/init/macros.rs
+++ b/rust/kernel/init/macros.rs
@@ -244,25 +244,25 @@
//! struct __InitOk;
//! // This is the expansion of `t,`, which is syntactic sugar for `t: t,`.
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).t), t) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).t, t) };
//! }
//! // Since initialization could fail later (not in this case, since the
//! // error type is `Infallible`) we will need to drop this field if there
//! // is an error later. This `DropGuard` will drop the field when it gets
//! // dropped and has not yet been forgotten.
//! let __t_guard = unsafe {
-//! ::pinned_init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).t))
+//! ::pinned_init::__internal::DropGuard::new(&raw mut (*slot).t)
//! };
//! // Expansion of `x: 0,`:
//! // Since this can be an arbitrary expression we cannot place it inside
//! // of the `unsafe` block, so we bind it here.
//! {
//! let x = 0;
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).x), x) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).x, x) };
//! }
//! // We again create a `DropGuard`.
//! let __x_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).x))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).x)
//! };
//! // Since initialization has successfully completed, we can now forget
//! // the guards. This is not `mem::forget`, since we only have
@@ -459,15 +459,15 @@
//! {
//! struct __InitOk;
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).a), a) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).a, a) };
//! }
//! let __a_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).a))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).a)
//! };
//! let init = Bar::new(36);
-//! unsafe { data.b(::core::addr_of_mut!((*slot).b), b)? };
+//! unsafe { data.b(&raw mut (*slot).b, b)? };
//! let __b_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).b))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).b)
//! };
//! ::core::mem::forget(__b_guard);
//! ::core::mem::forget(__a_guard);
@@ -1210,7 +1210,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
// We also use the `data` to require the correct trait (`Init` or `PinInit`) for `$field`.
- unsafe { $data.$field(::core::ptr::addr_of_mut!((*$slot).$field), init)? };
+ unsafe { $data.$field(&raw mut (*$slot).$field, init)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1218,7 +1218,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($use_data):
@@ -1241,7 +1241,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
//
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
- unsafe { $crate::init::Init::__init(init, ::core::ptr::addr_of_mut!((*$slot).$field))? };
+ unsafe { $crate::init::Init::__init(init, &raw mut (*$slot).$field)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1249,7 +1249,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot():
@@ -1272,7 +1272,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// Initialize the field.
//
// SAFETY: The memory at `slot` is uninitialized.
- unsafe { ::core::ptr::write(::core::ptr::addr_of_mut!((*$slot).$field), $field) };
+ unsafe { ::core::ptr::write(&raw mut (*$slot).$field, $field) };
}
// Create the drop guard:
//
@@ -1281,7 +1281,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($($use_data)?):
diff --git a/rust/kernel/jump_label.rs b/rust/kernel/jump_label.rs
index 4e974c768dbd..05d4564714c7 100644
--- a/rust/kernel/jump_label.rs
+++ b/rust/kernel/jump_label.rs
@@ -20,8 +20,8 @@
#[macro_export]
macro_rules! static_branch_unlikely {
($key:path, $keytyp:ty, $field:ident) => {{
- let _key: *const $keytyp = ::core::ptr::addr_of!($key);
- let _key: *const $crate::bindings::static_key_false = ::core::ptr::addr_of!((*_key).$field);
+ let _key: *const $keytyp = &raw $key;
+ let _key: *const $crate::bindings::static_key_false = &raw (*_key).$field;
let _key: *const $crate::bindings::static_key = _key.cast();
#[cfg(not(CONFIG_JUMP_LABEL))]
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 824da0e9738a..18357dd782ed 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -128,9 +128,9 @@ unsafe impl Sync for UnaryAssert {}
unsafe {
$crate::bindings::__kunit_do_failed_assertion(
kunit_test,
- core::ptr::addr_of!(LOCATION.0),
+ &raw LOCATION.0,
$crate::bindings::kunit_assert_type_KUNIT_ASSERTION,
- core::ptr::addr_of!(ASSERTION.0.assert),
+ &raw ASSERTION.0.assert,
Some($crate::bindings::kunit_unary_assert_format),
core::ptr::null(),
);
diff --git a/rust/kernel/list.rs b/rust/kernel/list.rs
index c0ed227b8a4f..e98f0820f002 100644
--- a/rust/kernel/list.rs
+++ b/rust/kernel/list.rs
@@ -176,7 +176,7 @@ pub fn new() -> impl PinInit<Self> {
#[inline]
unsafe fn fields(me: *mut Self) -> *mut ListLinksFields {
// SAFETY: The caller promises that the pointer is valid.
- unsafe { Opaque::raw_get(ptr::addr_of!((*me).inner)) }
+ unsafe { Opaque::raw_get(&raw const (*me).inner) }
}
/// # Safety
diff --git a/rust/kernel/list/impl_list_item_mod.rs b/rust/kernel/list/impl_list_item_mod.rs
index a0438537cee1..014b6713d59d 100644
--- a/rust/kernel/list/impl_list_item_mod.rs
+++ b/rust/kernel/list/impl_list_item_mod.rs
@@ -49,7 +49,7 @@ macro_rules! impl_has_list_links {
// SAFETY: The implementation of `raw_get_list_links` only compiles if the field has the
// right type.
//
- // The behavior of `raw_get_list_links` is not changed since the `addr_of_mut!` macro is
+ // The behavior of `raw_get_list_links` is not changed since the `&raw mut` op is
// equivalent to the pointer offset operation in the trait definition.
unsafe impl$(<$($implarg),*>)? $crate::list::HasListLinks$(<$id>)? for
$self $(<$($selfarg),*>)?
@@ -61,7 +61,7 @@ unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$
// SAFETY: The caller promises that the pointer is not dangling. We know that this
// expression doesn't follow any pointers, as the `offset_of!` invocation above
// would otherwise not compile.
- unsafe { ::core::ptr::addr_of_mut!((*ptr)$(.$field)*) }
+ unsafe { &raw mut (*ptr)$(.$field)* }
}
}
)*};
@@ -103,7 +103,7 @@ macro_rules! impl_has_list_links_self_ptr {
unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$id>)? {
// SAFETY: The caller promises that the pointer is not dangling.
let ptr: *mut $crate::list::ListLinksSelfPtr<$item_type $(, $id)?> =
- unsafe { ::core::ptr::addr_of_mut!((*ptr).$field) };
+ unsafe { &raw mut (*ptr).$field };
ptr.cast()
}
}
diff --git a/rust/kernel/net/phy.rs b/rust/kernel/net/phy.rs
index a59469c785e3..757db052cc09 100644
--- a/rust/kernel/net/phy.rs
+++ b/rust/kernel/net/phy.rs
@@ -7,7 +7,7 @@
//! C headers: [`include/linux/phy.h`](srctree/include/linux/phy.h).
use crate::{error::*, prelude::*, types::Opaque};
-use core::{marker::PhantomData, ptr::addr_of_mut};
+use core::marker::PhantomData;
pub mod reg;
@@ -285,7 +285,7 @@ impl AsRef<kernel::device::Device> for Device {
fn as_ref(&self) -> &kernel::device::Device {
let phydev = self.0.get();
// SAFETY: The struct invariant ensures that `mdio.dev` is valid.
- unsafe { kernel::device::Device::as_ref(addr_of_mut!((*phydev).mdio.dev)) }
+ unsafe { kernel::device::Device::as_ref(&raw mut (*phydev).mdio.dev) }
}
}
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index f7b2743828ae..6cb9ed1e7cbf 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -17,7 +17,7 @@
types::{ARef, ForeignOwnable, Opaque},
ThisModule,
};
-use core::{ops::Deref, ptr::addr_of_mut};
+use core::ops::Deref;
use kernel::prelude::*;
/// An adapter for the registration of PCI drivers.
@@ -60,7 +60,7 @@ extern "C" fn probe_callback(
) -> kernel::ffi::c_int {
// SAFETY: The PCI bus only ever calls the probe callback with a valid pointer to a
// `struct pci_dev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct pci_dev` by the call
// above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/platform.rs b/rust/kernel/platform.rs
index 1297f5292ba9..344875ad7b82 100644
--- a/rust/kernel/platform.rs
+++ b/rust/kernel/platform.rs
@@ -14,8 +14,6 @@
ThisModule,
};
-use core::ptr::addr_of_mut;
-
/// An adapter for the registration of platform drivers.
pub struct Adapter<T: Driver>(T);
@@ -55,7 +53,7 @@ unsafe fn unregister(pdrv: &Opaque<Self::RegType>) {
impl<T: Driver + 'static> Adapter<T> {
extern "C" fn probe_callback(pdev: *mut bindings::platform_device) -> kernel::ffi::c_int {
// SAFETY: The platform bus only ever calls the probe callback with a valid `pdev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct platform_device` by the
// call above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/rbtree.rs b/rust/kernel/rbtree.rs
index 1ea25c7092fb..b0ad35663cb0 100644
--- a/rust/kernel/rbtree.rs
+++ b/rust/kernel/rbtree.rs
@@ -11,7 +11,7 @@
cmp::{Ord, Ordering},
marker::PhantomData,
mem::MaybeUninit,
- ptr::{addr_of_mut, from_mut, NonNull},
+ ptr::{from_mut, NonNull},
};
/// A red-black tree with owned nodes.
@@ -238,7 +238,7 @@ pub fn values_mut(&mut self) -> impl Iterator<Item = &'_ mut V> {
/// Returns a cursor over the tree nodes, starting with the smallest key.
pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_first(root) };
NonNull::new(current).map(|current| {
@@ -253,7 +253,7 @@ pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
/// Returns a cursor over the tree nodes, starting with the largest key.
pub fn cursor_back(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_last(root) };
NonNull::new(current).map(|current| {
@@ -459,7 +459,7 @@ pub fn cursor_lower_bound(&mut self, key: &K) -> Option<Cursor<'_, K, V>>
let best = best_match?;
// SAFETY: `best` is a non-null node so it is valid by the type invariants.
- let links = unsafe { addr_of_mut!((*best.as_ptr()).links) };
+ let links = unsafe { &raw mut (*best.as_ptr()).links };
NonNull::new(links).map(|current| {
// INVARIANT:
@@ -767,7 +767,7 @@ pub fn remove_current(self) -> (Option<Self>, RBTreeNode<K, V>) {
let node = RBTreeNode { node };
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(&mut (*this).links, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(&mut (*this).links, &raw mut self.tree.root) };
let current = match (prev, next) {
(_, Some(next)) => next,
@@ -803,7 +803,7 @@ fn remove_neighbor(&mut self, direction: Direction) -> Option<RBTreeNode<K, V>>
let neighbor = neighbor.as_ptr();
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(neighbor, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(neighbor, &raw mut self.tree.root) };
// SAFETY: By the type invariant of `Self`, all non-null `rb_node` pointers stored in `self`
// point to the links field of `Node<K, V>` objects.
let this = unsafe { container_of!(neighbor, Node<K, V>, links) }.cast_mut();
@@ -918,7 +918,7 @@ unsafe fn to_key_value_raw<'b>(node: NonNull<bindings::rb_node>) -> (&'b K, *mut
let k = unsafe { &(*this).key };
// SAFETY: The passed `node` is the current node or a non-null neighbor,
// thus `this` is valid by the type invariants.
- let v = unsafe { addr_of_mut!((*this).value) };
+ let v = unsafe { &raw mut (*this).value };
(k, v)
}
}
@@ -1027,7 +1027,7 @@ fn next(&mut self) -> Option<Self::Item> {
self.next = unsafe { bindings::rb_next(self.next) };
// SAFETY: By the same reasoning above, it is safe to dereference the node.
- Some(unsafe { (addr_of_mut!((*cur).key), addr_of_mut!((*cur).value)) })
+ Some(unsafe { (&raw mut (*cur).key, &raw mut (*cur).value) })
}
}
@@ -1170,7 +1170,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let node_links = unsafe { addr_of_mut!((*node).links) };
+ let node_links = unsafe { &raw mut (*node).links };
// INVARIANT: We are linking in a new node, which is valid. It remains valid because we
// "forgot" it with `Box::into_raw`.
@@ -1178,7 +1178,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
unsafe { bindings::rb_link_node(node_links, self.parent, self.child_field_of_parent) };
// SAFETY: All pointers are valid. `node` has just been inserted into the tree.
- unsafe { bindings::rb_insert_color(node_links, addr_of_mut!((*self.rbtree).root)) };
+ unsafe { bindings::rb_insert_color(node_links, &raw mut (*self.rbtree).root) };
// SAFETY: The node is valid until we remove it from the tree.
unsafe { &mut (*node).value }
@@ -1261,7 +1261,7 @@ fn replace(self, node: RBTreeNode<K, V>) -> RBTreeNode<K, V> {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let new_node_links = unsafe { addr_of_mut!((*node).links) };
+ let new_node_links = unsafe { &raw mut (*node).links };
// SAFETY: This updates the pointers so that `new_node_links` is in the tree where
// `self.node_links` used to be.
diff --git a/rust/kernel/sync/arc.rs b/rust/kernel/sync/arc.rs
index 3cefda7a4372..81d8b0f84957 100644
--- a/rust/kernel/sync/arc.rs
+++ b/rust/kernel/sync/arc.rs
@@ -243,7 +243,7 @@ pub fn into_raw(self) -> *const T {
let ptr = self.ptr.as_ptr();
core::mem::forget(self);
// SAFETY: The pointer is valid.
- unsafe { core::ptr::addr_of!((*ptr).data) }
+ unsafe { &raw const (*ptr).data }
}
/// Recreates an [`Arc`] instance previously deconstructed via [`Arc::into_raw`].
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index 49012e711942..b2ac768eed23 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -257,7 +257,7 @@ pub fn as_ptr(&self) -> *mut bindings::task_struct {
pub fn group_leader(&self) -> &Task {
// SAFETY: The group leader of a task never changes after initialization, so reading this
// field is not a data race.
- let ptr = unsafe { *ptr::addr_of!((*self.as_ptr()).group_leader) };
+ let ptr = unsafe { *(&raw const (*self.as_ptr()).group_leader) };
// SAFETY: The lifetime of the returned task reference is tied to the lifetime of `self`,
// and given that a task has a reference to its group leader, we know it must be valid for
@@ -269,7 +269,7 @@ pub fn group_leader(&self) -> &Task {
pub fn pid(&self) -> Pid {
// SAFETY: The pid of a task never changes after initialization, so reading this field is
// not a data race.
- unsafe { *ptr::addr_of!((*self.as_ptr()).pid) }
+ unsafe { *(&raw const (*self.as_ptr()).pid) }
}
/// Returns the UID of the given task.
diff --git a/rust/kernel/workqueue.rs b/rust/kernel/workqueue.rs
index 0cd100d2aefb..34e8abb38974 100644
--- a/rust/kernel/workqueue.rs
+++ b/rust/kernel/workqueue.rs
@@ -401,9 +401,9 @@ pub fn new(name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self
pub unsafe fn raw_get(ptr: *const Self) -> *mut bindings::work_struct {
// SAFETY: The caller promises that the pointer is aligned and not dangling.
//
- // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `addr_of!` so that
- // the compiler does not complain that the `work` field is unused.
- unsafe { Opaque::raw_get(core::ptr::addr_of!((*ptr).work)) }
+ // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `&raw const (*ptr).work`
+ // so that the compiler does not complain that the `work` field is unused.
+ unsafe { Opaque::raw_get(&raw const (*ptr).work) }
}
}
@@ -510,7 +510,7 @@ macro_rules! impl_has_work {
unsafe fn raw_get_work(ptr: *mut Self) -> *mut $crate::workqueue::Work<$work_type $(, $id)?> {
// SAFETY: The caller promises that the pointer is not dangling.
unsafe {
- ::core::ptr::addr_of_mut!((*ptr).$field)
+ &raw mut (*ptr).$field
}
}
}
--
2.48.1
Replacing all occurrences of `addr_of!(place)` with `&raw const place`, and
all occurrences of `addr_of_mut!(place)` with `&raw mut place`.
Utilizing the new feature will allow us to reduce macro complexity, and
improve consistency with existing reference syntax as `&raw const`, `&raw mut`
is very similar to `&`, `&mut` making it fit more naturally with other
existing code than the previously used macros.
Suggested-by: Benno Lossin <benno.lossin(a)proton.me>
Link: https://github.com/Rust-for-Linux/linux/issues/1148
Signed-off-by: Antonio Hickey <contact(a)antoniohickey.com>
---
rust/kernel/block/mq/request.rs | 4 ++--
rust/kernel/faux.rs | 4 ++--
rust/kernel/fs/file.rs | 2 +-
rust/kernel/init.rs | 8 ++++----
rust/kernel/init/macros.rs | 28 +++++++++++++-------------
rust/kernel/jump_label.rs | 4 ++--
rust/kernel/kunit.rs | 4 ++--
rust/kernel/list.rs | 2 +-
rust/kernel/list/impl_list_item_mod.rs | 6 +++---
rust/kernel/net/phy.rs | 4 ++--
rust/kernel/pci.rs | 4 ++--
rust/kernel/platform.rs | 4 +---
rust/kernel/rbtree.rs | 22 ++++++++++----------
rust/kernel/sync/arc.rs | 2 +-
rust/kernel/task.rs | 4 ++--
rust/kernel/workqueue.rs | 8 ++++----
16 files changed, 54 insertions(+), 56 deletions(-)
diff --git a/rust/kernel/block/mq/request.rs b/rust/kernel/block/mq/request.rs
index 7943f43b9575..4a5b7ec914ef 100644
--- a/rust/kernel/block/mq/request.rs
+++ b/rust/kernel/block/mq/request.rs
@@ -12,7 +12,7 @@
};
use core::{
marker::PhantomData,
- ptr::{addr_of_mut, NonNull},
+ ptr::NonNull,
sync::atomic::{AtomicU64, Ordering},
};
@@ -187,7 +187,7 @@ pub(crate) fn refcount(&self) -> &AtomicU64 {
pub(crate) unsafe fn refcount_ptr(this: *mut Self) -> *mut AtomicU64 {
// SAFETY: Because of the safety requirements of this function, the
// field projection is safe.
- unsafe { addr_of_mut!((*this).refcount) }
+ unsafe { &raw mut (*this).refcount }
}
}
diff --git a/rust/kernel/faux.rs b/rust/kernel/faux.rs
index 5acc0c02d451..52ac554c1119 100644
--- a/rust/kernel/faux.rs
+++ b/rust/kernel/faux.rs
@@ -7,7 +7,7 @@
//! C header: [`include/linux/device/faux.h`]
use crate::{bindings, device, error::code::*, prelude::*};
-use core::ptr::{addr_of_mut, null, null_mut, NonNull};
+use core::ptr::{null, null_mut, NonNull};
/// The registration of a faux device.
///
@@ -45,7 +45,7 @@ impl AsRef<device::Device> for Registration {
fn as_ref(&self) -> &device::Device {
// SAFETY: The underlying `device` in `faux_device` is guaranteed by the C API to be
// a valid initialized `device`.
- unsafe { device::Device::as_ref(addr_of_mut!((*self.as_raw()).dev)) }
+ unsafe { device::Device::as_ref((&raw mut (*self.as_raw()).dev)) }
}
}
diff --git a/rust/kernel/fs/file.rs b/rust/kernel/fs/file.rs
index ed57e0137cdb..7ee4830b67f3 100644
--- a/rust/kernel/fs/file.rs
+++ b/rust/kernel/fs/file.rs
@@ -331,7 +331,7 @@ pub fn flags(&self) -> u32 {
// SAFETY: The file is valid because the shared reference guarantees a nonzero refcount.
//
// FIXME(read_once): Replace with `read_once` when available on the Rust side.
- unsafe { core::ptr::addr_of!((*self.as_ptr()).f_flags).read_volatile() }
+ unsafe { (&raw const (*self.as_ptr()).f_flags).read_volatile() }
}
}
diff --git a/rust/kernel/init.rs b/rust/kernel/init.rs
index 7fd1ea8265a5..a8fac6558671 100644
--- a/rust/kernel/init.rs
+++ b/rust/kernel/init.rs
@@ -122,7 +122,7 @@
//! ```rust
//! # #![expect(unreachable_pub, clippy::disallowed_names)]
//! use kernel::{init, types::Opaque};
-//! use core::{ptr::addr_of_mut, marker::PhantomPinned, pin::Pin};
+//! use core::{marker::PhantomPinned, pin::Pin};
//! # mod bindings {
//! # #![expect(non_camel_case_types)]
//! # #![expect(clippy::missing_safety_doc)]
@@ -159,7 +159,7 @@
//! unsafe {
//! init::pin_init_from_closure(move |slot: *mut Self| {
//! // `slot` contains uninit memory, avoid creating a reference.
-//! let foo = addr_of_mut!((*slot).foo);
+//! let foo = &raw mut (*slot).foo;
//!
//! // Initialize the `foo`
//! bindings::init_foo(Opaque::raw_get(foo));
@@ -541,7 +541,7 @@ macro_rules! stack_try_pin_init {
///
/// ```rust
/// # use kernel::{macros::{Zeroable, pin_data}, pin_init};
-/// # use core::{ptr::addr_of_mut, marker::PhantomPinned};
+/// # use core::marker::PhantomPinned;
/// #[pin_data]
/// #[derive(Zeroable)]
/// struct Buf {
@@ -554,7 +554,7 @@ macro_rules! stack_try_pin_init {
/// pin_init!(&this in Buf {
/// buf: [0; 64],
/// // SAFETY: TODO.
-/// ptr: unsafe { addr_of_mut!((*this.as_ptr()).buf).cast() },
+/// ptr: unsafe { &raw mut (*this.as_ptr()).buf.cast() },
/// pin: PhantomPinned,
/// });
/// pin_init!(Buf {
diff --git a/rust/kernel/init/macros.rs b/rust/kernel/init/macros.rs
index 1fd146a83241..af525fbb2f01 100644
--- a/rust/kernel/init/macros.rs
+++ b/rust/kernel/init/macros.rs
@@ -244,25 +244,25 @@
//! struct __InitOk;
//! // This is the expansion of `t,`, which is syntactic sugar for `t: t,`.
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).t), t) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).t, t) };
//! }
//! // Since initialization could fail later (not in this case, since the
//! // error type is `Infallible`) we will need to drop this field if there
//! // is an error later. This `DropGuard` will drop the field when it gets
//! // dropped and has not yet been forgotten.
//! let __t_guard = unsafe {
-//! ::pinned_init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).t))
+//! ::pinned_init::__internal::DropGuard::new(&raw mut (*slot).t)
//! };
//! // Expansion of `x: 0,`:
//! // Since this can be an arbitrary expression we cannot place it inside
//! // of the `unsafe` block, so we bind it here.
//! {
//! let x = 0;
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).x), x) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).x, x) };
//! }
//! // We again create a `DropGuard`.
//! let __x_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).x))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).x)
//! };
//! // Since initialization has successfully completed, we can now forget
//! // the guards. This is not `mem::forget`, since we only have
@@ -459,15 +459,15 @@
//! {
//! struct __InitOk;
//! {
-//! unsafe { ::core::ptr::write(::core::addr_of_mut!((*slot).a), a) };
+//! unsafe { ::core::ptr::write(&raw mut (*slot).a, a) };
//! }
//! let __a_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).a))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).a)
//! };
//! let init = Bar::new(36);
-//! unsafe { data.b(::core::addr_of_mut!((*slot).b), b)? };
+//! unsafe { data.b(&raw mut (*slot).b, b)? };
//! let __b_guard = unsafe {
-//! ::kernel::init::__internal::DropGuard::new(::core::addr_of_mut!((*slot).b))
+//! ::kernel::init::__internal::DropGuard::new(&raw mut (*slot).b)
//! };
//! ::core::mem::forget(__b_guard);
//! ::core::mem::forget(__a_guard);
@@ -1210,7 +1210,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
// We also use the `data` to require the correct trait (`Init` or `PinInit`) for `$field`.
- unsafe { $data.$field(::core::ptr::addr_of_mut!((*$slot).$field), init)? };
+ unsafe { $data.$field(&raw mut (*$slot).$field, init)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1218,7 +1218,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($use_data):
@@ -1241,7 +1241,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
//
// SAFETY: `slot` is valid, because we are inside of an initializer closure, we
// return when an error/panic occurs.
- unsafe { $crate::init::Init::__init(init, ::core::ptr::addr_of_mut!((*$slot).$field))? };
+ unsafe { $crate::init::Init::__init(init, &raw mut (*$slot).$field)? };
// Create the drop guard:
//
// We rely on macro hygiene to make it impossible for users to access this local variable.
@@ -1249,7 +1249,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot():
@@ -1272,7 +1272,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
// Initialize the field.
//
// SAFETY: The memory at `slot` is uninitialized.
- unsafe { ::core::ptr::write(::core::ptr::addr_of_mut!((*$slot).$field), $field) };
+ unsafe { ::core::ptr::write(&raw mut (*$slot).$field, $field) };
}
// Create the drop guard:
//
@@ -1281,7 +1281,7 @@ fn assert_zeroable<T: $crate::init::Zeroable>(_: *mut T) {}
::kernel::macros::paste! {
// SAFETY: We forget the guard later when initialization has succeeded.
let [< __ $field _guard >] = unsafe {
- $crate::init::__internal::DropGuard::new(::core::ptr::addr_of_mut!((*$slot).$field))
+ $crate::init::__internal::DropGuard::new(&raw mut (*$slot).$field)
};
$crate::__init_internal!(init_slot($($use_data)?):
diff --git a/rust/kernel/jump_label.rs b/rust/kernel/jump_label.rs
index 4e974c768dbd..ca10abae0eee 100644
--- a/rust/kernel/jump_label.rs
+++ b/rust/kernel/jump_label.rs
@@ -20,8 +20,8 @@
#[macro_export]
macro_rules! static_branch_unlikely {
($key:path, $keytyp:ty, $field:ident) => {{
- let _key: *const $keytyp = ::core::ptr::addr_of!($key);
- let _key: *const $crate::bindings::static_key_false = ::core::ptr::addr_of!((*_key).$field);
+ let _key: *const $keytyp = &raw const $key;
+ let _key: *const $crate::bindings::static_key_false = &raw const (*_key).$field;
let _key: *const $crate::bindings::static_key = _key.cast();
#[cfg(not(CONFIG_JUMP_LABEL))]
diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
index 824da0e9738a..a17ef3b2e860 100644
--- a/rust/kernel/kunit.rs
+++ b/rust/kernel/kunit.rs
@@ -128,9 +128,9 @@ unsafe impl Sync for UnaryAssert {}
unsafe {
$crate::bindings::__kunit_do_failed_assertion(
kunit_test,
- core::ptr::addr_of!(LOCATION.0),
+ &raw const LOCATION.0,
$crate::bindings::kunit_assert_type_KUNIT_ASSERTION,
- core::ptr::addr_of!(ASSERTION.0.assert),
+ &raw const ASSERTION.0.assert,
Some($crate::bindings::kunit_unary_assert_format),
core::ptr::null(),
);
diff --git a/rust/kernel/list.rs b/rust/kernel/list.rs
index c0ed227b8a4f..e98f0820f002 100644
--- a/rust/kernel/list.rs
+++ b/rust/kernel/list.rs
@@ -176,7 +176,7 @@ pub fn new() -> impl PinInit<Self> {
#[inline]
unsafe fn fields(me: *mut Self) -> *mut ListLinksFields {
// SAFETY: The caller promises that the pointer is valid.
- unsafe { Opaque::raw_get(ptr::addr_of!((*me).inner)) }
+ unsafe { Opaque::raw_get(&raw const (*me).inner) }
}
/// # Safety
diff --git a/rust/kernel/list/impl_list_item_mod.rs b/rust/kernel/list/impl_list_item_mod.rs
index a0438537cee1..014b6713d59d 100644
--- a/rust/kernel/list/impl_list_item_mod.rs
+++ b/rust/kernel/list/impl_list_item_mod.rs
@@ -49,7 +49,7 @@ macro_rules! impl_has_list_links {
// SAFETY: The implementation of `raw_get_list_links` only compiles if the field has the
// right type.
//
- // The behavior of `raw_get_list_links` is not changed since the `addr_of_mut!` macro is
+ // The behavior of `raw_get_list_links` is not changed since the `&raw mut` op is
// equivalent to the pointer offset operation in the trait definition.
unsafe impl$(<$($implarg),*>)? $crate::list::HasListLinks$(<$id>)? for
$self $(<$($selfarg),*>)?
@@ -61,7 +61,7 @@ unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$
// SAFETY: The caller promises that the pointer is not dangling. We know that this
// expression doesn't follow any pointers, as the `offset_of!` invocation above
// would otherwise not compile.
- unsafe { ::core::ptr::addr_of_mut!((*ptr)$(.$field)*) }
+ unsafe { &raw mut (*ptr)$(.$field)* }
}
}
)*};
@@ -103,7 +103,7 @@ macro_rules! impl_has_list_links_self_ptr {
unsafe fn raw_get_list_links(ptr: *mut Self) -> *mut $crate::list::ListLinks$(<$id>)? {
// SAFETY: The caller promises that the pointer is not dangling.
let ptr: *mut $crate::list::ListLinksSelfPtr<$item_type $(, $id)?> =
- unsafe { ::core::ptr::addr_of_mut!((*ptr).$field) };
+ unsafe { &raw mut (*ptr).$field };
ptr.cast()
}
}
diff --git a/rust/kernel/net/phy.rs b/rust/kernel/net/phy.rs
index a59469c785e3..757db052cc09 100644
--- a/rust/kernel/net/phy.rs
+++ b/rust/kernel/net/phy.rs
@@ -7,7 +7,7 @@
//! C headers: [`include/linux/phy.h`](srctree/include/linux/phy.h).
use crate::{error::*, prelude::*, types::Opaque};
-use core::{marker::PhantomData, ptr::addr_of_mut};
+use core::marker::PhantomData;
pub mod reg;
@@ -285,7 +285,7 @@ impl AsRef<kernel::device::Device> for Device {
fn as_ref(&self) -> &kernel::device::Device {
let phydev = self.0.get();
// SAFETY: The struct invariant ensures that `mdio.dev` is valid.
- unsafe { kernel::device::Device::as_ref(addr_of_mut!((*phydev).mdio.dev)) }
+ unsafe { kernel::device::Device::as_ref(&raw mut (*phydev).mdio.dev) }
}
}
diff --git a/rust/kernel/pci.rs b/rust/kernel/pci.rs
index f7b2743828ae..6cb9ed1e7cbf 100644
--- a/rust/kernel/pci.rs
+++ b/rust/kernel/pci.rs
@@ -17,7 +17,7 @@
types::{ARef, ForeignOwnable, Opaque},
ThisModule,
};
-use core::{ops::Deref, ptr::addr_of_mut};
+use core::ops::Deref;
use kernel::prelude::*;
/// An adapter for the registration of PCI drivers.
@@ -60,7 +60,7 @@ extern "C" fn probe_callback(
) -> kernel::ffi::c_int {
// SAFETY: The PCI bus only ever calls the probe callback with a valid pointer to a
// `struct pci_dev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct pci_dev` by the call
// above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/platform.rs b/rust/kernel/platform.rs
index 1297f5292ba9..344875ad7b82 100644
--- a/rust/kernel/platform.rs
+++ b/rust/kernel/platform.rs
@@ -14,8 +14,6 @@
ThisModule,
};
-use core::ptr::addr_of_mut;
-
/// An adapter for the registration of platform drivers.
pub struct Adapter<T: Driver>(T);
@@ -55,7 +53,7 @@ unsafe fn unregister(pdrv: &Opaque<Self::RegType>) {
impl<T: Driver + 'static> Adapter<T> {
extern "C" fn probe_callback(pdev: *mut bindings::platform_device) -> kernel::ffi::c_int {
// SAFETY: The platform bus only ever calls the probe callback with a valid `pdev`.
- let dev = unsafe { device::Device::get_device(addr_of_mut!((*pdev).dev)) };
+ let dev = unsafe { device::Device::get_device(&raw mut (*pdev).dev) };
// SAFETY: `dev` is guaranteed to be embedded in a valid `struct platform_device` by the
// call above.
let mut pdev = unsafe { Device::from_dev(dev) };
diff --git a/rust/kernel/rbtree.rs b/rust/kernel/rbtree.rs
index 1ea25c7092fb..b0ad35663cb0 100644
--- a/rust/kernel/rbtree.rs
+++ b/rust/kernel/rbtree.rs
@@ -11,7 +11,7 @@
cmp::{Ord, Ordering},
marker::PhantomData,
mem::MaybeUninit,
- ptr::{addr_of_mut, from_mut, NonNull},
+ ptr::{from_mut, NonNull},
};
/// A red-black tree with owned nodes.
@@ -238,7 +238,7 @@ pub fn values_mut(&mut self) -> impl Iterator<Item = &'_ mut V> {
/// Returns a cursor over the tree nodes, starting with the smallest key.
pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_first(root) };
NonNull::new(current).map(|current| {
@@ -253,7 +253,7 @@ pub fn cursor_front(&mut self) -> Option<Cursor<'_, K, V>> {
/// Returns a cursor over the tree nodes, starting with the largest key.
pub fn cursor_back(&mut self) -> Option<Cursor<'_, K, V>> {
- let root = addr_of_mut!(self.root);
+ let root = &raw mut self.root;
// SAFETY: `self.root` is always a valid root node
let current = unsafe { bindings::rb_last(root) };
NonNull::new(current).map(|current| {
@@ -459,7 +459,7 @@ pub fn cursor_lower_bound(&mut self, key: &K) -> Option<Cursor<'_, K, V>>
let best = best_match?;
// SAFETY: `best` is a non-null node so it is valid by the type invariants.
- let links = unsafe { addr_of_mut!((*best.as_ptr()).links) };
+ let links = unsafe { &raw mut (*best.as_ptr()).links };
NonNull::new(links).map(|current| {
// INVARIANT:
@@ -767,7 +767,7 @@ pub fn remove_current(self) -> (Option<Self>, RBTreeNode<K, V>) {
let node = RBTreeNode { node };
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(&mut (*this).links, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(&mut (*this).links, &raw mut self.tree.root) };
let current = match (prev, next) {
(_, Some(next)) => next,
@@ -803,7 +803,7 @@ fn remove_neighbor(&mut self, direction: Direction) -> Option<RBTreeNode<K, V>>
let neighbor = neighbor.as_ptr();
// SAFETY: The reference to the tree used to create the cursor outlives the cursor, so
// the tree cannot change. By the tree invariant, all nodes are valid.
- unsafe { bindings::rb_erase(neighbor, addr_of_mut!(self.tree.root)) };
+ unsafe { bindings::rb_erase(neighbor, &raw mut self.tree.root) };
// SAFETY: By the type invariant of `Self`, all non-null `rb_node` pointers stored in `self`
// point to the links field of `Node<K, V>` objects.
let this = unsafe { container_of!(neighbor, Node<K, V>, links) }.cast_mut();
@@ -918,7 +918,7 @@ unsafe fn to_key_value_raw<'b>(node: NonNull<bindings::rb_node>) -> (&'b K, *mut
let k = unsafe { &(*this).key };
// SAFETY: The passed `node` is the current node or a non-null neighbor,
// thus `this` is valid by the type invariants.
- let v = unsafe { addr_of_mut!((*this).value) };
+ let v = unsafe { &raw mut (*this).value };
(k, v)
}
}
@@ -1027,7 +1027,7 @@ fn next(&mut self) -> Option<Self::Item> {
self.next = unsafe { bindings::rb_next(self.next) };
// SAFETY: By the same reasoning above, it is safe to dereference the node.
- Some(unsafe { (addr_of_mut!((*cur).key), addr_of_mut!((*cur).value)) })
+ Some(unsafe { (&raw mut (*cur).key, &raw mut (*cur).value) })
}
}
@@ -1170,7 +1170,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let node_links = unsafe { addr_of_mut!((*node).links) };
+ let node_links = unsafe { &raw mut (*node).links };
// INVARIANT: We are linking in a new node, which is valid. It remains valid because we
// "forgot" it with `Box::into_raw`.
@@ -1178,7 +1178,7 @@ fn insert(self, node: RBTreeNode<K, V>) -> &'a mut V {
unsafe { bindings::rb_link_node(node_links, self.parent, self.child_field_of_parent) };
// SAFETY: All pointers are valid. `node` has just been inserted into the tree.
- unsafe { bindings::rb_insert_color(node_links, addr_of_mut!((*self.rbtree).root)) };
+ unsafe { bindings::rb_insert_color(node_links, &raw mut (*self.rbtree).root) };
// SAFETY: The node is valid until we remove it from the tree.
unsafe { &mut (*node).value }
@@ -1261,7 +1261,7 @@ fn replace(self, node: RBTreeNode<K, V>) -> RBTreeNode<K, V> {
// SAFETY: `node` is valid at least until we call `Box::from_raw`, which only happens when
// the node is removed or replaced.
- let new_node_links = unsafe { addr_of_mut!((*node).links) };
+ let new_node_links = unsafe { &raw mut (*node).links };
// SAFETY: This updates the pointers so that `new_node_links` is in the tree where
// `self.node_links` used to be.
diff --git a/rust/kernel/sync/arc.rs b/rust/kernel/sync/arc.rs
index 3cefda7a4372..81d8b0f84957 100644
--- a/rust/kernel/sync/arc.rs
+++ b/rust/kernel/sync/arc.rs
@@ -243,7 +243,7 @@ pub fn into_raw(self) -> *const T {
let ptr = self.ptr.as_ptr();
core::mem::forget(self);
// SAFETY: The pointer is valid.
- unsafe { core::ptr::addr_of!((*ptr).data) }
+ unsafe { &raw const (*ptr).data }
}
/// Recreates an [`Arc`] instance previously deconstructed via [`Arc::into_raw`].
diff --git a/rust/kernel/task.rs b/rust/kernel/task.rs
index 49012e711942..b2ac768eed23 100644
--- a/rust/kernel/task.rs
+++ b/rust/kernel/task.rs
@@ -257,7 +257,7 @@ pub fn as_ptr(&self) -> *mut bindings::task_struct {
pub fn group_leader(&self) -> &Task {
// SAFETY: The group leader of a task never changes after initialization, so reading this
// field is not a data race.
- let ptr = unsafe { *ptr::addr_of!((*self.as_ptr()).group_leader) };
+ let ptr = unsafe { *(&raw const (*self.as_ptr()).group_leader) };
// SAFETY: The lifetime of the returned task reference is tied to the lifetime of `self`,
// and given that a task has a reference to its group leader, we know it must be valid for
@@ -269,7 +269,7 @@ pub fn group_leader(&self) -> &Task {
pub fn pid(&self) -> Pid {
// SAFETY: The pid of a task never changes after initialization, so reading this field is
// not a data race.
- unsafe { *ptr::addr_of!((*self.as_ptr()).pid) }
+ unsafe { *(&raw const (*self.as_ptr()).pid) }
}
/// Returns the UID of the given task.
diff --git a/rust/kernel/workqueue.rs b/rust/kernel/workqueue.rs
index 0cd100d2aefb..34e8abb38974 100644
--- a/rust/kernel/workqueue.rs
+++ b/rust/kernel/workqueue.rs
@@ -401,9 +401,9 @@ pub fn new(name: &'static CStr, key: &'static LockClassKey) -> impl PinInit<Self
pub unsafe fn raw_get(ptr: *const Self) -> *mut bindings::work_struct {
// SAFETY: The caller promises that the pointer is aligned and not dangling.
//
- // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `addr_of!` so that
- // the compiler does not complain that the `work` field is unused.
- unsafe { Opaque::raw_get(core::ptr::addr_of!((*ptr).work)) }
+ // A pointer cast would also be ok due to `#[repr(transparent)]`. We use `&raw const (*ptr).work`
+ // so that the compiler does not complain that the `work` field is unused.
+ unsafe { Opaque::raw_get(&raw const (*ptr).work) }
}
}
@@ -510,7 +510,7 @@ macro_rules! impl_has_work {
unsafe fn raw_get_work(ptr: *mut Self) -> *mut $crate::workqueue::Work<$work_type $(, $id)?> {
// SAFETY: The caller promises that the pointer is not dangling.
unsafe {
- ::core::ptr::addr_of_mut!((*ptr).$field)
+ &raw mut (*ptr).$field
}
}
}
--
2.48.1
From: Jeff Xu <jeffxu(a)chromium.org>
Initially, when mseal was introduced in 6.10, semantically, when a VMA
within the specified address range is sealed, the mprotect will be rejected,
leaving all of VMA unmodified. However, adding an extra loop to check the mseal
flag for every VMA slows things down a bit, therefore in 6.12, this issue was
solved by removing can_modify_mm and checking each VMA’s mseal flag directly
without an extra loop [1]. This is a semantic change, i.e. partial update is
allowed, VMAs can be updated until a sealed VMA is found.
The new semantic also means, we could allow mprotect on a sealed VMA if the new
attribute of VMA remains the same as the old one. Relaxing this avoids unnecessary
impacts for applications that want to seal a particular mapping. Doing this also
has no security impact.
The mseal_test is also modified by this patch to adapt to the new
semantic. Please note, mseal_test is currently undergoing refactoring,
and eventually will be replaced with a new memory sealing selftest.
In this patch, I only make a minimum change to make it pass. I considered
adding a new testcase in mseal_test to cover this new behavior, however, the
existing mseal_test is using wrong patterns and won’t pass the review.
Such a new test is better to be added in the new refactored memory sealing tests.
The refactoring is currently pending review [2].
[1] https://lore.kernel.org/all/20240817-mseal-depessimize-v3-0-d8d2e037df30@gm…
[2] https://lore.kernel.org/all/20241211053311.245636-1-jeffxu@google.com/
Jeff Xu (2):
selftests/mm: mseal_test: avoid using no-op mprotect
mseal: allow noop mprotect
mm/mprotect.c | 6 +++---
tools/testing/selftests/mm/mseal_test.c | 6 +++---
2 files changed, 6 insertions(+), 6 deletions(-)
--
2.49.0.rc0.332.g42c0ae87b1-goog
The SO_RCVLOWAT option is defined as 18 in the selftest header,
which matches the generic definition. However, on powerpc,
SO_RCVLOWAT is defined as 16. This discrepancy causes
sol_socket_sockopt() to fail with the default switch case on powerpc.
This commit fixes by defining SO_RCVLOWAT as 16 for powerpc.
Signed-off-by: Saket Kumar Bhaskar <skb99(a)linux.ibm.com>
---
tools/testing/selftests/bpf/progs/bpf_tracing_net.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index 59843b430f76..bcd44d5018bf 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -15,7 +15,11 @@
#define SO_KEEPALIVE 9
#define SO_PRIORITY 12
#define SO_REUSEPORT 15
+#if defined(__TARGET_ARCH_powerpc)
+#define SO_RCVLOWAT 16
+#else
#define SO_RCVLOWAT 18
+#endif
#define SO_BINDTODEVICE 25
#define SO_MARK 36
#define SO_MAX_PACING_RATE 47
--
2.43.5
David, Brendan, Rae,
I am seeing the following error when I run
./tools/testing/kunit/kunit.py run --arch x86_64
ERROR:root:ld:arch/x86/realmode/rm/realmode.lds:236: undefined symbol `sev_es_trampoline_start' referenced in expression
I isolated it to dependency on CONFIG_AMD_MEM_ENCRYPT
I added the option using --kconfig_add
./tools/testing/kunit/kunit.py run --arch x86_64 --kconfig_add CONFIG_AMD_MEM_ENCRYPT=y
I see the following
RROR:root:Not all Kconfig options selected in kunitconfig were in the generated .config.
This is probably due to unsatisfied dependencies.
Missing: CONFIG_AMD_MEM_ENCRYPT=y
Is there a better way to fix the dependencies? Does kunit default config
need changing for x86_64?
thanks,
-- Shuah
This is one of just 3 remaining "Test Module" kselftests (the others
being bitmap and scanf), the rest having been converted to KUnit.
I tested this using:
$ tools/testing/kunit/kunit.py run --arch arm64 --make_options LLVM=1 printf
I have also sent out a series converting scanf[0].
Link: https://lore.kernel.org/all/20250204-scanf-kunit-convert-v3-0-386d7c3ee714@… [0]
Signed-off-by: Tamir Duberstein <tamird(a)gmail.com>
---
Changes in v6:
- Use __printf correctly on `__test`. (Petr Mladek)
- Rebase on linux-next.
- Remove leftover references to `printf.sh`.
- Update comment in `hash_pointer`. (Petr Mladek)
- Avoid overrun in `KUNIT_EXPECT_MEMNEQ`. (Petr Mladek)
- Restore trailing newlines on printk strings and add some missing ones.
(Petr Mladek)
- Use `kunit_skip` on not-yet-initialized crng. (Petr Mladek)
- Link to v5: https://lore.kernel.org/r/20250221-printf-kunit-convert-v5-0-5db840301730@g…
Changes in v5:
- Update `do_test` `__printf` annotation (Rasmus Villemoes).
- Link to v4: https://lore.kernel.org/r/20250214-printf-kunit-convert-v4-0-c254572f1565@g…
Changes in v4:
- Add patch "implicate test line in failure messages".
- Rebase on linux-next, move scanf_kunit.c into lib/tests/.
- Link to v3: https://lore.kernel.org/r/20250210-printf-kunit-convert-v3-0-ee6ac5500f5e@g…
Changes in v3:
- Remove extraneous trailing newlines from failure messages.
- Replace `pr_warn` with `kunit_warn`.
- Drop arch changes.
- Remove KUnit boilerplate from CONFIG_PRINTF_KUNIT_TEST help text.
- Restore `total_tests` counting.
- Remove tc_fail macro in last patch.
- Link to v2: https://lore.kernel.org/r/20250207-printf-kunit-convert-v2-0-057b23860823@g…
Changes in v2:
- Incorporate code review from prior work[0] by Arpitha Raghunandan.
- Link to v1: https://lore.kernel.org/r/20250204-printf-kunit-convert-v1-0-ecf1b846a4de@g…
Link: https://lore.kernel.org/lkml/20200817043028.76502-1-98.arpi@gmail.com/t/#u [0]
---
Tamir Duberstein (3):
printf: convert self-test to KUnit
printf: break kunit into test cases
printf: implicate test line in failure messages
Documentation/core-api/printk-formats.rst | 4 +-
Documentation/dev-tools/kselftest.rst | 2 +-
MAINTAINERS | 2 +-
lib/Kconfig.debug | 12 +-
lib/Makefile | 1 -
lib/tests/Makefile | 1 +
lib/{test_printf.c => tests/printf_kunit.c} | 442 ++++++++++++----------------
tools/testing/selftests/kselftest/module.sh | 2 +-
tools/testing/selftests/lib/Makefile | 2 +-
tools/testing/selftests/lib/config | 1 -
tools/testing/selftests/lib/printf.sh | 4 -
11 files changed, 207 insertions(+), 266 deletions(-)
---
base-commit: 7ec162622e66a4ff886f8f28712ea1b13069e1aa
change-id: 20250131-printf-kunit-convert-fd4012aa2ec6
Best regards,
--
Tamir Duberstein <tamird(a)gmail.com>
This series solves some issues about global "irq_type" that is used for
indicating the current type for users.
In addition, avoid an unexpected warning that occur due to interrupts
remaining after displaying an error caused by devm_request_irq().
Patch 1 includes adding GET_IRQTYPE test (check for failure).
Patch 2-4 include fixes for stable kernels that have global "irq_type".
Patch 5-6 include improvements for the latest.
Changes since v3:
- Add GET_IRQTYPE check to pci_endpoint test in selftests
- Add the reason why global variables aren't necessary (patch 5/6)
- Add Reviewed-by: lines (patch {2, 4, 6}/6)
Changes since v2:
- Rebase to v6.14-rc1
- Update message to clarify, and add result of call trace (patch 1/5)
- Add Reviewed-by: lines (patch 2/5)
- Add new patch to remove global "irq_type" variable (patch 4/5)
- Add new patch to replace "devm" version of IRQ functions (patch 5/5)
Changes since v1:
- Divide original patch into two
- Add an error message example
- Add "pcitest" display example
- Add a patch to fix an interrupt remaining issue
Kunihiko Hayashi (6):
selftests: pci_endpoint: Add GET_IRQTYPE checks to each interrupt test
misc: pci_endpoint_test: Avoid issue of interrupts remaining after
request_irq error
misc: pci_endpoint_test: Fix displaying irq_type after request_irq
error
misc: pci_endpoint_test: Fix irq_type to convey the correct type
misc: pci_endpoint_test: Remove global 'irq_type' and 'no_msi'
misc: pci_endpoint_test: Do not use managed irq functions
drivers/misc/pci_endpoint_test.c | 31 +++++++------------
.../pci_endpoint/pci_endpoint_test.c | 11 ++++++-
2 files changed, 21 insertions(+), 21 deletions(-)
--
2.25.1
Hi all,
This patchset adds a new buddy allocator like (or non-uniform) large folio
split from a order-n folio to order-m with m < n. It reduces
1. the total number of after-split folios from 2^(n-m) to n-m+1;
2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to
n/6-m/6, assuming XA_CHUNK_SHIFT=6;
3. keep more large folios after a split from all order-m folios to
order-(n-1) to order-m folios.
For example, to split an order-9 to order-0, folio split generates 10
(or 11 for anonymous memory) folios instead of 512, allocates 1 xa_node
instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2 order-0
folios (or 4 order-0 for anonymous memory) instead of 512 order-0 folios.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like (or non-uniform) split.
All existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
xfstests quick group passed for both tmpfs and xfs. I also
semi-replicated Hugh's test[12] and ran it without any issue for almost
24 hours.
It is on top of mm-everything-2025-03-07-07-55. It is ready to be merged.
Changelog
===
From V9[13]
1. Incorporated Hugh's fixes[14] (Thanks Hugh):
a) moved folio_set_order() in __split_folio_to_order() to be called
only once for the input folio,
b) used folio_test_swapcache() to catch both anon and shmem in swap
cache cases,
c) moved folio_next() out of for(;;),
d) used mapping instead of origin_folio->mapping.
2. Added a TODO in __folio_split(), since large in-swap-cache shmem folio
split is not supported yet.
3. Changed __split_folio_to_order() based on David Hildenbrand's MM owner
tracking for large folios patchset[15], due to rebasing.
From V8[11]:
1. Removed gfp parameter from xas_try_split() and GFP_NOWAIT is used all
the time. (per Baolin Wang)
2. Used __xas_init_node_for_split() instead of
__xas_alloc_node_for_split() and moved node allocation out. It fixed
a bug when xa_node is pre-allocated by xas_nomem() before
xas_try_split() is called without being initialized for split.
From V7[9]:
1. Fixed a wrong function name in lib/test_xarray.c.
2. Made __split_folio_to_order() never fail, since the old order check
is already done in __folio_split(). (per David Hildenbrand)
3. Fixed an issue reported by syzbot[10] by not dropping the original
folio during truncate.
4. Fixed a WARNING when READ_ONLY_THP_FOR_FS is enabled. (Thank David
Hildenbrand for reporting the issue)
5. Used two separate struct page* parameters, split_at and lock_at, to
specify at which subpage the non-uniform split happens and which subpage
to keep locked after the split, respectively. It improves code
readability.
From V6[8]:
1. Added an xarray function xas_try_split() to support iterative folio split,
removing the need of using xas_split_alloc() and xas_split(). The
function guarantees that at most one xa_node is allocated for each
call.
2. Added concrete numbers of after-split folios and xa_node savings to
cover letter, commit log. (per Andrew)
From V5[7]:
1. Split shmem to any lower order patches are in mm tree, so dropped
from this series.
2. Rename split_folio_at() to try_folio_split() to clarify that
non-uniform split will not be used if it is not supported.
From V4[6]:
1. Enabled shmem support in both uniform and buddy allocator like split
and added selftests for it.
2. Added functions to check if uniform split and buddy allocator like
split are supported for the given folio and order.
3. Made truncate fall back to uniform split if buddy allocator split is
not supported (CONFIG_READ_ONLY_THP_FOR_FS and FS without large folio).
4. Added the missing folio_clear_has_hwpoisoned() to
__split_unmapped_folio().
From V3[5]:
1. Used xas_split_alloc(GFP_NOWAIT) instead of xas_nomem(), since extra
operations inside xas_split_alloc() are needed for correctness.
2. Enabled folio_split() for shmem and no issue was found with xfstests
quick test group.
3. Split both ends of a truncate range in truncate_inode_partial_folio()
to avoid wasting memory in shmem truncate (per David Hildenbrand).
4. Removed page_in_folio_offset() since page_folio() does the same
thing.
5. Finished truncate related tests from xfstests quick test group on XFS and
tmpfs without issues.
6. Disabled buddy allocator like split on CONFIG_READ_ONLY_THP_FOR_FS
and FS without large folio. This check was missed in the prior
versions.
From V2[3]:
1. Incorporated all the feedback from Kirill[4].
2. Used GFP_NOWAIT for xas_nomem().
3. Tested the code path when xas_nomem() fails.
4. Added selftests for folio_split().
5. Fixed no THP config build error.
From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
for xas_split() during buddy allocator like split.
From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
folio_split(). The same code is used for both uniform split and buddy
allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
instead of the one containing the given page, since
the caller of truncate_inode_partial_folio() locks and unlocks the
first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().
Design
===
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.
The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
the whole folio into multiple smaller ones with the same order in a
shot, this approach splits the folio iteratively. Taking the example
above, this approach first splits the original order-9 into two order-8,
then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.
__split_unmapped_folio() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__split_unmapped_folio() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
uniformly split the given folio. All resulting folios are put back to
the list after split. The folio containing the given page is left to
caller to unlock and others are unlocked.
2. buddy allocator like (or non-uniform) split: (old_order - new_order) calls
to __split_folio_to_order() are used to split the given folio at order N to
order N-1. After each call, the target folio is changed to the one
containing the page, which is given as a folio_split() parameter.
After each call, folios not containing the page are put back to the list.
The folio containing the page is put back to the list when its order
is new_order. All folios are unlocked except the first folio, which
is left to caller to unlock.
Patch Overview
===
1. Patch 1 added a new xarray function xas_try_split() to perform
iterative xarray split.
2. Patch 2 added __split_unmapped_folio() and __split_folio_to_order() to
prepare for moving to new backend split code.
3. Patch 3 moved common code in split_huge_page_to_list_to_order() to
__folio_split().
4. Patch 4 added new folio_split() and made
split_huge_page_to_list_to_order() share the new
__split_unmapped_folio() with folio_split().
5. Patch 5 removed no longer used __split_huge_page() and
__split_huge_page_tail().
6. Patch 6 added a new in_folio_offset to split_huge_page debugfs for
folio_split() test.
7. Patch 7 used try_folio_split() for truncate operation.
8. Patch 8 added folio_split() tests.
Any comments and/or suggestions are welcome. Thanks.
[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20241101150357.1752726-1-ziy@nvidia.com/
[4] https://lore.kernel.org/linux-mm/e6ppwz5t4p4kvir6eqzoto4y5fmdjdxdyvxvtw43nc…
[5] https://lore.kernel.org/linux-mm/20241205001839.2582020-1-ziy@nvidia.com/
[6] https://lore.kernel.org/linux-mm/20250106165513.104899-1-ziy@nvidia.com/
[7] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/
[8] https://lore.kernel.org/linux-mm/20250205031417.1771278-1-ziy@nvidia.com/
[9] https://lore.kernel.org/linux-mm/20250211155034.268962-1-ziy@nvidia.com/
[10] https://lore.kernel.org/all/67af65cb.050a0220.21dd3.004a.GAE@google.com/
[11] https://lore.kernel.org/linux-mm/20250218235012.1542225-1-ziy@nvidia.com/
[12] https://lore.kernel.org/linux-mm/D45D4F01-E5A5-47E6-8724-01610CC192CC@nvidi…
[13] https://lore.kernel.org/linux-mm/20250226210032.2044041-1-ziy@nvidia.com/
[14] https://lore.kernel.org/linux-mm/2fae27fe-6e2e-3587-4b68-072118d80cf8@googl…
[15] https://lore.kernel.org/all/20250303163014.1128035-4-david@redhat.com/
Zi Yan (8):
xarray: add xas_try_split() to split a multi-index entry
mm/huge_memory: add two new (not yet used) functions for folio_split()
mm/huge_memory: move folio split common code to __folio_split()
mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
mm/huge_memory: remove the old, unused __split_huge_page()
mm/huge_memory: add folio_split() to debugfs testing interface
mm/truncate: use folio_split() in truncate operation
selftests/mm: add tests for folio_split(), buddy allocator like split
Documentation/core-api/xarray.rst | 14 +-
include/linux/huge_mm.h | 36 +
include/linux/xarray.h | 6 +
lib/test_xarray.c | 52 ++
lib/xarray.c | 132 ++-
mm/huge_memory.c | 786 ++++++++++++------
mm/truncate.c | 37 +-
tools/testing/radix-tree/Makefile | 1 +
.../selftests/mm/split_huge_page_test.c | 34 +-
9 files changed, 809 insertions(+), 289 deletions(-)
--
2.47.2
On Thu Mar 13, 2025 at 6:33 AM CET, Antonio Hickey wrote:
> Replacing all occurrences of `addr_of!(place)` with `&raw place`, and
> all occurrences of `addr_of_mut!(place)` with `&raw mut place`.
>
> Utilizing the new feature will allow us to reduce macro complexity, and
> improve consistency with existing reference syntax as `&raw`, `&raw mut`
> is very similar to `&`, `&mut` making it fit more naturally with other
> existing code.
>
> Depends on: Patch 1/3 0001-rust-enable-raw_ref_op-feature.patch
This information shouldn't be in the commit message. You can put it
below the `---` (that won't end up in the commit message). But since you
sent this as part of a series, you don't need to mention it.
> Suggested-by: Benno Lossin <y86-dev(a)protonmail.com>
> Link: https://github.com/Rust-for-Linux/linux/issues/1148
> Signed-off-by: Antonio Hickey <contact(a)antoniohickey.com>
> ---
> rust/kernel/block/mq/request.rs | 4 ++--
> rust/kernel/faux.rs | 4 ++--
> rust/kernel/fs/file.rs | 2 +-
> rust/kernel/init.rs | 8 ++++----
> rust/kernel/init/macros.rs | 28 +++++++++++++-------------
> rust/kernel/jump_label.rs | 4 ++--
> rust/kernel/kunit.rs | 4 ++--
> rust/kernel/list.rs | 2 +-
> rust/kernel/list/impl_list_item_mod.rs | 6 +++---
> rust/kernel/net/phy.rs | 4 ++--
> rust/kernel/pci.rs | 4 ++--
> rust/kernel/platform.rs | 4 +---
> rust/kernel/rbtree.rs | 22 ++++++++++----------
> rust/kernel/sync/arc.rs | 2 +-
> rust/kernel/task.rs | 4 ++--
> rust/kernel/workqueue.rs | 8 ++++----
> 16 files changed, 54 insertions(+), 56 deletions(-)
[...]
> diff --git a/rust/kernel/jump_label.rs b/rust/kernel/jump_label.rs
> index 4e974c768dbd..05d4564714c7 100644
> --- a/rust/kernel/jump_label.rs
> +++ b/rust/kernel/jump_label.rs
> @@ -20,8 +20,8 @@
> #[macro_export]
> macro_rules! static_branch_unlikely {
> ($key:path, $keytyp:ty, $field:ident) => {{
> - let _key: *const $keytyp = ::core::ptr::addr_of!($key);
> - let _key: *const $crate::bindings::static_key_false = ::core::ptr::addr_of!((*_key).$field);
> + let _key: *const $keytyp = &raw $key;
This should be `&raw const $key`. I wrote that wrongly in the issue.
> + let _key: *const $crate::bindings::static_key_false = &raw (*_key).$field;
Same here.
> let _key: *const $crate::bindings::static_key = _key.cast();
>
> #[cfg(not(CONFIG_JUMP_LABEL))]
> diff --git a/rust/kernel/kunit.rs b/rust/kernel/kunit.rs
> index 824da0e9738a..18357dd782ed 100644
> --- a/rust/kernel/kunit.rs
> +++ b/rust/kernel/kunit.rs
> @@ -128,9 +128,9 @@ unsafe impl Sync for UnaryAssert {}
> unsafe {
> $crate::bindings::__kunit_do_failed_assertion(
> kunit_test,
> - core::ptr::addr_of!(LOCATION.0),
> + &raw LOCATION.0,
And here.
> $crate::bindings::kunit_assert_type_KUNIT_ASSERTION,
> - core::ptr::addr_of!(ASSERTION.0.assert),
> + &raw ASSERTION.0.assert,
Lastly here as well.
---
Cheers,
Benno
Commit 51bef03e1a71 ("selftests/net: deflake GRO tests") recently
switched to NAPI suspension, and lowered the timeout from 1ms to 100us.
This started causing flakes in netdev-run CI. Let's bump it to 200us.
In a quick test of a debug kernel I see failures with 100us, with 200us
in 5 runs I see 2 completely clean runs and 3 with a single retry
(GRO test will retry up to 5 times).
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
---
CC: krakauer(a)google.com
CC: willemb(a)google.com
CC: shuah(a)kernel.org
CC: linux-kselftest(a)vger.kernel.org
---
tools/testing/selftests/net/setup_veth.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/net/setup_veth.sh b/tools/testing/selftests/net/setup_veth.sh
index eb3182066d12..152bf4c65747 100644
--- a/tools/testing/selftests/net/setup_veth.sh
+++ b/tools/testing/selftests/net/setup_veth.sh
@@ -11,7 +11,7 @@ setup_veth_ns() {
local -r ns_mac="$4"
[[ -e /var/run/netns/"${ns_name}" ]] || ip netns add "${ns_name}"
- echo 100000 > "/sys/class/net/${ns_dev}/gro_flush_timeout"
+ echo 200000 > "/sys/class/net/${ns_dev}/gro_flush_timeout"
echo 1 > "/sys/class/net/${ns_dev}/napi_defer_hard_irqs"
ip link set dev "${ns_dev}" netns "${ns_name}" mtu 65535
ip -netns "${ns_name}" link set dev "${ns_dev}" up
--
2.48.1
Following the previous vIOMMU series, this adds another vDEVICE structure,
representing the association from an iommufd_device to an iommufd_viommu.
This gives the whole architecture a new "v" layer:
_______________________________________________________________________
| iommufd (with vIOMMU/vDEVICE) |
| _____________ _____________ |
| | | | | |
| |----------------| vIOMMU |<---| vDEVICE |<------| |
| | | | |_____________| | |
| | ______ | | _____________ ___|____ |
| | | | | | | | | | |
| | | IOAS |<---|(HWPT_PAGING)|<---| HWPT_NESTED |<--| DEVICE | |
| | |______| |_____________| |_____________| |________| |
|______|________|______________|__________________|_______________|_____|
| | | | |
______v_____ | ______v_____ ______v_____ ___v__
| struct | | PFN | (paging) | | (nested) | |struct|
|iommu_device| |------>|iommu_domain|<----|iommu_domain|<----|device|
|____________| storage|____________| |____________| |______|
This vDEVICE object is used to collect and store all vIOMMU-related device
information/attributes in a VM. As an initial series for vDEVICE, add only
the virt_id to the vDEVICE, which is a vIOMMU specific device ID in a VM:
e.g. vSID of ARM SMMUv3, vDeviceID of AMD IOMMU, and vRID of Intel VT-d to
a Context Table. This virt_id helps IOMMU drivers to link the vID to a pID
of the device against the physical IOMMU instance. This is essential for a
vIOMMU-based invalidation, where the request contains a device's vID for a
device cache flush, e.g. ATC invalidation.
Therefore, with this vDEVICE object, support a vIOMMU-based invalidation,
by reusing IOMMUFD_CMD_HWPT_INVALIDATE for a vIOMMU object to flush cache
with a given driver data.
As for the implementation of the series, add driver support in ARM SMMUv3
for a real world use case.
This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p2-v7
(QEMU branch for testing will be provided in Jason's nesting series)
Changelog
v7
* Added "Reviewed-by" from Jason
* Corrected a line of comments in iommufd_vdevice_destroy()
v6
https://lore.kernel.org/all/cover.1730313494.git.nicolinc@nvidia.com/
* Fixed kdoc in the uAPI header
* Fixed indentations in iommufd.rst
* Replaced vdev->idev with vdev->dev
* Added "Reviewed-by" from Kevin and Jason
* Updated kdoc of struct iommu_vdevice_alloc
* Fixed lockdep function call in iommufd_viommu_find_dev
* Added missing iommu_dev validation between viommu and idev
* Skipped SMMUv3 driver changes (to post in a separate series)
* Replaced !cache_invalidate_user in WARN_ON of the allocation path
with cache_invalidate_user validation in iommufd_hwpt_invalidate
v5
https://lore.kernel.org/all/cover.1729897278.git.nicolinc@nvidia.com/
* Dropped driver-allocated vDEVICE support
* Changed vdev_to_dev helper to iommufd_viommu_find_dev
v4
https://lore.kernel.org/all/cover.1729555967.git.nicolinc@nvidia.com/
* Added missing brackets in switch-case
* Fixed the unreleased idev refcount issue
* Reworked the iommufd_vdevice_alloc allocator
* Dropped support for IOMMU_VIOMMU_TYPE_DEFAULT
* Added missing TEST_LENGTH and fail_nth coverages
* Added a verification to the driver-allocated vDEVICE object
* Added an iommufd_vdevice_abort for a missing mutex protection
* Added a u64 structure arm_vsmmu_invalidation_cmd for user command
conversion
v3
https://lore.kernel.org/all/cover.1728491532.git.nicolinc@nvidia.com/
* Added Jason's Reviewed-by
* Split this invalidation part out of the part-1 series
* Repurposed VDEV_ID ioctl to a wider vDEVICE structure and ioctl
* Reduced viommu_api functions by allowing drivers to access viommu
and vdevice structure directly
* Dropped vdevs_rwsem by using xa_lock instead
* Dropped arm_smmu_cache_invalidate_user
v2
https://lore.kernel.org/all/cover.1724776335.git.nicolinc@nvidia.com/
* Limited vdev_id to one per idev
* Added a rw_sem to protect the vdev_id list
* Reworked driver-level APIs with proper lockings
* Added a new viommu_api file for IOMMUFD_DRIVER config
* Dropped useless iommu_dev point from the viommu structure
* Added missing index numnbers to new types in the uAPI header
* Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
* Reworked mock_viommu_cache_invalidate() using the new iommu helper
* Reordered details of set/unset_vdev_id handlers for proper lockings
v1
https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
Thanks!
Nicolin
Jason Gunthorpe (1):
iommu: Add iommu_copy_struct_from_full_user_array helper
Nicolin Chen (9):
iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommufd/viommu: Add iommufd_viommu_find_dev helper
iommufd/selftest: Add mock_viommu_cache_invalidate
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
Documentation: userspace-api: iommufd: Update vDEVICE
drivers/iommu/iommufd/iommufd_private.h | 18 ++
drivers/iommu/iommufd/iommufd_test.h | 30 +++
include/linux/iommu.h | 48 ++++-
include/linux/iommufd.h | 22 ++
include/uapi/linux/iommufd.h | 31 ++-
tools/testing/selftests/iommu/iommufd_utils.h | 83 +++++++
drivers/iommu/iommufd/driver.c | 13 ++
drivers/iommu/iommufd/hw_pagetable.c | 40 +++-
drivers/iommu/iommufd/main.c | 6 +
drivers/iommu/iommufd/selftest.c | 98 ++++++++-
drivers/iommu/iommufd/viommu.c | 76 +++++++
tools/testing/selftests/iommu/iommufd.c | 204 +++++++++++++++++-
.../selftests/iommu/iommufd_fail_nth.c | 4 +
Documentation/userspace-api/iommufd.rst | 41 +++-
14 files changed, 688 insertions(+), 26 deletions(-)
base-commit: 0780dd4af09a5360392f5c376c35ffc2599a9c0e
--
2.43.0
The first patch fixes the incorrect locks using in bond driver.
The second patch fixes the xfrm offload feature during setup active-backup
mode. The third patch add a ipsec offload testing.
v5: use list_for_each_entry_safe() when del item in list (Nikolay Aleksandrov)
do not call spin_lock_bh in sleep function xdo_dev_state_free (Nikolay Aleksandrov)
set xso.real_dev = NULL to avoid __xfrm_state_delete() is called in parallel() (Cosmin Ratiu)
remove spin lock in bond_ipsec_add_sa_all() as it doesn't resolve the race condition.
v4: hold xs->lock for bond_ipsec_{del, add}_sa_all (Cosmin Ratiu)
use the defer helpers in lib.sh for selftest (Petr Machata)
v3: move the ipsec deletion to bond_ipsec_free_sa (Cosmin Ratiu)
v2: do not turn carrier on if bond change link failed (Nikolay Aleksandrov)
move the mutex lock to a work queue (Cosmin Ratiu)
Hangbin Liu (3):
bonding: fix calling sleeping function in spin lock and some race
conditions
bonding: fix xfrm offload feature setup on active-backup mode
selftests: bonding: add ipsec offload test
drivers/net/bonding/bond_main.c | 71 +++++---
drivers/net/bonding/bond_netlink.c | 16 +-
include/net/bonding.h | 1 +
.../selftests/drivers/net/bonding/Makefile | 3 +-
.../drivers/net/bonding/bond_ipsec_offload.sh | 154 ++++++++++++++++++
.../selftests/drivers/net/bonding/config | 4 +
6 files changed, 222 insertions(+), 27 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/bonding/bond_ipsec_offload.sh
--
2.46.0
Build break was reported in the powerpc mailing list for next-20250218 with below errors
make[1]: Nothing to be done for 'all'.
BUILD_TARGET=/root/venkat/linux-next/tools/testing/selftests/powerpc/mm; mkdir -p $BUILD_TARGET; make OUTPUT=$BUILD_TARGET -k -C mm all
CC pkey_exec_prot
In file included from pkey_exec_prot.c:18:
/root/venkat/linux-next/tools/testing/selftests/powerpc/include/pkeys.h: In function ‘pkeys_unsupported’:
/root/venkat/linux-next/tools/testing/selftests/powerpc/include/pkeys.h:96:34: error: ‘PKEY_UNRESTRICTED’ undeclared (first use in this function)
96 | pkey = sys_pkey_alloc(0, PKEY_UNRESTRICTED);
| ^~~~~~~~~~~~~~~~~
https://lore.kernel.org/all/20250113170619.484698-2-yury.khrustalev@arm.com/ patchset
has been queued to arm64/for-next/pkey_unrestricted which is causing a build break
in the selftest/powerpc builds.
Commit 6d61527d931ba ("mm/pkey: Add PKEY_UNRESTRICTED macro") added a macro
PKEY_UNRESTRICTED to handle implicit literal value of 0x0 (which is "unrestricted").
Add the same to selftest/powerpc/pkeys.h to fix the reported build break.
Reported-by: Venkat Rao Bagalkote <venkat88(a)linux.ibm.com>
Closes: https://lore.kernel.org/lkml/3267ea6e-5a1a-4752-96ef-8351c912d386@linux.ibm…
Tested-by: Venkat Rao Bagalkote <venkat88(a)linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy(a)linux.ibm.com>
---
Catalin, can you take this fix via arm64/for-next/pkey_unrestricted?
Patch applies clean on top of arm64/for-next/pkey_unrestricted
tools/testing/selftests/powerpc/include/pkeys.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/powerpc/include/pkeys.h b/tools/testing/selftests/powerpc/include/pkeys.h
index c6d4063dd4f6..d6deb6ffa1b9 100644
--- a/tools/testing/selftests/powerpc/include/pkeys.h
+++ b/tools/testing/selftests/powerpc/include/pkeys.h
@@ -24,6 +24,9 @@
#undef PKEY_DISABLE_EXECUTE
#define PKEY_DISABLE_EXECUTE 0x4
+#undef PKEY_UNRESTRICTED
+#define PKEY_UNRESTRICTED 0x0
+
/* Older versions of libc do not define this */
#ifndef SEGV_PKUERR
#define SEGV_PKUERR 4
--
2.48.1
Hello Jens and guys,
This patchset fixes several issues(1, 2, 4) and consolidate & improve
the tests in the following ways:
- support shellcheck and fixes all warning
- misc cleanup
- improve cleanup code path(module load/unload, cleanup temp files)
- help to reuse the same test source code and scripts for other
projects(liburing[1], blktest, ...)
- add two stress tests for covering IO workloads vs. removing device &
killing ublk server, given buffer lifetime is one big thing for ublk-zc
[1] https://github.com/ming1/liburing/commits/ublk-zc
- just need one line change for overriding skip_code, libring uses 77 and
kselftests takes 4
Ming Lei (11):
selftests: ublk: make ublk_stop_io_daemon() more reliable
selftests: ublk: fix build failure
selftests: ublk: add --foreground command line
selftests: ublk: fix parsing '-a' argument
selftests: ublk: support shellcheck and fix all warning
selftests: ublk: don't pass ${dev_id} to _cleanup_test()
selftests: ublk: move zero copy feature check into _add_ublk_dev()
selftests: ublk: load/unload ublk_drv when preparing & cleaning up
tests
selftests: ublk: add one stress test for covering IO vs. removing
device
selftests: ublk: add stress test for covering IO vs. killing ublk
server
selftests: ublk: improve test usability
tools/testing/selftests/ublk/Makefile | 6 +
tools/testing/selftests/ublk/kublk.c | 43 +++--
tools/testing/selftests/ublk/kublk.h | 2 +
tools/testing/selftests/ublk/test_common.sh | 167 ++++++++++++++----
tools/testing/selftests/ublk/test_loop_01.sh | 13 +-
tools/testing/selftests/ublk/test_loop_02.sh | 14 +-
tools/testing/selftests/ublk/test_loop_03.sh | 16 +-
tools/testing/selftests/ublk/test_loop_04.sh | 14 +-
tools/testing/selftests/ublk/test_null_01.sh | 9 +-
.../testing/selftests/ublk/test_stress_01.sh | 47 +++++
.../testing/selftests/ublk/test_stress_02.sh | 47 +++++
11 files changed, 300 insertions(+), 78 deletions(-)
create mode 100755 tools/testing/selftests/ublk/test_stress_01.sh
create mode 100755 tools/testing/selftests/ublk/test_stress_02.sh
--
2.47.0
I never had much luck running mm selftests so I spent a few hours
digging into why.
Looks like most of the reason is missing SKIP checks, so this series is
just adding a bunch of those that I found. I did not do anything like
all of them, just the ones I spotted in gup_longterm, gup_test, mmap,
userfaultfd and memfd_secret.
It's a bit unfortunate to have to skip those tests when ftruncate()
fails, but I don't have time to dig deep enough into it to actually make
them pass. I have observed the issue on 9pfs and heard rumours that NFS
has a similar problem.
I'm now able to run these test groups successfully:
- mmap
- gup_test
- compaction
- migration
- page_frag
- userfaultfd
- mlock
I've never gone past "Waiting for hugetlb memory to get depleted", in
the hugetlb tests. I don't know if they are stuck or if they would
eventually work if I was patient enough (testing on a 1G machine). I
have not investigated further.
I had some issues with mlock tests failing due to -ENOSRCH from
mlock2(), I can no longer reproduce that though, things work OK now.
Of the remaining tests there may be others that work fine, but there's
no convenient way to survey the whole output of run_vmtests.sh so I'm
just going test by test here.
In my spare moments I am slowly chipping away at a setup to run these
tests continuously in a reasonably hermetic QEMU environment via
virtme-ng:
https://github.com/bjackman/linux/blob/5fad4b9c592290f38e0f8bc73c9abb9c99d8…
Hopefully that will eventually offer a way to provide a "canned"
environment where the tests are known to work, which can be fairly
easily reproduced by any developer.
Signed-off-by: Brendan Jackman <jackmanb(a)google.com>
---
Changes in v4:
- NOT ADDRESSED: still using errno==ENOENT as a hacky way to detect
buggy filesystems:
https://lore.kernel.org/all/CA+i-1C3srkh44tN8dMQ5aD-jhoksUkdEpa+mMfdDtDrPAU…
- Added some incomplete cleanups for the mlock tests.
- Fixed divide-by-zero error when running uffd-stress on <32cpu systems.
- Fixed misnamed nr_threads variable (now nr_parallel).
- Fixed reporting io_uring errors (retval instead of errno).
- Link to v3: https://lore.kernel.org/r/20250228-mm-selftests-v3-0-958e3b6f0203@google.com
Changes in v3:
- Added fix for userfaultfd tests.
- Dropped attempts to use sudo.
- Fixed garbage printf in uffd-stress.
(Added EXTRA_CFLAGS=-Werror FORCE_TARGETS=1 to my scripts to prevent
such errors happening again).
- Fixed missing newlines in ksft_test_result_skip() calls.
- Link to v2: https://lore.kernel.org/r/20250221-mm-selftests-v2-0-28c4d66383c5@google.com
Changes in v2 (Thanks to Dev for the reviews):
- Improve and cleanup some error messages
- Add some extra SKIPs
- Fix misnaming of nr_cpus variable in uffd tests
- Link to v1: https://lore.kernel.org/r/20250220-mm-selftests-v1-0-9bbf57d64463@google.com
---
Brendan Jackman (12):
selftests/mm: Report errno when things fail in gup_longterm
selftests/mm: Skip uffd-stress if userfaultfd not available
selftests/mm: Skip uffd-wp-mremap if userfaultfd not available
selftests/mm/uffd: Rename nr_cpus -> nr_parallel
selftests/mm: Print some details when uffd-stress gets bad params
selftests/mm: Don't fail uffd-stress if too many CPUs
selftests/mm: Skip map_populate on weird filesystems
selftests/mm: Skip gup_longterm tests on weird filesystems
selftests/mm: Drop unnecessary sudo usage
selftests/mm: Ensure uffd-wp-mremap gets pages of each size
selftests/mm: Skip mlock tests if nobody user can't read it
selftests/mm/mlock: Print error on failure
tools/testing/selftests/mm/gup_longterm.c | 45 +++++++++++++++++---------
tools/testing/selftests/mm/map_populate.c | 7 ++++
tools/testing/selftests/mm/mlock-random-test.c | 4 +--
tools/testing/selftests/mm/mlock2.h | 8 ++++-
tools/testing/selftests/mm/run_vmtests.sh | 27 ++++++++++++++--
tools/testing/selftests/mm/uffd-common.c | 8 ++---
tools/testing/selftests/mm/uffd-common.h | 2 +-
tools/testing/selftests/mm/uffd-stress.c | 42 +++++++++++++++---------
tools/testing/selftests/mm/uffd-unit-tests.c | 2 +-
tools/testing/selftests/mm/uffd-wp-mremap.c | 5 ++-
10 files changed, 105 insertions(+), 45 deletions(-)
---
base-commit: dcb38e6757f1b7944af9347ce6b54263d3666478
change-id: 20250220-mm-selftests-2d7d0542face
Best regards,
--
Brendan Jackman <jackmanb(a)google.com>
The mac address on backup slave should be convert from Solicited-Node
Multicast address, not from bonding unicast target address.
v4: no change, just repost.
v3: also fix the mac setting for slave_set_ns_maddr. (Jay)
Add function description for slave_set_ns_maddr/slave_set_ns_maddrs (Jay)
v2: fix patch 01's subject
Hangbin Liu (2):
bonding: fix incorrect MAC address setting to receive NS messages
selftests: bonding: fix incorrect mac address
drivers/net/bonding/bond_options.c | 55 ++++++++++++++++---
.../drivers/net/bonding/bond_options.sh | 4 +-
2 files changed, 49 insertions(+), 10 deletions(-)
--
2.46.0
Hello,
While trying to implement an eBPF gatekeeper program, we ran into an
issue whereas the LSM hooks are missing some relevant data.
Certain subcommands passed to the bpf() syscall can be invoked from
either the kernel or userspace. Additionally, some fields in the
bpf_attr struct contain pointers, and depending on where the
subcommand was invoked, they could point to either user or kernel
memory. One example of this is the bpf_prog_load subcommand and its
fd_array. This data is made available and used by the verifier but not
made available to the LSM subsystem. This patchset simply exposes that
information to applicable LSM hooks.
Change list:
- v6 -> v7
- use gettid/pid in lieu of getpid/tgid in test condition
- v5 -> v6
- fix regression caused by is_kernel renaming
- simplify test logic
- v4 -> v5
- merge v4 selftest breakout patch back into a single patch
- change "is_kernel" to "kernel"
- add selftest using new kernel flag
- v3 -> v4
- split out selftest changes into a separate patch
- v2 -> v3
- reorder params so that the new boolean flag is the last param
- fixup function signatures in bpf selftests
- v1 -> v2
- Pass a boolean flag in lieu of bpfptr_t
Revisions:
- v6
https://lore.kernel.org/bpf/20250308013314.719150-1-bboscaccy@linux.microso…
- v5
https://lore.kernel.org/bpf/20250307213651.3065714-1-bboscaccy@linux.micros…
- v4
https://lore.kernel.org/bpf/20250304203123.3935371-1-bboscaccy@linux.micros…
- v3
https://lore.kernel.org/bpf/20250303222416.3909228-1-bboscaccy@linux.micros…
- v2
https://lore.kernel.org/bpf/20250228165322.3121535-1-bboscaccy@linux.micros…
- v1
https://lore.kernel.org/bpf/20250226003055.1654837-1-bboscaccy@linux.micros…
Blaise Boscaccy (2):
security: Propagate caller information in bpf hooks
selftests/bpf: Add a kernel flag test for LSM bpf hook
include/linux/lsm_hook_defs.h | 6 +--
include/linux/security.h | 12 +++---
kernel/bpf/syscall.c | 10 ++---
security/security.c | 15 ++++---
security/selinux/hooks.c | 6 +--
.../selftests/bpf/prog_tests/kernel_flag.c | 43 +++++++++++++++++++
.../selftests/bpf/progs/rcu_read_lock.c | 3 +-
.../bpf/progs/test_cgroup1_hierarchy.c | 4 +-
.../selftests/bpf/progs/test_kernel_flag.c | 28 ++++++++++++
.../bpf/progs/test_kfunc_dynptr_param.c | 6 +--
.../selftests/bpf/progs/test_lookup_key.c | 2 +-
.../selftests/bpf/progs/test_ptr_untrusted.c | 2 +-
.../bpf/progs/test_task_under_cgroup.c | 2 +-
.../bpf/progs/test_verify_pkcs7_sig.c | 2 +-
14 files changed, 108 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/kernel_flag.c
create mode 100644 tools/testing/selftests/bpf/progs/test_kernel_flag.c
--
2.48.1
A task in the kernel (task_mm_cid_work) runs somewhat periodically to
compact the mm_cid for each process. Add a test to validate that it runs
correctly and timely.
The test spawns 1 thread pinned to each CPU, then each thread, including
the main one, runs in short bursts for some time. During this period, the
mm_cids should be spanning all numbers between 0 and nproc.
At the end of this phase, a thread with high enough mm_cid (>= nproc/2)
is selected to be the new leader, all other threads terminate.
After some time, the only running thread should see 0 as mm_cid, if that
doesn't happen, the compaction mechanism didn't work and the test fails.
The test never fails if only 1 core is available, in which case, we
cannot test anything as the only available mm_cid is 0.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers(a)efficios.com>
Signed-off-by: Gabriele Monaco <gmonaco(a)redhat.com>
---
tools/testing/selftests/rseq/.gitignore | 1 +
tools/testing/selftests/rseq/Makefile | 2 +-
.../selftests/rseq/mm_cid_compaction_test.c | 200 ++++++++++++++++++
3 files changed, 202 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/rseq/mm_cid_compaction_test.c
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
index 16496de5f6ce4..2c89f97e4f737 100644
--- a/tools/testing/selftests/rseq/.gitignore
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -3,6 +3,7 @@ basic_percpu_ops_test
basic_percpu_ops_mm_cid_test
basic_test
basic_rseq_op_test
+mm_cid_compaction_test
param_test
param_test_benchmark
param_test_compare_twice
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 5a3432fceb586..ce1b38f46a355 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -16,7 +16,7 @@ OVERRIDE_TARGETS = 1
TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test param_test \
param_test_benchmark param_test_compare_twice param_test_mm_cid \
- param_test_mm_cid_benchmark param_test_mm_cid_compare_twice
+ param_test_mm_cid_benchmark param_test_mm_cid_compare_twice mm_cid_compaction_test
TEST_GEN_PROGS_EXTENDED = librseq.so
diff --git a/tools/testing/selftests/rseq/mm_cid_compaction_test.c b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
new file mode 100644
index 0000000000000..7ddde3b657dd6
--- /dev/null
+++ b/tools/testing/selftests/rseq/mm_cid_compaction_test.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: LGPL-2.1
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "../kselftest.h"
+#include "rseq.h"
+
+#define VERBOSE 0
+#define printf_verbose(fmt, ...) \
+ do { \
+ if (VERBOSE) \
+ printf(fmt, ##__VA_ARGS__); \
+ } while (0)
+
+/* 0.5 s */
+#define RUNNER_PERIOD 500000
+/* Number of runs before we terminate or get the token */
+#define THREAD_RUNS 5
+
+/*
+ * Number of times we check that the mm_cid were compacted.
+ * Checks are repeated every RUNNER_PERIOD.
+ */
+#define MM_CID_COMPACT_TIMEOUT 10
+
+struct thread_args {
+ int cpu;
+ int num_cpus;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ pthread_t *tinfo;
+ struct thread_args *args_head;
+};
+
+static void __noreturn *thread_runner(void *arg)
+{
+ struct thread_args *args = arg;
+ int i, ret, curr_mm_cid;
+ cpu_set_t cpumask;
+
+ CPU_ZERO(&cpumask);
+ CPU_SET(args->cpu, &cpumask);
+ ret = pthread_setaffinity_np(pthread_self(), sizeof(cpumask), &cpumask);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to set affinity");
+ abort();
+ }
+ pthread_barrier_wait(args->barrier);
+
+ for (i = 0; i < THREAD_RUNS; i++)
+ usleep(RUNNER_PERIOD);
+ curr_mm_cid = rseq_current_mm_cid();
+ /*
+ * We select one thread with high enough mm_cid to be the new leader.
+ * All other threads (including the main thread) will terminate.
+ * After some time, the mm_cid of the only remaining thread should
+ * converge to 0, if not, the test fails.
+ */
+ if (curr_mm_cid >= args->num_cpus / 2 &&
+ !pthread_mutex_trylock(args->token)) {
+ printf_verbose(
+ "cpu%d has mm_cid=%d and will be the new leader.\n",
+ sched_getcpu(), curr_mm_cid);
+ for (i = 0; i < args->num_cpus; i++) {
+ if (args->tinfo[i] == pthread_self())
+ continue;
+ ret = pthread_join(args->tinfo[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to join thread");
+ abort();
+ }
+ }
+ pthread_barrier_destroy(args->barrier);
+ free(args->tinfo);
+ free(args->token);
+ free(args->barrier);
+ free(args->args_head);
+
+ for (i = 0; i < MM_CID_COMPACT_TIMEOUT; i++) {
+ curr_mm_cid = rseq_current_mm_cid();
+ printf_verbose("run %d: mm_cid=%d on cpu%d.\n", i,
+ curr_mm_cid, sched_getcpu());
+ if (curr_mm_cid == 0)
+ exit(EXIT_SUCCESS);
+ usleep(RUNNER_PERIOD);
+ }
+ exit(EXIT_FAILURE);
+ }
+ printf_verbose("cpu%d has mm_cid=%d and is going to terminate.\n",
+ sched_getcpu(), curr_mm_cid);
+ pthread_exit(NULL);
+}
+
+int test_mm_cid_compaction(void)
+{
+ cpu_set_t affinity;
+ int i, j, ret = 0, num_threads;
+ pthread_t *tinfo;
+ pthread_mutex_t *token;
+ pthread_barrier_t *barrier;
+ struct thread_args *args;
+
+ sched_getaffinity(0, sizeof(affinity), &affinity);
+ num_threads = CPU_COUNT(&affinity);
+ tinfo = calloc(num_threads, sizeof(*tinfo));
+ if (!tinfo) {
+ perror("Error: failed to allocate tinfo");
+ return -1;
+ }
+ args = calloc(num_threads, sizeof(*args));
+ if (!args) {
+ perror("Error: failed to allocate args");
+ ret = -1;
+ goto out_free_tinfo;
+ }
+ token = malloc(sizeof(*token));
+ if (!token) {
+ perror("Error: failed to allocate token");
+ ret = -1;
+ goto out_free_args;
+ }
+ barrier = malloc(sizeof(*barrier));
+ if (!barrier) {
+ perror("Error: failed to allocate barrier");
+ ret = -1;
+ goto out_free_token;
+ }
+ if (num_threads == 1) {
+ fprintf(stderr, "Cannot test on a single cpu. "
+ "Skipping mm_cid_compaction test.\n");
+ /* only skipping the test, this is not a failure */
+ goto out_free_barrier;
+ }
+ pthread_mutex_init(token, NULL);
+ ret = pthread_barrier_init(barrier, NULL, num_threads);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to initialise barrier");
+ goto out_free_barrier;
+ }
+ for (i = 0, j = 0; i < CPU_SETSIZE && j < num_threads; i++) {
+ if (!CPU_ISSET(i, &affinity))
+ continue;
+ args[j].num_cpus = num_threads;
+ args[j].tinfo = tinfo;
+ args[j].token = token;
+ args[j].barrier = barrier;
+ args[j].cpu = i;
+ args[j].args_head = args;
+ if (!j) {
+ /* The first thread is the main one */
+ tinfo[0] = pthread_self();
+ ++j;
+ continue;
+ }
+ ret = pthread_create(&tinfo[j], NULL, thread_runner, &args[j]);
+ if (ret) {
+ errno = ret;
+ perror("Error: failed to create thread");
+ abort();
+ }
+ ++j;
+ }
+ printf_verbose("Started %d threads.\n", num_threads);
+
+ /* Also main thread will terminate if it is not selected as leader */
+ thread_runner(&args[0]);
+
+ /* only reached in case of errors */
+out_free_barrier:
+ free(barrier);
+out_free_token:
+ free(token);
+out_free_args:
+ free(args);
+out_free_tinfo:
+ free(tinfo);
+
+ return ret;
+}
+
+int main(int argc, char **argv)
+{
+ if (!rseq_mm_cid_available()) {
+ fprintf(stderr, "Error: rseq_mm_cid unavailable\n");
+ return -1;
+ }
+ if (test_mm_cid_compaction())
+ return -1;
+ return 0;
+}
--
2.48.1
Replace comma between expressions with semicolons.
Using a ',' in place of a ';' can have unintended side effects.
Although that is not the case here, it is seems best to use ';'
unless ',' is intended.
Found by inspection.
No functional change intended.
Compile tested only.
Signed-off-by: Chen Ni <nichen(a)iscas.ac.cn>
---
tools/testing/selftests/bpf/prog_tests/fd_array.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/fd_array.c b/tools/testing/selftests/bpf/prog_tests/fd_array.c
index a1d52e73fb16..9add890c2d37 100644
--- a/tools/testing/selftests/bpf/prog_tests/fd_array.c
+++ b/tools/testing/selftests/bpf/prog_tests/fd_array.c
@@ -83,8 +83,8 @@ static inline int bpf_prog_get_map_ids(int prog_fd, __u32 *nr_map_ids, __u32 *ma
int err;
memset(&info, 0, len);
- info.nr_map_ids = *nr_map_ids,
- info.map_ids = ptr_to_u64(map_ids),
+ info.nr_map_ids = *nr_map_ids;
+ info.map_ids = ptr_to_u64(map_ids);
err = bpf_prog_get_info_by_fd(prog_fd, &info, &len);
if (!ASSERT_OK(err, "bpf_prog_get_info_by_fd"))
--
2.25.1
Notable changes since v20:
* removed newline at the end of message in NL_SET_ERR_MSG_FMT_MOD
* dropped udp_init() and related build_protos() and directly use
.encap_destroy [instead of overriding sk->close()]
* defered peer_del() call to worker in case of transport errors, as
we may be in non-sleepable context
* used kfree() instead of kfree_rcu() when releasing socket as we
just invoked synchronize_rcu()
* fix comment in ovpn_tcp_parse()
* invoked peer->tcp.sk_cb.prot->release_cb instead of explicitly
calling tcp_release_cb. This way we don't need to export it again
* moved call to peer_put() after call to peer->tcp.sk_cb.prot->close
* moved switch(mode) inside ovpn_peers_free()
* simplified skip logic in ovpn_peers_free()
* fixed check to avoid passing a negative delay to keepalive's
schedule_work()
Please note that some patches were already reviewed/tested by a few
people. These patches have retained the tags as they have hardly been
touched.
The latest code can also be found at:
https://github.com/OpenVPN/ovpn-net-next
Thanks a lot!
Best Regards,
Antonio Quartulli
OpenVPN Inc.
To: netdev(a)vger.kernel.org
To: Eric Dumazet <edumazet(a)google.com>
To: Jakub Kicinski <kuba(a)kernel.org>
To: Paolo Abeni <pabeni(a)redhat.com>
To: Donald Hunter <donald.hunter(a)gmail.com>
To: Antonio Quartulli <antonio(a)openvpn.net>
To: Shuah Khan <shuah(a)kernel.org>
To: sd(a)queasysnail.net
To: ryazanov.s.a(a)gmail.com
To: Andrew Lunn <andrew+netdev(a)lunn.ch>
Cc: Simon Horman <horms(a)kernel.org>
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: Xiao Liang <shaw.leon(a)gmail.com>
Signed-off-by: Antonio Quartulli <antonio(a)openvpn.net>
---
Antonio Quartulli (24):
net: introduce OpenVPN Data Channel Offload (ovpn)
ovpn: add basic netlink support
ovpn: add basic interface creation/destruction/management routines
ovpn: keep carrier always on for MP interfaces
ovpn: introduce the ovpn_peer object
ovpn: introduce the ovpn_socket object
ovpn: implement basic TX path (UDP)
ovpn: implement basic RX path (UDP)
ovpn: implement packet processing
ovpn: store tunnel and transport statistics
ovpn: implement TCP transport
skb: implement skb_send_sock_locked_with_flags()
ovpn: add support for MSG_NOSIGNAL in tcp_sendmsg
ovpn: implement multi-peer support
ovpn: implement peer lookup logic
ovpn: implement keepalive mechanism
ovpn: add support for updating local UDP endpoint
ovpn: add support for peer floating
ovpn: implement peer add/get/dump/delete via netlink
ovpn: implement key add/get/del/swap via netlink
ovpn: kill key and notify userspace in case of IV exhaustion
ovpn: notify userspace when a peer is deleted
ovpn: add basic ethtool support
testing/selftests: add test tool and scripts for ovpn module
Documentation/netlink/specs/ovpn.yaml | 371 +++
Documentation/netlink/specs/rt_link.yaml | 16 +
MAINTAINERS | 11 +
drivers/net/Kconfig | 15 +
drivers/net/Makefile | 1 +
drivers/net/ovpn/Makefile | 22 +
drivers/net/ovpn/bind.c | 55 +
drivers/net/ovpn/bind.h | 101 +
drivers/net/ovpn/crypto.c | 211 ++
drivers/net/ovpn/crypto.h | 145 ++
drivers/net/ovpn/crypto_aead.c | 408 ++++
drivers/net/ovpn/crypto_aead.h | 33 +
drivers/net/ovpn/io.c | 462 ++++
drivers/net/ovpn/io.h | 34 +
drivers/net/ovpn/main.c | 339 +++
drivers/net/ovpn/main.h | 14 +
drivers/net/ovpn/netlink-gen.c | 213 ++
drivers/net/ovpn/netlink-gen.h | 41 +
drivers/net/ovpn/netlink.c | 1249 ++++++++++
drivers/net/ovpn/netlink.h | 18 +
drivers/net/ovpn/ovpnpriv.h | 57 +
drivers/net/ovpn/peer.c | 1352 +++++++++++
drivers/net/ovpn/peer.h | 163 ++
drivers/net/ovpn/pktid.c | 129 ++
drivers/net/ovpn/pktid.h | 87 +
drivers/net/ovpn/proto.h | 118 +
drivers/net/ovpn/skb.h | 61 +
drivers/net/ovpn/socket.c | 244 ++
drivers/net/ovpn/socket.h | 49 +
drivers/net/ovpn/stats.c | 21 +
drivers/net/ovpn/stats.h | 47 +
drivers/net/ovpn/tcp.c | 592 +++++
drivers/net/ovpn/tcp.h | 36 +
drivers/net/ovpn/udp.c | 442 ++++
drivers/net/ovpn/udp.h | 25 +
include/linux/skbuff.h | 2 +
include/uapi/linux/if_link.h | 15 +
include/uapi/linux/ovpn.h | 110 +
include/uapi/linux/udp.h | 1 +
net/core/skbuff.c | 18 +-
net/ipv6/af_inet6.c | 1 +
net/ipv6/udp.c | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/net/ovpn/.gitignore | 2 +
tools/testing/selftests/net/ovpn/Makefile | 31 +
tools/testing/selftests/net/ovpn/common.sh | 92 +
tools/testing/selftests/net/ovpn/config | 10 +
tools/testing/selftests/net/ovpn/data64.key | 5 +
tools/testing/selftests/net/ovpn/ovpn-cli.c | 2395 ++++++++++++++++++++
tools/testing/selftests/net/ovpn/tcp_peers.txt | 5 +
.../testing/selftests/net/ovpn/test-chachapoly.sh | 9 +
.../selftests/net/ovpn/test-close-socket-tcp.sh | 9 +
.../selftests/net/ovpn/test-close-socket.sh | 45 +
tools/testing/selftests/net/ovpn/test-float.sh | 9 +
tools/testing/selftests/net/ovpn/test-tcp.sh | 9 +
tools/testing/selftests/net/ovpn/test.sh | 113 +
tools/testing/selftests/net/ovpn/udp_peers.txt | 5 +
57 files changed, 10065 insertions(+), 5 deletions(-)
---
base-commit: fb05579a176f7bccc8d279665fc0e46dfed43dfb
change-id: 20250304-b4-ovpn-tmp-153379e78603
Best regards,
--
Antonio Quartulli <antonio(a)openvpn.net>
┌────────────┐ ┌───────────────────────────────────┐ ┌────────────────┐
│ │ │ │ │ │
│ │ │ PCI Endpoint │ │ PCI Host │
│ │ │ │ │ │
│ │◄──┤ 1.platform_msi_domain_alloc_irqs()│ │ │
│ │ │ │ │ │
│ MSI ├──►│ 2.write_msi_msg() ├──►├─BAR<n> │
│ Controller │ │ update doorbell register address│ │ │
│ │ │ for BAR │ │ │
│ │ │ │ │ 3. Write BAR<n>│
│ │◄──┼───────────────────────────────────┼───┤ │
│ │ │ │ │ │
│ ├──►│ 4.Irq Handle │ │ │
│ │ │ │ │ │
│ │ │ │ │ │
└────────────┘ └───────────────────────────────────┘ └────────────────┘
This patches based on old https://lore.kernel.org/imx/20221124055036.1630573-1-Frank.Li@nxp.com/
Original patch only target to vntb driver. But actually it is common
method.
This patches add new API to pci-epf-core, so any EP driver can use it.
Previous v2 discussion here.
https://lore.kernel.org/imx/20230911220920.1817033-1-Frank.Li@nxp.com/
Changes in v15:
- rebase to v6.14-rc1
- fix build issue find by kernel test robot
- Link to v14: https://lore.kernel.org/r/20250207-ep-msi-v14-0-9671b136f2b8@nxp.com
Changes in v14:
Marc Zyngier raised concerns about adding DOMAIN_BUS_DEVICE_PCI_EP_MSI. As
a result, the approach has been reverted to the v9 method. However, there
are several improvements:
MSI now supports msi-map in addition to msi-parent.
- The struct device: id is used as the endpoint function (EPF) device
identity to map to the stream ID (sideband information).
- The EPC device tree source (DTS) utilizes msi-map to provide such
information.
- The EPF device's of_node is set to the EPC controller’s node. This
approach is commonly used for multi-function device (MFD) platform child
devices, allowing them to inherit properties from the MFD device’s DTS,
such as reset-cells and gpio-cells. This method is well-suited for the
current case, as the EPF is inherently created/binded to the EPC and
should inherit the EPC’s DTS node properties.
Additionally:
Since the basic IMX95 LUT support has already been merged into the
mainline, a DTS and driver increment patch is added to complete the
solution. The patch is rebased onto the latest linux-next tree and
aligned with the new pcitest framework.
- Link to v13: https://lore.kernel.org/r/20241218-ep-msi-v13-0-646e2192dc24@nxp.com
Changes in v13:
- Change to use DOMAIN_BUS_PCI_DEVICE_EP_MSI
- Change request id as func | vfunc << 3
- Remove IRQ_DOMAIN_MSI_IMMUTABLE
Thomas Gleixner:
I hope capture all your points in review comments. If missed, let me know.
- Link to v12: https://lore.kernel.org/r/20241211-ep-msi-v12-0-33d4532fa520@nxp.com
Changes in v12:
- Change to use IRQ_DOMAIN_MSI_IMMUTABLE and add help function
irq_domain_msi_is_immuatble().
- split PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check to 3 patches
- Link to v11: https://lore.kernel.org/r/20241209-ep-msi-v11-0-7434fa8397bd@nxp.com
Changes in v11:
- Change to use MSI_FLAG_MSG_IMMUTABLE
- Link to v10: https://lore.kernel.org/r/20241204-ep-msi-v10-0-87c378dbcd6d@nxp.com
Changes in v10:
Thomas Gleixner:
There are big change in pci-ep-msi.c. I am sure if go on the
corrent path. The key improvement is remove only 1 function devices's
limitation.
I use new patch for imutable check, which relative additional
feature compared to base enablement patch.
- Remove patch Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Add new patch irqchip/gic-v3-its: Avoid overwriting msi_prepare callback if provided by msi_domain_info
- Remove only support 1 endpoint function limiation.
- Create one MSI domain for each endpoint function devices.
- Use "msi-map" in pci ep controler node, instead of of msi-parent. first
argument is
(func_no << 8 | vfunc_no)
- Link to v9: https://lore.kernel.org/r/20241203-ep-msi-v9-0-a60dbc3f15dd@nxp.com
Changes in v9
- Add patch platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
- Remove patch PCI: endpoint: Add pci_epc_get_fn() API for customizable filtering
- Remove API pci_epf_align_inbound_addr_lo_hi
- Move doorbell_alloc in to doorbell_enable function.
- Link to v8: https://lore.kernel.org/r/20241116-ep-msi-v8-0-6f1f68ffd1bb@nxp.com
Changes in v8:
- update helper function name to pci_epf_align_inbound_addr()
- Link to v7: https://lore.kernel.org/r/20241114-ep-msi-v7-0-d4ac7aafbd2c@nxp.com
Changes in v7:
- Add helper function pci_epf_align_addr();
- Link to v6: https://lore.kernel.org/r/20241112-ep-msi-v6-0-45f9722e3c2a@nxp.com
Changes in v6:
- change doorbell_addr to doorbell_offset
- use round_down()
- add Niklas's test by tag
- rebase to pci/endpoint
- Link to v5: https://lore.kernel.org/r/20241108-ep-msi-v5-0-a14951c0d007@nxp.com
Changes in v5:
- Move request_irq to epf test function driver for more flexiable user case
- Add fixed size bar handler
- Some minor improvememtn to see each patches's changelog.
- Link to v4: https://lore.kernel.org/r/20241031-ep-msi-v4-0-717da2d99b28@nxp.com
Changes in v4:
- Remove patch genirq/msi: Add cleanup guard define for msi_lock_descs()/msi_unlock_descs()
- Use new method to avoid compatible problem.
Add new command DOORBELL_ENABLE and DOORBELL_DISABLE.
pcitest -B send DOORBELL_ENABLE first, EP test function driver try to
remap one of BAR_N (except test register bar) to ITS MSI MMIO space. Old
driver don't support new command, so failure return, not side effect.
After test, DOORBELL_DISABLE command send out to recover original map, so
pcitest bar test can pass as normal.
- Other detail change see each patches's change log
- Link to v3: https://lore.kernel.org/r/20241015-ep-msi-v3-0-cedc89a16c1a@nxp.com
Change from v2 to v3
- Fixed manivannan's comments
- Move common part to pci-ep-msi.c and pci-ep-msi.h
- rebase to 6.12-rc1
- use RevID to distingiush old version
mkdir /sys/kernel/config/pci_ep/functions/pci_epf_test/func1
echo 16 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/msi_interrupts
echo 0x080c > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/deviceid
echo 0x1957 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/vendorid
echo 1 > /sys/kernel/config/pci_ep/functions/pci_epf_test/func1/revid
^^^^^^ to enable platform msi support.
ln -s /sys/kernel/config/pci_ep/functions/pci_epf_test/func1 /sys/kernel/config/pci_ep/controllers/4c380000.pcie-ep
- use new device ID, which identify support doorbell to avoid broken
compatility.
Enable doorbell support only for PCI_DEVICE_ID_IMX8_DB, while other devices
keep the same behavior as before.
EP side RC with old driver RC with new driver
PCI_DEVICE_ID_IMX8_DB no probe doorbell enabled
Other device ID doorbell disabled* doorbell disabled*
* Behavior remains unchanged.
Change from v1 to v2
- Add missed patch for endpont/pci-epf-test.c
- Move alloc and free to epc driver from epf.
- Provide general help function for EPC driver to alloc platform msi irq.
- Fixed manivannan's comments.
Signed-off-by: Frank Li <Frank.Li(a)nxp.com>
---
Frank Li (15):
platform-msi: Add msi_remove_device_irq_domain() in platform_device_msi_free_irqs_all()
irqdomain: Add IRQ_DOMAIN_FLAG_MSI_IMMUTABLE and irq_domain_is_msi_immutable()
irqchip/gic-v3-its: Set IRQ_DOMAIN_FLAG_MSI_IMMUTABLE for ITS
irqchip/gic-v3-its: Add support for device tree msi-map and msi-mask
PCI: endpoint: Set ID and of_node for function driver
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: endpoint: pci-ep-msi: Add MSI address/data pair mutable check
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for address alignment
PCI: endpoint: pci-epf-test: Add doorbell test support
misc: pci_endpoint_test: Add doorbell test case
selftests: pci_endpoint: Add doorbell test case
pci: imx6: Add helper function imx_pcie_add_lut_by_rid()
pci: imx6: Add LUT setting for MSI/IOMMU in Endpoint mode
arm64: dts: imx95: Add msi-map for pci-ep device
arm64: dts: imx95-19x19-evk: Add PCIe1 endpoint function overlay file
arch/arm64/boot/dts/freescale/Makefile | 3 +
.../dts/freescale/imx95-19x19-evk-pcie1-ep.dtso | 21 ++++
arch/arm64/boot/dts/freescale/imx95.dtsi | 1 +
drivers/base/platform-msi.c | 1 +
drivers/irqchip/irq-gic-v3-its-msi-parent.c | 8 ++
drivers/irqchip/irq-gic-v3-its.c | 2 +-
drivers/misc/pci_endpoint_test.c | 81 +++++++++++++
drivers/pci/controller/dwc/pci-imx6.c | 25 ++--
drivers/pci/endpoint/Makefile | 1 +
drivers/pci/endpoint/functions/pci-epf-test.c | 132 +++++++++++++++++++++
drivers/pci/endpoint/pci-ep-msi.c | 90 ++++++++++++++
drivers/pci/endpoint/pci-epf-core.c | 48 ++++++++
include/linux/irqdomain.h | 7 ++
include/linux/pci-ep-msi.h | 28 +++++
include/linux/pci-epf.h | 21 ++++
include/uapi/linux/pcitest.h | 1 +
.../selftests/pci_endpoint/pci_endpoint_test.c | 25 ++++
17 files changed, 486 insertions(+), 9 deletions(-)
---
base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b
change-id: 20241010-ep-msi-8b4cab33b1be
Best regards,
---
Frank Li <Frank.Li(a)nxp.com>
The first fixes setting incorrect skb->truesize.
When xdp-mb prog returns XDP_PASS, skb is allocated and initialized.
Currently, The truesize is calculated as BNXT_RX_PAGE_SIZE *
sinfo->nr_frags, but sinfo->nr_frags is flushed by napi_build_skb().
So, it stores sinfo before calling napi_build_skb() and then use it
for calculate truesize.
The second fixes kernel panic in the bnxt_queue_mem_alloc().
The bnxt_queue_mem_alloc() accesses rx ring descriptor.
rx ring descriptors are allocated when the interface is up and it's
freed when the interface is down.
So, if bnxt_queue_mem_alloc() is called when the interface is down,
kernel panic occurs.
This patch makes the bnxt_queue_mem_alloc() return -ENETDOWN if rx ring
descriptors are not allocated.
The third patch fixes kernel panic in the bnxt_queue_{start | stop}().
When a queue is restarted bnxt_queue_{start | stop}() are called.
These functions set MRU to 0 to stop packet flow and then to set up the
remaining things.
MRU variable is a member of vnic_info[] the first vnic_info is for
default and the second is for ntuple.
The first vnic_info is always allocated when interface is up, but the
second is allocated only when ntuple is enabled.
(ethtool -K eth0 ntuple <on | off>).
Currently, the bnxt_queue_{start | stop}() access
vnic_info[BNXT_VNIC_NTUPLE] regardless of whether ntuple is enabled or
not.
So kernel panic occurs.
This patch make the bnxt_queue_{start | stop}() use bp->nr_vnics instead
of BNXT_VNIC_NTUPLE.
The fourth patch fixes a warning due to checksum state.
The bnxt_rx_pkt() checks whether skb->ip_summed is not CHECKSUM_NONE
before updating ip_summed. if ip_summed is not CHECKSUM_NONE, it WARNS
about it. However, the bnxt_xdp_build_skb() is called in XDP-MB-PASS
path and it updates ip_summed earlier than bnxt_rx_pkt().
So, in the XDP-MB-PASS path, the bnxt_rx_pkt() always warns about
checksum.
Updating ip_summed at the bnxt_xdp_build_skb() is unnecessary and
duplicate, so it is removed.
The fifth patch fixes a kernel panic in the
bnxt_get_queue_stats{rx | tx}().
The bnxt_get_queue_stats{rx | tx}() callback functions are called when
a queue is resetting.
These internally access rx and tx rings without null check, but rings
are allocated and initialized when the interface is up.
So, these functions are called when the interface is down, it
occurs a kernel panic.
The sixth patch fixes memory leak in queue reset logic.
When a queue is resetting, tpa_info is allocated for the new queue and
tpa_info for an old queue is not used anymore.
So it should be freed, but not.
The seventh patch makes net_devmem_unbind_dmabuf() ignore -ENETDOWN.
When devmem socket is closed, net_devmem_unbind_dmabuf() is called to
unbind/release resources.
If interface is down, the driver returns -ENETDOWN.
The -ENETDOWN return value is not an actual error, because the interface
will release resources when the interface is down.
So, net_devmem_unbind_dmabuf() needs to ignore -ENETDOWN.
The last patch adds XDP testcases to
tools/testing/selftests/drivers/net/ping.py.
v3:
- Copy nr_frags instead of full copy. (1/8)
- Add Review tags from Somnath. (3/8)
- Add new patch for fixing kernel panic in the
bnxt_get_queue_stats{rx | tx}(). (5/8)
- Add new patch for fixing memory leak in queue reset. (6/8)
v2:
- Do not use num_frags in the bnxt_xdp_build_skb(). (1/6)
- Add Review tags from Somnath and Jakub. (2/6)
- Add new patch for fixing checksum warning. (4/6)
- Add new patch for fixing warning in net_devmem_unbind_dmabuf(). (5/6)
- Add new XDP testcases to ping.py (6/6)
Taehee Yoo (8):
eth: bnxt: fix truesize for mb-xdp-pass case
eth: bnxt: return fail if interface is down in bnxt_queue_mem_alloc()
eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue
restart logic
eth: bnxt: do not update checksum in bnxt_xdp_build_skb()
eth: bnxt: fix kernel panic in the bnxt_get_queue_stats{rx | tx}
eth: bnxt: fix memory leak in queue reset
net: devmem: do not WARN conditionally after netdev_rx_queue_restart()
selftests: drv-net: add xdp cases for ping.py
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 25 ++-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 13 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h | 3 +-
net/core/devmem.c | 4 +-
tools/testing/selftests/drivers/net/ping.py | 200 ++++++++++++++++--
.../testing/selftests/net/lib/xdp_dummy.bpf.c | 6 +
6 files changed, 220 insertions(+), 31 deletions(-)
--
2.34.1
Hello,
While trying to implement an eBPF gatekeeper program, we ran into an
issue whereas the LSM hooks are missing some relevant data.
Certain subcommands passed to the bpf() syscall can be invoked from
either the kernel or userspace. Additionally, some fields in the
bpf_attr struct contain pointers, and depending on where the
subcommand was invoked, they could point to either user or kernel
memory. One example of this is the bpf_prog_load subcommand and its
fd_array. This data is made available and used by the verifier but not
made available to the LSM subsystem. This patchset simply exposes that
information to applicable LSM hooks.
Change list:
- v5 -> v6
- fix regression caused by is_kernel renaming
- simplify test logic
- v4 -> v5
- merge v4 selftest breakout patch back into a single patch
- change "is_kernel" to "kernel"
- add selftest using new kernel flag
- v3 -> v4
- split out selftest changes into a separate patch
- v2 -> v3
- reorder params so that the new boolean flag is the last param
- fixup function signatures in bpf selftests
- v1 -> v2
- Pass a boolean flag in lieu of bpfptr_t
Revisions:
- v5
https://lore.kernel.org/bpf/20250307213651.3065714-1-bboscaccy@linux.micros…
- v4
https://lore.kernel.org/bpf/20250304203123.3935371-1-bboscaccy@linux.micros…
- v3
https://lore.kernel.org/bpf/20250303222416.3909228-1-bboscaccy@linux.micros…
- v2
https://lore.kernel.org/bpf/20250228165322.3121535-1-bboscaccy@linux.micros…
- v1
https://lore.kernel.org/bpf/20250226003055.1654837-1-bboscaccy@linux.micros…
Blaise Boscaccy (2):
security: Propagate caller information in bpf hooks
selftests/bpf: Add a kernel flag test for LSM bpf hook
include/linux/lsm_hook_defs.h | 6 +--
include/linux/security.h | 12 +++---
kernel/bpf/syscall.c | 10 ++---
security/security.c | 15 ++++---
security/selinux/hooks.c | 6 +--
.../selftests/bpf/prog_tests/kernel_flag.c | 43 +++++++++++++++++++
.../selftests/bpf/progs/rcu_read_lock.c | 3 +-
.../bpf/progs/test_cgroup1_hierarchy.c | 4 +-
.../selftests/bpf/progs/test_kernel_flag.c | 28 ++++++++++++
.../bpf/progs/test_kfunc_dynptr_param.c | 6 +--
.../selftests/bpf/progs/test_lookup_key.c | 2 +-
.../selftests/bpf/progs/test_ptr_untrusted.c | 2 +-
.../bpf/progs/test_task_under_cgroup.c | 2 +-
.../bpf/progs/test_verify_pkcs7_sig.c | 2 +-
14 files changed, 108 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/kernel_flag.c
create mode 100644 tools/testing/selftests/bpf/progs/test_kernel_flag.c
--
2.48.1
When an after-split folio is large and needs to be dropped due to EOF,
folio_put_refs(folio, folio_nr_pages(folio)) should be used to drop
all page cache refs. Otherwise, the folio will not be freed, causing
memory leak.
This leak would happen on a filesystem with blocksize > page_size and
a truncate is performed, where the blocksize makes folios split to
>0 order ones, causing truncated folios not being freed.
Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages")
Reported-by: Hugh Dickins <hughd(a)google.com>
Closes: https://lore.kernel.org/all/fcbadb7f-dd3e-21df-f9a7-2853b53183c4@google.com/
Cc: stable(a)vger.kernel.org
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
---
mm/huge_memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3d3ebdc002d5..373781b21e5c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3304,7 +3304,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
folio_account_cleaned(tail,
inode_to_wb(folio->mapping->host));
__filemap_remove_folio(tail, NULL);
- folio_put(tail);
+ folio_put_refs(tail, folio_nr_pages(tail));
} else if (!folio_test_anon(folio)) {
__xa_store(&folio->mapping->i_pages, tail->index,
tail, 0);
--
2.47.2
Hi all,
This patchset adds a new buddy allocator like (or non-uniform) large folio
split from a order-n folio to order-m with m < n. It reduces
1. the total number of after-split folios from 2^(n-m) to n-m+1;
2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to
n/6-m/6, assuming XA_CHUNK_SHIFT=6;
3. keep more large folios after a split from all order-m folios to
order-(n-1) to order-m folios.
For example, to split an order-9 to order-0, folio split generates 10
(or 11 for anonymous memory) folios instead of 512, allocates 1 xa_node
instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2 order-0
folios (or 4 order-0 for anonymous memory) instead of 512 order-0 folios.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split()
can support both uniform split and buddy allocator like (or non-uniform) split.
All existing split_huge_page*() users can be gradually converted to use
folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
xfstests quick group passed for both tmpfs and xfs.
It is on top of mm-everything-2025-02-26-03-56 with V8 reverted. It is ready to
be merged.
Changelog
===
From V8[11]:
1. Removed gfp parameter from xas_try_split() and GFP_NOWAIT is used all
the time. (per Baolin Wang)
2. Used __xas_init_node_for_split() instead of
__xas_alloc_node_for_split() and moved node allocation out. It fixed
a bug when xa_node is pre-allocated by xas_nomem() before
xas_try_split() is called without being initialized for split.
From V7[9]:
1. Fixed a wrong function name in lib/test_xarray.c.
2. Made __split_folio_to_order() never fail, since the old order check
is already done in __folio_split(). (per David Hildenbrand)
3. Fixed an issue reported by syzbot[10] by not dropping the original
folio during truncate.
4. Fixed a WARNING when READ_ONLY_THP_FOR_FS is enabled. (Thank David
Hildenbrand for reporting the issue)
5. Used two separate struct page* parameters, split_at and lock_at, to
specify at which subpage the non-uniform split happens and which subpage
to keep locked after the split, respectively. It improves code
readability.
From V6[8]:
1. Added an xarray function xas_try_split() to support iterative folio split,
removing the need of using xas_split_alloc() and xas_split(). The
function guarantees that at most one xa_node is allocated for each
call.
2. Added concrete numbers of after-split folios and xa_node savings to
cover letter, commit log. (per Andrew)
From V5[7]:
1. Split shmem to any lower order patches are in mm tree, so dropped
from this series.
2. Rename split_folio_at() to try_folio_split() to clarify that
non-uniform split will not be used if it is not supported.
From V4[6]:
1. Enabled shmem support in both uniform and buddy allocator like split
and added selftests for it.
2. Added functions to check if uniform split and buddy allocator like
split are supported for the given folio and order.
3. Made truncate fall back to uniform split if buddy allocator split is
not supported (CONFIG_READ_ONLY_THP_FOR_FS and FS without large folio).
4. Added the missing folio_clear_has_hwpoisoned() to
__split_unmapped_folio().
From V3[5]:
1. Used xas_split_alloc(GFP_NOWAIT) instead of xas_nomem(), since extra
operations inside xas_split_alloc() are needed for correctness.
2. Enabled folio_split() for shmem and no issue was found with xfstests
quick test group.
3. Split both ends of a truncate range in truncate_inode_partial_folio()
to avoid wasting memory in shmem truncate (per David Hildenbrand).
4. Removed page_in_folio_offset() since page_folio() does the same
thing.
5. Finished truncate related tests from xfstests quick test group on XFS and
tmpfs without issues.
6. Disabled buddy allocator like split on CONFIG_READ_ONLY_THP_FOR_FS
and FS without large folio. This check was missed in the prior
versions.
From V2[3]:
1. Incorporated all the feedback from Kirill[4].
2. Used GFP_NOWAIT for xas_nomem().
3. Tested the code path when xas_nomem() fails.
4. Added selftests for folio_split().
5. Fixed no THP config build error.
From V1[2]:
1. Split the original patch 1 into multiple ones for easy review (per
Kirill).
2. Added xas_destroy() to avoid memory leak.
3. Fixed nr_dropped not used error (per kernel test robot).
4. Added proper error handling when xas_nomem() fails to allocate memory
for xas_split() during buddy allocator like split.
From RFC[1]:
1. Merged backend code of split_huge_page_to_list_to_order() and
folio_split(). The same code is used for both uniform split and buddy
allocator like split.
2. Use xas_nomem() instead of xas_split_alloc() for folio_split().
3. folio_split() now leaves the first after-split folio unlocked,
instead of the one containing the given page, since
the caller of truncate_inode_partial_folio() locks and unlocks the
first folio.
4. Extended split_huge_page debugfs to use folio_split().
5. Added truncate_inode_partial_folio() as first user of folio_split().
Design
===
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon
O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
Since anon folio does not support order-1 yet.
The split process is similar to existing approach:
1. Unmap all page mappings (split PMD mappings if exist);
2. Split meta data like memcg, page owner, page alloc tag;
3. Copy meta data in struct folio to sub pages, but instead of spliting
the whole folio into multiple smaller ones with the same order in a
shot, this approach splits the folio iteratively. Taking the example
above, this approach first splits the original order-9 into two order-8,
then splits left part of order-8 to two order-7 and so on;
4. Post-process split folios, like write mapping->i_pages for pagecache,
adjust folio refcounts, add split folios to corresponding list;
5. Remap split folios
6. Unlock split folios.
__split_unmapped_folio() and __split_folio_to_order() replace
__split_huge_page() and __split_huge_page_tail() respectively.
__split_unmapped_folio() uses different approaches to perform
uniform split and buddy allocator like split:
1. uniform split: one single call to __split_folio_to_order() is used to
uniformly split the given folio. All resulting folios are put back to
the list after split. The folio containing the given page is left to
caller to unlock and others are unlocked.
2. buddy allocator like (or non-uniform) split: (old_order - new_order) calls
to __split_folio_to_order() are used to split the given folio at order N to
order N-1. After each call, the target folio is changed to the one
containing the page, which is given as a folio_split() parameter.
After each call, folios not containing the page are put back to the list.
The folio containing the page is put back to the list when its order
is new_order. All folios are unlocked except the first folio, which
is left to caller to unlock.
Patch Overview
===
1. Patch 1 added a new xarray function xas_try_split() to perform
iterative xarray split.
2. Patch 2 added __split_unmapped_folio() and __split_folio_to_order() to
prepare for moving to new backend split code.
3. Patch 3 moved common code in split_huge_page_to_list_to_order() to
__folio_split().
4. Patch 4 added new folio_split() and made
split_huge_page_to_list_to_order() share the new
__split_unmapped_folio() with folio_split().
5. Patch 5 removed no longer used __split_huge_page() and
__split_huge_page_tail().
6. Patch 6 added a new in_folio_offset to split_huge_page debugfs for
folio_split() test.
7. Patch 7 used try_folio_split() for truncate operation.
8. Patch 8 added folio_split() tests.
Any comments and/or suggestions are welcome. Thanks.
[1] https://lore.kernel.org/linux-mm/20241008223748.555845-1-ziy@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20241028180932.1319265-1-ziy@nvidia.com/
[3] https://lore.kernel.org/linux-mm/20241101150357.1752726-1-ziy@nvidia.com/
[4] https://lore.kernel.org/linux-mm/e6ppwz5t4p4kvir6eqzoto4y5fmdjdxdyvxvtw43nc…
[5] https://lore.kernel.org/linux-mm/20241205001839.2582020-1-ziy@nvidia.com/
[6] https://lore.kernel.org/linux-mm/20250106165513.104899-1-ziy@nvidia.com/
[7] https://lore.kernel.org/linux-mm/20250116211042.741543-1-ziy@nvidia.com/
[8] https://lore.kernel.org/linux-mm/20250205031417.1771278-1-ziy@nvidia.com/
[9] https://lore.kernel.org/linux-mm/20250211155034.268962-1-ziy@nvidia.com/
[10] https://lore.kernel.org/all/67af65cb.050a0220.21dd3.004a.GAE@google.com/
[11] https://lore.kernel.org/linux-mm/20250218235012.1542225-1-ziy@nvidia.com/
Zi Yan (8):
xarray: add xas_try_split() to split a multi-index entry
mm/huge_memory: add two new (not yet used) functions for folio_split()
mm/huge_memory: move folio split common code to __folio_split()
mm/huge_memory: add buddy allocator like (non-uniform) folio_split()
mm/huge_memory: remove the old, unused __split_huge_page()
mm/huge_memory: add folio_split() to debugfs testing interface
mm/truncate: use buddy allocator like folio split for truncate
operation
selftests/mm: add tests for folio_split(), buddy allocator like split
Documentation/core-api/xarray.rst | 14 +-
include/linux/huge_mm.h | 36 +
include/linux/xarray.h | 6 +
lib/test_xarray.c | 52 ++
lib/xarray.c | 131 ++-
mm/huge_memory.c | 755 ++++++++++++------
mm/truncate.c | 31 +-
tools/testing/radix-tree/Makefile | 1 +
.../selftests/mm/split_huge_page_test.c | 34 +-
9 files changed, 783 insertions(+), 277 deletions(-)
--
2.47.2