Hi everyone!
This is my first take on the Rust abstractions for the DRM subsystem. It includes the abstractions themselves, some minor prerequisite changes to the C side, as well as the drm-asahi GPU driver (for reference on how the abstractions are used, but not necessarily intended to land together).
These patches apply on top of the tree at [1], which is based on 6.3-rc1 with a large number of Rust abstraction/support commits added on top. Most of these are not prerequisites for the DRM abstractions themselves, but rather only of the driver.
* #1-12 introduce the abstractions, module by module, with minor C changes before the dependent abstraction. * Patch 10 is a little addition to drm_sched that I ended up needing, but I can pull it out of the abstraction into its own patch if needed. * #13-14 add a minor feature to drm/gem and its abstraction used by the driver. * #15-16 introduce the (unstable) asahi UAPI. This is obviously not ready for merge yet, but comments are welcome! * #17 adds a Rust helper macro to handle GPU core/firmware differences. This probably belongs in the driver at this point, but right now it has to live in rust/macros since there is no mechanism for per-driver proc macros. * #18 adds the driver proper, in one big commit, for reference purposes.
I've been working since mid last year on an Apple AGX GPU driver for Linux, using the (at the time) out-of-tree Rust support. As part of this effort, I've been writing safe Rust abstractions for portions of the DRM subsystem.
Now that Rust itself is upstream, I'd like to get all the abstractions upstreamed so we can eventually get the driver upstreamed!
These abstractions have been used by the driver since our release in December [2], in a simpler synchronous-submission form:
* drm::ioctl * drm::device * drm::drv * drm::file * drm::{gem, gem::shmem} * drm::mm
This series adds these too, which are used by the explicit sync refactor of the driver (the version in this series):
* drm::syncobj * drm::sched * dma_fence
The major dependencies for the DRM abstractions themselves are:
* [3] rust: error: Add missing wrappers to convert to/from kernel error codes * [4] rust: Miscellaneous macro improvements * [5] rust: Add a Sealed trait * [6] rust: device: Add a minimal RawDevice trait * [7] rust: Enable the new_uninit feature for kernel and driver crates * [8] rust: ioctl: Add ioctl number manipulation functions * [9] rust: sync: Arc: Any downcasting and assume_init() * rust: Add `container_of` and `offset_of` macros * kernel::sync::mutex and dependencies
Most of these (the ones with links) have already been submitted, and I expect all of them to land for 6.4 (the mutex one will likely be last, since there is some refactoring that will happen over the current state to make it more ergonomic to use). The mutex dep is only necessary for drm::mm and dma_fence, and transitively drm::syncobj and drm::sched.
Things work! We've had most of the abstractions in production edge kernels with the driver, and the new explicit sync stuff has passed quite a few torture tests (this is how we found the drm_sched issue, patch 11).
The abstractions are intended to be safe (safety review very welcome!). While writing them, I tried to avoid making any changes to the C side unless absolutely necessary. I understand that it will probably make sense to adjust the C side to make some things easier, but I wanted to start from this as a baseline.
Known issues:
- The existing Rust integration does not currently allow building abstractions as modules, so the Rust abstractions are only available for DRM components that are built in. I added some extra Kconfig symbols to deal with this, so a driver built as a module can depende on having those built in. This should go away in the future (but may not be ready in time for submission... I understand this probably shouldn't be a blocker though?).
- DRM relies heavily on the "subclassing" pattern for driver objects, and this doesn't map well to Rust. I tried several approaches for various bits, so we can see how they work out. In particular, whether wrapper types should pretend to be smart pointers and Deref to their inner driver-specific types, and whether they should be marked as method receivers (Yuck, internal rustc implementation hacks! But Arc<T> already does the same thing and it makes usage in driver-implemented callbacks as `self` possible) are things I'd love to discuss ^^.
- Only what I need for my driver is implemented (plus a small amount of obvious extras where better API completeness makes sense). I think the general idea with Rust abstractions is that we add things as they become necessary.
- The plain GEM vs. GEM-shmem duality ended up with quite a hairy type hierarchy. I'd love to figure out how to make this simpler...
- drm::mm ends up requiring a built-in mutex in the abstraction, instead of delegating that to the user with the usual Rust mutability rules. This is because nodes can be dropped at any time, and those operations need to be synchronized. We could try to avoid forbidding those drops or mark the node type !Send, but that would make it a lot less ergonomic to use...
I'm looking for feedback on the abstractions of all kinds, so we can move towards an upstreamable version. Optimistically, I'd love to get this upstream for 6.5, and the driver for 6.6.
Please feel free to ask any questions about the Rust bits, since I know a lot of this is new to many of the C folks!
This is a fairly complete driver for Apple AGX G13 and G14 series GPUs.
The driver today supports the Apple M1, M1 Pro, M1 Max, M1 Ultra, and M2 SoCs, across two firmware revisions each. It has an explicit sync UAPI heavily inspired by the upcoming Intel Xe UAPI, designed with Vulkan support in mind. On the Mesa side we currently have a Gallium driver that is mostly already upstream (missing the UAPI bits mostly) and passes the dEQP GLES2/EGL tests, with most of GLES3.0 passing in downstream work-in-progress branches. This is a reverse engineered community driver (we have no hardware documentation of any kind, other than some hints from aspects shared with PowerVR).
While developing the driver, I tried to make use of Rust's safety and lifetime features to provide not just CPU-side safety, but also partial firmware-ABI safety. Thanks to this, it has turned out to be a very stable driver even though GPU firmware crashes are fatal (no restart capability, need to reboot!) and the FW/driver interface is a huge mess of unsafe shared memory structures with complex pointer chains. There are over 70 ABI types and 3000+ lines of firmware ABI type definitions that vary between firmware builds and GPU cores...
In a simpler blocking-submission form, it has been shipping in Asahi Linux edge kernels since December [2], with lots of users and zero (!) reported oopses (and only a couple reports of GPU firmware crashes, though that issue should now be fixed). It has survived OOM scenarios (Rust makes error cleanup easy!), UAPI-level fuzzing, countless broken Mesa builds, uptimes of 40+ days, and more.
The explicit sync refactor significantly increases performance (and potential problems), but this version has survived a lot of torture with dEQP/piglit tests and some manual corner case testing.
In other words, Rust works! ^^
There are some design notes on the driver and further links at [10].
[1] https://github.com/AsahiLinux/linux.git drm-rfc-base-20230307 [2] https://asahilinux.org/2022/12/gpu-drivers-now-in-asahi-linux/ [3] https://lore.kernel.org/rust-for-linux/20230224-rust-error-v1-0-f8f9a9a87303... [4] https://lore.kernel.org/rust-for-linux/20230224-rust-macros-v1-0-b39fae46e10... [5] https://lore.kernel.org/rust-for-linux/20230224-rust-iopt-rtkit-v1-0-49ced33... [6] https://lore.kernel.org/rust-for-linux/20230224-rust-iopt-rtkit-v1-0-49ced33... [7] https://lore.kernel.org/rust-for-linux/CQV7ZNT6LMXI.1XG4YXSH8I7JK@vincent-ar... [8] https://lore.kernel.org/rust-for-linux/61f734d6-1497-755f-3632-3f261b890846@... [9] https://lore.kernel.org/rust-for-linux/20230224-rust-arc-v1-0-568eea613a41@a... [10] https://github.com/AsahiLinux/docs/wiki/SW:AGX-driver-notes
Signed-off-by: Asahi Lina lina@asahilina.net --- Asahi Lina (18): rust: drm: ioctl: Add DRM ioctl abstraction rust: drm: Add Device and Driver abstractions rust: drm: file: Add File abstraction rust: drm: gem: Add GEM object abstraction drm/gem-shmem: Export VM ops functions rust: drm: gem: shmem: Add DRM shmem helper abstraction rust: drm: mm: Add DRM MM Range Allocator abstraction rust: dma_fence: Add DMA Fence abstraction rust: drm: syncobj: Add DRM Sync Object abstraction drm/scheduler: Add can_run_job callback drm/scheduler: Clean up jobs when the scheduler is torn down rust: drm: sched: Add GPU scheduler abstraction drm/gem: Add a flag to control whether objects can be exported rust: drm: gem: Add set_exportable() method drm/asahi: Add the Asahi driver UAPI [DO NOT MERGE] rust: bindings: Bind the Asahi DRM UAPI rust: macros: Add versions macro drm/asahi: Add the Asahi driver for Apple AGX GPUs drivers/gpu/drm/Kconfig | 19 + drivers/gpu/drm/Makefile | 1 + drivers/gpu/drm/asahi/Kconfig | 35 + drivers/gpu/drm/asahi/Makefile | 3 + drivers/gpu/drm/asahi/alloc.rs | 1046 ++++++++++++++++++++++++++ drivers/gpu/drm/asahi/asahi.rs | 53 ++ drivers/gpu/drm/asahi/buffer.rs | 694 ++++++++++++++++++ drivers/gpu/drm/asahi/channel.rs | 542 ++++++++++++++ drivers/gpu/drm/asahi/debug.rs | 129 ++++ drivers/gpu/drm/asahi/driver.rs | 166 +++++ drivers/gpu/drm/asahi/event.rs | 229 ++++++ drivers/gpu/drm/asahi/file.rs | 718 ++++++++++++++++++ drivers/gpu/drm/asahi/float.rs | 381 ++++++++++ drivers/gpu/drm/asahi/fw/buffer.rs | 170 +++++ drivers/gpu/drm/asahi/fw/channels.rs | 385 ++++++++++ drivers/gpu/drm/asahi/fw/compute.rs | 107 +++ drivers/gpu/drm/asahi/fw/event.rs | 100 +++ drivers/gpu/drm/asahi/fw/fragment.rs | 276 +++++++ drivers/gpu/drm/asahi/fw/initdata.rs | 1264 ++++++++++++++++++++++++++++++++ drivers/gpu/drm/asahi/fw/job.rs | 56 ++ drivers/gpu/drm/asahi/fw/microseq.rs | 384 ++++++++++ drivers/gpu/drm/asahi/fw/mod.rs | 15 + drivers/gpu/drm/asahi/fw/types.rs | 233 ++++++ drivers/gpu/drm/asahi/fw/vertex.rs | 177 +++++ drivers/gpu/drm/asahi/fw/workqueue.rs | 168 +++++ drivers/gpu/drm/asahi/gem.rs | 301 ++++++++ drivers/gpu/drm/asahi/gpu.rs | 1088 +++++++++++++++++++++++++++ drivers/gpu/drm/asahi/hw/mod.rs | 522 +++++++++++++ drivers/gpu/drm/asahi/hw/t600x.rs | 140 ++++ drivers/gpu/drm/asahi/hw/t8103.rs | 80 ++ drivers/gpu/drm/asahi/hw/t8112.rs | 82 +++ drivers/gpu/drm/asahi/initdata.rs | 777 ++++++++++++++++++++ drivers/gpu/drm/asahi/mem.rs | 133 ++++ drivers/gpu/drm/asahi/microseq.rs | 61 ++ drivers/gpu/drm/asahi/mmu.rs | 1249 +++++++++++++++++++++++++++++++ drivers/gpu/drm/asahi/object.rs | 704 ++++++++++++++++++ drivers/gpu/drm/asahi/place.rs | 343 +++++++++ drivers/gpu/drm/asahi/queue/common.rs | 52 ++ drivers/gpu/drm/asahi/queue/compute.rs | 371 ++++++++++ drivers/gpu/drm/asahi/queue/mod.rs | 725 ++++++++++++++++++ drivers/gpu/drm/asahi/queue/render.rs | 1173 +++++++++++++++++++++++++++++ drivers/gpu/drm/asahi/regs.rs | 387 ++++++++++ drivers/gpu/drm/asahi/slotalloc.rs | 292 ++++++++ drivers/gpu/drm/asahi/util.rs | 44 ++ drivers/gpu/drm/asahi/workqueue.rs | 880 ++++++++++++++++++++++ drivers/gpu/drm/drm_gem.c | 1 + drivers/gpu/drm/drm_gem_shmem_helper.c | 9 +- drivers/gpu/drm/drm_prime.c | 5 + drivers/gpu/drm/scheduler/sched_main.c | 37 +- include/drm/drm_gem.h | 8 + include/drm/drm_gem_shmem_helper.h | 3 + include/drm/gpu_scheduler.h | 8 + include/uapi/drm/asahi_drm.h | 556 ++++++++++++++ rust/bindings/bindings_helper.h | 14 + rust/helpers.c | 168 +++++ rust/kernel/dma_fence.rs | 532 ++++++++++++++ rust/kernel/drm/device.rs | 76 ++ rust/kernel/drm/drv.rs | 342 +++++++++ rust/kernel/drm/file.rs | 113 +++ rust/kernel/drm/gem/mod.rs | 384 ++++++++++ rust/kernel/drm/gem/shmem.rs | 381 ++++++++++ rust/kernel/drm/ioctl.rs | 147 ++++ rust/kernel/drm/mm.rs | 309 ++++++++ rust/kernel/drm/mod.rs | 13 + rust/kernel/drm/sched.rs | 358 +++++++++ rust/kernel/drm/syncobj.rs | 77 ++ rust/kernel/lib.rs | 4 + rust/macros/lib.rs | 7 + rust/macros/versions.rs | 267 +++++++ 69 files changed, 20569 insertions(+), 5 deletions(-) --- base-commit: c9eb15274c9861026682a6b3e645891fccf88e07 change-id: 20230307-rust-drm-b5af3c2a9e55
Thank you, ~~ Lina
DRM drivers need to be able to declare which driver-specific ioctls they support. This abstraction adds the required types and a helper macro to generate the ioctl definition inside the DRM driver.
Note that this macro is not usable until further bits of the abstraction are in place (but it will not fail to compile on its own, if not called).
Signed-off-by: Asahi Lina lina@asahilina.net --- drivers/gpu/drm/Kconfig | 7 ++ rust/bindings/bindings_helper.h | 2 + rust/kernel/drm/ioctl.rs | 147 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 5 ++ rust/kernel/lib.rs | 2 + 5 files changed, 163 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dc0f94f02a82..dab8f0f9aa96 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -27,6 +27,13 @@ menuconfig DRM details. You should also select and configure AGP (/dev/agpgart) support if it is available for your platform.
+# Rust abstractions cannot be built as modules currently, so force them as +# bool by using these intermediate symbols. In the future these could be +# tristate once abstractions themselves can be built as modules. +config RUST_DRM + bool "Rust support for the DRM subsystem" + depends on DRM=y + config DRM_MIPI_DBI tristate depends on DRM diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 91bb7906ca5a..2687bef1676f 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -6,6 +6,7 @@ * Sorted alphabetically. */
+#include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-mapping.h> @@ -23,6 +24,7 @@ #include <linux/sysctl.h> #include <linux/timekeeping.h> #include <linux/xarray.h> +#include <uapi/drm/drm.h>
/* `bindgen` gets confused at certain things. */ const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL; diff --git a/rust/kernel/drm/ioctl.rs b/rust/kernel/drm/ioctl.rs new file mode 100644 index 000000000000..10304efbd5f1 --- /dev/null +++ b/rust/kernel/drm/ioctl.rs @@ -0,0 +1,147 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +#![allow(non_snake_case)] + +//! DRM IOCTL definitions. +//! +//! C header: [`include/linux/drm/drm_ioctl.h`](../../../../include/linux/drm/drm_ioctl.h) + +use crate::ioctl; + +const BASE: u32 = bindings::DRM_IOCTL_BASE as u32; + +/// Construct a DRM ioctl number with no argument. +pub const fn IO(nr: u32) -> u32 { + ioctl::_IO(BASE, nr) +} + +/// Construct a DRM ioctl number with a read-only argument. +pub const fn IOR<T>(nr: u32) -> u32 { + ioctl::_IOR::<T>(BASE, nr) +} + +/// Construct a DRM ioctl number with a write-only argument. +pub const fn IOW<T>(nr: u32) -> u32 { + ioctl::_IOW::<T>(BASE, nr) +} + +/// Construct a DRM ioctl number with a read-write argument. +pub const fn IOWR<T>(nr: u32) -> u32 { + ioctl::_IOWR::<T>(BASE, nr) +} + +/// Descriptor type for DRM ioctls. Use the `declare_drm_ioctls!{}` macro to construct them. +pub type DrmIoctlDescriptor = bindings::drm_ioctl_desc; + +/// This is for ioctl which are used for rendering, and require that the file descriptor is either +/// for a render node, or if it’s a legacy/primary node, then it must be authenticated. +pub const AUTH: u32 = bindings::drm_ioctl_flags_DRM_AUTH; + +/// This must be set for any ioctl which can change the modeset or display state. Userspace must +/// call the ioctl through a primary node, while it is the active master. +/// +/// Note that read-only modeset ioctl can also be called by unauthenticated clients, or when a +/// master is not the currently active one. +pub const MASTER: u32 = bindings::drm_ioctl_flags_DRM_MASTER; + +/// Anything that could potentially wreak a master file descriptor needs to have this flag set. +/// +/// Current that’s only for the SETMASTER and DROPMASTER ioctl, which e.g. logind can call to force +/// a non-behaving master (display compositor) into compliance. +/// +/// This is equivalent to callers with the SYSADMIN capability. +pub const ROOT_ONLY: u32 = bindings::drm_ioctl_flags_DRM_ROOT_ONLY; + +/// Whether drm_ioctl_desc.func should be called with the DRM BKL held or not. Enforced as the +/// default for all modern drivers, hence there should never be a need to set this flag. +/// +/// Do not use anywhere else than for the VBLANK_WAIT IOCTL, which is the only legacy IOCTL which +/// needs this. +pub const UNLOCKED: u32 = bindings::drm_ioctl_flags_DRM_UNLOCKED; + +/// This is used for all ioctl needed for rendering only, for drivers which support render nodes. +/// This should be all new render drivers, and hence it should be always set for any ioctl with +/// `AUTH` set. Note though that read-only query ioctl might have this set, but have not set +/// DRM_AUTH because they do not require authentication. +pub const RENDER_ALLOW: u32 = bindings::drm_ioctl_flags_DRM_RENDER_ALLOW; + +/// Declare the DRM ioctls for a driver. +/// +/// Each entry in the list should have the form: +/// +/// `(ioctl_number, argument_type, flags, user_callback),` +/// +/// `argument_type` is the type name within the `bindings` crate. +/// `user_callback` should have the following prototype: +/// +/// ``` +/// fn foo(device: &kernel::drm::device::Device<Self>, +/// data: &mut bindings::argument_type, +/// file: &kernel::drm::file::FileSelf::File, +/// ) +/// ``` +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within. +/// +/// # Examples +/// +/// ``` +/// kernel::declare_drm_ioctls! { +/// (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler), +/// } +/// ``` +/// +#[macro_export] +macro_rules! declare_drm_ioctls { + ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => { + const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = { + const _:() = { + let i: u32 = $crate::bindings::DRM_COMMAND_BASE; + // Assert that all the IOCTLs are in the right order and there are no gaps, + // and that the sizeof of the specified type is correct. + $( + let cmd: u32 = $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd); + ::core::assert!(i == $crate::ioctl::_IOC_NR(cmd)); + ::core::assert!(core::mem::size_of::<$crate::bindings::$struct>() == $crate::ioctl::_IOC_SIZE(cmd)); + let i: u32 = i + 1; + )* + }; + + let ioctls = &[$( + $crate::bindings::drm_ioctl_desc { + cmd: $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd) as u32, + func: { + #[allow(non_snake_case)] + unsafe extern "C" fn $cmd( + raw_dev: *mut $crate::bindings::drm_device, + raw_data: *mut ::core::ffi::c_void, + raw_file_priv: *mut $crate::bindings::drm_file, + ) -> core::ffi::c_int { + // SAFETY: We never drop this, and the DRM core ensures the device lives + // while callbacks are being called. + // + // FIXME: Currently there is nothing enforcing that the types of the + // dev/file match the current driver these ioctls are being declared + // for, and it's not clear how to enforce this within the type system. + let dev = ::core::mem::ManuallyDrop::new(unsafe { + $crate::drm::device::Device::from_raw(raw_dev) + }); + // SAFETY: This is just the ioctl argument, which hopefully has the right type + // (we've done our best checking the size). + let data = unsafe { &mut *(raw_data as *mut $crate::bindings::$struct) }; + // SAFETY: This is just the DRM file structure + let file = unsafe { $crate::drm::file::File::from_raw(raw_file_priv) }; + + match $func(&*dev, data, &file) { + Err(e) => e.to_kernel_errno(), + Ok(i) => i.try_into().unwrap_or(ERANGE.to_kernel_errno()), + } + } + Some($cmd) + }, + flags: $flags, + name: $crate::c_str!(::core::stringify!($cmd)).as_char_ptr(), + } + ),*]; + ioctls + }; + }; +} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs new file mode 100644 index 000000000000..9ec6d7cbcaf3 --- /dev/null +++ b/rust/kernel/drm/mod.rs @@ -0,0 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT + +//! DRM subsystem abstractions. + +pub mod ioctl; diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index 7903490816bf..cb23d24c6718 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -37,6 +37,8 @@ mod build_assert; pub mod delay; pub mod device; pub mod driver; +#[cfg(CONFIG_RUST_DRM)] +pub mod drm; pub mod error; pub mod io_buffer; pub mod io_mem;
On Tue, Mar 7, 2023 at 3:27 PM Asahi Lina lina@asahilina.net wrote:
DRM drivers need to be able to declare which driver-specific ioctls they support. This abstraction adds the required types and a helper macro to generate the ioctl definition inside the DRM driver.
Note that this macro is not usable until further bits of the abstraction are in place (but it will not fail to compile on its own, if not called).
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 7 ++ rust/bindings/bindings_helper.h | 2 + rust/kernel/drm/ioctl.rs | 147 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 5 ++ rust/kernel/lib.rs | 2 + 5 files changed, 163 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dc0f94f02a82..dab8f0f9aa96 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -27,6 +27,13 @@ menuconfig DRM details. You should also select and configure AGP (/dev/agpgart) support if it is available for your platform.
+# Rust abstractions cannot be built as modules currently, so force them as +# bool by using these intermediate symbols. In the future these could be +# tristate once abstractions themselves can be built as modules. +config RUST_DRM
bool "Rust support for the DRM subsystem"
depends on DRM=y
config DRM_MIPI_DBI tristate depends on DRM diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 91bb7906ca5a..2687bef1676f 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -6,6 +6,7 @@
- Sorted alphabetically.
*/
+#include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-mapping.h> @@ -23,6 +24,7 @@ #include <linux/sysctl.h> #include <linux/timekeeping.h> #include <linux/xarray.h> +#include <uapi/drm/drm.h>
might make more sense to add this chunk to the patch actually needing it
/* `bindgen` gets confused at certain things. */ const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL; diff --git a/rust/kernel/drm/ioctl.rs b/rust/kernel/drm/ioctl.rs new file mode 100644 index 000000000000..10304efbd5f1 --- /dev/null +++ b/rust/kernel/drm/ioctl.rs @@ -0,0 +1,147 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +#![allow(non_snake_case)]
+//! DRM IOCTL definitions. +//! +//! C header: [`include/linux/drm/drm_ioctl.h`](../../../../include/linux/drm/drm_ioctl.h)
+use crate::ioctl;
+const BASE: u32 = bindings::DRM_IOCTL_BASE as u32;
+/// Construct a DRM ioctl number with no argument. +pub const fn IO(nr: u32) -> u32 {
- ioctl::_IO(BASE, nr)
+}
+/// Construct a DRM ioctl number with a read-only argument. +pub const fn IOR<T>(nr: u32) -> u32 {
- ioctl::_IOR::<T>(BASE, nr)
+}
+/// Construct a DRM ioctl number with a write-only argument. +pub const fn IOW<T>(nr: u32) -> u32 {
- ioctl::_IOW::<T>(BASE, nr)
+}
+/// Construct a DRM ioctl number with a read-write argument. +pub const fn IOWR<T>(nr: u32) -> u32 {
- ioctl::_IOWR::<T>(BASE, nr)
+}
+/// Descriptor type for DRM ioctls. Use the `declare_drm_ioctls!{}` macro to construct them. +pub type DrmIoctlDescriptor = bindings::drm_ioctl_desc;
+/// This is for ioctl which are used for rendering, and require that the file descriptor is either +/// for a render node, or if it’s a legacy/primary node, then it must be authenticated. +pub const AUTH: u32 = bindings::drm_ioctl_flags_DRM_AUTH;
+/// This must be set for any ioctl which can change the modeset or display state. Userspace must +/// call the ioctl through a primary node, while it is the active master. +/// +/// Note that read-only modeset ioctl can also be called by unauthenticated clients, or when a +/// master is not the currently active one. +pub const MASTER: u32 = bindings::drm_ioctl_flags_DRM_MASTER;
+/// Anything that could potentially wreak a master file descriptor needs to have this flag set. +/// +/// Current that’s only for the SETMASTER and DROPMASTER ioctl, which e.g. logind can call to force +/// a non-behaving master (display compositor) into compliance. +/// +/// This is equivalent to callers with the SYSADMIN capability. +pub const ROOT_ONLY: u32 = bindings::drm_ioctl_flags_DRM_ROOT_ONLY;
+/// Whether drm_ioctl_desc.func should be called with the DRM BKL held or not. Enforced as the +/// default for all modern drivers, hence there should never be a need to set this flag. +/// +/// Do not use anywhere else than for the VBLANK_WAIT IOCTL, which is the only legacy IOCTL which +/// needs this. +pub const UNLOCKED: u32 = bindings::drm_ioctl_flags_DRM_UNLOCKED;
+/// This is used for all ioctl needed for rendering only, for drivers which support render nodes. +/// This should be all new render drivers, and hence it should be always set for any ioctl with +/// `AUTH` set. Note though that read-only query ioctl might have this set, but have not set +/// DRM_AUTH because they do not require authentication. +pub const RENDER_ALLOW: u32 = bindings::drm_ioctl_flags_DRM_RENDER_ALLOW;
+/// Declare the DRM ioctls for a driver. +/// +/// Each entry in the list should have the form: +/// +/// `(ioctl_number, argument_type, flags, user_callback),` +/// +/// `argument_type` is the type name within the `bindings` crate. +/// `user_callback` should have the following prototype: +/// +/// ``` +/// fn foo(device: &kernel::drm::device::Device<Self>, +/// data: &mut bindings::argument_type, +/// file: &kernel::drm::file::FileSelf::File, +/// ) +/// ``` +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within. +/// +/// # Examples +/// +/// ``` +/// kernel::declare_drm_ioctls! { +/// (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler), +/// }
I am wondering.. couldn't we make it a proc_macro and just tag all the functions instead? Though I also see the point of having a central list of all ioctls... Maybe we should have some higher level discussions around on _how_ we want things to look like.
+/// ``` +/// +#[macro_export] +macro_rules! declare_drm_ioctls {
- ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
const _:() = {
let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
// Assert that all the IOCTLs are in the right order and there are no gaps,
// and that the sizeof of the specified type is correct.
$(
let cmd: u32 = $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd);
::core::assert!(i == $crate::ioctl::_IOC_NR(cmd));
::core::assert!(core::mem::size_of::<$crate::bindings::$struct>() == $crate::ioctl::_IOC_SIZE(cmd));
::core::mem::size_of
let i: u32 = i + 1;
)*
};
let ioctls = &[$(
$crate::bindings::drm_ioctl_desc {
cmd: $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd) as u32,
func: {
#[allow(non_snake_case)]
unsafe extern "C" fn $cmd(
raw_dev: *mut $crate::bindings::drm_device,
raw_data: *mut ::core::ffi::c_void,
raw_file_priv: *mut $crate::bindings::drm_file,
) -> core::ffi::c_int {
::core
// SAFETY: We never drop this, and the DRM core ensures the device lives
// while callbacks are being called.
//
// FIXME: Currently there is nothing enforcing that the types of the
// dev/file match the current driver these ioctls are being declared
// for, and it's not clear how to enforce this within the type system.
let dev = ::core::mem::ManuallyDrop::new(unsafe {
$crate::drm::device::Device::from_raw(raw_dev)
});
// SAFETY: This is just the ioctl argument, which hopefully has the right type
// (we've done our best checking the size).
let data = unsafe { &mut *(raw_data as *mut $crate::bindings::$struct) };
// SAFETY: This is just the DRM file structure
let file = unsafe { $crate::drm::file::File::from_raw(raw_file_priv) };
match $func(&*dev, data, &file) {
Err(e) => e.to_kernel_errno(),
Ok(i) => i.try_into().unwrap_or(ERANGE.to_kernel_errno()),
need to specify the namespace on ERANGE, no?
}
}
Some($cmd)
},
flags: $flags,
name: $crate::c_str!(::core::stringify!($cmd)).as_char_ptr(),
}
),*];
ioctls
};
- };
+} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs new file mode 100644 index 000000000000..9ec6d7cbcaf3 --- /dev/null +++ b/rust/kernel/drm/mod.rs @@ -0,0 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM subsystem abstractions.
+pub mod ioctl; diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index 7903490816bf..cb23d24c6718 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -37,6 +37,8 @@ mod build_assert; pub mod delay; pub mod device; pub mod driver; +#[cfg(CONFIG_RUST_DRM)] +pub mod drm; pub mod error; pub mod io_buffer; pub mod io_mem;
-- 2.35.1
On Tue, Mar 7, 2023 at 3:48 PM Karol Herbst kherbst@redhat.com wrote:
On Tue, Mar 7, 2023 at 3:27 PM Asahi Lina lina@asahilina.net wrote:
DRM drivers need to be able to declare which driver-specific ioctls they support. This abstraction adds the required types and a helper macro to generate the ioctl definition inside the DRM driver.
Note that this macro is not usable until further bits of the abstraction are in place (but it will not fail to compile on its own, if not called).
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 7 ++ rust/bindings/bindings_helper.h | 2 + rust/kernel/drm/ioctl.rs | 147 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 5 ++ rust/kernel/lib.rs | 2 + 5 files changed, 163 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dc0f94f02a82..dab8f0f9aa96 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -27,6 +27,13 @@ menuconfig DRM details. You should also select and configure AGP (/dev/agpgart) support if it is available for your platform.
+# Rust abstractions cannot be built as modules currently, so force them as +# bool by using these intermediate symbols. In the future these could be +# tristate once abstractions themselves can be built as modules. +config RUST_DRM
bool "Rust support for the DRM subsystem"
depends on DRM=y
config DRM_MIPI_DBI tristate depends on DRM diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 91bb7906ca5a..2687bef1676f 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -6,6 +6,7 @@
- Sorted alphabetically.
*/
+#include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-mapping.h> @@ -23,6 +24,7 @@ #include <linux/sysctl.h> #include <linux/timekeeping.h> #include <linux/xarray.h> +#include <uapi/drm/drm.h>
might make more sense to add this chunk to the patch actually needing it
ehh, ignore this comment please :)
/* `bindgen` gets confused at certain things. */ const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL; diff --git a/rust/kernel/drm/ioctl.rs b/rust/kernel/drm/ioctl.rs new file mode 100644 index 000000000000..10304efbd5f1 --- /dev/null +++ b/rust/kernel/drm/ioctl.rs @@ -0,0 +1,147 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +#![allow(non_snake_case)]
+//! DRM IOCTL definitions. +//! +//! C header: [`include/linux/drm/drm_ioctl.h`](../../../../include/linux/drm/drm_ioctl.h)
+use crate::ioctl;
+const BASE: u32 = bindings::DRM_IOCTL_BASE as u32;
+/// Construct a DRM ioctl number with no argument. +pub const fn IO(nr: u32) -> u32 {
- ioctl::_IO(BASE, nr)
+}
+/// Construct a DRM ioctl number with a read-only argument. +pub const fn IOR<T>(nr: u32) -> u32 {
- ioctl::_IOR::<T>(BASE, nr)
+}
+/// Construct a DRM ioctl number with a write-only argument. +pub const fn IOW<T>(nr: u32) -> u32 {
- ioctl::_IOW::<T>(BASE, nr)
+}
+/// Construct a DRM ioctl number with a read-write argument. +pub const fn IOWR<T>(nr: u32) -> u32 {
- ioctl::_IOWR::<T>(BASE, nr)
+}
+/// Descriptor type for DRM ioctls. Use the `declare_drm_ioctls!{}` macro to construct them. +pub type DrmIoctlDescriptor = bindings::drm_ioctl_desc;
+/// This is for ioctl which are used for rendering, and require that the file descriptor is either +/// for a render node, or if it’s a legacy/primary node, then it must be authenticated. +pub const AUTH: u32 = bindings::drm_ioctl_flags_DRM_AUTH;
+/// This must be set for any ioctl which can change the modeset or display state. Userspace must +/// call the ioctl through a primary node, while it is the active master. +/// +/// Note that read-only modeset ioctl can also be called by unauthenticated clients, or when a +/// master is not the currently active one. +pub const MASTER: u32 = bindings::drm_ioctl_flags_DRM_MASTER;
+/// Anything that could potentially wreak a master file descriptor needs to have this flag set. +/// +/// Current that’s only for the SETMASTER and DROPMASTER ioctl, which e.g. logind can call to force +/// a non-behaving master (display compositor) into compliance. +/// +/// This is equivalent to callers with the SYSADMIN capability. +pub const ROOT_ONLY: u32 = bindings::drm_ioctl_flags_DRM_ROOT_ONLY;
+/// Whether drm_ioctl_desc.func should be called with the DRM BKL held or not. Enforced as the +/// default for all modern drivers, hence there should never be a need to set this flag. +/// +/// Do not use anywhere else than for the VBLANK_WAIT IOCTL, which is the only legacy IOCTL which +/// needs this. +pub const UNLOCKED: u32 = bindings::drm_ioctl_flags_DRM_UNLOCKED;
+/// This is used for all ioctl needed for rendering only, for drivers which support render nodes. +/// This should be all new render drivers, and hence it should be always set for any ioctl with +/// `AUTH` set. Note though that read-only query ioctl might have this set, but have not set +/// DRM_AUTH because they do not require authentication. +pub const RENDER_ALLOW: u32 = bindings::drm_ioctl_flags_DRM_RENDER_ALLOW;
+/// Declare the DRM ioctls for a driver. +/// +/// Each entry in the list should have the form: +/// +/// `(ioctl_number, argument_type, flags, user_callback),` +/// +/// `argument_type` is the type name within the `bindings` crate. +/// `user_callback` should have the following prototype: +/// +/// ``` +/// fn foo(device: &kernel::drm::device::Device<Self>, +/// data: &mut bindings::argument_type, +/// file: &kernel::drm::file::FileSelf::File, +/// ) +/// ``` +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within. +/// +/// # Examples +/// +/// ``` +/// kernel::declare_drm_ioctls! { +/// (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler), +/// }
I am wondering.. couldn't we make it a proc_macro and just tag all the functions instead? Though I also see the point of having a central list of all ioctls... Maybe we should have some higher level discussions around on _how_ we want things to look like.
+/// ``` +/// +#[macro_export] +macro_rules! declare_drm_ioctls {
- ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
const _:() = {
let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
// Assert that all the IOCTLs are in the right order and there are no gaps,
// and that the sizeof of the specified type is correct.
$(
let cmd: u32 = $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd);
::core::assert!(i == $crate::ioctl::_IOC_NR(cmd));
::core::assert!(core::mem::size_of::<$crate::bindings::$struct>() == $crate::ioctl::_IOC_SIZE(cmd));
::core::mem::size_of
let i: u32 = i + 1;
)*
};
let ioctls = &[$(
$crate::bindings::drm_ioctl_desc {
cmd: $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd) as u32,
func: {
#[allow(non_snake_case)]
unsafe extern "C" fn $cmd(
raw_dev: *mut $crate::bindings::drm_device,
raw_data: *mut ::core::ffi::c_void,
raw_file_priv: *mut $crate::bindings::drm_file,
) -> core::ffi::c_int {
::core
// SAFETY: We never drop this, and the DRM core ensures the device lives
// while callbacks are being called.
//
// FIXME: Currently there is nothing enforcing that the types of the
// dev/file match the current driver these ioctls are being declared
// for, and it's not clear how to enforce this within the type system.
let dev = ::core::mem::ManuallyDrop::new(unsafe {
$crate::drm::device::Device::from_raw(raw_dev)
});
// SAFETY: This is just the ioctl argument, which hopefully has the right type
// (we've done our best checking the size).
let data = unsafe { &mut *(raw_data as *mut $crate::bindings::$struct) };
// SAFETY: This is just the DRM file structure
let file = unsafe { $crate::drm::file::File::from_raw(raw_file_priv) };
match $func(&*dev, data, &file) {
Err(e) => e.to_kernel_errno(),
Ok(i) => i.try_into().unwrap_or(ERANGE.to_kernel_errno()),
need to specify the namespace on ERANGE, no?
}
}
Some($cmd)
},
flags: $flags,
name: $crate::c_str!(::core::stringify!($cmd)).as_char_ptr(),
}
),*];
ioctls
};
- };
+} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs new file mode 100644 index 000000000000..9ec6d7cbcaf3 --- /dev/null +++ b/rust/kernel/drm/mod.rs @@ -0,0 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM subsystem abstractions.
+pub mod ioctl; diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index 7903490816bf..cb23d24c6718 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -37,6 +37,8 @@ mod build_assert; pub mod delay; pub mod device; pub mod driver; +#[cfg(CONFIG_RUST_DRM)] +pub mod drm; pub mod error; pub mod io_buffer; pub mod io_mem;
-- 2.35.1
On 3/7/23 11:25, Asahi Lina wrote:
DRM drivers need to be able to declare which driver-specific ioctls they support. This abstraction adds the required types and a helper macro to generate the ioctl definition inside the DRM driver.
Note that this macro is not usable until further bits of the abstraction are in place (but it will not fail to compile on its own, if not called).
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 7 ++ rust/bindings/bindings_helper.h | 2 + rust/kernel/drm/ioctl.rs | 147 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 5 ++ rust/kernel/lib.rs | 2 + 5 files changed, 163 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dc0f94f02a82..dab8f0f9aa96 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -27,6 +27,13 @@ menuconfig DRM details. You should also select and configure AGP (/dev/agpgart) support if it is available for your platform.
[...]
+/// Declare the DRM ioctls for a driver. +/// +/// Each entry in the list should have the form: +/// +/// `(ioctl_number, argument_type, flags, user_callback),` +/// +/// `argument_type` is the type name within the `bindings` crate. +/// `user_callback` should have the following prototype: +/// +/// ``` +/// fn foo(device: &kernel::drm::device::Device<Self>, +/// data: &mut bindings::argument_type, +/// file: &kernel::drm::file::FileSelf::File, +/// ) +/// ``` +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within. +/// +/// # Examples +/// +/// ``` +/// kernel::declare_drm_ioctls! { +/// (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler), +/// } +/// ``` +/// +#[macro_export] +macro_rules! declare_drm_ioctls {
- ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
const _:() = {
let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
// Assert that all the IOCTLs are in the right order and there are no gaps,
// and that the sizeof of the specified type is correct.
I believe that not necessarily the IOCTLs need to be in the right order and with no gaps. For example, armada_drm.h has a gap in between 0x00 and 0x02 and exynos_drm.h also have gaps. Moreover, some drivers, like vgem and virtgpu, start their IOCTLs with 0x01.
Best Regards, - Maíra Canal
$(
let cmd: u32 = $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd);
::core::assert!(i == $crate::ioctl::_IOC_NR(cmd));
::core::assert!(core::mem::size_of::<$crate::bindings::$struct>() == $crate::ioctl::_IOC_SIZE(cmd));
let i: u32 = i + 1;
)*
};
let ioctls = &[$(
$crate::bindings::drm_ioctl_desc {
cmd: $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd) as u32,
func: {
#[allow(non_snake_case)]
unsafe extern "C" fn $cmd(
raw_dev: *mut $crate::bindings::drm_device,
raw_data: *mut ::core::ffi::c_void,
raw_file_priv: *mut $crate::bindings::drm_file,
) -> core::ffi::c_int {
// SAFETY: We never drop this, and the DRM core ensures the device lives
// while callbacks are being called.
//
// FIXME: Currently there is nothing enforcing that the types of the
// dev/file match the current driver these ioctls are being declared
// for, and it's not clear how to enforce this within the type system.
let dev = ::core::mem::ManuallyDrop::new(unsafe {
$crate::drm::device::Device::from_raw(raw_dev)
});
// SAFETY: This is just the ioctl argument, which hopefully has the right type
// (we've done our best checking the size).
let data = unsafe { &mut *(raw_data as *mut $crate::bindings::$struct) };
// SAFETY: This is just the DRM file structure
let file = unsafe { $crate::drm::file::File::from_raw(raw_file_priv) };
match $func(&*dev, data, &file) {
Err(e) => e.to_kernel_errno(),
Ok(i) => i.try_into().unwrap_or(ERANGE.to_kernel_errno()),
}
}
Some($cmd)
},
flags: $flags,
name: $crate::c_str!(::core::stringify!($cmd)).as_char_ptr(),
}
),*];
ioctls
};
- };
+} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs new file mode 100644 index 000000000000..9ec6d7cbcaf3 --- /dev/null +++ b/rust/kernel/drm/mod.rs @@ -0,0 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM subsystem abstractions.
+pub mod ioctl; diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index 7903490816bf..cb23d24c6718 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -37,6 +37,8 @@ mod build_assert; pub mod delay; pub mod device; pub mod driver; +#[cfg(CONFIG_RUST_DRM)] +pub mod drm; pub mod error; pub mod io_buffer; pub mod io_mem;
On 08/03/2023 00.32, Maíra Canal wrote:
On 3/7/23 11:25, Asahi Lina wrote:
DRM drivers need to be able to declare which driver-specific ioctls they support. This abstraction adds the required types and a helper macro to generate the ioctl definition inside the DRM driver.
Note that this macro is not usable until further bits of the abstraction are in place (but it will not fail to compile on its own, if not called).
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 7 ++ rust/bindings/bindings_helper.h | 2 + rust/kernel/drm/ioctl.rs | 147 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 5 ++ rust/kernel/lib.rs | 2 + 5 files changed, 163 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dc0f94f02a82..dab8f0f9aa96 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -27,6 +27,13 @@ menuconfig DRM details. You should also select and configure AGP (/dev/agpgart) support if it is available for your platform.
[...]
+/// Declare the DRM ioctls for a driver. +/// +/// Each entry in the list should have the form: +/// +/// `(ioctl_number, argument_type, flags, user_callback),` +/// +/// `argument_type` is the type name within the `bindings` crate. +/// `user_callback` should have the following prototype: +/// +/// ``` +/// fn foo(device: &kernel::drm::device::Device<Self>, +/// data: &mut bindings::argument_type, +/// file: &kernel::drm::file::FileSelf::File, +/// ) +/// ``` +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within. +/// +/// # Examples +/// +/// ``` +/// kernel::declare_drm_ioctls! { +/// (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler), +/// } +/// ``` +/// +#[macro_export] +macro_rules! declare_drm_ioctls {
- ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
const _:() = {
let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
// Assert that all the IOCTLs are in the right order and there are no gaps,
// and that the sizeof of the specified type is correct.
I believe that not necessarily the IOCTLs need to be in the right order and with no gaps. For example, armada_drm.h has a gap in between 0x00 and 0x02 and exynos_drm.h also have gaps. Moreover, some drivers, like vgem and virtgpu, start their IOCTLs with 0x01.
Yeah, we talked about this a bit... do you have any ideas about how to design this? I think it should be possible with a const function initializing an array entry by entry, we just need a two-pass macro (once to determine the max ioctl number, then again to actually output the implementation).
I'm not sure why drivers would have gaps in the ioctl numbers though... my idea was that new drivers shouldn't need that as far as I can tell (you can't remove APIs after the fact due to UAPI stability guarantees, so as long as you don't have gaps to begin with...). But I guess if we're reimplementing existing drivers in Rust we'll need this... though maybe it makes sense to just say it's not supported and require reimplementations that have holes to just explicitly add dummy ioctls that return EINVAL? We could even provide such a dummy generic ioctl handler on the abstraction side, so drivers just have to add it to the list, or make the macro take a special token that is used for placeholder ioctls that don't exist (which then creates the NULL function pointer that the drm core interprets as invalid)...
Basically I'm not sure if it makes sense to fully support noncontiguous ioctl numbers automagically, or just say drivers need to explicitly list gaps. I'd love to hear the opinion of other DRM folks about this!
~~ Lina
On Thu, 9 Mar 2023 at 15:32, Asahi Lina lina@asahilina.net wrote:
On 08/03/2023 00.32, Maíra Canal wrote:
On 3/7/23 11:25, Asahi Lina wrote:
DRM drivers need to be able to declare which driver-specific ioctls they support. This abstraction adds the required types and a helper macro to generate the ioctl definition inside the DRM driver.
Note that this macro is not usable until further bits of the abstraction are in place (but it will not fail to compile on its own, if not called).
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 7 ++ rust/bindings/bindings_helper.h | 2 + rust/kernel/drm/ioctl.rs | 147 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 5 ++ rust/kernel/lib.rs | 2 + 5 files changed, 163 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dc0f94f02a82..dab8f0f9aa96 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -27,6 +27,13 @@ menuconfig DRM details. You should also select and configure AGP (/dev/agpgart) support if it is available for your platform.
[...]
+/// Declare the DRM ioctls for a driver. +/// +/// Each entry in the list should have the form: +/// +/// `(ioctl_number, argument_type, flags, user_callback),` +/// +/// `argument_type` is the type name within the `bindings` crate. +/// `user_callback` should have the following prototype: +/// +/// ``` +/// fn foo(device: &kernel::drm::device::Device<Self>, +/// data: &mut bindings::argument_type, +/// file: &kernel::drm::file::FileSelf::File, +/// ) +/// ``` +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within. +/// +/// # Examples +/// +/// ``` +/// kernel::declare_drm_ioctls! { +/// (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler), +/// } +/// ``` +/// +#[macro_export] +macro_rules! declare_drm_ioctls {
- ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
const _:() = {
let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
// Assert that all the IOCTLs are in the right order and there are no gaps,
// and that the sizeof of the specified type is correct.
I believe that not necessarily the IOCTLs need to be in the right order and with no gaps. For example, armada_drm.h has a gap in between 0x00 and 0x02 and exynos_drm.h also have gaps. Moreover, some drivers, like vgem and virtgpu, start their IOCTLs with 0x01.
Yeah, we talked about this a bit... do you have any ideas about how to design this? I think it should be possible with a const function initializing an array entry by entry, we just need a two-pass macro (once to determine the max ioctl number, then again to actually output the implementation).
I'm not sure why drivers would have gaps in the ioctl numbers though... my idea was that new drivers shouldn't need that as far as I can tell (you can't remove APIs after the fact due to UAPI stability guarantees, so as long as you don't have gaps to begin with...). But I guess if we're reimplementing existing drivers in Rust we'll need this... though maybe it makes sense to just say it's not supported and require reimplementations that have holes to just explicitly add dummy ioctls that return EINVAL? We could even provide such a dummy generic ioctl handler on the abstraction side, so drivers just have to add it to the list, or make the macro take a special token that is used for placeholder ioctls that don't exist (which then creates the NULL function pointer that the drm core interprets as invalid)...
I can think of two reason for gaps having appeared:
a) developers wanted to group new uapis at a nice base number. This is never essential it's just makes things easier to read, and allows slotting other ioctls into the gaps later.
b) parallel feature development ends up conflicting then one thread never lands. I've got two-three devs each adding a uAPI, we assign them 0x10, 0x11, 0x12 while they work, then 0x11 never lands because it was a bad idea.
However I think you should be fine enforcing a non-sparse space here unless we want to handle replacing current drivers, as long as it's hard to screw up so you know early.
Dave.
On 3/9/23 03:15, Dave Airlie wrote:
On Thu, 9 Mar 2023 at 15:32, Asahi Lina lina@asahilina.net wrote:
On 08/03/2023 00.32, Maíra Canal wrote:
On 3/7/23 11:25, Asahi Lina wrote:
DRM drivers need to be able to declare which driver-specific ioctls they support. This abstraction adds the required types and a helper macro to generate the ioctl definition inside the DRM driver.
Note that this macro is not usable until further bits of the abstraction are in place (but it will not fail to compile on its own, if not called).
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 7 ++ rust/bindings/bindings_helper.h | 2 + rust/kernel/drm/ioctl.rs | 147 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 5 ++ rust/kernel/lib.rs | 2 + 5 files changed, 163 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dc0f94f02a82..dab8f0f9aa96 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -27,6 +27,13 @@ menuconfig DRM details. You should also select and configure AGP (/dev/agpgart) support if it is available for your platform.
[...]
+/// Declare the DRM ioctls for a driver. +/// +/// Each entry in the list should have the form: +/// +/// `(ioctl_number, argument_type, flags, user_callback),` +/// +/// `argument_type` is the type name within the `bindings` crate. +/// `user_callback` should have the following prototype: +/// +/// ``` +/// fn foo(device: &kernel::drm::device::Device<Self>, +/// data: &mut bindings::argument_type, +/// file: &kernel::drm::file::FileSelf::File, +/// ) +/// ``` +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within. +/// +/// # Examples +/// +/// ``` +/// kernel::declare_drm_ioctls! { +/// (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler), +/// } +/// ``` +/// +#[macro_export] +macro_rules! declare_drm_ioctls {
- ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
const _:() = {
let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
// Assert that all the IOCTLs are in the right order and there are no gaps,
// and that the sizeof of the specified type is correct.
I believe that not necessarily the IOCTLs need to be in the right order and with no gaps. For example, armada_drm.h has a gap in between 0x00 and 0x02 and exynos_drm.h also have gaps. Moreover, some drivers, like vgem and virtgpu, start their IOCTLs with 0x01.
Yeah, we talked about this a bit... do you have any ideas about how to design this? I think it should be possible with a const function initializing an array entry by entry, we just need a two-pass macro (once to determine the max ioctl number, then again to actually output the implementation).
I'm not sure why drivers would have gaps in the ioctl numbers though... my idea was that new drivers shouldn't need that as far as I can tell (you can't remove APIs after the fact due to UAPI stability guarantees, so as long as you don't have gaps to begin with...). But I guess if we're reimplementing existing drivers in Rust we'll need this... though maybe it makes sense to just say it's not supported and require reimplementations that have holes to just explicitly add dummy ioctls that return EINVAL? We could even provide such a dummy generic ioctl handler on the abstraction side, so drivers just have to add it to the list, or make the macro take a special token that is used for placeholder ioctls that don't exist (which then creates the NULL function pointer that the drm core interprets as invalid)...
I can think of two reason for gaps having appeared:
a) developers wanted to group new uapis at a nice base number. This is never essential it's just makes things easier to read, and allows slotting other ioctls into the gaps later.
b) parallel feature development ends up conflicting then one thread never lands. I've got two-three devs each adding a uAPI, we assign them 0x10, 0x11, 0x12 while they work, then 0x11 never lands because it was a bad idea.
However I think you should be fine enforcing a non-sparse space here unless we want to handle replacing current drivers, as long as it's hard to screw up so you know early.
I guess it would be nice to support old UAPIs for cases of reimplementations. Currently, I'm working on a reimplementation of vgem and I ended up having to create a dummy IOCTL to deal with the sparse number space. Although creating dummy IOCTLs works, I don't believe it is a nice practice.
Moreover, I believe that if we keep developing new drivers with Rust, cases (a) and (b) will end up happening, and maybe the Rust abstractions should work like DRM and allow it to happen.
Best Regards, - Maíra Canal
Dave.
------- Original Message ------- On Tuesday, March 7th, 2023 at 15:25, Asahi Lina lina@asahilina.net wrote:
DRM drivers need to be able to declare which driver-specific ioctls they support. This abstraction adds the required types and a helper macro to generate the ioctl definition inside the DRM driver.
Note that this macro is not usable until further bits of the abstraction are in place (but it will not fail to compile on its own, if not called).
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 7 ++ rust/bindings/bindings_helper.h | 2 + rust/kernel/drm/ioctl.rs | 147 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 5 ++ rust/kernel/lib.rs | 2 + 5 files changed, 163 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dc0f94f02a82..dab8f0f9aa96 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -27,6 +27,13 @@ menuconfig DRM details. You should also select and configure AGP (/dev/agpgart) support if it is available for your platform.
+# Rust abstractions cannot be built as modules currently, so force them as +# bool by using these intermediate symbols. In the future these could be +# tristate once abstractions themselves can be built as modules. +config RUST_DRM
- bool "Rust support for the DRM subsystem"
- depends on DRM=y
config DRM_MIPI_DBI tristate depends on DRM diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 91bb7906ca5a..2687bef1676f 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -6,6 +6,7 @@
- Sorted alphabetically.
*/
+#include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-mapping.h> @@ -23,6 +24,7 @@ #include <linux/sysctl.h> #include <linux/timekeeping.h> #include <linux/xarray.h> +#include <uapi/drm/drm.h>
/* `bindgen` gets confused at certain things. */ const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL; diff --git a/rust/kernel/drm/ioctl.rs b/rust/kernel/drm/ioctl.rs new file mode 100644 index 000000000000..10304efbd5f1 --- /dev/null +++ b/rust/kernel/drm/ioctl.rs @@ -0,0 +1,147 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +#![allow(non_snake_case)]
+//! DRM IOCTL definitions. +//! +//! C header: [`include/linux/drm/drm_ioctl.h`](../../../../include/linux/drm/drm_ioctl.h)
+use crate::ioctl;
+const BASE: u32 = bindings::DRM_IOCTL_BASE as u32;
+/// Construct a DRM ioctl number with no argument. +pub const fn IO(nr: u32) -> u32 {
- ioctl::_IO(BASE, nr)
+}
+/// Construct a DRM ioctl number with a read-only argument. +pub const fn IOR<T>(nr: u32) -> u32 {
- ioctl::_IOR::<T>(BASE, nr)
+}
+/// Construct a DRM ioctl number with a write-only argument. +pub const fn IOW<T>(nr: u32) -> u32 {
- ioctl::_IOW::<T>(BASE, nr)
+}
+/// Construct a DRM ioctl number with a read-write argument. +pub const fn IOWR<T>(nr: u32) -> u32 {
- ioctl::_IOWR::<T>(BASE, nr)
+}
+/// Descriptor type for DRM ioctls. Use the `declare_drm_ioctls!{}` macro to construct them. +pub type DrmIoctlDescriptor = bindings::drm_ioctl_desc;
+/// This is for ioctl which are used for rendering, and require that the file descriptor is either +/// for a render node, or if it’s a legacy/primary node, then it must be authenticated. +pub const AUTH: u32 = bindings::drm_ioctl_flags_DRM_AUTH;
+/// This must be set for any ioctl which can change the modeset or display state. Userspace must +/// call the ioctl through a primary node, while it is the active master. +/// +/// Note that read-only modeset ioctl can also be called by unauthenticated clients, or when a +/// master is not the currently active one. +pub const MASTER: u32 = bindings::drm_ioctl_flags_DRM_MASTER;
+/// Anything that could potentially wreak a master file descriptor needs to have this flag set. +/// +/// Current that’s only for the SETMASTER and DROPMASTER ioctl, which e.g. logind can call to force +/// a non-behaving master (display compositor) into compliance. +/// +/// This is equivalent to callers with the SYSADMIN capability. +pub const ROOT_ONLY: u32 = bindings::drm_ioctl_flags_DRM_ROOT_ONLY;
+/// Whether drm_ioctl_desc.func should be called with the DRM BKL held or not. Enforced as the +/// default for all modern drivers, hence there should never be a need to set this flag. +/// +/// Do not use anywhere else than for the VBLANK_WAIT IOCTL, which is the only legacy IOCTL which +/// needs this. +pub const UNLOCKED: u32 = bindings::drm_ioctl_flags_DRM_UNLOCKED;
+/// This is used for all ioctl needed for rendering only, for drivers which support render nodes. +/// This should be all new render drivers, and hence it should be always set for any ioctl with +/// `AUTH` set. Note though that read-only query ioctl might have this set, but have not set +/// DRM_AUTH because they do not require authentication. +pub const RENDER_ALLOW: u32 = bindings::drm_ioctl_flags_DRM_RENDER_ALLOW;
+/// Declare the DRM ioctls for a driver. +/// +/// Each entry in the list should have the form: +/// +/// `(ioctl_number, argument_type, flags, user_callback),` +/// +/// `argument_type` is the type name within the `bindings` crate. +/// `user_callback` should have the following prototype: +/// +/// ``` +/// fn foo(device: &kernel::drm::device::Device<Self>, +/// data: &mut bindings::argument_type, +/// file: &kernel::drm::file::FileSelf::File, +/// ) +/// ``` +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within. +/// +/// # Examples +/// +/// ``` +/// kernel::declare_drm_ioctls! { +/// (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler), +/// } +/// ``` +/// +#[macro_export] +macro_rules! declare_drm_ioctls {
- ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
const _:() = {
let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
// Assert that all the IOCTLs are in the right order and there are no gaps,
// and that the sizeof of the specified type is correct.
$(
let cmd: u32 = $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd);
::core::assert!(i == $crate::ioctl::_IOC_NR(cmd));
::core::assert!(core::mem::size_of::<$crate::bindings::$struct>() == $crate::ioctl::_IOC_SIZE(cmd));
let i: u32 = i + 1;
)*
};
let ioctls = &[$(
$crate::bindings::drm_ioctl_desc {
cmd: $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd) as u32,
func: {
#[allow(non_snake_case)]
unsafe extern "C" fn $cmd(
raw_dev: *mut $crate::bindings::drm_device,
raw_data: *mut ::core::ffi::c_void,
raw_file_priv: *mut $crate::bindings::drm_file,
) -> core::ffi::c_int {
// SAFETY: We never drop this, and the DRM core ensures the device lives
// while callbacks are being called.
//
// FIXME: Currently there is nothing enforcing that the types of the
// dev/file match the current driver these ioctls are being declared
// for, and it's not clear how to enforce this within the type system.
let dev = ::core::mem::ManuallyDrop::new(unsafe {
$crate::drm::device::Device::from_raw(raw_dev)
});
// SAFETY: This is just the ioctl argument, which hopefully has the right type
// (we've done our best checking the size).
In the rust tree there is the ReadableFromBytes [1] trait which indicates that it is safe to read arbitrary bytes into the type. Maybe you could add it as bound on the argument type when it lands in rust-next? This way you can't end up with for example a struct containing a bool with the byte value 2, which is UB.
https://rust-for-linux.github.io/docs/kernel/io_buffer/trait.ReadableFromByt... [1]
let data = unsafe { &mut *(raw_data as *mut $crate::bindings::$struct) };
// SAFETY: This is just the DRM file structure
let file = unsafe { $crate::drm::file::File::from_raw(raw_file_priv) };
match $func(&*dev, data, &file) {
Err(e) => e.to_kernel_errno(),
Ok(i) => i.try_into().unwrap_or(ERANGE.to_kernel_errno()),
}
}
Some($cmd)
},
flags: $flags,
name: $crate::c_str!(::core::stringify!($cmd)).as_char_ptr(),
}
),*];
ioctls
};
- };
+} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs new file mode 100644 index 000000000000..9ec6d7cbcaf3 --- /dev/null +++ b/rust/kernel/drm/mod.rs @@ -0,0 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM subsystem abstractions.
+pub mod ioctl; diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index 7903490816bf..cb23d24c6718 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -37,6 +37,8 @@ mod build_assert; pub mod delay; pub mod device; pub mod driver; +#[cfg(CONFIG_RUST_DRM)] +pub mod drm; pub mod error; pub mod io_buffer; pub mod io_mem;
-- 2.35.1
Cheers, Bjorn
On 08/03/2023 02.34, Björn Roy Baron wrote:
// SAFETY: This is just the ioctl argument, which hopefully has the right type
// (we've done our best checking the size).
In the rust tree there is the ReadableFromBytes [1] trait which indicates that it is safe to read arbitrary bytes into the type. Maybe you could add it as bound on the argument type when it lands in rust-next? This way you can't end up with for example a struct containing a bool with the byte value 2, which is UB.
There's actually a much bigger story here, because that trait isn't really very useful without a way to auto-derive it. I need the same kind of guarantee for all the GPU firmware structs...
There's one using only declarative macros [1] and one using proc macros [2]. And then, since ioctl arguments are declared in C UAPI header files, we need a way to be able to derive those traits for them... which I guess means bindgen changes?
For now though, I don't think this is something we need to worry about too much for this particular use case because the macro forces all struct types to be part of `bindings`, and any driver UAPI should already follow these constraints if it is well-formed (and UAPIs are going to already attract a lot of scrutiny anyway). Technically you could try taking a random kernel struct containing a `bool` in an ioctl list, but that would stand out as nonsense just as much as trying to unsafe impl ReadableFromBytes for it so... it's kind of an academic problem ^^
Actually, I think we talked of moving UAPI types to a separate crate (so drivers can get access to those types and only those types, not the main bindings crate). Then maybe we could just say that if the macro forces the type to be from that crate, it's inherently safe since all UAPIs should already be castable to/from bytes if properly designed.
Aside: I'm not sure the ReadableFromBytes/WritableToBytes distinction is very useful. I know it exists (padding bytes, uninit fields, and technically bool should be WritableToBytes but not ReadableFromBytes), but I can't think of a good use case for it... I think I'd rather start with a single trait and just always enforce the union of the rules, because pretty much any time you're casting to/from bytes you want well-defined "bag of bytes" struct layouts anyway. ioctls can be R/W/RW so having separate traits depending on ioctl type complicates the code...
[1] https://github.com/QubesOS/qubes-gui-rust/blob/940754bfefb7325548eece658c307... [2] https://docs.rs/pkbuffer/latest/pkbuffer/derive.Castable.html
https://rust-for-linux.github.io/docs/kernel/io_buffer/trait.ReadableFromByt... [1]
~~ Lina
On Thu, 2023-03-09 at 15:04 +0900, Asahi Lina wrote:
On 08/03/2023 02.34, Björn Roy Baron wrote:
+ // SAFETY: This is just the ioctl argument, which hopefully has the right type + // (we've done our best checking the size).
In the rust tree there is the ReadableFromBytes [1] trait which indicates that it is safe to read arbitrary bytes into the type. Maybe you could add it as bound on the argument type when it lands in rust-next? This way you can't end up with for example a struct containing a bool with the byte value 2, which is UB.
There's actually a much bigger story here, because that trait isn't really very useful without a way to auto-derive it. I need the same kind of guarantee for all the GPU firmware structs...
There's one using only declarative macros [1] and one using proc macros [2]. And then, since ioctl arguments are declared in C UAPI header files, we need a way to be able to derive those traits for them... which I guess means bindgen changes?
It'd be cool to be able to auto-verify that uAPI structs are all tightly packed and use the right subset of types. Maybe not possible this iteration but it'd be cool to see in future. I'd like to see it for C as well, ideally.
~Faith
On Thu, Mar 9, 2023 at 9:24 PM Faith Ekstrand faith.ekstrand@collabora.com wrote:
On Thu, 2023-03-09 at 15:04 +0900, Asahi Lina wrote:
On 08/03/2023 02.34, Björn Roy Baron wrote:
// SAFETY: This is just the ioctl
argument, which hopefully has the right type
// (we've done our best checking the
size).
In the rust tree there is the ReadableFromBytes [1] trait which indicates that it is safe to read arbitrary bytes into the type. Maybe you could add it as bound on the argument type when it lands in rust-next? This way you can't end up with for example a struct containing a bool with the byte value 2, which is UB.
There's actually a much bigger story here, because that trait isn't really very useful without a way to auto-derive it. I need the same kind of guarantee for all the GPU firmware structs...
There's one using only declarative macros [1] and one using proc macros [2]. And then, since ioctl arguments are declared in C UAPI header files, we need a way to be able to derive those traits for them... which I guess means bindgen changes?
It'd be cool to be able to auto-verify that uAPI structs are all tightly packed and use the right subset of types. Maybe not possible this iteration but it'd be cool to see in future. I'd like to see it for C as well, ideally.
~Faith
I'm sure that with a macro you could verify that a struct definition doesn't contain any gaps, just not sure on how one would enforce that. Could add a trait which can only be implemented through a proc_macro? Maybe we can have a proc_macro ensuring no gaps? Would be cool tech to have indeed.
On 10/03/2023 05.39, Karol Herbst wrote:
On Thu, Mar 9, 2023 at 9:24 PM Faith Ekstrand faith.ekstrand@collabora.com wrote:
On Thu, 2023-03-09 at 15:04 +0900, Asahi Lina wrote:
On 08/03/2023 02.34, Björn Roy Baron wrote:
// SAFETY: This is just the ioctl
argument, which hopefully has the right type
// (we've done our best checking the
size).
In the rust tree there is the ReadableFromBytes [1] trait which indicates that it is safe to read arbitrary bytes into the type. Maybe you could add it as bound on the argument type when it lands in rust-next? This way you can't end up with for example a struct containing a bool with the byte value 2, which is UB.
There's actually a much bigger story here, because that trait isn't really very useful without a way to auto-derive it. I need the same kind of guarantee for all the GPU firmware structs...
There's one using only declarative macros [1] and one using proc macros [2]. And then, since ioctl arguments are declared in C UAPI header files, we need a way to be able to derive those traits for them... which I guess means bindgen changes?
It'd be cool to be able to auto-verify that uAPI structs are all tightly packed and use the right subset of types. Maybe not possible this iteration but it'd be cool to see in future. I'd like to see it for C as well, ideally.
~Faith
I'm sure that with a macro you could verify that a struct definition doesn't contain any gaps, just not sure on how one would enforce that. Could add a trait which can only be implemented through a proc_macro? Maybe we can have a proc_macro ensuring no gaps? Would be cool tech to have indeed.
You just make the trait unsafe, as usual, then implement it via that macro. It's how the things I linked work ^^
The tricky thing with C UAPI definitions is just that we need to get bindgen to emit those macro instantiations around struct definitions somehow. Or maybe it could be done with a brute force text-based postprocessing pass? If we put all UAPI defs into their own crate, you could probably just do it with sed or a python script or something on the bindgen output to add it for all struct types...
@Rust folks: Should I try creating a uapi crate for this? I think we can just mirror the bindings crate logic, and we don't need helpers or anything like that here, so it shouldn't be very difficult. Then I could (eventually) eliminate all usage of the full bindings crate in the driver, and also try experimenting with stuff like this to validate all UAPI types and implement special traits for them...
~~ Lina
On Tue, Mar 07, 2023 at 11:25:26PM +0900, Asahi Lina wrote:
DRM drivers need to be able to declare which driver-specific ioctls they support. This abstraction adds the required types and a helper macro to generate the ioctl definition inside the DRM driver.
Note that this macro is not usable until further bits of the abstraction are in place (but it will not fail to compile on its own, if not called).
Signed-off-by: Asahi Lina lina@asahilina.net
A bunch of thoughts/questions:
- You have the pub functions to create ioctl numbers, but it looks like most drivers just do this in the C uapi headers instead and then use the direct number from the bindings? I wonder whether we shouldn't just use that as standard way, since in the end we do need the C headers for userspace to use the ioctls/structs. Or could we generate the headers from rust?
- More type safety would be nice. You have the one for device, but not yet for DrmFile. At least if I read the examples in asahi/vgem right. Also the FIXME for how to make sure you generate the table for the right kind of driver would be nice to fix.
- Type safety against the size of the struct an ioctl number is great!
- I wonder whether we could adjust the type according to _IOR/W/RW, i.e. if you have W then your ioctl function is Result<Struct>, if not then Result<()> since it's just errno, and you get the paramater only when you have R set. We had in the past confusions where people got this wrong and wondered why their parameters don't make it to userspace.
- There's also the question of drm_ioctl() zero-extending the ioctl parameter struct both ways (i.e. kernel kernel or newer userspace). I think trying to encode that with Some() is overkill, but maybe worth a thought.
- It would be _really_ great if rust ioctl abstractions enforce https://dri.freedesktop.org/docs/drm/process/botching-up-ioctls.html at the type level, i.e. everything naturally aligned, no gaps, all that stuff. This would also hold for get/put_user and all these things (I didn't look into that stuff yet in the drivers when you pull in entire arrays).
Cheers, Daniel
drivers/gpu/drm/Kconfig | 7 ++ rust/bindings/bindings_helper.h | 2 + rust/kernel/drm/ioctl.rs | 147 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 5 ++ rust/kernel/lib.rs | 2 + 5 files changed, 163 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dc0f94f02a82..dab8f0f9aa96 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -27,6 +27,13 @@ menuconfig DRM details. You should also select and configure AGP (/dev/agpgart) support if it is available for your platform. +# Rust abstractions cannot be built as modules currently, so force them as +# bool by using these intermediate symbols. In the future these could be +# tristate once abstractions themselves can be built as modules. +config RUST_DRM
- bool "Rust support for the DRM subsystem"
- depends on DRM=y
config DRM_MIPI_DBI tristate depends on DRM diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 91bb7906ca5a..2687bef1676f 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -6,6 +6,7 @@
- Sorted alphabetically.
*/ +#include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-mapping.h> @@ -23,6 +24,7 @@ #include <linux/sysctl.h> #include <linux/timekeeping.h> #include <linux/xarray.h> +#include <uapi/drm/drm.h> /* `bindgen` gets confused at certain things. */ const gfp_t BINDINGS_GFP_KERNEL = GFP_KERNEL; diff --git a/rust/kernel/drm/ioctl.rs b/rust/kernel/drm/ioctl.rs new file mode 100644 index 000000000000..10304efbd5f1 --- /dev/null +++ b/rust/kernel/drm/ioctl.rs @@ -0,0 +1,147 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT +#![allow(non_snake_case)]
+//! DRM IOCTL definitions. +//! +//! C header: [`include/linux/drm/drm_ioctl.h`](../../../../include/linux/drm/drm_ioctl.h)
+use crate::ioctl;
+const BASE: u32 = bindings::DRM_IOCTL_BASE as u32;
+/// Construct a DRM ioctl number with no argument. +pub const fn IO(nr: u32) -> u32 {
- ioctl::_IO(BASE, nr)
+}
+/// Construct a DRM ioctl number with a read-only argument. +pub const fn IOR<T>(nr: u32) -> u32 {
- ioctl::_IOR::<T>(BASE, nr)
+}
+/// Construct a DRM ioctl number with a write-only argument. +pub const fn IOW<T>(nr: u32) -> u32 {
- ioctl::_IOW::<T>(BASE, nr)
+}
+/// Construct a DRM ioctl number with a read-write argument. +pub const fn IOWR<T>(nr: u32) -> u32 {
- ioctl::_IOWR::<T>(BASE, nr)
+}
+/// Descriptor type for DRM ioctls. Use the `declare_drm_ioctls!{}` macro to construct them. +pub type DrmIoctlDescriptor = bindings::drm_ioctl_desc;
+/// This is for ioctl which are used for rendering, and require that the file descriptor is either +/// for a render node, or if it’s a legacy/primary node, then it must be authenticated. +pub const AUTH: u32 = bindings::drm_ioctl_flags_DRM_AUTH;
+/// This must be set for any ioctl which can change the modeset or display state. Userspace must +/// call the ioctl through a primary node, while it is the active master. +/// +/// Note that read-only modeset ioctl can also be called by unauthenticated clients, or when a +/// master is not the currently active one. +pub const MASTER: u32 = bindings::drm_ioctl_flags_DRM_MASTER;
+/// Anything that could potentially wreak a master file descriptor needs to have this flag set. +/// +/// Current that’s only for the SETMASTER and DROPMASTER ioctl, which e.g. logind can call to force +/// a non-behaving master (display compositor) into compliance. +/// +/// This is equivalent to callers with the SYSADMIN capability. +pub const ROOT_ONLY: u32 = bindings::drm_ioctl_flags_DRM_ROOT_ONLY;
+/// Whether drm_ioctl_desc.func should be called with the DRM BKL held or not. Enforced as the +/// default for all modern drivers, hence there should never be a need to set this flag. +/// +/// Do not use anywhere else than for the VBLANK_WAIT IOCTL, which is the only legacy IOCTL which +/// needs this. +pub const UNLOCKED: u32 = bindings::drm_ioctl_flags_DRM_UNLOCKED;
+/// This is used for all ioctl needed for rendering only, for drivers which support render nodes. +/// This should be all new render drivers, and hence it should be always set for any ioctl with +/// `AUTH` set. Note though that read-only query ioctl might have this set, but have not set +/// DRM_AUTH because they do not require authentication. +pub const RENDER_ALLOW: u32 = bindings::drm_ioctl_flags_DRM_RENDER_ALLOW;
+/// Declare the DRM ioctls for a driver. +/// +/// Each entry in the list should have the form: +/// +/// `(ioctl_number, argument_type, flags, user_callback),` +/// +/// `argument_type` is the type name within the `bindings` crate. +/// `user_callback` should have the following prototype: +/// +/// ``` +/// fn foo(device: &kernel::drm::device::Device<Self>, +/// data: &mut bindings::argument_type, +/// file: &kernel::drm::file::FileSelf::File, +/// ) +/// ``` +/// where `Self` is the drm::drv::Driver implementation these ioctls are being declared within. +/// +/// # Examples +/// +/// ``` +/// kernel::declare_drm_ioctls! { +/// (FOO_GET_PARAM, drm_foo_get_param, ioctl::RENDER_ALLOW, my_get_param_handler), +/// } +/// ``` +/// +#[macro_export] +macro_rules! declare_drm_ioctls {
- ( $(($cmd:ident, $struct:ident, $flags:expr, $func:expr)),* $(,)? ) => {
const IOCTLS: &'static [$crate::drm::ioctl::DrmIoctlDescriptor] = {
const _:() = {
let i: u32 = $crate::bindings::DRM_COMMAND_BASE;
// Assert that all the IOCTLs are in the right order and there are no gaps,
// and that the sizeof of the specified type is correct.
$(
let cmd: u32 = $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd);
::core::assert!(i == $crate::ioctl::_IOC_NR(cmd));
::core::assert!(core::mem::size_of::<$crate::bindings::$struct>() == $crate::ioctl::_IOC_SIZE(cmd));
let i: u32 = i + 1;
)*
};
let ioctls = &[$(
$crate::bindings::drm_ioctl_desc {
cmd: $crate::macros::concat_idents!($crate::bindings::DRM_IOCTL_, $cmd) as u32,
func: {
#[allow(non_snake_case)]
unsafe extern "C" fn $cmd(
raw_dev: *mut $crate::bindings::drm_device,
raw_data: *mut ::core::ffi::c_void,
raw_file_priv: *mut $crate::bindings::drm_file,
) -> core::ffi::c_int {
// SAFETY: We never drop this, and the DRM core ensures the device lives
// while callbacks are being called.
//
// FIXME: Currently there is nothing enforcing that the types of the
// dev/file match the current driver these ioctls are being declared
// for, and it's not clear how to enforce this within the type system.
let dev = ::core::mem::ManuallyDrop::new(unsafe {
$crate::drm::device::Device::from_raw(raw_dev)
});
// SAFETY: This is just the ioctl argument, which hopefully has the right type
// (we've done our best checking the size).
let data = unsafe { &mut *(raw_data as *mut $crate::bindings::$struct) };
// SAFETY: This is just the DRM file structure
let file = unsafe { $crate::drm::file::File::from_raw(raw_file_priv) };
match $func(&*dev, data, &file) {
Err(e) => e.to_kernel_errno(),
Ok(i) => i.try_into().unwrap_or(ERANGE.to_kernel_errno()),
}
}
Some($cmd)
},
flags: $flags,
name: $crate::c_str!(::core::stringify!($cmd)).as_char_ptr(),
}
),*];
ioctls
};
- };
+} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs new file mode 100644 index 000000000000..9ec6d7cbcaf3 --- /dev/null +++ b/rust/kernel/drm/mod.rs @@ -0,0 +1,5 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM subsystem abstractions.
+pub mod ioctl; diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index 7903490816bf..cb23d24c6718 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -37,6 +37,8 @@ mod build_assert; pub mod delay; pub mod device; pub mod driver; +#[cfg(CONFIG_RUST_DRM)] +pub mod drm; pub mod error; pub mod io_buffer; pub mod io_mem;
-- 2.35.1
Add the initial abstractions for DRM drivers and devices. These go together in one commit since they are fairly tightly coupled types.
A few things have been stubbed out, to be implemented as further bits of the DRM subsystem are introduced.
Signed-off-by: Asahi Lina lina@asahilina.net --- rust/bindings/bindings_helper.h | 3 + rust/kernel/drm/device.rs | 76 +++++++++ rust/kernel/drm/drv.rs | 339 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 2 + 4 files changed, 420 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2687bef1676f..2a999138c4ae 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -6,10 +6,13 @@ * Sorted alphabetically. */
+#include <drm/drm_device.h> +#include <drm/drm_drv.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-mapping.h> +#include <linux/fs.h> #include <linux/ioctl.h> #include <linux/io-pgtable.h> #include <linux/ktime.h> diff --git a/rust/kernel/drm/device.rs b/rust/kernel/drm/device.rs new file mode 100644 index 000000000000..6007f941137a --- /dev/null +++ b/rust/kernel/drm/device.rs @@ -0,0 +1,76 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT + +//! DRM device. +//! +//! C header: [`include/linux/drm/drm_device.h`](../../../../include/linux/drm/drm_device.h) + +use crate::{bindings, device, drm, types::ForeignOwnable}; +use core::marker::PhantomData; + +/// Represents a reference to a DRM device. The device is reference-counted and is guaranteed to +/// not be dropped while this object is alive. +pub struct Device<T: drm::drv::Driver> { + // Type invariant: ptr must be a valid and initialized drm_device, + // and this value must either own a reference to it or the caller + // must ensure that it is never dropped if the reference is borrowed. + pub(super) ptr: *mut bindings::drm_device, + _p: PhantomData<T>, +} + +impl<T: drm::drv::Driver> Device<T> { + // Not intended to be called externally, except via declare_drm_ioctls!() + #[doc(hidden)] + pub unsafe fn from_raw(raw: *mut bindings::drm_device) -> Device<T> { + Device { + ptr: raw, + _p: PhantomData, + } + } + + #[allow(dead_code)] + pub(crate) fn raw(&self) -> *const bindings::drm_device { + self.ptr + } + + pub(crate) fn raw_mut(&mut self) -> *mut bindings::drm_device { + self.ptr + } + + /// Returns a borrowed reference to the user data associated with this Device. + pub fn data(&self) -> <T::Data as ForeignOwnable>::Borrowed<'_> { + unsafe { T::Data::borrow((*self.ptr).dev_private) } + } +} + +impl<T: drm::drv::Driver> Drop for Device<T> { + fn drop(&mut self) { + // SAFETY: By the type invariants, we know that `self` owns a reference, so it is safe to + // relinquish it now. + unsafe { bindings::drm_dev_put(self.ptr) }; + } +} + +impl<T: drm::drv::Driver> Clone for Device<T> { + fn clone(&self) -> Self { + // SAFETY: We get a new reference and then create a new owning object from the raw pointer + unsafe { + bindings::drm_dev_get(self.ptr); + Device::from_raw(self.ptr) + } + } +} + +// SAFETY: `Device` only holds a pointer to a C device, which is safe to be used from any thread. +unsafe impl<T: drm::drv::Driver> Send for Device<T> {} + +// SAFETY: `Device` only holds a pointer to a C device, references to which are safe to be used +// from any thread. +unsafe impl<T: drm::drv::Driver> Sync for Device<T> {} + +// Make drm::Device work for dev_info!() and friends +unsafe impl<T: drm::drv::Driver> device::RawDevice for Device<T> { + fn raw_device(&self) -> *mut bindings::device { + // SAFETY: ptr must be valid per the type invariant + unsafe { (*self.ptr).dev } + } +} diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs new file mode 100644 index 000000000000..29a465515dc9 --- /dev/null +++ b/rust/kernel/drm/drv.rs @@ -0,0 +1,339 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT + +//! DRM driver core. +//! +//! C header: [`include/linux/drm/drm_drv.h`](../../../../include/linux/drm/drm_drv.h) + +use crate::{ + bindings, device, drm, + error::code::*, + error::from_kernel_err_ptr, + error::{Error, Result}, + prelude::*, + private::Sealed, + str::CStr, + types::ForeignOwnable, + ThisModule, +}; +use core::{ + marker::{PhantomData, PhantomPinned}, + pin::Pin, +}; +use macros::vtable; + +/// Driver use the GEM memory manager. This should be set for all modern drivers. +pub const FEAT_GEM: u32 = bindings::drm_driver_feature_DRIVER_GEM; +/// Driver supports mode setting interfaces (KMS). +pub const FEAT_MODESET: u32 = bindings::drm_driver_feature_DRIVER_MODESET; +/// Driver supports dedicated render nodes. +pub const FEAT_RENDER: u32 = bindings::drm_driver_feature_DRIVER_RENDER; +/// Driver supports the full atomic modesetting userspace API. +/// +/// Drivers which only use atomic internally, but do not support the full userspace API (e.g. not +/// all properties converted to atomic, or multi-plane updates are not guaranteed to be tear-free) +/// should not set this flag. +pub const FEAT_ATOMIC: u32 = bindings::drm_driver_feature_DRIVER_ATOMIC; +/// Driver supports DRM sync objects for explicit synchronization of command submission. +pub const FEAT_SYNCOBJ: u32 = bindings::drm_driver_feature_DRIVER_SYNCOBJ; +/// Driver supports the timeline flavor of DRM sync objects for explicit synchronization of command +/// submission. +pub const FEAT_SYNCOBJ_TIMELINE: u32 = bindings::drm_driver_feature_DRIVER_SYNCOBJ_TIMELINE; + +/// Information data for a DRM Driver. +pub struct DriverInfo { + /// Driver major version. + pub major: i32, + /// Driver minor version. + pub minor: i32, + /// Driver patchlevel version. + pub patchlevel: i32, + /// Driver name. + pub name: &'static CStr, + /// Driver description. + pub desc: &'static CStr, + /// Driver date. + pub date: &'static CStr, +} + +/// Internal memory management operation set, normally created by memory managers (e.g. GEM). +/// +/// See `kernel::drm::gem` and `kernel::drm::gem::shmem`. +pub struct AllocOps { + pub(crate) gem_create_object: Option< + unsafe extern "C" fn( + dev: *mut bindings::drm_device, + size: usize, + ) -> *mut bindings::drm_gem_object, + >, + pub(crate) prime_handle_to_fd: Option< + unsafe extern "C" fn( + dev: *mut bindings::drm_device, + file_priv: *mut bindings::drm_file, + handle: u32, + flags: u32, + prime_fd: *mut core::ffi::c_int, + ) -> core::ffi::c_int, + >, + pub(crate) prime_fd_to_handle: Option< + unsafe extern "C" fn( + dev: *mut bindings::drm_device, + file_priv: *mut bindings::drm_file, + prime_fd: core::ffi::c_int, + handle: *mut u32, + ) -> core::ffi::c_int, + >, + pub(crate) gem_prime_import: Option< + unsafe extern "C" fn( + dev: *mut bindings::drm_device, + dma_buf: *mut bindings::dma_buf, + ) -> *mut bindings::drm_gem_object, + >, + pub(crate) gem_prime_import_sg_table: Option< + unsafe extern "C" fn( + dev: *mut bindings::drm_device, + attach: *mut bindings::dma_buf_attachment, + sgt: *mut bindings::sg_table, + ) -> *mut bindings::drm_gem_object, + >, + pub(crate) gem_prime_mmap: Option< + unsafe extern "C" fn( + obj: *mut bindings::drm_gem_object, + vma: *mut bindings::vm_area_struct, + ) -> core::ffi::c_int, + >, + pub(crate) dumb_create: Option< + unsafe extern "C" fn( + file_priv: *mut bindings::drm_file, + dev: *mut bindings::drm_device, + args: *mut bindings::drm_mode_create_dumb, + ) -> core::ffi::c_int, + >, + pub(crate) dumb_map_offset: Option< + unsafe extern "C" fn( + file_priv: *mut bindings::drm_file, + dev: *mut bindings::drm_device, + handle: u32, + offset: *mut u64, + ) -> core::ffi::c_int, + >, + pub(crate) dumb_destroy: Option< + unsafe extern "C" fn( + file_priv: *mut bindings::drm_file, + dev: *mut bindings::drm_device, + handle: u32, + ) -> core::ffi::c_int, + >, +} + +/// Trait for memory manager implementations. Implemented internally. +pub trait AllocImpl: Sealed { + /// The C callback operations for this memory manager. + const ALLOC_OPS: AllocOps; +} + +/// A DRM driver implementation. +#[vtable] +pub trait Driver { + /// Context data associated with the DRM driver + /// + /// Determines the type of the context data passed to each of the methods of the trait. + type Data: ForeignOwnable + Sync + Send; + + /// The type used to manage memory for this driver. + /// + /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`. + type Object: AllocImpl; + + /// Driver metadata + const INFO: DriverInfo; + + /// Feature flags + const FEATURES: u32; + + /// IOCTL list. See `kernel::drm::ioctl::declare_drm_ioctls!{}`. + const IOCTLS: &'static [drm::ioctl::DrmIoctlDescriptor]; +} + +/// A registration of a DRM device +/// +/// # Invariants: +/// +/// drm is always a valid pointer to an allocated drm_device +pub struct Registration<T: Driver> { + drm: drm::device::Device<T>, + registered: bool, + fops: bindings::file_operations, + vtable: Pin<Boxbindings::drm_driver>, + _p: PhantomData<T>, + _pin: PhantomPinned, +} + +#[cfg(CONFIG_DRM_LEGACY)] +macro_rules! drm_legacy_fields { + ( $($field:ident: $val:expr),* $(,)? ) => { + bindings::drm_driver { + $( $field: $val ),*, + firstopen: None, + preclose: None, + dma_ioctl: None, + dma_quiescent: None, + context_dtor: None, + irq_handler: None, + irq_preinstall: None, + irq_postinstall: None, + irq_uninstall: None, + get_vblank_counter: None, + enable_vblank: None, + disable_vblank: None, + dev_priv_size: 0, + } + } +} + +#[cfg(not(CONFIG_DRM_LEGACY))] +macro_rules! drm_legacy_fields { + ( $($field:ident: $val:expr),* $(,)? ) => { + bindings::drm_driver { + $( $field: $val ),* + } + } +} + +/// Registers a DRM device with the rest of the kernel. +/// +/// It automatically picks up THIS_MODULE. +#[allow(clippy::crate_in_macro_def)] +#[macro_export] +macro_rules! drm_device_register { + ($reg:expr, $data:expr, $flags:expr $(,)?) => {{ + $crate::drm::drv::Registration::register($reg, $data, $flags, &crate::THIS_MODULE) + }}; +} + +impl<T: Driver> Registration<T> { + const VTABLE: bindings::drm_driver = drm_legacy_fields! { + load: None, + open: None, // TODO: File abstraction + postclose: None, // TODO: File abstraction + lastclose: None, + unload: None, + release: None, + master_set: None, + master_drop: None, + debugfs_init: None, + gem_create_object: T::Object::ALLOC_OPS.gem_create_object, + prime_handle_to_fd: T::Object::ALLOC_OPS.prime_handle_to_fd, + prime_fd_to_handle: T::Object::ALLOC_OPS.prime_fd_to_handle, + gem_prime_import: T::Object::ALLOC_OPS.gem_prime_import, + gem_prime_import_sg_table: T::Object::ALLOC_OPS.gem_prime_import_sg_table, + gem_prime_mmap: T::Object::ALLOC_OPS.gem_prime_mmap, + dumb_create: T::Object::ALLOC_OPS.dumb_create, + dumb_map_offset: T::Object::ALLOC_OPS.dumb_map_offset, + dumb_destroy: T::Object::ALLOC_OPS.dumb_destroy, + + major: T::INFO.major, + minor: T::INFO.minor, + patchlevel: T::INFO.patchlevel, + name: T::INFO.name.as_char_ptr() as *mut _, + desc: T::INFO.desc.as_char_ptr() as *mut _, + date: T::INFO.date.as_char_ptr() as *mut _, + + driver_features: T::FEATURES, + ioctls: T::IOCTLS.as_ptr(), + num_ioctls: T::IOCTLS.len() as i32, + fops: core::ptr::null_mut(), + }; + + /// Creates a new [`Registration`] but does not register it yet. + /// + /// It is allowed to move. + pub fn new(parent: &dyn device::RawDevice) -> Result<Self> { + let vtable = Pin::new(Box::try_new(Self::VTABLE)?); + let raw_drm = unsafe { bindings::drm_dev_alloc(&*vtable, parent.raw_device()) }; + let raw_drm = from_kernel_err_ptr(raw_drm)?; + + // The reference count is one, and now we take ownership of that reference as a + // drm::device::Device. + let drm = unsafe { drm::device::Device::from_raw(raw_drm) }; + + Ok(Self { + drm, + registered: false, + vtable, + fops: Default::default(), // TODO: GEM abstraction + _pin: PhantomPinned, + _p: PhantomData, + }) + } + + /// Registers a DRM device with the rest of the kernel. + /// + /// Users are encouraged to use the [`drm_device_register!()`] macro because it automatically + /// picks up the current module. + pub fn register( + self: Pin<&mut Self>, + data: T::Data, + flags: usize, + module: &'static ThisModule, + ) -> Result { + if self.registered { + // Already registered. + return Err(EINVAL); + } + + // SAFETY: We never move out of `this`. + let this = unsafe { self.get_unchecked_mut() }; + let data_pointer = <T::Data as ForeignOwnable>::into_foreign(data); + // SAFETY: `drm` is valid per the type invariant + unsafe { + (*this.drm.raw_mut()).dev_private = data_pointer as *mut _; + } + + this.fops.owner = module.0; + this.vtable.fops = &this.fops; + + // SAFETY: The device is now initialized and ready to be registered. + let ret = unsafe { bindings::drm_dev_register(this.drm.raw_mut(), flags as u64) }; + if ret < 0 { + // SAFETY: `data_pointer` was returned by `into_foreign` above. + unsafe { T::Data::from_foreign(data_pointer) }; + return Err(Error::from_kernel_errno(ret)); + } + + this.registered = true; + Ok(()) + } + + /// Returns a reference to the `Device` instance for this registration. + pub fn device(&self) -> &drm::device::Device<T> { + &self.drm + } +} + +// SAFETY: `Registration` doesn't offer any methods or access to fields when shared between threads +// or CPUs, so it is safe to share it. +unsafe impl<T: Driver> Sync for Registration<T> {} + +// SAFETY: Registration with and unregistration from the drm subsystem can happen from any thread. +// Additionally, `T::Data` (which is dropped during unregistration) is `Send`, so it is ok to move +// `Registration` to different threads. +#[allow(clippy::non_send_fields_in_send_ty)] +unsafe impl<T: Driver> Send for Registration<T> {} + +impl<T: Driver> Drop for Registration<T> { + /// Removes the registration from the kernel if it has completed successfully before. + fn drop(&mut self) { + if self.registered { + // Get a pointer to the data stored in device before destroying it. + // SAFETY: `drm` is valid per the type invariant + let data_pointer = unsafe { (*self.drm.raw_mut()).dev_private }; + + // SAFETY: Since `registered` is true, `self.drm` is both valid and registered. + unsafe { bindings::drm_dev_unregister(self.drm.raw_mut()) }; + + // Free data as well. + // SAFETY: `data_pointer` was returned by `into_foreign` during registration. + unsafe { <T::Data as ForeignOwnable>::from_foreign(data_pointer) }; + } + } +} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index 9ec6d7cbcaf3..69376b3c6db9 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -2,4 +2,6 @@
//! DRM subsystem abstractions.
+pub mod device; +pub mod drv; pub mod ioctl;
------- Original Message ------- On Tuesday, March 7th, 2023 at 15:25, Asahi Lina lina@asahilina.net wrote:
Add the initial abstractions for DRM drivers and devices. These go together in one commit since they are fairly tightly coupled types.
A few things have been stubbed out, to be implemented as further bits of the DRM subsystem are introduced.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 3 + rust/kernel/drm/device.rs | 76 +++++++++ rust/kernel/drm/drv.rs | 339 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 2 + 4 files changed, 420 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2687bef1676f..2a999138c4ae 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -6,10 +6,13 @@
- Sorted alphabetically.
*/
+#include <drm/drm_device.h> +#include <drm/drm_drv.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-mapping.h> +#include <linux/fs.h> #include <linux/ioctl.h> #include <linux/io-pgtable.h> #include <linux/ktime.h> diff --git a/rust/kernel/drm/device.rs b/rust/kernel/drm/device.rs new file mode 100644 index 000000000000..6007f941137a --- /dev/null +++ b/rust/kernel/drm/device.rs @@ -0,0 +1,76 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM device. +//! +//! C header: [`include/linux/drm/drm_device.h`](../../../../include/linux/drm/drm_device.h)
+use crate::{bindings, device, drm, types::ForeignOwnable}; +use core::marker::PhantomData;
+/// Represents a reference to a DRM device. The device is reference-counted and is guaranteed to +/// not be dropped while this object is alive. +pub struct Device<T: drm::drv::Driver> {
- // Type invariant: ptr must be a valid and initialized drm_device,
- // and this value must either own a reference to it or the caller
- // must ensure that it is never dropped if the reference is borrowed.
- pub(super) ptr: *mut bindings::drm_device,
- _p: PhantomData<T>,
+}
+impl<T: drm::drv::Driver> Device<T> {
- // Not intended to be called externally, except via declare_drm_ioctls!()
- #[doc(hidden)]
- pub unsafe fn from_raw(raw: *mut bindings::drm_device) -> Device<T> {
Device {
ptr: raw,
_p: PhantomData,
}
- }
- #[allow(dead_code)]
- pub(crate) fn raw(&self) -> *const bindings::drm_device {
self.ptr
- }
- pub(crate) fn raw_mut(&mut self) -> *mut bindings::drm_device {
self.ptr
- }
- /// Returns a borrowed reference to the user data associated with this Device.
- pub fn data(&self) -> <T::Data as ForeignOwnable>::Borrowed<'_> {
unsafe { T::Data::borrow((*self.ptr).dev_private) }
- }
+}
+impl<T: drm::drv::Driver> Drop for Device<T> {
- fn drop(&mut self) {
// SAFETY: By the type invariants, we know that `self` owns a reference, so it is safe to
// relinquish it now.
unsafe { bindings::drm_dev_put(self.ptr) };
- }
+}
+impl<T: drm::drv::Driver> Clone for Device<T> {
- fn clone(&self) -> Self {
// SAFETY: We get a new reference and then create a new owning object from the raw pointer
unsafe {
bindings::drm_dev_get(self.ptr);
Device::from_raw(self.ptr)
}
- }
+}
+// SAFETY: `Device` only holds a pointer to a C device, which is safe to be used from any thread. +unsafe impl<T: drm::drv::Driver> Send for Device<T> {}
+// SAFETY: `Device` only holds a pointer to a C device, references to which are safe to be used +// from any thread. +unsafe impl<T: drm::drv::Driver> Sync for Device<T> {}
+// Make drm::Device work for dev_info!() and friends +unsafe impl<T: drm::drv::Driver> device::RawDevice for Device<T> {
- fn raw_device(&self) -> *mut bindings::device {
// SAFETY: ptr must be valid per the type invariant
unsafe { (*self.ptr).dev }
- }
+} diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs new file mode 100644 index 000000000000..29a465515dc9 --- /dev/null +++ b/rust/kernel/drm/drv.rs @@ -0,0 +1,339 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM driver core. +//! +//! C header: [`include/linux/drm/drm_drv.h`](../../../../include/linux/drm/drm_drv.h)
+use crate::{
- bindings, device, drm,
- error::code::*,
- error::from_kernel_err_ptr,
- error::{Error, Result},
- prelude::*,
- private::Sealed,
- str::CStr,
- types::ForeignOwnable,
- ThisModule,
+}; +use core::{
- marker::{PhantomData, PhantomPinned},
- pin::Pin,
+}; +use macros::vtable;
+/// Driver use the GEM memory manager. This should be set for all modern drivers. +pub const FEAT_GEM: u32 = bindings::drm_driver_feature_DRIVER_GEM; +/// Driver supports mode setting interfaces (KMS). +pub const FEAT_MODESET: u32 = bindings::drm_driver_feature_DRIVER_MODESET; +/// Driver supports dedicated render nodes. +pub const FEAT_RENDER: u32 = bindings::drm_driver_feature_DRIVER_RENDER; +/// Driver supports the full atomic modesetting userspace API. +/// +/// Drivers which only use atomic internally, but do not support the full userspace API (e.g. not +/// all properties converted to atomic, or multi-plane updates are not guaranteed to be tear-free) +/// should not set this flag. +pub const FEAT_ATOMIC: u32 = bindings::drm_driver_feature_DRIVER_ATOMIC; +/// Driver supports DRM sync objects for explicit synchronization of command submission. +pub const FEAT_SYNCOBJ: u32 = bindings::drm_driver_feature_DRIVER_SYNCOBJ; +/// Driver supports the timeline flavor of DRM sync objects for explicit synchronization of command +/// submission. +pub const FEAT_SYNCOBJ_TIMELINE: u32 = bindings::drm_driver_feature_DRIVER_SYNCOBJ_TIMELINE;
+/// Information data for a DRM Driver. +pub struct DriverInfo {
- /// Driver major version.
- pub major: i32,
- /// Driver minor version.
- pub minor: i32,
- /// Driver patchlevel version.
- pub patchlevel: i32,
- /// Driver name.
- pub name: &'static CStr,
- /// Driver description.
- pub desc: &'static CStr,
- /// Driver date.
- pub date: &'static CStr,
+}
Could you please add an Invariants section to the doc comments indicating what requirements these function pointers must satisfy?
+/// Internal memory management operation set, normally created by memory managers (e.g. GEM). +/// +/// See `kernel::drm::gem` and `kernel::drm::gem::shmem`. +pub struct AllocOps {
- pub(crate) gem_create_object: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
size: usize,
) -> *mut bindings::drm_gem_object,
,- pub(crate) prime_handle_to_fd: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
file_priv: *mut bindings::drm_file,
handle: u32,
flags: u32,
prime_fd: *mut core::ffi::c_int,
) -> core::ffi::c_int,
,- pub(crate) prime_fd_to_handle: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
file_priv: *mut bindings::drm_file,
prime_fd: core::ffi::c_int,
handle: *mut u32,
) -> core::ffi::c_int,
,- pub(crate) gem_prime_import: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
dma_buf: *mut bindings::dma_buf,
) -> *mut bindings::drm_gem_object,
,- pub(crate) gem_prime_import_sg_table: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
attach: *mut bindings::dma_buf_attachment,
sgt: *mut bindings::sg_table,
) -> *mut bindings::drm_gem_object,
,- pub(crate) gem_prime_mmap: Option<
unsafe extern "C" fn(
obj: *mut bindings::drm_gem_object,
vma: *mut bindings::vm_area_struct,
) -> core::ffi::c_int,
,- pub(crate) dumb_create: Option<
unsafe extern "C" fn(
file_priv: *mut bindings::drm_file,
dev: *mut bindings::drm_device,
args: *mut bindings::drm_mode_create_dumb,
) -> core::ffi::c_int,
,- pub(crate) dumb_map_offset: Option<
unsafe extern "C" fn(
file_priv: *mut bindings::drm_file,
dev: *mut bindings::drm_device,
handle: u32,
offset: *mut u64,
) -> core::ffi::c_int,
,- pub(crate) dumb_destroy: Option<
unsafe extern "C" fn(
file_priv: *mut bindings::drm_file,
dev: *mut bindings::drm_device,
handle: u32,
) -> core::ffi::c_int,
,+}
+/// Trait for memory manager implementations. Implemented internally. +pub trait AllocImpl: Sealed {
- /// The C callback operations for this memory manager.
- const ALLOC_OPS: AllocOps;
+}
+/// A DRM driver implementation. +#[vtable] +pub trait Driver {
- /// Context data associated with the DRM driver
- ///
- /// Determines the type of the context data passed to each of the methods of the trait.
- type Data: ForeignOwnable + Sync + Send;
- /// The type used to manage memory for this driver.
- ///
- /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`.
- type Object: AllocImpl;
- /// Driver metadata
- const INFO: DriverInfo;
- /// Feature flags
- const FEATURES: u32;
- /// IOCTL list. See `kernel::drm::ioctl::declare_drm_ioctls!{}`.
- const IOCTLS: &'static [drm::ioctl::DrmIoctlDescriptor];
+}
+/// A registration of a DRM device +/// +/// # Invariants: +/// +/// drm is always a valid pointer to an allocated drm_device +pub struct Registration<T: Driver> {
- drm: drm::device::Device<T>,
- registered: bool,
- fops: bindings::file_operations,
- vtable: Pin<Boxbindings::drm_driver>,
- _p: PhantomData<T>,
- _pin: PhantomPinned,
+}
+#[cfg(CONFIG_DRM_LEGACY)] +macro_rules! drm_legacy_fields {
- ( $($field:ident: $val:expr),* $(,)? ) => {
bindings::drm_driver {
$( $field: $val ),*,
firstopen: None,
preclose: None,
dma_ioctl: None,
dma_quiescent: None,
context_dtor: None,
irq_handler: None,
irq_preinstall: None,
irq_postinstall: None,
irq_uninstall: None,
get_vblank_counter: None,
enable_vblank: None,
disable_vblank: None,
dev_priv_size: 0,
}
- }
+}
+#[cfg(not(CONFIG_DRM_LEGACY))] +macro_rules! drm_legacy_fields {
- ( $($field:ident: $val:expr),* $(,)? ) => {
bindings::drm_driver {
$( $field: $val ),*
}
- }
+}
+/// Registers a DRM device with the rest of the kernel. +/// +/// It automatically picks up THIS_MODULE. +#[allow(clippy::crate_in_macro_def)] +#[macro_export] +macro_rules! drm_device_register {
- ($reg:expr, $data:expr, $flags:expr $(,)?) => {{
$crate::drm::drv::Registration::register($reg, $data, $flags, &crate::THIS_MODULE)
- }};
+}
+impl<T: Driver> Registration<T> {
- const VTABLE: bindings::drm_driver = drm_legacy_fields! {
load: None,
open: None, // TODO: File abstraction
postclose: None, // TODO: File abstraction
lastclose: None,
unload: None,
release: None,
master_set: None,
master_drop: None,
debugfs_init: None,
gem_create_object: T::Object::ALLOC_OPS.gem_create_object,
prime_handle_to_fd: T::Object::ALLOC_OPS.prime_handle_to_fd,
prime_fd_to_handle: T::Object::ALLOC_OPS.prime_fd_to_handle,
gem_prime_import: T::Object::ALLOC_OPS.gem_prime_import,
gem_prime_import_sg_table: T::Object::ALLOC_OPS.gem_prime_import_sg_table,
gem_prime_mmap: T::Object::ALLOC_OPS.gem_prime_mmap,
dumb_create: T::Object::ALLOC_OPS.dumb_create,
dumb_map_offset: T::Object::ALLOC_OPS.dumb_map_offset,
dumb_destroy: T::Object::ALLOC_OPS.dumb_destroy,
major: T::INFO.major,
minor: T::INFO.minor,
patchlevel: T::INFO.patchlevel,
name: T::INFO.name.as_char_ptr() as *mut _,
desc: T::INFO.desc.as_char_ptr() as *mut _,
date: T::INFO.date.as_char_ptr() as *mut _,
driver_features: T::FEATURES,
ioctls: T::IOCTLS.as_ptr(),
num_ioctls: T::IOCTLS.len() as i32,
fops: core::ptr::null_mut(),
- };
- /// Creates a new [`Registration`] but does not register it yet.
- ///
- /// It is allowed to move.
- pub fn new(parent: &dyn device::RawDevice) -> Result<Self> {
let vtable = Pin::new(Box::try_new(Self::VTABLE)?);
let raw_drm = unsafe { bindings::drm_dev_alloc(&*vtable, parent.raw_device()) };
let raw_drm = from_kernel_err_ptr(raw_drm)?;
// The reference count is one, and now we take ownership of that reference as a
// drm::device::Device.
let drm = unsafe { drm::device::Device::from_raw(raw_drm) };
Ok(Self {
drm,
registered: false,
vtable,
fops: Default::default(), // TODO: GEM abstraction
_pin: PhantomPinned,
_p: PhantomData,
})
- }
- /// Registers a DRM device with the rest of the kernel.
- ///
- /// Users are encouraged to use the [`drm_device_register!()`] macro because it automatically
- /// picks up the current module.
- pub fn register(
self: Pin<&mut Self>,
data: T::Data,
flags: usize,
module: &'static ThisModule,
- ) -> Result {
if self.registered {
// Already registered.
return Err(EINVAL);
}
// SAFETY: We never move out of `this`.
let this = unsafe { self.get_unchecked_mut() };
let data_pointer = <T::Data as ForeignOwnable>::into_foreign(data);
// SAFETY: `drm` is valid per the type invariant
unsafe {
(*this.drm.raw_mut()).dev_private = data_pointer as *mut _;
}
this.fops.owner = module.0;
this.vtable.fops = &this.fops;
// SAFETY: The device is now initialized and ready to be registered.
let ret = unsafe { bindings::drm_dev_register(this.drm.raw_mut(), flags as u64) };
if ret < 0 {
// SAFETY: `data_pointer` was returned by `into_foreign` above.
unsafe { T::Data::from_foreign(data_pointer) };
return Err(Error::from_kernel_errno(ret));
}
this.registered = true;
Ok(())
- }
- /// Returns a reference to the `Device` instance for this registration.
- pub fn device(&self) -> &drm::device::Device<T> {
&self.drm
- }
+}
+// SAFETY: `Registration` doesn't offer any methods or access to fields when shared between threads +// or CPUs, so it is safe to share it. +unsafe impl<T: Driver> Sync for Registration<T> {}
+// SAFETY: Registration with and unregistration from the drm subsystem can happen from any thread. +// Additionally, `T::Data` (which is dropped during unregistration) is `Send`, so it is ok to move +// `Registration` to different threads. +#[allow(clippy::non_send_fields_in_send_ty)] +unsafe impl<T: Driver> Send for Registration<T> {}
+impl<T: Driver> Drop for Registration<T> {
- /// Removes the registration from the kernel if it has completed successfully before.
- fn drop(&mut self) {
if self.registered {
// Get a pointer to the data stored in device before destroying it.
// SAFETY: `drm` is valid per the type invariant
let data_pointer = unsafe { (*self.drm.raw_mut()).dev_private };
// SAFETY: Since `registered` is true, `self.drm` is both valid and registered.
unsafe { bindings::drm_dev_unregister(self.drm.raw_mut()) };
// Free data as well.
// SAFETY: `data_pointer` was returned by `into_foreign` during registration.
unsafe { <T::Data as ForeignOwnable>::from_foreign(data_pointer) };
}
- }
+} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index 9ec6d7cbcaf3..69376b3c6db9 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -2,4 +2,6 @@
//! DRM subsystem abstractions.
+pub mod device; +pub mod drv; pub mod ioctl;
-- 2.35.1
Cheers, Bjorn
On 08/03/2023 03.19, Björn Roy Baron wrote:
------- Original Message ------- On Tuesday, March 7th, 2023 at 15:25, Asahi Lina lina@asahilina.net wrote:
Add the initial abstractions for DRM drivers and devices. These go together in one commit since they are fairly tightly coupled types.
A few things have been stubbed out, to be implemented as further bits of the DRM subsystem are introduced.
Signed-off-by: Asahi Lina lina@asahilina.net
[...]
+/// Information data for a DRM Driver. +pub struct DriverInfo {
- /// Driver major version.
- pub major: i32,
- /// Driver minor version.
- pub minor: i32,
- /// Driver patchlevel version.
- pub patchlevel: i32,
- /// Driver name.
- pub name: &'static CStr,
- /// Driver description.
- pub desc: &'static CStr,
- /// Driver date.
- pub date: &'static CStr,
+}
Could you please add an Invariants section to the doc comments indicating what requirements these function pointers must satisfy?
I can try (as much as I can divine from the C side anyway...). I guess you want interface docs for each callback, so like what it must do and what invariants each one must uphold?
Note that this is a kernel crate-only struct (the fields are not public) so users can't create their own AllocOps variants anyway (plus AllocImpl is sealed, on top of that), but I guess it makes sense to document for internal kernel crate purposes. At some point it might make sense to allow drivers to override these with proper Rust callbacks (and then the wrappers need to ensure safety), but right now that's not implemented.
+/// Internal memory management operation set, normally created by memory managers (e.g. GEM). +/// +/// See `kernel::drm::gem` and `kernel::drm::gem::shmem`. +pub struct AllocOps {
- pub(crate) gem_create_object: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
size: usize,
) -> *mut bindings::drm_gem_object,
,- pub(crate) prime_handle_to_fd: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
file_priv: *mut bindings::drm_file,
handle: u32,
flags: u32,
prime_fd: *mut core::ffi::c_int,
) -> core::ffi::c_int,
,- pub(crate) prime_fd_to_handle: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
file_priv: *mut bindings::drm_file,
prime_fd: core::ffi::c_int,
handle: *mut u32,
) -> core::ffi::c_int,
,- pub(crate) gem_prime_import: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
dma_buf: *mut bindings::dma_buf,
) -> *mut bindings::drm_gem_object,
,- pub(crate) gem_prime_import_sg_table: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
attach: *mut bindings::dma_buf_attachment,
sgt: *mut bindings::sg_table,
) -> *mut bindings::drm_gem_object,
,- pub(crate) gem_prime_mmap: Option<
unsafe extern "C" fn(
obj: *mut bindings::drm_gem_object,
vma: *mut bindings::vm_area_struct,
) -> core::ffi::c_int,
,- pub(crate) dumb_create: Option<
unsafe extern "C" fn(
file_priv: *mut bindings::drm_file,
dev: *mut bindings::drm_device,
args: *mut bindings::drm_mode_create_dumb,
) -> core::ffi::c_int,
,- pub(crate) dumb_map_offset: Option<
unsafe extern "C" fn(
file_priv: *mut bindings::drm_file,
dev: *mut bindings::drm_device,
handle: u32,
offset: *mut u64,
) -> core::ffi::c_int,
,- pub(crate) dumb_destroy: Option<
unsafe extern "C" fn(
file_priv: *mut bindings::drm_file,
dev: *mut bindings::drm_device,
handle: u32,
) -> core::ffi::c_int,
,+}
~~ Lina
On Tue, Mar 07, 2023 at 11:25:27PM +0900, Asahi Lina wrote: [...]
+// SAFETY: `Device` only holds a pointer to a C device, which is safe to be used from any thread. +unsafe impl<T: drm::drv::Driver> Send for Device<T> {}
+// SAFETY: `Device` only holds a pointer to a C device, references to which are safe to be used +// from any thread. +unsafe impl<T: drm::drv::Driver> Sync for Device<T> {}
Here is the mind model I use to check whether a type is `Send` or `Sync`
* If an object of a type can be created on one thread and dropped on the another thread, then it's `Send`.
* If multiple threads can call the immutable functions (i.e. functions with `&self`) of the same object of a type, then the it's `Sync`.
Maybe it's incomplete, but at least I find it useful to determine whether a type is `Send` or `Sync`: it's not just the struct representation, the behaviors (functions) of the struct also matter.
If that looks reasonable to you, maybe update the "SAFETY" comments in the future version? Thanks ;-)
(I know you brought this up in the meeting, sorry I guess I wasn't fully woken when answering you ;-))
Regards, Boqun
+// Make drm::Device work for dev_info!() and friends +unsafe impl<T: drm::drv::Driver> device::RawDevice for Device<T> {
- fn raw_device(&self) -> *mut bindings::device {
// SAFETY: ptr must be valid per the type invariant
unsafe { (*self.ptr).dev }
- }
+}
[...]
On Tue, Mar 07, 2023 at 11:25:27PM +0900, Asahi Lina wrote:
Add the initial abstractions for DRM drivers and devices. These go together in one commit since they are fairly tightly coupled types.
A few things have been stubbed out, to be implemented as further bits of the DRM subsystem are introduced.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 3 + rust/kernel/drm/device.rs | 76 +++++++++ rust/kernel/drm/drv.rs | 339 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 2 + 4 files changed, 420 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2687bef1676f..2a999138c4ae 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -6,10 +6,13 @@
- Sorted alphabetically.
*/ +#include <drm/drm_device.h> +#include <drm/drm_drv.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-mapping.h> +#include <linux/fs.h> #include <linux/ioctl.h> #include <linux/io-pgtable.h> #include <linux/ktime.h> diff --git a/rust/kernel/drm/device.rs b/rust/kernel/drm/device.rs new file mode 100644 index 000000000000..6007f941137a --- /dev/null +++ b/rust/kernel/drm/device.rs @@ -0,0 +1,76 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM device. +//! +//! C header: [`include/linux/drm/drm_device.h`](../../../../include/linux/drm/drm_device.h)
+use crate::{bindings, device, drm, types::ForeignOwnable}; +use core::marker::PhantomData;
+/// Represents a reference to a DRM device. The device is reference-counted and is guaranteed to +/// not be dropped while this object is alive. +pub struct Device<T: drm::drv::Driver> {
- // Type invariant: ptr must be a valid and initialized drm_device,
- // and this value must either own a reference to it or the caller
- // must ensure that it is never dropped if the reference is borrowed.
- pub(super) ptr: *mut bindings::drm_device,
- _p: PhantomData<T>,
+}
+impl<T: drm::drv::Driver> Device<T> {
- // Not intended to be called externally, except via declare_drm_ioctls!()
- #[doc(hidden)]
- pub unsafe fn from_raw(raw: *mut bindings::drm_device) -> Device<T> {
Device {
ptr: raw,
_p: PhantomData,
}
- }
- #[allow(dead_code)]
- pub(crate) fn raw(&self) -> *const bindings::drm_device {
self.ptr
- }
- pub(crate) fn raw_mut(&mut self) -> *mut bindings::drm_device {
self.ptr
- }
Since you can always get a *mut bindings::drm_device safely from
a.raw() as *mut _
, this mutable version seems unnecesarry to me. In other words, no way to prevent getting a *mut bindings::drm_device from only &Device.
Regards, Boqun
- /// Returns a borrowed reference to the user data associated with this Device.
- pub fn data(&self) -> <T::Data as ForeignOwnable>::Borrowed<'_> {
unsafe { T::Data::borrow((*self.ptr).dev_private) }
- }
+}
[...]
On Tue, Mar 07, 2023 at 11:25:27PM +0900, Asahi Lina wrote:
Add the initial abstractions for DRM drivers and devices. These go together in one commit since they are fairly tightly coupled types.
A few things have been stubbed out, to be implemented as further bits of the DRM subsystem are introduced.
Signed-off-by: Asahi Lina lina@asahilina.net
Ok so this is fairly fundamental lifetime fun and might be fairly orthogonal to most of the things you actually want to do with a drm driver (like implement gem or whatever). So separate mail.
So upfront short intro. There's 3 different lifetimes involved in building a drm driver:
- struct drm_driver. It's refcounted because it's fundamentally an uapi interface thing, and all the various uapi interfaces that build on top of this (drm_file, dma_buf, ...) need to hold references on it. It's supposed to survive for as long as userspace needs it or the underlying driver is bound, whichever is longer.
- struct device. Refcounted and good for dmesg printing, nothing else. Yes, because ...
- ... the actual hardware resource, in many places also represented by struct device. Not refcounted, instead it's limited by hotunplug or more precisiely, how long your driver is bound to the struct device. You could make a case that in C this is represented by the bus specific type (e.g. platform_device), and the bus-specific hooks delineate the lifetime (for platform devices that's that's from ->probe to ->remove). Since there's no C type for this I'll call this hwdevice.
I think for rust it would be good if we model a bit more precisely in rust. It might be possible to use the bus-specific types as the hwdevice, but that's not entirely right either because each bus device is both representing the hwdevice and the refcounted struct device.
Now onto lifetimes, or at least how this is usually handled.
- struct device should be obvious, the important part really is that the rust wrappers should not allow anything to be done with that which is tied to the hwdevice lifetime. Which is almost everything you want to do with a struct (platform_)device (aside from pure sw stuff like dmesg printing).
- the hwdevice is _not_ refcounted. I think in rust this maps to borrow semantics, to make sure that the reference only stays valid during a driver callback. The driver core/bus driver ensure that all the various callbacks (pm_ops, platform_driver, ...) finish before the ->remove callback starts.
- usually the the link from hwdevice to drm_device is done with a refcounted drm_device stored with dev_set_drvdata. For rust it'd be nice if that's the Driver and fully typesafe and automatically cleaned up.
- which brings us to how hwdevice cleanup works in C: That's done with all the devm_ helpers for practically anything you might want to set up for hw access: mappings, interrupts, .... Note that there's also devm_*malloc functions, when drivers use them that's almost always a bug because generally allocations should stick around with the drm_device and not go down with the non-refcounted hwdevice lifetime.
For added fun the bus/driver core also uses devm_ to mange things tied to the refcounted struct device, which works because devm_ nests and ->probe opens up a new devm_ bucket which is torn down at ->remove time. But luckily drivers should never deal with that, so for them (on the C side at least) devm_ is the right lifetime model for things tied to the hwdevice lifetime.
For rust this means that we really should try to tie all the hw related things into devm or equivalent, and make both sure it's automatically cleaned up at that point, but also no later (because if you clean up hw stuff after ->remove you have a driver bug).
- Similarly on the drm_device side we have drmm_. You can have some refcounted things within the lifetime of drm_device (like dma_buf), but if things the driver creates survive past the point of drm_device, then probably something went really wrong. Either a leak or you'll blow up in the C code.
So again for rust I think we should try to model this, and make sure (with borrow semantics and avoiding full refcounting like the plague in driver code) that driver structs and other sw things can't outlive the drm_device, but also don't hold it alive unduly.
- Since the point of a drm_device is to drive hardware, you need to be able to safely dereference the drm_device->dev pointer and know whether it's still a hwdevice (i.e. useful) or just a struct device because the hw is gone. That's done with drm_dev_enter/exit and making sure that ->remove calls drm_dev_unplug as the very first thing, before it starts tearing down hw resources like mappings, interrupts, ...
On the C side we entirely rely on review for this, and it just doesn't work. Unless exhaustively tested, hotunplug just dies, and I think for more complex drivers this is something where Rust type enforcement could really shine: We'd need to make sure that a driver can only get at the hwtype where it's safe (bus/driver core callbacks or drm_dev_enter/exit as a mutex-guard thing). Furthermore we need to ensure that that drm_dev_unplug really is the first thing done in ->remove (and by symmetry drm_dev_register the last thing probe does). I think a neat way would be if ->probe would return a container of things that implement a Uapi trait, which has register and unplug functions, and then the rust glue calls that.
More aggressively would be to outright not implement ->remove for rust drivers and entirely rely on the devm stack of cleanup actions. This would still need a type trick to ensure that drm_dev_register is the very last thing that's called (to make sure the matching drm_dev_unplug is the first thing).
- Despite that we have refcounted pointers going both ways from drm_device<->device there's no loop, because the device->drm_device reference is dropped with hwdevice lifetime (through devm if you're using devm_drm_dev_alloc in a C driver), which breaks the loop. Note that the drm_device->device refcount/pointer stays until the very end of drm_device (need that for dmesg printing), but outside of drm_dev_enter/exit it's really just a temption for bugs.
- I think ideally drivers themselves should not even try to refcount drm_device or device, but instead have all it all tied directly. The exceptions really are only for when you have separate, free-standing uapi objects (like dma_buf or dma_fence or drm_file), and in those cases the combo of C code + rust glue should ensure that the refcounting is done right. If a rust driver has any additional refcounting need for these structs then I think we've screwed up the data lifetime model.
Apologies for the wall of text. I hope I didn't forget anything crucial, I've been pondering this for a few weeks now :-)
Imo this doesn't need to be fixed before we merge asahi, but it is something that I think really should fix because despite years of effort and all the auto-cleanup infrastructure like devm_ and drmm_ C drivers are still buggy by default, there's no clear understanding outside of a select few about the problems ("devm_kmalloc considered harmful" is some actual talk title), and I think this is something where Rust typing and borrow checker really could substantially improve the state of the art.
And yes for a soc driver that's all fairly irrelevant, because it's physically not possible to remove a device, but for most drm drivers it is absolutely possible to burn them with a hotunplug (hotunplug of device pass-through to a vm if you can't physically hotunplug the device itself), so this isn't academic at all.
I'll try and type up the separate mail about semantics of gem drivers and all that stuff tomorrow.
Cheers, Daniel
rust/bindings/bindings_helper.h | 3 + rust/kernel/drm/device.rs | 76 +++++++++ rust/kernel/drm/drv.rs | 339 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 2 + 4 files changed, 420 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2687bef1676f..2a999138c4ae 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -6,10 +6,13 @@
- Sorted alphabetically.
*/ +#include <drm/drm_device.h> +#include <drm/drm_drv.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-mapping.h> +#include <linux/fs.h> #include <linux/ioctl.h> #include <linux/io-pgtable.h> #include <linux/ktime.h> diff --git a/rust/kernel/drm/device.rs b/rust/kernel/drm/device.rs new file mode 100644 index 000000000000..6007f941137a --- /dev/null +++ b/rust/kernel/drm/device.rs @@ -0,0 +1,76 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM device. +//! +//! C header: [`include/linux/drm/drm_device.h`](../../../../include/linux/drm/drm_device.h)
+use crate::{bindings, device, drm, types::ForeignOwnable}; +use core::marker::PhantomData;
+/// Represents a reference to a DRM device. The device is reference-counted and is guaranteed to +/// not be dropped while this object is alive. +pub struct Device<T: drm::drv::Driver> {
- // Type invariant: ptr must be a valid and initialized drm_device,
- // and this value must either own a reference to it or the caller
- // must ensure that it is never dropped if the reference is borrowed.
- pub(super) ptr: *mut bindings::drm_device,
- _p: PhantomData<T>,
+}
+impl<T: drm::drv::Driver> Device<T> {
- // Not intended to be called externally, except via declare_drm_ioctls!()
- #[doc(hidden)]
- pub unsafe fn from_raw(raw: *mut bindings::drm_device) -> Device<T> {
Device {
ptr: raw,
_p: PhantomData,
}
- }
- #[allow(dead_code)]
- pub(crate) fn raw(&self) -> *const bindings::drm_device {
self.ptr
- }
- pub(crate) fn raw_mut(&mut self) -> *mut bindings::drm_device {
self.ptr
- }
- /// Returns a borrowed reference to the user data associated with this Device.
- pub fn data(&self) -> <T::Data as ForeignOwnable>::Borrowed<'_> {
unsafe { T::Data::borrow((*self.ptr).dev_private) }
- }
+}
+impl<T: drm::drv::Driver> Drop for Device<T> {
- fn drop(&mut self) {
// SAFETY: By the type invariants, we know that `self` owns a reference, so it is safe to
// relinquish it now.
unsafe { bindings::drm_dev_put(self.ptr) };
- }
+}
+impl<T: drm::drv::Driver> Clone for Device<T> {
- fn clone(&self) -> Self {
// SAFETY: We get a new reference and then create a new owning object from the raw pointer
unsafe {
bindings::drm_dev_get(self.ptr);
Device::from_raw(self.ptr)
}
- }
+}
+// SAFETY: `Device` only holds a pointer to a C device, which is safe to be used from any thread. +unsafe impl<T: drm::drv::Driver> Send for Device<T> {}
+// SAFETY: `Device` only holds a pointer to a C device, references to which are safe to be used +// from any thread. +unsafe impl<T: drm::drv::Driver> Sync for Device<T> {}
+// Make drm::Device work for dev_info!() and friends +unsafe impl<T: drm::drv::Driver> device::RawDevice for Device<T> {
- fn raw_device(&self) -> *mut bindings::device {
// SAFETY: ptr must be valid per the type invariant
unsafe { (*self.ptr).dev }
- }
+} diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs new file mode 100644 index 000000000000..29a465515dc9 --- /dev/null +++ b/rust/kernel/drm/drv.rs @@ -0,0 +1,339 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM driver core. +//! +//! C header: [`include/linux/drm/drm_drv.h`](../../../../include/linux/drm/drm_drv.h)
+use crate::{
- bindings, device, drm,
- error::code::*,
- error::from_kernel_err_ptr,
- error::{Error, Result},
- prelude::*,
- private::Sealed,
- str::CStr,
- types::ForeignOwnable,
- ThisModule,
+}; +use core::{
- marker::{PhantomData, PhantomPinned},
- pin::Pin,
+}; +use macros::vtable;
+/// Driver use the GEM memory manager. This should be set for all modern drivers. +pub const FEAT_GEM: u32 = bindings::drm_driver_feature_DRIVER_GEM; +/// Driver supports mode setting interfaces (KMS). +pub const FEAT_MODESET: u32 = bindings::drm_driver_feature_DRIVER_MODESET; +/// Driver supports dedicated render nodes. +pub const FEAT_RENDER: u32 = bindings::drm_driver_feature_DRIVER_RENDER; +/// Driver supports the full atomic modesetting userspace API. +/// +/// Drivers which only use atomic internally, but do not support the full userspace API (e.g. not +/// all properties converted to atomic, or multi-plane updates are not guaranteed to be tear-free) +/// should not set this flag. +pub const FEAT_ATOMIC: u32 = bindings::drm_driver_feature_DRIVER_ATOMIC; +/// Driver supports DRM sync objects for explicit synchronization of command submission. +pub const FEAT_SYNCOBJ: u32 = bindings::drm_driver_feature_DRIVER_SYNCOBJ; +/// Driver supports the timeline flavor of DRM sync objects for explicit synchronization of command +/// submission. +pub const FEAT_SYNCOBJ_TIMELINE: u32 = bindings::drm_driver_feature_DRIVER_SYNCOBJ_TIMELINE;
+/// Information data for a DRM Driver. +pub struct DriverInfo {
- /// Driver major version.
- pub major: i32,
- /// Driver minor version.
- pub minor: i32,
- /// Driver patchlevel version.
- pub patchlevel: i32,
- /// Driver name.
- pub name: &'static CStr,
- /// Driver description.
- pub desc: &'static CStr,
- /// Driver date.
- pub date: &'static CStr,
+}
+/// Internal memory management operation set, normally created by memory managers (e.g. GEM). +/// +/// See `kernel::drm::gem` and `kernel::drm::gem::shmem`. +pub struct AllocOps {
- pub(crate) gem_create_object: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
size: usize,
) -> *mut bindings::drm_gem_object,
,- pub(crate) prime_handle_to_fd: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
file_priv: *mut bindings::drm_file,
handle: u32,
flags: u32,
prime_fd: *mut core::ffi::c_int,
) -> core::ffi::c_int,
,- pub(crate) prime_fd_to_handle: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
file_priv: *mut bindings::drm_file,
prime_fd: core::ffi::c_int,
handle: *mut u32,
) -> core::ffi::c_int,
,- pub(crate) gem_prime_import: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
dma_buf: *mut bindings::dma_buf,
) -> *mut bindings::drm_gem_object,
,- pub(crate) gem_prime_import_sg_table: Option<
unsafe extern "C" fn(
dev: *mut bindings::drm_device,
attach: *mut bindings::dma_buf_attachment,
sgt: *mut bindings::sg_table,
) -> *mut bindings::drm_gem_object,
,- pub(crate) gem_prime_mmap: Option<
unsafe extern "C" fn(
obj: *mut bindings::drm_gem_object,
vma: *mut bindings::vm_area_struct,
) -> core::ffi::c_int,
,- pub(crate) dumb_create: Option<
unsafe extern "C" fn(
file_priv: *mut bindings::drm_file,
dev: *mut bindings::drm_device,
args: *mut bindings::drm_mode_create_dumb,
) -> core::ffi::c_int,
,- pub(crate) dumb_map_offset: Option<
unsafe extern "C" fn(
file_priv: *mut bindings::drm_file,
dev: *mut bindings::drm_device,
handle: u32,
offset: *mut u64,
) -> core::ffi::c_int,
,- pub(crate) dumb_destroy: Option<
unsafe extern "C" fn(
file_priv: *mut bindings::drm_file,
dev: *mut bindings::drm_device,
handle: u32,
) -> core::ffi::c_int,
,+}
+/// Trait for memory manager implementations. Implemented internally. +pub trait AllocImpl: Sealed {
- /// The C callback operations for this memory manager.
- const ALLOC_OPS: AllocOps;
+}
+/// A DRM driver implementation. +#[vtable] +pub trait Driver {
- /// Context data associated with the DRM driver
- ///
- /// Determines the type of the context data passed to each of the methods of the trait.
- type Data: ForeignOwnable + Sync + Send;
- /// The type used to manage memory for this driver.
- ///
- /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`.
- type Object: AllocImpl;
- /// Driver metadata
- const INFO: DriverInfo;
- /// Feature flags
- const FEATURES: u32;
- /// IOCTL list. See `kernel::drm::ioctl::declare_drm_ioctls!{}`.
- const IOCTLS: &'static [drm::ioctl::DrmIoctlDescriptor];
+}
+/// A registration of a DRM device +/// +/// # Invariants: +/// +/// drm is always a valid pointer to an allocated drm_device +pub struct Registration<T: Driver> {
- drm: drm::device::Device<T>,
- registered: bool,
- fops: bindings::file_operations,
- vtable: Pin<Boxbindings::drm_driver>,
- _p: PhantomData<T>,
- _pin: PhantomPinned,
+}
+#[cfg(CONFIG_DRM_LEGACY)] +macro_rules! drm_legacy_fields {
- ( $($field:ident: $val:expr),* $(,)? ) => {
bindings::drm_driver {
$( $field: $val ),*,
firstopen: None,
preclose: None,
dma_ioctl: None,
dma_quiescent: None,
context_dtor: None,
irq_handler: None,
irq_preinstall: None,
irq_postinstall: None,
irq_uninstall: None,
get_vblank_counter: None,
enable_vblank: None,
disable_vblank: None,
dev_priv_size: 0,
}
- }
+}
+#[cfg(not(CONFIG_DRM_LEGACY))] +macro_rules! drm_legacy_fields {
- ( $($field:ident: $val:expr),* $(,)? ) => {
bindings::drm_driver {
$( $field: $val ),*
}
- }
+}
+/// Registers a DRM device with the rest of the kernel. +/// +/// It automatically picks up THIS_MODULE. +#[allow(clippy::crate_in_macro_def)] +#[macro_export] +macro_rules! drm_device_register {
- ($reg:expr, $data:expr, $flags:expr $(,)?) => {{
$crate::drm::drv::Registration::register($reg, $data, $flags, &crate::THIS_MODULE)
- }};
+}
+impl<T: Driver> Registration<T> {
- const VTABLE: bindings::drm_driver = drm_legacy_fields! {
load: None,
open: None, // TODO: File abstraction
postclose: None, // TODO: File abstraction
lastclose: None,
unload: None,
release: None,
master_set: None,
master_drop: None,
debugfs_init: None,
gem_create_object: T::Object::ALLOC_OPS.gem_create_object,
prime_handle_to_fd: T::Object::ALLOC_OPS.prime_handle_to_fd,
prime_fd_to_handle: T::Object::ALLOC_OPS.prime_fd_to_handle,
gem_prime_import: T::Object::ALLOC_OPS.gem_prime_import,
gem_prime_import_sg_table: T::Object::ALLOC_OPS.gem_prime_import_sg_table,
gem_prime_mmap: T::Object::ALLOC_OPS.gem_prime_mmap,
dumb_create: T::Object::ALLOC_OPS.dumb_create,
dumb_map_offset: T::Object::ALLOC_OPS.dumb_map_offset,
dumb_destroy: T::Object::ALLOC_OPS.dumb_destroy,
major: T::INFO.major,
minor: T::INFO.minor,
patchlevel: T::INFO.patchlevel,
name: T::INFO.name.as_char_ptr() as *mut _,
desc: T::INFO.desc.as_char_ptr() as *mut _,
date: T::INFO.date.as_char_ptr() as *mut _,
driver_features: T::FEATURES,
ioctls: T::IOCTLS.as_ptr(),
num_ioctls: T::IOCTLS.len() as i32,
fops: core::ptr::null_mut(),
- };
- /// Creates a new [`Registration`] but does not register it yet.
- ///
- /// It is allowed to move.
- pub fn new(parent: &dyn device::RawDevice) -> Result<Self> {
let vtable = Pin::new(Box::try_new(Self::VTABLE)?);
let raw_drm = unsafe { bindings::drm_dev_alloc(&*vtable, parent.raw_device()) };
let raw_drm = from_kernel_err_ptr(raw_drm)?;
// The reference count is one, and now we take ownership of that reference as a
// drm::device::Device.
let drm = unsafe { drm::device::Device::from_raw(raw_drm) };
Ok(Self {
drm,
registered: false,
vtable,
fops: Default::default(), // TODO: GEM abstraction
_pin: PhantomPinned,
_p: PhantomData,
})
- }
- /// Registers a DRM device with the rest of the kernel.
- ///
- /// Users are encouraged to use the [`drm_device_register!()`] macro because it automatically
- /// picks up the current module.
- pub fn register(
self: Pin<&mut Self>,
data: T::Data,
flags: usize,
module: &'static ThisModule,
- ) -> Result {
if self.registered {
// Already registered.
return Err(EINVAL);
}
// SAFETY: We never move out of `this`.
let this = unsafe { self.get_unchecked_mut() };
let data_pointer = <T::Data as ForeignOwnable>::into_foreign(data);
// SAFETY: `drm` is valid per the type invariant
unsafe {
(*this.drm.raw_mut()).dev_private = data_pointer as *mut _;
}
this.fops.owner = module.0;
this.vtable.fops = &this.fops;
// SAFETY: The device is now initialized and ready to be registered.
let ret = unsafe { bindings::drm_dev_register(this.drm.raw_mut(), flags as u64) };
if ret < 0 {
// SAFETY: `data_pointer` was returned by `into_foreign` above.
unsafe { T::Data::from_foreign(data_pointer) };
return Err(Error::from_kernel_errno(ret));
}
this.registered = true;
Ok(())
- }
- /// Returns a reference to the `Device` instance for this registration.
- pub fn device(&self) -> &drm::device::Device<T> {
&self.drm
- }
+}
+// SAFETY: `Registration` doesn't offer any methods or access to fields when shared between threads +// or CPUs, so it is safe to share it. +unsafe impl<T: Driver> Sync for Registration<T> {}
+// SAFETY: Registration with and unregistration from the drm subsystem can happen from any thread. +// Additionally, `T::Data` (which is dropped during unregistration) is `Send`, so it is ok to move +// `Registration` to different threads. +#[allow(clippy::non_send_fields_in_send_ty)] +unsafe impl<T: Driver> Send for Registration<T> {}
+impl<T: Driver> Drop for Registration<T> {
- /// Removes the registration from the kernel if it has completed successfully before.
- fn drop(&mut self) {
if self.registered {
// Get a pointer to the data stored in device before destroying it.
// SAFETY: `drm` is valid per the type invariant
let data_pointer = unsafe { (*self.drm.raw_mut()).dev_private };
// SAFETY: Since `registered` is true, `self.drm` is both valid and registered.
unsafe { bindings::drm_dev_unregister(self.drm.raw_mut()) };
// Free data as well.
// SAFETY: `data_pointer` was returned by `into_foreign` during registration.
unsafe { <T::Data as ForeignOwnable>::from_foreign(data_pointer) };
}
- }
+} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index 9ec6d7cbcaf3..69376b3c6db9 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -2,4 +2,6 @@ //! DRM subsystem abstractions. +pub mod device; +pub mod drv; pub mod ioctl;
-- 2.35.1
A DRM File is the DRM counterpart to a kernel file structure, representing an open DRM file descriptor. Add a Rust abstraction to allow drivers to implement their own File types that implement the DriverFile trait.
Signed-off-by: Asahi Lina lina@asahilina.net --- rust/bindings/bindings_helper.h | 1 + rust/kernel/drm/drv.rs | 7 ++- rust/kernel/drm/file.rs | 113 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 4 files changed, 120 insertions(+), 2 deletions(-)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2a999138c4ae..7d7828faf89c 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -8,6 +8,7 @@
#include <drm/drm_device.h> #include <drm/drm_drv.h> +#include <drm/drm_file.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs index 29a465515dc9..1dcb651e1417 100644 --- a/rust/kernel/drm/drv.rs +++ b/rust/kernel/drm/drv.rs @@ -144,6 +144,9 @@ pub trait Driver { /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`. type Object: AllocImpl;
+ /// The type used to represent a DRM File (client) + type File: drm::file::DriverFile; + /// Driver metadata const INFO: DriverInfo;
@@ -213,8 +216,8 @@ macro_rules! drm_device_register { impl<T: Driver> Registration<T> { const VTABLE: bindings::drm_driver = drm_legacy_fields! { load: None, - open: None, // TODO: File abstraction - postclose: None, // TODO: File abstraction + open: Some(drm::file::open_callback::<T::File>), + postclose: Some(drm::file::postclose_callback::<T::File>), lastclose: None, unload: None, release: None, diff --git a/rust/kernel/drm/file.rs b/rust/kernel/drm/file.rs new file mode 100644 index 000000000000..48751e93c38a --- /dev/null +++ b/rust/kernel/drm/file.rs @@ -0,0 +1,113 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT + +//! DRM File objects. +//! +//! C header: [`include/linux/drm/drm_file.h`](../../../../include/linux/drm/drm_file.h) + +use crate::{bindings, drm, error::Result}; +use alloc::boxed::Box; +use core::marker::PhantomData; +use core::ops::Deref; + +/// Trait that must be implemented by DRM drivers to represent a DRM File (a client instance). +pub trait DriverFile { + /// The parent `Driver` implementation for this `DriverFile`. + type Driver: drm::drv::Driver; + + /// Open a new file (called when a client opens the DRM device). + fn open(device: &drm::device::DeviceSelf::Driver) -> Result<Box<Self>>; +} + +/// An open DRM File. +/// +/// # Invariants +/// `raw` is a valid pointer to a `drm_file` struct. +#[repr(transparent)] +pub struct File<T: DriverFile> { + raw: *mut bindings::drm_file, + _p: PhantomData<T>, +} + +pub(super) unsafe extern "C" fn open_callback<T: DriverFile>( + raw_dev: *mut bindings::drm_device, + raw_file: *mut bindings::drm_file, +) -> core::ffi::c_int { + let drm = core::mem::ManuallyDrop::new(unsafe { drm::device::Device::from_raw(raw_dev) }); + // SAFETY: This reference won't escape this function + let file = unsafe { &mut *raw_file }; + + let inner = match T::open(&drm) { + Err(e) => { + return e.to_kernel_errno(); + } + Ok(i) => i, + }; + + file.driver_priv = Box::into_raw(inner) as *mut _; + + 0 +} + +pub(super) unsafe extern "C" fn postclose_callback<T: DriverFile>( + _dev: *mut bindings::drm_device, + raw_file: *mut bindings::drm_file, +) { + // SAFETY: This reference won't escape this function + let file = unsafe { &*raw_file }; + + // Drop the DriverFile + unsafe { Box::from_raw(file.driver_priv as *mut T) }; +} + +impl<T: DriverFile> File<T> { + // Not intended to be called externally, except via declare_drm_ioctls!() + #[doc(hidden)] + pub unsafe fn from_raw(raw_file: *mut bindings::drm_file) -> File<T> { + File { + raw: raw_file, + _p: PhantomData, + } + } + + #[allow(dead_code)] + /// Return the raw pointer to the underlying `drm_file`. + pub(super) fn raw(&self) -> *const bindings::drm_file { + self.raw + } + + /// Return an immutable reference to the raw `drm_file` structure. + pub(super) fn file(&self) -> &bindings::drm_file { + unsafe { &*self.raw } + } +} + +impl<T: DriverFile> Deref for File<T> { + type Target = T; + + fn deref(&self) -> &T { + unsafe { &*(self.file().driver_priv as *const T) } + } +} + +impl<T: DriverFile> crate::private::Sealed for File<T> {} + +/// Generic trait to allow users that don't care about driver specifics to accept any File<T>. +/// +/// # Safety +/// Must only be implemented for File<T> and return the pointer, following the normal invariants +/// of that type. +pub unsafe trait GenericFile: crate::private::Sealed { + /// Returns the raw const pointer to the `struct drm_file` + fn raw(&self) -> *const bindings::drm_file; + /// Returns the raw mut pointer to the `struct drm_file` + fn raw_mut(&mut self) -> *mut bindings::drm_file; +} + +unsafe impl<T: DriverFile> GenericFile for File<T> { + fn raw(&self) -> *const bindings::drm_file { + self.raw + } + fn raw_mut(&mut self) -> *mut bindings::drm_file { + self.raw + } +} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index 69376b3c6db9..a767942d0b52 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -4,4 +4,5 @@
pub mod device; pub mod drv; +pub mod file; pub mod ioctl;
On Tue, 2023-03-07 at 23:25 +0900, Asahi Lina wrote:
A DRM File is the DRM counterpart to a kernel file structure, representing an open DRM file descriptor. Add a Rust abstraction to allow drivers to implement their own File types that implement the DriverFile trait.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 1 + rust/kernel/drm/drv.rs | 7 ++- rust/kernel/drm/file.rs | 113 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 4 files changed, 120 insertions(+), 2 deletions(-)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2a999138c4ae..7d7828faf89c 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -8,6 +8,7 @@ #include <drm/drm_device.h> #include <drm/drm_drv.h> +#include <drm/drm_file.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs index 29a465515dc9..1dcb651e1417 100644 --- a/rust/kernel/drm/drv.rs +++ b/rust/kernel/drm/drv.rs @@ -144,6 +144,9 @@ pub trait Driver { /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`. type Object: AllocImpl; + /// The type used to represent a DRM File (client) + type File: drm::file::DriverFile;
/// Driver metadata const INFO: DriverInfo; @@ -213,8 +216,8 @@ macro_rules! drm_device_register { impl<T: Driver> Registration<T> { const VTABLE: bindings::drm_driver = drm_legacy_fields! { load: None, - open: None, // TODO: File abstraction - postclose: None, // TODO: File abstraction + open: Some(drm::file::open_callback::<T::File>), + postclose: Some(drm::file::postclose_callback::<T::File>), lastclose: None, unload: None, release: None, diff --git a/rust/kernel/drm/file.rs b/rust/kernel/drm/file.rs new file mode 100644 index 000000000000..48751e93c38a --- /dev/null +++ b/rust/kernel/drm/file.rs @@ -0,0 +1,113 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM File objects. +//! +//! C header: [`include/linux/drm/drm_file.h`](../../../../include/linux/drm/drm_fi le.h)
+use crate::{bindings, drm, error::Result}; +use alloc::boxed::Box; +use core::marker::PhantomData; +use core::ops::Deref;
+/// Trait that must be implemented by DRM drivers to represent a DRM File (a client instance). +pub trait DriverFile { + /// The parent `Driver` implementation for this `DriverFile`. + type Driver: drm::drv::Driver;
+ /// Open a new file (called when a client opens the DRM device). + fn open(device: &drm::device::DeviceSelf::Driver) -> Result<Box<Self>>; +}
+/// An open DRM File. +/// +/// # Invariants +/// `raw` is a valid pointer to a `drm_file` struct. +#[repr(transparent)] +pub struct File<T: DriverFile> { + raw: *mut bindings::drm_file, + _p: PhantomData<T>, +}
+pub(super) unsafe extern "C" fn open_callback<T: DriverFile>( + raw_dev: *mut bindings::drm_device, + raw_file: *mut bindings::drm_file, +) -> core::ffi::c_int { + let drm = core::mem::ManuallyDrop::new(unsafe { drm::device::Device::from_raw(raw_dev) });
Maybe you can help educate me a bit here... This feels like a really sketchy pattern. We're creating a Device from a pointer, an operation which inherently consumes a reference but then marking it ManuallyDrop so drm_device_put() never gets called. It took me a while but I think I figured out what you're trying to do: Make it so all the Rust stuff works with Device, not drm_device but it still feels really wrong. It works, it just feels like there's a lot of unsafe abstraction juggling happening here and I expect this operation is going to be pretty common in the Rust abstraction layer.
Am I missing something?
~Faith
+ // SAFETY: This reference won't escape this function + let file = unsafe { &mut *raw_file };
+ let inner = match T::open(&drm) { + Err(e) => { + return e.to_kernel_errno(); + } + Ok(i) => i, + };
+ file.driver_priv = Box::into_raw(inner) as *mut _;
+ 0 +}
+pub(super) unsafe extern "C" fn postclose_callback<T: DriverFile>( + _dev: *mut bindings::drm_device, + raw_file: *mut bindings::drm_file, +) { + // SAFETY: This reference won't escape this function + let file = unsafe { &*raw_file };
+ // Drop the DriverFile + unsafe { Box::from_raw(file.driver_priv as *mut T) }; +}
+impl<T: DriverFile> File<T> { + // Not intended to be called externally, except via declare_drm_ioctls!() + #[doc(hidden)] + pub unsafe fn from_raw(raw_file: *mut bindings::drm_file) -> File<T> { + File { + raw: raw_file, + _p: PhantomData, + } + }
+ #[allow(dead_code)] + /// Return the raw pointer to the underlying `drm_file`. + pub(super) fn raw(&self) -> *const bindings::drm_file { + self.raw + }
+ /// Return an immutable reference to the raw `drm_file` structure. + pub(super) fn file(&self) -> &bindings::drm_file { + unsafe { &*self.raw } + } +}
+impl<T: DriverFile> Deref for File<T> { + type Target = T;
+ fn deref(&self) -> &T { + unsafe { &*(self.file().driver_priv as *const T) } + } +}
+impl<T: DriverFile> crate::private::Sealed for File<T> {}
+/// Generic trait to allow users that don't care about driver specifics to accept any File<T>. +/// +/// # Safety +/// Must only be implemented for File<T> and return the pointer, following the normal invariants +/// of that type. +pub unsafe trait GenericFile: crate::private::Sealed { + /// Returns the raw const pointer to the `struct drm_file` + fn raw(&self) -> *const bindings::drm_file; + /// Returns the raw mut pointer to the `struct drm_file` + fn raw_mut(&mut self) -> *mut bindings::drm_file; +}
+unsafe impl<T: DriverFile> GenericFile for File<T> { + fn raw(&self) -> *const bindings::drm_file { + self.raw + } + fn raw_mut(&mut self) -> *mut bindings::drm_file { + self.raw + } +} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index 69376b3c6db9..a767942d0b52 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -4,4 +4,5 @@ pub mod device; pub mod drv; +pub mod file; pub mod ioctl;
On 10/03/2023 06.16, Faith Ekstrand wrote:
On Tue, 2023-03-07 at 23:25 +0900, Asahi Lina wrote:
A DRM File is the DRM counterpart to a kernel file structure, representing an open DRM file descriptor. Add a Rust abstraction to allow drivers to implement their own File types that implement the DriverFile trait.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 1 + rust/kernel/drm/drv.rs | 7 ++- rust/kernel/drm/file.rs | 113 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 4 files changed, 120 insertions(+), 2 deletions(-)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2a999138c4ae..7d7828faf89c 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -8,6 +8,7 @@ #include <drm/drm_device.h> #include <drm/drm_drv.h> +#include <drm/drm_file.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs index 29a465515dc9..1dcb651e1417 100644 --- a/rust/kernel/drm/drv.rs +++ b/rust/kernel/drm/drv.rs @@ -144,6 +144,9 @@ pub trait Driver { /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`. type Object: AllocImpl; + /// The type used to represent a DRM File (client) + type File: drm::file::DriverFile;
/// Driver metadata const INFO: DriverInfo; @@ -213,8 +216,8 @@ macro_rules! drm_device_register { impl<T: Driver> Registration<T> { const VTABLE: bindings::drm_driver = drm_legacy_fields! { load: None, - open: None, // TODO: File abstraction - postclose: None, // TODO: File abstraction + open: Some(drm::file::open_callback::<T::File>), + postclose: Some(drm::file::postclose_callback::<T::File>), lastclose: None, unload: None, release: None, diff --git a/rust/kernel/drm/file.rs b/rust/kernel/drm/file.rs new file mode 100644 index 000000000000..48751e93c38a --- /dev/null +++ b/rust/kernel/drm/file.rs @@ -0,0 +1,113 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM File objects. +//! +//! C header: [`include/linux/drm/drm_file.h`](../../../../include/linux/drm/drm_fi le.h)
+use crate::{bindings, drm, error::Result}; +use alloc::boxed::Box; +use core::marker::PhantomData; +use core::ops::Deref;
+/// Trait that must be implemented by DRM drivers to represent a DRM File (a client instance). +pub trait DriverFile { + /// The parent `Driver` implementation for this `DriverFile`. + type Driver: drm::drv::Driver;
+ /// Open a new file (called when a client opens the DRM device). + fn open(device: &drm::device::DeviceSelf::Driver) -> Result<Box<Self>>; +}
+/// An open DRM File. +/// +/// # Invariants +/// `raw` is a valid pointer to a `drm_file` struct. +#[repr(transparent)] +pub struct File<T: DriverFile> { + raw: *mut bindings::drm_file, + _p: PhantomData<T>, +}
+pub(super) unsafe extern "C" fn open_callback<T: DriverFile>( + raw_dev: *mut bindings::drm_device, + raw_file: *mut bindings::drm_file, +) -> core::ffi::c_int { + let drm = core::mem::ManuallyDrop::new(unsafe { drm::device::Device::from_raw(raw_dev) });
Maybe you can help educate me a bit here... This feels like a really sketchy pattern. We're creating a Device from a pointer, an operation which inherently consumes a reference but then marking it ManuallyDrop so drm_device_put() never gets called. It took me a while but I think I figured out what you're trying to do: Make it so all the Rust stuff works with Device, not drm_device but it still feels really wrong. It works, it just feels like there's a lot of unsafe abstraction juggling happening here and I expect this operation is going to be pretty common in the Rust abstraction layer.
So I think this is going to be a pretty common pattern in this kind of abstraction. The problem is that, of course, in C there is no distinction between an owned reference and a borrowed one. Here we have a borrowed reference to a struct drm_device, and we need to turn it into a &Device (which is the Rust equivalent type). But for &Device to exist we need a Device to exist in the first place, and Device normally implies ownership of the underlying drm_device.
We could just acquire a reference here, but then we're needlessly grabbing a ref only to drop it at the end of the function, which is pointless when the caller is holding another reference for us while the callback runs. And of course Rust likes to claim to offer zero-cost abstractions, so it would be kind of sad to have to do that... ^^
Just doing drm::device::Device::from_raw(raw_dev) is a ticking time bomb, because we haven't acquired a reference (which would normally be required). If that Device ever gets dropped, we've messed up the refcounting and stolen the caller's reference. We could try to ensure it gets passed to core::mem::forget in all paths out, but that gets error-prone very quickly when trying to cover error paths. So instead, we put it into a ManuallyDrop. That takes care of neutering the ref drop, so we don't have to worry about messing that up. Then the only remaining safety requirement is that that the ManuallyDrop<Device> never escape the callback function, and that's easy to ensure: we only pass a &ref to the user (which via auto-deref ends up being a &Device), and then nothing bad can happen. If the user wants an owned reference to the device to keep around, they can call .clone() on it and that's when the incref happens.
Basically, ManuallyDrop<T> where T is a reference counted type represents a borrowed reference to a T coming from the C side. You can see another use of this pattern in gem::Object, which contains a ManuallyDrop<Device> that represents a borrowed reference to the device that owns that object. The DRM core (as far as I know!) guarantees that DRM devices outlive all of their GEM objects, so we can materialize a borrowed reference and as long as it never leaves the GEM object, it will be sound. Then we can take &Device references from it whenever we want, and the usual Rust borrow checker rules ensure we can't do something illegal.
~~ Lina
On Fri, 2023-03-10 at 07:16 +0900, Asahi Lina wrote:
On 10/03/2023 06.16, Faith Ekstrand wrote:
On Tue, 2023-03-07 at 23:25 +0900, Asahi Lina wrote:
A DRM File is the DRM counterpart to a kernel file structure, representing an open DRM file descriptor. Add a Rust abstraction to allow drivers to implement their own File types that implement the DriverFile trait.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 1 + rust/kernel/drm/drv.rs | 7 ++- rust/kernel/drm/file.rs | 113 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 4 files changed, 120 insertions(+), 2 deletions(-)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2a999138c4ae..7d7828faf89c 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -8,6 +8,7 @@ #include <drm/drm_device.h> #include <drm/drm_drv.h> +#include <drm/drm_file.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs index 29a465515dc9..1dcb651e1417 100644 --- a/rust/kernel/drm/drv.rs +++ b/rust/kernel/drm/drv.rs @@ -144,6 +144,9 @@ pub trait Driver { /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`. type Object: AllocImpl; + /// The type used to represent a DRM File (client) + type File: drm::file::DriverFile;
/// Driver metadata const INFO: DriverInfo; @@ -213,8 +216,8 @@ macro_rules! drm_device_register { impl<T: Driver> Registration<T> { const VTABLE: bindings::drm_driver = drm_legacy_fields! { load: None, - open: None, // TODO: File abstraction - postclose: None, // TODO: File abstraction + open: Some(drm::file::open_callback::<T::File>), + postclose: Some(drm::file::postclose_callback::<T::File>), lastclose: None, unload: None, release: None, diff --git a/rust/kernel/drm/file.rs b/rust/kernel/drm/file.rs new file mode 100644 index 000000000000..48751e93c38a --- /dev/null +++ b/rust/kernel/drm/file.rs @@ -0,0 +1,113 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM File objects. +//! +//! C header: [`include/linux/drm/drm_file.h`](../../../../include/linux/drm/dr m_fi le.h)
+use crate::{bindings, drm, error::Result}; +use alloc::boxed::Box; +use core::marker::PhantomData; +use core::ops::Deref;
+/// Trait that must be implemented by DRM drivers to represent a DRM File (a client instance). +pub trait DriverFile { + /// The parent `Driver` implementation for this `DriverFile`. + type Driver: drm::drv::Driver;
+ /// Open a new file (called when a client opens the DRM device). + fn open(device: &drm::device::DeviceSelf::Driver) -> Result<Box<Self>>; +}
+/// An open DRM File. +/// +/// # Invariants +/// `raw` is a valid pointer to a `drm_file` struct. +#[repr(transparent)] +pub struct File<T: DriverFile> { + raw: *mut bindings::drm_file, + _p: PhantomData<T>, +}
+pub(super) unsafe extern "C" fn open_callback<T: DriverFile>( + raw_dev: *mut bindings::drm_device, + raw_file: *mut bindings::drm_file, +) -> core::ffi::c_int { + let drm = core::mem::ManuallyDrop::new(unsafe { drm::device::Device::from_raw(raw_dev) });
Maybe you can help educate me a bit here... This feels like a really sketchy pattern. We're creating a Device from a pointer, an operation which inherently consumes a reference but then marking it ManuallyDrop so drm_device_put() never gets called. It took me a while but I think I figured out what you're trying to do: Make it so all the Rust stuff works with Device, not drm_device but it still feels really wrong. It works, it just feels like there's a lot of unsafe abstraction juggling happening here and I expect this operation is going to be pretty common in the Rust abstraction layer.
So I think this is going to be a pretty common pattern in this kind of abstraction. The problem is that, of course, in C there is no distinction between an owned reference and a borrowed one. Here we have a borrowed reference to a struct drm_device, and we need to turn it into a &Device (which is the Rust equivalent type). But for &Device to exist we need a Device to exist in the first place, and Device normally implies ownership of the underlying drm_device.
Thanks! Putting it in terms of borrow really helps clear up the difference.
We could just acquire a reference here, but then we're needlessly grabbing a ref only to drop it at the end of the function, which is pointless when the caller is holding another reference for us while the callback runs. And of course Rust likes to claim to offer zero-cost abstractions, so it would be kind of sad to have to do that... ^^
Yeah, I agree we don't want to take extra references.
Just doing drm::device::Device::from_raw(raw_dev) is a ticking time bomb, because we haven't acquired a reference (which would normally be required). If that Device ever gets dropped, we've messed up the refcounting and stolen the caller's reference. We could try to ensure it gets passed to core::mem::forget in all paths out, but that gets error-prone very quickly when trying to cover error paths. So instead, we put it into a ManuallyDrop. That takes care of neutering the ref drop, so we don't have to worry about messing that up. Then the only remaining safety requirement is that that the ManuallyDrop<Device> never escape the callback function, and that's easy to ensure: we only pass a &ref to the user (which via auto-deref ends up being a &Device), and then nothing bad can happen. If the user wants an owned reference to the device to keep around, they can call .clone() on it and that's when the incref happens.
Basically, ManuallyDrop<T> where T is a reference counted type represents a borrowed reference to a T coming from the C side. You can see another use of this pattern in gem::Object, which contains a ManuallyDrop<Device> that represents a borrowed reference to the device that owns that object. The DRM core (as far as I know!) guarantees that DRM devices outlive all of their GEM objects, so we can materialize a borrowed reference and as long as it never leaves the GEM object, it will be sound. Then we can take &Device references from it whenever we want, and the usual Rust borrow checker rules ensure we can't do something illegal.
Ok, that all matches my understanding of what I thought was going on. I do wonder if it would be good to wrap this up in a
struct DeviceBorrow { dev: ManuallyDrop<Device> }
impl DeviceBorrow { pub unsafe fn from_raw(*mut bindings::drm_device) -> DeviceBorrow; }
impl Deref<Device> for DeviceBorrow { ... }
with documentation, etc. Seeing a ManuallyDrop which is never dropped sets my rust senses tingling. Maybe that's too much typing for each object? I don't want to add a bunch of extra work but this seems like a pretty common pattern we're going to hit everywhere.
~Faith
On Mon, Mar 13, 2023 at 12:49:57PM -0500, Faith Ekstrand wrote:
On Fri, 2023-03-10 at 07:16 +0900, Asahi Lina wrote:
On 10/03/2023 06.16, Faith Ekstrand wrote:
On Tue, 2023-03-07 at 23:25 +0900, Asahi Lina wrote:
A DRM File is the DRM counterpart to a kernel file structure, representing an open DRM file descriptor. Add a Rust abstraction to allow drivers to implement their own File types that implement the DriverFile trait.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 1 + rust/kernel/drm/drv.rs | 7 ++- rust/kernel/drm/file.rs | 113 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 4 files changed, 120 insertions(+), 2 deletions(-)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2a999138c4ae..7d7828faf89c 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -8,6 +8,7 @@ #include <drm/drm_device.h> #include <drm/drm_drv.h> +#include <drm/drm_file.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs index 29a465515dc9..1dcb651e1417 100644 --- a/rust/kernel/drm/drv.rs +++ b/rust/kernel/drm/drv.rs @@ -144,6 +144,9 @@ pub trait Driver { /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`. type Object: AllocImpl; + /// The type used to represent a DRM File (client) + type File: drm::file::DriverFile;
/// Driver metadata const INFO: DriverInfo; @@ -213,8 +216,8 @@ macro_rules! drm_device_register { impl<T: Driver> Registration<T> { const VTABLE: bindings::drm_driver = drm_legacy_fields! { load: None, - open: None, // TODO: File abstraction - postclose: None, // TODO: File abstraction + open: Some(drm::file::open_callback::<T::File>), + postclose: Some(drm::file::postclose_callback::<T::File>), lastclose: None, unload: None, release: None, diff --git a/rust/kernel/drm/file.rs b/rust/kernel/drm/file.rs new file mode 100644 index 000000000000..48751e93c38a --- /dev/null +++ b/rust/kernel/drm/file.rs @@ -0,0 +1,113 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM File objects. +//! +//! C header: [`include/linux/drm/drm_file.h`](../../../../include/linux/drm/dr m_fi le.h)
+use crate::{bindings, drm, error::Result}; +use alloc::boxed::Box; +use core::marker::PhantomData; +use core::ops::Deref;
+/// Trait that must be implemented by DRM drivers to represent a DRM File (a client instance). +pub trait DriverFile { + /// The parent `Driver` implementation for this `DriverFile`. + type Driver: drm::drv::Driver;
+ /// Open a new file (called when a client opens the DRM device). + fn open(device: &drm::device::DeviceSelf::Driver) -> Result<Box<Self>>; +}
+/// An open DRM File. +/// +/// # Invariants +/// `raw` is a valid pointer to a `drm_file` struct. +#[repr(transparent)] +pub struct File<T: DriverFile> { + raw: *mut bindings::drm_file, + _p: PhantomData<T>, +}
+pub(super) unsafe extern "C" fn open_callback<T: DriverFile>( + raw_dev: *mut bindings::drm_device, + raw_file: *mut bindings::drm_file, +) -> core::ffi::c_int { + let drm = core::mem::ManuallyDrop::new(unsafe { drm::device::Device::from_raw(raw_dev) });
Maybe you can help educate me a bit here... This feels like a really sketchy pattern. We're creating a Device from a pointer, an operation which inherently consumes a reference but then marking it ManuallyDrop so drm_device_put() never gets called. It took me a while but I think I figured out what you're trying to do: Make it so all the Rust stuff works with Device, not drm_device but it still feels really wrong. It works, it just feels like there's a lot of unsafe abstraction juggling happening here and I expect this operation is going to be pretty common in the Rust abstraction layer.
So I think this is going to be a pretty common pattern in this kind of abstraction. The problem is that, of course, in C there is no distinction between an owned reference and a borrowed one. Here we have a borrowed reference to a struct drm_device, and we need to turn it into a &Device (which is the Rust equivalent type). But for &Device to exist we need a Device to exist in the first place, and Device normally implies ownership of the underlying drm_device.
Thanks! Putting it in terms of borrow really helps clear up the difference.
We could just acquire a reference here, but then we're needlessly grabbing a ref only to drop it at the end of the function, which is pointless when the caller is holding another reference for us while the callback runs. And of course Rust likes to claim to offer zero-cost abstractions, so it would be kind of sad to have to do that... ^^
Yeah, I agree we don't want to take extra references.
Just doing drm::device::Device::from_raw(raw_dev) is a ticking time bomb, because we haven't acquired a reference (which would normally be required). If that Device ever gets dropped, we've messed up the refcounting and stolen the caller's reference. We could try to ensure it gets passed to core::mem::forget in all paths out, but that gets error-prone very quickly when trying to cover error paths. So instead, we put it into a ManuallyDrop. That takes care of neutering the ref drop, so we don't have to worry about messing that up. Then the only remaining safety requirement is that that the ManuallyDrop<Device> never escape the callback function, and that's easy to ensure: we only pass a &ref to the user (which via auto-deref ends up being a &Device), and then nothing bad can happen. If the user wants an owned reference to the device to keep around, they can call .clone() on it and that's when the incref happens.
Basically, ManuallyDrop<T> where T is a reference counted type represents a borrowed reference to a T coming from the C side. You can see another use of this pattern in gem::Object, which contains a ManuallyDrop<Device> that represents a borrowed reference to the device that owns that object. The DRM core (as far as I know!) guarantees that DRM devices outlive all of their GEM objects, so we can materialize a borrowed reference and as long as it never leaves the GEM object, it will be sound. Then we can take &Device references from it whenever we want, and the usual Rust borrow checker rules ensure we can't do something illegal.
Ok, that all matches my understanding of what I thought was going on. I do wonder if it would be good to wrap this up in a
struct DeviceBorrow { dev: ManuallyDrop<Device> }
impl DeviceBorrow { pub unsafe fn from_raw(*mut bindings::drm_device) -> DeviceBorrow; }
impl Deref<Device> for DeviceBorrow { ... }
with documentation, etc. Seeing a ManuallyDrop which is never dropped sets my rust senses tingling. Maybe that's too much typing for each object? I don't want to add a bunch of extra work but this seems like a pretty common pattern we're going to hit everywhere.
I just want to mention, there is a different way to do the abstraction here:
similar to https://lore.kernel.org/rust-for-linux/ZA9l0EHCRRr%2Fmyoq@boqun-archlinux
* Define Device as tranparent represention of struct drm_device:
#[repr(transparent)] struct Device(Opaquebindings::drm_device);
* impl `AlwaysRefCounted`[1] for `Device`, therefore we can use `ARef<Device>`[2] as a smart pointer to `drm_device`.
* drm_device related methods are still implemented in `impl Device`
* In `open_callback`, we can just get a `&Device` from `*mut bindings::drm_device` unsafely, and that's all. Or introduce a helper function if we want:
pub unsafe fn with_device<F>(ptr: *mut drm_device, f: F) -> Result where F: FnOnce(&Device) -> Result { let d = unsafe { &*ptr }; f(d) }
The main difference is that we now treat a pointer to drm_device as a reference to the device, not the owner.
It seems we need to also change our driver/device framework to use this approach, but it looks better to me.
Regards, Boqun
[1]: https://rust-for-linux.github.io/docs/kernel/trait.AlwaysRefCounted.html [2]: https://rust-for-linux.github.io/docs/kernel/struct.ARef.html
~Faith
On Mon, Mar 13, 2023 at 07:07:09PM -0700, Boqun Feng wrote:
On Mon, Mar 13, 2023 at 12:49:57PM -0500, Faith Ekstrand wrote:
On Fri, 2023-03-10 at 07:16 +0900, Asahi Lina wrote:
On 10/03/2023 06.16, Faith Ekstrand wrote:
On Tue, 2023-03-07 at 23:25 +0900, Asahi Lina wrote:
A DRM File is the DRM counterpart to a kernel file structure, representing an open DRM file descriptor. Add a Rust abstraction to allow drivers to implement their own File types that implement the DriverFile trait.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 1 + rust/kernel/drm/drv.rs | 7 ++- rust/kernel/drm/file.rs | 113 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 4 files changed, 120 insertions(+), 2 deletions(-)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 2a999138c4ae..7d7828faf89c 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -8,6 +8,7 @@ #include <drm/drm_device.h> #include <drm/drm_drv.h> +#include <drm/drm_file.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs index 29a465515dc9..1dcb651e1417 100644 --- a/rust/kernel/drm/drv.rs +++ b/rust/kernel/drm/drv.rs @@ -144,6 +144,9 @@ pub trait Driver { /// Should be either `drm::gem::Object<T>` or `drm::gem::shmem::Object<T>`. type Object: AllocImpl; + /// The type used to represent a DRM File (client) + type File: drm::file::DriverFile;
/// Driver metadata const INFO: DriverInfo; @@ -213,8 +216,8 @@ macro_rules! drm_device_register { impl<T: Driver> Registration<T> { const VTABLE: bindings::drm_driver = drm_legacy_fields! { load: None, - open: None, // TODO: File abstraction - postclose: None, // TODO: File abstraction + open: Some(drm::file::open_callback::<T::File>), + postclose: Some(drm::file::postclose_callback::<T::File>), lastclose: None, unload: None, release: None, diff --git a/rust/kernel/drm/file.rs b/rust/kernel/drm/file.rs new file mode 100644 index 000000000000..48751e93c38a --- /dev/null +++ b/rust/kernel/drm/file.rs @@ -0,0 +1,113 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM File objects. +//! +//! C header: [`include/linux/drm/drm_file.h`](../../../../include/linux/drm/dr m_fi le.h)
+use crate::{bindings, drm, error::Result}; +use alloc::boxed::Box; +use core::marker::PhantomData; +use core::ops::Deref;
+/// Trait that must be implemented by DRM drivers to represent a DRM File (a client instance). +pub trait DriverFile { + /// The parent `Driver` implementation for this `DriverFile`. + type Driver: drm::drv::Driver;
+ /// Open a new file (called when a client opens the DRM device). + fn open(device: &drm::device::DeviceSelf::Driver) -> Result<Box<Self>>; +}
+/// An open DRM File. +/// +/// # Invariants +/// `raw` is a valid pointer to a `drm_file` struct. +#[repr(transparent)] +pub struct File<T: DriverFile> { + raw: *mut bindings::drm_file, + _p: PhantomData<T>, +}
+pub(super) unsafe extern "C" fn open_callback<T: DriverFile>( + raw_dev: *mut bindings::drm_device, + raw_file: *mut bindings::drm_file, +) -> core::ffi::c_int { + let drm = core::mem::ManuallyDrop::new(unsafe { drm::device::Device::from_raw(raw_dev) });
Maybe you can help educate me a bit here... This feels like a really sketchy pattern. We're creating a Device from a pointer, an operation which inherently consumes a reference but then marking it ManuallyDrop so drm_device_put() never gets called. It took me a while but I think I figured out what you're trying to do: Make it so all the Rust stuff works with Device, not drm_device but it still feels really wrong. It works, it just feels like there's a lot of unsafe abstraction juggling happening here and I expect this operation is going to be pretty common in the Rust abstraction layer.
So I think this is going to be a pretty common pattern in this kind of abstraction. The problem is that, of course, in C there is no distinction between an owned reference and a borrowed one. Here we have a borrowed reference to a struct drm_device, and we need to turn it into a &Device (which is the Rust equivalent type). But for &Device to exist we need a Device to exist in the first place, and Device normally implies ownership of the underlying drm_device.
Thanks! Putting it in terms of borrow really helps clear up the difference.
We could just acquire a reference here, but then we're needlessly grabbing a ref only to drop it at the end of the function, which is pointless when the caller is holding another reference for us while the callback runs. And of course Rust likes to claim to offer zero-cost abstractions, so it would be kind of sad to have to do that... ^^
Yeah, I agree we don't want to take extra references.
Just doing drm::device::Device::from_raw(raw_dev) is a ticking time bomb, because we haven't acquired a reference (which would normally be required). If that Device ever gets dropped, we've messed up the refcounting and stolen the caller's reference. We could try to ensure it gets passed to core::mem::forget in all paths out, but that gets error-prone very quickly when trying to cover error paths. So instead, we put it into a ManuallyDrop. That takes care of neutering the ref drop, so we don't have to worry about messing that up. Then the only remaining safety requirement is that that the ManuallyDrop<Device> never escape the callback function, and that's easy to ensure: we only pass a &ref to the user (which via auto-deref ends up being a &Device), and then nothing bad can happen. If the user wants an owned reference to the device to keep around, they can call .clone() on it and that's when the incref happens.
Basically, ManuallyDrop<T> where T is a reference counted type represents a borrowed reference to a T coming from the C side. You can see another use of this pattern in gem::Object, which contains a ManuallyDrop<Device> that represents a borrowed reference to the device that owns that object. The DRM core (as far as I know!) guarantees that DRM devices outlive all of their GEM objects, so we can materialize a borrowed reference and as long as it never leaves the GEM object, it will be sound. Then we can take &Device references from it whenever we want, and the usual Rust borrow checker rules ensure we can't do something illegal.
Ok, that all matches my understanding of what I thought was going on. I do wonder if it would be good to wrap this up in a
struct DeviceBorrow { dev: ManuallyDrop<Device> }
impl DeviceBorrow { pub unsafe fn from_raw(*mut bindings::drm_device) -> DeviceBorrow; }
impl Deref<Device> for DeviceBorrow { ... }
with documentation, etc. Seeing a ManuallyDrop which is never dropped sets my rust senses tingling. Maybe that's too much typing for each object? I don't want to add a bunch of extra work but this seems like a pretty common pattern we're going to hit everywhere.
I just want to mention, there is a different way to do the abstraction here:
similar to https://lore.kernel.org/rust-for-linux/ZA9l0EHCRRr%2Fmyoq@boqun-archlinux
Define Device as tranparent represention of struct drm_device:
#[repr(transparent)] struct Device(Opaquebindings::drm_device);
impl `AlwaysRefCounted`[1] for `Device`, therefore we can use `ARef<Device>`[2] as a smart pointer to `drm_device`.
drm_device related methods are still implemented in `impl Device`
In `open_callback`, we can just get a `&Device` from `*mut bindings::drm_device` unsafely, and that's all. Or introduce a helper function if we want:
pub unsafe fn with_device<F>(ptr: *mut drm_device, f: F) -> Result where F: FnOnce(&Device) -> Result { let d = unsafe { &*ptr }; f(d) }
The main difference is that we now treat a pointer to drm_device as a reference to the device, not the owner.
It seems we need to also change our driver/device framework to use this approach, but it looks better to me.
So I really don't have enough rust clue to have any useful opinion on how the glue should look like, but semantically the struct drm_file should only ever be borrowed as a parameter to a driver hook, so that rust can guarantee that the driver doesn't do anything funny and uses it beyond the end of that function. This holds for all the callbacks like open/close or also all the ioctl.
The other semantic thing is that that the ioctls should be able to rely on open having fully constructed the thing. I think the trait and dependent type stuff ensure that?
What I've missed (but maybe just looked in the wrong place) is that the ioctl support (and really anything else where the driver gets a struct drm_file on the C side, but I don't think there is anything else) should also make sure you get the right driver-specific type and not something else.
I did notice the FIXME in the first patch, I guess if it makes landing all this easier we could also keep this as a todo item to improve once things landed. That btw holds for a lot of the big "how to map semantics correctly to rust" questions I'm throwing up here. Maybe a Documentation/gpu/rust.rst file would be good to include, with these todo items noted instead of just FIXME sprinkled in patches? At least for things that will take more effort to polish. -Daniel
Regards, Boqun
~Faith
The DRM GEM subsystem is the DRM memory management subsystem used by most modern drivers. Add a Rust abstraction to allow Rust DRM driver implementations to use it.
Signed-off-by: Asahi Lina lina@asahilina.net --- rust/bindings/bindings_helper.h | 1 + rust/helpers.c | 23 +++ rust/kernel/drm/drv.rs | 4 +- rust/kernel/drm/gem/mod.rs | 374 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 5 files changed, 401 insertions(+), 2 deletions(-)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 7d7828faf89c..7183dfe6473f 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -9,6 +9,7 @@ #include <drm/drm_device.h> #include <drm/drm_drv.h> #include <drm/drm_file.h> +#include <drm/drm_gem.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> diff --git a/rust/helpers.c b/rust/helpers.c index 73b2ce607f27..78ec4162b03b 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -18,6 +18,7 @@ * accidentally exposed. */
+#include <drm/drm_gem.h> #include <linux/bug.h> #include <linux/build_bug.h> #include <linux/device.h> @@ -374,6 +375,28 @@ void rust_helper_init_completion(struct completion *c) } EXPORT_SYMBOL_GPL(rust_helper_init_completion);
+#ifdef CONFIG_DRM + +void rust_helper_drm_gem_object_get(struct drm_gem_object *obj) +{ + drm_gem_object_get(obj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_object_get); + +void rust_helper_drm_gem_object_put(struct drm_gem_object *obj) +{ + drm_gem_object_put(obj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_object_put); + +__u64 rust_helper_drm_vma_node_offset_addr(struct drm_vma_offset_node *node) +{ + return drm_vma_node_offset_addr(node); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_vma_node_offset_addr); + +#endif + /* * We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type * as the Rust `usize` type, so we can use it in contexts where Rust diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs index 1dcb651e1417..c138352cb489 100644 --- a/rust/kernel/drm/drv.rs +++ b/rust/kernel/drm/drv.rs @@ -126,7 +126,7 @@ pub struct AllocOps { }
/// Trait for memory manager implementations. Implemented internally. -pub trait AllocImpl: Sealed { +pub trait AllocImpl: Sealed + drm::gem::IntoGEMObject { /// The C callback operations for this memory manager. const ALLOC_OPS: AllocOps; } @@ -263,7 +263,7 @@ impl<T: Driver> Registration<T> { drm, registered: false, vtable, - fops: Default::default(), // TODO: GEM abstraction + fops: drm::gem::create_fops(), _pin: PhantomPinned, _p: PhantomData, }) diff --git a/rust/kernel/drm/gem/mod.rs b/rust/kernel/drm/gem/mod.rs new file mode 100644 index 000000000000..8a7d99613718 --- /dev/null +++ b/rust/kernel/drm/gem/mod.rs @@ -0,0 +1,374 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT + +//! DRM GEM API +//! +//! C header: [`include/linux/drm/drm_gem.h`](../../../../include/linux/drm/drm_gem.h) + +use alloc::boxed::Box; + +use crate::{ + bindings, + drm::{device, drv, file}, + error::{to_result, Result}, + prelude::*, +}; +use core::{mem, mem::ManuallyDrop, ops::Deref, ops::DerefMut}; + +/// GEM object functions, which must be implemented by drivers. +pub trait BaseDriverObject<T: BaseObject>: Sync + Send + Sized { + /// Create a new driver data object for a GEM object of a given size. + fn new(dev: &device::Device<T::Driver>, size: usize) -> Result<Self>; + + /// Open a new handle to an existing object, associated with a File. + fn open( + _obj: &<<T as IntoGEMObject>::Driver as drv::Driver>::Object, + _file: &file::File<<<T as IntoGEMObject>::Driver as drv::Driver>::File>, + ) -> Result { + Ok(()) + } + + /// Close a handle to an existing object, associated with a File. + fn close( + _obj: &<<T as IntoGEMObject>::Driver as drv::Driver>::Object, + _file: &file::File<<<T as IntoGEMObject>::Driver as drv::Driver>::File>, + ) { + } +} + +/// Trait that represents a GEM object subtype +pub trait IntoGEMObject: Sized + crate::private::Sealed { + /// Owning driver for this type + type Driver: drv::Driver; + + /// Returns a pointer to the raw `drm_gem_object` structure, which must be valid as long as + /// this owning object is valid. + fn gem_obj(&self) -> *mut bindings::drm_gem_object; + + /// Returns a reference to the raw `drm_gem_object` structure, which must be valid as long as + /// this owning object is valid. + fn gem_ref(&self) -> &bindings::drm_gem_object { + // SAFETY: gem_obj() must be valid per the above requirement. + unsafe { &*self.gem_obj() } + } + + /// Converts a pointer to a `drm_gem_object` into a pointer to this type. + fn from_gem_obj(obj: *mut bindings::drm_gem_object) -> *mut Self; +} + +/// Trait which must be implemented by drivers using base GEM objects. +pub trait DriverObject: BaseDriverObject<Object<Self>> { + /// Parent `Driver` for this object. + type Driver: drv::Driver; +} + +unsafe extern "C" fn free_callback<T: DriverObject>(obj: *mut bindings::drm_gem_object) { + // SAFETY: All of our objects are Object<T>. + let this = crate::container_of!(obj, Object<T>, obj) as *mut Object<T>; + + // SAFETY: The pointer we got has to be valid + unsafe { bindings::drm_gem_object_release(obj) }; + + // SAFETY: All of our objects are allocated via Box<>, and we're in the + // free callback which guarantees this object has zero remaining references, + // so we can drop it + unsafe { Box::from_raw(this) }; +} + +unsafe extern "C" fn open_callback<T: BaseDriverObject<U>, U: BaseObject>( + raw_obj: *mut bindings::drm_gem_object, + raw_file: *mut bindings::drm_file, +) -> core::ffi::c_int { + // SAFETY: The pointer we got has to be valid. + let file = unsafe { + file::File::<<<U as IntoGEMObject>::Driver as drv::Driver>::File>::from_raw(raw_file) + }; + let obj = + <<<U as IntoGEMObject>::Driver as drv::Driver>::Object as IntoGEMObject>::from_gem_obj( + raw_obj, + ); + + // SAFETY: from_gem_obj() returns a valid pointer as long as the type is + // correct and the raw_obj we got is valid. + match T::open(unsafe { &*obj }, &file) { + Err(e) => e.to_kernel_errno(), + Ok(()) => 0, + } +} + +unsafe extern "C" fn close_callback<T: BaseDriverObject<U>, U: BaseObject>( + raw_obj: *mut bindings::drm_gem_object, + raw_file: *mut bindings::drm_file, +) { + // SAFETY: The pointer we got has to be valid. + let file = unsafe { + file::File::<<<U as IntoGEMObject>::Driver as drv::Driver>::File>::from_raw(raw_file) + }; + let obj = + <<<U as IntoGEMObject>::Driver as drv::Driver>::Object as IntoGEMObject>::from_gem_obj( + raw_obj, + ); + + // SAFETY: from_gem_obj() returns a valid pointer as long as the type is + // correct and the raw_obj we got is valid. + T::close(unsafe { &*obj }, &file); +} + +impl<T: DriverObject> IntoGEMObject for Object<T> { + type Driver = T::Driver; + + fn gem_obj(&self) -> *mut bindings::drm_gem_object { + &self.obj as *const _ as *mut _ + } + + fn from_gem_obj(obj: *mut bindings::drm_gem_object) -> *mut Object<T> { + crate::container_of!(obj, Object<T>, obj) as *mut Object<T> + } +} + +/// Base operations shared by all GEM object classes +pub trait BaseObject: IntoGEMObject { + /// Returns the size of the object in bytes. + fn size(&self) -> usize { + self.gem_ref().size + } + + /// Creates a new reference to the object. + fn reference(&self) -> ObjectRef<Self> { + // SAFETY: Having a reference to an Object implies holding a GEM reference + unsafe { + bindings::drm_gem_object_get(self.gem_obj()); + } + ObjectRef { + ptr: self as *const _, + } + } + + /// Creates a new handle for the object associated with a given `File` + /// (or returns an existing one). + fn create_handle( + &self, + file: &file::File<<<Self as IntoGEMObject>::Driver as drv::Driver>::File>, + ) -> Result<u32> { + let mut handle: u32 = 0; + // SAFETY: The arguments are all valid per the type invariants. + to_result(unsafe { + bindings::drm_gem_handle_create(file.raw() as *mut _, self.gem_obj(), &mut handle) + })?; + Ok(handle) + } + + /// Looks up an object by its handle for a given `File`. + fn lookup_handle( + file: &file::File<<<Self as IntoGEMObject>::Driver as drv::Driver>::File>, + handle: u32, + ) -> Result<ObjectRef<Self>> { + // SAFETY: The arguments are all valid per the type invariants. + let ptr = unsafe { bindings::drm_gem_object_lookup(file.raw() as *mut _, handle) }; + + if ptr.is_null() { + Err(ENOENT) + } else { + Ok(ObjectRef { + ptr: ptr as *const _, + }) + } + } + + /// Creates an mmap offset to map the object from userspace. + fn create_mmap_offset(&self) -> Result<u64> { + // SAFETY: The arguments are valid per the type invariant. + to_result(unsafe { + // TODO: is this threadsafe? + bindings::drm_gem_create_mmap_offset(self.gem_obj()) + })?; + Ok(unsafe { + bindings::drm_vma_node_offset_addr(&self.gem_ref().vma_node as *const _ as *mut _) + }) + } +} + +impl<T: IntoGEMObject> BaseObject for T {} + +/// A base GEM object. +#[repr(C)] +pub struct Object<T: DriverObject> { + obj: bindings::drm_gem_object, + // The DRM core ensures the Device exists as long as its objects exist, so we don't need to + // manage the reference count here. + dev: ManuallyDrop<device::Device<T::Driver>>, + inner: T, +} + +impl<T: DriverObject> Object<T> { + /// The size of this object's structure. + pub const SIZE: usize = mem::size_of::<Self>(); + + const OBJECT_FUNCS: bindings::drm_gem_object_funcs = bindings::drm_gem_object_funcs { + free: Some(free_callback::<T>), + open: Some(open_callback::<T, Object<T>>), + close: Some(close_callback::<T, Object<T>>), + print_info: None, + export: None, + pin: None, + unpin: None, + get_sg_table: None, + vmap: None, + vunmap: None, + mmap: None, + vm_ops: core::ptr::null_mut(), + }; + + /// Create a new GEM object. + pub fn new(dev: &device::Device<T::Driver>, size: usize) -> Result<UniqueObjectRef<Self>> { + let mut obj: Box<Self> = Box::try_new(Self { + // SAFETY: This struct is expected to be zero-initialized + obj: unsafe { mem::zeroed() }, + // SAFETY: The drm subsystem guarantees that the drm_device will live as long as + // the GEM object lives, so we can conjure a reference out of thin air. + dev: ManuallyDrop::new(unsafe { device::Device::from_raw(dev.ptr) }), + inner: T::new(dev, size)?, + })?; + + obj.obj.funcs = &Self::OBJECT_FUNCS; + to_result(unsafe { + bindings::drm_gem_object_init(dev.raw() as *mut _, &mut obj.obj, size) + })?; + + let obj_ref = UniqueObjectRef { + ptr: Box::leak(obj), + }; + + Ok(obj_ref) + } + + /// Returns the `Device` that owns this GEM object. + pub fn dev(&self) -> &device::Device<T::Driver> { + &self.dev + } +} + +impl<T: DriverObject> crate::private::Sealed for Object<T> {} + +impl<T: DriverObject> Deref for Object<T> { + type Target = T; + + fn deref(&self) -> &Self::Target { + &self.inner + } +} + +impl<T: DriverObject> DerefMut for Object<T> { + fn deref_mut(&mut self) -> &mut Self::Target { + &mut self.inner + } +} + +impl<T: DriverObject> drv::AllocImpl for Object<T> { + const ALLOC_OPS: drv::AllocOps = drv::AllocOps { + gem_create_object: None, + prime_handle_to_fd: Some(bindings::drm_gem_prime_handle_to_fd), + prime_fd_to_handle: Some(bindings::drm_gem_prime_fd_to_handle), + gem_prime_import: None, + gem_prime_import_sg_table: None, + gem_prime_mmap: Some(bindings::drm_gem_prime_mmap), + dumb_create: None, + dumb_map_offset: None, + dumb_destroy: None, + }; +} + +/// A reference-counted shared reference to a base GEM object. +pub struct ObjectRef<T: IntoGEMObject> { + // Invariant: the pointer is valid and initialized, and this ObjectRef owns a reference to it. + ptr: *const T, +} + +/// SAFETY: GEM object references are safe to share between threads. +unsafe impl<T: IntoGEMObject> Send for ObjectRef<T> {} +unsafe impl<T: IntoGEMObject> Sync for ObjectRef<T> {} + +impl<T: IntoGEMObject> Clone for ObjectRef<T> { + fn clone(&self) -> Self { + self.reference() + } +} + +impl<T: IntoGEMObject> Drop for ObjectRef<T> { + fn drop(&mut self) { + // SAFETY: Having an ObjectRef implies holding a GEM reference. + // The free callback will take care of deallocation. + unsafe { + bindings::drm_gem_object_put((*self.ptr).gem_obj()); + } + } +} + +impl<T: IntoGEMObject> Deref for ObjectRef<T> { + type Target = T; + + fn deref(&self) -> &Self::Target { + // SAFETY: The pointer is valid per the invariant + unsafe { &*self.ptr } + } +} + +/// A unique reference to a base GEM object. +pub struct UniqueObjectRef<T: IntoGEMObject> { + // Invariant: the pointer is valid and initialized, and this ObjectRef owns the only reference + // to it. + ptr: *mut T, +} + +impl<T: IntoGEMObject> UniqueObjectRef<T> { + /// Downgrade this reference to a shared reference. + pub fn into_ref(self) -> ObjectRef<T> { + let ptr = self.ptr as *const _; + core::mem::forget(self); + + ObjectRef { ptr } + } +} + +impl<T: IntoGEMObject> Drop for UniqueObjectRef<T> { + fn drop(&mut self) { + // SAFETY: Having a UniqueObjectRef implies holding a GEM + // reference. The free callback will take care of deallocation. + unsafe { + bindings::drm_gem_object_put((*self.ptr).gem_obj()); + } + } +} + +impl<T: IntoGEMObject> Deref for UniqueObjectRef<T> { + type Target = T; + + fn deref(&self) -> &Self::Target { + // SAFETY: The pointer is valid per the invariant + unsafe { &*self.ptr } + } +} + +impl<T: IntoGEMObject> DerefMut for UniqueObjectRef<T> { + fn deref_mut(&mut self) -> &mut Self::Target { + // SAFETY: The pointer is valid per the invariant + unsafe { &mut *self.ptr } + } +} + +pub(super) fn create_fops() -> bindings::file_operations { + bindings::file_operations { + owner: core::ptr::null_mut(), + open: Some(bindings::drm_open), + release: Some(bindings::drm_release), + unlocked_ioctl: Some(bindings::drm_ioctl), + #[cfg(CONFIG_COMPAT)] + compat_ioctl: Some(bindings::drm_compat_ioctl), + #[cfg(not(CONFIG_COMPAT))] + compat_ioctl: None, + poll: Some(bindings::drm_poll), + read: Some(bindings::drm_read), + llseek: Some(bindings::noop_llseek), + mmap: Some(bindings::drm_gem_mmap), + ..Default::default() + } +} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index a767942d0b52..c44760a1332f 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -5,4 +5,5 @@ pub mod device; pub mod drv; pub mod file; +pub mod gem; pub mod ioctl;
Meta: I'm trying to unblock myself by limiting each reply to a narrow-ish topic. Otherwise it's just too much. Here's the first.
On Tue, Mar 07, 2023 at 11:25:29PM +0900, Asahi Lina wrote:
The DRM GEM subsystem is the DRM memory management subsystem used by most modern drivers. Add a Rust abstraction to allow Rust DRM driver implementations to use it.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 1 + rust/helpers.c | 23 +++ rust/kernel/drm/drv.rs | 4 +- rust/kernel/drm/gem/mod.rs | 374 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 5 files changed, 401 insertions(+), 2 deletions(-)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 7d7828faf89c..7183dfe6473f 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -9,6 +9,7 @@ #include <drm/drm_device.h> #include <drm/drm_drv.h> #include <drm/drm_file.h> +#include <drm/drm_gem.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> diff --git a/rust/helpers.c b/rust/helpers.c index 73b2ce607f27..78ec4162b03b 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -18,6 +18,7 @@
- accidentally exposed.
*/ +#include <drm/drm_gem.h> #include <linux/bug.h> #include <linux/build_bug.h> #include <linux/device.h> @@ -374,6 +375,28 @@ void rust_helper_init_completion(struct completion *c) } EXPORT_SYMBOL_GPL(rust_helper_init_completion); +#ifdef CONFIG_DRM
+void rust_helper_drm_gem_object_get(struct drm_gem_object *obj) +{
- drm_gem_object_get(obj);
+} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_object_get);
+void rust_helper_drm_gem_object_put(struct drm_gem_object *obj) +{
- drm_gem_object_put(obj);
+} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_object_put);
+__u64 rust_helper_drm_vma_node_offset_addr(struct drm_vma_offset_node *node) +{
- return drm_vma_node_offset_addr(node);
+} +EXPORT_SYMBOL_GPL(rust_helper_drm_vma_node_offset_addr);
Uh all the rust helper wrappers for all the kernel in a single file does not sound good. Can we not split these up into each subsystem, and then maybe instead of sprinkling #ifdef all over a .c file Make the compilation of that file conditional on rust support (plus whatever other Kconfig gate the other c files has already)?
Otherwise if rust adoption picks up there's going to be endless amounts of cross-subsystem conflicts.
Also similarly, can we perhaps split up the bindings_helper.h file in a per-subsystem way?
+#endif
/*
- We use `bindgen`'s `--size_t-is-usize` option to bind the C `size_t` type
- as the Rust `usize` type, so we can use it in contexts where Rust
diff --git a/rust/kernel/drm/drv.rs b/rust/kernel/drm/drv.rs index 1dcb651e1417..c138352cb489 100644 --- a/rust/kernel/drm/drv.rs +++ b/rust/kernel/drm/drv.rs @@ -126,7 +126,7 @@ pub struct AllocOps {
Similary I guess this needs to be all under rust for rust reasons. I'm assuming that the plan is that rust patches in here get acked/reviewed by rust people, but then merged through the drm subsystem? At least long term I think that's the least painful way.
Meaning we need a MAINTAINERS entry for rust/kernel/drm which adds dri-devel for review and the usual git repos somewhere earlier in the series. -Daniel
} /// Trait for memory manager implementations. Implemented internally. -pub trait AllocImpl: Sealed { +pub trait AllocImpl: Sealed + drm::gem::IntoGEMObject { /// The C callback operations for this memory manager. const ALLOC_OPS: AllocOps; } @@ -263,7 +263,7 @@ impl<T: Driver> Registration<T> { drm, registered: false, vtable,
fops: Default::default(), // TODO: GEM abstraction
fops: drm::gem::create_fops(), _pin: PhantomPinned, _p: PhantomData, })
diff --git a/rust/kernel/drm/gem/mod.rs b/rust/kernel/drm/gem/mod.rs new file mode 100644 index 000000000000..8a7d99613718 --- /dev/null +++ b/rust/kernel/drm/gem/mod.rs @@ -0,0 +1,374 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM GEM API +//! +//! C header: [`include/linux/drm/drm_gem.h`](../../../../include/linux/drm/drm_gem.h)
+use alloc::boxed::Box;
+use crate::{
- bindings,
- drm::{device, drv, file},
- error::{to_result, Result},
- prelude::*,
+}; +use core::{mem, mem::ManuallyDrop, ops::Deref, ops::DerefMut};
+/// GEM object functions, which must be implemented by drivers. +pub trait BaseDriverObject<T: BaseObject>: Sync + Send + Sized {
- /// Create a new driver data object for a GEM object of a given size.
- fn new(dev: &device::Device<T::Driver>, size: usize) -> Result<Self>;
- /// Open a new handle to an existing object, associated with a File.
- fn open(
_obj: &<<T as IntoGEMObject>::Driver as drv::Driver>::Object,
_file: &file::File<<<T as IntoGEMObject>::Driver as drv::Driver>::File>,
- ) -> Result {
Ok(())
- }
- /// Close a handle to an existing object, associated with a File.
- fn close(
_obj: &<<T as IntoGEMObject>::Driver as drv::Driver>::Object,
_file: &file::File<<<T as IntoGEMObject>::Driver as drv::Driver>::File>,
- ) {
- }
+}
+/// Trait that represents a GEM object subtype +pub trait IntoGEMObject: Sized + crate::private::Sealed {
- /// Owning driver for this type
- type Driver: drv::Driver;
- /// Returns a pointer to the raw `drm_gem_object` structure, which must be valid as long as
- /// this owning object is valid.
- fn gem_obj(&self) -> *mut bindings::drm_gem_object;
- /// Returns a reference to the raw `drm_gem_object` structure, which must be valid as long as
- /// this owning object is valid.
- fn gem_ref(&self) -> &bindings::drm_gem_object {
// SAFETY: gem_obj() must be valid per the above requirement.
unsafe { &*self.gem_obj() }
- }
- /// Converts a pointer to a `drm_gem_object` into a pointer to this type.
- fn from_gem_obj(obj: *mut bindings::drm_gem_object) -> *mut Self;
+}
+/// Trait which must be implemented by drivers using base GEM objects. +pub trait DriverObject: BaseDriverObject<Object<Self>> {
- /// Parent `Driver` for this object.
- type Driver: drv::Driver;
+}
+unsafe extern "C" fn free_callback<T: DriverObject>(obj: *mut bindings::drm_gem_object) {
- // SAFETY: All of our objects are Object<T>.
- let this = crate::container_of!(obj, Object<T>, obj) as *mut Object<T>;
- // SAFETY: The pointer we got has to be valid
- unsafe { bindings::drm_gem_object_release(obj) };
- // SAFETY: All of our objects are allocated via Box<>, and we're in the
- // free callback which guarantees this object has zero remaining references,
- // so we can drop it
- unsafe { Box::from_raw(this) };
+}
+unsafe extern "C" fn open_callback<T: BaseDriverObject<U>, U: BaseObject>(
- raw_obj: *mut bindings::drm_gem_object,
- raw_file: *mut bindings::drm_file,
+) -> core::ffi::c_int {
- // SAFETY: The pointer we got has to be valid.
- let file = unsafe {
file::File::<<<U as IntoGEMObject>::Driver as drv::Driver>::File>::from_raw(raw_file)
- };
- let obj =
<<<U as IntoGEMObject>::Driver as drv::Driver>::Object as IntoGEMObject>::from_gem_obj(
raw_obj,
);
- // SAFETY: from_gem_obj() returns a valid pointer as long as the type is
- // correct and the raw_obj we got is valid.
- match T::open(unsafe { &*obj }, &file) {
Err(e) => e.to_kernel_errno(),
Ok(()) => 0,
- }
+}
+unsafe extern "C" fn close_callback<T: BaseDriverObject<U>, U: BaseObject>(
- raw_obj: *mut bindings::drm_gem_object,
- raw_file: *mut bindings::drm_file,
+) {
- // SAFETY: The pointer we got has to be valid.
- let file = unsafe {
file::File::<<<U as IntoGEMObject>::Driver as drv::Driver>::File>::from_raw(raw_file)
- };
- let obj =
<<<U as IntoGEMObject>::Driver as drv::Driver>::Object as IntoGEMObject>::from_gem_obj(
raw_obj,
);
- // SAFETY: from_gem_obj() returns a valid pointer as long as the type is
- // correct and the raw_obj we got is valid.
- T::close(unsafe { &*obj }, &file);
+}
+impl<T: DriverObject> IntoGEMObject for Object<T> {
- type Driver = T::Driver;
- fn gem_obj(&self) -> *mut bindings::drm_gem_object {
&self.obj as *const _ as *mut _
- }
- fn from_gem_obj(obj: *mut bindings::drm_gem_object) -> *mut Object<T> {
crate::container_of!(obj, Object<T>, obj) as *mut Object<T>
- }
+}
+/// Base operations shared by all GEM object classes +pub trait BaseObject: IntoGEMObject {
- /// Returns the size of the object in bytes.
- fn size(&self) -> usize {
self.gem_ref().size
- }
- /// Creates a new reference to the object.
- fn reference(&self) -> ObjectRef<Self> {
// SAFETY: Having a reference to an Object implies holding a GEM reference
unsafe {
bindings::drm_gem_object_get(self.gem_obj());
}
ObjectRef {
ptr: self as *const _,
}
- }
- /// Creates a new handle for the object associated with a given `File`
- /// (or returns an existing one).
- fn create_handle(
&self,
file: &file::File<<<Self as IntoGEMObject>::Driver as drv::Driver>::File>,
- ) -> Result<u32> {
let mut handle: u32 = 0;
// SAFETY: The arguments are all valid per the type invariants.
to_result(unsafe {
bindings::drm_gem_handle_create(file.raw() as *mut _, self.gem_obj(), &mut handle)
})?;
Ok(handle)
- }
- /// Looks up an object by its handle for a given `File`.
- fn lookup_handle(
file: &file::File<<<Self as IntoGEMObject>::Driver as drv::Driver>::File>,
handle: u32,
- ) -> Result<ObjectRef<Self>> {
// SAFETY: The arguments are all valid per the type invariants.
let ptr = unsafe { bindings::drm_gem_object_lookup(file.raw() as *mut _, handle) };
if ptr.is_null() {
Err(ENOENT)
} else {
Ok(ObjectRef {
ptr: ptr as *const _,
})
}
- }
- /// Creates an mmap offset to map the object from userspace.
- fn create_mmap_offset(&self) -> Result<u64> {
// SAFETY: The arguments are valid per the type invariant.
to_result(unsafe {
// TODO: is this threadsafe?
bindings::drm_gem_create_mmap_offset(self.gem_obj())
})?;
Ok(unsafe {
bindings::drm_vma_node_offset_addr(&self.gem_ref().vma_node as *const _ as *mut _)
})
- }
+}
+impl<T: IntoGEMObject> BaseObject for T {}
+/// A base GEM object. +#[repr(C)] +pub struct Object<T: DriverObject> {
- obj: bindings::drm_gem_object,
- // The DRM core ensures the Device exists as long as its objects exist, so we don't need to
- // manage the reference count here.
- dev: ManuallyDrop<device::Device<T::Driver>>,
- inner: T,
+}
+impl<T: DriverObject> Object<T> {
- /// The size of this object's structure.
- pub const SIZE: usize = mem::size_of::<Self>();
- const OBJECT_FUNCS: bindings::drm_gem_object_funcs = bindings::drm_gem_object_funcs {
free: Some(free_callback::<T>),
open: Some(open_callback::<T, Object<T>>),
close: Some(close_callback::<T, Object<T>>),
print_info: None,
export: None,
pin: None,
unpin: None,
get_sg_table: None,
vmap: None,
vunmap: None,
mmap: None,
vm_ops: core::ptr::null_mut(),
- };
- /// Create a new GEM object.
- pub fn new(dev: &device::Device<T::Driver>, size: usize) -> Result<UniqueObjectRef<Self>> {
let mut obj: Box<Self> = Box::try_new(Self {
// SAFETY: This struct is expected to be zero-initialized
obj: unsafe { mem::zeroed() },
// SAFETY: The drm subsystem guarantees that the drm_device will live as long as
// the GEM object lives, so we can conjure a reference out of thin air.
dev: ManuallyDrop::new(unsafe { device::Device::from_raw(dev.ptr) }),
inner: T::new(dev, size)?,
})?;
obj.obj.funcs = &Self::OBJECT_FUNCS;
to_result(unsafe {
bindings::drm_gem_object_init(dev.raw() as *mut _, &mut obj.obj, size)
})?;
let obj_ref = UniqueObjectRef {
ptr: Box::leak(obj),
};
Ok(obj_ref)
- }
- /// Returns the `Device` that owns this GEM object.
- pub fn dev(&self) -> &device::Device<T::Driver> {
&self.dev
- }
+}
+impl<T: DriverObject> crate::private::Sealed for Object<T> {}
+impl<T: DriverObject> Deref for Object<T> {
- type Target = T;
- fn deref(&self) -> &Self::Target {
&self.inner
- }
+}
+impl<T: DriverObject> DerefMut for Object<T> {
- fn deref_mut(&mut self) -> &mut Self::Target {
&mut self.inner
- }
+}
+impl<T: DriverObject> drv::AllocImpl for Object<T> {
- const ALLOC_OPS: drv::AllocOps = drv::AllocOps {
gem_create_object: None,
prime_handle_to_fd: Some(bindings::drm_gem_prime_handle_to_fd),
prime_fd_to_handle: Some(bindings::drm_gem_prime_fd_to_handle),
gem_prime_import: None,
gem_prime_import_sg_table: None,
gem_prime_mmap: Some(bindings::drm_gem_prime_mmap),
dumb_create: None,
dumb_map_offset: None,
dumb_destroy: None,
- };
+}
+/// A reference-counted shared reference to a base GEM object. +pub struct ObjectRef<T: IntoGEMObject> {
- // Invariant: the pointer is valid and initialized, and this ObjectRef owns a reference to it.
- ptr: *const T,
+}
+/// SAFETY: GEM object references are safe to share between threads. +unsafe impl<T: IntoGEMObject> Send for ObjectRef<T> {} +unsafe impl<T: IntoGEMObject> Sync for ObjectRef<T> {}
+impl<T: IntoGEMObject> Clone for ObjectRef<T> {
- fn clone(&self) -> Self {
self.reference()
- }
+}
+impl<T: IntoGEMObject> Drop for ObjectRef<T> {
- fn drop(&mut self) {
// SAFETY: Having an ObjectRef implies holding a GEM reference.
// The free callback will take care of deallocation.
unsafe {
bindings::drm_gem_object_put((*self.ptr).gem_obj());
}
- }
+}
+impl<T: IntoGEMObject> Deref for ObjectRef<T> {
- type Target = T;
- fn deref(&self) -> &Self::Target {
// SAFETY: The pointer is valid per the invariant
unsafe { &*self.ptr }
- }
+}
+/// A unique reference to a base GEM object. +pub struct UniqueObjectRef<T: IntoGEMObject> {
- // Invariant: the pointer is valid and initialized, and this ObjectRef owns the only reference
- // to it.
- ptr: *mut T,
+}
+impl<T: IntoGEMObject> UniqueObjectRef<T> {
- /// Downgrade this reference to a shared reference.
- pub fn into_ref(self) -> ObjectRef<T> {
let ptr = self.ptr as *const _;
core::mem::forget(self);
ObjectRef { ptr }
- }
+}
+impl<T: IntoGEMObject> Drop for UniqueObjectRef<T> {
- fn drop(&mut self) {
// SAFETY: Having a UniqueObjectRef implies holding a GEM
// reference. The free callback will take care of deallocation.
unsafe {
bindings::drm_gem_object_put((*self.ptr).gem_obj());
}
- }
+}
+impl<T: IntoGEMObject> Deref for UniqueObjectRef<T> {
- type Target = T;
- fn deref(&self) -> &Self::Target {
// SAFETY: The pointer is valid per the invariant
unsafe { &*self.ptr }
- }
+}
+impl<T: IntoGEMObject> DerefMut for UniqueObjectRef<T> {
- fn deref_mut(&mut self) -> &mut Self::Target {
// SAFETY: The pointer is valid per the invariant
unsafe { &mut *self.ptr }
- }
+}
+pub(super) fn create_fops() -> bindings::file_operations {
- bindings::file_operations {
owner: core::ptr::null_mut(),
open: Some(bindings::drm_open),
release: Some(bindings::drm_release),
unlocked_ioctl: Some(bindings::drm_ioctl),
#[cfg(CONFIG_COMPAT)]
compat_ioctl: Some(bindings::drm_compat_ioctl),
#[cfg(not(CONFIG_COMPAT))]
compat_ioctl: None,
poll: Some(bindings::drm_poll),
read: Some(bindings::drm_read),
llseek: Some(bindings::noop_llseek),
mmap: Some(bindings::drm_gem_mmap),
..Default::default()
- }
+} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index a767942d0b52..c44760a1332f 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -5,4 +5,5 @@ pub mod device; pub mod drv; pub mod file; +pub mod gem; pub mod ioctl;
-- 2.35.1
On Wed, Apr 5, 2023 at 1:08 PM Daniel Vetter daniel@ffwll.ch wrote:
Uh all the rust helper wrappers for all the kernel in a single file does not sound good. Can we not split these up into each subsystem, and then maybe instead of sprinkling #ifdef all over a .c file Make the compilation of that file conditional on rust support (plus whatever other Kconfig gate the other c files has already)?
Indeed, the plan is splitting the `kernel` crate and giving each subsystem its own crate, bindings, helpers, etc.
Cheers, Miguel
On Wed, Apr 05, 2023 at 01:19:47PM +0200, Miguel Ojeda wrote:
On Wed, Apr 5, 2023 at 1:08 PM Daniel Vetter daniel@ffwll.ch wrote:
Uh all the rust helper wrappers for all the kernel in a single file does not sound good. Can we not split these up into each subsystem, and then maybe instead of sprinkling #ifdef all over a .c file Make the compilation of that file conditional on rust support (plus whatever other Kconfig gate the other c files has already)?
Indeed, the plan is splitting the `kernel` crate and giving each subsystem its own crate, bindings, helpers, etc.
Ok if this is just interim I think it's fine. Would still be good to have the MAINTAINERS entry though even just to cover the interim state. Least because I'm assuming that when things are split up you'd still want to keep the rust list on cc for the rust parts, even when they move into subsystems? -Daniel
On Wed, Apr 5, 2023 at 1:23 PM Daniel Vetter daniel@ffwll.ch wrote:
Ok if this is just interim I think it's fine. Would still be good to have the MAINTAINERS entry though even just to cover the interim state. Least because I'm assuming that when things are split up you'd still want to keep the rust list on cc for the rust parts, even when they move into subsystems?
Sorry, I missed to reply the second part of your email -- replying here.
Currently, the subsystem's code is under `rust/` (though modules can go already into other folders). One of the reasons was technical simplicity, and a nice side effect is that we could bootstrap things while getting C maintainers involved over time.
To accomplish that, the guidelines for contributing Rust code are that the respective maintainers need to be at least Cc'd, even if the files do not hit the `F:` fields for the time being -- see [1]. But, for us, ideally, the maintainers will take the changes through their tree, instead of going through the Rust one, since that is the end goal.
And, of course, if you already want to have `F:` fields for the Rust code, that is even better! (Whether those should be in the same entry or in a new one, it is up to you, of course, and whether it is a different set of people / level of support / etc.)
Then, when the `kernel` crate split happens, we can move the code directly under whatever folders it should be naturally, when their maintainers are ready. For some subsystems, that may mean they do not need any `F:` fields since they are already covered (e.g. if they did not create a new entry for Rust code only). And for cases like yours, where you already had `F:` fields, it means the move of the files can be done right away as soon as the split happens.
In short, we would definitely welcome if you add `F:` fields already (whether in existing or new entries) -- it would mean you are ahead of the curve! :)
As for the mailing list, yes, for the time being, I ask that all changes to please be sent to the Rust list, so that everybody that wants to follow the Rust progress has everything in a single place, so that we try to remain consistent in the beginning on e.g. coding guidelines, so that Rust reviewers can help spot mistakes, and so on and so forth.
But, as Rust grows in the kernel, as systems become non-experimental, and as maintainers take ownership of the code, that should eventually go away and let things be as usual with C code. Then the Rust subsystem (and its list) will become smaller, and it will be the subsystem (and the discussion place) for anything not covered by other subsystems, such as core Rust abstractions and types, Rust infrastructure and so on.
How does that sound?
[1] https://rust-for-linux.com/contributing#the-rust-subsystem (I may reorganize this to be Rust's `P:` field, by the way)
Cheers, Miguel
On Wed, Apr 05, 2023 at 02:32:12PM +0200, Miguel Ojeda wrote:
On Wed, Apr 5, 2023 at 1:23 PM Daniel Vetter daniel@ffwll.ch wrote:
Ok if this is just interim I think it's fine. Would still be good to have the MAINTAINERS entry though even just to cover the interim state. Least because I'm assuming that when things are split up you'd still want to keep the rust list on cc for the rust parts, even when they move into subsystems?
Sorry, I missed to reply the second part of your email -- replying here.
Currently, the subsystem's code is under `rust/` (though modules can go already into other folders). One of the reasons was technical simplicity, and a nice side effect is that we could bootstrap things while getting C maintainers involved over time.
To accomplish that, the guidelines for contributing Rust code are that the respective maintainers need to be at least Cc'd, even if the files do not hit the `F:` fields for the time being -- see [1]. But, for us, ideally, the maintainers will take the changes through their tree, instead of going through the Rust one, since that is the end goal.
And, of course, if you already want to have `F:` fields for the Rust code, that is even better! (Whether those should be in the same entry or in a new one, it is up to you, of course, and whether it is a different set of people / level of support / etc.)
Then, when the `kernel` crate split happens, we can move the code directly under whatever folders it should be naturally, when their maintainers are ready. For some subsystems, that may mean they do not need any `F:` fields since they are already covered (e.g. if they did not create a new entry for Rust code only). And for cases like yours, where you already had `F:` fields, it means the move of the files can be done right away as soon as the split happens.
In short, we would definitely welcome if you add `F:` fields already (whether in existing or new entries) -- it would mean you are ahead of the curve! :)
As for the mailing list, yes, for the time being, I ask that all changes to please be sent to the Rust list, so that everybody that wants to follow the Rust progress has everything in a single place, so that we try to remain consistent in the beginning on e.g. coding guidelines, so that Rust reviewers can help spot mistakes, and so on and so forth.
But, as Rust grows in the kernel, as systems become non-experimental, and as maintainers take ownership of the code, that should eventually go away and let things be as usual with C code. Then the Rust subsystem (and its list) will become smaller, and it will be the subsystem (and the discussion place) for anything not covered by other subsystems, such as core Rust abstractions and types, Rust infrastructure and so on.
How does that sound?
Yeah sounds all great!
I think interim at least a separate rust drm entry would be good, to make sure we always cc both rust and dri-devel. Once it's too much for you and you generally trust the dri-devel folks to not design stupid interfaces, we can then drop that and only ping rust folks when needed. I do expect that's some years out though. -Daniel
[1] https://rust-for-linux.com/contributing#the-rust-subsystem (I may reorganize this to be Rust's `P:` field, by the way)
Cheers, Miguel
There doesn't seem to be a way for the Rust bindings to get a compile-time constant reference to drm_gem_shmem_vm_ops, so we need to duplicate that structure in Rust... this isn't nice...
Signed-off-by: Asahi Lina lina@asahilina.net --- drivers/gpu/drm/drm_gem_shmem_helper.c | 9 ++++++--- include/drm/drm_gem_shmem_helper.h | 3 +++ 2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c index 75185a960fc4..10c09819410e 100644 --- a/drivers/gpu/drm/drm_gem_shmem_helper.c +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c @@ -534,7 +534,7 @@ int drm_gem_shmem_dumb_create(struct drm_file *file, struct drm_device *dev, } EXPORT_SYMBOL_GPL(drm_gem_shmem_dumb_create);
-static vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf) +vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct drm_gem_object *obj = vma->vm_private_data; @@ -563,8 +563,9 @@ static vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf)
return ret; } +EXPORT_SYMBOL_GPL(drm_gem_shmem_fault);
-static void drm_gem_shmem_vm_open(struct vm_area_struct *vma) +void drm_gem_shmem_vm_open(struct vm_area_struct *vma) { struct drm_gem_object *obj = vma->vm_private_data; struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj); @@ -585,8 +586,9 @@ static void drm_gem_shmem_vm_open(struct vm_area_struct *vma)
drm_gem_vm_open(vma); } +EXPORT_SYMBOL_GPL(drm_gem_shmem_vm_open);
-static void drm_gem_shmem_vm_close(struct vm_area_struct *vma) +void drm_gem_shmem_vm_close(struct vm_area_struct *vma) { struct drm_gem_object *obj = vma->vm_private_data; struct drm_gem_shmem_object *shmem = to_drm_gem_shmem_obj(obj); @@ -594,6 +596,7 @@ static void drm_gem_shmem_vm_close(struct vm_area_struct *vma) drm_gem_shmem_put_pages(shmem); drm_gem_vm_close(vma); } +EXPORT_SYMBOL_GPL(drm_gem_shmem_vm_close);
const struct vm_operations_struct drm_gem_shmem_vm_ops = { .fault = drm_gem_shmem_fault, diff --git a/include/drm/drm_gem_shmem_helper.h b/include/drm/drm_gem_shmem_helper.h index a2201b2488c5..b9f349b3ed76 100644 --- a/include/drm/drm_gem_shmem_helper.h +++ b/include/drm/drm_gem_shmem_helper.h @@ -138,6 +138,9 @@ void drm_gem_shmem_print_info(const struct drm_gem_shmem_object *shmem, struct drm_printer *p, unsigned int indent);
extern const struct vm_operations_struct drm_gem_shmem_vm_ops; +vm_fault_t drm_gem_shmem_fault(struct vm_fault *vmf); +void drm_gem_shmem_vm_open(struct vm_area_struct *vma); +void drm_gem_shmem_vm_close(struct vm_area_struct *vma);
/* * GEM object functions
The DRM shmem helper includes common code useful for drivers which allocate GEM objects as anonymous shmem. Add a Rust abstraction for this. Drivers can choose the raw GEM implementation or the shmem layer, depending on their needs.
Signed-off-by: Asahi Lina lina@asahilina.net --- drivers/gpu/drm/Kconfig | 5 + rust/bindings/bindings_helper.h | 2 + rust/helpers.c | 67 +++++++ rust/kernel/drm/gem/mod.rs | 3 + rust/kernel/drm/gem/shmem.rs | 381 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 458 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index dab8f0f9aa96..70a983a17ac2 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -34,6 +34,11 @@ config RUST_DRM bool "Rust support for the DRM subsystem" depends on DRM=y
+config RUST_DRM_GEM_SHMEM_HELPER + bool + depends on RUST_DRM + select DRM_GEM_SHMEM_HELPER + config DRM_MIPI_DBI tristate depends on DRM diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 7183dfe6473f..9f152d373df8 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -10,6 +10,7 @@ #include <drm/drm_drv.h> #include <drm/drm_file.h> #include <drm/drm_gem.h> +#include <drm/drm_gem_shmem_helper.h> #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> @@ -17,6 +18,7 @@ #include <linux/fs.h> #include <linux/ioctl.h> #include <linux/io-pgtable.h> +#include <linux/iosys-map.h> #include <linux/ktime.h> #include <linux/of.h> #include <linux/of_address.h> diff --git a/rust/helpers.c b/rust/helpers.c index 78ec4162b03b..388ff1100ea5 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -19,6 +19,7 @@ */
#include <drm/drm_gem.h> +#include <drm/drm_gem_shmem_helper.h> #include <linux/bug.h> #include <linux/build_bug.h> #include <linux/device.h> @@ -375,6 +376,18 @@ void rust_helper_init_completion(struct completion *c) } EXPORT_SYMBOL_GPL(rust_helper_init_completion);
+dma_addr_t rust_helper_sg_dma_address(const struct scatterlist *sg) +{ + return sg_dma_address(sg); +} +EXPORT_SYMBOL_GPL(rust_helper_sg_dma_address); + +int rust_helper_sg_dma_len(const struct scatterlist *sg) +{ + return sg_dma_len(sg); +} +EXPORT_SYMBOL_GPL(rust_helper_sg_dma_len); + #ifdef CONFIG_DRM
void rust_helper_drm_gem_object_get(struct drm_gem_object *obj) @@ -395,6 +408,60 @@ __u64 rust_helper_drm_vma_node_offset_addr(struct drm_vma_offset_node *node) } EXPORT_SYMBOL_GPL(rust_helper_drm_vma_node_offset_addr);
+#ifdef CONFIG_DRM_GEM_SHMEM_HELPER + +void rust_helper_drm_gem_shmem_object_free(struct drm_gem_object *obj) +{ + return drm_gem_shmem_object_free(obj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_shmem_object_free); + +void rust_helper_drm_gem_shmem_object_print_info(struct drm_printer *p, unsigned int indent, + const struct drm_gem_object *obj) +{ + drm_gem_shmem_object_print_info(p, indent, obj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_shmem_object_print_info); + +int rust_helper_drm_gem_shmem_object_pin(struct drm_gem_object *obj) +{ + return drm_gem_shmem_object_pin(obj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_shmem_object_pin); + +void rust_helper_drm_gem_shmem_object_unpin(struct drm_gem_object *obj) +{ + drm_gem_shmem_object_unpin(obj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_shmem_object_unpin); + +struct sg_table *rust_helper_drm_gem_shmem_object_get_sg_table(struct drm_gem_object *obj) +{ + return drm_gem_shmem_object_get_sg_table(obj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_shmem_object_get_sg_table); + +int rust_helper_drm_gem_shmem_object_vmap(struct drm_gem_object *obj, + struct iosys_map *map) +{ + return drm_gem_shmem_object_vmap(obj, map); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_shmem_object_vmap); + +void rust_helper_drm_gem_shmem_object_vunmap(struct drm_gem_object *obj, + struct iosys_map *map) +{ + drm_gem_shmem_object_vunmap(obj, map); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_shmem_object_vunmap); + +int rust_helper_drm_gem_shmem_object_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma) +{ + return drm_gem_shmem_object_mmap(obj, vma); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_gem_shmem_object_mmap); + +#endif #endif
/* diff --git a/rust/kernel/drm/gem/mod.rs b/rust/kernel/drm/gem/mod.rs index 8a7d99613718..e66bdef35c2e 100644 --- a/rust/kernel/drm/gem/mod.rs +++ b/rust/kernel/drm/gem/mod.rs @@ -4,6 +4,9 @@ //! //! C header: [`include/linux/drm/drm_gem.h`](../../../../include/linux/drm/drm_gem.h)
+#[cfg(CONFIG_RUST_DRM_GEM_SHMEM_HELPER)] +pub mod shmem; + use alloc::boxed::Box;
use crate::{ diff --git a/rust/kernel/drm/gem/shmem.rs b/rust/kernel/drm/gem/shmem.rs new file mode 100644 index 000000000000..15446ea1113e --- /dev/null +++ b/rust/kernel/drm/gem/shmem.rs @@ -0,0 +1,381 @@ +// SPDX-License-Identifier: GPL-2.0 + +//! DRM GEM shmem helper objects +//! +//! C header: [`include/linux/drm/drm_gem_shmem_helper.h`](../../../../include/linux/drm/drm_gem_shmem_helper.h) + +use crate::drm::{device, drv, gem}; +use crate::{ + error::{from_kernel_err_ptr, to_result}, + prelude::*, +}; +use core::{ + marker::PhantomData, + mem, + mem::{ManuallyDrop, MaybeUninit}, + ops::{Deref, DerefMut}, + ptr::addr_of_mut, + slice, +}; + +use gem::BaseObject; + +/// Trait which must be implemented by drivers using shmem-backed GEM objects. +pub trait DriverObject: gem::BaseDriverObject<Object<Self>> { + /// Parent `Driver` for this object. + type Driver: drv::Driver; +} + +// FIXME: This is terrible and I don't know how to avoid it +#[cfg(CONFIG_NUMA)] +macro_rules! vm_numa_fields { + ( $($field:ident: $val:expr),* $(,)? ) => { + bindings::vm_operations_struct { + $( $field: $val ),*, + set_policy: None, + get_policy: None, + } + } +} + +#[cfg(not(CONFIG_NUMA))] +macro_rules! vm_numa_fields { + ( $($field:ident: $val:expr),* $(,)? ) => { + bindings::vm_operations_struct { + $( $field: $val ),* + } + } +} + +const SHMEM_VM_OPS: bindings::vm_operations_struct = vm_numa_fields! { + open: Some(bindings::drm_gem_shmem_vm_open), + close: Some(bindings::drm_gem_shmem_vm_close), + may_split: None, + mremap: None, + mprotect: None, + fault: Some(bindings::drm_gem_shmem_fault), + huge_fault: None, + map_pages: None, + pagesize: None, + page_mkwrite: None, + pfn_mkwrite: None, + access: None, + name: None, + find_special_page: None, +}; + +/// A shmem-backed GEM object. +#[repr(C)] +pub struct Object<T: DriverObject> { + obj: bindings::drm_gem_shmem_object, + // The DRM core ensures the Device exists as long as its objects exist, so we don't need to + // manage the reference count here. + dev: ManuallyDrop<device::Device<T::Driver>>, + inner: T, +} + +unsafe extern "C" fn gem_create_object<T: DriverObject>( + raw_dev: *mut bindings::drm_device, + size: usize, +) -> *mut bindings::drm_gem_object { + // SAFETY: GEM ensures the device lives as long as its objects live, + // so we can conjure up a reference from thin air and never drop it. + let dev = ManuallyDrop::new(unsafe { device::Device::from_raw(raw_dev) }); + + let inner = match T::new(&*dev, size) { + Ok(v) => v, + Err(e) => return e.to_ptr(), + }; + + let p = unsafe { + bindings::krealloc( + core::ptr::null(), + Object::<T>::SIZE, + bindings::GFP_KERNEL | bindings::__GFP_ZERO, + ) as *mut Object<T> + }; + + if p.is_null() { + return ENOMEM.to_ptr(); + } + + // SAFETY: p is valid as long as the alloc succeeded + unsafe { + addr_of_mut!((*p).dev).write(dev); + addr_of_mut!((*p).inner).write(inner); + } + + // SAFETY: drm_gem_shmem_object is safe to zero-init, and + // the rest of Object has been initialized + let new: &mut Object<T> = unsafe { &mut *(p as *mut _) }; + + new.obj.base.funcs = &Object::<T>::VTABLE; + &mut new.obj.base +} + +unsafe extern "C" fn free_callback<T: DriverObject>(obj: *mut bindings::drm_gem_object) { + // SAFETY: All of our objects are Object<T>. + let p = crate::container_of!(obj, Object<T>, obj) as *mut Object<T>; + + // SAFETY: p is never used after this + unsafe { + core::ptr::drop_in_place(&mut (*p).inner); + } + + // SAFETY: This pointer has to be valid, since p is valid + unsafe { + bindings::drm_gem_shmem_free(&mut (*p).obj); + } +} + +impl<T: DriverObject> Object<T> { + /// The size of this object's structure. + const SIZE: usize = mem::size_of::<Self>(); + + /// `drm_gem_object_funcs` vtable suitable for GEM shmem objects. + const VTABLE: bindings::drm_gem_object_funcs = bindings::drm_gem_object_funcs { + free: Some(free_callback::<T>), + open: Some(super::open_callback::<T, Object<T>>), + close: Some(super::close_callback::<T, Object<T>>), + print_info: Some(bindings::drm_gem_shmem_object_print_info), + export: None, + pin: Some(bindings::drm_gem_shmem_object_pin), + unpin: Some(bindings::drm_gem_shmem_object_unpin), + get_sg_table: Some(bindings::drm_gem_shmem_object_get_sg_table), + vmap: Some(bindings::drm_gem_shmem_object_vmap), + vunmap: Some(bindings::drm_gem_shmem_object_vunmap), + mmap: Some(bindings::drm_gem_shmem_object_mmap), + vm_ops: &SHMEM_VM_OPS, + }; + + // SAFETY: Must only be used with DRM functions that are thread-safe + unsafe fn mut_shmem(&self) -> *mut bindings::drm_gem_shmem_object { + &self.obj as *const _ as *mut _ + } + + /// Create a new shmem-backed DRM object of the given size. + pub fn new(dev: &device::Device<T::Driver>, size: usize) -> Result<gem::UniqueObjectRef<Self>> { + // SAFETY: This function can be called as long as the ALLOC_OPS are set properly + // for this driver, and the gem_create_object is called. + let p = unsafe { bindings::drm_gem_shmem_create(dev.raw() as *mut _, size) }; + let p = crate::container_of!(p, Object<T>, obj) as *mut _; + + // SAFETY: The gem_create_object callback ensures this is a valid Object<T>, + // so we can take a unique reference to it. + let obj_ref = gem::UniqueObjectRef { ptr: p }; + + Ok(obj_ref) + } + + /// Returns the `Device` that owns this GEM object. + pub fn dev(&self) -> &device::Device<T::Driver> { + &self.dev + } + + /// Creates (if necessary) and returns a scatter-gather table of DMA pages for this object. + /// + /// This will pin the object in memory. + pub fn sg_table(&self) -> Result<SGTable<T>> { + // SAFETY: drm_gem_shmem_get_pages_sgt is thread-safe. + let sgt = from_kernel_err_ptr(unsafe { + bindings::drm_gem_shmem_get_pages_sgt(self.mut_shmem()) + })?; + + Ok(SGTable { + sgt, + _owner: self.reference(), + }) + } + + /// Creates and returns a virtual kernel memory mapping for this object. + pub fn vmap(&self) -> Result<VMap<T>> { + let mut map: MaybeUninitbindings::iosys_map = MaybeUninit::uninit(); + + // SAFETY: drm_gem_shmem_vmap is thread-safe + to_result(unsafe { bindings::drm_gem_shmem_vmap(self.mut_shmem(), map.as_mut_ptr()) })?; + + // SAFETY: if drm_gem_shmem_vmap did not fail, map is initialized now + let map = unsafe { map.assume_init() }; + + Ok(VMap { + map, + owner: self.reference(), + }) + } + + /// Set the write-combine flag for this object. + /// + /// Should be called before any mappings are made. + pub fn set_wc(&mut self, map_wc: bool) { + unsafe { (*self.mut_shmem()).map_wc = map_wc }; + } +} + +impl<T: DriverObject> Deref for Object<T> { + type Target = T; + + fn deref(&self) -> &Self::Target { + &self.inner + } +} + +impl<T: DriverObject> DerefMut for Object<T> { + fn deref_mut(&mut self) -> &mut Self::Target { + &mut self.inner + } +} + +impl<T: DriverObject> crate::private::Sealed for Object<T> {} + +impl<T: DriverObject> gem::IntoGEMObject for Object<T> { + type Driver = T::Driver; + + fn gem_obj(&self) -> *mut bindings::drm_gem_object { + &self.obj.base as *const _ as *mut _ + } + + fn from_gem_obj(obj: *mut bindings::drm_gem_object) -> *mut Object<T> { + crate::container_of!(obj, Object<T>, obj) as *mut Object<T> + } +} + +impl<T: DriverObject> drv::AllocImpl for Object<T> { + const ALLOC_OPS: drv::AllocOps = drv::AllocOps { + gem_create_object: Some(gem_create_object::<T>), + prime_handle_to_fd: Some(bindings::drm_gem_prime_handle_to_fd), + prime_fd_to_handle: Some(bindings::drm_gem_prime_fd_to_handle), + gem_prime_import: None, + gem_prime_import_sg_table: Some(bindings::drm_gem_shmem_prime_import_sg_table), + gem_prime_mmap: Some(bindings::drm_gem_prime_mmap), + dumb_create: Some(bindings::drm_gem_shmem_dumb_create), + dumb_map_offset: None, + dumb_destroy: None, + }; +} + +/// A virtual mapping for a shmem-backed GEM object in kernel address space. +pub struct VMap<T: DriverObject> { + map: bindings::iosys_map, + owner: gem::ObjectRef<Object<T>>, +} + +impl<T: DriverObject> VMap<T> { + /// Returns a const raw pointer to the start of the mapping. + pub fn as_ptr(&self) -> *const core::ffi::c_void { + // SAFETY: The shmem helpers always return non-iomem maps + unsafe { self.map.__bindgen_anon_1.vaddr } + } + + /// Returns a mutable raw pointer to the start of the mapping. + pub fn as_mut_ptr(&mut self) -> *mut core::ffi::c_void { + // SAFETY: The shmem helpers always return non-iomem maps + unsafe { self.map.__bindgen_anon_1.vaddr } + } + + /// Returns a byte slice view of the mapping. + pub fn as_slice(&self) -> &[u8] { + // SAFETY: The vmap maps valid memory up to the owner size + unsafe { slice::from_raw_parts(self.as_ptr() as *const u8, self.owner.size()) } + } + + /// Returns mutable a byte slice view of the mapping. + pub fn as_mut_slice(&mut self) -> &mut [u8] { + // SAFETY: The vmap maps valid memory up to the owner size + unsafe { slice::from_raw_parts_mut(self.as_mut_ptr() as *mut u8, self.owner.size()) } + } + + /// Borrows a reference to the object that owns this virtual mapping. + pub fn owner(&self) -> &gem::ObjectRef<Object<T>> { + &self.owner + } +} + +impl<T: DriverObject> Drop for VMap<T> { + fn drop(&mut self) { + // SAFETY: This function is thread-safe + unsafe { + bindings::drm_gem_shmem_vunmap(self.owner.mut_shmem(), &mut self.map); + } + } +} + +/// SAFETY: `iosys_map` objects are safe to send across threads. +unsafe impl<T: DriverObject> Send for VMap<T> {} +unsafe impl<T: DriverObject> Sync for VMap<T> {} + +/// A single scatter-gather entry, representing a span of pages in the device's DMA address space. +/// +/// For devices not behind a standalone IOMMU, this corresponds to physical addresses. +#[repr(transparent)] +pub struct SGEntry(bindings::scatterlist); + +impl SGEntry { + /// Returns the starting DMA address of this span + pub fn dma_address(&self) -> usize { + (unsafe { bindings::sg_dma_address(&self.0) }) as usize + } + + /// Returns the length of this span in bytes + pub fn dma_len(&self) -> usize { + (unsafe { bindings::sg_dma_len(&self.0) }) as usize + } +} + +/// A scatter-gather table of DMA address spans for a GEM shmem object. +/// +/// # Invariants +/// `sgt` must be a valid pointer to the `sg_table`, which must correspond to the owned +/// object in `_owner` (which ensures it remains valid). +pub struct SGTable<T: DriverObject> { + sgt: *const bindings::sg_table, + _owner: gem::ObjectRef<Object<T>>, +} + +impl<T: DriverObject> SGTable<T> { + /// Returns an iterator through the SGTable's entries + pub fn iter(&'_ self) -> SGTableIter<'_> { + SGTableIter { + left: unsafe { (*self.sgt).nents } as usize, + sg: unsafe { (*self.sgt).sgl }, + _p: PhantomData, + } + } +} + +impl<'a, T: DriverObject> IntoIterator for &'a SGTable<T> { + type Item = &'a SGEntry; + type IntoIter = SGTableIter<'a>; + + fn into_iter(self) -> Self::IntoIter { + self.iter() + } +} + +/// SAFETY: `sg_table` objects are safe to send across threads. +unsafe impl<T: DriverObject> Send for SGTable<T> {} +unsafe impl<T: DriverObject> Sync for SGTable<T> {} + +/// An iterator through `SGTable` entries. +/// +/// # Invariants +/// `sg` must be a valid pointer to the scatterlist, which must outlive our lifetime. +pub struct SGTableIter<'a> { + sg: *mut bindings::scatterlist, + left: usize, + _p: PhantomData<&'a ()>, +} + +impl<'a> Iterator for SGTableIter<'a> { + type Item = &'a SGEntry; + + fn next(&mut self) -> OptionSelf::Item { + if self.left == 0 { + None + } else { + let sg = self.sg; + self.sg = unsafe { bindings::sg_next(self.sg) }; + self.left -= 1; + Some(unsafe { &(*(sg as *const SGEntry)) }) + } + } +}
On 3/7/23 11:25, Asahi Lina wrote:
The DRM shmem helper includes common code useful for drivers which allocate GEM objects as anonymous shmem. Add a Rust abstraction for this. Drivers can choose the raw GEM implementation or the shmem layer, depending on their needs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 5 + rust/bindings/bindings_helper.h | 2 + rust/helpers.c | 67 +++++++ rust/kernel/drm/gem/mod.rs | 3 + rust/kernel/drm/gem/shmem.rs | 381 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 458 insertions(+)
[...]
+unsafe extern "C" fn gem_create_object<T: DriverObject>(
- raw_dev: *mut bindings::drm_device,
- size: usize,
+) -> *mut bindings::drm_gem_object {
- // SAFETY: GEM ensures the device lives as long as its objects live,
- // so we can conjure up a reference from thin air and never drop it.
- let dev = ManuallyDrop::new(unsafe { device::Device::from_raw(raw_dev) });
- let inner = match T::new(&*dev, size) {
Ok(v) => v,
Err(e) => return e.to_ptr(),
- };
- let p = unsafe {
bindings::krealloc(
core::ptr::null(),
Object::<T>::SIZE,
bindings::GFP_KERNEL | bindings::__GFP_ZERO,
) as *mut Object<T>
- };
- if p.is_null() {
return ENOMEM.to_ptr();
- }
- // SAFETY: p is valid as long as the alloc succeeded
- unsafe {
addr_of_mut!((*p).dev).write(dev);
addr_of_mut!((*p).inner).write(inner);
- }
- // SAFETY: drm_gem_shmem_object is safe to zero-init, and
- // the rest of Object has been initialized
- let new: &mut Object<T> = unsafe { &mut *(p as *mut _) };
- new.obj.base.funcs = &Object::<T>::VTABLE;
- &mut new.obj.base
+}
It would be nice to allow to set wc inside the gem_create_object callback, as some drivers do it so, like v3d, vc4, panfrost, lima...
Best Regards, - Maíra Canal
+unsafe extern "C" fn free_callback<T: DriverObject>(obj: *mut bindings::drm_gem_object) {
- // SAFETY: All of our objects are Object<T>.
- let p = crate::container_of!(obj, Object<T>, obj) as *mut Object<T>;
- // SAFETY: p is never used after this
- unsafe {
core::ptr::drop_in_place(&mut (*p).inner);
- }
- // SAFETY: This pointer has to be valid, since p is valid
- unsafe {
bindings::drm_gem_shmem_free(&mut (*p).obj);
- }
+}
+impl<T: DriverObject> Object<T> {
- /// The size of this object's structure.
- const SIZE: usize = mem::size_of::<Self>();
- /// `drm_gem_object_funcs` vtable suitable for GEM shmem objects.
- const VTABLE: bindings::drm_gem_object_funcs = bindings::drm_gem_object_funcs {
free: Some(free_callback::<T>),
open: Some(super::open_callback::<T, Object<T>>),
close: Some(super::close_callback::<T, Object<T>>),
print_info: Some(bindings::drm_gem_shmem_object_print_info),
export: None,
pin: Some(bindings::drm_gem_shmem_object_pin),
unpin: Some(bindings::drm_gem_shmem_object_unpin),
get_sg_table: Some(bindings::drm_gem_shmem_object_get_sg_table),
vmap: Some(bindings::drm_gem_shmem_object_vmap),
vunmap: Some(bindings::drm_gem_shmem_object_vunmap),
mmap: Some(bindings::drm_gem_shmem_object_mmap),
vm_ops: &SHMEM_VM_OPS,
- };
- // SAFETY: Must only be used with DRM functions that are thread-safe
- unsafe fn mut_shmem(&self) -> *mut bindings::drm_gem_shmem_object {
&self.obj as *const _ as *mut _
- }
- /// Create a new shmem-backed DRM object of the given size.
- pub fn new(dev: &device::Device<T::Driver>, size: usize) -> Result<gem::UniqueObjectRef<Self>> {
// SAFETY: This function can be called as long as the ALLOC_OPS are set properly
// for this driver, and the gem_create_object is called.
let p = unsafe { bindings::drm_gem_shmem_create(dev.raw() as *mut _, size) };
let p = crate::container_of!(p, Object<T>, obj) as *mut _;
// SAFETY: The gem_create_object callback ensures this is a valid Object<T>,
// so we can take a unique reference to it.
let obj_ref = gem::UniqueObjectRef { ptr: p };
Ok(obj_ref)
- }
- /// Returns the `Device` that owns this GEM object.
- pub fn dev(&self) -> &device::Device<T::Driver> {
&self.dev
- }
- /// Creates (if necessary) and returns a scatter-gather table of DMA pages for this object.
- ///
- /// This will pin the object in memory.
- pub fn sg_table(&self) -> Result<SGTable<T>> {
// SAFETY: drm_gem_shmem_get_pages_sgt is thread-safe.
let sgt = from_kernel_err_ptr(unsafe {
bindings::drm_gem_shmem_get_pages_sgt(self.mut_shmem())
})?;
Ok(SGTable {
sgt,
_owner: self.reference(),
})
- }
- /// Creates and returns a virtual kernel memory mapping for this object.
- pub fn vmap(&self) -> Result<VMap<T>> {
let mut map: MaybeUninit<bindings::iosys_map> = MaybeUninit::uninit();
// SAFETY: drm_gem_shmem_vmap is thread-safe
to_result(unsafe { bindings::drm_gem_shmem_vmap(self.mut_shmem(), map.as_mut_ptr()) })?;
// SAFETY: if drm_gem_shmem_vmap did not fail, map is initialized now
let map = unsafe { map.assume_init() };
Ok(VMap {
map,
owner: self.reference(),
})
- }
- /// Set the write-combine flag for this object.
- ///
- /// Should be called before any mappings are made.
- pub fn set_wc(&mut self, map_wc: bool) {
unsafe { (*self.mut_shmem()).map_wc = map_wc };
- }
+}
+impl<T: DriverObject> Deref for Object<T> {
- type Target = T;
- fn deref(&self) -> &Self::Target {
&self.inner
- }
+}
+impl<T: DriverObject> DerefMut for Object<T> {
- fn deref_mut(&mut self) -> &mut Self::Target {
&mut self.inner
- }
+}
+impl<T: DriverObject> crate::private::Sealed for Object<T> {}
+impl<T: DriverObject> gem::IntoGEMObject for Object<T> {
- type Driver = T::Driver;
- fn gem_obj(&self) -> *mut bindings::drm_gem_object {
&self.obj.base as *const _ as *mut _
- }
- fn from_gem_obj(obj: *mut bindings::drm_gem_object) -> *mut Object<T> {
crate::container_of!(obj, Object<T>, obj) as *mut Object<T>
- }
+}
+impl<T: DriverObject> drv::AllocImpl for Object<T> {
- const ALLOC_OPS: drv::AllocOps = drv::AllocOps {
gem_create_object: Some(gem_create_object::<T>),
prime_handle_to_fd: Some(bindings::drm_gem_prime_handle_to_fd),
prime_fd_to_handle: Some(bindings::drm_gem_prime_fd_to_handle),
gem_prime_import: None,
gem_prime_import_sg_table: Some(bindings::drm_gem_shmem_prime_import_sg_table),
gem_prime_mmap: Some(bindings::drm_gem_prime_mmap),
dumb_create: Some(bindings::drm_gem_shmem_dumb_create),
dumb_map_offset: None,
dumb_destroy: None,
- };
+}
+/// A virtual mapping for a shmem-backed GEM object in kernel address space. +pub struct VMap<T: DriverObject> {
- map: bindings::iosys_map,
- owner: gem::ObjectRef<Object<T>>,
+}
+impl<T: DriverObject> VMap<T> {
- /// Returns a const raw pointer to the start of the mapping.
- pub fn as_ptr(&self) -> *const core::ffi::c_void {
// SAFETY: The shmem helpers always return non-iomem maps
unsafe { self.map.__bindgen_anon_1.vaddr }
- }
- /// Returns a mutable raw pointer to the start of the mapping.
- pub fn as_mut_ptr(&mut self) -> *mut core::ffi::c_void {
// SAFETY: The shmem helpers always return non-iomem maps
unsafe { self.map.__bindgen_anon_1.vaddr }
- }
- /// Returns a byte slice view of the mapping.
- pub fn as_slice(&self) -> &[u8] {
// SAFETY: The vmap maps valid memory up to the owner size
unsafe { slice::from_raw_parts(self.as_ptr() as *const u8, self.owner.size()) }
- }
- /// Returns mutable a byte slice view of the mapping.
- pub fn as_mut_slice(&mut self) -> &mut [u8] {
// SAFETY: The vmap maps valid memory up to the owner size
unsafe { slice::from_raw_parts_mut(self.as_mut_ptr() as *mut u8, self.owner.size()) }
- }
- /// Borrows a reference to the object that owns this virtual mapping.
- pub fn owner(&self) -> &gem::ObjectRef<Object<T>> {
&self.owner
- }
+}
+impl<T: DriverObject> Drop for VMap<T> {
- fn drop(&mut self) {
// SAFETY: This function is thread-safe
unsafe {
bindings::drm_gem_shmem_vunmap(self.owner.mut_shmem(), &mut self.map);
}
- }
+}
+/// SAFETY: `iosys_map` objects are safe to send across threads. +unsafe impl<T: DriverObject> Send for VMap<T> {} +unsafe impl<T: DriverObject> Sync for VMap<T> {}
+/// A single scatter-gather entry, representing a span of pages in the device's DMA address space. +/// +/// For devices not behind a standalone IOMMU, this corresponds to physical addresses. +#[repr(transparent)] +pub struct SGEntry(bindings::scatterlist);
+impl SGEntry {
- /// Returns the starting DMA address of this span
- pub fn dma_address(&self) -> usize {
(unsafe { bindings::sg_dma_address(&self.0) }) as usize
- }
- /// Returns the length of this span in bytes
- pub fn dma_len(&self) -> usize {
(unsafe { bindings::sg_dma_len(&self.0) }) as usize
- }
+}
+/// A scatter-gather table of DMA address spans for a GEM shmem object. +/// +/// # Invariants +/// `sgt` must be a valid pointer to the `sg_table`, which must correspond to the owned +/// object in `_owner` (which ensures it remains valid). +pub struct SGTable<T: DriverObject> {
- sgt: *const bindings::sg_table,
- _owner: gem::ObjectRef<Object<T>>,
+}
+impl<T: DriverObject> SGTable<T> {
- /// Returns an iterator through the SGTable's entries
- pub fn iter(&'_ self) -> SGTableIter<'_> {
SGTableIter {
left: unsafe { (*self.sgt).nents } as usize,
sg: unsafe { (*self.sgt).sgl },
_p: PhantomData,
}
- }
+}
+impl<'a, T: DriverObject> IntoIterator for &'a SGTable<T> {
- type Item = &'a SGEntry;
- type IntoIter = SGTableIter<'a>;
- fn into_iter(self) -> Self::IntoIter {
self.iter()
- }
+}
+/// SAFETY: `sg_table` objects are safe to send across threads. +unsafe impl<T: DriverObject> Send for SGTable<T> {} +unsafe impl<T: DriverObject> Sync for SGTable<T> {}
+/// An iterator through `SGTable` entries. +/// +/// # Invariants +/// `sg` must be a valid pointer to the scatterlist, which must outlive our lifetime. +pub struct SGTableIter<'a> {
- sg: *mut bindings::scatterlist,
- left: usize,
- _p: PhantomData<&'a ()>,
+}
+impl<'a> Iterator for SGTableIter<'a> {
- type Item = &'a SGEntry;
- fn next(&mut self) -> OptionSelf::Item {
if self.left == 0 {
None
} else {
let sg = self.sg;
self.sg = unsafe { bindings::sg_next(self.sg) };
self.left -= 1;
Some(unsafe { &(*(sg as *const SGEntry)) })
}
- }
+}
On 08/03/2023 22.38, Maíra Canal wrote:
On 3/7/23 11:25, Asahi Lina wrote:
The DRM shmem helper includes common code useful for drivers which allocate GEM objects as anonymous shmem. Add a Rust abstraction for this. Drivers can choose the raw GEM implementation or the shmem layer, depending on their needs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 5 + rust/bindings/bindings_helper.h | 2 + rust/helpers.c | 67 +++++++ rust/kernel/drm/gem/mod.rs | 3 + rust/kernel/drm/gem/shmem.rs | 381 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 458 insertions(+)
[...]
+unsafe extern "C" fn gem_create_object<T: DriverObject>(
- raw_dev: *mut bindings::drm_device,
- size: usize,
+) -> *mut bindings::drm_gem_object {
- // SAFETY: GEM ensures the device lives as long as its objects live,
- // so we can conjure up a reference from thin air and never drop it.
- let dev = ManuallyDrop::new(unsafe { device::Device::from_raw(raw_dev) });
- let inner = match T::new(&*dev, size) {
Ok(v) => v,
Err(e) => return e.to_ptr(),
- };
- let p = unsafe {
bindings::krealloc(
core::ptr::null(),
Object::<T>::SIZE,
bindings::GFP_KERNEL | bindings::__GFP_ZERO,
) as *mut Object<T>
- };
- if p.is_null() {
return ENOMEM.to_ptr();
- }
- // SAFETY: p is valid as long as the alloc succeeded
- unsafe {
addr_of_mut!((*p).dev).write(dev);
addr_of_mut!((*p).inner).write(inner);
- }
- // SAFETY: drm_gem_shmem_object is safe to zero-init, and
- // the rest of Object has been initialized
- let new: &mut Object<T> = unsafe { &mut *(p as *mut _) };
- new.obj.base.funcs = &Object::<T>::VTABLE;
- &mut new.obj.base
+}
It would be nice to allow to set wc inside the gem_create_object callback, as some drivers do it so, like v3d, vc4, panfrost, lima...
This is actually a bit tricky to do safely, because we can't just have a callback that takes the drm_gem_shmem_object instance inside gem_create_object because it is not fully initialized yet from the point of view of the gem shmem API. Maybe we could have some sort of temporary proxy object that only lets you do safe things like set map_wc? Or maybe the new() callback could return something like a ShmemTemplate<T> type that contains both the inner data and some miscellaneous fields like the initial map_wc state?
I think we can also just wait until the first user before we do this though... the goal of the abstractions is to support the APIs we actually use. I know you need this for vgem, so please feel free to implement it as a separate patch! I think it's best if you get credit for the abstraction changes you need, so we can all work together on the design so it works for everyone's use cases instead of just having me make all the decisions ^^ (and it's fine if we have to refactor the APIs!)
~~ Lina
On 3/9/23 02:25, Asahi Lina wrote:
On 08/03/2023 22.38, Maíra Canal wrote:
On 3/7/23 11:25, Asahi Lina wrote:
The DRM shmem helper includes common code useful for drivers which allocate GEM objects as anonymous shmem. Add a Rust abstraction for this. Drivers can choose the raw GEM implementation or the shmem layer, depending on their needs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 5 + rust/bindings/bindings_helper.h | 2 + rust/helpers.c | 67 +++++++ rust/kernel/drm/gem/mod.rs | 3 + rust/kernel/drm/gem/shmem.rs | 381 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 458 insertions(+)
[...]
+unsafe extern "C" fn gem_create_object<T: DriverObject>(
- raw_dev: *mut bindings::drm_device,
- size: usize,
+) -> *mut bindings::drm_gem_object {
- // SAFETY: GEM ensures the device lives as long as its objects live,
- // so we can conjure up a reference from thin air and never drop it.
- let dev = ManuallyDrop::new(unsafe { device::Device::from_raw(raw_dev) });
- let inner = match T::new(&*dev, size) {
Ok(v) => v,
Err(e) => return e.to_ptr(),
- };
- let p = unsafe {
bindings::krealloc(
core::ptr::null(),
Object::<T>::SIZE,
bindings::GFP_KERNEL | bindings::__GFP_ZERO,
) as *mut Object<T>
- };
- if p.is_null() {
return ENOMEM.to_ptr();
- }
- // SAFETY: p is valid as long as the alloc succeeded
- unsafe {
addr_of_mut!((*p).dev).write(dev);
addr_of_mut!((*p).inner).write(inner);
- }
- // SAFETY: drm_gem_shmem_object is safe to zero-init, and
- // the rest of Object has been initialized
- let new: &mut Object<T> = unsafe { &mut *(p as *mut _) };
- new.obj.base.funcs = &Object::<T>::VTABLE;
- &mut new.obj.base
+}
It would be nice to allow to set wc inside the gem_create_object callback, as some drivers do it so, like v3d, vc4, panfrost, lima...
This is actually a bit tricky to do safely, because we can't just have a callback that takes the drm_gem_shmem_object instance inside gem_create_object because it is not fully initialized yet from the point of view of the gem shmem API. Maybe we could have some sort of temporary proxy object that only lets you do safe things like set map_wc? Or maybe the new() callback could return something like a ShmemTemplate<T> type that contains both the inner data and some miscellaneous fields like the initial map_wc state?
I see that most drivers use this hook to set map_wc and set funcs. What are your thoughts on something like this?
Best Regards, - Maíra Canal
From 61f23f4a39028c9d34d3df58d7640bfcd64e9af9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ADra=20Canal?= mcanal@igalia.com Date: Thu, 9 Mar 2023 08:24:09 -0300 Subject: [PATCH] rust: drm: gem: shmem: Set map_wc on gem_create_object callback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Some drivers use the gem_create_object callback to define the mapping of the object write-combined (map_wc). Currently, the DRM Rust abstractions doesn't allow such operation. So, add a method to the DriverObject trait to allow drivers to set map_wc on the gem_create_object callback. By default, the method returns false, which is the shmem default value.
Signed-off-by: Maíra Canal mcanal@igalia.com --- rust/kernel/drm/gem/shmem.rs | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/rust/kernel/drm/gem/shmem.rs b/rust/kernel/drm/gem/shmem.rs index 8f17eba0be99..a7f33b66f60a 100644 --- a/rust/kernel/drm/gem/shmem.rs +++ b/rust/kernel/drm/gem/shmem.rs @@ -24,6 +24,11 @@ use gem::BaseObject; pub trait DriverObject: gem::BaseDriverObject<Object<Self>> { /// Parent `Driver` for this object. type Driver: drv::Driver; + + /// Define the map object write-combined + fn set_wc() -> bool { + false + } }
// FIXME: This is terrible and I don't know how to avoid it @@ -110,6 +115,8 @@ unsafe extern "C" fn gem_create_object<T: DriverObject>( let new: &mut Object<T> = unsafe { &mut *(p as *mut _) };
new.obj.base.funcs = &Object::<T>::VTABLE; + new.obj.map_wc = <T>::set_wc(); + &mut new.obj.base }
I think we can also just wait until the first user before we do this though... the goal of the abstractions is to support the APIs we actually use. I know you need this for vgem, so please feel free to implement it as a separate patch! I think it's best if you get credit for the abstraction changes you need, so we can all work together on the design so it works for everyone's use cases instead of just having me make all the decisions ^^ (and it's fine if we have to refactor the APIs!)
~~ Lina
On 09/03/2023 20.47, Maíra Canal wrote:
On 3/9/23 02:25, Asahi Lina wrote:
On 08/03/2023 22.38, Maíra Canal wrote:
On 3/7/23 11:25, Asahi Lina wrote:
The DRM shmem helper includes common code useful for drivers which allocate GEM objects as anonymous shmem. Add a Rust abstraction for this. Drivers can choose the raw GEM implementation or the shmem layer, depending on their needs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/Kconfig | 5 + rust/bindings/bindings_helper.h | 2 + rust/helpers.c | 67 +++++++ rust/kernel/drm/gem/mod.rs | 3 + rust/kernel/drm/gem/shmem.rs | 381 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 458 insertions(+)
[...]
+unsafe extern "C" fn gem_create_object<T: DriverObject>(
- raw_dev: *mut bindings::drm_device,
- size: usize,
+) -> *mut bindings::drm_gem_object {
- // SAFETY: GEM ensures the device lives as long as its objects live,
- // so we can conjure up a reference from thin air and never drop it.
- let dev = ManuallyDrop::new(unsafe { device::Device::from_raw(raw_dev) });
- let inner = match T::new(&*dev, size) {
Ok(v) => v,
Err(e) => return e.to_ptr(),
- };
- let p = unsafe {
bindings::krealloc(
core::ptr::null(),
Object::<T>::SIZE,
bindings::GFP_KERNEL | bindings::__GFP_ZERO,
) as *mut Object<T>
- };
- if p.is_null() {
return ENOMEM.to_ptr();
- }
- // SAFETY: p is valid as long as the alloc succeeded
- unsafe {
addr_of_mut!((*p).dev).write(dev);
addr_of_mut!((*p).inner).write(inner);
- }
- // SAFETY: drm_gem_shmem_object is safe to zero-init, and
- // the rest of Object has been initialized
- let new: &mut Object<T> = unsafe { &mut *(p as *mut _) };
- new.obj.base.funcs = &Object::<T>::VTABLE;
- &mut new.obj.base
+}
It would be nice to allow to set wc inside the gem_create_object callback, as some drivers do it so, like v3d, vc4, panfrost, lima...
This is actually a bit tricky to do safely, because we can't just have a callback that takes the drm_gem_shmem_object instance inside gem_create_object because it is not fully initialized yet from the point of view of the gem shmem API. Maybe we could have some sort of temporary proxy object that only lets you do safe things like set map_wc? Or maybe the new() callback could return something like a ShmemTemplate<T> type that contains both the inner data and some miscellaneous fields like the initial map_wc state?
I see that most drivers use this hook to set map_wc and set funcs. What are your thoughts on something like this?
Best Regards,
- Maíra Canal
From 61f23f4a39028c9d34d3df58d7640bfcd64e9af9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ma=C3=ADra=20Canal?= mcanal@igalia.com Date: Thu, 9 Mar 2023 08:24:09 -0300 Subject: [PATCH] rust: drm: gem: shmem: Set map_wc on gem_create_object callback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Some drivers use the gem_create_object callback to define the mapping of the object write-combined (map_wc). Currently, the DRM Rust abstractions doesn't allow such operation. So, add a method to the DriverObject trait to allow drivers to set map_wc on the gem_create_object callback. By default, the method returns false, which is the shmem default value.
Signed-off-by: Maíra Canal mcanal@igalia.com
rust/kernel/drm/gem/shmem.rs | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/rust/kernel/drm/gem/shmem.rs b/rust/kernel/drm/gem/shmem.rs index 8f17eba0be99..a7f33b66f60a 100644 --- a/rust/kernel/drm/gem/shmem.rs +++ b/rust/kernel/drm/gem/shmem.rs @@ -24,6 +24,11 @@ use gem::BaseObject; pub trait DriverObject: gem::BaseDriverObject<Object<Self>> { /// Parent `Driver` for this object. type Driver: drv::Driver;
- /// Define the map object write-combined
- fn set_wc() -> bool {
false
- } }
I think if you're going to make it a static function like that, we might as well just make it an associated constant like `DEFAULT_WC`? After all there is no information gem_create_object gets other than the size so we can't really do anything more useful, and `set_wc()` can't do much other than return a constant ^^
The only corner case I can think of is cases where the WC mode depends on the device (for example, if some devices want to enable it or not depending on whether the particular hardware variant is cache-coherent), but then it should probably just be part of the return value for T::new since that function already gets all available information (device and size). But I think a constant works for now, we can always extend it when a use case comes for doing more.
~~ Lina
drm_mm provides a simple range allocator, useful for managing virtual address ranges. Add a Rust abstraction to expose this module to Rust drivers.
Signed-off-by: Asahi Lina lina@asahilina.net --- rust/kernel/drm/mm.rs | 309 +++++++++++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 2 files changed, 310 insertions(+)
diff --git a/rust/kernel/drm/mm.rs b/rust/kernel/drm/mm.rs new file mode 100644 index 000000000000..83e27a7dcc7e --- /dev/null +++ b/rust/kernel/drm/mm.rs @@ -0,0 +1,309 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT + +//! DRM MM range allocator +//! +//! C header: [`include/linux/drm/drm_mm.h`](../../../../include/linux/drm/drm_mm.h) + +use crate::{ + bindings, + error::{to_result, Result}, + str::CStr, + sync::{Arc, LockClassKey, LockIniter, Mutex, UniqueArc}, + types::Opaque, +}; + +use alloc::boxed::Box; + +use core::{ + marker::{PhantomData, PhantomPinned}, + ops::Deref, + pin::Pin, +}; + +/// Type alias representing a DRM MM node. +pub type Node<A, T> = Pin<Box<NodeData<A, T>>>; + +/// Trait which must be implemented by the inner allocator state type provided by the user. +pub trait AllocInner<T> { + /// Notification that a node was dropped from the allocator. + fn drop_object(&mut self, _start: u64, _size: u64, _color: usize, _object: &mut T) {} +} + +impl<T> AllocInner<T> for () {} + +/// Wrapper type for a `struct drm_mm` plus user AllocInner object. +/// +/// # Invariants +/// The `drm_mm` struct is valid and initialized. +struct MmInner<A: AllocInner<T>, T>(Opaquebindings::drm_mm, A, PhantomData<T>); + +/// Represents a single allocated node in the MM allocator +pub struct NodeData<A: AllocInner<T>, T> { + node: bindings::drm_mm_node, + mm: Arc<Mutex<MmInner<A, T>>>, + valid: bool, + /// A drm_mm_node needs to be pinned because nodes reference each other in a linked list. + _pin: PhantomPinned, + inner: T, +} + +// SAFETY: Allocator ops take the mutex, and there are no mutable actions on the node. +unsafe impl<A: Send + AllocInner<T>, T: Send> Send for NodeData<A, T> {} +unsafe impl<A: Send + AllocInner<T>, T: Sync> Sync for NodeData<A, T> {} + +/// Available MM node insertion modes +#[repr(u32)] +pub enum InsertMode { + /// Search for the smallest hole (within the search range) that fits the desired node. + /// + /// Allocates the node from the bottom of the found hole. + Best = bindings::drm_mm_insert_mode_DRM_MM_INSERT_BEST, + + /// Search for the lowest hole (address closest to 0, within the search range) that fits the + /// desired node. + /// + /// Allocates the node from the bottom of the found hole. + Low = bindings::drm_mm_insert_mode_DRM_MM_INSERT_LOW, + + /// Search for the highest hole (address closest to U64_MAX, within the search range) that fits + /// the desired node. + /// + /// Allocates the node from the top of the found hole. The specified alignment for the node is + /// applied to the base of the node (`Node.start()`). + High = bindings::drm_mm_insert_mode_DRM_MM_INSERT_HIGH, + + /// Search for the most recently evicted hole (within the search range) that fits the desired + /// node. This is appropriate for use immediately after performing an eviction scan and removing + /// the selected nodes to form a hole. + /// + /// Allocates the node from the bottom of the found hole. + Evict = bindings::drm_mm_insert_mode_DRM_MM_INSERT_EVICT, +} + +/// A clonable, interlocked reference to the allocator state. +/// +/// This is useful to perform actions on the user-supplied `AllocInner<T>` type given just a Node, +/// without immediately taking the lock. +#[derive(Clone)] +pub struct InnerRef<A: AllocInner<T>, T>(Arc<Mutex<MmInner<A, T>>>); + +impl<A: AllocInner<T>, T> InnerRef<A, T> { + /// Operate on the user `AllocInner<T>` implementation, taking the lock. + pub fn with<RetVal>(&self, cb: impl FnOnce(&mut A) -> RetVal) -> RetVal { + let mut l = self.0.lock(); + cb(&mut l.1) + } +} + +impl<A: AllocInner<T>, T> NodeData<A, T> { + /// Returns the color of the node (an opaque value) + pub fn color(&self) -> usize { + self.node.color as usize + } + + /// Returns the start address of the node + pub fn start(&self) -> u64 { + self.node.start + } + + /// Returns the size of the node in bytes + pub fn size(&self) -> u64 { + self.node.size + } + + /// Operate on the user `AllocInner<T>` implementation associated with this node's allocator. + pub fn with_inner<RetVal>(&self, cb: impl FnOnce(&mut A) -> RetVal) -> RetVal { + let mut l = self.mm.lock(); + cb(&mut l.1) + } + + /// Return a clonable, detached reference to the allocator inner data. + pub fn alloc_ref(&self) -> InnerRef<A, T> { + InnerRef(self.mm.clone()) + } + + /// Return a mutable reference to the inner data. + pub fn inner_mut(self: Pin<&mut Self>) -> &mut T { + // SAFETY: This is okay because inner is not structural + unsafe { &mut self.get_unchecked_mut().inner } + } +} + +impl<A: AllocInner<T>, T> Deref for NodeData<A, T> { + type Target = T; + + fn deref(&self) -> &Self::Target { + &self.inner + } +} + +impl<A: AllocInner<T>, T> Drop for NodeData<A, T> { + fn drop(&mut self) { + if self.valid { + let mut guard = self.mm.lock(); + + // Inform the user allocator that a node is being dropped. + guard + .1 + .drop_object(self.start(), self.size(), self.color(), &mut self.inner); + // SAFETY: The MM lock is still taken, so we can safely remove the node. + unsafe { bindings::drm_mm_remove_node(&mut self.node) }; + } + } +} + +/// An instance of a DRM MM range allocator. +pub struct Allocator<A: AllocInner<T>, T> { + mm: Arc<Mutex<MmInner<A, T>>>, + _p: PhantomData<T>, +} + +impl<A: AllocInner<T>, T> Allocator<A, T> { + /// Create a new range allocator for the given start and size range of addresses. + /// + /// The user may optionally provide an inner object representing allocator state, which will + /// be protected by the same lock. If not required, `()` can be used. + pub fn new( + start: u64, + size: u64, + inner: A, + name: &'static CStr, + lock_key: &'static LockClassKey, + ) -> Result<Allocator<A, T>> { + // SAFETY: We call `Mutex::init_lock` below. + let mut mm: Pin<UniqueArc<Mutex<MmInner<A, T>>>> = UniqueArc::try_new(unsafe { + Mutex::new(MmInner(Opaque::uninit(), inner, PhantomData)) + })? + .into(); + + mm.as_mut().init_lock(name, lock_key); + + unsafe { + // SAFETY: The Opaque instance provides a valid pointer, and it is initialized after + // this call. + bindings::drm_mm_init(mm.lock().0.get(), start, size); + } + + Ok(Allocator { + mm: mm.into(), + _p: PhantomData, + }) + } + + /// Insert a new node into the allocator of a given size. + /// + /// `node` is the user `T` type data to store into the node. + pub fn insert_node(&mut self, node: T, size: u64) -> Result<Node<A, T>> { + self.insert_node_generic(node, size, 0, 0, InsertMode::Best) + } + + /// Insert a new node into the allocator of a given size, with configurable alignment, + /// color, and insertion mode. + /// + /// `node` is the user `T` type data to store into the node. + pub fn insert_node_generic( + &mut self, + node: T, + size: u64, + alignment: u64, + color: usize, + mode: InsertMode, + ) -> Result<Node<A, T>> { + self.insert_node_in_range(node, size, alignment, color, 0, u64::MAX, mode) + } + + /// Insert a new node into the allocator of a given size, with configurable alignment, + /// color, insertion mode, and sub-range to allocate from. + /// + /// `node` is the user `T` type data to store into the node. + #[allow(clippy::too_many_arguments)] + pub fn insert_node_in_range( + &mut self, + node: T, + size: u64, + alignment: u64, + color: usize, + start: u64, + end: u64, + mode: InsertMode, + ) -> Result<Node<A, T>> { + let mut mm_node = Box::try_new(NodeData { + // SAFETY: This C struct should be zero-initialized. + node: unsafe { core::mem::zeroed() }, + valid: false, + inner: node, + mm: self.mm.clone(), + _pin: PhantomPinned, + })?; + + let guard = self.mm.lock(); + // SAFETY: We hold the lock and all pointers are valid. + to_result(unsafe { + bindings::drm_mm_insert_node_in_range( + guard.0.get(), + &mut mm_node.node, + size, + alignment, + color as core::ffi::c_ulong, + start, + end, + mode as u32, + ) + })?; + + mm_node.valid = true; + + Ok(Pin::from(mm_node)) + } + + /// Insert a node into the allocator at a fixed start address. + /// + /// `node` is the user `T` type data to store into the node. + pub fn reserve_node( + &mut self, + node: T, + start: u64, + size: u64, + color: usize, + ) -> Result<Node<A, T>> { + let mut mm_node = Box::try_new(NodeData { + // SAFETY: This C struct should be zero-initialized. + node: unsafe { core::mem::zeroed() }, + valid: false, + inner: node, + mm: self.mm.clone(), + _pin: PhantomPinned, + })?; + + mm_node.node.start = start; + mm_node.node.size = size; + mm_node.node.color = color as core::ffi::c_ulong; + + let guard = self.mm.lock(); + // SAFETY: We hold the lock and all pointers are valid. + to_result(unsafe { bindings::drm_mm_reserve_node(guard.0.get(), &mut mm_node.node) })?; + + mm_node.valid = true; + + Ok(Pin::from(mm_node)) + } + + /// Operate on the inner user type `A`, taking the allocator lock + pub fn with_inner<RetVal>(&self, cb: impl FnOnce(&mut A) -> RetVal) -> RetVal { + let mut guard = self.mm.lock(); + cb(&mut guard.1) + } +} + +impl<A: AllocInner<T>, T> Drop for MmInner<A, T> { + fn drop(&mut self) { + // SAFETY: If the MmInner is dropped then all nodes are gone (since they hold references), + // so it is safe to tear down the allocator. + unsafe { + bindings::drm_mm_takedown(self.0.get()); + } + } +} + +// MmInner is safely Send if the AllocInner user type is Send. +unsafe impl<A: Send + AllocInner<T>, T> Send for MmInner<A, T> {} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index c44760a1332f..73fab2dee3af 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -7,3 +7,4 @@ pub mod drv; pub mod file; pub mod gem; pub mod ioctl; +pub mod mm;
On Tue, Mar 07, 2023 at 11:25:32PM +0900, Asahi Lina wrote:
drm_mm provides a simple range allocator, useful for managing virtual address ranges. Add a Rust abstraction to expose this module to Rust drivers.
Signed-off-by: Asahi Lina lina@asahilina.net
In the cover letter you mentioned the design open about embedded the lock into the rust wrappers.
I think for a first step that's perfectly fine.
Longer term we might want to ramp up some "proof of locking" infrastructure in Rust, where callers can supply a lock guard and ideally rust validates at compile time that it's for the right type, and at runtime (like lockdep) that it's consistent and the callers don't mix up locks (like using different locks for the same drm_mm allocator).
There's a lot of libraries in the kernel that have this "caller ensures locking" pattern. drm/sched also has these requirements.
There's two other things I'd like to bring up on this patch though, just because it's a good example. But they're both really general points that apply for all the rust wrappers.
Documentation:
In drm we try to document all the interfaces that drivers use with formal docs. Yes there's some areas that are not great for historical reasons, but for new stuff and new wrappers we're really trying:
- This helps in telling internal (even across .c files or in rust across modules within a crate) from stuff drivers access. Sure you have static in C or pub in rust, but that doesn't tell you whether it's public all the way to drivers.
- ideally docs have a short intro section that explains the main concepts and links to the main data structures and functions. Just to give readers a good starting point to explore.
- Linking all the things, so that readers can connect the different parts. This is really important in C where e.g. get/put() or any such function pairs all needed to be linked together. With rust I'm hoping that rustdoc liberally sprinkles links already and we don't have to do this as much.
- Short explainers for parameters. For rust this also means type parameters, for those even simplified examples of how drivers are supposed to use them would help a lot in reading docs & understanding concepts.
- Ideally links from the rust to the sphinx side to linke relevant chapters together. Often the bigger explanations are in .rst files with DOT graphs (kms has a bunch I've added) or similar, and it doesn't make that much sense to duplicate all that on the rust side I guess. But it needs to be discoverable.
This might be more a discussion topic for the rust people than you directly. Still needed for the merge-ready patches eventually.
Refcounting vs borrowing:
This is honestly much more the eyebrow raising one than the locking. Very often on the C side these datastructures all work with borrow semantics, and you need to explicitly upgrade to a full reference (kref_get or kref_get_unless_zero, depending whether it's a strong or weak reference) if you need the object outside of the mutex/lock guard section.
Again I think for now it's ok, but the sales pitch of rust is that it enables borrow lifetime checking with no runtime cost. Plus viz the vm cleanup example, if you have too many strong backreferences the cleanup flow gets complicated. And it would suck if rust drivers have to add complexity like the openrefcount for the vm example simply because we can't model the borrow semantics well enough to be safe.
So not something that's really bad here, but if we need to resort to full refcounting already for simple datastructures then I'm getting a bit worried about how well rust will cope with the really nasty borrowed reference tricks we're playing in other areas.
Again more a topic for the rust folks I think than specifically here about drm_mm wrapping. Just to get things going I think this is fine.
Cheers, Daniel
rust/kernel/drm/mm.rs | 309 +++++++++++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 2 files changed, 310 insertions(+)
diff --git a/rust/kernel/drm/mm.rs b/rust/kernel/drm/mm.rs new file mode 100644 index 000000000000..83e27a7dcc7e --- /dev/null +++ b/rust/kernel/drm/mm.rs @@ -0,0 +1,309 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM MM range allocator +//! +//! C header: [`include/linux/drm/drm_mm.h`](../../../../include/linux/drm/drm_mm.h)
+use crate::{
- bindings,
- error::{to_result, Result},
- str::CStr,
- sync::{Arc, LockClassKey, LockIniter, Mutex, UniqueArc},
- types::Opaque,
+};
+use alloc::boxed::Box;
+use core::{
- marker::{PhantomData, PhantomPinned},
- ops::Deref,
- pin::Pin,
+};
+/// Type alias representing a DRM MM node. +pub type Node<A, T> = Pin<Box<NodeData<A, T>>>;
+/// Trait which must be implemented by the inner allocator state type provided by the user. +pub trait AllocInner<T> {
- /// Notification that a node was dropped from the allocator.
- fn drop_object(&mut self, _start: u64, _size: u64, _color: usize, _object: &mut T) {}
+}
+impl<T> AllocInner<T> for () {}
+/// Wrapper type for a `struct drm_mm` plus user AllocInner object. +/// +/// # Invariants +/// The `drm_mm` struct is valid and initialized. +struct MmInner<A: AllocInner<T>, T>(Opaquebindings::drm_mm, A, PhantomData<T>);
+/// Represents a single allocated node in the MM allocator +pub struct NodeData<A: AllocInner<T>, T> {
- node: bindings::drm_mm_node,
- mm: Arc<Mutex<MmInner<A, T>>>,
- valid: bool,
- /// A drm_mm_node needs to be pinned because nodes reference each other in a linked list.
- _pin: PhantomPinned,
- inner: T,
+}
+// SAFETY: Allocator ops take the mutex, and there are no mutable actions on the node. +unsafe impl<A: Send + AllocInner<T>, T: Send> Send for NodeData<A, T> {} +unsafe impl<A: Send + AllocInner<T>, T: Sync> Sync for NodeData<A, T> {}
+/// Available MM node insertion modes +#[repr(u32)] +pub enum InsertMode {
- /// Search for the smallest hole (within the search range) that fits the desired node.
- ///
- /// Allocates the node from the bottom of the found hole.
- Best = bindings::drm_mm_insert_mode_DRM_MM_INSERT_BEST,
- /// Search for the lowest hole (address closest to 0, within the search range) that fits the
- /// desired node.
- ///
- /// Allocates the node from the bottom of the found hole.
- Low = bindings::drm_mm_insert_mode_DRM_MM_INSERT_LOW,
- /// Search for the highest hole (address closest to U64_MAX, within the search range) that fits
- /// the desired node.
- ///
- /// Allocates the node from the top of the found hole. The specified alignment for the node is
- /// applied to the base of the node (`Node.start()`).
- High = bindings::drm_mm_insert_mode_DRM_MM_INSERT_HIGH,
- /// Search for the most recently evicted hole (within the search range) that fits the desired
- /// node. This is appropriate for use immediately after performing an eviction scan and removing
- /// the selected nodes to form a hole.
- ///
- /// Allocates the node from the bottom of the found hole.
- Evict = bindings::drm_mm_insert_mode_DRM_MM_INSERT_EVICT,
+}
+/// A clonable, interlocked reference to the allocator state. +/// +/// This is useful to perform actions on the user-supplied `AllocInner<T>` type given just a Node, +/// without immediately taking the lock. +#[derive(Clone)] +pub struct InnerRef<A: AllocInner<T>, T>(Arc<Mutex<MmInner<A, T>>>);
+impl<A: AllocInner<T>, T> InnerRef<A, T> {
- /// Operate on the user `AllocInner<T>` implementation, taking the lock.
- pub fn with<RetVal>(&self, cb: impl FnOnce(&mut A) -> RetVal) -> RetVal {
let mut l = self.0.lock();
cb(&mut l.1)
- }
+}
+impl<A: AllocInner<T>, T> NodeData<A, T> {
- /// Returns the color of the node (an opaque value)
- pub fn color(&self) -> usize {
self.node.color as usize
- }
- /// Returns the start address of the node
- pub fn start(&self) -> u64 {
self.node.start
- }
- /// Returns the size of the node in bytes
- pub fn size(&self) -> u64 {
self.node.size
- }
- /// Operate on the user `AllocInner<T>` implementation associated with this node's allocator.
- pub fn with_inner<RetVal>(&self, cb: impl FnOnce(&mut A) -> RetVal) -> RetVal {
let mut l = self.mm.lock();
cb(&mut l.1)
- }
- /// Return a clonable, detached reference to the allocator inner data.
- pub fn alloc_ref(&self) -> InnerRef<A, T> {
InnerRef(self.mm.clone())
- }
- /// Return a mutable reference to the inner data.
- pub fn inner_mut(self: Pin<&mut Self>) -> &mut T {
// SAFETY: This is okay because inner is not structural
unsafe { &mut self.get_unchecked_mut().inner }
- }
+}
+impl<A: AllocInner<T>, T> Deref for NodeData<A, T> {
- type Target = T;
- fn deref(&self) -> &Self::Target {
&self.inner
- }
+}
+impl<A: AllocInner<T>, T> Drop for NodeData<A, T> {
- fn drop(&mut self) {
if self.valid {
let mut guard = self.mm.lock();
// Inform the user allocator that a node is being dropped.
guard
.1
.drop_object(self.start(), self.size(), self.color(), &mut self.inner);
// SAFETY: The MM lock is still taken, so we can safely remove the node.
unsafe { bindings::drm_mm_remove_node(&mut self.node) };
}
- }
+}
+/// An instance of a DRM MM range allocator. +pub struct Allocator<A: AllocInner<T>, T> {
- mm: Arc<Mutex<MmInner<A, T>>>,
- _p: PhantomData<T>,
+}
+impl<A: AllocInner<T>, T> Allocator<A, T> {
- /// Create a new range allocator for the given start and size range of addresses.
- ///
- /// The user may optionally provide an inner object representing allocator state, which will
- /// be protected by the same lock. If not required, `()` can be used.
- pub fn new(
start: u64,
size: u64,
inner: A,
name: &'static CStr,
lock_key: &'static LockClassKey,
- ) -> Result<Allocator<A, T>> {
// SAFETY: We call `Mutex::init_lock` below.
let mut mm: Pin<UniqueArc<Mutex<MmInner<A, T>>>> = UniqueArc::try_new(unsafe {
Mutex::new(MmInner(Opaque::uninit(), inner, PhantomData))
})?
.into();
mm.as_mut().init_lock(name, lock_key);
unsafe {
// SAFETY: The Opaque instance provides a valid pointer, and it is initialized after
// this call.
bindings::drm_mm_init(mm.lock().0.get(), start, size);
}
Ok(Allocator {
mm: mm.into(),
_p: PhantomData,
})
- }
- /// Insert a new node into the allocator of a given size.
- ///
- /// `node` is the user `T` type data to store into the node.
- pub fn insert_node(&mut self, node: T, size: u64) -> Result<Node<A, T>> {
self.insert_node_generic(node, size, 0, 0, InsertMode::Best)
- }
- /// Insert a new node into the allocator of a given size, with configurable alignment,
- /// color, and insertion mode.
- ///
- /// `node` is the user `T` type data to store into the node.
- pub fn insert_node_generic(
&mut self,
node: T,
size: u64,
alignment: u64,
color: usize,
mode: InsertMode,
- ) -> Result<Node<A, T>> {
self.insert_node_in_range(node, size, alignment, color, 0, u64::MAX, mode)
- }
- /// Insert a new node into the allocator of a given size, with configurable alignment,
- /// color, insertion mode, and sub-range to allocate from.
- ///
- /// `node` is the user `T` type data to store into the node.
- #[allow(clippy::too_many_arguments)]
- pub fn insert_node_in_range(
&mut self,
node: T,
size: u64,
alignment: u64,
color: usize,
start: u64,
end: u64,
mode: InsertMode,
- ) -> Result<Node<A, T>> {
let mut mm_node = Box::try_new(NodeData {
// SAFETY: This C struct should be zero-initialized.
node: unsafe { core::mem::zeroed() },
valid: false,
inner: node,
mm: self.mm.clone(),
_pin: PhantomPinned,
})?;
let guard = self.mm.lock();
// SAFETY: We hold the lock and all pointers are valid.
to_result(unsafe {
bindings::drm_mm_insert_node_in_range(
guard.0.get(),
&mut mm_node.node,
size,
alignment,
color as core::ffi::c_ulong,
start,
end,
mode as u32,
)
})?;
mm_node.valid = true;
Ok(Pin::from(mm_node))
- }
- /// Insert a node into the allocator at a fixed start address.
- ///
- /// `node` is the user `T` type data to store into the node.
- pub fn reserve_node(
&mut self,
node: T,
start: u64,
size: u64,
color: usize,
- ) -> Result<Node<A, T>> {
let mut mm_node = Box::try_new(NodeData {
// SAFETY: This C struct should be zero-initialized.
node: unsafe { core::mem::zeroed() },
valid: false,
inner: node,
mm: self.mm.clone(),
_pin: PhantomPinned,
})?;
mm_node.node.start = start;
mm_node.node.size = size;
mm_node.node.color = color as core::ffi::c_ulong;
let guard = self.mm.lock();
// SAFETY: We hold the lock and all pointers are valid.
to_result(unsafe { bindings::drm_mm_reserve_node(guard.0.get(), &mut mm_node.node) })?;
mm_node.valid = true;
Ok(Pin::from(mm_node))
- }
- /// Operate on the inner user type `A`, taking the allocator lock
- pub fn with_inner<RetVal>(&self, cb: impl FnOnce(&mut A) -> RetVal) -> RetVal {
let mut guard = self.mm.lock();
cb(&mut guard.1)
- }
+}
+impl<A: AllocInner<T>, T> Drop for MmInner<A, T> {
- fn drop(&mut self) {
// SAFETY: If the MmInner is dropped then all nodes are gone (since they hold references),
// so it is safe to tear down the allocator.
unsafe {
bindings::drm_mm_takedown(self.0.get());
}
- }
+}
+// MmInner is safely Send if the AllocInner user type is Send. +unsafe impl<A: Send + AllocInner<T>, T> Send for MmInner<A, T> {} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index c44760a1332f..73fab2dee3af 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -7,3 +7,4 @@ pub mod drv; pub mod file; pub mod gem; pub mod ioctl; +pub mod mm;
-- 2.35.1
On Thu, Apr 6, 2023 at 4:15 PM Daniel Vetter daniel@ffwll.ch wrote:
Documentation:
In drm we try to document all the interfaces that drivers use with formal docs. Yes there's some areas that are not great for historical reasons, but for new stuff and new wrappers we're really trying:
- This helps in telling internal (even across .c files or in rust across modules within a crate) from stuff drivers access. Sure you have static in C or pub in rust, but that doesn't tell you whether it's public all the way to drivers.
I think you may be talking about the value high-level docs here, but just in case, visibility in Rust is flexible enough to expose (or not) APIs to those that you need. In other words, it does tell you (and enforces!) whether it is public all the way to drivers.
There is also the possibility of even more fancy visibility, but so far we just needed `pub(crate)`.
`rustdoc` also shows/hides things as needed, thus the generated docs for the crate should only show what is usable by others.
Then there is the `kernel` crate split, too.
- ideally docs have a short intro section that explains the main concepts and links to the main data structures and functions. Just to give readers a good starting point to explore.
Agreed, this is typically done in Rust in the top-level doc comments (module or crate). For the Rust side of the kernel, we are definitely trying to emphasize the quality of the docs, including compile- and runtime-tested examples.
Regarding linking, `rustdoc` already generates a listing with the contents of each crate/module even if there is no other docs. So as long as the short descriptions of the items are good, it may be fairly readable already, e.g. see https://rust-for-linux.github.io/docs/rust/kernel/sync/index.html for an example in our old `rust` branch. But, of course, you can add extra docs at that level too when there are many things or is unclear what should be used.
Also note that, sometimes, the docs we write are in the type, rather than the module, e.g. see the nice examples Wedson wrote for `RBTree`: https://rust-for-linux.github.io/docs/rust/kernel/rbtree/struct.RBTree.html.
- Linking all the things, so that readers can connect the different parts. This is really important in C where e.g. get/put() or any such function pairs all needed to be linked together. With rust I'm hoping that rustdoc liberally sprinkles links already and we don't have to do this as much.
If you mean within doc comments, it does! :) It is called "intra-doc links". Basically, you just write something in-between square brackets, and it is able to create the link to the right thing (in most cases, otherwise you can help it more), e.g.
/// Returns a new [`Foo`].
And, of course, for the rest of things that aren't inside comments, it automatically provides links etc.
There has been work on `rustdoc` on getting "Jump to Definition" and similar features to work on the source view, too.
- Short explainers for parameters. For rust this also means type parameters, for those even simplified examples of how drivers are supposed to use them would help a lot in reading docs & understanding concepts.
For parameters, we are not forcing to write explanations for every parameter (as in providing a list), but rather writing what is actually useful to know (referring to the parameters as needed). So it depends on a case-by-case.
In any case, in general is clearer what parameters are compared to C, due to the stronger typing. Of course, if one uses integers everywhere, it is as confusing as C. But if one has a type, it is easier to tell, plus one may jump with a click into the explanation of that type etc.
Regarding examples, 100% agreed. And not only that, the examples are enforced to be kept up to date by compiling and running them via KUnit (not yet submitted for mainline, but we have been enforcing it for our old `rust` branch for a long time).
- Ideally links from the rust to the sphinx side to linke relevant chapters together. Often the bigger explanations are in .rst files with DOT graphs (kms has a bunch I've added) or similar, and it doesn't make that much sense to duplicate all that on the rust side I guess. But it needs to be discoverable.
Definitely. One next step is having easy-to-write links to the rST docs. For this, a couple years ago I talked with the `rustdoc` maintainers about having a "External references map file" feature, so that we can link rST documents from the Rust docs, including generated C docs too. For instance, ideally we would be able to use the square brackets around a C type and have it work:
/// Exposes the kernel’s [`struct wait_queue_head`] as a condition variable.
Regarding the bigger explanations: we are trying to keep most of the docs close to the Rust code where it makes sense, as module-level/crate-level docs, rather than as rST docs. This has several benefits, like keeping them closer to the code, the linking features, having them organized equally as the code, no need to know whether there is a doc somewhere or not (e.g. if it is, it is near the code), examples are compiled, etc.
Of course, sometimes longer-form docs and other documents may not make sense as part of any code in particular, or may be shared across C and Rust, etc., and there it may more sense to use `Documentation/` files instead.
But, in general, the idea is that, compared to C, most of the docs go into the code. To give an idea of the difference: so far, in our old `rust` branch, we only needed a few documents in `Documentation/` (e.g. the Quick Start guide etc.), and everything else went into the code itself.
Cheers, Miguel
On Thu, Apr 06, 2023 at 05:28:59PM +0200, Miguel Ojeda wrote:
On Thu, Apr 6, 2023 at 4:15 PM Daniel Vetter daniel@ffwll.ch wrote:
Documentation:
In drm we try to document all the interfaces that drivers use with formal docs. Yes there's some areas that are not great for historical reasons, but for new stuff and new wrappers we're really trying:
- This helps in telling internal (even across .c files or in rust across modules within a crate) from stuff drivers access. Sure you have static in C or pub in rust, but that doesn't tell you whether it's public all the way to drivers.
I think you may be talking about the value high-level docs here, but just in case, visibility in Rust is flexible enough to expose (or not) APIs to those that you need. In other words, it does tell you (and enforces!) whether it is public all the way to drivers.
There is also the possibility of even more fancy visibility, but so far we just needed `pub(crate)`.
`rustdoc` also shows/hides things as needed, thus the generated docs for the crate should only show what is usable by others.
Then there is the `kernel` crate split, too.
- ideally docs have a short intro section that explains the main concepts and links to the main data structures and functions. Just to give readers a good starting point to explore.
Agreed, this is typically done in Rust in the top-level doc comments (module or crate). For the Rust side of the kernel, we are definitely trying to emphasize the quality of the docs, including compile- and runtime-tested examples.
Regarding linking, `rustdoc` already generates a listing with the contents of each crate/module even if there is no other docs. So as long as the short descriptions of the items are good, it may be fairly readable already, e.g. see https://rust-for-linux.github.io/docs/rust/kernel/sync/index.html for an example in our old `rust` branch. But, of course, you can add extra docs at that level too when there are many things or is unclear what should be used.
Also note that, sometimes, the docs we write are in the type, rather than the module, e.g. see the nice examples Wedson wrote for `RBTree`: https://rust-for-linux.github.io/docs/rust/kernel/rbtree/struct.RBTree.html.
Yeah this all looks great and very hyperlinked.
I think the only nit I have is that for types with two or more type variables (like the rbtree) what each of them should represent in the top intro. I can guess it's <Key, Value> and not the other way round, but confirmation takes quite a bit of scrolling to check with the function types.
Otherwise I think perfect api docs.
- Linking all the things, so that readers can connect the different parts. This is really important in C where e.g. get/put() or any such function pairs all needed to be linked together. With rust I'm hoping that rustdoc liberally sprinkles links already and we don't have to do this as much.
If you mean within doc comments, it does! :) It is called "intra-doc links". Basically, you just write something in-between square brackets, and it is able to create the link to the right thing (in most cases, otherwise you can help it more), e.g.
/// Returns a new [`Foo`].
And, of course, for the rest of things that aren't inside comments, it automatically provides links etc.
There has been work on `rustdoc` on getting "Jump to Definition" and similar features to work on the source view, too.
- Short explainers for parameters. For rust this also means type parameters, for those even simplified examples of how drivers are supposed to use them would help a lot in reading docs & understanding concepts.
For parameters, we are not forcing to write explanations for every parameter (as in providing a list), but rather writing what is actually useful to know (referring to the parameters as needed). So it depends on a case-by-case.
In any case, in general is clearer what parameters are compared to C, due to the stronger typing. Of course, if one uses integers everywhere, it is as confusing as C. But if one has a type, it is easier to tell, plus one may jump with a click into the explanation of that type etc.
Regarding examples, 100% agreed. And not only that, the examples are enforced to be kept up to date by compiling and running them via KUnit (not yet submitted for mainline, but we have been enforcing it for our old `rust` branch for a long time).
- Ideally links from the rust to the sphinx side to linke relevant chapters together. Often the bigger explanations are in .rst files with DOT graphs (kms has a bunch I've added) or similar, and it doesn't make that much sense to duplicate all that on the rust side I guess. But it needs to be discoverable.
Definitely. One next step is having easy-to-write links to the rST docs. For this, a couple years ago I talked with the `rustdoc` maintainers about having a "External references map file" feature, so that we can link rST documents from the Rust docs, including generated C docs too. For instance, ideally we would be able to use the square brackets around a C type and have it work:
/// Exposes the kernel’s [`struct wait_queue_head`] as a condition variable.
Regarding the bigger explanations: we are trying to keep most of the docs close to the Rust code where it makes sense, as module-level/crate-level docs, rather than as rST docs. This has several benefits, like keeping them closer to the code, the linking features, having them organized equally as the code, no need to know whether there is a doc somewhere or not (e.g. if it is, it is near the code), examples are compiled, etc.
Just a quick comment on this, that's the same we do on the C side. Most overview chapters are actually DOC: sections pulled in from the code.
What I meant here is that for big overview stuff (like for modesetting how the display pipe structures tie together as an example: https://dri.freedesktop.org/docs/drm/gpu/drm-kms.html#overview) it doesn't make sense to duplicate that in rustdoc once more.
Of course, sometimes longer-form docs and other documents may not make sense as part of any code in particular, or may be shared across C and Rust, etc., and there it may more sense to use `Documentation/` files instead.
But, in general, the idea is that, compared to C, most of the docs go into the code. To give an idea of the difference: so far, in our old `rust` branch, we only needed a few documents in `Documentation/` (e.g. the Quick Start guide etc.), and everything else went into the code itself.
Maybe drm is the exception, but if you look at our .rst files we also have most of our docs in the code:
https://cgit.freedesktop.org/drm/drm/tree/Documentation/gpu/drm-kms-helpers....
The rst files just provide the scaffolding because C dosn't have crates/modules hierarchy that would do this for you automatically.
Cheers, Daniel
On Thu, Apr 6, 2023 at 5:45 PM Daniel Vetter daniel@ffwll.ch wrote:
Yeah this all looks great and very hyperlinked.
I think the only nit I have is that for types with two or more type variables (like the rbtree) what each of them should represent in the top intro. I can guess it's <Key, Value> and not the other way round, but confirmation takes quite a bit of scrolling to check with the function types.
Yeah, that is fair. Personally I prefer more descriptive names when there are several or they have a special/asymmetric role.
Otherwise I think perfect api docs.
Glad you like it!
Just a quick comment on this, that's the same we do on the C side. Most overview chapters are actually DOC: sections pulled in from the code.
What I meant here is that for big overview stuff (like for modesetting how the display pipe structures tie together as an example: https://dri.freedesktop.org/docs/drm/gpu/drm-kms.html#overview) it doesn't make sense to duplicate that in rustdoc once more.
Yeah, definitely, if it is already somewhere else for C, we shouldn't duplicate it (that is what I meant by the "shared across C and Rust" exception).
Maybe drm is the exception, but if you look at our .rst files we also have most of our docs in the code:
https://cgit.freedesktop.org/drm/drm/tree/Documentation/gpu/drm-kms-helpers....
The rst files just provide the scaffolding because C dosn't have crates/modules hierarchy that would do this for you automatically.
Sorry, I was talking in general in the kernel. That `drm-kms-helpers.rst` looks great.
From a quick grep, I think you are indeed one of the big users of `DOC: `, which indeed map closely to what you would do in Rust without the scaffolding need.
So I think you will like writing docs in Rust :)
Cheers, Miguel
On 06/04/2023 23.15, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:32PM +0900, Asahi Lina wrote:
drm_mm provides a simple range allocator, useful for managing virtual address ranges. Add a Rust abstraction to expose this module to Rust drivers.
Signed-off-by: Asahi Lina lina@asahilina.net
In the cover letter you mentioned the design open about embedded the lock into the rust wrappers.
I think for a first step that's perfectly fine.
Longer term we might want to ramp up some "proof of locking" infrastructure in Rust, where callers can supply a lock guard and ideally rust validates at compile time that it's for the right type, and at runtime (like lockdep) that it's consistent and the callers don't mix up locks (like using different locks for the same drm_mm allocator).
That proof-of-lock tuff works in Rust too as far as I know.
But the general thread safety story in Rust is much simpler, you just use methods that take &mut self when locking is the caller's responsibility. That effectively implies that there can only be one reference that can call those methods at any given time, thanks to the borrow checker. Shared references only give you &self, a locked Mutex upgrades that to &mut self, and that's how you get proof of locking at compile time, through and through, not just for the type but for the specific object.
There's a lot of libraries in the kernel that have this "caller ensures locking" pattern. drm/sched also has these requirements.
Yup, that all usually maps nicely to &mut self in Rust... except for the issue below.
There's two other things I'd like to bring up on this patch though, just because it's a good example. But they're both really general points that apply for all the rust wrappers.
Documentation:
In drm we try to document all the interfaces that drivers use with formal docs. Yes there's some areas that are not great for historical reasons, but for new stuff and new wrappers we're really trying:
This helps in telling internal (even across .c files or in rust across modules within a crate) from stuff drivers access. Sure you have static in C or pub in rust, but that doesn't tell you whether it's public all the way to drivers.
ideally docs have a short intro section that explains the main concepts and links to the main data structures and functions. Just to give readers a good starting point to explore.
Linking all the things, so that readers can connect the different parts. This is really important in C where e.g. get/put() or any such function pairs all needed to be linked together. With rust I'm hoping that rustdoc liberally sprinkles links already and we don't have to do this as much.
Short explainers for parameters. For rust this also means type parameters, for those even simplified examples of how drivers are supposed to use them would help a lot in reading docs & understanding concepts.
Ideally links from the rust to the sphinx side to linke relevant chapters together. Often the bigger explanations are in .rst files with DOT graphs (kms has a bunch I've added) or similar, and it doesn't make that much sense to duplicate all that on the rust side I guess. But it needs to be discoverable.
This might be more a discussion topic for the rust people than you directly. Still needed for the merge-ready patches eventually.
I don't know much about the doc gen stuff on the Rust side so yeah, this is something I need to look into to make it pretty and complete...
Refcounting vs borrowing:
This is honestly much more the eyebrow raising one than the locking. Very often on the C side these datastructures all work with borrow semantics, and you need to explicitly upgrade to a full reference (kref_get or kref_get_unless_zero, depending whether it's a strong or weak reference) if you need the object outside of the mutex/lock guard section.
Again I think for now it's ok, but the sales pitch of rust is that it enables borrow lifetime checking with no runtime cost. Plus viz the vm cleanup example, if you have too many strong backreferences the cleanup flow gets complicated. And it would suck if rust drivers have to add complexity like the openrefcount for the vm example simply because we can't model the borrow semantics well enough to be safe.
So not something that's really bad here, but if we need to resort to full refcounting already for simple datastructures then I'm getting a bit worried about how well rust will cope with the really nasty borrowed reference tricks we're playing in other areas.
Again more a topic for the rust folks I think than specifically here about drm_mm wrapping. Just to get things going I think this is fine.
Yeeeeah... this is a *specific* problem. Drop.
The Allocator<T> itself is perfectly safe to implement without any locking, refcounting, or anything. You just make the methods take &mut self (as they already do), the caller can use it with a single reference or wrap it in an Arc<Mutex<T>> and share it, or whatever.
The problem is the Node<A, T>. When you Drop that, it has to go back to the Allocator. But now you're a different object, so no thread safety guarantees. And you need to keep the Allocator alive. So now to make a safe abstraction, you need refcounting and a mutex.
Lifetimes just don't work here, sadly. Not for a useful abstraction.
I'd love to hear from the other Rust folks whether they have any better ideas...
One thing that *can* be done is making the Drop illegal (Rust can't do this "natively" but Linux already has hacks for that, we can make it fail to link if the Drop is ever called). Then you'd have to actively return the Node to the Allocator with a free function. Since Drop is forbidden, and Node is pinned, you'd always have to either return Node objects to the Allocator or leak them. You could drop the Allocator before its nodes, but as far as I know drm_mm can safely handle that (though it will complain), and then due to the previous guarantees the *only* thing you could do with orphan nodes is leak their memory, which is safe.
It would work... but it breaks the whole Rust automagic Drop stuff.
Thinking about this a bit, I think I want the current mutex/arc semantics for something like a memory allocator (which is one of my primary use cases for drm_mm), since I definitely don't want to be manually returning objects to their allocator all over the place, nor have overarching lifetime requirements that the allocator outlive its objects for safety (that sounds like a can of worms I don't want to open, I'd much rather use a refcount even if I "think" I can prove the lifetime bounds ad-hoc). But for something like a drm_mm that is tracking VA ranges within a VM with all Nodes held internally, maybe I could manage it all internally and have all node destruction be handled via an explicit call into the Allocator.
Maybe the mm abstraction should offer both options? The extra locking can be implemented in terms of the base unlocked version I think (perhaps with some Deref abuse for ergonomics)... I definitely want to hear more opinions about this from other Rust folks, since there are probably other options I haven't considered...
Aside: This, and all the other DRM abstractions, were written before the pin_init stuff from y86 that is in review right now was ready. That may open up more interesting/ergonomic/efficient APIs for some cases, especially where Pin and embedding C types into user objects in some way are involved. So maybe there's room for improvement here. Just a sidenote.
Cheers, Daniel
rust/kernel/drm/mm.rs | 309 +++++++++++++++++++++++++++++++++++++++++++++++++ rust/kernel/drm/mod.rs | 1 + 2 files changed, 310 insertions(+)
diff --git a/rust/kernel/drm/mm.rs b/rust/kernel/drm/mm.rs new file mode 100644 index 000000000000..83e27a7dcc7e --- /dev/null +++ b/rust/kernel/drm/mm.rs @@ -0,0 +1,309 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM MM range allocator +//! +//! C header: [`include/linux/drm/drm_mm.h`](../../../../include/linux/drm/drm_mm.h)
+use crate::{
- bindings,
- error::{to_result, Result},
- str::CStr,
- sync::{Arc, LockClassKey, LockIniter, Mutex, UniqueArc},
- types::Opaque,
+};
+use alloc::boxed::Box;
+use core::{
- marker::{PhantomData, PhantomPinned},
- ops::Deref,
- pin::Pin,
+};
+/// Type alias representing a DRM MM node. +pub type Node<A, T> = Pin<Box<NodeData<A, T>>>;
+/// Trait which must be implemented by the inner allocator state type provided by the user. +pub trait AllocInner<T> {
- /// Notification that a node was dropped from the allocator.
- fn drop_object(&mut self, _start: u64, _size: u64, _color: usize, _object: &mut T) {}
+}
+impl<T> AllocInner<T> for () {}
+/// Wrapper type for a `struct drm_mm` plus user AllocInner object. +/// +/// # Invariants +/// The `drm_mm` struct is valid and initialized. +struct MmInner<A: AllocInner<T>, T>(Opaquebindings::drm_mm, A, PhantomData<T>);
+/// Represents a single allocated node in the MM allocator +pub struct NodeData<A: AllocInner<T>, T> {
- node: bindings::drm_mm_node,
- mm: Arc<Mutex<MmInner<A, T>>>,
- valid: bool,
- /// A drm_mm_node needs to be pinned because nodes reference each other in a linked list.
- _pin: PhantomPinned,
- inner: T,
+}
+// SAFETY: Allocator ops take the mutex, and there are no mutable actions on the node. +unsafe impl<A: Send + AllocInner<T>, T: Send> Send for NodeData<A, T> {} +unsafe impl<A: Send + AllocInner<T>, T: Sync> Sync for NodeData<A, T> {}
+/// Available MM node insertion modes +#[repr(u32)] +pub enum InsertMode {
- /// Search for the smallest hole (within the search range) that fits the desired node.
- ///
- /// Allocates the node from the bottom of the found hole.
- Best = bindings::drm_mm_insert_mode_DRM_MM_INSERT_BEST,
- /// Search for the lowest hole (address closest to 0, within the search range) that fits the
- /// desired node.
- ///
- /// Allocates the node from the bottom of the found hole.
- Low = bindings::drm_mm_insert_mode_DRM_MM_INSERT_LOW,
- /// Search for the highest hole (address closest to U64_MAX, within the search range) that fits
- /// the desired node.
- ///
- /// Allocates the node from the top of the found hole. The specified alignment for the node is
- /// applied to the base of the node (`Node.start()`).
- High = bindings::drm_mm_insert_mode_DRM_MM_INSERT_HIGH,
- /// Search for the most recently evicted hole (within the search range) that fits the desired
- /// node. This is appropriate for use immediately after performing an eviction scan and removing
- /// the selected nodes to form a hole.
- ///
- /// Allocates the node from the bottom of the found hole.
- Evict = bindings::drm_mm_insert_mode_DRM_MM_INSERT_EVICT,
+}
+/// A clonable, interlocked reference to the allocator state. +/// +/// This is useful to perform actions on the user-supplied `AllocInner<T>` type given just a Node, +/// without immediately taking the lock. +#[derive(Clone)] +pub struct InnerRef<A: AllocInner<T>, T>(Arc<Mutex<MmInner<A, T>>>);
+impl<A: AllocInner<T>, T> InnerRef<A, T> {
- /// Operate on the user `AllocInner<T>` implementation, taking the lock.
- pub fn with<RetVal>(&self, cb: impl FnOnce(&mut A) -> RetVal) -> RetVal {
let mut l = self.0.lock();
cb(&mut l.1)
- }
+}
+impl<A: AllocInner<T>, T> NodeData<A, T> {
- /// Returns the color of the node (an opaque value)
- pub fn color(&self) -> usize {
self.node.color as usize
- }
- /// Returns the start address of the node
- pub fn start(&self) -> u64 {
self.node.start
- }
- /// Returns the size of the node in bytes
- pub fn size(&self) -> u64 {
self.node.size
- }
- /// Operate on the user `AllocInner<T>` implementation associated with this node's allocator.
- pub fn with_inner<RetVal>(&self, cb: impl FnOnce(&mut A) -> RetVal) -> RetVal {
let mut l = self.mm.lock();
cb(&mut l.1)
- }
- /// Return a clonable, detached reference to the allocator inner data.
- pub fn alloc_ref(&self) -> InnerRef<A, T> {
InnerRef(self.mm.clone())
- }
- /// Return a mutable reference to the inner data.
- pub fn inner_mut(self: Pin<&mut Self>) -> &mut T {
// SAFETY: This is okay because inner is not structural
unsafe { &mut self.get_unchecked_mut().inner }
- }
+}
+impl<A: AllocInner<T>, T> Deref for NodeData<A, T> {
- type Target = T;
- fn deref(&self) -> &Self::Target {
&self.inner
- }
+}
+impl<A: AllocInner<T>, T> Drop for NodeData<A, T> {
- fn drop(&mut self) {
if self.valid {
let mut guard = self.mm.lock();
// Inform the user allocator that a node is being dropped.
guard
.1
.drop_object(self.start(), self.size(), self.color(), &mut self.inner);
// SAFETY: The MM lock is still taken, so we can safely remove the node.
unsafe { bindings::drm_mm_remove_node(&mut self.node) };
}
- }
+}
+/// An instance of a DRM MM range allocator. +pub struct Allocator<A: AllocInner<T>, T> {
- mm: Arc<Mutex<MmInner<A, T>>>,
- _p: PhantomData<T>,
+}
+impl<A: AllocInner<T>, T> Allocator<A, T> {
- /// Create a new range allocator for the given start and size range of addresses.
- ///
- /// The user may optionally provide an inner object representing allocator state, which will
- /// be protected by the same lock. If not required, `()` can be used.
- pub fn new(
start: u64,
size: u64,
inner: A,
name: &'static CStr,
lock_key: &'static LockClassKey,
- ) -> Result<Allocator<A, T>> {
// SAFETY: We call `Mutex::init_lock` below.
let mut mm: Pin<UniqueArc<Mutex<MmInner<A, T>>>> = UniqueArc::try_new(unsafe {
Mutex::new(MmInner(Opaque::uninit(), inner, PhantomData))
})?
.into();
mm.as_mut().init_lock(name, lock_key);
unsafe {
// SAFETY: The Opaque instance provides a valid pointer, and it is initialized after
// this call.
bindings::drm_mm_init(mm.lock().0.get(), start, size);
}
Ok(Allocator {
mm: mm.into(),
_p: PhantomData,
})
- }
- /// Insert a new node into the allocator of a given size.
- ///
- /// `node` is the user `T` type data to store into the node.
- pub fn insert_node(&mut self, node: T, size: u64) -> Result<Node<A, T>> {
self.insert_node_generic(node, size, 0, 0, InsertMode::Best)
- }
- /// Insert a new node into the allocator of a given size, with configurable alignment,
- /// color, and insertion mode.
- ///
- /// `node` is the user `T` type data to store into the node.
- pub fn insert_node_generic(
&mut self,
node: T,
size: u64,
alignment: u64,
color: usize,
mode: InsertMode,
- ) -> Result<Node<A, T>> {
self.insert_node_in_range(node, size, alignment, color, 0, u64::MAX, mode)
- }
- /// Insert a new node into the allocator of a given size, with configurable alignment,
- /// color, insertion mode, and sub-range to allocate from.
- ///
- /// `node` is the user `T` type data to store into the node.
- #[allow(clippy::too_many_arguments)]
- pub fn insert_node_in_range(
&mut self,
node: T,
size: u64,
alignment: u64,
color: usize,
start: u64,
end: u64,
mode: InsertMode,
- ) -> Result<Node<A, T>> {
let mut mm_node = Box::try_new(NodeData {
// SAFETY: This C struct should be zero-initialized.
node: unsafe { core::mem::zeroed() },
valid: false,
inner: node,
mm: self.mm.clone(),
_pin: PhantomPinned,
})?;
let guard = self.mm.lock();
// SAFETY: We hold the lock and all pointers are valid.
to_result(unsafe {
bindings::drm_mm_insert_node_in_range(
guard.0.get(),
&mut mm_node.node,
size,
alignment,
color as core::ffi::c_ulong,
start,
end,
mode as u32,
)
})?;
mm_node.valid = true;
Ok(Pin::from(mm_node))
- }
- /// Insert a node into the allocator at a fixed start address.
- ///
- /// `node` is the user `T` type data to store into the node.
- pub fn reserve_node(
&mut self,
node: T,
start: u64,
size: u64,
color: usize,
- ) -> Result<Node<A, T>> {
let mut mm_node = Box::try_new(NodeData {
// SAFETY: This C struct should be zero-initialized.
node: unsafe { core::mem::zeroed() },
valid: false,
inner: node,
mm: self.mm.clone(),
_pin: PhantomPinned,
})?;
mm_node.node.start = start;
mm_node.node.size = size;
mm_node.node.color = color as core::ffi::c_ulong;
let guard = self.mm.lock();
// SAFETY: We hold the lock and all pointers are valid.
to_result(unsafe { bindings::drm_mm_reserve_node(guard.0.get(), &mut mm_node.node) })?;
mm_node.valid = true;
Ok(Pin::from(mm_node))
- }
- /// Operate on the inner user type `A`, taking the allocator lock
- pub fn with_inner<RetVal>(&self, cb: impl FnOnce(&mut A) -> RetVal) -> RetVal {
let mut guard = self.mm.lock();
cb(&mut guard.1)
- }
+}
+impl<A: AllocInner<T>, T> Drop for MmInner<A, T> {
- fn drop(&mut self) {
// SAFETY: If the MmInner is dropped then all nodes are gone (since they hold references),
// so it is safe to tear down the allocator.
unsafe {
bindings::drm_mm_takedown(self.0.get());
}
- }
+}
+// MmInner is safely Send if the AllocInner user type is Send. +unsafe impl<A: Send + AllocInner<T>, T> Send for MmInner<A, T> {} diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index c44760a1332f..73fab2dee3af 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -7,3 +7,4 @@ pub mod drv; pub mod file; pub mod gem; pub mod ioctl; +pub mod mm;
-- 2.35.1
~~ Lina
On Fri, Apr 07, 2023 at 12:53:47AM +0900, Asahi Lina wrote:
On 06/04/2023 23.15, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:32PM +0900, Asahi Lina wrote:
drm_mm provides a simple range allocator, useful for managing virtual address ranges. Add a Rust abstraction to expose this module to Rust drivers.
Signed-off-by: Asahi Lina lina@asahilina.net
In the cover letter you mentioned the design open about embedded the lock into the rust wrappers.
I think for a first step that's perfectly fine.
Longer term we might want to ramp up some "proof of locking" infrastructure in Rust, where callers can supply a lock guard and ideally rust validates at compile time that it's for the right type, and at runtime (like lockdep) that it's consistent and the callers don't mix up locks (like using different locks for the same drm_mm allocator).
That proof-of-lock tuff works in Rust too as far as I know.
But the general thread safety story in Rust is much simpler, you just use methods that take &mut self when locking is the caller's responsibility. That effectively implies that there can only be one reference that can call those methods at any given time, thanks to the borrow checker. Shared references only give you &self, a locked Mutex upgrades that to &mut self, and that's how you get proof of locking at compile time, through and through, not just for the type but for the specific object.
Hm that still has the problem of making sure that you supply the right lock (for generic abstractions like drm_mm or drm/sched where the lock is supplied by the driver.
Once we have the lock then yeah borrow checker makes sure you can't screw up, worst case needs a PhantomData (I guess) as toke of proof to pass around the borrowed lifetime (If I got that right from your use of PhantomData in the sched wrappers).
There's a lot of libraries in the kernel that have this "caller ensures locking" pattern. drm/sched also has these requirements.
Yup, that all usually maps nicely to &mut self in Rust... except for the issue below.
There's two other things I'd like to bring up on this patch though, just because it's a good example. But they're both really general points that apply for all the rust wrappers.
Documentation:
In drm we try to document all the interfaces that drivers use with formal docs. Yes there's some areas that are not great for historical reasons, but for new stuff and new wrappers we're really trying:
This helps in telling internal (even across .c files or in rust across modules within a crate) from stuff drivers access. Sure you have static in C or pub in rust, but that doesn't tell you whether it's public all the way to drivers.
ideally docs have a short intro section that explains the main concepts and links to the main data structures and functions. Just to give readers a good starting point to explore.
Linking all the things, so that readers can connect the different parts. This is really important in C where e.g. get/put() or any such function pairs all needed to be linked together. With rust I'm hoping that rustdoc liberally sprinkles links already and we don't have to do this as much.
Short explainers for parameters. For rust this also means type parameters, for those even simplified examples of how drivers are supposed to use them would help a lot in reading docs & understanding concepts.
Ideally links from the rust to the sphinx side to linke relevant chapters together. Often the bigger explanations are in .rst files with DOT graphs (kms has a bunch I've added) or similar, and it doesn't make that much sense to duplicate all that on the rust side I guess. But it needs to be discoverable.
This might be more a discussion topic for the rust people than you directly. Still needed for the merge-ready patches eventually.
I don't know much about the doc gen stuff on the Rust side so yeah, this is something I need to look into to make it pretty and complete...
From what Miguel has shown I think it's all there already, and the only missing pieces are the cross-linking at a chapter level from rustdoc to rst and sphinx to rstdoc too ideally. But I think for most rust wrappers that will be one link each direction only (e.g. C drm_mm linking to kernel::drm::MM and other way round and done). So absolutely no problem if that one item is sorted out post merge once rustdoc/kernel-sphinx are ready.
Refcounting vs borrowing:
This is honestly much more the eyebrow raising one than the locking. Very often on the C side these datastructures all work with borrow semantics, and you need to explicitly upgrade to a full reference (kref_get or kref_get_unless_zero, depending whether it's a strong or weak reference) if you need the object outside of the mutex/lock guard section.
Again I think for now it's ok, but the sales pitch of rust is that it enables borrow lifetime checking with no runtime cost. Plus viz the vm cleanup example, if you have too many strong backreferences the cleanup flow gets complicated. And it would suck if rust drivers have to add complexity like the openrefcount for the vm example simply because we can't model the borrow semantics well enough to be safe.
So not something that's really bad here, but if we need to resort to full refcounting already for simple datastructures then I'm getting a bit worried about how well rust will cope with the really nasty borrowed reference tricks we're playing in other areas.
Again more a topic for the rust folks I think than specifically here about drm_mm wrapping. Just to get things going I think this is fine.
Yeeeeah... this is a *specific* problem. Drop.
The Allocator<T> itself is perfectly safe to implement without any locking, refcounting, or anything. You just make the methods take &mut self (as they already do), the caller can use it with a single reference or wrap it in an Arc<Mutex<T>> and share it, or whatever.
The problem is the Node<A, T>. When you Drop that, it has to go back to the Allocator. But now you're a different object, so no thread safety guarantees. And you need to keep the Allocator alive. So now to make a safe abstraction, you need refcounting and a mutex.
Lifetimes just don't work here, sadly. Not for a useful abstraction.
I'd love to hear from the other Rust folks whether they have any better ideas...
Hm yeah I think I get the gist of the issue. At time of Drop there's no allocator reference you can borrow and so you're screwed.
In C we tend to solve that by passing both to the unlink/drop stuff (and rust could then ensure that we have legit borrows for both), but I guess that just totally wreaks entire wrapper and makes it really rough to use.
One thing that *can* be done is making the Drop illegal (Rust can't do this "natively" but Linux already has hacks for that, we can make it fail to link if the Drop is ever called). Then you'd have to actively return the Node to the Allocator with a free function. Since Drop is forbidden, and Node is pinned, you'd always have to either return Node objects to the Allocator or leak them. You could drop the Allocator before its nodes, but as far as I know drm_mm can safely handle that (though it will complain), and then due to the previous guarantees the *only* thing you could do with orphan nodes is leak their memory, which is safe.
It would work... but it breaks the whole Rust automagic Drop stuff.
Yeah I think I see the challenge ...
Thinking about this a bit, I think I want the current mutex/arc semantics for something like a memory allocator (which is one of my primary use cases for drm_mm), since I definitely don't want to be manually returning objects to their allocator all over the place, nor have overarching lifetime requirements that the allocator outlive its objects for safety (that sounds like a can of worms I don't want to open, I'd much rather use a refcount even if I "think" I can prove the lifetime bounds ad-hoc). But for something like a drm_mm that is tracking VA ranges within a VM with all Nodes held internally, maybe I could manage it all internally and have all node destruction be handled via an explicit call into the Allocator.
Yeah I think for gpuva we need to do better, but assuming the gpuva library is in C then rust would just need to encode the safety properties that (hopefully) the C library guarantees ...
And for any driver that just wants to use some range manager the standard wrapping leans heavily on the side of "easy to use".
Maybe the mm abstraction should offer both options? The extra locking can be implemented in terms of the base unlocked version I think (perhaps with some Deref abuse for ergonomics)... I definitely want to hear more opinions about this from other Rust folks, since there are probably other options I haven't considered...
I don't think we need the more raw/tricky one, at least not until we have some serious libraries like gpuva implemented in rust. Or drivers reimplementing the gpuva stuff in their driver :-)
Aside: This, and all the other DRM abstractions, were written before the pin_init stuff from y86 that is in review right now was ready. That may open up more interesting/ergonomic/efficient APIs for some cases, especially where Pin and embedding C types into user objects in some way are involved. So maybe there's room for improvement here. Just a sidenote.
Ah good to know, and yeah that make open some interesting options. -Daniel
On 07/04/2023 01.13, Daniel Vetter wrote:
On Fri, Apr 07, 2023 at 12:53:47AM +0900, Asahi Lina wrote:
On 06/04/2023 23.15, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:32PM +0900, Asahi Lina wrote:
drm_mm provides a simple range allocator, useful for managing virtual address ranges. Add a Rust abstraction to expose this module to Rust drivers.
Signed-off-by: Asahi Lina lina@asahilina.net
In the cover letter you mentioned the design open about embedded the lock into the rust wrappers.
I think for a first step that's perfectly fine.
Longer term we might want to ramp up some "proof of locking" infrastructure in Rust, where callers can supply a lock guard and ideally rust validates at compile time that it's for the right type, and at runtime (like lockdep) that it's consistent and the callers don't mix up locks (like using different locks for the same drm_mm allocator).
That proof-of-lock tuff works in Rust too as far as I know.
But the general thread safety story in Rust is much simpler, you just use methods that take &mut self when locking is the caller's responsibility. That effectively implies that there can only be one reference that can call those methods at any given time, thanks to the borrow checker. Shared references only give you &self, a locked Mutex upgrades that to &mut self, and that's how you get proof of locking at compile time, through and through, not just for the type but for the specific object.
Hm that still has the problem of making sure that you supply the right lock (for generic abstractions like drm_mm or drm/sched where the lock is supplied by the driver.
No no, I mean you don't have to supply the lock at all. The idea is that if you have a mutable reference to the object *at all* then Rust says that's effectively interlocked, whether you achieve that with an actual lock or just by not sharing the object to begin with.
This is the standard pattern in Rust. Thread-safe methods take &self, and you can call those from multiple threads at once. Thread-unsafe methods take &mut self, and you can only call them from one thread at once. Mutex is one mechanism that allows you to upgrade a shared &self to a &mut self (while holding the lock). The actual object doesn't know anything about mutexes or locking, it just relies on the more fundamental property that Rust says that if you have a &mut obj, absolutely nobody else does, at any given time, by definition.
(And then everything also needs to impl Send + Sync for this to work across threads, but that's usually what you want)
Basically if there were to exist a Rust abstraction/object/anything that allows two threads to get ahold of a &mut to the same object at the same time without using unsafe, that abstraction/etc would be broken and unsound (and necessarily have to involve bad unsafe code within, because you can't do that with just safe code, the borrow checker stops you), and the state of affairs of two threads having such a reference is outright undefined behavior in the language model.
Once we have the lock then yeah borrow checker makes sure you can't screw up, worst case needs a PhantomData (I guess) as toke of proof to pass around the borrowed lifetime (If I got that right from your use of PhantomData in the sched wrappers).
Ah, PhantomData is just a hack because Rust wants you to use all type parameters inside structs even when you don't actually need them for anything because they only have meaning to the abstraction itself. Without it it won't compile. Something something deep type system black magic rules (I'm pretty sure this requirement isn't gratuitous but I don't know what the whole story is here).
The lock does give you a Guard you could pass somewhere as proof, which itself contains a lifetime that ties it to the Mutex, but more importantly that Guard implements DerefMut to give you a &mut to whatever is inside the Mutex, and *that* mutable reference is the proof that you are the sole execution context with the right to access that one particular object. At that point the Guard doesn't matter, and lifetimes tie everything together so you can't stash that &mut somewhere else or anything like that to break the rules (modulo unsafe code, of course!).
There's a lot of libraries in the kernel that have this "caller ensures locking" pattern. drm/sched also has these requirements.
Yup, that all usually maps nicely to &mut self in Rust... except for the issue below.
There's two other things I'd like to bring up on this patch though, just because it's a good example. But they're both really general points that apply for all the rust wrappers.
Documentation:
In drm we try to document all the interfaces that drivers use with formal docs. Yes there's some areas that are not great for historical reasons, but for new stuff and new wrappers we're really trying:
This helps in telling internal (even across .c files or in rust across modules within a crate) from stuff drivers access. Sure you have static in C or pub in rust, but that doesn't tell you whether it's public all the way to drivers.
ideally docs have a short intro section that explains the main concepts and links to the main data structures and functions. Just to give readers a good starting point to explore.
Linking all the things, so that readers can connect the different parts. This is really important in C where e.g. get/put() or any such function pairs all needed to be linked together. With rust I'm hoping that rustdoc liberally sprinkles links already and we don't have to do this as much.
Short explainers for parameters. For rust this also means type parameters, for those even simplified examples of how drivers are supposed to use them would help a lot in reading docs & understanding concepts.
Ideally links from the rust to the sphinx side to linke relevant chapters together. Often the bigger explanations are in .rst files with DOT graphs (kms has a bunch I've added) or similar, and it doesn't make that much sense to duplicate all that on the rust side I guess. But it needs to be discoverable.
This might be more a discussion topic for the rust people than you directly. Still needed for the merge-ready patches eventually.
I don't know much about the doc gen stuff on the Rust side so yeah, this is something I need to look into to make it pretty and complete...
From what Miguel has shown I think it's all there already, and the only missing pieces are the cross-linking at a chapter level from rustdoc to rst and sphinx to rstdoc too ideally. But I think for most rust wrappers that will be one link each direction only (e.g. C drm_mm linking to kernel::drm::MM and other way round and done). So absolutely no problem if that one item is sorted out post merge once rustdoc/kernel-sphinx are ready.
Refcounting vs borrowing:
This is honestly much more the eyebrow raising one than the locking. Very often on the C side these datastructures all work with borrow semantics, and you need to explicitly upgrade to a full reference (kref_get or kref_get_unless_zero, depending whether it's a strong or weak reference) if you need the object outside of the mutex/lock guard section.
Again I think for now it's ok, but the sales pitch of rust is that it enables borrow lifetime checking with no runtime cost. Plus viz the vm cleanup example, if you have too many strong backreferences the cleanup flow gets complicated. And it would suck if rust drivers have to add complexity like the openrefcount for the vm example simply because we can't model the borrow semantics well enough to be safe.
So not something that's really bad here, but if we need to resort to full refcounting already for simple datastructures then I'm getting a bit worried about how well rust will cope with the really nasty borrowed reference tricks we're playing in other areas.
Again more a topic for the rust folks I think than specifically here about drm_mm wrapping. Just to get things going I think this is fine.
Yeeeeah... this is a *specific* problem. Drop.
The Allocator<T> itself is perfectly safe to implement without any locking, refcounting, or anything. You just make the methods take &mut self (as they already do), the caller can use it with a single reference or wrap it in an Arc<Mutex<T>> and share it, or whatever.
The problem is the Node<A, T>. When you Drop that, it has to go back to the Allocator. But now you're a different object, so no thread safety guarantees. And you need to keep the Allocator alive. So now to make a safe abstraction, you need refcounting and a mutex.
Lifetimes just don't work here, sadly. Not for a useful abstraction.
I'd love to hear from the other Rust folks whether they have any better ideas...
Hm yeah I think I get the gist of the issue. At time of Drop there's no allocator reference you can borrow and so you're screwed.
In C we tend to solve that by passing both to the unlink/drop stuff (and rust could then ensure that we have legit borrows for both), but I guess that just totally wreaks entire wrapper and makes it really rough to use.
Yup, that's the issue ^^;;
One thing that *can* be done is making the Drop illegal (Rust can't do this "natively" but Linux already has hacks for that, we can make it fail to link if the Drop is ever called). Then you'd have to actively return the Node to the Allocator with a free function. Since Drop is forbidden, and Node is pinned, you'd always have to either return Node objects to the Allocator or leak them. You could drop the Allocator before its nodes, but as far as I know drm_mm can safely handle that (though it will complain), and then due to the previous guarantees the *only* thing you could do with orphan nodes is leak their memory, which is safe.
It would work... but it breaks the whole Rust automagic Drop stuff.
Yeah I think I see the challenge ...
Thinking about this a bit, I think I want the current mutex/arc semantics for something like a memory allocator (which is one of my primary use cases for drm_mm), since I definitely don't want to be manually returning objects to their allocator all over the place, nor have overarching lifetime requirements that the allocator outlive its objects for safety (that sounds like a can of worms I don't want to open, I'd much rather use a refcount even if I "think" I can prove the lifetime bounds ad-hoc). But for something like a drm_mm that is tracking VA ranges within a VM with all Nodes held internally, maybe I could manage it all internally and have all node destruction be handled via an explicit call into the Allocator.
Yeah I think for gpuva we need to do better, but assuming the gpuva library is in C then rust would just need to encode the safety properties that (hopefully) the C library guarantees ...
Yeah, if this is going to be common C code using drm_mm then it can provide whatever safety properties it wants and use drm_mm in ways not possible with the Rust abstraction, of course.
And for any driver that just wants to use some range manager the standard wrapping leans heavily on the side of "easy to use".
Maybe the mm abstraction should offer both options? The extra locking can be implemented in terms of the base unlocked version I think (perhaps with some Deref abuse for ergonomics)... I definitely want to hear more opinions about this from other Rust folks, since there are probably other options I haven't considered...
I don't think we need the more raw/tricky one, at least not until we have some serious libraries like gpuva implemented in rust. Or drivers reimplementing the gpuva stuff in their driver :-)
It only just hit me that gpuva is an actual thing that's in RFC. Sounds like I should give it a shot when I do the vm_bind stuff instead of reinventing that wheel (which was my original plan)... sorry if I've been a bit slow here.
Aside: This, and all the other DRM abstractions, were written before the pin_init stuff from y86 that is in review right now was ready. That may open up more interesting/ergonomic/efficient APIs for some cases, especially where Pin and embedding C types into user objects in some way are involved. So maybe there's room for improvement here. Just a sidenote.
Ah good to know, and yeah that make open some interesting options. -Daniel
~~ Lina
DMA fences are the internal synchronization primitive used for DMA operations like GPU rendering, video en/decoding, etc. Add an abstraction to allow Rust drivers to interact with this subsystem.
Note: This uses a raw spinlock living next to the fence, since we do not interact with it other than for initialization. TODO: Expose this to the user at some point with a safe abstraction.
Signed-off-by: Asahi Lina lina@asahilina.net --- rust/bindings/bindings_helper.h | 2 + rust/helpers.c | 53 ++++ rust/kernel/dma_fence.rs | 532 ++++++++++++++++++++++++++++++++++++++++ rust/kernel/lib.rs | 2 + 4 files changed, 589 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 9f152d373df8..705af292a5b4 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -14,6 +14,8 @@ #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> +#include <linux/dma-fence.h> +#include <linux/dma-fence-chain.h> #include <linux/dma-mapping.h> #include <linux/fs.h> #include <linux/ioctl.h> diff --git a/rust/helpers.c b/rust/helpers.c index 388ff1100ea5..8e906a7a7d8a 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -23,6 +23,8 @@ #include <linux/bug.h> #include <linux/build_bug.h> #include <linux/device.h> +#include <linux/dma-fence.h> +#include <linux/dma-fence-chain.h> #include <linux/dma-mapping.h> #include <linux/err.h> #include <linux/errname.h> @@ -30,6 +32,7 @@ #include <linux/of.h> #include <linux/of_device.h> #include <linux/platform_device.h> +#include <linux/spinlock.h> #include <linux/rcupdate.h> #include <linux/refcount.h> #include <linux/xarray.h> @@ -388,6 +391,56 @@ int rust_helper_sg_dma_len(const struct scatterlist *sg) } EXPORT_SYMBOL_GPL(rust_helper_sg_dma_len);
+void rust_helper___spin_lock_init(spinlock_t *lock, const char *name, + struct lock_class_key *key) +{ +#ifdef CONFIG_DEBUG_SPINLOCK +# ifndef CONFIG_PREEMPT_RT + __raw_spin_lock_init(spinlock_check(lock), name, key, LD_WAIT_CONFIG); +# else + rt_mutex_base_init(&lock->lock); + __rt_spin_lock_init(lock, name, key, false); +# endif +#else + spin_lock_init(lock); +#endif +} +EXPORT_SYMBOL_GPL(rust_helper___spin_lock_init); + +#ifdef CONFIG_DMA_SHARED_BUFFER + +void rust_helper_dma_fence_get(struct dma_fence *fence) +{ + dma_fence_get(fence); +} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_get); + +void rust_helper_dma_fence_put(struct dma_fence *fence) +{ + dma_fence_put(fence); +} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_put); + +struct dma_fence_chain *rust_helper_dma_fence_chain_alloc(void) +{ + return dma_fence_chain_alloc(); +} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_chain_alloc); + +void rust_helper_dma_fence_chain_free(struct dma_fence_chain *chain) +{ + dma_fence_chain_free(chain); +} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_chain_free); + +void rust_helper_dma_fence_set_error(struct dma_fence *fence, int error) +{ + dma_fence_set_error(fence, error); +} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_set_error); + +#endif + #ifdef CONFIG_DRM
void rust_helper_drm_gem_object_get(struct drm_gem_object *obj) diff --git a/rust/kernel/dma_fence.rs b/rust/kernel/dma_fence.rs new file mode 100644 index 000000000000..ca93380d9da2 --- /dev/null +++ b/rust/kernel/dma_fence.rs @@ -0,0 +1,532 @@ +// SPDX-License-Identifier: GPL-2.0 + +//! DMA fence abstraction. +//! +//! C header: [`include/linux/dma_fence.h`](../../include/linux/dma_fence.h) + +use crate::{ + bindings, + error::{to_result, Result}, + prelude::*, + sync::LockClassKey, + types::Opaque, +}; +use core::fmt::Write; +use core::ops::{Deref, DerefMut}; +use core::ptr::addr_of_mut; +use core::sync::atomic::{AtomicU64, Ordering}; + +/// Any kind of DMA Fence Object +/// +/// # Invariants +/// raw() returns a valid pointer to a dma_fence and we own a reference to it. +pub trait RawDmaFence: crate::private::Sealed { + /// Returns the raw `struct dma_fence` pointer. + fn raw(&self) -> *mut bindings::dma_fence; + + /// Returns the raw `struct dma_fence` pointer and consumes the object. + /// + /// The caller is responsible for dropping the reference. + fn into_raw(self) -> *mut bindings::dma_fence + where + Self: Sized, + { + let ptr = self.raw(); + core::mem::forget(self); + ptr + } + + /// Advances this fence to the chain node which will signal this sequence number. + /// If no sequence number is provided, this returns `self` again. + fn chain_find_seqno(self, seqno: u64) -> Result<Fence> + where + Self: Sized, + { + let mut ptr = self.into_raw(); + + // SAFETY: This will safely fail if this DmaFence is not a chain. + // `ptr` is valid per the type invariant. + let ret = unsafe { bindings::dma_fence_chain_find_seqno(&mut ptr, seqno) }; + + if ret != 0 { + // SAFETY: This is either an owned reference or NULL, dma_fence_put can handle both. + unsafe { bindings::dma_fence_put(ptr) }; + Err(Error::from_kernel_errno(ret)) + } else if ptr.is_null() { + Err(EINVAL) // When can this happen? + } else { + // SAFETY: ptr is valid and non-NULL as checked above. + Ok(unsafe { Fence::from_raw(ptr) }) + } + } + + /// Signal completion of this fence + fn signal(&self) -> Result { + to_result(unsafe { bindings::dma_fence_signal(self.raw()) }) + } + + /// Set the error flag on this fence + fn set_error(&self, err: Error) { + unsafe { bindings::dma_fence_set_error(self.raw(), err.to_kernel_errno()) }; + } +} + +/// A generic DMA Fence Object +/// +/// # Invariants +/// ptr is a valid pointer to a dma_fence and we own a reference to it. +pub struct Fence { + ptr: *mut bindings::dma_fence, +} + +impl Fence { + /// Create a new Fence object from a raw pointer to a dma_fence. + /// + /// # Safety + /// The caller must own a reference to the dma_fence, which is transferred to the new object. + pub(crate) unsafe fn from_raw(ptr: *mut bindings::dma_fence) -> Fence { + Fence { ptr } + } + + /// Create a new Fence object from a raw pointer to a dma_fence. + /// + /// # Safety + /// Takes a borrowed reference to the dma_fence, and increments the reference count. + pub(crate) unsafe fn get_raw(ptr: *mut bindings::dma_fence) -> Fence { + // SAFETY: Pointer is valid per the safety contract + unsafe { bindings::dma_fence_get(ptr) }; + Fence { ptr } + } + + /// Create a new Fence object from a RawDmaFence. + pub fn from_fence(fence: &dyn RawDmaFence) -> Fence { + // SAFETY: Pointer is valid per the RawDmaFence contract + unsafe { Self::get_raw(fence.raw()) } + } +} + +impl crate::private::Sealed for Fence {} + +impl RawDmaFence for Fence { + fn raw(&self) -> *mut bindings::dma_fence { + self.ptr + } +} + +impl Drop for Fence { + fn drop(&mut self) { + // SAFETY: We own a reference to this syncobj. + unsafe { bindings::dma_fence_put(self.ptr) }; + } +} + +impl Clone for Fence { + fn clone(&self) -> Self { + // SAFETY: `ptr` is valid per the type invariant and we own a reference to it. + unsafe { + bindings::dma_fence_get(self.ptr); + Self::from_raw(self.ptr) + } + } +} + +unsafe impl Sync for Fence {} +unsafe impl Send for Fence {} + +/// Trait which must be implemented by driver-specific fence objects. +#[vtable] +pub trait FenceOps: Sized + Send + Sync { + /// True if this dma_fence implementation uses 64bit seqno, false otherwise. + const USE_64BIT_SEQNO: bool; + + /// Returns the driver name. This is a callback to allow drivers to compute the name at + /// runtime, without having it to store permanently for each fence, or build a cache of + /// some sort. + fn get_driver_name<'a>(self: &'a FenceObject<Self>) -> &'a CStr; + + /// Return the name of the context this fence belongs to. This is a callback to allow drivers + /// to compute the name at runtime, without having it to store permanently for each fence, or + /// build a cache of some sort. + fn get_timeline_name<'a>(self: &'a FenceObject<Self>) -> &'a CStr; + + /// Enable software signaling of fence. + fn enable_signaling(self: &FenceObject<Self>) -> bool { + false + } + + /// Peek whether the fence is signaled, as a fastpath optimization for e.g. dma_fence_wait() or + /// dma_fence_add_callback(). + fn signaled(self: &FenceObject<Self>) -> bool { + false + } + + /// Callback to fill in free-form debug info specific to this fence, like the sequence number. + fn fence_value_str(self: &FenceObject<Self>, _output: &mut dyn Write) {} + + /// Fills in the current value of the timeline as a string, like the sequence number. Note that + /// the specific fence passed to this function should not matter, drivers should only use it to + /// look up the corresponding timeline structures. + fn timeline_value_str(self: &FenceObject<Self>, _output: &mut dyn Write) {} +} + +unsafe extern "C" fn get_driver_name_cb<T: FenceOps>( + fence: *mut bindings::dma_fence, +) -> *const core::ffi::c_char { + // SAFETY: All of our fences are FenceObject<T>. + let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>; + + // SAFETY: The caller is responsible for passing a valid dma_fence subtype + T::get_driver_name(unsafe { &mut *p }).as_char_ptr() +} + +unsafe extern "C" fn get_timeline_name_cb<T: FenceOps>( + fence: *mut bindings::dma_fence, +) -> *const core::ffi::c_char { + // SAFETY: All of our fences are FenceObject<T>. + let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>; + + // SAFETY: The caller is responsible for passing a valid dma_fence subtype + T::get_timeline_name(unsafe { &mut *p }).as_char_ptr() +} + +unsafe extern "C" fn enable_signaling_cb<T: FenceOps>(fence: *mut bindings::dma_fence) -> bool { + // SAFETY: All of our fences are FenceObject<T>. + let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>; + + // SAFETY: The caller is responsible for passing a valid dma_fence subtype + T::enable_signaling(unsafe { &mut *p }) +} + +unsafe extern "C" fn signaled_cb<T: FenceOps>(fence: *mut bindings::dma_fence) -> bool { + // SAFETY: All of our fences are FenceObject<T>. + let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>; + + // SAFETY: The caller is responsible for passing a valid dma_fence subtype + T::signaled(unsafe { &mut *p }) +} + +unsafe extern "C" fn release_cb<T: FenceOps>(fence: *mut bindings::dma_fence) { + // SAFETY: All of our fences are FenceObject<T>. + let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>; + + // SAFETY: p is never used after this + unsafe { + core::ptr::drop_in_place(&mut (*p).inner); + } + + // SAFETY: All of our fences are allocated using kmalloc, so this is safe. + unsafe { bindings::dma_fence_free(fence) }; +} + +unsafe extern "C" fn fence_value_str_cb<T: FenceOps>( + fence: *mut bindings::dma_fence, + string: *mut core::ffi::c_char, + size: core::ffi::c_int, +) { + let size: usize = size.try_into().unwrap_or(0); + + if size == 0 { + return; + } + + // SAFETY: All of our fences are FenceObject<T>. + let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>; + + // SAFETY: The caller is responsible for the validity of string/size + let mut f = unsafe { crate::str::Formatter::from_buffer(string as *mut _, size) }; + + // SAFETY: The caller is responsible for passing a valid dma_fence subtype + T::fence_value_str(unsafe { &mut *p }, &mut f); + let _ = f.write_str("\0"); + + // SAFETY: `size` is at least 1 per the check above + unsafe { *string.add(size - 1) = 0 }; +} + +unsafe extern "C" fn timeline_value_str_cb<T: FenceOps>( + fence: *mut bindings::dma_fence, + string: *mut core::ffi::c_char, + size: core::ffi::c_int, +) { + let size: usize = size.try_into().unwrap_or(0); + + if size == 0 { + return; + } + + // SAFETY: All of our fences are FenceObject<T>. + let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>; + + // SAFETY: The caller is responsible for the validity of string/size + let mut f = unsafe { crate::str::Formatter::from_buffer(string as *mut _, size) }; + + // SAFETY: The caller is responsible for passing a valid dma_fence subtype + T::timeline_value_str(unsafe { &mut *p }, &mut f); + let _ = f.write_str("\0"); + + // SAFETY: `size` is at least 1 per the check above + unsafe { *string.add(size - 1) = 0 }; +} + +// Allow FenceObject<Self> to be used as a self argument, for ergonomics +impl<T: FenceOps> core::ops::Receiver for FenceObject<T> {} + +/// A driver-specific DMA Fence Object +/// +/// # Invariants +/// ptr is a valid pointer to a dma_fence and we own a reference to it. +#[repr(C)] +pub struct FenceObject<T: FenceOps> { + fence: bindings::dma_fence, + lock: Opaquebindings::spinlock, + inner: T, +} + +impl<T: FenceOps> FenceObject<T> { + const SIZE: usize = core::mem::size_of::<Self>(); + + const VTABLE: bindings::dma_fence_ops = bindings::dma_fence_ops { + use_64bit_seqno: T::USE_64BIT_SEQNO, + get_driver_name: Some(get_driver_name_cb::<T>), + get_timeline_name: Some(get_timeline_name_cb::<T>), + enable_signaling: if T::HAS_ENABLE_SIGNALING { + Some(enable_signaling_cb::<T>) + } else { + None + }, + signaled: if T::HAS_SIGNALED { + Some(signaled_cb::<T>) + } else { + None + }, + wait: None, // Deprecated + release: Some(release_cb::<T>), + fence_value_str: if T::HAS_FENCE_VALUE_STR { + Some(fence_value_str_cb::<T>) + } else { + None + }, + timeline_value_str: if T::HAS_TIMELINE_VALUE_STR { + Some(timeline_value_str_cb::<T>) + } else { + None + }, + }; +} + +impl<T: FenceOps> Deref for FenceObject<T> { + type Target = T; + + fn deref(&self) -> &T { + &self.inner + } +} + +impl<T: FenceOps> DerefMut for FenceObject<T> { + fn deref_mut(&mut self) -> &mut T { + &mut self.inner + } +} + +impl<T: FenceOps> crate::private::Sealed for FenceObject<T> {} +impl<T: FenceOps> RawDmaFence for FenceObject<T> { + fn raw(&self) -> *mut bindings::dma_fence { + &self.fence as *const _ as *mut _ + } +} + +/// A unique reference to a driver-specific fence object +pub struct UniqueFence<T: FenceOps>(*mut FenceObject<T>); + +impl<T: FenceOps> Deref for UniqueFence<T> { + type Target = FenceObject<T>; + + fn deref(&self) -> &FenceObject<T> { + unsafe { &*self.0 } + } +} + +impl<T: FenceOps> DerefMut for UniqueFence<T> { + fn deref_mut(&mut self) -> &mut FenceObject<T> { + unsafe { &mut *self.0 } + } +} + +impl<T: FenceOps> crate::private::Sealed for UniqueFence<T> {} +impl<T: FenceOps> RawDmaFence for UniqueFence<T> { + fn raw(&self) -> *mut bindings::dma_fence { + unsafe { addr_of_mut!((*self.0).fence) } + } +} + +impl<T: FenceOps> From<UniqueFence<T>> for UserFence<T> { + fn from(value: UniqueFence<T>) -> Self { + let ptr = value.0; + core::mem::forget(value); + + UserFence(ptr) + } +} + +impl<T: FenceOps> Drop for UniqueFence<T> { + fn drop(&mut self) { + // SAFETY: We own a reference to this fence. + unsafe { bindings::dma_fence_put(self.raw()) }; + } +} + +unsafe impl<T: FenceOps> Sync for UniqueFence<T> {} +unsafe impl<T: FenceOps> Send for UniqueFence<T> {} + +/// A shared reference to a driver-specific fence object +pub struct UserFence<T: FenceOps>(*mut FenceObject<T>); + +impl<T: FenceOps> Deref for UserFence<T> { + type Target = FenceObject<T>; + + fn deref(&self) -> &FenceObject<T> { + unsafe { &*self.0 } + } +} + +impl<T: FenceOps> Clone for UserFence<T> { + fn clone(&self) -> Self { + // SAFETY: `ptr` is valid per the type invariant and we own a reference to it. + unsafe { + bindings::dma_fence_get(self.raw()); + Self(self.0) + } + } +} + +impl<T: FenceOps> crate::private::Sealed for UserFence<T> {} +impl<T: FenceOps> RawDmaFence for UserFence<T> { + fn raw(&self) -> *mut bindings::dma_fence { + unsafe { addr_of_mut!((*self.0).fence) } + } +} + +impl<T: FenceOps> Drop for UserFence<T> { + fn drop(&mut self) { + // SAFETY: We own a reference to this fence. + unsafe { bindings::dma_fence_put(self.raw()) }; + } +} + +unsafe impl<T: FenceOps> Sync for UserFence<T> {} +unsafe impl<T: FenceOps> Send for UserFence<T> {} + +/// An array of fence contexts, out of which fences can be created. +pub struct FenceContexts { + start: u64, + count: u32, + seqnos: Vec<AtomicU64>, + lock_name: &'static CStr, + lock_key: &'static LockClassKey, +} + +impl FenceContexts { + /// Create a new set of fence contexts. + pub fn new( + count: u32, + name: &'static CStr, + key: &'static LockClassKey, + ) -> Result<FenceContexts> { + let mut seqnos: Vec<AtomicU64> = Vec::new(); + + seqnos.try_reserve(count as usize)?; + + for _ in 0..count { + seqnos.try_push(Default::default())?; + } + + let start = unsafe { bindings::dma_fence_context_alloc(count as core::ffi::c_uint) }; + + Ok(FenceContexts { + start, + count, + seqnos, + lock_name: name, + lock_key: key, + }) + } + + /// Create a new fence in a given context index. + pub fn new_fence<T: FenceOps>(&self, context: u32, inner: T) -> Result<UniqueFence<T>> { + if context > self.count { + return Err(EINVAL); + } + + let p = unsafe { + bindings::krealloc( + core::ptr::null_mut(), + FenceObject::<T>::SIZE, + bindings::GFP_KERNEL | bindings::__GFP_ZERO, + ) as *mut FenceObject<T> + }; + + if p.is_null() { + return Err(ENOMEM); + } + + let seqno = self.seqnos[context as usize].fetch_add(1, Ordering::Relaxed); + + // SAFETY: The pointer is valid, so pointers to members are too. + // After this, all fields are initialized. + unsafe { + addr_of_mut!((*p).inner).write(inner); + bindings::__spin_lock_init( + addr_of_mut!((*p).lock) as *mut _, + self.lock_name.as_char_ptr(), + self.lock_key.get(), + ); + bindings::dma_fence_init( + addr_of_mut!((*p).fence), + &FenceObject::<T>::VTABLE, + addr_of_mut!((*p).lock) as *mut _, + self.start + context as u64, + seqno, + ); + }; + + Ok(UniqueFence(p)) + } +} + +/// A DMA Fence Chain Object +/// +/// # Invariants +/// ptr is a valid pointer to a dma_fence_chain which we own. +pub struct FenceChain { + ptr: *mut bindings::dma_fence_chain, +} + +impl FenceChain { + /// Create a new DmaFenceChain object. + pub fn new() -> Result<Self> { + // SAFETY: This function is safe to call and takes no arguments. + let ptr = unsafe { bindings::dma_fence_chain_alloc() }; + + if ptr.is_null() { + Err(ENOMEM) + } else { + Ok(FenceChain { ptr }) + } + } + + /// Convert the DmaFenceChain into the underlying raw pointer. + /// + /// This assumes the caller will take ownership of the object. + pub(crate) fn into_raw(self) -> *mut bindings::dma_fence_chain { + let ptr = self.ptr; + core::mem::forget(self); + ptr + } +} + +impl Drop for FenceChain { + fn drop(&mut self) { + // SAFETY: We own this dma_fence_chain. + unsafe { bindings::dma_fence_chain_free(self.ptr) }; + } +} diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index cb23d24c6718..31866069e0bc 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -36,6 +36,8 @@ mod allocator; mod build_assert; pub mod delay; pub mod device; +#[cfg(CONFIG_DMA_SHARED_BUFFER)] +pub mod dma_fence; pub mod driver; #[cfg(CONFIG_RUST_DRM)] pub mod drm;
On Tue, Mar 07, 2023 at 11:25:33PM +0900, Asahi Lina wrote:
DMA fences are the internal synchronization primitive used for DMA operations like GPU rendering, video en/decoding, etc. Add an abstraction to allow Rust drivers to interact with this subsystem.
Note: This uses a raw spinlock living next to the fence, since we do not interact with it other than for initialization. TODO: Expose this to the user at some point with a safe abstraction.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 2 + rust/helpers.c | 53 ++++ rust/kernel/dma_fence.rs | 532 ++++++++++++++++++++++++++++++++++++++++
This should probably be in the dma-buf namespace like on the C side? There's a pile of tightly coupled concepts that I expect we'll all need sooner or later (dma-fence/buf/resv at least).
Also I guess same questions about separate files and MAINTAINER entries as for the drm stuff. -Daniel
rust/kernel/lib.rs | 2 + 4 files changed, 589 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 9f152d373df8..705af292a5b4 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -14,6 +14,8 @@ #include <drm/drm_ioctl.h> #include <linux/delay.h> #include <linux/device.h> +#include <linux/dma-fence.h> +#include <linux/dma-fence-chain.h> #include <linux/dma-mapping.h> #include <linux/fs.h> #include <linux/ioctl.h> diff --git a/rust/helpers.c b/rust/helpers.c index 388ff1100ea5..8e906a7a7d8a 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -23,6 +23,8 @@ #include <linux/bug.h> #include <linux/build_bug.h> #include <linux/device.h> +#include <linux/dma-fence.h> +#include <linux/dma-fence-chain.h> #include <linux/dma-mapping.h> #include <linux/err.h> #include <linux/errname.h> @@ -30,6 +32,7 @@ #include <linux/of.h> #include <linux/of_device.h> #include <linux/platform_device.h> +#include <linux/spinlock.h> #include <linux/rcupdate.h> #include <linux/refcount.h> #include <linux/xarray.h> @@ -388,6 +391,56 @@ int rust_helper_sg_dma_len(const struct scatterlist *sg) } EXPORT_SYMBOL_GPL(rust_helper_sg_dma_len); +void rust_helper___spin_lock_init(spinlock_t *lock, const char *name,
struct lock_class_key *key)
+{ +#ifdef CONFIG_DEBUG_SPINLOCK +# ifndef CONFIG_PREEMPT_RT
- __raw_spin_lock_init(spinlock_check(lock), name, key, LD_WAIT_CONFIG);
+# else
- rt_mutex_base_init(&lock->lock);
- __rt_spin_lock_init(lock, name, key, false);
+# endif +#else
- spin_lock_init(lock);
+#endif +} +EXPORT_SYMBOL_GPL(rust_helper___spin_lock_init);
+#ifdef CONFIG_DMA_SHARED_BUFFER
+void rust_helper_dma_fence_get(struct dma_fence *fence) +{
- dma_fence_get(fence);
+} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_get);
+void rust_helper_dma_fence_put(struct dma_fence *fence) +{
- dma_fence_put(fence);
+} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_put);
+struct dma_fence_chain *rust_helper_dma_fence_chain_alloc(void) +{
- return dma_fence_chain_alloc();
+} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_chain_alloc);
+void rust_helper_dma_fence_chain_free(struct dma_fence_chain *chain) +{
- dma_fence_chain_free(chain);
+} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_chain_free);
+void rust_helper_dma_fence_set_error(struct dma_fence *fence, int error) +{
- dma_fence_set_error(fence, error);
+} +EXPORT_SYMBOL_GPL(rust_helper_dma_fence_set_error);
+#endif
#ifdef CONFIG_DRM void rust_helper_drm_gem_object_get(struct drm_gem_object *obj) diff --git a/rust/kernel/dma_fence.rs b/rust/kernel/dma_fence.rs new file mode 100644 index 000000000000..ca93380d9da2 --- /dev/null +++ b/rust/kernel/dma_fence.rs @@ -0,0 +1,532 @@ +// SPDX-License-Identifier: GPL-2.0
+//! DMA fence abstraction. +//! +//! C header: [`include/linux/dma_fence.h`](../../include/linux/dma_fence.h)
+use crate::{
- bindings,
- error::{to_result, Result},
- prelude::*,
- sync::LockClassKey,
- types::Opaque,
+}; +use core::fmt::Write; +use core::ops::{Deref, DerefMut}; +use core::ptr::addr_of_mut; +use core::sync::atomic::{AtomicU64, Ordering};
+/// Any kind of DMA Fence Object +/// +/// # Invariants +/// raw() returns a valid pointer to a dma_fence and we own a reference to it. +pub trait RawDmaFence: crate::private::Sealed {
- /// Returns the raw `struct dma_fence` pointer.
- fn raw(&self) -> *mut bindings::dma_fence;
- /// Returns the raw `struct dma_fence` pointer and consumes the object.
- ///
- /// The caller is responsible for dropping the reference.
- fn into_raw(self) -> *mut bindings::dma_fence
- where
Self: Sized,
- {
let ptr = self.raw();
core::mem::forget(self);
ptr
- }
- /// Advances this fence to the chain node which will signal this sequence number.
- /// If no sequence number is provided, this returns `self` again.
- fn chain_find_seqno(self, seqno: u64) -> Result<Fence>
- where
Self: Sized,
- {
let mut ptr = self.into_raw();
// SAFETY: This will safely fail if this DmaFence is not a chain.
// `ptr` is valid per the type invariant.
let ret = unsafe { bindings::dma_fence_chain_find_seqno(&mut ptr, seqno) };
if ret != 0 {
// SAFETY: This is either an owned reference or NULL, dma_fence_put can handle both.
unsafe { bindings::dma_fence_put(ptr) };
Err(Error::from_kernel_errno(ret))
} else if ptr.is_null() {
Err(EINVAL) // When can this happen?
} else {
// SAFETY: ptr is valid and non-NULL as checked above.
Ok(unsafe { Fence::from_raw(ptr) })
}
- }
- /// Signal completion of this fence
- fn signal(&self) -> Result {
to_result(unsafe { bindings::dma_fence_signal(self.raw()) })
- }
- /// Set the error flag on this fence
- fn set_error(&self, err: Error) {
unsafe { bindings::dma_fence_set_error(self.raw(), err.to_kernel_errno()) };
- }
+}
+/// A generic DMA Fence Object +/// +/// # Invariants +/// ptr is a valid pointer to a dma_fence and we own a reference to it. +pub struct Fence {
- ptr: *mut bindings::dma_fence,
+}
+impl Fence {
- /// Create a new Fence object from a raw pointer to a dma_fence.
- ///
- /// # Safety
- /// The caller must own a reference to the dma_fence, which is transferred to the new object.
- pub(crate) unsafe fn from_raw(ptr: *mut bindings::dma_fence) -> Fence {
Fence { ptr }
- }
- /// Create a new Fence object from a raw pointer to a dma_fence.
- ///
- /// # Safety
- /// Takes a borrowed reference to the dma_fence, and increments the reference count.
- pub(crate) unsafe fn get_raw(ptr: *mut bindings::dma_fence) -> Fence {
// SAFETY: Pointer is valid per the safety contract
unsafe { bindings::dma_fence_get(ptr) };
Fence { ptr }
- }
- /// Create a new Fence object from a RawDmaFence.
- pub fn from_fence(fence: &dyn RawDmaFence) -> Fence {
// SAFETY: Pointer is valid per the RawDmaFence contract
unsafe { Self::get_raw(fence.raw()) }
- }
+}
+impl crate::private::Sealed for Fence {}
+impl RawDmaFence for Fence {
- fn raw(&self) -> *mut bindings::dma_fence {
self.ptr
- }
+}
+impl Drop for Fence {
- fn drop(&mut self) {
// SAFETY: We own a reference to this syncobj.
unsafe { bindings::dma_fence_put(self.ptr) };
- }
+}
+impl Clone for Fence {
- fn clone(&self) -> Self {
// SAFETY: `ptr` is valid per the type invariant and we own a reference to it.
unsafe {
bindings::dma_fence_get(self.ptr);
Self::from_raw(self.ptr)
}
- }
+}
+unsafe impl Sync for Fence {} +unsafe impl Send for Fence {}
+/// Trait which must be implemented by driver-specific fence objects. +#[vtable] +pub trait FenceOps: Sized + Send + Sync {
- /// True if this dma_fence implementation uses 64bit seqno, false otherwise.
- const USE_64BIT_SEQNO: bool;
- /// Returns the driver name. This is a callback to allow drivers to compute the name at
- /// runtime, without having it to store permanently for each fence, or build a cache of
- /// some sort.
- fn get_driver_name<'a>(self: &'a FenceObject<Self>) -> &'a CStr;
- /// Return the name of the context this fence belongs to. This is a callback to allow drivers
- /// to compute the name at runtime, without having it to store permanently for each fence, or
- /// build a cache of some sort.
- fn get_timeline_name<'a>(self: &'a FenceObject<Self>) -> &'a CStr;
- /// Enable software signaling of fence.
- fn enable_signaling(self: &FenceObject<Self>) -> bool {
false
- }
- /// Peek whether the fence is signaled, as a fastpath optimization for e.g. dma_fence_wait() or
- /// dma_fence_add_callback().
- fn signaled(self: &FenceObject<Self>) -> bool {
false
- }
- /// Callback to fill in free-form debug info specific to this fence, like the sequence number.
- fn fence_value_str(self: &FenceObject<Self>, _output: &mut dyn Write) {}
- /// Fills in the current value of the timeline as a string, like the sequence number. Note that
- /// the specific fence passed to this function should not matter, drivers should only use it to
- /// look up the corresponding timeline structures.
- fn timeline_value_str(self: &FenceObject<Self>, _output: &mut dyn Write) {}
+}
+unsafe extern "C" fn get_driver_name_cb<T: FenceOps>(
- fence: *mut bindings::dma_fence,
+) -> *const core::ffi::c_char {
- // SAFETY: All of our fences are FenceObject<T>.
- let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>;
- // SAFETY: The caller is responsible for passing a valid dma_fence subtype
- T::get_driver_name(unsafe { &mut *p }).as_char_ptr()
+}
+unsafe extern "C" fn get_timeline_name_cb<T: FenceOps>(
- fence: *mut bindings::dma_fence,
+) -> *const core::ffi::c_char {
- // SAFETY: All of our fences are FenceObject<T>.
- let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>;
- // SAFETY: The caller is responsible for passing a valid dma_fence subtype
- T::get_timeline_name(unsafe { &mut *p }).as_char_ptr()
+}
+unsafe extern "C" fn enable_signaling_cb<T: FenceOps>(fence: *mut bindings::dma_fence) -> bool {
- // SAFETY: All of our fences are FenceObject<T>.
- let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>;
- // SAFETY: The caller is responsible for passing a valid dma_fence subtype
- T::enable_signaling(unsafe { &mut *p })
+}
+unsafe extern "C" fn signaled_cb<T: FenceOps>(fence: *mut bindings::dma_fence) -> bool {
- // SAFETY: All of our fences are FenceObject<T>.
- let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>;
- // SAFETY: The caller is responsible for passing a valid dma_fence subtype
- T::signaled(unsafe { &mut *p })
+}
+unsafe extern "C" fn release_cb<T: FenceOps>(fence: *mut bindings::dma_fence) {
- // SAFETY: All of our fences are FenceObject<T>.
- let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>;
- // SAFETY: p is never used after this
- unsafe {
core::ptr::drop_in_place(&mut (*p).inner);
- }
- // SAFETY: All of our fences are allocated using kmalloc, so this is safe.
- unsafe { bindings::dma_fence_free(fence) };
+}
+unsafe extern "C" fn fence_value_str_cb<T: FenceOps>(
- fence: *mut bindings::dma_fence,
- string: *mut core::ffi::c_char,
- size: core::ffi::c_int,
+) {
- let size: usize = size.try_into().unwrap_or(0);
- if size == 0 {
return;
- }
- // SAFETY: All of our fences are FenceObject<T>.
- let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>;
- // SAFETY: The caller is responsible for the validity of string/size
- let mut f = unsafe { crate::str::Formatter::from_buffer(string as *mut _, size) };
- // SAFETY: The caller is responsible for passing a valid dma_fence subtype
- T::fence_value_str(unsafe { &mut *p }, &mut f);
- let _ = f.write_str("\0");
- // SAFETY: `size` is at least 1 per the check above
- unsafe { *string.add(size - 1) = 0 };
+}
+unsafe extern "C" fn timeline_value_str_cb<T: FenceOps>(
- fence: *mut bindings::dma_fence,
- string: *mut core::ffi::c_char,
- size: core::ffi::c_int,
+) {
- let size: usize = size.try_into().unwrap_or(0);
- if size == 0 {
return;
- }
- // SAFETY: All of our fences are FenceObject<T>.
- let p = crate::container_of!(fence, FenceObject<T>, fence) as *mut FenceObject<T>;
- // SAFETY: The caller is responsible for the validity of string/size
- let mut f = unsafe { crate::str::Formatter::from_buffer(string as *mut _, size) };
- // SAFETY: The caller is responsible for passing a valid dma_fence subtype
- T::timeline_value_str(unsafe { &mut *p }, &mut f);
- let _ = f.write_str("\0");
- // SAFETY: `size` is at least 1 per the check above
- unsafe { *string.add(size - 1) = 0 };
+}
+// Allow FenceObject<Self> to be used as a self argument, for ergonomics +impl<T: FenceOps> core::ops::Receiver for FenceObject<T> {}
+/// A driver-specific DMA Fence Object +/// +/// # Invariants +/// ptr is a valid pointer to a dma_fence and we own a reference to it. +#[repr(C)] +pub struct FenceObject<T: FenceOps> {
- fence: bindings::dma_fence,
- lock: Opaquebindings::spinlock,
- inner: T,
+}
+impl<T: FenceOps> FenceObject<T> {
- const SIZE: usize = core::mem::size_of::<Self>();
- const VTABLE: bindings::dma_fence_ops = bindings::dma_fence_ops {
use_64bit_seqno: T::USE_64BIT_SEQNO,
get_driver_name: Some(get_driver_name_cb::<T>),
get_timeline_name: Some(get_timeline_name_cb::<T>),
enable_signaling: if T::HAS_ENABLE_SIGNALING {
Some(enable_signaling_cb::<T>)
} else {
None
},
signaled: if T::HAS_SIGNALED {
Some(signaled_cb::<T>)
} else {
None
},
wait: None, // Deprecated
release: Some(release_cb::<T>),
fence_value_str: if T::HAS_FENCE_VALUE_STR {
Some(fence_value_str_cb::<T>)
} else {
None
},
timeline_value_str: if T::HAS_TIMELINE_VALUE_STR {
Some(timeline_value_str_cb::<T>)
} else {
None
},
- };
+}
+impl<T: FenceOps> Deref for FenceObject<T> {
- type Target = T;
- fn deref(&self) -> &T {
&self.inner
- }
+}
+impl<T: FenceOps> DerefMut for FenceObject<T> {
- fn deref_mut(&mut self) -> &mut T {
&mut self.inner
- }
+}
+impl<T: FenceOps> crate::private::Sealed for FenceObject<T> {} +impl<T: FenceOps> RawDmaFence for FenceObject<T> {
- fn raw(&self) -> *mut bindings::dma_fence {
&self.fence as *const _ as *mut _
- }
+}
+/// A unique reference to a driver-specific fence object +pub struct UniqueFence<T: FenceOps>(*mut FenceObject<T>);
+impl<T: FenceOps> Deref for UniqueFence<T> {
- type Target = FenceObject<T>;
- fn deref(&self) -> &FenceObject<T> {
unsafe { &*self.0 }
- }
+}
+impl<T: FenceOps> DerefMut for UniqueFence<T> {
- fn deref_mut(&mut self) -> &mut FenceObject<T> {
unsafe { &mut *self.0 }
- }
+}
+impl<T: FenceOps> crate::private::Sealed for UniqueFence<T> {} +impl<T: FenceOps> RawDmaFence for UniqueFence<T> {
- fn raw(&self) -> *mut bindings::dma_fence {
unsafe { addr_of_mut!((*self.0).fence) }
- }
+}
+impl<T: FenceOps> From<UniqueFence<T>> for UserFence<T> {
- fn from(value: UniqueFence<T>) -> Self {
let ptr = value.0;
core::mem::forget(value);
UserFence(ptr)
- }
+}
+impl<T: FenceOps> Drop for UniqueFence<T> {
- fn drop(&mut self) {
// SAFETY: We own a reference to this fence.
unsafe { bindings::dma_fence_put(self.raw()) };
- }
+}
+unsafe impl<T: FenceOps> Sync for UniqueFence<T> {} +unsafe impl<T: FenceOps> Send for UniqueFence<T> {}
+/// A shared reference to a driver-specific fence object +pub struct UserFence<T: FenceOps>(*mut FenceObject<T>);
+impl<T: FenceOps> Deref for UserFence<T> {
- type Target = FenceObject<T>;
- fn deref(&self) -> &FenceObject<T> {
unsafe { &*self.0 }
- }
+}
+impl<T: FenceOps> Clone for UserFence<T> {
- fn clone(&self) -> Self {
// SAFETY: `ptr` is valid per the type invariant and we own a reference to it.
unsafe {
bindings::dma_fence_get(self.raw());
Self(self.0)
}
- }
+}
+impl<T: FenceOps> crate::private::Sealed for UserFence<T> {} +impl<T: FenceOps> RawDmaFence for UserFence<T> {
- fn raw(&self) -> *mut bindings::dma_fence {
unsafe { addr_of_mut!((*self.0).fence) }
- }
+}
+impl<T: FenceOps> Drop for UserFence<T> {
- fn drop(&mut self) {
// SAFETY: We own a reference to this fence.
unsafe { bindings::dma_fence_put(self.raw()) };
- }
+}
+unsafe impl<T: FenceOps> Sync for UserFence<T> {} +unsafe impl<T: FenceOps> Send for UserFence<T> {}
+/// An array of fence contexts, out of which fences can be created. +pub struct FenceContexts {
- start: u64,
- count: u32,
- seqnos: Vec<AtomicU64>,
- lock_name: &'static CStr,
- lock_key: &'static LockClassKey,
+}
+impl FenceContexts {
- /// Create a new set of fence contexts.
- pub fn new(
count: u32,
name: &'static CStr,
key: &'static LockClassKey,
- ) -> Result<FenceContexts> {
let mut seqnos: Vec<AtomicU64> = Vec::new();
seqnos.try_reserve(count as usize)?;
for _ in 0..count {
seqnos.try_push(Default::default())?;
}
let start = unsafe { bindings::dma_fence_context_alloc(count as core::ffi::c_uint) };
Ok(FenceContexts {
start,
count,
seqnos,
lock_name: name,
lock_key: key,
})
- }
- /// Create a new fence in a given context index.
- pub fn new_fence<T: FenceOps>(&self, context: u32, inner: T) -> Result<UniqueFence<T>> {
if context > self.count {
return Err(EINVAL);
}
let p = unsafe {
bindings::krealloc(
core::ptr::null_mut(),
FenceObject::<T>::SIZE,
bindings::GFP_KERNEL | bindings::__GFP_ZERO,
) as *mut FenceObject<T>
};
if p.is_null() {
return Err(ENOMEM);
}
let seqno = self.seqnos[context as usize].fetch_add(1, Ordering::Relaxed);
// SAFETY: The pointer is valid, so pointers to members are too.
// After this, all fields are initialized.
unsafe {
addr_of_mut!((*p).inner).write(inner);
bindings::__spin_lock_init(
addr_of_mut!((*p).lock) as *mut _,
self.lock_name.as_char_ptr(),
self.lock_key.get(),
);
bindings::dma_fence_init(
addr_of_mut!((*p).fence),
&FenceObject::<T>::VTABLE,
addr_of_mut!((*p).lock) as *mut _,
self.start + context as u64,
seqno,
);
};
Ok(UniqueFence(p))
- }
+}
+/// A DMA Fence Chain Object +/// +/// # Invariants +/// ptr is a valid pointer to a dma_fence_chain which we own. +pub struct FenceChain {
- ptr: *mut bindings::dma_fence_chain,
+}
+impl FenceChain {
- /// Create a new DmaFenceChain object.
- pub fn new() -> Result<Self> {
// SAFETY: This function is safe to call and takes no arguments.
let ptr = unsafe { bindings::dma_fence_chain_alloc() };
if ptr.is_null() {
Err(ENOMEM)
} else {
Ok(FenceChain { ptr })
}
- }
- /// Convert the DmaFenceChain into the underlying raw pointer.
- ///
- /// This assumes the caller will take ownership of the object.
- pub(crate) fn into_raw(self) -> *mut bindings::dma_fence_chain {
let ptr = self.ptr;
core::mem::forget(self);
ptr
- }
+}
+impl Drop for FenceChain {
- fn drop(&mut self) {
// SAFETY: We own this dma_fence_chain.
unsafe { bindings::dma_fence_chain_free(self.ptr) };
- }
+} diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs index cb23d24c6718..31866069e0bc 100644 --- a/rust/kernel/lib.rs +++ b/rust/kernel/lib.rs @@ -36,6 +36,8 @@ mod allocator; mod build_assert; pub mod delay; pub mod device; +#[cfg(CONFIG_DMA_SHARED_BUFFER)] +pub mod dma_fence; pub mod driver; #[cfg(CONFIG_RUST_DRM)] pub mod drm;
-- 2.35.1
DRM Sync Objects are a container for a DMA fence, and can be waited on signaled, exported, and imported from userspace. Add a Rust abstraction so Rust DRM drivers can support this functionality.
Signed-off-by: Asahi Lina lina@asahilina.net --- rust/bindings/bindings_helper.h | 1 + rust/helpers.c | 19 ++++++++++ rust/kernel/drm/mod.rs | 1 + rust/kernel/drm/syncobj.rs | 77 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 98 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 705af292a5b4..b6696011f3a4 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -12,6 +12,7 @@ #include <drm/drm_gem.h> #include <drm/drm_gem_shmem_helper.h> #include <drm/drm_ioctl.h> +#include <drm/drm_syncobj.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-fence.h> diff --git a/rust/helpers.c b/rust/helpers.c index 8e906a7a7d8a..11965b1e2f4e 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -20,6 +20,7 @@
#include <drm/drm_gem.h> #include <drm/drm_gem_shmem_helper.h> +#include <drm/drm_syncobj.h> #include <linux/bug.h> #include <linux/build_bug.h> #include <linux/device.h> @@ -461,6 +462,24 @@ __u64 rust_helper_drm_vma_node_offset_addr(struct drm_vma_offset_node *node) } EXPORT_SYMBOL_GPL(rust_helper_drm_vma_node_offset_addr);
+void rust_helper_drm_syncobj_get(struct drm_syncobj *obj) +{ + drm_syncobj_get(obj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_syncobj_get); + +void rust_helper_drm_syncobj_put(struct drm_syncobj *obj) +{ + drm_syncobj_put(obj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_syncobj_put); + +struct dma_fence *rust_helper_drm_syncobj_fence_get(struct drm_syncobj *syncobj) +{ + return drm_syncobj_fence_get(syncobj); +} +EXPORT_SYMBOL_GPL(rust_helper_drm_syncobj_fence_get); + #ifdef CONFIG_DRM_GEM_SHMEM_HELPER
void rust_helper_drm_gem_shmem_object_free(struct drm_gem_object *obj) diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index 73fab2dee3af..dae98826edfd 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -8,3 +8,4 @@ pub mod file; pub mod gem; pub mod ioctl; pub mod mm; +pub mod syncobj; diff --git a/rust/kernel/drm/syncobj.rs b/rust/kernel/drm/syncobj.rs new file mode 100644 index 000000000000..10eed05eb27a --- /dev/null +++ b/rust/kernel/drm/syncobj.rs @@ -0,0 +1,77 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT + +//! DRM Sync Objects +//! +//! C header: [`include/linux/drm/drm_syncobj.h`](../../../../include/linux/drm/drm_syncobj.h) + +use crate::{bindings, dma_fence::*, drm, error::Result, prelude::*}; + +/// A DRM Sync Object +/// +/// # Invariants +/// ptr is a valid pointer to a drm_syncobj and we own a reference to it. +pub struct SyncObj { + ptr: *mut bindings::drm_syncobj, +} + +impl SyncObj { + /// Looks up a sync object by its handle for a given `File`. + pub fn lookup_handle(file: &impl drm::file::GenericFile, handle: u32) -> Result<SyncObj> { + // SAFETY: The arguments are all valid per the type invariants. + let ptr = unsafe { bindings::drm_syncobj_find(file.raw() as *mut _, handle) }; + + if ptr.is_null() { + Err(ENOENT) + } else { + Ok(SyncObj { ptr }) + } + } + + /// Returns the DMA fence associated with this sync object, if any. + pub fn fence_get(&self) -> Option<Fence> { + let fence = unsafe { bindings::drm_syncobj_fence_get(self.ptr) }; + if fence.is_null() { + None + } else { + // SAFETY: The pointer is non-NULL and drm_syncobj_fence_get acquired an + // additional reference. + Some(unsafe { Fence::from_raw(fence) }) + } + } + + /// Replaces the DMA fence with a new one, or removes it if fence is None. + pub fn replace_fence(&self, fence: Option<&Fence>) { + unsafe { + bindings::drm_syncobj_replace_fence( + self.ptr, + fence.map_or(core::ptr::null_mut(), |a| a.raw()), + ) + }; + } + + /// Adds a new timeline point to the syncobj. + pub fn add_point(&self, chain: FenceChain, fence: &Fence, point: u64) { + // SAFETY: All arguments should be valid per the respective type invariants. + // This takes over the FenceChain ownership. + unsafe { bindings::drm_syncobj_add_point(self.ptr, chain.into_raw(), fence.raw(), point) }; + } +} + +impl Drop for SyncObj { + fn drop(&mut self) { + // SAFETY: We own a reference to this syncobj. + unsafe { bindings::drm_syncobj_put(self.ptr) }; + } +} + +impl Clone for SyncObj { + fn clone(&self) -> Self { + // SAFETY: `ptr` is valid per the type invariant and we own a reference to it. + unsafe { bindings::drm_syncobj_get(self.ptr) }; + SyncObj { ptr: self.ptr } + } +} + +// SAFETY: drm_syncobj operations are internally locked. +unsafe impl Sync for SyncObj {} +unsafe impl Send for SyncObj {}
On Tue, Mar 07, 2023 at 11:25:34PM +0900, Asahi Lina wrote:
DRM Sync Objects are a container for a DMA fence, and can be waited on signaled, exported, and imported from userspace. Add a Rust abstraction so Rust DRM drivers can support this functionality.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 1 + rust/helpers.c | 19 ++++++++++ rust/kernel/drm/mod.rs | 1 + rust/kernel/drm/syncobj.rs | 77 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 98 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 705af292a5b4..b6696011f3a4 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -12,6 +12,7 @@ #include <drm/drm_gem.h> #include <drm/drm_gem_shmem_helper.h> #include <drm/drm_ioctl.h> +#include <drm/drm_syncobj.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-fence.h> diff --git a/rust/helpers.c b/rust/helpers.c index 8e906a7a7d8a..11965b1e2f4e 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -20,6 +20,7 @@ #include <drm/drm_gem.h> #include <drm/drm_gem_shmem_helper.h> +#include <drm/drm_syncobj.h> #include <linux/bug.h> #include <linux/build_bug.h> #include <linux/device.h> @@ -461,6 +462,24 @@ __u64 rust_helper_drm_vma_node_offset_addr(struct drm_vma_offset_node *node) } EXPORT_SYMBOL_GPL(rust_helper_drm_vma_node_offset_addr); +void rust_helper_drm_syncobj_get(struct drm_syncobj *obj) +{
- drm_syncobj_get(obj);
+} +EXPORT_SYMBOL_GPL(rust_helper_drm_syncobj_get);
+void rust_helper_drm_syncobj_put(struct drm_syncobj *obj) +{
- drm_syncobj_put(obj);
+} +EXPORT_SYMBOL_GPL(rust_helper_drm_syncobj_put);
+struct dma_fence *rust_helper_drm_syncobj_fence_get(struct drm_syncobj *syncobj) +{
- return drm_syncobj_fence_get(syncobj);
+} +EXPORT_SYMBOL_GPL(rust_helper_drm_syncobj_fence_get);
#ifdef CONFIG_DRM_GEM_SHMEM_HELPER void rust_helper_drm_gem_shmem_object_free(struct drm_gem_object *obj) diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index 73fab2dee3af..dae98826edfd 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -8,3 +8,4 @@ pub mod file; pub mod gem; pub mod ioctl; pub mod mm; +pub mod syncobj; diff --git a/rust/kernel/drm/syncobj.rs b/rust/kernel/drm/syncobj.rs new file mode 100644 index 000000000000..10eed05eb27a --- /dev/null +++ b/rust/kernel/drm/syncobj.rs @@ -0,0 +1,77 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM Sync Objects +//! +//! C header: [`include/linux/drm/drm_syncobj.h`](../../../../include/linux/drm/drm_syncobj.h)
+use crate::{bindings, dma_fence::*, drm, error::Result, prelude::*};
+/// A DRM Sync Object +/// +/// # Invariants +/// ptr is a valid pointer to a drm_syncobj and we own a reference to it. +pub struct SyncObj {
- ptr: *mut bindings::drm_syncobj,
+}
+impl SyncObj {
- /// Looks up a sync object by its handle for a given `File`.
- pub fn lookup_handle(file: &impl drm::file::GenericFile, handle: u32) -> Result<SyncObj> {
// SAFETY: The arguments are all valid per the type invariants.
let ptr = unsafe { bindings::drm_syncobj_find(file.raw() as *mut _, handle) };
Just an aside, but the semantics of this are nasty: You're not allowed to hold any locks while calling this. We have runtime checks for that (if you enable lockdep), but I don't see any way to encode that on the rust side and check it at compile time :-/
if ptr.is_null() {
Err(ENOENT)
} else {
Ok(SyncObj { ptr })
}
- }
- /// Returns the DMA fence associated with this sync object, if any.
- pub fn fence_get(&self) -> Option<Fence> {
let fence = unsafe { bindings::drm_syncobj_fence_get(self.ptr) };
if fence.is_null() {
None
} else {
// SAFETY: The pointer is non-NULL and drm_syncobj_fence_get acquired an
// additional reference.
Some(unsafe { Fence::from_raw(fence) })
}
- }
- /// Replaces the DMA fence with a new one, or removes it if fence is None.
- pub fn replace_fence(&self, fence: Option<&Fence>) {
unsafe {
bindings::drm_syncobj_replace_fence(
self.ptr,
fence.map_or(core::ptr::null_mut(), |a| a.raw()),
)
};
- }
- /// Adds a new timeline point to the syncobj.
- pub fn add_point(&self, chain: FenceChain, fence: &Fence, point: u64) {
// SAFETY: All arguments should be valid per the respective type invariants.
// This takes over the FenceChain ownership.
unsafe { bindings::drm_syncobj_add_point(self.ptr, chain.into_raw(), fence.raw(), point) };
- }
+}
+impl Drop for SyncObj {
- fn drop(&mut self) {
// SAFETY: We own a reference to this syncobj.
unsafe { bindings::drm_syncobj_put(self.ptr) };
- }
+}
+impl Clone for SyncObj {
- fn clone(&self) -> Self {
// SAFETY: `ptr` is valid per the type invariant and we own a reference to it.
unsafe { bindings::drm_syncobj_get(self.ptr) };
So yeah syncobj are refcounted because they're shareable uapi objects (you can pass them around as fd), but that really should be entirely the subsystems business, not for drivers.
This is kinda like drm_file, which is also refcounted (by virtue of hanging of struct file), but the refcounting is entirely handled by the vfs and all drivers get is a borrowed reference, which nicely bounds the lifetime to the callback (which is usually an ioctl handler). I think we want the same semantics for syncobj, because if a driver is hanging onto a syncobj for longer than the ioctl. If my rust understanding is right we'd get that by dropping Clone here and relying on lookup_handle only being able to return stuff that's bound by the drm_file?
People are talking about drivers holding onto syncobj for longer, but I'm still not sold on the idea that this is any good and doesn't just bend the dma_fence and syncobj rules a bit too much over the breaking point. For kernel drivers it really should be just a different way to lookup and return dma_fence from the ioctl, pretty much matching what you could also do with sync_file (but since syncobj provides generic compat ioctl to convert to/from sync_file drivders only need to handle syncobj). -Daniel
SyncObj { ptr: self.ptr }
- }
+}
+// SAFETY: drm_syncobj operations are internally locked. +unsafe impl Sync for SyncObj {} +unsafe impl Send for SyncObj {}
-- 2.35.1
On 05/04/2023 21.33, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:34PM +0900, Asahi Lina wrote:
DRM Sync Objects are a container for a DMA fence, and can be waited on signaled, exported, and imported from userspace. Add a Rust abstraction so Rust DRM drivers can support this functionality.
Signed-off-by: Asahi Lina lina@asahilina.net
rust/bindings/bindings_helper.h | 1 + rust/helpers.c | 19 ++++++++++ rust/kernel/drm/mod.rs | 1 + rust/kernel/drm/syncobj.rs | 77 +++++++++++++++++++++++++++++++++++++++++ 4 files changed, 98 insertions(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index 705af292a5b4..b6696011f3a4 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -12,6 +12,7 @@ #include <drm/drm_gem.h> #include <drm/drm_gem_shmem_helper.h> #include <drm/drm_ioctl.h> +#include <drm/drm_syncobj.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-fence.h> diff --git a/rust/helpers.c b/rust/helpers.c index 8e906a7a7d8a..11965b1e2f4e 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -20,6 +20,7 @@ #include <drm/drm_gem.h> #include <drm/drm_gem_shmem_helper.h> +#include <drm/drm_syncobj.h> #include <linux/bug.h> #include <linux/build_bug.h> #include <linux/device.h> @@ -461,6 +462,24 @@ __u64 rust_helper_drm_vma_node_offset_addr(struct drm_vma_offset_node *node) } EXPORT_SYMBOL_GPL(rust_helper_drm_vma_node_offset_addr); +void rust_helper_drm_syncobj_get(struct drm_syncobj *obj) +{
- drm_syncobj_get(obj);
+} +EXPORT_SYMBOL_GPL(rust_helper_drm_syncobj_get);
+void rust_helper_drm_syncobj_put(struct drm_syncobj *obj) +{
- drm_syncobj_put(obj);
+} +EXPORT_SYMBOL_GPL(rust_helper_drm_syncobj_put);
+struct dma_fence *rust_helper_drm_syncobj_fence_get(struct drm_syncobj *syncobj) +{
- return drm_syncobj_fence_get(syncobj);
+} +EXPORT_SYMBOL_GPL(rust_helper_drm_syncobj_fence_get);
- #ifdef CONFIG_DRM_GEM_SHMEM_HELPER
void rust_helper_drm_gem_shmem_object_free(struct drm_gem_object *obj) diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index 73fab2dee3af..dae98826edfd 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -8,3 +8,4 @@ pub mod file; pub mod gem; pub mod ioctl; pub mod mm; +pub mod syncobj; diff --git a/rust/kernel/drm/syncobj.rs b/rust/kernel/drm/syncobj.rs new file mode 100644 index 000000000000..10eed05eb27a --- /dev/null +++ b/rust/kernel/drm/syncobj.rs @@ -0,0 +1,77 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM Sync Objects +//! +//! C header: [`include/linux/drm/drm_syncobj.h`](../../../../include/linux/drm/drm_syncobj.h)
+use crate::{bindings, dma_fence::*, drm, error::Result, prelude::*};
+/// A DRM Sync Object +/// +/// # Invariants +/// ptr is a valid pointer to a drm_syncobj and we own a reference to it. +pub struct SyncObj {
- ptr: *mut bindings::drm_syncobj,
+}
+impl SyncObj {
- /// Looks up a sync object by its handle for a given `File`.
- pub fn lookup_handle(file: &impl drm::file::GenericFile, handle: u32) -> Result<SyncObj> {
// SAFETY: The arguments are all valid per the type invariants.
let ptr = unsafe { bindings::drm_syncobj_find(file.raw() as *mut _, handle) };
Just an aside, but the semantics of this are nasty: You're not allowed to hold any locks while calling this. We have runtime checks for that (if you enable lockdep), but I don't see any way to encode that on the rust side and check it at compile time :-/
Oof, yeah, that's not possible today. Maybe in the future though, it's similar to the execution context stuff...
if ptr.is_null() {
Err(ENOENT)
} else {
Ok(SyncObj { ptr })
}
- }
- /// Returns the DMA fence associated with this sync object, if any.
- pub fn fence_get(&self) -> Option<Fence> {
let fence = unsafe { bindings::drm_syncobj_fence_get(self.ptr) };
if fence.is_null() {
None
} else {
// SAFETY: The pointer is non-NULL and drm_syncobj_fence_get acquired an
// additional reference.
Some(unsafe { Fence::from_raw(fence) })
}
- }
- /// Replaces the DMA fence with a new one, or removes it if fence is None.
- pub fn replace_fence(&self, fence: Option<&Fence>) {
unsafe {
bindings::drm_syncobj_replace_fence(
self.ptr,
fence.map_or(core::ptr::null_mut(), |a| a.raw()),
)
};
- }
- /// Adds a new timeline point to the syncobj.
- pub fn add_point(&self, chain: FenceChain, fence: &Fence, point: u64) {
// SAFETY: All arguments should be valid per the respective type invariants.
// This takes over the FenceChain ownership.
unsafe { bindings::drm_syncobj_add_point(self.ptr, chain.into_raw(), fence.raw(), point) };
- }
+}
+impl Drop for SyncObj {
- fn drop(&mut self) {
// SAFETY: We own a reference to this syncobj.
unsafe { bindings::drm_syncobj_put(self.ptr) };
- }
+}
+impl Clone for SyncObj {
- fn clone(&self) -> Self {
// SAFETY: `ptr` is valid per the type invariant and we own a reference to it.
unsafe { bindings::drm_syncobj_get(self.ptr) };
So yeah syncobj are refcounted because they're shareable uapi objects (you can pass them around as fd), but that really should be entirely the subsystems business, not for drivers.
This is kinda like drm_file, which is also refcounted (by virtue of hanging of struct file), but the refcounting is entirely handled by the vfs and all drivers get is a borrowed reference, which nicely bounds the lifetime to the callback (which is usually an ioctl handler). I think we want the same semantics for syncobj, because if a driver is hanging onto a syncobj for longer than the ioctl. If my rust understanding is right we'd get that by dropping Clone here and relying on lookup_handle only being able to return stuff that's bound by the drm_file?
Yeah, that should work! Lifetimes are perfect for this kind of stuff. I need to test it out and see what the right way to do it is (lifetime parameter or actual reference straight into the drm_syncobj) and see how it fits into the driver but I don't see why it wouldn't work, since I don't hold onto sync objects for longer than the ioctl. Might just need some minor refactoring since the current driver ioctl code wasn't written with lifetimes in mind ^^
People are talking about drivers holding onto syncobj for longer, but I'm still not sold on the idea that this is any good and doesn't just bend the dma_fence and syncobj rules a bit too much over the breaking point. For kernel drivers it really should be just a different way to lookup and return dma_fence from the ioctl, pretty much matching what you could also do with sync_file (but since syncobj provides generic compat ioctl to convert to/from sync_file drivders only need to handle syncobj).
Yeah, if you think restricting the API for this on the Rust side makes sense it works for me! I'm all for not abstracting features that aren't considered particularly useful/safe/a good idea.
-Daniel
SyncObj { ptr: self.ptr }
- }
+}
+// SAFETY: drm_syncobj operations are internally locked. +unsafe impl Sync for SyncObj {} +unsafe impl Send for SyncObj {}
-- 2.35.1
~~ Lina
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Signed-off-by: Asahi Lina lina@asahilina.net --- drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
+ if (sched->ops->can_run_job) { + sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue)); + if (!sched_job) { + complete_all(&entity->entity_idle); + continue; + } + if (!sched->ops->can_run_job(sched_job)) + continue; + } + sched_job = drm_sched_entity_pop_job(entity);
if (!sched_job) { diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
+ /** + * @can_run_job: Called before job execution to check whether the + * hardware is free enough to run the job. This can be used to + * implement more complex hardware resource policies than the + * hw_submission limit. + */ + bool (*can_run_job)(struct drm_sched_job *sched_job); + /** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
Am 07.03.23 um 15:25 schrieb Asahi Lina:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Well complete NAK.
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
Regards, Christian.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
if (sched->ops->can_run_job) {
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (!sched_job) {
complete_all(&entity->entity_idle);
continue;
}
if (!sched->ops->can_run_job(sched_job))
continue;
}
- sched_job = drm_sched_entity_pop_job(entity);
if (!sched_job) { diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
- /**
* @can_run_job: Called before job execution to check whether the
* hardware is free enough to run the job. This can be used to
* implement more complex hardware resource policies than the
* hw_submission limit.
*/
- bool (*can_run_job)(struct drm_sched_job *sched_job);
- /** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
On 08/03/2023 17.46, Christian König wrote:
Am 07.03.23 um 15:25 schrieb Asahi Lina:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Well complete NAK.
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
I think you misunderstood the intent here... This isn't about job dependencies, it's about in-flight resource limits.
drm_sched already has a hw_submission_limit that specifies the number of submissions that can be in flight, but that doesn't work for us because each job from drm_sched's point of view consists of multiple commands split among 3 firmware queues. The firmware can only support up to 128 work commands in flight per queue (barriers don't count), otherwise it overflows a fixed-size buffer.
So we need more complex accounting of how many underlying commands are in flight per queue to determine whether it is safe to run a new job, and that is what this callback accomplishes. This has to happen even when individual jobs have no buffer/resource dependencies between them (which is what the fences would express).
You can see the driver implementation of that callback in drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that does the actual available slot count checks.
The can_run_job logic is written to mirror the hw_submission_limit logic (just a bit later in the sched main loop since we need to actually pick a job to do the check), and just like for that case, completion of any job in the same scheduler will cause another run of the main loop and another check (which is exactly what we want here).
This case (potentially scheduling more than the FW job limit) is rare but handling it is necessary, since otherwise the entire job completion/tracking logic gets screwed up on the firmware end and queues end up stuck (I've managed to trigger this before).
~~ Lina
Am 08.03.23 um 10:41 schrieb Asahi Lina:
On 08/03/2023 17.46, Christian König wrote:
Am 07.03.23 um 15:25 schrieb Asahi Lina:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Well complete NAK.
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
I think you misunderstood the intent here... This isn't about job dependencies, it's about in-flight resource limits.
drm_sched already has a hw_submission_limit that specifies the number of submissions that can be in flight, but that doesn't work for us because each job from drm_sched's point of view consists of multiple commands split among 3 firmware queues. The firmware can only support up to 128 work commands in flight per queue (barriers don't count), otherwise it overflows a fixed-size buffer.
So we need more complex accounting of how many underlying commands are in flight per queue to determine whether it is safe to run a new job, and that is what this callback accomplishes. This has to happen even when individual jobs have no buffer/resource dependencies between them (which is what the fences would express).
Yeah, I already assumed that you have something like this.
And to make it clear this is unfortunately a complete NAK to this approach! You can't do this!
The background is that core memory management requires that signaling a fence only depends on signaling other fences and hardware progress and nothing else. Otherwise you immediately run into problems because of circle dependencies or what we call infinite fences.
Jason Ekstrand gave a create presentation on that problem a few years ago on LPC. I strongly suggest you google that one up.
You can see the driver implementation of that callback in drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that does the actual available slot count checks.
The can_run_job logic is written to mirror the hw_submission_limit logic (just a bit later in the sched main loop since we need to actually pick a job to do the check), and just like for that case, completion of any job in the same scheduler will cause another run of the main loop and another check (which is exactly what we want here).
Yeah and that hw_submission_limit is based on a fence signaling again.
When you have some firmware limitation that a job needs resources which are currently in use by other submissions then those other submissions have fences as well and you can return those in the prepare_job callback.
If those other submissions don't have fences, then you have a major design problem inside your driver and we need to get back to square one and talk about that dependency handling.
This case (potentially scheduling more than the FW job limit) is rare but handling it is necessary, since otherwise the entire job completion/tracking logic gets screwed up on the firmware end and queues end up stuck (I've managed to trigger this before).
Actually that's a pretty normal use case. I've have rejected similar requirements like this before as well.
For an example how this can work see amdgpu_job_prepare_job(): https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/...
The gang submit gives and example of a global fence lock and the VMIDs are an example of a global shared firmware resource.
Regards, Christian.
~~ Lina
On 08/03/2023 19.00, Christian König wrote:
Am 08.03.23 um 10:41 schrieb Asahi Lina:
On 08/03/2023 17.46, Christian König wrote:
Am 07.03.23 um 15:25 schrieb Asahi Lina:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Well complete NAK.
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
I think you misunderstood the intent here... This isn't about job dependencies, it's about in-flight resource limits.
drm_sched already has a hw_submission_limit that specifies the number of submissions that can be in flight, but that doesn't work for us because each job from drm_sched's point of view consists of multiple commands split among 3 firmware queues. The firmware can only support up to 128 work commands in flight per queue (barriers don't count), otherwise it overflows a fixed-size buffer.
So we need more complex accounting of how many underlying commands are in flight per queue to determine whether it is safe to run a new job, and that is what this callback accomplishes. This has to happen even when individual jobs have no buffer/resource dependencies between them (which is what the fences would express).
Yeah, I already assumed that you have something like this.
And to make it clear this is unfortunately a complete NAK to this approach! You can't do this!
I think you still have some significant misconception about how this driver works and uses drm_sched... I would appreciate it if you listen and try to understand the design before giving hard NAKs... (this isn't a Radeon)
The background is that core memory management requires that signaling a fence only depends on signaling other fences and hardware progress and nothing else. Otherwise you immediately run into problems because of circle dependencies or what we call infinite fences.
And hardware progress is exactly the only dependency here...
Jason Ekstrand gave a create presentation on that problem a few years ago on LPC. I strongly suggest you google that one up.
Faith Ekstrand (it looks like you mistyped that name...) is the person who proposed that I should use drm_sched in this way (see below), we've had a few private meetings about this design ^^
You can see the driver implementation of that callback in drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that does the actual available slot count checks.
The can_run_job logic is written to mirror the hw_submission_limit logic (just a bit later in the sched main loop since we need to actually pick a job to do the check), and just like for that case, completion of any job in the same scheduler will cause another run of the main loop and another check (which is exactly what we want here).
Yeah and that hw_submission_limit is based on a fence signaling again.
I don't think so...? It's just an atomic that gets checked in drm_sched_ready(). There are no extra fences involved (other than the job completion fences that trigger another scheduler run). The idea is that when the hardware queue makes forward progress you check against the limit again and submit more jobs as needed. I'm doing the same exact thing, I'm just using more complex logic for the notion of in-flight queue limits!
When you have some firmware limitation that a job needs resources which are currently in use by other submissions then those other submissions have fences as well and you can return those in the prepare_job callback.
If those other submissions don't have fences, then you have a major design problem inside your driver and we need to get back to square one and talk about that dependency handling.
I think we have a disconnect in our views of what is going on here...
This hardware has firmware-side scheduling with an arbitrary (as far as I know) number of queues. There is one scheduler instance and one entity per userspace queue (not global!). These queues process jobs in some logical sequence, though at the firmware level they get split into up to three queues each (and there is some parallelism allowed). The limitation here is in the number of in-flight jobs per firmware queue, not global.
There is no way for things to deadlock. If jobs have been submitted to the firmware queue, that means their dependencies were signaled already. Jobs have intra-job dependencies via driver barriers (which drm_sched knows nothing about), but the submission code in the driver guarantees that they are deadlock-free since you can only barrier on past commands, which by definition submit first.
If a firmware queue is full, drm_sched blocks. Since it is full, that means it will run those commands (since they have no outside dependencies and they are already queued and ready to run by the firmware), eventually space will be freed, and each time a job completes drm_sched will do the can_run_job check again and decide whether to run a new job.
Since the firmware queues contain commands which only have past-facing barriers on other already submitted commands, by definition they will become empty at some point as long as the firmware is making forward progress. And therefore, by definition, can_run_job will eventually return true at some point after a job completion fence is signaled (the one for the last job submitted prior). There is a check in the driver to ensure that we do not allow submissions which, by themselves, would exceed the queued command limit (we actually just limit to 64 commands overall right now, which is conservative but seems reasonable given the 128-per-firmware-queue limit).
I get the feeling that you are conflating pending jobs with submitted jobs. This isn't about how many jobs you can have pending in drm_sched before running them or anything like that. Of course, at that point, arbitrary dependencies come into play and you can end up with deadlocks on dependency fences. But that's not the case here. What can_run_job is waiting on is guaranteed to make forward progress.
This case (potentially scheduling more than the FW job limit) is rare but handling it is necessary, since otherwise the entire job completion/tracking logic gets screwed up on the firmware end and queues end up stuck (I've managed to trigger this before).
Actually that's a pretty normal use case. I've have rejected similar requirements like this before as well.
For an example how this can work see amdgpu_job_prepare_job(): https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/...
The gang submit gives and example of a global fence lock and the VMIDs are an example of a global shared firmware resource.
But the resource can_run_job is checking on isn't globally shared! It's specific to this scheduler instance, just like hw_submission_limit is, so as long as the firmware behind the scheduler is making forward progress, the resource will be guaranteed to be freed until another job can run.
I actually know I have a different theoretical deadlock issue along these lines in the driver because right now we grab actually global resources (including a VMID) before job submission to drm_sched. This is a known issue, and to fix it without reducing performance I need to introduce some kind of "patching/fixup" system for firmware commands (because we need to inject those identifiers in dozens of places, but we don't want to construct those commands from scratch at job run time because that introduces latency at the wrong time and makes error handling/validation more complicated and error-prone), and that is exactly what should happen in prepare_job, as you say. And yes, at that point that should use fences to block when those resources are exhausted. But that's a different discussion we should have when reviewing the driver, it has nothing to do with the DRM abstractions nor the can_run_job callback I'm adding here nor the firmware queue length limit issue! (And also the global hardware devices are plentiful enough that I would be very surprised if anyone ever deadlocks it in practice even with the current code, so I honestly don't think that should be a blocker for driver submission either, I can and will fix it later...)
~~ Lina
Am 08.03.23 um 15:53 schrieb Asahi Lina:
[SNIP]
The background is that core memory management requires that signaling a fence only depends on signaling other fences and hardware progress and nothing else. Otherwise you immediately run into problems because of circle dependencies or what we call infinite fences.
And hardware progress is exactly the only dependency here...
Well then you should have a fence for that hardware progress.
Jason Ekstrand gave a create presentation on that problem a few years ago on LPC. I strongly suggest you google that one up.
Faith Ekstrand (it looks like you mistyped that name...)
My fault I was really just mistyping that :)
is the person who proposed that I should use drm_sched in this way (see below), we've had a few private meetings about this design ^^
You can see the driver implementation of that callback in drivers/gpu/drm/asahi/queue/mod.rs (QueueJob::can_run()), which then calls into drivers/gpu/drm/asahi/workqueue.rs (Job::can_submit()) that does the actual available slot count checks.
The can_run_job logic is written to mirror the hw_submission_limit logic (just a bit later in the sched main loop since we need to actually pick a job to do the check), and just like for that case, completion of any job in the same scheduler will cause another run of the main loop and another check (which is exactly what we want here).
Yeah and that hw_submission_limit is based on a fence signaling again.
I don't think so...? It's just an atomic that gets checked in drm_sched_ready(). There are no extra fences involved (other than the job completion fences that trigger another scheduler run). The idea is that when the hardware queue makes forward progress you check against the limit again and submit more jobs as needed. I'm doing the same exact thing, I'm just using more complex logic for the notion of in-flight queue limits!
Then why can't you express that logic in a dependency fence?
When you have some firmware limitation that a job needs resources which are currently in use by other submissions then those other submissions have fences as well and you can return those in the prepare_job callback.
If those other submissions don't have fences, then you have a major design problem inside your driver and we need to get back to square one and talk about that dependency handling.
I think we have a disconnect in our views of what is going on here...
This hardware has firmware-side scheduling with an arbitrary (as far as I know) number of queues. There is one scheduler instance and one entity per userspace queue (not global!). These queues process jobs in some logical sequence, though at the firmware level they get split into up to three queues each (and there is some parallelism allowed). The limitation here is in the number of in-flight jobs per firmware queue, not global.
So far I'm familiar with that design.
There is no way for things to deadlock. If jobs have been submitted to the firmware queue, that means their dependencies were signaled already. Jobs have intra-job dependencies via driver barriers (which drm_sched knows nothing about), but the submission code in the driver guarantees that they are deadlock-free since you can only barrier on past commands, which by definition submit first.
If a firmware queue is full, drm_sched blocks. Since it is full, that means it will run those commands (since they have no outside dependencies and they are already queued and ready to run by the firmware), eventually space will be freed, and each time a job completes drm_sched will do the can_run_job check again and decide whether to run a new job.
Since the firmware queues contain commands which only have past-facing barriers on other already submitted commands, by definition they will become empty at some point as long as the firmware is making forward progress. And therefore, by definition, can_run_job will eventually return true at some point after a job completion fence is signaled (the one for the last job submitted prior). There is a check in the driver to ensure that we do not allow submissions which, by themselves, would exceed the queued command limit (we actually just limit to 64 commands overall right now, which is conservative but seems reasonable given the 128-per-firmware-queue limit).
Well then again why don't you give that fence out as dependency? Is it because the scheduler tries to optimize those away?
I get the feeling that you are conflating pending jobs with submitted jobs. This isn't about how many jobs you can have pending in drm_sched before running them or anything like that. Of course, at that point, arbitrary dependencies come into play and you can end up with deadlocks on dependency fences. But that's not the case here. What can_run_job is waiting on is guaranteed to make forward progress.
I see that we have a disconnection here. As far as I can see you can use the can_run callback in only three ways:
1. To check for some userspace dependency (We don't need to discuss that, it's evil and we both know it).
2. You check for some hw resource availability. Similar to VMID on amdgpu hw.
This is what I think you do here (but I might be wrong). But this would be extremely problematic because you can then live lock. E.g. queue A keeps submitting jobs which take only a few resources and by doing so delays submitting jobs from queue B indefinitely.
3. You have an intra queue dependency. E.g. you have jobs which take X amount of resources, you can submit only to a specific limit. But in this case you should be able to return fences from the same queue as dependency and won't need that callback.
We would just need to adjust drm_sched_entity_add_dependency_cb() a bit because dependencies from the same queue are currently filtered out because it assumes a pipeline nature of submission (e.g. previous submissions are finished before new submissions start).
This case (potentially scheduling more than the FW job limit) is rare but handling it is necessary, since otherwise the entire job completion/tracking logic gets screwed up on the firmware end and queues end up stuck (I've managed to trigger this before).
Actually that's a pretty normal use case. I've have rejected similar requirements like this before as well.
For an example how this can work see amdgpu_job_prepare_job(): https://elixir.bootlin.com/linux/v6.3-rc1/source/drivers/gpu/drm/amd/amdgpu/...
The gang submit gives and example of a global fence lock and the VMIDs are an example of a global shared firmware resource.
But the resource can_run_job is checking on isn't globally shared! It's specific to this scheduler instance, just like hw_submission_limit is, so as long as the firmware behind the scheduler is making forward progress, the resource will be guaranteed to be freed until another job can run.
Well either it should be globally shared because it is a shared resource (similar to our VMID or gangs) or it is an intra queue limitation in which case you could just use the fences previously submitted on the queue as dependency.
I actually know I have a different theoretical deadlock issue along these lines in the driver because right now we grab actually global resources (including a VMID) before job submission to drm_sched. This is a known issue, and to fix it without reducing performance I need to introduce some kind of "patching/fixup" system for firmware commands (because we need to inject those identifiers in dozens of places, but we don't want to construct those commands from scratch at job run time because that introduces latency at the wrong time and makes error handling/validation more complicated and error-prone), and that is exactly what should happen in prepare_job, as you say. And yes, at that point that should use fences to block when those resources are exhausted. But that's a different discussion we should have when reviewing the driver, it has nothing to do with the DRM abstractions nor the can_run_job callback I'm adding here nor the firmware queue length limit issue! (And also the global hardware devices are plentiful enough that I would be very surprised if anyone ever deadlocks it in practice even with the current code, so I honestly don't think that should be a blocker for driver submission either, I can and will fix it later...)
Well this is what I thought about those problems in amdgpu as well and it totally shipwrecked.
We still have memory allocations in the VMID code path which I'm still not sure how to remove.
Regards, Christian.
~~ Lina
On 09/03/2023 00.30, Christian König wrote:
Am 08.03.23 um 15:53 schrieb Asahi Lina:
[SNIP]
The background is that core memory management requires that signaling a fence only depends on signaling other fences and hardware progress and nothing else. Otherwise you immediately run into problems because of circle dependencies or what we call infinite fences.
And hardware progress is exactly the only dependency here...
Well then you should have a fence for that hardware progress.
I do, it's the prior job hardware completion fences that drm_sched already knows about!
Yes, I could return those in the prepare callback, it just means I need to start stashing fence references in the underlying firmware job queue command objects so I can find out what is the oldest pending fence is, and return it when a queue is full. As long as drm_sched doesn't mind if I keep giving it fences (since multiple commands can have to complete before there is space) or the occasional already signaled fence (because this process is inherently racy), it should work fine.
If you think this is the better way, I'll do it that way and drop this patch. It just seemed simpler to do it with another callback, since drm_sched is already tracking those fences and doing a hardware queue limit check anyway, and that way I can avoid tracking the fences down into the hardware queue code... *
(But I still maintain what I'm trying to do here is entirely correct and deadlock-free! If you prefer I use prepare_job and return prior job fences from that instead, that's very different from NAKing the patch saying it's broken...)
* If you're wondering how the fences get signaled at all then: callback closures that capture a reference to the fence when firmware commands are constructed and submitted. I know, I know, fancy Rust stuff... ^^ If you'd rather have me use the fences for the blocking, I'll probably just drop the signaling bit from the closures so we don't need to keep two redundant fence references in different places per command. I still need the closures for command completion processing though, since I use them to process statistics too...
Jason Ekstrand gave a create presentation on that problem a few years ago on LPC. I strongly suggest you google that one up.
Faith Ekstrand (it looks like you mistyped that name...)
My fault I was really just mistyping that :)
It's all good ^^
I see that we have a disconnection here. As far as I can see you can use the can_run callback in only three ways:
- To check for some userspace dependency (We don't need to discuss
that, it's evil and we both know it).
- You check for some hw resource availability. Similar to VMID on
amdgpu hw.
This is what I think you do here (but I might be wrong).
It isn't... I agree, it would be problematic. It doesn't make any sense to check for global resources this way, not just because you might deadlock but also because there might be nothing to signal to the scheduler that a resource was freed at all once it is!
But this would be extremely problematic because you can then live lock. E.g. queue A keeps submitting jobs which take only a few resources and by doing so delays submitting jobs from queue B indefinitely.
This particular issue aside, fairness in global resource allocation is a conversation I'd love to have! Right now the driver doesn't try to ensure that, a queue can easily monopolize certain hardware resources (though one queue can only monopolize one of each, so you'd need something like 63 queues with 63 distinct VMs all submitting free-running jobs back to back in order to starve other queues of resources forever). For starters, one thing I'm thinking of doing is reserving certain subsets of hardware resources for queues with a given priority, so you can at least guarantee forward progress of higher-priority queues when faced with misbehaving lower-priority queues. But if we want to guarantee proper fairness, I think I'll have to start doing things like switching to a CPU-roundtrip submission model when resources become scarce (to guarantee that queues actually release the resources once in a while) and then figure out how to add fairness to the allocation code...
But let's have that conversation when we talk about the driver (or maybe on IRC or something?), right now I'm more interested in getting the abstractions reviewed ^^
- You have an intra queue dependency. E.g. you have jobs which take X
amount of resources, you can submit only to a specific limit. But in this case you should be able to return fences from the same queue as dependency and won't need that callback.
Yes, I can do this. I can just do the same check can_run_job() does and if it fails, pick the oldest job in the full firmware queue and return its fence (it just means I need to keep track of those fences there, as I said above).
We would just need to adjust drm_sched_entity_add_dependency_cb() a bit because dependencies from the same queue are currently filtered out because it assumes a pipeline nature of submission (e.g. previous submissions are finished before new submissions start).
Actually that should be fine, because I'd be returning the underlying hardware completion fences (what the run() callback returns) which the driver owns, and wouldn't be recognized as belonging to the sched.
I actually know I have a different theoretical deadlock issue along these lines in the driver because right now we grab actually global resources (including a VMID) before job submission to drm_sched. This is a known issue, and to fix it without reducing performance I need to introduce some kind of "patching/fixup" system for firmware commands (because we need to inject those identifiers in dozens of places, but we don't want to construct those commands from scratch at job run time because that introduces latency at the wrong time and makes error handling/validation more complicated and error-prone), and that is exactly what should happen in prepare_job, as you say. And yes, at that point that should use fences to block when those resources are exhausted. But that's a different discussion we should have when reviewing the driver, it has nothing to do with the DRM abstractions nor the can_run_job callback I'm adding here nor the firmware queue length limit issue! (And also the global hardware devices are plentiful enough that I would be very surprised if anyone ever deadlocks it in practice even with the current code, so I honestly don't think that should be a blocker for driver submission either, I can and will fix it later...)
Well this is what I thought about those problems in amdgpu as well and it totally shipwrecked.
We still have memory allocations in the VMID code path which I'm still not sure how to remove.
We don't even have a shrinker yet, and I'm sure that's going to be a lot of fun when we add it too... but yes, if we can't do any memory allocations in some of these callbacks (is this documented anywhere?), that's going to be interesting...
It's not all bad news though! All memory allocations are fallible in kernel Rust (and therefore explicit, and also failures have to be explicitly handled or propagated), so it's pretty easy to point out where they are, and there are already discussions of higher-level tooling to enforce rules like that (and things like wait contexts). Also, Rust makes it a lot easier to refactor code in general and not be scared that you're going to regress everything, so I'm not really worried if I need to turn a chunk of the driver on its head to solve some of these problems in the future ^^ (I already did that when I switched it from the "demo" synchronous submission model to the proper explicit sync + fences one.)
~~ Lina
Am 08.03.23 um 17:44 schrieb Asahi Lina:
On 09/03/2023 00.30, Christian König wrote:
Am 08.03.23 um 15:53 schrieb Asahi Lina:
[SNIP]
The background is that core memory management requires that signaling a fence only depends on signaling other fences and hardware progress and nothing else. Otherwise you immediately run into problems because of circle dependencies or what we call infinite fences.
And hardware progress is exactly the only dependency here...
Well then you should have a fence for that hardware progress.
I do, it's the prior job hardware completion fences that drm_sched already knows about!
Yes, I could return those in the prepare callback, it just means I need to start stashing fence references in the underlying firmware job queue command objects so I can find out what is the oldest pending fence is, and return it when a queue is full. As long as drm_sched doesn't mind if I keep giving it fences (since multiple commands can have to complete before there is space) or the occasional already signaled fence (because this process is inherently racy), it should work fine.
Well this handling is intentional and necessary, but see below for a more in deep explanation.
If you think this is the better way, I'll do it that way and drop this patch. It just seemed simpler to do it with another callback, since drm_sched is already tracking those fences and doing a hardware queue limit check anyway, and that way I can avoid tracking the fences down into the hardware queue code... *
Well it's not the better way, it's the only way that works.
I have to admit that my bet on your intentions was wrong, but even that use case doesn't work correctly.
See when your callback returns false it is perfectly possible that all hw fences are signaled between returning that information and processing it.
The result would be that the scheduler goes to sleep and never wakes up again.
That's why we have that rule that all dependencies need to be expressed by those dma_fence objects, cause those are design with such races in mind.
(But I still maintain what I'm trying to do here is entirely correct and deadlock-free! If you prefer I use prepare_job and return prior job fences from that instead, that's very different from NAKing the patch saying it's broken...)
As I said we exercised those ideas before and yes this approach here came up before as well and no it doesn't work.
- If you're wondering how the fences get signaled at all then: callback
closures that capture a reference to the fence when firmware commands are constructed and submitted. I know, I know, fancy Rust stuff... ^^ If you'd rather have me use the fences for the blocking, I'll probably just drop the signaling bit from the closures so we don't need to keep two redundant fence references in different places per command. I still need the closures for command completion processing though, since I use them to process statistics too...
I see that we have a disconnection here. As far as I can see you can use the can_run callback in only three ways:
- To check for some userspace dependency (We don't need to discuss
that, it's evil and we both know it).
- You check for some hw resource availability. Similar to VMID on
amdgpu hw.
This is what I think you do here (but I might be wrong).
It isn't... I agree, it would be problematic. It doesn't make any sense to check for global resources this way, not just because you might deadlock but also because there might be nothing to signal to the scheduler that a resource was freed at all once it is!
But this would be extremely problematic because you can then live lock. E.g. queue A keeps submitting jobs which take only a few resources and by doing so delays submitting jobs from queue B indefinitely.
This particular issue aside, fairness in global resource allocation is a conversation I'd love to have! Right now the driver doesn't try to ensure that, a queue can easily monopolize certain hardware resources (though one queue can only monopolize one of each, so you'd need something like 63 queues with 63 distinct VMs all submitting free-running jobs back to back in order to starve other queues of resources forever). For starters, one thing I'm thinking of doing is reserving certain subsets of hardware resources for queues with a given priority, so you can at least guarantee forward progress of higher-priority queues when faced with misbehaving lower-priority queues. But if we want to guarantee proper fairness, I think I'll have to start doing things like switching to a CPU-roundtrip submission model when resources become scarce (to guarantee that queues actually release the resources once in a while) and then figure out how to add fairness to the allocation code...
But let's have that conversation when we talk about the driver (or maybe on IRC or something?), right now I'm more interested in getting the abstractions reviewed ^^
Well that stuff is highly problematic as well. The fairness aside you risk starvation which in turn breaks the guarantee of forward progress.
In this particular case you can catch this with a timeout for the hw operation, but you should consider blocking that from the sw side as well.
- You have an intra queue dependency. E.g. you have jobs which take X
amount of resources, you can submit only to a specific limit. But in this case you should be able to return fences from the same queue as dependency and won't need that callback.
Yes, I can do this. I can just do the same check can_run_job() does and if it fails, pick the oldest job in the full firmware queue and return its fence (it just means I need to keep track of those fences there, as I said above).
We would just need to adjust drm_sched_entity_add_dependency_cb() a bit because dependencies from the same queue are currently filtered out because it assumes a pipeline nature of submission (e.g. previous submissions are finished before new submissions start).
Actually that should be fine, because I'd be returning the underlying hardware completion fences (what the run() callback returns) which the driver owns, and wouldn't be recognized as belonging to the sched.
I actually know I have a different theoretical deadlock issue along these lines in the driver because right now we grab actually global resources (including a VMID) before job submission to drm_sched. This is a known issue, and to fix it without reducing performance I need to introduce some kind of "patching/fixup" system for firmware commands (because we need to inject those identifiers in dozens of places, but we don't want to construct those commands from scratch at job run time because that introduces latency at the wrong time and makes error handling/validation more complicated and error-prone), and that is exactly what should happen in prepare_job, as you say. And yes, at that point that should use fences to block when those resources are exhausted. But that's a different discussion we should have when reviewing the driver, it has nothing to do with the DRM abstractions nor the can_run_job callback I'm adding here nor the firmware queue length limit issue! (And also the global hardware devices are plentiful enough that I would be very surprised if anyone ever deadlocks it in practice even with the current code, so I honestly don't think that should be a blocker for driver submission either, I can and will fix it later...)
Well this is what I thought about those problems in amdgpu as well and it totally shipwrecked.
We still have memory allocations in the VMID code path which I'm still not sure how to remove.
We don't even have a shrinker yet, and I'm sure that's going to be a lot of fun when we add it too... but yes, if we can't do any memory allocations in some of these callbacks (is this documented anywhere?), that's going to be interesting...
Yes, that is all part of the dma_fence documentation. It's just absolutely not obvious what all this means.
It's not all bad news though! All memory allocations are fallible in kernel Rust (and therefore explicit, and also failures have to be explicitly handled or propagated), so it's pretty easy to point out where they are, and there are already discussions of higher-level tooling to enforce rules like that (and things like wait contexts). Also, Rust makes it a lot easier to refactor code in general and not be scared that you're going to regress everything, so I'm not really worried if I need to turn a chunk of the driver on its head to solve some of these problems in the future ^^ (I already did that when I switched it from the "demo" synchronous submission model to the proper explicit sync + fences one.)
Yeah, well the problem isn't that you run into memory allocation failure.
The problem is rather something like this: 1. You try to allocate memory to signal your fence. 2. This memory allocation can't be fulfilled and goes to sleep to wait for reclaim. 3. On another CPU reclaim is running and through the general purpose shrinker, page fault or MMU notifier ends up wait for your dma_fence.
You don't even need to implement the shrinker for this to go boom extremely easy.
So everything involved with signaling the fence can allocate memory only with GFP_ATOMIC and only if you absolutely have to.
Christian.
~~ Lina
On 09/03/2023 02.57, Christian König wrote:
Am 08.03.23 um 17:44 schrieb Asahi Lina:
On 09/03/2023 00.30, Christian König wrote:
Am 08.03.23 um 15:53 schrieb Asahi Lina:
[SNIP]
The background is that core memory management requires that signaling a fence only depends on signaling other fences and hardware progress and nothing else. Otherwise you immediately run into problems because of circle dependencies or what we call infinite fences.
And hardware progress is exactly the only dependency here...
Well then you should have a fence for that hardware progress.
I do, it's the prior job hardware completion fences that drm_sched already knows about!
Yes, I could return those in the prepare callback, it just means I need to start stashing fence references in the underlying firmware job queue command objects so I can find out what is the oldest pending fence is, and return it when a queue is full. As long as drm_sched doesn't mind if I keep giving it fences (since multiple commands can have to complete before there is space) or the occasional already signaled fence (because this process is inherently racy), it should work fine.
Well this handling is intentional and necessary, but see below for a more in deep explanation.
If you think this is the better way, I'll do it that way and drop this patch. It just seemed simpler to do it with another callback, since drm_sched is already tracking those fences and doing a hardware queue limit check anyway, and that way I can avoid tracking the fences down into the hardware queue code... *
Well it's not the better way, it's the only way that works.
I have to admit that my bet on your intentions was wrong, but even that use case doesn't work correctly.
See when your callback returns false it is perfectly possible that all hw fences are signaled between returning that information and processing it.
The result would be that the scheduler goes to sleep and never wakes up again.
That can't happen, because it will just go into another iteration of the drm_sched main loop since there is an entity available still.
Rather there is probably the opposite bug in this patch: the can_run_job logic should be moved into the wait_event_interruptible() condition check, otherwise I think it can end up busy-looping since the condition itself can be true even when the can_run_job check blocks it.
But there is no risk of it going to sleep and never waking up because job completions will wake up the waitqueue by definition, and that happens after the driver-side queues are popped. If this problem could happen, then the existing hw_submission_limit logic would be broken in the same way. It is logically equivalent in how it works.
Basically, if properly done in wait_event_interruptible, it is exactly the logic of that macro that prevents this race condition and makes everything work at all. Without it, drm_sched would be completely broken.
As I said we exercised those ideas before and yes this approach here came up before as well and no it doesn't work.
It can never deadlock with this patch as it stands (though it could busy loop), and if properly moved into the wait_event_interruptible(), it would also never busy loop and work entirely as intended. The actual API change is sound.
I don't know why you're trying so hard to convince everyone that this approach is fundamentally broken... It might be a bad idea for other reasons, it might encourage incorrect usage, it might not be the best option, there are plenty of arguments you can make... but you just keep trying to make an argument that it just can't work at all for some reason. Why? I already said I'm happy dropping it in favor of the fences...
It's intended to mirror the hw_submission_limit logic. If you think this is broken, then that's broken too. They are equivalent mechanisms.
This particular issue aside, fairness in global resource allocation is a conversation I'd love to have! Right now the driver doesn't try to ensure that, a queue can easily monopolize certain hardware resources (though one queue can only monopolize one of each, so you'd need something like 63 queues with 63 distinct VMs all submitting free-running jobs back to back in order to starve other queues of resources forever). For starters, one thing I'm thinking of doing is reserving certain subsets of hardware resources for queues with a given priority, so you can at least guarantee forward progress of higher-priority queues when faced with misbehaving lower-priority queues. But if we want to guarantee proper fairness, I think I'll have to start doing things like switching to a CPU-roundtrip submission model when resources become scarce (to guarantee that queues actually release the resources once in a while) and then figure out how to add fairness to the allocation code...
But let's have that conversation when we talk about the driver (or maybe on IRC or something?), right now I'm more interested in getting the abstractions reviewed ^^
Well that stuff is highly problematic as well. The fairness aside you risk starvation which in turn breaks the guarantee of forward progress.
In this particular case you can catch this with a timeout for the hw operation, but you should consider blocking that from the sw side as well.
In the current state I actually think it's not really that problematic, because the resources are acquired directly in the ioctl path. So that can block if starved, but if that can cause overall forward progress to stop because some fence doesn't get signaled, then so can just not doing the ioctl in the first place, so there's not much point (userspace can always misbehave with its fence usage...). By the time anything gets submitted to drm_sched, the resources are already guaranteed to be acquired, we never block in the run callback.
It needs to be fixed of course, but if the threat model is a malicious GPU process, well, there are many other ways to DoS your system... and I don't think it's very likely that 63+ queues (which usually means 63+ processes with OpenGL) will end up accidentally starving the GPU in a tight loop at the same time. I'd love to hear about real-world scenarios where this kind of thing has been a real problem and not just a theoretical one though... maybe I'm missing something?
Basically my priorities with the driver are:
1. Make sure it never crashes 2. Make sure it works well for real users 3. Make it work smoothly for real users under reasonable load (priorities, CPU scheduler interactions, etc.) 4. Make it handle accidental problems more gracefully (OOMs etc, I need to look into private GEM BO accounting to processes so the OOM killer has better data to work with) 5. Make it more robust against deliberate abuse/starvation (this should matter more once we have some kind of paravirtualization solution...)
And right now we're somewhere between 2 and 3. So if there are cases where this resource acquisition stuff can cause a problem for real users, I'll want to fix it earlier. But if this is more theoretical than anything (with the resource limits of AGX GPUs), I'd rather focus on things like memory accounting and shrinker support first.
We don't even have a shrinker yet, and I'm sure that's going to be a lot of fun when we add it too... but yes, if we can't do any memory allocations in some of these callbacks (is this documented anywhere?), that's going to be interesting...
Yes, that is all part of the dma_fence documentation. It's just absolutely not obvious what all this means.
I mean is there any documentation on how this interacts with drm_sched? Like, am I not allowed to allocate memory in prepare()? What about run()? What about GPU interrupt work? (not a raw IRQ handler context, I mean the execution path from GPU IRQ to drm_sched run() fences getting signaled)
It's not all bad news though! All memory allocations are fallible in kernel Rust (and therefore explicit, and also failures have to be explicitly handled or propagated), so it's pretty easy to point out where they are, and there are already discussions of higher-level tooling to enforce rules like that (and things like wait contexts). Also, Rust makes it a lot easier to refactor code in general and not be scared that you're going to regress everything, so I'm not really worried if I need to turn a chunk of the driver on its head to solve some of these problems in the future ^^ (I already did that when I switched it from the "demo" synchronous submission model to the proper explicit sync + fences one.)
Yeah, well the problem isn't that you run into memory allocation failure.
What I mean is that the mandatory failure handling means it's relatively easy to audit where memory allocations can actually happen.
The problem is rather something like this:
- You try to allocate memory to signal your fence.
- This memory allocation can't be fulfilled and goes to sleep to wait
for reclaim. 3. On another CPU reclaim is running and through the general purpose shrinker, page fault or MMU notifier ends up wait for your dma_fence.
You don't even need to implement the shrinker for this to go boom extremely easy.
Hmm, can you actually get something waiting on a dma_fence like that today with this driver? We don't have a shrinker, we don't have synchronous page faults or MMU notifications for the GPU, and this is explicit sync so all in/out fences cross over into userspace so surely they can't be trusted anyway?
I'm definitely not familiar with the intricacies of DMA fences and how they interact with everything else yet, but it's starting to sound like either this isn't quite broken for our simple driver yet, or it must be broken pretty much everywhere in some way...
So everything involved with signaling the fence can allocate memory only with GFP_ATOMIC and only if you absolutely have to.
I don't think we even have a good story for passing around gfp_flags in Rust code so that will be interesting... though I need to actually audit the code paths and see how many allocations we really do. I know I alloc some vectors for holding completed commands and stuff like that, but I'm pretty sure I can fix that one with some reworking, and I'm not sure how many other random things there really are...? Obviously most allocations happen at command creation time, on completion you mostly get a lot of freeing, so maybe I can just eliminate all allocs and not worry about GFP_ATOMIC.
~~ Lina
Am 08.03.23 um 20:05 schrieb Asahi Lina:
[SNIP]
Well it's not the better way, it's the only way that works.
I have to admit that my bet on your intentions was wrong, but even that use case doesn't work correctly.
See when your callback returns false it is perfectly possible that all hw fences are signaled between returning that information and processing it.
The result would be that the scheduler goes to sleep and never wakes up again.
That can't happen, because it will just go into another iteration of the drm_sched main loop since there is an entity available still.
Rather there is probably the opposite bug in this patch: the can_run_job logic should be moved into the wait_event_interruptible() condition check, otherwise I think it can end up busy-looping since the condition itself can be true even when the can_run_job check blocks it.
But there is no risk of it going to sleep and never waking up because job completions will wake up the waitqueue by definition, and that happens after the driver-side queues are popped. If this problem could happen, then the existing hw_submission_limit logic would be broken in the same way. It is logically equivalent in how it works.
Basically, if properly done in wait_event_interruptible, it is exactly the logic of that macro that prevents this race condition and makes everything work at all. Without it, drm_sched would be completely broken.
As I said we exercised those ideas before and yes this approach here came up before as well and no it doesn't work.
It can never deadlock with this patch as it stands (though it could busy loop), and if properly moved into the wait_event_interruptible(), it would also never busy loop and work entirely as intended. The actual API change is sound.
I don't know why you're trying so hard to convince everyone that this approach is fundamentally broken... It might be a bad idea for other reasons, it might encourage incorrect usage, it might not be the best option, there are plenty of arguments you can make... but you just keep trying to make an argument that it just can't work at all for some reason. Why? I already said I'm happy dropping it in favor of the fences...
Well because it is broken.
When you move the check into the wait_event_interruptible condition then who is going to call wait_event_interruptible when the condition changes?
As I said this idea came up before and was rejected multiple times.
Regards, Christian.
It's intended to mirror the hw_submission_limit logic. If you think this is broken, then that's broken too. They are equivalent mechanisms.
This particular issue aside, fairness in global resource allocation is a conversation I'd love to have! Right now the driver doesn't try to ensure that, a queue can easily monopolize certain hardware resources (though one queue can only monopolize one of each, so you'd need something like 63 queues with 63 distinct VMs all submitting free-running jobs back to back in order to starve other queues of resources forever). For starters, one thing I'm thinking of doing is reserving certain subsets of hardware resources for queues with a given priority, so you can at least guarantee forward progress of higher-priority queues when faced with misbehaving lower-priority queues. But if we want to guarantee proper fairness, I think I'll have to start doing things like switching to a CPU-roundtrip submission model when resources become scarce (to guarantee that queues actually release the resources once in a while) and then figure out how to add fairness to the allocation code...
But let's have that conversation when we talk about the driver (or maybe on IRC or something?), right now I'm more interested in getting the abstractions reviewed ^^
Well that stuff is highly problematic as well. The fairness aside you risk starvation which in turn breaks the guarantee of forward progress.
In this particular case you can catch this with a timeout for the hw operation, but you should consider blocking that from the sw side as well.
In the current state I actually think it's not really that problematic, because the resources are acquired directly in the ioctl path. So that can block if starved, but if that can cause overall forward progress to stop because some fence doesn't get signaled, then so can just not doing the ioctl in the first place, so there's not much point (userspace can always misbehave with its fence usage...). By the time anything gets submitted to drm_sched, the resources are already guaranteed to be acquired, we never block in the run callback.
It needs to be fixed of course, but if the threat model is a malicious GPU process, well, there are many other ways to DoS your system... and I don't think it's very likely that 63+ queues (which usually means 63+ processes with OpenGL) will end up accidentally starving the GPU in a tight loop at the same time. I'd love to hear about real-world scenarios where this kind of thing has been a real problem and not just a theoretical one though... maybe I'm missing something?
Basically my priorities with the driver are:
- Make sure it never crashes
- Make sure it works well for real users
- Make it work smoothly for real users under reasonable load
(priorities, CPU scheduler interactions, etc.) 4. Make it handle accidental problems more gracefully (OOMs etc, I need to look into private GEM BO accounting to processes so the OOM killer has better data to work with) 5. Make it more robust against deliberate abuse/starvation (this should matter more once we have some kind of paravirtualization solution...)
And right now we're somewhere between 2 and 3. So if there are cases where this resource acquisition stuff can cause a problem for real users, I'll want to fix it earlier. But if this is more theoretical than anything (with the resource limits of AGX GPUs), I'd rather focus on things like memory accounting and shrinker support first.
We don't even have a shrinker yet, and I'm sure that's going to be a lot of fun when we add it too... but yes, if we can't do any memory allocations in some of these callbacks (is this documented anywhere?), that's going to be interesting...
Yes, that is all part of the dma_fence documentation. It's just absolutely not obvious what all this means.
I mean is there any documentation on how this interacts with drm_sched? Like, am I not allowed to allocate memory in prepare()? What about run()? What about GPU interrupt work? (not a raw IRQ handler context, I mean the execution path from GPU IRQ to drm_sched run() fences getting signaled)
It's not all bad news though! All memory allocations are fallible in kernel Rust (and therefore explicit, and also failures have to be explicitly handled or propagated), so it's pretty easy to point out where they are, and there are already discussions of higher-level tooling to enforce rules like that (and things like wait contexts). Also, Rust makes it a lot easier to refactor code in general and not be scared that you're going to regress everything, so I'm not really worried if I need to turn a chunk of the driver on its head to solve some of these problems in the future ^^ (I already did that when I switched it from the "demo" synchronous submission model to the proper explicit sync + fences one.)
Yeah, well the problem isn't that you run into memory allocation failure.
What I mean is that the mandatory failure handling means it's relatively easy to audit where memory allocations can actually happen.
The problem is rather something like this:
- You try to allocate memory to signal your fence.
- This memory allocation can't be fulfilled and goes to sleep to wait
for reclaim. 3. On another CPU reclaim is running and through the general purpose shrinker, page fault or MMU notifier ends up wait for your dma_fence.
You don't even need to implement the shrinker for this to go boom extremely easy.
Hmm, can you actually get something waiting on a dma_fence like that today with this driver? We don't have a shrinker, we don't have synchronous page faults or MMU notifications for the GPU, and this is explicit sync so all in/out fences cross over into userspace so surely they can't be trusted anyway?
I'm definitely not familiar with the intricacies of DMA fences and how they interact with everything else yet, but it's starting to sound like either this isn't quite broken for our simple driver yet, or it must be broken pretty much everywhere in some way...
So everything involved with signaling the fence can allocate memory only with GFP_ATOMIC and only if you absolutely have to.
I don't think we even have a good story for passing around gfp_flags in Rust code so that will be interesting... though I need to actually audit the code paths and see how many allocations we really do. I know I alloc some vectors for holding completed commands and stuff like that, but I'm pretty sure I can fix that one with some reworking, and I'm not sure how many other random things there really are...? Obviously most allocations happen at command creation time, on completion you mostly get a lot of freeing, so maybe I can just eliminate all allocs and not worry about GFP_ATOMIC.
~~ Lina
On 09/03/2023 04.12, Christian König wrote:
Am 08.03.23 um 20:05 schrieb Asahi Lina:
[SNIP]
Well it's not the better way, it's the only way that works.
I have to admit that my bet on your intentions was wrong, but even that use case doesn't work correctly.
See when your callback returns false it is perfectly possible that all hw fences are signaled between returning that information and processing it.
The result would be that the scheduler goes to sleep and never wakes up again.
That can't happen, because it will just go into another iteration of the drm_sched main loop since there is an entity available still.
Rather there is probably the opposite bug in this patch: the can_run_job logic should be moved into the wait_event_interruptible() condition check, otherwise I think it can end up busy-looping since the condition itself can be true even when the can_run_job check blocks it.
But there is no risk of it going to sleep and never waking up because job completions will wake up the waitqueue by definition, and that happens after the driver-side queues are popped. If this problem could happen, then the existing hw_submission_limit logic would be broken in the same way. It is logically equivalent in how it works.
Basically, if properly done in wait_event_interruptible, it is exactly the logic of that macro that prevents this race condition and makes everything work at all. Without it, drm_sched would be completely broken.
As I said we exercised those ideas before and yes this approach here came up before as well and no it doesn't work.
It can never deadlock with this patch as it stands (though it could busy loop), and if properly moved into the wait_event_interruptible(), it would also never busy loop and work entirely as intended. The actual API change is sound.
I don't know why you're trying so hard to convince everyone that this approach is fundamentally broken... It might be a bad idea for other reasons, it might encourage incorrect usage, it might not be the best option, there are plenty of arguments you can make... but you just keep trying to make an argument that it just can't work at all for some reason. Why? I already said I'm happy dropping it in favor of the fences...
Well because it is broken.
When you move the check into the wait_event_interruptible condition then who is going to call wait_event_interruptible when the condition changes?
I think you mean wake_up_interruptible(). That would be drm_sched_job_done(), on the fence callback when a job completes, which as I keep saying is the same logic used for hw_rq_count/hw_submission_limit tracking.
Please think about it for a second, it's really not that complicated to see why it works:
- Driver pops off completed commands <-- can_run_job condition satisfied - Driver signals fence - drm_sched_job_done_cb() - drm_sched_job_done() - atomic_dec(&sched->hw_rq_count); <-- hw_submission_limit satisfied - ... - wake_up_interruptible(&sched->wake_up_worker); ^- happens after both conditions are potentially satisfied
It really is completely equivalent to just making the hw_rq_count logic customizable by the driver. The actual flow is the same. As long as the driver guarantees it satisfies the can_run_job() condition before signaling the completion fence that triggered that change, it works fine.
As I said this idea came up before and was rejected multiple times.
Maybe it was a different idea, or maybe it was rejected for other reasons, or maybe it was wrongly rejected for being broken when it isn't ^^
~~ Lina
Am 08.03.23 um 20:45 schrieb Asahi Lina:
On 09/03/2023 04.12, Christian König wrote:
Am 08.03.23 um 20:05 schrieb Asahi Lina:
[SNIP]
Well it's not the better way, it's the only way that works.
I have to admit that my bet on your intentions was wrong, but even that use case doesn't work correctly.
See when your callback returns false it is perfectly possible that all hw fences are signaled between returning that information and processing it.
The result would be that the scheduler goes to sleep and never wakes up again.
That can't happen, because it will just go into another iteration of the drm_sched main loop since there is an entity available still.
Rather there is probably the opposite bug in this patch: the can_run_job logic should be moved into the wait_event_interruptible() condition check, otherwise I think it can end up busy-looping since the condition itself can be true even when the can_run_job check blocks it.
But there is no risk of it going to sleep and never waking up because job completions will wake up the waitqueue by definition, and that happens after the driver-side queues are popped. If this problem could happen, then the existing hw_submission_limit logic would be broken in the same way. It is logically equivalent in how it works.
Basically, if properly done in wait_event_interruptible, it is exactly the logic of that macro that prevents this race condition and makes everything work at all. Without it, drm_sched would be completely broken.
As I said we exercised those ideas before and yes this approach here came up before as well and no it doesn't work.
It can never deadlock with this patch as it stands (though it could busy loop), and if properly moved into the wait_event_interruptible(), it would also never busy loop and work entirely as intended. The actual API change is sound.
I don't know why you're trying so hard to convince everyone that this approach is fundamentally broken... It might be a bad idea for other reasons, it might encourage incorrect usage, it might not be the best option, there are plenty of arguments you can make... but you just keep trying to make an argument that it just can't work at all for some reason. Why? I already said I'm happy dropping it in favor of the fences...
Well because it is broken.
When you move the check into the wait_event_interruptible condition then who is going to call wait_event_interruptible when the condition changes?
I think you mean wake_up_interruptible(). That would be drm_sched_job_done(), on the fence callback when a job completes, which as I keep saying is the same logic used for hw_rq_count/hw_submission_limit tracking.
As the documentation to wait_event says:
* wake_up() has to be called after changing any variable that could * change the result of the wait condition.
So what you essentially try to do here is to skip that and say drm_sched_job_done() would call that anyway, but when you read any variable to determine that state then as far as I can see nothing is guarantying that order.
The only other possibility how you could use the callback correctly would be to call drm_fence_is_signaled() to query the state of your hw submission from the same fence which is then signaled. But then the question is once more why you don't give that fence directly to the scheduler?
Please think about it for a second,
Yeah, I'm trying to really follow your intentions here. But that doesn't really makes sense.
Either you are trying to do something invalid or you are trying to circumvent the object model somehow and add a shortcut for the signaling API. Both would be more than fishy.
Regards, Christian.
it's really not that complicated to see why it works:
- Driver pops off completed commands <-- can_run_job condition satisfied
- Driver signals fence
- drm_sched_job_done_cb()
- drm_sched_job_done()
- atomic_dec(&sched->hw_rq_count); <-- hw_submission_limit satisfied
- ...
- wake_up_interruptible(&sched->wake_up_worker); ^- happens after both conditions are potentially satisfied
It really is completely equivalent to just making the hw_rq_count logic customizable by the driver. The actual flow is the same. As long as the driver guarantees it satisfies the can_run_job() condition before signaling the completion fence that triggered that change, it works fine.
As I said this idea came up before and was rejected multiple times.
Maybe it was a different idea, or maybe it was rejected for other reasons, or maybe it was wrongly rejected for being broken when it isn't ^^
~~ Lina
On 09/03/2023 05.14, Christian König wrote:
I think you mean wake_up_interruptible(). That would be drm_sched_job_done(), on the fence callback when a job completes, which as I keep saying is the same logic used for hw_rq_count/hw_submission_limit tracking.
As the documentation to wait_event says:
* wake_up() has to be called after changing any variable that could * change the result of the wait condition.
So what you essentially try to do here is to skip that and say drm_sched_job_done() would call that anyway, but when you read any variable to determine that state then as far as I can see nothing is guarantying that order.
The driver needs to guarantee that any changes to that state precede a job completion fence signal of course, that's the entire idea of the API. It's supposed to represent a check for per-scheduler (or more specific, but not more global) resources that are released on job completion. Of course if you misuse the API you could cause a problem, but what I'm trying to say is that the API as designed and when used as intended does work properly.
Put another way: job completions always need to cause the sched main loop to run an iteration anyway (otherwise we wouldn't make forward progress), and job completions are exactly the signal that the can_run_job() condition may have changed.
The only other possibility how you could use the callback correctly would be to call drm_fence_is_signaled() to query the state of your hw submission from the same fence which is then signaled. But then the question is once more why you don't give that fence directly to the scheduler?
But the driver is supposed to guarantee that the ordering is always 1. resources freed, 2. fence signaled. So you don't need to check for the fence, you can just check for the resource state. If the callback returns false then by definition the fence wasn't yet signaled at some point during its execution (because the resources weren't yet freed), and since it would be in the wait_event_interruptible() check path, by definition the fence signaling at any point during or after the check would cause the thread to wake up again and re-check.
Thread 1 Thread 2 1. wait_event_interruptible() arms wq 1. Free resources 2. can_run_job() checks resources 2. Signal fence 3. wait_event_interruptible() sleeps on wq 3. Fence wakes up wq 4. loop
There is no possible interleaving of those sequences that leads to a lost event and the thread not waking up: - If T2.3 happens before T1.1, that means T2.1 happened earlier and T1.2 must return true. - If T2.3 happens after T1.1 but before T1.3, the wq code will ensure the wq does not sleep (or immediately wakes up) at T1.3 since it was signaled during the condition check, after the wq was armed. At the next check loop, T1.2 will then return true, since T2.1 already happened before T2.3. - If T2.3 happens during T1.3, the wq wakes up normally and does another check, and at that point T1.2 returns true.
QED.
~~ Lina
Am 09.03.23 um 07:30 schrieb Asahi Lina:
On 09/03/2023 05.14, Christian König wrote:
I think you mean wake_up_interruptible(). That would be drm_sched_job_done(), on the fence callback when a job completes, which as I keep saying is the same logic used for hw_rq_count/hw_submission_limit tracking.
As the documentation to wait_event says:
* wake_up() has to be called after changing any variable that could * change the result of the wait condition.
So what you essentially try to do here is to skip that and say drm_sched_job_done() would call that anyway, but when you read any variable to determine that state then as far as I can see nothing is guarantying that order.
The driver needs to guarantee that any changes to that state precede a job completion fence signal of course, that's the entire idea of the API. It's supposed to represent a check for per-scheduler (or more specific, but not more global) resources that are released on job completion. Of course if you misuse the API you could cause a problem, but what I'm trying to say is that the API as designed and when used as intended does work properly.
Put another way: job completions always need to cause the sched main loop to run an iteration anyway (otherwise we wouldn't make forward progress), and job completions are exactly the signal that the can_run_job() condition may have changed.
The only other possibility how you could use the callback correctly would be to call drm_fence_is_signaled() to query the state of your hw submission from the same fence which is then signaled. But then the question is once more why you don't give that fence directly to the scheduler?
But the driver is supposed to guarantee that the ordering is always 1. resources freed, 2. fence signaled. So you don't need to check for the fence, you can just check for the resource state.
Yeah, but this is exactly what the dma_fence framework tried to prevent. We try very hard to avoid such side channel signaling :)
But putting that issue aside for a moment. What I don't get is when you have such intra queue dependencies, then why can't you check that at a much higher level?
In other words even userspace should be able to predict that for it's submissions X amount of resources are needed and when all of my submissions run in parallel that won't work.
Asking the firmware for a status is usually a magnitudes slower than just computing it before submission.
Regards, Christian.
If the callback returns false then by definition the fence wasn't yet signaled at some point during its execution (because the resources weren't yet freed), and since it would be in the wait_event_interruptible() check path, by definition the fence signaling at any point during or after the check would cause the thread to wake up again and re-check.
Thread 1 Thread 2
- wait_event_interruptible() arms wq 1. Free resources
- can_run_job() checks resources 2. Signal fence
- wait_event_interruptible() sleeps on wq 3. Fence wakes up wq
- loop
There is no possible interleaving of those sequences that leads to a lost event and the thread not waking up:
- If T2.3 happens before T1.1, that means T2.1 happened earlier and T1.2
must return true.
- If T2.3 happens after T1.1 but before T1.3, the wq code will ensure
the wq does not sleep (or immediately wakes up) at T1.3 since it was signaled during the condition check, after the wq was armed. At the next check loop, T1.2 will then return true, since T2.1 already happened before T2.3.
- If T2.3 happens during T1.3, the wq wakes up normally and does another
check, and at that point T1.2 returns true.
QED.
~~ Lina
On 09/03/2023 17.05, Christian König wrote:
Am 09.03.23 um 07:30 schrieb Asahi Lina:
On 09/03/2023 05.14, Christian König wrote:
I think you mean wake_up_interruptible(). That would be drm_sched_job_done(), on the fence callback when a job completes, which as I keep saying is the same logic used for hw_rq_count/hw_submission_limit tracking.
As the documentation to wait_event says:
* wake_up() has to be called after changing any variable that could * change the result of the wait condition.
So what you essentially try to do here is to skip that and say drm_sched_job_done() would call that anyway, but when you read any variable to determine that state then as far as I can see nothing is guarantying that order.
The driver needs to guarantee that any changes to that state precede a job completion fence signal of course, that's the entire idea of the API. It's supposed to represent a check for per-scheduler (or more specific, but not more global) resources that are released on job completion. Of course if you misuse the API you could cause a problem, but what I'm trying to say is that the API as designed and when used as intended does work properly.
Put another way: job completions always need to cause the sched main loop to run an iteration anyway (otherwise we wouldn't make forward progress), and job completions are exactly the signal that the can_run_job() condition may have changed.
The only other possibility how you could use the callback correctly would be to call drm_fence_is_signaled() to query the state of your hw submission from the same fence which is then signaled. But then the question is once more why you don't give that fence directly to the scheduler?
But the driver is supposed to guarantee that the ordering is always 1. resources freed, 2. fence signaled. So you don't need to check for the fence, you can just check for the resource state.
Yeah, but this is exactly what the dma_fence framework tried to prevent. We try very hard to avoid such side channel signaling :)
Right, and it's fine, I can use the fences directly easily enough. I'm just trying to explain why my original idea works too, even if it's not the best solution for other reasons!
Of course I don't have the context of what other drivers are doing or did historically and what the pitfalls are, so I can't know what the "right" solution for any of this is in that context. I did my best to understand the drm_sched code and come up with a solution that works (which it does) without any more info. When I saw the hw submission limit stuff, I thought "okay, I need the same thing but with slightly more complex logic, so let's add a callback so the driver can customize it and do its own inflight counting".
After this discussion, I can see that this is equivalent to doing the same check in prepare_job() followed by returning the oldest running job's fence (as long as there's no race there... it should be fine if the fence reference is taken first, before the resource check, or if everything is done within the same critical section taking the firmware queue lock), so I'm happy to switch to that and drop this patch.
But keep in mind none of this is documented, and there's no way for us driver authors to understand what we're supposed to do without documentation. As I said I spent a long time trying to understand drm_sched, and then my original attempt missed the drm_sched_fini() issue with dangling jobs and Alyssa managed to hit an oops on the test branch, I guessed what the problem was from her trace, figured out a way to reproduce it (the kill-loop glmark2 thing), and fixed it in the next patch in this series. So even trying my best to figure out how to do this, reading the code and what scarce docs there are, I managed to miss something that caused a potential oops on the first try. If I can't even get the API usage right after spending hours on it trying really hard not to (because it's not just about my driver, I need the Rust abstraction to be safe for any driver), there's no way I'm going to divine what approaches to resource/dependency signaling are problematic/easy to abuse... the most I can hope for is "I got the wrapper right and the API/driver interaction is correct and guarantees forward progress if the driver follows the rules".
So when I submit something, and you reply with "Well complete NAK", that's just not nice. Honestly, I was kind of upset when I got that email. It sounded as if you were saying my solution was completely broken and couldn't work, but no matter how I looked at it I couldn't figure out how it's broken. And then it took several emails to even understand what you were suggesting with the prepare_job callback (and yes, that works too and is probably harder to abuse than a new callback). I'm trying really hard to make this all work and be correct, and of course I make mistakes too... but then I look at the code and no matter what I can come up with it seems to work and be correct, what am I supposed to do? I'm happy to learn and figure out better approaches for everything that lead to better drivers, but I need an actual explanation of the issues, not just a NAK...
I also would appreciate it if people give me the benefit of the doubt and let me explain what I'm doing and how I'm doing it and how this hardware works, because the whole thing is subtle to the core and very different to other GPUs. Honestly, I don't think any reviewer that hasn't spent hours poring over the driver/abstraction code could confidently say that a certain subtle sync issue exists at a first pass (other than for really obvious bad code sequences). I'm happy to look into issues and I definitely want to know what cases to look at and what to check for and fix anything we find... but isn't it better if we work together instead of shouting "this is broken" at the first hint of possible trouble?
But putting that issue aside for a moment. What I don't get is when you have such intra queue dependencies, then why can't you check that at a much higher level?
In other words even userspace should be able to predict that for it's submissions X amount of resources are needed and when all of my submissions run in parallel that won't work.
Technically yes, but we can't trust userspace to honor this, since overflowing the firmware queue breaks everything, so the kernel has to do the check... plus we're trying to insulate userspace from the details of how work is queued at the firmware. We need to support multiple firmware versions including future ones we can't predict yet without breaking UAPI, so the less the UAPI depends on firmware details, the better. That's why at the UAPI level, this is boiled down to a simpler "max commands per submission" limit that gets passed in the params struct, which is conservative, and then the kernel can deal with the actual in-flight count tracking and only submit things to the hardware when they fit.
In the future we could even support job splitting on the kernel side and remove the max commands per submission limit altogether (though it probably still makes sense to have for other reasons, like bounding how much kernel/firmware memory a single queue can consume, so I'm not sure this is even worth doing at all).
Asking the firmware for a status is usually a magnitudes slower than just computing it before submission.
I'm not asking the firmware for status, I'm just asking my own firmware queue code how many slots are currently free in each backing queue. That's just based on internal driver state, there is no firmware round trip!
I could technically compute this before submission and figure out how much work has been queued and pre-populate fences that ensure we never exceed the max, but honestly that's a lot more code to track job sizes and I don't think it makes sense when I can just ask "Do we have space? No? Okay, return the oldest running job fence for now and try again when it completes" in prepare_job(). Maybe it's faster in pathological cases to do something fancier, but let's wait until Vulkan works and we can run real AAA games and see where the bottlenecks are before going down the optimization road ^^
~~ Lina
Jumping in here quick... (Sorry, I was out yesterday and was ignoring my e-mail on Tuesday so I could finally type some compiler code.)
On Thu, 2023-03-09 at 18:14 +0900, Asahi Lina wrote:
On 09/03/2023 17.05, Christian König wrote:
Am 09.03.23 um 07:30 schrieb Asahi Lina:
On 09/03/2023 05.14, Christian König wrote:
I think you mean wake_up_interruptible(). That would be drm_sched_job_done(), on the fence callback when a job completes, which as I keep saying is the same logic used for hw_rq_count/hw_submission_limit tracking.
As the documentation to wait_event says:
* wake_up() has to be called after changing any variable that could * change the result of the wait condition.
So what you essentially try to do here is to skip that and say drm_sched_job_done() would call that anyway, but when you read any variable to determine that state then as far as I can see nothing is guarantying that order.
The driver needs to guarantee that any changes to that state precede a job completion fence signal of course, that's the entire idea of the API. It's supposed to represent a check for per-scheduler (or more specific, but not more global) resources that are released on job completion. Of course if you misuse the API you could cause a problem, but what I'm trying to say is that the API as designed and when used as intended does work properly.
Put another way: job completions always need to cause the sched main loop to run an iteration anyway (otherwise we wouldn't make forward progress), and job completions are exactly the signal that the can_run_job() condition may have changed.
The only other possibility how you could use the callback correctly would be to call drm_fence_is_signaled() to query the state of your hw submission from the same fence which is then signaled. But then the question is once more why you don't give that fence directly to the scheduler?
But the driver is supposed to guarantee that the ordering is always 1. resources freed, 2. fence signaled. So you don't need to check for the fence, you can just check for the resource state.
Yeah, but this is exactly what the dma_fence framework tried to prevent. We try very hard to avoid such side channel signaling :)
Right, and it's fine, I can use the fences directly easily enough. I'm just trying to explain why my original idea works too, even if it's not the best solution for other reasons!
Of course I don't have the context of what other drivers are doing or did historically and what the pitfalls are, so I can't know what the "right" solution for any of this is in that context. I did my best to understand the drm_sched code and come up with a solution that works (which it does) without any more info. When I saw the hw submission limit stuff, I thought "okay, I need the same thing but with slightly more complex logic, so let's add a callback so the driver can customize it and do its own inflight counting".
So, I think there's a difference here between "impossible to implement correctly", "likely to be implemented correctly", and "impossible to implement incorrectly". It's obviously possible to implement correctly. You can just always return true or do exactly the same check or do some simple thing where you can guarantee that it will only ever return false when there's a bunch of other stuff in the queue. That doesn't mean that it's likely to be implemented correctly by some other driver. Some idiot will come along and try to take advantage of it and cause themselves horrible problems.
And, to be clear, for the purposes of this discussion, we're ALL idiots, myself included. If there's one thing the DRM community has learned over the years, it's that drivers are so complex that we all turn into idiots at some point, relative to the complexity of the code and hardware behavior. That's why things like dma_fence are written so incredibly defensively and why we're so harsh about the rules. It's the rules and not our individual smarts that keep us from making mistakes. (Kinda like Rust, in a way.) So while I appreciate the frustration of "I'm just trying to do something that's clearly correct here", that doesn't mean that then next person to come by and fix a bug by tweaking that callback isn't going to screw it up irreparably. That person may even be you in 6 to 12 months after this e-mail thread is a distant memory.
So, yes, does the implementation you have today work without deadlocks or starvation? Maybe it does. I've not verified. Is the suggested callback a giant foot-gun in the already treacherous territory of scheduling and fencing? Yeah, it probably is and there's another way to implement the same behavior which is likely safer in the long run.
After this discussion, I can see that this is equivalent to doing the same check in prepare_job() followed by returning the oldest running job's fence (as long as there's no race there... it should be fine if the fence reference is taken first, before the resource check, or if everything is done within the same critical section taking the firmware queue lock), so I'm happy to switch to that and drop this patch.
But keep in mind none of this is documented, and there's no way for us driver authors to understand what we're supposed to do without documentation. As I said I spent a long time trying to understand drm_sched, and then my original attempt missed the drm_sched_fini() issue with dangling jobs and Alyssa managed to hit an oops on the test branch, I guessed what the problem was from her trace, figured out a way to reproduce it (the kill-loop glmark2 thing), and fixed it in the next patch in this series. So even trying my best to figure out how to do this, reading the code and what scarce docs there are, I managed to miss something that caused a potential oops on the first try. If I can't even get the API usage right after spending hours on it trying really hard not to (because it's not just about my driver, I need the Rust abstraction to be safe for any driver), there's no way I'm going to divine what approaches to resource/dependency signaling are problematic/easy to abuse... the most I can hope for is "I got the wrapper right and the API/driver interaction is correct and guarantees forward progress if the driver follows the rules".
Your frustration with the lack of good documentation in DRM is entirely justified. It's a mess and there's not a whole lot of people who understand all these subtleties. Connecting to the hive mind via e- mail and asking questions is the best you can do a lot of the time, I'm afraid. I wish we had better documentation for a lot of things and I'd be happy to see the situation improved added but we've got a lot of debt there and not always a lot of time. (Yeah, I know, that's every senior engineer's excuse...) We really are trying to be better about it moving forward, though. Daniel has been pushing people to document things a lot more in recent years. But, yeah, lots of debt...
Also, in a weird way, I think these conversations are sometimes better than documentation. It took a while to get around to it all but there's a lot of context that was brought together in this e-mail thread that wouldn't have been in the docs no matter how good they are. A lot of it isn't an isolated thing that should clearly be explained in the run_job docs. It's subtle interactions which happen when all the pieces come together. I see this complaint a lot about Vulkan as well. There are behaviors which only become evident when you find the right 5 pieces of the spec and put them all together and squint. It'd be good to call those out sometimes but there's no way we can document all of them.
So when I submit something, and you reply with "Well complete NAK", that's just not nice. Honestly, I was kind of upset when I got that email. It sounded as if you were saying my solution was completely broken and couldn't work, but no matter how I looked at it I couldn't figure out how it's broken. And then it took several emails to even understand what you were suggesting with the prepare_job callback (and yes, that works too and is probably harder to abuse than a new callback). I'm trying really hard to make this all work and be correct, and of course I make mistakes too... but then I look at the code and no matter what I can come up with it seems to work and be correct, what am I supposed to do? I'm happy to learn and figure out better approaches for everything that lead to better drivers, but I need an actual explanation of the issues, not just a NAK...
I also would appreciate it if people give me the benefit of the doubt and let me explain what I'm doing and how I'm doing it and how this hardware works, because the whole thing is subtle to the core and very different to other GPUs. Honestly, I don't think any reviewer that hasn't spent hours poring over the driver/abstraction code could confidently say that a certain subtle sync issue exists at a first pass (other than for really obvious bad code sequences). I'm happy to look into issues and I definitely want to know what cases to look at and what to check for and fix anything we find... but isn't it better if we work together instead of shouting "this is broken" at the first hint of possible trouble?
Debating if I want to wade in in this one because this thread is already getting a bit warm and I don't want to make it worse. But, I'm an idiot, so...
Knowing what I do of both people in this thread, I think Christian is giving you more benefit of the doubt than you realize. Yes, his tone may be a bit abrupt but he continued to spend his time responding in detail to every question you raised. That means he was taking you seriously, even if he wasn't yielding ground.
Communication is hard, especially with all the different personalities, languages, and cultures involved in an international community like this. Sometimes the clarity of saying "no, this isn't going to work" up-front is necessary. Sometimes the person on the other end of the e- mail could benefit from a gentler response. It's hard to know from early interactions. Enough people have been wrong about dma_fence over the years (Hi! It's me!) that "no" is often the right starting position. 😭️ It doesn't always feel great to be on the receiving end of that but Christian is pretty much guarding a dragon cave, so...
To be clear, none of that is a defense of the toxicity for which the Linux community has gotten a reputation. A lot of subsystem maintainers have been known to start of with "no" to any idea they didn't already think of. That's bad. Generally, you shouldn't assume everyone but you is an idiot. When it comes to dma_fence, though, the assumption is that we're ALL idiots and the "No, seriously, don't go into the dragon cave. You won't come out alive. You're not that special." signs are justified. 😓️
I hope the context I'm providing here is helpful. If not, feel free to ignore me. It looks like you got the technical issues sorted.
~Faith
On 10/03/2023 03.50, Faith Ekstrand wrote:
Jumping in here quick... (Sorry, I was out yesterday and was ignoring my e-mail on Tuesday so I could finally type some compiler code.)
On Thu, 2023-03-09 at 18:14 +0900, Asahi Lina wrote:
On 09/03/2023 17.05, Christian König wrote:
Am 09.03.23 um 07:30 schrieb Asahi Lina:
On 09/03/2023 05.14, Christian König wrote:
I think you mean wake_up_interruptible(). That would be drm_sched_job_done(), on the fence callback when a job completes, which as I keep saying is the same logic used for hw_rq_count/hw_submission_limit tracking.
As the documentation to wait_event says:
* wake_up() has to be called after changing any variable that could * change the result of the wait condition.
So what you essentially try to do here is to skip that and say drm_sched_job_done() would call that anyway, but when you read any variable to determine that state then as far as I can see nothing is guarantying that order.
The driver needs to guarantee that any changes to that state precede a job completion fence signal of course, that's the entire idea of the API. It's supposed to represent a check for per-scheduler (or more specific, but not more global) resources that are released on job completion. Of course if you misuse the API you could cause a problem, but what I'm trying to say is that the API as designed and when used as intended does work properly.
Put another way: job completions always need to cause the sched main loop to run an iteration anyway (otherwise we wouldn't make forward progress), and job completions are exactly the signal that the can_run_job() condition may have changed.
The only other possibility how you could use the callback correctly would be to call drm_fence_is_signaled() to query the state of your hw submission from the same fence which is then signaled. But then the question is once more why you don't give that fence directly to the scheduler?
But the driver is supposed to guarantee that the ordering is always 1. resources freed, 2. fence signaled. So you don't need to check for the fence, you can just check for the resource state.
Yeah, but this is exactly what the dma_fence framework tried to prevent. We try very hard to avoid such side channel signaling :)
Right, and it's fine, I can use the fences directly easily enough. I'm just trying to explain why my original idea works too, even if it's not the best solution for other reasons!
Of course I don't have the context of what other drivers are doing or did historically and what the pitfalls are, so I can't know what the "right" solution for any of this is in that context. I did my best to understand the drm_sched code and come up with a solution that works (which it does) without any more info. When I saw the hw submission limit stuff, I thought "okay, I need the same thing but with slightly more complex logic, so let's add a callback so the driver can customize it and do its own inflight counting".
So, I think there's a difference here between "impossible to implement correctly", "likely to be implemented correctly", and "impossible to implement incorrectly". It's obviously possible to implement correctly. You can just always return true or do exactly the same check or do some simple thing where you can guarantee that it will only ever return false when there's a bunch of other stuff in the queue. That doesn't mean that it's likely to be implemented correctly by some other driver. Some idiot will come along and try to take advantage of it and cause themselves horrible problems.
And, to be clear, for the purposes of this discussion, we're ALL idiots, myself included. If there's one thing the DRM community has learned over the years, it's that drivers are so complex that we all turn into idiots at some point, relative to the complexity of the code and hardware behavior. That's why things like dma_fence are written so incredibly defensively and why we're so harsh about the rules. It's the rules and not our individual smarts that keep us from making mistakes. (Kinda like Rust, in a way.) So while I appreciate the frustration of "I'm just trying to do something that's clearly correct here", that doesn't mean that then next person to come by and fix a bug by tweaking that callback isn't going to screw it up irreparably. That person may even be you in 6 to 12 months after this e-mail thread is a distant memory.
So, yes, does the implementation you have today work without deadlocks or starvation? Maybe it does. I've not verified. Is the suggested callback a giant foot-gun in the already treacherous territory of scheduling and fencing? Yeah, it probably is and there's another way to implement the same behavior which is likely safer in the long run.
I understand that... I just wish the response had been along the lines of "this is a huge footgun for these reasons, and you don't need it because you can do it this other way instead", not "the concept is completely broken, NAK".
If the discussion were phrased around how the API can be used and abused, then I can understand what the concern is. But it was somehow always about me and what I'm doing...
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
That implies what I'm doing breaks memory management (and that it is obvious).
And to make it clear this is unfortunately a complete NAK to this approach! You can't do this!
Again that I can't do it... and then we got an argument over whether the code is actually broken or not. But that doesn't even matter, since the issue is how easy the API is to use or misuse, not whether I actually misuse it...
I'll switch to prepare_job() fences for the next version, so it's not an issue. Using that didn't even cross my mind because, knowing nothing about the intended usage here, the prepare_job() callback docs are quite obtuse:
Called when the scheduler is considering scheduling this job next> to get another struct dma_fence for this job to block on. Once i>
returns NULL, run_job() may be called.
Can be NULL if no additional preparation to the dependencies are necessary.> Skipped when jobs are killed instead of run.
What's a "dependency"? To me that sounded like execution dependencies, and we clearly express those in the jobs themselves ahead of time. But it turns out the purpose of this callback is to grab resources just in time before execution or block on them becoming available through a fence, and then it makes a lot more sense how to use it to do in-flight command count limiting.
Aside: now that I understand this, I'm tempted to make the Rust signature for this return a Result<(), Fence>. Returning a fence is essentially the "error" case here, and that means in the implementation you can just do:
if job.foo_res.is_none() { job.foo_res = Some(foo.get_resource()?); } if job.bar_res.is_none() { job.bar_res = Some(bar.get_resource()?); }
As long as all the get_resource() calls return a Result<Resource, Fence>.
There's even more undocumented subtlety here though, since as far as I can tell if all the resources aren't always grabbed in the same order, or more than one of a single resource is grabbed separately you could deadlock or even livelock?
This is theoretical since right now I don't handle this properly at all other than the command count limit (I need the command struct fixup system for this to be reasonably possible), but for example, I'll need 1-3 event IDs per job, and if I grab them one by one, you could end up deadlocking with all event IDs used by jobs waiting for more. And if I don't store them eagerly (so drop the IDs if you can't get all of them), then you can end up with livelocks where every scheduler is grabbing an ID, then dropping it when we can't get another one, which signals a fence for another blocked scheduler to grab another ID, which then drops it because it can't get more, etc. So I probably need to grab a number of event IDs atomically.
Also, in a weird way, I think these conversations are sometimes better than documentation. It took a while to get around to it all but there's a lot of context that was brought together in this e-mail thread that wouldn't have been in the docs no matter how good they are. A lot of it isn't an isolated thing that should clearly be explained in the run_job docs. It's subtle interactions which happen when all the pieces come together. I see this complaint a lot about Vulkan as well. There are behaviors which only become evident when you find the right 5 pieces of the spec and put them all together and squint. It'd be good to call those out sometimes but there's no way we can document all of them.
That's true, but I think we could improve things a lot even with just better docs and more hyperlinking between docs... For example, the GEM and DMA fence docs do have quite a bit of prose that gets you some context (even if it's a bit outdated and not complete). But drm_sched just has one paragraph and a list giving a high-level design, and then goes straight into function docs. It definitely takes putting together the sched, fence, dma_resv, etc. docs together to get the big picture, but if those docs all at least point at each other and are individually reasonably complete, then we'd have a chance ^^
~~ Lina
On Wed, Mar 8, 2023 at 9:46 AM Christian König christian.koenig@amd.com wrote:
Am 07.03.23 um 15:25 schrieb Asahi Lina:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Well complete NAK.
There hasn't even been any kind of discussion yet you already come around with a "Well complete NAK"
First, this can be seen as rude behavior and me being part of the drm community I don't want to have to see this kind of thing.
Obviously, any kind of strong "technical" review point is a nak until people settle with an agreement on what to land, there is no point in pointing out a "NAK", especially if that's the first thing you say. If you want to express your strong disagreement with the proposed solution, then state what your pain points are directly.
If there is a long discussion and a maintainer feels it's going nowhere and no conclusion will be reached it might be this kind of "speaking with authority" point has to be made. But not as the starter into a discussion. This is unnecessarily hostile towards the contributor. And I wished we wouldn't have to see this kind of behavior here.
Yes, some kernel maintainers do this a lot, but kernel maintainers also have this kind of reputation and people don't want to have to deal with this nonsense and decide to not contribute at all. So please just drop this attitude.
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
I'm sure it's all documented and there is a design document on how things have to look like you can point out? Might help to get a better understanding on how things should be.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
Regards, Christian.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
if (sched->ops->can_run_job) {
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (!sched_job) {
complete_all(&entity->entity_idle);
continue;
}
if (!sched->ops->can_run_job(sched_job))
continue;
}
sched_job = drm_sched_entity_pop_job(entity); if (!sched_job) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
/**
* @can_run_job: Called before job execution to check whether the
* hardware is free enough to run the job. This can be used to
* implement more complex hardware resource policies than the
* hw_submission limit.
*/
bool (*can_run_job)(struct drm_sched_job *sched_job);
/** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
Am 08.03.23 um 13:39 schrieb Karol Herbst:
On Wed, Mar 8, 2023 at 9:46 AM Christian König christian.koenig@amd.com wrote:
Am 07.03.23 um 15:25 schrieb Asahi Lina:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Well complete NAK.
There hasn't even been any kind of discussion yet you already come around with a "Well complete NAK"
First, this can be seen as rude behavior and me being part of the drm community I don't want to have to see this kind of thing.
Obviously, any kind of strong "technical" review point is a nak until people settle with an agreement on what to land, there is no point in pointing out a "NAK", especially if that's the first thing you say. If you want to express your strong disagreement with the proposed solution, then state what your pain points are directly.
If there is a long discussion and a maintainer feels it's going nowhere and no conclusion will be reached it might be this kind of "speaking with authority" point has to be made. But not as the starter into a discussion. This is unnecessarily hostile towards the contributor. And I wished we wouldn't have to see this kind of behavior here.
Yes, some kernel maintainers do this a lot, but kernel maintainers also have this kind of reputation and people don't want to have to deal with this nonsense and decide to not contribute at all. So please just drop this attitude.
Yes, you are completely right with that, but getting this message to the recipient is intentional on my side.
I give completely NAKs when the author of a patch has missed such a fundamental technical connection that further discussion doesn't make sense.
It's not meant to be in any way rude or offending. I can put a smiley behind it if it somehow helps, but we still need a way to raise this big red stop sign.
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
I'm sure it's all documented and there is a design document on how things have to look like you can point out? Might help to get a better understanding on how things should be.
Yeah, that's the problematic part. We have documented this very extensively: https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-...
And both Jason and Daniel gave talks about the underlying problem and try to come up with patches to raise warnings when that happens, but people still keep coming up with the same idea over and over again.
It's just that the technical relationship between preventing jobs from running and with that preventing dma_fences from signaling and the core memory management with page faults and shrinkers waiting for those fences is absolutely not obvious.
We had at least 10 different teams from different companies falling into the same trap already and either the patches were rejected of hand or had to painfully reverted or mitigated later on.
Regards, Christian.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
Regards, Christian.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
if (sched->ops->can_run_job) {
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (!sched_job) {
complete_all(&entity->entity_idle);
continue;
}
if (!sched->ops->can_run_job(sched_job))
continue;
}
sched_job = drm_sched_entity_pop_job(entity); if (!sched_job) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
/**
* @can_run_job: Called before job execution to check whether the
* hardware is free enough to run the job. This can be used to
* implement more complex hardware resource policies than the
* hw_submission limit.
*/
bool (*can_run_job)(struct drm_sched_job *sched_job);
/** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
On Wed, Mar 8, 2023 at 2:47 PM Christian König christian.koenig@amd.com wrote:
Am 08.03.23 um 13:39 schrieb Karol Herbst:
On Wed, Mar 8, 2023 at 9:46 AM Christian König christian.koenig@amd.com wrote:
Am 07.03.23 um 15:25 schrieb Asahi Lina:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Well complete NAK.
There hasn't even been any kind of discussion yet you already come around with a "Well complete NAK"
First, this can be seen as rude behavior and me being part of the drm community I don't want to have to see this kind of thing.
Obviously, any kind of strong "technical" review point is a nak until people settle with an agreement on what to land, there is no point in pointing out a "NAK", especially if that's the first thing you say. If you want to express your strong disagreement with the proposed solution, then state what your pain points are directly.
If there is a long discussion and a maintainer feels it's going nowhere and no conclusion will be reached it might be this kind of "speaking with authority" point has to be made. But not as the starter into a discussion. This is unnecessarily hostile towards the contributor. And I wished we wouldn't have to see this kind of behavior here.
Yes, some kernel maintainers do this a lot, but kernel maintainers also have this kind of reputation and people don't want to have to deal with this nonsense and decide to not contribute at all. So please just drop this attitude.
Yes, you are completely right with that, but getting this message to the recipient is intentional on my side.
I give completely NAKs when the author of a patch has missed such a fundamental technical connection that further discussion doesn't make sense.
It's not meant to be in any way rude or offending. I can put a smiley behind it if it somehow helps, but we still need a way to raise this big red stop sign.
"further"? There was no discussion at all, you just started off like that. If you think somebody misses that connection, you can point out to documentation/videos whatever so the contributor can understand what's wrong with an approach. You did that, so that's fine. It's just starting off _any_ discussion with a "Well complete NAK" is terrible style. I'd feel uncomfortable if that happened to me and I'm sure there are enough people like that that we should be more reasonable with our replies. Just.. don't.
We are all humans here and people react negatively to such things. And if people do it on purpose it just makes it worse.
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
I'm sure it's all documented and there is a design document on how things have to look like you can point out? Might help to get a better understanding on how things should be.
Yeah, that's the problematic part. We have documented this very extensively: https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-...
And both Jason and Daniel gave talks about the underlying problem and
fyi: s/Jason/Faith/g
try to come up with patches to raise warnings when that happens, but people still keep coming up with the same idea over and over again.
Yes, and we'll have to tell them over and over again. Nothing wrong with that. That's just part of maintaining such a big subsystem. And that's definitely not a valid reason to phrase things like above.
It's just that the technical relationship between preventing jobs from running and with that preventing dma_fences from signaling and the core memory management with page faults and shrinkers waiting for those fences is absolutely not obvious.
We had at least 10 different teams from different companies falling into the same trap already and either the patches were rejected of hand or had to painfully reverted or mitigated later on.
Sure, but that's just part of the job. And pointing out fundamental mistakes early on is important, but the situation won't get any better by being like that. Yes, we'll have to repeat the same words over and over again, and yes that might be annoying, but that's just how it is.
Regards, Christian.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
Regards, Christian.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
if (sched->ops->can_run_job) {
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (!sched_job) {
complete_all(&entity->entity_idle);
continue;
}
if (!sched->ops->can_run_job(sched_job))
continue;
}
sched_job = drm_sched_entity_pop_job(entity); if (!sched_job) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
/**
* @can_run_job: Called before job execution to check whether the
* hardware is free enough to run the job. This can be used to
* implement more complex hardware resource policies than the
* hw_submission limit.
*/
bool (*can_run_job)(struct drm_sched_job *sched_job);
/** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
Am 08.03.23 um 15:43 schrieb Karol Herbst:
[SNIP] "further"? There was no discussion at all,
Yeah, well that is exactly what I wanted to archive.
you just started off like that. If you think somebody misses that connection, you can point out to documentation/videos whatever so the contributor can understand what's wrong with an approach. You did that, so that's fine. It's just starting off _any_ discussion with a "Well complete NAK" is terrible style. I'd feel uncomfortable if that happened to me and I'm sure there are enough people like that that we should be more reasonable with our replies. Just.. don't.
We are all humans here and people react negatively to such things. And if people do it on purpose it just makes it worse.
I completely see your point, I just don't know how to improve it.
I don't stop people like this because I want to make them uncomfortable but because I want to prevent further discussions on that topic.
In other words how can I make people notice that this is something fundamental while still being polite?
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
I'm sure it's all documented and there is a design document on how things have to look like you can point out? Might help to get a better understanding on how things should be.
Yeah, that's the problematic part. We have documented this very extensively: https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-...
And both Jason and Daniel gave talks about the underlying problem and
fyi: s/Jason/Faith/g
+1. I wasn't aware of that.
try to come up with patches to raise warnings when that happens, but people still keep coming up with the same idea over and over again.
Yes, and we'll have to tell them over and over again. Nothing wrong with that. That's just part of maintaining such a big subsystem. And that's definitely not a valid reason to phrase things like above.
It's just that the technical relationship between preventing jobs from running and with that preventing dma_fences from signaling and the core memory management with page faults and shrinkers waiting for those fences is absolutely not obvious.
We had at least 10 different teams from different companies falling into the same trap already and either the patches were rejected of hand or had to painfully reverted or mitigated later on.
Sure, but that's just part of the job. And pointing out fundamental mistakes early on is important, but the situation won't get any better by being like that. Yes, we'll have to repeat the same words over and over again, and yes that might be annoying, but that's just how it is.
Well I have no problem explaining people why a solution doesn't work.
But what usually happens is that people don't realize that they need to back of from a design and completely start over.
Regards, Christian.
Regards, Christian.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
Regards, Christian.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
if (sched->ops->can_run_job) {
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (!sched_job) {
complete_all(&entity->entity_idle);
continue;
}
if (!sched->ops->can_run_job(sched_job))
continue;
}
sched_job = drm_sched_entity_pop_job(entity); if (!sched_job) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
/**
* @can_run_job: Called before job execution to check whether the
* hardware is free enough to run the job. This can be used to
* implement more complex hardware resource policies than the
* hw_submission limit.
*/
bool (*can_run_job)(struct drm_sched_job *sched_job);
/** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
On Wed, Mar 8, 2023 at 4:09 PM Christian König christian.koenig@amd.com wrote:
Am 08.03.23 um 15:43 schrieb Karol Herbst:
[SNIP] "further"? There was no discussion at all,
Yeah, well that is exactly what I wanted to archive.
you just started off like that. If you think somebody misses that connection, you can point out to documentation/videos whatever so the contributor can understand what's wrong with an approach. You did that, so that's fine. It's just starting off _any_ discussion with a "Well complete NAK" is terrible style. I'd feel uncomfortable if that happened to me and I'm sure there are enough people like that that we should be more reasonable with our replies. Just.. don't.
We are all humans here and people react negatively to such things. And if people do it on purpose it just makes it worse.
I completely see your point, I just don't know how to improve it.
I don't stop people like this because I want to make them uncomfortable but because I want to prevent further discussions on that topic.
In other words how can I make people notice that this is something fundamental while still being polite?
I think a little improvement over this would be to at least wait a few replies before resorting to those strong statements. Just before it becomes a risk in just wasting time.
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
I'm sure it's all documented and there is a design document on how things have to look like you can point out? Might help to get a better understanding on how things should be.
Yeah, that's the problematic part. We have documented this very extensively: https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-...
And both Jason and Daniel gave talks about the underlying problem and
fyi: s/Jason/Faith/g
+1. I wasn't aware of that.
try to come up with patches to raise warnings when that happens, but people still keep coming up with the same idea over and over again.
Yes, and we'll have to tell them over and over again. Nothing wrong with that. That's just part of maintaining such a big subsystem. And that's definitely not a valid reason to phrase things like above.
It's just that the technical relationship between preventing jobs from running and with that preventing dma_fences from signaling and the core memory management with page faults and shrinkers waiting for those fences is absolutely not obvious.
We had at least 10 different teams from different companies falling into the same trap already and either the patches were rejected of hand or had to painfully reverted or mitigated later on.
Sure, but that's just part of the job. And pointing out fundamental mistakes early on is important, but the situation won't get any better by being like that. Yes, we'll have to repeat the same words over and over again, and yes that might be annoying, but that's just how it is.
Well I have no problem explaining people why a solution doesn't work.
But what usually happens is that people don't realize that they need to back of from a design and completely start over.
Regards, Christian.
Regards, Christian.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
Regards, Christian.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
if (sched->ops->can_run_job) {
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (!sched_job) {
complete_all(&entity->entity_idle);
continue;
}
if (!sched->ops->can_run_job(sched_job))
continue;
}
sched_job = drm_sched_entity_pop_job(entity); if (!sched_job) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
/**
* @can_run_job: Called before job execution to check whether the
* hardware is free enough to run the job. This can be used to
* implement more complex hardware resource policies than the
* hw_submission limit.
*/
bool (*can_run_job)(struct drm_sched_job *sched_job);
/** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
On Wed, Mar 08, 2023 at 04:19:17PM +0100, Karol Herbst wrote:
On Wed, Mar 8, 2023 at 4:09 PM Christian König christian.koenig@amd.com wrote:
Am 08.03.23 um 15:43 schrieb Karol Herbst:
[SNIP] "further"? There was no discussion at all,
Yeah, well that is exactly what I wanted to archive.
you just started off like that. If you think somebody misses that connection, you can point out to documentation/videos whatever so the contributor can understand what's wrong with an approach. You did that, so that's fine. It's just starting off _any_ discussion with a "Well complete NAK" is terrible style. I'd feel uncomfortable if that happened to me and I'm sure there are enough people like that that we should be more reasonable with our replies. Just.. don't.
We are all humans here and people react negatively to such things. And if people do it on purpose it just makes it worse.
I completely see your point, I just don't know how to improve it.
I don't stop people like this because I want to make them uncomfortable but because I want to prevent further discussions on that topic.
In other words how can I make people notice that this is something fundamental while still being polite?
Ask them to improve the docs. Gets them on board, and for bonus point you - can check they actually get it when you review the doc patch - get scheduler docs for free - have an easily pasteable link for next time around instead of just an aggressive NAK that helps no one really (aside from getting people boiling).
It's not really about being polite but making sure that efficient communiction happens and that you don't have to repeat yourself. In rare cases you get to type the docs themself when people are too dense to learn (like what I had to do with the various dma_fence docs).
I think a little improvement over this would be to at least wait a few replies before resorting to those strong statements. Just before it becomes a risk in just wasting time.
See above what I'm trying to do. When the message doesn't sink in as either a proper doc patch or when linking to the doc patch for next time around (because let's face it, this entire concept of "dma_fence committed for execution" is extremely trick, there will be repeations of this question until we've sunset dma_fence, which is probably decades away).
If the learning does not happen, then it's the time to whack the big hammer (and if people don't get it, you can escalate to Dave&me, we have tools to make sure people get the message). But this really should be the end, not the start of the escalation chain :-)
Cheers, Daniel
This is clearly going against the idea of having jobs only depend on fences and nothing else which is mandatory for correct memory management.
I'm sure it's all documented and there is a design document on how things have to look like you can point out? Might help to get a better understanding on how things should be.
Yeah, that's the problematic part. We have documented this very extensively: https://www.kernel.org/doc/html/v5.9/driver-api/dma-buf.html#indefinite-dma-...
And both Jason and Daniel gave talks about the underlying problem and
fyi: s/Jason/Faith/g
+1. I wasn't aware of that.
try to come up with patches to raise warnings when that happens, but people still keep coming up with the same idea over and over again.
Yes, and we'll have to tell them over and over again. Nothing wrong with that. That's just part of maintaining such a big subsystem. And that's definitely not a valid reason to phrase things like above.
It's just that the technical relationship between preventing jobs from running and with that preventing dma_fences from signaling and the core memory management with page faults and shrinkers waiting for those fences is absolutely not obvious.
We had at least 10 different teams from different companies falling into the same trap already and either the patches were rejected of hand or had to painfully reverted or mitigated later on.
Sure, but that's just part of the job. And pointing out fundamental mistakes early on is important, but the situation won't get any better by being like that. Yes, we'll have to repeat the same words over and over again, and yes that might be annoying, but that's just how it is.
Well I have no problem explaining people why a solution doesn't work.
But what usually happens is that people don't realize that they need to back of from a design and completely start over.
Regards, Christian.
Regards, Christian.
If the hw is busy with something you need to return the fence for this from the prepare_job callback so that the scheduler can be notified when the hw is available again.
Regards, Christian.
> Signed-off-by: Asahi Lina lina@asahilina.net > --- > drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ > include/drm/gpu_scheduler.h | 8 ++++++++ > 2 files changed, 18 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 4e6ad6e122bc..5c0add2c7546 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) > if (!entity) > continue; > > + if (sched->ops->can_run_job) { > + sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue)); > + if (!sched_job) { > + complete_all(&entity->entity_idle); > + continue; > + } > + if (!sched->ops->can_run_job(sched_job)) > + continue; > + } > + > sched_job = drm_sched_entity_pop_job(entity); > > if (!sched_job) { > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > index 9db9e5e504ee..bd89ea9507b9 100644 > --- a/include/drm/gpu_scheduler.h > +++ b/include/drm/gpu_scheduler.h > @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { > struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, > struct drm_sched_entity *s_entity); > > + /** > + * @can_run_job: Called before job execution to check whether the > + * hardware is free enough to run the job. This can be used to > + * implement more complex hardware resource policies than the > + * hw_submission limit. > + */ > + bool (*can_run_job)(struct drm_sched_job *sched_job); > + > /** > * @run_job: Called to execute the job once all of the dependencies > * have been resolved. This may be called multiple times, if >
On Tue, Mar 07, 2023 at 11:25:35PM +0900, Asahi Lina wrote:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Signed-off-by: Asahi Lina lina@asahilina.net
Ok scheduler rules, or trying to summarize the entire discussion:
dma_fence rules are very tricky. The two main chapters in the docs are
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_b... https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_b...
Unforutunately I don't think it's possible to check this at compile time, thus far all we can do is validate at runtime. I've posted two patches for this:
https://lore.kernel.org/dri-devel/20201023122216.2373294-17-daniel.vetter@ff... https://lore.kernel.org/dri-devel/20201023122216.2373294-20-daniel.vetter@ff...
Unfortunately most drivers are buggy and get this completely wrong, so realistically we'd need to make this a per-driver opt-out and annotate all current drivers. Well except amdgpu is correct by now I think (they'd still need to test that). And Rob Clark is working on patches to fix up msm.
I think best here is if you work together with Rob to make sure these annotations are mandatory for any rust drivers (I don't want new buggy drivers at least). Would also be great to improve the kerneldoc for all the driver hooks to explain these restrictions and link to the relevant kerneldocs (there's also one for the dma_fence signalling annotations which might be worth linking too).
I don't see any way to make this explicit in rust types, it's really only something runtime tests (using lockdep) can catch. Somewhat disappointing.
For the other things discussed here:
- Option<Dma_Fence> as the return value for ->prepare_job makes sense to me.
- I don't see any way a driver can use ->can_run_job without breaking the above rules, that really doesn't sound like a good idea to me.
Cheers, Daniel
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
if (sched->ops->can_run_job) {
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (!sched_job) {
complete_all(&entity->entity_idle);
continue;
}
if (!sched->ops->can_run_job(sched_job))
continue;
}
- sched_job = drm_sched_entity_pop_job(entity);
if (!sched_job) { diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
- /**
* @can_run_job: Called before job execution to check whether the
* hardware is free enough to run the job. This can be used to
* implement more complex hardware resource policies than the
* hw_submission limit.
*/
- bool (*can_run_job)(struct drm_sched_job *sched_job);
- /** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
-- 2.35.1
Am 05.04.23 um 15:40 schrieb Daniel Vetter:
On Tue, Mar 07, 2023 at 11:25:35PM +0900, Asahi Lina wrote:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Signed-off-by: Asahi Lina lina@asahilina.net
Ok scheduler rules, or trying to summarize the entire discussion:
dma_fence rules are very tricky. The two main chapters in the docs are
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_b... https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_b...
Unforutunately I don't think it's possible to check this at compile time, thus far all we can do is validate at runtime. I've posted two patches for this:
https://lore.kernel.org/dri-devel/20201023122216.2373294-17-daniel.vetter@ff... https://lore.kernel.org/dri-devel/20201023122216.2373294-20-daniel.vetter@ff...
Unfortunately most drivers are buggy and get this completely wrong, so realistically we'd need to make this a per-driver opt-out and annotate all current drivers. Well except amdgpu is correct by now I think (they'd still need to test that).
There is still one potential memory allocation in the run_job callback in amdgpu which I wasn't able to fix yet.
But that one is purely academic and could potentially be trivially replaced with using GFP_ATOMIC if we ever have to.
Christian.
And Rob Clark is working on patches to fix up msm.
I think best here is if you work together with Rob to make sure these annotations are mandatory for any rust drivers (I don't want new buggy drivers at least). Would also be great to improve the kerneldoc for all the driver hooks to explain these restrictions and link to the relevant kerneldocs (there's also one for the dma_fence signalling annotations which might be worth linking too).
I don't see any way to make this explicit in rust types, it's really only something runtime tests (using lockdep) can catch. Somewhat disappointing.
For the other things discussed here:
Option<Dma_Fence> as the return value for ->prepare_job makes sense to me.
I don't see any way a driver can use ->can_run_job without breaking the above rules, that really doesn't sound like a good idea to me.
Cheers, Daniel
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
if (sched->ops->can_run_job) {
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (!sched_job) {
complete_all(&entity->entity_idle);
continue;
}
if (!sched->ops->can_run_job(sched_job))
continue;
}
- sched_job = drm_sched_entity_pop_job(entity);
if (!sched_job) { diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
- /**
* @can_run_job: Called before job execution to check whether the
* hardware is free enough to run the job. This can be used to
* implement more complex hardware resource policies than the
* hw_submission limit.
*/
- bool (*can_run_job)(struct drm_sched_job *sched_job);
- /** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
-- 2.35.1
On Wed, Apr 05, 2023 at 04:14:11PM +0200, Christian König wrote:
Am 05.04.23 um 15:40 schrieb Daniel Vetter:
On Tue, Mar 07, 2023 at 11:25:35PM +0900, Asahi Lina wrote:
Some hardware may require more complex resource utilization accounting than the simple job count supported by drm_sched internally. Add a can_run_job callback to allow drivers to implement more logic before deciding whether to run a GPU job.
Signed-off-by: Asahi Lina lina@asahilina.net
Ok scheduler rules, or trying to summarize the entire discussion:
dma_fence rules are very tricky. The two main chapters in the docs are
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_b... https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_b...
Unforutunately I don't think it's possible to check this at compile time, thus far all we can do is validate at runtime. I've posted two patches for this:
https://lore.kernel.org/dri-devel/20201023122216.2373294-17-daniel.vetter@ff... https://lore.kernel.org/dri-devel/20201023122216.2373294-20-daniel.vetter@ff...
Unfortunately most drivers are buggy and get this completely wrong, so realistically we'd need to make this a per-driver opt-out and annotate all current drivers. Well except amdgpu is correct by now I think (they'd still need to test that).
There is still one potential memory allocation in the run_job callback in amdgpu which I wasn't able to fix yet.
But that one is purely academic and could potentially be trivially replaced with using GFP_ATOMIC if we ever have to.
I think the modeset in the tdr code was more scary, and I'm not sure you really managed to get rid of absolutely everything in there yet. -Daniel
Christian.
And Rob Clark is working on patches to fix up msm.
I think best here is if you work together with Rob to make sure these annotations are mandatory for any rust drivers (I don't want new buggy drivers at least). Would also be great to improve the kerneldoc for all the driver hooks to explain these restrictions and link to the relevant kerneldocs (there's also one for the dma_fence signalling annotations which might be worth linking too).
I don't see any way to make this explicit in rust types, it's really only something runtime tests (using lockdep) can catch. Somewhat disappointing.
For the other things discussed here:
Option<Dma_Fence> as the return value for ->prepare_job makes sense to me.
I don't see any way a driver can use ->can_run_job without breaking the above rules, that really doesn't sound like a good idea to me.
Cheers, Daniel
drivers/gpu/drm/scheduler/sched_main.c | 10 ++++++++++ include/drm/gpu_scheduler.h | 8 ++++++++ 2 files changed, 18 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4e6ad6e122bc..5c0add2c7546 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1001,6 +1001,16 @@ static int drm_sched_main(void *param) if (!entity) continue;
if (sched->ops->can_run_job) {
sched_job = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (!sched_job) {
complete_all(&entity->entity_idle);
continue;
}
if (!sched->ops->can_run_job(sched_job))
continue;
}
- sched_job = drm_sched_entity_pop_job(entity); if (!sched_job) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9db9e5e504ee..bd89ea9507b9 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -396,6 +396,14 @@ struct drm_sched_backend_ops { struct dma_fence *(*prepare_job)(struct drm_sched_job *sched_job, struct drm_sched_entity *s_entity);
- /**
* @can_run_job: Called before job execution to check whether the
* hardware is free enough to run the job. This can be used to
* implement more complex hardware resource policies than the
* hw_submission limit.
*/
- bool (*can_run_job)(struct drm_sched_job *sched_job);
- /** * @run_job: Called to execute the job once all of the dependencies * have been resolved. This may be called multiple times, if
-- 2.35.1
drm_sched_fini() currently leaves any pending jobs dangling, which causes segfaults and other badness when job completion fences are signaled after the scheduler is torn down.
Explicitly detach all jobs from their completion callbacks and free them. This makes it possible to write a sensible safe abstraction for drm_sched, without having to externally duplicate the tracking of in-flight jobs.
This shouldn't regress any existing drivers, since calling drm_sched_fini() with any pending jobs is broken and this change should be a no-op if there are no pending jobs.
Signed-off-by: Asahi Lina lina@asahilina.net --- drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 5c0add2c7546..0aab1e0aebdd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init); void drm_sched_fini(struct drm_gpu_scheduler *sched) { struct drm_sched_entity *s_entity; + struct drm_sched_job *s_job, *tmp; int i;
- if (sched->thread) - kthread_stop(sched->thread); + if (!sched->thread) + return; + + /* + * Stop the scheduler, detaching all jobs from their hardware callbacks + * and cleaning up complete jobs. + */ + drm_sched_stop(sched, NULL); + + /* + * Iterate through the pending job list and free all jobs. + * This assumes the driver has either guaranteed jobs are already stopped, or that + * otherwise it is responsible for keeping any necessary data structures for + * in-progress jobs alive even when the free_job() callback is called early (e.g. by + * putting them in its own queue or doing its own refcounting). + */ + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { + spin_lock(&sched->job_list_lock); + list_del_init(&s_job->list); + spin_unlock(&sched->job_list_lock); + sched->ops->free_job(s_job); + } + + kthread_stop(sched->thread);
for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { struct drm_sched_rq *rq = &sched->sched_rq[i];
On 2023-03-07 15:25, Asahi Lina wrote:
drm_sched_fini() currently leaves any pending jobs dangling, which causes segfaults and other badness when job completion fences are signaled after the scheduler is torn down.
Explicitly detach all jobs from their completion callbacks and free them. This makes it possible to write a sensible safe abstraction for drm_sched, without having to externally duplicate the tracking of in-flight jobs.
This shouldn't regress any existing drivers, since calling drm_sched_fini() with any pending jobs is broken and this change should be a no-op if there are no pending jobs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 5c0add2c7546..0aab1e0aebdd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init); void drm_sched_fini(struct drm_gpu_scheduler *sched) { struct drm_sched_entity *s_entity;
- struct drm_sched_job *s_job, *tmp; int i;
- if (sched->thread)
kthread_stop(sched->thread);
- if (!sched->thread)
return;
- /*
* Stop the scheduler, detaching all jobs from their hardware callbacks
* and cleaning up complete jobs.
*/
- drm_sched_stop(sched, NULL);
- /*
* Iterate through the pending job list and free all jobs.
* This assumes the driver has either guaranteed jobs are already stopped, or that
* otherwise it is responsible for keeping any necessary data structures for
* in-progress jobs alive even when the free_job() callback is called early (e.g. by
* putting them in its own queue or doing its own refcounting).
*/
- list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
spin_lock(&sched->job_list_lock);
list_del_init(&s_job->list);
spin_unlock(&sched->job_list_lock);
sched->ops->free_job(s_job);
- }
I would stop the kthread first, then delete all jobs without spinlock since nothing else can race against sched_fini?
If you do need the spinlock, It would need to guard list_for_each_entry too.
- kthread_stop(sched->thread);
for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { struct drm_sched_rq *rq = &sched->sched_rq[i];
Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
On 2023-03-07 15:25, Asahi Lina wrote:
drm_sched_fini() currently leaves any pending jobs dangling, which causes segfaults and other badness when job completion fences are signaled after the scheduler is torn down.
Explicitly detach all jobs from their completion callbacks and free them. This makes it possible to write a sensible safe abstraction for drm_sched, without having to externally duplicate the tracking of in-flight jobs.
This shouldn't regress any existing drivers, since calling drm_sched_fini() with any pending jobs is broken and this change should be a no-op if there are no pending jobs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 5c0add2c7546..0aab1e0aebdd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init); void drm_sched_fini(struct drm_gpu_scheduler *sched) { struct drm_sched_entity *s_entity; + struct drm_sched_job *s_job, *tmp; int i; - if (sched->thread) - kthread_stop(sched->thread); + if (!sched->thread) + return;
+ /* + * Stop the scheduler, detaching all jobs from their hardware callbacks + * and cleaning up complete jobs. + */ + drm_sched_stop(sched, NULL);
+ /* + * Iterate through the pending job list and free all jobs. + * This assumes the driver has either guaranteed jobs are already stopped, or that + * otherwise it is responsible for keeping any necessary data structures for + * in-progress jobs alive even when the free_job() callback is called early (e.g. by + * putting them in its own queue or doing its own refcounting). + */ + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { + spin_lock(&sched->job_list_lock); + list_del_init(&s_job->list); + spin_unlock(&sched->job_list_lock); + sched->ops->free_job(s_job); + }
I would stop the kthread first, then delete all jobs without spinlock since nothing else can race against sched_fini?
If you do need the spinlock, It would need to guard list_for_each_entry too.
Well this case here actually should not happen in the first place.
Jobs depend on their device, so as long as there are jobs there should also be a reference to the scheduler.
What could be is that you have allocated a scheduler instance dynamically, but even then you should first tear down all entities and then the scheduler.
Regards, Christian.
+ kthread_stop(sched->thread); for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { struct drm_sched_rq *rq = &sched->sched_rq[i];
On 08/03/2023 19.03, Christian König wrote:
Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
On 2023-03-07 15:25, Asahi Lina wrote:
drm_sched_fini() currently leaves any pending jobs dangling, which causes segfaults and other badness when job completion fences are signaled after the scheduler is torn down.
Explicitly detach all jobs from their completion callbacks and free them. This makes it possible to write a sensible safe abstraction for drm_sched, without having to externally duplicate the tracking of in-flight jobs.
This shouldn't regress any existing drivers, since calling drm_sched_fini() with any pending jobs is broken and this change should be a no-op if there are no pending jobs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 5c0add2c7546..0aab1e0aebdd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init); void drm_sched_fini(struct drm_gpu_scheduler *sched) { struct drm_sched_entity *s_entity; + struct drm_sched_job *s_job, *tmp; int i; - if (sched->thread) - kthread_stop(sched->thread); + if (!sched->thread) + return;
+ /* + * Stop the scheduler, detaching all jobs from their hardware callbacks + * and cleaning up complete jobs. + */ + drm_sched_stop(sched, NULL);
+ /* + * Iterate through the pending job list and free all jobs. + * This assumes the driver has either guaranteed jobs are already stopped, or that + * otherwise it is responsible for keeping any necessary data structures for + * in-progress jobs alive even when the free_job() callback is called early (e.g. by + * putting them in its own queue or doing its own refcounting). + */ + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { + spin_lock(&sched->job_list_lock); + list_del_init(&s_job->list); + spin_unlock(&sched->job_list_lock); + sched->ops->free_job(s_job); + }
I would stop the kthread first, then delete all jobs without spinlock since nothing else can race against sched_fini?
If you do need the spinlock, It would need to guard list_for_each_entry too.
Well this case here actually should not happen in the first place.
"This should not happen in the first place" is how you end up with C APIs that have corner cases that lead to kernel oopses...
The idea with Rust abstractions is that it needs to be actually impossible to create memory safety problems for the user of the abstraction, you can't impose arbitrary constraints like "you must wait for all jobs to finish before destroying the scheduler"... it needs to be intrinsically safe.
Jobs depend on their device, so as long as there are jobs there should also be a reference to the scheduler.
These schedulers are created dynamically per userspace queue. The memory management and reference counting involved make it safe to destroy the scheduler even when behind the scenes hardware jobs are still running, as long as drm_sched itself doesn't crash on fences firing without a scheduler (which is what this patch fixes).
This is the power of Rust: it forces you to architect your code in a way that you don't have complex high-level dependencies that span the entire driver and are difficult to prove hold. In my driver, you can kill a process and that destroys the drm_sched, closes all GEM objects, everything, even if the GPU is still running jobs from that process. The worst that can happen is that the GPU faults as in-use userspace buffers are unmapped out from under the running user job, but that's fine (GPU faults are recoverable). The actual firmware resources, queues, etc. in use are all kept alive until the commands finish executing (or fault, which is just an abnormal completion), even if the userspace process that owned them is long gone. I've tested this extensively by doing things like large-resolution glmark runs in a loop that get `kill -9`'d repeatedly, and it works very well! Tons of GPU faults but no firmware crashes, no oopses, nothing. And the firmware *will* crash irrecoverably if anything goes wrong with its shared memory structures, so that it doesn't is pretty good evidence that all this works!
What could be is that you have allocated a scheduler instance dynamically, but even then you should first tear down all entities and then the scheduler.
This is about creating a safe Rust abstraction, so we can't impose requirements on users like that, the abstraction has to take care of it. Unfortunately, the jobs cannot depend on the scheduler at the abstraction level. I tried that (putting a reference counted reference to the scheduler in the job abstraction), but it doesn't work because a job completing can end up dropping the last reference to the scheduler, and then you end up trying to stop and clean up the scheduler from a callback called from the scheduler kthread itself, which deadlocks. We could throw those cleanups into a workqueue or something, but that's just adding bandages around the problem that the drm_sched interface today is just not safe without this patch...
Right now, it is not possible to create a safe Rust abstraction for drm_sched without doing something like duplicating all job tracking in the abstraction, or the above backreference + deferred cleanup mess, or something equally silly. So let's just fix the C side please ^^
So far, drm_sched is the only DRM API that has had such a fundamental API safety issue that I had to make a change like this to the C to make the Rust abstraction possible/reasonable... drm_sched has also been by far the hardest DRM component API to understand from a safety point of view, with the most inconsistent documentation about what the ownership/freeing rules are, and what objects need to outlive what other objects (I had to just read the code to figure most of this out). That's also one nice outcome of writing Rust abstractions: it forces us to make all these rules and invariants explicit, instead of leaving them as unwritten assumptions (almost nobody consistently documents this in C APIs...).
If I got it right, anyone using the Rust drm_sched abstraction doesn't have to worry about this any more because if they do something that would oops with it, their code won't compile. But I need this patch to be able to make that guarantee...
~~ Lina
Am 08.03.23 um 16:18 schrieb Asahi Lina:
On 08/03/2023 19.03, Christian König wrote:
Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
On 2023-03-07 15:25, Asahi Lina wrote:
drm_sched_fini() currently leaves any pending jobs dangling, which causes segfaults and other badness when job completion fences are signaled after the scheduler is torn down.
Explicitly detach all jobs from their completion callbacks and free them. This makes it possible to write a sensible safe abstraction for drm_sched, without having to externally duplicate the tracking of in-flight jobs.
This shouldn't regress any existing drivers, since calling drm_sched_fini() with any pending jobs is broken and this change should be a no-op if there are no pending jobs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 5c0add2c7546..0aab1e0aebdd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init); void drm_sched_fini(struct drm_gpu_scheduler *sched) { struct drm_sched_entity *s_entity; + struct drm_sched_job *s_job, *tmp; int i; - if (sched->thread) - kthread_stop(sched->thread); + if (!sched->thread) + return;
+ /* + * Stop the scheduler, detaching all jobs from their hardware callbacks + * and cleaning up complete jobs. + */ + drm_sched_stop(sched, NULL);
+ /* + * Iterate through the pending job list and free all jobs. + * This assumes the driver has either guaranteed jobs are already stopped, or that + * otherwise it is responsible for keeping any necessary data structures for + * in-progress jobs alive even when the free_job() callback is called early (e.g. by + * putting them in its own queue or doing its own refcounting). + */ + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { + spin_lock(&sched->job_list_lock); + list_del_init(&s_job->list); + spin_unlock(&sched->job_list_lock); + sched->ops->free_job(s_job); + }
I would stop the kthread first, then delete all jobs without spinlock since nothing else can race against sched_fini?
If you do need the spinlock, It would need to guard list_for_each_entry too.
Well this case here actually should not happen in the first place.
"This should not happen in the first place" is how you end up with C APIs that have corner cases that lead to kernel oopses...
The idea with Rust abstractions is that it needs to be actually impossible to create memory safety problems for the user of the abstraction, you can't impose arbitrary constraints like "you must wait for all jobs to finish before destroying the scheduler"... it needs to be intrinsically safe.
Jobs depend on their device, so as long as there are jobs there should also be a reference to the scheduler.
These schedulers are created dynamically per userspace queue. The memory management and reference counting involved make it safe to destroy the scheduler even when behind the scenes hardware jobs are still running, as long as drm_sched itself doesn't crash on fences firing without a scheduler (which is what this patch fixes).
We have originally rejected that approach, but I still think it might work if done right.
This is the power of Rust: it forces you to architect your code in a way that you don't have complex high-level dependencies that span the entire driver and are difficult to prove hold. In my driver, you can kill a process and that destroys the drm_sched, closes all GEM objects, everything, even if the GPU is still running jobs from that process. The worst that can happen is that the GPU faults as in-use userspace buffers are unmapped out from under the running user job, but that's fine (GPU faults are recoverable). The actual firmware resources, queues, etc. in use are all kept alive until the commands finish executing (or fault, which is just an abnormal completion), even if the userspace process that owned them is long gone. I've tested this extensively by doing things like large-resolution glmark runs in a loop that get `kill -9`'d repeatedly, and it works very well! Tons of GPU faults but no firmware crashes, no oopses, nothing. And the firmware *will* crash irrecoverably if anything goes wrong with its shared memory structures, so that it doesn't is pretty good evidence that all this works!
Well testing is no prove at all of a correct design.
What could be is that you have allocated a scheduler instance dynamically, but even then you should first tear down all entities and then the scheduler.
This is about creating a safe Rust abstraction, so we can't impose requirements on users like that, the abstraction has to take care of it. Unfortunately, the jobs cannot depend on the scheduler at the abstraction level. I tried that (putting a reference counted reference to the scheduler in the job abstraction), but it doesn't work because a job completing can end up dropping the last reference to the scheduler, and then you end up trying to stop and clean up the scheduler from a callback called from the scheduler kthread itself, which deadlocks. We could throw those cleanups into a workqueue or something, but that's just adding bandages around the problem that the drm_sched interface today is just not safe without this patch...
Well that won't work like this. The scheduler has a pretty clear tear down procedure.
And that procedure implies that all entities which might provide jobs are destroyed before the scheduler is destroyed.
Destroying the entities in turn cleans up the pending jobs inside of them. We could add a warning when users of this API doesn't do this correctly, but cleaning up incorrect API use is clearly something we don't want here.
Right now, it is not possible to create a safe Rust abstraction for drm_sched without doing something like duplicating all job tracking in the abstraction, or the above backreference + deferred cleanup mess, or something equally silly. So let's just fix the C side please ^^
Nope, as far as I can see this is just not correctly tearing down the objects in the right order.
So you are trying to do something which is not supposed to work in the first place.
Regards, Christian.
So far, drm_sched is the only DRM API that has had such a fundamental API safety issue that I had to make a change like this to the C to make the Rust abstraction possible/reasonable... drm_sched has also been by far the hardest DRM component API to understand from a safety point of view, with the most inconsistent documentation about what the ownership/freeing rules are, and what objects need to outlive what other objects (I had to just read the code to figure most of this out). That's also one nice outcome of writing Rust abstractions: it forces us to make all these rules and invariants explicit, instead of leaving them as unwritten assumptions (almost nobody consistently documents this in C APIs...).
If I got it right, anyone using the Rust drm_sched abstraction doesn't have to worry about this any more because if they do something that would oops with it, their code won't compile. But I need this patch to be able to make that guarantee...
~~ Lina
On 09/03/2023 00.42, Christian König wrote:
Am 08.03.23 um 16:18 schrieb Asahi Lina:
On 08/03/2023 19.03, Christian König wrote:
Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
On 2023-03-07 15:25, Asahi Lina wrote:
drm_sched_fini() currently leaves any pending jobs dangling, which causes segfaults and other badness when job completion fences are signaled after the scheduler is torn down.
Explicitly detach all jobs from their completion callbacks and free them. This makes it possible to write a sensible safe abstraction for drm_sched, without having to externally duplicate the tracking of in-flight jobs.
This shouldn't regress any existing drivers, since calling drm_sched_fini() with any pending jobs is broken and this change should be a no-op if there are no pending jobs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 5c0add2c7546..0aab1e0aebdd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init); void drm_sched_fini(struct drm_gpu_scheduler *sched) { struct drm_sched_entity *s_entity; + struct drm_sched_job *s_job, *tmp; int i; - if (sched->thread) - kthread_stop(sched->thread); + if (!sched->thread) + return;
+ /* + * Stop the scheduler, detaching all jobs from their hardware callbacks + * and cleaning up complete jobs. + */ + drm_sched_stop(sched, NULL);
+ /* + * Iterate through the pending job list and free all jobs. + * This assumes the driver has either guaranteed jobs are already stopped, or that + * otherwise it is responsible for keeping any necessary data structures for + * in-progress jobs alive even when the free_job() callback is called early (e.g. by + * putting them in its own queue or doing its own refcounting). + */ + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) { + spin_lock(&sched->job_list_lock); + list_del_init(&s_job->list); + spin_unlock(&sched->job_list_lock); + sched->ops->free_job(s_job); + }
I would stop the kthread first, then delete all jobs without spinlock since nothing else can race against sched_fini?
If you do need the spinlock, It would need to guard list_for_each_entry too.
Well this case here actually should not happen in the first place.
"This should not happen in the first place" is how you end up with C APIs that have corner cases that lead to kernel oopses...
The idea with Rust abstractions is that it needs to be actually impossible to create memory safety problems for the user of the abstraction, you can't impose arbitrary constraints like "you must wait for all jobs to finish before destroying the scheduler"... it needs to be intrinsically safe.
Jobs depend on their device, so as long as there are jobs there should also be a reference to the scheduler.
These schedulers are created dynamically per userspace queue. The memory management and reference counting involved make it safe to destroy the scheduler even when behind the scenes hardware jobs are still running, as long as drm_sched itself doesn't crash on fences firing without a scheduler (which is what this patch fixes).
We have originally rejected that approach, but I still think it might work if done right.
This is the power of Rust: it forces you to architect your code in a way that you don't have complex high-level dependencies that span the entire driver and are difficult to prove hold. In my driver, you can kill a process and that destroys the drm_sched, closes all GEM objects, everything, even if the GPU is still running jobs from that process. The worst that can happen is that the GPU faults as in-use userspace buffers are unmapped out from under the running user job, but that's fine (GPU faults are recoverable). The actual firmware resources, queues, etc. in use are all kept alive until the commands finish executing (or fault, which is just an abnormal completion), even if the userspace process that owned them is long gone. I've tested this extensively by doing things like large-resolution glmark runs in a loop that get `kill -9`'d repeatedly, and it works very well! Tons of GPU faults but no firmware crashes, no oopses, nothing. And the firmware *will* crash irrecoverably if anything goes wrong with its shared memory structures, so that it doesn't is pretty good evidence that all this works!
Well testing is no prove at all of a correct design.
Well, I'm guessing you don't have a formal correctness proof for amdgpu either... ^^
There's actually no way to prove my design is correct, since this is a reverse engineered driver that talks to proprietary firmware and I don't have the benefit of both open and internal docs like you AMD people have, never mind access to firmware source code... all I can do is try to understand how it should work based on how macOS does things and running tests, and then design something that should work with it. I spent months writing a prototype Python driver before even starting on the real DRM driver (long story...), and I keep going back to it to test little details of the firmware interface. There's over 3300 lines of just firmware structure definitions, it's kind of crazy...
But even with all that... this driver has no right to be as stable as it is, considering I wrote it in just a few months. It hasn't even been a year since I started working on AGX at all! As I mentioned in the cover letter, we've gotten zero reports of oopses in production. I tried fuzzing the UAPI and all I managed to do was crash the firmware after a lot of GPU faults (that was a subtle firmware data cache coherency issue, now fixed), the driver itself was fine. I didn't have to debug the OOM error codepaths when we first started running Xonotic on 8GB RAM machines with no texture compression support on high quality texture settings (bad combination...), it all just worked even though all those error/cleanup paths had never been tested before at all. The only memory leaks I managed to cause were due to circular references between VMs and GEM objects (tricky to avoid, I did manage to miss one special case object in the first driver release...), everything else just cleans itself up by design. And it's not because I'm a genius or anything like that... it's because Rust just makes getting all this right *so* much easier than C.
So I can at least say I'm quite confident that, as long as my understanding of the firmware structure lifetimes is correct and I encode it in the Rust object model the driver uses to represent them, things will work without crashing without relying on high-level invariants like "you must wait for all job completions before tearing down the top-level scheduler for a user queue" ^^
What could be is that you have allocated a scheduler instance dynamically, but even then you should first tear down all entities and then the scheduler.
This is about creating a safe Rust abstraction, so we can't impose requirements on users like that, the abstraction has to take care of it. Unfortunately, the jobs cannot depend on the scheduler at the abstraction level. I tried that (putting a reference counted reference to the scheduler in the job abstraction), but it doesn't work because a job completing can end up dropping the last reference to the scheduler, and then you end up trying to stop and clean up the scheduler from a callback called from the scheduler kthread itself, which deadlocks. We could throw those cleanups into a workqueue or something, but that's just adding bandages around the problem that the drm_sched interface today is just not safe without this patch...
Well that won't work like this. The scheduler has a pretty clear tear down procedure.
Well... I wouldn't call it "clear". I had to reverse engineer this from reading drm_sched source code, the docs don't tell you. The entire documentation of "drm_sched_fini()" is as follows:
"Tears down and cleans up the scheduler."
That's it.
This is why I had so much trouble writing this abstraction, and I spent hours reading the drm_sched code to understand how it worked in order to use the API correctly... and yet...
And that procedure implies that all entities which might provide jobs are destroyed before the scheduler is destroyed.
Yes, I do this: the entity abstraction holds a reference to the scheduler for this reason, so the scheduler can only be destroyed once all entities are destroyed. But...
Destroying the entities in turn cleans up the pending jobs inside of them.
Yes but... none of this cleans up jobs that are already submitted by the scheduler and in its pending list, with registered completion callbacks, which were already popped off of the entities.
*That* is the problem this patch fixes!
We could add a warning when users of this API doesn't do this correctly, but cleaning up incorrect API use is clearly something we don't want here.
It is the job of the Rust abstractions to make incorrect API use that leads to memory unsafety impossible. So even if you don't want that in C, it's my job to do that for Rust... and right now, I just can't because drm_sched doesn't provide an API that can be safely wrapped without weird bits of babysitting functionality on top (like tracking jobs outside or awkwardly making jobs hold a reference to the scheduler and defer dropping it to another thread).
Right now, it is not possible to create a safe Rust abstraction for drm_sched without doing something like duplicating all job tracking in the abstraction, or the above backreference + deferred cleanup mess, or something equally silly. So let's just fix the C side please ^^
Nope, as far as I can see this is just not correctly tearing down the objects in the right order.
There's no API to clean up in-flight jobs in a drm_sched at all. Destroying an entity won't do it. So there is no reasonable way to do this at all...
So you are trying to do something which is not supposed to work in the first place.
I need to make things that aren't supposed to work impossible to do in the first place, or at least fail gracefully instead of just oopsing like drm_sched does today...
If you're convinced there's a way to do this, can you tell me exactly what code sequence I need to run to safely shut down a scheduler assuming all entities are already destroyed? You can't ask me for a list of pending jobs (the scheduler knows this, it doesn't make any sense to duplicate that outside), and you can't ask me to just not do this until all jobs complete execution (because then we either end up with the messy deadlock situation I described if I take a reference, or more duplicative in-flight job count tracking and blocking in the free path of the Rust abstraction, which doesn't make any sense either).
~~ Lina
You can't ask me for a list of pending jobs (the scheduler knows this, it doesn't make any sense to duplicate that outside)
Silly question: could you add a new exported function to drm_sched to get the list of pending jobs, to be used by the Rust abstraction internally? IDK if that makes any sense.
On 09/03/2023 02.39, alyssa@rosenzweig.io wrote:
You can't ask me for a list of pending jobs (the scheduler knows this, it doesn't make any sense to duplicate that outside)
Silly question: could you add a new exported function to drm_sched to get the list of pending jobs, to be used by the Rust abstraction internally? IDK if that makes any sense.
The drm_sched struct is public, we could just go in there and do it anyway... but then I need to figure out how to do `list_for_each_entry_safe` in Rust and this all makes very little sense when it's clearly the scheduler's job to provide some form of cleanup function users can use to do it...
I mean, I guess I can do that if Christian is adamantly against providing a safe C API, but it's clearly not the right solution and I hope this is not the approach maintainers take with Rust abstractions, because that's going to make our lives a lot harder for no good reason, and it also means C users don't get any of the benefits of Rust abstraction work if the APIs can't be improved at all along with it.
~~ Lina
Am 08.03.23 um 18:39 schrieb alyssa@rosenzweig.io:
You can't ask me for a list of pending jobs (the scheduler knows this, it doesn't make any sense to duplicate that outside)
Silly question: could you add a new exported function to drm_sched to get the list of pending jobs, to be used by the Rust abstraction internally? IDK if that makes any sense.
I was thinking about something similar as well. The problem is that you could only use this function from the scheduler thread itself, e.g. from one of its callback functions.
Christian.
Am 08.03.23 um 18:32 schrieb Asahi Lina:
[SNIP] Yes but... none of this cleans up jobs that are already submitted by the scheduler and in its pending list, with registered completion callbacks, which were already popped off of the entities.
*That* is the problem this patch fixes!
Ah! Yes that makes more sense now.
We could add a warning when users of this API doesn't do this correctly, but cleaning up incorrect API use is clearly something we don't want here.
It is the job of the Rust abstractions to make incorrect API use that leads to memory unsafety impossible. So even if you don't want that in C, it's my job to do that for Rust... and right now, I just can't because drm_sched doesn't provide an API that can be safely wrapped without weird bits of babysitting functionality on top (like tracking jobs outside or awkwardly making jobs hold a reference to the scheduler and defer dropping it to another thread).
Yeah, that was discussed before but rejected.
The argument was that upper layer needs to wait for the hw to become idle before the scheduler can be destroyed anyway.
Right now, it is not possible to create a safe Rust abstraction for drm_sched without doing something like duplicating all job tracking in the abstraction, or the above backreference + deferred cleanup mess, or something equally silly. So let's just fix the C side please ^^
Nope, as far as I can see this is just not correctly tearing down the objects in the right order.
There's no API to clean up in-flight jobs in a drm_sched at all. Destroying an entity won't do it. So there is no reasonable way to do this at all...
Yes, this was removed.
So you are trying to do something which is not supposed to work in the first place.
I need to make things that aren't supposed to work impossible to do in the first place, or at least fail gracefully instead of just oopsing like drm_sched does today...
If you're convinced there's a way to do this, can you tell me exactly what code sequence I need to run to safely shut down a scheduler assuming all entities are already destroyed? You can't ask me for a list of pending jobs (the scheduler knows this, it doesn't make any sense to duplicate that outside), and you can't ask me to just not do this until all jobs complete execution (because then we either end up with the messy deadlock situation I described if I take a reference, or more duplicative in-flight job count tracking and blocking in the free path of the Rust abstraction, which doesn't make any sense either).
Good question. We don't have anybody upstream which uses the scheduler lifetime like this.
Essentially the job list in the scheduler is something we wanted to remove because it causes tons of race conditions during hw recovery.
When you tear down the firmware queue how do you handle already submitted jobs there?
Regards, Christian.
~~ Lina
On 09/03/2023 03.12, Christian König wrote:
Am 08.03.23 um 18:32 schrieb Asahi Lina:
[SNIP] Yes but... none of this cleans up jobs that are already submitted by the scheduler and in its pending list, with registered completion callbacks, which were already popped off of the entities.
*That* is the problem this patch fixes!
Ah! Yes that makes more sense now.
We could add a warning when users of this API doesn't do this correctly, but cleaning up incorrect API use is clearly something we don't want here.
It is the job of the Rust abstractions to make incorrect API use that leads to memory unsafety impossible. So even if you don't want that in C, it's my job to do that for Rust... and right now, I just can't because drm_sched doesn't provide an API that can be safely wrapped without weird bits of babysitting functionality on top (like tracking jobs outside or awkwardly making jobs hold a reference to the scheduler and defer dropping it to another thread).
Yeah, that was discussed before but rejected.
The argument was that upper layer needs to wait for the hw to become idle before the scheduler can be destroyed anyway.
Unfortunately, that's not a requirement you can encode in the Rust type system easily as far as I know, and Rust safety rules mean we need to make it safe even if the upper layer doesn't do this... (or else we have to mark the entire drm_sched abstraction unsafe, but that would be a pity).
I know it's a different way of thinking, but it has pretty clear benefits since with Rust you can actually guarantee that things are safe overall by just auditing explicitly unsafe code. If we just mark all of drm_sched unsafe, that means we now need to audit all details about how the driver uses it for safety. It makes more sense to just make the abstraction safe, which is much easier to audit.
Right now, it is not possible to create a safe Rust abstraction for drm_sched without doing something like duplicating all job tracking in the abstraction, or the above backreference + deferred cleanup mess, or something equally silly. So let's just fix the C side please ^^
Nope, as far as I can see this is just not correctly tearing down the objects in the right order.
There's no API to clean up in-flight jobs in a drm_sched at all. Destroying an entity won't do it. So there is no reasonable way to do this at all...
Yes, this was removed.
So you are trying to do something which is not supposed to work in the first place.
I need to make things that aren't supposed to work impossible to do in the first place, or at least fail gracefully instead of just oopsing like drm_sched does today...
If you're convinced there's a way to do this, can you tell me exactly what code sequence I need to run to safely shut down a scheduler assuming all entities are already destroyed? You can't ask me for a list of pending jobs (the scheduler knows this, it doesn't make any sense to duplicate that outside), and you can't ask me to just not do this until all jobs complete execution (because then we either end up with the messy deadlock situation I described if I take a reference, or more duplicative in-flight job count tracking and blocking in the free path of the Rust abstraction, which doesn't make any sense either).
Good question. We don't have anybody upstream which uses the scheduler lifetime like this.
Essentially the job list in the scheduler is something we wanted to remove because it causes tons of race conditions during hw recovery.
When you tear down the firmware queue how do you handle already submitted jobs there?
The firmware queue is itself reference counted and any firmware queue that has acquired an event notification resource (that is, which is busy with running or upcoming jobs) hands off a reference to itself into the event subsystem, so it can get notified of job completions by the firmware. Then once it becomes idle it unregisters itself, and at that point if it has no owning userspace queue, that would be the last reference and it gets dropped. So we don't tear down firmware queues until they are idle.
(There is a subtle deadlock break in the event module to make this work out, where we clone a reference to the queue and drop the event subsystem lock before signaling it of completions, so it can call back in and take the lock as it unregisters itself if needed. Then the actual teardown happens when the signaling is complete and that reference clone is the last one to get dropped.)
If a queue is idle at the firmware level but has upcoming jobs queued in drm_sched, when those get deleted as part of an explicit drm_sched teardown (free_job()) the queue notices it lost its upcoming jobs and relinquishes the event resource if there are no running jobs. I'm not even sure exactly what order this all happens in in practice (it depends on structure field order in Rust!), but it doesn't really matter because either way everything gets cleaned up one way or another.
I actually don't know of any way to actively abort jobs on the firmware, so this is pretty much the only option I have. I've even seen long-running compute jobs on macOS run to completion even if you kill the submitting process, so there might be no way to do this at all. Though in practice since we unmap everything from the VM anyway when the userspace stuff gets torn down, almost any normal GPU work is going to immediately fault at that point (macOS doesn't do this because macOS effectively does implicit sync with BO tracking at the kernel level...).
By the way, I don't really use the hardware recovery stuff right now. I'm not even sure if there is a sensible way I could use it, since as I said we can't exactly abort jobs. I know there are ways to lock up the firmware/GPU, but so far those have all been things the kernel driver can prevent, and I'm not even sure if there is any way to recover from that anyway. The firmware itself has its own timeouts and recovery for "normal" problems. From the point of view of the driver and everything above it, in-flight commands during a GPU fault or timeout are just marked complete by the firmware, after a firmware recovery cycle where the driver gets notified of the problem (that's when we mark the commands failed so we can propagate the error). There is no re-submission or anything, userspace just gets told of the problem but the queue survives. In the future it might be possible to re-submit innocent commands (it is possible for a GPU fault to break another process running concurrently, and this is a problem macOS has too...), which is still not perfect due to side effects but might work most of the time, but that depends on the "command patching" stuff I mentioned, and I'm still not even sure if it will be possible to do safely. There's a lot of subtlety around what we can and can't do during a firmware recovery cycle that I haven't even started to investigate yet (the answer could be "nothing" even).
~~ Lina
Am 08.03.23 um 20:37 schrieb Asahi Lina:
On 09/03/2023 03.12, Christian König wrote:
Am 08.03.23 um 18:32 schrieb Asahi Lina:
[SNIP] Yes but... none of this cleans up jobs that are already submitted by the scheduler and in its pending list, with registered completion callbacks, which were already popped off of the entities.
*That* is the problem this patch fixes!
Ah! Yes that makes more sense now.
We could add a warning when users of this API doesn't do this correctly, but cleaning up incorrect API use is clearly something we don't want here.
It is the job of the Rust abstractions to make incorrect API use that leads to memory unsafety impossible. So even if you don't want that in C, it's my job to do that for Rust... and right now, I just can't because drm_sched doesn't provide an API that can be safely wrapped without weird bits of babysitting functionality on top (like tracking jobs outside or awkwardly making jobs hold a reference to the scheduler and defer dropping it to another thread).
Yeah, that was discussed before but rejected.
The argument was that upper layer needs to wait for the hw to become idle before the scheduler can be destroyed anyway.
Unfortunately, that's not a requirement you can encode in the Rust type system easily as far as I know, and Rust safety rules mean we need to make it safe even if the upper layer doesn't do this... (or else we have to mark the entire drm_sched abstraction unsafe, but that would be a pity).
Yeah, that should really not be something we should do.
But you could make the scheduler depend on your fw context object, don't you?
Detaching the scheduler from the underlying hw fences is certainly possible, but we removed that functionality because some people people tried to force push some Windows recovery module into Linux. We are in the process of reverting that and cleaning things up once more, but that will take a while.
Instead of detaching you could also block for the hw to become idle, but if you do that synchronous on process termination you run into trouble as well.
I know it's a different way of thinking, but it has pretty clear benefits since with Rust you can actually guarantee that things are safe overall by just auditing explicitly unsafe code. If we just mark all of drm_sched unsafe, that means we now need to audit all details about how the driver uses it for safety. It makes more sense to just make the abstraction safe, which is much easier to audit.
I'm pretty familiar with that approach.
Right now, it is not possible to create a safe Rust abstraction for drm_sched without doing something like duplicating all job tracking in the abstraction, or the above backreference + deferred cleanup mess, or something equally silly. So let's just fix the C side please ^^
Nope, as far as I can see this is just not correctly tearing down the objects in the right order.
There's no API to clean up in-flight jobs in a drm_sched at all. Destroying an entity won't do it. So there is no reasonable way to do this at all...
Yes, this was removed.
So you are trying to do something which is not supposed to work in the first place.
I need to make things that aren't supposed to work impossible to do in the first place, or at least fail gracefully instead of just oopsing like drm_sched does today...
If you're convinced there's a way to do this, can you tell me exactly what code sequence I need to run to safely shut down a scheduler assuming all entities are already destroyed? You can't ask me for a list of pending jobs (the scheduler knows this, it doesn't make any sense to duplicate that outside), and you can't ask me to just not do this until all jobs complete execution (because then we either end up with the messy deadlock situation I described if I take a reference, or more duplicative in-flight job count tracking and blocking in the free path of the Rust abstraction, which doesn't make any sense either).
Good question. We don't have anybody upstream which uses the scheduler lifetime like this.
Essentially the job list in the scheduler is something we wanted to remove because it causes tons of race conditions during hw recovery.
When you tear down the firmware queue how do you handle already submitted jobs there?
The firmware queue is itself reference counted and any firmware queue that has acquired an event notification resource (that is, which is busy with running or upcoming jobs) hands off a reference to itself into the event subsystem, so it can get notified of job completions by the firmware. Then once it becomes idle it unregisters itself, and at that point if it has no owning userspace queue, that would be the last reference and it gets dropped. So we don't tear down firmware queues until they are idle.
And could those fw queue not reference the scheduler?
(There is a subtle deadlock break in the event module to make this work out, where we clone a reference to the queue and drop the event subsystem lock before signaling it of completions, so it can call back in and take the lock as it unregisters itself if needed. Then the actual teardown happens when the signaling is complete and that reference clone is the last one to get dropped.)
If a queue is idle at the firmware level but has upcoming jobs queued in drm_sched, when those get deleted as part of an explicit drm_sched teardown (free_job()) the queue notices it lost its upcoming jobs and relinquishes the event resource if there are no running jobs. I'm not even sure exactly what order this all happens in in practice (it depends on structure field order in Rust!), but it doesn't really matter because either way everything gets cleaned up one way or another.
I actually don't know of any way to actively abort jobs on the firmware, so this is pretty much the only option I have. I've even seen long-running compute jobs on macOS run to completion even if you kill the submitting process, so there might be no way to do this at all. Though in practice since we unmap everything from the VM anyway when the userspace stuff gets torn down, almost any normal GPU work is going to immediately fault at that point (macOS doesn't do this because macOS effectively does implicit sync with BO tracking at the kernel level...).
Oh, that is an interesting information. How does macOS do explicit sync then or isn't that supported at all?
By the way, I don't really use the hardware recovery stuff right now. I'm not even sure if there is a sensible way I could use it, since as I said we can't exactly abort jobs. I know there are ways to lock up the firmware/GPU, but so far those have all been things the kernel driver can prevent, and I'm not even sure if there is any way to recover from that anyway. The firmware itself has its own timeouts and recovery for "normal" problems. From the point of view of the driver and everything above it, in-flight commands during a GPU fault or timeout are just marked complete by the firmware, after a firmware recovery cycle where the driver gets notified of the problem (that's when we mark the commands failed so we can propagate the error).
Yeah, that's exactly what we are telling our fw people for years that we need this as well.
There is no re-submission or anything, userspace just gets told of the problem but the queue survives.
In the future it might be possible to re-submit innocent commands
Long story short: Don't do this! This is what the Windows drivers have been doing and it creates tons of problems.
Just signal the problem back to userspace and let the user space driver decide what to do.
The background is that most graphics applications (games etc..) then rather start on the next frame instead of submitting the current one again while compute applications make sure that the abort and tell the user that the calculations might be corrupted and need to be redone.
Regards, Christian.
(it is possible for a GPU fault to break another process running concurrently, and this is a problem macOS has too...), which is still not perfect due to side effects but might work most of the time, but that depends on the "command patching" stuff I mentioned, and I'm still not even sure if it will be possible to do safely. There's a lot of subtlety around what we can and can't do during a firmware recovery cycle that I haven't even started to investigate yet (the answer could be "nothing" even).
~~ Lina
On 09/03/2023 17.42, Christian König wrote:
Am 08.03.23 um 20:37 schrieb Asahi Lina:
On 09/03/2023 03.12, Christian König wrote:
Am 08.03.23 um 18:32 schrieb Asahi Lina:
[SNIP] Yes but... none of this cleans up jobs that are already submitted by the scheduler and in its pending list, with registered completion callbacks, which were already popped off of the entities.
*That* is the problem this patch fixes!
Ah! Yes that makes more sense now.
We could add a warning when users of this API doesn't do this correctly, but cleaning up incorrect API use is clearly something we don't want here.
It is the job of the Rust abstractions to make incorrect API use that leads to memory unsafety impossible. So even if you don't want that in C, it's my job to do that for Rust... and right now, I just can't because drm_sched doesn't provide an API that can be safely wrapped without weird bits of babysitting functionality on top (like tracking jobs outside or awkwardly making jobs hold a reference to the scheduler and defer dropping it to another thread).
Yeah, that was discussed before but rejected.
The argument was that upper layer needs to wait for the hw to become idle before the scheduler can be destroyed anyway.
Unfortunately, that's not a requirement you can encode in the Rust type system easily as far as I know, and Rust safety rules mean we need to make it safe even if the upper layer doesn't do this... (or else we have to mark the entire drm_sched abstraction unsafe, but that would be a pity).
Yeah, that should really not be something we should do.
But you could make the scheduler depend on your fw context object, don't you?
Yes, and that would fix the problem for this driver, but it wouldn't make the abstraction safe. The thing is we have to make it *impossible* to misuse drm_sched in such a way that it crashes, at the Rust abstraction level. If we start depending on the driver following rules like that, that means the drm_sched abstraction has to be marked unsafe.
Detaching the scheduler from the underlying hw fences is certainly possible, but we removed that functionality because some people people tried to force push some Windows recovery module into Linux. We are in the process of reverting that and cleaning things up once more, but that will take a while.
Okay, but I don't see why that should block the Rust abstractions... I don't even need a new API to do that, all I need is to know that drm_sched_fini() will do it so it won't crash when the hw fences complete later, as this patch does.
Instead of detaching you could also block for the hw to become idle, but if you do that synchronous on process termination you run into trouble as well.
Yes, but again this something that can only be done at the driver level so it doesn't solve the safe abstraction problem...
The firmware queue is itself reference counted and any firmware queue that has acquired an event notification resource (that is, which is busy with running or upcoming jobs) hands off a reference to itself into the event subsystem, so it can get notified of job completions by the firmware. Then once it becomes idle it unregisters itself, and at that point if it has no owning userspace queue, that would be the last reference and it gets dropped. So we don't tear down firmware queues until they are idle.
And could those fw queue not reference the scheduler?
Yes but again, that rule can't be encoded in the abstraction... so that makes it unsafe. The goal is to have a safe abstraction, which means that all the rules that you need to follow to avoid memory safety issues are checked by the Rust compiler.
I actually don't know of any way to actively abort jobs on the firmware, so this is pretty much the only option I have. I've even seen long-running compute jobs on macOS run to completion even if you kill the submitting process, so there might be no way to do this at all. Though in practice since we unmap everything from the VM anyway when the userspace stuff gets torn down, almost any normal GPU work is going to immediately fault at that point (macOS doesn't do this because macOS effectively does implicit sync with BO tracking at the kernel level...).
Oh, that is an interesting information. How does macOS do explicit sync then or isn't that supported at all?
They have the equivalent of sync objects at the UAPI level, but they also have the implicit stuff and their UAPI seems to always pass a BO list to the kernel as far as we could tell, even though it still works without it. I think it's a weird hybrid of explicit+implicit sync. From the Metal docs:
By default, Metal tracks the write hazards and synchronizes the resources (see Resource Fundamentals) you create from an MTLDevice and directly bind to a pipeline. However, Metal doesn’t, by default, track resources you allocate from an MTLHeap (see Memory Heaps).
So it's both, and you can override it...
At the firmware level, I've never seen Metal use queue barriers yet like I do (other than the vertex->fragment ones), so either they always do CPU round trips for cross-subqueue sync (render<->compute) or we just haven't figured out the magic combination to get it to do that yet. Honestly, I suspect they just always do it on the CPU. macOS is pretty ugly behind the scenes and it's pretty obvious a lot of their own driver was rushed (the firmware seems to support quite a few features the driver doesn't... maybe it even has a job abort mechanism, we just haven't found it yet).
Of course, our goal is to do things better than macOS (and we already do some things better!) but getting confident enough about firmware/HW details to diverge from what macOS does is tricky and a slow process...
By the way, I don't really use the hardware recovery stuff right now. I'm not even sure if there is a sensible way I could use it, since as I said we can't exactly abort jobs. I know there are ways to lock up the firmware/GPU, but so far those have all been things the kernel driver can prevent, and I'm not even sure if there is any way to recover from that anyway. The firmware itself has its own timeouts and recovery for "normal" problems. From the point of view of the driver and everything above it, in-flight commands during a GPU fault or timeout are just marked complete by the firmware, after a firmware recovery cycle where the driver gets notified of the problem (that's when we mark the commands failed so we can propagate the error).
Yeah, that's exactly what we are telling our fw people for years that we need this as well.
Yeah, the ugly bit is that the firmware does a full GPU recovery even on simple page faults (which could be handled more gracefully) so even stuff like that can possibly break concurrent GPU work.
On the other hand, macOS configures things so page faults are ignored and silently return all-00 on reads for shader accesses, which is how they implement sparse buffers/textures... and we'll probably have to do that to improve reliability against app faults if nothing else. But right now the driver enables explicit page faults for everything so we can debug Mesa (it's a kernel module param, GPU global and I haven't found a way to change it after initial load unfortunately, but it might be possible).
I think there's also a way to do actual page fault handling (like swap in pages and resume the GPU), but that's one of those firmware features Apple's driver just never uses as far as I can tell. There's so much unexplored territory...
There is no re-submission or anything, userspace just gets told of the problem but the queue survives.
In the future it might be possible to re-submit innocent commands
Long story short: Don't do this! This is what the Windows drivers have been doing and it creates tons of problems.
Just signal the problem back to userspace and let the user space driver decide what to do.
The background is that most graphics applications (games etc..) then rather start on the next frame instead of submitting the current one again while compute applications make sure that the abort and tell the user that the calculations might be corrupted and need to be redone.
Then we're good with what we're currently doing, since we already notify userspace like that!
Actually I wanted to ask about error notifications. Right now we have an out-of-band mechanism to provide detailed fault info to userspace which works fine, but in principle it's optional. However, I also mark the hw fences as errored when a fault happens (with an errno that describes the overall situation), but that never makes it into the drm_sched job complete fence. I looked at the drm_sched code and I didn't see any error propagation. Is that supposed to work, or am I supposed to directly mark the drm_sched side fence as complete, or did I misunderstand all this? I get the feeling maybe existing drivers just rely on the recovery/timeout/etc paths to mark jobs as errored (since those do it explicitly) and never need error forwarding from the hw fence?
~~ Lina
Am 09.03.23 um 10:43 schrieb Asahi Lina:
On 09/03/2023 17.42, Christian König wrote:
Am 08.03.23 um 20:37 schrieb Asahi Lina:
On 09/03/2023 03.12, Christian König wrote:
Am 08.03.23 um 18:32 schrieb Asahi Lina:
[SNIP] Yes but... none of this cleans up jobs that are already submitted by the scheduler and in its pending list, with registered completion callbacks, which were already popped off of the entities.
*That* is the problem this patch fixes!
Ah! Yes that makes more sense now.
We could add a warning when users of this API doesn't do this correctly, but cleaning up incorrect API use is clearly something we don't want here.
It is the job of the Rust abstractions to make incorrect API use that leads to memory unsafety impossible. So even if you don't want that in C, it's my job to do that for Rust... and right now, I just can't because drm_sched doesn't provide an API that can be safely wrapped without weird bits of babysitting functionality on top (like tracking jobs outside or awkwardly making jobs hold a reference to the scheduler and defer dropping it to another thread).
Yeah, that was discussed before but rejected.
The argument was that upper layer needs to wait for the hw to become idle before the scheduler can be destroyed anyway.
Unfortunately, that's not a requirement you can encode in the Rust type system easily as far as I know, and Rust safety rules mean we need to make it safe even if the upper layer doesn't do this... (or else we have to mark the entire drm_sched abstraction unsafe, but that would be a pity).
Yeah, that should really not be something we should do.
But you could make the scheduler depend on your fw context object, don't you?
Yes, and that would fix the problem for this driver, but it wouldn't make the abstraction safe. The thing is we have to make it *impossible* to misuse drm_sched in such a way that it crashes, at the Rust abstraction level. If we start depending on the driver following rules like that, that means the drm_sched abstraction has to be marked unsafe.
Detaching the scheduler from the underlying hw fences is certainly possible, but we removed that functionality because some people people tried to force push some Windows recovery module into Linux. We are in the process of reverting that and cleaning things up once more, but that will take a while.
Okay, but I don't see why that should block the Rust abstractions...
Because even with removing the fence callback this is inherently unsafe.
You not only need to remove the callback, but also make sure that no parallel timeout handling is running.
This might not matter for you driver at the moment, but it's certainly something you need to keep in mind when you really want save handling.
Apart from that I don't have much objections to this here as long as Maartens comments are addressed as well.
Regards, Christian.
I don't even need a new API to do that, all I need is to know that drm_sched_fini() will do it so it won't crash when the hw fences complete later, as this patch does.
Instead of detaching you could also block for the hw to become idle, but if you do that synchronous on process termination you run into trouble as well.
Yes, but again this something that can only be done at the driver level so it doesn't solve the safe abstraction problem...
The firmware queue is itself reference counted and any firmware queue that has acquired an event notification resource (that is, which is busy with running or upcoming jobs) hands off a reference to itself into the event subsystem, so it can get notified of job completions by the firmware. Then once it becomes idle it unregisters itself, and at that point if it has no owning userspace queue, that would be the last reference and it gets dropped. So we don't tear down firmware queues until they are idle.
And could those fw queue not reference the scheduler?
Yes but again, that rule can't be encoded in the abstraction... so that makes it unsafe. The goal is to have a safe abstraction, which means that all the rules that you need to follow to avoid memory safety issues are checked by the Rust compiler.
I actually don't know of any way to actively abort jobs on the firmware, so this is pretty much the only option I have. I've even seen long-running compute jobs on macOS run to completion even if you kill the submitting process, so there might be no way to do this at all. Though in practice since we unmap everything from the VM anyway when the userspace stuff gets torn down, almost any normal GPU work is going to immediately fault at that point (macOS doesn't do this because macOS effectively does implicit sync with BO tracking at the kernel level...).
Oh, that is an interesting information. How does macOS do explicit sync then or isn't that supported at all?
They have the equivalent of sync objects at the UAPI level, but they also have the implicit stuff and their UAPI seems to always pass a BO list to the kernel as far as we could tell, even though it still works without it. I think it's a weird hybrid of explicit+implicit sync. From the Metal docs:
By default, Metal tracks the write hazards and synchronizes the resources (see Resource Fundamentals) you create from an MTLDevice and directly bind to a pipeline. However, Metal doesn’t, by default, track resources you allocate from an MTLHeap (see Memory Heaps).
So it's both, and you can override it...
At the firmware level, I've never seen Metal use queue barriers yet like I do (other than the vertex->fragment ones), so either they always do CPU round trips for cross-subqueue sync (render<->compute) or we just haven't figured out the magic combination to get it to do that yet. Honestly, I suspect they just always do it on the CPU. macOS is pretty ugly behind the scenes and it's pretty obvious a lot of their own driver was rushed (the firmware seems to support quite a few features the driver doesn't... maybe it even has a job abort mechanism, we just haven't found it yet).
Of course, our goal is to do things better than macOS (and we already do some things better!) but getting confident enough about firmware/HW details to diverge from what macOS does is tricky and a slow process...
By the way, I don't really use the hardware recovery stuff right now. I'm not even sure if there is a sensible way I could use it, since as I said we can't exactly abort jobs. I know there are ways to lock up the firmware/GPU, but so far those have all been things the kernel driver can prevent, and I'm not even sure if there is any way to recover from that anyway. The firmware itself has its own timeouts and recovery for "normal" problems. From the point of view of the driver and everything above it, in-flight commands during a GPU fault or timeout are just marked complete by the firmware, after a firmware recovery cycle where the driver gets notified of the problem (that's when we mark the commands failed so we can propagate the error).
Yeah, that's exactly what we are telling our fw people for years that we need this as well.
Yeah, the ugly bit is that the firmware does a full GPU recovery even on simple page faults (which could be handled more gracefully) so even stuff like that can possibly break concurrent GPU work.
On the other hand, macOS configures things so page faults are ignored and silently return all-00 on reads for shader accesses, which is how they implement sparse buffers/textures... and we'll probably have to do that to improve reliability against app faults if nothing else. But right now the driver enables explicit page faults for everything so we can debug Mesa (it's a kernel module param, GPU global and I haven't found a way to change it after initial load unfortunately, but it might be possible).
I think there's also a way to do actual page fault handling (like swap in pages and resume the GPU), but that's one of those firmware features Apple's driver just never uses as far as I can tell. There's so much unexplored territory...
There is no re-submission or anything, userspace just gets told of the problem but the queue survives. In the future it might be possible to re-submit innocent commands
Long story short: Don't do this! This is what the Windows drivers have been doing and it creates tons of problems.
Just signal the problem back to userspace and let the user space driver decide what to do.
The background is that most graphics applications (games etc..) then rather start on the next frame instead of submitting the current one again while compute applications make sure that the abort and tell the user that the calculations might be corrupted and need to be redone.
Then we're good with what we're currently doing, since we already notify userspace like that!
Actually I wanted to ask about error notifications. Right now we have an out-of-band mechanism to provide detailed fault info to userspace which works fine, but in principle it's optional. However, I also mark the hw fences as errored when a fault happens (with an errno that describes the overall situation), but that never makes it into the drm_sched job complete fence. I looked at the drm_sched code and I didn't see any error propagation. Is that supposed to work, or am I supposed to directly mark the drm_sched side fence as complete, or did I misunderstand all this? I get the feeling maybe existing drivers just rely on the recovery/timeout/etc paths to mark jobs as errored (since those do it explicitly) and never need error forwarding from the hw fence?
~~ Lina
On 09/03/2023 20.47, Christian König wrote:
Am 09.03.23 um 10:43 schrieb Asahi Lina:
On 09/03/2023 17.42, Christian König wrote:
Am 08.03.23 um 20:37 schrieb Asahi Lina:
On 09/03/2023 03.12, Christian König wrote:
Am 08.03.23 um 18:32 schrieb Asahi Lina:
[SNIP] Yes but... none of this cleans up jobs that are already submitted by the scheduler and in its pending list, with registered completion callbacks, which were already popped off of the entities.
*That* is the problem this patch fixes!
Ah! Yes that makes more sense now.
> We could add a warning when users of this API doesn't do this > correctly, but cleaning up incorrect API use is clearly something we > don't want here. It is the job of the Rust abstractions to make incorrect API use that leads to memory unsafety impossible. So even if you don't want that in C, it's my job to do that for Rust... and right now, I just can't because drm_sched doesn't provide an API that can be safely wrapped without weird bits of babysitting functionality on top (like tracking jobs outside or awkwardly making jobs hold a reference to the scheduler and defer dropping it to another thread).
Yeah, that was discussed before but rejected.
The argument was that upper layer needs to wait for the hw to become idle before the scheduler can be destroyed anyway.
Unfortunately, that's not a requirement you can encode in the Rust type system easily as far as I know, and Rust safety rules mean we need to make it safe even if the upper layer doesn't do this... (or else we have to mark the entire drm_sched abstraction unsafe, but that would be a pity).
Yeah, that should really not be something we should do.
But you could make the scheduler depend on your fw context object, don't you?
Yes, and that would fix the problem for this driver, but it wouldn't make the abstraction safe. The thing is we have to make it *impossible* to misuse drm_sched in such a way that it crashes, at the Rust abstraction level. If we start depending on the driver following rules like that, that means the drm_sched abstraction has to be marked unsafe.
Detaching the scheduler from the underlying hw fences is certainly possible, but we removed that functionality because some people people tried to force push some Windows recovery module into Linux. We are in the process of reverting that and cleaning things up once more, but that will take a while.
Okay, but I don't see why that should block the Rust abstractions...
Because even with removing the fence callback this is inherently unsafe.
You not only need to remove the callback, but also make sure that no parallel timeout handling is running.
If by that you mean that the timeout handling functions aren't being called by the driver, then that's implied. If the scheduler is being dropped, by definition there are no references left to call into the scheduler directly from the Rust side. So we only need to worry about what drm_sched itself does.
Right now the cleanup function tears down the timeout work at the end, but it probably makes sense to do it at the start? Then if we do that and stop the kthread, we can be really sure nothing else is accessing the scheduler and we can clean up without taking any locks:
Roughly:
void drm_sched_fini(struct drm_gpu_scheduler *sched) { sched->ready = false; /* Should probably do this first? */ kthread_stop(sched->thread); cancel_delayed_work_sync(&sched->work_tdr);
/* Clean up the pending_list here */ }
I'm also not sure what the rest of the drm_sched_fini() function is doing right now. It's going through all entities and removing them, and then wakes up entities stuck in drm_sched_entity_flush()... but didn't we just agree that the API requires users to tear down entities before tearing down the scheduler anyway?
~~ Lina
On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
On 09/03/2023 17.42, Christian König wrote:
Am 08.03.23 um 20:37 schrieb Asahi Lina:
On 09/03/2023 03.12, Christian König wrote:
Am 08.03.23 um 18:32 schrieb Asahi Lina:
[SNIP] Yes but... none of this cleans up jobs that are already submitted by the scheduler and in its pending list, with registered completion callbacks, which were already popped off of the entities.
*That* is the problem this patch fixes!
Ah! Yes that makes more sense now.
We could add a warning when users of this API doesn't do this correctly, but cleaning up incorrect API use is clearly something we don't want here.
It is the job of the Rust abstractions to make incorrect API use that leads to memory unsafety impossible. So even if you don't want that in C, it's my job to do that for Rust... and right now, I just can't because drm_sched doesn't provide an API that can be safely wrapped without weird bits of babysitting functionality on top (like tracking jobs outside or awkwardly making jobs hold a reference to the scheduler and defer dropping it to another thread).
Yeah, that was discussed before but rejected.
The argument was that upper layer needs to wait for the hw to become idle before the scheduler can be destroyed anyway.
Unfortunately, that's not a requirement you can encode in the Rust type system easily as far as I know, and Rust safety rules mean we need to make it safe even if the upper layer doesn't do this... (or else we have to mark the entire drm_sched abstraction unsafe, but that would be a pity).
Yeah, that should really not be something we should do.
But you could make the scheduler depend on your fw context object, don't you?
Yes, and that would fix the problem for this driver, but it wouldn't make the abstraction safe. The thing is we have to make it *impossible* to misuse drm_sched in such a way that it crashes, at the Rust abstraction level. If we start depending on the driver following rules like that, that means the drm_sched abstraction has to be marked unsafe.
Detaching the scheduler from the underlying hw fences is certainly possible, but we removed that functionality because some people people tried to force push some Windows recovery module into Linux. We are in the process of reverting that and cleaning things up once more, but that will take a while.
Okay, but I don't see why that should block the Rust abstractions... I don't even need a new API to do that, all I need is to know that drm_sched_fini() will do it so it won't crash when the hw fences complete later, as this patch does.
Instead of detaching you could also block for the hw to become idle, but if you do that synchronous on process termination you run into trouble as well.
Yes, but again this something that can only be done at the driver level so it doesn't solve the safe abstraction problem...
The firmware queue is itself reference counted and any firmware queue that has acquired an event notification resource (that is, which is busy with running or upcoming jobs) hands off a reference to itself into the event subsystem, so it can get notified of job completions by the firmware. Then once it becomes idle it unregisters itself, and at that point if it has no owning userspace queue, that would be the last reference and it gets dropped. So we don't tear down firmware queues until they are idle.
And could those fw queue not reference the scheduler?
Yes but again, that rule can't be encoded in the abstraction... so that makes it unsafe. The goal is to have a safe abstraction, which means that all the rules that you need to follow to avoid memory safety issues are checked by the Rust compiler.
I actually don't know of any way to actively abort jobs on the firmware, so this is pretty much the only option I have. I've even seen long-running compute jobs on macOS run to completion even if you kill the submitting process, so there might be no way to do this at all. Though in practice since we unmap everything from the VM anyway when the userspace stuff gets torn down, almost any normal GPU work is going to immediately fault at that point (macOS doesn't do this because macOS effectively does implicit sync with BO tracking at the kernel level...).
Oh, that is an interesting information. How does macOS do explicit sync then or isn't that supported at all?
They have the equivalent of sync objects at the UAPI level, but they also have the implicit stuff and their UAPI seems to always pass a BO list to the kernel as far as we could tell, even though it still works without it. I think it's a weird hybrid of explicit+implicit sync. From the Metal docs:
By default, Metal tracks the write hazards and synchronizes the resources (see Resource Fundamentals) you create from an MTLDevice and directly bind to a pipeline. However, Metal doesn’t, by default, track resources you allocate from an MTLHeap (see Memory Heaps).
So it's both, and you can override it...
At the firmware level, I've never seen Metal use queue barriers yet like I do (other than the vertex->fragment ones), so either they always do CPU round trips for cross-subqueue sync (render<->compute) or we just haven't figured out the magic combination to get it to do that yet. Honestly, I suspect they just always do it on the CPU. macOS is pretty ugly behind the scenes and it's pretty obvious a lot of their own driver was rushed (the firmware seems to support quite a few features the driver doesn't... maybe it even has a job abort mechanism, we just haven't found it yet).
Of course, our goal is to do things better than macOS (and we already do some things better!) but getting confident enough about firmware/HW details to diverge from what macOS does is tricky and a slow process...
By the way, I don't really use the hardware recovery stuff right now. I'm not even sure if there is a sensible way I could use it, since as I said we can't exactly abort jobs. I know there are ways to lock up the firmware/GPU, but so far those have all been things the kernel driver can prevent, and I'm not even sure if there is any way to recover from that anyway. The firmware itself has its own timeouts and recovery for "normal" problems. From the point of view of the driver and everything above it, in-flight commands during a GPU fault or timeout are just marked complete by the firmware, after a firmware recovery cycle where the driver gets notified of the problem (that's when we mark the commands failed so we can propagate the error).
Yeah, that's exactly what we are telling our fw people for years that we need this as well.
Yeah, the ugly bit is that the firmware does a full GPU recovery even on simple page faults (which could be handled more gracefully) so even stuff like that can possibly break concurrent GPU work.
On the other hand, macOS configures things so page faults are ignored and silently return all-00 on reads for shader accesses, which is how they implement sparse buffers/textures... and we'll probably have to do that to improve reliability against app faults if nothing else. But right now the driver enables explicit page faults for everything so we can debug Mesa (it's a kernel module param, GPU global and I haven't found a way to change it after initial load unfortunately, but it might be possible).
I think there's also a way to do actual page fault handling (like swap in pages and resume the GPU), but that's one of those firmware features Apple's driver just never uses as far as I can tell. There's so much unexplored territory...
There is no re-submission or anything, userspace just gets told of the problem but the queue survives.
In the future it might be possible to re-submit innocent commands
Long story short: Don't do this! This is what the Windows drivers have been doing and it creates tons of problems.
Yeah, we tried to do a bit of that in the GL days. It was a bad idea.
Just signal the problem back to userspace and let the user space driver decide what to do.
The background is that most graphics applications (games etc..) then rather start on the next frame instead of submitting the current one again while compute applications make sure that the abort and tell the user that the calculations might be corrupted and need to be redone.
The guarantee that Vulkan makes is that, if you idle the GPU and you haven't gotten a DEVICE_LOST yet, your data is good. If you get a DEVICE_LOST, all bets are off. The problem is that, no matter how fast the error propagation may be in the kernel or userspace driver, errors can still show up in strange ways. An OOB buffer access could end up modifying a shader binary which gets run 3 frames later and causes a corruption. Once you've faulted, you really have no idea how far back is good or what memory is corrupted. You have to assume that everything mapped to the GPU VA space is potentially toast.
Then we're good with what we're currently doing, since we already notify userspace like that!
Actually I wanted to ask about error notifications. Right now we have an out-of-band mechanism to provide detailed fault info to userspace which works fine, but in principle it's optional.
This is fine, in principal. Because of the nature of errors, async is fine as long as the error shows up eventually. Faster is better, for sure, but error latency doesn't really matter in practice.
However, I also mark the hw fences as errored when a fault happens (with an errno that describes the overall situation), but that never makes it into the drm_sched job complete fence. I looked at the drm_sched code and I didn't see any error propagation. Is that supposed to work, or am I supposed to directly mark the drm_sched side fence as complete, or did I misunderstand all this? I get the feeling maybe existing drivers just rely on the recovery/timeout/etc paths to mark jobs as errored (since those do it explicitly) and never need error forwarding from the hw fence?
The end behavior needs to be that all fences for all jobs submitted to the queue get signaled. That's needed to satisfy the finite time guarantees of dma_fence. Exactly how that happens (let the job run, abort all the jobs, etc.) is an implementation detail for the driver to decide. If you want, you can also set a bit on the context (or queue) to mark it as dead and start returning EIO or similar from any ioctls trying to submit more work if you wanted. Not required but you can.
~Faith
On 10/03/2023 04.59, Faith Ekstrand wrote:
On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
On 09/03/2023 17.42, Christian König wrote:
Long story short: Don't do this! This is what the Windows drivers have been doing and it creates tons of problems.
Yeah, we tried to do a bit of that in the GL days. It was a bad idea.
I think I should clarify: I was proposing re-queueing innocent jobs from innocent queues/VMs that were impacted by a fault. The reason is that we may be able to tweak firmware state to force it to do that safely, during the firmware recovery cycle, such that an aborted job restarts and then subsequent jobs/commands continue as normal. We can't leave it to userspace because if we do nothing, the affected job ends up incomplete but then everything after it that is already queued still runs, and that is definitely a recipe for a bigger mess if userspace wants to seamlessly recover. The firmware recovery cycle is a "stop-the-world" situation for the GPU (the firmware literally busy-loops waiting for the driver to set a continue flag in memory...), so that's the only real chance that the driver gets to make decisions about what is going to happen next.
Of course, that only works if individual possibly concurrently running commands are idempotent, but I think a lot of typical GPU work is? (E.g. any render pass without side effects other than the render targets and where the background shader does no loads, or even render passes that do loads but where all draws are opaque, which are all things the current Gallium driver is intimately familiar with since Crazy Tiler Optimizations™ need that info to be provided anyway). So I was wondering whether it'd make sense to have such an idempotency/restartable flag on job submission, and then the driver would do its best to recover and rerun it if it gets killed by an unrelated concurrent bad job.
Then again this all depends on an investigation into what we *can* do during firmware recovery that hasn't happened at all yet. It might be that it isn't safe to do anything really, or that doing things depends on touching even deeper firmware state structs that we treat as opaque right now and we really don't want to have to touch...
But maybe none of this is worth it in practice, it just sounded like it could be useful maybe?
Now that I look at it, we have a lovely "what is this flag doing anyway" bit already passed from Mesa through to the firmware we called ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S which, now that I look at it, is actually getting set when any attachment (any color, Z, S) is not being cleared for that pass (so it's loaded). That could very well be an "is not idempotent" flag... and maybe that means the firmware does this for us already? Sounds like something to test... I might have some 16Kx16K GLmark runs to do concurrent with an evil faulting job now ^^ (and then that also means we need to set it when shaders have side effects and stuff, which right now we don't).
Just signal the problem back to userspace and let the user space driver decide what to do.
The background is that most graphics applications (games etc..) then rather start on the next frame instead of submitting the current one again while compute applications make sure that the abort and tell the user that the calculations might be corrupted and need to be redone.
The guarantee that Vulkan makes is that, if you idle the GPU and you haven't gotten a DEVICE_LOST yet, your data is good. If you get a DEVICE_LOST, all bets are off. The problem is that, no matter how fast the error propagation may be in the kernel or userspace driver, errors can still show up in strange ways. An OOB buffer access could end up modifying a shader binary which gets run 3 frames later and causes a corruption. Once you've faulted, you really have no idea how far back is good or what memory is corrupted. You have to assume that everything mapped to the GPU VA space is potentially toast.
Yes of course, for the actually faulting VM all bets are off after a fault (though we can try a bit harder at least... I have a READ_ONLY BO flag now, I should set it on the shader pools!).
Actually I wanted to ask about error notifications. Right now we have an out-of-band mechanism to provide detailed fault info to userspace which works fine, but in principle it's optional.
This is fine, in principal. Because of the nature of errors, async is fine as long as the error shows up eventually. Faster is better, for sure, but error latency doesn't really matter in practice.
However, I also mark the hw fences as errored when a fault happens (with an errno that describes the overall situation), but that never makes it into the drm_sched job complete fence. I looked at the drm_sched code and I didn't see any error propagation. Is that supposed to work, or am I supposed to directly mark the drm_sched side fence as complete, or did I misunderstand all this? I get the feeling maybe existing drivers just rely on the recovery/timeout/etc paths to mark jobs as errored (since those do it explicitly) and never need error forwarding from the hw fence?
The end behavior needs to be that all fences for all jobs submitted to the queue get signaled. That's needed to satisfy the finite time guarantees of dma_fence. Exactly how that happens (let the job run, abort all the jobs, etc.) is an implementation detail for the driver to decide. If you want, you can also set a bit on the context (or queue) to mark it as dead and start returning EIO or similar from any ioctls trying to submit more work if you wanted. Not required but you can.
Fences have an error flag though, does that get reported to userspace somehow? I thought it did, but maybe not, or maybe only drm_sched not propagating it is the issue?
In other words, absent my fancy stats reporting BO system, what is the normal way that an explicit sync driver signals to userspace that the job associated with a syncobj has failed?
(If there is no way, then I'll probably want to change the stats BO system to be configurable, so if you ask for no stats/time info, you only get overall job status and faults, which has less overhead.)
~~ Lina
On Fri, 2023-03-10 at 18:58 +0900, Asahi Lina wrote:
On 10/03/2023 04.59, Faith Ekstrand wrote:
On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
On 09/03/2023 17.42, Christian König wrote:
Long story short: Don't do this! This is what the Windows drivers have been doing and it creates tons of problems.
Yeah, we tried to do a bit of that in the GL days. It was a bad idea.
I think I should clarify: I was proposing re-queueing innocent jobs from innocent queues/VMs that were impacted by a fault. The reason is that we may be able to tweak firmware state to force it to do that safely, during the firmware recovery cycle, such that an aborted job restarts and then subsequent jobs/commands continue as normal. We can't leave it to userspace because if we do nothing, the affected job ends up incomplete but then everything after it that is already queued still runs, and that is definitely a recipe for a bigger mess if userspace wants to seamlessly recover. The firmware recovery cycle is a "stop-the-world" situation for the GPU (the firmware literally busy-loops waiting for the driver to set a continue flag in memory...), so that's the only real chance that the driver gets to make decisions about what is going to happen next.
Ok, that makes sense. Yes, if you have other jobs on other queues and are able to recover everything that isn't in the faulting VM, that's a good thing. I wasn't sure how hang/fault recovery worked on AGX. In tat case, I don't think there's a dma_fence problem. As long as you keep recovering and killing off any faulting contexts, eventually the good contexts should make progress and those fences should signal.
Of course, the firmware recovery cycle may be complex and need (or at least appear to) memory allocation or similar and that's where everything gets hairy. Hopefully, though, if you've already got the resources from the old context, you can re-use them after a bit of clean-up work and still get deterministic and reliable recovery cycles.
Of course, that only works if individual possibly concurrently running commands are idempotent, but I think a lot of typical GPU work is?
No, that's not a valid assumption. For a single 3D render pass which doesn't do any image or SSBO access, it may be possible to re-run it. However, that won't be true of compute work and isn't necessarily true of back-to-back passes. Lots of modern apps do temporal stuff where one frame depends on the previous and a re-run might screw that up. Also, with Vulkan's memory aliasing, it's hard to tell just from which resources are accessed whether or not a command buffer leaves its input memory undamaged.
(E.g. any render pass without side effects other than the render targets and where the background shader does no loads, or even render passes that do loads but where all draws are opaque, which are all things the current Gallium driver is intimately familiar with since Crazy Tiler Optimizations™ need that info to be provided anyway). So I was wondering whether it'd make sense to have such an idempotency/restartable flag on job submission, and then the driver would do its best to recover and rerun it if it gets killed by an unrelated concurrent bad job.
Then again this all depends on an investigation into what we *can* do during firmware recovery that hasn't happened at all yet. It might be that it isn't safe to do anything really, or that doing things depends on touching even deeper firmware state structs that we treat as opaque right now and we really don't want to have to touch...
But maybe none of this is worth it in practice, it just sounded like it could be useful maybe?
Maybe? It's not clear to me that such a flag would be useful or even practical to provide from the Mesa side. Ideally, you'd be able to figure out when a fault happens, what VM it happened in and exactly what work was in-flight when it happened and only kill the one guilty VM. However, it sounds like your understanding of the firmware is currently rough enough that doing so may not be practical. In that case, the best thing to do is to kill any VMs which were on the GPU at the time and hope the individual apps are able to recover.
Now that I look at it, we have a lovely "what is this flag doing anyway" bit already passed from Mesa through to the firmware we called ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S which, now that I look at it, is actually getting set when any attachment (any color, Z, S) is not being cleared for that pass (so it's loaded). That could very well be an "is not idempotent" flag... and maybe that means the firmware does this for us already? Sounds like something to test... I might have some 16Kx16K GLmark runs to do concurrent with an evil faulting job now ^^ (and then that also means we need to set it when shaders have side effects and stuff, which right now we don't).
Just signal the problem back to userspace and let the user space driver decide what to do.
The background is that most graphics applications (games etc..) then rather start on the next frame instead of submitting the current one again while compute applications make sure that the abort and tell the user that the calculations might be corrupted and need to be redone.
The guarantee that Vulkan makes is that, if you idle the GPU and you haven't gotten a DEVICE_LOST yet, your data is good. If you get a DEVICE_LOST, all bets are off. The problem is that, no matter how fast the error propagation may be in the kernel or userspace driver, errors can still show up in strange ways. An OOB buffer access could end up modifying a shader binary which gets run 3 frames later and causes a corruption. Once you've faulted, you really have no idea how far back is good or what memory is corrupted. You have to assume that everything mapped to the GPU VA space is potentially toast.
Yes of course, for the actually faulting VM all bets are off after a fault (though we can try a bit harder at least... I have a READ_ONLY BO flag now, I should set it on the shader pools!).
Actually I wanted to ask about error notifications. Right now we have an out-of-band mechanism to provide detailed fault info to userspace which works fine, but in principle it's optional.
This is fine, in principal. Because of the nature of errors, async is fine as long as the error shows up eventually. Faster is better, for sure, but error latency doesn't really matter in practice.
However, I also mark the hw fences as errored when a fault happens (with an errno that describes the overall situation), but that never makes it into the drm_sched job complete fence. I looked at the drm_sched code and I didn't see any error propagation. Is that supposed to work, or am I supposed to directly mark the drm_sched side fence as complete, or did I misunderstand all this? I get the feeling maybe existing drivers just rely on the recovery/timeout/etc paths to mark jobs as errored (since those do it explicitly) and never need error forwarding from the hw fence?
The end behavior needs to be that all fences for all jobs submitted to the queue get signaled. That's needed to satisfy the finite time guarantees of dma_fence. Exactly how that happens (let the job run, abort all the jobs, etc.) is an implementation detail for the driver to decide. If you want, you can also set a bit on the context (or queue) to mark it as dead and start returning EIO or similar from any ioctls trying to submit more work if you wanted. Not required but you can.
Fences have an error flag though, does that get reported to userspace somehow? I thought it did, but maybe not, or maybe only drm_sched not propagating it is the issue?
In other words, absent my fancy stats reporting BO system, what is the normal way that an explicit sync driver signals to userspace that the job associated with a syncobj has failed?
One is via the return value from exec/submit. Often there's also a query mechanism for more detailed information. It's not particularly standard at the moment, I'm afraid. I could point you at i915 but I wouldn't call that uAPI something to be emulated, in general.
(If there is no way, then I'll probably want to change the stats BO system to be configurable, so if you ask for no stats/time info, you only get overall job status and faults, which has less overhead.)
There is an error but it doesn't automatically get propagated to userspace. So, for instance, a SYNCOBJ_WAIT ioctl won't return an error if it sees a fence error. It needs to get caught by the driver and returned through a driver ioctl somehow.
~Faith
On Tue, Mar 07, 2023 at 11:25:36PM +0900, Asahi Lina wrote:
drm_sched_fini() currently leaves any pending jobs dangling, which causes segfaults and other badness when job completion fences are signaled after the scheduler is torn down.
Explicitly detach all jobs from their completion callbacks and free them. This makes it possible to write a sensible safe abstraction for drm_sched, without having to externally duplicate the tracking of in-flight jobs.
This shouldn't regress any existing drivers, since calling drm_sched_fini() with any pending jobs is broken and this change should be a no-op if there are no pending jobs.
Signed-off-by: Asahi Lina lina@asahilina.net
drivers/gpu/drm/scheduler/sched_main.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 5c0add2c7546..0aab1e0aebdd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init); void drm_sched_fini(struct drm_gpu_scheduler *sched) { struct drm_sched_entity *s_entity;
- struct drm_sched_job *s_job, *tmp; int i;
- if (sched->thread)
kthread_stop(sched->thread);
- if (!sched->thread)
return;
- /*
* Stop the scheduler, detaching all jobs from their hardware callbacks
* and cleaning up complete jobs.
*/
- drm_sched_stop(sched, NULL);
- /*
* Iterate through the pending job list and free all jobs.
* This assumes the driver has either guaranteed jobs are already stopped, or that
* otherwise it is responsible for keeping any necessary data structures for
* in-progress jobs alive even when the free_job() callback is called early (e.g. by
* putting them in its own queue or doing its own refcounting).
*/
This comment makes me wonder whether we shouldn't go one step further and have a drm_sched_quiescent, which waits for any in-flight jobs to complete and cancels everything else. Because even if rust guarantees that you don't have any memory bugs, if you just leak things by sprinkling reference-counted pointer wrappers everywhere you still have a semantic bug.
Except now it's much harder to realize that because there's no Oops and KASAN doesn't tell you about it either. I think it would be much better if the scheduler code and rust abstraction provider drivers the correct lifetimes and very strongly encourage them to only have borrowed references and not additional refcounting of their own.
I think Christian mentioned that this would block in close() or context destruction, which is no good at all. And with the 1:1 drm_scheduler:drm_sched_entity design for there's no other place. This is way I've suggested in the Xe threads that we should make the current drm_scheduler an implementation detail hidden from drivers, with a new drm_scheduler which is always per-engine for all cases as the driver api interface. And the internal scheduler attached to either that (for current drivers) or drm_sched_entity (for fw scheduling drivers) as needed. With that - the sched_entity cleanup could take care of this code here for the fw scheduler case - the drm_sched_fini could take care of blocking appropriately before the driver is unloaded for any lagging in-flight jobs, without blocking userspace - drivers should not end up with any need to reference-count either per-ctx/drm_sched_entity or per-drm_sched_job data, ever
Because any comment that's along the lines of "drivers need to refcount" is bad business, because it either means leaks (rust) or crashes (C). I much prefer when drivers have to put in extra effort to get things wrong because by default the lifetimes are Just Right(tm). -Daniel
- list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
spin_lock(&sched->job_list_lock);
list_del_init(&s_job->list);
spin_unlock(&sched->job_list_lock);
sched->ops->free_job(s_job);
- }
- kthread_stop(sched->thread);
for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { struct drm_sched_rq *rq = &sched->sched_rq[i];
-- 2.35.1
The GPU scheduler manages scheduling GPU jobs and dependencies between them. This Rust abstraction allows Rust DRM drivers to use this functionality.
Signed-off-by: Asahi Lina lina@asahilina.net --- drivers/gpu/drm/Kconfig | 5 + rust/bindings/bindings_helper.h | 1 + rust/helpers.c | 6 + rust/kernel/drm/mod.rs | 2 + rust/kernel/drm/sched.rs | 358 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 372 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index 70a983a17ac2..8b5ad6aee126 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -39,6 +39,11 @@ config RUST_DRM_GEM_SHMEM_HELPER depends on RUST_DRM select DRM_GEM_SHMEM_HELPER
+config RUST_DRM_SCHED + bool + depends on RUST_DRM + select DRM_SCHED + config DRM_MIPI_DBI tristate depends on DRM diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index b6696011f3a4..dc01be08676e 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -13,6 +13,7 @@ #include <drm/drm_gem_shmem_helper.h> #include <drm/drm_ioctl.h> #include <drm/drm_syncobj.h> +#include <drm/gpu_scheduler.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-fence.h> diff --git a/rust/helpers.c b/rust/helpers.c index 11965b1e2f4e..1b33ed602090 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -408,6 +408,12 @@ void rust_helper___spin_lock_init(spinlock_t *lock, const char *name, } EXPORT_SYMBOL_GPL(rust_helper___spin_lock_init);
+unsigned long rust_helper_msecs_to_jiffies(const unsigned int m) +{ + return msecs_to_jiffies(m); +} +EXPORT_SYMBOL_GPL(rust_helper_msecs_to_jiffies); + #ifdef CONFIG_DMA_SHARED_BUFFER
void rust_helper_dma_fence_get(struct dma_fence *fence) diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index dae98826edfd..3ddf7712aab3 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -8,4 +8,6 @@ pub mod file; pub mod gem; pub mod ioctl; pub mod mm; +#[cfg(CONFIG_RUST_DRM_SCHED)] +pub mod sched; pub mod syncobj; diff --git a/rust/kernel/drm/sched.rs b/rust/kernel/drm/sched.rs new file mode 100644 index 000000000000..a5275cc16179 --- /dev/null +++ b/rust/kernel/drm/sched.rs @@ -0,0 +1,358 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT + +//! DRM Scheduler +//! +//! C header: [`include/linux/drm/gpu_scheduler.h`](../../../../include/linux/drm/gpu_scheduler.h) + +use crate::{ + bindings, device, + dma_fence::*, + error::{to_result, Result}, + prelude::*, + sync::{Arc, UniqueArc}, +}; +use alloc::boxed::Box; +use core::marker::PhantomData; +use core::mem::MaybeUninit; +use core::ops::{Deref, DerefMut}; +use core::ptr::addr_of_mut; + +/// Scheduler status after timeout recovery +#[repr(u32)] +pub enum Status { + /// Device recovered from the timeout and can execute jobs again + Nominal = bindings::drm_gpu_sched_stat_DRM_GPU_SCHED_STAT_NOMINAL, + /// Device is no longer available + NoDevice = bindings::drm_gpu_sched_stat_DRM_GPU_SCHED_STAT_ENODEV, +} + +/// Scheduler priorities +#[repr(i32)] +pub enum Priority { + /// Low userspace priority + Min = bindings::drm_sched_priority_DRM_SCHED_PRIORITY_MIN, + /// Normal userspace priority + Normal = bindings::drm_sched_priority_DRM_SCHED_PRIORITY_NORMAL, + /// High userspace priority + High = bindings::drm_sched_priority_DRM_SCHED_PRIORITY_HIGH, + /// Kernel priority (highest) + Kernel = bindings::drm_sched_priority_DRM_SCHED_PRIORITY_KERNEL, +} + +/// Trait to be implemented by driver job objects. +pub trait JobImpl: Sized { + /// Called when the scheduler is considering scheduling this job next, to get another Fence + /// for this job to block on. Once it returns None, run() may be called. + fn prepare(_job: &mut Job<Self>) -> Option<Fence> { + None // Equivalent to NULL function pointer + } + + /// Called before job execution to check whether the hardware is free enough to run the job. + /// This can be used to implement more complex hardware resource policies than the hw_submission + /// limit. + fn can_run(_job: &mut Job<Self>) -> bool { + true + } + + /// Called to execute the job once all of the dependencies have been resolved. This may be + /// called multiple times, if timed_out() has happened and drm_sched_job_recovery() decides + /// to try it again. + fn run(job: &mut Job<Self>) -> Result<Option<Fence>>; + + /// Called when a job has taken too long to execute, to trigger GPU recovery. + /// + /// This method is called in a workqueue context. + fn timed_out(job: &mut Job<Self>) -> Status; +} + +unsafe extern "C" fn prepare_job_cb<T: JobImpl>( + sched_job: *mut bindings::drm_sched_job, + _s_entity: *mut bindings::drm_sched_entity, +) -> *mut bindings::dma_fence { + // SAFETY: All of our jobs are Job<T>. + let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>; + + match T::prepare(unsafe { &mut *p }) { + None => core::ptr::null_mut(), + Some(fence) => fence.into_raw(), + } +} + +unsafe extern "C" fn run_job_cb<T: JobImpl>( + sched_job: *mut bindings::drm_sched_job, +) -> *mut bindings::dma_fence { + // SAFETY: All of our jobs are Job<T>. + let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>; + + match T::run(unsafe { &mut *p }) { + Err(e) => e.to_ptr(), + Ok(None) => core::ptr::null_mut(), + Ok(Some(fence)) => fence.into_raw(), + } +} + +unsafe extern "C" fn can_run_job_cb<T: JobImpl>(sched_job: *mut bindings::drm_sched_job) -> bool { + // SAFETY: All of our jobs are Job<T>. + let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>; + + T::can_run(unsafe { &mut *p }) +} + +unsafe extern "C" fn timedout_job_cb<T: JobImpl>( + sched_job: *mut bindings::drm_sched_job, +) -> bindings::drm_gpu_sched_stat { + // SAFETY: All of our jobs are Job<T>. + let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>; + + T::timed_out(unsafe { &mut *p }) as bindings::drm_gpu_sched_stat +} + +unsafe extern "C" fn free_job_cb<T: JobImpl>(sched_job: *mut bindings::drm_sched_job) { + // SAFETY: All of our jobs are Job<T>. + let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>; + + // Convert the job back to a Box and drop it + // SAFETY: All of our Job<T>s are created inside a box. + unsafe { Box::from_raw(p) }; +} + +/// A DRM scheduler job. +pub struct Job<T: JobImpl> { + job: bindings::drm_sched_job, + inner: T, +} + +impl<T: JobImpl> Deref for Job<T> { + type Target = T; + + fn deref(&self) -> &Self::Target { + &self.inner + } +} + +impl<T: JobImpl> DerefMut for Job<T> { + fn deref_mut(&mut self) -> &mut Self::Target { + &mut self.inner + } +} + +impl<T: JobImpl> Drop for Job<T> { + fn drop(&mut self) { + // SAFETY: At this point the job has either been submitted and this is being called from + // `free_job_cb` above, or it hasn't and it is safe to call `drm_sched_job_cleanup`. + unsafe { bindings::drm_sched_job_cleanup(&mut self.job) }; + } +} + +/// A pending DRM scheduler job (not yet armed) +pub struct PendingJob<'a, T: JobImpl>(Box<Job<T>>, PhantomData<&'a T>); + +impl<'a, T: JobImpl> PendingJob<'a, T> { + /// Add a fence as a dependency to the job + pub fn add_dependency(&mut self, fence: Fence) -> Result { + to_result(unsafe { + bindings::drm_sched_job_add_dependency(&mut self.0.job, fence.into_raw()) + }) + } + + /// Arm the job to make it ready for execution + pub fn arm(mut self) -> ArmedJob<'a, T> { + unsafe { bindings::drm_sched_job_arm(&mut self.0.job) }; + ArmedJob(self.0, PhantomData) + } +} + +impl<'a, T: JobImpl> Deref for PendingJob<'a, T> { + type Target = Job<T>; + + fn deref(&self) -> &Self::Target { + &self.0 + } +} + +impl<'a, T: JobImpl> DerefMut for PendingJob<'a, T> { + fn deref_mut(&mut self) -> &mut Self::Target { + &mut self.0 + } +} + +/// An armed DRM scheduler job (not yet submitted) +pub struct ArmedJob<'a, T: JobImpl>(Box<Job<T>>, PhantomData<&'a T>); + +impl<'a, T: JobImpl> ArmedJob<'a, T> { + /// Returns the job fences + pub fn fences(&self) -> JobFences<'_> { + JobFences(unsafe { &mut *self.0.job.s_fence }) + } + + /// Push the job for execution into the scheduler + pub fn push(self) { + // After this point, the job is submitted and owned by the scheduler + let ptr = match self { + ArmedJob(job, _) => Box::<Job<T>>::into_raw(job), + }; + + // SAFETY: We are passing in ownership of a valid Box raw pointer. + unsafe { bindings::drm_sched_entity_push_job(addr_of_mut!((*ptr).job)) }; + } +} +impl<'a, T: JobImpl> Deref for ArmedJob<'a, T> { + type Target = Job<T>; + + fn deref(&self) -> &Self::Target { + &self.0 + } +} + +impl<'a, T: JobImpl> DerefMut for ArmedJob<'a, T> { + fn deref_mut(&mut self) -> &mut Self::Target { + &mut self.0 + } +} + +/// Reference to the bundle of fences attached to a DRM scheduler job +pub struct JobFences<'a>(&'a mut bindings::drm_sched_fence); + +impl<'a> JobFences<'a> { + /// Returns a new reference to the job scheduled fence. + pub fn scheduled(&mut self) -> Fence { + unsafe { Fence::get_raw(&mut self.0.scheduled) } + } + + /// Returns a new reference to the job finished fence. + pub fn finished(&mut self) -> Fence { + unsafe { Fence::get_raw(&mut self.0.finished) } + } +} + +struct EntityInner<T: JobImpl> { + entity: bindings::drm_sched_entity, + // TODO: Allow users to share guilty flag between entities + sched: Arc<SchedulerInner<T>>, + guilty: bindings::atomic_t, + _p: PhantomData<T>, +} + +impl<T: JobImpl> Drop for EntityInner<T> { + fn drop(&mut self) { + // SAFETY: The EntityInner is initialized. This will cancel/free all jobs. + unsafe { bindings::drm_sched_entity_destroy(&mut self.entity) }; + } +} + +// SAFETY: TODO +unsafe impl<T: JobImpl> Sync for EntityInner<T> {} +unsafe impl<T: JobImpl> Send for EntityInner<T> {} + +/// A DRM scheduler entity. +pub struct Entity<T: JobImpl>(Pin<Box<EntityInner<T>>>); + +impl<T: JobImpl> Entity<T> { + /// Create a new scheduler entity. + pub fn new(sched: &Scheduler<T>, priority: Priority) -> Result<Self> { + let mut entity: Box<MaybeUninit<EntityInner<T>>> = Box::try_new_zeroed()?; + + let mut sched_ptr = &sched.0.sched as *const _ as *mut _; + + // SAFETY: The Box is allocated above and valid. + unsafe { + bindings::drm_sched_entity_init( + addr_of_mut!((*entity.as_mut_ptr()).entity), + priority as _, + &mut sched_ptr, + 1, + addr_of_mut!((*entity.as_mut_ptr()).guilty), + ) + }; + + // SAFETY: The Box is allocated above and valid. + unsafe { addr_of_mut!((*entity.as_mut_ptr()).sched).write(sched.0.clone()) }; + + // SAFETY: entity is now initialized. + Ok(Self(Pin::from(unsafe { entity.assume_init() }))) + } + + /// Create a new job on this entity. + /// + /// The entity must outlive the pending job until it transitions into the submitted state, + /// after which the scheduler owns it. + pub fn new_job(&self, inner: T) -> Result<PendingJob<'_, T>> { + let mut job: Box<MaybeUninit<Job<T>>> = Box::try_new_zeroed()?; + + // SAFETY: We hold a reference to the entity (which is a valid pointer), + // and the job object was just allocated above. + to_result(unsafe { + bindings::drm_sched_job_init( + addr_of_mut!((*job.as_mut_ptr()).job), + &self.0.as_ref().get_ref().entity as *const _ as *mut _, + core::ptr::null_mut(), + ) + })?; + + // SAFETY: The Box pointer is valid, and this initializes the inner member. + unsafe { addr_of_mut!((*job.as_mut_ptr()).inner).write(inner) }; + + // SAFETY: All fields of the Job<T> are now initialized. + Ok(PendingJob(unsafe { job.assume_init() }, PhantomData)) + } +} + +/// DRM scheduler inner data +pub struct SchedulerInner<T: JobImpl> { + sched: bindings::drm_gpu_scheduler, + _p: PhantomData<T>, +} + +impl<T: JobImpl> Drop for SchedulerInner<T> { + fn drop(&mut self) { + // SAFETY: The scheduler is valid. This assumes drm_sched_fini() will take care of + // freeing all in-progress jobs. + unsafe { bindings::drm_sched_fini(&mut self.sched) }; + } +} + +// SAFETY: TODO +unsafe impl<T: JobImpl> Sync for SchedulerInner<T> {} +unsafe impl<T: JobImpl> Send for SchedulerInner<T> {} + +/// A DRM Scheduler +pub struct Scheduler<T: JobImpl>(Arc<SchedulerInner<T>>); + +impl<T: JobImpl> Scheduler<T> { + const OPS: bindings::drm_sched_backend_ops = bindings::drm_sched_backend_ops { + prepare_job: Some(prepare_job_cb::<T>), + can_run_job: Some(can_run_job_cb::<T>), + run_job: Some(run_job_cb::<T>), + timedout_job: Some(timedout_job_cb::<T>), + free_job: Some(free_job_cb::<T>), + }; + /// Creates a new DRM Scheduler object + // TODO: Shared timeout workqueues & scores + pub fn new( + device: &impl device::RawDevice, + hw_submission: u32, + hang_limit: u32, + timeout_ms: usize, + name: &'static CStr, + ) -> Result<Scheduler<T>> { + let mut sched: UniqueArc<MaybeUninit<SchedulerInner<T>>> = UniqueArc::try_new_uninit()?; + + // SAFETY: The drm_sched pointer is valid and pinned as it was just allocated above. + to_result(unsafe { + bindings::drm_sched_init( + addr_of_mut!((*sched.as_mut_ptr()).sched), + &Self::OPS, + hw_submission, + hang_limit, + bindings::msecs_to_jiffies(timeout_ms.try_into()?).try_into()?, + core::ptr::null_mut(), + core::ptr::null_mut(), + name.as_char_ptr(), + device.raw_device(), + ) + })?; + + // SAFETY: All fields of SchedulerInner are now initialized. + Ok(Scheduler(unsafe { sched.assume_init() }.into())) + } +}
On Tue, Mar 07, 2023 at 11:25:37PM +0900, Asahi Lina wrote:
The GPU scheduler manages scheduling GPU jobs and dependencies between them. This Rust abstraction allows Rust DRM drivers to use this functionality.
Signed-off-by: Asahi Lina lina@asahilina.net
Overall (with my limited rust knowledge) I really like this, it nicely encodes the state transitions of jobs and anything else I looked into. Some thoughts/questions below.
drivers/gpu/drm/Kconfig | 5 + rust/bindings/bindings_helper.h | 1 + rust/helpers.c | 6 + rust/kernel/drm/mod.rs | 2 + rust/kernel/drm/sched.rs | 358 ++++++++++++++++++++++++++++++++++++++++ 5 files changed, 372 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index 70a983a17ac2..8b5ad6aee126 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -39,6 +39,11 @@ config RUST_DRM_GEM_SHMEM_HELPER depends on RUST_DRM select DRM_GEM_SHMEM_HELPER +config RUST_DRM_SCHED
- bool
- depends on RUST_DRM
- select DRM_SCHED
config DRM_MIPI_DBI tristate depends on DRM diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index b6696011f3a4..dc01be08676e 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -13,6 +13,7 @@ #include <drm/drm_gem_shmem_helper.h> #include <drm/drm_ioctl.h> #include <drm/drm_syncobj.h> +#include <drm/gpu_scheduler.h> #include <linux/delay.h> #include <linux/device.h> #include <linux/dma-fence.h> diff --git a/rust/helpers.c b/rust/helpers.c index 11965b1e2f4e..1b33ed602090 100644 --- a/rust/helpers.c +++ b/rust/helpers.c @@ -408,6 +408,12 @@ void rust_helper___spin_lock_init(spinlock_t *lock, const char *name, } EXPORT_SYMBOL_GPL(rust_helper___spin_lock_init); +unsigned long rust_helper_msecs_to_jiffies(const unsigned int m) +{
- return msecs_to_jiffies(m);
+} +EXPORT_SYMBOL_GPL(rust_helper_msecs_to_jiffies);
#ifdef CONFIG_DMA_SHARED_BUFFER void rust_helper_dma_fence_get(struct dma_fence *fence) diff --git a/rust/kernel/drm/mod.rs b/rust/kernel/drm/mod.rs index dae98826edfd..3ddf7712aab3 100644 --- a/rust/kernel/drm/mod.rs +++ b/rust/kernel/drm/mod.rs @@ -8,4 +8,6 @@ pub mod file; pub mod gem; pub mod ioctl; pub mod mm; +#[cfg(CONFIG_RUST_DRM_SCHED)] +pub mod sched; pub mod syncobj; diff --git a/rust/kernel/drm/sched.rs b/rust/kernel/drm/sched.rs new file mode 100644 index 000000000000..a5275cc16179 --- /dev/null +++ b/rust/kernel/drm/sched.rs @@ -0,0 +1,358 @@ +// SPDX-License-Identifier: GPL-2.0 OR MIT
+//! DRM Scheduler +//! +//! C header: [`include/linux/drm/gpu_scheduler.h`](../../../../include/linux/drm/gpu_scheduler.h)
+use crate::{
- bindings, device,
- dma_fence::*,
- error::{to_result, Result},
- prelude::*,
- sync::{Arc, UniqueArc},
+}; +use alloc::boxed::Box; +use core::marker::PhantomData; +use core::mem::MaybeUninit; +use core::ops::{Deref, DerefMut}; +use core::ptr::addr_of_mut;
+/// Scheduler status after timeout recovery +#[repr(u32)] +pub enum Status {
- /// Device recovered from the timeout and can execute jobs again
- Nominal = bindings::drm_gpu_sched_stat_DRM_GPU_SCHED_STAT_NOMINAL,
- /// Device is no longer available
- NoDevice = bindings::drm_gpu_sched_stat_DRM_GPU_SCHED_STAT_ENODEV,
+}
+/// Scheduler priorities +#[repr(i32)] +pub enum Priority {
- /// Low userspace priority
- Min = bindings::drm_sched_priority_DRM_SCHED_PRIORITY_MIN,
- /// Normal userspace priority
- Normal = bindings::drm_sched_priority_DRM_SCHED_PRIORITY_NORMAL,
- /// High userspace priority
- High = bindings::drm_sched_priority_DRM_SCHED_PRIORITY_HIGH,
- /// Kernel priority (highest)
- Kernel = bindings::drm_sched_priority_DRM_SCHED_PRIORITY_KERNEL,
+}
+/// Trait to be implemented by driver job objects. +pub trait JobImpl: Sized {
- /// Called when the scheduler is considering scheduling this job next, to get another Fence
- /// for this job to block on. Once it returns None, run() may be called.
- fn prepare(_job: &mut Job<Self>) -> Option<Fence> {
So if I get this all right then Job<T> allows us to nicely parametrize the job with the driver structure itself, but not really anything else. I do wonder whether this needs a bit more with a type both for the job and entity and the drm/sched code + rust wrapper guaranteeing that the lifetimes of these make sense. With just the job parametrized drivers need to make sure they refcount anything else hanging of that properly which means if they get some detail wrong there might be an unintentional leak.
If we instead also give a parametrized entity where the driver can stuff anything necessary and sched code guarantees that it'll clean up the any mess on teardown and guarantee that the entity survives, I think a lot of drivers could benefit from that and it would be easier for them to have the right lifetimes for everything and no leaks.
None // Equivalent to NULL function pointer
- }
- /// Called before job execution to check whether the hardware is free enough to run the job.
- /// This can be used to implement more complex hardware resource policies than the hw_submission
- /// limit.
- fn can_run(_job: &mut Job<Self>) -> bool {
true
- }
- /// Called to execute the job once all of the dependencies have been resolved. This may be
- /// called multiple times, if timed_out() has happened and drm_sched_job_recovery() decides
- /// to try it again.
- fn run(job: &mut Job<Self>) -> Result<Option<Fence>>;
- /// Called when a job has taken too long to execute, to trigger GPU recovery.
- ///
- /// This method is called in a workqueue context.
- fn timed_out(job: &mut Job<Self>) -> Status;
+}
+unsafe extern "C" fn prepare_job_cb<T: JobImpl>(
- sched_job: *mut bindings::drm_sched_job,
- _s_entity: *mut bindings::drm_sched_entity,
+) -> *mut bindings::dma_fence {
- // SAFETY: All of our jobs are Job<T>.
- let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>;
- match T::prepare(unsafe { &mut *p }) {
None => core::ptr::null_mut(),
Some(fence) => fence.into_raw(),
- }
+}
+unsafe extern "C" fn run_job_cb<T: JobImpl>(
- sched_job: *mut bindings::drm_sched_job,
+) -> *mut bindings::dma_fence {
- // SAFETY: All of our jobs are Job<T>.
- let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>;
- match T::run(unsafe { &mut *p }) {
Err(e) => e.to_ptr(),
Ok(None) => core::ptr::null_mut(),
Ok(Some(fence)) => fence.into_raw(),
- }
+}
+unsafe extern "C" fn can_run_job_cb<T: JobImpl>(sched_job: *mut bindings::drm_sched_job) -> bool {
- // SAFETY: All of our jobs are Job<T>.
- let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>;
- T::can_run(unsafe { &mut *p })
+}
+unsafe extern "C" fn timedout_job_cb<T: JobImpl>(
- sched_job: *mut bindings::drm_sched_job,
+) -> bindings::drm_gpu_sched_stat {
- // SAFETY: All of our jobs are Job<T>.
- let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>;
- T::timed_out(unsafe { &mut *p }) as bindings::drm_gpu_sched_stat
+}
+unsafe extern "C" fn free_job_cb<T: JobImpl>(sched_job: *mut bindings::drm_sched_job) {
- // SAFETY: All of our jobs are Job<T>.
- let p = crate::container_of!(sched_job, Job<T>, job) as *mut Job<T>;
- // Convert the job back to a Box and drop it
- // SAFETY: All of our Job<T>s are created inside a box.
- unsafe { Box::from_raw(p) };
+}
+/// A DRM scheduler job. +pub struct Job<T: JobImpl> {
- job: bindings::drm_sched_job,
- inner: T,
+}
+impl<T: JobImpl> Deref for Job<T> {
- type Target = T;
- fn deref(&self) -> &Self::Target {
&self.inner
- }
+}
+impl<T: JobImpl> DerefMut for Job<T> {
- fn deref_mut(&mut self) -> &mut Self::Target {
&mut self.inner
- }
+}
+impl<T: JobImpl> Drop for Job<T> {
- fn drop(&mut self) {
// SAFETY: At this point the job has either been submitted and this is being called from
// `free_job_cb` above, or it hasn't and it is safe to call `drm_sched_job_cleanup`.
unsafe { bindings::drm_sched_job_cleanup(&mut self.job) };
- }
+}
+/// A pending DRM scheduler job (not yet armed) +pub struct PendingJob<'a, T: JobImpl>(Box<Job<T>>, PhantomData<&'a T>);
+impl<'a, T: JobImpl> PendingJob<'a, T> {
- /// Add a fence as a dependency to the job
- pub fn add_dependency(&mut self, fence: Fence) -> Result {
to_result(unsafe {
bindings::drm_sched_job_add_dependency(&mut self.0.job, fence.into_raw())
})
- }
- /// Arm the job to make it ready for execution
- pub fn arm(mut self) -> ArmedJob<'a, T> {
unsafe { bindings::drm_sched_job_arm(&mut self.0.job) };
ArmedJob(self.0, PhantomData)
- }
+}
+impl<'a, T: JobImpl> Deref for PendingJob<'a, T> {
- type Target = Job<T>;
- fn deref(&self) -> &Self::Target {
&self.0
- }
+}
+impl<'a, T: JobImpl> DerefMut for PendingJob<'a, T> {
- fn deref_mut(&mut self) -> &mut Self::Target {
&mut self.0
- }
+}
+/// An armed DRM scheduler job (not yet submitted) +pub struct ArmedJob<'a, T: JobImpl>(Box<Job<T>>, PhantomData<&'a T>);
+impl<'a, T: JobImpl> ArmedJob<'a, T> {
- /// Returns the job fences
- pub fn fences(&self) -> JobFences<'_> {
JobFences(unsafe { &mut *self.0.job.s_fence })
- }
- /// Push the job for execution into the scheduler
- pub fn push(self) {
// After this point, the job is submitted and owned by the scheduler
let ptr = match self {
ArmedJob(job, _) => Box::<Job<T>>::into_raw(job),
};
If I get this all right then this all makes sure that drivers can't use the job after push and they don't forgot to call arm.
What I'm not seeing is how we force drivers to call push once they've called arm? I haven't check what the code does, but from the docs it sounds like if you don't call push then drop will get called. Which wreaks the book-keeping on an armed job. Or is there someting that prevents ArmedJob<T> from having the Drop trait and so the only way to not go boom is by pushing it?
Googling for "rust undroppable" seems to indicate that this isn't a thing rust can do?
// SAFETY: We are passing in ownership of a valid Box raw pointer.
unsafe { bindings::drm_sched_entity_push_job(addr_of_mut!((*ptr).job)) };
- }
+} +impl<'a, T: JobImpl> Deref for ArmedJob<'a, T> {
- type Target = Job<T>;
- fn deref(&self) -> &Self::Target {
&self.0
- }
+}
+impl<'a, T: JobImpl> DerefMut for ArmedJob<'a, T> {
- fn deref_mut(&mut self) -> &mut Self::Target {
&mut self.0
- }
+}
+/// Reference to the bundle of fences attached to a DRM scheduler job +pub struct JobFences<'a>(&'a mut bindings::drm_sched_fence);
+impl<'a> JobFences<'a> {
- /// Returns a new reference to the job scheduled fence.
- pub fn scheduled(&mut self) -> Fence {
unsafe { Fence::get_raw(&mut self.0.scheduled) }
This feels a bit murky, because the safety of this relies on the safety of the ArmedJob and the guarantee (promise?) that the driver will push it. I'd just have two functions scheduled_fence and finished_fence in the ArmedJob impl and one safety note explaining why we can wrap it in the refcounted Fence.
- }
- /// Returns a new reference to the job finished fence.
- pub fn finished(&mut self) -> Fence {
unsafe { Fence::get_raw(&mut self.0.finished) }
- }
+}
+struct EntityInner<T: JobImpl> {
- entity: bindings::drm_sched_entity,
- // TODO: Allow users to share guilty flag between entities
- sched: Arc<SchedulerInner<T>>,
- guilty: bindings::atomic_t,
- _p: PhantomData<T>,
+}
+impl<T: JobImpl> Drop for EntityInner<T> {
- fn drop(&mut self) {
// SAFETY: The EntityInner is initialized. This will cancel/free all jobs.
unsafe { bindings::drm_sched_entity_destroy(&mut self.entity) };
- }
+}
+// SAFETY: TODO +unsafe impl<T: JobImpl> Sync for EntityInner<T> {} +unsafe impl<T: JobImpl> Send for EntityInner<T> {}
+/// A DRM scheduler entity. +pub struct Entity<T: JobImpl>(Pin<Box<EntityInner<T>>>);
+impl<T: JobImpl> Entity<T> {
- /// Create a new scheduler entity.
- pub fn new(sched: &Scheduler<T>, priority: Priority) -> Result<Self> {
let mut entity: Box<MaybeUninit<EntityInner<T>>> = Box::try_new_zeroed()?;
let mut sched_ptr = &sched.0.sched as *const _ as *mut _;
// SAFETY: The Box is allocated above and valid.
unsafe {
bindings::drm_sched_entity_init(
addr_of_mut!((*entity.as_mut_ptr()).entity),
priority as _,
&mut sched_ptr,
1,
addr_of_mut!((*entity.as_mut_ptr()).guilty),
)
};
// SAFETY: The Box is allocated above and valid.
unsafe { addr_of_mut!((*entity.as_mut_ptr()).sched).write(sched.0.clone()) };
// SAFETY: entity is now initialized.
Ok(Self(Pin::from(unsafe { entity.assume_init() })))
- }
- /// Create a new job on this entity.
- ///
- /// The entity must outlive the pending job until it transitions into the submitted state,
- /// after which the scheduler owns it.
- pub fn new_job(&self, inner: T) -> Result<PendingJob<'_, T>> {
let mut job: Box<MaybeUninit<Job<T>>> = Box::try_new_zeroed()?;
// SAFETY: We hold a reference to the entity (which is a valid pointer),
// and the job object was just allocated above.
to_result(unsafe {
bindings::drm_sched_job_init(
addr_of_mut!((*job.as_mut_ptr()).job),
&self.0.as_ref().get_ref().entity as *const _ as *mut _,
core::ptr::null_mut(),
)
})?;
// SAFETY: The Box pointer is valid, and this initializes the inner member.
unsafe { addr_of_mut!((*job.as_mut_ptr()).inner).write(inner) };
// SAFETY: All fields of the Job<T> are now initialized.
Ok(PendingJob(unsafe { job.assume_init() }, PhantomData))
- }
+}
+/// DRM scheduler inner data +pub struct SchedulerInner<T: JobImpl> {
- sched: bindings::drm_gpu_scheduler,
- _p: PhantomData<T>,
+}
+impl<T: JobImpl> Drop for SchedulerInner<T> {
- fn drop(&mut self) {
// SAFETY: The scheduler is valid. This assumes drm_sched_fini() will take care of
// freeing all in-progress jobs.
unsafe { bindings::drm_sched_fini(&mut self.sched) };
- }
+}
+// SAFETY: TODO +unsafe impl<T: JobImpl> Sync for SchedulerInner<T> {} +unsafe impl<T: JobImpl> Send for SchedulerInner<T> {}
+/// A DRM Scheduler +pub struct Scheduler<T: JobImpl>(Arc<SchedulerInner<T>>);
+impl<T: JobImpl> Scheduler<T> {
- const OPS: bindings::drm_sched_backend_ops = bindings::drm_sched_backend_ops {
prepare_job: Some(prepare_job_cb::<T>),
can_run_job: Some(can_run_job_cb::<T>),
run_job: Some(run_job_cb::<T>),
timedout_job: Some(timedout_job_cb::<T>),
free_job: Some(free_job_cb::<T>),
Two general questions with no relevance here really, just about vtable best practices:
So the trait has default impls for exactly the functions that are optional here, but either way we always end up with non-NULL function pointers. I guess there's no way to avoid that when you have a nice wrapping with traits and all that like here?
Another unrelated thing: How const is const? The C code side generally uses ops pointers for runtime time casting, so if the const is less const that a naive C hacker would expect, it might result in some fun.
Cheers, Daniel
- };
- /// Creates a new DRM Scheduler object
- // TODO: Shared timeout workqueues & scores
- pub fn new(
device: &impl device::RawDevice,
hw_submission: u32,
hang_limit: u32,
timeout_ms: usize,
name: &'static CStr,
- ) -> Result<Scheduler<T>> {
let mut sched: UniqueArc<MaybeUninit<SchedulerInner<T>>> = UniqueArc::try_new_uninit()?;
// SAFETY: The drm_sched pointer is valid and pinned as it was just allocated above.
to_result(unsafe {
bindings::drm_sched_init(
addr_of_mut!((*sched.as_mut_ptr()).sched),
&Self::OPS,
hw_submission,
hang_limit,
bindings::msecs_to_jiffies(timeout_ms.try_into()?).try_into()?,
core::ptr::null_mut(),
core::ptr::null_mut(),
name.as_char_ptr(),
device.raw_device(),
)
})?;
// SAFETY: All fields of SchedulerInner are now initialized.
Ok(Scheduler(unsafe { sched.assume_init() }.into()))
- }
+}
-- 2.35.1
On Wed, Apr 05, 2023 at 05:43:01PM +0200, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:37PM +0900, Asahi Lina wrote:
+/// An armed DRM scheduler job (not yet submitted) +pub struct ArmedJob<'a, T: JobImpl>(Box<Job<T>>, PhantomData<&'a T>);
+impl<'a, T: JobImpl> ArmedJob<'a, T> {
- /// Returns the job fences
- pub fn fences(&self) -> JobFences<'_> {
JobFences(unsafe { &mut *self.0.job.s_fence })
- }
- /// Push the job for execution into the scheduler
- pub fn push(self) {
// After this point, the job is submitted and owned by the scheduler
let ptr = match self {
ArmedJob(job, _) => Box::<Job<T>>::into_raw(job),
};
If I get this all right then this all makes sure that drivers can't use the job after push and they don't forgot to call arm.
What I'm not seeing is how we force drivers to call push once they've called arm? I haven't check what the code does, but from the docs it sounds like if you don't call push then drop will get called. Which wreaks the book-keeping on an armed job. Or is there someting that prevents ArmedJob<T> from having the Drop trait and so the only way to not go boom is by pushing it?
Googling for "rust undroppable" seems to indicate that this isn't a thing rust can do?
Another thing that I just realized: The driver must ensure that the arm->push sequence on a given drm_sched_entity isn't interrupte by another thread doing the same, i.e. you need to wrap it all in a lock, and it always needs to be the same lock for a given entity.
I have no idea how to guarantee that, but I guess somehow we should? -Daniel
On Wed, Apr 05, 2023 at 09:29:02PM +0200, Daniel Vetter wrote:
On Wed, Apr 05, 2023 at 05:43:01PM +0200, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:37PM +0900, Asahi Lina wrote:
+/// An armed DRM scheduler job (not yet submitted) +pub struct ArmedJob<'a, T: JobImpl>(Box<Job<T>>, PhantomData<&'a T>);
+impl<'a, T: JobImpl> ArmedJob<'a, T> {
- /// Returns the job fences
- pub fn fences(&self) -> JobFences<'_> {
JobFences(unsafe { &mut *self.0.job.s_fence })
- }
- /// Push the job for execution into the scheduler
- pub fn push(self) {
// After this point, the job is submitted and owned by the scheduler
let ptr = match self {
ArmedJob(job, _) => Box::<Job<T>>::into_raw(job),
};
If I get this all right then this all makes sure that drivers can't use the job after push and they don't forgot to call arm.
What I'm not seeing is how we force drivers to call push once they've called arm? I haven't check what the code does, but from the docs it sounds like if you don't call push then drop will get called. Which wreaks the book-keeping on an armed job. Or is there someting that prevents ArmedJob<T> from having the Drop trait and so the only way to not go boom is by pushing it?
Googling for "rust undroppable" seems to indicate that this isn't a thing rust can do?
Another thing that I just realized: The driver must ensure that the arm->push sequence on a given drm_sched_entity isn't interrupte by another thread doing the same, i.e. you need to wrap it all in a lock, and it always needs to be the same lock for a given entity.
I have no idea how to guarantee that, but I guess somehow we should?
Ok I was wrong here, pushing the job is optional, but the locking rules are still the same.
I think we can solve this in rust with: - passing &mut Entity to a new submit_job function. that way locking rules are left to the driver, which I think is best. - the submit_job also takes a closure, and passes the armed job as a &mut ArmedJob to it. That way we guarantee that the armed job never survives longer than the mutex guard (or whatever trick the driver is using) for the Entity - that closure probably should have Result return type which submit_job just passes on, because some drivers (when you support userptr that is) need to be able to bail out. since the ArmedJob is borred it shouldn't be able to escape through the return value - only ArmedJob has push_job
I think with that we fully uphold the drm_sched arm/push_job contract on the rust side? -Daniel
Drivers may want to support driver-private objects, which cannot be shared. This allows them to share a single lock and enables other optimizations.
Add an `exportable` field to drm_gem_object, which blocks PRIME export if set to false. It is initialized to true in drm_gem_private_object_init.
Signed-off-by: Asahi Lina lina@asahilina.net --- drivers/gpu/drm/drm_gem.c | 1 + drivers/gpu/drm/drm_prime.c | 5 +++++ include/drm/drm_gem.h | 8 ++++++++ 3 files changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 7a3cb08dc942..152ad9295a8d 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -166,6 +166,7 @@ void drm_gem_private_object_init(struct drm_device *dev,
drm_vma_node_reset(&obj->vma_node); INIT_LIST_HEAD(&obj->lru_node); + obj->exportable = true; } EXPORT_SYMBOL(drm_gem_private_object_init);
diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index f924b8b4ab6b..9d2dd982580e 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -391,6 +391,11 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; }
+ if (!obj->exportable) { + dmabuf = ERR_PTR(-EINVAL); + return dmabuf; + } + if (obj->funcs && obj->funcs->export) dmabuf = obj->funcs->export(obj, flags); else diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index 772a4adf5287..852dec3cf763 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -361,6 +361,14 @@ struct drm_gem_object { * The current LRU list that the GEM object is on. */ struct drm_gem_lru *lru; + + /** + * @exportable: + * + * Whether this GEM object can be exported via the drm_gem_object_funcs->export + * callback. Defaults to true. + */ + bool exportable; };
/**
On Tue, Mar 07, 2023 at 11:25:38PM +0900, Asahi Lina wrote:
Drivers may want to support driver-private objects, which cannot be shared. This allows them to share a single lock and enables other optimizations.
Add an `exportable` field to drm_gem_object, which blocks PRIME export if set to false. It is initialized to true in drm_gem_private_object_init.
Signed-off-by: Asahi Lina lina@asahilina.net
Two comments on this:
- for kernel objects which userspace never access itself the usual approach is to simply not install a gem handle on that drm_file. If userspace doesn't even have a handle they also can't export it. I think that should take care of the kernel object case you have in the asahi driver.
- for the vm-private object case you need some more checks anyway, since you can't even use such objects on a different vm within the same drm_file. Maybe the gpuva helpers can eventually cover this, but in general these driver cases are handled by simply overwriting the ->export case, you can check there for vm_id.is_none() and if that's not the case, hand the actual exporting to the helper function.
Whether this is done in the rust wrappers and you keep the set_exportable or just in asahi code is kinda meh, but personally for consistency I'd put that into asahi code. Imo it's much clearer when you explicitly list (by coding them into your export impl) the reasons why a buffer isn't exportable, instead of forcing people to chase set_exportable calls throughout the codebase. But also a bit matters of taste :-)
Either way (unless a missed a case) this should imo be handled in asahi code and not in C or the rust glue. -Daniel
drivers/gpu/drm/drm_gem.c | 1 + drivers/gpu/drm/drm_prime.c | 5 +++++ include/drm/drm_gem.h | 8 ++++++++ 3 files changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 7a3cb08dc942..152ad9295a8d 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -166,6 +166,7 @@ void drm_gem_private_object_init(struct drm_device *dev, drm_vma_node_reset(&obj->vma_node); INIT_LIST_HEAD(&obj->lru_node);
- obj->exportable = true;
} EXPORT_SYMBOL(drm_gem_private_object_init); diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index f924b8b4ab6b..9d2dd982580e 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -391,6 +391,11 @@ static struct dma_buf *export_and_register_object(struct drm_device *dev, return dmabuf; }
- if (!obj->exportable) {
dmabuf = ERR_PTR(-EINVAL);
return dmabuf;
- }
- if (obj->funcs && obj->funcs->export) dmabuf = obj->funcs->export(obj, flags); else
diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h index 772a4adf5287..852dec3cf763 100644 --- a/include/drm/drm_gem.h +++ b/include/drm/drm_gem.h @@ -361,6 +361,14 @@ struct drm_gem_object { * The current LRU list that the GEM object is on. */ struct drm_gem_lru *lru;
- /**
* @exportable:
*
* Whether this GEM object can be exported via the drm_gem_object_funcs->export
* callback. Defaults to true.
*/
- bool exportable;
}; /**
-- 2.35.1
This allows drivers to control whether a given GEM object is allowed to be exported via PRIME to other drivers. --- rust/kernel/drm/gem/mod.rs | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/rust/kernel/drm/gem/mod.rs b/rust/kernel/drm/gem/mod.rs index e66bdef35c2e..196252a25b5a 100644 --- a/rust/kernel/drm/gem/mod.rs +++ b/rust/kernel/drm/gem/mod.rs @@ -135,6 +135,13 @@ pub trait BaseObject: IntoGEMObject { self.gem_ref().size }
+ /// Sets the exportable flag, which controls whether the object can be exported via PRIME. + fn set_exportable(&mut self, exportable: bool) { + // SAFETY: gem_obj() is valid per the type invariant, and this is safe to write if we + // are the only holder (mutable ref). + unsafe { (*self.gem_obj()).exportable = exportable }; + } + /// Creates a new reference to the object. fn reference(&self) -> ObjectRef<Self> { // SAFETY: Having a reference to an Object implies holding a GEM reference
Adds the Asahi GPU driver UAPI. Note: this API is not yet stable and therefore not ready for merging!
Signed-off-by: Asahi Lina lina@asahilina.net --- include/uapi/drm/asahi_drm.h | 556 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 556 insertions(+)
diff --git a/include/uapi/drm/asahi_drm.h b/include/uapi/drm/asahi_drm.h new file mode 100644 index 000000000000..7b15b486d03d --- /dev/null +++ b/include/uapi/drm/asahi_drm.h @@ -0,0 +1,556 @@ +/* SPDX-License-Identifier: MIT */ +/* + * Copyright (C) The Asahi Linux Contributors + * + * Heavily inspired by xe_drm.h. + */ +#ifndef _ASAHI_DRM_H_ +#define _ASAHI_DRM_H_ + +#include "drm.h" + +#if defined(__cplusplus) +extern "C" { +#endif + +#define DRM_ASAHI_UNSTABLE_UABI_VERSION 10006 + +#define DRM_ASAHI_GET_PARAMS 0x00 +#define DRM_ASAHI_VM_CREATE 0x01 +#define DRM_ASAHI_VM_DESTROY 0x02 +#define DRM_ASAHI_GEM_CREATE 0x03 +#define DRM_ASAHI_GEM_MMAP_OFFSET 0x04 +#define DRM_ASAHI_GEM_BIND 0x05 +#define DRM_ASAHI_QUEUE_CREATE 0x06 +#define DRM_ASAHI_QUEUE_DESTROY 0x07 +#define DRM_ASAHI_SUBMIT 0x08 +#define DRM_ASAHI_GET_TIME 0x09 + +#define DRM_ASAHI_MAX_CLUSTERS 32 + +struct drm_asahi_params_global { + __u32 unstable_uabi_version; + __u32 pad0; + + __u64 feat_compat; + __u64 feat_incompat; + + __u32 gpu_generation; + __u32 gpu_variant; + __u32 gpu_revision; + __u32 chip_id; + + __u32 num_dies; + __u32 num_clusters_total; + __u32 num_cores_per_cluster; + __u32 num_frags_per_cluster; + __u32 num_gps_per_cluster; + __u32 num_cores_total_active; + __u64 core_masks[DRM_ASAHI_MAX_CLUSTERS]; + + __u32 vm_page_size; + __u32 pad1; + __u64 vm_user_start; + __u64 vm_user_end; + __u64 vm_shader_start; + __u64 vm_shader_end; + + __u32 max_syncs_per_submission; + __u32 max_commands_per_submission; + __u32 max_commands_in_flight; + __u32 max_attachments; + + __u32 timer_frequency_hz; + __u32 min_frequency_khz; + __u32 max_frequency_khz; + __u32 max_power_mw; + + __u32 result_render_size; + __u32 result_compute_size; +}; + +/* +enum drm_asahi_feat_compat { +}; +*/ + +enum drm_asahi_feat_incompat { + DRM_ASAHI_FEAT_MANDATORY_ZS_COMPRESSION = (1UL) << 0, +}; + +struct drm_asahi_get_params { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @param: Parameter group to fetch (MBZ) */ + __u32 param_group; + + /** @pad: MBZ */ + __u32 pad; + + /** @value: User pointer to write parameter struct */ + __u64 pointer; + + /** @value: Size of user buffer, max size supported on return */ + __u64 size; +}; + +struct drm_asahi_vm_create { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @value: Returned VM ID */ + __u32 vm_id; + + /** @pad: MBZ */ + __u32 pad; +}; + +struct drm_asahi_vm_destroy { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @value: VM ID to be destroyed */ + __u32 vm_id; + + /** @pad: MBZ */ + __u32 pad; +}; + +#define ASAHI_GEM_WRITEBACK (1L << 0) +#define ASAHI_GEM_VM_PRIVATE (1L << 1) + +struct drm_asahi_gem_create { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @size: Size of the BO */ + __u64 size; + + /** @flags: BO creation flags */ + __u32 flags; + + /** @handle: VM ID to assign to the BO, if ASAHI_GEM_VM_PRIVATE is set. */ + __u32 vm_id; + + /** @handle: Returned GEM handle for the BO */ + __u32 handle; +}; + +struct drm_asahi_gem_mmap_offset { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @handle: Handle for the object being mapped. */ + __u32 handle; + + /** @flags: Must be zero */ + __u32 flags; + + /** @offset: The fake offset to use for subsequent mmap call */ + __u64 offset; +}; + +enum drm_asahi_bind_op { + ASAHI_BIND_OP_BIND = 0, + ASAHI_BIND_OP_UNBIND = 1, + ASAHI_BIND_OP_UNBIND_ALL = 2, +}; + +#define ASAHI_BIND_READ (1L << 0) +#define ASAHI_BIND_WRITE (1L << 1) + +struct drm_asahi_gem_bind { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @obj: Bind operation */ + __u32 op; + + /** @flags: One or more of ASAHI_BIND_* */ + __u32 flags; + + /** @obj: GEM object to bind */ + __u32 handle; + + /** @vm_id: The ID of the VM to bind to */ + __u32 vm_id; + + /** @offset: Offset into the object */ + __u64 offset; + + /** @range: Number of bytes from the object to bind to addr */ + __u64 range; + + /** @addr: Address to bind to */ + __u64 addr; +}; + +enum drm_asahi_cmd_type { + DRM_ASAHI_CMD_RENDER = 0, + DRM_ASAHI_CMD_BLIT = 1, + DRM_ASAHI_CMD_COMPUTE = 2, +}; + +/* Note: this is an enum so that it can be resolved by Rust bindgen. */ +enum drm_asahi_queue_cap { + DRM_ASAHI_QUEUE_CAP_RENDER = (1UL << DRM_ASAHI_CMD_RENDER), + DRM_ASAHI_QUEUE_CAP_BLIT = (1UL << DRM_ASAHI_CMD_BLIT), + DRM_ASAHI_QUEUE_CAP_COMPUTE = (1UL << DRM_ASAHI_CMD_COMPUTE), +}; + +struct drm_asahi_queue_create { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @flags: MBZ */ + __u32 flags; + + /** @vm_id: The ID of the VM this queue is bound to */ + __u32 vm_id; + + /** @type: Bitmask of DRM_ASAHI_QUEUE_CAP_* */ + __u32 queue_caps; + + /** @priority: Queue priority, 0-3 */ + __u32 priority; + + /** @queue_id: The returned queue ID */ + __u32 queue_id; +}; + +struct drm_asahi_queue_destroy { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @queue_id: The queue ID to be destroyed */ + __u32 queue_id; +}; + +enum drm_asahi_sync_type { + DRM_ASAHI_SYNC_SYNCOBJ = 0, + DRM_ASAHI_SYNC_TIMELINE_SYNCOBJ = 1, +}; + +struct drm_asahi_sync { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @sync_type: One of drm_asahi_sync_type */ + __u32 sync_type; + + /** @handle: The sync object handle */ + __u32 handle; + + /** @timeline_value: Timeline value for timeline sync objects */ + __u64 timeline_value; +}; + +enum drm_asahi_subqueue { + DRM_ASAHI_SUBQUEUE_RENDER = 0, /* Also blit */ + DRM_ASAHI_SUBQUEUE_COMPUTE = 1, + DRM_ASAHI_SUBQUEUE_COUNT = 2, +}; + +#define DRM_ASAHI_BARRIER_NONE ~(0U) + +struct drm_asahi_command { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @type: One of drm_asahi_cmd_type */ + __u32 cmd_type; + + /** @flags: Flags for command submission */ + __u32 flags; + + /** @cmdbuf: Pointer to the appropriate command buffer structure */ + __u64 cmd_buffer; + + /** @cmdbuf: Size of the command buffer structure */ + __u64 cmd_buffer_size; + + /** @cmdbuf: Offset into the result BO to return information about this command */ + __u64 result_offset; + + /** @cmdbuf: Size of the result data structure */ + __u64 result_size; + + /** @barriers: Array of command indices per subqueue to wait on */ + __u32 barriers[DRM_ASAHI_SUBQUEUE_COUNT]; +}; + +struct drm_asahi_submit { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @in_syncs: An optional array of drm_asahi_sync to wait on before starting this job. */ + __u64 in_syncs; + + /** @in_syncs: An optional array of drm_asahi_sync objects to signal upon completion. */ + __u64 out_syncs; + + /** @commands: Pointer to the drm_asahi_command array of commands to submit. */ + __u64 commands; + + /** @flags: Flags for command submission (MBZ) */ + __u32 flags; + + /** @queue_id: The queue ID to be submitted to */ + __u32 queue_id; + + /** @result_handle: An optional BO handle to place result data in */ + __u32 result_handle; + + /** @in_sync_count: Number of sync objects to wait on before starting this job. */ + __u32 in_sync_count; + + /** @in_sync_count: Number of sync objects to signal upon completion of this job. */ + __u32 out_sync_count; + + /** @pad: Number of commands to be submitted */ + __u32 command_count; +}; + +/* FIXME: This doesn't make any sense, figure out exactly what the attachment flags are */ +#define ASAHI_ATTACHMENT_C 0 +#define ASAHI_ATTACHMENT_Z 1 +#define ASAHI_ATTACHMENT_S 2 + +struct drm_asahi_attachment { + __u32 type; + __u32 size; + __u64 pointer; +}; + +#define ASAHI_RENDER_NO_CLEAR_PIPELINE_TEXTURES (1UL << 0) +#define ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S (1UL << 1) +#define ASAHI_RENDER_MEMORYLESS_RTS_USED (1UL << 2) /* Not yet implemented */ +#define ASAHI_RENDER_PROCESS_EMPTY_TILES (1UL << 3) +#define ASAHI_RENDER_NO_VERTEX_CLUSTERING (1UL << 4) + +struct drm_asahi_cmd_render { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + __u64 flags; + + __u64 encoder_ptr; + + __u64 attachments; + __u32 attachment_count; + __u32 pad; + + __u64 depth_buffer_1; + __u64 depth_buffer_2; + __u64 depth_buffer_3; + __u64 depth_meta_buffer_1; + __u64 depth_meta_buffer_2; + __u64 depth_meta_buffer_3; + + __u64 stencil_buffer_1; + __u64 stencil_buffer_2; + __u64 stencil_buffer_3; + __u64 stencil_meta_buffer_1; + __u64 stencil_meta_buffer_2; + __u64 stencil_meta_buffer_3; + + __u64 scissor_array; + __u64 depth_bias_array; + __u64 visibility_result_buffer; + + __u64 zls_ctrl; + __u64 ppp_multisamplectl; + __u32 ppp_ctrl; + + __u32 fb_width; + __u32 fb_height; + + __u32 utile_width; + __u32 utile_height; + + __u32 samples; + __u32 layers; + + __u32 encoder_id; + __u32 cmd_ta_id; + __u32 cmd_3d_id; + + __u32 iogpu_unk_49; + __u32 iogpu_unk_212; + __u32 iogpu_unk_214; + + __u32 merge_upper_x; + __u32 merge_upper_y; + + __u32 load_pipeline; + __u32 load_pipeline_bind; + + __u32 store_pipeline; + __u32 store_pipeline_bind; + + __u32 partial_reload_pipeline; + __u32 partial_reload_pipeline_bind; + + __u32 partial_store_pipeline; + __u32 partial_store_pipeline_bind; + + __u32 depth_dimensions; + __u32 isp_bgobjdepth; + __u32 isp_bgobjvals; +}; + +struct drm_asahi_cmd_compute { + __u64 flags; + + __u64 encoder_ptr; + __u64 encoder_end; + + __u64 attachments; + __u32 attachment_count; + __u32 pad; + + __u64 buffer_descriptor; + + __u32 buffer_descriptor_size; /* ? */ + __u32 ctx_switch_prog; + + __u32 encoder_id; + __u32 cmd_id; + + __u32 iogpu_unk_40; + __u32 iogpu_unk_44; +}; + +enum drm_asahi_status { + DRM_ASAHI_STATUS_PENDING = 0, + DRM_ASAHI_STATUS_COMPLETE, + DRM_ASAHI_STATUS_UNKNOWN_ERROR, + DRM_ASAHI_STATUS_TIMEOUT, + DRM_ASAHI_STATUS_FAULT, + DRM_ASAHI_STATUS_KILLED, + DRM_ASAHI_STATUS_NO_DEVICE, +}; + +enum drm_asahi_fault { + DRM_ASAHI_FAULT_NONE = 0, + DRM_ASAHI_FAULT_UNKNOWN, + DRM_ASAHI_FAULT_UNMAPPED, + DRM_ASAHI_FAULT_AF_FAULT, + DRM_ASAHI_FAULT_WRITE_ONLY, + DRM_ASAHI_FAULT_READ_ONLY, + DRM_ASAHI_FAULT_NO_ACCESS, +}; + +struct drm_asahi_result_info { + /** @status: One of enum drm_asahi_status */ + __u32 status; + + /** @reason: One of drm_asahi_fault_type */ + __u32 fault_type; + + /** @unit: Unit number, hardware dependent */ + __u32 unit; + + /** @sideband: Sideband information, hardware dependent */ + __u32 sideband; + + /** @level: Page table level at which the fault occurred, hardware dependent */ + __u8 level; + + /** @read: Fault was a read */ + __u8 is_read; + + /** @pad: MBZ */ + __u16 pad; + + /** @unk_5: Extra bits, hardware dependent */ + __u32 extra; + + /** @address: Fault address, cache line aligned */ + __u64 address; +}; + +#define DRM_ASAHI_RESULT_RENDER_TVB_GROW_OVF (1UL << 0) +#define DRM_ASAHI_RESULT_RENDER_TVB_GROW_MIN (1UL << 1) +#define DRM_ASAHI_RESULT_RENDER_TVB_OVERFLOWED (1UL << 2) + +struct drm_asahi_result_render { + /** @address: Common result information */ + struct drm_asahi_result_info info; + + /** @flags: Zero or more of of DRM_ASAHI_RESULT_RENDER_* */ + __u64 flags; + + /** @vertex_ts_start: Timestamp of the start of vertex processing */ + __u64 vertex_ts_start; + + /** @vertex_ts_end: Timestamp of the end of vertex processing */ + __u64 vertex_ts_end; + + /** @fragment_ts_start: Timestamp of the start of fragment processing */ + __u64 fragment_ts_start; + + /** @fragment_ts_end: Timestamp of the end of fragment processing */ + __u64 fragment_ts_end; + + /** @tvb_size_bytes: TVB size at the start of this render */ + __u64 tvb_size_bytes; + + /** @tvb_usage_bytes: Total TVB usage in bytes for this render */ + __u64 tvb_usage_bytes; + + /** @num_tvb_overflows: Number of TVB overflows that occurred for this render */ + __u32 num_tvb_overflows; +}; + +struct drm_asahi_result_compute { + /** @address: Common result information */ + struct drm_asahi_result_info info; + + /** @flags: Zero or more of of DRM_ASAHI_RESULT_COMPUTE_* */ + __u64 flags; + + /** @ts_start: Timestamp of the start of this compute command */ + __u64 ts_start; + + /** @vertex_ts_end: Timestamp of the end of this compute command */ + __u64 ts_end; +}; + +struct drm_asahi_get_time { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + + /** @flags: MBZ. */ + __u64 flags; + + /** @tv_sec: On return, seconds part of a point in time */ + __s64 tv_sec; + + /** @tv_nsec: On return, nanoseconds part of a point in time */ + __s64 tv_nsec; + + /** @gpu_timestamp: On return, the GPU timestamp at that point in time */ + __u64 gpu_timestamp; +}; + +/* Note: this is an enum so that it can be resolved by Rust bindgen. */ +enum { + DRM_IOCTL_ASAHI_GET_PARAMS = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_GET_PARAMS, struct drm_asahi_get_params), + DRM_IOCTL_ASAHI_VM_CREATE = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_VM_CREATE, struct drm_asahi_vm_create), + DRM_IOCTL_ASAHI_VM_DESTROY = DRM_IOW(DRM_COMMAND_BASE + DRM_ASAHI_VM_DESTROY, struct drm_asahi_vm_destroy), + DRM_IOCTL_ASAHI_GEM_CREATE = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_GEM_CREATE, struct drm_asahi_gem_create), + DRM_IOCTL_ASAHI_GEM_MMAP_OFFSET = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_GEM_MMAP_OFFSET, struct drm_asahi_gem_mmap_offset), + DRM_IOCTL_ASAHI_GEM_BIND = DRM_IOW(DRM_COMMAND_BASE + DRM_ASAHI_GEM_BIND, struct drm_asahi_gem_bind), + DRM_IOCTL_ASAHI_QUEUE_CREATE = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_QUEUE_CREATE, struct drm_asahi_queue_create), + DRM_IOCTL_ASAHI_QUEUE_DESTROY = DRM_IOW(DRM_COMMAND_BASE + DRM_ASAHI_QUEUE_DESTROY, struct drm_asahi_queue_destroy), + DRM_IOCTL_ASAHI_SUBMIT = DRM_IOW(DRM_COMMAND_BASE + DRM_ASAHI_SUBMIT, struct drm_asahi_submit), + DRM_IOCTL_ASAHI_GET_TIME = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_GET_TIME, struct drm_asahi_get_time), +}; + +#if defined(__cplusplus) +} +#endif + +#endif /* _ASAHI_DRM_H_ */
On Tue, Mar 7, 2023 at 3:28 PM Asahi Lina lina@asahilina.net wrote:
Adds the Asahi GPU driver UAPI. Note: this API is not yet stable and therefore not ready for merging!
Signed-off-by: Asahi Lina lina@asahilina.net
include/uapi/drm/asahi_drm.h | 556 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 556 insertions(+)
diff --git a/include/uapi/drm/asahi_drm.h b/include/uapi/drm/asahi_drm.h new file mode 100644 index 000000000000..7b15b486d03d --- /dev/null +++ b/include/uapi/drm/asahi_drm.h @@ -0,0 +1,556 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright (C) The Asahi Linux Contributors
- Heavily inspired by xe_drm.h.
- */
+#ifndef _ASAHI_DRM_H_ +#define _ASAHI_DRM_H_
+#include "drm.h"
+#if defined(__cplusplus) +extern "C" { +#endif
+#define DRM_ASAHI_UNSTABLE_UABI_VERSION 10006
+#define DRM_ASAHI_GET_PARAMS 0x00 +#define DRM_ASAHI_VM_CREATE 0x01 +#define DRM_ASAHI_VM_DESTROY 0x02 +#define DRM_ASAHI_GEM_CREATE 0x03 +#define DRM_ASAHI_GEM_MMAP_OFFSET 0x04 +#define DRM_ASAHI_GEM_BIND 0x05 +#define DRM_ASAHI_QUEUE_CREATE 0x06 +#define DRM_ASAHI_QUEUE_DESTROY 0x07 +#define DRM_ASAHI_SUBMIT 0x08 +#define DRM_ASAHI_GET_TIME 0x09
+#define DRM_ASAHI_MAX_CLUSTERS 32
+struct drm_asahi_params_global {
__u32 unstable_uabi_version;
__u32 pad0;
__u64 feat_compat;
__u64 feat_incompat;
__u32 gpu_generation;
__u32 gpu_variant;
__u32 gpu_revision;
__u32 chip_id;
__u32 num_dies;
__u32 num_clusters_total;
__u32 num_cores_per_cluster;
__u32 num_frags_per_cluster;
__u32 num_gps_per_cluster;
__u32 num_cores_total_active;
__u64 core_masks[DRM_ASAHI_MAX_CLUSTERS];
__u32 vm_page_size;
__u32 pad1;
__u64 vm_user_start;
__u64 vm_user_end;
__u64 vm_shader_start;
__u64 vm_shader_end;
__u32 max_syncs_per_submission;
__u32 max_commands_per_submission;
__u32 max_commands_in_flight;
__u32 max_attachments;
__u32 timer_frequency_hz;
__u32 min_frequency_khz;
__u32 max_frequency_khz;
__u32 max_power_mw;
__u32 result_render_size;
__u32 result_compute_size;
+};
+/* +enum drm_asahi_feat_compat { +}; +*/
+enum drm_asahi_feat_incompat {
DRM_ASAHI_FEAT_MANDATORY_ZS_COMPRESSION = (1UL) << 0,
+};
+struct drm_asahi_get_params {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @param: Parameter group to fetch (MBZ) */
__u32 param_group;
/** @pad: MBZ */
__u32 pad;
/** @value: User pointer to write parameter struct */
__u64 pointer;
/** @value: Size of user buffer, max size supported on return */
__u64 size;
+};
+struct drm_asahi_vm_create {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @value: Returned VM ID */
__u32 vm_id;
/** @pad: MBZ */
__u32 pad;
+};
+struct drm_asahi_vm_destroy {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @value: VM ID to be destroyed */
__u32 vm_id;
/** @pad: MBZ */
__u32 pad;
+};
+#define ASAHI_GEM_WRITEBACK (1L << 0) +#define ASAHI_GEM_VM_PRIVATE (1L << 1)
+struct drm_asahi_gem_create {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @size: Size of the BO */
__u64 size;
/** @flags: BO creation flags */
__u32 flags;
/** @handle: VM ID to assign to the BO, if ASAHI_GEM_VM_PRIVATE is set. */
__u32 vm_id;
/** @handle: Returned GEM handle for the BO */
__u32 handle;
+};
+struct drm_asahi_gem_mmap_offset {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @handle: Handle for the object being mapped. */
__u32 handle;
/** @flags: Must be zero */
__u32 flags;
/** @offset: The fake offset to use for subsequent mmap call */
__u64 offset;
+};
+enum drm_asahi_bind_op {
ASAHI_BIND_OP_BIND = 0,
ASAHI_BIND_OP_UNBIND = 1,
ASAHI_BIND_OP_UNBIND_ALL = 2,
+};
+#define ASAHI_BIND_READ (1L << 0) +#define ASAHI_BIND_WRITE (1L << 1)
+struct drm_asahi_gem_bind {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @obj: Bind operation */
__u32 op;
/** @flags: One or more of ASAHI_BIND_* */
__u32 flags;
/** @obj: GEM object to bind */
__u32 handle;
/** @vm_id: The ID of the VM to bind to */
__u32 vm_id;
/** @offset: Offset into the object */
__u64 offset;
/** @range: Number of bytes from the object to bind to addr */
__u64 range;
/** @addr: Address to bind to */
__u64 addr;
+};
+enum drm_asahi_cmd_type {
DRM_ASAHI_CMD_RENDER = 0,
DRM_ASAHI_CMD_BLIT = 1,
DRM_ASAHI_CMD_COMPUTE = 2,
+};
+/* Note: this is an enum so that it can be resolved by Rust bindgen. */ +enum drm_asahi_queue_cap {
DRM_ASAHI_QUEUE_CAP_RENDER = (1UL << DRM_ASAHI_CMD_RENDER),
DRM_ASAHI_QUEUE_CAP_BLIT = (1UL << DRM_ASAHI_CMD_BLIT),
DRM_ASAHI_QUEUE_CAP_COMPUTE = (1UL << DRM_ASAHI_CMD_COMPUTE),
+};
+struct drm_asahi_queue_create {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @flags: MBZ */
__u32 flags;
/** @vm_id: The ID of the VM this queue is bound to */
__u32 vm_id;
/** @type: Bitmask of DRM_ASAHI_QUEUE_CAP_* */
__u32 queue_caps;
/** @priority: Queue priority, 0-3 */
__u32 priority;
/** @queue_id: The returned queue ID */
__u32 queue_id;
+};
+struct drm_asahi_queue_destroy {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @queue_id: The queue ID to be destroyed */
__u32 queue_id;
+};
+enum drm_asahi_sync_type {
DRM_ASAHI_SYNC_SYNCOBJ = 0,
DRM_ASAHI_SYNC_TIMELINE_SYNCOBJ = 1,
+};
+struct drm_asahi_sync {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @sync_type: One of drm_asahi_sync_type */
__u32 sync_type;
/** @handle: The sync object handle */
__u32 handle;
/** @timeline_value: Timeline value for timeline sync objects */
__u64 timeline_value;
+};
+enum drm_asahi_subqueue {
DRM_ASAHI_SUBQUEUE_RENDER = 0, /* Also blit */
DRM_ASAHI_SUBQUEUE_COMPUTE = 1,
DRM_ASAHI_SUBQUEUE_COUNT = 2,
+};
+#define DRM_ASAHI_BARRIER_NONE ~(0U)
+struct drm_asahi_command {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @type: One of drm_asahi_cmd_type */
__u32 cmd_type;
/** @flags: Flags for command submission */
__u32 flags;
/** @cmdbuf: Pointer to the appropriate command buffer structure */
__u64 cmd_buffer;
/** @cmdbuf: Size of the command buffer structure */
__u64 cmd_buffer_size;
/** @cmdbuf: Offset into the result BO to return information about this command */
__u64 result_offset;
/** @cmdbuf: Size of the result data structure */
__u64 result_size;
/** @barriers: Array of command indices per subqueue to wait on */
__u32 barriers[DRM_ASAHI_SUBQUEUE_COUNT];
+};
+struct drm_asahi_submit {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @in_syncs: An optional array of drm_asahi_sync to wait on before starting this job. */
__u64 in_syncs;
/** @in_syncs: An optional array of drm_asahi_sync objects to signal upon completion. */
__u64 out_syncs;
/** @commands: Pointer to the drm_asahi_command array of commands to submit. */
__u64 commands;
/** @flags: Flags for command submission (MBZ) */
__u32 flags;
/** @queue_id: The queue ID to be submitted to */
__u32 queue_id;
/** @result_handle: An optional BO handle to place result data in */
__u32 result_handle;
/** @in_sync_count: Number of sync objects to wait on before starting this job. */
__u32 in_sync_count;
/** @in_sync_count: Number of sync objects to signal upon completion of this job. */
__u32 out_sync_count;
/** @pad: Number of commands to be submitted */
__u32 command_count;
+};
+/* FIXME: This doesn't make any sense, figure out exactly what the attachment flags are */ +#define ASAHI_ATTACHMENT_C 0 +#define ASAHI_ATTACHMENT_Z 1 +#define ASAHI_ATTACHMENT_S 2
+struct drm_asahi_attachment {
__u32 type;
__u32 size;
__u64 pointer;
+};
+#define ASAHI_RENDER_NO_CLEAR_PIPELINE_TEXTURES (1UL << 0) +#define ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S (1UL << 1) +#define ASAHI_RENDER_MEMORYLESS_RTS_USED (1UL << 2) /* Not yet implemented */ +#define ASAHI_RENDER_PROCESS_EMPTY_TILES (1UL << 3) +#define ASAHI_RENDER_NO_VERTEX_CLUSTERING (1UL << 4)
+struct drm_asahi_cmd_render {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
__u64 flags;
__u64 encoder_ptr;
__u64 attachments;
__u32 attachment_count;
__u32 pad;
__u64 depth_buffer_1;
__u64 depth_buffer_2;
__u64 depth_buffer_3;
__u64 depth_meta_buffer_1;
__u64 depth_meta_buffer_2;
__u64 depth_meta_buffer_3;
__u64 stencil_buffer_1;
__u64 stencil_buffer_2;
__u64 stencil_buffer_3;
__u64 stencil_meta_buffer_1;
__u64 stencil_meta_buffer_2;
__u64 stencil_meta_buffer_3;
__u64 scissor_array;
__u64 depth_bias_array;
__u64 visibility_result_buffer;
__u64 zls_ctrl;
__u64 ppp_multisamplectl;
__u32 ppp_ctrl;
__u32 fb_width;
__u32 fb_height;
__u32 utile_width;
__u32 utile_height;
__u32 samples;
__u32 layers;
__u32 encoder_id;
__u32 cmd_ta_id;
__u32 cmd_3d_id;
__u32 iogpu_unk_49;
__u32 iogpu_unk_212;
__u32 iogpu_unk_214;
__u32 merge_upper_x;
__u32 merge_upper_y;
__u32 load_pipeline;
__u32 load_pipeline_bind;
__u32 store_pipeline;
__u32 store_pipeline_bind;
__u32 partial_reload_pipeline;
__u32 partial_reload_pipeline_bind;
__u32 partial_store_pipeline;
__u32 partial_store_pipeline_bind;
__u32 depth_dimensions;
__u32 isp_bgobjdepth;
__u32 isp_bgobjvals;
+};
+struct drm_asahi_cmd_compute {
__u64 flags;
__u64 encoder_ptr;
__u64 encoder_end;
__u64 attachments;
__u32 attachment_count;
__u32 pad;
__u64 buffer_descriptor;
__u32 buffer_descriptor_size; /* ? */
__u32 ctx_switch_prog;
__u32 encoder_id;
__u32 cmd_id;
__u32 iogpu_unk_40;
__u32 iogpu_unk_44;
+};
+enum drm_asahi_status {
DRM_ASAHI_STATUS_PENDING = 0,
DRM_ASAHI_STATUS_COMPLETE,
DRM_ASAHI_STATUS_UNKNOWN_ERROR,
DRM_ASAHI_STATUS_TIMEOUT,
DRM_ASAHI_STATUS_FAULT,
DRM_ASAHI_STATUS_KILLED,
DRM_ASAHI_STATUS_NO_DEVICE,
+};
+enum drm_asahi_fault {
DRM_ASAHI_FAULT_NONE = 0,
DRM_ASAHI_FAULT_UNKNOWN,
DRM_ASAHI_FAULT_UNMAPPED,
DRM_ASAHI_FAULT_AF_FAULT,
DRM_ASAHI_FAULT_WRITE_ONLY,
DRM_ASAHI_FAULT_READ_ONLY,
DRM_ASAHI_FAULT_NO_ACCESS,
+};
+struct drm_asahi_result_info {
/** @status: One of enum drm_asahi_status */
__u32 status;
/** @reason: One of drm_asahi_fault_type */
__u32 fault_type;
/** @unit: Unit number, hardware dependent */
__u32 unit;
/** @sideband: Sideband information, hardware dependent */
__u32 sideband;
/** @level: Page table level at which the fault occurred, hardware dependent */
__u8 level;
/** @read: Fault was a read */
__u8 is_read;
/** @pad: MBZ */
__u16 pad;
/** @unk_5: Extra bits, hardware dependent */
__u32 extra;
/** @address: Fault address, cache line aligned */
__u64 address;
+};
+#define DRM_ASAHI_RESULT_RENDER_TVB_GROW_OVF (1UL << 0) +#define DRM_ASAHI_RESULT_RENDER_TVB_GROW_MIN (1UL << 1) +#define DRM_ASAHI_RESULT_RENDER_TVB_OVERFLOWED (1UL << 2)
+struct drm_asahi_result_render {
/** @address: Common result information */
struct drm_asahi_result_info info;
/** @flags: Zero or more of of DRM_ASAHI_RESULT_RENDER_* */
__u64 flags;
/** @vertex_ts_start: Timestamp of the start of vertex processing */
__u64 vertex_ts_start;
/** @vertex_ts_end: Timestamp of the end of vertex processing */
__u64 vertex_ts_end;
/** @fragment_ts_start: Timestamp of the start of fragment processing */
__u64 fragment_ts_start;
/** @fragment_ts_end: Timestamp of the end of fragment processing */
__u64 fragment_ts_end;
/** @tvb_size_bytes: TVB size at the start of this render */
__u64 tvb_size_bytes;
/** @tvb_usage_bytes: Total TVB usage in bytes for this render */
__u64 tvb_usage_bytes;
/** @num_tvb_overflows: Number of TVB overflows that occurred for this render */
__u32 num_tvb_overflows;
+};
+struct drm_asahi_result_compute {
/** @address: Common result information */
struct drm_asahi_result_info info;
/** @flags: Zero or more of of DRM_ASAHI_RESULT_COMPUTE_* */
__u64 flags;
/** @ts_start: Timestamp of the start of this compute command */
__u64 ts_start;
/** @vertex_ts_end: Timestamp of the end of this compute command */
__u64 ts_end;
+};
+struct drm_asahi_get_time {
/** @extensions: Pointer to the first extension struct, if any */
__u64 extensions;
/** @flags: MBZ. */
__u64 flags;
/** @tv_sec: On return, seconds part of a point in time */
__s64 tv_sec;
/** @tv_nsec: On return, nanoseconds part of a point in time */
__s64 tv_nsec;
/** @gpu_timestamp: On return, the GPU timestamp at that point in time */
__u64 gpu_timestamp;
+};
+/* Note: this is an enum so that it can be resolved by Rust bindgen. */ +enum {
- DRM_IOCTL_ASAHI_GET_PARAMS = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_GET_PARAMS, struct drm_asahi_get_params),
- DRM_IOCTL_ASAHI_VM_CREATE = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_VM_CREATE, struct drm_asahi_vm_create),
- DRM_IOCTL_ASAHI_VM_DESTROY = DRM_IOW(DRM_COMMAND_BASE + DRM_ASAHI_VM_DESTROY, struct drm_asahi_vm_destroy),
- DRM_IOCTL_ASAHI_GEM_CREATE = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_GEM_CREATE, struct drm_asahi_gem_create),
- DRM_IOCTL_ASAHI_GEM_MMAP_OFFSET = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_GEM_MMAP_OFFSET, struct drm_asahi_gem_mmap_offset),
- DRM_IOCTL_ASAHI_GEM_BIND = DRM_IOW(DRM_COMMAND_BASE + DRM_ASAHI_GEM_BIND, struct drm_asahi_gem_bind),
- DRM_IOCTL_ASAHI_QUEUE_CREATE = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_QUEUE_CREATE, struct drm_asahi_queue_create),
- DRM_IOCTL_ASAHI_QUEUE_DESTROY = DRM_IOW(DRM_COMMAND_BASE + DRM_ASAHI_QUEUE_DESTROY, struct drm_asahi_queue_destroy),
- DRM_IOCTL_ASAHI_SUBMIT = DRM_IOW(DRM_COMMAND_BASE + DRM_ASAHI_SUBMIT, struct drm_asahi_submit),
- DRM_IOCTL_ASAHI_GET_TIME = DRM_IOWR(DRM_COMMAND_BASE + DRM_ASAHI_GET_TIME, struct drm_asahi_get_time),
+};
heh.. I had the same issue in mesa and wasn't thinking of doing this instead
+#if defined(__cplusplus) +} +#endif
+#endif /* _ASAHI_DRM_H_ */
-- 2.35.1
Add the Asahi UAPI to bindings_helper.h so Rust code can use it.
Signed-off-by: Asahi Lina lina@asahilina.net --- rust/bindings/bindings_helper.h | 1 + 1 file changed, 1 insertion(+)
diff --git a/rust/bindings/bindings_helper.h b/rust/bindings/bindings_helper.h index dc01be08676e..e21c87e6d317 100644 --- a/rust/bindings/bindings_helper.h +++ b/rust/bindings/bindings_helper.h @@ -35,6 +35,7 @@ #include <linux/sysctl.h> #include <linux/timekeeping.h> #include <linux/xarray.h> +#include <uapi/drm/asahi_drm.h> #include <uapi/drm/drm.h>
/* `bindgen` gets confused at certain things. */
This macro allows Rust code to build multiple versions of the same code, conditionally including certain fields or code segments.
The asahi driver uses this to support multiple GPU types and firmware revisions in the same codebase, without duplicating everything.
Signed-off-by: Asahi Lina lina@asahilina.net --- rust/macros/lib.rs | 7 ++ rust/macros/versions.rs | 267 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 274 insertions(+)
diff --git a/rust/macros/lib.rs b/rust/macros/lib.rs index c1d385e345b9..3ab9bae4ab52 100644 --- a/rust/macros/lib.rs +++ b/rust/macros/lib.rs @@ -5,6 +5,7 @@ mod concat_idents; mod helpers; mod module; +mod versions; mod vtable;
use proc_macro::TokenStream; @@ -73,6 +74,12 @@ pub fn module(ts: TokenStream) -> TokenStream { module::module(ts) }
+/// Declares multiple variants of a structure or impl code +#[proc_macro_attribute] +pub fn versions(attr: TokenStream, item: TokenStream) -> TokenStream { + versions::versions(attr, item) +} + /// Declares or implements a vtable trait. /// /// Linux's use of pure vtables is very close to Rust traits, but they differ diff --git a/rust/macros/versions.rs b/rust/macros/versions.rs new file mode 100644 index 000000000000..3bcd5f557289 --- /dev/null +++ b/rust/macros/versions.rs @@ -0,0 +1,267 @@ +use proc_macro::{token_stream, Group, Ident, Punct, Spacing, Span, TokenStream, TokenTree}; + +use crate::helpers::{expect_group, expect_punct}; + +fn drop_until_punct(it: &mut impl Iterator<Item = TokenTree>, delimiter: &str) { + let mut depth: isize = 0; + for token in it.by_ref() { + if let TokenTree::Punct(punct) = token { + match punct.as_char() { + '<' => { + depth += 1; + } + '>' => { + depth -= 1; + } + _ => { + if depth == 0 && delimiter.contains(&punct.to_string()) { + break; + } + } + } + } + } +} + +struct VersionConfig { + fields: &'static [&'static str], + enums: &'static [&'static [&'static str]], + versions: &'static [&'static [&'static str]], +} + +static AGX_VERSIONS: VersionConfig = VersionConfig { + fields: &["G", "V"], + enums: &[&["G13", "G14"], &["V12_3", "V12_4", "V13_0B4", "V13_2"]], + versions: &[ + &["G13", "V12_3"], + &["G14", "V12_4"], + &["G13", "V13_2"], + &["G14", "V13_2"], + ], +}; + +fn check_version(config: &VersionConfig, ver: &[usize], it: &mut token_stream::IntoIter) -> bool { + let first = it.next().unwrap(); + let val: bool = match &first { + TokenTree::Group(group) => check_version(config, ver, &mut group.stream().into_iter()), + TokenTree::Ident(ident) => { + let key = config + .fields + .iter() + .position(|&r| r == ident.to_string()) + .unwrap_or_else(|| panic!("Unknown field {}", ident)); + let mut operator = expect_punct(it).to_string(); + let mut rhs_token = it.next().unwrap(); + if let TokenTree::Punct(punct) = &rhs_token { + operator.extend(std::iter::once(punct.as_char())); + rhs_token = it.next().unwrap(); + } + let rhs_name = if let TokenTree::Ident(ident) = &rhs_token { + ident.to_string() + } else { + panic!("Unexpected token {}", ident) + }; + + let rhs = config.enums[key] + .iter() + .position(|&r| r == rhs_name) + .unwrap_or_else(|| panic!("Unknown value for {}:{}", ident, rhs_name)); + let lhs = ver[key]; + + match operator.as_str() { + "==" => lhs == rhs, + "!=" => lhs != rhs, + ">" => lhs > rhs, + ">=" => lhs >= rhs, + "<" => lhs < rhs, + "<=" => lhs <= rhs, + _ => panic!("Unknown operator {}", operator), + } + } + _ => { + panic!("Unknown token {}", first) + } + }; + + let boolop = it.next(); + match boolop { + Some(TokenTree::Punct(punct)) => { + let right = expect_punct(it).to_string(); + if right != punct.to_string() { + panic!("Unexpected op {}{}", punct, right); + } + match punct.as_char() { + '&' => val && check_version(config, ver, it), + '|' => val || check_version(config, ver, it), + _ => panic!("Unexpected op {}{}", right, right), + } + } + Some(a) => panic!("Unexpected op {}", a), + None => val, + } +} + +fn filter_versions( + config: &VersionConfig, + tag: &str, + ver: &[usize], + mut it: &mut token_stream::IntoIter, + is_struct: bool, +) -> Vec<TokenTree> { + let mut out = Vec::<TokenTree>::new(); + + while let Some(token) = it.next() { + let mut tail: Option<TokenTree> = None; + match &token { + TokenTree::Punct(punct) if punct.to_string() == "#" => { + let group = expect_group(it); + let mut grp_it = group.stream().into_iter(); + let attr = grp_it.next().unwrap(); + match attr { + TokenTree::Ident(ident) if ident.to_string() == "ver" => { + if check_version(config, ver, &mut grp_it) { + } else if is_struct { + drop_until_punct(&mut it, ","); + } else { + let first = it.next().unwrap(); + match &first { + TokenTree::Group(_) => (), + _ => { + drop_until_punct(&mut it, ",;"); + } + } + } + } + _ => { + out.push(token.clone()); + out.push(TokenTree::Group(group.clone())); + } + } + continue; + } + TokenTree::Punct(punct) if punct.to_string() == ":" => { + let next = it.next(); + match next { + Some(TokenTree::Punct(punct)) if punct.to_string() == ":" => { + let next = it.next(); + match next { + Some(TokenTree::Ident(idtag)) if idtag.to_string() == "ver" => { + let ident = match out.pop() { + Some(TokenTree::Ident(ident)) => ident, + a => panic!("$ver not following ident: {:?}", a), + }; + let name = ident.to_string() + tag; + let new_ident = Ident::new(name.as_str(), ident.span()); + out.push(TokenTree::Ident(new_ident)); + continue; + } + Some(a) => { + out.push(token.clone()); + out.push(token.clone()); + tail = Some(a); + } + None => { + out.push(token.clone()); + out.push(token.clone()); + } + } + } + Some(a) => { + out.push(token.clone()); + tail = Some(a); + } + None => { + out.push(token.clone()); + continue; + } + } + } + _ => { + tail = Some(token); + } + } + match &tail { + Some(TokenTree::Group(group)) => { + let new_body = + filter_versions(config, tag, ver, &mut group.stream().into_iter(), is_struct); + let mut stream = TokenStream::new(); + stream.extend(new_body); + let mut filtered_group = Group::new(group.delimiter(), stream); + filtered_group.set_span(group.span()); + out.push(TokenTree::Group(filtered_group)); + } + Some(token) => { + out.push(token.clone()); + } + None => {} + } + } + + out +} + +pub(crate) fn versions(attr: TokenStream, item: TokenStream) -> TokenStream { + let config = match attr.to_string().as_str() { + "AGX" => &AGX_VERSIONS, + _ => panic!("Unknown version group {}", attr), + }; + + let mut it = item.into_iter(); + let mut out = TokenStream::new(); + let mut body: Vec<TokenTree> = Vec::new(); + let mut is_struct = false; + + while let Some(token) = it.next() { + match token { + TokenTree::Punct(punct) if punct.to_string() == "#" => { + body.push(TokenTree::Punct(punct)); + body.push(it.next().unwrap()); + } + TokenTree::Ident(ident) + if ["struct", "enum", "union", "const", "type"] + .contains(&ident.to_string().as_str()) => + { + is_struct = ident.to_string() != "const"; + body.push(TokenTree::Ident(ident)); + body.push(it.next().unwrap()); + // This isn't valid syntax in a struct definition, so add it for the user + body.push(TokenTree::Punct(Punct::new(':', Spacing::Joint))); + body.push(TokenTree::Punct(Punct::new(':', Spacing::Alone))); + body.push(TokenTree::Ident(Ident::new("ver", Span::call_site()))); + break; + } + TokenTree::Ident(ident) if ident.to_string() == "impl" => { + body.push(TokenTree::Ident(ident)); + break; + } + TokenTree::Ident(ident) if ident.to_string() == "fn" => { + body.push(TokenTree::Ident(ident)); + break; + } + _ => { + body.push(token); + } + } + } + + body.extend(it); + + for ver in config.versions { + let tag = ver.join(""); + let mut ver_num = Vec::<usize>::new(); + for (i, comp) in ver.iter().enumerate() { + let idx = config.enums[i].iter().position(|&r| r == *comp).unwrap(); + ver_num.push(idx); + } + let tt = TokenStream::from_iter(body.clone().into_iter()); + out.extend(filter_versions( + config, + &tag, + &ver_num, + &mut tt.into_iter(), + is_struct, + )); + } + + out +}
The `asahi` drm driver supports Apple AGX GPUs of the following generations:
- G13G (Apple M1) - G13S (Apple M1 Pro) - G13C (Apple M1 Max) - G13D (Apple M1 Ultra) - G14G (Apple M2)
Signed-off-by: Asahi Lina lina@asahilina.net --- drivers/gpu/drm/Kconfig | 2 + drivers/gpu/drm/Makefile | 1 + drivers/gpu/drm/asahi/Kconfig | 35 + drivers/gpu/drm/asahi/Makefile | 3 + drivers/gpu/drm/asahi/alloc.rs | 1046 ++++++++++++++++++++++++++ drivers/gpu/drm/asahi/asahi.rs | 53 ++ drivers/gpu/drm/asahi/buffer.rs | 694 ++++++++++++++++++ drivers/gpu/drm/asahi/channel.rs | 542 ++++++++++++++ drivers/gpu/drm/asahi/debug.rs | 129 ++++ drivers/gpu/drm/asahi/driver.rs | 166 +++++ drivers/gpu/drm/asahi/event.rs | 229 ++++++ drivers/gpu/drm/asahi/file.rs | 718 ++++++++++++++++++ drivers/gpu/drm/asahi/float.rs | 381 ++++++++++ drivers/gpu/drm/asahi/fw/buffer.rs | 170 +++++ drivers/gpu/drm/asahi/fw/channels.rs | 385 ++++++++++ drivers/gpu/drm/asahi/fw/compute.rs | 107 +++ drivers/gpu/drm/asahi/fw/event.rs | 100 +++ drivers/gpu/drm/asahi/fw/fragment.rs | 276 +++++++ drivers/gpu/drm/asahi/fw/initdata.rs | 1264 ++++++++++++++++++++++++++++++++ drivers/gpu/drm/asahi/fw/job.rs | 56 ++ drivers/gpu/drm/asahi/fw/microseq.rs | 384 ++++++++++ drivers/gpu/drm/asahi/fw/mod.rs | 15 + drivers/gpu/drm/asahi/fw/types.rs | 233 ++++++ drivers/gpu/drm/asahi/fw/vertex.rs | 177 +++++ drivers/gpu/drm/asahi/fw/workqueue.rs | 168 +++++ drivers/gpu/drm/asahi/gem.rs | 301 ++++++++ drivers/gpu/drm/asahi/gpu.rs | 1088 +++++++++++++++++++++++++++ drivers/gpu/drm/asahi/hw/mod.rs | 522 +++++++++++++ drivers/gpu/drm/asahi/hw/t600x.rs | 140 ++++ drivers/gpu/drm/asahi/hw/t8103.rs | 80 ++ drivers/gpu/drm/asahi/hw/t8112.rs | 82 +++ drivers/gpu/drm/asahi/initdata.rs | 777 ++++++++++++++++++++ drivers/gpu/drm/asahi/mem.rs | 133 ++++ drivers/gpu/drm/asahi/microseq.rs | 61 ++ drivers/gpu/drm/asahi/mmu.rs | 1249 +++++++++++++++++++++++++++++++ drivers/gpu/drm/asahi/object.rs | 704 ++++++++++++++++++ drivers/gpu/drm/asahi/place.rs | 343 +++++++++ drivers/gpu/drm/asahi/queue/common.rs | 52 ++ drivers/gpu/drm/asahi/queue/compute.rs | 371 ++++++++++ drivers/gpu/drm/asahi/queue/mod.rs | 725 ++++++++++++++++++ drivers/gpu/drm/asahi/queue/render.rs | 1173 +++++++++++++++++++++++++++++ drivers/gpu/drm/asahi/regs.rs | 387 ++++++++++ drivers/gpu/drm/asahi/slotalloc.rs | 292 ++++++++ drivers/gpu/drm/asahi/util.rs | 44 ++ drivers/gpu/drm/asahi/workqueue.rs | 880 ++++++++++++++++++++++ 45 files changed, 16738 insertions(+)
diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index 8b5ad6aee126..acdd42fc543a 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -384,6 +384,8 @@ source "drivers/gpu/drm/solomon/Kconfig"
source "drivers/gpu/drm/sprd/Kconfig"
+source "drivers/gpu/drm/asahi/Kconfig" + config DRM_HYPERV tristate "DRM Support for Hyper-V synthetic video device" depends on DRM && PCI && MMU && HYPERV diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile index ab4460fcd63f..f144501d9c60 100644 --- a/drivers/gpu/drm/Makefile +++ b/drivers/gpu/drm/Makefile @@ -190,3 +190,4 @@ obj-y += gud/ obj-$(CONFIG_DRM_HYPERV) += hyperv/ obj-y += solomon/ obj-$(CONFIG_DRM_SPRD) += sprd/ +obj-$(CONFIG_DRM_ASAHI) += asahi/ diff --git a/drivers/gpu/drm/asahi/Kconfig b/drivers/gpu/drm/asahi/Kconfig new file mode 100644 index 000000000000..9d7caa49b64c --- /dev/null +++ b/drivers/gpu/drm/asahi/Kconfig @@ -0,0 +1,35 @@ +# SPDX-License-Identifier: GPL-2.0 + +config DRM_ASAHI + tristate "Asahi (DRM support for Apple AGX GPUs)" + depends on RUST + depends on RUST_DRM + depends on RUST_APPLE_RTKIT + depends on (ARM64 && ARCH_APPLE) || (COMPILE_TEST && !GENERIC_ATOMIC64) + depends on MMU + select IOMMU_SUPPORT + select IOMMU_IO_PGTABLE_LPAE + select RUST_DRM_SCHED + select RUST_DRM_GEM_SHMEM_HELPER + help + DRM driver for Apple AGX GPUs (G13x/G14). + + This driver supports the following SoCs: + + - T8103 "M1" + - T8112 "M2" + - T6000 "M1 Pro" + - T6001 "M1 Max" + - T6002 "M1 Ultra" + +config DRM_ASAHI_DEBUG_ALLOCATOR + bool "Use debug allocator" + depends on DRM_ASAHI + help + Use an alternate, simpler allocator which significantly reduces + performance, but can help find firmware- or GPU-side memory safety + issues. However, it can also trigger firmware bugs more easily, + so expect GPU crashes. + + Say N unless you are debugging firmware structures or porting to a + new firmware version. diff --git a/drivers/gpu/drm/asahi/Makefile b/drivers/gpu/drm/asahi/Makefile new file mode 100644 index 000000000000..e67248667987 --- /dev/null +++ b/drivers/gpu/drm/asahi/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-$(CONFIG_DRM_ASAHI) += asahi.o diff --git a/drivers/gpu/drm/asahi/alloc.rs b/drivers/gpu/drm/asahi/alloc.rs new file mode 100644 index 000000000000..d918b19e9721 --- /dev/null +++ b/drivers/gpu/drm/asahi/alloc.rs @@ -0,0 +1,1046 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU kernel object allocator. +//! +//! This kernel driver needs to manage a large number of GPU objects, in both firmware/kernel +//! address space and user address space. This module implements a simple grow-only heap allocator +//! based on the DRM MM range allocator, and a debug allocator that allocates each object as a +//! separate GEM object. +//! +//! Allocations may optionally have debugging enabled, which adds preambles that store metadata +//! about the allocation. This is useful for live debugging using the hypervisor or postmortem +//! debugging with a GPU memory snapshot, since it makes it easier to identify use-after-free and +//! caching issues. + +use kernel::{c_str, drm::mm, error::Result, prelude::*, str::CString, sync::LockClassKey}; + +use crate::debug::*; +use crate::driver::AsahiDevice; +use crate::fw::types::Zeroed; +use crate::mmu; +use crate::object::{GpuArray, GpuObject, GpuOnlyArray, GpuStruct, GpuWeakPointer}; + +use core::cmp::Ordering; +use core::fmt; +use core::fmt::{Debug, Formatter}; +use core::marker::PhantomData; +use core::mem; +use core::mem::MaybeUninit; +use core::ptr::NonNull; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Alloc; + +#[cfg(not(CONFIG_DRM_ASAHI_DEBUG_ALLOCATOR))] +/// The driver-global allocator type +pub(crate) type DefaultAllocator = HeapAllocator; + +#[cfg(not(CONFIG_DRM_ASAHI_DEBUG_ALLOCATOR))] +/// The driver-global allocation type +pub(crate) type DefaultAllocation = HeapAllocation; + +#[cfg(CONFIG_DRM_ASAHI_DEBUG_ALLOCATOR)] +/// The driver-global allocator type +pub(crate) type DefaultAllocator = SimpleAllocator; + +#[cfg(CONFIG_DRM_ASAHI_DEBUG_ALLOCATOR)] +/// The driver-global allocation type +pub(crate) type DefaultAllocation = SimpleAllocation; + +/// Represents a raw allocation (without any type information). +pub(crate) trait RawAllocation { + /// Returns the CPU-side pointer (if CPU mapping is enabled) as a byte non-null pointer. + fn ptr(&self) -> Option<NonNull<u8>>; + /// Returns the GPU VA pointer as a u64. + fn gpu_ptr(&self) -> u64; + /// Returns the size of the allocation in bytes. + fn size(&self) -> usize; + /// Returns the AsahiDevice that owns this allocation. + fn device(&self) -> &AsahiDevice; +} + +/// Represents a typed allocation. +pub(crate) trait Allocation<T>: Debug { + /// Returns the typed CPU-side pointer (if CPU mapping is enabled). + fn ptr(&self) -> Option<NonNull<T>>; + /// Returns the GPU VA pointer as a u64. + fn gpu_ptr(&self) -> u64; + /// Returns the size of the allocation in bytes. + fn size(&self) -> usize; + /// Returns the AsahiDevice that owns this allocation. + fn device(&self) -> &AsahiDevice; +} + +/// A generic typed allocation wrapping a RawAllocation. +/// +/// This is currently the only Allocation implementation, since it is shared by all allocators. +pub(crate) struct GenericAlloc<T, U: RawAllocation> { + alloc: U, + alloc_size: usize, + debug_offset: usize, + padding: usize, + _p: PhantomData<T>, +} + +impl<T, U: RawAllocation> Allocation<T> for GenericAlloc<T, U> { + fn ptr(&self) -> Option<NonNull<T>> { + self.alloc + .ptr() + .map(|p| unsafe { NonNull::new_unchecked(p.as_ptr().add(self.debug_offset) as *mut T) }) + } + fn gpu_ptr(&self) -> u64 { + self.alloc.gpu_ptr() + self.debug_offset as u64 + } + fn size(&self) -> usize { + self.alloc_size + } + fn device(&self) -> &AsahiDevice { + self.alloc.device() + } +} + +impl<T, U: RawAllocation> Debug for GenericAlloc<T, U> { + fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { + f.debug_struct(core::any::type_name::<GenericAlloc<T, U>>()) + .field("ptr", &format_args!("{:?}", self.ptr())) + .field("gpu_ptr", &format_args!("{:#X?}", self.gpu_ptr())) + .field("size", &format_args!("{:#X?}", self.size())) + .finish() + } +} + +/// Debugging data associated with an allocation, when debugging is enabled. +#[repr(C)] +struct AllocDebugData { + state: u32, + _pad: u32, + size: u64, + base_gpuva: u64, + obj_gpuva: u64, + name: [u8; 0x20], +} + +/// Magic flag indicating a live allocation. +const STATE_LIVE: u32 = 0x4556494c; +/// Magic flag indicating a freed allocation. +const STATE_DEAD: u32 = 0x44414544; + +/// Marker byte to identify when firmware/GPU write beyond the end of an allocation. +const GUARD_MARKER: u8 = 0x93; + +impl<T, U: RawAllocation> Drop for GenericAlloc<T, U> { + fn drop(&mut self) { + let debug_len = mem::size_of::<AllocDebugData>(); + if self.debug_offset >= debug_len { + if let Some(p) = self.alloc.ptr() { + unsafe { + let p = p.as_ptr().add(self.debug_offset - debug_len); + (p as *mut u32).write(STATE_DEAD); + } + } + } + if debug_enabled(DebugFlags::FillAllocations) { + if let Some(p) = self.ptr() { + unsafe { (p.as_ptr() as *mut u8).write_bytes(0xde, self.size()) }; + } + } + if self.padding != 0 { + if let Some(p) = self.ptr() { + let guard = unsafe { + core::slice::from_raw_parts( + (p.as_ptr() as *mut u8 as *const u8).add(self.size()), + self.padding, + ) + }; + if let Some(first_err) = guard.iter().position(|&r| r != GUARD_MARKER) { + let last_err = guard + .iter() + .rev() + .position(|&r| r != GUARD_MARKER) + .unwrap_or(0); + dev_warn!( + self.device(), + "Allocator: Corruption after object of type {} at {:#x}:{:#x} + {:#x}..={:#x}\n", + core::any::type_name::<T>(), + self.gpu_ptr(), + self.size(), + first_err, + self.padding - last_err - 1 + ); + } + } + } + } +} + +static_assert!(mem::size_of::<AllocDebugData>() == 0x40); + +/// A trait representing an allocator. +pub(crate) trait Allocator { + /// The raw allocation type used by this allocator. + type Raw: RawAllocation; + // TODO: Needs associated_type_defaults + // type Allocation<T> = GenericAlloc<T, Self::Raw>; + + /// Returns the `AsahiDevice` associated with this allocator. + fn device(&self) -> &AsahiDevice; + /// Returns whether CPU-side mapping is enabled. + fn cpu_maps(&self) -> bool; + /// Returns the minimum alignment for allocations. + fn min_align(&self) -> usize; + /// Allocate an object of the given size in bytes with the given alignment. + fn alloc(&mut self, size: usize, align: usize) -> ResultSelf::Raw; + + /// Returns a tuple of (count, size) of how much garbage (freed but not yet reusable objects) + /// exists in this allocator. Optional. + fn garbage(&self) -> (usize, usize) { + (0, 0) + } + /// Collect garbage for this allocator, up to the given object count. Optional. + fn collect_garbage(&mut self, _count: usize) {} + + /// Allocate a new GpuStruct object. See [`GpuObject::new`]. + #[inline(never)] + fn new_object<T: GpuStruct>( + &mut self, + inner: T, + callback: impl for<'a> FnOnce(&'a T) -> T::Raw<'a>, + ) -> Result<GpuObject<T, GenericAlloc<T, Self::Raw>>> { + GpuObject::<T, GenericAlloc<T, Self::Raw>>::new(self.alloc_object()?, inner, callback) + } + + /// Allocate a new GpuStruct object. See [`GpuObject::new_boxed`]. + #[inline(never)] + fn new_boxed<T: GpuStruct>( + &mut self, + inner: Box<T>, + callback: impl for<'a> FnOnce( + &'a T, + &'a mut MaybeUninit<T::Raw<'a>>, + ) -> Result<&'a mut T::Raw<'a>>, + ) -> Result<GpuObject<T, GenericAlloc<T, Self::Raw>>> { + GpuObject::<T, GenericAlloc<T, Self::Raw>>::new_boxed(self.alloc_object()?, inner, callback) + } + + /// Allocate a new GpuStruct object. See [`GpuObject::new_inplace`]. + #[inline(never)] + fn new_inplace<T: GpuStruct>( + &mut self, + inner: T, + callback: impl for<'a> FnOnce( + &'a T, + &'a mut MaybeUninit<T::Raw<'a>>, + ) -> Result<&'a mut T::Raw<'a>>, + ) -> Result<GpuObject<T, GenericAlloc<T, Self::Raw>>> { + GpuObject::<T, GenericAlloc<T, Self::Raw>>::new_inplace( + self.alloc_object()?, + inner, + callback, + ) + } + + /// Allocate a new GpuStruct object. See [`GpuObject::new_default`]. + #[inline(never)] + fn new_default<T: GpuStruct + Default>( + &mut self, + ) -> Result<GpuObject<T, GenericAlloc<T, Self::Raw>>> + where + for<'a> <T as GpuStruct>::Raw<'a>: Default + Zeroed, + { + GpuObject::<T, GenericAlloc<T, Self::Raw>>::new_default(self.alloc_object()?) + } + + /// Allocate a new GpuStruct object. See [`GpuObject::new_prealloc`]. + #[inline(never)] + fn new_prealloc<T: GpuStruct>( + &mut self, + inner_cb: impl FnOnce(GpuWeakPointer<T>) -> Result<Box<T>>, + raw_cb: impl for<'a> FnOnce( + &'a T, + &'a mut MaybeUninit<T::Raw<'a>>, + ) -> Result<&'a mut T::Raw<'a>>, + ) -> Result<GpuObject<T, GenericAlloc<T, Self::Raw>>> { + GpuObject::<T, GenericAlloc<T, Self::Raw>>::new_prealloc( + self.alloc_object()?, + inner_cb, + raw_cb, + ) + } + + /// Allocate a generic buffer of the given size and alignment, applying the debug features if + /// enabled to tag it and detect overflows. + fn alloc_generic<T>( + &mut self, + size: usize, + align: usize, + ) -> Result<GenericAlloc<T, Self::Raw>> { + let padding = if debug_enabled(DebugFlags::DetectOverflows) { + size + } else { + 0 + }; + + let ret: GenericAlloc<T, Self::Raw> = + if self.cpu_maps() && debug_enabled(debug::DebugFlags::DebugAllocations) { + let debug_align = self.min_align().max(align); + let debug_len = mem::size_of::<AllocDebugData>(); + let debug_offset = (debug_len * 2 + debug_align - 1) & !(debug_align - 1); + + let alloc = self.alloc(size + debug_offset + padding, align)?; + + let mut debug = AllocDebugData { + state: STATE_LIVE, + _pad: 0, + size: size as u64, + base_gpuva: alloc.gpu_ptr(), + obj_gpuva: alloc.gpu_ptr() + debug_offset as u64, + name: [0; 0x20], + }; + + let name = core::any::type_name::<T>().as_bytes(); + let len = name.len().min(debug.name.len() - 1); + debug.name[..len].copy_from_slice(&name[..len]); + + if let Some(p) = alloc.ptr() { + unsafe { + let p = p.as_ptr(); + p.write_bytes(0x42, debug_offset - 2 * debug_len); + let cur = p.add(debug_offset - debug_len) as *mut AllocDebugData; + let prev = p.add(debug_offset - 2 * debug_len) as *mut AllocDebugData; + prev.copy_from(cur, 1); + cur.copy_from(&debug, 1); + }; + } + + GenericAlloc { + alloc, + alloc_size: size, + debug_offset, + padding, + _p: PhantomData, + } + } else { + GenericAlloc { + alloc: self.alloc(size + padding, align)?, + alloc_size: size, + debug_offset: 0, + padding, + _p: PhantomData, + } + }; + + if debug_enabled(DebugFlags::FillAllocations) { + if let Some(p) = ret.ptr() { + unsafe { (p.as_ptr() as *mut u8).write_bytes(0xaa, ret.size()) }; + } + } + + if padding != 0 { + if let Some(p) = ret.ptr() { + unsafe { + (p.as_ptr() as *mut u8) + .add(ret.size()) + .write_bytes(GUARD_MARKER, padding); + } + } + } + + Ok(ret) + } + + /// Allocate an object of a given type, without actually initializing the allocation. + /// + /// This is useful to directly call [`GpuObject::new_*`], without borrowing a reference to the + /// allocator for the entire duration (e.g. if further allocations need to happen inside the + /// callbacks). + fn alloc_object<T: GpuStruct>(&mut self) -> Result<GenericAlloc<T, Self::Raw>> { + let size = mem::size_of::<T::Raw<'static>>(); + let align = mem::align_of::<T::Raw<'static>>(); + + self.alloc_generic(size, align) + } + + /// Allocate an empty `GpuArray` of a given type and length. + fn array_empty<T: Sized + Default>( + &mut self, + count: usize, + ) -> Result<GpuArray<T, GenericAlloc<T, Self::Raw>>> { + let size = mem::size_of::<T>() * count; + let align = mem::align_of::<T>(); + + let alloc = self.alloc_generic(size, align)?; + GpuArray::<T, GenericAlloc<T, Self::Raw>>::empty(alloc, count) + } + + /// Allocate an empty `GpuOnlyArray` of a given type and length. + fn array_gpuonly<T: Sized + Default>( + &mut self, + count: usize, + ) -> Result<GpuOnlyArray<T, GenericAlloc<T, Self::Raw>>> { + let size = mem::size_of::<T>() * count; + let align = mem::align_of::<T>(); + + let alloc = self.alloc_generic(size, align)?; + GpuOnlyArray::<T, GenericAlloc<T, Self::Raw>>::new(alloc, count) + } +} + +/// A simple allocation backed by a separate GEM object. +/// +/// # Invariants +/// `ptr` is either None or a valid, non-null pointer to the CPU view of the object. +/// `gpu_ptr` is the GPU-side VA of the object. +pub(crate) struct SimpleAllocation { + dev: AsahiDevice, + ptr: Option<NonNull<u8>>, + gpu_ptr: u64, + size: usize, + vm: mmu::Vm, + obj: crate::gem::ObjectRef, +} + +/// SAFETY: `SimpleAllocation` just points to raw memory and should be safe to send across threads. +unsafe impl Send for SimpleAllocation {} +unsafe impl Sync for SimpleAllocation {} + +impl Drop for SimpleAllocation { + fn drop(&mut self) { + mod_dev_dbg!( + self.device(), + "SimpleAllocator: drop object @ {:#x}\n", + self.gpu_ptr() + ); + if debug_enabled(DebugFlags::FillAllocations) { + if let Ok(vmap) = self.obj.vmap() { + vmap.as_mut_slice().fill(0x42); + } + } + self.obj.drop_vm_mappings(self.vm.id()); + } +} + +impl RawAllocation for SimpleAllocation { + fn ptr(&self) -> Option<NonNull<u8>> { + self.ptr + } + fn gpu_ptr(&self) -> u64 { + self.gpu_ptr + } + fn size(&self) -> usize { + self.size + } + + fn device(&self) -> &AsahiDevice { + &self.dev + } +} + +/// A simple allocator that allocates each object as its own GEM object, aligned to the end of a +/// page. +/// +/// This is very slow, but it has the advantage that over-reads by the firmware or GPU will fault on +/// the guard page after the allocation, which can be useful to validate that the firmware's or +/// GPU's idea of object size what we expect. +pub(crate) struct SimpleAllocator { + dev: AsahiDevice, + start: u64, + end: u64, + prot: u32, + vm: mmu::Vm, + min_align: usize, + cpu_maps: bool, +} + +impl SimpleAllocator { + /// Create a new `SimpleAllocator` for a given address range and `Vm`. + #[allow(dead_code)] + #[allow(clippy::too_many_arguments)] + pub(crate) fn new( + dev: &AsahiDevice, + vm: &mmu::Vm, + start: u64, + end: u64, + min_align: usize, + prot: u32, + _block_size: usize, + mut cpu_maps: bool, + _name: fmt::Arguments<'_>, + _keep_garbage: bool, + ) -> Result<SimpleAllocator> { + if debug_enabled(DebugFlags::ForceCPUMaps) { + cpu_maps = true; + } + Ok(SimpleAllocator { + dev: dev.clone(), + vm: vm.clone(), + start, + end, + prot, + min_align, + cpu_maps, + }) + } +} + +impl Allocator for SimpleAllocator { + type Raw = SimpleAllocation; + + fn device(&self) -> &AsahiDevice { + &self.dev + } + + fn cpu_maps(&self) -> bool { + self.cpu_maps + } + + fn min_align(&self) -> usize { + self.min_align + } + + #[inline(never)] + fn alloc(&mut self, size: usize, align: usize) -> Result<SimpleAllocation> { + let size_aligned = (size + mmu::UAT_PGSZ - 1) & !mmu::UAT_PGMSK; + let align = self.min_align.max(align); + let offset = (size_aligned - size) & !(align - 1); + + mod_dev_dbg!( + &self.dev, + "SimpleAllocator::new: size={:#x} size_al={:#x} al={:#x} off={:#x}\n", + size, + size_aligned, + align, + offset + ); + + let mut obj = crate::gem::new_kernel_object(&self.dev, size_aligned)?; + let p = obj.vmap()?.as_mut_ptr() as *mut u8; + if debug_enabled(DebugFlags::FillAllocations) { + obj.vmap()?.as_mut_slice().fill(0xde); + } + let iova = obj.map_into_range( + &self.vm, + self.start, + self.end, + self.min_align.max(mmu::UAT_PGSZ) as u64, + self.prot, + true, + )?; + + let ptr = unsafe { p.add(offset) } as *mut u8; + let gpu_ptr = (iova + offset) as u64; + + mod_dev_dbg!( + &self.dev, + "SimpleAllocator::new -> {:#?} / {:#?} | {:#x} / {:#x}\n", + p, + ptr, + iova, + gpu_ptr + ); + + Ok(SimpleAllocation { + dev: self.dev.clone(), + ptr: NonNull::new(ptr), + gpu_ptr, + size, + vm: self.vm.clone(), + obj, + }) + } +} + +/// Inner data for an allocation from the heap allocator. +/// +/// This is wrapped in an `mm::Node`. +pub(crate) struct HeapAllocationInner { + dev: AsahiDevice, + ptr: Option<NonNull<u8>>, + real_size: usize, +} + +/// SAFETY: `SimpleAllocation` just points to raw memory and should be safe to send across threads. +unsafe impl Send for HeapAllocationInner {} +unsafe impl Sync for HeapAllocationInner {} + +/// Outer view of a heap allocation. +/// +/// This uses an Option<> so we can move the internal `Node` into the garbage pool when it gets +/// dropped. +/// +/// # Invariants +/// The `Option` must always be `Some(...)` while this object is alive. +pub(crate) struct HeapAllocation(Option<mm::Node<HeapAllocatorInner, HeapAllocationInner>>); + +impl Drop for HeapAllocation { + fn drop(&mut self) { + let node = self.0.take().unwrap(); + let size = node.size(); + let alloc = node.alloc_ref(); + + alloc.with(|a| { + if let Some(garbage) = a.garbage.as_mut() { + if garbage.try_push(node).is_err() { + dev_err!( + &a.dev, + "HeapAllocation[{}]::drop: Failed to keep garbage\n", + &*a.name, + ); + } + a.total_garbage += size as usize; + None + } else { + // We need to ensure node survives this scope, since dropping it + // will try to take the mm lock and deadlock us + Some(node) + } + }); + } +} + +impl mm::AllocInner<HeapAllocationInner> for HeapAllocatorInner { + fn drop_object( + &mut self, + start: u64, + _size: u64, + _color: usize, + obj: &mut HeapAllocationInner, + ) { + /* real_size == 0 means it's a guard node */ + if obj.real_size > 0 { + mod_dev_dbg!( + obj.dev, + "HeapAllocator[{}]: drop object @ {:#x} ({} bytes)\n", + &*self.name, + start, + obj.real_size, + ); + self.allocated -= obj.real_size; + } + } +} + +impl RawAllocation for HeapAllocation { + // SAFETY: This function must always return a valid pointer. + // Since the HeapAllocation contains a reference to the + // backing_objects array that contains the object backing this pointer, + // and objects are only ever added to it, this pointer is guaranteed to + // remain valid for the lifetime of the HeapAllocation. + fn ptr(&self) -> Option<NonNull<u8>> { + self.0.as_ref().unwrap().ptr + } + // SAFETY: This function must always return a valid GPU pointer. + // See the explanation in ptr(). + fn gpu_ptr(&self) -> u64 { + self.0.as_ref().unwrap().start() + } + fn size(&self) -> usize { + self.0.as_ref().unwrap().size() as usize + } + fn device(&self) -> &AsahiDevice { + &self.0.as_ref().unwrap().dev + } +} + +/// Inner data for a heap allocator which uses the DRM MM range allocator to manage the heap. +/// +/// This is wrapped by an `mm::Allocator`. +struct HeapAllocatorInner { + dev: AsahiDevice, + allocated: usize, + backing_objects: Vec<(crate::gem::ObjectRef, u64)>, + garbage: Option<Vec<mm::Node<HeapAllocatorInner, HeapAllocationInner>>>, + total_garbage: usize, + name: CString, + vm_id: u64, +} + +/// A heap allocator which uses the DRM MM range allocator to manage its objects. +/// +/// The heap is composed of a series of GEM objects. This implementation only ever grows the heap, +/// never shrinks it. +pub(crate) struct HeapAllocator { + dev: AsahiDevice, + start: u64, + end: u64, + top: u64, + prot: u32, + vm: mmu::Vm, + min_align: usize, + block_size: usize, + cpu_maps: bool, + guard_nodes: Vec<mm::Node<HeapAllocatorInner, HeapAllocationInner>>, + mm: mm::Allocator<HeapAllocatorInner, HeapAllocationInner>, + name: CString, +} + +static LOCK_KEY: LockClassKey = LockClassKey::new(); + +impl HeapAllocator { + /// Create a new HeapAllocator for a given `Vm` and address range. + #[allow(dead_code)] + #[allow(clippy::too_many_arguments)] + pub(crate) fn new( + dev: &AsahiDevice, + vm: &mmu::Vm, + start: u64, + end: u64, + min_align: usize, + prot: u32, + block_size: usize, + mut cpu_maps: bool, + name: fmt::Arguments<'_>, + keep_garbage: bool, + ) -> Result<HeapAllocator> { + if !min_align.is_power_of_two() { + return Err(EINVAL); + } + if debug_enabled(DebugFlags::ForceCPUMaps) { + cpu_maps = true; + } + + let name = CString::try_from_fmt(name)?; + + let inner = HeapAllocatorInner { + dev: dev.clone(), + allocated: 0, + backing_objects: Vec::new(), + // TODO: This clearly needs a try_clone() or similar + name: CString::try_from_fmt(fmt!("{}", &*name))?, + vm_id: vm.id(), + garbage: if keep_garbage { Some(Vec::new()) } else { None }, + total_garbage: 0, + }; + + let mm = mm::Allocator::new( + start, + end - start + 1, + inner, + c_str!("HeapAllocator"), + &LOCK_KEY, + )?; + + Ok(HeapAllocator { + dev: dev.clone(), + vm: vm.clone(), + start, + end, + top: start, + prot, + min_align, + block_size: block_size.max(min_align), + cpu_maps, + guard_nodes: Vec::new(), + mm, + name, + }) + } + + /// Add a new backing block of the given size to this heap. + /// + /// If CPU mapping is enabled, this also adds a guard node to the range allocator to ensure that + /// objects cannot straddle backing block boundaries, since we cannot easily create a contiguous + /// CPU VA mapping for them. This can create some fragmentation. If CPU mapping is disabled, we + /// skip the guard blocks, since the GPU view of the heap is always contiguous. + fn add_block(&mut self, size: usize) -> Result { + let size_aligned = (size + mmu::UAT_PGSZ - 1) & !mmu::UAT_PGMSK; + + mod_dev_dbg!( + &self.dev, + "HeapAllocator[{}]::add_block: size={:#x} size_al={:#x}\n", + &*self.name, + size, + size_aligned, + ); + + if self.top.saturating_add(size_aligned as u64) >= self.end { + dev_err!( + &self.dev, + "HeapAllocator[{}]::add_block: Exhausted VA space\n", + &*self.name, + ); + } + + let mut obj = crate::gem::new_kernel_object(&self.dev, size_aligned)?; + if self.cpu_maps && debug_enabled(DebugFlags::FillAllocations) { + obj.vmap()?.as_mut_slice().fill(0xde); + } + + let gpu_ptr = self.top; + if let Err(e) = obj.map_at(&self.vm, gpu_ptr, self.prot, self.cpu_maps) { + dev_err!( + &self.dev, + "HeapAllocator[{}]::add_block: Failed to map at {:#x} ({:?})\n", + &*self.name, + gpu_ptr, + e + ); + return Err(e); + } + + self.mm + .with_inner(|inner| inner.backing_objects.try_reserve(1))?; + + let mut new_top = self.top + size_aligned as u64; + if self.cpu_maps { + let guard = self.min_align.max(mmu::UAT_PGSZ); + mod_dev_dbg!( + &self.dev, + "HeapAllocator[{}]::add_block: Adding guard node {:#x}:{:#x}\n", + &*self.name, + new_top, + guard + ); + + let inner = HeapAllocationInner { + dev: self.dev.clone(), + ptr: None, + real_size: 0, + }; + + let node = match self.mm.reserve_node(inner, new_top, guard as u64, 0) { + Ok(a) => a, + Err(a) => { + dev_err!( + &self.dev, + "HeapAllocator[{}]::add_block: Failed to reserve guard node {:#x}:{:#x}: {:?}\n", + &*self.name, + guard, + new_top, + a + ); + return Err(EIO); + } + }; + + self.guard_nodes.try_push(node)?; + + new_top += guard as u64; + } + mod_dev_dbg!( + &self.dev, + "HeapAllocator[{}]::add_block: top={:#x}\n", + &*self.name, + new_top + ); + + self.mm + .with_inner(|inner| inner.backing_objects.try_push((obj, gpu_ptr)))?; + + self.top = new_top; + + cls_dev_dbg!( + MemStats, + &self.dev, + "{} Heap: grow to {} bytes\n", + &*self.name, + self.top - self.start + ); + + Ok(()) + } + + /// Find the backing object index that backs a given GPU address. + fn find_obj(&mut self, addr: u64) -> Result<usize> { + self.mm.with_inner(|inner| { + inner + .backing_objects + .binary_search_by(|obj| { + let start = obj.1; + let end = obj.1 + obj.0.size() as u64; + if start > addr { + Ordering::Greater + } else if end <= addr { + Ordering::Less + } else { + Ordering::Equal + } + }) + .or(Err(ENOENT)) + }) + } +} + +impl Allocator for HeapAllocator { + type Raw = HeapAllocation; + + fn device(&self) -> &AsahiDevice { + &self.dev + } + + fn cpu_maps(&self) -> bool { + self.cpu_maps + } + + fn min_align(&self) -> usize { + self.min_align + } + + fn alloc(&mut self, size: usize, align: usize) -> Result<HeapAllocation> { + if align != 0 && !align.is_power_of_two() { + return Err(EINVAL); + } + let align = self.min_align.max(align); + let size_aligned = (size + align - 1) & !(align - 1); + + mod_dev_dbg!( + &self.dev, + "HeapAllocator[{}]::new: size={:#x} size_al={:#x}\n", + &*self.name, + size, + size_aligned, + ); + + let inner = HeapAllocationInner { + dev: self.dev.clone(), + ptr: None, + real_size: size, + }; + + let mut node = match self.mm.insert_node_generic( + inner, + size_aligned as u64, + align as u64, + 0, + mm::InsertMode::Best, + ) { + Ok(a) => a, + Err(a) => { + dev_err!( + &self.dev, + "HeapAllocator[{}]::new: Failed to insert node of size {:#x} / align {:#x}: {:?}\n", + &*self.name, size_aligned, align, a + ); + return Err(a); + } + }; + + self.mm.with_inner(|inner| inner.allocated += size); + + let mut new_object = false; + let start = node.start(); + let end = start + node.size(); + if end > self.top { + if start > self.top { + dev_warn!( + self.dev, + "HeapAllocator[{}]::alloc: top={:#x}, start={:#x}\n", + &*self.name, + self.top, + start + ); + } + let block_size = self.block_size.max((end - self.top) as usize); + self.add_block(block_size)?; + new_object = true; + } + assert!(end <= self.top); + + if self.cpu_maps { + mod_dev_dbg!( + self.dev, + "HeapAllocator[{}]::alloc: mapping to CPU\n", + &*self.name + ); + + let idx = if new_object { + None + } else { + Some(match self.find_obj(start) { + Ok(a) => a, + Err(_) => { + dev_warn!( + self.dev, + "HeapAllocator[{}]::alloc: Failed to find object at {:#x}\n", + &*self.name, + start + ); + return Err(EIO); + } + }) + }; + let (obj_start, obj_size, p) = self.mm.with_inner(|inner| -> Result<_> { + let idx = idx.unwrap_or(inner.backing_objects.len() - 1); + let obj = &mut inner.backing_objects[idx]; + let p = obj.0.vmap()?.as_mut_ptr() as *mut u8; + Ok((obj.1, obj.0.size(), p)) + })?; + assert!(obj_start <= start); + assert!(obj_start + obj_size as u64 >= end); + node.as_mut().inner_mut().ptr = + NonNull::new(unsafe { p.add((start - obj_start) as usize) }); + mod_dev_dbg!( + self.dev, + "HeapAllocator[{}]::alloc: CPU pointer = {:?}\n", + &*self.name, + node.ptr + ); + } + + mod_dev_dbg!( + self.dev, + "HeapAllocator[{}]::alloc: Allocated {:#x} bytes @ {:#x}\n", + &*self.name, + end - start, + start + ); + + Ok(HeapAllocation(Some(node))) + } + + fn garbage(&self) -> (usize, usize) { + self.mm.with_inner(|inner| { + if let Some(g) = inner.garbage.as_ref() { + (g.len(), inner.total_garbage) + } else { + (0, 0) + } + }) + } + + fn collect_garbage(&mut self, count: usize) { + // Take the garbage out of the inner block, so we can safely drop it without deadlocking + let mut garbage = Vec::new(); + + if garbage.try_reserve(count).is_err() { + dev_crit!( + self.dev, + "HeapAllocator[{}]:collect_garbage: failed to reserve space\n", + &*self.name, + ); + return; + } + + self.mm.with_inner(|inner| { + if let Some(g) = inner.garbage.as_mut() { + for node in g.drain(0..count) { + inner.total_garbage -= node.size() as usize; + garbage + .try_push(node) + .expect("try_push() failed after reserve()"); + } + } + }); + } +} + +impl Drop for HeapAllocatorInner { + fn drop(&mut self) { + mod_dev_dbg!( + self.dev, + "HeapAllocator[{}]: dropping allocator\n", + &*self.name + ); + if self.allocated > 0 { + // This should never happen + dev_crit!( + self.dev, + "HeapAllocator[{}]: dropping with {} bytes allocated\n", + &*self.name, + self.allocated + ); + } else { + for mut obj in self.backing_objects.drain(..) { + obj.0.drop_vm_mappings(self.vm_id); + } + } + } +} diff --git a/drivers/gpu/drm/asahi/asahi.rs b/drivers/gpu/drm/asahi/asahi.rs new file mode 100644 index 000000000000..e511d83f4cd1 --- /dev/null +++ b/drivers/gpu/drm/asahi/asahi.rs @@ -0,0 +1,53 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT +#![recursion_limit = "1024"] + +//! Driver for the Apple AGX GPUs found in Apple Silicon SoCs. + +mod alloc; +mod buffer; +mod channel; +mod debug; +mod driver; +mod event; +mod file; +mod float; +mod fw; +mod gem; +mod gpu; +mod hw; +mod initdata; +mod mem; +mod microseq; +mod mmu; +mod object; +mod place; +mod queue; +mod regs; +mod slotalloc; +mod util; +mod workqueue; + +use kernel::module_platform_driver; + +module_platform_driver! { + type: driver::AsahiDriver, + name: "asahi", + license: "Dual MIT/GPL", + params: { + debug_flags: u64 { + default: 0, + permissions: 0o644, + description: "Debug flags", + }, + fault_control: u32 { + default: 0, + permissions: 0, + description: "Fault control (0x0: hard faults, 0xb: macOS default)", + }, + initial_tvb_size: usize { + default: 0x8, + permissions: 0o644, + description: "Initial TVB size in blocks", + }, + }, +} diff --git a/drivers/gpu/drm/asahi/buffer.rs b/drivers/gpu/drm/asahi/buffer.rs new file mode 100644 index 000000000000..767ea161176f --- /dev/null +++ b/drivers/gpu/drm/asahi/buffer.rs @@ -0,0 +1,694 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Tiled Vertex Buffer management +//! +//! This module manages the Tiled Vertex Buffer, also known as the Parameter Buffer (in imgtec +//! parlance) or the tiler heap (on other architectures). This buffer holds transformed primitive +//! data between the vertex/tiling stage and the fragment stage. +//! +//! On AGX, the buffer is a heap of 128K blocks split into 32K pages (which must be aligned to a +//! multiple of 32K in VA space). The buffer can be shared between multiple render jobs, and each +//! will allocate pages from it during vertex processing and return them during fragment processing. +//! +//! If the buffer runs out of free pages, the vertex pass stops and a partial fragment pass occurs, +//! spilling the intermediate render target state to RAM (a partial render). This is all managed +//! transparently by the firmware. Since partial renders are less efficient, the kernel must grow +//! the heap in response to feedback from the firmware to avoid partial renders in the future. +//! Currently, we only ever grow the heap, and never shrink it. +//! +//! AGX also supports memoryless render targets, which can be used for intermediate results within +//! a render pass. To support partial renders, it seems the GPU/firmware has the ability to borrow +//! pages from the TVB buffer as a temporary render target buffer. Since this happens during a +//! partial render itself, if the buffer runs out of space, it requires synchronous growth in +//! response to a firmware interrupt. This is not currently supported, but may be in the future, +//! though it is unclear whether it is worth the effort. +//! +//! This module is also in charge of managing the temporary objects associated with a single render +//! pass, which includes the top-level tile array, the tail pointer cache, preemption buffers, and +//! other miscellaneous structures collectively managed as a "scene". +//! +//! To avoid runaway memory usage, there is a maximum size for buffers (at that point it's unlikely +//! that partial renders will incur much overhead over the buffer data access itself). This is +//! different depending on whether memoryless render targets are in use, and is currently hardcoded. +//! to the most common value used by macOS. + +use crate::debug::*; +use crate::fw::buffer; +use crate::fw::types::*; +use crate::util::*; +use crate::{alloc, fw, gpu, mmu, slotalloc}; +use crate::{box_in_place, place}; +use core::sync::atomic::Ordering; +use kernel::prelude::*; +use kernel::sync::{smutex::Mutex, Arc}; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Buffer; + +/// There are 127 GPU/firmware-side buffer manager slots (yes, 127, not 128). +const NUM_BUFFERS: u32 = 127; + +/// Page size bits for buffer pages (32K). VAs must be aligned to this size. +pub(crate) const PAGE_SHIFT: usize = 15; +/// Page size for buffer pages. +pub(crate) const PAGE_SIZE: usize = 1 << PAGE_SHIFT; +/// Number of pages in a buffer block, which should be contiguous in VA space. +pub(crate) const PAGES_PER_BLOCK: usize = 4; +/// Size of a buffer block. +pub(crate) const BLOCK_SIZE: usize = PAGE_SIZE * PAGES_PER_BLOCK; + +/// Metadata about the tiling configuration for a scene. This is computed in the `render` module. +/// based on dimensions, tile size, and other info. +pub(crate) struct TileInfo { + /// Tile count in the X dimension. Tiles are always 32x32. + pub(crate) tiles_x: u32, + /// Tile count in the Y dimension. Tiles are always 32x32. + pub(crate) tiles_y: u32, + /// Total tile count. + pub(crate) tiles: u32, + /// Micro-tile width (16 or 32). + pub(crate) utile_width: u32, + /// Micro-tile height (16 or 32). + pub(crate) utile_height: u32, + // Macro-tiles in the X dimension. Always 4. + //pub(crate) mtiles_x: u32, + // Macro-tiles in the Y dimension. Always 4. + //pub(crate) mtiles_y: u32, + /// Tiles per macro-tile in the X dimension. + pub(crate) tiles_per_mtile_x: u32, + /// Tiles per macro-tile in the Y dimension. + pub(crate) tiles_per_mtile_y: u32, + // Total tiles per macro-tile. + //pub(crate) tiles_per_mtile: u32, + /// Micro-tiles per macro-tile in the X dimension. + pub(crate) utiles_per_mtile_x: u32, + /// Micro-tiles per macro-tile in the Y dimension. + pub(crate) utiles_per_mtile_y: u32, + // Total micro-tiles per macro-tile. + //pub(crate) utiles_per_mtile: u32, + /// Size of the top-level tilemap, in bytes (for all layers, one cluster). + pub(crate) tilemap_size: usize, + /// Size of the Tail Pointer Cache, in bytes (for all layers * clusters). + pub(crate) tpc_size: usize, + /// Number of blocks in the clustering meta buffer (for clustering). + pub(crate) meta1_blocks: u32, + /// Minimum number of TVB blocks for this render. + pub(crate) min_tvb_blocks: usize, + /// XXX: Allocation factor for cluster tilemaps and meta4. Always 2? + pub(crate) cluster_factor: usize, + /// Tiling parameter structure passed to firmware. + pub(crate) params: fw::vertex::raw::TilingParameters, +} + +/// A single scene, representing a render pass and its required buffers. +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct Scene { + object: GpuObjectbuffer::Scene::ver, + slot: u32, + rebind: bool, + preempt2_off: usize, + preempt3_off: usize, + // Note: these are dead code only on some version variants. + // It's easier to do this than to propagate the version conditionals everywhere. + #[allow(dead_code)] + meta2_off: usize, + #[allow(dead_code)] + meta3_off: usize, + #[allow(dead_code)] + meta4_off: usize, +} + +#[versions(AGX)] +impl Scene::ver { + /// Returns true if the buffer was bound to a fresh manager slot, and therefore needs an init + /// command before a render. + pub(crate) fn rebind(&self) -> bool { + self.rebind + } + + /// Returns the buffer manager slot this scene's buffer was bound to. + pub(crate) fn slot(&self) -> u32 { + self.slot + } + + /// Returns the GPU pointer to the [`buffer::Scene::ver`]. + pub(crate) fn gpu_pointer(&self) -> GpuPointer<'_, buffer::Scene::ver> { + self.object.gpu_pointer() + } + + /// Returns the GPU weak pointer to the [`buffer::Scene::ver`]. + pub(crate) fn weak_pointer(&self) -> GpuWeakPointerbuffer::Scene::ver { + self.object.weak_pointer() + } + + /// Returns the GPU weak pointer to the kernel-side temp buffer. + /// (purpose unknown...) + pub(crate) fn kernel_buffer_pointer(&self) -> GpuWeakPointer<[u8]> { + self.object.buffer.inner.lock().kernel_buffer.weak_pointer() + } + + /// Returns the GPU pointer to the `buffer::Info::ver` object associated with this Scene. + pub(crate) fn buffer_pointer(&self) -> GpuPointer<'_, buffer::Info::ver> { + // We can't return the strong pointer directly since its lifetime crosses a lock, but we know + // its lifetime will be valid as long as &self since we hold a reference to the buffer, + // so just construct the strong pointer with the right lifetime here. + unsafe { self.weak_buffer_pointer().upgrade() } + } + + /// Returns the GPU weak pointer to the `buffer::Info::ver` object associated with this Scene. + pub(crate) fn weak_buffer_pointer(&self) -> GpuWeakPointerbuffer::Info::ver { + self.object.buffer.inner.lock().info.weak_pointer() + } + + /// Returns the GPU pointer to the TVB heap metadata buffer. + pub(crate) fn tvb_heapmeta_pointer(&self) -> GpuPointer<'_, &'_ [u8]> { + self.object.tvb_heapmeta.gpu_pointer() + } + + /// Returns the GPU pointer to the top-level TVB tilemap buffer. + pub(crate) fn tvb_tilemap_pointer(&self) -> GpuPointer<'_, &'_ [u8]> { + self.object.tvb_tilemap.gpu_pointer() + } + + /// Returns the GPU pointer to the Tail Pointer Cache buffer. + pub(crate) fn tpc_pointer(&self) -> GpuPointer<'_, &'_ [u8]> { + self.object.tpc.gpu_pointer() + } + + /// Returns the GPU pointer to the first preemption scratch buffer. + pub(crate) fn preempt_buf_1_pointer(&self) -> GpuPointer<'_, &'_ [u8]> { + self.object.preempt_buf.gpu_pointer() + } + + /// Returns the GPU pointer to the second preemption scratch buffer. + pub(crate) fn preempt_buf_2_pointer(&self) -> GpuPointer<'_, &'_ [u8]> { + self.object + .preempt_buf + .gpu_offset_pointer(self.preempt2_off) + } + + /// Returns the GPU pointer to the third preemption scratch buffer. + pub(crate) fn preempt_buf_3_pointer(&self) -> GpuPointer<'_, &'_ [u8]> { + self.object + .preempt_buf + .gpu_offset_pointer(self.preempt3_off) + } + + /// Returns the GPU pointer to the per-cluster tilemap buffer, if clustering is enabled. + #[allow(dead_code)] + pub(crate) fn cluster_tilemaps_pointer(&self) -> Option<GpuPointer<'_, &'_ [u8]>> { + self.object + .clustering + .as_ref() + .map(|c| c.tilemaps.gpu_pointer()) + } + + /// Returns the GPU pointer to the clustering metadata 1 buffer, if clustering is enabled. + #[allow(dead_code)] + pub(crate) fn meta_1_pointer(&self) -> Option<GpuPointer<'_, &'_ [u8]>> { + self.object + .clustering + .as_ref() + .map(|c| c.meta.gpu_pointer()) + } + + /// Returns the GPU pointer to the clustering metadata 2 buffer, if clustering is enabled. + #[allow(dead_code)] + pub(crate) fn meta_2_pointer(&self) -> Option<GpuPointer<'_, &'_ [u8]>> { + self.object + .clustering + .as_ref() + .map(|c| c.meta.gpu_offset_pointer(self.meta2_off)) + } + + /// Returns the GPU pointer to the clustering metadata 3 buffer, if clustering is enabled. + #[allow(dead_code)] + pub(crate) fn meta_3_pointer(&self) -> Option<GpuPointer<'_, &'_ [u8]>> { + self.object + .clustering + .as_ref() + .map(|c| c.meta.gpu_offset_pointer(self.meta3_off)) + } + + /// Returns the GPU pointer to the clustering metadata 4 buffer, if clustering is enabled. + #[allow(dead_code)] + pub(crate) fn meta_4_pointer(&self) -> Option<GpuPointer<'_, &'_ [u8]>> { + self.object + .clustering + .as_ref() + .map(|c| c.meta.gpu_offset_pointer(self.meta4_off)) + } + + /// Returns the GPU pointer to an unknown buffer with incrementing numbers. + pub(crate) fn seq_buf_pointer(&self) -> GpuPointer<'_, &'_ [u64]> { + self.object.seq_buf.gpu_pointer() + } + + /// Returns the number of TVB bytes used for this scene. + pub(crate) fn used_bytes(&self) -> usize { + self.object + .with(|raw, _inner| raw.total_page_count.load(Ordering::Relaxed) as usize * PAGE_SIZE) + } + + /// Returns whether the TVB overflowed while rendering this scene. + pub(crate) fn overflowed(&self) -> bool { + self.object.with(|raw, _inner| { + raw.total_page_count.load(Ordering::Relaxed) + > raw.pass_page_count.load(Ordering::Relaxed) + }) + } +} + +#[versions(AGX)] +impl Drop for Scene::ver { + fn drop(&mut self) { + let mut inner = self.object.buffer.inner.lock(); + assert_ne!(inner.active_scenes, 0); + inner.active_scenes -= 1; + + if inner.active_scenes == 0 { + mod_pr_debug!( + "Buffer: no scenes left, dropping slot {}", + inner.active_slot.take().unwrap().slot() + ); + inner.active_slot = None; + } + } +} + +/// Inner data for a single TVB buffer object. +#[versions(AGX)] +struct BufferInner { + info: GpuObjectbuffer::Info::ver, + ualloc: Arc<Mutexalloc::DefaultAllocator>, + ualloc_priv: Arc<Mutexalloc::DefaultAllocator>, + blocks: Vec<GpuOnlyArray<u8>>, + max_blocks: usize, + max_blocks_nomemless: usize, + mgr: BufferManager, + active_scenes: usize, + active_slot: Option<slotalloc::Guard<()>>, + last_token: Optionslotalloc::SlotToken, + tpc: Option<Arc<GpuArray<u8>>>, + kernel_buffer: GpuArray<u8>, + stats: GpuObjectbuffer::Stats, + preempt1_size: usize, + preempt2_size: usize, + preempt3_size: usize, + num_clusters: usize, +} + +/// Locked and reference counted TVB buffer. +#[versions(AGX)] +pub(crate) struct Buffer { + inner: Arc<MutexBufferInner::ver>, +} + +#[versions(AGX)] +impl Buffer::ver { + /// Create a new Buffer for a given VM, given the per-VM allocators. + pub(crate) fn new( + gpu: &dyn gpu::GpuManager, + alloc: &mut gpu::KernelAllocators, + ualloc: Arc<Mutexalloc::DefaultAllocator>, + ualloc_priv: Arc<Mutexalloc::DefaultAllocator>, + mgr: &BufferManager, + ) -> ResultBuffer::ver { + // These are the typical max numbers on macOS. + // 8GB machines have this halved. + let max_size: usize = 862_322_688; // bytes + let max_size_nomemless = max_size / 3; + + let max_blocks = max_size / BLOCK_SIZE; + let max_blocks_nomemless = max_size_nomemless / BLOCK_SIZE; + let max_pages = max_blocks * PAGES_PER_BLOCK; + let max_pages_nomemless = max_blocks_nomemless * PAGES_PER_BLOCK; + + let num_clusters = gpu.get_dyncfg().id.num_clusters as usize; + let num_clusters_adj = if num_clusters > 1 { + num_clusters + 1 + } else { + 1 + }; + + let preempt1_size = num_clusters_adj * gpu.get_cfg().preempt1_size; + let preempt2_size = num_clusters_adj * gpu.get_cfg().preempt2_size; + let preempt3_size = num_clusters_adj * gpu.get_cfg().preempt3_size; + + let inner = box_in_place!(buffer::Info::ver { + block_ctl: alloc.shared.new_default::buffer::BlockControl()?, + counter: alloc.shared.new_default::buffer::Counter()?, + page_list: ualloc_priv.lock().array_empty(max_pages)?, + block_list: ualloc_priv.lock().array_empty(max_blocks * 2)?, + })?; + + let info = alloc.private.new_boxed(inner, |inner, ptr| { + Ok(place!( + ptr, + buffer::raw::Info::ver { + gpu_counter: 0x0, + unk_4: 0, + last_id: 0x0, + cur_id: -1, + unk_10: 0x0, + gpu_counter2: 0x0, + unk_18: 0x0, + #[ver(V < V13_0B4)] + unk_1c: 0x0, + page_list: inner.page_list.gpu_pointer(), + page_list_size: (4 * max_pages).try_into()?, + page_count: AtomicU32::new(0), + max_blocks: max_blocks.try_into()?, + block_count: AtomicU32::new(0), + unk_38: 0x0, + block_list: inner.block_list.gpu_pointer(), + block_ctl: inner.block_ctl.gpu_pointer(), + last_page: AtomicU32::new(0), + gpu_page_ptr1: 0x0, + gpu_page_ptr2: 0x0, + unk_58: 0x0, + block_size: BLOCK_SIZE as u32, + unk_60: U64(0x0), + counter: inner.counter.gpu_pointer(), + unk_70: 0x0, + unk_74: 0x0, + unk_78: 0x0, + unk_7c: 0x0, + unk_80: 0x1, + max_pages: max_pages.try_into()?, + max_pages_nomemless: max_pages_nomemless.try_into()?, + unk_8c: 0x0, + unk_90: Default::default(), + } + )) + })?; + + // Technically similar to Scene below, let's play it safe. + let kernel_buffer = alloc.shared.array_empty(0x40)?; + let stats = alloc + .shared + .new_object(Default::default(), |_inner| buffer::raw::Stats { + reset: AtomicU32::from(1), + ..Default::default() + })?; + + Ok(Buffer::ver { + inner: Arc::try_new(Mutex::new(BufferInner::ver { + info, + ualloc, + ualloc_priv, + blocks: Vec::new(), + max_blocks, + max_blocks_nomemless, + mgr: mgr.clone(), + active_scenes: 0, + active_slot: None, + last_token: None, + tpc: None, + kernel_buffer, + stats, + preempt1_size, + preempt2_size, + preempt3_size, + num_clusters, + }))?, + }) + } + + /// Returns the total block count allocated to this Buffer. + pub(crate) fn block_count(&self) -> u32 { + self.inner.lock().blocks.len() as u32 + } + + /// Returns the total size in bytes allocated to this Buffer. + pub(crate) fn size(&self) -> usize { + self.block_count() as usize * BLOCK_SIZE + } + + /// Automatically grow the Buffer based on feedback from the statistics. + pub(crate) fn auto_grow(&self) -> Result<bool> { + let inner = self.inner.lock(); + + let used_pages = inner.stats.with(|raw, _inner| { + let used = raw.max_pages.load(Ordering::Relaxed); + raw.reset.store(1, Ordering::Release); + used as usize + }); + + let need_blocks = div_ceil(used_pages * 2, PAGES_PER_BLOCK).min(inner.max_blocks_nomemless); + let want_blocks = div_ceil(used_pages * 3, PAGES_PER_BLOCK).min(inner.max_blocks_nomemless); + + let cur_count = inner.blocks.len(); + + if need_blocks <= cur_count { + Ok(false) + } else { + // Grow to 3x requested size (same logic as macOS) + core::mem::drop(inner); + self.ensure_blocks(want_blocks)?; + Ok(true) + } + } + + /// Ensure that the buffer has at least a certain minimum size in blocks. + pub(crate) fn ensure_blocks(&self, min_blocks: usize) -> Result<bool> { + let mut inner = self.inner.lock(); + + let cur_count = inner.blocks.len(); + if cur_count >= min_blocks { + return Ok(false); + } + if min_blocks > inner.max_blocks { + return Err(ENOMEM); + } + + let add_blocks = min_blocks - cur_count; + let new_count = min_blocks; + + let mut new_blocks: Vec<GpuOnlyArray<u8>> = Vec::new(); + + // Allocate the new blocks first, so if it fails they will be dropped + let mut ualloc = inner.ualloc.lock(); + for _i in 0..add_blocks { + new_blocks.try_push(ualloc.array_gpuonly(BLOCK_SIZE)?)?; + } + core::mem::drop(ualloc); + + // Then actually commit them + inner.blocks.try_reserve(add_blocks)?; + + for (i, block) in new_blocks.into_iter().enumerate() { + let page_num = (block.gpu_va().get() >> PAGE_SHIFT) as u32; + + inner + .blocks + .try_push(block) + .expect("try_push() failed after try_reserve()"); + inner.info.block_list[2 * (cur_count + i)] = page_num; + for j in 0..PAGES_PER_BLOCK { + inner.info.page_list[(cur_count + i) * PAGES_PER_BLOCK + j] = page_num + j as u32; + } + } + + inner.info.block_ctl.with(|raw, _inner| { + raw.total.store(new_count as u32, Ordering::SeqCst); + raw.wptr.store(new_count as u32, Ordering::SeqCst); + }); + + let page_count = (new_count * PAGES_PER_BLOCK) as u32; + inner.info.with(|raw, _inner| { + raw.page_count.store(page_count, Ordering::Relaxed); + raw.block_count.store(new_count as u32, Ordering::Relaxed); + raw.last_page.store(page_count - 1, Ordering::Relaxed); + }); + + Ok(true) + } + + /// Create a new [`Scene::ver`] (render pass) using this buffer. + pub(crate) fn new_scene( + &self, + alloc: &mut gpu::KernelAllocators, + tile_info: &TileInfo, + ) -> ResultScene::ver { + let mut inner = self.inner.lock(); + + let tilemap_size = tile_info.tilemap_size; + let tpc_size = tile_info.tpc_size; + + // TODO: what is this exactly? + mod_pr_debug!("Buffer: Allocating TVB buffers\n"); + + // This seems to be a list, with 4x2 bytes of headers and 8 bytes per entry. + // On single-cluster devices, the used length always seems to be 1. + // On M1 Ultra, it can grow and usually doesn't exceed 8 * cluster_factor + // entries. macOS allocates a whole 64K * 0x80 for this, so let's go with + // that to be safe... + let user_buffer = inner.ualloc.lock().array_empty(if inner.num_clusters > 1 { + 0x10080 + } else { + 0x80 + })?; + + let tvb_heapmeta = inner.ualloc.lock().array_empty(0x200)?; + let tvb_tilemap = inner.ualloc.lock().array_empty(tilemap_size)?; + + mod_pr_debug!("Buffer: Allocating misc buffers\n"); + let preempt_buf = inner + .ualloc + .lock() + .array_empty(inner.preempt1_size + inner.preempt2_size + inner.preempt3_size)?; + + let mut seq_buf = inner.ualloc.lock().array_empty(0x800)?; + for i in 1..0x400 { + seq_buf[i] = (i + 1) as u64; + } + + let tpc = match inner.tpc.as_ref() { + Some(buf) if buf.len() >= tpc_size => buf.clone(), + _ => { + // MacOS allocates this as shared GPU+FW, but + // priv seems to work and might be faster? + // Needs to be FW-writable anyway, so ualloc + // won't work. + let buf = Arc::try_new( + inner + .ualloc_priv + .lock() + .array_empty((tpc_size + mmu::UAT_PGMSK) & !mmu::UAT_PGMSK)?, + )?; + inner.tpc = Some(buf.clone()); + buf + } + }; + + // Maybe: (4x4 macro tiles + 1 global page)*n, 32bit each (17*4*n) + let meta1_size = align(tile_info.meta1_blocks as usize * 0x44, 0x80); + // check + let meta2_size = align(0x190 * inner.num_clusters, 0x80); + let meta3_size = align(0x280 * inner.num_clusters, 0x80); + // Like user_buffer for single-cluster modes, 0x30 per cluster * the cluster + // factor. + let meta4_size = align(0x30 * inner.num_clusters * tile_info.cluster_factor, 0x80); + let meta_size = meta1_size + meta2_size + meta3_size + meta4_size; + + let clustering = if inner.num_clusters > 1 { + mod_pr_debug!("Buffer: Allocating clustering buffers\n"); + let tilemaps = inner + .ualloc + .lock() + .array_empty(inner.num_clusters * tilemap_size * tile_info.cluster_factor)?; + let meta = inner.ualloc.lock().array_empty(meta_size)?; + Some(buffer::ClusterBuffers { tilemaps, meta }) + } else { + None + }; + + let scene_inner = box_in_place!(buffer::Scene::ver { + user_buffer: user_buffer, + buffer: self.clone(), + tvb_heapmeta: tvb_heapmeta, + tvb_tilemap: tvb_tilemap, + tpc: tpc, + clustering: clustering, + preempt_buf: preempt_buf, + seq_buf: seq_buf, + })?; + + // Could be made strong, but we wind up with a deadlock if we try to grab the + // pointer through the inner.buffer path inside the closure. + let stats_pointer = inner.stats.weak_pointer(); + + // macOS allocates this as private. However, the firmware does not + // DC CIVAC this before reading it (like it does most other things), + // which causes odd cache incoherency bugs when combined with + // speculation on the firmware side (maybe). This doesn't happen + // on macOS because these structs are a circular pool that is mapped + // already initialized. Just mark this shared for now. + let scene = alloc.shared.new_boxed(scene_inner, |inner, ptr| { + Ok(place!( + ptr, + buffer::raw::Scene { + pass_page_count: AtomicU32::new(0), + unk_4: 0, + unk_8: U64(0), + unk_10: U64(0), + user_buffer: inner.user_buffer.gpu_pointer(), + unk_20: 0, + stats: stats_pointer, + total_page_count: AtomicU32::new(0), + unk_30: U64(0), + unk_38: U64(0), + } + )) + })?; + + let mut rebind = false; + + if inner.active_slot.is_none() { + assert_eq!(inner.active_scenes, 0); + + let slot = inner.mgr.0.get(inner.last_token)?; + rebind = slot.changed(); + + mod_pr_debug!("Buffer: assigning slot {} (rebind={})", slot.slot(), rebind); + + inner.last_token = Some(slot.token()); + inner.active_slot = Some(slot); + } + + inner.active_scenes += 1; + + Ok(Scene::ver { + object: scene, + slot: inner.active_slot.as_ref().unwrap().slot(), + rebind, + preempt2_off: inner.preempt1_size, + preempt3_off: inner.preempt1_size + inner.preempt2_size, + meta2_off: meta1_size, + meta3_off: meta1_size + meta2_size, + meta4_off: meta1_size + meta2_size + meta3_size, + }) + } + + /// Increment the buffer manager usage count. Should we done once we know the Scene is ready + /// to be committed and used in commands submitted to the GPU. + pub(crate) fn increment(&self) { + let inner = self.inner.lock(); + inner.info.counter.with(|raw, _inner| { + // We could use fetch_add, but the non-LSE atomic + // sequence Rust produces confuses the hypervisor. + // We have inner locked anyway, so this is not racy. + let v = raw.count.load(Ordering::Relaxed); + raw.count.store(v + 1, Ordering::Relaxed); + }); + } +} + +#[versions(AGX)] +impl Clone for Buffer::ver { + fn clone(&self) -> Self { + Buffer::ver { + inner: self.inner.clone(), + } + } +} + +/// The GPU-global buffer manager, used to allocate and release buffer slots from the pool. +pub(crate) struct BufferManager(slotalloc::SlotAllocator<()>); + +impl BufferManager { + pub(crate) fn new() -> Result<BufferManager> { + Ok(BufferManager(slotalloc::SlotAllocator::new( + NUM_BUFFERS, + (), + |_inner, _slot| (), + )?)) + } +} + +impl Clone for BufferManager { + fn clone(&self) -> Self { + BufferManager(self.0.clone()) + } +} diff --git a/drivers/gpu/drm/asahi/channel.rs b/drivers/gpu/drm/asahi/channel.rs new file mode 100644 index 000000000000..0b3c3b65c279 --- /dev/null +++ b/drivers/gpu/drm/asahi/channel.rs @@ -0,0 +1,542 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU ring buffer channels +//! +//! The GPU firmware use a set of ring buffer channels to receive commands from the driver and send +//! it notifications and status messages. +//! +//! These ring buffers mostly follow uniform conventions, so they share the same base +//! implementation. + +use crate::debug::*; +use crate::driver::AsahiDevice; +use crate::fw::channels::*; +use crate::fw::initdata::{raw, ChannelRing}; +use crate::fw::types::*; +use crate::{event, gpu, mem}; +use core::time::Duration; +use kernel::{c_str, delay::coarse_sleep, prelude::*, sync::Arc, time}; + +pub(crate) use crate::fw::channels::PipeType; + +/// A receive (FW->driver) channel. +pub(crate) struct RxChannel<T: RxChannelState, U: Copy + Default> +where + for<'a> <T as GpuStruct>::Raw<'a>: Debug + Default + Zeroed, +{ + ring: ChannelRing<T, U>, + // FIXME: needs feature(generic_const_exprs) + //rptr: [u32; T::SUB_CHANNELS], + rptr: [u32; 6], + count: u32, +} + +impl<T: RxChannelState, U: Copy + Default> RxChannel<T, U> +where + for<'a> <T as GpuStruct>::Raw<'a>: Debug + Default + Zeroed, +{ + /// Allocates a new receive channel with a given message count. + pub(crate) fn new(alloc: &mut gpu::KernelAllocators, count: usize) -> Result<RxChannel<T, U>> { + Ok(RxChannel { + ring: ChannelRing { + state: alloc.shared.new_default()?, + ring: alloc.shared.array_empty(T::SUB_CHANNELS * count)?, + }, + rptr: Default::default(), + count: count as u32, + }) + } + + /// Receives a message on the specified sub-channel index, optionally leaving in the ring + /// buffer. + /// + /// Returns None if the channel is empty. + fn get_or_peek(&mut self, index: usize, peek: bool) -> Option<U> { + self.ring.state.with(|raw, _inner| { + let wptr = T::wptr(raw, index); + let rptr = &mut self.rptr[index]; + if wptr == *rptr { + None + } else { + let off = self.count as usize * index; + let msg = self.ring.ring[off + *rptr as usize]; + if !peek { + *rptr = (*rptr + 1) % self.count; + T::set_rptr(raw, index, *rptr); + } + Some(msg) + } + }) + } + + /// Receives a message on the specified sub-channel index, and dequeues it from the ring buffer. + /// + /// Returns None if the channel is empty. + pub(crate) fn get(&mut self, index: usize) -> Option<U> { + self.get_or_peek(index, false) + } + + /// Peeks a message on the specified sub-channel index, leaving it in the ring buffer. + /// + /// Returns None if the channel is empty. + pub(crate) fn peek(&mut self, index: usize) -> Option<U> { + self.get_or_peek(index, true) + } +} + +/// A transmit (driver->FW) channel. +pub(crate) struct TxChannel<T: TxChannelState, U: Copy + Default> +where + for<'a> <T as GpuStruct>::Raw<'a>: Debug + Default + Zeroed, +{ + ring: ChannelRing<T, U>, + wptr: u32, + count: u32, +} + +impl<T: TxChannelState, U: Copy + Default> TxChannel<T, U> +where + for<'a> <T as GpuStruct>::Raw<'a>: Debug + Default + Zeroed, +{ + /// Allocates a new cached transmit channel with a given message count. + pub(crate) fn new(alloc: &mut gpu::KernelAllocators, count: usize) -> Result<TxChannel<T, U>> { + Ok(TxChannel { + ring: ChannelRing { + state: alloc.shared.new_default()?, + ring: alloc.private.array_empty(count)?, + }, + wptr: 0, + count: count as u32, + }) + } + + /// Allocates a new uncached transmit channel with a given message count. + pub(crate) fn new_uncached( + alloc: &mut gpu::KernelAllocators, + count: usize, + ) -> Result<TxChannel<T, U>> { + Ok(TxChannel { + ring: ChannelRing { + state: alloc.shared.new_default()?, + ring: alloc.shared.array_empty(count)?, + }, + wptr: 0, + count: count as u32, + }) + } + + /// Send a message to the ring, returning a cookie with the ring buffer position. + /// + /// This will poll/block if the ring is full, which we don't really expect to happen. + pub(crate) fn put(&mut self, msg: &U) -> u32 { + self.ring.state.with(|raw, _inner| { + let next_wptr = (self.wptr + 1) % self.count; + let mut rptr = T::rptr(raw); + if next_wptr == rptr { + pr_err!( + "TX ring buffer is full! Waiting... ({}, {})\n", + next_wptr, + rptr + ); + // TODO: block properly on incoming messages? + while next_wptr == rptr { + coarse_sleep(Duration::from_millis(8)); + rptr = T::rptr(raw); + } + } + self.ring.ring[self.wptr as usize] = *msg; + mem::sync(); + T::set_wptr(raw, next_wptr); + self.wptr = next_wptr; + }); + self.wptr + } + + /// Wait for a previously submitted message to be popped off of the ring by the GPU firmware. + /// + /// This busy-loops, and is intended to be used for rare cases when we need to block for + /// completion of a cache management or invalidation operation synchronously (which + /// the firmware normally completes fast enough not to be worth sleeping for). + /// If the poll takes longer than 10ms, this switches to sleeping between polls. + pub(crate) fn wait_for(&mut self, wptr: u32, timeout_ms: u64) -> Result { + const MAX_FAST_POLL: u64 = 10; + let start = time::ktime_get(); + let timeout_fast = start + Duration::from_millis(timeout_ms.min(MAX_FAST_POLL)); + let timeout_slow = start + Duration::from_millis(timeout_ms); + self.ring.state.with(|raw, _inner| { + while time::ktime_get() < timeout_fast { + if T::rptr(raw) == wptr { + return Ok(()); + } + mem::sync(); + } + while time::ktime_get() < timeout_slow { + if T::rptr(raw) == wptr { + return Ok(()); + } + coarse_sleep(Duration::from_millis(5)); + mem::sync(); + } + Err(ETIMEDOUT) + }) + } +} + +/// Device Control channel for global device management commands. +#[versions(AGX)] +pub(crate) struct DeviceControlChannel { + dev: AsahiDevice, + ch: TxChannel<ChannelState, DeviceControlMsg::ver>, +} + +#[versions(AGX)] +impl DeviceControlChannel::ver { + const COMMAND_TIMEOUT_MS: u64 = 1000; + + /// Allocate a new Device Control channel. + pub(crate) fn new( + dev: &AsahiDevice, + alloc: &mut gpu::KernelAllocators, + ) -> ResultDeviceControlChannel::ver { + Ok(DeviceControlChannel::ver { + dev: dev.clone(), + ch: TxChannel::<ChannelState, DeviceControlMsg::ver>::new(alloc, 0x100)?, + }) + } + + /// Returns the raw `ChannelRing` structure to pass to firmware. + pub(crate) fn to_raw(&self) -> raw::ChannelRing<ChannelState, DeviceControlMsg::ver> { + self.ch.ring.to_raw() + } + + /// Submits a Device Control command. + pub(crate) fn send(&mut self, msg: &DeviceControlMsg::ver) -> u32 { + cls_dev_dbg!(DeviceControlCh, self.dev, "DeviceControl: {:?}\n", msg); + self.ch.put(msg) + } + + /// Waits for a previously submitted Device Control command to complete. + pub(crate) fn wait_for(&mut self, wptr: u32) -> Result { + self.ch.wait_for(wptr, Self::COMMAND_TIMEOUT_MS) + } +} + +/// Pipe channel to submit WorkQueue execution requests. +#[versions(AGX)] +pub(crate) struct PipeChannel { + dev: AsahiDevice, + ch: TxChannel<ChannelState, PipeMsg::ver>, +} + +#[versions(AGX)] +impl PipeChannel::ver { + /// Allocate a new Pipe submission channel. + pub(crate) fn new( + dev: &AsahiDevice, + alloc: &mut gpu::KernelAllocators, + ) -> ResultPipeChannel::ver { + Ok(PipeChannel::ver { + dev: dev.clone(), + ch: TxChannel::<ChannelState, PipeMsg::ver>::new(alloc, 0x100)?, + }) + } + + /// Returns the raw `ChannelRing` structure to pass to firmware. + pub(crate) fn to_raw(&self) -> raw::ChannelRing<ChannelState, PipeMsg::ver> { + self.ch.ring.to_raw() + } + + /// Submits a Pipe kick command to the firmware. + pub(crate) fn send(&mut self, msg: &PipeMsg::ver) { + cls_dev_dbg!(PipeCh, self.dev, "Pipe: {:?}\n", msg); + self.ch.put(msg); + } +} + +/// Firmware Control channel, used for secure cache flush requests. +pub(crate) struct FwCtlChannel { + dev: AsahiDevice, + ch: TxChannel<FwCtlChannelState, FwCtlMsg>, +} + +impl FwCtlChannel { + const COMMAND_TIMEOUT_MS: u64 = 1000; + + /// Allocate a new Firmware Control channel. + pub(crate) fn new( + dev: &AsahiDevice, + alloc: &mut gpu::KernelAllocators, + ) -> Result<FwCtlChannel> { + Ok(FwCtlChannel { + dev: dev.clone(), + ch: TxChannel::<FwCtlChannelState, FwCtlMsg>::new_uncached(alloc, 0x100)?, + }) + } + + /// Returns the raw `ChannelRing` structure to pass to firmware. + pub(crate) fn to_raw(&self) -> raw::ChannelRing<FwCtlChannelState, FwCtlMsg> { + self.ch.ring.to_raw() + } + + /// Submits a Firmware Control command to the firmware. + pub(crate) fn send(&mut self, msg: &FwCtlMsg) -> u32 { + cls_dev_dbg!(FwCtlCh, self.dev, "FwCtl: {:?}\n", msg); + self.ch.put(msg) + } + + /// Waits for a previously submitted Firmware Control command to complete. + pub(crate) fn wait_for(&mut self, wptr: u32) -> Result { + self.ch.wait_for(wptr, Self::COMMAND_TIMEOUT_MS) + } +} + +/// Event channel, used to notify the driver of command completions, GPU faults and errors, and +/// other events. +pub(crate) struct EventChannel { + dev: AsahiDevice, + ch: RxChannel<ChannelState, RawEventMsg>, + mgr: Arcevent::EventManager, + gpu: Option<Arc<dyn gpu::GpuManager>>, +} + +impl EventChannel { + /// Allocate a new Event channel. + pub(crate) fn new( + dev: &AsahiDevice, + alloc: &mut gpu::KernelAllocators, + mgr: Arcevent::EventManager, + ) -> Result<EventChannel> { + Ok(EventChannel { + dev: dev.clone(), + ch: RxChannel::<ChannelState, RawEventMsg>::new(alloc, 0x100)?, + mgr, + gpu: None, + }) + } + + /// Registers the managing `Gpu` instance that will handle events on this channel. + pub(crate) fn set_manager(&mut self, gpu: Arc<dyn gpu::GpuManager>) { + self.gpu = Some(gpu); + } + + /// Returns the raw `ChannelRing` structure to pass to firmware. + pub(crate) fn to_raw(&self) -> raw::ChannelRing<ChannelState, RawEventMsg> { + self.ch.ring.to_raw() + } + + /// Polls for new Event messages on this ring. + pub(crate) fn poll(&mut self) { + while let Some(msg) = self.ch.get(0) { + let tag = unsafe { msg.raw.0 }; + match tag { + 0..=EVENT_MAX => { + let msg = unsafe { msg.msg }; + + cls_dev_dbg!(EventCh, self.dev, "Event: {:?}\n", msg); + match msg { + EventMsg::Fault => match self.gpu.as_ref() { + Some(gpu) => gpu.handle_fault(), + None => { + dev_crit!(self.dev, "EventChannel: No GPU manager available!\n") + } + }, + EventMsg::Timeout { + counter, + event_slot, + .. + } => match self.gpu.as_ref() { + Some(gpu) => gpu.handle_timeout(counter, event_slot), + None => { + dev_crit!(self.dev, "EventChannel: No GPU manager available!\n") + } + }, + EventMsg::Flag { firing, .. } => { + for (i, flags) in firing.iter().enumerate() { + for j in 0..32 { + if flags & (1u32 << j) != 0 { + self.mgr.signal((i * 32 + j) as u32); + } + } + } + } + msg => { + dev_crit!(self.dev, "Unknown event message: {:?}\n", msg); + } + } + } + _ => { + dev_warn!(self.dev, "Unknown event message: {:?}\n", unsafe { + msg.raw + }); + } + } + } + } +} + +/// Firmware Log channel. This one is pretty special, since it has 6 sub-channels (for different log +/// levels), and it also uses a side buffer to actually hold the log messages, only passing around +/// pointers in the main buffer. +pub(crate) struct FwLogChannel { + dev: AsahiDevice, + ch: RxChannel<FwLogChannelState, RawFwLogMsg>, + payload_buf: GpuArray<RawFwLogPayloadMsg>, +} + +impl FwLogChannel { + const RING_SIZE: usize = 0x100; + const BUF_SIZE: usize = 0x100; + + /// Allocate a new Firmware Log channel. + pub(crate) fn new( + dev: &AsahiDevice, + alloc: &mut gpu::KernelAllocators, + ) -> Result<FwLogChannel> { + Ok(FwLogChannel { + dev: dev.clone(), + ch: RxChannel::<FwLogChannelState, RawFwLogMsg>::new(alloc, Self::RING_SIZE)?, + payload_buf: alloc + .shared + .array_empty(Self::BUF_SIZE * FwLogChannelState::SUB_CHANNELS)?, + }) + } + + /// Returns the raw `ChannelRing` structure to pass to firmware. + pub(crate) fn to_raw(&self) -> raw::ChannelRing<FwLogChannelState, RawFwLogMsg> { + self.ch.ring.to_raw() + } + + /// Returns the GPU pointers to the firmware log payload buffer. + pub(crate) fn get_buf(&self) -> GpuWeakPointer<[RawFwLogPayloadMsg]> { + self.payload_buf.weak_pointer() + } + + /// Polls for new log messages on all sub-rings. + pub(crate) fn poll(&mut self) { + for i in 0..=FwLogChannelState::SUB_CHANNELS - 1 { + while let Some(msg) = self.ch.peek(i) { + cls_dev_dbg!(FwLogCh, self.dev, "FwLog{}: {:?}\n", i, msg); + if msg.msg_type != 2 { + dev_warn!(self.dev, "Unknown FWLog{} message: {:?}\n", i, msg); + self.ch.get(i); + continue; + } + if msg.msg_index.0 as usize >= Self::BUF_SIZE { + dev_warn!( + self.dev, + "FWLog{} message index out of bounds: {:?}\n", + i, + msg + ); + self.ch.get(i); + continue; + } + let index = Self::BUF_SIZE * i + msg.msg_index.0 as usize; + let payload = &self.payload_buf.as_slice()[index]; + if payload.msg_type != 3 { + dev_warn!(self.dev, "Unknown FWLog{} payload: {:?}\n", i, payload); + self.ch.get(i); + continue; + } + let msg = if let Some(end) = payload.msg.iter().position(|&r| r == 0) { + CStr::from_bytes_with_nul(&(*payload.msg)[..end + 1]) + .unwrap_or(c_str!("cstr_err")) + } else { + dev_warn!( + self.dev, + "FWLog{} payload not NUL-terminated: {:?}\n", + i, + payload + ); + self.ch.get(i); + continue; + }; + match i { + 0 => dev_dbg!(self.dev, "FWLog: {}\n", msg), + 1 => dev_info!(self.dev, "FWLog: {}\n", msg), + 2 => dev_notice!(self.dev, "FWLog: {}\n", msg), + 3 => dev_warn!(self.dev, "FWLog: {}\n", msg), + 4 => dev_err!(self.dev, "FWLog: {}\n", msg), + 5 => dev_crit!(self.dev, "FWLog: {}\n", msg), + _ => (), + }; + self.ch.get(i); + } + } + } +} + +pub(crate) struct KTraceChannel { + dev: AsahiDevice, + ch: RxChannel<ChannelState, RawKTraceMsg>, +} + +/// KTrace channel, used to receive detailed execution trace markers from the firmware. +/// We currently disable this in initdata, so no messages are expected here at this time. +impl KTraceChannel { + /// Allocate a new KTrace channel. + pub(crate) fn new( + dev: &AsahiDevice, + alloc: &mut gpu::KernelAllocators, + ) -> Result<KTraceChannel> { + Ok(KTraceChannel { + dev: dev.clone(), + ch: RxChannel::<ChannelState, RawKTraceMsg>::new(alloc, 0x200)?, + }) + } + + /// Returns the raw `ChannelRing` structure to pass to firmware. + pub(crate) fn to_raw(&self) -> raw::ChannelRing<ChannelState, RawKTraceMsg> { + self.ch.ring.to_raw() + } + + /// Polls for new KTrace messages on this ring. + pub(crate) fn poll(&mut self) { + while let Some(msg) = self.ch.get(0) { + cls_dev_dbg!(KTraceCh, self.dev, "KTrace: {:?}\n", msg); + } + } +} + +/// Statistics channel, reporting power-related statistics to the driver. +/// Not really implemented other than debug logs yet... +#[versions(AGX)] +pub(crate) struct StatsChannel { + dev: AsahiDevice, + ch: RxChannel<ChannelState, RawStatsMsg::ver>, +} + +#[versions(AGX)] +impl StatsChannel::ver { + /// Allocate a new Statistics channel. + pub(crate) fn new( + dev: &AsahiDevice, + alloc: &mut gpu::KernelAllocators, + ) -> ResultStatsChannel::ver { + Ok(StatsChannel::ver { + dev: dev.clone(), + ch: RxChannel::<ChannelState, RawStatsMsg::ver>::new(alloc, 0x100)?, + }) + } + + /// Returns the raw `ChannelRing` structure to pass to firmware. + pub(crate) fn to_raw(&self) -> raw::ChannelRing<ChannelState, RawStatsMsg::ver> { + self.ch.ring.to_raw() + } + + /// Polls for new statistics messages on this ring. + pub(crate) fn poll(&mut self) { + while let Some(msg) = self.ch.get(0) { + let tag = unsafe { msg.raw.0 }; + match tag { + 0..=STATS_MAX::ver => { + let msg = unsafe { msg.msg }; + cls_dev_dbg!(StatsCh, self.dev, "Stats: {:?}\n", msg); + } + _ => { + pr_warn!("Unknown stats message: {:?}\n", unsafe { msg.raw }); + } + } + } + } +} diff --git a/drivers/gpu/drm/asahi/debug.rs b/drivers/gpu/drm/asahi/debug.rs new file mode 100644 index 000000000000..2f3a70e04cfd --- /dev/null +++ b/drivers/gpu/drm/asahi/debug.rs @@ -0,0 +1,129 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT +#![allow(dead_code)] + +//! Debug enable/disable flags and convenience macros + +#[allow(unused_imports)] +pub(crate) use super::{cls_dev_dbg, cls_pr_debug, debug, mod_dev_dbg, mod_pr_debug}; +use core::sync::atomic::{AtomicU64, Ordering}; + +static DEBUG_FLAGS: AtomicU64 = AtomicU64::new(0); + +/// Debug flag bit indices +pub(crate) enum DebugFlags { + // 0-3: Memory-related debug + Mmu = 0, + Alloc = 1, + Gem = 2, + Object = 3, + + // 4-7: Firmware objects and resources + Event = 4, + Buffer = 5, + WorkQueue = 6, + + // 8-13: DRM interface, rendering, compute, GPU globals + Gpu = 8, + File = 9, + Queue = 10, + Render = 11, + Compute = 12, + + // 14-15: Misc stats + MemStats = 14, + TVBStats = 15, + + // 16-22: Channels + FwLogCh = 16, + KTraceCh = 17, + StatsCh = 18, + EventCh = 19, + PipeCh = 20, + DeviceControlCh = 21, + FwCtlCh = 22, + + // 32-35: Allocator debugging + FillAllocations = 32, + DebugAllocations = 33, + DetectOverflows = 34, + ForceCPUMaps = 35, + + // 36-: Behavior flags + ConservativeTlbi = 36, + KeepGpuPowered = 37, + WaitForPowerOff = 38, + NoGpuRecovery = 39, + DisableClustering = 40, + + // 48-: Misc + Debug0 = 48, + Debug1 = 49, + Debug2 = 50, + Debug3 = 51, + Debug4 = 52, + Debug5 = 53, + Debug6 = 54, + Debug7 = 55, +} + +/// Update the cached global debug flags from the module parameter +pub(crate) fn update_debug_flags() { + let flags = { + let lock = crate::THIS_MODULE.kernel_param_lock(); + *crate::debug_flags.read(&lock) + }; + + DEBUG_FLAGS.store(flags, Ordering::Relaxed); +} + +/// Check whether debug is enabled for a given flag +#[inline(always)] +pub(crate) fn debug_enabled(flag: DebugFlags) -> bool { + DEBUG_FLAGS.load(Ordering::Relaxed) & 1 << (flag as usize) != 0 +} + +/// Run some code only if debug is enabled for the calling module +#[macro_export] +macro_rules! debug { + ($($arg:tt)*) => { + if $crate::debug::debug_enabled(DEBUG_CLASS) { + $($arg)* + } + }; +} + +/// pr_info!() if debug is enabled for the calling module +#[macro_export] +macro_rules! mod_pr_debug ( + ($($arg:tt)*) => ( + $crate::debug! { ::kernel::pr_info! ( $($arg)* ); } + ) +); + +/// dev_info!() if debug is enabled for the calling module +#[macro_export] +macro_rules! mod_dev_dbg ( + ($($arg:tt)*) => ( + $crate::debug! { ::kernel::dev_info! ( $($arg)* ); } + ) +); + +/// pr_info!() if debug is enabled for a specific module +#[macro_export] +macro_rules! cls_pr_debug ( + ($cls:ident, $($arg:tt)*) => ( + if $crate::debug::debug_enabled($crate::debug::DebugFlags::$cls) { + ::kernel::pr_info! ( $($arg)* ); + } + ) +); + +/// dev_info!() if debug is enabled for a specific module +#[macro_export] +macro_rules! cls_dev_dbg ( + ($cls:ident, $($arg:tt)*) => ( + if $crate::debug::debug_enabled($crate::debug::DebugFlags::$cls) { + ::kernel::dev_info! ( $($arg)* ); + } + ) +); diff --git a/drivers/gpu/drm/asahi/driver.rs b/drivers/gpu/drm/asahi/driver.rs new file mode 100644 index 000000000000..d49d8b1934a4 --- /dev/null +++ b/drivers/gpu/drm/asahi/driver.rs @@ -0,0 +1,166 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Top-level GPU driver implementation. + +use kernel::{ + c_str, device, drm, drm::drv, drm::ioctl, error::Result, of, platform, prelude::*, sync::Arc, +}; + +use crate::{debug, file, gem, gpu, hw, regs}; + +use kernel::device::RawDevice; +use kernel::macros::vtable; + +/// Driver metadata +const INFO: drv::DriverInfo = drv::DriverInfo { + major: 0, + minor: 0, + patchlevel: 0, + name: c_str!("asahi"), + desc: c_str!("Apple AGX Graphics"), + date: c_str!("20220831"), +}; + +/// Device data for the driver registration. +/// +/// Holds a reference to the top-level `GpuManager` object. +pub(crate) struct AsahiData { + pub(crate) dev: device::Device, + pub(crate) gpu: Arc<dyn gpu::GpuManager>, +} + +/// Convenience type alias for the `device::Data` type for this driver. +type DeviceData = device::Data<drv::Registration<AsahiDriver>, regs::Resources, AsahiData>; + +/// Empty struct representing this driver. +pub(crate) struct AsahiDriver; + +/// Convenience type alias for the DRM device type for this driver. +pub(crate) type AsahiDevice = kernel::drm::device::Device<AsahiDriver>; + +/// DRM Driver implementation for `AsahiDriver`. +#[vtable] +impl drv::Driver for AsahiDriver { + /// Our `DeviceData` type, reference-counted + type Data = Arc<DeviceData>; + /// Our `File` type. + type File = file::File; + /// Our `Object` type. + type Object = gem::Object; + + const INFO: drv::DriverInfo = INFO; + const FEATURES: u32 = + drv::FEAT_GEM | drv::FEAT_RENDER | drv::FEAT_SYNCOBJ | drv::FEAT_SYNCOBJ_TIMELINE; + + kernel::declare_drm_ioctls! { + (ASAHI_GET_PARAMS, drm_asahi_get_params, + ioctl::RENDER_ALLOW, file::File::get_params), + (ASAHI_VM_CREATE, drm_asahi_vm_create, + ioctl::AUTH | ioctl::RENDER_ALLOW, file::File::vm_create), + (ASAHI_VM_DESTROY, drm_asahi_vm_destroy, + ioctl::AUTH | ioctl::RENDER_ALLOW, file::File::vm_destroy), + (ASAHI_GEM_CREATE, drm_asahi_gem_create, + ioctl::AUTH | ioctl::RENDER_ALLOW, file::File::gem_create), + (ASAHI_GEM_MMAP_OFFSET, drm_asahi_gem_mmap_offset, + ioctl::AUTH | ioctl::RENDER_ALLOW, file::File::gem_mmap_offset), + (ASAHI_GEM_BIND, drm_asahi_gem_bind, + ioctl::AUTH | ioctl::RENDER_ALLOW, file::File::gem_bind), + (ASAHI_QUEUE_CREATE, drm_asahi_queue_create, + ioctl::AUTH | ioctl::RENDER_ALLOW, file::File::queue_create), + (ASAHI_QUEUE_DESTROY, drm_asahi_queue_destroy, + ioctl::AUTH | ioctl::RENDER_ALLOW, file::File::queue_destroy), + (ASAHI_SUBMIT, drm_asahi_submit, + ioctl::AUTH | ioctl::RENDER_ALLOW, file::File::submit), + } +} + +// OF Device ID table. +kernel::define_of_id_table! {ASAHI_ID_TABLE, &'static hw::HwConfig, [ + (of::DeviceId::Compatible(b"apple,agx-t8103"), Some(&hw::t8103::HWCONFIG)), + (of::DeviceId::Compatible(b"apple,agx-t8112"), Some(&hw::t8112::HWCONFIG)), + (of::DeviceId::Compatible(b"apple,agx-t6000"), Some(&hw::t600x::HWCONFIG_T6000)), + (of::DeviceId::Compatible(b"apple,agx-t6001"), Some(&hw::t600x::HWCONFIG_T6001)), + (of::DeviceId::Compatible(b"apple,agx-t6002"), Some(&hw::t600x::HWCONFIG_T6002)), +]} + +/// Platform Driver implementation for `AsahiDriver`. +impl platform::Driver for AsahiDriver { + /// Our `DeviceData` type, reference-counted + type Data = Arc<DeviceData>; + /// Data associated with each hardware ID. + type IdInfo = &'static hw::HwConfig; + + // Assign the above OF ID table to this driver. + kernel::driver_of_id_table!(ASAHI_ID_TABLE); + + /// Device probe function. + fn probe( + pdev: &mut platform::Device, + id_info: Option<&Self::IdInfo>, + ) -> Result<Arc<DeviceData>> { + debug::update_debug_flags(); + + let dev = device::Device::from_dev(pdev); + + dev_info!(dev, "Probing...\n"); + + let cfg = id_info.ok_or(ENODEV)?; + + pdev.set_dma_masks((1 << cfg.uat_oas) - 1)?; + + let res = regs::Resources::new(pdev)?; + + // Initialize misc MMIO + res.init_mmio()?; + + // Start the coprocessor CPU, so UAT can initialize the handoff + res.start_cpu()?; + + let node = dev.of_node().ok_or(EIO)?; + let compat: Vec<u32> = node.get_property(c_str!("apple,firmware-compat"))?; + + let reg = drm::drv::Registration::<AsahiDriver>::new(&dev)?; + let gpu = match (cfg.gpu_gen, compat.as_slice()) { + (hw::GpuGen::G13, &[12, 3, 0]) => { + gpu::GpuManagerG13V12_3::new(reg.device(), &res, cfg)? as Arc<dyn gpu::GpuManager> + } + (hw::GpuGen::G13, &[13, 2, 0]) => { + gpu::GpuManagerG13V13_2::new(reg.device(), &res, cfg)? as Arc<dyn gpu::GpuManager> + } + (hw::GpuGen::G14, &[12, 4, 0]) => { + gpu::GpuManagerG14V12_4::new(reg.device(), &res, cfg)? as Arc<dyn gpu::GpuManager> + } + (hw::GpuGen::G14, &[13, 2, 0]) => { + gpu::GpuManagerG14V13_2::new(reg.device(), &res, cfg)? as Arc<dyn gpu::GpuManager> + } + _ => { + dev_info!( + dev, + "Unsupported GPU/firmware combination ({:?}, {:?})\n", + cfg.gpu_gen, + compat + ); + return Err(ENODEV); + } + }; + + let data = + kernel::new_device_data!(reg, res, AsahiData { dev, gpu }, "Asahi::Registrations")?; + + let data = Arc::<DeviceData>::from(data); + + data.gpu.init()?; + + kernel::drm_device_register!( + data.registrations().ok_or(ENXIO)?.as_pinned_mut(), + data.clone(), + 0 + )?; + + dev_info!(data.dev, "Probed!\n"); + Ok(data) + } +} + +// Export the OF ID table as a module ID table, to make modpost/autoloading work. +kernel::module_of_id_table!(MOD_TABLE, ASAHI_ID_TABLE); diff --git a/drivers/gpu/drm/asahi/event.rs b/drivers/gpu/drm/asahi/event.rs new file mode 100644 index 000000000000..ccf00e4104be --- /dev/null +++ b/drivers/gpu/drm/asahi/event.rs @@ -0,0 +1,229 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU event manager +//! +//! The GPU firmware manages work completion by using event objects (Apple calls them "stamps"), +//! which are monotonically incrementing counters. There are a fixed number of objects, and +//! they are managed with a `SlotAllocator`. +//! +//! This module manages the set of available events and lets users compute expected values. +//! It also manages signaling owners when the GPU firmware reports that an event fired. + +use crate::debug::*; +use crate::fw::types::*; +use crate::{gpu, slotalloc, workqueue}; +use core::cmp; +use core::sync::atomic::Ordering; +use kernel::prelude::*; +use kernel::sync::Arc; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Event; + +/// Number of events managed by the firmware. +const NUM_EVENTS: u32 = 128; + +/// Inner data associated with a given event slot. +pub(crate) struct EventInner { + /// CPU pointer to the driver notification event stamp + stamp: *const AtomicU32, + /// GPU pointer to the driver notification event stamp + gpu_stamp: GpuWeakPointer<Stamp>, + /// GPU pointer to the firmware-internal event stamp + gpu_fw_stamp: GpuWeakPointer<FwStamp>, +} + +/// SAFETY: The event slots are safe to send across threads. +unsafe impl Send for EventInner {} + +/// Alias for an event token, which allows requesting the same event. +pub(crate) type Token = slotalloc::SlotToken; +/// Alias for an allocated `Event` that has a slot. +pub(crate) type Event = slotalloc::Guard<EventInner>; + +/// Represents a given stamp value for an event. +#[derive(Eq, PartialEq, Copy, Clone, Debug)] +#[repr(transparent)] +pub(crate) struct EventValue(u32); + +impl EventValue { + /// Returns the `EventValue` that succeeds this one. + pub(crate) fn next(&self) -> EventValue { + EventValue(self.0.wrapping_add(0x100)) + } + + /// Increments this `EventValue` in place. + pub(crate) fn increment(&mut self) { + self.0 = self.0.wrapping_add(0x100); + } + + /* Not used + /// Increments this `EventValue` in place by a certain count. + pub(crate) fn add(&mut self, val: u32) { + self.0 = self + .0 + .wrapping_add(val.checked_mul(0x100).expect("Adding too many events")); + } + */ + + /// Increments this `EventValue` in place by a certain count. + pub(crate) fn sub(&mut self, val: u32) { + self.0 = self + .0 + .wrapping_sub(val.checked_mul(0x100).expect("Subtracting too many events")); + } + + /// Computes the delta between this event and another event. + pub(crate) fn delta(&self, other: &EventValue) -> i32 { + (self.0.wrapping_sub(other.0) as i32) >> 8 + } +} + +impl PartialOrd for EventValue { + fn partial_cmp(&self, other: &Self) -> Optioncmp::Ordering { + Some(self.cmp(other)) + } +} + +impl Ord for EventValue { + fn cmp(&self, other: &Self) -> cmp::Ordering { + self.delta(other).cmp(&0) + } +} + +impl EventInner { + /// Returns the GPU pointer to the driver notification stamp + pub(crate) fn stamp_pointer(&self) -> GpuWeakPointer<Stamp> { + self.gpu_stamp + } + + /// Returns the GPU pointer to the firmware internal stamp + pub(crate) fn fw_stamp_pointer(&self) -> GpuWeakPointer<FwStamp> { + self.gpu_fw_stamp + } + + /// Fetches the current event value from shared memory + pub(crate) fn current(&self) -> EventValue { + // SAFETY: The pointer is always valid as constructed in + // EventManager below, and outside users cannot construct + // new EventInners, nor move or copy them, and Guards as + // returned by the SlotAllocator hold a reference to the + // SlotAllocator containing the EventManagerInner, which + // keeps the GpuObject the stamp is contained within alive. + EventValue(unsafe { &*self.stamp }.load(Ordering::Acquire)) + } +} + +impl slotalloc::SlotItem for EventInner { + type Data = EventManagerInner; + + fn release(&mut self, data: &mut Self::Data, slot: u32) { + mod_pr_debug!("EventManager: Released slot {}\n", slot); + data.owners[slot as usize] = None; + } +} + +/// Inner data for the event manager, to be protected by the SlotAllocator lock. +pub(crate) struct EventManagerInner { + stamps: GpuArray<Stamp>, + fw_stamps: GpuArray<FwStamp>, + // Note: Use dyn to avoid having to version this entire module. + owners: Vec<Option<Arc<dyn workqueue::WorkQueue + Send + Sync>>>, +} + +/// Top-level EventManager object. +pub(crate) struct EventManager { + alloc: slotalloc::SlotAllocator<EventInner>, +} + +impl EventManager { + /// Create a new EventManager. + #[inline(never)] + pub(crate) fn new(alloc: &mut gpu::KernelAllocators) -> Result<EventManager> { + let mut owners = Vec::new(); + for _i in 0..(NUM_EVENTS as usize) { + owners.try_push(None)?; + } + let inner = EventManagerInner { + stamps: alloc.shared.array_empty(NUM_EVENTS as usize)?, + fw_stamps: alloc.private.array_empty(NUM_EVENTS as usize)?, + owners, + }; + + Ok(EventManager { + alloc: slotalloc::SlotAllocator::new( + NUM_EVENTS, + inner, + |inner: &mut EventManagerInner, slot| EventInner { + stamp: &inner.stamps[slot as usize].0, + gpu_stamp: inner.stamps.weak_item_pointer(slot as usize), + gpu_fw_stamp: inner.fw_stamps.weak_item_pointer(slot as usize), + }, + )?, + }) + } + + /// Gets a free `Event`, optionally trying to reuse the last one allocated by this caller. + pub(crate) fn get( + &self, + token: Option<Token>, + owner: Arc<dyn workqueue::WorkQueue + Send + Sync>, + ) -> Result<Event> { + let ev = self.alloc.get_inner(token, |inner, ev| { + mod_pr_debug!( + "EventManager: Registered owner {:p} on slot {}\n", + &*owner, + ev.slot() + ); + inner.owners[ev.slot() as usize] = Some(owner); + Ok(()) + })?; + Ok(ev) + } + + /// Signals an event by slot, indicating completion (of one or more commands). + pub(crate) fn signal(&self, slot: u32) { + match self + .alloc + .with_inner(|inner| inner.owners[slot as usize].as_ref().cloned()) + { + Some(owner) => { + owner.signal(); + } + None => { + mod_pr_debug!("EventManager: Received event for empty slot {}\n", slot); + } + } + } + + /// Marks the owner of an event as having lost its work due to a GPU error. + pub(crate) fn mark_error(&self, slot: u32, wait_value: u32, error: workqueue::WorkError) { + match self + .alloc + .with_inner(|inner| inner.owners[slot as usize].as_ref().cloned()) + { + Some(owner) => { + owner.mark_error(EventValue(wait_value), error); + } + None => { + pr_err!("Received error for empty slot {}\n", slot); + } + } + } + + /// Fail all commands, used when the GPU crashes. + pub(crate) fn fail_all(&self, error: workqueue::WorkError) { + let mut owners: Vec<Arc<dyn workqueue::WorkQueue + Send + Sync>> = Vec::new(); + + self.alloc.with_inner(|inner| { + for wq in inner.owners.iter().filter_map(|o| o.as_ref()).cloned() { + if owners.try_push(wq).is_err() { + pr_err!("Failed to signal failure to WorkQueue\n"); + } + } + }); + + for wq in owners { + wq.fail_all(error); + } + } +} diff --git a/drivers/gpu/drm/asahi/file.rs b/drivers/gpu/drm/asahi/file.rs new file mode 100644 index 000000000000..5d47feb30134 --- /dev/null +++ b/drivers/gpu/drm/asahi/file.rs @@ -0,0 +1,718 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT +#![allow(clippy::unusual_byte_groupings)] + +//! File implementation, which represents a single DRM client. +//! +//! This is in charge of managing the resources associated with one GPU client, including an +//! arbitrary number of submission queues and Vm objects, and reporting hardware/driver +//! information to userspace and accepting submissions. + +use crate::debug::*; +use crate::driver::AsahiDevice; +use crate::{alloc, buffer, driver, gem, mmu, queue}; +use core::mem::MaybeUninit; +use kernel::dma_fence::RawDmaFence; +use kernel::drm::gem::BaseObject; +use kernel::io_buffer::{IoBufferReader, IoBufferWriter}; +use kernel::prelude::*; +use kernel::sync::{smutex::Mutex, Arc}; +use kernel::user_ptr::UserSlicePtr; +use kernel::{bindings, dma_fence, drm, xarray}; + +const DEBUG_CLASS: DebugFlags = DebugFlags::File; + +const MAX_SYNCS_PER_SUBMISSION: u32 = 64; +const MAX_COMMANDS_PER_SUBMISSION: u32 = 64; +pub(crate) const MAX_COMMANDS_IN_FLIGHT: u32 = 1024; + +/// A client instance of an `mmu::Vm` address space. +struct Vm { + ualloc: Arc<Mutexalloc::DefaultAllocator>, + ualloc_priv: Arc<Mutexalloc::DefaultAllocator>, + vm: mmu::Vm, + dummy_obj: gem::ObjectRef, +} + +impl Drop for Vm { + fn drop(&mut self) { + // Mappings create a reference loop, make sure to break it. + self.dummy_obj.drop_vm_mappings(self.vm.id()); + } +} + +/// Sync object from userspace. +pub(crate) struct SyncItem { + pub(crate) syncobj: drm::syncobj::SyncObj, + pub(crate) fence: Option<dma_fence::Fence>, + pub(crate) chain_fence: Option<dma_fence::FenceChain>, + pub(crate) timeline_value: u64, +} + +impl SyncItem { + fn parse_one(file: &DrmFile, data: bindings::drm_asahi_sync, out: bool) -> Result<SyncItem> { + if data.extensions != 0 { + return Err(EINVAL); + } + + match data.sync_type { + bindings::drm_asahi_sync_type_DRM_ASAHI_SYNC_SYNCOBJ => { + if data.timeline_value != 0 { + return Err(EINVAL); + } + let syncobj = drm::syncobj::SyncObj::lookup_handle(file, data.handle)?; + + Ok(SyncItem { + fence: if out { + None + } else { + Some(syncobj.fence_get().ok_or(EINVAL)?) + }, + syncobj, + chain_fence: None, + timeline_value: data.timeline_value, + }) + } + bindings::drm_asahi_sync_type_DRM_ASAHI_SYNC_TIMELINE_SYNCOBJ => { + let syncobj = drm::syncobj::SyncObj::lookup_handle(file, data.handle)?; + let fence = if out { + None + } else { + Some( + syncobj + .fence_get() + .ok_or(EINVAL)? + .chain_find_seqno(data.timeline_value)?, + ) + }; + + Ok(SyncItem { + fence, + syncobj, + chain_fence: if out { + Some(dma_fence::FenceChain::new()?) + } else { + None + }, + timeline_value: data.timeline_value, + }) + } + _ => Err(EINVAL), + } + } + + fn parse_array(file: &DrmFile, ptr: u64, count: u32, out: bool) -> Result<Vec<SyncItem>> { + let mut vec = Vec::try_with_capacity(count as usize)?; + + const STRIDE: usize = core::mem::size_of::bindings::drm_asahi_sync(); + let size = STRIDE * count as usize; + + // SAFETY: We only read this once, so there are no TOCTOU issues. + let mut reader = unsafe { UserSlicePtr::new(ptr as usize as *mut _, size).reader() }; + + for _i in 0..count { + let mut sync: MaybeUninitbindings::drm_asahi_sync = MaybeUninit::uninit(); + + // SAFETY: The size of `sync` is STRIDE + unsafe { reader.read_raw(sync.as_mut_ptr() as *mut u8, STRIDE)? }; + + // SAFETY: All bit patterns in the struct are valid + let sync = unsafe { sync.assume_init() }; + + vec.try_push(SyncItem::parse_one(file, sync, out)?)?; + } + + Ok(vec) + } +} + +/// State associated with a client. +pub(crate) struct File { + id: u64, + vms: xarray::XArray<Box<Vm>>, + queues: xarray::XArray<Arc<Mutex<Box<dyn queue::Queue>>>>, +} + +/// Convenience type alias for our DRM `File` type. +pub(crate) type DrmFile = drm::file::File<File>; + +/// Start address of the 32-bit USC address space. +const VM_SHADER_START: u64 = 0x11_00000000; +/// End address of the 32-bit USC address space. +const VM_SHADER_END: u64 = 0x11_ffffffff; +/// Start address of the general user mapping region. +const VM_USER_START: u64 = 0x20_00000000; +/// End address of the general user mapping region. +const VM_USER_END: u64 = 0x5f_ffffffff; + +/// Start address of the kernel-managed GPU-only mapping region. +const VM_DRV_GPU_START: u64 = 0x60_00000000; +/// End address of the kernel-managed GPU-only mapping region. +const VM_DRV_GPU_END: u64 = 0x60_ffffffff; +/// Start address of the kernel-managed GPU/FW shared mapping region. +const VM_DRV_GPUFW_START: u64 = 0x61_00000000; +/// End address of the kernel-managed GPU/FW shared mapping region. +const VM_DRV_GPUFW_END: u64 = 0x61_ffffffff; +/// Address of a special dummy page? +const VM_UNK_PAGE: u64 = 0x6f_ffff8000; + +impl drm::file::DriverFile for File { + type Driver = driver::AsahiDriver; + + /// Create a new `File` instance for a fresh client. + fn open(device: &AsahiDevice) -> Result<Box<Self>> { + debug::update_debug_flags(); + + let gpu = &device.data().gpu; + let id = gpu.ids().file.next(); + + mod_dev_dbg!(device, "[File {}]: DRM device opened\n", id); + Ok(Box::try_new(Self { + id, + vms: xarray::XArray::new(xarray::flags::ALLOC1)?, + queues: xarray::XArray::new(xarray::flags::ALLOC1)?, + })?) + } +} + +impl File { + /// IOCTL: get_param: Get a driver parameter value. + pub(crate) fn get_params( + device: &AsahiDevice, + data: &mut bindings::drm_asahi_get_params, + file: &DrmFile, + ) -> Result<u32> { + mod_dev_dbg!(device, "[File {}]: IOCTL: get_params\n", file.id); + + let gpu = &device.data().gpu; + + if data.extensions != 0 || data.param_group != 0 || data.pad != 0 { + return Err(EINVAL); + } + + let mut params = bindings::drm_asahi_params_global { + unstable_uabi_version: bindings::DRM_ASAHI_UNSTABLE_UABI_VERSION, + pad0: 0, + + feat_compat: gpu.get_cfg().gpu_feat_compat, + feat_incompat: gpu.get_cfg().gpu_feat_incompat, + + gpu_generation: gpu.get_dyncfg().id.gpu_gen as u32, + gpu_variant: gpu.get_dyncfg().id.gpu_variant as u32, + gpu_revision: gpu.get_dyncfg().id.gpu_rev as u32, + chip_id: gpu.get_cfg().chip_id, + + num_dies: gpu.get_dyncfg().id.max_dies, + num_clusters_total: gpu.get_dyncfg().id.num_clusters, + num_cores_per_cluster: gpu.get_dyncfg().id.num_cores, + num_frags_per_cluster: gpu.get_dyncfg().id.num_frags, + num_gps_per_cluster: gpu.get_dyncfg().id.num_gps, + num_cores_total_active: gpu.get_dyncfg().id.total_active_cores, + core_masks: [0; bindings::DRM_ASAHI_MAX_CLUSTERS as usize], + + vm_page_size: mmu::UAT_PGSZ as u32, + pad1: 0, + vm_user_start: VM_USER_START, + vm_user_end: VM_USER_END, + vm_shader_start: VM_SHADER_START, + vm_shader_end: VM_SHADER_END, + + max_syncs_per_submission: MAX_SYNCS_PER_SUBMISSION, + max_commands_per_submission: MAX_COMMANDS_PER_SUBMISSION, + max_commands_in_flight: MAX_COMMANDS_IN_FLIGHT, + max_attachments: crate::microseq::MAX_ATTACHMENTS as u32, + + timer_frequency_hz: gpu.get_cfg().base_clock_hz, + min_frequency_khz: gpu.get_dyncfg().pwr.min_frequency_khz(), + max_frequency_khz: gpu.get_dyncfg().pwr.max_frequency_khz(), + max_power_mw: gpu.get_dyncfg().pwr.max_power_mw, + + result_render_size: core::mem::size_of::bindings::drm_asahi_result_render() as u32, + result_compute_size: core::mem::size_of::bindings::drm_asahi_result_compute() as u32, + }; + + for (i, mask) in gpu.get_dyncfg().id.core_masks.iter().enumerate() { + *(params.core_masks.get_mut(i).ok_or(EIO)?) = (*mask).try_into()?; + } + + let size = + core::mem::size_of::bindings::drm_asahi_params_global().min(data.size.try_into()?); + + // SAFETY: We only write to this userptr once, so there are no TOCTOU issues. + let mut params_writer = + unsafe { UserSlicePtr::new(data.pointer as usize as *mut _, size).writer() }; + + // SAFETY: `size` is at most the sizeof of `params` + unsafe { params_writer.write_raw(¶ms as *const _ as *const u8, size)? }; + + Ok(0) + } + + /// IOCTL: vm_create: Create a new `Vm`. + pub(crate) fn vm_create( + device: &AsahiDevice, + data: &mut bindings::drm_asahi_vm_create, + file: &DrmFile, + ) -> Result<u32> { + if data.extensions != 0 { + return Err(EINVAL); + } + + let gpu = &device.data().gpu; + let file_id = file.id; + let vm = gpu.new_vm(file_id)?; + + let resv = file.vms.reserve()?; + let id: u32 = resv.index().try_into()?; + + mod_dev_dbg!(device, "[File {} VM {}]: VM Create\n", file_id, id); + mod_dev_dbg!( + device, + "[File {} VM {}]: Creating allocators\n", + file_id, + id + ); + let ualloc = Arc::try_new(Mutex::new(alloc::DefaultAllocator::new( + device, + &vm, + VM_DRV_GPU_START, + VM_DRV_GPU_END, + buffer::PAGE_SIZE, + mmu::PROT_GPU_SHARED_RW, + 512 * 1024, + true, + fmt!("File {} VM {} GPU Shared", file_id, id), + false, + )?))?; + let ualloc_priv = Arc::try_new(Mutex::new(alloc::DefaultAllocator::new( + device, + &vm, + VM_DRV_GPUFW_START, + VM_DRV_GPUFW_END, + buffer::PAGE_SIZE, + mmu::PROT_GPU_FW_PRIV_RW, + 64 * 1024, + true, + fmt!("File {} VM {} GPU FW Private", file_id, id), + false, + )?))?; + + mod_dev_dbg!( + device, + "[File {} VM {}]: Creating dummy object\n", + file_id, + id + ); + let mut dummy_obj = gem::new_kernel_object(device, 0x4000)?; + dummy_obj.vmap()?.as_mut_slice().fill(0); + dummy_obj.map_at(&vm, VM_UNK_PAGE, mmu::PROT_GPU_SHARED_RW, true)?; + + mod_dev_dbg!(device, "[File {} VM {}]: VM created\n", file_id, id); + resv.store(Box::try_new(Vm { + ualloc, + ualloc_priv, + vm, + dummy_obj, + })?)?; + + data.vm_id = id; + + Ok(0) + } + + /// IOCTL: vm_destroy: Destroy a `Vm`. + pub(crate) fn vm_destroy( + _device: &AsahiDevice, + data: &mut bindings::drm_asahi_vm_destroy, + file: &DrmFile, + ) -> Result<u32> { + if data.extensions != 0 { + return Err(EINVAL); + } + + if file.vms.remove(data.vm_id as usize).is_none() { + Err(ENOENT) + } else { + Ok(0) + } + } + + /// IOCTL: gem_create: Create a new GEM object. + pub(crate) fn gem_create( + device: &AsahiDevice, + data: &mut bindings::drm_asahi_gem_create, + file: &DrmFile, + ) -> Result<u32> { + mod_dev_dbg!( + device, + "[File {}]: IOCTL: gem_create size={:#x?}\n", + file.id, + data.size + ); + + if data.extensions != 0 + || (data.flags & !(bindings::ASAHI_GEM_WRITEBACK | bindings::ASAHI_GEM_VM_PRIVATE)) != 0 + || (data.flags & bindings::ASAHI_GEM_VM_PRIVATE == 0 && data.vm_id != 0) + { + return Err(EINVAL); + } + + let vm_id = if data.flags & bindings::ASAHI_GEM_VM_PRIVATE != 0 { + Some(file.vms.get(data.vm_id.try_into()?).ok_or(ENOENT)?.vm.id()) + } else { + None + }; + + let bo = gem::new_object(device, data.size.try_into()?, data.flags, vm_id)?; + + let handle = bo.gem.create_handle(file)?; + data.handle = handle; + + mod_dev_dbg!( + device, + "[File {}]: IOCTL: gem_create size={:#x} handle={:#x?}\n", + file.id, + data.size, + data.handle + ); + + Ok(0) + } + + /// IOCTL: gem_mmap_offset: Assign an mmap offset to a GEM object. + pub(crate) fn gem_mmap_offset( + device: &AsahiDevice, + data: &mut bindings::drm_asahi_gem_mmap_offset, + file: &DrmFile, + ) -> Result<u32> { + mod_dev_dbg!( + device, + "[File {}]: IOCTL: gem_mmap_offset handle={:#x?}\n", + file.id, + data.handle + ); + + if data.extensions != 0 || data.flags != 0 { + return Err(EINVAL); + } + + let bo = gem::lookup_handle(file, data.handle)?; + data.offset = bo.gem.create_mmap_offset()?; + Ok(0) + } + + /// IOCTL: gem_bind: Map or unmap a GEM object into a Vm. + pub(crate) fn gem_bind( + device: &AsahiDevice, + data: &mut bindings::drm_asahi_gem_bind, + file: &DrmFile, + ) -> Result<u32> { + mod_dev_dbg!( + device, + "[File {} VM {}]: IOCTL: gem_bind op={:?} handle={:#x?} flags={:#x?} {:#x?}:{:#x?} -> {:#x?}\n", + file.id, + data.op, + data.vm_id, + data.handle, + data.flags, + data.offset, + data.range, + data.addr + ); + + if data.extensions != 0 { + return Err(EINVAL); + } + + match data.op { + bindings::drm_asahi_bind_op_ASAHI_BIND_OP_BIND => Self::do_gem_bind(device, data, file), + bindings::drm_asahi_bind_op_ASAHI_BIND_OP_UNBIND => Err(ENOTSUPP), + bindings::drm_asahi_bind_op_ASAHI_BIND_OP_UNBIND_ALL => { + Self::do_gem_unbind_all(device, data, file) + } + _ => Err(EINVAL), + } + } + + pub(crate) fn do_gem_bind( + _device: &AsahiDevice, + data: &mut bindings::drm_asahi_gem_bind, + file: &DrmFile, + ) -> Result<u32> { + if data.offset != 0 { + return Err(EINVAL); // Not supported yet + } + + if (data.addr | data.range) as usize & mmu::UAT_PGMSK != 0 { + return Err(EINVAL); // Must be page aligned + } + + if (data.flags & !(bindings::ASAHI_BIND_READ | bindings::ASAHI_BIND_WRITE)) != 0 { + return Err(EINVAL); + } + + let mut bo = gem::lookup_handle(file, data.handle)?; + + if data.range != bo.size().try_into()? { + return Err(EINVAL); // Not supported yet + } + + let start = data.addr; + let end = data.addr + data.range - 1; + + if (VM_SHADER_START..=VM_SHADER_END).contains(&start) { + if !(VM_SHADER_START..=VM_SHADER_END).contains(&end) { + return Err(EINVAL); // Invalid map range + } + } else if (VM_USER_START..=VM_USER_END).contains(&start) { + if !(VM_USER_START..=VM_USER_END).contains(&end) { + return Err(EINVAL); // Invalid map range + } + } else { + return Err(EINVAL); // Invalid map range + } + + // Just in case + if end >= VM_DRV_GPU_START { + return Err(EINVAL); + } + + let prot = if data.flags & bindings::ASAHI_BIND_READ != 0 { + if data.flags & bindings::ASAHI_BIND_WRITE != 0 { + mmu::PROT_GPU_SHARED_RW + } else { + mmu::PROT_GPU_SHARED_RO + } + } else if data.flags & bindings::ASAHI_BIND_WRITE != 0 { + mmu::PROT_GPU_SHARED_WO + } else { + return Err(EINVAL); // Must specify one of ASAHI_BIND_{READ,WRITE} + }; + + // Clone it immediately so we aren't holding the XArray lock + let vm = file + .vms + .get(data.vm_id.try_into()?) + .ok_or(ENOENT)? + .vm + .clone(); + + bo.map_at(&vm, start, prot, true)?; + + Ok(0) + } + + pub(crate) fn do_gem_unbind_all( + _device: &AsahiDevice, + data: &mut bindings::drm_asahi_gem_bind, + file: &DrmFile, + ) -> Result<u32> { + if data.flags != 0 || data.offset != 0 || data.range != 0 || data.addr != 0 { + return Err(EINVAL); + } + + let mut bo = gem::lookup_handle(file, data.handle)?; + + if data.vm_id == 0 { + bo.drop_file_mappings(file.id); + } else { + let vm_id = file.vms.get(data.vm_id.try_into()?).ok_or(ENOENT)?.vm.id(); + bo.drop_vm_mappings(vm_id); + } + + Ok(0) + } + + /// IOCTL: queue_create: Create a new command submission queue of a given type. + pub(crate) fn queue_create( + device: &AsahiDevice, + data: &mut bindings::drm_asahi_queue_create, + file: &DrmFile, + ) -> Result<u32> { + let file_id = file.id; + + mod_dev_dbg!( + device, + "[File {} VM {}]: Creating queue caps={:?} prio={:?} flags={:#x?}\n", + file_id, + data.vm_id, + data.queue_caps, + data.priority, + data.flags, + ); + + if data.extensions != 0 + || data.flags != 0 + || data.priority > 3 + || data.queue_caps == 0 + || (data.queue_caps + & !(bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_RENDER + | bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_BLIT + | bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_COMPUTE)) + != 0 + { + return Err(EINVAL); + } + + let resv = file.queues.reserve()?; + let file_vm = file.vms.get(data.vm_id.try_into()?).ok_or(ENOENT)?; + let vm = file_vm.vm.clone(); + let ualloc = file_vm.ualloc.clone(); + let ualloc_priv = file_vm.ualloc_priv.clone(); + // Drop the vms lock eagerly + core::mem::drop(file_vm); + + let queue = + device + .data() + .gpu + .new_queue(vm, ualloc, ualloc_priv, data.priority, data.queue_caps)?; + + data.queue_id = resv.index().try_into()?; + resv.store(Arc::try_new(Mutex::new(queue))?)?; + + Ok(0) + } + + /// IOCTL: queue_destroy: Destroy a command submission queue. + pub(crate) fn queue_destroy( + _device: &AsahiDevice, + data: &mut bindings::drm_asahi_queue_destroy, + file: &DrmFile, + ) -> Result<u32> { + if data.extensions != 0 { + return Err(EINVAL); + } + + if file.queues.remove(data.queue_id as usize).is_none() { + Err(ENOENT) + } else { + Ok(0) + } + } + + /// IOCTL: submit: Submit GPU work to a command submission queue. + pub(crate) fn submit( + device: &AsahiDevice, + data: &mut bindings::drm_asahi_submit, + file: &DrmFile, + ) -> Result<u32> { + if data.extensions != 0 + || data.flags != 0 + || data.in_sync_count > MAX_SYNCS_PER_SUBMISSION + || data.out_sync_count > MAX_SYNCS_PER_SUBMISSION + || data.command_count > MAX_COMMANDS_PER_SUBMISSION + { + return Err(EINVAL); + } + + debug::update_debug_flags(); + + let gpu = &device.data().gpu; + gpu.update_globals(); + + // Upgrade to Arc<T> to drop the XArray lock early + let queue: Arc<Mutex<Box<dyn queue::Queue>>> = file + .queues + .get(data.queue_id.try_into()?) + .ok_or(ENOENT)? + .borrow() + .into(); + + let id = gpu.ids().submission.next(); + mod_dev_dbg!( + device, + "[File {} Queue {}]: IOCTL: submit (submission ID: {})\n", + file.id, + data.queue_id, + id + ); + + mod_dev_dbg!( + device, + "[File {} Queue {}]: IOCTL: submit({}): Parsing in_syncs\n", + file.id, + data.queue_id, + id + ); + let in_syncs = SyncItem::parse_array(file, data.in_syncs, data.in_sync_count, false)?; + mod_dev_dbg!( + device, + "[File {} Queue {}]: IOCTL: submit({}): Parsing out_syncs\n", + file.id, + data.queue_id, + id + ); + let out_syncs = SyncItem::parse_array(file, data.out_syncs, data.out_sync_count, true)?; + + let result_buf = if data.result_handle != 0 { + mod_dev_dbg!( + device, + "[File {} Queue {}]: IOCTL: submit({}): Looking up result_handle {}\n", + file.id, + data.queue_id, + id, + data.result_handle + ); + Some(gem::lookup_handle(file, data.result_handle)?) + } else { + None + }; + + mod_dev_dbg!( + device, + "[File {} Queue {}]: IOCTL: submit({}): Parsing commands\n", + file.id, + data.queue_id, + id + ); + let mut commands = Vec::try_with_capacity(data.command_count as usize)?; + + const STRIDE: usize = core::mem::size_of::bindings::drm_asahi_command(); + let size = STRIDE * data.command_count as usize; + + // SAFETY: We only read this once, so there are no TOCTOU issues. + let mut reader = + unsafe { UserSlicePtr::new(data.commands as usize as *mut _, size).reader() }; + + for _i in 0..data.command_count { + let mut cmd: MaybeUninitbindings::drm_asahi_command = MaybeUninit::uninit(); + + // SAFETY: The size of `sync` is STRIDE + unsafe { reader.read_raw(cmd.as_mut_ptr() as *mut u8, STRIDE)? }; + + // SAFETY: All bit patterns in the struct are valid + commands.try_push(unsafe { cmd.assume_init() })?; + } + + let ret = queue + .lock() + .submit(id, in_syncs, out_syncs, result_buf, commands); + + match ret { + Err(ERESTARTSYS) => Err(ERESTARTSYS), + Err(e) => { + dev_info!( + device, + "[File {} Queue {}]: IOCTL: submit failed! (submission ID: {} err: {:?})\n", + file.id, + data.queue_id, + id, + e + ); + Err(e) + } + Ok(_) => Ok(0), + } + } + + /// Returns the unique file ID for this `File`. + pub(crate) fn file_id(&self) -> u64 { + self.id + } +} + +impl Drop for File { + fn drop(&mut self) { + mod_pr_debug!("[File {}]: Closing...\n", self.id); + } +} diff --git a/drivers/gpu/drm/asahi/float.rs b/drivers/gpu/drm/asahi/float.rs new file mode 100644 index 000000000000..e73b4b628cf9 --- /dev/null +++ b/drivers/gpu/drm/asahi/float.rs @@ -0,0 +1,381 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Basic soft floating-point support +//! +//! The GPU firmware requires a large number of power-related configuration values, many of which +//! are IEEE 754 32-bit floating point values. These values change not only between GPU/SoC +//! variants, but also between specific hardware platforms using these SoCs, so they must be +//! derived from device tree properties. There are many redundant values computed from the same +//! inputs with simple add/sub/mul/div calculations, plus a few values that are actually specific +//! to each individual device depending on its binning and fused voltage configuration, so it +//! doesn't make sense to store the final values to be passed to the firmware in the device tree. +//! +//! Therefore, we need a way to perform floating-point calculations in the kernel. +//! +//! Using the actual FPU from kernel mode is asking for trouble, since there is no way to bound +//! the execution of FPU instructions to a controlled section of code without outright putting it +//! in its own compilation unit, which is quite painful for Rust. Since these calculations only +//! have to happen at initialization time and there is no need for performance, let's use a simple +//! software float implementation instead. +//! +//! This implementation makes no attempt to be fully IEEE754 compliant, but it's good enough and +//! gives bit-identical results to macOS in the vast majority of cases, with one or two exceptions +//! related to slightly non-compliant rounding. + +use core::ops; +use kernel::{of, prelude::*}; + +/// An IEEE754-compatible floating point number implemented in software. +#[derive(Default, Debug, Copy, Clone)] +pub(crate) struct F32(u32); + +#[derive(Default, Debug, Copy, Clone)] +struct F32U { + sign: bool, + exp: i32, + frac: i64, +} + +impl F32 { + /// Convert a raw 32-bit representation into an F32 + pub(crate) const fn from_bits(u: u32) -> F32 { + F32(u) + } + + // Convert a `f32` value into an F32 + // + // This must ONLY be used in const context. Use the `f32!{}` macro to do it safely. + #[doc(hidden)] + pub(crate) const fn from_f32(v: f32) -> F32 { + F32(unsafe { core::mem::transmute(v) }) + } + + // Convert an F32 into a `f32` value + // + // For testing only. + #[doc(hidden)] + #[cfg(test)] + pub(crate) fn to_f32(self) -> f32 { + f32::from_bits(self.0) + } + + const fn unpack(&self) -> F32U { + F32U { + sign: self.0 & (1 << 31) != 0, + exp: ((self.0 >> 23) & 0xff) as i32 - 127, + frac: (((self.0 & 0x7fffff) | 0x800000) as i64) << 9, + } + .norm() + } +} + +/// Safely construct an `F32` out of a constant floating-point value. +/// +/// This ensures that the conversion happens in const context, so no floating point operations are +/// emitted. +#[macro_export] +macro_rules! f32 { + ([$($val:expr),*]) => {{ + [$(f32!($val)),*] + }}; + ($val:expr) => {{ + const _K: $crate::float::F32 = $crate::float::F32::from_f32($val); + _K + }}; +} + +impl ops::Neg for F32 { + type Output = F32; + + fn neg(self) -> F32 { + F32(self.0 ^ (1 << 31)) + } +} + +impl ops::Add<F32> for F32 { + type Output = F32; + + fn add(self, rhs: F32) -> F32 { + self.unpack().add(rhs.unpack()).pack() + } +} + +impl ops::Sub<F32> for F32 { + type Output = F32; + + fn sub(self, rhs: F32) -> F32 { + self.unpack().add((-rhs).unpack()).pack() + } +} + +impl ops::Mul<F32> for F32 { + type Output = F32; + + fn mul(self, rhs: F32) -> F32 { + self.unpack().mul(rhs.unpack()).pack() + } +} + +impl ops::Div<F32> for F32 { + type Output = F32; + + fn div(self, rhs: F32) -> F32 { + self.unpack().div(rhs.unpack()).pack() + } +} + +macro_rules! from_ints { + ($u:ty, $i:ty) => { + impl From<$i> for F32 { + fn from(v: $i) -> F32 { + F32U::from_i64(v as i64).pack() + } + } + impl From<$u> for F32 { + fn from(v: $u) -> F32 { + F32U::from_u64(v as u64).pack() + } + } + }; +} + +from_ints!(u8, i8); +from_ints!(u16, i16); +from_ints!(u32, i32); +from_ints!(u64, i64); + +impl F32U { + const INFINITY: F32U = f32!(f32::INFINITY).unpack(); + const NEG_INFINITY: F32U = f32!(f32::NEG_INFINITY).unpack(); + + fn from_i64(v: i64) -> F32U { + F32U { + sign: v < 0, + exp: 32, + frac: v.abs(), + } + .norm() + } + + fn from_u64(mut v: u64) -> F32U { + let mut exp = 32; + if v >= (1 << 63) { + exp = 31; + v >>= 1; + } + F32U { + sign: false, + exp, + frac: v as i64, + } + .norm() + } + + fn shr(&mut self, shift: i32) { + if shift > 63 { + self.exp = 0; + self.frac = 0; + } else { + self.frac >>= shift; + } + } + + fn align(a: &mut F32U, b: &mut F32U) { + if a.exp > b.exp { + b.shr(a.exp - b.exp); + b.exp = a.exp; + } else { + a.shr(b.exp - a.exp); + a.exp = b.exp; + } + } + + fn mul(self, other: F32U) -> F32U { + F32U { + sign: self.sign != other.sign, + exp: self.exp + other.exp, + frac: ((self.frac >> 8) * (other.frac >> 8)) >> 16, + } + } + + fn div(self, other: F32U) -> F32U { + if other.frac == 0 || self.is_inf() { + if self.sign { + F32U::NEG_INFINITY + } else { + F32U::INFINITY + } + } else { + F32U { + sign: self.sign != other.sign, + exp: self.exp - other.exp, + frac: ((self.frac << 24) / (other.frac >> 8)), + } + } + } + + fn add(mut self, mut other: F32U) -> F32U { + F32U::align(&mut self, &mut other); + if self.sign == other.sign { + self.frac += other.frac; + } else { + self.frac -= other.frac; + } + if self.frac < 0 { + self.sign = !self.sign; + self.frac = -self.frac; + } + self + } + + const fn norm(mut self) -> F32U { + let lz = self.frac.leading_zeros() as i32; + if lz > 31 { + self.frac <<= lz - 31; + self.exp -= lz - 31; + } else if lz < 31 { + self.frac >>= 31 - lz; + self.exp += 31 - lz; + } + + if self.is_zero() { + return F32U { + sign: self.sign, + frac: 0, + exp: 0, + }; + } + self + } + + const fn is_zero(&self) -> bool { + self.frac == 0 || self.exp < -126 + } + + const fn is_inf(&self) -> bool { + self.exp > 127 + } + + const fn pack(mut self) -> F32 { + self = self.norm(); + if !self.is_zero() { + self.frac += 0x100; + self = self.norm(); + } + + if self.is_inf() { + if self.sign { + return f32!(f32::NEG_INFINITY); + } else { + return f32!(f32::INFINITY); + } + } else if self.is_zero() { + if self.sign { + return f32!(-0.0); + } else { + return f32!(0.0); + } + } + + F32(if self.sign { 1u32 << 31 } else { 0u32 } + | ((self.exp + 127) as u32) << 23 + | ((self.frac >> 9) & 0x7fffff) as u32) + } +} + +impl<'a> TryFrom<of::Property<'a>> for F32 { + type Error = Error; + + fn try_from(p: of::Property<'_>) -> core::result::Result<F32, Self::Error> { + let bits: u32 = p.try_into()?; + Ok(F32::from_bits(bits)) + } +} + +impl of::PropertyUnit for F32 { + const UNIT_SIZE: usize = 4; + + fn from_bytes(data: &[u8]) -> Result<Self> { + Ok(F32::from_bits(<u32 as of::PropertyUnit>::from_bytes(data)?)) + } +} + +// TODO: Make this an actual test and figure out how to make it run. +#[cfg(test)] +mod tests { + #[test] + fn test_all() { + fn add(a: f32, b: f32) { + println!( + "{} + {} = {} {}", + a, + b, + (F32::from_f32(a) + F32::from_f32(b)).to_f32(), + a + b + ); + } + fn sub(a: f32, b: f32) { + println!( + "{} - {} = {} {}", + a, + b, + (F32::from_f32(a) - F32::from_f32(b)).to_f32(), + a - b + ); + } + fn mul(a: f32, b: f32) { + println!( + "{} * {} = {} {}", + a, + b, + (F32::from_f32(a) * F32::from_f32(b)).to_f32(), + a * b + ); + } + fn div(a: f32, b: f32) { + println!( + "{} / {} = {} {}", + a, + b, + (F32::from_f32(a) / F32::from_f32(b)).to_f32(), + a / b + ); + } + + fn test(a: f32, b: f32) { + add(a, b); + sub(a, b); + mul(a, b); + div(a, b); + } + + test(1.123, 7.567); + test(1.123, 1.456); + test(7.567, 1.123); + test(1.123, -7.567); + test(1.123, -1.456); + test(7.567, -1.123); + test(-1.123, -7.567); + test(-1.123, -1.456); + test(-7.567, -1.123); + test(1000.123, 0.001); + test(1000.123, 0.0000001); + test(0.0012, 1000.123); + test(0.0000001, 1000.123); + test(0., 0.); + test(0., 1.); + test(1., 0.); + test(1., 1.); + test(2., f32::INFINITY); + test(2., f32::NEG_INFINITY); + test(f32::INFINITY, 2.); + test(f32::NEG_INFINITY, 2.); + test(f32::NEG_INFINITY, 2.); + test(f32::MAX, 2.); + test(f32::MIN, 2.); + test(f32::MIN_POSITIVE, 2.); + test(2., f32::MAX); + test(2., f32::MIN); + test(2., f32::MIN_POSITIVE); + } +} diff --git a/drivers/gpu/drm/asahi/fw/buffer.rs b/drivers/gpu/drm/asahi/fw/buffer.rs new file mode 100644 index 000000000000..a8a467879518 --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/buffer.rs @@ -0,0 +1,170 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU tiled vertex buffer control firmware structures + +use super::types::*; +use super::workqueue; +use crate::{default_zeroed, no_debug, trivial_gpustruct}; +use kernel::sync::Arc; + +pub(crate) mod raw { + use super::*; + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct BlockControl { + pub(crate) total: AtomicU32, + pub(crate) wptr: AtomicU32, + pub(crate) unk: AtomicU32, + pub(crate) pad: Pad<0x34>, + } + default_zeroed!(BlockControl); + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct Counter { + pub(crate) count: AtomicU32, + __pad: Pad<0x3c>, + } + default_zeroed!(Counter); + + #[derive(Debug, Default)] + #[repr(C)] + pub(crate) struct Stats { + pub(crate) max_pages: AtomicU32, + pub(crate) max_b: AtomicU32, + pub(crate) overflow_count: AtomicU32, + pub(crate) gpu_c: AtomicU32, + pub(crate) __pad0: Pad<0x10>, + pub(crate) reset: AtomicU32, + pub(crate) __pad1: Pad<0x1c>, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct Info<'a> { + pub(crate) gpu_counter: u32, + pub(crate) unk_4: u32, + pub(crate) last_id: i32, + pub(crate) cur_id: i32, + pub(crate) unk_10: u32, + pub(crate) gpu_counter2: u32, + pub(crate) unk_18: u32, + + #[ver(V < V13_0B4)] + pub(crate) unk_1c: u32, + + pub(crate) page_list: GpuPointer<'a, &'a [u32]>, + pub(crate) page_list_size: u32, + pub(crate) page_count: AtomicU32, + pub(crate) max_blocks: u32, + pub(crate) block_count: AtomicU32, + pub(crate) unk_38: u32, + pub(crate) block_list: GpuPointer<'a, &'a [u32]>, + pub(crate) block_ctl: GpuPointer<'a, super::BlockControl>, + pub(crate) last_page: AtomicU32, + pub(crate) gpu_page_ptr1: u32, + pub(crate) gpu_page_ptr2: u32, + pub(crate) unk_58: u32, + pub(crate) block_size: u32, + pub(crate) unk_60: U64, + pub(crate) counter: GpuPointer<'a, super::Counter>, + pub(crate) unk_70: u32, + pub(crate) unk_74: u32, + pub(crate) unk_78: u32, + pub(crate) unk_7c: u32, + pub(crate) unk_80: u32, + pub(crate) max_pages: u32, + pub(crate) max_pages_nomemless: u32, + pub(crate) unk_8c: u32, + pub(crate) unk_90: Array<0x30, u8>, + } + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct Scene<'a> { + pub(crate) pass_page_count: AtomicU32, + pub(crate) unk_4: u32, + pub(crate) unk_8: U64, + pub(crate) unk_10: U64, + pub(crate) user_buffer: GpuPointer<'a, &'a [u8]>, + pub(crate) unk_20: u32, + pub(crate) stats: GpuWeakPointersuper::Stats, + pub(crate) total_page_count: AtomicU32, + pub(crate) unk_30: U64, // pad + pub(crate) unk_38: U64, // pad + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct InitBuffer<'a> { + pub(crate) tag: workqueue::CommandType, + pub(crate) vm_slot: u32, + pub(crate) buffer_slot: u32, + pub(crate) unk_c: u32, + pub(crate) block_count: u32, + pub(crate) buffer: GpuPointer<'a, super::Info::ver>, + pub(crate) stamp_value: EventValue, + } +} + +trivial_gpustruct!(BlockControl); +trivial_gpustruct!(Counter); +trivial_gpustruct!(Stats); + +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct Info { + pub(crate) block_ctl: GpuObject<BlockControl>, + pub(crate) counter: GpuObject<Counter>, + pub(crate) page_list: GpuArray<u32>, + pub(crate) block_list: GpuArray<u32>, +} + +#[versions(AGX)] +impl GpuStruct for Info::ver { + type Raw<'a> = raw::Info::ver<'a>; +} + +pub(crate) struct ClusterBuffers { + pub(crate) tilemaps: GpuArray<u8>, + pub(crate) meta: GpuArray<u8>, +} + +#[versions(AGX)] +pub(crate) struct Scene { + pub(crate) user_buffer: GpuArray<u8>, + pub(crate) buffer: crate::buffer::Buffer::ver, + pub(crate) tvb_heapmeta: GpuArray<u8>, + pub(crate) tvb_tilemap: GpuArray<u8>, + pub(crate) tpc: Arc<GpuArray<u8>>, + pub(crate) clustering: Option<ClusterBuffers>, + pub(crate) preempt_buf: GpuArray<u8>, + pub(crate) seq_buf: GpuArray<u64>, +} + +#[versions(AGX)] +no_debug!(Scene::ver); + +#[versions(AGX)] +impl GpuStruct for Scene::ver { + type Raw<'a> = raw::Scene<'a>; +} + +#[versions(AGX)] +pub(crate) struct InitBuffer { + pub(crate) scene: Arccrate::buffer::Scene::ver, +} + +#[versions(AGX)] +no_debug!(InitBuffer::ver); + +#[versions(AGX)] +impl workqueue::Command for InitBuffer::ver {} + +#[versions(AGX)] +impl GpuStruct for InitBuffer::ver { + type Raw<'a> = raw::InitBuffer::ver<'a>; +} diff --git a/drivers/gpu/drm/asahi/fw/channels.rs b/drivers/gpu/drm/asahi/fw/channels.rs new file mode 100644 index 000000000000..db5ac9a3ded5 --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/channels.rs @@ -0,0 +1,385 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU communication channel firmware structures (ring buffers) + +use super::types::*; +use crate::default_zeroed; +use core::sync::atomic::Ordering; + +pub(crate) mod raw { + use super::*; + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct ChannelState<'a> { + pub(crate) read_ptr: AtomicU32, + __pad0: Pad<0x1c>, + pub(crate) write_ptr: AtomicU32, + __pad1: Pad<0xc>, + _p: PhantomData<&'a ()>, + } + default_zeroed!(<'a>, ChannelState<'a>); + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct FwCtlChannelState<'a> { + pub(crate) read_ptr: AtomicU32, + __pad0: Pad<0xc>, + pub(crate) write_ptr: AtomicU32, + __pad1: Pad<0xc>, + _p: PhantomData<&'a ()>, + } + default_zeroed!(<'a>, FwCtlChannelState<'a>); +} + +pub(crate) trait RxChannelState: GpuStruct + Debug + Default +where + for<'a> <Self as GpuStruct>::Raw<'a>: Default + Zeroed, +{ + const SUB_CHANNELS: usize; + + fn wptr(raw: &Self::Raw<'_>, index: usize) -> u32; + fn set_rptr(raw: &Self::Raw<'_>, index: usize, rptr: u32); +} + +#[derive(Debug, Default)] +pub(crate) struct ChannelState {} + +impl GpuStruct for ChannelState { + type Raw<'a> = raw::ChannelState<'a>; +} + +impl RxChannelState for ChannelState { + const SUB_CHANNELS: usize = 1; + + fn wptr(raw: &Self::Raw<'_>, _index: usize) -> u32 { + raw.write_ptr.load(Ordering::Acquire) + } + + fn set_rptr(raw: &Self::Raw<'_>, _index: usize, rptr: u32) { + raw.read_ptr.store(rptr, Ordering::Release); + } +} + +#[derive(Debug, Default)] +pub(crate) struct FwLogChannelState {} + +impl GpuStruct for FwLogChannelState { + type Raw<'a> = Array<6, raw::ChannelState<'a>>; +} + +impl RxChannelState for FwLogChannelState { + const SUB_CHANNELS: usize = 6; + + fn wptr(raw: &Self::Raw<'_>, index: usize) -> u32 { + raw[index].write_ptr.load(Ordering::Acquire) + } + + fn set_rptr(raw: &Self::Raw<'_>, index: usize, rptr: u32) { + raw[index].read_ptr.store(rptr, Ordering::Release); + } +} + +#[derive(Debug, Default)] +pub(crate) struct FwCtlChannelState {} + +impl GpuStruct for FwCtlChannelState { + type Raw<'a> = raw::FwCtlChannelState<'a>; +} + +pub(crate) trait TxChannelState: GpuStruct + Debug + Default { + fn rptr(raw: &Self::Raw<'_>) -> u32; + fn set_wptr(raw: &Self::Raw<'_>, wptr: u32); +} + +impl TxChannelState for ChannelState { + fn rptr(raw: &Self::Raw<'_>) -> u32 { + raw.read_ptr.load(Ordering::Acquire) + } + + fn set_wptr(raw: &Self::Raw<'_>, wptr: u32) { + raw.write_ptr.store(wptr, Ordering::Release); + } +} + +impl TxChannelState for FwCtlChannelState { + fn rptr(raw: &Self::Raw<'_>) -> u32 { + raw.read_ptr.load(Ordering::Acquire) + } + + fn set_wptr(raw: &Self::Raw<'_>, wptr: u32) { + raw.write_ptr.store(wptr, Ordering::Release); + } +} + +#[derive(Debug, Copy, Clone, Default)] +#[repr(u32)] +pub(crate) enum PipeType { + #[default] + Vertex = 0, + Fragment = 1, + Compute = 2, +} + +#[versions(AGX)] +#[derive(Debug, Copy, Clone, Default)] +#[repr(C)] +pub(crate) struct RunWorkQueueMsg { + pub(crate) pipe_type: PipeType, + pub(crate) work_queue: Option<GpuWeakPointersuper::workqueue::QueueInfo::ver>, + pub(crate) wptr: u32, + pub(crate) event_slot: u32, + pub(crate) is_new: bool, + #[ver(V >= V13_2 && G >= G14)] + pub(crate) __pad: Pad<0x2b>, + #[ver(V < V13_2 || G < G14)] + pub(crate) __pad: Pad<0x1b>, +} + +#[versions(AGX)] +pub(crate) type PipeMsg = RunWorkQueueMsg::ver; + +#[versions(AGX)] +pub(crate) const DEVICECONTROL_SZ: usize = { + #[ver(V < V13_2 || G < G14)] + { + 0x2c + } + #[ver(V >= V13_2 && G >= G14)] + { + 0x3c + } +}; + +// TODO: clean up when arbitrary_enum_discriminant is stable +// https://github.com/rust-lang/rust/issues/60553 + +#[versions(AGX)] +#[derive(Debug, Copy, Clone)] +#[repr(C, u32)] +#[allow(dead_code)] +pub(crate) enum DeviceControlMsg { + Unk00(Array<DEVICECONTROL_SZ::ver, u8>), + Unk01(Array<DEVICECONTROL_SZ::ver, u8>), + Unk02(Array<DEVICECONTROL_SZ::ver, u8>), + Unk03(Array<DEVICECONTROL_SZ::ver, u8>), + Unk04(Array<DEVICECONTROL_SZ::ver, u8>), + Unk05(Array<DEVICECONTROL_SZ::ver, u8>), + Unk06(Array<DEVICECONTROL_SZ::ver, u8>), + Unk07(Array<DEVICECONTROL_SZ::ver, u8>), + Unk08(Array<DEVICECONTROL_SZ::ver, u8>), + Unk09(Array<DEVICECONTROL_SZ::ver, u8>), + Unk0a(Array<DEVICECONTROL_SZ::ver, u8>), + Unk0b(Array<DEVICECONTROL_SZ::ver, u8>), + Unk0c(Array<DEVICECONTROL_SZ::ver, u8>), + Unk0d(Array<DEVICECONTROL_SZ::ver, u8>), + Unk0e(Array<DEVICECONTROL_SZ::ver, u8>), + Unk0f(Array<DEVICECONTROL_SZ::ver, u8>), + Unk10(Array<DEVICECONTROL_SZ::ver, u8>), + Unk11(Array<DEVICECONTROL_SZ::ver, u8>), + Unk12(Array<DEVICECONTROL_SZ::ver, u8>), + Unk13(Array<DEVICECONTROL_SZ::ver, u8>), + Unk14(Array<DEVICECONTROL_SZ::ver, u8>), + Unk15(Array<DEVICECONTROL_SZ::ver, u8>), + Unk16(Array<DEVICECONTROL_SZ::ver, u8>), + DestroyContext { + unk_4: u32, + ctx_23: u8, + __pad0: Pad<3>, + unk_c: u32, + unk_10: u32, + ctx_0: u8, + ctx_1: u8, + ctx_4: u8, + __pad1: Pad<1>, + unk_18: u32, + gpu_context: Option<GpuWeakPointersuper::workqueue::GpuContextData>, + __pad2: Pad<{ DEVICECONTROL_SZ::ver - 0x20 }>, + }, + Unk18(Array<DEVICECONTROL_SZ::ver, u8>), + Initialize(Pad<DEVICECONTROL_SZ::ver>), +} + +#[versions(AGX)] +default_zeroed!(DeviceControlMsg::ver); + +#[derive(Copy, Clone, Default, Debug)] +#[repr(C)] +#[allow(dead_code)] +pub(crate) struct FwCtlMsg { + pub(crate) addr: U64, + pub(crate) unk_8: u32, + pub(crate) slot: u32, + pub(crate) page_count: u16, + pub(crate) unk_12: u16, +} + +pub(crate) const EVENT_SZ: usize = 0x34; + +#[derive(Debug, Copy, Clone)] +#[repr(C, u32)] +#[allow(dead_code)] +pub(crate) enum EventMsg { + Fault, + Flag { + firing: [u32; 4], + unk_14: u16, + }, + Unk2(Array<EVENT_SZ, u8>), + Unk3(Array<EVENT_SZ, u8>), + Timeout { + counter: u32, + unk_8: u32, + event_slot: u32, + }, // Max discriminant: 0x4 +} + +pub(crate) const EVENT_MAX: u32 = 0x4; + +#[derive(Copy, Clone)] +#[repr(C)] +pub(crate) union RawEventMsg { + pub(crate) raw: (u32, Array<EVENT_SZ, u8>), + pub(crate) msg: EventMsg, +} + +default_zeroed!(RawEventMsg); + +#[derive(Debug, Copy, Clone, Default)] +#[repr(C)] +pub(crate) struct RawFwLogMsg { + pub(crate) msg_type: u32, + __pad0: u32, + pub(crate) msg_index: U64, + __pad1: Pad<0x28>, +} + +#[derive(Debug, Copy, Clone, Default)] +#[repr(C)] +pub(crate) struct RawFwLogPayloadMsg { + pub(crate) msg_type: u32, + pub(crate) seq_no: u32, + pub(crate) timestamp: U64, + pub(crate) msg: Array<0xc8, u8>, +} + +#[derive(Debug, Copy, Clone, Default)] +#[repr(C)] +pub(crate) struct RawKTraceMsg { + pub(crate) msg_type: u32, + pub(crate) timestamp: U64, + pub(crate) args: Array<4, U64>, + pub(crate) code: u8, + pub(crate) channel: u8, + __pad: Pad<1>, + pub(crate) thread: u8, + pub(crate) unk_flag: U64, +} + +#[versions(AGX)] +pub(crate) const STATS_SZ: usize = { + #[ver(V < V13_0B4)] + { + 0x2c + } + #[ver(V >= V13_0B4)] + { + 0x3c + } +}; + +#[versions(AGX)] +#[derive(Debug, Copy, Clone)] +#[repr(C, u32)] +#[allow(dead_code)] +pub(crate) enum StatsMsg { + Power { + // 0x00 + __pad: Pad<0x18>, + power: U64, + }, + Unk1(Array<{ STATS_SZ::ver }, u8>), + PowerOn { + // 0x02 + off_time: U64, + }, + PowerOff { + // 0x03 + on_time: U64, + }, + Utilization { + // 0x04 + timestamp: U64, + util1: u32, + util2: u32, + util3: u32, + util4: u32, + }, + Unk5(Array<{ STATS_SZ::ver }, u8>), + Unk6(Array<{ STATS_SZ::ver }, u8>), + Unk7(Array<{ STATS_SZ::ver }, u8>), + Unk8(Array<{ STATS_SZ::ver }, u8>), + AvgPower { + // 0x09 + active_cs: U64, + unk2: u32, + unk3: u32, + unk4: u32, + avg_power: u32, + }, + Temperature { + // 0x0a + __pad: Pad<0x8>, + raw_value: u32, + scale: u32, + tmin: u32, + tmax: u32, + }, + PowerState { + // 0x0b + timestamp: U64, + last_busy_ts: U64, + active: u32, + poweroff: u32, + unk1: u32, + pstate: u32, + unk2: u32, + unk3: u32, + }, + FwBusy { + // 0x0c + timestamp: U64, + busy: u32, + }, + PState { + // 0x0d + __pad: Pad<0x8>, + ps_min: u32, + unk1: u32, + ps_max: u32, + unk2: u32, + }, + TempSensor { + // 0x0e + __pad: Pad<0x4>, + sensor_id: u32, + raw_value: u32, + scale: u32, + tmin: u32, + tmax: u32, + }, // Max discriminant: 0xe +} + +#[versions(AGX)] +pub(crate) const STATS_MAX: u32 = 0xe; + +#[versions(AGX)] +#[derive(Copy, Clone)] +#[repr(C)] +pub(crate) union RawStatsMsg { + pub(crate) raw: (u32, Array<{ STATS_SZ::ver }, u8>), + pub(crate) msg: StatsMsg::ver, +} + +#[versions(AGX)] +default_zeroed!(RawStatsMsg::ver); diff --git a/drivers/gpu/drm/asahi/fw/compute.rs b/drivers/gpu/drm/asahi/fw/compute.rs new file mode 100644 index 000000000000..0dbcd77c5e3e --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/compute.rs @@ -0,0 +1,107 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU compute job firmware structures + +use super::types::*; +use super::{event, job, workqueue}; +use crate::{microseq, mmu}; +use kernel::sync::Arc; + +pub(crate) mod raw { + use super::*; + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct JobParameters1<'a> { + pub(crate) preempt_buf1: GpuPointer<'a, &'a [u8]>, + pub(crate) encoder: U64, + pub(crate) preempt_buf2: GpuPointer<'a, &'a [u8]>, + pub(crate) preempt_buf3: GpuPointer<'a, &'a [u8]>, + pub(crate) preempt_buf4: GpuPointer<'a, &'a [u8]>, + pub(crate) preempt_buf5: GpuPointer<'a, &'a [u8]>, + pub(crate) pipeline_base: U64, + pub(crate) unk_38: U64, + pub(crate) unk_40: u32, + pub(crate) unk_44: u32, + pub(crate) compute_layout_addr: U64, + pub(crate) unk_50: u32, + pub(crate) unk_54: u32, + pub(crate) unk_58: u32, + pub(crate) unk_5c: u32, + pub(crate) iogpu_unk_40: u32, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct JobParameters2<'a> { + #[ver(V >= V13_0B4)] + pub(crate) unk_0_0: u32, + pub(crate) unk_0: Array<0x24, u8>, + pub(crate) preempt_buf1: GpuPointer<'a, &'a [u8]>, + pub(crate) encoder_end: U64, + pub(crate) unk_34: Array<0x28, u8>, + #[ver(V < V13_0B4)] + pub(crate) unk_5c: u32, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct RunCompute<'a> { + pub(crate) tag: workqueue::CommandType, + + #[ver(V >= V13_0B4)] + pub(crate) counter: U64, + + pub(crate) unk_4: u32, + pub(crate) vm_slot: u32, + pub(crate) notifier: GpuPointer<'a, event::Notifier::ver>, + pub(crate) unk_pointee: Array<0x54, u8>, + pub(crate) job_params1: JobParameters1<'a>, + pub(crate) unk_b8: Array<0x11c, u8>, + pub(crate) microsequence: GpuPointer<'a, &'a [u8]>, + pub(crate) microsequence_size: u32, + pub(crate) job_params2: JobParameters2::ver<'a>, + pub(crate) encoder_params: job::raw::EncoderParams<'a>, + pub(crate) meta: job::raw::JobMeta, + pub(crate) cur_ts: U64, + pub(crate) start_ts: Option<GpuPointer<'a, AtomicU64>>, + pub(crate) end_ts: Option<GpuPointer<'a, AtomicU64>>, + pub(crate) unk_2c0: u32, + pub(crate) unk_2c4: u32, + pub(crate) unk_2c8: u32, + pub(crate) unk_2cc: u32, + pub(crate) client_sequence: u8, + pub(crate) pad_2d1: Array<3, u8>, + pub(crate) unk_2d4: u32, + pub(crate) unk_2d8: u8, + #[ver(V >= V13_0B4)] + pub(crate) unk_ts: U64, + #[ver(V >= V13_0B4)] + pub(crate) unk_2e1: Array<0x1c, u8>, + #[ver(V >= V13_0B4)] + pub(crate) unk_flag: U32, + #[ver(V >= V13_0B4)] + pub(crate) unk_pad: Array<0x10, u8>, + } +} + +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct RunCompute { + pub(crate) notifier: Arc<GpuObjectevent::Notifier::ver>, + pub(crate) preempt_buf: GpuArray<u8>, + pub(crate) seq_buf: GpuArray<u64>, + pub(crate) micro_seq: microseq::MicroSequence, + pub(crate) vm_bind: mmu::VmBind, + pub(crate) timestamps: Arc<GpuObjectjob::JobTimestamps>, +} + +#[versions(AGX)] +impl GpuStruct for RunCompute::ver { + type Raw<'a> = raw::RunCompute::ver<'a>; +} + +#[versions(AGX)] +impl workqueue::Command for RunCompute::ver {} diff --git a/drivers/gpu/drm/asahi/fw/event.rs b/drivers/gpu/drm/asahi/fw/event.rs new file mode 100644 index 000000000000..fbf65ab6d976 --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/event.rs @@ -0,0 +1,100 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU events control structures & stamps + +use super::types::*; +use crate::{default_zeroed, trivial_gpustruct}; +use core::sync::atomic::Ordering; + +pub(crate) mod raw { + use super::*; + + #[derive(Debug, Clone, Copy, Default)] + #[repr(C)] + pub(crate) struct LinkedListHead { + pub(crate) prev: Option<GpuWeakPointer<LinkedListHead>>, + pub(crate) next: Option<GpuWeakPointer<LinkedListHead>>, + } + + #[derive(Debug, Clone, Copy)] + #[repr(C)] + pub(crate) struct NotifierList { + pub(crate) list_head: LinkedListHead, + pub(crate) unkptr_10: U64, + } + default_zeroed!(NotifierList); + + #[versions(AGX)] + #[derive(Debug, Clone, Copy)] + #[repr(C)] + pub(crate) struct NotifierState { + unk_14: u32, + unk_18: U64, + unk_20: u32, + vm_slot: u32, + has_vtx: u32, + pstamp_vtx: Array<4, U64>, + has_frag: u32, + pstamp_frag: Array<4, U64>, + has_comp: u32, + pstamp_comp: Array<4, U64>, + #[ver(G >= G14 && V < V13_0B4)] + unk_98_g14_0: Array<0x14, u8>, + in_list: u32, + list_head: LinkedListHead, + #[ver(G >= G14 && V < V13_0B4)] + unk_a8_g14_0: Pad<4>, + #[ver(V >= V13_0B4)] + pub(crate) unk_buf: Array<0x8, u8>, // Init to all-ff + } + + #[versions(AGX)] + impl Default for NotifierState::ver { + fn default() -> Self { + #[allow(unused_mut)] + let mut s: Self = unsafe { core::mem::zeroed() }; + #[ver(V >= V13_0B4)] + s.unk_buf = Array::new([0xff; 0x8]); + s + } + } + + #[derive(Debug)] + #[repr(transparent)] + pub(crate) struct Threshold(AtomicU64); + default_zeroed!(Threshold); + + impl Threshold { + pub(crate) fn increment(&self) { + // We could use fetch_add, but the non-LSE atomic + // sequence Rust produces confuses the hypervisor. + let v = self.0.load(Ordering::Relaxed); + self.0.store(v + 1, Ordering::Relaxed); + } + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct Notifier<'a> { + pub(crate) threshold: GpuPointer<'a, super::Threshold>, + pub(crate) generation: AtomicU32, + pub(crate) cur_count: AtomicU32, + pub(crate) unk_10: AtomicU32, + pub(crate) state: NotifierState::ver, + } +} + +trivial_gpustruct!(Threshold); +trivial_gpustruct!(NotifierList); + +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct Notifier { + pub(crate) threshold: GpuObject<Threshold>, +} + +#[versions(AGX)] +impl GpuStruct for Notifier::ver { + type Raw<'a> = raw::Notifier::ver<'a>; +} diff --git a/drivers/gpu/drm/asahi/fw/fragment.rs b/drivers/gpu/drm/asahi/fw/fragment.rs new file mode 100644 index 000000000000..eca275efb967 --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/fragment.rs @@ -0,0 +1,276 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU fragment job firmware structures + +use super::types::*; +use super::{event, job, workqueue}; +use crate::{buffer, fw, microseq, mmu}; +use kernel::sync::Arc; + +pub(crate) mod raw { + use super::*; + + #[derive(Debug, Clone, Copy)] + #[repr(C)] + pub(crate) struct ClearPipelineBinding { + pub(crate) pipeline_bind: U64, + pub(crate) address: U64, + } + + #[derive(Debug, Clone, Copy, Default)] + #[repr(C)] + pub(crate) struct StorePipelineBinding { + pub(crate) unk_0: U64, + pub(crate) unk_8: u32, + pub(crate) pipeline_bind: u32, + pub(crate) unk_10: u32, + pub(crate) address: u32, + pub(crate) unk_18: u32, + pub(crate) unk_1c_padding: u32, + } + + impl StorePipelineBinding { + pub(crate) fn new(pipeline_bind: u32, address: u32) -> StorePipelineBinding { + StorePipelineBinding { + pipeline_bind, + address, + ..Default::default() + } + } + } + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct ArrayAddr { + pub(crate) ptr: U64, + pub(crate) unk_padding: U64, + } + + #[versions(AGX)] + #[derive(Debug, Clone, Copy)] + #[repr(C)] + pub(crate) struct AuxFBInfo { + pub(crate) iogpu_unk_214: u32, + pub(crate) unk2: u32, + pub(crate) width: u32, + pub(crate) height: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk3: U64, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct JobParameters1<'a> { + pub(crate) utile_config: u32, + pub(crate) unk_4: u32, + pub(crate) clear_pipeline: ClearPipelineBinding, + pub(crate) ppp_multisamplectl: U64, + pub(crate) scissor_array: U64, + pub(crate) depth_bias_array: U64, + pub(crate) aux_fb_info: AuxFBInfo::ver, + pub(crate) depth_dimensions: U64, + pub(crate) visibility_result_buffer: U64, + pub(crate) zls_ctrl: U64, + + #[ver(G >= G14)] + pub(crate) unk_58_g14_0: U64, + #[ver(G >= G14)] + pub(crate) unk_58_g14_8: U64, + + pub(crate) depth_buffer_ptr1: U64, + pub(crate) depth_buffer_ptr2: U64, + pub(crate) stencil_buffer_ptr1: U64, + pub(crate) stencil_buffer_ptr2: U64, + + #[ver(G >= G14)] + pub(crate) unk_68_g14_0: Array<0x20, u8>, + + pub(crate) unk_78: Array<0x4, U64>, + pub(crate) depth_meta_buffer_ptr1: U64, + pub(crate) unk_a0: U64, + pub(crate) depth_meta_buffer_ptr2: U64, + pub(crate) unk_b0: U64, + pub(crate) stencil_meta_buffer_ptr1: U64, + pub(crate) unk_c0: U64, + pub(crate) stencil_meta_buffer_ptr2: U64, + pub(crate) unk_d0: U64, + pub(crate) tvb_tilemap: GpuPointer<'a, &'a [u8]>, + pub(crate) tvb_heapmeta: GpuPointer<'a, &'a [u8]>, + pub(crate) mtile_stride_dwords: U64, + pub(crate) tvb_heapmeta_2: GpuPointer<'a, &'a [u8]>, + pub(crate) tile_config: U64, + pub(crate) aux_fb: GpuPointer<'a, &'a [u8]>, + pub(crate) unk_108: Array<0x6, U64>, + pub(crate) pipeline_base: U64, + pub(crate) unk_140: U64, + pub(crate) unk_148: U64, + pub(crate) unk_150: U64, + pub(crate) unk_158: U64, + pub(crate) unk_160: U64, + + #[ver(G < G14)] + pub(crate) unk_168_padding: Array<0x1d8, u8>, + #[ver(G >= G14)] + pub(crate) unk_168_padding: Array<0x1a8, u8>, + #[ver(V < V13_0B4)] + pub(crate) __pad0: Pad<0x8>, + } + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct JobParameters2 { + pub(crate) store_pipeline_bind: u32, + pub(crate) store_pipeline_addr: u32, + pub(crate) unk_8: u32, + pub(crate) unk_c: u32, + pub(crate) merge_upper_x: F32, + pub(crate) merge_upper_y: F32, + pub(crate) unk_18: U64, + pub(crate) utiles_per_mtile_y: u16, + pub(crate) utiles_per_mtile_x: u16, + pub(crate) unk_24: u32, + pub(crate) tile_counts: u32, + pub(crate) iogpu_unk_212: u32, + pub(crate) isp_bgobjdepth: u32, + pub(crate) isp_bgobjvals: u32, + pub(crate) unk_38: u32, + pub(crate) unk_3c: u32, + pub(crate) unk_40: u32, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct JobParameters3 { + pub(crate) unk_44_padding: Array<0xac, u8>, + pub(crate) depth_bias_array: ArrayAddr, + pub(crate) scissor_array: ArrayAddr, + pub(crate) visibility_result_buffer: U64, + pub(crate) unk_118: U64, + pub(crate) unk_120: Array<0x25, U64>, + pub(crate) unk_reload_pipeline: ClearPipelineBinding, + pub(crate) unk_258: U64, + pub(crate) unk_260: U64, + pub(crate) unk_268: U64, + pub(crate) unk_270: U64, + pub(crate) reload_pipeline: ClearPipelineBinding, + pub(crate) zls_ctrl: U64, + pub(crate) unk_290: U64, + pub(crate) depth_buffer_ptr1: U64, + pub(crate) unk_2a0: U64, + pub(crate) unk_2a8: U64, + pub(crate) depth_buffer_ptr2: U64, + pub(crate) depth_buffer_ptr3: U64, + pub(crate) depth_meta_buffer_ptr3: U64, + pub(crate) stencil_buffer_ptr1: U64, + pub(crate) unk_2d0: U64, + pub(crate) unk_2d8: U64, + pub(crate) stencil_buffer_ptr2: U64, + pub(crate) stencil_buffer_ptr3: U64, + pub(crate) stencil_meta_buffer_ptr3: U64, + pub(crate) unk_2f8: Array<2, U64>, + pub(crate) iogpu_unk_212: u32, + pub(crate) unk_30c: u32, + pub(crate) aux_fb_info: AuxFBInfo::ver, + pub(crate) unk_320_padding: Array<0x10, u8>, + pub(crate) unk_partial_store_pipeline: StorePipelineBinding, + pub(crate) partial_store_pipeline: StorePipelineBinding, + pub(crate) isp_bgobjdepth: u32, + pub(crate) isp_bgobjvals: u32, + pub(crate) iogpu_unk_49: u32, + pub(crate) unk_37c: u32, + pub(crate) unk_380: U64, + pub(crate) unk_388: U64, + + #[ver(V >= V13_0B4)] + pub(crate) unk_390_0: U64, + + pub(crate) depth_dimensions: U64, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct RunFragment<'a> { + pub(crate) tag: workqueue::CommandType, + + #[ver(V >= V13_0B4)] + pub(crate) counter: U64, + + pub(crate) vm_slot: u32, + pub(crate) unk_8: u32, + pub(crate) microsequence: GpuPointer<'a, &'a [u8]>, + pub(crate) microsequence_size: u32, + pub(crate) notifier: GpuPointer<'a, event::Notifier::ver>, + pub(crate) buffer: GpuPointer<'a, fw::buffer::Info::ver>, + pub(crate) scene: GpuPointer<'a, fw::buffer::Scene::ver>, + pub(crate) unk_buffer_buf: GpuWeakPointer<[u8]>, + pub(crate) tvb_tilemap: GpuPointer<'a, &'a [u8]>, + pub(crate) ppp_multisamplectl: U64, + pub(crate) samples: u32, + pub(crate) tiles_per_mtile_y: u16, + pub(crate) tiles_per_mtile_x: u16, + pub(crate) unk_50: U64, + pub(crate) unk_58: U64, + pub(crate) merge_upper_x: F32, + pub(crate) merge_upper_y: F32, + pub(crate) unk_68: U64, + pub(crate) tile_count: U64, + pub(crate) job_params1: JobParameters1::ver<'a>, + pub(crate) job_params2: JobParameters2, + pub(crate) job_params3: JobParameters3::ver, + pub(crate) unk_758_flag: u32, + pub(crate) unk_75c_flag: u32, + pub(crate) unk_buf: Array<0x110, u8>, + pub(crate) busy_flag: u32, + pub(crate) tvb_overflow_count: u32, + pub(crate) unk_878: u32, + pub(crate) encoder_params: job::raw::EncoderParams<'a>, + pub(crate) process_empty_tiles: u32, + pub(crate) no_clear_pipeline_textures: u32, + pub(crate) unk_param: u32, + pub(crate) unk_pointee: u32, + pub(crate) meta: job::raw::JobMeta, + pub(crate) unk_after_meta: u32, + pub(crate) unk_buf_0: U64, + pub(crate) unk_buf_8: U64, + pub(crate) unk_buf_10: U64, + pub(crate) cur_ts: U64, + pub(crate) start_ts: Option<GpuPointer<'a, AtomicU64>>, + pub(crate) end_ts: Option<GpuPointer<'a, AtomicU64>>, + pub(crate) unk_914: u32, + pub(crate) unk_918: U64, + pub(crate) unk_920: u32, + pub(crate) client_sequence: u8, + pub(crate) pad_925: Array<3, u8>, + pub(crate) unk_928: u32, + pub(crate) unk_92c: u8, + + #[ver(V >= V13_0B4)] + pub(crate) unk_ts: U64, + + #[ver(V >= V13_0B4)] + pub(crate) unk_92d_8: Array<0x1b, u8>, + } +} + +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct RunFragment { + pub(crate) notifier: Arc<GpuObjectevent::Notifier::ver>, + pub(crate) scene: Arcbuffer::Scene::ver, + pub(crate) micro_seq: microseq::MicroSequence, + pub(crate) vm_bind: mmu::VmBind, + pub(crate) aux_fb: GpuArray<u8>, + pub(crate) timestamps: Arc<GpuObjectjob::RenderTimestamps>, +} + +#[versions(AGX)] +impl GpuStruct for RunFragment::ver { + type Raw<'a> = raw::RunFragment::ver<'a>; +} + +#[versions(AGX)] +impl workqueue::Command for RunFragment::ver {} diff --git a/drivers/gpu/drm/asahi/fw/initdata.rs b/drivers/gpu/drm/asahi/fw/initdata.rs new file mode 100644 index 000000000000..44de0c1cccf3 --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/initdata.rs @@ -0,0 +1,1264 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU initialization / global structures + +use super::channels; +use super::types::*; +use crate::{default_zeroed, no_debug, trivial_gpustruct}; + +pub(crate) mod raw { + use super::*; + + #[derive(Debug, Default)] + #[repr(C)] + pub(crate) struct ChannelRing<T: GpuStruct + Debug + Default, U: Copy> { + pub(crate) state: Option<GpuWeakPointer<T>>, + pub(crate) ring: Option<GpuWeakPointer<[U]>>, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct PipeChannels { + pub(crate) vtx: ChannelRing<channels::ChannelState, channels::PipeMsg::ver>, + pub(crate) frag: ChannelRing<channels::ChannelState, channels::PipeMsg::ver>, + pub(crate) comp: ChannelRing<channels::ChannelState, channels::PipeMsg::ver>, + } + #[versions(AGX)] + default_zeroed!(PipeChannels::ver); + + #[derive(Debug, Default)] + #[repr(C)] + pub(crate) struct FwStatusFlags { + pub(crate) halt_count: AtomicU32, + __pad0: Pad<0xc>, + pub(crate) halted: AtomicU32, + __pad1: Pad<0xc>, + pub(crate) resume: AtomicU32, + __pad2: Pad<0xc>, + pub(crate) unk_40: u32, + __pad3: Pad<0xc>, + pub(crate) unk_ctr: u32, + __pad4: Pad<0xc>, + pub(crate) unk_60: u32, + __pad5: Pad<0xc>, + pub(crate) unk_70: u32, + __pad6: Pad<0xc>, + } + + #[derive(Debug, Default)] + #[repr(C)] + pub(crate) struct FwStatus { + pub(crate) fwctl_channel: ChannelRing<channels::FwCtlChannelState, channels::FwCtlMsg>, + pub(crate) flags: FwStatusFlags, + } + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct HwDataShared1 { + pub(crate) table: Array<16, i32>, + pub(crate) unk_44: Array<0x60, u8>, + pub(crate) unk_a4: u32, + pub(crate) unk_a8: u32, + } + default_zeroed!(HwDataShared1); + + #[derive(Debug, Default)] + #[repr(C)] + pub(crate) struct HwDataShared2Curve { + pub(crate) unk_0: u32, + pub(crate) unk_4: u32, + pub(crate) t1: Array<16, i16>, + pub(crate) t2: Array<16, i16>, + pub(crate) t3: Array<8, Array<16, i32>>, + } + + #[derive(Debug, Default)] + #[repr(C)] + pub(crate) struct HwDataShared2T8112 { + pub(crate) unk_0: Array<5, u32>, + pub(crate) unk_14: u32, + pub(crate) unk_18: Array<8, u32>, + pub(crate) curve1: HwDataShared2Curve, + pub(crate) curve2: HwDataShared2Curve, + } + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct HwDataShared2 { + pub(crate) table: Array<10, i32>, + pub(crate) unk_28: Array<0x10, u8>, + pub(crate) t8112: HwDataShared2T8112, + pub(crate) unk_500: u32, + pub(crate) unk_504: u32, + pub(crate) unk_508: u32, + pub(crate) unk_50c: u32, + pub(crate) unk_510: u32, + } + default_zeroed!(HwDataShared2); + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct HwDataShared3 { + pub(crate) unk_0: u32, + pub(crate) unk_4: u32, + pub(crate) unk_8: u32, + pub(crate) table: Array<16, u32>, + pub(crate) unk_4c: u32, + } + default_zeroed!(HwDataShared3); + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct HwDataA130Extra { + pub(crate) unk_0: Array<0x38, u8>, + pub(crate) unk_38: u32, + pub(crate) unk_3c: u32, + pub(crate) unk_40: u32, + pub(crate) unk_44: u32, + pub(crate) unk_48: u32, + pub(crate) unk_4c: u32, + pub(crate) unk_50: u32, + pub(crate) unk_54: u32, + pub(crate) unk_58: u32, + pub(crate) unk_5c: u32, + pub(crate) unk_60: F32, + pub(crate) unk_64: F32, + pub(crate) unk_68: F32, + pub(crate) unk_6c: F32, + pub(crate) unk_70: F32, + pub(crate) unk_74: F32, + pub(crate) unk_78: F32, + pub(crate) unk_7c: F32, + pub(crate) unk_80: F32, + pub(crate) unk_84: F32, + pub(crate) unk_88: u32, + pub(crate) unk_8c: u32, + pub(crate) max_pstate_scaled_1: u32, + pub(crate) unk_94: u32, + pub(crate) unk_98: u32, + pub(crate) unk_9c: F32, + pub(crate) unk_a0: u32, + pub(crate) unk_a4: u32, + pub(crate) unk_a8: u32, + pub(crate) unk_ac: u32, + pub(crate) unk_b0: u32, + pub(crate) unk_b4: u32, + pub(crate) unk_b8: u32, + pub(crate) unk_bc: u32, + pub(crate) unk_c0: u32, + pub(crate) unk_c4: F32, + pub(crate) unk_c8: Array<0x4c, u8>, + pub(crate) unk_114: F32, + pub(crate) unk_118: u32, + pub(crate) unk_11c: u32, + pub(crate) unk_120: u32, + pub(crate) unk_124: u32, + pub(crate) max_pstate_scaled_2: u32, + pub(crate) unk_12c: Array<0x8c, u8>, + } + default_zeroed!(HwDataA130Extra); + + #[derive(Default)] + #[repr(C)] + pub(crate) struct T81xxData { + pub(crate) unk_d8c: u32, + pub(crate) unk_d90: u32, + pub(crate) unk_d94: u32, + pub(crate) unk_d98: u32, + pub(crate) unk_d9c: F32, + pub(crate) unk_da0: u32, + pub(crate) unk_da4: F32, + pub(crate) unk_da8: u32, + pub(crate) unk_dac: F32, + pub(crate) unk_db0: u32, + pub(crate) unk_db4: u32, + pub(crate) unk_db8: F32, + pub(crate) unk_dbc: F32, + pub(crate) unk_dc0: u32, + pub(crate) unk_dc4: u32, + pub(crate) unk_dc8: u32, + pub(crate) max_pstate_scaled: u32, + } + + #[versions(AGX)] + #[derive(Default, Copy, Clone)] + #[repr(C)] + pub(crate) struct PowerZone { + pub(crate) val: F32, + pub(crate) target: u32, + pub(crate) target_off: u32, + pub(crate) filter_tc_x4: u32, + pub(crate) filter_tc_xperiod: u32, + #[ver(V >= V13_0B4)] + pub(crate) unk_10: u32, + #[ver(V >= V13_0B4)] + pub(crate) unk_14: u32, + pub(crate) filter_a_neg: F32, + pub(crate) filter_a: F32, + pub(crate) pad: u32, + } + + #[versions(AGX)] + #[repr(C)] + pub(crate) struct HwDataA { + pub(crate) unk_0: u32, + pub(crate) clocks_per_period: u32, + + #[ver(V >= V13_0B4)] + pub(crate) clocks_per_period_2: u32, + + pub(crate) unk_8: u32, + pub(crate) pwr_status: AtomicU32, + pub(crate) unk_10: F32, + pub(crate) unk_14: u32, + pub(crate) unk_18: u32, + pub(crate) unk_1c: u32, + pub(crate) unk_20: u32, + pub(crate) unk_24: u32, + pub(crate) actual_pstate: u32, + pub(crate) tgt_pstate: u32, + pub(crate) unk_30: u32, + pub(crate) cur_pstate: u32, + pub(crate) unk_38: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_3c_0: u32, + + pub(crate) base_pstate_scaled: u32, + pub(crate) unk_40: u32, + pub(crate) max_pstate_scaled: u32, + pub(crate) unk_48: u32, + pub(crate) min_pstate_scaled: u32, + pub(crate) freq_mhz: F32, + pub(crate) unk_54: Array<0x20, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_74_0: u32, + + pub(crate) sram_k: Array<0x10, F32>, + pub(crate) unk_b4: Array<0x100, u8>, + pub(crate) unk_1b4: u32, + pub(crate) temp_c: u32, + pub(crate) avg_power_mw: u32, + pub(crate) update_ts: U64, + pub(crate) unk_1c8: u32, + pub(crate) unk_1cc: Array<0x478, u8>, + pub(crate) pad_644: Pad<0x8>, + pub(crate) unk_64c: u32, + pub(crate) unk_650: u32, + pub(crate) pad_654: u32, + pub(crate) pwr_filter_a_neg: F32, + pub(crate) pad_65c: u32, + pub(crate) pwr_filter_a: F32, + pub(crate) pad_664: u32, + pub(crate) pwr_integral_gain: F32, + pub(crate) pad_66c: u32, + pub(crate) pwr_integral_min_clamp: F32, + pub(crate) max_power_1: F32, + pub(crate) pwr_proportional_gain: F32, + pub(crate) pad_67c: u32, + pub(crate) pwr_pstate_related_k: F32, + pub(crate) pwr_pstate_max_dc_offset: i32, + pub(crate) unk_688: u32, + pub(crate) max_pstate_scaled_2: u32, + pub(crate) pad_690: u32, + pub(crate) unk_694: u32, + pub(crate) max_power_2: u32, + pub(crate) pad_69c: Pad<0x18>, + pub(crate) unk_6b4: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_6b8_0: Array<0x10, u8>, + + pub(crate) max_pstate_scaled_3: u32, + pub(crate) unk_6bc: u32, + pub(crate) pad_6c0: Pad<0x14>, + pub(crate) ppm_filter_tc_periods_x4: u32, + pub(crate) unk_6d8: u32, + pub(crate) pad_6dc: u32, + pub(crate) ppm_filter_a_neg: F32, + pub(crate) pad_6e4: u32, + pub(crate) ppm_filter_a: F32, + pub(crate) pad_6ec: u32, + pub(crate) ppm_ki_dt: F32, + pub(crate) pad_6f4: u32, + pub(crate) pwr_integral_min_clamp_2: u32, + pub(crate) unk_6fc: F32, + pub(crate) ppm_kp: F32, + pub(crate) pad_704: u32, + pub(crate) unk_708: u32, + pub(crate) pwr_min_duty_cycle: u32, + pub(crate) max_pstate_scaled_4: u32, + pub(crate) unk_714: u32, + pub(crate) pad_718: u32, + pub(crate) unk_71c: F32, + pub(crate) max_power_3: u32, + pub(crate) cur_power_mw_2: u32, + pub(crate) ppm_filter_tc_ms: u32, + pub(crate) unk_72c: u32, + + #[ver(V >= V13_0B4)] + pub(crate) ppm_filter_tc_clks: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_730_4: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_730_8: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_730_c: u32, + + pub(crate) unk_730: F32, + pub(crate) unk_734: u32, + pub(crate) unk_738: u32, + pub(crate) unk_73c: u32, + pub(crate) unk_740: u32, + pub(crate) unk_744: u32, + pub(crate) unk_748: Array<0x4, F32>, + pub(crate) unk_758: u32, + pub(crate) perf_tgt_utilization: u32, + pub(crate) pad_760: u32, + pub(crate) perf_boost_min_util: u32, + pub(crate) perf_boost_ce_step: u32, + pub(crate) perf_reset_iters: u32, + pub(crate) pad_770: u32, + pub(crate) unk_774: u32, + pub(crate) unk_778: u32, + pub(crate) perf_filter_drop_threshold: u32, + pub(crate) perf_filter_a_neg: F32, + pub(crate) perf_filter_a2_neg: F32, + pub(crate) perf_filter_a: F32, + pub(crate) perf_filter_a2: F32, + pub(crate) perf_ki: F32, + pub(crate) perf_ki2: F32, + pub(crate) perf_integral_min_clamp: F32, + pub(crate) unk_79c: F32, + pub(crate) perf_kp: F32, + pub(crate) perf_kp2: F32, + pub(crate) boost_state_unk_k: F32, + pub(crate) base_pstate_scaled_2: u32, + pub(crate) max_pstate_scaled_5: u32, + pub(crate) base_pstate_scaled_3: u32, + pub(crate) pad_7b8: u32, + pub(crate) perf_cur_utilization: F32, + pub(crate) perf_tgt_utilization_2: u32, + pub(crate) pad_7c4: Pad<0x18>, + pub(crate) unk_7dc: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_7e0_0: Array<0x10, u8>, + + pub(crate) base_pstate_scaled_4: u32, + pub(crate) pad_7e4: u32, + pub(crate) unk_7e8: Array<0x14, u8>, + pub(crate) unk_7fc: F32, + pub(crate) pwr_min_duty_cycle_2: F32, + pub(crate) max_pstate_scaled_6: F32, + pub(crate) max_freq_mhz: u32, + pub(crate) pad_80c: u32, + pub(crate) unk_810: u32, + pub(crate) pad_814: u32, + pub(crate) pwr_min_duty_cycle_3: u32, + pub(crate) unk_81c: u32, + pub(crate) pad_820: u32, + pub(crate) min_pstate_scaled_4: F32, + pub(crate) max_pstate_scaled_7: u32, + pub(crate) unk_82c: u32, + pub(crate) unk_alpha_neg: F32, + pub(crate) unk_alpha: F32, + pub(crate) unk_838: u32, + pub(crate) unk_83c: u32, + pub(crate) pad_840: Pad<0x2c>, + pub(crate) unk_86c: u32, + pub(crate) fast_die0_sensor_mask: U64, + pub(crate) fast_die0_release_temp_cc: u32, + pub(crate) unk_87c: i32, + pub(crate) unk_880: u32, + pub(crate) unk_884: u32, + pub(crate) pad_888: u32, + pub(crate) unk_88c: u32, + pub(crate) pad_890: u32, + pub(crate) unk_894: F32, + pub(crate) pad_898: u32, + pub(crate) fast_die0_ki_dt: F32, + pub(crate) pad_8a0: u32, + pub(crate) unk_8a4: u32, + pub(crate) unk_8a8: F32, + pub(crate) fast_die0_kp: F32, + pub(crate) pad_8b0: u32, + pub(crate) unk_8b4: u32, + pub(crate) pwr_min_duty_cycle_4: u32, + pub(crate) max_pstate_scaled_8: u32, + pub(crate) max_pstate_scaled_9: u32, + pub(crate) fast_die0_prop_tgt_delta: u32, + pub(crate) unk_8c8: u32, + pub(crate) unk_8cc: u32, + pub(crate) pad_8d0: Pad<0x14>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_8e4_0: Array<0x10, u8>, + + pub(crate) unk_8e4: u32, + pub(crate) unk_8e8: u32, + pub(crate) max_pstate_scaled_10: u32, + pub(crate) unk_8f0: u32, + pub(crate) unk_8f4: u32, + pub(crate) pad_8f8: u32, + pub(crate) pad_8fc: u32, + pub(crate) unk_900: Array<0x24, u8>, + pub(crate) unk_coef_a1: Array<8, Array<8, F32>>, + pub(crate) unk_coef_a2: Array<8, Array<8, F32>>, + pub(crate) pad_b24: Pad<0x70>, + pub(crate) max_pstate_scaled_11: u32, + pub(crate) freq_with_off: u32, + pub(crate) unk_b9c: u32, + pub(crate) unk_ba0: U64, + pub(crate) unk_ba8: U64, + pub(crate) unk_bb0: u32, + pub(crate) unk_bb4: u32, + pub(crate) pad_bb8: Pad<0x74>, + pub(crate) unk_c2c: u32, + pub(crate) power_zone_count: u32, + pub(crate) max_power_4: u32, + pub(crate) max_power_5: u32, + pub(crate) max_power_6: u32, + pub(crate) unk_c40: u32, + pub(crate) unk_c44: F32, + pub(crate) avg_power_target_filter_a_neg: F32, + pub(crate) avg_power_target_filter_a: F32, + pub(crate) avg_power_target_filter_tc_x4: u32, + pub(crate) avg_power_target_filter_tc_xperiod: u32, + + #[ver(V >= V13_0B4)] + pub(crate) avg_power_target_filter_tc_clks: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_c58_4: u32, + + pub(crate) power_zones: Array<5, PowerZone::ver>, + pub(crate) avg_power_filter_tc_periods_x4: u32, + pub(crate) unk_cfc: u32, + pub(crate) unk_d00: u32, + pub(crate) avg_power_filter_a_neg: F32, + pub(crate) unk_d08: u32, + pub(crate) avg_power_filter_a: F32, + pub(crate) unk_d10: u32, + pub(crate) avg_power_ki_dt: F32, + pub(crate) unk_d18: u32, + pub(crate) unk_d1c: u32, + pub(crate) unk_d20: F32, + pub(crate) avg_power_kp: F32, + pub(crate) unk_d28: u32, + pub(crate) unk_d2c: u32, + pub(crate) avg_power_min_duty_cycle: u32, + pub(crate) max_pstate_scaled_12: u32, + pub(crate) max_pstate_scaled_13: u32, + pub(crate) unk_d3c: u32, + pub(crate) max_power_7: F32, + pub(crate) max_power_8: u32, + pub(crate) unk_d48: u32, + pub(crate) avg_power_filter_tc_ms: u32, + pub(crate) unk_d50: u32, + + #[ver(V >= V13_0B4)] + pub(crate) avg_power_filter_tc_clks: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_d54_4: Array<0xc, u8>, + + pub(crate) unk_d54: Array<0x10, u8>, + pub(crate) max_pstate_scaled_14: u32, + pub(crate) unk_d68: Array<0x24, u8>, + + pub(crate) t81xx_data: T81xxData, + + pub(crate) unk_dd0: Array<0x40, u8>, + + #[ver(V >= V13_2)] + pub(crate) unk_e10_pad: Array<0x10, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_e10_0: HwDataA130Extra, + + pub(crate) unk_e10: Array<0xc, u8>, + pub(crate) fast_die0_sensor_mask_2: U64, + pub(crate) unk_e24: u32, + pub(crate) unk_e28: u32, + pub(crate) unk_e2c: Pad<0x1c>, + pub(crate) unk_coef_b1: Array<8, Array<8, F32>>, + pub(crate) unk_coef_b2: Array<8, Array<8, F32>>, + pub(crate) pad_1048: Pad<0x5e4>, + pub(crate) fast_die0_sensor_mask_alt: U64, + #[ver(V < V13_0B4)] + pub(crate) fast_die0_sensor_present: U64, + + pub(crate) unk_163c: u32, + + pub(crate) unk_1640: Array<0x2000, u8>, + pub(crate) unk_3640: u32, + pub(crate) unk_3644: u32, + pub(crate) hws1: HwDataShared1, + + #[ver(V >= V13_0B4)] + pub(crate) unk_pad1: Pad<0x20>, + + pub(crate) hws2: HwDataShared2, + pub(crate) unk_3c04: u32, + pub(crate) hws3: HwDataShared3, + pub(crate) unk_3c58: Array<0x3c, u8>, + pub(crate) unk_3c94: u32, + pub(crate) unk_3c98: U64, + pub(crate) unk_3ca0: U64, + pub(crate) unk_3ca8: U64, + pub(crate) unk_3cb0: U64, + pub(crate) ts_last_idle: U64, + pub(crate) ts_last_poweron: U64, + pub(crate) ts_last_poweroff: U64, + pub(crate) unk_3cd0: U64, + pub(crate) unk_3cd8: U64, + + #[ver(V >= V13_0B4)] + pub(crate) unk_3ce0_0: u32, + + pub(crate) unk_3ce0: u32, + pub(crate) unk_3ce4: u32, + pub(crate) unk_3ce8: u32, + pub(crate) unk_3cec: u32, + pub(crate) unk_3cf0: u32, + pub(crate) core_leak_coef: Array<8, F32>, + pub(crate) sram_leak_coef: Array<8, F32>, + pub(crate) unk_3d34: Array<0x38, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_3d6c: Array<0x38, u8>, + } + #[versions(AGX)] + default_zeroed!(HwDataA::ver); + #[versions(AGX)] + no_debug!(HwDataA::ver); + + #[derive(Debug, Default, Clone, Copy)] + #[repr(C)] + pub(crate) struct IOMapping { + pub(crate) phys_addr: U64, + pub(crate) virt_addr: U64, + pub(crate) size: u32, + pub(crate) range_size: u32, + pub(crate) readwrite: U64, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct HwDataB { + #[ver(V < V13_0B4)] + pub(crate) unk_0: U64, + + pub(crate) unk_8: U64, + + #[ver(V < V13_0B4)] + pub(crate) unk_10: U64, + + pub(crate) unk_18: U64, + pub(crate) unk_20: U64, + pub(crate) unk_28: U64, + pub(crate) unk_30: U64, + pub(crate) unkptr_38: U64, + pub(crate) pad_40: Pad<0x20>, + + #[ver(V < V13_0B4)] + pub(crate) yuv_matrices: Array<0xf, Array<3, Array<4, i16>>>, + + #[ver(V >= V13_0B4)] + pub(crate) yuv_matrices: Array<0x3f, Array<3, Array<4, i16>>>, + + pub(crate) pad_1c8: Pad<0x8>, + pub(crate) io_mappings: Array<0x14, IOMapping>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_450_0: Array<0x68, u8>, + + pub(crate) chip_id: u32, + pub(crate) unk_454: u32, + pub(crate) unk_458: u32, + pub(crate) unk_45c: u32, + pub(crate) unk_460: u32, + pub(crate) unk_464: u32, + pub(crate) unk_468: u32, + pub(crate) unk_46c: u32, + pub(crate) unk_470: u32, + pub(crate) unk_474: u32, + pub(crate) unk_478: u32, + pub(crate) unk_47c: u32, + pub(crate) unk_480: u32, + pub(crate) unk_484: u32, + pub(crate) unk_488: u32, + pub(crate) unk_48c: u32, + pub(crate) base_clock_khz: u32, + pub(crate) power_sample_period: u32, + pub(crate) pad_498: Pad<0x4>, + pub(crate) unk_49c: u32, + pub(crate) unk_4a0: u32, + pub(crate) unk_4a4: u32, + pub(crate) pad_4a8: Pad<0x4>, + pub(crate) unk_4ac: u32, + pub(crate) pad_4b0: Pad<0x8>, + pub(crate) unk_4b8: u32, + pub(crate) unk_4bc: Array<0x4, u8>, + pub(crate) unk_4c0: u32, + pub(crate) unk_4c4: u32, + pub(crate) unk_4c8: u32, + pub(crate) unk_4cc: u32, + pub(crate) unk_4d0: u32, + pub(crate) unk_4d4: u32, + pub(crate) unk_4d8: Array<0x4, u8>, + pub(crate) unk_4dc: u32, + pub(crate) unk_4e0: U64, + pub(crate) unk_4e8: u32, + pub(crate) unk_4ec: u32, + pub(crate) unk_4f0: u32, + pub(crate) unk_4f4: u32, + pub(crate) unk_4f8: u32, + pub(crate) unk_4fc: u32, + pub(crate) unk_500: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_504_0: u32, + + pub(crate) unk_504: u32, + pub(crate) unk_508: u32, + pub(crate) unk_50c: u32, + pub(crate) unk_510: u32, + pub(crate) unk_514: u32, + pub(crate) unk_518: u32, + pub(crate) unk_51c: u32, + pub(crate) unk_520: u32, + pub(crate) unk_524: u32, + pub(crate) unk_528: u32, + pub(crate) unk_52c: u32, + pub(crate) unk_530: u32, + pub(crate) unk_534: u32, + pub(crate) unk_538: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_53c_0: u32, + + pub(crate) num_frags: u32, + pub(crate) unk_540: u32, + pub(crate) unk_544: u32, + pub(crate) unk_548: u32, + pub(crate) unk_54c: u32, + pub(crate) unk_550: u32, + pub(crate) unk_554: u32, + pub(crate) uat_ttb_base: U64, + pub(crate) gpu_core_id: u32, + pub(crate) gpu_rev_id: u32, + pub(crate) num_cores: u32, + pub(crate) max_pstate: u32, + + #[ver(V < V13_0B4)] + pub(crate) num_pstates: u32, + + pub(crate) frequencies: Array<0x10, u32>, + pub(crate) voltages: Array<0x10, [u32; 0x8]>, + pub(crate) voltages_sram: Array<0x10, [u32; 0x8]>, + pub(crate) sram_k: Array<0x10, F32>, + pub(crate) unk_9f4: Array<0x10, u32>, + pub(crate) rel_max_powers: Array<0x10, u32>, + pub(crate) rel_boost_freqs: Array<0x10, u32>, + + #[ver(V < V13_0B4)] + pub(crate) min_sram_volt: u32, + + #[ver(V < V13_0B4)] + pub(crate) unk_ab8: u32, + + #[ver(V < V13_0B4)] + pub(crate) unk_abc: u32, + + #[ver(V < V13_0B4)] + pub(crate) unk_ac0: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_ac4_0: Array<0x1f0, u8>, + + pub(crate) pad_ac4: Pad<0x8>, + pub(crate) unk_acc: u32, + pub(crate) unk_ad0: u32, + pub(crate) pad_ad4: Pad<0x10>, + pub(crate) unk_ae4: Array<0x4, u32>, + pub(crate) pad_af4: Pad<0x4>, + pub(crate) unk_af8: u32, + pub(crate) pad_afc: Pad<0x8>, + pub(crate) unk_b04: u32, + pub(crate) unk_b08: u32, + pub(crate) unk_b0c: u32, + pub(crate) unk_b10: u32, + pub(crate) pad_b14: Pad<0x8>, + pub(crate) unk_b1c: u32, + pub(crate) unk_b20: u32, + pub(crate) unk_b24: u32, + pub(crate) unk_b28: u32, + pub(crate) unk_b2c: u32, + pub(crate) unk_b30: u32, + pub(crate) unk_b34: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_b38_0: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_b38_4: u32, + + pub(crate) unk_b38: Array<0xc, u32>, + pub(crate) unk_b68: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_b6c: Array<0xd0, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_c3c: u32, + } + #[versions(AGX)] + default_zeroed!(HwDataB::ver); + + #[derive(Debug, Clone, Copy)] + #[repr(C, packed)] + pub(crate) struct GpuQueueStatsVtx { + pub(crate) busy: u32, + pub(crate) unk_4: u32, + pub(crate) cur_cmdqueue: U64, + pub(crate) cur_count: u32, + pub(crate) unk_14: u32, + } + default_zeroed!(GpuQueueStatsVtx); + + #[versions(AGX)] + #[derive(Debug, Default, Clone, Copy)] + #[repr(C, packed)] + pub(crate) struct GpuStatsVtx { + pub(crate) unk_4: u32, + pub(crate) queues: Array<0x4, GpuQueueStatsVtx>, + pub(crate) unk_68: Array<0x8, u8>, + pub(crate) unk_70: u32, + pub(crate) unk_74: u32, + pub(crate) unk_timestamp: U64, + pub(crate) unk_80: Array<0x40, u8>, + } + + #[derive(Debug, Default, Clone, Copy)] + #[repr(C, packed)] + pub(crate) struct GpuQueueStatsFrag { + pub(crate) busy: u32, + pub(crate) cur_cmdqueue: U64, + pub(crate) unk_c: u32, + pub(crate) unk_10: u32, + pub(crate) unk_14: Array<0x14, u8>, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct GpuStatsFrag { + pub(crate) unk_0: Array<0x18, u8>, + pub(crate) queues: Array<0x4, GpuQueueStatsFrag>, + pub(crate) unk_d0: Array<0x38, u8>, + pub(crate) tvb_overflows_1: u32, + pub(crate) tvb_overflows_2: u32, + pub(crate) unk_f8: u32, + pub(crate) unk_fc: u32, + pub(crate) cur_stamp_id: i32, + pub(crate) unk_104: Array<0x14, u8>, + pub(crate) unk_118: i32, + pub(crate) unk_11c: u32, + pub(crate) unk_120: u32, + pub(crate) unk_124: u32, + pub(crate) unk_128: u32, + pub(crate) unk_12c: u32, + pub(crate) unk_timestamp: U64, + pub(crate) unk_134: Array<0x8c, u8>, + } + #[versions(AGX)] + default_zeroed!(GpuStatsFrag::ver); + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct GpuGlobalStatsVtx { + pub(crate) total_cmds: u32, + pub(crate) stats: GpuStatsVtx::ver, + #[ver(V >= V13_0B4)] + pub(crate) unk_pad: Array<0x5c4, u8>, + } + #[versions(AGX)] + default_zeroed!(GpuGlobalStatsVtx::ver); + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct GpuGlobalStatsFrag { + pub(crate) total_cmds: u32, + pub(crate) unk_4: u32, + pub(crate) stats: GpuStatsFrag::ver, + #[ver(V >= V13_0B4)] + pub(crate) unk_pad: Array<0x580, u8>, + } + #[versions(AGX)] + default_zeroed!(GpuGlobalStatsFrag::ver); + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct GpuStatsComp { + pub(crate) unk: Array<0x140, u8>, + } + default_zeroed!(GpuStatsComp); + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct RuntimeScratch { + pub(crate) unk_280: Array<0x6800, u8>, + pub(crate) unk_6a80: u32, + pub(crate) gpu_idle: u32, + pub(crate) unkpad_6a88: Pad<0x14>, + pub(crate) unk_6a9c: u32, + pub(crate) unk_ctr0: u32, + pub(crate) unk_ctr1: u32, + pub(crate) unk_6aa8: u32, + pub(crate) unk_6aac: u32, + pub(crate) unk_ctr2: u32, + pub(crate) unk_6ab4: u32, + pub(crate) unk_6ab8: u32, + pub(crate) unk_6abc: u32, + pub(crate) unk_6ac0: u32, + pub(crate) unk_6ac4: u32, + pub(crate) unk_ctr3: u32, + pub(crate) unk_6acc: u32, + pub(crate) unk_6ad0: u32, + pub(crate) unk_6ad4: u32, + pub(crate) unk_6ad8: u32, + pub(crate) unk_6adc: u32, + pub(crate) unk_6ae0: u32, + pub(crate) unk_6ae4: u32, + pub(crate) unk_6ae8: u32, + pub(crate) unk_6aec: u32, + pub(crate) unk_6af0: u32, + pub(crate) unk_ctr4: u32, + pub(crate) unk_ctr5: u32, + pub(crate) unk_6afc: u32, + pub(crate) pad_6b00: Pad<0x38>, + pub(crate) unk_6b38: u32, + pub(crate) pad_6b3c: Pad<0x84>, + } + default_zeroed!(RuntimeScratch); + + pub(crate) type BufferMgrCtl = Array<4, u32>; + + #[versions(AGX)] + #[repr(C)] + pub(crate) struct RuntimePointers<'a> { + pub(crate) pipes: Array<4, PipeChannels::ver>, + + pub(crate) device_control: + ChannelRing<channels::ChannelState, channels::DeviceControlMsg::ver>, + pub(crate) event: ChannelRing<channels::ChannelState, channels::RawEventMsg>, + pub(crate) fw_log: ChannelRing<channels::FwLogChannelState, channels::RawFwLogMsg>, + pub(crate) ktrace: ChannelRing<channels::ChannelState, channels::RawKTraceMsg>, + pub(crate) stats: ChannelRing<channels::ChannelState, channels::RawStatsMsg::ver>, + + pub(crate) __pad0: Pad<0x50>, + pub(crate) unk_160: U64, + pub(crate) unk_168: U64, + pub(crate) stats_vtx: GpuPointer<'a, super::GpuGlobalStatsVtx::ver>, + pub(crate) stats_frag: GpuPointer<'a, super::GpuGlobalStatsFrag::ver>, + pub(crate) stats_comp: GpuPointer<'a, super::GpuStatsComp>, + pub(crate) hwdata_a: GpuPointer<'a, super::HwDataA::ver>, + pub(crate) unkptr_190: GpuPointer<'a, &'a [u8]>, + pub(crate) unkptr_198: GpuPointer<'a, &'a [u8]>, + pub(crate) hwdata_b: GpuPointer<'a, super::HwDataB::ver>, + pub(crate) hwdata_b_2: GpuPointer<'a, super::HwDataB::ver>, + pub(crate) fwlog_buf: Option<GpuWeakPointer<[channels::RawFwLogPayloadMsg]>>, + pub(crate) unkptr_1b8: GpuPointer<'a, &'a [u8]>, + pub(crate) unkptr_1c0: GpuPointer<'a, &'a [u8]>, + pub(crate) unkptr_1c8: GpuPointer<'a, &'a [u8]>, + pub(crate) unk_1d0: u32, + pub(crate) unk_1d4: u32, + pub(crate) unk_1d8: Array<0x3c, u8>, + pub(crate) buffer_mgr_ctl: GpuPointer<'a, &'a [BufferMgrCtl]>, + pub(crate) buffer_mgr_ctl_2: GpuPointer<'a, &'a [BufferMgrCtl]>, + pub(crate) __pad1: Pad<0x5c>, + pub(crate) gpu_scratch: RuntimeScratch, + } + #[versions(AGX)] + no_debug!(RuntimePointers::ver<'_>); + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct PendingStamp { + pub(crate) info: AtomicU32, + pub(crate) wait_value: AtomicU32, + } + default_zeroed!(PendingStamp); + + #[derive(Debug, Clone, Copy)] + #[repr(C, packed)] + pub(crate) struct FaultInfo { + pub(crate) unk_0: u32, + pub(crate) unk_4: u32, + pub(crate) queue_uuid: u32, + pub(crate) unk_c: u32, + pub(crate) unk_10: u32, + pub(crate) unk_14: u32, + } + default_zeroed!(FaultInfo); + + #[versions(AGX)] + #[derive(Debug, Clone, Copy)] + #[repr(C, packed)] + pub(crate) struct GlobalsSub { + pub(crate) unk_54: u16, + pub(crate) unk_56: u16, + pub(crate) unk_58: u16, + pub(crate) unk_5a: U32, + pub(crate) unk_5e: U32, + pub(crate) unk_62: U32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_66_0: Array<0xc, u8>, + + pub(crate) unk_66: U32, + pub(crate) unk_6a: Array<0x16, u8>, + } + #[versions(AGX)] + default_zeroed!(GlobalsSub::ver); + + #[derive(Debug, Clone, Copy)] + #[repr(C)] + pub(crate) struct PowerZoneGlobal { + pub(crate) target: u32, + pub(crate) target_off: u32, + pub(crate) filter_tc: u32, + } + default_zeroed!(PowerZoneGlobal); + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct Globals { + pub(crate) ktrace_enable: u32, + pub(crate) unk_4: Array<0x20, u8>, + + #[ver(V >= V13_2)] + pub(crate) unk_24_0: u32, + + pub(crate) unk_24: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_28_0: u32, + + pub(crate) unk_28: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_2c_0: u32, + + pub(crate) unk_2c: u32, + pub(crate) unk_30: u32, + pub(crate) unk_34: u32, + pub(crate) unk_38: Array<0x1c, u8>, + + pub(crate) sub: GlobalsSub::ver, + + pub(crate) unk_80: Array<0xf80, u8>, + pub(crate) unk_1000: Array<0x7000, u8>, + pub(crate) unk_8000: Array<0x900, u8>, + + #[ver(V >= V13_0B4 && V < V13_2)] + pub(crate) unk_8900_0: u32, + + pub(crate) unk_8900: u32, + pub(crate) pending_submissions: AtomicU32, + pub(crate) max_power: u32, + pub(crate) max_pstate_scaled: u32, + pub(crate) max_pstate_scaled_2: u32, + pub(crate) unk_8914: u32, + pub(crate) unk_8918: u32, + pub(crate) max_pstate_scaled_3: u32, + pub(crate) unk_8920: u32, + pub(crate) power_zone_count: u32, + pub(crate) avg_power_filter_tc_periods: u32, + pub(crate) avg_power_ki_dt: F32, + pub(crate) avg_power_kp: F32, + pub(crate) avg_power_min_duty_cycle: u32, + pub(crate) avg_power_target_filter_tc: u32, + pub(crate) power_zones: Array<5, PowerZoneGlobal>, + pub(crate) unk_8978: Array<0x44, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_89bc_0: Array<0x3c, u8>, + + pub(crate) unk_89bc: u32, + pub(crate) fast_die0_release_temp: u32, + pub(crate) unk_89c4: i32, + pub(crate) fast_die0_prop_tgt_delta: u32, + pub(crate) fast_die0_kp: F32, + pub(crate) fast_die0_ki_dt: F32, + pub(crate) unk_89d4: Array<0xc, u8>, + pub(crate) unk_89e0: u32, + pub(crate) max_power_2: u32, + pub(crate) ppm_kp: F32, + pub(crate) ppm_ki_dt: F32, + pub(crate) unk_89f0: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_89f4_0: Array<0x8, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_89f4_8: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_89f4_c: Array<0x50, u8>, + + pub(crate) unk_89f4: u32, + pub(crate) hws1: HwDataShared1, + pub(crate) hws2: HwDataShared2, + + #[ver(V >= V13_0B4)] + pub(crate) unk_hws2_0: Array<0x28, u8>, + + pub(crate) hws3: HwDataShared3, + pub(crate) unk_9004: Array<8, u8>, + pub(crate) unk_900c: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_9010_0: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_9010_4: Array<0x14, u8>, + + pub(crate) unk_9010: Array<0x2c, u8>, + pub(crate) unk_903c: u32, + pub(crate) unk_9040: Array<0xc0, u8>, + pub(crate) unk_9100: Array<0x6f00, u8>, + pub(crate) unk_10000: Array<0xe50, u8>, + pub(crate) unk_10e50: u32, + pub(crate) unk_10e54: Array<0x2c, u8>, + pub(crate) fault_control: u32, + pub(crate) do_init: u32, + pub(crate) unk_10e88: Array<0x188, u8>, + pub(crate) idle_ts: U64, + pub(crate) idle_unk: U64, + pub(crate) unk_11020: u32, + pub(crate) unk_11024: u32, + pub(crate) unk_11028: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_1102c_0: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_1102c_4: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_1102c_8: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_1102c_c: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_1102c_10: u32, + + pub(crate) unk_1102c: u32, + pub(crate) idle_off_delay_ms: AtomicU32, + pub(crate) fender_idle_off_delay_ms: u32, + pub(crate) fw_early_wake_timeout_ms: u32, + pub(crate) pending_stamps: Array<0x110, PendingStamp>, + pub(crate) unk_117bc: u32, + pub(crate) fault_info: FaultInfo, + pub(crate) counter: u32, + pub(crate) unk_118dc: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_118e0_0: Array<0x9c, u8>, + + pub(crate) unk_118e0: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_118e4_0: u32, + + pub(crate) unk_118e4: u32, + pub(crate) unk_118e8: u32, + pub(crate) unk_118ec: Array<0x15, u8>, + pub(crate) unk_11901: Array<0x43f, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_11d40: Array<0x19c, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_11edc: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_11ee0: Array<0x1c, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_11efc: u32, + } + #[versions(AGX)] + default_zeroed!(Globals::ver); + + #[derive(Debug, Default, Clone, Copy)] + #[repr(C, packed)] + pub(crate) struct UatLevelInfo { + pub(crate) unk_3: u8, + pub(crate) unk_1: u8, + pub(crate) unk_2: u8, + pub(crate) index_shift: u8, + pub(crate) num_entries: u16, + pub(crate) unk_4: u16, + pub(crate) unk_8: U64, + pub(crate) unk_10: U64, + pub(crate) index_mask: U64, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct InitData<'a> { + #[ver(V >= V13_0B4)] + pub(crate) ver_info: Array<0x4, u16>, + + pub(crate) unk_buf: GpuPointer<'a, &'a [u8]>, + pub(crate) unk_8: u32, + pub(crate) unk_c: u32, + pub(crate) runtime_pointers: GpuPointer<'a, super::RuntimePointers::ver>, + pub(crate) globals: GpuPointer<'a, super::Globals::ver>, + pub(crate) fw_status: GpuPointer<'a, super::FwStatus>, + pub(crate) uat_page_size: u16, + pub(crate) uat_page_bits: u8, + pub(crate) uat_num_levels: u8, + pub(crate) uat_level_info: Array<0x3, UatLevelInfo>, + pub(crate) __pad0: Pad<0x14>, + pub(crate) host_mapped_fw_allocations: u32, + pub(crate) unk_ac: u32, + pub(crate) unk_b0: u32, + pub(crate) unk_b4: u32, + pub(crate) unk_b8: u32, + } +} + +#[derive(Debug)] +pub(crate) struct ChannelRing<T: GpuStruct + Debug + Default, U: Copy> +where + for<'a> <T as GpuStruct>::Raw<'a>: Debug, +{ + pub(crate) state: GpuObject<T>, + pub(crate) ring: GpuArray<U>, +} + +impl<T: GpuStruct + Debug + Default, U: Copy> ChannelRing<T, U> +where + for<'a> <T as GpuStruct>::Raw<'a>: Debug, +{ + pub(crate) fn to_raw(&self) -> raw::ChannelRing<T, U> { + raw::ChannelRing { + state: Some(self.state.weak_pointer()), + ring: Some(self.ring.weak_pointer()), + } + } +} + +trivial_gpustruct!(FwStatus); + +#[versions(AGX)] +#[derive(Debug, Default)] +pub(crate) struct GpuGlobalStatsVtx {} + +#[versions(AGX)] +impl GpuStruct for GpuGlobalStatsVtx::ver { + type Raw<'a> = raw::GpuGlobalStatsVtx::ver; +} + +#[versions(AGX)] +#[derive(Debug, Default)] +pub(crate) struct GpuGlobalStatsFrag {} + +#[versions(AGX)] +impl GpuStruct for GpuGlobalStatsFrag::ver { + type Raw<'a> = raw::GpuGlobalStatsFrag::ver; +} + +#[derive(Debug, Default)] +pub(crate) struct GpuStatsComp {} + +impl GpuStruct for GpuStatsComp { + type Raw<'a> = raw::GpuStatsComp; +} + +#[versions(AGX)] +#[derive(Debug, Default)] +pub(crate) struct HwDataA {} + +#[versions(AGX)] +impl GpuStruct for HwDataA::ver { + type Raw<'a> = raw::HwDataA::ver; +} + +#[versions(AGX)] +#[derive(Debug, Default)] +pub(crate) struct HwDataB {} + +#[versions(AGX)] +impl GpuStruct for HwDataB::ver { + type Raw<'a> = raw::HwDataB::ver; +} + +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct Stats { + pub(crate) vtx: GpuObjectGpuGlobalStatsVtx::ver, + pub(crate) frag: GpuObjectGpuGlobalStatsFrag::ver, + pub(crate) comp: GpuObject<GpuStatsComp>, +} + +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct RuntimePointers { + pub(crate) stats: Stats::ver, + + pub(crate) hwdata_a: GpuObjectHwDataA::ver, + pub(crate) unkptr_190: GpuArray<u8>, + pub(crate) unkptr_198: GpuArray<u8>, + pub(crate) hwdata_b: GpuObjectHwDataB::ver, + + pub(crate) unkptr_1b8: GpuArray<u8>, + pub(crate) unkptr_1c0: GpuArray<u8>, + pub(crate) unkptr_1c8: GpuArray<u8>, + + pub(crate) buffer_mgr_ctl: GpuArrayraw::BufferMgrCtl, +} + +#[versions(AGX)] +impl GpuStruct for RuntimePointers::ver { + type Raw<'a> = raw::RuntimePointers::ver<'a>; +} + +#[versions(AGX)] +#[derive(Debug, Default)] +pub(crate) struct Globals {} + +#[versions(AGX)] +impl GpuStruct for Globals::ver { + type Raw<'a> = raw::Globals::ver; +} + +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct InitData { + pub(crate) unk_buf: GpuArray<u8>, + pub(crate) runtime_pointers: GpuObjectRuntimePointers::ver, + pub(crate) globals: GpuObjectGlobals::ver, + pub(crate) fw_status: GpuObject<FwStatus>, +} + +#[versions(AGX)] +impl GpuStruct for InitData::ver { + type Raw<'a> = raw::InitData::ver<'a>; +} diff --git a/drivers/gpu/drm/asahi/fw/job.rs b/drivers/gpu/drm/asahi/fw/job.rs new file mode 100644 index 000000000000..a0bbf67b1b1d --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/job.rs @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Common GPU job firmware structures + +use super::types::*; +use crate::{default_zeroed, trivial_gpustruct}; + +pub(crate) mod raw { + use super::*; + + #[derive(Debug, Clone, Copy)] + #[repr(C)] + pub(crate) struct JobMeta { + pub(crate) unk_4: u32, + pub(crate) stamp: GpuWeakPointer<Stamp>, + pub(crate) fw_stamp: GpuWeakPointer<FwStamp>, + pub(crate) stamp_value: EventValue, + pub(crate) stamp_slot: u32, + pub(crate) evctl_index: u32, + pub(crate) flush_stamps: u32, + pub(crate) uuid: u32, + pub(crate) cmd_seq: u32, + } + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct EncoderParams<'a> { + pub(crate) unk_8: u32, + pub(crate) unk_c: u32, + pub(crate) unk_10: u32, + pub(crate) encoder_id: u32, + pub(crate) unk_18: u32, + pub(crate) iogpu_compute_unk44: u32, + pub(crate) seq_buffer: GpuPointer<'a, &'a [u64]>, + pub(crate) unk_28: U64, + } + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct JobTimestamps { + pub(crate) start: AtomicU64, + pub(crate) end: AtomicU64, + } + default_zeroed!(JobTimestamps); + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct RenderTimestamps { + pub(crate) vtx: JobTimestamps, + pub(crate) frag: JobTimestamps, + } + default_zeroed!(RenderTimestamps); +} + +trivial_gpustruct!(JobTimestamps); +trivial_gpustruct!(RenderTimestamps); diff --git a/drivers/gpu/drm/asahi/fw/microseq.rs b/drivers/gpu/drm/asahi/fw/microseq.rs new file mode 100644 index 000000000000..8deea3fb9914 --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/microseq.rs @@ -0,0 +1,384 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU firmware microsequence operations + +use super::types::*; +use super::{buffer, compute, fragment, initdata, vertex, workqueue}; +use crate::default_zeroed; + +pub(crate) trait Operation {} + +#[derive(Debug, Copy, Clone)] +#[repr(u32)] +enum OpCode { + WaitForIdle = 0x01, + RetireStamp = 0x18, + #[allow(dead_code)] + Timestamp = 0x19, + StartVertex = 0x22, + FinalizeVertex = 0x23, + StartFragment = 0x24, + FinalizeFragment = 0x25, + StartCompute = 0x29, + FinalizeCompute = 0x2a, +} + +#[derive(Debug, Copy, Clone)] +#[repr(u32)] +pub(crate) enum Pipe { + Vertex = 1 << 0, + Fragment = 1 << 8, + Compute = 1 << 15, +} + +pub(crate) const MAX_ATTACHMENTS: usize = 16; + +#[derive(Debug, Clone, Copy)] +#[repr(C)] +pub(crate) struct Attachment { + pub(crate) address: U64, + pub(crate) size: u32, + pub(crate) unk_c: u16, + pub(crate) unk_e: u16, +} +default_zeroed!(Attachment); + +#[derive(Debug, Clone, Copy, Default)] +#[repr(C)] +pub(crate) struct Attachments { + pub(crate) list: Array<MAX_ATTACHMENTS, Attachment>, + pub(crate) count: u32, +} + +#[derive(Debug, Copy, Clone)] +#[repr(transparent)] +pub(crate) struct OpHeader(u32); + +impl OpHeader { + const fn new(opcode: OpCode) -> OpHeader { + OpHeader(opcode as u32) + } + const fn with_args(opcode: OpCode, args: u32) -> OpHeader { + OpHeader(opcode as u32 | args) + } +} + +macro_rules! simple_op { + ($name:ident) => { + #[derive(Debug, Copy, Clone)] + pub(crate) struct $name(OpHeader); + + impl $name { + pub(crate) const HEADER: $name = $name(OpHeader::new(OpCode::$name)); + } + }; +} + +pub(crate) mod op { + use super::*; + + simple_op!(StartVertex); + simple_op!(FinalizeVertex); + simple_op!(StartFragment); + simple_op!(FinalizeFragment); + simple_op!(StartCompute); + simple_op!(FinalizeCompute); + + #[derive(Debug, Copy, Clone)] + pub(crate) struct RetireStamp(OpHeader); + impl RetireStamp { + pub(crate) const HEADER: RetireStamp = + RetireStamp(OpHeader::with_args(OpCode::RetireStamp, 0x40000000)); + } + + #[derive(Debug, Copy, Clone)] + pub(crate) struct WaitForIdle(OpHeader); + impl WaitForIdle { + pub(crate) const fn new(pipe: Pipe) -> WaitForIdle { + WaitForIdle(OpHeader::with_args(OpCode::WaitForIdle, (pipe as u32) << 8)) + } + } + + #[derive(Debug, Copy, Clone)] + pub(crate) struct Timestamp(OpHeader); + impl Timestamp { + #[allow(dead_code)] + pub(crate) const fn new(flag: bool) -> Timestamp { + Timestamp(OpHeader::with_args(OpCode::Timestamp, (flag as u32) << 31)) + } + } +} + +#[derive(Debug)] +#[repr(C)] +pub(crate) struct WaitForIdle { + pub(crate) header: op::WaitForIdle, +} + +impl Operation for WaitForIdle {} + +#[derive(Debug)] +#[repr(C)] +pub(crate) struct RetireStamp { + pub(crate) header: op::RetireStamp, +} + +impl Operation for RetireStamp {} + +#[versions(AGX)] +#[derive(Debug)] +#[repr(C)] +pub(crate) struct Timestamp<'a> { + pub(crate) header: op::Timestamp, + pub(crate) cur_ts: GpuWeakPointer<U64>, + pub(crate) start_ts: GpuWeakPointer<Option<GpuPointer<'a, AtomicU64>>>, + pub(crate) update_ts: GpuWeakPointer<Option<GpuPointer<'a, AtomicU64>>>, + pub(crate) work_queue: GpuWeakPointerworkqueue::QueueInfo::ver, + pub(crate) unk_24: U64, + + #[ver(V >= V13_0B4)] + pub(crate) unk_ts: GpuWeakPointer<U64>, + + pub(crate) uuid: u32, + pub(crate) unk_30_padding: u32, +} + +#[versions(AGX)] +impl<'a> Operation for Timestamp::ver<'a> {} + +#[versions(AGX)] +#[derive(Debug)] +#[repr(C)] +pub(crate) struct StartVertex<'a> { + pub(crate) header: op::StartVertex, + pub(crate) tiling_params: GpuWeakPointervertex::raw::TilingParameters, + pub(crate) job_params1: GpuWeakPointer<vertex::raw::JobParameters1::ver<'a>>, + pub(crate) buffer: GpuWeakPointerbuffer::Info::ver, + pub(crate) scene: GpuWeakPointerbuffer::Scene::ver, + pub(crate) stats: GpuWeakPointerinitdata::raw::GpuStatsVtx::ver, + pub(crate) work_queue: GpuWeakPointerworkqueue::QueueInfo::ver, + pub(crate) vm_slot: u32, + pub(crate) unk_38: u32, + pub(crate) event_generation: u32, + pub(crate) buffer_slot: u32, + pub(crate) unk_44: u32, + pub(crate) cmd_seq: U64, + pub(crate) unk_50: u32, + pub(crate) unk_pointer: GpuWeakPointer<u32>, + pub(crate) unk_job_buf: GpuWeakPointer<U64>, + pub(crate) unk_64: u32, + pub(crate) unk_68: u32, + pub(crate) uuid: u32, + pub(crate) unk_70: u32, + pub(crate) unk_74: Array<0x1d, U64>, + pub(crate) unk_15c: u32, + pub(crate) unk_160: U64, + pub(crate) unk_168: u32, + pub(crate) unk_16c: u32, + pub(crate) unk_170: U64, + + #[ver(V >= V13_0B4)] + pub(crate) counter: U64, + + #[ver(V >= V13_0B4)] + pub(crate) notifier_buf: GpuWeakPointer<Array<0x8, u8>>, + + pub(crate) unk_178: u32, +} + +#[versions(AGX)] +impl<'a> Operation for StartVertex::ver<'a> {} + +#[versions(AGX)] +#[derive(Debug)] +#[repr(C)] +pub(crate) struct FinalizeVertex { + pub(crate) header: op::FinalizeVertex, + pub(crate) scene: GpuWeakPointerbuffer::Scene::ver, + pub(crate) buffer: GpuWeakPointerbuffer::Info::ver, + pub(crate) stats: GpuWeakPointerinitdata::raw::GpuStatsVtx::ver, + pub(crate) work_queue: GpuWeakPointerworkqueue::QueueInfo::ver, + pub(crate) vm_slot: u32, + pub(crate) unk_28: u32, + pub(crate) unk_pointer: GpuWeakPointer<u32>, + pub(crate) unk_34: u32, + pub(crate) uuid: u32, + pub(crate) fw_stamp: GpuWeakPointer<FwStamp>, + pub(crate) stamp_value: EventValue, + pub(crate) unk_48: U64, + pub(crate) unk_50: u32, + pub(crate) unk_54: u32, + pub(crate) unk_58: U64, + pub(crate) unk_60: u32, + pub(crate) unk_64: u32, + pub(crate) unk_68: u32, + + #[ver(G >= G14 && V < V13_0B4)] + pub(crate) unk_68_g14: U64, + + pub(crate) restart_branch_offset: i32, + pub(crate) unk_70: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_74: Array<0x10, u8>, +} + +#[versions(AGX)] +impl Operation for FinalizeVertex::ver {} + +#[versions(AGX)] +#[derive(Debug)] +#[repr(C)] +pub(crate) struct StartFragment<'a> { + pub(crate) header: op::StartFragment, + pub(crate) job_params2: GpuWeakPointerfragment::raw::JobParameters2, + pub(crate) job_params1: GpuWeakPointer<fragment::raw::JobParameters1::ver<'a>>, + pub(crate) scene: GpuPointer<'a, buffer::Scene::ver>, + pub(crate) stats: GpuWeakPointerinitdata::raw::GpuStatsFrag::ver, + pub(crate) busy_flag: GpuWeakPointer<u32>, + pub(crate) tvb_overflow_count: GpuWeakPointer<u32>, + pub(crate) unk_pointer: GpuWeakPointer<u32>, + pub(crate) work_queue: GpuWeakPointerworkqueue::QueueInfo::ver, + pub(crate) work_item: GpuWeakPointerfragment::RunFragment::ver, + pub(crate) vm_slot: u32, + pub(crate) unk_50: u32, + pub(crate) event_generation: u32, + pub(crate) buffer_slot: u32, + pub(crate) unk_5c: u32, + pub(crate) cmd_seq: U64, + pub(crate) unk_68: u32, + pub(crate) unk_758_flag: GpuWeakPointer<u32>, + pub(crate) unk_job_buf: GpuWeakPointer<U64>, + pub(crate) unk_7c: u32, + pub(crate) unk_80: u32, + pub(crate) unk_84: u32, + pub(crate) uuid: u32, + pub(crate) attachments: Attachments, + pub(crate) unk_190: u32, + + #[ver(V >= V13_0B4)] + pub(crate) counter: U64, + + #[ver(V >= V13_0B4)] + pub(crate) notifier_buf: GpuWeakPointer<Array<0x8, u8>>, +} + +#[versions(AGX)] +impl<'a> Operation for StartFragment::ver<'a> {} + +#[versions(AGX)] +#[derive(Debug)] +#[repr(C)] +pub(crate) struct FinalizeFragment { + pub(crate) header: op::FinalizeFragment, + pub(crate) uuid: u32, + pub(crate) unk_8: u32, + pub(crate) fw_stamp: GpuWeakPointer<FwStamp>, + pub(crate) stamp_value: EventValue, + pub(crate) unk_18: u32, + pub(crate) scene: GpuWeakPointerbuffer::Scene::ver, + pub(crate) buffer: GpuWeakPointerbuffer::Info::ver, + pub(crate) unk_2c: U64, + pub(crate) stats: GpuWeakPointerinitdata::raw::GpuStatsFrag::ver, + pub(crate) unk_pointer: GpuWeakPointer<u32>, + pub(crate) busy_flag: GpuWeakPointer<u32>, + pub(crate) work_queue: GpuWeakPointerworkqueue::QueueInfo::ver, + pub(crate) work_item: GpuWeakPointerfragment::RunFragment::ver, + pub(crate) vm_slot: u32, + pub(crate) unk_60: u32, + pub(crate) unk_758_flag: GpuWeakPointer<u32>, + pub(crate) unk_6c: U64, + pub(crate) unk_74: U64, + pub(crate) unk_7c: U64, + pub(crate) unk_84: U64, + pub(crate) unk_8c: U64, + + #[ver(G == G14 && V < V13_0B4)] + pub(crate) unk_8c_g14: U64, + + pub(crate) restart_branch_offset: i32, + pub(crate) unk_98: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_9c: Array<0x10, u8>, +} + +#[versions(AGX)] +impl Operation for FinalizeFragment::ver {} + +#[versions(AGX)] +#[derive(Debug)] +#[repr(C)] +pub(crate) struct StartCompute<'a> { + pub(crate) header: op::StartCompute, + pub(crate) unk_pointer: GpuWeakPointer<Array<0x54, u8>>, + pub(crate) job_params1: GpuWeakPointer<compute::raw::JobParameters1<'a>>, + pub(crate) stats: GpuWeakPointerinitdata::GpuStatsComp, + pub(crate) work_queue: GpuWeakPointerworkqueue::QueueInfo::ver, + pub(crate) vm_slot: u32, + pub(crate) unk_28: u32, + pub(crate) event_generation: u32, + pub(crate) cmd_seq: U64, + pub(crate) unk_38: u32, + pub(crate) job_params2: GpuWeakPointer<compute::raw::JobParameters2::ver<'a>>, + pub(crate) unk_44: u32, + pub(crate) uuid: u32, + pub(crate) attachments: Attachments, + pub(crate) padding: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_flag: GpuWeakPointer<U32>, + + #[ver(V >= V13_0B4)] + pub(crate) counter: U64, + + #[ver(V >= V13_0B4)] + pub(crate) notifier_buf: GpuWeakPointer<Array<0x8, u8>>, +} + +#[versions(AGX)] +impl<'a> Operation for StartCompute::ver<'a> {} + +#[versions(AGX)] +#[derive(Debug)] +#[repr(C)] +pub(crate) struct FinalizeCompute<'a> { + pub(crate) header: op::FinalizeCompute, + pub(crate) stats: GpuWeakPointerinitdata::GpuStatsComp, + pub(crate) work_queue: GpuWeakPointerworkqueue::QueueInfo::ver, + pub(crate) vm_slot: u32, + #[ver(V < V13_0B4)] + pub(crate) unk_18: u32, + pub(crate) job_params2: GpuWeakPointer<compute::raw::JobParameters2::ver<'a>>, + pub(crate) unk_24: u32, + pub(crate) uuid: u32, + pub(crate) fw_stamp: GpuWeakPointer<FwStamp>, + pub(crate) stamp_value: EventValue, + pub(crate) unk_38: u32, + pub(crate) unk_3c: u32, + pub(crate) unk_40: u32, + pub(crate) unk_44: u32, + pub(crate) unk_48: u32, + pub(crate) unk_4c: u32, + pub(crate) unk_50: u32, + pub(crate) unk_54: u32, + pub(crate) unk_58: u32, + + #[ver(G == G14 && V < V13_0B4)] + pub(crate) unk_5c_g14: U64, + + pub(crate) restart_branch_offset: i32, + pub(crate) unk_60: u32, + + #[ver(V >= V13_0B4)] + pub(crate) unk_64: Array<0xd, u8>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_flag: GpuWeakPointer<U32>, + + #[ver(V >= V13_0B4)] + pub(crate) unk_79: Array<0x7, u8>, +} + +#[versions(AGX)] +impl<'a> Operation for FinalizeCompute::ver<'a> {} diff --git a/drivers/gpu/drm/asahi/fw/mod.rs b/drivers/gpu/drm/asahi/fw/mod.rs new file mode 100644 index 000000000000..a5649aa20d3a --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/mod.rs @@ -0,0 +1,15 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Firmware structures for Apple AGX GPUs + +pub(crate) mod buffer; +pub(crate) mod channels; +pub(crate) mod compute; +pub(crate) mod event; +pub(crate) mod fragment; +pub(crate) mod initdata; +pub(crate) mod job; +pub(crate) mod microseq; +pub(crate) mod types; +pub(crate) mod vertex; +pub(crate) mod workqueue; diff --git a/drivers/gpu/drm/asahi/fw/types.rs b/drivers/gpu/drm/asahi/fw/types.rs new file mode 100644 index 000000000000..c1a07be1e047 --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/types.rs @@ -0,0 +1,233 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Common types for firmware structure definitions + +use crate::{alloc, object}; +use core::fmt; +use core::ops::{Deref, DerefMut, Index, IndexMut}; + +pub(crate) use crate::event::EventValue; +pub(crate) use crate::object::{GpuPointer, GpuStruct, GpuWeakPointer}; +pub(crate) use crate::{f32, float::F32}; + +pub(crate) use ::alloc::boxed::Box; +pub(crate) use core::fmt::Debug; +pub(crate) use core::marker::PhantomData; +pub(crate) use core::sync::atomic::{AtomicI32, AtomicU32, AtomicU64}; +pub(crate) use kernel::macros::versions; + +// Make the trait visible +pub(crate) use crate::alloc::Allocator as _Allocator; + +/// General allocator type used for the driver +pub(crate) type Allocator = alloc::DefaultAllocator; + +/// General GpuObject type used for the driver +pub(crate) type GpuObject<T> = + object::GpuObject<T, alloc::GenericAlloc<T, alloc::DefaultAllocation>>; + +/// General GpuArray type used for the driver +pub(crate) type GpuArray<T> = object::GpuArray<T, alloc::GenericAlloc<T, alloc::DefaultAllocation>>; + +/// General GpuOnlyArray type used for the driver +pub(crate) type GpuOnlyArray<T> = + object::GpuOnlyArray<T, alloc::GenericAlloc<T, alloc::DefaultAllocation>>; + +/// A stamp slot that is shared between firmware and the driver. +#[derive(Debug, Default)] +#[repr(transparent)] +pub(crate) struct Stamp(pub(crate) AtomicU32); + +/// A stamp slot that is for private firmware use. +/// +/// This is a separate type to guard against pointer type confusion. +#[derive(Debug, Default)] +#[repr(transparent)] +pub(crate) struct FwStamp(pub(crate) AtomicU32); + +/// An unaligned u64 type. +/// +/// This is useful to avoid having to pack firmware structures entirely, since that is incompatible +/// with `#[derive(Debug)]` and atomics. +#[derive(Copy, Clone, Default)] +#[repr(C, packed(1))] +pub(crate) struct U64(pub(crate) u64); + +unsafe impl Zeroed for U64 {} + +impl fmt::Debug for U64 { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + let v = self.0; + f.write_fmt(format_args!("{:#x}", v)) + } +} + +/// An unaligned u32 type. +/// +/// This is useful to avoid having to pack firmware structures entirely, since that is incompatible +/// with `#[derive(Debug)]` and atomics. +#[derive(Copy, Clone, Default)] +#[repr(C, packed(1))] +pub(crate) struct U32(pub(crate) u32); + +unsafe impl Zeroed for U32 {} + +impl fmt::Debug for U32 { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + let v = self.0; + f.write_fmt(format_args!("{:#x}", v)) + } +} + +unsafe impl Zeroed for u8 {} +unsafe impl Zeroed for u16 {} +unsafe impl Zeroed for u32 {} +unsafe impl Zeroed for u64 {} +unsafe impl Zeroed for i8 {} +unsafe impl Zeroed for i16 {} +unsafe impl Zeroed for i32 {} +unsafe impl Zeroed for i64 {} + +/// Create a dummy `Debug` implementation, for when we need it but it's too painful to write by +/// hand or not very useful. +#[macro_export] +macro_rules! no_debug { + ($type:ty) => { + impl Debug for $type { + fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { + write!(f, "...") + } + } + }; +} + +/// Types which can be safely initialized with an all-zero bit pattern. +/// +/// See: https://github.com/rust-lang/rfcs/issues/2626 +/// +/// # Safety +/// +/// This trait must only be implemented if a type only contains primitive types which can be +/// zero-initialized, FFI structs intended to be zero-initialized, or other types which impl Zeroed. +pub(crate) unsafe trait Zeroed: Default { + fn zeroed() -> Self { + // SAFETY: The user is responsible for ensuring this is safe. + unsafe { core::mem::zeroed() } + } +} + +/// Implement Zeroed for a given type (and Default along with it). +/// +/// # Safety +/// +/// This macro must only be used if a type only contains primitive types which can be +/// zero-initialized, FFI structs intended to be zero-initialized, or other types which impl Zeroed. +#[macro_export] +macro_rules! default_zeroed { + (<$($lt:lifetime),*>, $type:ty) => { + impl<$($lt),*> Default for $type { + fn default() -> $type { + Zeroed::zeroed() + } + } + // SAFETY: The user is responsible for ensuring this is safe. + unsafe impl<$($lt),*> Zeroed for $type {} + }; + ($type:ty) => { + impl Default for $type { + fn default() -> $type { + Zeroed::zeroed() + } + } + // SAFETY: The user is responsible for ensuring this is safe. + unsafe impl Zeroed for $type {} + }; +} + +/// A convenience type for a number of padding bytes. Hidden from Debug formatting. +#[derive(Copy, Clone)] +#[repr(C, packed)] +pub(crate) struct Pad<const N: usize>([u8; N]); + +/// SAFETY: Primitive type, safe to zero-init. +unsafe impl<const N: usize> Zeroed for Pad<N> {} + +impl<const N: usize> Default for Pad<N> { + fn default() -> Self { + Zeroed::zeroed() + } +} + +impl<const N: usize> fmt::Debug for Pad<N> { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + f.write_fmt(format_args!("<pad>")) + } +} + +/// A convenience type for a fixed-sized array with Default/Zeroed impls. +#[derive(Copy, Clone)] +#[repr(C)] +pub(crate) struct Array<const N: usize, T>([T; N]); + +impl<const N: usize, T> Array<N, T> { + pub(crate) fn new(data: [T; N]) -> Self { + Self(data) + } +} + +// SAFETY: Arrays of Zeroed values can be safely Zeroed. +unsafe impl<const N: usize, T: Zeroed> Zeroed for Array<N, T> {} + +impl<const N: usize, T: Zeroed> Default for Array<N, T> { + fn default() -> Self { + Zeroed::zeroed() + } +} + +impl<const N: usize, T> Index<usize> for Array<N, T> { + type Output = T; + + fn index(&self, index: usize) -> &Self::Output { + &self.0[index] + } +} + +impl<const N: usize, T> IndexMut<usize> for Array<N, T> { + fn index_mut(&mut self, index: usize) -> &mut Self::Output { + &mut self.0[index] + } +} + +impl<const N: usize, T> Deref for Array<N, T> { + type Target = [T; N]; + + fn deref(&self) -> &Self::Target { + &self.0 + } +} + +impl<const N: usize, T> DerefMut for Array<N, T> { + fn deref_mut(&mut self) -> &mut Self::Target { + &mut self.0 + } +} + +impl<const N: usize, T: Sized + fmt::Debug> fmt::Debug for Array<N, T> { + fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { + self.0.fmt(f) + } +} + +/// Convenience macro to define an identically-named trivial GpuStruct with no inner fields for a +/// given raw type name. +#[macro_export] +macro_rules! trivial_gpustruct { + ($type:ident) => { + #[derive(Debug, Default)] + pub(crate) struct $type {} + + impl GpuStruct for $type { + type Raw<'a> = raw::$type; + } + }; +} diff --git a/drivers/gpu/drm/asahi/fw/vertex.rs b/drivers/gpu/drm/asahi/fw/vertex.rs new file mode 100644 index 000000000000..959a0913e693 --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/vertex.rs @@ -0,0 +1,177 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU vertex job firmware structures + +use super::types::*; +use super::{event, job, workqueue}; +use crate::{buffer, fw, microseq, mmu}; +use kernel::sync::Arc; + +pub(crate) mod raw { + use super::*; + + #[derive(Debug, Default, Copy, Clone)] + #[repr(C)] + pub(crate) struct TilingParameters { + pub(crate) rgn_size: u32, + pub(crate) unk_4: u32, + pub(crate) ppp_ctrl: u32, + pub(crate) x_max: u16, + pub(crate) y_max: u16, + pub(crate) te_screen: u32, + pub(crate) te_mtile1: u32, + pub(crate) te_mtile2: u32, + pub(crate) tiles_per_mtile: u32, + pub(crate) tpc_stride: u32, + pub(crate) unk_24: u32, + pub(crate) unk_28: u32, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct JobParameters1<'a> { + pub(crate) unk_0: U64, + pub(crate) unk_8: F32, + pub(crate) unk_c: F32, + pub(crate) tvb_tilemap: GpuPointer<'a, &'a [u8]>, + #[ver(G < G14)] + pub(crate) tvb_cluster_tilemaps: Option<GpuPointer<'a, &'a [u8]>>, + pub(crate) tpc: GpuPointer<'a, &'a [u8]>, + pub(crate) tvb_heapmeta: GpuPointer<'a, &'a [u8]>, + pub(crate) iogpu_unk_54: u32, + pub(crate) iogpu_unk_55: u32, + pub(crate) iogpu_unk_56: U64, + #[ver(G < G14)] + pub(crate) tvb_cluster_meta1: Option<GpuPointer<'a, &'a [u8]>>, + pub(crate) utile_config: u32, + pub(crate) unk_4c: u32, + pub(crate) ppp_multisamplectl: U64, + pub(crate) tvb_heapmeta_2: GpuPointer<'a, &'a [u8]>, + #[ver(G < G14)] + pub(crate) unk_60: U64, + #[ver(G < G14)] + pub(crate) core_mask: Array<2, u32>, + pub(crate) preempt_buf1: GpuPointer<'a, &'a [u8]>, + pub(crate) preempt_buf2: GpuPointer<'a, &'a [u8]>, + pub(crate) unk_80: U64, + pub(crate) preempt_buf3: GpuPointer<'a, &'a [u8]>, + pub(crate) encoder_addr: U64, + #[ver(G < G14)] + pub(crate) tvb_cluster_meta2: Option<GpuPointer<'a, &'a [u8]>>, + #[ver(G < G14)] + pub(crate) tvb_cluster_meta3: Option<GpuPointer<'a, &'a [u8]>>, + #[ver(G < G14)] + pub(crate) tiling_control: u32, + #[ver(G < G14)] + pub(crate) unk_ac: u32, + pub(crate) unk_b0: Array<6, U64>, + pub(crate) pipeline_base: U64, + #[ver(G < G14)] + pub(crate) tvb_cluster_meta4: Option<GpuPointer<'a, &'a [u8]>>, + #[ver(G < G14)] + pub(crate) unk_f0: U64, + pub(crate) unk_f8: U64, + pub(crate) unk_100: Array<3, U64>, + pub(crate) unk_118: u32, + #[ver(G >= G14)] + pub(crate) __pad: Pad<{ 8 * 9 }>, + } + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct JobParameters2<'a> { + pub(crate) unk_480: Array<4, u32>, + pub(crate) unk_498: U64, + pub(crate) unk_4a0: u32, + pub(crate) preempt_buf1: GpuPointer<'a, &'a [u8]>, + pub(crate) unk_4ac: u32, + pub(crate) unk_4b0: U64, + pub(crate) unk_4b8: u32, + pub(crate) unk_4bc: U64, + pub(crate) unk_4c4_padding: Array<0x48, u8>, + pub(crate) unk_50c: u32, + pub(crate) unk_510: U64, + pub(crate) unk_518: U64, + pub(crate) unk_520: U64, + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct RunVertex<'a> { + pub(crate) tag: workqueue::CommandType, + + #[ver(V >= V13_0B4)] + pub(crate) counter: U64, + + pub(crate) vm_slot: u32, + pub(crate) unk_8: u32, + pub(crate) notifier: GpuPointer<'a, event::Notifier::ver>, + pub(crate) buffer_slot: u32, + pub(crate) unk_1c: u32, + pub(crate) buffer: GpuPointer<'a, fw::buffer::Info::ver>, + pub(crate) scene: GpuPointer<'a, fw::buffer::Scene::ver>, + pub(crate) unk_buffer_buf: GpuWeakPointer<[u8]>, + pub(crate) unk_34: u32, + pub(crate) job_params1: JobParameters1::ver<'a>, + pub(crate) unk_154: Array<0x268, u8>, + pub(crate) tiling_params: TilingParameters, + pub(crate) unk_3e8: Array<0x74, u8>, + pub(crate) tpc: GpuPointer<'a, &'a [u8]>, + pub(crate) tpc_size: U64, + pub(crate) microsequence: GpuPointer<'a, &'a [u8]>, + pub(crate) microsequence_size: u32, + pub(crate) fragment_stamp_slot: u32, + pub(crate) fragment_stamp_value: EventValue, + pub(crate) unk_pointee: u32, + pub(crate) unk_pad: u32, + pub(crate) job_params2: JobParameters2<'a>, + pub(crate) encoder_params: job::raw::EncoderParams<'a>, + pub(crate) unk_55c: u32, + pub(crate) unk_560: u32, + pub(crate) memoryless_rts_used: u32, + pub(crate) unk_568: u32, + pub(crate) unk_56c: u32, + pub(crate) meta: job::raw::JobMeta, + pub(crate) unk_after_meta: u32, + pub(crate) unk_buf_0: U64, + pub(crate) unk_buf_8: U64, + pub(crate) unk_buf_10: U64, + pub(crate) cur_ts: U64, + pub(crate) start_ts: Option<GpuPointer<'a, AtomicU64>>, + pub(crate) end_ts: Option<GpuPointer<'a, AtomicU64>>, + pub(crate) unk_5c4: u32, + pub(crate) unk_5c8: u32, + pub(crate) unk_5cc: u32, + pub(crate) unk_5d0: u32, + pub(crate) client_sequence: u8, + pub(crate) pad_5d5: Array<3, u8>, + pub(crate) unk_5d8: u32, + pub(crate) unk_5dc: u8, + + #[ver(V >= V13_0B4)] + pub(crate) unk_ts: U64, + + #[ver(V >= V13_0B4)] + pub(crate) unk_5dd_8: Array<0x1b, u8>, + } +} + +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct RunVertex { + pub(crate) notifier: Arc<GpuObjectevent::Notifier::ver>, + pub(crate) scene: Arcbuffer::Scene::ver, + pub(crate) micro_seq: microseq::MicroSequence, + pub(crate) vm_bind: mmu::VmBind, + pub(crate) timestamps: Arc<GpuObjectjob::RenderTimestamps>, +} + +#[versions(AGX)] +impl GpuStruct for RunVertex::ver { + type Raw<'a> = raw::RunVertex::ver<'a>; +} + +#[versions(AGX)] +impl workqueue::Command for RunVertex::ver {} diff --git a/drivers/gpu/drm/asahi/fw/workqueue.rs b/drivers/gpu/drm/asahi/fw/workqueue.rs new file mode 100644 index 000000000000..e81025b6c014 --- /dev/null +++ b/drivers/gpu/drm/asahi/fw/workqueue.rs @@ -0,0 +1,168 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU work queue firmware structes + +use super::event; +use super::types::*; +use crate::event::EventValue; +use crate::{default_zeroed, trivial_gpustruct}; +use kernel::sync::Arc; + +#[derive(Debug)] +#[repr(u32)] +pub(crate) enum CommandType { + RunVertex = 0, + RunFragment = 1, + #[allow(dead_code)] + RunBlitter = 2, + RunCompute = 3, + Barrier = 4, + InitBuffer = 6, +} + +pub(crate) trait Command: GpuStruct + Send + Sync {} + +pub(crate) mod raw { + use super::*; + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct Barrier { + pub(crate) tag: CommandType, + pub(crate) wait_stamp: GpuWeakPointer<FwStamp>, + pub(crate) wait_value: EventValue, + pub(crate) wait_slot: u32, + pub(crate) stamp_self: EventValue, + pub(crate) uuid: u32, + pub(crate) unk: u32, + } + + #[derive(Debug, Clone, Copy)] + #[repr(C)] + pub(crate) struct GpuContextData { + pub(crate) unk_0: u8, + pub(crate) unk_1: u8, + unk_2: Array<0x2, u8>, + pub(crate) unk_4: u8, + pub(crate) unk_5: u8, + unk_6: Array<0x18, u8>, + pub(crate) unk_1e: u8, + pub(crate) unk_1f: u8, + unk_20: Array<0x3, u8>, + pub(crate) unk_23: u8, + unk_24: Array<0x1c, u8>, + } + + impl Default for GpuContextData { + fn default() -> Self { + Self { + unk_0: 0xff, + unk_1: 0xff, + unk_2: Default::default(), + unk_4: 0, + unk_5: 1, + unk_6: Default::default(), + unk_1e: 0xff, + unk_1f: 0, + unk_20: Default::default(), + unk_23: 2, + unk_24: Default::default(), + } + } + } + + #[derive(Debug)] + #[repr(C)] + pub(crate) struct RingState { + pub(crate) gpu_doneptr: AtomicU32, + __pad0: Pad<0xc>, + pub(crate) unk_10: AtomicU32, + __pad1: Pad<0xc>, + pub(crate) unk_20: AtomicU32, + __pad2: Pad<0xc>, + pub(crate) gpu_rptr: AtomicU32, + __pad3: Pad<0xc>, + pub(crate) cpu_wptr: AtomicU32, + __pad4: Pad<0xc>, + pub(crate) rb_size: u32, + __pad5: Pad<0xc>, + // This isn't part of the structure, but it's here as a + // debugging hack so we can inspect what ring position + // the driver considered complete and freeable. + pub(crate) cpu_freeptr: AtomicU32, + __pad6: Pad<0xc>, + } + default_zeroed!(RingState); + + #[derive(Debug, Clone, Copy)] + #[repr(C)] + pub(crate) struct Priority(u32, u32, U64, u32, u32, u32); + + pub(crate) const PRIORITY: [Priority; 4] = [ + Priority(0, 0, U64(0xffff_ffff_ffff_0000), 1, 0, 1), + Priority(1, 1, U64(0xffff_ffff_0000_0000), 0, 0, 0), + Priority(2, 2, U64(0xffff_0000_0000_0000), 0, 0, 2), + Priority(3, 3, U64(0x0000_0000_0000_0000), 0, 0, 3), + ]; + + impl Default for Priority { + fn default() -> Priority { + PRIORITY[2] + } + } + + #[versions(AGX)] + #[derive(Debug)] + #[repr(C)] + pub(crate) struct QueueInfo<'a> { + pub(crate) state: GpuPointer<'a, super::RingState>, + pub(crate) ring: GpuPointer<'a, &'a [u64]>, + pub(crate) notifier_list: GpuPointer<'a, event::NotifierList>, + pub(crate) gpu_buf: GpuPointer<'a, &'a [u8]>, + pub(crate) gpu_rptr1: AtomicU32, + pub(crate) gpu_rptr2: AtomicU32, + pub(crate) gpu_rptr3: AtomicU32, + pub(crate) event_id: AtomicI32, + pub(crate) priority: Priority, + pub(crate) unk_4c: i32, + pub(crate) uuid: u32, + pub(crate) unk_54: i32, + pub(crate) unk_58: U64, + pub(crate) busy: AtomicU32, + pub(crate) __pad: Pad<0x20>, + pub(crate) unk_84_state: AtomicU32, + pub(crate) unk_88: u32, + pub(crate) unk_8c: u32, + pub(crate) unk_90: u32, + pub(crate) unk_94: u32, + pub(crate) pending: AtomicU32, + pub(crate) unk_9c: u32, + #[ver(V >= V13_2)] + pub(crate) unk_a0_0: u32, + pub(crate) gpu_context: GpuPointer<'a, super::GpuContextData>, + pub(crate) unk_a8: U64, + #[ver(V >= V13_2)] + pub(crate) unk_b0: u32, + } +} + +trivial_gpustruct!(Barrier); +trivial_gpustruct!(GpuContextData); +trivial_gpustruct!(RingState); + +impl Command for Barrier {} + +#[versions(AGX)] +#[derive(Debug)] +pub(crate) struct QueueInfo { + pub(crate) state: GpuObject<RingState>, + pub(crate) ring: GpuArray<u64>, + pub(crate) gpu_buf: GpuArray<u8>, + pub(crate) notifier_list: Arc<GpuObjectevent::NotifierList>, + pub(crate) gpu_context: Arccrate::workqueue::GpuContext, +} + +#[versions(AGX)] +impl GpuStruct for QueueInfo::ver { + type Raw<'a> = raw::QueueInfo::ver<'a>; +} diff --git a/drivers/gpu/drm/asahi/gem.rs b/drivers/gpu/drm/asahi/gem.rs new file mode 100644 index 000000000000..b334bebb0e8e --- /dev/null +++ b/drivers/gpu/drm/asahi/gem.rs @@ -0,0 +1,301 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Asahi driver GEM object implementation +//! +//! Basic wrappers and adaptations between generic GEM shmem objects and this driver's +//! view of what a GPU buffer object is. It is in charge of keeping track of all mappings for +//! each GEM object so we can remove them when a client (File) or a Vm are destroyed, as well as +//! implementing RTKit buffers on top of GEM objects for firmware use. + +use kernel::{ + bindings, + drm::{gem, gem::shmem}, + error::Result, + prelude::*, + soc::apple::rtkit, + sync::smutex::Mutex, +}; + +use kernel::drm::gem::BaseObject; + +use core::sync::atomic::{AtomicU64, Ordering}; + +use crate::debug::*; +use crate::driver::AsahiDevice; +use crate::file::DrmFile; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Gem; + +/// Represents the inner data of a GEM object for this driver. +pub(crate) struct DriverObject { + /// Whether this is a kernel-created object. + kernel: bool, + /// Object creation flags. + flags: u32, + /// VM ID for VM-private objects. + vm_id: Option<u64>, + /// Locked list of mapping tuples: (file_id, vm_id, mapping) + mappings: Mutex<Vec<(u64, u64, crate::mmu::Mapping)>>, + /// ID for debug + id: u64, +} + +/// Type alias for the shmem GEM object type for this driver. +pub(crate) type Object = shmem::Object<DriverObject>; + +/// Type alias for the SGTable type for this driver. +pub(crate) type SGTable = shmem::SGTable<DriverObject>; + +/// A shared reference to a GEM object for this driver. +pub(crate) struct ObjectRef { + /// The underlying GEM object reference + pub(crate) gem: gem::ObjectRef<shmem::Object<DriverObject>>, + /// The kernel-side VMap of this object, if needed + vmap: Option<shmem::VMap<DriverObject>>, +} + +static GEM_ID: AtomicU64 = AtomicU64::new(0); + +impl DriverObject { + /// Drop all object mappings for a given file ID. + /// + /// Used on file close. + fn drop_file_mappings(&self, file_id: u64) { + let mut mappings = self.mappings.lock(); + for (index, (mapped_fid, _mapped_vmid, _mapping)) in mappings.iter().enumerate() { + if *mapped_fid == file_id { + mappings.swap_remove(index); + return; + } + } + } + + /// Drop all object mappings for a given VM ID. + /// + /// Used on VM destroy. + fn drop_vm_mappings(&self, vm_id: u64) { + let mut mappings = self.mappings.lock(); + for (index, (_mapped_fid, mapped_vmid, _mapping)) in mappings.iter().enumerate() { + if *mapped_vmid == vm_id { + mappings.swap_remove(index); + return; + } + } + } +} + +impl ObjectRef { + /// Create a new wrapper for a raw GEM object reference. + pub(crate) fn new(gem: gem::ObjectRef<shmem::Object<DriverObject>>) -> ObjectRef { + ObjectRef { gem, vmap: None } + } + + /// Return the `VMap` for this object, creating it if necessary. + pub(crate) fn vmap(&mut self) -> Result<&mut shmem::VMap<DriverObject>> { + if self.vmap.is_none() { + self.vmap = Some(self.gem.vmap()?); + } + Ok(self.vmap.as_mut().unwrap()) + } + + /// Return the IOVA of this object at which it is mapped in a given `Vm` identified by its ID, + /// if it is mapped in that `Vm`. + pub(crate) fn iova(&self, vm_id: u64) -> Option<usize> { + let mappings = self.gem.mappings.lock(); + for (_mapped_fid, mapped_vmid, mapping) in mappings.iter() { + if *mapped_vmid == vm_id { + return Some(mapping.iova()); + } + } + + None + } + + /// Returns the size of an object in bytes + pub(crate) fn size(&self) -> usize { + self.gem.size() + } + + /// Maps an object into a given `Vm` at any free address. + /// + /// Returns Err(EBUSY) if there is already a mapping. + pub(crate) fn map_into(&mut self, vm: &crate::mmu::Vm) -> Result<usize> { + let vm_id = vm.id(); + + if self.gem.vm_id.is_some() && self.gem.vm_id != Some(vm_id) { + return Err(EINVAL); + } + + let mut mappings = self.gem.mappings.lock(); + for (_mapped_fid, mapped_vmid, _mapping) in mappings.iter() { + if *mapped_vmid == vm_id { + return Err(EBUSY); + } + } + + let sgt = self.gem.sg_table()?; + let new_mapping = vm.map(self.gem.size(), sgt)?; + + let iova = new_mapping.iova(); + mappings.try_push((vm.file_id(), vm_id, new_mapping))?; + Ok(iova) + } + + /// Maps an object into a given `Vm` at any free address within a given range. + /// + /// Returns Err(EBUSY) if there is already a mapping. + pub(crate) fn map_into_range( + &mut self, + vm: &crate::mmu::Vm, + start: u64, + end: u64, + alignment: u64, + prot: u32, + guard: bool, + ) -> Result<usize> { + let vm_id = vm.id(); + + if self.gem.vm_id.is_some() && self.gem.vm_id != Some(vm_id) { + return Err(EINVAL); + } + + let mut mappings = self.gem.mappings.lock(); + for (_mapped_fid, mapped_vmid, _mapping) in mappings.iter() { + if *mapped_vmid == vm_id { + return Err(EBUSY); + } + } + + let sgt = self.gem.sg_table()?; + let new_mapping = + vm.map_in_range(self.gem.size(), sgt, alignment, start, end, prot, guard)?; + + let iova = new_mapping.iova(); + mappings.try_push((vm.file_id(), vm_id, new_mapping))?; + Ok(iova) + } + + /// Maps an object into a given `Vm` at a specific address. + /// + /// Returns Err(EBUSY) if there is already a mapping. + /// Returns Err(ENOSPC) if the requested address is already busy. + pub(crate) fn map_at( + &mut self, + vm: &crate::mmu::Vm, + addr: u64, + prot: u32, + guard: bool, + ) -> Result { + let vm_id = vm.id(); + + if self.gem.vm_id.is_some() && self.gem.vm_id != Some(vm_id) { + return Err(EINVAL); + } + + let mut mappings = self.gem.mappings.lock(); + for (_mapped_fid, mapped_vmid, _mapping) in mappings.iter() { + if *mapped_vmid == vm_id { + return Err(EBUSY); + } + } + + let sgt = self.gem.sg_table()?; + let new_mapping = vm.map_at(addr, self.gem.size(), sgt, prot, guard)?; + + let iova = new_mapping.iova(); + assert!(iova == addr as usize); + mappings.try_push((vm.file_id(), vm_id, new_mapping))?; + Ok(()) + } + + /// Drop all mappings for this object owned by a given `Vm` identified by its ID. + pub(crate) fn drop_vm_mappings(&mut self, vm_id: u64) { + self.gem.drop_vm_mappings(vm_id); + } + + /// Drop all mappings for this object owned by a given `File` identified by its ID. + pub(crate) fn drop_file_mappings(&mut self, file_id: u64) { + self.gem.drop_file_mappings(file_id); + } +} + +/// Create a new kernel-owned GEM object. +pub(crate) fn new_kernel_object(dev: &AsahiDevice, size: usize) -> Result<ObjectRef> { + let mut gem = shmem::Object::<DriverObject>::new(dev, size)?; + gem.kernel = true; + gem.flags = 0; + + gem.set_exportable(false); + + mod_pr_debug!("DriverObject new kernel object id={}\n", gem.id); + Ok(ObjectRef::new(gem.into_ref())) +} + +/// Create a new user-owned GEM object with the given flags. +pub(crate) fn new_object( + dev: &AsahiDevice, + size: usize, + flags: u32, + vm_id: Option<u64>, +) -> Result<ObjectRef> { + let mut gem = shmem::Object::<DriverObject>::new(dev, size)?; + gem.kernel = false; + gem.flags = flags; + gem.vm_id = vm_id; + + gem.set_exportable(vm_id.is_none()); + gem.set_wc(flags & bindings::ASAHI_GEM_WRITEBACK == 0); + + mod_pr_debug!( + "DriverObject new user object: vm_id={:?} id={}\n", + vm_id, + gem.id + ); + Ok(ObjectRef::new(gem.into_ref())) +} + +/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it. +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> { + Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?)) +} + +impl gem::BaseDriverObject<Object> for DriverObject { + /// Callback to create the inner data of a GEM object + fn new(_dev: &AsahiDevice, _size: usize) -> Result<DriverObject> { + let id = GEM_ID.fetch_add(1, Ordering::Relaxed); + mod_pr_debug!("DriverObject::new id={}\n", id); + Ok(DriverObject { + kernel: false, + flags: 0, + vm_id: None, + mappings: Mutex::new(Vec::new()), + id, + }) + } + + /// Callback to drop all mappings for a GEM object owned by a given `File` + fn close(obj: &Object, file: &DrmFile) { + mod_pr_debug!("DriverObject::close vm_id={:?} id={}\n", obj.vm_id, obj.id); + obj.drop_file_mappings(file.file_id()); + } +} + +impl Drop for DriverObject { + fn drop(&mut self) { + mod_pr_debug!("DriverObject::drop vm_id={:?} id={}\n", self.vm_id, self.id); + } +} + +impl shmem::DriverObject for DriverObject { + type Driver = crate::driver::AsahiDriver; +} + +impl rtkit::Buffer for ObjectRef { + fn iova(&self) -> Result<usize> { + self.iova(0).ok_or(EIO) + } + fn buf(&mut self) -> Result<&mut [u8]> { + let vmap = self.vmap.as_mut().ok_or(ENOMEM)?; + Ok(vmap.as_mut_slice()) + } +} diff --git a/drivers/gpu/drm/asahi/gpu.rs b/drivers/gpu/drm/asahi/gpu.rs new file mode 100644 index 000000000000..1384b52d11a2 --- /dev/null +++ b/drivers/gpu/drm/asahi/gpu.rs @@ -0,0 +1,1088 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Top-level GPU manager +//! +//! This module is the root of all GPU firmware management for a given driver instance. It is +//! responsible for initialization, owning the top-level managers (events, UAT, etc.), and +//! communicating with the raw RtKit endpoints to send and receive messages to/from the GPU +//! firmware. +//! +//! It is also the point where diverging driver firmware/GPU variants (using the versions macro) +//! are unified, so that the top level of the driver itself (in `driver`) does not have to concern +//! itself with version dependence. + +use core::any::Any; +use core::sync::atomic::{AtomicBool, AtomicU64, Ordering}; +use core::time::Duration; + +use kernel::{ + delay::coarse_sleep, + error::code::*, + macros::versions, + prelude::*, + soc::apple::rtkit, + sync::{smutex::Mutex, Arc, Guard, UniqueArc}, + time, + types::ForeignOwnable, +}; + +use crate::alloc::Allocator; +use crate::box_in_place; +use crate::debug::*; +use crate::driver::AsahiDevice; +use crate::fw::channels::PipeType; +use crate::fw::types::U64; +use crate::{ + alloc, buffer, channel, event, fw, gem, hw, initdata, mem, mmu, queue, regs, workqueue, +}; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Gpu; + +/// Firmware endpoint for init & incoming notifications. +const EP_FIRMWARE: u8 = 0x20; + +/// Doorbell endpoint for work/message submissions. +const EP_DOORBELL: u8 = 0x21; + +/// Initialize the GPU firmware. +const MSG_INIT: u64 = 0x81 << 48; +const INIT_DATA_MASK: u64 = (1 << 44) - 1; + +/// TX channel doorbell. +const MSG_TX_DOORBELL: u64 = 0x83 << 48; +/// Firmware control channel doorbell. +const MSG_FWCTL: u64 = 0x84 << 48; +// /// Halt the firmware (?). +// const MSG_HALT: u64 = 0x85 << 48; + +/// Receive channel doorbell notification. +const MSG_RX_DOORBELL: u64 = 0x42 << 48; + +/// Doorbell number for firmware kicks/wakeups. +const DOORBELL_KICKFW: u64 = 0x10; +/// Doorbell number for device control channel kicks. +const DOORBELL_DEVCTRL: u64 = 0x11; + +// Upper kernel half VA address ranges. +/// Private (cached) firmware structure VA range base. +const IOVA_KERN_PRIV_BASE: u64 = 0xffffffa000000000; +/// Private (cached) firmware structure VA range top. +const IOVA_KERN_PRIV_TOP: u64 = 0xffffffa7ffffffff; +/// Shared (uncached) firmware structure VA range base. +const IOVA_KERN_SHARED_BASE: u64 = 0xffffffa800000000; +/// Shared (uncached) firmware structure VA range top. +const IOVA_KERN_SHARED_TOP: u64 = 0xffffffa9ffffffff; +/// Shared (uncached) read-only firmware structure VA range base. +const IOVA_KERN_SHARED_RO_BASE: u64 = 0xffffffaa00000000; +/// Shared (uncached) read-only firmware structure VA range top. +const IOVA_KERN_SHARED_RO_TOP: u64 = 0xffffffabffffffff; +/// GPU/FW shared structure VA range base. +const IOVA_KERN_GPU_BASE: u64 = 0xffffffaf00000000; +/// GPU/FW shared structure VA range top. +const IOVA_KERN_GPU_TOP: u64 = 0xffffffafffffffff; + +/// Timeout for entering the halt state after a fault or request. +const HALT_ENTER_TIMEOUT_MS: u64 = 100; + +/// Global allocators used for kernel-half structures. +pub(crate) struct KernelAllocators { + pub(crate) private: alloc::DefaultAllocator, + pub(crate) shared: alloc::DefaultAllocator, + pub(crate) shared_ro: alloc::DefaultAllocator, + pub(crate) gpu: alloc::DefaultAllocator, +} + +/// Receive (GPU->driver) ring buffer channels. +#[versions(AGX)] +struct RxChannels { + event: channel::EventChannel, + fw_log: channel::FwLogChannel, + ktrace: channel::KTraceChannel, + stats: channel::StatsChannel::ver, +} + +/// GPU work submission pipe channels (driver->GPU). +#[versions(AGX)] +struct PipeChannels { + pub(crate) vtx: Vec<Mutexchannel::PipeChannel::ver>, + pub(crate) frag: Vec<Mutexchannel::PipeChannel::ver>, + pub(crate) comp: Vec<Mutexchannel::PipeChannel::ver>, +} + +/// Misc command transmit (driver->GPU) channels. +#[versions(AGX)] +struct TxChannels { + pub(crate) device_control: channel::DeviceControlChannel::ver, +} + +/// Number of work submission pipes per type, one for each priority level. +const NUM_PIPES: usize = 4; + +/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64); + +impl ID { + /// Create a new ID counter with a given value. + fn new(val: u64) -> ID { + ID(AtomicU64::new(val)) + } + + /// Fetch the next unique ID. + pub(crate) fn next(&self) -> u64 { + self.0.fetch_add(1, Ordering::Relaxed) + } +} + +impl Default for ID { + /// IDs default to starting at 2, as 0/1 are considered reserved for the system. + fn default() -> Self { + Self::new(2) + } +} + +/// A guard representing one active submission on the GPU. When dropped, decrements the active +/// submission count. +pub(crate) struct OpGuard(Arc<dyn GpuManagerPriv>); + +impl Drop for OpGuard { + fn drop(&mut self) { + self.0.end_op(); + } +} + +/// Set of global sequence IDs used in the driver. +#[derive(Default)] +pub(crate) struct SequenceIDs { + /// `File` instance ID. + pub(crate) file: ID, + /// `Vm` instance ID. + pub(crate) vm: ID, + /// Submission instance ID. + pub(crate) submission: ID, + /// `Queue` instance ID. + pub(crate) queue: ID, +} + +/// Top-level GPU manager that owns all the global state relevant to the driver instance. +#[versions(AGX)] +pub(crate) struct GpuManager { + dev: AsahiDevice, + cfg: &'static hw::HwConfig, + dyncfg: Boxhw::DynConfig, + pub(crate) initdata: Box<fw::types::GpuObjectfw::initdata::InitData::ver>, + uat: Boxmmu::Uat, + crashed: AtomicBool, + alloc: Mutex<KernelAllocators>, + io_mappings: Vecmmu::Mapping, + rtkit: Mutex<Option<Box<rtkit::RtKitGpuManager::ver>>>, + rx_channels: Mutex<BoxRxChannels::ver>, + tx_channels: Mutex<BoxTxChannels::ver>, + fwctl_channel: Mutex<Boxchannel::FwCtlChannel>, + pipes: PipeChannels::ver, + event_manager: Arcevent::EventManager, + buffer_mgr: buffer::BufferManager, + ids: SequenceIDs, +} + +/// Trait used to abstract the firmware/GPU-dependent variants of the GpuManager. +pub(crate) trait GpuManager: Send + Sync { + /// Cast as an Any type. + fn as_any(&self) -> &dyn Any; + /// Cast Arc<Self> as an Any type. + fn arc_as_any(self: Arc<Self>) -> Arc<dyn Any + Sync + Send>; + /// Initialize the GPU. + fn init(&self) -> Result; + /// Update the GPU globals from global info + /// + /// TODO: Unclear what can and cannot be updated like this. + fn update_globals(&self); + /// Get a reference to the KernelAllocators. + fn alloc(&self) -> Guard<'_, Mutex<KernelAllocators>>; + /// Create a new `Vm` given a unique `File` ID. + fn new_vm(&self, file_id: u64) -> Resultmmu::Vm; + /// Bind a `Vm` to an available slot and return the `VmBind`. + fn bind_vm(&self, vm: &mmu::Vm) -> Resultmmu::VmBind; + /// Create a new user command queue. + fn new_queue( + &self, + vm: mmu::Vm, + ualloc: Arc<Mutexalloc::DefaultAllocator>, + ualloc_priv: Arc<Mutexalloc::DefaultAllocator>, + priority: u32, + caps: u32, + ) -> Result<Box<dyn queue::Queue>>; + /// Return a reference to the global `SequenceIDs` instance. + fn ids(&self) -> &SequenceIDs; + /// Kick the firmware (wake it up if asleep). + /// + /// This should be useful to reduce latency on work submission, so we can ask the firmware to + /// wake up while we do some preparatory work for the work submission. + fn kick_firmware(&self) -> Result; + /// Invalidate a GPU scheduler context. Must be called before the relevant structures are freed. + fn invalidate_context( + &self, + context: &fw::types::GpuObjectfw::workqueue::GpuContextData, + ) -> Result; + /// Flush the entire firmware cache. + /// + /// TODO: Does this actually work? + fn flush_fw_cache(&self) -> Result; + /// Handle a GPU work timeout event. + fn handle_timeout(&self, counter: u32, event_slot: u32); + /// Handle a GPU fault event. + fn handle_fault(&self); + /// Wait for the GPU to become idle and power off. + fn wait_for_poweroff(&self, timeout: usize) -> Result; + /// Send a firmware control command (secure cache flush). + fn fwctl(&self, msg: fw::channels::FwCtlMsg) -> Result; + /// Get the static GPU configuration for this SoC. + fn get_cfg(&self) -> &'static hw::HwConfig; + /// Get the dynamic GPU configuration for this SoC. + fn get_dyncfg(&self) -> &hw::DynConfig; +} + +/// Private generic trait for functions that don't need to escape this module. +trait GpuManagerPriv { + /// Decrement the pending submission counter. + fn end_op(&self); +} + +#[versions(AGX)] +#[vtable] +impl rtkit::Operations for GpuManager::ver { + type Data = ArcGpuManager::ver; + type Buffer = gem::ObjectRef; + + fn recv_message(data: <Self::Data as ForeignOwnable>::Borrowed<'_>, ep: u8, msg: u64) { + let dev = &data.dev; + //dev_info!(dev, "RtKit message: {:#x}:{:#x}\n", ep, msg); + + if ep != EP_FIRMWARE || msg != MSG_RX_DOORBELL { + dev_err!(dev, "Unknown message: {:#x}:{:#x}\n", ep, msg); + return; + } + + let mut ch = data.rx_channels.lock(); + + ch.fw_log.poll(); + ch.ktrace.poll(); + ch.stats.poll(); + ch.event.poll(); + } + + fn crashed(data: <Self::Data as ForeignOwnable>::Borrowed<'_>) { + let dev = &data.dev; + dev_err!(dev, "GPU firmware crashed, failing all jobs\n"); + + data.crashed.store(true, Ordering::Relaxed); + data.event_manager.fail_all(workqueue::WorkError::NoDevice); + } + + fn shmem_alloc( + data: <Self::Data as ForeignOwnable>::Borrowed<'_>, + size: usize, + ) -> ResultSelf::Buffer { + let dev = &data.dev; + mod_dev_dbg!(dev, "shmem_alloc() {:#x} bytes\n", size); + + let mut obj = gem::new_kernel_object(dev, size)?; + obj.vmap()?; + let iova = obj.map_into(data.uat.kernel_vm())?; + mod_dev_dbg!(dev, "shmem_alloc() -> VA {:#x}\n", iova); + Ok(obj) + } +} + +#[versions(AGX)] +impl GpuManager::ver { + /// Create a new GpuManager of this version/GPU combination. + #[inline(never)] + pub(crate) fn new( + dev: &AsahiDevice, + res: ®s::Resources, + cfg: &'static hw::HwConfig, + ) -> Result<ArcGpuManager::ver> { + let uat = Self::make_uat(dev, cfg)?; + let dyncfg = Self::make_dyncfg(dev, res, cfg, &uat)?; + + let mut alloc = KernelAllocators { + private: alloc::DefaultAllocator::new( + dev, + uat.kernel_vm(), + IOVA_KERN_PRIV_BASE, + IOVA_KERN_PRIV_TOP, + 0x80, + mmu::PROT_FW_PRIV_RW, + 1024 * 1024, + true, + fmt!("Kernel Private"), + true, + )?, + shared: alloc::DefaultAllocator::new( + dev, + uat.kernel_vm(), + IOVA_KERN_SHARED_BASE, + IOVA_KERN_SHARED_TOP, + 0x80, + mmu::PROT_FW_SHARED_RW, + 1024 * 1024, + true, + fmt!("Kernel Shared"), + false, + )?, + shared_ro: alloc::DefaultAllocator::new( + dev, + uat.kernel_vm(), + IOVA_KERN_SHARED_RO_BASE, + IOVA_KERN_SHARED_RO_TOP, + 0x80, + mmu::PROT_FW_SHARED_RO, + 64 * 1024, + true, + fmt!("Kernel RO Shared"), + false, + )?, + gpu: alloc::DefaultAllocator::new( + dev, + uat.kernel_vm(), + IOVA_KERN_GPU_BASE, + IOVA_KERN_GPU_TOP, + 0x80, + mmu::PROT_GPU_FW_SHARED_RW, + 64 * 1024, + true, + fmt!("Kernel GPU Shared"), + false, + )?, + }; + + let event_manager = Self::make_event_manager(&mut alloc)?; + let initdata = Self::make_initdata(cfg, &dyncfg, &mut alloc)?; + let mut mgr = Self::make_mgr(dev, cfg, dyncfg, uat, alloc, event_manager, initdata)?; + + { + let fwctl = mgr.fwctl_channel.lock(); + let p_fwctl = fwctl.to_raw(); + core::mem::drop(fwctl); + + mgr.initdata.fw_status.with_mut(|raw, _inner| { + raw.fwctl_channel = p_fwctl; + }); + } + + { + let txc = mgr.tx_channels.lock(); + let p_device_control = txc.device_control.to_raw(); + core::mem::drop(txc); + + let rxc = mgr.rx_channels.lock(); + let p_event = rxc.event.to_raw(); + let p_fw_log = rxc.fw_log.to_raw(); + let p_ktrace = rxc.ktrace.to_raw(); + let p_stats = rxc.stats.to_raw(); + let p_fwlog_buf = rxc.fw_log.get_buf(); + core::mem::drop(rxc); + + mgr.initdata.runtime_pointers.with_mut(|raw, _inner| { + raw.device_control = p_device_control; + raw.event = p_event; + raw.fw_log = p_fw_log; + raw.ktrace = p_ktrace; + raw.stats = p_stats; + raw.fwlog_buf = Some(p_fwlog_buf); + }); + } + + let mut p_pipes: Vecfw::initdata::raw::PipeChannels::ver = Vec::new(); + + for ((v, f), c) in mgr + .pipes + .vtx + .iter() + .zip(&mgr.pipes.frag) + .zip(&mgr.pipes.comp) + { + p_pipes.try_push(fw::initdata::raw::PipeChannels::ver { + vtx: v.lock().to_raw(), + frag: f.lock().to_raw(), + comp: c.lock().to_raw(), + })?; + } + + mgr.initdata.runtime_pointers.with_mut(|raw, _inner| { + for (i, p) in p_pipes.into_iter().enumerate() { + raw.pipes[i].vtx = p.vtx; + raw.pipes[i].frag = p.frag; + raw.pipes[i].comp = p.comp; + } + }); + + for (i, map) in cfg.io_mappings.iter().enumerate() { + if let Some(map) = map.as_ref() { + mgr.iomap(i, map)?; + } + } + + let mgr = Arc::from(mgr); + + let rtkit = Box::try_new(rtkit::RtKit::GpuManager::ver::new( + dev, + None, + 0, + mgr.clone(), + )?)?; + + *mgr.rtkit.lock() = Some(rtkit); + + { + let mut rxc = mgr.rx_channels.lock(); + rxc.event.set_manager(mgr.clone()); + } + + Ok(mgr) + } + + /// Build the entire GPU InitData structure tree and return it as a boxed GpuObject. + fn make_initdata( + cfg: &'static hw::HwConfig, + dyncfg: &hw::DynConfig, + alloc: &mut KernelAllocators, + ) -> Result<Box<fw::types::GpuObjectfw::initdata::InitData::ver>> { + let mut builder = initdata::InitDataBuilder::ver::new(alloc, cfg, dyncfg); + builder.build() + } + + /// Create a fresh boxed Uat instance. + /// + /// Force disable inlining to avoid blowing up the stack. + #[inline(never)] + fn make_uat(dev: &AsahiDevice, cfg: &'static hw::HwConfig) -> Result<Boxmmu::Uat> { + Ok(Box::try_new(mmu::Uat::new(dev, cfg)?)?) + } + + /// Actually create the final GpuManager instance, as a UniqueArc. + /// + /// Force disable inlining to avoid blowing up the stack. + #[inline(never)] + fn make_mgr( + dev: &AsahiDevice, + cfg: &'static hw::HwConfig, + dyncfg: Boxhw::DynConfig, + uat: Boxmmu::Uat, + mut alloc: KernelAllocators, + event_manager: Arcevent::EventManager, + initdata: Box<fw::types::GpuObjectfw::initdata::InitData::ver>, + ) -> Result<UniqueArcGpuManager::ver> { + let mut pipes = PipeChannels::ver { + vtx: Vec::new(), + frag: Vec::new(), + comp: Vec::new(), + }; + + for _i in 0..=NUM_PIPES - 1 { + pipes + .vtx + .try_push(Mutex::new(channel::PipeChannel::ver::new(dev, &mut alloc)?))?; + pipes + .frag + .try_push(Mutex::new(channel::PipeChannel::ver::new(dev, &mut alloc)?))?; + pipes + .comp + .try_push(Mutex::new(channel::PipeChannel::ver::new(dev, &mut alloc)?))?; + } + + UniqueArc::try_new(GpuManager::ver { + dev: dev.clone(), + cfg, + dyncfg, + initdata, + uat, + io_mappings: Vec::new(), + rtkit: Mutex::new(None), + crashed: AtomicBool::new(false), + rx_channels: Mutex::new(box_in_place!(RxChannels::ver { + event: channel::EventChannel::new(dev, &mut alloc, event_manager.clone())?, + fw_log: channel::FwLogChannel::new(dev, &mut alloc)?, + ktrace: channel::KTraceChannel::new(dev, &mut alloc)?, + stats: channel::StatsChannel::ver::new(dev, &mut alloc)?, + })?), + tx_channels: Mutex::new(Box::try_new(TxChannels::ver { + device_control: channel::DeviceControlChannel::ver::new(dev, &mut alloc)?, + })?), + fwctl_channel: Mutex::new(Box::try_new(channel::FwCtlChannel::new(dev, &mut alloc)?)?), + pipes, + event_manager, + buffer_mgr: buffer::BufferManager::new()?, + alloc: Mutex::new(alloc), + ids: Default::default(), + }) + } + + /// Fetch and validate the GPU dynamic configuration from the device tree and hardware. + /// + /// Force disable inlining to avoid blowing up the stack. + #[inline(never)] + fn make_dyncfg( + dev: &AsahiDevice, + res: ®s::Resources, + cfg: &'static hw::HwConfig, + uat: &mmu::Uat, + ) -> Result<Boxhw::DynConfig> { + let gpu_id = res.get_gpu_id()?; + + dev_info!(dev, "GPU Information:\n"); + dev_info!( + dev, + " Type: {:?}{:?}\n", + gpu_id.gpu_gen, + gpu_id.gpu_variant + ); + dev_info!(dev, " Max dies: {}\n", gpu_id.max_dies); + dev_info!(dev, " Clusters: {}\n", gpu_id.num_clusters); + dev_info!( + dev, + " Cores: {} ({})\n", + gpu_id.num_cores, + gpu_id.num_cores * gpu_id.num_clusters + ); + dev_info!( + dev, + " Frags: {} ({})\n", + gpu_id.num_frags, + gpu_id.num_frags * gpu_id.num_clusters + ); + dev_info!( + dev, + " GPs: {} ({})\n", + gpu_id.num_gps, + gpu_id.num_gps * gpu_id.num_clusters + ); + dev_info!(dev, " Core masks: {:#x?}\n", gpu_id.core_masks); + dev_info!(dev, " Active cores: {}\n", gpu_id.total_active_cores); + + dev_info!(dev, "Getting configuration from device tree...\n"); + let pwr_cfg = hw::PwrConfig::load(dev, cfg)?; + dev_info!(dev, "Dynamic configuration fetched\n"); + + if gpu_id.gpu_gen != cfg.gpu_gen || gpu_id.gpu_variant != cfg.gpu_variant { + dev_err!( + dev, + "GPU type mismatch (expected {:?}{:?}, found {:?}{:?})\n", + cfg.gpu_gen, + cfg.gpu_variant, + gpu_id.gpu_gen, + gpu_id.gpu_variant + ); + return Err(EIO); + } + if gpu_id.num_clusters > cfg.max_num_clusters { + dev_err!( + dev, + "Too many clusters ({} > {})\n", + gpu_id.num_clusters, + cfg.max_num_clusters + ); + return Err(EIO); + } + if gpu_id.num_cores > cfg.max_num_cores { + dev_err!( + dev, + "Too many cores ({} > {})\n", + gpu_id.num_cores, + cfg.max_num_cores + ); + return Err(EIO); + } + if gpu_id.num_frags > cfg.max_num_frags { + dev_err!( + dev, + "Too many frags ({} > {})\n", + gpu_id.num_frags, + cfg.max_num_frags + ); + return Err(EIO); + } + if gpu_id.num_gps > cfg.max_num_gps { + dev_err!( + dev, + "Too many GPs ({} > {})\n", + gpu_id.num_gps, + cfg.max_num_gps + ); + return Err(EIO); + } + + Ok(Box::try_new(hw::DynConfig { + pwr: pwr_cfg, + uat_ttb_base: uat.ttb_base(), + id: gpu_id, + })?) + } + + /// Create the global GPU event manager, and return an `Arc<>` to it. + fn make_event_manager(alloc: &mut KernelAllocators) -> Result<Arcevent::EventManager> { + Arc::try_new(event::EventManager::new(alloc)?) + } + + /// Create a new MMIO mapping and add it to the mappings list in initdata at the specified + /// index. + fn iomap(&mut self, index: usize, map: &hw::IOMapping) -> Result { + let off = map.base & mmu::UAT_PGMSK; + let base = map.base - off; + let end = (map.base + map.size + mmu::UAT_PGMSK) & !mmu::UAT_PGMSK; + let mapping = self + .uat + .kernel_vm() + .map_io(base, end - base, map.writable)?; + + self.initdata.runtime_pointers.hwdata_b.with_mut(|raw, _| { + raw.io_mappings[index] = fw::initdata::raw::IOMapping { + phys_addr: U64(map.base as u64), + virt_addr: U64((mapping.iova() + off) as u64), + size: map.size as u32, + range_size: map.range_size as u32, + readwrite: U64(map.writable as u64), + }; + }); + + self.io_mappings.try_push(mapping)?; + Ok(()) + } + + /// Mark work associated with currently in-progress event slots as failed, after a fault or + /// timeout. + fn mark_pending_events(&self, culprit_slot: Option<u32>, error: workqueue::WorkError) { + dev_err!(self.dev, " Pending events:\n"); + + self.initdata.globals.with(|raw, _inner| { + for i in raw.pending_stamps.iter() { + let info = i.info.load(Ordering::Relaxed); + let wait_value = i.wait_value.load(Ordering::Relaxed); + + if info & 1 != 0 { + let slot = info >> 3; + let flags = info & 0x7; + dev_err!( + self.dev, + " [{}] flags={} value={:#x}\n", + slot, + flags, + wait_value + ); + let error = if culprit_slot.is_some() && culprit_slot != Some(slot) { + workqueue::WorkError::Killed + } else { + error + }; + self.event_manager.mark_error(slot, wait_value, error); + i.info.store(0, Ordering::Relaxed); + i.wait_value.store(0, Ordering::Relaxed); + } + } + }); + } + + /// Fetch the GPU MMU fault information from the hardware registers. + fn get_fault_info(&self) -> Optionregs::FaultInfo { + let data = self.dev.data(); + + let res = match data.resources() { + Some(res) => res, + None => { + dev_err!(self.dev, " Failed to acquire resources\n"); + return None; + } + }; + + let info = res.get_fault_info(); + if info.is_some() { + dev_err!(self.dev, " Fault info: {:#x?}\n", info.as_ref().unwrap()); + } + info + } + + /// Resume the GPU firmware after it halts (due to a timeout, fault, or request). + fn recover(&self) { + self.initdata.fw_status.with(|raw, _inner| { + let halt_count = raw.flags.halt_count.load(Ordering::Relaxed); + let mut halted = raw.flags.halted.load(Ordering::Relaxed); + dev_err!(self.dev, " Halt count: {}\n", halt_count); + dev_err!(self.dev, " Halted: {}\n", halted); + + if halted == 0 { + let timeout = time::ktime_get() + Duration::from_millis(HALT_ENTER_TIMEOUT_MS); + while time::ktime_get() < timeout { + halted = raw.flags.halted.load(Ordering::Relaxed); + if halted != 0 { + break; + } + mem::sync(); + } + halted = raw.flags.halted.load(Ordering::Relaxed); + } + + if debug_enabled(DebugFlags::NoGpuRecovery) { + dev_crit!(self.dev, " GPU recovery is disabled, wedging forever!\n"); + } else if halted != 0 { + dev_err!(self.dev, " Attempting recovery...\n"); + raw.flags.halted.store(0, Ordering::SeqCst); + raw.flags.resume.store(1, Ordering::SeqCst); + } else { + dev_err!(self.dev, " Cannot recover.\n"); + } + }); + } + + /// Return the packed GPU enabled core masks. + // Only used for some versions + #[allow(dead_code)] + pub(crate) fn core_masks_packed(&self) -> &[u32] { + self.dyncfg.id.core_masks_packed.as_slice() + } + + /// Kick a submission pipe for a submitted job to tell the firmware to start processing it. + pub(crate) fn run_job(&self, job: workqueue::JobSubmission::ver<'_>) -> Result { + mod_dev_dbg!(self.dev, "GPU: run_job\n"); + + let pipe_type = job.pipe_type(); + mod_dev_dbg!(self.dev, "GPU: run_job: pipe_type={:?}\n", pipe_type); + + let pipes = match pipe_type { + PipeType::Vertex => &self.pipes.vtx, + PipeType::Fragment => &self.pipes.frag, + PipeType::Compute => &self.pipes.comp, + }; + + let index: usize = job.priority() as usize; + let mut pipe = pipes.get(index).ok_or(EIO)?.lock(); + + mod_dev_dbg!(self.dev, "GPU: run_job: run()\n"); + job.run(&mut pipe); + mod_dev_dbg!(self.dev, "GPU: run_job: ring doorbell\n"); + + let mut guard = self.rtkit.lock(); + let rtk = guard.as_mut().unwrap(); + rtk.send_message( + EP_DOORBELL, + MSG_TX_DOORBELL | pipe_type as u64 | ((index as u64) << 2), + )?; + mod_dev_dbg!(self.dev, "GPU: run_job: done\n"); + + Ok(()) + } + + pub(crate) fn is_crashed(&self) -> bool { + self.crashed.load(Ordering::Relaxed) + } + + pub(crate) fn start_op(self: &ArcGpuManager::ver) -> Result<OpGuard> { + if self.is_crashed() { + return Err(ENODEV); + } + + let val = self + .initdata + .globals + .with(|raw, _inner| raw.pending_submissions.fetch_add(1, Ordering::Acquire)); + + mod_dev_dbg!(self.dev, "OP start (pending: {})\n", val + 1); + self.kick_firmware()?; + Ok(OpGuard(self.clone())) + } +} + +#[versions(AGX)] +impl GpuManager for GpuManager::ver { + fn as_any(&self) -> &dyn Any { + self + } + + fn arc_as_any(self: Arc<Self>) -> Arc<dyn Any + Sync + Send> { + self as Arc<dyn Any + Sync + Send> + } + + fn init(&self) -> Result { + self.tx_channels.lock().device_control.send( + &fw::channels::DeviceControlMsg::ver::Initialize(Default::default()), + ); + + let initdata = self.initdata.gpu_va().get(); + let mut guard = self.rtkit.lock(); + let rtk = guard.as_mut().unwrap(); + + rtk.boot()?; + rtk.start_endpoint(EP_FIRMWARE)?; + rtk.start_endpoint(EP_DOORBELL)?; + rtk.send_message(EP_FIRMWARE, MSG_INIT | (initdata & INIT_DATA_MASK))?; + rtk.send_message(EP_DOORBELL, MSG_TX_DOORBELL | DOORBELL_DEVCTRL)?; + core::mem::drop(guard); + + self.kick_firmware()?; + Ok(()) + } + + fn update_globals(&self) { + let mut timeout: u32 = 2; + if debug_enabled(DebugFlags::WaitForPowerOff) { + timeout = 0; + } else if debug_enabled(DebugFlags::KeepGpuPowered) { + timeout = 5000; + } + + self.initdata.globals.with(|raw, _inner| { + raw.idle_off_delay_ms.store(timeout, Ordering::Relaxed); + }); + } + + fn alloc(&self) -> Guard<'_, Mutex<KernelAllocators>> { + let mut guard = self.alloc.lock(); + let (garbage_count, garbage_bytes) = guard.private.garbage(); + if garbage_bytes > 1024 * 1024 { + mod_dev_dbg!( + self.dev, + "Collecting kalloc garbage ({} objects, {} bytes)\n", + garbage_count, + garbage_bytes + ); + if self.flush_fw_cache().is_err() { + dev_err!(self.dev, "Failed to flush FW cache\n"); + } else { + guard.private.collect_garbage(garbage_count); + } + } + + guard + } + + fn new_vm(&self, file_id: u64) -> Resultmmu::Vm { + self.uat.new_vm(self.ids.vm.next(), file_id) + } + + fn bind_vm(&self, vm: &mmu::Vm) -> Resultmmu::VmBind { + self.uat.bind(vm) + } + + fn new_queue( + &self, + vm: mmu::Vm, + ualloc: Arc<Mutexalloc::DefaultAllocator>, + ualloc_priv: Arc<Mutexalloc::DefaultAllocator>, + priority: u32, + caps: u32, + ) -> Result<Box<dyn queue::Queue>> { + let mut kalloc = self.alloc(); + let id = self.ids.queue.next(); + Ok(Box::try_new(queue::Queue::ver::new( + &self.dev, + vm, + &mut kalloc, + ualloc, + ualloc_priv, + self.event_manager.clone(), + &self.buffer_mgr, + id, + priority, + caps, + )?)?) + } + + fn kick_firmware(&self) -> Result { + if self.is_crashed() { + return Err(ENODEV); + } + + let mut guard = self.rtkit.lock(); + let rtk = guard.as_mut().unwrap(); + rtk.send_message(EP_DOORBELL, MSG_TX_DOORBELL | DOORBELL_KICKFW)?; + + Ok(()) + } + + fn invalidate_context( + &self, + context: &fw::types::GpuObjectfw::workqueue::GpuContextData, + ) -> Result { + mod_dev_dbg!( + self.dev, + "Invalidating GPU context @ {:?}\n", + context.weak_pointer() + ); + + if self.is_crashed() { + return Err(ENODEV); + } + + let mut guard = self.alloc.lock(); + let (garbage_count, _) = guard.private.garbage(); + + let dc = context.with( + |raw, _inner| fw::channels::DeviceControlMsg::ver::DestroyContext { + unk_4: 0, + ctx_23: raw.unk_23, + __pad0: Default::default(), + unk_c: 0, + unk_10: 0, + ctx_0: raw.unk_0, + ctx_1: raw.unk_1, + ctx_4: raw.unk_4, + __pad1: Default::default(), + unk_18: 0, + gpu_context: Some(context.weak_pointer()), + __pad2: Default::default(), + }, + ); + + mod_dev_dbg!(self.dev, "Context invalidation command: {:?}\n", &dc); + + let mut txch = self.tx_channels.lock(); + + let token = txch.device_control.send(&dc); + + { + let mut guard = self.rtkit.lock(); + let rtk = guard.as_mut().unwrap(); + rtk.send_message(EP_DOORBELL, MSG_TX_DOORBELL | DOORBELL_DEVCTRL)?; + } + + txch.device_control.wait_for(token)?; + + mod_dev_dbg!( + self.dev, + "GPU context invalidated: {:?}\n", + context.weak_pointer() + ); + + // The invalidation does a cache flush, so it is okay to collect garbage + guard.private.collect_garbage(garbage_count); + + Ok(()) + } + + fn flush_fw_cache(&self) -> Result { + mod_dev_dbg!(self.dev, "Flushing coprocessor data cache\n"); + + if self.is_crashed() { + return Err(ENODEV); + } + + // ctx_0 == 0xff or ctx_1 == 0xff cause no effect on context, + // but this command does a full cache flush too, so abuse it + // for that. + + let dc = fw::channels::DeviceControlMsg::ver::DestroyContext { + unk_4: 0, + ctx_23: 0, + __pad0: Default::default(), + unk_c: 0, + unk_10: 0, + ctx_0: 0xff, + ctx_1: 0xff, + ctx_4: 0, + __pad1: Default::default(), + unk_18: 0, + gpu_context: None, + __pad2: Default::default(), + }; + + let mut txch = self.tx_channels.lock(); + + let token = txch.device_control.send(&dc); + { + let mut guard = self.rtkit.lock(); + let rtk = guard.as_mut().unwrap(); + rtk.send_message(EP_DOORBELL, MSG_TX_DOORBELL | DOORBELL_DEVCTRL)?; + } + + txch.device_control.wait_for(token)?; + Ok(()) + } + + fn ids(&self) -> &SequenceIDs { + &self.ids + } + + fn handle_timeout(&self, counter: u32, event_slot: u32) { + dev_err!(self.dev, " (\________/) \n"); + dev_err!(self.dev, " | | \n"); + dev_err!(self.dev, "'.| \ , / |.'\n"); + dev_err!(self.dev, "--| / (( \ |--\n"); + dev_err!(self.dev, ".'| _-_- |'.\n"); + dev_err!(self.dev, " |________| \n"); + dev_err!(self.dev, "** GPU timeout nya~!!!!! **\n"); + dev_err!(self.dev, " Event slot: {}\n", event_slot); + dev_err!(self.dev, " Timeout count: {}\n", counter); + + // If we have fault info, consider it a fault. + let error = match self.get_fault_info() { + Some(info) => workqueue::WorkError::Fault(info), + None => workqueue::WorkError::Timeout, + }; + self.mark_pending_events(Some(event_slot), error); + self.recover(); + } + + fn handle_fault(&self) { + dev_err!(self.dev, " (\________/) \n"); + dev_err!(self.dev, " | | \n"); + dev_err!(self.dev, "'.| \ , / |.'\n"); + dev_err!(self.dev, "--| / (( \ |--\n"); + dev_err!(self.dev, ".'| _-_- |'.\n"); + dev_err!(self.dev, " |________| \n"); + dev_err!(self.dev, "GPU fault nya~!!!!!\n"); + let error = match self.get_fault_info() { + Some(info) => workqueue::WorkError::Fault(info), + None => workqueue::WorkError::Unknown, + }; + self.mark_pending_events(None, error); + self.recover(); + } + + fn wait_for_poweroff(&self, timeout: usize) -> Result { + self.initdata.runtime_pointers.hwdata_a.with(|raw, _inner| { + for _i in 0..timeout { + if raw.pwr_status.load(Ordering::Relaxed) == 4 { + return Ok(()); + } + coarse_sleep(Duration::from_millis(1)); + } + Err(ETIMEDOUT) + }) + } + + fn fwctl(&self, msg: fw::channels::FwCtlMsg) -> Result { + if self.is_crashed() { + return Err(ENODEV); + } + + let mut fwctl = self.fwctl_channel.lock(); + let token = fwctl.send(&msg); + { + let mut guard = self.rtkit.lock(); + let rtk = guard.as_mut().unwrap(); + rtk.send_message(EP_DOORBELL, MSG_FWCTL)?; + } + fwctl.wait_for(token)?; + Ok(()) + } + + fn get_cfg(&self) -> &'static hw::HwConfig { + self.cfg + } + + fn get_dyncfg(&self) -> &hw::DynConfig { + &self.dyncfg + } +} + +#[versions(AGX)] +impl GpuManagerPriv for GpuManager::ver { + fn end_op(&self) { + let val = self + .initdata + .globals + .with(|raw, _inner| raw.pending_submissions.fetch_sub(1, Ordering::Release)); + + mod_dev_dbg!(self.dev, "OP end (pending: {})\n", val - 1); + } +} diff --git a/drivers/gpu/drm/asahi/hw/mod.rs b/drivers/gpu/drm/asahi/hw/mod.rs new file mode 100644 index 000000000000..a92bb70aeae8 --- /dev/null +++ b/drivers/gpu/drm/asahi/hw/mod.rs @@ -0,0 +1,522 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Per-SoC hardware configuration structures +//! +//! This module contains the definitions used to store per-GPU and per-SoC configuration data. + +use crate::driver::AsahiDevice; +use crate::fw::types::*; +use alloc::vec::Vec; +use kernel::c_str; +use kernel::device::RawDevice; +use kernel::prelude::*; + +const MAX_POWERZONES: usize = 5; + +pub(crate) mod t600x; +pub(crate) mod t8103; +pub(crate) mod t8112; + +/// GPU generation enumeration. Note: Part of the UABI. +#[derive(Debug, PartialEq, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuGen { + G13 = 13, + G14 = 14, +} + +/// GPU variant enumeration. Note: Part of the UABI. +#[derive(Debug, PartialEq, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuVariant { + P = 'P' as u32, + G = 'G' as u32, + S = 'S' as u32, + C = 'C' as u32, + D = 'D' as u32, +} + +/// GPU revision enumeration. Note: Part of the UABI. +#[derive(Debug, PartialEq, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuRevision { + A0 = 0x00, + A1 = 0x01, + B0 = 0x10, + B1 = 0x11, + C0 = 0x20, + C1 = 0x21, +} + +/// GPU core type enumeration. Note: Part of the firmware ABI. +#[derive(Debug, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuCore { + // Unknown = 0, + // G5P = 1, + // G5G = 2, + // G9P = 3, + // G9G = 4, + // G10P = 5, + // G11P = 6, + // G11M = 7, + // G11G = 8, + // G12P = 9, + // G13P = 10, + G13G = 11, + G13S = 12, + G13C = 13, + // G14P = 14, + G14G = 15, +} + +/// GPU revision ID. Note: Part of the firmware ABI. +#[derive(Debug, PartialEq, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuRevisionID { + // Unknown = 0, + A0 = 1, + A1 = 2, + B0 = 3, + B1 = 4, + C0 = 5, + C1 = 6, +} + +/// GPU driver/hardware features, from the UABI. +pub(crate) mod feat { + /// Backwards-compatible features. + pub(crate) mod compat {} + + /// Backwards-incompatible features. + pub(crate) mod incompat { + use kernel::bindings; + + /// Hardware requires Z/S compression to be mandatorily enabled. + pub(crate) const MANDATORY_ZS_COMPRESSION: u64 = + bindings::drm_asahi_feat_incompat_DRM_ASAHI_FEAT_MANDATORY_ZS_COMPRESSION as u64; + } +} + +/// A single performance state of the GPU. +#[derive(Debug)] +pub(crate) struct PState { + /// Voltage in millivolts, per GPU cluster. + pub(crate) volt_mv: Vec<u32>, + /// Frequency in hertz. + pub(crate) freq_hz: u32, + /// Maximum power consumption of the GPU at this pstate, in milliwatts. + pub(crate) pwr_mw: u32, +} + +/// A power zone definition (we have no idea what this is but Apple puts them in the DT). +#[allow(missing_docs)] +#[derive(Debug, Copy, Clone)] +pub(crate) struct PowerZone { + pub(crate) target: u32, + pub(crate) target_offset: u32, + pub(crate) filter_tc: u32, +} + +/// An MMIO mapping used by the firmware. +#[derive(Debug, Copy, Clone)] +pub(crate) struct IOMapping { + /// Base physical address of the mapping. + pub(crate) base: usize, + /// Size of the mapping. + pub(crate) size: usize, + /// Range size of the mapping (for arrays?) + pub(crate) range_size: usize, + /// Whether the mapping should be writable. + pub(crate) writable: bool, +} + +impl IOMapping { + /// Convenience constructor for a new IOMapping. + pub(crate) const fn new( + base: usize, + size: usize, + range_size: usize, + writable: bool, + ) -> IOMapping { + IOMapping { + base, + size, + range_size, + writable, + } + } +} + +/// Unknown HwConfigA fields that vary from SoC to SoC. +#[allow(missing_docs)] +#[derive(Debug, Copy, Clone)] +pub(crate) struct HwConfigA { + pub(crate) unk_87c: i32, + pub(crate) unk_8cc: u32, + pub(crate) unk_e24: u32, +} + +/// Unknown HwConfigB fields that vary from SoC to SoC. +#[allow(missing_docs)] +#[derive(Debug, Copy, Clone)] +pub(crate) struct HwConfigB { + pub(crate) unk_4e0: u64, + pub(crate) unk_534: u32, + pub(crate) unk_ab8: u32, + pub(crate) unk_abc: u32, + pub(crate) unk_b30: u32, +} + +/// Render command configs that vary from SoC to SoC. +#[derive(Debug, Copy, Clone)] +pub(crate) struct HwRenderConfig { + /// Vertex/tiling-related configuration register (lsb: disable clustering) + pub(crate) tiling_control: u32, +} + +/// Static hardware configuration for a given SoC model. +#[derive(Debug)] +pub(crate) struct HwConfig { + /// Chip ID in hex format (e.g. 0x8103 for t8103). + pub(crate) chip_id: u32, + /// GPU generation. + pub(crate) gpu_gen: GpuGen, + /// GPU variant type. + pub(crate) gpu_variant: GpuVariant, + /// GPU core type ID (as known by the firmware). + pub(crate) gpu_core: GpuCore, + /// Compatible feature bitmask for this GPU. + pub(crate) gpu_feat_compat: u64, + /// Incompatible feature bitmask for this GPU. + pub(crate) gpu_feat_incompat: u64, + + /// Base clock used used for timekeeping. + pub(crate) base_clock_hz: u32, + /// Output address space for the UAT on this SoC. + pub(crate) uat_oas: usize, + /// Maximum number of clusters on this SoC. + pub(crate) max_num_clusters: u32, + /// Maximum number of cores per cluster for this GPU. + pub(crate) max_num_cores: u32, + /// Maximum number of frags per cluster for this GPU. + pub(crate) max_num_frags: u32, + /// Maximum number of GPs per cluster for this GPU. + pub(crate) max_num_gps: u32, + + /// Required size of the first preemption buffer. + pub(crate) preempt1_size: usize, + /// Required size of the second preemption buffer. + pub(crate) preempt2_size: usize, + /// Required size of the third preemption buffer. + pub(crate) preempt3_size: usize, + + /// Rendering-relevant configuration. + pub(crate) render: HwRenderConfig, + + /// Misc HWDataA field values. + pub(crate) da: HwConfigA, + /// Misc HWDataB field values. + pub(crate) db: HwConfigB, + /// HwDataShared1.table. + pub(crate) shared1_tab: &'static [i32], + /// HwDataShared1.unk_a4. + pub(crate) shared1_a4: u32, + /// HwDataShared2.table. + pub(crate) shared2_tab: &'static [i32], + /// HwDataShared2.unk_508. + pub(crate) shared2_unk_508: u32, + /// Constant related to SRAM voltages. + pub(crate) sram_k: F32, + /// Unknown per-cluster coefficients 1. + pub(crate) unk_coef_a: &'static [&'static [F32]], + /// Unknown per-cluster coefficients 2. + pub(crate) unk_coef_b: &'static [&'static [F32]], + /// Unknown table in Global struct. + pub(crate) global_tab: Option<&'static [u8]>, + + /// Temperature sensor list (8 bits per sensor). + pub(crate) fast_die0_sensor_mask: u64, + /// Temperature sensor list (alternate). + pub(crate) fast_die0_sensor_mask_alt: u64, + /// Temperature sensor present bitmask. + pub(crate) fast_die0_sensor_present: u32, + /// Required MMIO mappings for this GPU/firmware. + pub(crate) io_mappings: &'static [Option<IOMapping>], +} + +/// Dynamic (fetched from hardware/DT) configuration. +#[derive(Debug)] +pub(crate) struct DynConfig { + /// Base physical address of the UAT TTB (from DT reserved memory region). + pub(crate) uat_ttb_base: u64, + /// GPU ID configuration read from hardware. + pub(crate) id: GpuIdConfig, + /// Power calibration configuration for this specific chip/device. + pub(crate) pwr: PwrConfig, +} + +/// Specific GPU ID configuration fetched from SGX MMIO registers. +#[derive(Debug)] +pub(crate) struct GpuIdConfig { + /// GPU generation (should match static config). + pub(crate) gpu_gen: GpuGen, + /// GPU variant type (should match static config). + pub(crate) gpu_variant: GpuVariant, + /// GPU silicon revision. + pub(crate) gpu_rev: GpuRevision, + /// GPU silicon revision ID (firmware enum). + pub(crate) gpu_rev_id: GpuRevisionID, + /// Maximum number of dies supported. + pub(crate) max_dies: u32, + /// Total number of GPU clusters. + pub(crate) num_clusters: u32, + /// Maximum number of GPU cores per cluster. + pub(crate) num_cores: u32, + /// Number of frags per cluster. + pub(crate) num_frags: u32, + /// Number of GPs per cluster. + pub(crate) num_gps: u32, + /// Total number of active cores for the whole GPU. + pub(crate) total_active_cores: u32, + /// Mask of active cores per cluster. + pub(crate) core_masks: Vec<u32>, + /// Packed mask of all active cores. + pub(crate) core_masks_packed: Vec<u32>, +} + +/// Configurable GPU power settings from the device tree. +#[derive(Debug)] +pub(crate) struct PwrConfig { + /// GPU performance state list. + pub(crate) perf_states: Vec<PState>, + /// GPU power zone list. + pub(crate) power_zones: Vec<PowerZone>, + + /// Core leakage coefficient per cluster. + pub(crate) core_leak_coef: Vec<F32>, + /// SRAM leakage coefficient per cluster. + pub(crate) sram_leak_coef: Vec<F32>, + + /// Maximum total power of the GPU in milliwatts. + pub(crate) max_power_mw: u32, + /// Maximum frequency of the GPU in megahertz. + pub(crate) max_freq_mhz: u32, + + /// Minimum performance state to start at. + pub(crate) perf_base_pstate: u32, + /// Maximum enabled performance state. + pub(crate) perf_max_pstate: u32, + + /// Minimum voltage for the SRAM power domain in microvolts. + pub(crate) min_sram_microvolt: u32, + + // Most of these fields are just named after Apple ADT property names and we don't fully + // understand them. They configure various power-related PID loops and filters. + /// Average power filter time constant in milliseconds. + pub(crate) avg_power_filter_tc_ms: u32, + /// Average power filter PID integral gain? + pub(crate) avg_power_ki_only: F32, + /// Average power filter PID proportional gain? + pub(crate) avg_power_kp: F32, + pub(crate) avg_power_min_duty_cycle: u32, + /// Average power target filter time constant in periods. + pub(crate) avg_power_target_filter_tc: u32, + /// "Fast die0" (temperature?) PID integral gain. + pub(crate) fast_die0_integral_gain: F32, + /// "Fast die0" (temperature?) PID proportional gain. + pub(crate) fast_die0_proportional_gain: F32, + pub(crate) fast_die0_prop_tgt_delta: u32, + pub(crate) fast_die0_release_temp: u32, + /// Delay from the fender (?) becoming idle to powerdown + pub(crate) fender_idle_off_delay_ms: u32, + /// Timeout from firmware early wake to sleep if no work was submitted (?) + pub(crate) fw_early_wake_timeout_ms: u32, + /// Delay from the GPU becoming idle to powerdown + pub(crate) idle_off_delay_ms: u32, + /// Percent? + pub(crate) perf_boost_ce_step: u32, + /// Minimum utilization before performance state is increased in %. + pub(crate) perf_boost_min_util: u32, + pub(crate) perf_filter_drop_threshold: u32, + /// Performance PID filter time constant? (periods?) + pub(crate) perf_filter_time_constant: u32, + /// Performance PID filter time constant 2? (periods?) + pub(crate) perf_filter_time_constant2: u32, + /// Performance PID integral gain. + pub(crate) perf_integral_gain: F32, + /// Performance PID integral gain 2 (?). + pub(crate) perf_integral_gain2: F32, + pub(crate) perf_integral_min_clamp: u32, + /// Performance PID proportional gain. + pub(crate) perf_proportional_gain: F32, + /// Performance PID proportional gain 2 (?). + pub(crate) perf_proportional_gain2: F32, + pub(crate) perf_reset_iters: u32, + /// Target GPU utilization for the performance controller in %. + pub(crate) perf_tgt_utilization: u32, + /// Power sampling period in milliseconds. + pub(crate) power_sample_period: u32, + /// PPM (?) filter time constant in milliseconds. + pub(crate) ppm_filter_time_constant_ms: u32, + /// PPM (?) filter PID integral gain. + pub(crate) ppm_ki: F32, + /// PPM (?) filter PID proportional gain. + pub(crate) ppm_kp: F32, + /// Power consumption filter time constant (periods?) + pub(crate) pwr_filter_time_constant: u32, + /// Power consumption filter PID integral gain. + pub(crate) pwr_integral_gain: F32, + pub(crate) pwr_integral_min_clamp: u32, + pub(crate) pwr_min_duty_cycle: u32, + pub(crate) pwr_proportional_gain: F32, +} + +impl PwrConfig { + /// Load the GPU power configuration from the device tree. + pub(crate) fn load(dev: &AsahiDevice, cfg: &HwConfig) -> Result<PwrConfig> { + let mut perf_states = Vec::new(); + + let node = dev.of_node().ok_or(EIO)?; + let opps = node + .parse_phandle(c_str!("operating-points-v2"), 0) + .ok_or(EIO)?; + + let mut max_power_mw: u32 = 0; + let mut max_freq_mhz: u32 = 0; + + macro_rules! prop { + ($prop:expr, $default:expr) => {{ + node.get_opt_property(c_str!($prop)) + .map_err(|e| { + dev_err!(dev, "Error reading property {}: {:?}\n", $prop, e); + e + })? + .unwrap_or($default) + }}; + ($prop:expr) => {{ + node.get_property(c_str!($prop)).map_err(|e| { + dev_err!(dev, "Error reading property {}: {:?}\n", $prop, e); + e + })? + }}; + } + + for opp in opps.children() { + let freq_hz: u64 = opp.get_property(c_str!("opp-hz"))?; + let mut volt_uv: Vec<u32> = opp.get_property(c_str!("opp-microvolt"))?; + let pwr_uw: u32 = opp.get_property(c_str!("opp-microwatt"))?; + + if volt_uv.len() != cfg.max_num_clusters as usize { + dev_err!( + dev, + "Invalid opp-microvolt length (expected {}, got {})\n", + cfg.max_num_clusters, + volt_uv.len() + ); + return Err(EINVAL); + } + + volt_uv.iter_mut().for_each(|a| *a /= 1000); + let volt_mv = volt_uv; + + let pwr_mw = pwr_uw / 1000; + max_power_mw = max_power_mw.max(pwr_mw); + + let freq_mhz: u32 = (freq_hz / 1_000_000).try_into()?; + max_freq_mhz = max_freq_mhz.max(freq_mhz); + + perf_states.try_push(PState { + freq_hz: freq_hz.try_into()?, + volt_mv, + pwr_mw, + })?; + } + + let pz_data = prop!("apple,power-zones", Vec::new()); + + if pz_data.len() > 3 * MAX_POWERZONES || pz_data.len() % 3 != 0 { + dev_err!(dev, "Invalid apple,power-zones value\n"); + return Err(EINVAL); + } + + let pz_count = pz_data.len() / 3; + let mut power_zones = Vec::new(); + for i in (0..pz_count).step_by(3) { + power_zones.try_push(PowerZone { + target: pz_data[i], + target_offset: pz_data[i + 1], + filter_tc: pz_data[i + 2], + })?; + } + + let core_leak_coef: Vec<F32> = prop!("apple,core-leak-coef"); + let sram_leak_coef: Vec<F32> = prop!("apple,sram-leak-coef"); + + if core_leak_coef.len() != cfg.max_num_clusters as usize { + dev_err!(dev, "Invalid apple,core-leak-coef\n"); + return Err(EINVAL); + } + if sram_leak_coef.len() != cfg.max_num_clusters as usize { + dev_err!(dev, "Invalid apple,sram_leak_coef\n"); + return Err(EINVAL); + } + + Ok(PwrConfig { + core_leak_coef, + sram_leak_coef, + + max_power_mw, + max_freq_mhz, + + perf_base_pstate: prop!("apple,perf-base-pstate", 1), + perf_max_pstate: perf_states.len() as u32 - 1, + min_sram_microvolt: prop!("apple,min-sram-microvolt"), + + avg_power_filter_tc_ms: prop!("apple,avg-power-filter-tc-ms"), + avg_power_ki_only: prop!("apple,avg-power-ki-only"), + avg_power_kp: prop!("apple,avg-power-kp"), + avg_power_min_duty_cycle: prop!("apple,avg-power-min-duty-cycle"), + avg_power_target_filter_tc: prop!("apple,avg-power-target-filter-tc"), + fast_die0_integral_gain: prop!("apple,fast-die0-integral-gain"), + fast_die0_proportional_gain: prop!("apple,fast-die0-proportional-gain"), + fast_die0_prop_tgt_delta: prop!("apple,fast-die0-prop-tgt-delta", 0), + fast_die0_release_temp: prop!("apple,fast-die0-release-temp", 80), + fender_idle_off_delay_ms: prop!("apple,fender-idle-off-delay-ms", 40), + fw_early_wake_timeout_ms: prop!("apple,fw-early-wake-timeout-ms", 5), + idle_off_delay_ms: prop!("apple,idle-off-delay-ms", 2), + perf_boost_ce_step: prop!("apple,perf-boost-ce-step", 25), + perf_boost_min_util: prop!("apple,perf-boost-min-util", 100), + perf_filter_drop_threshold: prop!("apple,perf-filter-drop-threshold"), + perf_filter_time_constant2: prop!("apple,perf-filter-time-constant2"), + perf_filter_time_constant: prop!("apple,perf-filter-time-constant"), + perf_integral_gain2: prop!("apple,perf-integral-gain2"), + perf_integral_gain: prop!("apple,perf-integral-gain", f32!(7.8956833)), + perf_integral_min_clamp: prop!("apple,perf-integral-min-clamp"), + perf_proportional_gain2: prop!("apple,perf-proportional-gain2"), + perf_proportional_gain: prop!("apple,perf-proportional-gain", f32!(14.707963)), + perf_reset_iters: prop!("apple,perf-reset-iters", 6), + perf_tgt_utilization: prop!("apple,perf-tgt-utilization"), + power_sample_period: prop!("apple,power-sample-period"), + ppm_filter_time_constant_ms: prop!("apple,ppm-filter-time-constant-ms"), + ppm_ki: prop!("apple,ppm-ki"), + ppm_kp: prop!("apple,ppm-kp"), + pwr_filter_time_constant: prop!("apple,pwr-filter-time-constant", 313), + pwr_integral_gain: prop!("apple,pwr-integral-gain", f32!(0.0202129)), + pwr_integral_min_clamp: prop!("apple,pwr-integral-min-clamp", 0), + pwr_min_duty_cycle: prop!("apple,pwr-min-duty-cycle"), + pwr_proportional_gain: prop!("apple,pwr-proportional-gain", f32!(5.2831855)), + + perf_states, + power_zones, + }) + } + + pub(crate) fn min_frequency_khz(&self) -> u32 { + self.perf_states[self.perf_base_pstate as usize].freq_hz / 1000 + } + + pub(crate) fn max_frequency_khz(&self) -> u32 { + self.perf_states[self.perf_max_pstate as usize].freq_hz / 1000 + } +} diff --git a/drivers/gpu/drm/asahi/hw/t600x.rs b/drivers/gpu/drm/asahi/hw/t600x.rs new file mode 100644 index 000000000000..8a8267a7e18a --- /dev/null +++ b/drivers/gpu/drm/asahi/hw/t600x.rs @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Hardware configuration for t600x (M1 Pro/Max/Ultra) platforms. + +use crate::f32; + +use super::*; + +const fn iomaps(mcc_count: usize, has_die1: bool) -> [Option<IOMapping>; 20] { + [ + Some(IOMapping::new(0x404d00000, 0x1c000, 0x1c000, true)), // Fender + Some(IOMapping::new(0x20e100000, 0x4000, 0x4000, false)), // AICTimer + Some(IOMapping::new(0x28e104000, 0x4000, 0x4000, true)), // AICSWInt + Some(IOMapping::new(0x404000000, 0x20000, 0x20000, true)), // RGX + None, // UVD + None, // unused + None, // DisplayUnderrunWA + Some(IOMapping::new(0x28e494000, 0x1000, 0x1000, false)), // AnalogTempSensorControllerRegs + None, // PMPDoorbell + Some(IOMapping::new(0x404d80000, 0x8000, 0x8000, true)), // MetrologySensorRegs + Some(IOMapping::new(0x204d61000, 0x1000, 0x1000, true)), // GMGIFAFRegs + Some(IOMapping::new( + 0x200000000, + mcc_count * 0xd8000, + 0xd6400, + true, + )), // MCache registers + None, // AICBankedRegisters + None, // PMGRScratch + Some(IOMapping::new(0x2643c4000, 0x1000, 0x1000, true)), // NIA Special agent idle register die 0 + if has_die1 { + // NIA Special agent idle register die 1 + Some(IOMapping::new(0x22643c4000, 0x1000, 0x1000, true)) + } else { + None + }, + None, // CRE registers + None, // Streaming codec registers + Some(IOMapping::new(0x28e3d0000, 0x1000, 0x1000, true)), // ? + Some(IOMapping::new(0x28e3c0000, 0x1000, 0x1000, false)), // ? + ] +} + +pub(crate) const HWCONFIG_T6002: super::HwConfig = HwConfig { + chip_id: 0x6002, + gpu_gen: GpuGen::G13, + gpu_variant: GpuVariant::D, + gpu_core: GpuCore::G13C, + gpu_feat_compat: 0, + gpu_feat_incompat: feat::incompat::MANDATORY_ZS_COMPRESSION, + + base_clock_hz: 24_000_000, + uat_oas: 42, + max_num_clusters: 8, + max_num_cores: 8, + max_num_frags: 8, + max_num_gps: 4, + + preempt1_size: 0x540, + preempt2_size: 0x280, + preempt3_size: 0x20, + + render: HwRenderConfig { + tiling_control: 0xa540, + }, + + da: HwConfigA { + unk_87c: 900, + unk_8cc: 11000, + unk_e24: 125, + }, + db: HwConfigB { + unk_4e0: 4, + unk_534: 1, + unk_ab8: 0x2084, + unk_abc: 0x80, + unk_b30: 0, + }, + shared1_tab: &[ + 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, + 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, + ], + shared1_a4: 0xffff, + shared2_tab: &[-1, -1, -1, -1, 0x2aa, 0xaaa, -1, -1, 0, 0], + shared2_unk_508: 0xcc00001, + sram_k: f32!(1.02), + unk_coef_a: &[ + &f32!([9.838]), + &f32!([9.819]), + &f32!([9.826]), + &f32!([9.799]), + &f32!([9.799]), + &f32!([9.826]), + &f32!([9.819]), + &f32!([9.838]), + ], + unk_coef_b: &[ + &f32!([13.0]), + &f32!([13.0]), + &f32!([13.0]), + &f32!([13.0]), + &f32!([13.0]), + &f32!([13.0]), + &f32!([13.0]), + &f32!([13.0]), + ], + global_tab: Some(&[ + 0, 1, 2, 1, 1, 90, 75, 1, 1, 1, 2, 90, 75, 1, 1, 1, 1, 90, 75, 1, 1, + ]), + fast_die0_sensor_mask: 0x8080808080808080, + fast_die0_sensor_mask_alt: 0x9090909090909090, + fast_die0_sensor_present: 0xff, + io_mappings: &iomaps(16, true), +}; + +pub(crate) const HWCONFIG_T6001: super::HwConfig = HwConfig { + chip_id: 0x6001, + gpu_variant: GpuVariant::C, + gpu_core: GpuCore::G13C, + + max_num_clusters: 4, + fast_die0_sensor_mask: 0x80808080, + fast_die0_sensor_mask_alt: 0x90909090, + fast_die0_sensor_present: 0x0f, + io_mappings: &iomaps(8, false), + ..HWCONFIG_T6002 +}; + +pub(crate) const HWCONFIG_T6000: super::HwConfig = HwConfig { + chip_id: 0x6000, + gpu_variant: GpuVariant::S, + gpu_core: GpuCore::G13S, + + max_num_clusters: 2, + fast_die0_sensor_mask: 0x8080, + fast_die0_sensor_mask_alt: 0x9090, + fast_die0_sensor_present: 0x03, + io_mappings: &iomaps(4, false), + ..HWCONFIG_T6001 +}; diff --git a/drivers/gpu/drm/asahi/hw/t8103.rs b/drivers/gpu/drm/asahi/hw/t8103.rs new file mode 100644 index 000000000000..3d38b088a0f5 --- /dev/null +++ b/drivers/gpu/drm/asahi/hw/t8103.rs @@ -0,0 +1,80 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Hardware configuration for t8103 platforms (M1). + +use crate::f32; + +use super::*; + +pub(crate) const HWCONFIG: super::HwConfig = HwConfig { + chip_id: 0x8103, + gpu_gen: GpuGen::G13, + gpu_variant: GpuVariant::G, + gpu_core: GpuCore::G13G, + gpu_feat_compat: 0, + gpu_feat_incompat: 0, + + base_clock_hz: 24_000_000, + uat_oas: 40, + max_num_clusters: 1, + max_num_cores: 8, + max_num_frags: 8, + max_num_gps: 4, + + preempt1_size: 0x540, + preempt2_size: 0x280, + preempt3_size: 0x20, + + render: HwRenderConfig { + // bit 0: disable clustering (always) + tiling_control: 0xa041, + }, + + da: HwConfigA { + unk_87c: -220, + unk_8cc: 9880, + unk_e24: 112, + }, + db: HwConfigB { + unk_4e0: 0, + unk_534: 0, + unk_ab8: 0x48, + unk_abc: 0x8, + unk_b30: 0, + }, + shared1_tab: &[ + -1, 0x7282, 0x50ea, 0x370a, 0x25be, 0x1c1f, 0x16fb, -1, -1, -1, -1, -1, -1, -1, -1, -1, + ], + shared1_a4: 0xffff, + shared2_tab: &[0x800, 0x1555, -1, -1, -1, -1, -1, -1, 0, 0], + shared2_unk_508: 0xc00007, + sram_k: f32!(1.02), + unk_coef_a: &[], + unk_coef_b: &[], + global_tab: None, + fast_die0_sensor_mask: 0x12, + fast_die0_sensor_mask_alt: 0x12, + fast_die0_sensor_present: 0x01, + io_mappings: &[ + Some(IOMapping::new(0x204d00000, 0x1c000, 0x1c000, true)), // Fender + Some(IOMapping::new(0x20e100000, 0x4000, 0x4000, false)), // AICTimer + Some(IOMapping::new(0x23b104000, 0x4000, 0x4000, true)), // AICSWInt + Some(IOMapping::new(0x204000000, 0x20000, 0x20000, true)), // RGX + None, // UVD + None, // unused + None, // DisplayUnderrunWA + Some(IOMapping::new(0x23b2e8000, 0x1000, 0x1000, false)), // AnalogTempSensorControllerRegs + Some(IOMapping::new(0x23bc00000, 0x1000, 0x1000, true)), // PMPDoorbell + Some(IOMapping::new(0x204d80000, 0x5000, 0x5000, true)), // MetrologySensorRegs + Some(IOMapping::new(0x204d61000, 0x1000, 0x1000, true)), // GMGIFAFRegs + Some(IOMapping::new(0x200000000, 0xd6400, 0xd6400, true)), // MCache registers + None, // AICBankedRegisters + Some(IOMapping::new(0x23b738000, 0x1000, 0x1000, true)), // PMGRScratch + None, // NIA Special agent idle register die 0 + None, // NIA Special agent idle register die 1 + None, // CRE registers + None, // Streaming codec registers + None, // + None, // + ], +}; diff --git a/drivers/gpu/drm/asahi/hw/t8112.rs b/drivers/gpu/drm/asahi/hw/t8112.rs new file mode 100644 index 000000000000..5624dca130be --- /dev/null +++ b/drivers/gpu/drm/asahi/hw/t8112.rs @@ -0,0 +1,82 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Hardware configuration for t8112 platforms (M2). + +use crate::f32; + +use super::*; + +pub(crate) const HWCONFIG: super::HwConfig = HwConfig { + chip_id: 0x8112, + gpu_gen: GpuGen::G14, + gpu_variant: GpuVariant::G, + gpu_core: GpuCore::G14G, + gpu_feat_compat: 0, + gpu_feat_incompat: 0, + + base_clock_hz: 24_000_000, + uat_oas: 40, + max_num_clusters: 1, + max_num_cores: 10, + max_num_frags: 10, + max_num_gps: 4, + + preempt1_size: 0x540, + preempt2_size: 0x280, + preempt3_size: 0x20, + + render: HwRenderConfig { + // TODO: this is unused here, may be present in newer FW + tiling_control: 0xa041, + }, + + da: HwConfigA { + unk_87c: 900, + unk_8cc: 11000, + unk_e24: 125, + }, + db: HwConfigB { + unk_4e0: 4, + unk_534: 0, + unk_ab8: 0x2048, + unk_abc: 0x4000, + unk_b30: 1, + }, + shared1_tab: &[ + 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, + 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, + ], + shared1_a4: 0, + shared2_tab: &[-1, -1, -1, -1, -1, -1, -1, -1, 0xaa5aa, 0], + shared2_unk_508: 0xc00000, + sram_k: f32!(1.02), + // 13.2: last coef changed from 6.6 to 5.3, assuming that was a fix we can backport + unk_coef_a: &[&f32!([0.0, 0.0, 0.0, 0.0, 5.3, 0.0, 5.3, /*6.6*/ 5.3])], + unk_coef_b: &[&f32!([0.0, 0.0, 0.0, 0.0, 5.3, 0.0, 5.3, /*6.6*/ 5.3])], + global_tab: None, + fast_die0_sensor_mask: 0x6800, + fast_die0_sensor_mask_alt: 0x6800, + fast_die0_sensor_present: 0x02, + io_mappings: &[ + Some(IOMapping::new(0x204d00000, 0x14000, 0x14000, true)), // Fender + Some(IOMapping::new(0x20e100000, 0x4000, 0x4000, false)), // AICTimer + Some(IOMapping::new(0x23b0c4000, 0x4000, 0x4000, true)), // AICSWInt + Some(IOMapping::new(0x204000000, 0x20000, 0x20000, true)), // RGX + None, // UVD + None, // unused + None, // DisplayUnderrunWA + Some(IOMapping::new(0x23b2c0000, 0x1000, 0x1000, false)), // AnalogTempSensorControllerRegs + None, // PMPDoorbell + Some(IOMapping::new(0x204d80000, 0x8000, 0x8000, true)), // MetrologySensorRegs + Some(IOMapping::new(0x204d61000, 0x1000, 0x1000, true)), // GMGIFAFRegs + Some(IOMapping::new(0x200000000, 0xd6400, 0xd6400, true)), // MCache registers + None, // AICBankedRegisters + None, // PMGRScratch + None, // NIA Special agent idle register die 0 + None, // NIA Special agent idle register die 1 + Some(IOMapping::new(0x204e00000, 0x10000, 0x10000, true)), // CRE registers + Some(IOMapping::new(0x27d050000, 0x4000, 0x4000, true)), // Streaming codec registers + Some(IOMapping::new(0x23b3d0000, 0x1000, 0x1000, true)), // + Some(IOMapping::new(0x23b3c0000, 0x1000, 0x1000, true)), // + ], +}; diff --git a/drivers/gpu/drm/asahi/initdata.rs b/drivers/gpu/drm/asahi/initdata.rs new file mode 100644 index 000000000000..472c42169130 --- /dev/null +++ b/drivers/gpu/drm/asahi/initdata.rs @@ -0,0 +1,777 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT +#![allow(clippy::unusual_byte_groupings)] + +//! GPU initialization data builder. +//! +//! The root of all interaction between the GPU firmware and the host driver is a complex set of +//! nested structures that we call InitData. This includes both GPU hardware/firmware configuration +//! and the pointers to the ring buffers and global data fields that are used for communication at +//! runtime. +//! +//! Many of these structures are poorly understood, so there are lots of hardcoded unknown values +//! derived from observing the InitData structures that macOS generates. + +use crate::fw::initdata::*; +use crate::fw::types::*; +use crate::{box_in_place, f32, place}; +use crate::{gpu, hw, mmu}; +use kernel::error::Result; +use kernel::macros::versions; + +/// Builder helper for the global GPU InitData. +#[versions(AGX)] +pub(crate) struct InitDataBuilder<'a> { + alloc: &'a mut gpu::KernelAllocators, + cfg: &'a hw::HwConfig, + dyncfg: &'a hw::DynConfig, +} + +#[versions(AGX)] +impl<'a> InitDataBuilder::ver<'a> { + /// Create a new InitData builder + pub(crate) fn new( + alloc: &'a mut gpu::KernelAllocators, + cfg: &'a hw::HwConfig, + dyncfg: &'a hw::DynConfig, + ) -> InitDataBuilder::ver<'a> { + InitDataBuilder::ver { alloc, cfg, dyncfg } + } + + /// Create the HwDataShared1 structure, which is used in two places in InitData. + #[inline(never)] + fn hw_shared1(cfg: &hw::HwConfig) -> raw::HwDataShared1 { + let mut ret = raw::HwDataShared1 { + unk_a4: cfg.shared1_a4, + ..Default::default() + }; + for (i, val) in cfg.shared1_tab.iter().enumerate() { + ret.table[i] = *val; + } + ret + } + + fn init_curve( + curve: &mut raw::HwDataShared2Curve, + unk_0: u32, + unk_4: u32, + t1: &[i16], + t2: &[i16], + t3: &[&[i32]], + ) { + curve.unk_0 = unk_0; + curve.unk_4 = unk_4; + (*curve.t1)[..t1.len()].copy_from_slice(t1); + (*curve.t1)[t1.len()..].fill(t1[0]); + (*curve.t2)[..t2.len()].copy_from_slice(t2); + (*curve.t2)[t2.len()..].fill(t2[0]); + for (i, a) in curve.t3.iter_mut().enumerate() { + a.fill(0x3ffffff); + if i < t3.len() { + let b = t3[i]; + (**a)[..b.len()].copy_from_slice(b); + } + } + } + + /// Create the HwDataShared2 structure, which is used in two places in InitData. + #[inline(never)] + fn hw_shared2(cfg: &hw::HwConfig) -> Result<Boxraw::HwDataShared2> { + let mut ret = box_in_place!(raw::HwDataShared2 { + unk_28: Array::new([0xff; 16]), + t8112: Default::default(), + unk_508: cfg.shared2_unk_508, + ..Default::default() + })?; + + for (i, val) in cfg.shared2_tab.iter().enumerate() { + ret.table[i] = *val; + } + + if cfg.chip_id == 0x8112 { + ret.t8112.unk_14 = 0x6000000; + Self::init_curve(&mut ret.t8112.curve1, 0, 0x20000000, &[-1], &[0x0f07], &[]); + Self::init_curve( + &mut ret.t8112.curve2, + 7, + 0x80000000, + &[-1, 25740, 17429, 12550, 9597, 7910, 6657, 5881, 5421], + &[ + 0x0f07, 0x04c0, 0x06c0, 0x08c0, 0x0ac0, 0x0c40, 0x0dc0, 0x0ec0, 0x0f80, + ], + &[ + &[0x3ffffff, 107, 101, 94, 87, 82, 77, 73, 71], + &[ + 0x3ffffff, 38240, 36251, 33562, 31368, 29379, 27693, 26211, 25370, + ], + &[ + 0x3ffffff, 123933, 117485, 108771, 101661, 95217, 89751, 84948, 82222, + ], + ], + ); + } + + Ok(ret) + } + + /// Create the HwDataShared3 structure, which is used in two places in InitData. + #[inline(never)] + fn hw_shared3(cfg: &hw::HwConfig) -> Result<Boxraw::HwDataShared3> { + let mut ret = box_in_place!(raw::HwDataShared3 { + ..Default::default() + })?; + + if cfg.chip_id == 0x8112 { + ret.unk_0 = 1; + ret.unk_4 = 500; + ret.unk_8 = 5; + ret.table.copy_from_slice(&[ + 10700, 10700, 10700, 10700, 10700, 6000, 1000, 1000, 1000, 10700, 10700, 10700, + 10700, 10700, 10700, 10700, + ]); + ret.unk_4c = 1; + } + + Ok(ret) + } + + /// Create an unknown T81xx-specific data structure. + fn t81xx_data(dyncfg: &'a hw::DynConfig) -> raw::T81xxData { + raw::T81xxData { + unk_d8c: 0x80000000, + unk_d90: 4, + unk_d9c: f32!(0.6), + unk_da4: f32!(0.4), + unk_dac: f32!(0.38552), + unk_db8: f32!(65536.0), + unk_dbc: f32!(13.56), + max_pstate_scaled: 100 * dyncfg.pwr.perf_max_pstate, + ..Default::default() + } + } + + /// Create the HwDataA structure. This mostly contains power-related configuration. + #[inline(never)] + fn hwdata_a(&mut self) -> Result<GpuObjectHwDataA::ver> { + self.alloc + .private + .new_inplace(Default::default(), |_inner, ptr| { + let pwr = &self.dyncfg.pwr; + let period_ms = pwr.power_sample_period; + let period_s = F32::from(period_ms) / f32!(1000.0); + let ppm_filter_tc_periods = pwr.ppm_filter_time_constant_ms / period_ms; + #[ver(V >= V13_0B4)] + let ppm_filter_tc_ms_rounded = ppm_filter_tc_periods * period_ms; + let ppm_filter_a = f32!(1.0) / ppm_filter_tc_periods.into(); + let perf_filter_a = f32!(1.0) / pwr.perf_filter_time_constant.into(); + let perf_filter_a2 = f32!(1.0) / pwr.perf_filter_time_constant2.into(); + let avg_power_target_filter_a = f32!(1.0) / pwr.avg_power_target_filter_tc.into(); + let avg_power_filter_tc_periods = pwr.avg_power_filter_tc_ms / period_ms; + #[ver(V >= V13_0B4)] + let avg_power_filter_tc_ms_rounded = avg_power_filter_tc_periods * period_ms; + let avg_power_filter_a = f32!(1.0) / avg_power_filter_tc_periods.into(); + let pwr_filter_a = f32!(1.0) / pwr.pwr_filter_time_constant.into(); + + let base_ps = pwr.perf_base_pstate; + let base_ps_scaled = 100 * base_ps; + let max_ps = pwr.perf_max_pstate; + let max_ps_scaled = 100 * max_ps; + let boost_ps_count = max_ps - base_ps; + + let base_clock_khz = self.cfg.base_clock_hz / 1000; + let clocks_per_period = base_clock_khz * period_ms; + + let raw = place!( + ptr, + raw::HwDataA::ver { + clocks_per_period: clocks_per_period, + #[ver(V >= V13_0B4)] + clocks_per_period_2: clocks_per_period, + pwr_status: AtomicU32::new(4), + unk_10: f32!(1.0), + actual_pstate: 1, + tgt_pstate: 1, + base_pstate_scaled: base_ps_scaled, + unk_40: 1, + max_pstate_scaled: max_ps_scaled, + min_pstate_scaled: 100, + unk_64c: 625, + pwr_filter_a_neg: f32!(1.0) - pwr_filter_a, + pwr_filter_a: pwr_filter_a, + pwr_integral_gain: pwr.pwr_integral_gain, + pwr_integral_min_clamp: pwr.pwr_integral_min_clamp.into(), + max_power_1: pwr.max_power_mw.into(), + pwr_proportional_gain: pwr.pwr_proportional_gain, + pwr_pstate_related_k: -F32::from(max_ps_scaled) / pwr.max_power_mw.into(), + pwr_pstate_max_dc_offset: pwr.pwr_min_duty_cycle as i32 + - max_ps_scaled as i32, + max_pstate_scaled_2: max_ps_scaled, + max_power_2: pwr.max_power_mw, + max_pstate_scaled_3: max_ps_scaled, + ppm_filter_tc_periods_x4: ppm_filter_tc_periods * 4, + ppm_filter_a_neg: f32!(1.0) - ppm_filter_a, + ppm_filter_a: ppm_filter_a, + ppm_ki_dt: pwr.ppm_ki * period_s, + unk_6fc: f32!(65536.0), + ppm_kp: pwr.ppm_kp, + pwr_min_duty_cycle: pwr.pwr_min_duty_cycle, + max_pstate_scaled_4: max_ps_scaled, + unk_71c: f32!(0.0), + max_power_3: pwr.max_power_mw, + cur_power_mw_2: 0x0, + ppm_filter_tc_ms: pwr.ppm_filter_time_constant_ms, + #[ver(V >= V13_0B4)] + ppm_filter_tc_clks: ppm_filter_tc_ms_rounded * base_clock_khz, + perf_tgt_utilization: pwr.perf_tgt_utilization, + perf_boost_min_util: pwr.perf_boost_min_util, + perf_boost_ce_step: pwr.perf_boost_ce_step, + perf_reset_iters: pwr.perf_reset_iters, + unk_774: 6, + unk_778: 1, + perf_filter_drop_threshold: pwr.perf_filter_drop_threshold, + perf_filter_a_neg: f32!(1.0) - perf_filter_a, + perf_filter_a2_neg: f32!(1.0) - perf_filter_a2, + perf_filter_a: perf_filter_a, + perf_filter_a2: perf_filter_a2, + perf_ki: pwr.perf_integral_gain, + perf_ki2: pwr.perf_integral_gain2, + perf_integral_min_clamp: pwr.perf_integral_min_clamp.into(), + unk_79c: f32!(95.0), + perf_kp: pwr.perf_proportional_gain, + perf_kp2: pwr.perf_proportional_gain2, + boost_state_unk_k: F32::from(boost_ps_count) / f32!(0.95), + base_pstate_scaled_2: base_ps_scaled, + max_pstate_scaled_5: max_ps_scaled, + base_pstate_scaled_3: base_ps_scaled, + perf_tgt_utilization_2: pwr.perf_tgt_utilization, + base_pstate_scaled_4: base_ps_scaled, + unk_7fc: f32!(65536.0), + pwr_min_duty_cycle_2: pwr.pwr_min_duty_cycle.into(), + max_pstate_scaled_6: max_ps_scaled.into(), + max_freq_mhz: pwr.max_freq_mhz, + pwr_min_duty_cycle_3: pwr.pwr_min_duty_cycle, + min_pstate_scaled_4: f32!(100.0), + max_pstate_scaled_7: max_ps_scaled, + unk_alpha_neg: f32!(0.8), + unk_alpha: f32!(0.2), + fast_die0_sensor_mask: U64(self.cfg.fast_die0_sensor_mask), + fast_die0_release_temp_cc: 100 * pwr.fast_die0_release_temp, + unk_87c: self.cfg.da.unk_87c, + unk_880: 0x4, + unk_894: f32!(1.0), + + fast_die0_ki_dt: pwr.fast_die0_integral_gain * period_s, + unk_8a8: f32!(65536.0), + fast_die0_kp: pwr.fast_die0_proportional_gain, + pwr_min_duty_cycle_4: pwr.pwr_min_duty_cycle, + max_pstate_scaled_8: max_ps_scaled, + max_pstate_scaled_9: max_ps_scaled, + fast_die0_prop_tgt_delta: 100 * pwr.fast_die0_prop_tgt_delta, + unk_8cc: self.cfg.da.unk_8cc, + max_pstate_scaled_10: max_ps_scaled, + max_pstate_scaled_11: max_ps_scaled, + unk_c2c: 1, + power_zone_count: pwr.power_zones.len() as u32, + max_power_4: pwr.max_power_mw, + max_power_5: pwr.max_power_mw, + max_power_6: pwr.max_power_mw, + avg_power_target_filter_a_neg: f32!(1.0) - avg_power_target_filter_a, + avg_power_target_filter_a: avg_power_target_filter_a, + avg_power_target_filter_tc_x4: 4 * pwr.avg_power_target_filter_tc, + avg_power_target_filter_tc_xperiod: period_ms + * pwr.avg_power_target_filter_tc, + #[ver(V >= V13_0B4)] + avg_power_target_filter_tc_clks: period_ms + * pwr.avg_power_target_filter_tc + * base_clock_khz, + avg_power_filter_tc_periods_x4: 4 * avg_power_filter_tc_periods, + avg_power_filter_a_neg: f32!(1.0) - avg_power_filter_a, + avg_power_filter_a: avg_power_filter_a, + avg_power_ki_dt: pwr.avg_power_ki_only * period_s, + unk_d20: f32!(65536.0), + avg_power_kp: pwr.avg_power_kp, + avg_power_min_duty_cycle: pwr.avg_power_min_duty_cycle, + max_pstate_scaled_12: max_ps_scaled, + max_pstate_scaled_13: max_ps_scaled, + max_power_7: pwr.max_power_mw.into(), + max_power_8: pwr.max_power_mw, + avg_power_filter_tc_ms: pwr.avg_power_filter_tc_ms, + #[ver(V >= V13_0B4)] + avg_power_filter_tc_clks: avg_power_filter_tc_ms_rounded * base_clock_khz, + max_pstate_scaled_14: max_ps_scaled, + t81xx_data: match self.cfg.chip_id { + 0x8103 | 0x8112 => Self::t81xx_data(self.dyncfg), + _ => Default::default(), + }, + #[ver(V >= V13_0B4)] + unk_e10_0: raw::HwDataA130Extra { + unk_38: 4, + unk_3c: 8000, + unk_40: 2500, + unk_48: 0xffffffff, + unk_4c: 50, + unk_54: 50, + unk_58: 0x1, + unk_60: f32!(0.8888889), + unk_64: f32!(0.6666667), + unk_68: f32!(0.11111111), + unk_6c: f32!(0.33333333), + unk_70: f32!(-0.4), + unk_74: f32!(-0.8), + unk_7c: f32!(65536.0), + unk_80: f32!(-5.0), + unk_84: f32!(-10.0), + unk_8c: 40, + max_pstate_scaled_1: max_ps_scaled, + unk_9c: f32!(8000.0), + unk_a0: 1400, + unk_a8: 72, + unk_ac: 24, + unk_b0: 1728000, + unk_b8: 576000, + unk_c4: f32!(65536.0), + unk_114: f32!(65536.0), + unk_124: 40, + max_pstate_scaled_2: max_ps_scaled, + ..Default::default() + }, + fast_die0_sensor_mask_2: U64(self.cfg.fast_die0_sensor_mask), + unk_e24: self.cfg.da.unk_e24, + unk_e28: 1, + fast_die0_sensor_mask_alt: U64(self.cfg.fast_die0_sensor_mask_alt), + #[ver(V < V13_0B4)] + fast_die0_sensor_present: U64(self.cfg.fast_die0_sensor_present as u64), + unk_163c: 1, + unk_3644: 0, + hws1: Self::hw_shared1(self.cfg), + hws2: *Self::hw_shared2(self.cfg)?, + hws3: *Self::hw_shared3(self.cfg)?, + unk_3ce8: 1, + ..Default::default() + } + ); + + for i in 0..self.dyncfg.pwr.perf_states.len() { + raw.sram_k[i] = self.cfg.sram_k; + } + + for (i, coef) in pwr.core_leak_coef.iter().enumerate() { + raw.core_leak_coef[i] = *coef; + } + + for (i, coef) in pwr.sram_leak_coef.iter().enumerate() { + raw.sram_leak_coef[i] = *coef; + } + + for i in 0..self.dyncfg.id.num_clusters as usize { + if let Some(coef_a) = self.cfg.unk_coef_a.get(i) { + (*raw.unk_coef_a1[i])[..coef_a.len()].copy_from_slice(coef_a); + (*raw.unk_coef_a2[i])[..coef_a.len()].copy_from_slice(coef_a); + } + if let Some(coef_b) = self.cfg.unk_coef_b.get(i) { + (*raw.unk_coef_b1[i])[..coef_b.len()].copy_from_slice(coef_b); + (*raw.unk_coef_b2[i])[..coef_b.len()].copy_from_slice(coef_b); + } + } + + for (i, pz) in pwr.power_zones.iter().enumerate() { + raw.power_zones[i].target = pz.target; + raw.power_zones[i].target_off = pz.target - pz.target_offset; + raw.power_zones[i].filter_tc_x4 = 4 * pz.filter_tc; + raw.power_zones[i].filter_tc_xperiod = period_ms * pz.filter_tc; + let filter_a = f32!(1.0) / pz.filter_tc.into(); + raw.power_zones[i].filter_a = filter_a; + raw.power_zones[i].filter_a_neg = f32!(1.0) - filter_a; + #[ver(V >= V13_0B4)] + raw.power_zones[i].unk_10 = 1320000000; + } + + Ok(raw) + }) + } + + /// Create the HwDataB structure. This mostly contains GPU-related configuration. + #[inline(never)] + fn hwdata_b(&mut self) -> Result<GpuObjectHwDataB::ver> { + self.alloc + .private + .new_inplace(Default::default(), |_inner, ptr| { + let raw = place!( + ptr, + raw::HwDataB::ver { + // Userspace VA map related + #[ver(V < V13_0B4)] + unk_0: U64(0x13_00000000), + unk_8: U64(0x14_00000000), + #[ver(V < V13_0B4)] + unk_10: U64(0x1_00000000), + unk_18: U64(0xffc00000), + unk_20: U64(0x11_00000000), + unk_28: U64(0x11_00000000), + // userspace address? + unk_30: U64(0x6f_ffff8000), + // unmapped? + unkptr_38: U64(0xffffffa0_11800000), + // TODO: yuv matrices + chip_id: self.cfg.chip_id, + unk_454: 0x1, + unk_458: 0x1, + unk_460: 0x1, + unk_464: 0x1, + unk_468: 0x1, + unk_47c: 0x1, + unk_484: 0x1, + unk_48c: 0x1, + base_clock_khz: self.cfg.base_clock_hz / 1000, + power_sample_period: self.dyncfg.pwr.power_sample_period, + unk_49c: 0x1, + unk_4a0: 0x1, + unk_4a4: 0x1, + unk_4c0: 0x1f, + unk_4e0: U64(self.cfg.db.unk_4e0), + unk_4f0: 0x1, + unk_4f4: 0x1, + unk_504: 0x31, + unk_524: 0x1, // use_secure_cache_flush + unk_534: self.cfg.db.unk_534, + num_frags: self.dyncfg.id.num_frags * self.dyncfg.id.num_clusters, + unk_554: 0x1, + uat_ttb_base: U64(self.dyncfg.uat_ttb_base), + gpu_core_id: self.cfg.gpu_core as u32, + gpu_rev_id: self.dyncfg.id.gpu_rev_id as u32, + num_cores: self.dyncfg.id.num_cores * self.dyncfg.id.num_clusters, + max_pstate: self.dyncfg.pwr.perf_states.len() as u32 - 1, + #[ver(V < V13_0B4)] + num_pstates: self.dyncfg.pwr.perf_states.len() as u32, + #[ver(V < V13_0B4)] + min_sram_volt: self.dyncfg.pwr.min_sram_microvolt / 1000, + #[ver(V < V13_0B4)] + unk_ab8: self.cfg.db.unk_ab8, + #[ver(V < V13_0B4)] + unk_abc: self.cfg.db.unk_abc, + #[ver(V < V13_0B4)] + unk_ac0: 0x1020, + + #[ver(V >= V13_0B4)] + unk_ae4: Array::new([0x0, 0x3, 0x7, 0x7]), + #[ver(V < V13_0B4)] + unk_ae4: Array::new([0x0, 0xf, 0x3f, 0x3f]), + unk_b10: 0x1, + unk_b24: 0x1, + unk_b28: 0x1, + unk_b2c: 0x1, + unk_b30: self.cfg.db.unk_b30, + #[ver(V >= V13_0B4)] + unk_b38_0: 1, + #[ver(V >= V13_0B4)] + unk_b38_4: 1, + unk_b38: Array::new([0xffffffff; 12]), + #[ver(V >= V13_0B4)] + unk_c3c: 0x19, + ..Default::default() + } + ); + + let base_ps = self.dyncfg.pwr.perf_base_pstate as usize; + let max_ps = self.dyncfg.pwr.perf_max_pstate as usize; + let base_freq = self.dyncfg.pwr.perf_states[base_ps].freq_hz; + let max_freq = self.dyncfg.pwr.perf_states[max_ps].freq_hz; + + for (i, ps) in self.dyncfg.pwr.perf_states.iter().enumerate() { + raw.frequencies[i] = ps.freq_hz / 1000000; + for (j, mv) in ps.volt_mv.iter().enumerate() { + let sram_mv = (*mv).max(self.dyncfg.pwr.min_sram_microvolt / 1000); + raw.voltages[i][j] = *mv; + raw.voltages_sram[i][j] = sram_mv; + } + raw.sram_k[i] = self.cfg.sram_k; + raw.rel_max_powers[i] = ps.pwr_mw * 100 / self.dyncfg.pwr.max_power_mw; + raw.rel_boost_freqs[i] = if i > base_ps { + (ps.freq_hz - base_freq) / ((max_freq - base_freq) / 100) + } else { + 0 + }; + } + + Ok(raw) + }) + } + + /// Create the Globals structure, which contains global firmware config including more power + /// configuration data and globals used to exchange state between the firmware and driver. + #[inline(never)] + fn globals(&mut self) -> Result<GpuObjectGlobals::ver> { + self.alloc + .shared + .new_inplace(Default::default(), |_inner, ptr| { + let pwr = &self.dyncfg.pwr; + let period_ms = pwr.power_sample_period; + let period_s = F32::from(period_ms) / f32!(1000.0); + let avg_power_filter_tc_periods = pwr.avg_power_filter_tc_ms / period_ms; + + let max_ps = pwr.perf_max_pstate; + let max_ps_scaled = 100 * max_ps; + + let raw = place!( + ptr, + raw::Globals::ver { + //ktrace_enable: 0xffffffff, + ktrace_enable: 0, + #[ver(V >= V13_2)] + unk_24_0: 3000, + unk_24: 0, + #[ver(V >= V13_0B4)] + unk_28_0: 0, // debug + unk_28: 1, + #[ver(V >= V13_0B4)] + unk_2c_0: 0, + unk_2c: 1, + unk_30: 0, + unk_34: 120, + sub: raw::GlobalsSub::ver { + unk_54: 0xffff, + unk_56: 40, + unk_58: 0xffff, + unk_5e: U32(1), + unk_66: U32(1), + ..Default::default() + }, + unk_8900: 1, + pending_submissions: AtomicU32::new(0), + max_power: pwr.max_power_mw, + max_pstate_scaled: max_ps_scaled, + max_pstate_scaled_2: max_ps_scaled, + max_pstate_scaled_3: max_ps_scaled, + power_zone_count: pwr.power_zones.len() as u32, + avg_power_filter_tc_periods: avg_power_filter_tc_periods, + avg_power_ki_dt: pwr.avg_power_ki_only * period_s, + avg_power_kp: pwr.avg_power_kp, + avg_power_min_duty_cycle: pwr.avg_power_min_duty_cycle, + avg_power_target_filter_tc: pwr.avg_power_target_filter_tc, + unk_89bc: self.cfg.da.unk_8cc, + fast_die0_release_temp: 100 * pwr.fast_die0_release_temp, + unk_89c4: self.cfg.da.unk_87c, + fast_die0_prop_tgt_delta: 100 * pwr.fast_die0_prop_tgt_delta, + fast_die0_kp: pwr.fast_die0_proportional_gain, + fast_die0_ki_dt: pwr.fast_die0_integral_gain * period_s, + unk_89e0: 1, + max_power_2: pwr.max_power_mw, + ppm_kp: pwr.ppm_kp, + ppm_ki_dt: pwr.ppm_ki * period_s, + #[ver(V >= V13_0B4)] + unk_89f4_8: 1, + unk_89f4: 0, + hws1: Self::hw_shared1(self.cfg), + hws2: *Self::hw_shared2(self.cfg)?, + hws3: *Self::hw_shared3(self.cfg)?, + unk_900c: 1, + #[ver(V >= V13_0B4)] + unk_9010_0: 1, + #[ver(V >= V13_0B4)] + unk_903c: 1, + #[ver(V < V13_0B4)] + unk_903c: 0, + fault_control: *crate::fault_control.read(), + do_init: 1, + unk_11020: 40, + unk_11024: 10, + unk_11028: 250, + #[ver(V >= V13_0B4)] + unk_1102c_0: 1, + #[ver(V >= V13_0B4)] + unk_1102c_4: 1, + #[ver(V >= V13_0B4)] + unk_1102c_8: 100, + #[ver(V >= V13_0B4)] + unk_1102c_c: 1, + idle_off_delay_ms: AtomicU32::new(pwr.idle_off_delay_ms), + fender_idle_off_delay_ms: pwr.fender_idle_off_delay_ms, + fw_early_wake_timeout_ms: pwr.fw_early_wake_timeout_ms, + unk_118e0: 40, + #[ver(V >= V13_0B4)] + unk_118e4_0: 50, + #[ver(V >= V13_0B4)] + unk_11edc: 0, + #[ver(V >= V13_0B4)] + unk_11efc: 0, + ..Default::default() + } + ); + + for (i, pz) in pwr.power_zones.iter().enumerate() { + raw.power_zones[i].target = pz.target; + raw.power_zones[i].target_off = pz.target - pz.target_offset; + raw.power_zones[i].filter_tc = pz.filter_tc; + } + + if let Some(tab) = self.cfg.global_tab.as_ref() { + for (i, x) in tab.iter().enumerate() { + raw.unk_118ec[i] = *x; + } + raw.unk_118e8 = 1; + } + + Ok(raw) + }) + } + + /// Create the RuntimePointers structure, which contains pointers to most of the other + /// structures including the ring buffer channels, statistics structures, and HwDataA/HwDataB. + #[inline(never)] + fn runtime_pointers(&mut self) -> Result<GpuObjectRuntimePointers::ver> { + let hwa = self.hwdata_a()?; + let hwb = self.hwdata_b()?; + + let pointers: BoxRuntimePointers::ver = box_in_place!(RuntimePointers::ver { + stats: Stats::ver { + vtx: self.alloc.private.new_default::GpuGlobalStatsVtx::ver()?, + frag: self.alloc.private.new_inplace( + Default::default(), + |_inner, ptr: &mut MaybeUninitraw::GpuGlobalStatsFrag::ver| { + Ok(place!( + ptr, + raw::GpuGlobalStatsFrag::ver { + stats: raw::GpuStatsFrag::ver { + cur_stamp_id: -1, + unk_118: -1, + ..Default::default() + }, + ..Default::default() + } + )) + }, + )?, + comp: self.alloc.private.new_default::<GpuStatsComp>()?, + }, + + hwdata_a: hwa, + unkptr_190: self.alloc.private.array_empty(0x80)?, + unkptr_198: self.alloc.private.array_empty(0xc0)?, + hwdata_b: hwb, + + unkptr_1b8: self.alloc.private.array_empty(0x1000)?, + unkptr_1c0: self.alloc.private.array_empty(0x300)?, + unkptr_1c8: self.alloc.private.array_empty(0x1000)?, + + buffer_mgr_ctl: self.alloc.gpu.array_empty(127)?, + })?; + + self.alloc.private.new_boxed(pointers, |inner, ptr| { + Ok(place!( + ptr, + raw::RuntimePointers::ver { + pipes: Default::default(), + device_control: Default::default(), + event: Default::default(), + fw_log: Default::default(), + ktrace: Default::default(), + stats: Default::default(), + + stats_vtx: inner.stats.vtx.gpu_pointer(), + stats_frag: inner.stats.frag.gpu_pointer(), + stats_comp: inner.stats.comp.gpu_pointer(), + + hwdata_a: inner.hwdata_a.gpu_pointer(), + unkptr_190: inner.unkptr_190.gpu_pointer(), + unkptr_198: inner.unkptr_198.gpu_pointer(), + hwdata_b: inner.hwdata_b.gpu_pointer(), + hwdata_b_2: inner.hwdata_b.gpu_pointer(), + + fwlog_buf: None, + + unkptr_1b8: inner.unkptr_1b8.gpu_pointer(), + unkptr_1c0: inner.unkptr_1c0.gpu_pointer(), + unkptr_1c8: inner.unkptr_1c8.gpu_pointer(), + + buffer_mgr_ctl: inner.buffer_mgr_ctl.gpu_pointer(), + buffer_mgr_ctl_2: inner.buffer_mgr_ctl.gpu_pointer(), + + __pad0: Default::default(), + unk_160: U64(0), + unk_168: U64(0), + unk_1d0: 0, + unk_1d4: 0, + unk_1d8: Default::default(), + + __pad1: Default::default(), + gpu_scratch: raw::RuntimeScratch { + unk_6b38: 0xff, + ..Default::default() + }, + } + )) + }) + } + + /// Create the FwStatus structure, which is used to coordinate the firmware halt state between + /// the firmware and the driver. + #[inline(never)] + fn fw_status(&mut self) -> Result<GpuObject<FwStatus>> { + self.alloc + .shared + .new_object(Default::default(), |_inner| Default::default()) + } + + /// Create one UatLevelInfo structure, which describes one level of translation for the UAT MMU. + #[inline(never)] + fn uat_level_info( + cfg: &hw::HwConfig, + index_shift: usize, + num_entries: usize, + ) -> raw::UatLevelInfo { + raw::UatLevelInfo { + index_shift: index_shift as _, + unk_1: 14, + unk_2: 14, + unk_3: 8, + unk_4: 0x4000, + num_entries: num_entries as _, + unk_8: U64(1), + unk_10: U64(((1u64 << cfg.uat_oas) - 1) & !(mmu::UAT_PGMSK as u64)), + index_mask: U64(((num_entries - 1) << index_shift) as u64), + } + } + + /// Build the top-level InitData object. + #[inline(never)] + pub(crate) fn build(&mut self) -> Result<Box<GpuObjectInitData::ver>> { + let inner: BoxInitData::ver = box_in_place!(InitData::ver { + unk_buf: self.alloc.shared_ro.array_empty(0x4000)?, + runtime_pointers: self.runtime_pointers()?, + globals: self.globals()?, + fw_status: self.fw_status()?, + })?; + + Ok(Box::try_new(self.alloc.shared_ro.new_boxed( + inner, + |inner, ptr| { + Ok(place!( + ptr, + raw::InitData::ver { + #[ver(V >= V13_0B4)] + ver_info: Array::new([1, 1, 16, 1]), + unk_buf: inner.unk_buf.gpu_pointer(), + unk_8: 0, + unk_c: 0, + runtime_pointers: inner.runtime_pointers.gpu_pointer(), + globals: inner.globals.gpu_pointer(), + fw_status: inner.fw_status.gpu_pointer(), + uat_page_size: 0x4000, + uat_page_bits: 14, + uat_num_levels: 3, + uat_level_info: Array::new([ + Self::uat_level_info(self.cfg, 36, 8), + Self::uat_level_info(self.cfg, 25, 2048), + Self::uat_level_info(self.cfg, 14, 2048), + ]), + __pad0: Default::default(), + host_mapped_fw_allocations: 1, + unk_ac: 0, + unk_b0: 0, + unk_b4: 0, + unk_b8: 0, + } + )) + }, + )?)?) + } +} diff --git a/drivers/gpu/drm/asahi/mem.rs b/drivers/gpu/drm/asahi/mem.rs new file mode 100644 index 000000000000..491d4f8a4016 --- /dev/null +++ b/drivers/gpu/drm/asahi/mem.rs @@ -0,0 +1,133 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! ARM64 low level memory operations. +//! +//! This GPU uses CPU-side `tlbi` outer-shareable instructions to manage its TLBs. +//! Yes, really. Even though the VA address spaces are unrelated. +//! +//! Right now we pick our own ASIDs and don't coordinate with the CPU. This might result +//! in needless TLB shootdowns on the CPU side... TODO: fix this. + +use core::arch::asm; +use core::cmp::min; + +use crate::debug::*; +use crate::mmu; + +type Asid = u8; + +/// Invalidate the entire GPU TLB. +#[inline(always)] +pub(crate) fn tlbi_all() { + unsafe { + asm!(".arch armv8.4-a", "tlbi vmalle1os",); + } +} + +/// Invalidate all TLB entries for a given ASID. +#[inline(always)] +pub(crate) fn tlbi_asid(asid: Asid) { + if debug_enabled(DebugFlags::ConservativeTlbi) { + tlbi_all(); + sync(); + return; + } + + unsafe { + asm!( + ".arch armv8.4-a", + "tlbi aside1os, {x}", + x = in(reg) ((asid as u64) << 48) + ); + } +} + +/// Invalidate a single page for a given ASID. +#[inline(always)] +pub(crate) fn tlbi_page(asid: Asid, va: usize) { + if debug_enabled(DebugFlags::ConservativeTlbi) { + tlbi_all(); + sync(); + return; + } + + let val: u64 = ((asid as u64) << 48) | ((va as u64 >> 12) & 0xffffffffffc); + unsafe { + asm!( + ".arch armv8.4-a", + "tlbi vae1os, {x}", + x = in(reg) val + ); + } +} + +/// Invalidate a range of pages for a given ASID. +#[inline(always)] +pub(crate) fn tlbi_range(asid: Asid, va: usize, len: usize) { + if debug_enabled(DebugFlags::ConservativeTlbi) { + tlbi_all(); + sync(); + return; + } + + if len == 0 { + return; + } + + let start_pg = va >> mmu::UAT_PGBIT; + let end_pg = (va + len + mmu::UAT_PGMSK) >> mmu::UAT_PGBIT; + + let mut val: u64 = ((asid as u64) << 48) | (2 << 46) | (start_pg as u64 & 0x1fffffffff); + let pages = end_pg - start_pg; + + if pages == 1 { + tlbi_page(asid, va); + return; + } + + // Page count is always in units of 2 + let num = ((pages + 1) >> 1) as u64; + // base: 5 bits + // exp: 2 bits + // pages = (base + 1) << (5 * exp + 1) + // 0:00000 -> 2 pages = 2 << 0 + // 0:11111 -> 32 * 2 pages = 2 << 5 + // 1:00000 -> 1 * 32 * 2 pages = 2 << 5 + // 1:11111 -> 32 * 32 * 2 pages = 2 << 10 + // 2:00000 -> 1 * 32 * 32 * 2 pages = 2 << 10 + // 2:11111 -> 32 * 32 * 32 * 2 pages = 2 << 15 + // 3:00000 -> 1 * 32 * 32 * 32 * 2 pages = 2 << 15 + // 3:11111 -> 32 * 32 * 32 * 32 * 2 pages = 2 << 20 + let exp = min(3, (64 - num.leading_zeros()) / 5); + let bits = 5 * exp; + let mut base = (num + (1 << bits) - 1) >> bits; + + val |= (exp as u64) << 44; + + while base > 32 { + unsafe { + asm!( + ".arch armv8.4-a", + "tlbi rvae1os, {x}", + x = in(reg) val | (31 << 39) + ); + } + base -= 32; + } + + unsafe { + asm!( + ".arch armv8.4-a", + "tlbi rvae1os, {x}", + x = in(reg) val | ((base - 1) << 39) + ); + } +} + +/// Issue a memory barrier (`dsb sy`). +#[inline(always)] +pub(crate) fn sync() { + unsafe { + asm!("dsb sy"); + } +} diff --git a/drivers/gpu/drm/asahi/microseq.rs b/drivers/gpu/drm/asahi/microseq.rs new file mode 100644 index 000000000000..dca94ebc53a1 --- /dev/null +++ b/drivers/gpu/drm/asahi/microseq.rs @@ -0,0 +1,61 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU Micro operation sequence builder +//! +//! As part of a single job submisssion to the GPU, the GPU firmware interprets a sequence of +//! commands that we call a "microsequence". These are responsible for setting up the job execution, +//! timestamping the process, waiting for completion, tearing up any resources, and signaling +//! completion to the driver via the event stamp mechanism. +//! +//! Although the microsequences used by the macOS driver are usually quite uniform and simple, the +//! firmware actually implements enough operations to make this interpreter Turing-complete (!). +//! Most of those aren't implemented yet, since we don't need them, but they could come in handy in +//! the future to do strange things or work around firmware bugs... +//! +//! This module simply implements a collection of microsequence operations that can be appended to +//! and later concatenated into one buffer, ready for firmware execution. + +use crate::fw::microseq; +pub(crate) use crate::fw::microseq::*; +use crate::fw::types::*; +use kernel::prelude::*; + +/// MicroSequence object type, which is just an opaque byte array. +pub(crate) type MicroSequence = GpuArray<u8>; + +/// MicroSequence builder. +pub(crate) struct Builder { + ops: Vec<u8>, +} + +impl Builder { + /// Create a new Builder object + pub(crate) fn new() -> Builder { + Builder { ops: Vec::new() } + } + + /// Get the relative offset from the current pointer to a given target offset. + /// + /// Used for relative jumps. + pub(crate) fn offset_to(&self, target: i32) -> i32 { + target - self.ops.len() as i32 + } + + /// Add an operation to the end of the sequence. + pub(crate) fn add<T: microseq::Operation>(&mut self, op: T) -> Result<i32> { + let off = self.ops.len(); + let p: *const T = &op; + let p: *const u8 = p as *const u8; + let s: &[u8] = unsafe { core::slice::from_raw_parts(p, core::mem::size_of::<T>()) }; + self.ops.try_extend_from_slice(s)?; + Ok(off as i32) + } + + /// Collect all submitted operations into a finalized GPU object. + pub(crate) fn build(self, alloc: &mut Allocator) -> Result<MicroSequence> { + let mut array = alloc.array_empty::<u8>(self.ops.len())?; + + array.as_mut_slice().clone_from_slice(self.ops.as_slice()); + Ok(array) + } +} diff --git a/drivers/gpu/drm/asahi/mmu.rs b/drivers/gpu/drm/asahi/mmu.rs new file mode 100644 index 000000000000..226ca0b7c1d7 --- /dev/null +++ b/drivers/gpu/drm/asahi/mmu.rs @@ -0,0 +1,1249 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU UAT (MMU) management +//! +//! AGX GPUs use an MMU called the UAT, which is largely compatible with the ARM64 page table +//! format. This module manages the global MMU structures, including a shared handoff structure +//! that is used to coordinate VM management operations with the firmware, the TTBAT which points +//! to currently active GPU VM contexts, as well as the individual `Vm` operations to map and +//! unmap buffer objects into a single user or kernel address space. +//! +//! The actual page table management is delegated to the common kernel `io_pgtable` code. + +use core::fmt::Debug; +use core::mem::size_of; +use core::ptr::{addr_of_mut, NonNull}; +use core::sync::atomic::{fence, AtomicU32, AtomicU64, AtomicU8, Ordering}; +use core::time::Duration; + +use kernel::{ + bindings, c_str, delay, device, + drm::mm, + error::{to_result, Result}, + io_pgtable, + io_pgtable::{prot, AppleUAT, IoPageTable}, + prelude::*, + sync::{smutex::Mutex, Guard}, + sync::{Arc, LockClassKey, UniqueArc}, + time, + types::ForeignOwnable, +}; + +use crate::debug::*; +use crate::no_debug; +use crate::{driver, fw, gem, hw, mem, slotalloc}; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Mmu; + +/// PPL magic number for the handoff region +const PPL_MAGIC: u64 = 0x4b1d000000000002; + +/// Number of supported context entries in the TTBAT +const UAT_NUM_CTX: usize = 64; +/// First context available for users +const UAT_USER_CTX_START: usize = 1; +/// Number of available user contexts +const UAT_USER_CTX: usize = UAT_NUM_CTX - UAT_USER_CTX_START; + +/// Number of bits in a page offset. +pub(crate) const UAT_PGBIT: usize = 14; +/// UAT page size. +pub(crate) const UAT_PGSZ: usize = 1 << UAT_PGBIT; +/// UAT page offset mask. +pub(crate) const UAT_PGMSK: usize = UAT_PGSZ - 1; + +type Pte = AtomicU64; + +/// Number of PTEs per page. +const UAT_NPTE: usize = UAT_PGSZ / size_of::<Pte>(); + +/// UAT input address space (user) +pub(crate) const UAT_IAS: usize = 39; +/// "Fake" kernel UAT input address space (one page level lower) +pub(crate) const UAT_IAS_KERN: usize = 36; + +/// Lower/user base VA +const IOVA_USER_BASE: usize = UAT_PGSZ; +/// Lower/user top VA +const IOVA_USER_TOP: usize = (1 << UAT_IAS) - 1; +/// Upper/kernel base VA +// const IOVA_TTBR1_BASE: usize = 0xffffff8000000000; +/// Driver-managed kernel base VA +const IOVA_KERN_BASE: usize = 0xffffffa000000000; +/// Driver-managed kernel top VA +const IOVA_KERN_TOP: usize = 0xffffffafffffffff; + +const TTBR_VALID: u64 = 0x1; // BIT(0) +const TTBR_ASID_SHIFT: usize = 48; + +const PTE_TABLE: u64 = 0x3; // BIT(0) | BIT(1) + +// Mapping protection types + +// Note: prot::CACHE means "cache coherency", which for UAT means *uncached*, +// since uncached mappings from the GFX ASC side are cache coherent with the AP cache. +// Not having that flag means *cached noncoherent*. + +/// Firmware MMIO R/W +pub(crate) const PROT_FW_MMIO_RW: u32 = + prot::PRIV | prot::READ | prot::WRITE | prot::CACHE | prot::MMIO; +/// Firmware MMIO R/O +pub(crate) const PROT_FW_MMIO_RO: u32 = prot::PRIV | prot::READ | prot::CACHE | prot::MMIO; +/// Firmware shared (uncached) RW +pub(crate) const PROT_FW_SHARED_RW: u32 = prot::PRIV | prot::READ | prot::WRITE | prot::CACHE; +/// Firmware shared (uncached) RO +pub(crate) const PROT_FW_SHARED_RO: u32 = prot::PRIV | prot::READ | prot::CACHE; +/// Firmware private (cached) RW +pub(crate) const PROT_FW_PRIV_RW: u32 = prot::PRIV | prot::READ | prot::WRITE; +/* +/// Firmware private (cached) RO +pub(crate) const PROT_FW_PRIV_RO: u32 = prot::PRIV | prot::READ; +*/ +/// Firmware/GPU shared (uncached) RW +pub(crate) const PROT_GPU_FW_SHARED_RW: u32 = prot::READ | prot::WRITE | prot::CACHE; +/// Firmware/GPU shared (private) RW +pub(crate) const PROT_GPU_FW_PRIV_RW: u32 = prot::READ | prot::WRITE; +/// GPU shared/coherent RW +pub(crate) const PROT_GPU_SHARED_RW: u32 = prot::READ | prot::WRITE | prot::CACHE | prot::NOEXEC; +/// GPU shared/coherent RO +pub(crate) const PROT_GPU_SHARED_RO: u32 = prot::READ | prot::CACHE | prot::NOEXEC; +/// GPU shared/coherent WO +pub(crate) const PROT_GPU_SHARED_WO: u32 = prot::WRITE | prot::CACHE | prot::NOEXEC; +/* +/// GPU private/noncoherent RW +pub(crate) const PROT_GPU_PRIV_RW: u32 = prot::READ | prot::WRITE | prot::NOEXEC; +/// GPU private/noncoherent RO +pub(crate) const PROT_GPU_PRIV_RO: u32 = prot::READ | prot::NOEXEC; +*/ + +type PhysAddr = bindings::phys_addr_t; + +/// A pre-allocated memory region for UAT management +struct UatRegion { + base: PhysAddr, + map: NonNullcore::ffi::c_void, +} + +/// It's safe to share UAT region records across threads. +unsafe impl Send for UatRegion {} +unsafe impl Sync for UatRegion {} + +/// Handoff region flush info structure +#[repr(C)] +struct FlushInfo { + state: AtomicU64, + addr: AtomicU64, + size: AtomicU64, +} + +/// UAT Handoff region layout +#[repr(C)] +struct Handoff { + magic_ap: AtomicU64, + magic_fw: AtomicU64, + + lock_ap: AtomicU8, + lock_fw: AtomicU8, + // Implicit padding: 2 bytes + turn: AtomicU32, + cur_slot: AtomicU32, + // Implicit padding: 4 bytes + flush: [FlushInfo; UAT_NUM_CTX + 1], + + unk2: AtomicU8, + // Implicit padding: 7 bytes + unk3: AtomicU64, +} + +const HANDOFF_SIZE: usize = size_of::<Handoff>(); + +/// One VM slot in the TTBAT +#[repr(C)] +struct SlotTTBS { + ttb0: AtomicU64, + ttb1: AtomicU64, +} + +const SLOTS_SIZE: usize = UAT_NUM_CTX * size_of::<SlotTTBS>(); + +// We need at least page 0 (ttb0) +const PAGETABLES_SIZE: usize = UAT_PGSZ; + +/// Inner data for a Vm instance. This is reference-counted by the outer Vm object. +struct VmInner { + dev: driver::AsahiDevice, + is_kernel: bool, + min_va: usize, + max_va: usize, + page_table: AppleUAT<Uat>, + mm: mm::Allocator<(), MappingInner>, + uat_inner: Arc<UatInner>, + active_users: usize, + binding: Option<slotalloc::Guard<SlotInner>>, + bind_token: Optionslotalloc::SlotToken, + id: u64, +} + +impl VmInner { + /// Returns the slot index, if this VM is bound. + fn slot(&self) -> Option<u32> { + if self.is_kernel { + // The GFX ASC does not care about the ASID. Pick an arbitrary one. + // TODO: This needs to be a persistently reserved ASID once we integrate + // with the ARM64 kernel ASID machinery to avoid overlap. + Some(0) + } else { + // We don't check whether we lost the slot, which could cause unnecessary + // invalidations against another Vm. However, this situation should be very + // rare (e.g. a Vm lost its slot, which means 63 other Vms bound in the + // interim, and then it gets killed / drops its mappings without doing any + // final rendering). Anything doing active maps/unmaps is probably also + // rendering and therefore likely bound. + self.bind_token + .as_ref() + .map(|token| (token.last_slot() + UAT_USER_CTX_START as u32)) + } + } + + /// Returns the translation table base for this Vm + fn ttb(&self) -> u64 { + self.page_table.cfg().ttbr + } + + /// Map an IOVA to the shifted address the underlying io_pgtable uses. + fn map_iova(&self, iova: usize, size: usize) -> Result<usize> { + if iova < self.min_va || (iova + size - 1) > self.max_va { + Err(EINVAL) + } else if self.is_kernel { + Ok(iova - self.min_va) + } else { + Ok(iova) + } + } + + /// Map a contiguous range of virtual->physical pages. + fn map_pages( + &mut self, + mut iova: usize, + mut paddr: usize, + pgsize: usize, + pgcount: usize, + prot: u32, + ) -> Result<usize> { + let mut left = pgcount; + while left > 0 { + let mapped_iova = self.map_iova(iova, pgsize * left)?; + let mapped = self + .page_table + .map_pages(mapped_iova, paddr, pgsize, left, prot)?; + assert!(mapped <= left * pgsize); + + left -= mapped / pgsize; + paddr += mapped; + iova += mapped; + } + Ok(pgcount * pgsize) + } + + /// Unmap a contiguous range of pages. + fn unmap_pages(&mut self, mut iova: usize, pgsize: usize, pgcount: usize) -> Result<usize> { + let mut left = pgcount; + while left > 0 { + let mapped_iova = self.map_iova(iova, pgsize * left)?; + let unmapped = self.page_table.unmap_pages(mapped_iova, pgsize, left); + assert!(unmapped <= left * pgsize); + + left -= unmapped / pgsize; + iova += unmapped; + } + + Ok(pgcount * pgsize) + } + + /// Map an `mm::Node` representing an mapping in VA space. + fn map_node(&mut self, node: &mm::Node<(), MappingInner>, prot: u32) -> Result { + let mut iova = node.start() as usize; + let sgt = node.sgt.as_ref().ok_or(EINVAL)?; + + for range in sgt.iter() { + let addr = range.dma_address(); + let len = range.dma_len(); + + if (addr | len | iova) & UAT_PGMSK != 0 { + dev_err!( + self.dev, + "MMU: Mapping {:#x}:{:#x} -> {:#x} is not page-aligned\n", + addr, + len, + iova + ); + return Err(EINVAL); + } + + mod_dev_dbg!( + self.dev, + "MMU: map: {:#x}:{:#x} -> {:#x}\n", + addr, + len, + iova + ); + + self.map_pages(iova, addr, UAT_PGSZ, len >> UAT_PGBIT, prot)?; + + iova += len; + } + Ok(()) + } +} + +/// Shared reference to a virtual memory address space ([`Vm`]). +#[derive(Clone)] +pub(crate) struct Vm { + id: u64, + file_id: u64, + inner: Arc<Mutex<VmInner>>, +} +no_debug!(Vm); + +/// Slot data for a [`Vm`] slot (nothing, we only care about the indices). +pub(crate) struct SlotInner(); + +impl slotalloc::SlotItem for SlotInner { + type Data = (); +} + +/// Represents a single user of a binding of a [`Vm`] to a slot. +/// +/// The number of users is counted, and the slot will be freed when it drops to 0. +#[derive(Debug)] +pub(crate) struct VmBind(Vm, u32); + +impl VmBind { + /// Returns the slot that this `Vm` is bound to. + pub(crate) fn slot(&self) -> u32 { + self.1 + } +} + +impl Drop for VmBind { + fn drop(&mut self) { + let mut inner = self.0.inner.lock(); + + assert_ne!(inner.active_users, 0); + inner.active_users -= 1; + mod_pr_debug!("MMU: slot {} active users {}\n", self.1, inner.active_users); + if inner.active_users == 0 { + inner.binding = None; + } + } +} + +impl Clone for VmBind { + fn clone(&self) -> VmBind { + let mut inner = self.0.inner.lock(); + + inner.active_users += 1; + mod_pr_debug!("MMU: slot {} active users {}\n", self.1, inner.active_users); + VmBind(self.0.clone(), self.1) + } +} + +/// Inner data required for an object mapping into a [`Vm`]. +pub(crate) struct MappingInner { + owner: Arc<Mutex<VmInner>>, + uat_inner: Arc<UatInner>, + prot: u32, + mapped_size: usize, + sgt: Optiongem::SGTable, +} + +/// An object mapping into a [`Vm`], which reserves the address range from use by other mappings. +pub(crate) struct Mapping(mm::Node<(), MappingInner>); + +impl Mapping { + /// Returns the IOVA base of this mapping + pub(crate) fn iova(&self) -> usize { + self.0.start() as usize + } + + /// Returns the size of this mapping in bytes + pub(crate) fn size(&self) -> usize { + self.0.mapped_size + } + + /// Remap a cached mapping as uncached, then synchronously flush that range of VAs from the + /// coprocessor cache. This is required to safely unmap cached/private mappings. + fn remap_uncached_and_flush(&mut self) { + let mut owner = self.0.owner.lock(); + mod_dev_dbg!( + owner.dev, + "MMU: remap as uncached {:#x}:{:#x}\n", + self.iova(), + self.size() + ); + + // The IOMMU API does not allow us to remap things in-place... + // just do an unmap and map again for now. + // Do not try to unmap guard page (-1) + if owner + .unmap_pages(self.iova(), UAT_PGSZ, self.size() >> UAT_PGBIT) + .is_err() + { + dev_err!( + owner.dev, + "MMU: unmap for remap {:#x}:{:#x} failed\n", + self.iova(), + self.size() + ); + } + + let prot = self.0.prot | prot::CACHE; + if owner.map_node(&self.0, prot).is_err() { + dev_err!( + owner.dev, + "MMU: remap {:#x}:{:#x} failed\n", + self.iova(), + self.size() + ); + } + + // If we don't have (and have never had) a VM slot, just return + let slot = match owner.slot() { + None => return, + Some(slot) => slot, + }; + + let flush_slot = if owner.is_kernel { + // If this is a kernel mapping, always flush on index 64 + UAT_NUM_CTX as u32 + } else { + // Otherwise, check if this slot is the active one, otherwise return + // Also check that we actually own this slot + let ttb = owner.ttb() | TTBR_VALID | (slot as u64) << TTBR_ASID_SHIFT; + + let uat_inner = self.0.uat_inner.lock(); + uat_inner.handoff().lock(); + let cur_slot = uat_inner.handoff().current_slot(); + let ttb_cur = uat_inner.ttbs()[slot as usize].ttb0.load(Ordering::Relaxed); + uat_inner.handoff().unlock(); + if cur_slot == Some(slot) && ttb_cur == ttb { + slot + } else { + return; + } + }; + + // FIXME: There is a race here, though it'll probably never happen in practice. + // In theory, it's possible for the ASC to finish using our slot, whatever command + // it was processing to complete, the slot to be lost to another context, and the ASC + // to begin using it again with a different page table, thus faulting when it gets a + // flush request here. In practice, the chance of this happening is probably vanishingly + // small, as all 62 other slots would have to be recycled or in use before that slot can + // be reused, and the ASC using user contexts at all is very rare. + + // Still, the locking around UAT/Handoff/TTBs should probably be redesigned to better + // model the interactions with the firmware and avoid these races. + // Possibly TTB changes should be tied to slot locks: + + // Flush: + // - Can early check handoff here (no need to lock). + // If user slot and it doesn't match the active ASC slot, + // we can elide the flush as the ASC guarantees it flushes + // TLBs/caches when it switches context. We just need a + // barrier to ensure ordering. + // - Lock TTB slot + // - If user ctx: + // - Lock handoff AP-side + // - Lock handoff dekker + // - Check TTB & handoff cur ctx + // - Perform flush if necessary + // - This implies taking the fwring lock + // + // TTB change: + // - lock TTB slot + // - lock handoff AP-side + // - lock handoff dekker + // change TTB + + // Lock this flush slot, and write the range to it + let flush = self.0.uat_inner.lock_flush(flush_slot); + let pages = self.size() >> UAT_PGBIT; + flush.begin_flush(self.iova() as u64, self.size() as u64); + if pages >= 0x10000 { + dev_err!(owner.dev, "MMU: Flush too big ({:#x} pages))\n", pages); + } + + let cmd = fw::channels::FwCtlMsg { + addr: fw::types::U64(self.iova() as u64), + unk_8: 0, + slot: flush_slot, + page_count: pages as u16, + unk_12: 2, // ? + }; + + // Tell the firmware to do a cache flush + if let Err(e) = owner.dev.data().gpu.fwctl(cmd) { + dev_err!( + owner.dev, + "MMU: ASC cache flush {:#x}:{:#x} failed (err: {:?})\n", + self.iova(), + self.size(), + e + ); + } + + // Finish the flush + flush.end_flush(); + + // Slot is unlocked here + } +} + +impl Drop for Mapping { + fn drop(&mut self) { + // This is the main unmap function for UAT mappings. + // The sequence of operations here is finicky, due to the interaction + // between cached GFX ASC mappings and the page tables. These mappings + // always have to be flushed from the cache before being unmapped. + + // For uncached mappings, just unmapping and flushing the TLB is sufficient. + + // For cached mappings, this is the required sequence: + // 1. Remap it as uncached + // 2. Flush the TLB range + // 3. If kernel VA mapping OR user VA mapping and handoff.current_slot() == slot: + // a. Take a lock for this slot + // b. Write the flush range to the right context slot in handoff area + // c. Issue a cache invalidation request via FwCtl queue + // d. Poll for completion via queue + // e. Check for completion flag in the handoff area + // f. Drop the lock + // 4. Unmap + // 5. Flush the TLB range again + + // prot::CACHE means "cache coherent" which means *uncached* here. + if self.0.prot & prot::CACHE == 0 { + self.remap_uncached_and_flush(); + } + + let mut owner = self.0.owner.lock(); + mod_dev_dbg!( + owner.dev, + "MMU: unmap {:#x}:{:#x}\n", + self.iova(), + self.size() + ); + + if owner + .unmap_pages(self.iova(), UAT_PGSZ, self.size() >> UAT_PGBIT) + .is_err() + { + dev_err!( + owner.dev, + "MMU: unmap {:#x}:{:#x} failed\n", + self.iova(), + self.size() + ); + } + + if let Some(asid) = owner.slot() { + mem::tlbi_range(asid as u8, self.iova(), self.size()); + mod_dev_dbg!( + owner.dev, + "MMU: flush range: asid={:#x} start={:#x} len={:#x}\n", + asid, + self.iova(), + self.size() + ); + mem::sync(); + } + } +} + +/// Shared UAT global data structures +struct UatShared { + handoff_rgn: UatRegion, + ttbs_rgn: UatRegion, +} + +impl UatShared { + /// Returns the handoff region area + fn handoff(&self) -> &Handoff { + // SAFETY: pointer is non-null per the type invariant + unsafe { (self.handoff_rgn.map.as_ptr() as *mut Handoff).as_ref() }.unwrap() + } + + /// Returns the TTBAT area + fn ttbs(&self) -> &[SlotTTBS; UAT_NUM_CTX] { + // SAFETY: pointer is non-null per the type invariant + unsafe { (self.ttbs_rgn.map.as_ptr() as *mut [SlotTTBS; UAT_NUM_CTX]).as_ref() }.unwrap() + } +} + +// SAFETY: Nothing here is unsafe to send across threads. +unsafe impl Send for UatShared {} + +/// Inner data for the top-level UAT instance. +struct UatInner { + shared: Mutex<UatShared>, + handoff_flush: [Mutex<HandoffFlush>; UAT_NUM_CTX + 1], +} + +impl UatInner { + /// Take the lock on the shared data and return the guard. + fn lock(&self) -> Guard<'_, Mutex<UatShared>> { + self.shared.lock() + } + + /// Take a lock on a handoff flush slot and return the guard. + fn lock_flush(&self, slot: u32) -> Guard<'_, Mutex<HandoffFlush>> { + self.handoff_flush[slot as usize].lock() + } +} + +/// Top-level UAT manager object +pub(crate) struct Uat { + dev: driver::AsahiDevice, + cfg: &'static hw::HwConfig, + pagetables_rgn: UatRegion, + + inner: Arc<UatInner>, + slots: slotalloc::SlotAllocator<SlotInner>, + + kernel_vm: Vm, + _kernel_lower_vm: Vm, +} + +impl Drop for UatRegion { + fn drop(&mut self) { + // SAFETY: the pointer is valid by the type invariant + unsafe { bindings::memunmap(self.map.as_ptr()) }; + } +} + +impl Handoff { + /// Lock the handoff region from firmware access + fn lock(&self) { + self.lock_ap.store(1, Ordering::Relaxed); + fence(Ordering::SeqCst); + + while self.lock_fw.load(Ordering::Relaxed) != 0 { + if self.turn.load(Ordering::Relaxed) != 0 { + self.lock_ap.store(0, Ordering::Relaxed); + while self.turn.load(Ordering::Relaxed) != 0 {} + self.lock_ap.store(1, Ordering::Relaxed); + fence(Ordering::SeqCst); + } + } + fence(Ordering::Acquire); + } + + /// Unlock the handoff region, allowing firmware access + fn unlock(&self) { + self.turn.store(1, Ordering::Relaxed); + self.lock_ap.store(0, Ordering::Release); + } + + /// Returns the current Vm slot mapped by the firmware for lower/unprivileged access, if any. + fn current_slot(&self) -> Option<u32> { + let slot = self.cur_slot.load(Ordering::Relaxed); + if slot == 0 || slot == u32::MAX { + None + } else { + Some(slot) + } + } + + /// Initialize the handoff region + fn init(&self) -> Result { + self.magic_ap.store(PPL_MAGIC, Ordering::Relaxed); + self.cur_slot.store(0, Ordering::Relaxed); + self.unk3.store(0, Ordering::Relaxed); + fence(Ordering::SeqCst); + + let timeout = time::ktime_get() + Duration::from_millis(1000); + + self.lock(); + while time::ktime_get() < timeout { + if self.magic_fw.load(Ordering::Relaxed) == PPL_MAGIC { + break; + } else { + self.unlock(); + delay::coarse_sleep(Duration::from_millis(10)); + self.lock(); + } + } + + if self.magic_fw.load(Ordering::Relaxed) != PPL_MAGIC { + self.unlock(); + pr_err!("Handoff: Failed to initialize (firmware not running?)\n"); + return Err(EIO); + } + + self.unlock(); + + for i in 0..=UAT_NUM_CTX { + self.flush[i].state.store(0, Ordering::Relaxed); + self.flush[i].addr.store(0, Ordering::Relaxed); + self.flush[i].size.store(0, Ordering::Relaxed); + } + fence(Ordering::SeqCst); + Ok(()) + } +} + +/// Represents a single flush info slot in the handoff region. +/// +/// # Invariants +/// The pointer is valid and there is no aliasing HandoffFlush instance. +struct HandoffFlush(*const FlushInfo); + +// SAFETY: These pointers are safe to send across threads. +unsafe impl Send for HandoffFlush {} + +impl HandoffFlush { + /// Set up a flush operation for the coprocessor + fn begin_flush(&self, start: u64, size: u64) { + let flush = unsafe { self.0.as_ref().unwrap() }; + + let state = flush.state.load(Ordering::Relaxed); + if state != 0 { + pr_err!("Handoff: expected flush state 0, got {}\n", state); + } + flush.addr.store(start, Ordering::Relaxed); + flush.size.store(size, Ordering::Relaxed); + flush.state.store(1, Ordering::Relaxed); + } + + /// Complete a flush operation for the coprocessor + fn end_flush(&self) { + let flush = unsafe { self.0.as_ref().unwrap() }; + let state = flush.state.load(Ordering::Relaxed); + if state != 2 { + pr_err!("Handoff: expected flush state 2, got {}\n", state); + } + flush.state.store(0, Ordering::Relaxed); + } +} + +// We do not implement FlushOps, since we flush manually in this module after +// page table operations. Just provide dummy implementations. +impl io_pgtable::FlushOps for Uat { + type Data = (); + + fn tlb_flush_all(_data: <Self::Data as ForeignOwnable>::Borrowed<'_>) {} + fn tlb_flush_walk( + _data: <Self::Data as ForeignOwnable>::Borrowed<'_>, + _iova: usize, + _size: usize, + _granule: usize, + ) { + } + fn tlb_add_page( + _data: <Self::Data as ForeignOwnable>::Borrowed<'_>, + _iova: usize, + _granule: usize, + ) { + } +} + +static LOCK_KEY: LockClassKey = LockClassKey::new(); + +impl Vm { + /// Create a new virtual memory address space + fn new( + dev: driver::AsahiDevice, + uat_inner: Arc<UatInner>, + cfg: &'static hw::HwConfig, + is_kernel: bool, + id: u64, + file_id: u64, + ) -> Result<Vm> { + let page_table = AppleUAT::new( + &dev, + io_pgtable::Config { + pgsize_bitmap: UAT_PGSZ, + ias: if is_kernel { UAT_IAS_KERN } else { UAT_IAS }, + oas: cfg.uat_oas, + coherent_walk: true, + quirks: 0, + }, + (), + )?; + let min_va = if is_kernel { + IOVA_KERN_BASE + } else { + IOVA_USER_BASE + }; + let max_va = if is_kernel { + IOVA_KERN_TOP + } else { + IOVA_USER_TOP + }; + + let mm = mm::Allocator::new( + min_va as u64, + (max_va - min_va + 1) as u64, + (), + c_str!("asahi Vm"), + &LOCK_KEY, + )?; + + Ok(Vm { + id, + file_id, + inner: Arc::try_new(Mutex::new(VmInner { + dev, + min_va, + max_va, + is_kernel, + page_table, + mm, + uat_inner, + binding: None, + bind_token: None, + active_users: 0, + id, + }))?, + }) + } + + /// Get the translation table base for this Vm + fn ttb(&self) -> u64 { + self.inner.lock().ttb() + } + + /// Map a GEM object (using its `SGTable`) into this Vm at a free address. + pub(crate) fn map(&self, size: usize, sgt: gem::SGTable) -> Result<Mapping> { + let mut inner = self.inner.lock(); + + let uat_inner = inner.uat_inner.clone(); + let node = inner.mm.insert_node( + MappingInner { + owner: self.inner.clone(), + uat_inner, + prot: PROT_FW_SHARED_RW, + sgt: Some(sgt), + mapped_size: size, + }, + (size + UAT_PGSZ) as u64, // Add guard page + )?; + + inner.map_node(&node, PROT_FW_SHARED_RW)?; + Ok(Mapping(node)) + } + + /// Map a GEM object (using its `SGTable`) into this Vm at a free address in a given range. + #[allow(clippy::too_many_arguments)] + pub(crate) fn map_in_range( + &self, + size: usize, + sgt: gem::SGTable, + alignment: u64, + start: u64, + end: u64, + prot: u32, + guard: bool, + ) -> Result<Mapping> { + let mut inner = self.inner.lock(); + + let uat_inner = inner.uat_inner.clone(); + let node = inner.mm.insert_node_in_range( + MappingInner { + owner: self.inner.clone(), + uat_inner, + prot, + sgt: Some(sgt), + mapped_size: size, + }, + (size + if guard { UAT_PGSZ } else { 0 }) as u64, // Add guard page + alignment, + 0, + start, + end, + mm::InsertMode::Best, + )?; + + inner.map_node(&node, prot)?; + Ok(Mapping(node)) + } + + /// Map a GEM object (using its `SGTable`) into this Vm at a specific address. + #[allow(clippy::too_many_arguments)] + pub(crate) fn map_at( + &self, + addr: u64, + size: usize, + sgt: gem::SGTable, + prot: u32, + guard: bool, + ) -> Result<Mapping> { + let mut inner = self.inner.lock(); + + let uat_inner = inner.uat_inner.clone(); + let node = inner.mm.reserve_node( + MappingInner { + owner: self.inner.clone(), + uat_inner, + prot, + sgt: Some(sgt), + mapped_size: size, + }, + addr, + (size + if guard { UAT_PGSZ } else { 0 }) as u64, // Add guard page + 0, + )?; + + inner.map_node(&node, prot)?; + Ok(Mapping(node)) + } + + /// Add a direct MMIO mapping to this Vm at a free address. + pub(crate) fn map_io(&self, phys: usize, size: usize, rw: bool) -> Result<Mapping> { + let prot = if rw { PROT_FW_MMIO_RW } else { PROT_FW_MMIO_RO }; + let mut inner = self.inner.lock(); + + let uat_inner = inner.uat_inner.clone(); + let node = inner.mm.insert_node( + MappingInner { + owner: self.inner.clone(), + uat_inner, + prot, + sgt: None, + mapped_size: size, + }, + (size + UAT_PGSZ) as u64, // Add guard page + )?; + + let iova = node.start() as usize; + + if (phys | size | iova) & UAT_PGMSK != 0 { + dev_err!( + inner.dev, + "MMU: Mapping {:#x}:{:#x} -> {:#x} is not page-aligned\n", + phys, + size, + iova + ); + return Err(EINVAL); + } + + dev_info!( + inner.dev, + "MMU: IO map: {:#x}:{:#x} -> {:#x}\n", + phys, + size, + iova + ); + + inner.map_pages(iova, phys, UAT_PGSZ, size >> UAT_PGBIT, prot)?; + + Ok(Mapping(node)) + } + + /// Returns the unique ID of this Vm + pub(crate) fn id(&self) -> u64 { + self.id + } + + /// Returns the unique File ID of the owner of this Vm + pub(crate) fn file_id(&self) -> u64 { + self.file_id + } +} + +impl Drop for VmInner { + fn drop(&mut self) { + assert_eq!(self.active_users, 0); + + mod_pr_debug!( + "VmInner::Drop [{}]: bind_token={:?}\n", + self.id, + self.bind_token + ); + + // Make sure this VM is not mapped to a TTB if it was + if let Some(token) = self.bind_token.take() { + let idx = (token.last_slot() as usize) + UAT_USER_CTX_START; + let ttb = self.ttb() | TTBR_VALID | (idx as u64) << TTBR_ASID_SHIFT; + + let uat_inner = self.uat_inner.lock(); + uat_inner.handoff().lock(); + let handoff_cur = uat_inner.handoff().current_slot(); + let ttb_cur = uat_inner.ttbs()[idx].ttb0.load(Ordering::SeqCst); + let inval = ttb_cur == ttb; + if inval { + if handoff_cur == Some(idx as u32) { + pr_err!( + "VmInner::drop owning slot {}, but it is currently in use by the ASC?\n", + idx + ); + } + uat_inner.ttbs()[idx].ttb0.store(0, Ordering::SeqCst); + } + uat_inner.handoff().unlock(); + core::mem::drop(uat_inner); + + // In principle we dropped all the Mappings already, but we might as + // well play it safe and invalidate the whole ASID. + if inval { + mod_pr_debug!( + "VmInner::Drop [{}]: need inval for ASID {:#x}\n", + self.id, + idx + ); + mem::tlbi_asid(idx as u8); + mem::sync(); + } + } + } +} + +impl Uat { + /// Map a bootloader-preallocated memory region + fn map_region( + dev: &dyn device::RawDevice, + name: &CStr, + size: usize, + cached: bool, + ) -> Result<UatRegion> { + let rdev = dev.raw_device(); + + let mut res = core::mem::MaybeUninit::bindings::resource::uninit(); + + let res = unsafe { + let idx = bindings::of_property_match_string( + (*rdev).of_node, + c_str!("memory-region-names").as_char_ptr(), + name.as_char_ptr(), + ); + to_result(idx)?; + + let np = bindings::of_parse_phandle( + (*rdev).of_node, + c_str!("memory-region").as_char_ptr(), + idx, + ); + if np.is_null() { + dev_err!(dev, "Missing {} region\n", name); + return Err(EINVAL); + } + let ret = bindings::of_address_to_resource(np, 0, res.as_mut_ptr()); + bindings::of_node_put(np); + + if ret < 0 { + dev_err!(dev, "Failed to get {} region\n", name); + to_result(ret)? + } + + res.assume_init() + }; + + let rgn_size: usize = unsafe { bindings::resource_size(&res) } as usize; + + if size > rgn_size { + dev_err!( + dev, + "Region {} is too small (expected {}, got {})\n", + name, + size, + rgn_size + ); + return Err(ENOMEM); + } + + let flags = if cached { + bindings::MEMREMAP_WB + } else { + bindings::MEMREMAP_WC + }; + let map = unsafe { bindings::memremap(res.start, rgn_size, flags.into()) }; + let map = NonNull::new(map); + + match map { + None => { + dev_err!(dev, "Failed to remap {} region\n", name); + Err(ENOMEM) + } + Some(map) => Ok(UatRegion { + base: res.start, + map, + }), + } + } + + /// Returns a view into the root kernel (upper half) page table + fn kpt0(&self) -> &[Pte; UAT_NPTE] { + // SAFETY: pointer is non-null per the type invariant + unsafe { (self.pagetables_rgn.map.as_ptr() as *mut [Pte; UAT_NPTE]).as_ref() }.unwrap() + } + + /// Returns a reference to the global kernel (upper half) `Vm` + pub(crate) fn kernel_vm(&self) -> &Vm { + &self.kernel_vm + } + + /// Returns the base physical address of the TTBAT region. + pub(crate) fn ttb_base(&self) -> u64 { + let inner = self.inner.lock(); + + inner.ttbs_rgn.base + } + + /// Binds a `Vm` to a slot, preferring the last used one. + pub(crate) fn bind(&self, vm: &Vm) -> Result<VmBind> { + let mut inner = vm.inner.lock(); + + if inner.binding.is_none() { + assert_eq!(inner.active_users, 0); + + let slot = self.slots.get(inner.bind_token)?; + if slot.changed() { + mod_pr_debug!("Vm Bind [{}]: bind_token={:?}\n", vm.id, slot.token(),); + let idx = (slot.slot() as usize) + UAT_USER_CTX_START; + let ttb = inner.ttb() | TTBR_VALID | (idx as u64) << TTBR_ASID_SHIFT; + + let uat_inner = self.inner.lock(); + let ttbs = uat_inner.ttbs(); + uat_inner.handoff().lock(); + if uat_inner.handoff().current_slot() == Some(idx as u32) { + pr_err!( + "Vm::bind to slot {}, but it is currently in use by the ASC?\n", + idx + ); + } + ttbs[idx].ttb0.store(ttb, Ordering::Relaxed); + ttbs[idx].ttb1.store(0, Ordering::Relaxed); + uat_inner.handoff().unlock(); + core::mem::drop(uat_inner); + + // Make sure all TLB entries from the previous owner of this ASID are gone + mem::tlbi_asid(idx as u8); + mem::sync(); + } + + inner.bind_token = Some(slot.token()); + inner.binding = Some(slot); + } + + inner.active_users += 1; + + let slot = inner.binding.as_ref().unwrap().slot() + UAT_USER_CTX_START as u32; + mod_pr_debug!("MMU: slot {} active users {}\n", slot, inner.active_users); + Ok(VmBind(vm.clone(), slot)) + } + + /// Creates a new `Vm` linked to this UAT. + pub(crate) fn new_vm(&self, id: u64, file_id: u64) -> Result<Vm> { + Vm::new( + self.dev.clone(), + self.inner.clone(), + self.cfg, + false, + id, + file_id, + ) + } + + /// Creates the reference-counted inner data for a new `Uat` instance. + #[inline(never)] + fn make_inner(dev: &driver::AsahiDevice) -> Result<Arc<UatInner>> { + let handoff_rgn = Self::map_region(dev, c_str!("handoff"), HANDOFF_SIZE, false)?; + let ttbs_rgn = Self::map_region(dev, c_str!("ttbs"), SLOTS_SIZE, false)?; + + dev_info!(dev, "MMU: Initializing kernel page table\n"); + + let mut inner = UniqueArc::<UatInner>::try_new_uninit()?; + let ptr = inner.as_mut_ptr(); + + Ok(unsafe { + let handoff = &(handoff_rgn.map.as_ptr() as *mut Handoff).as_ref().unwrap(); + + for i in 0..UAT_NUM_CTX + 1 { + addr_of_mut!((*ptr).handoff_flush[i]) + .write(Mutex::new(HandoffFlush(&handoff.flush[i]))); + } + + addr_of_mut!((*ptr).shared).write(Mutex::new(UatShared { + handoff_rgn, + ttbs_rgn, + })); + + inner.assume_init() + } + .into()) + } + + /// Creates a new `Uat` instance given the relevant hardware config. + #[inline(never)] + pub(crate) fn new(dev: &driver::AsahiDevice, cfg: &'static hw::HwConfig) -> Result<Self> { + dev_info!(dev, "MMU: Initializing...\n"); + + let inner = Self::make_inner(dev)?; + + let pagetables_rgn = Self::map_region(dev, c_str!("pagetables"), PAGETABLES_SIZE, true)?; + + dev_info!(dev, "MMU: Creating kernel page tables\n"); + let kernel_lower_vm = Vm::new(dev.clone(), inner.clone(), cfg, false, 1, 0)?; + let kernel_vm = Vm::new(dev.clone(), inner.clone(), cfg, true, 0, 0)?; + + dev_info!(dev, "MMU: Kernel page tables created\n"); + + let ttb0 = kernel_lower_vm.ttb(); + let ttb1 = kernel_vm.ttb(); + + let uat = Self { + dev: dev.clone(), + cfg, + pagetables_rgn, + kernel_vm, + _kernel_lower_vm: kernel_lower_vm, + inner, + slots: slotalloc::SlotAllocator::new(UAT_USER_CTX as u32, (), |_inner, _slot| { + SlotInner() + })?, + }; + + let inner = uat.inner.lock(); + + inner.handoff().init()?; + + dev_info!(dev, "MMU: Initializing TTBs\n"); + + inner.handoff().lock(); + + let ttbs = inner.ttbs(); + + ttbs[0].ttb0.store(ttb0 | TTBR_VALID, Ordering::Relaxed); + ttbs[0] + .ttb1 + .store(uat.pagetables_rgn.base | TTBR_VALID, Ordering::Relaxed); + + for ctx in &ttbs[1..] { + ctx.ttb0.store(0, Ordering::Relaxed); + ctx.ttb1.store(0, Ordering::Relaxed); + } + + inner.handoff().unlock(); + + core::mem::drop(inner); + + uat.kpt0()[2].store(ttb1 | PTE_TABLE, Ordering::Relaxed); + + dev_info!(dev, "MMU: initialized\n"); + + Ok(uat) + } +} + +impl Drop for Uat { + fn drop(&mut self) { + // Unmap what we mapped + self.kpt0()[2].store(0, Ordering::Relaxed); + + // Make sure we flush the TLBs + fence(Ordering::SeqCst); + mem::tlbi_all(); + mem::sync(); + } +} diff --git a/drivers/gpu/drm/asahi/object.rs b/drivers/gpu/drm/asahi/object.rs new file mode 100644 index 000000000000..449899b88181 --- /dev/null +++ b/drivers/gpu/drm/asahi/object.rs @@ -0,0 +1,704 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Asahi GPU object model +//! +//! The AGX GPU includes a coprocessor that uses a large number of shared memory structures to +//! communicate with the driver. These structures contain GPU VA pointers to each other, which are +//! directly dereferenced by the firmware and are expected to always be valid for the usage +//! lifetime of the containing struct (which is an implicit contract, not explicitly managed). +//! Any faults cause an unrecoverable firmware crash, requiring a full system reboot. +//! +//! In order to manage this complexity safely, we implement a GPU object model using Rust's type +//! system to enforce GPU object lifetime relationships. GPU objects represent an allocated piece +//! of memory of a given type, mapped to the GPU (and usually also the CPU). On the CPU side, +//! these objects are associated with a pure Rust structure that contains the objects it depends +//! on (or references to them). This allows us to map Rust lifetimes into the GPU object model +//! system. Then, GPU VA pointers also inherit those lifetimes, which means the Rust borrow checker +//! can ensure that all pointers are assigned an address that is guaranteed to outlive the GPU +//! object it points to. +//! +//! Since the firmware object model does have self-referencing pointers (and there is of course no +//! underlying revocability mechanism to make it safe), we must have an escape hatch. GPU pointers +//! can be weak pointers, which do not enforce lifetimes. In those cases, it is the user's +//! responsibility to ensure that lifetime requirements are met. +//! +//! In other words, the model is necessarily leaky and there is no way to fully map Rust safety to +//! GPU firmware object safety. The goal of the model is to make it easy to model the lifetimes of +//! GPU objects and have the compiler help in avoiding mistakes, rather than to guarantee safety +//! 100% of the time as would be the case for CPU-side Rust code. + +// TODO: There is a fundamental soundness issue with sharing memory with the GPU (that even affects +// C code too). Since the GPU is free to mutate that memory at any time, normal reference invariants +// cannot be enforced on the CPU side. For example, the compiler could perform an optimization that +// assumes that a given memory location does not change between two reads, and causes UB otherwise, +// and then the GPU could mutate that memory out from under the CPU. +// +// For cases where we *expect* this to happen, we use atomic types, which avoid this issue. However, +// doing so for every single field of every type is a non-starter. Right now, there seems to be no +// good solution for this that does not come with significant performance or ergonomics downsides. +// +// In *practice* we are almost always only writing GPU memory, and only reading from atomics, so the +// chances of this actually triggering UB (e.g. a security issue that can be triggered from the GPU +// side) due to a compiler optimization are very slim. +// +// Further discussion: https://github.com/rust-lang/unsafe-code-guidelines/issues/152 + +use kernel::{error::code::*, prelude::*}; + +use alloc::boxed::Box; +use core::fmt; +use core::fmt::Debug; +use core::fmt::Formatter; +use core::marker::PhantomData; +use core::mem::MaybeUninit; +use core::num::NonZeroU64; +use core::ops::{Deref, DerefMut, Index, IndexMut}; +use core::{mem, ptr, slice}; + +use crate::alloc::Allocation; +use crate::debug::*; +use crate::fw::types::Zeroed; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Object; + +/// A GPU-side strong pointer, which is a 64-bit non-zero VA with an associated lifetime. +/// +/// In rare cases these pointers are not aligned, so this is `packed(1)`. +#[repr(C, packed(1))] +pub(crate) struct GpuPointer<'a, T: ?Sized>(NonZeroU64, PhantomData<&'a T>); + +impl<'a, T: ?Sized> GpuPointer<'a, T> { + /// Logical OR the pointer with an arbitrary `u64`. This is used when GPU struct fields contain + /// misc flag fields in the upper bits. The lifetime is retained. This is GPU-unsafe in + /// principle, but we assert that only non-implemented address bits are touched, which is safe + /// for pointers used by the GPU (not by firmware). + pub(crate) fn or(&self, other: u64) -> GpuPointer<'a, T> { + // This will fail for kernel-half pointers, which should not be ORed. + assert_eq!(self.0.get() & other, 0); + // Assert that we only touch the high bits. + assert_eq!(other & 0xffffffffff, 0); + GpuPointer(self.0 | other, PhantomData) + } + + /// Add an arbitrary offset to the pointer. This is not safe (from the GPU perspective), and + /// should only be used via the `inner_ptr` macro to get pointers to inner fields, hence we mark + /// it `unsafe` to discourage direct use. + // NOTE: The third argument is a type inference hack. + pub(crate) unsafe fn offset<U>(&self, off: usize, _: *const U) -> GpuPointer<'a, U> { + GpuPointer::<'a, U>( + NonZeroU64::new(self.0.get() + (off as u64)).unwrap(), + PhantomData, + ) + } +} + +impl<'a, T: ?Sized> Debug for GpuPointer<'a, T> { + fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { + let val = self.0; + f.write_fmt(format_args!("{:#x} ({})", val, core::any::type_name::<T>())) + } +} + +/// Take a pointer to a sub-field within a structure pointed to by a GpuPointer, keeping the +/// lifetime. +#[macro_export] +macro_rules! inner_ptr { + ($gpuva:expr, $($f:tt)*) => ({ + // This mirrors kernel::offset_of(), except we use type inference to avoid having to know + // the type of the pointer explicitly. + fn uninit_from<'a, T: GpuStruct>(_: GpuPointer<'a, T>) -> core::mem::MaybeUninit<T::Raw<'static>> { + core::mem::MaybeUninit::uninit() + } + let tmp = uninit_from($gpuva); + let outer = tmp.as_ptr(); + // SAFETY: The pointer is valid and aligned, just not initialised; `addr_of` ensures that + // we don't actually read from `outer` (which would be UB) nor create an intermediate + // reference. + let p: *const _ = unsafe { core::ptr::addr_of!((*outer).$($f)*) }; + let inner = p as *const u8; + // SAFETY: The two pointers are within the same allocation block. + let off = unsafe { inner.offset_from(outer as *const u8) }; + // SAFETY: The resulting pointer is guaranteed to point to valid memory within the outer + // object. + unsafe { $gpuva.offset(off.try_into().unwrap(), p) } + }) +} + +/// A GPU-side weak pointer, which is a 64-bit non-zero VA with no lifetime. +/// +/// In rare cases these pointers are not aligned, so this is `packed(1)`. +#[repr(C, packed(1))] +pub(crate) struct GpuWeakPointer<T: ?Sized>(NonZeroU64, PhantomData<*const T>); + +/// SAFETY: GPU weak pointers are always safe to share between threads. +unsafe impl<T: ?Sized> Send for GpuWeakPointer<T> {} +unsafe impl<T: ?Sized> Sync for GpuWeakPointer<T> {} + +// Weak pointers can be copied/cloned regardless of their target type. +impl<T: ?Sized> Copy for GpuWeakPointer<T> {} + +impl<T: ?Sized> Clone for GpuWeakPointer<T> { + fn clone(&self) -> Self { + *self + } +} + +impl<T: ?Sized> GpuWeakPointer<T> { + /// Add an arbitrary offset to the pointer. This is not safe (from the GPU perspective), and + /// should only be used via the `inner_ptr` macro to get pointers to inner fields, hence we mark + /// it `unsafe` to discourage direct use. + // NOTE: The third argument is a type inference hack. + pub(crate) unsafe fn offset<U>(&self, off: usize, _: *const U) -> GpuWeakPointer<U> { + GpuWeakPointer::<U>( + NonZeroU64::new(self.0.get() + (off as u64)).unwrap(), + PhantomData, + ) + } + + /// Upgrade a weak pointer into a strong pointer. This is not considered safe from the GPU + /// perspective. + pub(crate) unsafe fn upgrade<'a>(&self) -> GpuPointer<'a, T> { + GpuPointer(self.0, PhantomData) + } +} + +impl<T: ?Sized> Debug for GpuWeakPointer<T> { + fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { + let val = self.0; + f.write_fmt(format_args!("{:#x} ({})", val, core::any::type_name::<T>())) + } +} + +/// Take a pointer to a sub-field within a structure pointed to by a GpuWeakPointer. +#[macro_export] +macro_rules! inner_weak_ptr { + ($gpuva:expr, $($f:tt)*) => ({ + // See inner_ptr() + fn uninit_from<T: GpuStruct>(_: GpuWeakPointer<T>) -> core::mem::MaybeUninit<T::Raw<'static>> { + core::mem::MaybeUninit::uninit() + } + let tmp = uninit_from($gpuva); + let outer = tmp.as_ptr(); + // SAFETY: The pointer is valid and aligned, just not initialised; `addr_of` ensures that + // we don't actually read from `outer` (which would be UB) nor create an intermediate + // reference. + let p: *const _ = unsafe { core::ptr::addr_of!((*outer).$($f)*) }; + let inner = p as *const u8; + // SAFETY: The two pointers are within the same allocation block. + let off = unsafe { inner.offset_from(outer as *const u8) }; + // SAFETY: The resulting pointer is guaranteed to point to valid memory within the outer + // object. + unsafe { $gpuva.offset(off.try_into().unwrap(), p) } + }) +} + +/// Types that implement this trait represent a GPU structure from the CPU side. +/// +/// The `Raw` type represents the actual raw structure definition on the GPU side. +/// +/// Types implementing [`GpuStruct`] must have fields owning any objects (or strong references +/// to them) that GPU pointers in the `Raw` structure point to. This mechanism is used to enforce +/// lifetimes. +pub(crate) trait GpuStruct: 'static { + /// The type of the GPU-side structure definition representing the firmware struct layout. + type Raw<'a>; +} + +/// An instance of a GPU object in memory. +/// +/// # Invariants +/// `raw` must point to a valid mapping of the `T::Raw` type associated with the `alloc` allocation. +/// `gpu_ptr` must be the GPU address of the same object. +pub(crate) struct GpuObject<T: GpuStruct, U: Allocation<T>> { + raw: *mut T::Raw<'static>, + alloc: U, + gpu_ptr: GpuWeakPointer<T>, + inner: Box<T>, +} + +impl<T: GpuStruct, U: Allocation<T>> GpuObject<T, U> { + /// Create a new GpuObject given an allocator and the inner data (a type implementing + /// GpuStruct). + /// + /// The caller passes a closure that constructs the `T::Raw` type given a reference to the + /// `GpuStruct`. This is the mechanism used to enforce lifetimes. + pub(crate) fn new( + alloc: U, + inner: T, + callback: impl for<'a> FnOnce(&'a T) -> T::Raw<'a>, + ) -> Result<Self> { + let size = mem::size_of::<T::Raw<'static>>(); + if size > 0x1000 { + dev_crit!( + alloc.device(), + "Allocating {} of size {:#x}, with new, please use new_boxed!\n", + core::any::type_name::<T>(), + size + ); + } + if alloc.size() < size { + return Err(ENOMEM); + } + let gpu_ptr = + GpuWeakPointer::<T>(NonZeroU64::new(alloc.gpu_ptr()).ok_or(EINVAL)?, PhantomData); + mod_dev_dbg!( + alloc.device(), + "Allocating {} @ {:#x}\n", + core::any::type_name::<T>(), + alloc.gpu_ptr() + ); + let p = alloc.ptr().ok_or(EINVAL)?.as_ptr() as *mut T::Raw<'static>; + let mut raw = callback(&inner); + // SAFETY: `p` is guaranteed to be valid per the Allocation invariant, and the type is + // identical to the type of `raw` other than the lifetime. + unsafe { p.copy_from(&mut raw as *mut _ as *mut u8 as *mut _, 1) }; + mem::forget(raw); + Ok(Self { + raw: p, + gpu_ptr, + alloc, + inner: Box::try_new(inner)?, + }) + } + + /// Create a new GpuObject given an allocator and the boxed inner data (a type implementing + /// GpuStruct). + /// + /// The caller passes a closure that initializes the `T::Raw` type given a reference to the + /// `GpuStruct` and a `MaybeUninit<T::Raw>`. This is intended to be used with the place!() + /// macro to avoid constructing the whole `T::Raw` object on the stack. + pub(crate) fn new_boxed( + alloc: U, + inner: Box<T>, + callback: impl for<'a> FnOnce( + &'a T, + &'a mut MaybeUninit<T::Raw<'a>>, + ) -> Result<&'a mut T::Raw<'a>>, + ) -> Result<Self> { + if alloc.size() < mem::size_of::<T::Raw<'static>>() { + return Err(ENOMEM); + } + let gpu_ptr = + GpuWeakPointer::<T>(NonZeroU64::new(alloc.gpu_ptr()).ok_or(EINVAL)?, PhantomData); + mod_dev_dbg!( + alloc.device(), + "Allocating {} @ {:#x}\n", + core::any::type_name::<T>(), + alloc.gpu_ptr() + ); + let p = alloc.ptr().ok_or(EINVAL)?.as_ptr() as *mut MaybeUninit<T::Raw<'_>>; + // SAFETY: `p` is guaranteed to be valid per the Allocation invariant. + let raw = callback(&inner, unsafe { &mut *p })?; + if p as *mut T::Raw<'_> != raw as *mut _ { + dev_err!( + alloc.device(), + "Allocation callback returned a mismatched reference ({})\n", + core::any::type_name::<T>(), + ); + return Err(EINVAL); + } + Ok(Self { + raw: p as *mut u8 as *mut T::Raw<'static>, + gpu_ptr, + alloc, + inner, + }) + } + + /// Create a new GpuObject given an allocator and the inner data (a type implementing + /// GpuStruct). + /// + /// The caller passes a closure that initializes the `T::Raw` type given a reference to the + /// `GpuStruct` and a `MaybeUninit<T::Raw>`. This is intended to be used with the place!() + /// macro to avoid constructing the whole `T::Raw` object on the stack. + pub(crate) fn new_inplace( + alloc: U, + inner: T, + callback: impl for<'a> FnOnce( + &'a T, + &'a mut MaybeUninit<T::Raw<'a>>, + ) -> Result<&'a mut T::Raw<'a>>, + ) -> Result<Self> { + GpuObject::<T, U>::new_boxed(alloc, Box::try_new(inner)?, callback) + } + + /// Create a new GpuObject given an allocator, with callback-based initialization. + /// + /// This is used when the construction of the `T` type requires knowing the GPU VA address of + /// the structure that is being constructed ahead of time. The first callback constructs a + /// `Box<T>` given the pointer to the about-to-be-initialized GPU structure, and the second + /// callback initializes that structure as in `new_boxed`. + pub(crate) fn new_prealloc( + alloc: U, + inner_cb: impl FnOnce(GpuWeakPointer<T>) -> Result<Box<T>>, + raw_cb: impl for<'a> FnOnce( + &'a T, + &'a mut MaybeUninit<T::Raw<'a>>, + ) -> Result<&'a mut T::Raw<'a>>, + ) -> Result<Self> { + if alloc.size() < mem::size_of::<T::Raw<'static>>() { + return Err(ENOMEM); + } + let gpu_ptr = + GpuWeakPointer::<T>(NonZeroU64::new(alloc.gpu_ptr()).ok_or(EINVAL)?, PhantomData); + mod_dev_dbg!( + alloc.device(), + "Allocating {} @ {:#x}\n", + core::any::type_name::<T>(), + alloc.gpu_ptr() + ); + let inner = inner_cb(gpu_ptr)?; + let p = alloc.ptr().ok_or(EINVAL)?.as_ptr() as *mut MaybeUninit<T::Raw<'_>>; + // SAFETY: `p` is guaranteed to be valid per the Allocation invariant. + let raw = raw_cb(&*inner, unsafe { &mut *p })?; + if p as *mut T::Raw<'_> != raw as *mut _ { + dev_err!( + alloc.device(), + "Allocation callback returned a mismatched reference ({})\n", + core::any::type_name::<T>(), + ); + return Err(EINVAL); + } + Ok(Self { + raw: p as *mut u8 as *mut T::Raw<'static>, + gpu_ptr, + alloc, + inner, + }) + } + + /// Returns the GPU VA of this object (as a raw [`NonZeroU64`]) + pub(crate) fn gpu_va(&self) -> NonZeroU64 { + self.gpu_ptr.0 + } + + /// Returns a strong GPU pointer to this object, with a lifetime. + pub(crate) fn gpu_pointer(&self) -> GpuPointer<'_, T> { + GpuPointer(self.gpu_ptr.0, PhantomData) + } + + /// Returns a weak GPU pointer to this object, with no lifetime. + pub(crate) fn weak_pointer(&self) -> GpuWeakPointer<T> { + GpuWeakPointer(self.gpu_ptr.0, PhantomData) + } + + /// Perform a mutation to the inner `Raw` data given a user-supplied callback. + /// + /// The callback gets a mutable reference to the `GpuStruct` type. + pub(crate) fn with_mut<RetVal>( + &mut self, + callback: impl for<'a> FnOnce(&'a mut <T as GpuStruct>::Raw<'a>, &'a mut T) -> RetVal, + ) -> RetVal { + // SAFETY: `self.raw` is valid per the type invariant, and the second half is just + // converting lifetimes. + unsafe { callback(&mut *self.raw, &mut *(&mut *self.inner as *mut _)) } + } + + /// Access the inner `Raw` data given a user-supplied callback. + /// + /// The callback gets a reference to the `GpuStruct` type. + pub(crate) fn with<RetVal>( + &self, + callback: impl for<'a> FnOnce(&'a <T as GpuStruct>::Raw<'a>, &'a T) -> RetVal, + ) -> RetVal { + // SAFETY: `self.raw` is valid per the type invariant, and the second half is just + // converting lifetimes. + unsafe { callback(&*self.raw, &*(&*self.inner as *const _)) } + } +} + +impl<T: GpuStruct, U: Allocation<T>> Deref for GpuObject<T, U> { + type Target = T; + + fn deref(&self) -> &Self::Target { + &self.inner + } +} + +impl<T: GpuStruct, U: Allocation<T>> DerefMut for GpuObject<T, U> { + fn deref_mut(&mut self) -> &mut Self::Target { + &mut self.inner + } +} + +impl<T: GpuStruct + Debug, U: Allocation<T>> Debug for GpuObject<T, U> +where + <T as GpuStruct>::Raw<'static>: Debug, +{ + fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { + f.debug_struct(core::any::type_name::<T>()) + // SAFETY: `self.raw` is valid per the type invariant. + .field("raw", &format_args!("{:#X?}", unsafe { &*self.raw })) + .field("inner", &format_args!("{:#X?}", &self.inner)) + .field("alloc", &format_args!("{:?}", &self.alloc)) + .finish() + } +} + +impl<T: GpuStruct + Default, U: Allocation<T>> GpuObject<T, U> +where + for<'a> <T as GpuStruct>::Raw<'a>: Default + Zeroed, +{ + /// Create a new GpuObject with default data. `T` must implement `Default` and `T::Raw` must + /// implement `Zeroed`, since the GPU-side memory is initialized by zeroing. + pub(crate) fn new_default(alloc: U) -> Result<Self> { + GpuObject::<T, U>::new_inplace(alloc, Default::default(), |_inner, raw| { + // SAFETY: `raw` is valid here, and `T::Raw` implements `Zeroed`. + Ok(unsafe { + ptr::write_bytes(raw, 0, 1); + (*raw).assume_init_mut() + }) + }) + } +} + +impl<T: GpuStruct, U: Allocation<T>> Drop for GpuObject<T, U> { + fn drop(&mut self) { + mod_dev_dbg!( + self.alloc.device(), + "Dropping {} @ {:?}\n", + core::any::type_name::<T>(), + self.gpu_pointer() + ); + } +} + +// SAFETY: GpuObjects are Send as long as the GpuStruct itself is Send +unsafe impl<T: GpuStruct + Send, U: Allocation<T>> Send for GpuObject<T, U> {} +// SAFETY: GpuObjects are Send as long as the GpuStruct itself is Send +unsafe impl<T: GpuStruct + Sync, U: Allocation<T>> Sync for GpuObject<T, U> {} + +/// Trait used to erase the type of a GpuObject, used when we need to keep a list of heterogenous +/// objects around. +pub(crate) trait OpaqueGpuObject: Send + Sync { + fn gpu_va(&self) -> NonZeroU64; +} + +impl<T: GpuStruct + Sync + Send, U: Allocation<T>> OpaqueGpuObject for GpuObject<T, U> { + fn gpu_va(&self) -> NonZeroU64 { + Self::gpu_va(self) + } +} + +/// An array of raw GPU objects that is only accessible to the GPU (no CPU-side mapping required). +/// +/// This must necessarily be uninitialized as far as the GPU is concerned, so it cannot be used +/// when initialization is required. +/// +/// # Invariants +/// +/// `alloc` is valid and at least as large as `len` times the size of one `T`. +/// `gpu_ptr` is valid and points to the allocation start. +pub(crate) struct GpuOnlyArray<T, U: Allocation<T>> { + len: usize, + alloc: U, + gpu_ptr: NonZeroU64, + _p: PhantomData<T>, +} + +impl<T, U: Allocation<T>> GpuOnlyArray<T, U> { + /// Allocate a new GPU-only array with the given length. + pub(crate) fn new(alloc: U, count: usize) -> Result<GpuOnlyArray<T, U>> { + let bytes = count * mem::size_of::<T>(); + let gpu_ptr = NonZeroU64::new(alloc.gpu_ptr()).ok_or(EINVAL)?; + if alloc.size() < bytes { + return Err(ENOMEM); + } + Ok(Self { + len: count, + alloc, + gpu_ptr, + _p: PhantomData, + }) + } + + /// Returns the GPU VA of this arraw (as a raw [`NonZeroU64`]) + pub(crate) fn gpu_va(&self) -> NonZeroU64 { + self.gpu_ptr + } + + /// Returns a strong GPU pointer to this array, with a lifetime. + pub(crate) fn gpu_pointer(&self) -> GpuPointer<'_, &'_ [T]> { + GpuPointer(self.gpu_ptr, PhantomData) + } + + /// Returns a weak GPU pointer to this array, with no lifetime. + pub(crate) fn weak_pointer(&self) -> GpuWeakPointer<[T]> { + GpuWeakPointer(self.gpu_ptr, PhantomData) + } + + /// Returns a pointer to an offset within the array (as a subslice). + pub(crate) fn gpu_offset_pointer(&self, offset: usize) -> GpuPointer<'_, &'_ [T]> { + if offset > self.len { + panic!("Index {} out of bounds (len: {})", offset, self.len); + } + GpuPointer( + NonZeroU64::new(self.gpu_ptr.get() + (offset * mem::size_of::<T>()) as u64).unwrap(), + PhantomData, + ) + } + + /* Not used yet + /// Returns a weak pointer to an offset within the array (as a subslice). + pub(crate) fn weak_offset_pointer(&self, offset: usize) -> GpuWeakPointer<[T]> { + if offset > self.len { + panic!("Index {} out of bounds (len: {})", offset, self.len); + } + GpuWeakPointer( + NonZeroU64::new(self.gpu_ptr.get() + (offset * mem::size_of::<T>()) as u64).unwrap(), + PhantomData, + ) + } + + /// Returns a pointer to an element within the array. + pub(crate) fn gpu_item_pointer(&self, index: usize) -> GpuPointer<'_, &'_ T> { + if index >= self.len { + panic!("Index {} out of bounds (len: {})", index, self.len); + } + GpuPointer( + NonZeroU64::new(self.gpu_ptr.get() + (index * mem::size_of::<T>()) as u64).unwrap(), + PhantomData, + ) + } + */ + + /// Returns a weak pointer to an element within the array. + pub(crate) fn weak_item_pointer(&self, index: usize) -> GpuWeakPointer<T> { + if index >= self.len { + panic!("Index {} out of bounds (len: {})", index, self.len); + } + GpuWeakPointer( + NonZeroU64::new(self.gpu_ptr.get() + (index * mem::size_of::<T>()) as u64).unwrap(), + PhantomData, + ) + } + + /// Returns the length of the array. + pub(crate) fn len(&self) -> usize { + self.len + } +} + +impl<T: Debug, U: Allocation<T>> Debug for GpuOnlyArray<T, U> { + fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { + f.debug_struct(core::any::type_name::<T>()) + .field("len", &format_args!("{:#X?}", self.len())) + .finish() + } +} + +impl<T, U: Allocation<T>> Drop for GpuOnlyArray<T, U> { + fn drop(&mut self) { + mod_dev_dbg!( + self.alloc.device(), + "Dropping {} @ {:?}\n", + core::any::type_name::<T>(), + self.gpu_pointer() + ); + } +} + +/// An array of raw GPU objects that is also CPU-accessible. +/// +/// # Invariants +/// +/// `raw` is valid and points to the CPU-side view of the array (which must have one). +pub(crate) struct GpuArray<T, U: Allocation<T>> { + raw: *mut T, + array: GpuOnlyArray<T, U>, +} + +/* Not used yet +impl<T: Copy, U: Allocation<T>> GpuArray<T, U> { + /// Allocate a new GPU array, copying the contents from a slice. + pub(crate) fn new(alloc: U, data: &[T]) -> Result<GpuArray<T, U>> { + let p = alloc.ptr().ok_or(EINVAL)?.as_ptr(); + let inner = GpuOnlyArray::new(alloc, data.len())?; + // SAFETY: `p` is valid per the Allocation type invariant, and GpuOnlyArray guarantees + // that its size is at least as large as `data.len()`. + unsafe { ptr::copy(data.as_ptr(), p, data.len()) }; + Ok(Self { + raw: p, + array: inner, + }) + } +} +*/ + +impl<T: Default, U: Allocation<T>> GpuArray<T, U> { + /// Allocate a new GPU array, initializing each element to its default. + pub(crate) fn empty(alloc: U, count: usize) -> Result<GpuArray<T, U>> { + let p = alloc.ptr().ok_or(EINVAL)?.as_ptr() as *mut T; + let inner = GpuOnlyArray::new(alloc, count)?; + let mut pi = p; + for _i in 0..count { + // SAFETY: `pi` is valid per the Allocation type invariant, and GpuOnlyArray guarantees + // that it can never iterate beyond the buffer length. + unsafe { + pi.write(Default::default()); + pi = pi.add(1); + } + } + Ok(Self { + raw: p, + array: inner, + }) + } +} + +impl<T, U: Allocation<T>> GpuArray<T, U> { + /// Get a slice view of the array contents. + pub(crate) fn as_slice(&self) -> &[T] { + // SAFETY: self.raw / self.len are valid per the type invariant + unsafe { slice::from_raw_parts(self.raw, self.len) } + } + + /// Get a mutable slice view of the array contents. + pub(crate) fn as_mut_slice(&mut self) -> &mut [T] { + // SAFETY: self.raw / self.len are valid per the type invariant + unsafe { slice::from_raw_parts_mut(self.raw, self.len) } + } +} + +impl<T, U: Allocation<T>> Deref for GpuArray<T, U> { + type Target = GpuOnlyArray<T, U>; + + fn deref(&self) -> &GpuOnlyArray<T, U> { + &self.array + } +} + +impl<T, U: Allocation<T>> Index<usize> for GpuArray<T, U> { + type Output = T; + + fn index(&self, index: usize) -> &T { + if index >= self.len { + panic!("Index {} out of bounds (len: {})", index, self.len); + } + // SAFETY: This is bounds checked above + unsafe { &*(self.raw.add(index)) } + } +} + +impl<T, U: Allocation<T>> IndexMut<usize> for GpuArray<T, U> { + fn index_mut(&mut self, index: usize) -> &mut T { + if index >= self.len { + panic!("Index {} out of bounds (len: {})", index, self.len); + } + // SAFETY: This is bounds checked above + unsafe { &mut *(self.raw.add(index)) } + } +} + +// SAFETY: GpuArray are Send as long as the contained type itself is Send +unsafe impl<T: Send, U: Allocation<T>> Send for GpuArray<T, U> {} +// SAFETY: GpuArray are Sync as long as the contained type itself is Sync +unsafe impl<T: Sync, U: Allocation<T>> Sync for GpuArray<T, U> {} + +impl<T: Debug, U: Allocation<T>> Debug for GpuArray<T, U> { + fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result { + f.debug_struct(core::any::type_name::<T>()) + .field("array", &format_args!("{:#X?}", self.as_slice())) + .finish() + } +} diff --git a/drivers/gpu/drm/asahi/place.rs b/drivers/gpu/drm/asahi/place.rs new file mode 100644 index 000000000000..40c51f4fab8d --- /dev/null +++ b/drivers/gpu/drm/asahi/place.rs @@ -0,0 +1,343 @@ +// SPDX-License-Identifier: Apache-2.0 OR MIT + +//! "Placement new" macro +//! +//! This cursed abomination of a declarative macro is used to emulate a "placement new" feature, +//! which allows initializing objects directly in a user-provided memory region without first +//! going through the stack. +//! +//! This driver needs to manage several large GPU objects of a fixed layout. Linux kernel stacks are +//! very small, so it is impossible to create these objects on the stack. While the compiler can +//! sometimes optimize away the stack copy and directly instantiate in target memory, this is not +//! guaranteed and not reliable. Therefore, we need some mechanism to ergonomically initialize +//! complex structures directly in a pre-allocated piece of memory. +//! +//! This issue also affects some driver-internal structs which are large/complex enough to overflow +//! the stack. While this can be solved by breaking them up into pieces and using `Box` more +//! liberally, this has performance implications and still isn't very nice. This macro can also be +//! used to solve this issue. +//! +//! # Further reading +//! https://github.com/rust-lang/rust/issues/27779#issuecomment-378416911 +//! https://internals.rust-lang.org/t/removal-of-all-unstable-placement-features... + +/// Initialize a `MaybeUninit` in-place, without constructing the value on the stack first. +/// +/// This macro is analogous to `MaybeUninit::write()`. In other words, +/// `place!(foo, bar)` is equivalent to `MaybeUninit::write(foo, bar)`, except that `bar` is not +/// constructed first, but rather its fields (if it is a structure constructor) are copied one by +/// one into the correct location in the `MaybeUninit`. +/// +/// The macro supports most Rust initialization syntax including type paths, generic arguments, +/// and nested structures. Nested structures are themselves initialized in-place field by field. +/// `..Default::default()` is supported, but this macro converts it to `..Zeroed::zeroed()`, as it +/// initializes those structs by zero-initializing the underlying memory. Usage of +/// `..Default::default()` with a type not implementing `Zeroed` will result in a compile error. +/// +/// Usage: +/// ``` +/// let mut buf = MaybeUninit::uninit(); +/// let mut_ref = place!(&mut buf, MyStruct { +/// b: true, +/// s: String::from("works"), +/// i: str::parse::<i32>("123").unwrap(), +/// v: vec![String::from("works")], +/// x: foo::MyOtherCoolStruct { +/// a: false, +/// b: String::from("Hello, world!"), +/// }, +/// y: foo::MyOtherCoolStruct { +/// a: false, +/// b: String::from("Hello, world!"), +/// }, +/// z: foo::MyCoolGenericStruct::<bool, String> { +/// a: false, +/// b: String::from("Hello, world!"), +/// }, +/// }; +/// // `mut_ref` is now a mutable reference to the `buf`, which is now safely initialized. +/// ``` +/// +/// Based on https://crates.io/crates/place by DianaNites, with contributions by Joshua Barretto. +#[macro_export] +macro_rules! place { + // Top-level struct + (@STRUCT $ptr:ident, _TOP, $typ:path, {$($typ_init:tt)*} { $($fields:tt)* }) => {{ + place!(@STRUCT_ZERO $ptr, {$($typ_init)*} { $($fields)* }); + place!(@STRUCT_CHECK $ptr, {$($typ_init)*} { $($fields)* } { + place!(@FIELDS $ptr, $($fields)*); + }); + }}; + // Nested structure + (@STRUCT $ptr:ident, $f_struct:ident, $typ:path, {$($typ_init:tt)*} { $($fields:tt)* }) => {{ + use core::ptr::addr_of_mut; + let buf = unsafe { addr_of_mut!((*$ptr).$f_struct) }; + place!(@STRUCT_ZERO buf, {$($typ_init)*} { $($fields)* }); + place!(@STRUCT_CHECK $ptr, {$($typ_init)*} { $($fields)* } { + place!(@FIELDS buf, $($fields)*); + }); + }}; + + // Zero-initialize structure if the initializer ends in ..default::Default() + (@STRUCT_ZERO $ptr:ident, {$($typ_init:tt)*} { $($f:ident $(: $v:expr)?),* $(,)? }) => {}; + (@STRUCT_ZERO $ptr:ident, {$($typ_init:tt)*} { $($($f:ident $(: $v:expr)?),*,)? ..Default::default() }) => {{ + // Check that the structure actually implements Zeroed + const _: () = { + fn _check_default() { + let _ = $($typ_init)* { + ..Zeroed::zeroed() + }; + } + }; + use core::ptr; + unsafe { ptr::write_bytes($ptr, 0, 1) }; + + }}; + + // Check that all fields are specified + (@STRUCT_CHECK $ptr:ident, {$($typ_init:tt)*} { $($($f:ident $(: $v:expr)?),*,)? ..Default::default() } {$($body:tt)*}) => { + if false { + #[allow(clippy::redundant_field_names)] + let _x = $($typ_init)* { + $($( + $f $(: $v)? + ),* + ,)? + ..Zeroed::zeroed() + }; + } else { + {$($body)*} + } + }; + (@STRUCT_CHECK $ptr:ident, {$($typ_init:tt)*} { $($f:ident $(: $v:expr)?),* $(,)? } {$($body:tt)*}) => { + if false { + #[allow(clippy::redundant_field_names)] + let _x = $($typ_init)* { + $( + $f $(: $v)? + ),* + }; + } else { + {$($body)*} + } + }; + // Top-level scalar + (@SCALAR $ptr:ident, _TOP, $val:expr) => { + let tmp = $val; + unsafe { $ptr.write(tmp); } + }; + // Regular field + (@SCALAR $ptr:ident, $f:ident, $val:expr) => {{ + use core::ptr::addr_of_mut; + let tmp = $val; + unsafe { addr_of_mut!((*$ptr).$f).write(tmp); } + }}; + // Type-like name followed by braces is a nested structure + (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {{ $($fields:tt)* } $($tail:tt)*}) => { + place!(@STRUCT $ptr, $f, $($head)*, {$($head)*} { $($fields)* }); + place!(@FIELDS $ptr $($tail)*) + }; + // Type-like name followed by ::ident, append to head + (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {::$id:ident $($tail:tt)*}) => { + place!(@PARTIAL $ptr, $f, {$($head)* :: $id}, {$($tail)*}); + }; + // Type-like name followed by ::<args>, append to head + (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {::<$($gen:ty),*> $($tail:tt)*}) => { + place!(@PARTIAL $ptr, $f, {$($head)* :: <$($gen),*>}, {$($tail)*}); + }; + // Type-like name followed by ::<'lifetime>, append to head + (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {::<$li:lifetime> $($tail:tt)*}) => { + place!(@PARTIAL $ptr, $f, {$($head)* :: <$li>}, {$($tail)*}); + }; + // Anything else, parse it as an expression + (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {$($tail:tt)*}) => { + place!(@EXPR $ptr, $f, $($head)* $($tail)*) + }; + // Expression followed by more fields + (@EXPR $ptr:ident, $f:ident, $val:expr, $($tail:tt)*) => { + place!(@SCALAR $ptr, $f, $val); + place!(@FIELDS $ptr, $($tail)*) + }; + // Last field expression, without a trailing comma + (@EXPR $ptr:ident, $f:ident, $val:expr) => { + place!(@SCALAR $ptr, $f, $val); + }; + // Field with a value starting with an ident, start incremental type parsing + (@FIELDS $ptr:ident, $f:ident : $id:ident $($tail:tt)*) => { + place!(@PARTIAL $ptr, $f, {$id}, {$($tail)*}); + }; + // Same, but starting with ::ident + (@FIELDS $ptr:ident, $f:ident : ::$id:ident $($tail:tt)*) => { + place!(@PARTIAL $ptr, $f, {::$id}, {$($tail)*}); + }; + // Otherwise, parse it as an expression + (@FIELDS $ptr:ident, $f:ident : $($tail:tt)*) => { + place!(@EXPR $ptr, $f, $($tail)*) + }; + // Default terminating case + (@FIELDS $ptr:ident, ..Default::default() ) => {}; + // Terminating case + (@FIELDS $ptr:ident $(,)? ) => {}; + ( + $buf:expr, + $($val:tt)* + ) => {{ + use core::mem::MaybeUninit; + // Ensures types are correct + let obj: &mut MaybeUninit<_> = $buf; + let top_ptr = obj.as_mut_ptr(); + place!(@FIELDS top_ptr, _TOP: $($val)*); + // SAFETY: All fields have been initialized above + // The compiler ensures that all fields were used, all types were correct, + // and that size and alignment are correct. + unsafe { obj.assume_init_mut() } + }}; +} + +/// Helper macro to get the struct type part of a struct initialization expression. +#[macro_export] +#[doc(hidden)] +macro_rules! get_type { + ($t:ty { $($val:tt)* }) => { + $t + }; +} + +/// Like `Box::try_new(...)`, but with in-place initialization. +#[macro_export] +macro_rules! box_in_place { + ($($val:tt)*) => {{ + use $crate::place; + let b = Box::<$crate::get_type!($($val)*)>::try_new_uninit(); + match b { + Ok(mut p) => { + place!((&mut *p), $($val)*); + Ok(unsafe { p.assume_init() }) + } + Err(e) => Err(e) + } + }}; +} + +// TODO: figure out how to make this run +#[cfg(test)] +mod tests { + use super::*; + use core::mem::MaybeUninit; + + #[derive(Debug, PartialEq)] + struct MyCoolStruct { + b: bool, + s: String, + i: i32, + v: Vec<String>, + x: MyOtherCoolStruct, + y: MyOtherCoolStruct, + z: foo::MyCoolGenericStruct<bool, String>, + } + + #[derive(Debug, PartialEq)] + struct MyDefaultStruct { + b: bool, + i: i32, + j: i16, + } + default_zeroed!(MyDefaultStruct); + + mod foo { + #[derive(Debug, PartialEq)] + pub struct MyOtherCoolStruct { + pub a: bool, + pub b: String, + } + #[derive(Debug, PartialEq)] + pub struct MyCoolGenericStruct<T, U> { + pub a: T, + pub b: U, + } + } + + use foo::MyOtherCoolStruct; + + #[test] + fn test_initialized() { + let mut buf: MaybeUninit<MyCoolStruct> = MaybeUninit::uninit(); + + let x: &mut MyCoolStruct = place!( + &mut buf, + MyCoolStruct { + b: true, + s: String::from("works"), + i: str::parse::<i32>("123").unwrap(), + v: vec![String::from("works")], + x: MyOtherCoolStruct { + a: false, + b: String::from("Hello, world!"), + }, + y: foo::MyOtherCoolStruct { + a: false, + b: String::from("Hello, world!"), + }, + z: foo::MyCoolGenericStruct::<bool, String> { + a: false, + b: String::from("Hello, world!"), + } + } + ); + //dbg!(x); + + assert_eq!( + x, + &MyCoolStruct { + b: true, + s: String::from("works"), + i: str::parse::<i32>("123").unwrap(), + v: vec![String::from("works")], + x: foo::MyOtherCoolStruct { + a: false, + b: String::from("Hello, world!"), + }, + y: foo::MyOtherCoolStruct { + a: false, + b: String::from("Hello, world!"), + }, + z: foo::MyCoolGenericStruct::<bool, String> { + a: false, + b: String::from("Hello, world!"), + }, + }, + ); + } + + #[test] + fn test_default() { + let mut buf: MaybeUninit<MyDefaultStruct> = MaybeUninit::uninit(); + + let x: &mut MyDefaultStruct = place!( + &mut buf, + MyDefaultStruct { + b: true, + i: 1, + ..Default::default() + } + ); + + assert_eq!( + x, + &MyDefaultStruct { + b: true, + i: 1, + j: 0, + }, + ); + } + + #[test] + fn test_scalar() { + let mut buf: MaybeUninit<u32> = MaybeUninit::uninit(); + + let x: &mut u32 = place!(&mut buf, 1234); + + assert_eq!(x, &mut 1234u32); + } +} diff --git a/drivers/gpu/drm/asahi/queue/common.rs b/drivers/gpu/drm/asahi/queue/common.rs new file mode 100644 index 000000000000..127b4ccc6eca --- /dev/null +++ b/drivers/gpu/drm/asahi/queue/common.rs @@ -0,0 +1,52 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Common queue functionality. +//! +//! Shared helpers used by the submission logic for multiple command types. + +use crate::fw::microseq; +use crate::fw::types::*; + +use kernel::bindings; +use kernel::io_buffer::IoBufferReader; +use kernel::prelude::*; +use kernel::user_ptr::UserSlicePtr; + +use core::mem::MaybeUninit; + +pub(super) fn build_attachments(pointer: u64, count: u32) -> Resultmicroseq::Attachments { + if count as usize > microseq::MAX_ATTACHMENTS { + return Err(EINVAL); + } + + const STRIDE: usize = core::mem::size_of::bindings::drm_asahi_attachment(); + let size = STRIDE * count as usize; + + // SAFETY: We only read this once, so there are no TOCTOU issues. + let mut reader = unsafe { UserSlicePtr::new(pointer as usize as *mut _, size).reader() }; + + let mut attachments: microseq::Attachments = Default::default(); + + for i in 0..count { + let mut att: MaybeUninitbindings::drm_asahi_attachment = MaybeUninit::uninit(); + + // SAFETY: The size of `att` is STRIDE + unsafe { reader.read_raw(att.as_mut_ptr() as *mut u8, STRIDE)? }; + + // SAFETY: All bit patterns in the struct are valid + let att = unsafe { att.assume_init() }; + + let cache_lines = (att.size + 127) >> 7; + let order = 1; + attachments.list[i as usize] = microseq::Attachment { + address: U64(att.pointer), + size: cache_lines, + unk_c: 0x17, + unk_e: order, + }; + + attachments.count += 1; + } + + Ok(attachments) +} diff --git a/drivers/gpu/drm/asahi/queue/compute.rs b/drivers/gpu/drm/asahi/queue/compute.rs new file mode 100644 index 000000000000..6590382c75af --- /dev/null +++ b/drivers/gpu/drm/asahi/queue/compute.rs @@ -0,0 +1,371 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT +#![allow(clippy::unusual_byte_groupings)] + +//! Compute work queue. +//! +//! A compute queue consists of one underlying WorkQueue. +//! This module is in charge of creating all of the firmware structures required to submit compute +//! work to the GPU, based on the userspace command buffer. + +use super::common; +use crate::alloc::Allocator; +use crate::debug::*; +use crate::fw::types::*; +use crate::gpu::GpuManager; +use crate::{box_in_place, inner_ptr, inner_weak_ptr, place}; +use crate::{fw, gpu, microseq}; +use core::mem::MaybeUninit; +use core::sync::atomic::Ordering; +use kernel::bindings; +use kernel::dma_fence::RawDmaFence; +use kernel::drm::sched::Job; +use kernel::io_buffer::IoBufferReader; +use kernel::prelude::*; +use kernel::sync::Arc; +use kernel::user_ptr::UserSlicePtr; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Compute; + +#[versions(AGX)] +impl super::Queue::ver { + /// Submit work to a compute queue. + pub(super) fn submit_compute( + &self, + job: &mut Jobsuper::QueueJob::ver, + cmd: &bindings::drm_asahi_command, + result_writer: Optionsuper::ResultWriter, + id: u64, + flush_stamps: bool, + ) -> Result { + if cmd.cmd_type != bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_COMPUTE { + return Err(EINVAL); + } + + let dev = self.dev.data(); + let gpu = match dev.gpu.as_any().downcast_ref::gpu::GpuManager::ver() { + Some(gpu) => gpu, + None => { + dev_crit!(self.dev, "GpuManager mismatched with Queue!\n"); + return Err(EIO); + } + }; + + let mut alloc = gpu.alloc(); + let kalloc = &mut *alloc; + + mod_dev_dbg!(self.dev, "[Submission {}] Compute!\n", id); + + let mut cmdbuf_reader = unsafe { + UserSlicePtr::new( + cmd.cmd_buffer as usize as *mut _, + core::mem::size_of::bindings::drm_asahi_cmd_compute(), + ) + .reader() + }; + + let mut cmdbuf: MaybeUninitbindings::drm_asahi_cmd_compute = MaybeUninit::uninit(); + unsafe { + cmdbuf_reader.read_raw( + cmdbuf.as_mut_ptr() as *mut u8, + core::mem::size_of::bindings::drm_asahi_cmd_compute(), + )?; + } + let cmdbuf = unsafe { cmdbuf.assume_init() }; + + if cmdbuf.flags != 0 { + return Err(EINVAL); + } + + // This sequence number increases per new client/VM? assigned to some slot, + // but it's unclear *which* slot... + let slot_client_seq: u8 = (self.id & 0xff) as u8; + + let vm_bind = job.vm_bind.clone(); + + mod_dev_dbg!( + self.dev, + "[Submission {}] VM slot = {}\n", + id, + vm_bind.slot() + ); + + let notifier = self.notifier.clone(); + + let fence = job.fence.clone(); + let comp_job = job.get_comp()?; + let ev_comp = comp_job.event_info(); + + // TODO: Is this the same on all GPUs? Is this really for preemption? + let preempt_size = 0x7fa0; + let preempt2_off = 0x7f80; + let preempt3_off = 0x7f88; + let preempt4_off = 0x7f90; + let preempt5_off = 0x7f98; + + let preempt_buf = self.ualloc.lock().array_empty(preempt_size)?; + + let mut seq_buf = self.ualloc.lock().array_empty(0x800)?; + for i in 1..0x400 { + seq_buf[i] = (i + 1) as u64; + } + + mod_dev_dbg!( + self.dev, + "[Submission {}] Event #{} {:#x?} -> {:#x?}\n", + id, + ev_comp.slot, + ev_comp.value, + ev_comp.value.next(), + ); + + let timestamps = Arc::try_new(kalloc.shared.new_default::fw::job::JobTimestamps()?)?; + + let uuid = cmdbuf.cmd_id; + + let unk3 = debug_enabled(debug::DebugFlags::Debug3); + + mod_dev_dbg!(self.dev, "[Submission {}] UUID = {:#x?}\n", id, uuid); + + // TODO: check + #[ver(V >= V13_0B4)] + let count = self.counter.fetch_add(1, Ordering::Relaxed); + + let comp = GpuObject::new_prealloc( + kalloc.private.alloc_object()?, + |ptr: GpuWeakPointerfw::compute::RunCompute::ver| { + let mut builder = microseq::Builder::new(); + + let stats = gpu.initdata.runtime_pointers.stats.comp.weak_pointer(); + + let start_comp = builder.add(microseq::StartCompute::ver { + header: microseq::op::StartCompute::HEADER, + unk_pointer: inner_weak_ptr!(ptr, unk_pointee), + job_params1: inner_weak_ptr!(ptr, job_params1), + stats, + work_queue: ev_comp.info_ptr, + vm_slot: vm_bind.slot(), + unk_28: 0x1, + event_generation: self.id as u32, + cmd_seq: U64(ev_comp.cmd_seq), + unk_38: 0x0, + job_params2: inner_weak_ptr!(ptr, job_params2), + unk_44: 0x0, + uuid, + attachments: common::build_attachments( + cmdbuf.attachments, + cmdbuf.attachment_count, + )?, + padding: Default::default(), + #[ver(V >= V13_0B4)] + unk_flag: inner_weak_ptr!(ptr, unk_flag), + #[ver(V >= V13_0B4)] + counter: U64(count), + #[ver(V >= V13_0B4)] + notifier_buf: inner_weak_ptr!(notifier.weak_pointer(), state.unk_buf), + })?; + + if result_writer.is_some() { + builder.add(microseq::Timestamp::ver { + header: microseq::op::Timestamp::new(true), + cur_ts: inner_weak_ptr!(ptr, cur_ts), + start_ts: inner_weak_ptr!(ptr, start_ts), + update_ts: inner_weak_ptr!(ptr, start_ts), + work_queue: ev_comp.info_ptr, + unk_24: U64(0), + #[ver(V >= V13_0B4)] + unk_ts: inner_weak_ptr!(ptr, unk_ts), + uuid, + unk_30_padding: 0, + })?; + } + + builder.add(microseq::WaitForIdle { + header: microseq::op::WaitForIdle::new(microseq::Pipe::Compute), + })?; + + if result_writer.is_some() { + builder.add(microseq::Timestamp::ver { + header: microseq::op::Timestamp::new(false), + cur_ts: inner_weak_ptr!(ptr, cur_ts), + start_ts: inner_weak_ptr!(ptr, start_ts), + update_ts: inner_weak_ptr!(ptr, end_ts), + work_queue: ev_comp.info_ptr, + unk_24: U64(0), + #[ver(V >= V13_0B4)] + unk_ts: inner_weak_ptr!(ptr, unk_ts), + uuid, + unk_30_padding: 0, + })?; + } + + let off = builder.offset_to(start_comp); + builder.add(microseq::FinalizeCompute::ver { + header: microseq::op::FinalizeCompute::HEADER, + stats, + work_queue: ev_comp.info_ptr, + vm_slot: vm_bind.slot(), + #[ver(V < V13_0B4)] + unk_18: 0, + job_params2: inner_weak_ptr!(ptr, job_params2), + unk_24: 0, + uuid, + fw_stamp: ev_comp.fw_stamp_pointer, + stamp_value: ev_comp.value.next(), + unk_38: 0, + unk_3c: 0, + unk_40: 0, + unk_44: 0, + unk_48: 0, + unk_4c: 0, + unk_50: 0, + unk_54: 0, + unk_58: 0, + #[ver(G == G14 && V < V13_0B4)] + unk_5c_g14: U64(0), + restart_branch_offset: off, + unk_60: unk3.into(), + #[ver(V >= V13_0B4)] + unk_64: Default::default(), + #[ver(V >= V13_0B4)] + unk_flag: inner_weak_ptr!(ptr, unk_flag), + #[ver(V >= V13_0B4)] + unk_79: Default::default(), + })?; + + builder.add(microseq::RetireStamp { + header: microseq::op::RetireStamp::HEADER, + })?; + + Ok(box_in_place!(fw::compute::RunCompute::ver { + notifier: notifier.clone(), + preempt_buf: preempt_buf, + seq_buf: seq_buf, + micro_seq: builder.build(&mut kalloc.private)?, + vm_bind: vm_bind.clone(), + timestamps: timestamps.clone(), + })?) + }, + |inner, ptr| { + Ok(place!( + ptr, + fw::compute::raw::RunCompute::ver { + tag: fw::workqueue::CommandType::RunCompute, + #[ver(V >= V13_0B4)] + counter: U64(count), + unk_4: 0, + vm_slot: vm_bind.slot(), + notifier: inner.notifier.gpu_pointer(), + unk_pointee: Default::default(), + job_params1: fw::compute::raw::JobParameters1 { + preempt_buf1: inner.preempt_buf.gpu_pointer(), + encoder: U64(cmdbuf.encoder_ptr), + // buf2-5 Only if internal program is used + preempt_buf2: inner.preempt_buf.gpu_offset_pointer(preempt2_off), + preempt_buf3: inner.preempt_buf.gpu_offset_pointer(preempt3_off), + preempt_buf4: inner.preempt_buf.gpu_offset_pointer(preempt4_off), + preempt_buf5: inner.preempt_buf.gpu_offset_pointer(preempt5_off), + pipeline_base: U64(0x11_00000000), + unk_38: U64(0x8c60), + unk_40: cmdbuf.ctx_switch_prog, // Internal program addr | 1 + unk_44: 0, + compute_layout_addr: U64(cmdbuf.buffer_descriptor), // Only if internal program used + unk_50: cmdbuf.buffer_descriptor_size, // 0x40 if internal program used + unk_54: 0, + unk_58: 1, + unk_5c: 0, + iogpu_unk_40: cmdbuf.iogpu_unk_40, // 0x1c if internal program used + }, + unk_b8: Default::default(), + microsequence: inner.micro_seq.gpu_pointer(), + microsequence_size: inner.micro_seq.len() as u32, + job_params2: fw::compute::raw::JobParameters2::ver { + #[ver(V >= V13_0B4)] + unk_0_0: 0, + unk_0: Default::default(), + preempt_buf1: inner.preempt_buf.gpu_pointer(), + encoder_end: U64(cmdbuf.encoder_end), + unk_34: Default::default(), + #[ver(V < V13_0B4)] + unk_5c: 0, + }, + encoder_params: fw::job::raw::EncoderParams { + unk_8: 0x0, // fixed + unk_c: 0x0, // fixed + unk_10: 0x0, // fixed + encoder_id: cmdbuf.encoder_id, + unk_18: 0x0, // fixed + iogpu_compute_unk44: cmdbuf.iogpu_unk_44, + seq_buffer: inner.seq_buf.gpu_pointer(), + unk_28: U64(0x0), // fixed + }, + meta: fw::job::raw::JobMeta { + unk_4: 0, + stamp: ev_comp.stamp_pointer, + fw_stamp: ev_comp.fw_stamp_pointer, + stamp_value: ev_comp.value.next(), + stamp_slot: ev_comp.slot, + evctl_index: 0, // fixed + flush_stamps: flush_stamps as u32, + uuid: uuid, + cmd_seq: ev_comp.cmd_seq as u32, + }, + cur_ts: U64(0), + start_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), start)), + end_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), end)), + unk_2c0: 0, + unk_2c4: 0, + unk_2c8: 0, + unk_2cc: 0, + client_sequence: slot_client_seq, + pad_2d1: Default::default(), + unk_2d4: 0, + unk_2d8: 0, + #[ver(V >= V13_0B4)] + unk_ts: U64(0), + #[ver(V >= V13_0B4)] + unk_2e1: Default::default(), + #[ver(V >= V13_0B4)] + unk_flag: U32(0), + #[ver(V >= V13_0B4)] + unk_pad: Default::default(), + } + )) + }, + )?; + + core::mem::drop(alloc); + + fence.add_command(); + comp_job.add_cb(comp, vm_bind.slot(), move |cmd, error| { + if let Some(err) = error { + fence.set_error(err.into()) + } + if let Some(mut rw) = result_writer { + let mut result: bindings::drm_asahi_result_compute = Default::default(); + + cmd.timestamps.with(|raw, _inner| { + result.ts_start = raw.start.load(Ordering::Relaxed); + result.ts_end = raw.end.load(Ordering::Relaxed); + }); + + if let Some(err) = error { + result.info = err.into(); + } else { + result.info.status = bindings::drm_asahi_status_DRM_ASAHI_STATUS_COMPLETE; + } + + rw.write(result); + } + + fence.command_complete(); + })?; + + notifier.threshold.with(|raw, _inner| { + raw.increment(); + }); + + comp_job.next_seq(); + + Ok(()) + } +} diff --git a/drivers/gpu/drm/asahi/queue/mod.rs b/drivers/gpu/drm/asahi/queue/mod.rs new file mode 100644 index 000000000000..15988af33cf3 --- /dev/null +++ b/drivers/gpu/drm/asahi/queue/mod.rs @@ -0,0 +1,725 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Submission queue management +//! +//! This module implements the userspace view of submission queues and the logic to map userspace +//! submissions to firmware queues. + +use kernel::dma_fence::*; +use kernel::prelude::*; +use kernel::{ + bindings, c_str, dma_fence, + drm::gem::shmem::VMap, + drm::sched, + macros::versions, + sync::{smutex::Mutex, Arc}, +}; + +use crate::alloc::Allocator; +use crate::debug::*; +use crate::driver::AsahiDevice; +use crate::fw::types::*; +use crate::gpu::GpuManager; +use crate::{alloc, buffer, channel, event, file, fw, gem, gpu, mmu, workqueue}; +use crate::{inner_weak_ptr, place}; + +use core::mem::MaybeUninit; +use core::sync::atomic::{AtomicU64, Ordering}; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Queue; + +const WQ_SIZE: u32 = 0x500; + +mod common; +mod compute; +mod render; + +/// Trait implemented by all versioned queues. +pub(crate) trait Queue: Send + Sync { + fn submit( + &mut self, + id: u64, + in_syncs: Vecfile::SyncItem, + out_syncs: Vecfile::SyncItem, + result_buf: Optiongem::ObjectRef, + commands: Vecbindings::drm_asahi_command, + ) -> Result; +} + +#[versions(AGX)] +struct SubQueue { + wq: Arcworkqueue::WorkQueue::ver, +} + +#[versions(AGX)] +impl SubQueue::ver { + fn new_job(&mut self) -> SubQueueJob::ver { + SubQueueJob::ver { + wq: self.wq.clone(), + job: None, + } + } +} + +#[versions(AGX)] +struct SubQueueJob { + wq: Arcworkqueue::WorkQueue::ver, + job: Optionworkqueue::Job::ver, +} + +#[versions(AGX)] +impl SubQueueJob::ver { + fn get(&mut self) -> Result<&mut workqueue::Job::ver> { + if self.job.is_none() { + mod_pr_debug!("SubQueueJob: Creating {:?} job\n", self.wq.pipe_type()); + self.job.replace(self.wq.new_job()?); + } + Ok(self.job.as_mut().expect("expected a Job")) + } + + fn commit(&mut self) -> Result { + match self.job.as_mut() { + Some(job) => job.commit(), + None => Ok(()), + } + } + + fn can_submit(&self) -> bool { + match self.job.as_ref() { + None => true, + Some(job) => job.can_submit(), + } + } +} + +#[versions(AGX)] +pub(crate) struct Queue { + dev: AsahiDevice, + _sched: sched::SchedulerQueueJob::ver, + entity: sched::EntityQueueJob::ver, + vm: mmu::Vm, + ualloc: Arc<Mutexalloc::DefaultAllocator>, + q_vtx: OptionSubQueue::ver, + q_frag: OptionSubQueue::ver, + q_comp: OptionSubQueue::ver, + buffer: Option<Mutexbuffer::Buffer::ver>, + gpu_context: Arcworkqueue::GpuContext, + notifier_list: Arc<GpuObjectfw::event::NotifierList>, + notifier: Arc<GpuObjectfw::event::Notifier::ver>, + id: u64, + fence_ctx: FenceContexts, + #[ver(V >= V13_0B4)] + counter: AtomicU64, +} + +#[versions(AGX)] +#[derive(Default)] +pub(crate) struct JobFence { + id: u64, + pending: AtomicU64, +} + +#[versions(AGX)] +impl JobFence::ver { + fn add_command(self: &FenceObject<Self>) { + self.pending.fetch_add(1, Ordering::Relaxed); + } + + fn command_complete(self: &FenceObject<Self>) { + let remain = self.pending.fetch_sub(1, Ordering::Relaxed) - 1; + mod_pr_debug!( + "JobFence[{}]: Command complete (remain: {})\n", + self.id, + remain + ); + if remain == 0 { + mod_pr_debug!("JobFence[{}]: Signaling\n", self.id); + if self.signal().is_err() { + pr_err!("JobFence[{}]: Fence signal failed\n", self.id); + } + } + } +} + +#[versions(AGX)] +#[vtable] +impl dma_fence::FenceOps for JobFence::ver { + const USE_64BIT_SEQNO: bool = true; + + fn get_driver_name<'a>(self: &'a FenceObject<Self>) -> &'a CStr { + c_str!("asahi") + } + fn get_timeline_name<'a>(self: &'a FenceObject<Self>) -> &'a CStr { + c_str!("queue") + } +} + +#[versions(AGX)] +pub(crate) struct QueueJob { + dev: AsahiDevice, + vm_bind: mmu::VmBind, + op_guard: Optiongpu::OpGuard, + sj_vtx: OptionSubQueueJob::ver, + sj_frag: OptionSubQueueJob::ver, + sj_comp: OptionSubQueueJob::ver, + fence: UserFenceJobFence::ver, + did_run: bool, + id: u64, +} + +#[versions(AGX)] +impl QueueJob::ver { + fn get_vtx(&mut self) -> Result<&mut workqueue::Job::ver> { + self.sj_vtx.as_mut().ok_or(EINVAL)?.get() + } + fn get_frag(&mut self) -> Result<&mut workqueue::Job::ver> { + self.sj_frag.as_mut().ok_or(EINVAL)?.get() + } + fn get_comp(&mut self) -> Result<&mut workqueue::Job::ver> { + self.sj_comp.as_mut().ok_or(EINVAL)?.get() + } + + fn commit(&mut self) -> Result { + mod_dev_dbg!(self.dev, "QueueJob: Committing\n"); + + self.sj_vtx.as_mut().map(|a| a.commit()).unwrap_or(Ok(()))?; + self.sj_frag + .as_mut() + .map(|a| a.commit()) + .unwrap_or(Ok(()))?; + self.sj_comp.as_mut().map(|a| a.commit()).unwrap_or(Ok(())) + } +} + +#[versions(AGX)] +impl sched::JobImpl for QueueJob::ver { + fn can_run(job: &mut sched::Job<Self>) -> bool { + mod_dev_dbg!(job.dev, "QueueJob {}: Checking runnability\n", job.id); + + if let Some(sj) = job.sj_vtx.as_ref() { + if !sj.can_submit() { + mod_dev_dbg!( + job.dev, + "QueueJob {}: Blocking due to vertex queue full\n", + job.id + ); + return false; + } + } + if let Some(sj) = job.sj_frag.as_ref() { + if !sj.can_submit() { + mod_dev_dbg!( + job.dev, + "QueueJob {}: Blocking due to fragment queue full\n", + job.id + ); + return false; + } + } + if let Some(sj) = job.sj_comp.as_ref() { + if !sj.can_submit() { + mod_dev_dbg!( + job.dev, + "QueueJob {}: Blocking due to compute queue full\n", + job.id + ); + return false; + } + } + true + } + + #[allow(unused_assignments)] + fn run(job: &mut sched::Job<Self>) -> Result<Option<dma_fence::Fence>> { + mod_dev_dbg!(job.dev, "QueueJob {}: Running Job\n", job.id); + + let dev = job.dev.data(); + let gpu = match dev + .gpu + .clone() + .arc_as_any() + .downcast::gpu::GpuManager::ver() + { + Ok(gpu) => gpu, + Err(_) => { + dev_crit!(job.dev, "GpuManager mismatched with QueueJob!\n"); + return Err(EIO); + } + }; + + if job.op_guard.is_none() { + job.op_guard = Some(gpu.start_op()?); + } + + // First submit all the commands for each queue. This can fail. + + let mut frag_job = None; + let mut frag_sub = None; + if let Some(sj) = job.sj_frag.as_mut() { + frag_job = sj.job.take(); + if let Some(wqjob) = frag_job.as_mut() { + mod_dev_dbg!(job.dev, "QueueJob {}: Submit fragment\n", job.id); + frag_sub = Some(wqjob.submit()?); + } + } + + let mut vtx_job = None; + let mut vtx_sub = None; + if let Some(sj) = job.sj_vtx.as_mut() { + vtx_job = sj.job.take(); + if let Some(wqjob) = vtx_job.as_mut() { + mod_dev_dbg!(job.dev, "QueueJob {}: Submit vertex\n", job.id); + vtx_sub = Some(wqjob.submit()?); + } + } + + let mut comp_job = None; + let mut comp_sub = None; + if let Some(sj) = job.sj_comp.as_mut() { + comp_job = sj.job.take(); + if let Some(wqjob) = comp_job.as_mut() { + mod_dev_dbg!(job.dev, "QueueJob {}: Submit compute\n", job.id); + comp_sub = Some(wqjob.submit()?); + } + } + + // Now we fully commit to running the job + mod_dev_dbg!(job.dev, "QueueJob {}: Run fragment\n", job.id); + frag_sub.map(|a| gpu.run_job(a)).transpose()?; + + mod_dev_dbg!(job.dev, "QueueJob {}: Run vertex\n", job.id); + vtx_sub.map(|a| gpu.run_job(a)).transpose()?; + + mod_dev_dbg!(job.dev, "QueueJob {}: Run compute\n", job.id); + comp_sub.map(|a| gpu.run_job(a)).transpose()?; + + mod_dev_dbg!(job.dev, "QueueJob {}: Drop compute job\n", job.id); + core::mem::drop(comp_job); + mod_dev_dbg!(job.dev, "QueueJob {}: Drop vertex job\n", job.id); + core::mem::drop(vtx_job); + mod_dev_dbg!(job.dev, "QueueJob {}: Drop fragment job\n", job.id); + core::mem::drop(frag_job); + + job.did_run = true; + + Ok(Some(Fence::from_fence(&job.fence))) + } + + fn timed_out(job: &mut sched::Job<Self>) -> sched::Status { + // FIXME: Handle timeouts properly + dev_err!( + job.dev, + "QueueJob {}: Job timed out on the DRM scheduler, things will probably break (ran: {})\n", + job.id, job.did_run + ); + sched::Status::NoDevice + } +} + +#[versions(AGX)] +impl Drop for QueueJob::ver { + fn drop(&mut self) { + mod_dev_dbg!(self.dev, "QueueJob {}: Dropping\n", self.id); + } +} + +struct ResultWriter { + vmap: VMapgem::DriverObject, + offset: usize, + len: usize, +} + +impl ResultWriter { + fn write<T>(&mut self, mut value: T) { + let p: *mut u8 = &mut value as *mut _ as *mut u8; + // SAFETY: We know `p` points to a type T of that size, and UAPI types must have + // no padding and all bit patterns valid. + let slice = unsafe { core::slice::from_raw_parts_mut(p, core::mem::size_of::<T>()) }; + let len = slice.len().min(self.len); + self.vmap.as_mut_slice()[self.offset..self.offset + len].copy_from_slice(&slice[..len]); + } +} + +static QUEUE_NAME: &CStr = c_str!("asahi_fence"); +static QUEUE_CLASS_KEY: kernel::sync::LockClassKey = kernel::sync::LockClassKey::new(); + +#[versions(AGX)] +impl Queue::ver { + /// Create a new user queue. + #[allow(clippy::too_many_arguments)] + pub(crate) fn new( + dev: &AsahiDevice, + vm: mmu::Vm, + alloc: &mut gpu::KernelAllocators, + ualloc: Arc<Mutexalloc::DefaultAllocator>, + ualloc_priv: Arc<Mutexalloc::DefaultAllocator>, + event_manager: Arcevent::EventManager, + mgr: &buffer::BufferManager, + id: u64, + priority: u32, + caps: u32, + ) -> ResultQueue::ver { + mod_dev_dbg!(dev, "[Queue {}] Creating queue\n", id); + + let data = dev.data(); + + let mut notifier_list = alloc.private.new_default::fw::event::NotifierList()?; + + let self_ptr = notifier_list.weak_pointer(); + notifier_list.with_mut(|raw, _inner| { + raw.list_head.next = Some(inner_weak_ptr!(self_ptr, list_head)); + }); + + let threshold = alloc.shared.new_default::fw::event::Threshold()?; + + let notifier: Arc<GpuObjectfw::event::Notifier::ver> = + Arc::try_new(alloc.private.new_inplace( + fw::event::Notifier::ver { threshold }, + |inner, ptr: &mut MaybeUninit<fw::event::raw::Notifier::ver<'_>>| { + Ok(place!( + ptr, + fw::event::raw::Notifier::ver { + threshold: inner.threshold.gpu_pointer(), + generation: AtomicU32::new(id as u32), + cur_count: AtomicU32::new(0), + unk_10: AtomicU32::new(0x50), + state: Default::default() + } + )) + }, + )?)?; + + let sched = sched::Scheduler::new(dev, WQ_SIZE, 0, 100000, c_str!("asahi_sched"))?; + // Priorities are handled by the AGX scheduler, there is no meaning within a + // per-queue scheduler. + let entity = sched::Entity::new(&sched, sched::Priority::Normal)?; + + let mut ret = Queue::ver { + dev: dev.clone(), + _sched: sched, + entity, + vm, + ualloc, + q_vtx: None, + q_frag: None, + q_comp: None, + buffer: None, + gpu_context: Arc::try_new(workqueue::GpuContext::new(dev, alloc)?)?, + notifier_list: Arc::try_new(notifier_list)?, + notifier, + id, + fence_ctx: FenceContexts::new(1, QUEUE_NAME, &QUEUE_CLASS_KEY)?, + #[ver(V >= V13_0B4)] + counter: AtomicU64::new(0), + }; + + // Rendering structures + if caps & bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_RENDER != 0 { + let buffer = + buffer::Buffer::ver::new(&*data.gpu, alloc, ret.ualloc.clone(), ualloc_priv, mgr)?; + let tvb_blocks = { + let lock = crate::THIS_MODULE.kernel_param_lock(); + *crate::initial_tvb_size.read(&lock) + }; + + buffer.ensure_blocks(tvb_blocks)?; + + ret.buffer = Some(Mutex::new(buffer)); + ret.q_vtx = Some(SubQueue::ver { + wq: workqueue::WorkQueue::ver::new( + alloc, + event_manager.clone(), + ret.gpu_context.clone(), + ret.notifier_list.clone(), + channel::PipeType::Vertex, + id, + priority, + WQ_SIZE, + )?, + }); + } + + // Rendering & blit structures + if caps + & (bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_RENDER + | bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_BLIT) + != 0 + { + ret.q_frag = Some(SubQueue::ver { + wq: workqueue::WorkQueue::ver::new( + alloc, + event_manager.clone(), + ret.gpu_context.clone(), + ret.notifier_list.clone(), + channel::PipeType::Fragment, + id, + priority, + WQ_SIZE, + )?, + }); + } + + // Compute structures + if caps & bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_COMPUTE != 0 { + ret.q_comp = Some(SubQueue::ver { + wq: workqueue::WorkQueue::ver::new( + alloc, + event_manager, + ret.gpu_context.clone(), + ret.notifier_list.clone(), + channel::PipeType::Compute, + id, + priority, + WQ_SIZE, + )?, + }); + } + + mod_dev_dbg!(dev, "[Queue {}] Queue created\n", id); + Ok(ret) + } +} + +const SQ_RENDER: usize = bindings::drm_asahi_subqueue_DRM_ASAHI_SUBQUEUE_RENDER as usize; +const SQ_COMPUTE: usize = bindings::drm_asahi_subqueue_DRM_ASAHI_SUBQUEUE_COMPUTE as usize; +const SQ_COUNT: usize = bindings::drm_asahi_subqueue_DRM_ASAHI_SUBQUEUE_COUNT as usize; + +#[versions(AGX)] +impl Queue for Queue::ver { + fn submit( + &mut self, + id: u64, + in_syncs: Vecfile::SyncItem, + out_syncs: Vecfile::SyncItem, + result_buf: Optiongem::ObjectRef, + commands: Vecbindings::drm_asahi_command, + ) -> Result { + let dev = self.dev.data(); + let gpu = match dev + .gpu + .clone() + .arc_as_any() + .downcast::gpu::GpuManager::ver() + { + Ok(gpu) => gpu, + Err(_) => { + dev_crit!(self.dev, "GpuManager mismatched with JobImpl!\n"); + return Err(EIO); + } + }; + + mod_dev_dbg!(self.dev, "[Submission {}] Submit job\n", id); + + if gpu.is_crashed() { + dev_err!( + self.dev, + "[Submission {}] GPU is crashed, cannot submit\n", + id + ); + return Err(ENODEV); + } + + // Empty submissions are not legal + if commands.is_empty() { + return Err(EINVAL); + } + + let op_guard = if !in_syncs.is_empty() { + Some(gpu.start_op()?) + } else { + None + }; + + let mut events: [Vec<Optionworkqueue::QueueEventInfo::ver>; SQ_COUNT] = + Default::default(); + + events[SQ_RENDER].try_push(self.q_frag.as_ref().and_then(|a| a.wq.event_info()))?; + events[SQ_COMPUTE].try_push(self.q_comp.as_ref().and_then(|a| a.wq.event_info()))?; + + let vm_bind = gpu.bind_vm(&self.vm)?; + let vm_slot = vm_bind.slot(); + + mod_dev_dbg!(self.dev, "[Submission {}] Creating job\n", id); + let mut job = self.entity.new_job(QueueJob::ver { + dev: self.dev.clone(), + vm_bind, + op_guard, + sj_vtx: self.q_vtx.as_mut().map(|a| a.new_job()), + sj_frag: self.q_frag.as_mut().map(|a| a.new_job()), + sj_comp: self.q_comp.as_mut().map(|a| a.new_job()), + fence: self + .fence_ctx + .new_fence::JobFence::ver( + 0, + JobFence::ver { + id, + pending: Default::default(), + }, + )? + .into(), + did_run: false, + id, + })?; + + mod_dev_dbg!( + self.dev, + "[Submission {}] Adding {} in_syncs\n", + id, + in_syncs.len() + ); + for sync in in_syncs { + job.add_dependency(sync.fence.expect("in_sync missing fence"))?; + } + + let mut last_render = None; + let mut last_compute = None; + + for (i, cmd) in commands.iter().enumerate() { + match cmd.cmd_type { + bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_RENDER => last_render = Some(i), + bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_COMPUTE => last_compute = Some(i), + _ => return Err(EINVAL), + } + } + + mod_dev_dbg!( + self.dev, + "[Submission {}] Submitting {} commands\n", + id, + commands.len() + ); + for (i, cmd) in commands.into_iter().enumerate() { + for (queue_idx, index) in cmd.barriers.iter().enumerate() { + if *index == bindings::DRM_ASAHI_BARRIER_NONE as u32 { + continue; + } + if let Some(event) = events[queue_idx].get(*index as usize).ok_or(EINVAL)? { + let mut alloc = gpu.alloc(); + let queue_job = match cmd.cmd_type { + bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_RENDER => job.get_vtx()?, + bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_COMPUTE => job.get_comp()?, + _ => return Err(EINVAL), + }; + mod_dev_dbg!(self.dev, "[Submission {}] Create Explicit Barrier\n", id); + let barrier: GpuObjectfw::workqueue::Barrier = alloc.private.new_inplace( + Default::default(), + |_inner, ptr: &mut MaybeUninitfw::workqueue::raw::Barrier| { + Ok(place!( + ptr, + fw::workqueue::raw::Barrier { + tag: fw::workqueue::CommandType::Barrier, + wait_stamp: event.fw_stamp_pointer, + wait_value: event.value, + wait_slot: event.slot, + stamp_self: queue_job.event_info().value.next(), + uuid: 0xffffbbbb, + unk: 0, + } + )) + }, + )?; + mod_dev_dbg!(self.dev, "[Submission {}] Add Explicit Barrier\n", id); + queue_job.add(barrier, vm_slot)?; + } else { + assert!(*index == 0); + } + } + + let result_writer = match result_buf.as_ref() { + None => { + if cmd.result_offset != 0 || cmd.result_size != 0 { + return Err(EINVAL); + } + None + } + Some(buf) => { + if cmd.result_size != 0 { + if cmd + .result_offset + .checked_add(cmd.result_size) + .ok_or(EINVAL)? + > buf.size() as u64 + { + return Err(EINVAL); + } + Some(ResultWriter { + vmap: buf.gem.vmap()?, + offset: cmd.result_offset.try_into()?, + len: cmd.result_size.try_into()?, + }) + } else { + None + } + } + }; + + match cmd.cmd_type { + bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_RENDER => { + self.submit_render( + &mut job, + &cmd, + result_writer, + id, + last_render.unwrap() == i, + )?; + events[SQ_RENDER].try_push(Some( + job.sj_frag + .as_ref() + .expect("No frag queue?") + .job + .as_ref() + .expect("No frag job?") + .event_info(), + ))?; + } + bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_COMPUTE => { + self.submit_compute( + &mut job, + &cmd, + result_writer, + id, + last_compute.unwrap() == i, + )?; + events[SQ_COMPUTE].try_push(Some( + job.sj_comp + .as_ref() + .expect("No comp queue?") + .job + .as_ref() + .expect("No comp job?") + .event_info(), + ))?; + } + _ => return Err(EINVAL), + } + } + + mod_dev_dbg!(self.dev, "Queue: Committing job\n"); + job.commit()?; + + mod_dev_dbg!(self.dev, "Queue: Arming job\n"); + let job = job.arm(); + let out_fence = job.fences().finished(); + mod_dev_dbg!(self.dev, "Queue: Pushing job\n"); + job.push(); + + mod_dev_dbg!(self.dev, "Queue: Adding {} out_syncs\n", out_syncs.len()); + for mut sync in out_syncs { + if let Some(chain) = sync.chain_fence.take() { + sync.syncobj + .add_point(chain, &out_fence, sync.timeline_value); + } else { + sync.syncobj.replace_fence(Some(&out_fence)); + } + } + + Ok(()) + } +} + +#[versions(AGX)] +impl Drop for Queue::ver { + fn drop(&mut self) { + mod_dev_dbg!(self.dev, "[Queue {}] Dropping queue\n", self.id); + } +} diff --git a/drivers/gpu/drm/asahi/queue/render.rs b/drivers/gpu/drm/asahi/queue/render.rs new file mode 100644 index 000000000000..318c952df020 --- /dev/null +++ b/drivers/gpu/drm/asahi/queue/render.rs @@ -0,0 +1,1173 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT +#![allow(clippy::unusual_byte_groupings)] + +//! Render work queue. +//! +//! A render queue consists of two underlying WorkQueues, one for vertex and one for fragment work. +//! This module is in charge of creating all of the firmware structures required to submit 3D +//! rendering work to the GPU, based on the userspace command buffer. + +use super::common; +use crate::alloc::Allocator; +use crate::debug::*; +use crate::fw::types::*; +use crate::gpu::GpuManager; +use crate::util::*; +use crate::workqueue::WorkError; +use crate::{box_in_place, inner_ptr, inner_weak_ptr, place}; +use crate::{buffer, fw, gpu, microseq, workqueue}; +use core::mem::MaybeUninit; +use core::sync::atomic::Ordering; +use kernel::bindings; +use kernel::dma_fence::RawDmaFence; +use kernel::drm::sched::Job; +use kernel::io_buffer::IoBufferReader; +use kernel::prelude::*; +use kernel::sync::{smutex::Mutex, Arc}; +use kernel::user_ptr::UserSlicePtr; + +const DEBUG_CLASS: DebugFlags = DebugFlags::Render; + +/// Tiling/Vertex control bit to disable using more than one GPU cluster. This results in decreased +/// throughput but also less latency, which is probably desirable for light vertex loads where the +/// overhead of clustering/merging would exceed the time it takes to just run the job on one +/// cluster. +const TILECTL_DISABLE_CLUSTERING: u32 = 1u32 << 0; + +struct RenderResult { + result: bindings::drm_asahi_result_render, + vtx_complete: bool, + frag_complete: bool, + vtx_error: Optionworkqueue::WorkError, + frag_error: Optionworkqueue::WorkError, + writer: super::ResultWriter, +} + +impl RenderResult { + fn commit(&mut self) { + if !self.vtx_complete || !self.frag_complete { + return; + } + + let mut error = self.vtx_error.take(); + if let Some(frag_error) = self.frag_error.take() { + if error.is_none() || error == Some(WorkError::Killed) { + error = Some(frag_error); + } + } + + if let Some(err) = error { + self.result.info = err.into(); + } else { + self.result.info.status = bindings::drm_asahi_status_DRM_ASAHI_STATUS_COMPLETE; + } + + self.writer.write(self.result); + } +} + +#[versions(AGX)] +impl super::Queue::ver { + /// Get the appropriate tiling parameters for a given userspace command buffer. + fn get_tiling_params( + cmdbuf: &bindings::drm_asahi_cmd_render, + num_clusters: u32, + ) -> Resultbuffer::TileInfo { + let width: u32 = cmdbuf.fb_width; + let height: u32 = cmdbuf.fb_height; + let layers: u32 = cmdbuf.layers; + + if width > 65536 || height > 65536 { + return Err(EINVAL); + } + + if layers == 0 || layers > 2048 { + return Err(EINVAL); + } + + let tile_width = 32u32; + let tile_height = 32u32; + + let utile_width = cmdbuf.utile_width; + let utile_height = cmdbuf.utile_height; + + match (utile_width, utile_height) { + (32, 32) | (32, 16) | (16, 16) => (), + _ => return Err(EINVAL), + }; + + let utiles_per_tile_x = tile_width / utile_width; + let utiles_per_tile_y = tile_height / utile_height; + + let utiles_per_tile = utiles_per_tile_x * utiles_per_tile_y; + + let tiles_x = (width + tile_width - 1) / tile_width; + let tiles_y = (height + tile_height - 1) / tile_height; + let tiles = tiles_x * tiles_y; + + let mtiles_x = 4u32; + let mtiles_y = 4u32; + let mtiles = mtiles_x * mtiles_y; + + // TODO: *samples + let tiles_per_mtile_x = align(div_ceil(tiles_x, mtiles_x), 4); + let tiles_per_mtile_y = align(div_ceil(tiles_y, mtiles_y), 4); + let tiles_per_mtile = tiles_per_mtile_x * tiles_per_mtile_y; + + let mtile_x1 = tiles_per_mtile_x; + let mtile_x2 = 2 * tiles_per_mtile_x; + let mtile_x3 = 3 * tiles_per_mtile_x; + + let mtile_y1 = tiles_per_mtile_y; + let mtile_y2 = 2 * tiles_per_mtile_y; + let mtile_y3 = 3 * tiles_per_mtile_y; + + let rgn_entry_size = 5; + // Macrotile stride in 32-bit words + let rgn_size = align(rgn_entry_size * tiles_per_mtile * utiles_per_tile, 4) / 4; + let tilemap_size = (4 * rgn_size * mtiles * layers) as usize; + + let tpc_entry_size = 8; + // TPC stride in 32-bit words + let tpc_mtile_stride = tpc_entry_size * utiles_per_tile * tiles_per_mtile / 4; + let tpc_size = (num_clusters * (4 * tpc_mtile_stride * mtiles) * layers) as usize; + + // No idea where this comes from, but it fits what macOS does... + // TODO: layers? + let meta1_blocks = if num_clusters > 1 { + div_ceil(align(tiles_x, 2) * align(tiles_y, 4), 0x1980) + } else { + 0 + }; + + let min_tvb_blocks = + div_ceil(tiles_x * tiles_y, 128).max(if num_clusters > 1 { 9 } else { 8 }) as usize; + + // Sometimes clustering seems to use twice the cluster tilemap count + // and twice the meta4 size. TODO: Is this random or can we calculate + // it somehow??? Does it go higher??? + let cluster_factor = 2; + + Ok(buffer::TileInfo { + tiles_x, + tiles_y, + tiles, + utile_width, + utile_height, + //mtiles_x, + //mtiles_y, + tiles_per_mtile_x, + tiles_per_mtile_y, + //tiles_per_mtile, + utiles_per_mtile_x: tiles_per_mtile_x * utiles_per_tile_x, + utiles_per_mtile_y: tiles_per_mtile_y * utiles_per_tile_y, + //utiles_per_mtile: tiles_per_mtile * utiles_per_tile, + tilemap_size, + tpc_size, + meta1_blocks, + min_tvb_blocks, + cluster_factor, + params: fw::vertex::raw::TilingParameters { + rgn_size, + unk_4: 0x88, + ppp_ctrl: cmdbuf.ppp_ctrl, + x_max: (width - 1) as u16, + y_max: (height - 1) as u16, + te_screen: ((tiles_y - 1) << 12) | (tiles_x - 1), + te_mtile1: mtile_x3 | (mtile_x2 << 9) | (mtile_x1 << 18), + te_mtile2: mtile_y3 | (mtile_y2 << 9) | (mtile_y1 << 18), + tiles_per_mtile, + tpc_stride: tpc_mtile_stride, + unk_24: 0x100, + unk_28: if layers > 1 { + 0xe000 | (layers - 1) + } else { + 0x8000 + }, + }, + }) + } + + /// Submit work to a render queue. + pub(super) fn submit_render( + &self, + job: &mut Jobsuper::QueueJob::ver, + cmd: &bindings::drm_asahi_command, + result_writer: Optionsuper::ResultWriter, + id: u64, + flush_stamps: bool, + ) -> Result { + if cmd.cmd_type != bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_RENDER { + return Err(EINVAL); + } + + mod_dev_dbg!(self.dev, "[Submission {}] Render!\n", id); + + let mut cmdbuf_reader = unsafe { + UserSlicePtr::new( + cmd.cmd_buffer as usize as *mut _, + core::mem::size_of::bindings::drm_asahi_cmd_render(), + ) + .reader() + }; + + let mut cmdbuf: MaybeUninitbindings::drm_asahi_cmd_render = MaybeUninit::uninit(); + unsafe { + cmdbuf_reader.read_raw( + cmdbuf.as_mut_ptr() as *mut u8, + core::mem::size_of::bindings::drm_asahi_cmd_render(), + )?; + } + let cmdbuf = unsafe { cmdbuf.assume_init() }; + + if cmdbuf.flags + & !(bindings::ASAHI_RENDER_NO_CLEAR_PIPELINE_TEXTURES + | bindings::ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S + | bindings::ASAHI_RENDER_MEMORYLESS_RTS_USED + | bindings::ASAHI_RENDER_PROCESS_EMPTY_TILES + | bindings::ASAHI_RENDER_NO_VERTEX_CLUSTERING) as u64 + != 0 + { + return Err(EINVAL); + } + + if cmdbuf.flags & bindings::ASAHI_RENDER_MEMORYLESS_RTS_USED as u64 != 0 { + // Not supported yet + return Err(EINVAL); + } + + if cmdbuf.fb_width == 0 + || cmdbuf.fb_height == 0 + || cmdbuf.fb_width > 16384 + || cmdbuf.fb_height > 16384 + { + mod_dev_dbg!( + self.dev, + "[Submission {}] Invalid dimensions {}x{}\n", + id, + cmdbuf.fb_width, + cmdbuf.fb_height + ); + return Err(EINVAL); + } + + let dev = self.dev.data(); + let gpu = match dev.gpu.as_any().downcast_ref::gpu::GpuManager::ver() { + Some(gpu) => gpu, + None => { + dev_crit!(self.dev, "GpuManager mismatched with Queue!\n"); + return Err(EIO); + } + }; + + let nclusters = gpu.get_dyncfg().id.num_clusters; + + // Can be set to false to disable clustering (for simpler jobs), but then the + // core masks below should be adjusted to cover a single rolling cluster. + let mut clustering = nclusters > 1; + + if debug_enabled(debug::DebugFlags::DisableClustering) + || cmdbuf.flags & bindings::ASAHI_RENDER_NO_VERTEX_CLUSTERING as u64 != 0 + { + clustering = false; + } + + #[ver(G < G14)] + let tiling_control = { + let render_cfg = gpu.get_cfg().render; + let mut tiling_control = render_cfg.tiling_control; + + if !clustering { + tiling_control |= TILECTL_DISABLE_CLUSTERING; + } + tiling_control + }; + + let mut alloc = gpu.alloc(); + let kalloc = &mut *alloc; + + // This sequence number increases per new client/VM? assigned to some slot, + // but it's unclear *which* slot... + let slot_client_seq: u8 = (self.id & 0xff) as u8; + + let tile_info = Self::get_tiling_params(&cmdbuf, if clustering { nclusters } else { 1 })?; + + let buffer = self.buffer.as_ref().ok_or(EINVAL)?.lock(); + + let scene = Arc::try_new(buffer.new_scene(kalloc, &tile_info)?)?; + + let notifier = self.notifier.clone(); + + let tvb_autogrown = buffer.auto_grow()?; + if tvb_autogrown { + let new_size = buffer.block_count() as usize; + cls_dev_dbg!( + TVBStats, + &self.dev, + "[Submission {}] TVB grew to {} bytes ({} blocks) due to overflows\n", + id, + new_size * buffer::BLOCK_SIZE, + new_size, + ); + } + + let tvb_grown = buffer.ensure_blocks(tile_info.min_tvb_blocks)?; + if tvb_grown { + cls_dev_dbg!( + TVBStats, + &self.dev, + "[Submission {}] TVB grew to {} bytes ({} blocks) due to dimensions ({}x{})\n", + id, + tile_info.min_tvb_blocks * buffer::BLOCK_SIZE, + tile_info.min_tvb_blocks, + cmdbuf.fb_width, + cmdbuf.fb_height + ); + } + + let vm_bind = job.vm_bind.clone(); + + mod_dev_dbg!( + self.dev, + "[Submission {}] VM slot = {}\n", + id, + vm_bind.slot() + ); + + let ev_vtx = job.get_vtx()?.event_info(); + let ev_frag = job.get_frag()?.event_info(); + + mod_dev_dbg!( + self.dev, + "[Submission {}] Vert event #{} -> {:#x?}\n", + id, + ev_vtx.slot, + ev_vtx.value.next(), + ); + mod_dev_dbg!( + self.dev, + "[Submission {}] Frag event #{} -> {:#x?}\n", + id, + ev_frag.slot, + ev_frag.value.next(), + ); + + let uuid_3d = cmdbuf.cmd_3d_id; + let uuid_ta = cmdbuf.cmd_ta_id; + + mod_dev_dbg!( + self.dev, + "[Submission {}] Vert UUID = {:#x?}\n", + id, + uuid_ta + ); + mod_dev_dbg!( + self.dev, + "[Submission {}] Frag UUID = {:#x?}\n", + id, + uuid_3d + ); + + let fence = job.fence.clone(); + let frag_job = job.get_frag()?; + + mod_dev_dbg!(self.dev, "[Submission {}] Create Barrier\n", id); + let barrier: GpuObjectfw::workqueue::Barrier = kalloc.private.new_inplace( + Default::default(), + |_inner, ptr: &mut MaybeUninitfw::workqueue::raw::Barrier| { + Ok(place!( + ptr, + fw::workqueue::raw::Barrier { + tag: fw::workqueue::CommandType::Barrier, + wait_stamp: ev_vtx.fw_stamp_pointer, + wait_value: ev_vtx.value.next(), + wait_slot: ev_vtx.slot, + stamp_self: ev_frag.value.next(), + uuid: uuid_3d, + unk: 0, + } + )) + }, + )?; + + mod_dev_dbg!(self.dev, "[Submission {}] Add Barrier\n", id); + frag_job.add(barrier, vm_bind.slot())?; + + let timestamps = Arc::try_new(kalloc.shared.new_default::fw::job::RenderTimestamps()?)?; + + let unk1 = debug_enabled(debug::DebugFlags::Debug1); + let unk2 = debug_enabled(debug::DebugFlags::Debug2); + let unk3 = debug_enabled(debug::DebugFlags::Debug3); + + let mut tile_config: u64 = 0; + if !unk1 { + tile_config |= 0x280; + } + if cmdbuf.layers > 1 { + tile_config |= 1; + } + if cmdbuf.flags & bindings::ASAHI_RENDER_PROCESS_EMPTY_TILES as u64 != 0 { + tile_config |= 0x10000; + } + + let mut utile_config = + ((tile_info.utile_width / 16) << 12) | ((tile_info.utile_height / 16) << 14); + utile_config |= match cmdbuf.samples { + 1 => 0, + 2 => 1, + 4 => 2, + _ => return Err(EINVAL), + }; + + let frag_result = result_writer + .map(|writer| { + let mut result = RenderResult { + result: Default::default(), + vtx_complete: false, + frag_complete: false, + vtx_error: None, + frag_error: None, + writer, + }; + + if tvb_autogrown { + result.result.flags |= bindings::DRM_ASAHI_RESULT_RENDER_TVB_GROW_OVF as u64; + } + if tvb_grown { + result.result.flags |= bindings::DRM_ASAHI_RESULT_RENDER_TVB_GROW_MIN as u64; + } + result.result.tvb_size_bytes = buffer.size() as u64; + + Arc::try_new(Mutex::new(result)) + }) + .transpose()?; + + let vtx_result = frag_result.clone(); + + // TODO: check + #[ver(V >= V13_0B4)] + let count_frag = self.counter.fetch_add(2, Ordering::Relaxed); + #[ver(V >= V13_0B4)] + let count_vtx = count_frag + 1; + + mod_dev_dbg!(self.dev, "[Submission {}] Create Frag\n", id); + let frag = GpuObject::new_prealloc( + kalloc.private.alloc_object()?, + |ptr: GpuWeakPointerfw::fragment::RunFragment::ver| { + let mut builder = microseq::Builder::new(); + + let stats = inner_weak_ptr!( + gpu.initdata.runtime_pointers.stats.frag.weak_pointer(), + stats + ); + + let start_frag = builder.add(microseq::StartFragment::ver { + header: microseq::op::StartFragment::HEADER, + job_params2: inner_weak_ptr!(ptr, job_params2), + job_params1: inner_weak_ptr!(ptr, job_params1), + scene: scene.gpu_pointer(), + stats, + busy_flag: inner_weak_ptr!(ptr, busy_flag), + tvb_overflow_count: inner_weak_ptr!(ptr, tvb_overflow_count), + unk_pointer: inner_weak_ptr!(ptr, unk_pointee), + work_queue: ev_frag.info_ptr, + work_item: ptr, + vm_slot: vm_bind.slot(), + unk_50: 0x1, // fixed + event_generation: self.id as u32, + buffer_slot: scene.slot(), + unk_5c: 0, + cmd_seq: U64(ev_frag.cmd_seq), + unk_68: 0, + unk_758_flag: inner_weak_ptr!(ptr, unk_758_flag), + unk_job_buf: inner_weak_ptr!(ptr, unk_buf_0), + unk_7c: 0, + unk_80: 0, + unk_84: 0, + uuid: uuid_3d, + attachments: common::build_attachments( + cmdbuf.attachments, + cmdbuf.attachment_count, + )?, + unk_190: 0, + #[ver(V >= V13_0B4)] + counter: U64(count_frag), + #[ver(V >= V13_0B4)] + notifier_buf: inner_weak_ptr!(notifier.weak_pointer(), state.unk_buf), + })?; + + if frag_result.is_some() { + builder.add(microseq::Timestamp::ver { + header: microseq::op::Timestamp::new(true), + cur_ts: inner_weak_ptr!(ptr, cur_ts), + start_ts: inner_weak_ptr!(ptr, start_ts), + update_ts: inner_weak_ptr!(ptr, start_ts), + work_queue: ev_frag.info_ptr, + unk_24: U64(0), + #[ver(V >= V13_0B4)] + unk_ts: inner_weak_ptr!(ptr, unk_ts), + uuid: uuid_3d, + unk_30_padding: 0, + })?; + } + + builder.add(microseq::WaitForIdle { + header: microseq::op::WaitForIdle::new(microseq::Pipe::Fragment), + })?; + + if frag_result.is_some() { + builder.add(microseq::Timestamp::ver { + header: microseq::op::Timestamp::new(false), + cur_ts: inner_weak_ptr!(ptr, cur_ts), + start_ts: inner_weak_ptr!(ptr, start_ts), + update_ts: inner_weak_ptr!(ptr, end_ts), + work_queue: ev_frag.info_ptr, + unk_24: U64(0), + #[ver(V >= V13_0B4)] + unk_ts: inner_weak_ptr!(ptr, unk_ts), + uuid: uuid_3d, + unk_30_padding: 0, + })?; + } + + let off = builder.offset_to(start_frag); + builder.add(microseq::FinalizeFragment::ver { + header: microseq::op::FinalizeFragment::HEADER, + uuid: uuid_3d, + unk_8: 0, + fw_stamp: ev_frag.fw_stamp_pointer, + stamp_value: ev_frag.value.next(), + unk_18: 0, + scene: scene.weak_pointer(), + buffer: scene.weak_buffer_pointer(), + unk_2c: U64(1), + stats, + unk_pointer: inner_weak_ptr!(ptr, unk_pointee), + busy_flag: inner_weak_ptr!(ptr, busy_flag), + work_queue: ev_frag.info_ptr, + work_item: ptr, + vm_slot: vm_bind.slot(), + unk_60: 0, + unk_758_flag: inner_weak_ptr!(ptr, unk_758_flag), + unk_6c: U64(0), + unk_74: U64(0), + unk_7c: U64(0), + unk_84: U64(0), + unk_8c: U64(0), + #[ver(G == G14 && V < V13_0B4)] + unk_8c_g14: U64(0), + restart_branch_offset: off, + unk_98: unk3.into(), + #[ver(V >= V13_0B4)] + unk_9c: Default::default(), + })?; + + builder.add(microseq::RetireStamp { + header: microseq::op::RetireStamp::HEADER, + })?; + + Ok(box_in_place!(fw::fragment::RunFragment::ver { + notifier: notifier.clone(), + scene: scene.clone(), + micro_seq: builder.build(&mut kalloc.private)?, + vm_bind: vm_bind.clone(), + aux_fb: self.ualloc.lock().array_empty(0x8000)?, + timestamps: timestamps.clone(), + })?) + }, + |inner, ptr| { + let aux_fb_info = fw::fragment::raw::AuxFBInfo::ver { + iogpu_unk_214: cmdbuf.iogpu_unk_214, + unk2: 0, + width: cmdbuf.fb_width, + height: cmdbuf.fb_height, + #[ver(V >= V13_0B4)] + unk3: U64(0x100000), + }; + + Ok(place!( + ptr, + fw::fragment::raw::RunFragment::ver { + tag: fw::workqueue::CommandType::RunFragment, + #[ver(V >= V13_0B4)] + counter: U64(count_frag), + vm_slot: vm_bind.slot(), + unk_8: 0, + microsequence: inner.micro_seq.gpu_pointer(), + microsequence_size: inner.micro_seq.len() as u32, + notifier: inner.notifier.gpu_pointer(), + buffer: inner.scene.buffer_pointer(), + scene: inner.scene.gpu_pointer(), + unk_buffer_buf: inner.scene.kernel_buffer_pointer(), + tvb_tilemap: inner.scene.tvb_tilemap_pointer(), + ppp_multisamplectl: U64(cmdbuf.ppp_multisamplectl), + samples: cmdbuf.samples, + tiles_per_mtile_y: tile_info.tiles_per_mtile_y as u16, + tiles_per_mtile_x: tile_info.tiles_per_mtile_x as u16, + unk_50: U64(0), + unk_58: U64(0), + merge_upper_x: F32::from_bits(cmdbuf.merge_upper_x), + merge_upper_y: F32::from_bits(cmdbuf.merge_upper_y), + unk_68: U64(0), + tile_count: U64(tile_info.tiles as u64), + job_params1: fw::fragment::raw::JobParameters1::ver { + utile_config: utile_config, + unk_4: 0, + clear_pipeline: fw::fragment::raw::ClearPipelineBinding { + pipeline_bind: U64(cmdbuf.load_pipeline_bind as u64), + address: U64(cmdbuf.load_pipeline as u64), + }, + ppp_multisamplectl: U64(cmdbuf.ppp_multisamplectl), + scissor_array: U64(cmdbuf.scissor_array), + depth_bias_array: U64(cmdbuf.depth_bias_array), + aux_fb_info: aux_fb_info, + depth_dimensions: U64(cmdbuf.depth_dimensions as u64), + visibility_result_buffer: U64(cmdbuf.visibility_result_buffer), + zls_ctrl: U64(cmdbuf.zls_ctrl), + #[ver(G >= G14)] + unk_58_g14_0: U64(0x4040404), + #[ver(G >= G14)] + unk_58_g14_8: U64(0), + depth_buffer_ptr1: U64(cmdbuf.depth_buffer_1), + depth_buffer_ptr2: U64(cmdbuf.depth_buffer_2), + stencil_buffer_ptr1: U64(cmdbuf.stencil_buffer_1), + stencil_buffer_ptr2: U64(cmdbuf.stencil_buffer_2), + #[ver(G >= G14)] + unk_68_g14_0: Default::default(), + unk_78: Default::default(), + depth_meta_buffer_ptr1: U64(cmdbuf.depth_meta_buffer_1), + unk_a0: Default::default(), + depth_meta_buffer_ptr2: U64(cmdbuf.depth_meta_buffer_2), + unk_b0: Default::default(), + stencil_meta_buffer_ptr1: U64(cmdbuf.stencil_meta_buffer_1), + unk_c0: Default::default(), + stencil_meta_buffer_ptr2: U64(cmdbuf.stencil_meta_buffer_2), + unk_d0: Default::default(), + tvb_tilemap: inner.scene.tvb_tilemap_pointer(), + tvb_heapmeta: inner.scene.tvb_heapmeta_pointer(), + mtile_stride_dwords: U64((4 * tile_info.params.rgn_size as u64) << 24), + tvb_heapmeta_2: inner.scene.tvb_heapmeta_pointer(), + tile_config: U64(tile_config), + aux_fb: inner.aux_fb.gpu_pointer(), + unk_108: Default::default(), + pipeline_base: U64(0x11_00000000), + unk_140: U64(0x8c60), + unk_148: U64(0x0), + unk_150: U64(0x0), + unk_158: U64(0x1c), + unk_160: U64(0), + unk_168_padding: Default::default(), + #[ver(V < V13_0B4)] + __pad0: Default::default(), + }, + job_params2: fw::fragment::raw::JobParameters2 { + store_pipeline_bind: cmdbuf.store_pipeline_bind, + store_pipeline_addr: cmdbuf.store_pipeline, + unk_8: 0x0, + unk_c: 0x0, + merge_upper_x: F32::from_bits(cmdbuf.merge_upper_x), + merge_upper_y: F32::from_bits(cmdbuf.merge_upper_y), + unk_18: U64(0x0), + utiles_per_mtile_y: tile_info.utiles_per_mtile_y as u16, + utiles_per_mtile_x: tile_info.utiles_per_mtile_x as u16, + unk_24: 0x0, + tile_counts: ((tile_info.tiles_y - 1) << 12) | (tile_info.tiles_x - 1), + iogpu_unk_212: cmdbuf.iogpu_unk_212, + isp_bgobjdepth: cmdbuf.isp_bgobjdepth, + // TODO: does this flag need to be exposed to userspace? + isp_bgobjvals: cmdbuf.isp_bgobjvals | 0x400, + unk_38: 0x0, + unk_3c: 0x1, + unk_40: 0, + }, + job_params3: fw::fragment::raw::JobParameters3::ver { + unk_44_padding: Default::default(), + depth_bias_array: fw::fragment::raw::ArrayAddr { + ptr: U64(cmdbuf.depth_bias_array), + unk_padding: U64(0), + }, + scissor_array: fw::fragment::raw::ArrayAddr { + ptr: U64(cmdbuf.scissor_array), + unk_padding: U64(0), + }, + visibility_result_buffer: U64(cmdbuf.visibility_result_buffer), + unk_118: U64(0x0), + unk_120: Default::default(), + unk_reload_pipeline: fw::fragment::raw::ClearPipelineBinding { + pipeline_bind: U64(cmdbuf.partial_reload_pipeline_bind as u64), + address: U64(cmdbuf.partial_reload_pipeline as u64), + }, + unk_258: U64(0), + unk_260: U64(0), + unk_268: U64(0), + unk_270: U64(0), + reload_pipeline: fw::fragment::raw::ClearPipelineBinding { + pipeline_bind: U64(cmdbuf.partial_reload_pipeline_bind as u64), + address: U64(cmdbuf.partial_reload_pipeline as u64), + }, + zls_ctrl: U64(cmdbuf.zls_ctrl), + unk_290: U64(0x0), + depth_buffer_ptr1: U64(cmdbuf.depth_buffer_1), + unk_2a0: U64(0x0), + unk_2a8: U64(0x0), + depth_buffer_ptr2: U64(cmdbuf.depth_buffer_2), + depth_buffer_ptr3: U64(cmdbuf.depth_buffer_3), + depth_meta_buffer_ptr3: U64(cmdbuf.depth_meta_buffer_3), + stencil_buffer_ptr1: U64(cmdbuf.stencil_buffer_1), + unk_2d0: U64(0x0), + unk_2d8: U64(0x0), + stencil_buffer_ptr2: U64(cmdbuf.stencil_buffer_2), + stencil_buffer_ptr3: U64(cmdbuf.stencil_buffer_3), + stencil_meta_buffer_ptr3: U64(cmdbuf.stencil_meta_buffer_3), + unk_2f8: Default::default(), + iogpu_unk_212: cmdbuf.iogpu_unk_212, + unk_30c: 0x0, + aux_fb_info: aux_fb_info, + unk_320_padding: Default::default(), + unk_partial_store_pipeline: + fw::fragment::raw::StorePipelineBinding::new( + cmdbuf.partial_store_pipeline_bind, + cmdbuf.partial_store_pipeline + ), + partial_store_pipeline: fw::fragment::raw::StorePipelineBinding::new( + cmdbuf.partial_store_pipeline_bind, + cmdbuf.partial_store_pipeline + ), + isp_bgobjdepth: cmdbuf.isp_bgobjdepth, + isp_bgobjvals: cmdbuf.isp_bgobjvals, + iogpu_unk_49: cmdbuf.iogpu_unk_49, + unk_37c: 0x0, + unk_380: U64(0x0), + unk_388: U64(0x0), + #[ver(V >= V13_0B4)] + unk_390_0: U64(0x0), + depth_dimensions: U64(cmdbuf.depth_dimensions as u64), + }, + unk_758_flag: 0, + unk_75c_flag: 0, + unk_buf: Default::default(), + busy_flag: 0, + tvb_overflow_count: 0, + unk_878: 0, + encoder_params: fw::job::raw::EncoderParams { + unk_8: (cmdbuf.flags + & bindings::ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S as u64 + != 0) as u32, + unk_c: 0x0, // fixed + unk_10: 0x0, // fixed + encoder_id: cmdbuf.encoder_id, + unk_18: 0x0, // fixed + iogpu_compute_unk44: 0xffffffff, + seq_buffer: inner.scene.seq_buf_pointer(), + unk_28: U64(0x0), // fixed + }, + process_empty_tiles: (cmdbuf.flags + & bindings::ASAHI_RENDER_PROCESS_EMPTY_TILES as u64 + != 0) as u32, + no_clear_pipeline_textures: (cmdbuf.flags + & bindings::ASAHI_RENDER_NO_CLEAR_PIPELINE_TEXTURES as u64 + != 0) as u32, + unk_param: unk2.into(), // 1 for boot stuff? + unk_pointee: 0, + meta: fw::job::raw::JobMeta { + unk_4: 0, + stamp: ev_frag.stamp_pointer, + fw_stamp: ev_frag.fw_stamp_pointer, + stamp_value: ev_frag.value.next(), + stamp_slot: ev_frag.slot, + evctl_index: 0, // fixed + flush_stamps: flush_stamps as u32, + uuid: uuid_3d, + cmd_seq: ev_frag.cmd_seq as u32, + }, + unk_after_meta: unk1.into(), + unk_buf_0: U64(0), + unk_buf_8: U64(0), + unk_buf_10: U64(1), + cur_ts: U64(0), + start_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), frag.start)), + end_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), frag.end)), + unk_914: 0, + unk_918: U64(0), + unk_920: 0, + client_sequence: slot_client_seq, + pad_925: Default::default(), + unk_928: 0, + unk_92c: 0, + #[ver(V >= V13_0B4)] + unk_ts: U64(0), + #[ver(V >= V13_0B4)] + unk_92d_8: Default::default(), + } + )) + }, + )?; + + mod_dev_dbg!(self.dev, "[Submission {}] Add Frag\n", id); + fence.add_command(); + + frag_job.add_cb(frag, vm_bind.slot(), move |cmd, error| { + if let Some(err) = error { + fence.set_error(err.into()); + } + if let Some(mut res) = frag_result.as_ref().map(|a| a.lock()) { + cmd.timestamps.with(|raw, _inner| { + res.result.fragment_ts_start = raw.frag.start.load(Ordering::Relaxed); + res.result.fragment_ts_end = raw.frag.end.load(Ordering::Relaxed); + }); + cmd.with(|raw, _inner| { + res.result.num_tvb_overflows = raw.tvb_overflow_count; + }); + res.frag_error = error; + res.frag_complete = true; + res.commit(); + } + fence.command_complete(); + })?; + + let fence = job.fence.clone(); + let vtx_job = job.get_vtx()?; + + if scene.rebind() || tvb_grown || tvb_autogrown { + mod_dev_dbg!(self.dev, "[Submission {}] Create Bind Buffer\n", id); + let bind_buffer = kalloc.private.new_inplace( + fw::buffer::InitBuffer::ver { + scene: scene.clone(), + }, + |inner, ptr: &mut MaybeUninit<fw::buffer::raw::InitBuffer::ver<'_>>| { + Ok(place!( + ptr, + fw::buffer::raw::InitBuffer::ver { + tag: fw::workqueue::CommandType::InitBuffer, + vm_slot: vm_bind.slot(), + buffer_slot: inner.scene.slot(), + unk_c: 0, + block_count: buffer.block_count(), + buffer: inner.scene.buffer_pointer(), + stamp_value: ev_vtx.value.next(), + } + )) + }, + )?; + + mod_dev_dbg!(self.dev, "[Submission {}] Add Bind Buffer\n", id); + vtx_job.add(bind_buffer, vm_bind.slot())?; + } + + mod_dev_dbg!(self.dev, "[Submission {}] Create Vertex\n", id); + let vtx = GpuObject::new_prealloc( + kalloc.private.alloc_object()?, + |ptr: GpuWeakPointerfw::vertex::RunVertex::ver| { + let mut builder = microseq::Builder::new(); + + let stats = inner_weak_ptr!( + gpu.initdata.runtime_pointers.stats.vtx.weak_pointer(), + stats + ); + + let start_vtx = builder.add(microseq::StartVertex::ver { + header: microseq::op::StartVertex::HEADER, + tiling_params: inner_weak_ptr!(ptr, tiling_params), + job_params1: inner_weak_ptr!(ptr, job_params1), + buffer: scene.weak_buffer_pointer(), + scene: scene.weak_pointer(), + stats, + work_queue: ev_vtx.info_ptr, + vm_slot: vm_bind.slot(), + unk_38: 1, // fixed + event_generation: self.id as u32, + buffer_slot: scene.slot(), + unk_44: 0, + cmd_seq: U64(ev_vtx.cmd_seq), + unk_50: 0, + unk_pointer: inner_weak_ptr!(ptr, unk_pointee), + unk_job_buf: inner_weak_ptr!(ptr, unk_buf_0), + unk_64: 0x0, // fixed + unk_68: unk1.into(), + uuid: uuid_ta, + unk_70: 0x0, // fixed + unk_74: Default::default(), // fixed + unk_15c: 0x0, // fixed + unk_160: U64(0x0), // fixed + unk_168: 0x0, // fixed + unk_16c: 0x0, // fixed + unk_170: U64(0x0), // fixed + #[ver(V >= V13_0B4)] + counter: U64(count_vtx), + #[ver(V >= V13_0B4)] + notifier_buf: inner_weak_ptr!(notifier.weak_pointer(), state.unk_buf), + unk_178: 0x0, // padding? + })?; + + if vtx_result.is_some() { + builder.add(microseq::Timestamp::ver { + header: microseq::op::Timestamp::new(true), + cur_ts: inner_weak_ptr!(ptr, cur_ts), + start_ts: inner_weak_ptr!(ptr, start_ts), + update_ts: inner_weak_ptr!(ptr, start_ts), + work_queue: ev_vtx.info_ptr, + unk_24: U64(0), + #[ver(V >= V13_0B4)] + unk_ts: inner_weak_ptr!(ptr, unk_ts), + uuid: uuid_ta, + unk_30_padding: 0, + })?; + } + + builder.add(microseq::WaitForIdle { + header: microseq::op::WaitForIdle::new(microseq::Pipe::Vertex), + })?; + + if vtx_result.is_some() { + builder.add(microseq::Timestamp::ver { + header: microseq::op::Timestamp::new(false), + cur_ts: inner_weak_ptr!(ptr, cur_ts), + start_ts: inner_weak_ptr!(ptr, start_ts), + update_ts: inner_weak_ptr!(ptr, end_ts), + work_queue: ev_vtx.info_ptr, + unk_24: U64(0), + #[ver(V >= V13_0B4)] + unk_ts: inner_weak_ptr!(ptr, unk_ts), + uuid: uuid_ta, + unk_30_padding: 0, + })?; + } + + let off = builder.offset_to(start_vtx); + builder.add(microseq::FinalizeVertex::ver { + header: microseq::op::FinalizeVertex::HEADER, + scene: scene.weak_pointer(), + buffer: scene.weak_buffer_pointer(), + stats, + work_queue: ev_vtx.info_ptr, + vm_slot: vm_bind.slot(), + unk_28: 0x0, // fixed + unk_pointer: inner_weak_ptr!(ptr, unk_pointee), + unk_34: 0x0, // fixed + uuid: uuid_ta, + fw_stamp: ev_vtx.fw_stamp_pointer, + stamp_value: ev_vtx.value.next(), + unk_48: U64(0x0), // fixed + unk_50: 0x0, // fixed + unk_54: 0x0, // fixed + unk_58: U64(0x0), // fixed + unk_60: 0x0, // fixed + unk_64: 0x0, // fixed + unk_68: 0x0, // fixed + #[ver(G >= G14 && V < V13_0B4)] + unk_68_g14: U64(0), + restart_branch_offset: off, + unk_70: 0x0, // fixed + #[ver(V >= V13_0B4)] + unk_74: Default::default(), // Ventura + })?; + + builder.add(microseq::RetireStamp { + header: microseq::op::RetireStamp::HEADER, + })?; + + Ok(box_in_place!(fw::vertex::RunVertex::ver { + notifier: notifier, + scene: scene.clone(), + micro_seq: builder.build(&mut kalloc.private)?, + vm_bind: vm_bind.clone(), + timestamps: timestamps, + })?) + }, + |inner, ptr| { + #[ver(G < G14)] + let core_masks = gpu.core_masks_packed(); + Ok(place!( + ptr, + fw::vertex::raw::RunVertex::ver { + tag: fw::workqueue::CommandType::RunVertex, + #[ver(V >= V13_0B4)] + counter: U64(count_vtx), + vm_slot: vm_bind.slot(), + unk_8: 0, + notifier: inner.notifier.gpu_pointer(), + buffer_slot: inner.scene.slot(), + unk_1c: 0, + buffer: inner.scene.buffer_pointer(), + scene: inner.scene.gpu_pointer(), + unk_buffer_buf: inner.scene.kernel_buffer_pointer(), + unk_34: 0, + job_params1: fw::vertex::raw::JobParameters1::ver { + unk_0: U64(if unk1 { 0 } else { 0x200 }), // sometimes 0 + unk_8: f32!(1e-20), // fixed + unk_c: f32!(1e-20), // fixed + tvb_tilemap: inner.scene.tvb_tilemap_pointer(), + #[ver(G < G14)] + tvb_cluster_tilemaps: inner.scene.cluster_tilemaps_pointer(), + tpc: inner.scene.tpc_pointer(), + tvb_heapmeta: inner + .scene + .tvb_heapmeta_pointer() + .or(0x8000_0000_0000_0000), + iogpu_unk_54: 0x6b0003, // fixed + iogpu_unk_55: 0x3a0012, // fixed + iogpu_unk_56: U64(0x1), // fixed + #[ver(G < G14)] + tvb_cluster_meta1: inner + .scene + .meta_1_pointer() + .map(|x| x.or((tile_info.meta1_blocks as u64) << 50)), + utile_config: utile_config, + unk_4c: 0, + ppp_multisamplectl: U64(cmdbuf.ppp_multisamplectl), // fixed + tvb_heapmeta_2: inner.scene.tvb_heapmeta_pointer(), + #[ver(G < G14)] + unk_60: U64(0x0), // fixed + #[ver(G < G14)] + core_mask: Array::new([ + *core_masks.first().unwrap_or(&0), + *core_masks.get(1).unwrap_or(&0), + ]), + preempt_buf1: inner.scene.preempt_buf_1_pointer(), + preempt_buf2: inner.scene.preempt_buf_2_pointer(), + unk_80: U64(0x1), // fixed + preempt_buf3: inner + .scene + .preempt_buf_3_pointer() + .or(0x4_0000_0000_0000), // check + encoder_addr: U64(cmdbuf.encoder_ptr), + #[ver(G < G14)] + tvb_cluster_meta2: inner.scene.meta_2_pointer(), + #[ver(G < G14)] + tvb_cluster_meta3: inner.scene.meta_3_pointer(), + #[ver(G < G14)] + tiling_control: tiling_control, + #[ver(G < G14)] + unk_ac: Default::default(), // fixed + unk_b0: Default::default(), // fixed + pipeline_base: U64(0x11_00000000), + #[ver(G < G14)] + tvb_cluster_meta4: inner + .scene + .meta_4_pointer() + .map(|x| x.or(0x3000_0000_0000_0000)), + #[ver(G < G14)] + unk_f0: U64(0x1c + align(tile_info.meta1_blocks, 4) as u64), + unk_f8: U64(0x8c60), // fixed + unk_100: Default::default(), // fixed + unk_118: 0x1c, // fixed + #[ver(G >= G14)] + __pad: Default::default(), + }, + unk_154: Default::default(), + tiling_params: tile_info.params, + unk_3e8: Default::default(), + tpc: inner.scene.tpc_pointer(), + tpc_size: U64(tile_info.tpc_size as u64), + microsequence: inner.micro_seq.gpu_pointer(), + microsequence_size: inner.micro_seq.len() as u32, + fragment_stamp_slot: ev_frag.slot, + fragment_stamp_value: ev_frag.value.next(), + unk_pointee: 0, + unk_pad: 0, + job_params2: fw::vertex::raw::JobParameters2 { + unk_480: Default::default(), // fixed + unk_498: U64(0x0), // fixed + unk_4a0: 0x0, // fixed + preempt_buf1: inner.scene.preempt_buf_1_pointer(), + unk_4ac: 0x0, // fixed + unk_4b0: U64(0x0), // fixed + unk_4b8: 0x0, // fixed + unk_4bc: U64(0x0), // fixed + unk_4c4_padding: Default::default(), + unk_50c: 0x0, // fixed + unk_510: U64(0x0), // fixed + unk_518: U64(0x0), // fixed + unk_520: U64(0x0), // fixed + }, + encoder_params: fw::job::raw::EncoderParams { + unk_8: 0x0, // fixed + unk_c: 0x0, // fixed + unk_10: 0x0, // fixed + encoder_id: cmdbuf.encoder_id, + unk_18: 0x0, // fixed + iogpu_compute_unk44: 0xffffffff, + seq_buffer: inner.scene.seq_buf_pointer(), + unk_28: U64(0x0), // fixed + }, + unk_55c: 0, + unk_560: 0, + memoryless_rts_used: (cmdbuf.flags + & bindings::ASAHI_RENDER_MEMORYLESS_RTS_USED as u64 + != 0) as u32, + unk_568: 0, + unk_56c: 0, + meta: fw::job::raw::JobMeta { + unk_4: 0, + stamp: ev_vtx.stamp_pointer, + fw_stamp: ev_vtx.fw_stamp_pointer, + stamp_value: ev_vtx.value.next(), + stamp_slot: ev_vtx.slot, + evctl_index: 0, // fixed + flush_stamps: flush_stamps as u32, + uuid: uuid_ta, + cmd_seq: ev_vtx.cmd_seq as u32, + }, + unk_after_meta: unk1.into(), + unk_buf_0: U64(0), + unk_buf_8: U64(0), + unk_buf_10: U64(0), + cur_ts: U64(0), + start_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), vtx.start)), + end_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), vtx.end)), + unk_5c4: 0, + unk_5c8: 0, + unk_5cc: 0, + unk_5d0: 0, + client_sequence: slot_client_seq, + pad_5d5: Default::default(), + unk_5d8: 0, + unk_5dc: 0, + #[ver(V >= V13_0B4)] + unk_ts: U64(0), + #[ver(V >= V13_0B4)] + unk_5dd_8: Default::default(), + } + )) + }, + )?; + + core::mem::drop(alloc); + + mod_dev_dbg!(self.dev, "[Submission {}] Add Vertex\n", id); + fence.add_command(); + vtx_job.add_cb(vtx, vm_bind.slot(), move |cmd, error| { + if let Some(err) = error { + fence.set_error(err.into()) + } + if let Some(mut res) = vtx_result.as_ref().map(|a| a.lock()) { + cmd.timestamps.with(|raw, _inner| { + res.result.vertex_ts_start = raw.vtx.start.load(Ordering::Relaxed); + res.result.vertex_ts_end = raw.vtx.end.load(Ordering::Relaxed); + }); + res.result.tvb_usage_bytes = cmd.scene.used_bytes() as u64; + if cmd.scene.overflowed() { + res.result.flags |= bindings::DRM_ASAHI_RESULT_RENDER_TVB_OVERFLOWED as u64; + } + res.vtx_error = error; + res.vtx_complete = true; + res.commit(); + } + fence.command_complete(); + })?; + + mod_dev_dbg!(self.dev, "[Submission {}] Increment counters\n", id); + self.notifier.threshold.with(|raw, _inner| { + raw.increment(); + raw.increment(); + }); + + // TODO: handle rollbacks, move to job submit? + buffer.increment(); + + job.get_vtx()?.next_seq(); + job.get_frag()?.next_seq(); + + Ok(()) + } +} diff --git a/drivers/gpu/drm/asahi/regs.rs b/drivers/gpu/drm/asahi/regs.rs new file mode 100644 index 000000000000..019d7214793d --- /dev/null +++ b/drivers/gpu/drm/asahi/regs.rs @@ -0,0 +1,387 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU MMIO register abstraction +//! +//! Since the vast majority of the interactions with the GPU are brokered through the firmware, +//! there is very little need to interact directly with GPU MMIO register. This module abstracts +//! the few operations that require that, mainly reading the MMU fault status, reading GPU ID +//! information, and starting the GPU firmware coprocessor. + +use crate::hw; +use kernel::{device, io_mem::IoMem, platform, prelude::*}; + +/// Size of the ASC control MMIO region. +pub(crate) const ASC_CTL_SIZE: usize = 0x4000; + +/// Size of the SGX MMIO region. +pub(crate) const SGX_SIZE: usize = 0x1000000; + +const CPU_CONTROL: usize = 0x44; +const CPU_RUN: u32 = 0x1 << 4; // BIT(4) + +const FAULT_INFO: usize = 0x17030; + +const ID_VERSION: usize = 0xd04000; +const ID_UNK08: usize = 0xd04008; +const ID_COUNTS_1: usize = 0xd04010; +const ID_COUNTS_2: usize = 0xd04014; +const ID_UNK18: usize = 0xd04018; +const ID_CLUSTERS: usize = 0xd0401c; + +const CORE_MASK_0: usize = 0xd01500; +const CORE_MASK_1: usize = 0xd01514; + +/// Enum representing the unit that caused an MMU fault. +#[allow(non_camel_case_types)] +#[allow(clippy::upper_case_acronyms)] +#[derive(Copy, Clone, Debug, Eq, PartialEq)] +pub(crate) enum FaultUnit { + /// Decompress / pixel fetch + DCMP(u8), + /// USC L1 Cache (device loads/stores) + UL1C(u8), + /// Compress / pixel store + CMP(u8), + GSL1(u8), + IAP(u8), + VCE(u8), + /// Tiling Engine + TE(u8), + RAS(u8), + /// Vertex Data Master + VDM(u8), + PPP(u8), + /// ISP Parameter Fetch + IPF(u8), + IPF_CPF(u8), + VF(u8), + VF_CPF(u8), + /// Depth/Stencil load/store + ZLS(u8), + + /// Parameter Management + dPM, + /// Compute Data Master + dCDM_KS(u8), + dIPP, + dIPP_CS, + // Vertex Data Master + dVDM_CSD, + dVDM_SSD, + dVDM_ILF, + dVDM_ILD, + dRDE(u8), + FC, + GSL2, + + /// Graphics L2 Cache Control? + GL2CC_META(u8), + GL2CC_MB, + + /// Parameter Management + gPM_SP(u8), + /// Vertex Data Master - CSD + gVDM_CSD_SP(u8), + gVDM_SSD_SP(u8), + gVDM_ILF_SP(u8), + gVDM_TFP_SP(u8), + gVDM_MMB_SP(u8), + /// Compute Data Master + gCDM_CS_KS0_SP(u8), + gCDM_CS_KS1_SP(u8), + gCDM_CS_KS2_SP(u8), + gCDM_KS0_SP(u8), + gCDM_KS1_SP(u8), + gCDM_KS2_SP(u8), + gIPP_SP(u8), + gIPP_CS_SP(u8), + gRDE0_SP(u8), + gRDE1_SP(u8), + + Unknown(u8), +} + +/// Reason for an MMU fault. +#[derive(Copy, Clone, Debug, Eq, PartialEq)] +pub(crate) enum FaultReason { + Unmapped, + AfFault, + WriteOnly, + ReadOnly, + NoAccess, + Unknown(u8), +} + +/// Collection of information about an MMU fault. +#[derive(Copy, Clone, Debug, Eq, PartialEq)] +pub(crate) struct FaultInfo { + pub(crate) address: u64, + pub(crate) sideband: u8, + pub(crate) vm_slot: u32, + pub(crate) unit_code: u8, + pub(crate) unit: FaultUnit, + pub(crate) level: u8, + pub(crate) unk_5: u8, + pub(crate) read: bool, + pub(crate) reason: FaultReason, +} + +/// Device resources for this GPU instance. +pub(crate) struct Resources { + dev: device::Device, + asc: IoMem<ASC_CTL_SIZE>, + sgx: IoMem<SGX_SIZE>, +} + +impl Resources { + /// Map the required resources given our platform device. + pub(crate) fn new(pdev: &mut platform::Device) -> Result<Resources> { + // TODO: add device abstraction to ioremap by name + let asc_res = unsafe { pdev.ioremap_resource(0)? }; + let sgx_res = unsafe { pdev.ioremap_resource(1)? }; + + Ok(Resources { + // SAFETY: This device does DMA via the UAT IOMMU. + dev: device::Device::from_dev(pdev), + asc: asc_res, + sgx: sgx_res, + }) + } + + fn sgx_read32(&self, off: usize) -> u32 { + self.sgx.readl_relaxed(off) + } + + /* Not yet used + fn sgx_write32(&self, off: usize, val: u32) { + self.sgx.writel_relaxed(val, off) + } + */ + + fn sgx_read64(&self, off: usize) -> u64 { + self.sgx.readq_relaxed(off) + } + + /* Not yet used + fn sgx_write64(&self, off: usize, val: u64) { + self.sgx.writeq_relaxed(val, off) + } + */ + + /// Initialize the MMIO registers for the GPU. + pub(crate) fn init_mmio(&self) -> Result { + // Nothing to do for now... + + Ok(()) + } + + /// Start the ASC coprocessor CPU. + pub(crate) fn start_cpu(&self) -> Result { + let val = self.asc.readl_relaxed(CPU_CONTROL); + + self.asc.writel_relaxed(val | CPU_RUN, CPU_CONTROL); + + Ok(()) + } + + /// Get the GPU identification info from registers. + /// + /// See [`hw::GpuIdConfig`] for the result. + pub(crate) fn get_gpu_id(&self) -> Resulthw::GpuIdConfig { + let id_version = self.sgx_read32(ID_VERSION); + let id_unk08 = self.sgx_read32(ID_UNK08); + let id_counts_1 = self.sgx_read32(ID_COUNTS_1); + let id_counts_2 = self.sgx_read32(ID_COUNTS_2); + let id_unk18 = self.sgx_read32(ID_UNK18); + let id_clusters = self.sgx_read32(ID_CLUSTERS); + + dev_info!( + self.dev, + "GPU ID registers: {:#x} {:#x} {:#x} {:#x} {:#x} {:#x}\n", + id_version, + id_unk08, + id_counts_1, + id_counts_2, + id_unk18, + id_clusters + ); + + let core_mask_0 = self.sgx_read32(CORE_MASK_0); + let core_mask_1 = self.sgx_read32(CORE_MASK_1); + let mut core_mask = (core_mask_0 as u64) | ((core_mask_1 as u64) << 32); + + dev_info!(self.dev, "Core mask: {:#x}\n", core_mask); + + let num_clusters = (id_clusters >> 12) & 0xff; + let num_cores = id_counts_1 & 0xff; + + if num_cores * num_clusters > 64 { + dev_err!( + self.dev, + "Too many total cores ({} x {} > 64)\n", + num_clusters, + num_cores + ); + return Err(ENODEV); + } + + let mut core_masks = Vec::new(); + let mut total_active_cores: u32 = 0; + + let max_core_mask = (1u64 << num_cores) - 1; + for _i in 0..num_clusters { + let mask = core_mask & max_core_mask; + core_masks.try_push(mask as u32)?; + core_mask >>= num_cores; + total_active_cores += mask.count_ones(); + } + let mut core_masks_packed = Vec::new(); + core_masks_packed.try_push(core_mask_0)?; + if core_mask_1 != 0 { + core_masks_packed.try_push(core_mask_1)?; + } + + if core_mask != 0 { + dev_err!(self.dev, "Leftover core mask: {:#x}\n", core_mask); + return Err(EIO); + } + + let (gpu_rev, gpu_rev_id) = match (id_version >> 8) & 0xff { + 0x00 => (hw::GpuRevision::A0, hw::GpuRevisionID::A0), + 0x01 => (hw::GpuRevision::A1, hw::GpuRevisionID::A1), + 0x10 => (hw::GpuRevision::B0, hw::GpuRevisionID::B0), + 0x11 => (hw::GpuRevision::B1, hw::GpuRevisionID::B1), + 0x20 => (hw::GpuRevision::C0, hw::GpuRevisionID::C0), + 0x21 => (hw::GpuRevision::C1, hw::GpuRevisionID::C1), + a => { + dev_err!(self.dev, "Unknown GPU revision {}\n", a); + return Err(ENODEV); + } + }; + + Ok(hw::GpuIdConfig { + gpu_gen: match (id_version >> 24) & 0xff { + 4 => hw::GpuGen::G13, + 5 => hw::GpuGen::G14, + a => { + dev_err!(self.dev, "Unknown GPU generation {}\n", a); + return Err(ENODEV); + } + }, + gpu_variant: match (id_version >> 16) & 0xff { + 1 => hw::GpuVariant::P, // Guess + 2 => hw::GpuVariant::G, + 3 => hw::GpuVariant::S, + 4 => { + if num_clusters > 4 { + hw::GpuVariant::D + } else { + hw::GpuVariant::C + } + } + a => { + dev_err!(self.dev, "Unknown GPU variant {}\n", a); + return Err(ENODEV); + } + }, + gpu_rev, + gpu_rev_id, + max_dies: (id_clusters >> 20) & 0xf, + num_clusters, + num_cores, + num_frags: (id_counts_1 >> 8) & 0xff, + num_gps: (id_counts_2 >> 16) & 0xff, + total_active_cores, + core_masks, + core_masks_packed, + }) + } + + /// Get the fault information from the MMU status register, if one occurred. + pub(crate) fn get_fault_info(&self) -> Option<FaultInfo> { + let fault_info = self.sgx_read64(FAULT_INFO); + + if fault_info & 1 == 0 { + return None; + } + + let unit_code = ((fault_info >> 9) & 0xff) as u8; + let unit = match unit_code { + 0x00..=0x9f => match unit_code & 0xf { + 0x0 => FaultUnit::DCMP(unit_code >> 4), + 0x1 => FaultUnit::UL1C(unit_code >> 4), + 0x2 => FaultUnit::CMP(unit_code >> 4), + 0x3 => FaultUnit::GSL1(unit_code >> 4), + 0x4 => FaultUnit::IAP(unit_code >> 4), + 0x5 => FaultUnit::VCE(unit_code >> 4), + 0x6 => FaultUnit::TE(unit_code >> 4), + 0x7 => FaultUnit::RAS(unit_code >> 4), + 0x8 => FaultUnit::VDM(unit_code >> 4), + 0x9 => FaultUnit::PPP(unit_code >> 4), + 0xa => FaultUnit::IPF(unit_code >> 4), + 0xb => FaultUnit::IPF_CPF(unit_code >> 4), + 0xc => FaultUnit::VF(unit_code >> 4), + 0xd => FaultUnit::VF_CPF(unit_code >> 4), + 0xe => FaultUnit::ZLS(unit_code >> 4), + _ => FaultUnit::Unknown(unit_code), + }, + 0xa1 => FaultUnit::dPM, + 0xa2 => FaultUnit::dCDM_KS(0), + 0xa3 => FaultUnit::dCDM_KS(1), + 0xa4 => FaultUnit::dCDM_KS(2), + 0xa5 => FaultUnit::dIPP, + 0xa6 => FaultUnit::dIPP_CS, + 0xa7 => FaultUnit::dVDM_CSD, + 0xa8 => FaultUnit::dVDM_SSD, + 0xa9 => FaultUnit::dVDM_ILF, + 0xaa => FaultUnit::dVDM_ILD, + 0xab => FaultUnit::dRDE(0), + 0xac => FaultUnit::dRDE(1), + 0xad => FaultUnit::FC, + 0xae => FaultUnit::GSL2, + 0xb0..=0xb7 => FaultUnit::GL2CC_META(unit_code & 0xf), + 0xb8 => FaultUnit::GL2CC_MB, + 0xe0..=0xff => match unit_code & 0xf { + 0x0 => FaultUnit::gPM_SP((unit_code >> 4) & 1), + 0x1 => FaultUnit::gVDM_CSD_SP((unit_code >> 4) & 1), + 0x2 => FaultUnit::gVDM_SSD_SP((unit_code >> 4) & 1), + 0x3 => FaultUnit::gVDM_ILF_SP((unit_code >> 4) & 1), + 0x4 => FaultUnit::gVDM_TFP_SP((unit_code >> 4) & 1), + 0x5 => FaultUnit::gVDM_MMB_SP((unit_code >> 4) & 1), + 0x6 => FaultUnit::gCDM_CS_KS0_SP((unit_code >> 4) & 1), + 0x7 => FaultUnit::gCDM_CS_KS1_SP((unit_code >> 4) & 1), + 0x8 => FaultUnit::gCDM_CS_KS2_SP((unit_code >> 4) & 1), + 0x9 => FaultUnit::gCDM_KS0_SP((unit_code >> 4) & 1), + 0xa => FaultUnit::gCDM_KS1_SP((unit_code >> 4) & 1), + 0xb => FaultUnit::gCDM_KS2_SP((unit_code >> 4) & 1), + 0xc => FaultUnit::gIPP_SP((unit_code >> 4) & 1), + 0xd => FaultUnit::gIPP_CS_SP((unit_code >> 4) & 1), + 0xe => FaultUnit::gRDE0_SP((unit_code >> 4) & 1), + 0xf => FaultUnit::gRDE1_SP((unit_code >> 4) & 1), + _ => FaultUnit::Unknown(unit_code), + }, + _ => FaultUnit::Unknown(unit_code), + }; + + let reason = match (fault_info >> 1) & 0x7 { + 0 => FaultReason::Unmapped, + 1 => FaultReason::AfFault, + 2 => FaultReason::WriteOnly, + 3 => FaultReason::ReadOnly, + 4 => FaultReason::NoAccess, + a => FaultReason::Unknown(a as u8), + }; + + Some(FaultInfo { + address: (fault_info >> 30) << 6, + sideband: ((fault_info >> 23) & 0x7f) as u8, + vm_slot: ((fault_info >> 17) & 0x3f) as u32, + unit_code, + unit, + level: ((fault_info >> 7) & 3) as u8, + unk_5: ((fault_info >> 5) & 3) as u8, + read: (fault_info & (1 << 4)) != 0, + reason, + }) + } +} diff --git a/drivers/gpu/drm/asahi/slotalloc.rs b/drivers/gpu/drm/asahi/slotalloc.rs new file mode 100644 index 000000000000..6493111643fe --- /dev/null +++ b/drivers/gpu/drm/asahi/slotalloc.rs @@ -0,0 +1,292 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Generic slot allocator +//! +//! This is a simple allocator to manage fixed-size pools of GPU resources that are transiently +//! required during command execution. Each item resides in a "slot" at a given index. Users borrow +//! and return free items from the available pool. +//! +//! Allocations are "sticky", and return a token that callers can use to request the same slot +//! again later. This allows slots to be lazily invalidated, so that multiple uses by the same user +//! avoid any actual cleanup work. +//! +//! The allocation policy is currently a simple LRU mechanism, doing a full linear scan over the +//! slots when no token was previously provided. This is probably good enough, since in the absence +//! of serious system contention most allocation requests will be immediately fulfilled from the +//! previous slot without doing an LRU scan. + +use core::ops::{Deref, DerefMut}; +use kernel::{ + error::{code::*, Result}, + prelude::*, + sync::{Arc, CondVar, Mutex, UniqueArc}, +}; + +/// Trait representing a single item within a slot. +pub(crate) trait SlotItem { + /// Arbitrary user data associated with the SlotAllocator. + type Data; + + /// Called eagerly when this item is released back into the available pool. + fn release(&mut self, _data: &mut Self::Data, _slot: u32) {} +} + +/// Trivial implementation for users which do not require any slot data nor any allocator data. +impl SlotItem for () { + type Data = (); +} + +/// Represents a current or previous allocation of an item from a slot. Users keep `SlotToken`s +/// around across allocations to request that, if possible, the same slot be reused. +#[derive(Copy, Clone, Debug)] +pub(crate) struct SlotToken { + time: u64, + slot: u32, +} + +impl SlotToken { + /// Returns the slot index that this token represents a past assignment to. + pub(crate) fn last_slot(&self) -> u32 { + self.slot + } +} + +/// A guard representing active ownership of a slot. +pub(crate) struct Guard<T: SlotItem> { + item: Option<T>, + changed: bool, + token: SlotToken, + alloc: Arc<SlotAllocatorOuter<T>>, +} + +impl<T: SlotItem> Guard<T> { + /// Returns the active slot owned by this `Guard`. + pub(crate) fn slot(&self) -> u32 { + self.token.slot + } + + /// Returns `true` if the slot changed since the last allocation (or no `SlotToken` was + /// provided), or `false` if the previously allocated slot was successfully re-acquired with + /// no other users in the interim. + pub(crate) fn changed(&self) -> bool { + self.changed + } + + /// Returns a `SlotToken` that can be used to re-request the same slot at a later time, after + /// this `Guard` is dropped. + pub(crate) fn token(&self) -> SlotToken { + self.token + } +} + +impl<T: SlotItem> Deref for Guard<T> { + type Target = T; + + fn deref(&self) -> &Self::Target { + self.item.as_ref().expect("SlotItem Guard lost our item!") + } +} + +impl<T: SlotItem> DerefMut for Guard<T> { + fn deref_mut(&mut self) -> &mut Self::Target { + self.item.as_mut().expect("SlotItem Guard lost our item!") + } +} + +/// A slot item that is currently free. +struct Entry<T: SlotItem> { + item: T, + get_time: u64, + drop_time: u64, +} + +/// Inner data for the `SlotAllocator`, protected by a `Mutex`. +struct SlotAllocatorInner<T: SlotItem> { + data: T::Data, + slots: Vec<Option<Entry<T>>>, + get_count: u64, + drop_count: u64, +} + +/// A single slot allocator instance. +struct SlotAllocatorOuter<T: SlotItem> { + inner: Mutex<SlotAllocatorInner<T>>, + cond: CondVar, +} + +/// A shared reference to a slot allocator instance. +pub(crate) struct SlotAllocator<T: SlotItem>(Arc<SlotAllocatorOuter<T>>); + +impl<T: SlotItem> SlotAllocator<T> { + /// Creates a new `SlotAllocator`, with a fixed number of slots and arbitrary associated data. + /// + /// The caller provides a constructor callback which takes a reference to the `T::Data` and + /// creates a single slot. This is called during construction to create all the initial + /// items, which then live the lifetime of the `SlotAllocator`. + pub(crate) fn new( + num_slots: u32, + mut data: T::Data, + mut constructor: impl FnMut(&mut T::Data, u32) -> T, + ) -> Result<SlotAllocator<T>> { + let mut slots = Vec::try_with_capacity(num_slots as usize)?; + + for i in 0..num_slots { + slots + .try_push(Some(Entry { + item: constructor(&mut data, i), + get_time: 0, + drop_time: 0, + })) + .expect("try_push() failed after reservation"); + } + + let inner = SlotAllocatorInner { + data, + slots, + get_count: 0, + drop_count: 0, + }; + + let mut alloc = Pin::from(UniqueArc::try_new(SlotAllocatorOuter { + // SAFETY: `condvar_init!` is called below. + cond: unsafe { CondVar::new() }, + // SAFETY: `mutex_init!` is called below. + inner: unsafe { Mutex::new(inner) }, + })?); + + // SAFETY: `cond` is pinned when `alloc` is. + let pinned = unsafe { alloc.as_mut().map_unchecked_mut(|s| &mut s.cond) }; + kernel::condvar_init!(pinned, "SlotAllocator::cond"); + + // SAFETY: `inner` is pinned when `alloc` is. + let pinned = unsafe { alloc.as_mut().map_unchecked_mut(|s| &mut s.inner) }; + kernel::mutex_init!(pinned, "SlotAllocator::inner"); + + Ok(SlotAllocator(alloc.into())) + } + + /// Calls a callback on the inner data associated with this allocator, taking the lock. + pub(crate) fn with_inner<RetVal>(&self, cb: impl FnOnce(&mut T::Data) -> RetVal) -> RetVal { + let mut inner = self.0.inner.lock(); + cb(&mut inner.data) + } + + /// Gets a fresh slot, optionally reusing a previous allocation if a `SlotToken` is provided. + /// + /// Blocks if no slots are free. + pub(crate) fn get(&self, token: Option<SlotToken>) -> Result<Guard<T>> { + self.get_inner(token, |_a, _b| Ok(())) + } + + /// Gets a fresh slot, optionally reusing a previous allocation if a `SlotToken` is provided. + /// + /// Blocks if no slots are free. + /// + /// This version allows the caller to pass in a callback that gets a mutable reference to the + /// user data for the allocator and the freshly acquired slot, which is called before the + /// allocator lock is released. This can be used to perform bookkeeping associated with + /// specific slots (such as tracking their current owner). + pub(crate) fn get_inner( + &self, + token: Option<SlotToken>, + cb: impl FnOnce(&mut T::Data, &mut Guard<T>) -> Result<()>, + ) -> Result<Guard<T>> { + let mut inner = self.0.inner.lock(); + + if let Some(token) = token { + let slot = &mut inner.slots[token.slot as usize]; + if slot.is_some() { + let count = slot.as_ref().unwrap().get_time; + if count == token.time { + let mut guard = Guard { + item: Some(slot.take().unwrap().item), + token, + changed: false, + alloc: self.0.clone(), + }; + cb(&mut inner.data, &mut guard)?; + return Ok(guard); + } + } + } + + let mut first = true; + let slot = loop { + let mut oldest_time = u64::MAX; + let mut oldest_slot = 0u32; + + for (i, slot) in inner.slots.iter().enumerate() { + if let Some(slot) = slot.as_ref() { + if slot.drop_time < oldest_time { + oldest_slot = i as u32; + oldest_time = slot.drop_time; + } + } + } + + if oldest_time == u64::MAX { + if first { + pr_warn!( + "{}: out of slots, blocking\n", + core::any::type_name::<Self>() + ); + } + first = false; + if self.0.cond.wait(&mut inner) { + return Err(ERESTARTSYS); + } + } else { + break oldest_slot; + } + }; + + inner.get_count += 1; + + let item = inner.slots[slot as usize] + .take() + .expect("Someone stole our slot?") + .item; + + let mut guard = Guard { + item: Some(item), + changed: true, + token: SlotToken { + time: inner.get_count, + slot, + }, + alloc: self.0.clone(), + }; + + cb(&mut inner.data, &mut guard)?; + Ok(guard) + } +} + +impl<T: SlotItem> Clone for SlotAllocator<T> { + fn clone(&self) -> Self { + SlotAllocator(self.0.clone()) + } +} + +impl<T: SlotItem> Drop for Guard<T> { + fn drop(&mut self) { + let mut inner = self.alloc.inner.lock(); + if inner.slots[self.token.slot as usize].is_some() { + pr_crit!( + "{}: tried to return an item into a full slot ({})\n", + core::any::type_name::<Self>(), + self.token.slot + ); + } else { + inner.drop_count += 1; + let mut item = self.item.take().expect("Guard lost its item"); + item.release(&mut inner.data, self.token.slot); + inner.slots[self.token.slot as usize] = Some(Entry { + item, + get_time: self.token.time, + drop_time: inner.drop_count, + }); + self.alloc.cond.notify_one(); + } + } +} diff --git a/drivers/gpu/drm/asahi/util.rs b/drivers/gpu/drm/asahi/util.rs new file mode 100644 index 000000000000..8d1a37f17cd8 --- /dev/null +++ b/drivers/gpu/drm/asahi/util.rs @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! Miscellaneous utility functions + +use core::ops::{Add, BitAnd, Div, Not, Sub}; + +/// Aligns an integer type to a power of two. +pub(crate) fn align<T>(a: T, b: T) -> T +where + T: Copy + + Default + + BitAnd<Output = T> + + Not<Output = T> + + Add<Output = T> + + Sub<Output = T> + + Div<Output = T> + + core::cmp::PartialEq, +{ + let def: T = Default::default(); + #[allow(clippy::eq_op)] + let one: T = !def / !def; + + assert!((b & (b - one)) == def); + + (a + b - one) & !(b - one) +} + +/// Integer division rounding up. +pub(crate) fn div_ceil<T>(a: T, b: T) -> T +where + T: Copy + + Default + + BitAnd<Output = T> + + Not<Output = T> + + Add<Output = T> + + Sub<Output = T> + + Div<Output = T>, +{ + let def: T = Default::default(); + #[allow(clippy::eq_op)] + let one: T = !def / !def; + + (a + b - one) / b +} diff --git a/drivers/gpu/drm/asahi/workqueue.rs b/drivers/gpu/drm/asahi/workqueue.rs new file mode 100644 index 000000000000..ce1d1f89e48e --- /dev/null +++ b/drivers/gpu/drm/asahi/workqueue.rs @@ -0,0 +1,880 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT + +//! GPU command execution queues +//! +//! The AGX GPU firmware schedules GPU work commands out of work queues, which are ring buffers of +//! pointers to work commands. There can be an arbitrary number of work queues. Work queues have an +//! associated type (vertex, fragment, or compute) and may only contain generic commands or commands +//! specific to that type. +//! +//! This module manages queueing work commands into a work queue and submitting them for execution +//! by the firmware. An active work queue needs an event to signal completion of its work, which is +//! owned by what we call a batch. This event then notifies the work queue when work is completed, +//! and that triggers freeing of all resources associated with that work. An idle work queue gives +//! up its associated event. + +use crate::debug::*; +use crate::fw::channels::PipeType; +use crate::fw::types::*; +use crate::fw::workqueue::*; +use crate::object::OpaqueGpuObject; +use crate::regs::FaultReason; +use crate::{box_in_place, no_debug, place}; +use crate::{channel, driver, event, fw, gpu, object, regs}; +use core::num::NonZeroU64; +use core::sync::atomic::Ordering; +use kernel::{ + bindings, + error::code::*, + prelude::*, + sync::{Arc, Guard, Mutex, UniqueArc}, +}; + +const DEBUG_CLASS: DebugFlags = DebugFlags::WorkQueue; + +const MAX_JOB_SLOTS: u32 = 127; + +/// An enum of possible errors that might cause a piece of work to fail execution. +#[derive(Copy, Clone, Debug, PartialEq, Eq)] +pub(crate) enum WorkError { + /// GPU timeout (command execution took too long). + Timeout, + /// GPU MMU fault (invalid access). + Fault(regs::FaultInfo), + /// Work failed due to an error caused by other concurrent GPU work. + Killed, + /// The GPU crashed. + NoDevice, + /// Unknown reason. + Unknown, +} + +impl From<WorkError> for bindings::drm_asahi_result_info { + fn from(err: WorkError) -> Self { + match err { + WorkError::Fault(info) => Self { + status: bindings::drm_asahi_status_DRM_ASAHI_STATUS_FAULT, + fault_type: match info.reason { + FaultReason::Unmapped => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_UNMAPPED, + FaultReason::AfFault => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_AF_FAULT, + FaultReason::WriteOnly => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_WRITE_ONLY, + FaultReason::ReadOnly => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_READ_ONLY, + FaultReason::NoAccess => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_NO_ACCESS, + FaultReason::Unknown(_) => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_UNKNOWN, + }, + unit: info.unit_code.into(), + sideband: info.sideband.into(), + level: info.level, + extra: info.unk_5.into(), + is_read: info.read as u8, + pad: 0, + address: info.address, + }, + a => Self { + status: match a { + WorkError::Timeout => bindings::drm_asahi_status_DRM_ASAHI_STATUS_TIMEOUT, + WorkError::Killed => bindings::drm_asahi_status_DRM_ASAHI_STATUS_KILLED, + WorkError::NoDevice => bindings::drm_asahi_status_DRM_ASAHI_STATUS_NO_DEVICE, + _ => bindings::drm_asahi_status_DRM_ASAHI_STATUS_UNKNOWN_ERROR, + }, + ..Default::default() + }, + } + } +} + +impl From<WorkError> for kernel::error::Error { + fn from(err: WorkError) -> Self { + match err { + WorkError::Timeout => ETIMEDOUT, + // Not EFAULT because that's for userspace faults + WorkError::Fault(_) => EIO, + WorkError::Unknown => ENODATA, + WorkError::Killed => ECANCELED, + WorkError::NoDevice => ENODEV, + } + } +} + +/// A GPU context tracking structure, which must be explicitly invalidated when dropped. +pub(crate) struct GpuContext { + dev: driver::AsahiDevice, + data: GpuObjectfw::workqueue::GpuContextData, +} +no_debug!(GpuContext); + +impl GpuContext { + /// Allocate a new GPU context. + pub(crate) fn new( + dev: &driver::AsahiDevice, + alloc: &mut gpu::KernelAllocators, + ) -> Result<GpuContext> { + Ok(GpuContext { + dev: dev.clone(), + data: alloc + .shared + .new_object(Default::default(), |_inner| Default::default())?, + }) + } + + /// Returns the GPU pointer to the inner GPU context data structure. + pub(crate) fn gpu_pointer(&self) -> GpuPointer<'_, fw::workqueue::GpuContextData> { + self.data.gpu_pointer() + } +} + +impl Drop for GpuContext { + fn drop(&mut self) { + mod_dev_dbg!(self.dev, "GpuContext: Invalidating GPU context\n"); + let dev = self.dev.data(); + if dev.gpu.invalidate_context(&self.data).is_err() { + dev_err!(self.dev, "GpuContext: Failed to invalidate GPU context!\n"); + } + } +} + +struct SubmittedWork<O, C> +where + O: OpaqueGpuObject, + C: FnOnce(O, Option<WorkError>) + Send + Sync + 'static, +{ + object: O, + value: EventValue, + error: Option<WorkError>, + wptr: u32, + vm_slot: u32, + callback: C, +} + +trait GenSubmittedWork: Send + Sync { + fn gpu_va(&self) -> NonZeroU64; + fn value(&self) -> event::EventValue; + fn wptr(&self) -> u32; + fn set_wptr(&mut self, wptr: u32); + fn mark_error(&mut self, error: WorkError); + fn complete(self: Box<Self>); +} + +impl<O: OpaqueGpuObject, C: FnOnce(O, Option<WorkError>) + Send + Sync> GenSubmittedWork + for SubmittedWork<O, C> +{ + fn gpu_va(&self) -> NonZeroU64 { + self.object.gpu_va() + } + + fn value(&self) -> event::EventValue { + self.value + } + + fn wptr(&self) -> u32 { + self.wptr + } + + fn set_wptr(&mut self, wptr: u32) { + self.wptr = wptr; + } + + fn complete(self: Box<Self>) { + let SubmittedWork { + object, + value: _, + error, + wptr: _, + vm_slot: _, + callback, + } = *self; + + callback(object, error); + } + + fn mark_error(&mut self, error: WorkError) { + mod_pr_debug!("WorkQueue: Command at value {:#x?} failed\n", self.value); + self.error = Some(match error { + WorkError::Fault(info) if info.vm_slot != self.vm_slot => WorkError::Killed, + err => err, + }); + } +} + +/// Inner data for managing a single work queue. +#[versions(AGX)] +struct WorkQueueInner { + event_manager: Arcevent::EventManager, + info: GpuObjectQueueInfo::ver, + new: bool, + pipe_type: PipeType, + size: u32, + wptr: u32, + pending: Vec<Box<dyn GenSubmittedWork>>, + last_token: Optionevent::Token, + pending_jobs: usize, + last_submitted: Optionevent::EventValue, + last_completed: Optionevent::EventValue, + event: Option<(event::Event, event::EventValue)>, + priority: u32, + commit_seq: u64, + submit_seq: u64, +} + +/// An instance of a work queue. +#[versions(AGX)] +pub(crate) struct WorkQueue { + info_pointer: GpuWeakPointerQueueInfo::ver, + inner: MutexWorkQueueInner::ver, +} + +#[versions(AGX)] +impl WorkQueueInner::ver { + /// Return the GPU done pointer, representing how many work items have been completed by the + /// GPU. + fn doneptr(&self) -> u32 { + self.info + .state + .with(|raw, _inner| raw.gpu_doneptr.load(Ordering::Acquire)) + } +} + +#[versions(AGX)] +#[derive(Copy, Clone)] +pub(crate) struct QueueEventInfo { + pub(crate) stamp_pointer: GpuWeakPointer<Stamp>, + pub(crate) fw_stamp_pointer: GpuWeakPointer<FwStamp>, + pub(crate) slot: u32, + pub(crate) value: event::EventValue, + pub(crate) cmd_seq: u64, + pub(crate) info_ptr: GpuWeakPointerQueueInfo::ver, +} + +#[versions(AGX)] +pub(crate) struct Job { + wq: ArcWorkQueue::ver, + event_info: QueueEventInfo::ver, + start_value: EventValue, + pending: Vec<Box<dyn GenSubmittedWork>>, + committed: bool, + submitted: bool, + event_count: usize, +} + +#[versions(AGX)] +pub(crate) struct JobSubmission<'a> { + inner: Option<Guard<'a, MutexWorkQueueInner::ver>>, + wptr: u32, + event_count: usize, + command_count: usize, +} + +#[versions(AGX)] +impl Job::ver { + pub(crate) fn event_info(&self) -> QueueEventInfo::ver { + let mut info = self.event_info; + info.cmd_seq += self.event_count as u64; + + info + } + + pub(crate) fn next_seq(&mut self) { + self.event_count += 1; + self.event_info.value.increment(); + } + + pub(crate) fn add<O: object::OpaqueGpuObject + 'static>( + &mut self, + command: O, + vm_slot: u32, + ) -> Result { + self.add_cb(command, vm_slot, |_, _| {}) + } + + pub(crate) fn add_cb<O: object::OpaqueGpuObject + 'static>( + &mut self, + command: O, + vm_slot: u32, + callback: impl FnOnce(O, Option<WorkError>) + Sync + Send + 'static, + ) -> Result { + if self.committed { + pr_err!("WorkQueue: Tried to mutate committed Job\n"); + return Err(EINVAL); + } + + self.pending.try_push(Box::try_new(SubmittedWork::<_, _> { + object: command, + value: self.event_info.value.next(), + error: None, + callback, + wptr: 0, + vm_slot, + })?)?; + + Ok(()) + } + + pub(crate) fn commit(&mut self) -> Result { + if self.committed { + pr_err!("WorkQueue: Tried to commit committed Job\n"); + return Err(EINVAL); + } + + if self.pending.is_empty() { + pr_err!("WorkQueue: Job::commit() with no commands\n"); + return Err(EINVAL); + } + + let mut inner = self.wq.inner.lock(); + + let ev = inner.event.as_mut().expect("WorkQueue: Job lost its event"); + + if ev.1 != self.start_value { + pr_err!( + "WorkQueue: Job::commit() out of order (event slot {} {:?} != {:?}\n", + ev.0.slot(), + ev.1, + self.start_value + ); + return Err(EINVAL); + } + + ev.1 = self.event_info.value; + inner.commit_seq += self.pending.len() as u64; + self.committed = true; + + Ok(()) + } + + pub(crate) fn can_submit(&self) -> bool { + self.wq.free_slots() > self.event_count && self.wq.free_space() > self.pending.len() + } + + pub(crate) fn submit(&mut self) -> Result<JobSubmission::ver<'_>> { + if !self.committed { + pr_err!("WorkQueue: Tried to submit uncommitted Job\n"); + return Err(EINVAL); + } + + if self.submitted { + pr_err!("WorkQueue: Tried to submit Job twice\n"); + return Err(EINVAL); + } + + if self.pending.is_empty() { + pr_err!("WorkQueue: Job::submit() with no commands\n"); + return Err(EINVAL); + } + + let mut inner = self.wq.inner.lock(); + + if inner.submit_seq != self.event_info.cmd_seq { + pr_err!( + "WorkQueue: Job::submit() out of order (submit_seq {} != {})\n", + inner.submit_seq, + self.event_info.cmd_seq + ); + return Err(EINVAL); + } + + if inner.commit_seq < (self.event_info.cmd_seq + self.pending.len() as u64) { + pr_err!( + "WorkQueue: Job::submit() out of order (commit_seq {} != {})\n", + inner.commit_seq, + (self.event_info.cmd_seq + self.pending.len() as u64) + ); + return Err(EINVAL); + } + + let mut wptr = inner.wptr; + let command_count = self.pending.len(); + + if inner.free_space() <= command_count { + pr_err!("WorkQueue: Job does not fit in ring buffer\n"); + return Err(EBUSY); + } + + inner.pending.try_reserve(command_count)?; + + inner.last_submitted = inner.event.as_ref().map(|e| e.1); + + for mut command in self.pending.drain(..) { + command.set_wptr(wptr); + + let next_wptr = (wptr + 1) % inner.size; + assert!(inner.doneptr() != next_wptr); + inner.info.ring[wptr as usize] = command.gpu_va().get(); + wptr = next_wptr; + + // Cannot fail, since we did a try_reserve(1) above + inner + .pending + .try_push(command) + .expect("try_push() failed after try_reserve()"); + } + + self.submitted = true; + + Ok(JobSubmission::ver { + inner: Some(inner), + wptr, + command_count, + event_count: self.event_count, + }) + } +} + +#[versions(AGX)] +impl<'a> JobSubmission::ver<'a> { + pub(crate) fn run(mut self, channel: &mut channel::PipeChannel::ver) { + let command_count = self.command_count; + let mut inner = self.inner.take().expect("No inner?"); + let wptr = self.wptr; + core::mem::forget(self); + + inner + .info + .state + .with(|raw, _inner| raw.cpu_wptr.store(wptr, Ordering::Release)); + + inner.wptr = wptr; + + let event = inner.event.as_mut().expect("JobSubmission lost its event"); + + let event_slot = event.0.slot(); + + let msg = fw::channels::RunWorkQueueMsg::ver { + pipe_type: inner.pipe_type, + work_queue: Some(inner.info.weak_pointer()), + wptr: inner.wptr, + event_slot, + is_new: inner.new, + __pad: Default::default(), + }; + channel.send(&msg); + inner.new = false; + + inner.submit_seq += command_count as u64; + } + + pub(crate) fn pipe_type(&self) -> PipeType { + self.inner.as_ref().expect("No inner?").pipe_type + } + + pub(crate) fn priority(&self) -> u32 { + self.inner.as_ref().expect("No inner?").priority + } +} + +#[versions(AGX)] +impl Drop for Job::ver { + fn drop(&mut self) { + mod_pr_debug!("WorkQueue: Dropping Job\n"); + let mut inner = self.wq.inner.lock(); + + if self.committed && !self.submitted { + let pipe_type = inner.pipe_type; + let event = inner.event.as_mut().expect("Job lost its event"); + mod_pr_debug!( + "WorkQueue({:?}): Roll back {} events (slot {} val {:#x?}) and {} commands\n", + pipe_type, + self.event_count, + event.0.slot(), + event.1, + self.pending.len() + ); + event.1.sub(self.event_count as u32); + inner.commit_seq -= self.pending.len() as u64; + } + + inner.pending_jobs -= 1; + + if inner.pending.is_empty() && inner.pending_jobs == 0 { + mod_pr_debug!("WorkQueue({:?}): Dropping event\n", inner.pipe_type); + inner.event = None; + inner.last_submitted = None; + inner.last_completed = None; + } + mod_pr_debug!("WorkQueue({:?}): Dropped Job\n", inner.pipe_type); + } +} + +#[versions(AGX)] +impl<'a> Drop for JobSubmission::ver<'a> { + fn drop(&mut self) { + let inner = self.inner.as_mut().expect("No inner?"); + mod_pr_debug!("WorkQueue({:?}): Dropping JobSubmission\n", inner.pipe_type); + + let new_len = inner.pending.len() - self.command_count; + inner.pending.truncate(new_len); + + let pipe_type = inner.pipe_type; + let event = inner.event.as_mut().expect("JobSubmission lost its event"); + mod_pr_debug!( + "WorkQueue({:?}): Roll back {} events (slot {} val {:#x?}) and {} commands\n", + pipe_type, + self.event_count, + event.0.slot(), + event.1, + self.command_count + ); + event.1.sub(self.event_count as u32); + inner.commit_seq -= self.command_count as u64; + mod_pr_debug!("WorkQueue({:?}): Dropped JobSubmission\n", inner.pipe_type); + } +} + +#[versions(AGX)] +impl WorkQueueInner::ver { + /// Return the number of free entries in the workqueue + pub(crate) fn free_space(&self) -> usize { + self.size as usize - self.pending.len() - 1 + } + + pub(crate) fn free_slots(&self) -> usize { + let busy_slots = if let Some(ls) = self.last_submitted { + let lc = self + .last_completed + .expect("last_submitted but not completed?"); + ls.delta(&lc) + } else { + 0 + }; + + ((MAX_JOB_SLOTS as i32) - busy_slots).max(0) as usize + } +} + +#[versions(AGX)] +impl WorkQueue::ver { + /// Create a new WorkQueue of a given type and priority. + #[allow(clippy::too_many_arguments)] + pub(crate) fn new( + alloc: &mut gpu::KernelAllocators, + event_manager: Arcevent::EventManager, + gpu_context: Arc<GpuContext>, + notifier_list: Arc<GpuObjectfw::event::NotifierList>, + pipe_type: PipeType, + id: u64, + priority: u32, + size: u32, + ) -> Result<ArcWorkQueue::ver> { + let mut info = box_in_place!(QueueInfo::ver { + state: alloc.shared.new_default::<RingState>()?, + ring: alloc.shared.array_empty(size as usize)?, + gpu_buf: alloc.private.array_empty(0x2c18)?, + notifier_list: notifier_list, + gpu_context: gpu_context, + })?; + + info.state.with_mut(|raw, _inner| { + raw.rb_size = size; + }); + + let inner = WorkQueueInner::ver { + event_manager, + info: alloc.private.new_boxed(info, |inner, ptr| { + Ok(place!( + ptr, + raw::QueueInfo::ver { + state: inner.state.gpu_pointer(), + ring: inner.ring.gpu_pointer(), + notifier_list: inner.notifier_list.gpu_pointer(), + gpu_buf: inner.gpu_buf.gpu_pointer(), + gpu_rptr1: Default::default(), + gpu_rptr2: Default::default(), + gpu_rptr3: Default::default(), + event_id: AtomicI32::new(-1), + priority: *raw::PRIORITY.get(priority as usize).ok_or(EINVAL)?, + unk_4c: -1, + uuid: id as u32, + unk_54: -1, + unk_58: Default::default(), + busy: Default::default(), + __pad: Default::default(), + unk_84_state: Default::default(), + unk_88: 0, + unk_8c: 0, + unk_90: 0, + unk_94: 0, + pending: Default::default(), + unk_9c: 0, + #[ver(V >= V13_2)] + unk_a0_0: 0, + gpu_context: inner.gpu_context.gpu_pointer(), + unk_a8: Default::default(), + #[ver(V >= V13_2)] + unk_b0: 0, + } + )) + })?, + new: true, + pipe_type, + size, + wptr: 0, + pending: Vec::new(), + last_token: None, + event: None, + priority, + pending_jobs: 0, + commit_seq: 0, + submit_seq: 0, + last_completed: None, + last_submitted: None, + }; + + let mut queue = Pin::from(UniqueArc::try_new(Self { + info_pointer: inner.info.weak_pointer(), + // SAFETY: `mutex_init!` is called below. + inner: unsafe { Mutex::new(inner) }, + })?); + + // SAFETY: `inner` is pinned when `queue` is. + let pinned = unsafe { queue.as_mut().map_unchecked_mut(|s| &mut s.inner) }; + match pipe_type { + PipeType::Vertex => kernel::mutex_init!(pinned, "WorkQueue::inner (Vertex)"), + PipeType::Fragment => kernel::mutex_init!(pinned, "WorkQueue::inner (Fragment)"), + PipeType::Compute => kernel::mutex_init!(pinned, "WorkQueue::inner (Compute)"), + } + + Ok(queue.into()) + } + + pub(crate) fn event_info(&self) -> OptionQueueEventInfo::ver { + let inner = self.inner.lock(); + + inner.event.as_ref().map(|ev| QueueEventInfo::ver { + stamp_pointer: ev.0.stamp_pointer(), + fw_stamp_pointer: ev.0.fw_stamp_pointer(), + slot: ev.0.slot(), + value: ev.1, + cmd_seq: inner.commit_seq, + info_ptr: self.info_pointer, + }) + } + + pub(crate) fn new_job(self: &Arc<Self>) -> ResultJob::ver { + let mut inner = self.inner.lock(); + + if inner.event.is_none() { + mod_pr_debug!("WorkQueue({:?}): Grabbing event\n", inner.pipe_type); + let event = inner.event_manager.get(inner.last_token, self.clone())?; + let cur = event.current(); + inner.last_token = Some(event.token()); + mod_pr_debug!( + "WorkQueue({:?}): Grabbed event slot {}: {:#x?}\n", + inner.pipe_type, + event.slot(), + cur + ); + inner.event = Some((event, cur)); + inner.last_submitted = Some(cur); + inner.last_completed = Some(cur); + } + + inner.pending_jobs += 1; + + let ev = &inner.event.as_ref().unwrap(); + + mod_pr_debug!("WorkQueue({:?}): New job\n", inner.pipe_type); + Ok(Job::ver { + wq: self.clone(), + event_info: QueueEventInfo::ver { + stamp_pointer: ev.0.stamp_pointer(), + fw_stamp_pointer: ev.0.fw_stamp_pointer(), + slot: ev.0.slot(), + value: ev.1, + cmd_seq: inner.commit_seq, + info_ptr: self.info_pointer, + }, + start_value: ev.1, + pending: Vec::new(), + event_count: 0, + committed: false, + submitted: false, + }) + } + + /// Return the number of free entries in the workqueue + pub(crate) fn free_space(&self) -> usize { + self.inner.lock().free_space() + } + + /// Return the number of free job slots in the workqueue + pub(crate) fn free_slots(&self) -> usize { + self.inner.lock().free_slots() + } + + pub(crate) fn pipe_type(&self) -> PipeType { + self.inner.lock().pipe_type + } +} + +/// Trait used to erase the version-specific type of WorkQueues, to avoid leaking +/// version-specificity into the event module. +pub(crate) trait WorkQueue { + fn signal(&self) -> bool; + fn mark_error(&self, value: event::EventValue, error: WorkError); + fn fail_all(&self, error: WorkError); +} + +#[versions(AGX)] +impl WorkQueue for WorkQueue::ver { + /// Signal a workqueue that some work was completed. + /// + /// This will check the event stamp value to find out exactly how many commands were processed. + fn signal(&self) -> bool { + let mut inner = self.inner.lock(); + let event = inner.event.as_ref(); + let value = match event { + None => { + pr_err!("WorkQueue: signal() called but no event?\n"); + return true; + } + Some(event) => event.0.current(), + }; + + inner.last_completed = Some(value); + + mod_pr_debug!( + "WorkQueue({:?}): Signaling event {:?} value {:#x?}\n", + inner.pipe_type, + inner.last_token, + value + ); + + let mut completed_commands: usize = 0; + + for cmd in inner.pending.iter() { + if cmd.value() <= value { + mod_pr_debug!( + "WorkQueue({:?}): Command at value {:#x?} complete\n", + inner.pipe_type, + cmd.value() + ); + completed_commands += 1; + } else { + break; + } + } + + if completed_commands == 0 { + return inner.pending.is_empty(); + } + + let mut completed = Vec::new(); + + if completed.try_reserve(completed_commands).is_err() { + pr_crit!( + "WorkQueue({:?}): Failed to allocated space for {} completed commands\n", + inner.pipe_type, + completed_commands + ); + } + + let pipe_type = inner.pipe_type; + + for cmd in inner.pending.drain(..completed_commands) { + if completed.try_push(cmd).is_err() { + pr_crit!( + "WorkQueue({:?}): Failed to signal a completed command\n", + pipe_type, + ); + } + } + + mod_pr_debug!( + "WorkQueue({:?}): Completed {} commands\n", + inner.pipe_type, + completed_commands + ); + + if let Some(i) = completed.last() { + inner + .info + .state + .with(|raw, _inner| raw.cpu_freeptr.store(i.wptr(), Ordering::Release)); + } + + let empty = inner.pending.is_empty(); + if empty && inner.pending_jobs == 0 { + inner.event = None; + inner.last_submitted = None; + inner.last_completed = None; + } + + core::mem::drop(inner); + + for cmd in completed { + cmd.complete(); + } + + empty + } + + /// Mark this queue's work up to a certain stamp value as having failed. + fn mark_error(&self, value: event::EventValue, error: WorkError) { + // If anything is marked completed, we can consider it successful + // at this point, even if we didn't get the signal event yet. + self.signal(); + + let mut inner = self.inner.lock(); + + if inner.event.is_none() { + pr_err!("WorkQueue: signal_fault() called but no event?\n"); + return; + } + + mod_pr_debug!( + "WorkQueue({:?}): Signaling fault for event {:?} at value {:#x?}\n", + inner.pipe_type, + inner.last_token, + value + ); + + for cmd in inner.pending.iter_mut() { + if cmd.value() <= value { + cmd.mark_error(error); + } else { + break; + } + } + } + + /// Mark all of this queue's work as having failed, and complete it. + fn fail_all(&self, error: WorkError) { + // If anything is marked completed, we can consider it successful + // at this point, even if we didn't get the signal event yet. + self.signal(); + + let mut inner = self.inner.lock(); + + if inner.event.is_none() { + pr_err!("WorkQueue: fail_all() called but no event?\n"); + return; + } + + mod_pr_debug!( + "WorkQueue({:?}): Failing all jobs {:?}\n", + inner.pipe_type, + error + ); + + let mut cmds = Vec::new(); + + core::mem::swap(&mut inner.pending, &mut cmds); + + if inner.pending_jobs == 0 { + inner.event = None; + } + + core::mem::drop(inner); + + for mut cmd in cmds { + cmd.mark_error(error); + cmd.complete(); + } + } +} + +#[versions(AGX)] +impl Drop for WorkQueue::ver { + fn drop(&mut self) { + mod_pr_debug!("WorkQueue({:?}): Dropping\n", self.inner.lock().pipe_type); + } +}
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64);
+impl ID {
- /// Create a new ID counter with a given value.
- fn new(val: u64) -> ID {
ID(AtomicU64::new(val))
- }
- /// Fetch the next unique ID.
- pub(crate) fn next(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
- }
+}
Continuing the theme of me commenting on individual things, I stumbled over this because I noticed that there's a lot of id based lookups where I don't expect them, and started chasing.
- For ids use xarray, not atomic counters. Yes I know dma_fence timelines gets this wrong, this goes back to an innocent time where we didn't allocate more than one timeline per engine, and no one fixed it since then. Yes u64 should be big enough for everyone :-/
- Attaching ID spaces to drm_device is also not great. drm is full of these mistakes. Much better if their per drm_file and so private to each client.
- They shouldn't be used for anything else than uapi id -> kernel object lookup at the beginning of ioctl code, and nowhere else. At least from skimming it seems like these are used all over the driver codebase, which does freak me out. At least on the C side that's a clear indicator for a refcount/lockin/data structure model that's not thought out at all.
What's going on here, what do I miss? -Daniel
+impl Default for ID {
- /// IDs default to starting at 2, as 0/1 are considered reserved for the system.
- fn default() -> Self {
Self::new(2)
- }
+}
+/// A guard representing one active submission on the GPU. When dropped, decrements the active +/// submission count. +pub(crate) struct OpGuard(Arc<dyn GpuManagerPriv>);
+impl Drop for OpGuard {
- fn drop(&mut self) {
self.0.end_op();
- }
+}
+/// Set of global sequence IDs used in the driver. +#[derive(Default)] +pub(crate) struct SequenceIDs {
- /// `File` instance ID.
- pub(crate) file: ID,
- /// `Vm` instance ID.
- pub(crate) vm: ID,
- /// Submission instance ID.
- pub(crate) submission: ID,
- /// `Queue` instance ID.
- pub(crate) queue: ID,
+}
+/// Top-level GPU manager that owns all the global state relevant to the driver instance. +#[versions(AGX)] +pub(crate) struct GpuManager {
- dev: AsahiDevice,
- cfg: &'static hw::HwConfig,
- dyncfg: Boxhw::DynConfig,
- pub(crate) initdata: Box<fw::types::GpuObjectfw::initdata::InitData::ver>,
- uat: Boxmmu::Uat,
- crashed: AtomicBool,
- alloc: Mutex<KernelAllocators>,
- io_mappings: Vecmmu::Mapping,
- rtkit: Mutex<Option<Box<rtkit::RtKitGpuManager::ver>>>,
- rx_channels: Mutex<BoxRxChannels::ver>,
- tx_channels: Mutex<BoxTxChannels::ver>,
- fwctl_channel: Mutex<Boxchannel::FwCtlChannel>,
- pipes: PipeChannels::ver,
- event_manager: Arcevent::EventManager,
- buffer_mgr: buffer::BufferManager,
- ids: SequenceIDs,
+}
+/// Trait used to abstract the firmware/GPU-dependent variants of the GpuManager. +pub(crate) trait GpuManager: Send + Sync {
- /// Cast as an Any type.
- fn as_any(&self) -> &dyn Any;
- /// Cast Arc<Self> as an Any type.
- fn arc_as_any(self: Arc<Self>) -> Arc<dyn Any + Sync + Send>;
- /// Initialize the GPU.
- fn init(&self) -> Result;
- /// Update the GPU globals from global info
- ///
- /// TODO: Unclear what can and cannot be updated like this.
- fn update_globals(&self);
- /// Get a reference to the KernelAllocators.
- fn alloc(&self) -> Guard<'_, Mutex<KernelAllocators>>;
- /// Create a new `Vm` given a unique `File` ID.
- fn new_vm(&self, file_id: u64) -> Resultmmu::Vm;
- /// Bind a `Vm` to an available slot and return the `VmBind`.
- fn bind_vm(&self, vm: &mmu::Vm) -> Resultmmu::VmBind;
- /// Create a new user command queue.
- fn new_queue(
&self,
vm: mmu::Vm,
ualloc: Arc<Mutex<alloc::DefaultAllocator>>,
ualloc_priv: Arc<Mutex<alloc::DefaultAllocator>>,
priority: u32,
caps: u32,
- ) -> Result<Box<dyn queue::Queue>>;
- /// Return a reference to the global `SequenceIDs` instance.
- fn ids(&self) -> &SequenceIDs;
- /// Kick the firmware (wake it up if asleep).
- ///
- /// This should be useful to reduce latency on work submission, so we can ask the firmware to
- /// wake up while we do some preparatory work for the work submission.
- fn kick_firmware(&self) -> Result;
- /// Invalidate a GPU scheduler context. Must be called before the relevant structures are freed.
- fn invalidate_context(
&self,
context: &fw::types::GpuObject<fw::workqueue::GpuContextData>,
- ) -> Result;
- /// Flush the entire firmware cache.
- ///
- /// TODO: Does this actually work?
- fn flush_fw_cache(&self) -> Result;
- /// Handle a GPU work timeout event.
- fn handle_timeout(&self, counter: u32, event_slot: u32);
- /// Handle a GPU fault event.
- fn handle_fault(&self);
- /// Wait for the GPU to become idle and power off.
- fn wait_for_poweroff(&self, timeout: usize) -> Result;
- /// Send a firmware control command (secure cache flush).
- fn fwctl(&self, msg: fw::channels::FwCtlMsg) -> Result;
- /// Get the static GPU configuration for this SoC.
- fn get_cfg(&self) -> &'static hw::HwConfig;
- /// Get the dynamic GPU configuration for this SoC.
- fn get_dyncfg(&self) -> &hw::DynConfig;
+}
+/// Private generic trait for functions that don't need to escape this module. +trait GpuManagerPriv {
- /// Decrement the pending submission counter.
- fn end_op(&self);
+}
+#[versions(AGX)] +#[vtable] +impl rtkit::Operations for GpuManager::ver {
- type Data = ArcGpuManager::ver;
- type Buffer = gem::ObjectRef;
- fn recv_message(data: <Self::Data as ForeignOwnable>::Borrowed<'_>, ep: u8, msg: u64) {
let dev = &data.dev;
//dev_info!(dev, "RtKit message: {:#x}:{:#x}\n", ep, msg);
if ep != EP_FIRMWARE || msg != MSG_RX_DOORBELL {
dev_err!(dev, "Unknown message: {:#x}:{:#x}\n", ep, msg);
return;
}
let mut ch = data.rx_channels.lock();
ch.fw_log.poll();
ch.ktrace.poll();
ch.stats.poll();
ch.event.poll();
- }
- fn crashed(data: <Self::Data as ForeignOwnable>::Borrowed<'_>) {
let dev = &data.dev;
dev_err!(dev, "GPU firmware crashed, failing all jobs\n");
data.crashed.store(true, Ordering::Relaxed);
data.event_manager.fail_all(workqueue::WorkError::NoDevice);
- }
- fn shmem_alloc(
data: <Self::Data as ForeignOwnable>::Borrowed<'_>,
size: usize,
- ) -> ResultSelf::Buffer {
let dev = &data.dev;
mod_dev_dbg!(dev, "shmem_alloc() {:#x} bytes\n", size);
let mut obj = gem::new_kernel_object(dev, size)?;
obj.vmap()?;
let iova = obj.map_into(data.uat.kernel_vm())?;
mod_dev_dbg!(dev, "shmem_alloc() -> VA {:#x}\n", iova);
Ok(obj)
- }
+}
+#[versions(AGX)] +impl GpuManager::ver {
- /// Create a new GpuManager of this version/GPU combination.
- #[inline(never)]
- pub(crate) fn new(
dev: &AsahiDevice,
res: ®s::Resources,
cfg: &'static hw::HwConfig,
- ) -> Result<ArcGpuManager::ver> {
let uat = Self::make_uat(dev, cfg)?;
let dyncfg = Self::make_dyncfg(dev, res, cfg, &uat)?;
let mut alloc = KernelAllocators {
private: alloc::DefaultAllocator::new(
dev,
uat.kernel_vm(),
IOVA_KERN_PRIV_BASE,
IOVA_KERN_PRIV_TOP,
0x80,
mmu::PROT_FW_PRIV_RW,
1024 * 1024,
true,
fmt!("Kernel Private"),
true,
)?,
shared: alloc::DefaultAllocator::new(
dev,
uat.kernel_vm(),
IOVA_KERN_SHARED_BASE,
IOVA_KERN_SHARED_TOP,
0x80,
mmu::PROT_FW_SHARED_RW,
1024 * 1024,
true,
fmt!("Kernel Shared"),
false,
)?,
shared_ro: alloc::DefaultAllocator::new(
dev,
uat.kernel_vm(),
IOVA_KERN_SHARED_RO_BASE,
IOVA_KERN_SHARED_RO_TOP,
0x80,
mmu::PROT_FW_SHARED_RO,
64 * 1024,
true,
fmt!("Kernel RO Shared"),
false,
)?,
gpu: alloc::DefaultAllocator::new(
dev,
uat.kernel_vm(),
IOVA_KERN_GPU_BASE,
IOVA_KERN_GPU_TOP,
0x80,
mmu::PROT_GPU_FW_SHARED_RW,
64 * 1024,
true,
fmt!("Kernel GPU Shared"),
false,
)?,
};
let event_manager = Self::make_event_manager(&mut alloc)?;
let initdata = Self::make_initdata(cfg, &dyncfg, &mut alloc)?;
let mut mgr = Self::make_mgr(dev, cfg, dyncfg, uat, alloc, event_manager, initdata)?;
{
let fwctl = mgr.fwctl_channel.lock();
let p_fwctl = fwctl.to_raw();
core::mem::drop(fwctl);
mgr.initdata.fw_status.with_mut(|raw, _inner| {
raw.fwctl_channel = p_fwctl;
});
}
{
let txc = mgr.tx_channels.lock();
let p_device_control = txc.device_control.to_raw();
core::mem::drop(txc);
let rxc = mgr.rx_channels.lock();
let p_event = rxc.event.to_raw();
let p_fw_log = rxc.fw_log.to_raw();
let p_ktrace = rxc.ktrace.to_raw();
let p_stats = rxc.stats.to_raw();
let p_fwlog_buf = rxc.fw_log.get_buf();
core::mem::drop(rxc);
mgr.initdata.runtime_pointers.with_mut(|raw, _inner| {
raw.device_control = p_device_control;
raw.event = p_event;
raw.fw_log = p_fw_log;
raw.ktrace = p_ktrace;
raw.stats = p_stats;
raw.fwlog_buf = Some(p_fwlog_buf);
});
}
let mut p_pipes: Vec<fw::initdata::raw::PipeChannels::ver> = Vec::new();
for ((v, f), c) in mgr
.pipes
.vtx
.iter()
.zip(&mgr.pipes.frag)
.zip(&mgr.pipes.comp)
{
p_pipes.try_push(fw::initdata::raw::PipeChannels::ver {
vtx: v.lock().to_raw(),
frag: f.lock().to_raw(),
comp: c.lock().to_raw(),
})?;
}
mgr.initdata.runtime_pointers.with_mut(|raw, _inner| {
for (i, p) in p_pipes.into_iter().enumerate() {
raw.pipes[i].vtx = p.vtx;
raw.pipes[i].frag = p.frag;
raw.pipes[i].comp = p.comp;
}
});
for (i, map) in cfg.io_mappings.iter().enumerate() {
if let Some(map) = map.as_ref() {
mgr.iomap(i, map)?;
}
}
let mgr = Arc::from(mgr);
let rtkit = Box::try_new(rtkit::RtKit::<GpuManager::ver>::new(
dev,
None,
0,
mgr.clone(),
)?)?;
*mgr.rtkit.lock() = Some(rtkit);
{
let mut rxc = mgr.rx_channels.lock();
rxc.event.set_manager(mgr.clone());
}
Ok(mgr)
- }
- /// Build the entire GPU InitData structure tree and return it as a boxed GpuObject.
- fn make_initdata(
cfg: &'static hw::HwConfig,
dyncfg: &hw::DynConfig,
alloc: &mut KernelAllocators,
- ) -> Result<Box<fw::types::GpuObjectfw::initdata::InitData::ver>> {
let mut builder = initdata::InitDataBuilder::ver::new(alloc, cfg, dyncfg);
builder.build()
- }
- /// Create a fresh boxed Uat instance.
- ///
- /// Force disable inlining to avoid blowing up the stack.
- #[inline(never)]
- fn make_uat(dev: &AsahiDevice, cfg: &'static hw::HwConfig) -> Result<Boxmmu::Uat> {
Ok(Box::try_new(mmu::Uat::new(dev, cfg)?)?)
- }
- /// Actually create the final GpuManager instance, as a UniqueArc.
- ///
- /// Force disable inlining to avoid blowing up the stack.
- #[inline(never)]
- fn make_mgr(
dev: &AsahiDevice,
cfg: &'static hw::HwConfig,
dyncfg: Box<hw::DynConfig>,
uat: Box<mmu::Uat>,
mut alloc: KernelAllocators,
event_manager: Arc<event::EventManager>,
initdata: Box<fw::types::GpuObject<fw::initdata::InitData::ver>>,
- ) -> Result<UniqueArcGpuManager::ver> {
let mut pipes = PipeChannels::ver {
vtx: Vec::new(),
frag: Vec::new(),
comp: Vec::new(),
};
for _i in 0..=NUM_PIPES - 1 {
pipes
.vtx
.try_push(Mutex::new(channel::PipeChannel::ver::new(dev, &mut alloc)?))?;
pipes
.frag
.try_push(Mutex::new(channel::PipeChannel::ver::new(dev, &mut alloc)?))?;
pipes
.comp
.try_push(Mutex::new(channel::PipeChannel::ver::new(dev, &mut alloc)?))?;
}
UniqueArc::try_new(GpuManager::ver {
dev: dev.clone(),
cfg,
dyncfg,
initdata,
uat,
io_mappings: Vec::new(),
rtkit: Mutex::new(None),
crashed: AtomicBool::new(false),
rx_channels: Mutex::new(box_in_place!(RxChannels::ver {
event: channel::EventChannel::new(dev, &mut alloc, event_manager.clone())?,
fw_log: channel::FwLogChannel::new(dev, &mut alloc)?,
ktrace: channel::KTraceChannel::new(dev, &mut alloc)?,
stats: channel::StatsChannel::ver::new(dev, &mut alloc)?,
})?),
tx_channels: Mutex::new(Box::try_new(TxChannels::ver {
device_control: channel::DeviceControlChannel::ver::new(dev, &mut alloc)?,
})?),
fwctl_channel: Mutex::new(Box::try_new(channel::FwCtlChannel::new(dev, &mut alloc)?)?),
pipes,
event_manager,
buffer_mgr: buffer::BufferManager::new()?,
alloc: Mutex::new(alloc),
ids: Default::default(),
})
- }
- /// Fetch and validate the GPU dynamic configuration from the device tree and hardware.
- ///
- /// Force disable inlining to avoid blowing up the stack.
- #[inline(never)]
- fn make_dyncfg(
dev: &AsahiDevice,
res: ®s::Resources,
cfg: &'static hw::HwConfig,
uat: &mmu::Uat,
- ) -> Result<Boxhw::DynConfig> {
let gpu_id = res.get_gpu_id()?;
dev_info!(dev, "GPU Information:\n");
dev_info!(
dev,
" Type: {:?}{:?}\n",
gpu_id.gpu_gen,
gpu_id.gpu_variant
);
dev_info!(dev, " Max dies: {}\n", gpu_id.max_dies);
dev_info!(dev, " Clusters: {}\n", gpu_id.num_clusters);
dev_info!(
dev,
" Cores: {} ({})\n",
gpu_id.num_cores,
gpu_id.num_cores * gpu_id.num_clusters
);
dev_info!(
dev,
" Frags: {} ({})\n",
gpu_id.num_frags,
gpu_id.num_frags * gpu_id.num_clusters
);
dev_info!(
dev,
" GPs: {} ({})\n",
gpu_id.num_gps,
gpu_id.num_gps * gpu_id.num_clusters
);
dev_info!(dev, " Core masks: {:#x?}\n", gpu_id.core_masks);
dev_info!(dev, " Active cores: {}\n", gpu_id.total_active_cores);
dev_info!(dev, "Getting configuration from device tree...\n");
let pwr_cfg = hw::PwrConfig::load(dev, cfg)?;
dev_info!(dev, "Dynamic configuration fetched\n");
if gpu_id.gpu_gen != cfg.gpu_gen || gpu_id.gpu_variant != cfg.gpu_variant {
dev_err!(
dev,
"GPU type mismatch (expected {:?}{:?}, found {:?}{:?})\n",
cfg.gpu_gen,
cfg.gpu_variant,
gpu_id.gpu_gen,
gpu_id.gpu_variant
);
return Err(EIO);
}
if gpu_id.num_clusters > cfg.max_num_clusters {
dev_err!(
dev,
"Too many clusters ({} > {})\n",
gpu_id.num_clusters,
cfg.max_num_clusters
);
return Err(EIO);
}
if gpu_id.num_cores > cfg.max_num_cores {
dev_err!(
dev,
"Too many cores ({} > {})\n",
gpu_id.num_cores,
cfg.max_num_cores
);
return Err(EIO);
}
if gpu_id.num_frags > cfg.max_num_frags {
dev_err!(
dev,
"Too many frags ({} > {})\n",
gpu_id.num_frags,
cfg.max_num_frags
);
return Err(EIO);
}
if gpu_id.num_gps > cfg.max_num_gps {
dev_err!(
dev,
"Too many GPs ({} > {})\n",
gpu_id.num_gps,
cfg.max_num_gps
);
return Err(EIO);
}
Ok(Box::try_new(hw::DynConfig {
pwr: pwr_cfg,
uat_ttb_base: uat.ttb_base(),
id: gpu_id,
})?)
- }
- /// Create the global GPU event manager, and return an `Arc<>` to it.
- fn make_event_manager(alloc: &mut KernelAllocators) -> Result<Arcevent::EventManager> {
Arc::try_new(event::EventManager::new(alloc)?)
- }
- /// Create a new MMIO mapping and add it to the mappings list in initdata at the specified
- /// index.
- fn iomap(&mut self, index: usize, map: &hw::IOMapping) -> Result {
let off = map.base & mmu::UAT_PGMSK;
let base = map.base - off;
let end = (map.base + map.size + mmu::UAT_PGMSK) & !mmu::UAT_PGMSK;
let mapping = self
.uat
.kernel_vm()
.map_io(base, end - base, map.writable)?;
self.initdata.runtime_pointers.hwdata_b.with_mut(|raw, _| {
raw.io_mappings[index] = fw::initdata::raw::IOMapping {
phys_addr: U64(map.base as u64),
virt_addr: U64((mapping.iova() + off) as u64),
size: map.size as u32,
range_size: map.range_size as u32,
readwrite: U64(map.writable as u64),
};
});
self.io_mappings.try_push(mapping)?;
Ok(())
- }
- /// Mark work associated with currently in-progress event slots as failed, after a fault or
- /// timeout.
- fn mark_pending_events(&self, culprit_slot: Option<u32>, error: workqueue::WorkError) {
dev_err!(self.dev, " Pending events:\n");
self.initdata.globals.with(|raw, _inner| {
for i in raw.pending_stamps.iter() {
let info = i.info.load(Ordering::Relaxed);
let wait_value = i.wait_value.load(Ordering::Relaxed);
if info & 1 != 0 {
let slot = info >> 3;
let flags = info & 0x7;
dev_err!(
self.dev,
" [{}] flags={} value={:#x}\n",
slot,
flags,
wait_value
);
let error = if culprit_slot.is_some() && culprit_slot != Some(slot) {
workqueue::WorkError::Killed
} else {
error
};
self.event_manager.mark_error(slot, wait_value, error);
i.info.store(0, Ordering::Relaxed);
i.wait_value.store(0, Ordering::Relaxed);
}
}
});
- }
- /// Fetch the GPU MMU fault information from the hardware registers.
- fn get_fault_info(&self) -> Optionregs::FaultInfo {
let data = self.dev.data();
let res = match data.resources() {
Some(res) => res,
None => {
dev_err!(self.dev, " Failed to acquire resources\n");
return None;
}
};
let info = res.get_fault_info();
if info.is_some() {
dev_err!(self.dev, " Fault info: {:#x?}\n", info.as_ref().unwrap());
}
info
- }
- /// Resume the GPU firmware after it halts (due to a timeout, fault, or request).
- fn recover(&self) {
self.initdata.fw_status.with(|raw, _inner| {
let halt_count = raw.flags.halt_count.load(Ordering::Relaxed);
let mut halted = raw.flags.halted.load(Ordering::Relaxed);
dev_err!(self.dev, " Halt count: {}\n", halt_count);
dev_err!(self.dev, " Halted: {}\n", halted);
if halted == 0 {
let timeout = time::ktime_get() + Duration::from_millis(HALT_ENTER_TIMEOUT_MS);
while time::ktime_get() < timeout {
halted = raw.flags.halted.load(Ordering::Relaxed);
if halted != 0 {
break;
}
mem::sync();
}
halted = raw.flags.halted.load(Ordering::Relaxed);
}
if debug_enabled(DebugFlags::NoGpuRecovery) {
dev_crit!(self.dev, " GPU recovery is disabled, wedging forever!\n");
} else if halted != 0 {
dev_err!(self.dev, " Attempting recovery...\n");
raw.flags.halted.store(0, Ordering::SeqCst);
raw.flags.resume.store(1, Ordering::SeqCst);
} else {
dev_err!(self.dev, " Cannot recover.\n");
}
});
- }
- /// Return the packed GPU enabled core masks.
- // Only used for some versions
- #[allow(dead_code)]
- pub(crate) fn core_masks_packed(&self) -> &[u32] {
self.dyncfg.id.core_masks_packed.as_slice()
- }
- /// Kick a submission pipe for a submitted job to tell the firmware to start processing it.
- pub(crate) fn run_job(&self, job: workqueue::JobSubmission::ver<'_>) -> Result {
mod_dev_dbg!(self.dev, "GPU: run_job\n");
let pipe_type = job.pipe_type();
mod_dev_dbg!(self.dev, "GPU: run_job: pipe_type={:?}\n", pipe_type);
let pipes = match pipe_type {
PipeType::Vertex => &self.pipes.vtx,
PipeType::Fragment => &self.pipes.frag,
PipeType::Compute => &self.pipes.comp,
};
let index: usize = job.priority() as usize;
let mut pipe = pipes.get(index).ok_or(EIO)?.lock();
mod_dev_dbg!(self.dev, "GPU: run_job: run()\n");
job.run(&mut pipe);
mod_dev_dbg!(self.dev, "GPU: run_job: ring doorbell\n");
let mut guard = self.rtkit.lock();
let rtk = guard.as_mut().unwrap();
rtk.send_message(
EP_DOORBELL,
MSG_TX_DOORBELL | pipe_type as u64 | ((index as u64) << 2),
)?;
mod_dev_dbg!(self.dev, "GPU: run_job: done\n");
Ok(())
- }
- pub(crate) fn is_crashed(&self) -> bool {
self.crashed.load(Ordering::Relaxed)
- }
- pub(crate) fn start_op(self: &ArcGpuManager::ver) -> Result<OpGuard> {
if self.is_crashed() {
return Err(ENODEV);
}
let val = self
.initdata
.globals
.with(|raw, _inner| raw.pending_submissions.fetch_add(1, Ordering::Acquire));
mod_dev_dbg!(self.dev, "OP start (pending: {})\n", val + 1);
self.kick_firmware()?;
Ok(OpGuard(self.clone()))
- }
+}
+#[versions(AGX)] +impl GpuManager for GpuManager::ver {
- fn as_any(&self) -> &dyn Any {
self
- }
- fn arc_as_any(self: Arc<Self>) -> Arc<dyn Any + Sync + Send> {
self as Arc<dyn Any + Sync + Send>
- }
- fn init(&self) -> Result {
self.tx_channels.lock().device_control.send(
&fw::channels::DeviceControlMsg::ver::Initialize(Default::default()),
);
let initdata = self.initdata.gpu_va().get();
let mut guard = self.rtkit.lock();
let rtk = guard.as_mut().unwrap();
rtk.boot()?;
rtk.start_endpoint(EP_FIRMWARE)?;
rtk.start_endpoint(EP_DOORBELL)?;
rtk.send_message(EP_FIRMWARE, MSG_INIT | (initdata & INIT_DATA_MASK))?;
rtk.send_message(EP_DOORBELL, MSG_TX_DOORBELL | DOORBELL_DEVCTRL)?;
core::mem::drop(guard);
self.kick_firmware()?;
Ok(())
- }
- fn update_globals(&self) {
let mut timeout: u32 = 2;
if debug_enabled(DebugFlags::WaitForPowerOff) {
timeout = 0;
} else if debug_enabled(DebugFlags::KeepGpuPowered) {
timeout = 5000;
}
self.initdata.globals.with(|raw, _inner| {
raw.idle_off_delay_ms.store(timeout, Ordering::Relaxed);
});
- }
- fn alloc(&self) -> Guard<'_, Mutex<KernelAllocators>> {
let mut guard = self.alloc.lock();
let (garbage_count, garbage_bytes) = guard.private.garbage();
if garbage_bytes > 1024 * 1024 {
mod_dev_dbg!(
self.dev,
"Collecting kalloc garbage ({} objects, {} bytes)\n",
garbage_count,
garbage_bytes
);
if self.flush_fw_cache().is_err() {
dev_err!(self.dev, "Failed to flush FW cache\n");
} else {
guard.private.collect_garbage(garbage_count);
}
}
guard
- }
- fn new_vm(&self, file_id: u64) -> Resultmmu::Vm {
self.uat.new_vm(self.ids.vm.next(), file_id)
- }
- fn bind_vm(&self, vm: &mmu::Vm) -> Resultmmu::VmBind {
self.uat.bind(vm)
- }
- fn new_queue(
&self,
vm: mmu::Vm,
ualloc: Arc<Mutex<alloc::DefaultAllocator>>,
ualloc_priv: Arc<Mutex<alloc::DefaultAllocator>>,
priority: u32,
caps: u32,
- ) -> Result<Box<dyn queue::Queue>> {
let mut kalloc = self.alloc();
let id = self.ids.queue.next();
Ok(Box::try_new(queue::Queue::ver::new(
&self.dev,
vm,
&mut kalloc,
ualloc,
ualloc_priv,
self.event_manager.clone(),
&self.buffer_mgr,
id,
priority,
caps,
)?)?)
- }
- fn kick_firmware(&self) -> Result {
if self.is_crashed() {
return Err(ENODEV);
}
let mut guard = self.rtkit.lock();
let rtk = guard.as_mut().unwrap();
rtk.send_message(EP_DOORBELL, MSG_TX_DOORBELL | DOORBELL_KICKFW)?;
Ok(())
- }
- fn invalidate_context(
&self,
context: &fw::types::GpuObject<fw::workqueue::GpuContextData>,
- ) -> Result {
mod_dev_dbg!(
self.dev,
"Invalidating GPU context @ {:?}\n",
context.weak_pointer()
);
if self.is_crashed() {
return Err(ENODEV);
}
let mut guard = self.alloc.lock();
let (garbage_count, _) = guard.private.garbage();
let dc = context.with(
|raw, _inner| fw::channels::DeviceControlMsg::ver::DestroyContext {
unk_4: 0,
ctx_23: raw.unk_23,
__pad0: Default::default(),
unk_c: 0,
unk_10: 0,
ctx_0: raw.unk_0,
ctx_1: raw.unk_1,
ctx_4: raw.unk_4,
__pad1: Default::default(),
unk_18: 0,
gpu_context: Some(context.weak_pointer()),
__pad2: Default::default(),
},
);
mod_dev_dbg!(self.dev, "Context invalidation command: {:?}\n", &dc);
let mut txch = self.tx_channels.lock();
let token = txch.device_control.send(&dc);
{
let mut guard = self.rtkit.lock();
let rtk = guard.as_mut().unwrap();
rtk.send_message(EP_DOORBELL, MSG_TX_DOORBELL | DOORBELL_DEVCTRL)?;
}
txch.device_control.wait_for(token)?;
mod_dev_dbg!(
self.dev,
"GPU context invalidated: {:?}\n",
context.weak_pointer()
);
// The invalidation does a cache flush, so it is okay to collect garbage
guard.private.collect_garbage(garbage_count);
Ok(())
- }
- fn flush_fw_cache(&self) -> Result {
mod_dev_dbg!(self.dev, "Flushing coprocessor data cache\n");
if self.is_crashed() {
return Err(ENODEV);
}
// ctx_0 == 0xff or ctx_1 == 0xff cause no effect on context,
// but this command does a full cache flush too, so abuse it
// for that.
let dc = fw::channels::DeviceControlMsg::ver::DestroyContext {
unk_4: 0,
ctx_23: 0,
__pad0: Default::default(),
unk_c: 0,
unk_10: 0,
ctx_0: 0xff,
ctx_1: 0xff,
ctx_4: 0,
__pad1: Default::default(),
unk_18: 0,
gpu_context: None,
__pad2: Default::default(),
};
let mut txch = self.tx_channels.lock();
let token = txch.device_control.send(&dc);
{
let mut guard = self.rtkit.lock();
let rtk = guard.as_mut().unwrap();
rtk.send_message(EP_DOORBELL, MSG_TX_DOORBELL | DOORBELL_DEVCTRL)?;
}
txch.device_control.wait_for(token)?;
Ok(())
- }
- fn ids(&self) -> &SequenceIDs {
&self.ids
- }
- fn handle_timeout(&self, counter: u32, event_slot: u32) {
dev_err!(self.dev, " (\\________/) \n");
dev_err!(self.dev, " | | \n");
dev_err!(self.dev, "'.| \\ , / |.'\n");
dev_err!(self.dev, "--| / (( \\ |--\n");
dev_err!(self.dev, ".'| _-_- |'.\n");
dev_err!(self.dev, " |________| \n");
dev_err!(self.dev, "** GPU timeout nya~!!!!! **\n");
dev_err!(self.dev, " Event slot: {}\n", event_slot);
dev_err!(self.dev, " Timeout count: {}\n", counter);
// If we have fault info, consider it a fault.
let error = match self.get_fault_info() {
Some(info) => workqueue::WorkError::Fault(info),
None => workqueue::WorkError::Timeout,
};
self.mark_pending_events(Some(event_slot), error);
self.recover();
- }
- fn handle_fault(&self) {
dev_err!(self.dev, " (\\________/) \n");
dev_err!(self.dev, " | | \n");
dev_err!(self.dev, "'.| \\ , / |.'\n");
dev_err!(self.dev, "--| / (( \\ |--\n");
dev_err!(self.dev, ".'| _-_- |'.\n");
dev_err!(self.dev, " |________| \n");
dev_err!(self.dev, "GPU fault nya~!!!!!\n");
let error = match self.get_fault_info() {
Some(info) => workqueue::WorkError::Fault(info),
None => workqueue::WorkError::Unknown,
};
self.mark_pending_events(None, error);
self.recover();
- }
- fn wait_for_poweroff(&self, timeout: usize) -> Result {
self.initdata.runtime_pointers.hwdata_a.with(|raw, _inner| {
for _i in 0..timeout {
if raw.pwr_status.load(Ordering::Relaxed) == 4 {
return Ok(());
}
coarse_sleep(Duration::from_millis(1));
}
Err(ETIMEDOUT)
})
- }
- fn fwctl(&self, msg: fw::channels::FwCtlMsg) -> Result {
if self.is_crashed() {
return Err(ENODEV);
}
let mut fwctl = self.fwctl_channel.lock();
let token = fwctl.send(&msg);
{
let mut guard = self.rtkit.lock();
let rtk = guard.as_mut().unwrap();
rtk.send_message(EP_DOORBELL, MSG_FWCTL)?;
}
fwctl.wait_for(token)?;
Ok(())
- }
- fn get_cfg(&self) -> &'static hw::HwConfig {
self.cfg
- }
- fn get_dyncfg(&self) -> &hw::DynConfig {
&self.dyncfg
- }
+}
+#[versions(AGX)] +impl GpuManagerPriv for GpuManager::ver {
- fn end_op(&self) {
let val = self
.initdata
.globals
.with(|raw, _inner| raw.pending_submissions.fetch_sub(1, Ordering::Release));
mod_dev_dbg!(self.dev, "OP end (pending: {})\n", val - 1);
- }
+} diff --git a/drivers/gpu/drm/asahi/hw/mod.rs b/drivers/gpu/drm/asahi/hw/mod.rs new file mode 100644 index 000000000000..a92bb70aeae8 --- /dev/null +++ b/drivers/gpu/drm/asahi/hw/mod.rs @@ -0,0 +1,522 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! Per-SoC hardware configuration structures +//! +//! This module contains the definitions used to store per-GPU and per-SoC configuration data.
+use crate::driver::AsahiDevice; +use crate::fw::types::*; +use alloc::vec::Vec; +use kernel::c_str; +use kernel::device::RawDevice; +use kernel::prelude::*;
+const MAX_POWERZONES: usize = 5;
+pub(crate) mod t600x; +pub(crate) mod t8103; +pub(crate) mod t8112;
+/// GPU generation enumeration. Note: Part of the UABI. +#[derive(Debug, PartialEq, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuGen {
- G13 = 13,
- G14 = 14,
+}
+/// GPU variant enumeration. Note: Part of the UABI. +#[derive(Debug, PartialEq, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuVariant {
- P = 'P' as u32,
- G = 'G' as u32,
- S = 'S' as u32,
- C = 'C' as u32,
- D = 'D' as u32,
+}
+/// GPU revision enumeration. Note: Part of the UABI. +#[derive(Debug, PartialEq, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuRevision {
- A0 = 0x00,
- A1 = 0x01,
- B0 = 0x10,
- B1 = 0x11,
- C0 = 0x20,
- C1 = 0x21,
+}
+/// GPU core type enumeration. Note: Part of the firmware ABI. +#[derive(Debug, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuCore {
- // Unknown = 0,
- // G5P = 1,
- // G5G = 2,
- // G9P = 3,
- // G9G = 4,
- // G10P = 5,
- // G11P = 6,
- // G11M = 7,
- // G11G = 8,
- // G12P = 9,
- // G13P = 10,
- G13G = 11,
- G13S = 12,
- G13C = 13,
- // G14P = 14,
- G14G = 15,
+}
+/// GPU revision ID. Note: Part of the firmware ABI. +#[derive(Debug, PartialEq, Copy, Clone)] +#[repr(u32)] +pub(crate) enum GpuRevisionID {
- // Unknown = 0,
- A0 = 1,
- A1 = 2,
- B0 = 3,
- B1 = 4,
- C0 = 5,
- C1 = 6,
+}
+/// GPU driver/hardware features, from the UABI. +pub(crate) mod feat {
- /// Backwards-compatible features.
- pub(crate) mod compat {}
- /// Backwards-incompatible features.
- pub(crate) mod incompat {
use kernel::bindings;
/// Hardware requires Z/S compression to be mandatorily enabled.
pub(crate) const MANDATORY_ZS_COMPRESSION: u64 =
bindings::drm_asahi_feat_incompat_DRM_ASAHI_FEAT_MANDATORY_ZS_COMPRESSION as u64;
- }
+}
+/// A single performance state of the GPU. +#[derive(Debug)] +pub(crate) struct PState {
- /// Voltage in millivolts, per GPU cluster.
- pub(crate) volt_mv: Vec<u32>,
- /// Frequency in hertz.
- pub(crate) freq_hz: u32,
- /// Maximum power consumption of the GPU at this pstate, in milliwatts.
- pub(crate) pwr_mw: u32,
+}
+/// A power zone definition (we have no idea what this is but Apple puts them in the DT). +#[allow(missing_docs)] +#[derive(Debug, Copy, Clone)] +pub(crate) struct PowerZone {
- pub(crate) target: u32,
- pub(crate) target_offset: u32,
- pub(crate) filter_tc: u32,
+}
+/// An MMIO mapping used by the firmware. +#[derive(Debug, Copy, Clone)] +pub(crate) struct IOMapping {
- /// Base physical address of the mapping.
- pub(crate) base: usize,
- /// Size of the mapping.
- pub(crate) size: usize,
- /// Range size of the mapping (for arrays?)
- pub(crate) range_size: usize,
- /// Whether the mapping should be writable.
- pub(crate) writable: bool,
+}
+impl IOMapping {
- /// Convenience constructor for a new IOMapping.
- pub(crate) const fn new(
base: usize,
size: usize,
range_size: usize,
writable: bool,
- ) -> IOMapping {
IOMapping {
base,
size,
range_size,
writable,
}
- }
+}
+/// Unknown HwConfigA fields that vary from SoC to SoC. +#[allow(missing_docs)] +#[derive(Debug, Copy, Clone)] +pub(crate) struct HwConfigA {
- pub(crate) unk_87c: i32,
- pub(crate) unk_8cc: u32,
- pub(crate) unk_e24: u32,
+}
+/// Unknown HwConfigB fields that vary from SoC to SoC. +#[allow(missing_docs)] +#[derive(Debug, Copy, Clone)] +pub(crate) struct HwConfigB {
- pub(crate) unk_4e0: u64,
- pub(crate) unk_534: u32,
- pub(crate) unk_ab8: u32,
- pub(crate) unk_abc: u32,
- pub(crate) unk_b30: u32,
+}
+/// Render command configs that vary from SoC to SoC. +#[derive(Debug, Copy, Clone)] +pub(crate) struct HwRenderConfig {
- /// Vertex/tiling-related configuration register (lsb: disable clustering)
- pub(crate) tiling_control: u32,
+}
+/// Static hardware configuration for a given SoC model. +#[derive(Debug)] +pub(crate) struct HwConfig {
- /// Chip ID in hex format (e.g. 0x8103 for t8103).
- pub(crate) chip_id: u32,
- /// GPU generation.
- pub(crate) gpu_gen: GpuGen,
- /// GPU variant type.
- pub(crate) gpu_variant: GpuVariant,
- /// GPU core type ID (as known by the firmware).
- pub(crate) gpu_core: GpuCore,
- /// Compatible feature bitmask for this GPU.
- pub(crate) gpu_feat_compat: u64,
- /// Incompatible feature bitmask for this GPU.
- pub(crate) gpu_feat_incompat: u64,
- /// Base clock used used for timekeeping.
- pub(crate) base_clock_hz: u32,
- /// Output address space for the UAT on this SoC.
- pub(crate) uat_oas: usize,
- /// Maximum number of clusters on this SoC.
- pub(crate) max_num_clusters: u32,
- /// Maximum number of cores per cluster for this GPU.
- pub(crate) max_num_cores: u32,
- /// Maximum number of frags per cluster for this GPU.
- pub(crate) max_num_frags: u32,
- /// Maximum number of GPs per cluster for this GPU.
- pub(crate) max_num_gps: u32,
- /// Required size of the first preemption buffer.
- pub(crate) preempt1_size: usize,
- /// Required size of the second preemption buffer.
- pub(crate) preempt2_size: usize,
- /// Required size of the third preemption buffer.
- pub(crate) preempt3_size: usize,
- /// Rendering-relevant configuration.
- pub(crate) render: HwRenderConfig,
- /// Misc HWDataA field values.
- pub(crate) da: HwConfigA,
- /// Misc HWDataB field values.
- pub(crate) db: HwConfigB,
- /// HwDataShared1.table.
- pub(crate) shared1_tab: &'static [i32],
- /// HwDataShared1.unk_a4.
- pub(crate) shared1_a4: u32,
- /// HwDataShared2.table.
- pub(crate) shared2_tab: &'static [i32],
- /// HwDataShared2.unk_508.
- pub(crate) shared2_unk_508: u32,
- /// Constant related to SRAM voltages.
- pub(crate) sram_k: F32,
- /// Unknown per-cluster coefficients 1.
- pub(crate) unk_coef_a: &'static [&'static [F32]],
- /// Unknown per-cluster coefficients 2.
- pub(crate) unk_coef_b: &'static [&'static [F32]],
- /// Unknown table in Global struct.
- pub(crate) global_tab: Option<&'static [u8]>,
- /// Temperature sensor list (8 bits per sensor).
- pub(crate) fast_die0_sensor_mask: u64,
- /// Temperature sensor list (alternate).
- pub(crate) fast_die0_sensor_mask_alt: u64,
- /// Temperature sensor present bitmask.
- pub(crate) fast_die0_sensor_present: u32,
- /// Required MMIO mappings for this GPU/firmware.
- pub(crate) io_mappings: &'static [Option<IOMapping>],
+}
+/// Dynamic (fetched from hardware/DT) configuration. +#[derive(Debug)] +pub(crate) struct DynConfig {
- /// Base physical address of the UAT TTB (from DT reserved memory region).
- pub(crate) uat_ttb_base: u64,
- /// GPU ID configuration read from hardware.
- pub(crate) id: GpuIdConfig,
- /// Power calibration configuration for this specific chip/device.
- pub(crate) pwr: PwrConfig,
+}
+/// Specific GPU ID configuration fetched from SGX MMIO registers. +#[derive(Debug)] +pub(crate) struct GpuIdConfig {
- /// GPU generation (should match static config).
- pub(crate) gpu_gen: GpuGen,
- /// GPU variant type (should match static config).
- pub(crate) gpu_variant: GpuVariant,
- /// GPU silicon revision.
- pub(crate) gpu_rev: GpuRevision,
- /// GPU silicon revision ID (firmware enum).
- pub(crate) gpu_rev_id: GpuRevisionID,
- /// Maximum number of dies supported.
- pub(crate) max_dies: u32,
- /// Total number of GPU clusters.
- pub(crate) num_clusters: u32,
- /// Maximum number of GPU cores per cluster.
- pub(crate) num_cores: u32,
- /// Number of frags per cluster.
- pub(crate) num_frags: u32,
- /// Number of GPs per cluster.
- pub(crate) num_gps: u32,
- /// Total number of active cores for the whole GPU.
- pub(crate) total_active_cores: u32,
- /// Mask of active cores per cluster.
- pub(crate) core_masks: Vec<u32>,
- /// Packed mask of all active cores.
- pub(crate) core_masks_packed: Vec<u32>,
+}
+/// Configurable GPU power settings from the device tree. +#[derive(Debug)] +pub(crate) struct PwrConfig {
- /// GPU performance state list.
- pub(crate) perf_states: Vec<PState>,
- /// GPU power zone list.
- pub(crate) power_zones: Vec<PowerZone>,
- /// Core leakage coefficient per cluster.
- pub(crate) core_leak_coef: Vec<F32>,
- /// SRAM leakage coefficient per cluster.
- pub(crate) sram_leak_coef: Vec<F32>,
- /// Maximum total power of the GPU in milliwatts.
- pub(crate) max_power_mw: u32,
- /// Maximum frequency of the GPU in megahertz.
- pub(crate) max_freq_mhz: u32,
- /// Minimum performance state to start at.
- pub(crate) perf_base_pstate: u32,
- /// Maximum enabled performance state.
- pub(crate) perf_max_pstate: u32,
- /// Minimum voltage for the SRAM power domain in microvolts.
- pub(crate) min_sram_microvolt: u32,
- // Most of these fields are just named after Apple ADT property names and we don't fully
- // understand them. They configure various power-related PID loops and filters.
- /// Average power filter time constant in milliseconds.
- pub(crate) avg_power_filter_tc_ms: u32,
- /// Average power filter PID integral gain?
- pub(crate) avg_power_ki_only: F32,
- /// Average power filter PID proportional gain?
- pub(crate) avg_power_kp: F32,
- pub(crate) avg_power_min_duty_cycle: u32,
- /// Average power target filter time constant in periods.
- pub(crate) avg_power_target_filter_tc: u32,
- /// "Fast die0" (temperature?) PID integral gain.
- pub(crate) fast_die0_integral_gain: F32,
- /// "Fast die0" (temperature?) PID proportional gain.
- pub(crate) fast_die0_proportional_gain: F32,
- pub(crate) fast_die0_prop_tgt_delta: u32,
- pub(crate) fast_die0_release_temp: u32,
- /// Delay from the fender (?) becoming idle to powerdown
- pub(crate) fender_idle_off_delay_ms: u32,
- /// Timeout from firmware early wake to sleep if no work was submitted (?)
- pub(crate) fw_early_wake_timeout_ms: u32,
- /// Delay from the GPU becoming idle to powerdown
- pub(crate) idle_off_delay_ms: u32,
- /// Percent?
- pub(crate) perf_boost_ce_step: u32,
- /// Minimum utilization before performance state is increased in %.
- pub(crate) perf_boost_min_util: u32,
- pub(crate) perf_filter_drop_threshold: u32,
- /// Performance PID filter time constant? (periods?)
- pub(crate) perf_filter_time_constant: u32,
- /// Performance PID filter time constant 2? (periods?)
- pub(crate) perf_filter_time_constant2: u32,
- /// Performance PID integral gain.
- pub(crate) perf_integral_gain: F32,
- /// Performance PID integral gain 2 (?).
- pub(crate) perf_integral_gain2: F32,
- pub(crate) perf_integral_min_clamp: u32,
- /// Performance PID proportional gain.
- pub(crate) perf_proportional_gain: F32,
- /// Performance PID proportional gain 2 (?).
- pub(crate) perf_proportional_gain2: F32,
- pub(crate) perf_reset_iters: u32,
- /// Target GPU utilization for the performance controller in %.
- pub(crate) perf_tgt_utilization: u32,
- /// Power sampling period in milliseconds.
- pub(crate) power_sample_period: u32,
- /// PPM (?) filter time constant in milliseconds.
- pub(crate) ppm_filter_time_constant_ms: u32,
- /// PPM (?) filter PID integral gain.
- pub(crate) ppm_ki: F32,
- /// PPM (?) filter PID proportional gain.
- pub(crate) ppm_kp: F32,
- /// Power consumption filter time constant (periods?)
- pub(crate) pwr_filter_time_constant: u32,
- /// Power consumption filter PID integral gain.
- pub(crate) pwr_integral_gain: F32,
- pub(crate) pwr_integral_min_clamp: u32,
- pub(crate) pwr_min_duty_cycle: u32,
- pub(crate) pwr_proportional_gain: F32,
+}
+impl PwrConfig {
- /// Load the GPU power configuration from the device tree.
- pub(crate) fn load(dev: &AsahiDevice, cfg: &HwConfig) -> Result<PwrConfig> {
let mut perf_states = Vec::new();
let node = dev.of_node().ok_or(EIO)?;
let opps = node
.parse_phandle(c_str!("operating-points-v2"), 0)
.ok_or(EIO)?;
let mut max_power_mw: u32 = 0;
let mut max_freq_mhz: u32 = 0;
macro_rules! prop {
($prop:expr, $default:expr) => {{
node.get_opt_property(c_str!($prop))
.map_err(|e| {
dev_err!(dev, "Error reading property {}: {:?}\n", $prop, e);
e
})?
.unwrap_or($default)
}};
($prop:expr) => {{
node.get_property(c_str!($prop)).map_err(|e| {
dev_err!(dev, "Error reading property {}: {:?}\n", $prop, e);
e
})?
}};
}
for opp in opps.children() {
let freq_hz: u64 = opp.get_property(c_str!("opp-hz"))?;
let mut volt_uv: Vec<u32> = opp.get_property(c_str!("opp-microvolt"))?;
let pwr_uw: u32 = opp.get_property(c_str!("opp-microwatt"))?;
if volt_uv.len() != cfg.max_num_clusters as usize {
dev_err!(
dev,
"Invalid opp-microvolt length (expected {}, got {})\n",
cfg.max_num_clusters,
volt_uv.len()
);
return Err(EINVAL);
}
volt_uv.iter_mut().for_each(|a| *a /= 1000);
let volt_mv = volt_uv;
let pwr_mw = pwr_uw / 1000;
max_power_mw = max_power_mw.max(pwr_mw);
let freq_mhz: u32 = (freq_hz / 1_000_000).try_into()?;
max_freq_mhz = max_freq_mhz.max(freq_mhz);
perf_states.try_push(PState {
freq_hz: freq_hz.try_into()?,
volt_mv,
pwr_mw,
})?;
}
let pz_data = prop!("apple,power-zones", Vec::new());
if pz_data.len() > 3 * MAX_POWERZONES || pz_data.len() % 3 != 0 {
dev_err!(dev, "Invalid apple,power-zones value\n");
return Err(EINVAL);
}
let pz_count = pz_data.len() / 3;
let mut power_zones = Vec::new();
for i in (0..pz_count).step_by(3) {
power_zones.try_push(PowerZone {
target: pz_data[i],
target_offset: pz_data[i + 1],
filter_tc: pz_data[i + 2],
})?;
}
let core_leak_coef: Vec<F32> = prop!("apple,core-leak-coef");
let sram_leak_coef: Vec<F32> = prop!("apple,sram-leak-coef");
if core_leak_coef.len() != cfg.max_num_clusters as usize {
dev_err!(dev, "Invalid apple,core-leak-coef\n");
return Err(EINVAL);
}
if sram_leak_coef.len() != cfg.max_num_clusters as usize {
dev_err!(dev, "Invalid apple,sram_leak_coef\n");
return Err(EINVAL);
}
Ok(PwrConfig {
core_leak_coef,
sram_leak_coef,
max_power_mw,
max_freq_mhz,
perf_base_pstate: prop!("apple,perf-base-pstate", 1),
perf_max_pstate: perf_states.len() as u32 - 1,
min_sram_microvolt: prop!("apple,min-sram-microvolt"),
avg_power_filter_tc_ms: prop!("apple,avg-power-filter-tc-ms"),
avg_power_ki_only: prop!("apple,avg-power-ki-only"),
avg_power_kp: prop!("apple,avg-power-kp"),
avg_power_min_duty_cycle: prop!("apple,avg-power-min-duty-cycle"),
avg_power_target_filter_tc: prop!("apple,avg-power-target-filter-tc"),
fast_die0_integral_gain: prop!("apple,fast-die0-integral-gain"),
fast_die0_proportional_gain: prop!("apple,fast-die0-proportional-gain"),
fast_die0_prop_tgt_delta: prop!("apple,fast-die0-prop-tgt-delta", 0),
fast_die0_release_temp: prop!("apple,fast-die0-release-temp", 80),
fender_idle_off_delay_ms: prop!("apple,fender-idle-off-delay-ms", 40),
fw_early_wake_timeout_ms: prop!("apple,fw-early-wake-timeout-ms", 5),
idle_off_delay_ms: prop!("apple,idle-off-delay-ms", 2),
perf_boost_ce_step: prop!("apple,perf-boost-ce-step", 25),
perf_boost_min_util: prop!("apple,perf-boost-min-util", 100),
perf_filter_drop_threshold: prop!("apple,perf-filter-drop-threshold"),
perf_filter_time_constant2: prop!("apple,perf-filter-time-constant2"),
perf_filter_time_constant: prop!("apple,perf-filter-time-constant"),
perf_integral_gain2: prop!("apple,perf-integral-gain2"),
perf_integral_gain: prop!("apple,perf-integral-gain", f32!(7.8956833)),
perf_integral_min_clamp: prop!("apple,perf-integral-min-clamp"),
perf_proportional_gain2: prop!("apple,perf-proportional-gain2"),
perf_proportional_gain: prop!("apple,perf-proportional-gain", f32!(14.707963)),
perf_reset_iters: prop!("apple,perf-reset-iters", 6),
perf_tgt_utilization: prop!("apple,perf-tgt-utilization"),
power_sample_period: prop!("apple,power-sample-period"),
ppm_filter_time_constant_ms: prop!("apple,ppm-filter-time-constant-ms"),
ppm_ki: prop!("apple,ppm-ki"),
ppm_kp: prop!("apple,ppm-kp"),
pwr_filter_time_constant: prop!("apple,pwr-filter-time-constant", 313),
pwr_integral_gain: prop!("apple,pwr-integral-gain", f32!(0.0202129)),
pwr_integral_min_clamp: prop!("apple,pwr-integral-min-clamp", 0),
pwr_min_duty_cycle: prop!("apple,pwr-min-duty-cycle"),
pwr_proportional_gain: prop!("apple,pwr-proportional-gain", f32!(5.2831855)),
perf_states,
power_zones,
})
- }
- pub(crate) fn min_frequency_khz(&self) -> u32 {
self.perf_states[self.perf_base_pstate as usize].freq_hz / 1000
- }
- pub(crate) fn max_frequency_khz(&self) -> u32 {
self.perf_states[self.perf_max_pstate as usize].freq_hz / 1000
- }
+} diff --git a/drivers/gpu/drm/asahi/hw/t600x.rs b/drivers/gpu/drm/asahi/hw/t600x.rs new file mode 100644 index 000000000000..8a8267a7e18a --- /dev/null +++ b/drivers/gpu/drm/asahi/hw/t600x.rs @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! Hardware configuration for t600x (M1 Pro/Max/Ultra) platforms.
+use crate::f32;
+use super::*;
+const fn iomaps(mcc_count: usize, has_die1: bool) -> [Option<IOMapping>; 20] {
- [
Some(IOMapping::new(0x404d00000, 0x1c000, 0x1c000, true)), // Fender
Some(IOMapping::new(0x20e100000, 0x4000, 0x4000, false)), // AICTimer
Some(IOMapping::new(0x28e104000, 0x4000, 0x4000, true)), // AICSWInt
Some(IOMapping::new(0x404000000, 0x20000, 0x20000, true)), // RGX
None, // UVD
None, // unused
None, // DisplayUnderrunWA
Some(IOMapping::new(0x28e494000, 0x1000, 0x1000, false)), // AnalogTempSensorControllerRegs
None, // PMPDoorbell
Some(IOMapping::new(0x404d80000, 0x8000, 0x8000, true)), // MetrologySensorRegs
Some(IOMapping::new(0x204d61000, 0x1000, 0x1000, true)), // GMGIFAFRegs
Some(IOMapping::new(
0x200000000,
mcc_count * 0xd8000,
0xd6400,
true,
)), // MCache registers
None, // AICBankedRegisters
None, // PMGRScratch
Some(IOMapping::new(0x2643c4000, 0x1000, 0x1000, true)), // NIA Special agent idle register die 0
if has_die1 {
// NIA Special agent idle register die 1
Some(IOMapping::new(0x22643c4000, 0x1000, 0x1000, true))
} else {
None
},
None, // CRE registers
None, // Streaming codec registers
Some(IOMapping::new(0x28e3d0000, 0x1000, 0x1000, true)), // ?
Some(IOMapping::new(0x28e3c0000, 0x1000, 0x1000, false)), // ?
- ]
+}
+pub(crate) const HWCONFIG_T6002: super::HwConfig = HwConfig {
- chip_id: 0x6002,
- gpu_gen: GpuGen::G13,
- gpu_variant: GpuVariant::D,
- gpu_core: GpuCore::G13C,
- gpu_feat_compat: 0,
- gpu_feat_incompat: feat::incompat::MANDATORY_ZS_COMPRESSION,
- base_clock_hz: 24_000_000,
- uat_oas: 42,
- max_num_clusters: 8,
- max_num_cores: 8,
- max_num_frags: 8,
- max_num_gps: 4,
- preempt1_size: 0x540,
- preempt2_size: 0x280,
- preempt3_size: 0x20,
- render: HwRenderConfig {
tiling_control: 0xa540,
- },
- da: HwConfigA {
unk_87c: 900,
unk_8cc: 11000,
unk_e24: 125,
- },
- db: HwConfigB {
unk_4e0: 4,
unk_534: 1,
unk_ab8: 0x2084,
unk_abc: 0x80,
unk_b30: 0,
- },
- shared1_tab: &[
0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff,
0xffff, 0xffff, 0xffff, 0xffff, 0xffff,
- ],
- shared1_a4: 0xffff,
- shared2_tab: &[-1, -1, -1, -1, 0x2aa, 0xaaa, -1, -1, 0, 0],
- shared2_unk_508: 0xcc00001,
- sram_k: f32!(1.02),
- unk_coef_a: &[
&f32!([9.838]),
&f32!([9.819]),
&f32!([9.826]),
&f32!([9.799]),
&f32!([9.799]),
&f32!([9.826]),
&f32!([9.819]),
&f32!([9.838]),
- ],
- unk_coef_b: &[
&f32!([13.0]),
&f32!([13.0]),
&f32!([13.0]),
&f32!([13.0]),
&f32!([13.0]),
&f32!([13.0]),
&f32!([13.0]),
&f32!([13.0]),
- ],
- global_tab: Some(&[
0, 1, 2, 1, 1, 90, 75, 1, 1, 1, 2, 90, 75, 1, 1, 1, 1, 90, 75, 1, 1,
- ]),
- fast_die0_sensor_mask: 0x8080808080808080,
- fast_die0_sensor_mask_alt: 0x9090909090909090,
- fast_die0_sensor_present: 0xff,
- io_mappings: &iomaps(16, true),
+};
+pub(crate) const HWCONFIG_T6001: super::HwConfig = HwConfig {
- chip_id: 0x6001,
- gpu_variant: GpuVariant::C,
- gpu_core: GpuCore::G13C,
- max_num_clusters: 4,
- fast_die0_sensor_mask: 0x80808080,
- fast_die0_sensor_mask_alt: 0x90909090,
- fast_die0_sensor_present: 0x0f,
- io_mappings: &iomaps(8, false),
- ..HWCONFIG_T6002
+};
+pub(crate) const HWCONFIG_T6000: super::HwConfig = HwConfig {
- chip_id: 0x6000,
- gpu_variant: GpuVariant::S,
- gpu_core: GpuCore::G13S,
- max_num_clusters: 2,
- fast_die0_sensor_mask: 0x8080,
- fast_die0_sensor_mask_alt: 0x9090,
- fast_die0_sensor_present: 0x03,
- io_mappings: &iomaps(4, false),
- ..HWCONFIG_T6001
+}; diff --git a/drivers/gpu/drm/asahi/hw/t8103.rs b/drivers/gpu/drm/asahi/hw/t8103.rs new file mode 100644 index 000000000000..3d38b088a0f5 --- /dev/null +++ b/drivers/gpu/drm/asahi/hw/t8103.rs @@ -0,0 +1,80 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! Hardware configuration for t8103 platforms (M1).
+use crate::f32;
+use super::*;
+pub(crate) const HWCONFIG: super::HwConfig = HwConfig {
- chip_id: 0x8103,
- gpu_gen: GpuGen::G13,
- gpu_variant: GpuVariant::G,
- gpu_core: GpuCore::G13G,
- gpu_feat_compat: 0,
- gpu_feat_incompat: 0,
- base_clock_hz: 24_000_000,
- uat_oas: 40,
- max_num_clusters: 1,
- max_num_cores: 8,
- max_num_frags: 8,
- max_num_gps: 4,
- preempt1_size: 0x540,
- preempt2_size: 0x280,
- preempt3_size: 0x20,
- render: HwRenderConfig {
// bit 0: disable clustering (always)
tiling_control: 0xa041,
- },
- da: HwConfigA {
unk_87c: -220,
unk_8cc: 9880,
unk_e24: 112,
- },
- db: HwConfigB {
unk_4e0: 0,
unk_534: 0,
unk_ab8: 0x48,
unk_abc: 0x8,
unk_b30: 0,
- },
- shared1_tab: &[
-1, 0x7282, 0x50ea, 0x370a, 0x25be, 0x1c1f, 0x16fb, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- ],
- shared1_a4: 0xffff,
- shared2_tab: &[0x800, 0x1555, -1, -1, -1, -1, -1, -1, 0, 0],
- shared2_unk_508: 0xc00007,
- sram_k: f32!(1.02),
- unk_coef_a: &[],
- unk_coef_b: &[],
- global_tab: None,
- fast_die0_sensor_mask: 0x12,
- fast_die0_sensor_mask_alt: 0x12,
- fast_die0_sensor_present: 0x01,
- io_mappings: &[
Some(IOMapping::new(0x204d00000, 0x1c000, 0x1c000, true)), // Fender
Some(IOMapping::new(0x20e100000, 0x4000, 0x4000, false)), // AICTimer
Some(IOMapping::new(0x23b104000, 0x4000, 0x4000, true)), // AICSWInt
Some(IOMapping::new(0x204000000, 0x20000, 0x20000, true)), // RGX
None, // UVD
None, // unused
None, // DisplayUnderrunWA
Some(IOMapping::new(0x23b2e8000, 0x1000, 0x1000, false)), // AnalogTempSensorControllerRegs
Some(IOMapping::new(0x23bc00000, 0x1000, 0x1000, true)), // PMPDoorbell
Some(IOMapping::new(0x204d80000, 0x5000, 0x5000, true)), // MetrologySensorRegs
Some(IOMapping::new(0x204d61000, 0x1000, 0x1000, true)), // GMGIFAFRegs
Some(IOMapping::new(0x200000000, 0xd6400, 0xd6400, true)), // MCache registers
None, // AICBankedRegisters
Some(IOMapping::new(0x23b738000, 0x1000, 0x1000, true)), // PMGRScratch
None, // NIA Special agent idle register die 0
None, // NIA Special agent idle register die 1
None, // CRE registers
None, // Streaming codec registers
None, //
None, //
- ],
+}; diff --git a/drivers/gpu/drm/asahi/hw/t8112.rs b/drivers/gpu/drm/asahi/hw/t8112.rs new file mode 100644 index 000000000000..5624dca130be --- /dev/null +++ b/drivers/gpu/drm/asahi/hw/t8112.rs @@ -0,0 +1,82 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! Hardware configuration for t8112 platforms (M2).
+use crate::f32;
+use super::*;
+pub(crate) const HWCONFIG: super::HwConfig = HwConfig {
- chip_id: 0x8112,
- gpu_gen: GpuGen::G14,
- gpu_variant: GpuVariant::G,
- gpu_core: GpuCore::G14G,
- gpu_feat_compat: 0,
- gpu_feat_incompat: 0,
- base_clock_hz: 24_000_000,
- uat_oas: 40,
- max_num_clusters: 1,
- max_num_cores: 10,
- max_num_frags: 10,
- max_num_gps: 4,
- preempt1_size: 0x540,
- preempt2_size: 0x280,
- preempt3_size: 0x20,
- render: HwRenderConfig {
// TODO: this is unused here, may be present in newer FW
tiling_control: 0xa041,
- },
- da: HwConfigA {
unk_87c: 900,
unk_8cc: 11000,
unk_e24: 125,
- },
- db: HwConfigB {
unk_4e0: 4,
unk_534: 0,
unk_ab8: 0x2048,
unk_abc: 0x4000,
unk_b30: 1,
- },
- shared1_tab: &[
0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff, 0xffff,
0xffff, 0xffff, 0xffff, 0xffff, 0xffff,
- ],
- shared1_a4: 0,
- shared2_tab: &[-1, -1, -1, -1, -1, -1, -1, -1, 0xaa5aa, 0],
- shared2_unk_508: 0xc00000,
- sram_k: f32!(1.02),
- // 13.2: last coef changed from 6.6 to 5.3, assuming that was a fix we can backport
- unk_coef_a: &[&f32!([0.0, 0.0, 0.0, 0.0, 5.3, 0.0, 5.3, /*6.6*/ 5.3])],
- unk_coef_b: &[&f32!([0.0, 0.0, 0.0, 0.0, 5.3, 0.0, 5.3, /*6.6*/ 5.3])],
- global_tab: None,
- fast_die0_sensor_mask: 0x6800,
- fast_die0_sensor_mask_alt: 0x6800,
- fast_die0_sensor_present: 0x02,
- io_mappings: &[
Some(IOMapping::new(0x204d00000, 0x14000, 0x14000, true)), // Fender
Some(IOMapping::new(0x20e100000, 0x4000, 0x4000, false)), // AICTimer
Some(IOMapping::new(0x23b0c4000, 0x4000, 0x4000, true)), // AICSWInt
Some(IOMapping::new(0x204000000, 0x20000, 0x20000, true)), // RGX
None, // UVD
None, // unused
None, // DisplayUnderrunWA
Some(IOMapping::new(0x23b2c0000, 0x1000, 0x1000, false)), // AnalogTempSensorControllerRegs
None, // PMPDoorbell
Some(IOMapping::new(0x204d80000, 0x8000, 0x8000, true)), // MetrologySensorRegs
Some(IOMapping::new(0x204d61000, 0x1000, 0x1000, true)), // GMGIFAFRegs
Some(IOMapping::new(0x200000000, 0xd6400, 0xd6400, true)), // MCache registers
None, // AICBankedRegisters
None, // PMGRScratch
None, // NIA Special agent idle register die 0
None, // NIA Special agent idle register die 1
Some(IOMapping::new(0x204e00000, 0x10000, 0x10000, true)), // CRE registers
Some(IOMapping::new(0x27d050000, 0x4000, 0x4000, true)), // Streaming codec registers
Some(IOMapping::new(0x23b3d0000, 0x1000, 0x1000, true)), //
Some(IOMapping::new(0x23b3c0000, 0x1000, 0x1000, true)), //
- ],
+}; diff --git a/drivers/gpu/drm/asahi/initdata.rs b/drivers/gpu/drm/asahi/initdata.rs new file mode 100644 index 000000000000..472c42169130 --- /dev/null +++ b/drivers/gpu/drm/asahi/initdata.rs @@ -0,0 +1,777 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT +#![allow(clippy::unusual_byte_groupings)]
+//! GPU initialization data builder. +//! +//! The root of all interaction between the GPU firmware and the host driver is a complex set of +//! nested structures that we call InitData. This includes both GPU hardware/firmware configuration +//! and the pointers to the ring buffers and global data fields that are used for communication at +//! runtime. +//! +//! Many of these structures are poorly understood, so there are lots of hardcoded unknown values +//! derived from observing the InitData structures that macOS generates.
+use crate::fw::initdata::*; +use crate::fw::types::*; +use crate::{box_in_place, f32, place}; +use crate::{gpu, hw, mmu}; +use kernel::error::Result; +use kernel::macros::versions;
+/// Builder helper for the global GPU InitData. +#[versions(AGX)] +pub(crate) struct InitDataBuilder<'a> {
- alloc: &'a mut gpu::KernelAllocators,
- cfg: &'a hw::HwConfig,
- dyncfg: &'a hw::DynConfig,
+}
+#[versions(AGX)] +impl<'a> InitDataBuilder::ver<'a> {
- /// Create a new InitData builder
- pub(crate) fn new(
alloc: &'a mut gpu::KernelAllocators,
cfg: &'a hw::HwConfig,
dyncfg: &'a hw::DynConfig,
- ) -> InitDataBuilder::ver<'a> {
InitDataBuilder::ver { alloc, cfg, dyncfg }
- }
- /// Create the HwDataShared1 structure, which is used in two places in InitData.
- #[inline(never)]
- fn hw_shared1(cfg: &hw::HwConfig) -> raw::HwDataShared1 {
let mut ret = raw::HwDataShared1 {
unk_a4: cfg.shared1_a4,
..Default::default()
};
for (i, val) in cfg.shared1_tab.iter().enumerate() {
ret.table[i] = *val;
}
ret
- }
- fn init_curve(
curve: &mut raw::HwDataShared2Curve,
unk_0: u32,
unk_4: u32,
t1: &[i16],
t2: &[i16],
t3: &[&[i32]],
- ) {
curve.unk_0 = unk_0;
curve.unk_4 = unk_4;
(*curve.t1)[..t1.len()].copy_from_slice(t1);
(*curve.t1)[t1.len()..].fill(t1[0]);
(*curve.t2)[..t2.len()].copy_from_slice(t2);
(*curve.t2)[t2.len()..].fill(t2[0]);
for (i, a) in curve.t3.iter_mut().enumerate() {
a.fill(0x3ffffff);
if i < t3.len() {
let b = t3[i];
(**a)[..b.len()].copy_from_slice(b);
}
}
- }
- /// Create the HwDataShared2 structure, which is used in two places in InitData.
- #[inline(never)]
- fn hw_shared2(cfg: &hw::HwConfig) -> Result<Boxraw::HwDataShared2> {
let mut ret = box_in_place!(raw::HwDataShared2 {
unk_28: Array::new([0xff; 16]),
t8112: Default::default(),
unk_508: cfg.shared2_unk_508,
..Default::default()
})?;
for (i, val) in cfg.shared2_tab.iter().enumerate() {
ret.table[i] = *val;
}
if cfg.chip_id == 0x8112 {
ret.t8112.unk_14 = 0x6000000;
Self::init_curve(&mut ret.t8112.curve1, 0, 0x20000000, &[-1], &[0x0f07], &[]);
Self::init_curve(
&mut ret.t8112.curve2,
7,
0x80000000,
&[-1, 25740, 17429, 12550, 9597, 7910, 6657, 5881, 5421],
&[
0x0f07, 0x04c0, 0x06c0, 0x08c0, 0x0ac0, 0x0c40, 0x0dc0, 0x0ec0, 0x0f80,
],
&[
&[0x3ffffff, 107, 101, 94, 87, 82, 77, 73, 71],
&[
0x3ffffff, 38240, 36251, 33562, 31368, 29379, 27693, 26211, 25370,
],
&[
0x3ffffff, 123933, 117485, 108771, 101661, 95217, 89751, 84948, 82222,
],
],
);
}
Ok(ret)
- }
- /// Create the HwDataShared3 structure, which is used in two places in InitData.
- #[inline(never)]
- fn hw_shared3(cfg: &hw::HwConfig) -> Result<Boxraw::HwDataShared3> {
let mut ret = box_in_place!(raw::HwDataShared3 {
..Default::default()
})?;
if cfg.chip_id == 0x8112 {
ret.unk_0 = 1;
ret.unk_4 = 500;
ret.unk_8 = 5;
ret.table.copy_from_slice(&[
10700, 10700, 10700, 10700, 10700, 6000, 1000, 1000, 1000, 10700, 10700, 10700,
10700, 10700, 10700, 10700,
]);
ret.unk_4c = 1;
}
Ok(ret)
- }
- /// Create an unknown T81xx-specific data structure.
- fn t81xx_data(dyncfg: &'a hw::DynConfig) -> raw::T81xxData {
raw::T81xxData {
unk_d8c: 0x80000000,
unk_d90: 4,
unk_d9c: f32!(0.6),
unk_da4: f32!(0.4),
unk_dac: f32!(0.38552),
unk_db8: f32!(65536.0),
unk_dbc: f32!(13.56),
max_pstate_scaled: 100 * dyncfg.pwr.perf_max_pstate,
..Default::default()
}
- }
- /// Create the HwDataA structure. This mostly contains power-related configuration.
- #[inline(never)]
- fn hwdata_a(&mut self) -> Result<GpuObjectHwDataA::ver> {
self.alloc
.private
.new_inplace(Default::default(), |_inner, ptr| {
let pwr = &self.dyncfg.pwr;
let period_ms = pwr.power_sample_period;
let period_s = F32::from(period_ms) / f32!(1000.0);
let ppm_filter_tc_periods = pwr.ppm_filter_time_constant_ms / period_ms;
#[ver(V >= V13_0B4)]
let ppm_filter_tc_ms_rounded = ppm_filter_tc_periods * period_ms;
let ppm_filter_a = f32!(1.0) / ppm_filter_tc_periods.into();
let perf_filter_a = f32!(1.0) / pwr.perf_filter_time_constant.into();
let perf_filter_a2 = f32!(1.0) / pwr.perf_filter_time_constant2.into();
let avg_power_target_filter_a = f32!(1.0) / pwr.avg_power_target_filter_tc.into();
let avg_power_filter_tc_periods = pwr.avg_power_filter_tc_ms / period_ms;
#[ver(V >= V13_0B4)]
let avg_power_filter_tc_ms_rounded = avg_power_filter_tc_periods * period_ms;
let avg_power_filter_a = f32!(1.0) / avg_power_filter_tc_periods.into();
let pwr_filter_a = f32!(1.0) / pwr.pwr_filter_time_constant.into();
let base_ps = pwr.perf_base_pstate;
let base_ps_scaled = 100 * base_ps;
let max_ps = pwr.perf_max_pstate;
let max_ps_scaled = 100 * max_ps;
let boost_ps_count = max_ps - base_ps;
let base_clock_khz = self.cfg.base_clock_hz / 1000;
let clocks_per_period = base_clock_khz * period_ms;
let raw = place!(
ptr,
raw::HwDataA::ver {
clocks_per_period: clocks_per_period,
#[ver(V >= V13_0B4)]
clocks_per_period_2: clocks_per_period,
pwr_status: AtomicU32::new(4),
unk_10: f32!(1.0),
actual_pstate: 1,
tgt_pstate: 1,
base_pstate_scaled: base_ps_scaled,
unk_40: 1,
max_pstate_scaled: max_ps_scaled,
min_pstate_scaled: 100,
unk_64c: 625,
pwr_filter_a_neg: f32!(1.0) - pwr_filter_a,
pwr_filter_a: pwr_filter_a,
pwr_integral_gain: pwr.pwr_integral_gain,
pwr_integral_min_clamp: pwr.pwr_integral_min_clamp.into(),
max_power_1: pwr.max_power_mw.into(),
pwr_proportional_gain: pwr.pwr_proportional_gain,
pwr_pstate_related_k: -F32::from(max_ps_scaled) / pwr.max_power_mw.into(),
pwr_pstate_max_dc_offset: pwr.pwr_min_duty_cycle as i32
- max_ps_scaled as i32,
max_pstate_scaled_2: max_ps_scaled,
max_power_2: pwr.max_power_mw,
max_pstate_scaled_3: max_ps_scaled,
ppm_filter_tc_periods_x4: ppm_filter_tc_periods * 4,
ppm_filter_a_neg: f32!(1.0) - ppm_filter_a,
ppm_filter_a: ppm_filter_a,
ppm_ki_dt: pwr.ppm_ki * period_s,
unk_6fc: f32!(65536.0),
ppm_kp: pwr.ppm_kp,
pwr_min_duty_cycle: pwr.pwr_min_duty_cycle,
max_pstate_scaled_4: max_ps_scaled,
unk_71c: f32!(0.0),
max_power_3: pwr.max_power_mw,
cur_power_mw_2: 0x0,
ppm_filter_tc_ms: pwr.ppm_filter_time_constant_ms,
#[ver(V >= V13_0B4)]
ppm_filter_tc_clks: ppm_filter_tc_ms_rounded * base_clock_khz,
perf_tgt_utilization: pwr.perf_tgt_utilization,
perf_boost_min_util: pwr.perf_boost_min_util,
perf_boost_ce_step: pwr.perf_boost_ce_step,
perf_reset_iters: pwr.perf_reset_iters,
unk_774: 6,
unk_778: 1,
perf_filter_drop_threshold: pwr.perf_filter_drop_threshold,
perf_filter_a_neg: f32!(1.0) - perf_filter_a,
perf_filter_a2_neg: f32!(1.0) - perf_filter_a2,
perf_filter_a: perf_filter_a,
perf_filter_a2: perf_filter_a2,
perf_ki: pwr.perf_integral_gain,
perf_ki2: pwr.perf_integral_gain2,
perf_integral_min_clamp: pwr.perf_integral_min_clamp.into(),
unk_79c: f32!(95.0),
perf_kp: pwr.perf_proportional_gain,
perf_kp2: pwr.perf_proportional_gain2,
boost_state_unk_k: F32::from(boost_ps_count) / f32!(0.95),
base_pstate_scaled_2: base_ps_scaled,
max_pstate_scaled_5: max_ps_scaled,
base_pstate_scaled_3: base_ps_scaled,
perf_tgt_utilization_2: pwr.perf_tgt_utilization,
base_pstate_scaled_4: base_ps_scaled,
unk_7fc: f32!(65536.0),
pwr_min_duty_cycle_2: pwr.pwr_min_duty_cycle.into(),
max_pstate_scaled_6: max_ps_scaled.into(),
max_freq_mhz: pwr.max_freq_mhz,
pwr_min_duty_cycle_3: pwr.pwr_min_duty_cycle,
min_pstate_scaled_4: f32!(100.0),
max_pstate_scaled_7: max_ps_scaled,
unk_alpha_neg: f32!(0.8),
unk_alpha: f32!(0.2),
fast_die0_sensor_mask: U64(self.cfg.fast_die0_sensor_mask),
fast_die0_release_temp_cc: 100 * pwr.fast_die0_release_temp,
unk_87c: self.cfg.da.unk_87c,
unk_880: 0x4,
unk_894: f32!(1.0),
fast_die0_ki_dt: pwr.fast_die0_integral_gain * period_s,
unk_8a8: f32!(65536.0),
fast_die0_kp: pwr.fast_die0_proportional_gain,
pwr_min_duty_cycle_4: pwr.pwr_min_duty_cycle,
max_pstate_scaled_8: max_ps_scaled,
max_pstate_scaled_9: max_ps_scaled,
fast_die0_prop_tgt_delta: 100 * pwr.fast_die0_prop_tgt_delta,
unk_8cc: self.cfg.da.unk_8cc,
max_pstate_scaled_10: max_ps_scaled,
max_pstate_scaled_11: max_ps_scaled,
unk_c2c: 1,
power_zone_count: pwr.power_zones.len() as u32,
max_power_4: pwr.max_power_mw,
max_power_5: pwr.max_power_mw,
max_power_6: pwr.max_power_mw,
avg_power_target_filter_a_neg: f32!(1.0) - avg_power_target_filter_a,
avg_power_target_filter_a: avg_power_target_filter_a,
avg_power_target_filter_tc_x4: 4 * pwr.avg_power_target_filter_tc,
avg_power_target_filter_tc_xperiod: period_ms
* pwr.avg_power_target_filter_tc,
#[ver(V >= V13_0B4)]
avg_power_target_filter_tc_clks: period_ms
* pwr.avg_power_target_filter_tc
* base_clock_khz,
avg_power_filter_tc_periods_x4: 4 * avg_power_filter_tc_periods,
avg_power_filter_a_neg: f32!(1.0) - avg_power_filter_a,
avg_power_filter_a: avg_power_filter_a,
avg_power_ki_dt: pwr.avg_power_ki_only * period_s,
unk_d20: f32!(65536.0),
avg_power_kp: pwr.avg_power_kp,
avg_power_min_duty_cycle: pwr.avg_power_min_duty_cycle,
max_pstate_scaled_12: max_ps_scaled,
max_pstate_scaled_13: max_ps_scaled,
max_power_7: pwr.max_power_mw.into(),
max_power_8: pwr.max_power_mw,
avg_power_filter_tc_ms: pwr.avg_power_filter_tc_ms,
#[ver(V >= V13_0B4)]
avg_power_filter_tc_clks: avg_power_filter_tc_ms_rounded * base_clock_khz,
max_pstate_scaled_14: max_ps_scaled,
t81xx_data: match self.cfg.chip_id {
0x8103 | 0x8112 => Self::t81xx_data(self.dyncfg),
_ => Default::default(),
},
#[ver(V >= V13_0B4)]
unk_e10_0: raw::HwDataA130Extra {
unk_38: 4,
unk_3c: 8000,
unk_40: 2500,
unk_48: 0xffffffff,
unk_4c: 50,
unk_54: 50,
unk_58: 0x1,
unk_60: f32!(0.8888889),
unk_64: f32!(0.6666667),
unk_68: f32!(0.11111111),
unk_6c: f32!(0.33333333),
unk_70: f32!(-0.4),
unk_74: f32!(-0.8),
unk_7c: f32!(65536.0),
unk_80: f32!(-5.0),
unk_84: f32!(-10.0),
unk_8c: 40,
max_pstate_scaled_1: max_ps_scaled,
unk_9c: f32!(8000.0),
unk_a0: 1400,
unk_a8: 72,
unk_ac: 24,
unk_b0: 1728000,
unk_b8: 576000,
unk_c4: f32!(65536.0),
unk_114: f32!(65536.0),
unk_124: 40,
max_pstate_scaled_2: max_ps_scaled,
..Default::default()
},
fast_die0_sensor_mask_2: U64(self.cfg.fast_die0_sensor_mask),
unk_e24: self.cfg.da.unk_e24,
unk_e28: 1,
fast_die0_sensor_mask_alt: U64(self.cfg.fast_die0_sensor_mask_alt),
#[ver(V < V13_0B4)]
fast_die0_sensor_present: U64(self.cfg.fast_die0_sensor_present as u64),
unk_163c: 1,
unk_3644: 0,
hws1: Self::hw_shared1(self.cfg),
hws2: *Self::hw_shared2(self.cfg)?,
hws3: *Self::hw_shared3(self.cfg)?,
unk_3ce8: 1,
..Default::default()
}
);
for i in 0..self.dyncfg.pwr.perf_states.len() {
raw.sram_k[i] = self.cfg.sram_k;
}
for (i, coef) in pwr.core_leak_coef.iter().enumerate() {
raw.core_leak_coef[i] = *coef;
}
for (i, coef) in pwr.sram_leak_coef.iter().enumerate() {
raw.sram_leak_coef[i] = *coef;
}
for i in 0..self.dyncfg.id.num_clusters as usize {
if let Some(coef_a) = self.cfg.unk_coef_a.get(i) {
(*raw.unk_coef_a1[i])[..coef_a.len()].copy_from_slice(coef_a);
(*raw.unk_coef_a2[i])[..coef_a.len()].copy_from_slice(coef_a);
}
if let Some(coef_b) = self.cfg.unk_coef_b.get(i) {
(*raw.unk_coef_b1[i])[..coef_b.len()].copy_from_slice(coef_b);
(*raw.unk_coef_b2[i])[..coef_b.len()].copy_from_slice(coef_b);
}
}
for (i, pz) in pwr.power_zones.iter().enumerate() {
raw.power_zones[i].target = pz.target;
raw.power_zones[i].target_off = pz.target - pz.target_offset;
raw.power_zones[i].filter_tc_x4 = 4 * pz.filter_tc;
raw.power_zones[i].filter_tc_xperiod = period_ms * pz.filter_tc;
let filter_a = f32!(1.0) / pz.filter_tc.into();
raw.power_zones[i].filter_a = filter_a;
raw.power_zones[i].filter_a_neg = f32!(1.0) - filter_a;
#[ver(V >= V13_0B4)]
raw.power_zones[i].unk_10 = 1320000000;
}
Ok(raw)
})
- }
- /// Create the HwDataB structure. This mostly contains GPU-related configuration.
- #[inline(never)]
- fn hwdata_b(&mut self) -> Result<GpuObjectHwDataB::ver> {
self.alloc
.private
.new_inplace(Default::default(), |_inner, ptr| {
let raw = place!(
ptr,
raw::HwDataB::ver {
// Userspace VA map related
#[ver(V < V13_0B4)]
unk_0: U64(0x13_00000000),
unk_8: U64(0x14_00000000),
#[ver(V < V13_0B4)]
unk_10: U64(0x1_00000000),
unk_18: U64(0xffc00000),
unk_20: U64(0x11_00000000),
unk_28: U64(0x11_00000000),
// userspace address?
unk_30: U64(0x6f_ffff8000),
// unmapped?
unkptr_38: U64(0xffffffa0_11800000),
// TODO: yuv matrices
chip_id: self.cfg.chip_id,
unk_454: 0x1,
unk_458: 0x1,
unk_460: 0x1,
unk_464: 0x1,
unk_468: 0x1,
unk_47c: 0x1,
unk_484: 0x1,
unk_48c: 0x1,
base_clock_khz: self.cfg.base_clock_hz / 1000,
power_sample_period: self.dyncfg.pwr.power_sample_period,
unk_49c: 0x1,
unk_4a0: 0x1,
unk_4a4: 0x1,
unk_4c0: 0x1f,
unk_4e0: U64(self.cfg.db.unk_4e0),
unk_4f0: 0x1,
unk_4f4: 0x1,
unk_504: 0x31,
unk_524: 0x1, // use_secure_cache_flush
unk_534: self.cfg.db.unk_534,
num_frags: self.dyncfg.id.num_frags * self.dyncfg.id.num_clusters,
unk_554: 0x1,
uat_ttb_base: U64(self.dyncfg.uat_ttb_base),
gpu_core_id: self.cfg.gpu_core as u32,
gpu_rev_id: self.dyncfg.id.gpu_rev_id as u32,
num_cores: self.dyncfg.id.num_cores * self.dyncfg.id.num_clusters,
max_pstate: self.dyncfg.pwr.perf_states.len() as u32 - 1,
#[ver(V < V13_0B4)]
num_pstates: self.dyncfg.pwr.perf_states.len() as u32,
#[ver(V < V13_0B4)]
min_sram_volt: self.dyncfg.pwr.min_sram_microvolt / 1000,
#[ver(V < V13_0B4)]
unk_ab8: self.cfg.db.unk_ab8,
#[ver(V < V13_0B4)]
unk_abc: self.cfg.db.unk_abc,
#[ver(V < V13_0B4)]
unk_ac0: 0x1020,
#[ver(V >= V13_0B4)]
unk_ae4: Array::new([0x0, 0x3, 0x7, 0x7]),
#[ver(V < V13_0B4)]
unk_ae4: Array::new([0x0, 0xf, 0x3f, 0x3f]),
unk_b10: 0x1,
unk_b24: 0x1,
unk_b28: 0x1,
unk_b2c: 0x1,
unk_b30: self.cfg.db.unk_b30,
#[ver(V >= V13_0B4)]
unk_b38_0: 1,
#[ver(V >= V13_0B4)]
unk_b38_4: 1,
unk_b38: Array::new([0xffffffff; 12]),
#[ver(V >= V13_0B4)]
unk_c3c: 0x19,
..Default::default()
}
);
let base_ps = self.dyncfg.pwr.perf_base_pstate as usize;
let max_ps = self.dyncfg.pwr.perf_max_pstate as usize;
let base_freq = self.dyncfg.pwr.perf_states[base_ps].freq_hz;
let max_freq = self.dyncfg.pwr.perf_states[max_ps].freq_hz;
for (i, ps) in self.dyncfg.pwr.perf_states.iter().enumerate() {
raw.frequencies[i] = ps.freq_hz / 1000000;
for (j, mv) in ps.volt_mv.iter().enumerate() {
let sram_mv = (*mv).max(self.dyncfg.pwr.min_sram_microvolt / 1000);
raw.voltages[i][j] = *mv;
raw.voltages_sram[i][j] = sram_mv;
}
raw.sram_k[i] = self.cfg.sram_k;
raw.rel_max_powers[i] = ps.pwr_mw * 100 / self.dyncfg.pwr.max_power_mw;
raw.rel_boost_freqs[i] = if i > base_ps {
(ps.freq_hz - base_freq) / ((max_freq - base_freq) / 100)
} else {
0
};
}
Ok(raw)
})
- }
- /// Create the Globals structure, which contains global firmware config including more power
- /// configuration data and globals used to exchange state between the firmware and driver.
- #[inline(never)]
- fn globals(&mut self) -> Result<GpuObjectGlobals::ver> {
self.alloc
.shared
.new_inplace(Default::default(), |_inner, ptr| {
let pwr = &self.dyncfg.pwr;
let period_ms = pwr.power_sample_period;
let period_s = F32::from(period_ms) / f32!(1000.0);
let avg_power_filter_tc_periods = pwr.avg_power_filter_tc_ms / period_ms;
let max_ps = pwr.perf_max_pstate;
let max_ps_scaled = 100 * max_ps;
let raw = place!(
ptr,
raw::Globals::ver {
//ktrace_enable: 0xffffffff,
ktrace_enable: 0,
#[ver(V >= V13_2)]
unk_24_0: 3000,
unk_24: 0,
#[ver(V >= V13_0B4)]
unk_28_0: 0, // debug
unk_28: 1,
#[ver(V >= V13_0B4)]
unk_2c_0: 0,
unk_2c: 1,
unk_30: 0,
unk_34: 120,
sub: raw::GlobalsSub::ver {
unk_54: 0xffff,
unk_56: 40,
unk_58: 0xffff,
unk_5e: U32(1),
unk_66: U32(1),
..Default::default()
},
unk_8900: 1,
pending_submissions: AtomicU32::new(0),
max_power: pwr.max_power_mw,
max_pstate_scaled: max_ps_scaled,
max_pstate_scaled_2: max_ps_scaled,
max_pstate_scaled_3: max_ps_scaled,
power_zone_count: pwr.power_zones.len() as u32,
avg_power_filter_tc_periods: avg_power_filter_tc_periods,
avg_power_ki_dt: pwr.avg_power_ki_only * period_s,
avg_power_kp: pwr.avg_power_kp,
avg_power_min_duty_cycle: pwr.avg_power_min_duty_cycle,
avg_power_target_filter_tc: pwr.avg_power_target_filter_tc,
unk_89bc: self.cfg.da.unk_8cc,
fast_die0_release_temp: 100 * pwr.fast_die0_release_temp,
unk_89c4: self.cfg.da.unk_87c,
fast_die0_prop_tgt_delta: 100 * pwr.fast_die0_prop_tgt_delta,
fast_die0_kp: pwr.fast_die0_proportional_gain,
fast_die0_ki_dt: pwr.fast_die0_integral_gain * period_s,
unk_89e0: 1,
max_power_2: pwr.max_power_mw,
ppm_kp: pwr.ppm_kp,
ppm_ki_dt: pwr.ppm_ki * period_s,
#[ver(V >= V13_0B4)]
unk_89f4_8: 1,
unk_89f4: 0,
hws1: Self::hw_shared1(self.cfg),
hws2: *Self::hw_shared2(self.cfg)?,
hws3: *Self::hw_shared3(self.cfg)?,
unk_900c: 1,
#[ver(V >= V13_0B4)]
unk_9010_0: 1,
#[ver(V >= V13_0B4)]
unk_903c: 1,
#[ver(V < V13_0B4)]
unk_903c: 0,
fault_control: *crate::fault_control.read(),
do_init: 1,
unk_11020: 40,
unk_11024: 10,
unk_11028: 250,
#[ver(V >= V13_0B4)]
unk_1102c_0: 1,
#[ver(V >= V13_0B4)]
unk_1102c_4: 1,
#[ver(V >= V13_0B4)]
unk_1102c_8: 100,
#[ver(V >= V13_0B4)]
unk_1102c_c: 1,
idle_off_delay_ms: AtomicU32::new(pwr.idle_off_delay_ms),
fender_idle_off_delay_ms: pwr.fender_idle_off_delay_ms,
fw_early_wake_timeout_ms: pwr.fw_early_wake_timeout_ms,
unk_118e0: 40,
#[ver(V >= V13_0B4)]
unk_118e4_0: 50,
#[ver(V >= V13_0B4)]
unk_11edc: 0,
#[ver(V >= V13_0B4)]
unk_11efc: 0,
..Default::default()
}
);
for (i, pz) in pwr.power_zones.iter().enumerate() {
raw.power_zones[i].target = pz.target;
raw.power_zones[i].target_off = pz.target - pz.target_offset;
raw.power_zones[i].filter_tc = pz.filter_tc;
}
if let Some(tab) = self.cfg.global_tab.as_ref() {
for (i, x) in tab.iter().enumerate() {
raw.unk_118ec[i] = *x;
}
raw.unk_118e8 = 1;
}
Ok(raw)
})
- }
- /// Create the RuntimePointers structure, which contains pointers to most of the other
- /// structures including the ring buffer channels, statistics structures, and HwDataA/HwDataB.
- #[inline(never)]
- fn runtime_pointers(&mut self) -> Result<GpuObjectRuntimePointers::ver> {
let hwa = self.hwdata_a()?;
let hwb = self.hwdata_b()?;
let pointers: Box<RuntimePointers::ver> = box_in_place!(RuntimePointers::ver {
stats: Stats::ver {
vtx: self.alloc.private.new_default::<GpuGlobalStatsVtx::ver>()?,
frag: self.alloc.private.new_inplace(
Default::default(),
|_inner, ptr: &mut MaybeUninit<raw::GpuGlobalStatsFrag::ver>| {
Ok(place!(
ptr,
raw::GpuGlobalStatsFrag::ver {
stats: raw::GpuStatsFrag::ver {
cur_stamp_id: -1,
unk_118: -1,
..Default::default()
},
..Default::default()
}
))
},
)?,
comp: self.alloc.private.new_default::<GpuStatsComp>()?,
},
hwdata_a: hwa,
unkptr_190: self.alloc.private.array_empty(0x80)?,
unkptr_198: self.alloc.private.array_empty(0xc0)?,
hwdata_b: hwb,
unkptr_1b8: self.alloc.private.array_empty(0x1000)?,
unkptr_1c0: self.alloc.private.array_empty(0x300)?,
unkptr_1c8: self.alloc.private.array_empty(0x1000)?,
buffer_mgr_ctl: self.alloc.gpu.array_empty(127)?,
})?;
self.alloc.private.new_boxed(pointers, |inner, ptr| {
Ok(place!(
ptr,
raw::RuntimePointers::ver {
pipes: Default::default(),
device_control: Default::default(),
event: Default::default(),
fw_log: Default::default(),
ktrace: Default::default(),
stats: Default::default(),
stats_vtx: inner.stats.vtx.gpu_pointer(),
stats_frag: inner.stats.frag.gpu_pointer(),
stats_comp: inner.stats.comp.gpu_pointer(),
hwdata_a: inner.hwdata_a.gpu_pointer(),
unkptr_190: inner.unkptr_190.gpu_pointer(),
unkptr_198: inner.unkptr_198.gpu_pointer(),
hwdata_b: inner.hwdata_b.gpu_pointer(),
hwdata_b_2: inner.hwdata_b.gpu_pointer(),
fwlog_buf: None,
unkptr_1b8: inner.unkptr_1b8.gpu_pointer(),
unkptr_1c0: inner.unkptr_1c0.gpu_pointer(),
unkptr_1c8: inner.unkptr_1c8.gpu_pointer(),
buffer_mgr_ctl: inner.buffer_mgr_ctl.gpu_pointer(),
buffer_mgr_ctl_2: inner.buffer_mgr_ctl.gpu_pointer(),
__pad0: Default::default(),
unk_160: U64(0),
unk_168: U64(0),
unk_1d0: 0,
unk_1d4: 0,
unk_1d8: Default::default(),
__pad1: Default::default(),
gpu_scratch: raw::RuntimeScratch {
unk_6b38: 0xff,
..Default::default()
},
}
))
})
- }
- /// Create the FwStatus structure, which is used to coordinate the firmware halt state between
- /// the firmware and the driver.
- #[inline(never)]
- fn fw_status(&mut self) -> Result<GpuObject<FwStatus>> {
self.alloc
.shared
.new_object(Default::default(), |_inner| Default::default())
- }
- /// Create one UatLevelInfo structure, which describes one level of translation for the UAT MMU.
- #[inline(never)]
- fn uat_level_info(
cfg: &hw::HwConfig,
index_shift: usize,
num_entries: usize,
- ) -> raw::UatLevelInfo {
raw::UatLevelInfo {
index_shift: index_shift as _,
unk_1: 14,
unk_2: 14,
unk_3: 8,
unk_4: 0x4000,
num_entries: num_entries as _,
unk_8: U64(1),
unk_10: U64(((1u64 << cfg.uat_oas) - 1) & !(mmu::UAT_PGMSK as u64)),
index_mask: U64(((num_entries - 1) << index_shift) as u64),
}
- }
- /// Build the top-level InitData object.
- #[inline(never)]
- pub(crate) fn build(&mut self) -> Result<Box<GpuObjectInitData::ver>> {
let inner: Box<InitData::ver> = box_in_place!(InitData::ver {
unk_buf: self.alloc.shared_ro.array_empty(0x4000)?,
runtime_pointers: self.runtime_pointers()?,
globals: self.globals()?,
fw_status: self.fw_status()?,
})?;
Ok(Box::try_new(self.alloc.shared_ro.new_boxed(
inner,
|inner, ptr| {
Ok(place!(
ptr,
raw::InitData::ver {
#[ver(V >= V13_0B4)]
ver_info: Array::new([1, 1, 16, 1]),
unk_buf: inner.unk_buf.gpu_pointer(),
unk_8: 0,
unk_c: 0,
runtime_pointers: inner.runtime_pointers.gpu_pointer(),
globals: inner.globals.gpu_pointer(),
fw_status: inner.fw_status.gpu_pointer(),
uat_page_size: 0x4000,
uat_page_bits: 14,
uat_num_levels: 3,
uat_level_info: Array::new([
Self::uat_level_info(self.cfg, 36, 8),
Self::uat_level_info(self.cfg, 25, 2048),
Self::uat_level_info(self.cfg, 14, 2048),
]),
__pad0: Default::default(),
host_mapped_fw_allocations: 1,
unk_ac: 0,
unk_b0: 0,
unk_b4: 0,
unk_b8: 0,
}
))
},
)?)?)
- }
+} diff --git a/drivers/gpu/drm/asahi/mem.rs b/drivers/gpu/drm/asahi/mem.rs new file mode 100644 index 000000000000..491d4f8a4016 --- /dev/null +++ b/drivers/gpu/drm/asahi/mem.rs @@ -0,0 +1,133 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! ARM64 low level memory operations. +//! +//! This GPU uses CPU-side `tlbi` outer-shareable instructions to manage its TLBs. +//! Yes, really. Even though the VA address spaces are unrelated. +//! +//! Right now we pick our own ASIDs and don't coordinate with the CPU. This might result +//! in needless TLB shootdowns on the CPU side... TODO: fix this.
+use core::arch::asm; +use core::cmp::min;
+use crate::debug::*; +use crate::mmu;
+type Asid = u8;
+/// Invalidate the entire GPU TLB. +#[inline(always)] +pub(crate) fn tlbi_all() {
- unsafe {
asm!(".arch armv8.4-a", "tlbi vmalle1os",);
- }
+}
+/// Invalidate all TLB entries for a given ASID. +#[inline(always)] +pub(crate) fn tlbi_asid(asid: Asid) {
- if debug_enabled(DebugFlags::ConservativeTlbi) {
tlbi_all();
sync();
return;
- }
- unsafe {
asm!(
".arch armv8.4-a",
"tlbi aside1os, {x}",
x = in(reg) ((asid as u64) << 48)
);
- }
+}
+/// Invalidate a single page for a given ASID. +#[inline(always)] +pub(crate) fn tlbi_page(asid: Asid, va: usize) {
- if debug_enabled(DebugFlags::ConservativeTlbi) {
tlbi_all();
sync();
return;
- }
- let val: u64 = ((asid as u64) << 48) | ((va as u64 >> 12) & 0xffffffffffc);
- unsafe {
asm!(
".arch armv8.4-a",
"tlbi vae1os, {x}",
x = in(reg) val
);
- }
+}
+/// Invalidate a range of pages for a given ASID. +#[inline(always)] +pub(crate) fn tlbi_range(asid: Asid, va: usize, len: usize) {
- if debug_enabled(DebugFlags::ConservativeTlbi) {
tlbi_all();
sync();
return;
- }
- if len == 0 {
return;
- }
- let start_pg = va >> mmu::UAT_PGBIT;
- let end_pg = (va + len + mmu::UAT_PGMSK) >> mmu::UAT_PGBIT;
- let mut val: u64 = ((asid as u64) << 48) | (2 << 46) | (start_pg as u64 & 0x1fffffffff);
- let pages = end_pg - start_pg;
- if pages == 1 {
tlbi_page(asid, va);
return;
- }
- // Page count is always in units of 2
- let num = ((pages + 1) >> 1) as u64;
- // base: 5 bits
- // exp: 2 bits
- // pages = (base + 1) << (5 * exp + 1)
- // 0:00000 -> 2 pages = 2 << 0
- // 0:11111 -> 32 * 2 pages = 2 << 5
- // 1:00000 -> 1 * 32 * 2 pages = 2 << 5
- // 1:11111 -> 32 * 32 * 2 pages = 2 << 10
- // 2:00000 -> 1 * 32 * 32 * 2 pages = 2 << 10
- // 2:11111 -> 32 * 32 * 32 * 2 pages = 2 << 15
- // 3:00000 -> 1 * 32 * 32 * 32 * 2 pages = 2 << 15
- // 3:11111 -> 32 * 32 * 32 * 32 * 2 pages = 2 << 20
- let exp = min(3, (64 - num.leading_zeros()) / 5);
- let bits = 5 * exp;
- let mut base = (num + (1 << bits) - 1) >> bits;
- val |= (exp as u64) << 44;
- while base > 32 {
unsafe {
asm!(
".arch armv8.4-a",
"tlbi rvae1os, {x}",
x = in(reg) val | (31 << 39)
);
}
base -= 32;
- }
- unsafe {
asm!(
".arch armv8.4-a",
"tlbi rvae1os, {x}",
x = in(reg) val | ((base - 1) << 39)
);
- }
+}
+/// Issue a memory barrier (`dsb sy`). +#[inline(always)] +pub(crate) fn sync() {
- unsafe {
asm!("dsb sy");
- }
+} diff --git a/drivers/gpu/drm/asahi/microseq.rs b/drivers/gpu/drm/asahi/microseq.rs new file mode 100644 index 000000000000..dca94ebc53a1 --- /dev/null +++ b/drivers/gpu/drm/asahi/microseq.rs @@ -0,0 +1,61 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! GPU Micro operation sequence builder +//! +//! As part of a single job submisssion to the GPU, the GPU firmware interprets a sequence of +//! commands that we call a "microsequence". These are responsible for setting up the job execution, +//! timestamping the process, waiting for completion, tearing up any resources, and signaling +//! completion to the driver via the event stamp mechanism. +//! +//! Although the microsequences used by the macOS driver are usually quite uniform and simple, the +//! firmware actually implements enough operations to make this interpreter Turing-complete (!). +//! Most of those aren't implemented yet, since we don't need them, but they could come in handy in +//! the future to do strange things or work around firmware bugs... +//! +//! This module simply implements a collection of microsequence operations that can be appended to +//! and later concatenated into one buffer, ready for firmware execution.
+use crate::fw::microseq; +pub(crate) use crate::fw::microseq::*; +use crate::fw::types::*; +use kernel::prelude::*;
+/// MicroSequence object type, which is just an opaque byte array. +pub(crate) type MicroSequence = GpuArray<u8>;
+/// MicroSequence builder. +pub(crate) struct Builder {
- ops: Vec<u8>,
+}
+impl Builder {
- /// Create a new Builder object
- pub(crate) fn new() -> Builder {
Builder { ops: Vec::new() }
- }
- /// Get the relative offset from the current pointer to a given target offset.
- ///
- /// Used for relative jumps.
- pub(crate) fn offset_to(&self, target: i32) -> i32 {
target - self.ops.len() as i32
- }
- /// Add an operation to the end of the sequence.
- pub(crate) fn add<T: microseq::Operation>(&mut self, op: T) -> Result<i32> {
let off = self.ops.len();
let p: *const T = &op;
let p: *const u8 = p as *const u8;
let s: &[u8] = unsafe { core::slice::from_raw_parts(p, core::mem::size_of::<T>()) };
self.ops.try_extend_from_slice(s)?;
Ok(off as i32)
- }
- /// Collect all submitted operations into a finalized GPU object.
- pub(crate) fn build(self, alloc: &mut Allocator) -> Result<MicroSequence> {
let mut array = alloc.array_empty::<u8>(self.ops.len())?;
array.as_mut_slice().clone_from_slice(self.ops.as_slice());
Ok(array)
- }
+} diff --git a/drivers/gpu/drm/asahi/mmu.rs b/drivers/gpu/drm/asahi/mmu.rs new file mode 100644 index 000000000000..226ca0b7c1d7 --- /dev/null +++ b/drivers/gpu/drm/asahi/mmu.rs @@ -0,0 +1,1249 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! GPU UAT (MMU) management +//! +//! AGX GPUs use an MMU called the UAT, which is largely compatible with the ARM64 page table +//! format. This module manages the global MMU structures, including a shared handoff structure +//! that is used to coordinate VM management operations with the firmware, the TTBAT which points +//! to currently active GPU VM contexts, as well as the individual `Vm` operations to map and +//! unmap buffer objects into a single user or kernel address space. +//! +//! The actual page table management is delegated to the common kernel `io_pgtable` code.
+use core::fmt::Debug; +use core::mem::size_of; +use core::ptr::{addr_of_mut, NonNull}; +use core::sync::atomic::{fence, AtomicU32, AtomicU64, AtomicU8, Ordering}; +use core::time::Duration;
+use kernel::{
- bindings, c_str, delay, device,
- drm::mm,
- error::{to_result, Result},
- io_pgtable,
- io_pgtable::{prot, AppleUAT, IoPageTable},
- prelude::*,
- sync::{smutex::Mutex, Guard},
- sync::{Arc, LockClassKey, UniqueArc},
- time,
- types::ForeignOwnable,
+};
+use crate::debug::*; +use crate::no_debug; +use crate::{driver, fw, gem, hw, mem, slotalloc};
+const DEBUG_CLASS: DebugFlags = DebugFlags::Mmu;
+/// PPL magic number for the handoff region +const PPL_MAGIC: u64 = 0x4b1d000000000002;
+/// Number of supported context entries in the TTBAT +const UAT_NUM_CTX: usize = 64; +/// First context available for users +const UAT_USER_CTX_START: usize = 1; +/// Number of available user contexts +const UAT_USER_CTX: usize = UAT_NUM_CTX - UAT_USER_CTX_START;
+/// Number of bits in a page offset. +pub(crate) const UAT_PGBIT: usize = 14; +/// UAT page size. +pub(crate) const UAT_PGSZ: usize = 1 << UAT_PGBIT; +/// UAT page offset mask. +pub(crate) const UAT_PGMSK: usize = UAT_PGSZ - 1;
+type Pte = AtomicU64;
+/// Number of PTEs per page. +const UAT_NPTE: usize = UAT_PGSZ / size_of::<Pte>();
+/// UAT input address space (user) +pub(crate) const UAT_IAS: usize = 39; +/// "Fake" kernel UAT input address space (one page level lower) +pub(crate) const UAT_IAS_KERN: usize = 36;
+/// Lower/user base VA +const IOVA_USER_BASE: usize = UAT_PGSZ; +/// Lower/user top VA +const IOVA_USER_TOP: usize = (1 << UAT_IAS) - 1; +/// Upper/kernel base VA +// const IOVA_TTBR1_BASE: usize = 0xffffff8000000000; +/// Driver-managed kernel base VA +const IOVA_KERN_BASE: usize = 0xffffffa000000000; +/// Driver-managed kernel top VA +const IOVA_KERN_TOP: usize = 0xffffffafffffffff;
+const TTBR_VALID: u64 = 0x1; // BIT(0) +const TTBR_ASID_SHIFT: usize = 48;
+const PTE_TABLE: u64 = 0x3; // BIT(0) | BIT(1)
+// Mapping protection types
+// Note: prot::CACHE means "cache coherency", which for UAT means *uncached*, +// since uncached mappings from the GFX ASC side are cache coherent with the AP cache. +// Not having that flag means *cached noncoherent*.
+/// Firmware MMIO R/W +pub(crate) const PROT_FW_MMIO_RW: u32 =
- prot::PRIV | prot::READ | prot::WRITE | prot::CACHE | prot::MMIO;
+/// Firmware MMIO R/O +pub(crate) const PROT_FW_MMIO_RO: u32 = prot::PRIV | prot::READ | prot::CACHE | prot::MMIO; +/// Firmware shared (uncached) RW +pub(crate) const PROT_FW_SHARED_RW: u32 = prot::PRIV | prot::READ | prot::WRITE | prot::CACHE; +/// Firmware shared (uncached) RO +pub(crate) const PROT_FW_SHARED_RO: u32 = prot::PRIV | prot::READ | prot::CACHE; +/// Firmware private (cached) RW +pub(crate) const PROT_FW_PRIV_RW: u32 = prot::PRIV | prot::READ | prot::WRITE; +/* +/// Firmware private (cached) RO +pub(crate) const PROT_FW_PRIV_RO: u32 = prot::PRIV | prot::READ; +*/ +/// Firmware/GPU shared (uncached) RW +pub(crate) const PROT_GPU_FW_SHARED_RW: u32 = prot::READ | prot::WRITE | prot::CACHE; +/// Firmware/GPU shared (private) RW +pub(crate) const PROT_GPU_FW_PRIV_RW: u32 = prot::READ | prot::WRITE; +/// GPU shared/coherent RW +pub(crate) const PROT_GPU_SHARED_RW: u32 = prot::READ | prot::WRITE | prot::CACHE | prot::NOEXEC; +/// GPU shared/coherent RO +pub(crate) const PROT_GPU_SHARED_RO: u32 = prot::READ | prot::CACHE | prot::NOEXEC; +/// GPU shared/coherent WO +pub(crate) const PROT_GPU_SHARED_WO: u32 = prot::WRITE | prot::CACHE | prot::NOEXEC; +/* +/// GPU private/noncoherent RW +pub(crate) const PROT_GPU_PRIV_RW: u32 = prot::READ | prot::WRITE | prot::NOEXEC; +/// GPU private/noncoherent RO +pub(crate) const PROT_GPU_PRIV_RO: u32 = prot::READ | prot::NOEXEC; +*/
+type PhysAddr = bindings::phys_addr_t;
+/// A pre-allocated memory region for UAT management +struct UatRegion {
- base: PhysAddr,
- map: NonNullcore::ffi::c_void,
+}
+/// It's safe to share UAT region records across threads. +unsafe impl Send for UatRegion {} +unsafe impl Sync for UatRegion {}
+/// Handoff region flush info structure +#[repr(C)] +struct FlushInfo {
- state: AtomicU64,
- addr: AtomicU64,
- size: AtomicU64,
+}
+/// UAT Handoff region layout +#[repr(C)] +struct Handoff {
- magic_ap: AtomicU64,
- magic_fw: AtomicU64,
- lock_ap: AtomicU8,
- lock_fw: AtomicU8,
- // Implicit padding: 2 bytes
- turn: AtomicU32,
- cur_slot: AtomicU32,
- // Implicit padding: 4 bytes
- flush: [FlushInfo; UAT_NUM_CTX + 1],
- unk2: AtomicU8,
- // Implicit padding: 7 bytes
- unk3: AtomicU64,
+}
+const HANDOFF_SIZE: usize = size_of::<Handoff>();
+/// One VM slot in the TTBAT +#[repr(C)] +struct SlotTTBS {
- ttb0: AtomicU64,
- ttb1: AtomicU64,
+}
+const SLOTS_SIZE: usize = UAT_NUM_CTX * size_of::<SlotTTBS>();
+// We need at least page 0 (ttb0) +const PAGETABLES_SIZE: usize = UAT_PGSZ;
+/// Inner data for a Vm instance. This is reference-counted by the outer Vm object. +struct VmInner {
- dev: driver::AsahiDevice,
- is_kernel: bool,
- min_va: usize,
- max_va: usize,
- page_table: AppleUAT<Uat>,
- mm: mm::Allocator<(), MappingInner>,
- uat_inner: Arc<UatInner>,
- active_users: usize,
- binding: Option<slotalloc::Guard<SlotInner>>,
- bind_token: Optionslotalloc::SlotToken,
- id: u64,
+}
+impl VmInner {
- /// Returns the slot index, if this VM is bound.
- fn slot(&self) -> Option<u32> {
if self.is_kernel {
// The GFX ASC does not care about the ASID. Pick an arbitrary one.
// TODO: This needs to be a persistently reserved ASID once we integrate
// with the ARM64 kernel ASID machinery to avoid overlap.
Some(0)
} else {
// We don't check whether we lost the slot, which could cause unnecessary
// invalidations against another Vm. However, this situation should be very
// rare (e.g. a Vm lost its slot, which means 63 other Vms bound in the
// interim, and then it gets killed / drops its mappings without doing any
// final rendering). Anything doing active maps/unmaps is probably also
// rendering and therefore likely bound.
self.bind_token
.as_ref()
.map(|token| (token.last_slot() + UAT_USER_CTX_START as u32))
}
- }
- /// Returns the translation table base for this Vm
- fn ttb(&self) -> u64 {
self.page_table.cfg().ttbr
- }
- /// Map an IOVA to the shifted address the underlying io_pgtable uses.
- fn map_iova(&self, iova: usize, size: usize) -> Result<usize> {
if iova < self.min_va || (iova + size - 1) > self.max_va {
Err(EINVAL)
} else if self.is_kernel {
Ok(iova - self.min_va)
} else {
Ok(iova)
}
- }
- /// Map a contiguous range of virtual->physical pages.
- fn map_pages(
&mut self,
mut iova: usize,
mut paddr: usize,
pgsize: usize,
pgcount: usize,
prot: u32,
- ) -> Result<usize> {
let mut left = pgcount;
while left > 0 {
let mapped_iova = self.map_iova(iova, pgsize * left)?;
let mapped = self
.page_table
.map_pages(mapped_iova, paddr, pgsize, left, prot)?;
assert!(mapped <= left * pgsize);
left -= mapped / pgsize;
paddr += mapped;
iova += mapped;
}
Ok(pgcount * pgsize)
- }
- /// Unmap a contiguous range of pages.
- fn unmap_pages(&mut self, mut iova: usize, pgsize: usize, pgcount: usize) -> Result<usize> {
let mut left = pgcount;
while left > 0 {
let mapped_iova = self.map_iova(iova, pgsize * left)?;
let unmapped = self.page_table.unmap_pages(mapped_iova, pgsize, left);
assert!(unmapped <= left * pgsize);
left -= unmapped / pgsize;
iova += unmapped;
}
Ok(pgcount * pgsize)
- }
- /// Map an `mm::Node` representing an mapping in VA space.
- fn map_node(&mut self, node: &mm::Node<(), MappingInner>, prot: u32) -> Result {
let mut iova = node.start() as usize;
let sgt = node.sgt.as_ref().ok_or(EINVAL)?;
for range in sgt.iter() {
let addr = range.dma_address();
let len = range.dma_len();
if (addr | len | iova) & UAT_PGMSK != 0 {
dev_err!(
self.dev,
"MMU: Mapping {:#x}:{:#x} -> {:#x} is not page-aligned\n",
addr,
len,
iova
);
return Err(EINVAL);
}
mod_dev_dbg!(
self.dev,
"MMU: map: {:#x}:{:#x} -> {:#x}\n",
addr,
len,
iova
);
self.map_pages(iova, addr, UAT_PGSZ, len >> UAT_PGBIT, prot)?;
iova += len;
}
Ok(())
- }
+}
+/// Shared reference to a virtual memory address space ([`Vm`]). +#[derive(Clone)] +pub(crate) struct Vm {
- id: u64,
- file_id: u64,
- inner: Arc<Mutex<VmInner>>,
+} +no_debug!(Vm);
+/// Slot data for a [`Vm`] slot (nothing, we only care about the indices). +pub(crate) struct SlotInner();
+impl slotalloc::SlotItem for SlotInner {
- type Data = ();
+}
+/// Represents a single user of a binding of a [`Vm`] to a slot. +/// +/// The number of users is counted, and the slot will be freed when it drops to 0. +#[derive(Debug)] +pub(crate) struct VmBind(Vm, u32);
+impl VmBind {
- /// Returns the slot that this `Vm` is bound to.
- pub(crate) fn slot(&self) -> u32 {
self.1
- }
+}
+impl Drop for VmBind {
- fn drop(&mut self) {
let mut inner = self.0.inner.lock();
assert_ne!(inner.active_users, 0);
inner.active_users -= 1;
mod_pr_debug!("MMU: slot {} active users {}\n", self.1, inner.active_users);
if inner.active_users == 0 {
inner.binding = None;
}
- }
+}
+impl Clone for VmBind {
- fn clone(&self) -> VmBind {
let mut inner = self.0.inner.lock();
inner.active_users += 1;
mod_pr_debug!("MMU: slot {} active users {}\n", self.1, inner.active_users);
VmBind(self.0.clone(), self.1)
- }
+}
+/// Inner data required for an object mapping into a [`Vm`]. +pub(crate) struct MappingInner {
- owner: Arc<Mutex<VmInner>>,
- uat_inner: Arc<UatInner>,
- prot: u32,
- mapped_size: usize,
- sgt: Optiongem::SGTable,
+}
+/// An object mapping into a [`Vm`], which reserves the address range from use by other mappings. +pub(crate) struct Mapping(mm::Node<(), MappingInner>);
+impl Mapping {
- /// Returns the IOVA base of this mapping
- pub(crate) fn iova(&self) -> usize {
self.0.start() as usize
- }
- /// Returns the size of this mapping in bytes
- pub(crate) fn size(&self) -> usize {
self.0.mapped_size
- }
- /// Remap a cached mapping as uncached, then synchronously flush that range of VAs from the
- /// coprocessor cache. This is required to safely unmap cached/private mappings.
- fn remap_uncached_and_flush(&mut self) {
let mut owner = self.0.owner.lock();
mod_dev_dbg!(
owner.dev,
"MMU: remap as uncached {:#x}:{:#x}\n",
self.iova(),
self.size()
);
// The IOMMU API does not allow us to remap things in-place...
// just do an unmap and map again for now.
// Do not try to unmap guard page (-1)
if owner
.unmap_pages(self.iova(), UAT_PGSZ, self.size() >> UAT_PGBIT)
.is_err()
{
dev_err!(
owner.dev,
"MMU: unmap for remap {:#x}:{:#x} failed\n",
self.iova(),
self.size()
);
}
let prot = self.0.prot | prot::CACHE;
if owner.map_node(&self.0, prot).is_err() {
dev_err!(
owner.dev,
"MMU: remap {:#x}:{:#x} failed\n",
self.iova(),
self.size()
);
}
// If we don't have (and have never had) a VM slot, just return
let slot = match owner.slot() {
None => return,
Some(slot) => slot,
};
let flush_slot = if owner.is_kernel {
// If this is a kernel mapping, always flush on index 64
UAT_NUM_CTX as u32
} else {
// Otherwise, check if this slot is the active one, otherwise return
// Also check that we actually own this slot
let ttb = owner.ttb() | TTBR_VALID | (slot as u64) << TTBR_ASID_SHIFT;
let uat_inner = self.0.uat_inner.lock();
uat_inner.handoff().lock();
let cur_slot = uat_inner.handoff().current_slot();
let ttb_cur = uat_inner.ttbs()[slot as usize].ttb0.load(Ordering::Relaxed);
uat_inner.handoff().unlock();
if cur_slot == Some(slot) && ttb_cur == ttb {
slot
} else {
return;
}
};
// FIXME: There is a race here, though it'll probably never happen in practice.
// In theory, it's possible for the ASC to finish using our slot, whatever command
// it was processing to complete, the slot to be lost to another context, and the ASC
// to begin using it again with a different page table, thus faulting when it gets a
// flush request here. In practice, the chance of this happening is probably vanishingly
// small, as all 62 other slots would have to be recycled or in use before that slot can
// be reused, and the ASC using user contexts at all is very rare.
// Still, the locking around UAT/Handoff/TTBs should probably be redesigned to better
// model the interactions with the firmware and avoid these races.
// Possibly TTB changes should be tied to slot locks:
// Flush:
// - Can early check handoff here (no need to lock).
// If user slot and it doesn't match the active ASC slot,
// we can elide the flush as the ASC guarantees it flushes
// TLBs/caches when it switches context. We just need a
// barrier to ensure ordering.
// - Lock TTB slot
// - If user ctx:
// - Lock handoff AP-side
// - Lock handoff dekker
// - Check TTB & handoff cur ctx
// - Perform flush if necessary
// - This implies taking the fwring lock
//
// TTB change:
// - lock TTB slot
// - lock handoff AP-side
// - lock handoff dekker
// change TTB
// Lock this flush slot, and write the range to it
let flush = self.0.uat_inner.lock_flush(flush_slot);
let pages = self.size() >> UAT_PGBIT;
flush.begin_flush(self.iova() as u64, self.size() as u64);
if pages >= 0x10000 {
dev_err!(owner.dev, "MMU: Flush too big ({:#x} pages))\n", pages);
}
let cmd = fw::channels::FwCtlMsg {
addr: fw::types::U64(self.iova() as u64),
unk_8: 0,
slot: flush_slot,
page_count: pages as u16,
unk_12: 2, // ?
};
// Tell the firmware to do a cache flush
if let Err(e) = owner.dev.data().gpu.fwctl(cmd) {
dev_err!(
owner.dev,
"MMU: ASC cache flush {:#x}:{:#x} failed (err: {:?})\n",
self.iova(),
self.size(),
e
);
}
// Finish the flush
flush.end_flush();
// Slot is unlocked here
- }
+}
+impl Drop for Mapping {
- fn drop(&mut self) {
// This is the main unmap function for UAT mappings.
// The sequence of operations here is finicky, due to the interaction
// between cached GFX ASC mappings and the page tables. These mappings
// always have to be flushed from the cache before being unmapped.
// For uncached mappings, just unmapping and flushing the TLB is sufficient.
// For cached mappings, this is the required sequence:
// 1. Remap it as uncached
// 2. Flush the TLB range
// 3. If kernel VA mapping OR user VA mapping and handoff.current_slot() == slot:
// a. Take a lock for this slot
// b. Write the flush range to the right context slot in handoff area
// c. Issue a cache invalidation request via FwCtl queue
// d. Poll for completion via queue
// e. Check for completion flag in the handoff area
// f. Drop the lock
// 4. Unmap
// 5. Flush the TLB range again
// prot::CACHE means "cache coherent" which means *uncached* here.
if self.0.prot & prot::CACHE == 0 {
self.remap_uncached_and_flush();
}
let mut owner = self.0.owner.lock();
mod_dev_dbg!(
owner.dev,
"MMU: unmap {:#x}:{:#x}\n",
self.iova(),
self.size()
);
if owner
.unmap_pages(self.iova(), UAT_PGSZ, self.size() >> UAT_PGBIT)
.is_err()
{
dev_err!(
owner.dev,
"MMU: unmap {:#x}:{:#x} failed\n",
self.iova(),
self.size()
);
}
if let Some(asid) = owner.slot() {
mem::tlbi_range(asid as u8, self.iova(), self.size());
mod_dev_dbg!(
owner.dev,
"MMU: flush range: asid={:#x} start={:#x} len={:#x}\n",
asid,
self.iova(),
self.size()
);
mem::sync();
}
- }
+}
+/// Shared UAT global data structures +struct UatShared {
- handoff_rgn: UatRegion,
- ttbs_rgn: UatRegion,
+}
+impl UatShared {
- /// Returns the handoff region area
- fn handoff(&self) -> &Handoff {
// SAFETY: pointer is non-null per the type invariant
unsafe { (self.handoff_rgn.map.as_ptr() as *mut Handoff).as_ref() }.unwrap()
- }
- /// Returns the TTBAT area
- fn ttbs(&self) -> &[SlotTTBS; UAT_NUM_CTX] {
// SAFETY: pointer is non-null per the type invariant
unsafe { (self.ttbs_rgn.map.as_ptr() as *mut [SlotTTBS; UAT_NUM_CTX]).as_ref() }.unwrap()
- }
+}
+// SAFETY: Nothing here is unsafe to send across threads. +unsafe impl Send for UatShared {}
+/// Inner data for the top-level UAT instance. +struct UatInner {
- shared: Mutex<UatShared>,
- handoff_flush: [Mutex<HandoffFlush>; UAT_NUM_CTX + 1],
+}
+impl UatInner {
- /// Take the lock on the shared data and return the guard.
- fn lock(&self) -> Guard<'_, Mutex<UatShared>> {
self.shared.lock()
- }
- /// Take a lock on a handoff flush slot and return the guard.
- fn lock_flush(&self, slot: u32) -> Guard<'_, Mutex<HandoffFlush>> {
self.handoff_flush[slot as usize].lock()
- }
+}
+/// Top-level UAT manager object +pub(crate) struct Uat {
- dev: driver::AsahiDevice,
- cfg: &'static hw::HwConfig,
- pagetables_rgn: UatRegion,
- inner: Arc<UatInner>,
- slots: slotalloc::SlotAllocator<SlotInner>,
- kernel_vm: Vm,
- _kernel_lower_vm: Vm,
+}
+impl Drop for UatRegion {
- fn drop(&mut self) {
// SAFETY: the pointer is valid by the type invariant
unsafe { bindings::memunmap(self.map.as_ptr()) };
- }
+}
+impl Handoff {
- /// Lock the handoff region from firmware access
- fn lock(&self) {
self.lock_ap.store(1, Ordering::Relaxed);
fence(Ordering::SeqCst);
while self.lock_fw.load(Ordering::Relaxed) != 0 {
if self.turn.load(Ordering::Relaxed) != 0 {
self.lock_ap.store(0, Ordering::Relaxed);
while self.turn.load(Ordering::Relaxed) != 0 {}
self.lock_ap.store(1, Ordering::Relaxed);
fence(Ordering::SeqCst);
}
}
fence(Ordering::Acquire);
- }
- /// Unlock the handoff region, allowing firmware access
- fn unlock(&self) {
self.turn.store(1, Ordering::Relaxed);
self.lock_ap.store(0, Ordering::Release);
- }
- /// Returns the current Vm slot mapped by the firmware for lower/unprivileged access, if any.
- fn current_slot(&self) -> Option<u32> {
let slot = self.cur_slot.load(Ordering::Relaxed);
if slot == 0 || slot == u32::MAX {
None
} else {
Some(slot)
}
- }
- /// Initialize the handoff region
- fn init(&self) -> Result {
self.magic_ap.store(PPL_MAGIC, Ordering::Relaxed);
self.cur_slot.store(0, Ordering::Relaxed);
self.unk3.store(0, Ordering::Relaxed);
fence(Ordering::SeqCst);
let timeout = time::ktime_get() + Duration::from_millis(1000);
self.lock();
while time::ktime_get() < timeout {
if self.magic_fw.load(Ordering::Relaxed) == PPL_MAGIC {
break;
} else {
self.unlock();
delay::coarse_sleep(Duration::from_millis(10));
self.lock();
}
}
if self.magic_fw.load(Ordering::Relaxed) != PPL_MAGIC {
self.unlock();
pr_err!("Handoff: Failed to initialize (firmware not running?)\n");
return Err(EIO);
}
self.unlock();
for i in 0..=UAT_NUM_CTX {
self.flush[i].state.store(0, Ordering::Relaxed);
self.flush[i].addr.store(0, Ordering::Relaxed);
self.flush[i].size.store(0, Ordering::Relaxed);
}
fence(Ordering::SeqCst);
Ok(())
- }
+}
+/// Represents a single flush info slot in the handoff region. +/// +/// # Invariants +/// The pointer is valid and there is no aliasing HandoffFlush instance. +struct HandoffFlush(*const FlushInfo);
+// SAFETY: These pointers are safe to send across threads. +unsafe impl Send for HandoffFlush {}
+impl HandoffFlush {
- /// Set up a flush operation for the coprocessor
- fn begin_flush(&self, start: u64, size: u64) {
let flush = unsafe { self.0.as_ref().unwrap() };
let state = flush.state.load(Ordering::Relaxed);
if state != 0 {
pr_err!("Handoff: expected flush state 0, got {}\n", state);
}
flush.addr.store(start, Ordering::Relaxed);
flush.size.store(size, Ordering::Relaxed);
flush.state.store(1, Ordering::Relaxed);
- }
- /// Complete a flush operation for the coprocessor
- fn end_flush(&self) {
let flush = unsafe { self.0.as_ref().unwrap() };
let state = flush.state.load(Ordering::Relaxed);
if state != 2 {
pr_err!("Handoff: expected flush state 2, got {}\n", state);
}
flush.state.store(0, Ordering::Relaxed);
- }
+}
+// We do not implement FlushOps, since we flush manually in this module after +// page table operations. Just provide dummy implementations. +impl io_pgtable::FlushOps for Uat {
- type Data = ();
- fn tlb_flush_all(_data: <Self::Data as ForeignOwnable>::Borrowed<'_>) {}
- fn tlb_flush_walk(
_data: <Self::Data as ForeignOwnable>::Borrowed<'_>,
_iova: usize,
_size: usize,
_granule: usize,
- ) {
- }
- fn tlb_add_page(
_data: <Self::Data as ForeignOwnable>::Borrowed<'_>,
_iova: usize,
_granule: usize,
- ) {
- }
+}
+static LOCK_KEY: LockClassKey = LockClassKey::new();
+impl Vm {
- /// Create a new virtual memory address space
- fn new(
dev: driver::AsahiDevice,
uat_inner: Arc<UatInner>,
cfg: &'static hw::HwConfig,
is_kernel: bool,
id: u64,
file_id: u64,
- ) -> Result<Vm> {
let page_table = AppleUAT::new(
&dev,
io_pgtable::Config {
pgsize_bitmap: UAT_PGSZ,
ias: if is_kernel { UAT_IAS_KERN } else { UAT_IAS },
oas: cfg.uat_oas,
coherent_walk: true,
quirks: 0,
},
(),
)?;
let min_va = if is_kernel {
IOVA_KERN_BASE
} else {
IOVA_USER_BASE
};
let max_va = if is_kernel {
IOVA_KERN_TOP
} else {
IOVA_USER_TOP
};
let mm = mm::Allocator::new(
min_va as u64,
(max_va - min_va + 1) as u64,
(),
c_str!("asahi Vm"),
&LOCK_KEY,
)?;
Ok(Vm {
id,
file_id,
inner: Arc::try_new(Mutex::new(VmInner {
dev,
min_va,
max_va,
is_kernel,
page_table,
mm,
uat_inner,
binding: None,
bind_token: None,
active_users: 0,
id,
}))?,
})
- }
- /// Get the translation table base for this Vm
- fn ttb(&self) -> u64 {
self.inner.lock().ttb()
- }
- /// Map a GEM object (using its `SGTable`) into this Vm at a free address.
- pub(crate) fn map(&self, size: usize, sgt: gem::SGTable) -> Result<Mapping> {
let mut inner = self.inner.lock();
let uat_inner = inner.uat_inner.clone();
let node = inner.mm.insert_node(
MappingInner {
owner: self.inner.clone(),
uat_inner,
prot: PROT_FW_SHARED_RW,
sgt: Some(sgt),
mapped_size: size,
},
(size + UAT_PGSZ) as u64, // Add guard page
)?;
inner.map_node(&node, PROT_FW_SHARED_RW)?;
Ok(Mapping(node))
- }
- /// Map a GEM object (using its `SGTable`) into this Vm at a free address in a given range.
- #[allow(clippy::too_many_arguments)]
- pub(crate) fn map_in_range(
&self,
size: usize,
sgt: gem::SGTable,
alignment: u64,
start: u64,
end: u64,
prot: u32,
guard: bool,
- ) -> Result<Mapping> {
let mut inner = self.inner.lock();
let uat_inner = inner.uat_inner.clone();
let node = inner.mm.insert_node_in_range(
MappingInner {
owner: self.inner.clone(),
uat_inner,
prot,
sgt: Some(sgt),
mapped_size: size,
},
(size + if guard { UAT_PGSZ } else { 0 }) as u64, // Add guard page
alignment,
0,
start,
end,
mm::InsertMode::Best,
)?;
inner.map_node(&node, prot)?;
Ok(Mapping(node))
- }
- /// Map a GEM object (using its `SGTable`) into this Vm at a specific address.
- #[allow(clippy::too_many_arguments)]
- pub(crate) fn map_at(
&self,
addr: u64,
size: usize,
sgt: gem::SGTable,
prot: u32,
guard: bool,
- ) -> Result<Mapping> {
let mut inner = self.inner.lock();
let uat_inner = inner.uat_inner.clone();
let node = inner.mm.reserve_node(
MappingInner {
owner: self.inner.clone(),
uat_inner,
prot,
sgt: Some(sgt),
mapped_size: size,
},
addr,
(size + if guard { UAT_PGSZ } else { 0 }) as u64, // Add guard page
0,
)?;
inner.map_node(&node, prot)?;
Ok(Mapping(node))
- }
- /// Add a direct MMIO mapping to this Vm at a free address.
- pub(crate) fn map_io(&self, phys: usize, size: usize, rw: bool) -> Result<Mapping> {
let prot = if rw { PROT_FW_MMIO_RW } else { PROT_FW_MMIO_RO };
let mut inner = self.inner.lock();
let uat_inner = inner.uat_inner.clone();
let node = inner.mm.insert_node(
MappingInner {
owner: self.inner.clone(),
uat_inner,
prot,
sgt: None,
mapped_size: size,
},
(size + UAT_PGSZ) as u64, // Add guard page
)?;
let iova = node.start() as usize;
if (phys | size | iova) & UAT_PGMSK != 0 {
dev_err!(
inner.dev,
"MMU: Mapping {:#x}:{:#x} -> {:#x} is not page-aligned\n",
phys,
size,
iova
);
return Err(EINVAL);
}
dev_info!(
inner.dev,
"MMU: IO map: {:#x}:{:#x} -> {:#x}\n",
phys,
size,
iova
);
inner.map_pages(iova, phys, UAT_PGSZ, size >> UAT_PGBIT, prot)?;
Ok(Mapping(node))
- }
- /// Returns the unique ID of this Vm
- pub(crate) fn id(&self) -> u64 {
self.id
- }
- /// Returns the unique File ID of the owner of this Vm
- pub(crate) fn file_id(&self) -> u64 {
self.file_id
- }
+}
+impl Drop for VmInner {
- fn drop(&mut self) {
assert_eq!(self.active_users, 0);
mod_pr_debug!(
"VmInner::Drop [{}]: bind_token={:?}\n",
self.id,
self.bind_token
);
// Make sure this VM is not mapped to a TTB if it was
if let Some(token) = self.bind_token.take() {
let idx = (token.last_slot() as usize) + UAT_USER_CTX_START;
let ttb = self.ttb() | TTBR_VALID | (idx as u64) << TTBR_ASID_SHIFT;
let uat_inner = self.uat_inner.lock();
uat_inner.handoff().lock();
let handoff_cur = uat_inner.handoff().current_slot();
let ttb_cur = uat_inner.ttbs()[idx].ttb0.load(Ordering::SeqCst);
let inval = ttb_cur == ttb;
if inval {
if handoff_cur == Some(idx as u32) {
pr_err!(
"VmInner::drop owning slot {}, but it is currently in use by the ASC?\n",
idx
);
}
uat_inner.ttbs()[idx].ttb0.store(0, Ordering::SeqCst);
}
uat_inner.handoff().unlock();
core::mem::drop(uat_inner);
// In principle we dropped all the Mappings already, but we might as
// well play it safe and invalidate the whole ASID.
if inval {
mod_pr_debug!(
"VmInner::Drop [{}]: need inval for ASID {:#x}\n",
self.id,
idx
);
mem::tlbi_asid(idx as u8);
mem::sync();
}
}
- }
+}
+impl Uat {
- /// Map a bootloader-preallocated memory region
- fn map_region(
dev: &dyn device::RawDevice,
name: &CStr,
size: usize,
cached: bool,
- ) -> Result<UatRegion> {
let rdev = dev.raw_device();
let mut res = core::mem::MaybeUninit::<bindings::resource>::uninit();
let res = unsafe {
let idx = bindings::of_property_match_string(
(*rdev).of_node,
c_str!("memory-region-names").as_char_ptr(),
name.as_char_ptr(),
);
to_result(idx)?;
let np = bindings::of_parse_phandle(
(*rdev).of_node,
c_str!("memory-region").as_char_ptr(),
idx,
);
if np.is_null() {
dev_err!(dev, "Missing {} region\n", name);
return Err(EINVAL);
}
let ret = bindings::of_address_to_resource(np, 0, res.as_mut_ptr());
bindings::of_node_put(np);
if ret < 0 {
dev_err!(dev, "Failed to get {} region\n", name);
to_result(ret)?
}
res.assume_init()
};
let rgn_size: usize = unsafe { bindings::resource_size(&res) } as usize;
if size > rgn_size {
dev_err!(
dev,
"Region {} is too small (expected {}, got {})\n",
name,
size,
rgn_size
);
return Err(ENOMEM);
}
let flags = if cached {
bindings::MEMREMAP_WB
} else {
bindings::MEMREMAP_WC
};
let map = unsafe { bindings::memremap(res.start, rgn_size, flags.into()) };
let map = NonNull::new(map);
match map {
None => {
dev_err!(dev, "Failed to remap {} region\n", name);
Err(ENOMEM)
}
Some(map) => Ok(UatRegion {
base: res.start,
map,
}),
}
- }
- /// Returns a view into the root kernel (upper half) page table
- fn kpt0(&self) -> &[Pte; UAT_NPTE] {
// SAFETY: pointer is non-null per the type invariant
unsafe { (self.pagetables_rgn.map.as_ptr() as *mut [Pte; UAT_NPTE]).as_ref() }.unwrap()
- }
- /// Returns a reference to the global kernel (upper half) `Vm`
- pub(crate) fn kernel_vm(&self) -> &Vm {
&self.kernel_vm
- }
- /// Returns the base physical address of the TTBAT region.
- pub(crate) fn ttb_base(&self) -> u64 {
let inner = self.inner.lock();
inner.ttbs_rgn.base
- }
- /// Binds a `Vm` to a slot, preferring the last used one.
- pub(crate) fn bind(&self, vm: &Vm) -> Result<VmBind> {
let mut inner = vm.inner.lock();
if inner.binding.is_none() {
assert_eq!(inner.active_users, 0);
let slot = self.slots.get(inner.bind_token)?;
if slot.changed() {
mod_pr_debug!("Vm Bind [{}]: bind_token={:?}\n", vm.id, slot.token(),);
let idx = (slot.slot() as usize) + UAT_USER_CTX_START;
let ttb = inner.ttb() | TTBR_VALID | (idx as u64) << TTBR_ASID_SHIFT;
let uat_inner = self.inner.lock();
let ttbs = uat_inner.ttbs();
uat_inner.handoff().lock();
if uat_inner.handoff().current_slot() == Some(idx as u32) {
pr_err!(
"Vm::bind to slot {}, but it is currently in use by the ASC?\n",
idx
);
}
ttbs[idx].ttb0.store(ttb, Ordering::Relaxed);
ttbs[idx].ttb1.store(0, Ordering::Relaxed);
uat_inner.handoff().unlock();
core::mem::drop(uat_inner);
// Make sure all TLB entries from the previous owner of this ASID are gone
mem::tlbi_asid(idx as u8);
mem::sync();
}
inner.bind_token = Some(slot.token());
inner.binding = Some(slot);
}
inner.active_users += 1;
let slot = inner.binding.as_ref().unwrap().slot() + UAT_USER_CTX_START as u32;
mod_pr_debug!("MMU: slot {} active users {}\n", slot, inner.active_users);
Ok(VmBind(vm.clone(), slot))
- }
- /// Creates a new `Vm` linked to this UAT.
- pub(crate) fn new_vm(&self, id: u64, file_id: u64) -> Result<Vm> {
Vm::new(
self.dev.clone(),
self.inner.clone(),
self.cfg,
false,
id,
file_id,
)
- }
- /// Creates the reference-counted inner data for a new `Uat` instance.
- #[inline(never)]
- fn make_inner(dev: &driver::AsahiDevice) -> Result<Arc<UatInner>> {
let handoff_rgn = Self::map_region(dev, c_str!("handoff"), HANDOFF_SIZE, false)?;
let ttbs_rgn = Self::map_region(dev, c_str!("ttbs"), SLOTS_SIZE, false)?;
dev_info!(dev, "MMU: Initializing kernel page table\n");
let mut inner = UniqueArc::<UatInner>::try_new_uninit()?;
let ptr = inner.as_mut_ptr();
Ok(unsafe {
let handoff = &(handoff_rgn.map.as_ptr() as *mut Handoff).as_ref().unwrap();
for i in 0..UAT_NUM_CTX + 1 {
addr_of_mut!((*ptr).handoff_flush[i])
.write(Mutex::new(HandoffFlush(&handoff.flush[i])));
}
addr_of_mut!((*ptr).shared).write(Mutex::new(UatShared {
handoff_rgn,
ttbs_rgn,
}));
inner.assume_init()
}
.into())
- }
- /// Creates a new `Uat` instance given the relevant hardware config.
- #[inline(never)]
- pub(crate) fn new(dev: &driver::AsahiDevice, cfg: &'static hw::HwConfig) -> Result<Self> {
dev_info!(dev, "MMU: Initializing...\n");
let inner = Self::make_inner(dev)?;
let pagetables_rgn = Self::map_region(dev, c_str!("pagetables"), PAGETABLES_SIZE, true)?;
dev_info!(dev, "MMU: Creating kernel page tables\n");
let kernel_lower_vm = Vm::new(dev.clone(), inner.clone(), cfg, false, 1, 0)?;
let kernel_vm = Vm::new(dev.clone(), inner.clone(), cfg, true, 0, 0)?;
dev_info!(dev, "MMU: Kernel page tables created\n");
let ttb0 = kernel_lower_vm.ttb();
let ttb1 = kernel_vm.ttb();
let uat = Self {
dev: dev.clone(),
cfg,
pagetables_rgn,
kernel_vm,
_kernel_lower_vm: kernel_lower_vm,
inner,
slots: slotalloc::SlotAllocator::new(UAT_USER_CTX as u32, (), |_inner, _slot| {
SlotInner()
})?,
};
let inner = uat.inner.lock();
inner.handoff().init()?;
dev_info!(dev, "MMU: Initializing TTBs\n");
inner.handoff().lock();
let ttbs = inner.ttbs();
ttbs[0].ttb0.store(ttb0 | TTBR_VALID, Ordering::Relaxed);
ttbs[0]
.ttb1
.store(uat.pagetables_rgn.base | TTBR_VALID, Ordering::Relaxed);
for ctx in &ttbs[1..] {
ctx.ttb0.store(0, Ordering::Relaxed);
ctx.ttb1.store(0, Ordering::Relaxed);
}
inner.handoff().unlock();
core::mem::drop(inner);
uat.kpt0()[2].store(ttb1 | PTE_TABLE, Ordering::Relaxed);
dev_info!(dev, "MMU: initialized\n");
Ok(uat)
- }
+}
+impl Drop for Uat {
- fn drop(&mut self) {
// Unmap what we mapped
self.kpt0()[2].store(0, Ordering::Relaxed);
// Make sure we flush the TLBs
fence(Ordering::SeqCst);
mem::tlbi_all();
mem::sync();
- }
+} diff --git a/drivers/gpu/drm/asahi/object.rs b/drivers/gpu/drm/asahi/object.rs new file mode 100644 index 000000000000..449899b88181 --- /dev/null +++ b/drivers/gpu/drm/asahi/object.rs @@ -0,0 +1,704 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! Asahi GPU object model +//! +//! The AGX GPU includes a coprocessor that uses a large number of shared memory structures to +//! communicate with the driver. These structures contain GPU VA pointers to each other, which are +//! directly dereferenced by the firmware and are expected to always be valid for the usage +//! lifetime of the containing struct (which is an implicit contract, not explicitly managed). +//! Any faults cause an unrecoverable firmware crash, requiring a full system reboot. +//! +//! In order to manage this complexity safely, we implement a GPU object model using Rust's type +//! system to enforce GPU object lifetime relationships. GPU objects represent an allocated piece +//! of memory of a given type, mapped to the GPU (and usually also the CPU). On the CPU side, +//! these objects are associated with a pure Rust structure that contains the objects it depends +//! on (or references to them). This allows us to map Rust lifetimes into the GPU object model +//! system. Then, GPU VA pointers also inherit those lifetimes, which means the Rust borrow checker +//! can ensure that all pointers are assigned an address that is guaranteed to outlive the GPU +//! object it points to. +//! +//! Since the firmware object model does have self-referencing pointers (and there is of course no +//! underlying revocability mechanism to make it safe), we must have an escape hatch. GPU pointers +//! can be weak pointers, which do not enforce lifetimes. In those cases, it is the user's +//! responsibility to ensure that lifetime requirements are met. +//! +//! In other words, the model is necessarily leaky and there is no way to fully map Rust safety to +//! GPU firmware object safety. The goal of the model is to make it easy to model the lifetimes of +//! GPU objects and have the compiler help in avoiding mistakes, rather than to guarantee safety +//! 100% of the time as would be the case for CPU-side Rust code.
+// TODO: There is a fundamental soundness issue with sharing memory with the GPU (that even affects +// C code too). Since the GPU is free to mutate that memory at any time, normal reference invariants +// cannot be enforced on the CPU side. For example, the compiler could perform an optimization that +// assumes that a given memory location does not change between two reads, and causes UB otherwise, +// and then the GPU could mutate that memory out from under the CPU. +// +// For cases where we *expect* this to happen, we use atomic types, which avoid this issue. However, +// doing so for every single field of every type is a non-starter. Right now, there seems to be no +// good solution for this that does not come with significant performance or ergonomics downsides. +// +// In *practice* we are almost always only writing GPU memory, and only reading from atomics, so the +// chances of this actually triggering UB (e.g. a security issue that can be triggered from the GPU +// side) due to a compiler optimization are very slim. +// +// Further discussion: https://github.com/rust-lang/unsafe-code-guidelines/issues/152
+use kernel::{error::code::*, prelude::*};
+use alloc::boxed::Box; +use core::fmt; +use core::fmt::Debug; +use core::fmt::Formatter; +use core::marker::PhantomData; +use core::mem::MaybeUninit; +use core::num::NonZeroU64; +use core::ops::{Deref, DerefMut, Index, IndexMut}; +use core::{mem, ptr, slice};
+use crate::alloc::Allocation; +use crate::debug::*; +use crate::fw::types::Zeroed;
+const DEBUG_CLASS: DebugFlags = DebugFlags::Object;
+/// A GPU-side strong pointer, which is a 64-bit non-zero VA with an associated lifetime. +/// +/// In rare cases these pointers are not aligned, so this is `packed(1)`. +#[repr(C, packed(1))] +pub(crate) struct GpuPointer<'a, T: ?Sized>(NonZeroU64, PhantomData<&'a T>);
+impl<'a, T: ?Sized> GpuPointer<'a, T> {
- /// Logical OR the pointer with an arbitrary `u64`. This is used when GPU struct fields contain
- /// misc flag fields in the upper bits. The lifetime is retained. This is GPU-unsafe in
- /// principle, but we assert that only non-implemented address bits are touched, which is safe
- /// for pointers used by the GPU (not by firmware).
- pub(crate) fn or(&self, other: u64) -> GpuPointer<'a, T> {
// This will fail for kernel-half pointers, which should not be ORed.
assert_eq!(self.0.get() & other, 0);
// Assert that we only touch the high bits.
assert_eq!(other & 0xffffffffff, 0);
GpuPointer(self.0 | other, PhantomData)
- }
- /// Add an arbitrary offset to the pointer. This is not safe (from the GPU perspective), and
- /// should only be used via the `inner_ptr` macro to get pointers to inner fields, hence we mark
- /// it `unsafe` to discourage direct use.
- // NOTE: The third argument is a type inference hack.
- pub(crate) unsafe fn offset<U>(&self, off: usize, _: *const U) -> GpuPointer<'a, U> {
GpuPointer::<'a, U>(
NonZeroU64::new(self.0.get() + (off as u64)).unwrap(),
PhantomData,
)
- }
+}
+impl<'a, T: ?Sized> Debug for GpuPointer<'a, T> {
- fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
let val = self.0;
f.write_fmt(format_args!("{:#x} ({})", val, core::any::type_name::<T>()))
- }
+}
+/// Take a pointer to a sub-field within a structure pointed to by a GpuPointer, keeping the +/// lifetime. +#[macro_export] +macro_rules! inner_ptr {
- ($gpuva:expr, $($f:tt)*) => ({
// This mirrors kernel::offset_of(), except we use type inference to avoid having to know
// the type of the pointer explicitly.
fn uninit_from<'a, T: GpuStruct>(_: GpuPointer<'a, T>) -> core::mem::MaybeUninit<T::Raw<'static>> {
core::mem::MaybeUninit::uninit()
}
let tmp = uninit_from($gpuva);
let outer = tmp.as_ptr();
// SAFETY: The pointer is valid and aligned, just not initialised; `addr_of` ensures that
// we don't actually read from `outer` (which would be UB) nor create an intermediate
// reference.
let p: *const _ = unsafe { core::ptr::addr_of!((*outer).$($f)*) };
let inner = p as *const u8;
// SAFETY: The two pointers are within the same allocation block.
let off = unsafe { inner.offset_from(outer as *const u8) };
// SAFETY: The resulting pointer is guaranteed to point to valid memory within the outer
// object.
unsafe { $gpuva.offset(off.try_into().unwrap(), p) }
- })
+}
+/// A GPU-side weak pointer, which is a 64-bit non-zero VA with no lifetime. +/// +/// In rare cases these pointers are not aligned, so this is `packed(1)`. +#[repr(C, packed(1))] +pub(crate) struct GpuWeakPointer<T: ?Sized>(NonZeroU64, PhantomData<*const T>);
+/// SAFETY: GPU weak pointers are always safe to share between threads. +unsafe impl<T: ?Sized> Send for GpuWeakPointer<T> {} +unsafe impl<T: ?Sized> Sync for GpuWeakPointer<T> {}
+// Weak pointers can be copied/cloned regardless of their target type. +impl<T: ?Sized> Copy for GpuWeakPointer<T> {}
+impl<T: ?Sized> Clone for GpuWeakPointer<T> {
- fn clone(&self) -> Self {
*self
- }
+}
+impl<T: ?Sized> GpuWeakPointer<T> {
- /// Add an arbitrary offset to the pointer. This is not safe (from the GPU perspective), and
- /// should only be used via the `inner_ptr` macro to get pointers to inner fields, hence we mark
- /// it `unsafe` to discourage direct use.
- // NOTE: The third argument is a type inference hack.
- pub(crate) unsafe fn offset<U>(&self, off: usize, _: *const U) -> GpuWeakPointer<U> {
GpuWeakPointer::<U>(
NonZeroU64::new(self.0.get() + (off as u64)).unwrap(),
PhantomData,
)
- }
- /// Upgrade a weak pointer into a strong pointer. This is not considered safe from the GPU
- /// perspective.
- pub(crate) unsafe fn upgrade<'a>(&self) -> GpuPointer<'a, T> {
GpuPointer(self.0, PhantomData)
- }
+}
+impl<T: ?Sized> Debug for GpuWeakPointer<T> {
- fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
let val = self.0;
f.write_fmt(format_args!("{:#x} ({})", val, core::any::type_name::<T>()))
- }
+}
+/// Take a pointer to a sub-field within a structure pointed to by a GpuWeakPointer. +#[macro_export] +macro_rules! inner_weak_ptr {
- ($gpuva:expr, $($f:tt)*) => ({
// See inner_ptr()
fn uninit_from<T: GpuStruct>(_: GpuWeakPointer<T>) -> core::mem::MaybeUninit<T::Raw<'static>> {
core::mem::MaybeUninit::uninit()
}
let tmp = uninit_from($gpuva);
let outer = tmp.as_ptr();
// SAFETY: The pointer is valid and aligned, just not initialised; `addr_of` ensures that
// we don't actually read from `outer` (which would be UB) nor create an intermediate
// reference.
let p: *const _ = unsafe { core::ptr::addr_of!((*outer).$($f)*) };
let inner = p as *const u8;
// SAFETY: The two pointers are within the same allocation block.
let off = unsafe { inner.offset_from(outer as *const u8) };
// SAFETY: The resulting pointer is guaranteed to point to valid memory within the outer
// object.
unsafe { $gpuva.offset(off.try_into().unwrap(), p) }
- })
+}
+/// Types that implement this trait represent a GPU structure from the CPU side. +/// +/// The `Raw` type represents the actual raw structure definition on the GPU side. +/// +/// Types implementing [`GpuStruct`] must have fields owning any objects (or strong references +/// to them) that GPU pointers in the `Raw` structure point to. This mechanism is used to enforce +/// lifetimes. +pub(crate) trait GpuStruct: 'static {
- /// The type of the GPU-side structure definition representing the firmware struct layout.
- type Raw<'a>;
+}
+/// An instance of a GPU object in memory. +/// +/// # Invariants +/// `raw` must point to a valid mapping of the `T::Raw` type associated with the `alloc` allocation. +/// `gpu_ptr` must be the GPU address of the same object. +pub(crate) struct GpuObject<T: GpuStruct, U: Allocation<T>> {
- raw: *mut T::Raw<'static>,
- alloc: U,
- gpu_ptr: GpuWeakPointer<T>,
- inner: Box<T>,
+}
+impl<T: GpuStruct, U: Allocation<T>> GpuObject<T, U> {
- /// Create a new GpuObject given an allocator and the inner data (a type implementing
- /// GpuStruct).
- ///
- /// The caller passes a closure that constructs the `T::Raw` type given a reference to the
- /// `GpuStruct`. This is the mechanism used to enforce lifetimes.
- pub(crate) fn new(
alloc: U,
inner: T,
callback: impl for<'a> FnOnce(&'a T) -> T::Raw<'a>,
- ) -> Result<Self> {
let size = mem::size_of::<T::Raw<'static>>();
if size > 0x1000 {
dev_crit!(
alloc.device(),
"Allocating {} of size {:#x}, with new, please use new_boxed!\n",
core::any::type_name::<T>(),
size
);
}
if alloc.size() < size {
return Err(ENOMEM);
}
let gpu_ptr =
GpuWeakPointer::<T>(NonZeroU64::new(alloc.gpu_ptr()).ok_or(EINVAL)?, PhantomData);
mod_dev_dbg!(
alloc.device(),
"Allocating {} @ {:#x}\n",
core::any::type_name::<T>(),
alloc.gpu_ptr()
);
let p = alloc.ptr().ok_or(EINVAL)?.as_ptr() as *mut T::Raw<'static>;
let mut raw = callback(&inner);
// SAFETY: `p` is guaranteed to be valid per the Allocation invariant, and the type is
// identical to the type of `raw` other than the lifetime.
unsafe { p.copy_from(&mut raw as *mut _ as *mut u8 as *mut _, 1) };
mem::forget(raw);
Ok(Self {
raw: p,
gpu_ptr,
alloc,
inner: Box::try_new(inner)?,
})
- }
- /// Create a new GpuObject given an allocator and the boxed inner data (a type implementing
- /// GpuStruct).
- ///
- /// The caller passes a closure that initializes the `T::Raw` type given a reference to the
- /// `GpuStruct` and a `MaybeUninit<T::Raw>`. This is intended to be used with the place!()
- /// macro to avoid constructing the whole `T::Raw` object on the stack.
- pub(crate) fn new_boxed(
alloc: U,
inner: Box<T>,
callback: impl for<'a> FnOnce(
&'a T,
&'a mut MaybeUninit<T::Raw<'a>>,
) -> Result<&'a mut T::Raw<'a>>,
- ) -> Result<Self> {
if alloc.size() < mem::size_of::<T::Raw<'static>>() {
return Err(ENOMEM);
}
let gpu_ptr =
GpuWeakPointer::<T>(NonZeroU64::new(alloc.gpu_ptr()).ok_or(EINVAL)?, PhantomData);
mod_dev_dbg!(
alloc.device(),
"Allocating {} @ {:#x}\n",
core::any::type_name::<T>(),
alloc.gpu_ptr()
);
let p = alloc.ptr().ok_or(EINVAL)?.as_ptr() as *mut MaybeUninit<T::Raw<'_>>;
// SAFETY: `p` is guaranteed to be valid per the Allocation invariant.
let raw = callback(&inner, unsafe { &mut *p })?;
if p as *mut T::Raw<'_> != raw as *mut _ {
dev_err!(
alloc.device(),
"Allocation callback returned a mismatched reference ({})\n",
core::any::type_name::<T>(),
);
return Err(EINVAL);
}
Ok(Self {
raw: p as *mut u8 as *mut T::Raw<'static>,
gpu_ptr,
alloc,
inner,
})
- }
- /// Create a new GpuObject given an allocator and the inner data (a type implementing
- /// GpuStruct).
- ///
- /// The caller passes a closure that initializes the `T::Raw` type given a reference to the
- /// `GpuStruct` and a `MaybeUninit<T::Raw>`. This is intended to be used with the place!()
- /// macro to avoid constructing the whole `T::Raw` object on the stack.
- pub(crate) fn new_inplace(
alloc: U,
inner: T,
callback: impl for<'a> FnOnce(
&'a T,
&'a mut MaybeUninit<T::Raw<'a>>,
) -> Result<&'a mut T::Raw<'a>>,
- ) -> Result<Self> {
GpuObject::<T, U>::new_boxed(alloc, Box::try_new(inner)?, callback)
- }
- /// Create a new GpuObject given an allocator, with callback-based initialization.
- ///
- /// This is used when the construction of the `T` type requires knowing the GPU VA address of
- /// the structure that is being constructed ahead of time. The first callback constructs a
- /// `Box<T>` given the pointer to the about-to-be-initialized GPU structure, and the second
- /// callback initializes that structure as in `new_boxed`.
- pub(crate) fn new_prealloc(
alloc: U,
inner_cb: impl FnOnce(GpuWeakPointer<T>) -> Result<Box<T>>,
raw_cb: impl for<'a> FnOnce(
&'a T,
&'a mut MaybeUninit<T::Raw<'a>>,
) -> Result<&'a mut T::Raw<'a>>,
- ) -> Result<Self> {
if alloc.size() < mem::size_of::<T::Raw<'static>>() {
return Err(ENOMEM);
}
let gpu_ptr =
GpuWeakPointer::<T>(NonZeroU64::new(alloc.gpu_ptr()).ok_or(EINVAL)?, PhantomData);
mod_dev_dbg!(
alloc.device(),
"Allocating {} @ {:#x}\n",
core::any::type_name::<T>(),
alloc.gpu_ptr()
);
let inner = inner_cb(gpu_ptr)?;
let p = alloc.ptr().ok_or(EINVAL)?.as_ptr() as *mut MaybeUninit<T::Raw<'_>>;
// SAFETY: `p` is guaranteed to be valid per the Allocation invariant.
let raw = raw_cb(&*inner, unsafe { &mut *p })?;
if p as *mut T::Raw<'_> != raw as *mut _ {
dev_err!(
alloc.device(),
"Allocation callback returned a mismatched reference ({})\n",
core::any::type_name::<T>(),
);
return Err(EINVAL);
}
Ok(Self {
raw: p as *mut u8 as *mut T::Raw<'static>,
gpu_ptr,
alloc,
inner,
})
- }
- /// Returns the GPU VA of this object (as a raw [`NonZeroU64`])
- pub(crate) fn gpu_va(&self) -> NonZeroU64 {
self.gpu_ptr.0
- }
- /// Returns a strong GPU pointer to this object, with a lifetime.
- pub(crate) fn gpu_pointer(&self) -> GpuPointer<'_, T> {
GpuPointer(self.gpu_ptr.0, PhantomData)
- }
- /// Returns a weak GPU pointer to this object, with no lifetime.
- pub(crate) fn weak_pointer(&self) -> GpuWeakPointer<T> {
GpuWeakPointer(self.gpu_ptr.0, PhantomData)
- }
- /// Perform a mutation to the inner `Raw` data given a user-supplied callback.
- ///
- /// The callback gets a mutable reference to the `GpuStruct` type.
- pub(crate) fn with_mut<RetVal>(
&mut self,
callback: impl for<'a> FnOnce(&'a mut <T as GpuStruct>::Raw<'a>, &'a mut T) -> RetVal,
- ) -> RetVal {
// SAFETY: `self.raw` is valid per the type invariant, and the second half is just
// converting lifetimes.
unsafe { callback(&mut *self.raw, &mut *(&mut *self.inner as *mut _)) }
- }
- /// Access the inner `Raw` data given a user-supplied callback.
- ///
- /// The callback gets a reference to the `GpuStruct` type.
- pub(crate) fn with<RetVal>(
&self,
callback: impl for<'a> FnOnce(&'a <T as GpuStruct>::Raw<'a>, &'a T) -> RetVal,
- ) -> RetVal {
// SAFETY: `self.raw` is valid per the type invariant, and the second half is just
// converting lifetimes.
unsafe { callback(&*self.raw, &*(&*self.inner as *const _)) }
- }
+}
+impl<T: GpuStruct, U: Allocation<T>> Deref for GpuObject<T, U> {
- type Target = T;
- fn deref(&self) -> &Self::Target {
&self.inner
- }
+}
+impl<T: GpuStruct, U: Allocation<T>> DerefMut for GpuObject<T, U> {
- fn deref_mut(&mut self) -> &mut Self::Target {
&mut self.inner
- }
+}
+impl<T: GpuStruct + Debug, U: Allocation<T>> Debug for GpuObject<T, U> +where
- <T as GpuStruct>::Raw<'static>: Debug,
+{
- fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
f.debug_struct(core::any::type_name::<T>())
// SAFETY: `self.raw` is valid per the type invariant.
.field("raw", &format_args!("{:#X?}", unsafe { &*self.raw }))
.field("inner", &format_args!("{:#X?}", &self.inner))
.field("alloc", &format_args!("{:?}", &self.alloc))
.finish()
- }
+}
+impl<T: GpuStruct + Default, U: Allocation<T>> GpuObject<T, U> +where
- for<'a> <T as GpuStruct>::Raw<'a>: Default + Zeroed,
+{
- /// Create a new GpuObject with default data. `T` must implement `Default` and `T::Raw` must
- /// implement `Zeroed`, since the GPU-side memory is initialized by zeroing.
- pub(crate) fn new_default(alloc: U) -> Result<Self> {
GpuObject::<T, U>::new_inplace(alloc, Default::default(), |_inner, raw| {
// SAFETY: `raw` is valid here, and `T::Raw` implements `Zeroed`.
Ok(unsafe {
ptr::write_bytes(raw, 0, 1);
(*raw).assume_init_mut()
})
})
- }
+}
+impl<T: GpuStruct, U: Allocation<T>> Drop for GpuObject<T, U> {
- fn drop(&mut self) {
mod_dev_dbg!(
self.alloc.device(),
"Dropping {} @ {:?}\n",
core::any::type_name::<T>(),
self.gpu_pointer()
);
- }
+}
+// SAFETY: GpuObjects are Send as long as the GpuStruct itself is Send +unsafe impl<T: GpuStruct + Send, U: Allocation<T>> Send for GpuObject<T, U> {} +// SAFETY: GpuObjects are Send as long as the GpuStruct itself is Send +unsafe impl<T: GpuStruct + Sync, U: Allocation<T>> Sync for GpuObject<T, U> {}
+/// Trait used to erase the type of a GpuObject, used when we need to keep a list of heterogenous +/// objects around. +pub(crate) trait OpaqueGpuObject: Send + Sync {
- fn gpu_va(&self) -> NonZeroU64;
+}
+impl<T: GpuStruct + Sync + Send, U: Allocation<T>> OpaqueGpuObject for GpuObject<T, U> {
- fn gpu_va(&self) -> NonZeroU64 {
Self::gpu_va(self)
- }
+}
+/// An array of raw GPU objects that is only accessible to the GPU (no CPU-side mapping required). +/// +/// This must necessarily be uninitialized as far as the GPU is concerned, so it cannot be used +/// when initialization is required. +/// +/// # Invariants +/// +/// `alloc` is valid and at least as large as `len` times the size of one `T`. +/// `gpu_ptr` is valid and points to the allocation start. +pub(crate) struct GpuOnlyArray<T, U: Allocation<T>> {
- len: usize,
- alloc: U,
- gpu_ptr: NonZeroU64,
- _p: PhantomData<T>,
+}
+impl<T, U: Allocation<T>> GpuOnlyArray<T, U> {
- /// Allocate a new GPU-only array with the given length.
- pub(crate) fn new(alloc: U, count: usize) -> Result<GpuOnlyArray<T, U>> {
let bytes = count * mem::size_of::<T>();
let gpu_ptr = NonZeroU64::new(alloc.gpu_ptr()).ok_or(EINVAL)?;
if alloc.size() < bytes {
return Err(ENOMEM);
}
Ok(Self {
len: count,
alloc,
gpu_ptr,
_p: PhantomData,
})
- }
- /// Returns the GPU VA of this arraw (as a raw [`NonZeroU64`])
- pub(crate) fn gpu_va(&self) -> NonZeroU64 {
self.gpu_ptr
- }
- /// Returns a strong GPU pointer to this array, with a lifetime.
- pub(crate) fn gpu_pointer(&self) -> GpuPointer<'_, &'_ [T]> {
GpuPointer(self.gpu_ptr, PhantomData)
- }
- /// Returns a weak GPU pointer to this array, with no lifetime.
- pub(crate) fn weak_pointer(&self) -> GpuWeakPointer<[T]> {
GpuWeakPointer(self.gpu_ptr, PhantomData)
- }
- /// Returns a pointer to an offset within the array (as a subslice).
- pub(crate) fn gpu_offset_pointer(&self, offset: usize) -> GpuPointer<'_, &'_ [T]> {
if offset > self.len {
panic!("Index {} out of bounds (len: {})", offset, self.len);
}
GpuPointer(
NonZeroU64::new(self.gpu_ptr.get() + (offset * mem::size_of::<T>()) as u64).unwrap(),
PhantomData,
)
- }
- /* Not used yet
- /// Returns a weak pointer to an offset within the array (as a subslice).
- pub(crate) fn weak_offset_pointer(&self, offset: usize) -> GpuWeakPointer<[T]> {
if offset > self.len {
panic!("Index {} out of bounds (len: {})", offset, self.len);
}
GpuWeakPointer(
NonZeroU64::new(self.gpu_ptr.get() + (offset * mem::size_of::<T>()) as u64).unwrap(),
PhantomData,
)
- }
- /// Returns a pointer to an element within the array.
- pub(crate) fn gpu_item_pointer(&self, index: usize) -> GpuPointer<'_, &'_ T> {
if index >= self.len {
panic!("Index {} out of bounds (len: {})", index, self.len);
}
GpuPointer(
NonZeroU64::new(self.gpu_ptr.get() + (index * mem::size_of::<T>()) as u64).unwrap(),
PhantomData,
)
- }
- */
- /// Returns a weak pointer to an element within the array.
- pub(crate) fn weak_item_pointer(&self, index: usize) -> GpuWeakPointer<T> {
if index >= self.len {
panic!("Index {} out of bounds (len: {})", index, self.len);
}
GpuWeakPointer(
NonZeroU64::new(self.gpu_ptr.get() + (index * mem::size_of::<T>()) as u64).unwrap(),
PhantomData,
)
- }
- /// Returns the length of the array.
- pub(crate) fn len(&self) -> usize {
self.len
- }
+}
+impl<T: Debug, U: Allocation<T>> Debug for GpuOnlyArray<T, U> {
- fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
f.debug_struct(core::any::type_name::<T>())
.field("len", &format_args!("{:#X?}", self.len()))
.finish()
- }
+}
+impl<T, U: Allocation<T>> Drop for GpuOnlyArray<T, U> {
- fn drop(&mut self) {
mod_dev_dbg!(
self.alloc.device(),
"Dropping {} @ {:?}\n",
core::any::type_name::<T>(),
self.gpu_pointer()
);
- }
+}
+/// An array of raw GPU objects that is also CPU-accessible. +/// +/// # Invariants +/// +/// `raw` is valid and points to the CPU-side view of the array (which must have one). +pub(crate) struct GpuArray<T, U: Allocation<T>> {
- raw: *mut T,
- array: GpuOnlyArray<T, U>,
+}
+/* Not used yet +impl<T: Copy, U: Allocation<T>> GpuArray<T, U> {
- /// Allocate a new GPU array, copying the contents from a slice.
- pub(crate) fn new(alloc: U, data: &[T]) -> Result<GpuArray<T, U>> {
let p = alloc.ptr().ok_or(EINVAL)?.as_ptr();
let inner = GpuOnlyArray::new(alloc, data.len())?;
// SAFETY: `p` is valid per the Allocation type invariant, and GpuOnlyArray guarantees
// that its size is at least as large as `data.len()`.
unsafe { ptr::copy(data.as_ptr(), p, data.len()) };
Ok(Self {
raw: p,
array: inner,
})
- }
+} +*/
+impl<T: Default, U: Allocation<T>> GpuArray<T, U> {
- /// Allocate a new GPU array, initializing each element to its default.
- pub(crate) fn empty(alloc: U, count: usize) -> Result<GpuArray<T, U>> {
let p = alloc.ptr().ok_or(EINVAL)?.as_ptr() as *mut T;
let inner = GpuOnlyArray::new(alloc, count)?;
let mut pi = p;
for _i in 0..count {
// SAFETY: `pi` is valid per the Allocation type invariant, and GpuOnlyArray guarantees
// that it can never iterate beyond the buffer length.
unsafe {
pi.write(Default::default());
pi = pi.add(1);
}
}
Ok(Self {
raw: p,
array: inner,
})
- }
+}
+impl<T, U: Allocation<T>> GpuArray<T, U> {
- /// Get a slice view of the array contents.
- pub(crate) fn as_slice(&self) -> &[T] {
// SAFETY: self.raw / self.len are valid per the type invariant
unsafe { slice::from_raw_parts(self.raw, self.len) }
- }
- /// Get a mutable slice view of the array contents.
- pub(crate) fn as_mut_slice(&mut self) -> &mut [T] {
// SAFETY: self.raw / self.len are valid per the type invariant
unsafe { slice::from_raw_parts_mut(self.raw, self.len) }
- }
+}
+impl<T, U: Allocation<T>> Deref for GpuArray<T, U> {
- type Target = GpuOnlyArray<T, U>;
- fn deref(&self) -> &GpuOnlyArray<T, U> {
&self.array
- }
+}
+impl<T, U: Allocation<T>> Index<usize> for GpuArray<T, U> {
- type Output = T;
- fn index(&self, index: usize) -> &T {
if index >= self.len {
panic!("Index {} out of bounds (len: {})", index, self.len);
}
// SAFETY: This is bounds checked above
unsafe { &*(self.raw.add(index)) }
- }
+}
+impl<T, U: Allocation<T>> IndexMut<usize> for GpuArray<T, U> {
- fn index_mut(&mut self, index: usize) -> &mut T {
if index >= self.len {
panic!("Index {} out of bounds (len: {})", index, self.len);
}
// SAFETY: This is bounds checked above
unsafe { &mut *(self.raw.add(index)) }
- }
+}
+// SAFETY: GpuArray are Send as long as the contained type itself is Send +unsafe impl<T: Send, U: Allocation<T>> Send for GpuArray<T, U> {} +// SAFETY: GpuArray are Sync as long as the contained type itself is Sync +unsafe impl<T: Sync, U: Allocation<T>> Sync for GpuArray<T, U> {}
+impl<T: Debug, U: Allocation<T>> Debug for GpuArray<T, U> {
- fn fmt(&self, f: &mut Formatter<'_>) -> fmt::Result {
f.debug_struct(core::any::type_name::<T>())
.field("array", &format_args!("{:#X?}", self.as_slice()))
.finish()
- }
+} diff --git a/drivers/gpu/drm/asahi/place.rs b/drivers/gpu/drm/asahi/place.rs new file mode 100644 index 000000000000..40c51f4fab8d --- /dev/null +++ b/drivers/gpu/drm/asahi/place.rs @@ -0,0 +1,343 @@ +// SPDX-License-Identifier: Apache-2.0 OR MIT
+//! "Placement new" macro +//! +//! This cursed abomination of a declarative macro is used to emulate a "placement new" feature, +//! which allows initializing objects directly in a user-provided memory region without first +//! going through the stack. +//! +//! This driver needs to manage several large GPU objects of a fixed layout. Linux kernel stacks are +//! very small, so it is impossible to create these objects on the stack. While the compiler can +//! sometimes optimize away the stack copy and directly instantiate in target memory, this is not +//! guaranteed and not reliable. Therefore, we need some mechanism to ergonomically initialize +//! complex structures directly in a pre-allocated piece of memory. +//! +//! This issue also affects some driver-internal structs which are large/complex enough to overflow +//! the stack. While this can be solved by breaking them up into pieces and using `Box` more +//! liberally, this has performance implications and still isn't very nice. This macro can also be +//! used to solve this issue. +//! +//! # Further reading +//! https://github.com/rust-lang/rust/issues/27779#issuecomment-378416911 +//! https://internals.rust-lang.org/t/removal-of-all-unstable-placement-features...
+/// Initialize a `MaybeUninit` in-place, without constructing the value on the stack first. +/// +/// This macro is analogous to `MaybeUninit::write()`. In other words, +/// `place!(foo, bar)` is equivalent to `MaybeUninit::write(foo, bar)`, except that `bar` is not +/// constructed first, but rather its fields (if it is a structure constructor) are copied one by +/// one into the correct location in the `MaybeUninit`. +/// +/// The macro supports most Rust initialization syntax including type paths, generic arguments, +/// and nested structures. Nested structures are themselves initialized in-place field by field. +/// `..Default::default()` is supported, but this macro converts it to `..Zeroed::zeroed()`, as it +/// initializes those structs by zero-initializing the underlying memory. Usage of +/// `..Default::default()` with a type not implementing `Zeroed` will result in a compile error. +/// +/// Usage: +/// ``` +/// let mut buf = MaybeUninit::uninit(); +/// let mut_ref = place!(&mut buf, MyStruct { +/// b: true, +/// s: String::from("works"), +/// i: str::parse::<i32>("123").unwrap(), +/// v: vec![String::from("works")], +/// x: foo::MyOtherCoolStruct { +/// a: false, +/// b: String::from("Hello, world!"), +/// }, +/// y: foo::MyOtherCoolStruct { +/// a: false, +/// b: String::from("Hello, world!"), +/// }, +/// z: foo::MyCoolGenericStruct::<bool, String> { +/// a: false, +/// b: String::from("Hello, world!"), +/// }, +/// }; +/// // `mut_ref` is now a mutable reference to the `buf`, which is now safely initialized. +/// ``` +/// +/// Based on https://crates.io/crates/place by DianaNites, with contributions by Joshua Barretto. +#[macro_export] +macro_rules! place {
- // Top-level struct
- (@STRUCT $ptr:ident, _TOP, $typ:path, {$($typ_init:tt)*} { $($fields:tt)* }) => {{
place!(@STRUCT_ZERO $ptr, {$($typ_init)*} { $($fields)* });
place!(@STRUCT_CHECK $ptr, {$($typ_init)*} { $($fields)* } {
place!(@FIELDS $ptr, $($fields)*);
});
- }};
- // Nested structure
- (@STRUCT $ptr:ident, $f_struct:ident, $typ:path, {$($typ_init:tt)*} { $($fields:tt)* }) => {{
use core::ptr::addr_of_mut;
let buf = unsafe { addr_of_mut!((*$ptr).$f_struct) };
place!(@STRUCT_ZERO buf, {$($typ_init)*} { $($fields)* });
place!(@STRUCT_CHECK $ptr, {$($typ_init)*} { $($fields)* } {
place!(@FIELDS buf, $($fields)*);
});
- }};
- // Zero-initialize structure if the initializer ends in ..default::Default()
- (@STRUCT_ZERO $ptr:ident, {$($typ_init:tt)*} { $($f:ident $(: $v:expr)?),* $(,)? }) => {};
- (@STRUCT_ZERO $ptr:ident, {$($typ_init:tt)*} { $($($f:ident $(: $v:expr)?),*,)? ..Default::default() }) => {{
// Check that the structure actually implements Zeroed
const _: () = {
fn _check_default() {
let _ = $($typ_init)* {
..Zeroed::zeroed()
};
}
};
use core::ptr;
unsafe { ptr::write_bytes($ptr, 0, 1) };
- }};
- // Check that all fields are specified
- (@STRUCT_CHECK $ptr:ident, {$($typ_init:tt)*} { $($($f:ident $(: $v:expr)?),*,)? ..Default::default() } {$($body:tt)*}) => {
if false {
#[allow(clippy::redundant_field_names)]
let _x = $($typ_init)* {
$($(
$f $(: $v)?
),*
,)?
..Zeroed::zeroed()
};
} else {
{$($body)*}
}
- };
- (@STRUCT_CHECK $ptr:ident, {$($typ_init:tt)*} { $($f:ident $(: $v:expr)?),* $(,)? } {$($body:tt)*}) => {
if false {
#[allow(clippy::redundant_field_names)]
let _x = $($typ_init)* {
$(
$f $(: $v)?
),*
};
} else {
{$($body)*}
}
- };
- // Top-level scalar
- (@SCALAR $ptr:ident, _TOP, $val:expr) => {
let tmp = $val;
unsafe { $ptr.write(tmp); }
- };
- // Regular field
- (@SCALAR $ptr:ident, $f:ident, $val:expr) => {{
use core::ptr::addr_of_mut;
let tmp = $val;
unsafe { addr_of_mut!((*$ptr).$f).write(tmp); }
- }};
- // Type-like name followed by braces is a nested structure
- (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {{ $($fields:tt)* } $($tail:tt)*}) => {
place!(@STRUCT $ptr, $f, $($head)*, {$($head)*} { $($fields)* });
place!(@FIELDS $ptr $($tail)*)
- };
- // Type-like name followed by ::ident, append to head
- (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {::$id:ident $($tail:tt)*}) => {
place!(@PARTIAL $ptr, $f, {$($head)* :: $id}, {$($tail)*});
- };
- // Type-like name followed by ::<args>, append to head
- (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {::<$($gen:ty),*> $($tail:tt)*}) => {
place!(@PARTIAL $ptr, $f, {$($head)* :: <$($gen),*>}, {$($tail)*});
- };
- // Type-like name followed by ::<'lifetime>, append to head
- (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {::<$li:lifetime> $($tail:tt)*}) => {
place!(@PARTIAL $ptr, $f, {$($head)* :: <$li>}, {$($tail)*});
- };
- // Anything else, parse it as an expression
- (@PARTIAL $ptr:ident, $f:ident, {$($head:tt)*}, {$($tail:tt)*}) => {
place!(@EXPR $ptr, $f, $($head)* $($tail)*)
- };
- // Expression followed by more fields
- (@EXPR $ptr:ident, $f:ident, $val:expr, $($tail:tt)*) => {
place!(@SCALAR $ptr, $f, $val);
place!(@FIELDS $ptr, $($tail)*)
- };
- // Last field expression, without a trailing comma
- (@EXPR $ptr:ident, $f:ident, $val:expr) => {
place!(@SCALAR $ptr, $f, $val);
- };
- // Field with a value starting with an ident, start incremental type parsing
- (@FIELDS $ptr:ident, $f:ident : $id:ident $($tail:tt)*) => {
place!(@PARTIAL $ptr, $f, {$id}, {$($tail)*});
- };
- // Same, but starting with ::ident
- (@FIELDS $ptr:ident, $f:ident : ::$id:ident $($tail:tt)*) => {
place!(@PARTIAL $ptr, $f, {::$id}, {$($tail)*});
- };
- // Otherwise, parse it as an expression
- (@FIELDS $ptr:ident, $f:ident : $($tail:tt)*) => {
place!(@EXPR $ptr, $f, $($tail)*)
- };
- // Default terminating case
- (@FIELDS $ptr:ident, ..Default::default() ) => {};
- // Terminating case
- (@FIELDS $ptr:ident $(,)? ) => {};
- (
$buf:expr,
$($val:tt)*
- ) => {{
use core::mem::MaybeUninit;
// Ensures types are correct
let obj: &mut MaybeUninit<_> = $buf;
let top_ptr = obj.as_mut_ptr();
place!(@FIELDS top_ptr, _TOP: $($val)*);
// SAFETY: All fields have been initialized above
// The compiler ensures that all fields were used, all types were correct,
// and that size and alignment are correct.
unsafe { obj.assume_init_mut() }
- }};
+}
+/// Helper macro to get the struct type part of a struct initialization expression. +#[macro_export] +#[doc(hidden)] +macro_rules! get_type {
- ($t:ty { $($val:tt)* }) => {
$t
- };
+}
+/// Like `Box::try_new(...)`, but with in-place initialization. +#[macro_export] +macro_rules! box_in_place {
- ($($val:tt)*) => {{
use $crate::place;
let b = Box::<$crate::get_type!($($val)*)>::try_new_uninit();
match b {
Ok(mut p) => {
place!((&mut *p), $($val)*);
Ok(unsafe { p.assume_init() })
}
Err(e) => Err(e)
}
- }};
+}
+// TODO: figure out how to make this run +#[cfg(test)] +mod tests {
- use super::*;
- use core::mem::MaybeUninit;
- #[derive(Debug, PartialEq)]
- struct MyCoolStruct {
b: bool,
s: String,
i: i32,
v: Vec<String>,
x: MyOtherCoolStruct,
y: MyOtherCoolStruct,
z: foo::MyCoolGenericStruct<bool, String>,
- }
- #[derive(Debug, PartialEq)]
- struct MyDefaultStruct {
b: bool,
i: i32,
j: i16,
- }
- default_zeroed!(MyDefaultStruct);
- mod foo {
#[derive(Debug, PartialEq)]
pub struct MyOtherCoolStruct {
pub a: bool,
pub b: String,
}
#[derive(Debug, PartialEq)]
pub struct MyCoolGenericStruct<T, U> {
pub a: T,
pub b: U,
}
- }
- use foo::MyOtherCoolStruct;
- #[test]
- fn test_initialized() {
let mut buf: MaybeUninit<MyCoolStruct> = MaybeUninit::uninit();
let x: &mut MyCoolStruct = place!(
&mut buf,
MyCoolStruct {
b: true,
s: String::from("works"),
i: str::parse::<i32>("123").unwrap(),
v: vec![String::from("works")],
x: MyOtherCoolStruct {
a: false,
b: String::from("Hello, world!"),
},
y: foo::MyOtherCoolStruct {
a: false,
b: String::from("Hello, world!"),
},
z: foo::MyCoolGenericStruct::<bool, String> {
a: false,
b: String::from("Hello, world!"),
}
}
);
//dbg!(x);
assert_eq!(
x,
&MyCoolStruct {
b: true,
s: String::from("works"),
i: str::parse::<i32>("123").unwrap(),
v: vec![String::from("works")],
x: foo::MyOtherCoolStruct {
a: false,
b: String::from("Hello, world!"),
},
y: foo::MyOtherCoolStruct {
a: false,
b: String::from("Hello, world!"),
},
z: foo::MyCoolGenericStruct::<bool, String> {
a: false,
b: String::from("Hello, world!"),
},
},
);
- }
- #[test]
- fn test_default() {
let mut buf: MaybeUninit<MyDefaultStruct> = MaybeUninit::uninit();
let x: &mut MyDefaultStruct = place!(
&mut buf,
MyDefaultStruct {
b: true,
i: 1,
..Default::default()
}
);
assert_eq!(
x,
&MyDefaultStruct {
b: true,
i: 1,
j: 0,
},
);
- }
- #[test]
- fn test_scalar() {
let mut buf: MaybeUninit<u32> = MaybeUninit::uninit();
let x: &mut u32 = place!(&mut buf, 1234);
assert_eq!(x, &mut 1234u32);
- }
+} diff --git a/drivers/gpu/drm/asahi/queue/common.rs b/drivers/gpu/drm/asahi/queue/common.rs new file mode 100644 index 000000000000..127b4ccc6eca --- /dev/null +++ b/drivers/gpu/drm/asahi/queue/common.rs @@ -0,0 +1,52 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! Common queue functionality. +//! +//! Shared helpers used by the submission logic for multiple command types.
+use crate::fw::microseq; +use crate::fw::types::*;
+use kernel::bindings; +use kernel::io_buffer::IoBufferReader; +use kernel::prelude::*; +use kernel::user_ptr::UserSlicePtr;
+use core::mem::MaybeUninit;
+pub(super) fn build_attachments(pointer: u64, count: u32) -> Resultmicroseq::Attachments {
- if count as usize > microseq::MAX_ATTACHMENTS {
return Err(EINVAL);
- }
- const STRIDE: usize = core::mem::size_of::bindings::drm_asahi_attachment();
- let size = STRIDE * count as usize;
- // SAFETY: We only read this once, so there are no TOCTOU issues.
- let mut reader = unsafe { UserSlicePtr::new(pointer as usize as *mut _, size).reader() };
- let mut attachments: microseq::Attachments = Default::default();
- for i in 0..count {
let mut att: MaybeUninit<bindings::drm_asahi_attachment> = MaybeUninit::uninit();
// SAFETY: The size of `att` is STRIDE
unsafe { reader.read_raw(att.as_mut_ptr() as *mut u8, STRIDE)? };
// SAFETY: All bit patterns in the struct are valid
let att = unsafe { att.assume_init() };
let cache_lines = (att.size + 127) >> 7;
let order = 1;
attachments.list[i as usize] = microseq::Attachment {
address: U64(att.pointer),
size: cache_lines,
unk_c: 0x17,
unk_e: order,
};
attachments.count += 1;
- }
- Ok(attachments)
+} diff --git a/drivers/gpu/drm/asahi/queue/compute.rs b/drivers/gpu/drm/asahi/queue/compute.rs new file mode 100644 index 000000000000..6590382c75af --- /dev/null +++ b/drivers/gpu/drm/asahi/queue/compute.rs @@ -0,0 +1,371 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT +#![allow(clippy::unusual_byte_groupings)]
+//! Compute work queue. +//! +//! A compute queue consists of one underlying WorkQueue. +//! This module is in charge of creating all of the firmware structures required to submit compute +//! work to the GPU, based on the userspace command buffer.
+use super::common; +use crate::alloc::Allocator; +use crate::debug::*; +use crate::fw::types::*; +use crate::gpu::GpuManager; +use crate::{box_in_place, inner_ptr, inner_weak_ptr, place}; +use crate::{fw, gpu, microseq}; +use core::mem::MaybeUninit; +use core::sync::atomic::Ordering; +use kernel::bindings; +use kernel::dma_fence::RawDmaFence; +use kernel::drm::sched::Job; +use kernel::io_buffer::IoBufferReader; +use kernel::prelude::*; +use kernel::sync::Arc; +use kernel::user_ptr::UserSlicePtr;
+const DEBUG_CLASS: DebugFlags = DebugFlags::Compute;
+#[versions(AGX)] +impl super::Queue::ver {
- /// Submit work to a compute queue.
- pub(super) fn submit_compute(
&self,
job: &mut Job<super::QueueJob::ver>,
cmd: &bindings::drm_asahi_command,
result_writer: Option<super::ResultWriter>,
id: u64,
flush_stamps: bool,
- ) -> Result {
if cmd.cmd_type != bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_COMPUTE {
return Err(EINVAL);
}
let dev = self.dev.data();
let gpu = match dev.gpu.as_any().downcast_ref::<gpu::GpuManager::ver>() {
Some(gpu) => gpu,
None => {
dev_crit!(self.dev, "GpuManager mismatched with Queue!\n");
return Err(EIO);
}
};
let mut alloc = gpu.alloc();
let kalloc = &mut *alloc;
mod_dev_dbg!(self.dev, "[Submission {}] Compute!\n", id);
let mut cmdbuf_reader = unsafe {
UserSlicePtr::new(
cmd.cmd_buffer as usize as *mut _,
core::mem::size_of::<bindings::drm_asahi_cmd_compute>(),
)
.reader()
};
let mut cmdbuf: MaybeUninit<bindings::drm_asahi_cmd_compute> = MaybeUninit::uninit();
unsafe {
cmdbuf_reader.read_raw(
cmdbuf.as_mut_ptr() as *mut u8,
core::mem::size_of::<bindings::drm_asahi_cmd_compute>(),
)?;
}
let cmdbuf = unsafe { cmdbuf.assume_init() };
if cmdbuf.flags != 0 {
return Err(EINVAL);
}
// This sequence number increases per new client/VM? assigned to some slot,
// but it's unclear *which* slot...
let slot_client_seq: u8 = (self.id & 0xff) as u8;
let vm_bind = job.vm_bind.clone();
mod_dev_dbg!(
self.dev,
"[Submission {}] VM slot = {}\n",
id,
vm_bind.slot()
);
let notifier = self.notifier.clone();
let fence = job.fence.clone();
let comp_job = job.get_comp()?;
let ev_comp = comp_job.event_info();
// TODO: Is this the same on all GPUs? Is this really for preemption?
let preempt_size = 0x7fa0;
let preempt2_off = 0x7f80;
let preempt3_off = 0x7f88;
let preempt4_off = 0x7f90;
let preempt5_off = 0x7f98;
let preempt_buf = self.ualloc.lock().array_empty(preempt_size)?;
let mut seq_buf = self.ualloc.lock().array_empty(0x800)?;
for i in 1..0x400 {
seq_buf[i] = (i + 1) as u64;
}
mod_dev_dbg!(
self.dev,
"[Submission {}] Event #{} {:#x?} -> {:#x?}\n",
id,
ev_comp.slot,
ev_comp.value,
ev_comp.value.next(),
);
let timestamps = Arc::try_new(kalloc.shared.new_default::<fw::job::JobTimestamps>()?)?;
let uuid = cmdbuf.cmd_id;
let unk3 = debug_enabled(debug::DebugFlags::Debug3);
mod_dev_dbg!(self.dev, "[Submission {}] UUID = {:#x?}\n", id, uuid);
// TODO: check
#[ver(V >= V13_0B4)]
let count = self.counter.fetch_add(1, Ordering::Relaxed);
let comp = GpuObject::new_prealloc(
kalloc.private.alloc_object()?,
|ptr: GpuWeakPointer<fw::compute::RunCompute::ver>| {
let mut builder = microseq::Builder::new();
let stats = gpu.initdata.runtime_pointers.stats.comp.weak_pointer();
let start_comp = builder.add(microseq::StartCompute::ver {
header: microseq::op::StartCompute::HEADER,
unk_pointer: inner_weak_ptr!(ptr, unk_pointee),
job_params1: inner_weak_ptr!(ptr, job_params1),
stats,
work_queue: ev_comp.info_ptr,
vm_slot: vm_bind.slot(),
unk_28: 0x1,
event_generation: self.id as u32,
cmd_seq: U64(ev_comp.cmd_seq),
unk_38: 0x0,
job_params2: inner_weak_ptr!(ptr, job_params2),
unk_44: 0x0,
uuid,
attachments: common::build_attachments(
cmdbuf.attachments,
cmdbuf.attachment_count,
)?,
padding: Default::default(),
#[ver(V >= V13_0B4)]
unk_flag: inner_weak_ptr!(ptr, unk_flag),
#[ver(V >= V13_0B4)]
counter: U64(count),
#[ver(V >= V13_0B4)]
notifier_buf: inner_weak_ptr!(notifier.weak_pointer(), state.unk_buf),
})?;
if result_writer.is_some() {
builder.add(microseq::Timestamp::ver {
header: microseq::op::Timestamp::new(true),
cur_ts: inner_weak_ptr!(ptr, cur_ts),
start_ts: inner_weak_ptr!(ptr, start_ts),
update_ts: inner_weak_ptr!(ptr, start_ts),
work_queue: ev_comp.info_ptr,
unk_24: U64(0),
#[ver(V >= V13_0B4)]
unk_ts: inner_weak_ptr!(ptr, unk_ts),
uuid,
unk_30_padding: 0,
})?;
}
builder.add(microseq::WaitForIdle {
header: microseq::op::WaitForIdle::new(microseq::Pipe::Compute),
})?;
if result_writer.is_some() {
builder.add(microseq::Timestamp::ver {
header: microseq::op::Timestamp::new(false),
cur_ts: inner_weak_ptr!(ptr, cur_ts),
start_ts: inner_weak_ptr!(ptr, start_ts),
update_ts: inner_weak_ptr!(ptr, end_ts),
work_queue: ev_comp.info_ptr,
unk_24: U64(0),
#[ver(V >= V13_0B4)]
unk_ts: inner_weak_ptr!(ptr, unk_ts),
uuid,
unk_30_padding: 0,
})?;
}
let off = builder.offset_to(start_comp);
builder.add(microseq::FinalizeCompute::ver {
header: microseq::op::FinalizeCompute::HEADER,
stats,
work_queue: ev_comp.info_ptr,
vm_slot: vm_bind.slot(),
#[ver(V < V13_0B4)]
unk_18: 0,
job_params2: inner_weak_ptr!(ptr, job_params2),
unk_24: 0,
uuid,
fw_stamp: ev_comp.fw_stamp_pointer,
stamp_value: ev_comp.value.next(),
unk_38: 0,
unk_3c: 0,
unk_40: 0,
unk_44: 0,
unk_48: 0,
unk_4c: 0,
unk_50: 0,
unk_54: 0,
unk_58: 0,
#[ver(G == G14 && V < V13_0B4)]
unk_5c_g14: U64(0),
restart_branch_offset: off,
unk_60: unk3.into(),
#[ver(V >= V13_0B4)]
unk_64: Default::default(),
#[ver(V >= V13_0B4)]
unk_flag: inner_weak_ptr!(ptr, unk_flag),
#[ver(V >= V13_0B4)]
unk_79: Default::default(),
})?;
builder.add(microseq::RetireStamp {
header: microseq::op::RetireStamp::HEADER,
})?;
Ok(box_in_place!(fw::compute::RunCompute::ver {
notifier: notifier.clone(),
preempt_buf: preempt_buf,
seq_buf: seq_buf,
micro_seq: builder.build(&mut kalloc.private)?,
vm_bind: vm_bind.clone(),
timestamps: timestamps.clone(),
})?)
},
|inner, ptr| {
Ok(place!(
ptr,
fw::compute::raw::RunCompute::ver {
tag: fw::workqueue::CommandType::RunCompute,
#[ver(V >= V13_0B4)]
counter: U64(count),
unk_4: 0,
vm_slot: vm_bind.slot(),
notifier: inner.notifier.gpu_pointer(),
unk_pointee: Default::default(),
job_params1: fw::compute::raw::JobParameters1 {
preempt_buf1: inner.preempt_buf.gpu_pointer(),
encoder: U64(cmdbuf.encoder_ptr),
// buf2-5 Only if internal program is used
preempt_buf2: inner.preempt_buf.gpu_offset_pointer(preempt2_off),
preempt_buf3: inner.preempt_buf.gpu_offset_pointer(preempt3_off),
preempt_buf4: inner.preempt_buf.gpu_offset_pointer(preempt4_off),
preempt_buf5: inner.preempt_buf.gpu_offset_pointer(preempt5_off),
pipeline_base: U64(0x11_00000000),
unk_38: U64(0x8c60),
unk_40: cmdbuf.ctx_switch_prog, // Internal program addr | 1
unk_44: 0,
compute_layout_addr: U64(cmdbuf.buffer_descriptor), // Only if internal program used
unk_50: cmdbuf.buffer_descriptor_size, // 0x40 if internal program used
unk_54: 0,
unk_58: 1,
unk_5c: 0,
iogpu_unk_40: cmdbuf.iogpu_unk_40, // 0x1c if internal program used
},
unk_b8: Default::default(),
microsequence: inner.micro_seq.gpu_pointer(),
microsequence_size: inner.micro_seq.len() as u32,
job_params2: fw::compute::raw::JobParameters2::ver {
#[ver(V >= V13_0B4)]
unk_0_0: 0,
unk_0: Default::default(),
preempt_buf1: inner.preempt_buf.gpu_pointer(),
encoder_end: U64(cmdbuf.encoder_end),
unk_34: Default::default(),
#[ver(V < V13_0B4)]
unk_5c: 0,
},
encoder_params: fw::job::raw::EncoderParams {
unk_8: 0x0, // fixed
unk_c: 0x0, // fixed
unk_10: 0x0, // fixed
encoder_id: cmdbuf.encoder_id,
unk_18: 0x0, // fixed
iogpu_compute_unk44: cmdbuf.iogpu_unk_44,
seq_buffer: inner.seq_buf.gpu_pointer(),
unk_28: U64(0x0), // fixed
},
meta: fw::job::raw::JobMeta {
unk_4: 0,
stamp: ev_comp.stamp_pointer,
fw_stamp: ev_comp.fw_stamp_pointer,
stamp_value: ev_comp.value.next(),
stamp_slot: ev_comp.slot,
evctl_index: 0, // fixed
flush_stamps: flush_stamps as u32,
uuid: uuid,
cmd_seq: ev_comp.cmd_seq as u32,
},
cur_ts: U64(0),
start_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), start)),
end_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), end)),
unk_2c0: 0,
unk_2c4: 0,
unk_2c8: 0,
unk_2cc: 0,
client_sequence: slot_client_seq,
pad_2d1: Default::default(),
unk_2d4: 0,
unk_2d8: 0,
#[ver(V >= V13_0B4)]
unk_ts: U64(0),
#[ver(V >= V13_0B4)]
unk_2e1: Default::default(),
#[ver(V >= V13_0B4)]
unk_flag: U32(0),
#[ver(V >= V13_0B4)]
unk_pad: Default::default(),
}
))
},
)?;
core::mem::drop(alloc);
fence.add_command();
comp_job.add_cb(comp, vm_bind.slot(), move |cmd, error| {
if let Some(err) = error {
fence.set_error(err.into())
}
if let Some(mut rw) = result_writer {
let mut result: bindings::drm_asahi_result_compute = Default::default();
cmd.timestamps.with(|raw, _inner| {
result.ts_start = raw.start.load(Ordering::Relaxed);
result.ts_end = raw.end.load(Ordering::Relaxed);
});
if let Some(err) = error {
result.info = err.into();
} else {
result.info.status = bindings::drm_asahi_status_DRM_ASAHI_STATUS_COMPLETE;
}
rw.write(result);
}
fence.command_complete();
})?;
notifier.threshold.with(|raw, _inner| {
raw.increment();
});
comp_job.next_seq();
Ok(())
- }
+} diff --git a/drivers/gpu/drm/asahi/queue/mod.rs b/drivers/gpu/drm/asahi/queue/mod.rs new file mode 100644 index 000000000000..15988af33cf3 --- /dev/null +++ b/drivers/gpu/drm/asahi/queue/mod.rs @@ -0,0 +1,725 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! Submission queue management +//! +//! This module implements the userspace view of submission queues and the logic to map userspace +//! submissions to firmware queues.
+use kernel::dma_fence::*; +use kernel::prelude::*; +use kernel::{
- bindings, c_str, dma_fence,
- drm::gem::shmem::VMap,
- drm::sched,
- macros::versions,
- sync::{smutex::Mutex, Arc},
+};
+use crate::alloc::Allocator; +use crate::debug::*; +use crate::driver::AsahiDevice; +use crate::fw::types::*; +use crate::gpu::GpuManager; +use crate::{alloc, buffer, channel, event, file, fw, gem, gpu, mmu, workqueue}; +use crate::{inner_weak_ptr, place};
+use core::mem::MaybeUninit; +use core::sync::atomic::{AtomicU64, Ordering};
+const DEBUG_CLASS: DebugFlags = DebugFlags::Queue;
+const WQ_SIZE: u32 = 0x500;
+mod common; +mod compute; +mod render;
+/// Trait implemented by all versioned queues. +pub(crate) trait Queue: Send + Sync {
- fn submit(
&mut self,
id: u64,
in_syncs: Vec<file::SyncItem>,
out_syncs: Vec<file::SyncItem>,
result_buf: Option<gem::ObjectRef>,
commands: Vec<bindings::drm_asahi_command>,
- ) -> Result;
+}
+#[versions(AGX)] +struct SubQueue {
- wq: Arcworkqueue::WorkQueue::ver,
+}
+#[versions(AGX)] +impl SubQueue::ver {
- fn new_job(&mut self) -> SubQueueJob::ver {
SubQueueJob::ver {
wq: self.wq.clone(),
job: None,
}
- }
+}
+#[versions(AGX)] +struct SubQueueJob {
- wq: Arcworkqueue::WorkQueue::ver,
- job: Optionworkqueue::Job::ver,
+}
+#[versions(AGX)] +impl SubQueueJob::ver {
- fn get(&mut self) -> Result<&mut workqueue::Job::ver> {
if self.job.is_none() {
mod_pr_debug!("SubQueueJob: Creating {:?} job\n", self.wq.pipe_type());
self.job.replace(self.wq.new_job()?);
}
Ok(self.job.as_mut().expect("expected a Job"))
- }
- fn commit(&mut self) -> Result {
match self.job.as_mut() {
Some(job) => job.commit(),
None => Ok(()),
}
- }
- fn can_submit(&self) -> bool {
match self.job.as_ref() {
None => true,
Some(job) => job.can_submit(),
}
- }
+}
+#[versions(AGX)] +pub(crate) struct Queue {
- dev: AsahiDevice,
- _sched: sched::SchedulerQueueJob::ver,
- entity: sched::EntityQueueJob::ver,
- vm: mmu::Vm,
- ualloc: Arc<Mutexalloc::DefaultAllocator>,
- q_vtx: OptionSubQueue::ver,
- q_frag: OptionSubQueue::ver,
- q_comp: OptionSubQueue::ver,
- buffer: Option<Mutexbuffer::Buffer::ver>,
- gpu_context: Arcworkqueue::GpuContext,
- notifier_list: Arc<GpuObjectfw::event::NotifierList>,
- notifier: Arc<GpuObjectfw::event::Notifier::ver>,
- id: u64,
- fence_ctx: FenceContexts,
- #[ver(V >= V13_0B4)]
- counter: AtomicU64,
+}
+#[versions(AGX)] +#[derive(Default)] +pub(crate) struct JobFence {
- id: u64,
- pending: AtomicU64,
+}
+#[versions(AGX)] +impl JobFence::ver {
- fn add_command(self: &FenceObject<Self>) {
self.pending.fetch_add(1, Ordering::Relaxed);
- }
- fn command_complete(self: &FenceObject<Self>) {
let remain = self.pending.fetch_sub(1, Ordering::Relaxed) - 1;
mod_pr_debug!(
"JobFence[{}]: Command complete (remain: {})\n",
self.id,
remain
);
if remain == 0 {
mod_pr_debug!("JobFence[{}]: Signaling\n", self.id);
if self.signal().is_err() {
pr_err!("JobFence[{}]: Fence signal failed\n", self.id);
}
}
- }
+}
+#[versions(AGX)] +#[vtable] +impl dma_fence::FenceOps for JobFence::ver {
- const USE_64BIT_SEQNO: bool = true;
- fn get_driver_name<'a>(self: &'a FenceObject<Self>) -> &'a CStr {
c_str!("asahi")
- }
- fn get_timeline_name<'a>(self: &'a FenceObject<Self>) -> &'a CStr {
c_str!("queue")
- }
+}
+#[versions(AGX)] +pub(crate) struct QueueJob {
- dev: AsahiDevice,
- vm_bind: mmu::VmBind,
- op_guard: Optiongpu::OpGuard,
- sj_vtx: OptionSubQueueJob::ver,
- sj_frag: OptionSubQueueJob::ver,
- sj_comp: OptionSubQueueJob::ver,
- fence: UserFenceJobFence::ver,
- did_run: bool,
- id: u64,
+}
+#[versions(AGX)] +impl QueueJob::ver {
- fn get_vtx(&mut self) -> Result<&mut workqueue::Job::ver> {
self.sj_vtx.as_mut().ok_or(EINVAL)?.get()
- }
- fn get_frag(&mut self) -> Result<&mut workqueue::Job::ver> {
self.sj_frag.as_mut().ok_or(EINVAL)?.get()
- }
- fn get_comp(&mut self) -> Result<&mut workqueue::Job::ver> {
self.sj_comp.as_mut().ok_or(EINVAL)?.get()
- }
- fn commit(&mut self) -> Result {
mod_dev_dbg!(self.dev, "QueueJob: Committing\n");
self.sj_vtx.as_mut().map(|a| a.commit()).unwrap_or(Ok(()))?;
self.sj_frag
.as_mut()
.map(|a| a.commit())
.unwrap_or(Ok(()))?;
self.sj_comp.as_mut().map(|a| a.commit()).unwrap_or(Ok(()))
- }
+}
+#[versions(AGX)] +impl sched::JobImpl for QueueJob::ver {
- fn can_run(job: &mut sched::Job<Self>) -> bool {
mod_dev_dbg!(job.dev, "QueueJob {}: Checking runnability\n", job.id);
if let Some(sj) = job.sj_vtx.as_ref() {
if !sj.can_submit() {
mod_dev_dbg!(
job.dev,
"QueueJob {}: Blocking due to vertex queue full\n",
job.id
);
return false;
}
}
if let Some(sj) = job.sj_frag.as_ref() {
if !sj.can_submit() {
mod_dev_dbg!(
job.dev,
"QueueJob {}: Blocking due to fragment queue full\n",
job.id
);
return false;
}
}
if let Some(sj) = job.sj_comp.as_ref() {
if !sj.can_submit() {
mod_dev_dbg!(
job.dev,
"QueueJob {}: Blocking due to compute queue full\n",
job.id
);
return false;
}
}
true
- }
- #[allow(unused_assignments)]
- fn run(job: &mut sched::Job<Self>) -> Result<Option<dma_fence::Fence>> {
mod_dev_dbg!(job.dev, "QueueJob {}: Running Job\n", job.id);
let dev = job.dev.data();
let gpu = match dev
.gpu
.clone()
.arc_as_any()
.downcast::<gpu::GpuManager::ver>()
{
Ok(gpu) => gpu,
Err(_) => {
dev_crit!(job.dev, "GpuManager mismatched with QueueJob!\n");
return Err(EIO);
}
};
if job.op_guard.is_none() {
job.op_guard = Some(gpu.start_op()?);
}
// First submit all the commands for each queue. This can fail.
let mut frag_job = None;
let mut frag_sub = None;
if let Some(sj) = job.sj_frag.as_mut() {
frag_job = sj.job.take();
if let Some(wqjob) = frag_job.as_mut() {
mod_dev_dbg!(job.dev, "QueueJob {}: Submit fragment\n", job.id);
frag_sub = Some(wqjob.submit()?);
}
}
let mut vtx_job = None;
let mut vtx_sub = None;
if let Some(sj) = job.sj_vtx.as_mut() {
vtx_job = sj.job.take();
if let Some(wqjob) = vtx_job.as_mut() {
mod_dev_dbg!(job.dev, "QueueJob {}: Submit vertex\n", job.id);
vtx_sub = Some(wqjob.submit()?);
}
}
let mut comp_job = None;
let mut comp_sub = None;
if let Some(sj) = job.sj_comp.as_mut() {
comp_job = sj.job.take();
if let Some(wqjob) = comp_job.as_mut() {
mod_dev_dbg!(job.dev, "QueueJob {}: Submit compute\n", job.id);
comp_sub = Some(wqjob.submit()?);
}
}
// Now we fully commit to running the job
mod_dev_dbg!(job.dev, "QueueJob {}: Run fragment\n", job.id);
frag_sub.map(|a| gpu.run_job(a)).transpose()?;
mod_dev_dbg!(job.dev, "QueueJob {}: Run vertex\n", job.id);
vtx_sub.map(|a| gpu.run_job(a)).transpose()?;
mod_dev_dbg!(job.dev, "QueueJob {}: Run compute\n", job.id);
comp_sub.map(|a| gpu.run_job(a)).transpose()?;
mod_dev_dbg!(job.dev, "QueueJob {}: Drop compute job\n", job.id);
core::mem::drop(comp_job);
mod_dev_dbg!(job.dev, "QueueJob {}: Drop vertex job\n", job.id);
core::mem::drop(vtx_job);
mod_dev_dbg!(job.dev, "QueueJob {}: Drop fragment job\n", job.id);
core::mem::drop(frag_job);
job.did_run = true;
Ok(Some(Fence::from_fence(&job.fence)))
- }
- fn timed_out(job: &mut sched::Job<Self>) -> sched::Status {
// FIXME: Handle timeouts properly
dev_err!(
job.dev,
"QueueJob {}: Job timed out on the DRM scheduler, things will probably break (ran: {})\n",
job.id, job.did_run
);
sched::Status::NoDevice
- }
+}
+#[versions(AGX)] +impl Drop for QueueJob::ver {
- fn drop(&mut self) {
mod_dev_dbg!(self.dev, "QueueJob {}: Dropping\n", self.id);
- }
+}
+struct ResultWriter {
- vmap: VMapgem::DriverObject,
- offset: usize,
- len: usize,
+}
+impl ResultWriter {
- fn write<T>(&mut self, mut value: T) {
let p: *mut u8 = &mut value as *mut _ as *mut u8;
// SAFETY: We know `p` points to a type T of that size, and UAPI types must have
// no padding and all bit patterns valid.
let slice = unsafe { core::slice::from_raw_parts_mut(p, core::mem::size_of::<T>()) };
let len = slice.len().min(self.len);
self.vmap.as_mut_slice()[self.offset..self.offset + len].copy_from_slice(&slice[..len]);
- }
+}
+static QUEUE_NAME: &CStr = c_str!("asahi_fence"); +static QUEUE_CLASS_KEY: kernel::sync::LockClassKey = kernel::sync::LockClassKey::new();
+#[versions(AGX)] +impl Queue::ver {
- /// Create a new user queue.
- #[allow(clippy::too_many_arguments)]
- pub(crate) fn new(
dev: &AsahiDevice,
vm: mmu::Vm,
alloc: &mut gpu::KernelAllocators,
ualloc: Arc<Mutex<alloc::DefaultAllocator>>,
ualloc_priv: Arc<Mutex<alloc::DefaultAllocator>>,
event_manager: Arc<event::EventManager>,
mgr: &buffer::BufferManager,
id: u64,
priority: u32,
caps: u32,
- ) -> ResultQueue::ver {
mod_dev_dbg!(dev, "[Queue {}] Creating queue\n", id);
let data = dev.data();
let mut notifier_list = alloc.private.new_default::<fw::event::NotifierList>()?;
let self_ptr = notifier_list.weak_pointer();
notifier_list.with_mut(|raw, _inner| {
raw.list_head.next = Some(inner_weak_ptr!(self_ptr, list_head));
});
let threshold = alloc.shared.new_default::<fw::event::Threshold>()?;
let notifier: Arc<GpuObject<fw::event::Notifier::ver>> =
Arc::try_new(alloc.private.new_inplace(
fw::event::Notifier::ver { threshold },
|inner, ptr: &mut MaybeUninit<fw::event::raw::Notifier::ver<'_>>| {
Ok(place!(
ptr,
fw::event::raw::Notifier::ver {
threshold: inner.threshold.gpu_pointer(),
generation: AtomicU32::new(id as u32),
cur_count: AtomicU32::new(0),
unk_10: AtomicU32::new(0x50),
state: Default::default()
}
))
},
)?)?;
let sched = sched::Scheduler::new(dev, WQ_SIZE, 0, 100000, c_str!("asahi_sched"))?;
// Priorities are handled by the AGX scheduler, there is no meaning within a
// per-queue scheduler.
let entity = sched::Entity::new(&sched, sched::Priority::Normal)?;
let mut ret = Queue::ver {
dev: dev.clone(),
_sched: sched,
entity,
vm,
ualloc,
q_vtx: None,
q_frag: None,
q_comp: None,
buffer: None,
gpu_context: Arc::try_new(workqueue::GpuContext::new(dev, alloc)?)?,
notifier_list: Arc::try_new(notifier_list)?,
notifier,
id,
fence_ctx: FenceContexts::new(1, QUEUE_NAME, &QUEUE_CLASS_KEY)?,
#[ver(V >= V13_0B4)]
counter: AtomicU64::new(0),
};
// Rendering structures
if caps & bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_RENDER != 0 {
let buffer =
buffer::Buffer::ver::new(&*data.gpu, alloc, ret.ualloc.clone(), ualloc_priv, mgr)?;
let tvb_blocks = {
let lock = crate::THIS_MODULE.kernel_param_lock();
*crate::initial_tvb_size.read(&lock)
};
buffer.ensure_blocks(tvb_blocks)?;
ret.buffer = Some(Mutex::new(buffer));
ret.q_vtx = Some(SubQueue::ver {
wq: workqueue::WorkQueue::ver::new(
alloc,
event_manager.clone(),
ret.gpu_context.clone(),
ret.notifier_list.clone(),
channel::PipeType::Vertex,
id,
priority,
WQ_SIZE,
)?,
});
}
// Rendering & blit structures
if caps
& (bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_RENDER
| bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_BLIT)
!= 0
{
ret.q_frag = Some(SubQueue::ver {
wq: workqueue::WorkQueue::ver::new(
alloc,
event_manager.clone(),
ret.gpu_context.clone(),
ret.notifier_list.clone(),
channel::PipeType::Fragment,
id,
priority,
WQ_SIZE,
)?,
});
}
// Compute structures
if caps & bindings::drm_asahi_queue_cap_DRM_ASAHI_QUEUE_CAP_COMPUTE != 0 {
ret.q_comp = Some(SubQueue::ver {
wq: workqueue::WorkQueue::ver::new(
alloc,
event_manager,
ret.gpu_context.clone(),
ret.notifier_list.clone(),
channel::PipeType::Compute,
id,
priority,
WQ_SIZE,
)?,
});
}
mod_dev_dbg!(dev, "[Queue {}] Queue created\n", id);
Ok(ret)
- }
+}
+const SQ_RENDER: usize = bindings::drm_asahi_subqueue_DRM_ASAHI_SUBQUEUE_RENDER as usize; +const SQ_COMPUTE: usize = bindings::drm_asahi_subqueue_DRM_ASAHI_SUBQUEUE_COMPUTE as usize; +const SQ_COUNT: usize = bindings::drm_asahi_subqueue_DRM_ASAHI_SUBQUEUE_COUNT as usize;
+#[versions(AGX)] +impl Queue for Queue::ver {
- fn submit(
&mut self,
id: u64,
in_syncs: Vec<file::SyncItem>,
out_syncs: Vec<file::SyncItem>,
result_buf: Option<gem::ObjectRef>,
commands: Vec<bindings::drm_asahi_command>,
- ) -> Result {
let dev = self.dev.data();
let gpu = match dev
.gpu
.clone()
.arc_as_any()
.downcast::<gpu::GpuManager::ver>()
{
Ok(gpu) => gpu,
Err(_) => {
dev_crit!(self.dev, "GpuManager mismatched with JobImpl!\n");
return Err(EIO);
}
};
mod_dev_dbg!(self.dev, "[Submission {}] Submit job\n", id);
if gpu.is_crashed() {
dev_err!(
self.dev,
"[Submission {}] GPU is crashed, cannot submit\n",
id
);
return Err(ENODEV);
}
// Empty submissions are not legal
if commands.is_empty() {
return Err(EINVAL);
}
let op_guard = if !in_syncs.is_empty() {
Some(gpu.start_op()?)
} else {
None
};
let mut events: [Vec<Option<workqueue::QueueEventInfo::ver>>; SQ_COUNT] =
Default::default();
events[SQ_RENDER].try_push(self.q_frag.as_ref().and_then(|a| a.wq.event_info()))?;
events[SQ_COMPUTE].try_push(self.q_comp.as_ref().and_then(|a| a.wq.event_info()))?;
let vm_bind = gpu.bind_vm(&self.vm)?;
let vm_slot = vm_bind.slot();
mod_dev_dbg!(self.dev, "[Submission {}] Creating job\n", id);
let mut job = self.entity.new_job(QueueJob::ver {
dev: self.dev.clone(),
vm_bind,
op_guard,
sj_vtx: self.q_vtx.as_mut().map(|a| a.new_job()),
sj_frag: self.q_frag.as_mut().map(|a| a.new_job()),
sj_comp: self.q_comp.as_mut().map(|a| a.new_job()),
fence: self
.fence_ctx
.new_fence::<JobFence::ver>(
0,
JobFence::ver {
id,
pending: Default::default(),
},
)?
.into(),
did_run: false,
id,
})?;
mod_dev_dbg!(
self.dev,
"[Submission {}] Adding {} in_syncs\n",
id,
in_syncs.len()
);
for sync in in_syncs {
job.add_dependency(sync.fence.expect("in_sync missing fence"))?;
}
let mut last_render = None;
let mut last_compute = None;
for (i, cmd) in commands.iter().enumerate() {
match cmd.cmd_type {
bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_RENDER => last_render = Some(i),
bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_COMPUTE => last_compute = Some(i),
_ => return Err(EINVAL),
}
}
mod_dev_dbg!(
self.dev,
"[Submission {}] Submitting {} commands\n",
id,
commands.len()
);
for (i, cmd) in commands.into_iter().enumerate() {
for (queue_idx, index) in cmd.barriers.iter().enumerate() {
if *index == bindings::DRM_ASAHI_BARRIER_NONE as u32 {
continue;
}
if let Some(event) = events[queue_idx].get(*index as usize).ok_or(EINVAL)? {
let mut alloc = gpu.alloc();
let queue_job = match cmd.cmd_type {
bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_RENDER => job.get_vtx()?,
bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_COMPUTE => job.get_comp()?,
_ => return Err(EINVAL),
};
mod_dev_dbg!(self.dev, "[Submission {}] Create Explicit Barrier\n", id);
let barrier: GpuObject<fw::workqueue::Barrier> = alloc.private.new_inplace(
Default::default(),
|_inner, ptr: &mut MaybeUninit<fw::workqueue::raw::Barrier>| {
Ok(place!(
ptr,
fw::workqueue::raw::Barrier {
tag: fw::workqueue::CommandType::Barrier,
wait_stamp: event.fw_stamp_pointer,
wait_value: event.value,
wait_slot: event.slot,
stamp_self: queue_job.event_info().value.next(),
uuid: 0xffffbbbb,
unk: 0,
}
))
},
)?;
mod_dev_dbg!(self.dev, "[Submission {}] Add Explicit Barrier\n", id);
queue_job.add(barrier, vm_slot)?;
} else {
assert!(*index == 0);
}
}
let result_writer = match result_buf.as_ref() {
None => {
if cmd.result_offset != 0 || cmd.result_size != 0 {
return Err(EINVAL);
}
None
}
Some(buf) => {
if cmd.result_size != 0 {
if cmd
.result_offset
.checked_add(cmd.result_size)
.ok_or(EINVAL)?
> buf.size() as u64
{
return Err(EINVAL);
}
Some(ResultWriter {
vmap: buf.gem.vmap()?,
offset: cmd.result_offset.try_into()?,
len: cmd.result_size.try_into()?,
})
} else {
None
}
}
};
match cmd.cmd_type {
bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_RENDER => {
self.submit_render(
&mut job,
&cmd,
result_writer,
id,
last_render.unwrap() == i,
)?;
events[SQ_RENDER].try_push(Some(
job.sj_frag
.as_ref()
.expect("No frag queue?")
.job
.as_ref()
.expect("No frag job?")
.event_info(),
))?;
}
bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_COMPUTE => {
self.submit_compute(
&mut job,
&cmd,
result_writer,
id,
last_compute.unwrap() == i,
)?;
events[SQ_COMPUTE].try_push(Some(
job.sj_comp
.as_ref()
.expect("No comp queue?")
.job
.as_ref()
.expect("No comp job?")
.event_info(),
))?;
}
_ => return Err(EINVAL),
}
}
mod_dev_dbg!(self.dev, "Queue: Committing job\n");
job.commit()?;
mod_dev_dbg!(self.dev, "Queue: Arming job\n");
let job = job.arm();
let out_fence = job.fences().finished();
mod_dev_dbg!(self.dev, "Queue: Pushing job\n");
job.push();
mod_dev_dbg!(self.dev, "Queue: Adding {} out_syncs\n", out_syncs.len());
for mut sync in out_syncs {
if let Some(chain) = sync.chain_fence.take() {
sync.syncobj
.add_point(chain, &out_fence, sync.timeline_value);
} else {
sync.syncobj.replace_fence(Some(&out_fence));
}
}
Ok(())
- }
+}
+#[versions(AGX)] +impl Drop for Queue::ver {
- fn drop(&mut self) {
mod_dev_dbg!(self.dev, "[Queue {}] Dropping queue\n", self.id);
- }
+} diff --git a/drivers/gpu/drm/asahi/queue/render.rs b/drivers/gpu/drm/asahi/queue/render.rs new file mode 100644 index 000000000000..318c952df020 --- /dev/null +++ b/drivers/gpu/drm/asahi/queue/render.rs @@ -0,0 +1,1173 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT +#![allow(clippy::unusual_byte_groupings)]
+//! Render work queue. +//! +//! A render queue consists of two underlying WorkQueues, one for vertex and one for fragment work. +//! This module is in charge of creating all of the firmware structures required to submit 3D +//! rendering work to the GPU, based on the userspace command buffer.
+use super::common; +use crate::alloc::Allocator; +use crate::debug::*; +use crate::fw::types::*; +use crate::gpu::GpuManager; +use crate::util::*; +use crate::workqueue::WorkError; +use crate::{box_in_place, inner_ptr, inner_weak_ptr, place}; +use crate::{buffer, fw, gpu, microseq, workqueue}; +use core::mem::MaybeUninit; +use core::sync::atomic::Ordering; +use kernel::bindings; +use kernel::dma_fence::RawDmaFence; +use kernel::drm::sched::Job; +use kernel::io_buffer::IoBufferReader; +use kernel::prelude::*; +use kernel::sync::{smutex::Mutex, Arc}; +use kernel::user_ptr::UserSlicePtr;
+const DEBUG_CLASS: DebugFlags = DebugFlags::Render;
+/// Tiling/Vertex control bit to disable using more than one GPU cluster. This results in decreased +/// throughput but also less latency, which is probably desirable for light vertex loads where the +/// overhead of clustering/merging would exceed the time it takes to just run the job on one +/// cluster. +const TILECTL_DISABLE_CLUSTERING: u32 = 1u32 << 0;
+struct RenderResult {
- result: bindings::drm_asahi_result_render,
- vtx_complete: bool,
- frag_complete: bool,
- vtx_error: Optionworkqueue::WorkError,
- frag_error: Optionworkqueue::WorkError,
- writer: super::ResultWriter,
+}
+impl RenderResult {
- fn commit(&mut self) {
if !self.vtx_complete || !self.frag_complete {
return;
}
let mut error = self.vtx_error.take();
if let Some(frag_error) = self.frag_error.take() {
if error.is_none() || error == Some(WorkError::Killed) {
error = Some(frag_error);
}
}
if let Some(err) = error {
self.result.info = err.into();
} else {
self.result.info.status = bindings::drm_asahi_status_DRM_ASAHI_STATUS_COMPLETE;
}
self.writer.write(self.result);
- }
+}
+#[versions(AGX)] +impl super::Queue::ver {
- /// Get the appropriate tiling parameters for a given userspace command buffer.
- fn get_tiling_params(
cmdbuf: &bindings::drm_asahi_cmd_render,
num_clusters: u32,
- ) -> Resultbuffer::TileInfo {
let width: u32 = cmdbuf.fb_width;
let height: u32 = cmdbuf.fb_height;
let layers: u32 = cmdbuf.layers;
if width > 65536 || height > 65536 {
return Err(EINVAL);
}
if layers == 0 || layers > 2048 {
return Err(EINVAL);
}
let tile_width = 32u32;
let tile_height = 32u32;
let utile_width = cmdbuf.utile_width;
let utile_height = cmdbuf.utile_height;
match (utile_width, utile_height) {
(32, 32) | (32, 16) | (16, 16) => (),
_ => return Err(EINVAL),
};
let utiles_per_tile_x = tile_width / utile_width;
let utiles_per_tile_y = tile_height / utile_height;
let utiles_per_tile = utiles_per_tile_x * utiles_per_tile_y;
let tiles_x = (width + tile_width - 1) / tile_width;
let tiles_y = (height + tile_height - 1) / tile_height;
let tiles = tiles_x * tiles_y;
let mtiles_x = 4u32;
let mtiles_y = 4u32;
let mtiles = mtiles_x * mtiles_y;
// TODO: *samples
let tiles_per_mtile_x = align(div_ceil(tiles_x, mtiles_x), 4);
let tiles_per_mtile_y = align(div_ceil(tiles_y, mtiles_y), 4);
let tiles_per_mtile = tiles_per_mtile_x * tiles_per_mtile_y;
let mtile_x1 = tiles_per_mtile_x;
let mtile_x2 = 2 * tiles_per_mtile_x;
let mtile_x3 = 3 * tiles_per_mtile_x;
let mtile_y1 = tiles_per_mtile_y;
let mtile_y2 = 2 * tiles_per_mtile_y;
let mtile_y3 = 3 * tiles_per_mtile_y;
let rgn_entry_size = 5;
// Macrotile stride in 32-bit words
let rgn_size = align(rgn_entry_size * tiles_per_mtile * utiles_per_tile, 4) / 4;
let tilemap_size = (4 * rgn_size * mtiles * layers) as usize;
let tpc_entry_size = 8;
// TPC stride in 32-bit words
let tpc_mtile_stride = tpc_entry_size * utiles_per_tile * tiles_per_mtile / 4;
let tpc_size = (num_clusters * (4 * tpc_mtile_stride * mtiles) * layers) as usize;
// No idea where this comes from, but it fits what macOS does...
// TODO: layers?
let meta1_blocks = if num_clusters > 1 {
div_ceil(align(tiles_x, 2) * align(tiles_y, 4), 0x1980)
} else {
0
};
let min_tvb_blocks =
div_ceil(tiles_x * tiles_y, 128).max(if num_clusters > 1 { 9 } else { 8 }) as usize;
// Sometimes clustering seems to use twice the cluster tilemap count
// and twice the meta4 size. TODO: Is this random or can we calculate
// it somehow??? Does it go higher???
let cluster_factor = 2;
Ok(buffer::TileInfo {
tiles_x,
tiles_y,
tiles,
utile_width,
utile_height,
//mtiles_x,
//mtiles_y,
tiles_per_mtile_x,
tiles_per_mtile_y,
//tiles_per_mtile,
utiles_per_mtile_x: tiles_per_mtile_x * utiles_per_tile_x,
utiles_per_mtile_y: tiles_per_mtile_y * utiles_per_tile_y,
//utiles_per_mtile: tiles_per_mtile * utiles_per_tile,
tilemap_size,
tpc_size,
meta1_blocks,
min_tvb_blocks,
cluster_factor,
params: fw::vertex::raw::TilingParameters {
rgn_size,
unk_4: 0x88,
ppp_ctrl: cmdbuf.ppp_ctrl,
x_max: (width - 1) as u16,
y_max: (height - 1) as u16,
te_screen: ((tiles_y - 1) << 12) | (tiles_x - 1),
te_mtile1: mtile_x3 | (mtile_x2 << 9) | (mtile_x1 << 18),
te_mtile2: mtile_y3 | (mtile_y2 << 9) | (mtile_y1 << 18),
tiles_per_mtile,
tpc_stride: tpc_mtile_stride,
unk_24: 0x100,
unk_28: if layers > 1 {
0xe000 | (layers - 1)
} else {
0x8000
},
},
})
- }
- /// Submit work to a render queue.
- pub(super) fn submit_render(
&self,
job: &mut Job<super::QueueJob::ver>,
cmd: &bindings::drm_asahi_command,
result_writer: Option<super::ResultWriter>,
id: u64,
flush_stamps: bool,
- ) -> Result {
if cmd.cmd_type != bindings::drm_asahi_cmd_type_DRM_ASAHI_CMD_RENDER {
return Err(EINVAL);
}
mod_dev_dbg!(self.dev, "[Submission {}] Render!\n", id);
let mut cmdbuf_reader = unsafe {
UserSlicePtr::new(
cmd.cmd_buffer as usize as *mut _,
core::mem::size_of::<bindings::drm_asahi_cmd_render>(),
)
.reader()
};
let mut cmdbuf: MaybeUninit<bindings::drm_asahi_cmd_render> = MaybeUninit::uninit();
unsafe {
cmdbuf_reader.read_raw(
cmdbuf.as_mut_ptr() as *mut u8,
core::mem::size_of::<bindings::drm_asahi_cmd_render>(),
)?;
}
let cmdbuf = unsafe { cmdbuf.assume_init() };
if cmdbuf.flags
& !(bindings::ASAHI_RENDER_NO_CLEAR_PIPELINE_TEXTURES
| bindings::ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S
| bindings::ASAHI_RENDER_MEMORYLESS_RTS_USED
| bindings::ASAHI_RENDER_PROCESS_EMPTY_TILES
| bindings::ASAHI_RENDER_NO_VERTEX_CLUSTERING) as u64
!= 0
{
return Err(EINVAL);
}
if cmdbuf.flags & bindings::ASAHI_RENDER_MEMORYLESS_RTS_USED as u64 != 0 {
// Not supported yet
return Err(EINVAL);
}
if cmdbuf.fb_width == 0
|| cmdbuf.fb_height == 0
|| cmdbuf.fb_width > 16384
|| cmdbuf.fb_height > 16384
{
mod_dev_dbg!(
self.dev,
"[Submission {}] Invalid dimensions {}x{}\n",
id,
cmdbuf.fb_width,
cmdbuf.fb_height
);
return Err(EINVAL);
}
let dev = self.dev.data();
let gpu = match dev.gpu.as_any().downcast_ref::<gpu::GpuManager::ver>() {
Some(gpu) => gpu,
None => {
dev_crit!(self.dev, "GpuManager mismatched with Queue!\n");
return Err(EIO);
}
};
let nclusters = gpu.get_dyncfg().id.num_clusters;
// Can be set to false to disable clustering (for simpler jobs), but then the
// core masks below should be adjusted to cover a single rolling cluster.
let mut clustering = nclusters > 1;
if debug_enabled(debug::DebugFlags::DisableClustering)
|| cmdbuf.flags & bindings::ASAHI_RENDER_NO_VERTEX_CLUSTERING as u64 != 0
{
clustering = false;
}
#[ver(G < G14)]
let tiling_control = {
let render_cfg = gpu.get_cfg().render;
let mut tiling_control = render_cfg.tiling_control;
if !clustering {
tiling_control |= TILECTL_DISABLE_CLUSTERING;
}
tiling_control
};
let mut alloc = gpu.alloc();
let kalloc = &mut *alloc;
// This sequence number increases per new client/VM? assigned to some slot,
// but it's unclear *which* slot...
let slot_client_seq: u8 = (self.id & 0xff) as u8;
let tile_info = Self::get_tiling_params(&cmdbuf, if clustering { nclusters } else { 1 })?;
let buffer = self.buffer.as_ref().ok_or(EINVAL)?.lock();
let scene = Arc::try_new(buffer.new_scene(kalloc, &tile_info)?)?;
let notifier = self.notifier.clone();
let tvb_autogrown = buffer.auto_grow()?;
if tvb_autogrown {
let new_size = buffer.block_count() as usize;
cls_dev_dbg!(
TVBStats,
&self.dev,
"[Submission {}] TVB grew to {} bytes ({} blocks) due to overflows\n",
id,
new_size * buffer::BLOCK_SIZE,
new_size,
);
}
let tvb_grown = buffer.ensure_blocks(tile_info.min_tvb_blocks)?;
if tvb_grown {
cls_dev_dbg!(
TVBStats,
&self.dev,
"[Submission {}] TVB grew to {} bytes ({} blocks) due to dimensions ({}x{})\n",
id,
tile_info.min_tvb_blocks * buffer::BLOCK_SIZE,
tile_info.min_tvb_blocks,
cmdbuf.fb_width,
cmdbuf.fb_height
);
}
let vm_bind = job.vm_bind.clone();
mod_dev_dbg!(
self.dev,
"[Submission {}] VM slot = {}\n",
id,
vm_bind.slot()
);
let ev_vtx = job.get_vtx()?.event_info();
let ev_frag = job.get_frag()?.event_info();
mod_dev_dbg!(
self.dev,
"[Submission {}] Vert event #{} -> {:#x?}\n",
id,
ev_vtx.slot,
ev_vtx.value.next(),
);
mod_dev_dbg!(
self.dev,
"[Submission {}] Frag event #{} -> {:#x?}\n",
id,
ev_frag.slot,
ev_frag.value.next(),
);
let uuid_3d = cmdbuf.cmd_3d_id;
let uuid_ta = cmdbuf.cmd_ta_id;
mod_dev_dbg!(
self.dev,
"[Submission {}] Vert UUID = {:#x?}\n",
id,
uuid_ta
);
mod_dev_dbg!(
self.dev,
"[Submission {}] Frag UUID = {:#x?}\n",
id,
uuid_3d
);
let fence = job.fence.clone();
let frag_job = job.get_frag()?;
mod_dev_dbg!(self.dev, "[Submission {}] Create Barrier\n", id);
let barrier: GpuObject<fw::workqueue::Barrier> = kalloc.private.new_inplace(
Default::default(),
|_inner, ptr: &mut MaybeUninit<fw::workqueue::raw::Barrier>| {
Ok(place!(
ptr,
fw::workqueue::raw::Barrier {
tag: fw::workqueue::CommandType::Barrier,
wait_stamp: ev_vtx.fw_stamp_pointer,
wait_value: ev_vtx.value.next(),
wait_slot: ev_vtx.slot,
stamp_self: ev_frag.value.next(),
uuid: uuid_3d,
unk: 0,
}
))
},
)?;
mod_dev_dbg!(self.dev, "[Submission {}] Add Barrier\n", id);
frag_job.add(barrier, vm_bind.slot())?;
let timestamps = Arc::try_new(kalloc.shared.new_default::<fw::job::RenderTimestamps>()?)?;
let unk1 = debug_enabled(debug::DebugFlags::Debug1);
let unk2 = debug_enabled(debug::DebugFlags::Debug2);
let unk3 = debug_enabled(debug::DebugFlags::Debug3);
let mut tile_config: u64 = 0;
if !unk1 {
tile_config |= 0x280;
}
if cmdbuf.layers > 1 {
tile_config |= 1;
}
if cmdbuf.flags & bindings::ASAHI_RENDER_PROCESS_EMPTY_TILES as u64 != 0 {
tile_config |= 0x10000;
}
let mut utile_config =
((tile_info.utile_width / 16) << 12) | ((tile_info.utile_height / 16) << 14);
utile_config |= match cmdbuf.samples {
1 => 0,
2 => 1,
4 => 2,
_ => return Err(EINVAL),
};
let frag_result = result_writer
.map(|writer| {
let mut result = RenderResult {
result: Default::default(),
vtx_complete: false,
frag_complete: false,
vtx_error: None,
frag_error: None,
writer,
};
if tvb_autogrown {
result.result.flags |= bindings::DRM_ASAHI_RESULT_RENDER_TVB_GROW_OVF as u64;
}
if tvb_grown {
result.result.flags |= bindings::DRM_ASAHI_RESULT_RENDER_TVB_GROW_MIN as u64;
}
result.result.tvb_size_bytes = buffer.size() as u64;
Arc::try_new(Mutex::new(result))
})
.transpose()?;
let vtx_result = frag_result.clone();
// TODO: check
#[ver(V >= V13_0B4)]
let count_frag = self.counter.fetch_add(2, Ordering::Relaxed);
#[ver(V >= V13_0B4)]
let count_vtx = count_frag + 1;
mod_dev_dbg!(self.dev, "[Submission {}] Create Frag\n", id);
let frag = GpuObject::new_prealloc(
kalloc.private.alloc_object()?,
|ptr: GpuWeakPointer<fw::fragment::RunFragment::ver>| {
let mut builder = microseq::Builder::new();
let stats = inner_weak_ptr!(
gpu.initdata.runtime_pointers.stats.frag.weak_pointer(),
stats
);
let start_frag = builder.add(microseq::StartFragment::ver {
header: microseq::op::StartFragment::HEADER,
job_params2: inner_weak_ptr!(ptr, job_params2),
job_params1: inner_weak_ptr!(ptr, job_params1),
scene: scene.gpu_pointer(),
stats,
busy_flag: inner_weak_ptr!(ptr, busy_flag),
tvb_overflow_count: inner_weak_ptr!(ptr, tvb_overflow_count),
unk_pointer: inner_weak_ptr!(ptr, unk_pointee),
work_queue: ev_frag.info_ptr,
work_item: ptr,
vm_slot: vm_bind.slot(),
unk_50: 0x1, // fixed
event_generation: self.id as u32,
buffer_slot: scene.slot(),
unk_5c: 0,
cmd_seq: U64(ev_frag.cmd_seq),
unk_68: 0,
unk_758_flag: inner_weak_ptr!(ptr, unk_758_flag),
unk_job_buf: inner_weak_ptr!(ptr, unk_buf_0),
unk_7c: 0,
unk_80: 0,
unk_84: 0,
uuid: uuid_3d,
attachments: common::build_attachments(
cmdbuf.attachments,
cmdbuf.attachment_count,
)?,
unk_190: 0,
#[ver(V >= V13_0B4)]
counter: U64(count_frag),
#[ver(V >= V13_0B4)]
notifier_buf: inner_weak_ptr!(notifier.weak_pointer(), state.unk_buf),
})?;
if frag_result.is_some() {
builder.add(microseq::Timestamp::ver {
header: microseq::op::Timestamp::new(true),
cur_ts: inner_weak_ptr!(ptr, cur_ts),
start_ts: inner_weak_ptr!(ptr, start_ts),
update_ts: inner_weak_ptr!(ptr, start_ts),
work_queue: ev_frag.info_ptr,
unk_24: U64(0),
#[ver(V >= V13_0B4)]
unk_ts: inner_weak_ptr!(ptr, unk_ts),
uuid: uuid_3d,
unk_30_padding: 0,
})?;
}
builder.add(microseq::WaitForIdle {
header: microseq::op::WaitForIdle::new(microseq::Pipe::Fragment),
})?;
if frag_result.is_some() {
builder.add(microseq::Timestamp::ver {
header: microseq::op::Timestamp::new(false),
cur_ts: inner_weak_ptr!(ptr, cur_ts),
start_ts: inner_weak_ptr!(ptr, start_ts),
update_ts: inner_weak_ptr!(ptr, end_ts),
work_queue: ev_frag.info_ptr,
unk_24: U64(0),
#[ver(V >= V13_0B4)]
unk_ts: inner_weak_ptr!(ptr, unk_ts),
uuid: uuid_3d,
unk_30_padding: 0,
})?;
}
let off = builder.offset_to(start_frag);
builder.add(microseq::FinalizeFragment::ver {
header: microseq::op::FinalizeFragment::HEADER,
uuid: uuid_3d,
unk_8: 0,
fw_stamp: ev_frag.fw_stamp_pointer,
stamp_value: ev_frag.value.next(),
unk_18: 0,
scene: scene.weak_pointer(),
buffer: scene.weak_buffer_pointer(),
unk_2c: U64(1),
stats,
unk_pointer: inner_weak_ptr!(ptr, unk_pointee),
busy_flag: inner_weak_ptr!(ptr, busy_flag),
work_queue: ev_frag.info_ptr,
work_item: ptr,
vm_slot: vm_bind.slot(),
unk_60: 0,
unk_758_flag: inner_weak_ptr!(ptr, unk_758_flag),
unk_6c: U64(0),
unk_74: U64(0),
unk_7c: U64(0),
unk_84: U64(0),
unk_8c: U64(0),
#[ver(G == G14 && V < V13_0B4)]
unk_8c_g14: U64(0),
restart_branch_offset: off,
unk_98: unk3.into(),
#[ver(V >= V13_0B4)]
unk_9c: Default::default(),
})?;
builder.add(microseq::RetireStamp {
header: microseq::op::RetireStamp::HEADER,
})?;
Ok(box_in_place!(fw::fragment::RunFragment::ver {
notifier: notifier.clone(),
scene: scene.clone(),
micro_seq: builder.build(&mut kalloc.private)?,
vm_bind: vm_bind.clone(),
aux_fb: self.ualloc.lock().array_empty(0x8000)?,
timestamps: timestamps.clone(),
})?)
},
|inner, ptr| {
let aux_fb_info = fw::fragment::raw::AuxFBInfo::ver {
iogpu_unk_214: cmdbuf.iogpu_unk_214,
unk2: 0,
width: cmdbuf.fb_width,
height: cmdbuf.fb_height,
#[ver(V >= V13_0B4)]
unk3: U64(0x100000),
};
Ok(place!(
ptr,
fw::fragment::raw::RunFragment::ver {
tag: fw::workqueue::CommandType::RunFragment,
#[ver(V >= V13_0B4)]
counter: U64(count_frag),
vm_slot: vm_bind.slot(),
unk_8: 0,
microsequence: inner.micro_seq.gpu_pointer(),
microsequence_size: inner.micro_seq.len() as u32,
notifier: inner.notifier.gpu_pointer(),
buffer: inner.scene.buffer_pointer(),
scene: inner.scene.gpu_pointer(),
unk_buffer_buf: inner.scene.kernel_buffer_pointer(),
tvb_tilemap: inner.scene.tvb_tilemap_pointer(),
ppp_multisamplectl: U64(cmdbuf.ppp_multisamplectl),
samples: cmdbuf.samples,
tiles_per_mtile_y: tile_info.tiles_per_mtile_y as u16,
tiles_per_mtile_x: tile_info.tiles_per_mtile_x as u16,
unk_50: U64(0),
unk_58: U64(0),
merge_upper_x: F32::from_bits(cmdbuf.merge_upper_x),
merge_upper_y: F32::from_bits(cmdbuf.merge_upper_y),
unk_68: U64(0),
tile_count: U64(tile_info.tiles as u64),
job_params1: fw::fragment::raw::JobParameters1::ver {
utile_config: utile_config,
unk_4: 0,
clear_pipeline: fw::fragment::raw::ClearPipelineBinding {
pipeline_bind: U64(cmdbuf.load_pipeline_bind as u64),
address: U64(cmdbuf.load_pipeline as u64),
},
ppp_multisamplectl: U64(cmdbuf.ppp_multisamplectl),
scissor_array: U64(cmdbuf.scissor_array),
depth_bias_array: U64(cmdbuf.depth_bias_array),
aux_fb_info: aux_fb_info,
depth_dimensions: U64(cmdbuf.depth_dimensions as u64),
visibility_result_buffer: U64(cmdbuf.visibility_result_buffer),
zls_ctrl: U64(cmdbuf.zls_ctrl),
#[ver(G >= G14)]
unk_58_g14_0: U64(0x4040404),
#[ver(G >= G14)]
unk_58_g14_8: U64(0),
depth_buffer_ptr1: U64(cmdbuf.depth_buffer_1),
depth_buffer_ptr2: U64(cmdbuf.depth_buffer_2),
stencil_buffer_ptr1: U64(cmdbuf.stencil_buffer_1),
stencil_buffer_ptr2: U64(cmdbuf.stencil_buffer_2),
#[ver(G >= G14)]
unk_68_g14_0: Default::default(),
unk_78: Default::default(),
depth_meta_buffer_ptr1: U64(cmdbuf.depth_meta_buffer_1),
unk_a0: Default::default(),
depth_meta_buffer_ptr2: U64(cmdbuf.depth_meta_buffer_2),
unk_b0: Default::default(),
stencil_meta_buffer_ptr1: U64(cmdbuf.stencil_meta_buffer_1),
unk_c0: Default::default(),
stencil_meta_buffer_ptr2: U64(cmdbuf.stencil_meta_buffer_2),
unk_d0: Default::default(),
tvb_tilemap: inner.scene.tvb_tilemap_pointer(),
tvb_heapmeta: inner.scene.tvb_heapmeta_pointer(),
mtile_stride_dwords: U64((4 * tile_info.params.rgn_size as u64) << 24),
tvb_heapmeta_2: inner.scene.tvb_heapmeta_pointer(),
tile_config: U64(tile_config),
aux_fb: inner.aux_fb.gpu_pointer(),
unk_108: Default::default(),
pipeline_base: U64(0x11_00000000),
unk_140: U64(0x8c60),
unk_148: U64(0x0),
unk_150: U64(0x0),
unk_158: U64(0x1c),
unk_160: U64(0),
unk_168_padding: Default::default(),
#[ver(V < V13_0B4)]
__pad0: Default::default(),
},
job_params2: fw::fragment::raw::JobParameters2 {
store_pipeline_bind: cmdbuf.store_pipeline_bind,
store_pipeline_addr: cmdbuf.store_pipeline,
unk_8: 0x0,
unk_c: 0x0,
merge_upper_x: F32::from_bits(cmdbuf.merge_upper_x),
merge_upper_y: F32::from_bits(cmdbuf.merge_upper_y),
unk_18: U64(0x0),
utiles_per_mtile_y: tile_info.utiles_per_mtile_y as u16,
utiles_per_mtile_x: tile_info.utiles_per_mtile_x as u16,
unk_24: 0x0,
tile_counts: ((tile_info.tiles_y - 1) << 12) | (tile_info.tiles_x - 1),
iogpu_unk_212: cmdbuf.iogpu_unk_212,
isp_bgobjdepth: cmdbuf.isp_bgobjdepth,
// TODO: does this flag need to be exposed to userspace?
isp_bgobjvals: cmdbuf.isp_bgobjvals | 0x400,
unk_38: 0x0,
unk_3c: 0x1,
unk_40: 0,
},
job_params3: fw::fragment::raw::JobParameters3::ver {
unk_44_padding: Default::default(),
depth_bias_array: fw::fragment::raw::ArrayAddr {
ptr: U64(cmdbuf.depth_bias_array),
unk_padding: U64(0),
},
scissor_array: fw::fragment::raw::ArrayAddr {
ptr: U64(cmdbuf.scissor_array),
unk_padding: U64(0),
},
visibility_result_buffer: U64(cmdbuf.visibility_result_buffer),
unk_118: U64(0x0),
unk_120: Default::default(),
unk_reload_pipeline: fw::fragment::raw::ClearPipelineBinding {
pipeline_bind: U64(cmdbuf.partial_reload_pipeline_bind as u64),
address: U64(cmdbuf.partial_reload_pipeline as u64),
},
unk_258: U64(0),
unk_260: U64(0),
unk_268: U64(0),
unk_270: U64(0),
reload_pipeline: fw::fragment::raw::ClearPipelineBinding {
pipeline_bind: U64(cmdbuf.partial_reload_pipeline_bind as u64),
address: U64(cmdbuf.partial_reload_pipeline as u64),
},
zls_ctrl: U64(cmdbuf.zls_ctrl),
unk_290: U64(0x0),
depth_buffer_ptr1: U64(cmdbuf.depth_buffer_1),
unk_2a0: U64(0x0),
unk_2a8: U64(0x0),
depth_buffer_ptr2: U64(cmdbuf.depth_buffer_2),
depth_buffer_ptr3: U64(cmdbuf.depth_buffer_3),
depth_meta_buffer_ptr3: U64(cmdbuf.depth_meta_buffer_3),
stencil_buffer_ptr1: U64(cmdbuf.stencil_buffer_1),
unk_2d0: U64(0x0),
unk_2d8: U64(0x0),
stencil_buffer_ptr2: U64(cmdbuf.stencil_buffer_2),
stencil_buffer_ptr3: U64(cmdbuf.stencil_buffer_3),
stencil_meta_buffer_ptr3: U64(cmdbuf.stencil_meta_buffer_3),
unk_2f8: Default::default(),
iogpu_unk_212: cmdbuf.iogpu_unk_212,
unk_30c: 0x0,
aux_fb_info: aux_fb_info,
unk_320_padding: Default::default(),
unk_partial_store_pipeline:
fw::fragment::raw::StorePipelineBinding::new(
cmdbuf.partial_store_pipeline_bind,
cmdbuf.partial_store_pipeline
),
partial_store_pipeline: fw::fragment::raw::StorePipelineBinding::new(
cmdbuf.partial_store_pipeline_bind,
cmdbuf.partial_store_pipeline
),
isp_bgobjdepth: cmdbuf.isp_bgobjdepth,
isp_bgobjvals: cmdbuf.isp_bgobjvals,
iogpu_unk_49: cmdbuf.iogpu_unk_49,
unk_37c: 0x0,
unk_380: U64(0x0),
unk_388: U64(0x0),
#[ver(V >= V13_0B4)]
unk_390_0: U64(0x0),
depth_dimensions: U64(cmdbuf.depth_dimensions as u64),
},
unk_758_flag: 0,
unk_75c_flag: 0,
unk_buf: Default::default(),
busy_flag: 0,
tvb_overflow_count: 0,
unk_878: 0,
encoder_params: fw::job::raw::EncoderParams {
unk_8: (cmdbuf.flags
& bindings::ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S as u64
!= 0) as u32,
unk_c: 0x0, // fixed
unk_10: 0x0, // fixed
encoder_id: cmdbuf.encoder_id,
unk_18: 0x0, // fixed
iogpu_compute_unk44: 0xffffffff,
seq_buffer: inner.scene.seq_buf_pointer(),
unk_28: U64(0x0), // fixed
},
process_empty_tiles: (cmdbuf.flags
& bindings::ASAHI_RENDER_PROCESS_EMPTY_TILES as u64
!= 0) as u32,
no_clear_pipeline_textures: (cmdbuf.flags
& bindings::ASAHI_RENDER_NO_CLEAR_PIPELINE_TEXTURES as u64
!= 0) as u32,
unk_param: unk2.into(), // 1 for boot stuff?
unk_pointee: 0,
meta: fw::job::raw::JobMeta {
unk_4: 0,
stamp: ev_frag.stamp_pointer,
fw_stamp: ev_frag.fw_stamp_pointer,
stamp_value: ev_frag.value.next(),
stamp_slot: ev_frag.slot,
evctl_index: 0, // fixed
flush_stamps: flush_stamps as u32,
uuid: uuid_3d,
cmd_seq: ev_frag.cmd_seq as u32,
},
unk_after_meta: unk1.into(),
unk_buf_0: U64(0),
unk_buf_8: U64(0),
unk_buf_10: U64(1),
cur_ts: U64(0),
start_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), frag.start)),
end_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), frag.end)),
unk_914: 0,
unk_918: U64(0),
unk_920: 0,
client_sequence: slot_client_seq,
pad_925: Default::default(),
unk_928: 0,
unk_92c: 0,
#[ver(V >= V13_0B4)]
unk_ts: U64(0),
#[ver(V >= V13_0B4)]
unk_92d_8: Default::default(),
}
))
},
)?;
mod_dev_dbg!(self.dev, "[Submission {}] Add Frag\n", id);
fence.add_command();
frag_job.add_cb(frag, vm_bind.slot(), move |cmd, error| {
if let Some(err) = error {
fence.set_error(err.into());
}
if let Some(mut res) = frag_result.as_ref().map(|a| a.lock()) {
cmd.timestamps.with(|raw, _inner| {
res.result.fragment_ts_start = raw.frag.start.load(Ordering::Relaxed);
res.result.fragment_ts_end = raw.frag.end.load(Ordering::Relaxed);
});
cmd.with(|raw, _inner| {
res.result.num_tvb_overflows = raw.tvb_overflow_count;
});
res.frag_error = error;
res.frag_complete = true;
res.commit();
}
fence.command_complete();
})?;
let fence = job.fence.clone();
let vtx_job = job.get_vtx()?;
if scene.rebind() || tvb_grown || tvb_autogrown {
mod_dev_dbg!(self.dev, "[Submission {}] Create Bind Buffer\n", id);
let bind_buffer = kalloc.private.new_inplace(
fw::buffer::InitBuffer::ver {
scene: scene.clone(),
},
|inner, ptr: &mut MaybeUninit<fw::buffer::raw::InitBuffer::ver<'_>>| {
Ok(place!(
ptr,
fw::buffer::raw::InitBuffer::ver {
tag: fw::workqueue::CommandType::InitBuffer,
vm_slot: vm_bind.slot(),
buffer_slot: inner.scene.slot(),
unk_c: 0,
block_count: buffer.block_count(),
buffer: inner.scene.buffer_pointer(),
stamp_value: ev_vtx.value.next(),
}
))
},
)?;
mod_dev_dbg!(self.dev, "[Submission {}] Add Bind Buffer\n", id);
vtx_job.add(bind_buffer, vm_bind.slot())?;
}
mod_dev_dbg!(self.dev, "[Submission {}] Create Vertex\n", id);
let vtx = GpuObject::new_prealloc(
kalloc.private.alloc_object()?,
|ptr: GpuWeakPointer<fw::vertex::RunVertex::ver>| {
let mut builder = microseq::Builder::new();
let stats = inner_weak_ptr!(
gpu.initdata.runtime_pointers.stats.vtx.weak_pointer(),
stats
);
let start_vtx = builder.add(microseq::StartVertex::ver {
header: microseq::op::StartVertex::HEADER,
tiling_params: inner_weak_ptr!(ptr, tiling_params),
job_params1: inner_weak_ptr!(ptr, job_params1),
buffer: scene.weak_buffer_pointer(),
scene: scene.weak_pointer(),
stats,
work_queue: ev_vtx.info_ptr,
vm_slot: vm_bind.slot(),
unk_38: 1, // fixed
event_generation: self.id as u32,
buffer_slot: scene.slot(),
unk_44: 0,
cmd_seq: U64(ev_vtx.cmd_seq),
unk_50: 0,
unk_pointer: inner_weak_ptr!(ptr, unk_pointee),
unk_job_buf: inner_weak_ptr!(ptr, unk_buf_0),
unk_64: 0x0, // fixed
unk_68: unk1.into(),
uuid: uuid_ta,
unk_70: 0x0, // fixed
unk_74: Default::default(), // fixed
unk_15c: 0x0, // fixed
unk_160: U64(0x0), // fixed
unk_168: 0x0, // fixed
unk_16c: 0x0, // fixed
unk_170: U64(0x0), // fixed
#[ver(V >= V13_0B4)]
counter: U64(count_vtx),
#[ver(V >= V13_0B4)]
notifier_buf: inner_weak_ptr!(notifier.weak_pointer(), state.unk_buf),
unk_178: 0x0, // padding?
})?;
if vtx_result.is_some() {
builder.add(microseq::Timestamp::ver {
header: microseq::op::Timestamp::new(true),
cur_ts: inner_weak_ptr!(ptr, cur_ts),
start_ts: inner_weak_ptr!(ptr, start_ts),
update_ts: inner_weak_ptr!(ptr, start_ts),
work_queue: ev_vtx.info_ptr,
unk_24: U64(0),
#[ver(V >= V13_0B4)]
unk_ts: inner_weak_ptr!(ptr, unk_ts),
uuid: uuid_ta,
unk_30_padding: 0,
})?;
}
builder.add(microseq::WaitForIdle {
header: microseq::op::WaitForIdle::new(microseq::Pipe::Vertex),
})?;
if vtx_result.is_some() {
builder.add(microseq::Timestamp::ver {
header: microseq::op::Timestamp::new(false),
cur_ts: inner_weak_ptr!(ptr, cur_ts),
start_ts: inner_weak_ptr!(ptr, start_ts),
update_ts: inner_weak_ptr!(ptr, end_ts),
work_queue: ev_vtx.info_ptr,
unk_24: U64(0),
#[ver(V >= V13_0B4)]
unk_ts: inner_weak_ptr!(ptr, unk_ts),
uuid: uuid_ta,
unk_30_padding: 0,
})?;
}
let off = builder.offset_to(start_vtx);
builder.add(microseq::FinalizeVertex::ver {
header: microseq::op::FinalizeVertex::HEADER,
scene: scene.weak_pointer(),
buffer: scene.weak_buffer_pointer(),
stats,
work_queue: ev_vtx.info_ptr,
vm_slot: vm_bind.slot(),
unk_28: 0x0, // fixed
unk_pointer: inner_weak_ptr!(ptr, unk_pointee),
unk_34: 0x0, // fixed
uuid: uuid_ta,
fw_stamp: ev_vtx.fw_stamp_pointer,
stamp_value: ev_vtx.value.next(),
unk_48: U64(0x0), // fixed
unk_50: 0x0, // fixed
unk_54: 0x0, // fixed
unk_58: U64(0x0), // fixed
unk_60: 0x0, // fixed
unk_64: 0x0, // fixed
unk_68: 0x0, // fixed
#[ver(G >= G14 && V < V13_0B4)]
unk_68_g14: U64(0),
restart_branch_offset: off,
unk_70: 0x0, // fixed
#[ver(V >= V13_0B4)]
unk_74: Default::default(), // Ventura
})?;
builder.add(microseq::RetireStamp {
header: microseq::op::RetireStamp::HEADER,
})?;
Ok(box_in_place!(fw::vertex::RunVertex::ver {
notifier: notifier,
scene: scene.clone(),
micro_seq: builder.build(&mut kalloc.private)?,
vm_bind: vm_bind.clone(),
timestamps: timestamps,
})?)
},
|inner, ptr| {
#[ver(G < G14)]
let core_masks = gpu.core_masks_packed();
Ok(place!(
ptr,
fw::vertex::raw::RunVertex::ver {
tag: fw::workqueue::CommandType::RunVertex,
#[ver(V >= V13_0B4)]
counter: U64(count_vtx),
vm_slot: vm_bind.slot(),
unk_8: 0,
notifier: inner.notifier.gpu_pointer(),
buffer_slot: inner.scene.slot(),
unk_1c: 0,
buffer: inner.scene.buffer_pointer(),
scene: inner.scene.gpu_pointer(),
unk_buffer_buf: inner.scene.kernel_buffer_pointer(),
unk_34: 0,
job_params1: fw::vertex::raw::JobParameters1::ver {
unk_0: U64(if unk1 { 0 } else { 0x200 }), // sometimes 0
unk_8: f32!(1e-20), // fixed
unk_c: f32!(1e-20), // fixed
tvb_tilemap: inner.scene.tvb_tilemap_pointer(),
#[ver(G < G14)]
tvb_cluster_tilemaps: inner.scene.cluster_tilemaps_pointer(),
tpc: inner.scene.tpc_pointer(),
tvb_heapmeta: inner
.scene
.tvb_heapmeta_pointer()
.or(0x8000_0000_0000_0000),
iogpu_unk_54: 0x6b0003, // fixed
iogpu_unk_55: 0x3a0012, // fixed
iogpu_unk_56: U64(0x1), // fixed
#[ver(G < G14)]
tvb_cluster_meta1: inner
.scene
.meta_1_pointer()
.map(|x| x.or((tile_info.meta1_blocks as u64) << 50)),
utile_config: utile_config,
unk_4c: 0,
ppp_multisamplectl: U64(cmdbuf.ppp_multisamplectl), // fixed
tvb_heapmeta_2: inner.scene.tvb_heapmeta_pointer(),
#[ver(G < G14)]
unk_60: U64(0x0), // fixed
#[ver(G < G14)]
core_mask: Array::new([
*core_masks.first().unwrap_or(&0),
*core_masks.get(1).unwrap_or(&0),
]),
preempt_buf1: inner.scene.preempt_buf_1_pointer(),
preempt_buf2: inner.scene.preempt_buf_2_pointer(),
unk_80: U64(0x1), // fixed
preempt_buf3: inner
.scene
.preempt_buf_3_pointer()
.or(0x4_0000_0000_0000), // check
encoder_addr: U64(cmdbuf.encoder_ptr),
#[ver(G < G14)]
tvb_cluster_meta2: inner.scene.meta_2_pointer(),
#[ver(G < G14)]
tvb_cluster_meta3: inner.scene.meta_3_pointer(),
#[ver(G < G14)]
tiling_control: tiling_control,
#[ver(G < G14)]
unk_ac: Default::default(), // fixed
unk_b0: Default::default(), // fixed
pipeline_base: U64(0x11_00000000),
#[ver(G < G14)]
tvb_cluster_meta4: inner
.scene
.meta_4_pointer()
.map(|x| x.or(0x3000_0000_0000_0000)),
#[ver(G < G14)]
unk_f0: U64(0x1c + align(tile_info.meta1_blocks, 4) as u64),
unk_f8: U64(0x8c60), // fixed
unk_100: Default::default(), // fixed
unk_118: 0x1c, // fixed
#[ver(G >= G14)]
__pad: Default::default(),
},
unk_154: Default::default(),
tiling_params: tile_info.params,
unk_3e8: Default::default(),
tpc: inner.scene.tpc_pointer(),
tpc_size: U64(tile_info.tpc_size as u64),
microsequence: inner.micro_seq.gpu_pointer(),
microsequence_size: inner.micro_seq.len() as u32,
fragment_stamp_slot: ev_frag.slot,
fragment_stamp_value: ev_frag.value.next(),
unk_pointee: 0,
unk_pad: 0,
job_params2: fw::vertex::raw::JobParameters2 {
unk_480: Default::default(), // fixed
unk_498: U64(0x0), // fixed
unk_4a0: 0x0, // fixed
preempt_buf1: inner.scene.preempt_buf_1_pointer(),
unk_4ac: 0x0, // fixed
unk_4b0: U64(0x0), // fixed
unk_4b8: 0x0, // fixed
unk_4bc: U64(0x0), // fixed
unk_4c4_padding: Default::default(),
unk_50c: 0x0, // fixed
unk_510: U64(0x0), // fixed
unk_518: U64(0x0), // fixed
unk_520: U64(0x0), // fixed
},
encoder_params: fw::job::raw::EncoderParams {
unk_8: 0x0, // fixed
unk_c: 0x0, // fixed
unk_10: 0x0, // fixed
encoder_id: cmdbuf.encoder_id,
unk_18: 0x0, // fixed
iogpu_compute_unk44: 0xffffffff,
seq_buffer: inner.scene.seq_buf_pointer(),
unk_28: U64(0x0), // fixed
},
unk_55c: 0,
unk_560: 0,
memoryless_rts_used: (cmdbuf.flags
& bindings::ASAHI_RENDER_MEMORYLESS_RTS_USED as u64
!= 0) as u32,
unk_568: 0,
unk_56c: 0,
meta: fw::job::raw::JobMeta {
unk_4: 0,
stamp: ev_vtx.stamp_pointer,
fw_stamp: ev_vtx.fw_stamp_pointer,
stamp_value: ev_vtx.value.next(),
stamp_slot: ev_vtx.slot,
evctl_index: 0, // fixed
flush_stamps: flush_stamps as u32,
uuid: uuid_ta,
cmd_seq: ev_vtx.cmd_seq as u32,
},
unk_after_meta: unk1.into(),
unk_buf_0: U64(0),
unk_buf_8: U64(0),
unk_buf_10: U64(0),
cur_ts: U64(0),
start_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), vtx.start)),
end_ts: Some(inner_ptr!(inner.timestamps.gpu_pointer(), vtx.end)),
unk_5c4: 0,
unk_5c8: 0,
unk_5cc: 0,
unk_5d0: 0,
client_sequence: slot_client_seq,
pad_5d5: Default::default(),
unk_5d8: 0,
unk_5dc: 0,
#[ver(V >= V13_0B4)]
unk_ts: U64(0),
#[ver(V >= V13_0B4)]
unk_5dd_8: Default::default(),
}
))
},
)?;
core::mem::drop(alloc);
mod_dev_dbg!(self.dev, "[Submission {}] Add Vertex\n", id);
fence.add_command();
vtx_job.add_cb(vtx, vm_bind.slot(), move |cmd, error| {
if let Some(err) = error {
fence.set_error(err.into())
}
if let Some(mut res) = vtx_result.as_ref().map(|a| a.lock()) {
cmd.timestamps.with(|raw, _inner| {
res.result.vertex_ts_start = raw.vtx.start.load(Ordering::Relaxed);
res.result.vertex_ts_end = raw.vtx.end.load(Ordering::Relaxed);
});
res.result.tvb_usage_bytes = cmd.scene.used_bytes() as u64;
if cmd.scene.overflowed() {
res.result.flags |= bindings::DRM_ASAHI_RESULT_RENDER_TVB_OVERFLOWED as u64;
}
res.vtx_error = error;
res.vtx_complete = true;
res.commit();
}
fence.command_complete();
})?;
mod_dev_dbg!(self.dev, "[Submission {}] Increment counters\n", id);
self.notifier.threshold.with(|raw, _inner| {
raw.increment();
raw.increment();
});
// TODO: handle rollbacks, move to job submit?
buffer.increment();
job.get_vtx()?.next_seq();
job.get_frag()?.next_seq();
Ok(())
- }
+} diff --git a/drivers/gpu/drm/asahi/regs.rs b/drivers/gpu/drm/asahi/regs.rs new file mode 100644 index 000000000000..019d7214793d --- /dev/null +++ b/drivers/gpu/drm/asahi/regs.rs @@ -0,0 +1,387 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! GPU MMIO register abstraction +//! +//! Since the vast majority of the interactions with the GPU are brokered through the firmware, +//! there is very little need to interact directly with GPU MMIO register. This module abstracts +//! the few operations that require that, mainly reading the MMU fault status, reading GPU ID +//! information, and starting the GPU firmware coprocessor.
+use crate::hw; +use kernel::{device, io_mem::IoMem, platform, prelude::*};
+/// Size of the ASC control MMIO region. +pub(crate) const ASC_CTL_SIZE: usize = 0x4000;
+/// Size of the SGX MMIO region. +pub(crate) const SGX_SIZE: usize = 0x1000000;
+const CPU_CONTROL: usize = 0x44; +const CPU_RUN: u32 = 0x1 << 4; // BIT(4)
+const FAULT_INFO: usize = 0x17030;
+const ID_VERSION: usize = 0xd04000; +const ID_UNK08: usize = 0xd04008; +const ID_COUNTS_1: usize = 0xd04010; +const ID_COUNTS_2: usize = 0xd04014; +const ID_UNK18: usize = 0xd04018; +const ID_CLUSTERS: usize = 0xd0401c;
+const CORE_MASK_0: usize = 0xd01500; +const CORE_MASK_1: usize = 0xd01514;
+/// Enum representing the unit that caused an MMU fault. +#[allow(non_camel_case_types)] +#[allow(clippy::upper_case_acronyms)] +#[derive(Copy, Clone, Debug, Eq, PartialEq)] +pub(crate) enum FaultUnit {
- /// Decompress / pixel fetch
- DCMP(u8),
- /// USC L1 Cache (device loads/stores)
- UL1C(u8),
- /// Compress / pixel store
- CMP(u8),
- GSL1(u8),
- IAP(u8),
- VCE(u8),
- /// Tiling Engine
- TE(u8),
- RAS(u8),
- /// Vertex Data Master
- VDM(u8),
- PPP(u8),
- /// ISP Parameter Fetch
- IPF(u8),
- IPF_CPF(u8),
- VF(u8),
- VF_CPF(u8),
- /// Depth/Stencil load/store
- ZLS(u8),
- /// Parameter Management
- dPM,
- /// Compute Data Master
- dCDM_KS(u8),
- dIPP,
- dIPP_CS,
- // Vertex Data Master
- dVDM_CSD,
- dVDM_SSD,
- dVDM_ILF,
- dVDM_ILD,
- dRDE(u8),
- FC,
- GSL2,
- /// Graphics L2 Cache Control?
- GL2CC_META(u8),
- GL2CC_MB,
- /// Parameter Management
- gPM_SP(u8),
- /// Vertex Data Master - CSD
- gVDM_CSD_SP(u8),
- gVDM_SSD_SP(u8),
- gVDM_ILF_SP(u8),
- gVDM_TFP_SP(u8),
- gVDM_MMB_SP(u8),
- /// Compute Data Master
- gCDM_CS_KS0_SP(u8),
- gCDM_CS_KS1_SP(u8),
- gCDM_CS_KS2_SP(u8),
- gCDM_KS0_SP(u8),
- gCDM_KS1_SP(u8),
- gCDM_KS2_SP(u8),
- gIPP_SP(u8),
- gIPP_CS_SP(u8),
- gRDE0_SP(u8),
- gRDE1_SP(u8),
- Unknown(u8),
+}
+/// Reason for an MMU fault. +#[derive(Copy, Clone, Debug, Eq, PartialEq)] +pub(crate) enum FaultReason {
- Unmapped,
- AfFault,
- WriteOnly,
- ReadOnly,
- NoAccess,
- Unknown(u8),
+}
+/// Collection of information about an MMU fault. +#[derive(Copy, Clone, Debug, Eq, PartialEq)] +pub(crate) struct FaultInfo {
- pub(crate) address: u64,
- pub(crate) sideband: u8,
- pub(crate) vm_slot: u32,
- pub(crate) unit_code: u8,
- pub(crate) unit: FaultUnit,
- pub(crate) level: u8,
- pub(crate) unk_5: u8,
- pub(crate) read: bool,
- pub(crate) reason: FaultReason,
+}
+/// Device resources for this GPU instance. +pub(crate) struct Resources {
- dev: device::Device,
- asc: IoMem<ASC_CTL_SIZE>,
- sgx: IoMem<SGX_SIZE>,
+}
+impl Resources {
- /// Map the required resources given our platform device.
- pub(crate) fn new(pdev: &mut platform::Device) -> Result<Resources> {
// TODO: add device abstraction to ioremap by name
let asc_res = unsafe { pdev.ioremap_resource(0)? };
let sgx_res = unsafe { pdev.ioremap_resource(1)? };
Ok(Resources {
// SAFETY: This device does DMA via the UAT IOMMU.
dev: device::Device::from_dev(pdev),
asc: asc_res,
sgx: sgx_res,
})
- }
- fn sgx_read32(&self, off: usize) -> u32 {
self.sgx.readl_relaxed(off)
- }
- /* Not yet used
- fn sgx_write32(&self, off: usize, val: u32) {
self.sgx.writel_relaxed(val, off)
- }
- */
- fn sgx_read64(&self, off: usize) -> u64 {
self.sgx.readq_relaxed(off)
- }
- /* Not yet used
- fn sgx_write64(&self, off: usize, val: u64) {
self.sgx.writeq_relaxed(val, off)
- }
- */
- /// Initialize the MMIO registers for the GPU.
- pub(crate) fn init_mmio(&self) -> Result {
// Nothing to do for now...
Ok(())
- }
- /// Start the ASC coprocessor CPU.
- pub(crate) fn start_cpu(&self) -> Result {
let val = self.asc.readl_relaxed(CPU_CONTROL);
self.asc.writel_relaxed(val | CPU_RUN, CPU_CONTROL);
Ok(())
- }
- /// Get the GPU identification info from registers.
- ///
- /// See [`hw::GpuIdConfig`] for the result.
- pub(crate) fn get_gpu_id(&self) -> Resulthw::GpuIdConfig {
let id_version = self.sgx_read32(ID_VERSION);
let id_unk08 = self.sgx_read32(ID_UNK08);
let id_counts_1 = self.sgx_read32(ID_COUNTS_1);
let id_counts_2 = self.sgx_read32(ID_COUNTS_2);
let id_unk18 = self.sgx_read32(ID_UNK18);
let id_clusters = self.sgx_read32(ID_CLUSTERS);
dev_info!(
self.dev,
"GPU ID registers: {:#x} {:#x} {:#x} {:#x} {:#x} {:#x}\n",
id_version,
id_unk08,
id_counts_1,
id_counts_2,
id_unk18,
id_clusters
);
let core_mask_0 = self.sgx_read32(CORE_MASK_0);
let core_mask_1 = self.sgx_read32(CORE_MASK_1);
let mut core_mask = (core_mask_0 as u64) | ((core_mask_1 as u64) << 32);
dev_info!(self.dev, "Core mask: {:#x}\n", core_mask);
let num_clusters = (id_clusters >> 12) & 0xff;
let num_cores = id_counts_1 & 0xff;
if num_cores * num_clusters > 64 {
dev_err!(
self.dev,
"Too many total cores ({} x {} > 64)\n",
num_clusters,
num_cores
);
return Err(ENODEV);
}
let mut core_masks = Vec::new();
let mut total_active_cores: u32 = 0;
let max_core_mask = (1u64 << num_cores) - 1;
for _i in 0..num_clusters {
let mask = core_mask & max_core_mask;
core_masks.try_push(mask as u32)?;
core_mask >>= num_cores;
total_active_cores += mask.count_ones();
}
let mut core_masks_packed = Vec::new();
core_masks_packed.try_push(core_mask_0)?;
if core_mask_1 != 0 {
core_masks_packed.try_push(core_mask_1)?;
}
if core_mask != 0 {
dev_err!(self.dev, "Leftover core mask: {:#x}\n", core_mask);
return Err(EIO);
}
let (gpu_rev, gpu_rev_id) = match (id_version >> 8) & 0xff {
0x00 => (hw::GpuRevision::A0, hw::GpuRevisionID::A0),
0x01 => (hw::GpuRevision::A1, hw::GpuRevisionID::A1),
0x10 => (hw::GpuRevision::B0, hw::GpuRevisionID::B0),
0x11 => (hw::GpuRevision::B1, hw::GpuRevisionID::B1),
0x20 => (hw::GpuRevision::C0, hw::GpuRevisionID::C0),
0x21 => (hw::GpuRevision::C1, hw::GpuRevisionID::C1),
a => {
dev_err!(self.dev, "Unknown GPU revision {}\n", a);
return Err(ENODEV);
}
};
Ok(hw::GpuIdConfig {
gpu_gen: match (id_version >> 24) & 0xff {
4 => hw::GpuGen::G13,
5 => hw::GpuGen::G14,
a => {
dev_err!(self.dev, "Unknown GPU generation {}\n", a);
return Err(ENODEV);
}
},
gpu_variant: match (id_version >> 16) & 0xff {
1 => hw::GpuVariant::P, // Guess
2 => hw::GpuVariant::G,
3 => hw::GpuVariant::S,
4 => {
if num_clusters > 4 {
hw::GpuVariant::D
} else {
hw::GpuVariant::C
}
}
a => {
dev_err!(self.dev, "Unknown GPU variant {}\n", a);
return Err(ENODEV);
}
},
gpu_rev,
gpu_rev_id,
max_dies: (id_clusters >> 20) & 0xf,
num_clusters,
num_cores,
num_frags: (id_counts_1 >> 8) & 0xff,
num_gps: (id_counts_2 >> 16) & 0xff,
total_active_cores,
core_masks,
core_masks_packed,
})
- }
- /// Get the fault information from the MMU status register, if one occurred.
- pub(crate) fn get_fault_info(&self) -> Option<FaultInfo> {
let fault_info = self.sgx_read64(FAULT_INFO);
if fault_info & 1 == 0 {
return None;
}
let unit_code = ((fault_info >> 9) & 0xff) as u8;
let unit = match unit_code {
0x00..=0x9f => match unit_code & 0xf {
0x0 => FaultUnit::DCMP(unit_code >> 4),
0x1 => FaultUnit::UL1C(unit_code >> 4),
0x2 => FaultUnit::CMP(unit_code >> 4),
0x3 => FaultUnit::GSL1(unit_code >> 4),
0x4 => FaultUnit::IAP(unit_code >> 4),
0x5 => FaultUnit::VCE(unit_code >> 4),
0x6 => FaultUnit::TE(unit_code >> 4),
0x7 => FaultUnit::RAS(unit_code >> 4),
0x8 => FaultUnit::VDM(unit_code >> 4),
0x9 => FaultUnit::PPP(unit_code >> 4),
0xa => FaultUnit::IPF(unit_code >> 4),
0xb => FaultUnit::IPF_CPF(unit_code >> 4),
0xc => FaultUnit::VF(unit_code >> 4),
0xd => FaultUnit::VF_CPF(unit_code >> 4),
0xe => FaultUnit::ZLS(unit_code >> 4),
_ => FaultUnit::Unknown(unit_code),
},
0xa1 => FaultUnit::dPM,
0xa2 => FaultUnit::dCDM_KS(0),
0xa3 => FaultUnit::dCDM_KS(1),
0xa4 => FaultUnit::dCDM_KS(2),
0xa5 => FaultUnit::dIPP,
0xa6 => FaultUnit::dIPP_CS,
0xa7 => FaultUnit::dVDM_CSD,
0xa8 => FaultUnit::dVDM_SSD,
0xa9 => FaultUnit::dVDM_ILF,
0xaa => FaultUnit::dVDM_ILD,
0xab => FaultUnit::dRDE(0),
0xac => FaultUnit::dRDE(1),
0xad => FaultUnit::FC,
0xae => FaultUnit::GSL2,
0xb0..=0xb7 => FaultUnit::GL2CC_META(unit_code & 0xf),
0xb8 => FaultUnit::GL2CC_MB,
0xe0..=0xff => match unit_code & 0xf {
0x0 => FaultUnit::gPM_SP((unit_code >> 4) & 1),
0x1 => FaultUnit::gVDM_CSD_SP((unit_code >> 4) & 1),
0x2 => FaultUnit::gVDM_SSD_SP((unit_code >> 4) & 1),
0x3 => FaultUnit::gVDM_ILF_SP((unit_code >> 4) & 1),
0x4 => FaultUnit::gVDM_TFP_SP((unit_code >> 4) & 1),
0x5 => FaultUnit::gVDM_MMB_SP((unit_code >> 4) & 1),
0x6 => FaultUnit::gCDM_CS_KS0_SP((unit_code >> 4) & 1),
0x7 => FaultUnit::gCDM_CS_KS1_SP((unit_code >> 4) & 1),
0x8 => FaultUnit::gCDM_CS_KS2_SP((unit_code >> 4) & 1),
0x9 => FaultUnit::gCDM_KS0_SP((unit_code >> 4) & 1),
0xa => FaultUnit::gCDM_KS1_SP((unit_code >> 4) & 1),
0xb => FaultUnit::gCDM_KS2_SP((unit_code >> 4) & 1),
0xc => FaultUnit::gIPP_SP((unit_code >> 4) & 1),
0xd => FaultUnit::gIPP_CS_SP((unit_code >> 4) & 1),
0xe => FaultUnit::gRDE0_SP((unit_code >> 4) & 1),
0xf => FaultUnit::gRDE1_SP((unit_code >> 4) & 1),
_ => FaultUnit::Unknown(unit_code),
},
_ => FaultUnit::Unknown(unit_code),
};
let reason = match (fault_info >> 1) & 0x7 {
0 => FaultReason::Unmapped,
1 => FaultReason::AfFault,
2 => FaultReason::WriteOnly,
3 => FaultReason::ReadOnly,
4 => FaultReason::NoAccess,
a => FaultReason::Unknown(a as u8),
};
Some(FaultInfo {
address: (fault_info >> 30) << 6,
sideband: ((fault_info >> 23) & 0x7f) as u8,
vm_slot: ((fault_info >> 17) & 0x3f) as u32,
unit_code,
unit,
level: ((fault_info >> 7) & 3) as u8,
unk_5: ((fault_info >> 5) & 3) as u8,
read: (fault_info & (1 << 4)) != 0,
reason,
})
- }
+} diff --git a/drivers/gpu/drm/asahi/slotalloc.rs b/drivers/gpu/drm/asahi/slotalloc.rs new file mode 100644 index 000000000000..6493111643fe --- /dev/null +++ b/drivers/gpu/drm/asahi/slotalloc.rs @@ -0,0 +1,292 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! Generic slot allocator +//! +//! This is a simple allocator to manage fixed-size pools of GPU resources that are transiently +//! required during command execution. Each item resides in a "slot" at a given index. Users borrow +//! and return free items from the available pool. +//! +//! Allocations are "sticky", and return a token that callers can use to request the same slot +//! again later. This allows slots to be lazily invalidated, so that multiple uses by the same user +//! avoid any actual cleanup work. +//! +//! The allocation policy is currently a simple LRU mechanism, doing a full linear scan over the +//! slots when no token was previously provided. This is probably good enough, since in the absence +//! of serious system contention most allocation requests will be immediately fulfilled from the +//! previous slot without doing an LRU scan.
+use core::ops::{Deref, DerefMut}; +use kernel::{
- error::{code::*, Result},
- prelude::*,
- sync::{Arc, CondVar, Mutex, UniqueArc},
+};
+/// Trait representing a single item within a slot. +pub(crate) trait SlotItem {
- /// Arbitrary user data associated with the SlotAllocator.
- type Data;
- /// Called eagerly when this item is released back into the available pool.
- fn release(&mut self, _data: &mut Self::Data, _slot: u32) {}
+}
+/// Trivial implementation for users which do not require any slot data nor any allocator data. +impl SlotItem for () {
- type Data = ();
+}
+/// Represents a current or previous allocation of an item from a slot. Users keep `SlotToken`s +/// around across allocations to request that, if possible, the same slot be reused. +#[derive(Copy, Clone, Debug)] +pub(crate) struct SlotToken {
- time: u64,
- slot: u32,
+}
+impl SlotToken {
- /// Returns the slot index that this token represents a past assignment to.
- pub(crate) fn last_slot(&self) -> u32 {
self.slot
- }
+}
+/// A guard representing active ownership of a slot. +pub(crate) struct Guard<T: SlotItem> {
- item: Option<T>,
- changed: bool,
- token: SlotToken,
- alloc: Arc<SlotAllocatorOuter<T>>,
+}
+impl<T: SlotItem> Guard<T> {
- /// Returns the active slot owned by this `Guard`.
- pub(crate) fn slot(&self) -> u32 {
self.token.slot
- }
- /// Returns `true` if the slot changed since the last allocation (or no `SlotToken` was
- /// provided), or `false` if the previously allocated slot was successfully re-acquired with
- /// no other users in the interim.
- pub(crate) fn changed(&self) -> bool {
self.changed
- }
- /// Returns a `SlotToken` that can be used to re-request the same slot at a later time, after
- /// this `Guard` is dropped.
- pub(crate) fn token(&self) -> SlotToken {
self.token
- }
+}
+impl<T: SlotItem> Deref for Guard<T> {
- type Target = T;
- fn deref(&self) -> &Self::Target {
self.item.as_ref().expect("SlotItem Guard lost our item!")
- }
+}
+impl<T: SlotItem> DerefMut for Guard<T> {
- fn deref_mut(&mut self) -> &mut Self::Target {
self.item.as_mut().expect("SlotItem Guard lost our item!")
- }
+}
+/// A slot item that is currently free. +struct Entry<T: SlotItem> {
- item: T,
- get_time: u64,
- drop_time: u64,
+}
+/// Inner data for the `SlotAllocator`, protected by a `Mutex`. +struct SlotAllocatorInner<T: SlotItem> {
- data: T::Data,
- slots: Vec<Option<Entry<T>>>,
- get_count: u64,
- drop_count: u64,
+}
+/// A single slot allocator instance. +struct SlotAllocatorOuter<T: SlotItem> {
- inner: Mutex<SlotAllocatorInner<T>>,
- cond: CondVar,
+}
+/// A shared reference to a slot allocator instance. +pub(crate) struct SlotAllocator<T: SlotItem>(Arc<SlotAllocatorOuter<T>>);
+impl<T: SlotItem> SlotAllocator<T> {
- /// Creates a new `SlotAllocator`, with a fixed number of slots and arbitrary associated data.
- ///
- /// The caller provides a constructor callback which takes a reference to the `T::Data` and
- /// creates a single slot. This is called during construction to create all the initial
- /// items, which then live the lifetime of the `SlotAllocator`.
- pub(crate) fn new(
num_slots: u32,
mut data: T::Data,
mut constructor: impl FnMut(&mut T::Data, u32) -> T,
- ) -> Result<SlotAllocator<T>> {
let mut slots = Vec::try_with_capacity(num_slots as usize)?;
for i in 0..num_slots {
slots
.try_push(Some(Entry {
item: constructor(&mut data, i),
get_time: 0,
drop_time: 0,
}))
.expect("try_push() failed after reservation");
}
let inner = SlotAllocatorInner {
data,
slots,
get_count: 0,
drop_count: 0,
};
let mut alloc = Pin::from(UniqueArc::try_new(SlotAllocatorOuter {
// SAFETY: `condvar_init!` is called below.
cond: unsafe { CondVar::new() },
// SAFETY: `mutex_init!` is called below.
inner: unsafe { Mutex::new(inner) },
})?);
// SAFETY: `cond` is pinned when `alloc` is.
let pinned = unsafe { alloc.as_mut().map_unchecked_mut(|s| &mut s.cond) };
kernel::condvar_init!(pinned, "SlotAllocator::cond");
// SAFETY: `inner` is pinned when `alloc` is.
let pinned = unsafe { alloc.as_mut().map_unchecked_mut(|s| &mut s.inner) };
kernel::mutex_init!(pinned, "SlotAllocator::inner");
Ok(SlotAllocator(alloc.into()))
- }
- /// Calls a callback on the inner data associated with this allocator, taking the lock.
- pub(crate) fn with_inner<RetVal>(&self, cb: impl FnOnce(&mut T::Data) -> RetVal) -> RetVal {
let mut inner = self.0.inner.lock();
cb(&mut inner.data)
- }
- /// Gets a fresh slot, optionally reusing a previous allocation if a `SlotToken` is provided.
- ///
- /// Blocks if no slots are free.
- pub(crate) fn get(&self, token: Option<SlotToken>) -> Result<Guard<T>> {
self.get_inner(token, |_a, _b| Ok(()))
- }
- /// Gets a fresh slot, optionally reusing a previous allocation if a `SlotToken` is provided.
- ///
- /// Blocks if no slots are free.
- ///
- /// This version allows the caller to pass in a callback that gets a mutable reference to the
- /// user data for the allocator and the freshly acquired slot, which is called before the
- /// allocator lock is released. This can be used to perform bookkeeping associated with
- /// specific slots (such as tracking their current owner).
- pub(crate) fn get_inner(
&self,
token: Option<SlotToken>,
cb: impl FnOnce(&mut T::Data, &mut Guard<T>) -> Result<()>,
- ) -> Result<Guard<T>> {
let mut inner = self.0.inner.lock();
if let Some(token) = token {
let slot = &mut inner.slots[token.slot as usize];
if slot.is_some() {
let count = slot.as_ref().unwrap().get_time;
if count == token.time {
let mut guard = Guard {
item: Some(slot.take().unwrap().item),
token,
changed: false,
alloc: self.0.clone(),
};
cb(&mut inner.data, &mut guard)?;
return Ok(guard);
}
}
}
let mut first = true;
let slot = loop {
let mut oldest_time = u64::MAX;
let mut oldest_slot = 0u32;
for (i, slot) in inner.slots.iter().enumerate() {
if let Some(slot) = slot.as_ref() {
if slot.drop_time < oldest_time {
oldest_slot = i as u32;
oldest_time = slot.drop_time;
}
}
}
if oldest_time == u64::MAX {
if first {
pr_warn!(
"{}: out of slots, blocking\n",
core::any::type_name::<Self>()
);
}
first = false;
if self.0.cond.wait(&mut inner) {
return Err(ERESTARTSYS);
}
} else {
break oldest_slot;
}
};
inner.get_count += 1;
let item = inner.slots[slot as usize]
.take()
.expect("Someone stole our slot?")
.item;
let mut guard = Guard {
item: Some(item),
changed: true,
token: SlotToken {
time: inner.get_count,
slot,
},
alloc: self.0.clone(),
};
cb(&mut inner.data, &mut guard)?;
Ok(guard)
- }
+}
+impl<T: SlotItem> Clone for SlotAllocator<T> {
- fn clone(&self) -> Self {
SlotAllocator(self.0.clone())
- }
+}
+impl<T: SlotItem> Drop for Guard<T> {
- fn drop(&mut self) {
let mut inner = self.alloc.inner.lock();
if inner.slots[self.token.slot as usize].is_some() {
pr_crit!(
"{}: tried to return an item into a full slot ({})\n",
core::any::type_name::<Self>(),
self.token.slot
);
} else {
inner.drop_count += 1;
let mut item = self.item.take().expect("Guard lost its item");
item.release(&mut inner.data, self.token.slot);
inner.slots[self.token.slot as usize] = Some(Entry {
item,
get_time: self.token.time,
drop_time: inner.drop_count,
});
self.alloc.cond.notify_one();
}
- }
+} diff --git a/drivers/gpu/drm/asahi/util.rs b/drivers/gpu/drm/asahi/util.rs new file mode 100644 index 000000000000..8d1a37f17cd8 --- /dev/null +++ b/drivers/gpu/drm/asahi/util.rs @@ -0,0 +1,44 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! Miscellaneous utility functions
+use core::ops::{Add, BitAnd, Div, Not, Sub};
+/// Aligns an integer type to a power of two. +pub(crate) fn align<T>(a: T, b: T) -> T +where
- T: Copy
+ Default
+ BitAnd<Output = T>
+ Not<Output = T>
+ Add<Output = T>
+ Sub<Output = T>
+ Div<Output = T>
+ core::cmp::PartialEq,
+{
- let def: T = Default::default();
- #[allow(clippy::eq_op)]
- let one: T = !def / !def;
- assert!((b & (b - one)) == def);
- (a + b - one) & !(b - one)
+}
+/// Integer division rounding up. +pub(crate) fn div_ceil<T>(a: T, b: T) -> T +where
- T: Copy
+ Default
+ BitAnd<Output = T>
+ Not<Output = T>
+ Add<Output = T>
+ Sub<Output = T>
+ Div<Output = T>,
+{
- let def: T = Default::default();
- #[allow(clippy::eq_op)]
- let one: T = !def / !def;
- (a + b - one) / b
+} diff --git a/drivers/gpu/drm/asahi/workqueue.rs b/drivers/gpu/drm/asahi/workqueue.rs new file mode 100644 index 000000000000..ce1d1f89e48e --- /dev/null +++ b/drivers/gpu/drm/asahi/workqueue.rs @@ -0,0 +1,880 @@ +// SPDX-License-Identifier: GPL-2.0-only OR MIT
+//! GPU command execution queues +//! +//! The AGX GPU firmware schedules GPU work commands out of work queues, which are ring buffers of +//! pointers to work commands. There can be an arbitrary number of work queues. Work queues have an +//! associated type (vertex, fragment, or compute) and may only contain generic commands or commands +//! specific to that type. +//! +//! This module manages queueing work commands into a work queue and submitting them for execution +//! by the firmware. An active work queue needs an event to signal completion of its work, which is +//! owned by what we call a batch. This event then notifies the work queue when work is completed, +//! and that triggers freeing of all resources associated with that work. An idle work queue gives +//! up its associated event.
+use crate::debug::*; +use crate::fw::channels::PipeType; +use crate::fw::types::*; +use crate::fw::workqueue::*; +use crate::object::OpaqueGpuObject; +use crate::regs::FaultReason; +use crate::{box_in_place, no_debug, place}; +use crate::{channel, driver, event, fw, gpu, object, regs}; +use core::num::NonZeroU64; +use core::sync::atomic::Ordering; +use kernel::{
- bindings,
- error::code::*,
- prelude::*,
- sync::{Arc, Guard, Mutex, UniqueArc},
+};
+const DEBUG_CLASS: DebugFlags = DebugFlags::WorkQueue;
+const MAX_JOB_SLOTS: u32 = 127;
+/// An enum of possible errors that might cause a piece of work to fail execution. +#[derive(Copy, Clone, Debug, PartialEq, Eq)] +pub(crate) enum WorkError {
- /// GPU timeout (command execution took too long).
- Timeout,
- /// GPU MMU fault (invalid access).
- Fault(regs::FaultInfo),
- /// Work failed due to an error caused by other concurrent GPU work.
- Killed,
- /// The GPU crashed.
- NoDevice,
- /// Unknown reason.
- Unknown,
+}
+impl From<WorkError> for bindings::drm_asahi_result_info {
- fn from(err: WorkError) -> Self {
match err {
WorkError::Fault(info) => Self {
status: bindings::drm_asahi_status_DRM_ASAHI_STATUS_FAULT,
fault_type: match info.reason {
FaultReason::Unmapped => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_UNMAPPED,
FaultReason::AfFault => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_AF_FAULT,
FaultReason::WriteOnly => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_WRITE_ONLY,
FaultReason::ReadOnly => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_READ_ONLY,
FaultReason::NoAccess => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_NO_ACCESS,
FaultReason::Unknown(_) => bindings::drm_asahi_fault_DRM_ASAHI_FAULT_UNKNOWN,
},
unit: info.unit_code.into(),
sideband: info.sideband.into(),
level: info.level,
extra: info.unk_5.into(),
is_read: info.read as u8,
pad: 0,
address: info.address,
},
a => Self {
status: match a {
WorkError::Timeout => bindings::drm_asahi_status_DRM_ASAHI_STATUS_TIMEOUT,
WorkError::Killed => bindings::drm_asahi_status_DRM_ASAHI_STATUS_KILLED,
WorkError::NoDevice => bindings::drm_asahi_status_DRM_ASAHI_STATUS_NO_DEVICE,
_ => bindings::drm_asahi_status_DRM_ASAHI_STATUS_UNKNOWN_ERROR,
},
..Default::default()
},
}
- }
+}
+impl From<WorkError> for kernel::error::Error {
- fn from(err: WorkError) -> Self {
match err {
WorkError::Timeout => ETIMEDOUT,
// Not EFAULT because that's for userspace faults
WorkError::Fault(_) => EIO,
WorkError::Unknown => ENODATA,
WorkError::Killed => ECANCELED,
WorkError::NoDevice => ENODEV,
}
- }
+}
+/// A GPU context tracking structure, which must be explicitly invalidated when dropped. +pub(crate) struct GpuContext {
- dev: driver::AsahiDevice,
- data: GpuObjectfw::workqueue::GpuContextData,
+} +no_debug!(GpuContext);
+impl GpuContext {
- /// Allocate a new GPU context.
- pub(crate) fn new(
dev: &driver::AsahiDevice,
alloc: &mut gpu::KernelAllocators,
- ) -> Result<GpuContext> {
Ok(GpuContext {
dev: dev.clone(),
data: alloc
.shared
.new_object(Default::default(), |_inner| Default::default())?,
})
- }
- /// Returns the GPU pointer to the inner GPU context data structure.
- pub(crate) fn gpu_pointer(&self) -> GpuPointer<'_, fw::workqueue::GpuContextData> {
self.data.gpu_pointer()
- }
+}
+impl Drop for GpuContext {
- fn drop(&mut self) {
mod_dev_dbg!(self.dev, "GpuContext: Invalidating GPU context\n");
let dev = self.dev.data();
if dev.gpu.invalidate_context(&self.data).is_err() {
dev_err!(self.dev, "GpuContext: Failed to invalidate GPU context!\n");
}
- }
+}
+struct SubmittedWork<O, C> +where
- O: OpaqueGpuObject,
- C: FnOnce(O, Option<WorkError>) + Send + Sync + 'static,
+{
- object: O,
- value: EventValue,
- error: Option<WorkError>,
- wptr: u32,
- vm_slot: u32,
- callback: C,
+}
+trait GenSubmittedWork: Send + Sync {
- fn gpu_va(&self) -> NonZeroU64;
- fn value(&self) -> event::EventValue;
- fn wptr(&self) -> u32;
- fn set_wptr(&mut self, wptr: u32);
- fn mark_error(&mut self, error: WorkError);
- fn complete(self: Box<Self>);
+}
+impl<O: OpaqueGpuObject, C: FnOnce(O, Option<WorkError>) + Send + Sync> GenSubmittedWork
- for SubmittedWork<O, C>
+{
- fn gpu_va(&self) -> NonZeroU64 {
self.object.gpu_va()
- }
- fn value(&self) -> event::EventValue {
self.value
- }
- fn wptr(&self) -> u32 {
self.wptr
- }
- fn set_wptr(&mut self, wptr: u32) {
self.wptr = wptr;
- }
- fn complete(self: Box<Self>) {
let SubmittedWork {
object,
value: _,
error,
wptr: _,
vm_slot: _,
callback,
} = *self;
callback(object, error);
- }
- fn mark_error(&mut self, error: WorkError) {
mod_pr_debug!("WorkQueue: Command at value {:#x?} failed\n", self.value);
self.error = Some(match error {
WorkError::Fault(info) if info.vm_slot != self.vm_slot => WorkError::Killed,
err => err,
});
- }
+}
+/// Inner data for managing a single work queue. +#[versions(AGX)] +struct WorkQueueInner {
- event_manager: Arcevent::EventManager,
- info: GpuObjectQueueInfo::ver,
- new: bool,
- pipe_type: PipeType,
- size: u32,
- wptr: u32,
- pending: Vec<Box<dyn GenSubmittedWork>>,
- last_token: Optionevent::Token,
- pending_jobs: usize,
- last_submitted: Optionevent::EventValue,
- last_completed: Optionevent::EventValue,
- event: Option<(event::Event, event::EventValue)>,
- priority: u32,
- commit_seq: u64,
- submit_seq: u64,
+}
+/// An instance of a work queue. +#[versions(AGX)] +pub(crate) struct WorkQueue {
- info_pointer: GpuWeakPointerQueueInfo::ver,
- inner: MutexWorkQueueInner::ver,
+}
+#[versions(AGX)] +impl WorkQueueInner::ver {
- /// Return the GPU done pointer, representing how many work items have been completed by the
- /// GPU.
- fn doneptr(&self) -> u32 {
self.info
.state
.with(|raw, _inner| raw.gpu_doneptr.load(Ordering::Acquire))
- }
+}
+#[versions(AGX)] +#[derive(Copy, Clone)] +pub(crate) struct QueueEventInfo {
- pub(crate) stamp_pointer: GpuWeakPointer<Stamp>,
- pub(crate) fw_stamp_pointer: GpuWeakPointer<FwStamp>,
- pub(crate) slot: u32,
- pub(crate) value: event::EventValue,
- pub(crate) cmd_seq: u64,
- pub(crate) info_ptr: GpuWeakPointerQueueInfo::ver,
+}
+#[versions(AGX)] +pub(crate) struct Job {
- wq: ArcWorkQueue::ver,
- event_info: QueueEventInfo::ver,
- start_value: EventValue,
- pending: Vec<Box<dyn GenSubmittedWork>>,
- committed: bool,
- submitted: bool,
- event_count: usize,
+}
+#[versions(AGX)] +pub(crate) struct JobSubmission<'a> {
- inner: Option<Guard<'a, MutexWorkQueueInner::ver>>,
- wptr: u32,
- event_count: usize,
- command_count: usize,
+}
+#[versions(AGX)] +impl Job::ver {
- pub(crate) fn event_info(&self) -> QueueEventInfo::ver {
let mut info = self.event_info;
info.cmd_seq += self.event_count as u64;
info
- }
- pub(crate) fn next_seq(&mut self) {
self.event_count += 1;
self.event_info.value.increment();
- }
- pub(crate) fn add<O: object::OpaqueGpuObject + 'static>(
&mut self,
command: O,
vm_slot: u32,
- ) -> Result {
self.add_cb(command, vm_slot, |_, _| {})
- }
- pub(crate) fn add_cb<O: object::OpaqueGpuObject + 'static>(
&mut self,
command: O,
vm_slot: u32,
callback: impl FnOnce(O, Option<WorkError>) + Sync + Send + 'static,
- ) -> Result {
if self.committed {
pr_err!("WorkQueue: Tried to mutate committed Job\n");
return Err(EINVAL);
}
self.pending.try_push(Box::try_new(SubmittedWork::<_, _> {
object: command,
value: self.event_info.value.next(),
error: None,
callback,
wptr: 0,
vm_slot,
})?)?;
Ok(())
- }
- pub(crate) fn commit(&mut self) -> Result {
if self.committed {
pr_err!("WorkQueue: Tried to commit committed Job\n");
return Err(EINVAL);
}
if self.pending.is_empty() {
pr_err!("WorkQueue: Job::commit() with no commands\n");
return Err(EINVAL);
}
let mut inner = self.wq.inner.lock();
let ev = inner.event.as_mut().expect("WorkQueue: Job lost its event");
if ev.1 != self.start_value {
pr_err!(
"WorkQueue: Job::commit() out of order (event slot {} {:?} != {:?}\n",
ev.0.slot(),
ev.1,
self.start_value
);
return Err(EINVAL);
}
ev.1 = self.event_info.value;
inner.commit_seq += self.pending.len() as u64;
self.committed = true;
Ok(())
- }
- pub(crate) fn can_submit(&self) -> bool {
self.wq.free_slots() > self.event_count && self.wq.free_space() > self.pending.len()
- }
- pub(crate) fn submit(&mut self) -> Result<JobSubmission::ver<'_>> {
if !self.committed {
pr_err!("WorkQueue: Tried to submit uncommitted Job\n");
return Err(EINVAL);
}
if self.submitted {
pr_err!("WorkQueue: Tried to submit Job twice\n");
return Err(EINVAL);
}
if self.pending.is_empty() {
pr_err!("WorkQueue: Job::submit() with no commands\n");
return Err(EINVAL);
}
let mut inner = self.wq.inner.lock();
if inner.submit_seq != self.event_info.cmd_seq {
pr_err!(
"WorkQueue: Job::submit() out of order (submit_seq {} != {})\n",
inner.submit_seq,
self.event_info.cmd_seq
);
return Err(EINVAL);
}
if inner.commit_seq < (self.event_info.cmd_seq + self.pending.len() as u64) {
pr_err!(
"WorkQueue: Job::submit() out of order (commit_seq {} != {})\n",
inner.commit_seq,
(self.event_info.cmd_seq + self.pending.len() as u64)
);
return Err(EINVAL);
}
let mut wptr = inner.wptr;
let command_count = self.pending.len();
if inner.free_space() <= command_count {
pr_err!("WorkQueue: Job does not fit in ring buffer\n");
return Err(EBUSY);
}
inner.pending.try_reserve(command_count)?;
inner.last_submitted = inner.event.as_ref().map(|e| e.1);
for mut command in self.pending.drain(..) {
command.set_wptr(wptr);
let next_wptr = (wptr + 1) % inner.size;
assert!(inner.doneptr() != next_wptr);
inner.info.ring[wptr as usize] = command.gpu_va().get();
wptr = next_wptr;
// Cannot fail, since we did a try_reserve(1) above
inner
.pending
.try_push(command)
.expect("try_push() failed after try_reserve()");
}
self.submitted = true;
Ok(JobSubmission::ver {
inner: Some(inner),
wptr,
command_count,
event_count: self.event_count,
})
- }
+}
+#[versions(AGX)] +impl<'a> JobSubmission::ver<'a> {
- pub(crate) fn run(mut self, channel: &mut channel::PipeChannel::ver) {
let command_count = self.command_count;
let mut inner = self.inner.take().expect("No inner?");
let wptr = self.wptr;
core::mem::forget(self);
inner
.info
.state
.with(|raw, _inner| raw.cpu_wptr.store(wptr, Ordering::Release));
inner.wptr = wptr;
let event = inner.event.as_mut().expect("JobSubmission lost its event");
let event_slot = event.0.slot();
let msg = fw::channels::RunWorkQueueMsg::ver {
pipe_type: inner.pipe_type,
work_queue: Some(inner.info.weak_pointer()),
wptr: inner.wptr,
event_slot,
is_new: inner.new,
__pad: Default::default(),
};
channel.send(&msg);
inner.new = false;
inner.submit_seq += command_count as u64;
- }
- pub(crate) fn pipe_type(&self) -> PipeType {
self.inner.as_ref().expect("No inner?").pipe_type
- }
- pub(crate) fn priority(&self) -> u32 {
self.inner.as_ref().expect("No inner?").priority
- }
+}
+#[versions(AGX)] +impl Drop for Job::ver {
- fn drop(&mut self) {
mod_pr_debug!("WorkQueue: Dropping Job\n");
let mut inner = self.wq.inner.lock();
if self.committed && !self.submitted {
let pipe_type = inner.pipe_type;
let event = inner.event.as_mut().expect("Job lost its event");
mod_pr_debug!(
"WorkQueue({:?}): Roll back {} events (slot {} val {:#x?}) and {} commands\n",
pipe_type,
self.event_count,
event.0.slot(),
event.1,
self.pending.len()
);
event.1.sub(self.event_count as u32);
inner.commit_seq -= self.pending.len() as u64;
}
inner.pending_jobs -= 1;
if inner.pending.is_empty() && inner.pending_jobs == 0 {
mod_pr_debug!("WorkQueue({:?}): Dropping event\n", inner.pipe_type);
inner.event = None;
inner.last_submitted = None;
inner.last_completed = None;
}
mod_pr_debug!("WorkQueue({:?}): Dropped Job\n", inner.pipe_type);
- }
+}
+#[versions(AGX)] +impl<'a> Drop for JobSubmission::ver<'a> {
- fn drop(&mut self) {
let inner = self.inner.as_mut().expect("No inner?");
mod_pr_debug!("WorkQueue({:?}): Dropping JobSubmission\n", inner.pipe_type);
let new_len = inner.pending.len() - self.command_count;
inner.pending.truncate(new_len);
let pipe_type = inner.pipe_type;
let event = inner.event.as_mut().expect("JobSubmission lost its event");
mod_pr_debug!(
"WorkQueue({:?}): Roll back {} events (slot {} val {:#x?}) and {} commands\n",
pipe_type,
self.event_count,
event.0.slot(),
event.1,
self.command_count
);
event.1.sub(self.event_count as u32);
inner.commit_seq -= self.command_count as u64;
mod_pr_debug!("WorkQueue({:?}): Dropped JobSubmission\n", inner.pipe_type);
- }
+}
+#[versions(AGX)] +impl WorkQueueInner::ver {
- /// Return the number of free entries in the workqueue
- pub(crate) fn free_space(&self) -> usize {
self.size as usize - self.pending.len() - 1
- }
- pub(crate) fn free_slots(&self) -> usize {
let busy_slots = if let Some(ls) = self.last_submitted {
let lc = self
.last_completed
.expect("last_submitted but not completed?");
ls.delta(&lc)
} else {
0
};
((MAX_JOB_SLOTS as i32) - busy_slots).max(0) as usize
- }
+}
+#[versions(AGX)] +impl WorkQueue::ver {
- /// Create a new WorkQueue of a given type and priority.
- #[allow(clippy::too_many_arguments)]
- pub(crate) fn new(
alloc: &mut gpu::KernelAllocators,
event_manager: Arc<event::EventManager>,
gpu_context: Arc<GpuContext>,
notifier_list: Arc<GpuObject<fw::event::NotifierList>>,
pipe_type: PipeType,
id: u64,
priority: u32,
size: u32,
- ) -> Result<ArcWorkQueue::ver> {
let mut info = box_in_place!(QueueInfo::ver {
state: alloc.shared.new_default::<RingState>()?,
ring: alloc.shared.array_empty(size as usize)?,
gpu_buf: alloc.private.array_empty(0x2c18)?,
notifier_list: notifier_list,
gpu_context: gpu_context,
})?;
info.state.with_mut(|raw, _inner| {
raw.rb_size = size;
});
let inner = WorkQueueInner::ver {
event_manager,
info: alloc.private.new_boxed(info, |inner, ptr| {
Ok(place!(
ptr,
raw::QueueInfo::ver {
state: inner.state.gpu_pointer(),
ring: inner.ring.gpu_pointer(),
notifier_list: inner.notifier_list.gpu_pointer(),
gpu_buf: inner.gpu_buf.gpu_pointer(),
gpu_rptr1: Default::default(),
gpu_rptr2: Default::default(),
gpu_rptr3: Default::default(),
event_id: AtomicI32::new(-1),
priority: *raw::PRIORITY.get(priority as usize).ok_or(EINVAL)?,
unk_4c: -1,
uuid: id as u32,
unk_54: -1,
unk_58: Default::default(),
busy: Default::default(),
__pad: Default::default(),
unk_84_state: Default::default(),
unk_88: 0,
unk_8c: 0,
unk_90: 0,
unk_94: 0,
pending: Default::default(),
unk_9c: 0,
#[ver(V >= V13_2)]
unk_a0_0: 0,
gpu_context: inner.gpu_context.gpu_pointer(),
unk_a8: Default::default(),
#[ver(V >= V13_2)]
unk_b0: 0,
}
))
})?,
new: true,
pipe_type,
size,
wptr: 0,
pending: Vec::new(),
last_token: None,
event: None,
priority,
pending_jobs: 0,
commit_seq: 0,
submit_seq: 0,
last_completed: None,
last_submitted: None,
};
let mut queue = Pin::from(UniqueArc::try_new(Self {
info_pointer: inner.info.weak_pointer(),
// SAFETY: `mutex_init!` is called below.
inner: unsafe { Mutex::new(inner) },
})?);
// SAFETY: `inner` is pinned when `queue` is.
let pinned = unsafe { queue.as_mut().map_unchecked_mut(|s| &mut s.inner) };
match pipe_type {
PipeType::Vertex => kernel::mutex_init!(pinned, "WorkQueue::inner (Vertex)"),
PipeType::Fragment => kernel::mutex_init!(pinned, "WorkQueue::inner (Fragment)"),
PipeType::Compute => kernel::mutex_init!(pinned, "WorkQueue::inner (Compute)"),
}
Ok(queue.into())
- }
- pub(crate) fn event_info(&self) -> OptionQueueEventInfo::ver {
let inner = self.inner.lock();
inner.event.as_ref().map(|ev| QueueEventInfo::ver {
stamp_pointer: ev.0.stamp_pointer(),
fw_stamp_pointer: ev.0.fw_stamp_pointer(),
slot: ev.0.slot(),
value: ev.1,
cmd_seq: inner.commit_seq,
info_ptr: self.info_pointer,
})
- }
- pub(crate) fn new_job(self: &Arc<Self>) -> ResultJob::ver {
let mut inner = self.inner.lock();
if inner.event.is_none() {
mod_pr_debug!("WorkQueue({:?}): Grabbing event\n", inner.pipe_type);
let event = inner.event_manager.get(inner.last_token, self.clone())?;
let cur = event.current();
inner.last_token = Some(event.token());
mod_pr_debug!(
"WorkQueue({:?}): Grabbed event slot {}: {:#x?}\n",
inner.pipe_type,
event.slot(),
cur
);
inner.event = Some((event, cur));
inner.last_submitted = Some(cur);
inner.last_completed = Some(cur);
}
inner.pending_jobs += 1;
let ev = &inner.event.as_ref().unwrap();
mod_pr_debug!("WorkQueue({:?}): New job\n", inner.pipe_type);
Ok(Job::ver {
wq: self.clone(),
event_info: QueueEventInfo::ver {
stamp_pointer: ev.0.stamp_pointer(),
fw_stamp_pointer: ev.0.fw_stamp_pointer(),
slot: ev.0.slot(),
value: ev.1,
cmd_seq: inner.commit_seq,
info_ptr: self.info_pointer,
},
start_value: ev.1,
pending: Vec::new(),
event_count: 0,
committed: false,
submitted: false,
})
- }
- /// Return the number of free entries in the workqueue
- pub(crate) fn free_space(&self) -> usize {
self.inner.lock().free_space()
- }
- /// Return the number of free job slots in the workqueue
- pub(crate) fn free_slots(&self) -> usize {
self.inner.lock().free_slots()
- }
- pub(crate) fn pipe_type(&self) -> PipeType {
self.inner.lock().pipe_type
- }
+}
+/// Trait used to erase the version-specific type of WorkQueues, to avoid leaking +/// version-specificity into the event module. +pub(crate) trait WorkQueue {
- fn signal(&self) -> bool;
- fn mark_error(&self, value: event::EventValue, error: WorkError);
- fn fail_all(&self, error: WorkError);
+}
+#[versions(AGX)] +impl WorkQueue for WorkQueue::ver {
- /// Signal a workqueue that some work was completed.
- ///
- /// This will check the event stamp value to find out exactly how many commands were processed.
- fn signal(&self) -> bool {
let mut inner = self.inner.lock();
let event = inner.event.as_ref();
let value = match event {
None => {
pr_err!("WorkQueue: signal() called but no event?\n");
return true;
}
Some(event) => event.0.current(),
};
inner.last_completed = Some(value);
mod_pr_debug!(
"WorkQueue({:?}): Signaling event {:?} value {:#x?}\n",
inner.pipe_type,
inner.last_token,
value
);
let mut completed_commands: usize = 0;
for cmd in inner.pending.iter() {
if cmd.value() <= value {
mod_pr_debug!(
"WorkQueue({:?}): Command at value {:#x?} complete\n",
inner.pipe_type,
cmd.value()
);
completed_commands += 1;
} else {
break;
}
}
if completed_commands == 0 {
return inner.pending.is_empty();
}
let mut completed = Vec::new();
if completed.try_reserve(completed_commands).is_err() {
pr_crit!(
"WorkQueue({:?}): Failed to allocated space for {} completed commands\n",
inner.pipe_type,
completed_commands
);
}
let pipe_type = inner.pipe_type;
for cmd in inner.pending.drain(..completed_commands) {
if completed.try_push(cmd).is_err() {
pr_crit!(
"WorkQueue({:?}): Failed to signal a completed command\n",
pipe_type,
);
}
}
mod_pr_debug!(
"WorkQueue({:?}): Completed {} commands\n",
inner.pipe_type,
completed_commands
);
if let Some(i) = completed.last() {
inner
.info
.state
.with(|raw, _inner| raw.cpu_freeptr.store(i.wptr(), Ordering::Release));
}
let empty = inner.pending.is_empty();
if empty && inner.pending_jobs == 0 {
inner.event = None;
inner.last_submitted = None;
inner.last_completed = None;
}
core::mem::drop(inner);
for cmd in completed {
cmd.complete();
}
empty
- }
- /// Mark this queue's work up to a certain stamp value as having failed.
- fn mark_error(&self, value: event::EventValue, error: WorkError) {
// If anything is marked completed, we can consider it successful
// at this point, even if we didn't get the signal event yet.
self.signal();
let mut inner = self.inner.lock();
if inner.event.is_none() {
pr_err!("WorkQueue: signal_fault() called but no event?\n");
return;
}
mod_pr_debug!(
"WorkQueue({:?}): Signaling fault for event {:?} at value {:#x?}\n",
inner.pipe_type,
inner.last_token,
value
);
for cmd in inner.pending.iter_mut() {
if cmd.value() <= value {
cmd.mark_error(error);
} else {
break;
}
}
- }
- /// Mark all of this queue's work as having failed, and complete it.
- fn fail_all(&self, error: WorkError) {
// If anything is marked completed, we can consider it successful
// at this point, even if we didn't get the signal event yet.
self.signal();
let mut inner = self.inner.lock();
if inner.event.is_none() {
pr_err!("WorkQueue: fail_all() called but no event?\n");
return;
}
mod_pr_debug!(
"WorkQueue({:?}): Failing all jobs {:?}\n",
inner.pipe_type,
error
);
let mut cmds = Vec::new();
core::mem::swap(&mut inner.pending, &mut cmds);
if inner.pending_jobs == 0 {
inner.event = None;
}
core::mem::drop(inner);
for mut cmd in cmds {
cmd.mark_error(error);
cmd.complete();
}
- }
+}
+#[versions(AGX)] +impl Drop for WorkQueue::ver {
- fn drop(&mut self) {
mod_pr_debug!("WorkQueue({:?}): Dropping\n", self.inner.lock().pipe_type);
- }
+}
-- 2.35.1
On 05/04/2023 23.37, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64);
+impl ID {
- /// Create a new ID counter with a given value.
- fn new(val: u64) -> ID {
ID(AtomicU64::new(val))
- }
- /// Fetch the next unique ID.
- pub(crate) fn next(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
- }
+}
Continuing the theme of me commenting on individual things, I stumbled over this because I noticed that there's a lot of id based lookups where I don't expect them, and started chasing.
For ids use xarray, not atomic counters. Yes I know dma_fence timelines gets this wrong, this goes back to an innocent time where we didn't allocate more than one timeline per engine, and no one fixed it since then. Yes u64 should be big enough for everyone :-/
Attaching ID spaces to drm_device is also not great. drm is full of these mistakes. Much better if their per drm_file and so private to each client.
They shouldn't be used for anything else than uapi id -> kernel object lookup at the beginning of ioctl code, and nowhere else. At least from skimming it seems like these are used all over the driver codebase, which does freak me out. At least on the C side that's a clear indicator for a refcount/lockin/data structure model that's not thought out at all.
What's going on here, what do I miss?
These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use xarray and are per-File). Most of them are just for debugging, so that when I enable full debug spam I have some way to correlate different things that are happening together (this subset of interleaved log lines relate to the same submission). Basically just object names that are easier to read (and less of a security leak) than pointers and guaranteed not to repeat. You could get rid of most of them and it wouldn't affect the driver design, it just makes it very hard to see what's going on with debug logs ^^;
There are only two that are ever used for non-debugging purposes: the VM ID, and the File ID. Both are per-device global IDs attached to the VMs (not the UAPI VM objects, but rather the underlyng MMU address space managers they represent, including the kernel-internal ones) and to Files themselves. They are used for destroying GEM objects: since the objects are also device-global across multiple clients, I need a way to do things like "clean up all mappings for this File" or "clean up all mappings for this VM". There's an annoying circular reference between GEM objects and their mappings, which is why this is explicitly coded out in destroy paths instead of naturally happening via Drop semantics (without that cleanup code, the circular reference leaks it).
So e.g. when a File does a GEM close or explicitly asks for all mappings of an object to be removed, it goes out to the (possibly shared) GEM object and tells it to drop all mappings marked as owned by that unique File ID. When an explicit "unmap all in VM" op happens, it asks the GEM object to drop all mappings for that underlying VM ID. Similarly, when a UAPI VM object is dropped (in the Drop impl, so both explicitly and when the whole File/xarray is dropped and such), that does an explicit unmap of a special dummy object it owns which would otherwise leak since it is not tracked as a GEM object owned by that File and therefore not handled by GEM closing. And again along the same lines, the allocators in alloc.rs explicitly destroy the mappings for their backing GEM objects on Drop. All this is due to that annoying circular reference between VMs and GEM objects that I'm not sure how to fix.
Note that if I *don't* do this (or forget to do it somewhere) the consequence is just that we leak memory, and if you try to destroy the wrong IDs somehow the worst that can happen is you unmap things you shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM cases, potentially the firmware). Rust safety guarantees still keep things from going entirely off the rails within the kernel, since everything that matters is reference counted (which is why these reference cycles are possible at all).
This all started when I was looking at the panfrost driver for reference. It does the same thing except it uses actual pointers to the owning entities instead of IDs, and pointer comparison (see panfrost_gem_close). Of course you could try do that in Rust too (literally storing and comparing raw pointers that aren't owned references), but then you're introducing a Pin<> requirement on those objects to make their addresses stable and it feels way more icky and error-prone than unique IDs (since addresses can be reused). panfrost only has a single mmu (what I call the raw VM) per File while I have an arbitrary number, which is why I end up with the extra distinction/complexity of both File and VM IDs, but the concept is the same.
Some of this is going to be refactored when I implement arbitrary VM range mapping/unmapping, which would be a good time to improve this... but is there something particularly wrong/broken about the way I'm doing it now that I missed? I figured unique u64 IDs would be a pretty safe way to identify entities and cleanup the mappings when needed.
~~ Lina
Argh. This (and my other reply) was supposed to go to Daniel, but Thunderbird... just dropped that recipient? And then my silly brain saw all the Cc:s go to To: and figured it was some weird consolidation and so I moved everything to Cc: except the only name that started with "Da" and... yeah, that wasn't the same person.
Sorry for the confusion... I have no idea why Thunderbird hates Daniel...
On 06/04/2023 13.44, Asahi Lina wrote:
On 05/04/2023 23.37, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64);
+impl ID {
- /// Create a new ID counter with a given value.
- fn new(val: u64) -> ID {
ID(AtomicU64::new(val))
- }
- /// Fetch the next unique ID.
- pub(crate) fn next(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
- }
+}
Continuing the theme of me commenting on individual things, I stumbled over this because I noticed that there's a lot of id based lookups where I don't expect them, and started chasing.
For ids use xarray, not atomic counters. Yes I know dma_fence timelines gets this wrong, this goes back to an innocent time where we didn't allocate more than one timeline per engine, and no one fixed it since then. Yes u64 should be big enough for everyone :-/
Attaching ID spaces to drm_device is also not great. drm is full of these mistakes. Much better if their per drm_file and so private to each client.
They shouldn't be used for anything else than uapi id -> kernel object lookup at the beginning of ioctl code, and nowhere else. At least from skimming it seems like these are used all over the driver codebase, which does freak me out. At least on the C side that's a clear indicator for a refcount/lockin/data structure model that's not thought out at all.
What's going on here, what do I miss?
These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use xarray and are per-File). Most of them are just for debugging, so that when I enable full debug spam I have some way to correlate different things that are happening together (this subset of interleaved log lines relate to the same submission). Basically just object names that are easier to read (and less of a security leak) than pointers and guaranteed not to repeat. You could get rid of most of them and it wouldn't affect the driver design, it just makes it very hard to see what's going on with debug logs ^^;
There are only two that are ever used for non-debugging purposes: the VM ID, and the File ID. Both are per-device global IDs attached to the VMs (not the UAPI VM objects, but rather the underlyng MMU address space managers they represent, including the kernel-internal ones) and to Files themselves. They are used for destroying GEM objects: since the objects are also device-global across multiple clients, I need a way to do things like "clean up all mappings for this File" or "clean up all mappings for this VM". There's an annoying circular reference between GEM objects and their mappings, which is why this is explicitly coded out in destroy paths instead of naturally happening via Drop semantics (without that cleanup code, the circular reference leaks it).
So e.g. when a File does a GEM close or explicitly asks for all mappings of an object to be removed, it goes out to the (possibly shared) GEM object and tells it to drop all mappings marked as owned by that unique File ID. When an explicit "unmap all in VM" op happens, it asks the GEM object to drop all mappings for that underlying VM ID. Similarly, when a UAPI VM object is dropped (in the Drop impl, so both explicitly and when the whole File/xarray is dropped and such), that does an explicit unmap of a special dummy object it owns which would otherwise leak since it is not tracked as a GEM object owned by that File and therefore not handled by GEM closing. And again along the same lines, the allocators in alloc.rs explicitly destroy the mappings for their backing GEM objects on Drop. All this is due to that annoying circular reference between VMs and GEM objects that I'm not sure how to fix.
Note that if I *don't* do this (or forget to do it somewhere) the consequence is just that we leak memory, and if you try to destroy the wrong IDs somehow the worst that can happen is you unmap things you shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM cases, potentially the firmware). Rust safety guarantees still keep things from going entirely off the rails within the kernel, since everything that matters is reference counted (which is why these reference cycles are possible at all).
This all started when I was looking at the panfrost driver for reference. It does the same thing except it uses actual pointers to the owning entities instead of IDs, and pointer comparison (see panfrost_gem_close). Of course you could try do that in Rust too (literally storing and comparing raw pointers that aren't owned references), but then you're introducing a Pin<> requirement on those objects to make their addresses stable and it feels way more icky and error-prone than unique IDs (since addresses can be reused). panfrost only has a single mmu (what I call the raw VM) per File while I have an arbitrary number, which is why I end up with the extra distinction/complexity of both File and VM IDs, but the concept is the same.
Some of this is going to be refactored when I implement arbitrary VM range mapping/unmapping, which would be a good time to improve this... but is there something particularly wrong/broken about the way I'm doing it now that I missed? I figured unique u64 IDs would be a pretty safe way to identify entities and cleanup the mappings when needed.
~~ Lina
~~ Lina
On Thu, Apr 06, 2023 at 02:09:21PM +0900, Asahi Lina wrote:
Argh. This (and my other reply) was supposed to go to Daniel, but Thunderbird... just dropped that recipient? And then my silly brain saw all the Cc:s go to To: and figured it was some weird consolidation and so I moved everything to Cc: except the only name that started with "Da" and... yeah, that wasn't the same person.
Sorry for the confusion... I have no idea why Thunderbird hates Daniel...
Don't worry, I get cc'ed on so much stuff that whether I'm cc'ed or not has zero impact on whether I'll read a mail or not. It just kinda disappears into the big lable:cc bucket ... -Daniel
On 06/04/2023 13.44, Asahi Lina wrote:
On 05/04/2023 23.37, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64);
+impl ID {
- /// Create a new ID counter with a given value.
- fn new(val: u64) -> ID {
ID(AtomicU64::new(val))
- }
- /// Fetch the next unique ID.
- pub(crate) fn next(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
- }
+}
Continuing the theme of me commenting on individual things, I stumbled over this because I noticed that there's a lot of id based lookups where I don't expect them, and started chasing.
For ids use xarray, not atomic counters. Yes I know dma_fence timelines gets this wrong, this goes back to an innocent time where we didn't allocate more than one timeline per engine, and no one fixed it since then. Yes u64 should be big enough for everyone :-/
Attaching ID spaces to drm_device is also not great. drm is full of these mistakes. Much better if their per drm_file and so private to each client.
They shouldn't be used for anything else than uapi id -> kernel object lookup at the beginning of ioctl code, and nowhere else. At least from skimming it seems like these are used all over the driver codebase, which does freak me out. At least on the C side that's a clear indicator for a refcount/lockin/data structure model that's not thought out at all.
What's going on here, what do I miss?
These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use xarray and are per-File). Most of them are just for debugging, so that when I enable full debug spam I have some way to correlate different things that are happening together (this subset of interleaved log lines relate to the same submission). Basically just object names that are easier to read (and less of a security leak) than pointers and guaranteed not to repeat. You could get rid of most of them and it wouldn't affect the driver design, it just makes it very hard to see what's going on with debug logs ^^;
There are only two that are ever used for non-debugging purposes: the VM ID, and the File ID. Both are per-device global IDs attached to the VMs (not the UAPI VM objects, but rather the underlyng MMU address space managers they represent, including the kernel-internal ones) and to Files themselves. They are used for destroying GEM objects: since the objects are also device-global across multiple clients, I need a way to do things like "clean up all mappings for this File" or "clean up all mappings for this VM". There's an annoying circular reference between GEM objects and their mappings, which is why this is explicitly coded out in destroy paths instead of naturally happening via Drop semantics (without that cleanup code, the circular reference leaks it).
So e.g. when a File does a GEM close or explicitly asks for all mappings of an object to be removed, it goes out to the (possibly shared) GEM object and tells it to drop all mappings marked as owned by that unique File ID. When an explicit "unmap all in VM" op happens, it asks the GEM object to drop all mappings for that underlying VM ID. Similarly, when a UAPI VM object is dropped (in the Drop impl, so both explicitly and when the whole File/xarray is dropped and such), that does an explicit unmap of a special dummy object it owns which would otherwise leak since it is not tracked as a GEM object owned by that File and therefore not handled by GEM closing. And again along the same lines, the allocators in alloc.rs explicitly destroy the mappings for their backing GEM objects on Drop. All this is due to that annoying circular reference between VMs and GEM objects that I'm not sure how to fix.
Note that if I *don't* do this (or forget to do it somewhere) the consequence is just that we leak memory, and if you try to destroy the wrong IDs somehow the worst that can happen is you unmap things you shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM cases, potentially the firmware). Rust safety guarantees still keep things from going entirely off the rails within the kernel, since everything that matters is reference counted (which is why these reference cycles are possible at all).
This all started when I was looking at the panfrost driver for reference. It does the same thing except it uses actual pointers to the owning entities instead of IDs, and pointer comparison (see panfrost_gem_close). Of course you could try do that in Rust too (literally storing and comparing raw pointers that aren't owned references), but then you're introducing a Pin<> requirement on those objects to make their addresses stable and it feels way more icky and error-prone than unique IDs (since addresses can be reused). panfrost only has a single mmu (what I call the raw VM) per File while I have an arbitrary number, which is why I end up with the extra distinction/complexity of both File and VM IDs, but the concept is the same.
Some of this is going to be refactored when I implement arbitrary VM range mapping/unmapping, which would be a good time to improve this... but is there something particularly wrong/broken about the way I'm doing it now that I missed? I figured unique u64 IDs would be a pretty safe way to identify entities and cleanup the mappings when needed.
~~ Lina
~~ Lina
On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
On 05/04/2023 23.37, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64);
+impl ID {
- /// Create a new ID counter with a given value.
- fn new(val: u64) -> ID {
ID(AtomicU64::new(val))
- }
- /// Fetch the next unique ID.
- pub(crate) fn next(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
- }
+}
Continuing the theme of me commenting on individual things, I stumbled over this because I noticed that there's a lot of id based lookups where I don't expect them, and started chasing.
For ids use xarray, not atomic counters. Yes I know dma_fence timelines gets this wrong, this goes back to an innocent time where we didn't allocate more than one timeline per engine, and no one fixed it since then. Yes u64 should be big enough for everyone :-/
Attaching ID spaces to drm_device is also not great. drm is full of these mistakes. Much better if their per drm_file and so private to each client.
They shouldn't be used for anything else than uapi id -> kernel object lookup at the beginning of ioctl code, and nowhere else. At least from skimming it seems like these are used all over the driver codebase, which does freak me out. At least on the C side that's a clear indicator for a refcount/lockin/data structure model that's not thought out at all.
What's going on here, what do I miss?
These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use xarray and are per-File). Most of them are just for debugging, so that when I enable full debug spam I have some way to correlate different things that are happening together (this subset of interleaved log lines relate to the same submission). Basically just object names that are easier to read (and less of a security leak) than pointers and guaranteed not to repeat. You could get rid of most of them and it wouldn't affect the driver design, it just makes it very hard to see what's going on with debug logs ^^;
Hm generally we just print the kernel addresses with the right printk modifiers. Those filter/hash addresses if you have the right paranoia settings enabled. I guess throwing in a debug id doesn't hurt, but would be good to make that a lot more clearer.
I haven't read the full driver yet because I'm still too much lost, that's why I guess I missed the xarray stuff on the file. I'll try and go understand that.
For the big topic below I need to think more. -Daniel
There are only two that are ever used for non-debugging purposes: the VM ID, and the File ID. Both are per-device global IDs attached to the VMs (not the UAPI VM objects, but rather the underlyng MMU address space managers they represent, including the kernel-internal ones) and to Files themselves. They are used for destroying GEM objects: since the objects are also device-global across multiple clients, I need a way to do things like "clean up all mappings for this File" or "clean up all mappings for this VM". There's an annoying circular reference between GEM objects and their mappings, which is why this is explicitly coded out in destroy paths instead of naturally happening via Drop semantics (without that cleanup code, the circular reference leaks it).
So e.g. when a File does a GEM close or explicitly asks for all mappings of an object to be removed, it goes out to the (possibly shared) GEM object and tells it to drop all mappings marked as owned by that unique File ID. When an explicit "unmap all in VM" op happens, it asks the GEM object to drop all mappings for that underlying VM ID. Similarly, when a UAPI VM object is dropped (in the Drop impl, so both explicitly and when the whole File/xarray is dropped and such), that does an explicit unmap of a special dummy object it owns which would otherwise leak since it is not tracked as a GEM object owned by that File and therefore not handled by GEM closing. And again along the same lines, the allocators in alloc.rs explicitly destroy the mappings for their backing GEM objects on Drop. All this is due to that annoying circular reference between VMs and GEM objects that I'm not sure how to fix.
Note that if I *don't* do this (or forget to do it somewhere) the consequence is just that we leak memory, and if you try to destroy the wrong IDs somehow the worst that can happen is you unmap things you shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM cases, potentially the firmware). Rust safety guarantees still keep things from going entirely off the rails within the kernel, since everything that matters is reference counted (which is why these reference cycles are possible at all).
This all started when I was looking at the panfrost driver for reference. It does the same thing except it uses actual pointers to the owning entities instead of IDs, and pointer comparison (see panfrost_gem_close). Of course you could try do that in Rust too (literally storing and comparing raw pointers that aren't owned references), but then you're introducing a Pin<> requirement on those objects to make their addresses stable and it feels way more icky and error-prone than unique IDs (since addresses can be reused). panfrost only has a single mmu (what I call the raw VM) per File while I have an arbitrary number, which is why I end up with the extra distinction/complexity of both File and VM IDs, but the concept is the same.
Some of this is going to be refactored when I implement arbitrary VM range mapping/unmapping, which would be a good time to improve this... but is there something particularly wrong/broken about the way I'm doing it now that I missed? I figured unique u64 IDs would be a pretty safe way to identify entities and cleanup the mappings when needed.
~~ Lina
Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org
On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
On 05/04/2023 23.37, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64);
+impl ID {
- /// Create a new ID counter with a given value.
- fn new(val: u64) -> ID {
ID(AtomicU64::new(val))
- }
- /// Fetch the next unique ID.
- pub(crate) fn next(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
- }
+}
Continuing the theme of me commenting on individual things, I stumbled over this because I noticed that there's a lot of id based lookups where I don't expect them, and started chasing.
For ids use xarray, not atomic counters. Yes I know dma_fence timelines gets this wrong, this goes back to an innocent time where we didn't allocate more than one timeline per engine, and no one fixed it since then. Yes u64 should be big enough for everyone :-/
Attaching ID spaces to drm_device is also not great. drm is full of these mistakes. Much better if their per drm_file and so private to each client.
They shouldn't be used for anything else than uapi id -> kernel object lookup at the beginning of ioctl code, and nowhere else. At least from skimming it seems like these are used all over the driver codebase, which does freak me out. At least on the C side that's a clear indicator for a refcount/lockin/data structure model that's not thought out at all.
What's going on here, what do I miss?
These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use xarray and are per-File). Most of them are just for debugging, so that when I enable full debug spam I have some way to correlate different things that are happening together (this subset of interleaved log lines relate to the same submission). Basically just object names that are easier to read (and less of a security leak) than pointers and guaranteed not to repeat. You could get rid of most of them and it wouldn't affect the driver design, it just makes it very hard to see what's going on with debug logs ^^;
There are only two that are ever used for non-debugging purposes: the VM ID, and the File ID. Both are per-device global IDs attached to the VMs (not the UAPI VM objects, but rather the underlyng MMU address space managers they represent, including the kernel-internal ones) and to Files themselves. They are used for destroying GEM objects: since the objects are also device-global across multiple clients, I need a way to do things like "clean up all mappings for this File" or "clean up all mappings for this VM". There's an annoying circular reference between GEM objects and their mappings, which is why this is explicitly coded out in destroy paths instead of naturally happening via Drop semantics (without that cleanup code, the circular reference leaks it).
So e.g. when a File does a GEM close or explicitly asks for all mappings of an object to be removed, it goes out to the (possibly shared) GEM object and tells it to drop all mappings marked as owned by that unique File ID. When an explicit "unmap all in VM" op happens, it asks the GEM object to drop all mappings for that underlying VM ID. Similarly, when a UAPI VM object is dropped (in the Drop impl, so both explicitly and when the whole File/xarray is dropped and such), that does an explicit unmap of a special dummy object it owns which would otherwise leak since it is not tracked as a GEM object owned by that File and therefore not handled by GEM closing. And again along the same lines, the allocators in alloc.rs explicitly destroy the mappings for their backing GEM objects on Drop. All this is due to that annoying circular reference between VMs and GEM objects that I'm not sure how to fix.
Note that if I *don't* do this (or forget to do it somewhere) the consequence is just that we leak memory, and if you try to destroy the wrong IDs somehow the worst that can happen is you unmap things you shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM cases, potentially the firmware). Rust safety guarantees still keep things from going entirely off the rails within the kernel, since everything that matters is reference counted (which is why these reference cycles are possible at all).
This all started when I was looking at the panfrost driver for reference. It does the same thing except it uses actual pointers to the owning entities instead of IDs, and pointer comparison (see panfrost_gem_close). Of course you could try do that in Rust too (literally storing and comparing raw pointers that aren't owned references), but then you're introducing a Pin<> requirement on those objects to make their addresses stable and it feels way more icky and error-prone than unique IDs (since addresses can be reused). panfrost only has a single mmu (what I call the raw VM) per File while I have an arbitrary number, which is why I end up with the extra distinction/complexity of both File and VM IDs, but the concept is the same.
Some of this is going to be refactored when I implement arbitrary VM range mapping/unmapping, which would be a good time to improve this... but is there something particularly wrong/broken about the way I'm doing it now that I missed? I figured unique u64 IDs would be a pretty safe way to identify entities and cleanup the mappings when needed.
Ok, some attempt at going through the vm_id/file_id stuff. Extremely high-level purely informed by having read too many drivers:
First on the drm_file/struct file/file_id. This is the uapi interface object, and it's refcounted in the vfs, but that's entirely the vfs' business and none of the driver (or even subsystem). Once userspace has done the final close() the file is gone, there's no way to ever get anything meaningfully out of it because userspace dropped it. So if the driver has any kind of backpointer to that's a design bug, because in all the place you might want to care (ioctl, fdinfo for schedu stats, any other file_operations callback) the vfs ensures it stays alive during the callback and you essentially have a borrowed reference.
I've seen a lot of drivers try to make clever backpointings to stuff that's essentially tied to the drm_file, and I've not found a single case that made sense. iow, file_id as a lookup thingie needs to go. In principle it's the same argument I've made already for the syncobj rust wrappers. For specific uses I guess I need some rust reading help, but from your description it sounds like the vm_id is much more the core piece.
So for that we have the gpu ctx -> vm -> gem_bos chain of reference. Now on the C side if you have a modern driver that uses the vm_bind/unbind/gpuva manager approach, the reference counts go in that single direction only, anything else is essentially borrowed references under protection of a mutex/lock or similar thing (for e.g. going from the bo to the vm for eviction).
In addition to the above chain the xarray in the drm_file also holds references to each of these. So far so good, in the drm_file ->postclose callback you just walk the xarrays and drop all the references, and everything gets cleaned up, at least in the C world.
Aside: I'm ignoring the entire sched/job/gpu-ctx side because that's a separate can of worms and big other threads floating around already.
But if either due to the uabi being a bit more legacy, or Rust requiring that the backpointers are reference-counted from the gem_bo->vma->vm and can't follow borrow semantics (afaiui the usual linux list_head pattern of walking the list under a lock giving you a borrowed reference for each element doesn't work too well in rust?) then that's not a problem, you can still all clean it out:
- The key bit is that your vm struct needs both a refcount like kref and a separate open count. Each gpu ctx and the xarray for vm objects in drm_file hold _both_ the kref and the open refcount (in rust the open refcount implies the Arc or things go sideways).
- the other key bit is that drm_file ->postclose does _not_ have simple Drop semantics, it's more explicit.
- in the drm_file lastclose you first walk all the gpu ctx. The simplest semantics is that close() synchronously tears down all leftover gpu ctx, i.e. you unload them from the gpu. Details are under a lot of discussion in the various scheduler threads, but essentially this should ensure that the gpu ctx destruction completely removes all references to the ctx. If instead you have the legacy problem of apps expecting that rendering continues even if they called exit() before it finishes, then it gets more messy. I have no idea whether that's still a problem for new drivers or can be avoided.
- Next up you do the same thing for the vm xarray (which drops both the kref an open refcounts).
- At this point there might still be a ton of vm objects around with elevated kref. Except not, because at this point the open refcount of each vm should have dropped to zero. When that happens the vm object itself is still alive, plus even better for rust, you are in the vm_close(vm) function call so you have a full borrowed reference to that. Which means you can walk the entire address space and unmap everything explicit. Which should get rid of any gem_bo->vma->vm backpointers you have lying around.
- At that point all your vm objects are gone too, because the kref managed backpointers are gone.
- You walk the xarray of gem_bo (well the drm subsystem does that for you), which cleans out the reamining references to gem_bo. Only the gem_bo which are shared with other process or have a dma_buf will survive, like they should.
No leak, no funky driver-internal vm_id based lookup, and with rust we should even be able to guarantee you never mix up Arc<Vm> with OpenRef<Vm> (or however that exactly works in rust types, I have not much real clue).
If you have any other functional needs for vm_id then I guess I need to go through them, but they should be all fixable. -Daniel
On 06/04/2023 20.55, Daniel Vetter wrote:
On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
On 05/04/2023 23.37, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64);
+impl ID {
- /// Create a new ID counter with a given value.
- fn new(val: u64) -> ID {
ID(AtomicU64::new(val))
- }
- /// Fetch the next unique ID.
- pub(crate) fn next(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
- }
+}
Continuing the theme of me commenting on individual things, I stumbled over this because I noticed that there's a lot of id based lookups where I don't expect them, and started chasing.
For ids use xarray, not atomic counters. Yes I know dma_fence timelines gets this wrong, this goes back to an innocent time where we didn't allocate more than one timeline per engine, and no one fixed it since then. Yes u64 should be big enough for everyone :-/
Attaching ID spaces to drm_device is also not great. drm is full of these mistakes. Much better if their per drm_file and so private to each client.
They shouldn't be used for anything else than uapi id -> kernel object lookup at the beginning of ioctl code, and nowhere else. At least from skimming it seems like these are used all over the driver codebase, which does freak me out. At least on the C side that's a clear indicator for a refcount/lockin/data structure model that's not thought out at all.
What's going on here, what do I miss?
These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use xarray and are per-File). Most of them are just for debugging, so that when I enable full debug spam I have some way to correlate different things that are happening together (this subset of interleaved log lines relate to the same submission). Basically just object names that are easier to read (and less of a security leak) than pointers and guaranteed not to repeat. You could get rid of most of them and it wouldn't affect the driver design, it just makes it very hard to see what's going on with debug logs ^^;
There are only two that are ever used for non-debugging purposes: the VM ID, and the File ID. Both are per-device global IDs attached to the VMs (not the UAPI VM objects, but rather the underlyng MMU address space managers they represent, including the kernel-internal ones) and to Files themselves. They are used for destroying GEM objects: since the objects are also device-global across multiple clients, I need a way to do things like "clean up all mappings for this File" or "clean up all mappings for this VM". There's an annoying circular reference between GEM objects and their mappings, which is why this is explicitly coded out in destroy paths instead of naturally happening via Drop semantics (without that cleanup code, the circular reference leaks it).
So e.g. when a File does a GEM close or explicitly asks for all mappings of an object to be removed, it goes out to the (possibly shared) GEM object and tells it to drop all mappings marked as owned by that unique File ID. When an explicit "unmap all in VM" op happens, it asks the GEM object to drop all mappings for that underlying VM ID. Similarly, when a UAPI VM object is dropped (in the Drop impl, so both explicitly and when the whole File/xarray is dropped and such), that does an explicit unmap of a special dummy object it owns which would otherwise leak since it is not tracked as a GEM object owned by that File and therefore not handled by GEM closing. And again along the same lines, the allocators in alloc.rs explicitly destroy the mappings for their backing GEM objects on Drop. All this is due to that annoying circular reference between VMs and GEM objects that I'm not sure how to fix.
Note that if I *don't* do this (or forget to do it somewhere) the consequence is just that we leak memory, and if you try to destroy the wrong IDs somehow the worst that can happen is you unmap things you shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM cases, potentially the firmware). Rust safety guarantees still keep things from going entirely off the rails within the kernel, since everything that matters is reference counted (which is why these reference cycles are possible at all).
This all started when I was looking at the panfrost driver for reference. It does the same thing except it uses actual pointers to the owning entities instead of IDs, and pointer comparison (see panfrost_gem_close). Of course you could try do that in Rust too (literally storing and comparing raw pointers that aren't owned references), but then you're introducing a Pin<> requirement on those objects to make their addresses stable and it feels way more icky and error-prone than unique IDs (since addresses can be reused). panfrost only has a single mmu (what I call the raw VM) per File while I have an arbitrary number, which is why I end up with the extra distinction/complexity of both File and VM IDs, but the concept is the same.
Some of this is going to be refactored when I implement arbitrary VM range mapping/unmapping, which would be a good time to improve this... but is there something particularly wrong/broken about the way I'm doing it now that I missed? I figured unique u64 IDs would be a pretty safe way to identify entities and cleanup the mappings when needed.
Ok, some attempt at going through the vm_id/file_id stuff. Extremely high-level purely informed by having read too many drivers:
First on the drm_file/struct file/file_id. This is the uapi interface object, and it's refcounted in the vfs, but that's entirely the vfs' business and none of the driver (or even subsystem). Once userspace has done the final close() the file is gone, there's no way to ever get anything meaningfully out of it because userspace dropped it. So if the driver has any kind of backpointer to that's a design bug, because in all the place you might want to care (ioctl, fdinfo for schedu stats, any other file_operations callback) the vfs ensures it stays alive during the callback and you essentially have a borrowed reference.
Right, there's none of that for the File, and it is not refcounted itself. Certainly there are no direct references, and as for the IDs: the IDs of relevant Files live in GEM objects that hold mappings owned by that file. As part of File close all the GEM objects get closed, which removes those mappings. So by the time the File goes away there should be no references to its ID anywhere (other than if I stashed some away for debugging, I forget whether I did in some child object).
If this process breaks for some reason (say, stray mappings remain indexed to a File ID that is gone), that means we leak the mappings, which leaks the GEM objects themselves and the VM they are mapped to. Not great but not fireworks either. As long as the DRM core properly calls the GEM close callback on everything before calling the File close callback though, that shouldn't happen.
I've seen a lot of drivers try to make clever backpointings to stuff that's essentially tied to the drm_file, and I've not found a single case that made sense. iow, file_id as a lookup thingie needs to go. In principle it's the same argument I've made already for the syncobj rust wrappers. For specific uses I guess I need some rust reading help, but from your description it sounds like the vm_id is much more the core piece.
The file ID is simply how GEM mappings are identified as belonging to an active file within the mapping list of an object. GEM object close is literally the only place this ID is ever used for anything other than passing around:
/// Callback to drop all mappings for a GEM object owned by a given `File` fn close(obj: &Object, file: &DrmFile) { mod_pr_debug!("DriverObject::close vm_id={:?} id={}\n", obj.vm_id, obj.id); obj.drop_file_mappings(file.inner().file_id()); }
I could also just iterate through the VM XArray for the File and drop mappings one VM at a time instead of doing all of them in one go, it's just slightly more cumbersome (though potentially less code because I could get rid of all the forwarding the file_id I do now).
On the other hand, once we implement arbitrary VM maps, I suspect this is going to go away anyway with the new design, so I'm not really very inclined to fix it until that happens... ^^
So for that we have the gpu ctx -> vm -> gem_bos chain of reference. Now on the C side if you have a modern driver that uses the vm_bind/unbind/gpuva manager approach, the reference counts go in that single direction only, anything else is essentially borrowed references under protection of a mutex/lock or similar thing (for e.g. going from the bo to the vm for eviction).
Right, so that is what is going to change with the pending refactor. What I have right now is a design that used to be the old driver-managed VM design (and still retains part of that for kernel-managed objects) for the old synchronous demo UAPI, that I then shoehorned into the redesigned vm_bind UAPI by just not supporting the interesting cases (partial maps/unmaps/remaps, etc.). This is all temporary, it's just to get us by for now since OpenGL doesn't need it and there is no usable Vulkan driver that cares yet... I wanted to focus on the explicit sync and general sched/queuing part of the new UAPI before I got to the VM bind stuff, since I figured that would be more interesting (and pulls in all the new abstractions, plus major perf benefit). So the UAPI itself has vm_bind but only the "easy" subset of cases are supported by the driver (whole object maps/unmaps) and the refcounting is still backwards.
As I said this originally came from the Panfrost design that doesn't have vm_bind but instead keeps a list of mappings with pointer equality checks in BOs... so that's why ^^
Thanks for explaining the design approach though, it's roughly what I had in mind but it's good to hear I'm on the right track! I'd love to go into more detail about how to implement vm_bind if you have time though (maybe a meeting?). In particular things like using the mm allocator to keep track of mapping ranges and supporting splitting and all that.
In addition to the above chain the xarray in the drm_file also holds references to each of these. So far so good, in the drm_file ->postclose callback you just walk the xarrays and drop all the references, and everything gets cleaned up, at least in the C world.
In the Rust world you just do nothing since the XArray abstraction knows how to drop all of its contained objects!
But if either due to the uabi being a bit more legacy, or Rust requiring that the backpointers are reference-counted from the gem_bo->vma->vm and can't follow borrow semantics (afaiui the usual linux list_head pattern of walking the list under a lock giving you a borrowed reference for each element doesn't work too well in rust?) then that's not a problem, you can still all clean it out:
The key bit is that your vm struct needs both a refcount like kref and a separate open count. Each gpu ctx and the xarray for vm objects in drm_file hold _both_ the kref and the open refcount (in rust the open refcount implies the Arc or things go sideways).
the other key bit is that drm_file ->postclose does _not_ have simple Drop semantics, it's more explicit.
in the drm_file lastclose you first walk all the gpu ctx. The simplest semantics is that close() synchronously tears down all leftover gpu ctx, i.e. you unload them from the gpu. Details are under a lot of discussion in the various scheduler threads, but essentially this should ensure that the gpu ctx destruction completely removes all references to the ctx. If instead you have the legacy problem of apps expecting that rendering continues even if they called exit() before it finishes, then it gets more messy. I have no idea whether that's still a problem for new drivers or can be avoided.
Next up you do the same thing for the vm xarray (which drops both the kref an open refcounts).
At this point there might still be a ton of vm objects around with elevated kref. Except not, because at this point the open refcount of each vm should have dropped to zero. When that happens the vm object itself is still alive, plus even better for rust, you are in the vm_close(vm) function call so you have a full borrowed reference to that. Which means you can walk the entire address space and unmap everything explicit. Which should get rid of any gem_bo->vma->vm backpointers you have lying around.
At that point all your vm objects are gone too, because the kref managed backpointers are gone.
You walk the xarray of gem_bo (well the drm subsystem does that for you), which cleans out the reamining references to gem_bo. Only the gem_bo which are shared with other process or have a dma_buf will survive, like they should.
No leak, no funky driver-internal vm_id based lookup, and with rust we should even be able to guarantee you never mix up Arc<Vm> with OpenRef<Vm> (or however that exactly works in rust types, I have not much real clue).
That would totally work, and actually I already use somewhat analogous mechanisms in other places like firmware queues!
If this all weren't getting turned on its head for the new VM management I'd implement it, but hopefully we can agree there's not much point right now... I'd rather focus on the DRM abstraction design and work on improving the driver in parallel right now, and then about one kernel cycle or so from now it should definitely be in a better place for review. Honestly, there are bigger design problems with the driver right now than these IDs (that I already know about)... so I want to focus more on the abstractions and their usage right now than the internal driver design which I *know* has problems ^^
Rust is really good at getting you to come up with a *safe* design as far as memory and ownership, but that doesn't mean it's perfectly clean code and more importantly it does nothing for deadlocks and allocating in the wrong paths and getting resource allocation semantics right etc etc. The GPU FW queue stuff is at the very least due for another major refactor/cleanup to defer resource allocation and actual queuing to job prepare/run time (right now there's some horrible hacks to do it upfront at submit because I don't have a mechanism to back-patch job structures with those resource IDs later at exec time, but I want to add that), and along the way I can also fix the using job fences to block on pending job count thing that Christian really wants me to do instead of the can_run_job thing, and then getting all this resource stuff truly right is also going to mean eventually using fences to handle blocking on resource exhaustion too (though maybe I can get away with implementing that a bit later)...
The driver works stupidly well for how quickly I wrote it, but it still has all these rough edges that definitely need fixing before it's something I could say I'm happy with... I'm sure if you start hammering it with evil workloads you will hit some of its current problems (like I did yesterday with the deadlocks on GpuContext inval). I also need to learn more about the subtleties of fence signaling and all that, especially once a shrinker comes into play...
~~ Lina
On Thu, Apr 06, 2023 at 10:15:56PM +0900, Asahi Lina wrote:
On 06/04/2023 20.55, Daniel Vetter wrote:
On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
On 05/04/2023 23.37, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64);
+impl ID {
- /// Create a new ID counter with a given value.
- fn new(val: u64) -> ID {
ID(AtomicU64::new(val))
- }
- /// Fetch the next unique ID.
- pub(crate) fn next(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
- }
+}
Continuing the theme of me commenting on individual things, I stumbled over this because I noticed that there's a lot of id based lookups where I don't expect them, and started chasing.
For ids use xarray, not atomic counters. Yes I know dma_fence timelines gets this wrong, this goes back to an innocent time where we didn't allocate more than one timeline per engine, and no one fixed it since then. Yes u64 should be big enough for everyone :-/
Attaching ID spaces to drm_device is also not great. drm is full of these mistakes. Much better if their per drm_file and so private to each client.
They shouldn't be used for anything else than uapi id -> kernel object lookup at the beginning of ioctl code, and nowhere else. At least from skimming it seems like these are used all over the driver codebase, which does freak me out. At least on the C side that's a clear indicator for a refcount/lockin/data structure model that's not thought out at all.
What's going on here, what do I miss?
These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use xarray and are per-File). Most of them are just for debugging, so that when I enable full debug spam I have some way to correlate different things that are happening together (this subset of interleaved log lines relate to the same submission). Basically just object names that are easier to read (and less of a security leak) than pointers and guaranteed not to repeat. You could get rid of most of them and it wouldn't affect the driver design, it just makes it very hard to see what's going on with debug logs ^^;
There are only two that are ever used for non-debugging purposes: the VM ID, and the File ID. Both are per-device global IDs attached to the VMs (not the UAPI VM objects, but rather the underlyng MMU address space managers they represent, including the kernel-internal ones) and to Files themselves. They are used for destroying GEM objects: since the objects are also device-global across multiple clients, I need a way to do things like "clean up all mappings for this File" or "clean up all mappings for this VM". There's an annoying circular reference between GEM objects and their mappings, which is why this is explicitly coded out in destroy paths instead of naturally happening via Drop semantics (without that cleanup code, the circular reference leaks it).
So e.g. when a File does a GEM close or explicitly asks for all mappings of an object to be removed, it goes out to the (possibly shared) GEM object and tells it to drop all mappings marked as owned by that unique File ID. When an explicit "unmap all in VM" op happens, it asks the GEM object to drop all mappings for that underlying VM ID. Similarly, when a UAPI VM object is dropped (in the Drop impl, so both explicitly and when the whole File/xarray is dropped and such), that does an explicit unmap of a special dummy object it owns which would otherwise leak since it is not tracked as a GEM object owned by that File and therefore not handled by GEM closing. And again along the same lines, the allocators in alloc.rs explicitly destroy the mappings for their backing GEM objects on Drop. All this is due to that annoying circular reference between VMs and GEM objects that I'm not sure how to fix.
Note that if I *don't* do this (or forget to do it somewhere) the consequence is just that we leak memory, and if you try to destroy the wrong IDs somehow the worst that can happen is you unmap things you shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM cases, potentially the firmware). Rust safety guarantees still keep things from going entirely off the rails within the kernel, since everything that matters is reference counted (which is why these reference cycles are possible at all).
This all started when I was looking at the panfrost driver for reference. It does the same thing except it uses actual pointers to the owning entities instead of IDs, and pointer comparison (see panfrost_gem_close). Of course you could try do that in Rust too (literally storing and comparing raw pointers that aren't owned references), but then you're introducing a Pin<> requirement on those objects to make their addresses stable and it feels way more icky and error-prone than unique IDs (since addresses can be reused). panfrost only has a single mmu (what I call the raw VM) per File while I have an arbitrary number, which is why I end up with the extra distinction/complexity of both File and VM IDs, but the concept is the same.
Some of this is going to be refactored when I implement arbitrary VM range mapping/unmapping, which would be a good time to improve this... but is there something particularly wrong/broken about the way I'm doing it now that I missed? I figured unique u64 IDs would be a pretty safe way to identify entities and cleanup the mappings when needed.
Ok, some attempt at going through the vm_id/file_id stuff. Extremely high-level purely informed by having read too many drivers:
First on the drm_file/struct file/file_id. This is the uapi interface object, and it's refcounted in the vfs, but that's entirely the vfs' business and none of the driver (or even subsystem). Once userspace has done the final close() the file is gone, there's no way to ever get anything meaningfully out of it because userspace dropped it. So if the driver has any kind of backpointer to that's a design bug, because in all the place you might want to care (ioctl, fdinfo for schedu stats, any other file_operations callback) the vfs ensures it stays alive during the callback and you essentially have a borrowed reference.
Right, there's none of that for the File, and it is not refcounted itself. Certainly there are no direct references, and as for the IDs: the IDs of relevant Files live in GEM objects that hold mappings owned by that file. As part of File close all the GEM objects get closed, which removes those mappings. So by the time the File goes away there should be no references to its ID anywhere (other than if I stashed some away for debugging, I forget whether I did in some child object).
If this process breaks for some reason (say, stray mappings remain indexed to a File ID that is gone), that means we leak the mappings, which leaks the GEM objects themselves and the VM they are mapped to. Not great but not fireworks either. As long as the DRM core properly calls the GEM close callback on everything before calling the File close callback though, that shouldn't happen.
I've seen a lot of drivers try to make clever backpointings to stuff that's essentially tied to the drm_file, and I've not found a single case that made sense. iow, file_id as a lookup thingie needs to go. In principle it's the same argument I've made already for the syncobj rust wrappers. For specific uses I guess I need some rust reading help, but from your description it sounds like the vm_id is much more the core piece.
The file ID is simply how GEM mappings are identified as belonging to an active file within the mapping list of an object. GEM object close is literally the only place this ID is ever used for anything other than passing around:
/// Callback to drop all mappings for a GEM object owned by a given `File` fn close(obj: &Object, file: &DrmFile) { mod_pr_debug!("DriverObject::close vm_id={:?} id={}\n", obj.vm_id, obj.id); obj.drop_file_mappings(file.inner().file_id()); }
I could also just iterate through the VM XArray for the File and drop mappings one VM at a time instead of doing all of them in one go, it's just slightly more cumbersome (though potentially less code because I could get rid of all the forwarding the file_id I do now).
On the other hand, once we implement arbitrary VM maps, I suspect this is going to go away anyway with the new design, so I'm not really very inclined to fix it until that happens... ^^
Yeah the driver-managed vm needs a bunch more reference loops and gets awkward fast. the gpuva library might need to keep support for that, but I really hope it's not needed.
So for that we have the gpu ctx -> vm -> gem_bos chain of reference. Now on the C side if you have a modern driver that uses the vm_bind/unbind/gpuva manager approach, the reference counts go in that single direction only, anything else is essentially borrowed references under protection of a mutex/lock or similar thing (for e.g. going from the bo to the vm for eviction).
Right, so that is what is going to change with the pending refactor. What I have right now is a design that used to be the old driver-managed VM design (and still retains part of that for kernel-managed objects) for the old synchronous demo UAPI, that I then shoehorned into the redesigned vm_bind UAPI by just not supporting the interesting cases (partial maps/unmaps/remaps, etc.). This is all temporary, it's just to get us by for now since OpenGL doesn't need it and there is no usable Vulkan driver that cares yet... I wanted to focus on the explicit sync and general sched/queuing part of the new UAPI before I got to the VM bind stuff, since I figured that would be more interesting (and pulls in all the new abstractions, plus major perf benefit). So the UAPI itself has vm_bind but only the "easy" subset of cases are supported by the driver (whole object maps/unmaps) and the refcounting is still backwards.
As I said this originally came from the Panfrost design that doesn't have vm_bind but instead keeps a list of mappings with pointer equality checks in BOs... so that's why ^^
Thanks for explaining the design approach though, it's roughly what I had in mind but it's good to hear I'm on the right track! I'd love to go into more detail about how to implement vm_bind if you have time though (maybe a meeting?). In particular things like using the mm allocator to keep track of mapping ranges and supporting splitting and all that.
Yeah vm_bind sounds like a good topic to discuss. I don't think we'll get all the pieces aligned to land that before asahi, but the driver internals should at least match wrt semantics with that so that the refactoring isn't total pain.
In addition to the above chain the xarray in the drm_file also holds references to each of these. So far so good, in the drm_file ->postclose callback you just walk the xarrays and drop all the references, and everything gets cleaned up, at least in the C world.
In the Rust world you just do nothing since the XArray abstraction knows how to drop all of its contained objects!
Yeah xarray should work with Drop, but I guess you need a special uapi/open-reference object that knows that it needs to perform additional cleanup (like quiescent the gpu ctx or unamp everything for the vm).
But if either due to the uabi being a bit more legacy, or Rust requiring that the backpointers are reference-counted from the gem_bo->vma->vm and can't follow borrow semantics (afaiui the usual linux list_head pattern of walking the list under a lock giving you a borrowed reference for each element doesn't work too well in rust?) then that's not a problem, you can still all clean it out:
The key bit is that your vm struct needs both a refcount like kref and a separate open count. Each gpu ctx and the xarray for vm objects in drm_file hold _both_ the kref and the open refcount (in rust the open refcount implies the Arc or things go sideways).
the other key bit is that drm_file ->postclose does _not_ have simple Drop semantics, it's more explicit.
in the drm_file lastclose you first walk all the gpu ctx. The simplest semantics is that close() synchronously tears down all leftover gpu ctx, i.e. you unload them from the gpu. Details are under a lot of discussion in the various scheduler threads, but essentially this should ensure that the gpu ctx destruction completely removes all references to the ctx. If instead you have the legacy problem of apps expecting that rendering continues even if they called exit() before it finishes, then it gets more messy. I have no idea whether that's still a problem for new drivers or can be avoided.
Next up you do the same thing for the vm xarray (which drops both the kref an open refcounts).
At this point there might still be a ton of vm objects around with elevated kref. Except not, because at this point the open refcount of each vm should have dropped to zero. When that happens the vm object itself is still alive, plus even better for rust, you are in the vm_close(vm) function call so you have a full borrowed reference to that. Which means you can walk the entire address space and unmap everything explicit. Which should get rid of any gem_bo->vma->vm backpointers you have lying around.
At that point all your vm objects are gone too, because the kref managed backpointers are gone.
You walk the xarray of gem_bo (well the drm subsystem does that for you), which cleans out the reamining references to gem_bo. Only the gem_bo which are shared with other process or have a dma_buf will survive, like they should.
No leak, no funky driver-internal vm_id based lookup, and with rust we should even be able to guarantee you never mix up Arc<Vm> with OpenRef<Vm> (or however that exactly works in rust types, I have not much real clue).
That would totally work, and actually I already use somewhat analogous mechanisms in other places like firmware queues!
If this all weren't getting turned on its head for the new VM management I'd implement it, but hopefully we can agree there's not much point right now... I'd rather focus on the DRM abstraction design and work on improving the driver in parallel right now, and then about one kernel cycle or so from now it should definitely be in a better place for review. Honestly, there are bigger design problems with the driver right now than these IDs (that I already know about)... so I want to focus more on the abstractions and their usage right now than the internal driver design which I *know* has problems ^^
Yeah I think the only fundamental issue you have is that (if I get this all right) you're trying to clean up mappings from the gem_bo, not from the vm. The gem_bo (unlike the vm) is freely shareable (at least in general), so tying anything else to the lifetime of a gem_bo in any way is a design flaw.
This is similar to dma_fence that can end up absolutely everywhere, and why drm/sched has this decoupling between hw_fence and drm_job fences with wider visibility. i915-gem/i915-scheduler and a lot of the really old drivers all get this wrong, and you end up with either terrible explicit cleanup code that tries to go around looking for all the references that it needs to drop. Or you just leak.
All these things need to be sorted out at design time so that they're impossible.
Rust is really good at getting you to come up with a *safe* design as far as memory and ownership, but that doesn't mean it's perfectly clean code and more importantly it does nothing for deadlocks and allocating in the wrong paths and getting resource allocation semantics right etc etc. The GPU FW queue stuff is at the very least due for another major refactor/cleanup to defer resource allocation and actual queuing to job prepare/run time (right now there's some horrible hacks to do it upfront at submit because I don't have a mechanism to back-patch job structures with those resource IDs later at exec time, but I want to add that), and along the way I can also fix the using job fences to block on pending job count thing that Christian really wants me to do instead of the can_run_job thing, and then getting all this resource stuff truly right is also going to mean eventually using fences to handle blocking on resource exhaustion too (though maybe I can get away with implementing that a bit later)...
The driver works stupidly well for how quickly I wrote it, but it still has all these rough edges that definitely need fixing before it's something I could say I'm happy with... I'm sure if you start hammering it with evil workloads you will hit some of its current problems (like I did yesterday with the deadlocks on GpuContext inval). I also need to learn more about the subtleties of fence signaling and all that, especially once a shrinker comes into play...
Yeah I think rust is impressive at creating working code. The real challenge, and really where I see all the short term value at least, is in clarifying the semantics. Because that'll help us to clarify the semantics on the C side too, which gives immediate benefits for everyone. Not just new drivers in rust.
But it's also the part that's really, really hard work. -Daniel
On 06/04/2023 22.48, Daniel Vetter wrote:
On Thu, Apr 06, 2023 at 10:15:56PM +0900, Asahi Lina wrote:
On 06/04/2023 20.55, Daniel Vetter wrote:
On Thu, Apr 06, 2023 at 01:44:22PM +0900, Asahi Lina wrote:
On 05/04/2023 23.37, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// A generic monotonically incrementing ID used to uniquely identify object instances within the +/// driver. +pub(crate) struct ID(AtomicU64);
+impl ID {
- /// Create a new ID counter with a given value.
- fn new(val: u64) -> ID {
ID(AtomicU64::new(val))
- }
- /// Fetch the next unique ID.
- pub(crate) fn next(&self) -> u64 {
self.0.fetch_add(1, Ordering::Relaxed)
- }
+}
Continuing the theme of me commenting on individual things, I stumbled over this because I noticed that there's a lot of id based lookups where I don't expect them, and started chasing.
For ids use xarray, not atomic counters. Yes I know dma_fence timelines gets this wrong, this goes back to an innocent time where we didn't allocate more than one timeline per engine, and no one fixed it since then. Yes u64 should be big enough for everyone :-/
Attaching ID spaces to drm_device is also not great. drm is full of these mistakes. Much better if their per drm_file and so private to each client.
They shouldn't be used for anything else than uapi id -> kernel object lookup at the beginning of ioctl code, and nowhere else. At least from skimming it seems like these are used all over the driver codebase, which does freak me out. At least on the C side that's a clear indicator for a refcount/lockin/data structure model that's not thought out at all.
What's going on here, what do I miss?
These aren't UAPI IDs, they are driver-internal IDs (the UAPI IDs do use xarray and are per-File). Most of them are just for debugging, so that when I enable full debug spam I have some way to correlate different things that are happening together (this subset of interleaved log lines relate to the same submission). Basically just object names that are easier to read (and less of a security leak) than pointers and guaranteed not to repeat. You could get rid of most of them and it wouldn't affect the driver design, it just makes it very hard to see what's going on with debug logs ^^;
There are only two that are ever used for non-debugging purposes: the VM ID, and the File ID. Both are per-device global IDs attached to the VMs (not the UAPI VM objects, but rather the underlyng MMU address space managers they represent, including the kernel-internal ones) and to Files themselves. They are used for destroying GEM objects: since the objects are also device-global across multiple clients, I need a way to do things like "clean up all mappings for this File" or "clean up all mappings for this VM". There's an annoying circular reference between GEM objects and their mappings, which is why this is explicitly coded out in destroy paths instead of naturally happening via Drop semantics (without that cleanup code, the circular reference leaks it).
So e.g. when a File does a GEM close or explicitly asks for all mappings of an object to be removed, it goes out to the (possibly shared) GEM object and tells it to drop all mappings marked as owned by that unique File ID. When an explicit "unmap all in VM" op happens, it asks the GEM object to drop all mappings for that underlying VM ID. Similarly, when a UAPI VM object is dropped (in the Drop impl, so both explicitly and when the whole File/xarray is dropped and such), that does an explicit unmap of a special dummy object it owns which would otherwise leak since it is not tracked as a GEM object owned by that File and therefore not handled by GEM closing. And again along the same lines, the allocators in alloc.rs explicitly destroy the mappings for their backing GEM objects on Drop. All this is due to that annoying circular reference between VMs and GEM objects that I'm not sure how to fix.
Note that if I *don't* do this (or forget to do it somewhere) the consequence is just that we leak memory, and if you try to destroy the wrong IDs somehow the worst that can happen is you unmap things you shouldn't and fault the GPU (or, in the kernel or kernel-managed user VM cases, potentially the firmware). Rust safety guarantees still keep things from going entirely off the rails within the kernel, since everything that matters is reference counted (which is why these reference cycles are possible at all).
This all started when I was looking at the panfrost driver for reference. It does the same thing except it uses actual pointers to the owning entities instead of IDs, and pointer comparison (see panfrost_gem_close). Of course you could try do that in Rust too (literally storing and comparing raw pointers that aren't owned references), but then you're introducing a Pin<> requirement on those objects to make their addresses stable and it feels way more icky and error-prone than unique IDs (since addresses can be reused). panfrost only has a single mmu (what I call the raw VM) per File while I have an arbitrary number, which is why I end up with the extra distinction/complexity of both File and VM IDs, but the concept is the same.
Some of this is going to be refactored when I implement arbitrary VM range mapping/unmapping, which would be a good time to improve this... but is there something particularly wrong/broken about the way I'm doing it now that I missed? I figured unique u64 IDs would be a pretty safe way to identify entities and cleanup the mappings when needed.
Ok, some attempt at going through the vm_id/file_id stuff. Extremely high-level purely informed by having read too many drivers:
First on the drm_file/struct file/file_id. This is the uapi interface object, and it's refcounted in the vfs, but that's entirely the vfs' business and none of the driver (or even subsystem). Once userspace has done the final close() the file is gone, there's no way to ever get anything meaningfully out of it because userspace dropped it. So if the driver has any kind of backpointer to that's a design bug, because in all the place you might want to care (ioctl, fdinfo for schedu stats, any other file_operations callback) the vfs ensures it stays alive during the callback and you essentially have a borrowed reference.
Right, there's none of that for the File, and it is not refcounted itself. Certainly there are no direct references, and as for the IDs: the IDs of relevant Files live in GEM objects that hold mappings owned by that file. As part of File close all the GEM objects get closed, which removes those mappings. So by the time the File goes away there should be no references to its ID anywhere (other than if I stashed some away for debugging, I forget whether I did in some child object).
If this process breaks for some reason (say, stray mappings remain indexed to a File ID that is gone), that means we leak the mappings, which leaks the GEM objects themselves and the VM they are mapped to. Not great but not fireworks either. As long as the DRM core properly calls the GEM close callback on everything before calling the File close callback though, that shouldn't happen.
I've seen a lot of drivers try to make clever backpointings to stuff that's essentially tied to the drm_file, and I've not found a single case that made sense. iow, file_id as a lookup thingie needs to go. In principle it's the same argument I've made already for the syncobj rust wrappers. For specific uses I guess I need some rust reading help, but from your description it sounds like the vm_id is much more the core piece.
The file ID is simply how GEM mappings are identified as belonging to an active file within the mapping list of an object. GEM object close is literally the only place this ID is ever used for anything other than passing around:
/// Callback to drop all mappings for a GEM object owned by a given `File` fn close(obj: &Object, file: &DrmFile) { mod_pr_debug!("DriverObject::close vm_id={:?} id={}\n", obj.vm_id, obj.id); obj.drop_file_mappings(file.inner().file_id()); }
I could also just iterate through the VM XArray for the File and drop mappings one VM at a time instead of doing all of them in one go, it's just slightly more cumbersome (though potentially less code because I could get rid of all the forwarding the file_id I do now).
On the other hand, once we implement arbitrary VM maps, I suspect this is going to go away anyway with the new design, so I'm not really very inclined to fix it until that happens... ^^
Yeah the driver-managed vm needs a bunch more reference loops and gets awkward fast. the gpuva library might need to keep support for that, but I really hope it's not needed.
So for that we have the gpu ctx -> vm -> gem_bos chain of reference. Now on the C side if you have a modern driver that uses the vm_bind/unbind/gpuva manager approach, the reference counts go in that single direction only, anything else is essentially borrowed references under protection of a mutex/lock or similar thing (for e.g. going from the bo to the vm for eviction).
Right, so that is what is going to change with the pending refactor. What I have right now is a design that used to be the old driver-managed VM design (and still retains part of that for kernel-managed objects) for the old synchronous demo UAPI, that I then shoehorned into the redesigned vm_bind UAPI by just not supporting the interesting cases (partial maps/unmaps/remaps, etc.). This is all temporary, it's just to get us by for now since OpenGL doesn't need it and there is no usable Vulkan driver that cares yet... I wanted to focus on the explicit sync and general sched/queuing part of the new UAPI before I got to the VM bind stuff, since I figured that would be more interesting (and pulls in all the new abstractions, plus major perf benefit). So the UAPI itself has vm_bind but only the "easy" subset of cases are supported by the driver (whole object maps/unmaps) and the refcounting is still backwards.
As I said this originally came from the Panfrost design that doesn't have vm_bind but instead keeps a list of mappings with pointer equality checks in BOs... so that's why ^^
Thanks for explaining the design approach though, it's roughly what I had in mind but it's good to hear I'm on the right track! I'd love to go into more detail about how to implement vm_bind if you have time though (maybe a meeting?). In particular things like using the mm allocator to keep track of mapping ranges and supporting splitting and all that.
Yeah vm_bind sounds like a good topic to discuss. I don't think we'll get all the pieces aligned to land that before asahi, but the driver internals should at least match wrt semantics with that so that the refactoring isn't total pain.
In addition to the above chain the xarray in the drm_file also holds references to each of these. So far so good, in the drm_file ->postclose callback you just walk the xarrays and drop all the references, and everything gets cleaned up, at least in the C world.
In the Rust world you just do nothing since the XArray abstraction knows how to drop all of its contained objects!
Yeah xarray should work with Drop, but I guess you need a special uapi/open-reference object that knows that it needs to perform additional cleanup (like quiescent the gpu ctx or unamp everything for the vm).
Yeah, I already have that for VMs. Since I have a layer between UAPI VM objects and the underlying MMU VM objects, the UAPI VM object Drop impl can take care of explicitly unmapping whatever it needs to, or however that ends up working out with the new design. I prefer that to explicit cleanup code since it means you can't forget to do it.
Rust is pretty nice for throwing around tiny objects, 1:1 wrappers, or even zero-sized types that just do one thing + Drop in order to make some semantic ergonomic to use. That's how the XArray reservation stuff works: you get back a trivial object that just references the queue (yay lifetimes, no refcounting here) and holds the reservation open, and then you either fill it (which consumes the reservation guard) or drop it (which cleans up the reservation). There's lots of that kind of pattern in kernel Rust and I think we should use it often, it just makes things a lot less error-prone (ScopeGuard is another nice one!)
But if either due to the uabi being a bit more legacy, or Rust requiring that the backpointers are reference-counted from the gem_bo->vma->vm and can't follow borrow semantics (afaiui the usual linux list_head pattern of walking the list under a lock giving you a borrowed reference for each element doesn't work too well in rust?) then that's not a problem, you can still all clean it out:
The key bit is that your vm struct needs both a refcount like kref and a separate open count. Each gpu ctx and the xarray for vm objects in drm_file hold _both_ the kref and the open refcount (in rust the open refcount implies the Arc or things go sideways).
the other key bit is that drm_file ->postclose does _not_ have simple Drop semantics, it's more explicit.
in the drm_file lastclose you first walk all the gpu ctx. The simplest semantics is that close() synchronously tears down all leftover gpu ctx, i.e. you unload them from the gpu. Details are under a lot of discussion in the various scheduler threads, but essentially this should ensure that the gpu ctx destruction completely removes all references to the ctx. If instead you have the legacy problem of apps expecting that rendering continues even if they called exit() before it finishes, then it gets more messy. I have no idea whether that's still a problem for new drivers or can be avoided.
Next up you do the same thing for the vm xarray (which drops both the kref an open refcounts).
At this point there might still be a ton of vm objects around with elevated kref. Except not, because at this point the open refcount of each vm should have dropped to zero. When that happens the vm object itself is still alive, plus even better for rust, you are in the vm_close(vm) function call so you have a full borrowed reference to that. Which means you can walk the entire address space and unmap everything explicit. Which should get rid of any gem_bo->vma->vm backpointers you have lying around.
At that point all your vm objects are gone too, because the kref managed backpointers are gone.
You walk the xarray of gem_bo (well the drm subsystem does that for you), which cleans out the reamining references to gem_bo. Only the gem_bo which are shared with other process or have a dma_buf will survive, like they should.
No leak, no funky driver-internal vm_id based lookup, and with rust we should even be able to guarantee you never mix up Arc<Vm> with OpenRef<Vm> (or however that exactly works in rust types, I have not much real clue).
That would totally work, and actually I already use somewhat analogous mechanisms in other places like firmware queues!
If this all weren't getting turned on its head for the new VM management I'd implement it, but hopefully we can agree there's not much point right now... I'd rather focus on the DRM abstraction design and work on improving the driver in parallel right now, and then about one kernel cycle or so from now it should definitely be in a better place for review. Honestly, there are bigger design problems with the driver right now than these IDs (that I already know about)... so I want to focus more on the abstractions and their usage right now than the internal driver design which I *know* has problems ^^
Yeah I think the only fundamental issue you have is that (if I get this all right) you're trying to clean up mappings from the gem_bo, not from the vm. The gem_bo (unlike the vm) is freely shareable (at least in general), so tying anything else to the lifetime of a gem_bo in any way is a design flaw.
Yeah, it wasn't nice from the start. Actually the first bit of code I wrote is the MMU code, and originally it was even literally C code based on the panfrost MMU code as-is... I quickly realized that the C wasn't going to be that useful when I started diving into the GEM abstractions, so it got rewritten in Rust early on...
So right now it works (and I have no reason to believe it has actual leak bugs lurking today) but it's not a nice design and it's going to get a major refactor/redesign once I switch to proper vm_bind tracking.
This is similar to dma_fence that can end up absolutely everywhere, and why drm/sched has this decoupling between hw_fence and drm_job fences with wider visibility. i915-gem/i915-scheduler and a lot of the really old drivers all get this wrong, and you end up with either terrible explicit cleanup code that tries to go around looking for all the references that it needs to drop. Or you just leak.
I think for fences my general approach is going to be to just try to keep to what I'm doing now and minimize the references fences hold, and treat them as a signaling mechanism that ideally doesn't have to hold a reference to anything other than the module. After all, the real king of what needs to be alive is the firmware, and its mechanisms don't map well to fences directly, so I need to do bespoke resource management there anyway (and then just plug it into fences so it can feed into drm_sched and the rest of the world). I don't know if that makes sense, but it feels like it does? I still need to spend a bunch of time thinking about this though...
All these things need to be sorted out at design time so that they're impossible.
That's the other nice thing about Rust, it makes refactoring a lot faster too! The compiler is really good at annoying you and refusing to compile things until you've fixed all the really dumb mistakes you introduced, and then there's a pretty good chance it'll run and the remaining bugs will be really obvious after that. As much as you learn to hate the compiler, it's so much better than trying to debug things at runtime... ^^
I'm not sure what your opinion is on this, but personally if you/others were okay with it I wouldn't be too worried about hypothetically merging the driver in the state it's in today, with the expectation to hack major parts of it to bits and pieces over the next few months. I've done it a few times already... it usually doesn't take more than a day or two to make some major refactor to a component and get it back up and running. (I do expect to do a bunch of that cleanup over the next few months before it's even possible to merge anyway, just a hypothetical).
Rust is really good at getting you to come up with a *safe* design as far as memory and ownership, but that doesn't mean it's perfectly clean code and more importantly it does nothing for deadlocks and allocating in the wrong paths and getting resource allocation semantics right etc etc. The GPU FW queue stuff is at the very least due for another major refactor/cleanup to defer resource allocation and actual queuing to job prepare/run time (right now there's some horrible hacks to do it upfront at submit because I don't have a mechanism to back-patch job structures with those resource IDs later at exec time, but I want to add that), and along the way I can also fix the using job fences to block on pending job count thing that Christian really wants me to do instead of the can_run_job thing, and then getting all this resource stuff truly right is also going to mean eventually using fences to handle blocking on resource exhaustion too (though maybe I can get away with implementing that a bit later)...
The driver works stupidly well for how quickly I wrote it, but it still has all these rough edges that definitely need fixing before it's something I could say I'm happy with... I'm sure if you start hammering it with evil workloads you will hit some of its current problems (like I did yesterday with the deadlocks on GpuContext inval). I also need to learn more about the subtleties of fence signaling and all that, especially once a shrinker comes into play...
Yeah I think rust is impressive at creating working code. The real challenge, and really where I see all the short term value at least, is in clarifying the semantics. Because that'll help us to clarify the semantics on the C side too, which gives immediate benefits for everyone. Not just new drivers in rust.
But it's also the part that's really, really hard work.
Yup!
~~ Lina
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it. +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
- Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
+}
So maybe my expectations for rust typing is a bit too much, but I kinda expected this to be fully generic:
- trait Driver (drm_driver) knows the driver's object type - a generic create_handle function could ensure that for drm_file (which is always for a specific drm_device and hence Driver) can ensure at the type level that you only put the right objects into the drm_file - a generic lookup_handle function on the drm_file knows the Driver trait and so can give you back the right type right away.
Why the wrapping, what do I miss? -Daniel
On 05/04/2023 23.44, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it. +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
- Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
+}
So maybe my expectations for rust typing is a bit too much, but I kinda expected this to be fully generic:
- trait Driver (drm_driver) knows the driver's object type
- a generic create_handle function could ensure that for drm_file (which is always for a specific drm_device and hence Driver) can ensure at the type level that you only put the right objects into the drm_file
- a generic lookup_handle function on the drm_file knows the Driver trait and so can give you back the right type right away.
Why the wrapping, what do I miss?
Sigh, so this is one of the many ways I'm trying to work around the "Rust doesn't do subclasses" problem (so we can figure out what the best one is ^^).
The generic shmem::Object::lookup_handle() call *is* fully generic and will get you back a driver-specific object. But since Rust doesn't do subclassing, what you get back isn't a driver-specific type T, but rather a (reference to a) shmem::Object<T>. T represents the inner driver-specific data/functionality (only), and the outer shmem::Object<T> includes the actual drm_gem_shmem_object plus a T. This is backwards from C, where you expect the opposite situation where T contains a shmem object, but that just doesn't work with Rust because there's no way to build a safe API around that model as far as I know.
Now the problem is from the higher layers I want object operations that interact with the shmem::Object<T> (that is, they call generic GEM functions on the object). Options so far:
1. Add an outer wrapper and put that functionality there. 2. Just have the functions on T as helpers, so you need to call T::foo(obj) instead of obj.foo(). 3. Use the undocumented method receiver trait thing to make shmem::Object<T> a valid `self` type, plus add auto-Deref to shmem::Object. Then obj.foo() works.
#1 is what I use here. #2 is how the driver-specific File ioctl callbacks are implemented, and also sched::Job<T>. #3 is used for fence callbacks (FenceObject<T>). None of them are great, and I'd love to hear what people think of the various options...
There are other unexplored options, like in this GEM case it could be covered with a driver-internal auxiliary trait impl'd on shmem::Object<T> buuut that doesn't work when you actually need callbacks on T itself to circle back to shmem::Object<T>, as is the case with File/Job/FenceObject.
~~ Lina
Same as the prior email, this was supposed to go to Daniel...
On 06/04/2023 14.02, Asahi Lina wrote:
On 05/04/2023 23.44, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it. +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
- Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
+}
So maybe my expectations for rust typing is a bit too much, but I kinda expected this to be fully generic:
- trait Driver (drm_driver) knows the driver's object type
- a generic create_handle function could ensure that for drm_file (which is always for a specific drm_device and hence Driver) can ensure at the type level that you only put the right objects into the drm_file
- a generic lookup_handle function on the drm_file knows the Driver trait and so can give you back the right type right away.
Why the wrapping, what do I miss?
Sigh, so this is one of the many ways I'm trying to work around the "Rust doesn't do subclasses" problem (so we can figure out what the best one is ^^).
The generic shmem::Object::lookup_handle() call *is* fully generic and will get you back a driver-specific object. But since Rust doesn't do subclassing, what you get back isn't a driver-specific type T, but rather a (reference to a) shmem::Object<T>. T represents the inner driver-specific data/functionality (only), and the outer shmem::Object<T> includes the actual drm_gem_shmem_object plus a T. This is backwards from C, where you expect the opposite situation where T contains a shmem object, but that just doesn't work with Rust because there's no way to build a safe API around that model as far as I know.
Now the problem is from the higher layers I want object operations that interact with the shmem::Object<T> (that is, they call generic GEM functions on the object). Options so far:
- Add an outer wrapper and put that functionality there.
- Just have the functions on T as helpers, so you need to call
T::foo(obj) instead of obj.foo(). 3. Use the undocumented method receiver trait thing to make shmem::Object<T> a valid `self` type, plus add auto-Deref to shmem::Object. Then obj.foo() works.
#1 is what I use here. #2 is how the driver-specific File ioctl callbacks are implemented, and also sched::Job<T>. #3 is used for fence callbacks (FenceObject<T>). None of them are great, and I'd love to hear what people think of the various options...
There are other unexplored options, like in this GEM case it could be covered with a driver-internal auxiliary trait impl'd on shmem::Object<T> buuut that doesn't work when you actually need callbacks on T itself to circle back to shmem::Object<T>, as is the case with File/Job/FenceObject.
~~ Lina
~~ Lina
On Thu, Apr 06, 2023 at 02:02:55PM +0900, Asahi Lina wrote:
On 05/04/2023 23.44, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it. +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
- Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
+}
So maybe my expectations for rust typing is a bit too much, but I kinda expected this to be fully generic:
- trait Driver (drm_driver) knows the driver's object type
- a generic create_handle function could ensure that for drm_file (which is always for a specific drm_device and hence Driver) can ensure at the type level that you only put the right objects into the drm_file
- a generic lookup_handle function on the drm_file knows the Driver trait and so can give you back the right type right away.
Why the wrapping, what do I miss?
Sigh, so this is one of the many ways I'm trying to work around the "Rust doesn't do subclasses" problem (so we can figure out what the best one is ^^).
The generic shmem::Object::lookup_handle() call *is* fully generic and will get you back a driver-specific object. But since Rust doesn't do subclassing, what you get back isn't a driver-specific type T, but rather a (reference to a) shmem::Object<T>. T represents the inner driver-specific data/functionality (only), and the outer shmem::Object<T> includes the actual drm_gem_shmem_object plus a T. This is backwards from C, where you expect the opposite situation where T contains a shmem object, but that just doesn't work with Rust because there's no way to build a safe API around that model as far as I know.
Ah I think I just got confused. I did untangle (I think at least) the Object<T> trick, I guess the only thing that confused me here is why this is in the shmem module? Or is that the rust problem again?
I'd kinda have expected that we'd have a gem::Object<T> here that the lookup_handle function returns. So for the shmem case I guess that would then be gem::Object<shmem::Object<T>> for the driver type T with driver specific stuff? I guess not very pretty ...
Now the problem is from the higher layers I want object operations that interact with the shmem::Object<T> (that is, they call generic GEM functions on the object). Options so far:
- Add an outer wrapper and put that functionality there.
- Just have the functions on T as helpers, so you need to call T::foo(obj)
instead of obj.foo(). 3. Use the undocumented method receiver trait thing to make shmem::Object<T> a valid `self` type, plus add auto-Deref to shmem::Object. Then obj.foo() works.
#1 is what I use here. #2 is how the driver-specific File ioctl callbacks are implemented, and also sched::Job<T>. #3 is used for fence callbacks (FenceObject<T>). None of them are great, and I'd love to hear what people think of the various options...
There are other unexplored options, like in this GEM case it could be covered with a driver-internal auxiliary trait impl'd on shmem::Object<T> buuut that doesn't work when you actually need callbacks on T itself to circle back to shmem::Object<T>, as is the case with File/Job/FenceObject.
Ok I think I'm completely lost here. But I also havent' looked at how this is all really used in the driver, it's really just the shmem:: module in the lookup_handle function which looked strange to me. -Daniel
On 06/04/2023 20.25, Daniel Vetter wrote:
On Thu, Apr 06, 2023 at 02:02:55PM +0900, Asahi Lina wrote:
On 05/04/2023 23.44, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it. +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
- Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
+}
So maybe my expectations for rust typing is a bit too much, but I kinda expected this to be fully generic:
- trait Driver (drm_driver) knows the driver's object type
- a generic create_handle function could ensure that for drm_file (which is always for a specific drm_device and hence Driver) can ensure at the type level that you only put the right objects into the drm_file
- a generic lookup_handle function on the drm_file knows the Driver trait and so can give you back the right type right away.
Why the wrapping, what do I miss?
Sigh, so this is one of the many ways I'm trying to work around the "Rust doesn't do subclasses" problem (so we can figure out what the best one is ^^).
The generic shmem::Object::lookup_handle() call *is* fully generic and will get you back a driver-specific object. But since Rust doesn't do subclassing, what you get back isn't a driver-specific type T, but rather a (reference to a) shmem::Object<T>. T represents the inner driver-specific data/functionality (only), and the outer shmem::Object<T> includes the actual drm_gem_shmem_object plus a T. This is backwards from C, where you expect the opposite situation where T contains a shmem object, but that just doesn't work with Rust because there's no way to build a safe API around that model as far as I know.
Ah I think I just got confused. I did untangle (I think at least) the Object<T> trick, I guess the only thing that confused me here is why this is in the shmem module? Or is that the rust problem again?
I'd kinda have expected that we'd have a gem::Object<T> here that the lookup_handle function returns. So for the shmem case I guess that would then be gem::Object<shmem::Object<T>> for the driver type T with driver specific stuff? I guess not very pretty ...
Ahh, uh... Yeah, so shmem objects are allocated their own way (the shmem core expects to kfree them in drm_gem_shmem_free) and bindings::drm_gem_shmem_object already contains a bindings::drm_gem_object. Since the composition is already done in the C side, we can't just do it again in Rust cleanly. That's why I have this weird setup with both a common trait for common GEM functionality and separate actual types that both implement it.
Honestly the whole GEM codepath is untested other than the bits inherited by shmem. I'm not sure I'll be able to verify that this all makes sense until another Rust driver comes along that needs something other than shmem. I just felt I had to do *something* for GEM since the hierarchy is there and I needed shmem...
This whole gem stuff is IMO the messiest part of the abstractions though, so I'm happy to turn it on its head if it makes it better and someone has an idea of how to do that ^^
Now the problem is from the higher layers I want object operations that interact with the shmem::Object<T> (that is, they call generic GEM functions on the object). Options so far:
- Add an outer wrapper and put that functionality there.
- Just have the functions on T as helpers, so you need to call T::foo(obj)
instead of obj.foo(). 3. Use the undocumented method receiver trait thing to make shmem::Object<T> a valid `self` type, plus add auto-Deref to shmem::Object. Then obj.foo() works.
#1 is what I use here. #2 is how the driver-specific File ioctl callbacks are implemented, and also sched::Job<T>. #3 is used for fence callbacks (FenceObject<T>). None of them are great, and I'd love to hear what people think of the various options...
There are other unexplored options, like in this GEM case it could be covered with a driver-internal auxiliary trait impl'd on shmem::Object<T> buuut that doesn't work when you actually need callbacks on T itself to circle back to shmem::Object<T>, as is the case with File/Job/FenceObject.
Ok I think I'm completely lost here. But I also havent' looked at how this is all really used in the driver, it's really just the shmem:: module in the lookup_handle function which looked strange to me.
Ah, sorry, I misunderstood what you were talking about in my previous email then. That's just a default trait function. It comes from common functionality in the gem module, but shmem::Object implements the trait so it ends up offering it too (lookup_handle() is not duplicated, it only lives in gem, shmem only has to implement going to/from the drm_gem_object pointer so the rest of the methods can use it). That's part of why the type/trait hierarchy is kind of messy here, it's so I can share functionality between both types even though they are pre-composed on the C side.
In the end the object types are specialized for any given driver, so you're always getting your own unique kind of object anyway. It's just that drivers based on shmem will go through it to reach the common code and work with a shmem::Object<T>, and drivers using raw gem will use gem::Object<T> instead.
~~ Lina
On Thu, Apr 06, 2023 at 10:32:29PM +0900, Asahi Lina wrote:
On 06/04/2023 20.25, Daniel Vetter wrote:
On Thu, Apr 06, 2023 at 02:02:55PM +0900, Asahi Lina wrote:
On 05/04/2023 23.44, Daniel Vetter wrote:
On Tue, Mar 07, 2023 at 11:25:43PM +0900, Asahi Lina wrote:
+/// Look up a GEM object handle for a `File` and return an `ObjectRef` for it. +pub(crate) fn lookup_handle(file: &DrmFile, handle: u32) -> Result<ObjectRef> {
- Ok(ObjectRef::new(shmem::Object::lookup_handle(file, handle)?))
+}
So maybe my expectations for rust typing is a bit too much, but I kinda expected this to be fully generic:
- trait Driver (drm_driver) knows the driver's object type
- a generic create_handle function could ensure that for drm_file (which is always for a specific drm_device and hence Driver) can ensure at the type level that you only put the right objects into the drm_file
- a generic lookup_handle function on the drm_file knows the Driver trait and so can give you back the right type right away.
Why the wrapping, what do I miss?
Sigh, so this is one of the many ways I'm trying to work around the "Rust doesn't do subclasses" problem (so we can figure out what the best one is ^^).
The generic shmem::Object::lookup_handle() call *is* fully generic and will get you back a driver-specific object. But since Rust doesn't do subclassing, what you get back isn't a driver-specific type T, but rather a (reference to a) shmem::Object<T>. T represents the inner driver-specific data/functionality (only), and the outer shmem::Object<T> includes the actual drm_gem_shmem_object plus a T. This is backwards from C, where you expect the opposite situation where T contains a shmem object, but that just doesn't work with Rust because there's no way to build a safe API around that model as far as I know.
Ah I think I just got confused. I did untangle (I think at least) the Object<T> trick, I guess the only thing that confused me here is why this is in the shmem module? Or is that the rust problem again?
I'd kinda have expected that we'd have a gem::Object<T> here that the lookup_handle function returns. So for the shmem case I guess that would then be gem::Object<shmem::Object<T>> for the driver type T with driver specific stuff? I guess not very pretty ...
Ahh, uh... Yeah, so shmem objects are allocated their own way (the shmem core expects to kfree them in drm_gem_shmem_free) and bindings::drm_gem_shmem_object already contains a bindings::drm_gem_object. Since the composition is already done in the C side, we can't just do it again in Rust cleanly. That's why I have this weird setup with both a common trait for common GEM functionality and separate actual types that both implement it.
Hm this is annoying. For a single driver it doesn't matter, but I do expect that once we have more, and especially once we have more libraries wrapped (ttm, gpuva, execbuf submit helpers, ...) then the common glue really becomes the gem_bo for many of these things.
Could we have a GemObject trait which covers this? sole function is an unsafe one that gives you the raw C pointer :-) It still means that every gem memory manager library needs to impl that trait, but all the manager-agnostic bits in the wrappers would be generic? trait would then also have the right dependent type to ensure type safety in all this.
Maybe something to discuss in the next meeting with the rust folks.
Honestly the whole GEM codepath is untested other than the bits inherited by shmem. I'm not sure I'll be able to verify that this all makes sense until another Rust driver comes along that needs something other than shmem. I just felt I had to do *something* for GEM since the hierarchy is there and I needed shmem...
This whole gem stuff is IMO the messiest part of the abstractions though, so I'm happy to turn it on its head if it makes it better and someone has an idea of how to do that ^^
Yeah I still haven't worked up enough courage to type up my gem abstraction review :-/
Now the problem is from the higher layers I want object operations that interact with the shmem::Object<T> (that is, they call generic GEM functions on the object). Options so far:
- Add an outer wrapper and put that functionality there.
- Just have the functions on T as helpers, so you need to call T::foo(obj)
instead of obj.foo(). 3. Use the undocumented method receiver trait thing to make shmem::Object<T> a valid `self` type, plus add auto-Deref to shmem::Object. Then obj.foo() works.
#1 is what I use here. #2 is how the driver-specific File ioctl callbacks are implemented, and also sched::Job<T>. #3 is used for fence callbacks (FenceObject<T>). None of them are great, and I'd love to hear what people think of the various options...
There are other unexplored options, like in this GEM case it could be covered with a driver-internal auxiliary trait impl'd on shmem::Object<T> buuut that doesn't work when you actually need callbacks on T itself to circle back to shmem::Object<T>, as is the case with File/Job/FenceObject.
Ok I think I'm completely lost here. But I also havent' looked at how this is all really used in the driver, it's really just the shmem:: module in the lookup_handle function which looked strange to me.
Ah, sorry, I misunderstood what you were talking about in my previous email then. That's just a default trait function. It comes from common functionality in the gem module, but shmem::Object implements the trait so it ends up offering it too (lookup_handle() is not duplicated, it only lives in gem, shmem only has to implement going to/from the drm_gem_object pointer so the rest of the methods can use it). That's part of why the type/trait hierarchy is kind of messy here, it's so I can share functionality between both types even though they are pre-composed on the C side.
Ok, so it's all already what I expect and I'm just confused with rust syntax.
In the end the object types are specialized for any given driver, so you're always getting your own unique kind of object anyway. It's just that drivers based on shmem will go through it to reach the common code and work with a shmem::Object<T>, and drivers using raw gem will use gem::Object<T> instead.
Ok, sounds all good. -Daniel
That was supposed to have Markdown-style section headings, but I forgot that b4 considers a leading # as a comment... sorry for the abrupt topic changes...
The intended headings are below.
On 07/03/2023 23.25, Asahi Lina wrote:
Hi everyone!
This is my first take on the Rust abstractions for the DRM subsystem. It includes the abstractions themselves, some minor prerequisite changes to the C side, as well as the drm-asahi GPU driver (for reference on how the abstractions are used, but not necessarily intended to land together).
These patches apply on top of the tree at [1], which is based on 6.3-rc1 with a large number of Rust abstraction/support commits added on top. Most of these are not prerequisites for the DRM abstractions themselves, but rather only of the driver.
- #1-12 introduce the abstractions, module by module, with minor C changes before the dependent abstraction.
- Patch 10 is a little addition to drm_sched that I ended up needing, but I can pull it out of the abstraction into its own patch if needed.
- #13-14 add a minor feature to drm/gem and its abstraction used by the driver.
- #15-16 introduce the (unstable) asahi UAPI. This is obviously not ready for merge yet, but comments are welcome!
- #17 adds a Rust helper macro to handle GPU core/firmware differences. This probably belongs in the driver at this point, but right now it has to live in rust/macros since there is no mechanism for per-driver proc macros.
- #18 adds the driver proper, in one big commit, for reference purposes.
## Background
I've been working since mid last year on an Apple AGX GPU driver for Linux, using the (at the time) out-of-tree Rust support. As part of this effort, I've been writing safe Rust abstractions for portions of the DRM subsystem.
Now that Rust itself is upstream, I'd like to get all the abstractions upstreamed so we can eventually get the driver upstreamed!
These abstractions have been used by the driver since our release in December [2], in a simpler synchronous-submission form:
- drm::ioctl
- drm::device
- drm::drv
- drm::file
- drm::{gem, gem::shmem}
- drm::mm
This series adds these too, which are used by the explicit sync refactor of the driver (the version in this series):
- drm::syncobj
- drm::sched
- dma_fence
The major dependencies for the DRM abstractions themselves are:
- [3] rust: error: Add missing wrappers to convert to/from kernel error codes
- [4] rust: Miscellaneous macro improvements
- [5] rust: Add a Sealed trait
- [6] rust: device: Add a minimal RawDevice trait
- [7] rust: Enable the new_uninit feature for kernel and driver crates
- [8] rust: ioctl: Add ioctl number manipulation functions
- [9] rust: sync: Arc: Any downcasting and assume_init()
rust: Add `container_of` and `offset_of` macros
kernel::sync::mutex and dependencies
Most of these (the ones with links) have already been submitted, and I expect all of them to land for 6.4 (the mutex one will likely be last, since there is some refactoring that will happen over the current state to make it more ergonomic to use). The mutex dep is only necessary for drm::mm and dma_fence, and transitively drm::syncobj and drm::sched.
## State
Things work! We've had most of the abstractions in production edge kernels with the driver, and the new explicit sync stuff has passed quite a few torture tests (this is how we found the drm_sched issue, patch 11).
The abstractions are intended to be safe (safety review very welcome!). While writing them, I tried to avoid making any changes to the C side unless absolutely necessary. I understand that it will probably make sense to adjust the C side to make some things easier, but I wanted to start from this as a baseline.
Known issues:
The existing Rust integration does not currently allow building abstractions as modules, so the Rust abstractions are only available for DRM components that are built in. I added some extra Kconfig symbols to deal with this, so a driver built as a module can depende on having those built in. This should go away in the future (but may not be ready in time for submission... I understand this probably shouldn't be a blocker though?).
DRM relies heavily on the "subclassing" pattern for driver objects, and this doesn't map well to Rust. I tried several approaches for various bits, so we can see how they work out. In particular, whether wrapper types should pretend to be smart pointers and Deref to their inner driver-specific types, and whether they should be marked as method receivers (Yuck, internal rustc implementation hacks! But Arc<T> already does the same thing and it makes usage in driver-implemented callbacks as `self` possible) are things I'd love to discuss ^^.
Only what I need for my driver is implemented (plus a small amount of obvious extras where better API completeness makes sense). I think the general idea with Rust abstractions is that we add things as they become necessary.
The plain GEM vs. GEM-shmem duality ended up with quite a hairy type hierarchy. I'd love to figure out how to make this simpler...
drm::mm ends up requiring a built-in mutex in the abstraction, instead of delegating that to the user with the usual Rust mutability rules. This is because nodes can be dropped at any time, and those operations need to be synchronized. We could try to avoid forbidding those drops or mark the node type !Send, but that would make it a lot less ergonomic to use...
I'm looking for feedback on the abstractions of all kinds, so we can move towards an upstreamable version. Optimistically, I'd love to get this upstream for 6.5, and the driver for 6.6.
Please feel free to ask any questions about the Rust bits, since I know a lot of this is new to many of the C folks!
## About the drm-asahi driver
This is a fairly complete driver for Apple AGX G13 and G14 series GPUs.
The driver today supports the Apple M1, M1 Pro, M1 Max, M1 Ultra, and M2 SoCs, across two firmware revisions each. It has an explicit sync UAPI heavily inspired by the upcoming Intel Xe UAPI, designed with Vulkan support in mind. On the Mesa side we currently have a Gallium driver that is mostly already upstream (missing the UAPI bits mostly) and passes the dEQP GLES2/EGL tests, with most of GLES3.0 passing in downstream work-in-progress branches. This is a reverse engineered community driver (we have no hardware documentation of any kind, other than some hints from aspects shared with PowerVR).
While developing the driver, I tried to make use of Rust's safety and lifetime features to provide not just CPU-side safety, but also partial firmware-ABI safety. Thanks to this, it has turned out to be a very stable driver even though GPU firmware crashes are fatal (no restart capability, need to reboot!) and the FW/driver interface is a huge mess of unsafe shared memory structures with complex pointer chains. There are over 70 ABI types and 3000+ lines of firmware ABI type definitions that vary between firmware builds and GPU cores...
In a simpler blocking-submission form, it has been shipping in Asahi Linux edge kernels since December [2], with lots of users and zero (!) reported oopses (and only a couple reports of GPU firmware crashes, though that issue should now be fixed). It has survived OOM scenarios (Rust makes error cleanup easy!), UAPI-level fuzzing, countless broken Mesa builds, uptimes of 40+ days, and more.
The explicit sync refactor significantly increases performance (and potential problems), but this version has survived a lot of torture with dEQP/piglit tests and some manual corner case testing.
In other words, Rust works! ^^
There are some design notes on the driver and further links at [10].
## Links
[1] https://github.com/AsahiLinux/linux.git drm-rfc-base-20230307 [2] https://asahilinux.org/2022/12/gpu-drivers-now-in-asahi-linux/ [3] https://lore.kernel.org/rust-for-linux/20230224-rust-error-v1-0-f8f9a9a87303... [4] https://lore.kernel.org/rust-for-linux/20230224-rust-macros-v1-0-b39fae46e10... [5] https://lore.kernel.org/rust-for-linux/20230224-rust-iopt-rtkit-v1-0-49ced33... [6] https://lore.kernel.org/rust-for-linux/20230224-rust-iopt-rtkit-v1-0-49ced33... [7] https://lore.kernel.org/rust-for-linux/CQV7ZNT6LMXI.1XG4YXSH8I7JK@vincent-ar... [8] https://lore.kernel.org/rust-for-linux/61f734d6-1497-755f-3632-3f261b890846@... [9] https://lore.kernel.org/rust-for-linux/20230224-rust-arc-v1-0-568eea613a41@a... [10] https://github.com/AsahiLinux/docs/wiki/SW:AGX-driver-notes
~~ Lina
linaro-mm-sig@lists.linaro.org