Hi Arnd,
On Wed, Dec 11, 2019 at 9:53 PM Arnd Bergmann arnd@arndb.de wrote:
Documentation/process/botching-up-ioctls.rst was orignally written as a blog post for DRM driver writers, so it it misses some points while going into a lot of detail on others.
Try to provide a replacement that addresses typical issues across a wider range of subsystems, and follows the style of the core-api documentation better.
Signed-off-by: Arnd Bergmann arnd@arndb.de
Thanks for your patch!
--- /dev/null +++ b/Documentation/core-api/ioctl.rst @@ -0,0 +1,250 @@ +====================== +ioctl based interfaces +======================
+:c:func:`ioctl` is the most common way for applications to interface +with device drivers. It is flexible and easily extended by adding new +commands and can be passed through character devices, block devices as +well as sockets and other special file descriptors.
+However, it is also very easy to get ioctl command definitions wrong, +and hard to fix them later without breaking existing applications, +so this documentation tries to help developers get it right.
+Command number definitions +==========================
+The command number, or request number, is the second argument passed to +the ioctl system call. While this can be any 32-bit number that uniquely +identifies an action for a particular driver, there are a number of +conventions around defining them.
Interesting. I never realized the action is 32-bit in the kernel, but unsigned long in userspace...
+``include/uapi/asm-generic/ioctl.h`` provides four macros for defining +ioctl commands that follow modern conventions: ``_IOC``, ``_IOR``, +``_IOW``, and ``_IORW``. These should be used for all new commands, +with the correct parameters:
+_IO/_IOR/_IOW/_IOWR
This says _IO....
- The macro name determines whether the argument is used for passing
- data into kernel (_IOW), from the kernel (_IOR), both (_IOWR) or is
into the kernel , or is
- not a pointer (_IOC). It is possible but not recommended to pass an
- integer value instead of a pointer with _IOC.
...which is not explained here, but _IOC is?
+type
- An 8-bit number, often a character literal, specific to a subsystem
- or driver, and listed in :doc:`../ioctl/ioctl-number`
+nr
- An 8-bit number identifying the specific command, unique for a give
- value of 'type'
+size
- The name of the data type pointed to by the argument, the command
- number encodes the ``sizeof(size)`` value in a 13-bit or 14-bit integer,
- leading to a limit of 8191 bytes for the maximum size of the argument.
- Note: do not pass sizeof(type) type into _IOR/IOW, as that will lead
- to encoding sizeof(sizeof(type)), i.e. sizeof(size_t).
Looks like "size" could be renamed, to make this more obvious?
+Timestamps +==========
+Traditionally, timestamps and timeout values are passed as ``struct +timespec`` or ``struct timeval``, but these are problematic because of +incompatible definitions of these structures in user space after the +move to 64-bit time_t.
+The __kernel_timespec type can be used instead to be embedded in other +data structures when separate second/nanosecond values are desired, +or passed to user space directly. This is still not ideal though, +as the structure matches neither the kernel's timespec64 nor the user +space timespec exactly. The get_timespec64() and put_timespec64() helper +functions canbe used to ensure that the layout remains compatible with
can be
+user space and the padding is treated correctly.
+As it is cheap to convert seconds to nanoseconds, but the opposite +requires an expensive 64-bit division, a simple __u64 nanosecond value +can be simpler and more efficient.
+Timeout values and timestamps should ideally use CLOCK_MONOTONIC time, +as returned by ``ktime_get_ns()`` or ``ktime_get_ts64()``. Unlike +CLOCK_REALTIME, this makes the timestamps immune from jumping backwards +or forwards due to leap second adjustments and clock_settime() calls.
+``ktime_get_real_ns()`` can be used for CLOCK_REALTIME timestamps that +may be required for timestamps that need to be persistent across a reboot
Drop "may be required for timestamps that"?
+or between multiple machines.
+Structure layout +----------------
+Compatible data structures have the same layout on all architectures, +avoiding all problematic members:
+* ``long`` and ``unsigned long`` are the size of a register, so
- they can be either 32 bit or 64 bit wide and cannot be used in portable
32-bit or 64-bit (for consistency with the rest of the document)
- data structures. Fixed-length replacements are ``__s32``, ``__u32``,
- ``__s64`` and ``__u64``.
+* Pointers have the same problem, in addition to requiring the
- use of ``compat_ptr()``. The best workaround is to use ``__u64``
- in place of pointers, which requires a cast to ``uintptr_t`` in user
- space, and the use of ``u64_to_user_ptr()`` in the kernel to convert
- it back into a user pointer.
+* On the x86-32 (i386) architecture, the alignment of 64-bit variables
- is only 32 bit, but they are naturally aligned on most other
32-bit
- architectures including x86-64. This means a structure like
- ::
- struct foo {
__u32 a;
__u64 b;
__u32 c;
- };
- has four bytes of padding between a and b on x86-64, plus another four
- bytes of padding at the end, but no padding on i386, and it needs a
- compat_ioctl conversion handler to translate between the two formats.
- To avoid this problem, all structures should have their members
- naturally aligned, or explicit reserved fields added in place of the
- implicit padding.
+* On ARM OABI user space, 16-bit member variables have 32-bit
- alignment, making them incompatible with modern EABI kernels.
- Conversely, on the m68k architecture, all struct members have at most
- 16-bit alignment. These rarely cause problems as neither ARM-OABI nor
"have at most 16-bit alignment" sounds a bit weird to me, as a member may have a greater alignment. "struct members are not guaranteed to have an alignment greater than 16-bit"?
- m68k are supported by any compat mode, but for consistency, it is best
- to completely avoid 16-bit member variables.
+* Bitfields and enums generally work as one would expect them to,
- but some properties of them are implementation-defined, so it is better
- to avoid them completely in ioctl interfaces.
+* ``char`` members can be either signed or unsigned, depending on
- the architecture, so the __u8 and __s8 types should be used for 8-bit
- integer values, though char arrays are clearer for fixed-length strings.
+Information leaks +=================
+Uninitialized data must not be copied back to user space, as this can +cause an information leak, which can be used to defeat kernel address +space layout randomization (KASLR), helping in an attack.
+As explained for the compat mode, it is best to not avoid any padding in
best to avoid any implicit padding?
+data structures, but if there is already padding in existing structures, +the kernel driver must be careful to zero out the padding using +``memset()`` or similar before copying it to user space.
+Subsystem abstractions +======================
+While some device drivers implement their own ioctl function, most +subsystems implement the same command for multiple drivers. Ideally the +subsystem has an .ioctl() handler that copies the arguments from and +to user space, passing them into subsystem specific callback functions +through normal kernel pointers.
+This helps in various ways:
+* Applications written for one driver are more likely to work for
- another one in the same subsystem if there are no subtle differences
- in the user space ABI.
+* The complexity of user space access and data structure layout at done
is done
- in one place, reducing the potential for implementation bugs.
+* It is more likely to be reviewed by experienced developers
- that can spot problems in the interface when the ioctl is shared
- between multiple drivers than when it is only used in a single driver.
+Alternatives to ioctl +=====================
+There are many cases in which ioctl is not the best solution for a +problem. Alternatives include
:
+* System calls are a better choice for a system-wide feature that
- is not tied to a physical device or constrained by the file system
- permissions of a character device node
+* netlink is the preferred way of configuring any network related
- objects through sockets.
+* debugfs is used for ad-hoc interfaces for debugging functionality
- that does not need to be exposed as a stable interface to applications.
+* sysfs is a good way to expose the state of an in-kernel object
- that is not tied to a file descriptor.
+* configfs can be used for more complex configuration than sysfs
+* A custom file system can provide extra flexibility with a simple
- user interface but add a lot of complexity in the implementation.
adds ... to
Gr{oetje,eeting}s,
Geert