October 2023 - Linux-kselftest-mirror

[PATCH v4 0/8] Add printf attribute to kselftest functions

by Maciej Wieczor-Retman

Kselftest.h declares many variadic functions that can print some formatted message while also executing selftest logic. These declarations don't have any compiler mechanism to verify if passed arguments are valid in comparison with format specifiers used in printf() calls. Attribute addition can make debugging easier, the code more consistent and prevent mismatched or missing variables. Add a __printf() macro that validates types of variables passed to the format string. The macro is similarly used in other tools in the kernel. Add __printf() attributes to function definitions inside kselftest.h that use printing. Adding the __printf() macro exposes some mismatches in format strings across different selftests. Fix the mismatched format specifiers in multiple tests. Series is based on kselftests next branch. Changelog v4: - Fix patch 1/8 subject typo. - Add Reinette's reviewed-by tags. - Rebased onto updated kselftests next branch. Changelog v3: - Changed git signature from Wieczor-Retman Maciej to Maciej Wieczor-Retman. - Added one review tag. - Rebased onto updated kselftests next branch. Changelog v2: - Add review and fixes tags to patches. - Add two patches with mismatch fixes. - Fix missed attribute in selftests/kvm. (Andrew) - Fix previously missed issues in selftests/mm (Ilpo) [v3] https://lore.kernel.org/all/cover.1695373131.git.maciej.wieczor-retman@inte… [v2] https://lore.kernel.org/all/cover.1693829810.git.maciej.wieczor-retman@inte… [v1] https://lore.kernel.org/all/cover.1693216959.git.maciej.wieczor-retman@inte… Maciej Wieczor-Retman (8): selftests: Add printf attribute to kselftest prints selftests/cachestat: Fix print_cachestat format selftests/openat2: Fix wrong format specifier selftests/pidfd: Fix ksft print formats selftests/sigaltstack: Fix wrong format specifier selftests/kvm: Replace attribute with macro selftests/mm: Substitute attribute with a macro selftests/resctrl: Fix wrong format specifier .../selftests/cachestat/test_cachestat.c | 2 +- tools/testing/selftests/kselftest.h | 18 ++++++++++-------- .../testing/selftests/kvm/include/test_util.h | 8 ++++---- tools/testing/selftests/mm/mremap_test.c | 2 +- tools/testing/selftests/mm/pkey-helpers.h | 2 +- tools/testing/selftests/openat2/openat2_test.c | 2 +- .../selftests/pidfd/pidfd_fdinfo_test.c | 2 +- tools/testing/selftests/pidfd/pidfd_test.c | 12 ++++++------ tools/testing/selftests/resctrl/cache.c | 2 +- tools/testing/selftests/sigaltstack/sas.c | 2 +- 10 files changed, 27 insertions(+), 25 deletions(-) base-commit: f1020c687153609f246f3314db5b74821025c185 -- 2.42.0

2 years

3
10
0 0

kselftest/next build: 6 builds: 0 failed, 6 passed, 1 warning (v6.6-rc2-18-g2531f374f922e)

by kernelci.org bot

kselftest/next build: 6 builds: 0 failed, 6 passed, 1 warning (v6.6-rc2-18-g2531f374f922e) Full Build Summary: https://kernelci.org/build/kselftest/branch/next/kernel/v6.6-rc2-18-g2531f3… Tree: kselftest Branch: next Git Describe: v6.6-rc2-18-g2531f374f922e Git Commit: 2531f374f922e77ba51f24d1aa6fa11c7f4c36b8 Git URL: https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git Built: 4 unique architectures Warnings Detected: arm64: arm: i386: x86_64: x86_64_defconfig+kselftest (clang-16): 1 warning Warnings summary: 1 vmlinux.o: warning: objtool: set_ftrace_ops_ro+0x23: relocation to !ENDBR: .text+0x13950b ================================================================================ Detailed per-defconfig build reports: -------------------------------------------------------------------------------- defconfig+kselftest (arm64, gcc-10) — PASS, 0 errors, 0 warnings, 0 section mismatches -------------------------------------------------------------------------------- defconfig+kselftest+arm64-chromebook (arm64, gcc-10) — PASS, 0 errors, 0 warnings, 0 section mismatches -------------------------------------------------------------------------------- i386_defconfig+kselftest (i386, gcc-10) — PASS, 0 errors, 0 warnings, 0 section mismatches -------------------------------------------------------------------------------- multi_v7_defconfig+kselftest (arm, gcc-10) — PASS, 0 errors, 0 warnings, 0 section mismatches -------------------------------------------------------------------------------- x86_64_defconfig+kselftest (x86_64, gcc-10) — PASS, 0 errors, 0 warnings, 0 section mismatches -------------------------------------------------------------------------------- x86_64_defconfig+kselftest (x86_64, clang-16) — PASS, 0 errors, 1 warning, 0 section mismatches Warnings: vmlinux.o: warning: objtool: set_ftrace_ops_ro+0x23: relocation to !ENDBR: .text+0x13950b --- For more info write to <info(a)kernelci.org>

2 years

1
0
0 0

[RFC 0/8] iommufd support pasid attach/replace

by Yi Liu

PASID (Process Address Space ID) is a PCIe extension to tag the DMA transactions out of a physical device, and most modern IOMMU hardware have supported PASID granular address translation. So a PASID-capable devices can be attached to multiple hwpts (a.k.a. domains), each attachment is tagged with a PASID. This series first adds a missing iommu API to replace domain for a pasid, then adds iommufd APIs for device drivers to attach/replace/detach pasid to/from hwpt per userspace's request, and adds selftest to validate the iommufd APIs. pasid attach/replace is mandatory on Intel VT-d given the PASID table locates in the physical address space hence must be managed by the kernel, both for supporting vSVA and coming SIOV. But it's optional on ARM/AMD which allow configuring the PASID/CD table either in host physical address space or nested on top of an GPA address space. This series only add VT-d support as the minimal requirement. Complete code can be found in below link: https://github.com/yiliu1765/iommufd/tree/iommufd_pasid Regards, Yi Liu Kevin Tian (1): iommufd: Support attach/replace hwpt per pasid Lu Baolu (2): iommu: Introduce a replace API for device pasid iommu/vt-d: Add set_dev_pasid callback for nested domain Yi Liu (5): iommufd: replace attach_fn with a structure iommufd/selftest: Add set_dev_pasid and remove_dev_pasid in mock iommu iommufd/selftest: Add a helper to get test device iommufd/selftest: Add test ops to test pasid attach/detach iommufd/selftest: Add coverage for iommufd pasid attach/detach drivers/iommu/intel/nested.c | 47 +++++ drivers/iommu/iommu-priv.h | 2 + drivers/iommu/iommu.c | 73 ++++++-- drivers/iommu/iommufd/Makefile | 1 + drivers/iommu/iommufd/device.c | 42 +++-- drivers/iommu/iommufd/iommufd_private.h | 16 ++ drivers/iommu/iommufd/iommufd_test.h | 24 +++ drivers/iommu/iommufd/pasid.c | 152 ++++++++++++++++ drivers/iommu/iommufd/selftest.c | 158 ++++++++++++++-- include/linux/iommufd.h | 6 + tools/testing/selftests/iommu/iommufd.c | 172 ++++++++++++++++++ .../selftests/iommu/iommufd_fail_nth.c | 28 ++- tools/testing/selftests/iommu/iommufd_utils.h | 78 ++++++++ 13 files changed, 756 insertions(+), 43 deletions(-) create mode 100644 drivers/iommu/iommufd/pasid.c -- 2.34.1

2 years

4
20
0 0

Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

by Willem de Bruijn

On Sun, Oct 8, 2023 at 12:22 AM Akihiko Odaki <akihiko.odaki(a)daynix.com> wrote: > > virtio-net have two usage of hashes: one is RSS and another is hash > reporting. Conventionally the hash calculation was done by the VMM. > However, computing the hash after the queue was chosen defeats the > purpose of RSS. > > Another approach is to use eBPF steering program. This approach has > another downside: it cannot report the calculated hash due to the > restrictive nature of eBPF. > > Introduce the code to compute hashes to the kernel in order to overcome > thse challenges. An alternative solution is to extend the eBPF steering > program so that it will be able to report to the userspace, but it makes > little sense to allow to implement different hashing algorithms with > eBPF since the hash value reported by virtio-net is strictly defined by > the specification. > > The hash value already stored in sk_buff is not used and computed > independently since it may have been computed in a way not conformant > with the specification. > > Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com> > @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun, > } > > if (vnet_hdr_sz) { > - struct virtio_net_hdr gso; > + union { > + struct virtio_net_hdr hdr; > + struct virtio_net_hdr_v1_hash v1_hash_hdr; > + } hdr; > + int ret; > > if (iov_iter_count(iter) < vnet_hdr_sz) > return -EINVAL; > > - if (virtio_net_hdr_from_skb(skb, &gso, > - tun_is_little_endian(tun), true, > - vlan_hlen)) { > + if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) && > + vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) && > + skb->tun_vnet_hash) { Isn't vnet_hdr_sz guaranteed to be >= hdr.v1_hash_hdr, by virtue of the set hash ioctl failing otherwise? Such checks should be limited to control path where possible > + vnet_hdr_content_sz = sizeof(hdr.v1_hash_hdr); > + ret = virtio_net_hdr_v1_hash_from_skb(skb, > + &hdr.v1_hash_hdr, > + true, > + vlan_hlen, > + &vnet_hash); > + } else { > + vnet_hdr_content_sz = sizeof(hdr.hdr); > + ret = virtio_net_hdr_from_skb(skb, &hdr.hdr, > + tun_is_little_endian(tun), > + true, vlan_hlen); > + } > +

2 years

3
8
0 0

Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

by Michael S. Tsirkin

Akihiko Odaki sorry about reposts. Having an email with two "@" in the CC list: xuanzhuo@linux.alibaba.comshuah@kernel.org tripped up mutt's reply-all for me and made it send to you only. I am guessing you meant two addresses: xuanzhuo(a)linux.alibaba.com and shuah(a)kernel.org. On Sun, Oct 08, 2023 at 02:20:49PM +0900, Akihiko Odaki wrote: > virtio-net have two usage of hashes: one is RSS and another is hash > reporting. Conventionally the hash calculation was done by the VMM. > However, computing the hash after the queue was chosen defeats the > purpose of RSS. > > Another approach is to use eBPF steering program. This approach has > another downside: it cannot report the calculated hash due to the > restrictive nature of eBPF. > > Introduce the code to compute hashes to the kernel in order to overcome > thse challenges. An alternative solution is to extend the eBPF steering > program so that it will be able to report to the userspace, but it makes > little sense to allow to implement different hashing algorithms with > eBPF since the hash value reported by virtio-net is strictly defined by > the specification. > > The hash value already stored in sk_buff is not used and computed > independently since it may have been computed in a way not conformant > with the specification. > > Signed-off-by: Akihiko Odaki <akihiko.odaki(a)daynix.com> > --- > drivers/net/tun.c | 187 ++++++++++++++++++++++++++++++++---- > include/uapi/linux/if_tun.h | 16 +++ > 2 files changed, 182 insertions(+), 21 deletions(-) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 89ab9efe522c..561a573cd008 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -171,6 +171,9 @@ struct tun_prog { > struct bpf_prog *prog; > }; > > +#define TUN_VNET_HASH_MAX_KEY_SIZE 40 > +#define TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH 128 > + where do these come from? > /* Since the socket were moved to tun_file, to preserve the behavior of persist > * device, socket filter, sndbuf and vnet header size were restore when the > * file were attached to a persist device. > @@ -209,6 +212,9 @@ struct tun_struct { > struct tun_prog __rcu *steering_prog; > struct tun_prog __rcu *filter_prog; > struct ethtool_link_ksettings link_ksettings; > + struct tun_vnet_hash vnet_hash; > + u16 vnet_hash_indirection_table[TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH]; > + u32 vnet_hash_key[TUN_VNET_HASH_MAX_KEY_SIZE / 4]; That's quite a lot of data to add in this struct, and will be used by a small minority of users. Are you pushing any hot data out of cache with this? Why not allocate these as needed? > /* init args */ > struct file *file; > struct ifreq *ifr; > @@ -219,6 +225,13 @@ struct veth { > __be16 h_vlan_TCI; > }; > > +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = { > + .max_indirection_table_length = > + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH, > + > + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES > +}; > + > static void tun_flow_init(struct tun_struct *tun); > static void tun_flow_uninit(struct tun_struct *tun); > > @@ -320,10 +333,16 @@ static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp) > if (get_user(be, argp)) > return -EFAULT; > > - if (be) > + if (be) { > + if (!(tun->flags & TUN_VNET_LE) && > + (tun->vnet_hash.flags & TUN_VNET_HASH_REPORT)) { > + return -EINVAL; > + } > + > tun->flags |= TUN_VNET_BE; > - else > + } else { > tun->flags &= ~TUN_VNET_BE; > + } > > return 0; > } > @@ -558,15 +577,47 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) > return ret % numqueues; > } > > +static u16 tun_vnet_select_queue(struct tun_struct *tun, struct sk_buff *skb) > +{ > + u32 value = qdisc_skb_cb(skb)->tun_vnet_hash_value; > + u32 numqueues; > + u32 index; > + u16 queue; > + > + numqueues = READ_ONCE(tun->numqueues); > + if (!numqueues) > + return 0; > + > + index = value & READ_ONCE(tun->vnet_hash.indirection_table_mask); > + queue = READ_ONCE(tun->vnet_hash_indirection_table[index]); > + if (!queue) > + queue = READ_ONCE(tun->vnet_hash.unclassified_queue); Apparently 0 is an illegal queue value? You are making this part of UAPI better document things like this. > + > + return queue % numqueues; > +} > + > static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb, > struct net_device *sb_dev) > { > struct tun_struct *tun = netdev_priv(dev); > + u8 vnet_hash_flags = READ_ONCE(tun->vnet_hash.flags); > + struct virtio_net_hash hash; > u16 ret; > > + if (vnet_hash_flags & (TUN_VNET_HASH_RSS | TUN_VNET_HASH_REPORT)) { > + virtio_net_hash(skb, READ_ONCE(tun->vnet_hash.types), > + tun->vnet_hash_key, &hash); > + What are all these READ_ONCE things doing? E.g. you seem to be very content to have tun->vnet_hash.types read twice apparently? This is volatile which basically breaks all compiler's attempts to optimize code - needs to be used judiciously. > + skb->tun_vnet_hash = true; > + qdisc_skb_cb(skb)->tun_vnet_hash_value = hash.value; > + qdisc_skb_cb(skb)->tun_vnet_hash_report = hash.report; > + } > + > rcu_read_lock(); > if (rcu_dereference(tun->steering_prog)) > ret = tun_ebpf_select_queue(tun, skb); > + else if (vnet_hash_flags & TUN_VNET_HASH_RSS) > + ret = tun_vnet_select_queue(tun, skb); > else > ret = tun_automq_select_queue(tun, skb); > rcu_read_unlock(); > @@ -2088,10 +2139,15 @@ static ssize_t tun_put_user(struct tun_struct *tun, > struct iov_iter *iter) > { > struct tun_pi pi = { 0, skb->protocol }; > + struct virtio_net_hash vnet_hash = { > + .value = qdisc_skb_cb(skb)->tun_vnet_hash_value, > + .report = qdisc_skb_cb(skb)->tun_vnet_hash_report > + }; > ssize_t total; > int vlan_offset = 0; > int vlan_hlen = 0; > int vnet_hdr_sz = 0; > + size_t vnet_hdr_content_sz; > > if (skb_vlan_tag_present(skb)) > vlan_hlen = VLAN_HLEN; > @@ -2116,31 +2172,49 @@ static ssize_t tun_put_user(struct tun_struct *tun, > } > > if (vnet_hdr_sz) { > - struct virtio_net_hdr gso; > + union { > + struct virtio_net_hdr hdr; > + struct virtio_net_hdr_v1_hash v1_hash_hdr; > + } hdr; > + int ret; > > if (iov_iter_count(iter) < vnet_hdr_sz) > return -EINVAL; > > - if (virtio_net_hdr_from_skb(skb, &gso, > - tun_is_little_endian(tun), true, > - vlan_hlen)) { > + if ((READ_ONCE(tun->vnet_hash.flags) & TUN_VNET_HASH_REPORT) && > + vnet_hdr_sz >= sizeof(hdr.v1_hash_hdr) && > + skb->tun_vnet_hash) { > + vnet_hdr_content_sz = sizeof(hdr.v1_hash_hdr); > + ret = virtio_net_hdr_v1_hash_from_skb(skb, > + &hdr.v1_hash_hdr, > + true, > + vlan_hlen, > + &vnet_hash); > + } else { > + vnet_hdr_content_sz = sizeof(hdr.hdr); > + ret = virtio_net_hdr_from_skb(skb, &hdr.hdr, > + tun_is_little_endian(tun), > + true, vlan_hlen); > + } > + > + if (ret) { > struct skb_shared_info *sinfo = skb_shinfo(skb); > pr_err("unexpected GSO type: " > "0x%x, gso_size %d, hdr_len %d\n", > - sinfo->gso_type, tun16_to_cpu(tun, gso.gso_size), > - tun16_to_cpu(tun, gso.hdr_len)); > + sinfo->gso_type, tun16_to_cpu(tun, hdr.hdr.gso_size), > + tun16_to_cpu(tun, hdr.hdr.hdr_len)); > print_hex_dump(KERN_ERR, "tun: ", > DUMP_PREFIX_NONE, > 16, 1, skb->head, > - min((int)tun16_to_cpu(tun, gso.hdr_len), 64), true); > + min((int)tun16_to_cpu(tun, hdr.hdr.hdr_len), 64), true); > WARN_ON_ONCE(1); > return -EINVAL; > } > > - if (copy_to_iter(&gso, sizeof(gso), iter) != sizeof(gso)) > + if (copy_to_iter(&hdr, vnet_hdr_content_sz, iter) != vnet_hdr_content_sz) > return -EFAULT; > > - iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso)); > + iov_iter_advance(iter, vnet_hdr_sz - vnet_hdr_content_sz); > } > > if (vlan_hlen) { > @@ -3007,24 +3081,27 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr) > return ret; > } > > -static int tun_set_ebpf(struct tun_struct *tun, struct tun_prog __rcu **prog_p, > - void __user *data) > +static struct bpf_prog *tun_set_ebpf(struct tun_struct *tun, > + struct tun_prog __rcu **prog_p, > + void __user *data) > { > struct bpf_prog *prog; > int fd; > + int ret; > > if (copy_from_user(&fd, data, sizeof(fd))) > - return -EFAULT; > + return ERR_PTR(-EFAULT); > > if (fd == -1) { > prog = NULL; > } else { > prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER); > if (IS_ERR(prog)) > - return PTR_ERR(prog); > + return prog; > } > > - return __tun_set_ebpf(tun, prog_p, prog); > + ret = __tun_set_ebpf(tun, prog_p, prog); > + return ret ? ERR_PTR(ret) : prog; > } > > /* Return correct value for tun->dev->addr_len based on tun->dev->type. */ > @@ -3082,6 +3159,11 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, > int le; > int ret; > bool do_notify = false; > + struct bpf_prog *bpf_ret; > + struct tun_vnet_hash vnet_hash; > + u16 vnet_hash_indirection_table[TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH]; > + u8 vnet_hash_key[TUN_VNET_HASH_MAX_KEY_SIZE]; > + size_t len; > > if (cmd == TUNSETIFF || cmd == TUNSETQUEUE || > (_IOC_TYPE(cmd) == SOCK_IOC_TYPE && cmd != SIOCGSKNS)) { > @@ -3295,7 +3377,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, > ret = -EFAULT; > break; > } > - if (vnet_hdr_sz < (int)sizeof(struct virtio_net_hdr)) { > + if (vnet_hdr_sz < > + (int)((tun->vnet_hash.flags & TUN_VNET_HASH_REPORT) ? > + sizeof(struct virtio_net_hdr_v1_hash) : > + sizeof(struct virtio_net_hdr))) { > ret = -EINVAL; > break; > } > @@ -3314,10 +3399,16 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, > ret = -EFAULT; > break; > } > - if (le) > + if (le) { > tun->flags |= TUN_VNET_LE; > - else > + } else { > + if (!tun_legacy_is_little_endian(tun)) { > + ret = -EINVAL; > + break; > + } > + > tun->flags &= ~TUN_VNET_LE; > + } > break; > > case TUNGETVNETBE: > @@ -3360,11 +3451,17 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, > break; > > case TUNSETSTEERINGEBPF: > - ret = tun_set_ebpf(tun, &tun->steering_prog, argp); > + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp); > + if (IS_ERR(bpf_ret)) > + ret = PTR_ERR(bpf_ret); > + else if (bpf_ret) > + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS; what is this doing? > break; > > case TUNSETFILTEREBPF: > - ret = tun_set_ebpf(tun, &tun->filter_prog, argp); > + bpf_ret = tun_set_ebpf(tun, &tun->filter_prog, argp); > + if (IS_ERR(bpf_ret)) > + ret = PTR_ERR(bpf_ret); > break; > > case TUNSETCARRIER: > @@ -3382,6 +3479,54 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd, > ret = open_related_ns(&net->ns, get_net_ns); > break; > > + case TUNGETVNETHASHCAP: > + if (copy_to_user(argp, &tun_vnet_hash_cap, > + sizeof(tun_vnet_hash_cap))) > + ret = -EFAULT; > + break; > + > + case TUNSETVNETHASH: > + len = sizeof(vnet_hash); > + if (copy_from_user(&vnet_hash, argp, len)) { > + ret = -EFAULT; > + break; > + } what if flags has some bits set you don't know how to handle? should these be ignored as now or cause a failure? these decisions all affect uapi. > + > + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) && > + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) || > + !tun_is_little_endian(tun))) || > + vnet_hash.indirection_table_mask >= > + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) { > + ret = -EINVAL; > + break; > + } Given this is later used to index within an array one has to be very careful about spectre things here, which this code isn't. > + > + argp = (u8 __user *)argp + len; > + len = (vnet_hash.indirection_table_mask + 1) * 2; comment pointer math tricks like this extensively please. > + if (copy_from_user(vnet_hash_indirection_table, argp, len)) { > + ret = -EFAULT; > + break; > + } > + > + argp = (u8 __user *)argp + len; > + len = virtio_net_hash_key_length(vnet_hash.types); > + > + if (copy_from_user(vnet_hash_key, argp, len)) { > + ret = -EFAULT; > + break; > + } > + > + tun->vnet_hash = vnet_hash; > + memcpy(tun->vnet_hash_indirection_table, > + vnet_hash_indirection_table, > + (vnet_hash.indirection_table_mask + 1) * 2); > + memcpy(tun->vnet_hash_key, vnet_hash_key, len); > + > + if (vnet_hash.flags & TUN_VNET_HASH_RSS) > + __tun_set_ebpf(tun, &tun->steering_prog, NULL); > + > + break; > + > default: > ret = -EINVAL; > break; > diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h > index 287cdc81c939..dc591cd897c8 100644 > --- a/include/uapi/linux/if_tun.h > +++ b/include/uapi/linux/if_tun.h > @@ -61,6 +61,8 @@ > #define TUNSETFILTEREBPF _IOR('T', 225, int) > #define TUNSETCARRIER _IOW('T', 226, int) > #define TUNGETDEVNETNS _IO('T', 227) > +#define TUNGETVNETHASHCAP _IO('T', 228) > +#define TUNSETVNETHASH _IOW('T', 229, unsigned int) > > /* TUNSETIFF ifr flags */ > #define IFF_TUN 0x0001 > @@ -115,4 +117,18 @@ struct tun_filter { > __u8 addr[][ETH_ALEN]; > }; > > +struct tun_vnet_hash_cap { > + __u16 max_indirection_table_length; > + __u32 types; > +}; > + There's hidden padding in this struct - not good, copy will leak kernel info out. > +#define TUN_VNET_HASH_RSS 0x01 > +#define TUN_VNET_HASH_REPORT 0x02 Do you intend to add more flags down the road? How will userspace know what is supported? > +struct tun_vnet_hash { > + __u8 flags; > + __u32 types; > + __u16 indirection_table_mask; > + __u16 unclassified_queue; > +}; > + Padding here too. Best avoided. In any case, document UAPI please. > #endif /* _UAPI__IF_TUN_H */ > -- > 2.42.0

2 years

2
1
0 0

[PATCH v4 0/7] selftests/resctrl: Fixes to failing tests

by Ilpo Järvinen

Fix four issues with resctrl selftests. The signal handling fix became necessary after the mount/umount fixes and the uninitialized member bug was discovered during the review. The other two came up when I ran resctrl selftests across the server fleet in our lab to validate the upcoming CAT test rewrite (the rewrite is not part of this series). These are developed and should apply cleanly at least on top the benchmark cleanup series (might apply cleanly also w/o the benchmark series, I didn't test). v4: - Use func(void) for functions taking no arguments - Correct Fixes tag formatting v3: - Add fix to uninitialized sa_flags - Handle ksft_exit_fail_msg() in per test functions - Make signal handler register fails to also exit - Improve changelogs v2: - Include patch to move _GNU_SOURCE to Makefile to allow normal #include placement - Rework the signal register/unregister into patch to use helpers - Fixed incorrect function parameter description - Use return !!res to avoid confusing implicit boolean conversion - Improve MBA/MBM success bound patch's changelog - Tweak Cc: stable dependencies (make it a chain). Ilpo Järvinen (7): selftests/resctrl: Fix uninitialized .sa_flags selftests/resctrl: Extend signal handler coverage to unmount on receiving signal selftests/resctrl: Remove duplicate feature check from CMT test selftests/resctrl: Move _GNU_SOURCE define into Makefile selftests/resctrl: Refactor feature check to use resource and feature name selftests/resctrl: Fix feature checks selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests tools/testing/selftests/resctrl/Makefile | 2 +- tools/testing/selftests/resctrl/cat_test.c | 8 -- tools/testing/selftests/resctrl/cmt_test.c | 3 - tools/testing/selftests/resctrl/mba_test.c | 2 +- tools/testing/selftests/resctrl/mbm_test.c | 2 +- tools/testing/selftests/resctrl/resctrl.h | 7 +- .../testing/selftests/resctrl/resctrl_tests.c | 82 ++++++++++++------- tools/testing/selftests/resctrl/resctrl_val.c | 26 +++--- tools/testing/selftests/resctrl/resctrlfs.c | 69 ++++++---------- 9 files changed, 97 insertions(+), 104 deletions(-) -- 2.30.2

2 years

3
11
0 0

selftests: cgroup: test_core - Unable to handle kernel NULL pointer dereference at virtual address

by Naresh Kamboju

While running selftests: cgroup: test_kmem on FVP following kernel crash noticed on Linux next 6.6.0-rc4-next-20231006. Reported-by: Linux Kernel Functional Testing <lkft(a)linaro.org> Boot log: [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd0f0] [ 0.000000] Linux version 6.6.0-rc4-next-20231006 (tuxmake@tuxmake) (aarch64-linux-gnu-gcc (Debian 13.2.0-2) 13.2.0, GNU ld (GNU Binutils for Debian) 2.41) #1 SMP PREEMPT @1696592107 [ 0.000000] KASLR enabled [ 0.000000] Machine model: FVP Base RevC ... Running selftests: cgroup # selftests: cgroup: test_kmem # ok 1 test_kmem_basic # not ok 2 selftests: cgroup: test_kmem # TIMEOUT 45 seconds # timeout set to 45 # selftests: cgroup: test_core # ok 1 test_cgcore_internal_process_constraint # ok 2 test_cgcore_top_down_constraint_enable # ok 3 test_cgcore_top_down_constraint_disable # ok 4 test_cgcore_no_internal_process_constraint_on_threads # ok 5 test_cgcore_parent_becomes_threaded # ok 6 test_cgcore_invalid_domain # ok 7 test_cgcore_populated # ok 8 test_cgcore_proc_migration # ok 9 test_cgcore_thread_migration # ok 10 test_cgcore_destroy # ok 11 test_cgcore_lesser_euid_open # ok 12 test_cgcore_lesser_ns_open [ 400.108176] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 [ 400.108404] Mem abort info: [ 400.108523] ESR = 0x0000000096000004 [ 400.108656] EC = 0x25: DABT (current EL), IL = 32 bits [ 400.108810] SET = 0, FnV = 0 [ 400.108942] EA = 0, S1PTW = 0 [ 400.109074] FSC = 0x04: level 0 translation fault [ 400.109219] Data abort info: [ 400.109338] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 [ 400.109488] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 ok 3 selftests: cgroup: test_core [ 400.109644] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 400.109802] user pgtable: 4k pages, 48-bit VAs, pgdp=00000008898f3000 [ 400.109969] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ 400.110267] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP [ 400.110372] Modules linked in: pl111_drm drm_dma_helper arm_spe_pmu panel_simple crct10dif_ce drm_kms_helper fuse drm backlight dm_mod ip_tables x_tables [ 400.110872] CPU: 4 PID: 131 Comm: kworker/4:2 Not tainted 6.6.0-rc4-next-20231006 #1 [ 400.111010] Hardware name: FVP Base RevC (DT) [ 400.111093] Workqueue: cgroup_destroy css_free_rwork_fn [ 400.111238] pstate: 03402009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 400.111380] pc : percpu_ref_put_many.constprop.0+0xa0/0xf0 [ 400.111540] lr : percpu_ref_put_many.constprop.0+0x18/0xf0 [ 400.111700] sp : ffff800080713ca0 [ 400.111774] x29: ffff800080713ca0 x28: 0000000000000000 x27: 0000000000000000 [ 400.111970] x26: ffff00087f779d28 x25: ffff000800a3f700 x24: ffff0008003c2205 [ 400.112173] x23: 0000000000000036 x22: ffffd7c64df6a000 x21: ffffd7c64df6cb70 [ 400.112373] x20: ffff0008094d2000 x19: ffff000806dfa4c0 x18: ffff800083893c48 [ 400.112575] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001 [ 400.112765] x14: 0000000000000004 x13: ffffd7c64df87258 x12: 0000000000000000 [ 400.112964] x11: ffff000800402e60 x10: ffff000800402da0 x9 : ffffd7c64b786a90 [ 400.113166] x8 : ffff800080713b68 x7 : 0000000000000000 x6 : 0000000000000001 [ 400.113360] x5 : ffffd7c64df6a000 x4 : ffffd7c64df6a288 x3 : 0000000000000000 [ 400.113558] x2 : ffff0008044e0000 x1 : 0000000000000000 x0 : ffffffffffffffff [ 400.113756] Call trace: [ 400.113819] percpu_ref_put_many.constprop.0+0xa0/0xf0 [ 400.113980] __mem_cgroup_free+0x2c/0xe8 [ 400.114129] mem_cgroup_css_free+0x16c/0x1e8 [ 400.114281] css_free_rwork_fn+0x54/0x370 [ 400.114408] process_one_work+0x148/0x3b8 [ 400.114530] worker_thread+0x32c/0x450 [ 400.114650] kthread+0x104/0x118 [ 400.114797] ret_from_fork+0x10/0x20 [ 400.114954] Code: d65f03c0 f9400661 d503201f 92800000 (f8e00020) [ 400.115051] ---[ end trace 0000000000000000 ]--- Links: - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231006/te… - https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20231006/te… - https://tuxapi.tuxsuite.com/v1/groups/linaro/projects/lkft/tests/2WO7SlYRh8… - https://storage.tuxsuite.com/public/linaro/lkft/builds/2WO7RIllBsiwSAbiLChz… - https://storage.tuxsuite.com/public/linaro/lkft/builds/2WO7RIllBsiwSAbiLChz… https://storage.tuxsuite.com/public/linaro/lkft/builds/2WO7RIllBsiwSAbiLChz… -- Linaro LKFT https://lkft.linaro.org

2 years

2
3
0 0

[RFC PATCH v2 00/20] context_tracking,x86: Defer some IPIs until a user->kernel transition

by Valentin Schneider

Context ======= We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a pure-userspace application get regularly interrupted by IPIs sent from housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs leading to various on_each_cpu() calls, e.g.: 64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush) smp_call_function_many_cond+0x1 smp_call_function+0x39 on_each_cpu+0x2a flush_tlb_kernel_range+0x7b __purge_vmap_area_lazy+0x70 _vm_unmap_aliases.part.42+0xdf change_page_attr_set_clr+0x16a set_memory_ro+0x26 bpf_int_jit_compile+0x2f9 bpf_prog_select_runtime+0xc6 bpf_prepare_filter+0x523 sk_attach_filter+0x13 sock_setsockopt+0x92c __sys_setsockopt+0x16a __x64_sys_setsockopt+0x20 do_syscall_64+0x87 entry_SYSCALL_64_after_hwframe+0x65 The heart of this series is the thought that while we cannot remove NOHZ_FULL CPUs from the list of CPUs targeted by these IPIs, they may not have to execute the callbacks immediately. Anything that only affects kernelspace can wait until the next user->kernel transition, providing it can be executed "early enough" in the entry code. The original implementation is from Peter [1]. Nicolas then added kernel TLB invalidation deferral to that [2], and I picked it up from there. Deferral approach ================= Storing each and every callback, like a secondary call_single_queue turned out to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in userspace for as long as possible - no signal of any form would be sent when deferring an IPI. This means that any form of queuing for deferred callbacks would end up as a convoluted memory leak. Deferred IPIs must thus be coalesced, which this series achieves by assigning IPIs a "type" and having a mapping of IPI type to callback, leveraged upon kernel entry. What about IPIs whose callback take a parameter, you may ask? Peter suggested during OSPM23 [3] that since on_each_cpu() targets housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or housekeeping-CPU-local state to "reconstruct" the data that would have been sent via the IPI. This series does not affect any IPI callback that requires an argument, but the approach would remain the same (one coalescable callback executed on kernel entry). Kernel entry vs execution of the deferred operation =================================================== There is a non-zero length of code that is executed upon kernel entry before the deferred operation can be itself executed (i.e. before we start getting into context_tracking.c proper). This means one must take extra care to what can happen in the early entry code, and that <bad things> cannot happen. For instance, we really don't want to hit instructions that have been modified by a remote text_poke() while we're on our way to execute a deferred sync_core(). Patches ======= o Patches 1-9 have been submitted separately and are included for the sake of testing o Patches 10-14 focus on having objtool detect problematic static key usage in early entry o Patch 15 adds the infrastructure for IPI deferral. o Patches 16-17 add some RCU testing infrastructure o Patch 18 adds text_poke() IPI deferral. o Patches 19-20 add vunmap() flush_tlb_kernel_range() IPI deferral These ones I'm a lot less confident about, mostly due to lacking instrumentation/verification. The actual deferred callback is also incomplete as it's not properly noinstr: vmlinux.o: warning: objtool: __flush_tlb_all_noinstr+0x19: call to native_write_cr4() leaves .noinstr.text section and it doesn't support PARAVIRT - it's going to need a pv_ops.mmu entry, but I have *no idea* what a sane implementation would be for Xen so I haven't touched that yet. Patches are also available at: https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v2 Testing ======= Note: this is a different machine than used for v1, because that machine decided to act difficult. Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs. RHEL9 userspace. Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is: $ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \ -e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \ rteval --onlyload --loads-cpulist=$HK_CPUS \ --hackbench-runlowmem=True --duration=$DURATION This only records IPIs sent to isolated CPUs, so any event there is interference (with a bit of fuzz at the start/end of the workload when spawning the processes). All tests were done with a duration of 30 minutes. v6.5-rc1 (+ cpumask filtering patches): # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 338 callback=generic_smp_call_function_single_interrupt+0x0 # These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 9207 func=do_flush_tlb_all 1116 func=do_sync_core 62 func=do_kernel_range_flush 3 func=nohz_full_kick_func v6.5-rc1 + patches: # This is the actual IPI count $ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr 2 callback=generic_smp_call_function_single_interrupt+0x0 # These are the different CSD's that caused IPIs $ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr 2 func=nohz_full_kick_func The incriminating IPIs are all gone, but note that on the machine I used to test v1 there were still some do_flush_tlb_all() IPIs caused by pcpu_balance_workfn(), since only vmalloc is affected by the deferral mechanism. Acknowledgements ================ Special thanks to: o Clark Williams for listening to my ramblings about this and throwing ideas my way o Josh Poimboeuf for his guidance regarding objtool and hinting at the .data..ro_after_init section. Links ===== [1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/ [2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip [3]: https://youtu.be/0vjE6fjoVVE Revisions ========= RFCv1 -> RFCv2 ++++++++++++++ o Rebased onto v6.5-rc1 o Updated the trace filter patches (Steven) o Fixed __ro_after_init keys used in modules (Peter) o Dropped the extra context_tracking atomic, squashed the new bits in the existing .state field (Peter, Frederic) o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an rcutorture case for a low-size counter (Paul) The new TREE11 case with a 2-bit dynticks counter seems to pass when ran against this series. o Fixed flush_tlb_kernel_range_deferrable() definition Peter Zijlstra (1): jump_label,module: Don't alloc static_key_mod for __ro_after_init keys Valentin Schneider (19): tracing/filters: Dynamically allocate filter_pred.regex tracing/filters: Enable filtering a cpumask field by another cpumask tracing/filters: Enable filtering a scalar field by a cpumask tracing/filters: Enable filtering the CPU common field by a cpumask tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU tracing/filters: Further optimise scalar vs cpumask comparison tracing/filters: Document cpumask filtering objtool: Flesh out warning related to pv_ops[] calls objtool: Warn about non __ro_after_init static key usage in .noinstr context_tracking: Make context_tracking_key __ro_after_init x86/kvm: Make kvm_async_pf_enabled __ro_after_init context-tracking: Introduce work deferral infrastructure rcu: Make RCU dynticks counter size configurable rcutorture: Add a test config to torture test low RCU_DYNTICKS width context_tracking,x86: Defer kernel text patching IPIs context_tracking,x86: Add infrastructure to defer kernel TLBI x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs Documentation/trace/events.rst | 14 + arch/Kconfig | 9 + arch/x86/Kconfig | 1 + arch/x86/include/asm/context_tracking_work.h | 20 ++ arch/x86/include/asm/text-patching.h | 1 + arch/x86/include/asm/tlbflush.h | 2 + arch/x86/kernel/alternative.c | 24 +- arch/x86/kernel/kprobes/core.c | 4 +- arch/x86/kernel/kprobes/opt.c | 4 +- arch/x86/kernel/kvm.c | 2 +- arch/x86/kernel/module.c | 2 +- arch/x86/mm/tlb.c | 40 ++- include/asm-generic/sections.h | 5 + include/linux/context_tracking.h | 26 ++ include/linux/context_tracking_state.h | 65 +++- include/linux/context_tracking_work.h | 28 ++ include/linux/jump_label.h | 1 + include/linux/trace_events.h | 1 + init/main.c | 1 + kernel/context_tracking.c | 53 ++- kernel/jump_label.c | 49 +++ kernel/rcu/Kconfig | 33 ++ kernel/time/Kconfig | 5 + kernel/trace/trace_events_filter.c | 302 ++++++++++++++++-- mm/vmalloc.c | 19 +- tools/objtool/check.c | 22 +- tools/objtool/include/objtool/check.h | 1 + tools/objtool/include/objtool/special.h | 2 + tools/objtool/special.c | 3 + .../selftests/rcutorture/configs/rcu/TREE11 | 19 ++ .../rcutorture/configs/rcu/TREE11.boot | 1 + 31 files changed, 695 insertions(+), 64 deletions(-) create mode 100644 arch/x86/include/asm/context_tracking_work.h create mode 100644 include/linux/context_tracking_work.h create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot -- 2.31.1

2 years

12
75
0 0

[PATCH] Documentation: kselftests: Remove references to bpf tests

by Marcos Paulo de Souza

Currently the bpf selftests are skipped by default, so is someone would like to run the tests one would need to run: $ make TARGETS=bpf SKIP_TARGETS="" kselftest To overwrite the SKIP_TARGETS that defines bpf by default. Also, following the BPF instructions[1], to run the bpf selftests one would need to enter in the tools/testing/selftests/bpf/ directory, and then run make, which is not the standard way to run selftests per it's documentation. For the reasons above stop mentioning bpf in the kselftests as examples of how to run a test suite. [1]: Documentation/bpf/bpf_devel_QA.rst Signed-off-by: Marcos Paulo de Souza <mpdesouza(a)suse.com> --- Documentation/dev-tools/kselftest.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/dev-tools/kselftest.rst b/Documentation/dev-tools/kselftest.rst index deede972f254..ab376b316c36 100644 --- a/Documentation/dev-tools/kselftest.rst +++ b/Documentation/dev-tools/kselftest.rst @@ -112,7 +112,7 @@ You can specify multiple tests to skip:: You can also specify a restricted list of tests to run together with a dedicated skiplist:: - $ make TARGETS="bpf breakpoints size timers" SKIP_TARGETS=bpf kselftest + $ make TARGETS="breakpoints size timers" SKIP_TARGETS=size kselftest See the top-level tools/testing/selftests/Makefile for the list of all possible targets. @@ -165,7 +165,7 @@ To see the list of available tests, the `-l` option can be used:: The `-c` option can be used to run all the tests from a test collection, or the `-t` option for specific single tests. Either can be used multiple times:: - $ ./run_kselftest.sh -c bpf -c seccomp -t timers:posix_timers -t timer:nanosleep + $ ./run_kselftest.sh -c size -c seccomp -t timers:posix_timers -t timer:nanosleep For other features see the script usage output, seen with the `-h` option. @@ -210,7 +210,7 @@ option is supported, such as:: tests by using variables specified in `Running a subset of selftests`_ section:: - $ make -C tools/testing/selftests gen_tar TARGETS="bpf" FORMAT=.xz + $ make -C tools/testing/selftests gen_tar TARGETS="size" FORMAT=.xz .. _tar's auto-compress: https://www.gnu.org/software/tar/manual/html_node/gzip.html#auto_002dcompre… -- 2.42.0

2 years

1
0
0 0

[PATCH v6 00/38] arm64/gcs: Provide support for GCS in userspace

by Mark Brown

The arm64 Guarded Control Stack (GCS) feature provides support for hardware protected stacks of return addresses, intended to provide hardening against return oriented programming (ROP) attacks and to make it easier to gather call stacks for applications such as profiling. When GCS is active a secondary stack called the Guarded Control Stack is maintained, protected with a memory attribute which means that it can only be written with specific GCS operations. The current GCS pointer can not be directly written to by userspace. When a BL is executed the value stored in LR is also pushed onto the GCS, and when a RET is executed the top of the GCS is popped and compared to LR with a fault being raised if the values do not match. GCS operations may only be performed on GCS pages, a data abort is generated if they are not. The combination of hardware enforcement and lack of extra instructions in the function entry and exit paths should result in something which has less overhead and is more difficult to attack than a purely software implementation like clang's shadow stacks. This series implements support for use of GCS by userspace, along with support for use of GCS within KVM guests. It does not enable use of GCS by either EL1 or EL2, this will be implemented separately. Executables are started without GCS and must use a prctl() to enable it, it is expected that this will be done very early in application execution by the dynamic linker or other startup code. For dynamic linking this will be done by checking that everything in the executable is marked as GCS compatible. x86 has an equivalent feature called shadow stacks, this series depends on the x86 patches for generic memory management support for the new guarded/shadow stack page type and shares APIs as much as possible. As there has been extensive discussion with the wider community around the ABI for shadow stacks I have as far as practical kept implementation decisions close to those for x86, anticipating that review would lead to similar conclusions in the absence of strong reasoning for divergence. The main divergence I am concious of is that x86 allows shadow stack to be enabled and disabled repeatedly, freeing the shadow stack for the thread whenever disabled, while this implementation keeps the GCS allocated after disable but refuses to reenable it. This is to avoid races with things actively walking the GCS during a disable, we do anticipate that some systems will wish to disable GCS at runtime but are not aware of any demand for subsequently reenabling it. x86 uses an arch_prctl() to manage enable and disable, since only x86 and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a patch set for the equivalent RISC-V zisslpcfi feature which I initially adopted fairly directly but following review feedback has been revised quite a bit. There is an open issue with support for CRIU, on x86 this required the ability to set the GCS mode via ptrace. This series supports configuring mode bits other than enable/disable via ptrace but it needs to be confirmed if this is sufficient. There's a few bits where I'm not convinced with where I've placed things, in particular the GCS write operation is in the GCS header not in uaccess.h, I wasn't sure what was clearest there and am probably too close to the code to have a clear opinion. The reporting of GCS in /proc/PID/smaps is also a bit awkward. The series depends on the x86 shadow stack support: https://lore.kernel.org/lkml/20230227222957.24501-1-rick.p.edgecombe@intel.… I've rebased this onto v6.5-rc4 but not included it in the series in order to avoid confusion with Rick's work and cut down the size of the series, you can see the branch at: https://git.kernel.org/pub/scm/linux/kernel/git/broonie/misc.git arm64-gcs [1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ Pending feedback from Catalin: - Use clone3() paramaters to size/place the GCS. - Switch copy_to_user_gcs() to be put_user_gcs(). Signed-off-by: Mark Brown <broonie(a)kernel.org> --- Changes in v6: - Rebase onto v6.6-rc3. - Add some more gcsb_dsync() barriers following spec clarifications. - Due to ongoing discussion around clone()/clone3() I've not updated anything there, the behaviour is the same as on previous versions. - Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org Changes in v5: - Don't map any permissions for user GCSs, we always use EL0 accessors or use a separate mapping of the page. - Reduce the standard size of the GCS to RLIMIT_STACK/2. - Enforce a PAGE_SIZE alignment requirement on map_shadow_stack(). - Clarifications and fixes to documentation. - More tests. - Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org Changes in v4: - Implement flags for map_shadow_stack() allowing the cap and end of stack marker to be enabled independently or not at all. - Relax size and alignment requirements for map_shadow_stack(). - Add more blurb explaining the advantages of hardware enforcement. - Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org Changes in v3: - Rebase onto v6.5-rc4. - Add a GCS barrier on context switch. - Add a GCS stress test. - Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org Changes in v2: - Rebase onto v6.5-rc3. - Rework prctl() interface to allow each bit to be locked independently. - map_shadow_stack() now places the cap token based on the size requested by the caller not the actual space allocated. - Mode changes other than enable via ptrace are now supported. - Expand test coverage. - Various smaller fixes and adjustments. - Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org --- Mark Brown (38): arm64/mm: Restructure arch_validate_flags() for extensibility prctl: arch-agnostic prctl for shadow stack mman: Add map_shadow_stack() flags arm64: Document boot requirements for Guarded Control Stacks arm64/gcs: Document the ABI for Guarded Control Stacks arm64/sysreg: Add new system registers for GCS arm64/sysreg: Add definitions for architected GCS caps arm64/gcs: Add manual encodings of GCS instructions arm64/gcs: Provide copy_to_user_gcs() arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS) arm64/mm: Allocate PIE slots for EL0 guarded control stack mm: Define VM_SHADOW_STACK for arm64 when we support GCS arm64/mm: Map pages for guarded control stack KVM: arm64: Manage GCS registers for guests arm64/gcs: Allow GCS usage at EL0 and EL1 arm64/idreg: Add overrride for GCS arm64/hwcap: Add hwcap for GCS arm64/traps: Handle GCS exceptions arm64/mm: Handle GCS data aborts arm64/gcs: Context switch GCS state for EL0 arm64/gcs: Allocate a new GCS for threads with GCS enabled arm64/gcs: Implement shadow stack prctl() interface arm64/mm: Implement map_shadow_stack() arm64/signal: Set up and restore the GCS context for signal handlers arm64/signal: Expose GCS state in signal frames arm64/ptrace: Expose GCS via ptrace and core files arm64: Add Kconfig for Guarded Control Stack (GCS) kselftest/arm64: Verify the GCS hwcap kselftest/arm64: Add GCS as a detected feature in the signal tests kselftest/arm64: Add framework support for GCS to signal handling tests kselftest/arm64: Allow signals tests to specify an expected si_code kselftest/arm64: Always run signals tests with GCS enabled kselftest/arm64: Add very basic GCS test program kselftest/arm64: Add a GCS test program built with the system libc kselftest/arm64: Add test coverage for GCS mode locking selftests/arm64: Add GCS signal tests kselftest/arm64: Add a GCS stress test kselftest/arm64: Enable GCS for the FP stress tests Documentation/admin-guide/kernel-parameters.txt | 6 + Documentation/arch/arm64/booting.rst | 22 + Documentation/arch/arm64/elf_hwcaps.rst | 3 + Documentation/arch/arm64/gcs.rst | 233 +++++++ Documentation/arch/arm64/index.rst | 1 + Documentation/filesystems/proc.rst | 2 +- arch/arm64/Kconfig | 19 + arch/arm64/include/asm/cpufeature.h | 6 + arch/arm64/include/asm/el2_setup.h | 17 + arch/arm64/include/asm/esr.h | 28 +- arch/arm64/include/asm/exception.h | 2 + arch/arm64/include/asm/gcs.h | 106 +++ arch/arm64/include/asm/hwcap.h | 1 + arch/arm64/include/asm/kvm_arm.h | 4 +- arch/arm64/include/asm/kvm_host.h | 12 + arch/arm64/include/asm/mman.h | 23 +- arch/arm64/include/asm/pgtable-prot.h | 14 +- arch/arm64/include/asm/processor.h | 7 + arch/arm64/include/asm/sysreg.h | 20 + arch/arm64/include/asm/uaccess.h | 42 ++ arch/arm64/include/uapi/asm/hwcap.h | 1 + arch/arm64/include/uapi/asm/ptrace.h | 8 + arch/arm64/include/uapi/asm/sigcontext.h | 9 + arch/arm64/kernel/cpufeature.c | 19 + arch/arm64/kernel/cpuinfo.c | 1 + arch/arm64/kernel/entry-common.c | 23 + arch/arm64/kernel/idreg-override.c | 2 + arch/arm64/kernel/process.c | 92 +++ arch/arm64/kernel/ptrace.c | 59 ++ arch/arm64/kernel/signal.c | 237 ++++++- arch/arm64/kernel/traps.c | 11 + arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 17 + arch/arm64/kvm/sys_regs.c | 22 + arch/arm64/mm/Makefile | 1 + arch/arm64/mm/fault.c | 79 ++- arch/arm64/mm/gcs.c | 228 +++++++ arch/arm64/mm/mmap.c | 13 +- arch/arm64/tools/cpucaps | 1 + arch/arm64/tools/sysreg | 55 ++ arch/x86/include/uapi/asm/mman.h | 3 - fs/proc/task_mmu.c | 3 + include/linux/mm.h | 16 +- include/uapi/asm-generic/mman.h | 4 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/elf.h | 1 + include/uapi/linux/prctl.h | 22 + kernel/sys.c | 30 + tools/testing/selftests/arm64/Makefile | 2 +- tools/testing/selftests/arm64/abi/hwcap.c | 19 + tools/testing/selftests/arm64/fp/assembler.h | 15 + tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 + tools/testing/selftests/arm64/fp/sve-test.S | 2 + tools/testing/selftests/arm64/fp/za-test.S | 2 + tools/testing/selftests/arm64/fp/zt-test.S | 2 + tools/testing/selftests/arm64/gcs/.gitignore | 5 + tools/testing/selftests/arm64/gcs/Makefile | 24 + tools/testing/selftests/arm64/gcs/asm-offsets.h | 0 tools/testing/selftests/arm64/gcs/basic-gcs.c | 356 ++++++++++ tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++ .../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++ tools/testing/selftests/arm64/gcs/gcs-stress.c | 532 +++++++++++++++ tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++ tools/testing/selftests/arm64/gcs/libc-gcs.c | 742 +++++++++++++++++++++ tools/testing/selftests/arm64/signal/.gitignore | 1 + .../testing/selftests/arm64/signal/test_signals.c | 17 +- .../testing/selftests/arm64/signal/test_signals.h | 6 + .../selftests/arm64/signal/test_signals_utils.c | 32 +- .../selftests/arm64/signal/test_signals_utils.h | 39 ++ .../arm64/signal/testcases/gcs_exception_fault.c | 59 ++ .../selftests/arm64/signal/testcases/gcs_frame.c | 78 +++ .../arm64/signal/testcases/gcs_write_fault.c | 67 ++ .../selftests/arm64/signal/testcases/testcases.c | 7 + .../selftests/arm64/signal/testcases/testcases.h | 1 + 73 files changed, 4110 insertions(+), 41 deletions(-) --- base-commit: 6465e260f48790807eef06b583b38ca9789b6072 change-id: 20230303-arm64-gcs-e311ab0d8729 Best regards, -- Mark Brown <broonie(a)kernel.org>

2 years

1
38
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-kselftest-mirror October 2023