*Changes in v20*
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 560 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 54 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 54 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1464 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2329 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
Erdem Aktas wrote:
> On Mon, Jun 12, 2023 at 12:03 PM Dan Williams <dan.j.williams(a)intel.com>
> wrote:
>
> > [ add David, Brijesh, and Atish]
> >
> > Kuppuswamy Sathyanarayanan wrote:
> > > In TDX guest, the second stage of the attestation process is Quote
> > > generation. This process is required to convert the locally generated
> > > TDREPORT into a remotely verifiable Quote. It involves sending the
> > > TDREPORT data to a Quoting Enclave (QE) which will verify the
> > > integrity of the TDREPORT and sign it with an attestation key.
> > >
> > > Intel's TDX attestation driver exposes TDX_CMD_GET_QUOTE IOCTL to
> > > allow the user agent to get the TD Quote.
> > >
> > > Add a kernel selftest module to verify the Quote generation feature.
> > >
> > > TD Quote generation involves following steps:
> > >
> > > * Get the TDREPORT data using TDX_CMD_GET_REPORT IOCTL.
> > > * Embed the TDREPORT data in quote buffer and request for quote
> > > generation via TDX_CMD_GET_QUOTE IOCTL request.
> > > * Upon completion of the GetQuote request, check for non zero value
> > > in the status field of Quote header to make sure the generated
> > > quote is valid.
> >
> > What this cover letter does not say is that this is adding another
> > instance of the similar pattern as SNP_GET_REPORT.
> >
> > Linux is best served when multiple vendors trying to do similar
> > operations are brought together behind a common ABI. We see this in the
> > history of wrangling SCSI vendors behind common interfaces.
>
> Compared to the number of SCSI vendors, I think the number of CPU vendors
> for confidential computing seems manageable to me. Is this really a good
> comparison?
Fair enough, and prompted by this I talk a bit more about the
motiviations and benefits of a Keys abstraction for attestation here:
https://lore.kernel.org/all/64961c3baf8ce_142af829436@dwillia2-xfh.jf.intel…
> > Now multiple
> > confidential computing vendors trying to develop similar flows with
> > differentiated formats where that differentiation need not leak over the
> > ABI boundary.
> >
>
> <Just my personal opinion below>
> I agree with this statement in the high level but it is also somehow
> surprising for me after all the discussion happened around this topic.
> Honestly, I feel like there are multiple versions of "Intel" working in
> different directions.
This proposal was sent while firmly wearing my Linux community hat. I
agree, the timing here is unfortunate.
> If we want multiple vendors trying to do the similar things behind a common
> ABI, it should start with the spec. Since this comment is coming from
> Intel, I wonder if there is any plan to combine the GHCB and GHCI
> interfaces under common ABI in the future or why it did not even happen in
> the first place.
Per above comment about firmly wearing my Linux hat I am coming at this
purely from the perspective of what do we do now as a community that
continues to see these implementations proliferate and grow more
features. Common specs are great, but I agree with you, it is too late
for that, but I hope that as Linux asserts "this is what it should look
like" it starts to influence future IP innovation, and attestation
service providers, to acommodate the kernel's ABI momentum.
> What I see is that Intel has GETQUOTE TDVMCALL interface in its spec and
> again Intel does not really want to provide support for it in linux. It
> feels really frustrating.
I am aware of how frustrating late feedback can be. I am also encouraged
by some of the conversations and investigations that have already
happened around how Keys fits what these attestation solutions need.
> > My observation of SNP_GET_REPORT and TDX_CMD_GET_REPORT is that they are
> > both passing blobs across the user/kernel and platform/kernel boundary
> > for the purposes of unlocking other resources. To me that is a flow that
> > the Keys subsystem has infrastructure to handle. It has the concept of
> > upcalls and asynchronous population of blobs by handles and mechanisms
> > to protect and cache those communications. Linux / the Keys subsystem
> > could benefit from the enhancements it would need to cover these 2
> > cases. Specifically, the benefit that when ARM and RISC-V arrive with
> > similar communications with platform TSMs (Trusted Security Module) they
> > can build upon the same infrastructure.
> >
> > David, am I reaching with that association? My strawman mapping of
> > TDX_CMD_GET_QUOTE to request_key() is something like:
> >
> > request_key(coco_quote, "description", "<uuencoded tdreport>")
> >
> > Where this is a common key_type for all vendors, but the description and
> > arguments have room for vendor differentiation when doing the upcall to
> > the platform TSM, but userspace never needs to contend with the
> > different vendor formats, that is all handled internally to the kernel.
> >
> > I think the problem definition here is not accurate. With AMD SNP, guests
> need to do a hypercall to KVM and KVM needs to issue
> a SNP_GUEST_REQUEST(MSG_REPORT_REQ) to the SP firmware. In TDX, guests
> need to do a TDCALL to TDXMODULE to get the TDREPORT and then it needs to
> get that report delivered to the host userspace to get the TDQUOTE
> generated by the SGX quoting enclave. Also TDQUOTE is designed to work
> async while the SNP_GUEST_REQUESTS are blocking vmcalls.
>
> Those are completely different flows. Are you suggesting that intel should
> also come down to a single call to get the TDQUOTE like AMD SNP?
The Keys subsystem supports async instantiation of key material with
usermode upcalls if necessary. So I do not see a problem supporting
these flows behind a common key type.
> The TDCALL interface asking for the TDREPORT is already there. AMD does not
> need to ask the report and the quote separately.
>
> Here, the problem was that Intel (upstream) did not want to implement
> hypercall for TDQUOTE which would be handled by the user space VMM. The
> alternative implementation (using vsock) does not work for many use cases
> including ours. I do not see how your suggestion addresses the problem that
> this patch was trying to solve.
Perhaps the strawman mockup makes it more clear:
https://lore.kernel.org/all/64961c3baf8ce_142af829436@dwillia2-xfh.jf.intel…
> So while I like the suggested direction, I am not sure how much it is
> possible to come up with a common ABI even with just only for 2 vendors
> (AMD and Intel) without doing spec changes which is a multi year effort
> imho.
I agree, hardware spec changes are out of scope for this effort, but
Keys might require some additional flows to be built up in the kernel
that could be previously handled in userspace. I.e. the "bottom half"
that I reference in the mockup.
This is something we went through with using "encrypted-keys" for
nvdimm. Instead of an ioctl to inject a secret key over the user kernel
boundary a key server need to store a serialized version of the
encrypted key blob and pass that into the kernel.
The restoring of TPIDR2 signal context has been broken since it was
merged, fix this and add a test case covering it. This is a result of
TPIDR2 context management following a different flow to any of the other
state that we provide and the fact that we don't expose TPIDR (which
follows the same pattern) to signals.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Changes in v2:
- Added a feature check for SME to the new test.
- Link to v1: https://lore.kernel.org/r/20230621-arm64-fix-tpidr2-signal-restore-v1-0-b6d…
---
Mark Brown (2):
arm64/signal: Restore TPIDR2 register rather than memory state
kselftest/arm64: Add a test case for TPIDR2 restore
arch/arm64/kernel/signal.c | 2 +-
tools/testing/selftests/arm64/signal/.gitignore | 2 +-
.../arm64/signal/testcases/tpidr2_restore.c | 86 ++++++++++++++++++++++
3 files changed, 88 insertions(+), 2 deletions(-)
---
base-commit: 858fd168a95c5b9669aac8db6c14a9aeab446375
change-id: 20230621-arm64-fix-tpidr2-signal-restore-713d93798f99
Best regards,
--
Mark Brown <broonie(a)kernel.org>
TCP SYN/ACK packets of connections from processes/sockets outside a
cgroup on the same host are not received by the cgroup's installed
cgroup_skb filters.
There were two BPF cgroup_skb programs attached to a cgroup named
"my_cgroup".
SEC("cgroup_skb/ingress")
int ingress(struct __sk_buff *skb)
{
/* .... process skb ... */
return 1;
}
SEC("cgroup_skb/egress")
int egress(struct __sk_buff *skb)
{
/* .... process skb ... */
return 1;
}
We discovered that when running the command "nc -6 -l 8000" in
"my_group" and connecting to it from outside of "my_cgroup" with the
command "nc -6 localhost 8000", the egress filter did not detect the
SYN/ACK packet. However, we did observe the SYN/ACK packet at the
ingress when connecting from a socket in "my_cgroup" to a socket
outside of it.
We came across BPF_CGROUP_RUN_PROG_INET_EGRESS(). This macro is
responsible for calling BPF programs that are attached to the egress
hook of a cgroup and it skips programs if the sending socket is not the
owner of the skb. Specifically, in our situation, the SYN/ACK
skb is owned by a struct request_sock instance, but the sending
socket is the listener socket we use to receive incoming
connections. The request_sock is created to manage an incoming
connection.
It has been determined that checking the owner of a skb against
the sending socket is not required. Removing this check will allow the
filters to receive SYN/ACK packets.
To ensure that cgroup_skb filters can receive all signaling packets,
including SYN, SYN/ACK, ACK, FIN, and FIN/ACK. A new self-test has
been added as well.
Changes from v2:
- Remove redundant blank lines.
Changes from v1:
- Check the number of observed packets instead of just sleeping.
- Use ASSERT_XXX() instead of CHECK()/
[v1] https://lore.kernel.org/all/20230612191641.441774-1-kuifeng@meta.com/
[v2] https://lore.kernel.org/all/20230617052756.640916-2-kuifeng@meta.com/
Kui-Feng Lee (2):
net: bpf: Always call BPF cgroup filters for egress.
selftests/bpf: Verify that the cgroup_skb filters receive expected
packets.
include/linux/bpf-cgroup.h | 2 +-
tools/testing/selftests/bpf/cgroup_helpers.c | 12 +
tools/testing/selftests/bpf/cgroup_helpers.h | 1 +
tools/testing/selftests/bpf/cgroup_tcp_skb.h | 35 ++
.../selftests/bpf/prog_tests/cgroup_tcp_skb.c | 399 ++++++++++++++++++
.../selftests/bpf/progs/cgroup_tcp_skb.c | 382 +++++++++++++++++
6 files changed, 830 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/cgroup_tcp_skb.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_tcp_skb.c
create mode 100644 tools/testing/selftests/bpf/progs/cgroup_tcp_skb.c
--
2.34.1
Patch 1-3/9 track and expose some aggregated data counters at the MPTCP
level: the number of retransmissions and the bytes that have been
transferred. The first patch prepares the work by moving where snd_una
is updated for fallback sockets while the last patch adds some tests to
cover the new code.
Patch 4-6/9 introduce a new getsockopt for SOL_MPTCP: MPTCP_FULL_INFO.
This new socket option allows to combine info from MPTCP_INFO,
MPTCP_TCPINFO and MPTCP_SUBFLOW_ADDRS socket options into one. It can be
needed to have all info in one because the path-manager can close and
re-create subflows between getsockopt() and fooling the accounting. The
first patch introduces a unique subflow ID to easily detect when
subflows are being re-created with the same 5-tuple while the last patch
adds some tests to cover the new code.
Please note that patch 5/9 ("mptcp: introduce MPTCP_FULL_INFO getsockopt")
can reveal a bug that were there for a bit of time, see [1]. A fix has
recently been fixed to netdev for the -net tree: "mptcp: ensure listener
is unhashed before updating the sk status", see [2]. There is no
conflicts between the two patches but it might be better to apply this
series after the one for -net and after having merged "net" into
"net-next".
Patch 7/9 is similar to commit 47867f0a7e83 ("selftests: mptcp: join:
skip check if MIB counter not supported") recently applied in the -net
tree but here it adapts the new code that is only in net-next (and it
fixes a merge conflict resolution which didn't have any impact).
Patch 8 and 9/9 are two simple refactoring. One to consolidate the
transition to TCP_CLOSE in mptcp_do_fastclose() and avoid duplicated
code. The other one reduces the scope of an argument passed to
mptcp_pm_alloc_anno_list() function.
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/407 [1]
Link: https://lore.kernel.org/netdev/20230620-upstream-net-20230620-misc-fixes-fo… [2]
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Geliang Tang (1):
mptcp: pass addr to mptcp_pm_alloc_anno_list
Matthieu Baerts (1):
selftests: mptcp: join: skip check if MIB counter not supported (part 2)
Paolo Abeni (7):
mptcp: move snd_una update earlier for fallback socket
mptcp: track some aggregate data counters
selftests: mptcp: explicitly tests aggregate counters
mptcp: add subflow unique id
mptcp: introduce MPTCP_FULL_INFO getsockopt
selftests: mptcp: add MPTCP_FULL_INFO testcase
mptcp: consolidate transition to TCP_CLOSE in mptcp_do_fastclose()
include/uapi/linux/mptcp.h | 29 +++++
net/mptcp/options.c | 14 +-
net/mptcp/pm_netlink.c | 8 +-
net/mptcp/pm_userspace.c | 2 +-
net/mptcp/protocol.c | 31 +++--
net/mptcp/protocol.h | 11 +-
net/mptcp/sockopt.c | 152 +++++++++++++++++++++-
net/mptcp/subflow.c | 2 +
tools/testing/selftests/net/mptcp/mptcp_join.sh | 33 ++---
tools/testing/selftests/net/mptcp/mptcp_sockopt.c | 120 ++++++++++++++++-
10 files changed, 356 insertions(+), 46 deletions(-)
---
base-commit: 712557f210723101717570844c95ac0913af74d7
change-id: 20230620-upstream-net-next-20230620-mptcp-expose-more-info-and-misc-6b4a3a415ec5
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
*Changes in v19*
- Minor changes and interface updates
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 526 +++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 53 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 53 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1458 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2287 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
This patch introduces a specific test case for the EVIOCGLED ioctl.
The test covers the case where len > maxlen in the
EVIOCGLED(sizeof(all_leds)), all_leds) ioctl.
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
Changes in v2:
- Changed variable leds from an array to an int
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..378db2b4dd56 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,21 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgled_get_all_leds)
+{
+ struct selftest_uinput *uidev;
+ int leds = 0;
+ int rc;
+
+ rc = selftest_uinput_create_device(&uidev, -1);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to set the maxlen = 0 */
+ rc = ioctl(uidev->evdev_fd, EVIOCGLED(0), leds);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
This patch introduces a specific test case for the EVIOCGKEY ioctl.
The test covers the case where len > maxlen in the
EVIOCGKEY(sizeof(keystate)), keystate) ioctl.
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
Changes in v3:
- Edited commit's subject and description
- Renamed variable rep_values to keystate
- Added argument to selftest_uinput_create_device()
- Removed memset
Changes in v2:
- Added following note about the patch's dependency
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..e0f69459f504 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,21 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgkey_get_global_key_state)
+{
+ struct selftest_uinput *uidev;
+ int keystate = 0;
+ int rc;
+
+ rc = selftest_uinput_create_device(&uidev, -1);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to create the scenario where len > maxlen in bits_to_user() */
+ rc = ioctl(uidev->evdev_fd, EVIOCGKEY(0), keystate);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
This patch introduces a specific test case for the EVIOCGLED ioctl.
The test covers the case where len > maxlen in the
EVIOCGLED(sizeof(all_leds)), all_leds) ioctl.
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..2bf1b32ae01a 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,21 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgled_get_all_leds)
+{
+ struct selftest_uinput *uidev;
+ int leds[2];
+ int rc;
+
+ rc = selftest_uinput_create_device(&uidev, -1);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to set the maxlen = 0 */
+ rc = ioctl(uidev->evdev_fd, EVIOCGLED(0), leds);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
The restoring of TPIDR2 signal context has been broken since it was
merged, fix this and add a test case covering it. This is a result of
TPIDR2 context management following a different flow to any of the other
state that we provide and the fact that we don't expose TPIDR (which
follows the same pattern) to signals.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
Mark Brown (2):
arm64/signal: Restore TPIDR2 register rather than memory state
kselftest/arm64: Add a test case for TPIDR2 restore
arch/arm64/kernel/signal.c | 2 +-
tools/testing/selftests/arm64/signal/.gitignore | 2 +-
.../arm64/signal/testcases/tpidr2_restore.c | 85 ++++++++++++++++++++++
3 files changed, 87 insertions(+), 2 deletions(-)
---
base-commit: 858fd168a95c5b9669aac8db6c14a9aeab446375
change-id: 20230621-arm64-fix-tpidr2-signal-restore-713d93798f99
Best regards,
--
Mark Brown <broonie(a)kernel.org>
In order to cover this case, setting 'maxlen = 0', with the following
explanation:
EVIOCGKEY is executed from evdev_do_ioctl(), which is called from
evdev_ioctl_handler().
evdev_ioctl_handler() is called from 2 functions, where by code coverage,
only the first one is in use.
‘compat’ is given the value ‘0’ [1].
Thus, the condition [2] is always false.
This means ‘len’ always equals a positive number [3]
‘maxlen’ in evdev_handle_get_val [4] is defined locally in
evdev_do_ioctl() [5], and is sent in the variable 'size' [6]
[1] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1281
[2] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L705
[3] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L707
[4] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L886
[5] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1155
[6] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1141
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
Changes in v2:
- Added following note about the patch's dependency
This patch depends on '[v3] selftests/input: Introduce basic tests for evdev ioctls' [1] sent to the ML.
[1] https://patchwork.kernel.org/project/linux-input/patch/20230607153214.15933…
tools/testing/selftests/input/evioc-test.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..b94de2ee5596 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,23 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgkey_get_global_key_state)
+{
+ struct selftest_uinput *uidev;
+ int rep_values[2];
+ int rc;
+
+ memset(rep_values, 0, sizeof(rep_values));
+
+ rc = selftest_uinput_create_device(&uidev);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to create the scenario where len > maxlen in bits_to_user() */
+ rc = ioctl(uidev->evdev_fd, EVIOCGKEY(0), rep_values);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
From: Danielle Ratson <danieller(a)nvidia.com>
When mirroring to a gretap in hardware the device expects to be
programmed with the egress port and all the encapsulating headers. This
requires the driver to resolve the path the packet will take in the
software data path and program the device accordingly.
If the path cannot be resolved (in this case because of an unresolved
neighbor), then mirror installation fails until the path is resolved.
This results in a race that causes the test to sometimes fail.
Fix this by setting the neighbor's state to permanent in a couple of
tests, so that it is always valid.
Fixes: 35c31d5c323f ("selftests: forwarding: Test mirror-to-gretap w/ UL 802.1d")
Fixes: 239e754af854 ("selftests: forwarding: Test mirror-to-gretap w/ UL 802.1q")
Signed-off-by: Danielle Ratson <danieller(a)nvidia.com>
Reviewed-by: Petr Machata <petrm(a)nvidia.com>
Signed-off-by: Petr Machata <petrm(a)nvidia.com>
---
.../testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh | 4 ++++
.../testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh | 4 ++++
2 files changed, 8 insertions(+)
diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh
index c5095da7f6bf..aec752a22e9e 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1d.sh
@@ -93,12 +93,16 @@ cleanup()
test_gretap()
{
+ ip neigh replace 192.0.2.130 lladdr $(mac_get $h3) \
+ nud permanent dev br2
full_test_span_gre_dir gt4 ingress 8 0 "mirror to gretap"
full_test_span_gre_dir gt4 egress 0 8 "mirror to gretap"
}
test_ip6gretap()
{
+ ip neigh replace 2001:db8:2::2 lladdr $(mac_get $h3) \
+ nud permanent dev br2
full_test_span_gre_dir gt6 ingress 8 0 "mirror to ip6gretap"
full_test_span_gre_dir gt6 egress 0 8 "mirror to ip6gretap"
}
diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh
index 9ff22f28032d..0cf4c47a46f9 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_bridge_1q.sh
@@ -90,12 +90,16 @@ cleanup()
test_gretap()
{
+ ip neigh replace 192.0.2.130 lladdr $(mac_get $h3) \
+ nud permanent dev br1
full_test_span_gre_dir gt4 ingress 8 0 "mirror to gretap"
full_test_span_gre_dir gt4 egress 0 8 "mirror to gretap"
}
test_ip6gretap()
{
+ ip neigh replace 2001:db8:2::2 lladdr $(mac_get $h3) \
+ nud permanent dev br1
full_test_span_gre_dir gt6 ingress 8 0 "mirror to ip6gretap"
full_test_span_gre_dir gt6 egress 0 8 "mirror to ip6gretap"
}
--
2.40.1
When calling socket lookup from L2 (tc, xdp), VRF boundaries aren't
respected. This patchset fixes this by regarding the incoming device's
VRF attachment when performing the socket lookups from tc/xdp.
The first two patches are coding changes which factor out the tc helper's
logic which was shared with cg/sk_skb (which operate correctly).
This refactoring is needed in order to avoid affecting the cgroup/sk_skb
flows as there does not seem to be a strict criteria for discerning which
flow the helper is called from based on the net device or packet
information.
The third patch contains the actual bugfix.
The fourth patch adds bpf tests for these lookup functions.
---
v6: - Remove redundant IS_ENABLED as suggested by Daniel Borkmann
- Declare net_device variable and use it as suggested by Daniel Borkmann
v5: Use reverse xmas tree indentation
v4: - Move dev_sdif() to include/linux/netdevice.h as suggested by Stanislav Fomichev
- Remove SYS and SYS_NOFAIL duplicate definitions
v3: - Rename bpf_l2_sdif() to dev_sdif() as suggested by Stanislav Fomichev
- Added xdp tests as suggested by Daniel Borkmann
- Use start_server() to avoid duplicate code as suggested by Stanislav Fomichev
v2: Fixed uninitialized var in test patch (4).
Gilad Sever (4):
bpf: factor out socket lookup functions for the TC hookpoint.
bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC
hookpoint
bpf: fix bpf socket lookup from tc/xdp to respect socket VRF bindings
selftests/bpf: Add vrf_socket_lookup tests
include/linux/netdevice.h | 9 +
net/core/filter.c | 141 ++++++--
.../bpf/prog_tests/vrf_socket_lookup.c | 312 ++++++++++++++++++
.../selftests/bpf/progs/vrf_socket_lookup.c | 88 +++++
4 files changed, 526 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/vrf_socket_lookup.c
create mode 100644 tools/testing/selftests/bpf/progs/vrf_socket_lookup.c
--
2.34.1
The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.
Over the course of the following several patchsets, mlxsw code is going to
be adjusted to diminish the space of wrongly offloaded configurations.
Ideally the offload state will reflect the actual state, regardless of the
sequence of operation used to construct that state.
Several selftests build configurations that will not be offloadable in the
future on some systems. The reason is that what will get offloaded is the
actual configuration, not the configuration steps.
For example, when a port is added to a bridge that has an IP address, that
bridge will get a RIF, which it would not have with the current code. But
on Nvidia Spectrum-1 machines, MAC addresses of all RIFs need to have the
same prefix, which the bridge will violate. The RIF thus couldn't be
created, and the enslavement is therefore canceled, because it would lead
to an unoffloadable configuration. This breaks some selftests.
In this patchset, adjust selftests to avoid the configurations that mlxsw
would be incapable of offloading, while maintaining relevance with regards
to the feature that is being tested. There are generally two cases of
fixes:
- Disabling IPv6 autogen on bridges that do not participate in routing,
either because of the abovementioned requirement to keep the same MAC
prefix on all in-HW router interfaces, or, on 802.1ad bridges, because
in-HW router interfaces are not supported at all.
- Setting the bridge MAC address to what it will become after the first
member port is attached, so that the in-HW router interface is created
with a supported MAC address.
The patchset is then split thus:
- Patches #1-#7 adjust generic selftests
- Patches #8-#16 adjust mlxsw-specific selftests
Petr Machata (16):
selftests: forwarding: q_in_vni: Disable IPv6 autogen on bridges
selftests: forwarding: dual_vxlan_bridge: Disable IPv6 autogen on
bridges
selftests: forwarding: skbedit_priority: Disable IPv6 autogen on a
bridge
selftests: forwarding: pedit_dsfield: Disable IPv6 autogen on a bridge
selftests: forwarding: mirror_gre_*: Disable IPv6 autogen on bridges
selftests: forwarding: mirror_gre_*: Use port MAC for bridge address
selftests: forwarding: router_bridge: Use port MAC for bridge address
selftests: mlxsw: q_in_q_veto: Disable IPv6 autogen on bridges
selftests: mlxsw: extack: Disable IPv6 autogen on bridges
selftests: mlxsw: mirror_gre_scale: Disable IPv6 autogen on a bridge
selftests: mlxsw: qos_dscp_bridge: Disable IPv6 autogen on a bridge
selftests: mlxsw: qos_ets_strict: Disable IPv6 autogen on bridges
selftests: mlxsw: qos_mc_aware: Disable IPv6 autogen on bridges
selftests: mlxsw: spectrum: q_in_vni_veto: Disable IPv6 autogen on a
bridge
selftests: mlxsw: vxlan: Disable IPv6 autogen on bridges
selftests: mlxsw: one_armed_router: Use port MAC for bridge address
.../selftests/drivers/net/mlxsw/extack.sh | 24 ++++++++---
.../drivers/net/mlxsw/mirror_gre_scale.sh | 1 +
.../drivers/net/mlxsw/one_armed_router.sh | 3 +-
.../drivers/net/mlxsw/q_in_q_veto.sh | 8 ++++
.../drivers/net/mlxsw/qos_dscp_bridge.sh | 1 +
.../drivers/net/mlxsw/qos_ets_strict.sh | 8 +++-
.../drivers/net/mlxsw/qos_mc_aware.sh | 2 +
.../net/mlxsw/spectrum/q_in_vni_veto.sh | 1 +
.../selftests/drivers/net/mlxsw/vxlan.sh | 41 ++++++++++++++-----
.../net/forwarding/dual_vxlan_bridge.sh | 1 +
.../net/forwarding/mirror_gre_bound.sh | 1 +
.../net/forwarding/mirror_gre_bridge_1d.sh | 3 +-
.../forwarding/mirror_gre_bridge_1d_vlan.sh | 3 +-
.../forwarding/mirror_gre_bridge_1q_lag.sh | 3 +-
.../net/forwarding/mirror_topo_lib.sh | 1 +
.../selftests/net/forwarding/pedit_dsfield.sh | 4 +-
.../selftests/net/forwarding/q_in_vni.sh | 1 +
.../selftests/net/forwarding/router_bridge.sh | 3 +-
.../net/forwarding/skbedit_priority.sh | 4 +-
19 files changed, 88 insertions(+), 25 deletions(-)
--
2.40.1
If we get an unexpected signal during a signal test log a bit more data to
aid diagnostics.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/arm64/signal/test_signals_utils.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/arm64/signal/test_signals_utils.c b/tools/testing/selftests/arm64/signal/test_signals_utils.c
index 40be8443949d..0dc948db3a4a 100644
--- a/tools/testing/selftests/arm64/signal/test_signals_utils.c
+++ b/tools/testing/selftests/arm64/signal/test_signals_utils.c
@@ -249,7 +249,8 @@ static void default_handler(int signum, siginfo_t *si, void *uc)
fprintf(stderr, "-- Timeout !\n");
} else {
fprintf(stderr,
- "-- RX UNEXPECTED SIGNAL: %d\n", signum);
+ "-- RX UNEXPECTED SIGNAL: %d code %d address %p\n",
+ signum, si->si_code, si->si_addr);
}
default_result(current, 1);
}
---
base-commit: 44c026a73be8038f03dbdeef028b642880cf1511
change-id: 20230620-arm64-selftest-log-wrong-signal-cd8c34ae5e4f
Best regards,
--
Mark Brown <broonie(a)kernel.org>
This series adds 2 zswap related selftests that verify known and fixed
issues. A new dedicated test program (test_zswap) is proposed since
the test cases are specific to zswap and hosts specific helpers.
The first patch adds the (empty) test program, while the other 2 add an
actual test function each.
Domenico Cerasuolo (3):
selftests: cgroup: add test_zswap program
selftests: cgroup: add test_zswap with no kmem bypass test
selftests: cgroup: add zswap-memcg unwanted writeback test
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 2 +
tools/testing/selftests/cgroup/test_zswap.c | 286 ++++++++++++++++++++
3 files changed, 289 insertions(+)
create mode 100644 tools/testing/selftests/cgroup/test_zswap.c
--
2.34.1
We want to replace iptables TPROXY with a BPF program at TC ingress.
To make this work in all cases we need to assign a SO_REUSEPORT socket
to an skb, which is currently prohibited. This series adds support for
such sockets to bpf_sk_assing. See patch 5 for details.
I did some refactoring to cut down on the amount of duplicate code. The
key to this is to use INDIRECT_CALL in the reuseport helpers. To show
that this approach is not just beneficial to TC sk_assign I removed
duplicate code for bpf_sk_lookup as well.
Changes from v1:
- Correct commit abbrev length (Kuniyuki)
- Reduce duplication (Kuniyuki)
- Add checks on sk_state (Martin)
- Split exporting inet[6]_lookup_reuseport into separate patch (Eric)
Joint work with Daniel Borkmann.
Signed-off-by: Lorenz Bauer <lmb(a)isovalent.com>
---
Daniel Borkmann (1):
selftests/bpf: Test that SO_REUSEPORT can be used with sk_assign helper
Lorenz Bauer (5):
net: export inet_lookup_reuseport and inet6_lookup_reuseport
net: document inet[6]_lookup_reuseport sk_state requirements
net: remove duplicate reuseport_lookup functions
net: remove duplicate sk_lookup helpers
bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign
include/net/inet6_hashtables.h | 84 ++++++++-
include/net/inet_hashtables.h | 77 +++++++-
include/net/sock.h | 7 +-
include/uapi/linux/bpf.h | 3 -
net/core/filter.c | 2 -
net/ipv4/inet_hashtables.c | 69 +++++---
net/ipv4/udp.c | 73 +++-----
net/ipv6/inet6_hashtables.c | 71 +++++---
net/ipv6/udp.c | 85 +++------
tools/include/uapi/linux/bpf.h | 3 -
tools/testing/selftests/bpf/network_helpers.c | 3 +
.../selftests/bpf/prog_tests/assign_reuse.c | 197 +++++++++++++++++++++
.../selftests/bpf/progs/test_assign_reuse.c | 142 +++++++++++++++
13 files changed, 637 insertions(+), 179 deletions(-)
---
base-commit: 25085b4e9251c77758964a8e8651338972353642
change-id: 20230613-so-reuseport-e92c526173ee
Best regards,
--
Lorenz Bauer <lmb(a)isovalent.com>
*Changes in v18*
- Rebase on top of next-20230613
- Minor updates
*Changes in v17*
- Rebase on top of next-20230606
- Minor improvements in PAGEMAP_SCAN IOCTL patch
*Changes in v16*
- Fix a corner case
- Add exclusive PM_SCAN_OP_WP back
*Changes in v15*
- Build fix (Add missed build fix in RESEND)
*Changes in v14*
- Fix build error caused by #ifdef added at last minute in some configs
*Changes in v13*
- Rebase on top of next-20230414
- Give-up on using uffd_wp_range() and write new helpers, flush tlb only
once
*Changes in v12*
- Update and other memory types to UFFD_FEATURE_WP_ASYNC
- Rebaase on top of next-20230406
- Review updates
*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + !GET operation as it can be performed with
UFFDIO_WRITEPROTECT
*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation
*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code
*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation
*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
flags
*Motivation*
The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows
GetWriteWatch() syscall [1]. The GetWriteWatch{} retrieves the addresses of
the pages that are written to in a region of virtual memory.
This syscall is used in Windows applications and games etc. This syscall is
being emulated in pretty slow manner in userspace. Our purpose is to
enhance the kernel such that we translate it efficiently in a better way.
Currently some out of tree hack patches are being used to efficiently
emulate it in some kernels. We intend to replace those with these patches.
So the whole gaming on Linux can effectively get benefit from this. It
means there would be tons of users of this code.
CRIU use case [2] was mentioned by Andrei and Danylo:
> Use cases for migrating sparse VMAs are binaries sanitized with ASAN,
> MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of
> shadow memory [4]. Being able to migrate such binaries allows to highly
> reduce the amount of work needed to identify and fix post-migration
> crashes, which happen constantly.
Andrei's defines the following uses of this code:
* it is more granular and allows us to track changed pages more
effectively. The current interface can clear dirty bits for the entire
process only. In addition, reading info about pages is a separate
operation. It means we must freeze the process to read information
about all its pages, reset dirty bits, only then we can start dumping
pages. The information about pages becomes more and more outdated,
while we are processing pages. The new interface solves both these
downsides. First, it allows us to read pte bits and clear the
soft-dirty bit atomically. It means that CRIU will not need to freeze
processes to pre-dump their memory. Second, it clears soft-dirty bits
for a specified region of memory. It means CRIU will have actual info
about pages to the moment of dumping them.
* The new interface has to be much faster because basic page filtering
is happening in the kernel. With the old interface, we have to read
pagemap for each page.
*Implementation Evolution (Short Summary)*
From the definition of GetWriteWatch(), we feel like kernel's soft-dirty
feature can be used under the hood with some additions like:
* reset soft-dirty flag for only a specific region of memory instead of
clearing the flag for the entire process
* get and clear soft-dirty flag for a specific region atomically
So we decided to use ioctl on pagemap file to read or/and reset soft-dirty
flag. But using soft-dirty flag, sometimes we get extra pages which weren't
even written. They had become soft-dirty because of VMA merging and
VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were
able to by-pass this short coming by ignoring VM_SOFTDIRTY until David
reported that mprotect etc messes up the soft-dirty flag while ignoring
VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We
discussed if we can revert these patches. But we could not reach to any
conclusion. So at this point, I made couple of tries to solve this whole
VM_SOFTDIRTY issue by correcting the soft-dirty implementation:
* [7] Correct the bug fixed wrongly back in 2014. It had potential to cause
regression. We left it behind.
* [8] Keep a list of soft-dirty part of a VMA across splits and merges. I
got the reply don't increase the size of the VMA by 8 bytes.
At this point, we left soft-dirty considering it is too much delicate and
userfaultfd [9] seemed like the only way forward. From there onward, we
have been basing soft-dirty emulation on userfaultfd wp feature where
kernel resolves the faults itself when WP_ASYNC feature is used. It was
straight forward to add WP_ASYNC feature in userfautlfd. Now we get only
those pages dirty or written-to which are really written in reality. (PS
There is another WP_UNPOPULATED userfautfd feature is required which is
needed to avoid pre-faulting memory before write-protecting [9].)
All the different masks were added on the request of CRIU devs to create
interface more generic and better.
[1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-…
[2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com
[3] https://github.com/google/sanitizers
[4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit
[5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com
[6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/
[7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com
[10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com
* Original Cover letter from v8*
Hello,
Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
(PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
pages have been written-to.
- Find pages which have been written-to and write protect the pages
(atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping
Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
the kernel.
- The pages which have been written-to can not be found in accurate way.
(Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
pages than there actually are.)
Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.
*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.
So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.
The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].
The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.
The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.
[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora…
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.…
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.…
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/
Regards,
Muhammad Usama Anjum
Muhammad Usama Anjum (4):
fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
about PTEs
tools headers UAPI: Update linux/fs.h with the kernel sources
mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
selftests: mm: add pagemap ioctl tests
Peter Xu (1):
userfaultfd: UFFD_FEATURE_WP_ASYNC
Documentation/admin-guide/mm/pagemap.rst | 58 +
Documentation/admin-guide/mm/userfaultfd.rst | 35 +
fs/proc/task_mmu.c | 513 ++++++
fs/userfaultfd.c | 26 +-
include/linux/hugetlb.h | 1 +
include/linux/userfaultfd_k.h | 21 +-
include/uapi/linux/fs.h | 53 +
include/uapi/linux/userfaultfd.h | 9 +-
mm/hugetlb.c | 34 +-
mm/memory.c | 27 +-
tools/include/uapi/linux/fs.h | 53 +
tools/testing/selftests/mm/.gitignore | 2 +
tools/testing/selftests/mm/Makefile | 3 +-
tools/testing/selftests/mm/config | 1 +
tools/testing/selftests/mm/pagemap_ioctl.c | 1459 ++++++++++++++++++
tools/testing/selftests/mm/run_vmtests.sh | 4 +
16 files changed, 2275 insertions(+), 24 deletions(-)
create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh
--
2.39.2
This patchset is based on the next branch of shuah/linux-kselftest.git
Tiezhu Yang (2):
selftests/vDSO: Add support for LoongArch
selftests/vDSO: Get version and name for all archs
tools/testing/selftests/vDSO/vdso_config.h | 6 ++++-
tools/testing/selftests/vDSO/vdso_test_getcpu.c | 16 +++++--------
.../selftests/vDSO/vdso_test_gettimeofday.c | 26 ++++++----------------
3 files changed, 18 insertions(+), 30 deletions(-)
--
2.1.0
When execute the following command to test clone3 on LoongArch:
# cd tools/testing/selftests/clone3 && make && ./clone3
we can see the following error info:
# [5719] Trying clone3() with flags 0x80 (size 0)
# Invalid argument - Failed to create new process
# [5719] clone3() with flags says: -22 expected 0
not ok 18 [5719] Result (-22) is different than expected (0)
This is because if CONFIG_TIME_NS is not set, but the flag
CLONE_NEWTIME (0x80) is used to clone a time namespace, it
will return -EINVAL in copy_time_ns().
If kernel does not support CONFIG_TIME_NS, /proc/self/ns/time
will be not exist, and then we should skip clone3() test with
CLONE_NEWTIME.
With this patch under !CONFIG_TIME_NS:
# cd tools/testing/selftests/clone3 && make && ./clone3
...
# Time namespaces are not supported
ok 18 # SKIP Skipping clone3() with CLONE_NEWTIME
# Totals: pass:17 fail:0 xfail:0 xpass:0 skip:1 error:0
Fixes: 515bddf0ec41 ("selftests/clone3: test clone3 with CLONE_NEWTIME")
Suggested-by: Thomas Gleixner <tglx(a)linutronix.de>
Signed-off-by: Tiezhu Yang <yangtiezhu(a)loongson.cn>
---
v5:
-- Rebase on the next branch of shuah/linux-kselftest.git
to avoid potential merge conflicts due to changes in the link:
https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/c…
-- Update the commit message and send it as a single patch
Here is the v4 patch:
https://lore.kernel.org/loongarch/1685968410-5412-2-git-send-email-yangtiez…
tools/testing/selftests/clone3/clone3.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/clone3/clone3.c b/tools/testing/selftests/clone3/clone3.c
index e60cf4d..1c61e3c 100644
--- a/tools/testing/selftests/clone3/clone3.c
+++ b/tools/testing/selftests/clone3/clone3.c
@@ -196,7 +196,12 @@ int main(int argc, char *argv[])
CLONE3_ARGS_NO_TEST);
/* Do a clone3() in a new time namespace */
- test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST);
+ if (access("/proc/self/ns/time", F_OK) == 0) {
+ test_clone3(CLONE_NEWTIME, 0, 0, CLONE3_ARGS_NO_TEST);
+ } else {
+ ksft_print_msg("Time namespaces are not supported\n");
+ ksft_test_result_skip("Skipping clone3() with CLONE_NEWTIME\n");
+ }
/* Do a clone3() with exit signal (SIGCHLD) in flags */
test_clone3(SIGCHLD, 0, -EINVAL, CLONE3_ARGS_NO_TEST);
--
2.1.0
Hello,
This patchset builds upon a soon-to-be-published WIP patchset that Sean
published at https://github.com/sean-jc/linux/tree/x86/kvm_gmem_solo, mentioned
at [1].
The tree can be found at:
https://github.com/googleprodkernel/linux-cc/tree/gmem-hugetlb-rfc-v1
In this patchset, hugetlb support for KVM's guest_mem (aka gmem) is introduced,
allowing VM private memory (for confidential computing) to be backed by hugetlb
pages.
guest_mem provides userspace with a handle, with which userspace can allocate
and deallocate memory for confidential VMs without mapping the memory into
userspace.
Why use hugetlb instead of introducing a new allocator, like gmem does for 4K
and transparent hugepages?
+ hugetlb provides the following useful functionality, which would otherwise
have to be reimplemented:
+ Allocation of hugetlb pages at boot time, including
+ Parsing of kernel boot parameters to configure hugetlb
+ Tracking of usage in hstate
+ gmem will share the same system-wide pool of hugetlb pages, so users
don't have to have separate pools for hugetlb and gmem
+ Page accounting with subpools
+ hugetlb pages are tracked in subpools, which gmem uses to reserve
pages from the global hstate
+ Memory charging
+ hugetlb provides code that charges memory to cgroups
+ Reporting: hugetlb usage and availability are available at /proc/meminfo,
etc
The first 11 patches in this patchset is a series of refactoring to decouple
hugetlb and hugetlbfs.
The central thread binding the refactoring is that some functions (like
inode_resv_map(), inode_subpool(), inode_hstate(), etc) rely on a hugetlbfs
concept, that the resv_map, subpool, hstate, are in a specific field in a
hugetlb inode.
Refactoring to parametrize functions by hstate, subpool, resv_map will allow
hugetlb to be used by gmem and in other places where these data structures
aren't necessarily stored in the same positions in the inode.
The refactoring proposed here is just the minimum required to get a
proof-of-concept working with gmem. I would like to get opinions on this
approach before doing further refactoring. (See TODOs)
TODOs:
+ hugetlb/hugetlbfs refactoring
+ remove_inode_hugepages() no longer needs to be exposed, it is hugetlbfs
specific and used only in inode.c
+ remove_mapping_hugepages(), remove_inode_single_folio(),
hugetlb_unreserve_pages() shouldn't need to take inode as a parameter
+ Updating inode->i_blocks can be refactored to a separate function and
called from hugetlbfs and gmem
+ alloc_hugetlb_folio_from_subpool() shouldn't need to be parametrized by
vma
+ hugetlb_reserve_pages() should be refactored to be symmetric with
hugetlb_unreserve_pages()
+ It should be parametrized by resv_map
+ alloc_hugetlb_folio_from_subpool() could perhaps use
hugetlb_reserve_pages()?
+ gmem
+ Figure out if resv_map should be used by gmem at all
+ Probably needs more refactoring to decouple resv_map from hugetlb
functions
Questions for the community:
1. In this patchset, every gmem file backed with hugetlb is given a new
subpool. Is that desirable?
+ In hugetlbfs, a subpool always belongs to a mount, and hugetlbfs has one
mount per hugetlb size (2M, 1G, etc)
+ memfd_create(MFD_HUGETLB) effectively returns a full hugetlbfs file, so it
(rightfully) uses the hugetlbfs kernel mounts and their subpools
+ I gave each file a subpool mostly to speed up implementation and still be
able to reserve hugetlb pages from the global hstate based on the gmem
file size.
+ gmem, unlike hugetlbfs, isn't meant to be a full filesystem, so
+ Should there be multiple mounts, one for each hugetlb size?
+ Will the mounts be initialized on boot or on first gmem file creation?
+ Or is one subpool per gmem file fine?
2. Should resv_map be used for gmem at all, since gmem doesn't allow userspace
reservations?
[1] https://lore.kernel.org/lkml/ZEM5Zq8oo+xnApW9@google.com/
---
Ackerley Tng (19):
mm: hugetlb: Expose get_hstate_idx()
mm: hugetlb: Move and expose hugetlbfs_zero_partial_page
mm: hugetlb: Expose remove_inode_hugepages
mm: hugetlb: Decouple hstate, subpool from inode
mm: hugetlb: Allow alloc_hugetlb_folio() to be parametrized by subpool
and hstate
mm: hugetlb: Provide hugetlb_filemap_add_folio()
mm: hugetlb: Refactor vma_*_reservation functions
mm: hugetlb: Refactor restore_reserve_on_error
mm: hugetlb: Use restore_reserve_on_error directly in filesystems
mm: hugetlb: Parametrize alloc_hugetlb_folio_from_subpool() by
resv_map
mm: hugetlb: Parametrize hugetlb functions by resv_map
mm: truncate: Expose preparation steps for truncate_inode_pages_final
KVM: guest_mem: Refactor kvm_gmem fd creation to be in layers
KVM: guest_mem: Refactor cleanup to separate inode and file cleanup
KVM: guest_mem: hugetlb: initialization and cleanup
KVM: guest_mem: hugetlb: allocate and truncate from hugetlb
KVM: selftests: Add basic selftests for hugetlbfs-backed guest_mem
KVM: selftests: Support various types of backing sources for private
memory
KVM: selftests: Update test for various private memory backing source
types
fs/hugetlbfs/inode.c | 102 ++--
include/linux/hugetlb.h | 86 ++-
include/linux/mm.h | 1 +
include/uapi/linux/kvm.h | 25 +
mm/hugetlb.c | 324 +++++++-----
mm/truncate.c | 24 +-
.../testing/selftests/kvm/guest_memfd_test.c | 33 +-
.../testing/selftests/kvm/include/test_util.h | 14 +
tools/testing/selftests/kvm/lib/test_util.c | 74 +++
.../kvm/x86_64/private_mem_conversions_test.c | 38 +-
virt/kvm/guest_mem.c | 488 ++++++++++++++----
11 files changed, 882 insertions(+), 327 deletions(-)
--
2.41.0.rc0.172.g3f132b7071-goog
KVM_GET_REG_LIST will dump all register IDs that are available to
KVM_GET/SET_ONE_REG and It's very useful to identify some platform
regression issue during VM migration.
Patch 1-7 re-structured the get-reg-list test in aarch64 to make some
of the code as common test framework that can be shared by riscv.
Patch 8 enabled the KVM_GET_REG_LIST API in riscv and patch 9-10 added
the corresponding kselftest for checking possible register regressions.
The get-reg-list kvm selftest was ported from aarch64 and tested with
Linux 6.4-rc5 on a Qemu riscv64 virt machine.
---
Changed since v2:
* Rebase to Linux 6.4-rc5
* Filter out ZICBO* config and ISA_EXT registers report if the
extensions were not supported in host
* Enable AIA CSR test
* Move vCPU extension check_supported() to finalize_vcpu() per
Andrew's suggestion
* Switch to use KVM_REG_SIZE_ULONG for most registers' definition
---
Changed since v1:
* rebase to Andrew's changes
* fix coding style
Andrew Jones (7):
KVM: arm64: selftests: Replace str_with_index with strdup_printf
KVM: arm64: selftests: Drop SVE cap check in print_reg
KVM: arm64: selftests: Remove print_reg's dependency on vcpu_config
KVM: arm64: selftests: Rename vcpu_config and add to kvm_util.h
KVM: arm64: selftests: Delete core_reg_fixup
KVM: arm64: selftests: Split get-reg-list test code
KVM: arm64: selftests: Finish generalizing get-reg-list
Haibo Xu (3):
KVM: riscv: Add KVM_GET_REG_LIST API support
KVM: riscv: selftests: Skip some registers set operation
KVM: riscv: selftests: Add get-reg-list test
Documentation/virt/kvm/api.rst | 2 +-
arch/riscv/kvm/vcpu.c | 378 +++++++++++
tools/testing/selftests/kvm/Makefile | 11 +-
.../selftests/kvm/aarch64/get-reg-list.c | 540 ++--------------
tools/testing/selftests/kvm/get-reg-list.c | 421 ++++++++++++
.../selftests/kvm/include/kvm_util_base.h | 16 +
.../selftests/kvm/include/riscv/processor.h | 3 +
.../testing/selftests/kvm/include/test_util.h | 2 +
tools/testing/selftests/kvm/lib/test_util.c | 15 +
.../selftests/kvm/riscv/get-reg-list.c | 611 ++++++++++++++++++
10 files changed, 1499 insertions(+), 500 deletions(-)
create mode 100644 tools/testing/selftests/kvm/get-reg-list.c
create mode 100644 tools/testing/selftests/kvm/riscv/get-reg-list.c
--
2.34.1
When calling socket lookup from L2 (tc, xdp), VRF boundaries aren't
respected. This patchset fixes this by regarding the incoming device's
VRF attachment when performing the socket lookups from tc/xdp.
The first two patches are coding changes which factor out the tc helper's
logic which was shared with cg/sk_skb (which operate correctly).
This refactoring is needed in order to avoid affecting the cgroup/sk_skb
flows as there does not seem to be a strict criteria for discerning which
flow the helper is called from based on the net device or packet
information.
The third patch contains the actual bugfix.
The fourth patch adds bpf tests for these lookup functions.
---
v5: Use reverse xmas tree indentation
v4: - Move dev_sdif() to include/linux/netdevice.h as suggested by Stanislav Fomichev
- Remove SYS and SYS_NOFAIL duplicate definitions
v3: - Rename bpf_l2_sdif() to dev_sdif() as suggested by Stanislav Fomichev
- Added xdp tests as suggested by Daniel Borkmann
- Use start_server() to avoid duplicate code as suggested by Stanislav Fomichev
v2: Fixed uninitialized var in test patch (4).
Gilad Sever (4):
bpf: factor out socket lookup functions for the TC hookpoint.
bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC
hookpoint
bpf: fix bpf socket lookup from tc/xdp to respect socket VRF bindings
selftests/bpf: Add vrf_socket_lookup tests
include/linux/netdevice.h | 9 +
net/core/filter.c | 123 +++++--
.../bpf/prog_tests/vrf_socket_lookup.c | 312 ++++++++++++++++++
.../selftests/bpf/progs/vrf_socket_lookup.c | 88 +++++
4 files changed, 511 insertions(+), 21 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/vrf_socket_lookup.c
create mode 100644 tools/testing/selftests/bpf/progs/vrf_socket_lookup.c
--
2.34.1
PTP_SYS_OFFSET_EXTENDED was added in November 2018 in
361800876f80 (" ptp: add PTP_SYS_OFFSET_EXTENDED ioctl")
and PTP_SYS_OFFSET_PRECISE was added in February 2016 in
719f1aa4a671 ("ptp: Add PTP_SYS_OFFSET_PRECISE for driver crosstimestamping")
The PTP selftest code is lacking support for these two IOCTLS.
This short series of patches adds support for them.
Alex Maftei (2):
selftests/ptp: Add -x option for testing PTP_SYS_OFFSET_EXTENDED
selftests/ptp: Add -X option for testing PTP_SYS_OFFSET_PRECISE
tools/testing/selftests/ptp/testptp.c | 71 ++++++++++++++++++++++++++-
1 file changed, 69 insertions(+), 2 deletions(-)
--
2.28.0
Now the writing operation return the count of writes whether events are
enabled or disabled. Fix this by just return -ENOENT when events are disabled.
v1 -> v2:
- Change the returh vale from -EFAULT to -ENOENT
sunliming (3):
tracing/user_events: Fix incorrect return value for writing operation
when events are disabled
selftests/user_events: Enable the event before write_fault test in
ftrace self-test
selftests/user_events: Add test cases when event is disabled
kernel/trace/trace_events_user.c | 3 ++-
tools/testing/selftests/user_events/ftrace_test.c | 8 ++++++++
2 files changed, 10 insertions(+), 1 deletion(-)
--
2.25.1
This patch-set implements 2 small extensions to the current F_OFD_GETLK,
allowing it to gather more information than it currently returns.
First extension allows to use F_UNLCK on query, which currently returns
EINVAL. Instead it can be used to query the locks on a particular fd -
something that is not currently possible. The basic idea is that on
F_OFD_GETLK, F_UNLCK would "conflict" with (or query) any types of the
lock on the same fd, and ignore any locks on other fds.
Use-cases:
1. CRIU-alike scenario when you want to read the locking info from an
fd for the later reconstruction. This can now be done by setting
l_start and l_len to 0 to cover entire file range, and do F_OFD_GETLK.
In the loop you need to advance l_start past the returned lock ranges,
to eventually collect all locked ranges.
2. Implementing the lock checking/enforcing policy.
Say you want to implement an "auditor" module in your program,
that checks that the I/O is done only after the proper locking is
applied on a file region. In this case you need to know if the
particular region is locked on that fd, and if so - with what type
of the lock. If you would do that currently (without this extension)
then you can only check for the write locks, and for that you need to
probe the lock on your fd and then open the same file via nother fd and
probe there. That way you can identify the write lock on a particular
fd, but such trick is non-atomic and complex. As for finding out the
read lock on a particular fd - impossible.
This extension allows to do such queries without any extra efforts.
3. Implementing the mandatory locking policy.
Suppose you want to make a policy where the write lock inhibits any
unlocked readers and writers. Currently you need to check if the
write lock is present on some other fd, and if it is not there - allow
the I/O operation. But because the write lock can appear at any moment,
you need to do that under some global lock, which can be released only
when the I/O operation is finished.
With the proposed extension you can instead just check the write lock
on your own fd first, and if it is there - allow the I/O operation on
that fd without using any global lock. Only if there is no write lock
on this fd, then you need to take global lock and check for a write
lock on other fds.
The second patch implements another extension.
Currently F_OFD_GETLK returns -1 in the l_pid member.
This patch removes the code that writes -1 there, so that the proper
pid is returned. I am not sure why it was decided to deliberately hide
the owner's pid. It may be needed in case you want to send some
message to the offending locker, like eg SIGKILL.
The third patch adds a test-case for OFD locks.
It tests both the generic things and the proposed extensions.
Stas Sergeev (3):
fs/locks: F_UNLCK extension for F_OFD_GETLK
fd/locks: allow get the lock owner by F_OFD_GETLK
selftests: add OFD lock tests
fs/locks.c | 25 +++-
tools/testing/selftests/locking/Makefile | 2 +
tools/testing/selftests/locking/ofdlocks.c | 135 +++++++++++++++++++++
3 files changed, 157 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/locking/ofdlocks.c
CC: Jeff Layton <jlayton(a)kernel.org>
CC: Chuck Lever <chuck.lever(a)oracle.com>
CC: Alexander Viro <viro(a)zeniv.linux.org.uk>
CC: Christian Brauner <brauner(a)kernel.org>
CC: linux-fsdevel(a)vger.kernel.org
CC: linux-kernel(a)vger.kernel.org
CC: Shuah Khan <shuah(a)kernel.org>
CC: linux-kselftest(a)vger.kernel.org
--
2.39.2
This is to add Intel VT-d nested translation based on IOMMUFD nesting
infrastructure. As the iommufd nesting infrastructure series[1], iommu
core supports new ops to report iommu hardware information, allocate
domains with user data and sync stage-1 IOTLB. The data required in
the three paths are vendor-specific, so
1) IOMMU_HW_INFO_TYPE_INTEL_VTD and struct iommu_device_info_vtd are
defined to report iommu hardware information for Intel VT-d .
2) IOMMU_HWPT_DATA_VTD_S1 is defined for the Intel VT-d stage-1 page
table, it will be used in the stage-1 domain allocation and IOTLB
syncing path. struct iommu_hwpt_intel_vtd is defined to pass user_data
for the Intel VT-d stage-1 domain allocation.
struct iommu_hwpt_invalidate_intel_vtd is defined to pass the data for
the Intel VT-d stage-1 IOTLB invalidation.
With above IOMMUFD extensions, the intel iommu driver implements the three
paths to support nested translation.
The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.
A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.
Intel-iommu driver is armed with necessary checks to prevent such mix
in patch10 of this series.
Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.
In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to allow user opt-out RO mappings in
Qemu might be an acceptable tradeoff. But how to achieve it cleanly
needs more discussion in Qemu community. For now we just hacked Qemu
to test.
Complete code can be found in [3], QEMU could can be found in [4].
base-commit: ce9b593b1f74ccd090edc5d2ad397da84baa9946
[1] https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.c…
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-deta…
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/wip/iommufd_rfcv4.mig.reset.v4_var3%…
Change log:
v3:
- Further split the patches into an order of adding helpers for nested
domain, iotlb flush, nested domain attachment and nested domain allocation
callback, then report the hw_info to userspace.
- Add batch support in cache invalidation from userspace
- Disallow nested translation usage if RO mappings exists in stage-2 domain
due to errata on readonly mappings on Sapphire Rapids platform.
v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.…
- The iommufd infrastructure is split to be separate series.
v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.c…
Regards,
Yi Liu
Lu Baolu (5):
iommu/vt-d: Extend dmar_domain to support nested domain
iommu/vt-d: Add helper for nested domain allocation
iommu/vt-d: Add helper to setup pasid nested translation
iommu/vt-d: Add nested domain allocation
iommu/vt-d: Disallow nesting on domains with read-only mappings
Yi Liu (5):
iommufd: Add data structure for Intel VT-d stage-1 domain allocation
iommu/vt-d: Make domain attach helpers to be extern
iommu/vt-d: Set the nested domain to a device
iommu/vt-d: Add iotlb flush for nested domain
iommu/vt-d: Implement hw_info for iommu capability query
drivers/iommu/intel/Makefile | 2 +-
drivers/iommu/intel/iommu.c | 78 ++++++++++++---
drivers/iommu/intel/iommu.h | 55 +++++++++--
drivers/iommu/intel/nested.c | 181 +++++++++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.c | 151 +++++++++++++++++++++++++++++
drivers/iommu/intel/pasid.h | 2 +
drivers/iommu/iommufd/main.c | 6 ++
include/linux/iommu.h | 1 +
include/uapi/linux/iommufd.h | 149 ++++++++++++++++++++++++++++
9 files changed, 603 insertions(+), 22 deletions(-)
create mode 100644 drivers/iommu/intel/nested.c
--
2.34.1
In order to cover this case, setting 'maxlen = 0', with the following
explanation:
EVIOCGKEY is executed from evdev_do_ioctl(), which is called from
evdev_ioctl_handler().
evdev_ioctl_handler() is called from 2 functions, where by code coverage,
only the first one is in use.
‘compat’ is given the value ‘0’ [1].
Thus, the condition [2] is always false.
This means ‘len’ always equals a positive number [3]
‘maxlen’ in evdev_handle_get_val [4] is defined locally in
evdev_do_ioctl() [5], and is sent in the variable 'size' [6]
[1] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1281
[2] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L705
[3] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L707
[4] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L886
[5] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1155
[6] https://elixir.bootlin.com/linux/v6.2/source/drivers/input/evdev.c#L1141
Signed-off-by: Dana Elfassy <dangel101(a)gmail.com>
---
tools/testing/selftests/input/evioc-test.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/tools/testing/selftests/input/evioc-test.c b/tools/testing/selftests/input/evioc-test.c
index ad7b93fe39cf..b94de2ee5596 100644
--- a/tools/testing/selftests/input/evioc-test.c
+++ b/tools/testing/selftests/input/evioc-test.c
@@ -234,4 +234,23 @@ TEST(eviocsrep_set_repeat_settings)
selftest_uinput_destroy(uidev);
}
+TEST(eviocgkey_get_global_key_state)
+{
+ struct selftest_uinput *uidev;
+ int rep_values[2];
+ int rc;
+
+ memset(rep_values, 0, sizeof(rep_values));
+
+ rc = selftest_uinput_create_device(&uidev);
+ ASSERT_EQ(0, rc);
+ ASSERT_NE(NULL, uidev);
+
+ /* ioctl to create the scenario where len > maxlen in bits_to_user() */
+ rc = ioctl(uidev->evdev_fd, EVIOCGKEY(0), rep_values);
+ ASSERT_EQ(0, rc);
+
+ selftest_uinput_destroy(uidev);
+}
+
TEST_HARNESS_MAIN
--
2.41.0
From: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 4acfe3dfde685a5a9eaec5555351918e2d7266a1 ]
Dan Carpenter spotted a race condition in a couple of situations like
these in the test_firmware driver:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
ret = kstrtou8(buf, 10, &val);
if (ret)
return ret;
mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
static ssize_t config_num_requests_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int rc;
mutex_lock(&test_fw_mutex);
if (test_fw_config->reqs) {
pr_err("Must call release_all_firmware prior to changing config\n");
rc = -EINVAL;
mutex_unlock(&test_fw_mutex);
goto out;
}
mutex_unlock(&test_fw_mutex);
rc = test_dev_config_update_u8(buf, count,
&test_fw_config->num_requests);
out:
return rc;
}
static ssize_t config_read_fw_idx_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
return test_dev_config_update_u8(buf, count,
&test_fw_config->read_fw_idx);
}
The function test_dev_config_update_u8() is called from both the locked
and the unlocked context, function config_num_requests_store() and
config_read_fw_idx_store() which can both be called asynchronously as
they are driver's methods, while test_dev_config_update_u8() and siblings
change their argument pointed to by u8 *cfg or similar pointer.
To avoid deadlock on test_fw_mutex, the lock is dropped before calling
test_dev_config_update_u8() and re-acquired within test_dev_config_update_u8()
itself, but alas this creates a race condition.
Having two locks wouldn't assure a race-proof mutual exclusion.
This situation is best avoided by the introduction of a new, unlocked
function __test_dev_config_update_u8() which can be called from the locked
context and reducing test_dev_config_update_u8() to:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
int ret;
mutex_lock(&test_fw_mutex);
ret = __test_dev_config_update_u8(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
}
doing the locking and calling the unlocked primitive, which enables both
locked and unlocked versions without duplication of code.
The similar approach was applied to all functions called from the locked
and the unlocked context, which safely mitigates both deadlocks and race
conditions in the driver.
__test_dev_config_update_bool(), __test_dev_config_update_u8() and
__test_dev_config_update_size_t() unlocked versions of the functions
were introduced to be called from the locked contexts as a workaround
without releasing the main driver's lock and thereof causing a race
condition.
The test_dev_config_update_bool(), test_dev_config_update_u8() and
test_dev_config_update_size_t() locked versions of the functions
are being called from driver methods without the unnecessary multiplying
of the locking and unlocking code for each method, and complicating
the code with saving of the return value across lock.
Fixes: 7feebfa487b92 ("test_firmware: add support for request_firmware_into_buf")
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Russ Weight <russell.h.weight(a)intel.com>
Cc: Takashi Iwai <tiwai(a)suse.de>
Cc: Tianfei Zhang <tianfei.zhang(a)intel.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Colin Ian King <colin.i.king(a)gmail.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: linux-kselftest(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # v5.4
Suggested-by: Dan Carpenter <error27(a)gmail.com>
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Link: https://lore.kernel.org/r/20230509084746.48259-1-mirsad.todorovac@alu.unizg…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
lib/test_firmware.c | 52 ++++++++++++++++++++++++++++++---------------
1 file changed, 35 insertions(+), 17 deletions(-)
diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index b99cf0a50a698..4884057eb53f0 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -321,16 +321,26 @@ static ssize_t config_test_show_str(char *dst,
return len;
}
-static int test_dev_config_update_bool(const char *buf, size_t size,
+static inline int __test_dev_config_update_bool(const char *buf, size_t size,
bool *cfg)
{
int ret;
- mutex_lock(&test_fw_mutex);
if (kstrtobool(buf, cfg) < 0)
ret = -EINVAL;
else
ret = size;
+
+ return ret;
+}
+
+static int test_dev_config_update_bool(const char *buf, size_t size,
+ bool *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_bool(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
@@ -341,7 +351,8 @@ static ssize_t test_dev_config_show_bool(char *buf, bool val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_size_t(const char *buf,
+static int __test_dev_config_update_size_t(
+ const char *buf,
size_t size,
size_t *cfg)
{
@@ -352,9 +363,7 @@ static int test_dev_config_update_size_t(const char *buf,
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(size_t *)cfg = new;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
@@ -370,7 +379,7 @@ static ssize_t test_dev_config_show_int(char *buf, int val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+static int __test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
@@ -379,14 +388,23 @@ static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
+static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_u8(buf, size, cfg);
+ mutex_unlock(&test_fw_mutex);
+
+ return ret;
+}
+
static ssize_t test_dev_config_show_u8(char *buf, u8 val)
{
return snprintf(buf, PAGE_SIZE, "%u\n", val);
@@ -413,10 +431,10 @@ static ssize_t config_num_requests_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_u8(buf, count,
- &test_fw_config->num_requests);
+ rc = __test_dev_config_update_u8(buf, count,
+ &test_fw_config->num_requests);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -460,10 +478,10 @@ static ssize_t config_buf_size_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->buf_size);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->buf_size);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -490,10 +508,10 @@ static ssize_t config_file_offset_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->file_offset);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->file_offset);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
--
2.39.2
From: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 4acfe3dfde685a5a9eaec5555351918e2d7266a1 ]
Dan Carpenter spotted a race condition in a couple of situations like
these in the test_firmware driver:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
ret = kstrtou8(buf, 10, &val);
if (ret)
return ret;
mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
static ssize_t config_num_requests_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int rc;
mutex_lock(&test_fw_mutex);
if (test_fw_config->reqs) {
pr_err("Must call release_all_firmware prior to changing config\n");
rc = -EINVAL;
mutex_unlock(&test_fw_mutex);
goto out;
}
mutex_unlock(&test_fw_mutex);
rc = test_dev_config_update_u8(buf, count,
&test_fw_config->num_requests);
out:
return rc;
}
static ssize_t config_read_fw_idx_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
return test_dev_config_update_u8(buf, count,
&test_fw_config->read_fw_idx);
}
The function test_dev_config_update_u8() is called from both the locked
and the unlocked context, function config_num_requests_store() and
config_read_fw_idx_store() which can both be called asynchronously as
they are driver's methods, while test_dev_config_update_u8() and siblings
change their argument pointed to by u8 *cfg or similar pointer.
To avoid deadlock on test_fw_mutex, the lock is dropped before calling
test_dev_config_update_u8() and re-acquired within test_dev_config_update_u8()
itself, but alas this creates a race condition.
Having two locks wouldn't assure a race-proof mutual exclusion.
This situation is best avoided by the introduction of a new, unlocked
function __test_dev_config_update_u8() which can be called from the locked
context and reducing test_dev_config_update_u8() to:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
int ret;
mutex_lock(&test_fw_mutex);
ret = __test_dev_config_update_u8(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
}
doing the locking and calling the unlocked primitive, which enables both
locked and unlocked versions without duplication of code.
The similar approach was applied to all functions called from the locked
and the unlocked context, which safely mitigates both deadlocks and race
conditions in the driver.
__test_dev_config_update_bool(), __test_dev_config_update_u8() and
__test_dev_config_update_size_t() unlocked versions of the functions
were introduced to be called from the locked contexts as a workaround
without releasing the main driver's lock and thereof causing a race
condition.
The test_dev_config_update_bool(), test_dev_config_update_u8() and
test_dev_config_update_size_t() locked versions of the functions
are being called from driver methods without the unnecessary multiplying
of the locking and unlocking code for each method, and complicating
the code with saving of the return value across lock.
Fixes: 7feebfa487b92 ("test_firmware: add support for request_firmware_into_buf")
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Russ Weight <russell.h.weight(a)intel.com>
Cc: Takashi Iwai <tiwai(a)suse.de>
Cc: Tianfei Zhang <tianfei.zhang(a)intel.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Colin Ian King <colin.i.king(a)gmail.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: linux-kselftest(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # v5.4
Suggested-by: Dan Carpenter <error27(a)gmail.com>
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Link: https://lore.kernel.org/r/20230509084746.48259-1-mirsad.todorovac@alu.unizg…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
lib/test_firmware.c | 52 ++++++++++++++++++++++++++++++---------------
1 file changed, 35 insertions(+), 17 deletions(-)
diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index 0b4e3de3f1748..4ad01dbe7e729 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -321,16 +321,26 @@ static ssize_t config_test_show_str(char *dst,
return len;
}
-static int test_dev_config_update_bool(const char *buf, size_t size,
+static inline int __test_dev_config_update_bool(const char *buf, size_t size,
bool *cfg)
{
int ret;
- mutex_lock(&test_fw_mutex);
if (kstrtobool(buf, cfg) < 0)
ret = -EINVAL;
else
ret = size;
+
+ return ret;
+}
+
+static int test_dev_config_update_bool(const char *buf, size_t size,
+ bool *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_bool(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
@@ -341,7 +351,8 @@ static ssize_t test_dev_config_show_bool(char *buf, bool val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_size_t(const char *buf,
+static int __test_dev_config_update_size_t(
+ const char *buf,
size_t size,
size_t *cfg)
{
@@ -352,9 +363,7 @@ static int test_dev_config_update_size_t(const char *buf,
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(size_t *)cfg = new;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
@@ -370,7 +379,7 @@ static ssize_t test_dev_config_show_int(char *buf, int val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+static int __test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
@@ -379,14 +388,23 @@ static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
+static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_u8(buf, size, cfg);
+ mutex_unlock(&test_fw_mutex);
+
+ return ret;
+}
+
static ssize_t test_dev_config_show_u8(char *buf, u8 val)
{
return snprintf(buf, PAGE_SIZE, "%u\n", val);
@@ -413,10 +431,10 @@ static ssize_t config_num_requests_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_u8(buf, count,
- &test_fw_config->num_requests);
+ rc = __test_dev_config_update_u8(buf, count,
+ &test_fw_config->num_requests);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -460,10 +478,10 @@ static ssize_t config_buf_size_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->buf_size);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->buf_size);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -490,10 +508,10 @@ static ssize_t config_file_offset_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->file_offset);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->file_offset);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
--
2.39.2
From: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
[ Upstream commit 4acfe3dfde685a5a9eaec5555351918e2d7266a1 ]
Dan Carpenter spotted a race condition in a couple of situations like
these in the test_firmware driver:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
ret = kstrtou8(buf, 10, &val);
if (ret)
return ret;
mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
static ssize_t config_num_requests_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
int rc;
mutex_lock(&test_fw_mutex);
if (test_fw_config->reqs) {
pr_err("Must call release_all_firmware prior to changing config\n");
rc = -EINVAL;
mutex_unlock(&test_fw_mutex);
goto out;
}
mutex_unlock(&test_fw_mutex);
rc = test_dev_config_update_u8(buf, count,
&test_fw_config->num_requests);
out:
return rc;
}
static ssize_t config_read_fw_idx_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
{
return test_dev_config_update_u8(buf, count,
&test_fw_config->read_fw_idx);
}
The function test_dev_config_update_u8() is called from both the locked
and the unlocked context, function config_num_requests_store() and
config_read_fw_idx_store() which can both be called asynchronously as
they are driver's methods, while test_dev_config_update_u8() and siblings
change their argument pointed to by u8 *cfg or similar pointer.
To avoid deadlock on test_fw_mutex, the lock is dropped before calling
test_dev_config_update_u8() and re-acquired within test_dev_config_update_u8()
itself, but alas this creates a race condition.
Having two locks wouldn't assure a race-proof mutual exclusion.
This situation is best avoided by the introduction of a new, unlocked
function __test_dev_config_update_u8() which can be called from the locked
context and reducing test_dev_config_update_u8() to:
static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
int ret;
mutex_lock(&test_fw_mutex);
ret = __test_dev_config_update_u8(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
}
doing the locking and calling the unlocked primitive, which enables both
locked and unlocked versions without duplication of code.
The similar approach was applied to all functions called from the locked
and the unlocked context, which safely mitigates both deadlocks and race
conditions in the driver.
__test_dev_config_update_bool(), __test_dev_config_update_u8() and
__test_dev_config_update_size_t() unlocked versions of the functions
were introduced to be called from the locked contexts as a workaround
without releasing the main driver's lock and thereof causing a race
condition.
The test_dev_config_update_bool(), test_dev_config_update_u8() and
test_dev_config_update_size_t() locked versions of the functions
are being called from driver methods without the unnecessary multiplying
of the locking and unlocking code for each method, and complicating
the code with saving of the return value across lock.
Fixes: 7feebfa487b92 ("test_firmware: add support for request_firmware_into_buf")
Cc: Luis Chamberlain <mcgrof(a)kernel.org>
Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Cc: Russ Weight <russell.h.weight(a)intel.com>
Cc: Takashi Iwai <tiwai(a)suse.de>
Cc: Tianfei Zhang <tianfei.zhang(a)intel.com>
Cc: Shuah Khan <shuah(a)kernel.org>
Cc: Colin Ian King <colin.i.king(a)gmail.com>
Cc: Randy Dunlap <rdunlap(a)infradead.org>
Cc: linux-kselftest(a)vger.kernel.org
Cc: stable(a)vger.kernel.org # v5.4
Suggested-by: Dan Carpenter <error27(a)gmail.com>
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Link: https://lore.kernel.org/r/20230509084746.48259-1-mirsad.todorovac@alu.unizg…
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
lib/test_firmware.c | 52 ++++++++++++++++++++++++++++++---------------
1 file changed, 35 insertions(+), 17 deletions(-)
diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index 6ef3e6926da8a..13d3fa6aa972c 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -360,16 +360,26 @@ static ssize_t config_test_show_str(char *dst,
return len;
}
-static int test_dev_config_update_bool(const char *buf, size_t size,
+static inline int __test_dev_config_update_bool(const char *buf, size_t size,
bool *cfg)
{
int ret;
- mutex_lock(&test_fw_mutex);
if (kstrtobool(buf, cfg) < 0)
ret = -EINVAL;
else
ret = size;
+
+ return ret;
+}
+
+static int test_dev_config_update_bool(const char *buf, size_t size,
+ bool *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_bool(buf, size, cfg);
mutex_unlock(&test_fw_mutex);
return ret;
@@ -380,7 +390,8 @@ static ssize_t test_dev_config_show_bool(char *buf, bool val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_size_t(const char *buf,
+static int __test_dev_config_update_size_t(
+ const char *buf,
size_t size,
size_t *cfg)
{
@@ -391,9 +402,7 @@ static int test_dev_config_update_size_t(const char *buf,
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(size_t *)cfg = new;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
@@ -409,7 +418,7 @@ static ssize_t test_dev_config_show_int(char *buf, int val)
return snprintf(buf, PAGE_SIZE, "%d\n", val);
}
-static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+static int __test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
{
u8 val;
int ret;
@@ -418,14 +427,23 @@ static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
if (ret)
return ret;
- mutex_lock(&test_fw_mutex);
*(u8 *)cfg = val;
- mutex_unlock(&test_fw_mutex);
/* Always return full write size even if we didn't consume all */
return size;
}
+static int test_dev_config_update_u8(const char *buf, size_t size, u8 *cfg)
+{
+ int ret;
+
+ mutex_lock(&test_fw_mutex);
+ ret = __test_dev_config_update_u8(buf, size, cfg);
+ mutex_unlock(&test_fw_mutex);
+
+ return ret;
+}
+
static ssize_t test_dev_config_show_u8(char *buf, u8 val)
{
return snprintf(buf, PAGE_SIZE, "%u\n", val);
@@ -478,10 +496,10 @@ static ssize_t config_num_requests_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_u8(buf, count,
- &test_fw_config->num_requests);
+ rc = __test_dev_config_update_u8(buf, count,
+ &test_fw_config->num_requests);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -525,10 +543,10 @@ static ssize_t config_buf_size_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->buf_size);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->buf_size);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
@@ -555,10 +573,10 @@ static ssize_t config_file_offset_store(struct device *dev,
mutex_unlock(&test_fw_mutex);
goto out;
}
- mutex_unlock(&test_fw_mutex);
- rc = test_dev_config_update_size_t(buf, count,
- &test_fw_config->file_offset);
+ rc = __test_dev_config_update_size_t(buf, count,
+ &test_fw_config->file_offset);
+ mutex_unlock(&test_fw_mutex);
out:
return rc;
--
2.39.2
This is part of the effort to remove the empty element of the ctl_table
structures (used to calculate size) and replace it with an ARRAY_SIZE call. By
replacing the child element in struct ctl_table with a flags element we make
sure that there are no forward recursions on child nodes and therefore set
ourselves up for just using an ARRAY_SIZE. We also added some self tests to
make sure that we do not break anything.
Patchset is separated in 4: parport fixes, selftests fixes, selftests additions and
replacement of child element. Tested everything with sysctl self tests and everything
seems "ok".
1. parport fixes: This is related to my previous series and it plugs a sysct
table leak in the parport driver. @mcgrof: I'm just leaving this here so we
don't have to retest the parport stuff
2. Selftests fixes: Remove the prefixed zeros when passing a awk field to the
awk print command because it was causing $0009 to be interpreted as $0.
Replaced continue with return in sysctl.sh(test_case) so the test actually
gets skipped. The skip decision is now in sysctl.sh(skip_test).
3. Selftest additions: New test to confirm that unregister actually removes
targets. New test to confirm that permanently empty targets are indeed
created and that no other targets can be created "on top".
4. Replaced the child pointer in struct ctl_table with an enum which is used to
differentiate between permanently empty targets and non-empty ones.
V2: Replaced the u8 flag with an enumeration.
Comments/feedback greatly appreciated
Best
Joel
Joel Granados (8):
parport: plug a sysctl register leak
test_sysctl: Fix test metadata getters
test_sysctl: Group node sysctl test under one func
test_sysctl: Add an unregister sysctl test
test_sysctl: Add an option to prevent test skip
test_sysclt: Test for registering a mount point
sysctl: Remove debugging dump_stack
sysctl: replace child with an enumeration
drivers/parport/procfs.c | 23 ++---
fs/proc/proc_sysctl.c | 82 ++++------------
include/linux/sysctl.h | 14 ++-
lib/test_sysctl.c | 91 ++++++++++++++++--
tools/testing/selftests/sysctl/sysctl.sh | 115 +++++++++++++++++------
5 files changed, 214 insertions(+), 111 deletions(-)
--
2.30.2
Events Tracing infrastructure contains lot of files, directories
(internally in terms of inodes, dentries). And ends up by consuming
memory in MBs. We can have multiple events of Events Tracing, which
further requires more memory.
Instead of creating inodes/dentries, eventfs could keep meta-data and
skip the creation of inodes/dentries. As and when require, eventfs will
create the inodes/dentries only for required files/directories.
Also eventfs would delete the inodes/dentries once no more requires
but preserve the meta data.
Tracing events took ~9MB, with this approach it took ~4.5MB
for ~10K files/dir.
Diff from v1:
Patch 1: add header file
Patch 2: resolved kernel test robot issues
protecting eventfs lists using nested eventfs_rwsem
Patch 3: protecting eventfs lists using nested eventfs_rwsem
Patch 4: improve events cleanup code to fix crashes
Patch 5: resolved kernel test robot issues
removed d_instantiate_anon() calls
Patch 6: resolved kernel test robot issues
fix kprobe test in eventfs_root_lookup()
protecting eventfs lists using nested eventfs_rwsem
Patch 7: remove header file
Patch 8: pass eventfs_rwsem as argument to eventfs functions
called eventfs_remove_events_dir() instead of tracefs_remove()
from event_trace_del_tracer()
Patch 9: new patch to fix kprobe test case
fs/tracefs/Makefile | 1 +
fs/tracefs/event_inode.c | 761 ++++++++++++++++++
fs/tracefs/inode.c | 124 ++-
fs/tracefs/internal.h | 25 +
include/linux/trace_events.h | 1 +
include/linux/tracefs.h | 49 ++
kernel/trace/trace.h | 3 +-
kernel/trace/trace_events.c | 66 +-
.../ftrace/test.d/kprobe/kprobe_args_char.tc | 4 +-
.../test.d/kprobe/kprobe_args_string.tc | 4 +-
10 files changed, 992 insertions(+), 46 deletions(-)
create mode 100644 fs/tracefs/event_inode.c
create mode 100644 fs/tracefs/internal.h
--
2.39.0
Some test cases from net/tls, net/fcnal-test and net/vrf-xfrm-tests
that rely on cryptographic functions to work and use non-compliant FIPS
algorithms fail in FIPS mode.
In order to allow these tests to pass in a wider set of kernels,
- for net/tls, skip the test variants that use the ChaCha20-Poly1305
and SM4 algorithms, when FIPS mode is enabled;
- for net/fcnal-test, skip the MD5 tests, when FIPS mode is enabled;
- for net/vrf-xfrm-tests, replace the algorithms that are not
FIPS-compliant with compliant ones.
Changes in v4:
- Remove extra newline.
- Add R-b tag.
Changes in v3:
- Add new commit to allow skipping test directly from test setup.
- No need to initialize static variable to zero.
- Skip tests during test setup only.
- Use the constructor attribute to set fips_enabled before entering
main().
Changes in v2:
- Add R-b tags.
- Put fips_non_compliant into the variants.
- Turn fips_enabled into a static global variable.
- Read /proc/sys/crypto/fips_enabled only once at main().
v1: https://lore.kernel.org/netdev/20230607174302.19542-1-magali.lemes@canonica…
v2: https://lore.kernel.org/netdev/20230609164324.497813-1-magali.lemes@canonic…
v3: https://lore.kernel.org/netdev/20230612125107.73795-1-magali.lemes@canonica…
Magali Lemes (4):
selftests/harness: allow tests to be skipped during setup
selftests: net: tls: check if FIPS mode is enabled
selftests: net: vrf-xfrm-tests: change authentication and encryption
algos
selftests: net: fcnal-test: check if FIPS mode is enabled
tools/testing/selftests/kselftest_harness.h | 6 ++--
tools/testing/selftests/net/fcnal-test.sh | 27 +++++++++++-----
tools/testing/selftests/net/tls.c | 24 +++++++++++++-
tools/testing/selftests/net/vrf-xfrm-tests.sh | 32 +++++++++----------
4 files changed, 61 insertions(+), 28 deletions(-)
--
2.34.1
From: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
According to Mirsad the gpio-sim.sh test appears to FAIL in a wrong way
due to missing initialisation of shell variables:
4.2. Bias settings work correctly
cat: /sys/devices/platform/gpio-sim.0/gpiochip18/sim_gpio0/value: No such file or directory
./gpio-sim.sh: line 393: test: =: unary operator expected
bias setting does not work
GPIO gpio-sim test FAIL
After this change the test passed:
4.2. Bias settings work correctly
GPIO gpio-sim test PASS
His testing environment is AlmaLinux 8.7 on Lenovo desktop box with
the latest Linux kernel based on v6.2:
Linux 6.2.0-mglru-kmlk-andy-09238-gd2980d8d8265 x86_64
Suggested-by: Mirsad Todorovac <mirsad.todorovac(a)alu.unizg.hr>
Signed-off-by: Andy Shevchenko <andriy.shevchenko(a)linux.intel.com>
---
tools/testing/selftests/gpio/gpio-sim.sh | 3 +++
1 file changed, 3 insertions(+)
diff --git a/tools/testing/selftests/gpio/gpio-sim.sh b/tools/testing/selftests/gpio/gpio-sim.sh
index 9f539d454ee4..fa2ce2b9dd5f 100755
--- a/tools/testing/selftests/gpio/gpio-sim.sh
+++ b/tools/testing/selftests/gpio/gpio-sim.sh
@@ -389,6 +389,9 @@ create_chip chip
create_bank chip bank
set_num_lines chip bank 8
enable_chip chip
+DEVNAME=`configfs_dev_name chip`
+CHIPNAME=`configfs_chip_name chip bank`
+SYSFS_PATH="/sys/devices/platform/$DEVNAME/$CHIPNAME/sim_gpio0/value"
$BASE_DIR/gpio-mockup-cdev -b pull-up /dev/`configfs_chip_name chip bank` 0
test `cat $SYSFS_PATH` = "1" || fail "bias setting does not work"
remove_chip chip
--
2.40.0.1.gaa8946217a0b
The default timeout for kselftests is 45 seconds, but pcm-test can take
longer than that to run depending on the number of PCMs present on a
device.
As a data point, running pcm-test on mt8192-asurada-spherion takes about
1m15s.
Set the timeout to 10 minutes, which should give enough slack to run the
test even on devices with many PCMs.
Signed-off-by: Nícolas F. R. A. Prado <nfraprado(a)collabora.com>
---
tools/testing/selftests/alsa/settings | 1 +
1 file changed, 1 insertion(+)
create mode 100644 tools/testing/selftests/alsa/settings
diff --git a/tools/testing/selftests/alsa/settings b/tools/testing/selftests/alsa/settings
new file mode 100644
index 000000000000..a62d2fa1275c
--- /dev/null
+++ b/tools/testing/selftests/alsa/settings
@@ -0,0 +1 @@
+timeout=600
--
2.39.0
Here is a series with some fixes and cleanups to resctrl selftests and
rewrite of CAT test into something that really tests CAT working or not
condition.
v2:
- Rebased on top of next to solve the conflicts
- Added 2 patches related to resctrl FS mount/umount (fix + cleanup)
- Consistently use "alloc" in cache_alloc_size()
- CAT test error handling tweaked
- Remove a spurious newline change from the CAT patch
- Small improvements to changelogs
Ilpo Järvinen (24):
selftests/resctrl: Add resctrl.h into build deps
selftests/resctrl: Check also too low values for CBM bits
selftests/resctrl: Move resctrl FS mount/umount to higher level
selftests/resctrl: Remove mum_resctrlfs
selftests/resctrl: Make span unsigned long everywhere
selftests/resctrl: Express span in bytes
selftests/resctrl: Remove duplicated preparation for span arg
selftests/resctrl: Don't use variable argument list for ->setup()
selftests/resctrl: Remove "malloc_and_init_memory" param from
run_fill_buf()
selftests/resctrl: Split run_fill_buf() to alloc, work, and dealloc
helpers
selftests/resctrl: Remove start_buf local variable from buffer alloc
func
selftests/resctrl: Don't pass test name to fill_buf
selftests/resctrl: Add flush_buffer() to fill_buf
selftests/resctrl: Remove test type checks from cat_val()
selftests/resctrl: Refactor get_cbm_mask()
selftests/resctrl: Create cache_alloc_size() helper
selftests/resctrl: Replace count_bits with count_consecutive_bits()
selftests/resctrl: Exclude shareable bits from schemata in CAT test
selftests/resctrl: Pass the real number of tests to show_cache_info()
selftests/resctrl: Move CAT/CMT test global vars to func they are used
selftests/resctrl: Read in less obvious order to defeat prefetch
optimizations
selftests/resctrl: Split measure_cache_vals() function
selftests/resctrl: Split show_cache_info() to test specific and
generic parts
selftests/resctrl: Rewrite Cache Allocation Technology (CAT) test
tools/testing/selftests/resctrl/Makefile | 2 +-
tools/testing/selftests/resctrl/cache.c | 154 ++++++------
tools/testing/selftests/resctrl/cat_test.c | 235 ++++++++----------
tools/testing/selftests/resctrl/cmt_test.c | 65 +++--
tools/testing/selftests/resctrl/fill_buf.c | 105 ++++----
tools/testing/selftests/resctrl/mba_test.c | 9 +-
tools/testing/selftests/resctrl/mbm_test.c | 17 +-
tools/testing/selftests/resctrl/resctrl.h | 32 +--
.../testing/selftests/resctrl/resctrl_tests.c | 82 ++++--
tools/testing/selftests/resctrl/resctrl_val.c | 9 +-
tools/testing/selftests/resctrl/resctrlfs.c | 187 ++++++++++----
11 files changed, 499 insertions(+), 398 deletions(-)
--
2.30.2
Fix the following coccicheck warning:
tools/testing/selftests/nolibc/nolibc-test.c:646:5-8: Unneeded variable:
"ret". Return "0"
Signed-off-by: Yonggang Wu <wuyonggang001(a)208suo.com>
---
tools/testing/selftests/nolibc/nolibc-test.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/tools/testing/selftests/nolibc/nolibc-test.c
b/tools/testing/selftests/nolibc/nolibc-test.c
index 486334981e60..2b723354e085 100644
--- a/tools/testing/selftests/nolibc/nolibc-test.c
+++ b/tools/testing/selftests/nolibc/nolibc-test.c
@@ -546,7 +546,6 @@ int run_syscall(int min, int max)
int proc;
int test;
int tmp;
- int ret = 0;
void *p1, *p2;
/* <proc> indicates whether or not /proc is mounted */
@@ -632,18 +631,17 @@ int run_syscall(int min, int max)
CASE_TEST(syscall_noargs); EXPECT_SYSEQ(1,
syscall(__NR_getpid), getpid()); break;
CASE_TEST(syscall_args); EXPECT_SYSER(1,
syscall(__NR_statx, 0, NULL, 0, 0, NULL), -1, EFAULT); break;
case __LINE__:
- return ret; /* must be last */
+ return 0; /* must be last */
/* note: do not set any defaults so as to permit holes above */
}
}
- return ret;
+ return 0;
}
int run_stdlib(int min, int max)
{
int test;
int tmp;
- int ret = 0;
void *p1, *p2;
for (test = min; test >= 0 && test <= max; test++) {
@@ -726,11 +724,11 @@ int run_stdlib(int min, int max)
# warning "__SIZEOF_LONG__ is undefined"
#endif /* __SIZEOF_LONG__ */
case __LINE__:
- return ret; /* must be last */
+ return 0; /* must be last */
/* note: do not set any defaults so as to permit holes above */
}
}
- return ret;
+ return 0;
}
#define EXPECT_VFPRINTF(c, expected, fmt, ...) \
@@ -790,7 +788,6 @@ static int run_vfprintf(int min, int max)
{
int test;
int tmp;
- int ret = 0;
void *p1, *p2;
for (test = min; test >= 0 && test <= max; test++) {
@@ -810,11 +807,11 @@ static int run_vfprintf(int min, int max)
CASE_TEST(hex); EXPECT_VFPRINTF(1, "f", "%x", 0xf);
break;
CASE_TEST(pointer); EXPECT_VFPRINTF(3, "0x1", "%p", (void
*) 0x1); break;
case __LINE__:
- return ret; /* must be last */
+ return 0; /* must be last */
/* note: do not set any defaults so as to permit holes above */
}
}
- return ret;
+ return 0;
}
static int smash_stack(void)
Currently the MM selftests attempt to work out the target architecture by
using CROSS_COMPILE or otherwise querying the host machine, storing the
target architecture in a variable called MACHINE rather than the usual ARCH
though as far as I can tell (including for x86_64) the value is the same as
we would use for architecture.
When cross compiling with LLVM we don't need a CROSS_COMPILE as LLVM can
support many target architectures in a single build so this logic does not
work, CROSS_COMPILE is not set and we end up selecting tests for the host
rather than target architecture. Fix this by using the more standard ARCH
to describe the architecture, taking it from the environment if specified.
Signed-off-by: Mark Brown <broonie(a)kernel.org>
---
tools/testing/selftests/mm/Makefile | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 23af4633f0f4..4f0c50c33ba7 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -5,12 +5,15 @@ LOCAL_HDRS += $(selfdir)/mm/local_config.h $(top_srcdir)/mm/gup_test.h
include local_config.mk
+ifeq ($(ARCH),)
+
ifeq ($(CROSS_COMPILE),)
uname_M := $(shell uname -m 2>/dev/null || echo not)
else
uname_M := $(shell echo $(CROSS_COMPILE) | grep -o '^[a-z0-9]\+')
endif
-MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/' -e 's/ppc64.*/ppc64/')
+ARCH ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/' -e 's/ppc64.*/ppc64/')
+endif
# Without this, failed build products remain, with up-to-date timestamps,
# thus tricking Make (and you!) into believing that All Is Well, in subsequent
@@ -65,7 +68,7 @@ TEST_GEN_PROGS += ksm_tests
TEST_GEN_PROGS += ksm_functional_tests
TEST_GEN_PROGS += mdwe_test
-ifeq ($(MACHINE),x86_64)
+ifeq ($(ARCH),x86_64)
CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_32bit_program.c -m32)
CAN_BUILD_X86_64 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_64bit_program.c)
CAN_BUILD_WITH_NOPIE := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_program.c -no-pie)
@@ -87,13 +90,13 @@ TEST_GEN_PROGS += $(BINARIES_64)
endif
else
-ifneq (,$(findstring $(MACHINE),ppc64))
+ifneq (,$(findstring $(ARCH),ppc64))
TEST_GEN_PROGS += protection_keys
endif
endif
-ifneq (,$(filter $(MACHINE),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sparc64 x86_64))
+ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sparc64 x86_64))
TEST_GEN_PROGS += va_high_addr_switch
TEST_GEN_PROGS += virtual_address_range
TEST_GEN_PROGS += write_to_hugetlbfs
@@ -112,7 +115,7 @@ $(TEST_GEN_PROGS): vm_util.c
$(OUTPUT)/uffd-stress: uffd-common.c
$(OUTPUT)/uffd-unit-tests: uffd-common.c
-ifeq ($(MACHINE),x86_64)
+ifeq ($(ARCH),x86_64)
BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32))
BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64))
---
base-commit: 858fd168a95c5b9669aac8db6c14a9aeab446375
change-id: 20230614-kselftest-mm-llvm-a25a7daffa6f
Best regards,
--
Mark Brown <broonie(a)kernel.org>