This is the start of the stable review cycle for the 5.19.6 release. There are 158 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Wed, 31 Aug 2022 10:57:37 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.19.6-rc1.... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.19.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.19.6-rc1
Daniel Borkmann daniel@iogearbox.net bpf: Don't use tnum_range on array range checking for poke descriptors
Conor Dooley conor.dooley@microchip.com riscv: dts: microchip: mpfs: remove pci axi address translation property
Conor Dooley conor.dooley@microchip.com riscv: dts: microchip: mpfs: remove bogus card-detect-delay
Conor Dooley conor.dooley@microchip.com riscv: dts: microchip: mpfs: remove ti,fifo-depth property
Conor Dooley conor.dooley@microchip.com riscv: dts: microchip: mpfs: fix incorrect pcie child node name
Mike Christie michael.christie@oracle.com scsi: core: Fix passthrough retry counter handling
Saurabh Sengar ssengar@linux.microsoft.com scsi: storvsc: Remove WQ_MEM_RECLAIM from storvsc_error_wq
Kiwoong Kim kwmad.kim@samsung.com scsi: ufs: core: Enable link lost interrupt
Mark Brown broonie@kernel.org arm64/sme: Don't flush SVE register state when handling SME traps
Mark Brown broonie@kernel.org arm64/sme: Don't flush SVE register state when allocating SME storage
Mark Brown broonie@kernel.org arm64/signal: Flush FPSIMD register state when disabling streaming mode
Mark Rutland mark.rutland@arm.com arm64: fix rodata=full
Ian Rogers irogers@google.com perf stat: Clear evsel->reset_group for each stat run
Stephane Eranian eranian@google.com perf/x86/intel/ds: Fix precise store latency handling
Stephane Eranian eranian@google.com perf/x86/intel/uncore: Fix broken read_counter() for SNB IMC PMU
James Clark james.clark@arm.com perf python: Fix build when PYTHON_CONFIG is user supplied
Yu Kuai yukuai3@huawei.com blk-mq: fix io hung due to missing commit_rqs
Salvatore Bonaccorso carnil@debian.org Documentation/ABI: Mention retbleed vulnerability info file for sysfs
Prike Liang Prike.Liang@amd.com drm/amdkfd: Fix isa version for the GC 10.3.7
Peter Zijlstra peterz@infradead.org x86/nospec: Fix i386 RSB stuffing
Liam Howlett liam.howlett@oracle.com binder_alloc: add missing mmap_lock calls when using the VMA
Zenghui Yu yuzenghui@huawei.com arm64: Fix match_list for erratum 1286807 on Arm Cortex-A76
Guoqing Jiang guoqing.jiang@linux.dev md: call __md_stop_writes in md_stop
Guoqing Jiang guoqing.jiang@linux.dev Revert "md-raid: destroy the bitmap after destroying the thread"
David Hildenbrand david@redhat.com mm/hugetlb: fix hugetlb not supporting softdirty tracking
Jens Axboe axboe@kernel.dk io_uring: fix issue with io_write() not always undoing sb_start_write()
Jiri Slaby jirislaby@kernel.org Revert "zram: remove double compression logic"
Heinrich Schuchardt heinrich.schuchardt@canonical.com riscv: dts: microchip: correct L2 cache interrupts
Conor Dooley conor.dooley@microchip.com riscv: traps: add missing prototype
Conor Dooley conor.dooley@microchip.com riscv: signal: fix missing prototype warning
Juergen Gross jgross@suse.com xen/privcmd: fix error exit of privcmd_ioctl_dm_op()
Heming Zhao ocfs2-devel@oss.oracle.com ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
David Howells dhowells@redhat.com smb3: missing inode locks in punch hole
Karol Herbst kherbst@redhat.com nouveau: explicitly wait on the fence in nouveau_bo_move_m2mf
Riwen Lu luriwen@kylinos.cn ACPI: processor: Remove freq Qos request for all CPUs
Matthew Wilcox (Oracle) willy@infradead.org shmem: update folio if shmem_replace_page() updates the page
Shakeel Butt shakeelb@google.com Revert "memcg: cleanup racy sum avoidance code"
Shigeru Yoshida syoshida@redhat.com fbdev: fbcon: Properly revert changes when vc_resize() failed
Brian Foster bfoster@redhat.com s390: fix double free of GS and RI CBs on fork() failure
Paulo Alcantara pc@cjr.nz cifs: skip extra NULL byte in filenames
Peter Xu peterx@redhat.com mm/mprotect: only reference swap pfn page if type match
Miaohe Lin linmiaohe@huawei.com mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
Liu Shixin liushixin2@huawei.com bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
Gerald Schaefer gerald.schaefer@linux.ibm.com s390/mm: do not trigger write fault when vma does not allow VM_WRITE
Badari Pulavarty badari.pulavarty@intel.com mm/damon/dbgfs: avoid duplicate context directory creation
Quanyang Wang quanyang.wang@windriver.com asm-generic: sections: refactor memory_intersects
Richard Guy Briggs rgb@redhat.com audit: move audit_return_fixup before the filters
Khazhismel Kumykov khazhy@chromium.org writeback: avoid use-after-free after removing device
Siddh Raman Pant code@siddh.me loop: Check for overflow while configuring loop
Jan Beulich jbeulich@suse.com x86/PAT: Have pat_enabled() properly reflect state when running on Xen
Peter Zijlstra peterz@infradead.org x86/nospec: Unwreck the RSB stuffing
Pawan Gupta pawan.kumar.gupta@linux.intel.com x86/bugs: Add "unknown" reporting for MMIO Stale Data
Tom Lendacky thomas.lendacky@amd.com x86/sev: Don't use cc_platform_has() for early SEV-SNP calls
Chen Zhongjin chenzhongjin@huawei.com x86/unwind/orc: Unwind ftrace trampolines with correct ORC entry
Juergen Gross jgross@suse.com x86/entry: Fix entry_INT80_compat for Xen PV guests
Kan Liang kan.liang@linux.intel.com perf/x86/lbr: Enable the branch type for the Arch LBR by default
Kan Liang kan.liang@linux.intel.com perf/x86/intel: Fix pebs event constraints for ADL
Michael Roth michael.roth@amd.com x86/boot: Don't propagate uninitialized boot_params->cc_blob_address
Filipe Manana fdmanana@suse.com btrfs: update generation of hole file extent item when merging holes
Zixuan Fu r33s3n6@gmail.com btrfs: fix possible memory leak in btrfs_get_dev_args_from_path()
Goldwyn Rodrigues rgoldwyn@suse.de btrfs: check if root is readonly while setting security xattr
Omar Sandoval osandov@fb.com btrfs: fix space cache corruption and potential double allocations
Anand Jain anand.jain@oracle.com btrfs: add info when mount fails due to stale replace target
Anand Jain anand.jain@oracle.com btrfs: replace: drop assert for suspended replace
Filipe Manana fdmanana@suse.com btrfs: fix silent failure when deleting root reference
Aleksander Jan Bajkowski olek2@wp.pl net: lantiq_xrx200: restore buffer if memory allocation failed
Aleksander Jan Bajkowski olek2@wp.pl net: lantiq_xrx200: fix lock under memory pressure
Aleksander Jan Bajkowski olek2@wp.pl net: lantiq_xrx200: confirm skb is allocated before using
Heiner Kallweit hkallweit1@gmail.com net: stmmac: work around sporadic tx issue on link-up
R Mohamed Shah mohamed@pensando.io ionic: VF initial random MAC address if no assigned mac
Shannon Nelson snelson@pensando.io ionic: fix up issues with handling EAGAIN on FW cmds
Shannon Nelson snelson@pensando.io ionic: clear broken state on generation change
David Howells dhowells@redhat.com rxrpc: Fix locking in rxrpc's sendmsg
Lorenzo Bianconi lorenzo@kernel.org net: ethernet: mtk_eth_soc: fix hw hash reporting for MTK_NETSYS_V2
Lorenzo Bianconi lorenzo@kernel.org net: ethernet: mtk_eth_soc: enable rx cksum offload for MTK_NETSYS_V2
Sylwester Dziedziuch sylwesterx.dziedziuch@intel.com i40e: Fix incorrect address type for IPv6 flow rules
Jacob Keller jacob.e.keller@intel.com ixgbe: stop resetting SYSTIME in ixgbe_ptp_start_cyclecounter
Kuniyuki Iwashima kuniyu@amazon.com net: Fix a data-race around sysctl_somaxconn.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix a data-race around netdev_unregister_timeout_secs.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix a data-race around gro_normal_batch.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix data-races around sysctl_devconf_inherit_init_net.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix data-races around sysctl_fb_tunnels_only_for_init_net.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix a data-race around netdev_budget_usecs.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix data-races around sysctl_max_skb_frags.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix a data-race around netdev_budget.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix a data-race around sysctl_net_busy_read.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix a data-race around sysctl_net_busy_poll.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix a data-race around sysctl_tstamp_allow_data.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix data-races around sysctl_optmem_max.
Kuniyuki Iwashima kuniyu@amazon.com ratelimit: Fix data-races in ___ratelimit().
Kuniyuki Iwashima kuniyu@amazon.com net: Fix data-races around netdev_tstamp_prequeue.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix data-races around netdev_max_backlog.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix data-races around weight_p and dev_weight_[rt]x_bias.
Kuniyuki Iwashima kuniyu@amazon.com net: Fix data-races around sysctl_[rw]mem_(max|default).
Pablo Neira Ayuso pablo@netfilter.org netfilter: flowtable: fix stuck flows on cleanup due to pending work
Pablo Neira Ayuso pablo@netfilter.org netfilter: flowtable: add function to invoke garbage collection immediately
Pablo Neira Ayuso pablo@netfilter.org netfilter: nf_tables: disallow binding to already bound chain
Pablo Neira Ayuso pablo@netfilter.org netfilter: nft_tunnel: restrict it to netdev family
Pablo Neira Ayuso pablo@netfilter.org netfilter: nft_osf: restrict osf to ipv4, ipv6 and inet families
Pablo Neira Ayuso pablo@netfilter.org netfilter: nf_tables: do not leave chain stats enabled on error
Pablo Neira Ayuso pablo@netfilter.org netfilter: nft_payload: do not truncate csum_offset and csum_type
Pablo Neira Ayuso pablo@netfilter.org netfilter: nft_payload: report ERANGE for too long offset and length
Pablo Neira Ayuso pablo@netfilter.org netfilter: nf_tables: make table handle allocation per-netns friendly
Pablo Neira Ayuso pablo@netfilter.org netfilter: nf_tables: disallow updates of implicit chain
Vikas Gupta vikas.gupta@broadcom.com bnxt_en: fix LRO/GRO_HW features in ndo_fix_features callback
Vikas Gupta vikas.gupta@broadcom.com bnxt_en: fix NQ resource accounting during vf creation on 57500 chips
Vikas Gupta vikas.gupta@broadcom.com bnxt_en: set missing reload flag in devlink features
Pavan Chebbi pavan.chebbi@broadcom.com bnxt_en: Use PAGE_SIZE to init buffer when multi buffer XDP is not in use
Florian Westphal fw@strlen.de netfilter: nft_tproxy: restrict to prerouting hook
Florian Westphal fw@strlen.de netfilter: ebtables: reject blobs that don't provide all entry points
Maciej Żenczykowski maze@google.com net: ipvtap - add __init/__exit annotations to module init/exit funcs
Jonathan Toppins jtoppins@redhat.com bonding: 802.3ad: fix no transmission of LACPDUs
Sergei Antonov saproj@gmail.com net: moxa: get rid of asymmetry in DMA mapping/unmapping
Xiaolei Wang xiaolei.wang@windriver.com net: phy: Don't WARN for PHY_READY state in mdio_bus_phy_resume()
Alex Elder elder@linaro.org net: ipa: don't assume SMEM is page-aligned
Vladimir Oltean vladimir.oltean@nxp.com net: dsa: microchip: keep compatibility with device tree blobs with no phy-mode
Arun Ramadoss arun.ramadoss@microchip.com net: dsa: microchip: update the ksz_phylink_get_caps
Arun Ramadoss arun.ramadoss@microchip.com net: dsa: microchip: move the port mirror to ksz_common
Arun Ramadoss arun.ramadoss@microchip.com net: dsa: microchip: move vlan functionality to ksz_common
Arun Ramadoss arun.ramadoss@microchip.com net: dsa: microchip: move tag_protocol to ksz_common
Arun Ramadoss arun.ramadoss@microchip.com net: dsa: microchip: move switch chip_id detection to ksz_common
Arun Ramadoss arun.ramadoss@microchip.com net: dsa: microchip: ksz9477: cleanup the ksz9477_switch_detect
Maor Dickman maord@nvidia.com net/mlx5e: Fix wrong tc flag used when set hw-tc-offload off
Aya Levin ayal@nvidia.com net/mlx5e: Fix wrong application of the LRO state
Moshe Shemesh moshe@nvidia.com net/mlx5: Avoid false positive lockdep warning by adding lock_class_key
Roy Novich royno@nvidia.com net/mlx5: Fix cmd error logging for manage pages cmd
Vlad Buslov vladbu@nvidia.com net/mlx5: Disable irq when locking lag_lock
Eli Cohen elic@nvidia.com net/mlx5: Eswitch, Fix forwarding decision to uplink
Eli Cohen elic@nvidia.com net/mlx5: LAG, fix logic over MLX5_LAG_FLAG_NDEVS_READY
Vlad Buslov vladbu@nvidia.com net/mlx5e: Properly disable vlan strip on non-UL reps
Maciej Fijalkowski maciej.fijalkowski@intel.com ice: xsk: use Rx ring's XDP ring when picking NAPI context
Maciej Fijalkowski maciej.fijalkowski@intel.com ice: xsk: prohibit usage of non-balanced queue id
Duoming Zhou duoming@zju.edu.cn nfc: pn533: Fix use-after-free bugs caused by pn532_cmd_timeout
Hayes Wang hayeswang@realtek.com r8152: fix the RX FIFO settings when suspending
Hayes Wang hayeswang@realtek.com r8152: fix the units of some registers for RTL8156A
Bernard Pidoux f6bvp@free.fr rose: check NULL rose_loopback_neigh->loopback
Christian Brauner brauner@kernel.org ntfs: fix acl handling
Peter Xu peterx@redhat.com mm/smaps: don't access young/dirty bit if pte unpresent
Trond Myklebust trond.myklebust@hammerspace.com SUNRPC: RPC level errors should set task->tk_rpc_status
Olga Kornievskaia kolga@netapp.com NFSv4.2 fix problems with __nfs42_ssc_open
Sabrina Dubroca sd@queasysnail.net Revert "net: macsec: update SCI upon MAC address change."
Seth Forshee sforshee@digitalocean.com fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
Nikolay Aleksandrov razor@blackwall.org xfrm: policy: fix metadata dst->dev xmit null pointer dereference
Herbert Xu herbert@gondor.apana.org.au af_key: Do not call xfrm_probe_algs in parallel
Antony Antony antony.antony@secunet.com xfrm: clone missing x->lastused in xfrm_do_migrate
Antony Antony antony.antony@secunet.com Revert "xfrm: update SA curlft.use_time"
Xin Xiong xiongx18@fudan.edu.cn xfrm: fix refcount leak in __xfrm_policy_check()
Deren Wu deren.wu@mediatek.com mt76: mt7921: fix command timeout in AP stop period
David Hildenbrand david@redhat.com mm/hugetlb: support write-faults in shared mappings
Peter Xu peterx@redhat.com mm/uffd: reset write protection when unregister with wp-mode
Kuniyuki Iwashima kuniyu@amazon.com kprobes: don't call disarm_kprobe() for disabled kprobes
Randy Dunlap rdunlap@infradead.org kernel/sys_ni: add compat entry for fadvise64_64
Helge Deller deller@gmx.de parisc: Fix exception handler for fldw and fstw instructions
Helge Deller deller@gmx.de parisc: Make CONFIG_64BIT available for ARCH=parisc64 only
Jing-Ting Wu Jing-Ting.Wu@mediatek.com cgroup: Fix race condition at rebind_subsystems()
Gaosheng Cui cuigaosheng1@huawei.com audit: fix potential double free on error path from fsnotify_add_inode_mark
Trond Myklebust trond.myklebust@hammerspace.com NFS: Fix another fsync() issue after a server reboot
David Hildenbrand david@redhat.com mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
-------------
Diffstat:
Documentation/ABI/testing/sysfs-devices-system-cpu | 1 + .../hw-vuln/processor_mmio_stale_data.rst | 14 ++ Documentation/admin-guide/kernel-parameters.txt | 2 + Documentation/admin-guide/sysctl/net.rst | 2 +- Makefile | 4 +- arch/arm64/include/asm/fpsimd.h | 4 +- arch/arm64/include/asm/setup.h | 17 ++ arch/arm64/kernel/cpu_errata.c | 2 + arch/arm64/kernel/fpsimd.c | 21 +-- arch/arm64/kernel/ptrace.c | 6 +- arch/arm64/kernel/signal.c | 12 +- arch/arm64/mm/mmu.c | 18 --- arch/parisc/Kconfig | 21 +-- arch/parisc/kernel/unaligned.c | 2 +- arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts | 3 - arch/riscv/boot/dts/microchip/mpfs-polarberry.dts | 3 - arch/riscv/boot/dts/microchip/mpfs.dtsi | 5 +- arch/riscv/include/asm/signal.h | 12 ++ arch/riscv/include/asm/thread_info.h | 2 + arch/riscv/kernel/signal.c | 1 + arch/riscv/kernel/traps.c | 3 +- arch/s390/kernel/process.c | 22 ++- arch/s390/mm/fault.c | 4 +- arch/x86/boot/compressed/misc.h | 12 +- arch/x86/boot/compressed/sev.c | 8 + arch/x86/entry/entry_64_compat.S | 2 +- arch/x86/events/intel/ds.c | 12 +- arch/x86/events/intel/lbr.c | 8 + arch/x86/events/intel/uncore_snb.c | 18 ++- arch/x86/include/asm/cpufeatures.h | 5 +- arch/x86/include/asm/nospec-branch.h | 92 ++++++----- arch/x86/kernel/cpu/bugs.c | 14 +- arch/x86/kernel/cpu/common.c | 42 +++-- arch/x86/kernel/sev.c | 16 +- arch/x86/kernel/unwind_orc.c | 15 +- arch/x86/mm/pat/memtype.c | 10 +- block/blk-mq.c | 5 +- drivers/acpi/processor_thermal.c | 2 +- drivers/android/binder_alloc.c | 31 ++-- drivers/block/loop.c | 5 + drivers/block/zram/zram_drv.c | 42 +++-- drivers/block/zram/zram_drv.h | 1 + drivers/gpu/drm/amd/amdkfd/kfd_device.c | 6 +- drivers/gpu/drm/nouveau/nouveau_bo.c | 9 ++ drivers/md/md.c | 3 +- drivers/net/bonding/bond_3ad.c | 38 ++--- drivers/net/dsa/microchip/ksz8795.c | 102 +++--------- drivers/net/dsa/microchip/ksz8795_reg.h | 16 -- drivers/net/dsa/microchip/ksz9477.c | 89 +++-------- drivers/net/dsa/microchip/ksz9477_reg.h | 1 - drivers/net/dsa/microchip/ksz_common.c | 173 ++++++++++++++++++++- drivers/net/dsa/microchip/ksz_common.h | 47 +++++- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 5 +- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 + drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c | 1 + drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 10 +- drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 +- drivers/net/ethernet/intel/ice/ice.h | 36 +++-- drivers/net/ethernet/intel/ice/ice_lib.c | 4 +- drivers/net/ethernet/intel/ice/ice_main.c | 25 ++- drivers/net/ethernet/intel/ice/ice_xsk.c | 18 ++- drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 59 +++++-- drivers/net/ethernet/lantiq_xrx200.c | 9 +- drivers/net/ethernet/mediatek/mtk_eth_soc.c | 29 ++-- drivers/net/ethernet/mediatek/mtk_eth_soc.h | 5 + drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 12 +- drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 2 + .../ethernet/mellanox/mlx5/core/eswitch_offloads.c | 3 +- drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c | 57 ++++--- drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 + .../net/ethernet/mellanox/mlx5/core/pagealloc.c | 9 +- drivers/net/ethernet/moxa/moxart_ether.c | 11 +- drivers/net/ethernet/pensando/ionic/ionic_lif.c | 95 ++++++++++- drivers/net/ethernet/pensando/ionic/ionic_main.c | 4 +- drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c | 8 +- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 9 +- drivers/net/ipa/ipa_mem.c | 2 +- drivers/net/ipvlan/ipvtap.c | 4 +- drivers/net/macsec.c | 11 +- drivers/net/phy/phy_device.c | 8 +- drivers/net/usb/r8152.c | 27 ++-- .../net/wireless/mediatek/mt76/mt76_connac_mcu.c | 2 + drivers/net/wireless/mediatek/mt76/mt7921/main.c | 47 ++++-- drivers/net/wireless/mediatek/mt76/mt7921/mcu.c | 5 +- drivers/net/wireless/mediatek/mt76/mt7921/mt7921.h | 2 + drivers/nfc/pn533/uart.c | 1 + drivers/scsi/scsi_lib.c | 3 +- drivers/scsi/storvsc_drv.c | 2 +- drivers/video/fbdev/core/fbcon.c | 27 +++- drivers/xen/privcmd.c | 21 +-- fs/btrfs/block-group.c | 47 ++---- fs/btrfs/block-group.h | 4 +- fs/btrfs/ctree.h | 1 - fs/btrfs/dev-replace.c | 5 +- fs/btrfs/extent-tree.c | 30 +--- fs/btrfs/file.c | 2 + fs/btrfs/root-tree.c | 5 +- fs/btrfs/volumes.c | 5 +- fs/btrfs/xattr.c | 3 + fs/cifs/smb2ops.c | 12 +- fs/cifs/smb2pdu.c | 16 +- fs/fs-writeback.c | 12 +- fs/namespace.c | 7 + fs/nfs/file.c | 15 +- fs/nfs/inode.c | 1 + fs/nfs/nfs4file.c | 6 + fs/nfs/write.c | 6 +- fs/ntfs3/xattr.c | 16 +- fs/ocfs2/dlmglue.c | 8 +- fs/ocfs2/super.c | 3 +- fs/proc/task_mmu.c | 7 +- fs/userfaultfd.c | 4 + include/asm-generic/sections.h | 7 +- include/linux/memcontrol.h | 15 +- include/linux/mlx5/driver.h | 1 + include/linux/mm.h | 1 - include/linux/netdevice.h | 20 ++- include/linux/netfilter_bridge/ebtables.h | 4 - include/linux/nfs_fs.h | 1 + include/linux/userfaultfd_k.h | 2 + include/net/busy_poll.h | 2 +- include/net/gro.h | 2 +- include/net/netfilter/nf_flow_table.h | 3 + include/net/netfilter/nf_tables.h | 1 + include/ufs/ufshci.h | 6 +- init/main.c | 18 ++- io_uring/io_uring.c | 7 +- kernel/audit_fsnotify.c | 1 + kernel/auditsc.c | 4 +- kernel/bpf/verifier.c | 10 +- kernel/cgroup/cgroup.c | 1 + kernel/kprobes.c | 9 +- kernel/sys_ni.c | 1 + lib/ratelimit.c | 12 +- mm/backing-dev.c | 10 +- mm/bootmem_info.c | 2 + mm/damon/dbgfs.c | 3 + mm/gup.c | 69 +++++--- mm/huge_memory.c | 65 +++++--- mm/hugetlb.c | 28 +++- mm/mmap.c | 8 +- mm/mprotect.c | 3 +- mm/page-writeback.c | 6 +- mm/shmem.c | 1 + mm/userfaultfd.c | 29 ++-- net/bridge/netfilter/ebtable_broute.c | 8 - net/bridge/netfilter/ebtable_filter.c | 8 - net/bridge/netfilter/ebtable_nat.c | 8 - net/bridge/netfilter/ebtables.c | 8 +- net/core/bpf_sk_storage.c | 5 +- net/core/dev.c | 20 +-- net/core/filter.c | 13 +- net/core/gro_cells.c | 2 +- net/core/skbuff.c | 2 +- net/core/sock.c | 18 ++- net/core/sysctl_net_core.c | 15 +- net/ipv4/devinet.c | 16 +- net/ipv4/ip_output.c | 2 +- net/ipv4/ip_sockglue.c | 6 +- net/ipv4/tcp.c | 4 +- net/ipv4/tcp_output.c | 2 +- net/ipv6/addrconf.c | 5 +- net/ipv6/ipv6_sockglue.c | 4 +- net/key/af_key.c | 3 + net/mptcp/protocol.c | 2 +- net/netfilter/ipvs/ip_vs_sync.c | 4 +- net/netfilter/nf_flow_table_core.c | 15 +- net/netfilter/nf_flow_table_offload.c | 8 + net/netfilter/nf_tables_api.c | 14 +- net/netfilter/nft_osf.c | 18 ++- net/netfilter/nft_payload.c | 29 +++- net/netfilter/nft_tproxy.c | 8 + net/netfilter/nft_tunnel.c | 1 + net/rose/rose_loopback.c | 3 +- net/rxrpc/call_object.c | 4 +- net/rxrpc/sendmsg.c | 92 ++++++----- net/sched/sch_generic.c | 2 +- net/socket.c | 2 +- net/sunrpc/clnt.c | 2 +- net/xfrm/espintcp.c | 2 +- net/xfrm/xfrm_input.c | 3 +- net/xfrm/xfrm_output.c | 1 - net/xfrm/xfrm_policy.c | 3 +- net/xfrm/xfrm_state.c | 1 + tools/perf/Makefile.config | 2 +- tools/perf/builtin-stat.c | 1 + 187 files changed, 1603 insertions(+), 917 deletions(-)
From: David Hildenbrand david@redhat.com
commit 5535be3099717646781ce1540cf725965d680e7b upstream.
Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know that FOLL_FORCE can be possibly dangerous, especially if there are races that can be exploited by user space.
Right now, it would be sufficient to have some code that sets a PTE of a R/O-mapped shared page dirty, in order for it to erroneously become writable by FOLL_FORCE. The implications of setting a write-protected PTE dirty might not be immediately obvious to everyone.
And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map a shmem page R/O while marking the pte dirty. This can be used by unprivileged user space to modify tmpfs/shmem file content even if the user does not have write permissions to the file, and to bypass memfd write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
To fix such security issues for good, the insight is that we really only need that fancy retry logic (FOLL_COW) for COW mappings that are not writable (!VM_WRITE). And in a COW mapping, we really only broke COW if we have an exclusive anonymous page mapped. If we have something else mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), we have to trigger a write fault to break COW. If we don't find an exclusive anonymous page when we retry, we have to trigger COW breaking once again because something intervened.
Let's move away from this mandatory-retry + dirty handling and rely on our PageAnonExclusive() flag for making a similar decision, to use the same COW logic as in other kernel parts here as well. In case we stumble over a PTE in a COW mapping that does not map an exclusive anonymous page, COW was not properly broken and we have to trigger a fake write-fault to break COW.
Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection") and commit 76aefad628aa ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take care of softdirty and uffd-wp manually.
For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA.
This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So let's just get rid of it.
Thanks to Nadav Amit for pointing out that the pte_dirty() check in FOLL_FORCE code is problematic and might be exploitable.
Note 1: We don't check for the PTE being dirty because it doesn't matter for making a "was COWed" decision anymore, and whoever modifies the page has to set the page dirty either way.
Note 2: Kernels before extended uffd-wp support and before PageAnonExclusive (< 5.19) can simply revert the problematic commit instead and be safe regarding UFFDIO_CONTINUE. A backport to v5.19 requires minor adjustments due to lack of vma_soft_dirty_enabled().
Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte") Signed-off-by: David Hildenbrand david@redhat.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Axel Rasmussen axelrasmussen@google.com Cc: Nadav Amit nadav.amit@gmail.com Cc: Peter Xu peterx@redhat.com Cc: Hugh Dickins hughd@google.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Matthew Wilcox willy@infradead.org Cc: Vlastimil Babka vbabka@suse.cz Cc: John Hubbard jhubbard@nvidia.com Cc: Jason Gunthorpe jgg@nvidia.com Cc: David Laight David.Laight@ACULAB.COM Cc: stable@vger.kernel.org [5.16] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: David Hildenbrand david@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/mm.h | 1 mm/gup.c | 69 ++++++++++++++++++++++++++++++++++++----------------- mm/huge_memory.c | 65 +++++++++++++++++++++++++++++++++---------------- 3 files changed, 91 insertions(+), 44 deletions(-)
--- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2939,7 +2939,6 @@ struct page *follow_page(struct vm_area_ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ #define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_COW 0x4000 /* internal GUP flag */ #define FOLL_ANON 0x8000 /* don't do file mappings */ #define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ --- a/mm/gup.c +++ b/mm/gup.c @@ -478,14 +478,43 @@ static int follow_pfn_pte(struct vm_area return -EEXIST; }
-/* - * FOLL_FORCE can write to even unwritable pte's, but only - * after we've gone through a COW cycle and they are dirty. - */ -static inline bool can_follow_write_pte(pte_t pte, unsigned int flags) +/* FOLL_FORCE can write to even unwritable PTEs in COW mappings. */ +static inline bool can_follow_write_pte(pte_t pte, struct page *page, + struct vm_area_struct *vma, + unsigned int flags) { - return pte_write(pte) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte)); + /* If the pte is writable, we can write to the page. */ + if (pte_write(pte)) + return true; + + /* Maybe FOLL_FORCE is set to override it? */ + if (!(flags & FOLL_FORCE)) + return false; + + /* But FOLL_FORCE has no effect on shared mappings */ + if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED)) + return false; + + /* ... or read-only private ones */ + if (!(vma->vm_flags & VM_MAYWRITE)) + return false; + + /* ... or already writable ones that just need to take a write fault */ + if (vma->vm_flags & VM_WRITE) + return false; + + /* + * See can_change_pte_writable(): we broke COW and could map the page + * writable if we have an exclusive anonymous page ... + */ + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) + return false; + + /* ... and a write-fault isn't required for other reasons. */ + if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && + !(vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte)) + return false; + return !userfaultfd_pte_wp(vma, pte); }
static struct page *follow_page_pte(struct vm_area_struct *vma, @@ -528,12 +557,19 @@ retry: } if ((flags & FOLL_NUMA) && pte_protnone(pte)) goto no_page; - if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) { - pte_unmap_unlock(ptep, ptl); - return NULL; - }
page = vm_normal_page(vma, address, pte); + + /* + * We only care about anon pages in can_follow_write_pte() and don't + * have to worry about pte_devmap() because they are never anon. + */ + if ((flags & FOLL_WRITE) && + !can_follow_write_pte(pte, page, vma, flags)) { + page = NULL; + goto out; + } + if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /* * Only return device mapping pages in the FOLL_GET or FOLL_PIN @@ -967,17 +1003,6 @@ static int faultin_page(struct vm_area_s return -EBUSY; }
- /* - * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when - * necessary, even if maybe_mkwrite decided not to set pte_write. We - * can thus safely do subsequent page lookups as if they were reads. - * But only do so when looping for pte_write is futile: in some cases - * userspace may also be wanting to write to the gotten user page, - * which a read fault here might prevent (a readonly page might get - * reCOWed by userspace write). - */ - if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE)) - *flags |= FOLL_COW; return 0; }
--- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -978,12 +978,6 @@ struct page *follow_devmap_pmd(struct vm
assert_spin_locked(pmd_lockptr(mm, pmd));
- /* - * When we COW a devmap PMD entry, we split it into PTEs, so we should - * not be in this function with `flags & FOLL_COW` set. - */ - WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); - /* FOLL_GET and FOLL_PIN are mutually exclusive. */ if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) @@ -1349,14 +1343,43 @@ fallback: return VM_FAULT_FALLBACK; }
-/* - * FOLL_FORCE can write to even unwritable pmd's, but only - * after we've gone through a COW cycle and they are dirty. - */ -static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) +/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */ +static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page, + struct vm_area_struct *vma, + unsigned int flags) { - return pmd_write(pmd) || - ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); + /* If the pmd is writable, we can write to the page. */ + if (pmd_write(pmd)) + return true; + + /* Maybe FOLL_FORCE is set to override it? */ + if (!(flags & FOLL_FORCE)) + return false; + + /* But FOLL_FORCE has no effect on shared mappings */ + if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED)) + return false; + + /* ... or read-only private ones */ + if (!(vma->vm_flags & VM_MAYWRITE)) + return false; + + /* ... or already writable ones that just need to take a write fault */ + if (vma->vm_flags & VM_WRITE) + return false; + + /* + * See can_change_pte_writable(): we broke COW and could map the page + * writable if we have an exclusive anonymous page ... + */ + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) + return false; + + /* ... and a write-fault isn't required for other reasons. */ + if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && + !(vma->vm_flags & VM_SOFTDIRTY) && !pmd_soft_dirty(pmd)) + return false; + return !userfaultfd_huge_pmd_wp(vma, pmd); }
struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, @@ -1365,12 +1388,16 @@ struct page *follow_trans_huge_pmd(struc unsigned int flags) { struct mm_struct *mm = vma->vm_mm; - struct page *page = NULL; + struct page *page;
assert_spin_locked(pmd_lockptr(mm, pmd));
- if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) - goto out; + page = pmd_page(*pmd); + VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + + if ((flags & FOLL_WRITE) && + !can_follow_write_pmd(*pmd, page, vma, flags)) + return NULL;
/* Avoid dumping huge zero page */ if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) @@ -1378,10 +1405,7 @@ struct page *follow_trans_huge_pmd(struc
/* Full NUMA hinting faults to serialise migration in fault paths */ if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) - goto out; - - page = pmd_page(*pmd); - VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); + return NULL;
if (!pmd_write(*pmd) && gup_must_unshare(flags, page)) return ERR_PTR(-EMLINK); @@ -1398,7 +1422,6 @@ struct page *follow_trans_huge_pmd(struc page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-out: return page; }
From: Trond Myklebust trond.myklebust@hammerspace.com
commit 67f4b5dc49913abcdb5cc736e73674e2f352f81d upstream.
Currently, when the writeback code detects a server reboot, it redirties any pages that were not committed to disk, and it sets the flag NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor that dirtied the file. While this allows the file descriptor in question to redrive its own writes, it violates the fsync() requirement that we should be synchronising all writes to disk. While the problem is infrequent, we do see corner cases where an untimely server reboot causes the fsync() call to abandon its attempt to sync data to disk and causing data corruption issues due to missed error conditions or similar.
In order to tighted up the client's ability to deal with this situation without introducing livelocks, add a counter that records the number of times pages are redirtied due to a server reboot-like condition, and use that in fsync() to redrive the sync to disk.
Fixes: 2197e9b06c22 ("NFS: Fix up fsync() when the server rebooted") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/nfs/file.c | 15 ++++++--------- fs/nfs/inode.c | 1 + fs/nfs/write.c | 6 ++++-- include/linux/nfs_fs.h | 1 + 4 files changed, 12 insertions(+), 11 deletions(-)
--- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -221,8 +221,10 @@ nfs_file_fsync_commit(struct file *file, int nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync) { - struct nfs_open_context *ctx = nfs_file_open_context(file); struct inode *inode = file_inode(file); + struct nfs_inode *nfsi = NFS_I(inode); + long save_nredirtied = atomic_long_read(&nfsi->redirtied_pages); + long nredirtied; int ret;
trace_nfs_fsync_enter(inode); @@ -237,15 +239,10 @@ nfs_file_fsync(struct file *file, loff_t ret = pnfs_sync_inode(inode, !!datasync); if (ret != 0) break; - if (!test_and_clear_bit(NFS_CONTEXT_RESEND_WRITES, &ctx->flags)) + nredirtied = atomic_long_read(&nfsi->redirtied_pages); + if (nredirtied == save_nredirtied) break; - /* - * If nfs_file_fsync_commit detected a server reboot, then - * resend all dirty pages that might have been covered by - * the NFS_CONTEXT_RESEND_WRITES flag - */ - start = 0; - end = LLONG_MAX; + save_nredirtied = nredirtied; }
trace_nfs_fsync_exit(inode, ret); --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -426,6 +426,7 @@ nfs_ilookup(struct super_block *sb, stru static void nfs_inode_init_regular(struct nfs_inode *nfsi) { atomic_long_set(&nfsi->nrequests, 0); + atomic_long_set(&nfsi->redirtied_pages, 0); INIT_LIST_HEAD(&nfsi->commit_info.list); atomic_long_set(&nfsi->commit_info.ncommit, 0); atomic_set(&nfsi->commit_info.rpcs_out, 0); --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1419,10 +1419,12 @@ static void nfs_initiate_write(struct nf */ static void nfs_redirty_request(struct nfs_page *req) { + struct nfs_inode *nfsi = NFS_I(page_file_mapping(req->wb_page)->host); + /* Bump the transmission count */ req->wb_nio++; nfs_mark_request_dirty(req); - set_bit(NFS_CONTEXT_RESEND_WRITES, &nfs_req_openctx(req)->flags); + atomic_long_inc(&nfsi->redirtied_pages); nfs_end_page_writeback(req); nfs_release_request(req); } @@ -1892,7 +1894,7 @@ static void nfs_commit_release_pages(str /* We have a mismatch. Write the page again */ dprintk_cont(" mismatch\n"); nfs_mark_request_dirty(req); - set_bit(NFS_CONTEXT_RESEND_WRITES, &nfs_req_openctx(req)->flags); + atomic_long_inc(&NFS_I(data->inode)->redirtied_pages); next: nfs_unlock_and_release_request(req); /* Latency breaker */ --- a/include/linux/nfs_fs.h +++ b/include/linux/nfs_fs.h @@ -182,6 +182,7 @@ struct nfs_inode { /* Regular file */ struct { atomic_long_t nrequests; + atomic_long_t redirtied_pages; struct nfs_mds_commit_info commit_info; struct mutex commit_mutex; };
From: Gaosheng Cui cuigaosheng1@huawei.com
commit ad982c3be4e60c7d39c03f782733503cbd88fd2a upstream.
Audit_alloc_mark() assign pathname to audit_mark->path, on error path from fsnotify_add_inode_mark(), fsnotify_put_mark will free memory of audit_mark->path, but the caller of audit_alloc_mark will free the pathname again, so there will be double free problem.
Fix this by resetting audit_mark->path to NULL pointer on error path from fsnotify_add_inode_mark().
Cc: stable@vger.kernel.org Fixes: 7b1293234084d ("fsnotify: Add group pointer in fsnotify_init_mark()") Signed-off-by: Gaosheng Cui cuigaosheng1@huawei.com Reviewed-by: Jan Kara jack@suse.cz Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/audit_fsnotify.c | 1 + 1 file changed, 1 insertion(+)
--- a/kernel/audit_fsnotify.c +++ b/kernel/audit_fsnotify.c @@ -102,6 +102,7 @@ struct audit_fsnotify_mark *audit_alloc_
ret = fsnotify_add_inode_mark(&audit_mark->mark, inode, 0); if (ret < 0) { + audit_mark->path = NULL; fsnotify_put_mark(&audit_mark->mark); audit_mark = ERR_PTR(ret); }
From: Jing-Ting Wu Jing-Ting.Wu@mediatek.com
commit 763f4fb76e24959c370cdaa889b2492ba6175580 upstream.
Root cause: The rebind_subsystems() is no lock held when move css object from A list to B list,then let B's head be treated as css node at list_for_each_entry_rcu().
Solution: Add grace period before invalidating the removed rstat_css_node.
Reported-by: Jing-Ting Wu jing-ting.wu@mediatek.com Suggested-by: Michal Koutný mkoutny@suse.com Signed-off-by: Jing-Ting Wu jing-ting.wu@mediatek.com Tested-by: Jing-Ting Wu jing-ting.wu@mediatek.com Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c0112... Acked-by: Mukesh Ojha quic_mojha@quicinc.com Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1") Cc: stable@vger.kernel.org # v5.13+ Signed-off-by: Tejun Heo tj@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/cgroup/cgroup.c | 1 + 1 file changed, 1 insertion(+)
--- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1811,6 +1811,7 @@ int rebind_subsystems(struct cgroup_root
if (ss->css_rstat_flush) { list_del_rcu(&css->rstat_css_node); + synchronize_rcu(); list_add_rcu(&css->rstat_css_node, &dcgrp->rstat_css_list); }
From: Helge Deller deller@gmx.de
commit 3dcfb729b5f4a0c9b50742865cd5e6c4dbcc80dc upstream.
With this patch the ARCH= parameter decides if the CONFIG_64BIT option will be set or not. This means, the ARCH= parameter will give:
ARCH=parisc -> 32-bit kernel ARCH=parisc64 -> 64-bit kernel
This simplifies the usage of the other config options like randconfig, allmodconfig and allyesconfig a lot and produces the output which is expected for parisc64 (64-bit) vs. parisc (32-bit).
Suggested-by: Masahiro Yamada masahiroy@kernel.org Signed-off-by: Helge Deller deller@gmx.de Tested-by: Randy Dunlap rdunlap@infradead.org Reviewed-by: Randy Dunlap rdunlap@infradead.org Cc: stable@vger.kernel.org # 5.15+ Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/parisc/Kconfig | 21 ++++++--------------- 1 file changed, 6 insertions(+), 15 deletions(-)
--- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -147,10 +147,10 @@ menu "Processor type and features"
choice prompt "Processor type" - default PA7000 + default PA7000 if "$(ARCH)" = "parisc"
config PA7000 - bool "PA7000/PA7100" + bool "PA7000/PA7100" if "$(ARCH)" = "parisc" help This is the processor type of your CPU. This information is used for optimizing purposes. In order to compile a kernel @@ -161,21 +161,21 @@ config PA7000 which is required on some machines.
config PA7100LC - bool "PA7100LC" + bool "PA7100LC" if "$(ARCH)" = "parisc" help Select this option for the PCX-L processor, as used in the 712, 715/64, 715/80, 715/100, 715/100XC, 725/100, 743, 748, D200, D210, D300, D310 and E-class
config PA7200 - bool "PA7200" + bool "PA7200" if "$(ARCH)" = "parisc" help Select this option for the PCX-T' processor, as used in the C100, C110, J100, J110, J210XC, D250, D260, D350, D360, K100, K200, K210, K220, K400, K410 and K420
config PA7300LC - bool "PA7300LC" + bool "PA7300LC" if "$(ARCH)" = "parisc" help Select this option for the PCX-L2 processor, as used in the 744, A180, B132L, B160L, B180L, C132L, C160L, C180L, @@ -225,17 +225,8 @@ config MLONGCALLS Enabling this option will probably slow down your kernel.
config 64BIT - bool "64-bit kernel" + def_bool "$(ARCH)" = "parisc64" depends on PA8X00 - help - Enable this if you want to support 64bit kernel on PA-RISC platform. - - At the moment, only people willing to use more than 2GB of RAM, - or having a 64bit-only capable PA-RISC machine should say Y here. - - Since there is no 64bit userland on PA-RISC, there is no point to - enable this option otherwise. The 64bit kernel is significantly bigger - and slower than the 32bit one.
choice prompt "Kernel page size"
From: Helge Deller deller@gmx.de
commit 7ae1f5508d9a33fd58ed3059bd2d569961e3b8bd upstream.
The exception handler is broken for unaligned memory acceses with fldw and fstw instructions, because it trashes or uses randomly some other floating point register than the one specified in the instruction word on loads and stores.
The instruction "fldw 0(addr),%fr22L" (and the other fldw/fstw instructions) encode the target register (%fr22) in the rightmost 5 bits of the instruction word. The 7th rightmost bit of the instruction word defines if the left or right half of %fr22 should be used.
While processing unaligned address accesses, the FR3() define is used to extract the offset into the local floating-point register set. But the calculation in FR3() was buggy, so that for example instead of %fr22, register %fr12 [((22 * 2) & 0x1f) = 12] was used.
This bug has been since forever in the parisc kernel and I wonder why it wasn't detected earlier. Interestingly I noticed this bug just because the libime debian package failed to build on *native* hardware, while it successfully built in qemu.
This patch corrects the bitshift and masking calculation in FR3().
Signed-off-by: Helge Deller deller@gmx.de Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/parisc/kernel/unaligned.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/parisc/kernel/unaligned.c +++ b/arch/parisc/kernel/unaligned.c @@ -93,7 +93,7 @@ #define R1(i) (((i)>>21)&0x1f) #define R2(i) (((i)>>16)&0x1f) #define R3(i) ((i)&0x1f) -#define FR3(i) ((((i)<<1)&0x1f)|(((i)>>6)&1)) +#define FR3(i) ((((i)&0x1f)<<1)|(((i)>>6)&1)) #define IM(i,n) (((i)>>1&((1<<(n-1))-1))|((i)&1?((0-1L)<<(n-1)):0)) #define IM5_2(i) IM((i)>>16,5) #define IM5_3(i) IM((i),5)
From: Randy Dunlap rdunlap@infradead.org
commit a8faed3a02eeb75857a3b5d660fa80fe79db77a3 upstream.
When CONFIG_ADVISE_SYSCALLS is not set/enabled and CONFIG_COMPAT is set/enabled, the riscv compat_syscall_table references 'compat_sys_fadvise64_64', which is not defined:
riscv64-linux-ld: arch/riscv/kernel/compat_syscall_table.o:(.rodata+0x6f8): undefined reference to `compat_sys_fadvise64_64'
Add 'fadvise64_64' to kernel/sys_ni.c as a conditional COMPAT function so that when CONFIG_ADVISE_SYSCALLS is not set, there is a fallback function available.
Link: https://lkml.kernel.org/r/20220807220934.5689-1-rdunlap@infradead.org Fixes: d3ac21cacc24 ("mm: Support compiling out madvise and fadvise") Signed-off-by: Randy Dunlap rdunlap@infradead.org Suggested-by: Arnd Bergmann arnd@arndb.de Reviewed-by: Arnd Bergmann arnd@arndb.de Cc: Josh Triplett josh@joshtriplett.org Cc: Paul Walmsley paul.walmsley@sifive.com Cc: Palmer Dabbelt palmer@dabbelt.com Cc: Albert Ou aou@eecs.berkeley.edu Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/sys_ni.c | 1 + 1 file changed, 1 insertion(+)
--- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -277,6 +277,7 @@ COND_SYSCALL(landlock_restrict_self);
/* mm/fadvise.c */ COND_SYSCALL(fadvise64_64); +COND_SYSCALL_COMPAT(fadvise64_64);
/* mm/, CONFIG_MMU only */ COND_SYSCALL(swapon);
From: Kuniyuki Iwashima kuniyu@amazon.com
commit 9c80e79906b4ca440d09e7f116609262bb747909 upstream.
The assumption in __disable_kprobe() is wrong, and it could try to disarm an already disarmed kprobe and fire the WARN_ONCE() below. [0] We can easily reproduce this issue.
1. Write 0 to /sys/kernel/debug/kprobes/enabled.
# echo 0 > /sys/kernel/debug/kprobes/enabled
2. Run execsnoop. At this time, one kprobe is disabled.
# /usr/share/bcc/tools/execsnoop & [1] 2460 PCOMM PID PPID RET ARGS
# cat /sys/kernel/debug/kprobes/list ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE] ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE]
3. Write 1 to /sys/kernel/debug/kprobes/enabled, which changes kprobes_all_disarmed to false but does not arm the disabled kprobe.
# echo 1 > /sys/kernel/debug/kprobes/enabled
# cat /sys/kernel/debug/kprobes/list ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE] ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE]
4. Kill execsnoop, when __disable_kprobe() calls disarm_kprobe() for the disabled kprobe and hits the WARN_ONCE() in __disarm_kprobe_ftrace().
# fg /usr/share/bcc/tools/execsnoop ^C
Actually, WARN_ONCE() is fired twice, and __unregister_kprobe_top() misses some cleanups and leaves the aggregated kprobe in the hash table. Then, __unregister_trace_kprobe() initialises tk->rp.kp.list and creates an infinite loop like this.
aggregated kprobe.list -> kprobe.list -. ^ | '.__.'
In this situation, these commands fall into the infinite loop and result in RCU stall or soft lockup.
cat /sys/kernel/debug/kprobes/list : show_kprobe_addr() enters into the infinite loop with RCU.
/usr/share/bcc/tools/execsnoop : warn_kprobe_rereg() holds kprobe_mutex, and __get_valid_kprobe() is stuck in the loop.
To avoid the issue, make sure we don't call disarm_kprobe() for disabled kprobes.
[0] Failed to disarm kprobe-ftrace at __x64_sys_execve+0x0/0x40 (error -2) WARNING: CPU: 6 PID: 2460 at kernel/kprobes.c:1130 __disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129) Modules linked in: ena CPU: 6 PID: 2460 Comm: execsnoop Not tainted 5.19.0+ #28 Hardware name: Amazon EC2 c5.2xlarge/, BIOS 1.0 10/16/2017 RIP: 0010:__disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129) Code: 24 8b 02 eb c1 80 3d c4 83 f2 01 00 75 d4 48 8b 75 00 89 c2 48 c7 c7 90 fa 0f 92 89 04 24 c6 05 ab 83 01 e8 e4 94 f0 ff <0f> 0b 8b 04 24 eb b1 89 c6 48 c7 c7 60 fa 0f 92 89 04 24 e8 cc 94 RSP: 0018:ffff9e6ec154bd98 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffffffff930f7b00 RCX: 0000000000000001 RDX: 0000000080000001 RSI: ffffffff921461c5 RDI: 00000000ffffffff RBP: ffff89c504286da8 R08: 0000000000000000 R09: c0000000fffeffff R10: 0000000000000000 R11: ffff9e6ec154bc28 R12: ffff89c502394e40 R13: ffff89c502394c00 R14: ffff9e6ec154bc00 R15: 0000000000000000 FS: 00007fe800398740(0000) GS:ffff89c812d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000c00057f010 CR3: 0000000103b54006 CR4: 00000000007706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> __disable_kprobe (kernel/kprobes.c:1716) disable_kprobe (kernel/kprobes.c:2392) __disable_trace_kprobe (kernel/trace/trace_kprobe.c:340) disable_trace_kprobe (kernel/trace/trace_kprobe.c:429) perf_trace_event_unreg.isra.2 (./include/linux/tracepoint.h:93 kernel/trace/trace_event_perf.c:168) perf_kprobe_destroy (kernel/trace/trace_event_perf.c:295) _free_event (kernel/events/core.c:4971) perf_event_release_kernel (kernel/events/core.c:5176) perf_release (kernel/events/core.c:5186) __fput (fs/file_table.c:321) task_work_run (./include/linux/sched.h:2056 (discriminator 1) kernel/task_work.c:179 (discriminator 1)) exit_to_user_mode_prepare (./include/linux/resume_user_mode.h:49 kernel/entry/common.c:169 kernel/entry/common.c:201) syscall_exit_to_user_mode (./arch/x86/include/asm/jump_label.h:55 ./arch/x86/include/asm/nospec-branch.h:384 ./arch/x86/include/asm/entry-common.h:94 kernel/entry/common.c:133 kernel/entry/common.c:296) do_syscall_64 (arch/x86/entry/common.c:87) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7fe7ff210654 Code: 15 79 89 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 00 8b 05 9a cd 20 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 34 fc RSP: 002b:00007ffdbd1d3538 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007fe7ff210654 RDX: 0000000000000000 RSI: 0000000000002401 RDI: 0000000000000008 RBP: 0000000000000000 R08: 94ae31d6fda838a4 R0900007fe8001c9d30 R10: 00007ffdbd1d34b0 R11: 0000000000000246 R12: 00007ffdbd1d3600 R13: 0000000000000000 R14: fffffffffffffffc R15: 00007ffdbd1d3560 </TASK>
Link: https://lkml.kernel.org/r/20220813020509.90805-1-kuniyu@amazon.com Fixes: 69d54b916d83 ("kprobes: makes kprobes/enabled works correctly for optimized kprobes.") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Reported-by: Ayushman Dutta ayudutta@amazon.com Cc: "Naveen N. Rao" naveen.n.rao@linux.ibm.com Cc: Anil S Keshavamurthy anil.s.keshavamurthy@intel.com Cc: "David S. Miller" davem@davemloft.net Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Wang Nan wangnan0@huawei.com Cc: Kuniyuki Iwashima kuniyu@amazon.com Cc: Kuniyuki Iwashima kuni1840@gmail.com Cc: Ayushman Dutta ayudutta@amazon.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/kprobes.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
--- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1707,11 +1707,12 @@ static struct kprobe *__disable_kprobe(s /* Try to disarm and disable this/parent probe */ if (p == orig_p || aggr_kprobe_disabled(orig_p)) { /* - * If 'kprobes_all_disarmed' is set, 'orig_p' - * should have already been disarmed, so - * skip unneed disarming process. + * Don't be lazy here. Even if 'kprobes_all_disarmed' + * is false, 'orig_p' might not have been armed yet. + * Note arm_all_kprobes() __tries__ to arm all kprobes + * on the best effort basis. */ - if (!kprobes_all_disarmed) { + if (!kprobes_all_disarmed && !kprobe_disabled(orig_p)) { ret = disarm_kprobe(orig_p, true); if (ret) { p->flags &= ~KPROBE_FLAG_DISABLED;
From: Peter Xu peterx@redhat.com
commit f369b07c861435bd812a9d14493f71b34132ed6f upstream.
The motivation of this patch comes from a recent report and patchfix from David Hildenbrand on hugetlb shared handling of wr-protected page [1].
With the reproducer provided in commit message of [1], one can leverage the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect not only the attacker process, but also the whole system.
The lazy-reset mechanism of uffd-wp was used to make unregister faster, meanwhile it has an assumption that any leftover pgtable entries should only affect the process on its own, so not only the user should be aware of anything it does, but also it should not affect outside of the process.
But it seems that this is not true, and it can also be utilized to make some exploit easier.
So far there's no clue showing that the lazy-reset is important to any userfaultfd users because normally the unregister will only happen once for a specific range of memory of the lifecycle of the process.
Considering all above, what this patch proposes is to do explicit pte resets when unregister an uffd region with wr-protect mode enabled.
It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false) right before ioctl(UFFDIO_UNREGISTER) for the user. So potentially it'll make the unregister slower. From that pov it's a very slight abi change, but hopefully nothing should break with this change either.
Regarding to the change itself - core of uffd write [un]protect operation is moved into a separate function (uffd_wp_range()) and it is reused in the unregister code path.
Note that the new function will not check for anything, e.g. ranges or memory types, because they should have been checked during the previous UFFDIO_REGISTER or it should have failed already. It also doesn't check mmap_changing because we're with mmap write lock held anyway.
I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's the only issue reported so far and that's the commit David's reproducer will start working (v5.19+). But the whole idea actually applies to not only file memories but also anonymous. It's just that we don't need to fix anonymous prior to v5.19- because there's no known way to exploit.
IOW, this patch can also fix the issue reported in [1] as the patch 2 does.
[1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/
Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Signed-off-by: Peter Xu peterx@redhat.com Cc: David Hildenbrand david@redhat.com Cc: Mike Rapoport rppt@linux.vnet.ibm.com Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Andrea Arcangeli aarcange@redhat.com Cc: Nadav Amit nadav.amit@gmail.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/userfaultfd.c | 4 ++++ include/linux/userfaultfd_k.h | 2 ++ mm/userfaultfd.c | 29 ++++++++++++++++++----------- 3 files changed, 24 insertions(+), 11 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 1c44bf75f916..175de70e3adf 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1601,6 +1601,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, wake_userfault(vma->vm_userfaultfd_ctx.ctx, &range); }
+ /* Reset ptes for the whole vma range if wr-protected */ + if (userfaultfd_wp(vma)) + uffd_wp_range(mm, vma, start, vma_end - start, false); + new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; prev = vma_merge(mm, prev, start, vma_end, new_flags, vma->anon_vma, vma->vm_file, vma->vm_pgoff, diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 732b522bacb7..e1b8a915e9e9 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -73,6 +73,8 @@ extern ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long dst_start, extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); +extern void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *vma, + unsigned long start, unsigned long len, bool enable_wp);
/* mm helpers */ static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma, diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 07d3befc80e4..7327b2573f7c 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -703,14 +703,29 @@ ssize_t mcopy_continue(struct mm_struct *dst_mm, unsigned long start, mmap_changing, 0); }
+void uffd_wp_range(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma, + unsigned long start, unsigned long len, bool enable_wp) +{ + struct mmu_gather tlb; + pgprot_t newprot; + + if (enable_wp) + newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE)); + else + newprot = vm_get_page_prot(dst_vma->vm_flags); + + tlb_gather_mmu(&tlb, dst_mm); + change_protection(&tlb, dst_vma, start, start + len, newprot, + enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE); + tlb_finish_mmu(&tlb); +} + int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing) { struct vm_area_struct *dst_vma; unsigned long page_mask; - struct mmu_gather tlb; - pgprot_t newprot; int err;
/* @@ -750,15 +765,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, goto out_unlock; }
- if (enable_wp) - newprot = vm_get_page_prot(dst_vma->vm_flags & ~(VM_WRITE)); - else - newprot = vm_get_page_prot(dst_vma->vm_flags); - - tlb_gather_mmu(&tlb, dst_mm); - change_protection(&tlb, dst_vma, start, start + len, newprot, - enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE); - tlb_finish_mmu(&tlb); + uffd_wp_range(dst_mm, dst_vma, start, len, enable_wp);
err = 0; out_unlock:
From: David Hildenbrand david@redhat.com
commit 1d8d14641fd94a01b20a4abbf2749fd8eddcf57b upstream.
If we ever get a write-fault on a write-protected page in a shared mapping, we'd be in trouble (again). Instead, we can simply map the page writable.
And in fact, there is even a way right now to trigger that code via uffd-wp ever since we stared to support it for shmem in 5.19:
-------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <linux/userfaultfd.h>
#define HUGETLB_SIZE (2 * 1024 * 1024u)
static char *map; int uffd;
static int temp_setup_uffd(void) { struct uffdio_api uffdio_api; struct uffdio_register uffdio_register; struct uffdio_writeprotect uffd_writeprotect; struct uffdio_range uffd_range;
uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); if (uffd < 0) { fprintf(stderr, "syscall() failed: %d\n", errno); return -errno; }
uffdio_api.api = UFFD_API; uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { fprintf(stderr, "UFFDIO_API failed: %d\n", errno); return -errno; }
if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) { fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n"); return -ENOSYS; }
/* Register UFFD-WP */ uffdio_register.range.start = (unsigned long) map; uffdio_register.range.len = HUGETLB_SIZE; uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) { fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno); return -errno; }
/* Writeprotect a single page. */ uffd_writeprotect.range.start = (unsigned long) map; uffd_writeprotect.range.len = HUGETLB_SIZE; uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP; if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) { fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno); return -errno; }
/* Unregister UFFD-WP without prior writeunprotection. */ uffd_range.start = (unsigned long) map; uffd_range.len = HUGETLB_SIZE; if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) { fprintf(stderr, "UFFDIO_UNREGISTER failed: %d\n", errno); return -errno; }
return 0; }
int main(int argc, char **argv) { int fd;
fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT); if (!fd) { fprintf(stderr, "open() failed\n"); return -errno; } if (ftruncate(fd, HUGETLB_SIZE)) { fprintf(stderr, "ftruncate() failed\n"); return -errno; }
map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; }
*map = 0;
if (temp_setup_uffd()) return 1;
*map = 0;
return 0; } --------------------------------------------------------------------------
Above test fails with SIGBUS when there is only a single free hugetlb page. # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test Bus error (core dumped)
And worse, with sufficient free hugetlb pages it will map an anonymous page into a shared mapping, for example, messing up accounting during unmap and breaking MAP_SHARED semantics: # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test # cat /proc/meminfo | grep HugePages_ HugePages_Total: 2 HugePages_Free: 1 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0
Reason is that uffd-wp doesn't clear the uffd-wp PTE bit when unregistering and consequently keeps the PTE writeprotected. Reason for this is to avoid the additional overhead when unregistering. Note that this is the case also for !hugetlb and that we will end up with writable PTEs that still have the uffd-wp PTE bit set once we return from hugetlb_wp(). I'm not touching the uffd-wp PTE bit for now, because it seems to be a generic thing -- wp_page_reuse() also doesn't clear it.
VM_MAYSHARE handling in hugetlb_fault() for FAULT_FLAG_WRITE indicates that MAP_SHARED handling was at least envisioned, but could never have worked as expected.
While at it, make sure that we never end up in hugetlb_wp() on write faults without VM_WRITE, because we don't support maybe_mkwrite() semantics as commonly used in the !hugetlb case -- for example, in wp_page_reuse().
Note that there is no need to do any kind of reservation in hugetlb_fault() in this case ... because we already have a hugetlb page mapped R/O that we will simply map writable and we are not dealing with COW/unsharing.
Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Bjorn Helgaas bhelgaas@google.com Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: Hugh Dickins hughd@google.com Cc: Jamie Liu jamieliu@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Muchun Song songmuchun@bytedance.com Cc: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Cc: Pavel Emelyanov xemul@parallels.com Cc: Peter Feiner pfeiner@google.com Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org [5.19] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/hugetlb.c | 26 +++++++++++++++++++------- 1 file changed, 19 insertions(+), 7 deletions(-)
--- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5232,6 +5232,21 @@ static vm_fault_t hugetlb_wp(struct mm_s VM_BUG_ON(unshare && (flags & FOLL_WRITE)); VM_BUG_ON(!unshare && !(flags & FOLL_WRITE));
+ /* + * hugetlb does not support FOLL_FORCE-style write faults that keep the + * PTE mapped R/O such as maybe_mkwrite() would do. + */ + if (WARN_ON_ONCE(!unshare && !(vma->vm_flags & VM_WRITE))) + return VM_FAULT_SIGSEGV; + + /* Let's take out MAP_SHARED mappings first. */ + if (vma->vm_flags & VM_MAYSHARE) { + if (unlikely(unshare)) + return 0; + set_huge_ptep_writable(vma, haddr, ptep); + return 0; + } + pte = huge_ptep_get(ptep); old_page = pte_page(pte);
@@ -5766,12 +5781,11 @@ vm_fault_t hugetlb_fault(struct mm_struc * If we are going to COW/unshare the mapping later, we examine the * pending reservations for this page now. This will ensure that any * allocations necessary to record that reservation occur outside the - * spinlock. For private mappings, we also lookup the pagecache - * page now as it is used to determine if a reservation has been - * consumed. + * spinlock. Also lookup the pagecache page now as it is used to + * determine if a reservation has been consumed. */ if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && - !huge_pte_write(entry)) { + !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(entry)) { if (vma_needs_reservation(h, vma, haddr) < 0) { ret = VM_FAULT_OOM; goto out_mutex; @@ -5779,9 +5793,7 @@ vm_fault_t hugetlb_fault(struct mm_struc /* Just decrements count, does not deallocate */ vma_end_reservation(h, vma, haddr);
- if (!(vma->vm_flags & VM_MAYSHARE)) - pagecache_page = hugetlbfs_pagecache_page(h, - vma, haddr); + pagecache_page = hugetlbfs_pagecache_page(h, vma, haddr); }
ptl = huge_pte_lock(h, mm, ptep);
From: Deren Wu deren.wu@mediatek.com
commit 9d958b60ebc2434f2b7eae83d77849e22d1059eb upstream.
Due to AP stop improperly, mt7921 driver would face random command timeout by chip fw problem. Migrate AP start/stop process to .start_ap/.stop_ap and congiure BSS network settings in both hooks.
The new flow is shown below. * AP start .start_ap() configure BSS network resource set BSS to connected state .bss_info_changed() enable fw beacon offload
* AP stop .bss_info_changed() disable fw beacon offload (skip this command) .stop_ap() set BSS to disconnected state (beacon offload disabled automatically) destroy BSS network resource
Fixes: 116c69603b01 ("mt76: mt7921: Add AP mode support") Signed-off-by: Sean Wang sean.wang@mediatek.com Signed-off-by: Deren Wu deren.wu@mediatek.com Signed-off-by: Felix Fietkau nbd@nbd.name Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/wireless/mediatek/mt76/mt76_connac_mcu.c | 2 drivers/net/wireless/mediatek/mt76/mt7921/main.c | 47 +++++++++++++++---- drivers/net/wireless/mediatek/mt76/mt7921/mcu.c | 5 -- drivers/net/wireless/mediatek/mt76/mt7921/mt7921.h | 2 4 files changed, 43 insertions(+), 13 deletions(-)
--- a/drivers/net/wireless/mediatek/mt76/mt76_connac_mcu.c +++ b/drivers/net/wireless/mediatek/mt76/mt76_connac_mcu.c @@ -1403,6 +1403,8 @@ int mt76_connac_mcu_uni_add_bss(struct m else conn_type = CONNECTION_INFRA_AP; basic_req.basic.conn_type = cpu_to_le32(conn_type); + /* Fully active/deactivate BSS network in AP mode only */ + basic_req.basic.active = enable; break; case NL80211_IFTYPE_STATION: if (vif->p2p) --- a/drivers/net/wireless/mediatek/mt76/mt7921/main.c +++ b/drivers/net/wireless/mediatek/mt76/mt7921/main.c @@ -653,15 +653,6 @@ static void mt7921_bss_info_changed(stru } }
- if (changed & BSS_CHANGED_BEACON_ENABLED && info->enable_beacon) { - struct mt7921_vif *mvif = (struct mt7921_vif *)vif->drv_priv; - - mt76_connac_mcu_uni_add_bss(phy->mt76, vif, &mvif->sta.wcid, - true); - mt7921_mcu_sta_update(dev, NULL, vif, true, - MT76_STA_INFO_STATE_NONE); - } - if (changed & (BSS_CHANGED_BEACON | BSS_CHANGED_BEACON_ENABLED)) mt7921_mcu_uni_add_beacon_offload(dev, hw, vif, @@ -1500,6 +1491,42 @@ mt7921_channel_switch_beacon(struct ieee mt7921_mutex_release(dev); }
+static int +mt7921_start_ap(struct ieee80211_hw *hw, struct ieee80211_vif *vif) +{ + struct mt7921_vif *mvif = (struct mt7921_vif *)vif->drv_priv; + struct mt7921_phy *phy = mt7921_hw_phy(hw); + struct mt7921_dev *dev = mt7921_hw_dev(hw); + int err; + + err = mt76_connac_mcu_uni_add_bss(phy->mt76, vif, &mvif->sta.wcid, + true); + if (err) + return err; + + err = mt7921_mcu_set_bss_pm(dev, vif, true); + if (err) + return err; + + return mt7921_mcu_sta_update(dev, NULL, vif, true, + MT76_STA_INFO_STATE_NONE); +} + +static void +mt7921_stop_ap(struct ieee80211_hw *hw, struct ieee80211_vif *vif) +{ + struct mt7921_vif *mvif = (struct mt7921_vif *)vif->drv_priv; + struct mt7921_phy *phy = mt7921_hw_phy(hw); + struct mt7921_dev *dev = mt7921_hw_dev(hw); + int err; + + err = mt7921_mcu_set_bss_pm(dev, vif, false); + if (err) + return; + + mt76_connac_mcu_uni_add_bss(phy->mt76, vif, &mvif->sta.wcid, false); +} + const struct ieee80211_ops mt7921_ops = { .tx = mt7921_tx, .start = mt7921_start, @@ -1510,6 +1537,8 @@ const struct ieee80211_ops mt7921_ops = .conf_tx = mt7921_conf_tx, .configure_filter = mt7921_configure_filter, .bss_info_changed = mt7921_bss_info_changed, + .start_ap = mt7921_start_ap, + .stop_ap = mt7921_stop_ap, .sta_state = mt7921_sta_state, .sta_pre_rcu_remove = mt76_sta_pre_rcu_remove, .set_key = mt7921_set_key, --- a/drivers/net/wireless/mediatek/mt76/mt7921/mcu.c +++ b/drivers/net/wireless/mediatek/mt76/mt7921/mcu.c @@ -1020,7 +1020,7 @@ mt7921_mcu_uni_bss_bcnft(struct mt7921_d &bcnft_req, sizeof(bcnft_req), true); }
-static int +int mt7921_mcu_set_bss_pm(struct mt7921_dev *dev, struct ieee80211_vif *vif, bool enable) { @@ -1049,9 +1049,6 @@ mt7921_mcu_set_bss_pm(struct mt7921_dev }; int err;
- if (vif->type != NL80211_IFTYPE_STATION) - return 0; - err = mt76_mcu_send_msg(&dev->mt76, MCU_CE_CMD(SET_BSS_ABORT), &req_hdr, sizeof(req_hdr), false); if (err < 0 || !enable) --- a/drivers/net/wireless/mediatek/mt76/mt7921/mt7921.h +++ b/drivers/net/wireless/mediatek/mt76/mt7921/mt7921.h @@ -280,6 +280,8 @@ int mt7921_wpdma_reset(struct mt7921_dev int mt7921_wpdma_reinit_cond(struct mt7921_dev *dev); void mt7921_dma_cleanup(struct mt7921_dev *dev); int mt7921_run_firmware(struct mt7921_dev *dev); +int mt7921_mcu_set_bss_pm(struct mt7921_dev *dev, struct ieee80211_vif *vif, + bool enable); int mt7921_mcu_sta_update(struct mt7921_dev *dev, struct ieee80211_sta *sta, struct ieee80211_vif *vif, bool enable, enum mt76_sta_info_state state);
From: Xin Xiong xiongx18@fudan.edu.cn
[ Upstream commit 9c9cb23e00ddf45679b21b4dacc11d1ae7961ebe ]
The issue happens on an error path in __xfrm_policy_check(). When the fetching process of the object `pols[1]` fails, the function simply returns 0, forgetting to decrement the reference count of `pols[0]`, which is incremented earlier by either xfrm_sk_policy_lookup() or xfrm_policy_lookup(). This may result in memory leaks.
Fix it by decreasing the reference count of `pols[0]` in that path.
Fixes: 134b0fc544ba ("IPsec: propagate security module errors up from flow_cache_lookup") Signed-off-by: Xin Xiong xiongx18@fudan.edu.cn Signed-off-by: Xin Tan tanxin.ctf@gmail.com Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/xfrm/xfrm_policy.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index f1a0bab920a55..4f8bbb825abcb 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -3599,6 +3599,7 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb, if (pols[1]) { if (IS_ERR(pols[1])) { XFRM_INC_STATS(net, LINUX_MIB_XFRMINPOLERROR); + xfrm_pol_put(pols[0]); return 0; } pols[1]->curlft.use_time = ktime_get_real_seconds();
From: Antony Antony antony.antony@secunet.com
[ Upstream commit 717ada9f10f2de8c4f4d72ad045f3b67a7ced715 ]
This reverts commit af734a26a1a95a9fda51f2abb0c22a7efcafd5ca.
The abvoce commit is a regression according RFC 2367. A better fix would be use x->lastused. Which will be propsed later.
according to RFC 2367 use_time == sadb_lifetime_usetime.
"sadb_lifetime_usetime For CURRENT, the time, in seconds, when association was first used. For HARD and SOFT, the number of seconds after the first use of the association until it expires."
Fixes: af734a26a1a9 ("xfrm: update SA curlft.use_time") Signed-off-by: Antony Antony antony.antony@secunet.com Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/xfrm/xfrm_input.c | 1 - net/xfrm/xfrm_output.c | 1 - 2 files changed, 2 deletions(-)
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 144238a50f3d4..70a8c36f0ba6e 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -669,7 +669,6 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
x->curlft.bytes += skb->len; x->curlft.packets++; - x->curlft.use_time = ktime_get_real_seconds();
spin_unlock(&x->lock);
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c index 555ab35cd119a..9a5e79a38c679 100644 --- a/net/xfrm/xfrm_output.c +++ b/net/xfrm/xfrm_output.c @@ -534,7 +534,6 @@ static int xfrm_output_one(struct sk_buff *skb, int err)
x->curlft.bytes += skb->len; x->curlft.packets++; - x->curlft.use_time = ktime_get_real_seconds();
spin_unlock_bh(&x->lock);
From: Antony Antony antony.antony@secunet.com
[ Upstream commit 6aa811acdb76facca0b705f4e4c1d948ccb6af8b ]
x->lastused was not cloned in xfrm_do_migrate. Add it to clone during migrate.
Fixes: 80c9abaabf42 ("[XFRM]: Extension for dynamic update of endpoint address(es)") Signed-off-by: Antony Antony antony.antony@secunet.com Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/xfrm/xfrm_state.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index ccfb172eb5b8d..11d89af9cb55a 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -1592,6 +1592,7 @@ static struct xfrm_state *xfrm_state_clone(struct xfrm_state *orig, x->replay = orig->replay; x->preplay = orig->preplay; x->mapping_maxage = orig->mapping_maxage; + x->lastused = orig->lastused; x->new_mapping = 0; x->new_mapping_sport = 0;
From: Herbert Xu herbert@gondor.apana.org.au
[ Upstream commit ba953a9d89a00c078b85f4b190bc1dde66fe16b5 ]
When namespace support was added to xfrm/afkey, it caused the previously single-threaded call to xfrm_probe_algs to become multi-threaded. This is buggy and needs to be fixed with a mutex.
Reported-by: Abhishek Shah abhishek.shah@columbia.edu Fixes: 283bc9f35bbb ("xfrm: Namespacify xfrm state/policy locks") Signed-off-by: Herbert Xu herbert@gondor.apana.org.au Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/key/af_key.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/key/af_key.c b/net/key/af_key.c index fb16d7c4e1b8d..20e73643b9c89 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -1697,9 +1697,12 @@ static int pfkey_register(struct sock *sk, struct sk_buff *skb, const struct sad pfk->registered |= (1<<hdr->sadb_msg_satype); }
+ mutex_lock(&pfkey_mutex); xfrm_probe_algs();
supp_skb = compose_sadb_supported(hdr, GFP_KERNEL | __GFP_ZERO); + mutex_unlock(&pfkey_mutex); + if (!supp_skb) { if (hdr->sadb_msg_satype != SADB_SATYPE_UNSPEC) pfk->registered &= ~(1<<hdr->sadb_msg_satype);
From: Nikolay Aleksandrov razor@blackwall.org
[ Upstream commit 17ecd4a4db4783392edd4944f5e8268205083f70 ]
When we try to transmit an skb with metadata_dst attached (i.e. dst->dev == NULL) through xfrm interface we can hit a null pointer dereference[1] in xfrmi_xmit2() -> xfrm_lookup_with_ifid() due to the check for a loopback skb device when there's no policy which dereferences dst->dev unconditionally. Not having dst->dev can be interepreted as it not being a loopback device, so just add a check for a null dst_orig->dev.
With this fix xfrm interface's Tx error counters go up as usual.
[1] net-next calltrace captured via netconsole: BUG: kernel NULL pointer dereference, address: 00000000000000c0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 7231 Comm: ping Kdump: loaded Not tainted 5.19.0+ #24 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-1.fc36 04/01/2014 RIP: 0010:xfrm_lookup_with_ifid+0x5eb/0xa60 Code: 8d 74 24 38 e8 26 a4 37 00 48 89 c1 e9 12 fc ff ff 49 63 ed 41 83 fd be 0f 85 be 01 00 00 41 be ff ff ff ff 45 31 ed 48 8b 03 <f6> 80 c0 00 00 00 08 75 0f 41 80 bc 24 19 0d 00 00 01 0f 84 1e 02 RSP: 0018:ffffb0db82c679f0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffd0db7fcad430 RCX: ffffb0db82c67a10 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb0db82c67a80 RBP: ffffb0db82c67a80 R08: ffffb0db82c67a14 R09: 0000000000000000 R10: 0000000000000000 R11: ffff8fa449667dc8 R12: ffffffff966db880 R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000000 FS: 00007ff35c83f000(0000) GS:ffff8fa478480000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c0 CR3: 000000001ebb7000 CR4: 0000000000350ee0 Call Trace: <TASK> xfrmi_xmit+0xde/0x460 ? tcf_bpf_act+0x13d/0x2a0 dev_hard_start_xmit+0x72/0x1e0 __dev_queue_xmit+0x251/0xd30 ip_finish_output2+0x140/0x550 ip_push_pending_frames+0x56/0x80 raw_sendmsg+0x663/0x10a0 ? try_charge_memcg+0x3fd/0x7a0 ? __mod_memcg_lruvec_state+0x93/0x110 ? sock_sendmsg+0x30/0x40 sock_sendmsg+0x30/0x40 __sys_sendto+0xeb/0x130 ? handle_mm_fault+0xae/0x280 ? do_user_addr_fault+0x1e7/0x680 ? kvm_read_and_reset_apf_flags+0x3b/0x50 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7ff35cac1366 Code: eb 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 11 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 72 c3 90 55 48 83 ec 30 44 89 4c 24 2c 4c 89 RSP: 002b:00007fff738e4028 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007fff738e57b0 RCX: 00007ff35cac1366 RDX: 0000000000000040 RSI: 0000557164e4b450 RDI: 0000000000000003 RBP: 0000557164e4b450 R08: 00007fff738e7a2c R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040 R13: 00007fff738e5770 R14: 00007fff738e4030 R15: 0000001d00000001 </TASK> Modules linked in: netconsole veth br_netfilter bridge bonding virtio_net [last unloaded: netconsole] CR2: 00000000000000c0
CC: Steffen Klassert steffen.klassert@secunet.com CC: Daniel Borkmann daniel@iogearbox.net Fixes: 2d151d39073a ("xfrm: Add possibility to set the default to block if we have no policy") Signed-off-by: Nikolay Aleksandrov razor@blackwall.org Signed-off-by: Steffen Klassert steffen.klassert@secunet.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/xfrm/xfrm_policy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 4f8bbb825abcb..cc6ab79609e29 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -3162,7 +3162,7 @@ struct dst_entry *xfrm_lookup_with_ifid(struct net *net, return dst;
nopol: - if (!(dst_orig->dev->flags & IFF_LOOPBACK) && + if ((!dst_orig->dev || !(dst_orig->dev->flags & IFF_LOOPBACK)) && net->xfrm.policy_default[dir] == XFRM_USERPOLICY_BLOCK) { err = -EPERM; goto error;
From: Seth Forshee sforshee@digitalocean.com
[ Upstream commit bf1ac16edf6770a92bc75cf2373f1f9feea398a4 ]
Idmapped mounts should not allow a user to map file ownsership into a range of ids which is not under the control of that user. However, we currently don't check whether the mounter is privileged wrt to the target user namespace.
Currently no FS_USERNS_MOUNT filesystems support idmapped mounts, thus this is not a problem as only CAP_SYS_ADMIN in init_user_ns is allowed to set up idmapped mounts. But this could change in the future, so add a check to refuse to create idmapped mounts when the mounter does not have CAP_SYS_ADMIN in the target user namespace.
Fixes: bd303368b776 ("fs: support mapped mounts of mapped filesystems") Signed-off-by: Seth Forshee sforshee@digitalocean.com Reviewed-by: Christian Brauner (Microsoft) brauner@kernel.org Link: https://lore.kernel.org/r/20220816164752.2595240-1-sforshee@digitalocean.com Signed-off-by: Christian Brauner (Microsoft) brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/namespace.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/fs/namespace.c b/fs/namespace.c index e6a7e769d25dd..a59f8d645654a 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -4238,6 +4238,13 @@ static int build_mount_idmapped(const struct mount_attr *attr, size_t usize, err = -EPERM; goto out_fput; } + + /* We're not controlling the target namespace. */ + if (!ns_capable(mnt_userns, CAP_SYS_ADMIN)) { + err = -EPERM; + goto out_fput; + } + kattr->mnt_userns = get_user_ns(mnt_userns);
out_fput:
From: Sabrina Dubroca sd@queasysnail.net
[ Upstream commit e82c649e851c9c25367fb7a2a6cf3479187de467 ]
This reverts commit 6fc498bc82929ee23aa2f35a828c6178dfd3f823.
Commit 6fc498bc8292 states:
SCI should be updated, because it contains MAC in its first 6 octets.
That's not entirely correct. The SCI can be based on the MAC address, but doesn't have to be. We can also use any 64-bit number as the SCI. When the SCI based on the MAC address, it uses a 16-bit "port number" provided by userspace, which commit 6fc498bc8292 overwrites with 1.
In addition, changing the SCI after macsec has been setup can just confuse the receiver. If we configure the RXSC on the peer based on the original SCI, we should keep the same SCI on TX.
When the macsec device is being managed by a userspace key negotiation daemon such as wpa_supplicant, commit 6fc498bc8292 would also overwrite the SCI defined by userspace.
Fixes: 6fc498bc8292 ("net: macsec: update SCI upon MAC address change.") Signed-off-by: Sabrina Dubroca sd@queasysnail.net Link: https://lore.kernel.org/r/9b1a9d28327e7eb54550a92eebda45d25e54dd0d.166066703... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/macsec.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index f354fad05714a..5b0b23e55fa76 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -449,11 +449,6 @@ static struct macsec_eth_header *macsec_ethhdr(struct sk_buff *skb) return (struct macsec_eth_header *)skb_mac_header(skb); }
-static sci_t dev_to_sci(struct net_device *dev, __be16 port) -{ - return make_sci(dev->dev_addr, port); -} - static void __macsec_pn_wrapped(struct macsec_secy *secy, struct macsec_tx_sa *tx_sa) { @@ -3622,7 +3617,6 @@ static int macsec_set_mac_address(struct net_device *dev, void *p)
out: eth_hw_addr_set(dev, addr->sa_data); - macsec->secy.sci = dev_to_sci(dev, MACSEC_PORT_ES);
/* If h/w offloading is available, propagate to the device */ if (macsec_is_offloaded(macsec)) { @@ -3960,6 +3954,11 @@ static bool sci_exists(struct net_device *dev, sci_t sci) return false; }
+static sci_t dev_to_sci(struct net_device *dev, __be16 port) +{ + return make_sci(dev->dev_addr, port); +} + static int macsec_add_dev(struct net_device *dev, sci_t sci, u8 icv_len) { struct macsec_dev *macsec = macsec_priv(dev);
From: Olga Kornievskaia kolga@netapp.com
[ Upstream commit fcfc8be1e9cf2f12b50dce8b579b3ae54443a014 ]
A destination server while doing a COPY shouldn't accept using the passed in filehandle if its not a regular filehandle.
If alloc_file_pseudo() has failed, we need to decrement a reference on the newly created inode, otherwise it leaks.
Reported-by: Al Viro viro@zeniv.linux.org.uk Fixes: ec4b092508982 ("NFS: inter ssc open") Signed-off-by: Olga Kornievskaia kolga@netapp.com Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org --- fs/nfs/nfs4file.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c index e88f6b18445ec..9eb1812878795 100644 --- a/fs/nfs/nfs4file.c +++ b/fs/nfs/nfs4file.c @@ -340,6 +340,11 @@ static struct file *__nfs42_ssc_open(struct vfsmount *ss_mnt, goto out; }
+ if (!S_ISREG(fattr->mode)) { + res = ERR_PTR(-EBADF); + goto out; + } + res = ERR_PTR(-ENOMEM); len = strlen(SSC_READ_NAME_BODY) + 16; read_name = kzalloc(len, GFP_KERNEL); @@ -357,6 +362,7 @@ static struct file *__nfs42_ssc_open(struct vfsmount *ss_mnt, r_ino->i_fop); if (IS_ERR(filep)) { res = ERR_CAST(filep); + iput(r_ino); goto out_free_name; }
From: Trond Myklebust trond.myklebust@hammerspace.com
[ Upstream commit ed06fce0b034b2e25bd93430f5c4cbb28036cc1a ]
Fix up a case in call_encode() where we're failing to set task->tk_rpc_status when an RPC level error occurred.
Fixes: 9c5948c24869 ("SUNRPC: task should be exit if encode return EKEYEXPIRED more times") Signed-off-by: Trond Myklebust trond.myklebust@hammerspace.com Signed-off-by: Sasha Levin sashal@kernel.org --- net/sunrpc/clnt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index 733f9f2260926..c1a01947530f0 100644 --- a/net/sunrpc/clnt.c +++ b/net/sunrpc/clnt.c @@ -1888,7 +1888,7 @@ call_encode(struct rpc_task *task) break; case -EKEYEXPIRED: if (!task->tk_cred_retry) { - rpc_exit(task, task->tk_status); + rpc_call_rpcerror(task, task->tk_status); } else { task->tk_action = call_refresh; task->tk_cred_retry--;
From: Peter Xu peterx@redhat.com
[ Upstream commit efd4149342db2df41b1bbe68972ead853b30e444 ]
These bits should only be valid when the ptes are present. Introducing two booleans for it and set it to false when !pte_present() for both pte and pmd accountings.
The bug is found during code reading and no real world issue reported, but logically such an error can cause incorrect readings for either smaps or smaps_rollup output on quite a few fields.
For example, it could cause over-estimate on values like Shared_Dirty, Private_Dirty, Referenced. Or it could also cause under-estimate on values like LazyFree, Shared_Clean, Private_Clean.
Link: https://lkml.kernel.org/r/20220805160003.58929-1-peterx@redhat.com Fixes: b1d4d9e0cbd0 ("proc/smaps: carefully handle migration entries") Fixes: c94b6923fa0a ("/proc/PID/smaps: Add PMD migration entry parsing") Signed-off-by: Peter Xu peterx@redhat.com Reviewed-by: Vlastimil Babka vbabka@suse.cz Reviewed-by: David Hildenbrand david@redhat.com Reviewed-by: Yang Shi shy828301@gmail.com Cc: Konstantin Khlebnikov khlebnikov@openvz.org Cc: Huang Ying ying.huang@intel.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/proc/task_mmu.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 2d04e3470d4cd..313788bc0c307 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -525,10 +525,12 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct vm_area_struct *vma = walk->vma; bool locked = !!(vma->vm_flags & VM_LOCKED); struct page *page = NULL; - bool migration = false; + bool migration = false, young = false, dirty = false;
if (pte_present(*pte)) { page = vm_normal_page(vma, addr, *pte); + young = pte_young(*pte); + dirty = pte_dirty(*pte); } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte);
@@ -558,8 +560,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, if (!page) return;
- smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte), - locked, migration); + smaps_account(mss, page, false, young, dirty, locked, migration); }
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
From: Christian Brauner brauner@kernel.org
[ Upstream commit 0c3bc7899e6dfb52df1c46118a5a670ae619645f ]
While looking at our current POSIX ACL handling in the context of some overlayfs work I went through a range of other filesystems checking how they handle them currently and encountered ntfs3.
The posic_acl_{from,to}_xattr() helpers always need to operate on the filesystem idmapping. Since ntfs3 can only be mounted in the initial user namespace the relevant idmapping is init_user_ns.
The posix_acl_{from,to}_xattr() helpers are concerned with translating between the kernel internal struct posix_acl{_entry} and the uapi struct posix_acl_xattr_{header,entry} and the kernel internal data structure is cached filesystem wide.
Additional idmappings such as the caller's idmapping or the mount's idmapping are handled higher up in the VFS. Individual filesystems usually do not need to concern themselves with these.
The posix_acl_valid() helper is concerned with checking whether the values in the kernel internal struct posix_acl can be represented in the filesystem's idmapping. IOW, if they can be written to disk. So this helper too needs to take the filesystem's idmapping.
Fixes: be71b5cba2e6 ("fs/ntfs3: Add attrib operations") Cc: Konstantin Komarov almaz.alexandrovich@paragon-software.com Cc: ntfs3@lists.linux.dev Signed-off-by: Christian Brauner (Microsoft) brauner@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- fs/ntfs3/xattr.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/fs/ntfs3/xattr.c b/fs/ntfs3/xattr.c index 1b8c89dbf6684..3629049decac1 100644 --- a/fs/ntfs3/xattr.c +++ b/fs/ntfs3/xattr.c @@ -478,8 +478,7 @@ static noinline int ntfs_set_ea(struct inode *inode, const char *name, }
#ifdef CONFIG_NTFS3_FS_POSIX_ACL -static struct posix_acl *ntfs_get_acl_ex(struct user_namespace *mnt_userns, - struct inode *inode, int type, +static struct posix_acl *ntfs_get_acl_ex(struct inode *inode, int type, int locked) { struct ntfs_inode *ni = ntfs_i(inode); @@ -514,7 +513,7 @@ static struct posix_acl *ntfs_get_acl_ex(struct user_namespace *mnt_userns,
/* Translate extended attribute to acl. */ if (err >= 0) { - acl = posix_acl_from_xattr(mnt_userns, buf, err); + acl = posix_acl_from_xattr(&init_user_ns, buf, err); } else if (err == -ENODATA) { acl = NULL; } else { @@ -537,8 +536,7 @@ struct posix_acl *ntfs_get_acl(struct inode *inode, int type, bool rcu) if (rcu) return ERR_PTR(-ECHILD);
- /* TODO: init_user_ns? */ - return ntfs_get_acl_ex(&init_user_ns, inode, type, 0); + return ntfs_get_acl_ex(inode, type, 0); }
static noinline int ntfs_set_acl_ex(struct user_namespace *mnt_userns, @@ -590,7 +588,7 @@ static noinline int ntfs_set_acl_ex(struct user_namespace *mnt_userns, value = kmalloc(size, GFP_NOFS); if (!value) return -ENOMEM; - err = posix_acl_to_xattr(mnt_userns, acl, value, size); + err = posix_acl_to_xattr(&init_user_ns, acl, value, size); if (err < 0) goto out; flags = 0; @@ -641,7 +639,7 @@ static int ntfs_xattr_get_acl(struct user_namespace *mnt_userns, if (!acl) return -ENODATA;
- err = posix_acl_to_xattr(mnt_userns, acl, buffer, size); + err = posix_acl_to_xattr(&init_user_ns, acl, buffer, size); posix_acl_release(acl);
return err; @@ -665,12 +663,12 @@ static int ntfs_xattr_set_acl(struct user_namespace *mnt_userns, if (!value) { acl = NULL; } else { - acl = posix_acl_from_xattr(mnt_userns, value, size); + acl = posix_acl_from_xattr(&init_user_ns, value, size); if (IS_ERR(acl)) return PTR_ERR(acl);
if (acl) { - err = posix_acl_valid(mnt_userns, acl); + err = posix_acl_valid(&init_user_ns, acl); if (err) goto release_and_out; }
From: Bernard Pidoux f6bvp@free.fr
[ Upstream commit 3c53cd65dece47dd1f9d3a809f32e59d1d87b2b8 ]
Commit 3b3fd068c56e3fbea30090859216a368398e39bf added NULL check for `rose_loopback_neigh->dev` in rose_loopback_timer() but omitted to check rose_loopback_neigh->loopback.
It thus prevents *all* rose connect.
The reason is that a special rose_neigh loopback has a NULL device.
/proc/net/rose_neigh illustrates it via rose_neigh_show() function : [...] seq_printf(seq, "%05d %-9s %-4s %3d %3d %3s %3s %3lu %3lu", rose_neigh->number, (rose_neigh->loopback) ? "RSLOOP-0" : ax2asc(buf, &rose_neigh->callsign), rose_neigh->dev ? rose_neigh->dev->name : "???", rose_neigh->count,
/proc/net/rose_neigh displays special rose_loopback_neigh->loopback as callsign RSLOOP-0:
addr callsign dev count use mode restart t0 tf digipeaters 00001 RSLOOP-0 ??? 1 2 DCE yes 0 0
By checking rose_loopback_neigh->loopback, rose_rx_call_request() is called even in case rose_loopback_neigh->dev is NULL. This repairs rose connections.
Verification with rose client application FPAC:
FPAC-Node v 4.1.3 (built Aug 5 2022) for LINUX (help = h) F6BVP-4 (Commands = ?) : u Users - AX.25 Level 2 sessions : Port Callsign Callsign AX.25 state ROSE state NetRom status axudp F6BVP-5 -> F6BVP-9 Connected Connected ---------
Fixes: 3b3fd068c56e ("rose: Fix Null pointer dereference in rose_send_frame()") Signed-off-by: Bernard Pidoux f6bvp@free.fr Suggested-by: Francois Romieu romieu@fr.zoreil.com Cc: Thomas DL9SAU Osterried thomas@osterried.de Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/rose/rose_loopback.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/rose/rose_loopback.c b/net/rose/rose_loopback.c index 11c45c8c6c164..036d92c0ad794 100644 --- a/net/rose/rose_loopback.c +++ b/net/rose/rose_loopback.c @@ -96,7 +96,8 @@ static void rose_loopback_timer(struct timer_list *unused) }
if (frametype == ROSE_CALL_REQUEST) { - if (!rose_loopback_neigh->dev) { + if (!rose_loopback_neigh->dev && + !rose_loopback_neigh->loopback) { kfree_skb(skb); continue; }
From: Hayes Wang hayeswang@realtek.com
[ Upstream commit 6dc4df12d741c0fe8f885778a43039e0619b9cd9 ]
The units of PLA_RX_FIFO_FULL and PLA_RX_FIFO_EMPTY are 16 bytes.
Fixes: 195aae321c82 ("r8152: support new chips") Signed-off-by: Hayes Wang hayeswang@realtek.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/usb/r8152.c | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c index 0f6efaabaa32b..46c7954d27629 100644 --- a/drivers/net/usb/r8152.c +++ b/drivers/net/usb/r8152.c @@ -6431,21 +6431,8 @@ static void r8156_fc_parameter(struct r8152 *tp) u32 pause_on = tp->fc_pause_on ? tp->fc_pause_on : fc_pause_on_auto(tp); u32 pause_off = tp->fc_pause_off ? tp->fc_pause_off : fc_pause_off_auto(tp);
- switch (tp->version) { - case RTL_VER_10: - case RTL_VER_11: - ocp_write_word(tp, MCU_TYPE_PLA, PLA_RX_FIFO_FULL, pause_on / 8); - ocp_write_word(tp, MCU_TYPE_PLA, PLA_RX_FIFO_EMPTY, pause_off / 8); - break; - case RTL_VER_12: - case RTL_VER_13: - case RTL_VER_15: - ocp_write_word(tp, MCU_TYPE_PLA, PLA_RX_FIFO_FULL, pause_on / 16); - ocp_write_word(tp, MCU_TYPE_PLA, PLA_RX_FIFO_EMPTY, pause_off / 16); - break; - default: - break; - } + ocp_write_word(tp, MCU_TYPE_PLA, PLA_RX_FIFO_FULL, pause_on / 16); + ocp_write_word(tp, MCU_TYPE_PLA, PLA_RX_FIFO_EMPTY, pause_off / 16); }
static void rtl8156_change_mtu(struct r8152 *tp)
From: Hayes Wang hayeswang@realtek.com
[ Upstream commit b75d612014447e04abdf0e37ffb8f2fd8b0b49d6 ]
The RX FIFO would be changed when suspending, so the related settings have to be modified, too. Otherwise, the flow control would work abnormally.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216333 Reported-by: Mark Blakeney mark.blakeney@bullet-systems.net Fixes: cdf0b86b250f ("r8152: fix a WOL issue") Signed-off-by: Hayes Wang hayeswang@realtek.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/usb/r8152.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c index 46c7954d27629..d142ac8fcf6e2 100644 --- a/drivers/net/usb/r8152.c +++ b/drivers/net/usb/r8152.c @@ -5906,6 +5906,11 @@ static void r8153_enter_oob(struct r8152 *tp) ocp_data &= ~NOW_IS_OOB; ocp_write_byte(tp, MCU_TYPE_PLA, PLA_OOB_CTRL, ocp_data);
+ /* RX FIFO settings for OOB */ + ocp_write_dword(tp, MCU_TYPE_PLA, PLA_RXFIFO_CTRL0, RXFIFO_THR1_OOB); + ocp_write_word(tp, MCU_TYPE_PLA, PLA_RXFIFO_CTRL1, RXFIFO_THR2_OOB); + ocp_write_word(tp, MCU_TYPE_PLA, PLA_RXFIFO_CTRL2, RXFIFO_THR3_OOB); + rtl_disable(tp); rtl_reset_bmu(tp);
@@ -6544,6 +6549,11 @@ static void rtl8156_down(struct r8152 *tp) ocp_data &= ~NOW_IS_OOB; ocp_write_byte(tp, MCU_TYPE_PLA, PLA_OOB_CTRL, ocp_data);
+ /* RX FIFO settings for OOB */ + ocp_write_word(tp, MCU_TYPE_PLA, PLA_RXFIFO_FULL, 64 / 16); + ocp_write_word(tp, MCU_TYPE_PLA, PLA_RX_FIFO_FULL, 1024 / 16); + ocp_write_word(tp, MCU_TYPE_PLA, PLA_RX_FIFO_EMPTY, 4096 / 16); + rtl_disable(tp); rtl_reset_bmu(tp);
From: Duoming Zhou duoming@zju.edu.cn
[ Upstream commit f1e941dbf80a9b8bab0bffbc4cbe41cc7f4c6fb6 ]
When the pn532 uart device is detaching, the pn532_uart_remove() is called. But there are no functions in pn532_uart_remove() that could delete the cmd_timeout timer, which will cause use-after-free bugs. The process is shown below:
(thread 1) | (thread 2) | pn532_uart_send_frame pn532_uart_remove | mod_timer(&pn532->cmd_timeout,...) ... | (wait a time) kfree(pn532) //FREE | pn532_cmd_timeout | pn532_uart_send_frame | pn532->... //USE
This patch adds del_timer_sync() in pn532_uart_remove() in order to prevent the use-after-free bugs. What's more, the pn53x_unregister_nfc() is well synchronized, it sets nfc_dev->shutting_down to true and there are no syscalls could restart the cmd_timeout timer.
Fixes: c656aa4c27b1 ("nfc: pn533: add UART phy driver") Signed-off-by: Duoming Zhou duoming@zju.edu.cn Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/nfc/pn533/uart.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/nfc/pn533/uart.c b/drivers/nfc/pn533/uart.c index 2caf997f9bc94..07596bf5f7d6d 100644 --- a/drivers/nfc/pn533/uart.c +++ b/drivers/nfc/pn533/uart.c @@ -310,6 +310,7 @@ static void pn532_uart_remove(struct serdev_device *serdev) pn53x_unregister_nfc(pn532->priv); serdev_device_close(serdev); pn53x_common_clean(pn532->priv); + del_timer_sync(&pn532->cmd_timeout); kfree_skb(pn532->recv_skb); kfree(pn532); }
From: Maciej Fijalkowski maciej.fijalkowski@intel.com
[ Upstream commit 5a42f112d367bb4700a8a41f5c12724fde6bfbb9 ]
Fix the following scenario: 1. ethtool -L $IFACE rx 8 tx 96 2. xdpsock -q 10 -t -z
Above refers to a case where user would like to attach XSK socket in txonly mode at a queue id that does not have a corresponding Rx queue. At this moment ice's XSK logic is tightly bound to act on a "queue pair", e.g. both Tx and Rx queues at a given queue id are disabled/enabled and both of them will get XSK pool assigned, which is broken for the presented queue configuration. This results in the splat included at the bottom, which is basically an OOB access to Rx ring array.
To fix this, allow using the ids only in scope of "combined" queues reported by ethtool. However, logic should be rewritten to allow such configurations later on, which would end up as a complete rewrite of the control path, so let us go with this temporary fix.
[420160.558008] BUG: kernel NULL pointer dereference, address: 0000000000000082 [420160.566359] #PF: supervisor read access in kernel mode [420160.572657] #PF: error_code(0x0000) - not-present page [420160.579002] PGD 0 P4D 0 [420160.582756] Oops: 0000 [#1] PREEMPT SMP NOPTI [420160.588396] CPU: 10 PID: 21232 Comm: xdpsock Tainted: G OE 5.19.0-rc7+ #10 [420160.597893] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019 [420160.609894] RIP: 0010:ice_xsk_pool_setup+0x44/0x7d0 [ice] [420160.616968] Code: f3 48 83 ec 40 48 8b 4f 20 48 8b 3f 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 48 8d 04 ed 00 00 00 00 48 01 c1 48 8b 11 <0f> b7 92 82 00 00 00 48 85 d2 0f 84 2d 75 00 00 48 8d 72 ff 48 85 [420160.639421] RSP: 0018:ffffc9002d2afd48 EFLAGS: 00010282 [420160.646650] RAX: 0000000000000050 RBX: ffff88811d8bdd00 RCX: ffff888112c14ff8 [420160.655893] RDX: 0000000000000000 RSI: ffff88811d8bdd00 RDI: ffff888109861000 [420160.665166] RBP: 000000000000000a R08: 000000000000000a R09: 0000000000000000 [420160.674493] R10: 000000000000889f R11: 0000000000000000 R12: 000000000000000a [420160.683833] R13: 000000000000000a R14: 0000000000000000 R15: ffff888117611828 [420160.693211] FS: 00007fa869fc1f80(0000) GS:ffff8897e0880000(0000) knlGS:0000000000000000 [420160.703645] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [420160.711783] CR2: 0000000000000082 CR3: 00000001d076c001 CR4: 00000000007706e0 [420160.721399] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [420160.731045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [420160.740707] PKRU: 55555554 [420160.745960] Call Trace: [420160.750962] <TASK> [420160.755597] ? kmalloc_large_node+0x79/0x90 [420160.762703] ? __kmalloc_node+0x3f5/0x4b0 [420160.769341] xp_assign_dev+0xfd/0x210 [420160.775661] ? shmem_file_read_iter+0x29a/0x420 [420160.782896] xsk_bind+0x152/0x490 [420160.788943] __sys_bind+0xd0/0x100 [420160.795097] ? exit_to_user_mode_prepare+0x20/0x120 [420160.802801] __x64_sys_bind+0x16/0x20 [420160.809298] do_syscall_64+0x38/0x90 [420160.815741] entry_SYSCALL_64_after_hwframe+0x63/0xcd [420160.823731] RIP: 0033:0x7fa86a0dd2fb [420160.830264] Code: c3 66 0f 1f 44 00 00 48 8b 15 69 8b 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb bc 0f 1f 44 00 00 f3 0f 1e fa b8 31 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3d 8b 0c 00 f7 d8 64 89 01 48 [420160.855410] RSP: 002b:00007ffc1146f618 EFLAGS: 00000246 ORIG_RAX: 0000000000000031 [420160.866366] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa86a0dd2fb [420160.876957] RDX: 0000000000000010 RSI: 00007ffc1146f680 RDI: 0000000000000003 [420160.887604] RBP: 000055d7113a0520 R08: 00007fa868fb8000 R09: 0000000080000000 [420160.898293] R10: 0000000000008001 R11: 0000000000000246 R12: 000055d7113a04e0 [420160.909038] R13: 000055d7113a0320 R14: 000000000000000a R15: 0000000000000000 [420160.919817] </TASK> [420160.925659] Modules linked in: ice(OE) af_packet binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp mei_me coretemp ioatdma mei ipmi_si wmi ipmi_msghandler acpi_pad acpi_power_meter ip_tables x_tables autofs4 ixgbe i40e crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd ahci mdio dca libahci lpc_ich [last unloaded: ice] [420160.977576] CR2: 0000000000000082 [420160.985037] ---[ end trace 0000000000000000 ]--- [420161.097724] RIP: 0010:ice_xsk_pool_setup+0x44/0x7d0 [ice] [420161.107341] Code: f3 48 83 ec 40 48 8b 4f 20 48 8b 3f 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 48 8d 04 ed 00 00 00 00 48 01 c1 48 8b 11 <0f> b7 92 82 00 00 00 48 85 d2 0f 84 2d 75 00 00 48 8d 72 ff 48 85 [420161.134741] RSP: 0018:ffffc9002d2afd48 EFLAGS: 00010282 [420161.144274] RAX: 0000000000000050 RBX: ffff88811d8bdd00 RCX: ffff888112c14ff8 [420161.155690] RDX: 0000000000000000 RSI: ffff88811d8bdd00 RDI: ffff888109861000 [420161.168088] RBP: 000000000000000a R08: 000000000000000a R09: 0000000000000000 [420161.179295] R10: 000000000000889f R11: 0000000000000000 R12: 000000000000000a [420161.190420] R13: 000000000000000a R14: 0000000000000000 R15: ffff888117611828 [420161.201505] FS: 00007fa869fc1f80(0000) GS:ffff8897e0880000(0000) knlGS:0000000000000000 [420161.213628] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [420161.223413] CR2: 0000000000000082 CR3: 00000001d076c001 CR4: 00000000007706e0 [420161.234653] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [420161.245893] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [420161.257052] PKRU: 55555554
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Signed-off-by: Maciej Fijalkowski maciej.fijalkowski@intel.com Tested-by: George Kuruvinakunnel george.kuruvinakunnel@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/intel/ice/ice_xsk.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c index 49ba8bfdbf047..45f88e6ec25e8 100644 --- a/drivers/net/ethernet/intel/ice/ice_xsk.c +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c @@ -329,6 +329,12 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid) bool if_running, pool_present = !!pool; int ret = 0, pool_failure = 0;
+ if (qid >= vsi->num_rxq || qid >= vsi->num_txq) { + netdev_err(vsi->netdev, "Please use queue id in scope of combined queues count\n"); + pool_failure = -EINVAL; + goto failure; + } + if (!is_power_of_2(vsi->rx_rings[qid]->count) || !is_power_of_2(vsi->tx_rings[qid]->count)) { netdev_err(vsi->netdev, "Please align ring sizes to power of 2\n");
From: Maciej Fijalkowski maciej.fijalkowski@intel.com
[ Upstream commit 9ead7e74bfd6dd54db12ef133b8604add72511de ]
Ice driver allocates per cpu XDP queues so that redirect path can safely use smp_processor_id() as an index to the array. At the same time though, XDP rings are used to pick NAPI context to call napi_schedule() or set NAPIF_STATE_MISSED. When user reduces queue count, say to 8, and num_possible_cpus() of underlying platform is 44, then this means queue vectors with correlated NAPI contexts will carry several XDP queues.
This in turn can result in a broken behavior where NAPI context of interest will never be scheduled and AF_XDP socket will not process any traffic.
To fix this, let us change the way how XDP rings are assigned to Rx rings and use this information later on when setting ice_tx_ring::xsk_pool pointer. For each Rx ring, grab the associated queue vector and walk through Tx ring's linked list. Once we stumble upon XDP ring in it, assign this ring to ice_rx_ring::xdp_ring.
Previous [0] approach of fixing this issue was for txonly scenario because of the described grouping of XDP rings across queue vectors. So, relying on Rx ring meant that NAPI context could be scheduled with a queue vector without XDP ring with associated XSK pool.
[0]: https://lore.kernel.org/netdev/20220707161128.54215-1-maciej.fijalkowski@int...
Fixes: 2d4238f55697 ("ice: Add support for AF_XDP") Fixes: 22bf877e528f ("ice: introduce XDP_TX fallback path") Signed-off-by: Maciej Fijalkowski maciej.fijalkowski@intel.com Tested-by: George Kuruvinakunnel george.kuruvinakunnel@intel.com Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/intel/ice/ice.h | 36 +++++++++++++++-------- drivers/net/ethernet/intel/ice/ice_lib.c | 4 +-- drivers/net/ethernet/intel/ice/ice_main.c | 25 +++++++++++----- drivers/net/ethernet/intel/ice/ice_xsk.c | 12 ++++---- 4 files changed, 48 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h index 60453b3b8d233..6911cbb7afa50 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -684,8 +684,8 @@ static inline void ice_set_ring_xdp(struct ice_tx_ring *ring) * ice_xsk_pool - get XSK buffer pool bound to a ring * @ring: Rx ring to use * - * Returns a pointer to xdp_umem structure if there is a buffer pool present, - * NULL otherwise. + * Returns a pointer to xsk_buff_pool structure if there is a buffer pool + * present, NULL otherwise. */ static inline struct xsk_buff_pool *ice_xsk_pool(struct ice_rx_ring *ring) { @@ -699,23 +699,33 @@ static inline struct xsk_buff_pool *ice_xsk_pool(struct ice_rx_ring *ring) }
/** - * ice_tx_xsk_pool - get XSK buffer pool bound to a ring - * @ring: Tx ring to use + * ice_tx_xsk_pool - assign XSK buff pool to XDP ring + * @vsi: pointer to VSI + * @qid: index of a queue to look at XSK buff pool presence * - * Returns a pointer to xdp_umem structure if there is a buffer pool present, - * NULL otherwise. Tx equivalent of ice_xsk_pool. + * Sets XSK buff pool pointer on XDP ring. + * + * XDP ring is picked from Rx ring, whereas Rx ring is picked based on provided + * queue id. Reason for doing so is that queue vectors might have assigned more + * than one XDP ring, e.g. when user reduced the queue count on netdev; Rx ring + * carries a pointer to one of these XDP rings for its own purposes, such as + * handling XDP_TX action, therefore we can piggyback here on the + * rx_ring->xdp_ring assignment that was done during XDP rings initialization. */ -static inline struct xsk_buff_pool *ice_tx_xsk_pool(struct ice_tx_ring *ring) +static inline void ice_tx_xsk_pool(struct ice_vsi *vsi, u16 qid) { - struct ice_vsi *vsi = ring->vsi; - u16 qid; + struct ice_tx_ring *ring;
- qid = ring->q_index - vsi->alloc_txq; + ring = vsi->rx_rings[qid]->xdp_ring; + if (!ring) + return;
- if (!ice_is_xdp_ena_vsi(vsi) || !test_bit(qid, vsi->af_xdp_zc_qps)) - return NULL; + if (!ice_is_xdp_ena_vsi(vsi) || !test_bit(qid, vsi->af_xdp_zc_qps)) { + ring->xsk_pool = NULL; + return; + }
- return xsk_get_pool_from_qid(vsi->netdev, qid); + ring->xsk_pool = xsk_get_pool_from_qid(vsi->netdev, qid); }
/** diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c index d6aafa272fb0b..6c4e1d45235ef 100644 --- a/drivers/net/ethernet/intel/ice/ice_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_lib.c @@ -1983,8 +1983,8 @@ int ice_vsi_cfg_xdp_txqs(struct ice_vsi *vsi) if (ret) return ret;
- ice_for_each_xdp_txq(vsi, i) - vsi->xdp_rings[i]->xsk_pool = ice_tx_xsk_pool(vsi->xdp_rings[i]); + ice_for_each_rxq(vsi, i) + ice_tx_xsk_pool(vsi, i);
return ret; } diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index bfd97a9a8f2e0..3d45e075204e3 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -2581,7 +2581,6 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi) if (ice_setup_tx_ring(xdp_ring)) goto free_xdp_rings; ice_set_ring_xdp(xdp_ring); - xdp_ring->xsk_pool = ice_tx_xsk_pool(xdp_ring); spin_lock_init(&xdp_ring->tx_lock); for (j = 0; j < xdp_ring->count; j++) { tx_desc = ICE_TX_DESC(xdp_ring, j); @@ -2589,13 +2588,6 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi) } }
- ice_for_each_rxq(vsi, i) { - if (static_key_enabled(&ice_xdp_locking_key)) - vsi->rx_rings[i]->xdp_ring = vsi->xdp_rings[i % vsi->num_xdp_txq]; - else - vsi->rx_rings[i]->xdp_ring = vsi->xdp_rings[i]; - } - return 0;
free_xdp_rings: @@ -2685,6 +2677,23 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog) xdp_rings_rem -= xdp_rings_per_v; }
+ ice_for_each_rxq(vsi, i) { + if (static_key_enabled(&ice_xdp_locking_key)) { + vsi->rx_rings[i]->xdp_ring = vsi->xdp_rings[i % vsi->num_xdp_txq]; + } else { + struct ice_q_vector *q_vector = vsi->rx_rings[i]->q_vector; + struct ice_tx_ring *ring; + + ice_for_each_tx_ring(ring, q_vector->tx) { + if (ice_ring_is_xdp(ring)) { + vsi->rx_rings[i]->xdp_ring = ring; + break; + } + } + } + ice_tx_xsk_pool(vsi, i); + } + /* omit the scheduler update if in reset path; XDP queues will be * taken into account at the end of ice_vsi_rebuild, where * ice_cfg_vsi_lan is being called diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c index 45f88e6ec25e8..e48e29258450f 100644 --- a/drivers/net/ethernet/intel/ice/ice_xsk.c +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c @@ -243,7 +243,7 @@ static int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx) if (err) goto free_buf; ice_set_ring_xdp(xdp_ring); - xdp_ring->xsk_pool = ice_tx_xsk_pool(xdp_ring); + ice_tx_xsk_pool(vsi, q_idx); }
err = ice_vsi_cfg_rxq(rx_ring); @@ -359,7 +359,7 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid) if (if_running) { ret = ice_qp_ena(vsi, qid); if (!ret && pool_present) - napi_schedule(&vsi->xdp_rings[qid]->q_vector->napi); + napi_schedule(&vsi->rx_rings[qid]->xdp_ring->q_vector->napi); else if (ret) netdev_err(vsi->netdev, "ice_qp_ena error = %d\n", ret); } @@ -950,13 +950,13 @@ ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, if (!ice_is_xdp_ena_vsi(vsi)) return -EINVAL;
- if (queue_id >= vsi->num_txq) + if (queue_id >= vsi->num_txq || queue_id >= vsi->num_rxq) return -EINVAL;
- if (!vsi->xdp_rings[queue_id]->xsk_pool) - return -EINVAL; + ring = vsi->rx_rings[queue_id]->xdp_ring;
- ring = vsi->xdp_rings[queue_id]; + if (!ring->xsk_pool) + return -EINVAL;
/* The idea here is that if NAPI is running, mark a miss, so * it will run again. If not, trigger an interrupt and
From: Vlad Buslov vladbu@nvidia.com
[ Upstream commit f37044fd759b6bc40b6398a978e0b1acdf717372 ]
When querying mlx5 non-uplink representors capabilities with ethtool rx-vlan-offload is marked as "off [fixed]". However, it is actually always enabled because mlx5e_params->vlan_strip_disable is 0 by default when initializing struct mlx5e_params instance. Fix the issue by explicitly setting the vlan_strip_disable to 'true' for non-uplink representors.
Fixes: cb67b832921c ("net/mlx5e: Introduce SRIOV VF representors") Signed-off-by: Vlad Buslov vladbu@nvidia.com Reviewed-by: Roi Dayan roid@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c index f797fd97d305b..7da3dc6261929 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c @@ -662,6 +662,8 @@ static void mlx5e_build_rep_params(struct net_device *netdev)
params->mqprio.num_tc = 1; params->tunneled_offload_en = false; + if (rep->vport != MLX5_VPORT_UPLINK) + params->vlan_strip_disable = true;
mlx5_query_min_inline(mdev, ¶ms->tx_min_inline_mode); }
From: Eli Cohen elic@nvidia.com
[ Upstream commit a6e675a66175869b7d87c0e1dd0ddf93e04f8098 ]
Only set MLX5_LAG_FLAG_NDEVS_READY if both netdevices are registered. Doing so guarantees that both ldev->pf[MLX5_LAG_P0].dev and ldev->pf[MLX5_LAG_P1].dev have valid pointers when MLX5_LAG_FLAG_NDEVS_READY is set.
The core issue is asymmetry in setting MLX5_LAG_FLAG_NDEVS_READY and clearing it. Setting it is done wrongly when both ldev->pf[MLX5_LAG_P0].dev and ldev->pf[MLX5_LAG_P1].dev are set; clearing it is done right when either of ldev->pf[i].netdev is cleared.
Consider the following scenario: 1. PF0 loads and sets ldev->pf[MLX5_LAG_P0].dev to a valid pointer 2. PF1 loads and sets both ldev->pf[MLX5_LAG_P1].dev and ldev->pf[MLX5_LAG_P1].netdev with valid pointers. This results in MLX5_LAG_FLAG_NDEVS_READY is set. 3. PF0 is unloaded before setting dev->pf[MLX5_LAG_P0].netdev. MLX5_LAG_FLAG_NDEVS_READY remains set.
Further execution of mlx5_do_bond() will result in null pointer dereference when calling mlx5_lag_is_multipath()
This patch fixes the following call trace actually encountered:
[ 1293.475195] BUG: kernel NULL pointer dereference, address: 00000000000009a8 [ 1293.478756] #PF: supervisor read access in kernel mode [ 1293.481320] #PF: error_code(0x0000) - not-present page [ 1293.483686] PGD 0 P4D 0 [ 1293.484434] Oops: 0000 [#1] SMP PTI [ 1293.485377] CPU: 1 PID: 23690 Comm: kworker/u16:2 Not tainted 5.18.0-rc5_for_upstream_min_debug_2022_05_05_10_13 #1 [ 1293.488039] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1293.490836] Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core] [ 1293.492448] RIP: 0010:mlx5_lag_is_multipath+0x5/0x50 [mlx5_core] [ 1293.494044] Code: e8 70 40 ff e0 48 8b 14 24 48 83 05 5c 1a 1b 00 01 e9 19 ff ff ff 48 83 05 47 1a 1b 00 01 eb d7 0f 1f 44 00 00 0f 1f 44 00 00 <48> 8b 87 a8 09 00 00 48 85 c0 74 26 48 83 05 a7 1b 1b 00 01 41 b8 [ 1293.498673] RSP: 0018:ffff88811b2fbe40 EFLAGS: 00010202 [ 1293.500152] RAX: ffff88818a94e1c0 RBX: ffff888165eca6c0 RCX: 0000000000000000 [ 1293.501841] RDX: 0000000000000001 RSI: ffff88818a94e1c0 RDI: 0000000000000000 [ 1293.503585] RBP: 0000000000000000 R08: ffff888119886740 R09: ffff888165eca73c [ 1293.505286] R10: 0000000000000018 R11: 0000000000000018 R12: ffff88818a94e1c0 [ 1293.506979] R13: ffff888112729800 R14: 0000000000000000 R15: ffff888112729858 [ 1293.508753] FS: 0000000000000000(0000) GS:ffff88852cc40000(0000) knlGS:0000000000000000 [ 1293.510782] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1293.512265] CR2: 00000000000009a8 CR3: 00000001032d4002 CR4: 0000000000370ea0 [ 1293.514001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1293.515806] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Fixes: 8a66e4585979 ("net/mlx5: Change ownership model for lag") Signed-off-by: Eli Cohen elic@nvidia.com Reviewed-by: Maor Dickman maord@nvidia.com Reviewed-by: Mark Bloch mbloch@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c index 5d41e19378e09..c520edb942ca5 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c @@ -1234,7 +1234,7 @@ void mlx5_lag_add_netdev(struct mlx5_core_dev *dev, mlx5_ldev_add_netdev(ldev, dev, netdev);
for (i = 0; i < ldev->ports; i++) - if (!ldev->pf[i].dev) + if (!ldev->pf[i].netdev) break;
if (i >= ldev->ports)
From: Eli Cohen elic@nvidia.com
[ Upstream commit 942fca7e762be39204e5926e91a288a343a97c72 ]
Make sure to modify the rule for uplink forwarding only for the case where destination vport number is MLX5_VPORT_UPLINK.
Fixes: 94db33177819 ("net/mlx5: Support multiport eswitch mode") Signed-off-by: Eli Cohen elic@nvidia.com Reviewed-by: Maor Dickman maord@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c index eb79810199d3e..d04739cb793e5 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c @@ -427,7 +427,8 @@ esw_setup_vport_dest(struct mlx5_flow_destination *dest, struct mlx5_flow_act *f dest[dest_idx].vport.vhca_id = MLX5_CAP_GEN(esw_attr->dests[attr_idx].mdev, vhca_id); dest[dest_idx].vport.flags |= MLX5_FLOW_DEST_VPORT_VHCA_ID; - if (mlx5_lag_mpesw_is_activated(esw->dev)) + if (dest[dest_idx].vport.num == MLX5_VPORT_UPLINK && + mlx5_lag_mpesw_is_activated(esw->dev)) dest[dest_idx].type = MLX5_FLOW_DESTINATION_TYPE_UPLINK; } if (esw_attr->dests[attr_idx].flags & MLX5_ESW_DEST_ENCAP) {
From: Vlad Buslov vladbu@nvidia.com
[ Upstream commit 8e93f29422ffe968d7161f91acdf0d47f5323727 ]
The lag_lock is taken from both process and softirq contexts which results lockdep warning[0] about potential deadlock. However, just disabling softirqs by using *_bh spinlock API is not enough since it will cause warning in some contexts where the lock is obtained with hard irqs disabled. To fix the issue save current irq state, disable them before obtaining the lock an re-enable irqs from saved state after releasing it.
[0]:
[Sun Aug 7 13:12:29 2022] ================================ [Sun Aug 7 13:12:29 2022] WARNING: inconsistent lock state [Sun Aug 7 13:12:29 2022] 5.19.0_for_upstream_debug_2022_08_04_16_06 #1 Not tainted [Sun Aug 7 13:12:29 2022] -------------------------------- [Sun Aug 7 13:12:29 2022] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [Sun Aug 7 13:12:29 2022] swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [Sun Aug 7 13:12:29 2022] ffffffffa06dc0d8 (lag_lock){+.?.}-{2:2}, at: mlx5_lag_is_shared_fdb+0x1f/0x120 [mlx5_core] [Sun Aug 7 13:12:29 2022] {SOFTIRQ-ON-W} state was registered at: [Sun Aug 7 13:12:29 2022] lock_acquire+0x1c1/0x550 [Sun Aug 7 13:12:29 2022] _raw_spin_lock+0x2c/0x40 [Sun Aug 7 13:12:29 2022] mlx5_lag_add_netdev+0x13b/0x480 [mlx5_core] [Sun Aug 7 13:12:29 2022] mlx5e_nic_enable+0x114/0x470 [mlx5_core] [Sun Aug 7 13:12:29 2022] mlx5e_attach_netdev+0x30e/0x6a0 [mlx5_core] [Sun Aug 7 13:12:29 2022] mlx5e_resume+0x105/0x160 [mlx5_core] [Sun Aug 7 13:12:29 2022] mlx5e_probe+0xac3/0x14f0 [mlx5_core] [Sun Aug 7 13:12:29 2022] auxiliary_bus_probe+0x9d/0xe0 [Sun Aug 7 13:12:29 2022] really_probe+0x1e0/0xaa0 [Sun Aug 7 13:12:29 2022] __driver_probe_device+0x219/0x480 [Sun Aug 7 13:12:29 2022] driver_probe_device+0x49/0x130 [Sun Aug 7 13:12:29 2022] __driver_attach+0x1e4/0x4d0 [Sun Aug 7 13:12:29 2022] bus_for_each_dev+0x11e/0x1a0 [Sun Aug 7 13:12:29 2022] bus_add_driver+0x3f4/0x5a0 [Sun Aug 7 13:12:29 2022] driver_register+0x20f/0x390 [Sun Aug 7 13:12:29 2022] __auxiliary_driver_register+0x14e/0x260 [Sun Aug 7 13:12:29 2022] mlx5e_init+0x38/0x90 [mlx5_core] [Sun Aug 7 13:12:29 2022] vhost_iotlb_itree_augment_rotate+0xcb/0x180 [vhost_iotlb] [Sun Aug 7 13:12:29 2022] do_one_initcall+0xc4/0x400 [Sun Aug 7 13:12:29 2022] do_init_module+0x18a/0x620 [Sun Aug 7 13:12:29 2022] load_module+0x563a/0x7040 [Sun Aug 7 13:12:29 2022] __do_sys_finit_module+0x122/0x1d0 [Sun Aug 7 13:12:29 2022] do_syscall_64+0x3d/0x90 [Sun Aug 7 13:12:29 2022] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [Sun Aug 7 13:12:29 2022] irq event stamp: 3596508 [Sun Aug 7 13:12:29 2022] hardirqs last enabled at (3596508): [<ffffffff813687c2>] __local_bh_enable_ip+0xa2/0x100 [Sun Aug 7 13:12:29 2022] hardirqs last disabled at (3596507): [<ffffffff813687da>] __local_bh_enable_ip+0xba/0x100 [Sun Aug 7 13:12:29 2022] softirqs last enabled at (3596488): [<ffffffff81368a2a>] irq_exit_rcu+0x11a/0x170 [Sun Aug 7 13:12:29 2022] softirqs last disabled at (3596495): [<ffffffff81368a2a>] irq_exit_rcu+0x11a/0x170 [Sun Aug 7 13:12:29 2022] other info that might help us debug this: [Sun Aug 7 13:12:29 2022] Possible unsafe locking scenario:
[Sun Aug 7 13:12:29 2022] CPU0 [Sun Aug 7 13:12:29 2022] ---- [Sun Aug 7 13:12:29 2022] lock(lag_lock); [Sun Aug 7 13:12:29 2022] <Interrupt> [Sun Aug 7 13:12:29 2022] lock(lag_lock); [Sun Aug 7 13:12:29 2022] *** DEADLOCK ***
[Sun Aug 7 13:12:29 2022] 4 locks held by swapper/0/0: [Sun Aug 7 13:12:29 2022] #0: ffffffff84643260 (rcu_read_lock){....}-{1:2}, at: mlx5e_napi_poll+0x43/0x20a0 [mlx5_core] [Sun Aug 7 13:12:29 2022] #1: ffffffff84643260 (rcu_read_lock){....}-{1:2}, at: netif_receive_skb_list_internal+0x2d7/0xd60 [Sun Aug 7 13:12:29 2022] #2: ffff888144a18b58 (&br->hash_lock){+.-.}-{2:2}, at: br_fdb_update+0x301/0x570 [Sun Aug 7 13:12:29 2022] #3: ffffffff84643260 (rcu_read_lock){....}-{1:2}, at: atomic_notifier_call_chain+0x5/0x1d0 [Sun Aug 7 13:12:29 2022] stack backtrace: [Sun Aug 7 13:12:29 2022] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0_for_upstream_debug_2022_08_04_16_06 #1 [Sun Aug 7 13:12:29 2022] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [Sun Aug 7 13:12:29 2022] Call Trace: [Sun Aug 7 13:12:29 2022] <IRQ> [Sun Aug 7 13:12:29 2022] dump_stack_lvl+0x57/0x7d [Sun Aug 7 13:12:29 2022] mark_lock.part.0.cold+0x5f/0x92 [Sun Aug 7 13:12:29 2022] ? lock_chain_count+0x20/0x20 [Sun Aug 7 13:12:29 2022] ? unwind_next_frame+0x1c4/0x1b50 [Sun Aug 7 13:12:29 2022] ? secondary_startup_64_no_verify+0xcd/0xdb [Sun Aug 7 13:12:29 2022] ? mlx5e_napi_poll+0x4e9/0x20a0 [mlx5_core] [Sun Aug 7 13:12:29 2022] ? mlx5e_napi_poll+0x4e9/0x20a0 [mlx5_core] [Sun Aug 7 13:12:29 2022] ? stack_access_ok+0x1d0/0x1d0 [Sun Aug 7 13:12:29 2022] ? start_kernel+0x3a7/0x3c5 [Sun Aug 7 13:12:29 2022] __lock_acquire+0x1260/0x6720 [Sun Aug 7 13:12:29 2022] ? lock_chain_count+0x20/0x20 [Sun Aug 7 13:12:29 2022] ? lock_chain_count+0x20/0x20 [Sun Aug 7 13:12:29 2022] ? register_lock_class+0x1880/0x1880 [Sun Aug 7 13:12:29 2022] ? mark_lock.part.0+0xed/0x3060 [Sun Aug 7 13:12:29 2022] ? stack_trace_save+0x91/0xc0 [Sun Aug 7 13:12:29 2022] lock_acquire+0x1c1/0x550 [Sun Aug 7 13:12:29 2022] ? mlx5_lag_is_shared_fdb+0x1f/0x120 [mlx5_core] [Sun Aug 7 13:12:29 2022] ? lockdep_hardirqs_on_prepare+0x400/0x400 [Sun Aug 7 13:12:29 2022] ? __lock_acquire+0xd6f/0x6720 [Sun Aug 7 13:12:29 2022] _raw_spin_lock+0x2c/0x40 [Sun Aug 7 13:12:29 2022] ? mlx5_lag_is_shared_fdb+0x1f/0x120 [mlx5_core] [Sun Aug 7 13:12:29 2022] mlx5_lag_is_shared_fdb+0x1f/0x120 [mlx5_core] [Sun Aug 7 13:12:29 2022] mlx5_esw_bridge_rep_vport_num_vhca_id_get+0x1a0/0x600 [mlx5_core] [Sun Aug 7 13:12:29 2022] ? mlx5_esw_bridge_update_work+0x90/0x90 [mlx5_core] [Sun Aug 7 13:12:29 2022] ? lock_acquire+0x1c1/0x550 [Sun Aug 7 13:12:29 2022] mlx5_esw_bridge_switchdev_event+0x185/0x8f0 [mlx5_core] [Sun Aug 7 13:12:29 2022] ? mlx5_esw_bridge_port_obj_attr_set+0x3e0/0x3e0 [mlx5_core] [Sun Aug 7 13:12:29 2022] ? check_chain_key+0x24a/0x580 [Sun Aug 7 13:12:29 2022] atomic_notifier_call_chain+0xd7/0x1d0 [Sun Aug 7 13:12:29 2022] br_switchdev_fdb_notify+0xea/0x100 [Sun Aug 7 13:12:29 2022] ? br_switchdev_set_port_flag+0x310/0x310 [Sun Aug 7 13:12:29 2022] fdb_notify+0x11b/0x150 [Sun Aug 7 13:12:29 2022] br_fdb_update+0x34c/0x570 [Sun Aug 7 13:12:29 2022] ? lock_chain_count+0x20/0x20 [Sun Aug 7 13:12:29 2022] ? br_fdb_add_local+0x50/0x50 [Sun Aug 7 13:12:29 2022] ? br_allowed_ingress+0x5f/0x1070 [Sun Aug 7 13:12:29 2022] ? check_chain_key+0x24a/0x580 [Sun Aug 7 13:12:29 2022] br_handle_frame_finish+0x786/0x18e0 [Sun Aug 7 13:12:29 2022] ? check_chain_key+0x24a/0x580 [Sun Aug 7 13:12:29 2022] ? br_handle_local_finish+0x20/0x20 [Sun Aug 7 13:12:29 2022] ? __lock_acquire+0xd6f/0x6720 [Sun Aug 7 13:12:29 2022] ? sctp_inet_bind_verify+0x4d/0x190 [Sun Aug 7 13:12:29 2022] ? xlog_unpack_data+0x2e0/0x310 [Sun Aug 7 13:12:29 2022] ? br_handle_local_finish+0x20/0x20 [Sun Aug 7 13:12:29 2022] br_nf_hook_thresh+0x227/0x380 [br_netfilter] [Sun Aug 7 13:12:29 2022] ? setup_pre_routing+0x460/0x460 [br_netfilter] [Sun Aug 7 13:12:29 2022] ? br_handle_local_finish+0x20/0x20 [Sun Aug 7 13:12:29 2022] ? br_nf_pre_routing_ipv6+0x48b/0x69c [br_netfilter] [Sun Aug 7 13:12:29 2022] br_nf_pre_routing_finish_ipv6+0x5c2/0xbf0 [br_netfilter] [Sun Aug 7 13:12:29 2022] ? br_handle_local_finish+0x20/0x20 [Sun Aug 7 13:12:29 2022] br_nf_pre_routing_ipv6+0x4c6/0x69c [br_netfilter] [Sun Aug 7 13:12:29 2022] ? br_validate_ipv6+0x9e0/0x9e0 [br_netfilter] [Sun Aug 7 13:12:29 2022] ? br_nf_forward_arp+0xb70/0xb70 [br_netfilter] [Sun Aug 7 13:12:29 2022] ? br_nf_pre_routing+0xacf/0x1160 [br_netfilter] [Sun Aug 7 13:12:29 2022] br_handle_frame+0x8a9/0x1270 [Sun Aug 7 13:12:29 2022] ? br_handle_frame_finish+0x18e0/0x18e0 [Sun Aug 7 13:12:29 2022] ? register_lock_class+0x1880/0x1880 [Sun Aug 7 13:12:29 2022] ? br_handle_local_finish+0x20/0x20 [Sun Aug 7 13:12:29 2022] ? bond_handle_frame+0xf9/0xac0 [bonding] [Sun Aug 7 13:12:29 2022] ? br_handle_frame_finish+0x18e0/0x18e0 [Sun Aug 7 13:12:29 2022] __netif_receive_skb_core+0x7c0/0x2c70 [Sun Aug 7 13:12:29 2022] ? check_chain_key+0x24a/0x580 [Sun Aug 7 13:12:29 2022] ? generic_xdp_tx+0x5b0/0x5b0 [Sun Aug 7 13:12:29 2022] ? __lock_acquire+0xd6f/0x6720 [Sun Aug 7 13:12:29 2022] ? register_lock_class+0x1880/0x1880 [Sun Aug 7 13:12:29 2022] ? check_chain_key+0x24a/0x580 [Sun Aug 7 13:12:29 2022] __netif_receive_skb_list_core+0x2d7/0x8a0 [Sun Aug 7 13:12:29 2022] ? lock_acquire+0x1c1/0x550 [Sun Aug 7 13:12:29 2022] ? process_backlog+0x960/0x960 [Sun Aug 7 13:12:29 2022] ? lockdep_hardirqs_on_prepare+0x129/0x400 [Sun Aug 7 13:12:29 2022] ? kvm_clock_get_cycles+0x14/0x20 [Sun Aug 7 13:12:29 2022] netif_receive_skb_list_internal+0x5f4/0xd60 [Sun Aug 7 13:12:29 2022] ? do_xdp_generic+0x150/0x150 [Sun Aug 7 13:12:29 2022] ? mlx5e_poll_rx_cq+0xf6b/0x2960 [mlx5_core] [Sun Aug 7 13:12:29 2022] ? mlx5e_poll_ico_cq+0x3d/0x1590 [mlx5_core] [Sun Aug 7 13:12:29 2022] napi_complete_done+0x188/0x710 [Sun Aug 7 13:12:29 2022] mlx5e_napi_poll+0x4e9/0x20a0 [mlx5_core] [Sun Aug 7 13:12:29 2022] ? __queue_work+0x53c/0xeb0 [Sun Aug 7 13:12:29 2022] __napi_poll+0x9f/0x540 [Sun Aug 7 13:12:29 2022] net_rx_action+0x420/0xb70 [Sun Aug 7 13:12:29 2022] ? napi_threaded_poll+0x470/0x470 [Sun Aug 7 13:12:29 2022] ? __common_interrupt+0x79/0x1a0 [Sun Aug 7 13:12:29 2022] __do_softirq+0x271/0x92c [Sun Aug 7 13:12:29 2022] irq_exit_rcu+0x11a/0x170 [Sun Aug 7 13:12:29 2022] common_interrupt+0x7d/0xa0 [Sun Aug 7 13:12:29 2022] </IRQ> [Sun Aug 7 13:12:29 2022] <TASK> [Sun Aug 7 13:12:29 2022] asm_common_interrupt+0x22/0x40 [Sun Aug 7 13:12:29 2022] RIP: 0010:default_idle+0x42/0x60 [Sun Aug 7 13:12:29 2022] Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 0f b6 14 11 38 d0 7c 04 84 d2 75 14 8b 05 6b f1 22 02 85 c0 7e 07 0f 00 2d 80 3b 4a 00 fb f4 <c3> 48 c7 c7 e0 07 7e 85 e8 21 bd 40 fe eb de 66 66 2e 0f 1f 84 00 [Sun Aug 7 13:12:29 2022] RSP: 0018:ffffffff84407e18 EFLAGS: 00000242 [Sun Aug 7 13:12:29 2022] RAX: 0000000000000001 RBX: ffffffff84ec4a68 RCX: 1ffffffff0afc0fc [Sun Aug 7 13:12:29 2022] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff835b1fac [Sun Aug 7 13:12:29 2022] RBP: 0000000000000000 R08: 0000000000000001 R09: ffff8884d2c44ac3 [Sun Aug 7 13:12:29 2022] R10: ffffed109a588958 R11: 00000000ffffffff R12: 0000000000000000 [Sun Aug 7 13:12:29 2022] R13: ffffffff84efac20 R14: 0000000000000000 R15: dffffc0000000000 [Sun Aug 7 13:12:29 2022] ? default_idle_call+0xcc/0x460 [Sun Aug 7 13:12:29 2022] default_idle_call+0xec/0x460 [Sun Aug 7 13:12:29 2022] do_idle+0x394/0x450 [Sun Aug 7 13:12:29 2022] ? arch_cpu_idle_exit+0x40/0x40 [Sun Aug 7 13:12:29 2022] cpu_startup_entry+0x19/0x20 [Sun Aug 7 13:12:29 2022] rest_init+0x156/0x250 [Sun Aug 7 13:12:29 2022] arch_call_rest_init+0xf/0x15 [Sun Aug 7 13:12:29 2022] start_kernel+0x3a7/0x3c5 [Sun Aug 7 13:12:29 2022] secondary_startup_64_no_verify+0xcd/0xdb [Sun Aug 7 13:12:29 2022] </TASK>
Fixes: ff9b7521468b ("net/mlx5: Bridge, support LAG") Signed-off-by: Vlad Buslov vladbu@nvidia.com Reviewed-by: Mark Bloch mbloch@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 55 +++++++++++-------- 1 file changed, 33 insertions(+), 22 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c index c520edb942ca5..d98acd68af2ec 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c @@ -1067,30 +1067,32 @@ static void mlx5_ldev_add_netdev(struct mlx5_lag *ldev, struct net_device *netdev) { unsigned int fn = mlx5_get_dev_index(dev); + unsigned long flags;
if (fn >= ldev->ports) return;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev->pf[fn].netdev = netdev; ldev->tracker.netdev_state[fn].link_up = 0; ldev->tracker.netdev_state[fn].tx_enabled = 0; - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags); }
static void mlx5_ldev_remove_netdev(struct mlx5_lag *ldev, struct net_device *netdev) { + unsigned long flags; int i;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); for (i = 0; i < ldev->ports; i++) { if (ldev->pf[i].netdev == netdev) { ldev->pf[i].netdev = NULL; break; } } - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags); }
static void mlx5_ldev_add_mdev(struct mlx5_lag *ldev, @@ -1246,12 +1248,13 @@ void mlx5_lag_add_netdev(struct mlx5_core_dev *dev, bool mlx5_lag_is_roce(struct mlx5_core_dev *dev) { struct mlx5_lag *ldev; + unsigned long flags; bool res;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev = mlx5_lag_dev(dev); res = ldev && __mlx5_lag_is_roce(ldev); - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags);
return res; } @@ -1260,12 +1263,13 @@ EXPORT_SYMBOL(mlx5_lag_is_roce); bool mlx5_lag_is_active(struct mlx5_core_dev *dev) { struct mlx5_lag *ldev; + unsigned long flags; bool res;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev = mlx5_lag_dev(dev); res = ldev && __mlx5_lag_is_active(ldev); - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags);
return res; } @@ -1274,13 +1278,14 @@ EXPORT_SYMBOL(mlx5_lag_is_active); bool mlx5_lag_is_master(struct mlx5_core_dev *dev) { struct mlx5_lag *ldev; + unsigned long flags; bool res;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev = mlx5_lag_dev(dev); res = ldev && __mlx5_lag_is_active(ldev) && dev == ldev->pf[MLX5_LAG_P1].dev; - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags);
return res; } @@ -1289,12 +1294,13 @@ EXPORT_SYMBOL(mlx5_lag_is_master); bool mlx5_lag_is_sriov(struct mlx5_core_dev *dev) { struct mlx5_lag *ldev; + unsigned long flags; bool res;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev = mlx5_lag_dev(dev); res = ldev && __mlx5_lag_is_sriov(ldev); - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags);
return res; } @@ -1303,13 +1309,14 @@ EXPORT_SYMBOL(mlx5_lag_is_sriov); bool mlx5_lag_is_shared_fdb(struct mlx5_core_dev *dev) { struct mlx5_lag *ldev; + unsigned long flags; bool res;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev = mlx5_lag_dev(dev); res = ldev && __mlx5_lag_is_sriov(ldev) && test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &ldev->mode_flags); - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags);
return res; } @@ -1352,9 +1359,10 @@ struct net_device *mlx5_lag_get_roce_netdev(struct mlx5_core_dev *dev) { struct net_device *ndev = NULL; struct mlx5_lag *ldev; + unsigned long flags; int i;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev = mlx5_lag_dev(dev);
if (!(ldev && __mlx5_lag_is_roce(ldev))) @@ -1373,7 +1381,7 @@ struct net_device *mlx5_lag_get_roce_netdev(struct mlx5_core_dev *dev) dev_hold(ndev);
unlock: - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags);
return ndev; } @@ -1383,10 +1391,11 @@ u8 mlx5_lag_get_slave_port(struct mlx5_core_dev *dev, struct net_device *slave) { struct mlx5_lag *ldev; + unsigned long flags; u8 port = 0; int i;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev = mlx5_lag_dev(dev); if (!(ldev && __mlx5_lag_is_roce(ldev))) goto unlock; @@ -1401,7 +1410,7 @@ u8 mlx5_lag_get_slave_port(struct mlx5_core_dev *dev, port = ldev->v2p_map[port * ldev->buckets];
unlock: - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags); return port; } EXPORT_SYMBOL(mlx5_lag_get_slave_port); @@ -1422,8 +1431,9 @@ struct mlx5_core_dev *mlx5_lag_get_peer_mdev(struct mlx5_core_dev *dev) { struct mlx5_core_dev *peer_dev = NULL; struct mlx5_lag *ldev; + unsigned long flags;
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev = mlx5_lag_dev(dev); if (!ldev) goto unlock; @@ -1433,7 +1443,7 @@ struct mlx5_core_dev *mlx5_lag_get_peer_mdev(struct mlx5_core_dev *dev) ldev->pf[MLX5_LAG_P1].dev;
unlock: - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags); return peer_dev; } EXPORT_SYMBOL(mlx5_lag_get_peer_mdev); @@ -1446,6 +1456,7 @@ int mlx5_lag_query_cong_counters(struct mlx5_core_dev *dev, int outlen = MLX5_ST_SZ_BYTES(query_cong_statistics_out); struct mlx5_core_dev **mdev; struct mlx5_lag *ldev; + unsigned long flags; int num_ports; int ret, i, j; void *out; @@ -1462,7 +1473,7 @@ int mlx5_lag_query_cong_counters(struct mlx5_core_dev *dev,
memset(values, 0, sizeof(*values) * num_counters);
- spin_lock(&lag_lock); + spin_lock_irqsave(&lag_lock, flags); ldev = mlx5_lag_dev(dev); if (ldev && __mlx5_lag_is_active(ldev)) { num_ports = ldev->ports; @@ -1472,7 +1483,7 @@ int mlx5_lag_query_cong_counters(struct mlx5_core_dev *dev, num_ports = 1; mdev[MLX5_LAG_P1] = dev; } - spin_unlock(&lag_lock); + spin_unlock_irqrestore(&lag_lock, flags);
for (i = 0; i < num_ports; ++i) { u32 in[MLX5_ST_SZ_DW(query_cong_statistics_in)] = {};
From: Roy Novich royno@nvidia.com
[ Upstream commit 090f3e4f4089ab8041ed7d632c7851c2a42fcc10 ]
When the driver unloads, give/reclaim_pages may fail as PF driver in teardown flow, current code will lead to the following kernel log print 'failed reclaiming pages: err 0'.
Fix it to get same behavior as before the cited commits, by calling mlx5_cmd_check before handling error state. mlx5_cmd_check will verify if the returned error is an actual error needed to be handled by the driver or not and will return an appropriate value.
Fixes: 8d564292a166 ("net/mlx5: Remove redundant error on reclaim pages") Fixes: 4dac2f10ada0 ("net/mlx5: Remove redundant notify fail on give pages") Signed-off-by: Roy Novich royno@nvidia.com Reviewed-by: Moshe Shemesh moshe@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c index ec76a8b1acc1c..60596357bfc7a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c @@ -376,8 +376,8 @@ static int give_pages(struct mlx5_core_dev *dev, u16 func_id, int npages, goto out_dropped; } } + err = mlx5_cmd_check(dev, err, in, out); if (err) { - err = mlx5_cmd_check(dev, err, in, out); mlx5_core_warn(dev, "func_id 0x%x, npages %d, err %d\n", func_id, npages, err); goto out_dropped; @@ -524,10 +524,13 @@ static int reclaim_pages(struct mlx5_core_dev *dev, u16 func_id, int npages, dev->priv.reclaim_pages_discard += npages; } /* if triggered by FW event and failed by FW then ignore */ - if (event && err == -EREMOTEIO) + if (event && err == -EREMOTEIO) { err = 0; + goto out_free; + } + + err = mlx5_cmd_check(dev, err, in, out); if (err) { - err = mlx5_cmd_check(dev, err, in, out); mlx5_core_err(dev, "failed reclaiming pages: err %d\n", err); goto out_free; }
From: Moshe Shemesh moshe@nvidia.com
[ Upstream commit d59b73a66e5e0682442b6d7b4965364e57078b80 ]
Add a lock_class_key per mlx5 device to avoid a false positive "possible circular locking dependency" warning by lockdep, on flows which lock more than one mlx5 device, such as adding SF.
kernel log: ====================================================== WARNING: possible circular locking dependency detected 5.19.0-rc8+ #2 Not tainted ------------------------------------------------------ kworker/u20:0/8 is trying to acquire lock: ffff88812dfe0d98 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_init_one+0x2e/0x490 [mlx5_core]
but task is already holding lock: ffff888101aa7898 (&(¬ifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&(¬ifier->n_head)->rwsem){++++}-{3:3}: down_write+0x90/0x150 blocking_notifier_chain_register+0x53/0xa0 mlx5_sf_table_init+0x369/0x4a0 [mlx5_core] mlx5_init_one+0x261/0x490 [mlx5_core] probe_one+0x430/0x680 [mlx5_core] local_pci_probe+0xd6/0x170 work_for_cpu_fn+0x4e/0xa0 process_one_work+0x7c2/0x1340 worker_thread+0x6f6/0xec0 kthread+0x28f/0x330 ret_from_fork+0x1f/0x30
-> #0 (&dev->intf_state_mutex){+.+.}-{3:3}: __lock_acquire+0x2fc7/0x6720 lock_acquire+0x1c1/0x550 __mutex_lock+0x12c/0x14b0 mlx5_init_one+0x2e/0x490 [mlx5_core] mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core] auxiliary_bus_probe+0x9d/0xe0 really_probe+0x1e0/0xaa0 __driver_probe_device+0x219/0x480 driver_probe_device+0x49/0x130 __device_attach_driver+0x1b8/0x280 bus_for_each_drv+0x123/0x1a0 __device_attach+0x1a3/0x460 bus_probe_device+0x1a2/0x260 device_add+0x9b1/0x1b40 __auxiliary_device_add+0x88/0xc0 mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core] blocking_notifier_call_chain+0xd5/0x130 mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core] process_one_work+0x7c2/0x1340 worker_thread+0x59d/0xec0 kthread+0x28f/0x330 ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(&(¬ifier->n_head)->rwsem); lock(&dev->intf_state_mutex); lock(&(¬ifier->n_head)->rwsem); lock(&dev->intf_state_mutex);
*** DEADLOCK ***
4 locks held by kworker/u20:0/8: #0: ffff888150612938 ((wq_completion)mlx5_events){+.+.}-{0:0}, at: process_one_work+0x6e2/0x1340 #1: ffff888100cafdb8 ((work_completion)(&work->work)#3){+.+.}-{0:0}, at: process_one_work+0x70f/0x1340 #2: ffff888101aa7898 (&(¬ifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130 #3: ffff88813682d0e8 (&dev->mutex){....}-{3:3}, at:__device_attach+0x76/0x460
stack backtrace: CPU: 6 PID: 8 Comm: kworker/u20:0 Not tainted 5.19.0-rc8+ Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Workqueue: mlx5_events mlx5_vhca_state_work_handler [mlx5_core] Call Trace: <TASK> dump_stack_lvl+0x57/0x7d check_noncircular+0x278/0x300 ? print_circular_bug+0x460/0x460 ? lock_chain_count+0x20/0x20 ? register_lock_class+0x1880/0x1880 __lock_acquire+0x2fc7/0x6720 ? register_lock_class+0x1880/0x1880 ? register_lock_class+0x1880/0x1880 lock_acquire+0x1c1/0x550 ? mlx5_init_one+0x2e/0x490 [mlx5_core] ? lockdep_hardirqs_on_prepare+0x400/0x400 __mutex_lock+0x12c/0x14b0 ? mlx5_init_one+0x2e/0x490 [mlx5_core] ? mlx5_init_one+0x2e/0x490 [mlx5_core] ? _raw_read_unlock+0x1f/0x30 ? mutex_lock_io_nested+0x1320/0x1320 ? __ioremap_caller.constprop.0+0x306/0x490 ? mlx5_sf_dev_probe+0x269/0x370 [mlx5_core] ? iounmap+0x160/0x160 mlx5_init_one+0x2e/0x490 [mlx5_core] mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core] ? mlx5_sf_dev_remove+0x130/0x130 [mlx5_core] auxiliary_bus_probe+0x9d/0xe0 really_probe+0x1e0/0xaa0 __driver_probe_device+0x219/0x480 ? auxiliary_match_id+0xe9/0x140 driver_probe_device+0x49/0x130 __device_attach_driver+0x1b8/0x280 ? driver_allows_async_probing+0x140/0x140 bus_for_each_drv+0x123/0x1a0 ? bus_for_each_dev+0x1a0/0x1a0 ? lockdep_hardirqs_on_prepare+0x286/0x400 ? trace_hardirqs_on+0x2d/0x100 __device_attach+0x1a3/0x460 ? device_driver_attach+0x1e0/0x1e0 ? kobject_uevent_env+0x22d/0xf10 bus_probe_device+0x1a2/0x260 device_add+0x9b1/0x1b40 ? dev_set_name+0xab/0xe0 ? __fw_devlink_link_to_suppliers+0x260/0x260 ? memset+0x20/0x40 ? lockdep_init_map_type+0x21a/0x7d0 __auxiliary_device_add+0x88/0xc0 ? auxiliary_device_init+0x86/0xa0 mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core] blocking_notifier_call_chain+0xd5/0x130 mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core] ? mlx5_vhca_event_arm+0x100/0x100 [mlx5_core] ? lock_downgrade+0x6e0/0x6e0 ? lockdep_hardirqs_on_prepare+0x286/0x400 process_one_work+0x7c2/0x1340 ? lockdep_hardirqs_on_prepare+0x400/0x400 ? pwq_dec_nr_in_flight+0x230/0x230 ? rwlock_bug.part.0+0x90/0x90 worker_thread+0x59d/0xec0 ? process_one_work+0x1340/0x1340 kthread+0x28f/0x330 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK>
Fixes: 6a3273217469 ("net/mlx5: SF, Port function state change support") Signed-off-by: Moshe Shemesh moshe@nvidia.com Reviewed-by: Shay Drory shayd@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 ++++ include/linux/mlx5/driver.h | 1 + 2 files changed, 5 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index ba2e5232b90be..616207c3b187a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -1472,7 +1472,9 @@ int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx) memcpy(&dev->profile, &profile[profile_idx], sizeof(dev->profile)); INIT_LIST_HEAD(&priv->ctx_list); spin_lock_init(&priv->ctx_lock); + lockdep_register_key(&dev->lock_key); mutex_init(&dev->intf_state_mutex); + lockdep_set_class(&dev->intf_state_mutex, &dev->lock_key);
mutex_init(&priv->bfregs.reg_head.lock); mutex_init(&priv->bfregs.wc_head.lock); @@ -1527,6 +1529,7 @@ int mlx5_mdev_init(struct mlx5_core_dev *dev, int profile_idx) mutex_destroy(&priv->bfregs.wc_head.lock); mutex_destroy(&priv->bfregs.reg_head.lock); mutex_destroy(&dev->intf_state_mutex); + lockdep_unregister_key(&dev->lock_key); return err; }
@@ -1545,6 +1548,7 @@ void mlx5_mdev_uninit(struct mlx5_core_dev *dev) mutex_destroy(&priv->bfregs.wc_head.lock); mutex_destroy(&priv->bfregs.reg_head.lock); mutex_destroy(&dev->intf_state_mutex); + lockdep_unregister_key(&dev->lock_key); }
static int probe_one(struct pci_dev *pdev, const struct pci_device_id *id) diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index 5040cd774c5a3..b0b4ac92354a2 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -773,6 +773,7 @@ struct mlx5_core_dev { enum mlx5_device_state state; /* sync interface state */ struct mutex intf_state_mutex; + struct lock_class_key lock_key; unsigned long intf_state; struct mlx5_priv priv; struct mlx5_profile profile;
From: Aya Levin ayal@nvidia.com
[ Upstream commit 7b3707fc79044871ab8f3d5fa5e9603155bb5577 ]
Driver caches packet merge type in mlx5e_params instance which must be in perfect sync with the netdev_feature's bit. Prior to this patch, in certain conditions (*) LRO state was set in mlx5e_params, while netdev_feature's bit was off. Causing the LRO to be applied on the RQs (HW level).
(*) This can happen only on profile init (mlx5e_build_nic_params()), when RQ expect non-linear SKB and PCI is fast enough in comparison to link width.
Solution: remove setting of packet merge type from mlx5e_build_nic_params() as netdev features are not updated.
Fixes: 619a8f2a42f1 ("net/mlx5e: Use linear SKB in Striding RQ") Signed-off-by: Aya Levin ayal@nvidia.com Reviewed-by: Tariq Toukan tariqt@nvidia.com Reviewed-by: Maxim Mikityanskiy maximmi@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 8 -------- 1 file changed, 8 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 087952b84ccb0..62aab20025345 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -4733,14 +4733,6 @@ void mlx5e_build_nic_params(struct mlx5e_priv *priv, struct mlx5e_xsk *xsk, u16 /* RQ */ mlx5e_build_rq_params(mdev, params);
- /* HW LRO */ - if (MLX5_CAP_ETH(mdev, lro_cap) && - params->rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ) { - /* No XSK params: checking the availability of striding RQ in general. */ - if (!mlx5e_rx_mpwqe_is_linear_skb(mdev, params, NULL)) - params->packet_merge.type = slow_pci_heuristic(mdev) ? - MLX5E_PACKET_MERGE_NONE : MLX5E_PACKET_MERGE_LRO; - } params->packet_merge.timeout = mlx5e_choose_lro_timeout(mdev, MLX5E_DEFAULT_LRO_TIMEOUT);
/* CQ moderation params */
From: Maor Dickman maord@nvidia.com
[ Upstream commit 550f96432e6f6770efdaee0e65239d61431062a1 ]
The cited commit reintroduced the ability to set hw-tc-offload in switchdev mode by reusing NIC mode calls without modifying it to support both modes, this can cause an illegal memory access when trying to turn hw-tc-offload off.
Fix this by using the right TC_FLAG when checking if tc rules are installed while disabling hw-tc-offload.
Fixes: d3cbd4254df8 ("net/mlx5e: Add ndo_set_feature for uplink representor") Signed-off-by: Maor Dickman maord@nvidia.com Signed-off-by: Saeed Mahameed saeedm@nvidia.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 62aab20025345..9e6db779b6efa 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -3678,7 +3678,9 @@ static int set_feature_hw_tc(struct net_device *netdev, bool enable) struct mlx5e_priv *priv = netdev_priv(netdev);
#if IS_ENABLED(CONFIG_MLX5_CLS_ACT) - if (!enable && mlx5e_tc_num_filters(priv, MLX5_TC_FLAG(NIC_OFFLOAD))) { + int tc_flag = mlx5e_is_uplink_rep(priv) ? MLX5_TC_FLAG(ESW_OFFLOAD) : + MLX5_TC_FLAG(NIC_OFFLOAD); + if (!enable && mlx5e_tc_num_filters(priv, tc_flag)) { netdev_err(netdev, "Active offloaded tc filters, can't turn hw_tc_offload off\n"); return -EINVAL;
From: Arun Ramadoss arun.ramadoss@microchip.com
[ Upstream commit 27faa0aa85f6696d411bbbebaed9f0f723c2a175 ]
The ksz9477_switch_detect performs the detecting the chip id from the location 0x00 and also check gigabit compatibility check & number of ports based on the register global_options0. To prepare the common ksz switch detect function, routine other than chip id read is moved to ksz9477_switch_init.
Signed-off-by: Arun Ramadoss arun.ramadoss@microchip.com Reviewed-by: Vladimir Oltean olteanv@gmail.com Signed-off-by: Paolo Abeni pabeni@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/microchip/ksz9477.c | 48 +++++++++++++---------------- 1 file changed, 22 insertions(+), 26 deletions(-)
diff --git a/drivers/net/dsa/microchip/ksz9477.c b/drivers/net/dsa/microchip/ksz9477.c index ebad795e4e95f..876a801ac23a4 100644 --- a/drivers/net/dsa/microchip/ksz9477.c +++ b/drivers/net/dsa/microchip/ksz9477.c @@ -1365,12 +1365,30 @@ static u32 ksz9477_get_port_addr(int port, int offset)
static int ksz9477_switch_detect(struct ksz_device *dev) { - u8 data8; - u8 id_hi; - u8 id_lo; u32 id32; int ret;
+ /* read chip id */ + ret = ksz_read32(dev, REG_CHIP_ID0__1, &id32); + if (ret) + return ret; + + dev_dbg(dev->dev, "Switch detect: ID=%08x\n", id32); + + dev->chip_id = id32 & 0x00FFFF00; + + return 0; +} + +static int ksz9477_switch_init(struct ksz_device *dev) +{ + u8 data8; + int ret; + + dev->ds->ops = &ksz9477_switch_ops; + + dev->port_mask = (1 << dev->info->port_cnt) - 1; + /* turn off SPI DO Edge select */ ret = ksz_read8(dev, REG_SW_GLOBAL_SERIAL_CTRL_0, &data8); if (ret) @@ -1381,10 +1399,6 @@ static int ksz9477_switch_detect(struct ksz_device *dev) if (ret) return ret;
- /* read chip id */ - ret = ksz_read32(dev, REG_CHIP_ID0__1, &id32); - if (ret) - return ret; ret = ksz_read8(dev, REG_GLOBAL_OPTIONS, &data8); if (ret) return ret; @@ -1395,10 +1409,7 @@ static int ksz9477_switch_detect(struct ksz_device *dev) /* Default capability is gigabit capable. */ dev->features = GBIT_SUPPORT;
- dev_dbg(dev->dev, "Switch detect: ID=%08x%02x\n", id32, data8); - id_hi = (u8)(id32 >> 16); - id_lo = (u8)(id32 >> 8); - if ((id_lo & 0xf) == 3) { + if (dev->chip_id == KSZ9893_CHIP_ID) { /* Chip is from KSZ9893 design. */ dev_info(dev->dev, "Found KSZ9893\n"); dev->features |= IS_9893; @@ -1416,21 +1427,6 @@ static int ksz9477_switch_detect(struct ksz_device *dev) if (!(data8 & SW_GIGABIT_ABLE)) dev->features &= ~GBIT_SUPPORT; } - - /* Change chip id to known ones so it can be matched against them. */ - id32 = (id_hi << 16) | (id_lo << 8); - - dev->chip_id = id32; - - return 0; -} - -static int ksz9477_switch_init(struct ksz_device *dev) -{ - dev->ds->ops = &ksz9477_switch_ops; - - dev->port_mask = (1 << dev->info->port_cnt) - 1; - return 0; }
From: Arun Ramadoss arun.ramadoss@microchip.com
[ Upstream commit 91a98917a8839923d404a77c21646ca5fc9e330a ]
KSZ87xx and KSZ88xx have chip_id representation at reg location 0. And KSZ9477 compatible switch and LAN937x switch have same chip_id detection at location 0x01 and 0x02. To have the common switch detect functionality for ksz switches, ksz_switch_detect function is introduced.
Signed-off-by: Arun Ramadoss arun.ramadoss@microchip.com Signed-off-by: Paolo Abeni pabeni@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/microchip/ksz8795.c | 48 +-------------- drivers/net/dsa/microchip/ksz8795_reg.h | 16 ----- drivers/net/dsa/microchip/ksz9477.c | 21 ------- drivers/net/dsa/microchip/ksz9477_reg.h | 1 - drivers/net/dsa/microchip/ksz_common.c | 78 +++++++++++++++++++++++-- drivers/net/dsa/microchip/ksz_common.h | 19 +++++- 6 files changed, 93 insertions(+), 90 deletions(-)
diff --git a/drivers/net/dsa/microchip/ksz8795.c b/drivers/net/dsa/microchip/ksz8795.c index 12a599d5e61a4..3cc51ee5fb6cc 100644 --- a/drivers/net/dsa/microchip/ksz8795.c +++ b/drivers/net/dsa/microchip/ksz8795.c @@ -1272,7 +1272,7 @@ static void ksz8_config_cpu_port(struct dsa_switch *ds) continue; if (!ksz_is_ksz88x3(dev)) { ksz_pread8(dev, i, regs[P_REMOTE_STATUS], &remote); - if (remote & PORT_FIBER_MODE) + if (remote & KSZ8_PORT_FIBER_MODE) p->fiber = 1; } if (p->fiber) @@ -1424,51 +1424,6 @@ static u32 ksz8_get_port_addr(int port, int offset) return PORT_CTRL_ADDR(port, offset); }
-static int ksz8_switch_detect(struct ksz_device *dev) -{ - u8 id1, id2; - u16 id16; - int ret; - - /* read chip id */ - ret = ksz_read16(dev, REG_CHIP_ID0, &id16); - if (ret) - return ret; - - id1 = id16 >> 8; - id2 = id16 & SW_CHIP_ID_M; - - switch (id1) { - case KSZ87_FAMILY_ID: - if ((id2 != CHIP_ID_94 && id2 != CHIP_ID_95)) - return -ENODEV; - - if (id2 == CHIP_ID_95) { - u8 val; - - id2 = 0x95; - ksz_read8(dev, REG_PORT_STATUS_0, &val); - if (val & PORT_FIBER_MODE) - id2 = 0x65; - } else if (id2 == CHIP_ID_94) { - id2 = 0x94; - } - break; - case KSZ88_FAMILY_ID: - if (id2 != CHIP_ID_63) - return -ENODEV; - break; - default: - dev_err(dev->dev, "invalid family id: %d\n", id1); - return -ENODEV; - } - id16 &= ~0xff; - id16 |= id2; - dev->chip_id = id16; - - return 0; -} - static int ksz8_switch_init(struct ksz_device *dev) { struct ksz8 *ksz8 = dev->priv; @@ -1522,7 +1477,6 @@ static const struct ksz_dev_ops ksz8_dev_ops = { .freeze_mib = ksz8_freeze_mib, .port_init_cnt = ksz8_port_init_cnt, .shutdown = ksz8_reset_switch, - .detect = ksz8_switch_detect, .init = ksz8_switch_init, .exit = ksz8_switch_exit, }; diff --git a/drivers/net/dsa/microchip/ksz8795_reg.h b/drivers/net/dsa/microchip/ksz8795_reg.h index 4109433b6b6c2..b8f6ad7581bcd 100644 --- a/drivers/net/dsa/microchip/ksz8795_reg.h +++ b/drivers/net/dsa/microchip/ksz8795_reg.h @@ -14,23 +14,10 @@ #define KS_PRIO_M 0x3 #define KS_PRIO_S 2
-#define REG_CHIP_ID0 0x00 - -#define KSZ87_FAMILY_ID 0x87 -#define KSZ88_FAMILY_ID 0x88 - -#define REG_CHIP_ID1 0x01 - -#define SW_CHIP_ID_M 0xF0 -#define SW_CHIP_ID_S 4 #define SW_REVISION_M 0x0E #define SW_REVISION_S 1 #define SW_START 0x01
-#define CHIP_ID_94 0x60 -#define CHIP_ID_95 0x90 -#define CHIP_ID_63 0x30 - #define KSZ8863_REG_SW_RESET 0x43
#define KSZ8863_GLOBAL_SOFTWARE_RESET BIT(4) @@ -217,8 +204,6 @@ #define REG_PORT_4_STATUS_0 0x48
/* For KSZ8765. */ -#define PORT_FIBER_MODE BIT(7) - #define PORT_REMOTE_ASYM_PAUSE BIT(5) #define PORT_REMOTE_SYM_PAUSE BIT(4) #define PORT_REMOTE_100BTX_FD BIT(3) @@ -322,7 +307,6 @@
#define REG_PORT_CTRL_5 0x05
-#define REG_PORT_STATUS_0 0x08 #define REG_PORT_STATUS_1 0x09 #define REG_PORT_LINK_MD_CTRL 0x0A #define REG_PORT_LINK_MD_RESULT 0x0B diff --git a/drivers/net/dsa/microchip/ksz9477.c b/drivers/net/dsa/microchip/ksz9477.c index 876a801ac23a4..bcfdd505ca79a 100644 --- a/drivers/net/dsa/microchip/ksz9477.c +++ b/drivers/net/dsa/microchip/ksz9477.c @@ -1363,23 +1363,6 @@ static u32 ksz9477_get_port_addr(int port, int offset) return PORT_CTRL_ADDR(port, offset); }
-static int ksz9477_switch_detect(struct ksz_device *dev) -{ - u32 id32; - int ret; - - /* read chip id */ - ret = ksz_read32(dev, REG_CHIP_ID0__1, &id32); - if (ret) - return ret; - - dev_dbg(dev->dev, "Switch detect: ID=%08x\n", id32); - - dev->chip_id = id32 & 0x00FFFF00; - - return 0; -} - static int ksz9477_switch_init(struct ksz_device *dev) { u8 data8; @@ -1410,8 +1393,6 @@ static int ksz9477_switch_init(struct ksz_device *dev) dev->features = GBIT_SUPPORT;
if (dev->chip_id == KSZ9893_CHIP_ID) { - /* Chip is from KSZ9893 design. */ - dev_info(dev->dev, "Found KSZ9893\n"); dev->features |= IS_9893;
/* Chip does not support gigabit. */ @@ -1419,7 +1400,6 @@ static int ksz9477_switch_init(struct ksz_device *dev) dev->features &= ~GBIT_SUPPORT; dev->phy_port_cnt = 2; } else { - dev_info(dev->dev, "Found KSZ9477 or compatible\n"); /* Chip uses new XMII register definitions. */ dev->features |= NEW_XMII;
@@ -1446,7 +1426,6 @@ static const struct ksz_dev_ops ksz9477_dev_ops = { .freeze_mib = ksz9477_freeze_mib, .port_init_cnt = ksz9477_port_init_cnt, .shutdown = ksz9477_reset_switch, - .detect = ksz9477_switch_detect, .init = ksz9477_switch_init, .exit = ksz9477_switch_exit, }; diff --git a/drivers/net/dsa/microchip/ksz9477_reg.h b/drivers/net/dsa/microchip/ksz9477_reg.h index 7a2c8d4767aff..077e35ab11b54 100644 --- a/drivers/net/dsa/microchip/ksz9477_reg.h +++ b/drivers/net/dsa/microchip/ksz9477_reg.h @@ -25,7 +25,6 @@
#define REG_CHIP_ID2__1 0x0002
-#define CHIP_ID_63 0x63 #define CHIP_ID_66 0x66 #define CHIP_ID_67 0x67 #define CHIP_ID_77 0x77 diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 92a500e1ccd21..4511e99823f57 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -930,6 +930,72 @@ void ksz_port_stp_state_set(struct dsa_switch *ds, int port, } EXPORT_SYMBOL_GPL(ksz_port_stp_state_set);
+static int ksz_switch_detect(struct ksz_device *dev) +{ + u8 id1, id2; + u16 id16; + u32 id32; + int ret; + + /* read chip id */ + ret = ksz_read16(dev, REG_CHIP_ID0, &id16); + if (ret) + return ret; + + id1 = FIELD_GET(SW_FAMILY_ID_M, id16); + id2 = FIELD_GET(SW_CHIP_ID_M, id16); + + switch (id1) { + case KSZ87_FAMILY_ID: + if (id2 == KSZ87_CHIP_ID_95) { + u8 val; + + dev->chip_id = KSZ8795_CHIP_ID; + + ksz_read8(dev, KSZ8_PORT_STATUS_0, &val); + if (val & KSZ8_PORT_FIBER_MODE) + dev->chip_id = KSZ8765_CHIP_ID; + } else if (id2 == KSZ87_CHIP_ID_94) { + dev->chip_id = KSZ8794_CHIP_ID; + } else { + return -ENODEV; + } + break; + case KSZ88_FAMILY_ID: + if (id2 == KSZ88_CHIP_ID_63) + dev->chip_id = KSZ8830_CHIP_ID; + else + return -ENODEV; + break; + default: + ret = ksz_read32(dev, REG_CHIP_ID0, &id32); + if (ret) + return ret; + + dev->chip_rev = FIELD_GET(SW_REV_ID_M, id32); + id32 &= ~0xFF; + + switch (id32) { + case KSZ9477_CHIP_ID: + case KSZ9897_CHIP_ID: + case KSZ9893_CHIP_ID: + case KSZ9567_CHIP_ID: + case LAN9370_CHIP_ID: + case LAN9371_CHIP_ID: + case LAN9372_CHIP_ID: + case LAN9373_CHIP_ID: + case LAN9374_CHIP_ID: + dev->chip_id = id32; + break; + default: + dev_err(dev->dev, + "unsupported switch detected %x)\n", id32); + return -ENODEV; + } + } + return 0; +} + struct ksz_device *ksz_switch_alloc(struct device *base, void *priv) { struct dsa_switch *ds; @@ -986,10 +1052,9 @@ int ksz_switch_register(struct ksz_device *dev, mutex_init(&dev->alu_mutex); mutex_init(&dev->vlan_mutex);
- dev->dev_ops = ops; - - if (dev->dev_ops->detect(dev)) - return -EINVAL; + ret = ksz_switch_detect(dev); + if (ret) + return ret;
info = ksz_lookup_info(dev->chip_id); if (!info) @@ -998,10 +1063,15 @@ int ksz_switch_register(struct ksz_device *dev, /* Update the compatible info with the probed one */ dev->info = info;
+ dev_info(dev->dev, "found switch: %s, rev %i\n", + dev->info->dev_name, dev->chip_rev); + ret = ksz_check_device_id(dev); if (ret) return ret;
+ dev->dev_ops = ops; + ret = dev->dev_ops->init(dev); if (ret) return ret; diff --git a/drivers/net/dsa/microchip/ksz_common.h b/drivers/net/dsa/microchip/ksz_common.h index 8500eaedad67a..e6bc5fb2b1303 100644 --- a/drivers/net/dsa/microchip/ksz_common.h +++ b/drivers/net/dsa/microchip/ksz_common.h @@ -90,6 +90,7 @@ struct ksz_device {
/* chip specific data */ u32 chip_id; + u8 chip_rev; int cpu_port; /* port connected to CPU */ int phy_port_cnt; phy_interface_t compat_interface; @@ -182,7 +183,6 @@ struct ksz_dev_ops { void (*freeze_mib)(struct ksz_device *dev, int port, bool freeze); void (*port_init_cnt)(struct ksz_device *dev, int port); int (*shutdown)(struct ksz_device *dev); - int (*detect)(struct ksz_device *dev); int (*init)(struct ksz_device *dev); void (*exit)(struct ksz_device *dev); }; @@ -353,6 +353,23 @@ static inline void ksz_regmap_unlock(void *__mtx) #define PORT_RX_ENABLE BIT(1) #define PORT_LEARN_DISABLE BIT(0)
+/* Switch ID Defines */ +#define REG_CHIP_ID0 0x00 + +#define SW_FAMILY_ID_M GENMASK(15, 8) +#define KSZ87_FAMILY_ID 0x87 +#define KSZ88_FAMILY_ID 0x88 + +#define KSZ8_PORT_STATUS_0 0x08 +#define KSZ8_PORT_FIBER_MODE BIT(7) + +#define SW_CHIP_ID_M GENMASK(7, 4) +#define KSZ87_CHIP_ID_94 0x6 +#define KSZ87_CHIP_ID_95 0x9 +#define KSZ88_CHIP_ID_63 0x3 + +#define SW_REV_ID_M GENMASK(7, 4) + /* Regmap tables generation */ #define KSZ_SPI_OP_RD 3 #define KSZ_SPI_OP_WR 2
From: Arun Ramadoss arun.ramadoss@microchip.com
[ Upstream commit 534a0431e9e68959e2c0d71c141d5b911d66ad7c ]
This patch move the dsa hook get_tag_protocol to ksz_common file. And the tag_protocol is returned based on the dev->chip_id.
Signed-off-by: Arun Ramadoss arun.ramadoss@microchip.com Signed-off-by: Paolo Abeni pabeni@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/microchip/ksz8795.c | 13 +------------ drivers/net/dsa/microchip/ksz9477.c | 14 +------------- drivers/net/dsa/microchip/ksz_common.c | 24 ++++++++++++++++++++++++ drivers/net/dsa/microchip/ksz_common.h | 2 ++ 4 files changed, 28 insertions(+), 25 deletions(-)
diff --git a/drivers/net/dsa/microchip/ksz8795.c b/drivers/net/dsa/microchip/ksz8795.c index 3cc51ee5fb6cc..041956e3c7b1a 100644 --- a/drivers/net/dsa/microchip/ksz8795.c +++ b/drivers/net/dsa/microchip/ksz8795.c @@ -898,17 +898,6 @@ static void ksz8_w_phy(struct ksz_device *dev, u16 phy, u16 reg, u16 val) } }
-static enum dsa_tag_protocol ksz8_get_tag_protocol(struct dsa_switch *ds, - int port, - enum dsa_tag_protocol mp) -{ - struct ksz_device *dev = ds->priv; - - /* ksz88x3 uses the same tag schema as KSZ9893 */ - return ksz_is_ksz88x3(dev) ? - DSA_TAG_PROTO_KSZ9893 : DSA_TAG_PROTO_KSZ8795; -} - static u32 ksz8_sw_get_phy_flags(struct dsa_switch *ds, int port) { /* Silicon Errata Sheet (DS80000830A): @@ -1394,7 +1383,7 @@ static void ksz8_get_caps(struct dsa_switch *ds, int port, }
static const struct dsa_switch_ops ksz8_switch_ops = { - .get_tag_protocol = ksz8_get_tag_protocol, + .get_tag_protocol = ksz_get_tag_protocol, .get_phy_flags = ksz8_sw_get_phy_flags, .setup = ksz8_setup, .phy_read = ksz_phy_read16, diff --git a/drivers/net/dsa/microchip/ksz9477.c b/drivers/net/dsa/microchip/ksz9477.c index bcfdd505ca79a..31be767027feb 100644 --- a/drivers/net/dsa/microchip/ksz9477.c +++ b/drivers/net/dsa/microchip/ksz9477.c @@ -276,18 +276,6 @@ static void ksz9477_port_init_cnt(struct ksz_device *dev, int port) mutex_unlock(&mib->cnt_mutex); }
-static enum dsa_tag_protocol ksz9477_get_tag_protocol(struct dsa_switch *ds, - int port, - enum dsa_tag_protocol mp) -{ - enum dsa_tag_protocol proto = DSA_TAG_PROTO_KSZ9477; - struct ksz_device *dev = ds->priv; - - if (dev->features & IS_9893) - proto = DSA_TAG_PROTO_KSZ9893; - return proto; -} - static int ksz9477_phy_read16(struct dsa_switch *ds, int addr, int reg) { struct ksz_device *dev = ds->priv; @@ -1329,7 +1317,7 @@ static int ksz9477_setup(struct dsa_switch *ds) }
static const struct dsa_switch_ops ksz9477_switch_ops = { - .get_tag_protocol = ksz9477_get_tag_protocol, + .get_tag_protocol = ksz_get_tag_protocol, .setup = ksz9477_setup, .phy_read = ksz9477_phy_read16, .phy_write = ksz9477_phy_write16, diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 4511e99823f57..0713a40685fa9 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -930,6 +930,30 @@ void ksz_port_stp_state_set(struct dsa_switch *ds, int port, } EXPORT_SYMBOL_GPL(ksz_port_stp_state_set);
+enum dsa_tag_protocol ksz_get_tag_protocol(struct dsa_switch *ds, + int port, enum dsa_tag_protocol mp) +{ + struct ksz_device *dev = ds->priv; + enum dsa_tag_protocol proto = DSA_TAG_PROTO_NONE; + + if (dev->chip_id == KSZ8795_CHIP_ID || + dev->chip_id == KSZ8794_CHIP_ID || + dev->chip_id == KSZ8765_CHIP_ID) + proto = DSA_TAG_PROTO_KSZ8795; + + if (dev->chip_id == KSZ8830_CHIP_ID || + dev->chip_id == KSZ9893_CHIP_ID) + proto = DSA_TAG_PROTO_KSZ9893; + + if (dev->chip_id == KSZ9477_CHIP_ID || + dev->chip_id == KSZ9897_CHIP_ID || + dev->chip_id == KSZ9567_CHIP_ID) + proto = DSA_TAG_PROTO_KSZ9477; + + return proto; +} +EXPORT_SYMBOL_GPL(ksz_get_tag_protocol); + static int ksz_switch_detect(struct ksz_device *dev) { u8 id1, id2; diff --git a/drivers/net/dsa/microchip/ksz_common.h b/drivers/net/dsa/microchip/ksz_common.h index e6bc5fb2b1303..21db6f79035fa 100644 --- a/drivers/net/dsa/microchip/ksz_common.h +++ b/drivers/net/dsa/microchip/ksz_common.h @@ -231,6 +231,8 @@ int ksz_port_mdb_del(struct dsa_switch *ds, int port, int ksz_enable_port(struct dsa_switch *ds, int port, struct phy_device *phy); void ksz_get_strings(struct dsa_switch *ds, int port, u32 stringset, uint8_t *buf); +enum dsa_tag_protocol ksz_get_tag_protocol(struct dsa_switch *ds, + int port, enum dsa_tag_protocol mp);
/* Common register access functions */
From: Arun Ramadoss arun.ramadoss@microchip.com
[ Upstream commit f0d997e31bb307c7aa046c4992c568547fd25195 ]
This patch moves the vlan dsa_switch_ops such as vlan_add, vlan_del and vlan_filtering from the individual files ksz8795.c, ksz9477.c to ksz_common.c file.
Signed-off-by: Arun Ramadoss arun.ramadoss@microchip.com Reviewed-by: Vladimir Oltean olteanv@gmail.com Signed-off-by: Paolo Abeni pabeni@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/microchip/ksz8795.c | 19 +++++++------ drivers/net/dsa/microchip/ksz9477.c | 19 +++++++------ drivers/net/dsa/microchip/ksz_common.c | 37 ++++++++++++++++++++++++++ drivers/net/dsa/microchip/ksz_common.h | 14 ++++++++++ 4 files changed, 69 insertions(+), 20 deletions(-)
diff --git a/drivers/net/dsa/microchip/ksz8795.c b/drivers/net/dsa/microchip/ksz8795.c index 041956e3c7b1a..16e946dbd9d42 100644 --- a/drivers/net/dsa/microchip/ksz8795.c +++ b/drivers/net/dsa/microchip/ksz8795.c @@ -958,11 +958,9 @@ static void ksz8_flush_dyn_mac_table(struct ksz_device *dev, int port) } }
-static int ksz8_port_vlan_filtering(struct dsa_switch *ds, int port, bool flag, +static int ksz8_port_vlan_filtering(struct ksz_device *dev, int port, bool flag, struct netlink_ext_ack *extack) { - struct ksz_device *dev = ds->priv; - if (ksz_is_ksz88x3(dev)) return -ENOTSUPP;
@@ -987,12 +985,11 @@ static void ksz8_port_enable_pvid(struct ksz_device *dev, int port, bool state) } }
-static int ksz8_port_vlan_add(struct dsa_switch *ds, int port, +static int ksz8_port_vlan_add(struct ksz_device *dev, int port, const struct switchdev_obj_port_vlan *vlan, struct netlink_ext_ack *extack) { bool untagged = vlan->flags & BRIDGE_VLAN_INFO_UNTAGGED; - struct ksz_device *dev = ds->priv; struct ksz_port *p = &dev->ports[port]; u16 data, new_pvid = 0; u8 fid, member, valid; @@ -1060,10 +1057,9 @@ static int ksz8_port_vlan_add(struct dsa_switch *ds, int port, return 0; }
-static int ksz8_port_vlan_del(struct dsa_switch *ds, int port, +static int ksz8_port_vlan_del(struct ksz_device *dev, int port, const struct switchdev_obj_port_vlan *vlan) { - struct ksz_device *dev = ds->priv; u16 data, pvid; u8 fid, member, valid;
@@ -1398,9 +1394,9 @@ static const struct dsa_switch_ops ksz8_switch_ops = { .port_bridge_leave = ksz_port_bridge_leave, .port_stp_state_set = ksz8_port_stp_state_set, .port_fast_age = ksz_port_fast_age, - .port_vlan_filtering = ksz8_port_vlan_filtering, - .port_vlan_add = ksz8_port_vlan_add, - .port_vlan_del = ksz8_port_vlan_del, + .port_vlan_filtering = ksz_port_vlan_filtering, + .port_vlan_add = ksz_port_vlan_add, + .port_vlan_del = ksz_port_vlan_del, .port_fdb_dump = ksz_port_fdb_dump, .port_mdb_add = ksz_port_mdb_add, .port_mdb_del = ksz_port_mdb_del, @@ -1465,6 +1461,9 @@ static const struct ksz_dev_ops ksz8_dev_ops = { .r_mib_pkt = ksz8_r_mib_pkt, .freeze_mib = ksz8_freeze_mib, .port_init_cnt = ksz8_port_init_cnt, + .vlan_filtering = ksz8_port_vlan_filtering, + .vlan_add = ksz8_port_vlan_add, + .vlan_del = ksz8_port_vlan_del, .shutdown = ksz8_reset_switch, .init = ksz8_switch_init, .exit = ksz8_switch_exit, diff --git a/drivers/net/dsa/microchip/ksz9477.c b/drivers/net/dsa/microchip/ksz9477.c index 31be767027feb..1bb994a9109cd 100644 --- a/drivers/net/dsa/microchip/ksz9477.c +++ b/drivers/net/dsa/microchip/ksz9477.c @@ -377,12 +377,10 @@ static void ksz9477_flush_dyn_mac_table(struct ksz_device *dev, int port) } }
-static int ksz9477_port_vlan_filtering(struct dsa_switch *ds, int port, +static int ksz9477_port_vlan_filtering(struct ksz_device *dev, int port, bool flag, struct netlink_ext_ack *extack) { - struct ksz_device *dev = ds->priv; - if (flag) { ksz_port_cfg(dev, port, REG_PORT_LUE_CTRL, PORT_VLAN_LOOKUP_VID_0, true); @@ -396,11 +394,10 @@ static int ksz9477_port_vlan_filtering(struct dsa_switch *ds, int port, return 0; }
-static int ksz9477_port_vlan_add(struct dsa_switch *ds, int port, +static int ksz9477_port_vlan_add(struct ksz_device *dev, int port, const struct switchdev_obj_port_vlan *vlan, struct netlink_ext_ack *extack) { - struct ksz_device *dev = ds->priv; u32 vlan_table[3]; bool untagged = vlan->flags & BRIDGE_VLAN_INFO_UNTAGGED; int err; @@ -433,10 +430,9 @@ static int ksz9477_port_vlan_add(struct dsa_switch *ds, int port, return 0; }
-static int ksz9477_port_vlan_del(struct dsa_switch *ds, int port, +static int ksz9477_port_vlan_del(struct ksz_device *dev, int port, const struct switchdev_obj_port_vlan *vlan) { - struct ksz_device *dev = ds->priv; bool untagged = vlan->flags & BRIDGE_VLAN_INFO_UNTAGGED; u32 vlan_table[3]; u16 pvid; @@ -1331,9 +1327,9 @@ static const struct dsa_switch_ops ksz9477_switch_ops = { .port_bridge_leave = ksz_port_bridge_leave, .port_stp_state_set = ksz9477_port_stp_state_set, .port_fast_age = ksz_port_fast_age, - .port_vlan_filtering = ksz9477_port_vlan_filtering, - .port_vlan_add = ksz9477_port_vlan_add, - .port_vlan_del = ksz9477_port_vlan_del, + .port_vlan_filtering = ksz_port_vlan_filtering, + .port_vlan_add = ksz_port_vlan_add, + .port_vlan_del = ksz_port_vlan_del, .port_fdb_dump = ksz9477_port_fdb_dump, .port_fdb_add = ksz9477_port_fdb_add, .port_fdb_del = ksz9477_port_fdb_del, @@ -1413,6 +1409,9 @@ static const struct ksz_dev_ops ksz9477_dev_ops = { .r_mib_stat64 = ksz_r_mib_stats64, .freeze_mib = ksz9477_freeze_mib, .port_init_cnt = ksz9477_port_init_cnt, + .vlan_filtering = ksz9477_port_vlan_filtering, + .vlan_add = ksz9477_port_vlan_add, + .vlan_del = ksz9477_port_vlan_del, .shutdown = ksz9477_reset_switch, .init = ksz9477_switch_init, .exit = ksz9477_switch_exit, diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 0713a40685fa9..5db2b55152885 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -954,6 +954,43 @@ enum dsa_tag_protocol ksz_get_tag_protocol(struct dsa_switch *ds, } EXPORT_SYMBOL_GPL(ksz_get_tag_protocol);
+int ksz_port_vlan_filtering(struct dsa_switch *ds, int port, + bool flag, struct netlink_ext_ack *extack) +{ + struct ksz_device *dev = ds->priv; + + if (!dev->dev_ops->vlan_filtering) + return -EOPNOTSUPP; + + return dev->dev_ops->vlan_filtering(dev, port, flag, extack); +} +EXPORT_SYMBOL_GPL(ksz_port_vlan_filtering); + +int ksz_port_vlan_add(struct dsa_switch *ds, int port, + const struct switchdev_obj_port_vlan *vlan, + struct netlink_ext_ack *extack) +{ + struct ksz_device *dev = ds->priv; + + if (!dev->dev_ops->vlan_add) + return -EOPNOTSUPP; + + return dev->dev_ops->vlan_add(dev, port, vlan, extack); +} +EXPORT_SYMBOL_GPL(ksz_port_vlan_add); + +int ksz_port_vlan_del(struct dsa_switch *ds, int port, + const struct switchdev_obj_port_vlan *vlan) +{ + struct ksz_device *dev = ds->priv; + + if (!dev->dev_ops->vlan_del) + return -EOPNOTSUPP; + + return dev->dev_ops->vlan_del(dev, port, vlan); +} +EXPORT_SYMBOL_GPL(ksz_port_vlan_del); + static int ksz_switch_detect(struct ksz_device *dev) { u8 id1, id2; diff --git a/drivers/net/dsa/microchip/ksz_common.h b/drivers/net/dsa/microchip/ksz_common.h index 21db6f79035fa..1baa270859aa2 100644 --- a/drivers/net/dsa/microchip/ksz_common.h +++ b/drivers/net/dsa/microchip/ksz_common.h @@ -180,6 +180,13 @@ struct ksz_dev_ops { void (*r_mib_pkt)(struct ksz_device *dev, int port, u16 addr, u64 *dropped, u64 *cnt); void (*r_mib_stat64)(struct ksz_device *dev, int port); + int (*vlan_filtering)(struct ksz_device *dev, int port, + bool flag, struct netlink_ext_ack *extack); + int (*vlan_add)(struct ksz_device *dev, int port, + const struct switchdev_obj_port_vlan *vlan, + struct netlink_ext_ack *extack); + int (*vlan_del)(struct ksz_device *dev, int port, + const struct switchdev_obj_port_vlan *vlan); void (*freeze_mib)(struct ksz_device *dev, int port, bool freeze); void (*port_init_cnt)(struct ksz_device *dev, int port); int (*shutdown)(struct ksz_device *dev); @@ -233,6 +240,13 @@ void ksz_get_strings(struct dsa_switch *ds, int port, u32 stringset, uint8_t *buf); enum dsa_tag_protocol ksz_get_tag_protocol(struct dsa_switch *ds, int port, enum dsa_tag_protocol mp); +int ksz_port_vlan_filtering(struct dsa_switch *ds, int port, + bool flag, struct netlink_ext_ack *extack); +int ksz_port_vlan_add(struct dsa_switch *ds, int port, + const struct switchdev_obj_port_vlan *vlan, + struct netlink_ext_ack *extack); +int ksz_port_vlan_del(struct dsa_switch *ds, int port, + const struct switchdev_obj_port_vlan *vlan);
/* Common register access functions */
From: Arun Ramadoss arun.ramadoss@microchip.com
[ Upstream commit 00a298bbc23876288b1cd04c38752d8e7ed53ae2 ]
This patch updates the common port mirror add/del dsa_switch_ops in ksz_common.c. The individual switches implementation is executed based on the ksz_dev_ops function pointers.
Signed-off-by: Arun Ramadoss arun.ramadoss@microchip.com Reviewed-by: Vladimir Oltean olteanv@gmail.com Reviewed-by: Florian Fainelli f.fainelli@gmail.com Signed-off-by: Paolo Abeni pabeni@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/microchip/ksz8795.c | 13 ++++++------- drivers/net/dsa/microchip/ksz9477.c | 12 ++++++------ drivers/net/dsa/microchip/ksz_common.c | 23 +++++++++++++++++++++++ drivers/net/dsa/microchip/ksz_common.h | 10 ++++++++++ 4 files changed, 45 insertions(+), 13 deletions(-)
diff --git a/drivers/net/dsa/microchip/ksz8795.c b/drivers/net/dsa/microchip/ksz8795.c index 16e946dbd9d42..2e3d24a3260e1 100644 --- a/drivers/net/dsa/microchip/ksz8795.c +++ b/drivers/net/dsa/microchip/ksz8795.c @@ -1089,12 +1089,10 @@ static int ksz8_port_vlan_del(struct ksz_device *dev, int port, return 0; }
-static int ksz8_port_mirror_add(struct dsa_switch *ds, int port, +static int ksz8_port_mirror_add(struct ksz_device *dev, int port, struct dsa_mall_mirror_tc_entry *mirror, bool ingress, struct netlink_ext_ack *extack) { - struct ksz_device *dev = ds->priv; - if (ingress) { ksz_port_cfg(dev, port, P_MIRROR_CTRL, PORT_MIRROR_RX, true); dev->mirror_rx |= BIT(port); @@ -1113,10 +1111,9 @@ static int ksz8_port_mirror_add(struct dsa_switch *ds, int port, return 0; }
-static void ksz8_port_mirror_del(struct dsa_switch *ds, int port, +static void ksz8_port_mirror_del(struct ksz_device *dev, int port, struct dsa_mall_mirror_tc_entry *mirror) { - struct ksz_device *dev = ds->priv; u8 data;
if (mirror->ingress) { @@ -1400,8 +1397,8 @@ static const struct dsa_switch_ops ksz8_switch_ops = { .port_fdb_dump = ksz_port_fdb_dump, .port_mdb_add = ksz_port_mdb_add, .port_mdb_del = ksz_port_mdb_del, - .port_mirror_add = ksz8_port_mirror_add, - .port_mirror_del = ksz8_port_mirror_del, + .port_mirror_add = ksz_port_mirror_add, + .port_mirror_del = ksz_port_mirror_del, };
static u32 ksz8_get_port_addr(int port, int offset) @@ -1464,6 +1461,8 @@ static const struct ksz_dev_ops ksz8_dev_ops = { .vlan_filtering = ksz8_port_vlan_filtering, .vlan_add = ksz8_port_vlan_add, .vlan_del = ksz8_port_vlan_del, + .mirror_add = ksz8_port_mirror_add, + .mirror_del = ksz8_port_mirror_del, .shutdown = ksz8_reset_switch, .init = ksz8_switch_init, .exit = ksz8_switch_exit, diff --git a/drivers/net/dsa/microchip/ksz9477.c b/drivers/net/dsa/microchip/ksz9477.c index 1bb994a9109cd..cd4a3088e9473 100644 --- a/drivers/net/dsa/microchip/ksz9477.c +++ b/drivers/net/dsa/microchip/ksz9477.c @@ -819,11 +819,10 @@ static int ksz9477_port_mdb_del(struct dsa_switch *ds, int port, return ret; }
-static int ksz9477_port_mirror_add(struct dsa_switch *ds, int port, +static int ksz9477_port_mirror_add(struct ksz_device *dev, int port, struct dsa_mall_mirror_tc_entry *mirror, bool ingress, struct netlink_ext_ack *extack) { - struct ksz_device *dev = ds->priv; u8 data; int p;
@@ -859,10 +858,9 @@ static int ksz9477_port_mirror_add(struct dsa_switch *ds, int port, return 0; }
-static void ksz9477_port_mirror_del(struct dsa_switch *ds, int port, +static void ksz9477_port_mirror_del(struct ksz_device *dev, int port, struct dsa_mall_mirror_tc_entry *mirror) { - struct ksz_device *dev = ds->priv; bool in_use = false; u8 data; int p; @@ -1335,8 +1333,8 @@ static const struct dsa_switch_ops ksz9477_switch_ops = { .port_fdb_del = ksz9477_port_fdb_del, .port_mdb_add = ksz9477_port_mdb_add, .port_mdb_del = ksz9477_port_mdb_del, - .port_mirror_add = ksz9477_port_mirror_add, - .port_mirror_del = ksz9477_port_mirror_del, + .port_mirror_add = ksz_port_mirror_add, + .port_mirror_del = ksz_port_mirror_del, .get_stats64 = ksz_get_stats64, .port_change_mtu = ksz9477_change_mtu, .port_max_mtu = ksz9477_max_mtu, @@ -1412,6 +1410,8 @@ static const struct ksz_dev_ops ksz9477_dev_ops = { .vlan_filtering = ksz9477_port_vlan_filtering, .vlan_add = ksz9477_port_vlan_add, .vlan_del = ksz9477_port_vlan_del, + .mirror_add = ksz9477_port_mirror_add, + .mirror_del = ksz9477_port_mirror_del, .shutdown = ksz9477_reset_switch, .init = ksz9477_switch_init, .exit = ksz9477_switch_exit, diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 5db2b55152885..676669d353ea6 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -991,6 +991,29 @@ int ksz_port_vlan_del(struct dsa_switch *ds, int port, } EXPORT_SYMBOL_GPL(ksz_port_vlan_del);
+int ksz_port_mirror_add(struct dsa_switch *ds, int port, + struct dsa_mall_mirror_tc_entry *mirror, + bool ingress, struct netlink_ext_ack *extack) +{ + struct ksz_device *dev = ds->priv; + + if (!dev->dev_ops->mirror_add) + return -EOPNOTSUPP; + + return dev->dev_ops->mirror_add(dev, port, mirror, ingress, extack); +} +EXPORT_SYMBOL_GPL(ksz_port_mirror_add); + +void ksz_port_mirror_del(struct dsa_switch *ds, int port, + struct dsa_mall_mirror_tc_entry *mirror) +{ + struct ksz_device *dev = ds->priv; + + if (dev->dev_ops->mirror_del) + dev->dev_ops->mirror_del(dev, port, mirror); +} +EXPORT_SYMBOL_GPL(ksz_port_mirror_del); + static int ksz_switch_detect(struct ksz_device *dev) { u8 id1, id2; diff --git a/drivers/net/dsa/microchip/ksz_common.h b/drivers/net/dsa/microchip/ksz_common.h index 1baa270859aa2..c724cbb437e29 100644 --- a/drivers/net/dsa/microchip/ksz_common.h +++ b/drivers/net/dsa/microchip/ksz_common.h @@ -187,6 +187,11 @@ struct ksz_dev_ops { struct netlink_ext_ack *extack); int (*vlan_del)(struct ksz_device *dev, int port, const struct switchdev_obj_port_vlan *vlan); + int (*mirror_add)(struct ksz_device *dev, int port, + struct dsa_mall_mirror_tc_entry *mirror, + bool ingress, struct netlink_ext_ack *extack); + void (*mirror_del)(struct ksz_device *dev, int port, + struct dsa_mall_mirror_tc_entry *mirror); void (*freeze_mib)(struct ksz_device *dev, int port, bool freeze); void (*port_init_cnt)(struct ksz_device *dev, int port); int (*shutdown)(struct ksz_device *dev); @@ -247,6 +252,11 @@ int ksz_port_vlan_add(struct dsa_switch *ds, int port, struct netlink_ext_ack *extack); int ksz_port_vlan_del(struct dsa_switch *ds, int port, const struct switchdev_obj_port_vlan *vlan); +int ksz_port_mirror_add(struct dsa_switch *ds, int port, + struct dsa_mall_mirror_tc_entry *mirror, + bool ingress, struct netlink_ext_ack *extack); +void ksz_port_mirror_del(struct dsa_switch *ds, int port, + struct dsa_mall_mirror_tc_entry *mirror);
/* Common register access functions */
From: Arun Ramadoss arun.ramadoss@microchip.com
[ Upstream commit 7012033ce10e0968e6cb82709aa0ed7f2080b61e ]
This patch assigns the phylink_get_caps in ksz8795 and ksz9477 to ksz_phylink_get_caps. And update their mac_capabilities in the respective ksz_dev_ops.
Signed-off-by: Arun Ramadoss arun.ramadoss@microchip.com Reviewed-by: Vladimir Oltean olteanv@gmail.com Signed-off-by: Paolo Abeni pabeni@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/microchip/ksz8795.c | 9 +++------ drivers/net/dsa/microchip/ksz9477.c | 7 +++---- drivers/net/dsa/microchip/ksz_common.c | 3 +++ drivers/net/dsa/microchip/ksz_common.h | 2 ++ 4 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/drivers/net/dsa/microchip/ksz8795.c b/drivers/net/dsa/microchip/ksz8795.c index 2e3d24a3260e1..c771797fd902f 100644 --- a/drivers/net/dsa/microchip/ksz8795.c +++ b/drivers/net/dsa/microchip/ksz8795.c @@ -1353,13 +1353,9 @@ static int ksz8_setup(struct dsa_switch *ds) return ksz8_handle_global_errata(ds); }
-static void ksz8_get_caps(struct dsa_switch *ds, int port, +static void ksz8_get_caps(struct ksz_device *dev, int port, struct phylink_config *config) { - struct ksz_device *dev = ds->priv; - - ksz_phylink_get_caps(ds, port, config); - config->mac_capabilities = MAC_10 | MAC_100;
/* Silicon Errata Sheet (DS80000830A): @@ -1381,7 +1377,7 @@ static const struct dsa_switch_ops ksz8_switch_ops = { .setup = ksz8_setup, .phy_read = ksz_phy_read16, .phy_write = ksz_phy_write16, - .phylink_get_caps = ksz8_get_caps, + .phylink_get_caps = ksz_phylink_get_caps, .phylink_mac_link_down = ksz_mac_link_down, .port_enable = ksz_enable_port, .get_strings = ksz_get_strings, @@ -1463,6 +1459,7 @@ static const struct ksz_dev_ops ksz8_dev_ops = { .vlan_del = ksz8_port_vlan_del, .mirror_add = ksz8_port_mirror_add, .mirror_del = ksz8_port_mirror_del, + .get_caps = ksz8_get_caps, .shutdown = ksz8_reset_switch, .init = ksz8_switch_init, .exit = ksz8_switch_exit, diff --git a/drivers/net/dsa/microchip/ksz9477.c b/drivers/net/dsa/microchip/ksz9477.c index cd4a3088e9473..125124fdefbf4 100644 --- a/drivers/net/dsa/microchip/ksz9477.c +++ b/drivers/net/dsa/microchip/ksz9477.c @@ -1082,11 +1082,9 @@ static void ksz9477_phy_errata_setup(struct ksz_device *dev, int port) ksz9477_port_mmd_write(dev, port, 0x1c, 0x20, 0xeeee); }
-static void ksz9477_get_caps(struct dsa_switch *ds, int port, +static void ksz9477_get_caps(struct ksz_device *dev, int port, struct phylink_config *config) { - ksz_phylink_get_caps(ds, port, config); - config->mac_capabilities = MAC_10 | MAC_100 | MAC_1000FD | MAC_ASYM_PAUSE | MAC_SYM_PAUSE; } @@ -1316,7 +1314,7 @@ static const struct dsa_switch_ops ksz9477_switch_ops = { .phy_read = ksz9477_phy_read16, .phy_write = ksz9477_phy_write16, .phylink_mac_link_down = ksz_mac_link_down, - .phylink_get_caps = ksz9477_get_caps, + .phylink_get_caps = ksz_phylink_get_caps, .port_enable = ksz_enable_port, .get_strings = ksz_get_strings, .get_ethtool_stats = ksz_get_ethtool_stats, @@ -1412,6 +1410,7 @@ static const struct ksz_dev_ops ksz9477_dev_ops = { .vlan_del = ksz9477_port_vlan_del, .mirror_add = ksz9477_port_mirror_add, .mirror_del = ksz9477_port_mirror_del, + .get_caps = ksz9477_get_caps, .shutdown = ksz9477_reset_switch, .init = ksz9477_switch_init, .exit = ksz9477_switch_exit, diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 676669d353ea6..0f02c62b02685 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -456,6 +456,9 @@ void ksz_phylink_get_caps(struct dsa_switch *ds, int port, if (dev->info->internal_phy[port]) __set_bit(PHY_INTERFACE_MODE_INTERNAL, config->supported_interfaces); + + if (dev->dev_ops->get_caps) + dev->dev_ops->get_caps(dev, port, config); } EXPORT_SYMBOL_GPL(ksz_phylink_get_caps);
diff --git a/drivers/net/dsa/microchip/ksz_common.h b/drivers/net/dsa/microchip/ksz_common.h index c724cbb437e29..10f9ef2dbf1ca 100644 --- a/drivers/net/dsa/microchip/ksz_common.h +++ b/drivers/net/dsa/microchip/ksz_common.h @@ -192,6 +192,8 @@ struct ksz_dev_ops { bool ingress, struct netlink_ext_ack *extack); void (*mirror_del)(struct ksz_device *dev, int port, struct dsa_mall_mirror_tc_entry *mirror); + void (*get_caps)(struct ksz_device *dev, int port, + struct phylink_config *config); void (*freeze_mib)(struct ksz_device *dev, int port, bool freeze); void (*port_init_cnt)(struct ksz_device *dev, int port); int (*shutdown)(struct ksz_device *dev);
From: Vladimir Oltean vladimir.oltean@nxp.com
[ Upstream commit 5fbb08eb7f945c7e8896ea39f03143ce66dfa4c7 ]
DSA has multiple ways of specifying a MAC connection to an internal PHY. One requires a DT description like this:
port@0 { reg = <0>; phy-handle = <&internal_phy>; phy-mode = "internal"; };
(which is IMO the recommended approach, as it is the clearest description)
but it is also possible to leave the specification as just:
port@0 { reg = <0>; }
and if the driver implements ds->ops->phy_read and ds->ops->phy_write, the DSA framework "knows" it should create a ds->slave_mii_bus, and it should connect to a non-OF-based internal PHY on this MDIO bus, at an MDIO address equal to the port address.
There is also an intermediary way of describing things:
port@0 { reg = <0>; phy-handle = <&internal_phy>; };
In case 2, DSA calls phylink_connect_phy() and in case 3, it calls phylink_of_phy_connect(). In both cases, phylink_create() has been called with a phy_interface_t of PHY_INTERFACE_MODE_NA, and in both cases, PHY_INTERFACE_MODE_NA is translated into phy->interface.
It is important to note that phy_device_create() initializes dev->interface = PHY_INTERFACE_MODE_GMII, and so, when we use phylink_create(PHY_INTERFACE_MODE_NA), no one will override this, and we will end up with a PHY_INTERFACE_MODE_GMII interface inherited from the PHY.
All this means that in order to maintain compatibility with device tree blobs where the phy-mode property is missing, we need to allow the "gmii" phy-mode and treat it as "internal".
Fixes: 2c709e0bdad4 ("net: dsa: microchip: ksz8795: add phylink support") Link: https://bugzilla.kernel.org/show_bug.cgi?id=216320 Reported-by: Craig McQueen craig@mcqueen.id.au Signed-off-by: Vladimir Oltean vladimir.oltean@nxp.com Reviewed-by: Alvin Šipraga alsi@bang-olufsen.dk Tested-by: Rasmus Villemoes rasmus.villemoes@prevas.dk Link: https://lore.kernel.org/r/20220818143250.2797111-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/dsa/microchip/ksz_common.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c index 0f02c62b02685..c9389880ad1fa 100644 --- a/drivers/net/dsa/microchip/ksz_common.c +++ b/drivers/net/dsa/microchip/ksz_common.c @@ -453,9 +453,15 @@ void ksz_phylink_get_caps(struct dsa_switch *ds, int port, if (dev->info->supports_rgmii[port]) phy_interface_set_rgmii(config->supported_interfaces);
- if (dev->info->internal_phy[port]) + if (dev->info->internal_phy[port]) { __set_bit(PHY_INTERFACE_MODE_INTERNAL, config->supported_interfaces); + /* Compatibility for phylib's default interface type when the + * phy-mode property is absent + */ + __set_bit(PHY_INTERFACE_MODE_GMII, + config->supported_interfaces); + }
if (dev->dev_ops->get_caps) dev->dev_ops->get_caps(dev, port, config);
From: Alex Elder elder@linaro.org
[ Upstream commit b8d4380365c515d8e0351f2f46d371738dd19be1 ]
In ipa_smem_init(), a Qualcomm SMEM region is allocated (if needed) and then its virtual address is fetched using qcom_smem_get(). The physical address associated with that region is also fetched.
The physical address is adjusted so that it is page-aligned, and an attempt is made to update the size of the region to compensate for any non-zero adjustment.
But that adjustment isn't done properly. The physical address is aligned twice, and as a result the size is never actually adjusted.
Fix this by *not* aligning the "addr" local variable, and instead making the "phys" local variable be the adjusted "addr" value.
Fixes: a0036bb413d5b ("net: ipa: define SMEM memory region for IPA") Signed-off-by: Alex Elder elder@linaro.org Link: https://lore.kernel.org/r/20220818134206.567618-1-elder@linaro.org Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ipa/ipa_mem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ipa/ipa_mem.c b/drivers/net/ipa/ipa_mem.c index 1e9eae208e44f..53a1dbeaffa6d 100644 --- a/drivers/net/ipa/ipa_mem.c +++ b/drivers/net/ipa/ipa_mem.c @@ -568,7 +568,7 @@ static int ipa_smem_init(struct ipa *ipa, u32 item, size_t size) }
/* Align the address down and the size up to a page boundary */ - addr = qcom_smem_virt_to_phys(virt) & PAGE_MASK; + addr = qcom_smem_virt_to_phys(virt); phys = addr & PAGE_MASK; size = PAGE_ALIGN(size + addr - phys); iova = phys; /* We just want a direct mapping */
From: Xiaolei Wang xiaolei.wang@windriver.com
[ Upstream commit 6dbe852c379ff032a70a6b13a91914918c82cb07 ]
For some MAC drivers, they set the mac_managed_pm to true in its ->ndo_open() callback. So before the mac_managed_pm is set to true, we still want to leverage the mdio_bus_phy_suspend()/resume() for the phy device suspend and resume. In this case, the phy device is in PHY_READY, and we shouldn't warn about this. It also seems that the check of mac_managed_pm in WARN_ON is redundant since we already check this in the entry of mdio_bus_phy_resume(), so drop it.
Fixes: 744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state") Signed-off-by: Xiaolei Wang xiaolei.wang@windriver.com Acked-by: Florian Fainelli f.fainelli@gmail.com Link: https://lore.kernel.org/r/20220819082451.1992102-1-xiaolei.wang@windriver.co... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/phy/phy_device.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 608de5a94165f..f90a21781d8d6 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -316,11 +316,11 @@ static __maybe_unused int mdio_bus_phy_resume(struct device *dev)
phydev->suspended_by_mdio_bus = 0;
- /* If we managed to get here with the PHY state machine in a state other - * than PHY_HALTED this is an indication that something went wrong and - * we should most likely be using MAC managed PM and we are not. + /* If we manged to get here with the PHY state machine in a state neither + * PHY_HALTED nor PHY_READY this is an indication that something went wrong + * and we should most likely be using MAC managed PM and we are not. */ - WARN_ON(phydev->state != PHY_HALTED && !phydev->mac_managed_pm); + WARN_ON(phydev->state != PHY_HALTED && phydev->state != PHY_READY);
ret = phy_init_hw(phydev); if (ret < 0)
From: Sergei Antonov saproj@gmail.com
[ Upstream commit 0ee7828dfc56e97d71e51e6374dc7b4eb2b6e081 ]
Since priv->rx_mapping[i] is maped in moxart_mac_open(), we should unmap it from moxart_mac_stop(). Fixes 2 warnings.
1. During error unwinding in moxart_mac_probe(): "goto init_fail;", then moxart_mac_free_memory() calls dma_unmap_single() with priv->rx_mapping[i] pointers zeroed.
WARNING: CPU: 0 PID: 1 at kernel/dma/debug.c:963 check_unmap+0x704/0x980 DMA-API: moxart-ethernet 92000000.mac: device driver tries to free DMA memory it has not allocated [device address=0x0000000000000000] [size=1600 bytes] CPU: 0 PID: 1 Comm: swapper Not tainted 5.19.0+ #60 Hardware name: Generic DT based system unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x34/0x44 dump_stack_lvl from __warn+0xbc/0x1f0 __warn from warn_slowpath_fmt+0x94/0xc8 warn_slowpath_fmt from check_unmap+0x704/0x980 check_unmap from debug_dma_unmap_page+0x8c/0x9c debug_dma_unmap_page from moxart_mac_free_memory+0x3c/0xa8 moxart_mac_free_memory from moxart_mac_probe+0x190/0x218 moxart_mac_probe from platform_probe+0x48/0x88 platform_probe from really_probe+0xc0/0x2e4
2. After commands: ip link set dev eth0 down ip link set dev eth0 up
WARNING: CPU: 0 PID: 55 at kernel/dma/debug.c:570 add_dma_entry+0x204/0x2ec DMA-API: moxart-ethernet 92000000.mac: cacheline tracking EEXIST, overlapping mappings aren't supported CPU: 0 PID: 55 Comm: ip Not tainted 5.19.0+ #57 Hardware name: Generic DT based system unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x34/0x44 dump_stack_lvl from __warn+0xbc/0x1f0 __warn from warn_slowpath_fmt+0x94/0xc8 warn_slowpath_fmt from add_dma_entry+0x204/0x2ec add_dma_entry from dma_map_page_attrs+0x110/0x328 dma_map_page_attrs from moxart_mac_open+0x134/0x320 moxart_mac_open from __dev_open+0x11c/0x1ec __dev_open from __dev_change_flags+0x194/0x22c __dev_change_flags from dev_change_flags+0x14/0x44 dev_change_flags from devinet_ioctl+0x6d4/0x93c devinet_ioctl from inet_ioctl+0x1ac/0x25c
v1 -> v2: Extraneous change removed.
Fixes: 6c821bd9edc9 ("net: Add MOXA ART SoCs ethernet driver") Signed-off-by: Sergei Antonov saproj@gmail.com Reviewed-by: Andrew Lunn andrew@lunn.ch Link: https://lore.kernel.org/r/20220819110519.1230877-1-saproj@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/moxa/moxart_ether.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/moxa/moxart_ether.c b/drivers/net/ethernet/moxa/moxart_ether.c index f11f1cb92025f..3b6beb96ca856 100644 --- a/drivers/net/ethernet/moxa/moxart_ether.c +++ b/drivers/net/ethernet/moxa/moxart_ether.c @@ -74,11 +74,6 @@ static int moxart_set_mac_address(struct net_device *ndev, void *addr) static void moxart_mac_free_memory(struct net_device *ndev) { struct moxart_mac_priv_t *priv = netdev_priv(ndev); - int i; - - for (i = 0; i < RX_DESC_NUM; i++) - dma_unmap_single(&priv->pdev->dev, priv->rx_mapping[i], - priv->rx_buf_size, DMA_FROM_DEVICE);
if (priv->tx_desc_base) dma_free_coherent(&priv->pdev->dev, @@ -193,6 +188,7 @@ static int moxart_mac_open(struct net_device *ndev) static int moxart_mac_stop(struct net_device *ndev) { struct moxart_mac_priv_t *priv = netdev_priv(ndev); + int i;
napi_disable(&priv->napi);
@@ -204,6 +200,11 @@ static int moxart_mac_stop(struct net_device *ndev) /* disable all functions */ writel(0, priv->base + REG_MAC_CTRL);
+ /* unmap areas mapped in moxart_mac_setup_desc_ring() */ + for (i = 0; i < RX_DESC_NUM; i++) + dma_unmap_single(&priv->pdev->dev, priv->rx_mapping[i], + priv->rx_buf_size, DMA_FROM_DEVICE); + return 0; }
From: Jonathan Toppins jtoppins@redhat.com
[ Upstream commit d745b5062ad2b5da90a5e728d7ca884fc07315fd ]
This is caused by the global variable ad_ticks_per_sec being zero as demonstrated by the reproducer script discussed below. This causes all timer values in __ad_timer_to_ticks to be zero, resulting in the periodic timer to never fire.
To reproduce: Run the script in `tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh` which puts bonding into a state where it never transmits LACPDUs.
line 44: ip link add fbond type bond mode 4 miimon 200 \ xmit_hash_policy 1 ad_actor_sys_prio 65535 lacp_rate fast setting bond param: ad_actor_sys_prio given: params.ad_actor_system = 0 call stack: bond_option_ad_actor_sys_prio() -> bond_3ad_update_ad_actor_settings() -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio -> ad.system.sys_mac_addr = bond->dev->dev_addr; because params.ad_actor_system == 0 results: ad.system.sys_mac_addr = bond->dev->dev_addr
line 48: ip link set fbond address 52:54:00:3B:7C:A6 setting bond MAC addr call stack: bond->dev->dev_addr = new_mac
line 52: ip link set fbond type bond ad_actor_sys_prio 65535 setting bond param: ad_actor_sys_prio given: params.ad_actor_system = 0 call stack: bond_option_ad_actor_sys_prio() -> bond_3ad_update_ad_actor_settings() -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio -> ad.system.sys_mac_addr = bond->dev->dev_addr; because params.ad_actor_system == 0 results: ad.system.sys_mac_addr = bond->dev->dev_addr
line 60: ip link set veth1-bond down master fbond given: params.ad_actor_system = 0 params.mode = BOND_MODE_8023AD ad.system.sys_mac_addr == bond->dev->dev_addr call stack: bond_enslave -> bond_3ad_initialize(); because first slave -> if ad.system.sys_mac_addr != bond->dev->dev_addr return results: Nothing is run in bond_3ad_initialize() because dev_addr equals sys_mac_addr leaving the global ad_ticks_per_sec zero as it is never initialized anywhere else.
The if check around the contents of bond_3ad_initialize() is no longer needed due to commit 5ee14e6d336f ("bonding: 3ad: apply ad_actor settings changes immediately") which sets ad.system.sys_mac_addr if any one of the bonding parameters whos set function calls bond_3ad_update_ad_actor_settings(). This is because if ad.system.sys_mac_addr is zero it will be set to the current bond mac address, this causes the if check to never be true.
Fixes: 5ee14e6d336f ("bonding: 3ad: apply ad_actor settings changes immediately") Signed-off-by: Jonathan Toppins jtoppins@redhat.com Acked-by: Jay Vosburgh jay.vosburgh@canonical.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/bonding/bond_3ad.c | 38 ++++++++++++++-------------------- 1 file changed, 16 insertions(+), 22 deletions(-)
diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c index d7fb33c078e81..1f0120cbe9e80 100644 --- a/drivers/net/bonding/bond_3ad.c +++ b/drivers/net/bonding/bond_3ad.c @@ -2007,30 +2007,24 @@ void bond_3ad_initiate_agg_selection(struct bonding *bond, int timeout) */ void bond_3ad_initialize(struct bonding *bond, u16 tick_resolution) { - /* check that the bond is not initialized yet */ - if (!MAC_ADDRESS_EQUAL(&(BOND_AD_INFO(bond).system.sys_mac_addr), - bond->dev->dev_addr)) { - - BOND_AD_INFO(bond).aggregator_identifier = 0; - - BOND_AD_INFO(bond).system.sys_priority = - bond->params.ad_actor_sys_prio; - if (is_zero_ether_addr(bond->params.ad_actor_system)) - BOND_AD_INFO(bond).system.sys_mac_addr = - *((struct mac_addr *)bond->dev->dev_addr); - else - BOND_AD_INFO(bond).system.sys_mac_addr = - *((struct mac_addr *)bond->params.ad_actor_system); + BOND_AD_INFO(bond).aggregator_identifier = 0; + BOND_AD_INFO(bond).system.sys_priority = + bond->params.ad_actor_sys_prio; + if (is_zero_ether_addr(bond->params.ad_actor_system)) + BOND_AD_INFO(bond).system.sys_mac_addr = + *((struct mac_addr *)bond->dev->dev_addr); + else + BOND_AD_INFO(bond).system.sys_mac_addr = + *((struct mac_addr *)bond->params.ad_actor_system);
- /* initialize how many times this module is called in one - * second (should be about every 100ms) - */ - ad_ticks_per_sec = tick_resolution; + /* initialize how many times this module is called in one + * second (should be about every 100ms) + */ + ad_ticks_per_sec = tick_resolution;
- bond_3ad_initiate_agg_selection(bond, - AD_AGGREGATOR_SELECTION_TIMER * - ad_ticks_per_sec); - } + bond_3ad_initiate_agg_selection(bond, + AD_AGGREGATOR_SELECTION_TIMER * + ad_ticks_per_sec); }
/**
From: Maciej Żenczykowski maze@google.com
[ Upstream commit 4b2e3a17e9f279325712b79fb01d1493f9e3e005 ]
Looks to have been left out in an oversight.
Cc: Mahesh Bandewar maheshb@google.com Cc: Sainath Grandhi sainath.grandhi@intel.com Fixes: 235a9d89da97 ('ipvtap: IP-VLAN based tap driver') Signed-off-by: Maciej Żenczykowski maze@google.com Link: https://lore.kernel.org/r/20220821130808.12143-1-zenczykowski@gmail.com Signed-off-by: Paolo Abeni pabeni@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ipvlan/ipvtap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ipvlan/ipvtap.c b/drivers/net/ipvlan/ipvtap.c index ef02f2cf5ce13..cbabca167a078 100644 --- a/drivers/net/ipvlan/ipvtap.c +++ b/drivers/net/ipvlan/ipvtap.c @@ -194,7 +194,7 @@ static struct notifier_block ipvtap_notifier_block __read_mostly = { .notifier_call = ipvtap_device_event, };
-static int ipvtap_init(void) +static int __init ipvtap_init(void) { int err;
@@ -228,7 +228,7 @@ static int ipvtap_init(void) } module_init(ipvtap_init);
-static void ipvtap_exit(void) +static void __exit ipvtap_exit(void) { rtnl_link_unregister(&ipvtap_link_ops); unregister_netdevice_notifier(&ipvtap_notifier_block);
From: Florian Westphal fw@strlen.de
[ Upstream commit 7997eff82828304b780dc0a39707e1946d6f1ebf ]
Harshit Mogalapalli says: In ebt_do_table() function dereferencing 'private->hook_entry[hook]' can lead to NULL pointer dereference. [..] Kernel panic:
general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] [..] RIP: 0010:ebt_do_table+0x1dc/0x1ce0 Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88 [..] Call Trace: nf_hook_slow+0xb1/0x170 __br_forward+0x289/0x730 maybe_deliver+0x24b/0x380 br_flood+0xc6/0x390 br_dev_xmit+0xa2e/0x12c0
For some reason ebtables rejects blobs that provide entry points that are not supported by the table, but what it should instead reject is the opposite: blobs that DO NOT provide an entry point supported by the table.
t->valid_hooks is the bitmask of hooks (input, forward ...) that will see packets. Providing an entry point that is not support is harmless (never called/used), but the inverse isn't: it results in a crash because the ebtables traverser doesn't expect a NULL blob for a location its receiving packets for.
Instead of fixing all the individual checks, do what iptables is doing and reject all blobs that differ from the expected hooks.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Harshit Mogalapalli harshit.m.mogalapalli@oracle.com Reported-by: syzkaller syzkaller@googlegroups.com Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/netfilter_bridge/ebtables.h | 4 ---- net/bridge/netfilter/ebtable_broute.c | 8 -------- net/bridge/netfilter/ebtable_filter.c | 8 -------- net/bridge/netfilter/ebtable_nat.c | 8 -------- net/bridge/netfilter/ebtables.c | 8 +------- 5 files changed, 1 insertion(+), 35 deletions(-)
diff --git a/include/linux/netfilter_bridge/ebtables.h b/include/linux/netfilter_bridge/ebtables.h index a13296d6c7ceb..fd533552a062c 100644 --- a/include/linux/netfilter_bridge/ebtables.h +++ b/include/linux/netfilter_bridge/ebtables.h @@ -94,10 +94,6 @@ struct ebt_table { struct ebt_replace_kernel *table; unsigned int valid_hooks; rwlock_t lock; - /* e.g. could be the table explicitly only allows certain - * matches, targets, ... 0 == let it in */ - int (*check)(const struct ebt_table_info *info, - unsigned int valid_hooks); /* the data used by the kernel */ struct ebt_table_info *private; struct nf_hook_ops *ops; diff --git a/net/bridge/netfilter/ebtable_broute.c b/net/bridge/netfilter/ebtable_broute.c index 1a11064f99907..8f19253024b0a 100644 --- a/net/bridge/netfilter/ebtable_broute.c +++ b/net/bridge/netfilter/ebtable_broute.c @@ -36,18 +36,10 @@ static struct ebt_replace_kernel initial_table = { .entries = (char *)&initial_chain, };
-static int check(const struct ebt_table_info *info, unsigned int valid_hooks) -{ - if (valid_hooks & ~(1 << NF_BR_BROUTING)) - return -EINVAL; - return 0; -} - static const struct ebt_table broute_table = { .name = "broute", .table = &initial_table, .valid_hooks = 1 << NF_BR_BROUTING, - .check = check, .me = THIS_MODULE, };
diff --git a/net/bridge/netfilter/ebtable_filter.c b/net/bridge/netfilter/ebtable_filter.c index cb949436bc0e3..278f324e67524 100644 --- a/net/bridge/netfilter/ebtable_filter.c +++ b/net/bridge/netfilter/ebtable_filter.c @@ -43,18 +43,10 @@ static struct ebt_replace_kernel initial_table = { .entries = (char *)initial_chains, };
-static int check(const struct ebt_table_info *info, unsigned int valid_hooks) -{ - if (valid_hooks & ~FILTER_VALID_HOOKS) - return -EINVAL; - return 0; -} - static const struct ebt_table frame_filter = { .name = "filter", .table = &initial_table, .valid_hooks = FILTER_VALID_HOOKS, - .check = check, .me = THIS_MODULE, };
diff --git a/net/bridge/netfilter/ebtable_nat.c b/net/bridge/netfilter/ebtable_nat.c index 5ee0531ae5061..9066f7f376d57 100644 --- a/net/bridge/netfilter/ebtable_nat.c +++ b/net/bridge/netfilter/ebtable_nat.c @@ -43,18 +43,10 @@ static struct ebt_replace_kernel initial_table = { .entries = (char *)initial_chains, };
-static int check(const struct ebt_table_info *info, unsigned int valid_hooks) -{ - if (valid_hooks & ~NAT_VALID_HOOKS) - return -EINVAL; - return 0; -} - static const struct ebt_table frame_nat = { .name = "nat", .table = &initial_table, .valid_hooks = NAT_VALID_HOOKS, - .check = check, .me = THIS_MODULE, };
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c index f2dbefb61ce83..9a0ae59cdc500 100644 --- a/net/bridge/netfilter/ebtables.c +++ b/net/bridge/netfilter/ebtables.c @@ -1040,8 +1040,7 @@ static int do_replace_finish(struct net *net, struct ebt_replace *repl, goto free_iterate; }
- /* the table doesn't like it */ - if (t->check && (ret = t->check(newinfo, repl->valid_hooks))) + if (repl->valid_hooks != t->valid_hooks) goto free_unlock;
if (repl->num_counters && repl->num_counters != t->private->nentries) { @@ -1231,11 +1230,6 @@ int ebt_register_table(struct net *net, const struct ebt_table *input_table, if (ret != 0) goto free_chainstack;
- if (table->check && table->check(newinfo, table->valid_hooks)) { - ret = -EINVAL; - goto free_chainstack; - } - table->private = newinfo; rwlock_init(&table->lock); mutex_lock(&ebt_mutex);
From: Florian Westphal fw@strlen.de
[ Upstream commit 18bbc3213383a82b05383827f4b1b882e3f0a5a5 ]
TPROXY is only allowed from prerouting, but nft_tproxy doesn't check this. This fixes a crash (null dereference) when using tproxy from e.g. output.
Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support") Reported-by: Shell Chen xierch@gmail.com Signed-off-by: Florian Westphal fw@strlen.de Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/nft_tproxy.c | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/net/netfilter/nft_tproxy.c b/net/netfilter/nft_tproxy.c index 801f013971dfa..a701ad64f10af 100644 --- a/net/netfilter/nft_tproxy.c +++ b/net/netfilter/nft_tproxy.c @@ -312,6 +312,13 @@ static int nft_tproxy_dump(struct sk_buff *skb, return 0; }
+static int nft_tproxy_validate(const struct nft_ctx *ctx, + const struct nft_expr *expr, + const struct nft_data **data) +{ + return nft_chain_validate_hooks(ctx->chain, 1 << NF_INET_PRE_ROUTING); +} + static struct nft_expr_type nft_tproxy_type; static const struct nft_expr_ops nft_tproxy_ops = { .type = &nft_tproxy_type, @@ -321,6 +328,7 @@ static const struct nft_expr_ops nft_tproxy_ops = { .destroy = nft_tproxy_destroy, .dump = nft_tproxy_dump, .reduce = NFT_REDUCE_READONLY, + .validate = nft_tproxy_validate, };
static struct nft_expr_type nft_tproxy_type __read_mostly = {
From: Pavan Chebbi pavan.chebbi@broadcom.com
[ Upstream commit 7dd3de7cb1d657a918c6b2bc673c71e318aa0c05 ]
Using BNXT_PAGE_MODE_BUF_SIZE + offset as buffer length value is not sufficient when running single buffer XDP programs doing redirect operations. The stack will complain on missing skb tail room. Fix it by using PAGE_SIZE when calling xdp_init_buff() for single buffer programs.
Fixes: b231c3f3414c ("bnxt: refactor bnxt_rx_xdp to separate xdp_init_buff/xdp_prepare_buff") Reviewed-by: Somnath Kotur somnath.kotur@broadcom.com Signed-off-by: Pavan Chebbi pavan.chebbi@broadcom.com Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 1 + drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 10 ++++++++-- 2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index 075c6206325ce..b1b17f9113006 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -2130,6 +2130,7 @@ struct bnxt { #define BNXT_DUMP_CRASH 1
struct bpf_prog *xdp_prog; + u8 xdp_has_frags;
struct bnxt_ptp_cfg *ptp_cfg; u8 ptp_all_rx_tstamp; diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index f53387ed0167b..c3065ec0a4798 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -181,6 +181,7 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, struct xdp_buff *xdp) { struct bnxt_sw_rx_bd *rx_buf; + u32 buflen = PAGE_SIZE; struct pci_dev *pdev; dma_addr_t mapping; u32 offset; @@ -192,7 +193,10 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, mapping = rx_buf->mapping - bp->rx_dma_offset; dma_sync_single_for_cpu(&pdev->dev, mapping + offset, *len, bp->rx_dir);
- xdp_init_buff(xdp, BNXT_PAGE_MODE_BUF_SIZE + offset, &rxr->xdp_rxq); + if (bp->xdp_has_frags) + buflen = BNXT_PAGE_MODE_BUF_SIZE + offset; + + xdp_init_buff(xdp, buflen, &rxr->xdp_rxq); xdp_prepare_buff(xdp, *data_ptr - offset, offset, *len, false); }
@@ -397,8 +401,10 @@ static int bnxt_xdp_set(struct bnxt *bp, struct bpf_prog *prog) netdev_warn(dev, "ethtool rx/tx channels must be combined to support XDP.\n"); return -EOPNOTSUPP; } - if (prog) + if (prog) { tx_xdp = bp->rx_nr_rings; + bp->xdp_has_frags = prog->aux->xdp_has_frags; + }
tc = netdev_get_num_tc(dev); if (!tc)
From: Vikas Gupta vikas.gupta@broadcom.com
[ Upstream commit 574b2bb9692fd3d45ed631ac447176d4679f3010 ]
Add missing devlink_set_features() API for callbacks reload_down and reload_up to function.
Fixes: 228ea8c187d8 ("bnxt_en: implement devlink dev reload driver_reinit") Reviewed-by: Somnath Kotur somnath.kotur@broadcom.com Signed-off-by: Vikas Gupta vikas.gupta@broadcom.com Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c index 6b3d4f4c2a75f..d83be40785b89 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_devlink.c @@ -1246,6 +1246,7 @@ int bnxt_dl_register(struct bnxt *bp) if (rc) goto err_dl_port_unreg;
+ devlink_set_features(dl, DEVLINK_F_RELOAD); out: devlink_register(dl); return 0;
From: Vikas Gupta vikas.gupta@broadcom.com
[ Upstream commit 09a89cc59ad67794a11e1d3dd13c5b3172adcc51 ]
There are 2 issues:
1. We should decrement hw_resc->max_nqs instead of hw_resc->max_irqs with the number of NQs assigned to the VFs. The IRQs are fixed on each function and cannot be re-assigned. Only the NQs are being assigned to the VFs.
2. vf_msix is the total number of NQs to be assigned to the VFs. So we should decrement vf_msix from hw_resc->max_nqs.
Fixes: b16b68918674 ("bnxt_en: Add SR-IOV support for 57500 chips.") Signed-off-by: Vikas Gupta vikas.gupta@broadcom.com Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c index a1a2c7a64fd58..c9cf0569451a2 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c @@ -623,7 +623,7 @@ static int bnxt_hwrm_func_vf_resc_cfg(struct bnxt *bp, int num_vfs, bool reset) hw_resc->max_stat_ctxs -= le16_to_cpu(req->min_stat_ctx) * n; hw_resc->max_vnics -= le16_to_cpu(req->min_vnics) * n; if (bp->flags & BNXT_FLAG_CHIP_P5) - hw_resc->max_irqs -= vf_msix * n; + hw_resc->max_nqs -= vf_msix;
rc = pf->active_vfs; }
From: Vikas Gupta vikas.gupta@broadcom.com
[ Upstream commit 366c304741729e64d778c80555d9eb422cf5cc89 ]
LRO/GRO_HW should be disabled if there is an attached XDP program. BNXT_FLAG_TPA is the current setting of the LRO/GRO_HW. Using BNXT_FLAG_TPA to disable LRO/GRO_HW will cause these features to be permanently disabled once they are disabled.
Fixes: 1dc4c557bfed ("bnxt: adding bnxt_xdp_build_skb to build skb from multibuffer xdp_buff") Signed-off-by: Vikas Gupta vikas.gupta@broadcom.com Signed-off-by: Michael Chan michael.chan@broadcom.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index cf9b00576ed36..964354536f9ce 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -11183,10 +11183,7 @@ static netdev_features_t bnxt_fix_features(struct net_device *dev, if ((features & NETIF_F_NTUPLE) && !bnxt_rfs_capable(bp)) features &= ~NETIF_F_NTUPLE;
- if (bp->flags & BNXT_FLAG_NO_AGG_RINGS) - features &= ~(NETIF_F_LRO | NETIF_F_GRO_HW); - - if (!(bp->flags & BNXT_FLAG_TPA)) + if ((bp->flags & BNXT_FLAG_NO_AGG_RINGS) || bp->xdp_prog) features &= ~(NETIF_F_LRO | NETIF_F_GRO_HW);
if (!(features & NETIF_F_GRO))
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit 5dc52d83baac30decf5f3b371d5eb41dfa1d1412 ]
Updates on existing implicit chain make no sense, disallow this.
Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/nf_tables_api.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 4bd6e9427c918..8b6ee9df817fb 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -2574,6 +2574,9 @@ static int nf_tables_newchain(struct sk_buff *skb, const struct nfnl_info *info, nft_ctx_init(&ctx, net, skb, info->nlh, family, table, chain, nla);
if (chain != NULL) { + if (chain->flags & NFT_CHAIN_BINDING) + return -EINVAL; + if (info->nlh->nlmsg_flags & NLM_F_EXCL) { NL_SET_BAD_ATTR(extack, attr); return -EEXIST;
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit ab482c6b66a4a8c0a8c0b0f577a785cf9ff1c2e2 ]
mutex is per-netns, move table_netns to the pernet area.
*read-write* to 0xffffffff883a01e8 of 8 bytes by task 6542 on cpu 0: nf_tables_newtable+0x6dc/0xc00 net/netfilter/nf_tables_api.c:1221 nfnetlink_rcv_batch net/netfilter/nfnetlink.c:513 [inline] nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:634 [inline] nfnetlink_rcv+0xa6a/0x13a0 net/netfilter/nfnetlink.c:652 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x652/0x730 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x643/0x740 net/netlink/af_netlink.c:1921
Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") Reported-by: Abhishek Shah abhishek.shah@columbia.edu Reviewed-by: Florian Westphal fw@strlen.de Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- include/net/netfilter/nf_tables.h | 1 + net/netfilter/nf_tables_api.c | 3 +-- 2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h index b8890ace0f879..0daad6e63ccb2 100644 --- a/include/net/netfilter/nf_tables.h +++ b/include/net/netfilter/nf_tables.h @@ -1635,6 +1635,7 @@ struct nftables_pernet { struct list_head module_list; struct list_head notify_list; struct mutex commit_mutex; + u64 table_handle; unsigned int base_seq; u8 validate_state; }; diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index 8b6ee9df817fb..e171257739c2f 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -32,7 +32,6 @@ static LIST_HEAD(nf_tables_objects); static LIST_HEAD(nf_tables_flowtables); static LIST_HEAD(nf_tables_destroy_list); static DEFINE_SPINLOCK(nf_tables_destroy_list_lock); -static u64 table_handle;
enum { NFT_VALIDATE_SKIP = 0, @@ -1235,7 +1234,7 @@ static int nf_tables_newtable(struct sk_buff *skb, const struct nfnl_info *info, INIT_LIST_HEAD(&table->flowtables); table->family = family; table->flags = flags; - table->handle = ++table_handle; + table->handle = ++nft_net->table_handle; if (table->flags & NFT_TABLE_F_OWNER) table->nlpid = NETLINK_CB(skb).portid;
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit 94254f990c07e9ddf1634e0b727fab821c3b5bf9 ]
Instead of offset and length are truncation to u8, report ERANGE.
Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/nft_payload.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c index 2e7ac007cb30f..4fee67abfe2c5 100644 --- a/net/netfilter/nft_payload.c +++ b/net/netfilter/nft_payload.c @@ -833,6 +833,7 @@ nft_payload_select_ops(const struct nft_ctx *ctx, { enum nft_payload_bases base; unsigned int offset, len; + int err;
if (tb[NFTA_PAYLOAD_BASE] == NULL || tb[NFTA_PAYLOAD_OFFSET] == NULL || @@ -859,8 +860,13 @@ nft_payload_select_ops(const struct nft_ctx *ctx, if (tb[NFTA_PAYLOAD_DREG] == NULL) return ERR_PTR(-EINVAL);
- offset = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_OFFSET])); - len = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_LEN])); + err = nft_parse_u32_check(tb[NFTA_PAYLOAD_OFFSET], U8_MAX, &offset); + if (err < 0) + return ERR_PTR(err); + + err = nft_parse_u32_check(tb[NFTA_PAYLOAD_LEN], U8_MAX, &len); + if (err < 0) + return ERR_PTR(err);
if (len <= 4 && is_power_of_2(len) && IS_ALIGNED(offset, len) && base != NFT_PAYLOAD_LL_HEADER && base != NFT_PAYLOAD_INNER_HEADER)
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit 7044ab281febae9e2fa9b0b247693d6026166293 ]
Instead report ERANGE if csum_offset is too long, and EOPNOTSUPP if type is not support.
Fixes: 7ec3f7b47b8d ("netfilter: nft_payload: add packet mangling support") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/nft_payload.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/net/netfilter/nft_payload.c b/net/netfilter/nft_payload.c index 4fee67abfe2c5..eb0e40c297121 100644 --- a/net/netfilter/nft_payload.c +++ b/net/netfilter/nft_payload.c @@ -740,17 +740,23 @@ static int nft_payload_set_init(const struct nft_ctx *ctx, const struct nlattr * const tb[]) { struct nft_payload_set *priv = nft_expr_priv(expr); + u32 csum_offset, csum_type = NFT_PAYLOAD_CSUM_NONE; + int err;
priv->base = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_BASE])); priv->offset = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_OFFSET])); priv->len = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_LEN]));
if (tb[NFTA_PAYLOAD_CSUM_TYPE]) - priv->csum_type = - ntohl(nla_get_be32(tb[NFTA_PAYLOAD_CSUM_TYPE])); - if (tb[NFTA_PAYLOAD_CSUM_OFFSET]) - priv->csum_offset = - ntohl(nla_get_be32(tb[NFTA_PAYLOAD_CSUM_OFFSET])); + csum_type = ntohl(nla_get_be32(tb[NFTA_PAYLOAD_CSUM_TYPE])); + if (tb[NFTA_PAYLOAD_CSUM_OFFSET]) { + err = nft_parse_u32_check(tb[NFTA_PAYLOAD_CSUM_OFFSET], U8_MAX, + &csum_offset); + if (err < 0) + return err; + + priv->csum_offset = csum_offset; + } if (tb[NFTA_PAYLOAD_CSUM_FLAGS]) { u32 flags;
@@ -761,7 +767,7 @@ static int nft_payload_set_init(const struct nft_ctx *ctx, priv->csum_flags = flags; }
- switch (priv->csum_type) { + switch (csum_type) { case NFT_PAYLOAD_CSUM_NONE: case NFT_PAYLOAD_CSUM_INET: break; @@ -775,6 +781,7 @@ static int nft_payload_set_init(const struct nft_ctx *ctx, default: return -EOPNOTSUPP; } + priv->csum_type = csum_type;
return nft_parse_register_load(tb[NFTA_PAYLOAD_SREG], &priv->sreg, priv->len);
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit 43eb8949cfdffa764b92bc6c54b87cbe5b0003fe ]
Error might occur later in the nf_tables_addchain() codepath, enable static key only after transaction has been created.
Fixes: 9f08ea848117 ("netfilter: nf_tables: keep chain counters away from hot path") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/nf_tables_api.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index e171257739c2f..b2c89e8c2a655 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -2195,9 +2195,9 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask, struct netlink_ext_ack *extack) { const struct nlattr * const *nla = ctx->nla; + struct nft_stats __percpu *stats = NULL; struct nft_table *table = ctx->table; struct nft_base_chain *basechain; - struct nft_stats __percpu *stats; struct net *net = ctx->net; char name[NFT_NAME_MAXLEN]; struct nft_rule_blob *blob; @@ -2235,7 +2235,6 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask, return PTR_ERR(stats); } rcu_assign_pointer(basechain->stats, stats); - static_branch_inc(&nft_counters_enabled); }
err = nft_basechain_init(basechain, family, &hook, flags); @@ -2318,6 +2317,9 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask, goto err_unregister_hook; }
+ if (stats) + static_branch_inc(&nft_counters_enabled); + table->use++;
return 0;
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit 5f3b7aae14a706d0d7da9f9e39def52ff5fc3d39 ]
As it was originally intended, restrict extension to supported families.
Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/nft_osf.c | 18 +++++++++++++++--- 1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/net/netfilter/nft_osf.c b/net/netfilter/nft_osf.c index 5eed18f90b020..175d666c8d87e 100644 --- a/net/netfilter/nft_osf.c +++ b/net/netfilter/nft_osf.c @@ -115,9 +115,21 @@ static int nft_osf_validate(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nft_data **data) { - return nft_chain_validate_hooks(ctx->chain, (1 << NF_INET_LOCAL_IN) | - (1 << NF_INET_PRE_ROUTING) | - (1 << NF_INET_FORWARD)); + unsigned int hooks; + + switch (ctx->family) { + case NFPROTO_IPV4: + case NFPROTO_IPV6: + case NFPROTO_INET: + hooks = (1 << NF_INET_LOCAL_IN) | + (1 << NF_INET_PRE_ROUTING) | + (1 << NF_INET_FORWARD); + break; + default: + return -EOPNOTSUPP; + } + + return nft_chain_validate_hooks(ctx->chain, hooks); }
static bool nft_osf_reduce(struct nft_regs_track *track,
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit 01e4092d53bc4fe122a6e4b6d664adbd57528ca3 ]
Only allow to use this expression from NFPROTO_NETDEV family.
Fixes: af308b94a2a4 ("netfilter: nf_tables: add tunnel support") Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/nft_tunnel.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/net/netfilter/nft_tunnel.c b/net/netfilter/nft_tunnel.c index d0f9b1d51b0e9..96b03e0bf74ff 100644 --- a/net/netfilter/nft_tunnel.c +++ b/net/netfilter/nft_tunnel.c @@ -161,6 +161,7 @@ static const struct nft_expr_ops nft_tunnel_get_ops = {
static struct nft_expr_type nft_tunnel_type __read_mostly = { .name = "tunnel", + .family = NFPROTO_NETDEV, .ops = &nft_tunnel_get_ops, .policy = nft_tunnel_policy, .maxattr = NFTA_TUNNEL_MAX,
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit e02f0d3970404bfea385b6edb86f2d936db0ea2b ]
Update nft_data_init() to report EINVAL if chain is already bound.
Fixes: d0e2c7de92c7 ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Reported-by: Gwangun Jung exsociety@gmail.com Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/netfilter/nf_tables_api.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c index b2c89e8c2a655..bc690238a3c56 100644 --- a/net/netfilter/nf_tables_api.c +++ b/net/netfilter/nf_tables_api.c @@ -9657,6 +9657,8 @@ static int nft_verdict_init(const struct nft_ctx *ctx, struct nft_data *data, return PTR_ERR(chain); if (nft_is_base_chain(chain)) return -EOPNOTSUPP; + if (nft_chain_is_bound(chain)) + return -EINVAL; if (desc->flags & NFT_DATA_DESC_SETELEM && chain->flags & NFT_CHAIN_BINDING) return -EINVAL;
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit 759eebbcfafcefa23b59e912396306543764bd3c ]
Expose nf_flow_table_gc_run() to force a garbage collector run from the offload infrastructure.
Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- include/net/netfilter/nf_flow_table.h | 1 + net/netfilter/nf_flow_table_core.c | 12 +++++++++--- 2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h index 64daafd1fc41c..32c25122ab184 100644 --- a/include/net/netfilter/nf_flow_table.h +++ b/include/net/netfilter/nf_flow_table.h @@ -270,6 +270,7 @@ void flow_offload_refresh(struct nf_flowtable *flow_table,
struct flow_offload_tuple_rhash *flow_offload_lookup(struct nf_flowtable *flow_table, struct flow_offload_tuple *tuple); +void nf_flow_table_gc_run(struct nf_flowtable *flow_table); void nf_flow_table_gc_cleanup(struct nf_flowtable *flowtable, struct net_device *dev); void nf_flow_table_cleanup(struct net_device *dev); diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c index f2def06d10709..18453fa25199c 100644 --- a/net/netfilter/nf_flow_table_core.c +++ b/net/netfilter/nf_flow_table_core.c @@ -442,12 +442,17 @@ static void nf_flow_offload_gc_step(struct nf_flowtable *flow_table, } }
+void nf_flow_table_gc_run(struct nf_flowtable *flow_table) +{ + nf_flow_table_iterate(flow_table, nf_flow_offload_gc_step, NULL); +} + static void nf_flow_offload_work_gc(struct work_struct *work) { struct nf_flowtable *flow_table;
flow_table = container_of(work, struct nf_flowtable, gc_work.work); - nf_flow_table_iterate(flow_table, nf_flow_offload_gc_step, NULL); + nf_flow_table_gc_run(flow_table); queue_delayed_work(system_power_efficient_wq, &flow_table->gc_work, HZ); }
@@ -606,10 +611,11 @@ void nf_flow_table_free(struct nf_flowtable *flow_table)
cancel_delayed_work_sync(&flow_table->gc_work); nf_flow_table_iterate(flow_table, nf_flow_table_do_cleanup, NULL); - nf_flow_table_iterate(flow_table, nf_flow_offload_gc_step, NULL); + nf_flow_table_gc_run(flow_table); nf_flow_table_offload_flush(flow_table); if (nf_flowtable_hw_offload(flow_table)) - nf_flow_table_iterate(flow_table, nf_flow_offload_gc_step, NULL); + nf_flow_table_gc_run(flow_table); + rhashtable_destroy(&flow_table->rhashtable); } EXPORT_SYMBOL_GPL(nf_flow_table_free);
From: Pablo Neira Ayuso pablo@netfilter.org
[ Upstream commit 9afb4b27349a499483ae0134282cefd0c90f480f ]
To clear the flow table on flow table free, the following sequence normally happens in order:
1) gc_step work is stopped to disable any further stats/del requests. 2) All flow table entries are set to teardown state. 3) Run gc_step which will queue HW del work for each flow table entry. 4) Waiting for the above del work to finish (flush). 5) Run gc_step again, deleting all entries from the flow table. 6) Flow table is freed.
But if a flow table entry already has pending HW stats or HW add work step 3 will not queue HW del work (it will be skipped), step 4 will wait for the pending add/stats to finish, and step 5 will queue HW del work which might execute after freeing of the flow table.
To fix the above, this patch flushes the pending work, then it sets the teardown flag to all flows in the flowtable and it forces a garbage collector run to queue work to remove the flows from hardware, then it flushes this new pending work and (finally) it forces another garbage collector run to remove the entry from the software flowtable.
Stack trace: [47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460 [47773.883634] Write of size 8 at addr ffff888103b45aa8 by task kworker/u20:6/543704 [47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2 [47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) [47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table] [47773.889727] Call Trace: [47773.890214] dump_stack+0xbb/0x107 [47773.890818] print_address_description.constprop.0+0x18/0x140 [47773.892990] kasan_report.cold+0x7c/0xd8 [47773.894459] kasan_check_range+0x145/0x1a0 [47773.895174] down_read+0x99/0x460 [47773.899706] nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table] [47773.907137] flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table] [47773.913372] process_one_work+0x8ac/0x14e0 [47773.921325] [47773.921325] Allocated by task 592159: [47773.922031] kasan_save_stack+0x1b/0x40 [47773.922730] __kasan_kmalloc+0x7a/0x90 [47773.923411] tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct] [47773.924363] tcf_ct_init+0x71c/0x1156 [act_ct] [47773.925207] tcf_action_init_1+0x45b/0x700 [47773.925987] tcf_action_init+0x453/0x6b0 [47773.926692] tcf_exts_validate+0x3d0/0x600 [47773.927419] fl_change+0x757/0x4a51 [cls_flower] [47773.928227] tc_new_tfilter+0x89a/0x2070 [47773.936652] [47773.936652] Freed by task 543704: [47773.937303] kasan_save_stack+0x1b/0x40 [47773.938039] kasan_set_track+0x1c/0x30 [47773.938731] kasan_set_free_info+0x20/0x30 [47773.939467] __kasan_slab_free+0xe7/0x120 [47773.940194] slab_free_freelist_hook+0x86/0x190 [47773.941038] kfree+0xce/0x3a0 [47773.941644] tcf_ct_flow_table_cleanup_work
Original patch description and stack trace by Paul Blakey.
Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support") Reported-by: Paul Blakey paulb@nvidia.com Tested-by: Paul Blakey paulb@nvidia.com Signed-off-by: Pablo Neira Ayuso pablo@netfilter.org Signed-off-by: Sasha Levin sashal@kernel.org --- include/net/netfilter/nf_flow_table.h | 2 ++ net/netfilter/nf_flow_table_core.c | 7 +++---- net/netfilter/nf_flow_table_offload.c | 8 ++++++++ 3 files changed, 13 insertions(+), 4 deletions(-)
diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h index 32c25122ab184..9c93e4981b680 100644 --- a/include/net/netfilter/nf_flow_table.h +++ b/include/net/netfilter/nf_flow_table.h @@ -307,6 +307,8 @@ void nf_flow_offload_stats(struct nf_flowtable *flowtable, struct flow_offload *flow);
void nf_flow_table_offload_flush(struct nf_flowtable *flowtable); +void nf_flow_table_offload_flush_cleanup(struct nf_flowtable *flowtable); + int nf_flow_table_offload_setup(struct nf_flowtable *flowtable, struct net_device *dev, enum flow_block_command cmd); diff --git a/net/netfilter/nf_flow_table_core.c b/net/netfilter/nf_flow_table_core.c index 18453fa25199c..483b18d35cade 100644 --- a/net/netfilter/nf_flow_table_core.c +++ b/net/netfilter/nf_flow_table_core.c @@ -610,12 +610,11 @@ void nf_flow_table_free(struct nf_flowtable *flow_table) mutex_unlock(&flowtable_lock);
cancel_delayed_work_sync(&flow_table->gc_work); + nf_flow_table_offload_flush(flow_table); + /* ... no more pending work after this stage ... */ nf_flow_table_iterate(flow_table, nf_flow_table_do_cleanup, NULL); nf_flow_table_gc_run(flow_table); - nf_flow_table_offload_flush(flow_table); - if (nf_flowtable_hw_offload(flow_table)) - nf_flow_table_gc_run(flow_table); - + nf_flow_table_offload_flush_cleanup(flow_table); rhashtable_destroy(&flow_table->rhashtable); } EXPORT_SYMBOL_GPL(nf_flow_table_free); diff --git a/net/netfilter/nf_flow_table_offload.c b/net/netfilter/nf_flow_table_offload.c index 11b6e19420920..4d1169b634c5f 100644 --- a/net/netfilter/nf_flow_table_offload.c +++ b/net/netfilter/nf_flow_table_offload.c @@ -1063,6 +1063,14 @@ void nf_flow_offload_stats(struct nf_flowtable *flowtable, flow_offload_queue_work(offload); }
+void nf_flow_table_offload_flush_cleanup(struct nf_flowtable *flowtable) +{ + if (nf_flowtable_hw_offload(flowtable)) { + flush_workqueue(nf_flow_offload_del_wq); + nf_flow_table_gc_run(flowtable); + } +} + void nf_flow_table_offload_flush(struct nf_flowtable *flowtable) { if (nf_flowtable_hw_offload(flowtable)) {
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 1227c1771dd2ad44318aa3ab9e3a293b3f34ff2a ]
While reading sysctl_[rw]mem_(max|default), they can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/filter.c | 4 ++-- net/core/sock.c | 8 ++++---- net/ipv4/ip_output.c | 2 +- net/ipv4/tcp_output.c | 2 +- net/netfilter/ipvs/ip_vs_sync.c | 4 ++-- 5 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c index 74f05ed6aff29..60c854e7d98ba 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5036,14 +5036,14 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, /* Only some socketops are supported */ switch (optname) { case SO_RCVBUF: - val = min_t(u32, val, sysctl_rmem_max); + val = min_t(u32, val, READ_ONCE(sysctl_rmem_max)); val = min_t(int, val, INT_MAX / 2); sk->sk_userlocks |= SOCK_RCVBUF_LOCK; WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); break; case SO_SNDBUF: - val = min_t(u32, val, sysctl_wmem_max); + val = min_t(u32, val, READ_ONCE(sysctl_wmem_max)); val = min_t(int, val, INT_MAX / 2); sk->sk_userlocks |= SOCK_SNDBUF_LOCK; WRITE_ONCE(sk->sk_sndbuf, diff --git a/net/core/sock.c b/net/core/sock.c index 2ff40dd0a7a65..62f69bc3a0e6e 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1100,7 +1100,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, * play 'guess the biggest size' games. RCVBUF/SNDBUF * are treated in BSD as hints */ - val = min_t(u32, val, sysctl_wmem_max); + val = min_t(u32, val, READ_ONCE(sysctl_wmem_max)); set_sndbuf: /* Ensure val * 2 fits into an int, to prevent max_t() * from treating it as a negative value. @@ -1132,7 +1132,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, * play 'guess the biggest size' games. RCVBUF/SNDBUF * are treated in BSD as hints */ - __sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max)); + __sock_set_rcvbuf(sk, min_t(u32, val, READ_ONCE(sysctl_rmem_max))); break;
case SO_RCVBUFFORCE: @@ -3307,8 +3307,8 @@ void sock_init_data(struct socket *sock, struct sock *sk) timer_setup(&sk->sk_timer, NULL, 0);
sk->sk_allocation = GFP_KERNEL; - sk->sk_rcvbuf = sysctl_rmem_default; - sk->sk_sndbuf = sysctl_wmem_default; + sk->sk_rcvbuf = READ_ONCE(sysctl_rmem_default); + sk->sk_sndbuf = READ_ONCE(sysctl_wmem_default); sk->sk_state = TCP_CLOSE; sk_set_socket(sk, sock);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 00b4bf26fd932..da8b3cc67234d 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1712,7 +1712,7 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb,
sk->sk_protocol = ip_hdr(skb)->protocol; sk->sk_bound_dev_if = arg->bound_dev_if; - sk->sk_sndbuf = sysctl_wmem_default; + sk->sk_sndbuf = READ_ONCE(sysctl_wmem_default); ipc.sockc.mark = fl4.flowi4_mark; err = ip_append_data(sk, &fl4, ip_reply_glue_bits, arg->iov->iov_base, len, 0, &ipc, &rt, MSG_DONTWAIT); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index aed0c5f828bef..84314de754f87 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -239,7 +239,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, if (wscale_ok) { /* Set window scaling on max possible window */ space = max_t(u32, space, READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2])); - space = max_t(u32, space, sysctl_rmem_max); + space = max_t(u32, space, READ_ONCE(sysctl_rmem_max)); space = min_t(u32, space, *window_clamp); *rcv_wscale = clamp_t(int, ilog2(space) - 15, 0, TCP_MAX_WSCALE); diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c index 9d43277b8b4fe..a56fd0b5a430a 100644 --- a/net/netfilter/ipvs/ip_vs_sync.c +++ b/net/netfilter/ipvs/ip_vs_sync.c @@ -1280,12 +1280,12 @@ static void set_sock_size(struct sock *sk, int mode, int val) lock_sock(sk); if (mode) { val = clamp_t(int, val, (SOCK_MIN_SNDBUF + 1) / 2, - sysctl_wmem_max); + READ_ONCE(sysctl_wmem_max)); sk->sk_sndbuf = val * 2; sk->sk_userlocks |= SOCK_SNDBUF_LOCK; } else { val = clamp_t(int, val, (SOCK_MIN_RCVBUF + 1) / 2, - sysctl_rmem_max); + READ_ONCE(sysctl_rmem_max)); sk->sk_rcvbuf = val * 2; sk->sk_userlocks |= SOCK_RCVBUF_LOCK; }
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit bf955b5ab8f6f7b0632cdef8e36b14e4f6e77829 ]
While reading weight_p, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader.
Also, dev_[rt]x_weight can be read/written at the same time. So, we need to use READ_ONCE() and WRITE_ONCE() for its access. Moreover, to use the same weight_p while changing dev_[rt]x_weight, we add a mutex in proc_do_dev_weight().
Fixes: 3d48b53fb2ae ("net: dev_weight: TX/RX orthogonality") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/dev.c | 2 +- net/core/sysctl_net_core.c | 15 +++++++++------ net/sched/sch_generic.c | 2 +- 3 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c index 30a1603a7225c..2a7d81cd9e2ea 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5917,7 +5917,7 @@ static int process_backlog(struct napi_struct *napi, int quota) net_rps_action_and_irq_enable(sd); }
- napi->weight = dev_rx_weight; + napi->weight = READ_ONCE(dev_rx_weight); while (again) { struct sk_buff *skb;
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 71a13596ea2bf..725891527814c 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -234,14 +234,17 @@ static int set_default_qdisc(struct ctl_table *table, int write, static int proc_do_dev_weight(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { - int ret; + static DEFINE_MUTEX(dev_weight_mutex); + int ret, weight;
+ mutex_lock(&dev_weight_mutex); ret = proc_dointvec(table, write, buffer, lenp, ppos); - if (ret != 0) - return ret; - - dev_rx_weight = weight_p * dev_weight_rx_bias; - dev_tx_weight = weight_p * dev_weight_tx_bias; + if (!ret && write) { + weight = READ_ONCE(weight_p); + WRITE_ONCE(dev_rx_weight, weight * dev_weight_rx_bias); + WRITE_ONCE(dev_tx_weight, weight * dev_weight_tx_bias); + } + mutex_unlock(&dev_weight_mutex);
return ret; } diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index dba0b3e24af5e..a64c3c1541118 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -409,7 +409,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
void __qdisc_run(struct Qdisc *q) { - int quota = dev_tx_weight; + int quota = READ_ONCE(dev_tx_weight); int packets;
while (qdisc_restart(q, &packets)) {
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 5dcd08cd19912892586c6082d56718333e2d19db ]
While reading netdev_max_backlog, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
While at it, we remove the unnecessary spaces in the doc.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- Documentation/admin-guide/sysctl/net.rst | 2 +- net/core/dev.c | 4 ++-- net/core/gro_cells.c | 2 +- net/xfrm/espintcp.c | 2 +- net/xfrm/xfrm_input.c | 2 +- 5 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index fcd650bdbc7e2..01d9858197832 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -271,7 +271,7 @@ poll cycle or the number of packets processed reaches netdev_budget. netdev_max_backlog ------------------
-Maximum number of packets, queued on the INPUT side, when the interface +Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them.
netdev_rss_key diff --git a/net/core/dev.c b/net/core/dev.c index 2a7d81cd9e2ea..e1496e626a532 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4623,7 +4623,7 @@ static bool skb_flow_limit(struct sk_buff *skb, unsigned int qlen) struct softnet_data *sd; unsigned int old_flow, new_flow;
- if (qlen < (netdev_max_backlog >> 1)) + if (qlen < (READ_ONCE(netdev_max_backlog) >> 1)) return false;
sd = this_cpu_ptr(&softnet_data); @@ -4671,7 +4671,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu, if (!netif_running(skb->dev)) goto drop; qlen = skb_queue_len(&sd->input_pkt_queue); - if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) { + if (qlen <= READ_ONCE(netdev_max_backlog) && !skb_flow_limit(skb, qlen)) { if (qlen) { enqueue: __skb_queue_tail(&sd->input_pkt_queue, skb); diff --git a/net/core/gro_cells.c b/net/core/gro_cells.c index 541c7a72a28a4..21619c70a82b7 100644 --- a/net/core/gro_cells.c +++ b/net/core/gro_cells.c @@ -26,7 +26,7 @@ int gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
cell = this_cpu_ptr(gcells->cells);
- if (skb_queue_len(&cell->napi_skbs) > netdev_max_backlog) { + if (skb_queue_len(&cell->napi_skbs) > READ_ONCE(netdev_max_backlog)) { drop: dev_core_stats_rx_dropped_inc(dev); kfree_skb(skb); diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c index 82d14eea1b5ad..974eb97b77d22 100644 --- a/net/xfrm/espintcp.c +++ b/net/xfrm/espintcp.c @@ -168,7 +168,7 @@ int espintcp_queue_out(struct sock *sk, struct sk_buff *skb) { struct espintcp_ctx *ctx = espintcp_getctx(sk);
- if (skb_queue_len(&ctx->out_queue) >= netdev_max_backlog) + if (skb_queue_len(&ctx->out_queue) >= READ_ONCE(netdev_max_backlog)) return -ENOBUFS;
__skb_queue_tail(&ctx->out_queue, skb); diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index 70a8c36f0ba6e..b2f4ec9c537f0 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -782,7 +782,7 @@ int xfrm_trans_queue_net(struct net *net, struct sk_buff *skb,
trans = this_cpu_ptr(&xfrm_trans_tasklet);
- if (skb_queue_len(&trans->queue) >= netdev_max_backlog) + if (skb_queue_len(&trans->queue) >= READ_ONCE(netdev_max_backlog)) return -ENOBUFS;
BUILD_BUG_ON(sizeof(struct xfrm_trans_cb) > sizeof(skb->cb));
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 61adf447e38664447526698872e21c04623afb8e ]
While reading netdev_tstamp_prequeue, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 3b098e2d7c69 ("net: Consistent skb timestamping") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/dev.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c index e1496e626a532..34282b93c3f60 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4927,7 +4927,7 @@ static int netif_rx_internal(struct sk_buff *skb) { int ret;
- net_timestamp_check(netdev_tstamp_prequeue, skb); + net_timestamp_check(READ_ONCE(netdev_tstamp_prequeue), skb);
trace_netif_rx(skb);
@@ -5280,7 +5280,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc, int ret = NET_RX_DROP; __be16 type;
- net_timestamp_check(!netdev_tstamp_prequeue, skb); + net_timestamp_check(!READ_ONCE(netdev_tstamp_prequeue), skb);
trace_netif_receive_skb(skb);
@@ -5663,7 +5663,7 @@ static int netif_receive_skb_internal(struct sk_buff *skb) { int ret;
- net_timestamp_check(netdev_tstamp_prequeue, skb); + net_timestamp_check(READ_ONCE(netdev_tstamp_prequeue), skb);
if (skb_defer_rx_timestamp(skb)) return NET_RX_SUCCESS; @@ -5693,7 +5693,7 @@ void netif_receive_skb_list_internal(struct list_head *head)
INIT_LIST_HEAD(&sublist); list_for_each_entry_safe(skb, next, head, list) { - net_timestamp_check(netdev_tstamp_prequeue, skb); + net_timestamp_check(READ_ONCE(netdev_tstamp_prequeue), skb); skb_list_del_init(skb); if (!skb_defer_rx_timestamp(skb)) list_add_tail(&skb->list, &sublist);
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 6bae8ceb90ba76cdba39496db936164fa672b9be ]
While reading rs->interval and rs->burst, they can be changed concurrently via sysctl (e.g. net_ratelimit_state). Thus, we need to add READ_ONCE() to their readers.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- lib/ratelimit.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/lib/ratelimit.c b/lib/ratelimit.c index e01a93f46f833..ce945c17980b9 100644 --- a/lib/ratelimit.c +++ b/lib/ratelimit.c @@ -26,10 +26,16 @@ */ int ___ratelimit(struct ratelimit_state *rs, const char *func) { + /* Paired with WRITE_ONCE() in .proc_handler(). + * Changing two values seperately could be inconsistent + * and some message could be lost. (See: net_ratelimit_state). + */ + int interval = READ_ONCE(rs->interval); + int burst = READ_ONCE(rs->burst); unsigned long flags; int ret;
- if (!rs->interval) + if (!interval) return 1;
/* @@ -44,7 +50,7 @@ int ___ratelimit(struct ratelimit_state *rs, const char *func) if (!rs->begin) rs->begin = jiffies;
- if (time_is_before_jiffies(rs->begin + rs->interval)) { + if (time_is_before_jiffies(rs->begin + interval)) { if (rs->missed) { if (!(rs->flags & RATELIMIT_MSG_ON_RELEASE)) { printk_deferred(KERN_WARNING @@ -56,7 +62,7 @@ int ___ratelimit(struct ratelimit_state *rs, const char *func) rs->begin = jiffies; rs->printed = 0; } - if (rs->burst && rs->burst > rs->printed) { + if (burst && burst > rs->printed) { rs->printed++; ret = 1; } else {
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 7de6d09f51917c829af2b835aba8bb5040f8e86a ]
While reading sysctl_optmem_max, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/bpf_sk_storage.c | 5 +++-- net/core/filter.c | 9 +++++---- net/core/sock.c | 8 +++++--- net/ipv4/ip_sockglue.c | 6 +++--- net/ipv6/ipv6_sockglue.c | 4 ++-- 5 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c index 1b7f385643b4c..94374d529ea42 100644 --- a/net/core/bpf_sk_storage.c +++ b/net/core/bpf_sk_storage.c @@ -310,11 +310,12 @@ BPF_CALL_2(bpf_sk_storage_delete, struct bpf_map *, map, struct sock *, sk) static int bpf_sk_storage_charge(struct bpf_local_storage_map *smap, void *owner, u32 size) { + int optmem_max = READ_ONCE(sysctl_optmem_max); struct sock *sk = (struct sock *)owner;
/* same check as in sock_kmalloc() */ - if (size <= sysctl_optmem_max && - atomic_read(&sk->sk_omem_alloc) + size < sysctl_optmem_max) { + if (size <= optmem_max && + atomic_read(&sk->sk_omem_alloc) + size < optmem_max) { atomic_add(size, &sk->sk_omem_alloc); return 0; } diff --git a/net/core/filter.c b/net/core/filter.c index 60c854e7d98ba..063176428086b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1214,10 +1214,11 @@ void sk_filter_uncharge(struct sock *sk, struct sk_filter *fp) static bool __sk_filter_charge(struct sock *sk, struct sk_filter *fp) { u32 filter_size = bpf_prog_size(fp->prog->len); + int optmem_max = READ_ONCE(sysctl_optmem_max);
/* same check as in sock_kmalloc() */ - if (filter_size <= sysctl_optmem_max && - atomic_read(&sk->sk_omem_alloc) + filter_size < sysctl_optmem_max) { + if (filter_size <= optmem_max && + atomic_read(&sk->sk_omem_alloc) + filter_size < optmem_max) { atomic_add(filter_size, &sk->sk_omem_alloc); return true; } @@ -1548,7 +1549,7 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk) if (IS_ERR(prog)) return PTR_ERR(prog);
- if (bpf_prog_size(prog->len) > sysctl_optmem_max) + if (bpf_prog_size(prog->len) > READ_ONCE(sysctl_optmem_max)) err = -ENOMEM; else err = reuseport_attach_prog(sk, prog); @@ -1615,7 +1616,7 @@ int sk_reuseport_attach_bpf(u32 ufd, struct sock *sk) } } else { /* BPF_PROG_TYPE_SOCKET_FILTER */ - if (bpf_prog_size(prog->len) > sysctl_optmem_max) { + if (bpf_prog_size(prog->len) > READ_ONCE(sysctl_optmem_max)) { err = -ENOMEM; goto err_prog_put; } diff --git a/net/core/sock.c b/net/core/sock.c index 62f69bc3a0e6e..d672e63a5c2d4 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2535,7 +2535,7 @@ struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
/* small safe race: SKB_TRUESIZE may differ from final skb->truesize */ if (atomic_read(&sk->sk_omem_alloc) + SKB_TRUESIZE(size) > - sysctl_optmem_max) + READ_ONCE(sysctl_optmem_max)) return NULL;
skb = alloc_skb(size, priority); @@ -2553,8 +2553,10 @@ struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size, */ void *sock_kmalloc(struct sock *sk, int size, gfp_t priority) { - if ((unsigned int)size <= sysctl_optmem_max && - atomic_read(&sk->sk_omem_alloc) + size < sysctl_optmem_max) { + int optmem_max = READ_ONCE(sysctl_optmem_max); + + if ((unsigned int)size <= optmem_max && + atomic_read(&sk->sk_omem_alloc) + size < optmem_max) { void *mem; /* First do the add, to avoid the race if kmalloc * might sleep. diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index a8a323ecbb54b..e49a61a053a68 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -772,7 +772,7 @@ static int ip_set_mcast_msfilter(struct sock *sk, sockptr_t optval, int optlen)
if (optlen < GROUP_FILTER_SIZE(0)) return -EINVAL; - if (optlen > sysctl_optmem_max) + if (optlen > READ_ONCE(sysctl_optmem_max)) return -ENOBUFS;
gsf = memdup_sockptr(optval, optlen); @@ -808,7 +808,7 @@ static int compat_ip_set_mcast_msfilter(struct sock *sk, sockptr_t optval,
if (optlen < size0) return -EINVAL; - if (optlen > sysctl_optmem_max - 4) + if (optlen > READ_ONCE(sysctl_optmem_max) - 4) return -ENOBUFS;
p = kmalloc(optlen + 4, GFP_KERNEL); @@ -1233,7 +1233,7 @@ static int do_ip_setsockopt(struct sock *sk, int level, int optname,
if (optlen < IP_MSFILTER_SIZE(0)) goto e_inval; - if (optlen > sysctl_optmem_max) { + if (optlen > READ_ONCE(sysctl_optmem_max)) { err = -ENOBUFS; break; } diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c index 222f6bf220ba0..e0dcc7a193df2 100644 --- a/net/ipv6/ipv6_sockglue.c +++ b/net/ipv6/ipv6_sockglue.c @@ -210,7 +210,7 @@ static int ipv6_set_mcast_msfilter(struct sock *sk, sockptr_t optval,
if (optlen < GROUP_FILTER_SIZE(0)) return -EINVAL; - if (optlen > sysctl_optmem_max) + if (optlen > READ_ONCE(sysctl_optmem_max)) return -ENOBUFS;
gsf = memdup_sockptr(optval, optlen); @@ -244,7 +244,7 @@ static int compat_ipv6_set_mcast_msfilter(struct sock *sk, sockptr_t optval,
if (optlen < size0) return -EINVAL; - if (optlen > sysctl_optmem_max - 4) + if (optlen > READ_ONCE(sysctl_optmem_max) - 4) return -ENOBUFS;
p = kmalloc(optlen + 4, GFP_KERNEL);
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit d2154b0afa73c0159b2856f875c6b4fe7cf6a95e ]
While reading sysctl_tstamp_allow_data, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: b245be1f4db1 ("net-timestamp: no-payload only sysctl") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/skbuff.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5b3559cb1d827..bebf58464d667 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4772,7 +4772,7 @@ static bool skb_may_tx_timestamp(struct sock *sk, bool tsonly) { bool ret;
- if (likely(sysctl_tstamp_allow_data || tsonly)) + if (likely(READ_ONCE(sysctl_tstamp_allow_data) || tsonly)) return true;
read_lock_bh(&sk->sk_callback_lock);
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit c42b7cddea47503411bfb5f2f93a4154aaffa2d9 ]
While reading sysctl_net_busy_poll, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 060212928670 ("net: add low latency socket poll") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- include/net/busy_poll.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index c4898fcbf923b..f90f0021f5f2d 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -33,7 +33,7 @@ extern unsigned int sysctl_net_busy_poll __read_mostly;
static inline bool net_busy_loop_on(void) { - return sysctl_net_busy_poll; + return READ_ONCE(sysctl_net_busy_poll); }
static inline bool sk_can_busy_loop(const struct sock *sk)
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit e59ef36f0795696ab229569c153936bfd068d21c ]
While reading sysctl_net_busy_read, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 2d48d67fa8cd ("net: poll/select low latency socket support") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/sock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/sock.c b/net/core/sock.c index d672e63a5c2d4..16ab5ef749c60 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -3365,7 +3365,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
#ifdef CONFIG_NET_RX_BUSY_POLL sk->sk_napi_id = 0; - sk->sk_ll_usec = sysctl_net_busy_read; + sk->sk_ll_usec = READ_ONCE(sysctl_net_busy_read); #endif
sk->sk_max_pacing_rate = ~0UL;
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 2e0c42374ee32e72948559d2ae2f7ba3dc6b977c ]
While reading netdev_budget, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 51b0bdedb8e7 ("[NET]: Separate two usages of netdev_max_backlog.") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/dev.c b/net/core/dev.c index 34282b93c3f60..a330f93629314 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6647,7 +6647,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) struct softnet_data *sd = this_cpu_ptr(&softnet_data); unsigned long time_limit = jiffies + usecs_to_jiffies(netdev_budget_usecs); - int budget = netdev_budget; + int budget = READ_ONCE(netdev_budget); LIST_HEAD(list); LIST_HEAD(repoll);
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 657b991afb89d25fe6c4783b1b75a8ad4563670d ]
While reading sysctl_max_skb_frags, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/ipv4/tcp.c | 4 ++-- net/mptcp/protocol.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 3ae2ea0488838..3d446773ff2a5 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1000,7 +1000,7 @@ static struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
i = skb_shinfo(skb)->nr_frags; can_coalesce = skb_can_coalesce(skb, i, page, offset); - if (!can_coalesce && i >= sysctl_max_skb_frags) { + if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) { tcp_mark_push(tp, skb); goto new_segment; } @@ -1348,7 +1348,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (!skb_can_coalesce(skb, i, pfrag->page, pfrag->offset)) { - if (i >= sysctl_max_skb_frags) { + if (i >= READ_ONCE(sysctl_max_skb_frags)) { tcp_mark_push(tp, skb); goto new_segment; } diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 3d90fa9653ef3..513f571a082ba 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1299,7 +1299,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
i = skb_shinfo(skb)->nr_frags; can_coalesce = skb_can_coalesce(skb, i, dfrag->page, offset); - if (!can_coalesce && i >= sysctl_max_skb_frags) { + if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) { tcp_mark_push(tcp_sk(ssk), skb); goto alloc_skb; }
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit fa45d484c52c73f79db2c23b0cdfc6c6455093ad ]
While reading netdev_budget_usecs, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 7acf8a1e8a28 ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/dev.c b/net/core/dev.c index a330f93629314..19baeaf65a646 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6646,7 +6646,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h) { struct softnet_data *sd = this_cpu_ptr(&softnet_data); unsigned long time_limit = jiffies + - usecs_to_jiffies(netdev_budget_usecs); + usecs_to_jiffies(READ_ONCE(netdev_budget_usecs)); int budget = READ_ONCE(netdev_budget); LIST_HEAD(list); LIST_HEAD(repoll);
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit af67508ea6cbf0e4ea27f8120056fa2efce127dd ]
While reading sysctl_fb_tunnels_only_for_init_net, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/netdevice.h | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2563d30736e9a..78dd63a5c7c80 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -640,9 +640,14 @@ extern int sysctl_devconf_inherit_init_net; */ static inline bool net_has_fallback_tunnels(const struct net *net) { - return !IS_ENABLED(CONFIG_SYSCTL) || - !sysctl_fb_tunnels_only_for_init_net || - (net == &init_net && sysctl_fb_tunnels_only_for_init_net == 1); +#if IS_ENABLED(CONFIG_SYSCTL) + int fb_tunnels_only_for_init_net = READ_ONCE(sysctl_fb_tunnels_only_for_init_net); + + return !fb_tunnels_only_for_init_net || + (net_eq(net, &init_net) && fb_tunnels_only_for_init_net == 1); +#else + return true; +#endif }
static inline int netdev_queue_numa_node_read(const struct netdev_queue *q)
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit a5612ca10d1aa05624ebe72633e0c8c792970833 ]
While reading sysctl_devconf_inherit_init_net, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 856c395cfa63 ("net: introduce a knob to control whether to inherit devconf config") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- include/linux/netdevice.h | 9 +++++++++ net/ipv4/devinet.c | 16 ++++++++++------ net/ipv6/addrconf.c | 5 ++--- 3 files changed, 21 insertions(+), 9 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 78dd63a5c7c80..db40bc62213bd 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -650,6 +650,15 @@ static inline bool net_has_fallback_tunnels(const struct net *net) #endif }
+static inline int net_inherit_devconf(void) +{ +#if IS_ENABLED(CONFIG_SYSCTL) + return READ_ONCE(sysctl_devconf_inherit_init_net); +#else + return 0; +#endif +} + static inline int netdev_queue_numa_node_read(const struct netdev_queue *q) { #if defined(CONFIG_XPS) && defined(CONFIG_NUMA) diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index b2366ad540e62..787a44e3222db 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -2682,23 +2682,27 @@ static __net_init int devinet_init_net(struct net *net) #endif
if (!net_eq(net, &init_net)) { - if (IS_ENABLED(CONFIG_SYSCTL) && - sysctl_devconf_inherit_init_net == 3) { + switch (net_inherit_devconf()) { + case 3: /* copy from the current netns */ memcpy(all, current->nsproxy->net_ns->ipv4.devconf_all, sizeof(ipv4_devconf)); memcpy(dflt, current->nsproxy->net_ns->ipv4.devconf_dflt, sizeof(ipv4_devconf_dflt)); - } else if (!IS_ENABLED(CONFIG_SYSCTL) || - sysctl_devconf_inherit_init_net != 2) { - /* inherit == 0 or 1: copy from init_net */ + break; + case 0: + case 1: + /* copy from init_net */ memcpy(all, init_net.ipv4.devconf_all, sizeof(ipv4_devconf)); memcpy(dflt, init_net.ipv4.devconf_dflt, sizeof(ipv4_devconf_dflt)); + break; + case 2: + /* use compiled values */ + break; } - /* else inherit == 2: use compiled values */ }
#ifdef CONFIG_SYSCTL diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 49cc6587dd771..b738eb7e1cae8 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -7158,9 +7158,8 @@ static int __net_init addrconf_init_net(struct net *net) if (!dflt) goto err_alloc_dflt;
- if (IS_ENABLED(CONFIG_SYSCTL) && - !net_eq(net, &init_net)) { - switch (sysctl_devconf_inherit_init_net) { + if (!net_eq(net, &init_net)) { + switch (net_inherit_devconf()) { case 1: /* copy from init_net */ memcpy(all, init_net.ipv6.devconf_all, sizeof(ipv6_devconf));
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 8db24af3f02ebdbf302196006ebb270c4c3a2706 ]
While reading gro_normal_batch, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 323ebb61e32b ("net: use listified RX for handling GRO_NORMAL skbs") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Acked-by: Edward Cree ecree.xilinx@gmail.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- include/net/gro.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/net/gro.h b/include/net/gro.h index 867656b0739c0..24003dea8fa4d 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -439,7 +439,7 @@ static inline void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb, { list_add_tail(&skb->list, &napi->rx_list); napi->rx_count += segs; - if (napi->rx_count >= gro_normal_batch) + if (napi->rx_count >= READ_ONCE(gro_normal_batch)) gro_normal_list(napi); }
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 05e49cfc89e4f325eebbc62d24dd122e55f94c23 ]
While reading netdev_unregister_timeout_secs, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 5aa3afe107d9 ("net: make unregister netdev warning timeout configurable") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Acked-by: Dmitry Vyukov dvyukov@google.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/core/dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/dev.c b/net/core/dev.c index 19baeaf65a646..a77a979a4bf75 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -10265,7 +10265,7 @@ static struct net_device *netdev_wait_allrefs_any(struct list_head *list) return dev;
if (time_after(jiffies, warning_time + - netdev_unregister_timeout_secs * HZ)) { + READ_ONCE(netdev_unregister_timeout_secs) * HZ)) { list_for_each_entry(dev, list, todo_list) { pr_emerg("unregister_netdevice: waiting for %s to become free. Usage count = %d\n", dev->name, netdev_refcnt_read(dev));
From: Kuniyuki Iwashima kuniyu@amazon.com
[ Upstream commit 3c9ba81d72047f2e81bb535d42856517b613aba7 ]
While reading sysctl_somaxconn, it can be changed concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima kuniyu@amazon.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Sasha Levin sashal@kernel.org --- net/socket.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/socket.c b/net/socket.c index 96300cdc06251..34102aa4ab0a6 100644 --- a/net/socket.c +++ b/net/socket.c @@ -1801,7 +1801,7 @@ int __sys_listen(int fd, int backlog)
sock = sockfd_lookup_light(fd, &err, &fput_needed); if (sock) { - somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn; + somaxconn = READ_ONCE(sock_net(sock->sk)->core.sysctl_somaxconn); if ((unsigned int)backlog > somaxconn) backlog = somaxconn;
From: Jacob Keller jacob.e.keller@intel.com
[ Upstream commit 25d7a5f5a6bb15a2dae0a3f39ea5dda215024726 ]
The ixgbe_ptp_start_cyclecounter is intended to be called whenever the cyclecounter parameters need to be changed.
Since commit a9763f3cb54c ("ixgbe: Update PTP to support X550EM_x devices"), this function has cleared the SYSTIME registers and reset the TSAUXC DISABLE_SYSTIME bit.
While these need to be cleared during ixgbe_ptp_reset, it is wrong to clear them during ixgbe_ptp_start_cyclecounter. This function may be called during both reset and link status change. When link changes, the SYSTIME counter is still operating normally, but the cyclecounter should be updated to account for the possibly changed parameters.
Clearing SYSTIME when link changes causes the timecounter to jump because the cycle counter now reads zero.
Extract the SYSTIME initialization out to a new function and call this during ixgbe_ptp_reset. This prevents the timecounter adjustment and avoids an unnecessary reset of the current time.
This also restores the original SYSTIME clearing that occurred during ixgbe_ptp_reset before the commit above.
Reported-by: Steve Payne spayne@aurora.tech Reported-by: Ilya Evenbach ievenbach@aurora.tech Fixes: a9763f3cb54c ("ixgbe: Update PTP to support X550EM_x devices") Signed-off-by: Jacob Keller jacob.e.keller@intel.com Tested-by: Gurucharan gurucharanx.g@intel.com (A Contingent worker at Intel) Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 59 +++++++++++++++----- 1 file changed, 46 insertions(+), 13 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c index 336426a67ac1b..38cda659f65f4 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c @@ -1208,7 +1208,6 @@ void ixgbe_ptp_start_cyclecounter(struct ixgbe_adapter *adapter) struct cyclecounter cc; unsigned long flags; u32 incval = 0; - u32 tsauxc = 0; u32 fuse0 = 0;
/* For some of the boards below this mask is technically incorrect. @@ -1243,18 +1242,6 @@ void ixgbe_ptp_start_cyclecounter(struct ixgbe_adapter *adapter) case ixgbe_mac_x550em_a: case ixgbe_mac_X550: cc.read = ixgbe_ptp_read_X550; - - /* enable SYSTIME counter */ - IXGBE_WRITE_REG(hw, IXGBE_SYSTIMR, 0); - IXGBE_WRITE_REG(hw, IXGBE_SYSTIML, 0); - IXGBE_WRITE_REG(hw, IXGBE_SYSTIMH, 0); - tsauxc = IXGBE_READ_REG(hw, IXGBE_TSAUXC); - IXGBE_WRITE_REG(hw, IXGBE_TSAUXC, - tsauxc & ~IXGBE_TSAUXC_DISABLE_SYSTIME); - IXGBE_WRITE_REG(hw, IXGBE_TSIM, IXGBE_TSIM_TXTS); - IXGBE_WRITE_REG(hw, IXGBE_EIMS, IXGBE_EIMS_TIMESYNC); - - IXGBE_WRITE_FLUSH(hw); break; case ixgbe_mac_X540: cc.read = ixgbe_ptp_read_82599; @@ -1286,6 +1273,50 @@ void ixgbe_ptp_start_cyclecounter(struct ixgbe_adapter *adapter) spin_unlock_irqrestore(&adapter->tmreg_lock, flags); }
+/** + * ixgbe_ptp_init_systime - Initialize SYSTIME registers + * @adapter: the ixgbe private board structure + * + * Initialize and start the SYSTIME registers. + */ +static void ixgbe_ptp_init_systime(struct ixgbe_adapter *adapter) +{ + struct ixgbe_hw *hw = &adapter->hw; + u32 tsauxc; + + switch (hw->mac.type) { + case ixgbe_mac_X550EM_x: + case ixgbe_mac_x550em_a: + case ixgbe_mac_X550: + tsauxc = IXGBE_READ_REG(hw, IXGBE_TSAUXC); + + /* Reset SYSTIME registers to 0 */ + IXGBE_WRITE_REG(hw, IXGBE_SYSTIMR, 0); + IXGBE_WRITE_REG(hw, IXGBE_SYSTIML, 0); + IXGBE_WRITE_REG(hw, IXGBE_SYSTIMH, 0); + + /* Reset interrupt settings */ + IXGBE_WRITE_REG(hw, IXGBE_TSIM, IXGBE_TSIM_TXTS); + IXGBE_WRITE_REG(hw, IXGBE_EIMS, IXGBE_EIMS_TIMESYNC); + + /* Activate the SYSTIME counter */ + IXGBE_WRITE_REG(hw, IXGBE_TSAUXC, + tsauxc & ~IXGBE_TSAUXC_DISABLE_SYSTIME); + break; + case ixgbe_mac_X540: + case ixgbe_mac_82599EB: + /* Reset SYSTIME registers to 0 */ + IXGBE_WRITE_REG(hw, IXGBE_SYSTIML, 0); + IXGBE_WRITE_REG(hw, IXGBE_SYSTIMH, 0); + break; + default: + /* Other devices aren't supported */ + return; + }; + + IXGBE_WRITE_FLUSH(hw); +} + /** * ixgbe_ptp_reset * @adapter: the ixgbe private board structure @@ -1312,6 +1343,8 @@ void ixgbe_ptp_reset(struct ixgbe_adapter *adapter)
ixgbe_ptp_start_cyclecounter(adapter);
+ ixgbe_ptp_init_systime(adapter); + spin_lock_irqsave(&adapter->tmreg_lock, flags); timecounter_init(&adapter->hw_tc, &adapter->hw_cc, ktime_to_ns(ktime_get_real()));
From: Sylwester Dziedziuch sylwesterx.dziedziuch@intel.com
[ Upstream commit bcf3a156429306070afbfda5544f2b492d25e75b ]
It was not possible to create 1-tuple flow director rule for IPv6 flow type. It was caused by incorrectly checking for source IP address when validating user provided destination IP address.
Fix this by changing ip6src to correct ip6dst address in destination IP address validation for IPv6 flow type.
Fixes: efca91e89b67 ("i40e: Add flow director support for IPv6") Signed-off-by: Sylwester Dziedziuch sylwesterx.dziedziuch@intel.com Tested-by: Gurucharan gurucharanx.g@intel.com (A Contingent worker at Intel) Signed-off-by: Tony Nguyen anthony.l.nguyen@intel.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 19704f5c8291c..22a61802a4027 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -4395,7 +4395,7 @@ static int i40e_check_fdir_input_set(struct i40e_vsi *vsi, (struct in6_addr *)&ipv6_full_mask)) new_mask |= I40E_L3_V6_DST_MASK; else if (ipv6_addr_any((struct in6_addr *) - &usr_ip6_spec->ip6src)) + &usr_ip6_spec->ip6dst)) new_mask &= ~I40E_L3_V6_DST_MASK; else return -EOPNOTSUPP;
From: Lorenzo Bianconi lorenzo@kernel.org
[ Upstream commit da6e113ff010815fdd21ee1e9af2e8d179a2680f ]
Enable rx checksum offload for mt7986 chipset.
Signed-off-by: Lorenzo Bianconi lorenzo@kernel.org Link: https://lore.kernel.org/r/c8699805c18f7fd38315fcb8da2787676d83a32c.165454458... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/mediatek/mtk_eth_soc.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c index 59c9a10f83ba5..6beb3d4873a37 100644 --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c @@ -1444,8 +1444,8 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget, int done = 0, bytes = 0;
while (done < budget) { + unsigned int pktlen, *rxdcsum; struct net_device *netdev; - unsigned int pktlen; dma_addr_t dma_addr; u32 hash, reason; int mac = 0; @@ -1512,7 +1512,13 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget, pktlen = RX_DMA_GET_PLEN0(trxd.rxd2); skb->dev = netdev; skb_put(skb, pktlen); - if (trxd.rxd4 & eth->soc->txrx.rx_dma_l4_valid) + + if (MTK_HAS_CAPS(eth->soc->caps, MTK_NETSYS_V2)) + rxdcsum = &trxd.rxd3; + else + rxdcsum = &trxd.rxd4; + + if (*rxdcsum & eth->soc->txrx.rx_dma_l4_valid) skb->ip_summed = CHECKSUM_UNNECESSARY; else skb_checksum_none_assert(skb); @@ -3761,6 +3767,7 @@ static const struct mtk_soc_data mt7986_data = { .txd_size = sizeof(struct mtk_tx_dma_v2), .rxd_size = sizeof(struct mtk_rx_dma_v2), .rx_irq_done_mask = MTK_RX_DONE_INT_V2, + .rx_dma_l4_valid = RX_DMA_L4_VALID_V2, .dma_max_len = MTK_TX_DMA_BUF_LEN_V2, .dma_len_offset = 8, },
From: Lorenzo Bianconi lorenzo@kernel.org
[ Upstream commit 0cf731f9ebb5bf6f252055bebf4463a5c0bd490b ]
Properly report hw rx hash for mt7986 chipset accroding to the new dma descriptor layout.
Fixes: 197c9e9b17b11 ("net: ethernet: mtk_eth_soc: introduce support for mt7986 chipset") Signed-off-by: Lorenzo Bianconi lorenzo@kernel.org Link: https://lore.kernel.org/r/091394ea4e705fbb35f828011d98d0ba33808f69.166125729... Signed-off-by: Paolo Abeni pabeni@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/mediatek/mtk_eth_soc.c | 22 +++++++++++---------- drivers/net/ethernet/mediatek/mtk_eth_soc.h | 5 +++++ 2 files changed, 17 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c b/drivers/net/ethernet/mediatek/mtk_eth_soc.c index 6beb3d4873a37..dcf0aac0aa65d 100644 --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c @@ -1513,10 +1513,19 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget, skb->dev = netdev; skb_put(skb, pktlen);
- if (MTK_HAS_CAPS(eth->soc->caps, MTK_NETSYS_V2)) + if (MTK_HAS_CAPS(eth->soc->caps, MTK_NETSYS_V2)) { + hash = trxd.rxd5 & MTK_RXD5_FOE_ENTRY; + if (hash != MTK_RXD5_FOE_ENTRY) + skb_set_hash(skb, jhash_1word(hash, 0), + PKT_HASH_TYPE_L4); rxdcsum = &trxd.rxd3; - else + } else { + hash = trxd.rxd4 & MTK_RXD4_FOE_ENTRY; + if (hash != MTK_RXD4_FOE_ENTRY) + skb_set_hash(skb, jhash_1word(hash, 0), + PKT_HASH_TYPE_L4); rxdcsum = &trxd.rxd4; + }
if (*rxdcsum & eth->soc->txrx.rx_dma_l4_valid) skb->ip_summed = CHECKSUM_UNNECESSARY; @@ -1525,16 +1534,9 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget, skb->protocol = eth_type_trans(skb, netdev); bytes += pktlen;
- hash = trxd.rxd4 & MTK_RXD4_FOE_ENTRY; - if (hash != MTK_RXD4_FOE_ENTRY) { - hash = jhash_1word(hash, 0); - skb_set_hash(skb, hash, PKT_HASH_TYPE_L4); - } - reason = FIELD_GET(MTK_RXD4_PPE_CPU_REASON, trxd.rxd4); if (reason == MTK_PPE_CPU_REASON_HIT_UNBIND_RATE_REACHED) - mtk_ppe_check_skb(eth->ppe, skb, - trxd.rxd4 & MTK_RXD4_FOE_ENTRY); + mtk_ppe_check_skb(eth->ppe, skb, hash);
if (netdev->features & NETIF_F_HW_VLAN_CTAG_RX) { if (MTK_HAS_CAPS(eth->soc->caps, MTK_NETSYS_V2)) { diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.h b/drivers/net/ethernet/mediatek/mtk_eth_soc.h index 0a632896451a4..98d6a6d047e32 100644 --- a/drivers/net/ethernet/mediatek/mtk_eth_soc.h +++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.h @@ -307,6 +307,11 @@ #define RX_DMA_L4_VALID_PDMA BIT(30) /* when PDMA is used */ #define RX_DMA_SPECIAL_TAG BIT(22)
+/* PDMA descriptor rxd5 */ +#define MTK_RXD5_FOE_ENTRY GENMASK(14, 0) +#define MTK_RXD5_PPE_CPU_REASON GENMASK(22, 18) +#define MTK_RXD5_SRC_PORT GENMASK(29, 26) + #define RX_DMA_GET_SPORT(x) (((x) >> 19) & 0xf) #define RX_DMA_GET_SPORT_V2(x) (((x) >> 26) & 0x7)
From: David Howells dhowells@redhat.com
[ Upstream commit b0f571ecd7943423c25947439045f0d352ca3dbf ]
Fix three bugs in the rxrpc's sendmsg implementation:
(1) rxrpc_new_client_call() should release the socket lock when returning an error from rxrpc_get_call_slot().
(2) rxrpc_wait_for_tx_window_intr() will return without the call mutex held in the event that we're interrupted by a signal whilst waiting for tx space on the socket or relocking the call mutex afterwards.
Fix this by: (a) moving the unlock/lock of the call mutex up to rxrpc_send_data() such that the lock is not held around all of rxrpc_wait_for_tx_window*() and (b) indicating to higher callers whether we're return with the lock dropped. Note that this means recvmsg() will not block on this call whilst we're waiting.
(3) After dropping and regaining the call mutex, rxrpc_send_data() needs to go and recheck the state of the tx_pending buffer and the tx_total_len check in case we raced with another sendmsg() on the same call.
Thinking on this some more, it might make sense to have different locks for sendmsg() and recvmsg(). There's probably no need to make recvmsg() wait for sendmsg(). It does mean that recvmsg() can return MSG_EOR indicating that a call is dead before a sendmsg() to that call returns - but that can currently happen anyway.
Without fix (2), something like the following can be induced:
WARNING: bad unlock balance detected! 5.16.0-rc6-syzkaller #0 Not tainted ------------------------------------- syz-executor011/3597 is trying to release lock (&call->user_mutex) at: [<ffffffff885163a3>] rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748 but there are no more locks to release!
other info that might help us debug this: no locks held by syz-executor011/3597. ... Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_unlock_imbalance_bug include/trace/events/lock.h:58 [inline] __lock_release kernel/locking/lockdep.c:5306 [inline] lock_release.cold+0x49/0x4e kernel/locking/lockdep.c:5657 __mutex_unlock_slowpath+0x99/0x5e0 kernel/locking/mutex.c:900 rxrpc_do_sendmsg+0xc13/0x1350 net/rxrpc/sendmsg.c:748 rxrpc_sendmsg+0x420/0x630 net/rxrpc/af_rxrpc.c:561 sock_sendmsg_nosec net/socket.c:704 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:724 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2409 ___sys_sendmsg+0xf3/0x170 net/socket.c:2463 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2492 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae
[Thanks to Hawkins Jiawei and Khalid Masum for their attempts to fix this]
Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Reported-by: syzbot+7f0483225d0c94cb3441@syzkaller.appspotmail.com Signed-off-by: David Howells dhowells@redhat.com Reviewed-by: Marc Dionne marc.dionne@auristor.com Tested-by: syzbot+7f0483225d0c94cb3441@syzkaller.appspotmail.com cc: Hawkins Jiawei yin31149@gmail.com cc: Khalid Masum khalid.masum.92@gmail.com cc: Dan Carpenter dan.carpenter@oracle.com cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/166135894583.600315.7170979436768124075.stgit@wart... Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- net/rxrpc/call_object.c | 4 +- net/rxrpc/sendmsg.c | 92 ++++++++++++++++++++++++----------------- 2 files changed, 57 insertions(+), 39 deletions(-)
diff --git a/net/rxrpc/call_object.c b/net/rxrpc/call_object.c index 84d0a41096450..6401cdf7a6246 100644 --- a/net/rxrpc/call_object.c +++ b/net/rxrpc/call_object.c @@ -285,8 +285,10 @@ struct rxrpc_call *rxrpc_new_client_call(struct rxrpc_sock *rx, _enter("%p,%lx", rx, p->user_call_ID);
limiter = rxrpc_get_call_slot(p, gfp); - if (!limiter) + if (!limiter) { + release_sock(&rx->sk); return ERR_PTR(-ERESTARTSYS); + }
call = rxrpc_alloc_client_call(rx, srx, gfp, debug_id); if (IS_ERR(call)) { diff --git a/net/rxrpc/sendmsg.c b/net/rxrpc/sendmsg.c index 1d38e279e2efa..3c3a626459deb 100644 --- a/net/rxrpc/sendmsg.c +++ b/net/rxrpc/sendmsg.c @@ -51,10 +51,7 @@ static int rxrpc_wait_for_tx_window_intr(struct rxrpc_sock *rx, return sock_intr_errno(*timeo);
trace_rxrpc_transmit(call, rxrpc_transmit_wait); - mutex_unlock(&call->user_mutex); *timeo = schedule_timeout(*timeo); - if (mutex_lock_interruptible(&call->user_mutex) < 0) - return sock_intr_errno(*timeo); } }
@@ -290,37 +287,48 @@ static int rxrpc_queue_packet(struct rxrpc_sock *rx, struct rxrpc_call *call, static int rxrpc_send_data(struct rxrpc_sock *rx, struct rxrpc_call *call, struct msghdr *msg, size_t len, - rxrpc_notify_end_tx_t notify_end_tx) + rxrpc_notify_end_tx_t notify_end_tx, + bool *_dropped_lock) { struct rxrpc_skb_priv *sp; struct sk_buff *skb; struct sock *sk = &rx->sk; + enum rxrpc_call_state state; long timeo; - bool more; - int ret, copied; + bool more = msg->msg_flags & MSG_MORE; + int ret, copied = 0;
timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
/* this should be in poll */ sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
+reload: + ret = -EPIPE; if (sk->sk_shutdown & SEND_SHUTDOWN) - return -EPIPE; - - more = msg->msg_flags & MSG_MORE; - + goto maybe_error; + state = READ_ONCE(call->state); + ret = -ESHUTDOWN; + if (state >= RXRPC_CALL_COMPLETE) + goto maybe_error; + ret = -EPROTO; + if (state != RXRPC_CALL_CLIENT_SEND_REQUEST && + state != RXRPC_CALL_SERVER_ACK_REQUEST && + state != RXRPC_CALL_SERVER_SEND_REPLY) + goto maybe_error; + + ret = -EMSGSIZE; if (call->tx_total_len != -1) { - if (len > call->tx_total_len) - return -EMSGSIZE; - if (!more && len != call->tx_total_len) - return -EMSGSIZE; + if (len - copied > call->tx_total_len) + goto maybe_error; + if (!more && len - copied != call->tx_total_len) + goto maybe_error; }
skb = call->tx_pending; call->tx_pending = NULL; rxrpc_see_skb(skb, rxrpc_skb_seen);
- copied = 0; do { /* Check to see if there's a ping ACK to reply to. */ if (call->ackr_reason == RXRPC_ACK_PING_RESPONSE) @@ -331,16 +339,8 @@ static int rxrpc_send_data(struct rxrpc_sock *rx,
_debug("alloc");
- if (!rxrpc_check_tx_space(call, NULL)) { - ret = -EAGAIN; - if (msg->msg_flags & MSG_DONTWAIT) - goto maybe_error; - ret = rxrpc_wait_for_tx_window(rx, call, - &timeo, - msg->msg_flags & MSG_WAITALL); - if (ret < 0) - goto maybe_error; - } + if (!rxrpc_check_tx_space(call, NULL)) + goto wait_for_space;
/* Work out the maximum size of a packet. Assume that * the security header is going to be in the padded @@ -468,6 +468,27 @@ static int rxrpc_send_data(struct rxrpc_sock *rx, efault: ret = -EFAULT; goto out; + +wait_for_space: + ret = -EAGAIN; + if (msg->msg_flags & MSG_DONTWAIT) + goto maybe_error; + mutex_unlock(&call->user_mutex); + *_dropped_lock = true; + ret = rxrpc_wait_for_tx_window(rx, call, &timeo, + msg->msg_flags & MSG_WAITALL); + if (ret < 0) + goto maybe_error; + if (call->interruptibility == RXRPC_INTERRUPTIBLE) { + if (mutex_lock_interruptible(&call->user_mutex) < 0) { + ret = sock_intr_errno(timeo); + goto maybe_error; + } + } else { + mutex_lock(&call->user_mutex); + } + *_dropped_lock = false; + goto reload; }
/* @@ -629,6 +650,7 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg, size_t len) enum rxrpc_call_state state; struct rxrpc_call *call; unsigned long now, j; + bool dropped_lock = false; int ret;
struct rxrpc_send_params p = { @@ -737,21 +759,13 @@ int rxrpc_do_sendmsg(struct rxrpc_sock *rx, struct msghdr *msg, size_t len) ret = rxrpc_send_abort_packet(call); } else if (p.command != RXRPC_CMD_SEND_DATA) { ret = -EINVAL; - } else if (rxrpc_is_client_call(call) && - state != RXRPC_CALL_CLIENT_SEND_REQUEST) { - /* request phase complete for this client call */ - ret = -EPROTO; - } else if (rxrpc_is_service_call(call) && - state != RXRPC_CALL_SERVER_ACK_REQUEST && - state != RXRPC_CALL_SERVER_SEND_REPLY) { - /* Reply phase not begun or not complete for service call. */ - ret = -EPROTO; } else { - ret = rxrpc_send_data(rx, call, msg, len, NULL); + ret = rxrpc_send_data(rx, call, msg, len, NULL, &dropped_lock); }
out_put_unlock: - mutex_unlock(&call->user_mutex); + if (!dropped_lock) + mutex_unlock(&call->user_mutex); error_put: rxrpc_put_call(call, rxrpc_call_put); _leave(" = %d", ret); @@ -779,6 +793,7 @@ int rxrpc_kernel_send_data(struct socket *sock, struct rxrpc_call *call, struct msghdr *msg, size_t len, rxrpc_notify_end_tx_t notify_end_tx) { + bool dropped_lock = false; int ret;
_enter("{%d,%s},", call->debug_id, rxrpc_call_states[call->state]); @@ -796,7 +811,7 @@ int rxrpc_kernel_send_data(struct socket *sock, struct rxrpc_call *call, case RXRPC_CALL_SERVER_ACK_REQUEST: case RXRPC_CALL_SERVER_SEND_REPLY: ret = rxrpc_send_data(rxrpc_sk(sock->sk), call, msg, len, - notify_end_tx); + notify_end_tx, &dropped_lock); break; case RXRPC_CALL_COMPLETE: read_lock_bh(&call->state_lock); @@ -810,7 +825,8 @@ int rxrpc_kernel_send_data(struct socket *sock, struct rxrpc_call *call, break; }
- mutex_unlock(&call->user_mutex); + if (!dropped_lock) + mutex_unlock(&call->user_mutex); _leave(" = %d", ret); return ret; }
From: Shannon Nelson snelson@pensando.io
[ Upstream commit 9cb9dadb8f45c67e4310e002c2f221b70312b293 ]
There is a case found in heavy testing where a link flap happens just before a firmware Recovery event and the driver gets stuck in the BROKEN state. This comes from the driver getting interrupted by a FW generation change when coming back up from the link flap, and the call to ionic_start_queues() in ionic_link_status_check() fails. This can be addressed by having the fw_up code clear the BROKEN bit if seen, rather than waiting for a user to manually force the interface down and then back up.
Fixes: 9e8eaf8427b6 ("ionic: stop watchdog when in broken state") Signed-off-by: Shannon Nelson snelson@pensando.io Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/pensando/ionic/ionic_lif.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/pensando/ionic/ionic_lif.c b/drivers/net/ethernet/pensando/ionic/ionic_lif.c index 1443f788ee37c..d4226999547e8 100644 --- a/drivers/net/ethernet/pensando/ionic/ionic_lif.c +++ b/drivers/net/ethernet/pensando/ionic/ionic_lif.c @@ -2963,6 +2963,9 @@ static void ionic_lif_handle_fw_up(struct ionic_lif *lif)
mutex_lock(&lif->queue_lock);
+ if (test_and_clear_bit(IONIC_LIF_F_BROKEN, lif->state)) + dev_info(ionic->dev, "FW Up: clearing broken state\n"); + err = ionic_qcqs_alloc(lif); if (err) goto err_unlock;
From: Shannon Nelson snelson@pensando.io
[ Upstream commit 0fc4dd452d6c14828eed6369155c75c0ac15bab3 ]
In looping on FW update tests we occasionally see the FW_ACTIVATE_STATUS command fail while it is in its EAGAIN loop waiting for the FW activate step to finsh inside the FW. The firmware is complaining that the done bit is set when a new dev_cmd is going to be processed.
Doing a clean on the cmd registers and doorbell before exiting the wait-for-done and cleaning the done bit before the sleep prevents this from occurring.
Fixes: fbfb8031533c ("ionic: Add hardware init and device commands") Signed-off-by: Shannon Nelson snelson@pensando.io Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/pensando/ionic/ionic_main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/pensando/ionic/ionic_main.c b/drivers/net/ethernet/pensando/ionic/ionic_main.c index 4029b4e021f86..56f93b0305519 100644 --- a/drivers/net/ethernet/pensando/ionic/ionic_main.c +++ b/drivers/net/ethernet/pensando/ionic/ionic_main.c @@ -474,8 +474,8 @@ static int __ionic_dev_cmd_wait(struct ionic *ionic, unsigned long max_seconds, ionic_opcode_to_str(opcode), opcode, ionic_error_to_str(err), err);
- msleep(1000); iowrite32(0, &idev->dev_cmd_regs->done); + msleep(1000); iowrite32(1, &idev->dev_cmd_regs->doorbell); goto try_again; } @@ -488,6 +488,8 @@ static int __ionic_dev_cmd_wait(struct ionic *ionic, unsigned long max_seconds, return ionic_error_to_errno(err); }
+ ionic_dev_cmd_clean(ionic); + return 0; }
From: R Mohamed Shah mohamed@pensando.io
[ Upstream commit 19058be7c48ceb3e60fa3948e24da1059bd68ee4 ]
Assign a random mac address to the VF interface station address if it boots with a zero mac address in order to match similar behavior seen in other VF drivers. Handle the errors where the older firmware does not allow the VF to set its own station address.
Newer firmware will allow the VF to set the station mac address if it hasn't already been set administratively through the PF. Setting it will also be allowed if the VF has trust.
Fixes: fbb39807e9ae ("ionic: support sr-iov operations") Signed-off-by: R Mohamed Shah mohamed@pensando.io Signed-off-by: Shannon Nelson snelson@pensando.io Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- .../net/ethernet/pensando/ionic/ionic_lif.c | 92 ++++++++++++++++++- 1 file changed, 87 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/pensando/ionic/ionic_lif.c b/drivers/net/ethernet/pensando/ionic/ionic_lif.c index d4226999547e8..0be79c5167813 100644 --- a/drivers/net/ethernet/pensando/ionic/ionic_lif.c +++ b/drivers/net/ethernet/pensando/ionic/ionic_lif.c @@ -1564,8 +1564,67 @@ static int ionic_set_features(struct net_device *netdev, return err; }
+static int ionic_set_attr_mac(struct ionic_lif *lif, u8 *mac) +{ + struct ionic_admin_ctx ctx = { + .work = COMPLETION_INITIALIZER_ONSTACK(ctx.work), + .cmd.lif_setattr = { + .opcode = IONIC_CMD_LIF_SETATTR, + .index = cpu_to_le16(lif->index), + .attr = IONIC_LIF_ATTR_MAC, + }, + }; + + ether_addr_copy(ctx.cmd.lif_setattr.mac, mac); + return ionic_adminq_post_wait(lif, &ctx); +} + +static int ionic_get_attr_mac(struct ionic_lif *lif, u8 *mac_addr) +{ + struct ionic_admin_ctx ctx = { + .work = COMPLETION_INITIALIZER_ONSTACK(ctx.work), + .cmd.lif_getattr = { + .opcode = IONIC_CMD_LIF_GETATTR, + .index = cpu_to_le16(lif->index), + .attr = IONIC_LIF_ATTR_MAC, + }, + }; + int err; + + err = ionic_adminq_post_wait(lif, &ctx); + if (err) + return err; + + ether_addr_copy(mac_addr, ctx.comp.lif_getattr.mac); + return 0; +} + +static int ionic_program_mac(struct ionic_lif *lif, u8 *mac) +{ + u8 get_mac[ETH_ALEN]; + int err; + + err = ionic_set_attr_mac(lif, mac); + if (err) + return err; + + err = ionic_get_attr_mac(lif, get_mac); + if (err) + return err; + + /* To deal with older firmware that silently ignores the set attr mac: + * doesn't actually change the mac and doesn't return an error, so we + * do the get attr to verify whether or not the set actually happened + */ + if (!ether_addr_equal(get_mac, mac)) + return 1; + + return 0; +} + static int ionic_set_mac_address(struct net_device *netdev, void *sa) { + struct ionic_lif *lif = netdev_priv(netdev); struct sockaddr *addr = sa; u8 *mac; int err; @@ -1574,6 +1633,14 @@ static int ionic_set_mac_address(struct net_device *netdev, void *sa) if (ether_addr_equal(netdev->dev_addr, mac)) return 0;
+ err = ionic_program_mac(lif, mac); + if (err < 0) + return err; + + if (err > 0) + netdev_dbg(netdev, "%s: SET and GET ATTR Mac are not equal-due to old FW running\n", + __func__); + err = eth_prepare_mac_addr_change(netdev, addr); if (err) return err; @@ -3172,6 +3239,7 @@ static int ionic_station_set(struct ionic_lif *lif) .attr = IONIC_LIF_ATTR_MAC, }, }; + u8 mac_address[ETH_ALEN]; struct sockaddr addr; int err;
@@ -3180,8 +3248,23 @@ static int ionic_station_set(struct ionic_lif *lif) return err; netdev_dbg(lif->netdev, "found initial MAC addr %pM\n", ctx.comp.lif_getattr.mac); - if (is_zero_ether_addr(ctx.comp.lif_getattr.mac)) - return 0; + ether_addr_copy(mac_address, ctx.comp.lif_getattr.mac); + + if (is_zero_ether_addr(mac_address)) { + eth_hw_addr_random(netdev); + netdev_dbg(netdev, "Random Mac generated: %pM\n", netdev->dev_addr); + ether_addr_copy(mac_address, netdev->dev_addr); + + err = ionic_program_mac(lif, mac_address); + if (err < 0) + return err; + + if (err > 0) { + netdev_dbg(netdev, "%s:SET/GET ATTR Mac are not same-due to old FW running\n", + __func__); + return 0; + } + }
if (!is_zero_ether_addr(netdev->dev_addr)) { /* If the netdev mac is non-zero and doesn't match the default @@ -3189,12 +3272,11 @@ static int ionic_station_set(struct ionic_lif *lif) * likely here again after a fw-upgrade reset. We need to be * sure the netdev mac is in our filter list. */ - if (!ether_addr_equal(ctx.comp.lif_getattr.mac, - netdev->dev_addr)) + if (!ether_addr_equal(mac_address, netdev->dev_addr)) ionic_lif_addr_add(lif, netdev->dev_addr); } else { /* Update the netdev mac with the device's mac */ - memcpy(addr.sa_data, ctx.comp.lif_getattr.mac, netdev->addr_len); + ether_addr_copy(addr.sa_data, mac_address); addr.sa_family = AF_INET; err = eth_prepare_mac_addr_change(netdev, &addr); if (err) {
From: Heiner Kallweit hkallweit1@gmail.com
[ Upstream commit a3a57bf07de23fe1ff779e0fdf710aa581c3ff73 ]
This is a follow-up to the discussion in [0]. It seems to me that at least the IP version used on Amlogic SoC's sometimes has a problem if register MAC_CTRL_REG is written whilst the chip is still processing a previous write. But that's just a guess. Adding a delay between two writes to this register helps, but we can also simply omit the offending second write. This patch uses the second approach and is based on a suggestion from Qi Duan. Benefit of this approach is that we can save few register writes, also on not affected chip versions.
[0] https://www.spinics.net/lists/netdev/msg831526.html
Fixes: bfab27a146ed ("stmmac: add the experimental PCI support") Suggested-by: Qi Duan qi.duan@amlogic.com Suggested-by: Jerome Brunet jbrunet@baylibre.com Signed-off-by: Heiner Kallweit hkallweit1@gmail.com Link: https://lore.kernel.org/r/e99857ce-bd90-5093-ca8c-8cd480b5a0a2@gmail.com Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c | 8 ++++++-- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 9 +++++---- 2 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c b/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c index caa4bfc4c1d62..9b6138b117766 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac_lib.c @@ -258,14 +258,18 @@ EXPORT_SYMBOL_GPL(stmmac_set_mac_addr); /* Enable disable MAC RX/TX */ void stmmac_set_mac(void __iomem *ioaddr, bool enable) { - u32 value = readl(ioaddr + MAC_CTRL_REG); + u32 old_val, value; + + old_val = readl(ioaddr + MAC_CTRL_REG); + value = old_val;
if (enable) value |= MAC_ENABLE_RX | MAC_ENABLE_TX; else value &= ~(MAC_ENABLE_TX | MAC_ENABLE_RX);
- writel(value, ioaddr + MAC_CTRL_REG); + if (value != old_val) + writel(value, ioaddr + MAC_CTRL_REG); }
void stmmac_get_mac_addr(void __iomem *ioaddr, unsigned char *addr, diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index c5f33630e7718..78f11dabca056 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -983,10 +983,10 @@ static void stmmac_mac_link_up(struct phylink_config *config, bool tx_pause, bool rx_pause) { struct stmmac_priv *priv = netdev_priv(to_net_dev(config->dev)); - u32 ctrl; + u32 old_ctrl, ctrl;
- ctrl = readl(priv->ioaddr + MAC_CTRL_REG); - ctrl &= ~priv->hw->link.speed_mask; + old_ctrl = readl(priv->ioaddr + MAC_CTRL_REG); + ctrl = old_ctrl & ~priv->hw->link.speed_mask;
if (interface == PHY_INTERFACE_MODE_USXGMII) { switch (speed) { @@ -1061,7 +1061,8 @@ static void stmmac_mac_link_up(struct phylink_config *config, if (tx_pause && rx_pause) stmmac_mac_flow_ctrl(priv, duplex);
- writel(ctrl, priv->ioaddr + MAC_CTRL_REG); + if (ctrl != old_ctrl) + writel(ctrl, priv->ioaddr + MAC_CTRL_REG);
stmmac_mac_set(priv, priv->ioaddr, true); if (phy && priv->dma_cap.eee) {
From: Aleksander Jan Bajkowski olek2@wp.pl
[ Upstream commit c8b043702dc0894c07721c5b019096cebc8c798f ]
xrx200_hw_receive() assumes build_skb() always works and goes straight to skb_reserve(). However, build_skb() can fail under memory pressure.
Add a check in case build_skb() failed to allocate and return NULL.
Fixes: e015593573b3 ("net: lantiq_xrx200: convert to build_skb") Reported-by: Eric Dumazet edumazet@google.com Signed-off-by: Aleksander Jan Bajkowski olek2@wp.pl Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/lantiq_xrx200.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/net/ethernet/lantiq_xrx200.c b/drivers/net/ethernet/lantiq_xrx200.c index 5edb68a8aab1e..89314b645c822 100644 --- a/drivers/net/ethernet/lantiq_xrx200.c +++ b/drivers/net/ethernet/lantiq_xrx200.c @@ -239,6 +239,12 @@ static int xrx200_hw_receive(struct xrx200_chan *ch) }
skb = build_skb(buf, priv->rx_skb_size); + if (!skb) { + skb_free_frag(buf); + net_dev->stats.rx_dropped++; + return -ENOMEM; + } + skb_reserve(skb, NET_SKB_PAD); skb_put(skb, len);
From: Aleksander Jan Bajkowski olek2@wp.pl
[ Upstream commit c4b6e9341f930e4dd089231c0414758f5f1f9dbd ]
When the xrx200_hw_receive() function returns -ENOMEM, the NAPI poll function immediately returns an error. This is incorrect for two reasons: * the function terminates without enabling interrupts or scheduling NAPI, * the error code (-ENOMEM) is returned instead of the number of received packets.
After the first memory allocation failure occurs, packet reception is locked due to disabled interrupts from DMA..
Fixes: fe1a56420cf2 ("net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver") Signed-off-by: Aleksander Jan Bajkowski olek2@wp.pl Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/lantiq_xrx200.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/lantiq_xrx200.c b/drivers/net/ethernet/lantiq_xrx200.c index 89314b645c822..25adce7f0c7c0 100644 --- a/drivers/net/ethernet/lantiq_xrx200.c +++ b/drivers/net/ethernet/lantiq_xrx200.c @@ -294,7 +294,7 @@ static int xrx200_poll_rx(struct napi_struct *napi, int budget) if (ret == XRX200_DMA_PACKET_IN_PROGRESS) continue; if (ret != XRX200_DMA_PACKET_COMPLETE) - return ret; + break; rx++; } else { break;
From: Aleksander Jan Bajkowski olek2@wp.pl
[ Upstream commit c9c3b1775f80fa21f5bff874027d2ccb10f5d90c ]
In a situation where memory allocation fails, an invalid buffer address is stored. When this descriptor is used again, the system panics in the build_skb() function when accessing memory.
Fixes: 7ea6cd16f159 ("lantiq: net: fix duplicated skb in rx descriptor ring") Signed-off-by: Aleksander Jan Bajkowski olek2@wp.pl Signed-off-by: Jakub Kicinski kuba@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/net/ethernet/lantiq_xrx200.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/lantiq_xrx200.c b/drivers/net/ethernet/lantiq_xrx200.c index 25adce7f0c7c0..57f27cc7724e7 100644 --- a/drivers/net/ethernet/lantiq_xrx200.c +++ b/drivers/net/ethernet/lantiq_xrx200.c @@ -193,6 +193,7 @@ static int xrx200_alloc_buf(struct xrx200_chan *ch, void *(*alloc)(unsigned int
ch->rx_buff[ch->dma.desc] = alloc(priv->rx_skb_size); if (!ch->rx_buff[ch->dma.desc]) { + ch->rx_buff[ch->dma.desc] = buf; ret = -ENOMEM; goto skip; }
From: Filipe Manana fdmanana@suse.com
commit 47bf225a8d2cccb15f7e8d4a1ed9b757dd86afd7 upstream.
At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end up returning from the function with a value of 0 (success). This happens because the function returns the value stored in the variable 'err', which is 0, while the error value we got from btrfs_search_slot() is stored in the 'ret' variable.
So fix it by setting 'err' with the error value.
Fixes: 8289ed9f93bef2 ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Qu Wenruo wqu@suse.com Signed-off-by: Filipe Manana fdmanana@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/root-tree.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/fs/btrfs/root-tree.c +++ b/fs/btrfs/root-tree.c @@ -349,9 +349,10 @@ int btrfs_del_root_ref(struct btrfs_tran key.offset = ref_id; again: ret = btrfs_search_slot(trans, tree_root, &key, path, -1, 1); - if (ret < 0) + if (ret < 0) { + err = ret; goto out; - if (ret == 0) { + } else if (ret == 0) { leaf = path->nodes[0]; ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_root_ref);
From: Anand Jain anand.jain@oracle.com
commit 59a3991984dbc1fc47e5651a265c5200bd85464e upstream.
If the filesystem mounts with the replace-operation in a suspended state and try to cancel the suspended replace-operation, we hit the assert. The assert came from the commit fe97e2e173af ("btrfs: dev-replace: replace's scrub must not be running in suspended state") that was actually not required. So just remove it.
$ mount /dev/sda5 /btrfs
BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'
$ mount -o degraded /dev/sda5 /btrfs <-- success.
$ btrfs replace cancel /btrfs
kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131 kernel: ------------[ cut here ]------------ kernel: kernel BUG at fs/btrfs/ctree.h:3750!
After the patch:
$ btrfs replace cancel /btrfs
BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled
Fixes: fe97e2e173af ("btrfs: dev-replace: replace's scrub must not be running in suspended state") CC: stable@vger.kernel.org # 5.0+ Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/dev-replace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
--- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -1128,8 +1128,7 @@ int btrfs_dev_replace_cancel(struct btrf up_write(&dev_replace->rwsem);
/* Scrub for replace must not be running in suspended state */ - ret = btrfs_scrub_cancel(fs_info); - ASSERT(ret != -ENOTCONN); + btrfs_scrub_cancel(fs_info);
trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) {
From: Anand Jain anand.jain@oracle.com
commit f2c3bec215694fb8bc0ef5010f2a758d1906fc2d upstream.
If the replace target device reappears after the suspended replace is cancelled, it blocks the mount operation as it can't find the matching replace-item in the metadata. As shown below,
BTRFS error (device sda5): replace devid present without an active replace item
To overcome this situation, the user can run the command
btrfs device scan --forget <replace target device>
and try the mount command again. And also, to avoid repeating the issue, superblock on the devid=0 must be wiped.
wipefs -a device-path-to-devid=0.
This patch adds some info when this situation occurs.
Reported-by: Samuel Greiner samuel@balkonien.org Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@bal... CC: stable@vger.kernel.org # 5.0+ Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/dev-replace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -165,7 +165,7 @@ no_valid_dev_replace_entry_found: */ if (btrfs_find_device(fs_info->fs_devices, &args)) { btrfs_err(fs_info, - "replace devid present without an active replace item"); +"replace without active item, run 'device scan --forget' on the target device"); ret = -EUCLEAN; } else { dev_replace->srcdev = NULL;
From: Omar Sandoval osandov@fb.com
commit ced8ecf026fd8084cf175530ff85c76d6085d715 upstream.
When testing space_cache v2 on a large set of machines, we encountered a few symptoms:
1. "unable to add free space :-17" (EEXIST) errors. 2. Missing free space info items, sometimes caught with a "missing free space info for X" error. 3. Double-accounted space: ranges that were allocated in the extent tree and also marked as free in the free space tree, ranges that were marked as allocated twice in the extent tree, or ranges that were marked as free twice in the free space tree. If the latter made it onto disk, the next reboot would hit the BUG_ON() in add_new_free_space(). 4. On some hosts with no on-disk corruption or error messages, the in-memory space cache (dumped with drgn) disagreed with the free space tree.
All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") subtly broke this by allowing multiple transactions to be unpinning extents at the same time.
Specifically, the race is as follows:
1. An extent is deleted from an uncached block group in transaction A. 2. btrfs_commit_transaction() is called for transaction A. 3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed ref for the deleted extent. 4. __btrfs_free_extent() -> do_free_extent_accounting() -> add_to_free_space_tree() adds the deleted extent back to the free space tree. 5. do_free_extent_accounting() -> btrfs_update_block_group() -> btrfs_cache_block_group() queues up the block group to get cached. block_group->progress is set to block_group->start. 6. btrfs_commit_transaction() for transaction A calls switch_commit_roots(). It sets block_group->last_byte_to_unpin to block_group->progress, which is block_group->start because the block group hasn't been cached yet. 7. The caching thread gets to our block group. Since the commit roots were already switched, load_free_space_tree() sees the deleted extent as free and adds it to the space cache. It finishes caching and sets block_group->progress to U64_MAX. 8. btrfs_commit_transaction() advances transaction A to TRANS_STATE_SUPER_COMMITTED. 9. fsync calls btrfs_commit_transaction() for transaction B. Since transaction A is already in TRANS_STATE_SUPER_COMMITTED and the commit is for fsync, it advances. 10. btrfs_commit_transaction() for transaction B calls switch_commit_roots(). This time, the block group has already been cached, so it sets block_group->last_byte_to_unpin to U64_MAX. 11. btrfs_commit_transaction() for transaction A calls btrfs_finish_extent_commit(), which calls unpin_extent_range() for the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by transaction B!), so it adds the deleted extent to the space cache again!
This explains all of our symptoms above:
* If the sequence of events is exactly as described above, when the free space is re-added in step 11, it will fail with EEXIST. * If another thread reallocates the deleted extent in between steps 7 and 11, then step 11 will silently re-add that space to the space cache as free even though it is actually allocated. Then, if that space is allocated *again*, the free space tree will be corrupted (namely, the wrong item will be deleted). * If we don't catch this free space tree corruption, it will continue to get worse as extents are deleted and reallocated.
The v1 space_cache is synchronously loaded when an extent is deleted (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group() with load_cache_only=1), so it is not normally affected by this bug. However, as noted above, if we fail to load the space cache, we will fall back to caching from the extent tree and may hit this bug.
The easiest fix for this race is to also make caching from the free space tree or extent tree synchronous. Josef tested this and found no performance regressions.
A few extra changes fall out of this change. Namely, this fix does the following, with step 2 being the crucial fix:
1. Factor btrfs_caching_ctl_wait_done() out of btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl that we already hold a reference to. 2. Change the call in btrfs_cache_block_group() of btrfs_wait_space_cache_v1_finished() to btrfs_caching_ctl_wait_done(), which makes us wait regardless of the space_cache option. 3. Delete the now unused btrfs_wait_space_cache_v1_finished() and space_cache_v1_done(). 4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to `bool wait` to more accurately describe its new meaning. 5. Change a few callers which had a separate call to btrfs_wait_block_group_cache_done() to use wait = true instead. 6. Make btrfs_wait_block_group_cache_done() static now that it's not used outside of block-group.c anymore.
Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana fdmanana@suse.com Signed-off-by: Omar Sandoval osandov@fb.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/block-group.c | 47 +++++++++++++++-------------------------------- fs/btrfs/block-group.h | 4 +--- fs/btrfs/ctree.h | 1 - fs/btrfs/extent-tree.c | 30 ++++++------------------------ 4 files changed, 22 insertions(+), 60 deletions(-)
--- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -440,39 +440,26 @@ void btrfs_wait_block_group_cache_progre btrfs_put_caching_control(caching_ctl); }
-int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache) +static int btrfs_caching_ctl_wait_done(struct btrfs_block_group *cache, + struct btrfs_caching_control *caching_ctl) +{ + wait_event(caching_ctl->wait, btrfs_block_group_done(cache)); + return cache->cached == BTRFS_CACHE_ERROR ? -EIO : 0; +} + +static int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache) { struct btrfs_caching_control *caching_ctl; - int ret = 0; + int ret;
caching_ctl = btrfs_get_caching_control(cache); if (!caching_ctl) return (cache->cached == BTRFS_CACHE_ERROR) ? -EIO : 0; - - wait_event(caching_ctl->wait, btrfs_block_group_done(cache)); - if (cache->cached == BTRFS_CACHE_ERROR) - ret = -EIO; + ret = btrfs_caching_ctl_wait_done(cache, caching_ctl); btrfs_put_caching_control(caching_ctl); return ret; }
-static bool space_cache_v1_done(struct btrfs_block_group *cache) -{ - bool ret; - - spin_lock(&cache->lock); - ret = cache->cached != BTRFS_CACHE_FAST; - spin_unlock(&cache->lock); - - return ret; -} - -void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache, - struct btrfs_caching_control *caching_ctl) -{ - wait_event(caching_ctl->wait, space_cache_v1_done(cache)); -} - #ifdef CONFIG_BTRFS_DEBUG static void fragment_free_space(struct btrfs_block_group *block_group) { @@ -750,9 +737,8 @@ done: btrfs_put_block_group(block_group); }
-int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only) +int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait) { - DEFINE_WAIT(wait); struct btrfs_fs_info *fs_info = cache->fs_info; struct btrfs_caching_control *caching_ctl = NULL; int ret = 0; @@ -785,10 +771,7 @@ int btrfs_cache_block_group(struct btrfs } WARN_ON(cache->caching_ctl); cache->caching_ctl = caching_ctl; - if (btrfs_test_opt(fs_info, SPACE_CACHE)) - cache->cached = BTRFS_CACHE_FAST; - else - cache->cached = BTRFS_CACHE_STARTED; + cache->cached = BTRFS_CACHE_STARTED; cache->has_caching_ctl = 1; spin_unlock(&cache->lock);
@@ -801,8 +784,8 @@ int btrfs_cache_block_group(struct btrfs
btrfs_queue_work(fs_info->caching_workers, &caching_ctl->work); out: - if (load_cache_only && caching_ctl) - btrfs_wait_space_cache_v1_finished(cache, caching_ctl); + if (wait && caching_ctl) + ret = btrfs_caching_ctl_wait_done(cache, caching_ctl); if (caching_ctl) btrfs_put_caching_control(caching_ctl);
@@ -3313,7 +3296,7 @@ int btrfs_update_block_group(struct btrf * space back to the block group, otherwise we will leak space. */ if (!alloc && !btrfs_block_group_done(cache)) - btrfs_cache_block_group(cache, 1); + btrfs_cache_block_group(cache, true);
byte_in_group = bytenr - cache->start; WARN_ON(byte_in_group > cache->length); --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -263,9 +263,7 @@ void btrfs_dec_nocow_writers(struct btrf void btrfs_wait_nocow_writers(struct btrfs_block_group *bg); void btrfs_wait_block_group_cache_progress(struct btrfs_block_group *cache, u64 num_bytes); -int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache); -int btrfs_cache_block_group(struct btrfs_block_group *cache, - int load_cache_only); +int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait); void btrfs_put_caching_control(struct btrfs_caching_control *ctl); struct btrfs_caching_control *btrfs_get_caching_control( struct btrfs_block_group *cache); --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -494,7 +494,6 @@ struct btrfs_free_cluster { enum btrfs_caching_type { BTRFS_CACHE_NO, BTRFS_CACHE_STARTED, - BTRFS_CACHE_FAST, BTRFS_CACHE_FINISHED, BTRFS_CACHE_ERROR, }; --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2567,17 +2567,10 @@ int btrfs_pin_extent_for_log_replay(stru return -EINVAL;
/* - * pull in the free space cache (if any) so that our pin - * removes the free space from the cache. We have load_only set - * to one because the slow code to read in the free extents does check - * the pinned extents. + * Fully cache the free space first so that our pin removes the free space + * from the cache. */ - btrfs_cache_block_group(cache, 1); - /* - * Make sure we wait until the cache is completely built in case it is - * missing or is invalid and therefore needs to be rebuilt. - */ - ret = btrfs_wait_block_group_cache_done(cache); + ret = btrfs_cache_block_group(cache, true); if (ret) goto out;
@@ -2600,12 +2593,7 @@ static int __exclude_logged_extent(struc if (!block_group) return -EINVAL;
- btrfs_cache_block_group(block_group, 1); - /* - * Make sure we wait until the cache is completely built in case it is - * missing or is invalid and therefore needs to be rebuilt. - */ - ret = btrfs_wait_block_group_cache_done(block_group); + ret = btrfs_cache_block_group(block_group, true); if (ret) goto out;
@@ -4415,7 +4403,7 @@ have_block_group: ffe_ctl->cached = btrfs_block_group_done(block_group); if (unlikely(!ffe_ctl->cached)) { ffe_ctl->have_caching_bg = true; - ret = btrfs_cache_block_group(block_group, 0); + ret = btrfs_cache_block_group(block_group, false);
/* * If we get ENOMEM here or something else we want to @@ -6169,13 +6157,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *
if (end - start >= range->minlen) { if (!btrfs_block_group_done(cache)) { - ret = btrfs_cache_block_group(cache, 0); - if (ret) { - bg_failed++; - bg_ret = ret; - continue; - } - ret = btrfs_wait_block_group_cache_done(cache); + ret = btrfs_cache_block_group(cache, true); if (ret) { bg_failed++; bg_ret = ret;
From: Goldwyn Rodrigues rgoldwyn@suse.de
commit b51111271b0352aa596c5ae8faf06939e91b3b68 upstream.
For a filesystem which has btrfs read-only property set to true, all write operations including xattr should be denied. However, security xattr can still be changed even if btrfs ro property is true.
This happens because xattr_permission() does not have any restrictions on security.*, system.* and in some cases trusted.* from VFS and the decision is left to the underlying filesystem. See comments in xattr_permission() for more details.
This patch checks if the root is read-only before performing the set xattr operation.
Testcase:
DEV=/dev/vdb MNT=/mnt
mkfs.btrfs -f $DEV mount $DEV $MNT echo "file one" > $MNT/f1
setfattr -n "security.one" -v 2 $MNT/f1 btrfs property set /mnt ro true
setfattr -n "security.one" -v 1 $MNT/f1
umount $MNT
CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Qu Wenruo wqu@suse.com Reviewed-by: Filipe Manana fdmanana@suse.com Signed-off-by: Goldwyn Rodrigues rgoldwyn@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/xattr.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/fs/btrfs/xattr.c +++ b/fs/btrfs/xattr.c @@ -371,6 +371,9 @@ static int btrfs_xattr_handler_set(const const char *name, const void *buffer, size_t size, int flags) { + if (btrfs_root_readonly(BTRFS_I(inode)->root)) + return -EROFS; + name = xattr_full_name(handler, name); return btrfs_setxattr_trans(inode, name, buffer, size, flags); }
From: Zixuan Fu r33s3n6@gmail.com
commit 9ea0106a7a3d8116860712e3f17cd52ce99f6707 upstream.
In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if the path is invalid. In this case, btrfs_get_dev_args_from_path() returns directly without freeing args->uuid and args->fsid allocated before, which causes memory leak.
To fix these possible leaks, when btrfs_get_bdev_and_sb() fails, btrfs_put_dev_args_from_path() is called to clean up the memory.
Reported-by: TOTE Robot oslab@tsinghua.edu.cn Fixes: faa775c41d655 ("btrfs: add a btrfs_get_dev_args_from_path helper") CC: stable@vger.kernel.org # 5.16 Reviewed-by: Boris Burkov boris@bur.io Signed-off-by: Zixuan Fu r33s3n6@gmail.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/volumes.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)
--- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2344,8 +2344,11 @@ int btrfs_get_dev_args_from_path(struct
ret = btrfs_get_bdev_and_sb(path, FMODE_READ, fs_info->bdev_holder, 0, &bdev, &disk_super); - if (ret) + if (ret) { + btrfs_put_dev_args_from_path(args); return ret; + } + args->devid = btrfs_stack_device_id(&disk_super->dev_item); memcpy(args->uuid, disk_super->dev_item.uuid, BTRFS_UUID_SIZE); if (btrfs_fs_incompat(fs_info, METADATA_UUID))
From: Filipe Manana fdmanana@suse.com
commit e6e3dec6c3c288d556b991a85d5d8e3ee71e9046 upstream.
When punching a hole into a file range that is adjacent with a hole and we are not using the no-holes feature, we expand the range of the adjacent file extent item that represents a hole, to save metadata space.
However we don't update the generation of hole file extent item, which means a full fsync will not log that file extent item if the fsync happens in a later transaction (since commit 7f30c07288bb9e ("btrfs: stop copying old file extents when doing a full fsync")).
For example, if we do this:
$ mkfs.btrfs -f -O ^no-holes /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar $ sync
We end up with 2 file extent items in our file:
1) One that represents the hole for the file range [0, 2M), with a generation of 7;
2) Another one that represents an extent covering the range [2M, 4M).
After that if we do the following:
$ xfs_io -c "fpunch 2M 2M" /mnt/foobar
We end up with a single file extent item in the file, which represents a hole for the range [0, 4M) and with a generation of 7 - because we end dropping the data extent for range [2M, 4M) and then update the file extent item that represented the hole at [0, 2M), by increasing length from 2M to 4M.
Then doing a full fsync and power failing:
$ xfs_io -c "fsync" /mnt/foobar <power failure>
will result in the full fsync not logging the file extent item that represents the hole for the range [0, 4M), because its generation is 7, which is lower than the generation of the current transaction (8). As a consequence, after mounting again the filesystem (after log replay), the region [2M, 4M) does not have a hole, it still points to the previous data extent.
So fix this by always updating the generation of existing file extent items representing holes when we merge/expand them. This solves the problem and it's the same approach as when we merge prealloc extents that got written (at btrfs_mark_extent_written()). Setting the generation to the current transaction's generation is also what we do when merging the new hole extent map with the previous one or the next one.
A test case for fstests, covering both cases of hole file extent item merging (to the left and to the right), will be sent soon.
Fixes: 7f30c07288bb9e ("btrfs: stop copying old file extents when doing a full fsync") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Filipe Manana fdmanana@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/file.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2483,6 +2483,7 @@ static int fill_holes(struct btrfs_trans btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes); btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes); btrfs_set_file_extent_offset(leaf, fi, 0); + btrfs_set_file_extent_generation(leaf, fi, trans->transid); btrfs_mark_buffer_dirty(leaf); goto out; } @@ -2499,6 +2500,7 @@ static int fill_holes(struct btrfs_trans btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes); btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes); btrfs_set_file_extent_offset(leaf, fi, 0); + btrfs_set_file_extent_generation(leaf, fi, trans->transid); btrfs_mark_buffer_dirty(leaf); goto out; }
From: Michael Roth michael.roth@amd.com
commit 4b1c742407571eff58b6de9881889f7ca7c4b4dc upstream.
In some cases, bootloaders will leave boot_params->cc_blob_address uninitialized rather than zeroing it out. This field is only meant to be set by the boot/compressed kernel in order to pass information to the uncompressed kernel when SEV-SNP support is enabled.
Therefore, there are no cases where the bootloader-provided values should be treated as anything other than garbage. Otherwise, the uncompressed kernel may attempt to access this bogus address, leading to a crash during early boot.
Normally, sanitize_boot_params() would be used to clear out such fields but that happens too late: sev_enable() may have already initialized it to a valid value that should not be zeroed out. Instead, have sev_enable() zero it out unconditionally beforehand.
Also ensure this happens for !CONFIG_AMD_MEM_ENCRYPT as well by also including this handling in the sev_enable() stub function.
[ bp: Massage commit message and comments. ]
Fixes: b190a043c49a ("x86/sev: Add SEV-SNP feature detection/setup") Reported-by: Jeremi Piotrowski jpiotrowski@linux.microsoft.com Reported-by: watnuss@gmx.de Signed-off-by: Michael Roth michael.roth@amd.com Signed-off-by: Borislav Petkov bp@suse.de Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=216387 Link: https://lore.kernel.org/r/20220823160734.89036-1-michael.roth@amd.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/boot/compressed/misc.h | 12 +++++++++++- arch/x86/boot/compressed/sev.c | 8 ++++++++ 2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/arch/x86/boot/compressed/misc.h b/arch/x86/boot/compressed/misc.h index 4910bf230d7b..62208ec04ca4 100644 --- a/arch/x86/boot/compressed/misc.h +++ b/arch/x86/boot/compressed/misc.h @@ -132,7 +132,17 @@ void snp_set_page_private(unsigned long paddr); void snp_set_page_shared(unsigned long paddr); void sev_prep_identity_maps(unsigned long top_level_pgt); #else -static inline void sev_enable(struct boot_params *bp) { } +static inline void sev_enable(struct boot_params *bp) +{ + /* + * bp->cc_blob_address should only be set by boot/compressed kernel. + * Initialize it to 0 unconditionally (thus here in this stub too) to + * ensure that uninitialized values from buggy bootloaders aren't + * propagated. + */ + if (bp) + bp->cc_blob_address = 0; +} static inline void sev_es_shutdown_ghcb(void) { } static inline bool sev_es_check_ghcb_fault(unsigned long address) { diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c index 52f989f6acc2..c93930d5ccbd 100644 --- a/arch/x86/boot/compressed/sev.c +++ b/arch/x86/boot/compressed/sev.c @@ -276,6 +276,14 @@ void sev_enable(struct boot_params *bp) struct msr m; bool snp;
+ /* + * bp->cc_blob_address should only be set by boot/compressed kernel. + * Initialize it to 0 to ensure that uninitialized values from + * buggy bootloaders aren't propagated. + */ + if (bp) + bp->cc_blob_address = 0; + /* * Setup/preliminary detection of SNP. This will be sanity-checked * against CPUID/MSR values later.
From: Kan Liang kan.liang@linux.intel.com
commit cde643ff75bc20c538dfae787ca3b587bab16b50 upstream.
According to the latest event list, the LOAD_LATENCY PEBS event only works on the GP counter 0 and 1 for ADL and RPL.
Update the pebs event constraints table.
Fixes: f83d2f91d259 ("perf/x86/intel: Add Alder Lake Hybrid support") Reported-by: Ammy Yi ammy.yi@intel.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220818184429.2355857-1-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/events/intel/ds.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -822,7 +822,7 @@ struct event_constraint intel_glm_pebs_e
struct event_constraint intel_grt_pebs_event_constraints[] = { /* Allow all events as PEBS with no flags */ - INTEL_HYBRID_LAT_CONSTRAINT(0x5d0, 0xf), + INTEL_HYBRID_LAT_CONSTRAINT(0x5d0, 0x3), INTEL_HYBRID_LAT_CONSTRAINT(0x6d0, 0xf), EVENT_CONSTRAINT_END };
From: Kan Liang kan.liang@linux.intel.com
commit 32ba156df1b1c8804a4e5be5339616945eafea22 upstream.
On the platform with Arch LBR, the HW raw branch type encoding may leak to the perf tool when the SAVE_TYPE option is not set.
In the intel_pmu_store_lbr(), the HW raw branch type is stored in lbr_entries[].type. If the SAVE_TYPE option is set, the lbr_entries[].type will be converted into the generic PERF_BR_* type in the intel_pmu_lbr_filter() and exposed to the user tools. But if the SAVE_TYPE option is NOT set by the user, the current perf kernel doesn't clear the field. The HW raw branch type leaks.
There are two solutions to fix the issue for the Arch LBR. One is to clear the field if the SAVE_TYPE option is NOT set. The other solution is to unconditionally convert the branch type and expose the generic type to the user tools.
The latter is implemented here, because - The branch type is valuable information. I don't see a case where you would not benefit from the branch type. (Stephane Eranian) - Not having the branch type DOES NOT save any space in the branch record (Stephane Eranian) - The Arch LBR HW can retrieve the common branch types from the LBR_INFO. It doesn't require the high overhead SW disassemble.
Fixes: 47125db27e47 ("perf/x86/intel/lbr: Support Architectural LBR") Reported-by: Stephane Eranian eranian@google.com Signed-off-by: Kan Liang kan.liang@linux.intel.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220816125612.2042397-1-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/events/intel/lbr.c | 8 ++++++++ 1 file changed, 8 insertions(+)
--- a/arch/x86/events/intel/lbr.c +++ b/arch/x86/events/intel/lbr.c @@ -1097,6 +1097,14 @@ static int intel_pmu_setup_hw_lbr_filter
if (static_cpu_has(X86_FEATURE_ARCH_LBR)) { reg->config = mask; + + /* + * The Arch LBR HW can retrieve the common branch types + * from the LBR_INFO. It doesn't require the high overhead + * SW disassemble. + * Enable the branch type by default for the Arch LBR. + */ + reg->reg |= X86_BR_TYPE_SAVE; return 0; }
From: Juergen Gross jgross@suse.com
commit 5b9f0c4df1c1152403c738373fb063e9ffdac0a1 upstream.
Commit
c89191ce67ef ("x86/entry: Convert SWAPGS to swapgs and remove the definition of SWAPGS")
missed one use case of SWAPGS in entry_INT80_compat(). Removing of the SWAPGS macro led to asm just using "swapgs", as it is accepting instructions in capital letters, too.
This in turn leads to splats in Xen PV guests like:
[ 36.145223] general protection fault, maybe for address 0x2d: 0000 [#1] PREEMPT SMP NOPTI [ 36.145794] CPU: 2 PID: 1847 Comm: ld-linux.so.2 Not tainted 5.19.1-1-default #1 \ openSUSE Tumbleweed f3b44bfb672cdb9f235aff53b57724eba8b9411b [ 36.146608] Hardware name: HP ProLiant ML350p Gen8, BIOS P72 11/14/2013 [ 36.148126] RIP: e030:entry_INT80_compat+0x3/0xa3
Fix that by open coding this single instance of the SWAPGS macro.
Fixes: c89191ce67ef ("x86/entry: Convert SWAPGS to swapgs and remove the definition of SWAPGS") Signed-off-by: Juergen Gross jgross@suse.com Signed-off-by: Borislav Petkov bp@suse.de Reviewed-by: Jan Beulich jbeulich@suse.com Cc: stable@vger.kernel.org # 5.19 Link: https://lore.kernel.org/r/20220816071137.4893-1-jgross@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/entry/entry_64_compat.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/entry/entry_64_compat.S +++ b/arch/x86/entry/entry_64_compat.S @@ -311,7 +311,7 @@ SYM_CODE_START(entry_INT80_compat) * Interrupts are off on entry. */ ASM_CLAC /* Do this early to minimize exposure */ - SWAPGS + ALTERNATIVE "swapgs", "", X86_FEATURE_XENPV
/* * User tracing code (ptrace or signal handlers) might assume that
From: Chen Zhongjin chenzhongjin@huawei.com
commit fc2e426b1161761561624ebd43ce8c8d2fa058da upstream.
When meeting ftrace trampolines in ORC unwinding, unwinder uses address of ftrace_{regs_}call address to find the ORC entry, which gets next frame at sp+176.
If there is an IRQ hitting at sub $0xa8,%rsp, the next frame should be sp+8 instead of 176. It makes unwinder skip correct frame and throw warnings such as "wrong direction" or "can't access registers", etc, depending on the content of the incorrect frame address.
By adding the base address ftrace_{regs_}caller with the offset *ip - ops->trampoline*, we can get the correct address to find the ORC entry.
Also change "caller" to "tramp_addr" to make variable name conform to its content.
[ mingo: Clarified the changelog a bit. ]
Fixes: 6be7fa3c74d1 ("ftrace, orc, x86: Handle ftrace dynamically allocated trampolines") Signed-off-by: Chen Zhongjin chenzhongjin@huawei.com Signed-off-by: Ingo Molnar mingo@kernel.org Reviewed-by: Steven Rostedt (Google) rostedt@goodmis.org Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220819084334.244016-1-chenzhongjin@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/unwind_orc.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)
--- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -93,22 +93,27 @@ static struct orc_entry *orc_find(unsign static struct orc_entry *orc_ftrace_find(unsigned long ip) { struct ftrace_ops *ops; - unsigned long caller; + unsigned long tramp_addr, offset;
ops = ftrace_ops_trampoline(ip); if (!ops) return NULL;
+ /* Set tramp_addr to the start of the code copied by the trampoline */ if (ops->flags & FTRACE_OPS_FL_SAVE_REGS) - caller = (unsigned long)ftrace_regs_call; + tramp_addr = (unsigned long)ftrace_regs_caller; else - caller = (unsigned long)ftrace_call; + tramp_addr = (unsigned long)ftrace_caller; + + /* Now place tramp_addr to the location within the trampoline ip is at */ + offset = ip - ops->trampoline; + tramp_addr += offset;
/* Prevent unlikely recursion */ - if (ip == caller) + if (ip == tramp_addr) return NULL;
- return orc_find(caller); + return orc_find(tramp_addr); } #else static struct orc_entry *orc_ftrace_find(unsigned long ip)
From: Tom Lendacky thomas.lendacky@amd.com
commit cdaa0a407f1acd3a44861e3aea6e3c7349e668f1 upstream.
When running identity-mapped and depending on the kernel configuration, it is possible that the compiler uses jump tables when generating code for cc_platform_has().
This causes a boot failure because the jump table uses un-mapped kernel virtual addresses, not identity-mapped addresses. This has been seen with CONFIG_RETPOLINE=n.
Similar to sme_encrypt_kernel(), use an open-coded direct check for the status of SNP rather than trying to eliminate the jump table. This preserves any code optimization in cc_platform_has() that can be useful post boot. It also limits the changes to SEV-specific files so that future compiler features won't necessarily require possible build changes just because they are not compatible with running identity-mapped.
[ bp: Massage commit message. ]
Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes") Reported-by: Sean Christopherson seanjc@google.com Suggested-by: Sean Christopherson seanjc@google.com Signed-off-by: Tom Lendacky thomas.lendacky@amd.com Signed-off-by: Borislav Petkov bp@suse.de Cc: stable@vger.kernel.org # 5.19.x Link: https://lore.kernel.org/all/YqfabnTRxFSM+LoX@google.com/ Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/kernel/sev.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)
--- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -701,7 +701,13 @@ e_term: void __init early_snp_set_memory_private(unsigned long vaddr, unsigned long paddr, unsigned int npages) { - if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) + /* + * This can be invoked in early boot while running identity mapped, so + * use an open coded check for SNP instead of using cc_platform_has(). + * This eliminates worries about jump tables or checking boot_cpu_data + * in the cc_platform_has() function. + */ + if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED)) return;
/* @@ -717,7 +723,13 @@ void __init early_snp_set_memory_private void __init early_snp_set_memory_shared(unsigned long vaddr, unsigned long paddr, unsigned int npages) { - if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) + /* + * This can be invoked in early boot while running identity mapped, so + * use an open coded check for SNP instead of using cc_platform_has(). + * This eliminates worries about jump tables or checking boot_cpu_data + * in the cc_platform_has() function. + */ + if (!(sev_status & MSR_AMD64_SEV_SNP_ENABLED)) return;
/* Invalidate the memory pages before they are marked shared in the RMP table. */
From: Pawan Gupta pawan.kumar.gupta@linux.intel.com
commit 7df548840c496b0141fb2404b889c346380c2b22 upstream.
Older Intel CPUs that are not in the affected processor list for MMIO Stale Data vulnerabilities currently report "Not affected" in sysfs, which may not be correct. Vulnerability status for these older CPUs is unknown.
Add known-not-affected CPUs to the whitelist. Report "unknown" mitigation status for CPUs that are not in blacklist, whitelist and also don't enumerate MSR ARCH_CAPABILITIES bits that reflect hardware immunity to MMIO Stale Data vulnerabilities.
Mitigation is not deployed when the status is unknown.
[ bp: Massage, fixup. ]
Fixes: 8d50cdf8b834 ("x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data") Suggested-by: Andrew Cooper andrew.cooper3@citrix.com Suggested-by: Tony Luck tony.luck@intel.com Signed-off-by: Pawan Gupta pawan.kumar.gupta@linux.intel.com Signed-off-by: Borislav Petkov bp@suse.de Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/a932c154772f2121794a5f2eded1a11013114711.165784626... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst | 14 +++ arch/x86/include/asm/cpufeatures.h | 5 - arch/x86/kernel/cpu/bugs.c | 14 ++- arch/x86/kernel/cpu/common.c | 42 ++++++---- 4 files changed, 56 insertions(+), 19 deletions(-)
--- a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst +++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst @@ -230,6 +230,20 @@ The possible values in this file are: * - 'Mitigation: Clear CPU buffers' - The processor is vulnerable and the CPU buffer clearing mitigation is enabled. + * - 'Unknown: No mitigations' + - The processor vulnerability status is unknown because it is + out of Servicing period. Mitigation is not attempted. + +Definitions: +------------ + +Servicing period: The process of providing functional and security updates to +Intel processors or platforms, utilizing the Intel Platform Update (IPU) +process or other similar mechanisms. + +End of Servicing Updates (ESU): ESU is the date at which Intel will no +longer provide Servicing, such as through IPU or other similar update +processes. ESU dates will typically be aligned to end of quarter.
If the processor is vulnerable then the following information is appended to the above information: --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -456,7 +456,8 @@ #define X86_BUG_ITLB_MULTIHIT X86_BUG(23) /* CPU may incur MCE during certain page attribute changes */ #define X86_BUG_SRBDS X86_BUG(24) /* CPU may leak RNG bits if not mitigated */ #define X86_BUG_MMIO_STALE_DATA X86_BUG(25) /* CPU is affected by Processor MMIO Stale Data vulnerabilities */ -#define X86_BUG_RETBLEED X86_BUG(26) /* CPU is affected by RETBleed */ -#define X86_BUG_EIBRS_PBRSB X86_BUG(27) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ +#define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */ +#define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */ +#define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
#endif /* _ASM_X86_CPUFEATURES_H */ --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -433,7 +433,8 @@ static void __init mmio_select_mitigatio u64 ia32_cap;
if (!boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA) || - cpu_mitigations_off()) { + boot_cpu_has_bug(X86_BUG_MMIO_UNKNOWN) || + cpu_mitigations_off()) { mmio_mitigation = MMIO_MITIGATION_OFF; return; } @@ -538,6 +539,8 @@ out: pr_info("TAA: %s\n", taa_strings[taa_mitigation]); if (boot_cpu_has_bug(X86_BUG_MMIO_STALE_DATA)) pr_info("MMIO Stale Data: %s\n", mmio_strings[mmio_mitigation]); + else if (boot_cpu_has_bug(X86_BUG_MMIO_UNKNOWN)) + pr_info("MMIO Stale Data: Unknown: No mitigations\n"); }
static void __init md_clear_select_mitigation(void) @@ -2275,6 +2278,9 @@ static ssize_t tsx_async_abort_show_stat
static ssize_t mmio_stale_data_show_state(char *buf) { + if (boot_cpu_has_bug(X86_BUG_MMIO_UNKNOWN)) + return sysfs_emit(buf, "Unknown: No mitigations\n"); + if (mmio_mitigation == MMIO_MITIGATION_OFF) return sysfs_emit(buf, "%s\n", mmio_strings[mmio_mitigation]);
@@ -2421,6 +2427,7 @@ static ssize_t cpu_show_common(struct de return srbds_show_state(buf);
case X86_BUG_MMIO_STALE_DATA: + case X86_BUG_MMIO_UNKNOWN: return mmio_stale_data_show_state(buf);
case X86_BUG_RETBLEED: @@ -2480,7 +2487,10 @@ ssize_t cpu_show_srbds(struct device *de
ssize_t cpu_show_mmio_stale_data(struct device *dev, struct device_attribute *attr, char *buf) { - return cpu_show_common(dev, attr, buf, X86_BUG_MMIO_STALE_DATA); + if (boot_cpu_has_bug(X86_BUG_MMIO_UNKNOWN)) + return cpu_show_common(dev, attr, buf, X86_BUG_MMIO_UNKNOWN); + else + return cpu_show_common(dev, attr, buf, X86_BUG_MMIO_STALE_DATA); }
ssize_t cpu_show_retbleed(struct device *dev, struct device_attribute *attr, char *buf) --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1135,7 +1135,8 @@ static void identify_cpu_without_cpuid(s #define NO_SWAPGS BIT(6) #define NO_ITLB_MULTIHIT BIT(7) #define NO_SPECTRE_V2 BIT(8) -#define NO_EIBRS_PBRSB BIT(9) +#define NO_MMIO BIT(9) +#define NO_EIBRS_PBRSB BIT(10)
#define VULNWL(vendor, family, model, whitelist) \ X86_MATCH_VENDOR_FAM_MODEL(vendor, family, model, whitelist) @@ -1158,6 +1159,11 @@ static const __initconst struct x86_cpu_ VULNWL(VORTEX, 6, X86_MODEL_ANY, NO_SPECULATION),
/* Intel Family 6 */ + VULNWL_INTEL(TIGERLAKE, NO_MMIO), + VULNWL_INTEL(TIGERLAKE_L, NO_MMIO), + VULNWL_INTEL(ALDERLAKE, NO_MMIO), + VULNWL_INTEL(ALDERLAKE_L, NO_MMIO), + VULNWL_INTEL(ATOM_SALTWELL, NO_SPECULATION | NO_ITLB_MULTIHIT), VULNWL_INTEL(ATOM_SALTWELL_TABLET, NO_SPECULATION | NO_ITLB_MULTIHIT), VULNWL_INTEL(ATOM_SALTWELL_MID, NO_SPECULATION | NO_ITLB_MULTIHIT), @@ -1176,9 +1182,9 @@ static const __initconst struct x86_cpu_ VULNWL_INTEL(ATOM_AIRMONT_MID, NO_L1TF | MSBDS_ONLY | NO_SWAPGS | NO_ITLB_MULTIHIT), VULNWL_INTEL(ATOM_AIRMONT_NP, NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT),
- VULNWL_INTEL(ATOM_GOLDMONT, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT), - VULNWL_INTEL(ATOM_GOLDMONT_D, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT), - VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_EIBRS_PBRSB), + VULNWL_INTEL(ATOM_GOLDMONT, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO), + VULNWL_INTEL(ATOM_GOLDMONT_D, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO), + VULNWL_INTEL(ATOM_GOLDMONT_PLUS, NO_MDS | NO_L1TF | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO | NO_EIBRS_PBRSB),
/* * Technically, swapgs isn't serializing on AMD (despite it previously @@ -1193,18 +1199,18 @@ static const __initconst struct x86_cpu_ VULNWL_INTEL(ATOM_TREMONT_D, NO_ITLB_MULTIHIT | NO_EIBRS_PBRSB),
/* AMD Family 0xf - 0x12 */ - VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT), - VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT), - VULNWL_AMD(0x11, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT), - VULNWL_AMD(0x12, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT), + VULNWL_AMD(0x0f, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO), + VULNWL_AMD(0x10, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO), + VULNWL_AMD(0x11, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO), + VULNWL_AMD(0x12, NO_MELTDOWN | NO_SSB | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO),
/* FAMILY_ANY must be last, otherwise 0x0f - 0x12 matches won't work */ - VULNWL_AMD(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT), - VULNWL_HYGON(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT), + VULNWL_AMD(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO), + VULNWL_HYGON(X86_FAMILY_ANY, NO_MELTDOWN | NO_L1TF | NO_MDS | NO_SWAPGS | NO_ITLB_MULTIHIT | NO_MMIO),
/* Zhaoxin Family 7 */ - VULNWL(CENTAUR, 7, X86_MODEL_ANY, NO_SPECTRE_V2 | NO_SWAPGS), - VULNWL(ZHAOXIN, 7, X86_MODEL_ANY, NO_SPECTRE_V2 | NO_SWAPGS), + VULNWL(CENTAUR, 7, X86_MODEL_ANY, NO_SPECTRE_V2 | NO_SWAPGS | NO_MMIO), + VULNWL(ZHAOXIN, 7, X86_MODEL_ANY, NO_SPECTRE_V2 | NO_SWAPGS | NO_MMIO), {} };
@@ -1358,10 +1364,16 @@ static void __init cpu_set_bug_bits(stru * Affected CPU list is generally enough to enumerate the vulnerability, * but for virtualization case check for ARCH_CAP MSR bits also, VMM may * not want the guest to enumerate the bug. + * + * Set X86_BUG_MMIO_UNKNOWN for CPUs that are neither in the blacklist, + * nor in the whitelist and also don't enumerate MSR ARCH_CAP MMIO bits. */ - if (cpu_matches(cpu_vuln_blacklist, MMIO) && - !arch_cap_mmio_immune(ia32_cap)) - setup_force_cpu_bug(X86_BUG_MMIO_STALE_DATA); + if (!arch_cap_mmio_immune(ia32_cap)) { + if (cpu_matches(cpu_vuln_blacklist, MMIO)) + setup_force_cpu_bug(X86_BUG_MMIO_STALE_DATA); + else if (!cpu_matches(cpu_vuln_whitelist, NO_MMIO)) + setup_force_cpu_bug(X86_BUG_MMIO_UNKNOWN); + }
if (!cpu_has(c, X86_FEATURE_BTC_NO)) { if (cpu_matches(cpu_vuln_blacklist, RETBLEED) || (ia32_cap & ARCH_CAP_RSBA))
From: Peter Zijlstra peterz@infradead.org
commit 4e3aa9238277597c6c7624f302d81a7b568b6f2d upstream.
Commit 2b1299322016 ("x86/speculation: Add RSB VM Exit protections") made a right mess of the RSB stuffing, rewrite the whole thing to not suck.
Thanks to Andrew for the enlightening comment about Post-Barrier RSB things so we can make this code less magical.
Cc: stable@vger.kernel.org Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/YvuNdDWoUZSBjYcm@worktop.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/include/asm/nospec-branch.h | 80 +++++++++++++++++------------------ 1 file changed, 39 insertions(+), 41 deletions(-)
--- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -35,33 +35,44 @@ #define RSB_CLEAR_LOOPS 32 /* To forcibly overwrite all entries */
/* + * Common helper for __FILL_RETURN_BUFFER and __FILL_ONE_RETURN. + */ +#define __FILL_RETURN_SLOT \ + ANNOTATE_INTRA_FUNCTION_CALL; \ + call 772f; \ + int3; \ +772: + +/* + * Stuff the entire RSB. + * * Google experimented with loop-unrolling and this turned out to be * the optimal version - two calls, each with their own speculation * trap should their return address end up getting used, in a loop. */ -#define __FILL_RETURN_BUFFER(reg, nr, sp) \ - mov $(nr/2), reg; \ -771: \ - ANNOTATE_INTRA_FUNCTION_CALL; \ - call 772f; \ -773: /* speculation trap */ \ - UNWIND_HINT_EMPTY; \ - pause; \ - lfence; \ - jmp 773b; \ -772: \ - ANNOTATE_INTRA_FUNCTION_CALL; \ - call 774f; \ -775: /* speculation trap */ \ - UNWIND_HINT_EMPTY; \ - pause; \ - lfence; \ - jmp 775b; \ -774: \ - add $(BITS_PER_LONG/8) * 2, sp; \ - dec reg; \ - jnz 771b; \ - /* barrier for jnz misprediction */ \ +#define __FILL_RETURN_BUFFER(reg, nr) \ + mov $(nr/2), reg; \ +771: \ + __FILL_RETURN_SLOT \ + __FILL_RETURN_SLOT \ + add $(BITS_PER_LONG/8) * 2, %_ASM_SP; \ + dec reg; \ + jnz 771b; \ + /* barrier for jnz misprediction */ \ + lfence; + +/* + * Stuff a single RSB slot. + * + * To mitigate Post-Barrier RSB speculation, one CALL instruction must be + * forced to retire before letting a RET instruction execute. + * + * On PBRSB-vulnerable CPUs, it is not safe for a RET to be executed + * before this point. + */ +#define __FILL_ONE_RETURN \ + __FILL_RETURN_SLOT \ + add $(BITS_PER_LONG/8), %_ASM_SP; \ lfence;
#ifdef __ASSEMBLY__ @@ -120,28 +131,15 @@ #endif .endm
-.macro ISSUE_UNBALANCED_RET_GUARD - ANNOTATE_INTRA_FUNCTION_CALL - call .Lunbalanced_ret_guard_@ - int3 -.Lunbalanced_ret_guard_@: - add $(BITS_PER_LONG/8), %_ASM_SP - lfence -.endm - /* * A simpler FILL_RETURN_BUFFER macro. Don't make people use the CPP * monstrosity above, manually. */ -.macro FILL_RETURN_BUFFER reg:req nr:req ftr:req ftr2 -.ifb \ftr2 - ALTERNATIVE "jmp .Lskip_rsb_@", "", \ftr -.else - ALTERNATIVE_2 "jmp .Lskip_rsb_@", "", \ftr, "jmp .Lunbalanced_@", \ftr2 -.endif - __FILL_RETURN_BUFFER(\reg,\nr,%_ASM_SP) -.Lunbalanced_@: - ISSUE_UNBALANCED_RET_GUARD +.macro FILL_RETURN_BUFFER reg:req nr:req ftr:req ftr2=ALT_NOT(X86_FEATURE_ALWAYS) + ALTERNATIVE_2 "jmp .Lskip_rsb_@", \ + __stringify(__FILL_RETURN_BUFFER(\reg,\nr)), \ftr, \ + __stringify(__FILL_ONE_RETURN), \ftr2 + .Lskip_rsb_@: .endm
From: Jan Beulich jbeulich@suse.com
commit 72cbc8f04fe2fa93443c0fcccb7ad91dfea3d9ce upstream.
After commit ID in the Fixes: tag, pat_enabled() returns false (because of PAT initialization being suppressed in the absence of MTRRs being announced to be available).
This has become a problem: the i915 driver now fails to initialize when running PV on Xen (i915_gem_object_pin_map() is where I located the induced failure), and its error handling is flaky enough to (at least sometimes) result in a hung system.
Yet even beyond that problem the keying of the use of WC mappings to pat_enabled() (see arch_can_pci_mmap_wc()) means that in particular graphics frame buffer accesses would have been quite a bit less optimal than possible.
Arrange for the function to return true in such environments, without undermining the rest of PAT MSR management logic considering PAT to be disabled: specifically, no writes to the PAT MSR should occur.
For the new boolean to live in .init.data, init_cache_modes() also needs moving to .init.text (where it could/should have lived already before).
[ bp: This is the "small fix" variant for stable. It'll get replaced with a proper PAT and MTRR detection split upstream but that is too involved for a stable backport. - additional touchups to commit msg. Use cpu_feature_enabled(). ]
Fixes: bdd8b6c98239 ("drm/i915: replace X86_FEATURE_PAT with pat_enabled()") Signed-off-by: Jan Beulich jbeulich@suse.com Signed-off-by: Borislav Petkov bp@suse.de Acked-by: Ingo Molnar mingo@kernel.org Cc: stable@vger.kernel.org Cc: Juergen Gross jgross@suse.com Cc: Lucas De Marchi lucas.demarchi@intel.com Link: https://lore.kernel.org/r/9385fa60-fa5d-f559-a137-6608408f88b0@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/mm/pat/memtype.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
--- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -62,6 +62,7 @@
static bool __read_mostly pat_bp_initialized; static bool __read_mostly pat_disabled = !IS_ENABLED(CONFIG_X86_PAT); +static bool __initdata pat_force_disabled = !IS_ENABLED(CONFIG_X86_PAT); static bool __read_mostly pat_bp_enabled; static bool __read_mostly pat_cm_initialized;
@@ -86,6 +87,7 @@ void pat_disable(const char *msg_reason) static int __init nopat(char *str) { pat_disable("PAT support disabled via boot option."); + pat_force_disabled = true; return 0; } early_param("nopat", nopat); @@ -272,7 +274,7 @@ static void pat_ap_init(u64 pat) wrmsrl(MSR_IA32_CR_PAT, pat); }
-void init_cache_modes(void) +void __init init_cache_modes(void) { u64 pat = 0;
@@ -313,6 +315,12 @@ void init_cache_modes(void) */ pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) | PAT(4, WB) | PAT(5, WT) | PAT(6, UC_MINUS) | PAT(7, UC); + } else if (!pat_force_disabled && cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) { + /* + * Clearly PAT is enabled underneath. Allow pat_enabled() to + * reflect this. + */ + pat_bp_enabled = true; }
__init_cache_modes(pat);
On 8/29/2022 6:59 AM, Greg Kroah-Hartman wrote:
From: Jan Beulich jbeulich@suse.com
commit 72cbc8f04fe2fa93443c0fcccb7ad91dfea3d9ce upstream.
After commit ID in the Fixes: tag, pat_enabled() returns false (because of PAT initialization being suppressed in the absence of MTRRs being announced to be available).
This has become a problem: the i915 driver now fails to initialize when running PV on Xen (i915_gem_object_pin_map() is where I located the induced failure), and its error handling is flaky enough to (at least sometimes) result in a hung system.
Yet even beyond that problem the keying of the use of WC mappings to pat_enabled() (see arch_can_pci_mmap_wc()) means that in particular graphics frame buffer accesses would have been quite a bit less optimal than possible.
Arrange for the function to return true in such environments, without undermining the rest of PAT MSR management logic considering PAT to be disabled: specifically, no writes to the PAT MSR should occur.
For the new boolean to live in .init.data, init_cache_modes() also needs moving to .init.text (where it could/should have lived already before).
[ bp: This is the "small fix" variant for stable. It'll get replaced with a proper PAT and MTRR detection split upstream but that is too involved for a stable backport. - additional touchups to commit msg. Use cpu_feature_enabled(). ]
Fixes: bdd8b6c98239 ("drm/i915: replace X86_FEATURE_PAT with pat_enabled()") Signed-off-by: Jan Beulich jbeulich@suse.com Signed-off-by: Borislav Petkov bp@suse.de Acked-by: Ingo Molnar mingo@kernel.org Cc: stable@vger.kernel.org Cc: Juergen Gross jgross@suse.com Cc: Lucas De Marchi lucas.demarchi@intel.com Link: https://lore.kernel.org/r/9385fa60-fa5d-f559-a137-6608408f88b0@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
arch/x86/mm/pat/memtype.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
--- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -62,6 +62,7 @@ static bool __read_mostly pat_bp_initialized; static bool __read_mostly pat_disabled = !IS_ENABLED(CONFIG_X86_PAT); +static bool __initdata pat_force_disabled = !IS_ENABLED(CONFIG_X86_PAT); static bool __read_mostly pat_bp_enabled; static bool __read_mostly pat_cm_initialized; @@ -86,6 +87,7 @@ void pat_disable(const char *msg_reason) static int __init nopat(char *str) { pat_disable("PAT support disabled via boot option.");
- pat_force_disabled = true; return 0;
} early_param("nopat", nopat); @@ -272,7 +274,7 @@ static void pat_ap_init(u64 pat) wrmsrl(MSR_IA32_CR_PAT, pat); } -void init_cache_modes(void) +void __init init_cache_modes(void) { u64 pat = 0; @@ -313,6 +315,12 @@ void init_cache_modes(void) */ pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) | PAT(4, WB) | PAT(5, WT) | PAT(6, UC_MINUS) | PAT(7, UC);
- } else if (!pat_force_disabled && cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) {
/*
* Clearly PAT is enabled underneath. Allow pat_enabled() to
* reflect this.
*/
}pat_bp_enabled = true;
__init_cache_modes(pat);
Dear Boris, Jan, Juergen, Thorsten, and Greg,
Just a note to say thanks to everyone who worked on this regression. I just upgraded my Xen PV dom0 on fc36 that was affected with the Fedora build of 5.19.6, which has the fix, and it works fine.
Thanks again,
Chuck
From: Siddh Raman Pant code@siddh.me
commit c490a0b5a4f36da3918181a8acdc6991d967c5f3 upstream.
The userspace can configure a loop using an ioctl call, wherein a configuration of type loop_config is passed (see lo_ioctl()'s case on line 1550 of drivers/block/loop.c). This proceeds to call loop_configure() which in turn calls loop_set_status_from_info() (see line 1050 of loop.c), passing &config->info which is of type loop_info64*. This function then sets the appropriate values, like the offset.
loop_device has lo_offset of type loff_t (see line 52 of loop.c), which is typdef-chained to long long, whereas loop_info64 has lo_offset of type __u64 (see line 56 of include/uapi/linux/loop.h).
The function directly copies offset from info to the device as follows (See line 980 of loop.c): lo->lo_offset = info->lo_offset;
This results in an overflow, which triggers a warning in iomap_iter() due to a call to iomap_iter_done() which has: WARN_ON_ONCE(iter->iomap.offset > iter->pos);
Thus, check for negative value during loop_set_status_from_info().
Bug report: https://syzkaller.appspot.com/bug?id=c620fe14aac810396d3c3edc9ad73848bf69a29...
Reported-and-tested-by: syzbot+a8e049cd3abd342936b6@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Reviewed-by: Matthew Wilcox (Oracle) willy@infradead.org Signed-off-by: Siddh Raman Pant code@siddh.me Reviewed-by: Christoph Hellwig hch@lst.de Link: https://lore.kernel.org/r/20220823160810.181275-1-code@siddh.me Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/block/loop.c | 5 +++++ 1 file changed, 5 insertions(+)
--- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -979,6 +979,11 @@ loop_set_status_from_info(struct loop_de
lo->lo_offset = info->lo_offset; lo->lo_sizelimit = info->lo_sizelimit; + + /* loff_t vars have been assigned __u64 */ + if (lo->lo_offset < 0 || lo->lo_sizelimit < 0) + return -EOVERFLOW; + memcpy(lo->lo_file_name, info->lo_file_name, LO_NAME_SIZE); lo->lo_file_name[LO_NAME_SIZE-1] = 0; lo->lo_flags = info->lo_flags;
From: Khazhismel Kumykov khazhy@chromium.org
commit f87904c075515f3e1d8f4a7115869d3b914674fd upstream.
When a disk is removed, bdi_unregister gets called to stop further writeback and wait for associated delayed work to complete. However, wb_inode_writeback_end() may schedule bandwidth estimation dwork after this has completed, which can result in the timer attempting to access the just freed bdi_writeback.
Fix this by checking if the bdi_writeback is alive, similar to when scheduling writeback work.
Since this requires wb->work_lock, and wb_inode_writeback_end() may get called from interrupt, switch wb->work_lock to an irqsafe lock.
Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload") Signed-off-by: Khazhismel Kumykov khazhy@google.com Reviewed-by: Jan Kara jack@suse.cz Cc: Michael Stapelberg stapelberg+linux@google.com Cc: Wu Fengguang fengguang.wu@intel.com Cc: Alexander Viro viro@zeniv.linux.org.uk Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/fs-writeback.c | 12 ++++++------ mm/backing-dev.c | 10 +++++----- mm/page-writeback.c | 6 +++++- 3 files changed, 16 insertions(+), 12 deletions(-)
--- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -134,10 +134,10 @@ static bool inode_io_list_move_locked(st
static void wb_wakeup(struct bdi_writeback *wb) { - spin_lock_bh(&wb->work_lock); + spin_lock_irq(&wb->work_lock); if (test_bit(WB_registered, &wb->state)) mod_delayed_work(bdi_wq, &wb->dwork, 0); - spin_unlock_bh(&wb->work_lock); + spin_unlock_irq(&wb->work_lock); }
static void finish_writeback_work(struct bdi_writeback *wb, @@ -164,7 +164,7 @@ static void wb_queue_work(struct bdi_wri if (work->done) atomic_inc(&work->done->cnt);
- spin_lock_bh(&wb->work_lock); + spin_lock_irq(&wb->work_lock);
if (test_bit(WB_registered, &wb->state)) { list_add_tail(&work->list, &wb->work_list); @@ -172,7 +172,7 @@ static void wb_queue_work(struct bdi_wri } else finish_writeback_work(wb, work);
- spin_unlock_bh(&wb->work_lock); + spin_unlock_irq(&wb->work_lock); }
/** @@ -2082,13 +2082,13 @@ static struct wb_writeback_work *get_nex { struct wb_writeback_work *work = NULL;
- spin_lock_bh(&wb->work_lock); + spin_lock_irq(&wb->work_lock); if (!list_empty(&wb->work_list)) { work = list_entry(wb->work_list.next, struct wb_writeback_work, list); list_del_init(&work->list); } - spin_unlock_bh(&wb->work_lock); + spin_unlock_irq(&wb->work_lock); return work; }
--- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -260,10 +260,10 @@ void wb_wakeup_delayed(struct bdi_writeb unsigned long timeout;
timeout = msecs_to_jiffies(dirty_writeback_interval * 10); - spin_lock_bh(&wb->work_lock); + spin_lock_irq(&wb->work_lock); if (test_bit(WB_registered, &wb->state)) queue_delayed_work(bdi_wq, &wb->dwork, timeout); - spin_unlock_bh(&wb->work_lock); + spin_unlock_irq(&wb->work_lock); }
static void wb_update_bandwidth_workfn(struct work_struct *work) @@ -334,12 +334,12 @@ static void cgwb_remove_from_bdi_list(st static void wb_shutdown(struct bdi_writeback *wb) { /* Make sure nobody queues further work */ - spin_lock_bh(&wb->work_lock); + spin_lock_irq(&wb->work_lock); if (!test_and_clear_bit(WB_registered, &wb->state)) { - spin_unlock_bh(&wb->work_lock); + spin_unlock_irq(&wb->work_lock); return; } - spin_unlock_bh(&wb->work_lock); + spin_unlock_irq(&wb->work_lock);
cgwb_remove_from_bdi_list(wb); /* --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2867,6 +2867,7 @@ static void wb_inode_writeback_start(str
static void wb_inode_writeback_end(struct bdi_writeback *wb) { + unsigned long flags; atomic_dec(&wb->writeback_inodes); /* * Make sure estimate of writeback throughput gets updated after @@ -2875,7 +2876,10 @@ static void wb_inode_writeback_end(struc * that if multiple inodes end writeback at a similar time, they get * batched into one bandwidth update. */ - queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); + spin_lock_irqsave(&wb->work_lock, flags); + if (test_bit(WB_registered, &wb->state)) + queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); + spin_unlock_irqrestore(&wb->work_lock, flags); }
bool __folio_end_writeback(struct folio *folio)
From: Richard Guy Briggs rgb@redhat.com
commit d4fefa4801a1c2f9c0c7a48fbb0fdf384e89a4ab upstream.
The success and return_code are needed by the filters. Move audit_return_fixup() before the filters. This was causing syscall auditing events to be missed.
Link: https://github.com/linux-audit/audit-kernel/issues/138 Cc: stable@vger.kernel.org Fixes: 12c5e81d3fd0 ("audit: prepare audit_context for use in calling contexts beyond syscalls") Signed-off-by: Richard Guy Briggs rgb@redhat.com [PM: manual merge required] Signed-off-by: Paul Moore paul@paul-moore.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/auditsc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
--- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -1965,6 +1965,7 @@ void __audit_uring_exit(int success, lon goto out; }
+ audit_return_fixup(ctx, success, code); if (ctx->context == AUDIT_CTX_SYSCALL) { /* * NOTE: See the note in __audit_uring_entry() about the case @@ -2006,7 +2007,6 @@ void __audit_uring_exit(int success, lon audit_filter_inodes(current, ctx); if (ctx->current_state != AUDIT_STATE_RECORD) goto out; - audit_return_fixup(ctx, success, code); audit_log_exit();
out: @@ -2090,13 +2090,13 @@ void __audit_syscall_exit(int success, l if (!list_empty(&context->killed_trees)) audit_kill_trees(context);
+ audit_return_fixup(context, success, return_code); /* run through both filters to ensure we set the filterkey properly */ audit_filter_syscall(current, context); audit_filter_inodes(current, context); if (context->current_state < AUDIT_STATE_RECORD) goto out;
- audit_return_fixup(context, success, return_code); audit_log_exit();
out:
From: Quanyang Wang quanyang.wang@windriver.com
commit 0c7d7cc2b4fe2e74ef8728f030f0f1674f9f6aee upstream.
There are two problems with the current code of memory_intersects:
First, it doesn't check whether the region (begin, end) falls inside the region (virt, vend), that is (virt < begin && vend > end).
The second problem is if vend is equal to begin, it will return true but this is wrong since vend (virt + size) is not the last address of the memory region but (virt + size -1) is. The wrong determination will trigger the misreporting when the function check_for_illegal_area calls memory_intersects to check if the dma region intersects with stext region.
The misreporting is as below (stext is at 0x80100000): WARNING: CPU: 0 PID: 77 at kernel/dma/debug.c:1073 check_for_illegal_area+0x130/0x168 DMA-API: chipidea-usb2 e0002000.usb: device driver maps memory from kernel text or rodata [addr=800f0000] [len=65536] Modules linked in: CPU: 1 PID: 77 Comm: usb-storage Not tainted 5.19.0-yocto-standard #5 Hardware name: Xilinx Zynq Platform unwind_backtrace from show_stack+0x18/0x1c show_stack from dump_stack_lvl+0x58/0x70 dump_stack_lvl from __warn+0xb0/0x198 __warn from warn_slowpath_fmt+0x80/0xb4 warn_slowpath_fmt from check_for_illegal_area+0x130/0x168 check_for_illegal_area from debug_dma_map_sg+0x94/0x368 debug_dma_map_sg from __dma_map_sg_attrs+0x114/0x128 __dma_map_sg_attrs from dma_map_sg_attrs+0x18/0x24 dma_map_sg_attrs from usb_hcd_map_urb_for_dma+0x250/0x3b4 usb_hcd_map_urb_for_dma from usb_hcd_submit_urb+0x194/0x214 usb_hcd_submit_urb from usb_sg_wait+0xa4/0x118 usb_sg_wait from usb_stor_bulk_transfer_sglist+0xa0/0xec usb_stor_bulk_transfer_sglist from usb_stor_bulk_srb+0x38/0x70 usb_stor_bulk_srb from usb_stor_Bulk_transport+0x150/0x360 usb_stor_Bulk_transport from usb_stor_invoke_transport+0x38/0x440 usb_stor_invoke_transport from usb_stor_control_thread+0x1e0/0x238 usb_stor_control_thread from kthread+0xf8/0x104 kthread from ret_from_fork+0x14/0x2c
Refactor memory_intersects to fix the two problems above.
Before the 1d7db834a027e ("dma-debug: use memory_intersects() directly"), memory_intersects is called only by printk_late_init:
printk_late_init -> init_section_intersects ->memory_intersects.
There were few places where memory_intersects was called.
When commit 1d7db834a027e ("dma-debug: use memory_intersects() directly") was merged and CONFIG_DMA_API_DEBUG is enabled, the DMA subsystem uses it to check for an illegal area and the calltrace above is triggered.
[akpm@linux-foundation.org: fix nearby comment typo] Link: https://lkml.kernel.org/r/20220819081145.948016-1-quanyang.wang@windriver.co... Fixes: 979559362516 ("asm/sections: add helpers to check for section data") Signed-off-by: Quanyang Wang quanyang.wang@windriver.com Cc: Ard Biesheuvel ardb@kernel.org Cc: Arnd Bergmann arnd@arndb.de Cc: Thierry Reding treding@nvidia.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/asm-generic/sections.h | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
--- a/include/asm-generic/sections.h +++ b/include/asm-generic/sections.h @@ -97,7 +97,7 @@ static inline bool memory_contains(void /** * memory_intersects - checks if the region occupied by an object intersects * with another memory region - * @begin: virtual address of the beginning of the memory regien + * @begin: virtual address of the beginning of the memory region * @end: virtual address of the end of the memory region * @virt: virtual address of the memory object * @size: size of the memory object @@ -110,7 +110,10 @@ static inline bool memory_intersects(voi { void *vend = virt + size;
- return (virt >= begin && virt < end) || (vend >= begin && vend < end); + if (virt < end && vend > begin) + return true; + + return false; }
/**
From: Badari Pulavarty badari.pulavarty@intel.com
commit d26f60703606ab425eee9882b32a1781a8bed74d upstream.
When user tries to create a DAMON context via the DAMON debugfs interface with a name of an already existing context, the context directory creation fails but a new context is created and added in the internal data structure, due to absence of the directory creation success check. As a result, memory could leak and DAMON cannot be turned on. An example test case is as below:
# cd /sys/kernel/debug/damon/ # echo "off" > monitor_on # echo paddr > target_ids # echo "abc" > mk_context # echo "abc" > mk_context # echo $$ > abc/target_ids # echo "on" > monitor_on <<< fails
Return value of 'debugfs_create_dir()' is expected to be ignored in general, but this is an exceptional case as DAMON feature is depending on the debugfs functionality and it has the potential duplicate name issue. This commit therefore fixes the issue by checking the directory creation failure and immediately return the error in the case.
Link: https://lkml.kernel.org/r/20220821180853.2400-1-sj@kernel.org Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts") Signed-off-by: Badari Pulavarty badari.pulavarty@intel.com Signed-off-by: SeongJae Park sj@kernel.org Cc: stable@vger.kernel.org [ 5.15.x] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/damon/dbgfs.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -787,6 +787,9 @@ static int dbgfs_mk_context(char *name) return -ENOENT;
new_dir = debugfs_create_dir(name, root); + /* Below check is required for a potential duplicated name case */ + if (IS_ERR(new_dir)) + return PTR_ERR(new_dir); dbgfs_dirs[dbgfs_nr_ctxs] = new_dir;
new_ctx = dbgfs_new_ctx();
From: Gerald Schaefer gerald.schaefer@linux.ibm.com
commit 41ac42f137080bc230b5882e3c88c392ab7f2d32 upstream.
For non-protection pXd_none() page faults in do_dat_exception(), we call do_exception() with access == (VM_READ | VM_WRITE | VM_EXEC). In do_exception(), vma->vm_flags is checked against that before calling handle_mm_fault().
Since commit 92f842eac7ee3 ("[S390] store indication fault optimization"), we call handle_mm_fault() with FAULT_FLAG_WRITE, when recognizing that it was a write access. However, the vma flags check is still only checking against (VM_READ | VM_WRITE | VM_EXEC), and therefore also calling handle_mm_fault() with FAULT_FLAG_WRITE in cases where the vma does not allow VM_WRITE.
Fix this by changing access check in do_exception() to VM_WRITE only, when recognizing write access.
Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com Fixes: 92f842eac7ee3 ("[S390] store indication fault optimization") Cc: stable@vger.kernel.org Reported-by: David Hildenbrand david@redhat.com Reviewed-by: Heiko Carstens hca@linux.ibm.com Signed-off-by: Gerald Schaefer gerald.schaefer@linux.ibm.com Signed-off-by: Vasily Gorbik gor@linux.ibm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/s390/mm/fault.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
--- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -379,7 +379,9 @@ static inline vm_fault_t do_exception(st flags = FAULT_FLAG_DEFAULT; if (user_mode(regs)) flags |= FAULT_FLAG_USER; - if (access == VM_WRITE || is_write) + if (is_write) + access = VM_WRITE; + if (access == VM_WRITE) flags |= FAULT_FLAG_WRITE; mmap_read_lock(mm);
From: Liu Shixin liushixin2@huawei.com
commit dd0ff4d12dd284c334f7e9b07f8f335af856ac78 upstream.
The vmemmap pages is marked by kmemleak when allocated from memblock. Remove it from kmemleak when freeing the page. Otherwise, when we reuse the page, kmemleak may report such an error and then stop working.
kmemleak: Cannot insert 0xffff98fb6eab3d40 into the object search tree (overlaps existing) kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xffff98fb6be00000 (size 335544320): kmemleak: comm "swapper", pid 0, jiffies 4294892296 kmemleak: min_count = 0 kmemleak: count = 0 kmemleak: flags = 0x1 kmemleak: checksum = 0 kmemleak: backtrace:
Link: https://lkml.kernel.org/r/20220819094005.2928241-1-liushixin2@huawei.com Fixes: f41f2ed43ca5 (mm: hugetlb: free the vmemmap pages associated with each HugeTLB page) Signed-off-by: Liu Shixin liushixin2@huawei.com Reviewed-by: Muchun Song songmuchun@bytedance.com Cc: Matthew Wilcox willy@infradead.org Cc: Mike Kravetz mike.kravetz@oracle.com Cc: Oscar Salvador osalvador@suse.de Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/bootmem_info.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/mm/bootmem_info.c +++ b/mm/bootmem_info.c @@ -12,6 +12,7 @@ #include <linux/memblock.h> #include <linux/bootmem_info.h> #include <linux/memory_hotplug.h> +#include <linux/kmemleak.h>
void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) { @@ -33,6 +34,7 @@ void put_page_bootmem(struct page *page) ClearPagePrivate(page); set_page_private(page, 0); INIT_LIST_HEAD(&page->lru); + kmemleak_free_part(page_to_virt(page), PAGE_SIZE); free_reserved_page(page); } }
From: Miaohe Lin linmiaohe@huawei.com
commit ab74ef708dc51df7cf2b8a890b9c6990fac5c0c6 upstream.
In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page cache are installed in the ptes. But hugepage_add_new_anon_rmap is called for them mistakenly because they're not vm_shared. This will corrupt the page->mapping used by page cache code.
Link: https://lkml.kernel.org/r/20220712130542.18836-1-linmiaohe@huawei.com Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Signed-off-by: Miaohe Lin linmiaohe@huawei.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Axel Rasmussen axelrasmussen@google.com Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6026,7 +6026,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_s if (!huge_pte_none_mostly(huge_ptep_get(dst_pte))) goto out_release_unlock;
- if (vm_shared) { + if (page_in_pagecache) { page_dup_file_rmap(page, true); } else { ClearHPageRestoreReserve(page);
From: Peter Xu peterx@redhat.com
commit 3d2f78f08cd8388035ac375e731ec1ac1b79b09d upstream.
Yu Zhao reported a bug after the commit "mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry" added a check in swp_offset_pfn() for swap type [1]:
kernel BUG at include/linux/swapops.h:117! CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S O L 6.0.0-dbg-DEV #2 RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0 Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6 c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b 48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48 RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282 RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000 RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000 R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738 R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a FS: 00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> change_pte_range+0x36e/0x880 change_p4d_range+0x2e8/0x670 change_protection_range+0x14e/0x2c0 mprotect_fixup+0x1ee/0x330 do_mprotect_pkey+0x34c/0x440 __x64_sys_mprotect+0x1d/0x30
It triggers because pfn_swap_entry_to_page() could be called upon e.g. a genuine swap entry.
Fix it by only calling it when it's a write migration entry where the page* is used.
[1] https://lore.kernel.org/lkml/CAOUHufaVC2Za-p8m0aiHw6YkheDcrO-C3wRGixwDS32VTS...
Link: https://lkml.kernel.org/r/20220823221138.45602-1-peterx@redhat.com Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: Peter Xu peterx@redhat.com Reported-by: Yu Zhao yuzhao@google.com Tested-by: Yu Zhao yuzhao@google.com Reviewed-by: David Hildenbrand david@redhat.com Cc: "Huang, Ying" ying.huang@intel.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/mprotect.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
--- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -158,10 +158,11 @@ static unsigned long change_pte_range(st pages++; } else if (is_swap_pte(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); - struct page *page = pfn_swap_entry_to_page(entry); pte_t newpte;
if (is_writable_migration_entry(entry)) { + struct page *page = pfn_swap_entry_to_page(entry); + /* * A protection check is difficult so * just be safe and disable write
From: Paulo Alcantara pc@cjr.nz
commit a1d2eb51f0a33c28f5399a1610e66b3fbd24e884 upstream.
Since commit: cifs: alloc_path_with_tree_prefix: do not append sep. if the path is empty alloc_path_with_tree_prefix() function was no longer including the trailing separator when @path is empty, although @out_len was still assuming a path separator thus adding an extra byte to the final filename.
This has caused mount issues in some Synology servers due to the extra NULL byte in filenames when sending SMB2_CREATE requests with SMB2_FLAGS_DFS_OPERATIONS set.
Fix this by checking if @path is not empty and then add extra byte for separator. Also, do not include any trailing NULL bytes in filename as MS-SMB2 requires it to be 8-byte aligned and not NULL terminated.
Cc: stable@vger.kernel.org Fixes: 7eacba3b00a3 ("cifs: alloc_path_with_tree_prefix: do not append sep. if the path is empty") Signed-off-by: Paulo Alcantara (SUSE) pc@cjr.nz Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/cifs/smb2pdu.c | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-)
--- a/fs/cifs/smb2pdu.c +++ b/fs/cifs/smb2pdu.c @@ -2571,19 +2571,15 @@ alloc_path_with_tree_prefix(__le16 **out
path_len = UniStrnlen((wchar_t *)path, PATH_MAX);
- /* - * make room for one path separator between the treename and - * path - */ - *out_len = treename_len + 1 + path_len; + /* make room for one path separator only if @path isn't empty */ + *out_len = treename_len + (path[0] ? 1 : 0) + path_len;
/* - * final path needs to be null-terminated UTF16 with a - * size aligned to 8 + * final path needs to be 8-byte aligned as specified in + * MS-SMB2 2.2.13 SMB2 CREATE Request. */ - - *out_size = roundup((*out_len+1)*2, 8); - *out_path = kzalloc(*out_size, GFP_KERNEL); + *out_size = roundup(*out_len * sizeof(__le16), 8); + *out_path = kzalloc(*out_size + sizeof(__le16) /* null */, GFP_KERNEL); if (!*out_path) return -ENOMEM;
From: Brian Foster bfoster@redhat.com
commit 13cccafe0edcd03bf1c841de8ab8a1c8e34f77d9 upstream.
The pointers for guarded storage and runtime instrumentation control blocks are stored in the thread_struct of the associated task. These pointers are initially copied on fork() via arch_dup_task_struct() and then cleared via copy_thread() before fork() returns. If fork() happens to fail after the initial task dup and before copy_thread(), the newly allocated task and associated thread_struct memory are freed via free_task() -> arch_release_task_struct(). This results in a double free of the guarded storage and runtime info structs because the fields in the failed task still refer to memory associated with the source task.
This problem can manifest as a BUG_ON() in set_freepointer() (with CONFIG_SLAB_FREELIST_HARDENED enabled) or KASAN splat (if enabled) when running trinity syscall fuzz tests on s390x. To avoid this problem, clear the associated pointer fields in arch_dup_task_struct() immediately after the new task is copied. Note that the RI flag is still cleared in copy_thread() because it resides in thread stack memory and that is where stack info is copied.
Signed-off-by: Brian Foster bfoster@redhat.com Fixes: 8d9047f8b967c ("s390/runtime instrumentation: simplify task exit handling") Fixes: 7b83c6297d2fc ("s390/guarded storage: simplify task exit handling") Cc: stable@vger.kernel.org # 4.15 Reviewed-by: Gerald Schaefer gerald.schaefer@linux.ibm.com Reviewed-by: Heiko Carstens hca@linux.ibm.com Link: https://lore.kernel.org/r/20220816155407.537372-1-bfoster@redhat.com Signed-off-by: Vasily Gorbik gor@linux.ibm.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/s390/kernel/process.c | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-)
--- a/arch/s390/kernel/process.c +++ b/arch/s390/kernel/process.c @@ -91,6 +91,18 @@ int arch_dup_task_struct(struct task_str
memcpy(dst, src, arch_task_struct_size); dst->thread.fpu.regs = dst->thread.fpu.fprs; + + /* + * Don't transfer over the runtime instrumentation or the guarded + * storage control block pointers. These fields are cleared here instead + * of in copy_thread() to avoid premature freeing of associated memory + * on fork() failure. Wait to clear the RI flag because ->stack still + * refers to the source thread. + */ + dst->thread.ri_cb = NULL; + dst->thread.gs_cb = NULL; + dst->thread.gs_bc_cb = NULL; + return 0; }
@@ -150,13 +162,11 @@ int copy_thread(struct task_struct *p, c frame->childregs.flags = 0; if (new_stackp) frame->childregs.gprs[15] = new_stackp; - - /* Don't copy runtime instrumentation info */ - p->thread.ri_cb = NULL; + /* + * Clear the runtime instrumentation flag after the above childregs + * copy. The CB pointer was already cleared in arch_dup_task_struct(). + */ frame->childregs.psw.mask &= ~PSW_MASK_RI; - /* Don't copy guarded storage control block */ - p->thread.gs_cb = NULL; - p->thread.gs_bc_cb = NULL;
/* Set a new TLS ? */ if (clone_flags & CLONE_SETTLS) {
From: Shigeru Yoshida syoshida@redhat.com
commit a5a923038d70d2d4a86cb4e3f32625a5ee6e7e24 upstream.
fbcon_do_set_font() calls vc_resize() when font size is changed. However, if if vc_resize() failed, current implementation doesn't revert changes for font size, and this causes inconsistent state.
syzbot reported unable to handle page fault due to this issue [1]. syzbot's repro uses fault injection which cause failure for memory allocation, so vc_resize() failed.
This patch fixes this issue by properly revert changes for font related date when vc_resize() failed.
Link: https://syzkaller.appspot.com/bug?id=3443d3a1fa6d964dd7310a0cb1696d165a3e07c... [1] Reported-by: syzbot+a168dbeaaa7778273c1b@syzkaller.appspotmail.com Signed-off-by: Shigeru Yoshida syoshida@redhat.com Signed-off-by: Helge Deller deller@gmx.de CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/video/fbdev/core/fbcon.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
--- a/drivers/video/fbdev/core/fbcon.c +++ b/drivers/video/fbdev/core/fbcon.c @@ -2402,15 +2402,21 @@ static int fbcon_do_set_font(struct vc_d struct fb_info *info = fbcon_info_from_console(vc->vc_num); struct fbcon_ops *ops = info->fbcon_par; struct fbcon_display *p = &fb_display[vc->vc_num]; - int resize; + int resize, ret, old_userfont, old_width, old_height, old_charcount; char *old_data = NULL;
resize = (w != vc->vc_font.width) || (h != vc->vc_font.height); if (p->userfont) old_data = vc->vc_font.data; vc->vc_font.data = (void *)(p->fontdata = data); + old_userfont = p->userfont; if ((p->userfont = userfont)) REFCOUNT(data)++; + + old_width = vc->vc_font.width; + old_height = vc->vc_font.height; + old_charcount = vc->vc_font.charcount; + vc->vc_font.width = w; vc->vc_font.height = h; vc->vc_font.charcount = charcount; @@ -2426,7 +2432,9 @@ static int fbcon_do_set_font(struct vc_d rows = FBCON_SWAP(ops->rotate, info->var.yres, info->var.xres); cols /= w; rows /= h; - vc_resize(vc, cols, rows); + ret = vc_resize(vc, cols, rows); + if (ret) + goto err_out; } else if (con_is_visible(vc) && vc->vc_mode == KD_TEXT) { fbcon_clear_margins(vc, 0); @@ -2436,6 +2444,21 @@ static int fbcon_do_set_font(struct vc_d if (old_data && (--REFCOUNT(old_data) == 0)) kfree(old_data - FONT_EXTRA_WORDS * sizeof(int)); return 0; + +err_out: + p->fontdata = old_data; + vc->vc_font.data = (void *)old_data; + + if (userfont) { + p->userfont = old_userfont; + REFCOUNT(data)--; + } + + vc->vc_font.width = old_width; + vc->vc_font.height = old_height; + vc->vc_font.charcount = old_charcount; + + return ret; }
/*
From: Shakeel Butt shakeelb@google.com
commit dbb16df6443c59e8a1ef21c2272fcf387d600ddf upstream.
This reverts commit 96e51ccf1af33e82f429a0d6baebba29c6448d0f.
Recently we started running the kernel with rstat infrastructure on production traffic and begin to see negative memcg stats values. Particularly the 'sock' stat is the one which we observed having negative value.
$ grep "sock " /mnt/memory/job/memory.stat sock 253952 total_sock 18446744073708724224
Re-run after couple of seconds
$ grep "sock " /mnt/memory/job/memory.stat sock 253952 total_sock 53248
For now we are only seeing this issue on large machines (256 CPUs) and only with 'sock' stat. I think the networking stack increase the stat on one cpu and decrease it on another cpu much more often. So, this negative sock is due to rstat flusher flushing the stats on the CPU that has seen the decrement of sock but missed the CPU that has increments. A typical race condition.
For easy stable backport, revert is the most simple solution. For long term solution, I am thinking of two directions. First is just reduce the race window by optimizing the rstat flusher. Second is if the reader sees a negative stat value, force flush and restart the stat collection. Basically retry but limited.
Link: https://lkml.kernel.org/r/20220817172139.3141101-1-shakeelb@google.com Fixes: 96e51ccf1af33e8 ("memcg: cleanup racy sum avoidance code") Signed-off-by: Shakeel Butt shakeelb@google.com Cc: "Michal Koutný" mkoutny@suse.com Cc: Johannes Weiner hannes@cmpxchg.org Cc: Michal Hocko mhocko@kernel.org Cc: Roman Gushchin roman.gushchin@linux.dev Cc: Muchun Song songmuchun@bytedance.com Cc: David Hildenbrand david@redhat.com Cc: Yosry Ahmed yosryahmed@google.com Cc: Greg Thelen gthelen@google.com Cc: stable@vger.kernel.org [5.15] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- include/linux/memcontrol.h | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
--- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -978,19 +978,30 @@ static inline void mod_memcg_page_state(
static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { - return READ_ONCE(memcg->vmstats.state[idx]); + long x = READ_ONCE(memcg->vmstats.state[idx]); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; }
static inline unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx) { struct mem_cgroup_per_node *pn; + long x;
if (mem_cgroup_disabled()) return node_page_state(lruvec_pgdat(lruvec), idx);
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - return READ_ONCE(pn->lruvec_stats.state[idx]); + x = READ_ONCE(pn->lruvec_stats.state[idx]); +#ifdef CONFIG_SMP + if (x < 0) + x = 0; +#endif + return x; }
static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
From: Matthew Wilcox (Oracle) willy@infradead.org
commit 9dfb3b8d655022760ca68af11821f1c63aa547c3 upstream.
If we allocate a new page, we need to make sure that our folio matches that new page.
If we do end up in this code path, we store the wrong page in the shmem inode's page cache, and I would rather imagine that data corruption ensues.
This will be solved by changing shmem_replace_page() to shmem_replace_folio(), but this is the minimal fix.
Link: https://lkml.kernel.org/r/20220730042518.1264767-1-willy@infradead.org Fixes: da08e9b79323 ("mm/shmem: convert shmem_swapin_page() to shmem_swapin_folio()") Signed-off-by: Matthew Wilcox (Oracle) willy@infradead.org Reviewed-by: William Kucharski william.kucharski@oracle.com Cc: Hugh Dickins hughd@google.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/shmem.c | 1 + 1 file changed, 1 insertion(+)
--- a/mm/shmem.c +++ b/mm/shmem.c @@ -1771,6 +1771,7 @@ static int shmem_swapin_folio(struct ino
if (shmem_should_replace_folio(folio, gfp)) { error = shmem_replace_page(&page, gfp, info, index); + folio = page_folio(page); if (error) goto failed; }
From: Riwen Lu luriwen@kylinos.cn
commit 36527b9d882362567ceb4eea8666813280f30e6f upstream.
The freq Qos request would be removed repeatedly if the cpufreq policy relates to more than one CPU. Then, it would cause the "called for unknown object" warning.
Remove the freq Qos request for each CPU relates to the cpufreq policy, instead of removing repeatedly for the last CPU of it.
Fixes: a1bb46c36ce3 ("ACPI: processor: Add QoS requests for all CPUs") Reported-by: Jeremy Linton Jeremy.Linton@arm.com Tested-by: Jeremy Linton jeremy.linton@arm.com Signed-off-by: Riwen Lu luriwen@kylinos.cn Cc: 5.4+ stable@vger.kernel.org # 5.4+ Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/acpi/processor_thermal.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/acpi/processor_thermal.c +++ b/drivers/acpi/processor_thermal.c @@ -151,7 +151,7 @@ void acpi_thermal_cpufreq_exit(struct cp unsigned int cpu;
for_each_cpu(cpu, policy->related_cpus) { - struct acpi_processor *pr = per_cpu(processors, policy->cpu); + struct acpi_processor *pr = per_cpu(processors, cpu);
if (pr) freq_qos_remove_request(&pr->thermal_req);
From: Karol Herbst kherbst@redhat.com
commit 6b04ce966a738ecdd9294c9593e48513c0dc90aa upstream.
It is a bit unlcear to us why that's helping, but it does and unbreaks suspend/resume on a lot of GPUs without any known drawbacks.
Cc: stable@vger.kernel.org # v5.15+ Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156 Signed-off-by: Karol Herbst kherbst@redhat.com Reviewed-by: Lyude Paul lyude@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20220819200928.401416-1-kherbs... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/gpu/drm/nouveau/nouveau_bo.c | 9 +++++++++ 1 file changed, 9 insertions(+)
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -820,6 +820,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_o if (ret == 0) { ret = nouveau_fence_new(chan, false, &fence); if (ret == 0) { + /* TODO: figure out a better solution here + * + * wait on the fence here explicitly as going through + * ttm_bo_move_accel_cleanup somehow doesn't seem to do it. + * + * Without this the operation can timeout and we'll fallback to a + * software copy, which might take several minutes to finish. + */ + nouveau_fence_wait(fence, false, false); ret = ttm_bo_move_accel_cleanup(bo, &fence->base, evict, false,
From: David Howells dhowells@redhat.com
commit ba0803050d610d5072666be727bca5e03e55b242 upstream.
smb3 fallocate punch hole was not grabbing the inode or filemap_invalidate locks so could have race with pagemap reinstantiating the page.
Cc: stable@vger.kernel.org Signed-off-by: David Howells dhowells@redhat.com Signed-off-by: Steve French stfrench@microsoft.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/cifs/smb2ops.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)
--- a/fs/cifs/smb2ops.c +++ b/fs/cifs/smb2ops.c @@ -3671,7 +3671,7 @@ static long smb3_zero_range(struct file static long smb3_punch_hole(struct file *file, struct cifs_tcon *tcon, loff_t offset, loff_t len) { - struct inode *inode; + struct inode *inode = file_inode(file); struct cifsFileInfo *cfile = file->private_data; struct file_zero_data_information fsctl_buf; long rc; @@ -3680,14 +3680,12 @@ static long smb3_punch_hole(struct file
xid = get_xid();
- inode = d_inode(cfile->dentry); - + inode_lock(inode); /* Need to make file sparse, if not already, before freeing range. */ /* Consider adding equivalent for compressed since it could also work */ if (!smb2_set_sparse(xid, tcon, cfile, inode, set_sparse)) { rc = -EOPNOTSUPP; - free_xid(xid); - return rc; + goto out; }
filemap_invalidate_lock(inode->i_mapping); @@ -3707,8 +3705,10 @@ static long smb3_punch_hole(struct file true /* is_fctl */, (char *)&fsctl_buf, sizeof(struct file_zero_data_information), CIFSMaxBufSize, NULL, NULL); - free_xid(xid); filemap_invalidate_unlock(inode->i_mapping); +out: + inode_unlock(inode); + free_xid(xid); return rc; }
From: Heming Zhao ocfs2-devel@oss.oracle.com
commit 550842cc60987b269e31b222283ade3e1b6c7fc8 upstream.
After commit 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before return error"), any procedure after ocfs2_dlm_init() fails will trigger crash when calling ocfs2_dlm_shutdown().
ie: On local mount mode, no dlm resource is initialized. If ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling will call ocfs2_dlm_shutdown(), then does dlm resource cleanup job, which will trigger kernel crash.
This solution should bypass uninitialized resources in ocfs2_dlm_shutdown().
Link: https://lkml.kernel.org/r/20220815085754.20417-1-heming.zhao@suse.com Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before return error") Signed-off-by: Heming Zhao heming.zhao@suse.com Reviewed-by: Joseph Qi joseph.qi@linux.alibaba.com Cc: Mark Fasheh mark@fasheh.com Cc: Joel Becker jlbec@evilplan.org Cc: Junxiao Bi junxiao.bi@oracle.com Cc: Changwei Ge gechangwei@live.cn Cc: Gang He ghe@suse.com Cc: Jun Piao piaojun@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/ocfs2/dlmglue.c | 8 +++++--- fs/ocfs2/super.c | 3 +-- 2 files changed, 6 insertions(+), 5 deletions(-)
--- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -3403,10 +3403,12 @@ void ocfs2_dlm_shutdown(struct ocfs2_sup ocfs2_lock_res_free(&osb->osb_nfs_sync_lockres); ocfs2_lock_res_free(&osb->osb_orphan_scan.os_lockres);
- ocfs2_cluster_disconnect(osb->cconn, hangup_pending); - osb->cconn = NULL; + if (osb->cconn) { + ocfs2_cluster_disconnect(osb->cconn, hangup_pending); + osb->cconn = NULL;
- ocfs2_dlm_shutdown_debug(osb); + ocfs2_dlm_shutdown_debug(osb); + } }
static int ocfs2_drop_lock(struct ocfs2_super *osb, --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -1914,8 +1914,7 @@ static void ocfs2_dismount_volume(struct !ocfs2_is_hard_readonly(osb)) hangup_needed = 1;
- if (osb->cconn) - ocfs2_dlm_shutdown(osb, hangup_needed); + ocfs2_dlm_shutdown(osb, hangup_needed);
ocfs2_blockcheck_stats_debugfs_remove(&osb->osb_ecc_stats); debugfs_remove_recursive(osb->osb_debug_root);
From: Juergen Gross jgross@suse.com
commit c5deb27895e017a0267de0a20d140ad5fcc55a54 upstream.
The error exit of privcmd_ioctl_dm_op() is calling unlock_pages() potentially with pages being NULL, leading to a NULL dereference.
Additionally lock_pages() doesn't check for pin_user_pages_fast() having been completely successful, resulting in potentially not locking all pages into memory. This could result in sporadic failures when using the related memory in user mode.
Fix all of that by calling unlock_pages() always with the real number of pinned pages, which will be zero in case pages being NULL, and by checking the number of pages pinned by pin_user_pages_fast() matching the expected number of pages.
Cc: stable@vger.kernel.org Fixes: ab520be8cd5d ("xen/privcmd: Add IOCTL_PRIVCMD_DM_OP") Reported-by: Rustam Subkhankulov subkhankulov@ispras.ru Signed-off-by: Juergen Gross jgross@suse.com Reviewed-by: Jan Beulich jbeulich@suse.com Reviewed-by: Oleksandr Tyshchenko oleksandr_tyshchenko@epam.com Link: https://lore.kernel.org/r/20220825141918.3581-1-jgross@suse.com Signed-off-by: Juergen Gross jgross@suse.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/xen/privcmd.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-)
--- a/drivers/xen/privcmd.c +++ b/drivers/xen/privcmd.c @@ -581,27 +581,30 @@ static int lock_pages( struct privcmd_dm_op_buf kbufs[], unsigned int num, struct page *pages[], unsigned int nr_pages, unsigned int *pinned) { - unsigned int i; + unsigned int i, off = 0;
- for (i = 0; i < num; i++) { + for (i = 0; i < num; ) { unsigned int requested; int page_count;
requested = DIV_ROUND_UP( offset_in_page(kbufs[i].uptr) + kbufs[i].size, - PAGE_SIZE); + PAGE_SIZE) - off; if (requested > nr_pages) return -ENOSPC;
page_count = pin_user_pages_fast( - (unsigned long) kbufs[i].uptr, + (unsigned long)kbufs[i].uptr + off * PAGE_SIZE, requested, FOLL_WRITE, pages); - if (page_count < 0) - return page_count; + if (page_count <= 0) + return page_count ? : -EFAULT;
*pinned += page_count; nr_pages -= page_count; pages += page_count; + + off = (requested == page_count) ? 0 : off + page_count; + i += !off; }
return 0; @@ -677,10 +680,8 @@ static long privcmd_ioctl_dm_op(struct f }
rc = lock_pages(kbufs, kdata.num, pages, nr_pages, &pinned); - if (rc < 0) { - nr_pages = pinned; + if (rc < 0) goto out; - }
for (i = 0; i < kdata.num; i++) { set_xen_guest_handle(xbufs[i].h, kbufs[i].uptr); @@ -692,7 +693,7 @@ static long privcmd_ioctl_dm_op(struct f xen_preemptible_hcall_end();
out: - unlock_pages(pages, nr_pages); + unlock_pages(pages, pinned); kfree(xbufs); kfree(pages); kfree(kbufs);
From: Conor Dooley conor.dooley@microchip.com
commit b5c3aca86d2698c4850b6ee8b341938025d2780c upstream.
Fix the warning: arch/riscv/kernel/signal.c:316:27: warning: no previous prototype for function 'do_notify_resume' [-Wmissing-prototypes] asmlinkage __visible void do_notify_resume(struct pt_regs *regs,
All other functions in the file are static & none of the existing headers stood out as an obvious location. Create signal.h to hold the declaration.
Fixes: e2c0cdfba7f6 ("RISC-V: User-facing API") Signed-off-by: Conor Dooley conor.dooley@microchip.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220814141237.493457-4-mail@conchuod.ie Signed-off-by: Palmer Dabbelt palmer@rivosinc.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/riscv/include/asm/signal.h | 12 ++++++++++++ arch/riscv/kernel/signal.c | 1 + 2 files changed, 13 insertions(+) create mode 100644 arch/riscv/include/asm/signal.h
--- /dev/null +++ b/arch/riscv/include/asm/signal.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#ifndef __ASM_SIGNAL_H +#define __ASM_SIGNAL_H + +#include <uapi/asm/signal.h> +#include <uapi/asm/ptrace.h> + +asmlinkage __visible +void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags); + +#endif --- a/arch/riscv/kernel/signal.c +++ b/arch/riscv/kernel/signal.c @@ -15,6 +15,7 @@
#include <asm/ucontext.h> #include <asm/vdso.h> +#include <asm/signal.h> #include <asm/signal32.h> #include <asm/switch_to.h> #include <asm/csr.h>
From: Conor Dooley conor.dooley@microchip.com
commit d951b20b9def73dcc39a5379831525d0d2a537e9 upstream.
Sparse complains: arch/riscv/kernel/traps.c:213:6: warning: symbol 'shadow_stack' was not declared. Should it be static?
The variable is used in entry.S, so declare shadow_stack there alongside SHADOW_OVERFLOW_STACK_SIZE.
Fixes: 31da94c25aea ("riscv: add VMAP_STACK overflow detection") Signed-off-by: Conor Dooley conor.dooley@microchip.com Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220814141237.493457-5-mail@conchuod.ie Signed-off-by: Palmer Dabbelt palmer@rivosinc.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/riscv/include/asm/thread_info.h | 2 ++ arch/riscv/kernel/traps.c | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-)
--- a/arch/riscv/include/asm/thread_info.h +++ b/arch/riscv/include/asm/thread_info.h @@ -42,6 +42,8 @@
#ifndef __ASSEMBLY__
+extern long shadow_stack[SHADOW_OVERFLOW_STACK_SIZE / sizeof(long)]; + #include <asm/processor.h> #include <asm/csr.h>
--- a/arch/riscv/kernel/traps.c +++ b/arch/riscv/kernel/traps.c @@ -20,9 +20,10 @@
#include <asm/asm-prototypes.h> #include <asm/bug.h> +#include <asm/csr.h> #include <asm/processor.h> #include <asm/ptrace.h> -#include <asm/csr.h> +#include <asm/thread_info.h>
int show_unhandled_signals = 1;
From: Heinrich Schuchardt heinrich.schuchardt@canonical.com
commit 34fc9cc3aebe8b9e27d3bc821543dd482dc686ca upstream.
The "PolarFire SoC MSS Technical Reference Manual" documents the following PLIC interrupts:
1 - L2 Cache Controller Signals when a metadata correction event occurs 2 - L2 Cache Controller Signals when an uncorrectable metadata event occurs 3 - L2 Cache Controller Signals when a data correction event occurs 4 - L2 Cache Controller Signals when an uncorrectable data event occurs
This differs from the SiFive FU540 which only has three L2 cache related interrupts.
The sequence in the device tree is defined by an enum:
enum { DIR_CORR = 0, DATA_CORR, DATA_UNCORR, DIR_UNCORR, };
So the correct sequence of the L2 cache interrupts is
interrupts = <1>, <3>, <4>, <2>;
[Conor] This manifests as an unusable system if the l2-cache driver is enabled, as the wrong interrupt gets cleared & the handler prints errors to the console ad infinitum.
Fixes: 0fa6107eca41 ("RISC-V: Initial DTS for Microchip ICICLE board") CC: stable@vger.kernel.org # 5.15: e35b07a7df9b: riscv: dts: microchip: mpfs: Group tuples in interrupt properties Signed-off-by: Heinrich Schuchardt heinrich.schuchardt@canonical.com Signed-off-by: Conor Dooley conor.dooley@microchip.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/riscv/boot/dts/microchip/mpfs.dtsi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/riscv/boot/dts/microchip/mpfs.dtsi +++ b/arch/riscv/boot/dts/microchip/mpfs.dtsi @@ -169,7 +169,7 @@ cache-size = <2097152>; cache-unified; interrupt-parent = <&plic>; - interrupts = <1>, <2>, <3>; + interrupts = <1>, <3>, <4>, <2>; };
clint: clint@2000000 {
From: Jiri Slaby jslaby@suse.cz
commit 37887783b3fef877bf34b8992c9199864da4afcb upstream.
This reverts commit e7be8d1dd983156b ("zram: remove double compression logic") as it causes zram failures. It does not revert cleanly, PTR_ERR handling was introduced in the meantime. This is handled by appropriate IS_ERR.
When under memory pressure, zs_malloc() can fail. Before the above commit, the allocation was retried with direct reclaim enabled (GFP_NOIO). After the commit, it is not -- only __GFP_KSWAPD_RECLAIM is tried.
So when the failure occurs under memory pressure, the overlaying filesystem such as ext2 (mounted by ext4 module in this case) can emit failures, making the (file)system unusable: EXT4-fs warning (device zram0): ext4_end_bio:343: I/O error 10 writing to inode 16386 starting block 159744) Buffer I/O error on device zram0, logical block 159744
With direct reclaim, memory is really reclaimed and allocation succeeds, eventually. In the worst case, the oom killer is invoked, which is proper outcome if user sets up zram too large (in comparison to available RAM).
This very diff doesn't apply to 5.19 (stable) cleanly (see PTR_ERR note above). Use revert of e7be8d1dd983 directly.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1202203 Link: https://lkml.kernel.org/r/20220810070609.14402-1-jslaby@suse.cz Fixes: e7be8d1dd983 ("zram: remove double compression logic") Signed-off-by: Jiri Slaby jslaby@suse.cz Reviewed-by: Sergey Senozhatsky senozhatsky@chromium.org Cc: Minchan Kim minchan@kernel.org Cc: Nitin Gupta ngupta@vflare.org Cc: Alexey Romanov avromanov@sberdevices.ru Cc: Dmitry Rokosov ddrokosov@sberdevices.ru Cc: Lukas Czerner lczerner@redhat.com Cc: stable@vger.kernel.org [5.19] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/block/zram/zram_drv.c | 42 ++++++++++++++++++++++++++++++++---------- drivers/block/zram/zram_drv.h | 1 + 2 files changed, 33 insertions(+), 10 deletions(-)
--- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1144,14 +1144,15 @@ static ssize_t bd_stat_show(struct devic static ssize_t debug_stat_show(struct device *dev, struct device_attribute *attr, char *buf) { - int version = 2; + int version = 1; struct zram *zram = dev_to_zram(dev); ssize_t ret;
down_read(&zram->init_lock); ret = scnprintf(buf, PAGE_SIZE, - "version: %d\n%8llu\n", + "version: %d\n%8llu %8llu\n", version, + (u64)atomic64_read(&zram->stats.writestall), (u64)atomic64_read(&zram->stats.miss_free)); up_read(&zram->init_lock);
@@ -1367,6 +1368,7 @@ static int __zram_bvec_write(struct zram } kunmap_atomic(mem);
+compress_again: zstrm = zcomp_stream_get(zram->comp); src = kmap_atomic(page); ret = zcomp_compress(zstrm, src, &comp_len); @@ -1375,20 +1377,39 @@ static int __zram_bvec_write(struct zram if (unlikely(ret)) { zcomp_stream_put(zram->comp); pr_err("Compression failed! err=%d\n", ret); + zs_free(zram->mem_pool, handle); return ret; }
if (comp_len >= huge_class_size) comp_len = PAGE_SIZE; - - handle = zs_malloc(zram->mem_pool, comp_len, - __GFP_KSWAPD_RECLAIM | - __GFP_NOWARN | - __GFP_HIGHMEM | - __GFP_MOVABLE); - - if (unlikely(!handle)) { + /* + * handle allocation has 2 paths: + * a) fast path is executed with preemption disabled (for + * per-cpu streams) and has __GFP_DIRECT_RECLAIM bit clear, + * since we can't sleep; + * b) slow path enables preemption and attempts to allocate + * the page with __GFP_DIRECT_RECLAIM bit set. we have to + * put per-cpu compression stream and, thus, to re-do + * the compression once handle is allocated. + * + * if we have a 'non-null' handle here then we are coming + * from the slow path and handle has already been allocated. + */ + if (!handle) + handle = zs_malloc(zram->mem_pool, comp_len, + __GFP_KSWAPD_RECLAIM | + __GFP_NOWARN | + __GFP_HIGHMEM | + __GFP_MOVABLE); + if (!handle) { zcomp_stream_put(zram->comp); + atomic64_inc(&zram->stats.writestall); + handle = zs_malloc(zram->mem_pool, comp_len, + GFP_NOIO | __GFP_HIGHMEM | + __GFP_MOVABLE); + if (handle) + goto compress_again; return -ENOMEM; }
@@ -1946,6 +1967,7 @@ static int zram_add(void) if (ZRAM_LOGICAL_BLOCK_SIZE == PAGE_SIZE) blk_queue_max_write_zeroes_sectors(zram->disk->queue, UINT_MAX);
+ blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, zram->disk->queue); ret = device_add_disk(NULL, zram->disk, zram_disk_groups); if (ret) goto out_cleanup_disk; --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -81,6 +81,7 @@ struct zram_stats { atomic64_t huge_pages_since; /* no. of huge pages since zram set up */ atomic64_t pages_stored; /* no. of pages currently stored */ atomic_long_t max_used_pages; /* no. of maximum pages stored */ + atomic64_t writestall; /* no. of write slow paths */ atomic64_t miss_free; /* no. of missed free */ #ifdef CONFIG_ZRAM_WRITEBACK atomic64_t bd_count; /* no. of pages in backing device */
From: Jens Axboe axboe@kernel.dk
commit e053aaf4da56cbf0afb33a0fda4a62188e2c0637 upstream.
This is actually an older issue, but we never used to hit the -EAGAIN path before having done sb_start_write(). Make sure that we always call kiocb_end_write() if we need to retry the write, so that we keep the calls to sb_start_write() etc balanced.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- io_uring/io_uring.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
--- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -4331,7 +4331,12 @@ done: copy_iov: iov_iter_restore(&s->iter, &s->iter_state); ret = io_setup_async_rw(req, iovec, s, false); - return ret ?: -EAGAIN; + if (!ret) { + if (kiocb->ki_flags & IOCB_WRITE) + kiocb_end_write(req); + return -EAGAIN; + } + return ret; } out_free: /* it's reportedly faster than delegating the null check to kfree() */
From: David Hildenbrand david@redhat.com
commit f96f7a40874d7c746680c0b9f57cef2262ae551f upstream.
Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2.
I observed that hugetlb does not support/expect write-faults in shared mappings that would have to map the R/O-mapped page writable -- and I found two case where we could currently get such faults and would erroneously map an anon page into a shared mapping.
Reproducers part of the patches.
I propose to backport both fixes to stable trees. The first fix needs a small adjustment.
This patch (of 2):
Staring at hugetlb_wp(), one might wonder where all the logic for shared mappings is when stumbling over a write-protected page in a shared mapping. In fact, there is none, and so far we thought we could get away with that because e.g., mprotect() should always do the right thing and map all pages directly writable.
Looks like we were wrong:
-------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <sys/mman.h>
#define HUGETLB_SIZE (2 * 1024 * 1024u)
static void clear_softdirty(void) { int fd = open("/proc/self/clear_refs", O_WRONLY); const char *ctrl = "4"; int ret;
if (fd < 0) { fprintf(stderr, "open(clear_refs) failed\n"); exit(1); } ret = write(fd, ctrl, strlen(ctrl)); if (ret != strlen(ctrl)) { fprintf(stderr, "write(clear_refs) failed\n"); exit(1); } close(fd); }
int main(int argc, char **argv) { char *map; int fd;
fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT); if (!fd) { fprintf(stderr, "open() failed\n"); return -errno; } if (ftruncate(fd, HUGETLB_SIZE)) { fprintf(stderr, "ftruncate() failed\n"); return -errno; }
map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; }
*map = 0;
if (mprotect(map, HUGETLB_SIZE, PROT_READ)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; }
clear_softdirty();
if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; }
*map = 0;
return 0; } --------------------------------------------------------------------------
Above test fails with SIGBUS when there is only a single free hugetlb page. # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test Bus error (core dumped)
And worse, with sufficient free hugetlb pages it will map an anonymous page into a shared mapping, for example, messing up accounting during unmap and breaking MAP_SHARED semantics: # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test # cat /proc/meminfo | grep HugePages_ HugePages_Total: 2 HugePages_Free: 1 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0
Reason in this particular case is that vma_wants_writenotify() will return "true", removing VM_SHARED in vma_set_page_prot() to map pages write-protected. Let's teach vma_wants_writenotify() that hugetlb does not support softdirty tracking.
Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared") Signed-off-by: David Hildenbrand david@redhat.com Reviewed-by: Mike Kravetz mike.kravetz@oracle.com Cc: Peter Feiner pfeiner@google.com Cc: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Cyrill Gorcunov gorcunov@openvz.org Cc: Pavel Emelyanov xemul@parallels.com Cc: Jamie Liu jamieliu@google.com Cc: Hugh Dickins hughd@google.com Cc: Naoya Horiguchi n-horiguchi@ah.jp.nec.com Cc: Bjorn Helgaas bhelgaas@google.com Cc: Muchun Song songmuchun@bytedance.com Cc: Peter Xu peterx@redhat.com Cc: stable@vger.kernel.org [3.18+] Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: David Hildenbrand david@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- mm/mmap.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
--- a/mm/mmap.c +++ b/mm/mmap.c @@ -1693,8 +1693,12 @@ int vma_wants_writenotify(struct vm_area pgprot_val(vm_pgprot_modify(vm_page_prot, vm_flags))) return 0;
- /* Do we need to track softdirty? */ - if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !(vm_flags & VM_SOFTDIRTY)) + /* + * Do we need to track softdirty? hugetlb does not support softdirty + * tracking yet. + */ + if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !(vm_flags & VM_SOFTDIRTY) && + !is_vm_hugetlb_page(vma)) return 1;
/* Specialty mapping? */
From: Guoqing Jiang guoqing.jiang@linux.dev
commit 1d258758cf06a0734482989911d184dd5837ed4e upstream.
This reverts commit e151db8ecfb019b7da31d076130a794574c89f6f. Because it obviously breaks clustered raid as noticed by Neil though it fixed KASAN issue for dm-raid, let's revert it and fix KASAN issue in next commit.
[1]. https://lore.kernel.org/linux-raid/a6657e08-b6a7-358b-2d2a-0ac37d49d23a@linu...
Fixes: e151db8ecfb0 ("md-raid: destroy the bitmap after destroying the thread") Signed-off-by: Guoqing Jiang guoqing.jiang@linux.dev Signed-off-by: Song Liu song@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/md.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6244,11 +6244,11 @@ static void mddev_detach(struct mddev *m static void __md_stop(struct mddev *mddev) { struct md_personality *pers = mddev->pers; + md_bitmap_destroy(mddev); mddev_detach(mddev); /* Ensure ->event_work is done */ if (mddev->event_work.func) flush_workqueue(md_misc_wq); - md_bitmap_destroy(mddev); spin_lock(&mddev->lock); mddev->pers = NULL; spin_unlock(&mddev->lock);
From: Guoqing Jiang guoqing.jiang@linux.dev
commit 0dd84b319352bb8ba64752d4e45396d8b13e6018 upstream.
From the link [1], we can see raid1d was running even after the path
raid_dtr -> md_stop -> __md_stop.
Let's stop write first in destructor to align with normal md-raid to fix the KASAN issue.
[1]. https://lore.kernel.org/linux-raid/CAPhsuW5gc4AakdGNdF8ubpezAuDLFOYUO_sfMZce...
Fixes: 48df498daf62 ("md: move bitmap_destroy to the beginning of __md_stop") Reported-by: Mikulas Patocka mpatocka@redhat.com Signed-off-by: Guoqing Jiang guoqing.jiang@linux.dev Signed-off-by: Song Liu song@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/md/md.c | 1 + 1 file changed, 1 insertion(+)
--- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6266,6 +6266,7 @@ void md_stop(struct mddev *mddev) /* stop the array and free an attached data structures. * This is called from dm-raid */ + __md_stop_writes(mddev); __md_stop(mddev); bioset_exit(&mddev->bio_set); bioset_exit(&mddev->sync_set);
From: Zenghui Yu yuzenghui@huawei.com
commit 5e1e087457c94ad7fafbe1cf6f774c6999ee29d4 upstream.
Since commit 51f559d66527 ("arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs"), we failed to detect erratum 1286807 on Cortex-A76 because its entry in arm64_repeat_tlbi_list[] was accidently corrupted by this commit.
Fix this issue by creating a separate entry for Kryo4xx Gold.
Fixes: 51f559d66527 ("arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs") Cc: Shreyas K K quic_shrekk@quicinc.com Signed-off-by: Zenghui Yu yuzenghui@huawei.com Acked-by: Marc Zyngier maz@kernel.org Link: https://lore.kernel.org/r/20220809043848.969-1-yuzenghui@huawei.com Signed-off-by: Will Deacon will@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/arm64/kernel/cpu_errata.c | 2 ++ 1 file changed, 2 insertions(+)
--- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -208,6 +208,8 @@ static const struct arm64_cpu_capabiliti #ifdef CONFIG_ARM64_ERRATUM_1286807 { ERRATA_MIDR_RANGE(MIDR_CORTEX_A76, 0, 0, 3, 0), + }, + { /* Kryo4xx Gold (rcpe to rfpe) => (r0p0 to r3p0) */ ERRATA_MIDR_RANGE(MIDR_QCOM_KRYO_4XX_GOLD, 0xc, 0xe, 0xf, 0xe), },
From: Liam Howlett liam.howlett@oracle.com
commit 44e602b4e52f70f04620bbbf4fe46ecb40170bde upstream.
Take the mmap_read_lock() when using the VMA in binder_alloc_print_pages() and when checking for a VMA in binder_alloc_new_buf_locked().
It is worth noting binder_alloc_new_buf_locked() drops the VMA read lock after it verifies a VMA exists, but may be taken again deeper in the call stack, if necessary.
Link: https://lkml.kernel.org/r/20220810160209.1630707-1-Liam.Howlett@oracle.com Fixes: a43cfc87caaf (android: binder: stop saving a pointer to the VMA) Signed-off-by: Liam R. Howlett Liam.Howlett@oracle.com Reported-by: Ondrej Mosnacek omosnace@redhat.com Reported-by: syzbot+a7b60a176ec13cafb793@syzkaller.appspotmail.com Acked-by: Carlos Llamas cmllamas@google.com Tested-by: Ondrej Mosnacek omosnace@redhat.com Cc: Minchan Kim minchan@kernel.org Cc: Christian Brauner (Microsoft) brauner@kernel.org Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Hridya Valsaraju hridya@google.com Cc: Joel Fernandes joel@joelfernandes.org Cc: Martijn Coenen maco@android.com Cc: Suren Baghdasaryan surenb@google.com Cc: Todd Kjos tkjos@android.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: "Arve Hjønnevåg" arve@android.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/android/binder_alloc.c | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-)
--- a/drivers/android/binder_alloc.c +++ b/drivers/android/binder_alloc.c @@ -395,12 +395,15 @@ static struct binder_buffer *binder_allo size_t size, data_offsets_size; int ret;
+ mmap_read_lock(alloc->vma_vm_mm); if (!binder_alloc_get_vma(alloc)) { + mmap_read_unlock(alloc->vma_vm_mm); binder_alloc_debug(BINDER_DEBUG_USER_ERROR, "%d: binder_alloc_buf, no vma\n", alloc->pid); return ERR_PTR(-ESRCH); } + mmap_read_unlock(alloc->vma_vm_mm);
data_offsets_size = ALIGN(data_size, sizeof(void *)) + ALIGN(offsets_size, sizeof(void *)); @@ -922,17 +925,25 @@ void binder_alloc_print_pages(struct seq * Make sure the binder_alloc is fully initialized, otherwise we might * read inconsistent state. */ - if (binder_alloc_get_vma(alloc) != NULL) { - for (i = 0; i < alloc->buffer_size / PAGE_SIZE; i++) { - page = &alloc->pages[i]; - if (!page->page_ptr) - free++; - else if (list_empty(&page->lru)) - active++; - else - lru++; - } + + mmap_read_lock(alloc->vma_vm_mm); + if (binder_alloc_get_vma(alloc) == NULL) { + mmap_read_unlock(alloc->vma_vm_mm); + goto uninitialized; + } + + mmap_read_unlock(alloc->vma_vm_mm); + for (i = 0; i < alloc->buffer_size / PAGE_SIZE; i++) { + page = &alloc->pages[i]; + if (!page->page_ptr) + free++; + else if (list_empty(&page->lru)) + active++; + else + lru++; } + +uninitialized: mutex_unlock(&alloc->mutex); seq_printf(m, " pages: %d:%d:%d\n", active, lru, free); seq_printf(m, " pages high watermark: %zu\n", alloc->pages_high);
From: Peter Zijlstra peterz@infradead.org
commit 332924973725e8cdcc783c175f68cf7e162cb9e5 upstream.
Turns out that i386 doesn't unconditionally have LFENCE, as such the loop in __FILL_RETURN_BUFFER isn't actually speculation safe on such chips.
Fixes: ba6e31af2be9 ("x86/speculation: Add LFENCE to RSB fill sequence") Reported-by: Ben Hutchings ben@decadent.org.uk Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/Yv9tj9vbQ9nNlXoY@worktop.programming.kicks-ass.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/include/asm/nospec-branch.h | 12 ++++++++++++ 1 file changed, 12 insertions(+)
--- a/arch/x86/include/asm/nospec-branch.h +++ b/arch/x86/include/asm/nospec-branch.h @@ -50,6 +50,7 @@ * the optimal version - two calls, each with their own speculation * trap should their return address end up getting used, in a loop. */ +#ifdef CONFIG_X86_64 #define __FILL_RETURN_BUFFER(reg, nr) \ mov $(nr/2), reg; \ 771: \ @@ -60,6 +61,17 @@ jnz 771b; \ /* barrier for jnz misprediction */ \ lfence; +#else +/* + * i386 doesn't unconditionally have LFENCE, as such it can't + * do a loop. + */ +#define __FILL_RETURN_BUFFER(reg, nr) \ + .rept nr; \ + __FILL_RETURN_SLOT; \ + .endr; \ + add $(BITS_PER_LONG/8) * nr, %_ASM_SP; +#endif
/* * Stuff a single RSB slot.
From: Prike Liang Prike.Liang@amd.com
commit ee8086dbc1585d9f4020a19447388246a5cff5c8 upstream.
Correct the isa version for handling KFD test.
Fixes: 7c4f4f197e0c ("drm/amdkfd: Add GC 10.3.6 and 10.3.7 KFD definitions") Signed-off-by: Prike Liang Prike.Liang@amd.com Reviewed-by: Aaron Liu aaron.liu@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/gpu/drm/amd/amdkfd/kfd_device.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c @@ -377,12 +377,8 @@ struct kfd_dev *kgd2kfd_probe(struct amd f2g = &gfx_v10_3_kfd2kgd; break; case IP_VERSION(10, 3, 6): - gfx_target_version = 100306; - if (!vf) - f2g = &gfx_v10_3_kfd2kgd; - break; case IP_VERSION(10, 3, 7): - gfx_target_version = 100307; + gfx_target_version = 100306; if (!vf) f2g = &gfx_v10_3_kfd2kgd; break;
From: Salvatore Bonaccorso carnil@debian.org
commit 00da0cb385d05a89226e150a102eb49d8abb0359 upstream.
While reporting for the AMD retbleed vulnerability was added in
6b80b59b3555 ("x86/bugs: Report AMD retbleed vulnerability")
the new sysfs file was not mentioned so far in the ABI documentation for sysfs-devices-system-cpu. Fix that.
Fixes: 6b80b59b3555 ("x86/bugs: Report AMD retbleed vulnerability") Signed-off-by: Salvatore Bonaccorso carnil@debian.org Signed-off-by: Borislav Petkov bp@suse.de Link: https://lore.kernel.org/r/20220801091529.325327-1-carnil@debian.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- Documentation/ABI/testing/sysfs-devices-system-cpu | 1 + 1 file changed, 1 insertion(+)
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -527,6 +527,7 @@ What: /sys/devices/system/cpu/vulnerabi /sys/devices/system/cpu/vulnerabilities/tsx_async_abort /sys/devices/system/cpu/vulnerabilities/itlb_multihit /sys/devices/system/cpu/vulnerabilities/mmio_stale_data + /sys/devices/system/cpu/vulnerabilities/retbleed Date: January 2018 Contact: Linux kernel mailing list linux-kernel@vger.kernel.org Description: Information about CPU vulnerabilities
From: Yu Kuai yukuai3@huawei.com
commit 65fac0d54f374625b43a9d6ad1f2c212bd41f518 upstream.
Currently, in virtio_scsi, if 'bd->last' is not set to true while dispatching request, such io will stay in driver's queue, and driver will wait for block layer to dispatch more rqs. However, if block layer failed to dispatch more rq, it should trigger commit_rqs to inform driver.
There is a problem in blk_mq_try_issue_list_directly() that commit_rqs won't be called:
// assume that queue_depth is set to 1, list contains two rq blk_mq_try_issue_list_directly blk_mq_request_issue_directly // dispatch first rq // last is false __blk_mq_try_issue_directly blk_mq_get_dispatch_budget // succeed to get first budget __blk_mq_issue_directly scsi_queue_rq cmd->flags |= SCMD_LAST virtscsi_queuecommand kick = (sc->flags & SCMD_LAST) != 0 // kick is false, first rq won't issue to disk queued++
blk_mq_request_issue_directly // dispatch second rq __blk_mq_try_issue_directly blk_mq_get_dispatch_budget // failed to get second budget ret == BLK_STS_RESOURCE blk_mq_request_bypass_insert // errors is still 0
if (!list_empty(list) || errors && ...) // won't pass, commit_rqs won't be called
In this situation, first rq relied on second rq to dispatch, while second rq relied on first rq to complete, thus they will both hung.
Fix the problem by also treat 'BLK_STS_*RESOURCE' as 'errors' since it means that request is not queued successfully.
Same problem exists in blk_mq_dispatch_rq_list(), 'BLK_STS_*RESOURCE' can't be treated as 'errors' here, fix the problem by calling commit_rqs if queue_rq return 'BLK_STS_*RESOURCE'.
Fixes: d666ba98f849 ("blk-mq: add mq_ops->commit_rqs()") Signed-off-by: Yu Kuai yukuai3@huawei.com Reviewed-by: Ming Lei ming.lei@redhat.com Link: https://lore.kernel.org/r/20220726122224.1790882-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- block/blk-mq.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
--- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1925,7 +1925,8 @@ out: /* If we didn't flush the entire list, we could have told the driver * there was more coming, but that turned out to be a lie. */ - if ((!list_empty(list) || errors) && q->mq_ops->commit_rqs && queued) + if ((!list_empty(list) || errors || needs_resource || + ret == BLK_STS_DEV_RESOURCE) && q->mq_ops->commit_rqs && queued) q->mq_ops->commit_rqs(hctx); /* * Any items that need requeuing? Stuff them into hctx->dispatch, @@ -2678,6 +2679,7 @@ void blk_mq_try_issue_list_directly(stru list_del_init(&rq->queuelist); ret = blk_mq_request_issue_directly(rq, list_empty(list)); if (ret != BLK_STS_OK) { + errors++; if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { blk_mq_request_bypass_insert(rq, false, @@ -2685,7 +2687,6 @@ void blk_mq_try_issue_list_directly(stru break; } blk_mq_end_request(rq, ret); - errors++; } else queued++; }
From: James Clark james.clark@arm.com
commit bc9e7fe313d5e56d4d5f34bcc04d1165f94f86fb upstream.
The previous change to Python autodetection had a small mistake where the auto value was used to determine the Python binary, rather than the user supplied value. The Python binary is only used for one part of the build process, rather than the final linking, so it was producing correct builds in most scenarios, especially when the auto detected value matched what the user wanted, or the system only had a valid set of Pythons.
Change it so that the Python binary path is derived from either the PYTHON_CONFIG value or PYTHON value, depending on what is specified by the user. This was the original intention.
This error was spotted in a build failure an odd cross compilation environment after commit 4c41cb46a732fe82 ("perf python: Prefer python3") was merged.
Fixes: 630af16eee495f58 ("perf tools: Use Python devtools for version autodetection rather than runtime") Signed-off-by: James Clark james.clark@arm.com Acked-by: Ian Rogers irogers@google.com Cc: Alexander Shishkin alexander.shishkin@linux.intel.com Cc: Ingo Molnar mingo@redhat.com Cc: James Clark james.clark@arm.com Cc: Jiri Olsa jolsa@kernel.org Cc: Mark Rutland mark.rutland@arm.com Cc: Namhyung Kim namhyung@kernel.org Cc: Peter Zijlstra peterz@infradead.org Link: https://lore.kernel.org/r/20220728093946.1337642-1-james.clark@arm.com Signed-off-by: Arnaldo Carvalho de Melo acme@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- tools/perf/Makefile.config | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
--- a/tools/perf/Makefile.config +++ b/tools/perf/Makefile.config @@ -265,7 +265,7 @@ endif # defined. get-executable-or-default fails with an error if the first argument is supplied but # doesn't exist. override PYTHON_CONFIG := $(call get-executable-or-default,PYTHON_CONFIG,$(PYTHON_AUTO)) -override PYTHON := $(call get-executable-or-default,PYTHON,$(subst -config,,$(PYTHON_AUTO))) +override PYTHON := $(call get-executable-or-default,PYTHON,$(subst -config,,$(PYTHON_CONFIG)))
grep-libs = $(filter -l%,$(1)) strip-libs = $(filter-out -l%,$(1))
From: Stephane Eranian eranian@google.com
commit 11745ecfe8fea4b4a4c322967a7605d2ecbd5080 upstream.
Existing code was generating bogus counts for the SNB IMC bandwidth counters:
$ perf stat -a -I 1000 -e uncore_imc/data_reads/,uncore_imc/data_writes/ 1.000327813 1,024.03 MiB uncore_imc/data_reads/ 1.000327813 20.73 MiB uncore_imc/data_writes/ 2.000580153 261,120.00 MiB uncore_imc/data_reads/ 2.000580153 23.28 MiB uncore_imc/data_writes/
The problem was introduced by commit: 07ce734dd8ad ("perf/x86/intel/uncore: Clean up client IMC")
Where the read_counter callback was replace to point to the generic uncore_mmio_read_counter() function.
The SNB IMC counters are freerunnig 32-bit counters laid out contiguously in MMIO. But uncore_mmio_read_counter() is using a readq() call to read from MMIO therefore reading 64-bit from MMIO. Although this is okay for the uncore_perf_event_update() function because it is shifting the value based on the actual counter width to compute a delta, it is not okay for the uncore_pmu_event_start() which is simply reading the counter and therefore priming the event->prev_count with a bogus value which is responsible for causing bogus deltas in the perf stat command above.
The fix is to reintroduce the custom callback for read_counter for the SNB IMC PMU and use readl() instead of readq(). With the change the output of perf stat is back to normal: $ perf stat -a -I 1000 -e uncore_imc/data_reads/,uncore_imc/data_writes/ 1.000120987 296.94 MiB uncore_imc/data_reads/ 1.000120987 138.42 MiB uncore_imc/data_writes/ 2.000403144 175.91 MiB uncore_imc/data_reads/ 2.000403144 68.50 MiB uncore_imc/data_writes/
Fixes: 07ce734dd8ad ("perf/x86/intel/uncore: Clean up client IMC") Signed-off-by: Stephane Eranian eranian@google.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Reviewed-by: Kan Liang kan.liang@linux.intel.com Link: https://lore.kernel.org/r/20220803160031.1379788-1-eranian@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/events/intel/uncore_snb.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-)
--- a/arch/x86/events/intel/uncore_snb.c +++ b/arch/x86/events/intel/uncore_snb.c @@ -841,6 +841,22 @@ int snb_pci2phy_map_init(int devid) return 0; }
+static u64 snb_uncore_imc_read_counter(struct intel_uncore_box *box, struct perf_event *event) +{ + struct hw_perf_event *hwc = &event->hw; + + /* + * SNB IMC counters are 32-bit and are laid out back to back + * in MMIO space. Therefore we must use a 32-bit accessor function + * using readq() from uncore_mmio_read_counter() causes problems + * because it is reading 64-bit at a time. This is okay for the + * uncore_perf_event_update() function because it drops the upper + * 32-bits but not okay for plain uncore_read_counter() as invoked + * in uncore_pmu_event_start(). + */ + return (u64)readl(box->io_addr + hwc->event_base); +} + static struct pmu snb_uncore_imc_pmu = { .task_ctx_nr = perf_invalid_context, .event_init = snb_uncore_imc_event_init, @@ -860,7 +876,7 @@ static struct intel_uncore_ops snb_uncor .disable_event = snb_uncore_imc_disable_event, .enable_event = snb_uncore_imc_enable_event, .hw_config = snb_uncore_imc_hw_config, - .read_counter = uncore_mmio_read_counter, + .read_counter = snb_uncore_imc_read_counter, };
static struct intel_uncore_type snb_uncore_imc = {
From: Stephane Eranian eranian@google.com
commit d4bdb0bebc5ba3299d74f123c782d99cd4e25c49 upstream.
With the existing code in store_latency_data(), the memory operation (mem_op) returned to the user is always OP_LOAD where in fact, it should be OP_STORE. This comes from the fact that the function is simply grabbing the information from a data source map which covers only load accesses. Intel 12th gen CPU offers precise store sampling that captures both the data source and latency. Therefore it can use the data source mapping table but must override the memory operation to reflect stores instead of loads.
Fixes: 61b985e3e775 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids") Signed-off-by: Stephane Eranian eranian@google.com Signed-off-by: Peter Zijlstra (Intel) peterz@infradead.org Link: https://lkml.kernel.org/r/20220818054613.1548130-1-eranian@google.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- arch/x86/events/intel/ds.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
--- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -291,6 +291,7 @@ static u64 load_latency_data(struct perf static u64 store_latency_data(struct perf_event *event, u64 status) { union intel_x86_pebs_dse dse; + union perf_mem_data_src src; u64 val;
dse.val = status; @@ -304,7 +305,14 @@ static u64 store_latency_data(struct per
val |= P(BLK, NA);
- return val; + /* + * the pebs_data_source table is only for loads + * so override the mem_op to say STORE instead + */ + src.val = val; + src.mem_op = P(OP,STORE); + + return src.val; }
struct pebs_record_core {
From: Ian Rogers irogers@google.com
commit bf515f024e4c0ca46a1b08c4f31860c01781d8a5 upstream.
If a weak group is broken then the reset_group flag remains set for the nex