This is the start of the stable review cycle for the 5.4.141 release. There are 27 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 15 Aug 2021 15:05:12 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.141-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
------------- Pseudo-Shortlog of commits:
Greg Kroah-Hartman gregkh@linuxfoundation.org Linux 5.4.141-rc1
Nikolay Borisov nborisov@suse.com btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
Nikolay Borisov nborisov@suse.com btrfs: export and rename qgroup_reserve_meta
Qu Wenruo wqu@suse.com btrfs: qgroup: don't commit transaction when we already hold the handle
YueHaibing yuehaibing@huawei.com net: xilinx_emaclite: Do not print real IOMEM pointer
Filipe Manana fdmanana@suse.com btrfs: fix lockdep splat when enabling and disabling qgroups
Qu Wenruo wqu@suse.com btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT
Qu Wenruo wqu@suse.com btrfs: transaction: Cleanup unused TRANS_STATE_BLOCKED
Qu Wenruo wqu@suse.com btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
Qu Wenruo wqu@suse.com btrfs: qgroup: allow to unreserve range without releasing other ranges
Nikolay Borisov nborisov@suse.com btrfs: make btrfs_qgroup_reserve_data take btrfs_inode
Nikolay Borisov nborisov@suse.com btrfs: make qgroup_free_reserved_data take btrfs_inode
Miklos Szeredi mszeredi@redhat.com ovl: prevent private clone if bind mount is not allowed
Pali Rohár pali@kernel.org ppp: Fix generating ppp unit id when ifname is not specified
Luke D Jones luke@ljones.dev ALSA: hda: Add quirk for ASUS Flow x13
Longfang Liu liulongfang@huawei.com USB:ehci:fix Kunpeng920 ehci hardware problem
Lai Jiangshan laijs@linux.alibaba.com KVM: X86: MMU: Use the correct inherited permissions to get shadow page
Wesley Cheng wcheng@codeaurora.org usb: dwc3: gadget: Avoid runtime resume if disabling pullup
Wesley Cheng wcheng@codeaurora.org usb: dwc3: gadget: Disable gadget IRQ during pullup disable
Wesley Cheng wcheng@codeaurora.org usb: dwc3: gadget: Clear DEP flags after stop transfers in ep disable
Wesley Cheng wcheng@codeaurora.org usb: dwc3: gadget: Prevent EP queuing while stopping transfers
Wesley Cheng wcheng@codeaurora.org usb: dwc3: gadget: Restart DWC3 gadget when enabling pullup
Wesley Cheng wcheng@codeaurora.org usb: dwc3: gadget: Allow runtime suspend if UDC unbinded
Wesley Cheng wcheng@codeaurora.org usb: dwc3: Stop active transfers before halting the controller
Masami Hiramatsu mhiramat@kernel.org tracing: Reject string operand in the histogram expression
Alexandre Courbot gnurou@gmail.com media: v4l2-mem2mem: always consider OUTPUT queue during poll
Sumit Garg sumit.garg@linaro.org tee: Correct inappropriate usage of TEE_SHM_DMA_BUF flag
Sean Christopherson seanjc@google.com KVM: SVM: Fix off-by-one indexing when nullifying last used SEV VMCB
-------------
Diffstat:
Documentation/virt/kvm/mmu.txt | 4 +- Makefile | 4 +- arch/x86/kvm/paging_tmpl.h | 14 +- arch/x86/kvm/svm.c | 2 +- drivers/media/v4l2-core/v4l2-mem2mem.c | 6 +- drivers/net/ethernet/xilinx/xilinx_emaclite.c | 5 +- drivers/net/ppp/ppp_generic.c | 19 +- drivers/tee/optee/call.c | 2 +- drivers/tee/optee/core.c | 3 +- drivers/tee/optee/rpc.c | 5 +- drivers/tee/optee/shm_pool.c | 8 +- drivers/tee/tee_shm.c | 4 +- drivers/usb/dwc3/ep0.c | 2 +- drivers/usb/dwc3/gadget.c | 118 +++++++-- drivers/usb/host/ehci-pci.c | 3 + fs/btrfs/ctree.h | 13 +- fs/btrfs/delalloc-space.c | 2 +- fs/btrfs/delayed-inode.c | 3 +- fs/btrfs/disk-io.c | 4 +- fs/btrfs/file.c | 7 +- fs/btrfs/inode.c | 2 +- fs/btrfs/qgroup.c | 328 +++++++++++++++++++------- fs/btrfs/qgroup.h | 5 +- fs/btrfs/transaction.c | 16 +- fs/btrfs/transaction.h | 15 -- fs/namespace.c | 42 ++-- include/linux/tee_drv.h | 1 + kernel/trace/trace_events_hist.c | 20 +- sound/pci/hda/patch_realtek.c | 1 + 29 files changed, 468 insertions(+), 190 deletions(-)
From: Sean Christopherson seanjc@google.com
[ Upstream commit 179c6c27bf487273652efc99acd3ba512a23c137 ]
Use the raw ASID, not ASID-1, when nullifying the last used VMCB when freeing an SEV ASID. The consumer, pre_sev_run(), indexes the array by the raw ASID, thus KVM could get a false negative when checking for a different VMCB if KVM manages to reallocate the same ASID+VMCB combo for a new VM.
Note, this cannot cause a functional issue _in the current code_, as pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is guaranteed to mismatch on the first VMRUN. However, prior to commit 8a14fe4f0c54 ("kvm: x86: Move last_cpu into kvm_vcpu_arch as last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the last_cpu variable. Thus it's theoretically possible that older versions of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID and VMCB exactly match those of a prior VM.
Fixes: 70cd94e60c73 ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled") Cc: Tom Lendacky thomas.lendacky@amd.com Cc: Brijesh Singh brijesh.singh@amd.com Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson seanjc@google.com Signed-off-by: Paolo Bonzini pbonzini@redhat.com Signed-off-by: Sasha Levin sashal@kernel.org --- arch/x86/kvm/svm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7341d22ed04f..2a958dcc80f2 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1783,7 +1783,7 @@ static void __sev_asid_free(int asid)
for_each_possible_cpu(cpu) { sd = per_cpu(svm_data, cpu); - sd->sev_vmcbs[pos] = NULL; + sd->sev_vmcbs[asid] = NULL; } }
From: Sumit Garg sumit.garg@linaro.org
[ Upstream commit 376e4199e327a5cf29b8ec8fb0f64f3d8b429819 ]
Currently TEE_SHM_DMA_BUF flag has been inappropriately used to not register shared memory allocated for private usage by underlying TEE driver: OP-TEE in this case. So rather add a new flag as TEE_SHM_PRIV that can be utilized by underlying TEE drivers for private allocation and usage of shared memory.
With this corrected, allow tee_shm_alloc_kernel_buf() to allocate a shared memory region without the backing of dma-buf.
Cc: stable@vger.kernel.org Signed-off-by: Sumit Garg sumit.garg@linaro.org Co-developed-by: Tyler Hicks tyhicks@linux.microsoft.com Signed-off-by: Tyler Hicks tyhicks@linux.microsoft.com Reviewed-by: Jens Wiklander jens.wiklander@linaro.org Reviewed-by: Sumit Garg sumit.garg@linaro.org Signed-off-by: Jens Wiklander jens.wiklander@linaro.org Signed-off-by: Sasha Levin sashal@kernel.org --- drivers/tee/optee/call.c | 2 +- drivers/tee/optee/core.c | 3 ++- drivers/tee/optee/rpc.c | 5 +++-- drivers/tee/optee/shm_pool.c | 8 ++++++-- drivers/tee/tee_shm.c | 4 ++-- include/linux/tee_drv.h | 1 + 6 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/drivers/tee/optee/call.c b/drivers/tee/optee/call.c index 4b5069f88d78..3a54455d9ddf 100644 --- a/drivers/tee/optee/call.c +++ b/drivers/tee/optee/call.c @@ -181,7 +181,7 @@ static struct tee_shm *get_msg_arg(struct tee_context *ctx, size_t num_params, struct optee_msg_arg *ma;
shm = tee_shm_alloc(ctx, OPTEE_MSG_GET_ARG_SIZE(num_params), - TEE_SHM_MAPPED); + TEE_SHM_MAPPED | TEE_SHM_PRIV); if (IS_ERR(shm)) return shm;
diff --git a/drivers/tee/optee/core.c b/drivers/tee/optee/core.c index 432dd38921dd..4bb4c8f28cbd 100644 --- a/drivers/tee/optee/core.c +++ b/drivers/tee/optee/core.c @@ -254,7 +254,8 @@ static void optee_release(struct tee_context *ctx) if (!ctxdata) return;
- shm = tee_shm_alloc(ctx, sizeof(struct optee_msg_arg), TEE_SHM_MAPPED); + shm = tee_shm_alloc(ctx, sizeof(struct optee_msg_arg), + TEE_SHM_MAPPED | TEE_SHM_PRIV); if (!IS_ERR(shm)) { arg = tee_shm_get_va(shm, 0); /* diff --git a/drivers/tee/optee/rpc.c b/drivers/tee/optee/rpc.c index b4ade54d1f28..aecf62016e7b 100644 --- a/drivers/tee/optee/rpc.c +++ b/drivers/tee/optee/rpc.c @@ -220,7 +220,7 @@ static void handle_rpc_func_cmd_shm_alloc(struct tee_context *ctx, shm = cmd_alloc_suppl(ctx, sz); break; case OPTEE_MSG_RPC_SHM_TYPE_KERNEL: - shm = tee_shm_alloc(ctx, sz, TEE_SHM_MAPPED); + shm = tee_shm_alloc(ctx, sz, TEE_SHM_MAPPED | TEE_SHM_PRIV); break; default: arg->ret = TEEC_ERROR_BAD_PARAMETERS; @@ -405,7 +405,8 @@ void optee_handle_rpc(struct tee_context *ctx, struct optee_rpc_param *param,
switch (OPTEE_SMC_RETURN_GET_RPC_FUNC(param->a0)) { case OPTEE_SMC_RPC_FUNC_ALLOC: - shm = tee_shm_alloc(ctx, param->a1, TEE_SHM_MAPPED); + shm = tee_shm_alloc(ctx, param->a1, + TEE_SHM_MAPPED | TEE_SHM_PRIV); if (!IS_ERR(shm) && !tee_shm_get_pa(shm, 0, &pa)) { reg_pair_from_64(¶m->a1, ¶m->a2, pa); reg_pair_from_64(¶m->a4, ¶m->a5, diff --git a/drivers/tee/optee/shm_pool.c b/drivers/tee/optee/shm_pool.c index da06ce9b9313..c41a9a501a6e 100644 --- a/drivers/tee/optee/shm_pool.c +++ b/drivers/tee/optee/shm_pool.c @@ -27,7 +27,11 @@ static int pool_op_alloc(struct tee_shm_pool_mgr *poolm, shm->paddr = page_to_phys(page); shm->size = PAGE_SIZE << order;
- if (shm->flags & TEE_SHM_DMA_BUF) { + /* + * Shared memory private to the OP-TEE driver doesn't need + * to be registered with OP-TEE. + */ + if (!(shm->flags & TEE_SHM_PRIV)) { unsigned int nr_pages = 1 << order, i; struct page **pages;
@@ -60,7 +64,7 @@ err: static void pool_op_free(struct tee_shm_pool_mgr *poolm, struct tee_shm *shm) { - if (shm->flags & TEE_SHM_DMA_BUF) + if (!(shm->flags & TEE_SHM_PRIV)) optee_shm_unregister(shm->ctx, shm);
free_pages((unsigned long)shm->kaddr, get_order(shm->size)); diff --git a/drivers/tee/tee_shm.c b/drivers/tee/tee_shm.c index 1b4b4a1ba91d..d6491e973fa4 100644 --- a/drivers/tee/tee_shm.c +++ b/drivers/tee/tee_shm.c @@ -117,7 +117,7 @@ static struct tee_shm *__tee_shm_alloc(struct tee_context *ctx, return ERR_PTR(-EINVAL); }
- if ((flags & ~(TEE_SHM_MAPPED | TEE_SHM_DMA_BUF))) { + if ((flags & ~(TEE_SHM_MAPPED | TEE_SHM_DMA_BUF | TEE_SHM_PRIV))) { dev_err(teedev->dev.parent, "invalid shm flags 0x%x", flags); return ERR_PTR(-EINVAL); } @@ -233,7 +233,7 @@ EXPORT_SYMBOL_GPL(tee_shm_priv_alloc); */ struct tee_shm *tee_shm_alloc_kernel_buf(struct tee_context *ctx, size_t size) { - return tee_shm_alloc(ctx, size, TEE_SHM_MAPPED | TEE_SHM_DMA_BUF); + return tee_shm_alloc(ctx, size, TEE_SHM_MAPPED); } EXPORT_SYMBOL_GPL(tee_shm_alloc_kernel_buf);
diff --git a/include/linux/tee_drv.h b/include/linux/tee_drv.h index 91677f2fa2e8..cd15c1b7fae0 100644 --- a/include/linux/tee_drv.h +++ b/include/linux/tee_drv.h @@ -26,6 +26,7 @@ #define TEE_SHM_REGISTER BIT(3) /* Memory registered in secure world */ #define TEE_SHM_USER_MAPPED BIT(4) /* Memory mapped in user space */ #define TEE_SHM_POOL BIT(5) /* Memory allocated from pool */ +#define TEE_SHM_PRIV BIT(7) /* Memory private to TEE driver */
struct device; struct tee_device;
From: Alexandre Courbot gnurou@gmail.com
commit 566463afdbc43c7744c5a1b89250fc808df03833 upstream.
If poll() is called on a m2m device with the EPOLLOUT event after the last buffer of the CAPTURE queue is dequeued, any buffer available on OUTPUT queue will never be signaled because v4l2_m2m_poll_for_data() starts by checking whether dst_q->last_buffer_dequeued is set and returns EPOLLIN in this case, without looking at the state of the OUTPUT queue.
Fix this by not early returning so we keep checking the state of the OUTPUT queue afterwards.
Signed-off-by: Alexandre Courbot gnurou@gmail.com Reviewed-by: Ezequiel Garcia ezequiel@collabora.com Signed-off-by: Hans Verkuil hverkuil-cisco@xs4all.nl Signed-off-by: Mauro Carvalho Chehab mchehab+huawei@kernel.org Cc: Lecopzer Chen lecopzer.chen@mediatek.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/media/v4l2-core/v4l2-mem2mem.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
--- a/drivers/media/v4l2-core/v4l2-mem2mem.c +++ b/drivers/media/v4l2-core/v4l2-mem2mem.c @@ -635,10 +635,8 @@ static __poll_t v4l2_m2m_poll_for_data(s * If the last buffer was dequeued from the capture queue, * return immediately. DQBUF will return -EPIPE. */ - if (dst_q->last_buffer_dequeued) { - spin_unlock_irqrestore(&dst_q->done_lock, flags); - return EPOLLIN | EPOLLRDNORM; - } + if (dst_q->last_buffer_dequeued) + rc |= EPOLLIN | EPOLLRDNORM; } spin_unlock_irqrestore(&dst_q->done_lock, flags);
From: Masami Hiramatsu mhiramat@kernel.org
commit a9d10ca4986571bffc19778742d508cc8dd13e02 upstream.
Since the string type can not be the target of the addition / subtraction operation, it must be rejected. Without this fix, the string type silently converted to digits.
Link: https://lkml.kernel.org/r/162742654278.290973.1523000673366456634.stgit@devn...
Cc: stable@vger.kernel.org Fixes: 100719dcef447 ("tracing: Add simple expression support to hist triggers") Signed-off-by: Masami Hiramatsu mhiramat@kernel.org Signed-off-by: Steven Rostedt (VMware) rostedt@goodmis.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- kernel/trace/trace_events_hist.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-)
--- a/kernel/trace/trace_events_hist.c +++ b/kernel/trace/trace_events_hist.c @@ -66,7 +66,8 @@ C(INVALID_SUBSYS_EVENT, "Invalid subsystem or event name"), \ C(INVALID_REF_KEY, "Using variable references in keys not supported"), \ C(VAR_NOT_FOUND, "Couldn't find variable"), \ - C(FIELD_NOT_FOUND, "Couldn't find field"), + C(FIELD_NOT_FOUND, "Couldn't find field"), \ + C(INVALID_STR_OPERAND, "String type can not be an operand in expression"),
#undef C #define C(a, b) HIST_ERR_##a @@ -3038,6 +3039,13 @@ static struct hist_field *parse_unary(st ret = PTR_ERR(operand1); goto free; } + if (operand1->flags & HIST_FIELD_FL_STRING) { + /* String type can not be the operand of unary operator. */ + hist_err(file->tr, HIST_ERR_INVALID_STR_OPERAND, errpos(str)); + destroy_hist_field(operand1, 0); + ret = -EINVAL; + goto free; + }
expr->flags |= operand1->flags & (HIST_FIELD_FL_TIMESTAMP | HIST_FIELD_FL_TIMESTAMP_USECS); @@ -3139,6 +3147,11 @@ static struct hist_field *parse_expr(str operand1 = NULL; goto free; } + if (operand1->flags & HIST_FIELD_FL_STRING) { + hist_err(file->tr, HIST_ERR_INVALID_STR_OPERAND, errpos(operand1_str)); + ret = -EINVAL; + goto free; + }
/* rest of string could be another expression e.g. b+c in a+b+c */ operand_flags = 0; @@ -3148,6 +3161,11 @@ static struct hist_field *parse_expr(str operand2 = NULL; goto free; } + if (operand2->flags & HIST_FIELD_FL_STRING) { + hist_err(file->tr, HIST_ERR_INVALID_STR_OPERAND, errpos(str)); + ret = -EINVAL; + goto free; + }
ret = check_expr_operands(file->tr, operand1, operand2); if (ret)
From: Wesley Cheng wcheng@codeaurora.org
[ Upstream commit ae7e86108b12351028fa7e8796a59f9b2d9e1774 ]
In the DWC3 databook, for a device initiated disconnect or bus reset, the driver is required to send dependxfer commands for any pending transfers. In addition, before the controller can move to the halted state, the SW needs to acknowledge any pending events. If the controller is not halted properly, there is a chance the controller will continue accessing stale or freed TRBs and buffers.
Signed-off-by: Wesley Cheng wcheng@codeaurora.org Reviewed-by: Thinh Nguyen thinhn@synopsys.com Signed-off-by: Felipe Balbi balbi@kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/usb/dwc3/ep0.c | 2 - drivers/usb/dwc3/gadget.c | 66 +++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 66 insertions(+), 2 deletions(-)
--- a/drivers/usb/dwc3/ep0.c +++ b/drivers/usb/dwc3/ep0.c @@ -197,7 +197,7 @@ int dwc3_gadget_ep0_queue(struct usb_ep int ret;
spin_lock_irqsave(&dwc->lock, flags); - if (!dep->endpoint.desc) { + if (!dep->endpoint.desc || !dwc->pullups_connected) { dev_err(dwc->dev, "%s: can't queue to disabled endpoint\n", dep->name); ret = -ESHUTDOWN; --- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -1511,7 +1511,7 @@ static int __dwc3_gadget_ep_queue(struct { struct dwc3 *dwc = dep->dwc;
- if (!dep->endpoint.desc) { + if (!dep->endpoint.desc || !dwc->pullups_connected) { dev_err(dwc->dev, "%s: can't queue to disabled endpoint\n", dep->name); return -ESHUTDOWN; @@ -1931,6 +1931,21 @@ static int dwc3_gadget_set_selfpowered(s return 0; }
+static void dwc3_stop_active_transfers(struct dwc3 *dwc) +{ + u32 epnum; + + for (epnum = 2; epnum < dwc->num_eps; epnum++) { + struct dwc3_ep *dep; + + dep = dwc->eps[epnum]; + if (!dep) + continue; + + dwc3_remove_requests(dwc, dep); + } +} + static int dwc3_gadget_run_stop(struct dwc3 *dwc, int is_on, int suspend) { u32 reg; @@ -1976,6 +1991,9 @@ static int dwc3_gadget_run_stop(struct d return 0; }
+static void dwc3_gadget_disable_irq(struct dwc3 *dwc); +static void __dwc3_gadget_stop(struct dwc3 *dwc); + static int dwc3_gadget_pullup(struct usb_gadget *g, int is_on) { struct dwc3 *dwc = gadget_to_dwc(g); @@ -1999,7 +2017,46 @@ static int dwc3_gadget_pullup(struct usb } }
+ /* + * Synchronize any pending event handling before executing the controller + * halt routine. + */ + if (!is_on) { + dwc3_gadget_disable_irq(dwc); + synchronize_irq(dwc->irq_gadget); + } + spin_lock_irqsave(&dwc->lock, flags); + + if (!is_on) { + u32 count; + + /* + * In the Synopsis DesignWare Cores USB3 Databook Rev. 3.30a + * Section 4.1.8 Table 4-7, it states that for a device-initiated + * disconnect, the SW needs to ensure that it sends "a DEPENDXFER + * command for any active transfers" before clearing the RunStop + * bit. + */ + dwc3_stop_active_transfers(dwc); + __dwc3_gadget_stop(dwc); + + /* + * In the Synopsis DesignWare Cores USB3 Databook Rev. 3.30a + * Section 1.3.4, it mentions that for the DEVCTRLHLT bit, the + * "software needs to acknowledge the events that are generated + * (by writing to GEVNTCOUNTn) while it is waiting for this bit + * to be set to '1'." + */ + count = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0)); + count &= DWC3_GEVNTCOUNT_MASK; + if (count > 0) { + dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), count); + dwc->ev_buf->lpos = (dwc->ev_buf->lpos + count) % + dwc->ev_buf->length; + } + } + ret = dwc3_gadget_run_stop(dwc, is_on, false); spin_unlock_irqrestore(&dwc->lock, flags);
@@ -3038,6 +3095,13 @@ static void dwc3_gadget_reset_interrupt( }
dwc3_reset_gadget(dwc); + /* + * In the Synopsis DesignWare Cores USB3 Databook Rev. 3.30a + * Section 4.1.2 Table 4-2, it states that during a USB reset, the SW + * needs to ensure that it sends "a DEPENDXFER command for any active + * transfers." + */ + dwc3_stop_active_transfers(dwc);
reg = dwc3_readl(dwc->regs, DWC3_DCTL); reg &= ~DWC3_DCTL_TSTCTRL_MASK;
From: Wesley Cheng wcheng@codeaurora.org
[ Upstream commit 77adb8bdf4227257e26b7ff67272678e66a0b250 ]
The DWC3 runtime suspend routine checks for the USB connected parameter to determine if the controller can enter into a low power state. The connected state is only set to false after receiving a disconnect event. However, in the case of a device initiated disconnect (i.e. UDC unbind), the controller is halted and a disconnect event is never generated. Set the connected flag to false if issuing a device initiated disconnect to allow the controller to be suspended.
Signed-off-by: Wesley Cheng wcheng@codeaurora.org Link: https://lore.kernel.org/r/1609283136-22140-2-git-send-email-wcheng@codeauror... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/usb/dwc3/gadget.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
--- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -2018,6 +2018,17 @@ static int dwc3_gadget_pullup(struct usb }
/* + * Check the return value for successful resume, or error. For a + * successful resume, the DWC3 runtime PM resume routine will handle + * the run stop sequence, so avoid duplicate operations here. + */ + ret = pm_runtime_get_sync(dwc->dev); + if (!ret || ret < 0) { + pm_runtime_put(dwc->dev); + return 0; + } + + /* * Synchronize any pending event handling before executing the controller * halt routine. */ @@ -2055,10 +2066,12 @@ static int dwc3_gadget_pullup(struct usb dwc->ev_buf->lpos = (dwc->ev_buf->lpos + count) % dwc->ev_buf->length; } + dwc->connected = false; }
ret = dwc3_gadget_run_stop(dwc, is_on, false); spin_unlock_irqrestore(&dwc->lock, flags); + pm_runtime_put(dwc->dev);
return ret; }
From: Wesley Cheng wcheng@codeaurora.org
[ Upstream commit a1383b3537a7bea1c213baa7878ccc4ecf4413b5 ]
usb_gadget_deactivate/usb_gadget_activate does not execute the UDC start operation, which may leave EP0 disabled and event IRQs disabled when re-activating the function. Move the enabling/disabling of USB EP0 and device event IRQs to be performed in the pullup routine.
Fixes: ae7e86108b12 ("usb: dwc3: Stop active transfers before halting the controller") Tested-by: Michael Tretter m.tretter@pengutronix.de Cc: stable stable@vger.kernel.org Reported-by: Michael Tretter m.tretter@pengutronix.de Signed-off-by: Wesley Cheng wcheng@codeaurora.org Link: https://lore.kernel.org/r/1609282837-21666-1-git-send-email-wcheng@codeauror... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/usb/dwc3/gadget.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-)
--- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -1993,6 +1993,7 @@ static int dwc3_gadget_run_stop(struct d
static void dwc3_gadget_disable_irq(struct dwc3 *dwc); static void __dwc3_gadget_stop(struct dwc3 *dwc); +static int __dwc3_gadget_start(struct dwc3 *dwc);
static int dwc3_gadget_pullup(struct usb_gadget *g, int is_on) { @@ -2067,6 +2068,8 @@ static int dwc3_gadget_pullup(struct usb dwc->ev_buf->length; } dwc->connected = false; + } else { + __dwc3_gadget_start(dwc); }
ret = dwc3_gadget_run_stop(dwc, is_on, false); @@ -2244,10 +2247,6 @@ static int dwc3_gadget_start(struct usb_ }
dwc->gadget_driver = driver; - - if (pm_runtime_active(dwc->dev)) - __dwc3_gadget_start(dwc); - spin_unlock_irqrestore(&dwc->lock, flags);
return 0; @@ -2273,13 +2272,6 @@ static int dwc3_gadget_stop(struct usb_g unsigned long flags;
spin_lock_irqsave(&dwc->lock, flags); - - if (pm_runtime_suspended(dwc->dev)) - goto out; - - __dwc3_gadget_stop(dwc); - -out: dwc->gadget_driver = NULL; spin_unlock_irqrestore(&dwc->lock, flags);
From: Wesley Cheng wcheng@codeaurora.org
[ Upstream commit f09ddcfcb8c569675066337adac2ac205113471f ]
In the situations where the DWC3 gadget stops active transfers, once calling the dwc3_gadget_giveback(), there is a chance where a function driver can queue a new USB request in between the time where the dwc3 lock has been released and re-aquired. This occurs after we've already issued an ENDXFER command. When the stop active transfers continues to remove USB requests from all dep lists, the newly added request will also be removed, while controller still has an active TRB for it. This can lead to the controller accessing an unmapped memory address.
Fix this by ensuring parameters to prevent EP queuing are set before calling the stop active transfers API.
Fixes: ae7e86108b12 ("usb: dwc3: Stop active transfers before halting the controller") Signed-off-by: Wesley Cheng wcheng@codeaurora.org Link: https://lore.kernel.org/r/1615507142-23097-1-git-send-email-wcheng@codeauror... Cc: stable stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/usb/dwc3/gadget.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
--- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -746,8 +746,6 @@ static int __dwc3_gadget_ep_disable(stru
trace_dwc3_gadget_ep_disable(dep);
- dwc3_remove_requests(dwc, dep); - /* make sure HW endpoint isn't stalled */ if (dep->flags & DWC3_EP_STALL) __dwc3_gadget_ep_set_halt(dep, 0, false); @@ -766,6 +764,8 @@ static int __dwc3_gadget_ep_disable(stru dep->endpoint.desc = NULL; }
+ dwc3_remove_requests(dwc, dep); + return 0; }
@@ -1511,7 +1511,7 @@ static int __dwc3_gadget_ep_queue(struct { struct dwc3 *dwc = dep->dwc;
- if (!dep->endpoint.desc || !dwc->pullups_connected) { + if (!dep->endpoint.desc || !dwc->pullups_connected || !dwc->connected) { dev_err(dwc->dev, "%s: can't queue to disabled endpoint\n", dep->name); return -ESHUTDOWN; @@ -2043,6 +2043,7 @@ static int dwc3_gadget_pullup(struct usb if (!is_on) { u32 count;
+ dwc->connected = false; /* * In the Synopsis DesignWare Cores USB3 Databook Rev. 3.30a * Section 4.1.8 Table 4-7, it states that for a device-initiated @@ -2067,7 +2068,6 @@ static int dwc3_gadget_pullup(struct usb dwc->ev_buf->lpos = (dwc->ev_buf->lpos + count) % dwc->ev_buf->length; } - dwc->connected = false; } else { __dwc3_gadget_start(dwc); } @@ -3057,8 +3057,6 @@ static void dwc3_gadget_reset_interrupt( { u32 reg;
- dwc->connected = true; - /* * Ideally, dwc3_reset_gadget() would trigger the function * drivers to stop any active transfers through ep disable. @@ -3107,6 +3105,7 @@ static void dwc3_gadget_reset_interrupt( * transfers." */ dwc3_stop_active_transfers(dwc); + dwc->connected = true;
reg = dwc3_readl(dwc->regs, DWC3_DCTL); reg &= ~DWC3_DCTL_TSTCTRL_MASK;
From: Wesley Cheng wcheng@codeaurora.org
[ Upstream commit 5aef629704ad4d983ecf5c8a25840f16e45b6d59 ]
Ensure that dep->flags are cleared until after stop active transfers is completed. Otherwise, the ENDXFER command will not be executed during ep disable.
Fixes: f09ddcfcb8c5 ("usb: dwc3: gadget: Prevent EP queuing while stopping transfers") Cc: stable stable@vger.kernel.org Reported-and-tested-by: Andy Shevchenko andy.shevchenko@gmail.com Tested-by: Marek Szyprowski m.szyprowski@samsung.com Signed-off-by: Wesley Cheng wcheng@codeaurora.org Link: https://lore.kernel.org/r/1616610664-16495-1-git-send-email-wcheng@codeauror... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/usb/dwc3/gadget.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
--- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -754,10 +754,6 @@ static int __dwc3_gadget_ep_disable(stru reg &= ~DWC3_DALEPENA_EP(dep->number); dwc3_writel(dwc->regs, DWC3_DALEPENA, reg);
- dep->stream_capable = false; - dep->type = 0; - dep->flags = 0; - /* Clear out the ep descriptors for non-ep0 */ if (dep->number > 1) { dep->endpoint.comp_desc = NULL; @@ -766,6 +762,10 @@ static int __dwc3_gadget_ep_disable(stru
dwc3_remove_requests(dwc, dep);
+ dep->stream_capable = false; + dep->type = 0; + dep->flags = 0; + return 0; }
From: Wesley Cheng wcheng@codeaurora.org
[ Upstream commit 8212937305f84ef73ea81036dafb80c557583d4b ]
Current sequence utilizes dwc3_gadget_disable_irq() alongside synchronize_irq() to ensure that no further DWC3 events are generated. However, the dwc3_gadget_disable_irq() API only disables device specific events. Endpoint events can still be generated. Briefly disable the interrupt line, so that the cleanup code can run to prevent device and endpoint events. (i.e. __dwc3_gadget_stop() and dwc3_stop_active_transfers() respectively)
Without doing so, it can lead to both the interrupt handler and the pullup disable routine both writing to the GEVNTCOUNT register, which will cause an incorrect count being read from future interrupts.
Fixes: ae7e86108b12 ("usb: dwc3: Stop active transfers before halting the controller") Signed-off-by: Wesley Cheng wcheng@codeaurora.org Link: https://lore.kernel.org/r/1621571037-1424-1-git-send-email-wcheng@codeaurora... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/usb/dwc3/gadget.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
--- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -2030,13 +2030,10 @@ static int dwc3_gadget_pullup(struct usb }
/* - * Synchronize any pending event handling before executing the controller - * halt routine. + * Synchronize and disable any further event handling while controller + * is being enabled/disabled. */ - if (!is_on) { - dwc3_gadget_disable_irq(dwc); - synchronize_irq(dwc->irq_gadget); - } + disable_irq(dwc->irq_gadget);
spin_lock_irqsave(&dwc->lock, flags);
@@ -2074,6 +2071,8 @@ static int dwc3_gadget_pullup(struct usb
ret = dwc3_gadget_run_stop(dwc, is_on, false); spin_unlock_irqrestore(&dwc->lock, flags); + enable_irq(dwc->irq_gadget); + pm_runtime_put(dwc->dev);
return ret;
From: Wesley Cheng wcheng@codeaurora.org
[ Upstream commit cb10f68ad8150f243964b19391711aaac5e8ff42 ]
If the device is already in the runtime suspended state, any call to the pullup routine will issue a runtime resume on the DWC3 core device. If the USB gadget is disabling the pullup, then avoid having to issue a runtime resume, as DWC3 gadget has already been halted/stopped.
This fixes an issue where the following condition occurs:
usb_gadget_remove_driver() -->usb_gadget_disconnect() -->dwc3_gadget_pullup(0) -->pm_runtime_get_sync() -> ret = 0 -->pm_runtime_put() [async] -->usb_gadget_udc_stop() -->dwc3_gadget_stop() -->dwc->gadget_driver = NULL ...
dwc3_suspend_common() -->dwc3_gadget_suspend() -->DWC3 halt/stop routine skipped, driver_data == NULL
This leads to a situation where the DWC3 gadget is not properly stopped, as the runtime resume would have re-enabled EP0 and event interrupts, and since we avoided the DWC3 gadget suspend, these resources were never disabled.
Fixes: 77adb8bdf422 ("usb: dwc3: gadget: Allow runtime suspend if UDC unbinded") Cc: stable stable@vger.kernel.org Acked-by: Felipe Balbi balbi@kernel.org Signed-off-by: Wesley Cheng wcheng@codeaurora.org Link: https://lore.kernel.org/r/1628058245-30692-1-git-send-email-wcheng@codeauror... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/usb/dwc3/gadget.c | 11 +++++++++++ 1 file changed, 11 insertions(+)
--- a/drivers/usb/dwc3/gadget.c +++ b/drivers/usb/dwc3/gadget.c @@ -2019,6 +2019,17 @@ static int dwc3_gadget_pullup(struct usb }
/* + * Avoid issuing a runtime resume if the device is already in the + * suspended state during gadget disconnect. DWC3 gadget was already + * halted/stopped during runtime suspend. + */ + if (!is_on) { + pm_runtime_barrier(dwc->dev); + if (pm_runtime_suspended(dwc->dev)) + return 0; + } + + /* * Check the return value for successful resume, or error. For a * successful resume, the DWC3 runtime PM resume routine will handle * the run stop sequence, so avoid duplicate operations here.
From: Lai Jiangshan laijs@linux.alibaba.com
commit b1bd5cba3306691c771d558e94baa73e8b0b96b7 upstream.
When computing the access permissions of a shadow page, use the effective permissions of the walk up to that point, i.e. the logic AND of its parents' permissions. Two guest PxE entries that point at the same table gfn need to be shadowed with different shadow pages if their parents' permissions are different. KVM currently uses the effective permissions of the last non-leaf entry for all non-leaf entries. Because all non-leaf SPTEs have full ("uwx") permissions, and the effective permissions are recorded only in role.access and merged into the leaves, this can lead to incorrect reuse of a shadow page and eventually to a missing guest protection page fault.
For example, here is a shared pagetable:
pgd[] pud[] pmd[] virtual address pointers /->pmd1(u--)->pte1(uw-)->page1 <- ptr1 (u--) /->pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 <- ptr2 (uw-) pgd-| (shared pmd[] as above) ->pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 <- ptr3 (u--) ->pmd2(uw-)->pte2(uw-)->page2 <- ptr4 (u--)
pud1 and pud2 point to the same pmd table, so: - ptr1 and ptr3 points to the same page. - ptr2 and ptr4 points to the same page.
(pud1 and pud2 here are pud entries, while pmd1 and pmd2 here are pmd entries)
- First, the guest reads from ptr1 first and KVM prepares a shadow page table with role.access=u--, from ptr1's pud1 and ptr1's pmd1. "u--" comes from the effective permissions of pgd, pud1 and pmd1, which are stored in pt->access. "u--" is used also to get the pagetable for pud1, instead of "uw-".
- Then the guest writes to ptr2 and KVM reuses pud1 which is present. The hypervisor set up a shadow page for ptr2 with pt->access is "uw-" even though the pud1 pmd (because of the incorrect argument to kvm_mmu_get_page in the previous step) has role.access="u--".
- Then the guest reads from ptr3. The hypervisor reuses pud1's shadow pmd for pud2, because both use "u--" for their permissions. Thus, the shadow pmd already includes entries for both pmd1 and pmd2.
- At last, the guest writes to ptr4. This causes no vmexit or pagefault, because pud1's shadow page structures included an "uw-" page even though its role.access was "u--".
Any kind of shared pagetable might have the similar problem when in virtual machine without TDP enabled if the permissions are different from different ancestors.
In order to fix the problem, we change pt->access to be an array, and any access in it will not include permissions ANDed from child ptes.
The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/ Remember to test it with TDP disabled.
The problem had existed long before the commit 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated guest pte updates"), and it is hard to find which is the culprit. So there is no fixes tag here.
Signed-off-by: Lai Jiangshan laijs@linux.alibaba.com Message-Id: 20210603052455.21023-1-jiangshanlai@gmail.com Cc: stable@vger.kernel.org Fixes: cea0f0e7ea54 ("[PATCH] KVM: MMU: Shadow page table caching") Signed-off-by: Paolo Bonzini pbonzini@redhat.com [OP: - apply arch/x86/kvm/mmu/* changes to arch/x86/kvm - apply documentation changes to Documentation/virt/kvm/mmu.txt - adjusted context in arch/x86/kvm/paging_tmpl.h] Signed-off-by: Ovidiu Panait ovidiu.panait@windriver.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- Documentation/virt/kvm/mmu.txt | 4 ++-- arch/x86/kvm/paging_tmpl.h | 14 +++++++++----- 2 files changed, 11 insertions(+), 7 deletions(-)
--- a/Documentation/virt/kvm/mmu.txt +++ b/Documentation/virt/kvm/mmu.txt @@ -152,8 +152,8 @@ Shadow pages contain the following infor shadow pages) so role.quadrant takes values in the range 0..3. Each quadrant maps 1GB virtual address space. role.access: - Inherited guest access permissions in the form uwx. Note execute - permission is positive, not negative. + Inherited guest access permissions from the parent ptes in the form uwx. + Note execute permission is positive, not negative. role.invalid: The page is invalid and should not be used. It is a root page that is currently pinned (by a cpu hardware register pointing to it); once it is --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -90,8 +90,8 @@ struct guest_walker { gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; bool pte_writable[PT_MAX_FULL_LEVELS]; - unsigned pt_access; - unsigned pte_access; + unsigned int pt_access[PT_MAX_FULL_LEVELS]; + unsigned int pte_access; gfn_t gfn; struct x86_exception fault; }; @@ -406,13 +406,15 @@ retry_walk: }
walker->ptes[walker->level - 1] = pte; + + /* Convert to ACC_*_MASK flags for struct guest_walker. */ + walker->pt_access[walker->level - 1] = FNAME(gpte_access)(pt_access ^ walk_nx_mask); } while (!is_last_gpte(mmu, walker->level, pte));
pte_pkey = FNAME(gpte_pkeys)(vcpu, pte); accessed_dirty = have_ad ? pte_access & PT_GUEST_ACCESSED_MASK : 0;
/* Convert to ACC_*_MASK flags for struct guest_walker. */ - walker->pt_access = FNAME(gpte_access)(pt_access ^ walk_nx_mask); walker->pte_access = FNAME(gpte_access)(pte_access ^ walk_nx_mask); errcode = permission_fault(vcpu, mmu, walker->pte_access, pte_pkey, access); if (unlikely(errcode)) @@ -451,7 +453,8 @@ retry_walk: }
pgprintk("%s: pte %llx pte_access %x pt_access %x\n", - __func__, (u64)pte, walker->pte_access, walker->pt_access); + __func__, (u64)pte, walker->pte_access, + walker->pt_access[walker->level - 1]); return 1;
error: @@ -620,7 +623,7 @@ static int FNAME(fetch)(struct kvm_vcpu { struct kvm_mmu_page *sp = NULL; struct kvm_shadow_walk_iterator it; - unsigned direct_access, access = gw->pt_access; + unsigned int direct_access, access; int top_level, ret; gfn_t gfn, base_gfn;
@@ -652,6 +655,7 @@ static int FNAME(fetch)(struct kvm_vcpu sp = NULL; if (!is_shadow_present_pte(*it.sptep)) { table_gfn = gw->table_gfn[it.level - 2]; + access = gw->pt_access[it.level - 2]; sp = kvm_mmu_get_page(vcpu, table_gfn, addr, it.level-1, false, access); }
From: Longfang Liu liulongfang@huawei.com
commit 26b75952ca0b8b4b3050adb9582c8e2f44d49687 upstream.
Kunpeng920's EHCI controller does not have SBRN register. Reading the SBRN register when the controller driver is initialized will get 0.
When rebooting the EHCI driver, ehci_shutdown() will be called. if the sbrn flag is 0, ehci_shutdown() will return directly. The sbrn flag being 0 will cause the EHCI interrupt signal to not be turned off after reboot. this interrupt that is not closed will cause an exception to the device sharing the interrupt.
Therefore, the EHCI controller of Kunpeng920 needs to skip the read operation of the SBRN register.
Acked-by: Alan Stern stern@rowland.harvard.edu Signed-off-by: Longfang Liu liulongfang@huawei.com Link: https://lore.kernel.org/r/1617958081-17999-1-git-send-email-liulongfang@huaw... Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/usb/host/ehci-pci.c | 3 +++ 1 file changed, 3 insertions(+)
--- a/drivers/usb/host/ehci-pci.c +++ b/drivers/usb/host/ehci-pci.c @@ -298,6 +298,9 @@ static int ehci_pci_setup(struct usb_hcd if (pdev->vendor == PCI_VENDOR_ID_STMICRO && pdev->device == PCI_DEVICE_ID_STMICRO_USB_HOST) ; /* ConneXT has no sbrn register */ + else if (pdev->vendor == PCI_VENDOR_ID_HUAWEI + && pdev->device == 0xa239) + ; /* HUAWEI Kunpeng920 USB EHCI has no sbrn register */ else pci_read_config_byte(pdev, 0x60, &ehci->sbrn);
From: Luke D Jones luke@ljones.dev
commit 739d0959fbed23838a96c48fbce01dd2f6fb2c5f upstream.
The ASUS GV301QH sound appears to work well with the quirk for ALC294_FIXUP_ASUS_DUAL_SPK.
Signed-off-by: Luke D Jones luke@ljones.dev Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210807025805.27321-1-luke@ljones.dev Signed-off-by: Takashi Iwai tiwai@suse.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- sound/pci/hda/patch_realtek.c | 1 + 1 file changed, 1 insertion(+)
--- a/sound/pci/hda/patch_realtek.c +++ b/sound/pci/hda/patch_realtek.c @@ -8122,6 +8122,7 @@ static const struct snd_pci_quirk alc269 SND_PCI_QUIRK(0x1043, 0x16e3, "ASUS UX50", ALC269_FIXUP_STEREO_DMIC), SND_PCI_QUIRK(0x1043, 0x1740, "ASUS UX430UA", ALC295_FIXUP_ASUS_DACS), SND_PCI_QUIRK(0x1043, 0x17d1, "ASUS UX431FL", ALC294_FIXUP_ASUS_DUAL_SPK), + SND_PCI_QUIRK(0x1043, 0x1662, "ASUS GV301QH", ALC294_FIXUP_ASUS_DUAL_SPK), SND_PCI_QUIRK(0x1043, 0x1881, "ASUS Zephyrus S/M", ALC294_FIXUP_ASUS_GX502_PINS), SND_PCI_QUIRK(0x1043, 0x18b1, "Asus MJ401TA", ALC256_FIXUP_ASUS_HEADSET_MIC), SND_PCI_QUIRK(0x1043, 0x18f1, "Asus FX505DT", ALC256_FIXUP_ASUS_HEADSET_MIC),
From: Pali Rohár pali@kernel.org
commit 3125f26c514826077f2a4490b75e9b1c7a644c42 upstream.
When registering new ppp interface via PPPIOCNEWUNIT ioctl then kernel has to choose interface name as this ioctl API does not support specifying it.
Kernel in this case register new interface with name "ppp<id>" where <id> is the ppp unit id, which can be obtained via PPPIOCGUNIT ioctl. This applies also in the case when registering new ppp interface via rtnl without supplying IFLA_IFNAME.
PPPIOCNEWUNIT ioctl allows to specify own ppp unit id which will kernel assign to ppp interface, in case this ppp id is not already used by other ppp interface.
In case user does not specify ppp unit id then kernel choose the first free ppp unit id. This applies also for case when creating ppp interface via rtnl method as it does not provide a way for specifying own ppp unit id.
If some network interface (does not have to be ppp) has name "ppp<id>" with this first free ppp id then PPPIOCNEWUNIT ioctl or rtnl call fails.
And registering new ppp interface is not possible anymore, until interface which holds conflicting name is renamed. Or when using rtnl method with custom interface name in IFLA_IFNAME.
As list of allocated / used ppp unit ids is not possible to retrieve from kernel to userspace, userspace has no idea what happens nor which interface is doing this conflict.
So change the algorithm how ppp unit id is generated. And choose the first number which is not neither used as ppp unit id nor in some network interface with pattern "ppp<id>".
This issue can be simply reproduced by following pppd call when there is no ppp interface registered and also no interface with name pattern "ppp<id>":
pppd ifname ppp1 +ipv6 noip noauth nolock local nodetach pty "pppd +ipv6 noip noauth nolock local nodetach notty"
Or by creating the one ppp interface (which gets assigned ppp unit id 0), renaming it to "ppp1" and then trying to create a new ppp interface (which will always fails as next free ppp unit id is 1, but network interface with name "ppp1" exists).
This patch fixes above described issue by generating new and new ppp unit id until some non-conflicting id with network interfaces is generated.
Signed-off-by: Pali Rohár pali@kernel.org Cc: stable@vger.kernel.org Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ppp/ppp_generic.c | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-)
--- a/drivers/net/ppp/ppp_generic.c +++ b/drivers/net/ppp/ppp_generic.c @@ -283,7 +283,7 @@ static struct channel *ppp_find_channel( static int ppp_connect_channel(struct channel *pch, int unit); static int ppp_disconnect_channel(struct channel *pch); static void ppp_destroy_channel(struct channel *pch); -static int unit_get(struct idr *p, void *ptr); +static int unit_get(struct idr *p, void *ptr, int min); static int unit_set(struct idr *p, void *ptr, int n); static void unit_put(struct idr *p, int n); static void *unit_find(struct idr *p, int n); @@ -959,9 +959,20 @@ static int ppp_unit_register(struct ppp mutex_lock(&pn->all_ppp_mutex);
if (unit < 0) { - ret = unit_get(&pn->units_idr, ppp); + ret = unit_get(&pn->units_idr, ppp, 0); if (ret < 0) goto err; + if (!ifname_is_set) { + while (1) { + snprintf(ppp->dev->name, IFNAMSIZ, "ppp%i", ret); + if (!__dev_get_by_name(ppp->ppp_net, ppp->dev->name)) + break; + unit_put(&pn->units_idr, ret); + ret = unit_get(&pn->units_idr, ppp, ret + 1); + if (ret < 0) + goto err; + } + } } else { /* Caller asked for a specific unit number. Fail with -EEXIST * if unavailable. For backward compatibility, return -EEXIST @@ -3294,9 +3305,9 @@ static int unit_set(struct idr *p, void }
/* get new free unit number and associate pointer with it */ -static int unit_get(struct idr *p, void *ptr) +static int unit_get(struct idr *p, void *ptr, int min) { - return idr_alloc(p, ptr, 0, 0, GFP_KERNEL); + return idr_alloc(p, ptr, min, 0, GFP_KERNEL); }
/* put unit number back to a pool */
From: Miklos Szeredi mszeredi@redhat.com
commit 427215d85e8d1476da1a86b8d67aceb485eb3631 upstream.
Add the following checks from __do_loopback() to clone_private_mount() as well:
- verify that the mount is in the current namespace
- verify that there are no locked children
Reported-by: Alois Wohlschlager alois1@gmx-topmail.de Fixes: c771d683a62e ("vfs: introduce clone_private_mount()") Cc: stable@vger.kernel.org # v3.18 Signed-off-by: Miklos Szeredi mszeredi@redhat.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/namespace.c | 42 ++++++++++++++++++++++++++++-------------- 1 file changed, 28 insertions(+), 14 deletions(-)
--- a/fs/namespace.c +++ b/fs/namespace.c @@ -1861,6 +1861,20 @@ void drop_collected_mounts(struct vfsmou namespace_unlock(); }
+static bool has_locked_children(struct mount *mnt, struct dentry *dentry) +{ + struct mount *child; + + list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { + if (!is_subdir(child->mnt_mountpoint, dentry)) + continue; + + if (child->mnt.mnt_flags & MNT_LOCKED) + return true; + } + return false; +} + /** * clone_private_mount - create a private clone of a path * @@ -1875,14 +1889,27 @@ struct vfsmount *clone_private_mount(con struct mount *old_mnt = real_mount(path->mnt); struct mount *new_mnt;
+ down_read(&namespace_sem); if (IS_MNT_UNBINDABLE(old_mnt)) - return ERR_PTR(-EINVAL); + goto invalid; + + if (!check_mnt(old_mnt)) + goto invalid; + + if (has_locked_children(old_mnt, path->dentry)) + goto invalid;
new_mnt = clone_mnt(old_mnt, path->dentry, CL_PRIVATE); + up_read(&namespace_sem); + if (IS_ERR(new_mnt)) return ERR_CAST(new_mnt);
return &new_mnt->mnt; + +invalid: + up_read(&namespace_sem); + return ERR_PTR(-EINVAL); } EXPORT_SYMBOL_GPL(clone_private_mount);
@@ -2234,19 +2261,6 @@ static int do_change_type(struct path *p return err; }
-static bool has_locked_children(struct mount *mnt, struct dentry *dentry) -{ - struct mount *child; - list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { - if (!is_subdir(child->mnt_mountpoint, dentry)) - continue; - - if (child->mnt.mnt_flags & MNT_LOCKED) - return true; - } - return false; -} - static struct mount *__do_loopback(struct path *old_path, int recurse) { struct mount *mnt = ERR_PTR(-EINVAL), *old = real_mount(old_path->mnt);
From: Nikolay Borisov nborisov@suse.com
commit df2cfd131fd33dbef1ce33be8b332b1f3d645f35 upstream
It only uses btrfs_inode so can just as easily take it as an argument.
Signed-off-by: Nikolay Borisov nborisov@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/qgroup.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)
--- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -3481,10 +3481,10 @@ cleanup: }
/* Free ranges specified by @reserved, normally in error path */ -static int qgroup_free_reserved_data(struct inode *inode, +static int qgroup_free_reserved_data(struct btrfs_inode *inode, struct extent_changeset *reserved, u64 start, u64 len) { - struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_root *root = inode->root; struct ulist_node *unode; struct ulist_iterator uiter; struct extent_changeset changeset; @@ -3520,8 +3520,8 @@ static int qgroup_free_reserved_data(str * EXTENT_QGROUP_RESERVED, we won't double free. * So not need to rush. */ - ret = clear_record_extent_bits(&BTRFS_I(inode)->io_tree, - free_start, free_start + free_len - 1, + ret = clear_record_extent_bits(&inode->io_tree, free_start, + free_start + free_len - 1, EXTENT_QGROUP_RESERVED, &changeset); if (ret < 0) goto out; @@ -3550,7 +3550,8 @@ static int __btrfs_qgroup_release_data(s /* In release case, we shouldn't have @reserved */ WARN_ON(!free && reserved); if (free && reserved) - return qgroup_free_reserved_data(inode, reserved, start, len); + return qgroup_free_reserved_data(BTRFS_I(inode), reserved, + start, len); extent_changeset_init(&changeset); ret = clear_record_extent_bits(&BTRFS_I(inode)->io_tree, start, start + len -1, EXTENT_QGROUP_RESERVED, &changeset);
From: Nikolay Borisov nborisov@suse.com
commit 7661a3e033ab782366e0e1f4b6aad0df3555fcbd upstream
There's only a single use of vfs_inode in a tracepoint so let's take btrfs_inode directly.
Signed-off-by: Nikolay Borisov nborisov@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/delalloc-space.c | 2 +- fs/btrfs/file.c | 7 ++++--- fs/btrfs/qgroup.c | 10 +++++----- fs/btrfs/qgroup.h | 2 +- 4 files changed, 11 insertions(+), 10 deletions(-)
--- a/fs/btrfs/delalloc-space.c +++ b/fs/btrfs/delalloc-space.c @@ -151,7 +151,7 @@ int btrfs_check_data_free_space(struct i return ret;
/* Use new btrfs_qgroup_reserve_data to reserve precious data space. */ - ret = btrfs_qgroup_reserve_data(inode, reserved, start, len); + ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), reserved, start, len); if (ret < 0) btrfs_free_reserved_data_space_noquota(inode, start, len); else --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3149,7 +3149,7 @@ reserve_space: &cached_state); if (ret) goto out; - ret = btrfs_qgroup_reserve_data(inode, &data_reserved, + ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, alloc_start, bytes_to_reserve); if (ret) { unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, @@ -3322,8 +3322,9 @@ static long btrfs_fallocate(struct file free_extent_map(em); break; } - ret = btrfs_qgroup_reserve_data(inode, &data_reserved, - cur_offset, last_byte - cur_offset); + ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), + &data_reserved, cur_offset, + last_byte - cur_offset); if (ret < 0) { cur_offset = last_byte; free_extent_map(em); --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -3425,11 +3425,11 @@ btrfs_qgroup_rescan_resume(struct btrfs_ * same @reserved, caller must ensure when error happens it's OK * to free *ALL* reserved space. */ -int btrfs_qgroup_reserve_data(struct inode *inode, +int btrfs_qgroup_reserve_data(struct btrfs_inode *inode, struct extent_changeset **reserved_ret, u64 start, u64 len) { - struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_root *root = inode->root; struct ulist_node *unode; struct ulist_iterator uiter; struct extent_changeset *reserved; @@ -3452,12 +3452,12 @@ int btrfs_qgroup_reserve_data(struct ino reserved = *reserved_ret; /* Record already reserved space */ orig_reserved = reserved->bytes_changed; - ret = set_record_extent_bits(&BTRFS_I(inode)->io_tree, start, + ret = set_record_extent_bits(&inode->io_tree, start, start + len -1, EXTENT_QGROUP_RESERVED, reserved);
/* Newly reserved space */ to_reserve = reserved->bytes_changed - orig_reserved; - trace_btrfs_qgroup_reserve_data(inode, start, len, + trace_btrfs_qgroup_reserve_data(&inode->vfs_inode, start, len, to_reserve, QGROUP_RESERVE); if (ret < 0) goto cleanup; @@ -3471,7 +3471,7 @@ cleanup: /* cleanup *ALL* already reserved ranges */ ULIST_ITER_INIT(&uiter); while ((unode = ulist_next(&reserved->range_changed, &uiter))) - clear_extent_bit(&BTRFS_I(inode)->io_tree, unode->val, + clear_extent_bit(&inode->io_tree, unode->val, unode->aux, EXTENT_QGROUP_RESERVED, 0, 0, NULL); /* Also free data bytes of already reserved one */ btrfs_qgroup_free_refroot(root->fs_info, root->root_key.objectid, --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -344,7 +344,7 @@ int btrfs_verify_qgroup_counts(struct bt #endif
/* New io_tree based accurate qgroup reserve API */ -int btrfs_qgroup_reserve_data(struct inode *inode, +int btrfs_qgroup_reserve_data(struct btrfs_inode *inode, struct extent_changeset **reserved, u64 start, u64 len); int btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len); int btrfs_qgroup_free_data(struct inode *inode,
From: Qu Wenruo wqu@suse.com
commit 263da812e87bac4098a4778efaa32c54275641db upstream
[PROBLEM] Before this patch, when btrfs_qgroup_reserve_data() fails, we free all reserved space of the changeset.
For example: ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M); ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M); ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
If the last btrfs_qgroup_reserve_data() failed, it will release the entire [0, 3M) range.
This behavior is kind of OK for now, as when we hit -EDQUOT, we normally go error handling and need to release all reserved ranges anyway.
But this also means the following call is not possible:
ret = btrfs_qgroup_reserve_data(); if (ret == -EDQUOT) { /* Do something to free some qgroup space */ ret = btrfs_qgroup_reserve_data(); }
As if the first btrfs_qgroup_reserve_data() fails, it will free all reserved qgroup space.
[CAUSE] This is because we release all reserved ranges when btrfs_qgroup_reserve_data() fails.
[FIX] This patch will implement a new function, qgroup_unreserve_range(), to iterate through the ulist nodes, to find any nodes in the failure range, and remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and decrease the extent_changeset::bytes_changed, so that we can revert to previous state.
This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT happens.
Suggested-by: Josef Bacik josef@toxicpanda.com Reviewed-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Qu Wenruo wqu@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/qgroup.c | 92 +++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 77 insertions(+), 15 deletions(-)
--- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -3411,6 +3411,73 @@ btrfs_qgroup_rescan_resume(struct btrfs_ } }
+#define rbtree_iterate_from_safe(node, next, start) \ + for (node = start; node && ({ next = rb_next(node); 1;}); node = next) + +static int qgroup_unreserve_range(struct btrfs_inode *inode, + struct extent_changeset *reserved, u64 start, + u64 len) +{ + struct rb_node *node; + struct rb_node *next; + struct ulist_node *entry = NULL; + int ret = 0; + + node = reserved->range_changed.root.rb_node; + while (node) { + entry = rb_entry(node, struct ulist_node, rb_node); + if (entry->val < start) + node = node->rb_right; + else if (entry) + node = node->rb_left; + else + break; + } + + /* Empty changeset */ + if (!entry) + return 0; + + if (entry->val > start && rb_prev(&entry->rb_node)) + entry = rb_entry(rb_prev(&entry->rb_node), struct ulist_node, + rb_node); + + rbtree_iterate_from_safe(node, next, &entry->rb_node) { + u64 entry_start; + u64 entry_end; + u64 entry_len; + int clear_ret; + + entry = rb_entry(node, struct ulist_node, rb_node); + entry_start = entry->val; + entry_end = entry->aux; + entry_len = entry_end - entry_start + 1; + + if (entry_start >= start + len) + break; + if (entry_start + entry_len <= start) + continue; + /* + * Now the entry is in [start, start + len), revert the + * EXTENT_QGROUP_RESERVED bit. + */ + clear_ret = clear_extent_bits(&inode->io_tree, entry_start, + entry_end, EXTENT_QGROUP_RESERVED); + if (!ret && clear_ret < 0) + ret = clear_ret; + + ulist_del(&reserved->range_changed, entry->val, entry->aux); + if (likely(reserved->bytes_changed >= entry_len)) { + reserved->bytes_changed -= entry_len; + } else { + WARN_ON(1); + reserved->bytes_changed = 0; + } + } + + return ret; +} + /* * Reserve qgroup space for range [start, start + len). * @@ -3421,18 +3488,14 @@ btrfs_qgroup_rescan_resume(struct btrfs_ * Return <0 for error (including -EQUOT) * * NOTE: this function may sleep for memory allocation. - * if btrfs_qgroup_reserve_data() is called multiple times with - * same @reserved, caller must ensure when error happens it's OK - * to free *ALL* reserved space. */ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode, struct extent_changeset **reserved_ret, u64 start, u64 len) { struct btrfs_root *root = inode->root; - struct ulist_node *unode; - struct ulist_iterator uiter; struct extent_changeset *reserved; + bool new_reserved = false; u64 orig_reserved; u64 to_reserve; int ret; @@ -3445,6 +3508,7 @@ int btrfs_qgroup_reserve_data(struct btr if (WARN_ON(!reserved_ret)) return -EINVAL; if (!*reserved_ret) { + new_reserved = true; *reserved_ret = extent_changeset_alloc(); if (!*reserved_ret) return -ENOMEM; @@ -3460,7 +3524,7 @@ int btrfs_qgroup_reserve_data(struct btr trace_btrfs_qgroup_reserve_data(&inode->vfs_inode, start, len, to_reserve, QGROUP_RESERVE); if (ret < 0) - goto cleanup; + goto out; ret = qgroup_reserve(root, to_reserve, true, BTRFS_QGROUP_RSV_DATA); if (ret < 0) goto cleanup; @@ -3468,15 +3532,13 @@ int btrfs_qgroup_reserve_data(struct btr return ret;
cleanup: - /* cleanup *ALL* already reserved ranges */ - ULIST_ITER_INIT(&uiter); - while ((unode = ulist_next(&reserved->range_changed, &uiter))) - clear_extent_bit(&inode->io_tree, unode->val, - unode->aux, EXTENT_QGROUP_RESERVED, 0, 0, NULL); - /* Also free data bytes of already reserved one */ - btrfs_qgroup_free_refroot(root->fs_info, root->root_key.objectid, - orig_reserved, BTRFS_QGROUP_RSV_DATA); - extent_changeset_release(reserved); + qgroup_unreserve_range(inode, reserved, start, len); +out: + if (new_reserved) { + extent_changeset_release(reserved); + kfree(reserved); + *reserved_ret = NULL; + } return ret; }
From: Qu Wenruo wqu@suse.com
commit c53e9653605dbf708f5be02902de51831be4b009 upstream
[PROBLEM] There are known problem related to how btrfs handles qgroup reserved space. One of the most obvious case is the the test case btrfs/153, which do fallocate, then write into the preallocated range.
# btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad) # --- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800 # +++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800 # @@ -1,2 +1,5 @@ # QA output created by 153 # +pwrite: Disk quota exceeded # +/mnt/scratch/testfile2: Disk quota exceeded # +/mnt/scratch/testfile2: Disk quota exceeded # Silence is golden # ... # (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
[CAUSE] Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"), we always reserve space no matter if it's COW or not.
Such behavior change is mostly for performance, and reverting it is not a good idea anyway.
For preallcoated extent, we reserve qgroup data space for it already, and since we also reserve data space for qgroup at buffered write time, it needs twice the space for us to write into preallocated space.
This leads to the -EDQUOT in buffered write routine.
And we can't follow the same solution, unlike data/meta space check, qgroup reserved space is shared between data/metadata. The EDQUOT can happen at the metadata reservation, so doing NODATACOW check after qgroup reservation failure is not a solution.
[FIX] To solve the problem, we don't return -EDQUOT directly, but every time we got a -EDQUOT, we try to flush qgroup space:
- Flush all inodes of the root NODATACOW writes will free the qgroup reserved at run_dealloc_range(). However we don't have the infrastructure to only flush NODATACOW inodes, here we flush all inodes anyway.
- Wait for ordered extents This would convert the preallocated metadata space into per-trans metadata, which can be freed in later transaction commit.
- Commit transaction This will free all per-trans metadata space.
Also we don't want to trigger flush multiple times, so here we introduce a per-root wait list and a new root status, to ensure only one thread starts the flushing.
Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to") Reviewed-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Qu Wenruo wqu@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/ctree.h | 3 + fs/btrfs/disk-io.c | 1 fs/btrfs/qgroup.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++----- 3 files changed, 96 insertions(+), 8 deletions(-)
--- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -945,6 +945,8 @@ enum { BTRFS_ROOT_DEAD_TREE, /* The root has a log tree. Used only for subvolume roots. */ BTRFS_ROOT_HAS_LOG_TREE, + /* Qgroup flushing is in progress */ + BTRFS_ROOT_QGROUP_FLUSHING, };
/* @@ -1097,6 +1099,7 @@ struct btrfs_root { spinlock_t qgroup_meta_rsv_lock; u64 qgroup_meta_rsv_pertrans; u64 qgroup_meta_rsv_prealloc; + wait_queue_head_t qgroup_flush_wait;
/* Number of active swapfiles */ atomic_t nr_swapfiles; --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1154,6 +1154,7 @@ static void __setup_root(struct btrfs_ro mutex_init(&root->log_mutex); mutex_init(&root->ordered_extent_mutex); mutex_init(&root->delalloc_mutex); + init_waitqueue_head(&root->qgroup_flush_wait); init_waitqueue_head(&root->log_writer_wait); init_waitqueue_head(&root->log_commit_wait[0]); init_waitqueue_head(&root->log_commit_wait[1]); --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -3479,17 +3479,58 @@ static int qgroup_unreserve_range(struct }
/* - * Reserve qgroup space for range [start, start + len). + * Try to free some space for qgroup. * - * This function will either reserve space from related qgroups or doing - * nothing if the range is already reserved. + * For qgroup, there are only 3 ways to free qgroup space: + * - Flush nodatacow write + * Any nodatacow write will free its reserved data space at run_delalloc_range(). + * In theory, we should only flush nodatacow inodes, but it's not yet + * possible, so we need to flush the whole root. * - * Return 0 for successful reserve - * Return <0 for error (including -EQUOT) + * - Wait for ordered extents + * When ordered extents are finished, their reserved metadata is finally + * converted to per_trans status, which can be freed by later commit + * transaction. * - * NOTE: this function may sleep for memory allocation. + * - Commit transaction + * This would free the meta_per_trans space. + * In theory this shouldn't provide much space, but any more qgroup space + * is needed. */ -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode, +static int try_flush_qgroup(struct btrfs_root *root) +{ + struct btrfs_trans_handle *trans; + int ret; + + /* + * We don't want to run flush again and again, so if there is a running + * one, we won't try to start a new flush, but exit directly. + */ + if (test_and_set_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state)) { + wait_event(root->qgroup_flush_wait, + !test_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state)); + return 0; + } + + ret = btrfs_start_delalloc_snapshot(root); + if (ret < 0) + goto out; + btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1); + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto out; + } + + ret = btrfs_commit_transaction(trans); +out: + clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state); + wake_up(&root->qgroup_flush_wait); + return ret; +} + +static int qgroup_reserve_data(struct btrfs_inode *inode, struct extent_changeset **reserved_ret, u64 start, u64 len) { @@ -3542,6 +3583,34 @@ out: return ret; }
+/* + * Reserve qgroup space for range [start, start + len). + * + * This function will either reserve space from related qgroups or do nothing + * if the range is already reserved. + * + * Return 0 for successful reservation + * Return <0 for error (including -EQUOT) + * + * NOTE: This function may sleep for memory allocation, dirty page flushing and + * commit transaction. So caller should not hold any dirty page locked. + */ +int btrfs_qgroup_reserve_data(struct btrfs_inode *inode, + struct extent_changeset **reserved_ret, u64 start, + u64 len) +{ + int ret; + + ret = qgroup_reserve_data(inode, reserved_ret, start, len); + if (ret <= 0 && ret != -EDQUOT) + return ret; + + ret = try_flush_qgroup(inode->root); + if (ret < 0) + return ret; + return qgroup_reserve_data(inode, reserved_ret, start, len); +} + /* Free ranges specified by @reserved, normally in error path */ static int qgroup_free_reserved_data(struct btrfs_inode *inode, struct extent_changeset *reserved, u64 start, u64 len) @@ -3712,7 +3781,7 @@ static int sub_root_meta_rsv(struct btrf return num_bytes; }
-int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes, +static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes, enum btrfs_qgroup_rsv_type type, bool enforce) { struct btrfs_fs_info *fs_info = root->fs_info; @@ -3739,6 +3808,21 @@ int __btrfs_qgroup_reserve_meta(struct b return ret; }
+int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes, + enum btrfs_qgroup_rsv_type type, bool enforce) +{ + int ret; + + ret = qgroup_reserve_meta(root, num_bytes, type, enforce); + if (ret <= 0 && ret != -EDQUOT) + return ret; + + ret = try_flush_qgroup(root); + if (ret < 0) + return ret; + return qgroup_reserve_meta(root, num_bytes, type, enforce); +} + void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root) { struct btrfs_fs_info *fs_info = root->fs_info;
From: Qu Wenruo wqu@suse.com
commit 3296bf562443a8ca35aaad959a76a49e9b412760 upstream
The state was introduced in commit 4a9d8bdee368 ("Btrfs: make the state of the transaction more readable"), then in commit 302167c50b32 ("btrfs: don't end the transaction for delayed refs in throttle") the state is completely removed.
So we can just clean up the state since it's only compared but never set.
Signed-off-by: Qu Wenruo wqu@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/disk-io.c | 2 +- fs/btrfs/transaction.c | 15 +++------------ fs/btrfs/transaction.h | 1 - 3 files changed, 4 insertions(+), 14 deletions(-)
--- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1748,7 +1748,7 @@ static int transaction_kthread(void *arg }
now = ktime_get_seconds(); - if (cur->state < TRANS_STATE_BLOCKED && + if (cur->state < TRANS_STATE_COMMIT_START && !test_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags) && (now < cur->start_time || now - cur->start_time < fs_info->commit_interval)) { --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -27,7 +27,6 @@
static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = { [TRANS_STATE_RUNNING] = 0U, - [TRANS_STATE_BLOCKED] = __TRANS_START, [TRANS_STATE_COMMIT_START] = (__TRANS_START | __TRANS_ATTACH), [TRANS_STATE_COMMIT_DOING] = (__TRANS_START | __TRANS_ATTACH | @@ -388,7 +387,7 @@ int btrfs_record_root_in_trans(struct bt
static inline int is_transaction_blocked(struct btrfs_transaction *trans) { - return (trans->state >= TRANS_STATE_BLOCKED && + return (trans->state >= TRANS_STATE_COMMIT_START && trans->state < TRANS_STATE_UNBLOCKED && !TRANS_ABORTED(trans)); } @@ -580,7 +579,7 @@ again: INIT_LIST_HEAD(&h->new_bgs);
smp_mb(); - if (cur_trans->state >= TRANS_STATE_BLOCKED && + if (cur_trans->state >= TRANS_STATE_COMMIT_START && may_wait_transaction(fs_info, type)) { current->journal_info = h; btrfs_commit_transaction(h); @@ -797,7 +796,7 @@ int btrfs_should_end_transaction(struct struct btrfs_transaction *cur_trans = trans->transaction;
smp_mb(); - if (cur_trans->state >= TRANS_STATE_BLOCKED || + if (cur_trans->state >= TRANS_STATE_COMMIT_START || cur_trans->delayed_refs.flushing) return 1;
@@ -830,7 +829,6 @@ static int __btrfs_end_transaction(struc { struct btrfs_fs_info *info = trans->fs_info; struct btrfs_transaction *cur_trans = trans->transaction; - int lock = (trans->type != TRANS_JOIN_NOLOCK); int err = 0;
if (refcount_read(&trans->use_count) > 1) { @@ -846,13 +844,6 @@ static int __btrfs_end_transaction(struc
btrfs_trans_release_chunk_metadata(trans);
- if (lock && READ_ONCE(cur_trans->state) == TRANS_STATE_BLOCKED) { - if (throttle) - return btrfs_commit_transaction(trans); - else - wake_up_process(info->transaction_kthread); - } - if (trans->type & __TRANS_FREEZABLE) sb_end_intwrite(info->sb);
--- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -13,7 +13,6 @@
enum btrfs_trans_state { TRANS_STATE_RUNNING, - TRANS_STATE_BLOCKED, TRANS_STATE_COMMIT_START, TRANS_STATE_COMMIT_DOING, TRANS_STATE_UNBLOCKED,
From: Qu Wenruo wqu@suse.com
commit adca4d945c8dca28a85df45c5b117e6dac2e77f1 upstream
commit a514d63882c3 ("btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT") tries to reduce the early EDQUOT problems by checking the qgroup free against threshold and tries to wake up commit kthread to free some space.
The problem of that mechanism is, it can only free qgroup per-trans metadata space, can't do anything to data, nor prealloc qgroup space.
Now since we have the ability to flush qgroup space, and implemented retry-after-EDQUOT behavior, such mechanism can be completely replaced.
So this patch will cleanup such mechanism in favor of retry-after-EDQUOT.
Reviewed-by: Josef Bacik josef@toxicpanda.com Signed-off-by: Qu Wenruo wqu@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/ctree.h | 5 ----- fs/btrfs/disk-io.c | 1 - fs/btrfs/qgroup.c | 43 ++----------------------------------------- fs/btrfs/transaction.c | 1 - fs/btrfs/transaction.h | 14 -------------- 5 files changed, 2 insertions(+), 62 deletions(-)
--- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -505,11 +505,6 @@ enum { */ BTRFS_FS_EXCL_OP, /* - * To info transaction_kthread we need an immediate commit so it - * doesn't need to wait for commit_interval - */ - BTRFS_FS_NEED_ASYNC_COMMIT, - /* * Indicate that balance has been set up from the ioctl and is in the * main phase. The fs_info::balance_ctl is initialized. * Set and cleared while holding fs_info::balance_mutex. --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1749,7 +1749,6 @@ static int transaction_kthread(void *arg
now = ktime_get_seconds(); if (cur->state < TRANS_STATE_COMMIT_START && - !test_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags) && (now < cur->start_time || now - cur->start_time < fs_info->commit_interval)) { spin_unlock(&fs_info->trans_lock); --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -11,7 +11,6 @@ #include <linux/slab.h> #include <linux/workqueue.h> #include <linux/btrfs.h> -#include <linux/sizes.h>
#include "ctree.h" #include "transaction.h" @@ -2840,20 +2839,8 @@ out: return ret; }
-/* - * Two limits to commit transaction in advance. - * - * For RATIO, it will be 1/RATIO of the remaining limit as threshold. - * For SIZE, it will be in byte unit as threshold. - */ -#define QGROUP_FREE_RATIO 32 -#define QGROUP_FREE_SIZE SZ_32M -static bool qgroup_check_limits(struct btrfs_fs_info *fs_info, - const struct btrfs_qgroup *qg, u64 num_bytes) +static bool qgroup_check_limits(const struct btrfs_qgroup *qg, u64 num_bytes) { - u64 free; - u64 threshold; - if ((qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_RFER) && qgroup_rsv_total(qg) + (s64)qg->rfer + num_bytes > qg->max_rfer) return false; @@ -2862,32 +2849,6 @@ static bool qgroup_check_limits(struct b qgroup_rsv_total(qg) + (s64)qg->excl + num_bytes > qg->max_excl) return false;
- /* - * Even if we passed the check, it's better to check if reservation - * for meta_pertrans is pushing us near limit. - * If there is too much pertrans reservation or it's near the limit, - * let's try commit transaction to free some, using transaction_kthread - */ - if ((qg->lim_flags & (BTRFS_QGROUP_LIMIT_MAX_RFER | - BTRFS_QGROUP_LIMIT_MAX_EXCL))) { - if (qg->lim_flags & BTRFS_QGROUP_LIMIT_MAX_EXCL) { - free = qg->max_excl - qgroup_rsv_total(qg) - qg->excl; - threshold = min_t(u64, qg->max_excl / QGROUP_FREE_RATIO, - QGROUP_FREE_SIZE); - } else { - free = qg->max_rfer - qgroup_rsv_total(qg) - qg->rfer; - threshold = min_t(u64, qg->max_rfer / QGROUP_FREE_RATIO, - QGROUP_FREE_SIZE); - } - - /* - * Use transaction_kthread to commit transaction, so we no - * longer need to bother nested transaction nor lock context. - */ - if (free < threshold) - btrfs_commit_transaction_locksafe(fs_info); - } - return true; }
@@ -2937,7 +2898,7 @@ static int qgroup_reserve(struct btrfs_r
qg = unode_aux_to_qgroup(unode);
- if (enforce && !qgroup_check_limits(fs_info, qg, num_bytes)) { + if (enforce && !qgroup_check_limits(qg, num_bytes)) { ret = -EDQUOT; goto out; } --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2297,7 +2297,6 @@ int btrfs_commit_transaction(struct btrf */ cur_trans->state = TRANS_STATE_COMPLETED; wake_up(&cur_trans->commit_wait); - clear_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags);
spin_lock(&fs_info->trans_lock); list_del_init(&cur_trans->list); --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -207,20 +207,6 @@ int btrfs_clean_one_deleted_snapshot(str int btrfs_commit_transaction(struct btrfs_trans_handle *trans); int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, int wait_for_unblock); - -/* - * Try to commit transaction asynchronously, so this is safe to call - * even holding a spinlock. - * - * It's done by informing transaction_kthread to commit transaction without - * waiting for commit interval. - */ -static inline void btrfs_commit_transaction_locksafe( - struct btrfs_fs_info *fs_info) -{ - set_bit(BTRFS_FS_NEED_ASYNC_COMMIT, &fs_info->flags); - wake_up_process(fs_info->transaction_kthread); -} int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans); int btrfs_should_end_transaction(struct btrfs_trans_handle *trans); void btrfs_throttle(struct btrfs_fs_info *fs_info);
From: Filipe Manana fdmanana@suse.com
commit a855fbe69229078cd8aecd8974fb996a5ca651e6 upstream
When running test case btrfs/017 from fstests, lockdep reported the following splat:
[ 1297.067385] ====================================================== [ 1297.067708] WARNING: possible circular locking dependency detected [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted [ 1297.068322] ------------------------------------------------------ [ 1297.068629] btrfs/189080 is trying to acquire lock: [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.069274] but task is already holding lock: [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs] [ 1297.070219] which lock already depends on the new lock.
[ 1297.071131] the existing dependency chain (in reverse order) is: [ 1297.071721] -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}: [ 1297.072375] lock_acquire+0xd8/0x490 [ 1297.072710] __mutex_lock+0xa3/0xb30 [ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs] [ 1297.073421] create_subvol+0x194/0x990 [btrfs] [ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] [ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] [ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] [ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs] [ 1297.075245] __x64_sys_ioctl+0x83/0xb0 [ 1297.075617] do_syscall_64+0x33/0x80 [ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.076380] -> #0 (sb_internal#2){.+.+}-{0:0}: [ 1297.077166] check_prev_add+0x91/0xc60 [ 1297.077572] __lock_acquire+0x1740/0x3110 [ 1297.077984] lock_acquire+0xd8/0x490 [ 1297.078411] start_transaction+0x3c5/0x760 [btrfs] [ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs] [ 1297.079789] __x64_sys_ioctl+0x83/0xb0 [ 1297.080232] do_syscall_64+0x33/0x80 [ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.081139] other info that might help us debug this:
[ 1297.082536] Possible unsafe locking scenario:
[ 1297.083510] CPU0 CPU1 [ 1297.084005] ---- ---- [ 1297.084500] lock(&fs_info->qgroup_ioctl_lock); [ 1297.084994] lock(sb_internal#2); [ 1297.085485] lock(&fs_info->qgroup_ioctl_lock); [ 1297.085974] lock(sb_internal#2); [ 1297.086454] *** DEADLOCK *** [ 1297.087880] 3 locks held by btrfs/189080: [ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs] [ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs] [ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs] [ 1297.089771] stack backtrace: [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1 [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1297.092123] Call Trace: [ 1297.092629] dump_stack+0x8d/0xb5 [ 1297.093115] check_noncircular+0xff/0x110 [ 1297.093596] check_prev_add+0x91/0xc60 [ 1297.094076] ? kvm_clock_read+0x14/0x30 [ 1297.094553] ? kvm_sched_clock_read+0x5/0x10 [ 1297.095029] __lock_acquire+0x1740/0x3110 [ 1297.095510] lock_acquire+0xd8/0x490 [ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.096476] start_transaction+0x3c5/0x760 [btrfs] [ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs] [ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs] [ 1297.098904] ? do_user_addr_fault+0x20c/0x430 [ 1297.099382] ? kvm_clock_read+0x14/0x30 [ 1297.099854] ? kvm_sched_clock_read+0x5/0x10 [ 1297.100328] ? sched_clock+0x5/0x10 [ 1297.100801] ? sched_clock_cpu+0x12/0x180 [ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0 [ 1297.101739] __x64_sys_ioctl+0x83/0xb0 [ 1297.102207] do_syscall_64+0x33/0x80 [ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.103148] RIP: 0033:0x7f773ff65d87
This is because during the quota enable ioctl we lock first the mutex qgroup_ioctl_lock and then start a transaction, and starting a transaction acquires a fs freeze semaphore (at the VFS level). However, every other code path, except for the quota disable ioctl path, we do the opposite: we start a transaction and then lock the mutex.
So fix this by making the quota enable and disable paths to start the transaction without having the mutex locked, and then, after starting the transaction, lock the mutex and check if some other task already enabled or disabled the quotas, bailing with success if that was the case.
Signed-off-by: Filipe Manana fdmanana@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com
Conflicts: fs/btrfs/qgroup.c Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/ctree.h | 5 +++- fs/btrfs/qgroup.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++-------- 2 files changed, 52 insertions(+), 9 deletions(-)
--- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -827,7 +827,10 @@ struct btrfs_fs_info { */ struct ulist *qgroup_ulist;
- /* protect user change for quota operations */ + /* + * Protect user change for quota operations. If a transaction is needed, + * it must be started before locking this lock. + */ struct mutex qgroup_ioctl_lock;
/* list of dirty qgroups to be written at next commit */ --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -886,6 +886,7 @@ int btrfs_quota_enable(struct btrfs_fs_i struct btrfs_key found_key; struct btrfs_qgroup *qgroup = NULL; struct btrfs_trans_handle *trans = NULL; + struct ulist *ulist = NULL; int ret = 0; int slot;
@@ -893,13 +894,28 @@ int btrfs_quota_enable(struct btrfs_fs_i if (fs_info->quota_root) goto out;
- fs_info->qgroup_ulist = ulist_alloc(GFP_KERNEL); - if (!fs_info->qgroup_ulist) { + ulist = ulist_alloc(GFP_KERNEL); + if (!ulist) { ret = -ENOMEM; goto out; }
/* + * Unlock qgroup_ioctl_lock before starting the transaction. This is to + * avoid lock acquisition inversion problems (reported by lockdep) between + * qgroup_ioctl_lock and the vfs freeze semaphores, acquired when we + * start a transaction. + * After we started the transaction lock qgroup_ioctl_lock again and + * check if someone else created the quota root in the meanwhile. If so, + * just return success and release the transaction handle. + * + * Also we don't need to worry about someone else calling + * btrfs_sysfs_add_qgroups() after we unlock and getting an error because + * that function returns 0 (success) when the sysfs entries already exist. + */ + mutex_unlock(&fs_info->qgroup_ioctl_lock); + + /* * 1 for quota root item * 1 for BTRFS_QGROUP_STATUS item * @@ -908,12 +924,20 @@ int btrfs_quota_enable(struct btrfs_fs_i * would be a lot of overkill. */ trans = btrfs_start_transaction(tree_root, 2); + + mutex_lock(&fs_info->qgroup_ioctl_lock); if (IS_ERR(trans)) { ret = PTR_ERR(trans); trans = NULL; goto out; }
+ if (fs_info->quota_root) + goto out; + + fs_info->qgroup_ulist = ulist; + ulist = NULL; + /* * initially create the quota tree */ @@ -1046,10 +1070,13 @@ out: if (ret) { ulist_free(fs_info->qgroup_ulist); fs_info->qgroup_ulist = NULL; - if (trans) - btrfs_end_transaction(trans); } mutex_unlock(&fs_info->qgroup_ioctl_lock); + if (ret && trans) + btrfs_end_transaction(trans); + else if (trans) + ret = btrfs_end_transaction(trans); + ulist_free(ulist); return ret; }
@@ -1062,19 +1089,29 @@ int btrfs_quota_disable(struct btrfs_fs_ mutex_lock(&fs_info->qgroup_ioctl_lock); if (!fs_info->quota_root) goto out; + mutex_unlock(&fs_info->qgroup_ioctl_lock);
/* * 1 For the root item * * We should also reserve enough items for the quota tree deletion in * btrfs_clean_quota_tree but this is not done. + * + * Also, we must always start a transaction without holding the mutex + * qgroup_ioctl_lock, see btrfs_quota_enable(). */ trans = btrfs_start_transaction(fs_info->tree_root, 1); + + mutex_lock(&fs_info->qgroup_ioctl_lock); if (IS_ERR(trans)) { ret = PTR_ERR(trans); + trans = NULL; goto out; }
+ if (!fs_info->quota_root) + goto out; + clear_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags); btrfs_qgroup_wait_for_completion(fs_info, false); spin_lock(&fs_info->qgroup_lock); @@ -1088,13 +1125,13 @@ int btrfs_quota_disable(struct btrfs_fs_ ret = btrfs_clean_quota_tree(trans, quota_root); if (ret) { btrfs_abort_transaction(trans, ret); - goto end_trans; + goto out; }
ret = btrfs_del_root(trans, "a_root->root_key); if (ret) { btrfs_abort_transaction(trans, ret); - goto end_trans; + goto out; }
list_del("a_root->dirty_list); @@ -1108,10 +1145,13 @@ int btrfs_quota_disable(struct btrfs_fs_ free_extent_buffer(quota_root->commit_root); kfree(quota_root);
-end_trans: - ret = btrfs_end_transaction(trans); out: mutex_unlock(&fs_info->qgroup_ioctl_lock); + if (ret && trans) + btrfs_end_transaction(trans); + else if (trans) + ret = btrfs_end_transaction(trans); + return ret; }
From: YueHaibing yuehaibing@huawei.com
commit d0d62baa7f505bd4c59cd169692ff07ec49dde37 upstream.
Printing kernel pointers is discouraged because they might leak kernel memory layout. This fixes smatch warning:
drivers/net/ethernet/xilinx/xilinx_emaclite.c:1191 xemaclite_of_probe() warn: argument 4 to %08lX specifier is cast from pointer
Signed-off-by: YueHaibing yuehaibing@huawei.com Signed-off-by: David S. Miller davem@davemloft.net Signed-off-by: Pavel Machek (CIP) pavel@denx.de Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- drivers/net/ethernet/xilinx/xilinx_emaclite.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
--- a/drivers/net/ethernet/xilinx/xilinx_emaclite.c +++ b/drivers/net/ethernet/xilinx/xilinx_emaclite.c @@ -1191,9 +1191,8 @@ static int xemaclite_of_probe(struct pla }
dev_info(dev, - "Xilinx EmacLite at 0x%08X mapped to 0x%08X, irq=%d\n", - (unsigned int __force)ndev->mem_start, - (unsigned int __force)lp->base_addr, ndev->irq); + "Xilinx EmacLite at 0x%08X mapped to 0x%p, irq=%d\n", + (unsigned int __force)ndev->mem_start, lp->base_addr, ndev->irq); return 0;
error:
From: Qu Wenruo wqu@suse.com
commit 6f23277a49e68f8a9355385c846939ad0b1261e7 upstream
[BUG] When running the following script, btrfs will trigger an ASSERT():
#/bin/bash mkfs.btrfs -f $dev mount $dev $mnt xfs_io -f -c "pwrite 0 1G" $mnt/file sync btrfs quota enable $mnt btrfs quota rescan -w $mnt
# Manually set the limit below current usage btrfs qgroup limit 512M $mnt $mnt
# Crash happens touch $mnt/file
The dmesg looks like this:
assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3230! invalid opcode: 0000 [#1] SMP PTI RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs] btrfs_commit_transaction.cold+0x11/0x5d [btrfs] try_flush_qgroup+0x67/0x100 [btrfs] __btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs] btrfs_delayed_update_inode+0xaa/0x350 [btrfs] btrfs_update_inode+0x9d/0x110 [btrfs] btrfs_dirty_inode+0x5d/0xd0 [btrfs] touch_atime+0xb5/0x100 iterate_dir+0xf1/0x1b0 __x64_sys_getdents64+0x78/0x110 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fb5afe588db
[CAUSE] In try_flush_qgroup(), we assume we don't hold a transaction handle at all. This is true for data reservation and mostly true for metadata. Since data space reservation always happens before we start a transaction, and for most metadata operation we reserve space in start_transaction().
But there is an exception, btrfs_delayed_inode_reserve_metadata(). It holds a transaction handle, while still trying to reserve extra metadata space.
When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we will join current transaction and commit, while we still have transaction handle from qgroup code.
[FIX] Let's check current->journal before we join the transaction.
If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means we are not holding a transaction, thus are able to join and then commit transaction.
If current->journal is a valid transaction handle, we avoid committing transaction and just end it
This is less effective than committing current transaction, as it won't free metadata reserved space, but we may still free some data space before new data writes.
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634 Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") Reviewed-by: Filipe Manana fdmanana@suse.com Signed-off-by: Qu Wenruo wqu@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/qgroup.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-)
--- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -3502,6 +3502,7 @@ static int try_flush_qgroup(struct btrfs { struct btrfs_trans_handle *trans; int ret; + bool can_commit = true;
/* * We don't want to run flush again and again, so if there is a running @@ -3513,6 +3514,20 @@ static int try_flush_qgroup(struct btrfs return 0; }
+ /* + * If current process holds a transaction, we shouldn't flush, as we + * assume all space reservation happens before a transaction handle is + * held. + * + * But there are cases like btrfs_delayed_item_reserve_metadata() where + * we try to reserve space with one transction handle already held. + * In that case we can't commit transaction, but at least try to end it + * and hope the started data writes can free some space. + */ + if (current->journal_info && + current->journal_info != BTRFS_SEND_TRANS_STUB) + can_commit = false; + ret = btrfs_start_delalloc_snapshot(root); if (ret < 0) goto out; @@ -3524,7 +3539,10 @@ static int try_flush_qgroup(struct btrfs goto out; }
- ret = btrfs_commit_transaction(trans); + if (can_commit) + ret = btrfs_commit_transaction(trans); + else + ret = btrfs_end_transaction(trans); out: clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state); wake_up(&root->qgroup_flush_wait);
From: Nikolay Borisov nborisov@suse.com
commit 80e9baed722c853056e0c5374f51524593cb1031 upstream
Signed-off-by: Nikolay Borisov nborisov@suse.com Reviewed-by: David Sterba dsterba@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/qgroup.c | 8 ++++---- fs/btrfs/qgroup.h | 3 ++- 2 files changed, 6 insertions(+), 5 deletions(-)
--- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -3800,8 +3800,8 @@ static int sub_root_meta_rsv(struct btrf return num_bytes; }
-static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes, - enum btrfs_qgroup_rsv_type type, bool enforce) +int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes, + enum btrfs_qgroup_rsv_type type, bool enforce) { struct btrfs_fs_info *fs_info = root->fs_info; int ret; @@ -3832,14 +3832,14 @@ int __btrfs_qgroup_reserve_meta(struct b { int ret;
- ret = qgroup_reserve_meta(root, num_bytes, type, enforce); + ret = btrfs_qgroup_reserve_meta(root, num_bytes, type, enforce); if (ret <= 0 && ret != -EDQUOT) return ret;
ret = try_flush_qgroup(root); if (ret < 0) return ret; - return qgroup_reserve_meta(root, num_bytes, type, enforce); + return btrfs_qgroup_reserve_meta(root, num_bytes, type, enforce); }
void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root) --- a/fs/btrfs/qgroup.h +++ b/fs/btrfs/qgroup.h @@ -349,7 +349,8 @@ int btrfs_qgroup_reserve_data(struct btr int btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len); int btrfs_qgroup_free_data(struct inode *inode, struct extent_changeset *reserved, u64 start, u64 len); - +int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes, + enum btrfs_qgroup_rsv_type type, bool enforce); int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes, enum btrfs_qgroup_rsv_type type, bool enforce); /* Reserve metadata space for pertrans and prealloc type */
From: Nikolay Borisov nborisov@suse.com
commit 4d14c5cde5c268a2bc26addecf09489cb953ef64 upstream
Calling btrfs_qgroup_reserve_meta_prealloc from btrfs_delayed_inode_reserve_metadata can result in flushing delalloc while holding a transaction and delayed node locks. This is deadlock prone. In the past multiple commits:
* ae5e070eaca9 ("btrfs: qgroup: don't try to wait flushing if we're already holding a transaction")
* 6f23277a49e6 ("btrfs: qgroup: don't commit transaction when we already hold the handle")
Tried to solve various aspects of this but this was always a whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock scenario involving btrfs_delayed_node::mutex. Namely, one thread can call btrfs_dirty_inode as a result of reading a file and modifying its atime:
PID: 6963 TASK: ffff8c7f3f94c000 CPU: 2 COMMAND: "test" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_timeout at ffffffffa52a1bdd #3 wait_for_completion at ffffffffa529eeea <-- sleeps with delayed node mutex held #4 start_delalloc_inodes at ffffffffc0380db5 #5 btrfs_start_delalloc_snapshot at ffffffffc0393836 #6 try_flush_qgroup at ffffffffc03f04b2 #7 __btrfs_qgroup_reserve_meta at ffffffffc03f5bb6 <-- tries to reserve space and starts delalloc inodes. #8 btrfs_delayed_update_inode at ffffffffc03e31aa <-- acquires delayed node mutex #9 btrfs_update_inode at ffffffffc0385ba8 #10 btrfs_dirty_inode at ffffffffc038627b <-- TRANSACTIION OPENED #11 touch_atime at ffffffffa4cf0000 #12 generic_file_read_iter at ffffffffa4c1f123 #13 new_sync_read at ffffffffa4ccdc8a #14 vfs_read at ffffffffa4cd0849 #15 ksys_read at ffffffffa4cd0bd1 #16 do_syscall_64 at ffffffffa4a052eb #17 entry_SYSCALL_64_after_hwframe at ffffffffa540008c
This will cause an asynchronous work to flush the delalloc inodes to happen which can try to acquire the same delayed_node mutex:
PID: 455 TASK: ffff8c8085fa4000 CPU: 5 COMMAND: "kworker/u16:30" #0 __schedule at ffffffffa529e07d #1 schedule at ffffffffa529e4ff #2 schedule_preempt_disabled at ffffffffa529e80a #3 __mutex_lock at ffffffffa529fdcb <-- goes to sleep, never wakes up. #4 btrfs_delayed_update_inode at ffffffffc03e3143 <-- tries to acquire the mutex #5 btrfs_update_inode at ffffffffc0385ba8 <-- this is the same inode that pid 6963 is holding #6 cow_file_range_inline.constprop.78 at ffffffffc0386be7 #7 cow_file_range at ffffffffc03879c1 #8 btrfs_run_delalloc_range at ffffffffc038894c #9 writepage_delalloc at ffffffffc03a3c8f #10 __extent_writepage at ffffffffc03a4c01 #11 extent_write_cache_pages at ffffffffc03a500b #12 extent_writepages at ffffffffc03a6de2 #13 do_writepages at ffffffffa4c277eb #14 __filemap_fdatawrite_range at ffffffffa4c1e5bb #15 btrfs_run_delalloc_work at ffffffffc0380987 <-- starts running delayed nodes #16 normal_work_helper at ffffffffc03b706c #17 process_one_work at ffffffffa4aba4e4 #18 worker_thread at ffffffffa4aba6fd #19 kthread at ffffffffa4ac0a3d #20 ret_from_fork at ffffffffa54001ff
To fully address those cases the complete fix is to never issue any flushing while holding the transaction or the delayed node lock. This patch achieves it by calling qgroup_reserve_meta directly which will either succeed without flushing or will fail and return -EDQUOT. In the latter case that return value is going to be propagated to btrfs_dirty_inode which will fallback to start a new transaction. That's fine as the majority of time we expect the inode will have BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly copying the in-memory state.
Fixes: c53e9653605d ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo wqu@suse.com Signed-off-by: Nikolay Borisov nborisov@suse.com Signed-off-by: David Sterba dsterba@suse.com Signed-off-by: Anand Jain anand.jain@oracle.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org --- fs/btrfs/delayed-inode.c | 3 ++- fs/btrfs/inode.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-)
--- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -627,7 +627,8 @@ static int btrfs_delayed_inode_reserve_m */ if (!src_rsv || (!trans->bytes_reserved && src_rsv->type != BTRFS_BLOCK_RSV_DELALLOC)) { - ret = btrfs_qgroup_reserve_meta_prealloc(root, num_bytes, true); + ret = btrfs_qgroup_reserve_meta(root, num_bytes, + BTRFS_QGROUP_RSV_META_PREALLOC, true); if (ret < 0) return ret; ret = btrfs_block_rsv_add(root, dst_rsv, num_bytes, --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6375,7 +6375,7 @@ static int btrfs_dirty_inode(struct inod return PTR_ERR(trans);
ret = btrfs_update_inode(trans, root, inode); - if (ret && ret == -ENOSPC) { + if (ret && (ret == -ENOSPC || ret == -EDQUOT)) { /* whoops, lets try again with the full transaction */ btrfs_end_transaction(trans); trans = btrfs_start_transaction(root, 1);
On 8/13/21 9:06 AM, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.141 release. There are 27 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 15 Aug 2021 15:05:12 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.141-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
Compiled and booted on my test system. No dmesg regressions.
Tested-by: Shuah Khan skhan@linuxfoundation.org
thanks, -- Shuah
Hi Greg,
On Fri, Aug 13, 2021 at 05:06:58PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.141 release. There are 27 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 15 Aug 2021 15:05:12 +0000. Anything received after that time might be too late.
Build test: mips (gcc version 11.1.1 20210723): 65 configs -> no failure arm (gcc version 11.1.1 20210723): 107 configs -> no new failure arm64 (gcc version 11.1.1 20210723): 2 configs -> no failure x86_64 (gcc version 10.2.1 20210110): 4 configs -> no failure
Boot test: x86_64: Booted on my test laptop. No regression. x86_64: Booted on qemu. No regression. [1]
[1]. https://openqa.qa.codethink.co.uk/tests/28
Tested-by: Sudip Mukherjee sudip.mukherjee@codethink.co.uk
-- Regards Sudip
On Fri, 13 Aug 2021 at 20:44, Greg Kroah-Hartman gregkh@linuxfoundation.org wrote:
This is the start of the stable review cycle for the 5.4.141 release. There are 27 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 15 Aug 2021 15:05:12 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.141-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
Results from Linaro’s test farm. No regressions on arm64, arm, x86_64, and i386.
Tested-by: Linux Kernel Functional Testing lkft@linaro.org
## Build * kernel: 5.4.141-rc1 * git: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc * git branch: linux-5.4.y * git commit: df6ce9b59c705e13e7fdf19dbba5a715b993b712 * git describe: v5.4.139-114-gdf6ce9b59c70 * test details: https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.4.y/build/v5.4.13...
## No regressions (compared to v5.4.139-86-gff7bc8590c20)
## No fixes (compared to v5.4.139-86-gff7bc8590c20)
## Test result summary total: 77974, pass: 64106, fail: 387, skip: 12161, xfail: 1320
## Build Summary * arc: 10 total, 10 passed, 0 failed * arm: 193 total, 193 passed, 0 failed * arm64: 27 total, 27 passed, 0 failed * i386: 15 total, 15 passed, 0 failed * mips: 45 total, 45 passed, 0 failed * parisc: 9 total, 9 passed, 0 failed * powerpc: 27 total, 27 passed, 0 failed * riscv: 21 total, 21 passed, 0 failed * s390: 9 total, 9 passed, 0 failed * sh: 18 total, 18 passed, 0 failed * sparc: 9 total, 9 passed, 0 failed * x86_64: 27 total, 27 passed, 0 failed
## Test suites summary * fwts * igt-gpu-tools * kselftest-android * kselftest-arm64 * kselftest-breakpoints * kselftest-capabilities * kselftest-cgroup * kselftest-clone3 * kselftest-core * kselftest-cpu-hotplug * kselftest-cpufreq * kselftest-drivers * kselftest-efivarfs * kselftest-filesystems * kselftest-firmware * kselftest-fpu * kselftest-futex * kselftest-gpio * kselftest-intel_pstate * kselftest-ipc * kselftest-ir * kselftest-kcmp * kselftest-kvm * kselftest-lib * kselftest-livepatch * kselftest-lkdtm * kselftest-membarrier * kselftest-memfd * kselftest-memory-hotplug * kselftest-mincore * kselftest-mount * kselftest-mqueue * kselftest-net * kselftest-netfilter * kselftest-nsfs * kselftest-openat2 * kselftest-pid_namespace * kselftest-pidfd * kselftest-proc * kselftest-pstore * kselftest-ptrace * kselftest-rseq * kselftest-rtc * kselftest-timens * kselftest-timers * kselftest-tmpfs * kselftest-tpm2 * kselftest-user * kselftest-vm * kselftest-x86 * kselftest-zram * kvm-unit-tests * libhugetlbfs * linux-log-parser * ltp-cap_bounds-tests * ltp-commands-tests * ltp-containers-tests * ltp-controllers-tests * ltp-cpuhotplug-tests * ltp-crypto-tests * ltp-cve-tests * ltp-dio-tests * ltp-fcntl-locktests-tests * ltp-filecaps-tests * ltp-fs-tests * ltp-fs_bind-tests * ltp-fs_perms_simple-tests * ltp-fsx-tests * ltp-hugetlb-tests * ltp-io-tests * ltp-ipc-tests * ltp-math-tests * ltp-mm-tests * ltp-nptl-tests * ltp-open-posix-tests * ltp-pty-tests * ltp-sched-tests * ltp-securebits-tests * ltp-syscalls-tests * ltp-tracing-tests * network-basic-tests * packetdrill * perf * rcutorture * ssuite * v4l2-compliance
-- Linaro LKFT https://lkft.linaro.org
On Fri, Aug 13, 2021 at 05:06:58PM +0200, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.141 release. There are 27 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 15 Aug 2021 15:05:12 +0000. Anything received after that time might be too late.
Build results: total: 157 pass: 157 fail: 0 Qemu test results: total: 444 pass: 444 fail: 0
Tested-by: Guenter Roeck linux@roeck-us.net
Guenter
On 2021/8/13 23:06, Greg Kroah-Hartman wrote:
This is the start of the stable review cycle for the 5.4.141 release. There are 27 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know.
Responses should be made by Sun, 15 Aug 2021 15:05:12 +0000. Anything received after that time might be too late.
The whole patch series can be found in one patch at: https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.4.141-rc1... or in the git tree and branch at: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.4.y and the diffstat can be found below.
thanks,
greg k-h
Tested on arm64 and x86 for 5.4.141-rc1,
Kernel repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git Branch: linux-5.4.y Version: 5.4.141-rc1 Commit: df6ce9b59c705e13e7fdf19dbba5a715b993b712 Compiler: gcc version 7.3.0 (GCC)
arm64: -------------------------------------------------------------------- Testcase Result Summary: total: 8906 passed: 8906 failed: 0 timeout: 0 --------------------------------------------------------------------
x86: -------------------------------------------------------------------- Testcase Result Summary: total: 8906 passed: 8906 failed: 0 timeout: 0 --------------------------------------------------------------------
Tested-by: Hulk Robot hulkrobot@huawei.com
linux-stable-mirror@lists.linaro.org