June 2024 - Linux-stable-mirror

[PATCH AUTOSEL 6.1 01/29] scsi: core: alua: I/O errors for ALUA state transitions

by Sasha Levin

From: Martin Wilck <martin.wilck(a)suse.com> [ Upstream commit 10157b1fc1a762293381e9145041253420dfc6ad ] When a host is configured with a few LUNs and I/O is running, injecting FC faults repeatedly leads to path recovery problems. The LUNs have 4 paths each and 3 of them come back active after say an FC fault which makes 2 of the paths go down, instead of all 4. This happens after several iterations of continuous FC faults. Reason here is that we're returning an I/O error whenever we're encountering sense code 06/04/0a (LOGICAL UNIT NOT ACCESSIBLE, ASYMMETRIC ACCESS STATE TRANSITION) instead of retrying. [mwilck: The original patch was developed by Rajashekhar M A and Hannes Reinecke. I moved the code to alua_check_sense() as suggested by Mike Christie [1]. Evan Milne had raised the question whether pg->state should be set to transitioning in the UA case [2]. I believe that doing this is correct. SCSI_ACCESS_STATE_TRANSITIONING by itself doesn't cause I/O errors. Our handler schedules an RTPG, which will only result in an I/O error condition if the transitioning timeout expires.] [1] https://lore.kernel.org/all/0bc96e82-fdda-4187-148d-5b34f81d4942@oracle.com/ [2] https://lore.kernel.org/all/CAGtn9r=kicnTDE2o7Gt5Y=yoidHYD7tG8XdMHEBJTBraVE… Co-developed-by: Rajashekhar M A <rajs(a)netapp.com> Co-developed-by: Hannes Reinecke <hare(a)suse.de> Signed-off-by: Hannes Reinecke <hare(a)suse.de> Signed-off-by: Martin Wilck <martin.wilck(a)suse.com> Link: https://lore.kernel.org/r/20240514140344.19538-1-mwilck@suse.com Reviewed-by: Damien Le Moal <dlemoal(a)kernel.org> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Reviewed-by: Mike Christie <michael.christie(a)oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com> Signed-off-by: Sasha Levin <sashal(a)kernel.org> --- drivers/scsi/device_handler/scsi_dh_alua.c | 31 +++++++++++++++------- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c index 0781f991e7845..f5fc8631883d5 100644 --- a/drivers/scsi/device_handler/scsi_dh_alua.c +++ b/drivers/scsi/device_handler/scsi_dh_alua.c @@ -406,28 +406,40 @@ static char print_alua_state(unsigned char state) } } -static enum scsi_disposition alua_check_sense(struct scsi_device *sdev, - struct scsi_sense_hdr *sense_hdr) +static void alua_handle_state_transition(struct scsi_device *sdev) { struct alua_dh_data *h = sdev->handler_data; struct alua_port_group *pg; + rcu_read_lock(); + pg = rcu_dereference(h->pg); + if (pg) + pg->state = SCSI_ACCESS_STATE_TRANSITIONING; + rcu_read_unlock(); + alua_check(sdev, false); +} + +static enum scsi_disposition alua_check_sense(struct scsi_device *sdev, + struct scsi_sense_hdr *sense_hdr) +{ switch (sense_hdr->sense_key) { case NOT_READY: if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0a) { /* * LUN Not Accessible - ALUA state transition */ - rcu_read_lock(); - pg = rcu_dereference(h->pg); - if (pg) - pg->state = SCSI_ACCESS_STATE_TRANSITIONING; - rcu_read_unlock(); - alua_check(sdev, false); + alua_handle_state_transition(sdev); return NEEDS_RETRY; } break; case UNIT_ATTENTION: + if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0a) { + /* + * LUN Not Accessible - ALUA state transition + */ + alua_handle_state_transition(sdev); + return NEEDS_RETRY; + } if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00) { /* * Power On, Reset, or Bus Device Reset. @@ -494,7 +506,8 @@ static int alua_tur(struct scsi_device *sdev) retval = scsi_test_unit_ready(sdev, ALUA_FAILOVER_TIMEOUT * HZ, ALUA_FAILOVER_RETRIES, &sense_hdr); - if (sense_hdr.sense_key == NOT_READY && + if ((sense_hdr.sense_key == NOT_READY || + sense_hdr.sense_key == UNIT_ATTENTION) && sense_hdr.asc == 0x04 && sense_hdr.ascq == 0x0a) return SCSI_DH_RETRY; else if (retval) -- 2.43.0

1 year, 5 months

4
32
0 0

[PATCH v2] pmdomain: qcom: rpmhpd: Skip retention level for Power Domains

by Taniya Das

In the cases where the power domain connected to logics is allowed to transition from a level(L)-->power collapse(0)-->retention(1) or vice versa retention(1)-->power collapse(0)-->level(L) will cause the logic to lose the configurations. The ARC does not support retention to collapse transition on MxC rails. The targets from SM8450 onwards the PLL logics of clock controllers are connected to MxC rails and the recommended configurations are carried out during the clock controller probes. The MxC transition as mentioned above should be skipped to ensure the PLL settings are intact across clock controller power on & off. On older targets that do not split MX into MxA and MxC does not collapse the logic and it is parked always at RETENTION, thus this issue is never observed on those targets. Cc: stable(a)vger.kernel.org # v5.17 Reviewed-by: Bjorn Andersson <andersson(a)kernel.org> Signed-off-by: Taniya Das <quic_tdas(a)quicinc.com> --- [Changes in v2]: Incorporate the comments in the commit text. --- drivers/pmdomain/qcom/rpmhpd.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/drivers/pmdomain/qcom/rpmhpd.c b/drivers/pmdomain/qcom/rpmhpd.c index de9121ef4216..d2cb4271a1ca 100644 --- a/drivers/pmdomain/qcom/rpmhpd.c +++ b/drivers/pmdomain/qcom/rpmhpd.c @@ -40,6 +40,7 @@ * @addr: Resource address as looped up using resource name from * cmd-db * @state_synced: Indicator that sync_state has been invoked for the rpmhpd resource + * @skip_retention_level: Indicate that retention level should not be used for the power domain */ struct rpmhpd { struct device *dev; @@ -56,6 +57,7 @@ struct rpmhpd { const char *res_name; u32 addr; bool state_synced; + bool skip_retention_level; }; struct rpmhpd_desc { @@ -173,6 +175,7 @@ static struct rpmhpd mxc = { .pd = { .name = "mxc", }, .peer = &mxc_ao, .res_name = "mxc.lvl", + .skip_retention_level = true, }; static struct rpmhpd mxc_ao = { @@ -180,6 +183,7 @@ static struct rpmhpd mxc_ao = { .active_only = true, .peer = &mxc, .res_name = "mxc.lvl", + .skip_retention_level = true, }; static struct rpmhpd nsp = { @@ -819,6 +823,9 @@ static int rpmhpd_update_level_mapping(struct rpmhpd *rpmhpd) return -EINVAL; for (i = 0; i < rpmhpd->level_count; i++) { + if (rpmhpd->skip_retention_level && buf[i] == RPMH_REGULATOR_LEVEL_RETENTION) + continue; + rpmhpd->level[i] = buf[i]; /* Remember the first corner with non-zero level */ --- base-commit: 62c97045b8f720c2eac807a5f38e26c9ed512371 change-id: 20240625-avoid_mxc_retention-b095a761d981 Best regards, -- Taniya Das <quic_tdas(a)quicinc.com>

1 year, 5 months

2
1
0 0

[PATCH iwl-next] ice: Add a per-VF limit on number of FDIR filters

by Ahmed Zaki

While the iavf driver adds a s/w limit (128) on the number of FDIR filters that the VF can request, a malicious VF driver can request more than that and exhaust the resources for other VFs. Add a similar limit in ice. CC: stable(a)vger.kernel.org Reviewed-by: Przemek Kitszel <przemyslaw.kitszel(a)intel.com> Suggested-by: Sridhar Samudrala <sridhar.samudrala(a)intel.com> Signed-off-by: Ahmed Zaki <ahmed.zaki(a)intel.com> --- .../net/ethernet/intel/ice/ice_ethtool_fdir.c | 2 +- drivers/net/ethernet/intel/ice/ice_fdir.h | 3 +++ .../net/ethernet/intel/ice/ice_virtchnl_fdir.c | 16 ++++++++++++++++ .../net/ethernet/intel/ice/ice_virtchnl_fdir.h | 1 + 4 files changed, 21 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool_fdir.c b/drivers/net/ethernet/intel/ice/ice_ethtool_fdir.c index e3cab8e98f52..5412eff8ef23 100644 --- a/drivers/net/ethernet/intel/ice/ice_ethtool_fdir.c +++ b/drivers/net/ethernet/intel/ice/ice_ethtool_fdir.c @@ -534,7 +534,7 @@ ice_parse_rx_flow_user_data(struct ethtool_rx_flow_spec *fsp, * * Returns the number of available flow director filters to this VSI */ -static int ice_fdir_num_avail_fltr(struct ice_hw *hw, struct ice_vsi *vsi) +int ice_fdir_num_avail_fltr(struct ice_hw *hw, struct ice_vsi *vsi) { u16 vsi_num = ice_get_hw_vsi_num(hw, vsi->idx); u16 num_guar; diff --git a/drivers/net/ethernet/intel/ice/ice_fdir.h b/drivers/net/ethernet/intel/ice/ice_fdir.h index 021ecbac7848..ab5b118daa2d 100644 --- a/drivers/net/ethernet/intel/ice/ice_fdir.h +++ b/drivers/net/ethernet/intel/ice/ice_fdir.h @@ -207,6 +207,8 @@ struct ice_fdir_base_pkt { const u8 *tun_pkt; }; +struct ice_vsi; + int ice_alloc_fd_res_cntr(struct ice_hw *hw, u16 *cntr_id); int ice_free_fd_res_cntr(struct ice_hw *hw, u16 cntr_id); int ice_alloc_fd_guar_item(struct ice_hw *hw, u16 *cntr_id, u16 num_fltr); @@ -218,6 +220,7 @@ int ice_fdir_get_gen_prgm_pkt(struct ice_hw *hw, struct ice_fdir_fltr *input, u8 *pkt, bool frag, bool tun); int ice_get_fdir_cnt_all(struct ice_hw *hw); +int ice_fdir_num_avail_fltr(struct ice_hw *hw, struct ice_vsi *vsi); bool ice_fdir_is_dup_fltr(struct ice_hw *hw, struct ice_fdir_fltr *input); bool ice_fdir_has_frag(enum ice_fltr_ptype flow); struct ice_fdir_fltr * diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.c b/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.c index b8df8d0b2d85..60bf71da53bd 100644 --- a/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.c +++ b/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.c @@ -550,6 +550,8 @@ static void ice_vc_fdir_reset_cnt_all(struct ice_vf_fdir *fdir) fdir->fdir_fltr_cnt[flow][0] = 0; fdir->fdir_fltr_cnt[flow][1] = 0; } + + fdir->fdir_fltr_cnt_total = 0; } /** @@ -1694,6 +1696,7 @@ ice_vc_add_fdir_fltr_post(struct ice_vf *vf, struct ice_vf_fdir_ctx *ctx, resp->status = status; resp->flow_id = conf->flow_id; vf->fdir.fdir_fltr_cnt[conf->input.flow_type][is_tun]++; + vf->fdir.fdir_fltr_cnt_total++; ret = ice_vc_send_msg_to_vf(vf, ctx->v_opcode, v_ret, (u8 *)resp, len); @@ -1758,6 +1761,7 @@ ice_vc_del_fdir_fltr_post(struct ice_vf *vf, struct ice_vf_fdir_ctx *ctx, resp->status = status; ice_vc_fdir_remove_entry(vf, conf, conf->flow_id); vf->fdir.fdir_fltr_cnt[conf->input.flow_type][is_tun]--; + vf->fdir.fdir_fltr_cnt_total--; ret = ice_vc_send_msg_to_vf(vf, ctx->v_opcode, v_ret, (u8 *)resp, len); @@ -2074,6 +2078,7 @@ int ice_vc_add_fdir_fltr(struct ice_vf *vf, u8 *msg) struct virtchnl_fdir_add *stat = NULL; struct virtchnl_fdir_fltr_conf *conf; enum virtchnl_status_code v_ret; + struct ice_vsi *vf_vsi; struct device *dev; struct ice_pf *pf; int is_tun = 0; @@ -2082,6 +2087,17 @@ int ice_vc_add_fdir_fltr(struct ice_vf *vf, u8 *msg) pf = vf->pf; dev = ice_pf_to_dev(pf); + vf_vsi = ice_get_vf_vsi(vf); + +#define ICE_VF_MAX_FDIR_FILTERS 128 + if (!ice_fdir_num_avail_fltr(&pf->hw, vf_vsi) || + vf->fdir.fdir_fltr_cnt_total >= ICE_VF_MAX_FDIR_FILTERS) { + v_ret = VIRTCHNL_STATUS_ERR_PARAM; + dev_err(dev, "Max number of FDIR filters for VF %d is reached\n", + vf->vf_id); + goto err_exit; + } + ret = ice_vc_fdir_param_check(vf, fltr->vsi_id); if (ret) { v_ret = VIRTCHNL_STATUS_ERR_PARAM; diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.h b/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.h index c5bcc8d7481c..ac6dcab454b4 100644 --- a/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.h +++ b/drivers/net/ethernet/intel/ice/ice_virtchnl_fdir.h @@ -29,6 +29,7 @@ struct ice_vf_fdir_ctx { struct ice_vf_fdir { u16 fdir_fltr_cnt[ICE_FLTR_PTYPE_MAX][ICE_FD_HW_SEG_MAX]; int prof_entry_cnt[ICE_FLTR_PTYPE_MAX][ICE_FD_HW_SEG_MAX]; + u16 fdir_fltr_cnt_total; struct ice_fd_hw_prof **fdir_prof; struct idr fdir_rule_idr; -- 2.43.0

1 year, 5 months

3
2
0 0

[PATCH AUTOSEL 6.6 01/18] nvme-multipath: find NUMA path only for online numa-node

by Sasha Levin

From: Nilay Shroff <nilay(a)linux.ibm.com> [ Upstream commit d3a043733f25d743f3aa617c7f82dbcb5ee2211a ] In current native multipath design when a shared namespace is created, we loop through each possible numa-node, calculate the NUMA distance of that node from each nvme controller and then cache the optimal IO path for future reference while sending IO. The issue with this design is that we may refer to the NUMA distance table for an offline node which may not be populated at the time and so we may inadvertently end up finding and caching a non-optimal path for IO. Then latter when the corresponding numa-node becomes online and hence the NUMA distance table entry for that node is created, ideally we should re-calculate the multipath node distance for the newly added node however that doesn't happen unless we rescan/reset the controller. So essentially, we may keep using non-optimal IO path for a node which is made online after namespace is created. This patch helps fix this issue ensuring that when a shared namespace is created, we calculate the multipath node distance for each online numa-node instead of each possible numa-node. Then latter when a node becomes online and we receive any IO on that newly added node, we would calculate the multipath node distance for newly added node but this time NUMA distance table would have been already populated for newly added node. Hence we would be able to correctly calculate the multipath node distance and choose the optimal path for the IO. Signed-off-by: Nilay Shroff <nilay(a)linux.ibm.com> Reviewed-by: Christoph Hellwig <hch(a)lst.de> Signed-off-by: Keith Busch <kbusch(a)kernel.org> Signed-off-by: Sasha Levin <sashal(a)kernel.org> --- drivers/nvme/host/multipath.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 0a88d7bdc5e37..6a444ce273366 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -592,7 +592,7 @@ static void nvme_mpath_set_live(struct nvme_ns *ns) int node, srcu_idx; srcu_idx = srcu_read_lock(&head->srcu); - for_each_node(node) + for_each_online_node(node) __nvme_find_path(head, node); srcu_read_unlock(&head->srcu, srcu_idx); } -- 2.43.0

1 year, 5 months

2
20
0 0

[PATCH] FIXUP: genirq: defuse spurious-irq timebomb

by Pete Swain

The flapping-irq detector still has a timebomb. A pathological workload, or test script, can arm the spurious-irq timebomb described in 4f27c00bf80f ("Improve behaviour of spurious IRQ detect") This leads to irqs being moved the much slower polled mode, despite the actual unhandled-irq rate being well under the 99.9k/100k threshold that the code appears to check. How? - Queued completion handler, like nvme, servicing events as they appear in the queue, even if the irq corresponding to the event has not yet been seen. - queues frequently empty, so seeing "spurious" irqs whenever the last events of a threaded handler's while (events_queued()) process_them(); ends with those events' irqs posted while thread was scanning. In this case the while() has consumed last event(s), so next handler says IRQ_NONE. - In each run of "unhandled" irqs, exactly one IRQ_NONE response is promoted from IRQ_NONE to IRQ_HANDLED, by note_interrupt()'s SPURIOUS_DEFERRED logic. - Any 2+ unhandled-irq runs will increment irqs_unhandled. The time_after() check in note_interrupt() resets irqs_unhandled to 1 after an idle period, but if irqs are never spaced more than HZ/10 apart, irqs_unhandled keeps growing. - During processing of long completion queues, the non-threaded handlers will return IRQ_WAKE_THREAD, for potentially thousands of per-event irqs. These bypass note_interrupt()'s irq_count++ logic, so do not count as handled, and do not invoke the flapping-irq logic. - When the _counted_ irq_count reaches the 100k threshold, it's possible for irqs_unhandled > 99.9k to force a move to polling mode, even though many millions of _WAKE_THREAD irqs have been handled without being counted. Solution: include IRQ_WAKE_THREAD events in irq_count. Only when IRQ_NONE responses outweigh (IRQ_HANDLED + IRQ_WAKE_THREAD) by the old 99:1 ratio will an irq be moved to polling mode. Fixes: 4f27c00bf80f ("Improve behaviour of spurious IRQ detect") Cc: stable(a)vger.kernel.org Signed-off-by: Pete Swain <swine(a)google.com> --- kernel/irq/spurious.c | 68 +++++++++++++++++++++---------------------- 1 file changed, 34 insertions(+), 34 deletions(-) diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c index 02b2daf07441..ac596c8dc4b1 100644 --- a/kernel/irq/spurious.c +++ b/kernel/irq/spurious.c @@ -321,44 +321,44 @@ void note_interrupt(struct irq_desc *desc, irqreturn_t action_ret) */ if (!(desc->threads_handled_last & SPURIOUS_DEFERRED)) { desc->threads_handled_last |= SPURIOUS_DEFERRED; - return; - } - /* - * Check whether one of the threaded handlers - * returned IRQ_HANDLED since the last - * interrupt happened. - * - * For simplicity we just set bit 31, as it is - * set in threads_handled_last as well. So we - * avoid extra masking. And we really do not - * care about the high bits of the handled - * count. We just care about the count being - * different than the one we saw before. - */ - handled = atomic_read(&desc->threads_handled); - handled |= SPURIOUS_DEFERRED; - if (handled != desc->threads_handled_last) { - action_ret = IRQ_HANDLED; - /* - * Note: We keep the SPURIOUS_DEFERRED - * bit set. We are handling the - * previous invocation right now. - * Keep it for the current one, so the - * next hardware interrupt will - * account for it. - */ - desc->threads_handled_last = handled; } else { /* - * None of the threaded handlers felt - * responsible for the last interrupt + * Check whether one of the threaded handlers + * returned IRQ_HANDLED since the last + * interrupt happened. * - * We keep the SPURIOUS_DEFERRED bit - * set in threads_handled_last as we - * need to account for the current - * interrupt as well. + * For simplicity we just set bit 31, as it is + * set in threads_handled_last as well. So we + * avoid extra masking. And we really do not + * care about the high bits of the handled + * count. We just care about the count being + * different than the one we saw before. */ - action_ret = IRQ_NONE; + handled = atomic_read(&desc->threads_handled); + handled |= SPURIOUS_DEFERRED; + if (handled != desc->threads_handled_last) { + action_ret = IRQ_HANDLED; + /* + * Note: We keep the SPURIOUS_DEFERRED + * bit set. We are handling the + * previous invocation right now. + * Keep it for the current one, so the + * next hardware interrupt will + * account for it. + */ + desc->threads_handled_last = handled; + } else { + /* + * None of the threaded handlers felt + * responsible for the last interrupt + * + * We keep the SPURIOUS_DEFERRED bit + * set in threads_handled_last as we + * need to account for the current + * interrupt as well. + */ + action_ret = IRQ_NONE; + } } } else { /* -- 2.45.2.627.g7a2c4fd464-goog

1 year, 5 months

2
1
0 0

[PATCH] mm/page_alloc: add one PCP list for THP

by yangge1116＠126.com

From: yangge <yangge1116(a)126.com> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations") no longer differentiates the migration type of pages in THP-sized PCP list, it's possible that non-movable allocation requests may get a CMA page from the list, in some cases, it's not acceptable. If a large number of CMA memory are configured in system (for example, the CMA memory accounts for 50% of the system memory), starting a virtual machine with device passthrough will get stuck. During starting the virtual machine, it will call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory. Normally if a page is present and in CMA area, pin_user_pages_remote() will migrate the page from CMA area to non-CMA area because of FOLL_LONGTERM flag. But if non-movable allocation requests return CMA memory, migrate_longterm_unpinnable_pages() will migrate a CMA page to another CMA page, which will fail to pass the check in check_and_migrate_movable_pages() and cause migration endless. Call trace: pin_user_pages_remote --__gup_longterm_locked // endless loops in this function ----_get_user_pages_locked ----check_and_migrate_movable_pages ------migrate_longterm_unpinnable_pages --------alloc_migration_target This problem will also have a negative impact on CMA itself. For example, when CMA is borrowed by THP, and we need to reclaim it through cma_alloc() or dma_alloc_coherent(), we must move those pages out to ensure CMA's users can retrieve that contigous memory. Currently, CMA's memory is occupied by non-movable pages, meaning we can't relocate them. As a result, cma_alloc() is more likely to fail. To fix the problem above, we add one PCP list for THP, which will not introduce a new cacheline for struct per_cpu_pages. THP will have 2 PCP lists, one PCP list is used by MOVABLE allocation, and the other PCP list is used by UNMOVABLE allocation. MOVABLE allocation contains GPF_MOVABLE, and UNMOVABLE allocation contains GFP_UNMOVABLE and GFP_RECLAIMABLE. Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations") Signed-off-by: yangge <yangge1116(a)126.com> --- include/linux/mmzone.h | 9 ++++----- mm/page_alloc.c | 9 +++++++-- 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b7546dd..cb7f265 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -656,13 +656,12 @@ enum zone_watermarks { }; /* - * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list - * for THP which will usually be GFP_MOVABLE. Even if it is another type, - * it should not contribute to serious fragmentation causing THP allocation - * failures. + * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. Two additional lists + * are added for THP. One PCP list is used by GPF_MOVABLE, and the other PCP list + * is used by GFP_UNMOVABLE and GFP_RECLAIMABLE. */ #ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define NR_PCP_THP 1 +#define NR_PCP_THP 2 #else #define NR_PCP_THP 0 #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8f416a0..0a837e6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -504,10 +504,15 @@ static void bad_page(struct page *page, const char *reason) static inline unsigned int order_to_pindex(int migratetype, int order) { + bool __maybe_unused movable; + #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (order > PAGE_ALLOC_COSTLY_ORDER) { VM_BUG_ON(order != HPAGE_PMD_ORDER); - return NR_LOWORDER_PCP_LISTS; + + movable = migratetype == MIGRATE_MOVABLE; + + return NR_LOWORDER_PCP_LISTS + movable; } #else VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); @@ -521,7 +526,7 @@ static inline int pindex_to_order(unsigned int pindex) int order = pindex / MIGRATE_PCPTYPES; #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (pindex == NR_LOWORDER_PCP_LISTS) + if (pindex >= NR_LOWORDER_PCP_LISTS) order = HPAGE_PMD_ORDER; #else VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); -- 2.7.4

1 year, 5 months

6
6
0 0

[PATCH v2] net: ethernet: mtk_ppe: Change PPE entries number to 16K

by Shengyu Qu

MT7981,7986 and 7988 all supports 32768 PPE entries, and MT7621/MT7620 supports 16384 PPE entries, but only set to 8192 entries in driver. So incrase max entries to 16384 instead. Cc: stable(a)vger.kernel.org Signed-off-by: Elad Yifee <eladwf(a)gmail.com> Signed-off-by: Shengyu Qu <wiagn233(a)outlook.com> Fixes: ba37b7caf1ed ("net: ethernet: mtk_eth_soc: add support for initializing the PPE") --- Changes since V1: - Reduced max entries from 32768 to 16384 to keep compatible with MT7620/21 devices. - Add fixes tag --- drivers/net/ethernet/mediatek/mtk_ppe.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mediatek/mtk_ppe.h b/drivers/net/ethernet/mediatek/mtk_ppe.h index 691806bca372..223f709e2704 100644 --- a/drivers/net/ethernet/mediatek/mtk_ppe.h +++ b/drivers/net/ethernet/mediatek/mtk_ppe.h @@ -8,7 +8,7 @@ #include <linux/bitfield.h> #include <linux/rhashtable.h> -#define MTK_PPE_ENTRIES_SHIFT 3 +#define MTK_PPE_ENTRIES_SHIFT 4 #define MTK_PPE_ENTRIES (1024 << MTK_PPE_ENTRIES_SHIFT) #define MTK_PPE_HASH_MASK (MTK_PPE_ENTRIES - 1) #define MTK_PPE_WAIT_TIMEOUT_US 1000000 -- 2.39.2

1 year, 5 months

3
6
0 0

[PATCH] kobject_uevent: Fix OOB access within zap_modalias_env()

by Zijun Hu

zap_modalias_env() wrongly calculates size of memory block to move, so maybe cause OOB memory access issue, fixed by correcting size to memmove. Fixes: 9b3fa47d4a76 ("kobject: fix suppressing modalias in uevents delivered over netlink") Cc: stable(a)vger.kernel.org Signed-off-by: Zijun Hu <quic_zijuhu(a)quicinc.com> --- lib/kobject_uevent.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index 03b427e2707e..f153b4f9d4d9 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -434,7 +434,7 @@ static void zap_modalias_env(struct kobj_uevent_env *env) if (i != env->envp_idx - 1) { memmove(env->envp[i], env->envp[i + 1], - env->buflen - len); + env->buf + env->buflen - env->envp[i + 1]); for (j = i; j < env->envp_idx - 1; j++) env->envp[j] = env->envp[j + 1] - len; -- 2.7.4

1 year, 5 months

4
10
0 0

[PATCHv5 3/4] x86/tdx: Dynamically disable SEPT violations from causing #VEs

by Kirill A. Shutemov

Memory access #VE's are hard for Linux to handle in contexts like the entry code or NMIs. But other OSes need them for functionality. There's a static (pre-guest-boot) way for a VMM to choose one or the other. But VMMs don't always know which OS they are booting, so they choose to deliver those #VE's so the "other" OSes will work. That, unfortunately has left us in the lurch and exposed to these hard-to-handle #VEs. The TDX module has introduced a new feature. Even if the static configuration is "send nasty #VE's", the kernel can dynamically request that they be disabled. Check if the feature is available and disable SEPT #VE if possible. If the TD allowed to disable/enable SEPT #VEs, the ATTR_SEPT_VE_DISABLE attribute is no longer reliable. It reflects the initial state of the control for the TD, but it will not be updated if someone (e.g. bootloader) changes it before the kernel starts. Kernel must check TDCS_TD_CTLS bit to determine if SEPT #VEs are enabled or disabled. Signed-off-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Fixes: 373e715e31bf ("x86/tdx: Panic on bad configs that #VE on "private" memory access") Cc: stable(a)vger.kernel.org --- arch/x86/coco/tdx/tdx.c | 76 ++++++++++++++++++++++++------- arch/x86/include/asm/shared/tdx.h | 10 +++- 2 files changed, 69 insertions(+), 17 deletions(-) diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 08ce488b54d0..ba3103877b21 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -78,7 +78,7 @@ static inline void tdcall(u64 fn, struct tdx_module_args *args) } /* Read TD-scoped metadata */ -static inline u64 __maybe_unused tdg_vm_rd(u64 field, u64 *value) +static inline u64 tdg_vm_rd(u64 field, u64 *value) { struct tdx_module_args args = { .rdx = field, @@ -193,6 +193,62 @@ static void __noreturn tdx_panic(const char *msg) __tdx_hypercall(&args); } +/* + * The kernel cannot handle #VEs when accessing normal kernel memory. Ensure + * that no #VE will be delivered for accesses to TD-private memory. + * + * TDX 1.0 does not allow the guest to disable SEPT #VE on its own. The VMM + * controls if the guest will receive such #VE with TD attribute + * ATTR_SEPT_VE_DISABLE. + * + * Newer TDX module allows the guest to control if it wants to receive SEPT + * violation #VEs. + * + * Check if the feature is available and disable SEPT #VE if possible. + * + * If the TD allowed to disable/enable SEPT #VEs, the ATTR_SEPT_VE_DISABLE + * attribute is no longer reliable. It reflects the initial state of the + * control for the TD, but it will not be updated if someone (e.g. bootloader) + * changes it before the kernel starts. Kernel must check TDCS_TD_CTLS bit to + * determine if SEPT #VEs are enabled or disabled. + */ +static void disable_sept_ve(u64 td_attr) +{ + const char *msg = "TD misconfiguration: SEPT #VE has to be disabled"; + bool debug = td_attr & ATTR_DEBUG; + u64 config, controls; + + /* Is this TD allowed to disable SEPT #VE */ + tdg_vm_rd(TDCS_CONFIG_FLAGS, &config); + if (!(config & TDCS_CONFIG_FLEXIBLE_PENDING_VE)) { + /* No SEPT #VE controls for the guest: check the attribute */ + if (td_attr & ATTR_SEPT_VE_DISABLE) + return; + + /* Relax SEPT_VE_DISABLE check for debug TD for backtraces */ + if (debug) + pr_warn("%s\n", msg); + else + tdx_panic(msg); + return; + } + + /* Check if SEPT #VE has been disabled before us */ + tdg_vm_rd(TDCS_TD_CTLS, &controls); + if (controls & TD_CTLS_PENDING_VE_DISABLE) + return; + + /* Keep #VEs enabled for splats in debugging environments */ + if (debug) + return; + + /* Disable SEPT #VEs */ + tdg_vm_wr(TDCS_TD_CTLS, TD_CTLS_PENDING_VE_DISABLE, + TD_CTLS_PENDING_VE_DISABLE); + + return; +} + static void tdx_setup(u64 *cc_mask) { struct tdx_module_args args = {}; @@ -218,24 +274,12 @@ static void tdx_setup(u64 *cc_mask) gpa_width = args.rcx & GENMASK(5, 0); *cc_mask = BIT_ULL(gpa_width - 1); + td_attr = args.rdx; + /* Kernel does not use NOTIFY_ENABLES and does not need random #VEs */ tdg_vm_wr(TDCS_NOTIFY_ENABLES, 0, -1ULL); - /* - * The kernel can not handle #VE's when accessing normal kernel - * memory. Ensure that no #VE will be delivered for accesses to - * TD-private memory. Only VMM-shared memory (MMIO) will #VE. - */ - td_attr = args.rdx; - if (!(td_attr & ATTR_SEPT_VE_DISABLE)) { - const char *msg = "TD misconfiguration: SEPT_VE_DISABLE attribute must be set."; - - /* Relax SEPT_VE_DISABLE check for debug TD. */ - if (td_attr & ATTR_DEBUG) - pr_warn("%s\n", msg); - else - tdx_panic(msg); - } + disable_sept_ve(td_attr); } /* diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h index 7e12cfa28bec..fecb2a6e864b 100644 --- a/arch/x86/include/asm/shared/tdx.h +++ b/arch/x86/include/asm/shared/tdx.h @@ -19,9 +19,17 @@ #define TDG_VM_RD 7 #define TDG_VM_WR 8 -/* TDCS fields. To be used by TDG.VM.WR and TDG.VM.RD module calls */ +/* TDX TD-Scope Metadata. To be used by TDG.VM.WR and TDG.VM.RD */ +#define TDCS_CONFIG_FLAGS 0x1110000300000016 +#define TDCS_TD_CTLS 0x1110000300000017 #define TDCS_NOTIFY_ENABLES 0x9100000000000010 +/* TDCS_CONFIG_FLAGS bits */ +#define TDCS_CONFIG_FLEXIBLE_PENDING_VE BIT_ULL(1) + +/* TDCS_TD_CTLS bits */ +#define TD_CTLS_PENDING_VE_DISABLE BIT_ULL(0) + /* TDX hypercall Leaf IDs */ #define TDVMCALL_MAP_GPA 0x10001 #define TDVMCALL_GET_QUOTE 0x10002 -- 2.43.0

1 year, 5 months

3
5
0 0

[PATCH] acpi: Support CONFIG_ACPI without CONFIG_PCI

by Suraj Jitindar Singh

Make is possible to use ACPI without having CONFIG_PCI set. When initialising ACPI the following call chain occurs: acpi_init() -> acpi_bus_init() -> acpi_load_tables() -> acpi_ev_install_region_handlers() -> acpi_ev_install_region_handlers() calls acpi_ev_install_space_handler() on each of the default address spaces defined as: u8 acpi_gbl_default_address_spaces[ACPI_NUM_DEFAULT_SPACES] = { ACPI_ADR_SPACE_SYSTEM_MEMORY, ACPI_ADR_SPACE_SYSTEM_IO, ACPI_ADR_SPACE_PCI_CONFIG, ACPI_ADR_SPACE_DATA_TABLE }; However in acpi_ev_install_space_handler() the case statement for ACPI_ADR_SPACE_PCI_CONFIG is ifdef'd as: #ifdef ACPI_PCI_CONFIGURED case ACPI_ADR_SPACE_PCI_CONFIG: handler = acpi_ex_pci_config_space_handler; setup = acpi_ev_pci_config_region_setup; break; #endif ACPI_PCI_CONFIGURED is not defined if CONFIG_PCI is not enabled, thus the attempt to install the handler fails. Fix this by ifdef'ing ACPI_ADR_SPACE_PCI_CONFIG in the list of default address spaces. Fixes: bd23fac3eaaa ("ACPICA: Remove PCI bits from ACPICA when CONFIG_PCI is unset") CC: stable(a)vger.kernel.org # 5.0.x- Signed-off-by: Suraj Jitindar Singh <surajjs(a)amazon.com> --- drivers/acpi/acpica/evhandler.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/acpi/acpica/evhandler.c b/drivers/acpi/acpica/evhandler.c index 1c8cb6d924df..371093acb362 100644 --- a/drivers/acpi/acpica/evhandler.c +++ b/drivers/acpi/acpica/evhandler.c @@ -26,7 +26,9 @@ acpi_ev_install_handler(acpi_handle obj_handle, u8 acpi_gbl_default_address_spaces[ACPI_NUM_DEFAULT_SPACES] = { ACPI_ADR_SPACE_SYSTEM_MEMORY, ACPI_ADR_SPACE_SYSTEM_IO, +#ifdef ACPI_PCI_CONFIGURED ACPI_ADR_SPACE_PCI_CONFIG, +#endif ACPI_ADR_SPACE_DATA_TABLE }; -- 2.34.1

1 year, 5 months

3
3
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror June 2024