From: "Yury Norov [NVIDIA]" yury.norov@gmail.com
[ Upstream commit 59f7d2138591ef8f0e4e4ab5f1ab674e8181ad3a ]
The function divides number of online CPUs by num_core_siblings, and later checks the divider by zero. This implies a possibility to get and divide-by-zero runtime error. Fix it by moving the check prior to division. This also helps to save one indentation level.
Signed-off-by: Yury Norov [NVIDIA] yury.norov@gmail.com Link: https://patch.msgid.link/20250604193947.11834-3-yury.norov@gmail.com Signed-off-by: Leon Romanovsky leon@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees. Here's my extensive analysis:
## Bug Fix Analysis
The commit fixes a **genuine divide-by-zero bug** in the `find_hw_thread_mask()` function. The code changes show:
1. **Original bug**: The division `num_cores_per_socket = node_affinity.num_online_cpus / affinity->num_core_siblings / node_affinity.num_online_nodes` occurs at lines 967-969 BEFORE checking if `num_core_siblings > 0` at line 972.
2. **The fix**: Moves the check `if (affinity->num_core_siblings == 0) return;` to line 973-974 (in the new code) BEFORE the division operation, preventing the divide-by-zero.
## When the Bug Can Trigger
The `num_core_siblings` value is initialized as: ```c cpumask_weight(topology_sibling_cpumask(cpumask_first(&node_affinity.pro c.mask))) ```
This can be 0 in several real-world scenarios: - Single-core systems without SMT/hyperthreading - Systems where SMT is disabled at runtime - Virtualized environments with unusual CPU topology - Certain ARM or other architectures where topology_sibling_cpumask() returns empty
## Stable Kernel Criteria Met
1. **Fixes a real bug**: ✓ - Prevents kernel divide-by-zero crash 2. **Small and contained**: ✓ - Only 20 lines changed in one function 3. **No side effects**: ✓ - Early return preserves existing behavior when num_core_siblings==0 4. **No architectural changes**: ✓ - Simple defensive programming fix 5. **Clear bug fix**: ✓ - Not a feature or optimization 6. **Low regression risk**: ✓ - Only adds safety check, doesn't change logic
## Impact Assessment
- **Severity**: Medium-High - Can cause kernel panic on affected systems - **Affected systems**: HFI1 InfiniBand hardware on systems with specific CPU configurations - **User impact**: System crash when loading HFI1 driver on vulnerable configurations
The commit message clearly states "fix possible divide-by-zero" and the code change unambiguously moves a zero-check before a division operation that uses that value as divisor. This is a textbook example of a bug fix that should be backported to stable kernels to prevent crashes on systems with certain CPU topologies.
drivers/infiniband/hw/hfi1/affinity.c | 44 +++++++++++++++------------ 1 file changed, 24 insertions(+), 20 deletions(-)
diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c index 7ead8746b79b..f2c530ab85a5 100644 --- a/drivers/infiniband/hw/hfi1/affinity.c +++ b/drivers/infiniband/hw/hfi1/affinity.c @@ -964,31 +964,35 @@ static void find_hw_thread_mask(uint hw_thread_no, cpumask_var_t hw_thread_mask, struct hfi1_affinity_node_list *affinity) { int possible, curr_cpu, i; - uint num_cores_per_socket = node_affinity.num_online_cpus / + uint num_cores_per_socket; + + cpumask_copy(hw_thread_mask, &affinity->proc.mask); + + if (affinity->num_core_siblings == 0) + return; + + num_cores_per_socket = node_affinity.num_online_cpus / affinity->num_core_siblings / node_affinity.num_online_nodes;
- cpumask_copy(hw_thread_mask, &affinity->proc.mask); - if (affinity->num_core_siblings > 0) { - /* Removing other siblings not needed for now */ - possible = cpumask_weight(hw_thread_mask); - curr_cpu = cpumask_first(hw_thread_mask); - for (i = 0; - i < num_cores_per_socket * node_affinity.num_online_nodes; - i++) - curr_cpu = cpumask_next(curr_cpu, hw_thread_mask); - - for (; i < possible; i++) { - cpumask_clear_cpu(curr_cpu, hw_thread_mask); - curr_cpu = cpumask_next(curr_cpu, hw_thread_mask); - } + /* Removing other siblings not needed for now */ + possible = cpumask_weight(hw_thread_mask); + curr_cpu = cpumask_first(hw_thread_mask); + for (i = 0; + i < num_cores_per_socket * node_affinity.num_online_nodes; + i++) + curr_cpu = cpumask_next(curr_cpu, hw_thread_mask);
- /* Identifying correct HW threads within physical cores */ - cpumask_shift_left(hw_thread_mask, hw_thread_mask, - num_cores_per_socket * - node_affinity.num_online_nodes * - hw_thread_no); + for (; i < possible; i++) { + cpumask_clear_cpu(curr_cpu, hw_thread_mask); + curr_cpu = cpumask_next(curr_cpu, hw_thread_mask); } + + /* Identifying correct HW threads within physical cores */ + cpumask_shift_left(hw_thread_mask, hw_thread_mask, + num_cores_per_socket * + node_affinity.num_online_nodes * + hw_thread_no); }
int hfi1_get_proc_affinity(int node)