Hi! I'm curious to know if you pay attention to how your daily habits can impact your health. Nutrition, sleep, stress levels, and recovery are often interrelated. Do you think it's worth starting with lifestyle changes before looking for solutions to maintain hormonal balance?
On 6/25/26 08:10, Bryam Vargas via B4 Relay wrote:
> From: Bryam Vargas <hexlabsecurity(a)proton.me>
>
> begin_cpu_udmabuf() builds and caches ubuf->sg with an unserialised
> check-then-set, and end_cpu_udmabuf() reads the same field unlocked. The
> core invokes both cpu-access hooks without holding the reservation lock and
> DMA_BUF_IOCTL_SYNC is unlocked, so concurrent SYNC ioctls on a shared
> udmabuf fd race on ubuf->sg: two begins can both observe NULL and both call
> get_sg_table(), and the later store orphans the earlier table and its DMA
> mapping, which release_udmabuf() never frees. Each won race permanently
> leaks an sg_table and an unbalanced DMA mapping.
>
> Serialize both hooks under the buffer's reservation lock, as panfrost and
> panthor do. dma_buf_begin/end_cpu_access() already annotate might_lock() on
> that lock, so taking it here matches the documented contract.
> Single-threaded callers are unaffected.
>
> Fixes: 284562e1f348 ("udmabuf: implement begin_cpu_access/end_cpu_access hooks")
> Cc: stable(a)vger.kernel.org
> Signed-off-by: Bryam Vargas <hexlabsecurity(a)proton.me>
> ---
> Same leak-with-dangling-pointer class as CVE-2024-56712 (export_udmabuf()
> error path) -- a distinct site the 2024 fix does not cover.
>
> udmabuf is the only exporter that lazily builds its sg_table cache inside the
> cpu-access hook without serialising the check-then-set. The exporters that do
> comparable in-hook cache work all take a lock first: panfrost and panthor
> dma_resv_lock() (both hooks), omapdrm omap_obj->lock around its lazy page-get,
> the dma-heaps buffer->lock, and the TTM/GEM exporters (amdgpu, i915, xe) their
> object's reservation lock. tegra and videobuf2 take no lock here because they
> only sync an sg_table built earlier, so there is nothing to serialise.
>
> Confirmed with an out-of-tree A/B exercising the begin/begin race: this driver
> built as a module with get_sg_table()/put_sg_table() counting allocations
> against frees, driven by a userspace racer that creates 3000 udmabufs and fires
> DMA_BUF_IOCTL_SYNC(SYNC_START) from N threads on each shared fd.
>
> arm leaked sg_tables (of 3000 buffers)
> vulnerable, 4 threads 4761
> control, 1 thread 0
> patched (resv lock), 4 threads 0
>
> One sg_table and its DMA mapping leak per won race; the single-thread control
> does not leak, isolating the race; with the lock the lazy-init runs once per
> buffer (3000 allocations, zero leaked). end_cpu_udmabuf() is locked for the
> same field too: an unlocked end could otherwise observe the transient IS_ERR
> store begin makes before resetting ubuf->sg to NULL, and dereference it. In a
> tighter 5000-iteration loop the unpatched leak runs around 15-20 MB/s of slab.
> ---
> drivers/dma-buf/udmabuf.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index bced421c0d65..702ae13b97d1 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -226,6 +226,8 @@ static int begin_cpu_udmabuf(struct dma_buf *buf,
> struct device *dev = ubuf->device->this_device;
> int ret = 0;
>
> + dma_resv_lock(buf->resv, NULL);
Good catch, but we eventually wait for HW to finish while holding this lock.
So if possible lock it interruptible here.
Apart from that looks good to me,
Christian.
> +
> if (!ubuf->sg) {
> ubuf->sg = get_sg_table(dev, buf, direction);
> if (IS_ERR(ubuf->sg)) {
> @@ -238,6 +240,8 @@ static int begin_cpu_udmabuf(struct dma_buf *buf,
> dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> }
>
> + dma_resv_unlock(buf->resv);
> +
> return ret;
> }
>
> @@ -246,12 +250,18 @@ static int end_cpu_udmabuf(struct dma_buf *buf,
> {
> struct udmabuf *ubuf = buf->priv;
> struct device *dev = ubuf->device->this_device;
> + int ret = 0;
> +
> + dma_resv_lock(buf->resv, NULL);
>
> if (!ubuf->sg)
> - return -EINVAL;
> + ret = -EINVAL;
> + else
> + dma_sync_sgtable_for_device(dev, ubuf->sg, direction);
>
> - dma_sync_sgtable_for_device(dev, ubuf->sg, direction);
> - return 0;
> + dma_resv_unlock(buf->resv);
> +
> + return ret;
> }
>
> static const struct dma_buf_ops udmabuf_ops = {
>
> ---
> base-commit: 7eed1fb17959e721031555e5b5654083fe6a7d02
> change-id: 20260625-b4-disp-67d1f3db-0082918fdcb5
>
> Best regards,
> --
> Bryam Vargas <hexlabsecurity(a)proton.me>
>
>
In the vast and ever-growing world of mobile gaming, sometimes the simplest concepts deliver the most satisfying experiences. One such gem is Block Blast, a captivating puzzle game that combines the familiar mechanics of Tetris with a fresh, strategic twist. If you're looking for a relaxing yet engaging way to challenge your mind, then Block Blast might just be your next addiction.
https://blockblasts.io/
What is Block Blast?
Imagine a 10x10 grid, initially empty. Your goal is to fill this grid with various-shaped blocks that appear at the bottom of the screen, one set of three at a time. The catch? You can't rotate the blocks, and you must place all three given blocks before new ones appear. The objective is to clear lines and columns by filling them completely. Once a line or column is full, it disappears, freeing up space for more blocks and earning you points. The game ends when you can no longer place any of the current blocks on the grid.
Gameplay: Simple to Grasp, Challenging to Master
The beauty of Block Blast lies in its straightforward mechanics. You simply drag and drop the blocks from the bottom onto the grid. There's no time limit, no frantic swiping – just thoughtful placement. However, don't let the simplicity fool you. As the grid fills up, strategic thinking becomes paramount. You'll quickly learn the importance of anticipating future block shapes and planning your placements to create more clearing opportunities. Should you save that long straight piece for a full column clear, or use it now to open up a crucial corner? These are the delightful dilemmas you'll face. The game encourages a flow state, where you’re constantly evaluating and adapting to the evolving grid.
Tips for Becoming a Block Blast Master
While the game is easy to pick up, a few strategies can significantly boost your scores and enjoyment:
Prioritize Clears: Don't be afraid to clear lines or columns, even if it means using a block in a less-than-ideal spot. Clearing space is crucial for longevity.
Think Ahead: Always glance at the next set of blocks. This allows you to plan your current placements with future pieces in mind, creating combos and larger clears.
Corners are King: Filling corners and edges can be tricky, so try to tackle them early when you have more space. Don't leave isolated blocks in odd spots.
The "L" and "T" Block Dilemma: These oddly shaped blocks can be your best friends or worst enemies. Learn how to integrate them into your clears effectively, often by leaving gaps for them.
Practice Makes Perfect: Like any puzzle game, consistent play will improve your spatial reasoning and pattern recognition, leading to higher scores.
Conclusion
Block Blast offers a delightful blend of relaxation and mental stimulation. Its intuitive design makes it accessible to everyone, while its strategic depth keeps players engaged for countless hours. Whether you're looking for a quick brain break or a long, meditative puzzle session, Block Blast provides a satisfying and rewarding experience. Give it a try, and you might just find your new favorite way to unwind and sharpen your mind, one block at a time.
In a recent discussion with Philip and Danilo the question came up what
was already tried and never finished to cleanup the dma_fence framework.
So here are the different ideas I came with but never fully finished,
with the patches itself modernized and rebased on top of drm-misc-next.
The main goal of those changes is to make it easier to implement dma_fence
backends and don't enforce unnecessary constrains on implementations.
As first step the locking around the dma_fence_ops.signaled callback is
made consistent by removing the dma_fence_is_signaled_locked() function.
This was mostly used by backends itself, but if polling the HW is desired
the backends can call their own functions for this directly without going
through the dma-fence layer.
XE actually seems to be the only driver which make use of that for a bit
more handling. For all other cases testing the signaled flag should be enough.
Then forcefully calling dma_fence_signaled() is removed from the dma-fence
layer and moved into the backend implementations.
This allows the backend implementations to cleanup after they have
signaled the fence. Such cleanup can include removing now signaled fences
from lists, dropping references, starting work etc....
Especially nouveau seems to have some really messy workaround because of
that involving the DMA_FENCE_FLAG_USER_BITS and installing callbacks
because the reference to the context couldn't be dropped directly after
signaling. This can now be cleaned up as far as I can see.
In the long term this should also allow reworking the error handling, e.g.
removing dma_fence_set_error() and instead giving the error as mandatory
parameter to dma_fence_signal().
Then the last piece is dropping calling enable_signaling callback with the
dma_fence lock held. This makes it possible for backends to acquire locks
which are semantically ordered outside of the dma_fence lock.
This is necessary to allows using the dma_fence inline lock in more cases,
previously backends used some common external lock for their dma_fences to
for example make it possible remove fences from linked lists.
Please comment and review,
Christian.
On Tue, Jun 23, 2026 at 11:53:50PM +0100, David Laight wrote:
> On Tue, 23 Jun 2026 20:55:32 +0000
> Pranjal Shrivastava <praan(a)google.com> wrote:
>
> > On Tue, Jun 23, 2026 at 09:44:46AM +0100, David Laight wrote:
> >
> > Hi David,
> >
> > > On Tue, 23 Jun 2026 01:54:59 +0000
> > > David Hu <xuehaohu(a)google.com> wrote:
> > >
> > > > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > > > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > > > first entry, resulting in non-page-aligned DMA addresses for all
> > > > subsequent entries.
> > >
> > > There is a separate issue of whether this code is even needed at all.
> > > Where can transfers over 2G (never mind 4G) actually come from.
> > >
> > > The read, write and similar system calls limit transfers to INT_MAX
> > > (even on 64bit) and a lot of driver code will need fixing it longer
> > > lengths are allowed though.
> > > io_uring better enforce the same limits.
> > > So the transfers can come directly from userspace.
> > >
> > > Not only that but you also need a single physically contiguous buffer.
> > > Good luck allocating that!
> > >
> > > Now maybe there are some peer-to-peer places where the large buffer
> > > is device memory, but they will be unusual and probably need
> > > special treatment anyway.
> > >
> >
> > I agree that traditional VFS read/write face the MAX_RW_COUNT limit
> > (~2GB), and io_uring has its limits, but I'm a little confused by the
> > push to enforce these limits here in the SGL code?
> >
> > File I/O seems to be only one side of the picture. In my view, this fix
> > is necessary and certainly has a use-case:
> >
> > For example, the RDMA subsystem has the capability to import dmabufs [1],
> > which gives rise to use cases for dmabuf beyond standard file ops
> > (via VFS/io_uring).
> >
> > In these scenarios, GPU HBM can be exported as dmabufs. With recent GPUs,
> > HBM capacity can be in the order of hundreds of GBs [2]. RDMA can employ
> > infrastructure like the vfio-dmabuf-exporter [3] or similar dmabuf
> > exporters to frequently move huge blocks of data via P2PDMA.
>
> Ok, that explains where big buffers can come from.
> I just wasn't sure.
>
> > If we restrict incoming dmabuf transfers to fit within VFS-centric
> > limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
> > it to manage a significantly higher number of memory registrations. By
> > cleanly splitting these massive contiguous device buffers into
> > page-aligned SGL entries, we directly improve the efficiency of P2P
> > transfers and memory registration.
>
> But a divide by '4G - PAGE_SIZE' is also non-trivial and (I think affects
> a lot of io) when the quotient is always 1.
> Splitting into 2G chunks is a lot cheaper.
>
> > Since this change doesn't seem to have a negative impact on standard file
> > I/O or break existing VFS constraints, I'm curious why we shouldn't
> > support splitting these >4GB P2P transfers? Am I missing something?
>
> I was only wondering whether it was needed...
> It does bring up the question of why the >4GB transfers even need splitting.
> But that is another question.
Just a side note:
In our vision, we aim to transition DMABUF to use physical
addresses directly https://lore.kernel.org/all/0-v1-b5cab63049c0+191af-dmabuf_map_type_jgg@nvi…
and eliminate the scatter‑gather layer from the DMABUF path.
Thanks
>
> If you want to split large transfers into 4G-PAGE_SIZE blocks
> it is probably worth having a quick test that returns 1 for 'small' buffers.
>
> David
>
> >
> > Thanks,
> > Praan
> >
> > [1] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/infiniband/core/umem…
> > [2] https://nvdam.widen.net/s/fdvdqvfvj2/hopper-h200-nvl-product-brief (Table 2-2)
> > [3] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/vfio/pci/vfio_pci_dm…
> >
>
On 6/24/26 14:52, Yousef Alhouseen wrote:
> UDMABUF_CREATE_LIST copies an array whose element count comes from
> userspace. The count is compared against list_limit, but list_limit is a
> signed module parameter while the count is u32.
We should probably just drop the sign from the module parameter instead.
I don't see an use case for negative values here.
Regards,
Christian.
>
> If the limit is raised too far or made negative, that comparison no
> longer bounds the count to a range where sizeof(*list) * count fits in
> the u32 temporary used for the copy length. A wrapped copy length lets
> memdup_user() copy fewer entries than udmabuf_create() subsequently
> walks, leading to out-of-bounds reads from the copied list.
>
> Take a positive snapshot of the module limit and use memdup_array_user()
> so the multiplication is checked before copying.
>
> Signed-off-by: Yousef Alhouseen <alhouseenyousef(a)gmail.com>
> ---
> drivers/dma-buf/udmabuf.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index bced421c0..b4078ec84 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -469,14 +469,15 @@ static long udmabuf_ioctl_create_list(struct file *filp, unsigned long arg)
> struct udmabuf_create_list head;
> struct udmabuf_create_item *list;
> int ret = -EINVAL;
> - u32 lsize;
> + int limit;
>
> if (copy_from_user(&head, (void __user *)arg, sizeof(head)))
> return -EFAULT;
> - if (head.count > list_limit)
> + limit = READ_ONCE(list_limit);
> + if (!head.count || limit <= 0 || head.count > limit)
> return -EINVAL;
> - lsize = sizeof(struct udmabuf_create_item) * head.count;
> - list = memdup_user((void __user *)(arg + sizeof(head)), lsize);
> + list = memdup_array_user((void __user *)(arg + sizeof(head)),
> + head.count, sizeof(*list));
> if (IS_ERR(list))
> return PTR_ERR(list);
>
> --
> 2.54.0
>
The entity->last_scheduled field has always been set and read with
special RCU functions in addition to memory barriers. There is no
obvious reason for that, since the entity lock is available and taken at all
places that evaluate the last_scheduled field. The only exception is
drm_sched_entity_error(), which is not performance critical in any way.
Improve robustness, readability and maintainability by replacing RCU and
barriers with the lock.
As a preparational step, while at it, also guard spsc_queue_pop() with
the lock, since spsc_queue is deprecated and supposed to be replaced
with a locked list.
Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
---
Tested with drm_sched unit tests, which all ran fine.
---
drivers/gpu/drm/scheduler/sched_entity.c | 49 +++++++++++-------------
include/drm/gpu_scheduler.h | 9 ++---
2 files changed, 26 insertions(+), 32 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index c51101ec70c1..95b2c48a604a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -135,7 +135,6 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
entity->num_sched_list = num_sched_list;
entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
entity->rq = &sched_list[0]->rq;
- RCU_INIT_POINTER(entity->last_scheduled, NULL);
RB_CLEAR_NODE(&entity->rb_tree_node);
init_completion(&entity->entity_idle);
@@ -201,10 +200,10 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
struct dma_fence *fence;
int r;
- rcu_read_lock();
- fence = rcu_dereference(entity->last_scheduled);
+ spin_lock(&entity->lock);
+ fence = entity->last_scheduled;
r = fence ? fence->error : 0;
- rcu_read_unlock();
+ spin_unlock(&entity->lock);
return r;
}
@@ -288,8 +287,10 @@ void drm_sched_entity_kill(struct drm_sched_entity *entity)
wait_for_completion(&entity->entity_idle);
/* The entity is guaranteed to not be used by the scheduler */
- prev = rcu_dereference_check(entity->last_scheduled, true);
+ spin_lock(&entity->lock);
+ prev = entity->last_scheduled;
dma_fence_get(prev);
+ spin_unlock(&entity->lock);
while ((job = drm_sched_entity_queue_pop(entity))) {
struct drm_sched_fence *s_fence = job->s_fence;
@@ -381,8 +382,8 @@ void drm_sched_entity_fini(struct drm_sched_entity *entity)
entity->dependency = NULL;
}
- dma_fence_put(rcu_dereference_check(entity->last_scheduled, true));
- RCU_INIT_POINTER(entity->last_scheduled, NULL);
+ dma_fence_put(entity->last_scheduled);
+ WRITE_ONCE(entity->last_scheduled, NULL);
drm_sched_entity_stats_put(entity->stats);
}
EXPORT_SYMBOL(drm_sched_entity_fini);
@@ -523,18 +524,18 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
if (entity->guilty && atomic_read(entity->guilty))
dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
- dma_fence_put(rcu_dereference_check(entity->last_scheduled, true));
- rcu_assign_pointer(entity->last_scheduled,
- dma_fence_get(&sched_job->s_fence->finished));
+ spin_lock(&entity->lock);
+ dma_fence_put(entity->last_scheduled);
+ entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
- /*
- * If the queue is empty we allow drm_sched_entity_select_rq() to
- * locklessly access ->last_scheduled. This only works if we set the
- * pointer before we dequeue and if we a write barrier here.
+ /* A recent rework required taking the spinlock above. Since spsc_queue
+ * is scheduled for removal as per the DRM-TODO-list, we access it here
+ * locked already to prepare for that cleanup.
+ *
+ * TODO: Fully replace spsc_queue with a locked (h)list.
*/
- smp_wmb();
-
spsc_queue_pop(&entity->job_queue);
+ spin_unlock(&entity->lock);
drm_sched_rq_pop_entity(entity);
@@ -561,21 +562,15 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
if (spsc_queue_count(&entity->job_queue))
return;
- /*
- * Only when the queue is empty are we guaranteed that
- * drm_sched_run_job_work() cannot change entity->last_scheduled. To
- * enforce ordering we need a read barrier here. See
- * drm_sched_entity_pop_job() for the other side.
- */
- smp_rmb();
-
- fence = rcu_dereference_check(entity->last_scheduled, true);
+ spin_lock(&entity->lock);
+ fence = entity->last_scheduled;
/* stay on the same engine if the previous job hasn't finished */
- if (fence && !dma_fence_is_signaled(fence))
+ if (fence && !dma_fence_is_signaled(fence)) {
+ spin_unlock(&entity->lock);
return;
+ }
- spin_lock(&entity->lock);
sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
rq = sched ? &sched->rq : NULL;
if (rq != entity->rq) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index d61c19e78182..176ff1f936cd 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -100,7 +100,8 @@ struct drm_sched_entity {
* @lock:
*
* Lock protecting the run-queue (@rq) to which this entity belongs,
- * @priority and the list of schedulers (@sched_list, @num_sched_list).
+ * @priority, @last_scheduled and the list of schedulers (@sched_list,
+ * @num_sched_list).
*/
spinlock_t lock;
@@ -202,11 +203,9 @@ struct drm_sched_entity {
/**
* @last_scheduled:
*
- * Points to the finished fence of the last scheduled job. Only written
- * by drm_sched_entity_pop_job(). Can be accessed locklessly from
- * drm_sched_job_arm() if the queue is empty.
+ * Points to the finished fence of the last scheduled job.
*/
- struct dma_fence __rcu *last_scheduled;
+ struct dma_fence *last_scheduled;
/**
* @last_user: last group leader pushing a job into the entity.
base-commit: 60b5fa6edfef867322fce7c8306e5c4b46211be7
--
2.54.0
On 12/06/2026 3:31 pm, Matt Evans wrote:
> Hi Kevin, Pranjal, (+Robin, hi!)
Oh hey there! :)
> On 12/06/2026 04:39, Tian, Kevin wrote:
>>> From: Pranjal Shrivastava <praan(a)google.com>
>>> Sent: Friday, June 12, 2026 2:38 AM
>>>
>>> On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
>>>> --- a/drivers/pci/Kconfig
>>>> +++ b/drivers/pci/Kconfig
>>>> @@ -206,11 +206,7 @@ config PCIE_TPH
>>>> config PCI_P2PDMA
>>>> bool "PCI peer-to-peer transfer support"
>>>> depends on ZONE_DEVICE
>>>> - #
>>>> - # The need for the scatterlist DMA bus address flag means PCI
>>> P2PDMA
>>>> - # requires 64bit
>>>> - #
>>>> - depends on 64BIT
>>>> + select PCI_P2PDMA_CORE
>>>> select GENERIC_ALLOCATOR
>>>> select NEED_SG_DMA_FLAGS
>>>> help
>>>
>>> Nit: Did we drop depends on 64BIT intentionally here? I guess the full
>>> PCI_P2PDMA stack still selects NEED_SG_DMA_FLAGS? IIRC,
>>> NEED_SG_DMA_FLAGS doesn't select 64BIT?
>>
>> seems that comment is stale. According to the commit msg:
>>
>> " it would make vfio-pci only available if CONFIG_ZONE_DEVICE is
>> present (e.g. 64-bit systems), "
>>
>> so it sounds a redundant dependency hence is removed.
>
> This was intentional. In practice there is still a dependency on 64BIT
> for PCI_P2PDMA, but it is because of ZONE_DEVICE (and mem hotplug). The
> key need is PCI_P2PDMA_CORE is available on !64BIT for VFIO, but I
> didn't see a requirement from PCI_P2PDMA itself (as opposed to its
> dependencies). If I've missed one, I can put it back...
>
> But NEED_SG_DMA_FLAGS doesn't smell quite right; I see from comments in
>
> af2880ec44021 ("scatterlist: add dedicated config for DMA flags")
>
> that it assumes 64BIT, but it seems to be missing a "depends on 64BIT".
>
> Robin -- should that depend on 64BIT?
Indeed, looking at the history it seems like that was overlooked, but it
worked out at the time since the only selector of NEED_SG_DMA_FLAGS was
PCI_P2PDMA as you say. If we're now generalising then moving the
explicit 64BIT dependency to NEED_SG_DMA_FLAGS itself sounds like the
right thing to do.
Cheers,
Robin.