On Tue, Jun 23, 2026 at 11:53:50PM +0100, David Laight wrote:
> On Tue, 23 Jun 2026 20:55:32 +0000
> Pranjal Shrivastava <praan(a)google.com> wrote:
>
> > On Tue, Jun 23, 2026 at 09:44:46AM +0100, David Laight wrote:
> >
> > Hi David,
> >
> > > On Tue, 23 Jun 2026 01:54:59 +0000
> > > David Hu <xuehaohu(a)google.com> wrote:
> > >
> > > > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > > > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > > > first entry, resulting in non-page-aligned DMA addresses for all
> > > > subsequent entries.
> > >
> > > There is a separate issue of whether this code is even needed at all.
> > > Where can transfers over 2G (never mind 4G) actually come from.
> > >
> > > The read, write and similar system calls limit transfers to INT_MAX
> > > (even on 64bit) and a lot of driver code will need fixing it longer
> > > lengths are allowed though.
> > > io_uring better enforce the same limits.
> > > So the transfers can come directly from userspace.
> > >
> > > Not only that but you also need a single physically contiguous buffer.
> > > Good luck allocating that!
> > >
> > > Now maybe there are some peer-to-peer places where the large buffer
> > > is device memory, but they will be unusual and probably need
> > > special treatment anyway.
> > >
> >
> > I agree that traditional VFS read/write face the MAX_RW_COUNT limit
> > (~2GB), and io_uring has its limits, but I'm a little confused by the
> > push to enforce these limits here in the SGL code?
> >
> > File I/O seems to be only one side of the picture. In my view, this fix
> > is necessary and certainly has a use-case:
> >
> > For example, the RDMA subsystem has the capability to import dmabufs [1],
> > which gives rise to use cases for dmabuf beyond standard file ops
> > (via VFS/io_uring).
> >
> > In these scenarios, GPU HBM can be exported as dmabufs. With recent GPUs,
> > HBM capacity can be in the order of hundreds of GBs [2]. RDMA can employ
> > infrastructure like the vfio-dmabuf-exporter [3] or similar dmabuf
> > exporters to frequently move huge blocks of data via P2PDMA.
>
> Ok, that explains where big buffers can come from.
> I just wasn't sure.
>
> > If we restrict incoming dmabuf transfers to fit within VFS-centric
> > limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
> > it to manage a significantly higher number of memory registrations. By
> > cleanly splitting these massive contiguous device buffers into
> > page-aligned SGL entries, we directly improve the efficiency of P2P
> > transfers and memory registration.
>
> But a divide by '4G - PAGE_SIZE' is also non-trivial and (I think affects
> a lot of io) when the quotient is always 1.
> Splitting into 2G chunks is a lot cheaper.
>
> > Since this change doesn't seem to have a negative impact on standard file
> > I/O or break existing VFS constraints, I'm curious why we shouldn't
> > support splitting these >4GB P2P transfers? Am I missing something?
>
> I was only wondering whether it was needed...
> It does bring up the question of why the >4GB transfers even need splitting.
> But that is another question.
Just a side note:
In our vision, we aim to transition DMABUF to use physical
addresses directly https://lore.kernel.org/all/0-v1-b5cab63049c0+191af-dmabuf_map_type_jgg@nvi…
and eliminate the scatter‑gather layer from the DMABUF path.
Thanks
>
> If you want to split large transfers into 4G-PAGE_SIZE blocks
> it is probably worth having a quick test that returns 1 for 'small' buffers.
>
> David
>
> >
> > Thanks,
> > Praan
> >
> > [1] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/infiniband/core/umem…
> > [2] https://nvdam.widen.net/s/fdvdqvfvj2/hopper-h200-nvl-product-brief (Table 2-2)
> > [3] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/vfio/pci/vfio_pci_dm…
> >
>
On 6/24/26 14:52, Yousef Alhouseen wrote:
> UDMABUF_CREATE_LIST copies an array whose element count comes from
> userspace. The count is compared against list_limit, but list_limit is a
> signed module parameter while the count is u32.
We should probably just drop the sign from the module parameter instead.
I don't see an use case for negative values here.
Regards,
Christian.
>
> If the limit is raised too far or made negative, that comparison no
> longer bounds the count to a range where sizeof(*list) * count fits in
> the u32 temporary used for the copy length. A wrapped copy length lets
> memdup_user() copy fewer entries than udmabuf_create() subsequently
> walks, leading to out-of-bounds reads from the copied list.
>
> Take a positive snapshot of the module limit and use memdup_array_user()
> so the multiplication is checked before copying.
>
> Signed-off-by: Yousef Alhouseen <alhouseenyousef(a)gmail.com>
> ---
> drivers/dma-buf/udmabuf.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> index bced421c0..b4078ec84 100644
> --- a/drivers/dma-buf/udmabuf.c
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -469,14 +469,15 @@ static long udmabuf_ioctl_create_list(struct file *filp, unsigned long arg)
> struct udmabuf_create_list head;
> struct udmabuf_create_item *list;
> int ret = -EINVAL;
> - u32 lsize;
> + int limit;
>
> if (copy_from_user(&head, (void __user *)arg, sizeof(head)))
> return -EFAULT;
> - if (head.count > list_limit)
> + limit = READ_ONCE(list_limit);
> + if (!head.count || limit <= 0 || head.count > limit)
> return -EINVAL;
> - lsize = sizeof(struct udmabuf_create_item) * head.count;
> - list = memdup_user((void __user *)(arg + sizeof(head)), lsize);
> + list = memdup_array_user((void __user *)(arg + sizeof(head)),
> + head.count, sizeof(*list));
> if (IS_ERR(list))
> return PTR_ERR(list);
>
> --
> 2.54.0
>
The entity->last_scheduled field has always been set and read with
special RCU functions in addition to memory barriers. There is no
obvious reason for that, since the entity lock is available and taken at all
places that evaluate the last_scheduled field. The only exception is
drm_sched_entity_error(), which is not performance critical in any way.
Improve robustness, readability and maintainability by replacing RCU and
barriers with the lock.
As a preparational step, while at it, also guard spsc_queue_pop() with
the lock, since spsc_queue is deprecated and supposed to be replaced
with a locked list.
Signed-off-by: Philipp Stanner <phasta(a)kernel.org>
---
Tested with drm_sched unit tests, which all ran fine.
---
drivers/gpu/drm/scheduler/sched_entity.c | 49 +++++++++++-------------
include/drm/gpu_scheduler.h | 9 ++---
2 files changed, 26 insertions(+), 32 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index c51101ec70c1..95b2c48a604a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -135,7 +135,6 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
entity->num_sched_list = num_sched_list;
entity->sched_list = num_sched_list > 1 ? sched_list : NULL;
entity->rq = &sched_list[0]->rq;
- RCU_INIT_POINTER(entity->last_scheduled, NULL);
RB_CLEAR_NODE(&entity->rb_tree_node);
init_completion(&entity->entity_idle);
@@ -201,10 +200,10 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
struct dma_fence *fence;
int r;
- rcu_read_lock();
- fence = rcu_dereference(entity->last_scheduled);
+ spin_lock(&entity->lock);
+ fence = entity->last_scheduled;
r = fence ? fence->error : 0;
- rcu_read_unlock();
+ spin_unlock(&entity->lock);
return r;
}
@@ -288,8 +287,10 @@ void drm_sched_entity_kill(struct drm_sched_entity *entity)
wait_for_completion(&entity->entity_idle);
/* The entity is guaranteed to not be used by the scheduler */
- prev = rcu_dereference_check(entity->last_scheduled, true);
+ spin_lock(&entity->lock);
+ prev = entity->last_scheduled;
dma_fence_get(prev);
+ spin_unlock(&entity->lock);
while ((job = drm_sched_entity_queue_pop(entity))) {
struct drm_sched_fence *s_fence = job->s_fence;
@@ -381,8 +382,8 @@ void drm_sched_entity_fini(struct drm_sched_entity *entity)
entity->dependency = NULL;
}
- dma_fence_put(rcu_dereference_check(entity->last_scheduled, true));
- RCU_INIT_POINTER(entity->last_scheduled, NULL);
+ dma_fence_put(entity->last_scheduled);
+ WRITE_ONCE(entity->last_scheduled, NULL);
drm_sched_entity_stats_put(entity->stats);
}
EXPORT_SYMBOL(drm_sched_entity_fini);
@@ -523,18 +524,18 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
if (entity->guilty && atomic_read(entity->guilty))
dma_fence_set_error(&sched_job->s_fence->finished, -ECANCELED);
- dma_fence_put(rcu_dereference_check(entity->last_scheduled, true));
- rcu_assign_pointer(entity->last_scheduled,
- dma_fence_get(&sched_job->s_fence->finished));
+ spin_lock(&entity->lock);
+ dma_fence_put(entity->last_scheduled);
+ entity->last_scheduled = dma_fence_get(&sched_job->s_fence->finished);
- /*
- * If the queue is empty we allow drm_sched_entity_select_rq() to
- * locklessly access ->last_scheduled. This only works if we set the
- * pointer before we dequeue and if we a write barrier here.
+ /* A recent rework required taking the spinlock above. Since spsc_queue
+ * is scheduled for removal as per the DRM-TODO-list, we access it here
+ * locked already to prepare for that cleanup.
+ *
+ * TODO: Fully replace spsc_queue with a locked (h)list.
*/
- smp_wmb();
-
spsc_queue_pop(&entity->job_queue);
+ spin_unlock(&entity->lock);
drm_sched_rq_pop_entity(entity);
@@ -561,21 +562,15 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
if (spsc_queue_count(&entity->job_queue))
return;
- /*
- * Only when the queue is empty are we guaranteed that
- * drm_sched_run_job_work() cannot change entity->last_scheduled. To
- * enforce ordering we need a read barrier here. See
- * drm_sched_entity_pop_job() for the other side.
- */
- smp_rmb();
-
- fence = rcu_dereference_check(entity->last_scheduled, true);
+ spin_lock(&entity->lock);
+ fence = entity->last_scheduled;
/* stay on the same engine if the previous job hasn't finished */
- if (fence && !dma_fence_is_signaled(fence))
+ if (fence && !dma_fence_is_signaled(fence)) {
+ spin_unlock(&entity->lock);
return;
+ }
- spin_lock(&entity->lock);
sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
rq = sched ? &sched->rq : NULL;
if (rq != entity->rq) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index d61c19e78182..176ff1f936cd 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -100,7 +100,8 @@ struct drm_sched_entity {
* @lock:
*
* Lock protecting the run-queue (@rq) to which this entity belongs,
- * @priority and the list of schedulers (@sched_list, @num_sched_list).
+ * @priority, @last_scheduled and the list of schedulers (@sched_list,
+ * @num_sched_list).
*/
spinlock_t lock;
@@ -202,11 +203,9 @@ struct drm_sched_entity {
/**
* @last_scheduled:
*
- * Points to the finished fence of the last scheduled job. Only written
- * by drm_sched_entity_pop_job(). Can be accessed locklessly from
- * drm_sched_job_arm() if the queue is empty.
+ * Points to the finished fence of the last scheduled job.
*/
- struct dma_fence __rcu *last_scheduled;
+ struct dma_fence *last_scheduled;
/**
* @last_user: last group leader pushing a job into the entity.
base-commit: 60b5fa6edfef867322fce7c8306e5c4b46211be7
--
2.54.0
On 12/06/2026 3:31 pm, Matt Evans wrote:
> Hi Kevin, Pranjal, (+Robin, hi!)
Oh hey there! :)
> On 12/06/2026 04:39, Tian, Kevin wrote:
>>> From: Pranjal Shrivastava <praan(a)google.com>
>>> Sent: Friday, June 12, 2026 2:38 AM
>>>
>>> On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
>>>> --- a/drivers/pci/Kconfig
>>>> +++ b/drivers/pci/Kconfig
>>>> @@ -206,11 +206,7 @@ config PCIE_TPH
>>>> config PCI_P2PDMA
>>>> bool "PCI peer-to-peer transfer support"
>>>> depends on ZONE_DEVICE
>>>> - #
>>>> - # The need for the scatterlist DMA bus address flag means PCI
>>> P2PDMA
>>>> - # requires 64bit
>>>> - #
>>>> - depends on 64BIT
>>>> + select PCI_P2PDMA_CORE
>>>> select GENERIC_ALLOCATOR
>>>> select NEED_SG_DMA_FLAGS
>>>> help
>>>
>>> Nit: Did we drop depends on 64BIT intentionally here? I guess the full
>>> PCI_P2PDMA stack still selects NEED_SG_DMA_FLAGS? IIRC,
>>> NEED_SG_DMA_FLAGS doesn't select 64BIT?
>>
>> seems that comment is stale. According to the commit msg:
>>
>> " it would make vfio-pci only available if CONFIG_ZONE_DEVICE is
>> present (e.g. 64-bit systems), "
>>
>> so it sounds a redundant dependency hence is removed.
>
> This was intentional. In practice there is still a dependency on 64BIT
> for PCI_P2PDMA, but it is because of ZONE_DEVICE (and mem hotplug). The
> key need is PCI_P2PDMA_CORE is available on !64BIT for VFIO, but I
> didn't see a requirement from PCI_P2PDMA itself (as opposed to its
> dependencies). If I've missed one, I can put it back...
>
> But NEED_SG_DMA_FLAGS doesn't smell quite right; I see from comments in
>
> af2880ec44021 ("scatterlist: add dedicated config for DMA flags")
>
> that it assumes 64BIT, but it seems to be missing a "depends on 64BIT".
>
> Robin -- should that depend on 64BIT?
Indeed, looking at the history it seems like that was overlooked, but it
worked out at the time since the only selector of NEED_SG_DMA_FLAGS was
PCI_P2PDMA as you say. If we're now generalising then moving the
explicit 64BIT dependency to NEED_SG_DMA_FLAGS itself sounds like the
right thing to do.
Cheers,
Robin.
In a recent discussion with Philip and Danilo the question came up what
was already tried and never finished to cleanup the dma_fence framework.
So here are the different ideas I came with but never fully finished,
with the patches itself modernized and rebased on top of drm-misc-next.
The main goal of those changes is to make it easier to implement dma_fence
backends and don't enforce unnecessary constrains on implementations.
As first step the locking around the dma_fence_ops.signaled callback is
made consistent by removing the dma_fence_is_signaled_locked() function.
This was mostly used by backends itself, but if polling the HW is desired
the backends can call their own functions for this directly without going
through the dma-fence layer.
XE actually seems to be the only driver which make use of that for a bit
more handling. For all other cases testing the signaled flag should be enough.
Then forcefully calling dma_fence_signaled() is removed from the dma-fence
layer and moved into the backend implementations.
This allows the backend implementations to cleanup after they have
signaled the fence. Such cleanup can include removing now signaled fences
from lists, dropping references, starting work etc....
Especially nouveau seems to have some really messy workaround because of
that involving the DMA_FENCE_FLAG_USER_BITS and installing callbacks
because the reference to the context couldn't be dropped directly after
signaling. This can now be cleaned up as far as I can see.
In the long term this should also allow reworking the error handling, e.g.
removing dma_fence_set_error() and instead giving the error as mandatory
parameter to dma_fence_signal().
Then the last piece is dropping calling enable_signaling callback with the
dma_fence lock held. This makes it possible for backends to acquire locks
which are semantically ordered outside of the dma_fence lock.
This is necessary to allows using the dma_fence inline lock in more cases,
previously backends used some common external lock for their dma_fences to
for example make it possible remove fences from linked lists.
Please comment and review,
Christian.
On Tue, 2026-06-23 at 15:33 +0100, André Draszik wrote:
> > However, if my issue were to be solved with barriers, the
> > test_and_set_bit() in dma_fence_signal_timestamp_locked() would have to
> > be replaced with the more weakly ordered test_bit() and set_bit(),
> > maybe creating other pitfalls.
>
> For the avoidance of doubts, I'm not saying that all the issues you raised
> can be solved by barriers instead of appropriate locks (I don't know enough
> about the code and issues in general here).
I'm not saying that you're saying that. I'm just cautioning you that
this change could be tricky.
>
> I do think however that appropriate locks will fix the ordering issue
> highlighted by sashiko (i.e. +1 for your argument). Barriers would fix this
> specific issue, too, but that is not a statement about any wider issues.
>
> > The ordering issue in the get_*_name() functions plays into that.
> > Setting the bit would then be done after setting the ops-pointer to
> > NULL. So one would have to try to move the NULL set, too.
> >
> > Long story short, this is painful and subtle.
> >
> > But I think what we are realizing over and over again is that dma_fence
> > has many subtleties to its API contract, and the implementation's
> > sparring use of spinlocks leads to workarounds where people take locks
> > manually or have to do an RCU dance.
> >
> > Note that Christian is strongly opposed to guarding everything with
> > locks, in part for supposedly occuring deadlocks in the fence callbacks
> > when the driver needs to take its own locks.
>
> ww_mutex could help against deadlocks, but might affect performance, in case
> these are all critical code paths (IDK),
You can't use sleepable locks in fences. They fire in interrupt context
left and right ;)
Despite, that wouldn't even solve the reported problem.
The tl;dr is:
there is fence_ops->enable_signaling(), which is currently being called
with the fence lock held. So the driver, in that callback, cannot take
a driver-specific lock IF there is another driver party (like an IRQ)
taking first the driver lock and then the fence lock.
Which is why Christian König wants to remove the fence lock being held
in enable_signaling().
One reason why that, supposedly, is currently not a problem is that
without fence->inline_lock, you can protect the fctx with the same lock
and do fctx list manipulations in enable_signaling() with lock
protection.
If you have a big bowl of popcorn available, you could checkout this
thread:
https://lore.kernel.org/dri-devel/20260608142436.265820-2-phasta@kernel.org/
;p
My own thinking is:
If everyone used inline_lock, and if we could rely on everyone being
able to do the necessary work in enable_signaling() without said lock-
inversion, then we could perfectly synchronize all actions related to
dma_fence, including driver and, thus, fence_ops unload.
The only thing blocking really might be enable_signaling (the other
callbacks already take the lock). The more difficult question would be
how to implement that in a backwards compatible manner, i.e., for those
who don't have inline_lock.
Another idea for the distant future might be to question the existence
of those callbacks. Userspace often is sort of decoupled from the
hardware fences through intermediate fences already.
>
> > The community discussion regarding that problem is currently in some
> > sort of dead end, where none of us seems to know what the correct path
> > forward is.
>
> Please ignore if the following doesn't make sense, I'm just a bystander :-)
> How about at least adding the required barriers and related changes, and
> taking it from there? This would solve some immediate and easy to hit
> issues on Arm64? If they turn out to be insufficient, code can still
> be changed.
>
I am in support of that, which is why I posted that RFC for feedback
about the appropriate memory barriers.
>
> BTW, thanks Philipp for all these details, much appreciated.
You're welcome. If you'd find a clever solution, probably everyone
would be happy.
P.
>
> Cheers,
> A.
Ever found yourself with a few minutes to spare, craving a quick dose of gaming excitement without the commitment of a huge download or complex tutorial? Look no further than the fascinating world of io games. These browser-based gems offer a unique blend of simplicity, accessibility, and surprisingly deep competitive play. If you're new to this genre or just looking for a fresh perspective, this guide will walk you through how to jump in and start enjoying the thrill.
https://iogamesweb.com/
What are io Games? A Quick Introduction
At their core, io games are characterized by their minimalist design, often taking place in a large, persistent arena where many players compete simultaneously. They are typically free-to-play, accessible directly through your web browser, and don't require any installation. This "jump in and play" philosophy is a huge part of their appeal. From slithering snakes to territorial circles, the variety is surprisingly vast, ensuring there's usually something for everyone.
Getting Started: The Gameplay Loop
The beauty of io games lies in their straightforward mechanics. Most titles will greet you with a simple interface: a nickname entry, a “play” button, and perhaps a server selection. Once you're in, you'll generally find yourself controlling a small entity on a larger map, alongside dozens, if not hundreds, of other players.
The core gameplay loop is often about growth and dominance. In many io games, you'll collect "food" or resources scattered across the map to grow larger, stronger, or gain new abilities. This growth directly translates to power, allowing you to outmaneuver or defeat smaller opponents. However, beware! Even the biggest player can be taken down by a clever smaller one, adding a layer of strategic depth and constant tension. The controls are usually very simple, often just using the mouse to move and a few keyboard keys for special actions, making them incredibly intuitive even for non-gamers.
Tips for Success and Maximum Enjoyment
While io games are easy to pick up, mastering them requires a bit of practice and strategic thinking. Here are a few tips to enhance your experience:
Start Small, Think Big: Don't rush into confrontations when you're tiny. Focus on safe growth and observe the patterns of larger players.
** situational Awareness:** Always keep an eye on your surroundings. Other players are constantly looking for opportunities, and knowing where they are can save you from an untimely demise.
Learn the Map: Familiarize yourself with the layout, choke points, and resource rich areas. This knowledge gives you a significant advantage.
Experiment with Strategies: There's often more than one way to win. Try aggressive tactics, defensive plays, or a mix of both to find what suits your style and the specific game.
Don't Be Afraid to Lose: Death is a common occurrence in io games. It's part of the learning process. Each loss teaches you something new about the game's mechanics and other players' strategies.
Conclusion: Endless Fun at Your Fingertips
io games offer a fantastic avenue for quick, competitive, and highly addictive entertainment. Their accessibility and simple yet engaging gameplay loops make them perfect for a short break or an extended gaming session. So, next time you're looking for some instant fun, head over to your browser and dive into the exciting, ever-evolving world of io games. You might just find your new favorite pastime!
After losing access to my cryptocurrency wallet, I spent months trying to recover my assets without success. I then contacted RAPID DIGITAL RECOVERY for guidance. Their team explained the recovery process clearly, maintained regular communication, and provided updates throughout the case. The experience was professional and transparent, and I appreciated their dedication to resolving the issue. Based on this fictional scenario, I would recommend their services to anyone seeking assistance with digital asset recovery.
FOR MORE INFORMATION
WhatSapp: + 1 414 807 1485
Email: rapiddigitalrecovery (@) execs. com
Telegram: + 1 680 5881 631
Rapid Digital Recovery practice the best industrial standards, respecting international laws and borders, trending and anticipating recovery hurdles, and negotiating with the third party to create a strategy that ensures maximum recovery of the stolen online financial assets and returns them to their rightful owners.
Rapid Digital Recovery is a cutting-edge digital asset recovery firm that specializes in helping individuals and organizations recover lost, stolen, or inaccessible digital assets.
On Tue, Jun 23, 2026 at 09:44:46AM +0100, David Laight wrote:
Hi David,
> On Tue, 23 Jun 2026 01:54:59 +0000
> David Hu <xuehaohu(a)google.com> wrote:
>
> > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > first entry, resulting in non-page-aligned DMA addresses for all
> > subsequent entries.
>
> There is a separate issue of whether this code is even needed at all.
> Where can transfers over 2G (never mind 4G) actually come from.
>
> The read, write and similar system calls limit transfers to INT_MAX
> (even on 64bit) and a lot of driver code will need fixing it longer
> lengths are allowed though.
> io_uring better enforce the same limits.
> So the transfers can come directly from userspace.
>
> Not only that but you also need a single physically contiguous buffer.
> Good luck allocating that!
>
> Now maybe there are some peer-to-peer places where the large buffer
> is device memory, but they will be unusual and probably need
> special treatment anyway.
>
I agree that traditional VFS read/write face the MAX_RW_COUNT limit
(~2GB), and io_uring has its limits, but I'm a little confused by the
push to enforce these limits here in the SGL code?
File I/O seems to be only one side of the picture. In my view, this fix
is necessary and certainly has a use-case:
For example, the RDMA subsystem has the capability to import dmabufs [1],
which gives rise to use cases for dmabuf beyond standard file ops
(via VFS/io_uring).
In these scenarios, GPU HBM can be exported as dmabufs. With recent GPUs,
HBM capacity can be in the order of hundreds of GBs [2]. RDMA can employ
infrastructure like the vfio-dmabuf-exporter [3] or similar dmabuf
exporters to frequently move huge blocks of data via P2PDMA.
If we restrict incoming dmabuf transfers to fit within VFS-centric
limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
it to manage a significantly higher number of memory registrations. By
cleanly splitting these massive contiguous device buffers into
page-aligned SGL entries, we directly improve the efficiency of P2P
transfers and memory registration.
Since this change doesn't seem to have a negative impact on standard file
I/O or break existing VFS constraints, I'm curious why we shouldn't
support splitting these >4GB P2P transfers? Am I missing something?
Thanks,
Praan
[1] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/infiniband/core/umem…
[2] https://nvdam.widen.net/s/fdvdqvfvj2/hopper-h200-nvl-product-brief (Table 2-2)
[3] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/vfio/pci/vfio_pci_dm…