On Thu, Nov 9, 2023 at 1:30 AM Yunsheng Lin linyunsheng@huawei.com wrote:
On 2023/11/9 11:20, Mina Almasry wrote:
On Wed, Nov 8, 2023 at 2:56 AM Yunsheng Lin linyunsheng@huawei.com wrote:
Agreed everything above is undoable.
But we might be able to do something as folio is doing now, mm subsystem is still seeing 'struct folio/page', but other subsystem like slab is using 'struct slab', and there is still some common fields shared between 'struct folio' and 'struct slab'.
In my eyes this is almost exactly what I suggested in RFC v1 and got immediately nacked with no room to negotiate. What we did for v1 is to allocate struct pages for dma-buf to make dma-bufs look like struct page to mm subsystem. Almost exactly what you're describing above.
Maybe the above is where we have disagreement: Do we still need make dma-bufs look like struct page to mm subsystem? IMHO, the answer is no. We might only need to make dma-bufs look like struct page to net stack and page pool subsystem. I think that is already what this pacthset is trying to do, what I am suggesting is just make it more like 'struct page' to net stack and page pool subsystem, in order to try to avoid most of the 'if' checking in net stack and page pool subsystem.
First, most of the checking in the net stack is skb_frag_not_readable(). dma-buf are fundamentally not kmap()able and not readable. So we can't remove those, no matter what we do I think. Can we agree on that? If so, lets discuss removing most of the ifs in the page pool, only.
It's a no-go. I don't think renaming struct page to netmem is going to move the needle (it also re-introduces code-churn). What I feel like I learnt is that dma-bufs are not struct pages and can't be made to look like one, I think.
As the netmem patchset, is devmem able to reuse the below 'struct netmem' and rename it to 'struct page_pool_iov'?
I don't think so. For the reasons above, but also practically it immediately falls apart. Consider this field in netmem:
- @flags: The same as the page flags. Do not use directly.
dma-buf don't have or support page-flags, and making dma-buf looks like they support page flags or any page-like features (other than dma_addr) seems extremely unacceptable to mm folks.
As far as I tell, as we limit the devmem usage in netstack, the below is the related mm function call for 'struct page' for devmem: page_ref_*(): page->_refcount does not need changing
Sorry, I don't understand. Are you suggesting we call page_ref_add() & page_ref_sub() on page_pool_iov? That is basically making page_pool_iov look like struct page to the mm stack, since page_ref_* are mm calls, which you say above we don't need to do. We will still need to special case this, no?
page_is_pfmemalloc(): which is corresponding to page->pp_magic, and devmem provider can set/unset it in it's 'alloc_pages' ops.
page_is_pfmemalloc() has nothing to do with page->pp_magic. It checks page->lru.next to figure out if this is a pfmemalloc. page_pool_iov has no page->lru.next. Still need to special case this?
page_to_nid(): we may need to handle it differently somewhat like this patch does as page_to_nid() may has different implementation based on different configuration.
So you're saying we need to handle page_to_nid() differently for devmem? So we're not going to be able to avoid the if statement.
page_pool_iov_put_many(): as mentioned in other thread, if net stack is not calling page_pool_page_put_many() directly, we can reuse napi_pp_put_page() for devmem too, and handle the special case for devmem in 'release_page' ops.
page_pool_iov_put_many()/page_pool_iov_get_many() are called to do refcounting before the page is released back to the provider. I'm not seeing how we can handle the special case inside of 'release_page' - that's too late, as far as I can tell.
The only way to remove the if statements in the page pool is to implement what you said was not feasible in an earlier email. We would define this struct:
struct netmem { /* common fields */ refcount_t refcount; bool is_pfmemalloc; int nid; ...... union { struct devmem{ struct dmabuf_genpool_chunk_owner *owner; };
struct page * page; }; };
Then, we would require all memory providers to allocate struct netmem for the memory and set the common fields, including ones that have struct pages. For devmem, netmem->page will be NULL, because netmem has no page.
If we do that, the page pool can ignore whether the underlying memory is page or devmem, because it can use the common fields, example:
/* page_ref_count replacement */ netmem_ref_count(struct netmem* netmem) { return netmem->refcount; }
/* page_ref_add replacement */ netmem_ref_add(struct netmem* netmem) { atomic_inc(netmem->refcount); }
/* page_to_nid replacement */ netmem_nid(struct netmem* netmem) { return netmem->nid; }
/* page_is_pfmemalloc() replacement */ netmem_is_pfmemalloc(struct netmem* netmem) { return netmem->is_pfmemalloc; }
/* page_ref_sub replacement */ netmem_ref_sub(struct netmem* netmem) { atomic_sub(netmet->refcount); if (netmem->refcount == 0) { /* release page to the memory provider. * struct page memory provider will do put_page(), * devmem will do something else */ } } }
I think this MAY BE technically feasible, but I'm not sure it's better:
1. It is a huge refactor to the page pool, lots of code churn. While the page pool currently uses page*, it needs to be completely refactored to use netmem*. 2. It causes extra memory usage. struct netmem needs to be allocated for every struct page. 3. It has minimal perf upside. The page_is_page_pool_iov() checks currently have minimal perf impact, and I demonstrated that to Jesper in RFC v2. 4. It also may not be technically feasible. I'm not sure how netmem interacts with skb_frag_t. I guess we replace struct page* bv_page with struct netmem* bv_page, and add changes there. 5. Drivers need to be refactored to use netmem* instead of page*, unless we cast netmem* to page* before returning to the driver.
Possibly other downsides, these are what I could immediately think of.
If I'm still misunderstanding your suggestion, it may be time to send me a concrete code snippet of what you have in mind. I'm a bit confused at the moment because the only avenue I see to remove the if statements in the page pool is to define the struct that we agreed is not feasible in earlier emails.
So that 'struct page' for normal memory and 'struct page_pool_iov' for devmem share the common fields used by page pool and net stack?
Are you suggesting that we'd cast a netmem* to a page* and call core mm APIs on it? It's basically what was happening with RFC v1, where things that are not struct pages were made to look like struct pages.
Also, there isn't much upside for what you're suggesting, I think. For example I can align the refcount variable in struct page_pool_iov with the refcount in struct page so that this works:
put_page((struct page*)ppiov);
but it's a disaster. Because put_page() will call __put_page() if the page is freed, and __put_page() will try to return the page to the buddy allocator!
As what I suggested above, Can we handle this in devmem provider's 'release_page' ops instead of calling put_page() directly as for devmem.
And we might be able to reuse the 'flags', '_pp_mapping_pad' and '_mapcount' for specific mem provider, which is enough for the devmem only requiring a single pointer to point to it's owner?
All the above seems quite similar to RFC v1 again, using netmem instead of struct page. In RFC v1 we re-used zone_device_data() for the dma-buf owner equivalent.
As we have added a few checkings to limit 'struct page' for devmem to be only used in net stack, we can decouple 'struct page' for devmem from mm subsystem, zone_device_data() is not really needed, right?
If we can decouple 'struct page' for normal memory from mm subsystem through the folio work in the future, then we may define a more abstract structure for page pool and net stack instead of reusing 'struct page' from mm.
-- Thanks, Mina