On Mon, Oct 21, 2019 at 12:02 PM Mike Kravetz mike.kravetz@oracle.com wrote:
On 10/12/19 5:30 PM, Mina Almasry wrote:
A follow up patch in this series adds hugetlb cgroup uncharge info the file_region entries in resv->regions. The cgroup uncharge info may differ for different regions, so they can no longer be coalesced at region_add time. So, disable region coalescing in region_add in this patch.
Behavior change:
Say a resv_map exists like this [0->1], [2->3], and [5->6].
Then a region_chg/add call comes in region_chg/add(f=0, t=5).
Old code would generate resv->regions: [0->5], [5->6]. New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5], [5->6].
Special care needs to be taken to handle the resv->adds_in_progress variable correctly. In the past, only 1 region would be added for every region_chg and region_add call. But now, each call may add multiple regions, so we can no longer increment adds_in_progress by 1 in region_chg, or decrement adds_in_progress by 1 after region_add or region_abort. Instead, region_chg calls add_reservation_in_range() to count the number of regions needed and allocates those, and that info is passed to region_add and region_abort to decrement adds_in_progress correctly.
Signed-off-by: Mina Almasry almasrymina@google.com
Changes in v6:
- Fix bug in number of region_caches allocated by region_chg
mm/hugetlb.c | 256 +++++++++++++++++++++++++++++---------------------- 1 file changed, 147 insertions(+), 109 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4a60d7d44b4c3..f9c1947925bb9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c
<snip> > -static long region_chg(struct resv_map *resv, long f, long t) > +static long region_chg(struct resv_map *resv, long f, long t, > + long *out_regions_needed) > { > + struct file_region *trg = NULL; > long chg = 0; > > + /* Allocate the maximum number of regions we're going to need for this > + * reservation. The maximum number of regions we're going to need is > + * (t - f) / 2 + 1, which corresponds to a region with alternating > + * reserved and unreserved pages. > + */ > + *out_regions_needed = (t - f) / 2 + 1; > + > spin_lock(&resv->lock); > -retry_locked: > - resv->adds_in_progress++; > + > + resv->adds_in_progress += *out_regions_needed; > > /* > * Check for sufficient descriptors in the cache to accommodate > * the number of in progress add operations. > */ > - if (resv->adds_in_progress > resv->region_cache_count) { > - struct file_region *trg; > - > - VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 1); > + while (resv->region_cache_count < resv->adds_in_progress) { > /* Must drop lock to allocate a new descriptor. */ > - resv->adds_in_progress--; > spin_unlock(&resv->lock); > - > trg = kmalloc(sizeof(*trg), GFP_KERNEL); > if (!trg) > return -ENOMEM; > @@ -393,9 +395,9 @@ static long region_chg(struct resv_map *resv, long f, long t) > spin_lock(&resv->lock); > list_add(&trg->link, &resv->region_cache); > resv->region_cache_count++; > - goto retry_locked; > }
I know that I suggested allocating the worst case number of entries, but this is going to be too much of a hit for existing hugetlbfs users. It is not uncommon for DBs to have a shared areas in excess of 1TB mapped by hugetlbfs. With this new scheme, the above while loop will allocate over a half million file region entries and end up only using one.
I think we need to step back and come up with a different approach. Let me give it some more thought before throwing out ideas that may waste more of your time. Sorry.
No problem at all. The other more reasonable option is to have it such that region_add allocates its own cache entries if it needs to, and the effect of that is that region_add may fail, so the callers must handle that possibility. Doesn't seem too difficult to handle.
-- Mike Kravetz