Hi all,
Accounting only free pages is very inaccurate for low memory handling, so we have to be smarter here.
The patch set implements a new attribute, which is blended from various memory statistics. Vmevent can't expose all the kernel internals to the userland, as it would make internal Linux MM representation tied to the ABI. So the ABI itself was made very simple: just number of pages before we consider that we're low on memory, and the kernel takes care of the rest.
Thanks,
This complements GT and LT, making it possible to combine GE and LE operators. We'll use it for blended attributes: the special attributes will return either 0 or <threshold>, so to make two-way notifications we will pass LT | EQ bits.
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- include/linux/vmevent.h | 6 +++++- mm/vmevent.c | 22 +++++++++++++++------- 2 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h index ca97cf0..aae0d24 100644 --- a/include/linux/vmevent.h +++ b/include/linux/vmevent.h @@ -27,9 +27,13 @@ enum { */ VMEVENT_ATTR_STATE_VALUE_GT = (1UL << 1), /* + * Sample value is equal to user-specified value + */ + VMEVENT_ATTR_STATE_VALUE_EQ = (1UL << 2), + /* * One-shot mode. */ - VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 2), + VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 3),
/* Saved state, used internally by the kernel for one-shot mode. */ __VMEVENT_ATTR_STATE_VALUE_WAS_LT = (1UL << 30), diff --git a/mm/vmevent.c b/mm/vmevent.c index 47ed448..9f1520b 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -87,28 +87,39 @@ static bool vmevent_match(struct vmevent_watch *watch) u32 state = attr->state; bool attr_lt = state & VMEVENT_ATTR_STATE_VALUE_LT; bool attr_gt = state & VMEVENT_ATTR_STATE_VALUE_GT; + bool attr_eq = state & VMEVENT_ATTR_STATE_VALUE_EQ;
if (!state) continue;
- if (attr_lt || attr_gt) { + if (attr_lt || attr_gt || attr_eq) { bool one_shot = state & VMEVENT_ATTR_STATE_ONE_SHOT; u32 was_lt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_LT; u32 was_gt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_GT; u64 value = vmevent_sample_attr(watch, attr); bool lt = value < attr->value; bool gt = value > attr->value; + bool eq = value == attr->value; bool was_lt = state & was_lt_mask; bool was_gt = state & was_gt_mask; + bool was_eq = was_lt && was_gt; bool ret = false;
- if (((attr_lt && lt) || (attr_gt && gt)) && !one_shot) + if (((attr_lt && lt) || (attr_gt && gt) || + (attr_eq && eq)) && !one_shot) return true;
- if (attr_lt && lt && was_lt) { + if (attr_eq && eq && was_eq) { return false; - } else if (attr_gt && gt && was_gt) { + } else if (attr_lt && lt && was_lt && !was_eq) { return false; + } else if (attr_gt && gt && was_gt && !was_eq) { + return false; + } else if (eq) { + state |= was_lt_mask; + state |= was_gt_mask; + if (attr_eq) + ret = true; } else if (lt) { state |= was_lt_mask; state &= ~was_gt_mask; @@ -119,9 +130,6 @@ static bool vmevent_match(struct vmevent_watch *watch) state &= ~was_lt_mask; if (attr_gt) ret = true; - } else { - state &= ~was_lt_mask; - state &= ~was_gt_mask; }
attr->state = state;
We'll need the argument for blended attributes, the attributes return either 0 or the threshold value (which is in attr).
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- mm/vmevent.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/mm/vmevent.c b/mm/vmevent.c index 9f1520b..b312236 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -39,9 +39,11 @@ struct vmevent_watch { wait_queue_head_t waitq; };
-typedef u64 (*vmevent_attr_sample_fn)(struct vmevent_watch *watch); +typedef u64 (*vmevent_attr_sample_fn)(struct vmevent_watch *watch, + struct vmevent_attr *attr);
-static u64 vmevent_attr_swap_pages(struct vmevent_watch *watch) +static u64 vmevent_attr_swap_pages(struct vmevent_watch *watch, + struct vmevent_attr *attr) { #ifdef CONFIG_SWAP struct sysinfo si; @@ -54,12 +56,14 @@ static u64 vmevent_attr_swap_pages(struct vmevent_watch *watch) #endif }
-static u64 vmevent_attr_free_pages(struct vmevent_watch *watch) +static u64 vmevent_attr_free_pages(struct vmevent_watch *watch, + struct vmevent_attr *attr) { return global_page_state(NR_FREE_PAGES); }
-static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch) +static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch, + struct vmevent_attr *attr) { return totalram_pages; } @@ -74,7 +78,7 @@ static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr { vmevent_attr_sample_fn fn = attr_samplers[attr->type];
- return fn(watch); + return fn(watch, attr); }
static bool vmevent_match(struct vmevent_watch *watch)
This is specially "blended" attribute, the event triggers when kernel decides that we're close to the low memory threshold. Userspace should not expect very precise meaning of low memory situation, mostly, it's just a guess on the kernel's side.
Well, this is the same as userland should not know or care how exactly kernel manages the memory, or assume that memory management behaviour is a part of the "ABI". So, all the 'low memory' is just guessing, but we're trying to do our best. It might be that we will end up with two or three variations of 'low memory' thresholds, and all of them would be useful for different use cases.
For this implementation, we assume that there's a low memory situation for the N pages threshold when we have neither N pages of completely free pages, nor we have N reclaimable pages in the cache. This effectively means, that if userland expects to allocate N pages, it would consume all the free pages, and any further allocations (above N) would start draining caches.
In the worst case, prior to hitting the threshold, we might have only N pages in cache, and nearly no memory as free pages.
The same 'low memory' meaning is used in the current Android Low Memory Killer driver.
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- include/linux/vmevent.h | 7 ++++++ mm/vmevent.c | 40 ++++++++++++++++++++++++++++++++++ tools/testing/vmevent/vmevent-test.c | 12 +++++++++- 3 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h index aae0d24..9bfa244 100644 --- a/include/linux/vmevent.h +++ b/include/linux/vmevent.h @@ -10,6 +10,13 @@ enum { VMEVENT_ATTR_NR_AVAIL_PAGES = 1UL, VMEVENT_ATTR_NR_FREE_PAGES = 2UL, VMEVENT_ATTR_NR_SWAP_PAGES = 3UL, + /* + * This is specially blended attribute, the event triggers + * when kernel decides that we're close to the low memory threshold. + * Don't expect very precise meaning of low memory situation, mostly, + * it's just a guess on the kernel's side. + */ + VMEVENT_ATTR_LOWMEM_PAGES = 4UL,
VMEVENT_ATTR_MAX /* non-ABI */ }; diff --git a/mm/vmevent.c b/mm/vmevent.c index b312236..d278a25 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -68,10 +68,50 @@ static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch, return totalram_pages; }
+/* + * Here's some implementation details for the "low memory" meaning. + * + * (The explanation is not in the header file as userland should not + * know these details, nor it should assume that the meaning will + * always be the same. As well as it should not know how exactly kernel + * manages the memory, or assume that memory management behaviour is a + * part of the "ABI". So, all the 'low memory' is just guessing, but + * we're trying to do our best.) + * + * For this implementation, we assume that there's a low memory situation + * for the N pages threshold when we have neither N pages of completely + * free pages, nor we have N reclaimable pages in the cache. This + * effectively means, that if userland expects to allocate N pages, it + * would consume all the free pages, and any further allocations (above + * N) would start draining caches. + * + * In the worst case, prior hitting the threshold, we might have only + * N pages in cache, and nearly no memory as free pages. + */ +static u64 vmevent_attr_lowmem_pages(struct vmevent_watch *watch, + struct vmevent_attr *attr) +{ + int free = global_page_state(NR_FREE_PAGES); + int file = global_page_state(NR_FILE_PAGES) - + global_page_state(NR_SHMEM); /* TODO: account locked pages */ + int val = attr->value; + + /* + * For convenience we return 0 or attr value (instead of 0/1), it + * makes it easier for vmevent_match() to cope with blended + * attributes, plus userland might use the value to find out which + * threshold triggered. + */ + if (free < val && file < val) + return val; + return 0; +} + static vmevent_attr_sample_fn attr_samplers[] = { [VMEVENT_ATTR_NR_AVAIL_PAGES] = vmevent_attr_avail_pages, [VMEVENT_ATTR_NR_FREE_PAGES] = vmevent_attr_free_pages, [VMEVENT_ATTR_NR_SWAP_PAGES] = vmevent_attr_swap_pages, + [VMEVENT_ATTR_LOWMEM_PAGES] = vmevent_attr_lowmem_pages, };
static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr *attr) diff --git a/tools/testing/vmevent/vmevent-test.c b/tools/testing/vmevent/vmevent-test.c index fd9a174..c61aed7 100644 --- a/tools/testing/vmevent/vmevent-test.c +++ b/tools/testing/vmevent/vmevent-test.c @@ -33,7 +33,7 @@ int main(int argc, char *argv[])
config = (struct vmevent_config) { .sample_period_ns = 1000000000L, - .counter = 6, + .counter = 7, .attrs = { { .type = VMEVENT_ATTR_NR_FREE_PAGES, @@ -59,6 +59,13 @@ int main(int argc, char *argv[]) .type = VMEVENT_ATTR_NR_SWAP_PAGES, }, { + .type = VMEVENT_ATTR_LOWMEM_PAGES, + .state = VMEVENT_ATTR_STATE_VALUE_LT | + VMEVENT_ATTR_STATE_VALUE_EQ | + VMEVENT_ATTR_STATE_ONE_SHOT, + .value = phys_pages / 2, + }, + { .type = 0xffff, /* invalid */ }, }, @@ -108,6 +115,9 @@ int main(int argc, char *argv[]) case VMEVENT_ATTR_NR_SWAP_PAGES: printf(" VMEVENT_ATTR_NR_SWAP_PAGES: %Lu\n", attr->value); break; + case VMEVENT_ATTR_LOWMEM_PAGES: + printf(" VMEVENT_ATTR_LOWMEM_PAGES: %Lu\n", attr->value); + break; default: printf(" Unknown attribute: %Lu\n", attr->value); }
On Tue, May 1, 2012 at 4:26 PM, Anton Vorontsov anton.vorontsov@linaro.org wrote:
This is specially "blended" attribute, the event triggers when kernel decides that we're close to the low memory threshold. Userspace should not expect very precise meaning of low memory situation, mostly, it's just a guess on the kernel's side.
Well, this is the same as userland should not know or care how exactly kernel manages the memory, or assume that memory management behaviour is a part of the "ABI". So, all the 'low memory' is just guessing, but we're trying to do our best. It might be that we will end up with two or three variations of 'low memory' thresholds, and all of them would be useful for different use cases.
For this implementation, we assume that there's a low memory situation for the N pages threshold when we have neither N pages of completely free pages, nor we have N reclaimable pages in the cache. This effectively means, that if userland expects to allocate N pages, it would consume all the free pages, and any further allocations (above N) would start draining caches.
In the worst case, prior to hitting the threshold, we might have only N pages in cache, and nearly no memory as free pages.
The same 'low memory' meaning is used in the current Android Low Memory Killer driver.
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org
I don't see why we couldn't add this. Minchan, thoughts?
include/linux/vmevent.h | 7 ++++++ mm/vmevent.c | 40 ++++++++++++++++++++++++++++++++++ tools/testing/vmevent/vmevent-test.c | 12 +++++++++- 3 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h index aae0d24..9bfa244 100644 --- a/include/linux/vmevent.h +++ b/include/linux/vmevent.h @@ -10,6 +10,13 @@ enum { VMEVENT_ATTR_NR_AVAIL_PAGES = 1UL, VMEVENT_ATTR_NR_FREE_PAGES = 2UL, VMEVENT_ATTR_NR_SWAP_PAGES = 3UL,
- /*
- * This is specially blended attribute, the event triggers
- * when kernel decides that we're close to the low memory threshold.
- * Don't expect very precise meaning of low memory situation, mostly,
- * it's just a guess on the kernel's side.
- */
- VMEVENT_ATTR_LOWMEM_PAGES = 4UL,
VMEVENT_ATTR_MAX /* non-ABI */ }; diff --git a/mm/vmevent.c b/mm/vmevent.c index b312236..d278a25 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -68,10 +68,50 @@ static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch, return totalram_pages; }
+/*
- Here's some implementation details for the "low memory" meaning.
- (The explanation is not in the header file as userland should not
- know these details, nor it should assume that the meaning will
- always be the same. As well as it should not know how exactly kernel
- manages the memory, or assume that memory management behaviour is a
- part of the "ABI". So, all the 'low memory' is just guessing, but
- we're trying to do our best.)
- For this implementation, we assume that there's a low memory situation
- for the N pages threshold when we have neither N pages of completely
- free pages, nor we have N reclaimable pages in the cache. This
- effectively means, that if userland expects to allocate N pages, it
- would consume all the free pages, and any further allocations (above
- N) would start draining caches.
- In the worst case, prior hitting the threshold, we might have only
- N pages in cache, and nearly no memory as free pages.
- */
+static u64 vmevent_attr_lowmem_pages(struct vmevent_watch *watch,
- struct vmevent_attr *attr)
+{
- int free = global_page_state(NR_FREE_PAGES);
- int file = global_page_state(NR_FILE_PAGES) -
- global_page_state(NR_SHMEM); /* TODO: account locked pages */
- int val = attr->value;
- /*
- * For convenience we return 0 or attr value (instead of 0/1), it
- * makes it easier for vmevent_match() to cope with blended
- * attributes, plus userland might use the value to find out which
- * threshold triggered.
- */
- if (free < val && file < val)
- return val;
- return 0;
+}
static vmevent_attr_sample_fn attr_samplers[] = { [VMEVENT_ATTR_NR_AVAIL_PAGES] = vmevent_attr_avail_pages, [VMEVENT_ATTR_NR_FREE_PAGES] = vmevent_attr_free_pages, [VMEVENT_ATTR_NR_SWAP_PAGES] = vmevent_attr_swap_pages,
- [VMEVENT_ATTR_LOWMEM_PAGES] = vmevent_attr_lowmem_pages,
};
static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr *attr) diff --git a/tools/testing/vmevent/vmevent-test.c b/tools/testing/vmevent/vmevent-test.c index fd9a174..c61aed7 100644 --- a/tools/testing/vmevent/vmevent-test.c +++ b/tools/testing/vmevent/vmevent-test.c @@ -33,7 +33,7 @@ int main(int argc, char *argv[])
config = (struct vmevent_config) { .sample_period_ns = 1000000000L,
- .counter = 6,
- .counter = 7,
.attrs = { { .type = VMEVENT_ATTR_NR_FREE_PAGES, @@ -59,6 +59,13 @@ int main(int argc, char *argv[]) .type = VMEVENT_ATTR_NR_SWAP_PAGES, }, {
- .type = VMEVENT_ATTR_LOWMEM_PAGES,
- .state = VMEVENT_ATTR_STATE_VALUE_LT |
- VMEVENT_ATTR_STATE_VALUE_EQ |
- VMEVENT_ATTR_STATE_ONE_SHOT,
- .value = phys_pages / 2,
- },
- {
.type = 0xffff, /* invalid */ }, }, @@ -108,6 +115,9 @@ int main(int argc, char *argv[]) case VMEVENT_ATTR_NR_SWAP_PAGES: printf(" VMEVENT_ATTR_NR_SWAP_PAGES: %Lu\n", attr->value); break;
- case VMEVENT_ATTR_LOWMEM_PAGES:
- printf(" VMEVENT_ATTR_LOWMEM_PAGES: %Lu\n", attr->value);
- break;
default: printf(" Unknown attribute: %Lu\n", attr->value); } -- 1.7.9.2
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On 05/01/2012 10:26 PM, Anton Vorontsov wrote:
This is specially "blended" attribute, the event triggers when kernel decides that we're close to the low memory threshold. Userspace should not expect very precise meaning of low memory situation, mostly, it's just a guess on the kernel's side.
Well, this is the same as userland should not know or care how exactly kernel manages the memory, or assume that memory management behaviour is a part of the "ABI". So, all the 'low memory' is just guessing, but we're trying to do our best. It might be that we will end up with two or three variations of 'low memory' thresholds, and all of them would
First of all, your calculation for available pages is very simple and it's very specific of recent mobile phone. But recent systems have various devices. For example,
SSD : very fast server SSD which has lots of internal ram so that write is very fast. thumb usb : very slow whihc has small ram
1) We can consider anon pages and dirty pages as available pages.
SSD thumb usb rootfs 0 swap 0
2) We can consider anon pages as available pages but dirty page doesn't
SSD thumb usb rootfs O swap 0
3) We can consider dirty pages as available pages but anon doesn't
SSD thumb usb rootfs O swap O
4) We can't consider dirty pages and anon pages as available pages
SSD thumb usb rootfs 0 swap 0
5) If we use zram as swap? 6) Another idea. If we use both zram and swap device(eMMC), then when zram is full, we writes zram pages into swap device with align cluster size?
I mean we can select various option to define low memory state.
be useful for different use cases.
Why should we do it in kernel side? If vmevent will have VMEVENT_ATTR_[FILE|MOCK|DIRTY|WRITEBACK|SHMEM|ANON|SWAP]_PAGES and so on which is needed by calculation, we can calculate it in userspace without forking /proc/vmstat to see it. So I think there is no problem to do it in userspace.
And even though we can solve above problem, it is possible to show up another new "blended" attribute in future and it will suffer same problem, again. So IMHO, let's leave vmevent as it is which is very raw attribute and let's do blended attribute in user space.
For this implementation, we assume that there's a low memory situation for the N pages threshold when we have neither N pages of completely free pages, nor we have N reclaimable pages in the cache. This effectively means, that if userland expects to allocate N pages, it would consume all the free pages, and any further allocations (above N) would start draining caches.
In the worst case, prior to hitting the threshold, we might have only N pages in cache, and nearly no memory as free pages.
The same 'low memory' meaning is used in the current Android Low Memory Killer driver.
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org
include/linux/vmevent.h | 7 ++++++ mm/vmevent.c | 40 ++++++++++++++++++++++++++++++++++ tools/testing/vmevent/vmevent-test.c | 12 +++++++++- 3 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h index aae0d24..9bfa244 100644 --- a/include/linux/vmevent.h +++ b/include/linux/vmevent.h @@ -10,6 +10,13 @@ enum { VMEVENT_ATTR_NR_AVAIL_PAGES = 1UL, VMEVENT_ATTR_NR_FREE_PAGES = 2UL, VMEVENT_ATTR_NR_SWAP_PAGES = 3UL,
- /*
* This is specially blended attribute, the event triggers
* when kernel decides that we're close to the low memory threshold.
* Don't expect very precise meaning of low memory situation, mostly,
* it's just a guess on the kernel's side.
*/
- VMEVENT_ATTR_LOWMEM_PAGES = 4UL,
VMEVENT_ATTR_MAX /* non-ABI */ }; diff --git a/mm/vmevent.c b/mm/vmevent.c index b312236..d278a25 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -68,10 +68,50 @@ static u64 vmevent_attr_avail_pages(struct vmevent_watch *watch, return totalram_pages; } +/*
- Here's some implementation details for the "low memory" meaning.
- (The explanation is not in the header file as userland should not
- know these details, nor it should assume that the meaning will
- always be the same. As well as it should not know how exactly kernel
- manages the memory, or assume that memory management behaviour is a
- part of the "ABI". So, all the 'low memory' is just guessing, but
- we're trying to do our best.)
- For this implementation, we assume that there's a low memory situation
- for the N pages threshold when we have neither N pages of completely
- free pages, nor we have N reclaimable pages in the cache. This
- effectively means, that if userland expects to allocate N pages, it
- would consume all the free pages, and any further allocations (above
- N) would start draining caches.
- In the worst case, prior hitting the threshold, we might have only
- N pages in cache, and nearly no memory as free pages.
- */
+static u64 vmevent_attr_lowmem_pages(struct vmevent_watch *watch,
struct vmevent_attr *attr)
+{
- int free = global_page_state(NR_FREE_PAGES);
- int file = global_page_state(NR_FILE_PAGES) -
global_page_state(NR_SHMEM); /* TODO: account locked pages */
- int val = attr->value;
- /*
* For convenience we return 0 or attr value (instead of 0/1), it
* makes it easier for vmevent_match() to cope with blended
* attributes, plus userland might use the value to find out which
* threshold triggered.
*/
- if (free < val && file < val)
return val;
- return 0;
+}
static vmevent_attr_sample_fn attr_samplers[] = { [VMEVENT_ATTR_NR_AVAIL_PAGES] = vmevent_attr_avail_pages, [VMEVENT_ATTR_NR_FREE_PAGES] = vmevent_attr_free_pages, [VMEVENT_ATTR_NR_SWAP_PAGES] = vmevent_attr_swap_pages,
- [VMEVENT_ATTR_LOWMEM_PAGES] = vmevent_attr_lowmem_pages,
}; static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr *attr) diff --git a/tools/testing/vmevent/vmevent-test.c b/tools/testing/vmevent/vmevent-test.c index fd9a174..c61aed7 100644 --- a/tools/testing/vmevent/vmevent-test.c +++ b/tools/testing/vmevent/vmevent-test.c @@ -33,7 +33,7 @@ int main(int argc, char *argv[]) config = (struct vmevent_config) { .sample_period_ns = 1000000000L,
.counter = 6,
.attrs = { { .type = VMEVENT_ATTR_NR_FREE_PAGES,.counter = 7,
@@ -59,6 +59,13 @@ int main(int argc, char *argv[]) .type = VMEVENT_ATTR_NR_SWAP_PAGES, }, {
.type = VMEVENT_ATTR_LOWMEM_PAGES,
.state = VMEVENT_ATTR_STATE_VALUE_LT |
VMEVENT_ATTR_STATE_VALUE_EQ |
VMEVENT_ATTR_STATE_ONE_SHOT,
.value = phys_pages / 2,
},
},{ .type = 0xffff, /* invalid */ },
@@ -108,6 +115,9 @@ int main(int argc, char *argv[]) case VMEVENT_ATTR_NR_SWAP_PAGES: printf(" VMEVENT_ATTR_NR_SWAP_PAGES: %Lu\n", attr->value); break;
case VMEVENT_ATTR_LOWMEM_PAGES:
printf(" VMEVENT_ATTR_LOWMEM_PAGES: %Lu\n", attr->value);
break; default: printf(" Unknown attribute: %Lu\n", attr->value); }
On Fri, May 04, 2012 at 01:26:45PM +0900, Minchan Kim wrote: [...]
be useful for different use cases.
Why should we do it in kernel side?
Because currently you can't do this in userland, see below. Today this would be effectively the same as constantly reading /proc/vmstat, which is surely not friendly performance/context switches/battery wise.
If vmevent will have VMEVENT_ATTR_[FILE|MOCK|DIRTY|WRITEBACK|SHMEM|ANON|SWAP]_PAGES and so on which is needed by calculation, we can calculate it in userspace without forking /proc/vmstat to see it. So I think there is no problem to do it in userspace.
There are two problems.
1. Originally, the idea behind vmevent was that we should not expose all these mm details in vmevent, because it ties ABI with Linux internal memory representation;
2. If you have say a boolean '(A + B + C + ...) > X' attribute (which is exactly what blended attributes are), you can't just set up independent thresholds on A, B, C, ... and have the same effect.
(What we can do, though, is... introduce arithmetic operators in vmevent. :-D But then, at the end, we'll probably implement in-kernel forth-like stack machine, with vmevent_config array serving as a sequence of op-codes. ;-)
If we'll give up on "1." (Pekka, ping), then we need to solve "2." in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM - <todo-locked-file-pages>' attribute, and give it a name.
RECLAIMABLE_CACHE_PAGES maybe?
Thanks!
On Fri, May 4, 2012 at 10:38 AM, Anton Vorontsov anton.vorontsov@linaro.org wrote:
There are two problems.
- Originally, the idea behind vmevent was that we should not expose all
these mm details in vmevent, because it ties ABI with Linux internal memory representation;
- If you have say a boolean '(A + B + C + ...) > X' attribute (which is
exactly what blended attributes are), you can't just set up independent thresholds on A, B, C, ... and have the same effect.
(What we can do, though, is... introduce arithmetic operators in vmevent. :-D But then, at the end, we'll probably implement in-kernel forth-like stack machine, with vmevent_config array serving as a sequence of op-codes. ;-)
If we'll give up on "1." (Pekka, ping), then we need to solve "2." in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM - <todo-locked-file-pages>' attribute, and give it a name.
Well, no, we can't give up on (1) completely. That'd mean that eventually we'd need to change the ABI and break userspace. The difference between exposing internal details and reasonable abstractions is by no means black and white.
AFAICT, RECLAIMABLE_CACHE_PAGES is a reasonable thing to support. Can anyone come up with a reason why we couldn't do that in the future?
Pekka
If we'll give up on "1." (Pekka, ping), then we need to solve "2." in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM - <todo-locked-file-pages>' attribute, and give it a name.
Well, no, we can't give up on (1) completely. That'd mean that eventually we'd need to change the ABI and break userspace. The difference between exposing internal details and reasonable abstractions is by no means black and white.
AFAICT, RECLAIMABLE_CACHE_PAGES is a reasonable thing to support. Can anyone come up with a reason why we couldn't do that in the future?
It can. but the problem is, that is completely useless. Because of, 1) dirty pages writing-out is sometimes very slow and 2) libc and some important library's pages are critical important for running a system even though it is clean and reclaimable. In other word, kernel don't have an info then can't expose it.
On Mon, May 07, 2012 at 04:26:00AM -0400, KOSAKI Motohiro wrote:
If we'll give up on "1." (Pekka, ping), then we need to solve "2." in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM - <todo-locked-file-pages>' attribute, and give it a name.
Well, no, we can't give up on (1) completely. That'd mean that eventually we'd need to change the ABI and break userspace. The difference between exposing internal details and reasonable abstractions is by no means black and white.
AFAICT, RECLAIMABLE_CACHE_PAGES is a reasonable thing to support. Can anyone come up with a reason why we couldn't do that in the future?
It can. but the problem is, that is completely useless.
Surely it is useful. Could be not ideal, but you can't say that it is completely useless.
Because of, 1) dirty pages writing-out is sometimes very slow and
I don't see it as a unresolvable problem: we can exclude dirty pages, that's a nice idea actually.
Easily reclaimable cache pages = file_pages - shmem - locked pages - dirty pages.
The amount of dirty pages is configurable, which is also great.
Even more, we may introduce two attributes:
RECLAIMABLE_CACHE_PAGES and RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).
This makes ABI detached from the mm internals and still keeps a defined meaning of the attributes.
- libc and some important library's pages are critical important
for running a system even though it is clean and reclaimable. In other word, kernel don't have an info then can't expose it.
First off, I guess LRU would try to keep important/most used pages in the cache, as we try to never fully drain page cache to the zero mark.
Secondly, if we're really low on memory (which low memory notifications help to prevent) and kernel decided to throw libc's pages out of the cache, you'll get cache miss and kernel will have to read it back. Well, sometimes cache misses do happen, that's life. And if somebody really don't want this for the essential parts of the system, one have to mlock it (which eliminates your "kernel don't have an info" argument).
Btw, if you have any better strategy on helping userspace to define 'low memory' conditions, I'll readily try to implement it.
Thanks!
(5/7/12 8:15 AM), Anton Vorontsov wrote:
On Mon, May 07, 2012 at 04:26:00AM -0400, KOSAKI Motohiro wrote:
If we'll give up on "1." (Pekka, ping), then we need to solve "2." in a sane way: we'll have to add a 'NR_FILE_PAGES - NR_SHMEM - <todo-locked-file-pages>' attribute, and give it a name.
Well, no, we can't give up on (1) completely. That'd mean that eventually we'd need to change the ABI and break userspace. The difference between exposing internal details and reasonable abstractions is by no means black and white.
AFAICT, RECLAIMABLE_CACHE_PAGES is a reasonable thing to support. Can anyone come up with a reason why we couldn't do that in the future?
It can. but the problem is, that is completely useless.
Surely it is useful. Could be not ideal, but you can't say that it is completely useless.
Why? It doesn't work.
Because of, 1) dirty pages writing-out is sometimes very slow and
I don't see it as a unresolvable problem: we can exclude dirty pages, that's a nice idea actually.
Easily reclaimable cache pages = file_pages - shmem - locked pages
- dirty pages.
The amount of dirty pages is configurable, which is also great.
You don't understand the issue. The point is NOT a formula. The problem is, dirty and non-dirty pages aren't isolated in our kernel. Then, kernel start to get stuck far before non-dirty pages become empty. Lie notification always useless.
Even more, we may introduce two attributes:
RECLAIMABLE_CACHE_PAGES and RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).
This makes ABI detached from the mm internals and still keeps a defined meaning of the attributes.
Collection of craps are also crap. If you want to improve userland notification, you should join VM improvement activity. You shouldn't think nobody except you haven't think userland notification feature.
The problem is, Any current kernel vm statistics were not created for such purpose and don't fit.
Even though, some inaccurate and incorrect statistics fit _your_ usecase, they definitely don't fit other. And their people think it is bug.
- libc and some important library's pages are critical important
for running a system even though it is clean and reclaimable. In other word, kernel don't have an info then can't expose it.
First off, I guess LRU would try to keep important/most used pages in the cache, as we try to never fully drain page cache to the zero mark.
Yes, what do you want say?
Secondly, if we're really low on memory (which low memory notifications help to prevent) and kernel decided to throw libc's pages out of the cache, you'll get cache miss and kernel will have to read it back. Well, sometimes cache misses do happen, that's life. And if somebody really don't want this for the essential parts of the system, one have to mlock it (which eliminates your "kernel don't have an info" argument).
First off, "low memory" is very poor definition and we must not use it. It is multiple meanings. 1) System free memory is low. Some embedded have userland oom killer and they want to know _system_ status. 2) available memory is low. This is different from (1) when using NUMA, memcg or cpusets. And in nowadays, almost all x86 box have numa. This is userful for swap avoidance activity if we can implement correctly.
Secondly, we can't assume someone mlock to libc. Because of, Linux is generic purpose kernel. As far as you continue to talk about only user usecase, we can't agree you. "Users may have a workaround" don't make excuse to accept broken patch.
Btw, if you have any better strategy on helping userspace to define 'low memory' conditions, I'll readily try to implement it.
On Mon, May 07, 2012 at 03:19:50PM -0400, KOSAKI Motohiro wrote: [...]
You don't understand the issue.
Apparently.
The point is NOT a formula. The problem is, dirty and non-dirty pages aren't isolated in our kernel. Then, kernel start to get stuck far before non-dirty pages become empty. Lie notification always useless.
Ugh. I don't get it (yeah, see above :-), in what sense they're not isolated? In sense of isolate_lru_page and friends? Yes, they're not isolated, but how that makes the notifications untrustworthy?
I'm confused. Can you elaborate a bit?
Even more, we may introduce two attributes:
RECLAIMABLE_CACHE_PAGES and RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).
This makes ABI detached from the mm internals and still keeps a defined meaning of the attributes.
Collection of craps are also crap. If you want to improve userland notification, you should join VM improvement activity.
I'm all for improving VM, but please, be specific. I'm assuming there is currently some efforts on VM improvements, which I'm not aware of. Or there are some plans or thoughts on improvements -- please tell what are they.
You shouldn't think nobody except you haven't think userland notification feature.
That was never my assumption; surely many people have worked on userland notifications, and still, today we have none that would fully suite Android's or Nokia's (or "embedded people's") needs, right? ;-)
So, let's try solve things.
Memcg is currently not usable for us, and I explained why (the slab accounting for root cgroup thing: http://lkml.org/lkml/2012/4/30/115 ), any comments?
The problem is, Any current kernel vm statistics were not created for such purpose and don't fit.
OK, presuming current statistics don't fit, which ones should we implement? How do you see it?
Even though, some inaccurate and incorrect statistics fit _your_ usecase, they definitely don't fit other. And their people think it is bug.
I'm all for making a single solution for your and ours use cases, but you don't say anything specific.
(Btw, what are your use cases?)
- libc and some important library's pages are critical important
for running a system even though it is clean and reclaimable. In other word, kernel don't have an info then can't expose it.
First off, I guess LRU would try to keep important/most used pages in the cache, as we try to never fully drain page cache to the zero mark.
*1
Yes, what do you want say?
Secondly, if we're really low on memory (which low memory notifications help to prevent) and kernel decided to throw libc's pages out of the cache, you'll get cache miss and kernel will have to read it back. Well, sometimes cache misses do happen, that's life. And if somebody really don't want this for the essential parts of the system, one have to mlock it (which eliminates your "kernel don't have an info" argument).
First off, "low memory" is very poor definition and we must not use it.
OK.
It is multiple meanings.
- System free memory is low. Some embedded have userland
The 'free' has multiple meanings as well. For us, it is 'not-used-at-all-pages + the-pages-that-we-can-get-in-a- jiffy-and-not-disturb-things-much'.
The 'not-disturb-things-much' has a moot meaning as well, so all this should probably be tunable. Cool, so let's give the userspace all the needed statistics to decide on these meanings.
oom killer and they want to know _system_ status. 2) available memory is low. This is different from (1) when using NUMA, memcg or cpusets. And in nowadays, almost all x86 box have numa. This is userful for swap avoidance activity if we can implement correctly.
I don't get it: you don't see '1)' as a use case? You're saying that the meanings are different when using NUMA/memcg. If we don't use memcg, what statistics should we use?
OK, if you are hinting that memcg should be mandatory for proper statistics accounting, then please comment on the current memcg issues, which don't let us do '1)' via '2)'.
Secondly, we can't assume someone mlock to libc. Because of, Linux is generic purpose kernel.
You said that libc pages are important, implying that ideally they should never leave the page cache (i.e. we should not count the pages as 'easily reclaimable').
I answered that if are OK with "not guaranteed, but we'll do our best" strategy, then just don't let fully drain the caches, and then LRU will try keep "most important" pages (apparently libc) in the cache. *1 It is surely userland's task to maintain the needed amount of memory, and to do this efficiently we need..... notifications, that's right.
But if you want a guarantee, I guess mlock() is the only option -- it is the only way to tell the kernel that the pages are really not to be reclaimed.
So, in the light of 'easily reclaimable pages' statistics, what for was your libc point again? How would you solve "the libc problem"?
Thanks!
On Mon, May 7, 2012 at 10:19 PM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
Even more, we may introduce two attributes:
RECLAIMABLE_CACHE_PAGES and RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).
This makes ABI detached from the mm internals and still keeps a defined meaning of the attributes.
Collection of craps are also crap. If you want to improve userland notification, you should join VM improvement activity. You shouldn't think nobody except you haven't think userland notification feature.
The problem is, Any current kernel vm statistics were not created for such purpose and don't fit.
Even though, some inaccurate and incorrect statistics fit _your_ usecase, they definitely don't fit other. And their people think it is bug.
Well, yeah, if we are to report _number of pages_, the numbers better be meaningful.
That said, I think you are being unfair to Anton who's one of the few that's actually taking the time to implement this properly instead of settling for an out-of-tree hack.
On Tue, May 8, 2012 at 1:20 AM, Pekka Enberg penberg@kernel.org wrote:
On Mon, May 7, 2012 at 10:19 PM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
Even more, we may introduce two attributes:
RECLAIMABLE_CACHE_PAGES and RECLAIMABLE_CACHE_PAGES_NOIO (which excludes dirty pages).
This makes ABI detached from the mm internals and still keeps a defined meaning of the attributes.
Collection of craps are also crap. If you want to improve userland notification, you should join VM improvement activity. You shouldn't think nobody except you haven't think userland notification feature.
The problem is, Any current kernel vm statistics were not created for such purpose and don't fit.
Even though, some inaccurate and incorrect statistics fit _your_ usecase, they definitely don't fit other. And their people think it is bug.
Well, yeah, if we are to report _number of pages_, the numbers better be meaningful.
That said, I think you are being unfair to Anton who's one of the few that's actually taking the time to implement this properly instead of settling for an out-of-tree hack.
Unfair? But only I can talk about technical comment. To be honest, I really dislike I need say the same explanation again and again. A lot of people don't read past discussion. And as far as the patches take the same mistake, I must say the same thing. It is just PITA.
I don't disagree vmevent notification itself, but I must disagree lie notification. And also, To make just idea statistics doesn't make sense at all. How do an application choose the right events? If that depend on hardware configuration, userland developers can't write proper applications.
That's the reason why I often disagree at random new features.
On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
That said, I think you are being unfair to Anton who's one of the few that's actually taking the time to implement this properly instead of settling for an out-of-tree hack.
Unfair? But only I can talk about technical comment. To be honest, I really dislike I need say the same explanation again and again. A lot of people don't read past discussion. And as far as the patches take the same mistake, I must say the same thing. It is just PITA.
Unfair because you are trying to make it look as if Anton is only concerned with his specific use case. That's simply not true.
On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
I don't disagree vmevent notification itself, but I must disagree lie notification. And also, To make just idea statistics doesn't make sense at all. How do an application choose the right events? If that depend on hardware configuration, userland developers can't write proper applications.
That's exactly the problem we're trying to tackle here! We _want_ the ABI to provide sane, well-defined events that solve real world problems.
Pekka
On Tue, May 8, 2012 at 1:53 AM, Pekka Enberg penberg@kernel.org wrote:
On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
That said, I think you are being unfair to Anton who's one of the few that's actually taking the time to implement this properly instead of settling for an out-of-tree hack.
Unfair? But only I can talk about technical comment. To be honest, I really dislike I need say the same explanation again and again. A lot of people don't read past discussion. And as far as the patches take the same mistake, I must say the same thing. It is just PITA.
Unfair because you are trying to make it look as if Anton is only concerned with his specific use case. That's simply not true.
However current proposal certainly don't refer past discuss and don't work many environment.
On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
I don't disagree vmevent notification itself, but I must disagree lie notification. And also, To make just idea statistics doesn't make sense at all. How do an application choose the right events? If that depend on hardware configuration, userland developers can't write proper applications.
That's exactly the problem we're trying to tackle here! We _want_ the ABI to provide sane, well-defined events that solve real world problems.
Ok, sane. Then I take my time a little and review current vmevent code briefly. (I read vmevent/core branch in pekka's tree. please let me know if there is newer repositry)
I think following thing should be fixed.
1) sample_period is brain damaged idea. If people ONLY need to sampling stastics, they only need to read /proc/vmstat periodically. just remove it and implement push notification. _IF_ someone need unfrequent level trigger, just use "usleep(timeout); read(vmevent_fd)" on userland code. 2) VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as edge trigger shot. not only once. 3) vmevent_fd() seems sane interface. but it has name space unaware. maybe we discuss how to harmonize name space feature. No hurry. but we have to think that issue since at beginning. 4) Currently, vmstat have per-cpu batch and vmstat updating makes 3 second delay at maximum. This is fine for usual case because almost userland watcher only read /proc/vmstat per second. But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At worst, 128 batch x 4096 x 4k pagesize = 2G bytes inaccurate is there. 5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland exporting files. When exporing kenrel internal, always silly gus used them and made unhappy. 6) Also vmevent_event must hide from userland. 7) vmevent_config::size must be removed. In 20th century, M$ API prefer to use this technique. But They dropped the way because a lot of application don't initialize size member and they can't use it for keeping upper compitibility. 8) memcg unaware 9) numa unaware 10) zone unaware
And, we may need vm internal change if we really need lowmem notification. current kernel don't have such info. _And_ there is one more big problem. Currently the kernel maintain memory per zone. But almost all userland application aren't aware zone nor node. Thus raw notification aren't useful for userland. In the other hands, total memory and total free memory is useful? Definitely No! Even though total free memory are lots, system may start swap out and oom invokation. If we can't oom invocation, this feature has serious raison d'etre issue. (i.e. (4), (8), (9) and (19) are not ignorable issue. I think)
On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
Ok, sane. Then I take my time a little and review current vmevent code briefly. (I read vmevent/core branch in pekka's tree. please let me know if there is newer repositry)
It's the latest one.
On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
- sample_period is brain damaged idea. If people ONLY need to
sampling stastics, they only need to read /proc/vmstat periodically. just remove it and implement push notification. _IF_ someone need unfrequent level trigger, just use "usleep(timeout); read(vmevent_fd)" on userland code.
That comes from a real-world requirement. See Leonid's email on the topic:
https://lkml.org/lkml/2012/5/2/42
- VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
edge trigger shot. not only once.
Would VMEVENT_ATTR_STATE_EDGE_TRIGGER be a better name?
- vmevent_fd() seems sane interface. but it has name space unaware.
maybe we discuss how to harmonize name space feature. No hurry. but we have to think that issue since at beginning.
You mean VFS namespaces? Yeah, we need to take care of that.
- Currently, vmstat have per-cpu batch and vmstat updating makes 3
second delay at maximum. This is fine for usual case because almost userland watcher only read /proc/vmstat per second. But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At worst, 128 batch x 4096 x 4k pagesize = 2G bytes inaccurate is there.
That's pretty awful. Anton, Leonid, comments?
- __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
exporting files. When exporing kenrel internal, always silly gus used them and made unhappy.
Agreed. Anton, care to cook up a patch to do that?
- Also vmevent_event must hide from userland.
Why? That's part of the ABI.
- vmevent_config::size must be removed. In 20th century, M$ API
prefer to use this technique. But They dropped the way because a lot of application don't initialize size member and they can't use it for keeping upper compitibility.
It's there to support forward/backward ABI compatibility like perf does. I'm going to keep it for now but I'm open to dropping it when the ABI is more mature.
- memcg unaware
- numa unaware
- zone unaware
Yup.
And, we may need vm internal change if we really need lowmem notification. current kernel don't have such info. _And_ there is one more big problem. Currently the kernel maintain memory per zone. But almost all userland application aren't aware zone nor node. Thus raw notification aren't useful for userland. In the other hands, total memory and total free memory is useful? Definitely No! Even though total free memory are lots, system may start swap out and oom invokation. If we can't oom invocation, this feature has serious raison d'etre issue. (i.e. (4), (8), (9) and (19) are not ignorable issue. I think)
I'm guessing most of the existing solutions get away with approximations and soft limits because they're mostly used on UMA embedded machines.
But yes, we need to do better here.
(5/8/12 3:36 AM), Pekka Enberg wrote:
On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
Ok, sane. Then I take my time a little and review current vmevent code briefly. (I read vmevent/core branch in pekka's tree. please let me know if there is newer repositry)
It's the latest one.
On Tue, May 8, 2012 at 10:11 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
- sample_period is brain damaged idea. If people ONLY need to
sampling stastics, they only need to read /proc/vmstat periodically. just remove it and implement push notification. _IF_ someone need unfrequent level trigger, just use "usleep(timeout); read(vmevent_fd)" on userland code.
That comes from a real-world requirement. See Leonid's email on the topic:
I know, many embedded guys prefer such timer interval. I also have an experience similar logic when I was TV box developer. but I must disagree. Someone hope timer housekeeping complexity into kernel. but I haven't seen any justification.
- VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
edge trigger shot. not only once.
Would VMEVENT_ATTR_STATE_EDGE_TRIGGER be a better name?
maybe.
- vmevent_fd() seems sane interface. but it has name space unaware.
maybe we discuss how to harmonize name space feature. No hurry. but we have to think that issue since at beginning.
You mean VFS namespaces? Yeah, we need to take care of that.
If we keep current vmevent_fd() design, we may need to create new namespace concept likes ipc namespace. current vmevent_fd() is not VFS based.
- Currently, vmstat have per-cpu batch and vmstat updating makes 3
second delay at maximum. This is fine for usual case because almost userland watcher only read /proc/vmstat per second. But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At worst, 128 batch x 4096 x 4k pagesize = 2G bytes inaccurate is there.
That's pretty awful. Anton, Leonid, comments?
- __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
exporting files. When exporing kenrel internal, always silly gus used them and made unhappy.
Agreed. Anton, care to cook up a patch to do that?
- Also vmevent_event must hide from userland.
Why? That's part of the ABI.
Ahhh, if so, I missed something. as far as I look, vmevent_fd() only depend on vmevent_config. which syscall depend on vmevent_evennt?
- vmevent_config::size must be removed. In 20th century, M$ API
prefer to use this technique. But They dropped the way because a lot of application don't initialize size member and they can't use it for keeping upper compitibility.
It's there to support forward/backward ABI compatibility like perf does. I'm going to keep it for now but I'm open to dropping it when the ABI is more mature.
perf api is not intended to use from generic applications. then, I don't think it will make abi issue. tool/perf is sane, isn't it? but vmevent_fd() is generic api and we can't trust all userland guy have sane, unfortunately.
- memcg unaware
- numa unaware
- zone unaware
Yup.
And, we may need vm internal change if we really need lowmem notification. current kernel don't have such info. _And_ there is one more big problem. Currently the kernel maintain memory per zone. But almost all userland application aren't aware zone nor node. Thus raw notification aren't useful for userland. In the other hands, total memory and total free memory is useful? Definitely No! Even though total free memory are lots, system may start swap out and oom invokation. If we can't oom invocation, this feature has serious raison d'etre issue. (i.e. (4), (8), (9) and (19) are not ignorable issue. I think)
I'm guessing most of the existing solutions get away with approximations and soft limits because they're mostly used on UMA embedded machines.
But yes, we need to do better here.
Hm. If you want vmevent makes depend on CONFIG_EMBEDDED, I have no reason to complain this feature. At that world, almost all applications _know_ their system configuration. then I don't think api misuse issue is big matter.
On Tue, May 8, 2012 at 10:50 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
- sample_period is brain damaged idea. If people ONLY need to
sampling stastics, they only need to read /proc/vmstat periodically. just remove it and implement push notification. _IF_ someone need unfrequent level trigger, just use "usleep(timeout); read(vmevent_fd)" on userland code.
That comes from a real-world requirement. See Leonid's email on the topic:
I know, many embedded guys prefer such timer interval. I also have an experience similar logic when I was TV box developer. but I must disagree. Someone hope timer housekeeping complexity into kernel. but I haven't seen any justification.
Leonid?
- Also vmevent_event must hide from userland.
Why? That's part of the ABI.
Ahhh, if so, I missed something. as far as I look, vmevent_fd() only depend on vmevent_config. which syscall depend on vmevent_evennt?
read(2). See tools/testing/vmevent/vmevent-test.c for an example how the ABI is used.
- vmevent_config::size must be removed. In 20th century, M$ API
prefer to use this technique. But They dropped the way because a lot of application don't initialize size member and they can't use it for keeping upper compitibility.
It's there to support forward/backward ABI compatibility like perf does. I'm going to keep it for now but I'm open to dropping it when the ABI is more mature.
perf api is not intended to use from generic applications. then, I don't think it will make abi issue. tool/perf is sane, isn't it? but vmevent_fd() is generic api and we can't trust all userland guy have sane, unfortunately.
The perf ABI is being used by other projects as well. We don't _depend_ on ->size being sane as much as use it to detect new features in the future.
But anyway, we can consider dropping once the ABI is more stable.
Hm. If you want vmevent makes depend on CONFIG_EMBEDDED, I have no reason to complain this feature. At that world, almost all applications _know_ their system configuration. then I don't think api misuse issue is big matter.
No, I don't want to do that. I was just commeting on why existing solutions are the way they are.
-----Original Message----- From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext Pekka Enberg Sent: 08 May, 2012 11:03 To: KOSAKI Motohiro Cc: Anton Vorontsov; Minchan Kim; Moiseichuk Leonid (Nokia-MP/Espoo); John
...
That comes from a real-world requirement. See Leonid's email on the topic:
I know, many embedded guys prefer such timer interval. I also have an experience similar logic when I was TV box developer. but I must disagree. Someone hope timer housekeeping complexity into kernel. but I haven't seen any justification.
Leonid?
The "usleep(timeout); read(vmevent_fd)" will eliminate opportunity to use vmevent API for mobile devices. Developers already have to use heartbeat primitives to align/sync timers and update code which is not always simple to do. But the idea is to have user-space wakeup only if we have something change in memory numbers, thus aligned timers will not help much in vmevent case due to memory situation may change a lot in short time. Short depends from software stack but usually it below 1s. To have use-time and wakeups on good level (below 50Hz by e.g. powertop) and allow cpu switch off timers of such short period like 1s are not allowed.
Leonid PS: Sorry, meetings prevent to do interesting things :( I am tracking conversation with quite low understanding how it will be useful for practical needs because user-space developers in 80% cases needs to track simply dirty memory changes i.e. modified pages which cannot be dropped.
On Tue, May 8, 2012 at 12:15 PM, leonid.moiseichuk@nokia.com wrote:
I am tracking conversation with quite low understanding how it will be useful for practical needs because user-space developers in 80% cases needs to track simply dirty memory changes i.e. modified pages which cannot be dropped.
The point is to support those cases in such a way that works sanely across different architectures and configurations.
-----Original Message----- From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext Pekka Enberg Sent: 08 May, 2012 12:20
...
On Tue, May 8, 2012 at 12:15 PM, leonid.moiseichuk@nokia.com wrote:
I am tracking conversation with quite low understanding how it will be useful for practical needs because user-space developers in 80% cases needs to track simply dirty memory changes i.e. modified pages which cannot
be dropped.
The point is to support those cases in such a way that works sanely across different architectures and configurations.
Which usually means you need to increase level of abstraction from hugepages, swapped, kernel reclaimable, slab allocated, active memory to used, cache and must-have memory which are common for all platforms. Do you have visibility what do you need to show and how do you will calculate it? Does it possible to do for system, group of processes or particular process?
I implemented system-wide variant because that was a simplest one and most urgent I need. But e.g. to say how much memory consumed particular process in Linux you need to use smaps. Probably vmevent need to have some scratches (aka roadmap) into this direction as well.
On Tue, May 08, 2012 at 10:36:31AM +0300, Pekka Enberg wrote: [...]
- VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as
edge trigger shot. not only once.
Would VMEVENT_ATTR_STATE_EDGE_TRIGGER be a better name?
[...]
- Currently, vmstat have per-cpu batch and vmstat updating makes 3
second delay at maximum. This is fine for usual case because almost userland watcher only read /proc/vmstat per second. But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At worst, 128 batch x 4096 x 4k pagesize = 2G bytes inaccurate is there.
That's pretty awful. Anton, Leonid, comments?
[...]
- __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
exporting files. When exporing kenrel internal, always silly gus used them and made unhappy.
Agreed. Anton, care to cook up a patch to do that?
KOSAKI-San, Pekka,
Much thanks for your reviews!
These three issues should be fixed by the following patches. One mm/ change is needed outside of vmevent...
And I'm looking into other issues you pointed out...
Thanks!
--- include/linux/vmevent.h | 10 +++---- include/linux/vmstat.h | 2 ++ mm/vmevent.c | 66 +++++++++++++++++++++++++++++------------------ mm/vmstat.c | 22 +++++++++++++++- 4 files changed, 68 insertions(+), 32 deletions(-)
This function forcibly flushes per-cpu vmstat diff counters to the global counters.
Note that we don't try to flush percpu pagesets, the pcp will be still flushed once per 3 seconds.
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- include/linux/vmstat.h | 2 ++ mm/vmstat.c | 22 +++++++++++++++++++++- 2 files changed, 23 insertions(+), 1 deletion(-)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 65efb92..2a53896 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -200,6 +200,7 @@ extern void __inc_zone_state(struct zone *, enum zone_stat_item); extern void dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_zone_state(struct zone *, enum zone_stat_item);
+void refresh_vm_stats(void); void refresh_cpu_vm_stats(int); void refresh_zone_stat_thresholds(void);
@@ -253,6 +254,7 @@ static inline void __dec_zone_page_state(struct page *page,
#define set_pgdat_percpu_threshold(pgdat, callback) { }
+static inline void refresh_vm_stats(void) { } static inline void refresh_cpu_vm_stats(int cpu) { } static inline void refresh_zone_stat_thresholds(void) { }
diff --git a/mm/vmstat.c b/mm/vmstat.c index f600557..4a9d432 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -13,6 +13,7 @@ #include <linux/err.h> #include <linux/module.h> #include <linux/slab.h> +#include <linux/smp.h> #include <linux/cpu.h> #include <linux/vmstat.h> #include <linux/sched.h> @@ -434,7 +435,7 @@ EXPORT_SYMBOL(dec_zone_page_state); * with the global counters. These could cause remote node cache line * bouncing and will have to be only done when necessary. */ -void refresh_cpu_vm_stats(int cpu) +static void __refresh_cpu_vm_stats(int cpu, bool drain_pcp) { struct zone *zone; int i; @@ -456,11 +457,15 @@ void refresh_cpu_vm_stats(int cpu) local_irq_restore(flags); atomic_long_add(v, &zone->vm_stat[i]); global_diff[i] += v; + if (!drain_pcp) + continue; #ifdef CONFIG_NUMA /* 3 seconds idle till flush */ p->expire = 3; #endif } + if (!drain_pcp) + continue; cond_resched(); #ifdef CONFIG_NUMA /* @@ -495,6 +500,21 @@ void refresh_cpu_vm_stats(int cpu) atomic_long_add(global_diff[i], &vm_stat[i]); }
+void refresh_cpu_vm_stats(int cpu) +{ + __refresh_cpu_vm_stats(cpu, 1); +} + +static void smp_call_refresh_cpu_vm_stats(void *info) +{ + __refresh_cpu_vm_stats(smp_processor_id(), 0); +} + +void refresh_vm_stats(void) +{ + smp_call_function(smp_call_refresh_cpu_vm_stats, NULL, 1); +} + #endif
#ifdef CONFIG_NUMA
On Fri, 1 Jun 2012, Anton Vorontsov wrote:
This function forcibly flushes per-cpu vmstat diff counters to the global counters.
Why is it necessary to have a function that does not expire the pcps? Is that side effect important? We use refresh_vm_cpu_stats(cpu) in page_alloc.c already to flush the vmstat counters. Is the flushing of the pcps in 2 seconds insteads of 3 once really that important?
Also if we do this
Can we therefore also name the function in a different way like
flush_vmstats()
@@ -456,11 +457,15 @@ void refresh_cpu_vm_stats(int cpu) local_irq_restore(flags); atomic_long_add(v, &zone->vm_stat[i]); global_diff[i] += v;
if (!drain_pcp)
continue;
#ifdef CONFIG_NUMA /* 3 seconds idle till flush */ p->expire = 3; #endif
Erm. This should be
#ifdef CONFIG_NUMA if (drain_pcp) p->expire = 3; #endif
The construct using "continue" is weird.
}
if (!drain_pcp)
cond_resched();continue;
#ifdef CONFIG_NUMA /* @@ -495,6 +500,21 @@ void refresh_cpu_vm_stats(int cpu) atomic_long_add(global_diff[i], &vm_stat[i]); }
+void refresh_cpu_vm_stats(int cpu) +{
- __refresh_cpu_vm_stats(cpu, 1);
+}
Fold __refresh_cpu_vm_stats into this function and modify the caller of refresh_cpu_vm_stats instead.
(6/1/12 8:24 AM), Anton Vorontsov wrote:
This function forcibly flushes per-cpu vmstat diff counters to the global counters.
Note that we don't try to flush percpu pagesets, the pcp will be still flushed once per 3 seconds.
Signed-off-by: Anton Vorontsovanton.vorontsov@linaro.org
No.
This is insane. Your patch improved vmevent accuracy a _bit_. But instead of, decrease a performance of large systems. That's no good deal. 99% user never uses vmevent.
MOREOVER, this patch don't solve vmevent accuracy issue AT ALL. because of, a second is enough big to make large inaccuracy. Modern cpus are fast. Guys, the fact is, user land monitor can't use implicit batch likes vmstat. That's a reason why perf don't use vmstat. I suggest you see perf APIs. It may bring you good inspiration.
We'll need to use smp_function_call() in the sampling routines, and the call is not supposed to be called from the bottom halves. So, let's convert vmevent to dffered workqueues.
As a side effect, we also fix the swap reporting (we cannot call si_swapinfo from the interrupt context), i.e. the following oops should be fixed now:
================================= [ INFO: inconsistent lock state ] 3.4.0-rc1+ #37 Not tainted --------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (swap_lock){+.?...}, at: [<ffffffff8110449d>] si_swapinfo+0x1d/0x90 {SOFTIRQ-ON-W} state was registered at: [<ffffffff8107ca7f>] mark_irqflags+0x15f/0x1b0 [<ffffffff8107e5e3>] __lock_acquire+0x493/0x9d0 [<ffffffff8107f20e>] lock_acquire+0x9e/0x200 [<ffffffff813e9071>] _raw_spin_lock+0x41/0x50 [<ffffffff8110449d>] si_swapinfo+0x1d/0x90 [<ffffffff8117e7c8>] meminfo_proc_show+0x38/0x3f0 [<ffffffff81141209>] seq_read+0x139/0x3f0 [<ffffffff81174cc6>] proc_reg_read+0x86/0xc0 [<ffffffff8111c19c>] vfs_read+0xac/0x160 [<ffffffff8111c29a>] sys_read+0x4a/0x90 [<ffffffff813ea652>] system_call_fastpath+0x16/0x1b
Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- mm/vmevent.c | 49 ++++++++++++++++++++++++++++--------------------- 1 file changed, 28 insertions(+), 21 deletions(-)
diff --git a/mm/vmevent.c b/mm/vmevent.c index 381e9d1..4ca2a04 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -3,7 +3,7 @@ #include <linux/compiler.h> #include <linux/vmevent.h> #include <linux/syscalls.h> -#include <linux/timer.h> +#include <linux/workqueue.h> #include <linux/file.h> #include <linux/list.h> #include <linux/poll.h> @@ -34,7 +34,7 @@ struct vmevent_watch { struct vmevent_attr *config_attrs[VMEVENT_CONFIG_MAX_ATTRS];
/* sampling */ - struct timer_list timer; + struct delayed_work work;
/* poll */ wait_queue_head_t waitq; @@ -146,15 +146,13 @@ static bool vmevent_match(struct vmevent_watch *watch) }
/* - * This function is called from the timer context, which has the same - * guaranties as an interrupt handler: it can have only one execution - * thread (unlike bare softirq handler), so we don't need to worry - * about racing w/ ourselves. + * This function is called from a workqueue, which can have only one + * execution thread, so we don't need to worry about racing w/ ourselves. * - * We also don't need to worry about several instances of timers - * accessing the same vmevent_watch, as we allocate vmevent_watch - * together w/ the timer instance in vmevent_fd(), so there is always - * one timer per vmevent_watch. + * We also don't need to worry about several instances of us accessing + * the same vmevent_watch, as we allocate vmevent_watch together w/ the + * work instance in vmevent_fd(), so there is always one work per + * vmevent_watch. * * All the above makes it possible to implement the lock-free logic, * using just the atomic watch->pending variable. @@ -178,26 +176,35 @@ static void vmevent_sample(struct vmevent_watch *watch) atomic_set(&watch->pending, 1); }
-static void vmevent_timer_fn(unsigned long data) +static void vmevent_schedule_watch(struct vmevent_watch *watch) { - struct vmevent_watch *watch = (struct vmevent_watch *)data; + schedule_delayed_work(&watch->work, + nsecs_to_jiffies64(watch->config.sample_period_ns)); +} + +static struct vmevent_watch *work_to_vmevent_watch(struct work_struct *work) +{ + struct delayed_work *wk = to_delayed_work(work); + + return container_of(wk, struct vmevent_watch, work); +} + +static void vmevent_timer_fn(struct work_struct *work) +{ + struct vmevent_watch *watch = work_to_vmevent_watch(work);
vmevent_sample(watch);
if (atomic_read(&watch->pending)) wake_up(&watch->waitq); - mod_timer(&watch->timer, jiffies + - nsecs_to_jiffies64(watch->config.sample_period_ns)); + + vmevent_schedule_watch(watch); }
static void vmevent_start_timer(struct vmevent_watch *watch) { - init_timer_deferrable(&watch->timer); - watch->timer.data = (unsigned long)watch; - watch->timer.function = vmevent_timer_fn; - watch->timer.expires = jiffies + - nsecs_to_jiffies64(watch->config.sample_period_ns); - add_timer(&watch->timer); + INIT_DELAYED_WORK_DEFERRABLE(&watch->work, vmevent_timer_fn); + vmevent_schedule_watch(watch); }
static unsigned int vmevent_poll(struct file *file, poll_table *wait) @@ -259,7 +266,7 @@ static int vmevent_release(struct inode *inode, struct file *file) { struct vmevent_watch *watch = file->private_data;
- del_timer_sync(&watch->timer); + cancel_delayed_work_sync(&watch->work);
kfree(watch);
(6/1/12 8:24 AM), Anton Vorontsov wrote:
We'll need to use smp_function_call() in the sampling routines, and the call is not supposed to be called from the bottom halves. So, let's convert vmevent to dffered workqueues.
As a side effect, we also fix the swap reporting (we cannot call si_swapinfo from the interrupt context), i.e. the following oops should be fixed now:
================================= [ INFO: inconsistent lock state ] 3.4.0-rc1+ #37 Not tainted
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes: (swap_lock){+.?...}, at: [<ffffffff8110449d>] si_swapinfo+0x1d/0x90 {SOFTIRQ-ON-W} state was registered at: [<ffffffff8107ca7f>] mark_irqflags+0x15f/0x1b0 [<ffffffff8107e5e3>] __lock_acquire+0x493/0x9d0 [<ffffffff8107f20e>] lock_acquire+0x9e/0x200 [<ffffffff813e9071>] _raw_spin_lock+0x41/0x50 [<ffffffff8110449d>] si_swapinfo+0x1d/0x90 [<ffffffff8117e7c8>] meminfo_proc_show+0x38/0x3f0 [<ffffffff81141209>] seq_read+0x139/0x3f0 [<ffffffff81174cc6>] proc_reg_read+0x86/0xc0 [<ffffffff8111c19c>] vfs_read+0xac/0x160 [<ffffffff8111c29a>] sys_read+0x4a/0x90 [<ffffffff813ea652>] system_call_fastpath+0x16/0x1b
Signed-off-by: Anton Vorontsovanton.vorontsov@linaro.org
As I already told you, vmevent shouldn't deal a timer at all. It is NOT familiar to embedded world. Because of, time subsystem is one of most complex one on linux. Our 'time' is not simple concept. time.h says we have 5 possibilities user want, at least.
include/linux/time.h ------------------------------------------ #define CLOCK_REALTIME 0 #define CLOCK_MONOTONIC 1 #define CLOCK_MONOTONIC_RAW 4 #define CLOCK_REALTIME_COARSE 5 #define CLOCK_MONOTONIC_COARSE 6
And, some people want to change timer slack for optimize power consumption.
So, Don't reinventing the wheel. Just use posix tiemr apis.
On Thu, Jun 07, 2012 at 11:25:30PM -0400, KOSAKI Motohiro wrote: [...]
As I already told you, vmevent shouldn't deal a timer at all. It is NOT familiar to embedded world. Because of, time subsystem is one of most complex one on linux. Our 'time' is not simple concept. time.h says we have 5 possibilities user want, at least.
include/linux/time.h
#define CLOCK_REALTIME 0 #define CLOCK_MONOTONIC 1 #define CLOCK_MONOTONIC_RAW 4 #define CLOCK_REALTIME_COARSE 5 #define CLOCK_MONOTONIC_COARSE 6
And, some people want to change timer slack for optimize power consumption.
So, Don't reinventing the wheel. Just use posix tiemr apis.
I'm puzzled, why you mention posix timers in the context of the in-kernel user? And none of the posix timers are deferrable.
The whole point of vmevent is to be lightweight and save power. Vmevent is doing all the work in the kernel, and it uses deferrable timers/workqueues to save power, and it is a proper in-kernel API to do so.
If you're saying that we should set up a timer in the userland and constantly read /proc/vmstat, then we will cause CPU wake up every 100ms, which is not acceptable. Well, we can try to introduce deferrable timers for the userspace. But then it would still add a lot more overhead for our task, as this solution adds other two context switches to read and parse /proc/vmstat. I guess this is not a show-stopper though, so we can discuss this.
Leonid, Pekka, what do you think about the idea?
Thanks,
On Fri, Jun 8, 2012 at 9:58 AM, Anton Vorontsov anton.vorontsov@linaro.org wrote:
If you're saying that we should set up a timer in the userland and constantly read /proc/vmstat, then we will cause CPU wake up every 100ms, which is not acceptable. Well, we can try to introduce deferrable timers for the userspace. But then it would still add a lot more overhead for our task, as this solution adds other two context switches to read and parse /proc/vmstat. I guess this is not a show-stopper though, so we can discuss this.
Leonid, Pekka, what do you think about the idea?
That's exactly the kind of half-assed ABI that lead to people inventing out-of-tree lowmem notifiers in the first place.
I'd be more interested to know what people think of Minchan's that gets rid of vmstat sampling.
On Fri, Jun 08, 2012 at 10:03:24AM +0300, Pekka Enberg wrote:
On Fri, Jun 8, 2012 at 9:58 AM, Anton Vorontsov anton.vorontsov@linaro.org wrote:
If you're saying that we should set up a timer in the userland and constantly read /proc/vmstat, then we will cause CPU wake up every 100ms, which is not acceptable. Well, we can try to introduce deferrable timers for the userspace. But then it would still add a lot more overhead for our task, as this solution adds other two context switches to read and parse /proc/vmstat. I guess this is not a show-stopper though, so we can discuss this.
Leonid, Pekka, what do you think about the idea?
That's exactly the kind of half-assed ABI that lead to people inventing out-of-tree lowmem notifiers in the first place.
:-)
Well, at least powersaving-wise, the solution w/ userland deferred timers would be much better then just looping over /proc/vmstat each 100ms, and it is comparable to vmevent. Not pretty, but still would work.
I'd be more interested to know what people think of Minchan's that gets rid of vmstat sampling.
I answered to Minchan's post. The thing is that Minchan's idea is not a substitution for vmevent. To me it seems like a shrinker w/ some pre-filter.
Thanks,
-----Original Message----- From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org] Sent: 08 June, 2012 09:58
...
If you're saying that we should set up a timer in the userland and constantly read /proc/vmstat, then we will cause CPU wake up every 100ms, which is not acceptable. Well, we can try to introduce deferrable timers for the userspace. But then it would still add a lot more overhead for our task, as this solution adds other two context switches to read and parse /proc/vmstat. I guess this is not a show-stopper though, so we can discuss this.
Leonid, Pekka, what do you think about the idea?
Seems to me not nice solution. Generating/parsing vmstat every 100ms plus wakeups it is what exactly should be avoid to have sense to API. It also will cause page trashing because user-space code could be pushed out from cache if VM decide.
Thanks,
-- Anton Vorontsov Email: cbouatmailru@gmail.com
On Fri, Jun 8, 2012 at 3:05 AM, leonid.moiseichuk@nokia.com wrote:
-----Original Message----- From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org] Sent: 08 June, 2012 09:58
...
If you're saying that we should set up a timer in the userland and constantly read /proc/vmstat, then we will cause CPU wake up every 100ms, which is not acceptable. Well, we can try to introduce deferrable timers for the userspace. But then it would still add a lot more overhead for our task, as this solution adds other two context switches to read and parse /proc/vmstat. I guess this is not a show-stopper though, so we can discuss this.
Leonid, Pekka, what do you think about the idea?
Seems to me not nice solution. Generating/parsing vmstat every 100ms plus wakeups it is what exactly should be avoid to have sense to API.
No. I don't suggest to wake up every 100ms. I suggest to integrate existing subsystems. If you need any enhancement, just do it.
It also will cause page trashing because user-space code could be pushed out from cache if VM decide.
This is completely unrelated issue. Even if notification code is not swapped, userland notify handling code still may be swapped. So, if you must avoid swap, you must use mlock.
-----Original Message----- From: ext KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com] Sent: 08 June, 2012 10:11
..
No. I don't suggest to wake up every 100ms. I suggest to integrate existing subsystems. If you need any enhancement, just do it.
That will be non-trivial to hook all vmstat updates . Simple to use deferred timer.
It also will cause page trashing because user-space code could be pushed
out from cache if VM decide.
This is completely unrelated issue. Even if notification code is not swapped, userland notify handling code still may be swapped. So, if you must avoid swap, you must use mlock.
If you wakeup only by signal when memory situation changed you can be not mlocked. Mlocking uses memory very inefficient way and usually cannot be applied for apps which wants to be notified due to resources restrictions.
It also will cause page trashing because user-space code could be pushed
out from cache if VM decide.
This is completely unrelated issue. Even if notification code is not swapped, userland notify handling code still may be swapped. So, if you must avoid swap, you must use mlock.
If you wakeup only by signal when memory situation changed you can be not mlocked. Mlocking uses memory very inefficient way and usually cannot be applied for apps which wants to be notified due to resources restrictions.
That's your choice. If you don't need to care cache dropping, We don't enforce it. I only pointed out your explanation was technically incorrect.
-----Original Message----- From: ext KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com] Sent: 08 June, 2012 10:23
...
If you wakeup only by signal when memory situation changed you can be
not mlocked.
Mlocking uses memory very inefficient way and usually cannot be applied
for apps which wants to be notified due to resources restrictions.
That's your choice. If you don't need to care cache dropping, We don't enforce it. I only pointed out your explanation was technically incorrect.
My explanation is correct. That is an overhead you have to pay if start to use API based on polling from user-space and this overhead narrows API applicability. Moving all times/tracking to kernel avoid useless wakeups in user-space.
(6/8/12 3:28 AM), leonid.moiseichuk@nokia.com wrote:
-----Original Message----- From: ext KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com] Sent: 08 June, 2012 10:23
...
If you wakeup only by signal when memory situation changed you can be
not mlocked.
Mlocking uses memory very inefficient way and usually cannot be applied
for apps which wants to be notified due to resources restrictions.
That's your choice. If you don't need to care cache dropping, We don't enforce it. I only pointed out your explanation was technically incorrect.
My explanation is correct. That is an overhead you have to pay if start to use API based on polling from user-space and this overhead narrows API applicability. Moving all times/tracking to kernel avoid useless wakeups in user-space.
Wrong. CPU don't realized the running code belong to userspace or kernel. Every code just consume a power. That's why polling timer is wrong from point of power consumption view.
-----Original Message----- From: ext KOSAKI Motohiro [mailto:kosaki.motohiro@gmail.com] Sent: 08 June, 2012 10:33 To: Moiseichuk Leonid (Nokia-MP/Espoo)
..
Wrong. CPU don't realized the running code belong to userspace or kernel. Every code just consume a power. That's why polling timer is wrong from point of power consumption view.
??? We are talking about different things. User-space code could be dropped, distributed between several applications and has not deferred timers support. For polling API the user-space code has to be executed quite often. Localizing this code in kernel additionally allows to avoid vmsat/meminfo generation and parsing overhead as well.
On Fri, Jun 08, 2012 at 07:05:46AM +0000, leonid.moiseichuk@nokia.com wrote:
-----Original Message----- From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org] Sent: 08 June, 2012 09:58
...
If you're saying that we should set up a timer in the userland and constantly read /proc/vmstat, then we will cause CPU wake up every 100ms, which is not acceptable. Well, we can try to introduce deferrable timers for the userspace. But then it would still add a lot more overhead for our task, as this solution adds other two context switches to read and parse /proc/vmstat. I guess this is not a show-stopper though, so we can discuss this.
Leonid, Pekka, what do you think about the idea?
Seems to me not nice solution. Generating/parsing vmstat every 100ms plus wakeups it is what exactly should be avoid to have sense to API.
No, iff we implement deferred timers for userland, we would not wake up every 100 ms. The only additional overhead comparing to vmevent would be:
a) Two more context swtiches; b) Serialization/deserialization of /proc/vmstat.
It also will cause page trashing because user-space code could be pushed out from cache if VM decide.
This can solved by moving a "watcher" to a separate (daemon) process, and mlocking it. We do this in ulmkd.
Thanks,
-----Original Message----- From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org] Sent: 08 June, 2012 10:59
...
a) Two more context swtiches; b) Serialization/deserialization of /proc/vmstat.
It also will cause page trashing because user-space code could be pushed
out from cache if VM decide.
This can solved by moving a "watcher" to a separate (daemon) process, and mlocking it. We do this in ulmkd.
Right. It but it has drawbacks as well e.g. ensure that daemon scheduled properly and propagate reaction decision outside ulmkd. Also I understand your statement about "watcher" as probably you use one timer for daemon. Btw, in my variant (memnotify.c) I used only one timer, it is enough.
On Fri, Jun 08, 2012 at 08:16:04AM +0000, leonid.moiseichuk@nokia.com wrote:
-----Original Message----- From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org] Sent: 08 June, 2012 10:59
...
a) Two more context swtiches; b) Serialization/deserialization of /proc/vmstat.
It also will cause page trashing because user-space code could be pushed
out from cache if VM decide.
This can solved by moving a "watcher" to a separate (daemon) process, and mlocking it. We do this in ulmkd.
Right. It but it has drawbacks as well e.g. ensure that daemon scheduled properly and propagate reaction decision outside ulmkd.
No, ulmkd itself propagates the decision (i.e. it kills processes).
Here is how it works:
1. Android activity manager (it is tons of Java-code, runs inside a JVM) maintains list of applications and their "importance" index. This huge pile of code (and the whole JVM) of course can't be mlocked, and so it only maintains the list.
2. Once ulmkd (a separate low memory killer daemon, written in C) receives notification that system is low on memory, then it looks at the already prepared lists, and based on the processes' importance (and current free memory level) it kills appropriate apps.
Note that in-kernel LMK does absolutely the same as ulmkd, just in the kernel (and the "importance index" is passed to LMK as per-process oom_score_adj).
Also I understand your statement about "watcher" as probably you use one timer for daemon. Btw, in my variant (memnotify.c) I used only one timer, it is enough.
Not quite following.
In ulmkd I don't use timers at all, and by "watcher" I mean the some userspace daemon that receives lowmemory/pressure events (in our case it is ulmkd).
If we start "polling" on /proc/vmstat via userland deferred timers, that would be a single timer, just like in vmevent case. So, I'm not sure what is the difference?..
Thanks,
-----Original Message----- From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org] Sent: 08 June, 2012 11:41
...
Right. It but it has drawbacks as well e.g. ensure that daemon scheduled
properly and propagate reaction decision outside ulmkd.
No, ulmkd itself propagates the decision (i.e. it kills processes).
That is a decision "select & kill" :) Propagation of this decision required time. Not all processes could be killed. You may stuck in killing in some cases. ...
In ulmkd I don't use timers at all, and by "watcher" I mean the some userspace daemon that receives lowmemory/pressure events (in our case it is ulmkd).
Thanks for info.
If we start "polling" on /proc/vmstat via userland deferred timers, that would be a single timer, just like in vmevent case. So, I'm not sure what is the difference?..
Context switches, parsing, activity in userspace even memory situation is not changed. In kernel space you can use sliding timer (increasing interval) + shinker.
On Fri, Jun 08, 2012 at 08:57:13AM +0000, leonid.moiseichuk@nokia.com wrote:
-----Original Message----- From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org] Sent: 08 June, 2012 11:41
...
Right. It but it has drawbacks as well e.g. ensure that daemon scheduled
properly and propagate reaction decision outside ulmkd.
No, ulmkd itself propagates the decision (i.e. it kills processes).
That is a decision "select & kill" :) Propagation of this decision required time. Not all processes could be killed. You may stuck in killing in some cases.
Yeah. But since we have plenty of free memory (i.e. we're getting notified in advance), it's OK to be not super-fast.
And if we're losing control, OOMK will help us. (Btw, we can introduce "thrash killer" in-kernel driver. This would also help desktop use case, when the system is thrashing so hard that it becomes unresponsive, we'd better do something about it. When browser goes crazy on my laptop, I wish I had such a driver. :-) It takes forever to get OOM condition w/ 2GB swap space, slow hard drive and CPU all busy w/ moving pages back and forward.)
If we start "polling" on /proc/vmstat via userland deferred timers, that would be a single timer, just like in vmevent case. So, I'm not sure what is the difference?..
Context switches, parsing, activity in userspace even memory situation is not changed.
Sure, there is some additional overhead. I'm just saying that it is not drastic. It would be like 100 sprintfs + 100 sscanfs + 2 context switches? Well, it is unfortunate... but come on, today's phones are running X11 and Java. :-)
In kernel space you can use sliding timer (increasing interval) + shinker.
Well, w/ Minchan's idea, we can get shrinker notifications into the userland, so the sliding timer thing would be still possible.
Thanks,
-----Original Message----- From: ext Anton Vorontsov [mailto:cbouatmailru@gmail.com] Sent: 08 June, 2012 13:35
...
Context switches, parsing, activity in userspace even memory situation is
not changed.
Sure, there is some additional overhead. I'm just saying that it is not drastic. It would be like 100 sprintfs + 100 sscanfs + 2 context switches? Well, it is unfortunate... but come on, today's phones are running X11 and Java. :-)
Vmstat generation is not so trivial. Meminfo has even higher overhead. I just checked generation time using idling device and open/read test: - vmstat min 30, avg 94 max 2746 uSeconds - meminfo min 30, average 65 max 15961 uSeconds
In comparison /proc/version for the same conditions: min 30, average 41, max 1505 uSeconds
In kernel space you can use sliding timer (increasing interval) + shinker.
Well, w/ Minchan's idea, we can get shrinker notifications into the userland, so the sliding timer thing would be still possible.
Only as a post-schrinker actions. In case of memory stressing or close-to-stressing conditions shrinkers called very often, I saw up to 50 times per second.
On Fri, Jun 08, 2012 at 11:03:29AM +0000, leonid.moiseichuk@nokia.com wrote:
-----Original Message----- From: ext Anton Vorontsov [mailto:cbouatmailru@gmail.com] Sent: 08 June, 2012 13:35
...
Context switches, parsing, activity in userspace even memory situation is
not changed.
Sure, there is some additional overhead. I'm just saying that it is not drastic. It would be like 100 sprintfs + 100 sscanfs + 2 context switches? Well, it is unfortunate... but come on, today's phones are running X11 and Java. :-)
Vmstat generation is not so trivial. Meminfo has even higher overhead. I just checked generation time using idling device and open/read test:
- vmstat min 30, avg 94 max 2746 uSeconds
- meminfo min 30, average 65 max 15961 uSeconds
In comparison /proc/version for the same conditions: min 30, average 41, max 1505 uSeconds
Hm. I would expect that avg value for meminfo will be much worse than vmstat (meminfo grabs some locks).
OK, if we consider 100ms interval, then this would be like 0.1% overhead? Not great, but still better than memcg:
http://lkml.org/lkml/2011/12/21/487
:-)
Personally? I'm all for saving these 0.1% tho, I'm all for vmevent. But, for example, it's still broken for SMP as it is costly to update vm_stat. And I see no way to fix this.
So, I guess the right approach would be to find ways to not depend on frequent vm_stat updates (and thus reads).
userland deferred timers (and infrequent reads from vmstat) + "userland vm pressure notifications" looks promising for the userland solution.
For in-kernel solution it is all the same, a deferred timer that reads vm_stat occasionally (no pressure case) + in-kernel shrinker notifications for fast reaction under pressure.
In kernel space you can use sliding timer (increasing interval) + shinker.
Well, w/ Minchan's idea, we can get shrinker notifications into the userland, so the sliding timer thing would be still possible.
Only as a post-schrinker actions. In case of memory stressing or close-to-stressing conditions shrinkers called very often, I saw up to 50 times per second.
Well, yes. But in userland you would just poll/select on the shrinker notification fd, you won't get more than you can (or want to) process.
-----Original Message----- From: ext Anton Vorontsov [mailto:anton.vorontsov@linaro.org] Sent: 08 June, 2012 15:14 To: Moiseichuk Leonid (Nokia-MP/Espoo)
...
Hm. I would expect that avg value for meminfo will be much worse than vmstat (meminfo grabs some locks).
OK, if we consider 100ms interval, then this would be like 0.1% overhead? Not great, but still better than memcg:
That is difficult to win over memcg :) But in comparison to one syscall like read() for small structure for particular device the generation of meminfo is about 1000x times more expensive.
So, I guess the right approach would be to find ways to not depend on frequent vm_stat updates (and thus reads).
Agree.
userland deferred timers (and infrequent reads from vmstat) + "userland vm pressure notifications" looks promising for the userland solution.
On SMP, kernel updates vmstats only once per second, which makes vmevent unusable. Let's fix it by updating vmstats before sampling.
Reported-by: KOSAKI Motohiro kosaki.motohiro@gmail.com Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- mm/vmevent.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/mm/vmevent.c b/mm/vmevent.c index 4ca2a04..35fd0d5 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -2,6 +2,7 @@ #include <linux/atomic.h> #include <linux/compiler.h> #include <linux/vmevent.h> +#include <linux/mm.h> #include <linux/syscalls.h> #include <linux/workqueue.h> #include <linux/file.h> @@ -163,6 +164,9 @@ static void vmevent_sample(struct vmevent_watch *watch)
if (atomic_read(&watch->pending)) return; + + refresh_vm_stats(); + if (!vmevent_match(watch)) return;
On Fri, 1 Jun 2012, Anton Vorontsov wrote:
On SMP, kernel updates vmstats only once per second, which makes vmevent unusable. Let's fix it by updating vmstats before sampling.
Well this may increase your accuracy but there is no guarantee that an update to vm counters will not happen immediately after you have refreshed the counters for one processor or the other.
Also please consider the impact that a IPI broadcast will have on latency of other processors and to the function that is currently executing.
We just went through a round of getting rid of IPI broadcast because they create OS noise on processors.
...so that nobody would try to use the internally used bits.
Suggested-by: KOSAKI Motohiro kosaki.motohiro@gmail.com Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- include/linux/vmevent.h | 6 ++---- mm/vmevent.c | 9 +++++++-- 2 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h index aae0d24..b8ec0ac 100644 --- a/include/linux/vmevent.h +++ b/include/linux/vmevent.h @@ -35,10 +35,8 @@ enum { */ VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 3),
- /* Saved state, used internally by the kernel for one-shot mode. */ - __VMEVENT_ATTR_STATE_VALUE_WAS_LT = (1UL << 30), - /* Saved state, used internally by the kernel for one-shot mode. */ - __VMEVENT_ATTR_STATE_VALUE_WAS_GT = (1UL << 31), + __VMEVENT_ATTR_STATE_INTERNAL = (1UL << 30) | + (1UL << 31), };
struct vmevent_attr { diff --git a/mm/vmevent.c b/mm/vmevent.c index 35fd0d5..e64a92d 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -83,6 +83,11 @@ static u64 vmevent_sample_attr(struct vmevent_watch *watch, struct vmevent_attr return fn(watch, attr); }
+enum { + VMEVENT_ATTR_STATE_VALUE_WAS_LT = (1UL << 30), + VMEVENT_ATTR_STATE_VALUE_WAS_GT = (1UL << 31), +}; + static bool vmevent_match(struct vmevent_watch *watch) { struct vmevent_config *config = &watch->config; @@ -100,8 +105,8 @@ static bool vmevent_match(struct vmevent_watch *watch)
if (attr_lt || attr_gt || attr_eq) { bool one_shot = state & VMEVENT_ATTR_STATE_ONE_SHOT; - u32 was_lt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_LT; - u32 was_gt_mask = __VMEVENT_ATTR_STATE_VALUE_WAS_GT; + u32 was_lt_mask = VMEVENT_ATTR_STATE_VALUE_WAS_LT; + u32 was_gt_mask = VMEVENT_ATTR_STATE_VALUE_WAS_GT; u64 value = vmevent_sample_attr(watch, attr); bool lt = value < attr->value; bool gt = value > attr->value;
VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as edge trigger shot, not only once.
Suggested-by: KOSAKI Motohiro kosaki.motohiro@gmail.com Suggested-by: Pekka Enberg penberg@kernel.org Signed-off-by: Anton Vorontsov anton.vorontsov@linaro.org --- include/linux/vmevent.h | 4 ++-- mm/vmevent.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/vmevent.h b/include/linux/vmevent.h index b8ec0ac..b1c4016 100644 --- a/include/linux/vmevent.h +++ b/include/linux/vmevent.h @@ -31,9 +31,9 @@ enum { */ VMEVENT_ATTR_STATE_VALUE_EQ = (1UL << 2), /* - * One-shot mode. + * Edge trigger mode. */ - VMEVENT_ATTR_STATE_ONE_SHOT = (1UL << 3), + VMEVENT_ATTR_STATE_EDGE_TRIGGER = (1UL << 3),
__VMEVENT_ATTR_STATE_INTERNAL = (1UL << 30) | (1UL << 31), diff --git a/mm/vmevent.c b/mm/vmevent.c index e64a92d..46c1d18 100644 --- a/mm/vmevent.c +++ b/mm/vmevent.c @@ -104,7 +104,7 @@ static bool vmevent_match(struct vmevent_watch *watch) continue;
if (attr_lt || attr_gt || attr_eq) { - bool one_shot = state & VMEVENT_ATTR_STATE_ONE_SHOT; + bool edge = state & VMEVENT_ATTR_STATE_EDGE_TRIGGER; u32 was_lt_mask = VMEVENT_ATTR_STATE_VALUE_WAS_LT; u32 was_gt_mask = VMEVENT_ATTR_STATE_VALUE_WAS_GT; u64 value = vmevent_sample_attr(watch, attr); @@ -117,7 +117,7 @@ static bool vmevent_match(struct vmevent_watch *watch) bool ret = false;
if (((attr_lt && lt) || (attr_gt && gt) || - (attr_eq && eq)) && !one_shot) + (attr_eq && eq)) && !edge) return true;
if (attr_eq && eq && was_eq) {
On Fri, 1 Jun 2012, Anton Vorontsov wrote:
That's pretty awful. Anton, Leonid, comments?
[...]
- __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
exporting files. When exporing kenrel internal, always silly gus used them and made unhappy.
Agreed. Anton, care to cook up a patch to do that?
KOSAKI-San, Pekka,
Much thanks for your reviews!
These three issues should be fixed by the following patches. One mm/ change is needed outside of vmevent...
And I'm looking into other issues you pointed out...
I applied patches 2, 4, and 5. The vmstat patch need ACKs from VM folks to enter the tree.
Pekka
On 06/04/2012 03:26 AM, Pekka Enberg wrote:
On Fri, 1 Jun 2012, Anton Vorontsov wrote:
That's pretty awful. Anton, Leonid, comments?
[...]
- __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland
exporting files. When exporing kenrel internal, always silly gus used them and made unhappy.
Agreed. Anton, care to cook up a patch to do that?
KOSAKI-San, Pekka,
Much thanks for your reviews!
These three issues should be fixed by the following patches. One mm/ change is needed outside of vmevent...
And I'm looking into other issues you pointed out...
I applied patches 2, 4, and 5. The vmstat patch need ACKs from VM folks to enter the tree.
Pekka
It's time to wrap it up. It seems some people include me have tried to improve vmevent But I hope let us convince why we need vmevent before further review/patch.
Recently I tried reclaim-latency based notifier to consider backed device speed I mentioned elsewhere thread. The working model is that measure reclaim time and if it doesn't finish requirement time which is configurable by admin, notify it to user or kill some thread but I didn't published yet because it's not easy for admin to control and several issues.
AFAIK, low memory notifier is started for replacing android lowmemory killer. At the same time, other folks want use it generally. As I look through android low memory killer, it's not too bad except some point.
1. It should not depend on shrink_slab. If we need, we can add some hook in vmscan.c directly instead of shrink_slab. 2. We can use out_of_memory instead of custom victim selection/killing function. If we need, we can change out_of_memory interface little bit for passing needed information to select victim. 3. calculation for available pages
1) and 2) would make android low memory killer very general and 3) can meet each folk's requirement, I believe.
Anton, I expect you already investigated android low memory killer so maybe you know pros and cons of each solution. Could you convince us "why we need vmevent" and "why can't android LMK do it?"
KOSAKI, AFAIRC, you are a person who hates android low memory killer. Why do you hate it? If it solve problems I mentioned, do you have a concern, still? If so, please, list up.
Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel. Frankly speaking, I don't know vmevent's other use cases except low memory notification and didn't see any agreement about that with other guys.
I hope we get an agreement about vmevent before further enhance.
Thanks, all.
On Mon, Jun 4, 2012 at 11:45 AM, Minchan Kim minchan@kernel.org wrote:
KOSAKI, AFAIRC, you are a person who hates android low memory killer. Why do you hate it? If it solve problems I mentioned, do you have a concern, still? If so, please, list up.
Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel.
VM events started out as *ABI cleanup* of Nokia's N9 Linux lowmem notifier. That's not reinventing the wheel.
On Mon, Jun 4, 2012 at 11:45 AM, Minchan Kim minchan@kernel.org wrote:
Frankly speaking, I don't know vmevent's other use cases except low memory notification and didn't see any agreement about that with other guys.
I think you are missing the point. "vmevent" is an ABI for delivering VM events to userspace. I started it because different userspaces do not agree what "low memory" means - for obvious reasons.
As for use cases, it'd be useful for VMs to be notified of "about to swap your pages soon" so that they can aggressively GC before entering GC-swapstorm hell. I also hear that something similar would be useful for KVM/QEMU but I don't know the details.
I really don't see how Android's "low memory killer" will be useful as a generic solution.
Pekka
On Mon, Jun 04, 2012 at 12:20:18PM +0300, Pekka Enberg wrote:
On Mon, Jun 4, 2012 at 11:45 AM, Minchan Kim minchan@kernel.org wrote:
KOSAKI, AFAIRC, you are a person who hates android low memory killer. Why do you hate it? If it solve problems I mentioned, do you have a concern, still? If so, please, list up.
Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel.
VM events started out as *ABI cleanup* of Nokia's N9 Linux lowmem notifier. That's not reinventing the wheel.
On Mon, Jun 4, 2012 at 11:45 AM, Minchan Kim minchan@kernel.org wrote:
Frankly speaking, I don't know vmevent's other use cases except low memory notification and didn't see any agreement about that with other guys.
I think you are missing the point. "vmevent" is an ABI for delivering VM events to userspace. I started it because different userspaces do not agree what "low memory" means - for obvious reasons.
The part I dislike vmevent is to expose raw vmstat itself.
VMEVENT_ATTR_NR_SWAP_PAGES VMEVENT_ATTR_NR_FREE_PAGES VMEVENT_ATTR_NR_AVAIL_PAGES
And some calculation based on raw vmstat. We need more abstraction based on vmscan heuristic.
As for use cases, it'd be useful for VMs to be notified of "about to swap your pages soon" so that they can aggressively GC before entering
How do you detect it? It's a policy which is most important part on vmevent.
GC-swapstorm hell. I also hear that something similar would be useful for KVM/QEMU but I don't know the details.
I really don't see how Android's "low memory killer" will be useful as a generic solution.
Yes. we can't do it by current android LMK. I'm not big fan of androild LMK but we can improve it much by my suggestions and yours smart ideas, I believe.
Pekka
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Mon, Jun 04, 2012 at 05:45:06PM +0900, Minchan Kim wrote: [...]
AFAIK, low memory notifier is started for replacing android lowmemory killer. At the same time, other folks want use it generally. As I look through android low memory killer, it's not too bad except some point.
- It should not depend on shrink_slab. If we need, we can add some hook in vmscan.c directly instead of shrink_slab.
- We can use out_of_memory instead of custom victim selection/killing function. If we need, we can change out_of_memory interface little bit for passing needed information to select victim.
- calculation for available pages
- and 2) would make android low memory killer very general and 3) can meet each folk's requirement, I believe.
Anton, I expect you already investigated android low memory killer so maybe you know pros and cons of each solution. Could you convince us "why we need vmevent" and "why can't android LMK do it?"
Note that 1) and 2) are not problems per se, it's just implementation details, easy stuff. Vmevent is basically an ABI/API, and I didn't hear anybody who would object to vmevent ABI idea itself. More than this, nobody stop us from implementing in-kernel vmevent API, and make Android Lowmemory killer use it, if we want to.
The real problem is not with vmevent. Today there are two real problems:
a) Gathering proper statistics from the kernel. Both cgroups and vmstat have issues. Android lowmemory killer has the same problems w/ the statistics as vmevent, it uses vmstat, so by no means Android low memory killer is better or easier in this regard. (And cgroups has issues w/ slab accounting, plus some folks don't want memcg at all, since it has runtime and memory-wise costs.)
b) Interpreting this statistics. We can't provide one, universal "low memory" definition that would work for everybody. (Btw, your "levels" based low memory grading actually sounds the same as mine RECLAIMABLE_CACHE_PAGES and RECLAIMABLE_CACHE_PAGES_NOIO idea, i.e. http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/02751.html so personally I like the idea of level-based approach, based on available memory *cost*.)
So, you see, all these issues are valid for vmevent, cgroups and android low memory killer.
KOSAKI, AFAIRC, you are a person who hates android low memory killer. Why do you hate it? If it solve problems I mentioned, do you have a concern, still? If so, please, list up.
Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel.
Yes, nobody throws Android lowmemory killer away. And recently I fixed a bunch of issues in its tasks traversing and killing code. Now it's just time to "fix" statistics gathering and interpretation issues, and I see vmevent as a good way to do just that, and then we can either turn Android lowmemory killer driver to use the vmevent in-kernel API (so it will become just a "glue" between notifications and killing functions), or use userland daemon.
Note that memcg has notifications as well, so it's another proof that there is a demand for this stuff outside of embedded world, and going with ad-hoc, custom "low memory killer" is simple and tempting approach, but it doesn't solve any real problems.
Frankly speaking, I don't know vmevent's other use cases except low memory notification
I won't speak for realistic use-cases, but that is what comes to mind:
- DB can grow its caches/in-memory indexes infinitely, and start dropping them on demand (based on internal LRU list, for example). No more guessed/static configuration for DB daemons? - Assuming VPS hosting w/ dynamic resources management, notifications would be useful to readjust resources? - On desktops, apps can drop their caches on demand if they want to and can avoid swap activity?
Thanks,
On Mon, Jun 04, 2012 at 04:38:12AM -0700, Anton Vorontsov wrote:
On Mon, Jun 04, 2012 at 05:45:06PM +0900, Minchan Kim wrote: [...]
AFAIK, low memory notifier is started for replacing android lowmemory killer. At the same time, other folks want use it generally. As I look through android low memory killer, it's not too bad except some point.
- It should not depend on shrink_slab. If we need, we can add some hook in vmscan.c directly instead of shrink_slab.
- We can use out_of_memory instead of custom victim selection/killing function. If we need, we can change out_of_memory interface little bit for passing needed information to select victim.
- calculation for available pages
- and 2) would make android low memory killer very general and 3) can meet each folk's requirement, I believe.
Anton, I expect you already investigated android low memory killer so maybe you know pros and cons of each solution. Could you convince us "why we need vmevent" and "why can't android LMK do it?"
Note that 1) and 2) are not problems per se, it's just implementation details, easy stuff. Vmevent is basically an ABI/API, and I didn't
My opinion is different. Things depends on only vmstat have limitation. IMHO. 1) of such POV is very important. If we don't do 1), low memory notifier must depend on vmstat only. But if we do 1), we can add some hooks in vmscan.c so we can control notifier more exactly/fine-grained.
As stupid example, if get_scan_count get the anon pages and it is about to reclaim inactive anon list, it's a kind of signal anon pages swap out so that we can notify it to qemu/kvm which will ballon.
Other example, if we get the high priority of scanning (ex, DEF_PRIORITY - 2) and get scanned value greater than lru size, it is sort of high memory pressure.
My point is that if we want level notifier approach, we need more tightly VM-coupled thing
hear anybody who would object to vmevent ABI idea itself. More than this, nobody stop us from implementing in-kernel vmevent API, and make Android Lowmemory killer use it, if we want to.
I guess other guys didn't have a interest or very busy. In this chance, I hope all guys have a consensus. At least, we need Andrew, Rik, David and KOSAKI's opinion.
The real problem is not with vmevent. Today there are two real problems:
a) Gathering proper statistics from the kernel. Both cgroups and vmstat have issues. Android lowmemory killer has the same problems w/ the statistics as vmevent, it uses vmstat, so by no means Android low memory killer is better or easier in this regard.
Right.
(And cgroups has issues w/ slab accounting, plus some folks don't want memcg at all, since it has runtime and memory-wise costs.)
As I mentioned earlier, we need more VM-tightly coupled policy, NOT vmstat.
b) Interpreting this statistics. We can't provide one, universal "low memory" definition that would work for everybody. (Btw, your "levels" based low memory grading actually sounds the same as mine RECLAIMABLE_CACHE_PAGES and RECLAIMABLE_CACHE_PAGES_NOIO idea, i.e. http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/02751.html so personally I like the idea of level-based approach, based on available memory *cost*.)
Yes. I like such abstraction like my suggestion. For new comer's information, Quote from my previoius mail " The idea is that we can make some levels in advane and explain it to user
Level 1: It a immediate response to user when kernel decide there are not fast-reclaimable pages any more. Level 2: It's rather slower response than level 1 but kernel will consider it as reclaimable target Level 3: It's slowest response because kernel will consider page needed long time to reclaim as reclaimable target.
It doesn't expose any internal of kernel and can implment it in internal. For simple example,
Level 1: non-mapped clean page Level 2: Level 1 + mapped clean-page Level 3: Level 2 + dirty pages
So users of vmevent_fd can select his level. Of course, latency sensitive application with slow stoarge would select Level 1. Some application might use Level 4(Level 3 + half of swap) because it has very fast storage.
And application receives event can make strategy folloing as.
When it receives level 1 notification, it could request to others if it can release their own buffers. When it receives level 2 notification, it could request to suicide if it's not critical application. When it receives level 3 notification, it could kill others.
It's a just example and my point is we should storage speed to make it general. "
So, you see, all these issues are valid for vmevent, cgroups and android low memory killer.
Agree.
KOSAKI, AFAIRC, you are a person who hates android low memory killer. Why do you hate it? If it solve problems I mentioned, do you have a concern, still? If so, please, list up.
Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel.
Yes, nobody throws Android lowmemory killer away. And recently I fixed a bunch of issues in its tasks traversing and killing code. Now it's just time to "fix" statistics gathering and interpretation issues, and I see vmevent as a good way to do just that, and then we can either turn Android lowmemory killer driver to use the vmevent in-kernel API (so it will become just a "glue" between notifications and killing functions), or use userland daemon.
If you can define such level by only vmstat, I am OKAY and want to see how to glue vmevent and android LMK. But keep in mind about killing. killer should be a kernel, not user. https://lkml.org/lkml/2011/12/19/330
Note that memcg has notifications as well, so it's another proof that there is a demand for this stuff outside of embedded world, and going with ad-hoc, custom "low memory killer" is simple and tempting approach, but it doesn't solve any real problems.
Frankly speaking, I don't know vmevent's other use cases except low memory notification
I won't speak for realistic use-cases, but that is what comes to mind:
- DB can grow its caches/in-memory indexes infinitely, and start dropping them on demand (based on internal LRU list, for example). No more guessed/static configuration for DB daemons?
- Assuming VPS hosting w/ dynamic resources management, notifications would be useful to readjust resources?
- On desktops, apps can drop their caches on demand if they want to and can avoid swap activity?
I don't mean VMEVENT_ATTR_LOWMEM_PAGES but following as,
VMEVENT_ATTR_NR_FREE_PAGES VMEVENT_ATTR_NR_SWAP_PAGES VMEVENT_ATTR_NR_AVAIL_PAGES
I'm not sure how it is useful. Even VMEVENT_ATTR_LOWMEM_PAGES, I'm not sure it would be useful with only vmstat.
Thanks,
-- Anton Vorontsov Email: cbouatmailru@gmail.com
On Mon, Jun 04, 2012 at 09:17:22PM +0900, Minchan Kim wrote: [...]
But keep in mind about killing. killer should be a kernel, not user. https://lkml.org/lkml/2011/12/19/330
There you're discussing out-of-memory condition vs. low memory situation, and I don't see LMK as a substitution for the OOMK, or vice versa.
LMK triggers when we have plenty of free memory (and thus time); and OOMK is a last resort thing: it is super-fast, as the kernel may have literally nothing to do but start killing tasks.
Sure, if LMK won't react fast and low-memory turns into "out of memory", then OOMK will "help", which is totally fine. And it is no different from current in-kernel Android low memory killer (which gives no guarantees that OOMK will not trigger, it is still possible to make OOM condition, even with LMK enabled).
Anyway, personally I don't mind if LMK would live in the kernel, but I don't see it as a mandatory thing (unlike OOMK, which is kernel-only thing, of course).
Frankly speaking, I don't know vmevent's other use cases except low memory notification
I won't speak for realistic use-cases, but that is what comes to mind:
- DB can grow its caches/in-memory indexes infinitely, and start dropping them on demand (based on internal LRU list, for example). No more guessed/static configuration for DB daemons?
- Assuming VPS hosting w/ dynamic resources management, notifications would be useful to readjust resources?
- On desktops, apps can drop their caches on demand if they want to and can avoid swap activity?
I don't mean VMEVENT_ATTR_LOWMEM_PAGES but following as,
VMEVENT_ATTR_NR_FREE_PAGES VMEVENT_ATTR_NR_SWAP_PAGES VMEVENT_ATTR_NR_AVAIL_PAGES
I'm not sure how it is useful.
Yep, these raw values are mostly useless, and personally I don't use these plain attributes. I use heuristics, i.e. "blended" attributes. If we can come up with levels-based approach w/o using vmstat, well.. OK then.
But note that we actually want to know how much "free", "cheap to reclaim" and "can reclaim, but at cost" pages there are. It should not be just "ouch, we're out of cheap pages" signal, since the whole point of Android low memory killer is to act in advance, its idea is to try to free memory before the condition happens, and thus make the phone UI more smooth. And the threshold is actually load/HW specific, so I guess folks want to tune it (e.g. "trigger when there are XX MB of 'cheap to reclaim' pages left").
Thanks,
On Mon, Jun 4, 2012 at 4:35 PM, Anton Vorontsov cbouatmailru@gmail.com wrote:
I don't mean VMEVENT_ATTR_LOWMEM_PAGES but following as,
VMEVENT_ATTR_NR_FREE_PAGES VMEVENT_ATTR_NR_SWAP_PAGES VMEVENT_ATTR_NR_AVAIL_PAGES
I'm not sure how it is useful.
Yep, these raw values are mostly useless, and personally I don't use these plain attributes. I use heuristics, i.e. "blended" attributes. If we can come up with levels-based approach w/o using vmstat, well.. OK then.
That's what Nokia's lowmem notifier uses. We can probably drop them once we have something else they could use.
Hi Peakk,
On 06/05/2012 04:53 PM, Pekka Enberg wrote:
On Mon, Jun 4, 2012 at 4:35 PM, Anton Vorontsov cbouatmailru@gmail.com wrote:
I don't mean VMEVENT_ATTR_LOWMEM_PAGES but following as,
VMEVENT_ATTR_NR_FREE_PAGES VMEVENT_ATTR_NR_SWAP_PAGES VMEVENT_ATTR_NR_AVAIL_PAGES
I'm not sure how it is useful.
Yep, these raw values are mostly useless, and personally I don't use these plain attributes. I use heuristics, i.e. "blended" attributes. If we can come up with levels-based approach w/o using vmstat, well.. OK then.
That's what Nokia's lowmem notifier uses. We can probably drop them once we have something else they could use.
Next concern is that periodic timer of implementation. I think it would add direct hook in vmscan.c rather than peeking raw vmstat periodically by timer so we can control more fine-grained way without unnecessary overhead.
On Tue, Jun 5, 2012 at 11:00 AM, Minchan Kim minchan@kernel.org wrote:
That's what Nokia's lowmem notifier uses. We can probably drop them once we have something else they could use.
Next concern is that periodic timer of implementation. I think it would add direct hook in vmscan.c rather than peeking raw vmstat periodically by timer so we can control more fine-grained way without unnecessary overhead.
If the hooks are clean and it doesn't hurt the !CONFIG_VMEVENT case, I'm completely OK with that.
-----Original Message----- From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext Pekka Enberg Sent: 05 June, 2012 11:02 To: Minchan Kim
...
Next concern is that periodic timer of implementation. I think it would add direct hook in vmscan.c rather than peeking raw vmstat periodically by timer so we can control more fine-grained way
without unnecessary overhead.
If the hooks are clean and it doesn't hurt the !CONFIG_VMEVENT case, I'm completely OK with that.
On the previous iteration hooking vm was pointed as very bad idea, so in my version I installed shrinker to handle cases when we have memory pressure. Using deferred timer with adequate timeout (0.250 ms or larger) fully suitable for userspace and produce adequate overhead -> by nature such API should not be 100% accurate, anyhow applications cannot handle situation as good as kernel can provide, 0.5MB space accuracy, 100ms is maximum user-space require for 64-1024MB devices.
On 06/05/2012 05:16 PM, leonid.moiseichuk@nokia.com wrote:
-----Original Message----- From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext Pekka Enberg Sent: 05 June, 2012 11:02 To: Minchan Kim
...
Next concern is that periodic timer of implementation. I think it would add direct hook in vmscan.c rather than peeking raw vmstat periodically by timer so we can control more fine-grained way
without unnecessary overhead.
If the hooks are clean and it doesn't hurt the !CONFIG_VMEVENT case, I'm completely OK with that.
On the previous iteration hooking vm was pointed as very bad idea, so in my version I installed shrinker to handle cases when we have memory pressure. Using deferred timer with adequate timeout (0.250 ms or larger) fully suitable for userspace and produce adequate overhead -> by nature such API should not be 100% accurate, anyhow applications cannot handle situation as good as kernel can provide, 0.5MB space accuracy, 100ms is maximum user-space require for 64-1024MB devices.
I didn't follow previous iteration you mentioned so I don't know the history. I think it's a not good idea if LMN(low memory notifier) is needed by only embedded world. Maybe in that case, we might control it enough by only vmstat events but now we know many folks want it so we are trying to make it general. IMHO, for meeting various requirement, vmstat raw event isn't enough so we need direct hook in vmscan and should abstract it to some levels. Of course, VM guys should maintain it to work best as VM algorithm are changing.
(6/5/12 4:16 AM), leonid.moiseichuk@nokia.com wrote:
-----Original Message----- From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext Pekka Enberg Sent: 05 June, 2012 11:02 To: Minchan Kim
...
Next concern is that periodic timer of implementation. I think it would add direct hook in vmscan.c rather than peeking raw vmstat periodically by timer so we can control more fine-grained way
without unnecessary overhead.
If the hooks are clean and it doesn't hurt the !CONFIG_VMEVENT case, I'm completely OK with that.
On the previous iteration hooking vm was pointed as very bad idea, so in my version I installed shrinker to handle cases when we have memory pressure. Using deferred timer with adequate timeout (0.250 ms or larger) fully suitable for userspace and produce adequate overhead -> by nature such API should not be 100% accurate, anyhow applications cannot handle situation as good as kernel can provide, 0.5MB space accuracy, 100ms is maximum user-space require for 64-1024MB devices.
I believe that's bad idea. In fact, An "adequate" timeout depend on hardware, not application performance tendency. Thus, applications can't know "adequate" value.
Anton, I expect you already investigated android low memory killer so maybe you know pros and cons of each solution. Could you convince us "why we need vmevent" and "why can't android LMK do it?"
Note that 1) and 2) are not problems per se, it's just implementation details, easy stuff. Vmevent is basically an ABI/API, and I didn't hear anybody who would object to vmevent ABI idea itself. More than this, nobody stop us from implementing in-kernel vmevent API, and make Android Lowmemory killer use it, if we want to.
I never agree "it's mere ABI" discussion. Until the implementation is ugly, I never agree the ABI even if syscall interface is very clean.
The real problem is not with vmevent. Today there are two real problems:
a) Gathering proper statistics from the kernel. Both cgroups and vmstat have issues. Android lowmemory killer has the same problems w/ the statistics as vmevent, it uses vmstat, so by no means Android low memory killer is better or easier in this regard. (And cgroups has issues w/ slab accounting, plus some folks don't want memcg at all, since it has runtime and memory-wise costs.)
Right. android lowmemory killer is also buggy.
b) Interpreting this statistics. We can't provide one, universal "low memory" definition that would work for everybody. (Btw, your "levels" based low memory grading actually sounds the same as mine RECLAIMABLE_CACHE_PAGES and RECLAIMABLE_CACHE_PAGES_NOIO idea, i.e. http://lkml.indiana.edu/hypermail/linux/kernel/1205.0/02751.html so personally I like the idea of level-based approach, based on available memory *cost*.)
So, you see, all these issues are valid for vmevent, cgroups and android low memory killer.
KOSAKI, AFAIRC, you are a person who hates android low memory killer. Why do you hate it? If it solve problems I mentioned, do you have a concern, still? If so, please, list up.
Android low memory killer is proved solution for a long time, at least embedded area(So many android phone already have used it) so I think improving it makes sense to me rather than inventing new wheel.
Yes, nobody throws Android lowmemory killer away. And recently I fixed a bunch of issues in its tasks traversing and killing code. Now it's just time to "fix" statistics gathering and interpretation issues, and I see vmevent as a good way to do just that, and then we can either turn Android lowmemory killer driver to use the vmevent in-kernel API (so it will become just a "glue" between notifications and killing functions), or use userland daemon.
Huh? No? android lowmem killer is a "killer". it doesn't make any notification, it only kill memory hogging process. I don't think we can merge them.
Note that memcg has notifications as well, so it's another proof that there is a demand for this stuff outside of embedded world, and going with ad-hoc, custom "low memory killer" is simple and tempting approach, but it doesn't solve any real problems.
Wrong. memcg notification notify the event to _another_ mem cgroup's process. Then, it can avoid a notified process can't work well by swap thrashing. Its feature only share a weakness of vmevent api.
Frankly speaking, I don't know vmevent's other use cases except low memory notification
I won't speak for realistic use-cases, but that is what comes to mind:
- DB can grow its caches/in-memory indexes infinitely, and start dropping them on demand (based on internal LRU list, for example). No more guessed/static configuration for DB daemons?
They uses direct-io for fine grained cache control.
- Assuming VPS hosting w/ dynamic resources management, notifications would be useful to readjust resources?
How do they readjust? Now kvm/xen use balloon driver for dynamic resource adjustment. AND it work more fine than vmevent because it doesn't route userspace.
- On desktops, apps can drop their caches on demand if they want to and can avoid swap activity?
In this case, fallocate(VOLATILE) is work more better.
On Mon, Jun 04, 2012 at 04:05:05PM -0400, KOSAKI Motohiro wrote: [...]
Yes, nobody throws Android lowmemory killer away. And recently I fixed a bunch of issues in its tasks traversing and killing code. Now it's just time to "fix" statistics gathering and interpretation issues, and I see vmevent as a good way to do just that, and then we can either turn Android lowmemory killer driver to use the vmevent in-kernel API (so it will become just a "glue" between notifications and killing functions), or use userland daemon.
Huh? No? android lowmem killer is a "killer". it doesn't make any notification, it only kill memory hogging process. I don't think we can merge them.
KOSAKI, you don't read what I write. I didn't ever say that low memory killer makes any notifications, that's not what I was saying. I said that once we'll have a good "low memory" notification mechanism (e.g. vmevent), Android low memory killer would just use this mechanism. Be it userland notifications or in-kernel, doesn't matter much.
(6/4/12 6:39 PM), Anton Vorontsov wrote:
On Mon, Jun 04, 2012 at 04:05:05PM -0400, KOSAKI Motohiro wrote: [...]
Yes, nobody throws Android lowmemory killer away. And recently I fixed a bunch of issues in its tasks traversing and killing code. Now it's just time to "fix" statistics gathering and interpretation issues, and I see vmevent as a good way to do just that, and then we can either turn Android lowmemory killer driver to use the vmevent in-kernel API (so it will become just a "glue" between notifications and killing functions), or use userland daemon.
Huh? No? android lowmem killer is a "killer". it doesn't make any notification, it only kill memory hogging process. I don't think we can merge them.
KOSAKI, you don't read what I write. I didn't ever say that low memory killer makes any notifications, that's not what I was saying. I said that once we'll have a good "low memory" notification mechanism (e.g. vmevent), Android low memory killer would just use this mechanism. Be it userland notifications or in-kernel, doesn't matter much.
I don't disagree this. But this was not my point. I have to note why current android killer is NOT notification.
In fact, notification is a mere notification. There is no guarantee to success to kill. There are various reason to fail to kill. e.g. 1) Any shrinking activity need more memory. (that's the reason why now we only have memcg oom notification) 2) userland memory returning activity is not atomic nor fast. kernel might find another memory shortage before finishing memory returning. 3) system thrashing may bring userland process stucking 4) ... and userland bugs.
So, big design choice here. 1) vmevent is a just notification. it can't guarantee to prevent oom. or 2) to implement some trick (e.g. reserved memory for vmevent processes, kernel activity blocking until finish memory returing, etc)
On Fri, Jun 8, 2012 at 6:45 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
So, big design choice here. 1) vmevent is a just notification. it can't guarantee to prevent oom. or 2) to implement some trick (e.g. reserved memory for vmevent processes, kernel activity blocking until finish memory returing, etc)
Yes, vmevent is only for notification and will not *guarantee* OOM prevention. It simply tries to provide hints early enough to the userspace to so that OOM can be avoided if possible.
On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
Note that 1) and 2) are not problems per se, it's just implementation details, easy stuff. Vmevent is basically an ABI/API, and I didn't hear anybody who would object to vmevent ABI idea itself. More than this, nobody stop us from implementing in-kernel vmevent API, and make Android Lowmemory killer use it, if we want to.
I never agree "it's mere ABI" discussion. Until the implementation is ugly, I never agree the ABI even if syscall interface is very clean.
I don't know what discussion you are talking about.
I also don't agree that something should be merged just because the ABI is clean. The implementation must also make sense. I don't see how we disagree here at all.
On Tue, Jun 05, 2012 at 10:47:18AM +0300, Pekka Enberg wrote:
On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
Note that 1) and 2) are not problems per se, it's just implementation details, easy stuff. Vmevent is basically an ABI/API, and I didn't hear anybody who would object to vmevent ABI idea itself. More than this, nobody stop us from implementing in-kernel vmevent API, and make Android Lowmemory killer use it, if we want to.
I never agree "it's mere ABI" discussion. Until the implementation is ugly, I never agree the ABI even if syscall interface is very clean.
I don't know what discussion you are talking about.
I also don't agree that something should be merged just because the ABI is clean. The implementation must also make sense. I don't see how we disagree here at all.
BTW, I wasn't implying that vmevent should be merged just because it is a clean ABI, and I wasn't implying that it is clean, and I didn't propose to merge it at all. :-)
I just don't see any point in trying to scrap vmevent in favour of Android low memory killer. This makes no sense at all, since today vmevent is more useful than Android's solution. For vmevent we have contributors from Nokia, Samsung, and of course Linaro, plus we have an userland killer daemon* for Android (which can work with both cgroups and vmevent backends). So vmevent is more generic already.
To me it would make more sense if mm guys would tell us "scrap this all, just use cgroups and its notifications; fix cgroups' slab accounting and be happy". Well, I'd understand that.
Anyway, we all know that vmevent is 'work in progress', so nobody tries to push it, nobody asks to merge it. So far we're just discussing any possible solutions, and vmevent is a good playground.
So, question to Minchan. Do you have anything particular in mind regarding how the vmstat hooks should look like? And how all this would connect with cgroups, since KOSAKI wants to see it cgroups- aware...
p.s. http://git.infradead.org/users/cbou/ulmkd.git I haven't updated it for new vmevent changes, but still, its idea should be clear enough.
On 06/05/2012 05:39 PM, Anton Vorontsov wrote:
On Tue, Jun 05, 2012 at 10:47:18AM +0300, Pekka Enberg wrote:
On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
Note that 1) and 2) are not problems per se, it's just implementation details, easy stuff. Vmevent is basically an ABI/API, and I didn't hear anybody who would object to vmevent ABI idea itself. More than this, nobody stop us from implementing in-kernel vmevent API, and make Android Lowmemory killer use it, if we want to.
I never agree "it's mere ABI" discussion. Until the implementation is ugly, I never agree the ABI even if syscall interface is very clean.
I don't know what discussion you are talking about.
I also don't agree that something should be merged just because the ABI is clean. The implementation must also make sense. I don't see how we disagree here at all.
BTW, I wasn't implying that vmevent should be merged just because it is a clean ABI, and I wasn't implying that it is clean, and I didn't propose to merge it at all. :-)
I just don't see any point in trying to scrap vmevent in favour of Android low memory killer. This makes no sense at all, since today vmevent is more useful than Android's solution. For vmevent we have contributors from Nokia, Samsung, and of course Linaro, plus we have an userland killer daemon* for Android (which can work with both cgroups and vmevent backends). So vmevent is more generic already.
To me it would make more sense if mm guys would tell us "scrap this all, just use cgroups and its notifications; fix cgroups' slab accounting and be happy". Well, I'd understand that.
Anyway, we all know that vmevent is 'work in progress', so nobody tries to push it, nobody asks to merge it. So far we're just discussing any possible solutions, and vmevent is a good playground.
So, question to Minchan. Do you have anything particular in mind regarding how the vmstat hooks should look like? And how all this would connect with cgroups, since KOSAKI wants to see it cgroups- aware...
How about this?
It's totally pseudo code just I want to show my intention and even it's not a math. Totally we need more fine-grained some expression to standardize memory pressure. For it, we can use VM's several parameter, nr_scanned, nr_reclaimed, order, dirty page scanning ratio and so on. Also, we can aware of zone, node so we can pass lots of information to user space if they want it. For making lowmem notifier general, they are must, I think. We can have a plenty of tools for it.
And later as further step, we could replace it with memcg-aware after memcg reclaim work is totally unified with global page reclaim. Many memcg guys have tried it so I expect it works sooner or later but I'm not sure memcg really need it because memcg's goal is limit memory resource among several process groups. If some process feel bad about latency due to short of free memory and it's critical, I think it would be better to create new memcg group has tighter limit for latency and put the process into the group.
diff --git a/mm/vmscan.c b/mm/vmscan.c index eeb3bc9..eae3d2e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2323,6 +2323,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, }
/* + * higher dirty pages, higher pressure + * higher nr_scanned, higher pressure + * higher nr_reclaimed, lower pressure + * higher unmapped pages, lower pressure + * + * index toward 0 implies memory pressure is heavy. + */ +int lowmem_index(struct zone *zone, struct scan_control *sc) +{ + int pressure = (1000 * (sc->nr_scanned * (zone_page_state(zone, NR_FILE_DIRTY) + * dirty_weight + 1) - sc->nr_reclaimed - + zone_unmapped_file_pages(zone))) / + zone_reclaimable_page(zone); + + return 1000 - pressure; +} + +void lowmem_notifier(struct zone *zone, int index) +{ + if (lowmem_has_interested_zone(zone)) { + if (index < sysctl_lowmem_threshold) + notify(numa_node_id(), zone, index); + } +} + +/* * For kswapd, balance_pgdat() will work across all this node's zones until * they are all at high_wmark_pages(zone). * @@ -2494,6 +2520,7 @@ loop_again: !zone_watermark_ok_safe(zone, testorder, high_wmark_pages(zone) + balance_gap, end_zone, 0)) { + int index; shrink_zone(zone, &sc);
reclaim_state->reclaimed_slab = 0; @@ -2503,6 +2530,9 @@ loop_again:
if (nr_slab == 0 && !zone_reclaimable(zone)) zone->all_unreclaimable = 1; + + index = lowmem_index(zone, &sc); + lowmem_notifier(zone, index);
p.s. http://git.infradead.org/users/cbou/ulmkd.git I haven't updated it for new vmevent changes, but still, its idea should be clear enough.
On Thu, Jun 07, 2012 at 11:41:27AM +0900, Minchan Kim wrote: [...]
How about this?
So, basically this is just another shrinker callback (well, it's very close to), but only triggered after some magic index crosses a threshold.
Some information beforehand: current ALMK is broken in the regard that it does not do what it is supposed to do according to its documentation. It uses shrinker notifiers (alike to your code), but kernel calls shrinker when there is already pressure on the memory, and ALMK's original idea was to start killing processes when there is [yet] no pressure at all, so ALMK was supposed to act in advance, e.g. "kill unneeded apps when there's say 64 MB free memory left, irrespective of the current pressure". ALMK doesn't do this currently, it only reacts to the shrinker.
So, the solution would be then two-fold:
1. Use your memory pressure notifications. They must be quite fast when we starting to feel the high pressure. (I see the you use zone_page_state() and friends, which is vm_stat, and it is updated very infrequently, but to get accurate notification we have to update it much more frequently, but this is very expensive. So KOSAKI and Christoph will complain. :-) 2. Plus use deferred timers to monitor /proc/vmstat, we don't have to be fast here. But I see Pekka and Leonid don't like it already, so we're stuck.
Thanks,
It's totally pseudo code just I want to show my intention and even it's not a math. Totally we need more fine-grained some expression to standardize memory pressure. For it, we can use VM's several parameter, nr_scanned, nr_reclaimed, order, dirty page scanning ratio and so on. Also, we can aware of zone, node so we can pass lots of information to user space if they want it. For making lowmem notifier general, they are must, I think. We can have a plenty of tools for it.
And later as further step, we could replace it with memcg-aware after memcg reclaim work is totally unified with global page reclaim. Many memcg guys have tried it so I expect it works sooner or later but I'm not sure memcg really need it because memcg's goal is limit memory resource among several process groups. If some process feel bad about latency due to short of free memory and it's critical, I think it would be better to create new memcg group has tighter limit for latency and put the process into the group.
diff --git a/mm/vmscan.c b/mm/vmscan.c index eeb3bc9..eae3d2e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2323,6 +2323,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining, } /*
- higher dirty pages, higher pressure
- higher nr_scanned, higher pressure
- higher nr_reclaimed, lower pressure
- higher unmapped pages, lower pressure
- index toward 0 implies memory pressure is heavy.
- */
+int lowmem_index(struct zone *zone, struct scan_control *sc) +{
int pressure = (1000 * (sc->nr_scanned * (zone_page_state(zone, NR_FILE_DIRTY)
* dirty_weight + 1) - sc->nr_reclaimed -
zone_unmapped_file_pages(zone))) /
zone_reclaimable_page(zone);
return 1000 - pressure;
+}
+void lowmem_notifier(struct zone *zone, int index) +{
if (lowmem_has_interested_zone(zone)) {
if (index < sysctl_lowmem_threshold)
notify(numa_node_id(), zone, index);
}
+}
+/*
- For kswapd, balance_pgdat() will work across all this node's zones until
- they are all at high_wmark_pages(zone).
@@ -2494,6 +2520,7 @@ loop_again: !zone_watermark_ok_safe(zone, testorder, high_wmark_pages(zone) + balance_gap, end_zone, 0)) {
int index; shrink_zone(zone, &sc);
reclaim_state->reclaimed_slab = 0; @@ -2503,6 +2530,9 @@ loop_again: if (nr_slab == 0 && !zone_reclaimable(zone)) zone->all_unreclaimable = 1;
index = lowmem_index(zone, &sc);
lowmem_notifier(zone, index);
p.s. http://git.infradead.org/users/cbou/ulmkd.git I haven't updated it for new vmevent changes, but still, its idea should be clear enough.
-- Kind regards, Minchan Kim
On 06/08/2012 04:49 PM, Anton Vorontsov wrote:
On Thu, Jun 07, 2012 at 11:41:27AM +0900, Minchan Kim wrote: [...]
How about this?
So, basically this is just another shrinker callback (well, it's very close to), but only triggered after some magic index crosses a threshold.
Some information beforehand: current ALMK is broken in the regard that it does not do what it is supposed to do according to its documentation. It uses shrinker notifiers (alike to your code), but kernel calls shrinker when there is already pressure on the memory, and ALMK's original idea was to start killing processes when there is [yet] no pressure at all, so ALMK was supposed to act in advance, e.g. "kill unneeded apps when there's say 64 MB free memory left, irrespective of the current pressure". ALMK doesn't do this currently, it only reacts to the shrinker.
When I hear your information, I feel it's a problem is in VM. VM's goal is to use available memory enough while it can reclaim used page as soon as possible so that user program should not feel big latency if there are enough easy reclaimable pages in VM.
So, when reclaim start firstly, maybe there are lots of reclaimable pages in VM so it can be reclaimed easily. Nontheless, if you feel it's very slow, in principle, it's a VM's problem. But I don't have been heard such latency complain from desktop people once there are lots of reclaimable pages.\
I admit there is big latency if we have lots of dirty pages while clean pages are almost out and backed devices are very slow which is known problem and several mm guys still have tried to solve it.
I admit you can argue what's the reclaimable pages easily. Normally, we can order it following as.
1. non-mapped clean cache page 2. mapped-clean cache page 3. non-mapped dirty cache page 4. mapped dirty cache page 5. anon pages, tmpfs/shmem pages.
So I want to make math by those and VM's additional information and user can configure weight because in some crazy system, swapout which is backed by SSD can be faster than dirty file page flush which is backed very slow rotation device.
And we can improve it by adding new LRU list - CFLRU[1] which would be good for swap in embedded device, too. If clean LRU is about to be short, it's a good indication on latency so we can trigger notification or start vmevent timer.
[1] http://staff.ustc.edu.cn/~jpq/paper/flash/2006-CASES-CFLRU-%20A%20Replacemen...
So, the solution would be then two-fold:
- Use your memory pressure notifications. They must be quite fast when we starting to feel the high pressure. (I see the you use zone_page_state() and friends, which is vm_stat, and it is updated
VM has other information like nr_reclaimed, nr_scanned, nr_congested, recent_scanned, recent_rotated, too. I hope we can make math by them and improve as we improve VM reclaimer.
very infrequently, but to get accurate notification we have to update it much more frequently, but this is very expensive. So KOSAKI and Christoph will complain. :-)
Reclaimer already have used that and if we need accuracy, we handled it like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed, too.
- Plus use deferred timers to monitor /proc/vmstat, we don't have to be fast here. But I see Pekka and Leonid don't like it already, so we're stuck.
Thanks,
-- Kind regards, Minchan Kim
On Fri, Jun 8, 2012 at 11:43 AM, Minchan Kim minchan@kernel.org wrote:
So, the solution would be then two-fold:
- Use your memory pressure notifications. They must be quite fast when
we starting to feel the high pressure. (I see the you use zone_page_state() and friends, which is vm_stat, and it is updated
VM has other information like nr_reclaimed, nr_scanned, nr_congested, recent_scanned, recent_rotated, too. I hope we can make math by them and improve as we improve VM reclaimer.
very infrequently, but to get accurate notification we have to update it much more frequently, but this is very expensive. So KOSAKI and Christoph will complain. :-)
Reclaimer already have used that and if we need accuracy, we handled it like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed, too.
Exactly. I don't know why people think pushing vmevents to userspace is going to fix any of the hard problems.
Anton, Lenoid, do you see any fundamental issues from userspace point of view with going forward what Minchan is proposing?
-----Original Message----- From: penberg@gmail.com [mailto:penberg@gmail.com] On Behalf Of ext Pekka Enberg Sent: 08 June, 2012 11:48 To: Minchan Kim
...
On Fri, Jun 8, 2012 at 11:43 AM, Minchan Kim minchan@kernel.org wrote:
So, the solution would be then two-fold:
- Use your memory pressure notifications. They must be quite fast
when we starting to feel the high pressure. (I see the you use zone_page_state() and friends, which is vm_stat, and it is updated
VM has other information like nr_reclaimed, nr_scanned, nr_congested, recent_scanned, recent_rotated, too. I hope we can make math by them and improve as we improve VM reclaimer.
very infrequently, but to get accurate notification we have to update it much more frequently, but this is very expensive. So KOSAKI and Christoph will complain. :-)
Reclaimer already have used that and if we need accuracy, we handled it like zone_watermark_ok_safe. If it's very inaccurate, VM should be fixed,
too.
Exactly. I don't know why people think pushing vmevents to userspace is going to fix any of the hard problems.
Anton, Lenoid, do you see any fundamental issues from userspace point of view with going forward what Minchan is proposing?
That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.) but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead.
Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area. For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File} It will extend flexibility a lot.
On Fri, Jun 08, 2012 at 09:12:36AM +0000, leonid.moiseichuk@nokia.com wrote: [...]
Exactly. I don't know why people think pushing vmevents to userspace is going to fix any of the hard problems.
Anton, Lenoid, do you see any fundamental issues from userspace point of view with going forward what Minchan is proposing?
That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.) but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead.
Indeed. Minchan's proposal is good to get notified that VM is under stress.
But suppose some app allocates memory slowly (i.e. I scroll a large page on my phone, and the page is rendered piece by piece). So, in the end we're slowly but surely allocate a lot of memory. In that case Minchan's method won't notice that it's actually time to close some apps.
Then suppose someone calls me, the "phone" application is now starting, but since we're out of 'easy to reclaim' pages, it takes forever for the app to load, VM is now under huge stress, and surely we're *now* getting notified, but alas, it is too late. Call missed.
So, it's like measuring distance, velocity and acceleration. In Android case, among other things, we're interested in distance too! I.e. "how much exactly 'easy to reclaim' pages left", not only "how fast we're getting out of 'easy to reclaim' pages".
Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area. For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File} It will extend flexibility a lot.
Exposing VM details to the userland? No good. :-)
On Fri, Jun 08, 2012 at 02:45:07AM -0700, Anton Vorontsov wrote:
On Fri, Jun 08, 2012 at 09:12:36AM +0000, leonid.moiseichuk@nokia.com wrote: [...]
Exactly. I don't know why people think pushing vmevents to userspace is going to fix any of the hard problems.
Anton, Lenoid, do you see any fundamental issues from userspace point of view with going forward what Minchan is proposing?
That good proposal but I have to underline that userspace could be interested not only in memory consumption stressed cases (pressure, vm watermarks ON etc.) but quite relaxed as well e.g. 60% durty pages are consumed - let's do not restart some daemons. In very stressed conditions user-space might be already dead.
Indeed. Minchan's proposal is good to get notified that VM is under stress.
But suppose some app allocates memory slowly (i.e. I scroll a large page on my phone, and the page is rendered piece by piece). So, in the end we're slowly but surely allocate a lot of memory. In that case Minchan's method won't notice that it's actually time to close some apps.
I can't understand. Why can't the approach catch the situation? Let's think about it.
There is 40M in CleanCache LRU which has easy-reclaimable pages and there is 10M free pages and 5M high watermark in system.
Your application start to consume free pages very slowly. So when your application consumed 5M, VM start to reclaim. So far, it's okay because we have 40M easy-reclaimable pages. And low memory notifier can start to notify so your dameon can do some action to get free pages.
I think it's not so late.
sidenote: It seems I live in the complete opposite place because you guys always start discussion when I am about going out of office. Please understand my late response. Maybe I will come back after weekend. :)
Thanks.
Then suppose someone calls me, the "phone" application is now starting, but since we're out of 'easy to reclaim' pages, it takes forever for the app to load, VM is now under huge stress, and surely we're *now* getting notified, but alas, it is too late. Call missed.
So, it's like measuring distance, velocity and acceleration. In Android case, among other things, we're interested in distance too! I.e. "how much exactly 'easy to reclaim' pages left", not only "how fast we're getting out of 'easy to reclaim' pages".
Another interesting question which combination of VM page types could be recognized as interesting for tracking as Minchan correctly stated it depends from area. For me seems weights most likely will be -1, 0 or +1 to calculate resulting values and thesholds e.g. Active = {+1 * Active_Anon; +1 * Active_File} It will extend flexibility a lot.
Exposing VM details to the userland? No good. :-)
-- Anton Vorontsov Email: cbouatmailru@gmail.com
On Fri, Jun 08, 2012 at 07:42:04PM +0900, Minchan Kim wrote: [...]
I can't understand. Why can't the approach catch the situation? Let's think about it.
There is 40M in CleanCache LRU which has easy-reclaimable pages and there is 10M free pages and 5M high watermark in system.
Your application start to consume free pages very slowly. So when your application consumed 5M, VM start to reclaim. So far, it's okay because we have 40M easy-reclaimable pages. And low memory notifier can start to notify so your dameon can do some action to get free pages.
Maybe I'm missing how would you use the shrinker. But the last time I tried on my (swap-less, FWIW) qemu test setup, I was not receiving any notifications from the shrinker until the system was almost (but not exactly) out of memory.
My test app was allocating all memory MB by MB, filling the memory with zeroes. So, what I was observing is that shrinker callback was called just a few MB before OOM, not every 'X' consumed MBs.
I think it's not so late.
sidenote: It seems I live in the complete opposite place because you guys always start discussion when I am about going out of office. Please understand my late response. Maybe I will come back after weekend. :)
Well, it's 4AM here. :-) Have a great weekend!
On 06/08/2012 08:14 PM, Anton Vorontsov wrote:
On Fri, Jun 08, 2012 at 07:42:04PM +0900, Minchan Kim wrote: [...]
I can't understand. Why can't the approach catch the situation? Let's think about it.
There is 40M in CleanCache LRU which has easy-reclaimable pages and there is 10M free pages and 5M high watermark in system.
Your application start to consume free pages very slowly. So when your application consumed 5M, VM start to reclaim. So far, it's okay because we have 40M easy-reclaimable pages. And low memory notifier can start to notify so your dameon can do some action to get free pages.
Maybe I'm missing how would you use the shrinker. But the last time I tried on my (swap-less, FWIW) qemu test setup, I was not receiving any notifications from the shrinker until the system was almost (but not exactly) out of memory.
My test app was allocating all memory MB by MB, filling the memory with zeroes. So, what I was observing is that shrinker callback was called just a few MB before OOM, not every 'X' consumed MBs.
Yes. page reclaimer doesn't make sure calling shrinker per MB. So if you want to be notified per your threshold, shrinker isn't good.
I didn't say I will use shrinker. I want to add hooks directly in vmscan.c's normal reclaim path, not shrinker.
Still, I want to implement it by level triggering like I mentioned. I will show you my concept if anybody doesn't interrupt me within a few weeks. :)
Thanks.
I think it's not so late.
sidenote: It seems I live in the complete opposite place because you guys always start discussion when I am about going out of office. Please understand my late response. Maybe I will come back after weekend. :)
Well, it's 4AM here. :-) Have a great weekend!
You win! :)
On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
- On desktops, apps can drop their caches on demand if they want to
and can avoid swap activity?
In this case, fallocate(VOLATILE) is work more better.
For some cases, yes, but probably not for all.
For example, if userspace doesn't know about "about to swap real soon" condition, it can continue to grow its caches making fallocate(VOLATILE) pretty much useless.
(6/5/12 3:52 AM), Pekka Enberg wrote:
On Mon, Jun 4, 2012 at 11:05 PM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
- On desktops, apps can drop their caches on demand if they want to and can avoid swap activity?
In this case, fallocate(VOLATILE) is work more better.
For some cases, yes, but probably not for all.
For example, if userspace doesn't know about "about to swap real soon" condition, it can continue to grow its caches making fallocate(VOLATILE) pretty much useless.
Fair enough. But, Please consider your scenario don't need vmevent(2), cat /proc/meminfo works enough. Only dropping activity need a quick work and just in time notification. cache growing limitation don't need.
On Fri, Jun 8, 2012 at 6:55 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
For some cases, yes, but probably not for all.
For example, if userspace doesn't know about "about to swap real soon" condition, it can continue to grow its caches making fallocate(VOLATILE) pretty much useless.
Fair enough. But, Please consider your scenario don't need vmevent(2), cat /proc/meminfo works enough. Only dropping activity need a quick work and just in time notification. cache growing limitation don't need.
Why do you keep bringing this up? Yes, you might get away with polling /proc/meminfo or /proc/vmstat for some specific cases. It does not make it a proper generic solution to the problem "vmevent" ABI tries to address.
(6/8/12 2:54 AM), Pekka Enberg wrote:
On Fri, Jun 8, 2012 at 6:55 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
For some cases, yes, but probably not for all.
For example, if userspace doesn't know about "about to swap real soon" condition, it can continue to grow its caches making fallocate(VOLATILE) pretty much useless.
Fair enough. But, Please consider your scenario don't need vmevent(2), cat /proc/meminfo works enough. Only dropping activity need a quick work and just in time notification. cache growing limitation don't need.
Why do you keep bringing this up? Yes, you might get away with polling /proc/meminfo or /proc/vmstat for some specific cases. It does not make it a proper generic solution to the problem "vmevent" ABI tries to address.
Guys, current vmevent _is_ a polling interface. It only is wrapped kernel timer.
On Fri, Jun 8, 2012 at 9:57 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
Guys, current vmevent _is_ a polling interface. It only is wrapped kernel timer.
Current implementation is like that but we don't intend to keep it that way. That's why having a separate ABI is so important.
KOSAKI, AFAIRC, you are a person who hates android low memory killer. Why do you hate it? If it solve problems I mentioned, do you have a concern, still? If so, please, list up.
1) it took tasklist_lock. (it was solved already) 2) hacky lowmem_deathpending_timeout 3) No ZONE awareness. global_page_state(), lowmem_minfree[] and shrink_slab interface don't realize real memory footprint. 4) No memcg, cpuset awareness. 5) strange score calculation. the score is calculated from anon_rss+file_rss, but oom killer only free anon_rss. 6) strange oom_score_adj overload 7) much duplicate code w/ oom_killer. we can make common helper function, I think. 8) John's fallocate(VOLATILE) is more cleaner.
On 05/08/2012 04:11 PM, KOSAKI Motohiro wrote:
On Tue, May 8, 2012 at 1:53 AM, Pekka Enberg penberg@kernel.org wrote:
On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
That said, I think you are being unfair to Anton who's one of the few that's actually taking the time to implement this properly instead of settling for an out-of-tree hack.
Unfair? But only I can talk about technical comment. To be honest, I really dislike I need say the same explanation again and again. A lot of people don't read past discussion. And as far as the patches take the same mistake, I must say the same thing. It is just PITA.
Unfair because you are trying to make it look as if Anton is only concerned with his specific use case. That's simply not true.
However current proposal certainly don't refer past discuss and don't work many environment.
On Tue, May 8, 2012 at 8:42 AM, KOSAKI Motohiro kosaki.motohiro@gmail.com wrote:
I don't disagree vmevent notification itself, but I must disagree lie notification. And also, To make just idea statistics doesn't make sense at all. How do an application choose the right events? If that depend on hardware configuration, userland developers can't write proper applications.
That's exactly the problem we're trying to tackle here! We _want_ the ABI to provide sane, well-defined events that solve real world problems.
Ok, sane. Then I take my time a little and review current vmevent code briefly. (I read vmevent/core branch in pekka's tree. please let me know if there is newer repositry)
I think following thing should be fixed.
- sample_period is brain damaged idea. If people ONLY need to
sampling stastics, they only need to read /proc/vmstat periodically. just remove it and implement push notification. _IF_ someone need unfrequent level trigger, just use "usleep(timeout); read(vmevent_fd)" on userland code. 2) VMEVENT_ATTR_STATE_ONE_SHOT is misleading name. That is effect as edge trigger shot. not only once. 3) vmevent_fd() seems sane interface. but it has name space unaware. maybe we discuss how to harmonize name space feature. No hurry. but we have to think that issue since at beginning. 4) Currently, vmstat have per-cpu batch and vmstat updating makes 3 second delay at maximum. This is fine for usual case because almost userland watcher only read /proc/vmstat per second. But, for vmevent_fd() case, 3 seconds may be unacceptable delay. At worst, 128 batch x 4096 x 4k pagesize = 2G bytes inaccurate is there. 5) __VMEVENT_ATTR_STATE_VALUE_WAS_LT should be removed from userland exporting files. When exporing kenrel internal, always silly gus used them and made unhappy. 6) Also vmevent_event must hide from userland. 7) vmevent_config::size must be removed. In 20th century, M$ API prefer to use this technique. But They dropped the way because a lot of application don't initialize size member and they can't use it for keeping upper compitibility. 8) memcg unaware 9) numa unaware 10) zone unaware
I would like to add a concern.
11) understand storage speed.
As I mentioned, system can have various storage type(SSD, disk, eMMC, ramfs) In some system, user can tolerate ramfs and SSD write or swapout. We should consdier that to make it really useful.
The problem is user can't know it in advance so it should be detected by kernel. Unfortunately, it's not easy now.
The idea is that we can make some levels in advane and explain it to user
Level 1: It a immediate response to user when kernel decide there are not fast-reclaimable pages any more. Level 2: It's rather slower response than level 1 but kernel will consider it as reclaimable target Level 3: It's slowest response because kernel will consider page needed long time to reclaim as reclaimable target.
It doesn't expose any internal of kernel and can implment it in internal. For simple example,
Level 1: non-mapped clean page Level 2: Level 1 + mapped clean-page Level 3: Level 2 + dirty pages
So users of vmevent_fd can select his level. Of course, latency sensitive application with slow stoarge would select Level 1. Some application might use Level 4(Level 3 + half of swap) because it has very fast storage.
And application receives event can make strategy folloing as.
When it receives level 1 notification, it could request to others if it can release their own buffers. When it receives level 2 notification, it could request to suicide if it's not critical application. When it receives level 3 notification, it could kill others.
It's a just example and my point is we should storage speed to make it general.
On Tue, May 8, 2012 at 11:32 AM, Minchan Kim minchan@kernel.org wrote:
The idea is that we can make some levels in advane and explain it to user
Level 1: It a immediate response to user when kernel decide there are not fast-reclaimable pages any more. Level 2: It's rather slower response than level 1 but kernel will consider it as reclaimable target Level 3: It's slowest response because kernel will consider page needed long time to reclaim as reclaimable target.
It doesn't expose any internal of kernel and can implment it in internal. For simple example,
Level 1: non-mapped clean page Level 2: Level 1 + mapped clean-page Level 3: Level 2 + dirty pages
So users of vmevent_fd can select his level.
I'm totally OK with pursuing something like this if people like Leonid and Anton think it's useful for their use-cases.
On Tue, 8 May 2012, KOSAKI Motohiro wrote:
- Currently, vmstat have per-cpu batch and vmstat updating makes 3
second delay at maximum.
Nope. The delay is one second only and it is limited to a certain amount per cpu. There is a bound on the inaccuracy of the counter. If you want to have them more accurate then the right approach is to limit the threshhold. The more accurate the higher the impact of cache line bouncing.
On Tue, May 08, 2012 at 01:42:05AM -0400, KOSAKI Motohiro wrote: [...]
Well, yeah, if we are to report _number of pages_, the numbers better be meaningful.
That said, I think you are being unfair to Anton who's one of the few that's actually taking the time to implement this properly instead of settling for an out-of-tree hack.
Unfair? But only I can talk about technical comment. To be honest, I really dislike I need say the same explanation again and again. A lot of people don't read past discussion. And as far as the patches take the same mistake, I must say the same thing. It is just PITA.
Note that just telling people that something is PITA doesn't help solve things (so people will come back to you with stupid questions over and over again). You can call people morons, idiots and dumbasses (that's all fine) but still finding a way to be productive. :-)
You could just give a link to a previous discussion, in which you think you explained all your concerns regarding cache handling issues, or memory notifications/statistics in general.
So, feel free to call me an idiot, but please expand your points a little bit or give a link to the discussion you're referring to?
Thanks,
(5/8/12 2:58 AM), Anton Vorontsov wrote:
On Tue, May 08, 2012 at 01:42:05AM -0400, KOSAKI Motohiro wrote: [...]
Well, yeah, if we are to report _number of pages_, the numbers better be meaningful.
That said, I think you are being unfair to Anton who's one of the few that's actually taking the time to implement this properly instead of settling for an out-of-tree hack.
Unfair? But only I can talk about technical comment. To be honest, I really dislike I need say the same explanation again and again. A lot of people don't read past discussion. And as far as the patches take the same mistake, I must say the same thing. It is just PITA.
Note that just telling people that something is PITA doesn't help solve things (so people will come back to you with stupid questions over and over again). You can call people morons, idiots and dumbasses (that's all fine) but still finding a way to be productive. :-)
You could just give a link to a previous discussion, in which you think you explained all your concerns regarding cache handling issues, or memory notifications/statistics in general.
So, feel free to call me an idiot, but please expand your points a little bit or give a link to the discussion you're referring to?
I don't think you are idiot. But I hope you test your patch before submitting. That just don't work especially on x86. Because of, all x86 box have multiple zone and summarized statistics (i.e. global_page_state() thing) don't work and can't prevent oom nor swapping.
and please see may previous mail.
On Tue, May 08, 2012 at 03:16:59AM -0400, KOSAKI Motohiro wrote: [...]
So, feel free to call me an idiot, but please expand your points a little bit or give a link to the discussion you're referring to?
I don't think you are idiot. But I hope you test your patch before submitting. That just don't work especially on x86. Because of, all x86 box have multiple zone and summarized statistics (i.e. global_page_state() thing) don't work and can't prevent oom nor swapping.
Now I think I understand you: we don't take into account that e.g. DMA zone is not usable by the normal allocations, and so if we're basing our calculations on summarized stats, it is indeed possible to get an OOM in such a case. On Android kernels nobody noticed the issue as it is used mainly on ARM targets, which might have CONFIG_ZONE_DMA=n.
So, this particular issue makes sense now.
Thanks!
On Tue, May 08, 2012 at 01:13:05AM -0700, Anton Vorontsov wrote:
On Tue, May 08, 2012 at 03:16:59AM -0400, KOSAKI Motohiro wrote: [...]
So, feel free to call me an idiot, but please expand your points a little bit or give a link to the discussion you're referring to?
I don't think you are idiot. But I hope you test your patch before submitting. That just don't work especially on x86. Because of, all x86 box have multiple zone and summarized statistics (i.e. global_page_state() thing) don't work and can't prevent oom nor swapping.
Now I think I understand you: we don't take into account that e.g. DMA zone is not usable by the normal allocations, and so if we're basing our calculations on summarized stats, it is indeed possible to get an OOM in such a case.
Oops. Looking into it more, I think I was wrong here: kernel will surely use pages from the DMA zone when we have no pages in normal zones.
So, I don't see how we can get OOM in that case.
Hm.
On Tue, May 1, 2012 at 4:24 PM, Anton Vorontsov anton.vorontsov@linaro.org wrote:
Accounting only free pages is very inaccurate for low memory handling, so we have to be smarter here.
Can you elaborate on what kind of problems there are with tracking free pages?
On Thu, May 03, 2012 at 11:10:12AM +0300, Pekka Enberg wrote:
On Tue, May 1, 2012 at 4:24 PM, Anton Vorontsov anton.vorontsov@linaro.org wrote:
Accounting only free pages is very inaccurate for low memory handling, so we have to be smarter here.
Can you elaborate on what kind of problems there are with tracking free pages?
Well, there's no problem with tracking itself, the word 'inaccurate' was probably misleading. Tracking just free pages is inaccurate for our "low memory" notification needs, but NR_FREE_PAGES tracking itself is fine.
The thing is that NR_FREE_PAGES accounts only completely unused (wasted) pages. Most of the time we have very low NR_FREE_PAGES, and lots of page cache and block buffers (i.e. NR_FILE_PAGES).
The file pages are easily reclaimable (except shmem/tmpfs and locked pages), so file pages may be considered as "somewhat free" pages.
The cache might contain very stale data (or not), so we have to maneuver between the two strategies: sacrifice caches, or start freeing memory (which prevents caches draining).
The strategy is described in the third patch in the series. It might be not ideal, but the logic itself is not part of the ABI (this is very similar "not ABI" rules as we have for OOM scoring logic), and is subject for changes.
Thanks,
linaro-kernel@lists.linaro.org