We've had a discussion in the Linaro storage team (Saugata, Venkat and me,
with Luca joining in on the discussion) about swapping to flash based media
such as eMMC. This is a summary of what we found and what we think should
be done. If people agree that this is a good idea, we can start working
on it.
The basic problem is that Linux without swap is sort of crippled and some
things either don't work at all (hibernate) or not as efficient as they
should (e.g. tmpfs). At the same time, the swap code seems to be rather
inappropriate for the algorithms used in most flash media today, causing
system performance to suffer drastically, and wearing out the flash hardware
much faster than necessary. In order to change that, we would be
implementing the following changes:
1) Try to swap out multiple pages at once, in a single write request. My
reading of the current code is that we always send pages one by one to
the swap device, while most flash devices have an optimum write size of
32 or 64 kb and some require an alignment of more than a page. Ideally
we would try to write an aligned 64 kb block all the time. Writing aligned
64 kb chunks often gives us ten times the throughput of linear 4kb writes,
and going beyond 64 kb usually does not give any better performance.
2) Make variable sized swap clusters. Right now, the swap space is
organized in clusters of 256 pages (1MB), which is less than the typical
erase block size of 4 or 8 MB. We should try to make the swap cluster
aligned to erase blocks and have the size match to avoid garbage collection
in the drive. The cluster size would typically be set by mkswap as a new
option and interpreted at swapon time.
3) As Luca points out, some eMMC media would benefit significantly from
having discard requests issued for every page that gets freed from
the swap cache, rather than at the time just before we reuse a swap
cluster. This would probably have to become a configurable option
as well, to avoid the overhead of sending the discard requests on
media that don't benefit from this.
Does this all sound appropriate for the Linux memory management people?
Also, does this sound useful to the Android developers? Would you
start using swap if we make it perform well and not destroy the drives?
Finally, does this plan match up with the capabilities of the
various eMMC devices? I know more about SD and USB devices and
I'm quite convinced that it would help there, but eMMC can be
more like an SSD in some ways, and the current code should be fine
for real SSDs.
Arnd
Hi all,
This is another resend of several task->mm fixes, the bugs I found
during LMK code audit. Architectures were traverse the tasklist
in an unsafe manner, plus there are a few cases of unsafe access to
task->mm in general.
There were no objections on the previous resend, and the final words
were somewhere along "the patches are fine" line.
In v3:
- Dropped a controversal 'Make find_lock_task_mm() sparse-aware' patch;
- Reword arm and sh commit messages, per Oleg Nesterov's suggestions;
- Added an optimization trick in clear_tasks_mm_cpumask(): take only
the rcu read lock, no need for the whole tasklist_lock.
Suggested by Peter Zijlstra.
In v2:
- introduced a small helper in cpu.c: most arches duplicate the
same [buggy] code snippet, so it's better to fix it and move the
logic into a common function.
Thanks,
--
Anton Vorontsov
Email: cbouatmailru(a)gmail.com
Hi all,
That's a respin of the previous patchset that tried to add a new
'cross' event type, which would trigger whenever value crosses a
user-specified threshold both ways, i.e. from a lesser values side
to a greater values side, and vice versa.
We use the event type in an userspace low-memory killer: we get a
notification when memory becomes low, so we start freeing memory by
killing unneeded processes, and we get notification when memory hits
the threshold from another side, so we know that we freed enough of
memory.
There's also a fix for a bug that makes kernel upset about sleeping
in the atomic context.
Per Pekka's comments here comes v2. Changes:
- Added a one-shot mode plus a greater-than attribute, the two
additions makes the equivalent of the cross-event type.
- In the bugfix patch I added some comments about implementation
details of the lock-free logic. Also, in the previous version of
the fix I forgot to remove 'struct mutex' form the
'struct vmevent_watch', this is now cleaned up.
As usual, the patches are against
git://github.com/penberg/linux.git vmevent/core
Thanks!
--
Anton Vorontsov
Email: cbouatmailru(a)gmail.com
[ Yeah OK... I'm really bad with this stuff. ]
January 23 to FEbruary 03
- More on-going study of the standalone b.L switcher code.
- Discussion with Le.chi Thu, Paul Larson and others about big.LITTLE
switcher testing requirements.
- Involved in the review cycle for a patch about the new generic ioremap
optimization from Pawel Moll which turned out to be bad and needed
a subsequent revert.
- Obtaining and setting up the b.L Fast Model license.
- Review of some RCU changes pushed down the platform idle code path by Paul E.
McKenney which I ended up NAKing.
- Review of a patch from Stephen Boyd to disable preemption when reading
CCSIDR on ARMv7 to which I suggested a simpler alternative.
- Review of the initial test plan for the big.LITTLE switcher.
- REview of the SA11x0 cleanups from RMK.
- Investigation and prodding sent to Andrew Lunn for fixing a few Kirkwood
breakages from recent consolidation changes.
- More design discussions around b.L switcher with people from ARM Ltd,
notably Robin Randhawa.
- Preparing for Linaro Connect.
February 06 to 10
- Attending Linaro Connect.
- REview of Rob Herring's series cleaning up and removing IRQ and
FIQ related macros from the kernel.
- Wrote an article about big.LITTLE switcher for LWN.
- REview of Marc Zyngier's series to add per SoC SMP and CPU hotplug
operations.
- Quickstart session with Dave Martin to run the ARM Fast Model for b.L.
February 13 to 24
- Refinement to my LWN article about b.L before publication.
- Review of a patch series preparing the kernel for being entered in hypervisor
mode by Dave Martin.
- Discussion (on IRC) between Dave Martin and myself about design changes
brought to the in-kernel b.L switcher.
- Start experimenting with the b.L fast model.
- Read and digested various documents about the ARM virtualization,
the GIC, etc.
- Produced some b.L project status to help project management transition
from Usman to Mounir.
- Posted a patch to add support for early console output via semihosting.
- Wrote the first part of the big.LITTLE write-up for the monthly member
report (Paul McKenney did the second part).
February 27 to March 02
- Away on vacation.
March 05 to 16
- Review of the initial Kirkwood conversion to FDT by jason(a)lakedaemon.net.
- Review of a patch series removing most instances of io.h by Rob Herring.
- Comments/suggestions on how to deal with unresponsive maintainers,
prompted by Amit Kucheria.
- Review of a patch by Stephen Warren to generalize u-Boot's uImage
wrapping in the kernel build.
- More experiments with the b.L software model, attempting to boot
a 8-core SMP system, running into cross cluster cache coherency problems.
Finally get it to boot, thanks to the ARM guys who provided the missing
clue.
- Look at the multi-cluster aware boot protocol patches by Lorenzo
Pieralisi. Some of it might be directly useful for the b.L switcher.
- Review of Dave Martin's patch series to facilitate custom opcode
injection.
- Improved a patch I posted months ago to remove the debugging restrictions
inside the devicemaps_init() function and pushed upstream. REcent changes
to the kernel are making this patch very useful for people to debug
their own kernel.
- Quick review of the Cortex-M3 support by Uwe Kleine-König.
- Moved to the arm-soc tree to implement the in-kernel switcher as it
contains everything to boot a vexpress config with device tree on the
software model.
Nicolas
=== Pinctrl ===
* sent out pinctrl core add defer probe for gpio patch, merged by Linus.
* sent out pinctrl-imx v4 series patch
Already got Stephen Warren and Shawn Guo's ack. No more comments for a
few days.
I assume Linus Walleij may pick it up soon.
* Implement a common API to handle pinctrl dummy state, merged by Linus.
* Reviewed some other pinctrl patches.
* Implement per pin mux and config for pinctrl subsystem. INPROGRESS
=== Plan ===
* send out a draft pinctrl per pin mux and config patch and discuss it
with Linus and Stehpen.
* since the gpio base is dynamically allocated from DT, the exist pinctrl
gpio support
implementation based on fixed gpio base map may not fit any
more with DT.
Will think more about the solution and discuss it with Linus.
After a big move and other things, finally I can focus on the Linux
work. Now I have a high-speed internet in my new place, so using
'mumble' for conferencing is no problem any longer.
== Highlights ==
* We've got some 'looks fine' feedback on the userland LMK, and that's
reassuring.
The "bad news" is that there's not much of enthusiasm overall from
Android folks. That's understandable as kernel LMK driver works and
already in mainline^Wstaging kernel, so why bother. Well, it can't
live in staging/ forever, so we'd better hurry up.
* ulmkd's Makefile is again suitable for GNU/Linux builds
(as an addition to Android/Linux). This makes it easier for me
to test, plus maybe there we'll be other users for the daemon.
* for_each_process and task->mm fixes finally merged into -mm.
I will need a small documentation update for the series, but
overall the series seem to be fine.
* Prepared a few fixes for the memcg slab accounting. The proposed
slab accounting feature looks like exactly what was needed, except
that it doesn't account slab for the root cgroup. If that's not
a design decision, then it can be improved. If not, there are
two ways: a) drop cgroups support and go solely w/ vmevent
infrastructure b) try to push something like 'memory.available'
attribute for memcg. 'a)' is easy, and 'b)' is probably what I'll
try to implement tomorrow. Once implemented, we'll have all
options ready, and so can mark cgroups as either fully suitable
for lowmem notifications or not suitable by design.
== Plans ==
* I wonder if I need to make a deep-dive into Android build system
and try to integrate ulmkd into Android image myself?
* Back to interactive governor improvements? Well, as far as I
recall, the story behind interactive governor is very similar to
LMK: nobody likes the cpufreq overall, and want generic power
management improvements for the scheduler. At least, we need to
get 'interactive vs. ondemand' cpufreq latency numbers. That
would be a good starting point for any other improvements. And
the problem with cpufreq latency measurements was that it takes
ages for the benchmark to complete.
--
Anton Vorontsov
Email: cbouatmailru(a)gmail.com
Hello Glauber,
On Fri, Apr 20, 2012 at 06:57:08PM -0300, Glauber Costa wrote:
> This is my current attempt at getting the kmem controller
> into a mergeable state. IMHO, all the important bits are there, and it should't
> change *that* much from now on. I am, however, expecting at least a couple more
> interactions before we sort all the edges out.
>
> This series works for both the slub and the slab. One of my main goals was to
> make sure that the interfaces we are creating actually makes sense for both
> allocators.
>
> I did some adaptations to the slab-specific patches, but the bulk of it
> comes from Suleiman's patches. I did the best to use his patches
> as-is where possible so to keep authorship information. When not possible,
> I tried to be fair and quote it in the commit message.
>
> In this series, all existing caches are created per-memcg after its first hit.
> The main reason is, during discussions in the memory summit we came into
> agreement that the fragmentation problems that could arise from creating all
> of them are mitigated by the typically small quantity of caches in the system
> (order of a few megabytes total for sparsely used caches).
> The lazy creation from Suleiman is kept, although a bit modified. For instance,
> I now use a locked scheme instead of cmpxcgh to make sure cache creation won't
> fail due to duplicates, which simplifies things by quite a bit.
>
> The slub is a bit more complex than what I came up with in my slub-only
> series. The reason is we did not need to use the cache-selection logic
> in the allocator itself - it was done by the cache users. But since now
> we are lazy creating all caches, this is simply no longer doable.
>
> I am leaving destruction of caches out of the series, although most
> of the infrastructure for that is here, since we did it in earlier
> series. This is basically because right now Kame is reworking it for
> user memcg, and I like the new proposed behavior a lot more. We all seemed
> to have agreed that reclaim is an interesting problem by itself, and
> is not included in this already too complicated series. Please note
> that this is still marked as experimental, so we have so room. A proper
> shrinker implementation is a hard requirement to take the kmem controller
> out of the experimental state.
>
> I am also not including documentation, but it should only be a matter
> of merging what we already wrote in earlier series plus some additions.
The patches look great, thanks a lot for your work!
I finally tried them, and after a few fixes the kmem accounting
seems to work fine with slab. The fixes will follow this email,
and if they're fine, feel free to fold them into your patches.
However, with slub I'm getting kernel hangs and various traces[1].
It seems that kernel memcg recurses when trying to call
memcg_create_cache_enqueue() -- it calls kmalloc_no_account()
which was introduced to not recurse into memcg, but looking
into 'slub: provide kmalloc_no_account' patch, I don't see
any difference between _no_account and ordinary kmalloc. Hm.
OK, slub apart... the accounting works with slab, which is great.
There's another, more generic question: is there any particular
reason why you don't want to account slab memory for root cgroup?
Personally I'm interested in kmem accounting because I use
memcg for lowmemory notifications. I'm installing events
on the root's memory.usage_in_bytes, and the thresholds values
are calculated like this:
total_ram - wanted_threshold
So, if we want to get a notification when there's 64 MB memory
left on a 256 MB machine, we'd install an event on the 194 MB
mark (the good thing about usage_in_bytes, is that it does
account file caches, so the formula is simple).
Obviously, without kmem accounting the formula can be very
imprecise when kernel (e.g. hw drivers) itself start using a
lot of memory. With root's slab accounting the problem
would be solved, but for some reason you deliberately do not
want to account it for root cgroup. I suspect that there are
some performance concerns?..
Thanks,
[1]
BUG: unable to handle kernel paging request at ffffffffb2e80900
IP: [<ffffffff8105940c>] check_preempt_wakeup+0x3c/0x210
PGD 160d067 PUD 1611063 PMD 0
Thread overran stack, or stack corrupted
Oops: 0000 [#1] SMP
CPU 0
Pid: 943, comm: bash Not tainted 3.4.0-rc4+ #34 Bochs Bochs
RIP: 0010:[<ffffffff8105940c>] [<ffffffff8105940c>] check_preempt_wakeup+0x3c/0x210
RSP: 0018:ffff880006305ee8 EFLAGS: 00010006
RAX: 00000000000109c0 RBX: ffff8800071b4e20 RCX: ffff880006306000
RDX: 0000000000000000 RSI: 0000000006306028 RDI: ffff880007c109c0
RBP: ffff880006305f28 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880007c109c0
R13: ffff88000644ddc0 R14: ffff8800071b4e68 R15: 0000000000000000
FS: 00007fad1244c700(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffb2e80900 CR3: 00000000063b8000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 943, threadinfo ffff880006306000, task ffff88000644ddc0)
Stack:
0000000000000000 ffff88000644de08 ffff880007c109c0 ffff880007c109c0
ffff8800071b4e20 0000000000000000 0000000000000000 0000000000000000
ffff880006305f48 ffffffff81053304 ffff880007c109c0 ffff880007c109c0
Call Trace:
Code: 76 48 41 55 41 54 49 89 fc 53 48 89 f3 48 83 ec 18 4c 8b af e0 07 00 00 49 8d 4d 48 48 89 4d c8 49 8b 4d 08 4c 3b 75 c8 8b 71 18 <48> 8b 34 f5 c0 07 65 81 48 8b bc 30 a8 00 00 00 8b 35 3a 3f 5c
RIP [<ffffffff8105940c>] check_preempt_wakeup+0x3c/0x210
RSP <ffff880006305ee8>
CR2: ffffffffb2e80900
---[ end trace 78fa9c86bebb1214 ]---
--
Anton Vorontsov
Email: cbouatmailru(a)gmail.com
=== Highlights ===
* Interview with Deepak and one-on-one introductory discussion
- Deepak pointed me to relevant linaro WIKI's with appropriate usage
information
* Discussion with lee jones and Deepak about the DT Work. ab8500 power has
been assigned
* OnBoarding is nearing completion
* Spent some time to read through DT documents
* Lee jones supported me to get the build and test readiness, Setup to
start the work is ready.
* Attended session on "platform perimeter" hosted by Linus Walleij
- http://www.df.lth.se/~triad/papers/ESC-400Slides_Walleij.pdf
* Spent good length of time with niklas to get the DeviceTree work
transferred
=== Plans ===
* Complete ab8500 power DeviceTree assignments
* Spend some time on DT Spec study
* IT/Admin work to carryout in order to get the new laptop with UBUNTU
distro.
* Complete the pending onboard activity
=== Issues ===
* 2011-09 version of linaro-media-create python application did not succeed
in
preparing bundled image for flashing, so migrated to 2012.04-1, found
issue is
fixed.
https://launchpad.net/ubuntu/+source/linaro-image-tools/2012.04-1/+build/34…
=== Highlights ===
* Tested Rafael's wakelock interface patches. Found a bug and sent a
fix, which he included.
* Submitted the volatile ranges patch for inclusion. Got some minor
feedback. Dave Chinner suggested I rework the patch so that it uses
fallocate rather then fadvise. I pushed back a bit to make sure that is
a consensus opinion, but will likely try to switch things over next week.
* After getting positive feedback from Arve, on my patch to convert
ashmem to use wakeup sources instead of the stubbed out wakelocks, I
submitted it and Greg included it into staging-next for 3.5
* Got a small RTC null pointer fix merged into tip/timers/urgent for 3.4
* Pinged the Android team on Anton's ulmkd proposal, got some
interesting feedback, and no outright objections.
* Submitted a talk to linux plumbers
* Reviewed some patches to introduce CLOCK_TAI functionality. Queued a
few community cleanups.
=== Plans ===
* Rework volatile ranges to use fallocate & resubmit to lkml
=== Issues ===
NA