We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
The basic problem is that Linux without swap is sort of crippled and some things either don't work at all (hibernate) or not as efficient as they should (e.g. tmpfs). At the same time, the swap code seems to be rather inappropriate for the algorithms used in most flash media today, causing system performance to suffer drastically, and wearing out the flash hardware much faster than necessary. In order to change that, we would be implementing the following changes:
1) Try to swap out multiple pages at once, in a single write request. My reading of the current code is that we always send pages one by one to the swap device, while most flash devices have an optimum write size of 32 or 64 kb and some require an alignment of more than a page. Ideally we would try to write an aligned 64 kb block all the time. Writing aligned 64 kb chunks often gives us ten times the throughput of linear 4kb writes, and going beyond 64 kb usually does not give any better performance.
2) Make variable sized swap clusters. Right now, the swap space is organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
3) As Luca points out, some eMMC media would benefit significantly from having discard requests issued for every page that gets freed from the swap cache, rather than at the time just before we reuse a swap cluster. This would probably have to become a configurable option as well, to avoid the overhead of sending the discard requests on media that don't benefit from this.
Does this all sound appropriate for the Linux memory management people?
Also, does this sound useful to the Android developers? Would you start using swap if we make it perform well and not destroy the drives?
Finally, does this plan match up with the capabilities of the various eMMC devices? I know more about SD and USB devices and I'm quite convinced that it would help there, but eMMC can be more like an SSD in some ways, and the current code should be fine for real SSDs.
Arnd
(sorry for the duplicated email, this corrects the address of the android kernel team, please reply here)
On Friday 30 March 2012, Arnd Bergmann wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
The basic problem is that Linux without swap is sort of crippled and some things either don't work at all (hibernate) or not as efficient as they should (e.g. tmpfs). At the same time, the swap code seems to be rather inappropriate for the algorithms used in most flash media today, causing system performance to suffer drastically, and wearing out the flash hardware much faster than necessary. In order to change that, we would be implementing the following changes:
1) Try to swap out multiple pages at once, in a single write request. My reading of the current code is that we always send pages one by one to the swap device, while most flash devices have an optimum write size of 32 or 64 kb and some require an alignment of more than a page. Ideally we would try to write an aligned 64 kb block all the time. Writing aligned 64 kb chunks often gives us ten times the throughput of linear 4kb writes, and going beyond 64 kb usually does not give any better performance.
2) Make variable sized swap clusters. Right now, the swap space is organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
3) As Luca points out, some eMMC media would benefit significantly from having discard requests issued for every page that gets freed from the swap cache, rather than at the time just before we reuse a swap cluster. This would probably have to become a configurable option as well, to avoid the overhead of sending the discard requests on media that don't benefit from this.
Does this all sound appropriate for the Linux memory management people?
Also, does this sound useful to the Android developers? Would you start using swap if we make it perform well and not destroy the drives?
Finally, does this plan match up with the capabilities of the various eMMC devices? I know more about SD and USB devices and I'm quite convinced that it would help there, but eMMC can be more like an SSD in some ways, and the current code should be fine for real SSDs.
Arnd
On 30 March 2012 13:50, Arnd Bergmann arnd@arndb.de wrote:
(sorry for the duplicated email, this corrects the address of the android kernel team, please reply here)
On Friday 30 March 2012, Arnd Bergmann wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
The basic problem is that Linux without swap is sort of crippled and some things either don't work at all (hibernate) or not as efficient as they should (e.g. tmpfs). At the same time, the swap code seems to be rather inappropriate for the algorithms used in most flash media today, causing system performance to suffer drastically, and wearing out the flash hardware much faster than necessary. In order to change that, we would be implementing the following changes:
1) Try to swap out multiple pages at once, in a single write request. My reading of the current code is that we always send pages one by one to the swap device, while most flash devices have an optimum write size of 32 or 64 kb and some require an alignment of more than a page. Ideally we would try to write an aligned 64 kb block all the time. Writing aligned 64 kb chunks often gives us ten times the throughput of linear 4kb writes, and going beyond 64 kb usually does not give any better performance.
Last I read Transparent Huge Pages are still paged in and out a page at a time, is this or was this ever the case? If it is the case should the paging system be extended to support THP which would take care of the big block issues with flash media?
2) Make variable sized swap clusters. Right now, the swap space is organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
3) As Luca points out, some eMMC media would benefit significantly from having discard requests issued for every page that gets freed from the swap cache, rather than at the time just before we reuse a swap cluster. This would probably have to become a configurable option as well, to avoid the overhead of sending the discard requests on media that don't benefit from this.
Does this all sound appropriate for the Linux memory management people?
Also, does this sound useful to the Android developers? Would you start using swap if we make it perform well and not destroy the drives?
Finally, does this plan match up with the capabilities of the various eMMC devices? I know more about SD and USB devices and I'm quite convinced that it would help there, but eMMC can be more like an SSD in some ways, and the current code should be fine for real SSDs.
Arnd
linaro-kernel mailing list linaro-kernel@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-kernel
On Friday 30 March 2012, Zach Pfeffer wrote:
Last I read Transparent Huge Pages are still paged in and out a page at a time, is this or was this ever the case? If it is the case should the paging system be extended to support THP which would take care of the big block issues with flash media?
I don't think we ever want to get /that/ big. As I mentioned, going beyond 64kb does not improve throughput on most flash media. However, paging out 16MB causes a very noticeable delay of up to a few seconds on slow drives, which would be inacceptable to users.
Also, that would only deal with the rare case where the data you want to page out is actually in huge pages, not the common case.
Arnd
On 31 March 2012 04:24, Arnd Bergmann arnd@arndb.de wrote:
On Friday 30 March 2012, Zach Pfeffer wrote:
Last I read Transparent Huge Pages are still paged in and out a page at a time, is this or was this ever the case? If it is the case should the paging system be extended to support THP which would take care of the big block issues with flash media?
I don't think we ever want to get /that/ big. As I mentioned, going beyond 64kb does not improve throughput on most flash media. However, paging out 16MB causes a very noticeable delay of up to a few seconds on slow drives, which would be inacceptable to users.
Also, that would only deal with the rare case where the data you want to page out is actually in huge pages, not the common case.
What I had in mind was being able to swap out big contiguous buffers used by media and graphics engines in one go. This would allow devices to support multiple engines without needing to reserve contiguous memory for each device. They would instead share the contiguous memory. Only one multimedia engine could run at a time, but that would be an okay limitation given certain application domains (low end smart phones).
Arnd
On Fri, 30 Mar 2012, Arnd Bergmann wrote:
On Friday 30 March 2012, Arnd Bergmann wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it. The basic problem is that Linux without swap is sort of crippled and some things either don't work at all (hibernate) or not as efficient as they should (e.g. tmpfs). At the same time, the swap code seems to be rather inappropriate for the algorithms used in most flash media today, causing system performance to suffer drastically, and wearing out the flash hardware much faster than necessary. In order to change that, we would be implementing the following changes:
- Try to swap out multiple pages at once, in a single write request. My
reading of the current code is that we always send pages one by one to the swap device, while most flash devices have an optimum write size of 32 or 64 kb and some require an alignment of more than a page. Ideally we would try to write an aligned 64 kb block all the time. Writing aligned 64 kb chunks often gives us ten times the throughput of linear 4kb writes, and going beyond 64 kb usually does not give any better performance.
My suspicion is that we suffer a lot from the "distance" between when we allocate swap space (add_to_swap getting the swp_entry_t to replace ptes by) and when we finally decide to write out a page (swap_writepage): intervening decisions can jumble the sequence badly.
I've not investigated to confirm that, but certainly it was the case two or three years ago, that we got much better behaviour in swapping shmem to flash, when we stopped giving it a second pass round the lru, which used to come in between the allocation and the writeout.
I believe that you'll want to start by implementing something like what Rik set out a year ago in the mail appended below. Adding another layer of indirection isn't always a pure win, and I think none of us have taken it any further since then; but sooner or later we shall need to, and your flash case might be just the prod needed.
With that change made (so swap ptes are just pointers into an intervening structure, where we record disk blocks allocated at the time of writeout), some improvement should come just from traditional merging by the I/O scheduler (deadline seems both better for flash and better for swap: one day it would be nice to work out how cfq can be tweaked better for swap).
Some improvement, but probably not enough, and you'd want to do something more proactive, like the mblk_io_submit stuff ext4 does these days.
Though they might prove to give the greatest benefit on flash, these kind of changes should be good for conventional disk too.
2) Make variable sized swap clusters. Right now, the swap space is organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
That gets to sound more flash-specific, and I feel less enthusiastic about doing things in bigger and bigger lumps. But if it really proves to be of benefit, it's easy enough to let you.
Decide the cluster size at mkswap time, or at swapon time, or by /sys/block/sda/queue parameters? Perhaps a /sys parameter should give the size, but a swapon flag decide whether to participate or not. Perhaps.
3) As Luca points out, some eMMC media would benefit significantly from having discard requests issued for every page that gets freed from the swap cache, rather than at the time just before we reuse a swap cluster. This would probably have to become a configurable option as well, to avoid the overhead of sending the discard requests on media that don't benefit from this.
I'm surprised, I wouldn't have contemplated a discard per page; but if you have cases where it can be proved of benefit, fine. I know nothing at all of eMMC.
Though as things stand, that swap_lock spinlock makes it difficult to find a good safe moment to issue a discard (you want the spinlock to keep it safe, but you don't want to issue "I/O" while holding a spinlock). Perhaps that difficulty can be overcome in a satisfactory way, in the course of restructuring swap allocation as Rik set out (Rik suggests freeing on swapin, that should make it very easy).
Hugh
Does this all sound appropriate for the Linux memory management people? Also, does this sound useful to the Android developers? Would you start using swap if we make it perform well and not destroy the drives? Finally, does this plan match up with the capabilities of the various eMMC devices? I know more about SD and USB devices and I'm quite convinced that it would help there, but eMMC can be more like an SSD in some ways, and the current code should be fine for real SSDs. Arnd
From riel@redhat.com Sun Apr 10 17:50:10 2011
Date: Sun, 10 Apr 2011 20:50:01 -0400 From: Rik van Riel riel@redhat.com To: Linux Memory Management List linux-mm@kvack.org Subject: [LSF/Collab] swap cache redesign idea
On Thursday after LSF, Hugh, Minchan, Mel, Johannes and I were sitting in the hallway talking about yet more VM things.
During that discussion, we came up with a way to redesign the swap cache. During my flight home, I came with ideas on how to use that redesign, that may make the changes worthwhile.
Currently, the page table entries that have swapped out pages associated with them contain a swap entry, pointing directly at the swap device and swap slot containing the data. Meanwhile, the swap count lives in a separate array.
The redesign we are considering moving the swap entry to the page cache radix tree for the swapper_space and having the pte contain only the offset into the swapper_space. The swap count info can also fit inside the swapper_space page cache radix tree (at least on 64 bits - on 32 bits we may need to get creative or accept a smaller max amount of swap space).
This extra layer of indirection allows us to do several things:
1) get rid of the virtual address scanning swapoff; instead we just swap the data in and mark the pages as present in the swapper_space radix tree
2) free swap entries as the are read in, without waiting for the process to fault it in - this may be useful for memory types that have a large erase block
3) together with the defragmentation from (2), we can always do writes in large aligned blocks - the extra indirection will make it relatively easy to have special backend code for different kinds of swap space, since all the state can now live in just one place
4) skip writeout of zero-filled pages - this can be a big help for KVM virtual machines running Windows, since Windows zeroes out free pages; simply discarding a zero-filled page is not at all simple in the current VM, where we would have to iterate over all the ptes to free the swap entry before being able to free the swap cache page (I am not sure how that locking would even work)
with the extra layer of indirection, the locking for this scheme can be trivial - either the faulting process gets the old page, or it gets a new one, either way it'll be zero filled
5) skip writeout of pages the guest has marked as free - same as above, with the same easier locking
Only one real question remaining - how do we handle the swap count in the new scheme? On 64 bit systems we have enough space in the radix tree, on 32 bit systems maybe we'll have to start overflowing into the "swap_count_continued" logic a little sooner than we are now and reduce the maximum swap size a little?
On Saturday 31 March 2012, Hugh Dickins wrote:
On Fri, 30 Mar 2012, Arnd Bergmann wrote:
On Friday 30 March 2012, Arnd Bergmann wrote:
My suspicion is that we suffer a lot from the "distance" between when we allocate swap space (add_to_swap getting the swp_entry_t to replace ptes by) and when we finally decide to write out a page (swap_writepage): intervening decisions can jumble the sequence badly.
I've not investigated to confirm that, but certainly it was the case two or three years ago, that we got much better behaviour in swapping shmem to flash, when we stopped giving it a second pass round the lru, which used to come in between the allocation and the writeout.
I believe that you'll want to start by implementing something like what Rik set out a year ago in the mail appended below. Adding another layer of indirection isn't always a pure win, and I think none of us have taken it any further since then; but sooner or later we shall need to, and your flash case might be just the prod needed.
Thanks a lot for that pointer, that certainly sounds interesting. I guess we should first do some investigations into in what order the pages normally get writting out to flash. If they are not strictly in sequence order, the other improvements I suggested would be less effective as well.
Note that I'm not at all worried about reading pages back in from flash out of order, that tends to be harmless because reads are much rarer than writes on swap, and because only random writes require garbage collection inside of the flash (forcing up to 500ms delays on a single write occasionally), while reads are always uniformly fast.
- Make variable sized swap clusters. Right now, the swap space is
organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
That gets to sound more flash-specific, and I feel less enthusiastic about doing things in bigger and bigger lumps. But if it really proves to be of benefit, it's easy enough to let you.
Decide the cluster size at mkswap time, or at swapon time, or by /sys/block/sda/queue parameters? Perhaps a /sys parameter should give the size, but a swapon flag decide whether to participate or not. Perhaps.
I was think of mkswap time, because the erase block size is specific to the storage hardware and there is no reason to ever change it run time, and we cannot always easily probe the value from looking at hardware registers (USB doesn't have the data, in SD cards it's usually wrong, and in eMMC it's sometimes wrong). I should also mention that it's not always power-of-two, some drives that use TLC flash have three times the erase block size of the equivalent SLC flash, e.g. 3 MB or 6 MB.
I don't think that's a problem, but I might be missing something here. I have also encoutered a few older drives that use some completely random erase block size, but they are very rare.
Also, I'm unsure what the largest cluster size would be that we can realistically support. 8 MB sounds fairly large already, especially on systems that have less than 1 GB of RAM, as most of the ARM machines today do. For shingle based hard drives, we would get a very similar behavior as for flash media, but the chunks would be even larger, on the order of 64 MB. If we can make those work, it would no longer be specific to flash, but also a lot harder to do.
- As Luca points out, some eMMC media would benefit significantly from
having discard requests issued for every page that gets freed from the swap cache, rather than at the time just before we reuse a swap cluster. This would probably have to become a configurable option as well, to avoid the overhead of sending the discard requests on media that don't benefit from this.
I'm surprised, I wouldn't have contemplated a discard per page; but if you have cases where it can be proved of benefit, fine. I know nothing at all of eMMC.
My understanding is that some devices can arbitrarily map between physical flash pages (typically 4, 8, or 16kb) and logical sector numbers, instead of remapping on the much larger erase block granularity. In those cases, it makes sense to free up as many pages as possible on the drive, in order to give the hardware more room for reorganizing itself and doing background defragmentation of its free space.
Though as things stand, that swap_lock spinlock makes it difficult to find a good safe moment to issue a discard (you want the spinlock to keep it safe, but you don't want to issue "I/O" while holding a spinlock). Perhaps that difficulty can be overcome in a satisfactory way, in the course of restructuring swap allocation as Rik set out (Rik suggests freeing on swapin, that should make it very easy).
Luca was suggesting to use the disk->fops->swap_slot_free_notify callback from swap_entry_free(), which is currently only used in zram, but you're right, that would not work.
Another option would be batched discard as we do it for file systems: occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those.
Arnd
On Mon, 2 Apr 2012, Arnd Bergmann wrote:
Another option would be batched discard as we do it for file systems: occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those.
I'm not sure whether you've missed "swapon --discard", which switches on discard_swap_cluster() just before we allocate from a new cluster; or whether you're musing that it's no use to you because you want to repurpose the swap cluster to match erase block: I'm mentioning it in case you missed that it's already there (but few use it, since even done at that scale it's often more trouble than it's worth).
Hugh
On Monday 02 April 2012, Hugh Dickins wrote:
On Mon, 2 Apr 2012, Arnd Bergmann wrote:
Another option would be batched discard as we do it for file systems: occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those.
I'm not sure whether you've missed "swapon --discard", which switches on discard_swap_cluster() just before we allocate from a new cluster; or whether you're musing that it's no use to you because you want to repurpose the swap cluster to match erase block: I'm mentioning it in case you missed that it's already there (but few use it, since even done at that scale it's often more trouble than it's worth).
I actually argued that discard_swap_cluster is exactly the right thing to do, especially when clusters match erase blocks on the less capable devices like SD cards.
Luca was arguing that on some hardware there is no point in ever submitting a discard just before we start reusing space, because at that point it the hardware already discards the old data by overwriting the logical addresses with new blocks, while issuing a discard on all blocks as soon as they become available would make a bigger difference. I would be interested in hearing from Hyojin Jeong and Alex Lemberg what they think is the best time to issue a discard, because they would know about other hardware than Luca.
Arnd
Dear Arnd
Hello,
I'm not clearly understand the history of this e-mail communication because I joined in the middle of mail thread. Anyhow I would like to make comments for discard in swap area. eMMC device point of view, there is no information of files which is used in System S/W(Linux filesystem). So... In the eMMC, there is no way to know the address info of data which was already erased. If discard CMD send this information(address of erased files) to eMMC, old data should be erased in the physical NAND level and get the free space with minimizing internal merge.
I'm not sure that how Linux manage swap area. If there are difference of information for invalid data between host and eMMC device, discard to eMMC is good for performance of IO. It is as same as general case of discard of user partition which is formatted with filesystem. As your e-mail mentioned, overwriting the logical address is the another way to send info of invalid data address just for the overwrite area, however it is not a best way for eMMC to manage physical NAND array. In this case, eMMC have to trim physical NAND array, and do write operation at the same time. It needs more latency. If host send discard with invalid data address info in advance, eMMC can find beat way to manage physical NAND page before host usage(write operation). I'm not sure it is the right comments of your concern. If you need more info, please let me know
Best Regards Hyojin
-----Original Message----- From: Arnd Bergmann [mailto:arnd@arndb.de] Sent: Monday, April 02, 2012 11:55 PM To: Hugh Dickins Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux- mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel- team@android.com; Yejin Moon Subject: Re: swap on eMMC and other flash
On Monday 02 April 2012, Hugh Dickins wrote:
On Mon, 2 Apr 2012, Arnd Bergmann wrote:
Another option would be batched discard as we do it for file systems: occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those.
I'm not sure whether you've missed "swapon --discard", which switches on discard_swap_cluster() just before we allocate from a new cluster; or whether you're musing that it's no use to you because you want to repurpose the swap cluster to match erase block: I'm mentioning it in case you missed that it's already there (but few use it, since even done at that scale it's often more trouble than it's worth).
I actually argued that discard_swap_cluster is exactly the right thing to do, especially when clusters match erase blocks on the less capable devices like SD cards.
Luca was arguing that on some hardware there is no point in ever submitting a discard just before we start reusing space, because at that point it the hardware already discards the old data by overwriting the logical addresses with new blocks, while issuing a discard on all blocks as soon as they become available would make a bigger difference. I would be interested in hearing from Hyojin Jeong and Alex Lemberg what they think is the best time to issue a discard, because they would know about other hardware than Luca.
Arnd
On Thursday 05 April 2012, 정효진 wrote:
I'm not sure that how Linux manage swap area. If there are difference of information for invalid data between host and eMMC device, discard to eMMC is good for performance of IO. It is as same as general case of discard of user partition which is formatted with filesystem. As your e-mail mentioned, overwriting the logical address is the another way to send info of invalid data address just for the overwrite area, however it is not a best way for eMMC to manage physical NAND array. In this case, eMMC have to trim physical NAND array, and do write operation at the same time. It needs more latency. If host send discard with invalid data address info in advance, eMMC can find beat way to manage physical NAND page before host usage(write operation). I'm not sure it is the right comments of your concern. If you need more info, please let me know
One specific property of the linux swap code is that we write relatively large clusters (1 MB today) sequentially and only reuse them once all of the data in them has become invalid. Part of my suggestion was to increase that size to the erase block size of the underlying storage, e.g. 8MB for typical eMMC. Right now, we send a discard command just before reusing a swap cluster, for the entire cluster.
In my interpretation, this already means a typical device will never to a garbage collection of that erase block because we never overwrite the erase block partially.
Luca suggested that we could send the discard command as soon as an individual 4kb page is freed, which would let the device reuse the physical erase block as soon as all the pages in that erase block have been freed over time, but my interpretation is that while this can help for global wear levelling, it does not help avoid any garbage collection.
Arnd
Hi Arnd,
Regarding time to issue discard/TRIM commands: It would be advised to issue the discard command immediately after deleting/freeing a SWAP cluster (i.e. as soon as it becomes available).
Regarding SWAP page size: Working with as large as SWAP pages as possible would be recommended (preferably 64KB). Also, writing in a sequential manner as much as possible while swapping large quantities of data is also advisable.
SWAP pages and corresponding transactions should be aligned to the SWAP page size (i.e. 64KB above), the alignment should correspond to the physical storage "LBA 0", i.e. to the first LBA of the storage device (and not to a logical/physical partition).
Thanks, Alex
-----Original Message----- From: Arnd Bergmann [mailto:arnd@arndb.de] Sent: Monday, April 02, 2012 5:55 PM To: Hugh Dickins Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux- mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel- team@android.com; Yejin Moon Subject: Re: swap on eMMC and other flash
On Monday 02 April 2012, Hugh Dickins wrote:
On Mon, 2 Apr 2012, Arnd Bergmann wrote:
Another option would be batched discard as we do it for file
systems:
occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those.
I'm not sure whether you've missed "swapon --discard", which switches on discard_swap_cluster() just before we allocate from a new cluster; or whether you're musing that it's no use to you because you want to repurpose the swap cluster to match erase block: I'm mentioning it in case you missed that it's already there (but few use it, since even done at that scale it's often more trouble than it's worth).
I actually argued that discard_swap_cluster is exactly the right thing to do, especially when clusters match erase blocks on the less capable devices like SD cards.
Luca was arguing that on some hardware there is no point in ever submitting a discard just before we start reusing space, because at that point it the hardware already discards the old data by overwriting the logical addresses with new blocks, while issuing a discard on all blocks as soon as they become available would make a bigger difference. I would be interested in hearing from Hyojin Jeong and Alex Lemberg what they think is the best time to issue a discard, because they would know about other hardware than Luca.
Arnd
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
2012-04-08 오후 10:50, Alex Lemberg 쓴 글:
Hi Arnd,
Regarding time to issue discard/TRIM commands: It would be advised to issue the discard command immediately after deleting/freeing a SWAP cluster (i.e. as soon as it becomes available).
Is it still good with page size, not cluster size?
Regarding SWAP page size: Working with as large as SWAP pages as possible would be recommended (preferably 64KB). Also, writing in a sequential manner as much as possible while swapping large quantities of data is also advisable.
SWAP pages and corresponding transactions should be aligned to the SWAP page size (i.e. 64KB above), the alignment should correspond to the physical storage "LBA 0", i.e. to the first LBA of the storage device (and not to a logical/physical partition).
I have a curiosity on above comment is valid on Samsung and other eMMC. Hyojin, Could you answer?
Thanks, Alex
-----Original Message----- From: Arnd Bergmann [mailto:arnd@arndb.de] Sent: Monday, April 02, 2012 5:55 PM To: Hugh Dickins Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux- mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel- team@android.com; Yejin Moon Subject: Re: swap on eMMC and other flash
On Monday 02 April 2012, Hugh Dickins wrote:
On Mon, 2 Apr 2012, Arnd Bergmann wrote:
Another option would be batched discard as we do it for file
systems:
occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those.
I'm not sure whether you've missed "swapon --discard", which switches on discard_swap_cluster() just before we allocate from a new cluster; or whether you're musing that it's no use to you because you want to repurpose the swap cluster to match erase block: I'm mentioning it in case you missed that it's already there (but few use it, since even done at that scale it's often more trouble than it's worth).
I actually argued that discard_swap_cluster is exactly the right thing to do, especially when clusters match erase blocks on the less capable devices like SD cards.
Luca was arguing that on some hardware there is no point in ever submitting a discard just before we start reusing space, because at that point it the hardware already discards the old data by overwriting the logical addresses with new blocks, while issuing a discard on all blocks as soon as they become available would make a bigger difference. I would be interested in hearing from Hyojin Jeong and Alex Lemberg what they think is the best time to issue a discard, because they would know about other hardware than Luca.
Arnd
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
Hi Minchan
How are you doing?
Regarding time to issue Discard/Trim : eMMC point of view, I believe that the immediate Discard/Trim CMD after deleting/freezing a SWAP cluster is always better for all of general eMMC implementation.
Regarding swap page size: Actually, I can't guarantee the optimal size of different eMMC in the industry, because it depends on NAND page size an firmware implementation inside eMMC. In case of SAMSUNG eMMC, 8KB page size and 512KB block size(erase unit) is current implementation. I think that the multiple of 8KB page size align with 512KB is good for SAMSUNG eMMC. If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best performance as of today. However, large page size in swap partition may not best way in Linux system level. I'm not sure that the best page size between Swap system and eMMC device.
Best Regards Hyojin -----Original Message----- From: Minchan Kim [mailto:minchan@kernel.org] Sent: Monday, April 09, 2012 11:14 AM To: Alex Lemberg Cc: Arnd Bergmann; linaro-kernel@lists.linaro.org; Rik van Riel; linux-mmc@vger.kernel.org; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel-team@android.com; Yejin Moon; Hugh Dickins; Yaniv Iarovici Subject: Re: swap on eMMC and other flash
2012-04-08 오후 10:50, Alex Lemberg 쓴 글:
Hi Arnd,
Regarding time to issue discard/TRIM commands: It would be advised to issue the discard command immediately after deleting/freeing a SWAP cluster (i.e. as soon as it becomes available).
Is it still good with page size, not cluster size?
Regarding SWAP page size: Working with as large as SWAP pages as possible would be recommended (preferably 64KB). Also, writing in a sequential manner as much as possible while swapping large quantities of data is also advisable.
SWAP pages and corresponding transactions should be aligned to the SWAP page size (i.e. 64KB above), the alignment should correspond to the physical storage "LBA 0", i.e. to the first LBA of the storage device (and not to a logical/physical partition).
I have a curiosity on above comment is valid on Samsung and other eMMC. Hyojin, Could you answer?
Thanks, Alex
-----Original Message----- From: Arnd Bergmann [mailto:arnd@arndb.de] Sent: Monday, April 02, 2012 5:55 PM To: Hugh Dickins Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux- mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel- team@android.com; Yejin Moon Subject: Re: swap on eMMC and other flash
On Monday 02 April 2012, Hugh Dickins wrote:
On Mon, 2 Apr 2012, Arnd Bergmann wrote:
Another option would be batched discard as we do it for file
systems:
occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those.
I'm not sure whether you've missed "swapon --discard", which switches on discard_swap_cluster() just before we allocate from a new cluster; or whether you're musing that it's no use to you because you want to repurpose the swap cluster to match erase block: I'm mentioning it in case you missed that it's already there (but few use it, since even done at that scale it's often more trouble than it's worth).
I actually argued that discard_swap_cluster is exactly the right thing to do, especially when clusters match erase blocks on the less capable devices like SD cards.
Luca was arguing that on some hardware there is no point in ever submitting a discard just before we start reusing space, because at that point it the hardware already discards the old data by overwriting the logical addresses with new blocks, while issuing a discard on all blocks as soon as they become available would make a bigger difference. I would be interested in hearing from Hyojin Jeong and Alex Lemberg what they think is the best time to issue a discard, because they would know about other hardware than Luca.
Arnd
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
-- Kind regards, Minchan Kim
2012-04-09 오후 4:37, 정효진 쓴 글:
Hi Minchan
How are you doing?
Pretty good :)
Regarding time to issue Discard/Trim : eMMC point of view, I believe that the immediate Discard/Trim CMD after deleting/freezing a SWAP cluster is always better for all of general eMMC implementation.
The point of question is that discard of page size is good or not? Luca and Arnd said some device would have a benefit when we send discard command to eMMC as soon as linux free _a swap page_, not batched cluster size. But AFAIK, Samsung eMMC isn't useful by per-page discard and most of eMMC are not good in case of per page discard, I guess. Becauase FTL doesn't support full-page mapping in such small device. So I'm not sure we have to implement per-page discard even code is rather complicated for few devices.
Regarding swap page size: Actually, I can't guarantee the optimal size of different eMMC in the industry, because it depends on NAND page size an firmware implementation inside eMMC. In case of SAMSUNG eMMC, 8KB page size and 512KB block size(erase unit) is current implementation. I think that the multiple of 8KB page size align with 512KB is good for SAMSUNG eMMC. If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best performance as of today. However, large page size in swap partition may not best way in Linux system level. I'm not sure that the best page size between Swap system and eMMC device.
The variety is one of challenges for removing GC generally. ;-(. I don't like manual setting through /sys/block/xxx because it requires that user have to know nand page size and erase block size but it's not easy to know to normal user. Arnd. What's your plan to support various flash storages effectively?
Best Regards Hyojin -----Original Message----- From: Minchan Kim [mailto:minchan@kernel.org] Sent: Monday, April 09, 2012 11:14 AM To: Alex Lemberg Cc: Arnd Bergmann; linaro-kernel@lists.linaro.org; Rik van Riel; linux-mmc@vger.kernel.org; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel-team@android.com; Yejin Moon; Hugh Dickins; Yaniv Iarovici Subject: Re: swap on eMMC and other flash
2012-04-08 오후 10:50, Alex Lemberg 쓴 글:
Hi Arnd,
Regarding time to issue discard/TRIM commands: It would be advised to issue the discard command immediately after deleting/freeing a SWAP cluster (i.e. as soon as it becomes available).
Is it still good with page size, not cluster size?
Regarding SWAP page size: Working with as large as SWAP pages as possible would be recommended (preferably 64KB). Also, writing in a sequential manner as much as possible while swapping large quantities of data is also advisable.
SWAP pages and corresponding transactions should be aligned to the SWAP page size (i.e. 64KB above), the alignment should correspond to the physical storage "LBA 0", i.e. to the first LBA of the storage device (and not to a logical/physical partition).
I have a curiosity on above comment is valid on Samsung and other eMMC. Hyojin, Could you answer?
Thanks, Alex
-----Original Message----- From: Arnd Bergmann [mailto:arnd@arndb.de] Sent: Monday, April 02, 2012 5:55 PM To: Hugh Dickins Cc: linaro-kernel@lists.linaro.org; Rik van Riel; linux- mmc@vger.kernel.org; Alex Lemberg; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; Hyojin Jeong; kernel- team@android.com; Yejin Moon Subject: Re: swap on eMMC and other flash
On Monday 02 April 2012, Hugh Dickins wrote:
On Mon, 2 Apr 2012, Arnd Bergmann wrote:
Another option would be batched discard as we do it for file
systems:
occasionally stop writing to swap space and scanning for areas that have become available since the last discard, then send discard commands for those.
I'm not sure whether you've missed "swapon --discard", which switches on discard_swap_cluster() just before we allocate from a new cluster; or whether you're musing that it's no use to you because you want to repurpose the swap cluster to match erase block: I'm mentioning it in case you missed that it's already there (but few use it, since even done at that scale it's often more trouble than it's worth).
I actually argued that discard_swap_cluster is exactly the right thing to do, especially when clusters match erase blocks on the less capable devices like SD cards.
Luca was arguing that on some hardware there is no point in ever submitting a discard just before we start reusing space, because at that point it the hardware already discards the old data by overwriting the logical addresses with new blocks, while issuing a discard on all blocks as soon as they become available would make a bigger difference. I would be interested in hearing from Hyojin Jeong and Alex Lemberg what they think is the best time to issue a discard, because they would know about other hardware than Luca.
Arnd
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
-- Kind regards, Minchan Kim
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
On Monday 09 April 2012, Minchan Kim wrote:
Regarding swap page size: Actually, I can't guarantee the optimal size of different eMMC in the industry, because it depends on NAND page size an firmware implementation inside eMMC. In case of SAMSUNG eMMC, 8KB page size and 512KB block size(erase unit) is current implementation. I think that the multiple of 8KB page size align with 512KB is good for SAMSUNG eMMC. If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best performance as of today. However, large page size in swap partition may not best way in Linux system level. I'm not sure that the best page size between Swap system and eMMC device.
The variety is one of challenges for removing GC generally. ;-(. I don't like manual setting through /sys/block/xxx because it requires that user have to know nand page size and erase block size but it's not easy to know to normal user. Arnd. What's your plan to support various flash storages effectively?
My preference would be to build the logic to detect the sizes into mkfs and mkswap and encode them in the superblock in new fields. I don't think we can trust any data that a device reports right now because operating systems have ignored it in the past and either someone has forgotten to update the fields after moving to new technology (eMMC), or the data can not be encoded correctly according to the spec (SD, USB).
System builders for embedded systems can then make sure that they get it right for the hardware they use, and we can try our best to help that process.
Ard
2012-04-09 오후 10:00, Arnd Bergmann 쓴 글:
On Monday 09 April 2012, Minchan Kim wrote:
Regarding swap page size: Actually, I can't guarantee the optimal size of different eMMC in the industry, because it depends on NAND page size an firmware implementation inside eMMC. In case of SAMSUNG eMMC, 8KB page size and 512KB block size(erase unit) is current implementation. I think that the multiple of 8KB page size align with 512KB is good for SAMSUNG eMMC. If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best performance as of today. However, large page size in swap partition may not best way in Linux system level. I'm not sure that the best page size between Swap system and eMMC device.
The variety is one of challenges for removing GC generally. ;-(. I don't like manual setting through /sys/block/xxx because it requires that user have to know nand page size and erase block size but it's not easy to know to normal user. Arnd. What's your plan to support various flash storages effectively?
My preference would be to build the logic to detect the sizes into mkfs and mkswap and encode them in the superblock in new fields. I don't think we can trust any data that a device reports right now because operating systems have ignored it in the past and either someone has forgotten to update the fields after moving to new technology (eMMC), or the data can not be encoded correctly according to the spec (SD, USB).
I think it's not good approach. How long does it take to know such parameters? I guess it's not short so that mkfs/mkswap would be very long dramatically. If needed, let's maintain it as another tool.
If storage vendors break such fields, it doesn't work well on linux which is very popular on mobile world today and user will not use such vendor devices and company will be gone. Let's give such pressure to them and make vendor keep in promise.
System builders for embedded systems can then make sure that they get it right for the hardware they use, and we can try our best to help that process.
Ard
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 10 April 2012, Minchan Kim wrote:
I think it's not good approach. How long does it take to know such parameters? I guess it's not short so that mkfs/mkswap would be very long dramatically. If needed, let's maintain it as another tool.
I haven't come up with a way that is both fast and reliable. A very fast method is to time short read requests across potential erase block boundaries and see which ones are faster than others, this works on about 3 out of 4 devices.
For the other devices, I currently use a fairly manual process that times a lot of write requests and can take a long time.
If storage vendors break such fields, it doesn't work well on linux which is very popular on mobile world today and user will not use such vendor devices and company will be gone. Let's give such pressure to them and make vendor keep in promise.
This could work for eMMC, yes.
The SD card standard makes it impossible to write the correct value for most devices, it only supports power-of-two values up to 4MB for SDHC, and larger values (I believe 8, 12, 16, 24, ... 64) for SDXC, but a lot of SDHC cards nowadays use 1.5, 3, 6 or 8 MB erase blocks.
Arnd
Hi All,
-----Original Message----- From: linux-mmc-owner@vger.kernel.org [mailto:linux-mmc-owner@vger.kernel.org] On Behalf Of Arnd Bergmann Sent: Tuesday, April 10, 2012 1:40 AM To: Minchan Kim Cc: 정효진; 'Alex Lemberg'; linaro-kernel@lists.linaro.org; 'Rik van Riel'; linux-mmc@vger.kernel.org; linux-kernel@vger.kernel.org; Luca Porzio (lporzio); linux-mm@kvack.org; kernel-team@android.com; 'Yejin Moon'; 'Hugh Dickins'; 'Yaniv Iarovici'; cpgs@samsung.com Subject: Re: swap on eMMC and other flash
On Tuesday 10 April 2012, Minchan Kim wrote:
I think it's not good approach. How long does it take to know such parameters? I guess it's not short so that mkfs/mkswap would be very long dramatically. If needed, let's maintain it as another tool.
I haven't come up with a way that is both fast and reliable. A very fast method is to time short read requests across potential erase block boundaries and see which ones are faster than others, this works on about 3 out of 4 devices.
For the other devices, I currently use a fairly manual process that times a lot of write requests and can take a long time.
If storage vendors break such fields, it doesn't work well on linux which is very popular on mobile world today and user will not use such vendor devices and company will be gone. Let's give such pressure to them and make vendor keep in promise.
This could work for eMMC, yes.
I like it ;)
The SD card standard makes it impossible to write the correct value for most devices, it only supports power-of-two values up to 4MB for SDHC, and larger values (I believe 8, 12, 16, 24, ... 64) for SDXC, but a lot of SDHC cards nowadays use 1.5, 3, 6 or 8 MB erase blocks.
Arnd
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday 09 April 2012, 정효진 wrote:
If swap system use 512KB page and issue Discard/Trim align with 512KB, eMMC make best performance as of today. However, large page size in swap partition may not best way in Linux system level. I'm not sure that the best page size between Swap system and eMMC device.
Can you explain the significance of the 512KB size? I've seen devices report 512KB erase size, although measurements clearly showed an erase block size of 8MB and I do not understand this discprepancy.
Right now, we always send discards of 1MB clusters to the device, which does what you want, although I'm not sure if those clusters are naturally aligned to the start of the partition. Obviously this also requires aligning the start of the partition to the erase block size, but most devices should already get that right nowadays.
Arnd
Hugh,
Great topics. As per one of Rik original points:
skip writeout of zero-filled pages - this can be a big help for KVM virtual machines running Windows, since Windows zeroes out free pages; simply discarding a zero-filled page is not at all simple in the current VM, where we would have to iterate over all the ptes to free the swap entry before being able to free the swap cache page (I am not sure how that locking would even work)
with the extra layer of indirection, the locking for this scheme can be trivial - either the faulting process gets the old page, or it gets a new one, either way it'll be zero filled
Since it's KVMs realm here, can't KSM simply solve the zero-filled pages problem avoiding unnecessary burden for the Swap subsystem?
Cheers, Luca
On Mon, 2 Apr 2012, Luca Porzio (lporzio) wrote:
Great topics. As per one of Rik original points:
skip writeout of zero-filled pages - this can be a big help for KVM virtual machines running Windows, since Windows zeroes out free pages; simply discarding a zero-filled page is not at all simple in the current VM, where we would have to iterate over all the ptes to free the swap entry before being able to free the swap cache page (I am not sure how that locking would even work)
with the extra layer of indirection, the locking for this scheme can be trivial - either the faulting process gets the old page, or it gets a new one, either way it'll be zero filled
Since it's KVMs realm here, can't KSM simply solve the zero-filled pages problem avoiding unnecessary burden for the Swap subsystem?
I would expect that KSM already does largely handle this, yes. But it's also quite possible that I'm missing Rik's point.
Hugh
On 04/02/2012 10:58 AM, Hugh Dickins wrote:
On Mon, 2 Apr 2012, Luca Porzio (lporzio) wrote:
Great topics. As per one of Rik original points:
skip writeout of zero-filled pages - this can be a big help for KVM virtual machines running Windows, since Windows zeroes out free pages; simply discarding a zero-filled page is not at all simple in the current VM, where we would have to iterate over all the ptes to free the swap entry before being able to free the swap cache page (I am not sure how that locking would even work)
with the extra layer of indirection, the locking for this scheme can be trivial - either the faulting process gets the old page, or it gets a new one, either way it'll be zero filled
Since it's KVMs realm here, can't KSM simply solve the zero-filled pages problem avoiding unnecessary burden for the Swap subsystem?
I would expect that KSM already does largely handle this, yes. But it's also quite possible that I'm missing Rik's point.
Indeed, KSM handles it already.
However, it may be worthwhile for non-KVM users of transparent huge pages to discard zero-filled parts of pages (allocated by the kernel to the process, but not used memory).
Not just because it takes up swap space (writing to swap is easy, space is cheap), but because not swapping that memory back in later (because it is not used) will prevent us from re-building the transparent huge page...
On 30/03/12 21:50, Arnd Bergmann wrote:
(sorry for the duplicated email, this corrects the address of the android kernel team, please reply here)
On Friday 30 March 2012, Arnd Bergmann wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
There is mtdswap.
Also the old Nokia N900 had swap to eMMC.
The last I heard was that swap was considered to be simply too slow on hand held devices.
As systems adopt more RAM, isn't there a decreasing demand for swap?
On Wednesday 04 April 2012, Adrian Hunter wrote:
On 30/03/12 21:50, Arnd Bergmann wrote:
(sorry for the duplicated email, this corrects the address of the android kernel team, please reply here)
On Friday 30 March 2012, Arnd Bergmann wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
There is mtdswap.
Ah, very interesting. I wasn't aware of that. Obviously we can't directly use it on block devices that have their own garbage collection and wear leveling built into them, but it's interesting to see how this was solved before.
While we could build something similar that remaps blocks between an eMMC device and the logical swap space that is used by the mm code, my feeling is that it would be easier to modify the swap code itself to do the right thing.
Also the old Nokia N900 had swap to eMMC.
The last I heard was that swap was considered to be simply too slow on hand held devices.
That's the part that we want to solve here. It has nothing to do with handheld devices, but more with specific incompatibilities of the block allocation in the swap code vs. what an eMMC device expects to see for fast operation. If you write data in the wrong order on flash devices, you get long delays that you don't get when you do it the right way. The same problem exists for file systems, and is being addressed there as well.
As systems adopt more RAM, isn't there a decreasing demand for swap?
No. You would never be able to make hibernate work, no matter how much RAM you add ;-)
More seriously, the need for swap is not to work around the fact that we have too little memory, it's one of the fundamental assumptions of the mm subsystem that swap exists, and it's generally a good idea to have, so you treat file backed memory in the same way as anonymous memory.
Arnd
On 04/04/12 15:47, Arnd Bergmann wrote:
On Wednesday 04 April 2012, Adrian Hunter wrote:
On 30/03/12 21:50, Arnd Bergmann wrote:
(sorry for the duplicated email, this corrects the address of the android kernel team, please reply here)
On Friday 30 March 2012, Arnd Bergmann wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
There is mtdswap.
Ah, very interesting. I wasn't aware of that. Obviously we can't directly use it on block devices that have their own garbage collection and wear leveling built into them, but it's interesting to see how this was solved before.
While we could build something similar that remaps blocks between an eMMC device and the logical swap space that is used by the mm code, my feeling is that it would be easier to modify the swap code itself to do the right thing.
Also the old Nokia N900 had swap to eMMC.
The last I heard was that swap was considered to be simply too slow on hand held devices.
That's the part that we want to solve here. It has nothing to do with handheld devices, but more with specific incompatibilities of the block allocation in the swap code vs. what an eMMC device expects to see for fast operation. If you write data in the wrong order on flash devices, you get long delays that you don't get when you do it the right way. The same problem exists for file systems, and is being addressed there as well.
As systems adopt more RAM, isn't there a decreasing demand for swap?
No. You would never be able to make hibernate work, no matter how much RAM you add ;-)
Have you considered making hibernate work without swap?
On Wed 2012-04-11 13:28:39, Adrian Hunter wrote:
On 04/04/12 15:47, Arnd Bergmann wrote:
On Wednesday 04 April 2012, Adrian Hunter wrote:
On 30/03/12 21:50, Arnd Bergmann wrote:
(sorry for the duplicated email, this corrects the address of the android kernel team, please reply here)
On Friday 30 March 2012, Arnd Bergmann wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
There is mtdswap.
Ah, very interesting. I wasn't aware of that. Obviously we can't directly use it on block devices that have their own garbage collection and wear leveling built into them, but it's interesting to see how this was solved before.
While we could build something similar that remaps blocks between an eMMC device and the logical swap space that is used by the mm code, my feeling is that it would be easier to modify the swap code itself to do the right thing.
Also the old Nokia N900 had swap to eMMC.
The last I heard was that swap was considered to be simply too slow on hand held devices.
That's the part that we want to solve here. It has nothing to do with handheld devices, but more with specific incompatibilities of the block allocation in the swap code vs. what an eMMC device expects to see for fast operation. If you write data in the wrong order on flash devices, you get long delays that you don't get when you do it the right way. The same problem exists for file systems, and is being addressed there as well.
As systems adopt more RAM, isn't there a decreasing demand for swap?
No. You would never be able to make hibernate work, no matter how much RAM you add ;-)
Have you considered making hibernate work without swap?
It does work without swap. See userland suspend packages, where you write the image is up-to you.
Pavel
On Sat, Mar 31, 2012 at 2:44 AM, Arnd Bergmann arnd@arndb.de wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
The basic problem is that Linux without swap is sort of crippled and some things either don't work at all (hibernate) or not as efficient as they should (e.g. tmpfs). At the same time, the swap code seems to be rather inappropriate for the algorithms used in most flash media today, causing system performance to suffer drastically, and wearing out the flash hardware much faster than necessary. In order to change that, we would be implementing the following changes:
- Try to swap out multiple pages at once, in a single write request. My
reading of the current code is that we always send pages one by one to the swap device, while most flash devices have an optimum write size of 32 or 64 kb and some require an alignment of more than a page. Ideally we would try to write an aligned 64 kb block all the time. Writing aligned 64 kb chunks often gives us ten times the throughput of linear 4kb writes, and going beyond 64 kb usually does not give any better performance.
It does make sense. I think we can batch will-be-swapped-out pages in shrink_page_list if they are located by contiguous swap slots.
- Make variable sized swap clusters. Right now, the swap space is
organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
If we can find such big contiguous swap slots easily, it would be good. But I am not sure how often we can get such big slots. And maybe we have to improve search method for getting such big empty cluster.
- As Luca points out, some eMMC media would benefit significantly from
having discard requests issued for every page that gets freed from the swap cache, rather than at the time just before we reuse a swap cluster. This would probably have to become a configurable option as well, to avoid the overhead of sending the discard requests on media that don't benefit from this.
It's opposite of 2). I don't know how many there are such eMMC media. Normally, discard per page isn't useful on most eMMC media. I am not sure we have to implement per-page discard for such minor devices with increasing code complexity due to locking issue.
Does this all sound appropriate for the Linux memory management people?
Also, does this sound useful to the Android developers? Would you start using swap if we make it perform well and not destroy the drives?
Finally, does this plan match up with the capabilities of the various eMMC devices? I know more about SD and USB devices and I'm quite convinced that it would help there, but eMMC can be more like an SSD in some ways, and the current code should be fine for real SSDs.
Arnd
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Friday 06 April 2012, Minchan Kim wrote:
On Sat, Mar 31, 2012 at 2:44 AM, Arnd Bergmann arnd@arndb.de wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
The basic problem is that Linux without swap is sort of crippled and some things either don't work at all (hibernate) or not as efficient as they should (e.g. tmpfs). At the same time, the swap code seems to be rather inappropriate for the algorithms used in most flash media today, causing system performance to suffer drastically, and wearing out the flash hardware much faster than necessary. In order to change that, we would be implementing the following changes:
- Try to swap out multiple pages at once, in a single write request. My
reading of the current code is that we always send pages one by one to the swap device, while most flash devices have an optimum write size of 32 or 64 kb and some require an alignment of more than a page. Ideally we would try to write an aligned 64 kb block all the time. Writing aligned 64 kb chunks often gives us ten times the throughput of linear 4kb writes, and going beyond 64 kb usually does not give any better performance.
It does make sense. I think we can batch will-be-swapped-out pages in shrink_page_list if they are located by contiguous swap slots.
But would that guarantee that all writes are the same size? While writing larger chunks would generally be helpful, in order to guarantee that we the drive doesn't do any garbage collection, we would have to do all writes in aligned chunks. It would probably be enough to do this in 8kb or 16kb units for most devices over the next few years, but implementing it for 64kb should be the same amount of work and will get us a little bit further.
I'm not sure what we would do when there are less than 64kb available for pageout on the inactive list. The two choices I can think of are either not writing anything, or wasting the swap slots and filling up the data with zeroes.
- Make variable sized swap clusters. Right now, the swap space is
organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
If we can find such big contiguous swap slots easily, it would be good. But I am not sure how often we can get such big slots. And maybe we have to improve search method for getting such big empty cluster.
As long as there are clusters available, we should try to find them. When free space is too fragmented to find any unused cluster, we can pick one that has very little data in it, so that we reduce the time it takes to GC that erase block in the drive. While we could theoretically do active garbage collection of swap data in the kernel, it won't get more efficient than the GC inside of the drive. If we do this, it unfortunately means that we can't just send a discard for the entire erase block.
Arnd
2012-04-07 오전 1:16, Arnd Bergmann 쓴 글:
On Friday 06 April 2012, Minchan Kim wrote:
On Sat, Mar 31, 2012 at 2:44 AM, Arnd Bergmann arnd@arndb.de wrote:
We've had a discussion in the Linaro storage team (Saugata, Venkat and me, with Luca joining in on the discussion) about swapping to flash based media such as eMMC. This is a summary of what we found and what we think should be done. If people agree that this is a good idea, we can start working on it.
The basic problem is that Linux without swap is sort of crippled and some things either don't work at all (hibernate) or not as efficient as they should (e.g. tmpfs). At the same time, the swap code seems to be rather inappropriate for the algorithms used in most flash media today, causing system performance to suffer drastically, and wearing out the flash hardware much faster than necessary. In order to change that, we would be implementing the following changes:
- Try to swap out multiple pages at once, in a single write request. My
reading of the current code is that we always send pages one by one to the swap device, while most flash devices have an optimum write size of 32 or 64 kb and some require an alignment of more than a page. Ideally we would try to write an aligned 64 kb block all the time. Writing aligned 64 kb chunks often gives us ten times the throughput of linear 4kb writes, and going beyond 64 kb usually does not give any better performance.
It does make sense. I think we can batch will-be-swapped-out pages in shrink_page_list if they are located by contiguous swap slots.
But would that guarantee that all writes are the same size? While writing
Of course, not.
larger chunks would generally be helpful, in order to guarantee that we the drive doesn't do any garbage collection, we would have to do all writes
And we should guarantee for avoiding unnecessary swapout, even OOM killing.
in aligned chunks. It would probably be enough to do this in 8kb or 16kb units for most devices over the next few years, but implementing it for 64kb should be the same amount of work and will get us a little bit further.
I understand it's best for writing 64K in your statement. What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?
I'm not sure what we would do when there are less than 64kb available for pageout on the inactive list. The two choices I can think of are either not writing anything, or wasting the swap slots and filling
No wrtite will cause unnecessary many pages to swap out by next prioirty of scanning and we can't gaurantee how long we wait to queue up to 64KB in anon pages. It might take longer than GC time so we need some deadline.
up the data with zeroes.
Zero padding would be a good solution but I have a concern on WAP so we need smart policy.
To be honest, I think swapout is normally asynchonous operation so that it should not affect system latency rather than swap read which is synchronous operation. So if system is low memory pressure, we can queue swap out pages up to 64KB and then batch write-out in empty cluster. If we don't have any empty cluster in low memory pressure, we should write out it in partial cluster. Maybe it doesn't affect system latency severely in low memory pressure.
If system memory pressure is high(and It shoud be not frequent), swap-out B/W would be more important. So we can reserve some clusters for it and I think we can use page padding you mentioned in this case for reducing latency if we can queue it up to 64KB within threshold time.
Swap-read is also important. We have to investigate fragmentation of swap slots because we disable swap readahead in non-rotation device. It can make lots of hole in swap cluster and it makes to find empty cluster. So for it, it might be better than enable swap-read in non-rotation devices, too.
- Make variable sized swap clusters. Right now, the swap space is
organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
If we can find such big contiguous swap slots easily, it would be good. But I am not sure how often we can get such big slots. And maybe we have to improve search method for getting such big empty cluster.
As long as there are clusters available, we should try to find them. When free space is too fragmented to find any unused cluster, we can pick one that has very little data in it, so that we reduce the time it takes to GC that erase block in the drive. While we could theoretically do active garbage collection of swap data in the kernel, it won't get more efficient than the GC inside of the drive. If we do this, it unfortunately means that we can't just send a discard for the entire erase block.
Might need some compaction during idle time but WAP concern raises again. :(
Arnd
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Monday 09 April 2012, Minchan Kim wrote:
2012-04-07 오전 1:16, Arnd Bergmann 쓴 글:
larger chunks would generally be helpful, in order to guarantee that we the drive doesn't do any garbage collection, we would have to do all writes
And we should guarantee for avoiding unnecessary swapout, even OOM killing.
in aligned chunks. It would probably be enough to do this in 8kb or 16kb units for most devices over the next few years, but implementing it for 64kb should be the same amount of work and will get us a little bit further.
I understand it's best for writing 64K in your statement. What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?
From my measurements, there are three sizes that are relevant here:
1. The underlying page size of the flash: This used to be less than 4kb, which is fine when paging out 4kb mmu pages, as long as the partition is aligned. Today, most devices use 8kb pages and the number is increasing over time, meaning we will see more 16kb page devices in the future and presumably larger sizes after that. Writes that are not naturally aligned multiples of the page size tend to be a significant problem for the controller to deal with: in order to guarantee that a 4kb write makes it into permanent storage, the device has to write 8kb and the next 4kb write has to go into another 8kb page because each page can only be written once before the block is erased. At a later point, all the partial pages get rewritten into a new erase block, a process that can take hundreds of miliseconds and that we absolutely want to prevent from happening, as it can block all other I/O to the device. Writing all (flash) pages in an erase block sequentially usually avoids this, as long as you don't write to many different erase blocks at the same time. Note that the page size depends on how the controller combines different planes and channels.
2. The super-page size of the flash: When you have multiple channels between the controller and the individual flash chips, you can write multiple pages simultaneously, which means that e.g. sending 32kb of data to the device takes roughly the same amount of time as writing a single 8kb page. Writing less than the super-page size when there is more data waiting to get written out is a waste of time, although the effects are much less drastic as writing data that is not aligned to pages because it does not require garbage collection.
3. optimum write size: While writing larger amounts of data in a single request is usually faster than writing less, almost all devices I've seen have a sharp cut-off where increasing the size of the write does not actually help any more because of a bottleneck somewhere in the stack. Writing more than 64kb almost never improves performance and sometimes reduces performance.
From the I've done, a typical profile could look like
Size Throughput 1KB 200KB/s 2KB 450KB/s 4KB 1MB/s 8KB 4MB/s <== page size 16KB 8MB/s 32KB 16MB/s <== superpage size 64KB 18MB/s <== optimum size 128KB 17MB/s ... 8MB 18MB/s <== erase block size
I'm not sure what we would do when there are less than 64kb available for pageout on the inactive list. The two choices I can think of are either not writing anything, or wasting the swap slots and filling
No wrtite will cause unnecessary many pages to swap out by next prioirty of scanning and we can't gaurantee how long we wait to queue up to 64KB in anon pages. It might take longer than GC time so we need some deadline.
up the data with zeroes.
Zero padding would be a good solution but I have a concern on WAP so we need smart policy.
To be honest, I think swapout is normally asynchonous operation so that it should not affect system latency rather than swap read which is synchronous operation. So if system is low memory pressure, we can queue swap out pages up to 64KB and then batch write-out in empty cluster. If we don't have any empty cluster in low memory pressure, we should write out it in partial cluster. Maybe it doesn't affect system latency severely in low memory pressure.
The main thing that can affect system latency is garbage collection that blocks any other reads or writes for an extended amount of time. If we can avoid that, we've got the 95% solution.
Note that eMMC-4.5 provides a high-priority interrupt mechamism that lets us interrupt the a write that has hit the garbage collection path, so we can send a more important read request to the device. This will not work on other devices though and the patches for this are still under discussion.
If system memory pressure is high(and It shoud be not frequent), swap-out B/W would be more important. So we can reserve some clusters for it and I think we can use page padding you mentioned in this case for reducing latency if we can queue it up to 64KB within threshold time.
Swap-read is also important. We have to investigate fragmentation of swap slots because we disable swap readahead in non-rotation device. It can make lots of hole in swap cluster and it makes to find empty cluster. So for it, it might be better than enable swap-read in non-rotation devices, too.
Yes, reading in up to 64kb or at least a superpage would also help here, although there is no problem reading in a single cpu page, it will still take no more time than reading in a superpage.
- Make variable sized swap clusters. Right now, the swap space is
organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
If we can find such big contiguous swap slots easily, it would be good. But I am not sure how often we can get such big slots. And maybe we have to improve search method for getting such big empty cluster.
As long as there are clusters available, we should try to find them. When free space is too fragmented to find any unused cluster, we can pick one that has very little data in it, so that we reduce the time it takes to GC that erase block in the drive. While we could theoretically do active garbage collection of swap data in the kernel, it won't get more efficient than the GC inside of the drive. If we do this, it unfortunately means that we can't just send a discard for the entire erase block.
Might need some compaction during idle time but WAP concern raises again. :(
Sorry for my ignorance, but what does WAP stand for?
Arnd
2012-04-09 오후 9:35, Arnd Bergmann 쓴 글:
On Monday 09 April 2012, Minchan Kim wrote:
2012-04-07 오전 1:16, Arnd Bergmann 쓴 글:
larger chunks would generally be helpful, in order to guarantee that we the drive doesn't do any garbage collection, we would have to do all writes
And we should guarantee for avoiding unnecessary swapout, even OOM killing.
in aligned chunks. It would probably be enough to do this in 8kb or 16kb units for most devices over the next few years, but implementing it for 64kb should be the same amount of work and will get us a little bit further.
I understand it's best for writing 64K in your statement. What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?
From my measurements, there are three sizes that are relevant here:
- The underlying page size of the flash: This used to be less than 4kb,
which is fine when paging out 4kb mmu pages, as long as the partition is aligned. Today, most devices use 8kb pages and the number is increasing over time, meaning we will see more 16kb page devices in the future and presumably larger sizes after that. Writes that are not naturally aligned multiples of the page size tend to be a significant problem for the controller to deal with: in order to guarantee that a 4kb write makes it into permanent storage, the device has to write 8kb and the next 4kb write has to go into another 8kb page because each page can only be written once before the block is erased. At a later point, all the partial pages get rewritten into a new erase block, a process that can take hundreds of miliseconds and that we absolutely want to prevent from happening, as it can block all other I/O to the device. Writing all (flash) pages in an erase block sequentially usually avoids this, as long as you don't write to many different erase blocks at the same time. Note that the page size depends on how the controller combines different planes and channels.
- The super-page size of the flash: When you have multiple channels
between the controller and the individual flash chips, you can write multiple pages simultaneously, which means that e.g. sending 32kb of data to the device takes roughly the same amount of time as writing a single 8kb page. Writing less than the super-page size when there is more data waiting to get written out is a waste of time, although the effects are much less drastic as writing data that is not aligned to pages because it does not require garbage collection.
- optimum write size: While writing larger amounts of data in a single
request is usually faster than writing less, almost all devices I've seen have a sharp cut-off where increasing the size of the write does not actually help any more because of a bottleneck somewhere in the stack. Writing more than 64kb almost never improves performance and sometimes reduces performance.
For our understanding, you mean we have to do aligned-write as follows if possible?
"Nand internal page size write(8K, 16K)" < "Super-page size write(32K) which considers parallel working with number of channel and plane" < some sequential big write (64K)
From the I've done, a typical profile could look like
Size Throughput 1KB 200KB/s 2KB 450KB/s 4KB 1MB/s 8KB 4MB/s <== page size 16KB 8MB/s 32KB 16MB/s <== superpage size 64KB 18MB/s <== optimum size 128KB 17MB/s ... 8MB 18MB/s <== erase block size
I'm not sure what we would do when there are less than 64kb available for pageout on the inactive list. The two choices I can think of are either not writing anything, or wasting the swap slots and filling
No wrtite will cause unnecessary many pages to swap out by next prioirty of scanning and we can't gaurantee how long we wait to queue up to 64KB in anon pages. It might take longer than GC time so we need some deadline.
up the data with zeroes.
Zero padding would be a good solution but I have a concern on WAP so we need smart policy.
To be honest, I think swapout is normally asynchonous operation so that it should not affect system latency rather than swap read which is synchronous operation. So if system is low memory pressure, we can queue swap out pages up to 64KB and then batch write-out in empty cluster. If we don't have any empty cluster in low memory pressure, we should write out it in partial cluster. Maybe it doesn't affect system latency severely in low memory pressure.
The main thing that can affect system latency is garbage collection that blocks any other reads or writes for an extended amount of time. If we can avoid that, we've got the 95% solution.
I see.
Note that eMMC-4.5 provides a high-priority interrupt mechamism that lets us interrupt the a write that has hit the garbage collection path, so we can send a more important read request to the device. This will not work on other devices though and the patches for this are still under discussion.
Nice feature but I think swap system doesn't need to consider such feature. I should be handled by I/O subsystem like I/O scheduler.
If system memory pressure is high(and It shoud be not frequent), swap-out B/W would be more important. So we can reserve some clusters for it and I think we can use page padding you mentioned in this case for reducing latency if we can queue it up to 64KB within threshold time.
Swap-read is also important. We have to investigate fragmentation of swap slots because we disable swap readahead in non-rotation device. It can make lots of hole in swap cluster and it makes to find empty cluster. So for it, it might be better than enable swap-read in non-rotation devices, too.
Yes, reading in up to 64kb or at least a superpage would also help here, although there is no problem reading in a single cpu page, it will still take no more time than reading in a superpage.
- Make variable sized swap clusters. Right now, the swap space is
organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
If we can find such big contiguous swap slots easily, it would be good. But I am not sure how often we can get such big slots. And maybe we have to improve search method for getting such big empty cluster.
As long as there are clusters available, we should try to find them. When free space is too fragmented to find any unused cluster, we can pick one that has very little data in it, so that we reduce the time it takes to GC that erase block in the drive. While we could theoretically do active garbage collection of swap data in the kernel, it won't get more efficient than the GC inside of the drive. If we do this, it unfortunately means that we can't just send a discard for the entire erase block.
Might need some compaction during idle time but WAP concern raises again. :(
Sorry for my ignorance, but what does WAP stand for?
I should have written more general term. I means write amplication but WAF(Write Amplication Factor) is more popular. :(
Arnd
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 10 April 2012, Minchan Kim wrote:
2012-04-09 오후 9:35, Arnd Bergmann 쓴 글:
I understand it's best for writing 64K in your statement. What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?
From my measurements, there are three sizes that are relevant here:
- The underlying page size of the flash: This used to be less than 4kb,
which is fine when paging out 4kb mmu pages, as long as the partition is aligned. Today, most devices use 8kb pages and the number is increasing over time, meaning we will see more 16kb page devices in the future and presumably larger sizes after that. Writes that are not naturally aligned multiples of the page size tend to be a significant problem for the controller to deal with: in order to guarantee that a 4kb write makes it into permanent storage, the device has to write 8kb and the next 4kb write has to go into another 8kb page because each page can only be written once before the block is erased. At a later point, all the partial pages get rewritten into a new erase block, a process that can take hundreds of miliseconds and that we absolutely want to prevent from happening, as it can block all other I/O to the device. Writing all (flash) pages in an erase block sequentially usually avoids this, as long as you don't write to many different erase blocks at the same time. Note that the page size depends on how the controller combines different planes and channels.
- The super-page size of the flash: When you have multiple channels
between the controller and the individual flash chips, you can write multiple pages simultaneously, which means that e.g. sending 32kb of data to the device takes roughly the same amount of time as writing a single 8kb page. Writing less than the super-page size when there is more data waiting to get written out is a waste of time, although the effects are much less drastic as writing data that is not aligned to pages because it does not require garbage collection.
- optimum write size: While writing larger amounts of data in a single
request is usually faster than writing less, almost all devices I've seen have a sharp cut-off where increasing the size of the write does not actually help any more because of a bottleneck somewhere in the stack. Writing more than 64kb almost never improves performance and sometimes reduces performance.
For our understanding, you mean we have to do aligned-write as follows if possible?
"Nand internal page size write(8K, 16K)" < "Super-page size write(32K) which considers parallel working with number of channel and plane" < some sequential big write (64K)
In the definition I gave above, page size (8k, 16k) would be the only one that requires alignment. Writing 64k at an arbitrary 16k alignment should still give us the best performance in almost all cases and introduce no extra write amplification, while writing with less than page alignment causes significant write amplification and long latencies.
Note that eMMC-4.5 provides a high-priority interrupt mechamism that lets us interrupt the a write that has hit the garbage collection path, so we can send a more important read request to the device. This will not work on other devices though and the patches for this are still under discussion.
Nice feature but I think swap system doesn't need to consider such feature. I should be handled by I/O subsystem like I/O scheduler.
Right, this is completely independent of swap. The current implementation of the patch set favours only reads that are done for page-in operations by interrupting any long-running writes when a more important read comes in. IMHO we should do the same for any synchronous read, but that discussion is completely orthogonal to having the swap device on emmc.
- Make variable sized swap clusters. Right now, the swap space is
organized in clusters of 256 pages (1MB), which is less than the typical erase block size of 4 or 8 MB. We should try to make the swap cluster aligned to erase blocks and have the size match to avoid garbage collection in the drive. The cluster size would typically be set by mkswap as a new option and interpreted at swapon time.
If we can find such big contiguous swap slots easily, it would be good. But I am not sure how often we can get such big slots. And maybe we have to improve search method for getting such big empty cluster.
As long as there are clusters available, we should try to find them. When free space is too fragmented to find any unused cluster, we can pick one that has very little data in it, so that we reduce the time it takes to GC that erase block in the drive. While we could theoretically do active garbage collection of swap data in the kernel, it won't get more efficient than the GC inside of the drive. If we do this, it unfortunately means that we can't just send a discard for the entire erase block.
Might need some compaction during idle time but WAP concern raises again. :(
Sorry for my ignorance, but what does WAP stand for?
I should have written more general term. I means write amplication but WAF(Write Amplication Factor) is more popular. :(
D'oh. Thanks for the clarification. Note that the entire idea of increasing the swap cluster size to the erase block size is to *reduce* write amplification:
If we pick arbitrary swap clusters that are part of an erase block (or worse, span two partial erase blocks), sending a discard for one cluster does not allow the device to actually discard an entire erase block. Consider the best possible scenario where we have a 1MB cluster and 2MB erase blocks, all naturally aligned. After we have written the entire swap device once, all blocks are marked as used in the device, but some are available for reuse in the kernel. The swap code picks a cluster that is currently unused and sends a discard to the device, then fills the cluster with new pages. After that, we pick another swap cluster elsewhere. The erase block now contains 50% new and 50% old data and has to be garbage collected, so the device writes 2MB of data to anther erase block. So, in order to write 1MB, the device has written 3MB and the write amplification factor is 3. Using 8MB erase blocks, it would be 9.
If we do the active compaction and increase the cluster size to the erase block size, there is no write amplification inside of the device (and no stalls from the garbage collection, which are the other concern), and we only need to write a few blocks again that are still valid in a cluster at the time we want to reuse it. On an ideal device, the write amplification for active compaction should be exactly the same as what we get when we write a cluster while some of the data in it is still valid and we skip those pages, while some devices might now like having to gc themselves. Doing the compaction in software means we have to spend CPU cycles on it, but we get to choose when it happens and don't have to block on the device during GC.
Arnd
On Tue, Apr 10, 2012 at 08:32:51AM +0000, Arnd Bergmann wrote:
On Tuesday 10 April 2012, Minchan Kim wrote:
2012-04-09 오후 9:35, Arnd Bergmann 쓴 글:
I understand it's best for writing 64K in your statement. What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K?
From my measurements, there are three sizes that are relevant here:
- The underlying page size of the flash: This used to be less than 4kb,
which is fine when paging out 4kb mmu pages, as long as the partition is aligned. Today, most devices use 8kb pages and the number is increasing over time, meaning we will see more 16kb page devices in the future and presumably larger sizes after that. Writes that are not naturally aligned multiples of the page size tend to be a significant problem for the controller to deal with: in order to guarantee that a 4kb write makes it into permanent storage, the device has to write 8kb and the next 4kb write has to go into another 8kb page because each page can only be written once before the block is erased. At a later point, all the partial pages get rewritten into a new erase block, a process that can take hundreds of miliseconds and that we absolutely want to prevent from happening, as it can block all other I/O to the device. Writing all (flash) pages in an erase block sequentially usually avoids this, as long as you don't write to many different erase blocks at the same time. Note that the page size depends on how the controller combines different planes and channels.
- The super-page size of the flash: When you have multiple channels
between the controller and the individual flash chips, you can write multiple pages simultaneously, which means that e.g. sending 32kb of data to the device takes roughly the same amount of time as writing a single 8kb page. Writing less than the super-page size when there is more data waiting to get written out is a waste of time, although the effects are much less drastic as writing data that is not aligned to pages because it does not require garbage collection.
- optimum write size: While writing larger amounts of data in a single
request is usually faster than writing less, almost all devices I've seen have a sharp cut-off where increasing the size of the write does not actually help any more because of a bottleneck somewhere in the stack. Writing more than 64kb almost never improves performance and sometimes reduces performance.
For our understanding, you mean we have to do aligned-write as follows if possible?
"Nand internal page size write(8K, 16K)" < "Super-page size write(32K) which considers parallel working with number of channel and plane" < some sequential big write (64K)
In the definition I gave above, page size (8k, 16k) would be the only one that requires alignment. Writing 64k at an arbitrary 16k alignment should still give us the best performance in almost all cases and introduce no extra write amplification, while writing with less than page alignment causes significant write amplification and long latencies.
Note that eMMC-4.5 provides a high-priority interrupt mechamism that lets us interrupt the a write that has hit the garbage collection path, so we can send a more important read request to the device. This will not work on other devices though and the patches for this are still under discussion.
Nice feature but I think swap system doesn't need to consider such feature. I should be handled by I/O subsystem like I/O scheduler.
Right, this is completely independent of swap. The current implementation of the patch set favours only reads that are done for page-in operations by interrupting any long-running writes when a more important read comes in. IMHO we should do the same for any synchronous read, but that discussion is completely orthogonal to having the swap device on emmc.
> 2) Make variable sized swap clusters. Right now, the swap space is > organized in clusters of 256 pages (1MB), which is less than the typical > erase block size of 4 or 8 MB. We should try to make the swap cluster > aligned to erase blocks and have the size match to avoid garbage collection > in the drive. The cluster size would typically be set by mkswap as a new > option and interpreted at swapon time. >
If we can find such big contiguous swap slots easily, it would be good. But I am not sure how often we can get such big slots. And maybe we have to improve search method for getting such big empty cluster.
As long as there are clusters available, we should try to find them. When free space is too fragmented to find any unused cluster, we can pick one that has very little data in it, so that we reduce the time it takes to GC that erase block in the drive. While we could theoretically do active garbage collection of swap data in the kernel, it won't get more efficient than the GC inside of the drive. If we do this, it unfortunately means that we can't just send a discard for the entire erase block.
Might need some compaction during idle time but WAP concern raises again. :(
Sorry for my ignorance, but what does WAP stand for?
I should have written more general term. I means write amplication but WAF(Write Amplication Factor) is more popular. :(
D'oh. Thanks for the clarification. Note that the entire idea of increasing the swap cluster size to the erase block size is to *reduce* write amplification:
If we pick arbitrary swap clusters that are part of an erase block (or worse, span two partial erase blocks), sending a discard for one cluster does not allow the device to actually discard an entire erase block. Consider the best possible scenario where we have a 1MB cluster and 2MB erase blocks, all naturally aligned. After we have written the entire swap device once, all blocks are marked as used in the device, but some are available for reuse in the kernel. The swap code picks a cluster that is currently unused and sends a discard to the device, then fills the cluster with new pages. After that, we pick another swap cluster elsewhere. The erase block now contains 50% new and 50% old data and has to be garbage collected, so the device writes 2MB of data to anther erase block. So, in order to write 1MB, the device has written 3MB and the write amplification factor is 3. Using 8MB erase blocks, it would be 9.
If we do the active compaction and increase the cluster size to the erase block size, there is no write amplification inside of the device (and no stalls from the garbage collection, which are the other concern), and we only need to write a few blocks again that are still valid in a cluster at the time we want to reuse it. On an ideal device, the write amplification for active compaction should be exactly the same as what we get when we write a cluster while some of the data in it is still valid and we skip those pages, while some devices might now like having to gc themselves. Doing the compaction in software means we have to spend CPU cycles on it, but we get to choose when it happens and don't have to block on the device during GC.
Thanks for detail explanation. At least, we need active compaction to avoid GC completely when we can't find empty cluster and there are lots of hole. Indirection layer we discussed last LSF/MM could help slot change by compaction easily. I think way to find empty cluster should be changed because current linear scan is not proper for bigger cluster size.
I am looking forward to your works!
P.S) I'm afraid this work might raise endless war, again which host can do well VS device can do well. If we can work out, we don't need costly eMMC FTL, just need dumb bare nand, controller and simple firmware.
Arnd
On Wednesday 11 April 2012, Minchan Kim wrote:
On Tue, Apr 10, 2012 at 08:32:51AM +0000, Arnd Bergmann wrote:
I should have written more general term. I means write amplication but WAF(Write Amplication Factor) is more popular. :(
D'oh. Thanks for the clarification. Note that the entire idea of increasing the swap cluster size to the erase block size is to *reduce* write amplification:
If we pick arbitrary swap clusters that are part of an erase block (or worse, span two partial erase blocks), sending a discard for one cluster does not allow the device to actually discard an entire erase block. Consider the best possible scenario where we have a 1MB cluster and 2MB erase blocks, all naturally aligned. After we have written the entire swap device once, all blocks are marked as used in the device, but some are available for reuse in the kernel. The swap code picks a cluster that is currently unused and sends a discard to the device, then fills the cluster with new pages. After that, we pick another swap cluster elsewhere. The erase block now contains 50% new and 50% old data and has to be garbage collected, so the device writes 2MB of data to anther erase block. So, in order to write 1MB, the device has written 3MB and the write amplification factor is 3. Using 8MB erase blocks, it would be 9.
If we do the active compaction and increase the cluster size to the erase block size, there is no write amplification inside of the device (and no stalls from the garbage collection, which are the other concern), and we only need to write a few blocks again that are still valid in a cluster at the time we want to reuse it. On an ideal device, the write amplification for active compaction should be exactly the same as what we get when we write a cluster while some of the data in it is still valid and we skip those pages, while some devices might now like having to gc themselves. Doing the compaction in software means we have to spend CPU cycles on it, but we get to choose when it happens and don't have to block on the device during GC.
Thanks for detail explanation. At least, we need active compaction to avoid GC completely when we can't find empty cluster and there are lots of hole. Indirection layer we discussed last LSF/MM could help slot change by compaction easily. I think way to find empty cluster should be changed because current linear scan is not proper for bigger cluster size.
I am looking forward to your works!
P.S) I'm afraid this work might raise endless war, again which host can do well VS device can do well. If we can work out, we don't need costly eMMC FTL, just need dumb bare nand, controller and simple firmware.
IMHO, we should only distinguish between dumb and smart devices, defined as follows:
1. smart devices behave like all but the extremely cheap SSDs. They are optimized for 4KB random I/O, and the erase block size is not visible because there is a write cache and a flexible controller between the block device abstraction and the raw flash.
2. dumb devices have very visible effects that stem from a simplistic remapping layer that translates logical erase block numbers into physical erase blocks, and only a fixed number of those can be written at the same time before forcing GC. Writes smaller than page size are strongly discouraged here. There is no RAM to cache writes in the controller, but we still expect these devices to have a reasonable wear levelling policy. This covers almost all of today's eMMC, SD, USB and CF as well as some cheap ATA SSD.
A third category is of course spinning rust, but I think with the distinction for solid state media above, we have a pretty good grip on all existing media. As eMMC and UFS evolve over time, we might want to stick them into the first category, but I don't think we need more categories.
Arnd
On 04/12/2012 12:57 AM, Arnd Bergmann wrote:
On Wednesday 11 April 2012, Minchan Kim wrote:
On Tue, Apr 10, 2012 at 08:32:51AM +0000, Arnd Bergmann wrote:
I should have written more general term. I means write amplication but WAF(Write Amplication Factor) is more popular. :(
D'oh. Thanks for the clarification. Note that the entire idea of increasing the swap cluster size to the erase block size is to *reduce* write amplification:
If we pick arbitrary swap clusters that are part of an erase block (or worse, span two partial erase blocks), sending a discard for one cluster does not allow the device to actually discard an entire erase block. Consider the best possible scenario where we have a 1MB cluster and 2MB erase blocks, all naturally aligned. After we have written the entire swap device once, all blocks are marked as used in the device, but some are available for reuse in the kernel. The swap code picks a cluster that is currently unused and sends a discard to the device, then fills the cluster with new pages. After that, we pick another swap cluster elsewhere. The erase block now contains 50% new and 50% old data and has to be garbage collected, so the device writes 2MB of data to anther erase block. So, in order to write 1MB, the device has written 3MB and the write amplification factor is 3. Using 8MB erase blocks, it would be 9.
If we do the active compaction and increase the cluster size to the erase block size, there is no write amplification inside of the device (and no stalls from the garbage collection, which are the other concern), and we only need to write a few blocks again that are still valid in a cluster at the time we want to reuse it. On an ideal device, the write amplification for active compaction should be exactly the same as what we get when we write a cluster while some of the data in it is still valid and we skip those pages, while some devices might now like having to gc themselves. Doing the compaction in software means we have to spend CPU cycles on it, but we get to choose when it happens and don't have to block on the device during GC.
Thanks for detail explanation. At least, we need active compaction to avoid GC completely when we can't find empty cluster and there are lots of hole. Indirection layer we discussed last LSF/MM could help slot change by compaction easily. I think way to find empty cluster should be changed because current linear scan is not proper for bigger cluster size.
I am looking forward to your works!
P.S) I'm afraid this work might raise endless war, again which host can do well VS device can do well. If we can work out, we don't need costly eMMC FTL, just need dumb bare nand, controller and simple firmware.
IMHO, we should only distinguish between dumb and smart devices, defined as follows:
- smart devices behave like all but the extremely cheap SSDs. They are optimized
for 4KB random I/O, and the erase block size is not visible because there is a write cache and a flexible controller between the block device abstraction and the raw flash.
- dumb devices have very visible effects that stem from a simplistic remapping
layer that translates logical erase block numbers into physical erase blocks, and only a fixed number of those can be written at the same time before forcing GC. Writes smaller than page size are strongly discouraged here. There is no RAM to cache writes in the controller, but we still expect these devices to have a reasonable wear levelling policy. This covers almost all of today's eMMC, SD, USB and CF as well as some cheap ATA SSD.
Such dumb devices have disadvantage as follows, Some user expect it manage to do itself and some user don't expect it so someone like you will add smart features on host to remove GC but someone still believes that eMMC by itself will do enough so that he can use any FSes on it.
Conflict happens.
Although we can solve several problems to use eMMC as swap, other partition could be used for any FSes which are not aware of eMMC characteristic. It could cause GC in eMMC internal although it work out eMMC as swap so long latency when we use it as swap could be happened.
A third category is of course spinning rust, but I think with the distinction for solid state media above, we have a pretty good grip on all existing media. As eMMC and UFS evolve over time, we might want to stick them into the first category, but I don't think we need more categories.
Arnd
-- To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I really like where this is going and would like to use the opportunity to plant a few ideas.
In contrast to rotational disks read/write operation overhead and costs are not symmetric. While random reads are much faster on flash - the number of write operations is limited by wearout and garbage collection overhead. To further improve swapping on eMMC or similar flash media I believe that the following issues need to be addressed:
1) Limit average write bandwidth to eMMC to a configurable level to guarantee a minimum device lifetime 2) Aim for a low write amplification factor to maximize useable write bandwidth 3) Strongly favor read over write operations
Lowering write amplification (2) has been discussed in this email thread - and the only observation I would like to add is that over-provisioning the internal swap space compared to the exported swap space significantly can guarantee a lower write amplification factor with the indirection and GC techniques discussed.
I believe the swap functionality is currently optimized for storage media where read and write costs are nearly identical. As this is not the case on flash I propose splitting the anonymous inactive queue (at least conceptually) - keeping clean anonymous pages with swap slots on a separate queue as the cost of swapping them out/in is only an inexpensive read operation. A variable similar to swapiness (or a more dynamic algorithmn) could determine the preference for swapping out clean pages or dirty pages. ( A similar argument could be made for splitting up the file inactive queue )
The problem of limiting the average write bandwidth reminds me of enforcing cpu utilization limits on interactive workloads. Just as with cpu workloads - using the resources to the limit produces poor interactivity. When interactivity suffers too much I believe the only sane response for an interactive device is to limit usage of the swap device and transition into a low memory situation - and if needed - either allowing userspace to reduce memory usage or invoking the OOM killer. As a result low memory situations could not only be encountered on new memory allocations but also on workload changes that increase the number of dirty pages.
A wild idea to avoid some writes altogether is to see if de-duplication techniques can be used to (partially?) match pages previously written so swap. In case of unencrypted swap (or encrypted swap with a static key) swap pages on eMMC could even be re-used across multiple reboots. A simple version would just compare dirty pages with data in their swap slots as I suspect (but really don't know) that some user space algorithms (garbage collection?) dirty a page just temporarily - eventually reverting it to the previous content.
Stephan
On Monday 16 April 2012, Stephan Uphoff wrote:
opportunity to plant a few ideas.
In contrast to rotational disks read/write operation overhead and costs are not symmetric. While random reads are much faster on flash - the number of write operations is limited by wearout and garbage collection overhead. To further improve swapping on eMMC or similar flash media I believe that the following issues need to be addressed:
- Limit average write bandwidth to eMMC to a configurable level to
guarantee a minimum device lifetime 2) Aim for a low write amplification factor to maximize useable write bandwidth 3) Strongly favor read over write operations
Lowering write amplification (2) has been discussed in this email thread - and the only observation I would like to add is that over-provisioning the internal swap space compared to the exported swap space significantly can guarantee a lower write amplification factor with the indirection and GC techniques discussed.
Yes, good point.
I believe the swap functionality is currently optimized for storage media where read and write costs are nearly identical. As this is not the case on flash I propose splitting the anonymous inactive queue (at least conceptually) - keeping clean anonymous pages with swap slots on a separate queue as the cost of swapping them out/in is only an inexpensive read operation. A variable similar to swapiness (or a more dynamic algorithmn) could determine the preference for swapping out clean pages or dirty pages. ( A similar argument could be made for splitting up the file inactive queue )
I'm not sure I understand yet how this would be different from swappiness.
The problem of limiting the average write bandwidth reminds me of enforcing cpu utilization limits on interactive workloads. Just as with cpu workloads - using the resources to the limit produces poor interactivity. When interactivity suffers too much I believe the only sane response for an interactive device is to limit usage of the swap device and transition into a low memory situation - and if needed - either allowing userspace to reduce memory usage or invoking the OOM killer. As a result low memory situations could not only be encountered on new memory allocations but also on workload changes that increase the number of dirty pages.
While swap is just a special case for anonymous memory in writeback rather than file backed pages, I think what you want here is a tuning knob that decides whether we should discard a clean page or write back a dirty page under memory pressure. I have to say that I don't know whether we already have such a knob or whether we already treat them differently, but it is certainly a valid observation that on hard drives, discarding a clean page that is likely going to be needed again has about the same overhead as writing back a dirty page (i.e. one seek operation), while on flash the former would be much cheaper than the latter.
A wild idea to avoid some writes altogether is to see if de-duplication techniques can be used to (partially?) match pages previously written so swap.
Interesting! We already have KSM (kernel samepage merging) to do the same thing in memory, but I don't know how that works during swapout. It might already be there, waiting to get switched on, or might not be possible until we implemnt an extra remapping layer in swap as has been proposed. It's certainly worth remembering this as we work on the design for that remapping layer.
In case of unencrypted swap (or encrypted swap with a static key) swap pages on eMMC could even be re-used across multiple reboots. A simple version would just compare dirty pages with data in their swap slots as I suspect (but really don't know) that some user space algorithms (garbage collection?) dirty a page just temporarily - eventually reverting it to the previous content.
I think that would incur overhead for indexing the pages in swap space in a persistent way, something that by itself would contribute to write amplification because for every swapout, we would have to write both the page and the index (eventually), and that index would likely be a random write.
Thanks for your thoughts!
Arnd
Hi Arnd,
On Mon, Apr 16, 2012 at 12:59 PM, Arnd Bergmann arnd@arndb.de wrote:
On Monday 16 April 2012, Stephan Uphoff wrote:
opportunity to plant a few ideas.
In contrast to rotational disks read/write operation overhead and costs are not symmetric. While random reads are much faster on flash - the number of write operations is limited by wearout and garbage collection overhead. To further improve swapping on eMMC or similar flash media I believe that the following issues need to be addressed:
- Limit average write bandwidth to eMMC to a configurable level to
guarantee a minimum device lifetime 2) Aim for a low write amplification factor to maximize useable write bandwidth 3) Strongly favor read over write operations
Lowering write amplification (2) has been discussed in this email thread - and the only observation I would like to add is that over-provisioning the internal swap space compared to the exported swap space significantly can guarantee a lower write amplification factor with the indirection and GC techniques discussed.
Yes, good point.
I believe the swap functionality is currently optimized for storage media where read and write costs are nearly identical. As this is not the case on flash I propose splitting the anonymous inactive queue (at least conceptually) - keeping clean anonymous pages with swap slots on a separate queue as the cost of swapping them out/in is only an inexpensive read operation. A variable similar to swapiness (or a more dynamic algorithmn) could determine the preference for swapping out clean pages or dirty pages. ( A similar argument could be made for splitting up the file inactive queue )
I'm not sure I understand yet how this would be different from swappiness.
As I see it swappiness determines the ratio for paging out file backed as compared to anonymous, swap backed pages. I would like to further be able to set the ratio for throwing away clean anonymous pages with swap slots ( that are easy to read back in) as compared to writing out dirty anonymous pages to swap.
The problem of limiting the average write bandwidth reminds me of enforcing cpu utilization limits on interactive workloads. Just as with cpu workloads - using the resources to the limit produces poor interactivity. When interactivity suffers too much I believe the only sane response for an interactive device is to limit usage of the swap device and transition into a low memory situation - and if needed - either allowing userspace to reduce memory usage or invoking the OOM killer. As a result low memory situations could not only be encountered on new memory allocations but also on workload changes that increase the number of dirty pages.
While swap is just a special case for anonymous memory in writeback rather than file backed pages, I think what you want here is a tuning knob that decides whether we should discard a clean page or write back a dirty page under memory pressure. I have to say that I don't know whether we already have such a knob or whether we already treat them differently, but it is certainly a valid observation that on hard drives, discarding a clean page that is likely going to be needed again has about the same overhead as writing back a dirty page (i.e. one seek operation), while on flash the former would be much cheaper than the latter.
Exactly - as far as I see there is no such knob. I mentioned splitting the anonymous inactive queue (in clean and dirty) as I believe it would make it easier to implement such a knob while maintaining the maximum of LRU information..
A wild idea to avoid some writes altogether is to see if de-duplication techniques can be used to (partially?) match pages previously written so swap.
Interesting! We already have KSM (kernel samepage merging) to do the same thing in memory, but I don't know how that works during swapout. It might already be there, waiting to get switched on, or might not be possible until we implemnt an extra remapping layer in swap as has been proposed. It's certainly worth remembering this as we work on the design for that remapping layer.
In case of unencrypted swap (or encrypted swap with a static key) swap pages on eMMC could even be re-used across multiple reboots. A simple version would just compare dirty pages with data in their swap slots as I suspect (but really don't know) that some user space algorithms (garbage collection?) dirty a page just temporarily - eventually reverting it to the previous content.
I think that would incur overhead for indexing the pages in swap space in a persistent way, something that by itself would contribute to write amplification because for every swapout, we would have to write both the page and the index (eventually), and that index would likely be a random write.
I agree - overhead may be too big. Still unless it is too energy intensive I could see a case for an idle task to match up anonymous pages to pre-existing swap data sometimes after reboot ( and before memory is tight ) Unless memory layout is randomized I expect many anonymous pages to end up with the same data boot after boot.
Thanks for your thoughts!
Arnd
Thanks for working on this
Stephan
On 04/17/2012 06:12 AM, Stephan Uphoff wrote:
Hi Arnd,
On Mon, Apr 16, 2012 at 12:59 PM, Arnd Bergmannarnd@arndb.de wrote:
On Monday 16 April 2012, Stephan Uphoff wrote:
opportunity to plant a few ideas.
In contrast to rotational disks read/write operation overhead and costs are not symmetric. While random reads are much faster on flash - the number of write operations is limited by wearout and garbage collection overhead. To further improve swapping on eMMC or similar flash media I believe that the following issues need to be addressed:
- Limit average write bandwidth to eMMC to a configurable level to
guarantee a minimum device lifetime 2) Aim for a low write amplification factor to maximize useable write bandwidth 3) Strongly favor read over write operations
Lowering write amplification (2) has been discussed in this email thread - and the only observation I would like to add is that over-provisioning the internal swap space compared to the exported swap space significantly can guarantee a lower write amplification factor with the indirection and GC techniques discussed.
Yes, good point.
I believe the swap functionality is currently optimized for storage media where read and write costs are nearly identical. As this is not the case on flash I propose splitting the anonymous inactive queue (at least conceptually) - keeping clean anonymous pages with swap slots on a separate queue as the cost of swapping them out/in is only an inexpensive read operation. A variable similar to swapiness (or a more dynamic algorithmn) could determine the preference for swapping out clean pages or dirty pages. ( A similar argument could be made for splitting up the file inactive queue )
I'm not sure I understand yet how this would be different from swappiness.
As I see it swappiness determines the ratio for paging out file backed as compared to anonymous, swap backed pages. I would like to further be able to set the ratio for throwing away clean anonymous pages with swap slots ( that are easy to read back in) as compared to writing out dirty anonymous pages to swap.
We can apply the rule in file-lru list too and we already have ISOLATE_CLEAN mode to select victim pages in LRU list so it should work.
For selecting clean anon pages with swap slot, we need more looking. Recent, Dan had a question about it and Hugh answered it. Look at the http://marc.info/?l=linux-mm&m=133462346928786&w=2
Hi Arnd,
On 04/17/2012 03:59 AM, Arnd Bergmann wrote:
On Monday 16 April 2012, Stephan Uphoff wrote:
opportunity to plant a few ideas.
In contrast to rotational disks read/write operation overhead and costs are not symmetric. While random reads are much faster on flash - the number of write operations is limited by wearout and garbage collection overhead. To further improve swapping on eMMC or similar flash media I believe that the following issues need to be addressed:
- Limit average write bandwidth to eMMC to a configurable level to
guarantee a minimum device lifetime 2) Aim for a low write amplification factor to maximize useable write bandwidth 3) Strongly favor read over write operations
Lowering write amplification (2) has been discussed in this email thread - and the only observation I would like to add is that over-provisioning the internal swap space compared to the exported swap space significantly can guarantee a lower write amplification factor with the indirection and GC techniques discussed.
Yes, good point.
I believe the swap functionality is currently optimized for storage media where read and write costs are nearly identical. As this is not the case on flash I propose splitting the anonymous inactive queue (at least conceptually) - keeping clean anonymous pages with swap slots on a separate queue as the cost of swapping them out/in is only an inexpensive read operation. A variable similar to swapiness (or a more dynamic algorithmn) could determine the preference for swapping out clean pages or dirty pages. ( A similar argument could be made for splitting up the file inactive queue )
I'm not sure I understand yet how this would be different from swappiness.
The problem of limiting the average write bandwidth reminds me of enforcing cpu utilization limits on interactive workloads. Just as with cpu workloads - using the resources to the limit produces poor interactivity. When interactivity suffers too much I believe the only sane response for an interactive device is to limit usage of the swap device and transition into a low memory situation - and if needed - either allowing userspace to reduce memory usage or invoking the OOM killer. As a result low memory situations could not only be encountered on new memory allocations but also on workload changes that increase the number of dirty pages.
While swap is just a special case for anonymous memory in writeback rather than file backed pages, I think what you want here is a tuning knob that decides whether we should discard a clean page or write back a dirty page under memory pressure. I have to say that I don't know whether we already have such a knob or whether we already treat them differently, but it is certainly a valid observation that on hard drives, discarding a clean page that is likely going to be needed again has about the same overhead as writing back a dirty page (i.e. one seek operation), while on flash the former would be much cheaper than the latter.
It seems to make sense with considering asymmetric of flash and there is a CFLRU(Clean First LRU)[1] paper about it. You might already know it. Anyway if you don't aware of it, I hope it helps you.
[1] http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&am...
Stephan,
Good ideas. Some comments of mine below.
-----Original Message----- From: linux-mmc-owner@vger.kernel.org [mailto:linux-mmc-owner@vger.kernel.org] On Behalf Of Stephan Uphoff Sent: Tuesday, April 17, 2012 3:22 AM To: Arnd Bergmann Cc: Minchan Kim; linaro-kernel@lists.linaro.org; android- kernel@googlegroups.com; linux-mm@kvack.org; Luca Porzio (lporzio); Alex Lemberg; linux-kernel@vger.kernel.org; Saugata Das; Venkatraman S; Yejin Moon; Hyojin Jeong; linux-mmc@vger.kernel.org Subject: Re: swap on eMMC and other flash
I really like where this is going and would like to use the opportunity to plant a few ideas.
In contrast to rotational disks read/write operation overhead and costs are not symmetric. While random reads are much faster on flash - the number of write operations is limited by wearout and garbage collection overhead. To further improve swapping on eMMC or similar flash media I believe that the following issues need to be addressed:
- Limit average write bandwidth to eMMC to a configurable level to
guarantee a minimum device lifetime 2) Aim for a low write amplification factor to maximize useable write bandwidth 3) Strongly favor read over write operations
Lowering write amplification (2) has been discussed in this email thread - and the only observation I would like to add is that over-provisioning the internal swap space compared to the exported swap space significantly can guarantee a lower write amplification factor with the indirection and GC techniques discussed.
I believe the swap functionality is currently optimized for storage media where read and write costs are nearly identical. As this is not the case on flash I propose splitting the anonymous inactive queue (at least conceptually) - keeping clean anonymous pages with swap slots on a separate queue as the cost of swapping them out/in is only an inexpensive read operation. A variable similar to swapiness (or a more dynamic algorithmn) could determine the preference for swapping out clean pages or dirty pages. ( A similar argument could be made for splitting up the file inactive queue )
I totally agree. Read are inexpensive on flash based devices and as such a good swap algorithm (as well as a flash oriented FS) should take this into account.
The problem of limiting the average write bandwidth reminds me of enforcing cpu utilization limits on interactive workloads. Just as with cpu workloads - using the resources to the limit produces poor interactivity.
I don't quite get your definition of interactive workload and I am not sure here which is the technique for limiting resource utilization you have in mind. CGroups, for example, have proven not to be much reliable through time. Also in my experience it has always been very difficult to correlate resources utilization stats with user interactivity. The only technique which has been proven reliable through time is to do something while the system is idle, which is what, to my understanding, is already done.
When interactivity suffers too much I believe the only sane response for an interactive device is to limit usage of the swap device and transition into a low memory situation - and if needed - either allowing userspace to reduce memory usage or invoking the OOM killer. As a result low memory situations could not only be encountered on new memory allocations but also on workload changes that increase the number of dirty pages.
I agree with your comments about the OOM killer (what is the point of swapping out a page if that process is going to be killed soon? That is only increasing the WAF factor on MMCs). In fact one proposal here could be to somewhat mix OOM index with page age. I would suggest to first optimize swap traffic for an MMC device and then start thinking about this.
A wild idea to avoid some writes altogether is to see if de-duplication techniques can be used to (partially?) match pages previously written so swap.
If you have such a situation, I think this is where KSM may help. It is my personal belief that with a bit of work, the KSM algorithm can be extended to swapped out pages too with little effort (at the expense of few increase of read traffic, which is ok for flash based storage devices).
In case of unencrypted swap (or encrypted swap with a static key) swap pages on eMMC could even be re-used across multiple reboots. A simple version would just compare dirty pages with data in their swap slots as I suspect (but really don't know) that some user space algorithms (garbage collection?) dirty a page just temporarily - eventually reverting it to the previous content.
This goes in contrast with discarding or trimming a page and as such the advantages of this technique needs to be proven vs the performance gain of using the discard command.
Stephan
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Luca
linaro-kernel@lists.linaro.org